alamb commented on PR #21828:
URL: https://github.com/apache/datafusion/pull/21828#issuecomment-4347003410

   > I understand where the need comes from, but there is a good reason why 
databases treat scans without order by as unordered, it's because a lot of 
logical/physical planning optimizations depend on this assumption, and they can 
only rely on metadata to tell if the plan changes they want to do are safe or 
not.
   
   I agree with @asolimando on this -- and I think DataFusion should not be 
breaking new ground on what semantics we implement (we should follow other DB 
implementations as much as possible)
   
   > If the underlying data is truly sorted over something that can be encoded 
similarly to what you can write with an ORDER BY (or at least producing the 
same metadata DataFusion uses), that's could be fine, but if the order is just 
the order rows happen to have in the files, and we can't encode this promise 
nowhere, then it gets complex.
   
   We did recently add the ability to emit row id from the parquet reader 🤔 -- 
maybe we could make that work and then treat row group skipping as an 
optimization when the data is explicitly `ORDER BY row_number()` 🤔 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to