alamb commented on issue #34451:
URL: https://github.com/apache/arrow/issues/34451#issuecomment-1458488544

   > this is super helpful. This is the relevant PR for datafusion: 
https://github.com/apache/arrow-datafusion/pull/1776. @alamb , if you have 
extra input it'd be nice to hear.
   
   >  Currently, the node expects you to declare which columns are sorted ahead 
of time and, if they aren't, if will give you bad data.
   
   Yes, I think this is the standard situation (I have debugged many bugs in 
various past lives related to sortedness)
   
   DataFusion has gotten quite a bit more sophisticated in its sortedness 
handling / removing Sorts if not required based on metadata such as 
https://github.com/apache/arrow-datafusion/blob/928662bb12d915aef83abba1312392d25770f68f/datafusion/core/src/physical_optimizer/sort_enforcement.rs#L18
 and 
https://github.com/apache/arrow-datafusion/blob/928662bb12d915aef83abba1312392d25770f68f/datafusion/core/src/physical_optimizer/global_sort_selection.rs
   
   In terms of metadata, I recommend adding something to the Arrow standard as 
sorting is so important (and doesn't really vary from system to system)
   
   Things that should be covered:
   1. where do nulls sort (first or last)
   2. ASC / DESC
   3. Any collation considerations  (ideally we would keep it as simple as 
possible)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to