alamb commented on issue #34451: URL: https://github.com/apache/arrow/issues/34451#issuecomment-1458488544
> this is super helpful. This is the relevant PR for datafusion: https://github.com/apache/arrow-datafusion/pull/1776. @alamb , if you have extra input it'd be nice to hear. > Currently, the node expects you to declare which columns are sorted ahead of time and, if they aren't, if will give you bad data. Yes, I think this is the standard situation (I have debugged many bugs in various past lives related to sortedness) DataFusion has gotten quite a bit more sophisticated in its sortedness handling / removing Sorts if not required based on metadata such as https://github.com/apache/arrow-datafusion/blob/928662bb12d915aef83abba1312392d25770f68f/datafusion/core/src/physical_optimizer/sort_enforcement.rs#L18 and https://github.com/apache/arrow-datafusion/blob/928662bb12d915aef83abba1312392d25770f68f/datafusion/core/src/physical_optimizer/global_sort_selection.rs In terms of metadata, I recommend adding something to the Arrow standard as sorting is so important (and doesn't really vary from system to system) Things that should be covered: 1. where do nulls sort (first or last) 2. ASC / DESC 3. Any collation considerations (ideally we would keep it as simple as possible) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
