westonpace commented on issue #34451:
URL: https://github.com/apache/arrow/issues/34451#issuecomment-1456922143

   We're pretty close in Acero to the point where we can use this for real 
advantage internally.  For example, there is a PR 
(https://github.com/apache/arrow/pull/34311) up for a more RAM-efficient 
aggregation if we know the data is segmented / sorted.  Currently, the node 
expects you to declare which columns are sorted ahead of time and, if they 
aren't, if will give you bad data.
   
   However, if we had a metadata standard in place then pyarrow/R (for 
in-memory tables) and datasets (for on-disk tables) could automatically detect 
this condition and apply the more efficient aggregation.  There are probably a 
few unknowns about how exactly that should happen (An optimization pass based 
on data statistics (e.g. orderedness)?  Detected on the fly while running a 
plan?) but getting a standard in for representing this information would be a 
good first step.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to