westonpace commented on issue #34451: URL: https://github.com/apache/arrow/issues/34451#issuecomment-1456922143
We're pretty close in Acero to the point where we can use this for real advantage internally. For example, there is a PR (https://github.com/apache/arrow/pull/34311) up for a more RAM-efficient aggregation if we know the data is segmented / sorted. Currently, the node expects you to declare which columns are sorted ahead of time and, if they aren't, if will give you bad data. However, if we had a metadata standard in place then pyarrow/R (for in-memory tables) and datasets (for on-disk tables) could automatically detect this condition and apply the more efficient aggregation. There are probably a few unknowns about how exactly that should happen (An optimization pass based on data statistics (e.g. orderedness)? Detected on the fly while running a plan?) but getting a standard in for representing this information would be a good first step. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
