zhuqi-lucas commented on PR #21956: URL: https://github.com/apache/datafusion/pull/21956#issuecomment-4351600543
The benchmark results are expected — RG reorder alone doesn't skip any row groups, it only changes the read order so that TopK's dynamic filter threshold converges faster. The significant speedup (2-3x on `sort_pushdown_inexact`) comes from **stats init + cumulative RG prune** which will be in the follow-up PR. Those optimizations depend on RG reorder as a foundation: 1. **RG reorder**: put best RGs first (this PR) 2. **Stats init**: initialize TopK threshold from RG statistics before reading → prune RGs upfront (next PR) 3. **Cumulative prune**: after reorder, truncate remaining RGs once enough rows are collected (next PR) Without reorder, cumulative prune might truncate the wrong RGs. Reorder ensures the best RGs come first, making truncation safe and effective. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
