alamb commented on PR #6921: URL: https://github.com/apache/arrow-rs/pull/6921#issuecomment-2718792433
@XiangpengHao and I just had a nice discussion about this ticket and next steps. One thing that he noted is that reviewing this PR (and understanding its implications) is tricky as it requires a lot of context. For example, there are two subsets of columns * Predicate columns * Projection columns And those columns can be disjoint sets. This PR caches the intersection of those two columns. Also the design is that this PR doesn't cache every page (only cache 2 pages) to avoid increasing memory consumption In order to move forward I think the ideas are: Next steps: 1. @XiangpengHao will write up the current state of the affairs / document the existing code better 1. We then Rebase the PR against main 2. Rerun the clickbench / tpch DataFusion benchmarks again @XiangpengHao mentioned that while ClickBench Q23 gets 2x faster with pushdown enabled and this PR, it is actually even faster when pushdown is enabled without this pR (aka this PR regresses the pushdown performance) Thus we will also thought it would be valuable to 1. Put this new behavior behind a option that can be disabled in case we encounter issues rolling it out 3. Figure out how to get performance back for Q23 (maybe not needed for this PR) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org