acking-you commented on PR #15694: URL: https://github.com/apache/datafusion/pull/15694#issuecomment-2801131412
> Very cool. It would be nice to run some e2e benchmarks (TPC-H, clickbench) with this to see the impact here. I tried running clickbench, and there wasn't a significant improvement. The optimization is still relatively selective about the cost of expression evaluation. For example, if the computation on the right involves only simple integer comparisons or short string comparisons, skipping a large amount of computation may not yield much benefit. ## New ideas One potential area for significant improvement that I thought of is: heuristically replacing the computation process of [`filter_and_project`](https://github.com/apache/datafusion/blob/61e8a5d0d65859427c5882644e1cc97c4def0bde/datafusion/physical-plan/src/filter.rs#L514-L545). The current state of this function is: 1. Pass in the batch to compute the predicate and obtain a boolean array. 2. Perform the actual filtering process on the batch based on the boolean array and return the result. ## Pros Since the filtering process involves copying to produce a new batch, calculating a boolean array based on the predicate only requires a single copy. If filtering is performed during the predicate computation, multiple copies will occur. ## Cons If the predicate is an AND expression, after the left side is computed, it may already have filtered out most of the batch and computed a new batch. However, the final batch generation is still based on the previous batch and the boolean array, which introduces unnecessary copy overhead. ## Conclusion Considering both aspects, I believe we can strike a balance. For example, if it is discovered during the AND computation process that most of the batch can be filtered out, then early filtering can be performed as the final result. However, this optimization scenario is very limited: it can only be applied when the predicate is entirely an AND computation, such as `a AND b AND c ...`. Once an OR appears in the middle, we cannot perform early filtering to produce the result. The changes required to implement this optimization might be significant, and it would only apply to predicates in the form of `a AND b AND c...`. It might not be worth the effort. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org