acking-you commented on PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#issuecomment-2801131412

   > Very cool. It would be nice to run some e2e benchmarks (TPC-H, clickbench) 
with this to see the impact here.
   
   I tried running clickbench, and there wasn't a significant improvement. The 
optimization is still relatively selective about the cost of expression 
evaluation. For example, if the computation on the right involves only simple 
integer comparisons or short string comparisons, skipping a large amount of 
computation may not yield much benefit.
   
   ## New ideas
   One potential area for significant improvement that I thought of is: 
heuristically replacing the computation process of 
[`filter_and_project`](https://github.com/apache/datafusion/blob/61e8a5d0d65859427c5882644e1cc97c4def0bde/datafusion/physical-plan/src/filter.rs#L514-L545).
   
   The current state of this function is:
   1. Pass in the batch to compute the predicate and obtain a boolean array.
   2. Perform the actual filtering process on the batch based on the boolean 
array and return the result.
   
   ## Pros
   Since the filtering process involves copying to produce a new batch, 
calculating a boolean array based on the predicate only requires a single copy. 
If filtering is performed during the predicate computation, multiple copies 
will occur.
   
   ## Cons
   If the predicate is an AND expression, after the left side is computed, it 
may already have filtered out most of the batch and computed a new batch. 
However, the final batch generation is still based on the previous batch and 
the boolean array, which introduces unnecessary copy overhead.
   
   
   ## Conclusion
   Considering both aspects, I believe we can strike a balance. For example, if 
it is discovered during the AND computation process that most of the batch can 
be filtered out, then early filtering can be performed as the final result.
   However, this optimization scenario is very limited: it can only be applied 
when the predicate is entirely an AND computation, such as `a AND b AND c ...`. 
Once an OR appears in the middle, we cannot perform early filtering to produce 
the result.
   
   The changes required to implement this optimization might be significant, 
and it would only apply to predicates in the form of `a AND b AND c...`. It 
might not be worth the effort.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to