Dandandan commented on issue #20324:
URL: https://github.com/apache/datafusion/issues/20324#issuecomment-3904610347

   > Thinking more about this, I wonder if we could model the choice of where 
to evaluate a filter as a dynamic filter 🤔
   > 
   > Aka make two filters for each predicate
   > 
   > * The one in the parquet reader
   > * The one in a subsequent FilterExec
   > 
   > And then have the pair somehow coordiante between themselves during 
execution to decide which to evaluate 🤔
   > 
   > Though maybe that would be too localized (should take multiple predicates 
into account)
   
   IMO there is not a large difference in the evaluation in the Parquet reader 
vs `FilterExec` and we should be able to optimize the filter execution in the 
parquet reader (e.g. use masks more 
https://github.com/apache/arrow-rs/issues/9416) to further minimize the cost.
   
   For adaptive filters there mainly must be mainly an option _not to evaluate_ 
them at all on all rows (either based on heuristic or adaptiveness @adriangb is 
focusing on) as the evaluation cost can be higher than what's saved (for joins 
this is notable, but it can show itself for aggregate / TopK filters as well).
   
   I think the adaptiveness for to be mostly good, their effectiveness needs to 
be somehow computed on several fronts: 
   
   Benefit:
   * When pushing down as `RowFilter` we save the parquet decoding overhead (or 
more in very effective cases) for the columns that follow (if the predicate is 
more effective than other candidates)
   * When either pushing down or evaluating (e.g. `FilterExec`) we save 
upstream work in hash repartition / join / aggregate / etc.
   
   Cost:
   * Evaluating _extra_ expressions not needed for result correctness (even as 
simple `col>10` `col<50`, etc.) can take considerable time itself both as 
RowFilter or as `FilterExec` (of course more notably with joins it can lead to 
5-10x slowdowns in the more extreme cases (e.g. nested/multi predicate joins 
with non-effective filters).
   * Pushing down `RowFilter`s with multiple columns => first pay the cost of 
decoding them all (based on current selection) which can be considerable (so 
you could make "bad" choices wrt. predicate order)
   
   I think we probably also need to come up with a useful "initial" heuristic 
for ordering predicates that looks at:
   
   * evaluation cost (e.g. integer comparison vs string search)
   * (estimated) selectivity
   * decoding cost (bytes probably a reasonable proxy)
   
   I also agree that we shouldn't "discard" any useful work - I am willing to 
experiment what we can enable/disable in order to minimize the regressions 
(e.g. only add a flag for enabling it for join filters, or adding a flag for 
all three options but only disable join filters FTM) and/or help with the 
adaptiveness/minimizing evaluation costs.
   
   And hopefully with the adaptiveness / other changes in place, we can also 
benefit from dynamic join filters in parquet pushdown :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to