Dandandan commented on issue #20324: URL: https://github.com/apache/datafusion/issues/20324#issuecomment-3904610347
> Thinking more about this, I wonder if we could model the choice of where to evaluate a filter as a dynamic filter 🤔 > > Aka make two filters for each predicate > > * The one in the parquet reader > * The one in a subsequent FilterExec > > And then have the pair somehow coordiante between themselves during execution to decide which to evaluate 🤔 > > Though maybe that would be too localized (should take multiple predicates into account) IMO there is not a large difference in the evaluation in the Parquet reader vs `FilterExec` and we should be able to optimize the filter execution in the parquet reader (e.g. use masks more https://github.com/apache/arrow-rs/issues/9416) to further minimize the cost. For adaptive filters there mainly must be mainly an option _not to evaluate_ them at all on all rows (either based on heuristic or adaptiveness @adriangb is focusing on) as the evaluation cost can be higher than what's saved (for joins this is notable, but it can show itself for aggregate / TopK filters as well). I think the adaptiveness for to be mostly good, their effectiveness needs to be somehow computed on several fronts: Benefit: * When pushing down as `RowFilter` we save the parquet decoding overhead (or more in very effective cases) for the columns that follow (if the predicate is more effective than other candidates) * When either pushing down or evaluating (e.g. `FilterExec`) we save upstream work in hash repartition / join / aggregate / etc. Cost: * Evaluating _extra_ expressions not needed for result correctness (even as simple `col>10` `col<50`, etc.) can take considerable time itself both as RowFilter or as `FilterExec` (of course more notably with joins it can lead to 5-10x slowdowns in the more extreme cases (e.g. nested/multi predicate joins with non-effective filters). * Pushing down `RowFilter`s with multiple columns => first pay the cost of decoding them all (based on current selection) which can be considerable (so you could make "bad" choices wrt. predicate order) I think we probably also need to come up with a useful "initial" heuristic for ordering predicates that looks at: * evaluation cost (e.g. integer comparison vs string search) * (estimated) selectivity * decoding cost (bytes probably a reasonable proxy) I also agree that we shouldn't "discard" any useful work - I am willing to experiment what we can enable/disable in order to minimize the regressions (e.g. only add a flag for enabling it for join filters, or adding a flag for all three options but only disable join filters FTM) and/or help with the adaptiveness/minimizing evaluation costs. And hopefully with the adaptiveness / other changes in place, we can also benefit from dynamic join filters in parquet pushdown :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
