adriangb commented on issue #20324: URL: https://github.com/apache/datafusion/issues/20324#issuecomment-3902753597
> I think [#19639](https://github.com/apache/datafusion/pull/19639) is somewhat similar in goal to [apache/arrow-rs#9414](https://github.com/apache/arrow-rs/pull/9414) or do we expect a large difference? I haven't looked at your PR yet (very exciting!) but I think the approaches are indeed similar. I thought about it for a bit and (pending unknowns to me) the tradeoffs are that the arrow version can be dynamic within a file scan (useful for small queries?) while the DataFusion version can be dynamic between files or even between queries. The DataFusion version may also make it easier to use the same code for multiple file formats and doesn't require any API changes or cross-crate syncronization. We could also do a version that tracks stats across files / queries in DataFusion and feeds that into the arrow machinery that then applies it. Regarding the dynamic filter specific selectivity tracking: the difference with these is that we can completely disable / discard them if needed without loosing correctness. It might be possible to merge the efforts if we e.g. add `PhysicalExpr::is_discardable_filter() -> bool` or something, then the more general adaptive selectivity machinery can choose to discard the filter instead of just putting it last or something. I wonder if optimizer rules could derive "discardable" filters from more complex filters e.g. `col in ('large string', 'another')` -> `AdaptiveSelectivity [ col < 'larh` and col > 'anou` ] and col in ('large string', 'another')`. Some other general thoughts: - we should take into account compute needed for calculating the filters. So I think units of the thresholds should be bytes/second (bytes = amount of data discarded, seconds = time taken to compute the filter mask). - We can derive statistical correlations between the filters to do things like group correlated filters together to avoid creating useless masks, e.g. `name = 'adrian' and department = 'engineering' and role ilike '%database%'` could be grouped into `name = 'adrian'` (first, very selective) and then `department = 'engineering' and role ilike '%database%'` (as a single filter because they are correlated). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
