adriangb opened a new pull request, #21752: URL: https://github.com/apache/datafusion/pull/21752
Currently we are either all or nothing applying filters as post-scan or row filters. This PR introduces basic intra-query tracking of filter selectivity and dynamically adjusts how we apply filters. We use heuristics to determine where to start filters and how to order them (hence this replaces the existing `reorder_filters` config) and then use runtime stats to move filters around. Importantly this dynamic optimization impacts the IO patterns: we don't just move the compute, we move the IO as well. When filters are applied as row filters we schedule reads one by one. For example, for a filter like `service_name = 'foo' and duration > 1` we would: 1. Read the entire `duration` column and filter based on it. 2. Compute a mask. 3. Read the service_name column applying this mask. 4. Create a final mask used to filter any projected rows. This means two round trips to object storage (at least), which may be worth it if `duration > 1` filters out 75% of rows and `service_name = 'foo'` filters out another 75%. But if `duration > 1` filters out 99% of rows we might as well just pull the remaining `service_name` rows when we read the projection data and filter in memory. Similarly if `service_name = 'foo'` filtered out 99% of the data we could apply `duration > 1` in memory. If they both filter 99% of the data it's a tossup but we definitely should not be doing 2 IOs, one of them is redundant. This is complimentary to the compute-only optimizations being discussed in https://github.com/apache/arrow-rs/pull/9659. I do think there's an important followup here (which will be visible in the benchmarks): we currently can only adapt between row groups. If the data files are single huge row groups (as in TPCDS, also what DuckDB writes by default) we may not have a chance to adapt properly. Fixing this would require some way to re-evaluate the strategy mid stream (easy to do in DataFusion) but then push a new strategy into `arrow-rs` (new projection, new row filters). Maybe we can do this just in DataFusion if we were able to keep track of which pages have been read or something, but it'd likely be smoother if integrated with arrow-rs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
