adriangb opened a new pull request, #21752:
URL: https://github.com/apache/datafusion/pull/21752

   Currently we are either all or nothing applying filters as post-scan or row 
filters.
   
   This PR introduces basic intra-query tracking of filter selectivity and 
dynamically adjusts how we apply filters.
   We use heuristics to determine where to start filters and how to order them 
(hence this replaces the existing `reorder_filters` config) and then use 
runtime stats to move filters around.
   
   Importantly this dynamic optimization impacts the IO patterns: we don't just 
move the compute, we move the IO as well. When filters are applied as row 
filters we schedule reads one by one. For example, for a filter like 
`service_name = 'foo' and duration > 1` we would:
   
   1. Read the entire `duration` column and filter based on it.
   2. Compute a mask.
   3. Read the service_name column applying this mask.
   4. Create a final mask used to filter any projected rows.
   
   This means two round trips to object storage (at least), which may be worth 
it if `duration > 1` filters out 75% of rows and `service_name = 'foo'` filters 
out another 75%. But if `duration > 1` filters out 99% of rows we might as well 
just pull the remaining `service_name` rows when we read the projection data 
and filter in memory. Similarly if `service_name = 'foo'` filtered out 99% of 
the data we could apply `duration > 1` in memory. If they both filter 99% of 
the data it's a tossup but we definitely should not be doing 2 IOs, one of them 
is redundant.
   
   This is complimentary to the compute-only optimizations being discussed in 
https://github.com/apache/arrow-rs/pull/9659.
   
   I do think there's an important followup here (which will be visible in the 
benchmarks): we currently can only adapt between row groups. If the data files 
are single huge row groups (as in TPCDS, also what DuckDB writes by default) we 
may not have a chance to adapt properly. Fixing this would require some way to 
re-evaluate the strategy mid stream (easy to do in DataFusion) but then push a 
new strategy into `arrow-rs` (new projection, new row filters). Maybe we can do 
this just in DataFusion if we were able to keep track of which pages have been 
read or something, but it'd likely be smoother if integrated with arrow-rs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to