adriangb commented on issue #3463:
URL: https://github.com/apache/datafusion/issues/3463#issuecomment-3708360151

   > I'm not familiar with how DF is handling this currently, but a selectivity 
estimate based approach at plan time might be a good place to start.
   
   The answer is: we are not. The only similar thing we do is use the column 
sizes (from parquet metadata) to reorder the filters.
   
   I don’t think we have enough information to do anything useful from 
statistics (this is probably why we haven’t done so yet) but if arrow-rs at 
least exposed the selectivity of filters after each file is read (ideally each 
batch?) we could at least have runtime filter selectivity statistics so as we 
open more files we adapt our approach using the options you described above. A 
further step would be for arrow-rs to allow us to rebuild/reshuffle our 
approach within a scan but that may require more API churn. Adjusting between 
files should be pretty simple.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to