Tamar-Posen commented on issue #19764:
URL: https://github.com/apache/datafusion/issues/19764#issuecomment-3742533363

   Yes, it's exactly about scale and metadata overhead.
   
   We operate on 100TB+ datasets with millions of Parquet files, where even 
reading file footers for standard pruning is costly. Our inverted index maps 
values directly to file references, avoiding metadata scans entirely.
   
   We often join a small predicate table with a massive base table. Previously, 
we used manual sideways information passing so our custom TableProvider could 
use the index during planning.
   
   With DynamicFilters, this logic correctly moves to execution, but we lose 
the ability to trigger index-based pruning before files are opened.
   
   The goal is to use runtime DynamicFilter values to query the inverted index 
and skip opening most files.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to