Re: [PR] Dynamic filters blog post (rev 2) [datafusion-site]

via GitHub Wed, 17 Sep 2025 12:42:17 -0700


comphead commented on PR #103:
URL: https://github.com/apache/datafusion-site/pull/103#issuecomment-3275714842


   > > This is makes sense with the filter, but to get the min value for the 
filter we still to full scan, that is something i'm still missing, lets go 
ahead, yes, thanks for explanations
   > 
   > Let's take the best case, which is
   > 
   > * after reading the first batch from the first file, DataFusion has read 
the actual minimum value
   > 
   > While it is true DataFusion now still needs to check all remaining files 
to ensure this is actually the minimum value, it **may** not have to actually 
open and read and decode the rows in the file -- for example, it could 
potentially prune (skip) all remaining files using statistics. And even if it 
can't prune out the entire file, it may be able to prune row groups, or ranges 
of rows (if `pushdown_filters`) is turned on
   
   Oh I think I'm getting the picture now. So it is not only derived from data 
itself(like I was told) it is hybrid, data + parquet stats. That makes sense 
now, so we have an assumption that some value in the heap is approximate just 
to remove unnecessary reads, because it is still better than full scan. Best 
case scenario if we got the min value from the first batch, worst case still 
should be cheaper than full scan
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Dynamic filters blog post (rev 2) [datafusion-site]

Reply via email to