comphead commented on PR #103: URL: https://github.com/apache/datafusion-site/pull/103#issuecomment-3275714842
> > This is makes sense with the filter, but to get the min value for the filter we still to full scan, that is something i'm still missing, lets go ahead, yes, thanks for explanations > > Let's take the best case, which is > > * after reading the first batch from the first file, DataFusion has read the actual minimum value > > While it is true DataFusion now still needs to check all remaining files to ensure this is actually the minimum value, it **may** not have to actually open and read and decode the rows in the file -- for example, it could potentially prune (skip) all remaining files using statistics. And even if it can't prune out the entire file, it may be able to prune row groups, or ranges of rows (if `pushdown_filters`) is turned on Oh I think I'm getting the picture now. So it is not only derived from data itself(like I was told) it is hybrid, data + parquet stats. That makes sense now, so we have an assumption that some value in the heap is approximate just to remove unnecessary reads, because it is still better than full scan. Best case scenario if we got the min value from the first batch, worst case still should be cheaper than full scan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
