2010YOUY01 commented on issue #18856: URL: https://github.com/apache/datafusion/issues/18856#issuecomment-3587705416
> I think the _for stats pruning only_ is perhaps not the right distinction to make: what can and can't be used for stats pruning is going to vary by file format and changes over time. The distinction that the logical filter pushdown makes, and that maybe we should be making here, is: > > * I am going to use the filter, but I can't guarantee exact filtering aka `Inexact`. This usually means it _might_ be used for stats pruning but _will not_ be used for row-level pruning. > * I am going to apply the filter exactly as `FilterExec` would i.e. `Exact`. > * I won't use the filter at all i.e. `Unsupported`. > > But maybe I'm missing something... what would a node do with the information that another operator is going to use a filter "only for stats pruning"? Not produce the filter? This extra 'stat-only' message is for the forward pass like `HashJoinExec --(push down only for stat pruning, not row-by-row filtering)--> ParquetExec`, and the backward pass `ParquetExec->HashJoinExec` should implement something like `exact/inexact/no` Here is a example that this extra forward pass info helps: ``` select * from locations join spatial_objects on distance(locations.loc, spatial_objects.loc) < 10m; ``` We can calculate a spatial range from the build side and push that to the probe side, and use stat pruning to eliminate some data. This is not worth to do row-level filtering at the scanner, because spatial calculation is very expensive to perform row-wise, it can be wasteful to filter in the scan once, and evaluate again during join probing -- only do so once in the join is better. This idea is a bit ahead of where we are though, so we should better implement it only when needed, not proactively right now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
