adriangb commented on issue #22495: URL: https://github.com/apache/datafusion/issues/22495#issuecomment-4529522119
Closing: the current gate `has_statistics() || contains_dynamic_filter()` is correct as-is, and the two candidate refinements don't hold up. - **"Non-terminated dynamic filter only" (i.e. `Watching`) would regress** the already-complete-at-open case. Example: `SELECT * FROM fact JOIN dim ON fact.part_col = dim.key`, `fact` partitioned by `part_col`. The hash join's build side (`dim`) completes before the probe-side (`fact`) files open, so the filter is `AllComplete` at file-open. The completed filter (`part_col IN (...)`) still prunes `fact` files via partition-value folding — even with no column stats, and planning couldn't (values unknown then). A `Watching`-only gate would skip building that pruner and miss the prune. "Any dynamic filter (complete or not)" is required to catch it. - **Structural per-conjunct analysis is unsound** for dynamic filters (their expression changes at runtime; it's typically `lit(true)` at open) — see the correction above. The only sound residual optimization — skip building when the dynamic filters reference *only data columns* and the file has *no* statistics (decided from the filter's fixed children, not its expression) — is marginal (stats-less Parquet is rare) and risks missing prunes if mis-judged. Not worth tracking as an open issue; can revisit if a real workload shows the wasted builds matter. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
