adriangb commented on issue #22495:
URL: https://github.com/apache/datafusion/issues/22495#issuecomment-4529522119

   Closing: the current gate `has_statistics() || contains_dynamic_filter()` is 
correct as-is, and the two candidate refinements don't hold up.
   
   - **"Non-terminated dynamic filter only" (i.e. `Watching`) would regress** 
the already-complete-at-open case. Example: `SELECT * FROM fact JOIN dim ON 
fact.part_col = dim.key`, `fact` partitioned by `part_col`. The hash join's 
build side (`dim`) completes before the probe-side (`fact`) files open, so the 
filter is `AllComplete` at file-open. The completed filter (`part_col IN 
(...)`) still prunes `fact` files via partition-value folding — even with no 
column stats, and planning couldn't (values unknown then). A `Watching`-only 
gate would skip building that pruner and miss the prune. "Any dynamic filter 
(complete or not)" is required to catch it.
   - **Structural per-conjunct analysis is unsound** for dynamic filters (their 
expression changes at runtime; it's typically `lit(true)` at open) — see the 
correction above.
   
   The only sound residual optimization — skip building when the dynamic 
filters reference *only data columns* and the file has *no* statistics (decided 
from the filter's fixed children, not its expression) — is marginal (stats-less 
Parquet is rare) and risks missing prunes if mis-judged. Not worth tracking as 
an open issue; can revisit if a real workload shows the wasted builds matter.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to