diegoQuinas opened a new issue, #22621:
URL: https://github.com/apache/datafusion/issues/22621

   ### Describe the bug
   
   The sqllogictest `push_down_filter_regression.slt` (added in #22150) is 
**flaky** in CI. The `DynamicFilter` *content* asserted by the `EXPLAIN 
ANALYZE` queries on `agg_dyn_single` is **not** deterministic, contrary to what 
the test's own comment claims.
   
   A recent CI run failed with:
   
   ```
   [SQL] EXPLAIN ANALYZE SELECT MIN(a), MAX(a) FROM agg_dyn_single;
   [Diff] (-expected|+actual)
   - predicate=DynamicFilter [ a@0 < 1 OR a@0 > 8 ], pruning_predicate=... 
a_min@0 < 1 ...
   + predicate=DynamicFilter [ a@0 < 3 OR a@0 > 8 ], pruning_predicate=... 
a_min@0 < 3 ...
   at datafusion/sqllogictest/test_files/push_down_filter_regression.slt:330
   ```
   
   ### Root cause
   
   The test data is split across two files:
   
   - `file_0` → `(5), (1)` — partial `min` = **1** (the global minimum)
   - `file_1` → `(3), (8)` — partial `min` = **3**, partial `max` = **8**
   
   The comment above the queries states:
   
   > Pruning metrics here are subject to a parallel-execution race (the order 
in which Partial aggregates publish filter updates vs. when the scan reads each 
partition), so the filter **content** is deterministic but the pruning counts 
are not.
   
   That assumption is incorrect. The dynamic filter threshold tightens as each 
`AggregateExec(mode=Partial)` publishes its running `min`/`max`. `EXPLAIN 
ANALYZE` captures a **snapshot** of the filter's state. The *same* race the 
comment acknowledges for the pruning counts also affects the filter 
**content**: if the snapshot is taken after `file_1` has published its partial 
min (`3`) but before `file_0` publishes the global min (`1`), the filter reads 
`a < 3` instead of the final `a < 1`. The `MAX` side (`> 8`) happened to 
converge in time.
   
   So the filter content is an intermediate value of a converging filter, and 
which value is observed depends on partition scheduling — exactly the 
non-determinism the comment attributes only to the counts.
   
   ### To Reproduce
   
   Hard to reproduce deterministically because it is a thread-scheduling race; 
it surfaces intermittently in CI. The failing assertions are the 
`agg_dyn_single` `EXPLAIN ANALYZE` queries in 
`datafusion/sqllogictest/test_files/push_down_filter_regression.slt` (around 
line 330).
   
   ### Expected behavior
   
   The test should be stable across runs and not depend on the order in which 
partial aggregates publish their filter updates.
   
   ### Additional context
   
   Possible directions (open to maintainer preference):
   
   1. Assert only on the **shape** of the dynamic filter (e.g. that a 
`DynamicFilter` is present with the right column/structure) rather than its 
converged threshold value.
   2. Force a single partition / deterministic scan order for these specific 
queries so the filter is guaranteed to be fully converged at snapshot time.
   3. Use data where every file shares the same per-file min/max so any 
intermediate snapshot equals the final value.
   
   Introduced in #22150. Happy to open a PR once there's agreement on the 
preferred approach.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to