diegoQuinas opened a new issue, #22621: URL: https://github.com/apache/datafusion/issues/22621
### Describe the bug The sqllogictest `push_down_filter_regression.slt` (added in #22150) is **flaky** in CI. The `DynamicFilter` *content* asserted by the `EXPLAIN ANALYZE` queries on `agg_dyn_single` is **not** deterministic, contrary to what the test's own comment claims. A recent CI run failed with: ``` [SQL] EXPLAIN ANALYZE SELECT MIN(a), MAX(a) FROM agg_dyn_single; [Diff] (-expected|+actual) - predicate=DynamicFilter [ a@0 < 1 OR a@0 > 8 ], pruning_predicate=... a_min@0 < 1 ... + predicate=DynamicFilter [ a@0 < 3 OR a@0 > 8 ], pruning_predicate=... a_min@0 < 3 ... at datafusion/sqllogictest/test_files/push_down_filter_regression.slt:330 ``` ### Root cause The test data is split across two files: - `file_0` → `(5), (1)` — partial `min` = **1** (the global minimum) - `file_1` → `(3), (8)` — partial `min` = **3**, partial `max` = **8** The comment above the queries states: > Pruning metrics here are subject to a parallel-execution race (the order in which Partial aggregates publish filter updates vs. when the scan reads each partition), so the filter **content** is deterministic but the pruning counts are not. That assumption is incorrect. The dynamic filter threshold tightens as each `AggregateExec(mode=Partial)` publishes its running `min`/`max`. `EXPLAIN ANALYZE` captures a **snapshot** of the filter's state. The *same* race the comment acknowledges for the pruning counts also affects the filter **content**: if the snapshot is taken after `file_1` has published its partial min (`3`) but before `file_0` publishes the global min (`1`), the filter reads `a < 3` instead of the final `a < 1`. The `MAX` side (`> 8`) happened to converge in time. So the filter content is an intermediate value of a converging filter, and which value is observed depends on partition scheduling — exactly the non-determinism the comment attributes only to the counts. ### To Reproduce Hard to reproduce deterministically because it is a thread-scheduling race; it surfaces intermittently in CI. The failing assertions are the `agg_dyn_single` `EXPLAIN ANALYZE` queries in `datafusion/sqllogictest/test_files/push_down_filter_regression.slt` (around line 330). ### Expected behavior The test should be stable across runs and not depend on the order in which partial aggregates publish their filter updates. ### Additional context Possible directions (open to maintainer preference): 1. Assert only on the **shape** of the dynamic filter (e.g. that a `DynamicFilter` is present with the right column/structure) rather than its converged threshold value. 2. Force a single partition / deterministic scan order for these specific queries so the filter is guaranteed to be fully converged at snapshot time. 3. Use data where every file shares the same per-file min/max so any intermediate snapshot equals the final value. Introduced in #22150. Happy to open a PR once there's agreement on the preferred approach. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
