alamb opened a new issue, #22782:
URL: https://github.com/apache/datafusion/issues/22782

   ## Describe the bug
   
   The test 
`physical_optimizer::ensure_requirements::test_filter_over_multi_partition_sort_limit`
 (in `datafusion/core/tests/physical_optimizer/ensure_requirements.rs`) is 
**failing on `main`** in CI.
   
   Example failing run: 
https://github.com/apache/datafusion/actions/runs/27028299481/job/79775344825
   
   ### Failure output
   
   ```
   ---- 
physical_optimizer::ensure_requirements::test_filter_over_multi_partition_sort_limit
 stdout ----
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Snapshot Summary 
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
   Snapshot: filter_over_multi_partition_sort_limit
   Source: datafusion/core/tests/physical_optimizer/ensure_requirements.rs:405
   
────────────────────────────────────────────────────────────────────────────────
   Expression: p1_str
   
────────────────────────────────────────────────────────────────────────────────
   -old snapshot
   +new results
   
────────────┬───────────────────────────────────────────────────────────────────
       1     1 │ GlobalLimitExec: skip=0, fetch=21
       2     2 │   SortPreservingMergeExec: [a@0 DESC]
       3     3 │     SortExec: expr=[a@0 DESC], preserve_partitioning=[true]
       4     4 │       FilterExec: true
       5       │-        MockMultiPartitionExec
             5 │+        RepartitionExec: partitioning=RoundRobinBatch(32), 
input_partitions=16
             6 │+          MockMultiPartitionExec
   
────────────┴───────────────────────────────────────────────────────────────────
   ```
   
   ## Root cause
   
   This is a **CPU-count-dependent (non-deterministic) snapshot test**, not a 
logic regression.
   
   The test macro builds its config with `ConfigOptions::default()`:
   
   ```rust
   // datafusion/core/tests/physical_optimizer/ensure_requirements.rs
   macro_rules! assert_ensure_requirements_plan {
       ($plan:expr, @ $snapshot:literal $(,)?) => {{
           let config = ConfigOptions::default();
           ...
   ```
   
   `ConfigOptions::default()` leaves `execution.target_partitions` resolving to 
`get_available_parallelism()` (the machine's CPU core count):
   
   ```rust
   // datafusion/common/src/config.rs:562
   pub target_partitions: usize, transform = 
ExecutionOptions::normalized_parallelism, default = get_available_parallelism()
   ```
   
   The test's source has **16** partitions (`MockMultiPartitionExec::new(16)`):
   
   - On a machine with **≤16 cores**, `target_partitions == 16`, the input 
already satisfies the target, so `EnsureRequirements` inserts **no** 
`RepartitionExec` — this matches the recorded snapshot. (Confirmed: passes 
locally on a 16-core machine.)
   - On the CI runner with **32 cores**, `target_partitions == 32 > 16`, so 
`EnsureRequirements` inserts `RepartitionExec: 
partitioning=RoundRobinBatch(32), input_partitions=16` — producing the diff 
above.
   
   The test was introduced in #21976 (commit `fb1c0f342f`), snapshotted on a 
≤16-core machine. It will fail on any runner with >16 cores. The commit 
`031360d3ec` (`feat: implement retract_batch for array_agg(DISTINCT)`, #22719) 
that the failure was first observed after is **unrelated** — it just happened 
to be the first run scheduled onto a higher-core runner.
   
   ## To Reproduce
   
   On a machine with more than 16 logical cores:
   
   ```bash
   cargo test -p datafusion --test core_integration -- 
physical_optimizer::ensure_requirements::test_filter_over_multi_partition_sort_limit
   ```
   
   Or force a higher target partition count.
   
   ## Expected behavior
   
   The test should be deterministic regardless of the host CPU core count.
   
   ## Suggested fix
   
   Set an explicit `target_partitions` in the 
`assert_ensure_requirements_plan!` macro (and/or any other config-building 
sites in this file that rely on `ConfigOptions::default()`), e.g.:
   
   ```rust
   let mut config = ConfigOptions::default();
   config.execution.target_partitions = /* fixed value, e.g. 16 or 8 */;
   ```
   
   so the inserted (or omitted) `RepartitionExec` does not depend on the 
runner's core count.
   
   ## Additional context
   
   - Affected file: 
`datafusion/core/tests/physical_optimizer/ensure_requirements.rs`
   - Introduced by: #21976
   - Other tests in the same file using `ConfigOptions::default()` may have the 
same latent CPU-count dependency and should be audited.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to