geoffreyclaude opened a new pull request, #21958:
URL: https://github.com/apache/datafusion/pull/21958

   ## Which issue does this PR close?
   
   - None.
   
   ## Rationale for this change
   
   The `output_rows_skew` sqllogictest asserts exact per-partition scan row 
distributions. That test was added before dynamic `FileStream` work scheduling, 
where reorderable sibling scan streams can now share one queue of unopened 
files.
   
   With shared work scheduling enabled, the `DataSourceExec` explain output can 
still show four file groups, but the runtime `output_rows` metrics are 
attributed to whichever output partition actually scans each file. That makes 
the expected `[4, 0, 1, 0]` distribution nondeterministic: CI has observed the 
expected `output_rows_skew=84.31%` becoming `output_rows_skew=100%` when one 
stream scans all non-empty files.
   
   This happened on PR #21927 and also reproduced independently on `main` in 
`verify benchmark results (amd64)`, so it is a pre-existing test flake rather 
than a regression from that PR.
   
   ## What changes are included in this PR?
   
   This PR adds `WITH ORDER (x)` to the temporary external Parquet tables used 
by the `output_rows_skew` sqllogictest. An output ordering makes the scan 
order-preserving, which disables shared file work stealing for these tables and 
keeps the test file groups local to their output partitions.
   
   This keeps the exact skew assertions meaningful without relaxing the 
expected metric value.
   
   ## Are these changes tested?
   
   - `cargo test --features backtrace,parquet_encryption,substrait --profile ci 
--package datafusion-sqllogictest --test sqllogictests -- explain_analyze.slt 
--test-threads=1`
   - `cargo fmt --all`
   - `cargo clippy --workspace --all-targets --all-features --exclude 
datafusion-benchmarks -- -D warnings`
   
   I also attempted the repository instruction `cargo clippy --all-targets 
--all-features -- -D warnings`, but it currently fails in 
`datafusion-benchmarks` before reaching this change because `--all-features` 
enables the mutually exclusive `snmalloc` and `mimalloc` allocator features.
   
   ## Are there any user-facing changes?
   
   No. This is a sqllogictest-only determinism fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to