andygrove opened a new pull request, #1917:
URL: https://github.com/apache/datafusion-ballista/pull/1917

   # Which issue does this PR close?
   
   No dedicated issue; this strengthens the TPC-H CI added in #1913 by turning 
it
   into a correctness gate. The Q11 scale-factor threshold noted below is left 
for
   a separate follow-up.
   
   # Rationale for this change
   
   The TPC-H CI job runs all queries under AQE off and AQE on, but only checks 
that
   each query *runs without erroring* — it never checks results. A change that 
made
   Ballista return wrong rows (a broadcast/shuffle/serialization bug that drops,
   duplicates, or corrupts rows) would pass CI silently.
   
   This makes the job a correctness gate by comparing every Ballista result 
against
   single-process DataFusion over the same data. Ballista embeds DataFusion, so 
for
   a correct distributed execution the two must agree. This catches
   Ballista-specific divergence — the highest-value class — and is scale-factor
   agnostic (no committed answer files).
   
   # What changes are included in this PR?
   
   - Add a `--verify` flag to `tpch benchmark ballista`. For each query it runs 
the
     same SQL through a single-process DataFusion `SessionContext` over the same
     Parquet data and compares the result.
   - Add `compare_results`, replacing the old `assert_expected_results`. It 
returns
     a descriptive error (row-count, schema, or first differing cell) instead of
     panicking, so a mismatch fails the binary with a useful message. Decimal,
     integer, date, and string columns are compared exactly; only `Float32`/
     `Float64` columns use a small relative-or-absolute tolerance, since
     distributed vs single-process aggregation order differs and float `sum` is
     non-associative. `Utf8`/`Utf8View` (and the large/binary view variants) are
     treated as equal representations.
   - Fix multi-statement result capture: a query file may wrap the answer in
     setup/teardown statements (q15 creates and drops a view). The run loops now
     execute every statement but capture the result of the answer statement (the
     last `SELECT`/`WITH`), instead of reporting a trailing `DROP VIEW`. 
Applied to
     both the Ballista and DataFusion paths.
   - Wire `--verify` into `.github/workflows/tpch.yml` for both the AQE-off and
     AQE-on suites.
   
   Validated locally on TPC-H SF10 (1 scheduler, 1 executor, 4 task slots, 16
   partitions): all 21 queries (q16 omitted, as in CI) pass `--verify` under 
both
   AQE off and AQE on, including the ratio queries (q14/q17/q19/q22) under the 
float
   tolerance and the multi-statement q15.
   
   Known follow-up (out of scope here): q11's `HAVING` threshold literal 
`0.0001`
   is the SF1 value and is not scaled by `1/SF`, so it returns 0 rows at SF10 in
   both Ballista and single-process DataFusion. Since both engines agree, the
   cross-check passes; the fix belongs in the query file and will be handled
   separately.
   
   # Are there any user-facing changes?
   
   No. The change adds an opt-in `--verify` flag to the benchmark tool and 
enables
   it in CI; there are no changes to Ballista's runtime behavior or public APIs.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to