[PR] [SPARK-57420][INFRA] Only generate TPC-DS data when required and check CPU compatibility early in benchmark workflow [spark]

via GitHub Fri, 12 Jun 2026 11:53:40 -0700


iemejia opened a new pull request, #56479:
URL: https://github.com/apache/spark/pull/56479


   ### What changes were proposed in this pull request?
   
   Two improvements to the benchmark workflow:
   
   1. **Skip TPC-DS data generation for non-TPCDS benchmarks.** Change 
`contains(inputs.class, '*')` to `inputs.class == '*'` so wildcard patterns 
like `*VectorizedDeltaReaderBenchmark` no longer trigger the expensive TPC-DS 
generation job (~5-10 min saved per run).
   
   2. **Add early CPU model check step** that runs immediately after checkout, 
before compilation. Prints the CPU as a `::notice::` annotation for live 
visibility in the Actions UI, and optionally fails fast if the runner CPU does 
not match the `expected-cpu` input parameter.
   
   ### Why are the changes needed?
   
   The benchmark workflow currently generates TPC-DS data (~5-10 min) for every 
benchmark run, even when the benchmark class does not use TPC-DS data. This is 
because `contains(inputs.class, '*')` matches any class with a wildcard (e.g., 
`*VectorizedDeltaReaderBenchmark`), not just the literal `*` (all benchmarks).
   
   Additionally, when benchmark results need to match a specific CPU (e.g., AMD 
EPYC 7763 for consistent comparisons against upstream baselines), there is no 
way to detect a CPU mismatch until the full benchmark completes (~20-30 min). 
The early CPU check allows the job to fail within seconds of starting if the 
runner does not match, saving significant time and compute.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. This only affects the GHA benchmark workflow. Existing behavior is 
preserved when `expected-cpu` is not set (default).
   
   ### How was this patch tested?
   
   The workflow changes are self-contained in 
`.github/workflows/benchmark.yml`. Tested by inspection. The `expected-cpu` 
parameter is optional and defaults to empty (no check), preserving backward 
compatibility.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: OpenCode (Claude claude-opus-4.6)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-57420][INFRA] Only generate TPC-DS data when required and check CPU compatibility early in benchmark workflow [spark]

Reply via email to