zhengruifeng opened a new pull request, #56369: URL: https://github.com/apache/spark/pull/56369
### What changes were proposed in this pull request? Compress the shared compile artifacts that CI passes between jobs with `zstd` instead of `gzip`, across the three reusable workflows that produce/consume them: - `build_and_test.yml` - `compile-artifact` (produced by `precompile`, consumed by the `build`, `pyspark`, `sparkr`, `tpcds-1g`, `docker-integration-tests` and `k8s-integration-tests` jobs) - `python_hosted_runner_test.yml` - `compile-artifact` (macOS / ARM python matrix) - `maven_test.yml` - `compile-target` and `compile-m2-spark` Each producer pipes `tar` through the `zstd` binary, and each consumer decompresses with `zstd -dc`: ```bash # create ... | tar --null -cf - -T - | zstd -c -T0 > compile-artifact.tar.zst # extract zstd -dc compile-artifact.tar.zst | tar -xf - ``` `zstd` is driven through the standalone binary rather than `tar --zstd` on purpose: GitHub's macOS runners ship bsdtar, whose `--zstd` hangs, so letting `tar` do plain (un)archiving and piping through the `zstd` binary keeps one portable idiom across Linux hosts, the container images, `ubuntu-24.04-arm` and macOS. ### Why are the changes needed? `zstd` compresses and (especially) decompresses much faster than `gzip` at a comparable or better ratio, and `-T0` parallelizes compression across all cores. These artifacts are produced once and downloaded by up to 8 downstream jobs per run, so faster (de)compression shortens the critical path of every matrix entry. The `zstd` binary is already present on all relevant runners and container images (the latter since SPARK-57278), so no new dependency is introduced. ### Does this PR introduce _any_ user-facing change? No. CI-only. ### How was this patch tested? - Local validation: all three workflows parse as YAML, and no `*.tar.gz` reference to these artifacts remains. - The artifacts are ephemeral per run (`retention-days: 1`, names keyed by `run_id`), so producer and consumer always run the same code; there is no cross-version compatibility concern. - The `build_and_test.yml` path is exercised by this PR's CI. The `maven_test.yml` and `python_hosted_runner_test.yml` paths run via their scheduled / `workflow_dispatch` callers and can be validated by dispatching those workflows on the fork. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Opus 4.8 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
