zhengruifeng opened a new pull request, #56369:
URL: https://github.com/apache/spark/pull/56369

   ### What changes were proposed in this pull request?
   
   Compress the shared compile artifacts that CI passes between jobs with 
`zstd` instead of `gzip`, across the three reusable workflows that 
produce/consume them:
   
   - `build_and_test.yml` - `compile-artifact` (produced by `precompile`, 
consumed by the `build`, `pyspark`, `sparkr`, `tpcds-1g`, 
`docker-integration-tests` and `k8s-integration-tests` jobs)
   - `python_hosted_runner_test.yml` - `compile-artifact` (macOS / ARM python 
matrix)
   - `maven_test.yml` - `compile-target` and `compile-m2-spark`
   
   Each producer pipes `tar` through the `zstd` binary, and each consumer 
decompresses with `zstd -dc`:
   
   ```bash
   # create
   ... | tar --null -cf - -T - | zstd -c -T0 > compile-artifact.tar.zst
   # extract
   zstd -dc compile-artifact.tar.zst | tar -xf -
   ```
   
   `zstd` is driven through the standalone binary rather than `tar --zstd` on 
purpose: GitHub's macOS runners ship bsdtar, whose `--zstd` hangs, so letting 
`tar` do plain (un)archiving and piping through the `zstd` binary keeps one 
portable idiom across Linux hosts, the container images, `ubuntu-24.04-arm` and 
macOS.
   
   ### Why are the changes needed?
   
   `zstd` compresses and (especially) decompresses much faster than `gzip` at a 
comparable or better ratio, and `-T0` parallelizes compression across all 
cores. These artifacts are produced once and downloaded by up to 8 downstream 
jobs per run, so faster (de)compression shortens the critical path of every 
matrix entry. The `zstd` binary is already present on all relevant runners and 
container images (the latter since SPARK-57278), so no new dependency is 
introduced.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. CI-only.
   
   ### How was this patch tested?
   
   - Local validation: all three workflows parse as YAML, and no `*.tar.gz` 
reference to these artifacts remains.
   - The artifacts are ephemeral per run (`retention-days: 1`, names keyed by 
`run_id`), so producer and consumer always run the same code; there is no 
cross-version compatibility concern.
   - The `build_and_test.yml` path is exercised by this PR's CI. The 
`maven_test.yml` and `python_hosted_runner_test.yml` paths run via their 
scheduled / `workflow_dispatch` callers and can be validated by dispatching 
those workflows on the fork.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Opus 4.8
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to