zhengruifeng opened a new pull request, #55726:
URL: https://github.com/apache/spark/pull/55726

   ### What changes were proposed in this pull request?
   
   This PR adds a single shared `precompile-pyspark` CI job that runs Spark's 
SBT build once and uploads the resulting `target/` trees as a GitHub Actions 
artifact. The pyspark matrix (8 entries) and the sparkr job now consume that 
artifact instead of re-running the same SBT build themselves.
   
   Concretely:
   
   - New `precompile-pyspark` job in `.github/workflows/build_and_test.yml` 
runs the identical command those jobs run today:
     ```
     ./build/sbt -Phadoop-3 -Pyarn -Pspark-ganglia-lgpl -Phadoop-cloud -Phive \
       -Pkubernetes -Pjvm-profiler -Pkinesis-asl -Phive-thriftserver \
       -Pdocker-integration-tests -Pvolcano \
       Test/package streaming-kinesis-asl-assembly/assembly connect/assembly 
assembly/package
     ```
     It tars every `target/` directory (excluding `./build/` and `./.git/`) 
with `zstd -3 -T0`, uploads as `spark-compile-${{ github.run_id }}` with 
`retention-days: 1` so storage is reclaimed within 24h.
   - The `pyspark` matrix job and the `sparkr` job add `precompile-pyspark` to 
`needs:`, download and extract the artifact before running tests, and set 
`SKIP_BUILD: true` in env.
   - `dev/run-tests.py` now skips `build_apache_spark` and 
`build_spark_assembly_sbt` when `SKIP_BUILD` is set, matching the existing 
`SKIP_UNIDOC` / `SKIP_MIMA` pattern.
   
   ### Why are the changes needed?
   
   Today, each of the 8 pyspark matrix jobs and the sparkr job runs the same 
~13m27s SBT compile independently. Across a single CI run that's roughly 127m 
of redundant compile time (or ~107m on runs where sparkr is skipped), against a 
per-run total of ~700m. This change deduplicates that work.
   
   Estimated savings (based on a recent run of [Build and 
test](https://github.com/zhengruifeng/spark/actions/runs/25432660319)):
   
   |                                | Sparkr skipped | Sparkr included |
   |---                             |---:|---:|
   | Redundant compile time today   | ~107m | ~127m |
   | Add back: shared build + xfer  | ~21m  | ~21m  |
   | **Net CI compute saved per run** | **~93m (~13.4% of total)** | **~106m 
(~14.3% of total)** |
   
   Wall clock of the workflow is roughly unchanged. The build was previously 
parallel-hidden inside each matrix runner; sharing it serializes one ~13m build 
before the matrix, but the slowest matrix runner shrinks by the same amount, so 
the critical path is similar (within a few minutes).
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. CI infrastructure change only.
   
   ### How was this patch tested?
   
   The change is exercised by the CI run of this PR itself:
   - If `precompile-pyspark` succeeds and produces an artifact of reasonable 
size, the build phase works.
   - If the pyspark matrix and sparkr jobs complete normally on top of the 
downloaded artifact, the artifact is sufficient and `SKIP_BUILD` is correctly 
skipping the local compile.
   
   A few things to watch in the first run that I'd appreciate reviewer 
attention on:
   - **Artifact size.** Spark's combined `target/` is roughly 1-3 GB raw; 
expect ~400-800 MB after `zstd -3`. The "Package compile output" step prints 
the size with `ls -lh`. If it ever gets close to GHA's 10 GB per-artifact cap 
we should slim the find pattern (e.g., exclude `target/streams` and 
intermediate scaladoc).
   - **`zstd` in the test images.** The pyspark/sparkr Docker images need 
`zstd` for the extract step. If it isn't already present, we can add `apt-get 
install -y zstd` to `dev/spark-test-image/*/Dockerfile`. (The bare 
`ubuntu-latest` runner used by `precompile-pyspark` has it.)
   
   The doctests in `dev/sparktestsupport/utils.py` continue to pass; no logic 
in `is-changed.py` or the module graph was changed.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code (Opus 4.7)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to