zhengruifeng opened a new pull request, #55762:
URL: https://github.com/apache/spark/pull/55762

   ### What changes were proposed in this pull request?
   
   Follow-up to 
[SPARK-56768](https://issues.apache.org/jira/browse/SPARK-56768) 
(apache/spark#55726), which introduced a shared `precompile` CI job that runs 
Spark's SBT build once and publishes the resulting `target/` trees as a GitHub 
Actions artifact for the pyspark matrix to consume. This PR extends that 
artifact to the JVM `build` matrix (the Scala/Java test entries: core, sql, 
hive, catalyst, mllib, streaming/connect, etc.).
   
   Each JVM build matrix entry today runs three SBT calls back-to-back:
   
   | | Function | Goals | Identical across entries? |
   |---|---|---|---|
   | 1 | `build_spark_sbt` | `Test/package + 
streaming-kinesis-asl-assembly/assembly + connect/assembly` | ✓ (same 11 
profiles) |
   | 2 | `build_spark_assembly_sbt` | `assembly/package` | ✓ |
   | 3 | `run_scala_tests_sbt` | `<module>/test` (per-entry test goals + tag 
filters) | ✗ (varies) |
   
   I verified this against a recent `apache/spark` master run: all 9 JVM build 
matrix entries produce byte-equivalent compile output (same profile set, same 
goals; only the order in which `[info]` prints the profiles differs). Call 3 is 
the part that varies and stays as-is in this PR.
   
   ### Concrete changes
   
   - The `precompile` job's `if:` gate adds `build == 'true'` so the artifact 
is built whenever the JVM matrix runs, independent of pyspark changes.
   - The `build` matrix job adds `precompile` to `needs:` and uses `if: 
(!cancelled()) && ...` so it can still run when precompile is cancelled or its 
`if:` was false.
   - New "Download precompiled artifact" and "Extract precompiled artifact" 
steps with the same optional/fallback design used by the pyspark matrix:
     - `if: needs.precompile.result == 'success'` on download.
     - `continue-on-error: true` on both.
     - `if: steps.download-precompiled.outcome == 'success'` on extract.
   - Inside the existing "Run tests" bash block, `SKIP_SCALA_BUILD=true` is 
exported only when `steps.extract-precompiled.outcome == 'success'`. Otherwise 
the local SBT build path runs as before.
   - No `dev/run-tests.py` change needed - the `SKIP_SCALA_BUILD` gate landed 
with SPARK-56768.
   
   ### Why is this safe? Workspace and Zinc behavior
   
   The precompile job and the JVM `build` matrix both run on bare 
`ubuntu-latest` (no container), which means:
   
   - Same workspace path (`/home/runner/work/spark/spark`) on both sides → Zinc 
analysis files in `target/streams/...` reference paths that exist in the 
consumer.
   - `tar -czf` / `tar -xzf` preserves mtimes by default, so when call 3 
(`<module>/test`) runs, SBT's incremental compiler sees that classes are newer 
than (or as new as) sources and skips recompilation.
   - Call 3 uses a smaller per-entry profile set than calls 1+2, but profiles 
drive *which projects are activated*; pre-built classes for the activated 
projects are present in `target/scala-2.13/{classes,test-classes}` from the 
precompile output, so SBT just runs the tests.
   
   ### Optional: graceful fallback if precompile fails
   
   Same pattern as pyspark and sparkr extensions:
   
   - `precompile` is `continue-on-error: true` (set in SPARK-56768) - a failed 
or cancelled precompile does not fail the workflow.
   - Download/extract have `continue-on-error: true` and skip their work if the 
upstream step didn't succeed.
   - The conditional export inside "Run tests" only sets 
`SKIP_SCALA_BUILD=true` when the extract succeeded; otherwise 
`dev/run-tests.py` runs the original local SBT build.
   
   So a precompile failure degrades the JVM matrix to the pre-PR behavior, not 
a workflow failure.
   
   ### Estimated savings
   
   | | Per-run CI time |
   |---|---:|
   | Redundant SBT compile in JVM build matrix today (~9 entries × ~13m) | 
~120m |
   | Add back: shared build (already amortized with pyspark) | 0 |
   | **Net CI compute saved per build-matrix run** | **~120m** |
   
   The JVM `build` matrix runs when `is-changed.py -m 
"core,unsafe,...,sql,hive,..."` returns true, i.e. on roughly any 
non-Python-only PR. On those PRs, this saving stacks on top of the ~96m already 
saved by SPARK-56768. The shared `precompile` job is also already running in 
those cases (pyspark or pyspark-pandas is almost always also changed), so the 
marginal cost of this PR is just the artifact download per JVM matrix entry 
(~30s each, ~9 entries ≈ ~5m).
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. CI infrastructure change only.
   
   ### How was this patch tested?
   
   The change is exercised by the CI run of this PR itself when the build 
matrix runs. Each matrix entry's "Run tests" log should show `Reusing 
precompiled artifact, skipping local SBT build.` mirroring the pyspark 
behavior. The per-entry test phase (`<module>/test`) runs with the extracted 
classes already on disk; SBT's incremental check should not trigger 
recompilation. If the precompile artifact is unavailable, the matrix entry 
falls back to the local SBT build path identical to today's behavior.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code (Opus 4.7)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to