zhengruifeng opened a new pull request, #55761:
URL: https://github.com/apache/spark/pull/55761

   ### What changes were proposed in this pull request?
   
   Follow-up to 
[SPARK-56768](https://issues.apache.org/jira/browse/SPARK-56768) 
(apache/spark#55726), which introduced a shared `precompile` CI job that runs 
Spark's SBT build once and publishes the resulting `target/` trees as a GitHub 
Actions artifact for the pyspark matrix entries to consume. This PR extends 
that same artifact to the `sparkr` build.
   
   Concretely:
   
   - The `precompile` job's `if:` gate now also fires when `sparkr == 'true'` 
is set in the precondition output, so the artifact is built whenever only 
sparkr changes.
   - The `sparkr` job adds `precompile` to `needs:`, downloads and extracts the 
artifact (with the same graceful fallback as the pyspark matrix), and exports 
`SKIP_SCALA_BUILD=true` for `dev/run-tests.py` only when the artifact was 
successfully extracted.
   - No `dev/run-tests.py` change is needed — the `SKIP_SCALA_BUILD` gate 
landed with SPARK-56768.
   
   ### Optional: graceful fallback if precompile fails
   
   Same pattern as the pyspark matrix:
   
   - The "Download precompiled artifact" step is gated on 
`needs.precompile.result == 'success'` and has `continue-on-error: true`.
   - The "Extract precompiled artifact" step is gated on the download 
succeeding and also has `continue-on-error: true`.
   - Inside the "Run tests" bash block, `SKIP_SCALA_BUILD=true` is exported 
only when `steps.extract-precompiled.outcome == 'success'`. Otherwise it stays 
unset and `dev/run-tests.py` falls back to the original local SBT build.
   
   So a precompile/download/extract failure degrades sparkr to the pre-PR 
behavior, not a workflow failure.
   
   ### Why are the changes needed?
   
   The sparkr job today runs the same ~13m of redundant SBT compile that the 
pyspark matrix used to run. Reusing the existing precompile artifact removes 
that redundant work. The `precompile` job is already running in any workflow 
run where pyspark changes are present; adding sparkr as another consumer is 
essentially free (just another download of the same artifact).
   
   When sparkr is the only changed module, the `precompile` job is now 
scheduled to run anyway (via the new `sparkr == 'true'` clause in its `if:` 
gate), so this case picks up the same saving.
   
   ### Estimated savings
   
   | | Per sparkr run |
   |---|---:|
   | Redundant SBT compile in sparkr today | ~13m |
   | Add back: download + extract overhead | ~1m |
   | **Net CI compute saved per sparkr run** | **~12m** |
   
   This is on top of the ~96m / ~14% already saved by SPARK-56768. The actual 
wall clock for the sparkr job will drop by roughly the same amount (sparkr is 
not on the critical path; the pyspark matrix still drives the workflow's 
wall-clock).
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. CI infrastructure change only.
   
   ### How was this patch tested?
   
   The change is exercised by the CI run of this PR itself, when the sparkr job 
runs. The expected log signature inside "Run tests" is `Reusing precompiled 
artifact, skipping local SBT build.`, mirroring what the pyspark matrix already 
prints. If the precompile artifact is not available (precompile job failed, or 
this is some future caller that doesn't enable it), sparkr falls back to the 
local SBT build path, which is identical to today's behavior.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code (Opus 4.7)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to