zhengruifeng opened a new pull request, #55762: URL: https://github.com/apache/spark/pull/55762
### What changes were proposed in this pull request? Follow-up to [SPARK-56768](https://issues.apache.org/jira/browse/SPARK-56768) (apache/spark#55726), which introduced a shared `precompile` CI job that runs Spark's SBT build once and publishes the resulting `target/` trees as a GitHub Actions artifact for the pyspark matrix to consume. This PR extends that artifact to the JVM `build` matrix (the Scala/Java test entries: core, sql, hive, catalyst, mllib, streaming/connect, etc.). Each JVM build matrix entry today runs three SBT calls back-to-back: | | Function | Goals | Identical across entries? | |---|---|---|---| | 1 | `build_spark_sbt` | `Test/package + streaming-kinesis-asl-assembly/assembly + connect/assembly` | ✓ (same 11 profiles) | | 2 | `build_spark_assembly_sbt` | `assembly/package` | ✓ | | 3 | `run_scala_tests_sbt` | `<module>/test` (per-entry test goals + tag filters) | ✗ (varies) | I verified this against a recent `apache/spark` master run: all 9 JVM build matrix entries produce byte-equivalent compile output (same profile set, same goals; only the order in which `[info]` prints the profiles differs). Call 3 is the part that varies and stays as-is in this PR. ### Concrete changes - The `precompile` job's `if:` gate adds `build == 'true'` so the artifact is built whenever the JVM matrix runs, independent of pyspark changes. - The `build` matrix job adds `precompile` to `needs:` and uses `if: (!cancelled()) && ...` so it can still run when precompile is cancelled or its `if:` was false. - New "Download precompiled artifact" and "Extract precompiled artifact" steps with the same optional/fallback design used by the pyspark matrix: - `if: needs.precompile.result == 'success'` on download. - `continue-on-error: true` on both. - `if: steps.download-precompiled.outcome == 'success'` on extract. - Inside the existing "Run tests" bash block, `SKIP_SCALA_BUILD=true` is exported only when `steps.extract-precompiled.outcome == 'success'`. Otherwise the local SBT build path runs as before. - No `dev/run-tests.py` change needed - the `SKIP_SCALA_BUILD` gate landed with SPARK-56768. ### Why is this safe? Workspace and Zinc behavior The precompile job and the JVM `build` matrix both run on bare `ubuntu-latest` (no container), which means: - Same workspace path (`/home/runner/work/spark/spark`) on both sides → Zinc analysis files in `target/streams/...` reference paths that exist in the consumer. - `tar -czf` / `tar -xzf` preserves mtimes by default, so when call 3 (`<module>/test`) runs, SBT's incremental compiler sees that classes are newer than (or as new as) sources and skips recompilation. - Call 3 uses a smaller per-entry profile set than calls 1+2, but profiles drive *which projects are activated*; pre-built classes for the activated projects are present in `target/scala-2.13/{classes,test-classes}` from the precompile output, so SBT just runs the tests. ### Optional: graceful fallback if precompile fails Same pattern as pyspark and sparkr extensions: - `precompile` is `continue-on-error: true` (set in SPARK-56768) - a failed or cancelled precompile does not fail the workflow. - Download/extract have `continue-on-error: true` and skip their work if the upstream step didn't succeed. - The conditional export inside "Run tests" only sets `SKIP_SCALA_BUILD=true` when the extract succeeded; otherwise `dev/run-tests.py` runs the original local SBT build. So a precompile failure degrades the JVM matrix to the pre-PR behavior, not a workflow failure. ### Estimated savings | | Per-run CI time | |---|---:| | Redundant SBT compile in JVM build matrix today (~9 entries × ~13m) | ~120m | | Add back: shared build (already amortized with pyspark) | 0 | | **Net CI compute saved per build-matrix run** | **~120m** | The JVM `build` matrix runs when `is-changed.py -m "core,unsafe,...,sql,hive,..."` returns true, i.e. on roughly any non-Python-only PR. On those PRs, this saving stacks on top of the ~96m already saved by SPARK-56768. The shared `precompile` job is also already running in those cases (pyspark or pyspark-pandas is almost always also changed), so the marginal cost of this PR is just the artifact download per JVM matrix entry (~30s each, ~9 entries ≈ ~5m). ### Does this PR introduce _any_ user-facing change? No. CI infrastructure change only. ### How was this patch tested? The change is exercised by the CI run of this PR itself when the build matrix runs. Each matrix entry's "Run tests" log should show `Reusing precompiled artifact, skipping local SBT build.` mirroring the pyspark behavior. The per-entry test phase (`<module>/test`) runs with the extracted classes already on disk; SBT's incremental check should not trigger recompilation. If the precompile artifact is unavailable, the matrix entry falls back to the local SBT build path identical to today's behavior. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Opus 4.7) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
