zhengruifeng opened a new pull request, #55766:
URL: https://github.com/apache/spark/pull/55766

   ### What changes were proposed in this pull request?
   
   Follow-up to 
[SPARK-56768](https://issues.apache.org/jira/browse/SPARK-56768) 
(apache/spark#55726), which introduced the same kind of shared-precompile 
pattern for the SBT-driven `build_and_test.yml`. This PR applies the analogous 
optimization to `.github/workflows/maven_test.yml` - the reusable workflow that 
the scheduled `build_maven*.yml` jobs call to run Maven-based scala tests 
across multiple JDK versions.
   
   Each of the 12 matrix entries today runs three steps back-to-back:
   
   1. `mvn -DskipTests <profiles> clean install`  (~25-40m of redundant 
compile, identical across all entries)
   2. `mvn clean -pl assembly`  (small cleanup, conditional on module)
   3. `mvn -pl <TEST_MODULES> ... test`  (the actual per-entry test phase)
   
   Step 1 is byte-equivalent across every matrix entry: same 9 Maven profiles, 
same `-DskipTests`, same `-Djava.version=<input>`. This PR factors it into a 
single `precompile-maven` job whose output every entry consumes.
   
   ### Concrete changes
   
   - New `precompile-maven` job runs `mvn -DskipTests <profiles> clean install` 
once on the same `runs-on: ${{ inputs.os }}` runner. The same shell wrapper, 
same `MAVEN_OPTS`, same profile set, same `JAVA_VERSION/-ea` substitution as 
the matrix entries use today.
   - The job tars two pieces and uploads them as a multi-file artifact:
     - `compile-target.tar.gz` - all `*/target/` directories from the workspace.
     - `compile-m2-spark.tar.gz` - `~/.m2/repository/org/apache/spark/`, needed 
by the matrix's `mvn -pl X test` to resolve cross-module Spark dependencies 
that aren't in the reactor.
     
     Artifact name: `spark-maven-compile-<branch>-java<java>-<run_id>`. The JDK 
is encoded in the name because `build_maven.yml`, `build_maven_java21.yml`, 
`build_maven_java25.yml` use different JDKs and bytecode is JDK-specific.
   - The `build` matrix job adds `precompile-maven` to `needs:` and uses `if: 
(!cancelled())` so the matrix runs even if precompile fails or is cancelled.
   - New "Download precompiled artifact" / "Extract precompiled artifact" steps 
with the same optional/fallback design as the SBT version:
     - `if: needs.precompile-maven.result == 'success'` on download.
     - `continue-on-error: true` on both steps.
     - `if: steps.download-precompiled.outcome == 'success'` on extract.
   - Inside the existing "Run tests" bash, the `mvn clean install` line is 
gated:
     ```bash
     if [ "${{ steps.extract-precompiled.outcome }}" = "success" ]; then
       echo "Reusing precompiled artifact, skipping local Maven clean install."
     else
       ./build/mvn ... clean install
     fi
     ```
     The rest of the bash (the `clean -pl assembly` cleanup and the per-entry 
`test` invocations) is unchanged.
   
   ### Optional: graceful fallback if precompile fails
   
   Same pattern as the SBT extensions:
   
   - `precompile-maven` is `continue-on-error: true` - a failed or cancelled 
precompile does not fail the workflow.
   - Download/extract have `continue-on-error: true` and skip if the upstream 
step didn't succeed.
   - The bash runs the original `mvn clean install` whenever the artifact 
wasn't usable.
   
   So a precompile failure degrades to today's behavior, not a workflow failure.
   
   ### Why two artifact files
   
   Maven's `mvn -pl X test` resolves cross-module dependencies (other Spark 
modules) from `~/.m2/repository/org/apache/spark/` rather than from the 
workspace's `target/`. We need both:
   
   - `target/` so the matrix entry's main/test classes for module X are present 
(Maven sees they're up-to-date and skips re-compilation thanks to mtime 
preservation by `tar`).
   - `~/.m2/repository/org/apache/spark/` so the artifact resolution for 
inter-module Spark deps doesn't fall back to "module not found" or trigger a 
recursive build.
   
   The matrix entry extracts both into their respective locations 
(`./*/target/...` for the first, `~/.m2/repository/org/apache/spark/` for the 
second).
   
   ### Estimated savings
   
   Each scheduled `build_maven*.yml` workflow today runs all 12 matrix entries 
in parallel, each spending ~25-40m on a redundant `mvn clean install`:
   
   | | Per scheduled-run CI time |
   |---|---:|
   | Redundant Maven clean install today (12 entries × ~30m) | ~360m |
   | Add back: shared precompile + artifact transfer | ~35-45m |
   | **Net CI compute saved per scheduled run** | **~315-325m (~5h)** |
   
   Multiplied across the JDK 17 / JDK 21 / JDK 25 (and ARM, macos26 variants) 
scheduled workflows that all use this reusable workflow, the daily saving is 
multiple hours of org-shared CI capacity.
   
   The `sql/hive-thriftserver` matrix entry has a special case at line ~228 
("To avoid a compilation loop ... run `clean install` instead") that re-runs 
`clean install` regardless. That one entry won't pick up the saving; that's 
roughly 1 of 12, ~8% of the matrix is left unchanged.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. CI infrastructure change only.
   
   ### How was this patch tested?
   
   The change is exercised by a CI run of one of the `build_maven*.yml` 
workflows (the schedule trigger fires daily at 13:00 UTC for 
`build_maven.yml`). Expected log signatures:
   
   - `precompile-maven` job: `[INFO] BUILD SUCCESS` from Maven, plus the `ls 
-lh compile-target.tar.gz compile-m2-spark.tar.gz` line.
   - Matrix entry "Run tests" step: `Reusing precompiled artifact, skipping 
local Maven clean install.` (or, on fallback, the full `mvn clean install` runs 
as before).
   
   If the precompile artifact is missing or extraction fails, each matrix entry 
runs `mvn clean install` itself, identical to today's behavior.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code (Opus 4.7)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to