(spark) branch master updated: [SPARK-56964][INFRA] Share Maven precompile artifact across maven_test matrix

ruifengz Thu, 21 May 2026 19:25:22 -0700

This is an automated email from the ASF dual-hosted git repository.

zhengruifeng pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 74816d75abf0 [SPARK-56964][INFRA] Share Maven precompile artifact 
across maven_test matrix
74816d75abf0 is described below

commit 74816d75abf03e8977817bce473256bc9a4b46b7
Author: Ruifeng Zheng <[email protected]>
AuthorDate: Fri May 22 10:25:00 2026 +0800

    [SPARK-56964][INFRA] Share Maven precompile artifact across maven_test 
matrix
    
    ### What changes were proposed in this pull request?
    
    Follow-up to 
[SPARK-56768](https://issues.apache.org/jira/browse/SPARK-56768) 
(apache/spark#55726), which introduced the same kind of shared-precompile 
pattern for the SBT-driven `build_and_test.yml`. This PR applies the analogous 
optimization to `.github/workflows/maven_test.yml` - the reusable workflow that 
the scheduled `build_maven*.yml` jobs call to run Maven-based scala tests 
across multiple JDK versions.
    
    Each of the 12 matrix entries today runs three steps back-to-back:
    
    1. `mvn -DskipTests <profiles> clean install`  (~25-40m of redundant 
compile, identical across all entries)
    2. `mvn clean -pl assembly`  (small cleanup, conditional on module)
    3. `mvn -pl <TEST_MODULES> ... test`  (the actual per-entry test phase)
    
    Step 1 is byte-equivalent across every matrix entry: same 9 Maven profiles, 
same `-DskipTests`, same `-Djava.version=<input>`. This PR factors it into a 
single `precompile-maven` job whose output every entry consumes.
    
    ### Concrete changes
    
    - New `precompile-maven` job runs `mvn -DskipTests <profiles> clean 
install` once on the same `runs-on: ${{ inputs.os }}` runner. The same shell 
wrapper, same `MAVEN_OPTS`, same profile set, same `JAVA_VERSION/-ea` 
substitution as the matrix entries use today.
    - The job tars two pieces and uploads them as a multi-file artifact:
      - `compile-target.tar.gz` - all `*/target/` directories from the 
workspace.
      - `compile-m2-spark.tar.gz` - `~/.m2/repository/org/apache/spark/`, 
needed by the matrix's `mvn -pl X test` to resolve cross-module Spark 
dependencies that aren't in the reactor.
    
      Artifact name: `spark-maven-compile-<branch>-java<java>-<run_id>`. The 
JDK is encoded in the name because `build_maven.yml`, `build_maven_java21.yml`, 
`build_maven_java25.yml` use different JDKs and bytecode is JDK-specific.
    - The `build` matrix job adds `precompile-maven` to `needs:` and uses `if: 
(!cancelled())` so the matrix runs even if precompile fails or is cancelled.
    - New "Download precompiled artifact" / "Extract precompiled artifact" 
steps with the same optional/fallback design as the SBT version:
      - `if: needs.precompile-maven.result == 'success'` on download.
      - `continue-on-error: true` on both steps.
      - `if: steps.download-precompiled.outcome == 'success'` on extract.
    - Inside the existing "Run tests" bash, the `mvn clean install` line is 
gated:
      ```bash
      if [ "${{ steps.extract-precompiled.outcome }}" = "success" ]; then
        echo "Reusing precompiled artifact, skipping local Maven clean install."
      else
        ./build/mvn ... clean install
      fi
      ```
      The rest of the bash (the `clean -pl assembly` cleanup and the per-entry 
`test` invocations) is unchanged.
    
    ### Optional: graceful fallback if precompile fails
    
    Same pattern as the SBT extensions:
    
    - `precompile-maven` is `continue-on-error: true` - a failed or cancelled 
precompile does not fail the workflow.
    - Download/extract have `continue-on-error: true` and skip if the upstream 
step didn't succeed.
    - The bash runs the original `mvn clean install` whenever the artifact 
wasn't usable.
    
    So a precompile failure degrades to today's behavior, not a workflow 
failure.
    
    ### Why two artifact files
    
    Maven's `mvn -pl X test` resolves cross-module dependencies (other Spark 
modules) from `~/.m2/repository/org/apache/spark/` rather than from the 
workspace's `target/`. We need both:
    
    - `target/` so the matrix entry's main/test classes for module X are 
present (Maven sees they're up-to-date and skips re-compilation thanks to mtime 
preservation by `tar`).
    - `~/.m2/repository/org/apache/spark/` so the artifact resolution for 
inter-module Spark deps doesn't fall back to "module not found" or trigger a 
recursive build.
    
    The matrix entry extracts both into their respective locations 
(`./*/target/...` for the first, `~/.m2/repository/org/apache/spark/` for the 
second).
    
    ### Measured savings
    
    Comparing the apache/spark scheduled `build_maven.yml` run on 2026-05-17 
([25992372470](https://github.com/apache/spark/actions/runs/25992372470)) 
against the validation push of this PR on 2026-05-20 
([26153415924](https://github.com/zhengruifeng/spark/actions/runs/26153415924)),
 both JDK 17 / Scala 2.13 / Hadoop 3:
    
    | | Before | After | Δ |
    |---|---:|---:|---:|
    | Sum of 12 matrix entries | 17:58:04 | 9:44:11 | −8:13:53 |
    | + new `precompile-maven` job | | 0:49:24 | |
    | **Total CI compute per run** | **17:58:04** | **10:33:35** | **−7:24:29 
(−41%)** |
    
    Every matrix entry drops by 28–53 min (≈40 min average), matching the 
redundant `mvn -DskipTests … clean install` (~25–40 min) that this PR removes 
from each entry. Multiplied across the three scheduled Maven workflows (JDK 17 
/ 21 / 25), the daily saving is ~22 h of org-shared CI capacity.
    
    See [this 
comment](https://github.com/apache/spark/pull/55766#issuecomment-4507484858) 
for the full per-entry breakdown and notes on the wall-clock trade-off 
(precompile + matrix is sequential, so end-to-end wall-clock grows by ~20 min 
on official infra; the much larger compute saving comes from removing the 
redundant compile from every matrix entry).
    
    The `sql/hive-thriftserver` matrix entry has a special case ("To avoid a 
compilation loop ... run `clean install` instead") that re-runs `clean install` 
regardless. In the measured run that entry still saved ~39 min, likely because 
the cached `~/.m2/repository/org/apache/spark/` from the precompile artifact 
shortens its re-run.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No. CI infrastructure change only.
    
    ### How was this patch tested?
    
    Exercised end-to-end by validation run 
[26153415924](https://github.com/zhengruifeng/spark/actions/runs/26153415924) 
of `build_maven.yml` on the PR branch (JDK 17). Both expected log signatures 
appeared:
    
    - `precompile-maven` job: `[INFO] BUILD SUCCESS` from Maven, plus the `ls 
-lh compile-target.tar.gz compile-m2-spark.tar.gz` line.
    - Matrix entries' "Run tests" step: `Reusing precompiled artifact, skipping 
local Maven clean install.`
    
    The fallback path (full `mvn clean install` when the artifact is missing or 
extraction fails) is preserved by `continue-on-error: true` on the precompile 
job and the download/extract steps; on that path each matrix entry runs `mvn 
clean install` itself, identical to today's behavior.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    Generated-by: Claude Code (Opus 4.7)
    
    Closes #55766 from zhengruifeng/share-precompile-maven-test.
    
    Authored-by: Ruifeng Zheng <[email protected]>
    Signed-off-by: Ruifeng Zheng <[email protected]>
---
 .github/workflows/maven_test.yml | 133 +++++++++++++++++++++++++++++++++++++--
 1 file changed, 127 insertions(+), 6 deletions(-)

diff --git a/.github/workflows/maven_test.yml b/.github/workflows/maven_test.yml
index 357d869d1b88..0799d871e6d6 100644
--- a/.github/workflows/maven_test.yml
+++ b/.github/workflows/maven_test.yml
@@ -52,9 +52,96 @@ on:
         type: string
         default: '{}'
 jobs:
+  # Precompile Spark with Maven once and publish target/ + ~/.m2/.../spark as
+  # an artifact for the matrix entries below to consume. Optional: any failure
+  # here degrades the matrix to its original local `clean install` path.
+  precompile-maven:
+    name: "Precompile Spark with Maven"
+    runs-on: ${{ inputs.os }}
+    # If this job fails or is cancelled, the matrix entries fall back to
+    # running `mvn clean install` locally as before.
+    continue-on-error: true
+    env:
+      HADOOP_PROFILE: ${{ inputs.hadoop }}
+      HIVE_PROFILE: hive2.3
+      SPARK_LOCAL_IP: localhost
+      GITHUB_PREV_SHA: ${{ github.event.before }}
+    steps:
+      - name: Checkout Spark repository
+        uses: actions/checkout@v6
+        with:
+          fetch-depth: 0
+          repository: apache/spark
+          ref: ${{ inputs.branch }}
+      - name: Sync the current branch with the latest in Apache Spark
+        if: github.repository != 'apache/spark'
+        run: |
+          echo "APACHE_SPARK_REF=$(git rev-parse HEAD)" >> $GITHUB_ENV
+          git fetch https://github.com/$GITHUB_REPOSITORY.git 
${GITHUB_REF#refs/heads/}
+          git -c user.name='Apache Spark Test Account' -c 
user.email='[email protected]' merge --no-commit --progress --squash 
FETCH_HEAD
+          git -c user.name='Apache Spark Test Account' -c 
user.email='[email protected]' commit -m "Merged commit" --allow-empty
+      - name: Cache SBT and Maven
+        # TODO(SPARK-54466): 
https://github.com/actions/runner-images/issues/13341
+        if: ${{ runner.os != 'macOS' }}
+        uses: actions/cache@v5
+        with:
+          path: |
+            build/apache-maven-*
+            build/*.jar
+            ~/.sbt
+          key: build-${{ hashFiles('**/pom.xml', 'project/build.properties', 
'build/mvn', 'build/sbt', 'build/sbt-launch-lib.bash', 
'build/spark-build-info') }}
+          restore-keys: |
+            build-
+      - name: Cache Maven local repository
+        # TODO(SPARK-54466): 
https://github.com/actions/runner-images/issues/13341
+        if: ${{ runner.os != 'macOS' }}
+        uses: actions/cache@v5
+        with:
+          path: ~/.m2/repository
+          key: java${{ inputs.java }}-maven-${{ hashFiles('**/pom.xml') }}
+          restore-keys: |
+            java${{ inputs.java }}-maven-
+      - name: Install Java ${{ inputs.java }}
+        uses: actions/setup-java@v5
+        with:
+          distribution: zulu
+          java-version: ${{ inputs.java }}
+      - name: Build Spark with Maven
+        shell: |
+          bash -c "if script -qec true 2>/dev/null; then script -qec bash\ 
{0}; else script -qe /dev/null bash {0}; fi"
+        run: |
+          set -e
+          export MAVEN_OPTS="-Xss64m -Xmx4g -Xms4g 
-XX:ReservedCodeCacheSize=128m -Dorg.slf4j.simpleLogger.defaultLogLevel=WARN"
+          export MAVEN_CLI_OPTS="--no-transfer-progress"
+          export JAVA_VERSION=${{ inputs.java }}
+          ./build/mvn $MAVEN_CLI_OPTS -DskipTests -Pyarn -Pkubernetes 
-Pvolcano -Phive -Phive-thriftserver -Phadoop-cloud -Pjvm-profiler 
-Pspark-ganglia-lgpl -Pkinesis-asl -Djava.version=${JAVA_VERSION/-ea} clean 
install
+      - name: Package compile output
+        run: |
+          # Exclude assembly/ from the artifact: 11 of 12 matrix entries wipe 
it
+          # right after extraction (SPARK-51628 
regression-test-for-SPARK-51600),
+          # and the connect entry rebuilds it via `mvn install -pl assembly`.
+          find . \( -path './build' -o -path './.git' -o -path './assembly' \) 
-prune \
+            -o -type d -name target -print0 \
+            | tar --null -czf compile-target.tar.gz -T -
+          if [ -d "$HOME/.m2/repository/org/apache/spark" ]; then
+            tar -C "$HOME/.m2/repository/org/apache" -czf 
compile-m2-spark.tar.gz spark
+          fi
+          ls -lh compile-target.tar.gz compile-m2-spark.tar.gz
+      - name: Upload compile artifact
+        uses: actions/upload-artifact@v6
+        with:
+          name: spark-maven-compile-${{ inputs.branch }}-java${{ inputs.java 
}}-${{ github.run_id }}
+          path: |
+            compile-target.tar.gz
+            compile-m2-spark.tar.gz
+          retention-days: 1
+          if-no-files-found: error
+
   # Build: build Spark and run the tests for specified modules using maven
   build:
     name: "Build modules: ${{ matrix.modules }} ${{ matrix.comment }}"
+    needs: precompile-maven
+    if: (!cancelled())
     runs-on: ${{ inputs.os }}
     # TODO(SPARK-54466): https://github.com/actions/runner-images/issues/13341
     # timeout-minutes: 150
@@ -184,6 +271,25 @@ jobs:
         run: |
           python3.12 -m pip install 'numpy>=1.23.2' pyarrow 'pandas==2.3.3' 
pyyaml scipy unittest-xml-reporting 'grpcio==1.76.0' 'grpcio-status==1.76.0' 
'protobuf==6.33.5' 'zstandard==0.25.0'
           python3.12 -m pip list
+      - name: Download precompiled artifact
+        id: download-precompiled
+        if: needs.precompile-maven.result == 'success'
+        continue-on-error: true
+        uses: actions/download-artifact@v6
+        with:
+          name: spark-maven-compile-${{ inputs.branch }}-java${{ matrix.java 
}}-${{ github.run_id }}
+      - name: Extract precompiled artifact
+        id: extract-precompiled
+        if: steps.download-precompiled.outcome == 'success'
+        continue-on-error: true
+        run: |
+          tar -xzf compile-target.tar.gz
+          rm compile-target.tar.gz
+          if [ -f compile-m2-spark.tar.gz ]; then
+            mkdir -p "$HOME/.m2/repository/org/apache"
+            tar -C "$HOME/.m2/repository/org/apache" -xzf 
compile-m2-spark.tar.gz
+            rm compile-m2-spark.tar.gz
+          fi
       # Run the tests using script command.
       # BSD's script command doesn't support -c option, and the usage is 
different from Linux's one.
       # The kind of script command is tested by `script -qec true`.
@@ -203,13 +309,28 @@ jobs:
           export ENABLE_KINESIS_TESTS=0
           # Replace with the real module name, for example, 
connector#kafka-0-10 -> connector/kafka-0-10
           export TEST_MODULES=`echo "$MODULES_TO_TEST" | sed -e "s%#%/%g"`
-          ./build/mvn $MAVEN_CLI_OPTS -DskipTests -Pyarn -Pkubernetes 
-Pvolcano -Phive -Phive-thriftserver -Phadoop-cloud -Pjvm-profiler 
-Pspark-ganglia-lgpl -Pkinesis-asl -Djava.version=${JAVA_VERSION/-ea} clean 
install
-          
-          if [ "$MODULES_TO_TEST" != "connect" ]; then
-            echo "Clean up the assembly module before maven testing"
-            ./build/mvn $MAVEN_CLI_OPTS clean -pl assembly
+          if [ "${{ steps.extract-precompiled.outcome }}" = "success" ]; then
+            echo "Reusing precompiled artifact, skipping local Maven clean 
install."
+            # SPARK-51628 regression coverage is naturally preserved on the 
reuse
+            # path: the precompile artifact excludes assembly/, so non-connect
+            # tests already run with the assembly module's jars dir missing.
+            # Connect tests strongly depend on a built assembly module; rebuild
+            # it here.
+            if [ "$MODULES_TO_TEST" = "connect" ]; then
+              echo "Building assembly module for connect tests."
+              ./build/mvn $MAVEN_CLI_OPTS -DskipTests -Pyarn -Pkubernetes 
-Pvolcano -Phive -Phive-thriftserver -Phadoop-cloud -Pjvm-profiler 
-Pspark-ganglia-lgpl -Pkinesis-asl -Djava.version=${JAVA_VERSION/-ea} -pl 
assembly install
+            fi
+          else
+            ./build/mvn $MAVEN_CLI_OPTS -DskipTests -Pyarn -Pkubernetes 
-Pvolcano -Phive -Phive-thriftserver -Phadoop-cloud -Pjvm-profiler 
-Pspark-ganglia-lgpl -Pkinesis-asl -Djava.version=${JAVA_VERSION/-ea} clean 
install
+            # SPARK-51628: wipe the assembly module so tests exercise the
+            # SPARK-51600 prepend fallback path. Connect tests strongly depend
+            # on a built assembly module, so they are excluded.
+            if [ "$MODULES_TO_TEST" != "connect" ]; then
+              echo "Clean up the assembly module before maven testing"
+              ./build/mvn $MAVEN_CLI_OPTS clean -pl assembly
+            fi
           fi
-          
+
           if [[ "$INCLUDED_TAGS" != "" ]]; then
             ./build/mvn $MAVEN_CLI_OPTS -pl "$TEST_MODULES" -Pyarn 
-Pkubernetes -Pvolcano -Phive -Phive-thriftserver -Phadoop-cloud -Pjvm-profiler 
-Pspark-ganglia-lgpl -Pkinesis-asl -Djava.version=${JAVA_VERSION/-ea} 
-Dtest.include.tags="$INCLUDED_TAGS" test -fae
           elif [[ "$MODULES_TO_TEST" == "connect" && "$INPUT_BRANCH" == 
"branch-4.0" ]]; then


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark) branch master updated: [SPARK-56964][INFRA] Share Maven precompile artifact across maven_test matrix

Reply via email to