This is an automated email from the ASF dual-hosted git repository.
zhengruifeng pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new 74816d75abf0 [SPARK-56964][INFRA] Share Maven precompile artifact
across maven_test matrix
74816d75abf0 is described below
commit 74816d75abf03e8977817bce473256bc9a4b46b7
Author: Ruifeng Zheng <[email protected]>
AuthorDate: Fri May 22 10:25:00 2026 +0800
[SPARK-56964][INFRA] Share Maven precompile artifact across maven_test
matrix
### What changes were proposed in this pull request?
Follow-up to
[SPARK-56768](https://issues.apache.org/jira/browse/SPARK-56768)
(apache/spark#55726), which introduced the same kind of shared-precompile
pattern for the SBT-driven `build_and_test.yml`. This PR applies the analogous
optimization to `.github/workflows/maven_test.yml` - the reusable workflow that
the scheduled `build_maven*.yml` jobs call to run Maven-based scala tests
across multiple JDK versions.
Each of the 12 matrix entries today runs three steps back-to-back:
1. `mvn -DskipTests <profiles> clean install` (~25-40m of redundant
compile, identical across all entries)
2. `mvn clean -pl assembly` (small cleanup, conditional on module)
3. `mvn -pl <TEST_MODULES> ... test` (the actual per-entry test phase)
Step 1 is byte-equivalent across every matrix entry: same 9 Maven profiles,
same `-DskipTests`, same `-Djava.version=<input>`. This PR factors it into a
single `precompile-maven` job whose output every entry consumes.
### Concrete changes
- New `precompile-maven` job runs `mvn -DskipTests <profiles> clean
install` once on the same `runs-on: ${{ inputs.os }}` runner. The same shell
wrapper, same `MAVEN_OPTS`, same profile set, same `JAVA_VERSION/-ea`
substitution as the matrix entries use today.
- The job tars two pieces and uploads them as a multi-file artifact:
- `compile-target.tar.gz` - all `*/target/` directories from the
workspace.
- `compile-m2-spark.tar.gz` - `~/.m2/repository/org/apache/spark/`,
needed by the matrix's `mvn -pl X test` to resolve cross-module Spark
dependencies that aren't in the reactor.
Artifact name: `spark-maven-compile-<branch>-java<java>-<run_id>`. The
JDK is encoded in the name because `build_maven.yml`, `build_maven_java21.yml`,
`build_maven_java25.yml` use different JDKs and bytecode is JDK-specific.
- The `build` matrix job adds `precompile-maven` to `needs:` and uses `if:
(!cancelled())` so the matrix runs even if precompile fails or is cancelled.
- New "Download precompiled artifact" / "Extract precompiled artifact"
steps with the same optional/fallback design as the SBT version:
- `if: needs.precompile-maven.result == 'success'` on download.
- `continue-on-error: true` on both steps.
- `if: steps.download-precompiled.outcome == 'success'` on extract.
- Inside the existing "Run tests" bash, the `mvn clean install` line is
gated:
```bash
if [ "${{ steps.extract-precompiled.outcome }}" = "success" ]; then
echo "Reusing precompiled artifact, skipping local Maven clean install."
else
./build/mvn ... clean install
fi
```
The rest of the bash (the `clean -pl assembly` cleanup and the per-entry
`test` invocations) is unchanged.
### Optional: graceful fallback if precompile fails
Same pattern as the SBT extensions:
- `precompile-maven` is `continue-on-error: true` - a failed or cancelled
precompile does not fail the workflow.
- Download/extract have `continue-on-error: true` and skip if the upstream
step didn't succeed.
- The bash runs the original `mvn clean install` whenever the artifact
wasn't usable.
So a precompile failure degrades to today's behavior, not a workflow
failure.
### Why two artifact files
Maven's `mvn -pl X test` resolves cross-module dependencies (other Spark
modules) from `~/.m2/repository/org/apache/spark/` rather than from the
workspace's `target/`. We need both:
- `target/` so the matrix entry's main/test classes for module X are
present (Maven sees they're up-to-date and skips re-compilation thanks to mtime
preservation by `tar`).
- `~/.m2/repository/org/apache/spark/` so the artifact resolution for
inter-module Spark deps doesn't fall back to "module not found" or trigger a
recursive build.
The matrix entry extracts both into their respective locations
(`./*/target/...` for the first, `~/.m2/repository/org/apache/spark/` for the
second).
### Measured savings
Comparing the apache/spark scheduled `build_maven.yml` run on 2026-05-17
([25992372470](https://github.com/apache/spark/actions/runs/25992372470))
against the validation push of this PR on 2026-05-20
([26153415924](https://github.com/zhengruifeng/spark/actions/runs/26153415924)),
both JDK 17 / Scala 2.13 / Hadoop 3:
| | Before | After | Δ |
|---|---:|---:|---:|
| Sum of 12 matrix entries | 17:58:04 | 9:44:11 | −8:13:53 |
| + new `precompile-maven` job | | 0:49:24 | |
| **Total CI compute per run** | **17:58:04** | **10:33:35** | **−7:24:29
(−41%)** |
Every matrix entry drops by 28–53 min (≈40 min average), matching the
redundant `mvn -DskipTests … clean install` (~25–40 min) that this PR removes
from each entry. Multiplied across the three scheduled Maven workflows (JDK 17
/ 21 / 25), the daily saving is ~22 h of org-shared CI capacity.
See [this
comment](https://github.com/apache/spark/pull/55766#issuecomment-4507484858)
for the full per-entry breakdown and notes on the wall-clock trade-off
(precompile + matrix is sequential, so end-to-end wall-clock grows by ~20 min
on official infra; the much larger compute saving comes from removing the
redundant compile from every matrix entry).
The `sql/hive-thriftserver` matrix entry has a special case ("To avoid a
compilation loop ... run `clean install` instead") that re-runs `clean install`
regardless. In the measured run that entry still saved ~39 min, likely because
the cached `~/.m2/repository/org/apache/spark/` from the precompile artifact
shortens its re-run.
### Does this PR introduce _any_ user-facing change?
No. CI infrastructure change only.
### How was this patch tested?
Exercised end-to-end by validation run
[26153415924](https://github.com/zhengruifeng/spark/actions/runs/26153415924)
of `build_maven.yml` on the PR branch (JDK 17). Both expected log signatures
appeared:
- `precompile-maven` job: `[INFO] BUILD SUCCESS` from Maven, plus the `ls
-lh compile-target.tar.gz compile-m2-spark.tar.gz` line.
- Matrix entries' "Run tests" step: `Reusing precompiled artifact, skipping
local Maven clean install.`
The fallback path (full `mvn clean install` when the artifact is missing or
extraction fails) is preserved by `continue-on-error: true` on the precompile
job and the download/extract steps; on that path each matrix entry runs `mvn
clean install` itself, identical to today's behavior.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Opus 4.7)
Closes #55766 from zhengruifeng/share-precompile-maven-test.
Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
---
.github/workflows/maven_test.yml | 133 +++++++++++++++++++++++++++++++++++++--
1 file changed, 127 insertions(+), 6 deletions(-)
diff --git a/.github/workflows/maven_test.yml b/.github/workflows/maven_test.yml
index 357d869d1b88..0799d871e6d6 100644
--- a/.github/workflows/maven_test.yml
+++ b/.github/workflows/maven_test.yml
@@ -52,9 +52,96 @@ on:
type: string
default: '{}'
jobs:
+ # Precompile Spark with Maven once and publish target/ + ~/.m2/.../spark as
+ # an artifact for the matrix entries below to consume. Optional: any failure
+ # here degrades the matrix to its original local `clean install` path.
+ precompile-maven:
+ name: "Precompile Spark with Maven"
+ runs-on: ${{ inputs.os }}
+ # If this job fails or is cancelled, the matrix entries fall back to
+ # running `mvn clean install` locally as before.
+ continue-on-error: true
+ env:
+ HADOOP_PROFILE: ${{ inputs.hadoop }}
+ HIVE_PROFILE: hive2.3
+ SPARK_LOCAL_IP: localhost
+ GITHUB_PREV_SHA: ${{ github.event.before }}
+ steps:
+ - name: Checkout Spark repository
+ uses: actions/checkout@v6
+ with:
+ fetch-depth: 0
+ repository: apache/spark
+ ref: ${{ inputs.branch }}
+ - name: Sync the current branch with the latest in Apache Spark
+ if: github.repository != 'apache/spark'
+ run: |
+ echo "APACHE_SPARK_REF=$(git rev-parse HEAD)" >> $GITHUB_ENV
+ git fetch https://github.com/$GITHUB_REPOSITORY.git
${GITHUB_REF#refs/heads/}
+ git -c user.name='Apache Spark Test Account' -c
user.email='[email protected]' merge --no-commit --progress --squash
FETCH_HEAD
+ git -c user.name='Apache Spark Test Account' -c
user.email='[email protected]' commit -m "Merged commit" --allow-empty
+ - name: Cache SBT and Maven
+ # TODO(SPARK-54466):
https://github.com/actions/runner-images/issues/13341
+ if: ${{ runner.os != 'macOS' }}
+ uses: actions/cache@v5
+ with:
+ path: |
+ build/apache-maven-*
+ build/*.jar
+ ~/.sbt
+ key: build-${{ hashFiles('**/pom.xml', 'project/build.properties',
'build/mvn', 'build/sbt', 'build/sbt-launch-lib.bash',
'build/spark-build-info') }}
+ restore-keys: |
+ build-
+ - name: Cache Maven local repository
+ # TODO(SPARK-54466):
https://github.com/actions/runner-images/issues/13341
+ if: ${{ runner.os != 'macOS' }}
+ uses: actions/cache@v5
+ with:
+ path: ~/.m2/repository
+ key: java${{ inputs.java }}-maven-${{ hashFiles('**/pom.xml') }}
+ restore-keys: |
+ java${{ inputs.java }}-maven-
+ - name: Install Java ${{ inputs.java }}
+ uses: actions/setup-java@v5
+ with:
+ distribution: zulu
+ java-version: ${{ inputs.java }}
+ - name: Build Spark with Maven
+ shell: |
+ bash -c "if script -qec true 2>/dev/null; then script -qec bash\
{0}; else script -qe /dev/null bash {0}; fi"
+ run: |
+ set -e
+ export MAVEN_OPTS="-Xss64m -Xmx4g -Xms4g
-XX:ReservedCodeCacheSize=128m -Dorg.slf4j.simpleLogger.defaultLogLevel=WARN"
+ export MAVEN_CLI_OPTS="--no-transfer-progress"
+ export JAVA_VERSION=${{ inputs.java }}
+ ./build/mvn $MAVEN_CLI_OPTS -DskipTests -Pyarn -Pkubernetes
-Pvolcano -Phive -Phive-thriftserver -Phadoop-cloud -Pjvm-profiler
-Pspark-ganglia-lgpl -Pkinesis-asl -Djava.version=${JAVA_VERSION/-ea} clean
install
+ - name: Package compile output
+ run: |
+ # Exclude assembly/ from the artifact: 11 of 12 matrix entries wipe
it
+ # right after extraction (SPARK-51628
regression-test-for-SPARK-51600),
+ # and the connect entry rebuilds it via `mvn install -pl assembly`.
+ find . \( -path './build' -o -path './.git' -o -path './assembly' \)
-prune \
+ -o -type d -name target -print0 \
+ | tar --null -czf compile-target.tar.gz -T -
+ if [ -d "$HOME/.m2/repository/org/apache/spark" ]; then
+ tar -C "$HOME/.m2/repository/org/apache" -czf
compile-m2-spark.tar.gz spark
+ fi
+ ls -lh compile-target.tar.gz compile-m2-spark.tar.gz
+ - name: Upload compile artifact
+ uses: actions/upload-artifact@v6
+ with:
+ name: spark-maven-compile-${{ inputs.branch }}-java${{ inputs.java
}}-${{ github.run_id }}
+ path: |
+ compile-target.tar.gz
+ compile-m2-spark.tar.gz
+ retention-days: 1
+ if-no-files-found: error
+
# Build: build Spark and run the tests for specified modules using maven
build:
name: "Build modules: ${{ matrix.modules }} ${{ matrix.comment }}"
+ needs: precompile-maven
+ if: (!cancelled())
runs-on: ${{ inputs.os }}
# TODO(SPARK-54466): https://github.com/actions/runner-images/issues/13341
# timeout-minutes: 150
@@ -184,6 +271,25 @@ jobs:
run: |
python3.12 -m pip install 'numpy>=1.23.2' pyarrow 'pandas==2.3.3'
pyyaml scipy unittest-xml-reporting 'grpcio==1.76.0' 'grpcio-status==1.76.0'
'protobuf==6.33.5' 'zstandard==0.25.0'
python3.12 -m pip list
+ - name: Download precompiled artifact
+ id: download-precompiled
+ if: needs.precompile-maven.result == 'success'
+ continue-on-error: true
+ uses: actions/download-artifact@v6
+ with:
+ name: spark-maven-compile-${{ inputs.branch }}-java${{ matrix.java
}}-${{ github.run_id }}
+ - name: Extract precompiled artifact
+ id: extract-precompiled
+ if: steps.download-precompiled.outcome == 'success'
+ continue-on-error: true
+ run: |
+ tar -xzf compile-target.tar.gz
+ rm compile-target.tar.gz
+ if [ -f compile-m2-spark.tar.gz ]; then
+ mkdir -p "$HOME/.m2/repository/org/apache"
+ tar -C "$HOME/.m2/repository/org/apache" -xzf
compile-m2-spark.tar.gz
+ rm compile-m2-spark.tar.gz
+ fi
# Run the tests using script command.
# BSD's script command doesn't support -c option, and the usage is
different from Linux's one.
# The kind of script command is tested by `script -qec true`.
@@ -203,13 +309,28 @@ jobs:
export ENABLE_KINESIS_TESTS=0
# Replace with the real module name, for example,
connector#kafka-0-10 -> connector/kafka-0-10
export TEST_MODULES=`echo "$MODULES_TO_TEST" | sed -e "s%#%/%g"`
- ./build/mvn $MAVEN_CLI_OPTS -DskipTests -Pyarn -Pkubernetes
-Pvolcano -Phive -Phive-thriftserver -Phadoop-cloud -Pjvm-profiler
-Pspark-ganglia-lgpl -Pkinesis-asl -Djava.version=${JAVA_VERSION/-ea} clean
install
-
- if [ "$MODULES_TO_TEST" != "connect" ]; then
- echo "Clean up the assembly module before maven testing"
- ./build/mvn $MAVEN_CLI_OPTS clean -pl assembly
+ if [ "${{ steps.extract-precompiled.outcome }}" = "success" ]; then
+ echo "Reusing precompiled artifact, skipping local Maven clean
install."
+ # SPARK-51628 regression coverage is naturally preserved on the
reuse
+ # path: the precompile artifact excludes assembly/, so non-connect
+ # tests already run with the assembly module's jars dir missing.
+ # Connect tests strongly depend on a built assembly module; rebuild
+ # it here.
+ if [ "$MODULES_TO_TEST" = "connect" ]; then
+ echo "Building assembly module for connect tests."
+ ./build/mvn $MAVEN_CLI_OPTS -DskipTests -Pyarn -Pkubernetes
-Pvolcano -Phive -Phive-thriftserver -Phadoop-cloud -Pjvm-profiler
-Pspark-ganglia-lgpl -Pkinesis-asl -Djava.version=${JAVA_VERSION/-ea} -pl
assembly install
+ fi
+ else
+ ./build/mvn $MAVEN_CLI_OPTS -DskipTests -Pyarn -Pkubernetes
-Pvolcano -Phive -Phive-thriftserver -Phadoop-cloud -Pjvm-profiler
-Pspark-ganglia-lgpl -Pkinesis-asl -Djava.version=${JAVA_VERSION/-ea} clean
install
+ # SPARK-51628: wipe the assembly module so tests exercise the
+ # SPARK-51600 prepend fallback path. Connect tests strongly depend
+ # on a built assembly module, so they are excluded.
+ if [ "$MODULES_TO_TEST" != "connect" ]; then
+ echo "Clean up the assembly module before maven testing"
+ ./build/mvn $MAVEN_CLI_OPTS clean -pl assembly
+ fi
fi
-
+
if [[ "$INCLUDED_TAGS" != "" ]]; then
./build/mvn $MAVEN_CLI_OPTS -pl "$TEST_MODULES" -Pyarn
-Pkubernetes -Pvolcano -Phive -Phive-thriftserver -Phadoop-cloud -Pjvm-profiler
-Pspark-ganglia-lgpl -Pkinesis-asl -Djava.version=${JAVA_VERSION/-ea}
-Dtest.include.tags="$INCLUDED_TAGS" test -fae
elif [[ "$MODULES_TO_TEST" == "connect" && "$INPUT_BRANCH" ==
"branch-4.0" ]]; then
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]