[PR] [INFRA] Unify Coursier cache to a single key across all jobs [spark]

via GitHub Fri, 29 May 2026 05:23:44 -0700


zhengruifeng opened a new pull request, #56201:
URL: https://github.com/apache/spark/pull/56201


   ### What changes were proposed in this pull request?
   
   Replace 8 distinct per-job Coursier cache keys with a single 
`coursier-<hash>` key in `.github/workflows/build_and_test.yml`:
   
   - **`precompile`** and **`build`** (Scala test matrix): `actions/cache@v5` — 
both can write `coursier-<hash>`. `precompile` is the primary writer (runs 
first, full dependency superset via all `-P` profiles). `build` is the fallback 
writer — when `precompile` is absent or its save fails, the first `build` 
matrix entry seeds the cache. When `precompile` did save it, `build` gets an 
exact key hit and GHA automatically skips the post-save (caches are immutable).
   - **All other consumers** (`pyspark` ×9, `sparkr`, `lint`, `docs`, 
`tpcds-1g`, `docker-integration-tests`, `k8s-integration-tests`): converted to 
`actions/cache/restore@v5` — restore-only, never write.
   
   ### Why are the changes needed?
   
   The old per-job keys (`$matrix.java-$matrix.hadoop-coursier-`, 
`pyspark-coursier-`, `sparkr-coursier-`, `docs-coursier-` ×2, 
`tpcds-coursier-`, `docker-integration-coursier-`, `k8s-integration-coursier-`) 
all resolved to near-identical ~1.4 GB snapshots of the same dependency 
superset. The repo-wide 10 GB cache budget was almost entirely consumed by 
duplicates on just two branches:
   
   ```
   branch-4.x:  tpcds-coursier      1895 MB
                21-hadoop3-coursier  1437 MB
                docker-integration-coursier  1437 MB   → 4770 MB
   
   master:      precompile-coursier (current hash)  1401 MB
                precompile-coursier (prev hash)     1549 MB
                25-hadoop3-coursier                 1401 MB   → 4351 MB
   
   total: ~9.1 GB of near-duplicate Coursier content in 10 GB budget
   ```
   
   This left no room for old maintenance branches (branch-4.0, 4.1, 4.2, 3.5), 
which had their caches evicted before their next scheduled CI run and were 
always cold.
   
   With one writer per branch the per-branch footprint drops from ~4.5 GB to 
~1.4 GB, fitting ~6 branches in the 10 GB budget simultaneously.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. CI-only.
   
   ### How was this patch tested?
   
   YAML validates with `python3 -c "import yaml; yaml.safe_load(...)"`.
   
   The correctness of the one-writer design relies on two GHA cache guarantees 
verified in prior CI runs:
   1. Caches are immutable — an exact key hit skips the post-save step (`Cache 
hit occurred on the primary key …, not saving cache`), so multiple jobs using 
`actions/cache@v5` with the same key don't produce duplicates when the cache 
already exists.
   2. The `precompile` job builds with every profile (`-Phadoop-3 -Pyarn 
-Pspark-ganglia-lgpl -Phadoop-cloud -Phive -Pkubernetes -Pjvm-profiler 
-Pkinesis-asl -Phive-thriftserver -Pdocker-integration-tests -Pvolcano`), so 
its `~/.cache/coursier` is a superset of every consumer job's closure.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code (claude-sonnet-4-6)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [INFRA] Unify Coursier cache to a single key across all jobs [spark]

Reply via email to