(spark) branch branch-4.x updated: [SPARK-56866][INFRA] Pin downstream actions/checkout to a single resolved SHA

ruifengz Mon, 18 May 2026 17:58:55 -0700

This is an automated email from the ASF dual-hosted git repository.

zhengruifeng pushed a commit to branch branch-4.x
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/branch-4.x by this push:
     new 94a0cf7b9fe7 [SPARK-56866][INFRA] Pin downstream actions/checkout to a 
single resolved SHA
94a0cf7b9fe7 is described below

commit 94a0cf7b9fe789ce07714200bb245e3febb7626e
Author: Ruifeng Zheng <[email protected]>
AuthorDate: Tue May 19 08:58:23 2026 +0800

    [SPARK-56866][INFRA] Pin downstream actions/checkout to a single resolved 
SHA
    
    ### What changes were proposed in this pull request?
    
    In `.github/workflows/build_and_test.yml`, add a step to the `precondition` 
job that captures `git rev-parse HEAD` right after the apache/spark checkout, 
exposes it as a `head_sha` output, and switch every downstream 
`actions/checkout` from `ref: ${{ inputs.branch }}` to `ref: ${{ 
needs.precondition.outputs.head_sha }}`. The `precondition` job's own checkout 
still resolves `inputs.branch`; the 11 downstream checkouts (`build`, 
`infra-image`, `precompile`, `pyspark`, `sparkr`, `buf`, ` [...]
    
    ### Why are the changes needed?
    
    Today each `actions/checkout` step independently re-resolves `ref: ${{ 
inputs.branch }}` (default `master`) at the moment the runner picks it up. 
Different jobs in the same workflow run can therefore end up testing different 
commits.
    
    **This is a long-standing issue.** `ref: ${{ inputs.branch }}` has been in 
`build_and_test.yml` since commit `9e468cf010f` (SPARK-39521, 2022-06-21) — 
~3.5 years. The race has existed the entire time. It usually goes unnoticed 
because a normal master commit doesn't cross the JVM/Python boundary, so even 
when jobs do see different commits the tests stay consistent within each job.
    
    **It becomes a real problem during merge bursts.** Commits per hour on 
master vary wildly; release-prep windows, end-of-week merges, and APAC + EU 
overlap regularly push 3–6 commits in 20 minutes. The drift window for 
`pyspark` jobs is structurally ~17 minutes (`precompile` time) plus runner 
queue wait — so during a merge burst the probability that at least one commit 
lands inside that window approaches 1. When the unlucky commit happens to add a 
tightly-coupled change — new Spark Con [...]
    
    ```
    [CONNECT_INVALID_PLAN.INVALID_ONE_OF_FIELD_NOT_SET]
    The Spark Connect plan is invalid. This oneOf field in 
spark.connect.Relation is not set: RELTYPE_NOT_SET
    ```
    
    Concrete example from 2026-05-14:
    - Run 
[25835824862](https://github.com/apache/spark/actions/runs/25835824862) 
triggered by `e19bc35c` (SPARK-56844) — `pyspark-connect` failed with 19 
NEAREST BY errors.
    - Run 
[25835929554](https://github.com/apache/spark/actions/runs/25835929554) 
triggered ~3 minutes later by the next commit `13380e78` (SPARK-56395, which 
added the NEAREST BY feature) — same job passed.
    
    The first run's `precompile` checked out `e19bc35c` (no NEAREST BY server 
code), but by the time its `pyspark-connect` job actually started 17 minutes 
later, master was at `13380e78` and `actions/checkout` resolved that newer 
commit (with the new Python test files). Pinning every job to the SHA 
`precondition` saw makes this impossible.
    
    The fix is also forward-leaning: as Spark's release cadence and contributor 
count grow, the merge-burst probability only goes up; without pinning, 
"spurious red CI on the previous PR every time someone merges a Connect 
feature" will keep recurring.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No. CI infrastructure only.
    
    ### How was this patch tested?
    
    YAML syntax validated locally. CI will exercise the change end-to-end.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    Generated-by: Claude Code (claude-opus-4-7)
    
    Closes #55879 from zhengruifeng/ci-pin-checkout-sha.
    
    Authored-by: Ruifeng Zheng <[email protected]>
    Signed-off-by: Ruifeng Zheng <[email protected]>
    (cherry picked from commit 869adad659f8ce5c449daba4123f779f76b41ba6)
    Signed-off-by: Ruifeng Zheng <[email protected]>
---
 .github/workflows/build_and_test.yml | 27 ++++++++++++++++-----------
 1 file changed, 16 insertions(+), 11 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 4d7f246360d9..6c5929ad6ae6 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -65,6 +65,8 @@ jobs:
       GITHUB_PREV_SHA: ${{ github.event.before }}
     outputs:
       required: ${{ steps.set-outputs.outputs.required }}
+      # Pinned so every downstream job checks out the same snapshot, even if 
`master` advances mid-run.
+      head_sha: ${{ steps.resolve-sha.outputs.head_sha }}
       image_url: ${{ steps.infra-image-outputs.outputs.image_url }}
       image_docs_url: ${{ 
steps.infra-image-docs-outputs.outputs.image_docs_url }}
       image_docs_url_link: ${{ 
steps.infra-image-link.outputs.image_docs_url_link }}
@@ -81,6 +83,9 @@ jobs:
         fetch-depth: 0
         repository: apache/spark
         ref: ${{ inputs.branch }}
+    - name: Resolve apache/spark HEAD SHA
+      id: resolve-sha
+      run: echo "head_sha=$(git rev-parse HEAD)" >> $GITHUB_OUTPUT
     - name: Sync the current branch with the latest in Apache Spark
       if: github.repository != 'apache/spark'
       run: |
@@ -346,7 +351,7 @@ jobs:
       with:
         fetch-depth: 0
         repository: apache/spark
-        ref: ${{ inputs.branch }}
+        ref: ${{ needs.precondition.outputs.head_sha }}
     - name: Sync the current branch with the latest in Apache Spark
       if: github.repository != 'apache/spark'
       run: |
@@ -464,7 +469,7 @@ jobs:
         with:
           fetch-depth: 0
           repository: apache/spark
-          ref: ${{ inputs.branch }}
+          ref: ${{ needs.precondition.outputs.head_sha }}
       - name: Sync the current branch with the latest in Apache Spark
         if: github.repository != 'apache/spark'
         run: |
@@ -558,7 +563,7 @@ jobs:
       with:
         fetch-depth: 0
         repository: apache/spark
-        ref: ${{ inputs.branch }}
+        ref: ${{ needs.precondition.outputs.head_sha }}
     - name: Sync the current branch with the latest in Apache Spark
       if: github.repository != 'apache/spark'
       run: |
@@ -680,7 +685,7 @@ jobs:
       with:
         fetch-depth: 0
         repository: apache/spark
-        ref: ${{ inputs.branch }}
+        ref: ${{ needs.precondition.outputs.head_sha }}
     - name: Add GITHUB_WORKSPACE to git trust safe.directory
       run: |
         git config --global --add safe.directory ${GITHUB_WORKSPACE}
@@ -830,7 +835,7 @@ jobs:
       with:
         fetch-depth: 0
         repository: apache/spark
-        ref: ${{ inputs.branch }}
+        ref: ${{ needs.precondition.outputs.head_sha }}
     - name: Add GITHUB_WORKSPACE to git trust safe.directory
       run: |
         git config --global --add safe.directory ${GITHUB_WORKSPACE}
@@ -919,7 +924,7 @@ jobs:
       with:
         fetch-depth: 0
         repository: apache/spark
-        ref: ${{ inputs.branch }}
+        ref: ${{ needs.precondition.outputs.head_sha }}
     - name: Sync the current branch with the latest in Apache Spark
       if: github.repository != 'apache/spark'
       run: |
@@ -981,7 +986,7 @@ jobs:
       with:
         fetch-depth: 0
         repository: apache/spark
-        ref: ${{ inputs.branch }}
+        ref: ${{ needs.precondition.outputs.head_sha }}
     - name: Add GITHUB_WORKSPACE to git trust safe.directory
       run: |
         git config --global --add safe.directory ${GITHUB_WORKSPACE}
@@ -1173,7 +1178,7 @@ jobs:
       with:
         fetch-depth: 0
         repository: apache/spark
-        ref: ${{ inputs.branch }}
+        ref: ${{ needs.precondition.outputs.head_sha }}
     - name: Add GITHUB_WORKSPACE to git trust safe.directory
       run: |
         git config --global --add safe.directory ${GITHUB_WORKSPACE}
@@ -1346,7 +1351,7 @@ jobs:
       with:
         fetch-depth: 0
         repository: apache/spark
-        ref: ${{ inputs.branch }}
+        ref: ${{ needs.precondition.outputs.head_sha }}
     - name: Sync the current branch with the latest in Apache Spark
       if: github.repository != 'apache/spark'
       run: |
@@ -1463,7 +1468,7 @@ jobs:
       with:
         fetch-depth: 0
         repository: apache/spark
-        ref: ${{ inputs.branch }}
+        ref: ${{ needs.precondition.outputs.head_sha }}
     - name: Sync the current branch with the latest in Apache Spark
       if: github.repository != 'apache/spark'
       run: |
@@ -1531,7 +1536,7 @@ jobs:
         with:
           fetch-depth: 0
           repository: apache/spark
-          ref: ${{ inputs.branch }}
+          ref: ${{ needs.precondition.outputs.head_sha }}
       - name: Sync the current branch with the latest in Apache Spark
         if: github.repository != 'apache/spark'
         run: |


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark) branch branch-4.x updated: [SPARK-56866][INFRA] Pin downstream actions/checkout to a single resolved SHA

Reply via email to