This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-4.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-4.0 by this push:
     new c90c62751718 [SPARK-55115][INFRA][4.0] Use composable Dockerfile for 
release builds
c90c62751718 is described below

commit c90c627517182c68c5c2737edeaff9b1a54f8bce
Author: Wenchen Fan <[email protected]>
AuthorDate: Fri Jan 23 17:40:25 2026 +0800

    [SPARK-55115][INFRA][4.0] Use composable Dockerfile for release builds
    
    ### What changes were proposed in this pull request?
    
    This PR refactors the release Docker image build process to use a 
composable Dockerfile approach:
    
    1. **`Dockerfile.base`**: A shared base image containing common tools 
(Ubuntu 22.04, R packages, Ruby/bundler, TeX, Node.js)
    2. **`Dockerfile`**: Branch-specific image that extends the base with 
Java/Python versions and packages for this branch
    3. **`do-release-docker.sh`**: Updated to build the base image first, then 
the branch-specific image
    
    ### Why are the changes needed?
    
    Currently, each branch maintains its own full Dockerfile which leads to:
    - Duplicated common configuration across branches
    - Difficulty keeping base tools (R packages, Ruby, etc.) in sync
    - Expired GPG keys or outdated base images affecting all branches
    
    With the composable approach:
    - Common tools are defined once in `Dockerfile.base`
    - Each branch only specifies its unique Java/Python requirements
    - Updates to base tools can be applied consistently
    
    ### Version changes
    
    | Component | Before | After |
    |-----------|--------|-------|
    | Ubuntu image | jammy-20240911.1 | jammy-20250819 |
    | FULL_REFRESH_DATE | 20241119 | 20250819 |
    
    ### Does this PR introduce _any_ user-facing change?
    
    No. This only affects the release infrastructure.
    
    ### How was this patch tested?
    
    Docker image built and verified successfully on remote machine.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    Yes
    
    Closes #53907 from cloud-fan/release-infra-4.0.
    
    Lead-authored-by: Wenchen Fan <[email protected]>
    Co-authored-by: Wenchen Fan <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
---
 dev/create-release/do-release-docker.sh     |   5 +
 dev/create-release/spark-rm/Dockerfile      | 158 ++++++++++------------------
 dev/create-release/spark-rm/Dockerfile.base | 110 +++++++++++++++++++
 3 files changed, 169 insertions(+), 104 deletions(-)

diff --git a/dev/create-release/do-release-docker.sh 
b/dev/create-release/do-release-docker.sh
index 3a395e3c266b..e231d7a48eec 100755
--- a/dev/create-release/do-release-docker.sh
+++ b/dev/create-release/do-release-docker.sh
@@ -120,6 +120,11 @@ GPG_KEY_FILE="$WORKDIR/gpg.key"
 fcreate_secure "$GPG_KEY_FILE"
 $GPG --export-secret-key --armor --pinentry-mode loopback --passphrase 
"$GPG_PASSPHRASE" "$GPG_KEY" > "$GPG_KEY_FILE"
 
+# Build base image first (contains common tools shared across all branches)
+run_silent "Building spark-rm-base image..." "docker-build-base.log" \
+  docker build -t "spark-rm-base:latest" -f "$SELF/spark-rm/Dockerfile.base" 
"$SELF/spark-rm"
+
+# Build branch-specific image (extends base with Java/Python versions for this 
branch)
 run_silent "Building spark-rm image with tag $IMGTAG..." "docker-build.log" \
   docker build -t "spark-rm:$IMGTAG" --build-arg UID=$UID "$SELF/spark-rm"
 
diff --git a/dev/create-release/spark-rm/Dockerfile 
b/dev/create-release/spark-rm/Dockerfile
index eb37fad6cccd..5803a902cd06 100644
--- a/dev/create-release/spark-rm/Dockerfile
+++ b/dev/create-release/spark-rm/Dockerfile
@@ -15,134 +15,84 @@
 # limitations under the License.
 #
 
-# Image for building Spark releases. Based on Ubuntu 22.04.
-FROM ubuntu:jammy-20240911.1
-LABEL org.opencontainers.image.authors="Apache Spark project 
<[email protected]>"
-LABEL org.opencontainers.image.licenses="Apache-2.0"
-LABEL org.opencontainers.image.ref.name="Apache Spark Release Manager Image"
-# Overwrite this label to avoid exposing the underlying Ubuntu OS version label
-LABEL org.opencontainers.image.version=""
+# Spark 4.0 release image
+# Extends the base image with:
+# - Java 17
+# - Python 3.9/3.10 with required packages
+# - PyPy 3.10 for testing
 
-ENV FULL_REFRESH_DATE=20241119
-
-ENV DEBIAN_FRONTEND=noninteractive
-ENV DEBCONF_NONINTERACTIVE_SEEN=true
+FROM spark-rm-base:latest
 
+# Install Java 17 for Spark 4.x
 RUN apt-get update && apt-get install -y \
-    build-essential \
-    ca-certificates \
-    curl \
-    gfortran \
-    git \
-    subversion \
-    gnupg \
-    libcurl4-openssl-dev \
-    libfontconfig1-dev \
-    libfreetype6-dev \
-    libfribidi-dev \
-    libgit2-dev \
-    libharfbuzz-dev \
-    libjpeg-dev \
-    liblapack-dev \
-    libopenblas-dev \
-    libpng-dev \
-    libpython3-dev \
-    libssl-dev \
-    libtiff5-dev \
-    libwebp-dev \
-    libxml2-dev \
-    msmtp \
-    nodejs \
-    npm \
     openjdk-17-jdk-headless \
-    pandoc \
-    pkg-config \
+    && rm -rf /var/lib/apt/lists/*
+
+# Install Python 3.9 and 3.10 from deadsnakes PPA
+RUN add-apt-repository ppa:deadsnakes/ppa && \
+    apt-get update && apt-get install -y \
+    python3.9 \
+    python3.9-dev \
+    python3.9-distutils \
     python3.10 \
+    python3.10-dev \
     python3-psutil \
-    texlive-latex-base \
-    texlive \
-    texlive-fonts-extra \
-    texinfo \
-    texlive-latex-extra \
-    qpdf \
-    jq \
-    r-base \
-    ruby \
-    ruby-dev \
-    software-properties-common \
-    wget \
-    zlib1g-dev \
+    libpython3-dev \
     && rm -rf /var/lib/apt/lists/*
 
+# Install pip for both Python versions
+RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python3.9 && \
+    curl -sS https://bootstrap.pypa.io/get-pip.py | python3.10
 
-RUN echo 'deb https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/' >> 
/etc/apt/sources.list
-RUN gpg --keyserver hkps://keyserver.ubuntu.com --recv-key 
E298A3A825C0D65DFD57CBB651716619E084DAB9
-RUN gpg -a --export E084DAB9 | apt-key add -
-RUN add-apt-repository 'deb https://cloud.r-project.org/bin/linux/ubuntu 
jammy-cran40/'
+# Basic Python packages for Spark 4.0
+ARG BASIC_PIP_PKGS="numpy pyarrow>=18.0.0 six==1.16.0 pandas==2.2.3 scipy 
plotly<6.0.0 \
+    mlflow>=2.8.1 coverage matplotlib openpyxl memory-profiler>=0.61.0 
scikit-learn>=1.3.2 twine==3.4.1"
 
-# See more in SPARK-39959, roxygen2 < 7.2.1
-RUN Rscript -e "install.packages(c('devtools', 'knitr', 'markdown',  \
-    'rmarkdown', 'testthat', 'devtools', 'e1071', 'survival', 'arrow',  \
-    'ggplot2', 'mvtnorm', 'statmod', 'xml2'), 
repos='https://cloud.r-project.org/')" && \
-    Rscript -e "devtools::install_version('roxygen2', version='7.2.0', 
repos='https://cloud.r-project.org')" && \
-    Rscript -e "devtools::install_version('lintr', version='2.0.1', 
repos='https://cloud.r-project.org')" && \
-    Rscript -e "devtools::install_version('preferably', version='0.4', 
repos='https://cloud.r-project.org')" && \
-    Rscript -e "devtools::install_version('pkgdown', version='2.0.1', 
repos='https://cloud.r-project.org')"
-
-# See more in SPARK-39735
-ENV 
R_LIBS_SITE="/usr/local/lib/R/site-library:${R_LIBS_SITE}:/usr/lib/R/library"
-
-
-RUN add-apt-repository ppa:pypy/ppa
-RUN mkdir -p /usr/local/pypy/pypy3.10 && \
-    curl -sqL 
https://downloads.python.org/pypy/pypy3.10-v7.3.17-linux64.tar.bz2 | tar xjf - 
-C /usr/local/pypy/pypy3.10 --strip-components=1 && \
-    ln -sf /usr/local/pypy/pypy3.10/bin/pypy /usr/local/bin/pypy3.10 && \
-    ln -sf /usr/local/pypy/pypy3.10/bin/pypy /usr/local/bin/pypy3
-RUN curl -sS https://bootstrap.pypa.io/get-pip.py | pypy3
-RUN pypy3 -m pip install numpy 'six==1.16.0' 'pandas==2.2.3' scipy coverage 
matplotlib lxml
-
-
-ARG BASIC_PIP_PKGS="numpy pyarrow>=18.0.0 six==1.16.0 pandas==2.2.3 scipy 
plotly<6.0.0 mlflow>=2.8.1 coverage matplotlib openpyxl memory-profiler>=0.61.0 
scikit-learn>=1.3.2 twine==3.4.1"
 # Python deps for Spark Connect
-ARG CONNECT_PIP_PKGS="grpcio==1.67.0 grpcio-status==1.67.0 protobuf==5.29.1 
googleapis-common-protos==1.65.0 graphviz==0.20.3"
+ARG CONNECT_PIP_PKGS="grpcio==1.67.0 grpcio-status==1.67.0 protobuf==5.29.1 \
+    googleapis-common-protos==1.65.0 graphviz==0.20.3"
 
 # Install Python 3.10 packages
-RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python3.10
-RUN python3.10 -m pip install --ignore-installed blinker>=1.6.2 # mlflow needs 
this
-RUN python3.10 -m pip install --ignore-installed 'six==1.16.0'  # Avoid 
`python3-six` installation
-RUN python3.10 -m pip install $BASIC_PIP_PKGS unittest-xml-reporting 
$CONNECT_PIP_PKGS && \
+RUN python3.10 -m pip install --ignore-installed 'blinker>=1.6.2' && \
+    python3.10 -m pip install --ignore-installed 'six==1.16.0' && \
+    python3.10 -m pip install $BASIC_PIP_PKGS unittest-xml-reporting 
$CONNECT_PIP_PKGS && \
     python3.10 -m pip install 'torch<2.6.0' torchvision --index-url 
https://download.pytorch.org/whl/cpu && \
     python3.10 -m pip install deepspeed torcheval && \
     python3.10 -m pip cache purge
 
-# Install Python 3.9
-RUN add-apt-repository ppa:deadsnakes/ppa
-RUN apt-get update && apt-get install -y \
-    python3.9 python3.9-distutils \
-    && rm -rf /var/lib/apt/lists/*
-RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python3.9
-RUN python3.9 -m pip install --ignore-installed blinker>=1.6.2 # mlflow needs 
this
-RUN python3.9 -m pip install --force $BASIC_PIP_PKGS unittest-xml-reporting 
$CONNECT_PIP_PKGS && \
+# Install Python 3.9 packages
+RUN python3.9 -m pip install --ignore-installed 'blinker>=1.6.2' && \
+    python3.9 -m pip install --force $BASIC_PIP_PKGS unittest-xml-reporting 
$CONNECT_PIP_PKGS && \
     python3.9 -m pip install 'torch<2.6.0' torchvision --index-url 
https://download.pytorch.org/whl/cpu && \
     python3.9 -m pip install torcheval && \
     python3.9 -m pip cache purge
 
+# Sphinx and documentation packages (installed on Python 3.9)
 # Should unpin 'sphinxcontrib-*' after upgrading sphinx>5
-# See 'ipython_genutils' in SPARK-38517
-# See 'docutils<0.18.0' in SPARK-39421
-RUN python3.9 -m pip install 'sphinx==4.5.0' mkdocs 
'pydata_sphinx_theme>=0.13' sphinx-copybutton nbsphinx numpydoc jinja2 
markupsafe 'pyzmq<24.0.0' \
-ipython ipython_genutils sphinx_plotly_directive 'numpy>=1.20.0' pyarrow 
pandas 'plotly>=4.8' 'docutils<0.18.0' \
-'flake8==3.9.0' 'mypy==1.8.0' 'pytest==7.1.3' 'pytest-mypy-plugins==1.9.3' 
'black==23.12.1' \
-'pandas-stubs==1.2.0.53' 'grpcio==1.67.0' 'grpc-stubs==1.24.11' 
'googleapis-common-protos-stubs==2.2.0' \
-'sphinxcontrib-applehelp==1.0.4' 'sphinxcontrib-devhelp==1.0.2' 
'sphinxcontrib-htmlhelp==2.0.1' 'sphinxcontrib-qthelp==1.0.3' 
'sphinxcontrib-serializinghtml==1.1.5'
-RUN python3.9 -m pip list
-
-RUN gem install --no-document "bundler:2.4.22"
-RUN ln -s "$(which python3.9)" "/usr/local/bin/python"
-RUN ln -s "$(which python3.9)" "/usr/local/bin/python3"
+# See 'ipython_genutils' in SPARK-38517, 'docutils<0.18.0' in SPARK-39421
+RUN python3.9 -m pip install 'sphinx==4.5.0' mkdocs 
'pydata_sphinx_theme>=0.13' \
+    sphinx-copybutton nbsphinx numpydoc jinja2 markupsafe 'pyzmq<24.0.0' \
+    ipython ipython_genutils sphinx_plotly_directive 'numpy>=1.20.0' pyarrow 
pandas \
+    'plotly>=4.8' 'docutils<0.18.0' 'flake8==3.9.0' 'mypy==1.8.0' 
'pytest==7.1.3' \
+    'pytest-mypy-plugins==1.9.3' 'black==23.12.1' 'pandas-stubs==1.2.0.53' \
+    'grpcio==1.67.0' 'grpc-stubs==1.24.11' 
'googleapis-common-protos-stubs==2.2.0' \
+    'sphinxcontrib-applehelp==1.0.4' 'sphinxcontrib-devhelp==1.0.2' \
+    'sphinxcontrib-htmlhelp==2.0.1' 'sphinxcontrib-qthelp==1.0.3' \
+    'sphinxcontrib-serializinghtml==1.1.5'
+
+# Install PyPy 3.10 for testing
+RUN mkdir -p /usr/local/pypy/pypy3.10 && \
+    curl -sqL 
https://downloads.python.org/pypy/pypy3.10-v7.3.17-linux64.tar.bz2 | tar xjf - 
-C /usr/local/pypy/pypy3.10 --strip-components=1 && \
+    ln -sf /usr/local/pypy/pypy3.10/bin/pypy /usr/local/bin/pypy3.10 && \
+    ln -sf /usr/local/pypy/pypy3.10/bin/pypy /usr/local/bin/pypy3 && \
+    curl -sS https://bootstrap.pypa.io/get-pip.py | pypy3 && \
+    pypy3 -m pip install numpy 'six==1.16.0' 'pandas==2.2.3' scipy coverage 
matplotlib lxml
 
-WORKDIR /opt/spark-rm/output
+# Set Python 3.9 as the default (branch-4.0 uses 3.9 for docs)
+RUN ln -sf "$(which python3.9)" "/usr/local/bin/python" && \
+    ln -sf "$(which python3.9)" "/usr/local/bin/python3"
 
+# Create user for release manager
 ARG UID
 RUN useradd -m -s /bin/bash -p spark-rm -u $UID spark-rm
 USER spark-rm:spark-rm
diff --git a/dev/create-release/spark-rm/Dockerfile.base 
b/dev/create-release/spark-rm/Dockerfile.base
new file mode 100644
index 000000000000..56e85256d52d
--- /dev/null
+++ b/dev/create-release/spark-rm/Dockerfile.base
@@ -0,0 +1,110 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Base image for building Spark releases. Based on Ubuntu 22.04.
+# This image contains common tools shared across all Spark versions:
+# - Build tools (gcc, make, etc.)
+# - R with pinned package versions
+# - Ruby with bundler
+# - TeX for documentation
+# - Node.js for documentation
+#
+# Branch-specific Dockerfiles should use "FROM spark-rm-base:latest" and add:
+# - Java version (8 or 17)
+# - Python version and pip packages
+
+FROM ubuntu:jammy-20250819
+LABEL org.opencontainers.image.authors="Apache Spark project 
<[email protected]>"
+LABEL org.opencontainers.image.licenses="Apache-2.0"
+LABEL org.opencontainers.image.ref.name="Apache Spark Release Manager Base 
Image"
+LABEL org.opencontainers.image.version=""
+
+ENV FULL_REFRESH_DATE=20250819
+
+ENV DEBIAN_FRONTEND=noninteractive
+ENV DEBCONF_NONINTERACTIVE_SEEN=true
+
+# Install common system packages and build tools
+# Note: Java and Python are installed in branch-specific Dockerfiles
+RUN apt-get update && apt-get install -y \
+    build-essential \
+    ca-certificates \
+    curl \
+    gfortran \
+    git \
+    subversion \
+    gnupg \
+    libcurl4-openssl-dev \
+    libfontconfig1-dev \
+    libfreetype6-dev \
+    libfribidi-dev \
+    libgit2-dev \
+    libharfbuzz-dev \
+    libjpeg-dev \
+    liblapack-dev \
+    libopenblas-dev \
+    libpng-dev \
+    libssl-dev \
+    libtiff5-dev \
+    libwebp-dev \
+    libxml2-dev \
+    msmtp \
+    nodejs \
+    npm \
+    pandoc \
+    pkg-config \
+    texlive-latex-base \
+    texlive \
+    texlive-fonts-extra \
+    texinfo \
+    texlive-latex-extra \
+    qpdf \
+    jq \
+    r-base \
+    ruby \
+    ruby-dev \
+    software-properties-common \
+    wget \
+    zlib1g-dev \
+    && rm -rf /var/lib/apt/lists/*
+
+# Set up R CRAN repository for latest R packages
+RUN echo 'deb https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/' >> 
/etc/apt/sources.list && \
+    gpg --keyserver hkps://keyserver.ubuntu.com --recv-key 
E298A3A825C0D65DFD57CBB651716619E084DAB9 && \
+    gpg -a --export E084DAB9 | apt-key add - && \
+    add-apt-repository 'deb https://cloud.r-project.org/bin/linux/ubuntu 
jammy-cran40/'
+
+# Install R packages (same versions across all branches)
+# See more in SPARK-39959, roxygen2 < 7.2.1
+RUN Rscript -e "install.packages(c('devtools', 'knitr', 'markdown', \
+    'rmarkdown', 'testthat', 'e1071', 'survival', 'arrow', \
+    'ggplot2', 'mvtnorm', 'statmod', 'xml2'), 
repos='https://cloud.r-project.org/')" && \
+    Rscript -e "devtools::install_version('roxygen2', version='7.2.0', 
repos='https://cloud.r-project.org')" && \
+    Rscript -e "devtools::install_version('lintr', version='2.0.1', 
repos='https://cloud.r-project.org')" && \
+    Rscript -e "devtools::install_version('preferably', version='0.4', 
repos='https://cloud.r-project.org')" && \
+    Rscript -e "devtools::install_version('pkgdown', version='2.0.1', 
repos='https://cloud.r-project.org')"
+
+# See more in SPARK-39735
+ENV 
R_LIBS_SITE="/usr/local/lib/R/site-library:${R_LIBS_SITE}:/usr/lib/R/library"
+
+# Install Ruby bundler (same version across all branches)
+RUN gem install --no-document "bundler:2.4.22"
+
+# Create workspace directory
+WORKDIR /opt/spark-rm/output
+
+# Note: Java, Python, and user creation are done in branch-specific Dockerfiles


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to