This is an automated email from the ASF dual-hosted git repository.
wenchen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-3.5 by this push:
new 854a9d44b831 [SPARK-55115][INFRA][3.5] Use composable Dockerfile for
release builds
854a9d44b831 is described below
commit 854a9d44b831e4782a58e0d53f03c26334697435
Author: Wenchen Fan <[email protected]>
AuthorDate: Fri Jan 23 17:42:14 2026 +0800
[SPARK-55115][INFRA][3.5] Use composable Dockerfile for release builds
### What changes were proposed in this pull request?
This PR refactors the release Docker image build process to use a
composable Dockerfile approach:
1. **`Dockerfile.base`**: A shared base image containing common tools
(Ubuntu 22.04, R packages, Ruby/bundler, TeX, Node.js)
2. **`Dockerfile`**: Branch-specific image that extends the base with
Java/Python versions and packages for this branch
3. **`do-release-docker.sh`**: Updated to build the base image first, then
the branch-specific image
### Why are the changes needed?
Currently, each branch maintains its own full Dockerfile which leads to:
- Duplicated common configuration across branches
- Difficulty keeping base tools (R packages, Ruby, etc.) in sync
- Expired GPG keys or outdated base images affecting all branches
With the composable approach:
- Common tools are defined once in `Dockerfile.base`
- Each branch only specifies its unique Java/Python requirements
- Updates to base tools can be applied consistently
### Version changes
| Component | Before | After |
|-----------|--------|-------|
| Ubuntu | 20.04 | 22.04 (jammy-20250819) |
| bundler | 2.3.8 | 2.4.22 |
| R CRAN repo | focal-cran40 | jammy-cran40 |
| docutils | <0.17 | ==0.16 |
| FULL_REFRESH_DATE | (none) | 20250819 |
**Note**: The Ubuntu upgrade from 20.04 to 22.04 is necessary to use a
shared base image across all branches. Ubuntu 20.04 reached end of standard
support in April 2025.
### Does this PR introduce _any_ user-facing change?
No. This only affects the release infrastructure.
### How was this patch tested?
Docker image built and verified successfully on remote machine.
### Was this patch authored or co-authored using generative AI tooling?
Yes
Closes #53908 from cloud-fan/release-infra-3.5.
Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
---
dev/create-release/do-release-docker.sh | 5 ++
dev/create-release/spark-rm/Dockerfile | 119 +++++++++-------------------
dev/create-release/spark-rm/Dockerfile.base | 110 +++++++++++++++++++++++++
3 files changed, 154 insertions(+), 80 deletions(-)
diff --git a/dev/create-release/do-release-docker.sh
b/dev/create-release/do-release-docker.sh
index 4e8fffd08062..c9f651e0ba57 100755
--- a/dev/create-release/do-release-docker.sh
+++ b/dev/create-release/do-release-docker.sh
@@ -120,6 +120,11 @@ GPG_KEY_FILE="$WORKDIR/gpg.key"
fcreate_secure "$GPG_KEY_FILE"
$GPG --export-secret-key --armor --pinentry-mode loopback --passphrase
"$GPG_PASSPHRASE" "$GPG_KEY" > "$GPG_KEY_FILE"
+# Build base image first (contains common tools shared across all branches)
+run_silent "Building spark-rm-base image..." "docker-build-base.log" \
+ docker build -t "spark-rm-base:latest" -f "$SELF/spark-rm/Dockerfile.base"
"$SELF/spark-rm"
+
+# Build branch-specific image (extends base with Java/Python versions for this
branch)
run_silent "Building spark-rm image with tag $IMGTAG..." "docker-build.log" \
docker build -t "spark-rm:$IMGTAG" --build-arg UID=$UID "$SELF/spark-rm"
diff --git a/dev/create-release/spark-rm/Dockerfile
b/dev/create-release/spark-rm/Dockerfile
index 7fb9c95bb0a3..8132e8ac6b3e 100644
--- a/dev/create-release/spark-rm/Dockerfile
+++ b/dev/create-release/spark-rm/Dockerfile
@@ -15,86 +15,45 @@
# limitations under the License.
#
-# Image for building Spark releases. Based on Ubuntu 20.04.
-#
-# Includes:
-# * Java 8
-# * Ivy
-# * Python (3.8.5)
-# * R-base/R-base-dev (4.0.3)
-# * Ruby (2.7.0)
-#
-# You can test it as below:
-# cd dev/create-release/spark-rm
-# docker build -t spark-rm --build-arg UID=$UID .
-
-FROM ubuntu:20.04
-
-# For apt to be noninteractive
-ENV DEBIAN_FRONTEND noninteractive
-ENV DEBCONF_NONINTERACTIVE_SEEN true
-
-# These arguments are just for reuse and not really meant to be customized.
-ARG APT_INSTALL="apt-get install --no-install-recommends -y"
-
-# TODO(SPARK-32407): Sphinx 3.1+ does not correctly index nested classes.
-# See also https://github.com/sphinx-doc/sphinx/issues/7551.
-# We should use the latest Sphinx version once this is fixed.
-# TODO(SPARK-35375): Jinja2 3.0.0+ causes error when building with Sphinx.
-# See also https://issues.apache.org/jira/browse/SPARK-35375.
-ARG PIP_PKGS="sphinx==3.0.4 mkdocs==1.1.2 numpy==1.20.3
pydata_sphinx_theme==0.8.0 ipython==7.19.0 nbsphinx==0.8.0 numpydoc==1.1.0
jinja2==2.11.3 twine==3.4.1 sphinx-plotly-directive==0.1.3
sphinx-copybutton==0.5.2 pandas==2.0.3 pyarrow==4.0.0 plotly==5.4.0
markupsafe==2.0.1 docutils<0.17 grpcio==1.56.0 protobuf==4.21.6
grpcio-status==1.56.0 googleapis-common-protos==1.56.4"
-ARG GEM_PKGS="bundler:2.3.8"
-
-# Install extra needed repos and refresh.
-# - CRAN repo
-# - Ruby repo (for doc generation)
-#
-# This is all in a single "RUN" command so that if anything changes, "apt
update" is run to fetch
-# the most current package versions (instead of potentially using old versions
cached by docker).
-RUN apt-get clean && apt-get update && $APT_INSTALL gnupg ca-certificates && \
- echo 'deb https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/' >>
/etc/apt/sources.list && \
- gpg --keyserver hkps://keyserver.ubuntu.com --recv-key
E298A3A825C0D65DFD57CBB651716619E084DAB9 && \
- gpg -a --export E084DAB9 | apt-key add - && \
- apt-get clean && \
- rm -rf /var/lib/apt/lists/* && \
- apt-get clean && \
- apt-get update && \
- $APT_INSTALL software-properties-common && \
- apt-get update && \
- $APT_INSTALL msmtp && \
- # Install openjdk 8.
- $APT_INSTALL openjdk-8-jdk && \
- update-alternatives --set java $(ls
/usr/lib/jvm/java-8-openjdk-*/jre/bin/java) && \
- # Install build / source control tools
- $APT_INSTALL curl wget git maven ivy subversion make gcc lsof libffi-dev \
- pandoc pandoc-citeproc libssl-dev libcurl4-openssl-dev libxml2-dev && \
- curl -sL https://deb.nodesource.com/setup_12.x | bash && \
- $APT_INSTALL nodejs && \
- # Install needed python packages. Use pip for installing packages (for
consistency).
- $APT_INSTALL python-is-python3 python3-pip python3-setuptools && \
- # qpdf is required for CRAN checks to pass.
- $APT_INSTALL qpdf jq && \
- pip3 install $PIP_PKGS && \
- # Install R packages and dependencies used when building.
- # R depends on pandoc*, libssl (which are installed above).
- # Note that PySpark doc generation also needs pandoc due to nbsphinx
- $APT_INSTALL r-base r-base-dev && \
- $APT_INSTALL libcurl4-openssl-dev libgit2-dev libssl-dev libxml2-dev && \
- $APT_INSTALL texlive-latex-base texlive texlive-fonts-extra texinfo qpdf
texlive-latex-extra && \
- $APT_INSTALL libfontconfig1-dev libharfbuzz-dev libfribidi-dev
libfreetype6-dev libpng-dev libtiff5-dev libjpeg-dev libwebp-dev && \
- Rscript -e "install.packages(c('curl', 'xml2', 'httr', 'devtools',
'testthat', 'knitr', 'rmarkdown', 'markdown', 'e1071', 'survival'),
repos='https://cloud.r-project.org/')" && \
- # See more in SPARK-39959, roxygen2 < 7.2.1
- Rscript -e "devtools::install_version('roxygen2', version='7.2.0',
repos='https://cloud.r-project.org')" && \
- Rscript -e "devtools::install_version('lintr', version='2.0.1',
repos='https://cloud.r-project.org')" && \
- Rscript -e "devtools::install_version('preferably', version='0.4',
repos='https://cloud.r-project.org')" && \
- # See more in SPARK-54371, pkgdown should be installed at the end to avoid
version upgrade
- Rscript -e "devtools::install_version('pkgdown', version='2.0.1',
repos='https://cloud.r-project.org')" && \
- # Install tools needed to build the documentation.
- $APT_INSTALL ruby2.7 ruby2.7-dev && \
- gem install --no-document $GEM_PKGS
-
-WORKDIR /opt/spark-rm/output
-
+# Spark 3.5 release image
+# Extends the base image with:
+# - Java 8
+# - Python 3.8 with required packages
+
+FROM spark-rm-base:latest
+
+# Install Java 8 for Spark 3.x
+RUN apt-get update && apt-get install -y \
+ openjdk-8-jdk-headless \
+ && rm -rf /var/lib/apt/lists/*
+
+# Install Python 3.8 from deadsnakes PPA
+RUN add-apt-repository ppa:deadsnakes/ppa && \
+ apt-get update && apt-get install -y \
+ python3.8 \
+ python3.8-dev \
+ python3.8-distutils \
+ && rm -rf /var/lib/apt/lists/*
+
+# Install pip for Python 3.8 (using version-specific URL)
+RUN curl -sS https://bootstrap.pypa.io/pip/3.8/get-pip.py | python3.8
+
+# Python packages for Spark 3.5
+# Based on the original branch-3.5 Dockerfile
+ARG PIP_PKGS="sphinx==3.0.4 mkdocs==1.1.2 numpy==1.20.3
pydata_sphinx_theme==0.8.0 \
+ ipython==7.19.0 nbsphinx==0.8.0 numpydoc==1.1.0 jinja2==2.11.3
twine==3.4.1 \
+ sphinx-plotly-directive==0.1.3 sphinx-copybutton==0.5.2 pandas==2.0.3
pyarrow==4.0.0 \
+ plotly==5.4.0 markupsafe==2.0.1 docutils==0.16 grpcio==1.56.0
protobuf==4.21.6 \
+ grpcio-status==1.56.0 googleapis-common-protos==1.56.4"
+
+# Install Python 3.8 packages
+RUN python3.8 -m pip install --ignore-installed $PIP_PKGS
+
+# Set Python 3.8 as the default
+RUN ln -sf "$(which python3.8)" "/usr/local/bin/python" && \
+ ln -sf "$(which python3.8)" "/usr/local/bin/python3"
+
+# Create user for release manager
ARG UID
RUN useradd -m -s /bin/bash -p spark-rm -u $UID spark-rm
USER spark-rm:spark-rm
diff --git a/dev/create-release/spark-rm/Dockerfile.base
b/dev/create-release/spark-rm/Dockerfile.base
new file mode 100644
index 000000000000..56e85256d52d
--- /dev/null
+++ b/dev/create-release/spark-rm/Dockerfile.base
@@ -0,0 +1,110 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Base image for building Spark releases. Based on Ubuntu 22.04.
+# This image contains common tools shared across all Spark versions:
+# - Build tools (gcc, make, etc.)
+# - R with pinned package versions
+# - Ruby with bundler
+# - TeX for documentation
+# - Node.js for documentation
+#
+# Branch-specific Dockerfiles should use "FROM spark-rm-base:latest" and add:
+# - Java version (8 or 17)
+# - Python version and pip packages
+
+FROM ubuntu:jammy-20250819
+LABEL org.opencontainers.image.authors="Apache Spark project
<[email protected]>"
+LABEL org.opencontainers.image.licenses="Apache-2.0"
+LABEL org.opencontainers.image.ref.name="Apache Spark Release Manager Base
Image"
+LABEL org.opencontainers.image.version=""
+
+ENV FULL_REFRESH_DATE=20250819
+
+ENV DEBIAN_FRONTEND=noninteractive
+ENV DEBCONF_NONINTERACTIVE_SEEN=true
+
+# Install common system packages and build tools
+# Note: Java and Python are installed in branch-specific Dockerfiles
+RUN apt-get update && apt-get install -y \
+ build-essential \
+ ca-certificates \
+ curl \
+ gfortran \
+ git \
+ subversion \
+ gnupg \
+ libcurl4-openssl-dev \
+ libfontconfig1-dev \
+ libfreetype6-dev \
+ libfribidi-dev \
+ libgit2-dev \
+ libharfbuzz-dev \
+ libjpeg-dev \
+ liblapack-dev \
+ libopenblas-dev \
+ libpng-dev \
+ libssl-dev \
+ libtiff5-dev \
+ libwebp-dev \
+ libxml2-dev \
+ msmtp \
+ nodejs \
+ npm \
+ pandoc \
+ pkg-config \
+ texlive-latex-base \
+ texlive \
+ texlive-fonts-extra \
+ texinfo \
+ texlive-latex-extra \
+ qpdf \
+ jq \
+ r-base \
+ ruby \
+ ruby-dev \
+ software-properties-common \
+ wget \
+ zlib1g-dev \
+ && rm -rf /var/lib/apt/lists/*
+
+# Set up R CRAN repository for latest R packages
+RUN echo 'deb https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/' >>
/etc/apt/sources.list && \
+ gpg --keyserver hkps://keyserver.ubuntu.com --recv-key
E298A3A825C0D65DFD57CBB651716619E084DAB9 && \
+ gpg -a --export E084DAB9 | apt-key add - && \
+ add-apt-repository 'deb https://cloud.r-project.org/bin/linux/ubuntu
jammy-cran40/'
+
+# Install R packages (same versions across all branches)
+# See more in SPARK-39959, roxygen2 < 7.2.1
+RUN Rscript -e "install.packages(c('devtools', 'knitr', 'markdown', \
+ 'rmarkdown', 'testthat', 'e1071', 'survival', 'arrow', \
+ 'ggplot2', 'mvtnorm', 'statmod', 'xml2'),
repos='https://cloud.r-project.org/')" && \
+ Rscript -e "devtools::install_version('roxygen2', version='7.2.0',
repos='https://cloud.r-project.org')" && \
+ Rscript -e "devtools::install_version('lintr', version='2.0.1',
repos='https://cloud.r-project.org')" && \
+ Rscript -e "devtools::install_version('preferably', version='0.4',
repos='https://cloud.r-project.org')" && \
+ Rscript -e "devtools::install_version('pkgdown', version='2.0.1',
repos='https://cloud.r-project.org')"
+
+# See more in SPARK-39735
+ENV
R_LIBS_SITE="/usr/local/lib/R/site-library:${R_LIBS_SITE}:/usr/lib/R/library"
+
+# Install Ruby bundler (same version across all branches)
+RUN gem install --no-document "bundler:2.4.22"
+
+# Create workspace directory
+WORKDIR /opt/spark-rm/output
+
+# Note: Java, Python, and user creation are done in branch-specific Dockerfiles
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]