This is an automated email from the ASF dual-hosted git repository.
yao pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new 3a9307f59adb [SPARK-55401][PYTHON] Add retry logic and timeout
handling to pyspark install download
3a9307f59adb is described below
commit 3a9307f59adb64b59b6afae8e449e9cec6b969fe
Author: Kent Yao <[email protected]>
AuthorDate: Sun Feb 8 00:38:51 2026 +0800
[SPARK-55401][PYTHON] Add retry logic and timeout handling to pyspark
install download
### What changes were proposed in this pull request?
This PR adds retry logic and timeout handling to the Spark distribution
download in `pyspark/install.py` to reduce flakiness in
`pyspark.tests.test_install_spark`.
**Changes:**
1. **Added `timeout=10`** to the mirror resolution `urlopen()` call in
`get_preferred_mirrors()` — prevents hanging when `closer.lua` is unresponsive
2. **Added `_download_with_retries()` helper** — wraps the download with:
- Configurable timeout (default: 600s) on `urlopen()` to prevent
indefinite hangs
- Up to 3 retry attempts with exponential backoff (5s, 10s, 20s)
- Cleanup of partial downloads on failure
- Clear logging of retry attempts for CI debugging
### Why are the changes needed?
The `pyspark-install` CI job frequently fails due to transient network
issues when downloading ~400MB Spark distributions from Apache mirrors. Current
issues:
- `urlopen()` has no timeout — downloads can hang indefinitely
- No retry logic — a single transient network error causes complete failure
- CI logs show downloads stalling mid-stream (e.g., at 64%) with no recovery
### Does this PR introduce _any_ user-facing change?
No. The download behavior is improved with retries and timeouts, but the
API is unchanged. Users who call `install_spark` or `pip install pyspark` with
`PYSPARK_HADOOP_VERSION` will benefit from more reliable downloads.
### How was this patch tested?
- Existing unit tests pass: `test_package_name`, `test_checked_versions`
- The download test (`test_install_spark`) exercises the new retry path in
CI
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes #54183 from yaooqinn/SPARK-55401.
Authored-by: Kent Yao <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
---
python/pyspark/install.py | 39 +++++++++++++++++++++++++++++++++++++--
1 file changed, 37 insertions(+), 2 deletions(-)
diff --git a/python/pyspark/install.py b/python/pyspark/install.py
index ba67a157e964..c68745853200 100644
--- a/python/pyspark/install.py
+++ b/python/pyspark/install.py
@@ -17,6 +17,7 @@
import os
import re
import tarfile
+import time
import traceback
import urllib.request
from shutil import rmtree
@@ -143,7 +144,7 @@ def install_spark(dest, spark_version, hadoop_version,
hive_version):
tar = None
try:
print("Downloading %s from:\n- %s" % (pretty_pkg_name, url))
- download_to_file(urllib.request.urlopen(url), package_local_path)
+ _download_with_retries(url, package_local_path)
print("Installing to %s" % dest)
tar = tarfile.open(package_local_path, "r:gz")
@@ -171,7 +172,7 @@ def get_preferred_mirrors():
for _ in range(3):
try:
response = urllib.request.urlopen(
- "https://www.apache.org/dyn/closer.lua?preferred=true"
+ "https://www.apache.org/dyn/closer.lua?preferred=true",
timeout=10
)
mirror_urls.append(response.read().decode("utf-8"))
except Exception:
@@ -186,6 +187,40 @@ def get_preferred_mirrors():
return list(set(mirror_urls)) + [x for x in default_sites if x not in
mirror_urls]
+def _download_with_retries(url, path, max_retries=3, timeout=600):
+ """
+ Download a file from a URL with retry logic and timeout handling.
+
+ Parameters
+ ----------
+ url : str
+ The URL to download from.
+ path : str
+ The local file path to save the downloaded file.
+ max_retries : int
+ Maximum number of retry attempts per URL.
+ timeout : int
+ Timeout in seconds for the HTTP request.
+ """
+ for attempt in range(max_retries):
+ try:
+ response = urllib.request.urlopen(url, timeout=timeout)
+ download_to_file(response, path)
+ return
+ except Exception as e:
+ if os.path.exists(path):
+ os.remove(path)
+ if attempt < max_retries - 1:
+ wait = 2**attempt * 5
+ print(
+ "Download attempt %d/%d failed: %s. Retrying in %d
seconds..."
+ % (attempt + 1, max_retries, str(e), wait)
+ )
+ time.sleep(wait)
+ else:
+ raise
+
+
def download_to_file(response, path, chunk_size=1024 * 1024):
total_size = int(response.info().get("Content-Length").strip())
bytes_so_far = 0
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]