(spark) branch master updated: [SPARK-55401][PYTHON] Add retry logic and timeout handling to pyspark install download

yao Sat, 07 Feb 2026 08:39:09 -0800

This is an automated email from the ASF dual-hosted git repository.

yao pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 3a9307f59adb [SPARK-55401][PYTHON] Add retry logic and timeout 
handling to pyspark install download
3a9307f59adb is described below

commit 3a9307f59adb64b59b6afae8e449e9cec6b969fe
Author: Kent Yao <[email protected]>
AuthorDate: Sun Feb 8 00:38:51 2026 +0800

    [SPARK-55401][PYTHON] Add retry logic and timeout handling to pyspark 
install download
    
    ### What changes were proposed in this pull request?
    
    This PR adds retry logic and timeout handling to the Spark distribution 
download in `pyspark/install.py` to reduce flakiness in 
`pyspark.tests.test_install_spark`.
    
    **Changes:**
    1. **Added `timeout=10`** to the mirror resolution `urlopen()` call in 
`get_preferred_mirrors()` — prevents hanging when `closer.lua` is unresponsive
    2. **Added `_download_with_retries()` helper** — wraps the download with:
       - Configurable timeout (default: 600s) on `urlopen()` to prevent 
indefinite hangs
       - Up to 3 retry attempts with exponential backoff (5s, 10s, 20s)
       - Cleanup of partial downloads on failure
       - Clear logging of retry attempts for CI debugging
    
    ### Why are the changes needed?
    
    The `pyspark-install` CI job frequently fails due to transient network 
issues when downloading ~400MB Spark distributions from Apache mirrors. Current 
issues:
    - `urlopen()` has no timeout — downloads can hang indefinitely
    - No retry logic — a single transient network error causes complete failure
    - CI logs show downloads stalling mid-stream (e.g., at 64%) with no recovery
    
    ### Does this PR introduce _any_ user-facing change?
    
    No. The download behavior is improved with retries and timeouts, but the 
API is unchanged. Users who call `install_spark` or `pip install pyspark` with 
`PYSPARK_HADOOP_VERSION` will benefit from more reliable downloads.
    
    ### How was this patch tested?
    
    - Existing unit tests pass: `test_package_name`, `test_checked_versions`
    - The download test (`test_install_spark`) exercises the new retry path in 
CI
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes #54183 from yaooqinn/SPARK-55401.
    
    Authored-by: Kent Yao <[email protected]>
    Signed-off-by: Kent Yao <[email protected]>
---
 python/pyspark/install.py | 39 +++++++++++++++++++++++++++++++++++++--
 1 file changed, 37 insertions(+), 2 deletions(-)

diff --git a/python/pyspark/install.py b/python/pyspark/install.py
index ba67a157e964..c68745853200 100644
--- a/python/pyspark/install.py
+++ b/python/pyspark/install.py
@@ -17,6 +17,7 @@
 import os
 import re
 import tarfile
+import time
 import traceback
 import urllib.request
 from shutil import rmtree
@@ -143,7 +144,7 @@ def install_spark(dest, spark_version, hadoop_version, 
hive_version):
         tar = None
         try:
             print("Downloading %s from:\n- %s" % (pretty_pkg_name, url))
-            download_to_file(urllib.request.urlopen(url), package_local_path)
+            _download_with_retries(url, package_local_path)
 
             print("Installing to %s" % dest)
             tar = tarfile.open(package_local_path, "r:gz")
@@ -171,7 +172,7 @@ def get_preferred_mirrors():
     for _ in range(3):
         try:
             response = urllib.request.urlopen(
-                "https://www.apache.org/dyn/closer.lua?preferred=true";
+                "https://www.apache.org/dyn/closer.lua?preferred=true";, 
timeout=10
             )
             mirror_urls.append(response.read().decode("utf-8"))
         except Exception:
@@ -186,6 +187,40 @@ def get_preferred_mirrors():
     return list(set(mirror_urls)) + [x for x in default_sites if x not in 
mirror_urls]
 
 
+def _download_with_retries(url, path, max_retries=3, timeout=600):
+    """
+    Download a file from a URL with retry logic and timeout handling.
+
+    Parameters
+    ----------
+    url : str
+        The URL to download from.
+    path : str
+        The local file path to save the downloaded file.
+    max_retries : int
+        Maximum number of retry attempts per URL.
+    timeout : int
+        Timeout in seconds for the HTTP request.
+    """
+    for attempt in range(max_retries):
+        try:
+            response = urllib.request.urlopen(url, timeout=timeout)
+            download_to_file(response, path)
+            return
+        except Exception as e:
+            if os.path.exists(path):
+                os.remove(path)
+            if attempt < max_retries - 1:
+                wait = 2**attempt * 5
+                print(
+                    "Download attempt %d/%d failed: %s. Retrying in %d 
seconds..."
+                    % (attempt + 1, max_retries, str(e), wait)
+                )
+                time.sleep(wait)
+            else:
+                raise
+
+
 def download_to_file(response, path, chunk_size=1024 * 1024):
     total_size = int(response.info().get("Content-Length").strip())
     bytes_so_far = 0


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark) branch master updated: [SPARK-55401][PYTHON] Add retry logic and timeout handling to pyspark install download

Reply via email to