Re: [PR] [SPARK-55401][PYTHON] Add retry logic and timeout handling to pyspark install download [spark]

via GitHub Fri, 06 Feb 2026 11:09:43 -0800


Copilot commented on code in PR #54183:
URL: https://github.com/apache/spark/pull/54183#discussion_r2775612429



##########
python/pyspark/install.py:
##########
@@ -186,6 +187,40 @@ def get_preferred_mirrors():
     return list(set(mirror_urls)) + [x for x in default_sites if x not in 
mirror_urls]
 
 
+def _download_with_retries(url, path, max_retries=3, timeout=600):
+    """
+    Download a file from a URL with retry logic and timeout handling.
+
+    Parameters
+    ----------
+    url : str
+        The URL to download from.
+    path : str
+        The local file path to save the downloaded file.
+    max_retries : int
+        Maximum number of retry attempts per URL.
+    timeout : int
+        Timeout in seconds for the HTTP request.
+    """
+    for attempt in range(max_retries):
+        try:
+            response = urllib.request.urlopen(url, timeout=timeout)
+            download_to_file(response, path)
+            return
+        except Exception as e:
+            if os.path.exists(path):
+                os.remove(path)

Review Comment:
   Current retry loop will also retry on non-transient failures like 
`HTTPError` 404 (bad version / missing artifact) or 403, which can add minutes 
of unnecessary backoff before the outer loop moves to the next mirror. Consider 
special-casing `urllib.error.HTTPError` to only retry for likely-transient 
status codes (e.g., 429/5xx) and fail fast otherwise.



##########
python/pyspark/install.py:
##########
@@ -186,6 +187,40 @@ def get_preferred_mirrors():
     return list(set(mirror_urls)) + [x for x in default_sites if x not in 
mirror_urls]
 
 
+def _download_with_retries(url, path, max_retries=3, timeout=600):
+    """
+    Download a file from a URL with retry logic and timeout handling.
+
+    Parameters
+    ----------
+    url : str
+        The URL to download from.
+    path : str
+        The local file path to save the downloaded file.
+    max_retries : int
+        Maximum number of retry attempts per URL.
+    timeout : int
+        Timeout in seconds for the HTTP request.
+    """
+    for attempt in range(max_retries):
+        try:
+            response = urllib.request.urlopen(url, timeout=timeout)
+            download_to_file(response, path)
+            return

Review Comment:
   `response` is never closed, so each attempt can leak an open HTTP 
connection/file descriptor (especially problematic with retries). Wrap the 
`urlopen` call in a context manager or ensure `response.close()` runs in a 
`finally` block even when `download_to_file` raises.



##########
python/pyspark/install.py:
##########
@@ -186,6 +187,40 @@ def get_preferred_mirrors():
     return list(set(mirror_urls)) + [x for x in default_sites if x not in 
mirror_urls]
 
 
+def _download_with_retries(url, path, max_retries=3, timeout=600):
+    """
+    Download a file from a URL with retry logic and timeout handling.
+
+    Parameters
+    ----------
+    url : str
+        The URL to download from.
+    path : str
+        The local file path to save the downloaded file.
+    max_retries : int
+        Maximum number of retry attempts per URL.
+    timeout : int
+        Timeout in seconds for the HTTP request.
+    """
+    for attempt in range(max_retries):
+        try:
+            response = urllib.request.urlopen(url, timeout=timeout)
+            download_to_file(response, path)
+            return
+        except Exception as e:
+            if os.path.exists(path):
+                os.remove(path)
+            if attempt < max_retries - 1:
+                wait = 2**attempt * 5
+                print(
+                    "Download attempt %d/%d failed: %s. Retrying in %d 
seconds..."
+                    % (attempt + 1, max_retries, str(e), wait)
+                )
+                time.sleep(wait)

Review Comment:
   Backoff timing here doesn’t match the PR description (5s, 10s, 20s): with 
`max_retries=3`, the code only sleeps for 5s and 10s (there’s no sleep after 
the last attempt). Either adjust the retry/backoff logic (e.g., treat 
`max_retries` as "retries" not "total attempts") or update the PR 
description/docstring so they’re consistent.



##########
python/pyspark/install.py:
##########
@@ -171,7 +172,7 @@ def get_preferred_mirrors():
     for _ in range(3):
         try:
             response = urllib.request.urlopen(
-                "https://www.apache.org/dyn/closer.lua?preferred=true";
+                "https://www.apache.org/dyn/closer.lua?preferred=true";, 
timeout=10
             )
             mirror_urls.append(response.read().decode("utf-8"))

Review Comment:
   `urlopen()` responses should be closed to avoid leaking sockets/file 
descriptors. Consider using a context manager (`with 
urllib.request.urlopen(...) as response:`) (or `try/finally: response.close()`) 
around the `response.read()` call here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-55401][PYTHON] Add retry logic and timeout handling to pyspark install download [spark]

Reply via email to