yaooqinn opened a new pull request, #54183:
URL: https://github.com/apache/spark/pull/54183
### What changes were proposed in this pull request?
This PR adds retry logic and timeout handling to the Spark distribution
download in `pyspark/install.py` to reduce flakiness in
`pyspark.tests.test_install_spark`.
**Changes:**
1. **Added `timeout=10`** to the mirror resolution `urlopen()` call in
`get_preferred_mirrors()` — prevents hanging when `closer.lua` is unresponsive
2. **Added `_download_with_retries()` helper** — wraps the download with:
- Configurable timeout (default: 600s) on `urlopen()` to prevent
indefinite hangs
- Up to 3 retry attempts with exponential backoff (5s, 10s, 20s)
- Cleanup of partial downloads on failure
- Clear logging of retry attempts for CI debugging
### Why are the changes needed?
The `pyspark-install` CI job frequently fails due to transient network
issues when downloading ~400MB Spark distributions from Apache mirrors. Current
issues:
- `urlopen()` has no timeout — downloads can hang indefinitely
- No retry logic — a single transient network error causes complete failure
- CI logs show downloads stalling mid-stream (e.g., at 64%) with no recovery
### Does this PR introduce _any_ user-facing change?
No. The download behavior is improved with retries and timeouts, but the API
is unchanged. Users who call `install_spark` or `pip install pyspark` with
`PYSPARK_HADOOP_VERSION` will benefit from more reliable downloads.
### How was this patch tested?
- Existing unit tests pass: `test_package_name`, `test_checked_versions`
- The download test (`test_install_spark`) exercises the new retry path in CI
### Was this patch authored or co-authored using generative AI tooling?
No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]