Yicong Huang created SPARK-55620:
------------------------------------
Summary: test_connect_session flaky timeout due to shutdown
deadlock
Key: SPARK-55620
URL: https://issues.apache.org/jira/browse/SPARK-55620
Project: Spark
Issue Type: Bug
Components: PySpark, Connect
Affects Versions: 4.2.0
Reporter: Yicong Huang
h2. Description
{{test_connect_session}} occasionally times out (450 seconds) in CI. The test
normally completes in 20 seconds but sometimes hangs indefinitely during
shutdown, causing flaky test failures.
h2. Reproduce
This is a flaky bug with ~33% failure rate:
1. Run {{python/run-tests.py --testnames
pyspark.sql.tests.connect.test_connect_session}}
2. Test may hang at 450 seconds timeout
*Evidence from CI runs:*
- Run 22196465437: Cancelled after 4m10s
- Run 22196593939: Timeout after 1h22m (hung at 450s)
- Run 22237720726: Success in 20s ✓
h2. Root Cause
Deadlock during Python shutdown when {{ReleaseExecute}} cleanup tasks are still
executing:
{code}
Session.__del__()
→ client.close() waits: concurrent.futures.wait(self._release_futures)
→ Worker thread executes: ReleaseExecute() gRPC call
→ gRPC attempts: threading.Thread().start()
→ Python 3.12 blocks thread creation during shutdown
→ DEADLOCK (main waits for worker, worker waits for thread)
{code}
Thread stacks show:
- Main thread: blocked in {{concurrent.futures.wait()}}
- Worker thread: blocked in {{threading.start() -> self._started.wait()}}
The {{ReleaseExecute}} tasks are asynchronous cleanup submitted during test
execution. If they haven't completed when Python shuts down, gRPC's attempt to
spawn I/O threads gets blocked.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]