[
https://issues.apache.org/jira/browse/SPARK-55620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18062358#comment-18062358
]
Tae Hwan Eom commented on SPARK-55620:
--------------------------------------
Hi, I'm new to Spark contribution but I'd like to try fixing this issue.
> test_connect_session flaky timeout due to shutdown deadlock
> -----------------------------------------------------------
>
> Key: SPARK-55620
> URL: https://issues.apache.org/jira/browse/SPARK-55620
> Project: Spark
> Issue Type: Bug
> Components: Connect, PySpark
> Affects Versions: 4.2.0
> Reporter: Yicong Huang
> Priority: Minor
>
> h2. Description
> {{test_connect_session}} occasionally times out (450 seconds) in CI. The test
> normally completes in 20 seconds but sometimes hangs indefinitely during
> shutdown, causing flaky test failures.
> h2. Reproduce
> This is a flaky bug with ~33% failure rate:
> 1. Run {{python/run-tests.py --testnames
> pyspark.sql.tests.connect.test_connect_session}}
> 2. Test may hang at 450 seconds timeout
> *Evidence from CI runs:*
> - [Run
> 22196465437|https://github.com/Yicong-Huang/spark/actions/runs/22196465437]:
> Cancelled after 4m10s
> - [Run
> 22196593939|https://github.com/Yicong-Huang/spark/actions/runs/22196593939]:
> Timeout after 1h22m (hung at 450s)
> - [Run
> 22237720726|https://github.com/Yicong-Huang/spark/actions/runs/22237720726]:
> Success in 20s ✓
> h2. Root Cause
> Deadlock during Python shutdown when {{ReleaseExecute}} cleanup tasks are
> still executing:
> {code}
> Session.__del__()
> → client.close() waits: concurrent.futures.wait(self._release_futures)
> → Worker thread executes: ReleaseExecute() gRPC call
> → gRPC attempts: threading.Thread().start()
> → Python 3.12 blocks thread creation during shutdown
> → DEADLOCK (main waits for worker, worker waits for thread)
> {code}
> Thread stacks show:
> - Main thread: blocked in {{concurrent.futures.wait()}}
> - Worker thread: blocked in {{threading.start() -> self._started.wait()}}
> The {{ReleaseExecute}} tasks are asynchronous cleanup submitted during test
> execution. If they haven't completed when Python shuts down, gRPC's attempt
> to spawn I/O threads gets blocked.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]