[ https://issues.apache.org/jira/browse/SPARK-55620 ]


    Tae Hwan Eom deleted comment on SPARK-55620:
    --------------------------------------

was (Author: JIRAUSER312450):
I have analyzed the root cause of this flaky timeout and came up with a 
proposed solution.
 * *Root Cause Analysis:* The hang is caused by a deadlock during Python 3.12+ 
shutdown. 
When {{Session.__del__()}} is triggered during the interpreter shutdown phase, 
it eventually calls {{self._channel.close()}}. 
At this point, gRPC attempts to spawn a new background thread to clean up 
ongoing RPCs (like {{ReleaseExecute}}). 
However, Python 3.12+ strictly blocks new thread creation during finalization. 
This causes the process to hang indefinitely(deadlock). _(Note: 
{{shutdown_threadpool_if_idle(wait=False)}} and {{futures.wait(timeout=10)}} 
are already safe._ 
_The exact culprit is {{self._channel.close()}} without any safeguard.)_


 * 
*Proposed Solution:* We can prevent this deadlock by utilizing 
{{sys.is_finalizing()}}. 
If the Python interpreter is shutting down ({{sys.is_finalizing()}} is True), 
we can safely skip {{self._channel.close()}} inside {{client.close()}} or 
{{__del__()}}. 
The OS will automatically reclaim the network sockets and memory resources upon 
process exit anyway. 
Although this is a race condition that is hard to reproduce consistently on 
local machines, the CI logs clearly point to this behavior. 
I will prepare and submit a PR with this fix shortly!

> test_connect_session flaky timeout due to shutdown deadlock
> -----------------------------------------------------------
>
>                 Key: SPARK-55620
>                 URL: https://issues.apache.org/jira/browse/SPARK-55620
>             Project: Spark
>          Issue Type: Bug
>          Components: Connect, PySpark
>    Affects Versions: 4.2.0
>            Reporter: Yicong Huang
>            Priority: Minor
>              Labels: pull-request-available
>         Attachments: image-2026-03-05-13-06-16-885.png
>
>
> h2. Description
> {{test_connect_session}} occasionally times out (450 seconds) in CI. The test 
> normally completes in 20 seconds but sometimes hangs indefinitely during 
> shutdown, causing flaky test failures.
> h2. Reproduce
> This is a flaky bug with ~33% failure rate:
> 1. Run {{python/run-tests.py --testnames 
> pyspark.sql.tests.connect.test_connect_session}}
> 2. Test may hang at 450 seconds timeout
> *Evidence from CI runs:*
> - [Run 
> 22196465437|https://github.com/Yicong-Huang/spark/actions/runs/22196465437]: 
> Cancelled after 4m10s
> - [Run 
> 22196593939|https://github.com/Yicong-Huang/spark/actions/runs/22196593939]: 
> Timeout after 1h22m (hung at 450s)
> - [Run 
> 22237720726|https://github.com/Yicong-Huang/spark/actions/runs/22237720726]: 
> Success in 20s ✓
> h2. Root Cause
> Deadlock during Python shutdown when {{ReleaseExecute}} cleanup tasks are 
> still executing:
> {code}
> Session.__del__()
>   → client.close() waits: concurrent.futures.wait(self._release_futures)
>     → Worker thread executes: ReleaseExecute() gRPC call
>       → gRPC attempts: threading.Thread().start()
>         → Python 3.12 blocks thread creation during shutdown
>           → DEADLOCK (main waits for worker, worker waits for thread)
> {code}
> Thread stacks show:
> - Main thread: blocked in {{concurrent.futures.wait()}}
> - Worker thread: blocked in {{threading.start() -> self._started.wait()}}
> The {{ReleaseExecute}} tasks are asynchronous cleanup submitted during test 
> execution. If they haven't completed when Python shuts down, gRPC's attempt 
> to spawn I/O threads gets blocked.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to