[ https://issues.apache.org/jira/browse/SPARK-55620 ]
Tae Hwan Eom deleted comment on SPARK-55620:
--------------------------------------
was (Author: JIRAUSER312450):
I have analyzed the root cause of this flaky timeout and came up with a
proposed solution.
* *Root Cause Analysis:* The hang is caused by a deadlock during Python 3.12+
shutdown.
When {{Session.__del__()}} is triggered during the interpreter shutdown phase,
it eventually calls {{self._channel.close()}}.
At this point, gRPC attempts to spawn a new background thread to clean up
ongoing RPCs (like {{ReleaseExecute}}).
However, Python 3.12+ strictly blocks new thread creation during finalization.
This causes the process to hang indefinitely(deadlock). _(Note:
{{shutdown_threadpool_if_idle(wait=False)}} and {{futures.wait(timeout=10)}}
are already safe._
_The exact culprit is {{self._channel.close()}} without any safeguard.)_
*
*Proposed Solution:* We can prevent this deadlock by utilizing
{{sys.is_finalizing()}}.
If the Python interpreter is shutting down ({{sys.is_finalizing()}} is True),
we can safely skip {{self._channel.close()}} inside {{client.close()}} or
{{__del__()}}.
The OS will automatically reclaim the network sockets and memory resources upon
process exit anyway.
Although this is a race condition that is hard to reproduce consistently on
local machines, the CI logs clearly point to this behavior.
I will prepare and submit a PR with this fix shortly!
> test_connect_session flaky timeout due to shutdown deadlock
> -----------------------------------------------------------
>
> Key: SPARK-55620
> URL: https://issues.apache.org/jira/browse/SPARK-55620
> Project: Spark
> Issue Type: Bug
> Components: Connect, PySpark
> Affects Versions: 4.2.0
> Reporter: Yicong Huang
> Priority: Minor
> Labels: pull-request-available
> Attachments: image-2026-03-05-13-06-16-885.png
>
>
> h2. Description
> {{test_connect_session}} occasionally times out (450 seconds) in CI. The test
> normally completes in 20 seconds but sometimes hangs indefinitely during
> shutdown, causing flaky test failures.
> h2. Reproduce
> This is a flaky bug with ~33% failure rate:
> 1. Run {{python/run-tests.py --testnames
> pyspark.sql.tests.connect.test_connect_session}}
> 2. Test may hang at 450 seconds timeout
> *Evidence from CI runs:*
> - [Run
> 22196465437|https://github.com/Yicong-Huang/spark/actions/runs/22196465437]:
> Cancelled after 4m10s
> - [Run
> 22196593939|https://github.com/Yicong-Huang/spark/actions/runs/22196593939]:
> Timeout after 1h22m (hung at 450s)
> - [Run
> 22237720726|https://github.com/Yicong-Huang/spark/actions/runs/22237720726]:
> Success in 20s ✓
> h2. Root Cause
> Deadlock during Python shutdown when {{ReleaseExecute}} cleanup tasks are
> still executing:
> {code}
> Session.__del__()
> → client.close() waits: concurrent.futures.wait(self._release_futures)
> → Worker thread executes: ReleaseExecute() gRPC call
> → gRPC attempts: threading.Thread().start()
> → Python 3.12 blocks thread creation during shutdown
> → DEADLOCK (main waits for worker, worker waits for thread)
> {code}
> Thread stacks show:
> - Main thread: blocked in {{concurrent.futures.wait()}}
> - Worker thread: blocked in {{threading.start() -> self._started.wait()}}
> The {{ReleaseExecute}} tasks are asynchronous cleanup submitted during test
> execution. If they haven't completed when Python shuts down, gRPC's attempt
> to spawn I/O threads gets blocked.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]