[PR] [SPARK-57655][CONNECT][PYTHON] Avoid re-entrant Spark Connect ML cache cleanup RPC hang [spark]

via GitHub Tue, 23 Jun 2026 22:24:59 -0700


HyukjinKwon opened a new pull request, #56725:
URL: https://github.com/apache/spark/pull/56725


   ### What changes were proposed in this pull request?
   Add a same-thread re-entrancy guard around the best-effort ML-cache RPCs 
`SparkConnectClient._cleanup_ml_cache` / `_delete_ml_cache`. If one is already 
in flight on the current thread, the nested call is skipped and a `WARNING` is 
logged instead of issuing a second blocking RPC.
   
   ### Why are the changes needed?
   A rare CI hang — e.g. `pyspark.ml.tests.connect.test_parity_clustering` 
timing out at 450s — traces to a **re-entrant ML-cache RPC**. While a 
cleanup/delete RPC is blocked in gRPC with the GIL released, CPython runs a 
pending `RemoteModelRef` finalizer (`__del__` → `del_remote_cache` → 
`_delete_ml_cache`) **on the same thread**, issuing a second blocking RPC that 
deadlocks the channel until the process/test timeout (faulthandler dump 
confirmed the re-entrant stack). The nested call is redundant — the in-flight 
RPC is already releasing server-side state, which is also evicted on session 
end — so skipping it is safe, and the `WARNING` turns a silent multi-minute 
hang into an observable, attributable signal if it recurs in scheduled jobs.
   
   ### Does this PR introduce any user-facing change?
   No.
   
   ### How was this patch tested?
   The underlying hang is a rare, timing-dependent flake (observed ~once in two 
months) that cannot be reproduced on demand, so this **cannot be proven to 
eliminate it** — it is a no-regression safety net plus a diagnostic. In normal 
operation no ML-cache RPC is in flight when another is issued, so behavior is 
unchanged. Verified on a fork by building Spark Connect and running 
`test_parity_clustering` **15×**, each run actually executing (~30s, not 
skipped) — all 15 passed.
   
   - ❌ Before (450s hang, scheduled `Build / Non-ANSI (branch-4.x, ...)`, 
module `pyspark-ml-connect`): 
https://github.com/apache/spark/actions/runs/28004040195
   - ✅ After (this fix, `test_parity_clustering` ×15 actually executing, all 
green): https://github.com/HyukjinKwon/spark/actions/runs/28075822467
   
   ### Was this patch authored or co-authored using generative AI tooling?
   Yes, Generated-by: Claude Code
   
   This pull request and its description were written by Isaac.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-57655][CONNECT][PYTHON] Avoid re-entrant Spark Connect ML cache cleanup RPC hang [spark]

Reply via email to