HyukjinKwon opened a new pull request, #56725: URL: https://github.com/apache/spark/pull/56725
### What changes were proposed in this pull request? Add a same-thread re-entrancy guard around the best-effort ML-cache RPCs `SparkConnectClient._cleanup_ml_cache` / `_delete_ml_cache`. If one is already in flight on the current thread, the nested call is skipped and a `WARNING` is logged instead of issuing a second blocking RPC. ### Why are the changes needed? A rare CI hang — e.g. `pyspark.ml.tests.connect.test_parity_clustering` timing out at 450s — traces to a **re-entrant ML-cache RPC**. While a cleanup/delete RPC is blocked in gRPC with the GIL released, CPython runs a pending `RemoteModelRef` finalizer (`__del__` → `del_remote_cache` → `_delete_ml_cache`) **on the same thread**, issuing a second blocking RPC that deadlocks the channel until the process/test timeout (faulthandler dump confirmed the re-entrant stack). The nested call is redundant — the in-flight RPC is already releasing server-side state, which is also evicted on session end — so skipping it is safe, and the `WARNING` turns a silent multi-minute hang into an observable, attributable signal if it recurs in scheduled jobs. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? The underlying hang is a rare, timing-dependent flake (observed ~once in two months) that cannot be reproduced on demand, so this **cannot be proven to eliminate it** — it is a no-regression safety net plus a diagnostic. In normal operation no ML-cache RPC is in flight when another is issued, so behavior is unchanged. Verified on a fork by building Spark Connect and running `test_parity_clustering` **15×**, each run actually executing (~30s, not skipped) — all 15 passed. - ❌ Before (450s hang, scheduled `Build / Non-ANSI (branch-4.x, ...)`, module `pyspark-ml-connect`): https://github.com/apache/spark/actions/runs/28004040195 - ✅ After (this fix, `test_parity_clustering` ×15 actually executing, all green): https://github.com/HyukjinKwon/spark/actions/runs/28075822467 ### Was this patch authored or co-authored using generative AI tooling? Yes, Generated-by: Claude Code This pull request and its description were written by Isaac. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
