Re: [PR] [SPARK-57655][CONNECT][PYTHON] Avoid re-entrant Spark Connect ML cache cleanup RPC hang [spark]

via GitHub Tue, 23 Jun 2026 23:03:03 -0700


HyukjinKwon commented on PR #56725:
URL: https://github.com/apache/spark/pull/56725#issuecomment-4786350442


   <!-- ai-code-review -->
   **Automated code review** (posted as a comment, not an approval) — via 
`spark-dev:code-review`.
   
   **Verdict: no blocking findings.** Small, targeted, no-regression guard; the 
diagnosis (faulthandler dump showing same-thread finalizer re-entrancy) is 
sound and the fix matches it.
   
   **Design (Pass A):** The shared `_ml_cache_rpc_thread` slot is declared as a 
class attribute used as a default — writes go through `self.` so each client 
instance gets its own value; there's no cross-instance leakage. `try/finally` 
correctly clears it. `threading` is imported (line 32). The guard is scoped 
exactly to the failure mode (same-thread re-entrancy via 
`threading.get_ident()`), so it won't false-skip a legitimately concurrent call 
from another thread.
   
   **Notes (non-blocking):**
   1. **Skip isn't always strictly redundant.** The PR text frames the nested 
delete as redundant, which is exactly true when the in-flight RPC is the full 
`clean_cache` (the observed `test_parity_clustering` case). But the guard also 
fires when a *targeted* `_delete_ml_cache(refsA)` is in flight and a finalizer 
requests `_delete_ml_cache(refsB)` — those `refsB` are then not proactively 
released and linger until session-end eviction. That's a safe tradeoff 
(avoiding a multi-minute hang >> proactive cache release) and the `WARNING` 
records it, but worth being aware of if stricter eviction is ever needed.
   2. **Single-slot flag, not concurrency-general.** If one client is driven by 
multiple threads issuing ML-cache RPCs concurrently, the flag is a single slot 
— the failure mode is only "the guard may not fire" (no false skip, no 
incorrect result), since the target is same-thread finalizer re-entrancy. Fine 
for the intended scenario.
   
   **Verification caveat (already stated in the PR):** the underlying hang is a 
rare (~1/2-months), timing-dependent flake that can't be reproduced on demand, 
so this can't be proven to eliminate it — it's a no-regression safety net plus 
a `WARNING` diagnostic that makes a recurrence in scheduled jobs observable. 
Confirmed no-regression by running `test_parity_clustering` 15× (each actually 
executing, ~30s) — all green: 
https://github.com/HyukjinKwon/spark/actions/runs/28075822467
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-57655][CONNECT][PYTHON] Avoid re-entrant Spark Connect ML cache cleanup RPC hang [spark]

Reply via email to