HyukjinKwon commented on PR #56725: URL: https://github.com/apache/spark/pull/56725#issuecomment-4786350442
<!-- ai-code-review --> **Automated code review** (posted as a comment, not an approval) — via `spark-dev:code-review`. **Verdict: no blocking findings.** Small, targeted, no-regression guard; the diagnosis (faulthandler dump showing same-thread finalizer re-entrancy) is sound and the fix matches it. **Design (Pass A):** The shared `_ml_cache_rpc_thread` slot is declared as a class attribute used as a default — writes go through `self.` so each client instance gets its own value; there's no cross-instance leakage. `try/finally` correctly clears it. `threading` is imported (line 32). The guard is scoped exactly to the failure mode (same-thread re-entrancy via `threading.get_ident()`), so it won't false-skip a legitimately concurrent call from another thread. **Notes (non-blocking):** 1. **Skip isn't always strictly redundant.** The PR text frames the nested delete as redundant, which is exactly true when the in-flight RPC is the full `clean_cache` (the observed `test_parity_clustering` case). But the guard also fires when a *targeted* `_delete_ml_cache(refsA)` is in flight and a finalizer requests `_delete_ml_cache(refsB)` — those `refsB` are then not proactively released and linger until session-end eviction. That's a safe tradeoff (avoiding a multi-minute hang >> proactive cache release) and the `WARNING` records it, but worth being aware of if stricter eviction is ever needed. 2. **Single-slot flag, not concurrency-general.** If one client is driven by multiple threads issuing ML-cache RPCs concurrently, the flag is a single slot — the failure mode is only "the guard may not fire" (no false skip, no incorrect result), since the target is same-thread finalizer re-entrancy. Fine for the intended scenario. **Verification caveat (already stated in the PR):** the underlying hang is a rare (~1/2-months), timing-dependent flake that can't be reproduced on demand, so this can't be proven to eliminate it — it's a no-regression safety net plus a `WARNING` diagnostic that makes a recurrence in scheduled jobs observable. Confirmed no-regression by running `test_parity_clustering` 15× (each actually executing, ~30s) — all green: https://github.com/HyukjinKwon/spark/actions/runs/28075822467 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
