[PR] [SPARK-55020][PYTHON][FOLLOW-UP] Skip ML cache cleanup when GC is disabled to prevent gRPC deadlock [spark]

via GitHub Sun, 08 Feb 2026 08:37:50 -0800


yaooqinn opened a new pull request, #54211:
URL: https://github.com/apache/spark/pull/54211


   ### What changes were proposed in this pull request?
   
   This is a follow-up to SPARK-55020 (#53783). The original fix disabled GC 
during `_execute` and `_execute_and_fetch_as_iterator` to prevent `__del__` 
methods from sending concurrent gRPC commands. However, the deadlock can still 
occur because:
   
   1. `execute_command()` calls `_execute_and_fetch()` (NOT `@disable_gc` 
decorated)
   2. `_execute_and_fetch()` calls `_execute_and_fetch_as_iterator()` 
(`@disable_gc` decorated)
   3. Between entry of `_execute_and_fetch` and the `@disable_gc` taking effect 
on `_execute_and_fetch_as_iterator`, GC can trigger
   4. GC invokes `__del__` → `release_ref()` → `del_remote_cache()` → 
`execute_command()` → deadlock on gRPC channel lock
   
   This PR adds a guard in `del_remote_cache()` to check if GC is disabled (via 
`gc.isenabled()`). If GC is disabled, it means we are inside a gRPC call 
(protected by `@disable_gc`), and sending another gRPC command would deadlock. 
In this case, we skip the cache cleanup — the cache will be cleaned up when the 
session is closed via `_cleanup_ml_cache()` in `setUp()`.
   
   ### Why are the changes needed?
   
   The `test_parity_clustering` test continues to be flaky on `apache/master` 
CI due to this deadlock:
   - https://github.com/apache/spark/actions/runs/21800415959/job/62894979197
   
   Stacktrace shows the same deadlock pattern as the original SPARK-55020:
   ```
   test_distributed_lda (line 466) → model.__repr__() → execute_command
     ↕ GC triggers __del__ → release_ref() → del_remote_cache() → 
execute_command → DEADLOCK
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. ML model cache cleanup may be slightly deferred when triggered during a 
gRPC call, but the cache is always cleaned up at test teardown and session 
close.
   
   ### How was this patch tested?
   
   - Syntax verified
   - The fix is a defensive guard that prevents deadlocks — the existing 
`test_distributed_lda` and `test_parity_clustering` tests exercise this path
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-55020][PYTHON][FOLLOW-UP] Skip ML cache cleanup when GC is disabled to prevent gRPC deadlock [spark]

Reply via email to