[ https://issues.apache.org/jira/browse/FLINK-35073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17837259#comment-17837259 ]
Julien Tournay edited comment on FLINK-35073 at 4/15/24 1:32 PM: ----------------------------------------------------------------- Closing this issue because even though I captured a deadlock, that was not what was causing my job to block. Instead it looks like there's a scheduling bug when using the default scheduler with hybrid shuffle. Adaptive scheduler + hybrid shuffle works fine. was (Author: jto): Closing this issue because even though I captured a deadlock, that was not what was causing my job to block. Instead it looks like there's a scheduling bug when using the default scheduler with hybrid shuffle. > Deadlock in LocalBufferPool when segments become available in the global pool > ----------------------------------------------------------------------------- > > Key: FLINK-35073 > URL: https://issues.apache.org/jira/browse/FLINK-35073 > Project: Flink > Issue Type: Bug > Components: Runtime / Network > Affects Versions: 1.19.0 > Reporter: Julien Tournay > Priority: Critical > Labels: deadlock, network > Attachments: deadlock_threaddump_extract.json > > > The reported issue is easy to reproduce in batch mode using hybrid shuffle > and a somewhat large total number of slots in the cluster. Low parallelism > (60) still triggers it. > Note: Joined a partial threaddump to illustrate the issue. > When `NetworkBufferPool.internalRecycleMemorySegments` is called. The > following chain of call may happen: > {code:java} > NetworkBufferPool.internalRecycleMemorySegments -> > LocalBufferPool.onGlobalPoolAvailable -> > LocalBufferPool.checkAndUpdateAvailability -> > LocalBufferPool.requestMemorySegmentFromGlobalWhenAvailable{code} > Several instances of `LocalBufferPool` will be notified as soon as a segment > becomes available in the global pool because of the implementation of > `requestMemorySegmentFromGlobalWhenAvailable`: > {code:java} > networkBufferPool.getAvailableFuture().thenRun(this::onGlobalPoolAvailable));{code} > The issue arises when 2 or more threads go through this specific code path at > the same time. > Each thread will notify the same instances of `LocalBufferPool` by invoking > `onGlobalPoolAvailable` on each of them and in the process try to acquire a > series of locks > As an example, assume there are 6 `LocalBufferPool` instance A, B, C, D, E > and F: > Thread 1 calls `onGlobalPoolAvailable` on A, B, C and D. it locks A, B, C and > tries to lock D > Thread 2 calls `onGlobalPoolAvailable` on D, E, F, and A. It locks D, E, F > and tries to lock A > ==> Threads 1 and 2 are mutually blocked. > The example threadump captured this issue: > First thread locked java.util.ArrayDeque@41d6a3bb and is blocked on > java.util.ArrayDeque@e2b5e34 > Second thread locked java.util.ArrayDeque@e2b5e34 and is blocked on > java.util.ArrayDeque@41d6a3bb > > Note that I'm not familiar enough with Flink internals to know what the fix > should be but I'm happy to submit a PR if someone tells me what the correct > behaviour should be. > -- This message was sent by Atlassian Jira (v8.20.10#820010)