[
https://issues.apache.org/jira/browse/FLINK-35073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julien Tournay updated FLINK-35073:
-----------------------------------
Description:
The reported issue is easy to reproduce in batch mode using hybrid shuffle and
a somewhat large total number of slots in the cluster. Low parallelism (60)
still triggers it.
Note: Joined a partial threaddump to illustrate the issue.
When `NetworkBufferPool.internalRecycleMemorySegments` is called concurrently.
The following chain of call may happen:
{code:java}
NetworkBufferPool.internalRecycleMemorySegments ->
LocalBufferPool.onGlobalPoolAvailable ->
LocalBufferPool.checkAndUpdateAvailability ->
LocalBufferPool.requestMemorySegmentFromGlobalWhenAvailable{code}
`requestMemorySegmentFromGlobalWhenAvailable can cause `onGlobalPoolAvailable`
to be invoked on another `LocalBufferPool` instance which triggers the same
chain of actions.
The issue arises when 2 threads go through this specific code path at the same
time.
Each thread will `requestMemorySegmentFromGlobalWhenAvailable` and in the
process try to acquire new locks on a series of LocalBuffer.
As an example, assume there are 6 `LocalBufferPool` instance A, B, C, D, E and
F:
Thread 1 locks A, B, C and tries to lock D
Thread 2 locks D, E, F and tried to lock A
==> Threads 1 and 2 are mutually blocked.
The example threadump captured this issue:
First thread locked java.util.ArrayDeque@41d6a3bb and is blocked on
java.util.ArrayDeque@e2b5e34
Second thread locked java.util.ArrayDeque@e2b5e34 and is blocked on
java.util.ArrayDeque@41d6a3bb
Note that I'm not familiar enough with Flink internals to know what the fix
should be but I'm happy to submit a PR if someone tells me what the correct
behaviour should be. Maybe just moving `toNotify.complete(null);` inside the
`synchronized` block in `internalRecycleMemorySegments` would fix the problem
but I'm not sure how much thread contention that would create.
was:
The reported issue is easy to reproduce in batch mode using hybrid shuffle and
a somewhat large total number of slots in the cluster. Low parallelism (60)
still triggers it.
Note: Joined a partial threaddump to illustrate the issue.
When `NetworkBufferPool.internalRecycleMemorySegments` is called concurrently.
The following chain of call may happen:
{code:java}
NetworkBufferPool.internalRecycleMemorySegments ->
LocalBufferPool.onGlobalPoolAvailable ->
LocalBufferPool.checkAndUpdateAvailability ->
LocalBufferPool.requestMemorySegmentFromGlobalWhenAvailable{code}
`requestMemorySegmentFromGlobalWhenAvailable can cause `onGlobalPoolAvailable`
to be invoked on another `LocalBufferPool` instance which triggers the same
chain of actions.
The issue arises when 2 threads go through this specific code path at the same
time.
Each thread will `requestMemorySegmentFromGlobalWhenAvailable` and in the
process try to acquire new locks on a series of LocalBuffer.
As an example, assume there are 6 `LocalBufferPool` instance A, B, C, D, E and
F:
Thread 1 locks A, B, C and tries to lock D
Thread 2 locks D, E, F and tried to lock A
==> Threads 1 and 2 are mutually blocked.
The example threadump captured this issue:
First thread locked java.util.ArrayDeque@41d6a3bb and is blocked on
java.util.ArrayDeque@e2b5e34
Second thread locked java.util.ArrayDeque@e2b5e34 and is blocked on
java.util.ArrayDeque@41d6a3bb
Note that I'm not familiar enough with Flink internals to know what the fix
should be but I'm happy to submit a PR if someone tells me what the correct
behaviour should be.
> Deadlock in LocalBufferPool when
> NetworkBufferPool.internalRecycleMemorySegments is called concurrently
> -------------------------------------------------------------------------------------------------------
>
> Key: FLINK-35073
> URL: https://issues.apache.org/jira/browse/FLINK-35073
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Network
> Reporter: Julien Tournay
> Priority: Critical
> Labels: deadlock, network
> Attachments: deadlock_threaddump_extract.json
>
>
> The reported issue is easy to reproduce in batch mode using hybrid shuffle
> and a somewhat large total number of slots in the cluster. Low parallelism
> (60) still triggers it.
> Note: Joined a partial threaddump to illustrate the issue.
> When `NetworkBufferPool.internalRecycleMemorySegments` is called
> concurrently. The following chain of call may happen:
> {code:java}
> NetworkBufferPool.internalRecycleMemorySegments ->
> LocalBufferPool.onGlobalPoolAvailable ->
> LocalBufferPool.checkAndUpdateAvailability ->
> LocalBufferPool.requestMemorySegmentFromGlobalWhenAvailable{code}
> `requestMemorySegmentFromGlobalWhenAvailable can cause
> `onGlobalPoolAvailable` to be invoked on another `LocalBufferPool` instance
> which triggers the same chain of actions.
> The issue arises when 2 threads go through this specific code path at the
> same time.
> Each thread will `requestMemorySegmentFromGlobalWhenAvailable` and in the
> process try to acquire new locks on a series of LocalBuffer.
> As an example, assume there are 6 `LocalBufferPool` instance A, B, C, D, E
> and F:
> Thread 1 locks A, B, C and tries to lock D
> Thread 2 locks D, E, F and tried to lock A
> ==> Threads 1 and 2 are mutually blocked.
> The example threadump captured this issue:
> First thread locked java.util.ArrayDeque@41d6a3bb and is blocked on
> java.util.ArrayDeque@e2b5e34
> Second thread locked java.util.ArrayDeque@e2b5e34 and is blocked on
> java.util.ArrayDeque@41d6a3bb
>
> Note that I'm not familiar enough with Flink internals to know what the fix
> should be but I'm happy to submit a PR if someone tells me what the correct
> behaviour should be. Maybe just moving `toNotify.complete(null);` inside the
> `synchronized` block in `internalRecycleMemorySegments` would fix the problem
> but I'm not sure how much thread contention that would create.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)