Julien Tournay created FLINK-35073:
--------------------------------------
Summary: Deadlock in LocalBufferPool when
NetworkBufferPool.internalRecycleMemorySegments is called concurrently
Key: FLINK-35073
URL: https://issues.apache.org/jira/browse/FLINK-35073
Project: Flink
Issue Type: Bug
Reporter: Julien Tournay
Attachments: deadlock_threaddump_extract.json
The reported issue is easy to reproduce in batch mode using hybrid shuffle and
a somewhat large total number of slots in the cluster. Parallelism does not
seem to matter much.
Note: Joined a partial threaddump to illustrate the issue.
When `NetworkBufferPool.internalRecycleMemorySegments` is called concurrently.
The following chain of call may happen:
{code:java}
NetworkBufferPool.internalRecycleMemorySegments ->
LocalBufferPool.onGlobalPoolAvailable ->
LocalBufferPool.checkAndUpdateAvailability ->
LocalBufferPool.requestMemorySegmentFromGlobalWhenAvailable{code}
`requestMemorySegmentFromGlobalWhenAvailable can cause `onGlobalPoolAvailable`
to be invoked on another `LocalBufferPool` instance which triggers the same
chain of actions.
The issue arises when 2 threads go through this specific code path at the same
time.
Each thread will `requestMemorySegmentFromGlobalWhenAvailable` and in the
process try to acquire a new locks on a series of LocalBuffer.
As an example, assume there are 6 `LocalBufferPool` instance A, B, C, D, E and
F:
Thread 1 locks A, B, C and tries to lock D
Thread 2 locks D, E, F and tried to lock A
==> Both threads 1 and 2 are blocked.
The example threadump captured this issue:
First thread locked java.util.ArrayDeque@41d6a3bb and is blocked on
java.util.ArrayDeque@e2b5e34
Second thread locked java.util.ArrayDeque@e2b5e34 and is blocked on
java.util.ArrayDeque@41d6a3bb
Note that I'm not familiar enough with Flink internals to know what the fix
should be but I'm happy to submit a PR if someone tells me what the correct
behaviour should be.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)