[
https://issues.apache.org/jira/browse/FLINK-19391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17201555#comment-17201555
]
Arvid Heise commented on FLINK-19391:
-------------------------------------
Merged into master as
https://github.com/apache/flink/commit/c3fab5173c1127c970ee992e8bde948abd22dced.
> Deadlock during partition update
> --------------------------------
>
> Key: FLINK-19391
> URL: https://issues.apache.org/jira/browse/FLINK-19391
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Network
> Affects Versions: 1.12.0
> Reporter: Arvid Heise
> Assignee: Arvid Heise
> Priority: Blocker
> Labels: pull-request-available
>
> Master cron job is currently failing because of a deadlock introduced in
> FLINK-19026.
> {noformat}
> 2020-09-23T21:50:39.2444176Z Found one Java-level deadlock:
> 2020-09-23T21:50:39.2444633Z =============================
> 2020-09-23T21:50:39.2445001Z "Temp writer":
> 2020-09-23T21:50:39.2445484Z waiting to lock monitor 0x00007f4e14004ca8
> (object 0x0000000086501948, a java.lang.Object),
> 2020-09-23T21:50:39.2446418Z which is held by
> "flink-akka.actor.default-dispatcher-2"
> 2020-09-23T21:50:39.2447193Z "flink-akka.actor.default-dispatcher-2":
> 2020-09-23T21:50:39.2447903Z waiting to lock monitor 0x00007f4e14004bf8
> (object 0x0000000086501930, a
> org.apache.flink.runtime.io.network.partition.PrioritizedDeque),
> 2020-09-23T21:50:39.2448703Z which is held by "Temp writer"
> 2020-09-23T21:50:39.2448965Z
> 2020-09-23T21:50:39.2449384Z Java stack information for the threads listed
> above:
> 2020-09-23T21:50:39.2449900Z
> ===================================================
> 2020-09-23T21:50:39.2450325Z "Temp writer":
> 2020-09-23T21:50:39.2451050Z at
> org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel.checkAndWaitForSubpartitionView(LocalInputChannel.java:244)
> 2020-09-23T21:50:39.2452264Z - waiting to lock <0x0000000086501948> (a
> java.lang.Object)
> 2020-09-23T21:50:39.2453183Z at
> org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel.getNextBuffer(LocalInputChannel.java:205)
> 2020-09-23T21:50:39.2454173Z at
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.waitAndGetNextData(SingleInputGate.java:642)
> 2020-09-23T21:50:39.2455422Z - locked <0x0000000086501930> (a
> org.apache.flink.runtime.io.network.partition.PrioritizedDeque)
> 2020-09-23T21:50:39.2456310Z at
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:619)
> 2020-09-23T21:50:39.2457311Z at
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNext(SingleInputGate.java:602)
> 2020-09-23T21:50:39.2458205Z at
> org.apache.flink.runtime.taskmanager.InputGateWithMetrics.getNext(InputGateWithMetrics.java:105)
> 2020-09-23T21:50:39.2459258Z at
> org.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:100)
> 2020-09-23T21:50:39.2460465Z at
> org.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:47)
> 2020-09-23T21:50:39.2461344Z at
> org.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:59)
> 2020-09-23T21:50:39.2462164Z at
> org.apache.flink.runtime.operators.TempBarrier$TempWritingThread.run(TempBarrier.java:178)
> 2020-09-23T21:50:39.2463418Z "flink-akka.actor.default-dispatcher-2":
> 2020-09-23T21:50:39.2464109Z at
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.queueChannel(SingleInputGate.java:825)
> 2020-09-23T21:50:39.2465336Z - waiting to lock <0x0000000086501930> (a
> org.apache.flink.runtime.io.network.partition.PrioritizedDeque)
> 2020-09-23T21:50:39.2466228Z at
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.notifyChannelNonEmpty(SingleInputGate.java:791)
> 2020-09-23T21:50:39.2467222Z at
> org.apache.flink.runtime.io.network.partition.consumer.InputChannel.notifyChannelNonEmpty(InputChannel.java:154)
> 2020-09-23T21:50:39.2468212Z at
> org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel.notifyDataAvailable(LocalInputChannel.java:236)
> 2020-09-23T21:50:39.2469577Z at
> org.apache.flink.runtime.io.network.partition.ResultPartitionManager.createSubpartitionView(ResultPartitionManager.java:76)
> 2020-09-23T21:50:39.2470607Z at
> org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel.requestSubpartition(LocalInputChannel.java:133)
> 2020-09-23T21:50:39.2471765Z - locked <0x0000000086501948> (a
> java.lang.Object)
> 2020-09-23T21:50:39.2472685Z at
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.updateInputChannel(SingleInputGate.java:489)
> 2020-09-23T21:50:39.2473727Z - locked <0x0000000086532500> (a
> java.lang.Object)
> 2020-09-23T21:50:39.2474449Z at
> org.apache.flink.runtime.io.network.NettyShuffleEnvironment.updatePartitionInfo(NettyShuffleEnvironment.java:279)
> 2020-09-23T21:50:39.2475394Z at
> org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$updatePartitions$12(TaskExecutor.java:758)
> 2020-09-23T21:50:39.2476235Z at
> org.apache.flink.runtime.taskexecutor.TaskExecutor$$Lambda$406/1860601696.run(Unknown
> Source)
> 2020-09-23T21:50:39.2476973Z at
> java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
> 2020-09-23T21:50:39.2477714Z at
> akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
> 2020-09-23T21:50:39.2478698Z at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
> 2020-09-23T21:50:39.2479506Z at
> akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> 2020-09-23T21:50:39.2480263Z at
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> 2020-09-23T21:50:39.2481018Z at
> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> 2020-09-23T21:50:39.2481727Z at
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 2020-09-23T21:50:39.2482192Z
> {noformat}
>
> The deadlock was introduced by the alternative fix for FLINK-12510 in
> FLINK-19026 .
> The alternative fix avoided double lock acquisition in
> {{SingleInputGate#waitAndGetNextData}} by moving {{notifyDataAvailable}} from
> {{ResultSubpartition#createReadView}} to
> {{ResultPartitionManager#createSubpartitionView}}, which solved the circular
> deadlock of FLINK-12510 by not acquiring the {{buffers}} lock of
> {{PipelinedSubpartition}}.
> However, that fix didn't go far enough. For local channels, it is still
> possible to create a similar deadlock. While {{SingleInputGate}} reads from
> {{LocalInputChannel}}, it holds the lock {{inputChannelsWithData}}. The
> channel may start requesting the partition and tries to acquire
> {{requestLock}}.
> At the same time there is an update of partition info, which acquires the
> {{requestLock}} and notifies the {{SingleInputGate}}, which needs to lock on
> {{inputChannelsWithData}}.
> The solution is to first acquire the partition and release {{requestLock}}
> before notifying {{SingleInputGate}}.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)