[
https://issues.apache.org/jira/browse/FLINK-2134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14571311#comment-14571311
]
ASF GitHub Bot commented on FLINK-2134:
---------------------------------------
GitHub user uce opened a pull request:
https://github.com/apache/flink/pull/773
[FLINK-2134] Close Netty channel via CloseRequest msg
The failing `SuccessAfterNetworkBuffersFailureITCase` discovered a race
between sending backwards events (e.g. from sync task to iteration head task)
and closing the TCP channel. The close overtook outstanding backwards task
events. This change guarnatees that it is in order by sending a explicit close
msg.
(I've also tried other approaches, but this struck me as the simplest
solution.)
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/uce/incubator-flink event-deadlock-2134
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/773.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #773
----
commit 6a4dcd0d4866ba16c432d15de25df7c161c894b7
Author: Ufuk Celebi <[email protected]>
Date: 2015-06-03T16:41:40Z
[FLINK-2134] Close Netty channel via CloseRequest msg
----
> Deadlock in SuccessAfterNetworkBuffersFailureITCase
> ---------------------------------------------------
>
> Key: FLINK-2134
> URL: https://issues.apache.org/jira/browse/FLINK-2134
> Project: Flink
> Issue Type: Bug
> Affects Versions: master
> Reporter: Ufuk Celebi
>
> I ran into the issue in a Travis run for a PR:
> https://s3.amazonaws.com/archive.travis-ci.org/jobs/64994288/log.txt
> I can reproduce this locally by running
> SuccessAfterNetworkBuffersFailureITCase multiple times:
> {code}
> cluster = new ForkableFlinkMiniCluster(config, false);
> for (int i = 0; i < 100; i++) {
> // run test programs CC, KMeans, CC
> }
> {code}
> The iteration tasks wait for superstep notifications like this:
> {code}
> "Join (Join at
> runConnectedComponents(SuccessAfterNetworkBuffersFailureITCase.java:128))
> (8/6)" daemon prio=5 tid=0x00007f95f374f800 nid=0x138a7 in Object.wait()
> [0x0000000123f2a000]
> java.lang.Thread.State: TIMED_WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> - waiting on <0x00000007f89e3440> (a java.lang.Object)
> at
> org.apache.flink.runtime.iterative.concurrent.SuperstepKickoffLatch.awaitStartOfSuperstepOrTermination(SuperstepKickoffLatch.java:57)
> - locked <0x00000007f89e3440> (a java.lang.Object)
> at
> org.apache.flink.runtime.iterative.task.IterationTailPactTask.run(IterationTailPactTask.java:131)
> at
> org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> I've asked [~rmetzger] to reproduce this and it deadlocks for him as well.
> The system needs to be under some load for this to occur after multiple runs.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)