[jira] [Commented] (FLINK-2134) Deadlock in SuccessAfterNetworkBuffersFailureITCase

ASF GitHub Bot (JIRA) Wed, 03 Jun 2015 09:54:22 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-2134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14571311#comment-14571311
 ]


ASF GitHub Bot commented on FLINK-2134:
---------------------------------------

GitHub user uce opened a pull request:

    https://github.com/apache/flink/pull/773

    [FLINK-2134] Close Netty channel via CloseRequest msg

    The failing `SuccessAfterNetworkBuffersFailureITCase` discovered a race 
between sending backwards events (e.g. from sync task to iteration head task) 
and closing the TCP channel. The close overtook outstanding backwards task 
events. This change guarnatees that it is in order by sending a explicit close 
msg.
    
    (I've also tried other approaches, but this struck me as the simplest 
solution.)

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/uce/incubator-flink event-deadlock-2134

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/773.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #773
    
----
commit 6a4dcd0d4866ba16c432d15de25df7c161c894b7
Author: Ufuk Celebi <[email protected]>
Date:   2015-06-03T16:41:40Z

    [FLINK-2134] Close Netty channel via CloseRequest msg

----


> Deadlock in SuccessAfterNetworkBuffersFailureITCase
> ---------------------------------------------------
>
>                 Key: FLINK-2134
>                 URL: https://issues.apache.org/jira/browse/FLINK-2134
>             Project: Flink
>          Issue Type: Bug
>    Affects Versions: master
>            Reporter: Ufuk Celebi
>
> I ran into the issue in a Travis run for a PR: 
> https://s3.amazonaws.com/archive.travis-ci.org/jobs/64994288/log.txt
> I can reproduce this locally by running 
> SuccessAfterNetworkBuffersFailureITCase multiple times:
> {code}
> cluster = new ForkableFlinkMiniCluster(config, false);
> for (int i = 0; i < 100; i++) {
>    // run test programs CC, KMeans, CC
> }
> {code}
> The iteration tasks wait for superstep notifications like this:
> {code}
> "Join (Join at 
> runConnectedComponents(SuccessAfterNetworkBuffersFailureITCase.java:128)) 
> (8/6)" daemon prio=5 tid=0x00007f95f374f800 nid=0x138a7 in Object.wait() 
> [0x0000000123f2a000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>       at java.lang.Object.wait(Native Method)
>       - waiting on <0x00000007f89e3440> (a java.lang.Object)
>       at 
> org.apache.flink.runtime.iterative.concurrent.SuperstepKickoffLatch.awaitStartOfSuperstepOrTermination(SuperstepKickoffLatch.java:57)
>       - locked <0x00000007f89e3440> (a java.lang.Object)
>       at 
> org.apache.flink.runtime.iterative.task.IterationTailPactTask.run(IterationTailPactTask.java:131)
>       at 
> org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
>       at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
>       at java.lang.Thread.run(Thread.java:745)
> {code}
> I've asked [~rmetzger] to reproduce this and it deadlocks for him as well. 
> The system needs to be under some load for this to occur after multiple runs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2134) Deadlock in SuccessAfterNetworkBuffersFailureITCase

Reply via email to