[ 
https://issues.apache.org/jira/browse/FLINK-2134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14571716#comment-14571716
 ] 

ASF GitHub Bot commented on FLINK-2134:
---------------------------------------

Github user StephanEwen commented on the pull request:

    https://github.com/apache/flink/pull/773#issuecomment-108627084
  
    Can you elaborate? Why are there backwards events after the connection is 
closed? The iteration head should not close until the iteration terminates, in 
which case there should be no back events any more.


> Deadlock in SuccessAfterNetworkBuffersFailureITCase
> ---------------------------------------------------
>
>                 Key: FLINK-2134
>                 URL: https://issues.apache.org/jira/browse/FLINK-2134
>             Project: Flink
>          Issue Type: Bug
>    Affects Versions: master
>            Reporter: Ufuk Celebi
>
> I ran into the issue in a Travis run for a PR: 
> https://s3.amazonaws.com/archive.travis-ci.org/jobs/64994288/log.txt
> I can reproduce this locally by running 
> SuccessAfterNetworkBuffersFailureITCase multiple times:
> {code}
> cluster = new ForkableFlinkMiniCluster(config, false);
> for (int i = 0; i < 100; i++) {
>    // run test programs CC, KMeans, CC
> }
> {code}
> The iteration tasks wait for superstep notifications like this:
> {code}
> "Join (Join at 
> runConnectedComponents(SuccessAfterNetworkBuffersFailureITCase.java:128)) 
> (8/6)" daemon prio=5 tid=0x00007f95f374f800 nid=0x138a7 in Object.wait() 
> [0x0000000123f2a000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>       at java.lang.Object.wait(Native Method)
>       - waiting on <0x00000007f89e3440> (a java.lang.Object)
>       at 
> org.apache.flink.runtime.iterative.concurrent.SuperstepKickoffLatch.awaitStartOfSuperstepOrTermination(SuperstepKickoffLatch.java:57)
>       - locked <0x00000007f89e3440> (a java.lang.Object)
>       at 
> org.apache.flink.runtime.iterative.task.IterationTailPactTask.run(IterationTailPactTask.java:131)
>       at 
> org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
>       at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
>       at java.lang.Thread.run(Thread.java:745)
> {code}
> I've asked [~rmetzger] to reproduce this and it deadlocks for him as well. 
> The system needs to be under some load for this to occur after multiple runs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to