Ufuk Celebi created FLINK-1627:
----------------------------------

             Summary: Netty channel connect deadlock 
                 Key: FLINK-1627
                 URL: https://issues.apache.org/jira/browse/FLINK-1627
             Project: Flink
          Issue Type: Bug
            Reporter: Ufuk Celebi


[~StephanEwen] reports the following deadlock 
(https://travis-ci.org/StephanEwen/incubator-flink/jobs/52755844, logs: 
https://s3.amazonaws.com/flink.a.o.uce.east/travis-artifacts/StephanEwen/incubator-flink/477/477.2.tar.gz).

{code}
"CHAIN Partition -> Map (Map at 
testRestartMultipleTimes(SimpleRecoveryITCase.java:200)) (2/4)" daemon prio=10 
tid=0x00007f5fdc008800 nid=0xe230 in Object.wait() [0x00007f5fca8f2000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00000000f2a13530> (a java.lang.Object)
        at 
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:179)
        - locked <0x00000000f2a13530> (a java.lang.Object)
        at 
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:125)
        at 
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:64)
        at 
org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:53)
        at 
org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestIntermediateResultPartition(RemoteInputChannel.java:92)
        at 
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:287)
        - locked <0x00000000f29dbcd8> (a java.lang.Object)
        at 
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:306)
        at 
org.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:75)
        at 
org.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:34)
        at 
org.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:59)
        at org.apache.flink.runtime.operators.NoOpDriver.run(NoOpDriver.java:91)
        at 
org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496)
        at 
org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
        at 
org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:205)
        at java.lang.Thread.run(Thread.java:745)
{code}

{code}
"CHAIN Partition -> Map (Map at 
testRestartMultipleTimes(SimpleRecoveryITCase.java:200)) (3/4)" daemon prio=10 
tid=0x00007f5fdc005000 nid=0xe22f in Object.wait() [0x00007f5fca9f3000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00000000f2a13530> (a java.lang.Object)
        at 
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:179)
        - locked <0x00000000f2a13530> (a java.lang.Object)
        at 
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:125)
        at 
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:79)
        at 
org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:53)
        at 
org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestIntermediateResultPartition(RemoteInputChannel.java:92)
        at 
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:287)
        - locked <0x00000000f2896f88> (a java.lang.Object)
        at 
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:306)
        at 
org.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:75)
        at 
org.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:34)
        at 
org.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:59)
        at org.apache.flink.runtime.operators.NoOpDriver.run(NoOpDriver.java:91)
        at 
org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496)
        at 
org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
        at 
org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:205)
        at java.lang.Thread.run(Thread.java:745)
{code}

Two tasks try to connect to a task manager during a data shuffle. One of the 
two tries to establish the connection and then both wait for the connect to 
return (waitForChannel).

The problem seems to be related to the channel listener never handing in the 
channel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to