Ufuk Celebi created FLINK-1627:
----------------------------------
Summary: Netty channel connect deadlock
Key: FLINK-1627
URL: https://issues.apache.org/jira/browse/FLINK-1627
Project: Flink
Issue Type: Bug
Reporter: Ufuk Celebi
[~StephanEwen] reports the following deadlock
(https://travis-ci.org/StephanEwen/incubator-flink/jobs/52755844, logs:
https://s3.amazonaws.com/flink.a.o.uce.east/travis-artifacts/StephanEwen/incubator-flink/477/477.2.tar.gz).
{code}
"CHAIN Partition -> Map (Map at
testRestartMultipleTimes(SimpleRecoveryITCase.java:200)) (2/4)" daemon prio=10
tid=0x00007f5fdc008800 nid=0xe230 in Object.wait() [0x00007f5fca8f2000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000000f2a13530> (a java.lang.Object)
at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:179)
- locked <0x00000000f2a13530> (a java.lang.Object)
at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:125)
at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:64)
at
org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:53)
at
org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestIntermediateResultPartition(RemoteInputChannel.java:92)
at
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:287)
- locked <0x00000000f29dbcd8> (a java.lang.Object)
at
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:306)
at
org.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:75)
at
org.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:34)
at
org.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:59)
at org.apache.flink.runtime.operators.NoOpDriver.run(NoOpDriver.java:91)
at
org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496)
at
org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
at
org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:205)
at java.lang.Thread.run(Thread.java:745)
{code}
{code}
"CHAIN Partition -> Map (Map at
testRestartMultipleTimes(SimpleRecoveryITCase.java:200)) (3/4)" daemon prio=10
tid=0x00007f5fdc005000 nid=0xe22f in Object.wait() [0x00007f5fca9f3000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000000f2a13530> (a java.lang.Object)
at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:179)
- locked <0x00000000f2a13530> (a java.lang.Object)
at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:125)
at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:79)
at
org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:53)
at
org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestIntermediateResultPartition(RemoteInputChannel.java:92)
at
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:287)
- locked <0x00000000f2896f88> (a java.lang.Object)
at
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:306)
at
org.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:75)
at
org.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:34)
at
org.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:59)
at org.apache.flink.runtime.operators.NoOpDriver.run(NoOpDriver.java:91)
at
org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496)
at
org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
at
org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:205)
at java.lang.Thread.run(Thread.java:745)
{code}
Two tasks try to connect to a task manager during a data shuffle. One of the
two tries to establish the connection and then both wait for the connect to
return (waitForChannel).
The problem seems to be related to the channel listener never handing in the
channel.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)