[ 
https://issues.apache.org/jira/browse/FLINK-1627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi resolved FLINK-1627.
--------------------------------
    Resolution: Fixed

Fixed in 0333109.

It was not a connect deadlock, but the network I/O thread was blocked in an 
infinite loop. That's why the connect calls never returned.

> Netty channel connect deadlock 
> -------------------------------
>
>                 Key: FLINK-1627
>                 URL: https://issues.apache.org/jira/browse/FLINK-1627
>             Project: Flink
>          Issue Type: Bug
>            Reporter: Ufuk Celebi
>
> [~StephanEwen] reports the following deadlock 
> (https://travis-ci.org/StephanEwen/incubator-flink/jobs/52755844, logs: 
> https://s3.amazonaws.com/flink.a.o.uce.east/travis-artifacts/StephanEwen/incubator-flink/477/477.2.tar.gz).
> {code}
> "CHAIN Partition -> Map (Map at 
> testRestartMultipleTimes(SimpleRecoveryITCase.java:200)) (2/4)" daemon 
> prio=10 tid=0x00007f5fdc008800 nid=0xe230 in Object.wait() 
> [0x00007f5fca8f2000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>       at java.lang.Object.wait(Native Method)
>       - waiting on <0x00000000f2a13530> (a java.lang.Object)
>       at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:179)
>       - locked <0x00000000f2a13530> (a java.lang.Object)
>       at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:125)
>       at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:64)
>       at 
> org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:53)
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestIntermediateResultPartition(RemoteInputChannel.java:92)
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:287)
>       - locked <0x00000000f29dbcd8> (a java.lang.Object)
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:306)
>       at 
> org.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:75)
>       at 
> org.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:34)
>       at 
> org.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:59)
>       at org.apache.flink.runtime.operators.NoOpDriver.run(NoOpDriver.java:91)
>       at 
> org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496)
>       at 
> org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
>       at 
> org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:205)
>       at java.lang.Thread.run(Thread.java:745)
> {code}
> {code}
> "CHAIN Partition -> Map (Map at 
> testRestartMultipleTimes(SimpleRecoveryITCase.java:200)) (3/4)" daemon 
> prio=10 tid=0x00007f5fdc005000 nid=0xe22f in Object.wait() 
> [0x00007f5fca9f3000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>       at java.lang.Object.wait(Native Method)
>       - waiting on <0x00000000f2a13530> (a java.lang.Object)
>       at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:179)
>       - locked <0x00000000f2a13530> (a java.lang.Object)
>       at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:125)
>       at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:79)
>       at 
> org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:53)
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestIntermediateResultPartition(RemoteInputChannel.java:92)
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:287)
>       - locked <0x00000000f2896f88> (a java.lang.Object)
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:306)
>       at 
> org.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:75)
>       at 
> org.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:34)
>       at 
> org.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:59)
>       at org.apache.flink.runtime.operators.NoOpDriver.run(NoOpDriver.java:91)
>       at 
> org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496)
>       at 
> org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
>       at 
> org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:205)
>       at java.lang.Thread.run(Thread.java:745)
> {code}
> Two tasks try to connect to a task manager during a data shuffle. One of the 
> two tries to establish the connection and then both wait for the connect to 
> return (waitForChannel).
> The problem seems to be related to the channel listener never handing in the 
> channel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to