Robert Metzger created FLINK-5553:
-------------------------------------
Summary: Job fails during deployment with IllegalStateException
from subpartition request
Key: FLINK-5553
URL: https://issues.apache.org/jira/browse/FLINK-5553
Project: Flink
Issue Type: Bug
Components: Network
Affects Versions: 1.3.0
Reporter: Robert Metzger
While running a test job with Flink 1.3-SNAPSHOT
(6fb6967b9f9a31f034bd09fcf76aaf147bc8e9a0) the job failed with this exception:
{code}
2017-01-18 14:56:27,043 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph - Sink: Unnamed
(9/10) (befc06d0e792c2ce39dde74b365dd3cf) switched from DEPLOYING to RUNNING.
2017-01-18 14:56:27,059 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph - Flat Map (9/10)
(e94a01ec283e5dce7f79b02cf51654c4) switched from DEPLOYING to RUNNING.
2017-01-18 14:56:27,817 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph - Flat Map
(10/10) (cbb61c9a2f72c282877eb383e111f7cd) switched from RUNNING to FAILED.
java.lang.IllegalStateException: There has been an error in the channel.
at
org.apache.flink.util.Preconditions.checkState(Preconditions.java:195)
at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.addInputChannel(PartitionRequestClientHandler.java:77)
at
org.apache.flink.runtime.io.network.netty.PartitionRequestClient.requestSubpartition(PartitionRequestClient.java:104)
at
org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:115)
at
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:419)
at
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:441)
at
org.apache.flink.streaming.runtime.io.BarrierBuffer.getNextNonBlocked(BarrierBuffer.java:153)
at
org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:192)
at
org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:63)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:270)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:666)
at java.lang.Thread.run(Thread.java:745)
2017-01-18 14:56:27,819 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph - Job Misbehaved
Job (b1d985d11984df57400fdff2bb656c59) switched from state RUNNING to FAILING.
java.lang.IllegalStateException: There has been an error in the channel.
at
org.apache.flink.util.Preconditions.checkState(Preconditions.java:195)
at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.addInputChannel(PartitionRequestClientHandler.java:77)
at
org.apache.flink.runtime.io.network.netty.PartitionRequestClient.requestSubpartition(PartitionRequestClient.java:104)
at
org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:115)
at
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:419)
at
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:441)
at
org.apache.flink.streaming.runtime.io.BarrierBuffer.getNextNonBlocked(BarrierBuffer.java:153)
at
org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:192)
at
org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:63)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:270)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:666)
at java.lang.Thread.run(Thread.java:745)
{code}
This is the first exception that is reported to the jobmanager.
I think this is related to missing network buffers. You see that from the next
deployment after the restart, where the deployment fails with the insufficient
number of buffers exception.
I'll add logs to the JIRA.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)