[
https://issues.apache.org/jira/browse/FLINK-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953618#comment-16953618
]
liupengcheng commented on FLINK-14429:
--------------------------------------
[~trohrmann] Even for multi part jobs, we should fail the whole application if
one job fails in non-detached mode, This is what spark currently do, I think
flink should also do it. Or users will get confused, why client report failure,
but the app is still running? For batch jobs this is a common case, user will
not expect the app still running in the background.
I understand that the current implementation is mainly for interactive jobs
like Flink-shell , so we deployed a session cluster to avoid repeated
deployment. But for batch multi-jobs application, one job fails mostly means
the whole application failed, it's meaninless to keep the session cluster alive.
Another thing I have to point out is that the shutdown message is sent by
client, so here we just need to bring the right status to remote, and tell the
dispatcher to unregister with right status.
For flink-shell, the client only send the shutdown message when client exits,
so it's ok to live with this.
> Wrong app final status when running batch job on yarn with non-detached mode
> ----------------------------------------------------------------------------
>
> Key: FLINK-14429
> URL: https://issues.apache.org/jira/browse/FLINK-14429
> Project: Flink
> Issue Type: Bug
> Components: Deployment / YARN
> Affects Versions: 1.9.0
> Reporter: liupengcheng
> Priority: Major
> Labels: pull-request-available
> Attachments: image-2019-10-17-16-47-47-038.png
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Recently, we found that the app final status is not correct when an
> application failed when running batch job on yarn with non-detached mode, It
> reported SUCCEEDED but FAILED is what we expected.
> !image-2019-10-17-16-47-47-038.png!
>
> But the logs and client reported error and job failed(It's caused by OOM):
> {code:java}
> 2019-10-10 14:36:21,797 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph - Job TeraSort
> (d82cbfaae905c695597083b1476e51b8) switched from state FAILING to FAILED.
> org.apache.flink.runtime.io.network.partition.consumer.PartitionConnectionException:
> Connection for partition
> a254412fc7464cd4e0fe04ab9e3a6309@8d5afff58c86dd7f5bc78946f0101699 not
> reachable.
> at
> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:168)
> at
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:237)
> at
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.setup(SingleInputGate.java:215)
> at
> org.apache.flink.runtime.taskmanager.InputGateWithMetrics.setup(InputGateWithMetrics.java:65)
> at
> org.apache.flink.runtime.taskmanager.Task.setupPartitionsAndGates(Task.java:866)
> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:621)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Connecting the channel failed: Connecting to
> remote task manager + 'zjy-hadoop-prc-st164.bj/10.152.47.8:45704' has failed.
> This might indicate that the remote task manager has been lost.
> at
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:197)
> at
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:134)
> at
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:86)
> at
> org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:68)
> at
> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:165)
> ... 7 more
> Caused by:
> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
> Connecting to remote task manager +
> 'zjy-hadoop-prc-st164.bj/10.152.47.8:45704' has failed. This might indicate
> that the remote task manager has been lost.
> at
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:220)
> at
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:134)
> at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511)
> at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:504)
> at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:483)
> at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424)
> at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:121)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:327)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:343)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:591)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:508)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470)
> at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
> ... 1 more
> Caused by:
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException:
> Connection refused: zjy-hadoop-prc-st164.bj/10.152.47.8:45704
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
> ... 6 more
> Caused by: java.net.ConnectException: Connection refused
> ... 10 more
> {code}
>
> I looked into the code, and find that it's because currently we didn't send
> status and diagnostics message to dispatcher when shutting down cluster. So I
> propose to add these informations to make the status correct.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)