[jira] [Commented] (FLINK-14429) Wrong app final status when running batch job on yarn with non-detached mode

liupengcheng (Jira) Tue, 22 Oct 2019 04:31:06 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956946#comment-16956946
 ]


liupengcheng commented on FLINK-14429:
--------------------------------------

Thanks [~trohrmann] [~aljoscha] , I got it!

>Conceptually, I think the client should not tell the cluster with which return 
>code to terminate. This seems like a wrong separation of concerns. The cluster 
>>should make this decision instead.

I agree with it, the cluster should make this decision itself. BTW, this make 
it more similar with spark exit behavior. This is good for design, but 
sometimes our users always complains that the app final status is not 
consistent with the actual SUCCESS of their job in yarn-client/non-detached 
mode.

> Wrong app final status when running batch job on yarn with non-detached mode
> ----------------------------------------------------------------------------
>
>                 Key: FLINK-14429
>                 URL: https://issues.apache.org/jira/browse/FLINK-14429
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / YARN
>    Affects Versions: 1.9.0
>            Reporter: liupengcheng
>            Priority: Minor
>              Labels: pull-request-available
>         Attachments: image-2019-10-17-16-47-47-038.png
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Recently, we found that the app final status is not correct when an 
> application failed when running batch job on yarn with non-detached mode,  It 
> reported SUCCEEDED but FAILED is what we expected.
> !image-2019-10-17-16-47-47-038.png!
>  
> But the logs and client reported error and job failed(It's caused by OOM):
> {code:java}
> 2019-10-10 14:36:21,797 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job TeraSort 
> (d82cbfaae905c695597083b1476e51b8) switched from state FAILING to FAILED.
> org.apache.flink.runtime.io.network.partition.consumer.PartitionConnectionException:
>  Connection for partition 
> a254412fc7464cd4e0fe04ab9e3a6309@8d5afff58c86dd7f5bc78946f0101699 not 
> reachable.
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:168)
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:237)
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.setup(SingleInputGate.java:215)
>       at 
> org.apache.flink.runtime.taskmanager.InputGateWithMetrics.setup(InputGateWithMetrics.java:65)
>       at 
> org.apache.flink.runtime.taskmanager.Task.setupPartitionsAndGates(Task.java:866)
>       at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:621)
>       at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
>       at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Connecting the channel failed: Connecting to 
> remote task manager + 'zjy-hadoop-prc-st164.bj/10.152.47.8:45704' has failed. 
> This might indicate that the remote task manager has been lost.
>       at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:197)
>       at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:134)
>       at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:86)
>       at 
> org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:68)
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:165)
>       ... 7 more
> Caused by: 
> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: 
> Connecting to remote task manager + 
> 'zjy-hadoop-prc-st164.bj/10.152.47.8:45704' has failed. This might indicate 
> that the remote task manager has been lost.
>       at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:220)
>       at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:134)
>       at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511)
>       at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:504)
>       at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:483)
>       at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424)
>       at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:121)
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:327)
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:343)
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:591)
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:508)
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470)
>       at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
>       ... 1 more
> Caused by: 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  Connection refused: zjy-hadoop-prc-st164.bj/10.152.47.8:45704
>       at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>       at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327)
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
>       ... 6 more
> Caused by: java.net.ConnectException: Connection refused
>       ... 10 more
> {code}
>  
> I looked into the code, and find that it's because currently we didn't send 
> status and diagnostics message to dispatcher when shutting down cluster. So I 
> propose to add these informations to make the status correct.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-14429) Wrong app final status when running batch job on yarn with non-detached mode

Reply via email to