[jira] [Commented] (FLINK-14429) Wrong app final status when running batch job on yarn with non-detached mode

Till Rohrmann (Jira) Mon, 21 Oct 2019 05:50:43 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956025#comment-16956025
 ]


Till Rohrmann commented on FLINK-14429:
---------------------------------------

Hmm, the solution looks a bit like a hack to me, tbh. I'm not sure whether the 
actual benefit justifies the additional maintenance overhead we are adding with 
this solution. I'd rather wait for the refactoring of the cli which will 
hopefully solve this problem properly.

> Wrong app final status when running batch job on yarn with non-detached mode
> ----------------------------------------------------------------------------
>
>                 Key: FLINK-14429
>                 URL: https://issues.apache.org/jira/browse/FLINK-14429
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / YARN
>    Affects Versions: 1.9.0
>            Reporter: liupengcheng
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: image-2019-10-17-16-47-47-038.png
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Recently, we found that the app final status is not correct when an 
> application failed when running batch job on yarn with non-detached mode,  It 
> reported SUCCEEDED but FAILED is what we expected.
> !image-2019-10-17-16-47-47-038.png!
>  
> But the logs and client reported error and job failed(It's caused by OOM):
> {code:java}
> 2019-10-10 14:36:21,797 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job TeraSort 
> (d82cbfaae905c695597083b1476e51b8) switched from state FAILING to FAILED.
> org.apache.flink.runtime.io.network.partition.consumer.PartitionConnectionException:
>  Connection for partition 
> a254412fc7464cd4e0fe04ab9e3a6309@8d5afff58c86dd7f5bc78946f0101699 not 
> reachable.
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:168)
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:237)
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.setup(SingleInputGate.java:215)
>       at 
> org.apache.flink.runtime.taskmanager.InputGateWithMetrics.setup(InputGateWithMetrics.java:65)
>       at 
> org.apache.flink.runtime.taskmanager.Task.setupPartitionsAndGates(Task.java:866)
>       at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:621)
>       at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
>       at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Connecting the channel failed: Connecting to 
> remote task manager + 'zjy-hadoop-prc-st164.bj/10.152.47.8:45704' has failed. 
> This might indicate that the remote task manager has been lost.
>       at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:197)
>       at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:134)
>       at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:86)
>       at 
> org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:68)
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:165)
>       ... 7 more
> Caused by: 
> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: 
> Connecting to remote task manager + 
> 'zjy-hadoop-prc-st164.bj/10.152.47.8:45704' has failed. This might indicate 
> that the remote task manager has been lost.
>       at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:220)
>       at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:134)
>       at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511)
>       at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:504)
>       at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:483)
>       at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424)
>       at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:121)
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:327)
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:343)
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:591)
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:508)
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470)
>       at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
>       ... 1 more
> Caused by: 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  Connection refused: zjy-hadoop-prc-st164.bj/10.152.47.8:45704
>       at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>       at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327)
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
>       ... 6 more
> Caused by: java.net.ConnectException: Connection refused
>       ... 10 more
> {code}
>  
> I looked into the code, and find that it's because currently we didn't send 
> status and diagnostics message to dispatcher when shutting down cluster. So I 
> propose to add these informations to make the status correct.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-14429) Wrong app final status when running batch job on yarn with non-detached mode

Reply via email to