[jira] [Commented] (FLINK-14429) Wrong app final status when running batch job on yarn with non-detached mode

Aljoscha Krettek (Jira) Tue, 22 Oct 2019 01:46:34 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956812#comment-16956812
 ]


Aljoscha Krettek commented on FLINK-14429:
------------------------------------------

The proper solution will involve reworking how job-submission works for 
"attached" (sometimes called NORMAL or simply "not attached) execution in the 
{{CliFrontend}}. This touches other parts, like the {{ContextEnvironment}} 
machinery and how it invokes the client (or doesn't). Currently [~tison], 
[~kkloudas], and I are working on these parts and we will also change job 
submission of attached jobs as a last step.

It turns out there is no easy fix for this issue, that's why I was saying that 
it needs some investigation first. I'm sorry that you now made the effort to 
implement something, but I agree with Till that this will not be the correct 
solution in the end.

> Wrong app final status when running batch job on yarn with non-detached mode
> ----------------------------------------------------------------------------
>
>                 Key: FLINK-14429
>                 URL: https://issues.apache.org/jira/browse/FLINK-14429
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / YARN
>    Affects Versions: 1.9.0
>            Reporter: liupengcheng
>            Priority: Minor
>              Labels: pull-request-available
>         Attachments: image-2019-10-17-16-47-47-038.png
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Recently, we found that the app final status is not correct when an 
> application failed when running batch job on yarn with non-detached mode,  It 
> reported SUCCEEDED but FAILED is what we expected.
> !image-2019-10-17-16-47-47-038.png!
>  
> But the logs and client reported error and job failed(It's caused by OOM):
> {code:java}
> 2019-10-10 14:36:21,797 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job TeraSort 
> (d82cbfaae905c695597083b1476e51b8) switched from state FAILING to FAILED.
> org.apache.flink.runtime.io.network.partition.consumer.PartitionConnectionException:
>  Connection for partition 
> a254412fc7464cd4e0fe04ab9e3a6309@8d5afff58c86dd7f5bc78946f0101699 not 
> reachable.
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:168)
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:237)
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.setup(SingleInputGate.java:215)
>       at 
> org.apache.flink.runtime.taskmanager.InputGateWithMetrics.setup(InputGateWithMetrics.java:65)
>       at 
> org.apache.flink.runtime.taskmanager.Task.setupPartitionsAndGates(Task.java:866)
>       at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:621)
>       at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
>       at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Connecting the channel failed: Connecting to 
> remote task manager + 'zjy-hadoop-prc-st164.bj/10.152.47.8:45704' has failed. 
> This might indicate that the remote task manager has been lost.
>       at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:197)
>       at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:134)
>       at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:86)
>       at 
> org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:68)
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:165)
>       ... 7 more
> Caused by: 
> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: 
> Connecting to remote task manager + 
> 'zjy-hadoop-prc-st164.bj/10.152.47.8:45704' has failed. This might indicate 
> that the remote task manager has been lost.
>       at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:220)
>       at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:134)
>       at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511)
>       at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:504)
>       at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:483)
>       at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424)
>       at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:121)
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:327)
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:343)
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:591)
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:508)
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470)
>       at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
>       ... 1 more
> Caused by: 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  Connection refused: zjy-hadoop-prc-st164.bj/10.152.47.8:45704
>       at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>       at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327)
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
>       ... 6 more
> Caused by: java.net.ConnectException: Connection refused
>       ... 10 more
> {code}
>  
> I looked into the code, and find that it's because currently we didn't send 
> status and diagnostics message to dispatcher when shutting down cluster. So I 
> propose to add these informations to make the status correct.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-14429) Wrong app final status when running batch job on yarn with non-detached mode

Reply via email to