[ 
https://issues.apache.org/jira/browse/FLINK-30908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17685040#comment-17685040
 ] 

Xintong Song commented on FLINK-30908:
--------------------------------------

After looking more into the logs and Hadoop codes, we believe FLINK-20988 is 
not the cause of this failure.

The test failure is caused by:
1. {{AMRMClientAsync}} sends an {{InterruptedIOException}} to the callback 
handler ({{YarnContainerEventHandler}}) after being stopped.
2. All errors sent to {{YarnContainerEventHandler}} are treated as fatal error 
in Flink.

This is not a newly introduced issue. 1) exists in Hadoop 2.9+ versions 
(https://issues.apache.org/jira/browse/YARN-5999), and 2) is the behavior since 
yarn deployment is supported. FLINK-20988 did introduce another chance for 
exceptions during shutdown to be handled as fatal error, but that is not the 
cause of this test failure. Given that this issue already exist in previous 
releases, I'm downgrading this ticket to Critical priority.

The proper fix might be to ignore the exceptions in 
{{YarnContainerEventHandler}} after being terminated. I'll update the PR and 
fix this.

> Fatal error in ResourceManager caused 
> YARNSessionFIFOSecuredITCase.testDetachedMode to fail
> -------------------------------------------------------------------------------------------
>
>                 Key: FLINK-30908
>                 URL: https://issues.apache.org/jira/browse/FLINK-30908
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / YARN, Runtime / Coordination
>    Affects Versions: 1.17.0
>            Reporter: Matthias Pohl
>            Assignee: Xintong Song
>            Priority: Blocker
>              Labels: pull-request-available, test-stability
>         Attachments: mvn-1.FLINK-30908.log
>
>
> There's a build failure in {{YARNSessionFIFOSecuredITCase.testDetachedMode}} 
> which is caused by a fatal error in the ResourceManager:
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=45720&view=logs&j=245e1f2e-ba5b-5570-d689-25ae21e5302f&t=d04c9862-880c-52f5-574b-a7a79fef8e0f&l=29869
> {code}
> Feb 05 02:41:58 java.io.InterruptedIOException: Interrupted waiting to send 
> RPC request to server
> Feb 05 02:41:58 java.io.InterruptedIOException: Interrupted waiting to send 
> RPC request to server
> Feb 05 02:41:58       at org.apache.hadoop.ipc.Client.call(Client.java:1480) 
> ~[hadoop-common-3.2.3.jar:?]
> Feb 05 02:41:58       at org.apache.hadoop.ipc.Client.call(Client.java:1422) 
> ~[hadoop-common-3.2.3.jar:?]
> Feb 05 02:41:58       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>  ~[hadoop-common-3.2.3.jar:?]
> Feb 05 02:41:58       at com.sun.proxy.$Proxy31.allocate(Unknown Source) 
> ~[?:?]
> Feb 05 02:41:58       at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>  ~[hadoop-yarn-common-3.2.3.jar:?]
> Feb 05 02:41:58       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method) ~[?:1.8.0_292]
> Feb 05 02:41:58       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> ~[?:1.8.0_292]
> Feb 05 02:41:58       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ~[?:1.8.0_292]
> Feb 05 02:41:58       at java.lang.reflect.Method.invoke(Method.java:498) 
> ~[?:1.8.0_292]
> Feb 05 02:41:58       at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>  ~[hadoop-common-3.2.3.jar:?]
> Feb 05 02:41:58       at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>  ~[hadoop-common-3.2.3.jar:?]
> Feb 05 02:41:58       at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>  ~[hadoop-common-3.2.3.jar:?]
> Feb 05 02:41:58       at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>  ~[hadoop-common-3.2.3.jar:?]
> Feb 05 02:41:58       at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
>  ~[hadoop-common-3.2.3.jar:?]
> Feb 05 02:41:58       at com.sun.proxy.$Proxy32.allocate(Unknown Source) 
> ~[?:?]
> Feb 05 02:41:58       at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:325)
>  ~[hadoop-yarn-client-3.2.3.jar:?]
> Feb 05 02:41:58       at 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:311)
>  [hadoop-yarn-client-3.2.3.jar:?]
> Feb 05 02:41:58 Caused by: java.lang.InterruptedException
> Feb 05 02:41:58       at 
> java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404) ~[?:1.8.0_292]
> Feb 05 02:41:58       at 
> java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[?:1.8.0_292]
> Feb 05 02:41:58       at 
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1180) 
> ~[hadoop-common-3.2.3.jar:?]
> Feb 05 02:41:58       at org.apache.hadoop.ipc.Client.call(Client.java:1475) 
> ~[hadoop-common-3.2.3.jar:?]
> Feb 05 02:41:58       ... 17 more
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to