[
https://issues.apache.org/jira/browse/FLINK-30908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17685040#comment-17685040
]
Xintong Song commented on FLINK-30908:
--------------------------------------
After looking more into the logs and Hadoop codes, we believe FLINK-20988 is
not the cause of this failure.
The test failure is caused by:
1. {{AMRMClientAsync}} sends an {{InterruptedIOException}} to the callback
handler ({{YarnContainerEventHandler}}) after being stopped.
2. All errors sent to {{YarnContainerEventHandler}} are treated as fatal error
in Flink.
This is not a newly introduced issue. 1) exists in Hadoop 2.9+ versions
(https://issues.apache.org/jira/browse/YARN-5999), and 2) is the behavior since
yarn deployment is supported. FLINK-20988 did introduce another chance for
exceptions during shutdown to be handled as fatal error, but that is not the
cause of this test failure. Given that this issue already exist in previous
releases, I'm downgrading this ticket to Critical priority.
The proper fix might be to ignore the exceptions in
{{YarnContainerEventHandler}} after being terminated. I'll update the PR and
fix this.
> Fatal error in ResourceManager caused
> YARNSessionFIFOSecuredITCase.testDetachedMode to fail
> -------------------------------------------------------------------------------------------
>
> Key: FLINK-30908
> URL: https://issues.apache.org/jira/browse/FLINK-30908
> Project: Flink
> Issue Type: Bug
> Components: Deployment / YARN, Runtime / Coordination
> Affects Versions: 1.17.0
> Reporter: Matthias Pohl
> Assignee: Xintong Song
> Priority: Blocker
> Labels: pull-request-available, test-stability
> Attachments: mvn-1.FLINK-30908.log
>
>
> There's a build failure in {{YARNSessionFIFOSecuredITCase.testDetachedMode}}
> which is caused by a fatal error in the ResourceManager:
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=45720&view=logs&j=245e1f2e-ba5b-5570-d689-25ae21e5302f&t=d04c9862-880c-52f5-574b-a7a79fef8e0f&l=29869
> {code}
> Feb 05 02:41:58 java.io.InterruptedIOException: Interrupted waiting to send
> RPC request to server
> Feb 05 02:41:58 java.io.InterruptedIOException: Interrupted waiting to send
> RPC request to server
> Feb 05 02:41:58 at org.apache.hadoop.ipc.Client.call(Client.java:1480)
> ~[hadoop-common-3.2.3.jar:?]
> Feb 05 02:41:58 at org.apache.hadoop.ipc.Client.call(Client.java:1422)
> ~[hadoop-common-3.2.3.jar:?]
> Feb 05 02:41:58 at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
> ~[hadoop-common-3.2.3.jar:?]
> Feb 05 02:41:58 at com.sun.proxy.$Proxy31.allocate(Unknown Source)
> ~[?:?]
> Feb 05 02:41:58 at
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
> ~[hadoop-yarn-common-3.2.3.jar:?]
> Feb 05 02:41:58 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method) ~[?:1.8.0_292]
> Feb 05 02:41:58 at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> ~[?:1.8.0_292]
> Feb 05 02:41:58 at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> ~[?:1.8.0_292]
> Feb 05 02:41:58 at java.lang.reflect.Method.invoke(Method.java:498)
> ~[?:1.8.0_292]
> Feb 05 02:41:58 at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> ~[hadoop-common-3.2.3.jar:?]
> Feb 05 02:41:58 at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> ~[hadoop-common-3.2.3.jar:?]
> Feb 05 02:41:58 at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> ~[hadoop-common-3.2.3.jar:?]
> Feb 05 02:41:58 at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> ~[hadoop-common-3.2.3.jar:?]
> Feb 05 02:41:58 at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> ~[hadoop-common-3.2.3.jar:?]
> Feb 05 02:41:58 at com.sun.proxy.$Proxy32.allocate(Unknown Source)
> ~[?:?]
> Feb 05 02:41:58 at
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:325)
> ~[hadoop-yarn-client-3.2.3.jar:?]
> Feb 05 02:41:58 at
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:311)
> [hadoop-yarn-client-3.2.3.jar:?]
> Feb 05 02:41:58 Caused by: java.lang.InterruptedException
> Feb 05 02:41:58 at
> java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404) ~[?:1.8.0_292]
> Feb 05 02:41:58 at
> java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[?:1.8.0_292]
> Feb 05 02:41:58 at
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1180)
> ~[hadoop-common-3.2.3.jar:?]
> Feb 05 02:41:58 at org.apache.hadoop.ipc.Client.call(Client.java:1475)
> ~[hadoop-common-3.2.3.jar:?]
> Feb 05 02:41:58 ... 17 more
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)