[jira] [Commented] (FLINK-22819) YARNFileReplicationITCase fails with "The YARN application unexpectedly switched to state FAILED during deployment"

Jira Wed, 07 Jul 2021 05:34:23 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-22819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17376531#comment-17376531
 ]


David Morávek commented on FLINK-22819:
---------------------------------------

The issues is caused by `yarn.am.liveness-monitor.expiry-interval-ms`, that was 
lowered for yarn test cases. We can think of two reasons, why this options is 
lowered.

A) In order to stress, that we're actually sending heartbeats from 
ApplicationMaster to YARN Resource Manager.
B) To make tests fail faster, if user-code that we're executing inside 
ApplicationMaster gets stuck for some reason.

`expiry-interval-ms` for AM works as follows (simplified):
1) We start a YARN mini cluster + HDFS (where we upload flink jars)
2) We create a new YARN application and allocated a container for 
ApplicationMaster. This is the *actual startTime for AM expiration check*.
3) User jars / entrypoints / configs get downloaded from HDFS into 
ApplicationMaster.
4) Entrypoint (shell script that starts `YarnJobClusterEntrypoint`) gets 
executed.
5) Inside java entrypoint, we construct a YARN Resource Manager Client 
(`AMRMClientAsync<?>`), which we use for operating YARN cluster. This client 
also *starts sending heartbeats* (it's a side-effect, no explicit action needs 
to be taken for sending heartbeats).

We can see there is a lot of work to be done between 2) and 5) and in resource 
limited environment, such as CI, 20s may not be enough for registering RM 
Client with Resource Manager.

Anyway, I don't think there is a need for this timeout anymore, because if we 
wouldn't start RM Client, tests would fail anyway and by this timeout, we 
unnecessarily stress YARN's internal implementation.

As for potentially long running tests, this should be already solved by 
timeouts in AZURE pipelines.

As a fix, I propose to remove this override and use 5m default instead. IMO 
this option is actually meant just as a safeguard, when AM gets into an 
"unhandled" state - network failures / unable to shutdown.

> YARNFileReplicationITCase fails with "The YARN application unexpectedly 
> switched to state FAILED during deployment"
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-22819
>                 URL: https://issues.apache.org/jira/browse/FLINK-22819
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / YARN
>    Affects Versions: 1.13.1
>            Reporter: Dawid Wysakowicz
>            Assignee: David Morávek
>            Priority: Major
>              Labels: stale-major, test-stability
>             Fix For: 1.14.0
>
>         Attachments: 
> FLINK-22819-YARNFileReplicationITCase-testPerJobModeWithDefaultFileReplication.log
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=18467&view=logs&j=8fd975ef-f478-511d-4997-6f15fe8a1fd3&t=ac0fa443-5d45-5a6b-3597-0310ecc1d2ab&l=32007
> {code}
> May 31 23:14:22 
> org.apache.flink.client.deployment.ClusterDeploymentException: Could not 
> deploy Yarn job cluster.
> May 31 23:14:22       at 
> org.apache.flink.yarn.YarnClusterDescriptor.deployJobCluster(YarnClusterDescriptor.java:481)
> May 31 23:14:22       at 
> org.apache.flink.yarn.YARNFileReplicationITCase.deployPerJob(YARNFileReplicationITCase.java:106)
> May 31 23:14:22       at 
> org.apache.flink.yarn.YARNFileReplicationITCase.lambda$testPerJobModeWithDefaultFileReplication$1(YARNFileReplicationITCase.java:78)
> May 31 23:14:22       at 
> org.apache.flink.yarn.YarnTestBase.runTest(YarnTestBase.java:287)
> May 31 23:14:22       at 
> org.apache.flink.yarn.YARNFileReplicationITCase.testPerJobModeWithDefaultFileReplication(YARNFileReplicationITCase.java:78)
> May 31 23:14:22       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
> May 31 23:14:22       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> May 31 23:14:22       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> May 31 23:14:22       at java.lang.reflect.Method.invoke(Method.java:498)
> May 31 23:14:22       at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> May 31 23:14:22       at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> May 31 23:14:22       at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> May 31 23:14:22       at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> May 31 23:14:22       at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> May 31 23:14:22       at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> May 31 23:14:22       at 
> org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45)
> May 31 23:14:22       at 
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
> May 31 23:14:22       at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> May 31 23:14:22       at 
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> May 31 23:14:22       at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> May 31 23:14:22       at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> May 31 23:14:22       at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> May 31 23:14:22       at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> May 31 23:14:22       at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> May 31 23:14:22       at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> May 31 23:14:22       at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> May 31 23:14:22       at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> May 31 23:14:22       at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> May 31 23:14:22       at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> May 31 23:14:22       at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> May 31 23:14:22       at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> May 31 23:14:22       at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> May 31 23:14:22       at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
> May 31 23:14:22       at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
> May 31 23:14:22       at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
> May 31 23:14:22       at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
> May 31 23:14:22       at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
> May 31 23:14:22       at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
> May 31 23:14:22       at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
> May 31 23:14:22       at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> May 31 23:14:22 Caused by: 
> org.apache.flink.yarn.YarnClusterDescriptor$YarnDeploymentException: The YARN 
> application unexpectedly switched to state FAILED during deployment. 
> May 31 23:14:22 Diagnostics from YARN: Application 
> application_1622502732791_0001 failed 1 times (global limit =2; local limit 
> is =1) due to ApplicationMaster for attempt 
> appattempt_1622502732791_0001_000001 timed out. Failing the application.
> May 31 23:14:22 If log aggregation is enabled on your cluster, use this 
> command to further investigate the issue:
> May 31 23:14:22 yarn logs -applicationId application_1622502732791_0001
> May 31 23:14:22       at 
> org.apache.flink.yarn.YarnClusterDescriptor.startAppMaster(YarnClusterDescriptor.java:1201)
> May 31 23:14:22       at 
> org.apache.flink.yarn.YarnClusterDescriptor.deployInternal(YarnClusterDescriptor.java:593)
> May 31 23:14:22       at 
> org.apache.flink.yarn.YarnClusterDescriptor.deployJobCluster(YarnClusterDescriptor.java:474)
> May 31 23:14:22       ... 39 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-22819) YARNFileReplicationITCase fails with "The YARN application unexpectedly switched to state FAILED during deployment"

Reply via email to