[
https://issues.apache.org/jira/browse/FLINK-22819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17376531#comment-17376531
]
David Morávek commented on FLINK-22819:
---------------------------------------
The issues is caused by `yarn.am.liveness-monitor.expiry-interval-ms`, that was
lowered for yarn test cases. We can think of two reasons, why this options is
lowered.
A) In order to stress, that we're actually sending heartbeats from
ApplicationMaster to YARN Resource Manager.
B) To make tests fail faster, if user-code that we're executing inside
ApplicationMaster gets stuck for some reason.
`expiry-interval-ms` for AM works as follows (simplified):
1) We start a YARN mini cluster + HDFS (where we upload flink jars)
2) We create a new YARN application and allocated a container for
ApplicationMaster. This is the *actual startTime for AM expiration check*.
3) User jars / entrypoints / configs get downloaded from HDFS into
ApplicationMaster.
4) Entrypoint (shell script that starts `YarnJobClusterEntrypoint`) gets
executed.
5) Inside java entrypoint, we construct a YARN Resource Manager Client
(`AMRMClientAsync<?>`), which we use for operating YARN cluster. This client
also *starts sending heartbeats* (it's a side-effect, no explicit action needs
to be taken for sending heartbeats).
We can see there is a lot of work to be done between 2) and 5) and in resource
limited environment, such as CI, 20s may not be enough for registering RM
Client with Resource Manager.
Anyway, I don't think there is a need for this timeout anymore, because if we
wouldn't start RM Client, tests would fail anyway and by this timeout, we
unnecessarily stress YARN's internal implementation.
As for potentially long running tests, this should be already solved by
timeouts in AZURE pipelines.
As a fix, I propose to remove this override and use 5m default instead. IMO
this option is actually meant just as a safeguard, when AM gets into an
"unhandled" state - network failures / unable to shutdown.
> YARNFileReplicationITCase fails with "The YARN application unexpectedly
> switched to state FAILED during deployment"
> -------------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-22819
> URL: https://issues.apache.org/jira/browse/FLINK-22819
> Project: Flink
> Issue Type: Bug
> Components: Deployment / YARN
> Affects Versions: 1.13.1
> Reporter: Dawid Wysakowicz
> Assignee: David Morávek
> Priority: Major
> Labels: stale-major, test-stability
> Fix For: 1.14.0
>
> Attachments:
> FLINK-22819-YARNFileReplicationITCase-testPerJobModeWithDefaultFileReplication.log
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=18467&view=logs&j=8fd975ef-f478-511d-4997-6f15fe8a1fd3&t=ac0fa443-5d45-5a6b-3597-0310ecc1d2ab&l=32007
> {code}
> May 31 23:14:22
> org.apache.flink.client.deployment.ClusterDeploymentException: Could not
> deploy Yarn job cluster.
> May 31 23:14:22 at
> org.apache.flink.yarn.YarnClusterDescriptor.deployJobCluster(YarnClusterDescriptor.java:481)
> May 31 23:14:22 at
> org.apache.flink.yarn.YARNFileReplicationITCase.deployPerJob(YARNFileReplicationITCase.java:106)
> May 31 23:14:22 at
> org.apache.flink.yarn.YARNFileReplicationITCase.lambda$testPerJobModeWithDefaultFileReplication$1(YARNFileReplicationITCase.java:78)
> May 31 23:14:22 at
> org.apache.flink.yarn.YarnTestBase.runTest(YarnTestBase.java:287)
> May 31 23:14:22 at
> org.apache.flink.yarn.YARNFileReplicationITCase.testPerJobModeWithDefaultFileReplication(YARNFileReplicationITCase.java:78)
> May 31 23:14:22 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
> May 31 23:14:22 at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> May 31 23:14:22 at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> May 31 23:14:22 at java.lang.reflect.Method.invoke(Method.java:498)
> May 31 23:14:22 at
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> May 31 23:14:22 at
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> May 31 23:14:22 at
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> May 31 23:14:22 at
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> May 31 23:14:22 at
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> May 31 23:14:22 at
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> May 31 23:14:22 at
> org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45)
> May 31 23:14:22 at
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
> May 31 23:14:22 at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> May 31 23:14:22 at
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> May 31 23:14:22 at
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> May 31 23:14:22 at
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> May 31 23:14:22 at
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> May 31 23:14:22 at
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> May 31 23:14:22 at
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> May 31 23:14:22 at
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> May 31 23:14:22 at
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> May 31 23:14:22 at
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> May 31 23:14:22 at
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> May 31 23:14:22 at
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> May 31 23:14:22 at
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> May 31 23:14:22 at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> May 31 23:14:22 at
> org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> May 31 23:14:22 at
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
> May 31 23:14:22 at
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
> May 31 23:14:22 at
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
> May 31 23:14:22 at
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
> May 31 23:14:22 at
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
> May 31 23:14:22 at
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
> May 31 23:14:22 at
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
> May 31 23:14:22 at
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> May 31 23:14:22 Caused by:
> org.apache.flink.yarn.YarnClusterDescriptor$YarnDeploymentException: The YARN
> application unexpectedly switched to state FAILED during deployment.
> May 31 23:14:22 Diagnostics from YARN: Application
> application_1622502732791_0001 failed 1 times (global limit =2; local limit
> is =1) due to ApplicationMaster for attempt
> appattempt_1622502732791_0001_000001 timed out. Failing the application.
> May 31 23:14:22 If log aggregation is enabled on your cluster, use this
> command to further investigate the issue:
> May 31 23:14:22 yarn logs -applicationId application_1622502732791_0001
> May 31 23:14:22 at
> org.apache.flink.yarn.YarnClusterDescriptor.startAppMaster(YarnClusterDescriptor.java:1201)
> May 31 23:14:22 at
> org.apache.flink.yarn.YarnClusterDescriptor.deployInternal(YarnClusterDescriptor.java:593)
> May 31 23:14:22 at
> org.apache.flink.yarn.YarnClusterDescriptor.deployJobCluster(YarnClusterDescriptor.java:474)
> May 31 23:14:22 ... 39 more
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)