[ 
https://issues.apache.org/jira/browse/FLINK-22819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17356280#comment-17356280
 ] 

Matthias edited comment on FLINK-22819 at 6/3/21, 8:20 AM:
-----------------------------------------------------------

I couldn't get anything specific during my initial investigation (I attached 
the test's logs). We don't get any additional YARN logs due to the failure 
happening during application deployment. There is a timeout during deployment 
as stated in the error messages.
{code}
23:13:00,816 [ContainersLauncher #0] INFO  
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor [] - 
launchContainer: [bash, 
/__w/3/s/flink-yarn-tests/target/flink-yarn-tests-per-job/flink-yarn-tests-per-job-localDir-nm-0_0/usercache/agent0
23:13:12,994 [        Ping Checker] INFO  
org.apache.hadoop.yarn.util.AbstractLivelinessMonitor        [] - 
Expired:appattempt_1622502732791_0001_000001 Timed out after 20 secs
23:13:12,996 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl [] 
- Updating application attempt appattempt_1622502732791_0001_000001 with final 
state: FAILED, and exit status: -1000
{code}

Comparing it to a successful test of the same build it appears that there is 
some time consumed (5 secs here) for authentication:
{code}
23:15:52,955 [ContainersLauncher #0] INFO  
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor [] - 
launchContainer: [bash, 
/__w/3/s/flink-yarn-tests/target/flink-yarn-tests-capacityscheduler/flink-yarn-tests-capacityscheduler-localDir-nm-
1_0/usercache/agent03_azpcontainer/appcache/application_1622502943279_0001/container_1622502943279_0001_01_000001/default_container_executor.sh]
23:15:57,954 [Socket Reader #1 for port 44617] INFO  
SecurityLogger.org.apache.hadoop.ipc.Server                  [] - Auth 
successful for appattempt_1622502943279_0001_000001 (auth:SIMPLE)
23:15:57,977 [IPC Server handler 0 on 44617] INFO  
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService [] - AM 
registration appattempt_1622502943279_0001_000001
{code}


was (Author: mapohl):
I couldn't get anything specific during my initial investigation (I attached 
the test's logs). We don't get any additional YARN logs due to the failure 
happening during application deployment. There is a timeout during deployment 
as stated in the error messages.
```
23:13:00,816 [ContainersLauncher #0] INFO  
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor [] - 
launchContainer: [bash, 
/__w/3/s/flink-yarn-tests/target/flink-yarn-tests-per-job/flink-yarn-tests-per-job-localDir-nm-0_0/usercache/agent0
23:13:12,994 [        Ping Checker] INFO  
org.apache.hadoop.yarn.util.AbstractLivelinessMonitor        [] - 
Expired:appattempt_1622502732791_0001_000001 Timed out after 20 secs
23:13:12,996 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl [] 
- Updating application attempt appattempt_1622502732791_0001_000001 with final 
state: FAILED, and exit status: -1000
```

Comparing it to a successful test of the same build it appears that there is 
some time consumed (5 secs here) for authentication:
```
23:15:52,955 [ContainersLauncher #0] INFO  
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor [] - 
launchContainer: [bash, 
/__w/3/s/flink-yarn-tests/target/flink-yarn-tests-capacityscheduler/flink-yarn-tests-capacityscheduler-localDir-nm-
1_0/usercache/agent03_azpcontainer/appcache/application_1622502943279_0001/container_1622502943279_0001_01_000001/default_container_executor.sh]
23:15:57,954 [Socket Reader #1 for port 44617] INFO  
SecurityLogger.org.apache.hadoop.ipc.Server                  [] - Auth 
successful for appattempt_1622502943279_0001_000001 (auth:SIMPLE)
23:15:57,977 [IPC Server handler 0 on 44617] INFO  
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService [] - AM 
registration appattempt_1622502943279_0001_000001
```

> YARNFileReplicationITCase fails with "The YARN application unexpectedly 
> switched to state FAILED during deployment"
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-22819
>                 URL: https://issues.apache.org/jira/browse/FLINK-22819
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / YARN
>    Affects Versions: 1.13.1
>            Reporter: Dawid Wysakowicz
>            Assignee: Matthias
>            Priority: Major
>              Labels: test-stability
>             Fix For: 1.14.0
>
>         Attachments: 
> FLINK-22819-YARNFileReplicationITCase-testPerJobModeWithDefaultFileReplication.log
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=18467&view=logs&j=8fd975ef-f478-511d-4997-6f15fe8a1fd3&t=ac0fa443-5d45-5a6b-3597-0310ecc1d2ab&l=32007
> {code}
> May 31 23:14:22 
> org.apache.flink.client.deployment.ClusterDeploymentException: Could not 
> deploy Yarn job cluster.
> May 31 23:14:22       at 
> org.apache.flink.yarn.YarnClusterDescriptor.deployJobCluster(YarnClusterDescriptor.java:481)
> May 31 23:14:22       at 
> org.apache.flink.yarn.YARNFileReplicationITCase.deployPerJob(YARNFileReplicationITCase.java:106)
> May 31 23:14:22       at 
> org.apache.flink.yarn.YARNFileReplicationITCase.lambda$testPerJobModeWithDefaultFileReplication$1(YARNFileReplicationITCase.java:78)
> May 31 23:14:22       at 
> org.apache.flink.yarn.YarnTestBase.runTest(YarnTestBase.java:287)
> May 31 23:14:22       at 
> org.apache.flink.yarn.YARNFileReplicationITCase.testPerJobModeWithDefaultFileReplication(YARNFileReplicationITCase.java:78)
> May 31 23:14:22       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
> May 31 23:14:22       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> May 31 23:14:22       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> May 31 23:14:22       at java.lang.reflect.Method.invoke(Method.java:498)
> May 31 23:14:22       at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> May 31 23:14:22       at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> May 31 23:14:22       at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> May 31 23:14:22       at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> May 31 23:14:22       at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> May 31 23:14:22       at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> May 31 23:14:22       at 
> org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45)
> May 31 23:14:22       at 
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
> May 31 23:14:22       at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> May 31 23:14:22       at 
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> May 31 23:14:22       at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> May 31 23:14:22       at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> May 31 23:14:22       at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> May 31 23:14:22       at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> May 31 23:14:22       at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> May 31 23:14:22       at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> May 31 23:14:22       at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> May 31 23:14:22       at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> May 31 23:14:22       at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> May 31 23:14:22       at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> May 31 23:14:22       at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> May 31 23:14:22       at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> May 31 23:14:22       at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> May 31 23:14:22       at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
> May 31 23:14:22       at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
> May 31 23:14:22       at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
> May 31 23:14:22       at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
> May 31 23:14:22       at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
> May 31 23:14:22       at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
> May 31 23:14:22       at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
> May 31 23:14:22       at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> May 31 23:14:22 Caused by: 
> org.apache.flink.yarn.YarnClusterDescriptor$YarnDeploymentException: The YARN 
> application unexpectedly switched to state FAILED during deployment. 
> May 31 23:14:22 Diagnostics from YARN: Application 
> application_1622502732791_0001 failed 1 times (global limit =2; local limit 
> is =1) due to ApplicationMaster for attempt 
> appattempt_1622502732791_0001_000001 timed out. Failing the application.
> May 31 23:14:22 If log aggregation is enabled on your cluster, use this 
> command to further investigate the issue:
> May 31 23:14:22 yarn logs -applicationId application_1622502732791_0001
> May 31 23:14:22       at 
> org.apache.flink.yarn.YarnClusterDescriptor.startAppMaster(YarnClusterDescriptor.java:1201)
> May 31 23:14:22       at 
> org.apache.flink.yarn.YarnClusterDescriptor.deployInternal(YarnClusterDescriptor.java:593)
> May 31 23:14:22       at 
> org.apache.flink.yarn.YarnClusterDescriptor.deployJobCluster(YarnClusterDescriptor.java:474)
> May 31 23:14:22       ... 39 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to