[
https://issues.apache.org/jira/browse/FLINK-21148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17309649#comment-17309649
]
Matthias commented on FLINK-21148:
----------------------------------
I looked over the issue with [~rmetzger]. The actual reason seems to be that
the YARN containers get [killed at the end of the
test|https://github.com/apache/flink/blob/5e08e55caede0c81100d7032257133854de1155c/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YARNSessionFIFOITCase.java#L192].
There's a race condition between stopping the TaskManager and stopping the
JobManager. If the JM is stopped first, there is a risk that the TM is trying
to access the JM's BLOB server at that moment. It loses the connection and
reports the connection problem. The exception ends up in the output of the
TaskManager and will trigger the test failure.
The following logs showcase this based on the build reported in the Jira
issues' description (application folder:
{{./container_1611618440792_0002_01_000001/}}).
{code}
[...]
23:48:07,987 [ Time-limited test] INFO
org.apache.flink.yarn.YARNSessionFIFOITCase [] - Two
containers are running. Killing the application
23:48:07,988 [ Time-limited test] INFO org.apache.hadoop.yarn.client.RMProxy
[] - Connecting to ResourceManager at
29c91476178c/172.21.0.2:37502
23:48:07,991 [AsyncDispatcher event handler] INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl [] -
application_1611618440792_0002 State change from RUNNING to KILLING
23:48:07,991 [AsyncDispatcher event handler] INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl []
- Updating application attempt appattempt_1611618440792_0002_000001 with final
state: KILLED
23:48:07,991 [AsyncDispatcher event handler] INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl []
- appattempt_1611618440792_0002_000001 State change from RUNNING to FINAL_SAVING
23:48:07,991 [AsyncDispatcher event handler] INFO
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService [] -
Unregistering app attempt : appattempt_1611618440792_0002_000001
23:48:07,992 [AsyncDispatcher event handler] INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl []
- appattempt_1611618440792_0002_000001 State change from FINAL_SAVING to KILLED
23:48:07,992 [AsyncDispatcher event handler] INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl [] - Updating
application application_1611618440792_0002 with final state: KILLED
23:48:07,992 [AsyncDispatcher event handler] INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl [] -
application_1611618440792_0002 State change from KILLING to FINAL_SAVING
23:48:07,992 [AsyncDispatcher event handler] INFO
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore [] -
Storing info for app: application_1611618440792_0002
23:48:07,992 [ResourceManager Event Processor] INFO
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl [] -
container_1611618440792_0002_01_000001 Container Transitioned from RUNNING to
KILLED
23:48:07,992 [ResourceManager Event Processor] INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp
[] - Completed container: container_1611618440792_0002_01_000001 in state:
KILLED event:KILL
23:48:07,992 [AsyncDispatcher event handler] INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl [] -
application_1611618440792_0002 State change from FINAL_SAVING to KILLED
23:48:07,992 [ResourceManager Event Processor] INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger [] - USER=hadoop
OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS
APPID=application_1611618440792_0002
CONTAINERID=container_1611618440792_0002_01_000001
23:48:07,992 [ResourceManager Event Processor] INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode
[] - Released container container_1611618440792_0002_01_000001 of capacity
<memory:1024, vCores:1> on host 29c91476178c:36323, which currently has 1
containers, <memory:1024, vCores:1> used and <memory:3072, vCores:665> avail
able, release resources=true
23:48:07,992 [ResourceManager Event Processor] INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler [] -
Application attempt appattempt_1611618440792_0002_000001 released container
container_1611618440792_0002_01_000001 on node: host: 29c91476178c:36323
#containers=1 available=3072 used=1024 with event: KILL
23:48:07,992 [ResourceManager Event Processor] INFO
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl [] -
container_1611618440792_0002_01_000002 Container Transitioned from RUNNING to
KILLED
23:48:07,992 [ResourceManager Event Processor] INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp
[] - Completed container: container_1611618440792_0002_01_000002 in state:
KILLED event:KILL
23:48:07,992 [ResourceManager Event Processor] INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger [] - USER=hadoop
OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS
APPID=application_1611618440792_0002
CONTAINERID=container_1611618440792_0002_01_000002
23:48:07,992 [ResourceManager Event Processor] INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode
[] - Released container container_1611618440792_0002_01_000002 of capacity
<memory:1024, vCores:1> on host 29c91476178c:36323, which currently has 0
containers, <memory:0, vCores:0> used and <memory:4096, vCores:666> availabl
e, release resources=true
23:48:07,992 [ResourceManager Event Processor] INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler [] -
Application attempt appattempt_1611618440792_0002_000001 released container
container_1611618440792_0002_01_000002 on node: host: 29c91476178c:36323
#containers=0 available=4096 used=0 with event: KILL
23:48:07,993 [ResourceManager Event Processor] INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo [] -
Application application_1611618440792_0002 requests cleared
23:48:07,993 [ pool-3-thread-4] INFO
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher [] -
Cleaning master appattempt_1611618440792_0002_000001
23:48:07,993 [AsyncDispatcher event handler] INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger [] - USER=hadoop
OPERATION=Application Finished - Killed TARGET=RMAppManager
RESULT=SUCCESS APPID=application_1611618440792_0002
23:48:07,993 [AsyncDispatcher event handler] INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary
[] -
appId=application_1611618440792_0002,name=MyCustomName,user=hadoop,queue=default,state=KILLED,trackingUrl=http://29c91476178c:46794/cluster/app/application_1611618440792_0002,appMasterHost=N/A,startTime=1611618467077,finishTime=16
11618487992,finalStatus=KILLED
23:48:07,996 [Socket Reader #1 for port 36323] INFO
SecurityLogger.org.apache.hadoop.ipc.Server [] - Auth
successful for appattempt_1611618440792_0002_000001 (auth:SIMPLE)
23:48:07,998 [Socket Reader #1 for port 36323] INFO
SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager
[] - Authorization successful for appattempt_1611618440792_0002_000001
(auth:TOKEN) for protocol=interface
org.apache.hadoop.yarn.api.ContainerManagementProtocolPB
23:48:08,000 [IPC Server handler 11 on 36323] INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl
[] - Stopping container with container Id:
container_1611618440792_0002_01_000001
23:48:08,000 [IPC Server handler 11 on 36323] INFO
org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger [] - USER=hadoop
IP=172.21.0.2 OPERATION=Stop Container Request
TARGET=ContainerManageImpl RESULT=SUCCESS
APPID=application_1611618440792_0002
CONTAINERID=container_1611618440792_0002_01_000001
23:48:08,000 [AsyncDispatcher event handler] INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container
[] - Container container_1611618440792_0002_01_000001 transitioned from RUNNING
to KILLING
23:48:08,000 [AsyncDispatcher event handler] INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch
[] - Cleaning up container container_1611618440792_0002_01_000001
23:48:08,008 [ContainersLauncher #0] WARN
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor [] - Exit
code from container container_1611618440792_0002_01_000001 is : 143
23:48:08,023 [AsyncDispatcher event handler] INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container
[] - Container container_1611618440792_0002_01_000002 transitioned from RUNNING
to KILLING
23:48:08,023 [AsyncDispatcher event handler] INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application
[] - Application application_1611618440792_0002 transitioned from RUNNING to
FINISHING_CONTAINERS_WAIT
23:48:08,023 [AsyncDispatcher event handler] INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container
[] - Container container_1611618440792_0002_01_000001 transitioned from KILLING
to CONTAINER_CLEANEDUP_AFTER_KILL
23:48:08,023 [AsyncDispatcher event handler] INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch
[] - Cleaning up container container_1611618440792_0002_01_000002
23:48:08,029 [ContainersLauncher #1] WARN
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor [] - Exit
code from container container_1611618440792_0002_01_000002 is : 143
23:48:08,043 [AsyncDispatcher event handler] INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container
[] - Container container_1611618440792_0002_01_000002 transitioned from KILLING
to CONTAINER_CLEANEDUP_AFTER_KILL
23:48:08,043 [AsyncDispatcher event handler] INFO
org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger [] - USER=hadoop
OPERATION=Container Finished - Killed TARGET=ContainerImpl
RESULT=SUCCESS APPID=application_1611618440792_0002
CONTAINERID=container_1611618440792_0002_01_000001
23:48:08,043 [AsyncDispatcher event handler] INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container
[] - Container container_1611618440792_0002_01_000001 transitioned from
CONTAINER_CLEANEDUP_AFTER_KILL to DONE
23:48:08,043 [AsyncDispatcher event handler] INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application
[] - Removing container_1611618440792_0002_01_000001 from application
application_1611618440792_0002
23:48:08,044 [AsyncDispatcher event handler] INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
[] - Neither virutal-memory nor physical-memory monitoring is needed. Not
running the monitor-thread
23:48:08,044 [AsyncDispatcher event handler] INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices [] - Got
event CONTAINER_STOP for appId application_1611618440792_0002
23:48:08,044 [AsyncDispatcher event handler] INFO
org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger [] - USER=hadoop
OPERATION=Container Finished - Killed TARGET=ContainerImpl
RESULT=SUCCESS APPID=application_1611618440792_0002
CONTAINERID=container_1611618440792_0002_01_000002
23:48:08,044 [AsyncDispatcher event handler] INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container
[] - Container container_1611618440792_0002_01_000002 transitioned from
CONTAINER_CLEANEDUP_AFTER_KILL to DONE
23:48:08,044 [AsyncDispatcher event handler] INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application
[] - Removing container_1611618440792_0002_01_000002 from application
application_1611618440792_0002
23:48:08,044 [AsyncDispatcher event handler] INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application
[] - Application application_1611618440792_0002 transitioned from
FINISHING_CONTAINERS_WAIT to APPLICATION_RESOURCES_CLEANINGUP
23:48:08,044 [AsyncDispatcher event handler] INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
[] - Neither virutal-memory nor physical-memory monitoring is needed. Not
running the monitor-thread
23:48:08,044 [AsyncDispatcher event handler] INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices [] - Got
event CONTAINER_STOP for appId application_1611618440792_0002
23:48:08,044 [AsyncDispatcher event handler] INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices [] - Got
event APPLICATION_STOP for appId application_1611618440792_0002
23:48:08,044 [AsyncDispatcher event handler] INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application
[] - Application application_1611618440792_0002 transitioned from
APPLICATION_RESOURCES_CLEANINGUP to FINISHED
23:48:08,044 [AsyncDispatcher event handler] INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.loghandler.NonAggregatingLogHandler
[] - Scheduling Log Deletion for application: application_1611618440792_0002,
with delay of 10800 seconds
23:48:08,192 [IPC Server handler 35 on 37502] INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger [] - USER=hadoop
IP=172.21.0.2 OPERATION=Kill Application Request
TARGET=ClientRMService RESULT=SUCCESS APPID=application_1611618440792_0002
23:48:08,193 [ Time-limited test] INFO
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl [] - Killed
application application_1611618440792_0002
[...]
{code}
> YARNSessionFIFOSecuredITCase cannot connect to BlobServer
> ---------------------------------------------------------
>
> Key: FLINK-21148
> URL: https://issues.apache.org/jira/browse/FLINK-21148
> Project: Flink
> Issue Type: Bug
> Components: Deployment / YARN, Tests
> Affects Versions: 1.11.3, 1.13.0
> Reporter: Dawid Wysakowicz
> Assignee: Matthias
> Priority: Major
> Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=12483&view=logs&j=f450c1a5-64b1-5955-e215-49cb1ad5ec88&t=ea63c80c-957f-50d1-8f67-3671c14686b9
> {code}
> java.io.IOException: Could not connect to BlobServer at address
> 29c91476178c/172.21.0.2:44412
> java.io.IOException: Could not connect to BlobServer at address
> 29c91476178c/172.21.0.2:44412
> at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:102)
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> at
> org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:137)
> [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> at
> org.apache.flink.yarn.YarnTestBase.ensureNoProhibitedStringInLogFiles(YarnTestBase.java:538)
> at
> org.apache.flink.yarn.YARNSessionFIFOITCase.checkForProhibitedLogContents(YARNSessionFIFOITCase.java:84)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> at
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> at
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:33)
> at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
> at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> at
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> at
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> at
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> at
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> at
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
> at
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
> at
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
> at
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
> at
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
> at
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
> at
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
> at
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)