Danil Serdyuchenko created SAMZA-1746:
-----------------------------------------

             Summary: Coordinator stuck connecting to a decommissioned NM 
                 Key: SAMZA-1746
                 URL: https://issues.apache.org/jira/browse/SAMZA-1746
             Project: Samza
          Issue Type: Bug
    Affects Versions: 0.12.0
            Reporter: Danil Serdyuchenko


We hit an issue trying to decommission a large number of NMs.
 # We've decommissioned large number of NMs
 # Containers were allocated by RM to a new NM
 # That NM was decommissioned immediately after RM allocation, before 
coordinator could start the containers   

Relevant logs:
{noformat}
2018-06-08 10:28:12 YarnClusterResourceManager [INFO] Container allocated from 
RM on x.x.x.x
2018-06-08 10:28:12 ContainerProcessManager [INFO] Container allocated from RM 
on x.x.x.x
2018-06-08 10:28:12 ResourceRequestState [INFO] Host affinity not enabled. 
Saving the samzaResource container_1525695824467_0091_06_000106 in the buffer 
for ANY_HOST
2018-06-08 10:28:15 AbstractContainerAllocator [INFO] Found available resources 
on ANY_HOST. Assigning request for container_id 0 with timestamp 1528453690917 
to resource container_1525695824467_0091_06_000106
2018-06-08 10:28:15 YarnClusterResourceManager [INFO] Received launch request 
for 0 on hostname x.x.x.x
2018-06-08 10:28:15 YarnContainerRunner [INFO] Got available container ID (0) 
for container: Container: [ContainerId: container_1525695824467_0091_06_000106, 
NodeId: x.x.x.x:45400, NodeHttpAddress: x.x.x.x:8042, Resource: <memory:1024, 
vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 
x.x.x.x:45400 }, ]
2018-06-08 10:28:15 YarnContainerRunner [INFO] In runContainer in util: 
fwkPath= ;cmdPath=./__package/;jobLib=
2018-06-08 10:28:15 YarnContainerRunner [INFO] Container ID 0 using command 
./__package//bin/run-container.sh
2018-06-08 10:28:15 YarnContainerRunner [INFO] Container ID 0 using environment 
variables: 
SAMZA_CONTAINER_ID=0
EXECUTION_ENV_CONTAINER_ID=container_1525695824467_0091_06_000106
SAMZA_COORDINATOR_URL=http://ip-a-a-a-a:33907/
JAVA_HOME=/opt/jvm/jdk1.8.0_65
JAVA_OPTS=
2018-06-08 10:28:15 YarnContainerRunner [INFO] Samza FWK path: 
./__package//bin/run-container.sh; env={SAMZA_CONTAINER_ID=0, 
EXECUTION_ENV_CONTAINER_ID=container_1525695824467_0091_06_000106, 
SAMZA_COORDINATOR_URL=http://ip-a-a-a-a:33907/, JAVA_HOME=/opt/jvm/jdk1.8.0_65, 
JAVA_OPTS=}
2018-06-08 10:28:15 YarnContainerRunner [INFO] Starting container ID 0 using 
package path http://package.tar.gz
2018-06-08 10:28:15 YarnContainerRunner [INFO] starting container 
http://package.tar.gz Container: [ContainerId: 
container_1525695824467_0091_06_000106, NodeId: x.x.x.x:45400, NodeHttpAddress: 
x.x.x.x:8042, Resource: <memory:1024, vCores:1>, Priority: 0, Token: Token { 
kind: ContainerToken, service: x.x.x.x:45400 }, ] {SAMZA_CONTAINER_ID=0, 
EXECUTION_ENV_CONTAINER_ID=container_1525695824467_0091_06_000106, 
SAMZA_COORDINATOR_URL=http://ip-a-a-a-a:33907/, JAVA_HOME=/opt/jvm/jdk1.8.0_65, 
JAVA_OPTS=} export SAMZA_LOG_DIR=<LOG_DIR> && ln -sfn <LOG_DIR> logs && exec 
./__package//bin/run-container.sh 1>logs/stdout 2>logs/stderr
2018-06-08 10:28:15 YarnContainerRunner [INFO] set package Resource in 
YarnContainerRunner for scheme: "http" host: "s3-eu-west-1.amazonaws.com" port: 
-1 file: "/package.tar.gz"
2018-06-08 10:28:15 ContainerManagementProtocolProxy [INFO] Opening proxy : 
x.x.x.x:45400
2018-06-08 10:28:35 Client [INFO] Retrying connect to server: 
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 0 time(s); 
maxRetries=45
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Container completed from 
RM ContainerStatus: [ContainerId: container_1525695824467_0091_06_000016, 
State: COMPLETE, Diagnostics: Container released on a *lost* node, ExitStatus: 
-100, ]
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Completed container had 
ID: 9
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Removing container ID 9 
from completed containers
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Container completed from 
RM ContainerStatus: [ContainerId: container_1525695824467_0091_06_000010, 
State: COMPLETE, Diagnostics: Container released on a *lost* node, ExitStatus: 
-100, ]
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Completed container had 
ID: 3
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Removing container ID 3 
from completed containers
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Container completed from 
RM ContainerStatus: [ContainerId: container_1525695824467_0091_06_000002, 
State: COMPLETE, Diagnostics: Container released on a *lost* node, ExitStatus: 
-100, ]
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Completed container had 
ID: 11
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Removing container ID 11 
from completed containers
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Container completed from 
RM ContainerStatus: [ContainerId: container_1525695824467_0091_06_000102, 
State: COMPLETE, Diagnostics: Container released on a *lost* node, ExitStatus: 
-100, ]
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Completed container had 
ID: 8
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Removing container ID 8 
from completed containers
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Container completed from 
RM ContainerStatus: [ContainerId: container_1525695824467_0091_06_000005, 
State: COMPLETE, Diagnostics: Container released on a *lost* node, ExitStatus: 
-100, ]
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Completed container had 
ID: 14
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Removing container ID 14 
from completed containers
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Container completed from 
RM ContainerStatus: [ContainerId: container_1525695824467_0091_06_000011, 
State: COMPLETE, Diagnostics: Container released on a *lost* node, ExitStatus: 
-100, ]
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Completed container had 
ID: 4
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Removing container ID 4 
from completed containers
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Container completed from 
RM ContainerStatus: [ContainerId: container_1525695824467_0091_06_000008, 
State: COMPLETE, Diagnostics: Container released on a *lost* node, ExitStatus: 
-100, ]
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Completed container had 
ID: 1
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Removing container ID 1 
from completed containers
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Container completed from 
RM ContainerStatus: [ContainerId: container_1525695824467_0091_06_000013, 
State: COMPLETE, Diagnostics: Container released on a *lost* node, ExitStatus: 
-100, ]
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Completed container had 
ID: 6
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Removing container ID 6 
from completed containers
2018-06-08 10:28:51 ContainerProcessManager [INFO] Matching container ID found 
9 org.apache.samza.clustermanager.SamzaResource@3c12182f
2018-06-08 10:28:51 ContainerProcessManager [INFO] Got an exit code of -100. 
This means that container container_1525695824467_0091_06_000016 was killed by 
YARN, either due to being released by the application master or being 'lost' 
due to node failures etc. or due to preemption by the RM
2018-06-08 10:28:51 ContainerProcessManager [INFO] Released container 
container_1525695824467_0091_06_000016 was assigned task group ID 9. Requesting 
a new container for the task group.
2018-06-08 10:28:51 SamzaResourceRequest [INFO] Resource Request created for 9 
on ANY_HOST at 1528453731124
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Requesting resources on 
ANY_HOST for container 9
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Making a request for 
ANY_HOST ANY_HOST
2018-06-08 10:28:55 Client [INFO] Retrying connect to server: 
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 1 time(s); 
maxRetries=45
2018-06-08 10:29:06 Client [INFO] Retrying connect to server: 
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 0 time(s); 
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1000 MILLISECONDS)
2018-06-08 10:29:10 Client [INFO] Retrying connect to server: 
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 1 time(s); 
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1000 MILLISECONDS)
2018-06-08 10:29:14 Client [INFO] Retrying connect to server: 
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 2 time(s); 
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1000 MILLISECONDS)
2018-06-08 10:29:17 Client [INFO] Retrying connect to server: 
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 3 time(s); 
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1000 MILLISECONDS)
2018-06-08 10:29:20 Client [INFO] Retrying connect to server: 
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 4 time(s); 
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1000 MILLISECONDS)
2018-06-08 10:29:23 Client [INFO] Retrying connect to server: 
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 5 time(s); 
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1000 MILLISECONDS)
2018-06-08 10:29:26 Client [INFO] Retrying connect to server: 
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 6 time(s); 
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1000 MILLISECONDS)
2018-06-08 10:29:29 Client [INFO] Retrying connect to server: 
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 7 time(s); 
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1000 MILLISECONDS)
2018-06-08 10:29:32 Client [INFO] Retrying connect to server: 
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 8 time(s); 
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1000 MILLISECONDS)
2018-06-08 10:29:35 Client [INFO] Retrying connect to server: 
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 9 time(s); 
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1000 MILLISECONDS)
2018-06-08 10:29:51 Client [INFO] Retrying connect to server: 
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 0 time(s); 
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1000 MILLISECONDS)
2018-06-08 10:29:54 Client [INFO] Retrying connect to server: 
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 1 time(s); 
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1000 MILLISECONDS)
2018-06-08 10:29:57 Client [INFO] Retrying connect to server: 
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 2 time(s); 
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1000 MILLISECONDS)
2018-06-08 10:30:00 Client [INFO] Retrying connect to server: 
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 3 time(s); 
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1000 MILLISECONDS)
2018-06-08 10:30:03 Client [INFO] Retrying connect to server: 
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 4 time(s); 
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1000 MILLISECONDS)
2018-06-08 10:30:06 Client [INFO] Retrying connect to server: 
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 5 time(s); 
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1000 MILLISECONDS)
2018-06-08 10:30:09 Client [INFO] Retrying connect to server: 
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 6 time(s); 
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1000 MILLISECONDS)
2018-06-08 10:30:12 Client [INFO] Retrying connect to server: 
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 7 time(s); 
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1000 MILLISECONDS)
2018-06-08 10:30:15 Client [INFO] Retrying connect to server: 
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 8 time(s); 
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1000 MILLISECONDS)
2018-06-08 10:30:18 Client [INFO] Retrying connect to server: 
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 9 time(s); 
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1000 MILLISECONDS)
2018-06-08 10:30:34 Client [INFO] Retrying connect to server: 
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 0 time(s); 
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1000 MILLISECONDS)
2018-06-08 10:30:38 Client [INFO] Retrying connect to server: 
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 1 time(s); 
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1000 MILLISECONDS)
2018-06-08 10:30:41 Client [INFO] Retrying connect to server: 
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 2 time(s); 
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1000 MILLISECONDS)
2018-06-08 10:30:44 Client [INFO] Retrying connect to server: 
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 3 time(s); 
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1000 MILLISECONDS)
2018-06-08 10:30:47 Client [INFO] Retrying connect to server: 
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 4 time(s); 
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1000 MILLISECONDS)
2018-06-08 10:30:50 Client [INFO] Retrying connect to server: 
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 5 time(s); 
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1000 MILLISECONDS)
2018-06-08 10:30:53 Client [INFO] Retrying connect to server: 
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 6 time(s); 
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1000 MILLISECONDS)
2018-06-08 10:30:56 Client [INFO] Retrying connect to server: 
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 7 time(s); 
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1000 MILLISECONDS)
2018-06-08 10:30:59 Client [INFO] Retrying connect to server: 
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 8 time(s); 
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1000 MILLISECONDS)
2018-06-08 10:31:02 Client [INFO] Retrying connect to server: 
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 9 time(s); 
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1000 MILLISECONDS){noformat}
So the container gets allocated to x.x.x.x but then when coordinator tries to 
connect to the NM it is no longer reachable, and it gets stuck in retry loop 
indefinitely. The result of which new containers don't get scheduled for that 
job. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to