Danil Serdyuchenko created SAMZA-1746:
-----------------------------------------
Summary: Coordinator stuck connecting to a decommissioned NM
Key: SAMZA-1746
URL: https://issues.apache.org/jira/browse/SAMZA-1746
Project: Samza
Issue Type: Bug
Affects Versions: 0.12.0
Reporter: Danil Serdyuchenko
We hit an issue trying to decommission a large number of NMs.
# We've decommissioned large number of NMs
# Containers were allocated by RM to a new NM
# That NM was decommissioned immediately after RM allocation, before
coordinator could start the containers
Relevant logs:
{noformat}
2018-06-08 10:28:12 YarnClusterResourceManager [INFO] Container allocated from
RM on x.x.x.x
2018-06-08 10:28:12 ContainerProcessManager [INFO] Container allocated from RM
on x.x.x.x
2018-06-08 10:28:12 ResourceRequestState [INFO] Host affinity not enabled.
Saving the samzaResource container_1525695824467_0091_06_000106 in the buffer
for ANY_HOST
2018-06-08 10:28:15 AbstractContainerAllocator [INFO] Found available resources
on ANY_HOST. Assigning request for container_id 0 with timestamp 1528453690917
to resource container_1525695824467_0091_06_000106
2018-06-08 10:28:15 YarnClusterResourceManager [INFO] Received launch request
for 0 on hostname x.x.x.x
2018-06-08 10:28:15 YarnContainerRunner [INFO] Got available container ID (0)
for container: Container: [ContainerId: container_1525695824467_0091_06_000106,
NodeId: x.x.x.x:45400, NodeHttpAddress: x.x.x.x:8042, Resource: <memory:1024,
vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service:
x.x.x.x:45400 }, ]
2018-06-08 10:28:15 YarnContainerRunner [INFO] In runContainer in util:
fwkPath= ;cmdPath=./__package/;jobLib=
2018-06-08 10:28:15 YarnContainerRunner [INFO] Container ID 0 using command
./__package//bin/run-container.sh
2018-06-08 10:28:15 YarnContainerRunner [INFO] Container ID 0 using environment
variables:
SAMZA_CONTAINER_ID=0
EXECUTION_ENV_CONTAINER_ID=container_1525695824467_0091_06_000106
SAMZA_COORDINATOR_URL=http://ip-a-a-a-a:33907/
JAVA_HOME=/opt/jvm/jdk1.8.0_65
JAVA_OPTS=
2018-06-08 10:28:15 YarnContainerRunner [INFO] Samza FWK path:
./__package//bin/run-container.sh; env={SAMZA_CONTAINER_ID=0,
EXECUTION_ENV_CONTAINER_ID=container_1525695824467_0091_06_000106,
SAMZA_COORDINATOR_URL=http://ip-a-a-a-a:33907/, JAVA_HOME=/opt/jvm/jdk1.8.0_65,
JAVA_OPTS=}
2018-06-08 10:28:15 YarnContainerRunner [INFO] Starting container ID 0 using
package path http://package.tar.gz
2018-06-08 10:28:15 YarnContainerRunner [INFO] starting container
http://package.tar.gz Container: [ContainerId:
container_1525695824467_0091_06_000106, NodeId: x.x.x.x:45400, NodeHttpAddress:
x.x.x.x:8042, Resource: <memory:1024, vCores:1>, Priority: 0, Token: Token {
kind: ContainerToken, service: x.x.x.x:45400 }, ] {SAMZA_CONTAINER_ID=0,
EXECUTION_ENV_CONTAINER_ID=container_1525695824467_0091_06_000106,
SAMZA_COORDINATOR_URL=http://ip-a-a-a-a:33907/, JAVA_HOME=/opt/jvm/jdk1.8.0_65,
JAVA_OPTS=} export SAMZA_LOG_DIR=<LOG_DIR> && ln -sfn <LOG_DIR> logs && exec
./__package//bin/run-container.sh 1>logs/stdout 2>logs/stderr
2018-06-08 10:28:15 YarnContainerRunner [INFO] set package Resource in
YarnContainerRunner for scheme: "http" host: "s3-eu-west-1.amazonaws.com" port:
-1 file: "/package.tar.gz"
2018-06-08 10:28:15 ContainerManagementProtocolProxy [INFO] Opening proxy :
x.x.x.x:45400
2018-06-08 10:28:35 Client [INFO] Retrying connect to server:
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 0 time(s);
maxRetries=45
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Container completed from
RM ContainerStatus: [ContainerId: container_1525695824467_0091_06_000016,
State: COMPLETE, Diagnostics: Container released on a *lost* node, ExitStatus:
-100, ]
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Completed container had
ID: 9
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Removing container ID 9
from completed containers
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Container completed from
RM ContainerStatus: [ContainerId: container_1525695824467_0091_06_000010,
State: COMPLETE, Diagnostics: Container released on a *lost* node, ExitStatus:
-100, ]
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Completed container had
ID: 3
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Removing container ID 3
from completed containers
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Container completed from
RM ContainerStatus: [ContainerId: container_1525695824467_0091_06_000002,
State: COMPLETE, Diagnostics: Container released on a *lost* node, ExitStatus:
-100, ]
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Completed container had
ID: 11
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Removing container ID 11
from completed containers
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Container completed from
RM ContainerStatus: [ContainerId: container_1525695824467_0091_06_000102,
State: COMPLETE, Diagnostics: Container released on a *lost* node, ExitStatus:
-100, ]
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Completed container had
ID: 8
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Removing container ID 8
from completed containers
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Container completed from
RM ContainerStatus: [ContainerId: container_1525695824467_0091_06_000005,
State: COMPLETE, Diagnostics: Container released on a *lost* node, ExitStatus:
-100, ]
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Completed container had
ID: 14
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Removing container ID 14
from completed containers
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Container completed from
RM ContainerStatus: [ContainerId: container_1525695824467_0091_06_000011,
State: COMPLETE, Diagnostics: Container released on a *lost* node, ExitStatus:
-100, ]
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Completed container had
ID: 4
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Removing container ID 4
from completed containers
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Container completed from
RM ContainerStatus: [ContainerId: container_1525695824467_0091_06_000008,
State: COMPLETE, Diagnostics: Container released on a *lost* node, ExitStatus:
-100, ]
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Completed container had
ID: 1
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Removing container ID 1
from completed containers
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Container completed from
RM ContainerStatus: [ContainerId: container_1525695824467_0091_06_000013,
State: COMPLETE, Diagnostics: Container released on a *lost* node, ExitStatus:
-100, ]
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Completed container had
ID: 6
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Removing container ID 6
from completed containers
2018-06-08 10:28:51 ContainerProcessManager [INFO] Matching container ID found
9 org.apache.samza.clustermanager.SamzaResource@3c12182f
2018-06-08 10:28:51 ContainerProcessManager [INFO] Got an exit code of -100.
This means that container container_1525695824467_0091_06_000016 was killed by
YARN, either due to being released by the application master or being 'lost'
due to node failures etc. or due to preemption by the RM
2018-06-08 10:28:51 ContainerProcessManager [INFO] Released container
container_1525695824467_0091_06_000016 was assigned task group ID 9. Requesting
a new container for the task group.
2018-06-08 10:28:51 SamzaResourceRequest [INFO] Resource Request created for 9
on ANY_HOST at 1528453731124
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Requesting resources on
ANY_HOST for container 9
2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Making a request for
ANY_HOST ANY_HOST
2018-06-08 10:28:55 Client [INFO] Retrying connect to server:
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 1 time(s);
maxRetries=45
2018-06-08 10:29:06 Client [INFO] Retrying connect to server:
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 0 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-08 10:29:10 Client [INFO] Retrying connect to server:
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 1 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-08 10:29:14 Client [INFO] Retrying connect to server:
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 2 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-08 10:29:17 Client [INFO] Retrying connect to server:
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 3 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-08 10:29:20 Client [INFO] Retrying connect to server:
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 4 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-08 10:29:23 Client [INFO] Retrying connect to server:
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 5 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-08 10:29:26 Client [INFO] Retrying connect to server:
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 6 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-08 10:29:29 Client [INFO] Retrying connect to server:
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 7 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-08 10:29:32 Client [INFO] Retrying connect to server:
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 8 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-08 10:29:35 Client [INFO] Retrying connect to server:
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 9 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-08 10:29:51 Client [INFO] Retrying connect to server:
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 0 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-08 10:29:54 Client [INFO] Retrying connect to server:
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 1 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-08 10:29:57 Client [INFO] Retrying connect to server:
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 2 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-08 10:30:00 Client [INFO] Retrying connect to server:
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 3 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-08 10:30:03 Client [INFO] Retrying connect to server:
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 4 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-08 10:30:06 Client [INFO] Retrying connect to server:
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 5 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-08 10:30:09 Client [INFO] Retrying connect to server:
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 6 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-08 10:30:12 Client [INFO] Retrying connect to server:
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 7 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-08 10:30:15 Client [INFO] Retrying connect to server:
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 8 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-08 10:30:18 Client [INFO] Retrying connect to server:
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 9 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-08 10:30:34 Client [INFO] Retrying connect to server:
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 0 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-08 10:30:38 Client [INFO] Retrying connect to server:
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 1 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-08 10:30:41 Client [INFO] Retrying connect to server:
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 2 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-08 10:30:44 Client [INFO] Retrying connect to server:
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 3 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-08 10:30:47 Client [INFO] Retrying connect to server:
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 4 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-08 10:30:50 Client [INFO] Retrying connect to server:
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 5 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-08 10:30:53 Client [INFO] Retrying connect to server:
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 6 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-08 10:30:56 Client [INFO] Retrying connect to server:
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 7 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-08 10:30:59 Client [INFO] Retrying connect to server:
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 8 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-08 10:31:02 Client [INFO] Retrying connect to server:
ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 9 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS){noformat}
So the container gets allocated to x.x.x.x but then when coordinator tries to
connect to the NM it is no longer reachable, and it gets stuck in retry loop
indefinitely. The result of which new containers don't get scheduled for that
job.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)