Sarjeet Singh created MYRIAD-128:
------------------------------------
Summary: Issue with Flex down, Pending NMs stuck in staging and
don't get to active task.
Key: MYRIAD-128
URL: https://issues.apache.org/jira/browse/MYRIAD-128
Project: Myriad
Issue Type: Bug
Components: Scheduler
Affects Versions: Myriad 0.1.0
Reporter: Sarjeet Singh
Seeing some issue when I tried flexing NMs from Myriad UI. On flexing down
active NM, pending NMs doesn't go to active state (not sowing in 'Active
Tasks') and there is no active NM showing on Myriad UI. Although, there is a NM
running on the node (verified from jps).
mapr 20528 20526 1 17:23 ? 00:00:26
/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.85.x86_64/bin/java -Dproc_nodemanager
-Xmx1000m -Dhadoop.log.dir=/opt/mapr/hadoop/hadoop-2.7.0/logs
-Dyarn.log.dir=/opt/mapr/hadoop/hadoop-2.7.0/logs -Dhadoop.log.file=yarn.log
-Dyarn.log.file=yarn.log -Dyarn.home.dir= -Dyarn.id.str=
-Dhadoop.root.logger=INFO,console -Dyarn.root.logger=INFO,console
-Djava.library.path=/opt/mapr/hadoop/hadoop-2.7.0/lib/native
-Dyarn.policy.file=hadoop-policy.xml -server
-Dnodemanager.resource.io-spindles=4.0
-Dyarn.resourcemanager.hostname=testrm.marathon.mesos
-Dyarn.nodemanager.container-executor.class=org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor
-Dnodemanager.resource.cpu-vcores=0 -Dnodemanager.resource.memory-mb=0
-Dmyriad.yarn.nodemanager.address=0.0.0.0:31000
-Dmyriad.yarn.nodemanager.localizer.address=0.0.0.0:31001
-Dmyriad.yarn.nodemanager.webapp.address=0.0.0.0:31002
-Dmyriad.mapreduce.shuffle.port=0.0.0.0:31003 -Dhadoop.login=maprsasl
-Dhttps.protocols=TLSv1.2
-Djava.security.auth.login.config=/opt/mapr/conf/mapr.login.conf
-Dzookeeper.sasl.clientconfig=Client_simple
-Dzookeeper.saslprovider=com.mapr.security.simplesasl.SimpleSaslProvider
-Dhadoop.log.dir=/opt/mapr/hadoop/hadoop-2.7.0/logs
-Dyarn.log.dir=/opt/mapr/hadoop/hadoop-2.7.0/logs -Dhadoop.log.file=yarn.log
-Dyarn.log.file=yarn.log -Dyarn.home.dir=/opt/mapr/hadoop/hadoop-2.7.0
-Dhadoop.home.dir=/opt/mapr/hadoop/hadoop-2.7.0
-Dhadoop.root.logger=INFO,console -Dyarn.root.logger=INFO,console
-Djava.library.path=/opt/mapr/hadoop/hadoop-2.7.0/lib/native -classpath
/opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop:/opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop:/opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/hdfs:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/hdfs/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/hdfs/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/mapreduce/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop/nm-config/log4j.properties:/opt/mapr/lib/JPam-1.1.jar
org.apache.hadoop.yarn.server.nodemanager.NodeManager
>From myriad UI:
Active Tasks
Killable Tasks
Pending Tasks
Staging Tasks
nm.large.123badb1-57d8-4bd2-aa2e-de9fc1898c7f
nm.medium.f2c4126c-4cb2-46af-a1e0-690034b914b8
nm.medium.a9e9fd84-350a-48bc-bcd2-8712ecdc8c66
nm.medium.663f9c6e-f28e-4395-8540-70c306eb04c5
nm.medium.93f7cc91-9263-48a7-821e-3b0ffbe70e66
This is the state even after waited for about 30 min or so after flexing down
the NM.
I tried this on a single node cluster though, but looks like the problem can
happen in any case.
I started RM from marathon and was able to get RM & Myriad up & running. With
RM launched, there is a CGS (medium profile) NM is launched along with it as
well which is shown as 'Active Task' on Myriad UI. Then, I launched some large
profile & zero profile NM which are shown now in 'Pending tasks' since there is
a (CGS default) NM already running on a single node cluster.
Then, I tried flexing down NM from myriad UI, which flexed up the active NM and
all pending NMs start moving to staging tasks, and then they stuck in staging
task for longer time. On waited for about > 30min, I dont see any active task
for NM and all of the pending NM tasks are shown in 'Staging task' only. (See
the screenshot)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)