Hi Jim, I was also working on this as part of the review comments that I received for the Myriad HA changes. Are you too far along in fixing this? If not, I can send out an updated pull request including this by eod today.
Regards Swapnil On Wed, Aug 26, 2015 at 7:35 AM, Jim Klucar <[email protected]> wrote: > I took a brief look at this and have an idea about what could be going on. > Basically the SchedulerState class isn't thread-safe. There is a lot of > adding and removing tasks from the various sets (pending, staging, etc) > that aren't thread-safe. Short of synchronization and locks, perhaps we can > do a concurrent hash map with taskIds and a new enum representing the > state. > > On Mon, Aug 24, 2015 at 8:58 PM, Sarjeet Singh (JIRA) <[email protected]> > wrote: > > > > > [ > > > https://issues.apache.org/jira/browse/MYRIAD-128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > > ] > > > > Sarjeet Singh updated MYRIAD-128: > > --------------------------------- > > Attachment: Screen Shot 2015-08-24 at 5.51.38 PM.png > > > > Myriad UI screenshot > > > > > Issue with Flex down, Pending NMs stuck in staging and don't get to > > active task. > > > > > > -------------------------------------------------------------------------------- > > > > > > Key: MYRIAD-128 > > > URL: https://issues.apache.org/jira/browse/MYRIAD-128 > > > Project: Myriad > > > Issue Type: Bug > > > Components: Scheduler > > > Affects Versions: Myriad 0.1.0 > > > Reporter: Sarjeet Singh > > > Attachments: Screen Shot 2015-08-24 at 5.51.38 PM.png > > > > > > > > > Seeing some issue when I tried flexing NMs from Myriad UI. On flexing > > down active NM, pending NMs doesn't go to active state (not sowing in > > 'Active Tasks') and there is no active NM showing on Myriad UI. Although, > > there is a NM running on the node (verified from jps). > > > mapr 20528 20526 1 17:23 ? 00:00:26 > > /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.85.x86_64/bin/java > -Dproc_nodemanager > > -Xmx1000m -Dhadoop.log.dir=/opt/mapr/hadoop/hadoop-2.7.0/logs > > -Dyarn.log.dir=/opt/mapr/hadoop/hadoop-2.7.0/logs > > -Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log -Dyarn.home.dir= > > -Dyarn.id.str= -Dhadoop.root.logger=INFO,console > > -Dyarn.root.logger=INFO,console > > -Djava.library.path=/opt/mapr/hadoop/hadoop-2.7.0/lib/native > > -Dyarn.policy.file=hadoop-policy.xml -server > > -Dnodemanager.resource.io-spindles=4.0 > > -Dyarn.resourcemanager.hostname=testrm.marathon.mesos > > > -Dyarn.nodemanager.container-executor.class=org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor > > -Dnodemanager.resource.cpu-vcores=0 -Dnodemanager.resource.memory-mb=0 > > -Dmyriad.yarn.nodemanager.address=0.0.0.0:31000 > > -Dmyriad.yarn.nodemanager.localizer.address=0.0.0.0:31001 > > -Dmyriad.yarn.nodemanager.webapp.address=0.0.0.0:31002 > > -Dmyriad.mapreduce.shuffle.port=0.0.0.0:31003 -Dhadoop.login=maprsasl > > -Dhttps.protocols=TLSv1.2 > > -Djava.security.auth.login.config=/opt/mapr/conf/mapr.login.conf > > -Dzookeeper.sasl.clientconfig=Client_simple > > -Dzookeeper.saslprovider=com.mapr.security.simplesasl.SimpleSaslProvider > > -Dhadoop.log.dir=/opt/mapr/hadoop/hadoop-2.7.0/logs > > -Dyarn.log.dir=/opt/mapr/hadoop/hadoop-2.7.0/logs > > -Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log > > -Dyarn.home.dir=/opt/mapr/hadoop/hadoop-2.7.0 > > -Dhadoop.home.dir=/opt/mapr/hadoop/hadoop-2.7.0 > > -Dhadoop.root.logger=INFO,console -Dyarn.root.logger=INFO,console > > -Djava.library.path=/opt/mapr/hadoop/hadoop-2.7.0/lib/native -classpath > > > /opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop:/opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop:/opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/hdfs:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/hdfs/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/hdfs/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/mapreduce/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop/nm-config/log4j.properties:/opt/mapr/lib/JPam-1.1.jar > > org.apache.hadoop.yarn.server.nodemanager.NodeManager > > > From myriad UI: > > > Active Tasks > > > Killable Tasks > > > Pending Tasks > > > Staging Tasks > > > nm.large.123badb1-57d8-4bd2-aa2e-de9fc1898c7f > > > nm.medium.f2c4126c-4cb2-46af-a1e0-690034b914b8 > > > nm.medium.a9e9fd84-350a-48bc-bcd2-8712ecdc8c66 > > > nm.medium.663f9c6e-f28e-4395-8540-70c306eb04c5 > > > nm.medium.93f7cc91-9263-48a7-821e-3b0ffbe70e66 > > > This is the state even after waited for about 30 min or so after > flexing > > down the NM. > > > I tried this on a single node cluster though, but looks like the > problem > > can happen in any case. > > > I started RM from marathon and was able to get RM & Myriad up & > running. > > With RM launched, there is a CGS (medium profile) NM is launched along > with > > it as well which is shown as 'Active Task' on Myriad UI. Then, I launched > > some large profile & zero profile NM which are shown now in 'Pending > tasks' > > since there is a (CGS default) NM already running on a single node > cluster. > > > Then, I tried flexing down NM from myriad UI, which flexed up the > active > > NM and all pending NMs start moving to staging tasks, and then they stuck > > in staging task for longer time. On waited for about > 30min, I dont see > > any active task for NM and all of the pending NM tasks are shown in > > 'Staging task' only. (See the screenshot) > > > > > > > > -- > > This message was sent by Atlassian JIRA > > (v6.3.4#6332) > > >
