Hi Darin,

I have multiple important fixes that I have made as part of the HA work.
I feel it will be a lot of work and very time consuming to test each fix
independently,
get it reviewed, rebase a big change like HA on top.

I am working on rebasing the HA change (I have to anyways). Should be able
to update the PR in a an hour or two.
Most of the work has already been reviewed by multiple people. Only work
that needs review are the fixes.
I would say lets see if we can try to get it all in. If it really becomes a
problem I'll have to send out a separate pull request
for each fix.

Regards
Swapnil


On Sat, Aug 29, 2015 at 7:34 AM, Darin Johnson <dbjohnson1...@gmail.com>
wrote:

> If you have a fix.  Let's do a separate PR for it and then rebase the ha PR
> based on it.  This will make it easier to reason about the code in each PR
> and get the bug fix in quicker.
> On Aug 26, 2015 12:02 PM, "Swapnil Daingade" <swapnil.daing...@gmail.com>
> wrote:
>
> > Hi Jim,
> >
> > I was also working on this as part of the review comments that I received
> > for the Myriad HA changes.
> > Are you too far along in fixing this? If not, I can send out an updated
> > pull request including this by eod today.
> >
> > Regards
> > Swapnil
> >
> >
> > On Wed, Aug 26, 2015 at 7:35 AM, Jim Klucar <klu...@gmail.com> wrote:
> >
> > > I took a brief look at this and have an idea about what could be going
> > on.
> > > Basically the SchedulerState class isn't thread-safe. There is a lot of
> > > adding and removing tasks from the various sets (pending, staging, etc)
> > > that aren't thread-safe. Short of synchronization and locks, perhaps we
> > can
> > > do a concurrent hash map with taskIds and a new enum representing the
> > > state.
> > >
> > > On Mon, Aug 24, 2015 at 8:58 PM, Sarjeet Singh (JIRA) <j...@apache.org
> >
> > > wrote:
> > >
> > > >
> > > >      [
> > > >
> > >
> >
> https://issues.apache.org/jira/browse/MYRIAD-128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> > > > ]
> > > >
> > > > Sarjeet Singh updated MYRIAD-128:
> > > > ---------------------------------
> > > >     Attachment: Screen Shot 2015-08-24 at 5.51.38 PM.png
> > > >
> > > > Myriad UI screenshot
> > > >
> > > > > Issue with Flex down, Pending NMs stuck in staging and don't get to
> > > > active task.
> > > > >
> > > >
> > >
> >
> --------------------------------------------------------------------------------
> > > > >
> > > > >                 Key: MYRIAD-128
> > > > >                 URL:
> > https://issues.apache.org/jira/browse/MYRIAD-128
> > > > >             Project: Myriad
> > > > >          Issue Type: Bug
> > > > >          Components: Scheduler
> > > > >    Affects Versions: Myriad 0.1.0
> > > > >            Reporter: Sarjeet Singh
> > > > >         Attachments: Screen Shot 2015-08-24 at 5.51.38 PM.png
> > > > >
> > > > >
> > > > > Seeing some issue when I tried flexing NMs from Myriad UI. On
> flexing
> > > > down active NM,  pending NMs doesn't go to active state (not sowing
> in
> > > > 'Active Tasks') and there is no active NM showing on Myriad UI.
> > Although,
> > > > there is a NM running on the node (verified from jps).
> > > > > mapr     20528 20526  1 17:23 ?        00:00:26
> > > > /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.85.x86_64/bin/java
> > > -Dproc_nodemanager
> > > > -Xmx1000m -Dhadoop.log.dir=/opt/mapr/hadoop/hadoop-2.7.0/logs
> > > > -Dyarn.log.dir=/opt/mapr/hadoop/hadoop-2.7.0/logs
> > > > -Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log -Dyarn.home.dir=
> > > > -Dyarn.id.str= -Dhadoop.root.logger=INFO,console
> > > > -Dyarn.root.logger=INFO,console
> > > > -Djava.library.path=/opt/mapr/hadoop/hadoop-2.7.0/lib/native
> > > > -Dyarn.policy.file=hadoop-policy.xml -server
> > > > -Dnodemanager.resource.io-spindles=4.0
> > > > -Dyarn.resourcemanager.hostname=testrm.marathon.mesos
> > > >
> > >
> >
> -Dyarn.nodemanager.container-executor.class=org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor
> > > > -Dnodemanager.resource.cpu-vcores=0
> -Dnodemanager.resource.memory-mb=0
> > > > -Dmyriad.yarn.nodemanager.address=0.0.0.0:31000
> > > > -Dmyriad.yarn.nodemanager.localizer.address=0.0.0.0:31001
> > > > -Dmyriad.yarn.nodemanager.webapp.address=0.0.0.0:31002
> > > > -Dmyriad.mapreduce.shuffle.port=0.0.0.0:31003
> -Dhadoop.login=maprsasl
> > > > -Dhttps.protocols=TLSv1.2
> > > > -Djava.security.auth.login.config=/opt/mapr/conf/mapr.login.conf
> > > > -Dzookeeper.sasl.clientconfig=Client_simple
> > > >
> > -Dzookeeper.saslprovider=com.mapr.security.simplesasl.SimpleSaslProvider
> > > > -Dhadoop.log.dir=/opt/mapr/hadoop/hadoop-2.7.0/logs
> > > > -Dyarn.log.dir=/opt/mapr/hadoop/hadoop-2.7.0/logs
> > > > -Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log
> > > > -Dyarn.home.dir=/opt/mapr/hadoop/hadoop-2.7.0
> > > > -Dhadoop.home.dir=/opt/mapr/hadoop/hadoop-2.7.0
> > > > -Dhadoop.root.logger=INFO,console -Dyarn.root.logger=INFO,console
> > > > -Djava.library.path=/opt/mapr/hadoop/hadoop-2.7.0/lib/native
> -classpath
> > > >
> > >
> >
> /opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop:/opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop:/opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/hdfs:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/hdfs/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/hdfs/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/mapreduce/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop/nm-config/log4j.properties:/opt/mapr/lib/JPam-1.1.jar
> > > > org.apache.hadoop.yarn.server.nodemanager.NodeManager
> > > > > From myriad UI:
> > > > > Active Tasks
> > > > > Killable Tasks
> > > > > Pending Tasks
> > > > > Staging Tasks
> > > > > nm.large.123badb1-57d8-4bd2-aa2e-de9fc1898c7f
> > > > > nm.medium.f2c4126c-4cb2-46af-a1e0-690034b914b8
> > > > > nm.medium.a9e9fd84-350a-48bc-bcd2-8712ecdc8c66
> > > > > nm.medium.663f9c6e-f28e-4395-8540-70c306eb04c5
> > > > > nm.medium.93f7cc91-9263-48a7-821e-3b0ffbe70e66
> > > > > This is the state even after waited for about 30 min or so after
> > > flexing
> > > > down the NM.
> > > > > I tried this on a single node cluster though, but looks like the
> > > problem
> > > > can happen in any case.
> > > > > I started RM from marathon and was able to get RM & Myriad up &
> > > running.
> > > > With RM launched, there is a CGS (medium profile) NM is launched
> along
> > > with
> > > > it as well which is shown as 'Active Task' on Myriad UI. Then, I
> > launched
> > > > some large profile & zero profile NM which are shown now in 'Pending
> > > tasks'
> > > > since there is a (CGS default) NM already running on a single node
> > > cluster.
> > > > > Then, I tried flexing down NM from myriad UI, which flexed up the
> > > active
> > > > NM and all pending NMs start moving to staging tasks, and then they
> > stuck
> > > > in staging task for longer time. On waited for about > 30min, I dont
> > see
> > > > any active task for NM and all of the pending NM tasks are shown in
> > > > 'Staging task' only. (See the screenshot)
> > > >
> > > >
> > > >
> > > > --
> > > > This message was sent by Atlassian JIRA
> > > > (v6.3.4#6332)
> > > >
> > >
> >
>

Reply via email to