Hey Steven, Looks like this might be slightly different than what I was originally expecting. Sorry to keep asking for more info but it will help me recreate the issue. Could you possibly get me more of the ResourceManager logs? In particular, I'm trying to figure out where upgradeNodeCapacity is getting called from and any transitions of slave2. Also, what version of hadoop are you running, I think I recall it being 2.72 but should verify.
Thanks for taken the time to work with me on this. Darin On Thu, Jun 30, 2016 at 5:10 PM, Stephen Gran <stephen.g...@piksel.com> wrote: > Hi, > > Yes - the imaginatively named slave2 was a zero-sized nm at that point - > I am looking at how small a pool of reserved resource I can get away > with, and use FGS for burst activity. > > > Here are all the logs related to that host:port combination around that > time: > > 2016-06-30 19:47:43,756 INFO > org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: > Expired:slave2:24679 Timed out after 2 secs > 2016-06-30 19:47:43,771 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: > Deactivating Node slave2:24679 as it is now LOST > 2016-06-30 19:47:43,771 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: > slave2:24679 Node Transitioned from RUNNING to LOST > 2016-06-30 19:47:43,909 INFO > org.apache.myriad.scheduler.fgs.YarnNodeCapacityManager: Removed task > yarn_Container: [ContainerId: container_1467314892573_0009_01_000005, > NodeId: slave2:24679, NodeHttpAddress: slave2:23177, Resource: > <memory:2048, vCores:1>, Priority: 20, Token: Token { kind: > ContainerToken, service: 10.0.5.5:24679 }, ] with exit status freeing 0 > cpu and 1 mem. > 2016-06-30 19:47:43,909 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: > Released container container_1467314892573_0009_01_000005 of capacity > <memory:2048, vCores:1> on host slave2:24679, which currently has 1 > containers, <memory:2048, vCores:1> used and <memory:2048, vCores:1> > available, release resources=true > 2016-06-30 19:47:43,909 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Application attempt appattempt_1467314892573_0009_000001 released > container container_1467314892573_0009_01_000005 on node: host: > slave2:24679 #containers=1 available=<memory:2048, vCores:1> > used=<memory:2048, vCores:1> with event: KILL > 2016-06-30 19:47:43,909 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > Node not found resyncing slave2:24679 > 2016-06-30 19:47:43,952 INFO > org.apache.myriad.scheduler.fgs.YarnNodeCapacityManager: Removed task > yarn_Container: [ContainerId: container_1467314892573_0009_01_000006, > NodeId: slave2:24679, NodeHttpAddress: slave2:23177, Resource: > <memory:2048, vCores:1>, Priority: 20, Token: Token { kind: > ContainerToken, service: 10.0.5.5:24679 }, ] with exit status freeing 0 > cpu and 1 mem. > 2016-06-30 19:47:43,952 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: > Released container container_1467314892573_0009_01_000006 of capacity > <memory:2048, vCores:1> on host slave2:24679, which currently has 0 > containers, <memory:0, vCores:0> used and <memory:4096, vCores:2> > available, release resources=true > 2016-06-30 19:47:43,952 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Application attempt appattempt_1467314892573_0009_000001 released > container container_1467314892573_0009_01_000006 on node: host: > slave2:24679 #containers=0 available=<memory:4096, vCores:2> > used=<memory:0, vCores:0> with event: KILL > 2016-06-30 19:47:43,952 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Removed node slave2:24679 cluster capacity: <memory:4096, vCores:4> > 2016-06-30 19:47:47,573 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: > slave2:24679 Node Transitioned from NEW to RUNNING > 2016-06-30 19:47:47,936 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > NodeManager from node slave2(cmPort: 24679 httpPort: 23177) registered > with capability: <memory:0, vCores:0>, assigned nodeId slave2:24679 > > > Looks like it did go into LOST for a bit. > > Cheers, > > On 30/06/16 21:36, Darin Johnson wrote: > > Steven, thanks. I thought I had fixed that but perhaps a regression was > > made in another merge. I'll look into it, can you answer a few > questions? > > Was the node (slave2) a zero sided nodemanager (for fgs)? In the node > > manager logs had it recently become unhealthy? I'm pretty concerned > about > > this and will try to get a patch soon. > > > > Thanks, > > > > Darin > > On Jun 30, 2016 3:53 PM, "Stephen Gran" <stephen.g...@piksel.com> wrote: > > > >> Hi, > >> > >> Just playing with the 0.2.0 release (congratulations, by the way!) > >> > >> I have seen this twice now, although it is by no means consistent - I > >> will have a dozen successful runs, and then one of these. This exits > >> the RM, which makes it rather noticable. > >> > >> 2016-06-30 19:47:43,952 INFO > >> > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > >> Removed node slave2:24679 cluster capacity: <memory:4096, vCore > >> s:4> > >> 2016-06-30 19:47:43,953 FATAL > >> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > >> handling event type NODE_RESOURCE_UPDATE to the scheduler > >> java.lang.NullPointerException > >> at > >> > >> > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.updateNodeResource(AbstractYarnScheduler.java:563) > >> at > >> > >> > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.updateNodeResource(FairScheduler.java:1652) > >> at > >> > >> > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1222) > >> at > >> > >> > org.apache.myriad.scheduler.yarn.MyriadFairScheduler.handle(MyriadFairScheduler.java:102) > >> at > >> > >> > org.apache.myriad.scheduler.yarn.MyriadFairScheduler.handle(MyriadFairScheduler.java:42) > >> at > >> > >> > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:671) > >> at java.lang.Thread.run(Thread.java:745) > >> 2016-06-30 19:47:43,972 INFO > >> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, > >> bbye.. > >> > >> -- > >> Stephen Gran > >> Senior Technical Architect > >> > >> picture the possibilities | piksel.com > >> This message is private and confidential. If you have received this > >> message in error, please notify the sender or serviced...@piksel.com > and > >> remove it from your system. > >> > >> Piksel Inc is a company registered in the United States New York City, > >> 1250 Broadway, Suite 1902, New York, NY 10001. F No. = 2931986 > >> > > > > -- > Stephen Gran > Senior Technical Architect > > picture the possibilities | piksel.com >