Hi, Yes - the imaginatively named slave2 was a zero-sized nm at that point - I am looking at how small a pool of reserved resource I can get away with, and use FGS for burst activity.
Here are all the logs related to that host:port combination around that time: 2016-06-30 19:47:43,756 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:slave2:24679 Timed out after 2 secs 2016-06-30 19:47:43,771 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating Node slave2:24679 as it is now LOST 2016-06-30 19:47:43,771 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: slave2:24679 Node Transitioned from RUNNING to LOST 2016-06-30 19:47:43,909 INFO org.apache.myriad.scheduler.fgs.YarnNodeCapacityManager: Removed task yarn_Container: [ContainerId: container_1467314892573_0009_01_000005, NodeId: slave2:24679, NodeHttpAddress: slave2:23177, Resource: <memory:2048, vCores:1>, Priority: 20, Token: Token { kind: ContainerToken, service: 10.0.5.5:24679 }, ] with exit status freeing 0 cpu and 1 mem. 2016-06-30 19:47:43,909 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Released container container_1467314892573_0009_01_000005 of capacity <memory:2048, vCores:1> on host slave2:24679, which currently has 1 containers, <memory:2048, vCores:1> used and <memory:2048, vCores:1> available, release resources=true 2016-06-30 19:47:43,909 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application attempt appattempt_1467314892573_0009_000001 released container container_1467314892573_0009_01_000005 on node: host: slave2:24679 #containers=1 available=<memory:2048, vCores:1> used=<memory:2048, vCores:1> with event: KILL 2016-06-30 19:47:43,909 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: Node not found resyncing slave2:24679 2016-06-30 19:47:43,952 INFO org.apache.myriad.scheduler.fgs.YarnNodeCapacityManager: Removed task yarn_Container: [ContainerId: container_1467314892573_0009_01_000006, NodeId: slave2:24679, NodeHttpAddress: slave2:23177, Resource: <memory:2048, vCores:1>, Priority: 20, Token: Token { kind: ContainerToken, service: 10.0.5.5:24679 }, ] with exit status freeing 0 cpu and 1 mem. 2016-06-30 19:47:43,952 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Released container container_1467314892573_0009_01_000006 of capacity <memory:2048, vCores:1> on host slave2:24679, which currently has 0 containers, <memory:0, vCores:0> used and <memory:4096, vCores:2> available, release resources=true 2016-06-30 19:47:43,952 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application attempt appattempt_1467314892573_0009_000001 released container container_1467314892573_0009_01_000006 on node: host: slave2:24679 #containers=0 available=<memory:4096, vCores:2> used=<memory:0, vCores:0> with event: KILL 2016-06-30 19:47:43,952 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Removed node slave2:24679 cluster capacity: <memory:4096, vCores:4> 2016-06-30 19:47:47,573 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: slave2:24679 Node Transitioned from NEW to RUNNING 2016-06-30 19:47:47,936 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: NodeManager from node slave2(cmPort: 24679 httpPort: 23177) registered with capability: <memory:0, vCores:0>, assigned nodeId slave2:24679 Looks like it did go into LOST for a bit. Cheers, On 30/06/16 21:36, Darin Johnson wrote: > Steven, thanks. I thought I had fixed that but perhaps a regression was > made in another merge. I'll look into it, can you answer a few questions? > Was the node (slave2) a zero sided nodemanager (for fgs)? In the node > manager logs had it recently become unhealthy? I'm pretty concerned about > this and will try to get a patch soon. > > Thanks, > > Darin > On Jun 30, 2016 3:53 PM, "Stephen Gran" <stephen.g...@piksel.com> wrote: > >> Hi, >> >> Just playing with the 0.2.0 release (congratulations, by the way!) >> >> I have seen this twice now, although it is by no means consistent - I >> will have a dozen successful runs, and then one of these. This exits >> the RM, which makes it rather noticable. >> >> 2016-06-30 19:47:43,952 INFO >> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: >> Removed node slave2:24679 cluster capacity: <memory:4096, vCore >> s:4> >> 2016-06-30 19:47:43,953 FATAL >> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in >> handling event type NODE_RESOURCE_UPDATE to the scheduler >> java.lang.NullPointerException >> at >> >> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.updateNodeResource(AbstractYarnScheduler.java:563) >> at >> >> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.updateNodeResource(FairScheduler.java:1652) >> at >> >> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1222) >> at >> >> org.apache.myriad.scheduler.yarn.MyriadFairScheduler.handle(MyriadFairScheduler.java:102) >> at >> >> org.apache.myriad.scheduler.yarn.MyriadFairScheduler.handle(MyriadFairScheduler.java:42) >> at >> >> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:671) >> at java.lang.Thread.run(Thread.java:745) >> 2016-06-30 19:47:43,972 INFO >> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, >> bbye.. >> >> -- >> Stephen Gran >> Senior Technical Architect >> >> picture the possibilities | piksel.com >> This message is private and confidential. If you have received this >> message in error, please notify the sender or serviced...@piksel.com and >> remove it from your system. >> >> Piksel Inc is a company registered in the United States New York City, >> 1250 Broadway, Suite 1902, New York, NY 10001. F No. = 2931986 >> > -- Stephen Gran Senior Technical Architect picture the possibilities | piksel.com