Re: NPE in removing container

Stephen Gran Thu, 30 Jun 2016 14:11:26 -0700

Hi,

Yes - the imaginatively named slave2 was a zero-sized nm at that point - 
I am looking at how small a pool of reserved resource I can get away 
with, and use FGS for burst activity.

Here are all the logs related to that host:port combination around that 
time:

2016-06-30 19:47:43,756 INFO 
org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: 
Expired:slave2:24679 Timed out after 2 secs
2016-06-30 19:47:43,771 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 
Deactivating Node slave2:24679 as it is now LOST
2016-06-30 19:47:43,771 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 
slave2:24679 Node Transitioned from RUNNING to LOST
2016-06-30 19:47:43,909 INFO 
org.apache.myriad.scheduler.fgs.YarnNodeCapacityManager: Removed task 
yarn_Container: [ContainerId: container_1467314892573_0009_01_000005, 
NodeId: slave2:24679, NodeHttpAddress: slave2:23177, Resource: 
<memory:2048, vCores:1>, Priority: 20, Token: Token { kind: 
ContainerToken, service: 10.0.5.5:24679 }, ] with exit status freeing 0 
cpu and 1 mem.
2016-06-30 19:47:43,909 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: 
Released container container_1467314892573_0009_01_000005 of capacity 
<memory:2048, vCores:1> on host slave2:24679, which currently has 1 
containers, <memory:2048, vCores:1> used and <memory:2048, vCores:1> 
available, release resources=true
2016-06-30 19:47:43,909 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Application attempt appattempt_1467314892573_0009_000001 released 
container container_1467314892573_0009_01_000005 on node: host: 
slave2:24679 #containers=1 available=<memory:2048, vCores:1> 
used=<memory:2048, vCores:1> with event: KILL
2016-06-30 19:47:43,909 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: 
Node not found resyncing slave2:24679
2016-06-30 19:47:43,952 INFO 
org.apache.myriad.scheduler.fgs.YarnNodeCapacityManager: Removed task 
yarn_Container: [ContainerId: container_1467314892573_0009_01_000006, 
NodeId: slave2:24679, NodeHttpAddress: slave2:23177, Resource: 
<memory:2048, vCores:1>, Priority: 20, Token: Token { kind: 
ContainerToken, service: 10.0.5.5:24679 }, ] with exit status freeing 0 
cpu and 1 mem.
2016-06-30 19:47:43,952 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: 
Released container container_1467314892573_0009_01_000006 of capacity 
<memory:2048, vCores:1> on host slave2:24679, which currently has 0 
containers, <memory:0, vCores:0> used and <memory:4096, vCores:2> 
available, release resources=true
2016-06-30 19:47:43,952 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Application attempt appattempt_1467314892573_0009_000001 released 
container container_1467314892573_0009_01_000006 on node: host: 
slave2:24679 #containers=0 available=<memory:4096, vCores:2> 
used=<memory:0, vCores:0> with event: KILL
2016-06-30 19:47:43,952 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Removed node slave2:24679 cluster capacity: <memory:4096, vCores:4>
2016-06-30 19:47:47,573 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 
slave2:24679 Node Transitioned from NEW to RUNNING
2016-06-30 19:47:47,936 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: 
NodeManager from node slave2(cmPort: 24679 httpPort: 23177) registered 
with capability: <memory:0, vCores:0>, assigned nodeId slave2:24679

Looks like it did go into LOST for a bit.

Cheers,

On 30/06/16 21:36, Darin Johnson wrote:
> Steven, thanks.  I thought I had fixed that but perhaps a regression was
> made in another merge.  I'll look into it, can you answer a few questions?
> Was the node (slave2) a zero sided nodemanager (for fgs)?  In the node
> manager logs had it recently become unhealthy?  I'm pretty concerned about
> this and will try to get a patch soon.
>
> Thanks,
>
> Darin
> On Jun 30, 2016 3:53 PM, "Stephen Gran" <stephen.g...@piksel.com> wrote:
>
>> Hi,
>>
>> Just playing with the 0.2.0 release (congratulations, by the way!)
>>
>> I have seen this twice now, although it is by no means consistent - I
>> will have a dozen successful runs, and then one of these.  This exits
>> the RM, which makes it rather noticable.
>>
>> 2016-06-30 19:47:43,952 INFO
>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
>> Removed node slave2:24679 cluster capacity: <memory:4096, vCore
>> s:4>
>> 2016-06-30 19:47:43,953 FATAL
>> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in
>> handling event type NODE_RESOURCE_UPDATE to the scheduler
>> java.lang.NullPointerException
>>           at
>>
>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.updateNodeResource(AbstractYarnScheduler.java:563)
>>           at
>>
>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.updateNodeResource(FairScheduler.java:1652)
>>           at
>>
>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1222)
>>           at
>>
>> org.apache.myriad.scheduler.yarn.MyriadFairScheduler.handle(MyriadFairScheduler.java:102)
>>           at
>>
>> org.apache.myriad.scheduler.yarn.MyriadFairScheduler.handle(MyriadFairScheduler.java:42)
>>           at
>>
>> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:671)
>>           at java.lang.Thread.run(Thread.java:745)
>> 2016-06-30 19:47:43,972 INFO
>> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting,
>> bbye..
>>
>> --
>> Stephen Gran
>> Senior Technical Architect
>>
>> picture the possibilities | piksel.com
>> This message is private and confidential. If you have received this
>> message in error, please notify the sender or serviced...@piksel.com and
>> remove it from your system.
>>
>> Piksel Inc is a company registered in the United States New York City,
>> 1250 Broadway, Suite 1902, New York, NY 10001. F No. = 2931986
>>
>

-- 
Stephen Gran
Senior Technical Architect

picture the possibilities | piksel.com

Re: NPE in removing container

Reply via email to