[ 
https://issues.apache.org/jira/browse/TEZ-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated TEZ-3368:
----------------------------
    Attachment: TEZ-3368.001.patch

Although I don't have the full root cause of the NPE, I think we can make this 
more robust to avoid the NPE and hanging of the app.  Attaching a patch that 
avoids extra lookups and object creation for getting the top priority, and it 
also wraps some logic around the DelayedContainerManager so if it crashes we 
will tear down the AM rather than let it hang indefinitely.


> NPE in DelayedContainerManager
> ------------------------------
>
>                 Key: TEZ-3368
>                 URL: https://issues.apache.org/jira/browse/TEZ-3368
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.1
>            Reporter: Jason Lowe
>         Attachments: TEZ-3368.001.patch
>
>
> Saw a Tez AM hang due to an NPE in the DelayedContainerManager:
> {noformat}
> 2016-07-17 01:53:23,157 [ERROR] [DelayedContainerManager] 
> |yarn.YarnUncaughtExceptionHandler|: Thread 
> Thread[DelayedContainerManager,5,main] threw an Exception.
> java.lang.NullPointerException
>         at 
> org.apache.tez.dag.app.rm.TezAMRMClientAsync.getMatchingRequestsForTopPriority(TezAMRMClientAsync.java:142)
>         at 
> org.apache.tez.dag.app.rm.YarnTaskSchedulerService.getMatchingRequestWithoutPriority(YarnTaskSchedulerService.java:1474)
>         at 
> org.apache.tez.dag.app.rm.YarnTaskSchedulerService.access$500(YarnTaskSchedulerService.java:84)
>         at 
> org.apache.tez.dag.app.rm.YarnTaskSchedulerService$NodeLocalContainerAssigner.assignReUsedContainer(YarnTaskSchedulerService.java:1869)
>         at 
> org.apache.tez.dag.app.rm.YarnTaskSchedulerService.assignReUsedContainerWithLocation(YarnTaskSchedulerService.java:1753)
>         at 
> org.apache.tez.dag.app.rm.YarnTaskSchedulerService.assignDelayedContainer(YarnTaskSchedulerService.java:733)
>         at 
> org.apache.tez.dag.app.rm.YarnTaskSchedulerService.access$600(YarnTaskSchedulerService.java:84)
>         at 
> org.apache.tez.dag.app.rm.YarnTaskSchedulerService$DelayedContainerManager.run(YarnTaskSchedulerService.java:2030)
> {noformat}
> After the DelayedContainerManager thread exited the AM proceeded to receive 
> requested containers that would go unused until the container allocations 
> expired.  Then they would be re-requested, and the cycle repeated 
> indefinitely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to