[
https://issues.apache.org/jira/browse/TEZ-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jason Lowe updated TEZ-3368:
----------------------------
Attachment: TEZ-3368.001.patch
Although I don't have the full root cause of the NPE, I think we can make this
more robust to avoid the NPE and hanging of the app. Attaching a patch that
avoids extra lookups and object creation for getting the top priority, and it
also wraps some logic around the DelayedContainerManager so if it crashes we
will tear down the AM rather than let it hang indefinitely.
> NPE in DelayedContainerManager
> ------------------------------
>
> Key: TEZ-3368
> URL: https://issues.apache.org/jira/browse/TEZ-3368
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.7.1
> Reporter: Jason Lowe
> Attachments: TEZ-3368.001.patch
>
>
> Saw a Tez AM hang due to an NPE in the DelayedContainerManager:
> {noformat}
> 2016-07-17 01:53:23,157 [ERROR] [DelayedContainerManager]
> |yarn.YarnUncaughtExceptionHandler|: Thread
> Thread[DelayedContainerManager,5,main] threw an Exception.
> java.lang.NullPointerException
> at
> org.apache.tez.dag.app.rm.TezAMRMClientAsync.getMatchingRequestsForTopPriority(TezAMRMClientAsync.java:142)
> at
> org.apache.tez.dag.app.rm.YarnTaskSchedulerService.getMatchingRequestWithoutPriority(YarnTaskSchedulerService.java:1474)
> at
> org.apache.tez.dag.app.rm.YarnTaskSchedulerService.access$500(YarnTaskSchedulerService.java:84)
> at
> org.apache.tez.dag.app.rm.YarnTaskSchedulerService$NodeLocalContainerAssigner.assignReUsedContainer(YarnTaskSchedulerService.java:1869)
> at
> org.apache.tez.dag.app.rm.YarnTaskSchedulerService.assignReUsedContainerWithLocation(YarnTaskSchedulerService.java:1753)
> at
> org.apache.tez.dag.app.rm.YarnTaskSchedulerService.assignDelayedContainer(YarnTaskSchedulerService.java:733)
> at
> org.apache.tez.dag.app.rm.YarnTaskSchedulerService.access$600(YarnTaskSchedulerService.java:84)
> at
> org.apache.tez.dag.app.rm.YarnTaskSchedulerService$DelayedContainerManager.run(YarnTaskSchedulerService.java:2030)
> {noformat}
> After the DelayedContainerManager thread exited the AM proceeded to receive
> requested containers that would go unused until the container allocations
> expired. Then they would be re-requested, and the cycle repeated
> indefinitely.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)