[
https://issues.apache.org/jira/browse/TEZ-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15386583#comment-15386583
]
Jason Lowe commented on TEZ-3368:
---------------------------------
Looking at the code I don't see how this happened. The line that corresponds
in this build was this in TezAMRMClientAsync:
{code}
if (lrc.localityRequests.get() == 0) {
{code}
Which means either lrc or localityRequests was null. localityRequests is a
final field initialized to an AtomicInteger in the constructor, so it should be
impossible for that to be null. That means lrc was null, and it was computed
this way:
{code}
public synchronized List<? extends Collection<T>>
getMatchingRequestsForTopPriority(
String resourceName, Resource capability) {
// Sort based on reverse order. By default, Priority ordering is based on
// highest numeric value being considered to be lowest priority.
Iterator<Priority> iter =
knownRequestsByPriority.descendingKeySet().iterator();
if (!iter.hasNext()) {
return Collections.emptyList();
}
Priority p = iter.next();
LocalityRequestCounter lrc = knownRequestsByPriority.get(p);
if (lrc.localityRequests.get() == 0) {
{code}
Basically it's trying to get the last value in the tree map. It creates a
descending map then iterates that keyset for the first value, then finally does
a lookup of the value based on that key. It would be a lot simpler and safer
to just call the lastEntry method on the map than create all the extra objects
for navigation and iteration.
My first thought was some other thread was messing with the
knownRequestsByPriority tree map, but all of the methods that access the tree
map are synchronized on the enclosing TezAMRMClientAsync object, and the map is
private and not exposed via some accessor method.
My second thought was somehow we associated a null value with the key in the
map, but every time we poke something in the map we create a new
LocalityRequestCounter object as the value. Therefore it should be impossible
for a key's value to be null in the map.
Another thought was somehow iterating the map returned a priority that
subsequently mutated (e.g.: via setPriority) before we tried to do the value
lookup, and therefore we won't be able to find the original entry associated
with the iterator. However I didn't see any way we would call setPriority on a
Priority record after it was created.
> NPE in DelayedContainerManager
> ------------------------------
>
> Key: TEZ-3368
> URL: https://issues.apache.org/jira/browse/TEZ-3368
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.7.1
> Reporter: Jason Lowe
>
> Saw a Tez AM hang due to an NPE in the DelayedContainerManager:
> {noformat}
> 2016-07-17 01:53:23,157 [ERROR] [DelayedContainerManager]
> |yarn.YarnUncaughtExceptionHandler|: Thread
> Thread[DelayedContainerManager,5,main] threw an Exception.
> java.lang.NullPointerException
> at
> org.apache.tez.dag.app.rm.TezAMRMClientAsync.getMatchingRequestsForTopPriority(TezAMRMClientAsync.java:142)
> at
> org.apache.tez.dag.app.rm.YarnTaskSchedulerService.getMatchingRequestWithoutPriority(YarnTaskSchedulerService.java:1474)
> at
> org.apache.tez.dag.app.rm.YarnTaskSchedulerService.access$500(YarnTaskSchedulerService.java:84)
> at
> org.apache.tez.dag.app.rm.YarnTaskSchedulerService$NodeLocalContainerAssigner.assignReUsedContainer(YarnTaskSchedulerService.java:1869)
> at
> org.apache.tez.dag.app.rm.YarnTaskSchedulerService.assignReUsedContainerWithLocation(YarnTaskSchedulerService.java:1753)
> at
> org.apache.tez.dag.app.rm.YarnTaskSchedulerService.assignDelayedContainer(YarnTaskSchedulerService.java:733)
> at
> org.apache.tez.dag.app.rm.YarnTaskSchedulerService.access$600(YarnTaskSchedulerService.java:84)
> at
> org.apache.tez.dag.app.rm.YarnTaskSchedulerService$DelayedContainerManager.run(YarnTaskSchedulerService.java:2030)
> {noformat}
> After the DelayedContainerManager thread exited the AM proceeded to receive
> requested containers that would go unused until the container allocations
> expired. Then they would be re-requested, and the cycle repeated
> indefinitely.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)