[jira] [Commented] (TEZ-3368) NPE in DelayedContainerManager

Jason Lowe (JIRA) Wed, 20 Jul 2016 13:49:03 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15386583#comment-15386583
 ]


Jason Lowe commented on TEZ-3368:
---------------------------------

Looking at the code I don't see how this happened.  The line that corresponds 
in this build was this in TezAMRMClientAsync:
{code}
    if (lrc.localityRequests.get() == 0) {
{code}

Which means either lrc or localityRequests was null.  localityRequests is a 
final field initialized to an AtomicInteger in the constructor, so it should be 
impossible for that to be null.  That means lrc was null, and it was computed 
this way:
{code}
  public synchronized List<? extends Collection<T>>
    getMatchingRequestsForTopPriority(
        String resourceName, Resource capability) {
    // Sort based on reverse order. By default, Priority ordering is based on
    // highest numeric value being considered to be lowest priority.
    Iterator<Priority> iter =
      knownRequestsByPriority.descendingKeySet().iterator();
    if (!iter.hasNext()) {
      return Collections.emptyList();
    }
    Priority p = iter.next();
    LocalityRequestCounter lrc = knownRequestsByPriority.get(p);
    if (lrc.localityRequests.get() == 0) {
{code}

Basically it's trying to get the last value in the tree map.  It creates a 
descending map then iterates that keyset for the first value, then finally does 
a lookup of the value based on that key.  It would be a lot simpler and safer 
to just call the lastEntry method on the map than create all the extra objects 
for navigation and iteration.

My first thought was some other thread was messing with the 
knownRequestsByPriority tree map, but all of the methods that access the tree 
map are synchronized on the enclosing TezAMRMClientAsync object, and the map is 
private and not exposed via some accessor method.

My second thought was somehow we associated a null value with the key in the 
map, but every time we poke something in the map we create a new 
LocalityRequestCounter object as the value. Therefore it should be impossible 
for a key's value to be null in the map.

Another thought was somehow iterating the map returned a priority that 
subsequently mutated (e.g.: via setPriority) before we tried to do the value 
lookup, and therefore we won't be able to find the original entry associated 
with the iterator.  However I didn't see any way we would call setPriority on a 
Priority record after it was created.

> NPE in DelayedContainerManager
> ------------------------------
>
>                 Key: TEZ-3368
>                 URL: https://issues.apache.org/jira/browse/TEZ-3368
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.1
>            Reporter: Jason Lowe
>
> Saw a Tez AM hang due to an NPE in the DelayedContainerManager:
> {noformat}
> 2016-07-17 01:53:23,157 [ERROR] [DelayedContainerManager] 
> |yarn.YarnUncaughtExceptionHandler|: Thread 
> Thread[DelayedContainerManager,5,main] threw an Exception.
> java.lang.NullPointerException
>         at 
> org.apache.tez.dag.app.rm.TezAMRMClientAsync.getMatchingRequestsForTopPriority(TezAMRMClientAsync.java:142)
>         at 
> org.apache.tez.dag.app.rm.YarnTaskSchedulerService.getMatchingRequestWithoutPriority(YarnTaskSchedulerService.java:1474)
>         at 
> org.apache.tez.dag.app.rm.YarnTaskSchedulerService.access$500(YarnTaskSchedulerService.java:84)
>         at 
> org.apache.tez.dag.app.rm.YarnTaskSchedulerService$NodeLocalContainerAssigner.assignReUsedContainer(YarnTaskSchedulerService.java:1869)
>         at 
> org.apache.tez.dag.app.rm.YarnTaskSchedulerService.assignReUsedContainerWithLocation(YarnTaskSchedulerService.java:1753)
>         at 
> org.apache.tez.dag.app.rm.YarnTaskSchedulerService.assignDelayedContainer(YarnTaskSchedulerService.java:733)
>         at 
> org.apache.tez.dag.app.rm.YarnTaskSchedulerService.access$600(YarnTaskSchedulerService.java:84)
>         at 
> org.apache.tez.dag.app.rm.YarnTaskSchedulerService$DelayedContainerManager.run(YarnTaskSchedulerService.java:2030)
> {noformat}
> After the DelayedContainerManager thread exited the AM proceeded to receive 
> requested containers that would go unused until the container allocations 
> expired.  Then they would be re-requested, and the cycle repeated 
> indefinitely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-3368) NPE in DelayedContainerManager

Reply via email to