[jira] Commented: (HADOOP-5641) Possible NPE in CapacityScheduler's MemoryMatcher

Hemanth Yamijala (JIRA) Tue, 12 May 2009 03:08:13 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-5641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708384#action_12708384
 ]


Hemanth Yamijala commented on HADOOP-5641:
------------------------------------------

The reason for the above exception is as follows:

- MemoryMatcher.getMemReservedForTasks tries to retrieve the JobConf for a job 
whose task has been reported in the TaskTrackerStatus in order to compute the 
available memory.
- This job has been retired and hence the lookup returns null, which is not 
checked.
- This scenario can occur if a TT comes back to the JobTracker with task 
reports of a job, *after* that job has been killed and retired. So, it is a 
timing issue.

I was able to replicate the problem as follows:

- Set the number of completed jobs per user in memory to 1 on the JT.
- Submit a job with long running tasks.
- On one of the TTs running tasks, send a SIGSTOP to the TT.
- Kill the job and submit enough other jobs from the same user, so this job is 
retired and gets out of the JT's page.
- Submit a new job
- Send SIGCONT to the TT which was stopped above.

At this point, in the JT log, we can see the above exception.

To fix this problem, we discussed two possible approaches.

- Check for null in the lookup result for the job, and ignore the tasks if the 
job is not found in the MemoryMatcher. IOW, the memory computed as 'reserved' 
for tasks would be less than the actual memory being used at that point on that 
TT (because the running task is not being accounted for). If any scheduling 
happens based on this number, this might oversubscribe memory on the TT. The 
behavior may be unpredictable in this case.

- The second option is to check for null, and if this scenario is found, do not 
schedule anything to the TT. The rationale here is that the memory state cannot 
be correctly determined and we are conservatively not assigning anything to the 
TT, until the state can be correctly determined. In this case, we could 
potentially waste a heartbeat. Note in general, in the heartbeat response, the 
running tasks will be killed.

Evaluating the two options, and also considering the rare case in which this 
could happen, we decided to take the conservative approach and favor 
consistency of state over utilization. Hence, the proposal is to do the second 
option.



> Possible NPE in CapacityScheduler's MemoryMatcher
> -------------------------------------------------
>
>                 Key: HADOOP-5641
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5641
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Vinod K V
>            Assignee: rahul k singh
>
> MemoryMatches does a job lookup based on a JobID in its 
> getMemReservedForTasks(TaskTrackerStatus taskTracker) method. By this time, 
> job might have been removed from JT's memory resulting in a NPE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5641) Possible NPE in CapacityScheduler's MemoryMatcher

Reply via email to