[ 
https://issues.apache.org/jira/browse/HADOOP-5964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721792#action_12721792
 ] 

Hemanth Yamijala commented on HADOOP-5964:
------------------------------------------

I've looked at most of the code changes (excluding tests and examples). Here 
are a few more comments:

CapacityTaskScheduler:
 - In getTaskFromQueue, I would request a comment on why we are not reserving 
tasktrackers in the second pass (the reason, as we discussed offline, was 
because we don't think we need to give users more leeway by reserving slots 
given they are already over their user limit)

JobTracker:
 - hostnameToTrackerName seems a wrong name. it should be hostnameToTracker
 - Comment on trackerExpiryQueue refers to TreeSet of status objects.
 - In recovery, there is an 'interesting' behavior currently that a job can be 
initialized by both the RecoveryManager or a job initialization thread like 
EagerTaskInitializer or JobInitializationPoller. Which means that relying on 
preInitializeJob to set the right number of slots may be broken.
 - Since we are not storing information about reservations across restarts, one 
impact could be on the fact that the counter information about how long 
reservations were made for a job on a tracker could be lost. This may not be a 
big issue because reservations themselves are lost on restart, but just wanted 
to check what you thought.

mapreduce.TaskTracker:
 - I am thinking if it will be good to make unreserveSlots re-entrant. I 
struggled a bit to determine that it will never be called twice in any 
scenario, which seems to be the case now. But if we can make it re-entrant by 
simply ignoring the operation if the reserved Job is null, it might save us 
some corner case bugs. Note we are currently throwing a runtime exception.

JobConf:
 - We are not handling the case where memory based scheduling is disabled, but 
jobconf has some non default value for the job size (say because of user 
misconfiguration). computeNumSlotsPerMap should probably check the value and 
return 1 if it is disabled. Otherwise it could get set to a -ve value.

MemoryMatcher:
 - The computation of committed memory included tasks that were in the commit 
pending state for a reason. We'll need to check this with someone from the M/R 
team.



> Fix the 'cluster drain' problem in the Capacity Scheduler wrt High RAM Jobs
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-5964
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5964
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>    Affects Versions: 0.20.0
>            Reporter: Arun C Murthy
>            Assignee: Arun C Murthy
>             Fix For: 0.21.0
>
>         Attachments: HADOOP-5964_0_20090602.patch, 
> HADOOP-5964_1_20090608.patch, HADOOP-5964_2_20090609.patch, 
> HADOOP-5964_4_20090615.patch, HADOOP-5964_6_20090617.patch, 
> HADOOP-5964_7_20090618.patch, HADOOP-5964_8_20090618.patch
>
>
> When a HighRAMJob turns up at the head of the queue, the current 
> implementation of support for HighRAMJobs in the Capacity Scheduler has 
> problem in that the scheduler stops assigning tasks to all TaskTrackers in 
> the cluster until a HighRAMJob finds a suitable TaskTrackers for all its 
> tasks.
> This causes a severe utilization problem since effectively no new tasks are 
> allowed to run until the HighRAMJob (at the head of the queue) gets slots.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to