[ https://issues.apache.org/jira/browse/HADOOP-5964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721792#action_12721792 ]
Hemanth Yamijala commented on HADOOP-5964: ------------------------------------------ I've looked at most of the code changes (excluding tests and examples). Here are a few more comments: CapacityTaskScheduler: - In getTaskFromQueue, I would request a comment on why we are not reserving tasktrackers in the second pass (the reason, as we discussed offline, was because we don't think we need to give users more leeway by reserving slots given they are already over their user limit) JobTracker: - hostnameToTrackerName seems a wrong name. it should be hostnameToTracker - Comment on trackerExpiryQueue refers to TreeSet of status objects. - In recovery, there is an 'interesting' behavior currently that a job can be initialized by both the RecoveryManager or a job initialization thread like EagerTaskInitializer or JobInitializationPoller. Which means that relying on preInitializeJob to set the right number of slots may be broken. - Since we are not storing information about reservations across restarts, one impact could be on the fact that the counter information about how long reservations were made for a job on a tracker could be lost. This may not be a big issue because reservations themselves are lost on restart, but just wanted to check what you thought. mapreduce.TaskTracker: - I am thinking if it will be good to make unreserveSlots re-entrant. I struggled a bit to determine that it will never be called twice in any scenario, which seems to be the case now. But if we can make it re-entrant by simply ignoring the operation if the reserved Job is null, it might save us some corner case bugs. Note we are currently throwing a runtime exception. JobConf: - We are not handling the case where memory based scheduling is disabled, but jobconf has some non default value for the job size (say because of user misconfiguration). computeNumSlotsPerMap should probably check the value and return 1 if it is disabled. Otherwise it could get set to a -ve value. MemoryMatcher: - The computation of committed memory included tasks that were in the commit pending state for a reason. We'll need to check this with someone from the M/R team. > Fix the 'cluster drain' problem in the Capacity Scheduler wrt High RAM Jobs > --------------------------------------------------------------------------- > > Key: HADOOP-5964 > URL: https://issues.apache.org/jira/browse/HADOOP-5964 > Project: Hadoop Core > Issue Type: Bug > Components: contrib/capacity-sched > Affects Versions: 0.20.0 > Reporter: Arun C Murthy > Assignee: Arun C Murthy > Fix For: 0.21.0 > > Attachments: HADOOP-5964_0_20090602.patch, > HADOOP-5964_1_20090608.patch, HADOOP-5964_2_20090609.patch, > HADOOP-5964_4_20090615.patch, HADOOP-5964_6_20090617.patch, > HADOOP-5964_7_20090618.patch, HADOOP-5964_8_20090618.patch > > > When a HighRAMJob turns up at the head of the queue, the current > implementation of support for HighRAMJobs in the Capacity Scheduler has > problem in that the scheduler stops assigning tasks to all TaskTrackers in > the cluster until a HighRAMJob finds a suitable TaskTrackers for all its > tasks. > This causes a severe utilization problem since effectively no new tasks are > allowed to run until the HighRAMJob (at the head of the queue) gets slots. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.