[ 
https://issues.apache.org/jira/browse/HADOOP-5964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721321#action_12721321
 ] 

Hemanth Yamijala commented on HADOOP-5964:
------------------------------------------

I am looking at this patch as comprising of three separate parts:
- Changes to the scheduler for fixing the under utilization problem in the face 
of high RAM jobs
- The new TaskTracker class, its lifecycle and changes in JobTracker to support 
this.
- The changes on the old TaskTracker class to account for number of slots.

I've currently done the first and partly the second part.

Some comments so far:

TaskTrackerStatus:
 - countOccupiedMapSlots: the check for whether a task is running, based on 
it's status, seems complicated enough to move to an API that can be called from 
both countMapTasks and this API. This way, any changes to it will cause the 
right behavior for both APIs. Likewise, for reduces.

mapreduce.TaskTracker:
 - reserveSlots: java doc refers to reserving on 'map' slots.
 - Why do we need to maintain a count of slots reserved (numFallowMapSlots). I 
see that the accessor API is not used anywhere. 

CapacityTaskScheduler:
 - Why are we reserving available slots on the tasktracker. Shouldn't we always 
be reserving only how much this job requires ? In that case, do we need a 
re-reservation ?
 - When we try to get a task for a job ignoring user limits (i.e. if the 
cluster is free), we are not reserving TTs. Is this by design ? Also, is it for 
the same reason that we are not checking for user limits when assigning a task 
to a reserved TT ?

JobConf:
 - Since computeNumSlotsPerMap is used only by CapacityScheduler right now, 
should we just leave this computation out of JobConf ?

JobInitializationPoller:
 - Lets not pass the scheduler instance to the poller. I think it only needs 
the number of map slots and reduce slots. We can pass just that much. We've 
seen in the past that passing entire objects like the scheduler makes testing 
classes difficult. Also, not all information is required.

JobTracker:
 - When a job is killed, we are not clearing reserved trackers for this job.
 - Likewise, when a TT is blacklisted do we need to remove the reservations ?
 - It seems like the changes in JobTracker can be reduced a little if we do not 
change APIs that are passed a TTstatus object or a tasktracker name. We can 
still change the maps to be built of TaskTracker objects, but retrieve the 
status wherever necessary and pass it to methods. This way the changes may be 
fewer and easier to verify. For e.g. I think this is possible in the 
ExpireTrackers class.

Some nits:
- In some places, formatting more than 80 characters in a line. (E.g. 
mapreduce.TaskTracker.java)
- There are lot of LOG.info statements, possibly to enable testing / debugging. 
Can you please remove these ?
- Fallow seems a complicated word to understand. Is 'Reserved' good enough ?

Will continue with the review...

> Fix the 'cluster drain' problem in the Capacity Scheduler wrt High RAM Jobs
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-5964
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5964
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>    Affects Versions: 0.20.0
>            Reporter: Arun C Murthy
>            Assignee: Arun C Murthy
>             Fix For: 0.21.0
>
>         Attachments: HADOOP-5964_0_20090602.patch, 
> HADOOP-5964_1_20090608.patch, HADOOP-5964_2_20090609.patch, 
> HADOOP-5964_4_20090615.patch, HADOOP-5964_6_20090617.patch, 
> HADOOP-5964_7_20090618.patch
>
>
> When a HighRAMJob turns up at the head of the queue, the current 
> implementation of support for HighRAMJobs in the Capacity Scheduler has 
> problem in that the scheduler stops assigning tasks to all TaskTrackers in 
> the cluster until a HighRAMJob finds a suitable TaskTrackers for all its 
> tasks.
> This causes a severe utilization problem since effectively no new tasks are 
> allowed to run until the HighRAMJob (at the head of the queue) gets slots.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to