[jira] Updated: (HADOOP-5964) Fix the 'cluster drain' problem in the Capacity Scheduler wrt High RAM Jobs

Arun C Murthy (JIRA) Fri, 19 Jun 2009 00:01:37 -0700

     [ 
https://issues.apache.org/jira/browse/HADOOP-5964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Arun C Murthy updated HADOOP-5964:
----------------------------------

    Attachment: HADOOP-5964_8_20090618.patch

Thanks for the review Hemanth - as you pointed out the patch needs a bit more 
work to remove logging etc.

I'm attaching a patch which incorporates your feedback.

Some clarifications:


{quote}
TaskTrackerStatus:

    * countOccupiedMapSlots: the check for whether a task is running, based on 
it's status, seems complicated enough to move to an API that can be called from 
both countMapTasks and this API. This way, any changes to it will cause the 
right behavior for both APIs. Likewise, for reduces.

mapreduce.TaskTracker:

    * reserveSlots: java doc refers to reserving on 'map' slots.
    * Why do we need to maintain a count of slots reserved (numFallowMapSlots). 
I see that the accessor API is not used anywhere.
{quote}

Fixed.


bq.    * Why are we reserving available slots on the tasktracker. Shouldn't we 
always be reserving only how much this job requires ? In that case, do we need 
a re-reservation ?

We reserve all available slots since by definition all of them are for the same 
task, else we wouldn't reserve if we could run right away.
We need 're-reservation' since #reserved-slots (on the same tasktracker) might 
change over time and we need to track these for metering 
(JobCounter.FALLOW_SLOTS_MILLIS_{MAPS|REDUCES}).

bq.    * When we try to get a task for a job ignoring user limits (i.e. if the 
cluster is free), we are not reserving TTs. Is this by design ? Also, is it for 
the same reason that we are not checking for user limits when assigning a task 
to a reserved TT ?

Yes.


bq.    * Lets not pass the scheduler instance to the poller. I think it only 
needs the number of map slots and reduce slots. We can pass just that much. 
We've seen in the past that passing entire objects like the scheduler makes 
testing classes difficult. Also, not all information is required.

Done. I've added a JobInitializationPoller.JobInitializationContext and use 
that rather than the passing the scheduler.


{quote}
JobTracker:

    * When a job is killed, we are not clearing reserved trackers for this job.
    * Likewise, when a TT is blacklisted do we need to remove the reservations ?

{quote}

My bad. Thanks for catching this. Fixed.

bq. It seems like the changes in JobTracker can be reduced a little if we do 
not change APIs that are passed a TTstatus object or a tasktracker name. We can 
still change the maps to be built of TaskTracker objects, but retrieve the 
status wherever necessary and pass it to methods. This way the changes may be 
fewer and easier to verify. For e.g. I think this is possible in the 
ExpireTrackers class.

I really don't think it's a good idea to use both TaskTracker and 
TaskTrackerStatus in the long run, it's really hard to maintain. Which is why I 
bit the bullet and changed all of them.


> Fix the 'cluster drain' problem in the Capacity Scheduler wrt High RAM Jobs
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-5964
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5964
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>    Affects Versions: 0.20.0
>            Reporter: Arun C Murthy
>            Assignee: Arun C Murthy
>             Fix For: 0.21.0
>
>         Attachments: HADOOP-5964_0_20090602.patch, 
> HADOOP-5964_1_20090608.patch, HADOOP-5964_2_20090609.patch, 
> HADOOP-5964_4_20090615.patch, HADOOP-5964_6_20090617.patch, 
> HADOOP-5964_7_20090618.patch, HADOOP-5964_8_20090618.patch
>
>
> When a HighRAMJob turns up at the head of the queue, the current 
> implementation of support for HighRAMJobs in the Capacity Scheduler has 
> problem in that the scheduler stops assigning tasks to all TaskTrackers in 
> the cluster until a HighRAMJob finds a suitable TaskTrackers for all its 
> tasks.
> This causes a severe utilization problem since effectively no new tasks are 
> allowed to run until the HighRAMJob (at the head of the queue) gets slots.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5964) Fix the 'cluster drain' problem in the Capacity Scheduler wrt High RAM Jobs

Reply via email to