[ 
https://issues.apache.org/jira/browse/HADOOP-5964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715768#action_12715768
 ] 

Arun C Murthy commented on HADOOP-5964:
---------------------------------------

A _much_ better model is for the scheduler to *pick* specific TaskTrackers and 
reserve slots on them while accounting for the same against the HighRAMJob and 
it's queue. This would mean that once there is a reserved slot(s), per-task of 
the HighRAMJob, other slots in the cluster can be handed out to other 
jobs/queues in the cluster. 

 Once the accounting for reserved slots is fixed, it would automatically ensure 
that a HighRAMJob can only reserve slots upto the quota of the queue it belongs 
to. Hence the next enhancement is to *pick* specific slots and hold them rather 
than hold slots on every TaskTracker.

h4. Picking slots for High RAM Jobs
  
 The key to better support for HighRAMJobs is to reserve slots on specific 
TaskTracker. Of course one could get arbitrarily clever while *picking* slots, 
factors to be considered are: 
   * Locality of input for the specific map-task of the job
   * Minimize expected delay time until the slot in freed on a specific 
!TaskTracker

 For the first cut, I'd propose we consider only locality and not expected 
time. Once we fix _speculative execution_ (HADOOP-2141), we will more of the 
necessary features to predict expected time etc., hence the pushback.

h4. Accounting for Reserved Slots

 It is critical that we charge the queues of the HighRAMJobs when we hold 
reserved slots for them to ensure that they stay under their capacity and can't 
runaway with slots in the cluster. The proposal is to charge jobs/queues 
immediately when we reserve slots on a TaskTracker (when it can't be 
immediately run).

 
h4. Metering

 While metering HighRAMJobs, it would be incorrect to meter jobs (slot-hours 
etc.) by equating reserved slots to _running_ slots. The proposal is to meter 
HighRAMJobs for open-but-held slots and running slots. (Open but held slots are 
those which are free on the TaskTracker but are being held while more become 
available for the HighRAMJob's tasks.)

h4. Notes on Implementation and Challenges
 
 As discussed above the proposal is to consider just data-locality while 
reserving slots. Assuming this, there are a couple of implementation choices 
once we reserved the slot: 
   * Proposal1: Hand out the task to the TaskTracker with a directive to start 
the task only when sufficient slots are freed-up to the this task.
   * Proposal2: Hold the task at the scheduler noting which slot (i.e. 
TaskTracker) has been reserved for the same.

h5. Proposal 1

 Here we would introduce a queue of _ready to run_ tasks at the TaskTracker and 
fill it in with the task of the HighRAMJobs.

h6. Pros
   * The primary advantage of taking this route is that it greatly reduces the 
cost of implementation; it is fairly simple to introduce a WAITING_FOR_SLOT 
state for the task and have the necessary information at the TaskTracker to 
launch it at the appropriate time (i.e. when sufficient slots are free).
   * Looking ahead, this might be a good start to do more global scheduling 
across jobs too where we might 

h6. Cons
   * The major problem with this approach is that it touches a fairly sensitive 
part of code in the current implementation of the framework... it's fairly 
risky to tweak the TaskTracker code at this point, along with the JVMManager 
etc.
   * We would still need to tweak the JobTracker to handle the WAITING_FOR_SLOT 
state e.g. ensure the TaskInitializationThread doesn't kill these tasks etc.
   * We need to consider how this affects other schedulers (probably will not).

h5. Proposal 2

 Here we would start marking slots as reserved (per task per job) and maintain 
information to assign the slot to the task when it eventually does free up.

h6. Pros
   * Simpler since all state management is done centrally.
   * Lesser risk since all information is maintained in the scheduler.

h6. Cons
   * Currently the framework isn't setup to maintain this information: we do 
not have a single place (e.g. a TT class in the !JobTracker) to maintain 
information per-tracker i.e. reserved slots etc.
   * More engineering effort to maintain maps from !TaskTracker to task to 
which it's reserved for and vice-versa.

h5. Recommendation

  * Proposal 1 for the attendant benefits and the leverage it gives us going 
forward (global scheduling etc.)

h4. User Interface

 It is important for users (and queue-admins) to understand that there are 
slots which are _reserved_ for HighRAMJobs which result in lower running 
maps/reduces w.r.t the queue-capacities. It would be nice to add _reserved_ 
slots to the JobTracker/Job UI, and also to the Queue-Info in the Scheduler 
page.


> Fix the 'cluster drain' problem in the Capacity Scheduler wrt High RAM Jobs
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-5964
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5964
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>    Affects Versions: 0.20.0
>            Reporter: Arun C Murthy
>             Fix For: 0.21.0
>
>
> When a HighRAMJob turns up at the head of the queue, the current 
> implementation of support for HighRAMJobs in the Capacity Scheduler has 
> problem in that the scheduler stops assigning tasks to all TaskTrackers in 
> the cluster until a HighRAMJob finds a suitable TaskTrackers for all its 
> tasks.
> This causes a severe utilization problem since effectively no new tasks are 
> allowed to run until the HighRAMJob (at the head of the queue) gets slots.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to