[jira] Commented: (HADOOP-4558) Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other

Amar Kamat (JIRA) Mon, 10 Nov 2008 03:03:09 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-4558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646223#action_12646223
 ]


Amar Kamat commented on HADOOP-4558:
------------------------------------

{quote}
Here J1 is still using 12 extra map and 1 extra reduce slots
It took nearly two more minutes to when j1 and j2 both starts using MR slots 
equal to their GCs.
{quote}
The reason is as follows :
When job2 gets added, a {{ReclaimedResource}} object is added to the reclaim 
queue. After _whenToKill_ units of time, tasks from job1 are killed. But at 
this point of time job2 is not set up and hence is not able to schedule tasks. 
So again job1 is selected for scheduling tasks. Now once job2 finishes setup, 
the reclaim request is added for the (extra) scheduled tasks. Hence the 
observation that there is some extra killings and the guaranteed capacity is 
allocated after few mins.

I think the issue is more involved. Here are the choices
1) Let it be : Since the setup task took time to schedule and finish, its ok to 
keep it as it is. What we guarantee here is that the slots will be allocated to 
the queue as soon as a request is made
2) Delay : One way to avoid the _thrashing_ is to delay the reclaim until the 
job/queue which wants it, actually needs it. The obvious problem with this is 
that it will take sometime to kill the tasks and hence there will a little 
delay in reclaim. Also the _sla_ needs to be redefined.

Note that this issue also depends on how set-up tasks are handled in future and 
when the job actually becomes _RUNNING_.

> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after 
> the other
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4558
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>    Affects Versions: 0.19.0
>         Environment: Cluster Capacity Maps=Reduces =210 each
> Two Queues: 
> Q1:  default, GC (%) =40, GC=84 (Maps and Reduces each). Reclaim time = 3 
> mins.
> Q2: test_q1, GC (%) =60, GC=126 (Maps and Reduces each) Reclaim time = 2 mins
>            Reporter: Karam Singh
>         Attachments: 4558.1.patch
>
>
> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after 
> the other.
> First job submitted with tasks equal to cluster's M/R Capacity
> Second is submitted to different queue when all tasks of First Job are 
> running, scheduler fails to reclaim capacity for second job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4558) Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other

Reply via email to