[ https://issues.apache.org/jira/browse/HADOOP-4558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646223#action_12646223 ]
Amar Kamat commented on HADOOP-4558: ------------------------------------ {quote} Here J1 is still using 12 extra map and 1 extra reduce slots It took nearly two more minutes to when j1 and j2 both starts using MR slots equal to their GCs. {quote} The reason is as follows : When job2 gets added, a {{ReclaimedResource}} object is added to the reclaim queue. After _whenToKill_ units of time, tasks from job1 are killed. But at this point of time job2 is not set up and hence is not able to schedule tasks. So again job1 is selected for scheduling tasks. Now once job2 finishes setup, the reclaim request is added for the (extra) scheduled tasks. Hence the observation that there is some extra killings and the guaranteed capacity is allocated after few mins. I think the issue is more involved. Here are the choices 1) Let it be : Since the setup task took time to schedule and finish, its ok to keep it as it is. What we guarantee here is that the slots will be allocated to the queue as soon as a request is made 2) Delay : One way to avoid the _thrashing_ is to delay the reclaim until the job/queue which wants it, actually needs it. The obvious problem with this is that it will take sometime to kill the tasks and hence there will a little delay in reclaim. Also the _sla_ needs to be redefined. Note that this issue also depends on how set-up tasks are handled in future and when the job actually becomes _RUNNING_. > Scheduler fails to reclaim capacity if Jobs are submitted to queue one after > the other > -------------------------------------------------------------------------------------- > > Key: HADOOP-4558 > URL: https://issues.apache.org/jira/browse/HADOOP-4558 > Project: Hadoop Core > Issue Type: Bug > Components: contrib/capacity-sched > Affects Versions: 0.19.0 > Environment: Cluster Capacity Maps=Reduces =210 each > Two Queues: > Q1: default, GC (%) =40, GC=84 (Maps and Reduces each). Reclaim time = 3 > mins. > Q2: test_q1, GC (%) =60, GC=126 (Maps and Reduces each) Reclaim time = 2 mins > Reporter: Karam Singh > Attachments: 4558.1.patch > > > Scheduler fails to reclaim capacity if Jobs are submitted to queue one after > the other. > First job submitted with tasks equal to cluster's M/R Capacity > Second is submitted to different queue when all tasks of First Job are > running, scheduler fails to reclaim capacity for second job. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.