[ 
https://issues.apache.org/jira/browse/HADOOP-5075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665540#action_12665540
 ] 

Joydeep Sen Sarma commented on HADOOP-5075:
-------------------------------------------

question - regarding the 'break' in the slotsLeft == oldSlots

this doesn't look correct to me - it seems that there is no guarantee that all 
available slots are distributed in one round. and that is why earlier we had a 
for loop over the slots. but now we are claiming that by going over the jobs 
one last time - we will be able to distribute all the slots?

The basic problem seems to be:

             int share = (int) Math.ceil(oldSlots * weight / totalWeight);
              slotsLeft = giveMinSlots(job, type, slotsLeft, share);

I believe that the share computed is quite likely to be less than the maximum 
number of slots that the task can consume. So going from 'floor' to 'ceil' may 
not be enough to guarantee that slots get consumed (and certainly not enough to 
consume that *all* the slots left get consumed).

my gut feel is that the correct solution (when oldSlots == slotsLeft) should be 
something that takes into account the max tasks that a job can consume (as 
opposed to it's weighted share only). 


> Potential infinite loop in updateMinSlots
> -----------------------------------------
>
>                 Key: HADOOP-5075
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5075
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>            Priority: Blocker
>             Fix For: 0.19.1, 0.20.0, 0.21.0
>
>         Attachments: hadoop-5075-v2.patch, hadoop-5075-v3.patch, 
> hadoop-5075.patch
>
>
> We ran into a problem at Facebook where the updateMinSlots loop in the 
> scheduler was repeating infinitely. This might happen if, due to rounding, we 
> are unable to assign the last few slots in a pool. This patch adds a break 
> statement to ensure that the loop exists if it hasn't managed to assign any 
> slots.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to