[ 
https://issues.apache.org/jira/browse/HADOOP-5003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663216#action_12663216
 ] 

Vivek Ratan commented on HADOOP-5003:
-------------------------------------

I don't see this as a bug. There will be times when the sum of GCs of all 
queues will not be equal to the actual cluster capacity. In particular, there 
will be times when this sum is greater than the cluster capacity. For example, 
TTs could be down, and it may take a while before that is detected and the 
cluster capacity (as defined by the ClusterStatus object obtained from 
TaskTrackerManager) updated. So there just may not be enough slots to fulfill 
requests for all queues. To handle such situations, the Capacity Scheduler has 
a simple check that starts a timer to reclaim capacity only if there is at 
least one queue that is using extra capacity, i.e., there is capacity to 
reclaim. I think this is a sensible check and avoids starting timers too 
quickly. I see the SLA as taking effect when there is legitimate capacity to 
reclaim. In a perfect world, we'd know instantly that TTs were down and 
recompute GCs, but in our distributed world, we have to rely on possibly stale 
data from the ClusterStatus object.

> When computing absoluet guaranteed capacity (GC) from a percent value, 
> Capacity Scheduler should round up floats, rather than truncate them.
> --------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5003
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5003
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>            Reporter: Vivek Ratan
>            Priority: Minor
>
> The Capacity Scheduler calculates a queue's absolute GC value by getting its 
> percent of the total cluster capacity (which is a float, since the configured 
> GC% is a float) and casting it to an int. Casting a float to an int always 
> rounds down. For very small clusters, this can result in the GC of a queue 
> being one lower than what it should be. For example, if Q1 has a GC of 50%, 
> Q2 has a GC of 40%, and Q3 has a GC of 10%, and if the cluster capacity is 4 
> (as we have, in our test cases), Q1's GC works out to 2, Q2's to 1, and Q3's 
> to 0 with today's code. Q2's capacity should really be 2, as 40% of 4, 
> rounded up, should be 2. 
> Simple fix is to use Math.round() rather than cast to an int. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to