[ 
https://issues.apache.org/jira/browse/HADOOP-5003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665089#action_12665089
 ] 

Vivek Ratan commented on HADOOP-5003:
-------------------------------------

Well, I thought I was clear why this isn't is a bug, but let me give you a 
detailed example.  My argument is simple: in theory, you do not want the sum of 
GCs to be larger than the cluster size (and we ensure that when we start up and 
read the config file), but in practice, given that this is a distributed 
system, there will be situations when the sum of GCs is greater than the 
'actual' cluster size at a given moment. This happens when TTs fail. Consider 
the situation when you have two queues: Q1 and Q2. Assume their GCs are 5 slots 
(map or reduce, doesn't matter) each, i.e., there are 10 slots in the system. 
For simplification, assume there are 10 TTs and 1 slot per TT. Now suppose that 
Q1 is running at capacity, and Q2 is only using 4 out of 5 slots, because it 
doesn't have any more tasks to run. So, 1 TT  is free. Also assume that the 
tasks are long running, i.e. they take minutes to complete. This is time T0. 
Now, suppose a user submits a job with lots of tasks to Q2, at time T0+1second. 
Also suppose that at around the same time, i.e., at T0+1, the idle TT dies (it 
doesn't have to be the same time, but anytiem before it sends a heartbeat) . 
Further, suppose that the reclaim capacity thread runs at time T0+3 seconds (it 
runs every 5 seconds by default). What should it do? The actual cluster 
capacity is 9, but the JT and scheduler do not know that yet. Remember it takes 
over 10 minutes for the JT to detect that the TT is down, and to update the 
cluster status. So, what does the Scheduler do? 

Your suggestion is that a timer is started for Q2, since it's below capacity 
and has pending tasks. So, at time T0+3, a timer gets started. Assuming that 
the reclaim time for the queue is 3 minutes, this timer will go off at T0+183 
(in seconds). When the timer goes off, what happens? We still haven't detected 
the lost TT (that will happen at T0+600 at the earliest, I believe). The timer 
has gone off, and we need to kill. Well, do we kill from Q1? If you say no (Q1 
is, after all, running at capacity only), the timer is wasted. If you say yes, 
it's unfair to Q1. 

I am arguing that the timer shouldn't have been set in the first place. The SLA 
is valid, as I see it, only IF there is capacity to reclaim. If nobody has 
taken my capacity, there is no SLA. Had we known instantly about the TT going 
down, you would recompute capacities, and Q2's GC would be 4 instead of 5, and 
it would be at capacity. Q2's demand is not being satisfied because there is an 
incorrect view of what Q2's capacity is. The SLA should not apply here. 

If you look at the way we've worded the requirement for reclaiming of capacity 
in HADOOP-3421, it reads "...the system will guarantee that excess resources 
taken from an Org will be restored to it within N minutes of its need for 
them". The key phrase is 'resources taken from an Org'. If no queue is running 
over capacity, no resources have been taken from a queue, and hence the SLA is 
not in force. 

> When computing absoluet guaranteed capacity (GC) from a percent value, 
> Capacity Scheduler should round up floats, rather than truncate them.
> --------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5003
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5003
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>            Reporter: Vivek Ratan
>            Priority: Minor
>
> The Capacity Scheduler calculates a queue's absolute GC value by getting its 
> percent of the total cluster capacity (which is a float, since the configured 
> GC% is a float) and casting it to an int. Casting a float to an int always 
> rounds down. For very small clusters, this can result in the GC of a queue 
> being one lower than what it should be. For example, if Q1 has a GC of 50%, 
> Q2 has a GC of 40%, and Q3 has a GC of 10%, and if the cluster capacity is 4 
> (as we have, in our test cases), Q1's GC works out to 2, Q2's to 1, and Q3's 
> to 0 with today's code. Q2's capacity should really be 2, as 40% of 4, 
> rounded up, should be 2. 
> Simple fix is to use Math.round() rather than cast to an int. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to