> On Nov. 18, 2015, 7:51 a.m., Qian Zhang wrote:
> > src/master/quota_handler.cpp, line 180
> > <https://reviews.apache.org/r/40351/diff/3/?file=1128793#file1128793line180>
> >
> >     Why do we want to rescind the offeres that do not contribute to 
> > satisfying quota request?
> 
> Alexander Rukletsov wrote:
>     Because we may rescind more than necessary to satisfy quota request 
> (remember minimal agent count). If we have a check in place, this will 
> effectively prevent us from doing so. Does it make sense to you?
> 
> Qian Zhang wrote:
>     Suppose the quota request is to request 20GB disk for a role, and there 
> is an offer which only include 2 CPU & 2GB memory and has no disk resources 
> at all, so we will rescind this offer too? This seems a little unfair to me.
>     And can you please clarify a little more about why we want to rescind 
> offers from at least `numF` agents? The reason is that we want to ensure each 
> framework in that role will have a chance to get an offer in next allocation 
> cycle?
> 
> Alexander Rukletsov wrote:
>     That's correct, we will rescind that offer and yes, it's a bit unfair. 
> Let me explain why I decided to remove this check. Suppose we a quota request 
> is for 6 CPUs for role with 3 frameworks. The first offer we rescind is 10 
> CPUs, 10GB MEM. Technically, we have enough resources to satisfy quota, but 
> we would like to rescind offers from at least 2 more agents. Having a check 
> in place will prevent us from doing so. Do you think greedy rescinding can be 
> a problem?
>     
>     Yes, we would like to facilitate allocation for each framework in the 
> role, for which quota is set.
> 
> Qian Zhang wrote:
>     The most unclear in my mind is why we need to rescind offers from at 
> least numF agents, i.e., in your example above, why do we want to rescind 
> offers from at least 2 more agents after quota has been satisfied? Can you 
> please let me know the motivation behind it? I think quota is kind of global 
> concept which should not have direct relation with agent and framework, it 
> should stay in role level. So I am not sure why we want to facilitate 
> allocation for each framework in the role, is that something that we 
> mentioned in design doc? Maybe I forget ... :-)
> 
> Alexander Rukletsov wrote:
>     Nope, it wasn't in the design doc, that's something we decided recently. 
> The main motivation is to improve user experience and simplify debugging. 
> Because the built-in allocator is used in 99% of clusters, it makes sense to 
> exploit some knowledge about how it works. Because of coarse-grained 
> allocations, to facilitate fairness we may want to rescind from more agents 
> than necessary to satisfy quota numbers.

`why do we want to rescind offers from at least 2 more agents after quota has 
been satisfied?`
Just to be clear: it's not numF or more agents *on top of* quota. It's at least 
numF agents in case the quota itself doesn't already rescind offers from that 
many.

I'm not sure this is really "un-fair", as these are *offers*, and not 
*allocations*. We are not pre-empting tasks. If the resources in the offers 
that are rescinded are not needed for quota, then they will be re-offered using 
the same fair-sharing logic that they were before. In fact, this is *more* 
fair, as we might end up making better offers due to information that has 
changed in the cluster.

The argument for the `numF` condition that Alex is making is one I pushed for. 
We often end up debugging clusters around new features, even not so new 
features. Although the `numF` condition by no means guarantees that every 
framework in the role will receive an offer, it does increase the chances 
greatly. The fact that they will receive any offer at all means we will see 
messages flowing to the framework, and hopefully log lines at the framework 
after receiving the offer. If the offer is still too small to launch a task, at 
least we will see a message at the framework level to that regard. **what we 
are optimizing for** is the ability to eliminate quickly (in most cases) the 
possibility that there is a bug in quota because the framework didn't receive 
any offers.

Please let me know if this is not clear, as I believe it is very important. The 
more of us understand why this extra condition is here, the fewer framework 
writers and cluster operators will be coming on IRC / dev list with debug logs 
that don't allow us to easily eliminate quota as the source of the problem.


- Joris


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/40351/#review106977
-----------------------------------------------------------


On Nov. 24, 2015, 4:29 p.m., Alexander Rukletsov wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/40351/
> -----------------------------------------------------------
> 
> (Updated Nov. 24, 2015, 4:29 p.m.)
> 
> 
> Review request for mesos, Bernd Mathiske, Joerg Schad, Joris Van Remoortere, 
> Joseph Wu, and Qian Zhang.
> 
> 
> Bugs: MESOS-3912
>     https://issues.apache.org/jira/browse/MESOS-3912
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> See summary.
> 
> 
> Diffs
> -----
> 
>   src/master/master.hpp e5e0ed01a56d869cc535687c8dbb6b99f6295b66 
>   src/master/quota_handler.cpp b8e501be43de6bc02aebfa5bd415b4212a96da31 
> 
> Diff: https://reviews.apache.org/r/40351/diff/
> 
> 
> Testing
> -------
> 
> make check (Mac OS X 10.10.4)
> 
> 
> Thanks,
> 
> Alexander Rukletsov
> 
>

Reply via email to