> On Nov. 18, 2015, 3:51 p.m., Qian Zhang wrote: > > src/master/quota_handler.cpp, line 180 > > <https://reviews.apache.org/r/40351/diff/3/?file=1128793#file1128793line180> > > > > Why do we want to rescind the offeres that do not contribute to > > satisfying quota request? > > Alexander Rukletsov wrote: > Because we may rescind more than necessary to satisfy quota request > (remember minimal agent count). If we have a check in place, this will > effectively prevent us from doing so. Does it make sense to you? > > Qian Zhang wrote: > Suppose the quota request is to request 20GB disk for a role, and there > is an offer which only include 2 CPU & 2GB memory and has no disk resources > at all, so we will rescind this offer too? This seems a little unfair to me. > And can you please clarify a little more about why we want to rescind > offers from at least `numF` agents? The reason is that we want to ensure each > framework in that role will have a chance to get an offer in next allocation > cycle? > > Alexander Rukletsov wrote: > That's correct, we will rescind that offer and yes, it's a bit unfair. > Let me explain why I decided to remove this check. Suppose we a quota request > is for 6 CPUs for role with 3 frameworks. The first offer we rescind is 10 > CPUs, 10GB MEM. Technically, we have enough resources to satisfy quota, but > we would like to rescind offers from at least 2 more agents. Having a check > in place will prevent us from doing so. Do you think greedy rescinding can be > a problem? > > Yes, we would like to facilitate allocation for each framework in the > role, for which quota is set. > > Qian Zhang wrote: > The most unclear in my mind is why we need to rescind offers from at > least numF agents, i.e., in your example above, why do we want to rescind > offers from at least 2 more agents after quota has been satisfied? Can you > please let me know the motivation behind it? I think quota is kind of global > concept which should not have direct relation with agent and framework, it > should stay in role level. So I am not sure why we want to facilitate > allocation for each framework in the role, is that something that we > mentioned in design doc? Maybe I forget ... :-) > > Alexander Rukletsov wrote: > Nope, it wasn't in the design doc, that's something we decided recently. > The main motivation is to improve user experience and simplify debugging. > Because the built-in allocator is used in 99% of clusters, it makes sense to > exploit some knowledge about how it works. Because of coarse-grained > allocations, to facilitate fairness we may want to rescind from more agents > than necessary to satisfy quota numbers. > > Joris Van Remoortere wrote: > `why do we want to rescind offers from at least 2 more agents after quota > has been satisfied?` > Just to be clear: it's not numF or more agents *on top of* quota. It's at > least numF agents in case the quota itself doesn't already rescind offers > from that many. > > I'm not sure this is really "un-fair", as these are *offers*, and not > *allocations*. We are not pre-empting tasks. If the resources in the offers > that are rescinded are not needed for quota, then they will be re-offered > using the same fair-sharing logic that they were before. In fact, this is > *more* fair, as we might end up making better offers due to information that > has changed in the cluster. > > The argument for the `numF` condition that Alex is making is one I pushed > for. We often end up debugging clusters around new features, even not so new > features. Although the `numF` condition by no means guarantees that every > framework in the role will receive an offer, it does increase the chances > greatly. The fact that they will receive any offer at all means we will see > messages flowing to the framework, and hopefully log lines at the framework > after receiving the offer. If the offer is still too small to launch a task, > at least we will see a message at the framework level to that regard. **what > we are optimizing for** is the ability to eliminate quickly (in most cases) > the possibility that there is a bug in quota because the framework didn't > receive any offers. > > Please let me know if this is not clear, as I believe it is very > important. The more of us understand why this extra condition is here, the > fewer framework writers and cluster operators will be coming on IRC / dev > list with debug logs that don't allow us to easily eliminate quota as the > source of the problem.
Thanks Joris. So the motivation of this extra condition is to improve the debuggability, right? But I am still not clear about why increasing the chances for frameworks to receive offer will improve the debuggability of Mesos, can you please clarify more? - Qian ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/40351/#review106977 ----------------------------------------------------------- On Nov. 25, 2015, 12:29 a.m., Alexander Rukletsov wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/40351/ > ----------------------------------------------------------- > > (Updated Nov. 25, 2015, 12:29 a.m.) > > > Review request for mesos, Bernd Mathiske, Joerg Schad, Joris Van Remoortere, > Joseph Wu, and Qian Zhang. > > > Bugs: MESOS-3912 > https://issues.apache.org/jira/browse/MESOS-3912 > > > Repository: mesos > > > Description > ------- > > See summary. > > > Diffs > ----- > > src/master/master.hpp e5e0ed01a56d869cc535687c8dbb6b99f6295b66 > src/master/quota_handler.cpp b8e501be43de6bc02aebfa5bd415b4212a96da31 > > Diff: https://reviews.apache.org/r/40351/diff/ > > > Testing > ------- > > make check (Mac OS X 10.10.4) > > > Thanks, > > Alexander Rukletsov > >