> On Nov. 18, 2015, 7:51 a.m., Qian Zhang wrote: > > src/master/quota_handler.cpp, line 180 > > <https://reviews.apache.org/r/40351/diff/3/?file=1128793#file1128793line180> > > > > Why do we want to rescind the offeres that do not contribute to > > satisfying quota request? > > Alexander Rukletsov wrote: > Because we may rescind more than necessary to satisfy quota request > (remember minimal agent count). If we have a check in place, this will > effectively prevent us from doing so. Does it make sense to you? > > Qian Zhang wrote: > Suppose the quota request is to request 20GB disk for a role, and there > is an offer which only include 2 CPU & 2GB memory and has no disk resources > at all, so we will rescind this offer too? This seems a little unfair to me. > And can you please clarify a little more about why we want to rescind > offers from at least `numF` agents? The reason is that we want to ensure each > framework in that role will have a chance to get an offer in next allocation > cycle? > > Alexander Rukletsov wrote: > That's correct, we will rescind that offer and yes, it's a bit unfair. > Let me explain why I decided to remove this check. Suppose we a quota request > is for 6 CPUs for role with 3 frameworks. The first offer we rescind is 10 > CPUs, 10GB MEM. Technically, we have enough resources to satisfy quota, but > we would like to rescind offers from at least 2 more agents. Having a check > in place will prevent us from doing so. Do you think greedy rescinding can be > a problem? > > Yes, we would like to facilitate allocation for each framework in the > role, for which quota is set. > > Qian Zhang wrote: > The most unclear in my mind is why we need to rescind offers from at > least numF agents, i.e., in your example above, why do we want to rescind > offers from at least 2 more agents after quota has been satisfied? Can you > please let me know the motivation behind it? I think quota is kind of global > concept which should not have direct relation with agent and framework, it > should stay in role level. So I am not sure why we want to facilitate > allocation for each framework in the role, is that something that we > mentioned in design doc? Maybe I forget ... :-) > > Alexander Rukletsov wrote: > Nope, it wasn't in the design doc, that's something we decided recently. > The main motivation is to improve user experience and simplify debugging. > Because the built-in allocator is used in 99% of clusters, it makes sense to > exploit some knowledge about how it works. Because of coarse-grained > allocations, to facilitate fairness we may want to rescind from more agents > than necessary to satisfy quota numbers.
`why do we want to rescind offers from at least 2 more agents after quota has been satisfied?` Just to be clear: it's not numF or more agents *on top of* quota. It's at least numF agents in case the quota itself doesn't already rescind offers from that many. I'm not sure this is really "un-fair", as these are *offers*, and not *allocations*. We are not pre-empting tasks. If the resources in the offers that are rescinded are not needed for quota, then they will be re-offered using the same fair-sharing logic that they were before. In fact, this is *more* fair, as we might end up making better offers due to information that has changed in the cluster. The argument for the `numF` condition that Alex is making is one I pushed for. We often end up debugging clusters around new features, even not so new features. Although the `numF` condition by no means guarantees that every framework in the role will receive an offer, it does increase the chances greatly. The fact that they will receive any offer at all means we will see messages flowing to the framework, and hopefully log lines at the framework after receiving the offer. If the offer is still too small to launch a task, at least we will see a message at the framework level to that regard. **what we are optimizing for** is the ability to eliminate quickly (in most cases) the possibility that there is a bug in quota because the framework didn't receive any offers. Please let me know if this is not clear, as I believe it is very important. The more of us understand why this extra condition is here, the fewer framework writers and cluster operators will be coming on IRC / dev list with debug logs that don't allow us to easily eliminate quota as the source of the problem. - Joris ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/40351/#review106977 ----------------------------------------------------------- On Nov. 24, 2015, 4:29 p.m., Alexander Rukletsov wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/40351/ > ----------------------------------------------------------- > > (Updated Nov. 24, 2015, 4:29 p.m.) > > > Review request for mesos, Bernd Mathiske, Joerg Schad, Joris Van Remoortere, > Joseph Wu, and Qian Zhang. > > > Bugs: MESOS-3912 > https://issues.apache.org/jira/browse/MESOS-3912 > > > Repository: mesos > > > Description > ------- > > See summary. > > > Diffs > ----- > > src/master/master.hpp e5e0ed01a56d869cc535687c8dbb6b99f6295b66 > src/master/quota_handler.cpp b8e501be43de6bc02aebfa5bd415b4212a96da31 > > Diff: https://reviews.apache.org/r/40351/diff/ > > > Testing > ------- > > make check (Mac OS X 10.10.4) > > > Thanks, > > Alexander Rukletsov > >