> On April 23, 2018, 3:43 p.m., Greg Mann wrote: > > src/master/master.cpp > > Line 4512 (original), 4515-4527 (patched) > > <https://reviews.apache.org/r/66568/diff/5/?file=2010510#file2010510line4515> > > > > I was just wondering - perhaps this code belongs in the validation > > function for the ACCEPT call? Since this is something that we can easily > > confirm synchronously in the top-level scheduler API handler and return > > 400, rather than calling into the accept handlers. > > Zhitao Li wrote: > I'm fine in either case as long as it provides better consistency and > less code duplication (which I think is a problem). I guess the reason there > is a split is due to that AuthZ is async right now. > > I actualy wonder what happens if we move the slave connectivity checks to > `Master::accept()`: since master to agent connection through network can be > broken at any moment, I do not see checking that in `_accept()` is 100% safe > anyway, and `send()` calls to agent need to handle failures anyway. > > Alternatively, we can consider moving authorization checks as early as we > can, and condense all other validations after AuthZ returned (assuming AuthZ > do not depends on these validations). > > Zhitao Li wrote: > (replied too fast) I feel both work can reduce the code duplicate, and > makes it less ambiguous for future development. > > Chun-Hung Hsiao wrote: > IMO in general it's better to have validations before authN/authZ, > because validations is usually more light-weight than authN/authZ. Also, > validations should be synchronous, meaning that if the call is invalid, we > will get rid of it faster from the memory (rather than waiting for the > authN/authZ to finish). > > Chun-Hung Hsiao wrote: > But in this case since this is temporary, I'm fine with putting the logic > here. > > Chun-Hung Hsiao wrote: > Question: Are these operations usable with this restriction? It seems to > me that with this restriction, the framework will need to maintain one more > extra state: 1. reserve enough CPUs, mem and free disk (otherwise it is > possible that the framework cannot get enough CPUs and mem after growing the > volume); 2. grow the persistent volume; 3. use the grown volume. If we > implement a proper validation right now, 1 and 2 can be combined, and there > will be no need to rewrite the framework in the future. Can you explain again > what the concern of implementing the correct validation logic again? > > Zhitao Li wrote: > Re: order of valiation vs AuthZ: I agree they should be consistency, but > we already have splits between `accept()` and `_accept()` and framework could > see operations dropped either before or after AuthZ too. Given that 1) AuthZ > will be async by nature and 2) certain states could change between `accept()` > and `_accept()` (offer validaty, agent connectivity), I feel that only check > AuthZ on `accept()` but delay any actual validation to `_accept()` results in > most consistent behavior. Indeed this could increase calls to authorizor but > I don't know whether that can be quantified. > > Re: usability of framework: I generally believe that framework always > needs to model each state to handle corner cases even if it uses chained > operations (because that can have partial success). That's also why I hope to > eventually not support chained operations at all. In the final form of the > code, whatever "correct validation logic" we write now will not be necessary, > and we can implicitly rely on `offeredResources` containment checks chained > operations (because `converted` resources will not be added back > immediately). Implementing this right now would require us to add extra > variables like `observedOfferedResources` and `actualOfferedResources` and > check them on each `case` of operations, which is not be as clean or as local > as current patch (which also easier to back out). > > Chun-Hung Hsiao wrote: > I'm not sure if I get it. Instead of applying the conversion on > `_offeredResources`, we could just do > ``` > _offeredResources -= consumed; > ``` > Wouldn't this solve the issue? Am I missing anything? > > Zhitao Li wrote: > This code block > (https://github.com/apache/mesos/blob/bd688e4cf9f6f35c9d75aad50b99e1fdeb8104da/src/master/master.cpp#L5502) > requires `_offeredResources` properly tracks what allocator needs to > recover. I believe we still need to handle `converted` part of `grow` and > `shrink`, otherwise allocator could see inconsitent views. > > Chun-Hung Hsiao wrote: > Thanks for point it out Zhitao. > > Just briefly chatted with Greg, and it seems that I misunderstood what he > suggested :( > He suggested to check the size of operations here: > https://github.com/apache/mesos/blob/master/src/master/validation.cpp#L547 > which will lead to the entire call being dropped: > https://github.com/apache/mesos/blob/master/src/master/master.cpp#L2350 > And we don't need to send status updates for each LAUNCH and > LAUNCH_GROUP. It seems cleaner to me. > > re: usability of framework. If your framework team has no problem using > the call this way, or doesn't mind updating their frameworks, then it's > probably fine to have this restriction ;)
I like the suggestion to perform the validation there, although I want to know how https://github.com/apache/mesos/blob/master/src/master/master.cpp#L2350 handles recovery of resources from offers if it just drops the operation w/o going through offered resources in each offers (or maybe it does not ?) - Zhitao ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66568/#review201785 ----------------------------------------------------------- On April 23, 2018, 1:07 p.m., Zhitao Li wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/66568/ > ----------------------------------------------------------- > > (Updated April 23, 2018, 1:07 p.m.) > > > Review request for mesos, Chun-Hung Hsiao and Greg Mann. > > > Bugs: MESOS-4965 > https://issues.apache.org/jira/browse/MESOS-4965 > > > Repository: mesos > > > Description > ------- > > These two operations are intended to be non-speculative eventually but > are implemented as speculative right now. To avoid frameworks opt-in to > dangerous behavior, we require that accept can only contain one > `GROW_VOLUME` or `SHRINK_VOLUME` and no other operations. > > This is implemented by reuse existing code which already drops `LAUNCH` > or `LAUNCH_GROUP` with proper reason and message. > > > Diffs > ----- > > src/master/master.cpp c723a291ed8d219ced4492bc905ac6b52683ae47 > > > Diff: https://reviews.apache.org/r/66568/diff/6/ > > > Testing > ------- > > Behavior tested in https://reviews.apache.org/r/66569/. > > > Thanks, > > Zhitao Li > >