> On Aug. 28, 2017, 11:31 p.m., Jie Yu wrote: > > src/master/validation.cpp > > Lines 2205 (patched) > > <https://reviews.apache.org/r/61946/diff/1/?file=1806110#file1806110line2205> > > > > I think `checkpointedResources` should not be used for Resource > > Provider provided resources. It should only apply to agent default > > resources. The checkpointing should be done by the corresponding resource > > provider, not the agent for RP provided resources. > > > > As a result, for operations like RESERVE/UNRESERVE/CREATE/DESTROY, we > > need to send operation to the corresponding resource provider as well. This > > does make sense. If we ask agent to persist those information, what will be > > the semantics if the resource provider is marked as gone? > > > > However, this does get complicated if we want to guarantee ordering for > > operations in one `acceptOffers` call (for backwards compatibility), and we > > do want to allow frameworks to launch a task right after reserve operation > > (the current semantics). > > > > To support that, I think we need to speculatively assume the operation > > will be sucessful (thus allow a subsequent launch immediately at the master > > side). However, when the checkpointing fails, we need a way to abort the > > subsequent launch at the agent side. This is essentially why we CHECK fail > > if the checkpointing fails at the agent previously for > > `checkpointedResources`. > > > > For the resource provider case, we should do the same thing. We can > > abort the agent if a checkpointing fails. However, this only applies to the > > local resource provider that lives in the agent process. If a LRP is > > outside of the agent process, how to abort the subsequent task launch if a > > previous operation fails is something we should think about. For instance, > > always reject operations from the agent's RP manager if the operation is > > for a stale stream ID?
Fully agreed, thanks for bringing up the challenged with handling `RESERVE`/`UNRESERVE`/`CREATE`/`DESTROY` with local and external resource providers. An idea for solving this with external resource providers could be to rescind a launch, similar to how we rescind offers. E.g. an ERP would send a rescind message to the master which then instructs the agent to stop the launch. - Jan ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/61946/#review183988 ----------------------------------------------------------- On Aug. 28, 2017, 5:28 p.m., Jan Schlicht wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/61946/ > ----------------------------------------------------------- > > (Updated Aug. 28, 2017, 5:28 p.m.) > > > Review request for mesos, Benjamin Bannier and Jie Yu. > > > Repository: mesos > > > Description > ------- > > Added validation of resource provider operations. > > > Diffs > ----- > > src/master/validation.hpp f4925752f20ae8ca4de1d9b4a3d5ffc394db9585 > src/master/validation.cpp 7c3247d407c9e6aa8cce457d6c6be0c39f4b532f > > > Diff: https://reviews.apache.org/r/61946/diff/1/ > > > Testing > ------- > > make check > > > Thanks, > > Jan Schlicht > >
