----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/48313/#review139100 -----------------------------------------------------------
src/slave/paths.hpp (lines 86 - 89) <https://reviews.apache.org/r/48313/#comment204244> Actually, I just realized that existing diagram is wrong... So now that you are at it, could you fix the following as well? The whole ``` // |-- boot_id // |-- resources // | |-- resources.info // |-- resources.target ``` thing should be put under the "meta" dir. src/slave/slave.cpp (line 2522) <https://reviews.apache.org/r/48313/#comment204245> Nit: "the checkpoint"? src/slave/slave.cpp (lines 2536 - 2540) <https://reviews.apache.org/r/48313/#comment204292> LOG(ERROR)? src/slave/slave.cpp (lines 2579 - 2580) <https://reviews.apache.org/r/48313/#comment204306> This is more of a comment on the later review: Here we don't know if framework intends to create a new persistent volume, i.e., if both the new checkpointed resources and the old have the same persistent volume and the disk has the directory and it's not empty -> its OK! We only want to make sure the directory is empty when the new checkpointed resources' volume is not in the old checkpointed resources. Plus this condition only applies to MOUNT, we can easily check/CHECK the resources to find out. src/slave/slave.cpp (line 4744) <https://reviews.apache.org/r/48313/#comment204315> You can do: ``` checkpointedResources = resourcesState.get().resources; ``` Then you don't need two arguments in `syncCheckpointedResources`. I wouldn't worry about assigning the member variable before syncing the disk state because (copied from my reply to your comment): After all, the first order of business for `recover()` is to recover the state from disk and by assigning `checkpointedResources` you are faithfully recovering the exact memory state of the `checkpointedResources` variable (and the target) before the crash, right? After you have recovered `checkpointedResources` you proceed with the 2nd step: reattempting to sync the disk state for `checkpointedResources` and crash again or update `checkpointedResources`, this logical order sounds reasonable to me. src/slave/slave.cpp (line 4747) <https://reviews.apache.org/r/48313/#comment204314> Nit: Just FYI you could use `->`: ``` const Resources& target = resourcesState->target.get(); ``` src/slave/slave.cpp (lines 4758 - 4759) <https://reviews.apache.org/r/48313/#comment204316> So here you can: ``` Try<Nothing> result = syncCheckpointedResources(targetResources); ``` src/slave/slave.cpp (line 4781) <https://reviews.apache.org/r/48313/#comment204442> LOG(ERROR) is more suitable. src/slave/slave.cpp (line 4787) <https://reviews.apache.org/r/48313/#comment204318> Here you can do ``` checkpointedResources = targetResources; ``` without the need of the intermediate `resources`. src/slave/slave.cpp (line 4806) <https://reviews.apache.org/r/48313/#comment204319> And we can kill this. src/slave/state.hpp (line 271) <https://reviews.apache.org/r/48313/#comment204321> It looks a bit odd that we are explicitly initiating `target` but not `resources`. Note that it's not necessary to initiate either but it is necessary to initiate errors here. src/slave/state.hpp (lines 277 - 280) <https://reviews.apache.org/r/48313/#comment204328> s/parseResource/recoverResources/ and this could just be a non-member helper hidden in the cpp file (which I suspect can be replaced in a followup). src/slave/state.cpp (line 696) <https://reviews.apache.org/r/48313/#comment204434> Can we avoid using `commit` in code because we are still using ResourcesState.resources and the filename "resources.info" to refer to this thing which is no different from other checkpoints? Something like the following: ``` string path = paths::getResourcesInfoPath(rootDir); ... Try<Resources> resources = recoverResources(path, strict, state.errors); ... state.resources = resources.get(); ... path = paths::getResourcesTargetPath(rootDir); Try<Resources> target = recoverResources(path, strict, state.errors); state.target = target.get(); ``` src/slave/state.cpp (line 699) <https://reviews.apache.org/r/48313/#comment204323> Align the two `<<`s but we don't need to change the logging here. src/slave/state.cpp (line 715) <https://reviews.apache.org/r/48313/#comment204327> Align the two `<<`s. src/slave/state.cpp (line 719) <https://reviews.apache.org/r/48313/#comment204329> Should we use `state.errors`? src/slave/state.cpp (lines 723 - 725) <https://reviews.apache.org/r/48313/#comment204348> Why shouldn't we return an error in this case? The caller would think the target doesn't exist (so the disk state is caught up) but rather it could be that the target is not readable. Strictly speaking the agent can run with the unreadable target file but possibly in a state different than the master's (which is in the target file). Yes the master will send another update but as it could have operations "cancelled out" and look as if nothing is changed. Plus, further down the road the it's likely the file is unwriteable either and crash then. Overall I think it's better to treat resources.info and resources.target the same way. It's simpler to reason about. src/slave/state.cpp (lines 731 - 734) <https://reviews.apache.org/r/48313/#comment204437> Let's change it to a local helper and use plural `recoverResources`. It seems to me the whole method could be replaced by `::protobuf::read<RepeatedPtrField<Resource>>(file);` but let's tackle it later. - Jiang Yan Xu On June 20, 2016, 4:41 p.m., Anindya Sinha wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/48313/ > ----------------------------------------------------------- > > (Updated June 20, 2016, 4:41 p.m.) > > > Review request for mesos, Neil Conway and Jiang Yan Xu. > > > Bugs: MESOS-5448 > https://issues.apache.org/jira/browse/MESOS-5448 > > > Repository: mesos > > > Description > ------- > > When the agent receives CheckpointedResourcesMessage, we store the > target checkpoint on disk. On successful create and destroy of > persistent volumes as a part of handling this messages, we commit > the checkpoint on the disk, and clear the target checkpoint. > > However, incase of any failure we do not commit the checkpoint to > disk, and exit the agent. When the agent restarts and there is a > target checkpoint present on disk which differs from the committed > checkpoint, we retry to sync the target and committed checkpoint. > On success, we reregister the agent with the master, but in case it > fails, we do not commit the checkpoint and the agent exists. > > > Diffs > ----- > > src/slave/paths.hpp 339e539863c678b6ed4d4670d75c7ff4c54daa79 > src/slave/paths.cpp 03157f93b1e703006f95ef6d0a30afae375dcdb5 > src/slave/slave.hpp 58ff2bfac6918d989ab36b67cf6ba2f3657c8356 > src/slave/slave.cpp 4bf01f2b020f5e975fb57cffcd19865d7431eac2 > src/slave/state.hpp 0de2a4ee4fabaad612c4526166157b001c380bdb > src/slave/state.cpp 04c3d42040f186507a0e484a3ee616f1b1a77ea8 > > Diff: https://reviews.apache.org/r/48313/diff/ > > > Testing > ------- > > All tests passed. > > > Thanks, > > Anindya Sinha > >
