> On June 8, 2016, 1:28 p.m., Neil Conway wrote: > > Overall seems like a reasonable approach. > > > > One thing that isn't clear to me: what is the advantage of updating the > > checkpoint to reflect any partial work that was done before exiting? It > > seems that adds a bunch of complexity and room for error. Why not only > > update the checkpoint if all changes were made successfully? > > Anindya Sinha wrote: > We would need to maintain what was actually successful in any case since > in a DESTROY, a failed rmdir does not lead to the agent exiting. So, if we > were to do it at one place, we would still need to keep account of the > successful operations so as to not update the checkpoint based on a failed > rmdir as an example (and hence can be a partial update). > > Since we are keeping track of result of the operations anyway, I think it > is a good idea to update before exiting (only place we do that when CREATE > fails and the agent exits) so that the subsequent handling of > CheckpointResources does not need to redo such operations when the agent > reregisters.
On reflection, I wonder whether we should be handling `CREATE` errors differently from `DESTROY` errors. In both cases, the user has asked the agent to do something it wasn't able to do. A failed `DESTROY` has the addditional concern that we might have destroyed some but not all of the data on the volume. Do you think handling `CREATE` vs. `DESTROY` errors differently is a good idea? - Neil ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/48313/#review136638 ----------------------------------------------------------- On June 9, 2016, 12:22 a.m., Anindya Sinha wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/48313/ > ----------------------------------------------------------- > > (Updated June 9, 2016, 12:22 a.m.) > > > Review request for mesos, Neil Conway and Jiang Yan Xu. > > > Bugs: MESOS-5448 > https://issues.apache.org/jira/browse/MESOS-5448 > > > Repository: mesos > > > Description > ------- > > o Checkpoints on the agent are updated only after successful handling > of persistent volume creation and deletion to maintain consistency. > o If volume creation or deletion fails, checkpoint is updated up until > that point, and the agent exits. > o This ensures that after a agent restart, checkpoints are in sync > between the master and the agent after the reregistration workflow. > > > Diffs > ----- > > include/mesos/resources.hpp a557e97c65194d4aad879fb88d8edefd1c95b8d8 > include/mesos/v1/resources.hpp a5ba8fec4c9c3643646308f75a4b28cefe0b3df3 > src/common/resources.cpp f6ff92b591c15bc8e93fd85e1896349c3a7bb968 > src/slave/slave.cpp d635dd2c6f6fce5a9eeefc5dcdf84e00cdc833b6 > src/v1/resources.cpp 8c3f2d1c1529915a59d47fe37bb3fc7a3267079a > > Diff: https://reviews.apache.org/r/48313/diff/ > > > Testing > ------- > > All tests passed. > > > Thanks, > > Anindya Sinha > >
