> On June 8, 2016, 1:28 p.m., Neil Conway wrote:
> > Overall seems like a reasonable approach.
> > 
> > One thing that isn't clear to me: what is the advantage of updating the 
> > checkpoint to reflect any partial work that was done before exiting? It 
> > seems that adds a bunch of complexity and room for error. Why not only 
> > update the checkpoint if all changes were made successfully?
> 
> Anindya Sinha wrote:
>     We would need to maintain what was actually successful in any case since 
> in a DESTROY, a failed rmdir does not lead to the agent exiting. So, if we 
> were to do it at one place, we would still need to keep account of the 
> successful operations so as to not update the checkpoint based on a failed 
> rmdir as an example (and hence can be a partial update).
>     
>     Since we are keeping track of result of the operations anyway, I think it 
> is a good idea to update before exiting (only place we do that when CREATE 
> fails and the agent exits) so that the subsequent handling of 
> CheckpointResources does not need to redo such operations when the agent 
> reregisters.

On reflection, I wonder whether we should be handling `CREATE` errors 
differently from `DESTROY` errors. In both cases, the user has asked the agent 
to do something it wasn't able to do. A failed `DESTROY` has the addditional 
concern that we might have destroyed some but not all of the data on the volume.

Do you think handling `CREATE` vs. `DESTROY` errors differently is a good idea?


- Neil


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48313/#review136638
-----------------------------------------------------------


On June 9, 2016, 12:22 a.m., Anindya Sinha wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48313/
> -----------------------------------------------------------
> 
> (Updated June 9, 2016, 12:22 a.m.)
> 
> 
> Review request for mesos, Neil Conway and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-5448
>     https://issues.apache.org/jira/browse/MESOS-5448
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> o Checkpoints on the agent are updated only after successful handling
>   of persistent volume creation and deletion to maintain consistency.
> o If volume creation or deletion fails, checkpoint is updated up until
>   that point, and the agent exits.
> o This ensures that after a agent restart, checkpoints are in sync
>   between the master and the agent after the reregistration workflow.
> 
> 
> Diffs
> -----
> 
>   include/mesos/resources.hpp a557e97c65194d4aad879fb88d8edefd1c95b8d8 
>   include/mesos/v1/resources.hpp a5ba8fec4c9c3643646308f75a4b28cefe0b3df3 
>   src/common/resources.cpp f6ff92b591c15bc8e93fd85e1896349c3a7bb968 
>   src/slave/slave.cpp d635dd2c6f6fce5a9eeefc5dcdf84e00cdc833b6 
>   src/v1/resources.cpp 8c3f2d1c1529915a59d47fe37bb3fc7a3267079a 
> 
> Diff: https://reviews.apache.org/r/48313/diff/
> 
> 
> Testing
> -------
> 
> All tests passed.
> 
> 
> Thanks,
> 
> Anindya Sinha
> 
>

Reply via email to