Re: Review Request 48313: Creation and deletion of persistent volumes across agent restart.

Anindya Sinha Wed, 15 Jun 2016 16:02:09 -0700


> On June 8, 2016, 1:28 p.m., Neil Conway wrote:
> > Overall seems like a reasonable approach.
> > 
> > One thing that isn't clear to me: what is the advantage of updating the 
> > checkpoint to reflect any partial work that was done before exiting? It 
> > seems that adds a bunch of complexity and room for error. Why not only 
> > update the checkpoint if all changes were made successfully?
> 
> Anindya Sinha wrote:
>     We would need to maintain what was actually successful in any case since 
> in a DESTROY, a failed rmdir does not lead to the agent exiting. So, if we 
> were to do it at one place, we would still need to keep account of the 
> successful operations so as to not update the checkpoint based on a failed 
> rmdir as an example (and hence can be a partial update).
>     
>     Since we are keeping track of result of the operations anyway, I think it 
> is a good idea to update before exiting (only place we do that when CREATE 
> fails and the agent exits) so that the subsequent handling of 
> CheckpointResources does not need to redo such operations when the agent 
> reregisters.
> 
> Neil Conway wrote:
>     On reflection, I wonder whether we should be handling `CREATE` errors 
> differently from `DESTROY` errors. In both cases, the user has asked the 
> agent to do something it wasn't able to do. A failed `DESTROY` has the 
> addditional concern that we might have destroyed some but not all of the data 
> on the volume.
>     
>     Do you think handling `CREATE` vs. `DESTROY` errors differently is a good 
> idea?
> 
> Anindya Sinha wrote:
>     Good point. Here is what I think are the use cases:
>     Say we have checkpointed resources (CR) as {V1,V2} where V1, V2 are 2 
> persistent volumes. So, CR(master) = {V1,V2}, and CR(agent) = {V1,V2}.
>     If we now receive a DESTROY(V2): CR(master) = {V1} [since master's view 
> is not dependent on what happened on the agent]. Suppose that fails on the 
> agent, so CR(agent) = {V1,V2} [since we do not update checkpoint resources on 
> agent on failure in DESTROY, which results in inconsistency between master 
> and agent at this point of time].
>     
>     Case 1 (current implementation): Agent does not restart on failure in 
> DESTROY. Hence, CR(agent) = {V1,V2}. When the next CheckpointResources is 
> received, ie. on a subsequent CREATE/DESTROY/RESERVE/UNRESERVE on a different 
> resource, DESTROY(V2) will be reattempted and if that is successful, we will 
> in sync between agent and master. However if the next CheckpointResources is 
> due to a CREATE(V2) [that can happen since V2 is available as a resource 
> based on offer from master], that would be a no-op on agent since agent does 
> not treat it as a new resource based on the checkpoint since at this point 
> CR(master) = {V1,V2}, and CR(agent) = {V1,V2}, which would be a problem.
>     
>     Case 2 (if we exit agent on failure): The agent restarts which triggers a 
> CheckpointResources from master->agent on ReregisterSlave. That would force a 
> reattempt of DESTROY(V2) since current view is CR(master) = {V1} and 
> CR(agent) = {V1,V2} which will reattempt to bring the checkpointed resources 
> back in sync between master and agent.
>     
>     So, I think it might be a better option to exit the agent on failure in 
> DESTROY as well. However, I think we should still update the checkpoint based 
> on the status of successful operations (other CREATE/DESTROY) on failure 
> (when agent exits) so as to avoid these operations to be repeated in a 
> subsequent CheckpointResources message. Does that sound reasonable to you?
>     
>     Note: I think this use case probably is a good example to consider 
> StatusUpdates (or something similar) for operations on reserved resources, 
> viz. CREATE/DESTROY/RESERVE/UNRESERVE to ensure master and agent are in sync 
> to ensure guaranteed view of offers (to frameworks) for reserved resources.
> 
> Neil Conway wrote:
>     Thanks for the writeup! Makes sense.
>     
>     If we cause the agent to exit on DESTROY failure, if we only re-apply 
> DESTROYs at `ReregisterSlave`, it seems to me there is still a window in 
> which another `CREATE` can be applied at the master. That would mean we 
> wouldn't reapply the `DESTROY`, which would be bad.
>     
>     My concern about updating the checkpointed resources to reflect *partial* 
> results is that (a) it seems unnecessary, (b) it means the checkpointed 
> resources don't reflect either the *original* or the *desired* state of the 
> agent, which seems problematic.
>     
>     What about the following: when an agent gets a 
> `CheckpointedResourcesMessage` that differs from its current state, it first 
> writes the desired resource state to a new checkpoint file, 
> `checkpoint.target`. Then it tries to apply the delta between 
> `checkpoint.target` and the current checkpoint. If the on-disk state at the 
> agent is updated successfully to match `checkpoint.target`, we rename 
> `checkpoint.target` -> `checkpoint` and we're done. Otherwise, if any I/O 
> operation fails, we exit the agent. Then on restart, we check for the 
> existence of `checkpoint.target` *before* we try to reregister with the 
> master; if `checkpoint.target` != `checkpoint`, we resume trying to apply the 
> delta between `checkpoint.target` and `checkpoint`; if that fails we exit 
> again, and if it succeeds we rename `checkpoint.target` -> `checkpoint` and 
> then continue to reregister with the master.
>     
>     Let me know what you think.
> 
> Anindya Sinha wrote:
>     > If we cause the agent to exit on DESTROY failure, if we only re-apply 
> DESTROYs at ReregisterSlave, it seems to me there is still a window in which 
> another CREATE can be applied at the master. That would mean we wouldn't 
> reapply the DESTROY, which would be bad.
>     
>     Agreed, but I think we agree that the window of failure is very small. 
> The agent needs to fail during processing of a `CheckpointResourcesMessage` 
> pertaining to a DELETE, an offer goes out to a framework with the disk space 
> as available, and a CREATE for the same volume and the same persistence id is 
> received prior to the successful reregister of the agent.
>     
>     > What about the following: when an agent gets a 
> CheckpointedResourcesMessage that differs from its current state, it first 
> writes the desired resource state to a new checkpoint file, 
> checkpoint.target. Then it tries to apply the delta between checkpoint.target 
> and the current checkpoint. If the on-disk state at the agent is updated 
> successfully to match checkpoint.target, we rename checkpoint.target -> 
> checkpoint and we're done. Otherwise, if any I/O operation fails, we exit the 
> agent. Then on restart, we check for the existence of checkpoint.target 
> before we try to reregister with the master; if checkpoint.target != 
> checkpoint, we resume trying to apply the delta between checkpoint.target and 
> checkpoint; if that fails we exit again, and if it succeeds we rename 
> checkpoint.target -> checkpoint and then continue to reregister with the 
> master.
>     
>     This would work as far as retrying DESTROY on the agent is concerned; 
> however, this would still continue to offer resources to frameworks that are 
> not available. If DESTROY fails (continuously even after multiple retries), 
> the master has already offered that resource as available to the framework 
> which seems incorrect. I think we need a longer term discussion on something 
> similar to StatusUpdates for reserved resources (agent->master->framework) so 
> that such a resource is offered to the framework only after the agent 
> successfully processes it.
>     
>     I will post an update to this RR to address the remaining comments, and 
> modify it as appropriate as we decide based on the current thread.
> 
> Neil Conway wrote:
>     > I think we agree that the window of failure is very small.
>     
>     The window is indeed small, but it is still possible. If we can eliminate 
> the window of failure entirely without too much complexity, I think that is 
> worth investigating.
>     
>     > If DESTROY fails (continuously even after multiple retries), the master 
> has already offered that resource as available to the framework which seems 
> incorrect.
>     
>     Indeed -- although the master can offer resources that are unavailable in 
> general (e.g., master offers resource at an agent and concurrently the agent 
> crashes; task launches will return TASK_LOST). I completely agree that we 
> want something akin to status updates + reconciliation for reliably making 
> reservations and dynamic volumes -- +1 to the longer-term discussion.


Sounds good.


- Anindya


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48313/#review136638
-----------------------------------------------------------


On June 15, 2016, 11 p.m., Anindya Sinha wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48313/
> -----------------------------------------------------------
> 
> (Updated June 15, 2016, 11 p.m.)
> 
> 
> Review request for mesos, Neil Conway and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-5448
>     https://issues.apache.org/jira/browse/MESOS-5448
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> When the agent receives CheckpointedResourcesMessage, we store the
> target checkpoint on disk. On successful create and destroy of
> persistent volumes as a part of handling this messages, we commit
> the checkpoint on the disk, and clear the target checkpoint.
> 
> However, incase of any failure we do not commit the checkpoint to
> disk, and exit the agent. When the agent restarts and there is a
> target checkpoint present on disk which differs from the committed
> checkpoint, we retry to sync the target and committed checkpoint.
> On success, we reregister the agent with the master, but in case it
> fails, we do not commit the checkpoint and the agent exists.
> 
> 
> Diffs
> -----
> 
>   src/slave/paths.hpp 339e539863c678b6ed4d4670d75c7ff4c54daa79 
>   src/slave/paths.cpp 03157f93b1e703006f95ef6d0a30afae375dcdb5 
>   src/slave/slave.hpp 3ed026e4b7e886ae738567369ee5750591ef97d9 
>   src/slave/slave.cpp 0af04d6fe53f92e03905fb7b3bec72b09d5e8e57 
>   src/slave/state.hpp 0de2a4ee4fabaad612c4526166157b001c380bdb 
>   src/slave/state.cpp 04c3d42040f186507a0e484a3ee616f1b1a77ea8 
> 
> Diff: https://reviews.apache.org/r/48313/diff/
> 
> 
> Testing
> -------
> 
> All tests passed.
> 
> 
> Thanks,
> 
> Anindya Sinha
> 
>

Re: Review Request 48313: Creation and deletion of persistent volumes across agent restart.

Reply via email to