> On June 8, 2016, 1:28 p.m., Neil Conway wrote:
> > Overall seems like a reasonable approach.
> >
> > One thing that isn't clear to me: what is the advantage of updating the
> > checkpoint to reflect any partial work that was done before exiting? It
> > seems that adds a bunch of complexity and room for error. Why not only
> > update the checkpoint if all changes were made successfully?
>
> Anindya Sinha wrote:
> We would need to maintain what was actually successful in any case since
> in a DESTROY, a failed rmdir does not lead to the agent exiting. So, if we
> were to do it at one place, we would still need to keep account of the
> successful operations so as to not update the checkpoint based on a failed
> rmdir as an example (and hence can be a partial update).
>
> Since we are keeping track of result of the operations anyway, I think it
> is a good idea to update before exiting (only place we do that when CREATE
> fails and the agent exits) so that the subsequent handling of
> CheckpointResources does not need to redo such operations when the agent
> reregisters.
>
> Neil Conway wrote:
> On reflection, I wonder whether we should be handling `CREATE` errors
> differently from `DESTROY` errors. In both cases, the user has asked the
> agent to do something it wasn't able to do. A failed `DESTROY` has the
> addditional concern that we might have destroyed some but not all of the data
> on the volume.
>
> Do you think handling `CREATE` vs. `DESTROY` errors differently is a good
> idea?
>
> Anindya Sinha wrote:
> Good point. Here is what I think are the use cases:
> Say we have checkpointed resources (CR) as {V1,V2} where V1, V2 are 2
> persistent volumes. So, CR(master) = {V1,V2}, and CR(agent) = {V1,V2}.
> If we now receive a DESTROY(V2): CR(master) = {V1} [since master's view
> is not dependent on what happened on the agent]. Suppose that fails on the
> agent, so CR(agent) = {V1,V2} [since we do not update checkpoint resources on
> agent on failure in DESTROY, which results in inconsistency between master
> and agent at this point of time].
>
> Case 1 (current implementation): Agent does not restart on failure in
> DESTROY. Hence, CR(agent) = {V1,V2}. When the next CheckpointResources is
> received, ie. on a subsequent CREATE/DESTROY/RESERVE/UNRESERVE on a different
> resource, DESTROY(V2) will be reattempted and if that is successful, we will
> in sync between agent and master. However if the next CheckpointResources is
> due to a CREATE(V2) [that can happen since V2 is available as a resource
> based on offer from master], that would be a no-op on agent since agent does
> not treat it as a new resource based on the checkpoint since at this point
> CR(master) = {V1,V2}, and CR(agent) = {V1,V2}, which would be a problem.
>
> Case 2 (if we exit agent on failure): The agent restarts which triggers a
> CheckpointResources from master->agent on ReregisterSlave. That would force a
> reattempt of DESTROY(V2) since current view is CR(master) = {V1} and
> CR(agent) = {V1,V2} which will reattempt to bring the checkpointed resources
> back in sync between master and agent.
>
> So, I think it might be a better option to exit the agent on failure in
> DESTROY as well. However, I think we should still update the checkpoint based
> on the status of successful operations (other CREATE/DESTROY) on failure
> (when agent exits) so as to avoid these operations to be repeated in a
> subsequent CheckpointResources message. Does that sound reasonable to you?
>
> Note: I think this use case probably is a good example to consider
> StatusUpdates (or something similar) for operations on reserved resources,
> viz. CREATE/DESTROY/RESERVE/UNRESERVE to ensure master and agent are in sync
> to ensure guaranteed view of offers (to frameworks) for reserved resources.
>
> Neil Conway wrote:
> Thanks for the writeup! Makes sense.
>
> If we cause the agent to exit on DESTROY failure, if we only re-apply
> DESTROYs at `ReregisterSlave`, it seems to me there is still a window in
> which another `CREATE` can be applied at the master. That would mean we
> wouldn't reapply the `DESTROY`, which would be bad.
>
> My concern about updating the checkpointed resources to reflect *partial*
> results is that (a) it seems unnecessary, (b) it means the checkpointed
> resources don't reflect either the *original* or the *desired* state of the
> agent, which seems problematic.
>
> What about the following: when an agent gets a
> `CheckpointedResourcesMessage` that differs from its current state, it first
> writes the desired resource state to a new checkpoint file,
> `checkpoint.target`. Then it tries to apply the delta between
> `checkpoint.target` and the current checkpoint. If the on-disk state at the
> agent is updated successfully to match `checkpoint.target`, we rename
> `checkpoint.target` -> `checkpoint` and we're done. Otherwise, if any I/O
> operation fails, we exit the agent. Then on restart, we check for the
> existence of `checkpoint.target` *before* we try to reregister with the
> master; if `checkpoint.target` != `checkpoint`, we resume trying to apply the
> delta between `checkpoint.target` and `checkpoint`; if that fails we exit
> again, and if it succeeds we rename `checkpoint.target` -> `checkpoint` and
> then continue to reregister with the master.
>
> Let me know what you think.
>
> Anindya Sinha wrote:
> > If we cause the agent to exit on DESTROY failure, if we only re-apply
> DESTROYs at ReregisterSlave, it seems to me there is still a window in which
> another CREATE can be applied at the master. That would mean we wouldn't
> reapply the DESTROY, which would be bad.
>
> Agreed, but I think we agree that the window of failure is very small.
> The agent needs to fail during processing of a `CheckpointResourcesMessage`
> pertaining to a DELETE, an offer goes out to a framework with the disk space
> as available, and a CREATE for the same volume and the same persistence id is
> received prior to the successful reregister of the agent.
>
> > What about the following: when an agent gets a
> CheckpointedResourcesMessage that differs from its current state, it first
> writes the desired resource state to a new checkpoint file,
> checkpoint.target. Then it tries to apply the delta between checkpoint.target
> and the current checkpoint. If the on-disk state at the agent is updated
> successfully to match checkpoint.target, we rename checkpoint.target ->
> checkpoint and we're done. Otherwise, if any I/O operation fails, we exit the
> agent. Then on restart, we check for the existence of checkpoint.target
> before we try to reregister with the master; if checkpoint.target !=
> checkpoint, we resume trying to apply the delta between checkpoint.target and
> checkpoint; if that fails we exit again, and if it succeeds we rename
> checkpoint.target -> checkpoint and then continue to reregister with the
> master.
>
> This would work as far as retrying DESTROY on the agent is concerned;
> however, this would still continue to offer resources to frameworks that are
> not available. If DESTROY fails (continuously even after multiple retries),
> the master has already offered that resource as available to the framework
> which seems incorrect. I think we need a longer term discussion on something
> similar to StatusUpdates for reserved resources (agent->master->framework) so
> that such a resource is offered to the framework only after the agent
> successfully processes it.
>
> I will post an update to this RR to address the remaining comments, and
> modify it as appropriate as we decide based on the current thread.
>
> Neil Conway wrote:
> > I think we agree that the window of failure is very small.
>
> The window is indeed small, but it is still possible. If we can eliminate
> the window of failure entirely without too much complexity, I think that is
> worth investigating.
>
> > If DESTROY fails (continuously even after multiple retries), the master
> has already offered that resource as available to the framework which seems
> incorrect.
>
> Indeed -- although the master can offer resources that are unavailable in
> general (e.g., master offers resource at an agent and concurrently the agent
> crashes; task launches will return TASK_LOST). I completely agree that we
> want something akin to status updates + reconciliation for reliably making
> reservations and dynamic volumes -- +1 to the longer-term discussion.
Sounds good.
- Anindya
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48313/#review136638
-----------------------------------------------------------
On June 15, 2016, 11 p.m., Anindya Sinha wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48313/
> -----------------------------------------------------------
>
> (Updated June 15, 2016, 11 p.m.)
>
>
> Review request for mesos, Neil Conway and Jiang Yan Xu.
>
>
> Bugs: MESOS-5448
> https://issues.apache.org/jira/browse/MESOS-5448
>
>
> Repository: mesos
>
>
> Description
> -------
>
> When the agent receives CheckpointedResourcesMessage, we store the
> target checkpoint on disk. On successful create and destroy of
> persistent volumes as a part of handling this messages, we commit
> the checkpoint on the disk, and clear the target checkpoint.
>
> However, incase of any failure we do not commit the checkpoint to
> disk, and exit the agent. When the agent restarts and there is a
> target checkpoint present on disk which differs from the committed
> checkpoint, we retry to sync the target and committed checkpoint.
> On success, we reregister the agent with the master, but in case it
> fails, we do not commit the checkpoint and the agent exists.
>
>
> Diffs
> -----
>
> src/slave/paths.hpp 339e539863c678b6ed4d4670d75c7ff4c54daa79
> src/slave/paths.cpp 03157f93b1e703006f95ef6d0a30afae375dcdb5
> src/slave/slave.hpp 3ed026e4b7e886ae738567369ee5750591ef97d9
> src/slave/slave.cpp 0af04d6fe53f92e03905fb7b3bec72b09d5e8e57
> src/slave/state.hpp 0de2a4ee4fabaad612c4526166157b001c380bdb
> src/slave/state.cpp 04c3d42040f186507a0e484a3ee616f1b1a77ea8
>
> Diff: https://reviews.apache.org/r/48313/diff/
>
>
> Testing
> -------
>
> All tests passed.
>
>
> Thanks,
>
> Anindya Sinha
>
>