----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/71285/#review217398 -----------------------------------------------------------
Patch looks great! Reviews applied: [71284, 71285] Passed command: export OS='ubuntu:14.04' BUILDTOOL='autotools' COMPILER='gcc' CONFIGURATION='--verbose --disable-libtool-wrappers --disable-parallel-test-execution' ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; ./support/docker-build.sh - Mesos Reviewbot On Aug. 22, 2019, 8:40 p.m., Greg Mann wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/71285/ > ----------------------------------------------------------- > > (Updated Aug. 22, 2019, 8:40 p.m.) > > > Review request for mesos, Gastón Kleiman, James Peach, and Joseph Wu. > > > Bugs: MESOS-9875 > https://issues.apache.org/jira/browse/MESOS-9875 > > > Repository: mesos > > > Description > ------- > > Fixes an issue where the agent may incorrectly send an > OPERATION_FINISHED update for a failed offer operation > following agent failover and recovery. > > The agent previously relied on the difference between the > set of checkpointed operations and the set of operation > IDs recovered from the operation status update manager to > apply any operations which had not been applied due to an > ill-timed agent failover. > > However, this approach did not work in the case where a > persistent volume failed to be successfully created by > syncCheckpointedResources(). In order to handle this > case, this patch changes the agent code to continue with > the old approach of a two-phase-commit of persistent > volumes to disk, where the agent will fail to complete > recovery if syncCheckpointedResources() cannot be > executed successfully after failover. > > > Diffs > ----- > > src/slave/paths.hpp e077587fd02bd8e35fee7ce12ae436e3dca25e47 > src/slave/paths.cpp 28a7cf9f9c70fb31eeefe2e823cd7e19ffcf126a > src/slave/slave.cpp 1d0ec9d2428c3ffa28ad3e960b74f171013cf0c2 > src/slave/state.cpp cd3fac72dd57da21ed5ac46b17066531af26d42a > > > Diff: https://reviews.apache.org/r/71285/diff/3/ > > > Testing > ------- > > Tested manually by doing the following: > > 1) Run a master > 2) Run an agent with statically reserved resources for use in the following > step > 3) Run the persistent volume example framework to create a volume on the > agent, which leads to checkpointing of resources > 4) Use `chattr -R +i /agent/volumes/dir` to make the agent's persistent > volume directory immutable > 5) Run the persistent volume example framework again; it will fail and the > agent will crash > 6) Restart the agent; confirm that it continues to crash > 7) Use `chattr -R -i /agent/volumes/dir` to remove the immutable attribute > 8) Restart the agent; confirm that it recovers successfully, with the > persistent volume created on disk > 9) Run the persistent volume example framework again; it should succeed > > > Thanks, > > Greg Mann > >