> On May 23, 2017, 9:32 p.m., Vinod Kone wrote: > > src/slave/slave.cpp > > Line 5954 (original), 5965 (patched) > > <https://reviews.apache.org/r/56895/diff/4-6/?file=1693973#file1693973line5965> > > > > Previously, looks like any errors during the partial recovery after a > > reboot resulted in agent flapping (e.g., checkpointed resources recovery > > failure, fetcher recovery failure), but now looks like the agent will > > remove the symlink and move along? That seems like a change in behavior > > Megha Sharma wrote: > Now we are only cleaning the symlink and continuing if the failure was > because of slaveInfo mismatch. There's no recovery failure caused by fetcher > recovery anymore (removed in ToT) so the only class of failure left is > checkpointed resources failure which will not be affected by agent symlink > cleanup so no cleanup has been performed and it will cause the agent to flap > so no change in behavior here.
What is `ToT` ? If this issue is fixed, then please mark it so. If you are unsure then feel free to ask further questions here that will help us in resolving it. > On May 23, 2017, 9:32 p.m., Vinod Kone wrote: > > src/slave/slave.cpp > > Line 5962 (original) > > <https://reviews.apache.org/r/56895/diff/4-6/?file=1693973#file1693973line5978> > > > > So, we are continuing if there is a recovery error and the agent has > > not rebooted? That doesn't sound right? > > Megha Sharma wrote: > Fixed to continue after symlink cleanup if the recovery error is because > of slaveInfo mismatch, any other errors will cause the agent to behave same > as before. please mark it fixed. see above. > On May 23, 2017, 9:32 p.m., Vinod Kone wrote: > > src/tests/slave_recovery_tests.cpp > > Line 237 (original), 237 (patched) > > <https://reviews.apache.org/r/56895/diff/4-6/?file=1693977#file1693977line237> > > > > why this change in this review? looks independent. > > Megha Sharma wrote: > Actually, this was done to address Neil's comment about the variable name > being too generic which seemed quite reasonable. See the comment below. > > `Can we rename _ack to something that identifies we're waiting for the > agent to see the status update acknowledgment?` please do such changes in a different review. see my comment in the next review. - Vinod ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/56895/#review175852 ----------------------------------------------------------- On June 9, 2017, 4:27 a.m., Megha Sharma wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/56895/ > ----------------------------------------------------------- > > (Updated June 9, 2017, 4:27 a.m.) > > > Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu. > > > Bugs: MESOS-6223 > https://issues.apache.org/jira/browse/MESOS-6223 > > > Repository: mesos > > > Description > ------- > > With partition awareness, the agents are now allowed to re-register > after they have been marked Unreachable. The executors are anyway > terminated on the agent when it reboots so there is no harm in > letting the agent keep its SlaveID, re-register with the master > and reconcile the lost executors. This is a pre-requisite for > supporting persistent/restartable tasks in mesos. > > > Diffs > ----- > > src/slave/containerizer/composing.cpp > a003e1b80dc9b4dec5b3fbbadb2daecf855c90c7 > src/slave/containerizer/docker.cpp 9f84109d7de22a39ace6e44e0c7d8d501bcb24de > src/slave/containerizer/mesos/containerizer.cpp > f3e6210eccd4a6b445ffd4447e69526d424ea36d > src/slave/slave.hpp 7ffaed14035a05259ec72c70532ee4f0affa1f5d > src/slave/slave.cpp 7d147ac6609933ac884bfc29032dba572a0952c6 > src/slave/state.hpp a497ce1f58fb8dc7718ee5bb10bc62dd7479efa5 > src/slave/state.cpp 18b790d2cc4f537cc9b0c3cca59b9cbaac0eda10 > src/tests/reservation_tests.cpp 6e9c215382ef41700921a673669ac1a7975e9b7f > src/tests/slave_recovery_tests.cpp 38502584186793686f78ff5f4e03f36a3bf7ad1c > > > Diff: https://reviews.apache.org/r/56895/diff/7/ > > > Testing > ------- > > make check > > > Thanks, > > Megha Sharma > >
