> On May 23, 2017, 9:32 p.m., Vinod Kone wrote:
> > src/slave/slave.cpp
> > Line 5954 (original), 5965 (patched)
> > <https://reviews.apache.org/r/56895/diff/4-6/?file=1693973#file1693973line5965>
> >
> >     Previously, looks like any errors during the partial recovery after a 
> > reboot resulted in agent flapping (e.g., checkpointed resources recovery 
> > failure, fetcher recovery failure), but now looks like the agent will 
> > remove the symlink and move along? That seems like a change in behavior
> 
> Megha Sharma wrote:
>     Now we are only cleaning the symlink and continuing if the failure was 
> because of slaveInfo mismatch. There's no recovery failure caused by fetcher 
> recovery anymore (removed in ToT) so the only class of failure left is 
> checkpointed resources failure which will not be affected by agent symlink 
> cleanup so no cleanup has been performed and it will cause the agent to flap 
> so no change in behavior here.

What is `ToT` ?

If this issue is fixed, then please mark it so. If you are unsure then feel 
free to ask further questions here that will help us in resolving it.


> On May 23, 2017, 9:32 p.m., Vinod Kone wrote:
> > src/slave/slave.cpp
> > Line 5962 (original)
> > <https://reviews.apache.org/r/56895/diff/4-6/?file=1693973#file1693973line5978>
> >
> >     So, we are continuing if there is a recovery error and the agent has 
> > not rebooted? That doesn't sound right?
> 
> Megha Sharma wrote:
>     Fixed to continue after symlink cleanup if the recovery error is because 
> of slaveInfo mismatch, any other errors will cause the agent to behave same 
> as before.

please mark it fixed. see above.


> On May 23, 2017, 9:32 p.m., Vinod Kone wrote:
> > src/tests/slave_recovery_tests.cpp
> > Line 237 (original), 237 (patched)
> > <https://reviews.apache.org/r/56895/diff/4-6/?file=1693977#file1693977line237>
> >
> >     why this change in this review? looks independent.
> 
> Megha Sharma wrote:
>     Actually, this was done to address Neil's comment about the variable name 
> being too generic which seemed quite reasonable. See the comment below.
>     
>     `Can we rename _ack to something that identifies we're waiting for the 
> agent to see the status update acknowledgment?`

please do such changes in a different review. see my comment in the next review.


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56895/#review175852
-----------------------------------------------------------


On June 9, 2017, 4:27 a.m., Megha Sharma wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56895/
> -----------------------------------------------------------
> 
> (Updated June 9, 2017, 4:27 a.m.)
> 
> 
> Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6223
>     https://issues.apache.org/jira/browse/MESOS-6223
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> With partition awareness, the agents are now allowed to re-register
> after they have been marked Unreachable. The executors are anyway
> terminated on the agent when it reboots so there is no harm in
> letting the agent keep its SlaveID, re-register with the master
> and reconcile the lost executors. This is a pre-requisite for
> supporting persistent/restartable tasks in mesos.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/composing.cpp 
> a003e1b80dc9b4dec5b3fbbadb2daecf855c90c7 
>   src/slave/containerizer/docker.cpp 9f84109d7de22a39ace6e44e0c7d8d501bcb24de 
>   src/slave/containerizer/mesos/containerizer.cpp 
> f3e6210eccd4a6b445ffd4447e69526d424ea36d 
>   src/slave/slave.hpp 7ffaed14035a05259ec72c70532ee4f0affa1f5d 
>   src/slave/slave.cpp 7d147ac6609933ac884bfc29032dba572a0952c6 
>   src/slave/state.hpp a497ce1f58fb8dc7718ee5bb10bc62dd7479efa5 
>   src/slave/state.cpp 18b790d2cc4f537cc9b0c3cca59b9cbaac0eda10 
>   src/tests/reservation_tests.cpp 6e9c215382ef41700921a673669ac1a7975e9b7f 
>   src/tests/slave_recovery_tests.cpp 38502584186793686f78ff5f4e03f36a3bf7ad1c 
> 
> 
> Diff: https://reviews.apache.org/r/56895/diff/7/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>

Reply via email to