----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/59584/#review176167 -----------------------------------------------------------
src/slave/flags.cpp Lines 355 (patched) <https://reviews.apache.org/r/59584/#comment249521> s/sent to the executor/sent to the executor during recovery/ src/slave/flags.cpp Lines 356 (patched) <https://reviews.apache.org/r/59584/#comment249522> s/MESOS-5322/MESOS-5332/ src/slave/flags.cpp Lines 365-366 (patched) <https://reviews.apache.org/r/59584/#comment249523> Maybe something like: these "old" executors will reply on their half-open connection and receive a RST; without any retries, they will fail to reconnect and be killed by the agent once the executor re-registration timeout elapses. src/slave/slave.cpp Lines 5964-5965 (patched) <https://reviews.apache.org/r/59584/#comment249525> Ditto, as above. src/slave/slave.cpp Lines 5967 (patched) <https://reviews.apache.org/r/59584/#comment249526> s/an optional/optional/ src/slave/slave.cpp Lines 5972-5973 (patched) <https://reviews.apache.org/r/59584/#comment249527> Is this TODO necessary, since this entire block only executes when `executor->pid.isSome() && executor->pid.get()`? src/slave/slave.cpp Lines 5975-5979 (patched) <https://reviews.apache.org/r/59584/#comment249530> Why const ref for the IDs but not for the retry interval? - Greg Mann On May 26, 2017, 12:56 a.m., Benjamin Mahler wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/59584/ > ----------------------------------------------------------- > > (Updated May 26, 2017, 12:56 a.m.) > > > Review request for mesos, Anand Mazumdar, Greg Mann, and Vinod Kone. > > > Bugs: MESOS-5332, MESOS-7057 and MESOS-7569 > https://issues.apache.org/jira/browse/MESOS-5332 > https://issues.apache.org/jira/browse/MESOS-7057 > https://issues.apache.org/jira/browse/MESOS-7569 > > > Repository: mesos > > > Description > ------- > > PID-based v0 executors using Mesos libraries >= 1.1.2 always re-link > with the agent upon receiving the reconnect message. This avoids the > executor replying on a half-open TCP connection to the old agent > (possible if netfilter is dropping packets, see: MESOS-7057). > However, PID-based executors using Mesos libraries < 1.1.2 do not > re-link and are therefore prone to replying on a half-open connection > after the agent restarts. If we only send a single reconnect message, > these "old" executors will reply on their half-open connection, > receive a RST, and think the agent just died. To ensure these "old" > executors can reconnect in the presence of netfilter dropping packets, > we introduced optional retries of the reconnect message. This results > in "old" executors correctly establishing a link when processing the > second reconnect message. > > Generally, users should not enable this flag unless they are affected > by this issue. > > > Diffs > ----- > > src/slave/flags.hpp b66995630f89dfb95a6d0cf66efc5d7590e90cbc > src/slave/flags.cpp 0c8276e425a6a7d22ee68edc6cc25b331635ec44 > src/slave/slave.cpp 15e4d68714556ca30a766acd3b9729367df680c3 > > > Diff: https://reviews.apache.org/r/59584/diff/1/ > > > Testing > ------- > > Added tests in follow up reviews. > > > Thanks, > > Benjamin Mahler > >