> On Feb. 9, 2018, 6:36 p.m., Joseph Wu wrote: > > src/launcher/default_executor.cpp > > Lines 539-546 (original), 537 (patched) > > <https://reviews.apache.org/r/65550/diff/1/?file=1954029#file1954029line539> > > > > Since the guard above was removed, this CHECK could potentially be hit > > now.
Good catch, I'll remove the CHECK. > On Feb. 9, 2018, 6:36 p.m., Joseph Wu wrote: > > src/launcher/default_executor.cpp > > Line 558 (original), 549 (patched) > > <https://reviews.apache.org/r/65550/diff/1/?file=1954029#file1954029line558> > > > > What happens when the executor is disconnected (as is now allowed) and > > attempts to launch some health checks? > > > > Any nested command checks would definitely fail. But I suppose this is > > better than shutting down the executor. > > > > Seems like you need to either delay the creation of the health checks > > or pause them immediately after creation. The checker process will treat connection errors as transient failures, and reschedule the check: https://github.com/apache/mesos/blob/a86ff8c36532f97b6eb6b44c6f871de24afbcc4d/src/checks/checker_process.cpp#L531-L538 Transient failures are logged, but not treated as a health check failure: https://github.com/apache/mesos/blob/a86ff8c36532f97b6eb6b44c6f871de24afbcc4d/src/checks/checker_process.cpp#L353-L356 > On Feb. 9, 2018, 6:36 p.m., Joseph Wu wrote: > > src/launcher/default_executor.cpp > > Lines 626-631 (original), 617-622 (patched) > > <https://reviews.apache.org/r/65550/diff/1/?file=1954029#file1954029line626> > > > > This will be dropped if the executor isn't subscribed. And as far as I > > can tell, this status update is not sent in any other location If the executor isn't subscribed, the status updates will be added to the `unacknowledgedUpdates` map, and sent by `doReliableRegistration()` in the next `SUBSCRIBE` call: https://github.com/apache/mesos/blob/a86ff8c36532f97b6eb6b44c6f871de24afbcc4d/src/launcher/default_executor.cpp#L309-L343 The executor doesn't wait for the updates to be ack'd before shutting down (https://github.com/apache/mesos/blob/a86ff8c36532f97b6eb6b44c6f871de24afbcc4d/src/launcher/default_executor.cpp#L1020-L1024), so there's a possibility that these updates will be dropped if the executor is not connected to the agent upon disconnection. This is tracked in https://issues.apache.org/jira/browse/MESOS-8537. - Gaston ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/65550/#review197217 ----------------------------------------------------------- On Feb. 7, 2018, 11 a.m., Gaston Kleiman wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/65550/ > ----------------------------------------------------------- > > (Updated Feb. 7, 2018, 11 a.m.) > > > Review request for mesos, Anand Mazumdar, Qian Zhang, and Vinod Kone. > > > Bugs: MESOS-8468 > https://issues.apache.org/jira/browse/MESOS-8468 > > > Repository: mesos > > > Description > ------- > > The default executor would unnecessarily shutdown if, while launching a > task group, it gets unsubscribed after having successfully launched the > task group's containers. > > > Diffs > ----- > > src/launcher/default_executor.cpp 4a619859095cc2d30f4806813f64a2e48c83b3ea > > > Diff: https://reviews.apache.org/r/65550/diff/1/ > > > Testing > ------- > > `make check` on GNU/Linux > > > Thanks, > > Gaston Kleiman > >