> On Jan. 31, 2020, 8:50 p.m., Qian Zhang wrote:
> > src/launcher/default_executor.cpp
> > Lines 1089-1098 (original), 1095-1104 (patched)
> > <https://reviews.apache.org/r/72029/diff/4/?file=2210076#file2210076line1095>
> >
> > I see `_shutdown` will be called in some error cases, like:
> >
> > https://github.com/apache/mesos/blob/1.9.0/src/launcher/default_executor.cpp#L390:L392
> >
> > https://github.com/apache/mesos/blob/1.9.0/src/launcher/default_executor.cpp#L1041:L1044
> > So for such cases the previous behavior is self terminate just after
> > sleeping 1 second, but now it is after sleeping 60 seconds with your patch.
> > I do not think we should sleep so long before self termination for those
> > cases.
>
> Andrei Budnik wrote:
> Updated.
I see you have updated `_shutdown` to:
```
void _shutdown()
{
if (unacknowledgedUpdates.empty()) {
terminate(self());
} else {
// This is a fail safe in case the agent doesn't send an ACK for
// a status update for some reason.
const Duration duration = Seconds(60);
LOG(INFO) << "Terminating after " << duration;
delay(duration, self(), &Self::__shutdown);
}
}
```
That's also what I thought, and I think it can handle the following cases well.
https://github.com/apache/mesos/blob/1.9.0/src/launcher/default_executor.cpp#L390:L392
https://github.com/apache/mesos/blob/1.9.0/src/launcher/default_executor.cpp#L406:L408
But what about the cases like below?
https://github.com/apache/mesos/blob/1.9.0/src/launcher/default_executor.cpp#L559:L565
In such cases, `unacknowledgedUpdates` is likely not empty and agent has failed
(i.e. no ACKs can be sent to the executor), so executor will sleep 60s before
self termination, but I think the executor should self terminate immediately in
this case instead, HDYT?
- Qian
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72029/#review219448
-----------------------------------------------------------
On Jan. 30, 2020, 11:28 p.m., Andrei Budnik wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/72029/
> -----------------------------------------------------------
>
> (Updated Jan. 30, 2020, 11:28 p.m.)
>
>
> Review request for mesos, Andrei Sekretenko, Greg Mann, Qian Zhang, and Vinod
> Kone.
>
>
> Bugs: MESOS-8537
> https://issues.apache.org/jira/browse/MESOS-8537
>
>
> Repository: mesos
>
>
> Description
> -------
>
> Previously, the default executor terminated itself after all containers
> had terminated. This could lead to termination of the executor before
> processing of a terminal status update by the agent. In order
> to mitigate this issue, the executor slept for one second to give a
> chance to send all status updates and receive all status update
> acknowledgements before terminating itself. This might have led to
> various race conditions in some circumstances (e.g., on a slow host).
> This patch terminates the default executor if all status updates have
> been acknowledged by the agent and no running containers left.
> Also, this patch increases the timeout from one second to one minute
> for fail-safety.
>
>
> Diffs
> -----
>
> src/launcher/default_executor.cpp 4369fd0052b2e8496ba63606fa57e17d881ea52c
>
>
> Diff: https://reviews.apache.org/r/72029/diff/5/
>
>
> Testing
> -------
>
> internal CI
>
>
> Thanks,
>
> Andrei Budnik
>
>