> On Jan. 30, 2020, 12:14 a.m., Greg Mann wrote: > > The patch looks great, thanks Andrei. What about adding a test for this, > > would it be hard? I'm imagining something like: > > 1) kill a task under the default executor > > 2) intercept the ACK from agent to executor > > 3) verify that the executor is still running > > 4) send the ACK to the executor > > 5) verify that the executor has terminated > > > > WDYT? > > Andrei Budnik wrote: > How to implement step (2) and step (4)? Is there an example somewhere in > Mesos tests? > > Greg Mann wrote: > Yep there are some places in the tests where we use `DROP_PROTOBUF` to > intercept a message and then inject it manually with `process::post`; see > 'TaskStatusUpdateManagerTest.DuplicateUpdateBeforeAck', for example.
Oh shoot, for an HTTP executor that won't work though... hmm I'm not sure if there's a good way to do this in our test code right now. - Greg ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/72029/#review219426 ----------------------------------------------------------- On Jan. 30, 2020, 3:28 p.m., Andrei Budnik wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/72029/ > ----------------------------------------------------------- > > (Updated Jan. 30, 2020, 3:28 p.m.) > > > Review request for mesos, Andrei Sekretenko, Greg Mann, Qian Zhang, and Vinod > Kone. > > > Bugs: MESOS-8537 > https://issues.apache.org/jira/browse/MESOS-8537 > > > Repository: mesos > > > Description > ------- > > Previously, the default executor terminated itself after all containers > had terminated. This could lead to termination of the executor before > processing of a terminal status update by the agent. In order > to mitigate this issue, the executor slept for one second to give a > chance to send all status updates and receive all status update > acknowledgements before terminating itself. This might have led to > various race conditions in some circumstances (e.g., on a slow host). > This patch terminates the default executor if all status updates have > been acknowledged by the agent and no running containers left. > Also, this patch increases the timeout from one second to one minute > for fail-safety. > > > Diffs > ----- > > src/launcher/default_executor.cpp 4369fd0052b2e8496ba63606fa57e17d881ea52c > > > Diff: https://reviews.apache.org/r/72029/diff/4/ > > > Testing > ------- > > internal CI > > > Thanks, > > Andrei Budnik > >
