> On Sept. 6, 2016, 11:33 a.m., Vinod Kone wrote: > > src/launcher/executor.cpp, line 682 > > <https://reviews.apache.org/r/46187/diff/5/?file=1488422#file1488422line682> > > > > hmm. what's the guarantee that an HTTP based executor receives an ACK > > within a second? what if the agent is down? > > Qian Zhang wrote: > If executor does not receives an ACK from agent within 1 second, that > means there should be something wrong in agent, so with this code, the > executor will exit with -1 and a message so that we can catch such situation. > Maybe we should enlarge this timeout a bit (e.g., 3 seconds) to be safer? > > Vinod Kone wrote: > Executor exiting with -1 code when an agent is down or restarting > (probably for an upgrade) seems unfortunate. Since we allow agents to be down > for upto "MESOS_RECOVERY_TIMEOUT" (default 15 mins) if "MESOS_CHECKPOINT" is > set, maybe the command executor could wait for "MESOS_RECOVERY_TIMEOUT" if it > is disconnected? If it is connected then yes, it probably should wait for > less time (1s is too short?) and then exit because that's seems like there is > something wrong with the agent. Does that make sense? > > Qian Zhang wrote: > Yes, it makes sense. So what we need to do is: > 1. For adapter based executor, keep what I am doing now, i.e., in > `reaped()` do the self termination with 0 as exit code after 1s. > 2. For HTTP based executor, in `reaped()`, check if `MESOS_CHECKPOINT` is > set: > 2.1 If it is not set, do the self termination with 1 as exit code > after 3s (1s is too short). > 2.2 If it is set, check if the executor is disconnected from agent, if > yes, do the self termination with 1 as exit code after > `MESOS_RECOVERY_TIMEOUT`, if no, do the self termination with 1 as exit code > after 3s. > > Please let me know if you have further comments, I will update the patch > soon. > > Huadong Liu wrote: > Hi Qian, Vinod, I have tried 3s in Yelp's mesos/chronos environments. It > helped at some extent, but seemed not enough for all the cases. Since the > executor will terminate once an ACK is received, why not delay for an > extended period of time, say 10s, or even 30s for rare cases (e.g. agent > being extermely busy)? > > Vinod Kone wrote: > For 2.2) what if agent goes down right after executor checks for > connectedness in `reaped()`? seems like it would only wait 3s which seems > non-ideal. Maybe this logic is becoming too complex? To simplify, how about > having the executor always wait for a resonable amount of time (60s) in > `reaped()`? HTTP based executor would exit early if they get an ACK for > terminal update. How does that sound? We would need to make sure that our > unit tests (that use command executor) are not slowed down because of this. > > Qian Zhang wrote: > > what if agent goes down right after executor checks for connectedness > in reaped()? > > Yes, I thought about this case as well, to handle it, it will make the > logic here even more complex though the logic of what I posted above is > already complex and not that readable. So I agree we should simplify it, what > about the following? > 1. For adapter based executor, in `reaped()` do the self termination with > 0 as exit code after 1s. > 2. For HTTP based executor, in `reaped()`, do the self termination with 1 > as exit code after 60s (@huadongliu, this should satisfy your requirement :-). > Or you think we should always wait for 60s for both adapter based > executor and HTTP based executor? > > Vinod Kone wrote: > your plan sounds good to me.
Thanks Qian. I am a bit confused with the terminology here. +Is HTTP command executor the same thing as Mesos command executor in master, i.e. https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp? Chronos uses the Mesos command executor. +Is the "docker executor" one of the adapter based executors? It uses the ExecutorDriver C++ interface. However, https://github.com/apache/mesos/blob/master/src/docker/executor.cpp#L451 comments about HTTP API. Will the comments simply get removed for "1. For adapter based executor, in reaped() do the self termination with 0 as exit code after 1s."? - Huadong ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/46187/#review147808 ----------------------------------------------------------- On Sept. 7, 2016, 2:03 p.m., Qian Zhang wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/46187/ > ----------------------------------------------------------- > > (Updated Sept. 7, 2016, 2:03 p.m.) > > > Review request for mesos, Anand Mazumdar and Vinod Kone. > > > Bugs: MESOS-5276 > https://issues.apache.org/jira/browse/MESOS-5276 > > > Repository: mesos > > > Description > ------- > > Terminate when receiving the ACK of terminal status update. > > > Diffs > ----- > > src/launcher/executor.cpp 5370634ef9e6f3ac9717fed71f6a77707026a16a > > Diff: https://reviews.apache.org/r/46187/diff/ > > > Testing > ------- > > make check > > > Thanks, > > Qian Zhang > >