> On April 14, 2016, 4:49 p.m., Vinod Kone wrote:
> > src/launcher/http_command_executor.cpp, line 749
> > <https://reviews.apache.org/r/46187/diff/1/?file=1343828#file1343828line749>
> >
> >     Looking at slave::statusUpdate() code there are several scenarios where 
> > the slave ignores a status update sent by the executor; this means this 
> > executor could end up not terminating forever.
> >     
> >     Can you do the following:
> >     
> >     --> Enque a message in the queue to self terminate after some timeout 
> > (you can use the delay() function) to be safe.
> >     
> >     --> Add a TODO that we do this to be safe and also because slave 
> > sometimes doesn't ACK a status update. Link to a ticket that fixes the 
> > slave status update semantics to always ACK a status update sent by an 
> > executor.
> >     
> >     sounds good?
> 
> Vinod Kone wrote:
>     @Qian, any update on this? If this particular review is going to take 
> some time, I think it is still useful two commit the other 2 reviews in this 
> chain. AFAICT, they are independent of this review?
> 
> Qian Zhang wrote:
>     @Vinod, sorry for the late. I have filed a ticket 
> (https://issues.apache.org/jira/browse/MESOS-5262) for enhancing 
> `slave::statusUpdate()` to always ACK the status update sent by executor.
>     
>     And can you please elaborate about the specific scenarios this executor 
> could not terminate forever. Originially I thought the scenario should be: 
> executor sends a terminal status upate to slave when the corresponding 
> framework is in `TERMINATING` state (e.g., operator tears down the 
> framework), then in `Slave::statusUpdate()`, this status update will be 
> ignored, so the executor will not get the ACK. But after testing, I found in 
> this case the executor can still terminate, because the container 
> corresponded to this executor will be destroyed by 
> `Slave::shutdownExecutorTimeout()` -> `MesosContainerizer::destroy()`, so 
> after `--executor_shutdown_grace_period`, the executor can still terminate. 
> So I am not in which case the executor will never terminate.
>     
>     And yes, the other 2 patches are independent of this one, I will make 
> them not depending on this one in the review board, thanks!
> 
> Qian Zhang wrote:
>     After more thinking, I see one scenario the executor could never 
> terminate is: agent is down right after it sends SHUTDOWN event to executor. 
> In this case, when handling the SHUTDOWN event, executor will kill the task 
> and send TASK_KILLED status update to agent, but it will not get ACK since 
> agent is already down, so the executor will still run. But I think once agent 
> is started again, executor will receive the ACK and then terminate. I am not 
> sure if this behavior is OK, or we want executor to terminate once it 
> receives the SHUTDOWN event rather than wait for agent is started again?

I think it is the right thing for the executor to stay up until the agent comes 
back up and sends an ACK. Otherwise, which is currently the case, the agent has 
no idea about the exit status of the executor.

So the cases where I see no ACKs from the agent are:

1) `update` doesn't have UUID.
  --> Is this possible with the HTTP executor? If not, we should probably 
shutdown the executor instead of just ignoring.

2) Framework is unknown.
  --> Don't think we can do much here because we don't have access to Executor 
object?

3) Framework is terminating.
  --> Per your observation the agent is guaranteed to destroy the container. We 
need to add a comment here explainining this.

4) Executor is unknown. While we forward the status update to the framework, 
the ACK is dropped by `_statusUpdate` and `___statusUpdate()`
   --> If the update is generated by the agent (`killTask()`, `_runTask()`) we 
should be fine because there is no executor.
   --> If the update is sent by an executor for a task belonging to another 
executor, that another executor is hopefully already dead. So we should be fine.
   --> Is it possible for the update to be sent by an executor that is not 
known to the agent? Don't think we can do much here since we don't have access 
to the Executor object.


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/46187/#review128916
-----------------------------------------------------------


On April 25, 2016, 8:34 p.m., Qian Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/46187/
> -----------------------------------------------------------
> 
> (Updated April 25, 2016, 8:34 p.m.)
> 
> 
> Review request for mesos, Anand Mazumdar and Vinod Kone.
> 
> 
> Bugs: MESOS-4855
>     https://issues.apache.org/jira/browse/MESOS-4855
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Terminate when receiving the ACK of terminal status update.
> 
> 
> Diffs
> -----
> 
>   src/launcher/http_command_executor.cpp 
> 0b4136c040ec9cf585c0d38f8471495a855cd640 
> 
> Diff: https://reviews.apache.org/r/46187/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>

Reply via email to