> On April 15, 2016, 12:49 a.m., Vinod Kone wrote:
> > src/launcher/http_command_executor.cpp, line 749
> > <https://reviews.apache.org/r/46187/diff/1/?file=1343828#file1343828line749>
> >
> >     Looking at slave::statusUpdate() code there are several scenarios where 
> > the slave ignores a status update sent by the executor; this means this 
> > executor could end up not terminating forever.
> >     
> >     Can you do the following:
> >     
> >     --> Enque a message in the queue to self terminate after some timeout 
> > (you can use the delay() function) to be safe.
> >     
> >     --> Add a TODO that we do this to be safe and also because slave 
> > sometimes doesn't ACK a status update. Link to a ticket that fixes the 
> > slave status update semantics to always ACK a status update sent by an 
> > executor.
> >     
> >     sounds good?
> 
> Vinod Kone wrote:
>     @Qian, any update on this? If this particular review is going to take 
> some time, I think it is still useful two commit the other 2 reviews in this 
> chain. AFAICT, they are independent of this review?
> 
> Qian Zhang wrote:
>     @Vinod, sorry for the late. I have filed a ticket 
> (https://issues.apache.org/jira/browse/MESOS-5262) for enhancing 
> `slave::statusUpdate()` to always ACK the status update sent by executor.
>     
>     And can you please elaborate about the specific scenarios this executor 
> could not terminate forever. Originially I thought the scenario should be: 
> executor sends a terminal status upate to slave when the corresponding 
> framework is in `TERMINATING` state (e.g., operator tears down the 
> framework), then in `Slave::statusUpdate()`, this status update will be 
> ignored, so the executor will not get the ACK. But after testing, I found in 
> this case the executor can still terminate, because the container 
> corresponded to this executor will be destroyed by 
> `Slave::shutdownExecutorTimeout()` -> `MesosContainerizer::destroy()`, so 
> after `--executor_shutdown_grace_period`, the executor can still terminate. 
> So I am not in which case the executor will never terminate.
>     
>     And yes, the other 2 patches are independent of this one, I will make 
> them not depending on this one in the review board, thanks!
> 
> Qian Zhang wrote:
>     After more thinking, I see one scenario the executor could never 
> terminate is: agent is down right after it sends SHUTDOWN event to executor. 
> In this case, when handling the SHUTDOWN event, executor will kill the task 
> and send TASK_KILLED status update to agent, but it will not get ACK since 
> agent is already down, so the executor will still run. But I think once agent 
> is started again, executor will receive the ACK and then terminate. I am not 
> sure if this behavior is OK, or we want executor to terminate once it 
> receives the SHUTDOWN event rather than wait for agent is started again?
> 
> Vinod Kone wrote:
>     I think it is the right thing for the executor to stay up until the agent 
> comes back up and sends an ACK. Otherwise, which is currently the case, the 
> agent has no idea about the exit status of the executor.
>     
>     So the cases where I see no ACKs from the agent are:
>     
>     1) `update` doesn't have UUID.
>       --> Is this possible with the HTTP executor? If not, we should probably 
> shutdown the executor instead of just ignoring.
>     
>     2) Framework is unknown.
>       --> Don't think we can do much here because we don't have access to 
> Executor object?
>     
>     3) Framework is terminating.
>       --> Per your observation the agent is guaranteed to destroy the 
> container. We need to add a comment here explainining this.
>     
>     4) Executor is unknown. While we forward the status update to the 
> framework, the ACK is dropped by `_statusUpdate` and `___statusUpdate()`
>        --> If the update is generated by the agent (`killTask()`, 
> `_runTask()`) we should be fine because there is no executor.
>        --> If the update is sent by an executor for a task belonging to 
> another executor, that another executor is hopefully already dead. So we 
> should be fine.
>        --> Is it possible for the update to be sent by an executor that is 
> not known to the agent? Don't think we can do much here since we don't have 
> access to the Executor object.

> I think it is the right thing for the executor to stay up until the agent 
> comes back up and sends an ACK. Otherwise, which is currently the case, the 
> agent has no idea about the exit status of the executor.

But that seems conflict with your original comment. My understanding to your 
original comment is, to avoid executor not terminating forever, executor needs 
to enque a message to self terminate after some timeout (I think we can do it 
via delay() in `reaped()` after terminal status update is sent), if so, then 
when executor handles SHUTDOWN event, it will self terminate anyway during 
agent down, so it is not possible to make executor stay up until agent comes 
back.

With this patch, I think the behavior of HTTP commmand executor and driver 
based command executor will be different, let's see this scenario: a 
checkpoint-enabled frameworks launches a "sleep 100" task, after the task is 
running, agent is down, and during agent down, the task finishes, and then 
agent is started again. In this case, after agent is started again, for 
driver-based command executor, framework will receive a TASK_FAILED since the 
executor has terminated during agent is down, but for HTTP command executor 
(with this patch applied), framework will receive a TASK_FINISHED since the 
executer will NOT terminate until it receives the ACK for the TASK_FINISHED.

I think we need to make them have consistent behavior. Any suggestions? :-)

> 1) update doesn't have UUID.
  --> Is this possible with the HTTP executor? If not, we should probably 
shutdown the executor instead of just ignoring.

This is not possible with HTTP command executor, because in 
`HttpCommandExecutor::update()` we always set UUID in the status upate before 
sending it to agent.

> 2) Framework is unknown.
  --> Don't think we can do much here because we don't have access to Executor 
object?

I do not think this will happen in our case, because framework will only be 
removed (`Slave::removeFramework()`) when it has no executor 
(`framework->executors.empty()`), but in our case, it must have one executor 
(the HTTP command executor).


> 4) Executor is unknown. While we forward the status update to the framework, 
> the ACK is dropped by _statusUpdate and ___statusUpdate()

>   --> If the update is generated by the agent (killTask(), _runTask()) we 
> should be fine because there is no executor.

>   --> If the update is sent by an executor for a task belonging to another 
> executor, that another executor is hopefully already dead. So we should be 
> fine.

>   --> Is it possible for the update to be sent by an executor that is not 
> known to the agent? Don't think we can do much here since we don't have 
> access to the Executor object.

I do not think it is possible that an executor sends a update to agent but 
agent does not know this executor.


- Qian


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/46187/#review128916
-----------------------------------------------------------


On April 27, 2016, 9:23 a.m., Qian Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/46187/
> -----------------------------------------------------------
> 
> (Updated April 27, 2016, 9:23 a.m.)
> 
> 
> Review request for mesos, Anand Mazumdar and Vinod Kone.
> 
> 
> Bugs: MESOS-4855
>     https://issues.apache.org/jira/browse/MESOS-4855
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Terminate when receiving the ACK of terminal status update.
> 
> 
> Diffs
> -----
> 
>   src/launcher/http_command_executor.cpp 
> 0b4136c040ec9cf585c0d38f8471495a855cd640 
> 
> Diff: https://reviews.apache.org/r/46187/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>

Reply via email to