[ 
https://issues.apache.org/jira/browse/MESOS-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17029065#comment-17029065
 ] 

Andrei Budnik commented on MESOS-8537:
--------------------------------------

1.5.x
{code:java}
commit 84b7af3409d8af343da0f0420e168a42de4b110f
Author: Andrei Budnik <abud...@apache.org>
Date:   Wed Jan 29 19:07:50 2020 +0100

    Changed termination logic of the default executor.

    Previously, the default executor terminated itself after all containers
    had terminated. This could lead to termination of the executor before
    processing of a terminal status update by the agent. In order
    to mitigate this issue, the executor slept for one second to give a
    chance to send all status updates and receive all status update
    acknowledgements before terminating itself. This might have led to
    various race conditions in some circumstances (e.g., on a slow host).
    This patch terminates the default executor if all status updates have
    been acknowledged by the agent and no running containers left.
    Also, this patch increases the timeout from one second to one minute
    for fail-safety.

    Review: https://reviews.apache.org/r/72029
{code}

1.6.x
{code:java}
commit 205525eb56a33e58bed1fc38e0b32189b19d3fbc
Author: Andrei Budnik <abud...@apache.org>
Date:   Wed Jan 29 19:07:50 2020 +0100

    Changed termination logic of the default executor.

    Previously, the default executor terminated itself after all containers
    had terminated. This could lead to termination of the executor before
    processing of a terminal status update by the agent. In order
    to mitigate this issue, the executor slept for one second to give a
    chance to send all status updates and receive all status update
    acknowledgements before terminating itself. This might have led to
    various race conditions in some circumstances (e.g., on a slow host).
    This patch terminates the default executor if all status updates have
    been acknowledged by the agent and no running containers left.
    Also, this patch increases the timeout from one second to one minute
    for fail-safety.

    Review: https://reviews.apache.org/r/72029
{code}

1.7.x
{code:java}
commit 5b399080eee11ee03f4bc6c09b791c24670da6c1
Author: Andrei Budnik <abud...@apache.org>
Date:   Wed Jan 29 19:07:50 2020 +0100

    Changed termination logic of the default executor.

    Previously, the default executor terminated itself after all containers
    had terminated. This could lead to termination of the executor before
    processing of a terminal status update by the agent. In order
    to mitigate this issue, the executor slept for one second to give a
    chance to send all status updates and receive all status update
    acknowledgements before terminating itself. This might have led to
    various race conditions in some circumstances (e.g., on a slow host).
    This patch terminates the default executor if all status updates have
    been acknowledged by the agent and no running containers left.
    Also, this patch increases the timeout from one second to one minute
    for fail-safety.

    Review: https://reviews.apache.org/r/72029
{code}

1.8.x
{code:java}
commit a2ca451aab4625e126b9e7b470eb9f7c232dd746
Author: Andrei Budnik <abud...@apache.org>
Date:   Wed Jan 29 19:07:50 2020 +0100

    Changed termination logic of the default executor.

    Previously, the default executor terminated itself after all containers
    had terminated. This could lead to termination of the executor before
    processing of a terminal status update by the agent. In order
    to mitigate this issue, the executor slept for one second to give a
    chance to send all status updates and receive all status update
    acknowledgements before terminating itself. This might have led to
    various race conditions in some circumstances (e.g., on a slow host).
    This patch terminates the default executor if all status updates have
    been acknowledged by the agent and no running containers left.
    Also, this patch increases the timeout from one second to one minute
    for fail-safety.

    Review: https://reviews.apache.org/r/72029
{code}

1.9.x
{code:java}
commit f37ae68a8f0d23a2e0f31812b8fe4494109769c6
Author: Andrei Budnik <abud...@apache.org>
Date:   Wed Jan 29 19:07:50 2020 +0100

    Changed termination logic of the default executor.

    Previously, the default executor terminated itself after all containers
    had terminated. This could lead to termination of the executor before
    processing of a terminal status update by the agent. In order
    to mitigate this issue, the executor slept for one second to give a
    chance to send all status updates and receive all status update
    acknowledgements before terminating itself. This might have led to
    various race conditions in some circumstances (e.g., on a slow host).
    This patch terminates the default executor if all status updates have
    been acknowledged by the agent and no running containers left.
    Also, this patch increases the timeout from one second to one minute
    for fail-safety.

    Review: https://reviews.apache.org/r/72029
{code}

> Default executor doesn't wait for status updates to be ack'd before shutting 
> down
> ---------------------------------------------------------------------------------
>
>                 Key: MESOS-8537
>                 URL: https://issues.apache.org/jira/browse/MESOS-8537
>             Project: Mesos
>          Issue Type: Bug
>          Components: executor
>    Affects Versions: 1.4.1, 1.5.0
>            Reporter: Gastón Kleiman
>            Assignee: Andrei Budnik
>            Priority: Major
>              Labels: containerization, default-executor, mesosphere
>             Fix For: 1.5.4, 1.6.3, 1.7.3, 1.8.2, 1.10, 1.9.1
>
>
> The default executor doesn't wait for pending status updates to be 
> acknowledged before shutting down, instead it sleeps for one second and then 
> terminates:
> {code}
>   void _shutdown()
>   {
>     const Duration duration = Seconds(1);
>     LOG(INFO) << "Terminating after " << duration;
>     // TODO(qianzhang): Remove this hack since the executor now receives
>     // acknowledgements for status updates. The executor can terminate
>     // after it receives an ACK for a terminal status update.
>     os::sleep(duration);
>     terminate(self());
>   }
> {code}
> The event handler should exit if upon receiving a {{Event::ACKNOWLEDGED}} the 
> executor is shutting down, no tasks are running anymore, and all pending 
> status updates have been acknowledged.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to