[jira] [Commented] (MESOS-9507) Agent could not recover due to empty docker volume checkpointed files.

Till Toenshoff (JIRA) Mon, 25 Feb 2019 19:05:48 -0800


    [ 
https://issues.apache.org/jira/browse/MESOS-9507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16777527#comment-16777527
 ]


Till Toenshoff commented on MESOS-9507:
---------------------------------------

1.4.x:
{noformat}
commit fbd865317e96483ba4df3a1803fe242c87c63293
Author: Qian Zhang <[email protected]>
Date:   Wed Feb 20 13:41:51 2019 -0800

    Handled containers which has no checkpointed volumes during recovery.

    There are two cases we need to handle:
      1. The container's checkpointed docker volumes file does not exist.
      2. The container's checkpointed docker volumes file is empty.
    For both of the two cases, in the recovery of `docker/volume` isolator,
    we should construct an info object with empty docker volumes for the
    container and rely on containerizer or `docker/volume` isolator's
    `recover` method to cleanup the container.

    Review: https://reviews.apache.org/r/69972/
    (cherry picked from commit d174f0bb5120c70f48cfee2bea5b84724493d416)
{noformat}

1.5.x:
{noformat}
commit f21ad749e963ad02012ed9776fd3fecdfcb57bb0
Author: Qian Zhang <[email protected]>
Date:   Wed Feb 20 13:41:51 2019 -0800

    Handled containers which has no checkpointed volumes during recovery.

    There are two cases we need to handle:
      1. The container's checkpointed docker volumes file does not exist.
      2. The container's checkpointed docker volumes file is empty.
    For both of the two cases, in the recovery of `docker/volume` isolator,
    we should construct an info object with empty docker volumes for the
    container and rely on containerizer or `docker/volume` isolator's
    `recover` method to cleanup the container.

    Review: https://reviews.apache.org/r/69972/
    (cherry picked from commit d174f0bb5120c70f48cfee2bea5b84724493d416)
{noformat}

1.6.x:
{noformat}
commit c965ae4e34e56574c33022553c9c07ee082fc3b9
Author: Qian Zhang <[email protected]>
Date:   Wed Feb 20 13:41:51 2019 -0800

    Handled containers which has no checkpointed volumes during recovery.

    There are two cases we need to handle:
      1. The container's checkpointed docker volumes file does not exist.
      2. The container's checkpointed docker volumes file is empty.
    For both of the two cases, in the recovery of `docker/volume` isolator,
    we should construct an info object with empty docker volumes for the
    container and rely on containerizer or `docker/volume` isolator's
    `recover` method to cleanup the container.

    Review: https://reviews.apache.org/r/69972/
    (cherry picked from commit d174f0bb5120c70f48cfee2bea5b84724493d416)
{noformat}

1.7.x:
{noformat}
commit eb117529c100fa2da7ba2b63020094d975418e95
Author: Qian Zhang <[email protected]>
Date:   Wed Feb 20 13:41:51 2019 -0800

    Handled containers which has no checkpointed volumes during recovery.

    There are two cases we need to handle:
      1. The container's checkpointed docker volumes file does not exist.
      2. The container's checkpointed docker volumes file is empty.
    For both of the two cases, in the recovery of `docker/volume` isolator,
    we should construct an info object with empty docker volumes for the
    container and rely on containerizer or `docker/volume` isolator's
    `recover` method to cleanup the container.

    Review: https://reviews.apache.org/r/69972/
    (cherry picked from commit d174f0bb5120c70f48cfee2bea5b84724493d416)
{noformat}

> Agent could not recover due to empty docker volume checkpointed files.
> ----------------------------------------------------------------------
>
>                 Key: MESOS-9507
>                 URL: https://issues.apache.org/jira/browse/MESOS-9507
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization
>            Reporter: Gilbert Song
>            Assignee: Qian Zhang
>            Priority: Critical
>              Labels: containerizer
>             Fix For: 1.5.3, 1.7.2, 1.8.0, 1.6.3, 1.4.4
>
>
> Agent could not recover due to empty docker volume checkpointed files. Please 
> see logs:
> {noformat}
> Nov 12 17:12:00 guppy mesos-agent[38960]: E1112 17:12:00.978682 38969 
> slave.cpp:6279] EXIT with status 1: Failed to perform recovery: Collect 
> failed: Collect failed: Failed to recover docker volumes for orphan container 
> e1b04051-1e4a-47a9-b866-1d625cda1d22: JSON parse failed: syntax error at line 
> 1 near:
> Nov 12 17:12:00 guppy mesos-agent[38960]: To remedy this do as follows: 
> Nov 12 17:12:00 guppy mesos-agent[38960]: Step 1: rm -f 
> /var/lib/mesos/slave/meta/slaves/latest
> Nov 12 17:12:00 guppy mesos-agent[38960]: This ensures agent doesn't recover 
> old live executors.
> Nov 12 17:12:00 guppy mesos-agent[38960]: Step 2: Restart the agent. 
> Nov 12 17:12:00 guppy systemd[1]: dcos-mesos-slave.service: main process 
> exited, code=exited, status=1/FAILURE
> Nov 12 17:12:00 guppy systemd[1]: Unit dcos-mesos-slave.service entered 
> failed state.
> Nov 12 17:12:00 guppy systemd[1]: dcos-mesos-slave.service failed.
> {noformat}
> This might happen after hard reboot. Docker volume isolator uses 
> `state::checkpoint()` function which creates a temporary file, then writes 
> the data, then renames the temporary file to destination file. This function 
> is atomic and supports `fsync` for the data. However, Docker volume isolator 
> does not use `fsync` option for performance reasons, hence the data might be 
> lost if page cache is not synced before reboot.
> Basically the docker volume is not mounted yet, so the docker volume isolator 
> should skip recovering this volume.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9507) Agent could not recover due to empty docker volume checkpointed files.

Reply via email to