[
https://issues.apache.org/jira/browse/MESOS-9507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16777527#comment-16777527
]
Till Toenshoff commented on MESOS-9507:
---------------------------------------
1.4.x:
{noformat}
commit fbd865317e96483ba4df3a1803fe242c87c63293
Author: Qian Zhang <[email protected]>
Date: Wed Feb 20 13:41:51 2019 -0800
Handled containers which has no checkpointed volumes during recovery.
There are two cases we need to handle:
1. The container's checkpointed docker volumes file does not exist.
2. The container's checkpointed docker volumes file is empty.
For both of the two cases, in the recovery of `docker/volume` isolator,
we should construct an info object with empty docker volumes for the
container and rely on containerizer or `docker/volume` isolator's
`recover` method to cleanup the container.
Review: https://reviews.apache.org/r/69972/
(cherry picked from commit d174f0bb5120c70f48cfee2bea5b84724493d416)
{noformat}
1.5.x:
{noformat}
commit f21ad749e963ad02012ed9776fd3fecdfcb57bb0
Author: Qian Zhang <[email protected]>
Date: Wed Feb 20 13:41:51 2019 -0800
Handled containers which has no checkpointed volumes during recovery.
There are two cases we need to handle:
1. The container's checkpointed docker volumes file does not exist.
2. The container's checkpointed docker volumes file is empty.
For both of the two cases, in the recovery of `docker/volume` isolator,
we should construct an info object with empty docker volumes for the
container and rely on containerizer or `docker/volume` isolator's
`recover` method to cleanup the container.
Review: https://reviews.apache.org/r/69972/
(cherry picked from commit d174f0bb5120c70f48cfee2bea5b84724493d416)
{noformat}
1.6.x:
{noformat}
commit c965ae4e34e56574c33022553c9c07ee082fc3b9
Author: Qian Zhang <[email protected]>
Date: Wed Feb 20 13:41:51 2019 -0800
Handled containers which has no checkpointed volumes during recovery.
There are two cases we need to handle:
1. The container's checkpointed docker volumes file does not exist.
2. The container's checkpointed docker volumes file is empty.
For both of the two cases, in the recovery of `docker/volume` isolator,
we should construct an info object with empty docker volumes for the
container and rely on containerizer or `docker/volume` isolator's
`recover` method to cleanup the container.
Review: https://reviews.apache.org/r/69972/
(cherry picked from commit d174f0bb5120c70f48cfee2bea5b84724493d416)
{noformat}
1.7.x:
{noformat}
commit eb117529c100fa2da7ba2b63020094d975418e95
Author: Qian Zhang <[email protected]>
Date: Wed Feb 20 13:41:51 2019 -0800
Handled containers which has no checkpointed volumes during recovery.
There are two cases we need to handle:
1. The container's checkpointed docker volumes file does not exist.
2. The container's checkpointed docker volumes file is empty.
For both of the two cases, in the recovery of `docker/volume` isolator,
we should construct an info object with empty docker volumes for the
container and rely on containerizer or `docker/volume` isolator's
`recover` method to cleanup the container.
Review: https://reviews.apache.org/r/69972/
(cherry picked from commit d174f0bb5120c70f48cfee2bea5b84724493d416)
{noformat}
> Agent could not recover due to empty docker volume checkpointed files.
> ----------------------------------------------------------------------
>
> Key: MESOS-9507
> URL: https://issues.apache.org/jira/browse/MESOS-9507
> Project: Mesos
> Issue Type: Bug
> Components: containerization
> Reporter: Gilbert Song
> Assignee: Qian Zhang
> Priority: Critical
> Labels: containerizer
> Fix For: 1.5.3, 1.7.2, 1.8.0, 1.6.3, 1.4.4
>
>
> Agent could not recover due to empty docker volume checkpointed files. Please
> see logs:
> {noformat}
> Nov 12 17:12:00 guppy mesos-agent[38960]: E1112 17:12:00.978682 38969
> slave.cpp:6279] EXIT with status 1: Failed to perform recovery: Collect
> failed: Collect failed: Failed to recover docker volumes for orphan container
> e1b04051-1e4a-47a9-b866-1d625cda1d22: JSON parse failed: syntax error at line
> 1 near:
> Nov 12 17:12:00 guppy mesos-agent[38960]: To remedy this do as follows:
> Nov 12 17:12:00 guppy mesos-agent[38960]: Step 1: rm -f
> /var/lib/mesos/slave/meta/slaves/latest
> Nov 12 17:12:00 guppy mesos-agent[38960]: This ensures agent doesn't recover
> old live executors.
> Nov 12 17:12:00 guppy mesos-agent[38960]: Step 2: Restart the agent.
> Nov 12 17:12:00 guppy systemd[1]: dcos-mesos-slave.service: main process
> exited, code=exited, status=1/FAILURE
> Nov 12 17:12:00 guppy systemd[1]: Unit dcos-mesos-slave.service entered
> failed state.
> Nov 12 17:12:00 guppy systemd[1]: dcos-mesos-slave.service failed.
> {noformat}
> This might happen after hard reboot. Docker volume isolator uses
> `state::checkpoint()` function which creates a temporary file, then writes
> the data, then renames the temporary file to destination file. This function
> is atomic and supports `fsync` for the data. However, Docker volume isolator
> does not use `fsync` option for performance reasons, hence the data might be
> lost if page cache is not synced before reboot.
> Basically the docker volume is not mounted yet, so the docker volume isolator
> should skip recovering this volume.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)