[
https://issues.apache.org/jira/browse/MESOS-5352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adam B updated MESOS-5352:
--------------------------
Target Version/s: 1.2.0
Priority: Critical (was: Major)
> Docker volume isolator cleanup can be blocked by first cleanup failure.
> -----------------------------------------------------------------------
>
> Key: MESOS-5352
> URL: https://issues.apache.org/jira/browse/MESOS-5352
> Project: Mesos
> Issue Type: Bug
> Components: containerization
> Reporter: Gilbert Song
> Priority: Critical
> Labels: containerizer
>
> The summary title may be confusing, please look at the description below for
> details.
> Some background:
> 1). In docker volume isolator cleanup, currently we do reference counting for
> docker volumes. Volume driver `unmount` will only be called if the ref count
> is 1.
> 2). We have built a hash map `infos` to track on docker volume mount
> information for one specific containerId. And a containerId will be erased
> form the hash map only if all driver `unmount` calls succeed (each subprocess
> return a ready future).
> The issue in this JIRA is that if we have a slave running (not shut down or
> reboot in this case), then keep launching frameworks which make use of docker
> volumes. Once any docker volume isolator cleanup returns a failure, all the
> other `unmount` calls to these volumes will be blocked by the reference
> count, since the `_cleanup()` returns a failure and the containerId in the
> hash map `infos` is not erased even through all volume may be
> unmounted/detached correctly. (docker volume isolator calls driver unmount as
> a subprocess, and a failure message may be possibly returned by the driver
> even if all volumes are unmount/detached correctly). Then, the extra
> containerId in infos could make all other isolator cleanup calls to contain
> one extra volume when doing the reference counting, which mean it rejects to
> call driver unmount. So after all tasks finish, all those docker volumes from
> the first failure will still with the `attached` status.
> This issue will be gone after the slave recover, but we cannot rely on
> restarting the slave every time hitting this case.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)