[ 
https://issues.apache.org/jira/browse/MESOS-5352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-5352:
--------------------------
    Target Version/s:   (was: 1.2.0)

> Docker volume isolator cleanup can be blocked by first cleanup failure.
> -----------------------------------------------------------------------
>
>                 Key: MESOS-5352
>                 URL: https://issues.apache.org/jira/browse/MESOS-5352
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization
>            Reporter: Gilbert Song
>            Priority: Critical
>              Labels: containerizer
>
> The summary title may be confusing, please look at the description below for 
> details.
> Some background:
> 1). In docker volume isolator cleanup, currently we do reference counting for 
> docker volumes. Volume driver `unmount` will only be called if the ref count 
> is 1. 
> 2). We have built a hash map `infos` to track on docker volume mount 
> information for one specific containerId. And a containerId will be erased 
> form the hash map only if all driver `unmount` calls succeed (each subprocess 
> return a ready future).
> The issue in this JIRA is that if we have a slave running (not shut down or 
> reboot in this case), then keep launching frameworks which make use of docker 
> volumes. Once any docker volume isolator cleanup returns a failure, all the 
> other `unmount` calls to these volumes will be blocked by the reference 
> count, since the `_cleanup()` returns a failure and the containerId in the 
> hash map `infos` is not erased even through all volume may be 
> unmounted/detached correctly. (docker volume isolator calls driver unmount as 
> a subprocess, and a failure message may be possibly returned by the driver 
> even if all volumes are unmount/detached correctly). Then, the extra 
> containerId in infos could make all other isolator cleanup calls to contain 
> one extra volume when doing the reference counting, which mean it rejects to 
> call driver unmount. So after all tasks finish, all those docker volumes from 
> the first failure will still with the `attached` status.
> This issue will be gone after the slave recover, but we cannot rely on 
> restarting the slave every time hitting this case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to