[ 
https://issues.apache.org/jira/browse/MESOS-5352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15707988#comment-15707988
 ] 

Armand Grillet commented on MESOS-5352:
---------------------------------------

>From a user point of view, the issue can happen in that case:
* An agent run a task using a persistent volume.
* The task terminates and the framework (e.g. Marathon) wants to restart it.
* An other agent starts running the task.

Here is the error returned after having that case using Marathon:
{code}
Failed to launch container: Unexpected termination of the subprocess: 
time="2016-11-28T09:58:34Z" level=error msg="Plugin Error: VolumeDriver.Mount, 
{\"Error\":\"VolumeInUse: vol-305b6a84 is already attached to an 
instance\n\tstatus code: 400, request id: 
ee4b2e31-90a6-40c2-9820-99b6e1c0878d\"}\n" ; Container destroyed while 
preparing isolators.
{code}

If the framework restarts the task again after the fail but on the first agent, 
it will work.


> Docker volume isolator cleanup can be blocked by first cleanup failure.
> -----------------------------------------------------------------------
>
>                 Key: MESOS-5352
>                 URL: https://issues.apache.org/jira/browse/MESOS-5352
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization
>            Reporter: Gilbert Song
>              Labels: containerizer
>
> The summary title may be confusing, please look at the description below for 
> details.
> Some background:
> 1). In docker volume isolator cleanup, currently we do reference counting for 
> docker volumes. Volume driver `unmount` will only be called if the ref count 
> is 1. 
> 2). We have built a hash map `infos` to track on docker volume mount 
> information for one specific containerId. And a containerId will be erased 
> form the hash map only if all driver `unmount` calls success (each subprocess 
> return a ready future).
> The issue in this JIRA is that if we have a slave running (not shut down or 
> reboot in this case), then keep launching frameworks which make use of docker 
> volumes. Once any docker volume isolator cleanup returns a failure, all the 
> other `unmount` calls to these volumes will be blocked by the reference 
> count, since the `_cleanup()` returns a failure and the containerId in the 
> hash map `infos` is not erased even through all volume may be 
> unmounted/detached correctly. (docker volume isolator calls driver unmount as 
> a subprocess, and a failure message may be possibly returned by the driver 
> even if all volumes are unmount/detached correctly). Then, the extra 
> containerId in infos could make all other isolator cleanup calls to contain 
> one extra volume when doing the reference counting, which mean it rejects to 
> call driver unmount. So after all tasks finish, all those docker volumes from 
> the first failure will still with the `attached` status.
> This issue will be gone after the slave recover, but we cannot rely on 
> restarting the slave every time hitting this case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to