Gilbert Song created MESOS-5352:
-----------------------------------

             Summary: Docker volume isolator cleanup can be blocked by first 
cleanup failure.
                 Key: MESOS-5352
                 URL: https://issues.apache.org/jira/browse/MESOS-5352
             Project: Mesos
          Issue Type: Bug
          Components: containerization
            Reporter: Gilbert Song


The summary title may be confusing, please look at the description below for 
details.

Some background:
1). In docker volume isolator cleanup, currently we do reference counting for 
docker volumes. Volume driver `unmount` will only be called if the ref count is 
1. 
2). We have built a hash map `infos` to track on docker volume mount 
information for one specific containerId. And a containerId will be erased form 
the hash map only if all driver `unmount` calls success (each subprocess return 
a ready future).

The issue in this JIRA is that if we have a slave running (not shut down or 
reboot in this case), then keep launching frameworks which make use of docker 
volumes. Once any docker volume isolator cleanup returns a failure, all the 
other `unmount` calls to these volumes will be blocked by the reference count, 
since the `_cleanup()` returns a failure and the containerId in the hash map 
`infos` is not erased even through all volume may be unmounted/detached 
correctly. (docker volume isolator calls driver unmount as a subprocess, and a 
failure message may be possibly returned by the driver even if all volumes are 
unmount/detached correctly). Then, the extra containerId in infos could make 
all other isolator cleanup calls to contain one extra volume when doing the 
reference counting, which mean it rejects to call driver unmount. So after all 
tasks finish, all those docker volumes from the first failure will still with 
the `attached` status.

This issue will be gone after the slave recover, but we cannot rely on 
restarting the slave every time hitting this case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to