[
https://issues.apache.org/jira/browse/MESOS-3349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14738557#comment-14738557
]
haosdent edited comment on MESOS-3349 at 9/11/15 9:30 AM:
----------------------------------------------------------
According
[CLONE_NEWNS|http://stackoverflow.com/questions/22889241/linux-understanding-the-mount-namespace-clone-clone-newns-flag],
[bind_mount|https://lwn.net/Articles/159092/]. I think could explain the
behaviours so far.
In LinuxFilesystemIsolatorProcess, we mount persistent volume (default
behaviour make-private) before launch the executor. After LinuxLauncher fork
with CLONE_NEWNS, we could umount persistent volume in
LinuxFilesystemIsolatorProcess. But this don't affect the executor continue to
hold that mount point. When slave receive TASK_FINISH and call
LinuxFilesystemIsolatorProcess try to rmdir that mount point, it would failed
because executor is still running and holding the mount point (after add some
trace code to show when executor exited, I observed this.). So a possible way
to fix this is to use make-shared or make-slave when mount persistent volume.
By my attempts failed on this.
{code}
45 22 8:3
/tmp/PersistentVolumeTest_AccessPersistentVolume_L6fY1a/volumes/roles/role1/id1
/tmp/PersistentVolumeTest_AccessPersistentVolume_L6fY1a/slaves/20150911-170559-162297291-49192-21628-S0/frameworks/20150911-170559-162297291-49192-21628-0000/executors/c6bcf76f-7cf5-42e6-8eb8-2d21e393ba3d/runs/454bbfa3-0305-4900-b05d-389f6b215c32/path1
rw,relatime shared:1 - ext4
/dev/disk/by-uuid/98708f21-a59d-4b80-a85c-27b78c22e316
rw,errors=remount-ro,data=ordered
{code}
{code}
78 48 8:3
/tmp/PersistentVolumeTest_AccessPersistentVolume_L6fY1a/volumes/roles/role1/id1
/tmp/PersistentVolumeTest_AccessPersistentVolume_L6fY1a/slaves/20150911-170559-162297291-49192-21628-S0/frameworks/20150911-170559-162297291-49192-21628-0000/executors/c6bcf76f-7cf5-42e6-8eb8-2d21e393ba3d/runs/454bbfa3-0305-4900-b05d-389f6b215c32/path1
rw,relatime shared:1 - ext4
/dev/disk/by-uuid/98708f21-a59d-4b80-a85c-27b78c22e316
rw,errors=remount-ro,data=ordered
{code}
Could see the persistent volumes have already mount as shared, but this test
still failed.
was (Author: [email protected]):
After see this
http://stackoverflow.com/questions/22889241/linux-understanding-the-mount-namespace-clone-clone-newns-flag
about CLONE_NEWNS. I think could explain the behaviours so far.
In LinuxFilesystemIsolatorProcess, we mount in parent (pid is 24073).
{code}
I0910 18:07:42.768034 24073 linux.cpp:598] Mounting
'/tmp/PersistentVolumeTest_AccessPersistentVolume_PTx7g0/volumes/roles/role1/id1'
to
'/tmp/PersistentVolumeTest_AccessPersistentVolume_PTx7g0/slaves/20150910-180742-162297291-42795-24055-S0/frameworks/20150910-180742-162297291-42795-24055-0000/executors/72989615-cc6e-449c-a561-264fcee7edc3/runs/0cdc0d01-4c59-48e8-925a-7a6c06feb2ae/path1'
for persistent volume disk(role1)[id1:path1]:64 of container
0cdc0d01-4c59-48e8-925a-7a6c06feb2ae
{code}
After LinuxLauncher fork with CLONE_NEWNS, child(pid is 24071) could unmount
it. But still could not rmdir it, because it has another mount point handled by
parent.
{code}
I0910 18:07:44.868654 24071 linux.cpp:493] Removing mount
'/tmp/PersistentVolumeTest_AccessPersistentVolume_PTx7g0/slaves/20150910-180742-162297291-42795-24055-S0/frameworks/20150910-180742-162297291-42795-24055-0000/executors/72989615-cc6e-449c-a561-264fcee7edc3/runs/0cdc0d01-4c59-48e8-925a-7a6c06feb2ae/path1'
for persistent volume disk(role1)[id1:path1]:64 of container
0cdc0d01-4c59-48e8-925a-7a6c06feb2ae
E0910 18:07:44.876619 24076 slave.cpp:2870] Failed to update resources for
container 0cdc0d01-4c59-48e8-925a-7a6c06feb2ae of executor
72989615-cc6e-449c-a561-264fcee7edc3 running task
72989615-cc6e-449c-a561-264fcee7edc3 on status update for terminal task,
destroying container: Collect failed: Failed to remove persistent volume mount
point at
'/tmp/PersistentVolumeTest_AccessPersistentVolume_PTx7g0/slaves/20150910-180742-162297291-42795-24055-S0/frameworks/20150910-180742-162297291-42795-24055-0000/executors/72989615-cc6e-449c-a561-264fcee7edc3/runs/0cdc0d01-4c59-48e8-925a-7a6c06feb2ae/path1':
Device or resource busy
{code}
> PersistentVolumeTest.AccessPersistentVolume fails when run as root.
> -------------------------------------------------------------------
>
> Key: MESOS-3349
> URL: https://issues.apache.org/jira/browse/MESOS-3349
> Project: Mesos
> Issue Type: Bug
> Components: test
> Environment: Ubuntu 14.04, CentOS 5
> Reporter: Benjamin Mahler
> Assignee: haosdent
> Labels: flaky-test
>
> When running the tests as root:
> {noformat}
> [ RUN ] PersistentVolumeTest.AccessPersistentVolume
> I0901 02:17:26.435140 39432 exec.cpp:133] Version: 0.25.0
> I0901 02:17:26.442129 39461 exec.cpp:207] Executor registered on slave
> 20150901-021726-1828659978-52102-32604-S0
> Registered executor on hostname
> Starting task d8ff1f00-e720-4a61-b440-e111009dfdc3
> sh -c 'echo abc > path1/file'
> Forked command at 39484
> Command exited with status 0 (pid: 39484)
> ../../src/tests/persistent_volume_tests.cpp:579: Failure
> Value of: os::exists(path::join(directory, "path1"))
> Actual: true
> Expected: false
> [ FAILED ] PersistentVolumeTest.AccessPersistentVolume (777 ms)
> {noformat}
> FYI [~jieyu] [~mcypark]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)