[
https://issues.apache.org/jira/browse/MESOS-7366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16174009#comment-16174009
]
Zhitao Li commented on MESOS-7366:
----------------------------------
[~jieyu], sorry for reviving this task, but we might have missed a case for
{{unmount} in linux.cpp. [This unmount call
|https://github.com/apache/mesos/blob/6f98b8d6d149c5497d16f588c683a68fccba4fc9/src/slave/containerizer/mesos/isolators/filesystem/linux.cpp#L489]
can still fail if device is busy.
> Agent sandbox gc could accidentally delete the entire persistent volume
> content
> -------------------------------------------------------------------------------
>
> Key: MESOS-7366
> URL: https://issues.apache.org/jira/browse/MESOS-7366
> Project: Mesos
> Issue Type: Bug
> Affects Versions: 1.0.2, 1.1.1, 1.2.0
> Reporter: Zhitao Li
> Assignee: Jie Yu
> Priority: Blocker
> Fix For: 1.0.4, 1.1.2, 1.2.1
>
>
> When 1) a persistent volume is mounted, 2) umount is stuck or something, 3)
> executor directory gc being invoked, agent seems to emit a log like:
> ```
> Failed to delete directory <executor_dir>/runs/<uuid>/volume: Device or
> resource busy
> ```
> After this, the persistent volume directory is empty.
> This could trigger data loss on critical workload so we should fix this ASAP.
> The triggering environment is a custom executor w/o rootfs image.
> Please let me know if you need more signal.
> {noformat}
> I0407 15:18:22.752624 22758 paths.cpp:536] Trying to chown
> '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377'
> to user 'uber'
> I0407 15:18:22.763229 22758 slave.cpp:6179] Launching executor
> 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework
> 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 with resources
> cpus(cassandra-cstar-location-store, cassandra, {resource_id:
> 29e2ac63-d605-4982-a463-fa311be94e0a}):0.1;
> mem(cassandra-cstar-location-store, cassandra, {resource_id:
> 2e1223f3-41a2-419f-85cc-cbc839c19c70}):768;
> ports(cassandra-cstar-location-store, cassandra, {resource_id:
> fdd6598f-f32b-4c90-a622-226684528139}):[31001-31001] in work directory
> '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377'
> I0407 15:18:22.764103 22758 slave.cpp:1987] Queued task
> 'node-29__c6fdf823-e31a-4b78-a34f-e47e749c07f4' for executor
> 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework
> 5d030fd5-0fb6-4366-9dee-706261fa0749-0014
> I0407 15:18:22.766253 22764 containerizer.cpp:943] Starting container
> d5a56564-3e24-4c60-9919-746710b78377 for executor
> 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework
> 5d030fd5-0fb6-4366-9dee-706261fa0749-0014
> I0407 15:18:22.767514 22766 linux.cpp:730] Mounting
> '/var/lib/mesos/volumes/roles/cassandra-cstar-location-store/d6290423-2ba4-4975-86f4-ffd84ad138ff'
> to
> '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/volume'
> for persistent volume disk(cassandra-cstar-location-store, cassandra,
> {resource_id:
> fefc15d6-0c6f-4eac-a3f8-c34d0335c5ec})[d6290423-2ba4-4975-86f4-ffd84ad138ff:volume]:6466445
> of container d5a56564-3e24-4c60-9919-746710b78377
> I0407 15:18:22.894340 22768 containerizer.cpp:1494] Checkpointing container's
> forked pid 6892 to
> '/var/lib/mesos/meta/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/pids/forked.pid'
> I0407 15:19:01.011916 22749 slave.cpp:3231] Got registration for executor
> 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework
> 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 from executor(1)@10.14.6.132:36837
> I0407 15:19:01.031939 22770 slave.cpp:2191] Sending queued task
> 'node-29__c6fdf823-e31a-4b78-a34f-e47e749c07f4' to executor
> 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework
> 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 at executor(1)@10.14.6.132:36837
> I0407 15:26:14.012861 22749 linux.cpp:627] Removing mount
> '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/fra
> meworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a5656
> 4-3e24-4c60-9919-746710b78377/volume' for persistent volume
> disk(cassandra-cstar-location-store, cassandra, {resource_id:
> fefc15d6-0c6f-4eac-a3f8-c34d0335c5ec})[d6290423-2ba4-4975-86f4-ffd84ad138ff:volume]:6466445
> of container d5a56564-3e24-4c60-9919-746710b78377
> E0407 15:26:14.013828 22756 slave.cpp:3903] Failed to update resources for
> container d5a56564-3e24-4c60-9919-746710b78377 of executor
> 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' running task
> node-29__c6fdf823-e31a-4b78-a34f-e47e749c07f4 on status update for terminal
> task, destroying container: Collect failed: Failed to unmount unneeded
> persistent volume at
> '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/volume':
> Failed to unmount
> '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/volume':
> Device or resource busy
> I0407 15:26:14.545647 22747 linux.cpp:810] Unmounting volume
> '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/volume'
> for container d5a56564-3e24-4c60-9919-746710b78377
> E0407 15:26:14.546123 22753 slave.cpp:4520] Termination of executor
> 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework
> 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 failed: Failed to clean up an
> isolator when destroying container: Failed to unmount volume
> '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/volume':
> Failed to unmount
> '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/volume':
> Device or resource busy
> I0407 15:26:14.566028 22744 slave.cpp:4646] Cleaning up executor
> 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework
> 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 at executor(1)@10.14.6.132:36837
> I0407 15:26:14.566186 22768 gc.cpp:55] Scheduling
> '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377'
> for gc 6.99999344714074days in the future
> I0407 15:26:14.566299 22768 gc.cpp:55] Scheduling
> '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7'
> for gc 6.99999344665481days in the future
> I0407 15:26:14.566337 22768 gc.cpp:55] Scheduling
> '/var/lib/mesos/meta/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377'
> for gc 6.99999344637926days in the future
> I0407 15:26:14.566368 22768 gc.cpp:55] Scheduling
> '/var/lib/mesos/meta/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7'
> for gc 6.99999344597333days in the future
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)