[ 
https://issues.apache.org/jira/browse/MESOS-7366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16174009#comment-16174009
 ] 

Zhitao Li commented on MESOS-7366:
----------------------------------

[~jieyu], sorry for reviving this task, but we might have missed a case for 
{{unmount} in linux.cpp. [This unmount call 
|https://github.com/apache/mesos/blob/6f98b8d6d149c5497d16f588c683a68fccba4fc9/src/slave/containerizer/mesos/isolators/filesystem/linux.cpp#L489]
 can still fail if device is busy.


> Agent sandbox gc could accidentally delete the entire persistent volume 
> content
> -------------------------------------------------------------------------------
>
>                 Key: MESOS-7366
>                 URL: https://issues.apache.org/jira/browse/MESOS-7366
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 1.0.2, 1.1.1, 1.2.0
>            Reporter: Zhitao Li
>            Assignee: Jie Yu
>            Priority: Blocker
>             Fix For: 1.0.4, 1.1.2, 1.2.1
>
>
> When 1) a persistent volume is mounted, 2) umount is stuck or something, 3) 
> executor directory gc being invoked, agent seems to emit a log like:
> ```
>  Failed to delete directory  <executor_dir>/runs/<uuid>/volume: Device or 
> resource busy
> ```
> After this, the persistent volume directory is empty.
> This could trigger data loss on critical workload so we should fix this ASAP.
> The triggering environment is a custom executor w/o rootfs image.
> Please let me know if you need more signal.
> {noformat}
> I0407 15:18:22.752624 22758 paths.cpp:536] Trying to chown 
> '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377'
>  to user 'uber'
> I0407 15:18:22.763229 22758 slave.cpp:6179] Launching executor 
> 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 
> 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 with resources 
> cpus(cassandra-cstar-location-store, cassandra, {resource_id: 
> 29e2ac63-d605-4982-a463-fa311be94e0a}):0.1; 
> mem(cassandra-cstar-location-store, cassandra, {resource_id: 
> 2e1223f3-41a2-419f-85cc-cbc839c19c70}):768; 
> ports(cassandra-cstar-location-store, cassandra, {resource_id: 
> fdd6598f-f32b-4c90-a622-226684528139}):[31001-31001] in work directory 
> '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377'
> I0407 15:18:22.764103 22758 slave.cpp:1987] Queued task 
> 'node-29__c6fdf823-e31a-4b78-a34f-e47e749c07f4' for executor 
> 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 
> 5d030fd5-0fb6-4366-9dee-706261fa0749-0014
> I0407 15:18:22.766253 22764 containerizer.cpp:943] Starting container 
> d5a56564-3e24-4c60-9919-746710b78377 for executor 
> 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 
> 5d030fd5-0fb6-4366-9dee-706261fa0749-0014
> I0407 15:18:22.767514 22766 linux.cpp:730] Mounting 
> '/var/lib/mesos/volumes/roles/cassandra-cstar-location-store/d6290423-2ba4-4975-86f4-ffd84ad138ff'
>  to 
> '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/volume'
>  for persistent volume disk(cassandra-cstar-location-store, cassandra, 
> {resource_id: 
> fefc15d6-0c6f-4eac-a3f8-c34d0335c5ec})[d6290423-2ba4-4975-86f4-ffd84ad138ff:volume]:6466445
>  of container d5a56564-3e24-4c60-9919-746710b78377
> I0407 15:18:22.894340 22768 containerizer.cpp:1494] Checkpointing container's 
> forked pid 6892 to 
> '/var/lib/mesos/meta/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/pids/forked.pid'
> I0407 15:19:01.011916 22749 slave.cpp:3231] Got registration for executor 
> 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 
> 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 from executor(1)@10.14.6.132:36837
> I0407 15:19:01.031939 22770 slave.cpp:2191] Sending queued task 
> 'node-29__c6fdf823-e31a-4b78-a34f-e47e749c07f4' to executor 
> 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 
> 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 at executor(1)@10.14.6.132:36837
> I0407 15:26:14.012861 22749 linux.cpp:627] Removing mount 
> '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/fra
> meworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a5656
> 4-3e24-4c60-9919-746710b78377/volume' for persistent volume 
> disk(cassandra-cstar-location-store, cassandra, {resource_id: 
> fefc15d6-0c6f-4eac-a3f8-c34d0335c5ec})[d6290423-2ba4-4975-86f4-ffd84ad138ff:volume]:6466445
>  of container d5a56564-3e24-4c60-9919-746710b78377
> E0407 15:26:14.013828 22756 slave.cpp:3903] Failed to update resources for 
> container d5a56564-3e24-4c60-9919-746710b78377 of executor 
> 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' running task 
> node-29__c6fdf823-e31a-4b78-a34f-e47e749c07f4 on status update for terminal 
> task, destroying container: Collect failed: Failed to unmount unneeded 
> persistent volume at 
> '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/volume':
>  Failed to unmount 
> '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/volume':
>  Device or resource busy
> I0407 15:26:14.545647 22747 linux.cpp:810] Unmounting volume 
> '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/volume'
>  for container d5a56564-3e24-4c60-9919-746710b78377
> E0407 15:26:14.546123 22753 slave.cpp:4520] Termination of executor 
> 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 
> 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 failed: Failed to clean up an 
> isolator when destroying container: Failed to unmount volume 
> '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/volume':
>  Failed to unmount 
> '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/volume':
>  Device or resource busy
> I0407 15:26:14.566028 22744 slave.cpp:4646] Cleaning up executor 
> 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 
> 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 at executor(1)@10.14.6.132:36837
> I0407 15:26:14.566186 22768 gc.cpp:55] Scheduling 
> '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377'
>  for gc 6.99999344714074days in the future
> I0407 15:26:14.566299 22768 gc.cpp:55] Scheduling 
> '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7'
>  for gc 6.99999344665481days in the future
> I0407 15:26:14.566337 22768 gc.cpp:55] Scheduling 
> '/var/lib/mesos/meta/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377'
>  for gc 6.99999344637926days in the future
> I0407 15:26:14.566368 22768 gc.cpp:55] Scheduling 
> '/var/lib/mesos/meta/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7'
>  for gc 6.99999344597333days in the future
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to