[ 
https://issues.apache.org/jira/browse/MESOS-8830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454853#comment-16454853
 ] 

Chun-Hung Hsiao commented on MESOS-8830:
----------------------------------------

How do you restart the agent as a new one? Did you just remove the {{latest}} 
symlink in the meta dir, or did you remove the runtime dir as well?

When an agent is restarted as a new one, it goes through the runtime dir to 
discover existing containers, and check if there is a matching record in its 
checkpoint in the meta dir. In your case, since the agent is a new one, there 
will be no record at all, so all running containers discovered in the runtime 
dir will be considered as orphans, and the containerizer will destroy them and 
clean them up, which includes running cleanup for each isolator.

I'm suspecting that for some reason the containers occured in the log were 
still running and were not treated as orphaned containers. Could you verify if 
this is the case? You could look at the agent log and see if they have been 
cleaned up as orphaned containers during recovery.

> Agent gc on old slave sandboxes could empty persistent volume data
> ------------------------------------------------------------------
>
>                 Key: MESOS-8830
>                 URL: https://issues.apache.org/jira/browse/MESOS-8830
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 1.3.1
>            Reporter: Zhitao Li
>            Priority: Blocker
>
> We had an issue in which custom Cassandra executors (which does not use any 
> container image thus running on host filesystem) saw its persistent volume 
> data got wiped out.
> Upon revisiting logs, we found following suspicious lines:
> {panel:title=log}
> I0424 02:06:11.716380 10980 slave.cpp:5723] Current disk usage 21.93%. Max 
> allowed age: 4.764742265646493days
> I0424 02:06:11.716883 10994 gc.cpp:170] Pruning directories with remaining 
> removal time 2.23508429704593days
> I0424 02:06:11.716943 10994 gc.cpp:170] Pruning directories with remaining 
> removal time 2.23508429587852days
> I0424 02:06:11.717183 10994 gc.cpp:133] Deleting 
> /var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
> I0424 02:06:11.727033 10994 gc.cpp:146] Deleted 
> '/var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44'
> I0424 02:06:11.727094 10994 gc.cpp:133] Deleting 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
> I0424 02:06:14.933104 10972 http.cpp:1115] HTTP GET for /slave(1)/state from 
> 127.0.0.1:53602 with User-Agent='Go-http-client/1.1'
> E0424 02:06:15.245652 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149/volume:
>  Device or resource busy
> E0424 02:06:15.394328 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149:
>  Directory not empty
> E0424 02:06:15.394419 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs:
>  Directory not empty
> E0424 02:06:15.394459 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a:
>  Directory not empty
> E0424 02:06:15.394477 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors:
>  Directory not empty
> E0424 02:06:15.394511 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004:
>  Directory not empty
> E0424 02:06:15.394536 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks:
>  Directory not empty
> E0424 02:06:15.394556 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44:
>  Directory not empty
> {panel}
> (I can try to provide more logs, depending on how much local archive after 
> rotation has)
> This happened on a 1.3.1 agent although I suspect it's not local to that 
> version.
> The path 
> */var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149/volume*
>  is a bind mount to a persistent volume. The fact that agent gc touched that 
> process makes me believe this is what triggered the data loss.
> We had some misconfigurations on out fleet, and I do not know whether the 
> previous slave (id-ed *70279b87-553a-4213-a85b-46fdc191849d-S4*) was shutdown 
> cleanly yet.
> My questions/suggestions:
> 1) if an executor was asked to shutdown by a new agent (with a new id), how 
> much of persistent volume clean up code will be executed (especially if new 
> agent do not really know this executor should be running anymore)?
> 2) should we figure out a way to better protect bind mounts to persistent 
> volumes in slave/gc.cpp (for instance, skip any path which seems dangerous), 
> to prevent similar problems?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to