[ https://issues.apache.org/jira/browse/MESOS-8830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16450468#comment-16450468 ]
Zhitao Li commented on MESOS-8830: ---------------------------------- Minor correction: I mistakenly thought the problematic path is a hard link while it's actually a bind mount. > Agent gc on old slave sandboxes could empty persistent volume data > ------------------------------------------------------------------ > > Key: MESOS-8830 > URL: https://issues.apache.org/jira/browse/MESOS-8830 > Project: Mesos > Issue Type: Bug > Affects Versions: 1.3.1 > Reporter: Zhitao Li > Priority: Blocker > > We had an issue in which custom Cassandra executors (which does not use any > container image thus running on host filesystem) saw its persistent volume > data got wiped out. > Upon revisiting logs, we found following suspicious lines: > {panel:title=log} > I0424 02:06:11.716380 10980 slave.cpp:5723] Current disk usage 21.93%. Max > allowed age: 4.764742265646493days > I0424 02:06:11.716883 10994 gc.cpp:170] Pruning directories with remaining > removal time 2.23508429704593days > I0424 02:06:11.716943 10994 gc.cpp:170] Pruning directories with remaining > removal time 2.23508429587852days > I0424 02:06:11.717183 10994 gc.cpp:133] Deleting > /var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44 > I0424 02:06:11.727033 10994 gc.cpp:146] Deleted > '/var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44' > I0424 02:06:11.727094 10994 gc.cpp:133] Deleting > /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44 > I0424 02:06:14.933104 10972 http.cpp:1115] HTTP GET for /slave(1)/state from > 127.0.0.1:53602 with User-Agent='Go-http-client/1.1' > E0424 02:06:15.245652 10994 rmdir.hpp:81] Failed to delete directory > /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149/volume: > Device or resource busy > E0424 02:06:15.394328 10994 rmdir.hpp:81] Failed to delete directory > /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149: > Directory not empty > E0424 02:06:15.394419 10994 rmdir.hpp:81] Failed to delete directory > /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs: > Directory not empty > E0424 02:06:15.394459 10994 rmdir.hpp:81] Failed to delete directory > /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a: > Directory not empty > E0424 02:06:15.394477 10994 rmdir.hpp:81] Failed to delete directory > /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors: > Directory not empty > E0424 02:06:15.394511 10994 rmdir.hpp:81] Failed to delete directory > /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004: > Directory not empty > E0424 02:06:15.394536 10994 rmdir.hpp:81] Failed to delete directory > /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks: > Directory not empty > E0424 02:06:15.394556 10994 rmdir.hpp:81] Failed to delete directory > /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44: > Directory not empty > {panel} > (I can try to provide more logs, depending on how much local archive after > rotation has) > This happened on a 1.3.1 agent although I suspect it's not local to that > version. > The path > */var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149/volume* > is a bind mount to a persistent volume. The fact that agent gc touched that > process makes me believe this is what triggered the data loss. > We had some misconfigurations on out fleet, and I do not know whether the > previous slave (id-ed *70279b87-553a-4213-a85b-46fdc191849d-S4*) was shutdown > cleanly yet. > My questions/suggestions: > 1) if an executor was asked to shutdown by a new agent (with a new id), how > much of persistent volume clean up code will be executed (especially if new > agent do not really know this executor should be running anymore)? > 2) should we figure out a way to better protect bind mounts to persistent > volumes in slave/gc.cpp (for instance, skip any path which seems dangerous), > to prevent similar problems? -- This message was sent by Atlassian JIRA (v7.6.3#76005)