[ https://issues.apache.org/jira/browse/MESOS-8444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Qian Zhang updated MESOS-8444: ------------------------------ Description: I launched a task group which has one task via {{mesos-execute}}, and that task just did a {{sleep 10}}, when the task finished, {{Slave::removeExecutor()}} and {{Slave::removeFramework()}} were called and they will try to gc 3 directories: # /<slave-work-dir>/slaves/<slaveID>/frameworks/<frameworkID>/executors/<executorID>/runs/<containerID> # /<slave-work-dir>/slaves/<slaveID>/frameworks/<frameworkID>/executors/<executorID> # /<slave-work-dir>/slaves/<slaveID>/frameworks/<frameworkID> For 1 and 2, the code to gc them is like this: {code} garbageCollect(path) .then(defer(self(), &Self::detachFile, path)); {code} So here {{then()}} is used which means we will only do the detach when the gc succeeds. But the problem is the order of 1, 2 and 3 deleted by gc can not be guaranteed, from my test, 3 will be deleted first for most of times. Since 3 is the parent directory of 1 and 2, so the gc for 1 and 2 will fail: {code} I0111 00:19:33.001655 42889 gc.cpp:208] Deleting /home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000 I0111 00:19:33.002576 42889 gc.cpp:218] Deleted '/home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000' I0111 00:19:33.004551 42893 gc.cpp:208] Deleting /home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000/executors/default-executor/runs/b067936a-f4c4-4091-b786-4dd4d4d6da15 W0111 00:19:33.004622 42893 gc.cpp:212] Failed to delete '/home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000/executors/default-executor/runs/b067936a-f4c4-4091-b786-4dd4d4d6da15': No such file or directory I0111 00:19:33.006367 42923 gc.cpp:208] Deleting /home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000/executors/default-executor W0111 00:19:33.006466 42923 gc.cpp:212] Failed to delete '/home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000/executors/default-executor': No such file or directory {code} So we will NOT do the detach for 1 and 2 which is a leak. was: I launched a task group which has one task via {{mesos-execute}}, and that task just did a {{sleep 10}}, when the task finished, {{Slave::removeExecutor()}} and {{Slave::removeFramework()}} were called and they will try to gc 3 directories: # /<slave-work-dir>/slaves/<slaveID>/frameworks/<frameworkID>/executors/<executorID>/runs/<containerID> # /<slave-work-dir>/slaves/<slaveID>/frameworks/<frameworkID>/executors/<executorID> # /<slave-work-dir>/slaves/<slaveID>/frameworks/<frameworkID> For 1 and 2, the code to gc them is like this: {code} garbageCollect(path) .then(defer(self(), &Self::detachFile, path)); {code} So here {{then()}} is used which means we will only do the detach when the gc succeeds. But the problem is the order of 1, 2 and 3 deleted by gc can not be guaranteed, from my test, 3 will be deleted first for most of times. Since 3 is the parent directory of 1 and 2, so gc to 1 and 2 will fail: {code} I0111 00:19:33.001655 42889 gc.cpp:208] Deleting /home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000 I0111 00:19:33.002576 42889 gc.cpp:218] Deleted '/home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000' I0111 00:19:33.004551 42893 gc.cpp:208] Deleting /home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000/executors/default-executor/runs/b067936a-f4c4-4091-b786-4dd4d4d6da15 W0111 00:19:33.004622 42893 gc.cpp:212] Failed to delete '/home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000/executors/default-executor/runs/b067936a-f4c4-4091-b786-4dd4d4d6da15': No such file or directory I0111 00:19:33.006367 42923 gc.cpp:208] Deleting /home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000/executors/default-executor W0111 00:19:33.006466 42923 gc.cpp:212] Failed to delete '/home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000/executors/default-executor': No such file or directory {code} So we will NOT do the detach for 1 and 2 which is a leak. > Agent miss to detach virtual paths for the executor's sandbox > ------------------------------------------------------------- > > Key: MESOS-8444 > URL: https://issues.apache.org/jira/browse/MESOS-8444 > Project: Mesos > Issue Type: Bug > Components: agent > Reporter: Qian Zhang > Assignee: Qian Zhang > > I launched a task group which has one task via {{mesos-execute}}, and that > task just did a {{sleep 10}}, when the task finished, > {{Slave::removeExecutor()}} and {{Slave::removeFramework()}} were called and > they will try to gc 3 directories: > # > /<slave-work-dir>/slaves/<slaveID>/frameworks/<frameworkID>/executors/<executorID>/runs/<containerID> > # > /<slave-work-dir>/slaves/<slaveID>/frameworks/<frameworkID>/executors/<executorID> > # /<slave-work-dir>/slaves/<slaveID>/frameworks/<frameworkID> > For 1 and 2, the code to gc them is like this: > {code} > garbageCollect(path) > .then(defer(self(), &Self::detachFile, path)); > {code} > So here {{then()}} is used which means we will only do the detach when the gc > succeeds. But the problem is the order of 1, 2 and 3 deleted by gc can not be > guaranteed, from my test, 3 will be deleted first for most of times. Since 3 > is the parent directory of 1 and 2, so the gc for 1 and 2 will fail: > {code} > I0111 00:19:33.001655 42889 gc.cpp:208] Deleting > /home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000 > I0111 00:19:33.002576 42889 gc.cpp:218] Deleted > '/home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000' > I0111 00:19:33.004551 42893 gc.cpp:208] Deleting > /home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000/executors/default-executor/runs/b067936a-f4c4-4091-b786-4dd4d4d6da15 > W0111 00:19:33.004622 42893 gc.cpp:212] Failed to delete > '/home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000/executors/default-executor/runs/b067936a-f4c4-4091-b786-4dd4d4d6da15': > No such file or directory > I0111 00:19:33.006367 42923 gc.cpp:208] Deleting > /home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000/executors/default-executor > W0111 00:19:33.006466 42923 gc.cpp:212] Failed to delete > '/home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000/executors/default-executor': > No such file or directory > {code} > So we will NOT do the detach for 1 and 2 which is a leak. -- This message was sent by Atlassian JIRA (v6.4.14#64029)