[
https://issues.apache.org/jira/browse/MESOS-8444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16328216#comment-16328216
]
Qian Zhang commented on MESOS-8444:
-----------------------------------
commit 5225a49c495bc7e3362bcee2d460d8c99111c7f4
Author: Qian Zhang
Date: Sun Jan 14 22:02:33 2018 +0800
Detached the virtual paths regardless of the result of gc.
Previously we only detach the following paths when the gc for the
executor's sandbox succeeds.
1. /agent_workdir/frameworks/FID/executors/EID/runs/CID
2. /agent_workdir/frameworks/FID/executors/EID/runs/latest
3. /frameworks/FID/executors/EID/runs/latest
But the problem is, such gc may not always succeed, e.g., it may fail
due to the parent directory of the executor's sandbox already gc'ed.
Now in this patch, we will detach those paths regardless of the result
of gc.
Review: https://reviews.apache.org/r/65156
> GC failure causes agent miss to detach virtual paths for the executor's
> sandbox
> -------------------------------------------------------------------------------
>
> Key: MESOS-8444
> URL: https://issues.apache.org/jira/browse/MESOS-8444
> Project: Mesos
> Issue Type: Bug
> Components: agent
> Reporter: Qian Zhang
> Assignee: Qian Zhang
> Priority: Major
> Fix For: 1.5.0, 1.6.0
>
>
> I launched a task via {{mesos-execute}} which just did a {{sleep 10}}, when
> the task finished, {{Slave::removeExecutor()}} and
> {{Slave::removeFramework()}} were called and they will try to gc 3
> directories:
> #
> /<slave-work-dir>/slaves/<slaveID>/frameworks/<frameworkID>/executors/<executorID>/runs/<containerID>
> #
> /<slave-work-dir>/slaves/<slaveID>/frameworks/<frameworkID>/executors/<executorID>
> # /<slave-work-dir>/slaves/<slaveID>/frameworks/<frameworkID>
> For 1 and 2, the code to gc them is like this:
> {code}
> garbageCollect(path)
> .then(defer(self(), &Self::detachFile, path));
> {code}
> So here {{then()}} is used which means we will only do the detach when the gc
> succeeds. But the problem is the order of 1, 2 and 3 deleted by gc can not be
> guaranteed, from my test, 3 will be deleted first for most of times. Since 3
> is the parent directory of 1 and 2, so the gc for 1 and 2 will fail:
> {code}
> I0111 00:19:33.001655 42889 gc.cpp:208] Deleting
> /home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000
> I0111 00:19:33.002576 42889 gc.cpp:218] Deleted
> '/home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000'
> I0111 00:19:33.004551 42893 gc.cpp:208] Deleting
> /home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000/executors/default-executor/runs/b067936a-f4c4-4091-b786-4dd4d4d6da15
> W0111 00:19:33.004622 42893 gc.cpp:212] Failed to delete
> '/home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000/executors/default-executor/runs/b067936a-f4c4-4091-b786-4dd4d4d6da15':
> No such file or directory
> I0111 00:19:33.006367 42923 gc.cpp:208] Deleting
> /home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000/executors/default-executor
> W0111 00:19:33.006466 42923 gc.cpp:212] Failed to delete
> '/home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000/executors/default-executor':
> No such file or directory
> {code}
> So we will NOT do the detach for 1 and 2 which is a leak.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)