[
https://issues.apache.org/jira/browse/MESOS-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14200810#comment-14200810
]
Jie Yu commented on MESOS-2047:
-------------------------------
[~idownes] Do you agree with the points I made in the description? Yes, we can
get away from this particular issue by removing the error checks for os::rm.
However, we never know if there are some other errors in the future (for other
isolators) that we can safely ignore.
> Isolator cleanup failures shouldn't cause TASK_LOST.
> ----------------------------------------------------
>
> Key: MESOS-2047
> URL: https://issues.apache.org/jira/browse/MESOS-2047
> Project: Mesos
> Issue Type: Bug
> Affects Versions: 0.21.0
> Reporter: Jie Yu
>
> Right now, if isolator cleanup fails, we'll transition all pending tasks to
> TASK_LOST (even in the OOM case, we should have transitioned it to
> TASK_FAILED).
> The problematic code is here:
> {noformat}
> 1052 void MesosContainerizerProcess::___destroy(
>
> 1053 const ContainerID& containerId,
> 1054 const Future<Option<int>>& status,
> 1055 const Future<list<Future<Nothing>>>& cleanups)
> 1056 {
> 1057 // This should not occur because we only use the Future<list> to
>
> 1058 // facilitate chaining.
> 1059 CHECK_READY(cleanups);
> 1060
> 1061 // Check cleanup succeeded for all isolators. If not, we'll fail the
>
> 1062 // container termination and remove the 'destroying' flag but leave
>
> 1063 // all other state. The container is now in an inconsistent state.
> 1064 foreach (const Future<Nothing>& cleanup, cleanups.get()) {
>
> 1065 if (!cleanup.isReady()) {
>
> 1066 promises[containerId]->fail(
>
> 1067 "Failed to clean up an isolator when destroying container '" +
>
> 1068 stringify(containerId) + "' :" +
> 1069 (cleanup.isFailed() ? cleanup.failure() : "discarded future"));
>
> 1070
> 1071 destroying.erase(containerId);
> 1072
> 1073 return;
> 1074 }
> 1075 }
> {noformat}
> Since launcher->destroy already succeeds (all processes are killed), instead
> of failing the promises[containerId], we probably should just export the
> error through metrics (so that people can get alerted on that) and still set
> the termination appropriately.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)