[
https://issues.apache.org/jira/browse/MESOS-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14200825#comment-14200825
]
Ian Downes commented on MESOS-2047:
-----------------------------------
I do agree that it's incorrect to return TASK_LOST if all processes have been
terminated but *we* failed to clean up other state, i.e., from the framework's
perspective the task completed (whether it be FAILED or whatever).
The short term fix is to avoid failing on the os::rm but longer term we should
do as you suggest, perhaps even partitioning to ::terminate and ::cleanup so
the slave has visibility into the stages.
> Isolator cleanup failures shouldn't cause TASK_LOST.
> ----------------------------------------------------
>
> Key: MESOS-2047
> URL: https://issues.apache.org/jira/browse/MESOS-2047
> Project: Mesos
> Issue Type: Bug
> Affects Versions: 0.21.0
> Reporter: Jie Yu
>
> Right now, if isolator cleanup fails, we'll transition all pending tasks to
> TASK_LOST (even in the OOM case, we should have transitioned it to
> TASK_FAILED).
> The problematic code is here:
> {noformat}
> 1052 void MesosContainerizerProcess::___destroy(
>
> 1053 const ContainerID& containerId,
> 1054 const Future<Option<int>>& status,
> 1055 const Future<list<Future<Nothing>>>& cleanups)
> 1056 {
> 1057 // This should not occur because we only use the Future<list> to
>
> 1058 // facilitate chaining.
> 1059 CHECK_READY(cleanups);
> 1060
> 1061 // Check cleanup succeeded for all isolators. If not, we'll fail the
>
> 1062 // container termination and remove the 'destroying' flag but leave
>
> 1063 // all other state. The container is now in an inconsistent state.
> 1064 foreach (const Future<Nothing>& cleanup, cleanups.get()) {
>
> 1065 if (!cleanup.isReady()) {
>
> 1066 promises[containerId]->fail(
>
> 1067 "Failed to clean up an isolator when destroying container '" +
>
> 1068 stringify(containerId) + "' :" +
> 1069 (cleanup.isFailed() ? cleanup.failure() : "discarded future"));
>
> 1070
> 1071 destroying.erase(containerId);
> 1072
> 1073 return;
> 1074 }
> 1075 }
> {noformat}
> Since launcher->destroy already succeeds (all processes are killed), instead
> of failing the promises[containerId], we probably should just export the
> error through metrics (so that people can get alerted on that) and still set
> the termination appropriately.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)