[jira] [Commented] (MESOS-2047) Isolator cleanup failures shouldn't cause TASK_LOST.

Ian Downes (JIRA) Thu, 06 Nov 2014 12:14:02 -0800

    [ 
https://issues.apache.org/jira/browse/MESOS-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14200825#comment-14200825
 ]


Ian Downes commented on MESOS-2047:
-----------------------------------

I do agree that it's incorrect to return TASK_LOST if all processes have been 
terminated but *we* failed to clean up other state, i.e., from the framework's 
perspective the task completed (whether it be FAILED or whatever).

The short term fix is to avoid failing on the os::rm but longer term we should 
do as you suggest, perhaps even partitioning to ::terminate and ::cleanup so 
the slave has visibility into the stages.

> Isolator cleanup failures shouldn't cause TASK_LOST.
> ----------------------------------------------------
>
>                 Key: MESOS-2047
>                 URL: https://issues.apache.org/jira/browse/MESOS-2047
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.21.0
>            Reporter: Jie Yu
>
> Right now, if isolator cleanup fails, we'll transition all pending tasks to 
> TASK_LOST (even in the OOM case, we should have transitioned it to 
> TASK_FAILED).
> The problematic code is here:
> {noformat}
> 1052 void MesosContainerizerProcess::___destroy(                              
>                            
> 1053     const ContainerID& containerId,
> 1054     const Future<Option<int>>& status,
> 1055     const Future<list<Future<Nothing>>>& cleanups)
> 1056 {   
> 1057   // This should not occur because we only use the Future<list> to       
>                            
> 1058   // facilitate chaining.
> 1059   CHECK_READY(cleanups);
> 1060                  
> 1061   // Check cleanup succeeded for all isolators. If not, we'll fail the   
>                            
> 1062   // container termination and remove the 'destroying' flag but leave    
>                            
> 1063   // all other state. The container is now in an inconsistent state.
> 1064   foreach (const Future<Nothing>& cleanup, cleanups.get()) {             
>                            
> 1065     if (!cleanup.isReady()) {                                            
>                            
> 1066       promises[containerId]->fail(                                       
>                            
> 1067         "Failed to clean up an isolator when destroying container '" +   
>                            
> 1068         stringify(containerId) + "' :" +
> 1069         (cleanup.isFailed() ? cleanup.failure() : "discarded future"));  
>                            
> 1070   
> 1071       destroying.erase(containerId);
> 1072                  
> 1073       return;    
> 1074     }
> 1075   }
> {noformat}
> Since launcher->destroy already succeeds (all processes are killed), instead 
> of failing the promises[containerId], we probably should just export the 
> error through metrics (so that people can get alerted on that) and still set 
> the termination appropriately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2047) Isolator cleanup failures shouldn't cause TASK_LOST.

Reply via email to