[jira] [Commented] (MESOS-2047) Isolator cleanup failures shouldn't cause TASK_LOST.

Benjamin Mahler (JIRA) Thu, 06 Nov 2014 11:29:28 -0800

    [ 
https://issues.apache.org/jira/browse/MESOS-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14200720#comment-14200720
 ]


Benjamin Mahler commented on MESOS-2047:
----------------------------------------

For the short term, should we change the following known race to not be an 
error?

{code: title=port_mapping.cpp}
  Try<Nothing> unmount = fs::unmount(target, MNT_DETACH);
  if (unmount.isError()) {
    errors.push_back("Failed to umount: " + unmount.error());
  }

  // MNT_DETACH does a lazy umount, which means umount will eventually
  // succeed when the mount point becomes idle, but possiblely not
  // soon enough every time for this remove to go through, e.g,
  // someone entered into the container for debugging purpose. In that
  // case remove will fail, which is okay, because we only leaked an
  // empty file, which could also be reused later if the pid (the name
  // of the file) is used again. However, we still return error to
  // indicate that the cleanup hasn't been successful.
  Try<Nothing> rm = os::rm(target);
  if (rm.isError()) {
    errors.push_back("Failed to remove " + target + ": " + rm.error());
  }
{code}

Seems like this isn't something that should surface a TASK_LOST.

> Isolator cleanup failures shouldn't cause TASK_LOST.
> ----------------------------------------------------
>
>                 Key: MESOS-2047
>                 URL: https://issues.apache.org/jira/browse/MESOS-2047
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.21.0
>            Reporter: Jie Yu
>
> Right now, if isolator cleanup fails, we'll transition all pending tasks to 
> TASK_LOST (even in the OOM case, we should have transitioned it to 
> TASK_FAILED).
> The problematic code is here:
> {noformat}
> 1052 void MesosContainerizerProcess::___destroy(                              
>                            
> 1053     const ContainerID& containerId,
> 1054     const Future<Option<int>>& status,
> 1055     const Future<list<Future<Nothing>>>& cleanups)
> 1056 {   
> 1057   // This should not occur because we only use the Future<list> to       
>                            
> 1058   // facilitate chaining.
> 1059   CHECK_READY(cleanups);
> 1060                  
> 1061   // Check cleanup succeeded for all isolators. If not, we'll fail the   
>                            
> 1062   // container termination and remove the 'destroying' flag but leave    
>                            
> 1063   // all other state. The container is now in an inconsistent state.
> 1064   foreach (const Future<Nothing>& cleanup, cleanups.get()) {             
>                            
> 1065     if (!cleanup.isReady()) {                                            
>                            
> 1066       promises[containerId]->fail(                                       
>                            
> 1067         "Failed to clean up an isolator when destroying container '" +   
>                            
> 1068         stringify(containerId) + "' :" +
> 1069         (cleanup.isFailed() ? cleanup.failure() : "discarded future"));  
>                            
> 1070   
> 1071       destroying.erase(containerId);
> 1072                  
> 1073       return;    
> 1074     }
> 1075   }
> {noformat}
> Since launcher->destroy already succeeds (all processes are killed), instead 
> of failing the promises[containerId], we probably should just export the 
> error through metrics (so that people can get alerted on that) and still set 
> the termination appropriately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2047) Isolator cleanup failures shouldn't cause TASK_LOST.

Reply via email to