[jira] [Commented] (MESOS-3070) Master CHECK failure if a framework uses duplicated task id.

Klaus Ma (JIRA) Mon, 03 Aug 2015 22:36:47 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14653091#comment-14653091
 ]


Klaus Ma commented on MESOS-3070:
---------------------------------

Currently, I killed the old tasks when failed to add task to the framework; 
refer to the following section for the draft code. 
This solution will send KillTaskMessage to slave; and slave will update task 
status to framework according to the code; I'm doing test in my cluster, but 
any suggestion on how to write UT case for such daemon interaction? Mock?

{code}
   foreachkey (const FrameworkID& frameworkId, slave->tasks) {
     foreachvalue (Task* task, slave->tasks[frameworkId]) {
       Framework* framework = getFramework(task->framework_id());
-      if (framework != NULL) { // The framework might not be re-registered yet.
-        framework->addTask(task);
-      } else {
+      // The framework might not be re-registered yet.
+      if (framework == NULL) {
         // TODO(benh): We should really put a timeout on how long we
         // keep tasks running on a slave that never have frameworks
         // reregister and claim them.
         LOG(WARNING) << "Possibly orphaned task " << task->task_id()
                      << " of framework " << task->framework_id()
                      << " running on slave " << *slave;
+        continue;
+      }
+
+      // If failed to add task back, kill the tasks in slave
+      // Refer to MESOS-3070
+      if (!framework->addTask(task)) {
+        KillTaskMessage message;
+        message.mutable_framework_id()->MergeFrom(task->framework_id());
+        message.mutable_task_id()->MergeFrom(task->task_id());
+        send(slave->pid, message);
       }
     }
   }
{code}



> Master CHECK failure if a framework uses duplicated task id.
> ------------------------------------------------------------
>
>                 Key: MESOS-3070
>                 URL: https://issues.apache.org/jira/browse/MESOS-3070
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.22.1
>            Reporter: Jie Yu
>            Assignee: Klaus Ma
>
> We observed this in one of our testing cluster.
> One framework (under development) keeps launching tasks using the same 
> task_id. We don't expect the master to crash even if the framework is not 
> doing what it's supposed to do. However, under a series of events, this could 
> happen and keeps crashing the master.
> 1) frameworkA launches task 'task_id_1' on slaveA
> 2) master fails over
> 3) slaveA has not re-registered yet
> 4) frameworkA re-registered and launches task 'task_id_1' on slaveB
> 5) slaveA re-registering and add task "task_id_1' to frameworkA
> 6) CHECK failure in addTask
> {noformat}
> I0716 21:52:50.759305 28805 master.hpp:159] Adding task 'task_id_1' with 
> resources cpus(*):4; mem(*):32768 on slave 
> 20150417-232509-1735470090-5050-48870-S25 (hostname)
> ...
> ...
> F0716 21:52:50.760136 28805 master.hpp:362] Check failed: 
> !tasks.contains(task->task_id()) Duplicate task 'task_id_1' of framework 
> <framework_id>
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-3070) Master CHECK failure if a framework uses duplicated task id.

Reply via email to