[ 
https://issues.apache.org/jira/browse/MESOS-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706160#comment-14706160
 ] 

Klaus Ma commented on MESOS-3070:
---------------------------------

Summary current status & options:

*Status*:
* The UT code are updated to 
[MESOS-3070_UT.cpp|https://gist.github.com/klaus1982/971d09bf51c99dbe94fa]
* A *draft* code diff on generating task UUID by master was upload to 
[RB#37531|https://reviews.apache.org/r/37531/]. *NOTE*: this code diff is only 
used to show the overall solution, more detail need to be hanlded when marked 
to reviewable.

*Actions*:
* Confirm the final solution and work out the code diff for review.

For now, there are several options here:
# sends rejected tasks list to slave within SlaveReregistedMessage, slave kill 
the executor/tasks accordingly
# persists tasks info in registry; reject duplicated tasks when master restarted
# stores tasks in master in a per slave map
# generates unique task id for each task by Mesos; there're also two 
implementations:
## add task tag (TaskTag) attribute: 1.) copy taskID to taskTag, 2) generate an 
uuid and assign to taskID
## add task uid (TaskUID) attribute: 1.) generate an uuid and assign to 
taskUID, 2) update codes to use task uuid between master & slave

Perfor to #4.2:
Regarding #1, it'll break current design: task status update will send back to 
master, and master will be confused by duplicated task id
Regarding #2, the scalability bottleneck will happen when master failover
Regarding #3, it's similar to #4: using frameworId + slaveId + taskId as an 
unique task it; but the data structure is changed, the code need to be updated 
in master to identify a task; or the master is confused by the duplicated task 
id
Regarding #4.1, it'll make user confused as TaskID was changed

I'd like to get your comments & suggestion to continue the work.

> Master CHECK failure if a framework uses duplicated task id.
> ------------------------------------------------------------
>
>                 Key: MESOS-3070
>                 URL: https://issues.apache.org/jira/browse/MESOS-3070
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.22.1
>            Reporter: Jie Yu
>            Assignee: Klaus Ma
>
> We observed this in one of our testing cluster.
> One framework (under development) keeps launching tasks using the same 
> task_id. We don't expect the master to crash even if the framework is not 
> doing what it's supposed to do. However, under a series of events, this could 
> happen and keeps crashing the master.
> 1) frameworkA launches task 'task_id_1' on slaveA
> 2) master fails over
> 3) slaveA has not re-registered yet
> 4) frameworkA re-registered and launches task 'task_id_1' on slaveB
> 5) slaveA re-registering and add task "task_id_1' to frameworkA
> 6) CHECK failure in addTask
> {noformat}
> I0716 21:52:50.759305 28805 master.hpp:159] Adding task 'task_id_1' with 
> resources cpus(*):4; mem(*):32768 on slave 
> 20150417-232509-1735470090-5050-48870-S25 (hostname)
> ...
> ...
> F0716 21:52:50.760136 28805 master.hpp:362] Check failed: 
> !tasks.contains(task->task_id()) Duplicate task 'task_id_1' of framework 
> <framework_id>
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to