[ 
https://issues.apache.org/jira/browse/MESOS-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14172870#comment-14172870
 ] 

Dominic Hamon commented on MESOS-1830:
--------------------------------------

Discussion from RB:

OK. Here is a proposal for what it could look like.

General idea: We should add as few top level task states as possible because it 
is more work for frameworks. TASK_LOST should be used for cases where we expect 
a relaunch of the task would succeed (unfortunately this principle breaks with 
reconciliation).

Add 2 new task states to TaskState
enum TaskState {
  ...
  ...
  ...,
  TASK_UNAUTHORIZED,  # Fold this into TASK_INVALID?
  TASK_INVALID # Maybe use TASK_ERROR instead since it already exists but 
unused?
}

We add 2 new fields, "source" and "reason"/"code" both enums, to TaskStatus

NOTE: We should take this opportunity to move task validations from scheduler 
driver to master, to simplify. Maybe do this as first patch.

enum Source {
  MASTER,
  SLAVE,
  EXECUTOR,
  SCHEDULER, # Don't need this when we move validation to master.
}

Based on the different status updates, these are the reasons i came up with. 
Let me know if you can't figure out which reason should be used where :)

enum Reason {

Set by master

INVALID_OFFERS,
SLAVE_REMOVED,
SLAVE_DISCONNECTED,
SLAVE_UKNOWN,
TASK_UNKNOWN,

Set by scheduler driver for now. But we could kill this and expect scheduler to 
not send launch tasks when it is disconnected?

MASTER_DISCONNECTED,

Set by slave

GC_ERROR,
SLAVE_RESTARTED
EXECUTOR_TERMINATED,
}

Currently the "Reason" make sense for LOST updates generated by master/slave. 
Executors might use this code for udpates they generate, but it is upto the 
framework on how to interpret it. We could also consider adding more reasons 
for TASK_INVALID/TASK_ERROR which is also generated by master (e.g., 
TASK_UNAUTHORIZED could be a reason for TASK_INVALID).

Bill Farner 1 week ago (Oct. 8, 2014, 9:40 a.m.)
This looks good; i have one addendum: frameworks must not be allowed to set 
status update fields in ways that conflict with the master/slave.  i.e. an 
executor should not be allowed to specify the Source (or if it does, mesos 
should overwrite it).

Vinod Kone 1 week ago (Oct. 8, 2014, 10:24 a.m.)
yup. that definitely was on my mind :)

Alexander Rukletsov 6 days, 9 hours ago (Oct. 9, 2014, 3:20 a.m.)
Looks good to me. We can also add TASK_FAILED reasons and TASK_KILLED 
explanations to the Reason enum. Generally, my proposal is to use Reason for 
all second-tier states.

> Expose master stats differentiating between master-generated and 
> slave-generated LOST tasks
> -------------------------------------------------------------------------------------------
>
>                 Key: MESOS-1830
>                 URL: https://issues.apache.org/jira/browse/MESOS-1830
>             Project: Mesos
>          Issue Type: Story
>          Components: master
>            Reporter: Bill Farner
>            Assignee: Dominic Hamon
>            Priority: Minor
>
> The master exports a monotonically-increasing counter of tasks transitioned 
> to TASK_LOST.  This loses fidelity of the source of the lost task.  A first 
> step in exposing the source of lost tasks might be to just differentiate 
> between TASK_LOST transitions initiated by the master vs the slave (and maybe 
> bad input from the scheduler).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to