[
https://issues.apache.org/jira/browse/MESOS-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14172870#comment-14172870
]
Dominic Hamon commented on MESOS-1830:
--------------------------------------
Discussion from RB:
OK. Here is a proposal for what it could look like.
General idea: We should add as few top level task states as possible because it
is more work for frameworks. TASK_LOST should be used for cases where we expect
a relaunch of the task would succeed (unfortunately this principle breaks with
reconciliation).
Add 2 new task states to TaskState
enum TaskState {
...
...
...,
TASK_UNAUTHORIZED, # Fold this into TASK_INVALID?
TASK_INVALID # Maybe use TASK_ERROR instead since it already exists but
unused?
}
We add 2 new fields, "source" and "reason"/"code" both enums, to TaskStatus
NOTE: We should take this opportunity to move task validations from scheduler
driver to master, to simplify. Maybe do this as first patch.
enum Source {
MASTER,
SLAVE,
EXECUTOR,
SCHEDULER, # Don't need this when we move validation to master.
}
Based on the different status updates, these are the reasons i came up with.
Let me know if you can't figure out which reason should be used where :)
enum Reason {
Set by master
INVALID_OFFERS,
SLAVE_REMOVED,
SLAVE_DISCONNECTED,
SLAVE_UKNOWN,
TASK_UNKNOWN,
Set by scheduler driver for now. But we could kill this and expect scheduler to
not send launch tasks when it is disconnected?
MASTER_DISCONNECTED,
Set by slave
GC_ERROR,
SLAVE_RESTARTED
EXECUTOR_TERMINATED,
}
Currently the "Reason" make sense for LOST updates generated by master/slave.
Executors might use this code for udpates they generate, but it is upto the
framework on how to interpret it. We could also consider adding more reasons
for TASK_INVALID/TASK_ERROR which is also generated by master (e.g.,
TASK_UNAUTHORIZED could be a reason for TASK_INVALID).
Bill Farner 1 week ago (Oct. 8, 2014, 9:40 a.m.)
This looks good; i have one addendum: frameworks must not be allowed to set
status update fields in ways that conflict with the master/slave. i.e. an
executor should not be allowed to specify the Source (or if it does, mesos
should overwrite it).
Vinod Kone 1 week ago (Oct. 8, 2014, 10:24 a.m.)
yup. that definitely was on my mind :)
Alexander Rukletsov 6 days, 9 hours ago (Oct. 9, 2014, 3:20 a.m.)
Looks good to me. We can also add TASK_FAILED reasons and TASK_KILLED
explanations to the Reason enum. Generally, my proposal is to use Reason for
all second-tier states.
> Expose master stats differentiating between master-generated and
> slave-generated LOST tasks
> -------------------------------------------------------------------------------------------
>
> Key: MESOS-1830
> URL: https://issues.apache.org/jira/browse/MESOS-1830
> Project: Mesos
> Issue Type: Story
> Components: master
> Reporter: Bill Farner
> Assignee: Dominic Hamon
> Priority: Minor
>
> The master exports a monotonically-increasing counter of tasks transitioned
> to TASK_LOST. This loses fidelity of the source of the lost task. A first
> step in exposing the source of lost tasks might be to just differentiate
> between TASK_LOST transitions initiated by the master vs the slave (and maybe
> bad input from the scheduler).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)