This is expected behavior, as STARTING is not a transient state
<https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/reconciliation/TaskTimeout.java#L62-L67>.
I don't believe it ever was.  The rationale is that the ASSIGNED ->
STARTING transition acknowledges the handoff from scheduler to executor
control.  From then, the executor manages the task state.  This allows for
tasks that have a long delay in STARTING -> RUNNING, which may commonly
occur due to slow package or container image fetching.  At this point, your
executor is responsible for any timeouts you deem necessary.

On Fri, Jul 20, 2018 at 11:19 PM, meghdoot bhattacharya <
meghdoo...@yahoo.com.invalid> wrote:

> http://aurora.apache.org/documentation/latest/reference/task-lifecycle/
>
>
> Unexpected Termination: LOST
>
> If a Task stays in a transient task state for too long (such
> as ASSIGNED or STARTING), the scheduler forces it into LOST state, creating
> a new Task in its place that’s sent into PENDING state.
>
> So, the behavior we are observing while testing with our custom executor
> is mesos task in staging or say executor has not sent the task starting
> mesos status message, the transient timeout is working and task marked as
> lost in aurora. However, if executor has sent starting status message but
> then does not send the task running/failed message status, the transient
> timeout is not kicking in and aurora not marking it lost. we waited good 5+
> mins after the timeout to see a change in multiple tests.
> This is 0.19 aurora.
> Thx
>
>

Reply via email to