> On Mar 23, 2018, at 2:21 PM, Zhitao Li <zhitaoli...@gmail.com> wrote:
> 
> Hi everyone,
> 
> I'd like to do an API review for MESOS-8725
> <https://issues.apache.org/jira/browse/MESOS-8725>. We are adding an
> optional `max_duration` to `TaskInfo` field. If a task does not terminate
> within this duration, built-in executors will kill the task with a new
> reason `REASON_MAX_DURATION_REACHED`.
> 
> Proof of concept patch:
> https://reviews.apache.org/r/66258/
> 
> Reference implementation in command executor:
> https://reviews.apache.org/r/66259/
> 
> A design choice we made is to make this relative duration rather than an
> absolute timestamp of deadline. Our rationales:
> 
>   - Cluster could suffer from clock skews, so same absolute deadline would
>   result in inconsistent behavior;
>   - Framework can just trivially translate its own clock as source of
>   truth to translate absolute deadline to current time + max_duration.
> 
> Please let me know what you think. Thanks.

Bringing our conversation about task group semantics back to the list.

The current reviews require all tasks in a group to have the same max_duration. 
This is equivalent to specifying max_duration on the task group itself. This 
means that when the time is up, the whole group gets torn down. Validation on 
the master ensures that schedulers have to set the same value across all the 
tasks.

Alternatively, we could allow the duration to be different for tasks and then 
just kill the individual task when it's time expires. In this case, the task 
will have a final status of TASK_KILLED, which will cause the Mesos default 
executor to tear down the rest of the group. So we have the same effect, though 
it is expressed differently in the API.

So maybe the cleanest way to express this for task groups is to place the 
max_duration in the `TaskGroupInfo`? However if we do that, then we lose any 
information about which task exceeded the duration (since by definition they 
all did). So I'm leaning towards allowing a per-task max_duration.

We should also define what this API means for the final `TaskStatus` of the 
task. In my executor, the rule we follow is that `TASK_KILLED` is only ever 
used in response to explicit KILL requests from the scheduler. If the 
max_duration is exceeded, I think that we should classify that as `TASK_FAILED`.

thanks,
James

Reply via email to