[ 
https://issues.apache.org/jira/browse/MESOS-8725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16410613#comment-16410613
 ] 

Zhitao Li commented on MESOS-8725:
----------------------------------

[~jamesmulcahy], we actually started on that path, however some of the 
scalability difficulties we met:
 * limited compute resource on scheduler: a lot schedulers takes same design of 
Mesos master and only run one active process, and tracking a timer per task 
there uses up precious resources there;
 * network partition: if master/agent was under network partition, the 
scheduler could not terminate the task;
 * recovery upon scheduler restart: this was the biggest problem for us, but 
when our scheduler process restarted, it needed to recover "all" running tasks 
from database and reconstruct what to do for each task (which is also a common 
pattern among schedulers). Any additional features introduced there will 
further made the process heavier;
 * cheaper to implement in executor: with isolation mechanisms like `pid`, we 
expect that executor has a longer lifecycle. Therefore, executors do not even 
need to maintain a busy thread, but simply use a 
[Timer|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/include/process/timer.hpp]
 and terminate the task.

> Support deadline for tasks
> --------------------------
>
>                 Key: MESOS-8725
>                 URL: https://issues.apache.org/jira/browse/MESOS-8725
>             Project: Mesos
>          Issue Type: Improvement
>            Reporter: Zhitao Li
>            Priority: Major
>
> In our environment, we run a lot of batch jobs, some of which have tight 
> timeline. If any tasks in the job runs longer than x hours, it does not make 
> sense to run it anymore. 
>  
> For instance, a team would submit a job which builds a weekly index and 
> repeats every Monday. If the job does not finish before next Monday for 
> whatever reason, there is no point to keep any task running.
>  
> We believe that implementing deadline tracking distributed across our cluster 
> makes more sense as it makes the system more scalable and also makes our 
> centralized state machine simpler.
>  
> One idea I have right now is to add an  *optional* *TimeInfo deadline* to 
> TaskInfo field, and all default executors in Mesos can simply terminate the 
> task and send a proper *StatusUpdate.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to