[ 
https://issues.apache.org/jira/browse/MYRIAD-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334515#comment-15334515
 ] 

John Yost commented on MYRIAD-220:
----------------------------------

Myriad interacts with the Mesos scheduler via the Mesos ScheduleDriver class 
and receives acknowledgement of requests via a callback mechanism defined in 
the Mesos Scheduler interface.

Myriad Tasks are killed via a process that is ideally one-step but can be 
two-step if there are any system or network issues at the time the task kill 
request is submitted to Mesos:

1. The YarnNodeCapacityManager invokes MyriadDriver.kill, which delegates to 
the Mesos SchedulerDriver.killTask method. At this stage the kill request is 
sent by the ScheduleDriver to Mesos. A Protos.Status object is returned, but 
that Status object only indicates whether the SchedulerDriver is in a running 
state. Until a callback is invoked by Mesos upon the Myriad Scheduler 
implementation, there is no guarantee that the task kill request succeeded.

2. The TaskTerminator is a daemon that periodically checks the SchedulerState 
killable tasks queue and invokes MyriadDriverManager.kill, which is a wrapper 
for the MyriadDriver (and, again, the Mesos SchedulerDriver). In the master 
branch the TaskTerminator also invokes SchedulerState.removeTask to remove the 
killable task from the SchedulerState.  This is potentially an issue because, 
again, the task kill request is not guaranteed to have worked unless the 
corresponding Mesos callback method is invoked.

The only way to ensure that all Killable Myriad tasks are eventually killed is 
to invoke SchedulerState.removeTask from within Myriad's Mesos lifecycle 
callback implementation, MyriadScheduler, either directly or within a local 
Java object.  Specifically, MyriadScheduler implements the 
org.apache.mesos.Scheduler interface, which is the Mesos callback interface for 
all Mesos Task lifecycle methods (i.e., register, statusUpdate, disconnected, 
etc...). When the statusUpdate callback is invoked, one of the status updates 
is that the Myriad task was killed. 

When a StatusUpdate is received from Mesos, a Myriad StatusUpdateEvent is 
created and fired via Disruptor. The message listener is the 
StatusUpdateEventHandler class. Within it's onEvent method, there is logic to 
decline new resource offers as well as remove the corresponding task from the 
Myriad SchedulerState. 

Consequently, the only code change required is to remove the 
SchedulerState.removeTask method invocation from TaskTerminator. I am also 
adding some JUnit tests and Javadoc comments.

--John


> Improve reliability of kill task messaging
> ------------------------------------------
>
>                 Key: MYRIAD-220
>                 URL: https://issues.apache.org/jira/browse/MYRIAD-220
>             Project: Myriad
>          Issue Type: Improvement
>          Components: Scheduler
>            Reporter: John Yost
>            Assignee: John Yost
>
> Currently within the YarnNodeCapacityManager there is a two-step process of 
> killing a YARN task via the following method invocations:
>       state.makeTaskKillable(taskId);
>       myriadDriver.kill(taskId);
> Need to add logic to ensure all killable tasks are killed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to