[
https://issues.apache.org/jira/browse/MYRIAD-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334515#comment-15334515
]
John Yost commented on MYRIAD-220:
----------------------------------
Myriad interacts with the Mesos scheduler via the Mesos ScheduleDriver class
and receives acknowledgement of requests via a callback mechanism defined in
the Mesos Scheduler interface.
Myriad Tasks are killed via a process that is ideally one-step but can be
two-step if there are any system or network issues at the time the task kill
request is submitted to Mesos:
1. The YarnNodeCapacityManager invokes MyriadDriver.kill, which delegates to
the Mesos SchedulerDriver.killTask method. At this stage the kill request is
sent by the ScheduleDriver to Mesos. A Protos.Status object is returned, but
that Status object only indicates whether the SchedulerDriver is in a running
state. Until a callback is invoked by Mesos upon the Myriad Scheduler
implementation, there is no guarantee that the task kill request succeeded.
2. The TaskTerminator is a daemon that periodically checks the SchedulerState
killable tasks queue and invokes MyriadDriverManager.kill, which is a wrapper
for the MyriadDriver (and, again, the Mesos SchedulerDriver). In the master
branch the TaskTerminator also invokes SchedulerState.removeTask to remove the
killable task from the SchedulerState. This is potentially an issue because,
again, the task kill request is not guaranteed to have worked unless the
corresponding Mesos callback method is invoked.
The only way to ensure that all Killable Myriad tasks are eventually killed is
to invoke SchedulerState.removeTask from within Myriad's Mesos lifecycle
callback implementation, MyriadScheduler, either directly or within a local
Java object. Specifically, MyriadScheduler implements the
org.apache.mesos.Scheduler interface, which is the Mesos callback interface for
all Mesos Task lifecycle methods (i.e., register, statusUpdate, disconnected,
etc...). When the statusUpdate callback is invoked, one of the status updates
is that the Myriad task was killed.
When a StatusUpdate is received from Mesos, a Myriad StatusUpdateEvent is
created and fired via Disruptor. The message listener is the
StatusUpdateEventHandler class. Within it's onEvent method, there is logic to
decline new resource offers as well as remove the corresponding task from the
Myriad SchedulerState.
Consequently, the only code change required is to remove the
SchedulerState.removeTask method invocation from TaskTerminator. I am also
adding some JUnit tests and Javadoc comments.
--John
> Improve reliability of kill task messaging
> ------------------------------------------
>
> Key: MYRIAD-220
> URL: https://issues.apache.org/jira/browse/MYRIAD-220
> Project: Myriad
> Issue Type: Improvement
> Components: Scheduler
> Reporter: John Yost
> Assignee: John Yost
>
> Currently within the YarnNodeCapacityManager there is a two-step process of
> killing a YARN task via the following method invocations:
> state.makeTaskKillable(taskId);
> myriadDriver.kill(taskId);
> Need to add logic to ensure all killable tasks are killed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)