[
https://issues.apache.org/jira/browse/MESOS-938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13981420#comment-13981420
]
Niklas Quarfot Nielsen commented on MESOS-938:
----------------------------------------------
Have been hacking on this on-off since I reported the issue. I should have more
time to prioritize this after the external containerizer patches.
I'd like to start a thorough design discussion before suggesting any code
(lesson learned).
I will suggest to split the proposed replaceTask into:
- replaceTask(TaskID oldId, TaskInfo newTask, bool rollback = false)
- resizeTask(TaskInfo runningTask, vector<Offer> auxiliaryOffers)
- Extend task launch with health check feature. TaskInfo will encode a repeated
field of checks which range from builtin TCP/HTTP checks to scriptable checks.
replaceTask will reuse this logic to implement rollback if new task failed.
If this sounds reasonable, I will start individual issues for the above and
start detailed discussions around them alongside find interested
contributors/committers to develop/shepherd these changes.
> Add replace task primitive
> --------------------------
>
> Key: MESOS-938
> URL: https://issues.apache.org/jira/browse/MESOS-938
> Project: Mesos
> Issue Type: Improvement
> Components: c++ api, master, slave
> Reporter: Niklas Quarfot Nielsen
> Attachments: resource-flow.png, sequence.png
>
>
> A replaceTask primitive could allow easier up and down-scaling by reusing
> resources from an old task either partially (leaving resource delta
> available), fully or augmented (backed by additional offers) for launching a
> new task. Further, recovery logic can be applied which restarts the old task
> in case of new task failure.
> The signature could be something like:
> replaceTask(TaskID oldTaskId, TaskInfo newTaskInfo, vector<Offer> offers)
> A suggestion of the semantics of replaceTask could be as follows:
> 1) Framework issues replaceTask with the task id of an old running task, a
> new task info and optionally a set of offers: replaceTask(oldTaskId,
> newTaskInfo, offers)
> 2) The master validates offers and task (with respect to the resource
> requirements section) by reusing offer and task visitors, and sends
> replaceTask request to slave.
> a) If new task violates resource requirements or consistency, TASK_LOST
> is reported for the new task and entire request ends.
> 3) Slave stores and checkpoints request (tuple of old task id and new task
> info), sends killTask to executor and starts timer. Auxiliary resources (from
> the optional offers) are reserved before issuing the killTask.
> a) If executor is reregistering, enqueue entire request, reserve
> resources and start timer.
> b) If executor is terminating or terminated, TASK_LOST is reported for
> new task as old task resources cannot be reused.
> c) If old task is unknown, it might be about to be reregistered by an
> executor. Enqueue entire request, reserve resources and start timer.
> d) If timer times out and a status update for terminating state e.g.
> TASK_KILLED/TERMINATED has not been received, reserved resources are released
> and send TASK_LOST for new task and replace request is removed. If executor
> doesn’t report anything, this should be no different than an unresponsive
> killTask().
> NOTE: Timers started at request start and during executor reregistration can
> be merged.
> 4) Executor sends terminal status update for old task as usual response to
> killTask(). However, the slave now intercepts the stage between releasing
> resources and announcing change to isolator if old task id is marked to be
> replaced. In this stage, depending on whether new task consumes less than
> the old task, unneeded resources are announced available to isolator.
> Thereafter, a launchTask request for the new task is sent to the executor.
> Reserved resources are released (but not announced available) so they become
> available for regular launchTask() request.
> 5) Start timer and await TASK_RUNNING update. If a terminal state update (or
> no update) is received within time out a roll-back is attempted: If new task
> use more or equal resources than old task, the old task should be able to be
> restarted. If old task successfully restarted, a new TASK_ROLLEDBACK status
> update rather then TASK_RUNNING should be reported to framework.
> Resource requirements:
> The resources required for the new task must be covered by:
> - Using less or same resources as old task: T_{new} <= T_{old}
> - Using more than old task resources, but less than sum of old resources and
> aggregated offer resources: T_{new} <= T_{old} + O_{1} + … + O_{n}
> Any resources provided but not used by the new task are announced as
> available.
> Fault tolerance (and open questions):
> - What happens if master fails after receiving a replaceTask request?
> - Master does not store any state on replacements.
> - What happens if slave fails after receiving a replaceTask request?
> - A slave, however, becomes aware of replacements in flight which either
> requires 1) Checkpointing replacement requests to stable storage or 2) Let
> executor be aware of replace request and let reregistration reestablish lost
> requests.
> - What happens if slave fails while waiting for executor behavior?
> - Waiting for executor to run
> - Waiting for old task to reregister
> - Killing old task while holding up resources for new task
> - Just after receiving a terminal update for the old task.
> - Monitoring new task startup
> - Reconciliation behavior
> This is can end up being a complicated operation in the slaves, but requires
> virtually no change in executors.
> An alternative approach could relieve slave of replace logic by pushing the
> responsibility onto executors.
> Thoughts?
--
This message was sent by Atlassian JIRA
(v6.2#6252)