Re: Trying to get task reconciliation to work

David Greenberg Fri, 18 Apr 2014 11:22:20 -0700

So task reconciliation will always tell me if a task is finished when the
slave is still running, and it will give me TASK_LOST if the slave or task
is unknown to the master? If so, these semantics are very convenient for
frameworks that fail to failover in a timely manner, and then ask for tasks
that belonged to their previous FrameworkID.



On Fri, Apr 18, 2014 at 1:55 PM, Benjamin Mahler
<[email protected]>wrote:

> Vinod, David is asking about tasks that "belong" to the framework in that
> they were "launched" by it, in which case your answer is not correct. We
> don't keep track of tasks so we don't know whether the task "belongs" to
> the framework in this sense.
>
> David, you will either receive TASK_LOST or nothing (if the slave for
> the task is in a transient state).
>
> This is determined more so by the SlaveID than the TaskID as the Master
> does not persistently track tasks.
>
> (a) If you're asking about an unknown slave, you will get TASK_LOST.
> (b) If you're asking about a known slave and an unknown task, you will get
> TASK_LOST.
> (c) If you're asking about a known slave and a known task with a different
> state, you will be sent the latest state.
>
> If you consider these semantics, you'll realize that you may receive
> TASK_LOST if you try to reconcile your task that finished correctly. This
> is why I mentioned the need to persist updates in (1) above. Let's say you
> receive a terminal update of TASK_FINISHED and then you still try to
> reconcile against a failed over Master. This new Master will reply with
> TASK_LOST because it is unaware of the task/slave. So, you will always
> receive your valid terminal update before getting a TASK_LOST from
> reconciliation.
>
>
> On Fri, Apr 18, 2014 at 10:46 AM, Vinod Kone <[email protected]> wrote:
>
>> If a framework asks to reconcile a task that doesn't belong to it there
>> would be no response from the master. This is nice because it avoids
>> information leak between frameworks.
>>
>>
>> On Fri, Apr 18, 2014 at 5:04 AM, David Greenberg <[email protected]
>> >wrote:
>>
>> > Piggybacking onto this thread with a follow up question: what happens if
>> > you ask the master to reconcile some tasks that weren't launched by your
>> > framework? Will you get messages that express those tasks were unknown,
>> > lost, or will nothing respond?
>> >
>> >
>> > On Thursday, April 17, 2014, Sharma Podila <[email protected]> wrote:
>> >
>> >> No problem, I have a better understanding now.
>> >> And it was useful to see the three items you listed explicitly.
>> >>
>> >>
>> >> On Thu, Apr 17, 2014 at 2:39 PM, Benjamin Mahler <
>> >> [email protected]> wrote:
>> >>
>> >> Good to see you were playing around with reconciliation, we should have
>> >> made the current semantics more clear. Especially in light of the fact
>> that
>> >> it's not implemented fully until one uses a strict registrar (likely
>> >> 0.20.0).
>> >>
>> >> Think of reconciliation as the fallback mechanism to ensure that state
>> is
>> >> consistent, it's not designed to be something to inform you of things
>> you
>> >> were already told (in this case, that the tasks were running).
>> Although we
>> >> could consider sending updates even when task state remains the same.
>> >>
>> >>
>> >> For the purpose of this conversation, let's say we're in the 0.20.0
>> >> world, operating with the registrar. And let's assume your goal is to
>> build
>> >> a highly available framework (I will be documenting how to do this for
>> >> 0.20.0):
>> >>
>> >> (1) *When you receive a status update, you must persist this
>> information
>> >> before returning from the statusUpdate() callback*. Once you return
>> from
>>
>> >> the callback, the driver will acknowledge the slave directly. Slaves
>> will
>> >> retry status update delivery *until* the acknowledgement is received
>> from
>> >> the scheduler driver in order to ensure that the framework processed
>> the
>> >> update.
>> >>
>> >> (2) *When you receive a "slave lost" signal, it means that your tasks
>> >> that were running on that slave are in state TASK_LOST*, and any
>>
>> >> reconciliation you perform for these tasks will result in a reply of
>> >> TASK_LOST. Most of the time we'll deliver these TASK_LOST
>> automatically,
>> >> but with a confluence of Master *and* Slave failovers, we are unaware
>> of
>> >> which tasks were running on the slave as we do not persist this
>> information
>> >> in the Master.
>> >>
>> >> (3) To guarantee that you have a consistent view of task states. *You
>> >> must also periodically reconcile task state against the Master*. This
>> is
>>
>> >> only because the delivery of the "slave lost" signal in (2) is not
>> reliable
>> >> (the Master could failover after removing a slave but before telling
>> >> frameworks that the slave was lost).
>> >>
>> >> You'll notice that this model forces one to serially persist all status
>> >> update changes. We are planning to expose mechanisms to allow "batch"
>> >> acknowledgement of status updates in the lower-level API that benh has
>> >> given talks about. With a lower-level API, it is possible to build more
>> >> powerful libraries that hide much of these details!
>> >>
>> >> You'll also perhaps notice that only (1) and (3) are strictly required
>> >> for consistency, but (2) is highly recommended as the vast majority of
>> the
>> >> time the "slave lost" signal will be delivered and you can take action
>> >> quickly, without having to rely on periodic reconciliation.
>> >>
>> >> Please let me know if anything here was not clear!
>> >>
>> >>
>> >> On Thu, Apr 17, 2014 at 1:47 PM, Sharma Podila <[email protected]
>> >wrote:
>> >>
>> >> Should've looked at the code before sending the previous email...
>> >>  master/main.cpp confirmed what I needed to know. It doesn't look like
>> I
>> >> will be able to use reconcileTasks the way I thought I could.
>> Effectively,
>> >> a lack of callback could either mean that the master agrees with the
>> >> requested reconcile task state, or that the task and/or slave is
>> currently
>> >> unknown. Which makes it an unreliable source of data. I understand
>> this is
>> >> expected to improve later by leveraging the registrar, but, I suspect
>> >> there's more to it.
>> >>
>> >> I take it then that individual frameworks need to have their own
>> >> mechanisms to ascertain the state of their tasks.
>> >>
>> >>
>> >> On Thu, Apr 17, 2014 at 12:53 PM, Sharma Podila <[email protected]
>> >wrote:
>> >>
>> >> Hello
>> >>
>> >>
>>
>
>

Re: Trying to get task reconciliation to work

Reply via email to