So task reconciliation will always tell me if a task is finished when the slave is still running, and it will give me TASK_LOST if the slave or task is unknown to the master? If so, these semantics are very convenient for frameworks that fail to failover in a timely manner, and then ask for tasks that belonged to their previous FrameworkID.
On Fri, Apr 18, 2014 at 1:55 PM, Benjamin Mahler <[email protected]>wrote: > Vinod, David is asking about tasks that "belong" to the framework in that > they were "launched" by it, in which case your answer is not correct. We > don't keep track of tasks so we don't know whether the task "belongs" to > the framework in this sense. > > David, you will either receive TASK_LOST or nothing (if the slave for > the task is in a transient state). > > This is determined more so by the SlaveID than the TaskID as the Master > does not persistently track tasks. > > (a) If you're asking about an unknown slave, you will get TASK_LOST. > (b) If you're asking about a known slave and an unknown task, you will get > TASK_LOST. > (c) If you're asking about a known slave and a known task with a different > state, you will be sent the latest state. > > If you consider these semantics, you'll realize that you may receive > TASK_LOST if you try to reconcile your task that finished correctly. This > is why I mentioned the need to persist updates in (1) above. Let's say you > receive a terminal update of TASK_FINISHED and then you still try to > reconcile against a failed over Master. This new Master will reply with > TASK_LOST because it is unaware of the task/slave. So, you will always > receive your valid terminal update before getting a TASK_LOST from > reconciliation. > > > On Fri, Apr 18, 2014 at 10:46 AM, Vinod Kone <[email protected]> wrote: > >> If a framework asks to reconcile a task that doesn't belong to it there >> would be no response from the master. This is nice because it avoids >> information leak between frameworks. >> >> >> On Fri, Apr 18, 2014 at 5:04 AM, David Greenberg <[email protected] >> >wrote: >> >> > Piggybacking onto this thread with a follow up question: what happens if >> > you ask the master to reconcile some tasks that weren't launched by your >> > framework? Will you get messages that express those tasks were unknown, >> > lost, or will nothing respond? >> > >> > >> > On Thursday, April 17, 2014, Sharma Podila <[email protected]> wrote: >> > >> >> No problem, I have a better understanding now. >> >> And it was useful to see the three items you listed explicitly. >> >> >> >> >> >> On Thu, Apr 17, 2014 at 2:39 PM, Benjamin Mahler < >> >> [email protected]> wrote: >> >> >> >> Good to see you were playing around with reconciliation, we should have >> >> made the current semantics more clear. Especially in light of the fact >> that >> >> it's not implemented fully until one uses a strict registrar (likely >> >> 0.20.0). >> >> >> >> Think of reconciliation as the fallback mechanism to ensure that state >> is >> >> consistent, it's not designed to be something to inform you of things >> you >> >> were already told (in this case, that the tasks were running). >> Although we >> >> could consider sending updates even when task state remains the same. >> >> >> >> >> >> For the purpose of this conversation, let's say we're in the 0.20.0 >> >> world, operating with the registrar. And let's assume your goal is to >> build >> >> a highly available framework (I will be documenting how to do this for >> >> 0.20.0): >> >> >> >> (1) *When you receive a status update, you must persist this >> information >> >> before returning from the statusUpdate() callback*. Once you return >> from >> >> >> the callback, the driver will acknowledge the slave directly. Slaves >> will >> >> retry status update delivery *until* the acknowledgement is received >> from >> >> the scheduler driver in order to ensure that the framework processed >> the >> >> update. >> >> >> >> (2) *When you receive a "slave lost" signal, it means that your tasks >> >> that were running on that slave are in state TASK_LOST*, and any >> >> >> reconciliation you perform for these tasks will result in a reply of >> >> TASK_LOST. Most of the time we'll deliver these TASK_LOST >> automatically, >> >> but with a confluence of Master *and* Slave failovers, we are unaware >> of >> >> which tasks were running on the slave as we do not persist this >> information >> >> in the Master. >> >> >> >> (3) To guarantee that you have a consistent view of task states. *You >> >> must also periodically reconcile task state against the Master*. This >> is >> >> >> only because the delivery of the "slave lost" signal in (2) is not >> reliable >> >> (the Master could failover after removing a slave but before telling >> >> frameworks that the slave was lost). >> >> >> >> You'll notice that this model forces one to serially persist all status >> >> update changes. We are planning to expose mechanisms to allow "batch" >> >> acknowledgement of status updates in the lower-level API that benh has >> >> given talks about. With a lower-level API, it is possible to build more >> >> powerful libraries that hide much of these details! >> >> >> >> You'll also perhaps notice that only (1) and (3) are strictly required >> >> for consistency, but (2) is highly recommended as the vast majority of >> the >> >> time the "slave lost" signal will be delivered and you can take action >> >> quickly, without having to rely on periodic reconciliation. >> >> >> >> Please let me know if anything here was not clear! >> >> >> >> >> >> On Thu, Apr 17, 2014 at 1:47 PM, Sharma Podila <[email protected] >> >wrote: >> >> >> >> Should've looked at the code before sending the previous email... >> >> master/main.cpp confirmed what I needed to know. It doesn't look like >> I >> >> will be able to use reconcileTasks the way I thought I could. >> Effectively, >> >> a lack of callback could either mean that the master agrees with the >> >> requested reconcile task state, or that the task and/or slave is >> currently >> >> unknown. Which makes it an unreliable source of data. I understand >> this is >> >> expected to improve later by leveraging the registrar, but, I suspect >> >> there's more to it. >> >> >> >> I take it then that individual frameworks need to have their own >> >> mechanisms to ascertain the state of their tasks. >> >> >> >> >> >> On Thu, Apr 17, 2014 at 12:53 PM, Sharma Podila <[email protected] >> >wrote: >> >> >> >> Hello >> >> >> >> >> > >
