Re: Question on resource offers and framework failover

Benjamin Mahler Thu, 15 May 2014 08:29:43 -0700

>
> Where as, a TASK_LOST will make me (unnecessarily, in this case) try to
> ensure that the task is actually lost, not running away on the slave that
> got disconnected from Mesos master. Not all environments may need the
> distinction, but at least some do.



To be clear, are you still planning to do this out-of-band reconciliation
when Mesos provides complete reconciliation (thanks to the Registrar)?
Mesos will ensure that the situation you describe is not possible (in
0.19.0 optionally, and in 0.20.0 by default).

Taking a step back, you will always have to deal with TASK_LOST as a status
*regardless* of what the true status of the task was, this is the reality
of failures in a distributed system. For example, let's say the Master
fails right before we could send you the TASK_INVALID_OFFER update, or your
framework fails right before it could persist the TASK_INVALID_OFFER
update. In both cases, you will need to reconcile with the Master, and it
will be TASK_LOST.

Likewise, let's say your TASK_FINISHED on the slave, but the slave fails
permanently before the update could reach the Master. Then when you
reconcile this with the Master it will be TASK_LOST.

For these reasons, we haven't yet found much value in providing more
precise task states for various conditions.


On Tue, May 13, 2014 at 10:10 AM, Sharma Podila <[email protected]> wrote:

> Thanks for confirming that, Adam.
> 
>
>> , but it would be a good Mesos FAQ topic.
>
> I was thinking it might be good to also add to doc in code, either in
> mesos.proto or MesosSchedulerDriver (mesos.proto already refers to the
> latter for failover at FrameworkID definition).
>
> If you were to try to persist the 'ephemeral' offers to another framework
>> instance, and call launchTasks with one of the old offers, the master will
>> respond with TASK_LOST ("Task launched with invalid offers"), since the
>> master no longer knows about that offer
>
> Strictly speaking, shouldn't this produce some kind of an 'invalid offer'
> response instead of task being lost? A TASK_LOST response is handled
> differently in my scheduler, for example, compared to what I'd do for an
> invalid offer response. An invalid offer would just be a simple discard
> offer and retry of launch with a more recent offer. Where as, a TASK_LOST
> will make me (unnecessarily, in this case) try to ensure that the task is
> actually lost, not running away on the slave that got disconnected from
> Mesos master. Not all environments may need the distinction, but at least
> some do.
>
>
>
> On Mon, May 12, 2014 at 11:12 PM, Adam Bordelon <[email protected]>wrote:
>
>> Correct, Sharma. I don't think this is documented anywhere yet, but it
>> would be a good Mesos FAQ topic.
>> When the master notices that the framework has exited or is deactivated,
>> it disables the framework in the allocator so no new offers will be made to
>> that framework, and removes any outstanding offers (but does not send a
>> RescindResourceOfferMessage to the framework, since the framework is
>> presumably failing over). When a framework reregisters, it is reactivated
>> in the allocator and will start receiving new offers again.
>> If you were to try to persist the 'ephemeral' offers to another framework
>> instance, and call launchTasks with one of the old offers, the master will
>> respond with TASK_LOST ("Task launched with invalid offers"), since the
>> master no longer knows about that offer. So don't bother trying. :)
>> Already running tasks (used offers) continue running, unless the
>> framework failover timeout is exceeded.
>>
>>
>> On Mon, May 12, 2014 at 5:38 PM, Sharma Podila <[email protected]>wrote:
>>
>>> My understanding is that when a framework fails over (either new
>>> instance starts after previous one fails, or the same instance restarts),
>>> Mesos master would automatically cancel any unused offers it had given to
>>> the previous framework instance. This is a good thing. Can someone confirm
>>> this to be the case? Is such an expectation documented somewhere? I did
>>> look at master.cpp and I hope I interpreted it right.
>>>
>>> Effectively then, the offers are 'ephemeral' and don't need to be
>>> persisted by the framework scheduler to pass along to another of its
>>> instance that may failover as the leader.
>>>
>>> Thank you.
>>>
>>> Sharma
>>>
>>>
>>
>

Re: Question on resource offers and framework failover

Reply via email to