Re: Registering and framework failover

Vinod Kone Fri, 15 Jul 2016 14:02:09 -0700

As Neil mentioned, we plan to add error codes for asynchronous errors
(error() callback in the old API, Error event in the new API) and
synchronous errors (HTTP 4xx/5xx responses in the new API).


Having said that, I would advise against adding logic in your framework to
do something smart (like removing its framework id from a persistent store)
when it gets an error saying "the framework has been removed". I would
rather have the framework exit/crash and require an operator involvement to
rectify. Mainly because if you are in this situation there is something
really bad happening in your cluster regarding your expectations and it
should be really a very rare event. Having human involvement in rare
catastrophic events is probably better.

On Thu, Jul 14, 2016 at 3:32 AM, Evers Benno <[email protected]> wrote:

> So, given that this probably won't be changed before the 1.0 release,
> are the strings considered part of the stable API? Or is it recommended
> not to rely on `error()` at all? (That's what we did for now, setting
> failover timeout to 5 years)
>
> On 13.07.2016 15:37, Neil Conway wrote:
> > Ah, right -- yes, at the moment you need to look at error strings to
> > decide whether to retry with a new framework ID, unfortunately. IMO we
> > should introduce error codes or enums to make this process more
> > reliable, but no one has done so yet:
> >
> > https://issues.apache.org/jira/browse/MESOS-4548
> > https://issues.apache.org/jira/browse/MESOS-5322
> >
> > Neil
> >
> >
> > On Wed, Jul 13, 2016 at 3:27 PM, Evers Benno <[email protected]>
> wrote:
> >> Let me try to clarify:
> >>
> >> The problem is that I don't get to decide manually if the framwork
> >> should try to take a new id or re-use the old one, but it needs to be
> >> decided programmatically, by an algorithm.
> >>
> >> Afaik it's not possible to get the time when the framework disconnected
> >> from mesos, so it's not possible to know how much time is left until the
> >> failover timeout runs out. Therefore, if I want to attempt task
> >> reconciliation, I just have to try registering with my old framework id
> >> and see what happens.
> >>
> >> However, in the case where the failover timeout already passed, I now
> >> need to programmatically detect this error and try again with an empty
> >> framework id.
> >>
> >> My question was, is it possible to do this?
> >>
> >> (also, we actually use a failover timeout of 1 week, but it doesn't
> >> really change the problem and I mistakenly assumed that an example with
> >> smaller values would be more intuitive)
> >>
> >> On 13.07.2016 14:50, Neil Conway wrote:
> >>> On Wed, Jul 13, 2016 at 2:44 PM, Evers Benno <[email protected]>
> wrote:
> >>>> imagine the following situation: I am a framework with failover
> timeout
> >>>> of 1 hour, and 59 minutes and 55 seconds after shutting down I want to
> >>>> register with the master again.
> >>>>
> >>>> If my registration attempt arrives at the master within the time limit
> >>>> everything will be fine and I even get back the old tasks for
> >>>> reconciliation, but if it arrives slightly later the framework id is
> >>>> permanently blocked by mesos, and I am not able to register. Instead,
> I
> >>>> will receive an error()-callback with the message "Framework has been
> >>>> removed".
> >>>
> >>> Right: if you set a failover_timeout of 1 hour, your framework is
> >>> expected to reregister within one hour. If it does not, all of its
> >>> tasks will be killed and you need to start over with a new
> >>> FrameworkID. Can you clarify which aspect of this behavior is
> >>> problematic for you?
> >>>
> >>> Note that a failover_timeout of 1 hour is probably a little low.
> >>>
> >>>> Is there any way to reliably connect to the master while also
> >>>> reconciling old tasks if possible?
> >>>
> >>> Sorry, not sure what you mean by this.
> >>>
> >>> Neil
> >>>
>

Re: Registering and framework failover

Reply via email to