Re: IEP-14: Ignite failures handling (Discussion)

Dmitry Pavlov Thu, 15 Mar 2018 05:23:01 -0700

Hi Dmitriy,

It seems, here everyone agrees that killing the process will give a more
guaranteed result. The question is that the majority in the community does
not consider this to be acceptable in case Ignite as started as embedded
lib (e.g. from Java, using Ignition.start())


What can help to accept the community's opinion? Let's remember Apache
principle: "community first".

If release 2.5 will show us it was inpractical, we will change default to
kill even for library. What do you think?

Sincerely,
Dmitriy Pavlov

чт, 15 мар. 2018 г. в 5:48, Dmitriy Setrakyan <dsetrak...@apache.org>:

> On Wed, Mar 14, 2018 at 7:12 PM, Andrey Kornev <andrewkor...@hotmail.com>
> wrote:
>
> > I'm not disagreeing with you, Dmitriy.
> >
> > What I'm trying to say is that if we assume that a serious enough bug or
> > some environmental issue prevents Ignite node from functioning correctly,
> > then it's only logical to assume that Ignite process is completely hosed
> > (for example, due to a very very long STW pause) and can't make any
> > progress at all. In a situation like this the application can't reason
> > about the process state, and the process itself may not be able to even
> > kill itself. The only reliable way to handle cases like that is to have
> an
> > external observer (a health monitoring tool) that is not itself affected
> by
> > the bug or the env issue and can either make a decision by itself or
> send a
> > notification to the SRE team.
> >
>
> Agree about the external observers, but that is something a user should do
> outside of Ignite.
>
>
> > In my previous post I only suggest to go easy on the "cleverness" of the
> > self-monitoring implementation as IMHO it won't be used much in
> production
> > environment. I think Ignite as it is already provides sufficient means
> > of monitoring its health (they may or may not be robust enough, which is
> a
> > different issue).
> >
>
> The approach I am suggesting is pretty simple - "kill" the process in case
> of a critical error. The only intelligence I would like to add is to
> attempt shutting down the Ignite node gracefully before the "kill" is
> executed. If a node is shutdown gracefully, then the restart procedure will
> be faster, so it is worthwhile to try.
>
> Some of the critical errors include running out of disk, memory, loosing
> Ignite system threads, etc... These errors are truly unrecoverable from the
> application stand point and should mostly be handled with a process restart
> anyway.
>
> D.
>

Re: IEP-14: Ignite failures handling (Discussion)

Reply via email to