Re: IEP-14: Ignite failures handling (Discussion)

Andrey Kornev Wed, 14 Mar 2018 12:37:14 -0700

If I were the one responsible for running Ignite-based applications (be it 
embedded or standalone Ignite) in my company's datacenter, I'd prefer the 
application nodes simply make their current state readily available to external 
tools (via JMX, health checks, etc.) and leave the decision of when to die and 
when to continue to run up to me. The last thing I need in production is a too 
clever an application that decides to kill itself based on its local (perhaps 
confused) state.


Usually SRE teams build all sorts of technology-specific tools to monitor 
health of the applications and they like to be as much in control as possible 
when it comes to killing processes.

I guess what I'm saying is this: keep things simple. Do not over engineer. In 
real production environments the companies will most likely have this feature 
disabled (I know I would) and instead rely on their own tooling for handling 
failures.

Regards
Andrey

________________________________
From: Vladimir Ozerov <[email protected]>
Sent: Tuesday, March 13, 2018 10:43 PM
To: [email protected]
Subject: Re: IEP-14: Ignite failures handling (Discussion)

As far as shutdown, what we need to implement is “hard shutdown” mode. This
is when we first close all network sockets, then cancel all registered
futures. This would enough to unblock the cluster and local user threads.

ср, 14 марта 2018 г. в 8:40, Vladimir Ozerov <[email protected]>:

> Valya,
>
> This is very easy to answer - if CommandLineStartup is used, then it is
> standalone node. In all other cases it is embedded.
>
> If node shutdown hangs - just let it continue hanging, so that application
> admins are able to decide on their own what to do next. Someone would want
> to get the stack trace, others would decide to restart outside of business
> hours (e.g. because Ignite is used only in part of their application),
> someone else would try to shutdown gracefully other components before
> stopping the process to minimize negative impact, etc.
>
> I am quite understand why are we guessing here how embedded Ignite is
> used. It could be used in any way and in any combination with other
> frameworks. Process stop by default is simply not an option.
>
> ср, 14 марта 2018 г. в 3:12, Valentin Kulichenko <
> [email protected]>:
>
>> Ivan,
>>
>> If grid hangs, graceful shutdown would most likely hang as well. Almost
>> never you can recover from a bad state using graceful procedures.
>>
>> I agree that we should not create two defaults, especially in this case.
>> It's not even strictly defined what is embedded node in Ignite. For
>> example, if I start it using a custom main class and/or custom script
>> instead of ignite.sh, would it be embedded or standalone node?
>>
>> -Val
>>
>> On Tue, Mar 13, 2018 at 4:58 PM, Ivan Rakov <[email protected]>
>> wrote:
>>
>> > One more note: "kill if standalone, stop if embedded" differs from what
>> > you are suggesting "try graceful, then kill process regardless" only in
>> > case when graceful shutdown hangs.
>> > Do we have understanding, how often does graceful shutdown hang?
>> > Obviously, *grid hang* is often case, but it shouldn't be messed with
>> > *graceful shutdown hang*. From my experience, if something went wrong,
>> > users just prefer to do kill -9  because it's much more reliable and
>> easy.
>> > Probably, in most of cases when kill -9 worked, graceful stop would have
>> > worked as well - we just don't have such statistics.
>> > It may be bad example, but: in our CI tests we intentionally break grid
>> in
>> > many harsh ways and perform a graceful stop after the test execution,
>> and
>> > it doesn't hang - otherwise we'd see many "Execution timeout" test suite
>> > hangs.
>> >
>> > Best Regards,
>> > Ivan Rakov
>> >
>> >
>> > On 14.03.2018 2:24, Dmitriy Setrakyan wrote:
>> >
>> >> On Tue, Mar 13, 2018 at 7:13 PM, Ivan Rakov <[email protected]>
>> >> wrote:
>> >>
>> >> I just would like to add my +1 for "kill if standalone, stop if
>> embedded"
>> >>> default option. My arguments:
>> >>>
>> >>> 1) Regarding "If Ignite hangs - it will likely be impossible to stop":
>> >>> Unfortunately, it's true that Ignite can hang during stop procedure.
>> >>> However, most of failures described under IEP-14 (storage IO
>> exceptions,
>> >>> death of critical system worker thread, etc) normally shouldn't turn
>> node
>> >>> into "impossible to stop" state. Turning into that state is a bug
>> >>> itself. I
>> >>> guess that we shouldn't choose system behavior on the basis of known
>> >>> bugs.
>> >>>
>> >>
>> >> The whole discussion is about protecting against force-major issues,
>> >> including Ignite bugs. You are assuming that a user application will
>> >> somehow continue to function if an Ignite node is stopped. In most
>> cases
>> >> it
>> >> will just freeze itself and cause the rest of the application to hang.
>> >>
>> >> Again, "kill+stop" is the most deterministic and the safest default
>> >> behavior. Try a graceful shutdown (which will make restart easier), and
>> >> then kill the process regardless.
>> >>
>> >> Note that we are arguing about the default behavior. If a user does not
>> >> like this default, then this user can change it to another behavior.
>> >>
>> >>
>> >> 2) User might want to handle Ignite node crash before shutting down the
>> >>> whole JVM - raise alert, close external resources, etc
>> >>>
>> >>> Very unlikely, but if a user is this advanced, then this user can
>> change
>> >> the default behavior. Most users will not even know how to configure
>> such
>> >> custom shutdown behavior and would prefer an automatic kill.
>> >>
>> >> 3) IEP-14 document has important notes: "More than one Ignite node
>> could
>> >> be
>> >>
>> >>> started in one JVM process" and "Different nodes in one JVM process
>> could
>> >>> belong to different clusters". This is possible only in embedded
>> mode. I
>> >>> think, we shouldn't shock user by sudden JVM halt (possibly, along
>> with
>> >>> another healthy nodes) if there's a chance of successful node stop.
>> >>>
>> >>> Has anyone actually seen a real example of that? I have not. This
>> >> scenario
>> >> is extremely unlikely and should not define the default behavior.
>> Again,
>> >> if
>> >> a user is so advanced to come up with such a sophisticated deployment,
>> >> then
>> >> the same user should be able to set different default behaviors for
>> >> different clusters.
>> >>
>> >>
>> >
>>
>

Re: IEP-14: Ignite failures handling (Discussion)

Reply via email to