On Wed, Mar 14, 2018 at 3:36 PM, Andrey Kornev <andrewkor...@hotmail.com> wrote:
> If I were the one responsible for running Ignite-based applications (be it > embedded or standalone Ignite) in my company's datacenter, I'd prefer the > application nodes simply make their current state readily available to > external tools (via JMX, health checks, etc.) and leave the decision of > when to die and when to continue to run up to me. The last thing I need in > production is a too clever an application that decides to kill itself based > on its local (perhaps confused) state. > > Usually SRE teams build all sorts of technology-specific tools to monitor > health of the applications and they like to be as much in control as > possible when it comes to killing processes. > > I guess what I'm saying is this: keep things simple. Do not over engineer. > In real production environments the companies will most likely have this > feature disabled (I know I would) and instead rely on their own tooling for > handling failures. > > Andrey, our priority should be to keep the cluster operational. If a frozen Ignite node is kept around, the whole cluster becomes un-operational. I bet this is not what you would prefer in production either. However, if we kill the process, then the cluster should continue to operate. We are talking about a distributed system in which a failure of one node should not matter. If we want to keep this promise to the users, then we must kill the process if Ignite node freezes. Also, keep in mind that we are talking about the "default" behavior. If you are not happy with the "default" mode, then you will be able to configure other behaviors, like keeping the frozen Ignite node around, if you like. D.