It would be nice to see the whole design first before going into low-level
details. Without it we are jumping from topic to topic. Were the list
events and reaction to these events discussed previously? At this point it
is not clear why nodes should be forcefully stopped without any
alternative.

For example, consider the following cases:
1) Exchange thread died. This is critical situation. But as a part of
analysis administrator might want to dump threads before killing the node.
He can do that programmatically, which is difficult and require knowledge
of Java, or can do that through management utilities, such as jstack or
VisualVM. What is more user friendly?
2) We start a service with multiple data regions. One data region is
configured incorrectly, what causes IOOME on multiple nodes. Why do you
think that the whole cluster (or many nodes) should be restarted? This is
potential data loss in all caches (not only in affected) and interruption
of service. Instead, administrator might decide to gradually reconfigure
and restart nodes one by one, instead of killing them all immediately.

This is why we need the design first.

On Wed, Nov 15, 2017 at 2:39 PM, Anton Vinogradov <avinogra...@gridgain.com>
wrote:

> According to [1]
>
> Reasons are:
> - IgniteOutOfMemoryException
> - Persistence errors
> - ExchangeWorker exits with error
>
> [1]
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> 7%3A+Ignite+internal+problems+detection
>
> On Wed, Nov 15, 2017 at 2:24 PM, Vladimir Ozerov <voze...@gridgain.com>
> wrote:
>
> > I am not quite I understand how tasks are split. How can we discuss
> > graceful shutdown without discussing the reasons of this shutdown? What
> > leads to it?
> >
> > On Wed, Nov 15, 2017 at 2:10 PM, Anton Vinogradov <
> > avinogra...@gridgain.com>
> > wrote:
> >
> > > Vova,
> > >
> > > Currently we have a lot IEPs to improve grid monitoring and behavior.
> > >
> > > Let's split tasks to:
> > >
> > > 1) Graceful shutdown.
> > > In this case we'd like to provide user ability to do something,
> > > LifecycleBean is what we looking for, thanks for tips!
> > > But, we have to keep shutdown reason somewhere.
> > > In case you know where it already kept , please let us know.
> > >
> > > 2) OOM or any other reason cause node crash.
> > > In this case some watchdog (like [1] or [2]) should monitor node alive
> > >
> > > 3) GC and deadlock(java and tx) issues
> > > Should be monitored by special thread [3] or published by metrics [4]
> > >
> > > 4) Throughput, latency and space issues
> > > Special metrics should be developed according to [5]
> > >
> > > Andrey asking about case #1 (graceful shutdown), lets discuss only this
> > > case.
> > >
> > > [1] https://issues.apache.org/jira/browse/IGNITE-6587
> > > [2] https://wrapper.tanukisoftware.com/doc/english/download.jsp
> > > [3] https://issues.apache.org/jira/browse/IGNITE-6171
> > > [4]
> > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > 7%3A+Ignite+internal+problems+detection
> > > [5]
> > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > 6%3A+Metrics+improvements
> > >
> > >
> > > On Wed, Nov 15, 2017 at 1:34 PM, Vladimir Ozerov <voze...@gridgain.com
> >
> > > wrote:
> > >
> > > > AFAIK the idea was not only to shutdown the node, but also to give
> user
> > > > (e.g. administrator) ability to observe the problem from the outside,
> > > e.g.
> > > > through JMX. E.g. if we detect Java-level deadlock, it doesn't mean
> > that
> > > > the only possible solution is node shutdown. In addition it could be
> > > no-op,
> > > > e.g. to give user chance to collect additional system info, or simply
> > > > because this particular deadlock is resolvable (e.g.
> > > > Lock.lockInterruptibly()). So as we need to expose health info
> through
> > > JMX
> > > > anyway, we could also give user programmatic access to it as well.
> > > > Alternatively, we can expose this info through JMX only and ask user
> to
> > > get
> > > > instance of that bean manually.
> > > >
> > > > On Wed, Nov 15, 2017 at 1:19 PM, Anton Vinogradov <
> > > > avinogra...@gridgain.com>
> > > > wrote:
> > > >
> > > > > Vova,
> > > > >
> > > > > Could you point to metric you're talking about?
> > > > >
> > > > > On Wed, Nov 15, 2017 at 1:06 PM, Andrey Kuznetsov <
> stku...@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > Vladimir,
> > > > > >
> > > > > > Could you please refine, what are local metrics? Should I extend
> > > Ignite
> > > > > > interface by adding something similar to dataRegionMetrics() or
> > there
> > > > is
> > > > > > some universal mechanism to handle metrics?
> > > > > >
> > > > > > 2017-11-15 8:30 GMT+03:00 Vladimir Ozerov <voze...@gridgain.com
> >:
> > > > > > >
> > > > > > > This information should be available through local metrics, so
> > that
> > > > it
> > > > > is
> > > > > > > accessible from Ignite instance.
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to