Dmitry,

There's two cases
1) STW duration is long -> notifying monitoring via JMX metric

2) STW duration exceed N seconds -> no need to wait for something.
We already know that node will be segmented or that pause bigger that N
seconds will affect cluster performance.
Better case is to kill node ASAP to protect the cluster. Some customers
have huge timeouts and such node can kill whole cluster in case it will not
be killed by watchdog.

On Mon, Nov 20, 2017 at 7:23 PM, Dmitry Pavlov <dpavlov....@gmail.com>
wrote:

> Hi Anton,
>
> > - GC STW duration exceed maximum possible length (node should be stopped
> before
> STW finished)
>
> Are you sure we should kill node in case long STW? Can we produce warnings
> into logs and monitoring tools an wait node to become alive a little bit
> longer if we detect STW. In this case we can notify coordinator or other
> node, that 'current node is in STW, please wait longer than 3 heartbeat
> timeout'.
>
> It is probable such pauses will occur not often?
>
> Sincerely,
> Dmitriy Pavlov
>
> пн, 20 нояб. 2017 г. в 18:53, Anton Vinogradov <avinogra...@gridgain.com>:
>
> > Igniters,
> >
> > Internal problems may and, unfortunately, cause unexpected cluster
> > behavior.
> > We should determine behavior in case any of internal problem happened.
> >
> > Well known internal problems can be split to:
> > 1) OOM or any other reason cause node crash
> >
> > 2) Situations required graceful node shutdown with custom notification
> > - IgniteOutOfMemoryException
> > - Persistence errors
> > - ExchangeWorker exits with error
> >
> > 3) Prefomance issues should be covered by metrics
> > - GC STW duration
> > - Timed out tasks and jobs
> > - TX deadlock
> > - Hanged Tx (waits for some service)
> > - Java Deadlocks
> >
> > I created special issue [1] to make sure all these metrics will be
> > presented at WebConsole or VisorConsole (what's preferred?)
> >
> > 4) Situations required external monitoring implementation
> > - GC STW duration exceed maximum possible length (node should be stopped
> > before STW finished)
> >
> > All this problems were reported by different persons different time ago,
> > So, we should reanalyze each of them and, possible, find better ways to
> > solve them than it described at issues.
> >
> > P.s. IEP-7 [2] already contains 9 issues, feel free to mention something
> > else :)
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-6961
> > [2]
> >
> > https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> 7%3A+Ignite+internal+problems+detection
> >
>

Reply via email to