Hi Anton,

> - GC STW duration exceed maximum possible length (node should be stopped 
> before
STW finished)

Are you sure we should kill node in case long STW? Can we produce warnings
into logs and monitoring tools an wait node to become alive a little bit
longer if we detect STW. In this case we can notify coordinator or other
node, that 'current node is in STW, please wait longer than 3 heartbeat
timeout'.

It is probable such pauses will occur not often?

Sincerely,
Dmitriy Pavlov

пн, 20 нояб. 2017 г. в 18:53, Anton Vinogradov <avinogra...@gridgain.com>:

> Igniters,
>
> Internal problems may and, unfortunately, cause unexpected cluster
> behavior.
> We should determine behavior in case any of internal problem happened.
>
> Well known internal problems can be split to:
> 1) OOM or any other reason cause node crash
>
> 2) Situations required graceful node shutdown with custom notification
> - IgniteOutOfMemoryException
> - Persistence errors
> - ExchangeWorker exits with error
>
> 3) Prefomance issues should be covered by metrics
> - GC STW duration
> - Timed out tasks and jobs
> - TX deadlock
> - Hanged Tx (waits for some service)
> - Java Deadlocks
>
> I created special issue [1] to make sure all these metrics will be
> presented at WebConsole or VisorConsole (what's preferred?)
>
> 4) Situations required external monitoring implementation
> - GC STW duration exceed maximum possible length (node should be stopped
> before STW finished)
>
> All this problems were reported by different persons different time ago,
> So, we should reanalyze each of them and, possible, find better ways to
> solve them than it described at issues.
>
> P.s. IEP-7 [2] already contains 9 issues, feel free to mention something
> else :)
>
> [1] https://issues.apache.org/jira/browse/IGNITE-6961
> [2]
>
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-7%3A+Ignite+internal+problems+detection
>

Reply via email to