Hi Anton, > - GC STW duration exceed maximum possible length (node should be stopped > before STW finished)
Are you sure we should kill node in case long STW? Can we produce warnings into logs and monitoring tools an wait node to become alive a little bit longer if we detect STW. In this case we can notify coordinator or other node, that 'current node is in STW, please wait longer than 3 heartbeat timeout'. It is probable such pauses will occur not often? Sincerely, Dmitriy Pavlov пн, 20 нояб. 2017 г. в 18:53, Anton Vinogradov <avinogra...@gridgain.com>: > Igniters, > > Internal problems may and, unfortunately, cause unexpected cluster > behavior. > We should determine behavior in case any of internal problem happened. > > Well known internal problems can be split to: > 1) OOM or any other reason cause node crash > > 2) Situations required graceful node shutdown with custom notification > - IgniteOutOfMemoryException > - Persistence errors > - ExchangeWorker exits with error > > 3) Prefomance issues should be covered by metrics > - GC STW duration > - Timed out tasks and jobs > - TX deadlock > - Hanged Tx (waits for some service) > - Java Deadlocks > > I created special issue [1] to make sure all these metrics will be > presented at WebConsole or VisorConsole (what's preferred?) > > 4) Situations required external monitoring implementation > - GC STW duration exceed maximum possible length (node should be stopped > before STW finished) > > All this problems were reported by different persons different time ago, > So, we should reanalyze each of them and, possible, find better ways to > solve them than it described at issues. > > P.s. IEP-7 [2] already contains 9 issues, feel free to mention something > else :) > > [1] https://issues.apache.org/jira/browse/IGNITE-6961 > [2] > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-7%3A+Ignite+internal+problems+detection >