In the first iteration I would focus only on reporting facilities, to let administrator spot dangerous situation. And in the second phase, when all reporting and metrics are ready, we can think on some automatic actions.
On Wed, Nov 22, 2017 at 10:39 AM, Mikhail Cherkasov <mcherka...@gridgain.com > wrote: > Hi Anton, > > I don't think that we should shutdown node in case of IgniteOOMException, > if one node has no space, then other probably don't have it too, so re > -balancing will cause IgniteOOM on all other nodes and will kill the whole > cluster. I think for some configurations cluster should survive and allow > to user clean cache or/and add more nodes. > > Thanks, > Mikhail. > > 20 нояб. 2017 г. 6:53 ПП пользователь "Anton Vinogradov" < > avinogra...@gridgain.com> написал: > > > Igniters, > > > > Internal problems may and, unfortunately, cause unexpected cluster > > behavior. > > We should determine behavior in case any of internal problem happened. > > > > Well known internal problems can be split to: > > 1) OOM or any other reason cause node crash > > > > 2) Situations required graceful node shutdown with custom notification > > - IgniteOutOfMemoryException > > - Persistence errors > > - ExchangeWorker exits with error > > > > 3) Prefomance issues should be covered by metrics > > - GC STW duration > > - Timed out tasks and jobs > > - TX deadlock > > - Hanged Tx (waits for some service) > > - Java Deadlocks > > > > I created special issue [1] to make sure all these metrics will be > > presented at WebConsole or VisorConsole (what's preferred?) > > > > 4) Situations required external monitoring implementation > > - GC STW duration exceed maximum possible length (node should be stopped > > before STW finished) > > > > All this problems were reported by different persons different time ago, > > So, we should reanalyze each of them and, possible, find better ways to > > solve them than it described at issues. > > > > P.s. IEP-7 [2] already contains 9 issues, feel free to mention something > > else :) > > > > [1] https://issues.apache.org/jira/browse/IGNITE-6961 > > [2] > > https://cwiki.apache.org/confluence/display/IGNITE/IEP- > > 7%3A+Ignite+internal+problems+detection > > >