Denis, I propose start with first three policies (it's already implemented, just await some code combing, commit & review). About of fourth policy (EXEC) I think that it's rather additional property (some script path) than policy.
2017-11-23 0:43 GMT+03:00 Denis Magda <dma...@apache.org>: > Just provide FailureProcessingPolicy with possible reactions: > - NOOP - exceptions will be reported, metrics will be triggered but an > affected Ignite process won’t be touched. > - HAULT (or STOP or KILL) - all the actions of the of NOOP + Ignite > process termination. > - RESTART - NOOP actions + process restart. > - EXEC - execute a custom script provided by the user. > > If needed the policy can be set per know failure such is OOM, Persistence > errors so that the user can act accordingly basing on a context. > > — > Denis > > > On Nov 21, 2017, at 11:43 PM, Vladimir Ozerov <voze...@gridgain.com> > wrote: > > > > In the first iteration I would focus only on reporting facilities, to let > > administrator spot dangerous situation. And in the second phase, when all > > reporting and metrics are ready, we can think on some automatic actions. > > > > On Wed, Nov 22, 2017 at 10:39 AM, Mikhail Cherkasov < > mcherka...@gridgain.com > >> wrote: > > > >> Hi Anton, > >> > >> I don't think that we should shutdown node in case of > IgniteOOMException, > >> if one node has no space, then other probably don't have it too, so re > >> -balancing will cause IgniteOOM on all other nodes and will kill the > whole > >> cluster. I think for some configurations cluster should survive and > allow > >> to user clean cache or/and add more nodes. > >> > >> Thanks, > >> Mikhail. > >> > >> 20 нояб. 2017 г. 6:53 ПП пользователь "Anton Vinogradov" < > >> avinogra...@gridgain.com> написал: > >> > >>> Igniters, > >>> > >>> Internal problems may and, unfortunately, cause unexpected cluster > >>> behavior. > >>> We should determine behavior in case any of internal problem happened. > >>> > >>> Well known internal problems can be split to: > >>> 1) OOM or any other reason cause node crash > >>> > >>> 2) Situations required graceful node shutdown with custom notification > >>> - IgniteOutOfMemoryException > >>> - Persistence errors > >>> - ExchangeWorker exits with error > >>> > >>> 3) Prefomance issues should be covered by metrics > >>> - GC STW duration > >>> - Timed out tasks and jobs > >>> - TX deadlock > >>> - Hanged Tx (waits for some service) > >>> - Java Deadlocks > >>> > >>> I created special issue [1] to make sure all these metrics will be > >>> presented at WebConsole or VisorConsole (what's preferred?) > >>> > >>> 4) Situations required external monitoring implementation > >>> - GC STW duration exceed maximum possible length (node should be > stopped > >>> before STW finished) > >>> > >>> All this problems were reported by different persons different time > ago, > >>> So, we should reanalyze each of them and, possible, find better ways to > >>> solve them than it described at issues. > >>> > >>> P.s. IEP-7 [2] already contains 9 issues, feel free to mention > something > >>> else :) > >>> > >>> [1] https://issues.apache.org/jira/browse/IGNITE-6961 > >>> [2] > >>> https://cwiki.apache.org/confluence/display/IGNITE/IEP- > >>> 7%3A+Ignite+internal+problems+detection > >>> > >> > >