Dmitry, How these policies will be configured? Do you have any API in mind?
On Thu, Nov 23, 2017 at 6:26 PM, Denis Magda <dma...@apache.org> wrote: > No objections here. Additional policies like EXEC might be added later > depending on user needs. > > — > Denis > > > On Nov 23, 2017, at 2:26 AM, Дмитрий Сорокин <sbt.sorokin....@gmail.com> > wrote: > > > > Denis, > > I propose start with first three policies (it's already implemented, just > > await some code combing, commit & review). > > About of fourth policy (EXEC) I think that it's rather additional > property > > (some script path) than policy. > > > > 2017-11-23 0:43 GMT+03:00 Denis Magda <dma...@apache.org>: > > > >> Just provide FailureProcessingPolicy with possible reactions: > >> - NOOP - exceptions will be reported, metrics will be triggered but an > >> affected Ignite process won’t be touched. > >> - HAULT (or STOP or KILL) - all the actions of the of NOOP + Ignite > >> process termination. > >> - RESTART - NOOP actions + process restart. > >> - EXEC - execute a custom script provided by the user. > >> > >> If needed the policy can be set per know failure such is OOM, > Persistence > >> errors so that the user can act accordingly basing on a context. > >> > >> — > >> Denis > >> > >>> On Nov 21, 2017, at 11:43 PM, Vladimir Ozerov <voze...@gridgain.com> > >> wrote: > >>> > >>> In the first iteration I would focus only on reporting facilities, to > let > >>> administrator spot dangerous situation. And in the second phase, when > all > >>> reporting and metrics are ready, we can think on some automatic > actions. > >>> > >>> On Wed, Nov 22, 2017 at 10:39 AM, Mikhail Cherkasov < > >> mcherka...@gridgain.com > >>>> wrote: > >>> > >>>> Hi Anton, > >>>> > >>>> I don't think that we should shutdown node in case of > >> IgniteOOMException, > >>>> if one node has no space, then other probably don't have it too, so > re > >>>> -balancing will cause IgniteOOM on all other nodes and will kill the > >> whole > >>>> cluster. I think for some configurations cluster should survive and > >> allow > >>>> to user clean cache or/and add more nodes. > >>>> > >>>> Thanks, > >>>> Mikhail. > >>>> > >>>> 20 нояб. 2017 г. 6:53 ПП пользователь "Anton Vinogradov" < > >>>> avinogra...@gridgain.com> написал: > >>>> > >>>>> Igniters, > >>>>> > >>>>> Internal problems may and, unfortunately, cause unexpected cluster > >>>>> behavior. > >>>>> We should determine behavior in case any of internal problem > happened. > >>>>> > >>>>> Well known internal problems can be split to: > >>>>> 1) OOM or any other reason cause node crash > >>>>> > >>>>> 2) Situations required graceful node shutdown with custom > notification > >>>>> - IgniteOutOfMemoryException > >>>>> - Persistence errors > >>>>> - ExchangeWorker exits with error > >>>>> > >>>>> 3) Prefomance issues should be covered by metrics > >>>>> - GC STW duration > >>>>> - Timed out tasks and jobs > >>>>> - TX deadlock > >>>>> - Hanged Tx (waits for some service) > >>>>> - Java Deadlocks > >>>>> > >>>>> I created special issue [1] to make sure all these metrics will be > >>>>> presented at WebConsole or VisorConsole (what's preferred?) > >>>>> > >>>>> 4) Situations required external monitoring implementation > >>>>> - GC STW duration exceed maximum possible length (node should be > >> stopped > >>>>> before STW finished) > >>>>> > >>>>> All this problems were reported by different persons different time > >> ago, > >>>>> So, we should reanalyze each of them and, possible, find better ways > to > >>>>> solve them than it described at issues. > >>>>> > >>>>> P.s. IEP-7 [2] already contains 9 issues, feel free to mention > >> something > >>>>> else :) > >>>>> > >>>>> [1] https://issues.apache.org/jira/browse/IGNITE-6961 > >>>>> [2] > >>>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP- > >>>>> 7%3A+Ignite+internal+problems+detection > >>>>> > >>>> > >> > >> > >