I think the failure processing policy should be configured via IgniteConfiguration in a way similar to the segmentation policies.
— Denis > On Nov 27, 2017, at 11:28 PM, Vladimir Ozerov <voze...@gridgain.com> wrote: > > Dmitry, > > How these policies will be configured? Do you have any API in mind? > > On Thu, Nov 23, 2017 at 6:26 PM, Denis Magda <dma...@apache.org> wrote: > >> No objections here. Additional policies like EXEC might be added later >> depending on user needs. >> >> — >> Denis >> >>> On Nov 23, 2017, at 2:26 AM, Дмитрий Сорокин <sbt.sorokin....@gmail.com> >> wrote: >>> >>> Denis, >>> I propose start with first three policies (it's already implemented, just >>> await some code combing, commit & review). >>> About of fourth policy (EXEC) I think that it's rather additional >> property >>> (some script path) than policy. >>> >>> 2017-11-23 0:43 GMT+03:00 Denis Magda <dma...@apache.org>: >>> >>>> Just provide FailureProcessingPolicy with possible reactions: >>>> - NOOP - exceptions will be reported, metrics will be triggered but an >>>> affected Ignite process won’t be touched. >>>> - HAULT (or STOP or KILL) - all the actions of the of NOOP + Ignite >>>> process termination. >>>> - RESTART - NOOP actions + process restart. >>>> - EXEC - execute a custom script provided by the user. >>>> >>>> If needed the policy can be set per know failure such is OOM, >> Persistence >>>> errors so that the user can act accordingly basing on a context. >>>> >>>> — >>>> Denis >>>> >>>>> On Nov 21, 2017, at 11:43 PM, Vladimir Ozerov <voze...@gridgain.com> >>>> wrote: >>>>> >>>>> In the first iteration I would focus only on reporting facilities, to >> let >>>>> administrator spot dangerous situation. And in the second phase, when >> all >>>>> reporting and metrics are ready, we can think on some automatic >> actions. >>>>> >>>>> On Wed, Nov 22, 2017 at 10:39 AM, Mikhail Cherkasov < >>>> mcherka...@gridgain.com >>>>>> wrote: >>>>> >>>>>> Hi Anton, >>>>>> >>>>>> I don't think that we should shutdown node in case of >>>> IgniteOOMException, >>>>>> if one node has no space, then other probably don't have it too, so >> re >>>>>> -balancing will cause IgniteOOM on all other nodes and will kill the >>>> whole >>>>>> cluster. I think for some configurations cluster should survive and >>>> allow >>>>>> to user clean cache or/and add more nodes. >>>>>> >>>>>> Thanks, >>>>>> Mikhail. >>>>>> >>>>>> 20 нояб. 2017 г. 6:53 ПП пользователь "Anton Vinogradov" < >>>>>> avinogra...@gridgain.com> написал: >>>>>> >>>>>>> Igniters, >>>>>>> >>>>>>> Internal problems may and, unfortunately, cause unexpected cluster >>>>>>> behavior. >>>>>>> We should determine behavior in case any of internal problem >> happened. >>>>>>> >>>>>>> Well known internal problems can be split to: >>>>>>> 1) OOM or any other reason cause node crash >>>>>>> >>>>>>> 2) Situations required graceful node shutdown with custom >> notification >>>>>>> - IgniteOutOfMemoryException >>>>>>> - Persistence errors >>>>>>> - ExchangeWorker exits with error >>>>>>> >>>>>>> 3) Prefomance issues should be covered by metrics >>>>>>> - GC STW duration >>>>>>> - Timed out tasks and jobs >>>>>>> - TX deadlock >>>>>>> - Hanged Tx (waits for some service) >>>>>>> - Java Deadlocks >>>>>>> >>>>>>> I created special issue [1] to make sure all these metrics will be >>>>>>> presented at WebConsole or VisorConsole (what's preferred?) >>>>>>> >>>>>>> 4) Situations required external monitoring implementation >>>>>>> - GC STW duration exceed maximum possible length (node should be >>>> stopped >>>>>>> before STW finished) >>>>>>> >>>>>>> All this problems were reported by different persons different time >>>> ago, >>>>>>> So, we should reanalyze each of them and, possible, find better ways >> to >>>>>>> solve them than it described at issues. >>>>>>> >>>>>>> P.s. IEP-7 [2] already contains 9 issues, feel free to mention >>>> something >>>>>>> else :) >>>>>>> >>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-6961 >>>>>>> [2] >>>>>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP- >>>>>>> 7%3A+Ignite+internal+problems+detection >>>>>>> >>>>>> >>>> >>>> >> >>