Denis,
I propose start with first three policies (it's already implemented, just
await some code combing, commit & review).
About of fourth policy (EXEC) I think that it's rather additional property
(some script path) than policy.

2017-11-23 0:43 GMT+03:00 Denis Magda <dma...@apache.org>:

> Just provide FailureProcessingPolicy with possible reactions:
> - NOOP - exceptions will be reported, metrics will be triggered but an
> affected Ignite process won’t be touched.
> - HAULT (or STOP or KILL) - all the actions of the of NOOP + Ignite
> process termination.
> - RESTART - NOOP actions + process restart.
> - EXEC - execute a custom script provided by the user.
>
> If needed the policy can be set per know failure such is OOM, Persistence
> errors so that the user can act accordingly basing on a context.
>
> —
> Denis
>
> > On Nov 21, 2017, at 11:43 PM, Vladimir Ozerov <voze...@gridgain.com>
> wrote:
> >
> > In the first iteration I would focus only on reporting facilities, to let
> > administrator spot dangerous situation. And in the second phase, when all
> > reporting and metrics are ready, we can think on some automatic actions.
> >
> > On Wed, Nov 22, 2017 at 10:39 AM, Mikhail Cherkasov <
> mcherka...@gridgain.com
> >> wrote:
> >
> >> Hi Anton,
> >>
> >> I don't think that we should shutdown node in case of
> IgniteOOMException,
> >> if one node has no space, then other probably  don't have it too, so re
> >> -balancing will cause IgniteOOM on all other nodes and will kill the
> whole
> >> cluster. I think for some configurations cluster should survive and
> allow
> >> to user clean cache or/and add more nodes.
> >>
> >> Thanks,
> >> Mikhail.
> >>
> >> 20 нояб. 2017 г. 6:53 ПП пользователь "Anton Vinogradov" <
> >> avinogra...@gridgain.com> написал:
> >>
> >>> Igniters,
> >>>
> >>> Internal problems may and, unfortunately, cause unexpected cluster
> >>> behavior.
> >>> We should determine behavior in case any of internal problem happened.
> >>>
> >>> Well known internal problems can be split to:
> >>> 1) OOM or any other reason cause node crash
> >>>
> >>> 2) Situations required graceful node shutdown with custom notification
> >>> - IgniteOutOfMemoryException
> >>> - Persistence errors
> >>> - ExchangeWorker exits with error
> >>>
> >>> 3) Prefomance issues should be covered by metrics
> >>> - GC STW duration
> >>> - Timed out tasks and jobs
> >>> - TX deadlock
> >>> - Hanged Tx (waits for some service)
> >>> - Java Deadlocks
> >>>
> >>> I created special issue [1] to make sure all these metrics will be
> >>> presented at WebConsole or VisorConsole (what's preferred?)
> >>>
> >>> 4) Situations required external monitoring implementation
> >>> - GC STW duration exceed maximum possible length (node should be
> stopped
> >>> before STW finished)
> >>>
> >>> All this problems were reported by different persons different time
> ago,
> >>> So, we should reanalyze each of them and, possible, find better ways to
> >>> solve them than it described at issues.
> >>>
> >>> P.s. IEP-7 [2] already contains 9 issues, feel free to mention
> something
> >>> else :)
> >>>
> >>> [1] https://issues.apache.org/jira/browse/IGNITE-6961
> >>> [2]
> >>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> >>> 7%3A+Ignite+internal+problems+detection
> >>>
> >>
>
>

Reply via email to