I think the failure processing policy should be configured via 
IgniteConfiguration in a way similar to the segmentation policies.

—
Denis

> On Nov 27, 2017, at 11:28 PM, Vladimir Ozerov <voze...@gridgain.com> wrote:
> 
> Dmitry,
> 
> How these policies will be configured? Do you have any API in mind?
> 
> On Thu, Nov 23, 2017 at 6:26 PM, Denis Magda <dma...@apache.org> wrote:
> 
>> No objections here. Additional policies like EXEC might be added later
>> depending on user needs.
>> 
>> —
>> Denis
>> 
>>> On Nov 23, 2017, at 2:26 AM, Дмитрий Сорокин <sbt.sorokin....@gmail.com>
>> wrote:
>>> 
>>> Denis,
>>> I propose start with first three policies (it's already implemented, just
>>> await some code combing, commit & review).
>>> About of fourth policy (EXEC) I think that it's rather additional
>> property
>>> (some script path) than policy.
>>> 
>>> 2017-11-23 0:43 GMT+03:00 Denis Magda <dma...@apache.org>:
>>> 
>>>> Just provide FailureProcessingPolicy with possible reactions:
>>>> - NOOP - exceptions will be reported, metrics will be triggered but an
>>>> affected Ignite process won’t be touched.
>>>> - HAULT (or STOP or KILL) - all the actions of the of NOOP + Ignite
>>>> process termination.
>>>> - RESTART - NOOP actions + process restart.
>>>> - EXEC - execute a custom script provided by the user.
>>>> 
>>>> If needed the policy can be set per know failure such is OOM,
>> Persistence
>>>> errors so that the user can act accordingly basing on a context.
>>>> 
>>>> —
>>>> Denis
>>>> 
>>>>> On Nov 21, 2017, at 11:43 PM, Vladimir Ozerov <voze...@gridgain.com>
>>>> wrote:
>>>>> 
>>>>> In the first iteration I would focus only on reporting facilities, to
>> let
>>>>> administrator spot dangerous situation. And in the second phase, when
>> all
>>>>> reporting and metrics are ready, we can think on some automatic
>> actions.
>>>>> 
>>>>> On Wed, Nov 22, 2017 at 10:39 AM, Mikhail Cherkasov <
>>>> mcherka...@gridgain.com
>>>>>> wrote:
>>>>> 
>>>>>> Hi Anton,
>>>>>> 
>>>>>> I don't think that we should shutdown node in case of
>>>> IgniteOOMException,
>>>>>> if one node has no space, then other probably  don't have it too, so
>> re
>>>>>> -balancing will cause IgniteOOM on all other nodes and will kill the
>>>> whole
>>>>>> cluster. I think for some configurations cluster should survive and
>>>> allow
>>>>>> to user clean cache or/and add more nodes.
>>>>>> 
>>>>>> Thanks,
>>>>>> Mikhail.
>>>>>> 
>>>>>> 20 нояб. 2017 г. 6:53 ПП пользователь "Anton Vinogradov" <
>>>>>> avinogra...@gridgain.com> написал:
>>>>>> 
>>>>>>> Igniters,
>>>>>>> 
>>>>>>> Internal problems may and, unfortunately, cause unexpected cluster
>>>>>>> behavior.
>>>>>>> We should determine behavior in case any of internal problem
>> happened.
>>>>>>> 
>>>>>>> Well known internal problems can be split to:
>>>>>>> 1) OOM or any other reason cause node crash
>>>>>>> 
>>>>>>> 2) Situations required graceful node shutdown with custom
>> notification
>>>>>>> - IgniteOutOfMemoryException
>>>>>>> - Persistence errors
>>>>>>> - ExchangeWorker exits with error
>>>>>>> 
>>>>>>> 3) Prefomance issues should be covered by metrics
>>>>>>> - GC STW duration
>>>>>>> - Timed out tasks and jobs
>>>>>>> - TX deadlock
>>>>>>> - Hanged Tx (waits for some service)
>>>>>>> - Java Deadlocks
>>>>>>> 
>>>>>>> I created special issue [1] to make sure all these metrics will be
>>>>>>> presented at WebConsole or VisorConsole (what's preferred?)
>>>>>>> 
>>>>>>> 4) Situations required external monitoring implementation
>>>>>>> - GC STW duration exceed maximum possible length (node should be
>>>> stopped
>>>>>>> before STW finished)
>>>>>>> 
>>>>>>> All this problems were reported by different persons different time
>>>> ago,
>>>>>>> So, we should reanalyze each of them and, possible, find better ways
>> to
>>>>>>> solve them than it described at issues.
>>>>>>> 
>>>>>>> P.s. IEP-7 [2] already contains 9 issues, feel free to mention
>>>> something
>>>>>>> else :)
>>>>>>> 
>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-6961
>>>>>>> [2]
>>>>>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-
>>>>>>> 7%3A+Ignite+internal+problems+detection
>>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>> 

Reply via email to