Igor Seliverstov commented on IGNITE-7648:

[~ascherbakov], the code looks OK, but I'd change 
long delay = failureDetectionTimeoutEnabled() ? failureDetectionTimeout() / 
reconCnt :
   connTimeout0 - (U.currentTimeMillis() - start);
To something like:
long delay = failureDetectionTimeoutEnabled() ? 
timeoutHelper.remainingTime(U.currentTimeMillis()) / (reconCnt - attempt) :
   connTimeout0 - (U.currentTimeMillis() - start);{code}

Also I'm not sure is it a good idea to enable force kill by default.

Lets consider next example:

We successfully joined the topology but due to some local issue cannot open a 
direct connection to any node via Communication SPI.

This way using your approach we will kill each node we try to send a message to.

Even in current shape IGNITE_ENABLE_FORCIBLE_NODE_KILL doesn't look like a 
production feature and, in my opinion, cannot be used by default.


> Revert IGNITE_ENABLE_FORCIBLE_NODE_KILL system property.
> --------------------------------------------------------
>                 Key: IGNITE-7648
>                 URL: https://issues.apache.org/jira/browse/IGNITE-7648
>             Project: Ignite
>          Issue Type: Improvement
>    Affects Versions: 2.3
>            Reporter: Alexei Scherbakov
>            Assignee: Alexei Scherbakov
>            Priority: Major
>             Fix For: 2.5
> IGNITE_ENABLE_FORCIBLE_NODE_KILL system property was introduced in 
> IGNITE-5718 as a way to prevent unnecessary node drops in case of short 
> network problems.
> I suppose it's wrong decision to fix it in such way.
> We had faced some issues in our production due to lack of automatic kicking 
> of ill-behaving nodes (on example, hanging due to long GC pauses) until we 
> realised the necessity of changing default behavior via property.
> Right solution is to kick nodes only if failure threshold is reached. Such 
> behavior should be always enabled.

This message was sent by Atlassian JIRA

Reply via email to