[
https://issues.apache.org/jira/browse/IGNITE-7648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16369323#comment-16369323
]
Igor Seliverstov commented on IGNITE-7648:
------------------------------------------
[~ascherbakov], the code looks OK, but I'd change
{code:java}
long delay = failureDetectionTimeoutEnabled() ? failureDetectionTimeout() /
reconCnt :
connTimeout0 - (U.currentTimeMillis() - start);
{code}
To something like:
{code:java}
long delay = failureDetectionTimeoutEnabled() ?
timeoutHelper.remainingTime(U.currentTimeMillis()) / (reconCnt - attempt) :
connTimeout0 - (U.currentTimeMillis() - start);{code}
In
{{org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi#createTcpClient}}
Also I'm not sure is it a good idea to enable force kill by default.
Lets consider next example:
We successfully joined the topology but due to some local issue cannot open a
direct connection to any node via Communication SPI.
This way using your approach we will kill each node we try to send a message to.
Even in current shape IGNITE_ENABLE_FORCIBLE_NODE_KILL doesn't look like a
production feature and, in my opinion, cannot be used by default.
> Revert IGNITE_ENABLE_FORCIBLE_NODE_KILL system property.
> --------------------------------------------------------
>
> Key: IGNITE-7648
> URL: https://issues.apache.org/jira/browse/IGNITE-7648
> Project: Ignite
> Issue Type: Improvement
> Affects Versions: 2.3
> Reporter: Alexei Scherbakov
> Assignee: Alexei Scherbakov
> Priority: Major
> Fix For: 2.5
>
>
> IGNITE_ENABLE_FORCIBLE_NODE_KILL system property was introduced in
> IGNITE-5718 as a way to prevent unnecessary node drops in case of short
> network problems.
> I suppose it's wrong decision to fix it in such way.
> We had faced some issues in our production due to lack of automatic kicking
> of ill-behaving nodes (on example, hanging due to long GC pauses) until we
> realised the necessity of changing default behavior via property.
> Right solution is to kick nodes only if failure threshold is reached. Such
> behavior should be always enabled.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)