Vladsz83 opened a new pull request #8484: URL: https://github.com/apache/ignite/pull/8484
Scenario: Two nodes fail at the same time. The nodes have relative places in the ring: N-1 and N+2. Node N detects failure of node N+1. Node N tries to connect to node N+2. Node N+2 checks backward connection to node N+1. Problem: Node N can fail too. Cause: The timeout on node N to recover connection to node N+2 appears shorter than timeout on node N+2 to check connection to N+1. Fix: Introduced a fundamental timeout value to check/recover connection based on current configuration. Not a constant. Mentioned above timeouts have been made relative. The timeout of backward connection check is now generally shorter than the timeout to recover connection. Additions: - Brought some logs to have diagnostoc ability. It was hard to realize the issue without them. - Some renamings and minor optimizations to avoid mess in ping / connection checks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
