[
https://issues.apache.org/jira/browse/IGNITE-13663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vladimir Steshin updated IGNITE-13663:
--------------------------------------
Description:
We should document that TcpDiscoverySpi prolongs detection of node failure if
node has several addresses.
By default, all available addresses are assigned to node and node listens any
address (0.0.0.0). Not first non-loopback addresses as the documentation says.
Simple example on my ordinary Mac having WiFi, VPN and docker (from Ignite
log): `Local node addresses: [192.168.1.42/0:0:0:0:0:0:0:1%lo0, /127.0.0.1,
/10.11.220.206]`.
It is cleary seen that `ServerImpl.TcpServer.srvrSock` binds to '0.0.0.0'.
And actual failure detection and connection restoring delay is:
`failureDetectionTimeout * addresses_number + connRecoveryTimeout`. Which is
usually unexpectable. This peculiarity was unearthed in [1], [2] and
additionally confirmed in ducktape integration test [3].
To avoid this, user should assign `IgniteConfiguration.localHost` or
`TcpDiscoverySpi.localAddress`. Unfortunately, users frequently skip this
setting and allow node to activate all available IPs.
Often, middleware runs in environments with several IP addresses
(virtualizations, containers, different networks). Node sends all obtained
addresses with other node info to the cluster. Connection to node is
established to first of its addresses. But if lost, other addresses are
attempted to reconnect sequentially. If addresses do not belong to assumed node
network, do not represent existing physical connection, processing them is just
waste of time.
[1] https://issues.apache.org/jira/browse/IGNITE-13012
[2] https://issues.apache.org/jira/browse/IGNITE-13134
[3]
https://github.com/apache/ignite/blob/ignite-ducktape/modules/ducktests/tests/ignitetest/tests/discovery_test.py
was:
We should document that TcpDiscoverySpi prolongs detection of node failure if
node has several addresses.
By default, all available addresses are assigned to node and node listens any
address (0.0.0.0). Not first non-loopback addresses as the documentation says.
Simple example on my ordinary Mac having WiFi, VPN and docker (from Ignite
log): `Local node addresses: [192.168.1.42/0:0:0:0:0:0:0:1%lo0, /127.0.0.1,
/192.168.1.42]`.
It is cleary seen that `ServerImpl.TcpServer.srvrSock` binds to '0.0.0.0'.
And actual failure detection and connection restoring delay is:
`failureDetectionTimeout * addresses_number + connRecoveryTimeout`. Which is
usually unexpectable. This peculiarity was unearthed in [1], [2] and
additionally confirmed in ducktape integration test [3].
To avoid this, user should assign `IgniteConfiguration.localHost` or
`TcpDiscoverySpi.localAddress`. Unfortunately, users frequently skip this
setting and allow node to activate all available IPs.
Often, middleware runs in environments with several IP addresses
(virtualizations, containers, different networks). Node sends all obtained
addresses with other node info to the cluster. Connection to node is
established to first of its addresses. But if lost, other addresses are
attempted to reconnect sequentially. If addresses do not belong to assumed node
network, do not represent existing physical connection, processing them is just
waste of time.
[1] https://issues.apache.org/jira/browse/IGNITE-13012
[2] https://issues.apache.org/jira/browse/IGNITE-13134
[3]
https://github.com/apache/ignite/blob/ignite-ducktape/modules/ducktests/tests/ignitetest/tests/discovery_test.py
> Represent in the documenttion affection of several node addresses on failure
> detection v2.
> ------------------------------------------------------------------------------------------
>
> Key: IGNITE-13663
> URL: https://issues.apache.org/jira/browse/IGNITE-13663
> Project: Ignite
> Issue Type: Improvement
> Components: documentation
> Affects Versions: 2.7.6, 2.9, 2.8.1
> Reporter: Vladimir Steshin
> Assignee: Denis A. Magda
> Priority: Major
> Labels: iep-45
> Fix For: 2.10
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> We should document that TcpDiscoverySpi prolongs detection of node failure if
> node has several addresses.
> By default, all available addresses are assigned to node and node listens any
> address (0.0.0.0). Not first non-loopback addresses as the documentation
> says. Simple example on my ordinary Mac having WiFi, VPN and docker (from
> Ignite log): `Local node addresses: [192.168.1.42/0:0:0:0:0:0:0:1%lo0,
> /127.0.0.1, /10.11.220.206]`.
> It is cleary seen that `ServerImpl.TcpServer.srvrSock` binds to '0.0.0.0'.
> And actual failure detection and connection restoring delay is:
> `failureDetectionTimeout * addresses_number + connRecoveryTimeout`. Which is
> usually unexpectable. This peculiarity was unearthed in [1], [2] and
> additionally confirmed in ducktape integration test [3].
> To avoid this, user should assign `IgniteConfiguration.localHost` or
> `TcpDiscoverySpi.localAddress`. Unfortunately, users frequently skip this
> setting and allow node to activate all available IPs.
> Often, middleware runs in environments with several IP addresses
> (virtualizations, containers, different networks). Node sends all obtained
> addresses with other node info to the cluster. Connection to node is
> established to first of its addresses. But if lost, other addresses are
> attempted to reconnect sequentially. If addresses do not belong to assumed
> node network, do not represent existing physical connection, processing them
> is just waste of time.
> [1] https://issues.apache.org/jira/browse/IGNITE-13012
> [2] https://issues.apache.org/jira/browse/IGNITE-13134
> [3]
> https://github.com/apache/ignite/blob/ignite-ducktape/modules/ducktests/tests/ignitetest/tests/discovery_test.py
--
This message was sent by Atlassian Jira
(v8.3.4#803005)