[ 
https://issues.apache.org/jira/browse/IGNITE-23395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-23395:
----------------------------------
    Description: 
Raft log on any network error consumes about 1Gb per node / 5 minutes on a 
3-node cluster!
 # Start 3 node cluter
 # Start creating tables in a loop (create 50 tables, insert 1 rows into each)
 # Kil 1 node

*Expected result:*

The cluster prints a few errors, updates the topology and continues operations.

*Actual result:*

Logs in two remaining nodes contains 20*100Mb files with similar ERRORs:
 * grep  "[ReplicatorGroupImpl] Fail to check replicator connection to" 
ignite3db* | wc -l
*2 423 492*
 * grep  "[AbstractClientService] Fail to connect 
TablesAmountCapacityMultiNodeTest_cluster_1, exception: 
org.apache.ignite.internal.raft.PeerUnavailableException: Peer 
TablesAmountCapacityMultiNodeTest_cluster_1 is unavailable." ignite3db* | wc -l
*2 547 696*

In just 9 minutes! In each node.

*Implementation notes*

See also PeerUnavailableException
Raft writes the mentioned lines to log each time when it fails to send any 
message to the killed node. It could remember the killed peer and check it on 
connection failure, like this:
{code:java}
        // Volatile
        Collection<PeerId> deadPeers = new ArrayList<>();

        if (!client.connect(peer)) {
             if (!deadPeers.contains(peer)) {
                LOG.error("Fail to check replicator connection to peer={}, 
replicatorType={}.", peer, replicatorType);
                deadPeers.add(peer);
             }
            this.failureReplicators.put(peer, replicatorType);
            return false;
        } {code}
There are several places in the code to fix.

  was:
Raft log on any network error consumes about 1Gb per node / 5 minutes on a 
3-node cluster!
 # Start 3 node cluter
 # Start creating tables in a loop (create 50 tables, insert 1 rows into each)
 # Kil 1 node

Expected result:

The cluster either
 * fails to operate (depending on the configured CMG/MS nodes and killed node) 
or
 * prints a few errors, updates the topology and continues operations.

Actual result:

Logs in two remaining nodes contains 20*100Mb files with similar ERRORs:
 * grep  "\[ReplicatorGroupImpl\] Fail to check replicator connection to" 
ignite3db* | wc -l
*2 423 492*
 * grep  "\[AbstractClientService\] Fail to connect 
TablesAmountCapacityMultiNodeTest_cluster_1, exception: 
org.apache.ignite.internal.raft.PeerUnavailableException: Peer 
TablesAmountCapacityMultiNodeTest_cluster_1 is unavailable." ignite3db* | wc -l
*2 547 696*

In just 9 minutes! In each node.


> Raft subsystem spams to log with network exceptions
> ---------------------------------------------------
>
>                 Key: IGNITE-23395
>                 URL: https://issues.apache.org/jira/browse/IGNITE-23395
>             Project: Ignite
>          Issue Type: Improvement
>          Components: networking, persistence
>    Affects Versions: 3.0
>            Reporter: Alexander Belyak
>            Assignee: Denis Chudov
>            Priority: Critical
>              Labels: ignite-3
>
> Raft log on any network error consumes about 1Gb per node / 5 minutes on a 
> 3-node cluster!
>  # Start 3 node cluter
>  # Start creating tables in a loop (create 50 tables, insert 1 rows into each)
>  # Kil 1 node
> *Expected result:*
> The cluster prints a few errors, updates the topology and continues 
> operations.
> *Actual result:*
> Logs in two remaining nodes contains 20*100Mb files with similar ERRORs:
>  * grep  "[ReplicatorGroupImpl] Fail to check replicator connection to" 
> ignite3db* | wc -l
> *2 423 492*
>  * grep  "[AbstractClientService] Fail to connect 
> TablesAmountCapacityMultiNodeTest_cluster_1, exception: 
> org.apache.ignite.internal.raft.PeerUnavailableException: Peer 
> TablesAmountCapacityMultiNodeTest_cluster_1 is unavailable." ignite3db* | wc 
> -l
> *2 547 696*
> In just 9 minutes! In each node.
> *Implementation notes*
> See also PeerUnavailableException
> Raft writes the mentioned lines to log each time when it fails to send any 
> message to the killed node. It could remember the killed peer and check it on 
> connection failure, like this:
> {code:java}
>         // Volatile
>         Collection<PeerId> deadPeers = new ArrayList<>();
>         if (!client.connect(peer)) {
>              if (!deadPeers.contains(peer)) {
>                 LOG.error("Fail to check replicator connection to peer={}, 
> replicatorType={}.", peer, replicatorType);
>                 deadPeers.add(peer);
>              }
>             this.failureReplicators.put(peer, replicatorType);
>             return false;
>         } {code}
> There are several places in the code to fix.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to