[jira] [Updated] (IGNITE-23395) Raft subsystem spams to log with network exceptions

Alexander Lapin (Jira) Thu, 21 Nov 2024 09:01:22 -0800


     [ 
https://issues.apache.org/jira/browse/IGNITE-23395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Alexander Lapin updated IGNITE-23395:
-------------------------------------
    Ignite Flags:   (was: Docs Required,Release Notes Required)

> Raft subsystem spams to log with network exceptions
> ---------------------------------------------------
>
>                 Key: IGNITE-23395
>                 URL: https://issues.apache.org/jira/browse/IGNITE-23395
>             Project: Ignite
>          Issue Type: Improvement
>          Components: networking, persistence
>    Affects Versions: 3.0
>            Reporter: Alexander Belyak
>            Assignee: Denis Chudov
>            Priority: Critical
>              Labels: ignite-3
>             Fix For: 3.0
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> Raft log on any network error consumes about 1Gb per node / 5 minutes on a 
> 3-node cluster!
>  # Start 3 node cluter
>  # Start creating tables in a loop (create 50 tables, insert 1 rows into each)
>  # Kil 1 node
> *Expected result:*
> The cluster prints a few errors, updates the topology and continues 
> operations.
> *Actual result:*
> Logs in two remaining nodes contains 20*100Mb files with similar ERRORs:
>  * grep  "[ReplicatorGroupImpl] Fail to check replicator connection to" 
> ignite3db* | wc -l
> *2 423 492*
>  * grep  "[AbstractClientService] Fail to connect 
> TablesAmountCapacityMultiNodeTest_cluster_1, exception: 
> org.apache.ignite.internal.raft.PeerUnavailableException: Peer 
> TablesAmountCapacityMultiNodeTest_cluster_1 is unavailable." ignite3db* | wc 
> -l
> *2 547 696*
> In just 9 minutes! In each node.
> *Implementation notes*
> See also PeerUnavailableException
> Raft writes the mentioned lines to log each time when it fails to send any 
> message to the killed node. It could remember the killed peer and check it on 
> connection failure, like this:
> {code:java}
>         // Volatile
>         Collection<PeerId> deadPeers = new ArrayList<>();
>         if (!client.connect(peer)) {
>              if (!deadPeers.contains(peer)) {
>                 LOG.error("Fail to check replicator connection to peer={}, 
> replicatorType={}.", peer, replicatorType);
>                 deadPeers.add(peer);
>              }
>             this.failureReplicators.put(peer, replicatorType);
>             return false;
>         } {code}
> There are several places in the code to fix.
> *Definition of done*
> Raft writes only one message about each dead peer on a node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-23395) Raft subsystem spams to log with network exceptions

Reply via email to