[
https://issues.apache.org/jira/browse/IGNITE-23395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexander Lapin updated IGNITE-23395:
-------------------------------------
Ignite Flags: (was: Docs Required,Release Notes Required)
> Raft subsystem spams to log with network exceptions
> ---------------------------------------------------
>
> Key: IGNITE-23395
> URL: https://issues.apache.org/jira/browse/IGNITE-23395
> Project: Ignite
> Issue Type: Improvement
> Components: networking, persistence
> Affects Versions: 3.0
> Reporter: Alexander Belyak
> Assignee: Denis Chudov
> Priority: Critical
> Labels: ignite-3
> Fix For: 3.0
>
> Time Spent: 40m
> Remaining Estimate: 0h
>
> Raft log on any network error consumes about 1Gb per node / 5 minutes on a
> 3-node cluster!
> # Start 3 node cluter
> # Start creating tables in a loop (create 50 tables, insert 1 rows into each)
> # Kil 1 node
> *Expected result:*
> The cluster prints a few errors, updates the topology and continues
> operations.
> *Actual result:*
> Logs in two remaining nodes contains 20*100Mb files with similar ERRORs:
> * grep "[ReplicatorGroupImpl] Fail to check replicator connection to"
> ignite3db* | wc -l
> *2 423 492*
> * grep "[AbstractClientService] Fail to connect
> TablesAmountCapacityMultiNodeTest_cluster_1, exception:
> org.apache.ignite.internal.raft.PeerUnavailableException: Peer
> TablesAmountCapacityMultiNodeTest_cluster_1 is unavailable." ignite3db* | wc
> -l
> *2 547 696*
> In just 9 minutes! In each node.
> *Implementation notes*
> See also PeerUnavailableException
> Raft writes the mentioned lines to log each time when it fails to send any
> message to the killed node. It could remember the killed peer and check it on
> connection failure, like this:
> {code:java}
> // Volatile
> Collection<PeerId> deadPeers = new ArrayList<>();
> if (!client.connect(peer)) {
> if (!deadPeers.contains(peer)) {
> LOG.error("Fail to check replicator connection to peer={},
> replicatorType={}.", peer, replicatorType);
> deadPeers.add(peer);
> }
> this.failureReplicators.put(peer, replicatorType);
> return false;
> } {code}
> There are several places in the code to fix.
> *Definition of done*
> Raft writes only one message about each dead peer on a node.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)