Kirill Sizov created IGNITE-22692:
--------------------------------------
Summary: Race in rebalance counters
Key: IGNITE-22692
URL: https://issues.apache.org/jira/browse/IGNITE-22692
Project: Ignite
Issue Type: Bug
Reporter: Kirill Sizov
{{org.apache.ignite.internal.disaster.ItDisasterRecoveryReconfigurationTest#testManualRebalanceIfPartitionIsLost}}
silently fails on assertion error
{noformat}
java.lang.AssertionError: null
at
org.apache.ignite.internal.distributionzones.rebalance.RebalanceRaftGroupEventsListener.countDownPartitionsFromZone(RebalanceRaftGroupEventsListener.java:267)
~[classes/:?]
at
org.apache.ignite.internal.distributionzones.rebalance.RebalanceRaftGroupEventsListener.lambda$onNewPeersConfigurationApplied$1(RebalanceRaftGroupEventsListener.java:239)
~[classes/:?]
at
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
[?:?]
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
[?:?]
at
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
[?:?]
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
[?:?]
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
[?:?]
at java.base/java.lang.Thread.run(Thread.java:829) [?:?]
{noformat}
*Implementation details*
With extra logging:
{noformat}
[2024-07-09T13:10:31,812][INFO
][%idrrt_tmripil_0%Raft-Group-Client-1][TableManager] Node couldn't get the
leader within timeout so the changing peers is skipped [grp=10_part_0].
[2024-07-09T13:10:31,818][WARN
][%idrrt_tmripil_0%metastorage-watch-executor-1][WatchProcessor] Watch event
processing has been too long [duration=20417,
keys=[assignments.pending.10_part_0, 10_part_0pending.change.trigger]]
[2024-07-09T13:10:31,858][INFO
][%idrrt_tmripil_0%JRaft-FSMCaller-Disruptormetastorage_stripe_0-0][DisasterRecoveryManager]
Partition 10_part_0 returned PENDING_KEY_UPDATED status on reset attempt
[2024-07-09T13:10:31,859][INFO
][%idrrt_tmripil_0%metastorage-watch-executor-0][TableManager] Received update
on pending assignments. Check if new raft group should be started
[key=assignments.pending.10_part_0, partition=0, table=TEST,
localMemberAddress=127.0.0.1:3344, pendingAssignments=Assignments
[nodes=HashSet [Assignment [consistentId=idrrt_tmripil_0, isPeer=true]],
force=true], revision=107]
[2024-07-09T13:10:31,916][INFO
][%idrrt_tmripil_0%JRaft-FSMCaller-Disruptormetastorage_stripe_0-0][TableManager]
Rebalance counter for the zone is updated [zoneId = 9, partId = 0, counter =
[10], revision = 107]
[2024-07-09T13:10:31,917][WARN ][%idrrt_tmripil_0%tableManager-io-15][Loza]
Reset peers for raft group RaftNodeId [groupId=10_part_0, peer=Peer
[consistentId=idrrt_tmripil_0, idx=0]], new configuration is PeersAndLearners
[peers=Set12 [Peer [consistentId=idrrt_tmripil_0, idx=0]], learners=SetN []]
[2024-07-09T13:10:31,934][INFO ][%idrrt_tmripil_0%tableManager-io-15][NodeImpl]
Node <10_part_0/idrrt_tmripil_0> set peers from idrrt_tmripil_1 to
idrrt_tmripil_0.
[2024-07-09T13:10:32,028][INFO
][%idrrt_tmripil_0%tableManager-io-15][LocalRaftMetaStorage] Save raft meta,
path=build/work/ItDisasterRecoveryReconfigurationTest/testManualRebalanceIfPartitionIsLost_189774505606983/idrrt_tmripil_0/10_part_0_idrrt_tmripil_0_0/meta,
term=1, votedFor=, cost time=76 ms
[2024-07-09T13:10:32,029][INFO ][%idrrt_tmripil_0%tableManager-io-15][NodeImpl]
Election timeout was reset to initial value.
[2024-07-09T13:10:32,047][INFO ][%idrrt_tmripil_0%tableManager-io-15][NodeImpl]
Node <10_part_0/idrrt_tmripil_0> reset electionTimeout, currTimer 1 state
STATE_FOLLOWER new electionTimeout 1200.
[2024-07-09T13:10:32,049][WARN
][%idrrt_tmripil_0%tableManager-io-17][TableManager] Raft
changePeersOnRebalance true for Assignments [nodes=HashSet [Assignment
[consistentId=idrrt_tmripil_0, isPeer=true]], force=true]
[2024-07-09T13:10:33,813][INFO ][%idrrt_tmripil_0%lease-updater][LeaseUpdater]
Leases updated (printed once per 10 iteration(s)):
[inCurrentIteration=LeaseStats [leasesCreated=0, leasesPublished=0,
leasesProlonged=0, leasesWithoutCandidates=1], active=0,
currentAssignmentsSize=1].
[2024-07-09T13:10:34,243][INFO
][%idrrt_tmripil_0%JRaft-ElectionTimer-6][NodeImpl] Node
<10_part_0/idrrt_tmripil_0> term 1 start preVote.
[2024-07-09T13:10:34,256][INFO
][%idrrt_tmripil_0%JRaft-ElectionTimer-6][NodeImpl] Node
<10_part_0/idrrt_tmripil_0> start vote and grant vote self, term=1.
[2024-07-09T13:10:34,372][INFO
][%idrrt_tmripil_0%JRaft-ElectionTimer-6][LocalRaftMetaStorage] Save raft meta,
path=build/work/ItDisasterRecoveryReconfigurationTest/testManualRebalanceIfPartitionIsLost_189774505606983/idrrt_tmripil_0/10_part_0_idrrt_tmripil_0_0/meta,
term=2, votedFor=idrrt_tmripil_0, cost time=115 ms
[2024-07-09T13:10:34,373][INFO
][%idrrt_tmripil_0%JRaft-ElectionTimer-6][NodeImpl] Node
<10_part_0/idrrt_tmripil_0> become leader of group, term=2,
conf=idrrt_tmripil_0, oldConf=.
[2024-07-09T13:10:34,373][INFO
][%idrrt_tmripil_0%JRaft-Request-Processor-26][TableManager] Current node=Peer
[consistentId=idrrt_tmripil_0, idx=0] is the leader of partition raft
group=10_part_0. Initiate rebalance process for partition=0, table=TEST
[2024-07-09T13:10:34,393][INFO
][%idrrt_tmripil_0%JRaft-Request-Processor-26][RaftGroupServiceImpl] Sending
changePeersAsync request for group=10_part_0 to peers=[Peer
[consistentId=idrrt_tmripil_0, idx=0]] and learners=[] with leader term=2
[2024-07-09T13:10:34,412][INFO
][%idrrt_tmripil_0%JRaft-Request-Processor-42][BaseCliRequestProcessor] Receive
ChangePeersAsyncRequest with term 2 to <10_part_0/idrrt_tmripil_0> from
127.0.0.1:3344, new conf is idrrt_tmripil_0
[2024-07-09T13:10:34,412][INFO
][%idrrt_tmripil_0%JRaft-Request-Processor-42][NodeImpl] Node
<10_part_0/idrrt_tmripil_0> change peers from idrrt_tmripil_0 to
idrrt_tmripil_0.
[2024-07-09T13:10:34,413][INFO
][%idrrt_tmripil_0%JRaft-FSMCaller-Disruptor_stripe_1-0][StateMachineAdapter]
onLeaderStart: term=2.
[2024-07-09T13:10:34,413][INFO
][%idrrt_tmripil_0%rebalance-scheduler-0][RebalanceRaftGroupEventsListener] New
leader elected. Going to apply new configuration [tablePartitionId=10_part_0,
peers=[idrrt_tmripil_0], learners=[]]
[2024-07-09T13:10:34,413][WARN
][%idrrt_tmripil_0%metastorage-watch-executor-3][WatchProcessor] Watch event
processing has been too long [duration=2554,
keys=[assignments.pending.10_part_0, 10_part_0pending.change.trigger]]
[2024-07-09T13:10:34,414][WARN
][%idrrt_tmripil_0%rebalance-scheduler-1][RebalanceRaftGroupEventsListener]
onNewPeersConfigurationApplied for PeersAndLearners [peers=Set12 [Peer
[consistentId=idrrt_tmripil_0, idx=0]], learners=SetN []]
[2024-07-09T13:10:34,435][INFO
][%idrrt_tmripil_0%rebalance-scheduler-0][RaftGroupServiceImpl] Sending
changePeersAsync request for group=10_part_0 to peers=[Peer
[consistentId=idrrt_tmripil_0, idx=0]] and learners=[] with leader term=2
[2024-07-09T13:10:34,435][WARN
][%idrrt_tmripil_0%rebalance-scheduler-1][RebalanceRaftGroupEventsListener]
countDownPartitionsFromZone counter [10]
[2024-07-09T13:10:34,468][INFO
][%idrrt_tmripil_0%JRaft-Request-Processor-51][BaseCliRequestProcessor] Receive
ChangePeersAsyncRequest with term 2 to <10_part_0/idrrt_tmripil_0> from
127.0.0.1:3344, new conf is idrrt_tmripil_0
[2024-07-09T13:10:34,469][INFO
][%idrrt_tmripil_0%JRaft-Request-Processor-51][NodeImpl] Node
<10_part_0/idrrt_tmripil_0> change peers from idrrt_tmripil_0 to
idrrt_tmripil_0.
[2024-07-09T13:10:34,488][WARN
][%idrrt_tmripil_0%rebalance-scheduler-2][RebalanceRaftGroupEventsListener]
onNewPeersConfigurationApplied for PeersAndLearners [peers=Set12 [Peer
[consistentId=idrrt_tmripil_0, idx=0]], learners=SetN []]
[2024-07-09T13:10:34,506][WARN
][%idrrt_tmripil_0%rebalance-scheduler-2][RebalanceRaftGroupEventsListener]
countDownPartitionsFromZone counter []
[2024-07-09T13:10:34,506][INFO
][%idrrt_tmripil_0%rebalance-scheduler-1][RebalanceRaftGroupEventsListener]
Count down of zone's tables counter is succeeded [zoneId=9, partId=0,
counter=[], appliedPeers=[idrrt_tmripil_0]]
[2024-07-09T13:10:34,525][WARN
][%idrrt_tmripil_0%rebalance-scheduler-2][RebalanceRaftGroupEventsListener]
Unable to count down partitions counter in metastore: 10_part_0
{noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)