Kirill Sizov created IGNITE-22692:
--------------------------------------

             Summary: Race in rebalance counters
                 Key: IGNITE-22692
                 URL: https://issues.apache.org/jira/browse/IGNITE-22692
             Project: Ignite
          Issue Type: Bug
            Reporter:  Kirill Sizov


{{org.apache.ignite.internal.disaster.ItDisasterRecoveryReconfigurationTest#testManualRebalanceIfPartitionIsLost}}
 silently fails on assertion error
{noformat}
 java.lang.AssertionError: null
        at 
org.apache.ignite.internal.distributionzones.rebalance.RebalanceRaftGroupEventsListener.countDownPartitionsFromZone(RebalanceRaftGroupEventsListener.java:267)
 ~[classes/:?]
        at 
org.apache.ignite.internal.distributionzones.rebalance.RebalanceRaftGroupEventsListener.lambda$onNewPeersConfigurationApplied$1(RebalanceRaftGroupEventsListener.java:239)
 ~[classes/:?]
        at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
 [?:?]
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) 
[?:?]
        at 
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
 [?:?]
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
 [?:?]
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
 [?:?]
        at java.base/java.lang.Thread.run(Thread.java:829) [?:?]
{noformat}

*Implementation details*
With extra logging:
{noformat}
[2024-07-09T13:10:31,812][INFO 
][%idrrt_tmripil_0%Raft-Group-Client-1][TableManager] Node couldn't get the 
leader within timeout so the changing peers is skipped [grp=10_part_0].
[2024-07-09T13:10:31,818][WARN 
][%idrrt_tmripil_0%metastorage-watch-executor-1][WatchProcessor] Watch event 
processing has been too long [duration=20417, 
keys=[assignments.pending.10_part_0, 10_part_0pending.change.trigger]]
[2024-07-09T13:10:31,858][INFO 
][%idrrt_tmripil_0%JRaft-FSMCaller-Disruptormetastorage_stripe_0-0][DisasterRecoveryManager]
 Partition 10_part_0 returned PENDING_KEY_UPDATED status on reset attempt
[2024-07-09T13:10:31,859][INFO 
][%idrrt_tmripil_0%metastorage-watch-executor-0][TableManager] Received update 
on pending assignments. Check if new raft group should be started 
[key=assignments.pending.10_part_0, partition=0, table=TEST, 
localMemberAddress=127.0.0.1:3344, pendingAssignments=Assignments 
[nodes=HashSet [Assignment [consistentId=idrrt_tmripil_0, isPeer=true]], 
force=true], revision=107]
[2024-07-09T13:10:31,916][INFO 
][%idrrt_tmripil_0%JRaft-FSMCaller-Disruptormetastorage_stripe_0-0][TableManager]
 Rebalance counter for the zone is updated [zoneId = 9, partId = 0, counter = 
[10], revision = 107]
[2024-07-09T13:10:31,917][WARN ][%idrrt_tmripil_0%tableManager-io-15][Loza] 
Reset peers for raft group RaftNodeId [groupId=10_part_0, peer=Peer 
[consistentId=idrrt_tmripil_0, idx=0]], new configuration is PeersAndLearners 
[peers=Set12 [Peer [consistentId=idrrt_tmripil_0, idx=0]], learners=SetN []]
[2024-07-09T13:10:31,934][INFO ][%idrrt_tmripil_0%tableManager-io-15][NodeImpl] 
Node <10_part_0/idrrt_tmripil_0> set peers from idrrt_tmripil_1 to 
idrrt_tmripil_0.
[2024-07-09T13:10:32,028][INFO 
][%idrrt_tmripil_0%tableManager-io-15][LocalRaftMetaStorage] Save raft meta, 
path=build/work/ItDisasterRecoveryReconfigurationTest/testManualRebalanceIfPartitionIsLost_189774505606983/idrrt_tmripil_0/10_part_0_idrrt_tmripil_0_0/meta,
 term=1, votedFor=, cost time=76 ms
[2024-07-09T13:10:32,029][INFO ][%idrrt_tmripil_0%tableManager-io-15][NodeImpl] 
Election timeout was reset to initial value.
[2024-07-09T13:10:32,047][INFO ][%idrrt_tmripil_0%tableManager-io-15][NodeImpl] 
Node <10_part_0/idrrt_tmripil_0> reset electionTimeout, currTimer 1 state 
STATE_FOLLOWER new electionTimeout 1200.
[2024-07-09T13:10:32,049][WARN 
][%idrrt_tmripil_0%tableManager-io-17][TableManager] Raft 
changePeersOnRebalance true for Assignments [nodes=HashSet [Assignment 
[consistentId=idrrt_tmripil_0, isPeer=true]], force=true]
[2024-07-09T13:10:33,813][INFO ][%idrrt_tmripil_0%lease-updater][LeaseUpdater] 
Leases updated (printed once per 10 iteration(s)): 
[inCurrentIteration=LeaseStats [leasesCreated=0, leasesPublished=0, 
leasesProlonged=0, leasesWithoutCandidates=1], active=0, 
currentAssignmentsSize=1].
[2024-07-09T13:10:34,243][INFO 
][%idrrt_tmripil_0%JRaft-ElectionTimer-6][NodeImpl] Node 
<10_part_0/idrrt_tmripil_0> term 1 start preVote.
[2024-07-09T13:10:34,256][INFO 
][%idrrt_tmripil_0%JRaft-ElectionTimer-6][NodeImpl] Node 
<10_part_0/idrrt_tmripil_0> start vote and grant vote self, term=1.
[2024-07-09T13:10:34,372][INFO 
][%idrrt_tmripil_0%JRaft-ElectionTimer-6][LocalRaftMetaStorage] Save raft meta, 
path=build/work/ItDisasterRecoveryReconfigurationTest/testManualRebalanceIfPartitionIsLost_189774505606983/idrrt_tmripil_0/10_part_0_idrrt_tmripil_0_0/meta,
 term=2, votedFor=idrrt_tmripil_0, cost time=115 ms
[2024-07-09T13:10:34,373][INFO 
][%idrrt_tmripil_0%JRaft-ElectionTimer-6][NodeImpl] Node 
<10_part_0/idrrt_tmripil_0> become leader of group, term=2, 
conf=idrrt_tmripil_0, oldConf=.
[2024-07-09T13:10:34,373][INFO 
][%idrrt_tmripil_0%JRaft-Request-Processor-26][TableManager] Current node=Peer 
[consistentId=idrrt_tmripil_0, idx=0] is the leader of partition raft 
group=10_part_0. Initiate rebalance process for partition=0, table=TEST
[2024-07-09T13:10:34,393][INFO 
][%idrrt_tmripil_0%JRaft-Request-Processor-26][RaftGroupServiceImpl] Sending 
changePeersAsync request for group=10_part_0 to peers=[Peer 
[consistentId=idrrt_tmripil_0, idx=0]] and learners=[] with leader term=2
[2024-07-09T13:10:34,412][INFO 
][%idrrt_tmripil_0%JRaft-Request-Processor-42][BaseCliRequestProcessor] Receive 
ChangePeersAsyncRequest with term 2 to <10_part_0/idrrt_tmripil_0> from 
127.0.0.1:3344, new conf is idrrt_tmripil_0
[2024-07-09T13:10:34,412][INFO 
][%idrrt_tmripil_0%JRaft-Request-Processor-42][NodeImpl] Node 
<10_part_0/idrrt_tmripil_0> change peers from idrrt_tmripil_0 to 
idrrt_tmripil_0.
[2024-07-09T13:10:34,413][INFO 
][%idrrt_tmripil_0%JRaft-FSMCaller-Disruptor_stripe_1-0][StateMachineAdapter] 
onLeaderStart: term=2.
[2024-07-09T13:10:34,413][INFO 
][%idrrt_tmripil_0%rebalance-scheduler-0][RebalanceRaftGroupEventsListener] New 
leader elected. Going to apply new configuration [tablePartitionId=10_part_0, 
peers=[idrrt_tmripil_0], learners=[]]
[2024-07-09T13:10:34,413][WARN 
][%idrrt_tmripil_0%metastorage-watch-executor-3][WatchProcessor] Watch event 
processing has been too long [duration=2554, 
keys=[assignments.pending.10_part_0, 10_part_0pending.change.trigger]]
[2024-07-09T13:10:34,414][WARN 
][%idrrt_tmripil_0%rebalance-scheduler-1][RebalanceRaftGroupEventsListener] 
onNewPeersConfigurationApplied for PeersAndLearners [peers=Set12 [Peer 
[consistentId=idrrt_tmripil_0, idx=0]], learners=SetN []]
[2024-07-09T13:10:34,435][INFO 
][%idrrt_tmripil_0%rebalance-scheduler-0][RaftGroupServiceImpl] Sending 
changePeersAsync request for group=10_part_0 to peers=[Peer 
[consistentId=idrrt_tmripil_0, idx=0]] and learners=[] with leader term=2
[2024-07-09T13:10:34,435][WARN 
][%idrrt_tmripil_0%rebalance-scheduler-1][RebalanceRaftGroupEventsListener] 
countDownPartitionsFromZone counter [10]
[2024-07-09T13:10:34,468][INFO 
][%idrrt_tmripil_0%JRaft-Request-Processor-51][BaseCliRequestProcessor] Receive 
ChangePeersAsyncRequest with term 2 to <10_part_0/idrrt_tmripil_0> from 
127.0.0.1:3344, new conf is idrrt_tmripil_0
[2024-07-09T13:10:34,469][INFO 
][%idrrt_tmripil_0%JRaft-Request-Processor-51][NodeImpl] Node 
<10_part_0/idrrt_tmripil_0> change peers from idrrt_tmripil_0 to 
idrrt_tmripil_0.
[2024-07-09T13:10:34,488][WARN 
][%idrrt_tmripil_0%rebalance-scheduler-2][RebalanceRaftGroupEventsListener] 
onNewPeersConfigurationApplied for PeersAndLearners [peers=Set12 [Peer 
[consistentId=idrrt_tmripil_0, idx=0]], learners=SetN []]
[2024-07-09T13:10:34,506][WARN 
][%idrrt_tmripil_0%rebalance-scheduler-2][RebalanceRaftGroupEventsListener] 
countDownPartitionsFromZone counter []
[2024-07-09T13:10:34,506][INFO 
][%idrrt_tmripil_0%rebalance-scheduler-1][RebalanceRaftGroupEventsListener] 
Count down of zone's tables counter is succeeded [zoneId=9, partId=0, 
counter=[], appliedPeers=[idrrt_tmripil_0]]
[2024-07-09T13:10:34,525][WARN 
][%idrrt_tmripil_0%rebalance-scheduler-2][RebalanceRaftGroupEventsListener] 
Unable to count down partitions counter in metastore: 10_part_0
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to