[
https://issues.apache.org/jira/browse/IGNITE-25804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18003529#comment-18003529
]
Alexander Lapin edited comment on IGNITE-25804 at 7/7/25 2:03 PM:
------------------------------------------------------------------
Likely the reason of the failure is the race between doStableSwitch processing
on former leader and new leader onLeaderElected + newPendingAssignments
processing.
{code:java}
[2025-07-04T16:47:22,041][INFO
][%irbpaqt_tnrdqp_3346%rebalance-scheduler-0][RaftGroupServiceImpl] Sending
changePeersAndLearnersAsync request for group=20_part_0 to peers=[Peer
[consistentId=irbpaqt_tnrdqp_3347, idx=0], Peer
[consistentId=irbpaqt_tnrdqp_3344, idx=0], Peer
[consistentId=irbpaqt_tnrdqp_3345, idx=0]] and learners=[] with leader term=1
[2025-07-04T16:47:22,042][INFO
][%irbpaqt_tnrdqp_3346%JRaft-Request-Processor-10][BaseCliRequestProcessor]
Receive ChangePeersAndLearnersAsyncRequest with term 1 to
<20_part_0/irbpaqt_tnrdqp_3346> from 127.0.0.1:3346, new conf is
irbpaqt_tnrdqp_3347,irbpaqt_tnrdqp_3344,irbpaqt_tnrdqp_3345
[2025-07-04T16:47:22,410][INFO
][%irbpaqt_tnrdqp_3346%JRaft-Common-Executor-3][NodeImpl] Catch up phase to
change peers from=[irbpaqt_tnrdqp_3347, irbpaqt_tnrdqp_3344,
irbpaqt_tnrdqp_3346] to=[irbpaqt_tnrdqp_3347, irbpaqt_tnrdqp_3344,
irbpaqt_tnrdqp_3345] was successfully finished
[2025-07-04T16:47:22,411][INFO
][%irbpaqt_tnrdqp_3346%JRaft-FSMCaller-Disruptor_stripe_0-0][StateMachineAdapter]
onConfigurationCommitted:
irbpaqt_tnrdqp_3347,irbpaqt_tnrdqp_3344,irbpaqt_tnrdqp_3345.
[2025-07-04T16:47:22,411][INFO
][%irbpaqt_tnrdqp_3347%JRaft-FSMCaller-Disruptor_stripe_0-0][StateMachineAdapter]
onConfigurationCommitted:
irbpaqt_tnrdqp_3347,irbpaqt_tnrdqp_3344,irbpaqt_tnrdqp_3345.
[2025-07-04T16:47:22,411][INFO
][%irbpaqt_tnrdqp_3344%JRaft-FSMCaller-Disruptor_stripe_0-0][StateMachineAdapter]
onConfigurationCommitted:
irbpaqt_tnrdqp_3347,irbpaqt_tnrdqp_3344,irbpaqt_tnrdqp_3345.
// Spoiled changePeers[2025-07-04T16:47:22,425][INFO
][%irbpaqt_tnrdqp_3347%rebalance-scheduler-0][RaftGroupServiceImpl] Sending
changePeersAndLearnersAsync request for group=20_part_0 to peers=[Peer
[consistentId=irbpaqt_tnrdqp_3347, idx=0], Peer
[consistentId=irbpaqt_tnrdqp_3344, idx=0], Peer
[consistentId=irbpaqt_tnrdqp_3345, idx=0]] and learners=[] with leader term=2
[2025-07-04T16:47:22,676][INFO
][%irbpaqt_tnrdqp_3347%MessagingService-inbound-Default-0-0][RaftGroupServiceImpl]
Sending changePeersAndLearnersAsync request for group=20_part_0 to peers=[Peer
[consistentId=irbpaqt_tnrdqp_3347, idx=0], Peer
[consistentId=irbpaqt_tnrdqp_3344, idx=0], Peer
[consistentId=irbpaqt_tnrdqp_3345, idx=0]] and learners=[Peer
[consistentId=irbpaqt_tnrdqp_3346, idx=0]] with leader term=2
[2025-07-04T16:47:22,676][INFO
][%irbpaqt_tnrdqp_3347%JRaft-Request-Processor-16][BaseCliRequestProcessor]
Receive ChangePeersAndLearnersAsyncRequest with term 2 to
<20_part_0/irbpaqt_tnrdqp_3347> from 127.0.0.1:3347, new conf is
irbpaqt_tnrdqp_3347,irbpaqt_tnrdqp_3344,irbpaqt_tnrdqp_3345,irbpaqt_tnrdqp_3346/learner
[2025-07-04T16:47:22,677][INFO
][%irbpaqt_tnrdqp_3347%JRaft-Request-Processor-16][NodeImpl] Catch up phase to
change peers from=[irbpaqt_tnrdqp_3347, irbpaqt_tnrdqp_3344,
irbpaqt_tnrdqp_3345] to=[irbpaqt_tnrdqp_3347, irbpaqt_tnrdqp_3344,
irbpaqt_tnrdqp_3345] was successfully finished // Learners part is
likely missing.
// 3347, 3344[2025-07-04T16:47:22,739][INFO
][%irbpaqt_tnrdqp_3347%MessagingService-inbound-Default-0-0][RaftGroupServiceImpl]
Sending changePeersAndLearnersAsync request for group=20_part_0 to peers=[Peer
[consistentId=irbpaqt_tnrdqp_3347, idx=0], Peer
[consistentId=irbpaqt_tnrdqp_3344, idx=0]] and learners=[] with leader term=2
[2025-07-04T16:47:22,739][INFO
][%irbpaqt_tnrdqp_3347%JRaft-Request-Processor-0][BaseCliRequestProcessor]
Receive ChangePeersAndLearnersAsyncRequest with term 2 to
<20_part_0/irbpaqt_tnrdqp_3347> from 127.0.0.1:3347, new conf is
irbpaqt_tnrdqp_3347,irbpaqt_tnrdqp_3344
[2025-07-04T16:47:22,740][INFO
][%irbpaqt_tnrdqp_3347%JRaft-Request-Processor-0][NodeImpl] Catch up phase to
change peers from=[irbpaqt_tnrdqp_3347, irbpaqt_tnrdqp_3344,
irbpaqt_tnrdqp_3345] to=[irbpaqt_tnrdqp_3347, irbpaqt_tnrdqp_3344] was
successfully finished
[2025-07-04T16:47:22,740][INFO
][%irbpaqt_tnrdqp_3347%JRaft-FSMCaller-Disruptor_stripe_0-0][StateMachineAdapter]
onConfigurationCommitted: irbpaqt_tnrdqp_3347,irbpaqt_tnrdqp_3344.
// 3347, 3344, 3346[2025-07-04T16:47:22,786][INFO
][%irbpaqt_tnrdqp_3347%MessagingService-inbound-Default-0-0][RaftGroupServiceImpl]
Sending changePeersAndLearnersAsync request for group=20_part_0 to peers=[Peer
[consistentId=irbpaqt_tnrdqp_3347, idx=0], Peer
[consistentId=irbpaqt_tnrdqp_3344, idx=0], Peer
[consistentId=irbpaqt_tnrdqp_3346, idx=0]] and learners=[] with leader term=2
[2025-07-04T16:47:22,786][INFO
][%irbpaqt_tnrdqp_3347%JRaft-Request-Processor-1][BaseCliRequestProcessor]
Receive ChangePeersAndLearnersAsyncRequest with term 2 to
<20_part_0/irbpaqt_tnrdqp_3347> from 127.0.0.1:3347, new conf is
irbpaqt_tnrdqp_3347,irbpaqt_tnrdqp_3344,irbpaqt_tnrdqp_3346
[2025-07-04T16:47:22,786][INFO
][%irbpaqt_tnrdqp_3347%JRaft-Common-Executor-0][NodeImpl] Catch up phase to
change peers from=[irbpaqt_tnrdqp_3347, irbpaqt_tnrdqp_3344]
to=[irbpaqt_tnrdqp_3347, irbpaqt_tnrdqp_3344, irbpaqt_tnrdqp_3346] was
successfully finished
[2025-07-04T16:47:22,787][INFO
][%irbpaqt_tnrdqp_3347%JRaft-FSMCaller-Disruptor_stripe_0-0][StateMachineAdapter]
onConfigurationCommitted:
irbpaqt_tnrdqp_3347,irbpaqt_tnrdqp_3344,irbpaqt_tnrdqp_3346.
[2025-07-04T16:47:22,787][INFO
][%irbpaqt_tnrdqp_3346%JRaft-FSMCaller-Disruptor_stripe_0-0][StateMachineAdapter]
onConfigurationCommitted:
irbpaqt_tnrdqp_3347,irbpaqt_tnrdqp_3344,irbpaqt_tnrdqp_3346.
[2025-07-04T16:47:22,787][INFO
][%irbpaqt_tnrdqp_3344%JRaft-FSMCaller-Disruptor_stripe_0-0][StateMachineAdapter]
onConfigurationCommitted:
irbpaqt_tnrdqp_3347,irbpaqt_tnrdqp_3344,irbpaqt_tnrdqp_3346.
// Delayed spoiled changePeers prcessing. [2025-07-04T16:47:23,837][INFO
][%irbpaqt_tnrdqp_3347%JRaft-Request-Processor-17][BaseCliRequestProcessor]
Receive ChangePeersAndLearnersAsyncRequest with term 2 to
<20_part_0/irbpaqt_tnrdqp_3347> from 127.0.0.1:3347, new conf is
irbpaqt_tnrdqp_3347,irbpaqt_tnrdqp_3344,irbpaqt_tnrdqp_3345
[2025-07-04T16:47:24,365][INFO
][%irbpaqt_tnrdqp_3347%MessagingService-inbound-Default-0-0][RaftGroupServiceImpl]
Sending changePeersAndLearnersAsync request for group=20_part_0 to peers=[Peer
[consistentId=irbpaqt_tnrdqp_3347, idx=0], Peer
[consistentId=irbpaqt_tnrdqp_3344, idx=0], Peer
[consistentId=irbpaqt_tnrdqp_3346, idx=0]] and learners=[Peer
[consistentId=irbpaqt_tnrdqp_3345, idx=0]] with leader term=2
[2025-07-04T16:47:24,366][INFO
][%irbpaqt_tnrdqp_3347%JRaft-Request-Processor-2][BaseCliRequestProcessor]
Receive ChangePeersAndLearnersAsyncRequest with term 2 to
<20_part_0/irbpaqt_tnrdqp_3347> from 127.0.0.1:3347, new conf is
irbpaqt_tnrdqp_3347,irbpaqt_tnrdqp_3344,irbpaqt_tnrdqp_3346,irbpaqt_tnrdqp_3345/learner
[2025-07-04T16:47:24,366][WARN
][%irbpaqt_tnrdqp_3347%JRaft-Request-Processor-2][NodeImpl] Node
<20_part_0/irbpaqt_tnrdqp_3347> refused configuration concurrent changing.
// Переехали на левую конфигурацию [2025-07-04T16:47:24,450][INFO
][%irbpaqt_tnrdqp_3347%JRaft-FSMCaller-Disruptor_stripe_0-0][StateMachineAdapter]
onConfigurationCommitted:
irbpaqt_tnrdqp_3347,irbpaqt_tnrdqp_3344,irbpaqt_tnrdqp_3345.
[2025-07-04T16:47:24,450][INFO
][%irbpaqt_tnrdqp_3346%JRaft-FSMCaller-Disruptor_stripe_0-0][StateMachineAdapter]
onConfigurationCommitted:
irbpaqt_tnrdqp_3347,irbpaqt_tnrdqp_3344,irbpaqt_tnrdqp_3345.
[2025-07-04T16:47:24,451][INFO
][%irbpaqt_tnrdqp_3344%JRaft-FSMCaller-Disruptor_stripe_0-0][StateMachineAdapter]
onConfigurationCommitted:
irbpaqt_tnrdqp_3347,irbpaqt_tnrdqp_3344,irbpaqt_tnrdqp_3345.
// Странно, возможно реордеринг в логах [2025-07-04T16:47:24,451][INFO
][%irbpaqt_tnrdqp_3345%JRaft-FSMCaller-Disruptor_stripe_1-0][StateMachineAdapter]
onConfigurationCommitted:
irbpaqt_tnrdqp_3347,irbpaqt_tnrdqp_3344,irbpaqt_tnrdqp_3345.
[2025-07-04T16:47:24,451][INFO
][%irbpaqt_tnrdqp_3345%JRaft-FSMCaller-Disruptor_stripe_1-0][StateMachineAdapter]
onConfigurationCommitted: irbpaqt_tnrdqp_3347,irbpaqt_tnrdqp_3344.
[2025-07-04T16:47:24,451][INFO
][%irbpaqt_tnrdqp_3345%JRaft-FSMCaller-Disruptor_stripe_1-0][StateMachineAdapter]
onConfigurationCommitted:
irbpaqt_tnrdqp_3347,irbpaqt_tnrdqp_3344,irbpaqt_tnrdqp_3346.
[2025-07-04T16:47:24,451][INFO
][%irbpaqt_tnrdqp_3345%JRaft-FSMCaller-Disruptor_stripe_1-0][StateMachineAdapter]
onConfigurationCommitted:
irbpaqt_tnrdqp_3347,irbpaqt_tnrdqp_3344,irbpaqt_tnrdqp_3345.
[2025-07-04T16:47:24,568][ERROR][%irbpaqt_tnrdqp_3347%JRaft-Request-Processor-6][RpcRequestProcessor]
handleRequest ChangePeersAndLearnersAsyncRequestImpl [groupId=20_part_0,
leaderId=irbpaqt_tnrdqp_3347, newLearnersList=ArrayList [irbpaqt_tnrdqp_3345],
newPeersList=ArrayList [irbpaqt_tnrdqp_3347, irbpaqt_tnrdqp_3344,
irbpaqt_tnrdqp_3346], term=2] failed
[2025-07-04T16:47:24,567][INFO
][%irbpaqt_tnrdqp_3347%JRaft-Request-Processor-6][BaseCliRequestProcessor]
Receive ChangePeersAndLearnersAsyncRequest with term 2 to
<20_part_0/irbpaqt_tnrdqp_3347> from 127.0.0.1:3347, new conf is
irbpaqt_tnrdqp_3347,irbpaqt_tnrdqp_3344,irbpaqt_tnrdqp_3346,irbpaqt_tnrdqp_3345/learner
[2025-07-04T16:47:24,568][ERROR][%irbpaqt_tnrdqp_3347%JRaft-Request-Processor-6][ConfigurationEntry]
Invalid conf entry ConfigurationEntry [id=null,
conf=irbpaqt_tnrdqp_3347,irbpaqt_tnrdqp_3344,irbpaqt_tnrdqp_3346,irbpaqt_tnrdqp_3345/learner,
oldConf=irbpaqt_tnrdqp_3347,irbpaqt_tnrdqp_3344,irbpaqt_tnrdqp_3345], peers
and learners have intersection: [irbpaqt_tnrdqp_3345].
java.lang.IllegalArgumentException: Invalid conf entry:
irbpaqt_tnrdqp_3347,irbpaqt_tnrdqp_3344,irbpaqt_tnrdqp_3346,irbpaqt_tnrdqp_3345/learner
{code}
was (Author: alapin):
Likely the reason of the failure is the race between doStableSwitch processing
on former leader and new leader onLeaderElected + newPendingAssignments
processing.
> ItRebalanceByPendingAssignmentsQueueTest#testNodeRestartDuringQueueProcessing
> is flaky
> --------------------------------------------------------------------------------------
>
> Key: IGNITE-25804
> URL: https://issues.apache.org/jira/browse/IGNITE-25804
> Project: Ignite
> Issue Type: Bug
> Reporter: Alexander Lapin
> Assignee: Alexander Lapin
> Priority: Major
> Labels: MakeTeamcityGreenAgain, ignite-3
>
> {code:java}
> Expected: <[irbpaqt_tnrdqp_3347, irbpaqt_tnrdqp_3346, irbpaqt_tnrdqp_3345,
> irbpaqt_tnrdqp_3344]>
> but: was <[irbpaqt_tnrdqp_3347, irbpaqt_tnrdqp_3346,
> irbpaqt_tnrdqp_3344]>
> java.lang.AssertionError:
> Expected: <[irbpaqt_tnrdqp_3347, irbpaqt_tnrdqp_3346, irbpaqt_tnrdqp_3345,
> irbpaqt_tnrdqp_3344]>
> but: was <[irbpaqt_tnrdqp_3347, irbpaqt_tnrdqp_3346,
> irbpaqt_tnrdqp_3344]>
> at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
> at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6)
> at
> org.apache.ignite.internal.rebalance.ItRebalanceByPendingAssignmentsQueueTest.lambda$testNodeRestartDuringQueueProcessing$12(ItRebalanceByPendingAssignmentsQueueTest.java:307)
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)