Mikhail Efremov created IGNITE-28013:
----------------------------------------

             Summary: Lease shouldn't be prolonged for node UUID out of the 
current logical topology snapshot
                 Key: IGNITE-28013
                 URL: https://issues.apache.org/jira/browse/IGNITE-28013
             Project: Ignite
          Issue Type: Bug
            Reporter: Mikhail Efremov


*Description*

{{ItHighAvailablePartitionsRecoveryByFilterUpdateTest#testSeveralHaResetsAndSomeNodeRestart}}
 with a default zone with 25+ partitions fails with guarantee due to the follow 
assertion fail:


{code:java}
2026-02-25T10:42:38,771][ERROR][%ihaprbfut_tshrasnr_3344%lease-updater][FailureManager]
 Critical system error detected. Will be handled accordingly to configured 
handler [hnd=StopNodeFailureHandler [nodeName=ihaprbfut_tshrasnr_3344, 
super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet 
[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], 
failureCtx=CRITICAL_ERROR, failureCtxId=5e7318b6-f5ed-4e93-b526-cfdfd8ed377e]
org.apache.ignite.internal.failure.StackTraceCapturingException: Error occurred 
when updating the leases.
  at 
org.apache.ignite.internal.failure.FailureManager.process(FailureManager.java:199)
  at 
org.apache.ignite.internal.failure.FailureManager.process(FailureManager.java:176)
  at 
org.apache.ignite.internal.placementdriver.LeaseUpdater$Updater.run(LeaseUpdater.java:394)
  at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.lang.AssertionError: 8
  at 
org.apache.ignite.internal.placementdriver.leases.LeaseBatchSerializer.packNodesInfo(LeaseBatchSerializer.java:350)
  at 
org.apache.ignite.internal.placementdriver.leases.LeaseBatchSerializer.writeLease(LeaseBatchSerializer.java:323)
  at 
org.apache.ignite.internal.placementdriver.leases.LeaseBatchSerializer.writeLeasesForObject(LeaseBatchSerializer.java:279)
  at 
org.apache.ignite.internal.placementdriver.leases.LeaseBatchSerializer.writePartitionedGroupLeases(LeaseBatchSerializer.java:245)
  at 
org.apache.ignite.internal.placementdriver.leases.LeaseBatchSerializer.writeExternalData(LeaseBatchSerializer.java:169)
  at 
org.apache.ignite.internal.placementdriver.leases.LeaseBatchSerializer.writeExternalData(LeaseBatchSerializer.java:109)
  at 
org.apache.ignite.internal.versioned.VersionedSerializer.writeExternal(VersionedSerializer.java:71)
  at 
org.apache.ignite.internal.versioned.VersionedSerialization.toBytes(VersionedSerialization.java:52)
{code}

This means that {{NodesDictionary}} contains {{nameIndexToName.size()}} less or 
equal to 8 due to {{holderIdAndProposedCandidateFitIn1Byte}} but later we got 
index greater or equal 8 during {{packNodesInfo}} from dictionary 
{{idToNodeIndex}} map.

So, we some why have a node with the same consistentId, but different UUIDs 
(test case: restart almost all 8 nodes -- corner case). But it mostly not the 
dictionary issue: we shouldn't have such lease batch at all. The root cause is 
in {{tryToFindCandidateAmongAssignments}}:


{code:java}
// Check whether given assignments is actually available in logical topology. 
It's a best effort check because it's possible
            // for proposed primary candidate to leave the topology at any 
time. In that case primary candidate will be recalculated.
            InternalClusterNode candidateNode = 
topologyTracker.nodeByConsistentId(assignment.consistentId());

            if (candidateNode == null) {
                continue;
            }
{code}

We're looking up for consistent ID node name instead of UUID, this leads to 
leases for a partitions with leaseholders with the same ID, but different 
UUIDs. This should be fixed.

*Motivation*

We shouldn't have leases in a batch with nodes UUID that aren't in the actual 
logical topology.

*Definition of done*

# Lease candidate is looking up based on UUID.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to