Mikhail Efremov created IGNITE-28013:
----------------------------------------
Summary: Lease shouldn't be prolonged for node UUID out of the
current logical topology snapshot
Key: IGNITE-28013
URL: https://issues.apache.org/jira/browse/IGNITE-28013
Project: Ignite
Issue Type: Bug
Reporter: Mikhail Efremov
*Description*
{{ItHighAvailablePartitionsRecoveryByFilterUpdateTest#testSeveralHaResetsAndSomeNodeRestart}}
with a default zone with 25+ partitions fails with guarantee due to the follow
assertion fail:
{code:java}
2026-02-25T10:42:38,771][ERROR][%ihaprbfut_tshrasnr_3344%lease-updater][FailureManager]
Critical system error detected. Will be handled accordingly to configured
handler [hnd=StopNodeFailureHandler [nodeName=ihaprbfut_tshrasnr_3344,
super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet
[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]],
failureCtx=CRITICAL_ERROR, failureCtxId=5e7318b6-f5ed-4e93-b526-cfdfd8ed377e]
org.apache.ignite.internal.failure.StackTraceCapturingException: Error occurred
when updating the leases.
at
org.apache.ignite.internal.failure.FailureManager.process(FailureManager.java:199)
at
org.apache.ignite.internal.failure.FailureManager.process(FailureManager.java:176)
at
org.apache.ignite.internal.placementdriver.LeaseUpdater$Updater.run(LeaseUpdater.java:394)
at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.lang.AssertionError: 8
at
org.apache.ignite.internal.placementdriver.leases.LeaseBatchSerializer.packNodesInfo(LeaseBatchSerializer.java:350)
at
org.apache.ignite.internal.placementdriver.leases.LeaseBatchSerializer.writeLease(LeaseBatchSerializer.java:323)
at
org.apache.ignite.internal.placementdriver.leases.LeaseBatchSerializer.writeLeasesForObject(LeaseBatchSerializer.java:279)
at
org.apache.ignite.internal.placementdriver.leases.LeaseBatchSerializer.writePartitionedGroupLeases(LeaseBatchSerializer.java:245)
at
org.apache.ignite.internal.placementdriver.leases.LeaseBatchSerializer.writeExternalData(LeaseBatchSerializer.java:169)
at
org.apache.ignite.internal.placementdriver.leases.LeaseBatchSerializer.writeExternalData(LeaseBatchSerializer.java:109)
at
org.apache.ignite.internal.versioned.VersionedSerializer.writeExternal(VersionedSerializer.java:71)
at
org.apache.ignite.internal.versioned.VersionedSerialization.toBytes(VersionedSerialization.java:52)
{code}
This means that {{NodesDictionary}} contains {{nameIndexToName.size()}} less or
equal to 8 due to {{holderIdAndProposedCandidateFitIn1Byte}} but later we got
index greater or equal 8 during {{packNodesInfo}} from dictionary
{{idToNodeIndex}} map.
So, we some why have a node with the same consistentId, but different UUIDs
(test case: restart almost all 8 nodes -- corner case). But it mostly not the
dictionary issue: we shouldn't have such lease batch at all. The root cause is
in {{tryToFindCandidateAmongAssignments}}:
{code:java}
// Check whether given assignments is actually available in logical topology.
It's a best effort check because it's possible
// for proposed primary candidate to leave the topology at any
time. In that case primary candidate will be recalculated.
InternalClusterNode candidateNode =
topologyTracker.nodeByConsistentId(assignment.consistentId());
if (candidateNode == null) {
continue;
}
{code}
We're looking up for consistent ID node name instead of UUID, this leads to
leases for a partitions with leaseholders with the same ID, but different
UUIDs. This should be fixed.
*Motivation*
We shouldn't have leases in a batch with nodes UUID that aren't in the actual
logical topology.
*Definition of done*
# Lease candidate is looking up based on UUID.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)