[
https://issues.apache.org/jira/browse/IGNITE-17279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vyacheslav Koptilin updated IGNITE-17279:
-----------------------------------------
Description:
It seems that a coordinator node does not correctly update node2part mapping
for lost partitions.
{noformat}
[test-runner-#1%distributed.CachePartitionLostAfterSupplierHasLeftTest%][root]
dump partitions state for <default>:
----preload sync futures----
nodeId=b57ca812-416d-40d7-bb4f-271994900000
consistentId=distributed.CachePartitionLostAfterSupplierHasLeftTest0 isDone=true
nodeId=20fdfa4a-ddf6-4229-b25e-38cd8d300001
consistentId=distributed.CachePartitionLostAfterSupplierHasLeftTest1 isDone=true
----rebalance futures----
nodeId=b57ca812-416d-40d7-bb4f-271994900000 isDone=true res=true topVer=null
remaining: {}
nodeId=20fdfa4a-ddf6-4229-b25e-38cd8d300001 isDone=true res=false
topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0]
remaining: {}
----partition state----
localNodeId=b57ca812-416d-40d7-bb4f-271994900000
grid=distributed.CachePartitionLostAfterSupplierHasLeftTest0
local part=0 counters=Counter [lwm=200, missed=[], maxApplied=200, hwm=200]
fullSize=200 *state=LOST* reservations=0 isAffNode=true
nodeId=20fdfa4a-ddf6-4229-b25e-38cd8d300001 part=0 *state=LOST* isAffNode=true
...
localNodeId=20fdfa4a-ddf6-4229-b25e-38cd8d300001
grid=distributed.CachePartitionLostAfterSupplierHasLeftTest1
local part=0 counters=Counter [lwm=0, missed=[], maxApplied=0, hwm=0]
fullSize=100 *state=LOST* reservations=0 isAffNode=true
nodeId=b57ca812-416d-40d7-bb4f-271994900000 part=0 *state=OWNING*
isAffNode=true
...
{noformat}
*Update*:
The root cause of the issue is that the coordinator node incorrectly update
mapping nodes to partition states on PME (see
GridDhtPartitionTopologyImpl.node2part). It seems to me, that the coordinator
node should set partition state to LOST on all affinity nodes (if this
partition is assumed as LOST on the coordinator) before creating and sending a
“full map” message.
was:
It seems that a coordinator node does not correctly update node2part mapping
for lost partitions.
{noformat}
[test-runner-#1%distributed.CachePartitionLostAfterSupplierHasLeftTest%][root]
dump partitions state for <default>:
----preload sync futures----
nodeId=b57ca812-416d-40d7-bb4f-271994900000
consistentId=distributed.CachePartitionLostAfterSupplierHasLeftTest0 isDone=true
nodeId=20fdfa4a-ddf6-4229-b25e-38cd8d300001
consistentId=distributed.CachePartitionLostAfterSupplierHasLeftTest1 isDone=true
----rebalance futures----
nodeId=b57ca812-416d-40d7-bb4f-271994900000 isDone=true res=true topVer=null
remaining: {}
nodeId=20fdfa4a-ddf6-4229-b25e-38cd8d300001 isDone=true res=false
topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0]
remaining: {}
----partition state----
localNodeId=b57ca812-416d-40d7-bb4f-271994900000
grid=distributed.CachePartitionLostAfterSupplierHasLeftTest0
local part=0 counters=Counter [lwm=200, missed=[], maxApplied=200, hwm=200]
fullSize=200 *state=LOST* reservations=0 isAffNode=true
nodeId=20fdfa4a-ddf6-4229-b25e-38cd8d300001 part=0 *state=LOST* isAffNode=true
...
localNodeId=20fdfa4a-ddf6-4229-b25e-38cd8d300001
grid=distributed.CachePartitionLostAfterSupplierHasLeftTest1
local part=0 counters=Counter [lwm=0, missed=[], maxApplied=0, hwm=0]
fullSize=100 *state=LOST* reservations=0 isAffNode=true
nodeId=b57ca812-416d-40d7-bb4f-271994900000 part=0 *state=OWNING*
isAffNode=true
...
{noformat}
Update:
The root cause of the issue is that the coordinator node incorrectly update
mapping nodes to partition states on PME (see
GridDhtPartitionTopologyImpl.node2part). It seems to me, that the coordinator
node should set partition state to LOST on all affinity nodes (if this
partition is assumed as LOST on the coordinator) before creating and sending a
“full map” message.
> Mapping of partition states to nodes can erroneously skip lost partitions on
> the coordinator node
> -------------------------------------------------------------------------------------------------
>
> Key: IGNITE-17279
> URL: https://issues.apache.org/jira/browse/IGNITE-17279
> Project: Ignite
> Issue Type: Bug
> Reporter: Vyacheslav Koptilin
> Assignee: Vyacheslav Koptilin
> Priority: Minor
> Fix For: 2.14
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> It seems that a coordinator node does not correctly update node2part mapping
> for lost partitions.
> {noformat}
> [test-runner-#1%distributed.CachePartitionLostAfterSupplierHasLeftTest%][root]
> dump partitions state for <default>:
> ----preload sync futures----
> nodeId=b57ca812-416d-40d7-bb4f-271994900000
> consistentId=distributed.CachePartitionLostAfterSupplierHasLeftTest0
> isDone=true
> nodeId=20fdfa4a-ddf6-4229-b25e-38cd8d300001
> consistentId=distributed.CachePartitionLostAfterSupplierHasLeftTest1
> isDone=true
> ----rebalance futures----
> nodeId=b57ca812-416d-40d7-bb4f-271994900000 isDone=true res=true topVer=null
> remaining: {}
> nodeId=20fdfa4a-ddf6-4229-b25e-38cd8d300001 isDone=true res=false
> topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0]
> remaining: {}
> ----partition state----
> localNodeId=b57ca812-416d-40d7-bb4f-271994900000
> grid=distributed.CachePartitionLostAfterSupplierHasLeftTest0
> local part=0 counters=Counter [lwm=200, missed=[], maxApplied=200, hwm=200]
> fullSize=200 *state=LOST* reservations=0 isAffNode=true
> nodeId=20fdfa4a-ddf6-4229-b25e-38cd8d300001 part=0 *state=LOST*
> isAffNode=true
> ...
> localNodeId=20fdfa4a-ddf6-4229-b25e-38cd8d300001
> grid=distributed.CachePartitionLostAfterSupplierHasLeftTest1
> local part=0 counters=Counter [lwm=0, missed=[], maxApplied=0, hwm=0]
> fullSize=100 *state=LOST* reservations=0 isAffNode=true
> nodeId=b57ca812-416d-40d7-bb4f-271994900000 part=0 *state=OWNING*
> isAffNode=true
> ...
> {noformat}
> *Update*:
> The root cause of the issue is that the coordinator node incorrectly
> update mapping nodes to partition states on PME (see
> GridDhtPartitionTopologyImpl.node2part). It seems to me, that the coordinator
> node should set partition state to LOST on all affinity nodes (if this
> partition is assumed as LOST on the coordinator) before creating and sending
> a “full map” message.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)