[jira] [Updated] (IGNITE-17279) Mapping of partition states to nodes can erroneously skip lost partitions on the coordinator node

Vyacheslav Koptilin (Jira) Mon, 04 Jul 2022 15:24:31 -0700


     [ 
https://issues.apache.org/jira/browse/IGNITE-17279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Vyacheslav Koptilin updated IGNITE-17279:
-----------------------------------------
    Description: 
It seems that a coordinator node does not correctly update node2part mapping 
for lost partitions. 


{noformat}
[test-runner-#1%distributed.CachePartitionLostAfterSupplierHasLeftTest%][root] 
dump partitions state for <default>:
----preload sync futures----
nodeId=b57ca812-416d-40d7-bb4f-271994900000 
consistentId=distributed.CachePartitionLostAfterSupplierHasLeftTest0 isDone=true
nodeId=20fdfa4a-ddf6-4229-b25e-38cd8d300001 
consistentId=distributed.CachePartitionLostAfterSupplierHasLeftTest1 isDone=true
----rebalance futures----
nodeId=b57ca812-416d-40d7-bb4f-271994900000 isDone=true res=true topVer=null
remaining: {}
nodeId=20fdfa4a-ddf6-4229-b25e-38cd8d300001 isDone=true res=false 
topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0]
remaining: {}
----partition state----
localNodeId=b57ca812-416d-40d7-bb4f-271994900000 
grid=distributed.CachePartitionLostAfterSupplierHasLeftTest0
local part=0 counters=Counter [lwm=200, missed=[], maxApplied=200, hwm=200] 
fullSize=200 *state=LOST* reservations=0 isAffNode=true
 nodeId=20fdfa4a-ddf6-4229-b25e-38cd8d300001 part=0 *state=LOST* isAffNode=true
...

localNodeId=20fdfa4a-ddf6-4229-b25e-38cd8d300001 
grid=distributed.CachePartitionLostAfterSupplierHasLeftTest1
local part=0 counters=Counter [lwm=0, missed=[], maxApplied=0, hwm=0] 
fullSize=100 *state=LOST* reservations=0 isAffNode=true
 nodeId=b57ca812-416d-40d7-bb4f-271994900000 part=0 *state=OWNING* 
isAffNode=true
...
{noformat}


*Update*:
    The root cause of the issue is that the coordinator node incorrectly update 
mapping nodes to partition states on PME (see 
GridDhtPartitionTopologyImpl.node2part). It seems to me, that the coordinator 
node should set partition state to LOST on all affinity nodes (if this 
partition is assumed as LOST on the coordinator) before creating and sending a 
“full map” message.

  was:
It seems that a coordinator node does not correctly update node2part mapping 
for lost partitions. 


{noformat}
[test-runner-#1%distributed.CachePartitionLostAfterSupplierHasLeftTest%][root] 
dump partitions state for <default>:
----preload sync futures----
nodeId=b57ca812-416d-40d7-bb4f-271994900000 
consistentId=distributed.CachePartitionLostAfterSupplierHasLeftTest0 isDone=true
nodeId=20fdfa4a-ddf6-4229-b25e-38cd8d300001 
consistentId=distributed.CachePartitionLostAfterSupplierHasLeftTest1 isDone=true
----rebalance futures----
nodeId=b57ca812-416d-40d7-bb4f-271994900000 isDone=true res=true topVer=null
remaining: {}
nodeId=20fdfa4a-ddf6-4229-b25e-38cd8d300001 isDone=true res=false 
topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0]
remaining: {}
----partition state----
localNodeId=b57ca812-416d-40d7-bb4f-271994900000 
grid=distributed.CachePartitionLostAfterSupplierHasLeftTest0
local part=0 counters=Counter [lwm=200, missed=[], maxApplied=200, hwm=200] 
fullSize=200 *state=LOST* reservations=0 isAffNode=true
 nodeId=20fdfa4a-ddf6-4229-b25e-38cd8d300001 part=0 *state=LOST* isAffNode=true
...

localNodeId=20fdfa4a-ddf6-4229-b25e-38cd8d300001 
grid=distributed.CachePartitionLostAfterSupplierHasLeftTest1
local part=0 counters=Counter [lwm=0, missed=[], maxApplied=0, hwm=0] 
fullSize=100 *state=LOST* reservations=0 isAffNode=true
 nodeId=b57ca812-416d-40d7-bb4f-271994900000 part=0 *state=OWNING* 
isAffNode=true
...
{noformat}

Update:
    The root cause of the issue is that the coordinator node incorrectly update 
mapping nodes to partition states on PME (see 
GridDhtPartitionTopologyImpl.node2part). It seems to me, that the coordinator 
node should set partition state to LOST on all affinity nodes (if this 
partition is assumed as LOST on the coordinator) before creating and sending a 
“full map” message.


> Mapping of partition states to nodes can erroneously skip lost partitions on 
> the coordinator node
> -------------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-17279
>                 URL: https://issues.apache.org/jira/browse/IGNITE-17279
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Vyacheslav Koptilin
>            Assignee: Vyacheslav Koptilin
>            Priority: Minor
>             Fix For: 2.14
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> It seems that a coordinator node does not correctly update node2part mapping 
> for lost partitions. 
> {noformat}
> [test-runner-#1%distributed.CachePartitionLostAfterSupplierHasLeftTest%][root]
>  dump partitions state for <default>:
> ----preload sync futures----
> nodeId=b57ca812-416d-40d7-bb4f-271994900000 
> consistentId=distributed.CachePartitionLostAfterSupplierHasLeftTest0 
> isDone=true
> nodeId=20fdfa4a-ddf6-4229-b25e-38cd8d300001 
> consistentId=distributed.CachePartitionLostAfterSupplierHasLeftTest1 
> isDone=true
> ----rebalance futures----
> nodeId=b57ca812-416d-40d7-bb4f-271994900000 isDone=true res=true topVer=null
> remaining: {}
> nodeId=20fdfa4a-ddf6-4229-b25e-38cd8d300001 isDone=true res=false 
> topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0]
> remaining: {}
> ----partition state----
> localNodeId=b57ca812-416d-40d7-bb4f-271994900000 
> grid=distributed.CachePartitionLostAfterSupplierHasLeftTest0
> local part=0 counters=Counter [lwm=200, missed=[], maxApplied=200, hwm=200] 
> fullSize=200 *state=LOST* reservations=0 isAffNode=true
>  nodeId=20fdfa4a-ddf6-4229-b25e-38cd8d300001 part=0 *state=LOST* 
> isAffNode=true
> ...
> localNodeId=20fdfa4a-ddf6-4229-b25e-38cd8d300001 
> grid=distributed.CachePartitionLostAfterSupplierHasLeftTest1
> local part=0 counters=Counter [lwm=0, missed=[], maxApplied=0, hwm=0] 
> fullSize=100 *state=LOST* reservations=0 isAffNode=true
>  nodeId=b57ca812-416d-40d7-bb4f-271994900000 part=0 *state=OWNING* 
> isAffNode=true
> ...
> {noformat}
> *Update*:
>     The root cause of the issue is that the coordinator node incorrectly 
> update mapping nodes to partition states on PME (see 
> GridDhtPartitionTopologyImpl.node2part). It seems to me, that the coordinator 
> node should set partition state to LOST on all affinity nodes (if this 
> partition is assumed as LOST on the coordinator) before creating and sending 
> a “full map” message.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-17279) Mapping of partition states to nodes can erroneously skip lost partitions on the coordinator node

Reply via email to