[ 
https://issues.apache.org/jira/browse/IGNITE-20408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

 Kirill Sizov updated IGNITE-20408:
-----------------------------------
    Description: 
*Motivation*

Local map of transaction states (local tx state map) contains non-consistent id 
of a transaction coordinator node. When trying to resolve write intents using 
coordinator path, we need to check whether the coordinator is still present in 
cluster and has not restarted (because if it has restarted it means it lost its 
volatile state, including local tx state map). But we can't get the 
coordinator's non-consistent id in the message handler, and can't send the 
message to the node using its non-consistent id, so the following race is 
possible:
 * we receive message from coordinator with its consistent id,
 * try to resolve its non-consistent id to save it in the local tx state map, 
but we get the id of restarted node from topology service, so this 
non-consistent id is no longer valid.

There is a ticket for the improvement that will allow us to get ClusterNode 
containing non-consistent id in the message handler: IGNITE-20296 . After that 
improvement we will be able to get ClusterNode as a sender and will have to 
replace coordinator id with ClusterNode representing coordinator in tx local 
state map.

*Definition of done*

Local map of transaction states contains ClusterNode representing the 
coordinator instead of its non-consistent id, and the message to the 
coordinator is sent using this ClusterNode as a recepient node. 

*Implementation details*
First, the issue.
# a {{NetworkMessage}} is processed in 
{{ReplicaManager.onReplicaMessageReceived}}, we get sender id (which is a 
non-consistent id) from the parameter {{senderConsistentId}}:
{code}
String senderId = 
clusterNetSvc.topologyService().getByConsistentId(senderConsistentId).id();
{code}
# {{senderId}} is then stored in {{TxStateMeta}} when 
{{PartitionReplicaListener}} calls {{txManager.updateTxMeta}} with it.
# Later when we perform write intent resolution in 
{{TransactionStateResolver.resolveDistributiveTxState}} we take the previously 
stored sender id as then id of a coordinator node and run 
{code}
resolveTxStateFromTxCoordinator(txId, localMeta.txCoordinatorId(), commitGrpId, 
timestamp0, txMetaFuture);
{code}

If the node was restarted after it had successfully delivered a 
{{NetworkMessage}} but before #1, the code from #1 may return a different 
sender id:
{noformat}
coordinator (localId = A, consistentId = 1): send message M0 (id = 1) --> 
primary: receive message M0 (id = 1)
coordinator (localId = A, consistentId = 1): restart
coordinator (localId = B, consistentId = 1): the same node has now different 
local id, previous volatile state is lost
primary: Find coordinator for write intent resolution for consistent id = 1. We 
get node B with no state.
{noformat}

  was:
*Motivation*

Local map of transaction states (local tx state map) contains non-consistent id 
of a transaction coordinator node. When trying to resolve write intents using 
coordinator path, we need to check whether the coordinator is still present in 
cluster and has not restarted (because if it has restarted it means it lost its 
volatile state, including local tx state map). But we can't get the 
coordinator's non-consistent id in the message handler, and can't send the 
message to the node using its non-consistent id, so the following race is 
possible:
 * we receive message from coordinator with its consistent id,
 * try to resolve its non-consistent id to save it in the local tx state map, 
but we get the id of restarted node from topology service, so this 
non-consistent id is no longer valid.

There is a ticket for the improvement that will allow us to get ClusterNode 
containing non-consistent id in the message handler: IGNITE-20296 . After that 
improvement we will be able to get ClusterNode as a sender and will have to 
replace coordinator id with ClusterNode representing coordinator in tx local 
state map.

*Definition of done*

Local map of transaction states contains ClusterNode representing the 
coordinator instead of its non-consistent id, and the message to the 
coordinator is sent using this ClusterNode as a recepient node. 

*Implementation details*
First, the issue.
# a {{NetworkMessage}} is processed in 
{{ReplicaManager.onReplicaMessageReceived}}, we get sender id (which is a 
non-consistent id) from the parameter {{senderConsistentId}}:
{code}
String senderId = 
clusterNetSvc.topologyService().getByConsistentId(senderConsistentId).id();
{code}
# {{senderId}} is then stored in {{TxStateMeta}} when 
{{PartitionReplicaListener}} calls {{txManager.updateTxMeta}} with it.
# Later when we perform write intent resolution in 
{{TransactionStateResolver.resolveDistributiveTxState}} we take the previously 
stored sender id as then id of a coordinator node and run 
{code}
resolveTxStateFromTxCoordinator(txId, localMeta.txCoordinatorId(), commitGrpId, 
timestamp0, txMetaFuture);
{code}

If the node was restarted after it has successfully delivered a 
{{NetworkMessage}} but before #1, the code from #1 may return a different 
sender id:
{noformat}
coordinator (localId = A, consistentId = 1): send message M0 (id = 1) --> 
primary: receive message M0 (id = 1)
coordinator (localId = A, consistentId = 1): restart
coordinator (localId = B, consistentId = 1): the same node has now different 
local id, previous volatile state is lost
primary: Find coordinator for write intent resolution for consistent id = 1. We 
get node B with no state.
{noformat}


> Replace tx coordinator non-consistent ID with coordinator ClusterNode in 
> local tx state map
> -------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-20408
>                 URL: https://issues.apache.org/jira/browse/IGNITE-20408
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Denis Chudov
>            Priority: Major
>              Labels: ignite-3
>
> *Motivation*
> Local map of transaction states (local tx state map) contains non-consistent 
> id of a transaction coordinator node. When trying to resolve write intents 
> using coordinator path, we need to check whether the coordinator is still 
> present in cluster and has not restarted (because if it has restarted it 
> means it lost its volatile state, including local tx state map). But we can't 
> get the coordinator's non-consistent id in the message handler, and can't 
> send the message to the node using its non-consistent id, so the following 
> race is possible:
>  * we receive message from coordinator with its consistent id,
>  * try to resolve its non-consistent id to save it in the local tx state map, 
> but we get the id of restarted node from topology service, so this 
> non-consistent id is no longer valid.
> There is a ticket for the improvement that will allow us to get ClusterNode 
> containing non-consistent id in the message handler: IGNITE-20296 . After 
> that improvement we will be able to get ClusterNode as a sender and will have 
> to replace coordinator id with ClusterNode representing coordinator in tx 
> local state map.
> *Definition of done*
> Local map of transaction states contains ClusterNode representing the 
> coordinator instead of its non-consistent id, and the message to the 
> coordinator is sent using this ClusterNode as a recepient node. 
> *Implementation details*
> First, the issue.
> # a {{NetworkMessage}} is processed in 
> {{ReplicaManager.onReplicaMessageReceived}}, we get sender id (which is a 
> non-consistent id) from the parameter {{senderConsistentId}}:
> {code}
> String senderId = 
> clusterNetSvc.topologyService().getByConsistentId(senderConsistentId).id();
> {code}
> # {{senderId}} is then stored in {{TxStateMeta}} when 
> {{PartitionReplicaListener}} calls {{txManager.updateTxMeta}} with it.
> # Later when we perform write intent resolution in 
> {{TransactionStateResolver.resolveDistributiveTxState}} we take the 
> previously stored sender id as then id of a coordinator node and run 
> {code}
> resolveTxStateFromTxCoordinator(txId, localMeta.txCoordinatorId(), 
> commitGrpId, timestamp0, txMetaFuture);
> {code}
> If the node was restarted after it had successfully delivered a 
> {{NetworkMessage}} but before #1, the code from #1 may return a different 
> sender id:
> {noformat}
> coordinator (localId = A, consistentId = 1): send message M0 (id = 1) --> 
> primary: receive message M0 (id = 1)
> coordinator (localId = A, consistentId = 1): restart
> coordinator (localId = B, consistentId = 1): the same node has now different 
> local id, previous volatile state is lost
> primary: Find coordinator for write intent resolution for consistent id = 1. 
> We get node B with no state.
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to