[
https://issues.apache.org/jira/browse/IGNITE-20408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kirill Sizov updated IGNITE-20408:
-----------------------------------
Description:
*Motivation*
Local map of transaction states (local tx state map) contains non-consistent id
of a transaction coordinator node. When trying to resolve write intents using
coordinator path, we need to check whether the coordinator is still present in
cluster and has not restarted (because if it has restarted it means it lost its
volatile state, including local tx state map). But we can't get the
coordinator's non-consistent id in the message handler, and can't send the
message to the node using its non-consistent id, so the following race is
possible:
* we receive message from coordinator with its consistent id,
* try to resolve its non-consistent id to save it in the local tx state map,
but we get the id of restarted node from topology service, so this
non-consistent id is no longer valid.
There is a ticket for the improvement that will allow us to get ClusterNode
containing non-consistent id in the message handler: IGNITE-20296 . After that
improvement we will be able to get ClusterNode as a sender and will have to
replace coordinator id with ClusterNode representing coordinator in tx local
state map.
*Definition of done*
Local map of transaction states contains ClusterNode representing the
coordinator instead of its non-consistent id, and the message to the
coordinator is sent using this ClusterNode as a recepient node.
*Implementation details*
First, the issue.
# a {{NetworkMessage}} is processed in
{{ReplicaManager.onReplicaMessageReceived}}, we get sender id (which is a
non-consistent id) from the parameter {{senderConsistentId}}:
{code}
String senderId =
clusterNetSvc.topologyService().getByConsistentId(senderConsistentId).id();
{code}
# {{senderId}} is then stored in {{TxStateMeta}} when
{{PartitionReplicaListener}} calls {{txManager.updateTxMeta}} with it.
# Later when we perform write intent resolution in
{{TransactionStateResolver.resolveDistributiveTxState}} we take the previously
stored sender id as then id of a coordinator node and run
{code}
resolveTxStateFromTxCoordinator(txId, localMeta.txCoordinatorId(), commitGrpId,
timestamp0, txMetaFuture);
{code}
If the node was restarted after it has successfully delivered a
{{NetworkMessage}} but before #1, the code from #1 may return a different
sender id:
{noformat}
coordinator (localId = A, consistentId = 1): send message M0 (id = 1) -->
primary: receive message M0 (id = 1)
coordinator (localId = A, consistentId = 1): restart
coordinator (localId = B, consistentId = 1): the same node has now different
local id, previous volatile state is lost
primary: Find coordinator for write intent resolution for consistent id = 1. We
get node B with no state.
{noformat}
was:
*Motivation*
Local map of transaction states (local tx state map) contains non-consistent id
of a transaction coordinator node. When trying to resolve write intents using
coordinator path, we need to check whether the coordinator is still present in
cluster and has not restarted (because if it has restarted it means it lost its
volatile state, including local tx state map). But we can't get the
coordinator's non-consistent id in the message handler, and can't send the
message to the node using its non-consistent id, so the following race is
possible:
* we receive message from coordinator with its consistent id,
* try to resolve its non-consistent id to save it in the local tx state map,
but we get the id of restarted node from topology service, so this
non-consistent id is no longer valid.
There is a ticket for the improvement that will allow us to get ClusterNode
containing non-consistent id in the message handler: IGNITE-20296 . After that
improvement we will be able to get ClusterNode as a sender and will have to
replace coordinator id with ClusterNode representing coordinator in tx local
state map.
*Definition of done*
Local map of transaction states contains ClusterNode representing the
coordinator instead of its non-consistent id, and the message to the
coordinator is sent using this ClusterNode as a recepient node.
> Replace tx coordinator non-consistent ID with coordinator ClusterNode in
> local tx state map
> -------------------------------------------------------------------------------------------
>
> Key: IGNITE-20408
> URL: https://issues.apache.org/jira/browse/IGNITE-20408
> Project: Ignite
> Issue Type: Improvement
> Reporter: Denis Chudov
> Priority: Major
> Labels: ignite-3
>
> *Motivation*
> Local map of transaction states (local tx state map) contains non-consistent
> id of a transaction coordinator node. When trying to resolve write intents
> using coordinator path, we need to check whether the coordinator is still
> present in cluster and has not restarted (because if it has restarted it
> means it lost its volatile state, including local tx state map). But we can't
> get the coordinator's non-consistent id in the message handler, and can't
> send the message to the node using its non-consistent id, so the following
> race is possible:
> * we receive message from coordinator with its consistent id,
> * try to resolve its non-consistent id to save it in the local tx state map,
> but we get the id of restarted node from topology service, so this
> non-consistent id is no longer valid.
> There is a ticket for the improvement that will allow us to get ClusterNode
> containing non-consistent id in the message handler: IGNITE-20296 . After
> that improvement we will be able to get ClusterNode as a sender and will have
> to replace coordinator id with ClusterNode representing coordinator in tx
> local state map.
> *Definition of done*
> Local map of transaction states contains ClusterNode representing the
> coordinator instead of its non-consistent id, and the message to the
> coordinator is sent using this ClusterNode as a recepient node.
> *Implementation details*
> First, the issue.
> # a {{NetworkMessage}} is processed in
> {{ReplicaManager.onReplicaMessageReceived}}, we get sender id (which is a
> non-consistent id) from the parameter {{senderConsistentId}}:
> {code}
> String senderId =
> clusterNetSvc.topologyService().getByConsistentId(senderConsistentId).id();
> {code}
> # {{senderId}} is then stored in {{TxStateMeta}} when
> {{PartitionReplicaListener}} calls {{txManager.updateTxMeta}} with it.
> # Later when we perform write intent resolution in
> {{TransactionStateResolver.resolveDistributiveTxState}} we take the
> previously stored sender id as then id of a coordinator node and run
> {code}
> resolveTxStateFromTxCoordinator(txId, localMeta.txCoordinatorId(),
> commitGrpId, timestamp0, txMetaFuture);
> {code}
> If the node was restarted after it has successfully delivered a
> {{NetworkMessage}} but before #1, the code from #1 may return a different
> sender id:
> {noformat}
> coordinator (localId = A, consistentId = 1): send message M0 (id = 1) -->
> primary: receive message M0 (id = 1)
> coordinator (localId = A, consistentId = 1): restart
> coordinator (localId = B, consistentId = 1): the same node has now different
> local id, previous volatile state is lost
> primary: Find coordinator for write intent resolution for consistent id = 1.
> We get node B with no state.
> {noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)