[
https://issues.apache.org/jira/browse/IGNITE-26119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Maksim Davydov reassigned IGNITE-26119:
---------------------------------------
Assignee: Maksim Davydov
> Create a set of tests to examine tx protocol behavior against an unstable
> network
> ---------------------------------------------------------------------------------
>
> Key: IGNITE-26119
> URL: https://issues.apache.org/jira/browse/IGNITE-26119
> Project: Ignite
> Issue Type: Task
> Reporter: Sergey Chugunov
> Assignee: Maksim Davydov
> Priority: Major
> Labels: ise
>
> Ignite 2.x TX protocol is based on Two-Phase Commit (2PC) algorithm which is
> known to be unstable in an environment with unstable network. Lost
> messages/timeouts/network splits aka split-brain situations could lead to
> data loss or data inconsistency.
> At the same time there are no tests to verify Ignite TX protocol in a
> controllable environment.
> The task is to create a set of such tests and find improvements to the
> protocol, logging and tooling to make it easier to track and fix problematic
> transactions.
> A good example of such a scenario looks like this:
> # Cluster of 5 nodes, cache with 2 backups.
> # A transaction covering two partitions is started.
> # Finish message is sent to a backup node of one partition and a primary node
> of another partition.
> # Other nodes don't receive this commit message as tx coordinator along with
> the nodes from previous step become unavailable.
> The task here is to assess what happens on the other three nodes that have
> never seen finish request, how they would recover the transaction. Is it
> possible to get a data inconsistency between different nodes (e.g. the other
> three nodes make a decision to rollback the tx). If yes, is it possible to
> prevent this by plugging in a TopologyValidator?
> Options to expand this scenario include:
> # Assingning different nodes a role of tx coordinator.
> # Using different transaction concurrency and isolation levels.
> # Setting up different timeouts for tx rollback, network etc.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)