[
https://issues.apache.org/jira/browse/IGNITE-25665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Bessonov updated IGNITE-25665:
-----------------------------------
Description:
h3. Motivation
We need to persistently track pending rows to ensure they are preserved after a
cluster restart. Otherwise, we risk losing them and inadvertently marking
transaction statuses as aborted (as described in the root issue). This could
lead to resolving write intents as aborted, resulting in permanent client data
loss.
h3. Definition of done
Pending rows are persisted and fully recovered upon cluster restart.
h3. Design
The idea is to have a persistent double-linked list, constructed on a subset of
row versions that represent write intents.
Currently, each version chain represents the following structure:
{code:java}
Chain 1 = [timestamp, row] -> ... -> []
Chain 2 = [timestamp, row] -> ... -> []{code}
What we want to do is to connect all the chains that have write intents as
their heads (i.e. {{{}timestamp == 0L{}}}), and enrich them with an information
that would allow restoring information about pending transactions:
{code:java}
...
^ |
| v
Chain 1 = [rowId, timestamp, row] -> ... -> []
^ |
| v
Chain 2 = [rowId, timestamp, row] -> ... -> []
^ |
| v
...{code}
This means enriching {{RowVersion}} class with:
* {{RowId}} (16 bytes).
* Link to the previous list node, "nullable", 6 bytes.
* Link to the next list node, "nullable", 6 bytes.
28 bytes in total. That's a lot already. Commit replication group ID and
transaction ID will be stored in a tree as metadata, because it would be other
22 bytes of constantly duplicated data.
Since version chains don't have transaction ID, we will get it from version
chain tree when starting the replica.
{{// TODO it is possible to introduce a *getAll* operation on the B+Tree, which
should make this reading faster.}}
New partition storage API will be required to read this list.
Obviously, the change must be backwards-compatible.
We should probably disable it for {{{}aimem{}}}, because it's just a memory
overhead in that case, it doesn't provide anything useful.
was:
h3. Motivation
We need to persistently track pending rows to ensure they are preserved after a
cluster restart. Otherwise, we risk losing them and inadvertently marking
transaction statuses as aborted (as described in the root issue). This could
lead to resolving write intents as aborted, resulting in permanent client data
loss.
h3. Definition of done
Pending rows are persisted and fully recovered upon cluster restart.
> Persist pending entries list in "aipersist" engine
> --------------------------------------------------
>
> Key: IGNITE-25665
> URL: https://issues.apache.org/jira/browse/IGNITE-25665
> Project: Ignite
> Issue Type: Bug
> Reporter: Vladislav Pyatkov
> Assignee: Ivan Bessonov
> Priority: Major
> Labels: ignite-3
>
> h3. Motivation
> We need to persistently track pending rows to ensure they are preserved after
> a cluster restart. Otherwise, we risk losing them and inadvertently marking
> transaction statuses as aborted (as described in the root issue). This could
> lead to resolving write intents as aborted, resulting in permanent client
> data loss.
> h3. Definition of done
> Pending rows are persisted and fully recovered upon cluster restart.
> h3. Design
> The idea is to have a persistent double-linked list, constructed on a subset
> of row versions that represent write intents.
> Currently, each version chain represents the following structure:
>
> {code:java}
> Chain 1 = [timestamp, row] -> ... -> []
> Chain 2 = [timestamp, row] -> ... -> []{code}
> What we want to do is to connect all the chains that have write intents as
> their heads (i.e. {{{}timestamp == 0L{}}}), and enrich them with an
> information that would allow restoring information about pending transactions:
>
>
> {code:java}
> ...
> ^ |
> | v
> Chain 1 = [rowId, timestamp, row] -> ... -> []
> ^ |
> | v
> Chain 2 = [rowId, timestamp, row] -> ... -> []
> ^ |
> | v
> ...{code}
> This means enriching {{RowVersion}} class with:
>
> * {{RowId}} (16 bytes).
> * Link to the previous list node, "nullable", 6 bytes.
> * Link to the next list node, "nullable", 6 bytes.
> 28 bytes in total. That's a lot already. Commit replication group ID and
> transaction ID will be stored in a tree as metadata, because it would be
> other 22 bytes of constantly duplicated data.
> Since version chains don't have transaction ID, we will get it from version
> chain tree when starting the replica.
> {{// TODO it is possible to introduce a *getAll* operation on the B+Tree,
> which should make this reading faster.}}
> New partition storage API will be required to read this list.
> Obviously, the change must be backwards-compatible.
> We should probably disable it for {{{}aimem{}}}, because it's just a memory
> overhead in that case, it doesn't provide anything useful.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)