[jira] [Comment Edited] (IGNITE-17083) Universal full rebalance procedure for MV storage

Roman Puchkovskiy (Jira) Wed, 22 Jun 2022 07:40:04 -0700


    [ 
https://issues.apache.org/jira/browse/IGNITE-17083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557475#comment-17557475
 ]


Roman Puchkovskiy edited comment on IGNITE-17083 at 6/22/22 2:39 PM:
---------------------------------------------------------------------

A snapshot must have (associated with it) the largest index of an applied 
command included in the snapshot. If a snapshot is created from the 'current 
state' of the state machine, then we can do the following to obtain the index 
corresponding to the state that the state machine had at the beginning of the 
snapshot:
 # Each write is accompanied with a RAFT log index (it's the index of the 
command that executes the write) (this is already suggested in IGNITE-16907)
 # When a write is executed, its index is saved (it's persisted: either 
eventually (during a checkpoint), or immediately)
 # When starting a snapshot, we take a lock making sure that no command is 
executed concurrently with us, and read the current index (corresponding to the 
last executed write). We release the lock immediately after reading it. Then we 
send the index to the recipient node as a part of the snapshot metadata.


was (Author: rpuch):
A snapshot must have (associated with it) the largest index of an applied 
command included in the snapshot. If a snapshot is created from the 'current 
state' of the state machine, then we can do the following to obtain the index 
corresponding to the state that the state machine had at the beginning of the 
snapshot:
 # Each write is accompanied with a RAFT log index (it's the index of the 
command that executes the write) (this is already suggested in IGNITE-16907)
 # When a write is executed, its index is saved (it's persisted: either 
eventually (during a checkpoint), or immediately)
 # When starting a snapshot, we take a lock making sure that no write is 
executed concurrently with us, and read the current index (corresponding to the 
last executed write). We release the lock immediately after reading it. Then we 
send the index to the recipient node as a part of the snapshot metadata.

NOTE: for this to work, it's required that each command executes at most one 
write operation, otherwise we might end up in a situation when a command (on 
the recepient node) is just partly applied.

> Universal full rebalance procedure for MV storage
> -------------------------------------------------
>
>                 Key: IGNITE-17083
>                 URL: https://issues.apache.org/jira/browse/IGNITE-17083
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Ivan Bessonov
>            Priority: Major
>              Labels: ignite-3
>
> Canonical way to make "full rebalance" in RAFT is to have a persisted 
> snapshots of data. This is not always a good idea. First of all, for 
> persistent data is already stored somewhere and can be read at any time. 
> Second, for volatile storage this requirement is just absurd.
> So, a "rebalance snapshot" should be streamed from one node to another 
> instead of being written to a storage. What's good is that this approach can 
> be implemented independently from the storage engine (with few adjustments to 
> storage API, of course).
> h2. General idea
> Once a "rebalance snapshot" operation is triggered, we open a special type of 
> cursor from the partition storage, that is able to give us all versioned 
> chains in {_}some fixed order{_}. Every time the next chain has been read, 
> it's remembered as the last read (let's call it\{{ lastRowId}} for now). Then 
> all versions for the specific row id should be sent to receiver node in 
> "Oldest to Newest" order to simplify insertion.
> This works fine without concurrent load. To account for that we need to have 
> a additional collection of row ids, associated with a snapshot. Let's call it 
> {{{}overwrittenRowIds{}}}.
> With this in mind, every write command should look similar to this:
> {noformat}
> for (var rebalanceSnaphot : ongoingRebalanceSnapshots) {
>   try (var lock = rebalanceSnaphot.lock()) {
>     if (rowId <= rebalanceSnaphot.lastRowId())
>       continue;
>     if (!rebalanceSnaphot.overwrittenRowIds().put(rowId))
>       continue;
>     rebalanceSnapshot.sendRowToReceiver(rowId);
>   }
> }
> // Now modification can be freely performed.
> // Snapshot itself will skip everything from the "overwrittenRowIds" 
> collection.{noformat}
> NOTE: rebalance snapshot scan must also return uncommitted write intentions. 
> Their commit will be replicated later from the RAFT log.
> NOTE: receiving side will have to rebuild indexes during the rebalancing. 
> Just like it works in Ignite 2.x.
> NOTE: Technically it is possible to have several nodes entering the cluster 
> that require a full rebalance. So, while triggering a rebalance snapshot 
> cursor, we could wait for other nodes that might want to read the same data 
> and process all of them with a single scan. This is an optimization, 
> obviously.
> h2. Implementation
> The implementation will have to be split into several parts, because we need:
>  * Support for snapshot streaming in RAFT state machine.
>  * Storage API for this type of scan.
>  * Every storage must implement the new scan method.
>  * Streamer itself should be implemented, along with a specific logic in 
> write commands.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Comment Edited] (IGNITE-17083) Universal full rebalance procedure for MV storage

Reply via email to