[ 
https://issues.apache.org/jira/browse/IGNITE-17087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov resolved IGNITE-17087.
------------------------------------
    Resolution: Won't Fix

> Native rebalance for PDS partitions
> -----------------------------------
>
>                 Key: IGNITE-17087
>                 URL: https://issues.apache.org/jira/browse/IGNITE-17087
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Ivan Bessonov
>            Priority: Major
>              Labels: ignite-3
>
> General idea of full rebalance is described inĀ 
> https://issues.apache.org/jira/browse/IGNITE-17083
> For persistent storages, there's an option to avoid copy-on-write rebalance 
> algorithms if it's desired. Intuitively, it's a preferable option. Each 
> storage chooses its own format.
> h2. General idea
> In this case, PDS has checkpointing feature that saves consistent state on 
> disk. I expect SQL indexes to be in the same partition file as other data.
> For every partition, its state on disk would look like this:
> {code:java}
> part-x.bin
> part-x-1.bin
> part-x-2.bin
> ...
> part-x-n.bin{code}
> part-x.bin is a baseline, and every other file is a delta that should be 
> applied to underlying layers to get consistent data. It can be viewed like 
> full and incremental backups.
> When rebalance snapshot is required, we could force a checkpoint and then 
> *prohibit merging* of new deltas to delta files from the snapshot until 
> rebalance is finished. We must guarantee that consistent state can be read 
> from disk.
> Now, there are several strategies of data transferring:
>  * File-based. We can send baseline and delta files as files. Two possible 
> issues here:
>  ** Files contain duplicated pages, so the volume of data will be bigger than 
> necessary.
>  ** Baseline file has to be truncated, because some delta pages go directly 
> into baseline file as optimization.
>  * Page-based. Latest state of every required page is sent separately. Two 
> strategies here:
>  ** Iterate pages in order of page indexes. Overheads during reads, but 
> writes are very effective.
>  ** Iterate pages in order of delta files, skipping already read pages in the 
> process (like snapshots in GridGain, for example). Little overhead on read, 
> but write won't be append-only.
> I would argue that slower reads are more appropriate then slower writes. 
> Generally speaking, any write should be slower than any read of the same 
> size, right?
> Should we implement all strategies and give user a choice? It's hard to 
> predict which one is better for which scenario. In the future, I think it 
> would be convenient to implement many options, but at first we should stick 
> to the simplest one.
> There must be a common "infrastructure" or a framework to stream native 
> rebalance snapshots. Data format should be as simple as possible.
> NOTE: of course, it has to be mentioned that this approach might lead to 
> ineffective storage space usage. It can be a problem in theory, but in 
> practice full rebalance isn't expected to occur often, and event then we 
> don't expect that users will rewrite the entire partition data in a span of a 
> single rebalance.
> h2. Possible problems
> Given that "raw" data is sent, including sql indexes, all incompleted indexes 
> will be sent incompleted. Maybe we should also send a build state for each 
> index so that the receiving side could continue from the right place, not 
> from the beginning.
> This problem will be resolved in the future. Currently we don't have indexes 
> implemented.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to