[
https://issues.apache.org/jira/browse/IGNITE-17087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Bessonov resolved IGNITE-17087.
------------------------------------
Resolution: Won't Fix
> Native rebalance for PDS partitions
> -----------------------------------
>
> Key: IGNITE-17087
> URL: https://issues.apache.org/jira/browse/IGNITE-17087
> Project: Ignite
> Issue Type: Improvement
> Reporter: Ivan Bessonov
> Priority: Major
> Labels: ignite-3
>
> General idea of full rebalance is described inĀ
> https://issues.apache.org/jira/browse/IGNITE-17083
> For persistent storages, there's an option to avoid copy-on-write rebalance
> algorithms if it's desired. Intuitively, it's a preferable option. Each
> storage chooses its own format.
> h2. General idea
> In this case, PDS has checkpointing feature that saves consistent state on
> disk. I expect SQL indexes to be in the same partition file as other data.
> For every partition, its state on disk would look like this:
> {code:java}
> part-x.bin
> part-x-1.bin
> part-x-2.bin
> ...
> part-x-n.bin{code}
> part-x.bin is a baseline, and every other file is a delta that should be
> applied to underlying layers to get consistent data. It can be viewed like
> full and incremental backups.
> When rebalance snapshot is required, we could force a checkpoint and then
> *prohibit merging* of new deltas to delta files from the snapshot until
> rebalance is finished. We must guarantee that consistent state can be read
> from disk.
> Now, there are several strategies of data transferring:
> * File-based. We can send baseline and delta files as files. Two possible
> issues here:
> ** Files contain duplicated pages, so the volume of data will be bigger than
> necessary.
> ** Baseline file has to be truncated, because some delta pages go directly
> into baseline file as optimization.
> * Page-based. Latest state of every required page is sent separately. Two
> strategies here:
> ** Iterate pages in order of page indexes. Overheads during reads, but
> writes are very effective.
> ** Iterate pages in order of delta files, skipping already read pages in the
> process (like snapshots in GridGain, for example). Little overhead on read,
> but write won't be append-only.
> I would argue that slower reads are more appropriate then slower writes.
> Generally speaking, any write should be slower than any read of the same
> size, right?
> Should we implement all strategies and give user a choice? It's hard to
> predict which one is better for which scenario. In the future, I think it
> would be convenient to implement many options, but at first we should stick
> to the simplest one.
> There must be a common "infrastructure" or a framework to stream native
> rebalance snapshots. Data format should be as simple as possible.
> NOTE: of course, it has to be mentioned that this approach might lead to
> ineffective storage space usage. It can be a problem in theory, but in
> practice full rebalance isn't expected to occur often, and event then we
> don't expect that users will rewrite the entire partition data in a span of a
> single rebalance.
> h2. Possible problems
> Given that "raw" data is sent, including sql indexes, all incompleted indexes
> will be sent incompleted. Maybe we should also send a build state for each
> index so that the receiving side could continue from the right place, not
> from the beginning.
> This problem will be resolved in the future. Currently we don't have indexes
> implemented.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)