[
https://issues.apache.org/jira/browse/IGNITE-17077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Bessonov reassigned IGNITE-17077:
--------------------------------------
Assignee: Ivan Bessonov
> Implement checkpointIndex for PDS
> ---------------------------------
>
> Key: IGNITE-17077
> URL: https://issues.apache.org/jira/browse/IGNITE-17077
> Project: Ignite
> Issue Type: Improvement
> Reporter: Ivan Bessonov
> Assignee: Ivan Bessonov
> Priority: Major
> Labels: ignite-3
>
> Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for
> prerequisites.
> h2. General idea
> The idea doesn't seem complicated. There will be a "setUpdateIndex" and
> "getUpdateIndex" methods (names might be different).
> * First one is invoked at the end of every write command, with RAFT commit
> index being passed as a parameter. This is done right before releasing
> checkpoint read lock (or whatever the name we will come up with). More on
> that later.
> * Second one is invoked at the beginning of every write command to validate
> that update don't come out of order or with gaps. This is the way to
> guarantee that IndexMismatchException can be thrown at the right time.
> So, the write command flow will look like this. All names here are completely
> random.
>
> {code:java}
> try (ConsistencyLock lock = partition.acquireConsistencyLock()) {
> long updateIndex = partition.getUpdateIndex();
> long raftIndex = writeCommand.raftIndex();
> if (raftIndex != updateIndex + 1) {
> throw new IndexMismatchException(updateIndex);
> }
> partition.write(writeCommand.row());
> for (Index index : table.indexes(partition) {
> index.index(writeCommand.row());
> }
> partition.setUpdateIndex(raftIndex);
> }{code}
>
> Some nuances:
> * Mismatch exception must be thrown before any data modifications. Storage
> content must be intact, otherwise we'll just break it.
> * Case above is the simplest one - there's a single "atomic" storage update.
> Generally speaking, we can't or sometimes don't want to work this way.
> Examples of operations, where atomicity this strict is not required:
> ** Batch insert/update from the transaction.
> ** Transaction commit might have a huge number of row ids, we can exhaust
> the memory while committing.
> * If we split write operation into several operations, we should externally
> guarantee their idempotence. "setUpdateIndex" should be at the end of the
> last "atomic" operation, so that the last command could be safely reapplied.
> h2. Implementation
> "set" method could write a value directly into partitions meta page. This
> *will* work. But it's not quite optimal.
> Optimal solution is tightly coupled with the way checkpoint should work. This
> may not be the right place to describe the issue, but I do it nonetheless.
> It'll probably get split into another issue one day.
> There's a simple way to touch every meta page only once per checkpoint. We
> just do it while holding checkpoint write lock. This way data is consistent.
> But this solution is equally {*}bad{*}, it forces us to perform pages
> manipulation under write lock. Flushing freelists is enough already. (NOTE:
> we should test the performance without onheap-cache, it'll speed-up
> checkpoint start process, thus reducing latency spikes)
> Better way to do this is not having meta pages in page memory whatsoever.
> Maybe during the start, but that's it. It's a common practice to have a
> pageSize being equal to 16Kb. Effective payload of partition meta page in
> Ignite 2.x is just above 100 bytes. I expect it to be way lower in Ignite
> 3.0. Having a loaded page for every partition is just a waste of resources,
> all required data can be stored on-heap.
> Then, let's rely on two simple facts:
> * If meta page date is cached on-heap, no one would need to read it from
> disk. I should also mention that it will mostly be immutable.
> * We can write partition meta page into every delta file even if meta has
> not changed. In actuality, this will be very rare situation.
> Considering both of these facts, checkpointer may unconditionally write meta
> page from heap to disk at the beginning of writing the delta file. This page
> will become a write-only page, which is basically what we need.
> h2. Callbacks and RAFT snapshots
> I argue against scheduled RAFT snapshots. They will produce a lot of junk
> checkpoints. This is because checkpoint is a {*}global operation{*}. Imagine
> RAFT triggering snapshots for 100 partitions in a row. This will result in a
> 100 minuscule checkpoints, no one needs it. So, I'd say, we need to operation:
> * partition.getCheckpointerUpdateIndex();
> * partition.registerCheckpointedUpdateIndexListener(closure);
> Bot of these methods could be used by RAFT to determine whether it needs to
> truncate its log and to define a specific commit index for truncation.
> In case of PDS checkpointer, implementation for both of these methods is
> trivial.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)