Ivan Bessonov created IGNITE-17077:
--------------------------------------
Summary: Implement checkpointIndex for PDS
Key: IGNITE-17077
URL: https://issues.apache.org/jira/browse/IGNITE-17077
Project: Ignite
Issue Type: Improvement
Reporter: Ivan Bessonov
Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for
prerequisites.
h2. General idea
The idea doesn't seem complicated. There will be a "setUpdateIndex" and
"getUpdateIndex" methods (names might be different).
* First one is invoked at the end of every write command, with RAFT commit
index being passed as a parameter. This is done right before releasing
checkpoint read lock (or whatever the name we will come up with). More on that
later.
* Second one is invoked at the beginning of every write command to validate
that update don't come out of order or with gaps. This is the way to guarantee
that IndexMismatchException can be thrown at the right time.
So, the write command flow will look like this. All names here are completely
random.
{code:java}
try (ConsistencyLock lock = partition.acquireConsistencyLock()) {
long updateIndex = partition.getUpdateIndex();
long raftIndex = writeCommand.raftIndex();
if (raftIndex != updateIndex + 1) {
throw new IndexMismatchException(updateIndex);
}
partition.write(writeCommand.row());
for (Index index : table.indexes(partition) {
index.index(writeCommand.row());
}
partition.setUpdateIndex(raftIndex);
}{code}
Some nuances:
* Mismatch exception must be thrown before any data modifications. Storage
content must be intact, otherwise we'll just break it.
* Case above is the simplest one - there's a single "atomic" storage update.
Generally speaking, we can't or sometimes don't want to work this way. Examples
of operations, where atomicity this strict is not required:
** Batch insert/update from the transaction.
** Transaction commit might have a huge number of row ids, we can exhaust the
memory while committing.
* If we split write operation into several operations, we should externally
guarantee their idempotence. "setUpdateIndex" should be at the end of the last
"atomic" operation, so that the last command could be safely reapplied.
h2. Implementation
"set" method could write a value directly into partitions meta page. This
*will* work. But it's not quite optimal.
Optimal solution is tightly coupled with the way checkpoint should work. This
may not be the right place to describe the issue, but I do it nonetheless.
It'll probably get split into another issue one day.
There's a simple way to touch every meta page only once per checkpoint. We just
do it while holding checkpoint write lock. This way data is consistent. But
this solution is equally {*}bad{*}, it forces us to perform pages manipulation
under write lock. Flushing freelists is enough already. (NOTE: we should test
the performance without onheap-cache, it'll speed-up checkpoint start process,
thus reducing latency spikes)
Better way to do this is not having meta pages in page memory whatsoever. Maybe
during the start, but that's it. It's a common practice to have a pageSize
being equal to 16Kb. Effective payload of partition meta page in Ignite 2.x is
just above 100 bytes. I expect it to be way lower in Ignite 3.0. Having a
loaded page for every partition is just a waste of resources, all required data
can be stored on-heap.
Then, let's rely on two simple facts:
* If meta page date is cached on-heap, no one would need to read it from disk.
I should also mention that it will mostly be immutable.
* We can write partition meta page into every delta file even if meta has not
changed. In actuality, this is will be very rare situation.
Considering both of these facts, checkpointer may unconditionally write meta
page from heap to disk at the beginning of writing the delta file. This page
will become a write-only page, which is basically what we need.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)