Ivan Bessonov created IGNITE-17077:
--------------------------------------

             Summary: Implement checkpointIndex for PDS
                 Key: IGNITE-17077
                 URL: https://issues.apache.org/jira/browse/IGNITE-17077
             Project: Ignite
          Issue Type: Improvement
            Reporter: Ivan Bessonov


Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for 
prerequisites.
h2. General idea

The idea doesn't seem complicated. There will be a "setUpdateIndex" and 
"getUpdateIndex" methods (names might be different).
 * First one is invoked at the end of every write command, with RAFT commit 
index being passed as a parameter. This is done right before releasing 
checkpoint read lock (or whatever the name we will come up with). More on that 
later.
 * Second one is invoked at the beginning of every write command to validate 
that update don't come out of order or with gaps. This is the way to guarantee 
that IndexMismatchException can be thrown at the right time.

So, the write command flow will look like this. All names here are completely 
random.

 
{code:java}
try (ConsistencyLock lock = partition.acquireConsistencyLock()) {
    long updateIndex = partition.getUpdateIndex();
    long raftIndex = writeCommand.raftIndex();

    if (raftIndex != updateIndex + 1) {
        throw new IndexMismatchException(updateIndex);
    }

    partition.write(writeCommand.row());

    for (Index index : table.indexes(partition) {
        index.index(writeCommand.row());
    }

    partition.setUpdateIndex(raftIndex);
}{code}
 

Some nuances:
 * Mismatch exception must be thrown before any data modifications. Storage 
content must be intact, otherwise we'll just break it.
 * Case above is the simplest one - there's a single "atomic" storage update. 
Generally speaking, we can't or sometimes don't want to work this way. Examples 
of operations, where atomicity this strict is not required:
 ** Batch insert/update from the transaction.
 ** Transaction commit might have a huge number of row ids, we can exhaust the 
memory while committing.
 * If we split write operation into several operations, we should externally 
guarantee their idempotence. "setUpdateIndex" should be at the end of the last 
"atomic" operation, so that the last command could be safely reapplied.

h2. Implementation

"set" method could write a value directly into partitions meta page. This 
*will* work. But it's not quite optimal.

Optimal solution is tightly coupled with the way checkpoint should work. This 
may not be the right place to describe the issue, but I do it nonetheless. 
It'll probably get split into another issue one day.

There's a simple way to touch every meta page only once per checkpoint. We just 
do it while holding checkpoint write lock. This way data is consistent. But 
this solution is equally {*}bad{*}, it forces us to perform pages manipulation 
under write lock. Flushing freelists is enough already. (NOTE: we should test 
the performance without onheap-cache, it'll speed-up checkpoint start process, 
thus reducing latency spikes)

Better way to do this is not having meta pages in page memory whatsoever. Maybe 
during the start, but that's it. It's a common practice to have a pageSize 
being equal to 16Kb. Effective payload of partition meta page in Ignite 2.x is 
just above 100 bytes. I expect it to be way lower in Ignite 3.0. Having a 
loaded page for every partition is just a waste of resources, all required data 
can be stored on-heap.

Then, let's rely on two simple facts:
 * If meta page date is cached on-heap, no one would need to read it from disk. 
I should also mention that it will mostly be immutable.
 * We can write partition meta page into every delta file even if meta has not 
changed. In actuality, this is will be very rare situation.

Considering both of these facts, checkpointer may unconditionally write meta 
page from heap to disk at the beginning of writing the delta file. This page 
will become a write-only page, which is basically what we need. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to