[ 
https://issues.apache.org/jira/browse/IGNITE-17077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17077:
-----------------------------------
    Description: 
Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for 
prerequisites.
h2. General idea

The idea doesn't seem complicated. There will be a "setUpdateIndex" and 
"getUpdateIndex" methods (names might be different).
 * First one is invoked at the end of every write command, with RAFT commit 
index being passed as a parameter. This is done right before releasing 
checkpoint read lock (or whatever the name we will come up with). More on that 
later.
 * Second one is invoked at the beginning of every write command to validate 
that update don't come out of order or with gaps. This is the way to guarantee 
that IndexMismatchException can be thrown at the right time.

So, the write command flow will look like this. All names here are completely 
random.

 
{code:java}
try (ConsistencyLock lock = partition.acquireConsistencyLock()) {
    long updateIndex = partition.getUpdateIndex();
    long raftIndex = writeCommand.raftIndex();

    if (raftIndex != updateIndex + 1) {
        throw new IndexMismatchException(updateIndex);
    }

    partition.write(writeCommand.row());

    for (Index index : table.indexes(partition) {
        index.index(writeCommand.row());
    }

    partition.setUpdateIndex(raftIndex);
}{code}
 

Some nuances:
 * Mismatch exception must be thrown before any data modifications. Storage 
content must be intact, otherwise we'll just break it.
 * Case above is the simplest one - there's a single "atomic" storage update. 
Generally speaking, we can't or sometimes don't want to work this way. Examples 
of operations, where atomicity this strict is not required:
 ** Batch insert/update from the transaction.
 ** Transaction commit might have a huge number of row ids, we can exhaust the 
memory while committing.
 * If we split write operation into several operations, we should externally 
guarantee their idempotence. "setUpdateIndex" should be at the end of the last 
"atomic" operation, so that the last command could be safely reapplied.

h2. Implementation

"set" method could write a value directly into partitions meta page. This 
*will* work. But it's not quite optimal.

Optimal solution is tightly coupled with the way checkpoint should work. This 
may not be the right place to describe the issue, but I do it nonetheless. 
It'll probably get split into another issue one day.

There's a simple way to touch every meta page only once per checkpoint. We just 
do it while holding checkpoint write lock. This way data is consistent. But 
this solution is equally {*}bad{*}, it forces us to perform pages manipulation 
under write lock. Flushing freelists is enough already. (NOTE: we should test 
the performance without onheap-cache, it'll speed-up checkpoint start process, 
thus reducing latency spikes)

Better way to do this is not having meta pages in page memory whatsoever. Maybe 
during the start, but that's it. It's a common practice to have a pageSize 
being equal to 16Kb. Effective payload of partition meta page in Ignite 2.x is 
just above 100 bytes. I expect it to be way lower in Ignite 3.0. Having a 
loaded page for every partition is just a waste of resources, all required data 
can be stored on-heap.

Then, let's rely on two simple facts:
 * If meta page date is cached on-heap, no one would need to read it from disk. 
I should also mention that it will mostly be immutable.
 * We can write partition meta page into every delta file even if meta has not 
changed. In actuality, this is will be very rare situation.

Considering both of these facts, checkpointer may unconditionally write meta 
page from heap to disk at the beginning of writing the delta file. This page 
will become a write-only page, which is basically what we need. 
h2. Callbacks and RAFT snapshots

I argue against scheduled RAFT snapshots. They will produce a lot of junk 
checkpoints. This is because checkpoint is a {*}global operation{*}. Imagine 
RAFT triggering snapshots for 100 partitions in a row. This will result in a 
100 minuscule checkpoints, no one needs it. So, I'd say, we need to operation:
 * partition.getCheckpointerUpdateIndex();
 * partition.registerCheckpointedUpdateIndexListener(closure);

Bot of these methods could be used by RAFT to determine whether it needs to 
truncate its log and to define a specific commit index for truncation.

In case of PDS checkpointer, implementation for both of these methods is 
trivial.

  was:
Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for 
prerequisites.
h2. General idea

The idea doesn't seem complicated. There will be a "setUpdateIndex" and 
"getUpdateIndex" methods (names might be different).
 * First one is invoked at the end of every write command, with RAFT commit 
index being passed as a parameter. This is done right before releasing 
checkpoint read lock (or whatever the name we will come up with). More on that 
later.
 * Second one is invoked at the beginning of every write command to validate 
that update don't come out of order or with gaps. This is the way to guarantee 
that IndexMismatchException can be thrown at the right time.

So, the write command flow will look like this. All names here are completely 
random.

 
{code:java}
try (ConsistencyLock lock = partition.acquireConsistencyLock()) {
    long updateIndex = partition.getUpdateIndex();
    long raftIndex = writeCommand.raftIndex();

    if (raftIndex != updateIndex + 1) {
        throw new IndexMismatchException(updateIndex);
    }

    partition.write(writeCommand.row());

    for (Index index : table.indexes(partition) {
        index.index(writeCommand.row());
    }

    partition.setUpdateIndex(raftIndex);
}{code}
 

Some nuances:
 * Mismatch exception must be thrown before any data modifications. Storage 
content must be intact, otherwise we'll just break it.
 * Case above is the simplest one - there's a single "atomic" storage update. 
Generally speaking, we can't or sometimes don't want to work this way. Examples 
of operations, where atomicity this strict is not required:
 ** Batch insert/update from the transaction.
 ** Transaction commit might have a huge number of row ids, we can exhaust the 
memory while committing.
 * If we split write operation into several operations, we should externally 
guarantee their idempotence. "setUpdateIndex" should be at the end of the last 
"atomic" operation, so that the last command could be safely reapplied.

h2. Implementation

"set" method could write a value directly into partitions meta page. This 
*will* work. But it's not quite optimal.

Optimal solution is tightly coupled with the way checkpoint should work. This 
may not be the right place to describe the issue, but I do it nonetheless. 
It'll probably get split into another issue one day.

There's a simple way to touch every meta page only once per checkpoint. We just 
do it while holding checkpoint write lock. This way data is consistent. But 
this solution is equally {*}bad{*}, it forces us to perform pages manipulation 
under write lock. Flushing freelists is enough already. (NOTE: we should test 
the performance without onheap-cache, it'll speed-up checkpoint start process, 
thus reducing latency spikes)

Better way to do this is not having meta pages in page memory whatsoever. Maybe 
during the start, but that's it. It's a common practice to have a pageSize 
being equal to 16Kb. Effective payload of partition meta page in Ignite 2.x is 
just above 100 bytes. I expect it to be way lower in Ignite 3.0. Having a 
loaded page for every partition is just a waste of resources, all required data 
can be stored on-heap.

Then, let's rely on two simple facts:
 * If meta page date is cached on-heap, no one would need to read it from disk. 
I should also mention that it will mostly be immutable.
 * We can write partition meta page into every delta file even if meta has not 
changed. In actuality, this is will be very rare situation.

Considering both of these facts, checkpointer may unconditionally write meta 
page from heap to disk at the beginning of writing the delta file. This page 
will become a write-only page, which is basically what we need. 


> Implement checkpointIndex for PDS
> ---------------------------------
>
>                 Key: IGNITE-17077
>                 URL: https://issues.apache.org/jira/browse/IGNITE-17077
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Ivan Bessonov
>            Priority: Major
>              Labels: ignite-3
>
> Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for 
> prerequisites.
> h2. General idea
> The idea doesn't seem complicated. There will be a "setUpdateIndex" and 
> "getUpdateIndex" methods (names might be different).
>  * First one is invoked at the end of every write command, with RAFT commit 
> index being passed as a parameter. This is done right before releasing 
> checkpoint read lock (or whatever the name we will come up with). More on 
> that later.
>  * Second one is invoked at the beginning of every write command to validate 
> that update don't come out of order or with gaps. This is the way to 
> guarantee that IndexMismatchException can be thrown at the right time.
> So, the write command flow will look like this. All names here are completely 
> random.
>  
> {code:java}
> try (ConsistencyLock lock = partition.acquireConsistencyLock()) {
>     long updateIndex = partition.getUpdateIndex();
>     long raftIndex = writeCommand.raftIndex();
>     if (raftIndex != updateIndex + 1) {
>         throw new IndexMismatchException(updateIndex);
>     }
>     partition.write(writeCommand.row());
>     for (Index index : table.indexes(partition) {
>         index.index(writeCommand.row());
>     }
>     partition.setUpdateIndex(raftIndex);
> }{code}
>  
> Some nuances:
>  * Mismatch exception must be thrown before any data modifications. Storage 
> content must be intact, otherwise we'll just break it.
>  * Case above is the simplest one - there's a single "atomic" storage update. 
> Generally speaking, we can't or sometimes don't want to work this way. 
> Examples of operations, where atomicity this strict is not required:
>  ** Batch insert/update from the transaction.
>  ** Transaction commit might have a huge number of row ids, we can exhaust 
> the memory while committing.
>  * If we split write operation into several operations, we should externally 
> guarantee their idempotence. "setUpdateIndex" should be at the end of the 
> last "atomic" operation, so that the last command could be safely reapplied.
> h2. Implementation
> "set" method could write a value directly into partitions meta page. This 
> *will* work. But it's not quite optimal.
> Optimal solution is tightly coupled with the way checkpoint should work. This 
> may not be the right place to describe the issue, but I do it nonetheless. 
> It'll probably get split into another issue one day.
> There's a simple way to touch every meta page only once per checkpoint. We 
> just do it while holding checkpoint write lock. This way data is consistent. 
> But this solution is equally {*}bad{*}, it forces us to perform pages 
> manipulation under write lock. Flushing freelists is enough already. (NOTE: 
> we should test the performance without onheap-cache, it'll speed-up 
> checkpoint start process, thus reducing latency spikes)
> Better way to do this is not having meta pages in page memory whatsoever. 
> Maybe during the start, but that's it. It's a common practice to have a 
> pageSize being equal to 16Kb. Effective payload of partition meta page in 
> Ignite 2.x is just above 100 bytes. I expect it to be way lower in Ignite 
> 3.0. Having a loaded page for every partition is just a waste of resources, 
> all required data can be stored on-heap.
> Then, let's rely on two simple facts:
>  * If meta page date is cached on-heap, no one would need to read it from 
> disk. I should also mention that it will mostly be immutable.
>  * We can write partition meta page into every delta file even if meta has 
> not changed. In actuality, this is will be very rare situation.
> Considering both of these facts, checkpointer may unconditionally write meta 
> page from heap to disk at the beginning of writing the delta file. This page 
> will become a write-only page, which is basically what we need. 
> h2. Callbacks and RAFT snapshots
> I argue against scheduled RAFT snapshots. They will produce a lot of junk 
> checkpoints. This is because checkpoint is a {*}global operation{*}. Imagine 
> RAFT triggering snapshots for 100 partitions in a row. This will result in a 
> 100 minuscule checkpoints, no one needs it. So, I'd say, we need to operation:
>  * partition.getCheckpointerUpdateIndex();
>  * partition.registerCheckpointedUpdateIndexListener(closure);
> Bot of these methods could be used by RAFT to determine whether it needs to 
> truncate its log and to define a specific commit index for truncation.
> In case of PDS checkpointer, implementation for both of these methods is 
> trivial.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to