[ 
https://issues.apache.org/jira/browse/IGNITE-16306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Chugunov updated IGNITE-16306:
-------------------------------------
    Description: 
h3. Goals

We need an in-memory store, similar to Ignite-2. This store must reuse common 
replication infrastructure, in other words, be integrated into raft STM and 
support transactions.

The raft protocol implies some persistent state: metadata, logs, snapshot.

Simplest solution - write a raft persistent state on disk (this is already 
implemented for 
org.apache.ignite.internal.storage.basic.ConcurrentHashMapPartitionStorage). 

Drawback - not fully in-memory solution, doesn't much differ from a database 
cache

We can go the pure in-memory way - keep all raft state in a volatile store.
h3. Raft metadata

Must not be persisted for a pure in-memory cluster, because the state is always 
lost on restart. 

Note: a node must always be removed from the raft group when it’s removed from 
baseline by auto adjust and should join as new (in-memory always works with 
auto-adjust similarly to Ignite 2). *Out of scope.*
h3. Log store

Has working in-memory implementation (currently used in tests): 
org.apache.ignite.raft.jraft.storage.impl.LocalLogStorage

Note: generally speaking, log is only required for "historical rebalancing" 
after the snapshot rebalance. It won't be needed at all once it is possible to 
apply snapshot and concurrent updates at the same time, for example when a 
solution like mvcc is implemented.
h3. Snapshots

Can be implemented over any kv store extended with some kind of Copy-On-Write 
support. Not implemented currently. More details below.
h3. COW buffer

To create an in-memory snapshot, the snapshot data is written to a separate 
in-memory buffer. The buffer is populated from the state machine update thread 
either by the update operations or by a snapshot advance mini-task which is 
submitted to the state machine update thread as needed.

To maintain a snapshot, the state machine needs to keep an snapshot iterator 
boundary key. If a key being updated is smaller or equal than the boundary key, 
there is no need in any additional action because the snapshot iterator has 
already processed this key. If a key being updated is larger than the boundary 
key, the old version of the key is eagerly put to the snapshot buffer and the 
key is marked with snapshot ID (so that the key is skipped during further 
iteration). Snapshot advance mini-task iterates over a next batch of the keys 
starting from the boundary key and puts to the snapshot buffer only keys that 
are not yet marked by the snapshot ID.

This approach has similar memory requirements to the first alternative, but 
does not require to modify the storage tree so that it can store multiple 
versions of the same key. This approach, however, allows for transparent 
snapshot buffer offloading to disk which can reduce memory requirements. It is 
also simpler in implementation because the code is essentially single-threaded 
and only requires synchronization for the in-memory buffer. The downside is 
that snapshot advance tasks will increase tail latency of state machine update 
operations.

Can be implemented on top of any kv store.

Note: we should consider the possibility of streaming the snapshot instead of 
storing it in memory until it is completed.

  was:
Goals

We need an in-memory store, similar to Ignite-2. This store must reuse common 
replication infrastructure, in other words, be integrated into raft STM and 
support transactions.

The raft protocol implies some persistent state: metadata, logs, snapshot.

Simplest solution - write a raft persistent state on disk (this is already 
implemented for 
org.apache.ignite.internal.storage.basic.ConcurrentHashMapPartitionStorage). 

Drawback - not fully in-memory solution, doesn't much differ from a database 
cache

We can go the pure in-memory way - keep all raft state in a volatile store.
h3. Raft metadata

Must not be persisted for a pure in-memory cluster, because the state is always 
lost on restart. 

Note: a node must always be removed from the raft group when it’s removed from 
baseline by auto adjust and should join as new (in-memory always works with 
auto-adjust similarly to Ignite 2). *Out of scope.*
h3. Log store

Has working in-memory implementation (currently used in tests): 
org.apache.ignite.raft.jraft.storage.impl.LocalLogStorage

Note: generally speaking, log is only required for "historical rebalancing" 
after the snapshot rebalance. It won't be needed at all once it is possible to 
apply snapshot and concurrent updates at the same time, for example when a 
solution like mvcc is implemented.
h3. Snapshots

Can be implemented over any kv store extended with some kind of Copy-On-Write 
support. Not implemented currently. More details below.
h3. COW buffer

To create an in-memory snapshot, the snapshot data is written to a separate 
in-memory buffer. The buffer is populated from the state machine update thread 
either by the update operations or by a snapshot advance mini-task which is 
submitted to the state machine update thread as needed.

To maintain a snapshot, the state machine needs to keep an snapshot iterator 
boundary key. If a key being updated is smaller or equal than the boundary key, 
there is no need in any additional action because the snapshot iterator has 
already processed this key. If a key being updated is larger than the boundary 
key, the old version of the key is eagerly put to the snapshot buffer and the 
key is marked with snapshot ID (so that the key is skipped during further 
iteration). Snapshot advance mini-task iterates over a next batch of the keys 
starting from the boundary key and puts to the snapshot buffer only keys that 
are not yet marked by the snapshot ID.

This approach has similar memory requirements to the first alternative, but 
does not require to modify the storage tree so that it can store multiple 
versions of the same key. This approach, however, allows for transparent 
snapshot buffer offloading to disk which can reduce memory requirements. It is 
also simpler in implementation because the code is essentially single-threaded 
and only requires synchronization for the in-memory buffer. The downside is 
that snapshot advance tasks will increase tail latency of state machine update 
operations.

Can be implemented on top of any kv store.

Note: we should consider the possibility of streaming the snapshot instead of 
storing it in memory until it is completed.


> [POC] In-Memory storage integration
> -----------------------------------
>
>                 Key: IGNITE-16306
>                 URL: https://issues.apache.org/jira/browse/IGNITE-16306
>             Project: Ignite
>          Issue Type: Improvement
>    Affects Versions: 3.0.0-alpha3
>            Reporter: Ivan Bessonov
>            Priority: Major
>              Labels: iep-74, ignite-3
>
> h3. Goals
> We need an in-memory store, similar to Ignite-2. This store must reuse common 
> replication infrastructure, in other words, be integrated into raft STM and 
> support transactions.
> The raft protocol implies some persistent state: metadata, logs, snapshot.
> Simplest solution - write a raft persistent state on disk (this is already 
> implemented for 
> org.apache.ignite.internal.storage.basic.ConcurrentHashMapPartitionStorage). 
> Drawback - not fully in-memory solution, doesn't much differ from a database 
> cache
> We can go the pure in-memory way - keep all raft state in a volatile store.
> h3. Raft metadata
> Must not be persisted for a pure in-memory cluster, because the state is 
> always lost on restart. 
> Note: a node must always be removed from the raft group when it’s removed 
> from baseline by auto adjust and should join as new (in-memory always works 
> with auto-adjust similarly to Ignite 2). *Out of scope.*
> h3. Log store
> Has working in-memory implementation (currently used in tests): 
> org.apache.ignite.raft.jraft.storage.impl.LocalLogStorage
> Note: generally speaking, log is only required for "historical rebalancing" 
> after the snapshot rebalance. It won't be needed at all once it is possible 
> to apply snapshot and concurrent updates at the same time, for example when a 
> solution like mvcc is implemented.
> h3. Snapshots
> Can be implemented over any kv store extended with some kind of Copy-On-Write 
> support. Not implemented currently. More details below.
> h3. COW buffer
> To create an in-memory snapshot, the snapshot data is written to a separate 
> in-memory buffer. The buffer is populated from the state machine update 
> thread either by the update operations or by a snapshot advance mini-task 
> which is submitted to the state machine update thread as needed.
> To maintain a snapshot, the state machine needs to keep an snapshot iterator 
> boundary key. If a key being updated is smaller or equal than the boundary 
> key, there is no need in any additional action because the snapshot iterator 
> has already processed this key. If a key being updated is larger than the 
> boundary key, the old version of the key is eagerly put to the snapshot 
> buffer and the key is marked with snapshot ID (so that the key is skipped 
> during further iteration). Snapshot advance mini-task iterates over a next 
> batch of the keys starting from the boundary key and puts to the snapshot 
> buffer only keys that are not yet marked by the snapshot ID.
> This approach has similar memory requirements to the first alternative, but 
> does not require to modify the storage tree so that it can store multiple 
> versions of the same key. This approach, however, allows for transparent 
> snapshot buffer offloading to disk which can reduce memory requirements. It 
> is also simpler in implementation because the code is essentially 
> single-threaded and only requires synchronization for the in-memory buffer. 
> The downside is that snapshot advance tasks will increase tail latency of 
> state machine update operations.
> Can be implemented on top of any kv store.
> Note: we should consider the possibility of streaming the snapshot instead of 
> storing it in memory until it is completed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to