Ivan Bessonov created IGNITE-16306:
--------------------------------------
Summary: [POC] In-Memory storage integration
Key: IGNITE-16306
URL: https://issues.apache.org/jira/browse/IGNITE-16306
Project: Ignite
Issue Type: Improvement
Affects Versions: 3.0.0-alpha3
Reporter: Ivan Bessonov
Goals
We need an in-memory store, similar to Ignite-2. This store must reuse common
replication infrastructure, in other words, be integrated into raft STM and
support transactions.
The raft protocol implies some persistent state: metadata, logs, snapshot.
Simplest solution - write a raft persistent state on disk (this is already
implemented for
org.apache.ignite.internal.storage.basic.ConcurrentHashMapPartitionStorage).
Drawback - not fully in-memory solution, doesn't much differ from a database
cache
We can go the pure in-memory way - keep all raft state in a volatile store.
h3. Raft metadata
Must not be persisted for a pure in-memory cluster, because the state is always
lost on restart.
Note: a node must always be removed from the raft group when it’s removed from
baseline by auto adjust and should join as new (in-memory always works with
auto-adjust similarly to Ignite 2). *Out of scope.*
h3. Log store
Has working in-memory implementation (currently used in tests):
org.apache.ignite.raft.jraft.storage.impl.LocalLogStorage
Note: generally speaking, log is only required for "historical rebalancing"
after the snapshot rebalance. It won't be needed at all once it is possible to
apply snapshot and concurrent updates at the same time, for example when a
solution like mvcc is implemented.
h3. Snapshots
Can be implemented over any kv store extended with some kind of Copy-On-Write
support. Not implemented currently. More details below.
h3. COW buffer
To create an in-memory snapshot, the snapshot data is written to a separate
in-memory buffer. The buffer is populated from the state machine update thread
either by the update operations or by a snapshot advance mini-task which is
submitted to the state machine update thread as needed.
To maintain a snapshot, the state machine needs to keep an snapshot iterator
boundary key. If a key being updated is smaller or equal than the boundary key,
there is no need in any additional action because the snapshot iterator has
already processed this key. If a key being updated is larger than the boundary
key, the old version of the key is eagerly put to the snapshot buffer and the
key is marked with snapshot ID (so that the key is skipped during further
iteration). Snapshot advance mini-task iterates over a next batch of the keys
starting from the boundary key and puts to the snapshot buffer only keys that
are not yet marked by the snapshot ID.
This approach has similar memory requirements to the first alternative, but
does not require to modify the storage tree so that it can store multiple
versions of the same key. This approach, however, allows for transparent
snapshot buffer offloading to disk which can reduce memory requirements. It is
also simpler in implementation because the code is essentially single-threaded
and only requires synchronization for the in-memory buffer. The downside is
that snapshot advance tasks will increase tail latency of state machine update
operations.
Can be implemented on top of any kv store.
Note: we should consider the possibility of streaming the snapshot instead of
storing it in memory until it is completed.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)