[
https://issues.apache.org/jira/browse/IGNITE-16655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Roman Puchkovskiy reassigned IGNITE-16655:
------------------------------------------
Assignee: Roman Puchkovskiy
> Volatile RAFT log for pure in-memory storages
> ---------------------------------------------
>
> Key: IGNITE-16655
> URL: https://issues.apache.org/jira/browse/IGNITE-16655
> Project: Ignite
> Issue Type: Improvement
> Reporter: Sergey Chugunov
> Assignee: Roman Puchkovskiy
> Priority: Major
> Labels: iep-74, ignite-3
>
> h3. Original issue description
> For in-memory storage Raft logging can be optimized as we don't need to have
> it active when topology is stable.
> Each write can directly go to in-memory storage at much lower cost than
> synchronizing it with disk so it is possible to avoid writing Raft log.
> As nodes don't have any state and always join cluster clean we always need to
> transfer full snapshot during rebalancing - no need to keep long Raft log for
> historical rebalancing purposes.
> So we need to implement API for Raft component enabling configuration of Raft
> logging process.
> h3. More detailed description
> Apparently, we can't completely ignore writing to log. There are several
> situations where it needs to be collected:
> * During a regular workload, each node needs to have a small portion of log
> in case if it becomes a leader. There might be a number of "slow" nodes
> outside of "quorum" that require older data to be re-sent to them. Log entry
> can be truncated only when all nodes reply with "ack" or fail, otherwise log
> entry should be preserved.
> * During a clean node join - it will need to apply part of the log that
> wasn't included in the full-rebalance snapshot. So, everything, starting with
> snapshots applied index, will have to be preserved.
> It feels like the second option is just a special case of the first one - we
> can't truncate log until we receive all acks. And we can't receive an ack
> from the joining node until it finishes its rebalancing procedure.
> So, it all comes to the aggressive log truncation to make it short.
> Preserved log can be quite big in reality, there must be a disk offloading
> operation available.
> The easiest way to achieve it is to write into a RocksDB instance with WAL
> disabled. It'll store everything in memory until the flush, and even then the
> amount of flushed data will be small on stable topology. Absence of WAL is
> not an issue, the entire rocks instance can be dropped on restart, since it's
> supposed to be volatile.
> To avoid even the smallest flush, we can use additional volatile structure,
> like ring buffer or concurrent map, to store part of the log, and transfer
> records into RocksDB only on structure overflow. This sounds more compilcated
> and makes memory management more difficult. But, we should take it into
> consideration anyways.
> * Potentially, we could use a volatile page memory region for this purpose,
> since it already has a good control over the amount of memory used. But,
> memory overflow should be carefully processed, usually it's treated as an
> error and might even cause node failure.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)