[
https://issues.apache.org/jira/browse/IGNITE-15818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kirill Tkalenko updated IGNITE-15818:
-------------------------------------
Fix Version/s: 3.0.0-alpha6
> [Native Persistence 3.0] Checkpoint, lifecycle and file store refactoring and
> re-implementation
> -----------------------------------------------------------------------------------------------
>
> Key: IGNITE-15818
> URL: https://issues.apache.org/jira/browse/IGNITE-15818
> Project: Ignite
> Issue Type: Task
> Reporter: Sergey Chugunov
> Assignee: Kirill Tkalenko
> Priority: Major
> Labels: ignite-3
> Fix For: 3.0.0-alpha6
>
>
> h2. Goal
> Port and refactor core classes implementing page-based persistent store in
> Ignite 2.x: GridCacheOffheapManager, GridCacheDatabaseSharedManager,
> PageMemoryImpl, Checkpointer, FileWriteAheadLogManager.
> New checkpoint implementation to avoid excessive logging.
> Store lifecycle clarification to avoid complicated and invasive code of
> custom lifecycle managed mostly by DatabaseSharedManager.
> h2. Items to pay attention to
> New checkpoint implementation based on split-file storage, new page index
> structure to maintain disk-memory page mapping.
> File page store implementation should be extracted from
> GridCacheOffheapManager to a separate entity, target implementation should
> support new version of checkpoint (split-file store to enable
> always-consistent store and to eliminate binary recovery phase).
> Support of big pages (256+ kB).
> Support of throttling algorithms.
> h2. References
> New checkpoint design overview is available
> [here|https://github.com/apache/ignite-3/blob/ignite-14647/modules/vault/README.md]
> h2. Thoughts
> Although there is a technical opportunity to have independent checkpoints for
> different data regions, managing them could be a nightmare and it's
> definitely in the realm of optimizations and out of scope right now.
> So, let's assume that there's one good old checkpoint process. There's still
> a requirement to have checkpoint markers, but they will not have a reference
> to WAL, because there's no WAL. Instead, we will have to store RAFT log
> revision per partition. Or not, I'm not that familiar with a recovery
> procedure that's currently in development.
> Unlike checkpoints in Ignite 2.x, that had DO and REDO operations, new
> version will have DO and UNDO. This drastically simplifies both checkpoint
> itself and node recovery. But is complicates data access.
> There will be two process that will share storage resource: "checkpointer"
> and "compactor". Let's examine what compactor should or shouldn't do:
> * it should not work in parallel with checkpointer, except for cases when
> there are too many layers (more on that later)
> * it should merge later checkpoint delta files into main partition files
> * it should delete checkpoint markers once all merges are completed for it,
> thus markers are decoupled from RAFT log
> About "cases when there are too many layers" - too many layers could
> compromise reading speed. Number of layers should not increase
> uncontrollably. So, when a threshold is exceeded, compactor should start
> working no mater what. If anything, writing load can be throttled, reading
> matters more.
> Recovery procedure:
> * read the list of checkpoint markers on engines start
> * remove all data from unfinished checkpoint, if it's there
> * trim main partition files to their proper size (should check it it's
> actually beneficial)
> Table start procedure:
> * read all layer files headers according to the list of checkpoints
> * construct a list oh hash tables (pageId -> pageIndex) for all layers, make
> it as effective as possible
> * everything else is just like before
> Partition removal might be tricky, but we'll see. It's tricky in Ignite 2.x
> after all. "Restore partition states" procedure could be revisited, I don't
> know how this will work yet.
> How to store hashmaps:
> regular maps might be too much, we should consider roaring map implementation
> or something similar that'll occupy less space. This is only a concern for
> in-memory structures. Files on disk may have a list of pairs, that's fine.
> Generally speaking, checkpoints with a size of 100 thousand pages are close
> to the top limit for most users. Splitting that to 500 partitions, for
> example, gives us 200 pages per partition. Entire map should fit into a
> single page.
> The only exception to these calculations is index.bin. Amount of pages per
> checkpoint can be an orders of magnitudes higher, so we should keep an eye on
> it, It'll be the main target for testing/benchmarking. Anyway, 4 kilobytes is
> enough to fit 512 integer pairs, scaling to 2048 for regular 16 kilobytes
> pages. Map won't be too big IMO.
> Another important moment - we should enable direct IO, it's supported by Java
> natively since version 9 (I guess). There's a chance that not only regular
> disk operations will become somewhat faster, but fsync will become
> drastically faster as a result. Which is good, fsync can easily take half a
> time of the checkpoint, which is just unacceptable.
> h2. Thoughts 2.0
> With high likelihood, we'll get rid of index.bin. This will remove the
> requirement of having checkpoint markers.
> All that we need is a consistently growing local counter that will be used to
> mark partition delta files. But, it doesn't need to be global even on a level
> of local node, it can be a local counter per partition, that's persisted in
> the meta page. This should be further discussed during the implementation.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)