[
https://issues.apache.org/jira/browse/IGNITE-15818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Bessonov updated IGNITE-15818:
-----------------------------------
Description:
h2. Goal
Port and refactor core classes implementing page-based persistent store in
Ignite 2.x: GridCacheOffheapManager, GridCacheDatabaseSharedManager,
PageMemoryImpl, Checkpointer, FileWriteAheadLogManager.
New checkpoint implementation to avoid excessive logging.
Store lifecycle clarification to avoid complicated and invasive code of custom
lifecycle managed mostly by DatabaseSharedManager.
h2. Items to pay attention to
New checkpoint implementation based on split-file storage, new page index
structure to maintain disk-memory page mapping.
File page store implementation should be extracted from GridCacheOffheapManager
to a separate entity, target implementation should support new version of
checkpoint (split-file store to enable always-consistent store and to eliminate
binary recovery phase).
Support of big pages (256+ kB).
Support of throttling algorithms.
h2. References
New checkpoint design overview is available
[here|https://github.com/apache/ignite-3/blob/ignite-14647/modules/vault/README.md]
h2. Thoughts
Although there is a technical opportunity to have independent checkpoints for
different data regions, managing them could be a nightmare and it's definitely
in the realm of optimizations and out of scope right now.
So, let's assume that there's one good old checkpoint process. There's still a
requirement to have checkpoint markers, but they will not have a reference to
WAL, because there's no WAL. Instead, we will have to store RAFT log revision
per partition. Or not, I'm not that familiar with a recovery procedure that's
currently in development.
Unlike checkpoints in Ignite 2.x, that had DO and REDO operations, new version
will have DO and UNDO. This drastically simplifies both checkpoint itself and
node recovery. But is complicates data access.
There will be two process that will share storage resource: "checkpointer" and
"compactor". Let's examine what compactor should or shouldn't do:
* it should not work in parallel with checkpointer, except for cases when
there are too many layers (more on that later)
* it should merge later checkpoint delta files into main partition files
* it should delete checkpoint markers once all merges are completed for it,
thus markers are decoupled from RAFT log
About "cases when there are too many layers" - too many layers could compromise
reading speed. Number of layers should not increase uncontrollably. So, when a
threshold is exceeded, compactor should start working no mater what.
Recovery procedure:
* read the list of checkpoint markers on engines start
* remove all data from unfinished checkpoint, if it's there
* trim main partition files to their proper size (should check it it's
actually beneficial)
Table start procedure:
* read all layer files headers according to the list of checkpoints
* construct a list oh hash tables (pageId -> pageIndex) for all layers, make
it as effective as possible
* everything else is just like before
Partition removal might be tricky, but we'll see. It's tricky in Ignite 2.x
after all. "Restore partition states" procedure could be revisited, I don't
know how this will work yet
was:
h2. Goal
Port and refactor core classes implementing page-based persistent store in
Ignite 2.x: GridCacheOffheapManager, GridCacheDatabaseSharedManager,
PageMemoryImpl, Checkpointer, FileWriteAheadLogManager.
New checkpoint implementation to avoid excessive logging.
Store lifecycle clarification to avoid complicated and invasive code of custom
lifecycle managed mostly by DatabaseSharedManager.
h2. Items to pay attention to
New checkpoint implementation based on split-file storage, new page index
structure to maintain disk-memory page mapping.
File page store implementation should be extracted from GridCacheOffheapManager
to a separate entity, target implementation should support new version of
checkpoint (split-file store to enable always-consistent store and to eliminate
binary recovery phase).
Support of big pages (256+ kB).
Support of throttling algorithms.
h2. References
New checkpoint design overview is available
[here|https://github.com/apache/ignite-3/blob/ignite-14647/modules/vault/README.md]
h2. Thoughts
On some occasions, physical WAL records are the only way to debug data
corruption. Having new persistence without binary log is unacceptable. There
are at least two choices:
* use WAL for binary recovery, as before. This eliminates
* don't use WAL for recovery and enable it by premise.
Or, log binary data in different format. But that's too radical, there are
already tools for reading WAL that can be adopted as well.
> [Native Persistence 3.0] Checkpoint, lifecycle and file store refactoring and
> re-implementation
> -----------------------------------------------------------------------------------------------
>
> Key: IGNITE-15818
> URL: https://issues.apache.org/jira/browse/IGNITE-15818
> Project: Ignite
> Issue Type: Task
> Reporter: Sergey Chugunov
> Priority: Major
> Labels: ignite-3
>
> h2. Goal
> Port and refactor core classes implementing page-based persistent store in
> Ignite 2.x: GridCacheOffheapManager, GridCacheDatabaseSharedManager,
> PageMemoryImpl, Checkpointer, FileWriteAheadLogManager.
> New checkpoint implementation to avoid excessive logging.
> Store lifecycle clarification to avoid complicated and invasive code of
> custom lifecycle managed mostly by DatabaseSharedManager.
> h2. Items to pay attention to
> New checkpoint implementation based on split-file storage, new page index
> structure to maintain disk-memory page mapping.
> File page store implementation should be extracted from
> GridCacheOffheapManager to a separate entity, target implementation should
> support new version of checkpoint (split-file store to enable
> always-consistent store and to eliminate binary recovery phase).
> Support of big pages (256+ kB).
> Support of throttling algorithms.
> h2. References
> New checkpoint design overview is available
> [here|https://github.com/apache/ignite-3/blob/ignite-14647/modules/vault/README.md]
> h2. Thoughts
> Although there is a technical opportunity to have independent checkpoints for
> different data regions, managing them could be a nightmare and it's
> definitely in the realm of optimizations and out of scope right now.
> So, let's assume that there's one good old checkpoint process. There's still
> a requirement to have checkpoint markers, but they will not have a reference
> to WAL, because there's no WAL. Instead, we will have to store RAFT log
> revision per partition. Or not, I'm not that familiar with a recovery
> procedure that's currently in development.
> Unlike checkpoints in Ignite 2.x, that had DO and REDO operations, new
> version will have DO and UNDO. This drastically simplifies both checkpoint
> itself and node recovery. But is complicates data access.
> There will be two process that will share storage resource: "checkpointer"
> and "compactor". Let's examine what compactor should or shouldn't do:
> * it should not work in parallel with checkpointer, except for cases when
> there are too many layers (more on that later)
> * it should merge later checkpoint delta files into main partition files
> * it should delete checkpoint markers once all merges are completed for it,
> thus markers are decoupled from RAFT log
> About "cases when there are too many layers" - too many layers could
> compromise reading speed. Number of layers should not increase
> uncontrollably. So, when a threshold is exceeded, compactor should start
> working no mater what.
> Recovery procedure:
> * read the list of checkpoint markers on engines start
> * remove all data from unfinished checkpoint, if it's there
> * trim main partition files to their proper size (should check it it's
> actually beneficial)
> Table start procedure:
> * read all layer files headers according to the list of checkpoints
> * construct a list oh hash tables (pageId -> pageIndex) for all layers, make
> it as effective as possible
> * everything else is just like before
> Partition removal might be tricky, but we'll see. It's tricky in Ignite 2.x
> after all. "Restore partition states" procedure could be revisited, I don't
> know how this will work yet
--
This message was sent by Atlassian Jira
(v8.20.1#820001)