[ 
https://issues.apache.org/jira/browse/IGNITE-15818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-15818:
-----------------------------------
    Description: 
h2. Goal

Port and refactor core classes implementing page-based persistent store in 
Ignite 2.x: GridCacheOffheapManager, GridCacheDatabaseSharedManager, 
PageMemoryImpl, Checkpointer, FileWriteAheadLogManager.

New checkpoint implementation to avoid excessive logging.

Store lifecycle clarification to avoid complicated and invasive code of custom 
lifecycle managed mostly by DatabaseSharedManager.
h2. Items to pay attention to

New checkpoint implementation based on split-file storage, new page index 
structure to maintain disk-memory page mapping.

File page store implementation should be extracted from GridCacheOffheapManager 
to a separate entity, target implementation should support new version of 
checkpoint (split-file store to enable always-consistent store and to eliminate 
binary recovery phase).

Support of big pages (256+ kB).

Support of throttling algorithms.
h2. References

New checkpoint design overview is available 
[here|https://github.com/apache/ignite-3/blob/ignite-14647/modules/vault/README.md]
h2. Thoughts

Although there is a technical opportunity to have independent checkpoints for 
different data regions, managing them could be a nightmare and it's definitely 
in the realm of optimizations and out of scope right now.

So, let's assume that there's one good old checkpoint process. There's still a 
requirement to have checkpoint markers, but they will not have a reference to 
WAL, because there's no WAL. Instead, we will have to store RAFT log revision 
per partition. Or not, I'm not that familiar with a recovery procedure that's 
currently in development.

Unlike checkpoints in Ignite 2.x, that had DO and REDO operations, new version 
will have DO and UNDO. This drastically simplifies both checkpoint itself and 
node recovery. But is complicates data access.

There will be two process that will share storage resource: "checkpointer" and 
"compactor". Let's examine what compactor should or shouldn't do:
 * it should not work in parallel with checkpointer, except for cases when 
there are too many layers (more on that later)
 * it should merge later checkpoint delta files into main partition files
 * it should delete checkpoint markers once all merges are completed for it, 
thus markers are decoupled from RAFT log

About "cases when there are too many layers" - too many layers could compromise 
reading speed. Number of layers should not increase uncontrollably. So, when a 
threshold is exceeded, compactor should start working no mater what.

Recovery procedure:
 * read the list of checkpoint markers on engines start
 * remove all data from unfinished checkpoint, if it's there
 * trim main partition files to their proper size (should check it it's 
actually beneficial)

Table start procedure:
 * read all layer files headers according to the list of checkpoints
 * construct a list oh hash tables (pageId -> pageIndex) for all layers, make 
it as effective as possible
 * everything else is just like before

Partition removal might be tricky, but we'll see. It's tricky in Ignite 2.x 
after all. "Restore partition states" procedure could be revisited, I don't 
know how this will work yet

  was:
h2. Goal

Port and refactor core classes implementing page-based persistent store in 
Ignite 2.x: GridCacheOffheapManager, GridCacheDatabaseSharedManager, 
PageMemoryImpl, Checkpointer, FileWriteAheadLogManager.

New checkpoint implementation to avoid excessive logging.

Store lifecycle clarification to avoid complicated and invasive code of custom 
lifecycle managed mostly by DatabaseSharedManager.
h2. Items to pay attention to

New checkpoint implementation based on split-file storage, new page index 
structure to maintain disk-memory page mapping.

File page store implementation should be extracted from GridCacheOffheapManager 
to a separate entity, target implementation should support new version of 
checkpoint (split-file store to enable always-consistent store and to eliminate 
binary recovery phase).

Support of big pages (256+ kB).

Support of throttling algorithms.
h2. References

New checkpoint design overview is available 
[here|https://github.com/apache/ignite-3/blob/ignite-14647/modules/vault/README.md]
h2. Thoughts

On some occasions, physical WAL records are the only way to debug data 
corruption. Having new persistence without binary log is unacceptable. There 
are at least two choices:
 * use WAL for binary recovery, as before. This eliminates 
 * don't use WAL for recovery and enable it by premise.

Or, log binary data in different format. But that's too radical, there are 
already tools for reading WAL that can be adopted as well.


> [Native Persistence 3.0] Checkpoint, lifecycle and file store refactoring and 
> re-implementation
> -----------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-15818
>                 URL: https://issues.apache.org/jira/browse/IGNITE-15818
>             Project: Ignite
>          Issue Type: Task
>            Reporter: Sergey Chugunov
>            Priority: Major
>              Labels: ignite-3
>
> h2. Goal
> Port and refactor core classes implementing page-based persistent store in 
> Ignite 2.x: GridCacheOffheapManager, GridCacheDatabaseSharedManager, 
> PageMemoryImpl, Checkpointer, FileWriteAheadLogManager.
> New checkpoint implementation to avoid excessive logging.
> Store lifecycle clarification to avoid complicated and invasive code of 
> custom lifecycle managed mostly by DatabaseSharedManager.
> h2. Items to pay attention to
> New checkpoint implementation based on split-file storage, new page index 
> structure to maintain disk-memory page mapping.
> File page store implementation should be extracted from 
> GridCacheOffheapManager to a separate entity, target implementation should 
> support new version of checkpoint (split-file store to enable 
> always-consistent store and to eliminate binary recovery phase).
> Support of big pages (256+ kB).
> Support of throttling algorithms.
> h2. References
> New checkpoint design overview is available 
> [here|https://github.com/apache/ignite-3/blob/ignite-14647/modules/vault/README.md]
> h2. Thoughts
> Although there is a technical opportunity to have independent checkpoints for 
> different data regions, managing them could be a nightmare and it's 
> definitely in the realm of optimizations and out of scope right now.
> So, let's assume that there's one good old checkpoint process. There's still 
> a requirement to have checkpoint markers, but they will not have a reference 
> to WAL, because there's no WAL. Instead, we will have to store RAFT log 
> revision per partition. Or not, I'm not that familiar with a recovery 
> procedure that's currently in development.
> Unlike checkpoints in Ignite 2.x, that had DO and REDO operations, new 
> version will have DO and UNDO. This drastically simplifies both checkpoint 
> itself and node recovery. But is complicates data access.
> There will be two process that will share storage resource: "checkpointer" 
> and "compactor". Let's examine what compactor should or shouldn't do:
>  * it should not work in parallel with checkpointer, except for cases when 
> there are too many layers (more on that later)
>  * it should merge later checkpoint delta files into main partition files
>  * it should delete checkpoint markers once all merges are completed for it, 
> thus markers are decoupled from RAFT log
> About "cases when there are too many layers" - too many layers could 
> compromise reading speed. Number of layers should not increase 
> uncontrollably. So, when a threshold is exceeded, compactor should start 
> working no mater what.
> Recovery procedure:
>  * read the list of checkpoint markers on engines start
>  * remove all data from unfinished checkpoint, if it's there
>  * trim main partition files to their proper size (should check it it's 
> actually beneficial)
> Table start procedure:
>  * read all layer files headers according to the list of checkpoints
>  * construct a list oh hash tables (pageId -> pageIndex) for all layers, make 
> it as effective as possible
>  * everything else is just like before
> Partition removal might be tricky, but we'll see. It's tricky in Ignite 2.x 
> after all. "Restore partition states" procedure could be revisited, I don't 
> know how this will work yet



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to