[jira] [Updated] (IGNITE-15818) [Native Persistence 3.0] Checkpoint, lifecycle and file store refactoring and re-implementation

Ivan Bessonov (Jira) Thu, 02 Jun 2022 03:57:20 -0700


     [ 
https://issues.apache.org/jira/browse/IGNITE-15818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ivan Bessonov updated IGNITE-15818:
-----------------------------------
    Description: 
h2. Goal

Port and refactor core classes implementing page-based persistent store in 
Ignite 2.x: GridCacheOffheapManager, GridCacheDatabaseSharedManager, 
PageMemoryImpl, Checkpointer, FileWriteAheadLogManager.

New checkpoint implementation to avoid excessive logging.

Store lifecycle clarification to avoid complicated and invasive code of custom 
lifecycle managed mostly by DatabaseSharedManager.
h2. Items to pay attention to

New checkpoint implementation based on split-file storage, new page index 
structure to maintain disk-memory page mapping.

File page store implementation should be extracted from GridCacheOffheapManager 
to a separate entity, target implementation should support new version of 
checkpoint (split-file store to enable always-consistent store and to eliminate 
binary recovery phase).

Support of big pages (256+ kB).

Support of throttling algorithms.
h2. References

New checkpoint design overview is available 
[here|https://github.com/apache/ignite-3/blob/ignite-14647/modules/vault/README.md]
h2. Thoughts

Although there is a technical opportunity to have independent checkpoints for 
different data regions, managing them could be a nightmare and it's definitely 
in the realm of optimizations and out of scope right now.

So, let's assume that there's one good old checkpoint process. There's still a 
requirement to have checkpoint markers, but they will not have a reference to 
WAL, because there's no WAL. Instead, we will have to store RAFT log revision 
per partition. Or not, I'm not that familiar with a recovery procedure that's 
currently in development.

Unlike checkpoints in Ignite 2.x, that had DO and REDO operations, new version 
will have DO and UNDO. This drastically simplifies both checkpoint itself and 
node recovery. But is complicates data access.

There will be two process that will share storage resource: "checkpointer" and 
"compactor". Let's examine what compactor should or shouldn't do:
 * it should not work in parallel with checkpointer, except for cases when 
there are too many layers (more on that later)
 * it should merge later checkpoint delta files into main partition files
 * it should delete checkpoint markers once all merges are completed for it, 
thus markers are decoupled from RAFT log

About "cases when there are too many layers" - too many layers could compromise 
reading speed. Number of layers should not increase uncontrollably. So, when a 
threshold is exceeded, compactor should start working no mater what. If 
anything, writing load can be throttled, reading matters more.

Recovery procedure:
 * read the list of checkpoint markers on engines start
 * remove all data from unfinished checkpoint, if it's there
 * trim main partition files to their proper size (should check it it's 
actually beneficial)

Table start procedure:
 * read all layer files headers according to the list of checkpoints
 * construct a list oh hash tables (pageId -> pageIndex) for all layers, make 
it as effective as possible
 * everything else is just like before

Partition removal might be tricky, but we'll see. It's tricky in Ignite 2.x 
after all. "Restore partition states" procedure could be revisited, I don't 
know how this will work yet.

How to store hashmaps:

regular maps might be too much, we should consider roaring map implementation 
or something similar that'll occupy less space. This is only a concern for 
in-memory structures. Files on disk may have a list of pairs, that's fine. 
Generally speaking, checkpoints with a size of 100 thousand pages are close to 
the top limit for most users. Splitting that to 500 partitions, for example, 
gives us 200 pages per partition. Entire map should fit into a single page.

The only exception to these calculations is index.bin. Amount of pages per 
checkpoint can be an orders of magnitudes higher, so we should keep an eye on 
it, It'll be the main target for testing/benchmarking. Anyway, 4 kilobytes is 
enough to fit 512 integer pairs, scaling to 2048 for regular 16 kilobytes 
pages. Map won't be too big IMO.

Another important moment - we should enable direct IO, it's supported by Java 
natively since version 9 (I guess). There's a chance that not only regular disk 
operations will become somewhat faster, but fsync will become drastically 
faster as a result. Which is good, fsync can easily take half a time of the 
checkpoint, which is just unacceptable.
h2. Thoughts 2.0

With high likelihood, we'll get rid of index.bin. This will remove the 
requirement of having checkpoint markers.

All that we need is a consistently growing local counter that will be used to 
mark partition delta files. But, it doesn't need to be global even on a level 
of local node, it can be a local counter per partition, that's persisted in the 
meta page. This should be further discussed during the implementation.

  was:
h2. Goal

Port and refactor core classes implementing page-based persistent store in 
Ignite 2.x: GridCacheOffheapManager, GridCacheDatabaseSharedManager, 
PageMemoryImpl, Checkpointer, FileWriteAheadLogManager.

New checkpoint implementation to avoid excessive logging.

Store lifecycle clarification to avoid complicated and invasive code of custom 
lifecycle managed mostly by DatabaseSharedManager.
h2. Items to pay attention to

New checkpoint implementation based on split-file storage, new page index 
structure to maintain disk-memory page mapping.

File page store implementation should be extracted from GridCacheOffheapManager 
to a separate entity, target implementation should support new version of 
checkpoint (split-file store to enable always-consistent store and to eliminate 
binary recovery phase).

Support of big pages (256+ kB).

Support of throttling algorithms.
h2. References

New checkpoint design overview is available 
[here|https://github.com/apache/ignite-3/blob/ignite-14647/modules/vault/README.md]
h2. Thoughts

Although there is a technical opportunity to have independent checkpoints for 
different data regions, managing them could be a nightmare and it's definitely 
in the realm of optimizations and out of scope right now.

So, let's assume that there's one good old checkpoint process. There's still a 
requirement to have checkpoint markers, but they will not have a reference to 
WAL, because there's no WAL. Instead, we will have to store RAFT log revision 
per partition. Or not, I'm not that familiar with a recovery procedure that's 
currently in development.

Unlike checkpoints in Ignite 2.x, that had DO and REDO operations, new version 
will have DO and UNDO. This drastically simplifies both checkpoint itself and 
node recovery. But is complicates data access.

There will be two process that will share storage resource: "checkpointer" and 
"compactor". Let's examine what compactor should or shouldn't do:
 * it should not work in parallel with checkpointer, except for cases when 
there are too many layers (more on that later)
 * it should merge later checkpoint delta files into main partition files
 * it should delete checkpoint markers once all merges are completed for it, 
thus markers are decoupled from RAFT log

About "cases when there are too many layers" - too many layers could compromise 
reading speed. Number of layers should not increase uncontrollably. So, when a 
threshold is exceeded, compactor should start working no mater what. If 
anything, writing load can be throttled, reading matters more.

Recovery procedure:
 * read the list of checkpoint markers on engines start
 * remove all data from unfinished checkpoint, if it's there
 * trim main partition files to their proper size (should check it it's 
actually beneficial)

Table start procedure:
 * read all layer files headers according to the list of checkpoints
 * construct a list oh hash tables (pageId -> pageIndex) for all layers, make 
it as effective as possible
 * everything else is just like before

Partition removal might be tricky, but we'll see. It's tricky in Ignite 2.x 
after all. "Restore partition states" procedure could be revisited, I don't 
know how this will work yet.

How to store hashmaps:

regular maps might be too much, we should consider roaring map implementation 
or something similar that'll occupy less space. This is only a concern for 
in-memory structures. Files on disk may have a list of pairs, that's fine. 
Generally speaking, checkpoints with a size of 100 thousand pages are close to 
the top limit for most users. Splitting that to 500 partitions, for example, 
gives us 200 pages per partition. Entire map should fit into a single page.

The only exception to these calculations is index.bin. Amount of pages per 
checkpoint can be an orders of magnitudes higher, so we should keep an eye on 
it, It'll be the main target for testing/benchmarking. Anyway, 4 kilobytes is 
enough to fit 512 integer pairs, scaling to 2048 for regular 16 kilobytes 
pages. Map won't be too big IMO.

Another important moment - we should enable direct IO, it's supported by Java 
natively since version 9 (I guess). There's a chance that not only regular disk 
operations will become somewhat faster, but fsync will become drastically 
faster as a result. Which is good, fsync can easily take half a time of the 
checkpoint, which is just unacceptable.



> [Native Persistence 3.0] Checkpoint, lifecycle and file store refactoring and 
> re-implementation
> -----------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-15818
>                 URL: https://issues.apache.org/jira/browse/IGNITE-15818
>             Project: Ignite
>          Issue Type: Task
>            Reporter: Sergey Chugunov
>            Assignee: Kirill Tkalenko
>            Priority: Major
>              Labels: ignite-3
>
> h2. Goal
> Port and refactor core classes implementing page-based persistent store in 
> Ignite 2.x: GridCacheOffheapManager, GridCacheDatabaseSharedManager, 
> PageMemoryImpl, Checkpointer, FileWriteAheadLogManager.
> New checkpoint implementation to avoid excessive logging.
> Store lifecycle clarification to avoid complicated and invasive code of 
> custom lifecycle managed mostly by DatabaseSharedManager.
> h2. Items to pay attention to
> New checkpoint implementation based on split-file storage, new page index 
> structure to maintain disk-memory page mapping.
> File page store implementation should be extracted from 
> GridCacheOffheapManager to a separate entity, target implementation should 
> support new version of checkpoint (split-file store to enable 
> always-consistent store and to eliminate binary recovery phase).
> Support of big pages (256+ kB).
> Support of throttling algorithms.
> h2. References
> New checkpoint design overview is available 
> [here|https://github.com/apache/ignite-3/blob/ignite-14647/modules/vault/README.md]
> h2. Thoughts
> Although there is a technical opportunity to have independent checkpoints for 
> different data regions, managing them could be a nightmare and it's 
> definitely in the realm of optimizations and out of scope right now.
> So, let's assume that there's one good old checkpoint process. There's still 
> a requirement to have checkpoint markers, but they will not have a reference 
> to WAL, because there's no WAL. Instead, we will have to store RAFT log 
> revision per partition. Or not, I'm not that familiar with a recovery 
> procedure that's currently in development.
> Unlike checkpoints in Ignite 2.x, that had DO and REDO operations, new 
> version will have DO and UNDO. This drastically simplifies both checkpoint 
> itself and node recovery. But is complicates data access.
> There will be two process that will share storage resource: "checkpointer" 
> and "compactor". Let's examine what compactor should or shouldn't do:
>  * it should not work in parallel with checkpointer, except for cases when 
> there are too many layers (more on that later)
>  * it should merge later checkpoint delta files into main partition files
>  * it should delete checkpoint markers once all merges are completed for it, 
> thus markers are decoupled from RAFT log
> About "cases when there are too many layers" - too many layers could 
> compromise reading speed. Number of layers should not increase 
> uncontrollably. So, when a threshold is exceeded, compactor should start 
> working no mater what. If anything, writing load can be throttled, reading 
> matters more.
> Recovery procedure:
>  * read the list of checkpoint markers on engines start
>  * remove all data from unfinished checkpoint, if it's there
>  * trim main partition files to their proper size (should check it it's 
> actually beneficial)
> Table start procedure:
>  * read all layer files headers according to the list of checkpoints
>  * construct a list oh hash tables (pageId -> pageIndex) for all layers, make 
> it as effective as possible
>  * everything else is just like before
> Partition removal might be tricky, but we'll see. It's tricky in Ignite 2.x 
> after all. "Restore partition states" procedure could be revisited, I don't 
> know how this will work yet.
> How to store hashmaps:
> regular maps might be too much, we should consider roaring map implementation 
> or something similar that'll occupy less space. This is only a concern for 
> in-memory structures. Files on disk may have a list of pairs, that's fine. 
> Generally speaking, checkpoints with a size of 100 thousand pages are close 
> to the top limit for most users. Splitting that to 500 partitions, for 
> example, gives us 200 pages per partition. Entire map should fit into a 
> single page.
> The only exception to these calculations is index.bin. Amount of pages per 
> checkpoint can be an orders of magnitudes higher, so we should keep an eye on 
> it, It'll be the main target for testing/benchmarking. Anyway, 4 kilobytes is 
> enough to fit 512 integer pairs, scaling to 2048 for regular 16 kilobytes 
> pages. Map won't be too big IMO.
> Another important moment - we should enable direct IO, it's supported by Java 
> natively since version 9 (I guess). There's a chance that not only regular 
> disk operations will become somewhat faster, but fsync will become 
> drastically faster as a result. Which is good, fsync can easily take half a 
> time of the checkpoint, which is just unacceptable.
> h2. Thoughts 2.0
> With high likelihood, we'll get rid of index.bin. This will remove the 
> requirement of having checkpoint markers.
> All that we need is a consistently growing local counter that will be used to 
> mark partition delta files. But, it doesn't need to be global even on a level 
> of local node, it can be a local counter per partition, that's persisted in 
> the meta page. This should be further discussed during the implementation.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (IGNITE-15818) [Native Persistence 3.0] Checkpoint, lifecycle and file store refactoring and re-implementation

Reply via email to