[
https://issues.apache.org/jira/browse/IGNITE-15818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Bessonov updated IGNITE-15818:
-----------------------------------
Description:
h2. Goal
Port and refactor core classes implementing page-based persistent store in
Ignite 2.x: GridCacheOffheapManager, GridCacheDatabaseSharedManager,
PageMemoryImpl, Checkpointer, FileWriteAheadLogManager.
New checkpoint implementation to avoid excessive logging.
Store lifecycle clarification to avoid complicated and invasive code of custom
lifecycle managed mostly by DatabaseSharedManager.
h2. Items to pay attention to
New checkpoint implementation based on split-file storage, new page index
structure to maintain disk-memory page mapping.
File page store implementation should be extracted from GridCacheOffheapManager
to a separate entity, target implementation should support new version of
checkpoint (split-file store to enable always-consistent store and to eliminate
binary recovery phase).
Support of big pages (256+ kB).
Support of throttling algorithms.
h2. References
New checkpoint design overview is available
[here|https://github.com/apache/ignite-3/blob/ignite-14647/modules/vault/README.md]
h2. Thoughts
Although there is a technical opportunity to have independent checkpoints for
different data regions, managing them could be a nightmare and it's definitely
in the realm of optimizations and out of scope right now.
So, let's assume that there's one good old checkpoint process. There's still a
requirement to have checkpoint markers, but they will not have a reference to
WAL, because there's no WAL. Instead, we will have to store RAFT log revision
per partition. Or not, I'm not that familiar with a recovery procedure that's
currently in development.
Unlike checkpoints in Ignite 2.x, that had DO and REDO operations, new version
will have DO and UNDO. This drastically simplifies both checkpoint itself and
node recovery. But is complicates data access.
There will be two process that will share storage resource: "checkpointer" and
"compactor". Let's examine what compactor should or shouldn't do:
* it should not work in parallel with checkpointer, except for cases when
there are too many layers (more on that later)
* it should merge later checkpoint delta files into main partition files
* it should delete checkpoint markers once all merges are completed for it,
thus markers are decoupled from RAFT log
About "cases when there are too many layers" - too many layers could compromise
reading speed. Number of layers should not increase uncontrollably. So, when a
threshold is exceeded, compactor should start working no mater what. If
anything, writing load can be throttled, reading matters more.
Recovery procedure:
* read the list of checkpoint markers on engines start
* remove all data from unfinished checkpoint, if it's there
* trim main partition files to their proper size (should check it it's
actually beneficial)
Table start procedure:
* read all layer files headers according to the list of checkpoints
* construct a list oh hash tables (pageId -> pageIndex) for all layers, make
it as effective as possible
* everything else is just like before
Partition removal might be tricky, but we'll see. It's tricky in Ignite 2.x
after all. "Restore partition states" procedure could be revisited, I don't
know how this will work yet.
How to store hashmaps:
regular maps might be too much, we should consider roaring map implementation
or something similar that'll occupy less space. This is only a concern for
in-memory structures. Files on disk may have a list of pairs, that's fine.
Generally speaking, checkpoints with a size of 100 thousand pages are close to
the top limit for most users. Splitting that to 500 partitions, for example,
gives us 200 pages per partition. Entire map should fit into a single page.
The only exception to these calculations is index.bin. Amount of pages per
checkpoint can be an orders of magnitudes higher, so we should keep an eye on
it, It'll be the main target for testing/benchmarking. Anyway, 4 kilobytes is
enough to fit 512 integer pairs, scaling to 2048 for regular 16 kilobytes
pages. Map won't be too big IMO.
Another important moment - we should enable direct IO, it's supported by Java
natively since version 9 (I guess). There's a chance that not only regular disk
operations will become somewhat faster, but fsync will become drastically
faster as a result. Which is good, fsync can easily take half a time of the
checkpoint, which is just unacceptable.
h2. Thoughts 2.0
With high likelihood, we'll get rid of index.bin. This will remove the
requirement of having checkpoint markers.
All that we need is a consistently growing local counter that will be used to
mark partition delta files. But, it doesn't need to be global even on a level
of local node, it can be a local counter per partition, that's persisted in the
meta page. This should be further discussed during the implementation.
was:
h2. Goal
Port and refactor core classes implementing page-based persistent store in
Ignite 2.x: GridCacheOffheapManager, GridCacheDatabaseSharedManager,
PageMemoryImpl, Checkpointer, FileWriteAheadLogManager.
New checkpoint implementation to avoid excessive logging.
Store lifecycle clarification to avoid complicated and invasive code of custom
lifecycle managed mostly by DatabaseSharedManager.
h2. Items to pay attention to
New checkpoint implementation based on split-file storage, new page index
structure to maintain disk-memory page mapping.
File page store implementation should be extracted from GridCacheOffheapManager
to a separate entity, target implementation should support new version of
checkpoint (split-file store to enable always-consistent store and to eliminate
binary recovery phase).
Support of big pages (256+ kB).
Support of throttling algorithms.
h2. References
New checkpoint design overview is available
[here|https://github.com/apache/ignite-3/blob/ignite-14647/modules/vault/README.md]
h2. Thoughts
Although there is a technical opportunity to have independent checkpoints for
different data regions, managing them could be a nightmare and it's definitely
in the realm of optimizations and out of scope right now.
So, let's assume that there's one good old checkpoint process. There's still a
requirement to have checkpoint markers, but they will not have a reference to
WAL, because there's no WAL. Instead, we will have to store RAFT log revision
per partition. Or not, I'm not that familiar with a recovery procedure that's
currently in development.
Unlike checkpoints in Ignite 2.x, that had DO and REDO operations, new version
will have DO and UNDO. This drastically simplifies both checkpoint itself and
node recovery. But is complicates data access.
There will be two process that will share storage resource: "checkpointer" and
"compactor". Let's examine what compactor should or shouldn't do:
* it should not work in parallel with checkpointer, except for cases when
there are too many layers (more on that later)
* it should merge later checkpoint delta files into main partition files
* it should delete checkpoint markers once all merges are completed for it,
thus markers are decoupled from RAFT log
About "cases when there are too many layers" - too many layers could compromise
reading speed. Number of layers should not increase uncontrollably. So, when a
threshold is exceeded, compactor should start working no mater what. If
anything, writing load can be throttled, reading matters more.
Recovery procedure:
* read the list of checkpoint markers on engines start
* remove all data from unfinished checkpoint, if it's there
* trim main partition files to their proper size (should check it it's
actually beneficial)
Table start procedure:
* read all layer files headers according to the list of checkpoints
* construct a list oh hash tables (pageId -> pageIndex) for all layers, make
it as effective as possible
* everything else is just like before
Partition removal might be tricky, but we'll see. It's tricky in Ignite 2.x
after all. "Restore partition states" procedure could be revisited, I don't
know how this will work yet.
How to store hashmaps:
regular maps might be too much, we should consider roaring map implementation
or something similar that'll occupy less space. This is only a concern for
in-memory structures. Files on disk may have a list of pairs, that's fine.
Generally speaking, checkpoints with a size of 100 thousand pages are close to
the top limit for most users. Splitting that to 500 partitions, for example,
gives us 200 pages per partition. Entire map should fit into a single page.
The only exception to these calculations is index.bin. Amount of pages per
checkpoint can be an orders of magnitudes higher, so we should keep an eye on
it, It'll be the main target for testing/benchmarking. Anyway, 4 kilobytes is
enough to fit 512 integer pairs, scaling to 2048 for regular 16 kilobytes
pages. Map won't be too big IMO.
Another important moment - we should enable direct IO, it's supported by Java
natively since version 9 (I guess). There's a chance that not only regular disk
operations will become somewhat faster, but fsync will become drastically
faster as a result. Which is good, fsync can easily take half a time of the
checkpoint, which is just unacceptable.
> [Native Persistence 3.0] Checkpoint, lifecycle and file store refactoring and
> re-implementation
> -----------------------------------------------------------------------------------------------
>
> Key: IGNITE-15818
> URL: https://issues.apache.org/jira/browse/IGNITE-15818
> Project: Ignite
> Issue Type: Task
> Reporter: Sergey Chugunov
> Assignee: Kirill Tkalenko
> Priority: Major
> Labels: ignite-3
>
> h2. Goal
> Port and refactor core classes implementing page-based persistent store in
> Ignite 2.x: GridCacheOffheapManager, GridCacheDatabaseSharedManager,
> PageMemoryImpl, Checkpointer, FileWriteAheadLogManager.
> New checkpoint implementation to avoid excessive logging.
> Store lifecycle clarification to avoid complicated and invasive code of
> custom lifecycle managed mostly by DatabaseSharedManager.
> h2. Items to pay attention to
> New checkpoint implementation based on split-file storage, new page index
> structure to maintain disk-memory page mapping.
> File page store implementation should be extracted from
> GridCacheOffheapManager to a separate entity, target implementation should
> support new version of checkpoint (split-file store to enable
> always-consistent store and to eliminate binary recovery phase).
> Support of big pages (256+ kB).
> Support of throttling algorithms.
> h2. References
> New checkpoint design overview is available
> [here|https://github.com/apache/ignite-3/blob/ignite-14647/modules/vault/README.md]
> h2. Thoughts
> Although there is a technical opportunity to have independent checkpoints for
> different data regions, managing them could be a nightmare and it's
> definitely in the realm of optimizations and out of scope right now.
> So, let's assume that there's one good old checkpoint process. There's still
> a requirement to have checkpoint markers, but they will not have a reference
> to WAL, because there's no WAL. Instead, we will have to store RAFT log
> revision per partition. Or not, I'm not that familiar with a recovery
> procedure that's currently in development.
> Unlike checkpoints in Ignite 2.x, that had DO and REDO operations, new
> version will have DO and UNDO. This drastically simplifies both checkpoint
> itself and node recovery. But is complicates data access.
> There will be two process that will share storage resource: "checkpointer"
> and "compactor". Let's examine what compactor should or shouldn't do:
> * it should not work in parallel with checkpointer, except for cases when
> there are too many layers (more on that later)
> * it should merge later checkpoint delta files into main partition files
> * it should delete checkpoint markers once all merges are completed for it,
> thus markers are decoupled from RAFT log
> About "cases when there are too many layers" - too many layers could
> compromise reading speed. Number of layers should not increase
> uncontrollably. So, when a threshold is exceeded, compactor should start
> working no mater what. If anything, writing load can be throttled, reading
> matters more.
> Recovery procedure:
> * read the list of checkpoint markers on engines start
> * remove all data from unfinished checkpoint, if it's there
> * trim main partition files to their proper size (should check it it's
> actually beneficial)
> Table start procedure:
> * read all layer files headers according to the list of checkpoints
> * construct a list oh hash tables (pageId -> pageIndex) for all layers, make
> it as effective as possible
> * everything else is just like before
> Partition removal might be tricky, but we'll see. It's tricky in Ignite 2.x
> after all. "Restore partition states" procedure could be revisited, I don't
> know how this will work yet.
> How to store hashmaps:
> regular maps might be too much, we should consider roaring map implementation
> or something similar that'll occupy less space. This is only a concern for
> in-memory structures. Files on disk may have a list of pairs, that's fine.
> Generally speaking, checkpoints with a size of 100 thousand pages are close
> to the top limit for most users. Splitting that to 500 partitions, for
> example, gives us 200 pages per partition. Entire map should fit into a
> single page.
> The only exception to these calculations is index.bin. Amount of pages per
> checkpoint can be an orders of magnitudes higher, so we should keep an eye on
> it, It'll be the main target for testing/benchmarking. Anyway, 4 kilobytes is
> enough to fit 512 integer pairs, scaling to 2048 for regular 16 kilobytes
> pages. Map won't be too big IMO.
> Another important moment - we should enable direct IO, it's supported by Java
> natively since version 9 (I guess). There's a chance that not only regular
> disk operations will become somewhat faster, but fsync will become
> drastically faster as a result. Which is good, fsync can easily take half a
> time of the checkpoint, which is just unacceptable.
> h2. Thoughts 2.0
> With high likelihood, we'll get rid of index.bin. This will remove the
> requirement of having checkpoint markers.
> All that we need is a consistently growing local counter that will be used to
> mark partition delta files. But, it doesn't need to be global even on a level
> of local node, it can be a local counter per partition, that's persisted in
> the meta page. This should be further discussed during the implementation.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)