[
https://issues.apache.org/jira/browse/IGNITE-28429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Bessonov updated IGNITE-28429:
-----------------------------------
Description:
h2. Preface
https://issues.apache.org/jira/browse/IGNITE-27152 described an issue in page
replacement, that causes many FSM threads to get stuck until sorting phase of
checkpoint is completed. There are two reasons for it:
* Sorting itself is a heavy operation.
* Replacement algorithm is waiting for the sorting phase, which itself is also
a problem.
Proposed fix assumed that the time spent in "split and sort" phase is mostly
"sort", and that splitting the pages into partitions can be done quickly. That
assumption is experimentally proven to be false, grouping dirty pages by
partitions is a heavy operation (especially considering that it's
single-threaded). It must also be optimized.
This, besides the issues mentioned in a linked Jira, is what will be addressed
by this epic.
h2. Design
h3. 1. Pre-split dirty pages sets
Grouping pages by partitions requires (a) iterating over a collection of
hash-sets, (b) performing a lookup in a corresponding hash-table for further
insertion, and (c) inserting each page into a corresponding iterable container,
that later will be converted into an array.
The fact that page replacement must wait for the "split" phase makes it
critical performance-wise. More specifically, we must be able to perform these
two operations as quickly as possible:
* Wait until the snapshot of dirty pages is acquired, so that the
{{CheckpointPages#removeOnPageReplacement}} can be called.
* Wait until it is possible to the the content of a corresponding delta file,
so that we can calculate the page position in the file.
As mentioned in IGNITE-27152, the second wait only needs the results of the
first wait, we don't need to wait for the sorting. I'll get back to this point
later.
The proposal is to store dirty pages in a pre-grouped map {{{}GroupPartitionId
-> PartitionDirtyPages{}}}, and make sure that with this approach we're able to
make the wait fast.
Such a decision forces us to make certain design choices. The majority of the
design will be concentrated on these choices.
h3. 2. Decouple PageSupport and PageMemory
The idea is the following: every structure that operates with a page memory
already belongs to a single partition. Exclusively. We should utilize this
information by providing each partition its separate instance of
{{PageSupport}} (probably merged with {{{}PageIdAllocator{}}}). Such an
instance will contain:
* Group ID.
* Partition generation.
* A dirty pages collection.
** The specific properties of this structure are not yet provided. Striping of
this collection may not be necessary, especially if benchmarks show no issues
without it. This is expected, because the number of partitions is greater than
the number of segments.
**
Partition generation assignment should not be lazy. It's done when we register
new {{PageSupport}} instance for a partition. I may look like this:
* {{PageSupport PageMemory#newPageSupport(...)}}
* {{void PageSupport#dispose(...)}}
This implies that there's still an internal mapping from partition to some
descriptor inside of {{PageMemory}} instance.
This move makes dirty pages between partition generations mutually exclusive,
solving once and for all the problem of filtering out the obsolete generations
from the collection of dirty pages, that's good.
h3. 3. Collecting dirty pages in checkpoint
Should not require holding write lock anymore. Let me elaborate. We should use
the same trick that we use for partition metadata - cooperation.
* Checkpointer releases the write lock and then starts collecting dirty pages
from all the partitions.
* Every time another thread needs to perform a "setDirty" on a page it
participates in a copying of the dirty pages collection from the previous
checkpoint. It must be achieved with a simple CAS. This alleviates the
necessity to wait for a dedicated phase of checkpointer during the page
replacement (while holding a segment write lock).
* By the time checkpointer finishes the iteration, all dirty pages are
collected and ready to be sorted.
h3. 4. Sorting and page replacement
The only real reason to sort pages is the order in which they are written to
the storage. Checkpointer iterates a sorted array of page indexes. The same
sorted array is saved into delta-file's header.
Current implementation presupposes that delta-file's header is already
persisted before any other page. That expectation is artificial. In order to
save a page into delta file, all that we need to know is its position. It can
be calculated from an unsorted collection of page indexes, belonging to the
delta file. I'm repeating my position from IGNITE-27152: generally speaking, a
linear scan of an array, in order to calculate the "number of elements less
than X", is faster than sorting that array and then accessing X through binary
search. Which means that page replacer may work on an unsorted collection,
while checkpointer is free to sort pages in background.
h3. 5. "isInCheckpoint" checks
Current code performs this check using the collection of dirty pages that
belongs to a given segment. New structure can make that process more difficult,
due to the necessity of an additional lookup by partition ID. To address this
potential inefficiency, I propose having two dirty flags, alterating between
checkpoints. For example, The flag itself can be calculated as
{{{}"checkpointSequenceNumber % 2 + 1"{}}}, when the sequence number is
incremented on each checkpoint. This way we may differentiate, "who" made that
page dirty. The idea needs further development and can, in theory, be out of
scope. I would prefer it being in the scope, because this design introduces
another level of indirection "partitionId -> dirtyPages" for this check.
h3. 6. Per-partition dirty pages collection
Current implementation only uses concurrent hash-sets for dirty pages. This is
not an optimal structure. In many instances, the large number of dirty pages is
accompanied by the large number of new allocations (outside of free-list).
Indexes of new pages form a continuous range. Utilizing this knowledge will
allow us to:
* Use less memory for dirty pages.
* Don't sort a continuous tail of dirty pages.
* Have a natural distinction between pages that go into "main" file or "delta"
file that doesn't look like an encapsulation violation.
h3. 7. Segment locks
Segment locks will not protect partition generations anymore. We should expect
an external synchronization of partition generations, by not allowing to have a
new {{PageSupport}} instance while an old one exists. Externally it's going to
be a busy lock.
h3. 8. Partition destruction lock
May become obsolete after all these changes. This part requires further
investigation.
was:
h2. Preface
https://issues.apache.org/jira/browse/IGNITE-27152 described an issue in page
replacement, that causes many FSM threads to get stuck until sorting phase of
checkpoint is completed. There are two reasons for it:
* Sorting itself is a heavy operation.
* Replacement algorithm is waiting for the sorting phase, which itself is also
a problem.
Proposed fix assumed that the time spent in "split and sort" phase is mostly
"sort", and that splitting the pages into partitions can be done quickly. That
assumption is experimentally proven to be false, grouping dirty pages by
partitions is a heavy operation (especially considering that it's
single-threaded). It must also be optimized.
This, besides the issues mentioned in a linked Jira, is what will be addressed
by this epic.
h2. Design
h3. 1. Pre-split dirty pages sets
Grouping pages by partitions requires (a) iterating over a collection of
hash-sets, (b) performing a lookup in a corresponding hash-table for further
insertion, and (c) inserting each page into a corresponding iterable container,
that later will be converted into an array.
The fact that page replacement must wait for the "split" phase makes it
critical performance-wise. More specifically, we must be able to perform these
two operations as quickly as possible:
* Wait until the snapshot of dirty pages is acquired, so that the
{{CheckpointPages#removeOnPageReplacement}} can be called.
* Wait until it is possible to the the content of a corresponding delta file,
so that we can calculate the page position in the file.
As mentioned in IGNITE-27152, the second wait only needs the results of the
first wait, we don't need to wait for the sorting. I'll get back to this point
later.
The proposal is to store dirty pages in a pre-grouped map {{{}GroupPartitionId
-> PartitionDirtyPages{}}}, and make sure that with this approach we're able to
make the wait fast.
Such a decision forces us to make certain design choices. The majority of the
design will be concentrated on these choices.
h3. 2. Decouple PageSupport and PageMemory
The idea is the following: every structure that operates with a page memory
already belongs to a single partition. Exclusively. We should utilize this
information by providing each partition its separate instance of
{{PageSupport}} (probably merged with {{{}PageIdAllocator{}}}). Such an
instance will contain:
* Group ID.
* Partition generation.
* A dirty pages collection.
** The specific properties of this structure are not yet provided. Striping of
this collection may not be necessary, especially if benchmarks show no issues
without it. This is expected, because the number of partitions is greater than
the number of segments.
**
Partition generation assignment should not be lazy. It's done when we register
new {{PageSupport}} instance for a partition. I may look like this:
* {{PageSupport PageMemory#newPageSupport(...)}}
* {{void PageSupport#dispose(...)}}
This implies that there's still an internal mapping from partition to some
descriptor inside of {{PageMemory}} instance.
This move makes dirty pages between partition generations mutually exclusive,
solving once and for all the problem of filtering out the obsolete generations
from the collection of dirty pages, that's good.
h3. 3. Collecting dirty pages in checkpoint
Should not require holding write lock anymore. Let me elaborate. We should use
the same trick that we use for partition metadata - cooperation.
* Checkpointer releases the write lock and then starts collecting dirty pages
from all the partitions.
* Every time another thread needs to perform a "setDirty" on a page it
participates in a copying of the dirty pages collection from the previous
checkpoint. It must be achieved with a simple CAS. This alleviates the
necessity to wait for a dedicated phase of checkpointer during the page
replacement (while holding a segment write lock).
* By the time checkpointer finishes the iteration, all dirty pages are
collected and ready to be sorted.
h3. 4. Sorting and page replacement
The only real reason to sort pages is the order in which they are written to
the storage. Checkpointer iterates a sorted array of page indexes. The same
sorted array is saved into delta-file's header.
Current implementation presupposes that delta-file's header is already
persisted before any other page. That expectation is artificial. In order to
save a page into delta file, all that we need to know is its position. It can
be calculated from an unsorted collection of page indexes, belonging to the
delta file. I'm repeating my position from IGNITE-27152: generally speaking, a
linear scan of an array, in order to calculate the "number of elements less
than X", is faster than sorting that array and then accessing X through binary
search. Which means that page replacer may work on an unsorted collection,
while checkpointer is free to sort pages in background.
h3. 5. "isInCheckpoint" checks
Current code performs this check using the collection of dirty pages that
belongs to a given segment. New structure can make that process more difficult,
due to the necessity of an additional lookup by partition ID. To address this
potential inefficiency, I propose having two dirty flags, alterating between
checkpoints. For example, The flag itself can be calculated as
{{{}"checkpointSequenceNumber % 2 + 1"{}}}, when the sequence number is
incremented on each checkpoint. This way we may differentiate, "who" made that
page dirty. The idea needs further development and can, in theory, be out of
scope. I would prefer it being in the scope, because this design introduces
another level of indirection "partitionId -> dirtyPages" for this check.
h3. 6. Per-partition dirty pages collection
Current implementation only uses concurrent hash-sets for dirty pages. This is
not an optimal structure. In many instances, the large number of dirty pages is
accompanied by the large number of new allocations (outside of free-list).
Indexes of new pages form a continuous range. Utilizing this knowledge will
allow us to:
* Use less memory for dirty pages.
* Don't sort a continuous tail of dirty pages.
* Have a natural distinction between pages that go into "main" file or "delta"
file that doesn't look like an encapsulation violation.
> Rework dirty pages management in `aipersist`
> --------------------------------------------
>
> Key: IGNITE-28429
> URL: https://issues.apache.org/jira/browse/IGNITE-28429
> Project: Ignite
> Issue Type: Epic
> Reporter: Ivan Bessonov
> Assignee: Ivan Bessonov
> Priority: Major
> Labels: ignite-3
>
> h2. Preface
> https://issues.apache.org/jira/browse/IGNITE-27152 described an issue in page
> replacement, that causes many FSM threads to get stuck until sorting phase of
> checkpoint is completed. There are two reasons for it:
> * Sorting itself is a heavy operation.
> * Replacement algorithm is waiting for the sorting phase, which itself is
> also a problem.
> Proposed fix assumed that the time spent in "split and sort" phase is mostly
> "sort", and that splitting the pages into partitions can be done quickly.
> That assumption is experimentally proven to be false, grouping dirty pages by
> partitions is a heavy operation (especially considering that it's
> single-threaded). It must also be optimized.
> This, besides the issues mentioned in a linked Jira, is what will be
> addressed by this epic.
> h2. Design
> h3. 1. Pre-split dirty pages sets
> Grouping pages by partitions requires (a) iterating over a collection of
> hash-sets, (b) performing a lookup in a corresponding hash-table for further
> insertion, and (c) inserting each page into a corresponding iterable
> container, that later will be converted into an array.
> The fact that page replacement must wait for the "split" phase makes it
> critical performance-wise. More specifically, we must be able to perform
> these two operations as quickly as possible:
> * Wait until the snapshot of dirty pages is acquired, so that the
> {{CheckpointPages#removeOnPageReplacement}} can be called.
> * Wait until it is possible to the the content of a corresponding delta
> file, so that we can calculate the page position in the file.
> As mentioned in IGNITE-27152, the second wait only needs the results of the
> first wait, we don't need to wait for the sorting. I'll get back to this
> point later.
> The proposal is to store dirty pages in a pre-grouped map
> {{{}GroupPartitionId -> PartitionDirtyPages{}}}, and make sure that with this
> approach we're able to make the wait fast.
> Such a decision forces us to make certain design choices. The majority of the
> design will be concentrated on these choices.
> h3. 2. Decouple PageSupport and PageMemory
> The idea is the following: every structure that operates with a page memory
> already belongs to a single partition. Exclusively. We should utilize this
> information by providing each partition its separate instance of
> {{PageSupport}} (probably merged with {{{}PageIdAllocator{}}}). Such an
> instance will contain:
> * Group ID.
> * Partition generation.
> * A dirty pages collection.
> ** The specific properties of this structure are not yet provided. Striping
> of this collection may not be necessary, especially if benchmarks show no
> issues without it. This is expected, because the number of partitions is
> greater than the number of segments.
> **
> Partition generation assignment should not be lazy. It's done when we
> register new {{PageSupport}} instance for a partition. I may look like this:
> * {{PageSupport PageMemory#newPageSupport(...)}}
> * {{void PageSupport#dispose(...)}}
> This implies that there's still an internal mapping from partition to some
> descriptor inside of {{PageMemory}} instance.
> This move makes dirty pages between partition generations mutually exclusive,
> solving once and for all the problem of filtering out the obsolete
> generations from the collection of dirty pages, that's good.
> h3. 3. Collecting dirty pages in checkpoint
> Should not require holding write lock anymore. Let me elaborate. We should
> use the same trick that we use for partition metadata - cooperation.
> * Checkpointer releases the write lock and then starts collecting dirty
> pages from all the partitions.
> * Every time another thread needs to perform a "setDirty" on a page it
> participates in a copying of the dirty pages collection from the previous
> checkpoint. It must be achieved with a simple CAS. This alleviates the
> necessity to wait for a dedicated phase of checkpointer during the page
> replacement (while holding a segment write lock).
> * By the time checkpointer finishes the iteration, all dirty pages are
> collected and ready to be sorted.
> h3. 4. Sorting and page replacement
> The only real reason to sort pages is the order in which they are written to
> the storage. Checkpointer iterates a sorted array of page indexes. The same
> sorted array is saved into delta-file's header.
> Current implementation presupposes that delta-file's header is already
> persisted before any other page. That expectation is artificial. In order to
> save a page into delta file, all that we need to know is its position. It can
> be calculated from an unsorted collection of page indexes, belonging to the
> delta file. I'm repeating my position from IGNITE-27152: generally speaking,
> a linear scan of an array, in order to calculate the "number of elements less
> than X", is faster than sorting that array and then accessing X through
> binary search. Which means that page replacer may work on an unsorted
> collection, while checkpointer is free to sort pages in background.
> h3. 5. "isInCheckpoint" checks
> Current code performs this check using the collection of dirty pages that
> belongs to a given segment. New structure can make that process more
> difficult, due to the necessity of an additional lookup by partition ID. To
> address this potential inefficiency, I propose having two dirty flags,
> alterating between checkpoints. For example, The flag itself can be
> calculated as {{{}"checkpointSequenceNumber % 2 + 1"{}}}, when the sequence
> number is incremented on each checkpoint. This way we may differentiate,
> "who" made that page dirty. The idea needs further development and can, in
> theory, be out of scope. I would prefer it being in the scope, because this
> design introduces another level of indirection "partitionId -> dirtyPages"
> for this check.
> h3. 6. Per-partition dirty pages collection
> Current implementation only uses concurrent hash-sets for dirty pages. This
> is not an optimal structure. In many instances, the large number of dirty
> pages is accompanied by the large number of new allocations (outside of
> free-list). Indexes of new pages form a continuous range. Utilizing this
> knowledge will allow us to:
> * Use less memory for dirty pages.
> * Don't sort a continuous tail of dirty pages.
> * Have a natural distinction between pages that go into "main" file or
> "delta" file that doesn't look like an encapsulation violation.
> h3. 7. Segment locks
> Segment locks will not protect partition generations anymore. We should
> expect an external synchronization of partition generations, by not allowing
> to have a new {{PageSupport}} instance while an old one exists. Externally
> it's going to be a busy lock.
> h3. 8. Partition destruction lock
> May become obsolete after all these changes. This part requires further
> investigation.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)