[jira] [Updated] (IGNITE-28429) Rework dirty pages management in `aipersist`

Ivan Bessonov (Jira) Thu, 02 Apr 2026 03:39:36 -0700


     [ 
https://issues.apache.org/jira/browse/IGNITE-28429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ivan Bessonov updated IGNITE-28429:
-----------------------------------
    Description: 
h2. Preface

https://issues.apache.org/jira/browse/IGNITE-27152 described an issue in page 
replacement, that causes many FSM threads to get stuck until sorting phase of 
checkpoint is completed. There are two reasons for it:
 * Sorting itself is a heavy operation.
 * Replacement algorithm is waiting for the sorting phase, which itself is also 
a problem.

Proposed fix assumed that the time spent in "split and sort" phase is mostly 
"sort", and that splitting the pages into partitions can be done quickly. That 
assumption is experimentally proven to be false, grouping dirty pages by 
partitions is a heavy operation (especially considering that it's 
single-threaded). It must also be optimized.

This, besides the issues mentioned in a linked Jira, is what will be addressed 
by this epic.
h2. Design
h3. 1. Pre-split dirty pages sets

Grouping pages by partitions requires (a) iterating over a collection of 
hash-sets, (b) performing a lookup in a corresponding hash-table for further 
insertion, and (c) inserting each page into a corresponding iterable container, 
that later will be converted into an array.

The fact that page replacement must wait for the "split" phase makes it 
critical performance-wise. More specifically, we must be able to perform these 
two operations as quickly as possible:
 * Wait until the snapshot of dirty pages is acquired, so that the 
{{CheckpointPages#removeOnPageReplacement}} can be called.
 * Wait until it is possible to the the content of a corresponding delta file, 
so that we can calculate the page position in the file.

As mentioned in IGNITE-27152, the second wait only needs the results of the 
first wait, we don't need to wait for the sorting. I'll get back to this point 
later.

The proposal is to store dirty pages in a pre-grouped map {{{}GroupPartitionId 
-> PartitionDirtyPages{}}}, and make sure that with this approach we're able to 
make the wait fast.

Such a decision forces us to make certain design choices. The majority of the 
design will be concentrated on these choices.
h3. 2. Decouple PageSupport and PageMemory

The idea is the following: every structure that operates with a page memory 
already belongs to a single partition. Exclusively. We should utilize this 
information by providing each partition its separate instance of 
{{PageSupport}} (probably merged with {{{}PageIdAllocator{}}}). Such an 
instance will contain:
 * Group ID.
 * Partition generation.
 * A dirty pages collection.
 ** The specific properties of this structure are not yet provided. Striping of 
this collection may not be necessary, especially if benchmarks show no issues 
without it. This is expected, because the number of partitions is greater than 
the number of segments.
 **  

Partition generation assignment should not be lazy. It's done when we register 
new {{PageSupport}} instance for a partition. I may look like this:
 * {{PageSupport PageMemory#newPageSupport(...)}}
 * {{void PageSupport#dispose(...)}}

This implies that there's still an internal mapping from partition to some 
descriptor inside of {{PageMemory}} instance.

This move makes dirty pages between partition generations mutually exclusive, 
solving once and for all the problem of filtering out the obsolete generations 
from the collection of dirty pages, that's good.
h3. 3. Collecting dirty pages in checkpoint

Should not require holding write lock anymore. Let me elaborate. We should use 
the same trick that we use for partition metadata - cooperation.
 * Checkpointer releases the write lock and then starts collecting dirty pages 
from all the partitions.
 * Every time another thread needs to perform a "setDirty" on a page it 
participates in a copying of the dirty pages collection from the previous 
checkpoint. It must be achieved with a simple CAS. This alleviates the 
necessity to wait for a dedicated phase of checkpointer during the page 
replacement (while holding a segment write lock).
 * By the time checkpointer finishes the iteration, all dirty pages are 
collected and ready to be sorted.

h3. 4. Sorting and page replacement

The only real reason to sort pages is the order in which they are written to 
the storage. Checkpointer iterates a sorted array of page indexes. The same 
sorted array is saved into delta-file's header.

Current implementation presupposes that delta-file's header is already 
persisted before any other page. That expectation is artificial. In order to 
save a page into delta file, all that we need to know is its position. It can 
be calculated from an unsorted collection of page indexes, belonging to the 
delta file. I'm repeating my position from IGNITE-27152: generally speaking, a 
linear scan of an array, in order to calculate the "number of elements less 
than X", is faster than sorting that array and then accessing X through binary 
search. Which means that page replacer may work on an unsorted collection, 
while checkpointer is free to sort pages in background.
h3. 5. "isInCheckpoint" checks

Current code performs this check using the collection of dirty pages that 
belongs to a given segment. New structure can make that process more difficult, 
due to the necessity of an additional lookup by partition ID. To address this 
potential inefficiency, I propose having two dirty flags, alterating between 
checkpoints. For example, The flag itself can be calculated as 
{{{}"checkpointSequenceNumber % 2 + 1"{}}}, when the sequence number is 
incremented on each checkpoint. This way we may differentiate, "who" made that 
page dirty. The idea needs further development and can, in theory, be out of 
scope. I would prefer it being in the scope, because this design introduces 
another level of indirection "partitionId -> dirtyPages" for this check.
h3. 6. Per-partition dirty pages collection

Current implementation only uses concurrent hash-sets for dirty pages. This is 
not an optimal structure. In many instances, the large number of dirty pages is 
accompanied by the large number of new allocations (outside of free-list). 
Indexes of new pages form a continuous range. Utilizing this knowledge will 
allow us to:
 * Use less memory for dirty pages.
 * Don't sort a continuous tail of dirty pages.
 * Have a natural distinction between pages that go into "main" file or "delta" 
file that doesn't look like an encapsulation violation.

h3. 7. Segment locks

Segment locks will not protect partition generations anymore. We should expect 
an external synchronization of partition generations, by not allowing to have a 
new {{PageSupport}} instance while an old one exists. Externally it's going to 
be a busy lock.
h3. 8. Partition destruction lock

May become obsolete after all these changes. This part requires further 
investigation.

  was:
h2. Preface

https://issues.apache.org/jira/browse/IGNITE-27152 described an issue in page 
replacement, that causes many FSM threads to get stuck until sorting phase of 
checkpoint is completed. There are two reasons for it:
 * Sorting itself is a heavy operation.
 * Replacement algorithm is waiting for the sorting phase, which itself is also 
a problem.

Proposed fix assumed that the time spent in "split and sort" phase is mostly 
"sort", and that splitting the pages into partitions can be done quickly. That 
assumption is experimentally proven to be false, grouping dirty pages by 
partitions is a heavy operation (especially considering that it's 
single-threaded). It must also be optimized.

This, besides the issues mentioned in a linked Jira, is what will be addressed 
by this epic.
h2. Design
h3. 1. Pre-split dirty pages sets

Grouping pages by partitions requires (a) iterating over a collection of 
hash-sets, (b) performing a lookup in a corresponding hash-table for further 
insertion, and (c) inserting each page into a corresponding iterable container, 
that later will be converted into an array.

The fact that page replacement must wait for the "split" phase makes it 
critical performance-wise. More specifically, we must be able to perform these 
two operations as quickly as possible:
 * Wait until the snapshot of dirty pages is acquired, so that the 
{{CheckpointPages#removeOnPageReplacement}} can be called.
 * Wait until it is possible to the the content of a corresponding delta file, 
so that we can calculate the page position in the file.

As mentioned in IGNITE-27152, the second wait only needs the results of the 
first wait, we don't need to wait for the sorting. I'll get back to this point 
later.

The proposal is to store dirty pages in a pre-grouped map {{{}GroupPartitionId 
-> PartitionDirtyPages{}}}, and make sure that with this approach we're able to 
make the wait fast.

Such a decision forces us to make certain design choices. The majority of the 
design will be concentrated on these choices.
h3. 2. Decouple PageSupport and PageMemory

The idea is the following: every structure that operates with a page memory 
already belongs to a single partition. Exclusively. We should utilize this 
information by providing each partition its separate instance of 
{{PageSupport}} (probably merged with {{{}PageIdAllocator{}}}). Such an 
instance will contain:
 * Group ID.
 * Partition generation.
 * A dirty pages collection.
 ** The specific properties of this structure are not yet provided. Striping of 
this collection may not be necessary, especially if benchmarks show no issues 
without it. This is expected, because the number of partitions is greater than 
the number of segments.
 **  

Partition generation assignment should not be lazy. It's done when we register 
new {{PageSupport}} instance for a partition. I may look like this:
 * {{PageSupport PageMemory#newPageSupport(...)}}
 * {{void PageSupport#dispose(...)}}

This implies that there's still an internal mapping from partition to some 
descriptor inside of {{PageMemory}} instance.

This move makes dirty pages between partition generations mutually exclusive, 
solving once and for all the problem of filtering out the obsolete generations 
from the collection of dirty pages, that's good.
h3. 3. Collecting dirty pages in checkpoint

Should not require holding write lock anymore. Let me elaborate. We should use 
the same trick that we use for partition metadata - cooperation.
 * Checkpointer releases the write lock and then starts collecting dirty pages 
from all the partitions.
 * Every time another thread needs to perform a "setDirty" on a page it 
participates in a copying of the dirty pages collection from the previous 
checkpoint. It must be achieved with a simple CAS. This alleviates the 
necessity to wait for a dedicated phase of checkpointer during the page 
replacement (while holding a segment write lock).
 * By the time checkpointer finishes the iteration, all dirty pages are 
collected and ready to be sorted.

h3. 4. Sorting and page replacement

The only real reason to sort pages is the order in which they are written to 
the storage. Checkpointer iterates a sorted array of page indexes. The same 
sorted array is saved into delta-file's header.

Current implementation presupposes that delta-file's header is already 
persisted before any other page. That expectation is artificial. In order to 
save a page into delta file, all that we need to know is its position. It can 
be calculated from an unsorted collection of page indexes, belonging to the 
delta file. I'm repeating my position from IGNITE-27152: generally speaking, a 
linear scan of an array, in order to calculate the "number of elements less 
than X", is faster than sorting that array and then accessing X through binary 
search. Which means that page replacer may work on an unsorted collection, 
while checkpointer is free to sort pages in background.
h3. 5. "isInCheckpoint" checks

Current code performs this check using the collection of dirty pages that 
belongs to a given segment. New structure can make that process more difficult, 
due to the necessity of an additional lookup by partition ID. To address this 
potential inefficiency, I propose having two dirty flags, alterating between 
checkpoints. For example, The flag itself can be calculated as 
{{{}"checkpointSequenceNumber % 2 + 1"{}}}, when the sequence number is 
incremented on each checkpoint. This way we may differentiate, "who" made that 
page dirty. The idea needs further development and can, in theory, be out of 
scope. I would prefer it being in the scope, because this design introduces 
another level of indirection "partitionId -> dirtyPages" for this check.
h3. 6. Per-partition dirty pages collection

Current implementation only uses concurrent hash-sets for dirty pages. This is 
not an optimal structure. In many instances, the large number of dirty pages is 
accompanied by the large number of new allocations (outside of free-list). 
Indexes of new pages form a continuous range. Utilizing this knowledge will 
allow us to:
 * Use less memory for dirty pages.
 * Don't sort a continuous tail of dirty pages.
 * Have a natural distinction between pages that go into "main" file or "delta" 
file that doesn't look like an encapsulation violation.


> Rework dirty pages management in `aipersist`
> --------------------------------------------
>
>                 Key: IGNITE-28429
>                 URL: https://issues.apache.org/jira/browse/IGNITE-28429
>             Project: Ignite
>          Issue Type: Epic
>            Reporter: Ivan Bessonov
>            Assignee: Ivan Bessonov
>            Priority: Major
>              Labels: ignite-3
>
> h2. Preface
> https://issues.apache.org/jira/browse/IGNITE-27152 described an issue in page 
> replacement, that causes many FSM threads to get stuck until sorting phase of 
> checkpoint is completed. There are two reasons for it:
>  * Sorting itself is a heavy operation.
>  * Replacement algorithm is waiting for the sorting phase, which itself is 
> also a problem.
> Proposed fix assumed that the time spent in "split and sort" phase is mostly 
> "sort", and that splitting the pages into partitions can be done quickly. 
> That assumption is experimentally proven to be false, grouping dirty pages by 
> partitions is a heavy operation (especially considering that it's 
> single-threaded). It must also be optimized.
> This, besides the issues mentioned in a linked Jira, is what will be 
> addressed by this epic.
> h2. Design
> h3. 1. Pre-split dirty pages sets
> Grouping pages by partitions requires (a) iterating over a collection of 
> hash-sets, (b) performing a lookup in a corresponding hash-table for further 
> insertion, and (c) inserting each page into a corresponding iterable 
> container, that later will be converted into an array.
> The fact that page replacement must wait for the "split" phase makes it 
> critical performance-wise. More specifically, we must be able to perform 
> these two operations as quickly as possible:
>  * Wait until the snapshot of dirty pages is acquired, so that the 
> {{CheckpointPages#removeOnPageReplacement}} can be called.
>  * Wait until it is possible to the the content of a corresponding delta 
> file, so that we can calculate the page position in the file.
> As mentioned in IGNITE-27152, the second wait only needs the results of the 
> first wait, we don't need to wait for the sorting. I'll get back to this 
> point later.
> The proposal is to store dirty pages in a pre-grouped map 
> {{{}GroupPartitionId -> PartitionDirtyPages{}}}, and make sure that with this 
> approach we're able to make the wait fast.
> Such a decision forces us to make certain design choices. The majority of the 
> design will be concentrated on these choices.
> h3. 2. Decouple PageSupport and PageMemory
> The idea is the following: every structure that operates with a page memory 
> already belongs to a single partition. Exclusively. We should utilize this 
> information by providing each partition its separate instance of 
> {{PageSupport}} (probably merged with {{{}PageIdAllocator{}}}). Such an 
> instance will contain:
>  * Group ID.
>  * Partition generation.
>  * A dirty pages collection.
>  ** The specific properties of this structure are not yet provided. Striping 
> of this collection may not be necessary, especially if benchmarks show no 
> issues without it. This is expected, because the number of partitions is 
> greater than the number of segments.
>  **  
> Partition generation assignment should not be lazy. It's done when we 
> register new {{PageSupport}} instance for a partition. I may look like this:
>  * {{PageSupport PageMemory#newPageSupport(...)}}
>  * {{void PageSupport#dispose(...)}}
> This implies that there's still an internal mapping from partition to some 
> descriptor inside of {{PageMemory}} instance.
> This move makes dirty pages between partition generations mutually exclusive, 
> solving once and for all the problem of filtering out the obsolete 
> generations from the collection of dirty pages, that's good.
> h3. 3. Collecting dirty pages in checkpoint
> Should not require holding write lock anymore. Let me elaborate. We should 
> use the same trick that we use for partition metadata - cooperation.
>  * Checkpointer releases the write lock and then starts collecting dirty 
> pages from all the partitions.
>  * Every time another thread needs to perform a "setDirty" on a page it 
> participates in a copying of the dirty pages collection from the previous 
> checkpoint. It must be achieved with a simple CAS. This alleviates the 
> necessity to wait for a dedicated phase of checkpointer during the page 
> replacement (while holding a segment write lock).
>  * By the time checkpointer finishes the iteration, all dirty pages are 
> collected and ready to be sorted.
> h3. 4. Sorting and page replacement
> The only real reason to sort pages is the order in which they are written to 
> the storage. Checkpointer iterates a sorted array of page indexes. The same 
> sorted array is saved into delta-file's header.
> Current implementation presupposes that delta-file's header is already 
> persisted before any other page. That expectation is artificial. In order to 
> save a page into delta file, all that we need to know is its position. It can 
> be calculated from an unsorted collection of page indexes, belonging to the 
> delta file. I'm repeating my position from IGNITE-27152: generally speaking, 
> a linear scan of an array, in order to calculate the "number of elements less 
> than X", is faster than sorting that array and then accessing X through 
> binary search. Which means that page replacer may work on an unsorted 
> collection, while checkpointer is free to sort pages in background.
> h3. 5. "isInCheckpoint" checks
> Current code performs this check using the collection of dirty pages that 
> belongs to a given segment. New structure can make that process more 
> difficult, due to the necessity of an additional lookup by partition ID. To 
> address this potential inefficiency, I propose having two dirty flags, 
> alterating between checkpoints. For example, The flag itself can be 
> calculated as {{{}"checkpointSequenceNumber % 2 + 1"{}}}, when the sequence 
> number is incremented on each checkpoint. This way we may differentiate, 
> "who" made that page dirty. The idea needs further development and can, in 
> theory, be out of scope. I would prefer it being in the scope, because this 
> design introduces another level of indirection "partitionId -> dirtyPages" 
> for this check.
> h3. 6. Per-partition dirty pages collection
> Current implementation only uses concurrent hash-sets for dirty pages. This 
> is not an optimal structure. In many instances, the large number of dirty 
> pages is accompanied by the large number of new allocations (outside of 
> free-list). Indexes of new pages form a continuous range. Utilizing this 
> knowledge will allow us to:
>  * Use less memory for dirty pages.
>  * Don't sort a continuous tail of dirty pages.
>  * Have a natural distinction between pages that go into "main" file or 
> "delta" file that doesn't look like an encapsulation violation.
> h3. 7. Segment locks
> Segment locks will not protect partition generations anymore. We should 
> expect an external synchronization of partition generations, by not allowing 
> to have a new {{PageSupport}} instance while an old one exists. Externally 
> it's going to be a busy lock.
> h3. 8. Partition destruction lock
> May become obsolete after all these changes. This part requires further 
> investigation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-28429) Rework dirty pages management in `aipersist`

Reply via email to