[ 
https://issues.apache.org/jira/browse/IGNITE-27152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-27152:
-----------------------------------
    Description: 
The flame-graph shows a strong correlation between a page sorting phase of a 
checkpoint and dips in FSM activity:

!image-2025-11-24-16-35-22-721.png|width=1133,height=377!

And while there's just a single most likely explanation to it (I'll get to it 
later), there are multiple issues in the product when it comes to the sorting 
phase. Let's take a look at them:
h2. 1. Sorting duration

It's been noticed in practice that sorting even a million pages takes few 
hundred milliseconds, which is a large amount of time. Some clients may have 
tens or hundreds of millions dirty pages in a single checkpoint, and the 
sorting speed scales non-linearly.
h3. A proposal

While joining the collection of dirty pages of a data region together, we 
should execute the "split" phase as a fair split: each partition should have 
its own collection.

Pros:
 * As a heuristic we can assume that in most cases dirty pages will be 
uniformly distributed between a subset of local partitions. The size of this 
subset is proportional to the number of cores in the system. Following these 
assumptions, we can conclude that:
 ** There may be a significant number of partitions that require little to no 
sorting at all. Their exclusion doesn't help much, but it does help.
 ** Remaining {{N}} partitions could be sorted up to {{Log(N)}} times faster if 
each partition is sorted separately. Even for 32 partitions it might be a 5x 
gain, and in the majority of real cases we have more than that.

Cons:
 * Maintaining multiple collections instead of a single array inevitably leads 
to extra allocations. There are multiple ways we can tackle this problem:
 ** Use segmented list (list of arrays) instead of an array list. This way each 
allocation will be equally cheap, there's no need to copy the head of a list.
 ** Checkpointer thread can have its own cache of pre-allocated segments, if 
necessary.

h2. 2. Filtering of wrong generations

Currently we only do that before saving delta-file metadata. It might be 
beneficial for us to do this step before sorting pages for each individual 
partitions. To be even more precise, I propose having three steps: Split, 
Filter, Sort. _Maybe_ split and filter could be executed together. Why it's 
important:
 * We will immediately have a valid collection of dirty page identifiers that 
belong to any given partitions.
 * This will, among other things, allow us to accurately estimate the delta 
file size before sorting the list of pages. This is important for later steps.
 * Some of {{CheckpointState#PAGES_SORTED}} usages may eventually be replaced 
with this "filtered" state.

h2. 3. Page replacement improvements

I believe that page replacement is exactly what causes issues in the 
flamegraph. If we trace the usages of 
{{CheckpointPages#removeOnPageReplacement}} (which contains an join of certain 
future) then we will see that they all happen while the thread is holding a 
segment write lock in the data region. Such a lock does not allow any other 
thread to acquire pages (it would require segment read lock).

First of all, we need to figure out if that's a right place to wait for 
anything. It appears to me that what we need it to make sure that checkpointer 
thread won't be iterating over {{pageIds}} collection in order to merge it into 
its own internal representation. Which is a {{SPLIT}} phase, not {{{}SORT{}}}. 
Depending on how far we want to come with the optimization of page replacement, 
we can even introduce an additional {{COPY}} phase, and make it the fastest one.

Next, in order to then write the page into a delta file, we don't need to wait 
for the {{SORT}} phase either. Technically speaking, we need to wait for a 
global {{SPLIT}} phase (make sure that we know all pages that can possibly go 
into the file), and then to wait for the {{FILTER}} phase for the particular 
partition. Filtering phase allows us to know the size of delta-file's header, 
which is helpful, because:
 * If header is already written, then we have a sorted list of page indexes, 
and thus can use a binary search to find a file offset.
 * If header is not yet written, and page indexes are not yet sorted, we can 
use a linear search through the actual list of page indexes in order to 
calculate a file position. Linear search is, generally speaking, faster then 
waiting for the list to be sorted, which means that we will unblock the thread 
much faster as well.

 

  was:
The flame-graph shows a strong correlation between a page sorting phase of a 
checkpoint and dips in FSM activity:

!image-2025-11-24-16-35-22-721.png|width=1133,height=377!

And while there's just a single most likely explanation to it (I'll get to it 
later), there are multiple issues in the product when it comes to the sorting 
phase. Let's take a look at them:
h2. 1. Sorting duration

It's been noticed in practice that sorting even a million pages takes few 
hundred milliseconds, which is a large amount of time. Some clients may have 
tens or hundreds of millions dirty pages in a single checkpoint, and the 
sorting speed scales non-linearly.
h3. A proposal

While joining the collection of dirty pages of a data region together, we 
should execute the "split" phase as a fair split: each partition should have 
its own collection.

Pros:
 * As a heuristic we can assume that in most cases dirty pages will be 
uniformly distributed between a subset of local partitions. The size of this 
subset is proportional to the number of cores in the system. Following these 
assumptions, we can conclude that:

 * 
 ** There may be a significant number of partitions that require little to no 
sorting at all. Their exclusion doesn't help much, but it does help.
 ** Remaining {{N}} partitions could be sorted up to {{Log(N)}} times faster if 
each partition is sorted separately. Even for 32 partitions it might be a 5x 
gain, and in the majority of real cases we have more than that.

Cons:
 * Maintaining multiple collections instead of a single array inevitably leads 
to extra allocations. There are multiple ways we can tackle this problem:
 ** Use segmented list (list of arrays) instead of an array list. This way each 
allocation will be equally cheap, there's no need to copy the head of a list.
 ** Checkpointer thread can have its own cache of pre-allocated segments, if 
necessary.

h2. 2. Filtering of wrong generations

Currently we only do that before saving delta-file metadata. It might be 
beneficial for us to do this step before sorting pages for each individual 
partitions. To be even more precise, I propose having three steps: Split, 
Filter, Sort. _Maybe_ split and filter could be executed together. Why it's 
important:
 * We will immediately have a valid collection of dirty page identifiers that 
belong to any given partitions.
 * This will, among other things, allow us to accurately estimate the delta 
file size before sorting the list of pages. This is important for later steps.
 * Some of {{CheckpointState#PAGES_SORTED}} usages may eventually be replaced 
with this "filtered" state.

h2. 3. Page replacement improvements

I believe that page replacement is exactly what causes issues in the 
flamegraph. If we trace the usages of 
{{CheckpointPages#removeOnPageReplacement}} (which contains an join of certain 
future) then we will see that they all happen while the thread is holding a 
segment write lock in the data region. Such a lock does not allow any other 
thread to acquire pages (it would require segment read lock).

First of all, we need to figure out if that's a right place to wait for 
anything. It appears to me that what we need it to make sure that checkpointer 
thread won't be iterating over {{pageIds}} collection in order to merge it into 
its own internal representation. Which is a {{SPLIT}} phase, not {{{}SORT{}}}. 
Depending on how far we want to come with the optimization of page replacement, 
we can even introduce an additional {{COPY}} phase, and make it the fastest one.

Next, in order to then write the page into a delta file, we don't need to wait 
for the {{SORT}} phase either. Technically speaking, we need to wait for a 
global {{SPLIT}} phase (make sure that we know all pages that can possibly go 
into the file), and then to wait for the {{FILTER}} phase for the particular 
partition. Filtering phase allows us to know the size of delta-file's header, 
which is helpful, because:
 * If header is already written, then we have a sorted list of page indexes, 
and thus can use a binary search to find a file offset.
 * If header is not yet written, and page indexes are not yet sorted, we can 
use a linear search through the actual list of page indexes in order to 
calculate a file position. Linear search is, generally speaking, faster then 
waiting for the list to be sorted, which means that we will unblock the thread 
much faster as well.

 


> Throughput dip during the page sorting phase of a checkpoint
> ------------------------------------------------------------
>
>                 Key: IGNITE-27152
>                 URL: https://issues.apache.org/jira/browse/IGNITE-27152
>             Project: Ignite
>          Issue Type: Improvement
>          Components: storage engines ai3
>            Reporter: Ivan Bessonov
>            Priority: Major
>              Labels: ignite-3
>         Attachments: image-2025-11-24-16-35-22-721.png
>
>
> The flame-graph shows a strong correlation between a page sorting phase of a 
> checkpoint and dips in FSM activity:
> !image-2025-11-24-16-35-22-721.png|width=1133,height=377!
> And while there's just a single most likely explanation to it (I'll get to it 
> later), there are multiple issues in the product when it comes to the sorting 
> phase. Let's take a look at them:
> h2. 1. Sorting duration
> It's been noticed in practice that sorting even a million pages takes few 
> hundred milliseconds, which is a large amount of time. Some clients may have 
> tens or hundreds of millions dirty pages in a single checkpoint, and the 
> sorting speed scales non-linearly.
> h3. A proposal
> While joining the collection of dirty pages of a data region together, we 
> should execute the "split" phase as a fair split: each partition should have 
> its own collection.
> Pros:
>  * As a heuristic we can assume that in most cases dirty pages will be 
> uniformly distributed between a subset of local partitions. The size of this 
> subset is proportional to the number of cores in the system. Following these 
> assumptions, we can conclude that:
>  ** There may be a significant number of partitions that require little to no 
> sorting at all. Their exclusion doesn't help much, but it does help.
>  ** Remaining {{N}} partitions could be sorted up to {{Log(N)}} times faster 
> if each partition is sorted separately. Even for 32 partitions it might be a 
> 5x gain, and in the majority of real cases we have more than that.
> Cons:
>  * Maintaining multiple collections instead of a single array inevitably 
> leads to extra allocations. There are multiple ways we can tackle this 
> problem:
>  ** Use segmented list (list of arrays) instead of an array list. This way 
> each allocation will be equally cheap, there's no need to copy the head of a 
> list.
>  ** Checkpointer thread can have its own cache of pre-allocated segments, if 
> necessary.
> h2. 2. Filtering of wrong generations
> Currently we only do that before saving delta-file metadata. It might be 
> beneficial for us to do this step before sorting pages for each individual 
> partitions. To be even more precise, I propose having three steps: Split, 
> Filter, Sort. _Maybe_ split and filter could be executed together. Why it's 
> important:
>  * We will immediately have a valid collection of dirty page identifiers that 
> belong to any given partitions.
>  * This will, among other things, allow us to accurately estimate the delta 
> file size before sorting the list of pages. This is important for later steps.
>  * Some of {{CheckpointState#PAGES_SORTED}} usages may eventually be replaced 
> with this "filtered" state.
> h2. 3. Page replacement improvements
> I believe that page replacement is exactly what causes issues in the 
> flamegraph. If we trace the usages of 
> {{CheckpointPages#removeOnPageReplacement}} (which contains an join of 
> certain future) then we will see that they all happen while the thread is 
> holding a segment write lock in the data region. Such a lock does not allow 
> any other thread to acquire pages (it would require segment read lock).
> First of all, we need to figure out if that's a right place to wait for 
> anything. It appears to me that what we need it to make sure that 
> checkpointer thread won't be iterating over {{pageIds}} collection in order 
> to merge it into its own internal representation. Which is a {{SPLIT}} phase, 
> not {{{}SORT{}}}. Depending on how far we want to come with the optimization 
> of page replacement, we can even introduce an additional {{COPY}} phase, and 
> make it the fastest one.
> Next, in order to then write the page into a delta file, we don't need to 
> wait for the {{SORT}} phase either. Technically speaking, we need to wait for 
> a global {{SPLIT}} phase (make sure that we know all pages that can possibly 
> go into the file), and then to wait for the {{FILTER}} phase for the 
> particular partition. Filtering phase allows us to know the size of 
> delta-file's header, which is helpful, because:
>  * If header is already written, then we have a sorted list of page indexes, 
> and thus can use a binary search to find a file offset.
>  * If header is not yet written, and page indexes are not yet sorted, we can 
> use a linear search through the actual list of page indexes in order to 
> calculate a file position. Linear search is, generally speaking, faster then 
> waiting for the list to be sorted, which means that we will unblock the 
> thread much faster as well.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to