[
https://issues.apache.org/jira/browse/IGNITE-27152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Bessonov updated IGNITE-27152:
-----------------------------------
Description:
The flame-graph shows a strong correlation between a page sorting phase of a
checkpoint and dips in FSM activity:
!image-2025-11-24-16-35-22-721.png|width=1133,height=377!
And while there's just a single most likely explanation to it (I'll get to it
later), there are multiple issues in the product when it comes to the sorting
phase. Let's take a look at them:
h2. 1. Sorting duration
It's been noticed in practice that sorting even a million pages takes few
hundred milliseconds, which is a large amount of time. Some clients may have
tens or hundreds of millions dirty pages in a single checkpoint, and the
sorting speed scales non-linearly.
h3. A proposal
While joining the collection of dirty pages of a data region together, we
should execute the "split" phase as a fair split: each partition should have
its own collection.
Pros:
* As a heuristic we can assume that in most cases dirty pages will be
uniformly distributed between a subset of local partitions. The size of this
subset is proportional to the number of cores in the system. Following these
assumptions, we can conclude that:
** There may be a significant number of partitions that require little to no
sorting at all. Their exclusion doesn't help much, but it does help.
** Remaining {{N}} partitions could be sorted up to {{Log(N)}} times faster if
each partition is sorted separately. Even for 32 partitions it might be a 5x
gain, and in the majority of real cases we have more than that.
Cons:
* Maintaining multiple collections instead of a single array inevitably leads
to extra allocations. There are multiple ways we can tackle this problem:
** Use segmented list (list of arrays) instead of an array list. This way each
allocation will be equally cheap, there's no need to copy the head of a list.
** Checkpointer thread can have its own cache of pre-allocated segments, if
necessary.
h2. 2. Filtering of wrong generations
Currently we only do that before saving delta-file metadata. It might be
beneficial for us to do this step before sorting pages for each individual
partitions. To be even more precise, I propose having three steps: Split,
Filter, Sort. _Maybe_ split and filter could be executed together. Why it's
important:
* We will immediately have a valid collection of dirty page identifiers that
belong to any given partitions.
* This will, among other things, allow us to accurately estimate the delta
file size before sorting the list of pages. This is important for later steps.
* Some of {{CheckpointState#PAGES_SORTED}} usages may eventually be replaced
with this "filtered" state.
h2. 3. Page replacement improvements
I believe that page replacement is exactly what causes issues in the
flamegraph. If we trace the usages of
{{CheckpointPages#removeOnPageReplacement}} (which contains an join of certain
future) then we will see that they all happen while the thread is holding a
segment write lock in the data region. Such a lock does not allow any other
thread to acquire pages (it would require segment read lock).
First of all, we need to figure out if that's a right place to wait for
anything. It appears to me that what we need it to make sure that checkpointer
thread won't be iterating over {{pageIds}} collection in order to merge it into
its own internal representation. Which is a {{SPLIT}} phase, not {{{}SORT{}}}.
Depending on how far we want to come with the optimization of page replacement,
we can even introduce an additional {{COPY}} phase, and make it the fastest one.
Next, in order to then write the page into a delta file, we don't need to wait
for the {{SORT}} phase either. Technically speaking, we need to wait for a
global {{SPLIT}} phase (make sure that we know all pages that can possibly go
into the file), and then to wait for the {{FILTER}} phase for the particular
partition. Filtering phase allows us to know the size of delta-file's header,
which is helpful, because:
* If header is already written, then we have a sorted list of page indexes,
and thus can use a binary search to find a file offset.
* If header is not yet written, and page indexes are not yet sorted, we can
use a linear search through the actual list of page indexes in order to
calculate a file position. Linear search is, generally speaking, faster then
waiting for the list to be sorted, which means that we will unblock the thread
much faster as well.
was:
The flame-graph shows a strong correlation between a page sorting phase of a
checkpoint and dips in FSM activity:
!image-2025-11-24-16-35-22-721.png|width=1133,height=377!
And while there's just a single most likely explanation to it (I'll get to it
later), there are multiple issues in the product when it comes to the sorting
phase. Let's take a look at them:
h2. 1. Sorting duration
It's been noticed in practice that sorting even a million pages takes few
hundred milliseconds, which is a large amount of time. Some clients may have
tens or hundreds of millions dirty pages in a single checkpoint, and the
sorting speed scales non-linearly.
h3. A proposal
While joining the collection of dirty pages of a data region together, we
should execute the "split" phase as a fair split: each partition should have
its own collection.
Pros:
* As a heuristic we can assume that in most cases dirty pages will be
uniformly distributed between a subset of local partitions. The size of this
subset is proportional to the number of cores in the system. Following these
assumptions, we can conclude that:
*
** There may be a significant number of partitions that require little to no
sorting at all. Their exclusion doesn't help much, but it does help.
** Remaining {{N}} partitions could be sorted up to {{Log(N)}} times faster if
each partition is sorted separately. Even for 32 partitions it might be a 5x
gain, and in the majority of real cases we have more than that.
Cons:
* Maintaining multiple collections instead of a single array inevitably leads
to extra allocations. There are multiple ways we can tackle this problem:
** Use segmented list (list of arrays) instead of an array list. This way each
allocation will be equally cheap, there's no need to copy the head of a list.
** Checkpointer thread can have its own cache of pre-allocated segments, if
necessary.
h2. 2. Filtering of wrong generations
Currently we only do that before saving delta-file metadata. It might be
beneficial for us to do this step before sorting pages for each individual
partitions. To be even more precise, I propose having three steps: Split,
Filter, Sort. _Maybe_ split and filter could be executed together. Why it's
important:
* We will immediately have a valid collection of dirty page identifiers that
belong to any given partitions.
* This will, among other things, allow us to accurately estimate the delta
file size before sorting the list of pages. This is important for later steps.
* Some of {{CheckpointState#PAGES_SORTED}} usages may eventually be replaced
with this "filtered" state.
h2. 3. Page replacement improvements
I believe that page replacement is exactly what causes issues in the
flamegraph. If we trace the usages of
{{CheckpointPages#removeOnPageReplacement}} (which contains an join of certain
future) then we will see that they all happen while the thread is holding a
segment write lock in the data region. Such a lock does not allow any other
thread to acquire pages (it would require segment read lock).
First of all, we need to figure out if that's a right place to wait for
anything. It appears to me that what we need it to make sure that checkpointer
thread won't be iterating over {{pageIds}} collection in order to merge it into
its own internal representation. Which is a {{SPLIT}} phase, not {{{}SORT{}}}.
Depending on how far we want to come with the optimization of page replacement,
we can even introduce an additional {{COPY}} phase, and make it the fastest one.
Next, in order to then write the page into a delta file, we don't need to wait
for the {{SORT}} phase either. Technically speaking, we need to wait for a
global {{SPLIT}} phase (make sure that we know all pages that can possibly go
into the file), and then to wait for the {{FILTER}} phase for the particular
partition. Filtering phase allows us to know the size of delta-file's header,
which is helpful, because:
* If header is already written, then we have a sorted list of page indexes,
and thus can use a binary search to find a file offset.
* If header is not yet written, and page indexes are not yet sorted, we can
use a linear search through the actual list of page indexes in order to
calculate a file position. Linear search is, generally speaking, faster then
waiting for the list to be sorted, which means that we will unblock the thread
much faster as well.
> Throughput dip during the page sorting phase of a checkpoint
> ------------------------------------------------------------
>
> Key: IGNITE-27152
> URL: https://issues.apache.org/jira/browse/IGNITE-27152
> Project: Ignite
> Issue Type: Improvement
> Components: storage engines ai3
> Reporter: Ivan Bessonov
> Priority: Major
> Labels: ignite-3
> Attachments: image-2025-11-24-16-35-22-721.png
>
>
> The flame-graph shows a strong correlation between a page sorting phase of a
> checkpoint and dips in FSM activity:
> !image-2025-11-24-16-35-22-721.png|width=1133,height=377!
> And while there's just a single most likely explanation to it (I'll get to it
> later), there are multiple issues in the product when it comes to the sorting
> phase. Let's take a look at them:
> h2. 1. Sorting duration
> It's been noticed in practice that sorting even a million pages takes few
> hundred milliseconds, which is a large amount of time. Some clients may have
> tens or hundreds of millions dirty pages in a single checkpoint, and the
> sorting speed scales non-linearly.
> h3. A proposal
> While joining the collection of dirty pages of a data region together, we
> should execute the "split" phase as a fair split: each partition should have
> its own collection.
> Pros:
> * As a heuristic we can assume that in most cases dirty pages will be
> uniformly distributed between a subset of local partitions. The size of this
> subset is proportional to the number of cores in the system. Following these
> assumptions, we can conclude that:
> ** There may be a significant number of partitions that require little to no
> sorting at all. Their exclusion doesn't help much, but it does help.
> ** Remaining {{N}} partitions could be sorted up to {{Log(N)}} times faster
> if each partition is sorted separately. Even for 32 partitions it might be a
> 5x gain, and in the majority of real cases we have more than that.
> Cons:
> * Maintaining multiple collections instead of a single array inevitably
> leads to extra allocations. There are multiple ways we can tackle this
> problem:
> ** Use segmented list (list of arrays) instead of an array list. This way
> each allocation will be equally cheap, there's no need to copy the head of a
> list.
> ** Checkpointer thread can have its own cache of pre-allocated segments, if
> necessary.
> h2. 2. Filtering of wrong generations
> Currently we only do that before saving delta-file metadata. It might be
> beneficial for us to do this step before sorting pages for each individual
> partitions. To be even more precise, I propose having three steps: Split,
> Filter, Sort. _Maybe_ split and filter could be executed together. Why it's
> important:
> * We will immediately have a valid collection of dirty page identifiers that
> belong to any given partitions.
> * This will, among other things, allow us to accurately estimate the delta
> file size before sorting the list of pages. This is important for later steps.
> * Some of {{CheckpointState#PAGES_SORTED}} usages may eventually be replaced
> with this "filtered" state.
> h2. 3. Page replacement improvements
> I believe that page replacement is exactly what causes issues in the
> flamegraph. If we trace the usages of
> {{CheckpointPages#removeOnPageReplacement}} (which contains an join of
> certain future) then we will see that they all happen while the thread is
> holding a segment write lock in the data region. Such a lock does not allow
> any other thread to acquire pages (it would require segment read lock).
> First of all, we need to figure out if that's a right place to wait for
> anything. It appears to me that what we need it to make sure that
> checkpointer thread won't be iterating over {{pageIds}} collection in order
> to merge it into its own internal representation. Which is a {{SPLIT}} phase,
> not {{{}SORT{}}}. Depending on how far we want to come with the optimization
> of page replacement, we can even introduce an additional {{COPY}} phase, and
> make it the fastest one.
> Next, in order to then write the page into a delta file, we don't need to
> wait for the {{SORT}} phase either. Technically speaking, we need to wait for
> a global {{SPLIT}} phase (make sure that we know all pages that can possibly
> go into the file), and then to wait for the {{FILTER}} phase for the
> particular partition. Filtering phase allows us to know the size of
> delta-file's header, which is helpful, because:
> * If header is already written, then we have a sorted list of page indexes,
> and thus can use a binary search to find a file offset.
> * If header is not yet written, and page indexes are not yet sorted, we can
> use a linear search through the actual list of page indexes in order to
> calculate a file position. Linear search is, generally speaking, faster then
> waiting for the list to be sorted, which means that we will unblock the
> thread much faster as well.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)