[ 
https://issues.apache.org/jira/browse/IGNITE-27152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-27152:
-----------------------------------
    Summary: Throughput dip during the page sorting phase of a checkpoint  
(was: Page sorter pause)

> Throughput dip during the page sorting phase of a checkpoint
> ------------------------------------------------------------
>
>                 Key: IGNITE-27152
>                 URL: https://issues.apache.org/jira/browse/IGNITE-27152
>             Project: Ignite
>          Issue Type: Improvement
>          Components: storage engines ai3
>            Reporter: Ivan Bessonov
>            Priority: Major
>              Labels: ignite-3
>         Attachments: image-2025-11-24-16-35-22-721.png
>
>
> The flame-graph shows a strong correlation between a page sorting phase of a 
> checkpoint and dips in FSM activity:
> !image-2025-11-24-16-35-22-721.png|width=1133,height=377!
> And while there's just a single most likely explanation to it (I'll get to it 
> later), there are multiple issues in the product when it comes to the sorting 
> phase. Let's take a look at them:
> h2. 1. Sorting duration
> It's been noticed in practice that sorting even a million pages takes few 
> hundred milliseconds, which is a large amount of time. Some clients may have 
> tens or hundreds of millions dirty pages in a single checkpoint, and the 
> sorting speed scales non-linearly.
> h3. A proposal
> While joining the collection of dirty pages of a data region together, we 
> should execute the "split" phase as a fair split: each partition should have 
> its own collection.
> Pros:
>  * As a heuristic we can assume that in most cases dirty pages will be 
> uniformly distributed between a subset of local partitions. The size of this 
> subset is proportional to the number of cores in the system. Following these 
> assumptions, we can conclude that:
>  * 
>  ** There may be a significant number of partitions that require little to no 
> sorting at all. Their exclusion doesn't help much, but it does help.
>  ** Remaining {{N}} partitions could be sorted up to {{Log(N)}} times faster 
> if each partition is sorted separately. Even for 32 partitions it might be a 
> 5x gain, and in the majority of real cases we have more than that.
> Cons:
>  * Maintaining multiple collections instead of a single array inevitably 
> leads to extra allocations. There are multiple ways we can tackle this 
> problem:
>  ** Use segmented list (list of arrays) instead of an array list. This way 
> each allocation will be equally cheap, there's no need to copy the head of a 
> list.
>  ** Checkpointer thread can have its own cache of pre-allocated segments, if 
> necessary.
> h2. 2. Filtering of wrong generations
> Currently we only do that before saving delta-file metadata. It might be 
> beneficial for us to do this step before sorting pages for each individual 
> partitions. To be even more precise, I propose having three steps: Split, 
> Filter, Sort. _Maybe_ split and filter could be executed together. Why it's 
> important:
>  * We will immediately have a valid collection of dirty page identifiers that 
> belong to any given partitions.
>  * This will, among other things, allow us to accurately estimate the delta 
> file size before sorting the list of pages. This is important for later steps.
>  * Some of {{CheckpointState#PAGES_SORTED}} usages may eventually be replaced 
> with this "filtered" state.
> h2. 3. Page replacement improvements
> I believe that page replacement is exactly what causes issues in the 
> flamegraph. If we trace the usages of 
> {{CheckpointPages#removeOnPageReplacement}} (which contains an join of 
> certain future) then we will see that they all happen while the thread is 
> holding a segment write lock in the data region. Such a lock does not allow 
> any other thread to acquire pages (it would require segment read lock).
> First of all, we need to figure out if that's a right place to wait for 
> anything. It appears to me that what we need it to make sure that 
> checkpointer thread won't be iterating over {{pageIds}} collection in order 
> to merge it into its own internal representation. Which is a {{SPLIT}} phase, 
> not {{{}SORT{}}}. Depending on how far we want to come with the optimization 
> of page replacement, we can even introduce an additional {{COPY}} phase, and 
> make it the fastest one.
> Next, in order to then write the page into a delta file, we don't need to 
> wait for the {{SORT}} phase either. Technically speaking, we need to wait for 
> a global {{SPLIT}} phase (make sure that we know all pages that can possibly 
> go into the file), and then to wait for the {{FILTER}} phase for the 
> particular partition. Filtering phase allows us to know the size of 
> delta-file's header, which is helpful, because:
>  * If header is already written, then we have a sorted list of page indexes, 
> and thus can use a binary search to find a file offset.
>  * If header is not yet written, and page indexes are not yet sorted, we can 
> use a linear search through the actual list of page indexes in order to 
> calculate a file position. Linear search is, generally speaking, faster then 
> waiting for the list to be sorted, which means that we will unblock the 
> thread much faster as well.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to