[jira] [Commented] (IGNITE-12429) Rework bytes-based WAL archive size management logic to make historical rebalance more predictable

Ivan Rakov (Jira) Fri, 13 Dec 2019 12:40:07 -0800


    [ 
https://issues.apache.org/jira/browse/IGNITE-12429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995894#comment-16995894
 ]


Ivan Rakov commented on IGNITE-12429:
-------------------------------------

[~ascherbakov],
 1. Can you explain motivation for configuring WAL history size in time units?
 As for me, it's definetely not the best approach.
 * First of all, it causes unpredictable history size: more intensive workload 
leads to large history. When user performs capacity planning of production 
deployment, it's very convenient to understand how much data will be consumed 
by WAL history in the worst case.
 * Secondly, using time units will practically disable historical rebalance for 
users with not intensive workload. Imagine a case: your data is chaged very 
infrequently (e.g. 1GB of WAL is generated per day), and you shutdown one node 
for 12 hours. Even if you are ready to reserve 10GB for WAL history, you still 
have to deal with full partition rebalance.
 * At last, the most critical parameter of historical rebalance (and 
reasonability of using or not using it) is its estimated time, which is 
linearly dependent from available WAL history size in bytes. On the contrary, 
rebalance time doesn't depend from period of time in which corresponding range 
of WAL was generated.

2. As you noticed, we already use GroupStateLazyStore with soft references. 
It's a doubtful solution. Soft references are unpredictable. In case node has 
enough free heap space, there's a risk that large share of it will be occupied 
by storing counters for no reason (the same RAM space could be utilized by OS 
for page caching, for example). On the other hand, if there's lack of heap 
space, PME speed would be severely affected by synchronous unswapping from disk.

>Also I do not understand how having sparse map will help us because we need 
>all entries for history calculation.
Do we? Why can't we calculate history when some intermediate checkpoints are 
skipped? 
My point is that we can read a bit more WAL than needed. Reading 13GB instead 
of 12.77GB is ok, especially when we save a lot of resources by not keeping 
info about all checkpoints between 13GB and 12.77GB. There may be hundreds of 
checkpoints if workload is not intensive.

> Rework bytes-based WAL archive size management logic to make historical 
> rebalance more predictable
> --------------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-12429
>                 URL: https://issues.apache.org/jira/browse/IGNITE-12429
>             Project: Ignite
>          Issue Type: Improvement
>    Affects Versions: 2.7, 2.7.5, 2.7.6
>            Reporter: Ivan Rakov
>            Priority: Major
>
> Since 2.7 DataStorageConfiguration allows to specify size of WAL archive in 
> bytes (see DataStorageConfiguration#maxWalArchiveSize), which is much more 
> trasparent to user. 
> Unfortunately, new logic may be unpredictable when it comes to the historical 
> rebalance. WAL archive is truncated when one of the following conditions 
> occur:
> 1. Total number of checkpoints in WAL archive is bigger than 
> DataStorageConfiguration#walHistSize
> 2. Total size of WAL archive is bigger than 
> DataStorageConfiguration#maxWalArchiveSize
> Independently, in-memory checkpoint history contains only fixed number of 
> last checkpoints (can be changed with 
> IGNITE_PDS_MAX_CHECKPOINT_MEMORY_HISTORY_SIZE, 100 by default).
> All these particular qualities make it hard for user to cotrol usage of 
> historical rebalance. Imagine the case when user has slight load (WAL gets 
> rotated very slowly) and default checkpoint frequency. After 100 * 3 = 300 
> minutes, all updates in WAL will be impossible to be received via historical 
> rebalance even if:
> 1. User has configured large DataStorageConfiguration#maxWalArchiveSize
> 2. User has configured large DataStorageConfiguration#walHistSize
> At the same time, setting large IGNITE_PDS_MAX_CHECKPOINT_MEMORY_HISTORY_SIZE 
> will help (only with previous two points combined), but Ignite node heap 
> usage may increase dramatically.
> I propose to change WAL history management logic in the following way:
> 1. *Don't cut* WAL archive when number of checkpoint exceeds 
> DataStorageConfiguration#walHistSize. WAL history should be managed only 
> based on DataStorageConfiguration#maxWalArchiveSize.
> 2. Checkpoint history should contain fixed number of entries, but should 
> cover the whole stored WAL archive (not only its more recent part with 
> IGNITE_PDS_MAX_CHECKPOINT_MEMORY_HISTORY_SIZE last checkpoints). This can be 
> achieved by making checkpoint history sparse: some intermediate checkpoints 
> *may be not present in history*, but fixed number of checkpoints can be 
> positioned either in uniform distribution (trying to keep fixed number of 
> bytes between two neighbour checkpoints) or exponentially (trying to keep 
> fixed ratio between [size of WAL from checkpoint(N-1) to current write 
> pointer] and [size of WAL from checkpoint(N) to current write pointer]).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (IGNITE-12429) Rework bytes-based WAL archive size management logic to make historical rebalance more predictable

Reply via email to