Exceeding the DataStorageConfiguration#getMaxWalArchiveSize due to historical rebalance

ткаленко кирилл Tue, 04 May 2021 01:29:51 -0700

Hello everybody!

At the moment, if there are partitions for the rebalance for which the 
historical rebalance will be used, then we reserve segments in the WAL archive 
(we do not allow cleaning the WAL archive) until the rebalance for all cache 
groups is over.


If a cluster is under load during the rebalance, WAL archive size may 
significantly exceed limits set in 
DataStorageConfiguration#getMaxWalArchiveSize until the process is complete. 
This may lead to user issues and nodes may crash with the "No space left on 
device" error.

We have a system property IGNITE_THRESHOLD_WAL_ARCHIVE_SIZE_PERCENTAGE by 
default 0.5, which sets the threshold (multiplied by getMaxWalArchiveSize) from 
which and up to which the WAL archive will be cleared, i.e. sets the size of 
the WAL archive that will always be on the node. I propose to replace this 
system property with the  DataStorageConfiguration#getWalArchiveSize in bytes, 
the default is (getMaxWalArchiveSize * 0.5) as it is now.

Main proposal:
When theDataStorageConfiguration#getMaxWalArchiveSize is reached, cancel and do 
not give the reservation of the WAL segments until we reach 
DataStorageConfiguration#getWalArchiveSize. In this case, if there is no 
segment for historical rebalance, we will automatically switch to full 
rebalance.

Exceeding the DataStorageConfiguration#getMaxWalArchiveSize due to historical rebalance

Reply via email to