[ 
https://issues.apache.org/jira/browse/IGNITE-8017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16435672#comment-16435672
 ] 

Alexey Goncharuk commented on IGNITE-8017:
------------------------------------------

Ilya,

1. Please check the case when WAL was disabled for rebalancing, then topology 
changes and the node is not going to rebalance anymore. You listen on rebalance 
future and enable WAL only if future succeeds, however I am not sure if another 
rebalancing session is triggered in this case.
2. We need to persist locally disabled WAL state because checkpoints are still 
running and local storage for WAL-disabled cache group may be corrupted. If 
such a situation happens, we need to clean up corresponding cache group 
storages the same way as global WAL disable does. Please add corresponding 
test. 
3. When WAL is re-enabled, we need to enable WAL first and then trigger 
checkpoint, not in reverse order.
4. There may be a race in {{onGroupRebalanceFinished}} method - we can own a 
partition that we did not rebalance. I think topology read lock is a proper way 
to synchronize here.
5. Please check that {{changeLocalStatesOnExchangeDone}} is not called when 
holding checkpoint read lock, otherwise it may deadlock. 
6. Add a specific test that will check that WAL will not be disabled on cluster 
nodes when BLT size is reduced.

> Disable WAL during initial preloading
> -------------------------------------
>
>                 Key: IGNITE-8017
>                 URL: https://issues.apache.org/jira/browse/IGNITE-8017
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Ilya Lantukh
>            Assignee: Ilya Lantukh
>            Priority: Major
>              Labels: iep-16
>             Fix For: 2.5
>
>
> While handling SupplyMessage, node handles each supplied data entry 
> separately, which causes a WAL record for each entry to be written. It 
> significantly limits preloading speed.
> We can improve rebalancing speed and reduce pressure on disk by disabling WAL 
> until all data is loaded. The disadvantage of this approach is that data 
> might get corrupted if node crashes - but node that crashed during preloading 
> has to clear all it's data anyway. However, it is important to distinguish 
> situations when new node joined cluster or added to baseline topology (and 
> doesn't hold any data) and when additional partitions got assigned to node 
> after baseline topology changed (in this case node has to keep all data in 
> consistent state).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to