[ 
https://issues.apache.org/jira/browse/IGNITE-12069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177120#comment-17177120
 ] 

Pavel Pereslegin edited comment on IGNITE-12069 at 8/13/20, 4:11 PM:
---------------------------------------------------------------------

This task has been temporarily held. Testing shows the following results:
 * Index rebuilding takes a very long time, in some cases, the index rebuilding 
time (24 threads) exceeds the full rebalance of the index cache (24 threads).
 * Most of the time is spent on rebuilding the index, the current solution can 
be modified to transfer the index partition (if the distribution of partitions 
on the demander matches the supplier partition distribution (affinity can be 
configured for such cases on PARTITIONED caches)).
 * Index rebuilding can be started earlier on a separate (single) partition 
(after this mode is implemented), this should slightly smooth out the index 
rebuild time.
 * A critical slowdown in the transfer of partition files on hdd drives was 
revealed, especially with minor concurrent cache updates (in some cases, the 
speed drops tenfold and long timeouts occur, which lead to an abnormal 
termination of the process).
 * Single-threaded file transfer mode can be switched to multi-threaded (which 
should lead to a multiple increase in file transfer speed), because hard disks 
on demander are loaded slightly.


was (Author: xtern):
This task has been temporarily held because our testing shows the following 
results:
 * Index rebuilding takes a very long time, in some cases, the index rebuilding 
time (24 threads) exceeds the full rebalance of the index cache (24 threads).
 * Most of the time is spent on rebuilding the index, the current solution can 
be modified to transfer the index partition (if the distribution of partitions 
on the demander matches the supplier partition distribution (affinity can be 
configured for such cases on PARTITIONED caches)).
 * Index rebuilding can be started earlier on a separate (single) partition 
(after this mode is implemented), this should slightly smooth out the index 
rebuild time.
 * A critical slowdown in the transfer of partition files on hdd drives was 
revealed, especially with minor concurrent cache updates (in some cases, the 
speed drops tenfold and long timeouts occur, which lead to an abnormal 
termination of the process).
 * Single-threaded file transfer mode can be switched to multi-threaded (which 
should lead to a multiple increase in file transfer speed), because hard disks 
on demander are loaded slightly.

> Implement file rebalancing management
> -------------------------------------
>
>                 Key: IGNITE-12069
>                 URL: https://issues.apache.org/jira/browse/IGNITE-12069
>             Project: Ignite
>          Issue Type: Sub-task
>            Reporter: Maxim Muzafarov
>            Assignee: Pavel Pereslegin
>            Priority: Major
>              Labels: iep-28
>             Fix For: 2.10
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> {{Preloader}} should be able to do the following:
>  # build the map of partitions and corresponding supplier nodes from which 
> partitions will be loaded;
>  # switch cache data storage to {{no-op}} and back to original (HWM must be 
> fixed here for the needs of historical rebalance) under the checkpoint and 
> keep the partition update counter for each partition;
>  # run async the eviction indexes for the list of collected partitions;
>  # send a request message to each node one by one with the list of partitions 
> to load;
>  # wait for files received (listening for the transmission handler);
>  # run rebuild indexes async over the receiving partitions;
>  # run historical rebalance from LWM to HWM collected above (LWM can be read 
> from the received file meta page);
> h5. Stage 1. implement "read-only" mode for cache data store. Implement data 
> store reinitialization on the updated persistence file.
> h6. Tests:
>  - Switching under load.
>  - Check re-initialization of partition on new file.
>  - Check that in read-only mode
>  ** H2 indexes are not updated
>  ** update counter is updated
>  ** cache entries eviction works fine
>  ** tx/atomic updates on this partition works fine in cluster
> h5. Stage 2. Build Map for request partitions by node, add message that will 
> be sent to the supplier. Send a demand request, handle the response, switch 
> datastore when file received.
> h6. Tests:
>  - Check partition consistency after receiving a file.
>  - File transmission under load.
>  - Failover - some of the partitions have been switched, the node has been 
> restarted, rebalancing is expected to continue only for fully loaded large 
> partitions through the historical rebalance, for the rest of partitions it 
> should restart from the beginning. 
> h5. Stage 3. Add WAL history reservation on supplier. Add historical 
> rebalance triggering (LWM (partition) - HWM (read-only)).
> h6. Tests:
>  - File rebalancing under load and without on atomic/tx caches. (check 
> existing PDS-enabled rebalancing tests).
>  - Ensure that MVCC groups use regular rebalancing.
>  - The rebalancing on the unstable topology and failures of the 
> supplier/demander nodes at different stages.
>  - (compatibility) The old nodes should use regular rebalancing.
> h5. Stage 4 Eviction and rebuild of indexes.
> h6. Tests:
>  - File rebalancing of caches with H2 indexes.
>  - Check consistency of H2 indexes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to