[
https://issues.apache.org/jira/browse/KUDU-3447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781579#comment-17781579
]
ASF subversion and git services commented on KUDU-3447:
-------------------------------------------------------
Commit f6f6c243971955d57493ae07d6b89e87f6400a82 in kudu's branch
refs/heads/master from xinghuayu007
[ https://gitbox.apache.org/repos/asf?p=kudu.git;h=f6f6c2439 ]
[KUDU-3447] Limit tablets copying speed
Copying tablets from one cluster to another is a high resource
consumed operation using the command : kudu
local_replica copy_from_remote. If the data size is very large,
the copying process will last for a long time. Other service
maybe get impacted and become unavailable because of the tablets
copying process cost too much disk and/or network bandwith.
Therefore it is better to limit the tablets copying speed and make
the system more stable. The goal is a trade-off the tablets copying
speed and the resource consumption.
As copy_from_remote is mainly downloading data from the remote
cluster and writing the data to local file system, it is better to
control the downloading speed to control the resource consumption.
This patch use a throttler to limit tablet copying speed.
Two paramters are added:
--tablet_copy_throttler_bytes_per_sec limits the copying speed,
and --tablet_copy_throttler_burst_factor limits the maximum
copying speed at a single time.
Change-Id: I1f4834bfb0718a2b6b1d946975287a11f6be1fe3
Reviewed-on: http://gerrit.cloudera.org:8080/19479
Reviewed-by: Yingchun Lai <[email protected]>
Tested-by: Yingchun Lai <[email protected]>
> Limit the usage of network bandwidth of tablet copying
> -------------------------------------------------------
>
> Key: KUDU-3447
> URL: https://issues.apache.org/jira/browse/KUDU-3447
> Project: Kudu
> Issue Type: Improvement
> Reporter: Xixu Wang
> Priority: Minor
> Attachments: image-2023-02-09-10-38-50-512.png,
> image-2023-02-09-10-47-58-370.png, image-2023-02-13-17-08-37-256.png,
> image-2023-02-13-17-16-50-491.png, image-2023-02-13-17-22-25-368.png,
> image-2023-02-13-17-25-15-997.png, image-2023-02-13-17-32-11-650.png
>
>
> Copying tablets from an old cluster to another new cluster is a high resource
> consumed operation using the command : kudu local_replica copy_from_remote.
> As the follow picture shows: the usage of memory is as high as 75%. And the
> network is almost occupied fully (the overall network bandwidth is 2Gb/s).
> Disk reading is every high (the overall disk bandwidth is 200MB/s).
> !image-2023-02-09-10-47-58-370.png|width=996,height=369!
> If the data size is very large, the copying process will last for a long
> time. Other service maybe get impacted and become unavailable. Therefore it
> is better to limit the tablets copying speed and make the system more stable.
> The goal is to balance the tablets copying speed and the impact to other
> services.
> As copy_from_remote is mainly downloading data from the remote cluster and
> write the data to local file system, it is better to control the downloading
> speed to control the resource consumption. There are some algorithms to
> implement a rate limiter. This patch will use the token bucket algorithm
> implemented by Facebook Folly library:
> [https://github.com/facebook/folly/blob/main/folly/TokenBucket.h]
>
> *Performance Tests*
> 1. Data size:
> TABLE test_1
> on disk size: 13263880213
> live row count: 66433035
> 2. Test Case:
> case 1:
> kudu local_replica copy_from_remote xxx_tablet_ids src_tserver_adddr:7050
> -fs_data_dirs=/test/data_dir -fs_wal_dir=/test/wal_dir
> -tablet_copy_download_threads_nums_per_session=4 -num_threads=4
> case 2:
> kudu local_replica copy_from_remote xxx_tablet_ids src_tserver_adddr:7050
> -fs_data_dirs=/test/data_dir -fs_wal_dir=/test/wal_dir
> -tablet_copy_download_threads_nums_per_session=4 -num_threads=4
> -enable_network_speed_limit=true -limit_network_speed=25
> 3. Results:
> 3.1 The usage of CPU
> Left is test case 1, right is 2. As we can seek, using speed limit feature
> can reduce CPU comsumption.
> !image-2023-02-13-17-08-37-256.png|width=418,height=559!!image-2023-02-13-17-16-50-491.png|width=794,height=369!
> 3.2 Load of CPU
> Left is case 1, right is case 2. As we can see, using speed limit feature can
> reduce CPU Load.
> !image-2023-02-13-17-22-25-368.png|width=536,height=408!!image-2023-02-13-17-25-15-997.png|width=851,height=402!
> 3.3 Network brandwidth
> Left is case 1, right is case 2. As we can see, using speed limit feature can
> limit the network to 25MB/s nearly.
> !image-2023-02-13-17-32-11-650.png|width=1393,height=652!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)