[
https://issues.apache.org/jira/browse/KUDU-3565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexey Serbin updated KUDU-3565:
--------------------------------
Description:
Current implementation of scheduling maintenance operations inadvertently
induces much more disk IO demand than it could. The maintenance manager thread
can schedule multiple IO-heavy operations _on the same tablet_ to run
concurrently when there is more than one maintenance worker thread.
Sometimes, it creates bottlenecks by saturating disk IO throughput since the
maintenance worker threads access data backed by the same set of drives when
working on the same tablet.
Apparently, such behavior doesn't scale well when increasing the number of
maintenance worker threads, especially when data directories are backed by
spinning drives (HDD).
In the presence of four maintenance threads I saw the logs like below. Notice
that when four operations are scheduled to run almost at the same time, it
takes more than 3 seconds to run each. Other instances of similar flush
operations at the same tablet take less than 100ms to complete when no other
flush operation on the same tablet is running. Notice that not only
fdatasync_us is higher for concurrently running operations, but some of them
now waiting on synchronization primitives, blocking the maintenance threads
from doing other useful work, e.g. compacting/flushing data of other active
tablets. In case of MRS/DMS flushes, the flusher that acquires corresponding
synchronization primitives as a writer, blocks concurrently running
insert/update/delete operations since they acquire the same synchronization
primitives as readers when checking for a primary key's presence. So, long
running flush operations induce an extra latency into concurrently running
write workloads.
It's necessary to improve the scheduling algorithm of maintenance operations,
differentiating operations not only by their score, but taking into account the
IO impact and dependencies between them. As a simplest approach, it's
necessary to add an option to enable the following behavior:
* an IO-heavy operation (such as flush, merge compaction, etc.) should not be
scheduled when there is another one scheduled to run or already running at the
same tablet
{noformat}
I0328 14:36:18.732758 1279870 maintenance_manager.cc:382] P
db78de4070874e75be1482b02b5161e6: Scheduling
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf
score=0.035118I0328
I0328 14:36:18.816906 1278984 maintenance_manager.cc:603] P
db78de4070874e75be1482b02b5161e6:
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real
0.084s user 0.017s sys 0.005s Metrics:
{"bytes_written":33228,"delete_count":0,"fdatasync":3,"fdatasync_us":56207,"lbm_write_time_us":115,",lbm_writes_lt_1ms":6,"reinsert_count":0,"update_count":7500}
...
I0328 14:36:18.848975 1279870 maintenance_manager.cc:382] P
db78de4070874e75be1482b02b5161e6: Scheduling
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035150
I0328 14:36:18.962837 1279870 maintenance_manager.cc:382] P
db78de4070874e75be1482b02b5161e6: Scheduling
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035182
I0328 14:36:19.073328 1279870 maintenance_manager.cc:382] P
db78de4070874e75be1482b02b5161e6: Scheduling
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035214
I0328 14:36:19.182384 1279870 maintenance_manager.cc:382] P
db78de4070874e75be1482b02b5161e6: Scheduling
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035244
...
I0328 14:36:22.128937 1278984 maintenance_manager.cc:603] P
db78de4070874e75be1482b02b5161e6:
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real
3.280s user 0.021s sys 0.005s Metrics:
{"bytes_written":35275,"delete_count":0,"fdatasync":3,"fdatasync_us":167953,"lbm_write_time_us":124,"lbm_writes_lt_1ms":7,"reinsert_count":0,"update_count":7896}
I0328 14:36:22.241256 1278985 maintenance_manager.cc:603] P
db78de4070874e75be1482b02b5161e6:
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real
3.278s user 0.021s sys 0.000s Metrics:
{"bytes_written":28910,"delete_count":0,"fdatasync":3,"fdatasync_us":99705,"lbm_write_time_us":123,"lbm_writes_lt_1ms":6,"mutex_wait_us":3081560,"reinsert_count":0,"update_count":6381}
...
I0328 14:36:22.292019 1278986 maintenance_manager.cc:603] P
db78de4070874e75be1482b02b5161e6:
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real
3.219s user 0.018s sys 0.003s Metrics:
{"bytes_written":30826,"delete_count":0,"fdatasync":3,"fdatasync_us":84332,"lbm_write_time_us":65,"lbm_writes_lt_1ms":6,"mutex_wait_us":3106656,"reinsert_count":0,"update_count":6915}
I0328 14:36:22.341326 1278983 maintenance_manager.cc:603] P
db78de4070874e75be1482b02b5161e6:
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real
3.159s user 0.015s sys 0.002s Metrics:
{"bytes_written":34786,"delete_count":0,"fdatasync":3,"fdatasync_us":326326,"lbm_write_time_us":109,"lbm_writes_lt_1ms":7,"mutex_wait_us":2807569,"reinsert_count":0,"update_count":7866}
{noformat}
was:
Current implementation of scheduling maintenance operations inadvertently
induces much more disk IO demand than it could. The maintenance manager thread
can schedule multiple IO-heavy operations _on the same tablet_ to run
concurrently when there is more than one maintenance worker thread.
Sometimes, it creates bottlenecks by saturating disk IO throughput since the
maintenance worker threads access data backed by the same set of drives when
working on the same tablet.
Apparently, such behavior doesn't scale well when increasing the number of
maintenance worker threads, especially when data directories are backed by
spinning drives (HDD).
In the presence of four maintenance threads I saw the logs like below. Notice
that when four operations are scheduled to run almost at the same time, it
takes more than 3 seconds to run each. Other instances of similar flush
operations at the same tablet take less than 100ms to complete when no other
flush operation on the same tablet is running. Notice that not only
fdatasync_us is higher for concurrently running operations, but some of them
now waiting on synchronization primitives, blocking the maintenance threads
from doing other useful work, e.g. compacting/flushing data of other active
tablets. In case of MRS/DMS flushes, the flusher that acquires corresponding
synchronization primitives as a writer, blocks concurrently running
insert/update/delete operations since they acquire the same synchronization
primitives as readers when checking for a primary key's presence. So, long
running flush operations induce an extra latency into concurrently running
write workloads.
It's necessary to improve the scheduling algorithm of maintenance operations,
differentiating operations not only by their score, but taking into account the
IO impact and dependencies between them. As a simplest approach, it's
necessary to add an option to enable the following behavior:
* an IO-heavy operation (such as flush, merge compaction, etc.) should not be
scheduled when there is another one scheduled to run or already running at the
same tablet
{noformat}
I0328 14:36:18.732758 1279870 maintenance_manager.cc:382] P
db78de4070874e75be1482b02b5161e6: Scheduling
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf
score=0.035118I0328 14:36:18.816906 1278984 maintenance_manager.cc:603] P
db78de4070874e75be1482b02b5161e6:
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete.
Timing: real 0.084s user 0.017s sys 0.005s Metrics:
{"bytes_written":33228,"delete_count":0,"fdatasync":3,"fdatasync_us":56207,"lbm_write_time_us":115,",lbm_writes_lt_1ms":6,"reinsert_count":0,"update_count":7500}
...
I0328 14:36:18.848975 1279870 maintenance_manager.cc:382] P
db78de4070874e75be1482b02b5161e6: Scheduling
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035150
I0328 14:36:18.962837 1279870 maintenance_manager.cc:382] P
db78de4070874e75be1482b02b5161e6: Scheduling
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035182
I0328 14:36:19.073328 1279870 maintenance_manager.cc:382] P
db78de4070874e75be1482b02b5161e6: Scheduling
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035214
I0328 14:36:19.182384 1279870 maintenance_manager.cc:382] P
db78de4070874e75be1482b02b5161e6: Scheduling
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035244
...
I0328 14:36:22.128937 1278984 maintenance_manager.cc:603] P
db78de4070874e75be1482b02b5161e6:
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real
3.280s user 0.021s sys 0.005s Metrics:
{"bytes_written":35275,"delete_count":0,"fdatasync":3,"fdatasync_us":167953,"lbm_write_time_us":124,"lbm_writes_lt_1ms":7,"reinsert_count":0,"update_count":7896}
I0328 14:36:22.241256 1278985 maintenance_manager.cc:603] P
db78de4070874e75be1482b02b5161e6:
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real
3.278s user 0.021s sys 0.000s Metrics:
{"bytes_written":28910,"delete_count":0,"fdatasync":3,"fdatasync_us":99705,"lbm_write_time_us":123,"lbm_writes_lt_1ms":6,"mutex_wait_us":3081560,"reinsert_count":0,"update_count":6381}
...
I0328 14:36:22.292019 1278986 maintenance_manager.cc:603] P
db78de4070874e75be1482b02b5161e6:
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real
3.219s user 0.018s sys 0.003s Metrics:
{"bytes_written":30826,"delete_count":0,"fdatasync":3,"fdatasync_us":84332,"lbm_write_time_us":65,"lbm_writes_lt_1ms":6,"mutex_wait_us":3106656,"reinsert_count":0,"update_count":6915}
I0328 14:36:22.341326 1278983 maintenance_manager.cc:603] P
db78de4070874e75be1482b02b5161e6:
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real
3.159s user 0.015s sys 0.002s Metrics:
{"bytes_written":34786,"delete_count":0,"fdatasync":3,"fdatasync_us":326326,"lbm_write_time_us":109,"lbm_writes_lt_1ms":7,"mutex_wait_us":2807569,"reinsert_count":0,"update_count":7866}
{noformat}
> Avoid scheduling multiple IO-heavy maintenance operations at once for the
> same tablet
> -------------------------------------------------------------------------------------
>
> Key: KUDU-3565
> URL: https://issues.apache.org/jira/browse/KUDU-3565
> Project: Kudu
> Issue Type: Improvement
> Components: master, tserver
> Affects Versions: 1.0.0, 1.0.1, 1.1.0, 1.2.0, 1.3.0, 1.3.1, 1.4.0, 1.5.0,
> 1.6.0, 1.7.0, 1.8.0, 1.7.1, 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.12.0, 1.11.1,
> 1.13.0, 1.14.0, 1.15.0, 1.16.0, 1.17.0
> Reporter: Alexey Serbin
> Priority: Major
>
> Current implementation of scheduling maintenance operations inadvertently
> induces much more disk IO demand than it could. The maintenance manager
> thread can schedule multiple IO-heavy operations _on the same tablet_ to run
> concurrently when there is more than one maintenance worker thread.
> Sometimes, it creates bottlenecks by saturating disk IO throughput since the
> maintenance worker threads access data backed by the same set of drives when
> working on the same tablet.
> Apparently, such behavior doesn't scale well when increasing the number of
> maintenance worker threads, especially when data directories are backed by
> spinning drives (HDD).
> In the presence of four maintenance threads I saw the logs like below.
> Notice that when four operations are scheduled to run almost at the same
> time, it takes more than 3 seconds to run each. Other instances of similar
> flush operations at the same tablet take less than 100ms to complete when no
> other flush operation on the same tablet is running. Notice that not only
> fdatasync_us is higher for concurrently running operations, but some of them
> now waiting on synchronization primitives, blocking the maintenance threads
> from doing other useful work, e.g. compacting/flushing data of other active
> tablets. In case of MRS/DMS flushes, the flusher that acquires corresponding
> synchronization primitives as a writer, blocks concurrently running
> insert/update/delete operations since they acquire the same synchronization
> primitives as readers when checking for a primary key's presence. So, long
> running flush operations induce an extra latency into concurrently running
> write workloads.
> It's necessary to improve the scheduling algorithm of maintenance operations,
> differentiating operations not only by their score, but taking into account
> the IO impact and dependencies between them. As a simplest approach, it's
> necessary to add an option to enable the following behavior:
> * an IO-heavy operation (such as flush, merge compaction, etc.) should not be
> scheduled when there is another one scheduled to run or already running at
> the same tablet
> {noformat}
> I0328 14:36:18.732758 1279870 maintenance_manager.cc:382] P
> db78de4070874e75be1482b02b5161e6: Scheduling
> FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf
> score=0.035118I0328
> I0328 14:36:18.816906 1278984 maintenance_manager.cc:603] P
> db78de4070874e75be1482b02b5161e6:
> FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing:
> real 0.084s user 0.017s sys 0.005s Metrics:
> {"bytes_written":33228,"delete_count":0,"fdatasync":3,"fdatasync_us":56207,"lbm_write_time_us":115,",lbm_writes_lt_1ms":6,"reinsert_count":0,"update_count":7500}
> ...
> I0328 14:36:18.848975 1279870 maintenance_manager.cc:382] P
> db78de4070874e75be1482b02b5161e6: Scheduling
> FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035150
> I0328 14:36:18.962837 1279870 maintenance_manager.cc:382] P
> db78de4070874e75be1482b02b5161e6: Scheduling
> FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035182
> I0328 14:36:19.073328 1279870 maintenance_manager.cc:382] P
> db78de4070874e75be1482b02b5161e6: Scheduling
> FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035214
> I0328 14:36:19.182384 1279870 maintenance_manager.cc:382] P
> db78de4070874e75be1482b02b5161e6: Scheduling
> FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035244
> ...
> I0328 14:36:22.128937 1278984 maintenance_manager.cc:603] P
> db78de4070874e75be1482b02b5161e6:
> FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing:
> real 3.280s user 0.021s sys 0.005s Metrics:
> {"bytes_written":35275,"delete_count":0,"fdatasync":3,"fdatasync_us":167953,"lbm_write_time_us":124,"lbm_writes_lt_1ms":7,"reinsert_count":0,"update_count":7896}
> I0328 14:36:22.241256 1278985 maintenance_manager.cc:603] P
> db78de4070874e75be1482b02b5161e6:
> FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing:
> real 3.278s user 0.021s sys 0.000s Metrics:
> {"bytes_written":28910,"delete_count":0,"fdatasync":3,"fdatasync_us":99705,"lbm_write_time_us":123,"lbm_writes_lt_1ms":6,"mutex_wait_us":3081560,"reinsert_count":0,"update_count":6381}
> ...
> I0328 14:36:22.292019 1278986 maintenance_manager.cc:603] P
> db78de4070874e75be1482b02b5161e6:
> FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing:
> real 3.219s user 0.018s sys 0.003s Metrics:
> {"bytes_written":30826,"delete_count":0,"fdatasync":3,"fdatasync_us":84332,"lbm_write_time_us":65,"lbm_writes_lt_1ms":6,"mutex_wait_us":3106656,"reinsert_count":0,"update_count":6915}
> I0328 14:36:22.341326 1278983 maintenance_manager.cc:603] P
> db78de4070874e75be1482b02b5161e6:
> FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing:
> real 3.159s user 0.015s sys 0.002s Metrics:
> {"bytes_written":34786,"delete_count":0,"fdatasync":3,"fdatasync_us":326326,"lbm_write_time_us":109,"lbm_writes_lt_1ms":7,"mutex_wait_us":2807569,"reinsert_count":0,"update_count":7866}
> {noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)