Alexey Serbin created KUDU-3028:
-----------------------------------
Summary: Prefer running concurrent flushes/compactions on
different data directories, if possible
Key: KUDU-3028
URL: https://issues.apache.org/jira/browse/KUDU-3028
Project: Kudu
Issue Type: Improvement
Components: tserver
Reporter: Alexey Serbin
In a Kudu cluster with tablet servers having 9 directories each backed by a
separate HDD (spinning disks) and 3 maintenance manager threads, I noticed long
period (2 hours or so) of 100% IO saturation of first one drive, and then a
long period of 100% IO saturation of another drive.
I noticed that all 3 maintenance threads were hammering the same data directory
for a long time (and that was the reason of 100% IO saturation on the backing
drive). Then they switched do other data directory, saturating the IO there.
That lead to extremes like tens of seconds waiting for fsync to complete. In
case if higher number of data directories and higher number of maintenance
threads that may become even more extreme.
{noformat}
W1218 12:10:04.712692 247413 env_posix.cc:889] Time spent sync call for
/data/6/kudu/tablet/data/data/4b1f42243784484b85a57255c88d8b93.metadata: real
27.245s user 0.000s sys 0.000s
W1218 12:10:04.712724 247412 env_posix.cc:889] Time spent sync call for
/data/6/kudu/tablet/data/data/128658789b56415b82becf42f34c4af1.metadata: real
27.244s user 0.000s sys 0.000s
W1218 12:11:22.690099 247411 env_posix.cc:889] Time spent sync call for
/data/6/kudu/tablet/data/data/ad4c53b4e230488899f55e6580c070af.data: real
15.357s user 0.000s sys 0.000s
{noformat}
{noformat}
W1218 14:17:30.151391 247412 env_posix.cc:889] Time spent sync call for
/data/3/kudu/tablet/data/data/165c86d614c54f9f8bfaf01361ceca16.data: real
10.674s user 0.000s sys 0.000s
W1218 14:17:30.151448 247413 env_posix.cc:889] Time spent sync call for
/data/3/kudu/tablet/data/data/820354e482be40f9858b29484c2db5c6.metadata: real
11.807s user 0.000s sys 0.000s
W1218 14:17:30.151460 247411 env_posix.cc:889] Time spent sync call for
/data/3/kudu/tablet/data/data/483a57ac212544f3b39cbe887bf16946.metadata: real
23.472s user 0.000s sys 0.000s
{noformat}
It would be nice to schedule compactions and flushes to be spread between
available directories, if possible.
Also, it would be great to establish a limit of concurrent compactions/flushes
per one data directory, so even in case of higher number of data directories it
will be possible to prevent hammering one data directory by all the
flushing/compacting threads.
Another approach might be switching from multi-directory structure to some
volume-based approach where the filesystem or a controller takes care of
fanning out the IO to multitude of drives backing the volume.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)