[
https://issues.apache.org/jira/browse/KUDU-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999618#comment-16999618
]
Alexey Serbin commented on KUDU-3028:
-------------------------------------
It seems scaling a Kudu node 'vertically' is not possible right now: the more
data directories/drives you add, the slower it could get if all the maintenance
threads jump compacting/flushing data stored in the same data directory.
> Prefer running concurrent flushes/compactions on different data directories,
> if possible
> ----------------------------------------------------------------------------------------
>
> Key: KUDU-3028
> URL: https://issues.apache.org/jira/browse/KUDU-3028
> Project: Kudu
> Issue Type: Improvement
> Components: tserver
> Reporter: Alexey Serbin
> Priority: Major
> Labels: compaction, dense-storage, scalability
>
> In a Kudu cluster with tablet servers having 9 directories each backed by a
> separate HDD (spinning disks) and 3 maintenance manager threads, I noticed
> long period (2 hours or so) of 100% IO saturation of first one drive, and
> then a long period of 100% IO saturation of another drive.
> I noticed that all 3 maintenance threads were hammering the same data
> directory for a long time (and that was the reason of 100% IO saturation on
> the backing drive). Then they switched do other data directory, saturating
> the IO there. That lead to extremes like tens of seconds waiting for fsync
> to complete. In case if higher number of data directories and higher number
> of maintenance threads that may become even more extreme.
> {noformat}
> W1218 12:10:04.712692 247413 env_posix.cc:889] Time spent sync call for
> /data/6/kudu/tablet/data/data/4b1f42243784484b85a57255c88d8b93.metadata: real
> 27.245s user 0.000s sys 0.000s
> W1218 12:10:04.712724 247412 env_posix.cc:889] Time spent sync call for
> /data/6/kudu/tablet/data/data/128658789b56415b82becf42f34c4af1.metadata: real
> 27.244s user 0.000s sys 0.000s
> W1218 12:11:22.690099 247411 env_posix.cc:889] Time spent sync call for
> /data/6/kudu/tablet/data/data/ad4c53b4e230488899f55e6580c070af.data: real
> 15.357s user 0.000s sys 0.000s
> {noformat}
> {noformat}
> W1218 14:17:30.151391 247412 env_posix.cc:889] Time spent sync call for
> /data/3/kudu/tablet/data/data/165c86d614c54f9f8bfaf01361ceca16.data: real
> 10.674s user 0.000s sys 0.000s
> W1218 14:17:30.151448 247413 env_posix.cc:889] Time spent sync call for
> /data/3/kudu/tablet/data/data/820354e482be40f9858b29484c2db5c6.metadata: real
> 11.807s user 0.000s sys 0.000s
> W1218 14:17:30.151460 247411 env_posix.cc:889] Time spent sync call for
> /data/3/kudu/tablet/data/data/483a57ac212544f3b39cbe887bf16946.metadata: real
> 23.472s user 0.000s sys 0.000s
> {noformat}
> It would be nice to schedule compactions and flushes to be spread between
> available directories, if possible.
> Also, it would be great to establish a limit of concurrent
> compactions/flushes per one data directory, so even in case of higher number
> of data directories it will be possible to prevent hammering one data
> directory by all the flushing/compacting threads.
> Another approach might be switching from multi-directory structure to some
> volume-based approach where the filesystem or a controller takes care of
> fanning out the IO to multitude of drives backing the volume.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)