[ 
https://issues.apache.org/jira/browse/KUDU-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999618#comment-16999618
 ] 

Alexey Serbin commented on KUDU-3028:
-------------------------------------

It seems scaling a Kudu node 'vertically' is not possible right now: the more 
data directories/drives you add, the slower it could get if all the maintenance 
threads jump compacting/flushing data stored in the same data directory.

> Prefer running concurrent flushes/compactions on different data directories, 
> if possible
> ----------------------------------------------------------------------------------------
>
>                 Key: KUDU-3028
>                 URL: https://issues.apache.org/jira/browse/KUDU-3028
>             Project: Kudu
>          Issue Type: Improvement
>          Components: tserver
>            Reporter: Alexey Serbin
>            Priority: Major
>              Labels: compaction, dense-storage, scalability
>
> In a Kudu cluster with tablet servers having 9 directories each backed by a 
> separate HDD (spinning disks) and 3 maintenance manager threads, I noticed 
> long period (2 hours or so) of 100% IO saturation of first one drive, and 
> then a long period of 100% IO saturation of another drive.
> I noticed that all 3 maintenance threads were hammering the same data 
> directory for a long time (and that was the reason of 100% IO saturation on 
> the backing drive).  Then they switched do other data directory, saturating 
> the IO there.  That lead to extremes like tens of seconds waiting for fsync 
> to complete.  In case if higher number of data directories and higher number 
> of maintenance threads that may become even more extreme. 
> {noformat}
> W1218 12:10:04.712692 247413 env_posix.cc:889] Time spent sync call for 
> /data/6/kudu/tablet/data/data/4b1f42243784484b85a57255c88d8b93.metadata: real 
> 27.245s      user 0.000s     sys 0.000s
> W1218 12:10:04.712724 247412 env_posix.cc:889] Time spent sync call for 
> /data/6/kudu/tablet/data/data/128658789b56415b82becf42f34c4af1.metadata: real 
> 27.244s      user 0.000s     sys 0.000s
> W1218 12:11:22.690099 247411 env_posix.cc:889] Time spent sync call for 
> /data/6/kudu/tablet/data/data/ad4c53b4e230488899f55e6580c070af.data: real 
> 15.357s  user 0.000s     sys 0.000s
> {noformat}
> {noformat}
> W1218 14:17:30.151391 247412 env_posix.cc:889] Time spent sync call for 
> /data/3/kudu/tablet/data/data/165c86d614c54f9f8bfaf01361ceca16.data: real 
> 10.674s       user 0.000s     sys 0.000s
> W1218 14:17:30.151448 247413 env_posix.cc:889] Time spent sync call for 
> /data/3/kudu/tablet/data/data/820354e482be40f9858b29484c2db5c6.metadata: real 
> 11.807s   user 0.000s     sys 0.000s
> W1218 14:17:30.151460 247411 env_posix.cc:889] Time spent sync call for 
> /data/3/kudu/tablet/data/data/483a57ac212544f3b39cbe887bf16946.metadata: real 
> 23.472s   user 0.000s     sys 0.000s
> {noformat}
> It would be nice to schedule compactions and flushes to be spread between 
> available directories, if possible.
> Also, it would be great to establish a limit of concurrent 
> compactions/flushes per one data directory, so even in case of higher number 
> of data directories it will be possible to prevent hammering one data 
> directory by all the flushing/compacting threads.
> Another approach might be switching from multi-directory structure to some 
> volume-based approach where the filesystem or a controller takes care of 
> fanning out the IO to multitude of drives backing the volume.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to