[Devel] Performance numbers with IO throttling patches (Was: Re: IO scheduler based IO controller V10)

2009-10-10 Thread Vivek Goyal
On Thu, Sep 24, 2009 at 02:33:15PM -0700, Andrew Morton wrote:

[..]
  Environment
  ==
  A 7200 RPM SATA drive with queue depth of 31. Ext3 filesystem.
 
 That's a bit of a toy.
 
 Do we have testing results for more enterprisey hardware?  Big storage
 arrays?  SSD?  Infiniband?  iscsi?  nfs? (lol, gotcha)
 
 

Hi All,

Couple of days back I posted some performance number of IO scheduler
controller and dm-ioband here.

http://lkml.org/lkml/2009/10/8/9

Now I have run similar tests with Andrea Righi's IO throttling approach
of max bandwidth control. This is the exercise to understand pros/cons
of each approach and see how can we take things forward.

Environment
===
Software

- 2.6.31 kenrel
- IO scheduler controller V10 on top of 2.6.31
- IO throttling patch on top of 2.6.31. Patch is available here.

http://www.develer.com/~arighi/linux/patches/io-throttle/old/cgroup-io-throttle-2.6.31.patch

Hardware

A storage array of 5 striped disks of 500GB each.

Used fio jobs for 30 seconds in various configurations. Most of the IO is
direct IO to eliminate the effects of caches.

I have run three sets for each test. Blindly reporting results of set2
from each test, otherwise it is too much of data to report.

Had lun of 2500GB capacity. Used 200G partition with ext3 file system for
my testing. For IO scheduler controller testing, created two cgroups of 
weight 100 each so that effectively disk can be divided half/half between
two groups.

For IO throttling patches also created two cgroups. Now tricky part is
that it is a max bw controller and not a proportional weight controller.
So dividing the disk capacity half/half between two cgroups is tricky. The
reason being I just don't know what's the BW capacity of underlying
storage. Throughput varies so much with type of workload. For example, on
my arrary, this is how throughput looks like with different workloads.

8 sequential buffered readers   115 MB/s
8 direct sequential readers bs=64K  64 MB/s
8 direct sequential readers bs=4K   14 MB/s

8 buffered random readers bs=64K3 MB/s
8 direct random readers bs=64K  15 MB/s
8 direct random readers bs=4K   1.5 MB/s

So throughput seems to be varying from 1.5 MB/s to 115 MB/s depending
on workload. What should be the BW limits per cgroup to divide disk BW
in half/half between two groups?

So I took a conservative estimate and divide max bandwidth divide by 2,
and thought of array capacity as 60MB/s and assign each cgroup 30MB/s. In
some cases I have assigened even 10MB/s or 5MB/s to each cgropu to see the
effects of throttling. I am using Leaky bucket policy for all the tests.

As theme of two controllers is different, at some places it might sound
like apples vs oranges comparison. But still it does help...

Multiple Random Reader vs Sequential Reader
===
Generally random readers bring the throughput down of others in the
system. Ran a test to see the impact of increasing number of random readers on
single sequential reader in different groups.

Vanilla CFQ
---
[Multiple Random Reader]  [Sequential Reader]   
nr  Max-bandw Min-bandw Agg-bandw Max-latency nr  Agg-bandw Max-latency 
1   23KB/s23KB/s22KB/s691 msec1   13519KB/s 468K usec   
2   152KB/s   152KB/s   297KB/s   244K usec   1   12380KB/s 31675 usec  
4   174KB/s   156KB/s   638KB/s   249K usec   1   10860KB/s 36715 usec  
8   49KB/s11KB/s310KB/s   1856 msec   1   1292KB/s  990K usec   
16  63KB/s48KB/s877KB/s   762K usec   1   3905KB/s  506K usec   
32  35KB/s27KB/s951KB/s   2655 msec   1   1109KB/s  1910K usec  

IO scheduler controller + CFQ
---
[Multiple Random Reader]  [Sequential Reader]   
nr  Max-bandw Min-bandw Agg-bandw Max-latency nr  Agg-bandw Max-latency 
1   228KB/s   228KB/s   223KB/s   132K usec   1   5551KB/s  129K usec   
2   97KB/s97KB/s190KB/s   154K usec   1   5718KB/s  122K usec   
4   115KB/s   110KB/s   445KB/s   208K usec   1   5909KB/s  116K usec   
8   23KB/s12KB/s158KB/s   2820 msec   1   5445KB/s  168K usec   
16  11KB/s3KB/s 145KB/s   5963 msec   1   5418KB/s  164K usec   
32  6KB/s 2KB/s 139KB/s   12762 msec  1   5398KB/s  175K usec   

Notes:
- Sequential reader in group2 seems to be well isolated from random readers
  in group1. Throughput and latency of sequential reader are stable and
  don't drop as number of random readers inrease in system.

io-throttle + CFQ
--
BW limit group1=10 MB/s   BW limit group2=10 MB/s   
[Multiple Random Reader]  [Sequential Reader]   
nr  Max-bandw Min-bandw Agg-bandw Max-latency nr  Agg-bandw Max-latency 
1   37KB/s37KB/s36KB/s218K usec   1   8006KB/s  20529 usec  
2   185KB/s   

[Devel] Re: Performance numbers with IO throttling patches (Was: Re: IO scheduler based IO controller V10)

2009-10-10 Thread Andrea Righi
On Sat, Oct 10, 2009 at 03:53:16PM -0400, Vivek Goyal wrote:
 On Thu, Sep 24, 2009 at 02:33:15PM -0700, Andrew Morton wrote:
 
 [..]
   Environment
   ==
   A 7200 RPM SATA drive with queue depth of 31. Ext3 filesystem.
  
  That's a bit of a toy.
  
  Do we have testing results for more enterprisey hardware?  Big storage
  arrays?  SSD?  Infiniband?  iscsi?  nfs? (lol, gotcha)
  
  
 
 Hi All,

Hi Vivek,

thanks for posting this detailed report first of all. A few comments
below.

 
 Couple of days back I posted some performance number of IO scheduler
 controller and dm-ioband here.
 
 http://lkml.org/lkml/2009/10/8/9
 
 Now I have run similar tests with Andrea Righi's IO throttling approach
 of max bandwidth control. This is the exercise to understand pros/cons
 of each approach and see how can we take things forward.
 
 Environment
 ===
 Software
 
 - 2.6.31 kenrel
 - IO scheduler controller V10 on top of 2.6.31
 - IO throttling patch on top of 2.6.31. Patch is available here.
 
 http://www.develer.com/~arighi/linux/patches/io-throttle/old/cgroup-io-throttle-2.6.31.patch
 
 Hardware
 
 A storage array of 5 striped disks of 500GB each.
 
 Used fio jobs for 30 seconds in various configurations. Most of the IO is
 direct IO to eliminate the effects of caches.
 
 I have run three sets for each test. Blindly reporting results of set2
 from each test, otherwise it is too much of data to report.
 
 Had lun of 2500GB capacity. Used 200G partition with ext3 file system for
 my testing. For IO scheduler controller testing, created two cgroups of 
 weight 100 each so that effectively disk can be divided half/half between
 two groups.
 
 For IO throttling patches also created two cgroups. Now tricky part is
 that it is a max bw controller and not a proportional weight controller.
 So dividing the disk capacity half/half between two cgroups is tricky. The
 reason being I just don't know what's the BW capacity of underlying
 storage. Throughput varies so much with type of workload. For example, on
 my arrary, this is how throughput looks like with different workloads.
 
 8 sequential buffered readers 115 MB/s
 8 direct sequential readers bs=64K64 MB/s
 8 direct sequential readers bs=4K 14 MB/s
 
 8 buffered random readers bs=64K  3 MB/s
 8 direct random readers bs=64K15 MB/s
 8 direct random readers bs=4K 1.5 MB/s
 
 So throughput seems to be varying from 1.5 MB/s to 115 MB/s depending
 on workload. What should be the BW limits per cgroup to divide disk BW
 in half/half between two groups?
 
 So I took a conservative estimate and divide max bandwidth divide by 2,
 and thought of array capacity as 60MB/s and assign each cgroup 30MB/s. In
 some cases I have assigened even 10MB/s or 5MB/s to each cgropu to see the
 effects of throttling. I am using Leaky bucket policy for all the tests.
 
 As theme of two controllers is different, at some places it might sound
 like apples vs oranges comparison. But still it does help...
 
 Multiple Random Reader vs Sequential Reader
 ===
 Generally random readers bring the throughput down of others in the
 system. Ran a test to see the impact of increasing number of random readers on
 single sequential reader in different groups.
 
 Vanilla CFQ
 ---
 [Multiple Random Reader]  [Sequential Reader]   
 nr  Max-bandw Min-bandw Agg-bandw Max-latency nr  Agg-bandw Max-latency 
 1   23KB/s23KB/s22KB/s691 msec1   13519KB/s 468K usec   
 2   152KB/s   152KB/s   297KB/s   244K usec   1   12380KB/s 31675 usec  
 4   174KB/s   156KB/s   638KB/s   249K usec   1   10860KB/s 36715 usec  
 8   49KB/s11KB/s310KB/s   1856 msec   1   1292KB/s  990K usec   
 16  63KB/s48KB/s877KB/s   762K usec   1   3905KB/s  506K usec   
 32  35KB/s27KB/s951KB/s   2655 msec   1   1109KB/s  1910K usec  
 
 IO scheduler controller + CFQ
 ---
 [Multiple Random Reader]  [Sequential Reader]   
 nr  Max-bandw Min-bandw Agg-bandw Max-latency nr  Agg-bandw Max-latency 
 1   228KB/s   228KB/s   223KB/s   132K usec   1   5551KB/s  129K usec   
 2   97KB/s97KB/s190KB/s   154K usec   1   5718KB/s  122K usec   
 4   115KB/s   110KB/s   445KB/s   208K usec   1   5909KB/s  116K usec   
 8   23KB/s12KB/s158KB/s   2820 msec   1   5445KB/s  168K usec   
 16  11KB/s3KB/s 145KB/s   5963 msec   1   5418KB/s  164K usec   
 32  6KB/s 2KB/s 139KB/s   12762 msec  1   5398KB/s  175K usec   
 
 Notes:
 - Sequential reader in group2 seems to be well isolated from random readers
   in group1. Throughput and latency of sequential reader are stable and
   don't drop as number of random readers inrease in system.
 
 io-throttle + CFQ
 --
 BW limit group1=10 MB/s