Re: [ceph-users] krbd splitting large IO's into smaller IO's

2015-06-30 Thread Ilya Dryomov
On Tue, Jun 30, 2015 at 8:30 AM, Z Zhang zhangz.da...@outlook.com wrote:
 Hi Ilya,

 Thanks for your explanation. This makes sense. Will you make max_segments to
 be configurable? Could you pls point me the fix you have made? We might help
 to test it.

[PATCH] rbd: bump queue_max_segments on ceph-devel.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] krbd splitting large IO's into smaller IO's

2015-06-29 Thread Z Zhang
Hi Ilya,
Thanks for your explanation. This makes sense. Will you make max_segments to be 
configurable? Could you pls point me the fix you have made? We might help to 
test it.
Thanks.
David Zhang 

 Date: Fri, 26 Jun 2015 18:21:55 +0300
 Subject: Re: [ceph-users] krbd splitting large IO's into smaller IO's
 From: idryo...@gmail.com
 To: zhangz.da...@outlook.com
 CC: ceph-users@lists.ceph.com
 
 On Fri, Jun 26, 2015 at 3:17 PM, Z Zhang zhangz.da...@outlook.com wrote:
  Hi Ilya,
 
  I am seeing your recent email talking about krbd splitting large IO's into
  smaller IO's, see below link.
 
  https://www.mail-archive.com/ceph-users@lists.ceph.com/msg20587.html
 
  I just tried it on my ceph cluster using kernel 3.10.0-1. I adjust both
  max_sectors_kb and max_hw_sectors_kb of rbd device to 4096.
 
  Use fio with 4M block size for read:
 
  Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz
  avgqu-sz   await r_await w_await  svctm  %util
  rbd3 81.00 0.00  135.000.00   108.00 0.00  1638.40
  2.72   20.15   20.150.00   7.41 100.00
 
 
  Use fio with 1M or 2M block size for read:
 
  Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz
  avgqu-sz   await r_await w_await  svctm  %util
  rbd3  0.00 0.00  213.000.00   106.50 0.00  1024.00
  2.56   12.02   12.020.00   4.69 100.00
 
 
  Use fio with 4M block size for write:
 
  Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz
  avgqu-sz   await r_await w_await  svctm  %util
  rbd3  0.0040.000.00   40.00 0.0040.00  2048.00
  2.87   70.900.00   70.90  24.90  99.60
 
 
  Use fio with 1M or 2M block size for write:
 
  Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz
  avgqu-sz   await r_await w_await  svctm  %util
  rbd3  0.00 0.000.00   80.00 0.0040.00  1024.00
  3.55   48.200.00   48.20  12.50 100.00
 
 
  So why the IO size here is far less than 4096 (If using default value 512,
  all the IO size is 1024)? Is there some other parameters need to adjust, or
  is it about this kernel version?
 
 It's about this kernel version.  Assuming you are doing direct I/Os
 with fio, setting max_sectors_kb to 4096 is really the only thing you
 can do, and that's enough to *sometimes* see 8192 sector (i.e. 4M) I/Os.
 The problem is the max_segments value, which in 3.10 is 128 and which
 you cannot adjust via sysfs.
 
 It all comes down to a memory allocator.  To get a 4M I/O, the total
 number of segments (physically contiguous chunks of memory) in the
 8 bios (8*512k = 4M) that need to be merged has to be = 128.  When you
 are allocated such nice and contiguous bios, you get 4M I/Os.  In other
 cases you don't.
 
 This will be fixed in 4.2, along with a bunch of other things.  This
 particular max_segment fix is a one liner, so we will probably backport
 it to older kernels, including 3.10.
 
 Thanks,
 
 Ilya
  ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] krbd splitting large IO's into smaller IO's

2015-06-26 Thread Ilya Dryomov
On Fri, Jun 26, 2015 at 3:17 PM, Z Zhang zhangz.da...@outlook.com wrote:
 Hi Ilya,

 I am seeing your recent email talking about krbd splitting large IO's into
 smaller IO's, see below link.

 https://www.mail-archive.com/ceph-users@lists.ceph.com/msg20587.html

 I just tried it on my ceph cluster using kernel 3.10.0-1. I adjust both
 max_sectors_kb and max_hw_sectors_kb of rbd device to 4096.

 Use fio with 4M block size for read:

 Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz
 avgqu-sz   await r_await w_await  svctm  %util
 rbd3 81.00 0.00  135.000.00   108.00 0.00  1638.40
 2.72   20.15   20.150.00   7.41 100.00


 Use fio with 1M or 2M block size for read:

 Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz
 avgqu-sz   await r_await w_await  svctm  %util
 rbd3  0.00 0.00  213.000.00   106.50 0.00  1024.00
 2.56   12.02   12.020.00   4.69 100.00


 Use fio with 4M block size for write:

 Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz
 avgqu-sz   await r_await w_await  svctm  %util
 rbd3  0.0040.000.00   40.00 0.0040.00  2048.00
 2.87   70.900.00   70.90  24.90  99.60


 Use fio with 1M or 2M block size for write:

 Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz
 avgqu-sz   await r_await w_await  svctm  %util
 rbd3  0.00 0.000.00   80.00 0.0040.00  1024.00
 3.55   48.200.00   48.20  12.50 100.00


 So why the IO size here is far less than 4096 (If using default value 512,
 all the IO size is 1024)? Is there some other parameters need to adjust, or
 is it about this kernel version?

It's about this kernel version.  Assuming you are doing direct I/Os
with fio, setting max_sectors_kb to 4096 is really the only thing you
can do, and that's enough to *sometimes* see 8192 sector (i.e. 4M) I/Os.
The problem is the max_segments value, which in 3.10 is 128 and which
you cannot adjust via sysfs.

It all comes down to a memory allocator.  To get a 4M I/O, the total
number of segments (physically contiguous chunks of memory) in the
8 bios (8*512k = 4M) that need to be merged has to be = 128.  When you
are allocated such nice and contiguous bios, you get 4M I/Os.  In other
cases you don't.

This will be fixed in 4.2, along with a bunch of other things.  This
particular max_segment fix is a one liner, so we will probably backport
it to older kernels, including 3.10.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] krbd splitting large IO's into smaller IO's

2015-06-26 Thread Z Zhang
Hi Ilya,
I am seeing your recent email talking about krbd splitting large IO's into 
smaller IO's, see below link.
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg20587.html
I just tried it on my ceph cluster using kernel 3.10.0-1. I adjust both 
max_sectors_kb and max_hw_sectors_kb of rbd device to 4096.   
Use fio with 4M block size for read:
Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %utilrbd3 81.00 0.00  
135.000.00   108.00 0.00  1638.40 2.72   20.15   20.150.00   
7.41 100.00

Use fio with 1M or 2M block size for read:
Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %utilrbd3  0.00 0.00  
213.000.00   106.50 0.00  1024.00 2.56   12.02   12.020.00   
4.69 100.00

Use fio with 4M block size for write:
Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %utilrbd3  0.0040.00   
 0.00   40.00 0.0040.00  2048.00 2.87   70.900.00   70.90  
24.90  99.60

Use fio with 1M or 2M block size for write:
Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %utilrbd3  0.00 0.00   
 0.00   80.00 0.0040.00  1024.00 3.55   48.200.00   48.20  
12.50 100.00

So why the IO size here is far less than 4096 (If using default value 512, all 
the IO size is 1024)? Is there some other parameters need to adjust, or is it 
about this kernel version?
Thanks!
Regards,David Zhang 
  ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] krbd splitting large IO's into smaller IO's

2015-06-11 Thread Ilya Dryomov
On Thu, Jun 11, 2015 at 2:23 PM, Ilya Dryomov idryo...@gmail.com wrote:
 On Wed, Jun 10, 2015 at 7:07 PM, Ilya Dryomov idryo...@gmail.com wrote:
 On Wed, Jun 10, 2015 at 7:04 PM, Nick Fisk n...@fisk.me.uk wrote:
   -Original Message-
   From: Ilya Dryomov [mailto:idryo...@gmail.com]
   Sent: 10 June 2015 14:06
   To: Nick Fisk
   Cc: ceph-users
   Subject: Re: [ceph-users] krbd splitting large IO's into smaller
   IO's
  
   On Wed, Jun 10, 2015 at 2:47 PM, Nick Fisk n...@fisk.me.uk wrote:
Hi,
   
Using Kernel RBD client with Kernel 4.03 (I have also tried some
older kernels with the same effect) and IO is being split into
smaller IO's which is having a negative impact on performance.
   
cat /sys/block/sdc/queue/max_hw_sectors_kb
4096
   
cat /sys/block/rbd0/queue/max_sectors_kb
4096
   
Using DD
dd if=/dev/rbd0 of=/dev/null bs=4M
   
Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
 avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
rbd0  0.00 0.00  201.500.00 25792.00 0.00
 256.00
1.99   10.15   10.150.00   4.96 100.00
   
   
Using FIO with 4M blocks
Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
 avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
rbd0  0.00 0.00  232.000.00 118784.00 0.00
 1024.00
11.29   48.58   48.580.00   4.31 100.00
   
Any ideas why IO sizes are limited to 128k (256 blocks) in DD's
case and 512k in Fio's case?
  
   128k vs 512k is probably buffered vs direct IO - add iflag=direct
   to your dd invocation.
  
   Yes, thanks for this, that was the case
  
  
   As for the 512k - I'm pretty sure it's a regression in our switch
   to blk-mq.  I tested it around 3.18-3.19 and saw steady 4M IOs.  I
   hope we are just missing a knob - I'll take a look.
  
   I've tested both 4.03 and 3.16 and both seem to be split into 512k.
   Let
 me
  know if you need me to test any other particular version.
 
  With 3.16 you are going to need to adjust max_hw_sectors_kb /
  max_sectors_kb as discussed in Dan's thread.  The patch that fixed
  that in the block layer went into 3.19, blk-mq into 4.0 - try 3.19.

 Sorry should have mentioned, I had adjusted both of them on the 3.16
 kernel to 4096.
 I will try 3.19 and let you know.

 Better with 3.19, but should I not be seeing around 8192, or am I getting my
 blocks and bytes mixed up?

 Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
 avgqu-sz   await r_await w_await  svctm  %util
 rbd0 72.00 0.00   24.000.00 49152.00 0.00  4096.00
 1.96   82.67   82.670.00  41.58  99.80

 I'd expect 8192.  I'm getting a box for investigation.

 OK, so this is bug in the blk-mq part of block layer.  There is no
 plugging going on in the single hardware queue (i.e. krbd) case - it
 never once plugs the queue, and that means no request merging is done
 for your direct sequential read test.  It gets 512k bios and those same
 512k requests are issued to krbd.

 While queue plugging may not make sense in the multi queue case, I'm
 pretty sure it's supposed to plug in the single queue case.  Looks like
 use_plug logic in blk_sq_make_request() is busted.

It turns out to be a year old regression.  Before commit 07068d5b8ed8
(blk-mq: split make request handler for multi and single queue) it
used to be (reads are considered sync)

use_plug = !is_flush_fua  ((q-nr_hw_queues == 1) || !is_sync);

and now it is

use_plug = !is_flush_fua  !is_sync;

in a function that is only called if q-nr_hw_queues == 1.

This is getting fixed by blk-mq: fix plugging in blk_sq_make_request
from Jeff Moyer - http://article.gmane.org/gmane.linux.kernel/1941750.
Looks like it's on its way to mainline along with some other blk-mq
plugging fixes.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] krbd splitting large IO's into smaller IO's

2015-06-11 Thread Ilya Dryomov
On Wed, Jun 10, 2015 at 7:07 PM, Ilya Dryomov idryo...@gmail.com wrote:
 On Wed, Jun 10, 2015 at 7:04 PM, Nick Fisk n...@fisk.me.uk wrote:
   -Original Message-
   From: Ilya Dryomov [mailto:idryo...@gmail.com]
   Sent: 10 June 2015 14:06
   To: Nick Fisk
   Cc: ceph-users
   Subject: Re: [ceph-users] krbd splitting large IO's into smaller
   IO's
  
   On Wed, Jun 10, 2015 at 2:47 PM, Nick Fisk n...@fisk.me.uk wrote:
Hi,
   
Using Kernel RBD client with Kernel 4.03 (I have also tried some
older kernels with the same effect) and IO is being split into
smaller IO's which is having a negative impact on performance.
   
cat /sys/block/sdc/queue/max_hw_sectors_kb
4096
   
cat /sys/block/rbd0/queue/max_sectors_kb
4096
   
Using DD
dd if=/dev/rbd0 of=/dev/null bs=4M
   
Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
 avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
rbd0  0.00 0.00  201.500.00 25792.00 0.00
 256.00
1.99   10.15   10.150.00   4.96 100.00
   
   
Using FIO with 4M blocks
Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
 avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
rbd0  0.00 0.00  232.000.00 118784.00 0.00
 1024.00
11.29   48.58   48.580.00   4.31 100.00
   
Any ideas why IO sizes are limited to 128k (256 blocks) in DD's
case and 512k in Fio's case?
  
   128k vs 512k is probably buffered vs direct IO - add iflag=direct
   to your dd invocation.
  
   Yes, thanks for this, that was the case
  
  
   As for the 512k - I'm pretty sure it's a regression in our switch
   to blk-mq.  I tested it around 3.18-3.19 and saw steady 4M IOs.  I
   hope we are just missing a knob - I'll take a look.
  
   I've tested both 4.03 and 3.16 and both seem to be split into 512k.
   Let
 me
  know if you need me to test any other particular version.
 
  With 3.16 you are going to need to adjust max_hw_sectors_kb /
  max_sectors_kb as discussed in Dan's thread.  The patch that fixed
  that in the block layer went into 3.19, blk-mq into 4.0 - try 3.19.

 Sorry should have mentioned, I had adjusted both of them on the 3.16
 kernel to 4096.
 I will try 3.19 and let you know.

 Better with 3.19, but should I not be seeing around 8192, or am I getting my
 blocks and bytes mixed up?

 Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
 avgqu-sz   await r_await w_await  svctm  %util
 rbd0 72.00 0.00   24.000.00 49152.00 0.00  4096.00
 1.96   82.67   82.670.00  41.58  99.80

 I'd expect 8192.  I'm getting a box for investigation.

OK, so this is bug in the blk-mq part of block layer.  There is no
plugging going on in the single hardware queue (i.e. krbd) case - it
never once plugs the queue, and that means no request merging is done
for your direct sequential read test.  It gets 512k bios and those same
512k requests are issued to krbd.

While queue plugging may not make sense in the multi queue case, I'm
pretty sure it's supposed to plug in the single queue case.  Looks like
use_plug logic in blk_sq_make_request() is busted.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] krbd splitting large IO's into smaller IO's

2015-06-11 Thread Nick Fisk
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Ilya Dryomov
 Sent: 11 June 2015 12:33
 To: Nick Fisk
 Cc: ceph-users
 Subject: Re: [ceph-users] krbd splitting large IO's into smaller IO's
 
 On Thu, Jun 11, 2015 at 2:23 PM, Ilya Dryomov idryo...@gmail.com
 wrote:
  On Wed, Jun 10, 2015 at 7:07 PM, Ilya Dryomov idryo...@gmail.com
 wrote:
  On Wed, Jun 10, 2015 at 7:04 PM, Nick Fisk n...@fisk.me.uk wrote:
-Original Message-
From: Ilya Dryomov [mailto:idryo...@gmail.com]
Sent: 10 June 2015 14:06
To: Nick Fisk
Cc: ceph-users
Subject: Re: [ceph-users] krbd splitting large IO's into
smaller IO's
   
On Wed, Jun 10, 2015 at 2:47 PM, Nick Fisk n...@fisk.me.uk
 wrote:
 Hi,

 Using Kernel RBD client with Kernel 4.03 (I have also tried
 some older kernels with the same effect) and IO is being
 split into smaller IO's which is having a negative impact on
 performance.

 cat /sys/block/sdc/queue/max_hw_sectors_kb
 4096

 cat /sys/block/rbd0/queue/max_sectors_kb
 4096

 Using DD
 dd if=/dev/rbd0 of=/dev/null bs=4M

 Device: rrqm/s   wrqm/s r/s w/srkB/s
wkB/s
  avgrq-sz
 avgqu-sz   await r_await w_await  svctm  %util
 rbd0  0.00 0.00  201.500.00 25792.00
0.00
  256.00
 1.99   10.15   10.150.00   4.96 100.00


 Using FIO with 4M blocks
 Device: rrqm/s   wrqm/s r/s w/srkB/s
wkB/s
  avgrq-sz
 avgqu-sz   await r_await w_await  svctm  %util
 rbd0  0.00 0.00  232.000.00 118784.00
0.00
  1024.00
 11.29   48.58   48.580.00   4.31 100.00

 Any ideas why IO sizes are limited to 128k (256 blocks) in
 DD's case and 512k in Fio's case?
   
128k vs 512k is probably buffered vs direct IO - add
iflag=direct to your dd invocation.
   
Yes, thanks for this, that was the case
   
   
As for the 512k - I'm pretty sure it's a regression in our
switch to blk-mq.  I tested it around 3.18-3.19 and saw steady
4M IOs.  I hope we are just missing a knob - I'll take a look.
   
I've tested both 4.03 and 3.16 and both seem to be split into
512k.
Let
  me
   know if you need me to test any other particular version.
  
   With 3.16 you are going to need to adjust max_hw_sectors_kb /
   max_sectors_kb as discussed in Dan's thread.  The patch that
   fixed that in the block layer went into 3.19, blk-mq into 4.0 - try
3.19.
 
  Sorry should have mentioned, I had adjusted both of them on the
  3.16 kernel to 4096.
  I will try 3.19 and let you know.
 
  Better with 3.19, but should I not be seeing around 8192, or am I
  getting my blocks and bytes mixed up?
 
  Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
avgrq-sz
  avgqu-sz   await r_await w_await  svctm  %util
  rbd0 72.00 0.00   24.000.00 49152.00 0.00
4096.00
  1.96   82.67   82.670.00  41.58  99.80
 
  I'd expect 8192.  I'm getting a box for investigation.
 
  OK, so this is bug in the blk-mq part of block layer.  There is no
  plugging going on in the single hardware queue (i.e. krbd) case - it
  never once plugs the queue, and that means no request merging is done
  for your direct sequential read test.  It gets 512k bios and those
  same 512k requests are issued to krbd.
 
  While queue plugging may not make sense in the multi queue case, I'm
  pretty sure it's supposed to plug in the single queue case.  Looks
  like use_plug logic in blk_sq_make_request() is busted.
 
 It turns out to be a year old regression.  Before commit 07068d5b8ed8
 (blk-mq: split make request handler for multi and single queue) it used
to
 be (reads are considered sync)
 
 use_plug = !is_flush_fua  ((q-nr_hw_queues == 1) || !is_sync);
 
 and now it is
 
 use_plug = !is_flush_fua  !is_sync;
 
 in a function that is only called if q-nr_hw_queues == 1.
 
 This is getting fixed by blk-mq: fix plugging in blk_sq_make_request
 from Jeff Moyer - http://article.gmane.org/gmane.linux.kernel/1941750.
 Looks like it's on its way to mainline along with some other blk-mq
plugging
 fixes.

That's great, do you think it will make 4.2?


 
 Thanks,
 
 Ilya
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] krbd splitting large IO's into smaller IO's

2015-06-11 Thread Ilya Dryomov
On Thu, Jun 11, 2015 at 5:30 PM, Nick Fisk n...@fisk.me.uk wrote:
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Ilya Dryomov
 Sent: 11 June 2015 12:33
 To: Nick Fisk
 Cc: ceph-users
 Subject: Re: [ceph-users] krbd splitting large IO's into smaller IO's

 On Thu, Jun 11, 2015 at 2:23 PM, Ilya Dryomov idryo...@gmail.com
 wrote:
  On Wed, Jun 10, 2015 at 7:07 PM, Ilya Dryomov idryo...@gmail.com
 wrote:
  On Wed, Jun 10, 2015 at 7:04 PM, Nick Fisk n...@fisk.me.uk wrote:
-Original Message-
From: Ilya Dryomov [mailto:idryo...@gmail.com]
Sent: 10 June 2015 14:06
To: Nick Fisk
Cc: ceph-users
Subject: Re: [ceph-users] krbd splitting large IO's into
smaller IO's
   
On Wed, Jun 10, 2015 at 2:47 PM, Nick Fisk n...@fisk.me.uk
 wrote:
 Hi,

 Using Kernel RBD client with Kernel 4.03 (I have also tried
 some older kernels with the same effect) and IO is being
 split into smaller IO's which is having a negative impact on
 performance.

 cat /sys/block/sdc/queue/max_hw_sectors_kb
 4096

 cat /sys/block/rbd0/queue/max_sectors_kb
 4096

 Using DD
 dd if=/dev/rbd0 of=/dev/null bs=4M

 Device: rrqm/s   wrqm/s r/s w/srkB/s
 wkB/s
  avgrq-sz
 avgqu-sz   await r_await w_await  svctm  %util
 rbd0  0.00 0.00  201.500.00 25792.00
 0.00
  256.00
 1.99   10.15   10.150.00   4.96 100.00


 Using FIO with 4M blocks
 Device: rrqm/s   wrqm/s r/s w/srkB/s
 wkB/s
  avgrq-sz
 avgqu-sz   await r_await w_await  svctm  %util
 rbd0  0.00 0.00  232.000.00 118784.00
 0.00
  1024.00
 11.29   48.58   48.580.00   4.31 100.00

 Any ideas why IO sizes are limited to 128k (256 blocks) in
 DD's case and 512k in Fio's case?
   
128k vs 512k is probably buffered vs direct IO - add
iflag=direct to your dd invocation.
   
Yes, thanks for this, that was the case
   
   
As for the 512k - I'm pretty sure it's a regression in our
switch to blk-mq.  I tested it around 3.18-3.19 and saw steady
4M IOs.  I hope we are just missing a knob - I'll take a look.
   
I've tested both 4.03 and 3.16 and both seem to be split into
 512k.
Let
  me
   know if you need me to test any other particular version.
  
   With 3.16 you are going to need to adjust max_hw_sectors_kb /
   max_sectors_kb as discussed in Dan's thread.  The patch that
   fixed that in the block layer went into 3.19, blk-mq into 4.0 - try
 3.19.
 
  Sorry should have mentioned, I had adjusted both of them on the
  3.16 kernel to 4096.
  I will try 3.19 and let you know.
 
  Better with 3.19, but should I not be seeing around 8192, or am I
  getting my blocks and bytes mixed up?
 
  Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
 avgrq-sz
  avgqu-sz   await r_await w_await  svctm  %util
  rbd0 72.00 0.00   24.000.00 49152.00 0.00
 4096.00
  1.96   82.67   82.670.00  41.58  99.80
 
  I'd expect 8192.  I'm getting a box for investigation.
 
  OK, so this is bug in the blk-mq part of block layer.  There is no
  plugging going on in the single hardware queue (i.e. krbd) case - it
  never once plugs the queue, and that means no request merging is done
  for your direct sequential read test.  It gets 512k bios and those
  same 512k requests are issued to krbd.
 
  While queue plugging may not make sense in the multi queue case, I'm
  pretty sure it's supposed to plug in the single queue case.  Looks
  like use_plug logic in blk_sq_make_request() is busted.

 It turns out to be a year old regression.  Before commit 07068d5b8ed8
 (blk-mq: split make request handler for multi and single queue) it used
 to
 be (reads are considered sync)

 use_plug = !is_flush_fua  ((q-nr_hw_queues == 1) || !is_sync);

 and now it is

 use_plug = !is_flush_fua  !is_sync;

 in a function that is only called if q-nr_hw_queues == 1.

 This is getting fixed by blk-mq: fix plugging in blk_sq_make_request
 from Jeff Moyer - http://article.gmane.org/gmane.linux.kernel/1941750.
 Looks like it's on its way to mainline along with some other blk-mq
 plugging
 fixes.

 That's great, do you think it will make 4.2?

Depends on Jens, but I think it will.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] krbd splitting large IO's into smaller IO's

2015-06-10 Thread Nick Fisk
Hi,

Using Kernel RBD client with Kernel 4.03 (I have also tried some older
kernels with the same effect) and IO is being split into smaller IO's which
is having a negative impact on performance.

cat /sys/block/sdc/queue/max_hw_sectors_kb
4096

cat /sys/block/rbd0/queue/max_sectors_kb
4096

Using DD
dd if=/dev/rbd0 of=/dev/null bs=4M

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
rbd0  0.00 0.00  201.500.00 25792.00 0.00   256.00
1.99   10.15   10.150.00   4.96 100.00


Using FIO with 4M blocks
Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
rbd0  0.00 0.00  232.000.00 118784.00 0.00  1024.00
11.29   48.58   48.580.00   4.31 100.00

Any ideas why IO sizes are limited to 128k (256 blocks) in DD's case and
512k in Fio's case?




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] krbd splitting large IO's into smaller IO's

2015-06-10 Thread Ilya Dryomov
On Wed, Jun 10, 2015 at 3:23 PM, Dan van der Ster d...@vanderster.com wrote:
 Hi,

 I found something similar awhile ago within a VM.
 http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-November/045034.html
 I don't know if the change suggested by Ilya ever got applied.

Yeah, it got applied.  We didn't have to do anything in krbdh - that
artificial limit got nuked up in the stack right after our
conversation.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] krbd splitting large IO's into smaller IO's

2015-06-10 Thread Dan van der Ster
Hi,

I found something similar awhile ago within a VM.
http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-November/045034.html
I don't know if the change suggested by Ilya ever got applied.

Cheers, Dan

On Wed, Jun 10, 2015 at 1:47 PM, Nick Fisk n...@fisk.me.uk wrote:
 Hi,

 Using Kernel RBD client with Kernel 4.03 (I have also tried some older
 kernels with the same effect) and IO is being split into smaller IO's which
 is having a negative impact on performance.

 cat /sys/block/sdc/queue/max_hw_sectors_kb
 4096

 cat /sys/block/rbd0/queue/max_sectors_kb
 4096

 Using DD
 dd if=/dev/rbd0 of=/dev/null bs=4M

 Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
 avgqu-sz   await r_await w_await  svctm  %util
 rbd0  0.00 0.00  201.500.00 25792.00 0.00   256.00
 1.99   10.15   10.150.00   4.96 100.00


 Using FIO with 4M blocks
 Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
 avgqu-sz   await r_await w_await  svctm  %util
 rbd0  0.00 0.00  232.000.00 118784.00 0.00  1024.00
 11.29   48.58   48.580.00   4.31 100.00

 Any ideas why IO sizes are limited to 128k (256 blocks) in DD's case and
 512k in Fio's case?




 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] krbd splitting large IO's into smaller IO's

2015-06-10 Thread Nick Fisk
Hi Dan,

I found your post last night, it does indeed look like the default has been
set to 4096 for the Kernel RBD client in the 4.0 kernel. I also checked a
machine running 3.16 and this had 512 as the default.

However in my case there seems to be something else which is affecting the
max block size. 

This originally stemmed from me trying to use flashcache as a small
writeback cache for RBD's to improve sequential write latency. My workload
submits all IO as 64kb so sequential write speed tops out around 15MB/s. The
idea being that a small flashcache block device should be able to take these
small IO's and then spit them out as large 4MB blocks to Ceph, dramatically
increasing throughput. However with this limitation, I'm not seeing the
gains I expect.

Nick

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Dan van der Ster
 Sent: 10 June 2015 13:24
 To: Nick Fisk
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] krbd splitting large IO's into smaller IO's
 
 Hi,
 
 I found something similar awhile ago within a VM.
 http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-
 November/045034.html
 I don't know if the change suggested by Ilya ever got applied.
 
 Cheers, Dan
 
 On Wed, Jun 10, 2015 at 1:47 PM, Nick Fisk n...@fisk.me.uk wrote:
  Hi,
 
  Using Kernel RBD client with Kernel 4.03 (I have also tried some older
  kernels with the same effect) and IO is being split into smaller IO's
  which is having a negative impact on performance.
 
  cat /sys/block/sdc/queue/max_hw_sectors_kb
  4096
 
  cat /sys/block/rbd0/queue/max_sectors_kb
  4096
 
  Using DD
  dd if=/dev/rbd0 of=/dev/null bs=4M
 
  Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
avgrq-sz
  avgqu-sz   await r_await w_await  svctm  %util
  rbd0  0.00 0.00  201.500.00 25792.00 0.00
256.00
  1.99   10.15   10.150.00   4.96 100.00
 
 
  Using FIO with 4M blocks
  Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
avgrq-sz
  avgqu-sz   await r_await w_await  svctm  %util
  rbd0  0.00 0.00  232.000.00 118784.00 0.00
1024.00
  11.29   48.58   48.580.00   4.31 100.00
 
  Any ideas why IO sizes are limited to 128k (256 blocks) in DD's case
  and 512k in Fio's case?
 
 
 
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] krbd splitting large IO's into smaller IO's

2015-06-10 Thread Ilya Dryomov
On Wed, Jun 10, 2015 at 2:47 PM, Nick Fisk n...@fisk.me.uk wrote:
 Hi,

 Using Kernel RBD client with Kernel 4.03 (I have also tried some older
 kernels with the same effect) and IO is being split into smaller IO's which
 is having a negative impact on performance.

 cat /sys/block/sdc/queue/max_hw_sectors_kb
 4096

 cat /sys/block/rbd0/queue/max_sectors_kb
 4096

 Using DD
 dd if=/dev/rbd0 of=/dev/null bs=4M

 Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
 avgqu-sz   await r_await w_await  svctm  %util
 rbd0  0.00 0.00  201.500.00 25792.00 0.00   256.00
 1.99   10.15   10.150.00   4.96 100.00


 Using FIO with 4M blocks
 Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
 avgqu-sz   await r_await w_await  svctm  %util
 rbd0  0.00 0.00  232.000.00 118784.00 0.00  1024.00
 11.29   48.58   48.580.00   4.31 100.00

 Any ideas why IO sizes are limited to 128k (256 blocks) in DD's case and
 512k in Fio's case?

128k vs 512k is probably buffered vs direct IO - add iflag=direct to
your dd invocation.

As for the 512k - I'm pretty sure it's a regression in our switch to
blk-mq.  I tested it around 3.18-3.19 and saw steady 4M IOs.  I hope we
are just missing a knob - I'll take a look.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] krbd splitting large IO's into smaller IO's

2015-06-10 Thread Nick Fisk
   -Original Message-
   From: Ilya Dryomov [mailto:idryo...@gmail.com]
   Sent: 10 June 2015 14:06
   To: Nick Fisk
   Cc: ceph-users
   Subject: Re: [ceph-users] krbd splitting large IO's into smaller
   IO's
  
   On Wed, Jun 10, 2015 at 2:47 PM, Nick Fisk n...@fisk.me.uk wrote:
Hi,
   
Using Kernel RBD client with Kernel 4.03 (I have also tried some
older kernels with the same effect) and IO is being split into
smaller IO's which is having a negative impact on performance.
   
cat /sys/block/sdc/queue/max_hw_sectors_kb
4096
   
cat /sys/block/rbd0/queue/max_sectors_kb
4096
   
Using DD
dd if=/dev/rbd0 of=/dev/null bs=4M
   
Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
 avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
rbd0  0.00 0.00  201.500.00 25792.00 0.00
 256.00
1.99   10.15   10.150.00   4.96 100.00
   
   
Using FIO with 4M blocks
Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
 avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
rbd0  0.00 0.00  232.000.00 118784.00 0.00
 1024.00
11.29   48.58   48.580.00   4.31 100.00
   
Any ideas why IO sizes are limited to 128k (256 blocks) in DD's
case and 512k in Fio's case?
  
   128k vs 512k is probably buffered vs direct IO - add iflag=direct
   to your dd invocation.
  
   Yes, thanks for this, that was the case
  
  
   As for the 512k - I'm pretty sure it's a regression in our switch
   to blk-mq.  I tested it around 3.18-3.19 and saw steady 4M IOs.  I
   hope we are just missing a knob - I'll take a look.
  
   I've tested both 4.03 and 3.16 and both seem to be split into 512k.
   Let
 me
  know if you need me to test any other particular version.
 
  With 3.16 you are going to need to adjust max_hw_sectors_kb /
  max_sectors_kb as discussed in Dan's thread.  The patch that fixed
  that in the block layer went into 3.19, blk-mq into 4.0 - try 3.19.
 
 Sorry should have mentioned, I had adjusted both of them on the 3.16
 kernel to 4096.
 I will try 3.19 and let you know.

Better with 3.19, but should I not be seeing around 8192, or am I getting my
blocks and bytes mixed up?

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
rbd0 72.00 0.00   24.000.00 49152.00 0.00  4096.00
1.96   82.67   82.670.00  41.58  99.80

 
 
  Thanks,
 
  Ilya
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] krbd splitting large IO's into smaller IO's

2015-06-10 Thread Nick Fisk




 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Ilya Dryomov
 Sent: 10 June 2015 16:23
 To: Nick Fisk
 Cc: ceph-users
 Subject: Re: [ceph-users] krbd splitting large IO's into smaller IO's
 
 On Wed, Jun 10, 2015 at 6:18 PM, Nick Fisk n...@fisk.me.uk wrote:
  -Original Message-
  From: Ilya Dryomov [mailto:idryo...@gmail.com]
  Sent: 10 June 2015 14:06
  To: Nick Fisk
  Cc: ceph-users
  Subject: Re: [ceph-users] krbd splitting large IO's into smaller IO's
 
  On Wed, Jun 10, 2015 at 2:47 PM, Nick Fisk n...@fisk.me.uk wrote:
   Hi,
  
   Using Kernel RBD client with Kernel 4.03 (I have also tried some
   older kernels with the same effect) and IO is being split into
   smaller IO's which is having a negative impact on performance.
  
   cat /sys/block/sdc/queue/max_hw_sectors_kb
   4096
  
   cat /sys/block/rbd0/queue/max_sectors_kb
   4096
  
   Using DD
   dd if=/dev/rbd0 of=/dev/null bs=4M
  
   Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
avgrq-sz
   avgqu-sz   await r_await w_await  svctm  %util
   rbd0  0.00 0.00  201.500.00 25792.00 0.00
256.00
   1.99   10.15   10.150.00   4.96 100.00
  
  
   Using FIO with 4M blocks
   Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
avgrq-sz
   avgqu-sz   await r_await w_await  svctm  %util
   rbd0  0.00 0.00  232.000.00 118784.00 0.00
1024.00
   11.29   48.58   48.580.00   4.31 100.00
  
   Any ideas why IO sizes are limited to 128k (256 blocks) in DD's
   case and 512k in Fio's case?
 
  128k vs 512k is probably buffered vs direct IO - add iflag=direct to
  your dd invocation.
 
  Yes, thanks for this, that was the case
 
 
  As for the 512k - I'm pretty sure it's a regression in our switch to
  blk-mq.  I tested it around 3.18-3.19 and saw steady 4M IOs.  I hope
  we are just missing a knob - I'll take a look.
 
  I've tested both 4.03 and 3.16 and both seem to be split into 512k. Let
me
 know if you need me to test any other particular version.
 
 With 3.16 you are going to need to adjust max_hw_sectors_kb /
 max_sectors_kb as discussed in Dan's thread.  The patch that fixed that in
 the block layer went into 3.19, blk-mq into 4.0 - try 3.19.

Sorry should have mentioned, I had adjusted both of them on the 3.16 kernel
to 4096.
I will try 3.19 and let you know.

 
 Thanks,
 
 Ilya
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] krbd splitting large IO's into smaller IO's

2015-06-10 Thread Ilya Dryomov
On Wed, Jun 10, 2015 at 7:04 PM, Nick Fisk n...@fisk.me.uk wrote:
   -Original Message-
   From: Ilya Dryomov [mailto:idryo...@gmail.com]
   Sent: 10 June 2015 14:06
   To: Nick Fisk
   Cc: ceph-users
   Subject: Re: [ceph-users] krbd splitting large IO's into smaller
   IO's
  
   On Wed, Jun 10, 2015 at 2:47 PM, Nick Fisk n...@fisk.me.uk wrote:
Hi,
   
Using Kernel RBD client with Kernel 4.03 (I have also tried some
older kernels with the same effect) and IO is being split into
smaller IO's which is having a negative impact on performance.
   
cat /sys/block/sdc/queue/max_hw_sectors_kb
4096
   
cat /sys/block/rbd0/queue/max_sectors_kb
4096
   
Using DD
dd if=/dev/rbd0 of=/dev/null bs=4M
   
Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
 avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
rbd0  0.00 0.00  201.500.00 25792.00 0.00
 256.00
1.99   10.15   10.150.00   4.96 100.00
   
   
Using FIO with 4M blocks
Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
 avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
rbd0  0.00 0.00  232.000.00 118784.00 0.00
 1024.00
11.29   48.58   48.580.00   4.31 100.00
   
Any ideas why IO sizes are limited to 128k (256 blocks) in DD's
case and 512k in Fio's case?
  
   128k vs 512k is probably buffered vs direct IO - add iflag=direct
   to your dd invocation.
  
   Yes, thanks for this, that was the case
  
  
   As for the 512k - I'm pretty sure it's a regression in our switch
   to blk-mq.  I tested it around 3.18-3.19 and saw steady 4M IOs.  I
   hope we are just missing a knob - I'll take a look.
  
   I've tested both 4.03 and 3.16 and both seem to be split into 512k.
   Let
 me
  know if you need me to test any other particular version.
 
  With 3.16 you are going to need to adjust max_hw_sectors_kb /
  max_sectors_kb as discussed in Dan's thread.  The patch that fixed
  that in the block layer went into 3.19, blk-mq into 4.0 - try 3.19.

 Sorry should have mentioned, I had adjusted both of them on the 3.16
 kernel to 4096.
 I will try 3.19 and let you know.

 Better with 3.19, but should I not be seeing around 8192, or am I getting my
 blocks and bytes mixed up?

 Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
 avgqu-sz   await r_await w_await  svctm  %util
 rbd0 72.00 0.00   24.000.00 49152.00 0.00  4096.00
 1.96   82.67   82.670.00  41.58  99.80

I'd expect 8192.  I'm getting a box for investigation.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] krbd splitting large IO's into smaller IO's

2015-06-10 Thread German Anders
hi guys, sorry that I hang on this email, I've four OSD servers with Ubuntu
14.04.1 LTS with 9 osd daemons each, 3TB drive size, and 3 ssd journal
drives (each journal holds 3 osd daemons), the kernel version that I'm
using is 3.18.3-031803-generic, and ceph version 0.82, I would like to know
what would be the 'best' parameters in term of io for my 3TB devices, I've:

scheduler: deadline
max_hw_sectors_kb: 16383
max_sectors_kb: 4096
read_ahead_kb: 128
nr_requests: 128

I'm experience some high io waits on all the OSD servers:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   1.740.00   15.43   *64.80*0.00   18.03

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
sda1610.40   322.20  374.80   11.00  7940.80  1330.00
48.06 0.080.210.200.44   0.20   7.68
sdb 130.60   322.20   55.00   11.00   742.40  1330.00
62.80 0.020.230.170.51   0.19   1.28
md0   0.00 0.00 2170.80  332.40  8683.20  1329.60
8.00 0.000.000.000.00   0.00   0.00
dm-0  0.00 0.000.000.00 0.00 0.00
0.00 0.000.000.000.00   0.00   0.00
dm-1  0.00 0.00 2170.80  332.40  8683.20  1329.60
8.00 0.870.350.211.26   0.03   7.84
sdd   0.00 0.00   11.80  384.40  4217.60 33197.60
188.8775.17  189.72  130.78  191.53   1.88  74.64
sdc   0.00 0.00   18.80  313.40   581.60 33154.40
203.1178.09  235.08   66.85  245.17   2.16  71.84
sde   0.00 0.000.000.00 0.00 0.00
0.00 0.000.000.000.00   0.00   0.00
sdf   0.00 0.80   78.20  181.40 10400.80 19204.80
228.0931.75  110.93   43.09  140.18   2.99  77.52
sdg   0.00 0.001.60  304.6051.20 31647.20
207.0464.05  209.19   73.50  209.90   1.90  58.32
sdh   0.00 0.000.000.00 0.00 0.00
0.00 0.000.000.000.00   0.00   0.00
sdi   0.00 0.006.60   17.20   159.20  2784.80
247.39 0.279.14   12.128.00   3.19   7.60
sdk   0.00 0.000.000.00 0.00 0.00
0.00 0.000.000.000.00   0.00   0.00
sdj   0.00 0.00   13.40  120.00   428.80  8487.20
133.6723.91  203.37   36.18  222.04   2.64  35.28
sdl   0.00 0.80   12.40  524.20  2088.80 40842.40
160.0193.53  168.27  183.35  167.91   1.64  88.24
sdn   0.00 1.404.00  433.8092.80 35926.40
164.5588.72  196.29  299.40  195.33   1.71  74.96
sdm   0.00 0.000.60  544.6019.20 40348.00
148.08   118.31  217.00   17.33  217.22   1.67  90.80

Thanks in advance,

Best regards,


*German Anders*
Storage System Engineer Leader
*Despegar* | IT Team
*office* +54 11 4894 3500 x3408
*mobile* +54 911 3493 7262
*mail* gand...@despegar.com

2015-06-10 13:07 GMT-03:00 Ilya Dryomov idryo...@gmail.com:

 On Wed, Jun 10, 2015 at 7:04 PM, Nick Fisk n...@fisk.me.uk wrote:
-Original Message-
From: Ilya Dryomov [mailto:idryo...@gmail.com]
Sent: 10 June 2015 14:06
To: Nick Fisk
Cc: ceph-users
Subject: Re: [ceph-users] krbd splitting large IO's into smaller
IO's
   
On Wed, Jun 10, 2015 at 2:47 PM, Nick Fisk n...@fisk.me.uk
 wrote:
 Hi,

 Using Kernel RBD client with Kernel 4.03 (I have also tried some
 older kernels with the same effect) and IO is being split into
 smaller IO's which is having a negative impact on performance.

 cat /sys/block/sdc/queue/max_hw_sectors_kb
 4096

 cat /sys/block/rbd0/queue/max_sectors_kb
 4096

 Using DD
 dd if=/dev/rbd0 of=/dev/null bs=4M

 Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
  avgrq-sz
 avgqu-sz   await r_await w_await  svctm  %util
 rbd0  0.00 0.00  201.500.00 25792.00 0.00
  256.00
 1.99   10.15   10.150.00   4.96 100.00


 Using FIO with 4M blocks
 Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
  avgrq-sz
 avgqu-sz   await r_await w_await  svctm  %util
 rbd0  0.00 0.00  232.000.00 118784.00
  0.00
  1024.00
 11.29   48.58   48.580.00   4.31 100.00

 Any ideas why IO sizes are limited to 128k (256 blocks) in DD's
 case and 512k in Fio's case?
   
128k vs 512k is probably buffered vs direct IO - add iflag=direct
to your dd invocation.
   
Yes, thanks for this, that was the case
   
   
As for the 512k - I'm pretty sure it's a regression in our switch
to blk-mq.  I tested it around 3.18-3.19 and saw steady 4M IOs.  I
hope we are just missing a knob - I'll take a look.
   
I've tested both 4.03 and 3.16 and both seem to be split into 512k.
Let
  me
   know if you need me to test any other particular

Re: [ceph-users] krbd splitting large IO's into smaller IO's

2015-06-10 Thread Nick Fisk
 -Original Message-
 From: Ilya Dryomov [mailto:idryo...@gmail.com]
 Sent: 10 June 2015 14:06
 To: Nick Fisk
 Cc: ceph-users
 Subject: Re: [ceph-users] krbd splitting large IO's into smaller IO's
 
 On Wed, Jun 10, 2015 at 2:47 PM, Nick Fisk n...@fisk.me.uk wrote:
  Hi,
 
  Using Kernel RBD client with Kernel 4.03 (I have also tried some older
  kernels with the same effect) and IO is being split into smaller IO's
  which is having a negative impact on performance.
 
  cat /sys/block/sdc/queue/max_hw_sectors_kb
  4096
 
  cat /sys/block/rbd0/queue/max_sectors_kb
  4096
 
  Using DD
  dd if=/dev/rbd0 of=/dev/null bs=4M
 
  Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
  avgqu-sz   await r_await w_await  svctm  %util
  rbd0  0.00 0.00  201.500.00 25792.00 0.00   256.00
  1.99   10.15   10.150.00   4.96 100.00
 
 
  Using FIO with 4M blocks
  Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
  avgqu-sz   await r_await w_await  svctm  %util
  rbd0  0.00 0.00  232.000.00 118784.00 0.00  1024.00
  11.29   48.58   48.580.00   4.31 100.00
 
  Any ideas why IO sizes are limited to 128k (256 blocks) in DD's case
  and 512k in Fio's case?
 
 128k vs 512k is probably buffered vs direct IO - add iflag=direct to your dd
 invocation.

Yes, thanks for this, that was the case

 
 As for the 512k - I'm pretty sure it's a regression in our switch to blk-mq.  
 I
 tested it around 3.18-3.19 and saw steady 4M IOs.  I hope we are just missing
 a knob - I'll take a look.

I've tested both 4.03 and 3.16 and both seem to be split into 512k. Let me know 
if you need me to test any other particular version.

 
 Thanks,
 
 Ilya




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] krbd splitting large IO's into smaller IO's

2015-06-10 Thread Ilya Dryomov
On Wed, Jun 10, 2015 at 6:18 PM, Nick Fisk n...@fisk.me.uk wrote:
 -Original Message-
 From: Ilya Dryomov [mailto:idryo...@gmail.com]
 Sent: 10 June 2015 14:06
 To: Nick Fisk
 Cc: ceph-users
 Subject: Re: [ceph-users] krbd splitting large IO's into smaller IO's

 On Wed, Jun 10, 2015 at 2:47 PM, Nick Fisk n...@fisk.me.uk wrote:
  Hi,
 
  Using Kernel RBD client with Kernel 4.03 (I have also tried some older
  kernels with the same effect) and IO is being split into smaller IO's
  which is having a negative impact on performance.
 
  cat /sys/block/sdc/queue/max_hw_sectors_kb
  4096
 
  cat /sys/block/rbd0/queue/max_sectors_kb
  4096
 
  Using DD
  dd if=/dev/rbd0 of=/dev/null bs=4M
 
  Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
  avgqu-sz   await r_await w_await  svctm  %util
  rbd0  0.00 0.00  201.500.00 25792.00 0.00   256.00
  1.99   10.15   10.150.00   4.96 100.00
 
 
  Using FIO with 4M blocks
  Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
  avgqu-sz   await r_await w_await  svctm  %util
  rbd0  0.00 0.00  232.000.00 118784.00 0.00  1024.00
  11.29   48.58   48.580.00   4.31 100.00
 
  Any ideas why IO sizes are limited to 128k (256 blocks) in DD's case
  and 512k in Fio's case?

 128k vs 512k is probably buffered vs direct IO - add iflag=direct to your dd
 invocation.

 Yes, thanks for this, that was the case


 As for the 512k - I'm pretty sure it's a regression in our switch to blk-mq. 
  I
 tested it around 3.18-3.19 and saw steady 4M IOs.  I hope we are just missing
 a knob - I'll take a look.

 I've tested both 4.03 and 3.16 and both seem to be split into 512k. Let me 
 know if you need me to test any other particular version.

With 3.16 you are going to need to adjust max_hw_sectors_kb /
max_sectors_kb as discussed in Dan's thread.  The patch that fixed that
in the block layer went into 3.19, blk-mq into 4.0 - try 3.19.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com