Re: [ceph-users] krbd splitting large IO's into smaller IO's
On Tue, Jun 30, 2015 at 8:30 AM, Z Zhang zhangz.da...@outlook.com wrote: Hi Ilya, Thanks for your explanation. This makes sense. Will you make max_segments to be configurable? Could you pls point me the fix you have made? We might help to test it. [PATCH] rbd: bump queue_max_segments on ceph-devel. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] krbd splitting large IO's into smaller IO's
Hi Ilya, Thanks for your explanation. This makes sense. Will you make max_segments to be configurable? Could you pls point me the fix you have made? We might help to test it. Thanks. David Zhang Date: Fri, 26 Jun 2015 18:21:55 +0300 Subject: Re: [ceph-users] krbd splitting large IO's into smaller IO's From: idryo...@gmail.com To: zhangz.da...@outlook.com CC: ceph-users@lists.ceph.com On Fri, Jun 26, 2015 at 3:17 PM, Z Zhang zhangz.da...@outlook.com wrote: Hi Ilya, I am seeing your recent email talking about krbd splitting large IO's into smaller IO's, see below link. https://www.mail-archive.com/ceph-users@lists.ceph.com/msg20587.html I just tried it on my ceph cluster using kernel 3.10.0-1. I adjust both max_sectors_kb and max_hw_sectors_kb of rbd device to 4096. Use fio with 4M block size for read: Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd3 81.00 0.00 135.000.00 108.00 0.00 1638.40 2.72 20.15 20.150.00 7.41 100.00 Use fio with 1M or 2M block size for read: Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd3 0.00 0.00 213.000.00 106.50 0.00 1024.00 2.56 12.02 12.020.00 4.69 100.00 Use fio with 4M block size for write: Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd3 0.0040.000.00 40.00 0.0040.00 2048.00 2.87 70.900.00 70.90 24.90 99.60 Use fio with 1M or 2M block size for write: Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd3 0.00 0.000.00 80.00 0.0040.00 1024.00 3.55 48.200.00 48.20 12.50 100.00 So why the IO size here is far less than 4096 (If using default value 512, all the IO size is 1024)? Is there some other parameters need to adjust, or is it about this kernel version? It's about this kernel version. Assuming you are doing direct I/Os with fio, setting max_sectors_kb to 4096 is really the only thing you can do, and that's enough to *sometimes* see 8192 sector (i.e. 4M) I/Os. The problem is the max_segments value, which in 3.10 is 128 and which you cannot adjust via sysfs. It all comes down to a memory allocator. To get a 4M I/O, the total number of segments (physically contiguous chunks of memory) in the 8 bios (8*512k = 4M) that need to be merged has to be = 128. When you are allocated such nice and contiguous bios, you get 4M I/Os. In other cases you don't. This will be fixed in 4.2, along with a bunch of other things. This particular max_segment fix is a one liner, so we will probably backport it to older kernels, including 3.10. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] krbd splitting large IO's into smaller IO's
On Fri, Jun 26, 2015 at 3:17 PM, Z Zhang zhangz.da...@outlook.com wrote: Hi Ilya, I am seeing your recent email talking about krbd splitting large IO's into smaller IO's, see below link. https://www.mail-archive.com/ceph-users@lists.ceph.com/msg20587.html I just tried it on my ceph cluster using kernel 3.10.0-1. I adjust both max_sectors_kb and max_hw_sectors_kb of rbd device to 4096. Use fio with 4M block size for read: Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd3 81.00 0.00 135.000.00 108.00 0.00 1638.40 2.72 20.15 20.150.00 7.41 100.00 Use fio with 1M or 2M block size for read: Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd3 0.00 0.00 213.000.00 106.50 0.00 1024.00 2.56 12.02 12.020.00 4.69 100.00 Use fio with 4M block size for write: Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd3 0.0040.000.00 40.00 0.0040.00 2048.00 2.87 70.900.00 70.90 24.90 99.60 Use fio with 1M or 2M block size for write: Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd3 0.00 0.000.00 80.00 0.0040.00 1024.00 3.55 48.200.00 48.20 12.50 100.00 So why the IO size here is far less than 4096 (If using default value 512, all the IO size is 1024)? Is there some other parameters need to adjust, or is it about this kernel version? It's about this kernel version. Assuming you are doing direct I/Os with fio, setting max_sectors_kb to 4096 is really the only thing you can do, and that's enough to *sometimes* see 8192 sector (i.e. 4M) I/Os. The problem is the max_segments value, which in 3.10 is 128 and which you cannot adjust via sysfs. It all comes down to a memory allocator. To get a 4M I/O, the total number of segments (physically contiguous chunks of memory) in the 8 bios (8*512k = 4M) that need to be merged has to be = 128. When you are allocated such nice and contiguous bios, you get 4M I/Os. In other cases you don't. This will be fixed in 4.2, along with a bunch of other things. This particular max_segment fix is a one liner, so we will probably backport it to older kernels, including 3.10. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] krbd splitting large IO's into smaller IO's
Hi Ilya, I am seeing your recent email talking about krbd splitting large IO's into smaller IO's, see below link. https://www.mail-archive.com/ceph-users@lists.ceph.com/msg20587.html I just tried it on my ceph cluster using kernel 3.10.0-1. I adjust both max_sectors_kb and max_hw_sectors_kb of rbd device to 4096. Use fio with 4M block size for read: Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %utilrbd3 81.00 0.00 135.000.00 108.00 0.00 1638.40 2.72 20.15 20.150.00 7.41 100.00 Use fio with 1M or 2M block size for read: Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %utilrbd3 0.00 0.00 213.000.00 106.50 0.00 1024.00 2.56 12.02 12.020.00 4.69 100.00 Use fio with 4M block size for write: Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %utilrbd3 0.0040.00 0.00 40.00 0.0040.00 2048.00 2.87 70.900.00 70.90 24.90 99.60 Use fio with 1M or 2M block size for write: Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %utilrbd3 0.00 0.00 0.00 80.00 0.0040.00 1024.00 3.55 48.200.00 48.20 12.50 100.00 So why the IO size here is far less than 4096 (If using default value 512, all the IO size is 1024)? Is there some other parameters need to adjust, or is it about this kernel version? Thanks! Regards,David Zhang ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] krbd splitting large IO's into smaller IO's
On Thu, Jun 11, 2015 at 2:23 PM, Ilya Dryomov idryo...@gmail.com wrote: On Wed, Jun 10, 2015 at 7:07 PM, Ilya Dryomov idryo...@gmail.com wrote: On Wed, Jun 10, 2015 at 7:04 PM, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: Ilya Dryomov [mailto:idryo...@gmail.com] Sent: 10 June 2015 14:06 To: Nick Fisk Cc: ceph-users Subject: Re: [ceph-users] krbd splitting large IO's into smaller IO's On Wed, Jun 10, 2015 at 2:47 PM, Nick Fisk n...@fisk.me.uk wrote: Hi, Using Kernel RBD client with Kernel 4.03 (I have also tried some older kernels with the same effect) and IO is being split into smaller IO's which is having a negative impact on performance. cat /sys/block/sdc/queue/max_hw_sectors_kb 4096 cat /sys/block/rbd0/queue/max_sectors_kb 4096 Using DD dd if=/dev/rbd0 of=/dev/null bs=4M Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 0.00 0.00 201.500.00 25792.00 0.00 256.00 1.99 10.15 10.150.00 4.96 100.00 Using FIO with 4M blocks Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 0.00 0.00 232.000.00 118784.00 0.00 1024.00 11.29 48.58 48.580.00 4.31 100.00 Any ideas why IO sizes are limited to 128k (256 blocks) in DD's case and 512k in Fio's case? 128k vs 512k is probably buffered vs direct IO - add iflag=direct to your dd invocation. Yes, thanks for this, that was the case As for the 512k - I'm pretty sure it's a regression in our switch to blk-mq. I tested it around 3.18-3.19 and saw steady 4M IOs. I hope we are just missing a knob - I'll take a look. I've tested both 4.03 and 3.16 and both seem to be split into 512k. Let me know if you need me to test any other particular version. With 3.16 you are going to need to adjust max_hw_sectors_kb / max_sectors_kb as discussed in Dan's thread. The patch that fixed that in the block layer went into 3.19, blk-mq into 4.0 - try 3.19. Sorry should have mentioned, I had adjusted both of them on the 3.16 kernel to 4096. I will try 3.19 and let you know. Better with 3.19, but should I not be seeing around 8192, or am I getting my blocks and bytes mixed up? Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 72.00 0.00 24.000.00 49152.00 0.00 4096.00 1.96 82.67 82.670.00 41.58 99.80 I'd expect 8192. I'm getting a box for investigation. OK, so this is bug in the blk-mq part of block layer. There is no plugging going on in the single hardware queue (i.e. krbd) case - it never once plugs the queue, and that means no request merging is done for your direct sequential read test. It gets 512k bios and those same 512k requests are issued to krbd. While queue plugging may not make sense in the multi queue case, I'm pretty sure it's supposed to plug in the single queue case. Looks like use_plug logic in blk_sq_make_request() is busted. It turns out to be a year old regression. Before commit 07068d5b8ed8 (blk-mq: split make request handler for multi and single queue) it used to be (reads are considered sync) use_plug = !is_flush_fua ((q-nr_hw_queues == 1) || !is_sync); and now it is use_plug = !is_flush_fua !is_sync; in a function that is only called if q-nr_hw_queues == 1. This is getting fixed by blk-mq: fix plugging in blk_sq_make_request from Jeff Moyer - http://article.gmane.org/gmane.linux.kernel/1941750. Looks like it's on its way to mainline along with some other blk-mq plugging fixes. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] krbd splitting large IO's into smaller IO's
On Wed, Jun 10, 2015 at 7:07 PM, Ilya Dryomov idryo...@gmail.com wrote: On Wed, Jun 10, 2015 at 7:04 PM, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: Ilya Dryomov [mailto:idryo...@gmail.com] Sent: 10 June 2015 14:06 To: Nick Fisk Cc: ceph-users Subject: Re: [ceph-users] krbd splitting large IO's into smaller IO's On Wed, Jun 10, 2015 at 2:47 PM, Nick Fisk n...@fisk.me.uk wrote: Hi, Using Kernel RBD client with Kernel 4.03 (I have also tried some older kernels with the same effect) and IO is being split into smaller IO's which is having a negative impact on performance. cat /sys/block/sdc/queue/max_hw_sectors_kb 4096 cat /sys/block/rbd0/queue/max_sectors_kb 4096 Using DD dd if=/dev/rbd0 of=/dev/null bs=4M Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 0.00 0.00 201.500.00 25792.00 0.00 256.00 1.99 10.15 10.150.00 4.96 100.00 Using FIO with 4M blocks Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 0.00 0.00 232.000.00 118784.00 0.00 1024.00 11.29 48.58 48.580.00 4.31 100.00 Any ideas why IO sizes are limited to 128k (256 blocks) in DD's case and 512k in Fio's case? 128k vs 512k is probably buffered vs direct IO - add iflag=direct to your dd invocation. Yes, thanks for this, that was the case As for the 512k - I'm pretty sure it's a regression in our switch to blk-mq. I tested it around 3.18-3.19 and saw steady 4M IOs. I hope we are just missing a knob - I'll take a look. I've tested both 4.03 and 3.16 and both seem to be split into 512k. Let me know if you need me to test any other particular version. With 3.16 you are going to need to adjust max_hw_sectors_kb / max_sectors_kb as discussed in Dan's thread. The patch that fixed that in the block layer went into 3.19, blk-mq into 4.0 - try 3.19. Sorry should have mentioned, I had adjusted both of them on the 3.16 kernel to 4096. I will try 3.19 and let you know. Better with 3.19, but should I not be seeing around 8192, or am I getting my blocks and bytes mixed up? Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 72.00 0.00 24.000.00 49152.00 0.00 4096.00 1.96 82.67 82.670.00 41.58 99.80 I'd expect 8192. I'm getting a box for investigation. OK, so this is bug in the blk-mq part of block layer. There is no plugging going on in the single hardware queue (i.e. krbd) case - it never once plugs the queue, and that means no request merging is done for your direct sequential read test. It gets 512k bios and those same 512k requests are issued to krbd. While queue plugging may not make sense in the multi queue case, I'm pretty sure it's supposed to plug in the single queue case. Looks like use_plug logic in blk_sq_make_request() is busted. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] krbd splitting large IO's into smaller IO's
-Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Ilya Dryomov Sent: 11 June 2015 12:33 To: Nick Fisk Cc: ceph-users Subject: Re: [ceph-users] krbd splitting large IO's into smaller IO's On Thu, Jun 11, 2015 at 2:23 PM, Ilya Dryomov idryo...@gmail.com wrote: On Wed, Jun 10, 2015 at 7:07 PM, Ilya Dryomov idryo...@gmail.com wrote: On Wed, Jun 10, 2015 at 7:04 PM, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: Ilya Dryomov [mailto:idryo...@gmail.com] Sent: 10 June 2015 14:06 To: Nick Fisk Cc: ceph-users Subject: Re: [ceph-users] krbd splitting large IO's into smaller IO's On Wed, Jun 10, 2015 at 2:47 PM, Nick Fisk n...@fisk.me.uk wrote: Hi, Using Kernel RBD client with Kernel 4.03 (I have also tried some older kernels with the same effect) and IO is being split into smaller IO's which is having a negative impact on performance. cat /sys/block/sdc/queue/max_hw_sectors_kb 4096 cat /sys/block/rbd0/queue/max_sectors_kb 4096 Using DD dd if=/dev/rbd0 of=/dev/null bs=4M Device: rrqm/s wrqm/s r/s w/srkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 0.00 0.00 201.500.00 25792.00 0.00 256.00 1.99 10.15 10.150.00 4.96 100.00 Using FIO with 4M blocks Device: rrqm/s wrqm/s r/s w/srkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 0.00 0.00 232.000.00 118784.00 0.00 1024.00 11.29 48.58 48.580.00 4.31 100.00 Any ideas why IO sizes are limited to 128k (256 blocks) in DD's case and 512k in Fio's case? 128k vs 512k is probably buffered vs direct IO - add iflag=direct to your dd invocation. Yes, thanks for this, that was the case As for the 512k - I'm pretty sure it's a regression in our switch to blk-mq. I tested it around 3.18-3.19 and saw steady 4M IOs. I hope we are just missing a knob - I'll take a look. I've tested both 4.03 and 3.16 and both seem to be split into 512k. Let me know if you need me to test any other particular version. With 3.16 you are going to need to adjust max_hw_sectors_kb / max_sectors_kb as discussed in Dan's thread. The patch that fixed that in the block layer went into 3.19, blk-mq into 4.0 - try 3.19. Sorry should have mentioned, I had adjusted both of them on the 3.16 kernel to 4096. I will try 3.19 and let you know. Better with 3.19, but should I not be seeing around 8192, or am I getting my blocks and bytes mixed up? Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 72.00 0.00 24.000.00 49152.00 0.00 4096.00 1.96 82.67 82.670.00 41.58 99.80 I'd expect 8192. I'm getting a box for investigation. OK, so this is bug in the blk-mq part of block layer. There is no plugging going on in the single hardware queue (i.e. krbd) case - it never once plugs the queue, and that means no request merging is done for your direct sequential read test. It gets 512k bios and those same 512k requests are issued to krbd. While queue plugging may not make sense in the multi queue case, I'm pretty sure it's supposed to plug in the single queue case. Looks like use_plug logic in blk_sq_make_request() is busted. It turns out to be a year old regression. Before commit 07068d5b8ed8 (blk-mq: split make request handler for multi and single queue) it used to be (reads are considered sync) use_plug = !is_flush_fua ((q-nr_hw_queues == 1) || !is_sync); and now it is use_plug = !is_flush_fua !is_sync; in a function that is only called if q-nr_hw_queues == 1. This is getting fixed by blk-mq: fix plugging in blk_sq_make_request from Jeff Moyer - http://article.gmane.org/gmane.linux.kernel/1941750. Looks like it's on its way to mainline along with some other blk-mq plugging fixes. That's great, do you think it will make 4.2? Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] krbd splitting large IO's into smaller IO's
On Thu, Jun 11, 2015 at 5:30 PM, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Ilya Dryomov Sent: 11 June 2015 12:33 To: Nick Fisk Cc: ceph-users Subject: Re: [ceph-users] krbd splitting large IO's into smaller IO's On Thu, Jun 11, 2015 at 2:23 PM, Ilya Dryomov idryo...@gmail.com wrote: On Wed, Jun 10, 2015 at 7:07 PM, Ilya Dryomov idryo...@gmail.com wrote: On Wed, Jun 10, 2015 at 7:04 PM, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: Ilya Dryomov [mailto:idryo...@gmail.com] Sent: 10 June 2015 14:06 To: Nick Fisk Cc: ceph-users Subject: Re: [ceph-users] krbd splitting large IO's into smaller IO's On Wed, Jun 10, 2015 at 2:47 PM, Nick Fisk n...@fisk.me.uk wrote: Hi, Using Kernel RBD client with Kernel 4.03 (I have also tried some older kernels with the same effect) and IO is being split into smaller IO's which is having a negative impact on performance. cat /sys/block/sdc/queue/max_hw_sectors_kb 4096 cat /sys/block/rbd0/queue/max_sectors_kb 4096 Using DD dd if=/dev/rbd0 of=/dev/null bs=4M Device: rrqm/s wrqm/s r/s w/srkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 0.00 0.00 201.500.00 25792.00 0.00 256.00 1.99 10.15 10.150.00 4.96 100.00 Using FIO with 4M blocks Device: rrqm/s wrqm/s r/s w/srkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 0.00 0.00 232.000.00 118784.00 0.00 1024.00 11.29 48.58 48.580.00 4.31 100.00 Any ideas why IO sizes are limited to 128k (256 blocks) in DD's case and 512k in Fio's case? 128k vs 512k is probably buffered vs direct IO - add iflag=direct to your dd invocation. Yes, thanks for this, that was the case As for the 512k - I'm pretty sure it's a regression in our switch to blk-mq. I tested it around 3.18-3.19 and saw steady 4M IOs. I hope we are just missing a knob - I'll take a look. I've tested both 4.03 and 3.16 and both seem to be split into 512k. Let me know if you need me to test any other particular version. With 3.16 you are going to need to adjust max_hw_sectors_kb / max_sectors_kb as discussed in Dan's thread. The patch that fixed that in the block layer went into 3.19, blk-mq into 4.0 - try 3.19. Sorry should have mentioned, I had adjusted both of them on the 3.16 kernel to 4096. I will try 3.19 and let you know. Better with 3.19, but should I not be seeing around 8192, or am I getting my blocks and bytes mixed up? Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 72.00 0.00 24.000.00 49152.00 0.00 4096.00 1.96 82.67 82.670.00 41.58 99.80 I'd expect 8192. I'm getting a box for investigation. OK, so this is bug in the blk-mq part of block layer. There is no plugging going on in the single hardware queue (i.e. krbd) case - it never once plugs the queue, and that means no request merging is done for your direct sequential read test. It gets 512k bios and those same 512k requests are issued to krbd. While queue plugging may not make sense in the multi queue case, I'm pretty sure it's supposed to plug in the single queue case. Looks like use_plug logic in blk_sq_make_request() is busted. It turns out to be a year old regression. Before commit 07068d5b8ed8 (blk-mq: split make request handler for multi and single queue) it used to be (reads are considered sync) use_plug = !is_flush_fua ((q-nr_hw_queues == 1) || !is_sync); and now it is use_plug = !is_flush_fua !is_sync; in a function that is only called if q-nr_hw_queues == 1. This is getting fixed by blk-mq: fix plugging in blk_sq_make_request from Jeff Moyer - http://article.gmane.org/gmane.linux.kernel/1941750. Looks like it's on its way to mainline along with some other blk-mq plugging fixes. That's great, do you think it will make 4.2? Depends on Jens, but I think it will. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] krbd splitting large IO's into smaller IO's
Hi, Using Kernel RBD client with Kernel 4.03 (I have also tried some older kernels with the same effect) and IO is being split into smaller IO's which is having a negative impact on performance. cat /sys/block/sdc/queue/max_hw_sectors_kb 4096 cat /sys/block/rbd0/queue/max_sectors_kb 4096 Using DD dd if=/dev/rbd0 of=/dev/null bs=4M Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 0.00 0.00 201.500.00 25792.00 0.00 256.00 1.99 10.15 10.150.00 4.96 100.00 Using FIO with 4M blocks Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 0.00 0.00 232.000.00 118784.00 0.00 1024.00 11.29 48.58 48.580.00 4.31 100.00 Any ideas why IO sizes are limited to 128k (256 blocks) in DD's case and 512k in Fio's case? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] krbd splitting large IO's into smaller IO's
On Wed, Jun 10, 2015 at 3:23 PM, Dan van der Ster d...@vanderster.com wrote: Hi, I found something similar awhile ago within a VM. http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-November/045034.html I don't know if the change suggested by Ilya ever got applied. Yeah, it got applied. We didn't have to do anything in krbdh - that artificial limit got nuked up in the stack right after our conversation. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] krbd splitting large IO's into smaller IO's
Hi, I found something similar awhile ago within a VM. http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-November/045034.html I don't know if the change suggested by Ilya ever got applied. Cheers, Dan On Wed, Jun 10, 2015 at 1:47 PM, Nick Fisk n...@fisk.me.uk wrote: Hi, Using Kernel RBD client with Kernel 4.03 (I have also tried some older kernels with the same effect) and IO is being split into smaller IO's which is having a negative impact on performance. cat /sys/block/sdc/queue/max_hw_sectors_kb 4096 cat /sys/block/rbd0/queue/max_sectors_kb 4096 Using DD dd if=/dev/rbd0 of=/dev/null bs=4M Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 0.00 0.00 201.500.00 25792.00 0.00 256.00 1.99 10.15 10.150.00 4.96 100.00 Using FIO with 4M blocks Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 0.00 0.00 232.000.00 118784.00 0.00 1024.00 11.29 48.58 48.580.00 4.31 100.00 Any ideas why IO sizes are limited to 128k (256 blocks) in DD's case and 512k in Fio's case? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] krbd splitting large IO's into smaller IO's
Hi Dan, I found your post last night, it does indeed look like the default has been set to 4096 for the Kernel RBD client in the 4.0 kernel. I also checked a machine running 3.16 and this had 512 as the default. However in my case there seems to be something else which is affecting the max block size. This originally stemmed from me trying to use flashcache as a small writeback cache for RBD's to improve sequential write latency. My workload submits all IO as 64kb so sequential write speed tops out around 15MB/s. The idea being that a small flashcache block device should be able to take these small IO's and then spit them out as large 4MB blocks to Ceph, dramatically increasing throughput. However with this limitation, I'm not seeing the gains I expect. Nick -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Dan van der Ster Sent: 10 June 2015 13:24 To: Nick Fisk Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] krbd splitting large IO's into smaller IO's Hi, I found something similar awhile ago within a VM. http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014- November/045034.html I don't know if the change suggested by Ilya ever got applied. Cheers, Dan On Wed, Jun 10, 2015 at 1:47 PM, Nick Fisk n...@fisk.me.uk wrote: Hi, Using Kernel RBD client with Kernel 4.03 (I have also tried some older kernels with the same effect) and IO is being split into smaller IO's which is having a negative impact on performance. cat /sys/block/sdc/queue/max_hw_sectors_kb 4096 cat /sys/block/rbd0/queue/max_sectors_kb 4096 Using DD dd if=/dev/rbd0 of=/dev/null bs=4M Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 0.00 0.00 201.500.00 25792.00 0.00 256.00 1.99 10.15 10.150.00 4.96 100.00 Using FIO with 4M blocks Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 0.00 0.00 232.000.00 118784.00 0.00 1024.00 11.29 48.58 48.580.00 4.31 100.00 Any ideas why IO sizes are limited to 128k (256 blocks) in DD's case and 512k in Fio's case? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] krbd splitting large IO's into smaller IO's
On Wed, Jun 10, 2015 at 2:47 PM, Nick Fisk n...@fisk.me.uk wrote: Hi, Using Kernel RBD client with Kernel 4.03 (I have also tried some older kernels with the same effect) and IO is being split into smaller IO's which is having a negative impact on performance. cat /sys/block/sdc/queue/max_hw_sectors_kb 4096 cat /sys/block/rbd0/queue/max_sectors_kb 4096 Using DD dd if=/dev/rbd0 of=/dev/null bs=4M Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 0.00 0.00 201.500.00 25792.00 0.00 256.00 1.99 10.15 10.150.00 4.96 100.00 Using FIO with 4M blocks Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 0.00 0.00 232.000.00 118784.00 0.00 1024.00 11.29 48.58 48.580.00 4.31 100.00 Any ideas why IO sizes are limited to 128k (256 blocks) in DD's case and 512k in Fio's case? 128k vs 512k is probably buffered vs direct IO - add iflag=direct to your dd invocation. As for the 512k - I'm pretty sure it's a regression in our switch to blk-mq. I tested it around 3.18-3.19 and saw steady 4M IOs. I hope we are just missing a knob - I'll take a look. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] krbd splitting large IO's into smaller IO's
-Original Message- From: Ilya Dryomov [mailto:idryo...@gmail.com] Sent: 10 June 2015 14:06 To: Nick Fisk Cc: ceph-users Subject: Re: [ceph-users] krbd splitting large IO's into smaller IO's On Wed, Jun 10, 2015 at 2:47 PM, Nick Fisk n...@fisk.me.uk wrote: Hi, Using Kernel RBD client with Kernel 4.03 (I have also tried some older kernels with the same effect) and IO is being split into smaller IO's which is having a negative impact on performance. cat /sys/block/sdc/queue/max_hw_sectors_kb 4096 cat /sys/block/rbd0/queue/max_sectors_kb 4096 Using DD dd if=/dev/rbd0 of=/dev/null bs=4M Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 0.00 0.00 201.500.00 25792.00 0.00 256.00 1.99 10.15 10.150.00 4.96 100.00 Using FIO with 4M blocks Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 0.00 0.00 232.000.00 118784.00 0.00 1024.00 11.29 48.58 48.580.00 4.31 100.00 Any ideas why IO sizes are limited to 128k (256 blocks) in DD's case and 512k in Fio's case? 128k vs 512k is probably buffered vs direct IO - add iflag=direct to your dd invocation. Yes, thanks for this, that was the case As for the 512k - I'm pretty sure it's a regression in our switch to blk-mq. I tested it around 3.18-3.19 and saw steady 4M IOs. I hope we are just missing a knob - I'll take a look. I've tested both 4.03 and 3.16 and both seem to be split into 512k. Let me know if you need me to test any other particular version. With 3.16 you are going to need to adjust max_hw_sectors_kb / max_sectors_kb as discussed in Dan's thread. The patch that fixed that in the block layer went into 3.19, blk-mq into 4.0 - try 3.19. Sorry should have mentioned, I had adjusted both of them on the 3.16 kernel to 4096. I will try 3.19 and let you know. Better with 3.19, but should I not be seeing around 8192, or am I getting my blocks and bytes mixed up? Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 72.00 0.00 24.000.00 49152.00 0.00 4096.00 1.96 82.67 82.670.00 41.58 99.80 Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] krbd splitting large IO's into smaller IO's
-Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Ilya Dryomov Sent: 10 June 2015 16:23 To: Nick Fisk Cc: ceph-users Subject: Re: [ceph-users] krbd splitting large IO's into smaller IO's On Wed, Jun 10, 2015 at 6:18 PM, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: Ilya Dryomov [mailto:idryo...@gmail.com] Sent: 10 June 2015 14:06 To: Nick Fisk Cc: ceph-users Subject: Re: [ceph-users] krbd splitting large IO's into smaller IO's On Wed, Jun 10, 2015 at 2:47 PM, Nick Fisk n...@fisk.me.uk wrote: Hi, Using Kernel RBD client with Kernel 4.03 (I have also tried some older kernels with the same effect) and IO is being split into smaller IO's which is having a negative impact on performance. cat /sys/block/sdc/queue/max_hw_sectors_kb 4096 cat /sys/block/rbd0/queue/max_sectors_kb 4096 Using DD dd if=/dev/rbd0 of=/dev/null bs=4M Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 0.00 0.00 201.500.00 25792.00 0.00 256.00 1.99 10.15 10.150.00 4.96 100.00 Using FIO with 4M blocks Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 0.00 0.00 232.000.00 118784.00 0.00 1024.00 11.29 48.58 48.580.00 4.31 100.00 Any ideas why IO sizes are limited to 128k (256 blocks) in DD's case and 512k in Fio's case? 128k vs 512k is probably buffered vs direct IO - add iflag=direct to your dd invocation. Yes, thanks for this, that was the case As for the 512k - I'm pretty sure it's a regression in our switch to blk-mq. I tested it around 3.18-3.19 and saw steady 4M IOs. I hope we are just missing a knob - I'll take a look. I've tested both 4.03 and 3.16 and both seem to be split into 512k. Let me know if you need me to test any other particular version. With 3.16 you are going to need to adjust max_hw_sectors_kb / max_sectors_kb as discussed in Dan's thread. The patch that fixed that in the block layer went into 3.19, blk-mq into 4.0 - try 3.19. Sorry should have mentioned, I had adjusted both of them on the 3.16 kernel to 4096. I will try 3.19 and let you know. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] krbd splitting large IO's into smaller IO's
On Wed, Jun 10, 2015 at 7:04 PM, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: Ilya Dryomov [mailto:idryo...@gmail.com] Sent: 10 June 2015 14:06 To: Nick Fisk Cc: ceph-users Subject: Re: [ceph-users] krbd splitting large IO's into smaller IO's On Wed, Jun 10, 2015 at 2:47 PM, Nick Fisk n...@fisk.me.uk wrote: Hi, Using Kernel RBD client with Kernel 4.03 (I have also tried some older kernels with the same effect) and IO is being split into smaller IO's which is having a negative impact on performance. cat /sys/block/sdc/queue/max_hw_sectors_kb 4096 cat /sys/block/rbd0/queue/max_sectors_kb 4096 Using DD dd if=/dev/rbd0 of=/dev/null bs=4M Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 0.00 0.00 201.500.00 25792.00 0.00 256.00 1.99 10.15 10.150.00 4.96 100.00 Using FIO with 4M blocks Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 0.00 0.00 232.000.00 118784.00 0.00 1024.00 11.29 48.58 48.580.00 4.31 100.00 Any ideas why IO sizes are limited to 128k (256 blocks) in DD's case and 512k in Fio's case? 128k vs 512k is probably buffered vs direct IO - add iflag=direct to your dd invocation. Yes, thanks for this, that was the case As for the 512k - I'm pretty sure it's a regression in our switch to blk-mq. I tested it around 3.18-3.19 and saw steady 4M IOs. I hope we are just missing a knob - I'll take a look. I've tested both 4.03 and 3.16 and both seem to be split into 512k. Let me know if you need me to test any other particular version. With 3.16 you are going to need to adjust max_hw_sectors_kb / max_sectors_kb as discussed in Dan's thread. The patch that fixed that in the block layer went into 3.19, blk-mq into 4.0 - try 3.19. Sorry should have mentioned, I had adjusted both of them on the 3.16 kernel to 4096. I will try 3.19 and let you know. Better with 3.19, but should I not be seeing around 8192, or am I getting my blocks and bytes mixed up? Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 72.00 0.00 24.000.00 49152.00 0.00 4096.00 1.96 82.67 82.670.00 41.58 99.80 I'd expect 8192. I'm getting a box for investigation. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] krbd splitting large IO's into smaller IO's
hi guys, sorry that I hang on this email, I've four OSD servers with Ubuntu 14.04.1 LTS with 9 osd daemons each, 3TB drive size, and 3 ssd journal drives (each journal holds 3 osd daemons), the kernel version that I'm using is 3.18.3-031803-generic, and ceph version 0.82, I would like to know what would be the 'best' parameters in term of io for my 3TB devices, I've: scheduler: deadline max_hw_sectors_kb: 16383 max_sectors_kb: 4096 read_ahead_kb: 128 nr_requests: 128 I'm experience some high io waits on all the OSD servers: avg-cpu: %user %nice %system %iowait %steal %idle 1.740.00 15.43 *64.80*0.00 18.03 Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda1610.40 322.20 374.80 11.00 7940.80 1330.00 48.06 0.080.210.200.44 0.20 7.68 sdb 130.60 322.20 55.00 11.00 742.40 1330.00 62.80 0.020.230.170.51 0.19 1.28 md0 0.00 0.00 2170.80 332.40 8683.20 1329.60 8.00 0.000.000.000.00 0.00 0.00 dm-0 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 dm-1 0.00 0.00 2170.80 332.40 8683.20 1329.60 8.00 0.870.350.211.26 0.03 7.84 sdd 0.00 0.00 11.80 384.40 4217.60 33197.60 188.8775.17 189.72 130.78 191.53 1.88 74.64 sdc 0.00 0.00 18.80 313.40 581.60 33154.40 203.1178.09 235.08 66.85 245.17 2.16 71.84 sde 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 sdf 0.00 0.80 78.20 181.40 10400.80 19204.80 228.0931.75 110.93 43.09 140.18 2.99 77.52 sdg 0.00 0.001.60 304.6051.20 31647.20 207.0464.05 209.19 73.50 209.90 1.90 58.32 sdh 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 sdi 0.00 0.006.60 17.20 159.20 2784.80 247.39 0.279.14 12.128.00 3.19 7.60 sdk 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 sdj 0.00 0.00 13.40 120.00 428.80 8487.20 133.6723.91 203.37 36.18 222.04 2.64 35.28 sdl 0.00 0.80 12.40 524.20 2088.80 40842.40 160.0193.53 168.27 183.35 167.91 1.64 88.24 sdn 0.00 1.404.00 433.8092.80 35926.40 164.5588.72 196.29 299.40 195.33 1.71 74.96 sdm 0.00 0.000.60 544.6019.20 40348.00 148.08 118.31 217.00 17.33 217.22 1.67 90.80 Thanks in advance, Best regards, *German Anders* Storage System Engineer Leader *Despegar* | IT Team *office* +54 11 4894 3500 x3408 *mobile* +54 911 3493 7262 *mail* gand...@despegar.com 2015-06-10 13:07 GMT-03:00 Ilya Dryomov idryo...@gmail.com: On Wed, Jun 10, 2015 at 7:04 PM, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: Ilya Dryomov [mailto:idryo...@gmail.com] Sent: 10 June 2015 14:06 To: Nick Fisk Cc: ceph-users Subject: Re: [ceph-users] krbd splitting large IO's into smaller IO's On Wed, Jun 10, 2015 at 2:47 PM, Nick Fisk n...@fisk.me.uk wrote: Hi, Using Kernel RBD client with Kernel 4.03 (I have also tried some older kernels with the same effect) and IO is being split into smaller IO's which is having a negative impact on performance. cat /sys/block/sdc/queue/max_hw_sectors_kb 4096 cat /sys/block/rbd0/queue/max_sectors_kb 4096 Using DD dd if=/dev/rbd0 of=/dev/null bs=4M Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 0.00 0.00 201.500.00 25792.00 0.00 256.00 1.99 10.15 10.150.00 4.96 100.00 Using FIO with 4M blocks Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 0.00 0.00 232.000.00 118784.00 0.00 1024.00 11.29 48.58 48.580.00 4.31 100.00 Any ideas why IO sizes are limited to 128k (256 blocks) in DD's case and 512k in Fio's case? 128k vs 512k is probably buffered vs direct IO - add iflag=direct to your dd invocation. Yes, thanks for this, that was the case As for the 512k - I'm pretty sure it's a regression in our switch to blk-mq. I tested it around 3.18-3.19 and saw steady 4M IOs. I hope we are just missing a knob - I'll take a look. I've tested both 4.03 and 3.16 and both seem to be split into 512k. Let me know if you need me to test any other particular
Re: [ceph-users] krbd splitting large IO's into smaller IO's
-Original Message- From: Ilya Dryomov [mailto:idryo...@gmail.com] Sent: 10 June 2015 14:06 To: Nick Fisk Cc: ceph-users Subject: Re: [ceph-users] krbd splitting large IO's into smaller IO's On Wed, Jun 10, 2015 at 2:47 PM, Nick Fisk n...@fisk.me.uk wrote: Hi, Using Kernel RBD client with Kernel 4.03 (I have also tried some older kernels with the same effect) and IO is being split into smaller IO's which is having a negative impact on performance. cat /sys/block/sdc/queue/max_hw_sectors_kb 4096 cat /sys/block/rbd0/queue/max_sectors_kb 4096 Using DD dd if=/dev/rbd0 of=/dev/null bs=4M Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 0.00 0.00 201.500.00 25792.00 0.00 256.00 1.99 10.15 10.150.00 4.96 100.00 Using FIO with 4M blocks Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 0.00 0.00 232.000.00 118784.00 0.00 1024.00 11.29 48.58 48.580.00 4.31 100.00 Any ideas why IO sizes are limited to 128k (256 blocks) in DD's case and 512k in Fio's case? 128k vs 512k is probably buffered vs direct IO - add iflag=direct to your dd invocation. Yes, thanks for this, that was the case As for the 512k - I'm pretty sure it's a regression in our switch to blk-mq. I tested it around 3.18-3.19 and saw steady 4M IOs. I hope we are just missing a knob - I'll take a look. I've tested both 4.03 and 3.16 and both seem to be split into 512k. Let me know if you need me to test any other particular version. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] krbd splitting large IO's into smaller IO's
On Wed, Jun 10, 2015 at 6:18 PM, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: Ilya Dryomov [mailto:idryo...@gmail.com] Sent: 10 June 2015 14:06 To: Nick Fisk Cc: ceph-users Subject: Re: [ceph-users] krbd splitting large IO's into smaller IO's On Wed, Jun 10, 2015 at 2:47 PM, Nick Fisk n...@fisk.me.uk wrote: Hi, Using Kernel RBD client with Kernel 4.03 (I have also tried some older kernels with the same effect) and IO is being split into smaller IO's which is having a negative impact on performance. cat /sys/block/sdc/queue/max_hw_sectors_kb 4096 cat /sys/block/rbd0/queue/max_sectors_kb 4096 Using DD dd if=/dev/rbd0 of=/dev/null bs=4M Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 0.00 0.00 201.500.00 25792.00 0.00 256.00 1.99 10.15 10.150.00 4.96 100.00 Using FIO with 4M blocks Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 0.00 0.00 232.000.00 118784.00 0.00 1024.00 11.29 48.58 48.580.00 4.31 100.00 Any ideas why IO sizes are limited to 128k (256 blocks) in DD's case and 512k in Fio's case? 128k vs 512k is probably buffered vs direct IO - add iflag=direct to your dd invocation. Yes, thanks for this, that was the case As for the 512k - I'm pretty sure it's a regression in our switch to blk-mq. I tested it around 3.18-3.19 and saw steady 4M IOs. I hope we are just missing a knob - I'll take a look. I've tested both 4.03 and 3.16 and both seem to be split into 512k. Let me know if you need me to test any other particular version. With 3.16 you are going to need to adjust max_hw_sectors_kb / max_sectors_kb as discussed in Dan's thread. The patch that fixed that in the block layer went into 3.19, blk-mq into 4.0 - try 3.19. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com