Re: [ceph-users] How to improve single thread sequential reads?

2015-08-18 Thread Wido den Hollander


On 18-08-15 12:25, Benedikt Fraunhofer wrote:
 Hi Nick,
 
 did you do anything fancy to get to ~90MB/s in the first place?
 I'm stuck at ~30MB/s reading cold data. single-threaded-writes are
 quite speedy, around 600MB/s.
 
 radosgw for cold data is around the 90MB/s, which is imho limitted by
 the speed of a single disk.
 
 Data already present on the osd-os-buffers arrive with around
 400-700MB/s so I don't think the network is the culprit.
 
 (20 node cluster, 12x4TB 7.2k disks, 2 ssds for journals for 6 osds
 each, lacp 2x10g bonds)
 
 rados bench single-threaded performs equally bad, but with its default
 multithreaded settings it generates wonderful numbers, usually only
 limiited by linerate and/or interrupts/s.
 
 I just gave kernel 4.0 with its rbd-blk-mq feature a shot, hoping to
 get to your wonderful numbers, but it's staying below 30 MB/s.
 
 I was thinking about using a software raid0 like you did but that's
 imho really ugly.
 When I know I needed something speedy, I usually just started dd-ing
 the file to /dev/null and wait for about  three minutes before
 starting the actual job; some sort of hand-made read-ahead for
 dummies.
 

It really depends on your situation, but you could also go for larger
objects then 4MB for specific block devices.

In a use-case with a customer where they read large single-thread files
from RBD block devices we went for 64MB objects.

That improved our read performance in that case. We didn't have to
create a new TCP connection every 4MB and talk to a new OSD.

You could try that and see how it works out.

Wido

 Thx in advance
   Benedikt
 
 
 2015-08-17 13:29 GMT+02:00 Nick Fisk n...@fisk.me.uk:
 Thanks for the replies guys.

 The client is set to 4MB, I haven't played with the OSD side yet as I wasn't
 sure if it would make much difference, but I will give it a go. If the
 client is already passing a 4MB request down through to the OSD, will it be
 able to readahead any further? The next 4MB object in theory will be on
 another OSD and so I'm not sure if reading ahead any further on the OSD side
 would help.

 How I see the problem is that the RBD client will only read 1 OSD at a time
 as the RBD readahead can't be set any higher than max_hw_sectors_kb, which
 is the object size of the RBD. Please correct me if I'm wrong on this.

 If you could set the RBD readahead to much higher than the object size, then
 this would probably give the desired effect where the buffer could be
 populated by reading from several OSD's in advance to give much higher
 performance. That or wait for striping to appear in the Kernel client.

 I've also found that BareOS (fork of Bacula) seems to has a direct RADOS
 feature that supports radosstriper. I might try this and see how it performs
 as well.


 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Somnath Roy
 Sent: 17 August 2015 03:36
 To: Alex Gorbachev a...@iss-integration.com; Nick Fisk n...@fisk.me.uk
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] How to improve single thread sequential reads?

 Have you tried setting read_ahead_kb to bigger number for both client/OSD
 side if you are using krbd ?
 In case of librbd, try the different config options for rbd cache..

 Thanks  Regards
 Somnath

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Alex Gorbachev
 Sent: Sunday, August 16, 2015 7:07 PM
 To: Nick Fisk
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] How to improve single thread sequential reads?

 Hi Nick,

 On Thu, Aug 13, 2015 at 4:37 PM, Nick Fisk n...@fisk.me.uk wrote:
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
 Of Nick Fisk
 Sent: 13 August 2015 18:04
 To: ceph-users@lists.ceph.com
 Subject: [ceph-users] How to improve single thread sequential reads?

 Hi,

 I'm trying to use a RBD to act as a staging area for some data before
 pushing
 it down to some LTO6 tapes. As I cannot use striping with the kernel
 client I
 tend to be maxing out at around 80MB/s reads testing with DD. Has
 anyone got any clever suggestions of giving this a bit of a boost, I
 think I need
 to get it
 up to around 200MB/s to make sure there is always a steady flow of
 data to the tape drive.

 I've just tried the testing kernel with the blk-mq fixes in it for
 full size IO's, this combined with bumping readahead up to 4MB, is now
 getting me on average 150MB/s to 200MB/s so this might suffice.

 On a personal interest, I would still like to know if anyone has ideas
 on how to really push much higher bandwidth through a RBD.

 Some settings in our ceph.conf that may help:

 osd_op_threads = 20
 osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k
 filestore_queue_max_ops = 9 filestore_flusher = false
 filestore_max_sync_interval = 10 filestore_sync_flush = false

 Regards,
 Alex



 Rbd-fuse seems to top out at 12MB/s, so there goes that option

Re: [ceph-users] How to improve single thread sequential reads?

2015-08-18 Thread Jan Schermer
I'm not sure if I missed that but are you testing in a VM backed by RBD device, 
or using the device directly?

I don't see how blk-mq would help if it's not a VM, it just passes the request 
to the underlying block device, and in case of RBD there is no real block 
device from the host perspective...? Enlighten me if I'm wrong please. I have 
some Ubuntu VMs that use blk-mq for virtio-blk devices and makes me cringe 
because I'm unable to tune the scheduler and it just makes no sense at all...?

Anyway I'd try to bump up read_ahead_kb first, and max_hw_sectors_kb (to make 
sure it gets into readahead), also try (if you're not using blk-mq) to a cfq 
scheduler and set it to rotational=1. I see you've also tried this, but I think 
blk-mq is the limiting factor here now.

If you are running a single-threaded benchmark like rados bench then what's 
limiting you is latency - it's not surprising it scales up with more threads.
It should run nicely with a real workload once readahead kicks in and the queue 
fills up. But again - not sure how that works with blk-mq and I've never used 
the RBD device directly (the kernel client). Does it show in /sys/block ? Can 
you dump find /sys/block/$rbd in here?

Jan


 On 18 Aug 2015, at 12:25, Benedikt Fraunhofer 
 given.to.lists.ceph-users.ceph.com.toasta@traced.net wrote:
 
 Hi Nick,
 
 did you do anything fancy to get to ~90MB/s in the first place?
 I'm stuck at ~30MB/s reading cold data. single-threaded-writes are
 quite speedy, around 600MB/s.
 
 radosgw for cold data is around the 90MB/s, which is imho limitted by
 the speed of a single disk.
 
 Data already present on the osd-os-buffers arrive with around
 400-700MB/s so I don't think the network is the culprit.
 
 (20 node cluster, 12x4TB 7.2k disks, 2 ssds for journals for 6 osds
 each, lacp 2x10g bonds)
 
 rados bench single-threaded performs equally bad, but with its default
 multithreaded settings it generates wonderful numbers, usually only
 limiited by linerate and/or interrupts/s.
 
 I just gave kernel 4.0 with its rbd-blk-mq feature a shot, hoping to
 get to your wonderful numbers, but it's staying below 30 MB/s.
 
 I was thinking about using a software raid0 like you did but that's
 imho really ugly.
 When I know I needed something speedy, I usually just started dd-ing
 the file to /dev/null and wait for about  three minutes before
 starting the actual job; some sort of hand-made read-ahead for
 dummies.
 
 Thx in advance
  Benedikt
 
 
 2015-08-17 13:29 GMT+02:00 Nick Fisk n...@fisk.me.uk:
 Thanks for the replies guys.
 
 The client is set to 4MB, I haven't played with the OSD side yet as I wasn't
 sure if it would make much difference, but I will give it a go. If the
 client is already passing a 4MB request down through to the OSD, will it be
 able to readahead any further? The next 4MB object in theory will be on
 another OSD and so I'm not sure if reading ahead any further on the OSD side
 would help.
 
 How I see the problem is that the RBD client will only read 1 OSD at a time
 as the RBD readahead can't be set any higher than max_hw_sectors_kb, which
 is the object size of the RBD. Please correct me if I'm wrong on this.
 
 If you could set the RBD readahead to much higher than the object size, then
 this would probably give the desired effect where the buffer could be
 populated by reading from several OSD's in advance to give much higher
 performance. That or wait for striping to appear in the Kernel client.
 
 I've also found that BareOS (fork of Bacula) seems to has a direct RADOS
 feature that supports radosstriper. I might try this and see how it performs
 as well.
 
 
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Somnath Roy
 Sent: 17 August 2015 03:36
 To: Alex Gorbachev a...@iss-integration.com; Nick Fisk n...@fisk.me.uk
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] How to improve single thread sequential reads?
 
 Have you tried setting read_ahead_kb to bigger number for both client/OSD
 side if you are using krbd ?
 In case of librbd, try the different config options for rbd cache..
 
 Thanks  Regards
 Somnath
 
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Alex Gorbachev
 Sent: Sunday, August 16, 2015 7:07 PM
 To: Nick Fisk
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] How to improve single thread sequential reads?
 
 Hi Nick,
 
 On Thu, Aug 13, 2015 at 4:37 PM, Nick Fisk n...@fisk.me.uk wrote:
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
 Of Nick Fisk
 Sent: 13 August 2015 18:04
 To: ceph-users@lists.ceph.com
 Subject: [ceph-users] How to improve single thread sequential reads?
 
 Hi,
 
 I'm trying to use a RBD to act as a staging area for some data before
 pushing
 it down to some LTO6 tapes. As I cannot use striping with the kernel
 client I
 tend to be maxing out at around 80MB/s reads

Re: [ceph-users] How to improve single thread sequential reads?

2015-08-18 Thread Nick Fisk


 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Jan Schermer
 Sent: 18 August 2015 11:50
 To: Benedikt Fraunhofer given.to.lists.ceph-
 users.ceph.com.toasta@traced.net
 Cc: ceph-users@lists.ceph.com; Nick Fisk n...@fisk.me.uk
 Subject: Re: [ceph-users] How to improve single thread sequential reads?
 
 I'm not sure if I missed that but are you testing in a VM backed by RBD
 device, or using the device directly?
 
 I don't see how blk-mq would help if it's not a VM, it just passes the
request
 to the underlying block device, and in case of RBD there is no real block
 device from the host perspective...? Enlighten me if I'm wrong please. I
have
 some Ubuntu VMs that use blk-mq for virtio-blk devices and makes me
 cringe because I'm unable to tune the scheduler and it just makes no sense
 at all...?

Since 4.0 (I think) the Kernel RBD client now uses the blk-mq
infrastructure, but there is a bug which limits max IO sizes to 128kb, which
is why for large block/sequential that testing kernel is essential. I think
this bug fix should make it to 4.2 hopefully.

 
 Anyway I'd try to bump up read_ahead_kb first, and max_hw_sectors_kb (to
 make sure it gets into readahead), also try (if you're not using blk-mq)
to a
 cfq scheduler and set it to rotational=1. I see you've also tried this,
but I think
 blk-mq is the limiting factor here now.

I'm pretty sure you can't adjust the max_hw_sectors_kb (which equals object
size, from what I can tell) and the max_sectors_kb is already set at the
hw_max. But it would sure be nice if the max_hw_sectors_kb could be set
higher though, but I'm not sure if there is a reason for this limit.

 
 If you are running a single-threaded benchmark like rados bench then
what's
 limiting you is latency - it's not surprising it scales up with more
threads.

Agreed, but with sequential workloads, if you can get readahead working
properly then you can easily remove this limitation as a single threaded op
effectively becomes multithreaded.

 It should run nicely with a real workload once readahead kicks in and the
 queue fills up. But again - not sure how that works with blk-mq and I've
 never used the RBD device directly (the kernel client). Does it show in
 /sys/block ? Can you dump find /sys/block/$rbd in here?
 
 Jan
 
 
  On 18 Aug 2015, at 12:25, Benedikt Fraunhofer given.to.lists.ceph-
 users.ceph.com.toasta@traced.net wrote:
 
  Hi Nick,
 
  did you do anything fancy to get to ~90MB/s in the first place?
  I'm stuck at ~30MB/s reading cold data. single-threaded-writes are
  quite speedy, around 600MB/s.
 
  radosgw for cold data is around the 90MB/s, which is imho limitted by
  the speed of a single disk.
 
  Data already present on the osd-os-buffers arrive with around
  400-700MB/s so I don't think the network is the culprit.
 
  (20 node cluster, 12x4TB 7.2k disks, 2 ssds for journals for 6 osds
  each, lacp 2x10g bonds)
 
  rados bench single-threaded performs equally bad, but with its default
  multithreaded settings it generates wonderful numbers, usually only
  limiited by linerate and/or interrupts/s.
 
  I just gave kernel 4.0 with its rbd-blk-mq feature a shot, hoping to
  get to your wonderful numbers, but it's staying below 30 MB/s.
 
  I was thinking about using a software raid0 like you did but that's
  imho really ugly.
  When I know I needed something speedy, I usually just started dd-ing
  the file to /dev/null and wait for about  three minutes before
  starting the actual job; some sort of hand-made read-ahead for
  dummies.
 
  Thx in advance
   Benedikt
 
 
  2015-08-17 13:29 GMT+02:00 Nick Fisk n...@fisk.me.uk:
  Thanks for the replies guys.
 
  The client is set to 4MB, I haven't played with the OSD side yet as I
  wasn't sure if it would make much difference, but I will give it a
  go. If the client is already passing a 4MB request down through to
  the OSD, will it be able to readahead any further? The next 4MB
  object in theory will be on another OSD and so I'm not sure if
  reading ahead any further on the OSD side would help.
 
  How I see the problem is that the RBD client will only read 1 OSD at
  a time as the RBD readahead can't be set any higher than
  max_hw_sectors_kb, which is the object size of the RBD. Please correct
 me if I'm wrong on this.
 
  If you could set the RBD readahead to much higher than the object
  size, then this would probably give the desired effect where the
  buffer could be populated by reading from several OSD's in advance to
  give much higher performance. That or wait for striping to appear in
the
 Kernel client.
 
  I've also found that BareOS (fork of Bacula) seems to has a direct
  RADOS feature that supports radosstriper. I might try this and see
  how it performs as well.
 
 
  -Original Message-
  From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
  Behalf Of Somnath Roy
  Sent: 17 August 2015 03:36
  To: Alex Gorbachev a...@iss

Re: [ceph-users] How to improve single thread sequential reads?

2015-08-18 Thread Benedikt Fraunhofer
Hi Nick,

did you do anything fancy to get to ~90MB/s in the first place?
I'm stuck at ~30MB/s reading cold data. single-threaded-writes are
quite speedy, around 600MB/s.

radosgw for cold data is around the 90MB/s, which is imho limitted by
the speed of a single disk.

Data already present on the osd-os-buffers arrive with around
400-700MB/s so I don't think the network is the culprit.

(20 node cluster, 12x4TB 7.2k disks, 2 ssds for journals for 6 osds
each, lacp 2x10g bonds)

rados bench single-threaded performs equally bad, but with its default
multithreaded settings it generates wonderful numbers, usually only
limiited by linerate and/or interrupts/s.

I just gave kernel 4.0 with its rbd-blk-mq feature a shot, hoping to
get to your wonderful numbers, but it's staying below 30 MB/s.

I was thinking about using a software raid0 like you did but that's
imho really ugly.
When I know I needed something speedy, I usually just started dd-ing
the file to /dev/null and wait for about  three minutes before
starting the actual job; some sort of hand-made read-ahead for
dummies.

Thx in advance
  Benedikt


2015-08-17 13:29 GMT+02:00 Nick Fisk n...@fisk.me.uk:
 Thanks for the replies guys.

 The client is set to 4MB, I haven't played with the OSD side yet as I wasn't
 sure if it would make much difference, but I will give it a go. If the
 client is already passing a 4MB request down through to the OSD, will it be
 able to readahead any further? The next 4MB object in theory will be on
 another OSD and so I'm not sure if reading ahead any further on the OSD side
 would help.

 How I see the problem is that the RBD client will only read 1 OSD at a time
 as the RBD readahead can't be set any higher than max_hw_sectors_kb, which
 is the object size of the RBD. Please correct me if I'm wrong on this.

 If you could set the RBD readahead to much higher than the object size, then
 this would probably give the desired effect where the buffer could be
 populated by reading from several OSD's in advance to give much higher
 performance. That or wait for striping to appear in the Kernel client.

 I've also found that BareOS (fork of Bacula) seems to has a direct RADOS
 feature that supports radosstriper. I might try this and see how it performs
 as well.


 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Somnath Roy
 Sent: 17 August 2015 03:36
 To: Alex Gorbachev a...@iss-integration.com; Nick Fisk n...@fisk.me.uk
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] How to improve single thread sequential reads?

 Have you tried setting read_ahead_kb to bigger number for both client/OSD
 side if you are using krbd ?
 In case of librbd, try the different config options for rbd cache..

 Thanks  Regards
 Somnath

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Alex Gorbachev
 Sent: Sunday, August 16, 2015 7:07 PM
 To: Nick Fisk
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] How to improve single thread sequential reads?

 Hi Nick,

 On Thu, Aug 13, 2015 at 4:37 PM, Nick Fisk n...@fisk.me.uk wrote:
  -Original Message-
  From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
  Of Nick Fisk
  Sent: 13 August 2015 18:04
  To: ceph-users@lists.ceph.com
  Subject: [ceph-users] How to improve single thread sequential reads?
 
  Hi,
 
  I'm trying to use a RBD to act as a staging area for some data before
  pushing
  it down to some LTO6 tapes. As I cannot use striping with the kernel
  client I
  tend to be maxing out at around 80MB/s reads testing with DD. Has
  anyone got any clever suggestions of giving this a bit of a boost, I
  think I need
  to get it
  up to around 200MB/s to make sure there is always a steady flow of
  data to the tape drive.
 
  I've just tried the testing kernel with the blk-mq fixes in it for
  full size IO's, this combined with bumping readahead up to 4MB, is now
  getting me on average 150MB/s to 200MB/s so this might suffice.
 
  On a personal interest, I would still like to know if anyone has ideas
  on how to really push much higher bandwidth through a RBD.

 Some settings in our ceph.conf that may help:

 osd_op_threads = 20
 osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k
 filestore_queue_max_ops = 9 filestore_flusher = false
 filestore_max_sync_interval = 10 filestore_sync_flush = false

 Regards,
 Alex

 
 
  Rbd-fuse seems to top out at 12MB/s, so there goes that option.
 
  I'm thinking mapping multiple RBD's and then combining them into a
  mdadm
  RAID0 stripe might work, but seems a bit messy.
 
  Any suggestions?
 
  Thanks,
  Nick
 
 
 
 
 
 
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com

Re: [ceph-users] How to improve single thread sequential reads?

2015-08-18 Thread Jan Schermer
Reply in text

 On 18 Aug 2015, at 12:59, Nick Fisk n...@fisk.me.uk wrote:
 
 
 
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Jan Schermer
 Sent: 18 August 2015 11:50
 To: Benedikt Fraunhofer given.to.lists.ceph-
 users.ceph.com.toasta@traced.net
 Cc: ceph-users@lists.ceph.com; Nick Fisk n...@fisk.me.uk
 Subject: Re: [ceph-users] How to improve single thread sequential reads?
 
 I'm not sure if I missed that but are you testing in a VM backed by RBD
 device, or using the device directly?
 
 I don't see how blk-mq would help if it's not a VM, it just passes the
 request
 to the underlying block device, and in case of RBD there is no real block
 device from the host perspective...? Enlighten me if I'm wrong please. I
 have
 some Ubuntu VMs that use blk-mq for virtio-blk devices and makes me
 cringe because I'm unable to tune the scheduler and it just makes no sense
 at all...?
 
 Since 4.0 (I think) the Kernel RBD client now uses the blk-mq
 infrastructure, but there is a bug which limits max IO sizes to 128kb, which
 is why for large block/sequential that testing kernel is essential. I think
 this bug fix should make it to 4.2 hopefully.

blk-mq is supposed to remove redundancy of having

IO scheduler in VM - VM block device - host IO scheduler - block device

it's a paravirtualized driver that just moves requests from inside the VM to 
the host queue (and this is why inside the VM you have no IO scheduler options 
- it effectively becomes noop).

But this just doesn't make sense if you're using qemu with librbd - there's no 
host queue.
It would make sense if the qemu drive was krbd device with a queue.

If there's no VM there should be no blk-mq?

So what was added to the kernel was probably the host-side infrastructure to 
handle blk-mq in guest passthrough to the krdb device, but that's probably not 
your case, is it?

 
 
 Anyway I'd try to bump up read_ahead_kb first, and max_hw_sectors_kb (to
 make sure it gets into readahead), also try (if you're not using blk-mq)
 to a
 cfq scheduler and set it to rotational=1. I see you've also tried this,
 but I think
 blk-mq is the limiting factor here now.
 
 I'm pretty sure you can't adjust the max_hw_sectors_kb (which equals object
 size, from what I can tell) and the max_sectors_kb is already set at the
 hw_max. But it would sure be nice if the max_hw_sectors_kb could be set
 higher though, but I'm not sure if there is a reason for this limit.
 
 
 If you are running a single-threaded benchmark like rados bench then
 what's
 limiting you is latency - it's not surprising it scales up with more
 threads.
 
 Agreed, but with sequential workloads, if you can get readahead working
 properly then you can easily remove this limitation as a single threaded op
 effectively becomes multithreaded.

Thinking on this more - I don't know if this will help after all, it will still 
be a single thread, just trying to get ahead of the client IO - and that's not 
likely to happen unless you can read the data in userspace slower than what 
Ceph can read...

I think striping multiple device could be the answer after all. But have you 
tried creating the RBD volume as striped in Ceph?

 
 It should run nicely with a real workload once readahead kicks in and the
 queue fills up. But again - not sure how that works with blk-mq and I've
 never used the RBD device directly (the kernel client). Does it show in
 /sys/block ? Can you dump find /sys/block/$rbd in here?
 
 Jan
 
 
 On 18 Aug 2015, at 12:25, Benedikt Fraunhofer given.to.lists.ceph-
 users.ceph.com.toasta@traced.net wrote:
 
 Hi Nick,
 
 did you do anything fancy to get to ~90MB/s in the first place?
 I'm stuck at ~30MB/s reading cold data. single-threaded-writes are
 quite speedy, around 600MB/s.
 
 radosgw for cold data is around the 90MB/s, which is imho limitted by
 the speed of a single disk.
 
 Data already present on the osd-os-buffers arrive with around
 400-700MB/s so I don't think the network is the culprit.
 
 (20 node cluster, 12x4TB 7.2k disks, 2 ssds for journals for 6 osds
 each, lacp 2x10g bonds)
 
 rados bench single-threaded performs equally bad, but with its default
 multithreaded settings it generates wonderful numbers, usually only
 limiited by linerate and/or interrupts/s.
 
 I just gave kernel 4.0 with its rbd-blk-mq feature a shot, hoping to
 get to your wonderful numbers, but it's staying below 30 MB/s.
 
 I was thinking about using a software raid0 like you did but that's
 imho really ugly.
 When I know I needed something speedy, I usually just started dd-ing
 the file to /dev/null and wait for about  three minutes before
 starting the actual job; some sort of hand-made read-ahead for
 dummies.
 
 Thx in advance
 Benedikt
 
 
 2015-08-17 13:29 GMT+02:00 Nick Fisk n...@fisk.me.uk:
 Thanks for the replies guys.
 
 The client is set to 4MB, I haven't played with the OSD side yet as I
 wasn't sure if it would make much difference

Re: [ceph-users] How to improve single thread sequential reads?

2015-08-18 Thread Nick Fisk




 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Jan Schermer
 Sent: 18 August 2015 12:41
 To: Nick Fisk n...@fisk.me.uk
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] How to improve single thread sequential reads?
 
 Reply in text
 
  On 18 Aug 2015, at 12:59, Nick Fisk n...@fisk.me.uk wrote:
 
 
 
  -Original Message-
  From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
  Of Jan Schermer
  Sent: 18 August 2015 11:50
  To: Benedikt Fraunhofer given.to.lists.ceph-
  users.ceph.com.toasta@traced.net
  Cc: ceph-users@lists.ceph.com; Nick Fisk n...@fisk.me.uk
  Subject: Re: [ceph-users] How to improve single thread sequential
reads?
 
  I'm not sure if I missed that but are you testing in a VM backed by
  RBD device, or using the device directly?
 
  I don't see how blk-mq would help if it's not a VM, it just passes
  the
  request
  to the underlying block device, and in case of RBD there is no real
  block device from the host perspective...? Enlighten me if I'm wrong
  please. I
  have
  some Ubuntu VMs that use blk-mq for virtio-blk devices and makes me
  cringe because I'm unable to tune the scheduler and it just makes no
  sense at all...?
 
  Since 4.0 (I think) the Kernel RBD client now uses the blk-mq
  infrastructure, but there is a bug which limits max IO sizes to 128kb,
  which is why for large block/sequential that testing kernel is
  essential. I think this bug fix should make it to 4.2 hopefully.
 
 blk-mq is supposed to remove redundancy of having
 
 IO scheduler in VM - VM block device - host IO scheduler - block device
 
 it's a paravirtualized driver that just moves requests from inside the VM
to
 the host queue (and this is why inside the VM you have no IO scheduler
 options - it effectively becomes noop).
 
 But this just doesn't make sense if you're using qemu with librbd -
there's no
 host queue.
 It would make sense if the qemu drive was krbd device with a queue.
 
 If there's no VM there should be no blk-mq?

I think you might be thinking about the virtio-blk driver for blk-mq. Blk-mq
itself seems to be a lot more about enhancing the overall block layer
performance in Linux

https://www.thomas-krenn.com/en/wiki/Linux_Multi-Queue_Block_IO_Queueing_Mec
hanism_(blk-mq)



 
 So what was added to the kernel was probably the host-side infrastructure
 to handle blk-mq in guest passthrough to the krdb device, but that's
probably
 not your case, is it?
 
 
 
  Anyway I'd try to bump up read_ahead_kb first, and max_hw_sectors_kb
  (to make sure it gets into readahead), also try (if you're not using
  blk-mq)
  to a
  cfq scheduler and set it to rotational=1. I see you've also tried
  this,
  but I think
  blk-mq is the limiting factor here now.
 
  I'm pretty sure you can't adjust the max_hw_sectors_kb (which equals
  object size, from what I can tell) and the max_sectors_kb is already
  set at the hw_max. But it would sure be nice if the max_hw_sectors_kb
  could be set higher though, but I'm not sure if there is a reason for
this
 limit.
 
 
  If you are running a single-threaded benchmark like rados bench then
  what's
  limiting you is latency - it's not surprising it scales up with more
  threads.
 
  Agreed, but with sequential workloads, if you can get readahead
  working properly then you can easily remove this limitation as a
  single threaded op effectively becomes multithreaded.
 
 Thinking on this more - I don't know if this will help after all, it will
still be a
 single thread, just trying to get ahead of the client IO - and that's not
likely to
 happen unless you can read the data in userspace slower than what Ceph
 can read...
 
 I think striping multiple device could be the answer after all. But have
you
 tried creating the RBD volume as striped in Ceph?

Yes striping would probably give amazing performance, but the kernel client
currently doesn't support it, which leaves us in the position of trying to
find work arounds to boost performance.

Although the client read is single threaded, the RBD/RADOS layer would split
these larger readahead IOs into 4MB requests that would then be processed in
parallel by the OSD's. This is much the same way as sequential access
performance varies with a RAID array. If your IO size matches the stripe
size of the array then you get nearly the bandwidth of all disks involved. I
think in Ceph the effective stripe size is the   object size * #OSDS.


 
 
  It should run nicely with a real workload once readahead kicks in and
  the queue fills up. But again - not sure how that works with blk-mq
  and I've never used the RBD device directly (the kernel client). Does
  it show in /sys/block ? Can you dump find /sys/block/$rbd in here?
 
  Jan
 
 
  On 18 Aug 2015, at 12:25, Benedikt Fraunhofer given.to.lists.ceph-
  users.ceph.com.toasta@traced.net wrote:
 
  Hi Nick,
 
  did you do anything fancy to get to ~90MB/s in the first place?
  I'm stuck

Re: [ceph-users] How to improve single thread sequential reads?

2015-08-18 Thread Jan Schermer

 On 18 Aug 2015, at 13:58, Nick Fisk n...@fisk.me.uk wrote:
 
 
 
 
 
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Jan Schermer
 Sent: 18 August 2015 12:41
 To: Nick Fisk n...@fisk.me.uk
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] How to improve single thread sequential reads?
 
 Reply in text
 
 On 18 Aug 2015, at 12:59, Nick Fisk n...@fisk.me.uk wrote:
 
 
 
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
 Of Jan Schermer
 Sent: 18 August 2015 11:50
 To: Benedikt Fraunhofer given.to.lists.ceph-
 users.ceph.com.toasta@traced.net
 Cc: ceph-users@lists.ceph.com; Nick Fisk n...@fisk.me.uk
 Subject: Re: [ceph-users] How to improve single thread sequential
 reads?
 
 I'm not sure if I missed that but are you testing in a VM backed by
 RBD device, or using the device directly?
 
 I don't see how blk-mq would help if it's not a VM, it just passes
 the
 request
 to the underlying block device, and in case of RBD there is no real
 block device from the host perspective...? Enlighten me if I'm wrong
 please. I
 have
 some Ubuntu VMs that use blk-mq for virtio-blk devices and makes me
 cringe because I'm unable to tune the scheduler and it just makes no
 sense at all...?
 
 Since 4.0 (I think) the Kernel RBD client now uses the blk-mq
 infrastructure, but there is a bug which limits max IO sizes to 128kb,
 which is why for large block/sequential that testing kernel is
 essential. I think this bug fix should make it to 4.2 hopefully.
 
 blk-mq is supposed to remove redundancy of having
 
 IO scheduler in VM - VM block device - host IO scheduler - block device
 
 it's a paravirtualized driver that just moves requests from inside the VM
 to
 the host queue (and this is why inside the VM you have no IO scheduler
 options - it effectively becomes noop).
 
 But this just doesn't make sense if you're using qemu with librbd -
 there's no
 host queue.
 It would make sense if the qemu drive was krbd device with a queue.
 
 If there's no VM there should be no blk-mq?
 
 I think you might be thinking about the virtio-blk driver for blk-mq. Blk-mq
 itself seems to be a lot more about enhancing the overall block layer
 performance in Linux
 
 https://www.thomas-krenn.com/en/wiki/Linux_Multi-Queue_Block_IO_Queueing_Mec
 hanism_(blk-mq)
 
 
 
 
 So what was added to the kernel was probably the host-side infrastructure
 to handle blk-mq in guest passthrough to the krdb device, but that's
 probably
 not your case, is it?
 
 
 
 Anyway I'd try to bump up read_ahead_kb first, and max_hw_sectors_kb
 (to make sure it gets into readahead), also try (if you're not using
 blk-mq)
 to a
 cfq scheduler and set it to rotational=1. I see you've also tried
 this,
 but I think
 blk-mq is the limiting factor here now.
 
 I'm pretty sure you can't adjust the max_hw_sectors_kb (which equals
 object size, from what I can tell) and the max_sectors_kb is already
 set at the hw_max. But it would sure be nice if the max_hw_sectors_kb
 could be set higher though, but I'm not sure if there is a reason for
 this
 limit.
 
 
 If you are running a single-threaded benchmark like rados bench then
 what's
 limiting you is latency - it's not surprising it scales up with more
 threads.
 
 Agreed, but with sequential workloads, if you can get readahead
 working properly then you can easily remove this limitation as a
 single threaded op effectively becomes multithreaded.
 
 Thinking on this more - I don't know if this will help after all, it will
 still be a
 single thread, just trying to get ahead of the client IO - and that's not
 likely to
 happen unless you can read the data in userspace slower than what Ceph
 can read...
 
 I think striping multiple device could be the answer after all. But have
 you
 tried creating the RBD volume as striped in Ceph?
 
 Yes striping would probably give amazing performance, but the kernel client
 currently doesn't support it, which leaves us in the position of trying to
 find work arounds to boost performance.
 
 Although the client read is single threaded, the RBD/RADOS layer would split
 these larger readahead IOs into 4MB requests that would then be processed in
 parallel by the OSD's. This is much the same way as sequential access
 performance varies with a RAID array. If your IO size matches the stripe
 size of the array then you get nearly the bandwidth of all disks involved. I
 think in Ceph the effective stripe size is the   object size * #OSDS.
 

Hmmm...

RBD - PG - objects

stripe_unit (more commonly called stride) bytes are put into strip_count 
objects - not OSDs, but it's possible you'll hit all OSDs with a small enough 
stride and large enough stripe_count... 
I have no idea how well that works in practice on current Ceph releases, my 
Dumpling experience is probably useless here.

So we're back at striping with mdraid I guess ... :)

 
 
 
 It should run nicely with a real workload

Re: [ceph-users] How to improve single thread sequential reads?

2015-08-17 Thread Nick Fisk
Thanks for the replies guys.

The client is set to 4MB, I haven't played with the OSD side yet as I wasn't
sure if it would make much difference, but I will give it a go. If the
client is already passing a 4MB request down through to the OSD, will it be
able to readahead any further? The next 4MB object in theory will be on
another OSD and so I'm not sure if reading ahead any further on the OSD side
would help.

How I see the problem is that the RBD client will only read 1 OSD at a time
as the RBD readahead can't be set any higher than max_hw_sectors_kb, which
is the object size of the RBD. Please correct me if I'm wrong on this.

If you could set the RBD readahead to much higher than the object size, then
this would probably give the desired effect where the buffer could be
populated by reading from several OSD's in advance to give much higher
performance. That or wait for striping to appear in the Kernel client.

I've also found that BareOS (fork of Bacula) seems to has a direct RADOS
feature that supports radosstriper. I might try this and see how it performs
as well.


 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Somnath Roy
 Sent: 17 August 2015 03:36
 To: Alex Gorbachev a...@iss-integration.com; Nick Fisk n...@fisk.me.uk
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] How to improve single thread sequential reads?
 
 Have you tried setting read_ahead_kb to bigger number for both client/OSD
 side if you are using krbd ?
 In case of librbd, try the different config options for rbd cache..
 
 Thanks  Regards
 Somnath
 
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Alex Gorbachev
 Sent: Sunday, August 16, 2015 7:07 PM
 To: Nick Fisk
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] How to improve single thread sequential reads?
 
 Hi Nick,
 
 On Thu, Aug 13, 2015 at 4:37 PM, Nick Fisk n...@fisk.me.uk wrote:
  -Original Message-
  From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
  Of Nick Fisk
  Sent: 13 August 2015 18:04
  To: ceph-users@lists.ceph.com
  Subject: [ceph-users] How to improve single thread sequential reads?
 
  Hi,
 
  I'm trying to use a RBD to act as a staging area for some data before
  pushing
  it down to some LTO6 tapes. As I cannot use striping with the kernel
  client I
  tend to be maxing out at around 80MB/s reads testing with DD. Has
  anyone got any clever suggestions of giving this a bit of a boost, I
  think I need
  to get it
  up to around 200MB/s to make sure there is always a steady flow of
  data to the tape drive.
 
  I've just tried the testing kernel with the blk-mq fixes in it for
  full size IO's, this combined with bumping readahead up to 4MB, is now
  getting me on average 150MB/s to 200MB/s so this might suffice.
 
  On a personal interest, I would still like to know if anyone has ideas
  on how to really push much higher bandwidth through a RBD.
 
 Some settings in our ceph.conf that may help:
 
 osd_op_threads = 20
 osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k
 filestore_queue_max_ops = 9 filestore_flusher = false
 filestore_max_sync_interval = 10 filestore_sync_flush = false
 
 Regards,
 Alex
 
 
 
  Rbd-fuse seems to top out at 12MB/s, so there goes that option.
 
  I'm thinking mapping multiple RBD's and then combining them into a
  mdadm
  RAID0 stripe might work, but seems a bit messy.
 
  Any suggestions?
 
  Thanks,
  Nick
 
 
 
 
 
 
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
 PLEASE NOTE: The information contained in this electronic mail message is
 intended only for the use of the designated recipient(s) named above. If
the
 reader of this message is not the intended recipient, you are hereby
notified
 that you have received this message in error and that any review,
 dissemination, distribution, or copying of this message is strictly
prohibited. If
 you have received this communication in error, please notify the sender by
 telephone or e-mail (as shown above) immediately and destroy any and all
 copies of this message in your possession (whether hard copies or
 electronically stored copies).
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to improve single thread sequential reads?

2015-08-16 Thread Alex Gorbachev
Hi Nick,

On Thu, Aug 13, 2015 at 4:37 PM, Nick Fisk n...@fisk.me.uk wrote:
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Nick Fisk
 Sent: 13 August 2015 18:04
 To: ceph-users@lists.ceph.com
 Subject: [ceph-users] How to improve single thread sequential reads?

 Hi,

 I'm trying to use a RBD to act as a staging area for some data before
 pushing
 it down to some LTO6 tapes. As I cannot use striping with the kernel
 client I
 tend to be maxing out at around 80MB/s reads testing with DD. Has anyone
 got any clever suggestions of giving this a bit of a boost, I think I need
 to get it
 up to around 200MB/s to make sure there is always a steady flow of data to
 the tape drive.

 I've just tried the testing kernel with the blk-mq fixes in it for full size
 IO's, this combined with bumping readahead up to 4MB, is now getting me on
 average 150MB/s to 200MB/s so this might suffice.

 On a personal interest, I would still like to know if anyone has ideas on
 how to really push much higher bandwidth through a RBD.

Some settings in our ceph.conf that may help:

osd_op_threads = 20
osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k
filestore_queue_max_ops = 9
filestore_flusher = false
filestore_max_sync_interval = 10
filestore_sync_flush = false

Regards,
Alex



 Rbd-fuse seems to top out at 12MB/s, so there goes that option.

 I'm thinking mapping multiple RBD's and then combining them into a mdadm
 RAID0 stripe might work, but seems a bit messy.

 Any suggestions?

 Thanks,
 Nick







 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to improve single thread sequential reads?

2015-08-16 Thread Somnath Roy
Have you tried setting read_ahead_kb to bigger number for both client/OSD side 
if you are using krbd ?
In case of librbd, try the different config options for rbd cache..

Thanks  Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Alex 
Gorbachev
Sent: Sunday, August 16, 2015 7:07 PM
To: Nick Fisk
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] How to improve single thread sequential reads?

Hi Nick,

On Thu, Aug 13, 2015 at 4:37 PM, Nick Fisk n...@fisk.me.uk wrote:
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
 Of Nick Fisk
 Sent: 13 August 2015 18:04
 To: ceph-users@lists.ceph.com
 Subject: [ceph-users] How to improve single thread sequential reads?

 Hi,

 I'm trying to use a RBD to act as a staging area for some data before
 pushing
 it down to some LTO6 tapes. As I cannot use striping with the kernel
 client I
 tend to be maxing out at around 80MB/s reads testing with DD. Has
 anyone got any clever suggestions of giving this a bit of a boost, I
 think I need
 to get it
 up to around 200MB/s to make sure there is always a steady flow of
 data to the tape drive.

 I've just tried the testing kernel with the blk-mq fixes in it for
 full size IO's, this combined with bumping readahead up to 4MB, is now
 getting me on average 150MB/s to 200MB/s so this might suffice.

 On a personal interest, I would still like to know if anyone has ideas
 on how to really push much higher bandwidth through a RBD.

Some settings in our ceph.conf that may help:

osd_op_threads = 20
osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k 
filestore_queue_max_ops = 9 filestore_flusher = false 
filestore_max_sync_interval = 10 filestore_sync_flush = false

Regards,
Alex



 Rbd-fuse seems to top out at 12MB/s, so there goes that option.

 I'm thinking mapping multiple RBD's and then combining them into a
 mdadm
 RAID0 stripe might work, but seems a bit messy.

 Any suggestions?

 Thanks,
 Nick







 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to improve single thread sequential reads?

2015-08-13 Thread Nick Fisk
Hi,

 

I'm trying to use a RBD to act as a staging area for some data before
pushing it down to some LTO6 tapes. As I cannot use striping with the kernel
client I tend to be maxing out at around 80MB/s reads testing with DD. Has
anyone got any clever suggestions of giving this a bit of a boost, I think I
need to get it up to around 200MB/s to make sure there is always a steady
flow of data to the tape drive.

 

Rbd-fuse seems to top out at 12MB/s, so there goes that option.

 

I'm thinking mapping multiple RBD's and then combining them into a mdadm
RAID0 stripe might work, but seems a bit messy.

 

Any suggestions?

 

Thanks,

Nick




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to improve single thread sequential reads?

2015-08-13 Thread Nick Fisk
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Nick Fisk
 Sent: 13 August 2015 18:04
 To: ceph-users@lists.ceph.com
 Subject: [ceph-users] How to improve single thread sequential reads?
 
 Hi,
 
 I'm trying to use a RBD to act as a staging area for some data before
pushing
 it down to some LTO6 tapes. As I cannot use striping with the kernel
client I
 tend to be maxing out at around 80MB/s reads testing with DD. Has anyone
 got any clever suggestions of giving this a bit of a boost, I think I need
to get it
 up to around 200MB/s to make sure there is always a steady flow of data to
 the tape drive.

I've just tried the testing kernel with the blk-mq fixes in it for full size
IO's, this combined with bumping readahead up to 4MB, is now getting me on
average 150MB/s to 200MB/s so this might suffice.

On a personal interest, I would still like to know if anyone has ideas on
how to really push much higher bandwidth through a RBD.

 
 Rbd-fuse seems to top out at 12MB/s, so there goes that option.
 
 I'm thinking mapping multiple RBD's and then combining them into a mdadm
 RAID0 stripe might work, but seems a bit messy.
 
 Any suggestions?
 
 Thanks,
 Nick
 


 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com