subject:"Re\: \[ceph\-users\] Ceph \+ VMware \+ Single Thread Performance"

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-09-11 Thread Nick Fisk

> -Original Message-
> From: Alex Gorbachev [mailto:a...@iss-integration.com]
> Sent: 11 September 2016 03:17
> To: Nick Fisk <n...@fisk.me.uk>
> Cc: Wilhelm Redbrake <w...@globe.de>; Horace Ng <hor...@hkisl.net>; 
> ceph-users <ceph-users@lists.ceph.com>
> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> 
> Confirming again much better performance with ESXi and NFS on RBD using the 
> XFS hint Nick uses, below.

Cool, I never experimented with different extent sizes, so I don't know if 
there is any performance/fragmentation benefit with larger/smaller values. I 
think storage vmotions might benefit from using striped RBD's with rbd-nbd, as 
this might get round the PG contention issues with 32 concurrent writes to the 
same PG. I want to test this out at some point.

> 
> I saw high load averages on the NFS server nodes, corresponding to iowait, 
> does not seem to cause too much trouble so far.

Yeah I get this as well, but I think this is just a side effect of having a 
storage backend that can support a high queue depth. Every IO in flight will 
increase the load by 1. However, despite what it looks like in top, it doesn't 
actually consume any CPU, so it shouldn't cause any problems.

> 
> Here are HDtune Pro testing results from some recent runs.  The puzzling part 
> is better random IO performance with 16 mb object size
> on both iSCSI and NFS.  I my thinking this should be slower, however, this 
> has been confirmed by the timed vmotion tests and more
> random IO tests by my coworker as well:
> 
> Test_type read MB/s write MB/s read iops write iops read multi iops write 
> multi iops NFS 1mb 460 103 8753 66 47466 1616 NFS 4mb 441
> 147 8863 82 47556 764 iSCSI 1mb 117 76 326 90 672 938 iSCSI 4mb 275 60 205 24 
> 2015 1212 NFS 16mb 455 177 7761 119 36403 3175 iSCSI
> 16mb 300 65 1117 237 12389 1826
> 
> ( prettier view at
> http://storcium.blogspot.com/2016/09/latest-tests-on-nfs-vs.html )

Interesting. Are you pre-conditioning the RBD's before these tests? The only 
logical thing I can think of is that if you are writing to a new area of the 
RBD, it will be having to create the objects as it goes, larger objects would 
therefore need less object creates per MB.

> 
> Alex
> 
> >
> > From: Alex Gorbachev [mailto:a...@iss-integration.com]
> > Sent: 04 September 2016 04:45
> > To: Nick Fisk <n...@fisk.me.uk>
> > Cc: Wilhelm Redbrake <w...@globe.de>; Horace Ng <hor...@hkisl.net>;
> > ceph-users <ceph-users@lists.ceph.com>
> > Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >
> >
> >
> >
> >
> > On Saturday, September 3, 2016, Alex Gorbachev <a...@iss-integration.com> 
> > wrote:
> >
> > HI Nick,
> >
> > On Sun, Aug 21, 2016 at 3:19 PM, Nick Fisk <n...@fisk.me.uk> wrote:
> >
> > From: Alex Gorbachev [mailto:a...@iss-integration.com]
> > Sent: 21 August 2016 15:27
> > To: Wilhelm Redbrake <w...@globe.de>
> > Cc: n...@fisk.me.uk; Horace Ng <hor...@hkisl.net>; ceph-users
> > <ceph-users@lists.ceph.com>
> > Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >
> >
> >
> >
> >
> > On Sunday, August 21, 2016, Wilhelm Redbrake <w...@globe.de> wrote:
> >
> > Hi Nick,
> > i understand all of your technical improvements.
> > But: why do you Not use a simple for example Areca Raid Controller with 8 
> > gb Cache and Bbu ontop in every ceph node.
> > Configure n Times RAID 0 on the Controller and enable Write back Cache.
> > That must be a latency "Killer" like in all the prop. Storage arrays or Not 
> > ??
> >
> > Best Regards !!
> >
> >
> >
> > What we saw specifically with Areca cards is that performance is excellent 
> > in benchmarking and for bursty loads. However, once we
> started loading with more constant workloads (we replicate databases and 
> files to our Ceph cluster), this looks to have saturated the
> relatively small Areca NVDIMM caches and we went back to pure drive based 
> performance.
> >
> >
> >
> > Yes, I think that is a valid point. Although low latency, you are still 
> > having to write to the disks twice (journal+data), so once the
> cache’s on the cards start filling up, you are going to hit problems.
> >
> >
> >
> >
> >
> > So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per 3 
> > HDDs) in hopes that it would help reduce the noisy
> neighbor impact. That worked, but now the overall latency is really high at 
> times, not always. Red Ha

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-09-11 Thread Alex Gorbachev

--
Alex Gorbachev
Storcium

On Sun, Sep 11, 2016 at 12:54 PM, Nick Fisk <n...@fisk.me.uk> wrote:

>
>
>
>
> *From:* Alex Gorbachev [mailto:a...@iss-integration.com]
> *Sent:* 11 September 2016 16:14
>
> *To:* Nick Fisk <n...@fisk.me.uk>
> *Cc:* Wilhelm Redbrake <w...@globe.de>; Horace Ng <hor...@hkisl.net>;
> ceph-users <ceph-users@lists.ceph.com>
> *Subject:* Re: [ceph-users] Ceph + VMware + Single Thread Performance
>
>
>
>
>
> On Sun, Sep 4, 2016 at 4:48 PM, Nick Fisk <n...@fisk.me.uk> wrote:
>
>
>
>
>
> *From:* Alex Gorbachev [mailto:a...@iss-integration.com]
> *Sent:* 04 September 2016 04:45
> *To:* Nick Fisk <n...@fisk.me.uk>
> *Cc:* Wilhelm Redbrake <w...@globe.de>; Horace Ng <hor...@hkisl.net>;
> ceph-users <ceph-users@lists.ceph.com>
> *Subject:* Re: [ceph-users] Ceph + VMware + Single Thread Performance
>
>
>
>
>
>
> On Saturday, September 3, 2016, Alex Gorbachev <a...@iss-integration.com>
> wrote:
>
> HI Nick,
>
> On Sun, Aug 21, 2016 at 3:19 PM, Nick Fisk <n...@fisk.me.uk> wrote:
>
> *From:* Alex Gorbachev [mailto:a...@iss-integration.com]
> *Sent:* 21 August 2016 15:27
> *To:* Wilhelm Redbrake <w...@globe.de>
> *Cc:* n...@fisk.me.uk; Horace Ng <hor...@hkisl.net>; ceph-users <
> ceph-users@lists.ceph.com>
> *Subject:* Re: [ceph-users] Ceph + VMware + Single Thread Performance
>
>
>
>
>
> On Sunday, August 21, 2016, Wilhelm Redbrake <w...@globe.de> wrote:
>
> Hi Nick,
> i understand all of your technical improvements.
> But: why do you Not use a simple for example Areca Raid Controller with 8
> gb Cache and Bbu ontop in every ceph node.
> Configure n Times RAID 0 on the Controller and enable Write back Cache.
> That must be a latency "Killer" like in all the prop. Storage arrays or
> Not ??
>
> Best Regards !!
>
>
>
> What we saw specifically with Areca cards is that performance is excellent
> in benchmarking and for bursty loads. However, once we started loading with
> more constant workloads (we replicate databases and files to our Ceph
> cluster), this looks to have saturated the relatively small Areca NVDIMM
> caches and we went back to pure drive based performance.
>
>
>
> Yes, I think that is a valid point. Although low latency, you are still
> having to write to the disks twice (journal+data), so once the cache’s on
> the cards start filling up, you are going to hit problems.
>
>
>
>
>
> So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per
> 3 HDDs) in hopes that it would help reduce the noisy neighbor impact. That
> worked, but now the overall latency is really high at times, not always.
> Red Hat engineer suggested this is due to loading the 7200 rpm NL-SAS
> drives with too many IOPS, which get their latency sky high. Overall we are
> functioning fine, but I sure would like storage vmotion and other large
> operations faster.
>
>
>
>
>
> Yeah this is the biggest pain point I think. Normal VM ops are fine, but
> if you ever have to move a multi-TB VM, it’s just too slow.
>
>
>
> If you use iscsi with vaai and are migrating a thick provisioned vmdk,
> then performance is actually quite good, as the block sizes used for the
> copy are a lot bigger.
>
>
>
> However, my use case required thin provisioned VM’s + snapshots and I
> found that using iscsi you have no control over the fragmentation of the
> vmdk’s and so the read performance is then what suffers (certainly with
> 7.2k disks)
>
>
>
> Also with thin provisioned vmdk’s I think I was seeing PG contention with
> the updating of the VMFS metadata, although I can’t be sure.
>
>
>
>
>
> I am thinking I will test a few different schedulers and readahead
> settings to see if we can improve this by parallelizing reads. Also will
> test NFS, but need to determine whether to do krbd/knfsd or something more
> interesting like CephFS/Ganesha.
>
>
>
> As you know I’m on NFS now. I’ve found it a lot easier to get going and a
> lot less sensitive to making config adjustments without suddenly everything
> dropping offline. The fact that you can specify the extent size on XFS
> helps massively with using thin vmdks/snapshots to avoid fragmentation.
> Storage v-motions are a bit faster than iscsi, but I think I am hitting PG
> contention when esxi tries to write 32 copy threads to the same object.
> Th

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-09-11 Thread Nick Fisk

From: Alex Gorbachev [mailto:a...@iss-integration.com] 
Sent: 11 September 2016 16:14
To: Nick Fisk <n...@fisk.me.uk>
Cc: Wilhelm Redbrake <w...@globe.de>; Horace Ng <hor...@hkisl.net>; ceph-users 
<ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

On Sun, Sep 4, 2016 at 4:48 PM, Nick Fisk <n...@fisk.me.uk 
<mailto:n...@fisk.me.uk> > wrote:

From: Alex Gorbachev [mailto:a...@iss-integration.com 
<mailto:a...@iss-integration.com> ] 
Sent: 04 September 2016 04:45
To: Nick Fisk <n...@fisk.me.uk <mailto:n...@fisk.me.uk> >
Cc: Wilhelm Redbrake <w...@globe.de <mailto:w...@globe.de> >; Horace Ng 
<hor...@hkisl.net <mailto:hor...@hkisl.net> >; ceph-users 
<ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> >
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

On Saturday, September 3, 2016, Alex Gorbachev <a...@iss-integration.com 
<mailto:a...@iss-integration.com> > wrote:

HI Nick,

On Sun, Aug 21, 2016 at 3:19 PM, Nick Fisk <n...@fisk.me.uk> wrote:

From: Alex Gorbachev [mailto:a...@iss-integration.com] 
Sent: 21 August 2016 15:27
To: Wilhelm Redbrake <w...@globe.de>
Cc: n...@fisk.me.uk; Horace Ng <hor...@hkisl.net>; ceph-users 
<ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

On Sunday, August 21, 2016, Wilhelm Redbrake <w...@globe.de> wrote:

Hi Nick,
i understand all of your technical improvements.
But: why do you Not use a simple for example Areca Raid Controller with 8 gb 
Cache and Bbu ontop in every ceph node.
Configure n Times RAID 0 on the Controller and enable Write back Cache.
That must be a latency "Killer" like in all the prop. Storage arrays or Not ??

Best Regards !!

What we saw specifically with Areca cards is that performance is excellent in 
benchmarking and for bursty loads. However, once we started loading with more 
constant workloads (we replicate databases and files to our Ceph cluster), this 
looks to have saturated the relatively small Areca NVDIMM caches and we went 
back to pure drive based performance. 

Yes, I think that is a valid point. Although low latency, you are still having 
to write to the disks twice (journal+data), so once the cache’s on the cards 
start filling up, you are going to hit problems.

So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per 3 
HDDs) in hopes that it would help reduce the noisy neighbor impact. That 
worked, but now the overall latency is really high at times, not always. Red 
Hat engineer suggested this is due to loading the 7200 rpm NL-SAS drives with 
too many IOPS, which get their latency sky high. Overall we are functioning 
fine, but I sure would like storage vmotion and other large operations faster. 

Yeah this is the biggest pain point I think. Normal VM ops are fine, but if you 
ever have to move a multi-TB VM, it’s just too slow. 

If you use iscsi with vaai and are migrating a thick provisioned vmdk, then 
performance is actually quite good, as the block sizes used for the copy are a 
lot bigger. 

However, my use case required thin provisioned VM’s + snapshots and I found 
that using iscsi you have no control over the fragmentation of the vmdk’s and 
so the read performance is then what suffers (certainly with 7.2k disks)

Also with thin provisioned vmdk’s I think I was seeing PG contention with the 
updating of the VMFS metadata, although I can’t be sure.

I am thinking I will test a few different schedulers and readahead settings to 
see if we can improve this by parallelizing reads. Also will test NFS, but need 
to determine whether to do krbd/knfsd or something more interesting like 
CephFS/Ganesha. 

As you know I’m on NFS now. I’ve found it a lot easier to get going and a lot 
less sensitive to making config adjustments without suddenly everything 
dropping offline. The fact that you can specify the extent size on XFS helps 
massively with using thin vmdks/snapshots to avoid fragmentation. Storage 
v-motions are a bit faster than iscsi, but I think I am hitting PG contention 
when esxi tries to write 32 copy threads to the same object. There is probably 
some tuning that could be done here (RBD striping???) but this is the best it’s 
been for a long time and I’m reluctant to fiddle any further.

We have moved ahead and added NFS support to Storcium, and now able ti run NFS 
servers with Pacemaker in HA mode (all agents are public at 
https://github.com/akurz/resource-agents/tree/master/heartbeat 
<http://xo4t.mj.am/lnk/AEMAFOTiMP4AAFhNkjYAADNJBWwAAACRXwBXzIiFBSEAPLcmRUCEpgI8l005EAAAlBI/1/SaDNCfweUSbAAalNO6TCqg/aHR0cHM6Ly9naXRodWIuY29tL2FrdXJ6L3Jlc291cmNlLWFnZW50cy90cmVlL21hc3Rlci9oZWFydGJlYXQ>
 ).  I can confirm that VM performance is definitely better and benchm

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-09-11 Thread Alex Gorbachev

On Sun, Sep 4, 2016 at 4:48 PM, Nick Fisk <n...@fisk.me.uk> wrote:

>
>
>
>
> *From:* Alex Gorbachev [mailto:a...@iss-integration.com]
> *Sent:* 04 September 2016 04:45
> *To:* Nick Fisk <n...@fisk.me.uk>
> *Cc:* Wilhelm Redbrake <w...@globe.de>; Horace Ng <hor...@hkisl.net>;
> ceph-users <ceph-users@lists.ceph.com>
> *Subject:* Re: [ceph-users] Ceph + VMware + Single Thread Performance
>
>
>
>
>
> On Saturday, September 3, 2016, Alex Gorbachev <a...@iss-integration.com>
> wrote:
>
> HI Nick,
>
> On Sun, Aug 21, 2016 at 3:19 PM, Nick Fisk <n...@fisk.me.uk> wrote:
>
> *From:* Alex Gorbachev [mailto:a...@iss-integration.com]
> *Sent:* 21 August 2016 15:27
> *To:* Wilhelm Redbrake <w...@globe.de>
> *Cc:* n...@fisk.me.uk; Horace Ng <hor...@hkisl.net>; ceph-users <
> ceph-users@lists.ceph.com>
> *Subject:* Re: [ceph-users] Ceph + VMware + Single Thread Performance
>
>
>
>
>
> On Sunday, August 21, 2016, Wilhelm Redbrake <w...@globe.de> wrote:
>
> Hi Nick,
> i understand all of your technical improvements.
> But: why do you Not use a simple for example Areca Raid Controller with 8
> gb Cache and Bbu ontop in every ceph node.
> Configure n Times RAID 0 on the Controller and enable Write back Cache.
> That must be a latency "Killer" like in all the prop. Storage arrays or
> Not ??
>
> Best Regards !!
>
>
>
> What we saw specifically with Areca cards is that performance is excellent
> in benchmarking and for bursty loads. However, once we started loading with
> more constant workloads (we replicate databases and files to our Ceph
> cluster), this looks to have saturated the relatively small Areca NVDIMM
> caches and we went back to pure drive based performance.
>
>
>
> Yes, I think that is a valid point. Although low latency, you are still
> having to write to the disks twice (journal+data), so once the cache’s on
> the cards start filling up, you are going to hit problems.
>
>
>
>
>
> So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per
> 3 HDDs) in hopes that it would help reduce the noisy neighbor impact. That
> worked, but now the overall latency is really high at times, not always.
> Red Hat engineer suggested this is due to loading the 7200 rpm NL-SAS
> drives with too many IOPS, which get their latency sky high. Overall we are
> functioning fine, but I sure would like storage vmotion and other large
> operations faster.
>
>
>
>
>
> Yeah this is the biggest pain point I think. Normal VM ops are fine, but
> if you ever have to move a multi-TB VM, it’s just too slow.
>
>
>
> If you use iscsi with vaai and are migrating a thick provisioned vmdk,
> then performance is actually quite good, as the block sizes used for the
> copy are a lot bigger.
>
>
>
> However, my use case required thin provisioned VM’s + snapshots and I
> found that using iscsi you have no control over the fragmentation of the
> vmdk’s and so the read performance is then what suffers (certainly with
> 7.2k disks)
>
>
>
> Also with thin provisioned vmdk’s I think I was seeing PG contention with
> the updating of the VMFS metadata, although I can’t be sure.
>
>
>
>
>
> I am thinking I will test a few different schedulers and readahead
> settings to see if we can improve this by parallelizing reads. Also will
> test NFS, but need to determine whether to do krbd/knfsd or something more
> interesting like CephFS/Ganesha.
>
>
>
> As you know I’m on NFS now. I’ve found it a lot easier to get going and a
> lot less sensitive to making config adjustments without suddenly everything
> dropping offline. The fact that you can specify the extent size on XFS
> helps massively with using thin vmdks/snapshots to avoid fragmentation.
> Storage v-motions are a bit faster than iscsi, but I think I am hitting PG
> contention when esxi tries to write 32 copy threads to the same object.
> There is probably some tuning that could be done here (RBD striping???) but
> this is the best it’s been for a long time and I’m reluctant to fiddle any
> further.
>
>
>
> We have moved ahead and added NFS support to Storcium, and now able ti run
> NFS servers with Pacemaker in HA mode (all agents are public at
> https://github.com/akurz/resource-agents/tree/master/heartbeat
> <http://xo4t.mj.am/lnk/AEMAFOTiMP4AAFhNkjYAADNJBWwAAACRXwBXzIiFBSEAPLcmRUCEpgI8l005EAAAlBI/1/SaDNCfweUSbAAalNO6TCqg/aHR0cHM6Ly9naXRodWIuY29tL2FrdXJ6L3Jlc291cmNlLWFnZW50cy90cmVlL21hc3Rlci9oZWFydGJlYXQ>).
> I can confirm that VM performance is definitely better and benchmarks are
> more smooth (in Windows we can see a lot of chop

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-09-10 Thread Alex Gorbachev

Confirming again much better performance with ESXi and NFS on RBD
using the XFS hint Nick uses, below.

I saw high load averages on the NFS server nodes, corresponding to
iowait, does not seem to cause too much trouble so far.

Here are HDtune Pro testing results from some recent runs.  The
puzzling part is better random IO performance with 16 mb object size
on both iSCSI and NFS.  I my thinking this should be slower, however,
this has been confirmed by the timed vmotion tests and more random IO
tests by my coworker as well:

Test_type read MB/s write MB/s read iops write iops read multi iops
write multi iops
NFS 1mb 460 103 8753 66 47466 1616
NFS 4mb 441 147 8863 82 47556 764
iSCSI 1mb 117 76 326 90 672 938
iSCSI 4mb 275 60 205 24 2015 1212
NFS 16mb 455 177 7761 119 36403 3175
iSCSI 16mb 300 65 1117 237 12389 1826

( prettier view at
http://storcium.blogspot.com/2016/09/latest-tests-on-nfs-vs.html )

Alex

>
> From: Alex Gorbachev [mailto:a...@iss-integration.com]
> Sent: 04 September 2016 04:45
> To: Nick Fisk <n...@fisk.me.uk>
> Cc: Wilhelm Redbrake <w...@globe.de>; Horace Ng <hor...@hkisl.net>; 
> ceph-users <ceph-users@lists.ceph.com>
> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>
>
>
>
>
> On Saturday, September 3, 2016, Alex Gorbachev <a...@iss-integration.com> 
> wrote:
>
> HI Nick,
>
> On Sun, Aug 21, 2016 at 3:19 PM, Nick Fisk <n...@fisk.me.uk> wrote:
>
> From: Alex Gorbachev [mailto:a...@iss-integration.com]
> Sent: 21 August 2016 15:27
> To: Wilhelm Redbrake <w...@globe.de>
> Cc: n...@fisk.me.uk; Horace Ng <hor...@hkisl.net>; ceph-users 
> <ceph-users@lists.ceph.com>
> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>
>
>
>
>
> On Sunday, August 21, 2016, Wilhelm Redbrake <w...@globe.de> wrote:
>
> Hi Nick,
> i understand all of your technical improvements.
> But: why do you Not use a simple for example Areca Raid Controller with 8 gb 
> Cache and Bbu ontop in every ceph node.
> Configure n Times RAID 0 on the Controller and enable Write back Cache.
> That must be a latency "Killer" like in all the prop. Storage arrays or Not ??
>
> Best Regards !!
>
>
>
> What we saw specifically with Areca cards is that performance is excellent in 
> benchmarking and for bursty loads. However, once we started loading with more 
> constant workloads (we replicate databases and files to our Ceph cluster), 
> this looks to have saturated the relatively small Areca NVDIMM caches and we 
> went back to pure drive based performance.
>
>
>
> Yes, I think that is a valid point. Although low latency, you are still 
> having to write to the disks twice (journal+data), so once the cache’s on the 
> cards start filling up, you are going to hit problems.
>
>
>
>
>
> So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per 3 
> HDDs) in hopes that it would help reduce the noisy neighbor impact. That 
> worked, but now the overall latency is really high at times, not always. Red 
> Hat engineer suggested this is due to loading the 7200 rpm NL-SAS drives with 
> too many IOPS, which get their latency sky high. Overall we are functioning 
> fine, but I sure would like storage vmotion and other large operations faster.
>
>
>
>
>
> Yeah this is the biggest pain point I think. Normal VM ops are fine, but if 
> you ever have to move a multi-TB VM, it’s just too slow.
>
>
>
> If you use iscsi with vaai and are migrating a thick provisioned vmdk, then 
> performance is actually quite good, as the block sizes used for the copy are 
> a lot bigger.
>
>
>
> However, my use case required thin provisioned VM’s + snapshots and I found 
> that using iscsi you have no control over the fragmentation of the vmdk’s and 
> so the read performance is then what suffers (certainly with 7.2k disks)
>
>
>
> Also with thin provisioned vmdk’s I think I was seeing PG contention with the 
> updating of the VMFS metadata, although I can’t be sure.
>
>
>
>
>
> I am thinking I will test a few different schedulers and readahead settings 
> to see if we can improve this by parallelizing reads. Also will test NFS, but 
> need to determine whether to do krbd/knfsd or something more interesting like 
> CephFS/Ganesha.
>
>
>
> As you know I’m on NFS now. I’ve found it a lot easier to get going and a lot 
> less sensitive to making config adjustments without suddenly everything 
> dropping offline. The fact that you can specify the extent size on XFS helps 
> massively with using thin vmdks/snapshots to avoid fragmentation. Storage 
> v-motions are a bit faster than iscsi, but I think I am hitting PG contention

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-09-04 Thread Nick Fisk

From: Alex Gorbachev [mailto:a...@iss-integration.com] 
Sent: 04 September 2016 04:45
To: Nick Fisk <n...@fisk.me.uk>
Cc: Wilhelm Redbrake <w...@globe.de>; Horace Ng <hor...@hkisl.net>; ceph-users 
<ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

On Saturday, September 3, 2016, Alex Gorbachev <a...@iss-integration.com 
<mailto:a...@iss-integration.com> > wrote:

HI Nick,

On Sun, Aug 21, 2016 at 3:19 PM, Nick Fisk <n...@fisk.me.uk 
<javascript:_e(%7B%7D,'cvml','n...@fisk.me.uk');> > wrote:

From: Alex Gorbachev [mailto:a...@iss-integration.com 
<javascript:_e(%7B%7D,'cvml','a...@iss-integration.com');> ] 
Sent: 21 August 2016 15:27
To: Wilhelm Redbrake <w...@globe.de 
<javascript:_e(%7B%7D,'cvml','w...@globe.de');> >
Cc: n...@fisk.me.uk <javascript:_e(%7B%7D,'cvml','n...@fisk.me.uk');> ; Horace 
Ng <hor...@hkisl.net <javascript:_e(%7B%7D,'cvml','hor...@hkisl.net');> >; 
ceph-users <ceph-users@lists.ceph.com 
<javascript:_e(%7B%7D,'cvml','ceph-users@lists.ceph.com');> >
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

On Sunday, August 21, 2016, Wilhelm Redbrake <w...@globe.de 
<javascript:_e(%7B%7D,'cvml','w...@globe.de');> > wrote:

Hi Nick,
i understand all of your technical improvements.
But: why do you Not use a simple for example Areca Raid Controller with 8 gb 
Cache and Bbu ontop in every ceph node.
Configure n Times RAID 0 on the Controller and enable Write back Cache.
That must be a latency "Killer" like in all the prop. Storage arrays or Not ??

Best Regards !!

What we saw specifically with Areca cards is that performance is excellent in 
benchmarking and for bursty loads. However, once we started loading with more 
constant workloads (we replicate databases and files to our Ceph cluster), this 
looks to have saturated the relatively small Areca NVDIMM caches and we went 
back to pure drive based performance. 

Yes, I think that is a valid point. Although low latency, you are still having 
to write to the disks twice (journal+data), so once the cache’s on the cards 
start filling up, you are going to hit problems.

So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per 3 
HDDs) in hopes that it would help reduce the noisy neighbor impact. That 
worked, but now the overall latency is really high at times, not always. Red 
Hat engineer suggested this is due to loading the 7200 rpm NL-SAS drives with 
too many IOPS, which get their latency sky high. Overall we are functioning 
fine, but I sure would like storage vmotion and other large operations faster. 

Yeah this is the biggest pain point I think. Normal VM ops are fine, but if you 
ever have to move a multi-TB VM, it’s just too slow. 

If you use iscsi with vaai and are migrating a thick provisioned vmdk, then 
performance is actually quite good, as the block sizes used for the copy are a 
lot bigger. 

However, my use case required thin provisioned VM’s + snapshots and I found 
that using iscsi you have no control over the fragmentation of the vmdk’s and 
so the read performance is then what suffers (certainly with 7.2k disks)

Also with thin provisioned vmdk’s I think I was seeing PG contention with the 
updating of the VMFS metadata, although I can’t be sure.

I am thinking I will test a few different schedulers and readahead settings to 
see if we can improve this by parallelizing reads. Also will test NFS, but need 
to determine whether to do krbd/knfsd or something more interesting like 
CephFS/Ganesha. 

As you know I’m on NFS now. I’ve found it a lot easier to get going and a lot 
less sensitive to making config adjustments without suddenly everything 
dropping offline. The fact that you can specify the extent size on XFS helps 
massively with using thin vmdks/snapshots to avoid fragmentation. Storage 
v-motions are a bit faster than iscsi, but I think I am hitting PG contention 
when esxi tries to write 32 copy threads to the same object. There is probably 
some tuning that could be done here (RBD striping???) but this is the best it’s 
been for a long time and I’m reluctant to fiddle any further.

We have moved ahead and added NFS support to Storcium, and now able ti run NFS 
servers with Pacemaker in HA mode (all agents are public at 
https://github.com/akurz/resource-agents/tree/master/heartbeat).  I can confirm 
that VM performance is definitely better and benchmarks are more smooth (in 
Windows we can see a lot of choppiness with iSCSI, NFS is choppy on writes, but 
smooth on reads, likely due to the bursty nature of OSD filesystems when 
dealing with that small IO size).

Were you using extsz=16384 at creation time for the filesystem?  I saw kernel 
memory deadlock messages during vmotion, such as:

 XFS: nfsd(102545) possible memory allocation deadlock size 40320 i

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-09-03 Thread Alex Gorbachev

On Saturday, September 3, 2016, Alex Gorbachev <a...@iss-integration.com>
wrote:

> HI Nick,
>
> On Sun, Aug 21, 2016 at 3:19 PM, Nick Fisk <n...@fisk.me.uk
> <javascript:_e(%7B%7D,'cvml','n...@fisk.me.uk');>> wrote:
>
>> *From:* Alex Gorbachev [mailto:a...@iss-integration.com
>> <javascript:_e(%7B%7D,'cvml','a...@iss-integration.com');>]
>> *Sent:* 21 August 2016 15:27
>> *To:* Wilhelm Redbrake <w...@globe.de
>> <javascript:_e(%7B%7D,'cvml','w...@globe.de');>>
>> *Cc:* n...@fisk.me.uk <javascript:_e(%7B%7D,'cvml','n...@fisk.me.uk');>;
>> Horace Ng <hor...@hkisl.net
>> <javascript:_e(%7B%7D,'cvml','hor...@hkisl.net');>>; ceph-users <
>> ceph-users@lists.ceph.com
>> <javascript:_e(%7B%7D,'cvml','ceph-users@lists.ceph.com');>>
>> *Subject:* Re: [ceph-users] Ceph + VMware + Single Thread Performance
>>
>>
>>
>>
>>
>> On Sunday, August 21, 2016, Wilhelm Redbrake <w...@globe.de
>> <javascript:_e(%7B%7D,'cvml','w...@globe.de');>> wrote:
>>
>> Hi Nick,
>> i understand all of your technical improvements.
>> But: why do you Not use a simple for example Areca Raid Controller with 8
>> gb Cache and Bbu ontop in every ceph node.
>> Configure n Times RAID 0 on the Controller and enable Write back Cache.
>> That must be a latency "Killer" like in all the prop. Storage arrays or
>> Not ??
>>
>> Best Regards !!
>>
>>
>>
>> What we saw specifically with Areca cards is that performance is
>> excellent in benchmarking and for bursty loads. However, once we started
>> loading with more constant workloads (we replicate databases and files to
>> our Ceph cluster), this looks to have saturated the relatively small Areca
>> NVDIMM caches and we went back to pure drive based performance.
>>
>>
>>
>> Yes, I think that is a valid point. Although low latency, you are still
>> having to write to the disks twice (journal+data), so once the cache’s on
>> the cards start filling up, you are going to hit problems.
>>
>>
>>
>>
>>
>> So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per
>> 3 HDDs) in hopes that it would help reduce the noisy neighbor impact. That
>> worked, but now the overall latency is really high at times, not always.
>> Red Hat engineer suggested this is due to loading the 7200 rpm NL-SAS
>> drives with too many IOPS, which get their latency sky high. Overall we are
>> functioning fine, but I sure would like storage vmotion and other large
>> operations faster.
>>
>>
>>
>>
>>
>> Yeah this is the biggest pain point I think. Normal VM ops are fine, but
>> if you ever have to move a multi-TB VM, it’s just too slow.
>>
>>
>>
>> If you use iscsi with vaai and are migrating a thick provisioned vmdk,
>> then performance is actually quite good, as the block sizes used for the
>> copy are a lot bigger.
>>
>>
>>
>> However, my use case required thin provisioned VM’s + snapshots and I
>> found that using iscsi you have no control over the fragmentation of the
>> vmdk’s and so the read performance is then what suffers (certainly with
>> 7.2k disks)
>>
>>
>>
>> Also with thin provisioned vmdk’s I think I was seeing PG contention with
>> the updating of the VMFS metadata, although I can’t be sure.
>>
>>
>>
>>
>>
>> I am thinking I will test a few different schedulers and readahead
>> settings to see if we can improve this by parallelizing reads. Also will
>> test NFS, but need to determine whether to do krbd/knfsd or something more
>> interesting like CephFS/Ganesha.
>>
>>
>>
>> As you know I’m on NFS now. I’ve found it a lot easier to get going and a
>> lot less sensitive to making config adjustments without suddenly everything
>> dropping offline. The fact that you can specify the extent size on XFS
>> helps massively with using thin vmdks/snapshots to avoid fragmentation.
>> Storage v-motions are a bit faster than iscsi, but I think I am hitting PG
>> contention when esxi tries to write 32 copy threads to the same object.
>> There is probably some tuning that could be done here (RBD striping???) but
>> this is the best it’s been for a long time and I’m reluctant to fiddle any
>> further.
>>
>
> We have moved ahead and added NFS support to Storcium, and now able ti run
> NFS servers with Pacemaker in HA mode (all agents are public at
> https://github.com/akurz/resource-agents/tree/master/hear

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-09-03 Thread Alex Gorbachev

HI Nick,

On Sun, Aug 21, 2016 at 3:19 PM, Nick Fisk <n...@fisk.me.uk> wrote:

> *From:* Alex Gorbachev [mailto:a...@iss-integration.com]
> *Sent:* 21 August 2016 15:27
> *To:* Wilhelm Redbrake <w...@globe.de>
> *Cc:* n...@fisk.me.uk; Horace Ng <hor...@hkisl.net>; ceph-users <
> ceph-users@lists.ceph.com>
> *Subject:* Re: [ceph-users] Ceph + VMware + Single Thread Performance
>
>
>
>
>
> On Sunday, August 21, 2016, Wilhelm Redbrake <w...@globe.de> wrote:
>
> Hi Nick,
> i understand all of your technical improvements.
> But: why do you Not use a simple for example Areca Raid Controller with 8
> gb Cache and Bbu ontop in every ceph node.
> Configure n Times RAID 0 on the Controller and enable Write back Cache.
> That must be a latency "Killer" like in all the prop. Storage arrays or
> Not ??
>
> Best Regards !!
>
>
>
> What we saw specifically with Areca cards is that performance is excellent
> in benchmarking and for bursty loads. However, once we started loading with
> more constant workloads (we replicate databases and files to our Ceph
> cluster), this looks to have saturated the relatively small Areca NVDIMM
> caches and we went back to pure drive based performance.
>
>
>
> Yes, I think that is a valid point. Although low latency, you are still
> having to write to the disks twice (journal+data), so once the cache’s on
> the cards start filling up, you are going to hit problems.
>
>
>
>
>
> So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per
> 3 HDDs) in hopes that it would help reduce the noisy neighbor impact. That
> worked, but now the overall latency is really high at times, not always.
> Red Hat engineer suggested this is due to loading the 7200 rpm NL-SAS
> drives with too many IOPS, which get their latency sky high. Overall we are
> functioning fine, but I sure would like storage vmotion and other large
> operations faster.
>
>
>
>
>
> Yeah this is the biggest pain point I think. Normal VM ops are fine, but
> if you ever have to move a multi-TB VM, it’s just too slow.
>
>
>
> If you use iscsi with vaai and are migrating a thick provisioned vmdk,
> then performance is actually quite good, as the block sizes used for the
> copy are a lot bigger.
>
>
>
> However, my use case required thin provisioned VM’s + snapshots and I
> found that using iscsi you have no control over the fragmentation of the
> vmdk’s and so the read performance is then what suffers (certainly with
> 7.2k disks)
>
>
>
> Also with thin provisioned vmdk’s I think I was seeing PG contention with
> the updating of the VMFS metadata, although I can’t be sure.
>
>
>
>
>
> I am thinking I will test a few different schedulers and readahead
> settings to see if we can improve this by parallelizing reads. Also will
> test NFS, but need to determine whether to do krbd/knfsd or something more
> interesting like CephFS/Ganesha.
>
>
>
> As you know I’m on NFS now. I’ve found it a lot easier to get going and a
> lot less sensitive to making config adjustments without suddenly everything
> dropping offline. The fact that you can specify the extent size on XFS
> helps massively with using thin vmdks/snapshots to avoid fragmentation.
> Storage v-motions are a bit faster than iscsi, but I think I am hitting PG
> contention when esxi tries to write 32 copy threads to the same object.
> There is probably some tuning that could be done here (RBD striping???) but
> this is the best it’s been for a long time and I’m reluctant to fiddle any
> further.
>

We have moved ahead and added NFS support to Storcium, and now able ti run
NFS servers with Pacemaker in HA mode (all agents are public at
https://github.com/akurz/resource-agents/tree/master/heartbeat).  I can
confirm that VM performance is definitely better and benchmarks are more
smooth (in Windows we can see a lot of choppiness with iSCSI, NFS is choppy
on writes, but smooth on reads, likely due to the bursty nature of OSD
filesystems when dealing with that small IO size).

Were you using extsz=16384 at creation time for the filesystem?  I saw
kernel memory deadlock messages during vmotion, such as:

 XFS: nfsd(102545) possible memory allocation deadlock size 40320 in
kmem_alloc (mode:0x2400240)

And analyzing fragmentation:

root@roc-5r-scd218:~# xfs_db -r /dev/rbd21
xfs_db> frag -d
actual 0, ideal 0, fragmentation factor 0.00%
xfs_db> frag -f
actual 1863960, ideal 74, fragmentation factor 100.00%

Just from two vmotions.

Are you seeing anything similar?

Thank you,
Alex


>
>
> But as mentioned above, thick vmdk’s with vaai might be a really good fit.
>
>
>
> Thanks for your very valuable info on analysis and hw build.
>
&

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-31 Thread Nick Fisk

From: w...@globe.de [mailto:w...@globe.de] 
Sent: 31 August 2016 08:56
To: n...@fisk.me.uk; 'Alex Gorbachev' <a...@iss-integration.com>; 'Horace Ng' 
<hor...@hkisl.net>
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

Nick,

what do you think about Infiniband?

I have read that with Infiniband the latency is at 1,2us

It’s great, but I don’t believe the Ceph support for RDMA is finished yet, so 
you are stuck using IPoIB, which has similar performance to 10G Ethernet.

For now concentrate on removing latency where you easily can (3.5+ Ghz CPU’s, 
NVME journals) and then when stuff like RDMA comes along, you will be in a 
better place to take advantage of it.

Kind Regards!

Am 31.08.16 um 09:51 schrieb Nick Fisk:

From: w...@globe.de <mailto:w...@globe.de>  [mailto:w...@globe.de] 
Sent: 30 August 2016 18:40
To: n...@fisk.me.uk <mailto:n...@fisk.me.uk> ; 'Alex Gorbachev'  
<mailto:a...@iss-integration.com> <a...@iss-integration.com>
Cc: 'Horace Ng'  <mailto:hor...@hkisl.net> <hor...@hkisl.net>
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

Hi Nick,

here are my answers and questions...

Am 30.08.16 um 19:05 schrieb Nick Fisk:

From: w...@globe.de <mailto:w...@globe.de>  [mailto:w...@globe.de] 
Sent: 30 August 2016 08:48
To: n...@fisk.me.uk <mailto:n...@fisk.me.uk> ; 'Alex Gorbachev'  
<mailto:a...@iss-integration.com> <a...@iss-integration.com>
Cc: 'Horace Ng'  <mailto:hor...@hkisl.net> <hor...@hkisl.net>
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

Hi Nick, Hi Ales,

Nick: i've got my 600GB SAS HP Drives.

Performance is not good soo i don't paste the results here...

Generally another thing: I've build in the Ceph Cluster Samsung SM863 
Enterprise SSD's

If i do a 4k Test on the SSD directly without filesystem i become 

(See Sebastien's Han Tests)

https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

<http://xo4t.mj.am/lnk/AEEAFKSsSPsAAF3gduwAADNJBWwAAACRXwBXxoyWJxD41h5WTsmv5AyUVi8GUwAAlBI/1/kG_bXVmSVXssUysDBe9M-g/aHR0cDovL3hvNHQubWouYW0vbG5rL0FFVUFGSU9hZmQ4QUFBQUFBQUFBQUYzZ2R1d0FBRE5KQld3QUFBQUFBQUNSWHdCWHhieV9JUzdQRkYzQVNBMkpBZ0MyTjBobU53QUFsQkkvMS83Z09JODR1dUhUUWhOc3ViSEg2UmxnL2FIUjBjSE02THk5M2QzY3VjMlZpWVhOMGFXVnVMV2hoYmk1bWNpOWliRzluTHpJd01UUXZNVEF2TVRBdlkyVndhQzFvYjNjdGRHOHRkR1Z6ZEMxcFppMTViM1Z5TFhOelpDMXBjeTF6ZFdsMFlXSnNaUzFoY3kxaExXcHZkWEp1WVd3dFpHVjJhV05sTHc>

dd if=/dev/zero of=/dev/sdd bs=4k count=100 oflag=direct,dsync
100+0 Datensätze ein
100+0 Datensätze aus
409600 bytes (4,1 GB, 3,8 GiB) copied, 52,7139 s, 77,7 MB/s

77000/4 = ~2 IOP’s

If i format the device with xfs i become:

mkfs.xfs -f /dev/sdd

mount /dev/sdd /mnt

cd /mnt

dd if=/dev/zero of=/mnt/test.txt bs=4k count=10 oflag=direct,dsync
10+0 Datensätze ein
10+0 Datensätze aus
40960 bytes (410 MB, 391 MiB) copied, 21,1856 s, 19,3 MB/s

19300/4 = ~5000 IOPs
I know once you have a FS on the device it will slow down due to the extra 
journal writes, maybe this is a little more than expected here…but still 
reasonably fast. Can you see in iostat how many IO’s the device is doing during 
this test?

watch iostat -dmx -t -y 1 1 /dev/sde

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sde   0,00 0,000,00 9625,00 0,0025,85 5,50 
0,600,060,000,06   0,06  59,60

So there seems to be an extra delay somewhere when writing via a FS instead of 
raw device. You are still getting around 10,000 iops though, so not too bad.

If i use the ssd in the ceph cluster and i do the test again with rados bench 
bs=4K and -t = 1 (one thread) i become only 2-3 MByte/s

2500/4 = ~600IOP’s

My question is: How can it be that the pure device performance is so high 
against the xfs and the ceph rbd performance?

Ceph will be a lot slower as you are replacing a 30cm SAS/SATA cable with 
networking, software and also doing replication. You have at least 2 network 
hops with Ceph. For a slightly fairer test set replication to 1x.

Replication 3x:
rados bench -p rbd 60 write -b 4k -t 1
Invalid value for block-size: The option value '4k' seems to be invalid
root@ceph-mon-1:~# rados bench -p rbd 60 write -b 4K -t 1
Maintaining 1 concurrent writes of 4096 bytes to objects of size 4096 for up to 
60 seconds or 0 objects
Object prefix: benchmark_data_ceph-mon-1_30407
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
0   0 0 0 0 0   -   0
1   1   402   4011.5661   1.56641  0.00226091  0.00248929
2   1   775   774   1.51142   1.45703   0.0021945  0.00258187
3   1  1110  1109   1.44374   1.30859  0.00278291  0.00

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-31 Thread Nick Fisk

 

 

From: w...@globe.de [mailto:w...@globe.de] 
Sent: 30 August 2016 18:40
To: n...@fisk.me.uk; 'Alex Gorbachev' <a...@iss-integration.com>
Cc: 'Horace Ng' <hor...@hkisl.net>
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

 

Hi Nick,

here are my answers and questions...

 

Am 30.08.16 um 19:05 schrieb Nick Fisk:

 

 

From: w...@globe.de <mailto:w...@globe.de>  [mailto:w...@globe.de] 
Sent: 30 August 2016 08:48
To: n...@fisk.me.uk <mailto:n...@fisk.me.uk> ; 'Alex Gorbachev'  
<mailto:a...@iss-integration.com> <a...@iss-integration.com>
Cc: 'Horace Ng'  <mailto:hor...@hkisl.net> <hor...@hkisl.net>
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

 

Hi Nick, Hi Ales,

Nick: i've got my 600GB SAS HP Drives.

Performance is not good soo i don't paste the results here...

 

Generally another thing: I've build in the Ceph Cluster Samsung SM863 
Enterprise SSD's

If i do a 4k Test on the SSD directly without filesystem i become 

(See Sebastien's Han Tests)

https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
 
<http://xo4t.mj.am/lnk/AEUAFIOafd8AAF3gduwAADNJBWwAAACRXwBXxby_IS7PFF3ASA2JAgC2N0hmNwAAlBI/1/7gOI84uuHTQhNsubHH6Rlg/aHR0cHM6Ly93d3cuc2ViYXN0aWVuLWhhbi5mci9ibG9nLzIwMTQvMTAvMTAvY2VwaC1ob3ctdG8tdGVzdC1pZi15b3VyLXNzZC1pcy1zdWl0YWJsZS1hcy1hLWpvdXJuYWwtZGV2aWNlLw>
 

 

dd if=/dev/zero of=/dev/sdd bs=4k count=100 oflag=direct,dsync
100+0 Datensätze ein
100+0 Datensätze aus
409600 bytes (4,1 GB, 3,8 GiB) copied, 52,7139 s, 77,7 MB/s

77000/4 = ~2 IOP’s

 

If i format the device with xfs i become:

mkfs.xfs -f /dev/sdd

mount /dev/sdd /mnt

cd /mnt

dd if=/dev/zero of=/mnt/test.txt bs=4k count=10 oflag=direct,dsync
10+0 Datensätze ein
10+0 Datensätze aus
40960 bytes (410 MB, 391 MiB) copied, 21,1856 s, 19,3 MB/s

19300/4 = ~5000 IOPs
I know once you have a FS on the device it will slow down due to the extra 
journal writes, maybe this is a little more than expected here…but still 
reasonably fast. Can you see in iostat how many IO’s the device is doing during 
this test?



watch iostat -dmx -t -y 1 1 /dev/sde

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sde   0,00 0,000,00 9625,00 0,0025,85 5,50 
0,600,060,000,06   0,06  59,60



So there seems to be an extra delay somewhere when writing via a FS instead of 
raw device. You are still getting around 10,000 iops though, so not too bad.






If i use the ssd in the ceph cluster and i do the test again with rados bench 
bs=4K and -t = 1 (one thread) i become only 2-3 MByte/s

2500/4 = ~600IOP’s

My question is: How can it be that the pure device performance is so high 
against the xfs and the ceph rbd performance?

Ceph will be a lot slower as you are replacing a 30cm SAS/SATA cable with 
networking, software and also doing replication. You have at least 2 network 
hops with Ceph. For a slightly fairer test set replication to 1x.


Replication 3x:
rados bench -p rbd 60 write -b 4k -t 1
Invalid value for block-size: The option value '4k' seems to be invalid
root@ceph-mon-1:~# rados bench -p rbd 60 write -b 4K -t 1
Maintaining 1 concurrent writes of 4096 bytes to objects of size 4096 for up to 
60 seconds or 0 objects
Object prefix: benchmark_data_ceph-mon-1_30407
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
0   0 0 0 0 0   -   0
1   1   402   4011.5661   1.56641  0.00226091  0.00248929
2   1   775   774   1.51142   1.45703   0.0021945  0.00258187
3   1  1110  1109   1.44374   1.30859  0.00278291  0.00270182
4   1  1421  1420   1.38647   1.21484  0.00199578  0.00281537
5   1  1731  1730   1.35132   1.21094  0.00219136  0.00288843
6   1  2044  2043   1.32985   1.22266   0.0023981  0.00293468
7   1  2351  2350   1.31116   1.19922  0.00258856  0.00296963
8   1  2703  2702   1.31911 1.375   0.0224678  0.00295862
9   1  2955  2954   1.28191  0.984375  0.00841621  0.00304526
   10   1  3228  3227   1.26034   1.06641  0.00261023  0.00309665
   11   1  3501  35001.2427   1.06641  0.00659853  0.00313985
   12   1  3791  3790   1.23353   1.13281   0.0027244  0.00316168
   13   1  4150  4149   1.24649   1.40234  0.00262242  0.00313177
   14   1  4460  4459   1.24394   1.21094  0.00262075  0.00313735
   15   1  4721  4720   1.22897   1.01953  0.00239961  0.00317357
   16   1  4983  4982   1.21611   1.02344  0.00290526  0.00321005
   17   1  5279  5278   1.21258   1.15625  0.00252002   0.0032196
   18

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-22 Thread Christian Balzer



Hello,

On Mon, 22 Aug 2016 20:34:54 +0100 Nick Fisk wrote:

> > -Original Message-
> > From: Christian Balzer [mailto:ch...@gol.com]
> > Sent: 22 August 2016 03:00
> > To: 'ceph-users' <ceph-users@lists.ceph.com>
> > Cc: Nick Fisk <n...@fisk.me.uk>
> > Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> > 
> > 
> > Hello,
> > 
> > On Sun, 21 Aug 2016 09:57:40 +0100 Nick Fisk wrote:
> > 
> > >
> > >
> > > > -Original Message-
> > > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > > > Behalf Of Christian Balzer
> > > > Sent: 21 August 2016 09:32
> > > > To: ceph-users <ceph-users@lists.ceph.com>
> > > > Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> > > >
> > > >
> > > > Hello,
> > > >
> > > > On Sun, 21 Aug 2016 09:16:33 +0100 Brian :: wrote:
> > > >
> > > > > Hi Nick
> > > > >
> > > > > Interested in this comment - "-Dual sockets are probably bad and
> > > > > will impact performance."
> > > > >
> > > > > Have you got real world experience of this being the case?
> > > > >
> > > > Well, Nick wrote "probably".
> > > >
> > > > Dual sockets and thus NUMA, the need for CPUs to talk to each other
> > > > and share information certainly can impact things that are
> > > very
> > > > time critical.
> > > > How much though is a question of design, both HW and SW.
> > >
> > > There was a guy from Redhat (sorry his name escapes me now) a few
> > > months ago on the performance weekly meeting. He was analysing the CPU
> > > cache miss effects with Ceph and it looked like a NUMA setup was
> > > having quite a severe impact on some things. To be honest a lot of it
> > > went over my head, but I came away from it with a general feeling that
> > > if you can get the required performance from 1 socket, then that is 
> > > probably a better bet. This includes only populating a single
> > socket in a dual socket system. There was also a Ceph tech talk at the 
> > start of the year (High perf databases on Ceph) where the guy
> > presenting was also recommending only populating 1 socket for latency 
> > reasons.
> > >
> > I wonder how complete their testing was and how much manual tuning they 
> > tried.
> > As in:
> > 
> > 1. Was irqbalance running?
> > Because it and the normal kernel strategies clash beautifully.
> > Irqbalance moves stuff around, the kernel tries to move things close to 
> > where the IRQs are, cat and mouse.
> > 
> > 2. Did they try with manual IRQ pinning?
> > I do, not that it's critical with my Ceph nodes, but on other machines it 
> > can make a LOT of difference.
> > Like keeping the cores near (or at least on the same NUMA node) as the 
> > network IRQs reserved for KVM vhost processes.
> > 
> > 3. Did they try pining Ceph OSD processes?
> > While this may certainly help (and make things more predictable when the 
> > load gets high), as I said above the kernel normally does a
> > pretty good job of NOT moving things around and keeping processes close to 
> > the resources they need.
> > 
> 
> From what I remember I think they went to pretty long lengths to tune things. 
> I think one point was that if you have a 40GB nic on socket, a NVME on 
> another, no matter where the process runs, you are going to have a lot of 
> traffic crossing between the sockets.

Traffic yes, complete process migrations hopefully not.
But anyway, yes, that's to be expected.

And also unavoidable if you want/need to utilize the whole capabilities
and PCIe lanes of a dual socket motherboard.
And in some cases (usually not with Ceph/OSDs), the IRQ load really will
benefit from more cores to play with.

> 
> Here is the DB on Ceph one
> 
> http://ceph-users.ceph.narkive.com/1sj4VI4U/ceph-tech-talk-high-performance-production-databases-on-ceph

Thanks!
Yeah, basically confirms what I know/said.

> 
> I don't think the recordings are available for the performance meeting one, 
> but it was something to do with certain C++ string functions causing issue 
> with CPU cache. Honestly can't remember much else.
> 
> > > Both of those, coupled with the fact that Xeon E3's are the cheapest way 
> > > to get high clock speeds, sort of made my decision.
> > >
> > Totally agreed, my current HDD node des

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-22 Thread Nick Fisk

From: Alex Gorbachev [mailto:a...@iss-integration.com] 
Sent: 22 August 2016 20:30
To: Nick Fisk <n...@fisk.me.uk>
Cc: Wilhelm Redbrake <w...@globe.de>; Horace Ng <hor...@hkisl.net>; ceph-users 
<ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

On Sunday, August 21, 2016, Wilhelm Redbrake <w...@globe.de 
<mailto:w...@globe.de> > wrote:

Hi Nick,
i understand all of your technical improvements.
But: why do you Not use a simple for example Areca Raid Controller with 8 gb 
Cache and Bbu ontop in every ceph node.
Configure n Times RAID 0 on the Controller and enable Write back Cache.
That must be a latency "Killer" like in all the prop. Storage arrays or Not ??

Best Regards !!

What we saw specifically with Areca cards is that performance is excellent in 
benchmarking and for bursty loads. However, once we started loading with more 
constant workloads (we replicate databases and files to our Ceph cluster), this 
looks to have saturated the relatively small Areca NVDIMM caches and we went 
back to pure drive based performance. 

Yes, I think that is a valid point. Although low latency, you are still having 
to write to the disks twice (journal+data), so once the cache’s on the cards 
start filling up, you are going to hit problems.

So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per 3 
HDDs) in hopes that it would help reduce the noisy neighbor impact. That 
worked, but now the overall latency is really high at times, not always. Red 
Hat engineer suggested this is due to loading the 7200 rpm NL-SAS drives with 
too many IOPS, which get their latency sky high. Overall we are functioning 
fine, but I sure would like storage vmotion and other large operations faster. 

Yeah this is the biggest pain point I think. Normal VM ops are fine, but if you 
ever have to move a multi-TB VM, it’s just too slow. 

If you use iscsi with vaai and are migrating a thick provisioned vmdk, then 
performance is actually quite good, as the block sizes used for the copy are a 
lot bigger. 

However, my use case required thin provisioned VM’s + snapshots and I found 
that using iscsi you have no control over the fragmentation of the vmdk’s and 
so the read performance is then what suffers (certainly with 7.2k disks)

Also with thin provisioned vmdk’s I think I was seeing PG contention with the 
updating of the VMFS metadata, although I can’t be sure.

I am thinking I will test a few different schedulers and readahead settings to 
see if we can improve this by parallelizing reads. Also will test NFS, but need 
to determine whether to do krbd/knfsd or something more interesting like 
CephFS/Ganesha. 

As you know I’m on NFS now. I’ve found it a lot easier to get going and a lot 
less sensitive to making config adjustments without suddenly everything 
dropping offline. The fact that you can specify the extent size on XFS helps 
massively with using thin vmdks/snapshots to avoid fragmentation. Storage 
v-motions are a bit faster than iscsi, but I think I am hitting PG contention 
when esxi tries to write 32 copy threads to the same object. There is probably 
some tuning that could be done here (RBD striping???) but this is the best it’s 
been for a long time and I’m reluctant to fiddle any further.

But as mentioned above, thick vmdk’s with vaai might be a really good fit.

Any chance thin vs. thick difference could be related to discards?  I saw 
zillions of them in recent testing.

I was using FILEIO and so discard weren’t working for me. I know fragmentation 
was definitely the cause of the small reads. The VMFS metadata I’m less sure 
of, but it seemed the most likely cause as it only effected write performance 
the 1st time round.

Thanks for your very valuable info on analysis and hw build. 

Alex

Am 21.08.2016 um 09:31 schrieb Nick Fisk <n...@fisk.me.uk>:

>> -Original Message-
>> From: Alex Gorbachev [mailto:a...@iss-integration.com]
>> Sent: 21 August 2016 04:15
>> To: Nick Fisk <n...@fisk.me.uk>
>> Cc: w...@globe.de; Horace Ng <hor...@hkisl.net>; ceph-users 
>> <ceph-users@lists.ceph.com>
>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>>
>> Hi Nick,
>>
>> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk <n...@fisk.me.uk> wrote:
>>>> -Original Message-
>>>> From: w...@globe.de [mailto:w...@globe.de]
>>>> Sent: 21 July 2016 13:23
>>>> To: n...@fisk.me.uk; 'Horace Ng' <hor...@hkisl.net>
>>>> Cc: ceph-users@lists.ceph.com
>>>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>>>>
>>>> Okay and what is your plan now to speed up ?
>>>
>>> Now I have come up with a lower latency

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-22 Thread Nick Fisk

> -Original Message-
> From: Christian Balzer [mailto:ch...@gol.com]
> Sent: 22 August 2016 03:00
> To: 'ceph-users' <ceph-users@lists.ceph.com>
> Cc: Nick Fisk <n...@fisk.me.uk>
> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> 
> 
> Hello,
> 
> On Sun, 21 Aug 2016 09:57:40 +0100 Nick Fisk wrote:
> 
> >
> >
> > > -Original Message-
> > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > > Behalf Of Christian Balzer
> > > Sent: 21 August 2016 09:32
> > > To: ceph-users <ceph-users@lists.ceph.com>
> > > Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> > >
> > >
> > > Hello,
> > >
> > > On Sun, 21 Aug 2016 09:16:33 +0100 Brian :: wrote:
> > >
> > > > Hi Nick
> > > >
> > > > Interested in this comment - "-Dual sockets are probably bad and
> > > > will impact performance."
> > > >
> > > > Have you got real world experience of this being the case?
> > > >
> > > Well, Nick wrote "probably".
> > >
> > > Dual sockets and thus NUMA, the need for CPUs to talk to each other
> > > and share information certainly can impact things that are
> > very
> > > time critical.
> > > How much though is a question of design, both HW and SW.
> >
> > There was a guy from Redhat (sorry his name escapes me now) a few
> > months ago on the performance weekly meeting. He was analysing the CPU
> > cache miss effects with Ceph and it looked like a NUMA setup was
> > having quite a severe impact on some things. To be honest a lot of it
> > went over my head, but I came away from it with a general feeling that
> > if you can get the required performance from 1 socket, then that is 
> > probably a better bet. This includes only populating a single
> socket in a dual socket system. There was also a Ceph tech talk at the start 
> of the year (High perf databases on Ceph) where the guy
> presenting was also recommending only populating 1 socket for latency reasons.
> >
> I wonder how complete their testing was and how much manual tuning they tried.
> As in:
> 
> 1. Was irqbalance running?
> Because it and the normal kernel strategies clash beautifully.
> Irqbalance moves stuff around, the kernel tries to move things close to where 
> the IRQs are, cat and mouse.
> 
> 2. Did they try with manual IRQ pinning?
> I do, not that it's critical with my Ceph nodes, but on other machines it can 
> make a LOT of difference.
> Like keeping the cores near (or at least on the same NUMA node) as the 
> network IRQs reserved for KVM vhost processes.
> 
> 3. Did they try pining Ceph OSD processes?
> While this may certainly help (and make things more predictable when the load 
> gets high), as I said above the kernel normally does a
> pretty good job of NOT moving things around and keeping processes close to 
> the resources they need.
> 

>From what I remember I think they went to pretty long lengths to tune things. 
>I think one point was that if you have a 40GB nic on socket, a NVME on 
>another, no matter where the process runs, you are going to have a lot of 
>traffic crossing between the sockets.

Here is the DB on Ceph one

http://ceph-users.ceph.narkive.com/1sj4VI4U/ceph-tech-talk-high-performance-production-databases-on-ceph

I don't think the recordings are available for the performance meeting one, but 
it was something to do with certain C++ string functions causing issue with CPU 
cache. Honestly can't remember much else.

> > Both of those, coupled with the fact that Xeon E3's are the cheapest way to 
> > get high clock speeds, sort of made my decision.
> >
> Totally agreed, my current HDD node design is based on the single CPU 
> Supermicro 5028R-E1CR12L barebone, with an E5-1650 v3
> (3.50GHz) CPU.

Nice. Any ideas how they compare to the E3's?

> 
> > >
> > > We're looking here at a case where he's trying to reduce latency by
> > > all means and where the actual CPU needs for the HDDs are negligible.
> > > The idea being that a "Ceph IOPS" stays on one core which is hopefully 
> > > also not being shared at that time.
> > >
> > > If you're looking at full SSD nodes OTOH a singe CPU may very well
> > > not be able to saturate a sensible amount of SSDs per node, so
> > a
> > > slight penalty but better utilization and overall IOPS with 2 CPUs may be 
> > > the forward.
> >
> > Definitely, as always work out what your requirements are and design around 
> &

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-22 Thread Alex Gorbachev

>
>
> On Sunday, August 21, 2016, Wilhelm Redbrake <w...@globe.de> wrote:
>
> Hi Nick,
> i understand all of your technical improvements.
> But: why do you Not use a simple for example Areca Raid Controller with 8
> gb Cache and Bbu ontop in every ceph node.
> Configure n Times RAID 0 on the Controller and enable Write back Cache.
> That must be a latency "Killer" like in all the prop. Storage arrays or
> Not ??
>
> Best Regards !!
>
>
>
> What we saw specifically with Areca cards is that performance is excellent
> in benchmarking and for bursty loads. However, once we started loading with
> more constant workloads (we replicate databases and files to our Ceph
> cluster), this looks to have saturated the relatively small Areca NVDIMM
> caches and we went back to pure drive based performance.
>
>
>
> Yes, I think that is a valid point. Although low latency, you are still
> having to write to the disks twice (journal+data), so once the cache’s on
> the cards start filling up, you are going to hit problems.
>
>
>
>
>
> So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per
> 3 HDDs) in hopes that it would help reduce the noisy neighbor impact. That
> worked, but now the overall latency is really high at times, not always.
> Red Hat engineer suggested this is due to loading the 7200 rpm NL-SAS
> drives with too many IOPS, which get their latency sky high. Overall we are
> functioning fine, but I sure would like storage vmotion and other large
> operations faster.
>
>
>
>
>
> Yeah this is the biggest pain point I think. Normal VM ops are fine, but
> if you ever have to move a multi-TB VM, it’s just too slow.
>
>
>
> If you use iscsi with vaai and are migrating a thick provisioned vmdk,
> then performance is actually quite good, as the block sizes used for the
> copy are a lot bigger.
>
>
>
> However, my use case required thin provisioned VM’s + snapshots and I
> found that using iscsi you have no control over the fragmentation of the
> vmdk’s and so the read performance is then what suffers (certainly with
> 7.2k disks)
>
>
>
> Also with thin provisioned vmdk’s I think I was seeing PG contention with
> the updating of the VMFS metadata, although I can’t be sure.
>
>
>
>
>
> I am thinking I will test a few different schedulers and readahead
> settings to see if we can improve this by parallelizing reads. Also will
> test NFS, but need to determine whether to do krbd/knfsd or something more
> interesting like CephFS/Ganesha.
>
>
>
> As you know I’m on NFS now. I’ve found it a lot easier to get going and a
> lot less sensitive to making config adjustments without suddenly everything
> dropping offline. The fact that you can specify the extent size on XFS
> helps massively with using thin vmdks/snapshots to avoid fragmentation.
> Storage v-motions are a bit faster than iscsi, but I think I am hitting PG
> contention when esxi tries to write 32 copy threads to the same object.
> There is probably some tuning that could be done here (RBD striping???) but
> this is the best it’s been for a long time and I’m reluctant to fiddle any
> further.
>
>
>
> But as mentioned above, thick vmdk’s with vaai might be a really good fit.
>

Any chance thin vs. thick difference could be related to discards?  I saw
zillions of them in recent testing.


>
>
> Thanks for your very valuable info on analysis and hw build.
>
>
>
> Alex
>
>
>
>
>
>
> Am 21.08.2016 um 09:31 schrieb Nick Fisk <n...@fisk.me.uk>:
>
> >> -Original Message-
> >> From: Alex Gorbachev [mailto:a...@iss-integration.com]
> >> Sent: 21 August 2016 04:15
> >> To: Nick Fisk <n...@fisk.me.uk>
> >> Cc: w...@globe.de; Horace Ng <hor...@hkisl.net>; ceph-users <
> ceph-users@lists.ceph.com>
> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >>
> >> Hi Nick,
> >>
> >> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk <n...@fisk.me.uk> wrote:
> >>>> -Original Message-
> >>>> From: w...@globe.de [mailto:w...@globe.de]
> >>>> Sent: 21 July 2016 13:23
> >>>> To: n...@fisk.me.uk; 'Horace Ng' <hor...@hkisl.net>
> >>>> Cc: ceph-users@lists.ceph.com
> >>>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >>>>
> >>>> Okay and what is your plan now to speed up ?
> >>>
> >>> Now I have come up with a lower latency hardware design, there is not
> much further improvement until persistent RBD caching is
> >&

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-21 Thread Christian Balzer


Hello,

On Sun, 21 Aug 2016 09:57:40 +0100 Nick Fisk wrote:

> 
> 
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> > Christian Balzer
> > Sent: 21 August 2016 09:32
> > To: ceph-users <ceph-users@lists.ceph.com>
> > Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> > 
> > 
> > Hello,
> > 
> > On Sun, 21 Aug 2016 09:16:33 +0100 Brian :: wrote:
> > 
> > > Hi Nick
> > >
> > > Interested in this comment - "-Dual sockets are probably bad and will
> > > impact performance."
> > >
> > > Have you got real world experience of this being the case?
> > >
> > Well, Nick wrote "probably".
> > 
> > Dual sockets and thus NUMA, the need for CPUs to talk to each other and 
> > share information certainly can impact things that are
> very
> > time critical.
> > How much though is a question of design, both HW and SW.
> 
> There was a guy from Redhat (sorry his name escapes me now) a few months ago 
> on the performance weekly meeting. He was analysing the
> CPU cache miss effects with Ceph and it looked like a NUMA setup was having 
> quite a severe impact on some things. To be honest a lot
> of it went over my head, but I came away from it with a general feeling that 
> if you can get the required performance from 1 socket,
> then that is probably a better bet. This includes only populating a single 
> socket in a dual socket system. There was also a Ceph
> tech talk at the start of the year (High perf databases on Ceph) where the 
> guy presenting was also recommending only populating 1
> socket for latency reasons.
> 
I wonder how complete their testing was and how much manual tuning they
tried.
As in:

1. Was irqbalance running? 
Because it and the normal kernel strategies clash beautifully.
Irqbalance moves stuff around, the kernel tries to move things close to
where the IRQs are, cat and mouse.

2. Did they try with manual IRQ pinning?
I do, not that it's critical with my Ceph nodes, but on other machines it
can make a LOT of difference. 
Like keeping the cores near (or at least on the same NUMA node) as the
network IRQs reserved for KVM vhost processes. 

3. Did they try pining Ceph OSD processes?
While this may certainly help (and make things more predictable when the
load gets high), as I said above the kernel normally does a pretty good job
of NOT moving things around and keeping processes close to the resources
they need.

> Both of those, coupled with the fact that Xeon E3's are the cheapest way to 
> get high clock speeds, sort of made my decision.
> 
Totally agreed, my current HDD node design is based on the single CPU
Supermicro 5028R-E1CR12L barebone, with an E5-1650 v3 (3.50GHz) CPU.

> > 
> > We're looking here at a case where he's trying to reduce latency by all 
> > means and where the actual CPU needs for the HDDs are
> > negligible.
> > The idea being that a "Ceph IOPS" stays on one core which is hopefully also 
> > not being shared at that time.
> > 
> > If you're looking at full SSD nodes OTOH a singe CPU may very well not be 
> > able to saturate a sensible amount of SSDs per node, so
> a
> > slight penalty but better utilization and overall IOPS with 2 CPUs may be 
> > the forward.
> 
> Definitely, as always work out what your requirements are and design around 
> them.  
> 
On my cache tier nodes with 2x E5-2623 v3 (3.00GHz) and currently 4 800GB
DC S3610 SSDs I can already saturate all but 2 "cores", with the "right"
extreme test cases.
Normal load is of course just around 4 (out of 16) "cores".

And for the people who like it fast(er) but don't have to deal with VMware
or the likes, instead of forcing the c-state to 1 just setting the governor
to "performance" was enough in my case to halve latency (from about 2 to
1ms).

This still does save some power at times and (as Nick speculated) indeed
allows some cores to use their turbo speeds.

So the 4-5 busy cores on my cache tier nodes tend to hover around 3.3GHz,
instead of the 3.0GHz baseline for their CPUs.
And the less loaded cores don't tend to go below 2.6GHz, as opposed to the
1.2GHz that the "powersave" governor would default to.

Christian

> > 
> > Christian
> > 
> > > Thanks - B
> > >
> > > On Sun, Aug 21, 2016 at 8:31 AM, Nick Fisk <n...@fisk.me.uk> wrote:
> > > >> -Original Message-
> > > >> From: Alex Gorbachev [mailto:a...@iss-integration.com]
> > > >> Sent: 21 August 2016 04:15
> > > >> To: Nick Fisk <n...@fisk.me.uk>
> > > >&

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-21 Thread Nick Fisk

From: Alex Gorbachev [mailto:a...@iss-integration.com] 
Sent: 21 August 2016 15:27
To: Wilhelm Redbrake <w...@globe.de>
Cc: n...@fisk.me.uk; Horace Ng <hor...@hkisl.net>; ceph-users 
<ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

On Sunday, August 21, 2016, Wilhelm Redbrake <w...@globe.de 
<mailto:w...@globe.de> > wrote:

Hi Nick,
i understand all of your technical improvements.
But: why do you Not use a simple for example Areca Raid Controller with 8 gb 
Cache and Bbu ontop in every ceph node.
Configure n Times RAID 0 on the Controller and enable Write back Cache.
That must be a latency "Killer" like in all the prop. Storage arrays or Not ??

Best Regards !!

What we saw specifically with Areca cards is that performance is excellent in 
benchmarking and for bursty loads. However, once we started loading with more 
constant workloads (we replicate databases and files to our Ceph cluster), this 
looks to have saturated the relatively small Areca NVDIMM caches and we went 
back to pure drive based performance. 

Yes, I think that is a valid point. Although low latency, you are still having 
to write to the disks twice (journal+data), so once the cache’s on the cards 
start filling up, you are going to hit problems.

So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per 3 
HDDs) in hopes that it would help reduce the noisy neighbor impact. That 
worked, but now the overall latency is really high at times, not always. Red 
Hat engineer suggested this is due to loading the 7200 rpm NL-SAS drives with 
too many IOPS, which get their latency sky high. Overall we are functioning 
fine, but I sure would like storage vmotion and other large operations faster. 

Yeah this is the biggest pain point I think. Normal VM ops are fine, but if you 
ever have to move a multi-TB VM, it’s just too slow. 

If you use iscsi with vaai and are migrating a thick provisioned vmdk, then 
performance is actually quite good, as the block sizes used for the copy are a 
lot bigger. 

However, my use case required thin provisioned VM’s + snapshots and I found 
that using iscsi you have no control over the fragmentation of the vmdk’s and 
so the read performance is then what suffers (certainly with 7.2k disks)

Also with thin provisioned vmdk’s I think I was seeing PG contention with the 
updating of the VMFS metadata, although I can’t be sure.

I am thinking I will test a few different schedulers and readahead settings to 
see if we can improve this by parallelizing reads. Also will test NFS, but need 
to determine whether to do krbd/knfsd or something more interesting like 
CephFS/Ganesha. 

As you know I’m on NFS now. I’ve found it a lot easier to get going and a lot 
less sensitive to making config adjustments without suddenly everything 
dropping offline. The fact that you can specify the extent size on XFS helps 
massively with using thin vmdks/snapshots to avoid fragmentation. Storage 
v-motions are a bit faster than iscsi, but I think I am hitting PG contention 
when esxi tries to write 32 copy threads to the same object. There is probably 
some tuning that could be done here (RBD striping???) but this is the best it’s 
been for a long time and I’m reluctant to fiddle any further.

But as mentioned above, thick vmdk’s with vaai might be a really good fit.

Thanks for your very valuable info on analysis and hw build. 

Alex

Am 21.08.2016 um 09:31 schrieb Nick Fisk <n...@fisk.me.uk <javascript:;> >:

>> -Original Message-
>> From: Alex Gorbachev [mailto:a...@iss-integration.com <javascript:;> ]
>> Sent: 21 August 2016 04:15
>> To: Nick Fisk <n...@fisk.me.uk <javascript:;> >
>> Cc: w...@globe.de <javascript:;> ; Horace Ng <hor...@hkisl.net 
>> <javascript:;> >; ceph-users <ceph-users@lists.ceph.com <javascript:;> >
>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>>
>> Hi Nick,
>>
>> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk <n...@fisk.me.uk <javascript:;> > 
>> wrote:
>>>> -Original Message-
>>>> From: w...@globe.de <javascript:;>  [mailto:w...@globe.de <javascript:;> ]
>>>> Sent: 21 July 2016 13:23
>>>> To: n...@fisk.me.uk <javascript:;> ; 'Horace Ng' <hor...@hkisl.net 
>>>> <javascript:;> >
>>>> Cc: ceph-users@lists.ceph.com <javascript:;> 
>>>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>>>>
>>>> Okay and what is your plan now to speed up ?
>>>
>>> Now I have come up with a lower latency hardware design, there is not much 
>>> further improvement until persistent RBD caching is
>> imple

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-21 Thread Alex Gorbachev

On Sunday, August 21, 2016, Wilhelm Redbrake <w...@globe.de> wrote:

> Hi Nick,
> i understand all of your technical improvements.
> But: why do you Not use a simple for example Areca Raid Controller with 8
> gb Cache and Bbu ontop in every ceph node.
> Configure n Times RAID 0 on the Controller and enable Write back Cache.
> That must be a latency "Killer" like in all the prop. Storage arrays or
> Not ??
>
> Best Regards !!


What we saw specifically with Areca cards is that performance is excellent
in benchmarking and for bursty loads. However, once we started loading with
more constant workloads (we replicate databases and files to our Ceph
cluster), this looks to have saturated the relatively small Areca NVDIMM
caches and we went back to pure drive based performance.

So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per 3
HDDs) in hopes that it would help reduce the noisy neighbor impact. That
worked, but now the overall latency is really high at times, not always.
Red Hat engineer suggested this is due to loading the 7200 rpm NL-SAS
drives with too many IOPS, which get their latency sky high. Overall we are
functioning fine, but I sure would like storage vmotion and other large
operations faster.

I am thinking I will test a few different schedulers and readahead settings
to see if we can improve this by parallelizing reads. Also will test NFS,
but need to determine whether to do krbd/knfsd or something more
interesting like CephFS/Ganesha.

Thanks for your very valuable info on analysis and hw build.

Alex


>
>
>
> Am 21.08.2016 um 09:31 schrieb Nick Fisk <n...@fisk.me.uk <javascript:;>>:
>
> >> -Original Message-
> >> From: Alex Gorbachev [mailto:a...@iss-integration.com <javascript:;>]
> >> Sent: 21 August 2016 04:15
> >> To: Nick Fisk <n...@fisk.me.uk <javascript:;>>
> >> Cc: w...@globe.de <javascript:;>; Horace Ng <hor...@hkisl.net
> <javascript:;>>; ceph-users <ceph-users@lists.ceph.com <javascript:;>>
> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >>
> >> Hi Nick,
> >>
> >> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk <n...@fisk.me.uk
> <javascript:;>> wrote:
> >>>> -Original Message-
> >>>> From: w...@globe.de <javascript:;> [mailto:w...@globe.de <javascript:;>]
> >>>> Sent: 21 July 2016 13:23
> >>>> To: n...@fisk.me.uk <javascript:;>; 'Horace Ng' <hor...@hkisl.net
> <javascript:;>>
> >>>> Cc: ceph-users@lists.ceph.com <javascript:;>
> >>>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >>>>
> >>>> Okay and what is your plan now to speed up ?
> >>>
> >>> Now I have come up with a lower latency hardware design, there is not
> much further improvement until persistent RBD caching is
> >> implemented, as you will be moving the SSD/NVME closer to the client.
> But I'm happy with what I can achieve at the moment. You
> >> could also experiment with bcache on the RBD.
> >>
> >> Reviving this thread, would you be willing to share the details of the
> low latency hardware design?  Are you optimizing for NFS or
> >> iSCSI?
> >
> > Both really, just trying to get the write latency as low as possible, as
> you know, vmware does everything with lots of unbuffered small io's. Eg
> when you migrate a VM or as thin vmdk's grow.
> >
> > Even storage vmotions which might kick off 32 threads, as they all
> roughly fall on the same PG, there still appears to be a bottleneck with
> contention on the PG itself.
> >
> > These were the sort of things I was trying to optimise for, to make the
> time spent in Ceph as minimal as possible for each IO.
> >
> > So onto the hardware. Through reading various threads and experiments on
> my own I came to the following conclusions.
> >
> > -You need highest possible frequency on the CPU cores, which normally
> also means less of them.
> > -Dual sockets are probably bad and will impact performance.
> > -Use NVME's for journals to minimise latency
> >
> > The end result was OSD nodes based off of a 3.5Ghz Xeon E3v5 with an
> Intel P3700 for a journal. I used the SuperMicro X11SSH-CTF board which has
> 10G-T onboard as well as 8SATA and 8SAS, so no expansion cards required.
> Actually this design as well as being very performant for Ceph, also works
> out very cheap as you are using low end server parts. The whole lot +
> 12x7.2k disks all goes into a 1U case.
> >
> > During testing I noticed that by default

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-21 Thread Nick Fisk



> -Original Message-
> From: Wilhelm Redbrake [mailto:w...@globe.de]
> Sent: 21 August 2016 09:34
> To: n...@fisk.me.uk
> Cc: Alex Gorbachev <a...@iss-integration.com>; Horace Ng <hor...@hkisl.net>; 
> ceph-users <ceph-users@lists.ceph.com>
> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> 
> Hi Nick,
> i understand all of your technical improvements.
> But: why do you Not use a simple for example Areca Raid Controller with 8 gb 
> Cache and Bbu ontop in every ceph node.
> Configure n Times RAID 0 on the Controller and enable Write back Cache.
> That must be a latency "Killer" like in all the prop. Storage arrays or Not ??

Possibly, the latency of the NVME is very low, to the point that the "latency" 
in Ceph dwarfs it. So I'm not sure how much more improvement can be got from 
lowering journal latency further. But you are certainly correct it would help.

The other thing, if you don't use a SSD for a journal but rely on the RAID WBC, 
do you still see half the MB/s on the hard disks due to colo journal? Maybe 
someone can confirm?

Oh and I just looked at the price of that thing. The 16 port version is nearly 
double the price of what I paid for the 400GB NVME and that’s without adding on 
the 8GB ram and BBU. Maybe it's more suited for a full SSD cluster rather than 
spinning disks?

> 
> Best Regards !!
> 
> 
> 
> Am 21.08.2016 um 09:31 schrieb Nick Fisk <n...@fisk.me.uk>:
> 
> >> -Original Message-
> >> From: Alex Gorbachev [mailto:a...@iss-integration.com]
> >> Sent: 21 August 2016 04:15
> >> To: Nick Fisk <n...@fisk.me.uk>
> >> Cc: w...@globe.de; Horace Ng <hor...@hkisl.net>; ceph-users
> >> <ceph-users@lists.ceph.com>
> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >>
> >> Hi Nick,
> >>
> >> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk <n...@fisk.me.uk> wrote:
> >>>> -Original Message-----
> >>>> From: w...@globe.de [mailto:w...@globe.de]
> >>>> Sent: 21 July 2016 13:23
> >>>> To: n...@fisk.me.uk; 'Horace Ng' <hor...@hkisl.net>
> >>>> Cc: ceph-users@lists.ceph.com
> >>>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >>>>
> >>>> Okay and what is your plan now to speed up ?
> >>>
> >>> Now I have come up with a lower latency hardware design, there is
> >>> not much further improvement until persistent RBD caching is
> >> implemented, as you will be moving the SSD/NVME closer to the client.
> >> But I'm happy with what I can achieve at the moment. You could also 
> >> experiment with bcache on the RBD.
> >>
> >> Reviving this thread, would you be willing to share the details of
> >> the low latency hardware design?  Are you optimizing for NFS or iSCSI?
> >
> > Both really, just trying to get the write latency as low as possible, as 
> > you know, vmware does everything with lots of unbuffered
> small io's. Eg when you migrate a VM or as thin vmdk's grow.
> >
> > Even storage vmotions which might kick off 32 threads, as they all roughly 
> > fall on the same PG, there still appears to be a bottleneck
> with contention on the PG itself.
> >
> > These were the sort of things I was trying to optimise for, to make the 
> > time spent in Ceph as minimal as possible for each IO.
> >
> > So onto the hardware. Through reading various threads and experiments on my 
> > own I came to the following conclusions.
> >
> > -You need highest possible frequency on the CPU cores, which normally also 
> > means less of them.
> > -Dual sockets are probably bad and will impact performance.
> > -Use NVME's for journals to minimise latency
> >
> > The end result was OSD nodes based off of a 3.5Ghz Xeon E3v5 with an Intel 
> > P3700 for a journal. I used the SuperMicro X11SSH-CTF
> board which has 10G-T onboard as well as 8SATA and 8SAS, so no expansion 
> cards required. Actually this design as well as being very
> performant for Ceph, also works out very cheap as you are using low end 
> server parts. The whole lot + 12x7.2k disks all goes into a 1U
> case.
> >
> > During testing I noticed that by default c-states and p-states slaughter 
> > performance. After forcing max cstate to 1 and forcing the
> CPU frequency up to max, I was seeing 600us latency for a 4kb write to a 
> 3xreplica pool, or around 1600IOPs, this is at QD=1.
> >
> > Few other observations:
> > 1. Power usage is around 150-200W for this config

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-21 Thread Christian Balzer


Hello,

On Sun, 21 Aug 2016 09:16:33 +0100 Brian :: wrote:

> Hi Nick
> 
> Interested in this comment - "-Dual sockets are probably bad and will
> impact performance."
> 
> Have you got real world experience of this being the case?
> 
Well, Nick wrote "probably".

Dual sockets and thus NUMA, the need for CPUs to talk to each other and
share information certainly can impact things that are very time critical.
How much though is a question of design, both HW and SW.

We're looking here at a case where he's trying to reduce latency by all
means and where the actual CPU needs for the HDDs are negligible.
The idea being that a "Ceph IOPS" stays on one core which is hopefully
also not being shared at that time.

If you're looking at full SSD nodes OTOH a singe CPU may very well not be
able to saturate a sensible amount of SSDs per node, so a slight penalty
but better utilization and overall IOPS with 2 CPUs may be the forward.

Christian

> Thanks - B
> 
> On Sun, Aug 21, 2016 at 8:31 AM, Nick Fisk <n...@fisk.me.uk> wrote:
> >> -Original Message-
> >> From: Alex Gorbachev [mailto:a...@iss-integration.com]
> >> Sent: 21 August 2016 04:15
> >> To: Nick Fisk <n...@fisk.me.uk>
> >> Cc: w...@globe.de; Horace Ng <hor...@hkisl.net>; ceph-users 
> >> <ceph-users@lists.ceph.com>
> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >>
> >> Hi Nick,
> >>
> >> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk <n...@fisk.me.uk> wrote:
> >> >> -Original Message-----
> >> >> From: w...@globe.de [mailto:w...@globe.de]
> >> >> Sent: 21 July 2016 13:23
> >> >> To: n...@fisk.me.uk; 'Horace Ng' <hor...@hkisl.net>
> >> >> Cc: ceph-users@lists.ceph.com
> >> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >> >>
> >> >> Okay and what is your plan now to speed up ?
> >> >
> >> > Now I have come up with a lower latency hardware design, there is not 
> >> > much further improvement until persistent RBD caching is
> >> implemented, as you will be moving the SSD/NVME closer to the client. But 
> >> I'm happy with what I can achieve at the moment. You
> >> could also experiment with bcache on the RBD.
> >>
> >> Reviving this thread, would you be willing to share the details of the low 
> >> latency hardware design?  Are you optimizing for NFS or
> >> iSCSI?
> >
> > Both really, just trying to get the write latency as low as possible, as 
> > you know, vmware does everything with lots of unbuffered small io's. Eg 
> > when you migrate a VM or as thin vmdk's grow.
> >
> > Even storage vmotions which might kick off 32 threads, as they all roughly 
> > fall on the same PG, there still appears to be a bottleneck with contention 
> > on the PG itself.
> >
> > These were the sort of things I was trying to optimise for, to make the 
> > time spent in Ceph as minimal as possible for each IO.
> >
> > So onto the hardware. Through reading various threads and experiments on my 
> > own I came to the following conclusions.
> >
> > -You need highest possible frequency on the CPU cores, which normally also 
> > means less of them.
> > -Dual sockets are probably bad and will impact performance.
> > -Use NVME's for journals to minimise latency
> >
> > The end result was OSD nodes based off of a 3.5Ghz Xeon E3v5 with an Intel 
> > P3700 for a journal. I used the SuperMicro X11SSH-CTF board which has 10G-T 
> > onboard as well as 8SATA and 8SAS, so no expansion cards required. Actually 
> > this design as well as being very performant for Ceph, also works out very 
> > cheap as you are using low end server parts. The whole lot + 12x7.2k disks 
> > all goes into a 1U case.
> >
> > During testing I noticed that by default c-states and p-states slaughter 
> > performance. After forcing max cstate to 1 and forcing the CPU frequency up 
> > to max, I was seeing 600us latency for a 4kb write to a 3xreplica pool, or 
> > around 1600IOPs, this is at QD=1.
> >
> > Few other observations:
> > 1. Power usage is around 150-200W for this config with 12x7.2k disks
> > 2. CPU usage maxing out disks, is only around 10-15%, so plenty of headroom 
> > for more disks.
> > 3. NOTE FOR ABOVE: Don't include iowait when looking at CPU usage
> > 4. No idea about CPU load for pure SSD nodes, but based on the current 
> > disks, you could maybe expect ~1iops per node, before maxing out CPU

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-21 Thread Brian ::

Hi Nick

Interested in this comment - "-Dual sockets are probably bad and will
impact performance."

Have you got real world experience of this being the case?

Thanks - B

On Sun, Aug 21, 2016 at 8:31 AM, Nick Fisk <n...@fisk.me.uk> wrote:
>> -Original Message-
>> From: Alex Gorbachev [mailto:a...@iss-integration.com]
>> Sent: 21 August 2016 04:15
>> To: Nick Fisk <n...@fisk.me.uk>
>> Cc: w...@globe.de; Horace Ng <hor...@hkisl.net>; ceph-users 
>> <ceph-users@lists.ceph.com>
>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>>
>> Hi Nick,
>>
>> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk <n...@fisk.me.uk> wrote:
>> >> -Original Message-
>> >> From: w...@globe.de [mailto:w...@globe.de]
>> >> Sent: 21 July 2016 13:23
>> >> To: n...@fisk.me.uk; 'Horace Ng' <hor...@hkisl.net>
>> >> Cc: ceph-users@lists.ceph.com
>> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>> >>
>> >> Okay and what is your plan now to speed up ?
>> >
>> > Now I have come up with a lower latency hardware design, there is not much 
>> > further improvement until persistent RBD caching is
>> implemented, as you will be moving the SSD/NVME closer to the client. But 
>> I'm happy with what I can achieve at the moment. You
>> could also experiment with bcache on the RBD.
>>
>> Reviving this thread, would you be willing to share the details of the low 
>> latency hardware design?  Are you optimizing for NFS or
>> iSCSI?
>
> Both really, just trying to get the write latency as low as possible, as you 
> know, vmware does everything with lots of unbuffered small io's. Eg when you 
> migrate a VM or as thin vmdk's grow.
>
> Even storage vmotions which might kick off 32 threads, as they all roughly 
> fall on the same PG, there still appears to be a bottleneck with contention 
> on the PG itself.
>
> These were the sort of things I was trying to optimise for, to make the time 
> spent in Ceph as minimal as possible for each IO.
>
> So onto the hardware. Through reading various threads and experiments on my 
> own I came to the following conclusions.
>
> -You need highest possible frequency on the CPU cores, which normally also 
> means less of them.
> -Dual sockets are probably bad and will impact performance.
> -Use NVME's for journals to minimise latency
>
> The end result was OSD nodes based off of a 3.5Ghz Xeon E3v5 with an Intel 
> P3700 for a journal. I used the SuperMicro X11SSH-CTF board which has 10G-T 
> onboard as well as 8SATA and 8SAS, so no expansion cards required. Actually 
> this design as well as being very performant for Ceph, also works out very 
> cheap as you are using low end server parts. The whole lot + 12x7.2k disks 
> all goes into a 1U case.
>
> During testing I noticed that by default c-states and p-states slaughter 
> performance. After forcing max cstate to 1 and forcing the CPU frequency up 
> to max, I was seeing 600us latency for a 4kb write to a 3xreplica pool, or 
> around 1600IOPs, this is at QD=1.
>
> Few other observations:
> 1. Power usage is around 150-200W for this config with 12x7.2k disks
> 2. CPU usage maxing out disks, is only around 10-15%, so plenty of headroom 
> for more disks.
> 3. NOTE FOR ABOVE: Don't include iowait when looking at CPU usage
> 4. No idea about CPU load for pure SSD nodes, but based on the current disks, 
> you could maybe expect ~1iops per node, before maxing out CPU's
> 5. Single NVME seems to be able to journal 12 disks with no problem during 
> normal operation, no doubt a specific benchmark could max it out though.
> 6. There are slightly faster Xeon E3's, but price/performance = diminishing 
> returns
>
> Hope that answers all your questions.
> Nick
>
>>
>> Thank you,
>> Alex
>>
>> >
>> >>
>> >> Would it help to put in multiple P3700 per OSD Node to improve 
>> >> performance for a single Thread (example Storage VMotion) ?
>> >
>> > Most likely not, it's all the other parts of the puzzle which are causing 
>> > the latency. ESXi was designed for storage arrays that service
>> IO's in 100us-1ms range, Ceph is probably about 10x slower than this, hence 
>> the problem. Disable the BBWC on a RAID controller or
>> SAN and you will the same behaviour.
>> >
>> >>
>> >> Regards
>> >>
>> >>
>> >> Am 21.07.16 um 14:17 schrieb Nick Fisk:
>> >> >> -Original Message-
>> >> >> From: ce

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-21 Thread Nick Fisk

> -Original Message-
> From: Alex Gorbachev [mailto:a...@iss-integration.com]
> Sent: 21 August 2016 04:15
> To: Nick Fisk <n...@fisk.me.uk>
> Cc: w...@globe.de; Horace Ng <hor...@hkisl.net>; ceph-users 
> <ceph-users@lists.ceph.com>
> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> 
> Hi Nick,
> 
> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk <n...@fisk.me.uk> wrote:
> >> -Original Message-
> >> From: w...@globe.de [mailto:w...@globe.de]
> >> Sent: 21 July 2016 13:23
> >> To: n...@fisk.me.uk; 'Horace Ng' <hor...@hkisl.net>
> >> Cc: ceph-users@lists.ceph.com
> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >>
> >> Okay and what is your plan now to speed up ?
> >
> > Now I have come up with a lower latency hardware design, there is not much 
> > further improvement until persistent RBD caching is
> implemented, as you will be moving the SSD/NVME closer to the client. But I'm 
> happy with what I can achieve at the moment. You
> could also experiment with bcache on the RBD.
> 
> Reviving this thread, would you be willing to share the details of the low 
> latency hardware design?  Are you optimizing for NFS or
> iSCSI?

Both really, just trying to get the write latency as low as possible, as you 
know, vmware does everything with lots of unbuffered small io's. Eg when you 
migrate a VM or as thin vmdk's grow.

Even storage vmotions which might kick off 32 threads, as they all roughly fall 
on the same PG, there still appears to be a bottleneck with contention on the 
PG itself. 

These were the sort of things I was trying to optimise for, to make the time 
spent in Ceph as minimal as possible for each IO.

So onto the hardware. Through reading various threads and experiments on my own 
I came to the following conclusions. 

-You need highest possible frequency on the CPU cores, which normally also 
means less of them. 
-Dual sockets are probably bad and will impact performance.
-Use NVME's for journals to minimise latency

The end result was OSD nodes based off of a 3.5Ghz Xeon E3v5 with an Intel 
P3700 for a journal. I used the SuperMicro X11SSH-CTF board which has 10G-T 
onboard as well as 8SATA and 8SAS, so no expansion cards required. Actually 
this design as well as being very performant for Ceph, also works out very 
cheap as you are using low end server parts. The whole lot + 12x7.2k disks all 
goes into a 1U case.

During testing I noticed that by default c-states and p-states slaughter 
performance. After forcing max cstate to 1 and forcing the CPU frequency up to 
max, I was seeing 600us latency for a 4kb write to a 3xreplica pool, or around 
1600IOPs, this is at QD=1.

Few other observations:
1. Power usage is around 150-200W for this config with 12x7.2k disks
2. CPU usage maxing out disks, is only around 10-15%, so plenty of headroom for 
more disks.
3. NOTE FOR ABOVE: Don't include iowait when looking at CPU usage
4. No idea about CPU load for pure SSD nodes, but based on the current disks, 
you could maybe expect ~1iops per node, before maxing out CPU's
5. Single NVME seems to be able to journal 12 disks with no problem during 
normal operation, no doubt a specific benchmark could max it out though.
6. There are slightly faster Xeon E3's, but price/performance = diminishing 
returns

Hope that answers all your questions.
Nick

> 
> Thank you,
> Alex
> 
> >
> >>
> >> Would it help to put in multiple P3700 per OSD Node to improve performance 
> >> for a single Thread (example Storage VMotion) ?
> >
> > Most likely not, it's all the other parts of the puzzle which are causing 
> > the latency. ESXi was designed for storage arrays that service
> IO's in 100us-1ms range, Ceph is probably about 10x slower than this, hence 
> the problem. Disable the BBWC on a RAID controller or
> SAN and you will the same behaviour.
> >
> >>
> >> Regards
> >>
> >>
> >> Am 21.07.16 um 14:17 schrieb Nick Fisk:
> >> >> -Original Message-
> >> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> >> >> Behalf Of w...@globe.de
> >> >> Sent: 21 July 2016 13:04
> >> >> To: n...@fisk.me.uk; 'Horace Ng' <hor...@hkisl.net>
> >> >> Cc: ceph-users@lists.ceph.com
> >> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread
> >> >> Performance
> >> >>
> >> >> Hi,
> >> >>
> >> >> hmm i think 200 MByte/s is really bad. Is your Cluster in production 
> >> >> right now?
> >> > It's just been built, not running yet.
> >> >
> >

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-20 Thread Alex Gorbachev

Hi Nick,

On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk <n...@fisk.me.uk> wrote:
>> -Original Message-
>> From: w...@globe.de [mailto:w...@globe.de]
>> Sent: 21 July 2016 13:23
>> To: n...@fisk.me.uk; 'Horace Ng' <hor...@hkisl.net>
>> Cc: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>>
>> Okay and what is your plan now to speed up ?
>
> Now I have come up with a lower latency hardware design, there is not much 
> further improvement until persistent RBD caching is implemented, as you will 
> be moving the SSD/NVME closer to the client. But I'm happy with what I can 
> achieve at the moment. You could also experiment with bcache on the RBD.

Reviving this thread, would you be willing to share the details of the
low latency hardware design?  Are you optimizing for NFS or iSCSI?

Thank you,
Alex

>
>>
>> Would it help to put in multiple P3700 per OSD Node to improve performance 
>> for a single Thread (example Storage VMotion) ?
>
> Most likely not, it's all the other parts of the puzzle which are causing the 
> latency. ESXi was designed for storage arrays that service IO's in 100us-1ms 
> range, Ceph is probably about 10x slower than this, hence the problem. 
> Disable the BBWC on a RAID controller or SAN and you will the same behaviour.
>
>>
>> Regards
>>
>>
>> Am 21.07.16 um 14:17 schrieb Nick Fisk:
>> >> -Original Message-
>> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
>> >> Of w...@globe.de
>> >> Sent: 21 July 2016 13:04
>> >> To: n...@fisk.me.uk; 'Horace Ng' <hor...@hkisl.net>
>> >> Cc: ceph-users@lists.ceph.com
>> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>> >>
>> >> Hi,
>> >>
>> >> hmm i think 200 MByte/s is really bad. Is your Cluster in production 
>> >> right now?
>> > It's just been built, not running yet.
>> >
>> >> So if you start a storage migration you get only 200 MByte/s right?
>> > I wish. My current cluster (not this new one) would storage migrate at
>> > ~10-15MB/s. Serial latency is the problem, without being able to
>> > buffer, ESXi waits on an ack for each IO before sending the next. Also it 
>> > submits the migrations in 64kb chunks, unless you get VAAI
>> working. I think esxi will try and do them in parallel, which will help as 
>> well.
>> >
>> >> I think it would be awesome if you get 1000 MByte/s
>> >>
>> >> Where is the Bottleneck?
>> > Latency serialisation, without a buffer, you can't drive the devices
>> > to 100%. With buffered IO (or high queue depths) I can max out the 
>> > journals.
>> >
>> >> A FIO Test from Sebastien Han give us 400 MByte/s raw performance from 
>> >> the P3700.
>> >>
>> >> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your
>> >> -ssd-is-suitable-as-a-journal-device/
>> >>
>> >> How could it be that the rbd client performance is 50% slower?
>> >>
>> >> Regards
>> >>
>> >>
>> >> Am 21.07.16 um 12:15 schrieb Nick Fisk:
>> >>> I've had a lot of pain with this, smaller block sizes are even worse.
>> >>> You want to try and minimize latency at every point as there is no
>> >>> buffering happening in the iSCSI stack. This means:-
>> >>>
>> >>> 1. Fast journals (NVME or NVRAM)
>> >>> 2. 10GB or better networking
>> >>> 3. Fast CPU's (Ghz)
>> >>> 4. Fix CPU c-state's to C1
>> >>> 5. Fix CPU's Freq to max
>> >>>
>> >>> Also I can't be sure, but I think there is a metadata update
>> >>> happening with VMFS, particularly if you are using thin VMDK's, this
>> >>> can also be a major bottleneck. For my use case, I've switched over to 
>> >>> NFS as it has given much more performance at scale and
>> less headache.
>> >>>
>> >>> For the RADOS Run, here you go (400GB P3700):
>> >>>
>> >>> Total time run:     60.026491
>> >>> Total writes made:  3104
>> >>> Write size: 4194304
>> >>> Object size:4194304
>> >>> Bandwidth (MB/sec): 206.842
>> >>> Stddev Bandwidth:   8.10412
>> >>> Max bandwidth (MB/sec): 224
>> >>> Min ba

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-07-21 Thread Nick Fisk

Yes awesome, as long as you fully test bcache and you are happy with it.

 

Also, if you intend to do HA, you will have to use dual port SAS SSD’s instead 
of NVME and make sure you create your resource agent scripts correctly, 
otherwise bye bye data.

 

If you enable writeback caching in TGT and you have power failure, then 
anything in the cache is lost. This will either mean holes in your data, or 
sections that are out of date. Basically that LUN will most likely be toast and 
you will have to reformat.

 

From: w...@globe.de [mailto:w...@globe.de] 
Sent: 21 July 2016 15:04
To: n...@fisk.me.uk; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

 

Okay that should be the answer... 

I think it would be great to use Intel P3700 1.6TB as bcache in the iscsi rbd 
client gateway nodes.

caching device: Intel P3700 1.6TB

backing device: RBD from Ceph Cluster

What do you mean? I think this setup should improve the performance 
dramatically or not?

If i enable writeback in these nodes and use tgt for vmware. What happens if 
iscsi node 1 goes offline. Power Loss... or Linux Kernel crash.

 

 

Am 21.07.16 um 15:57 schrieb Nick Fisk:

What you are seeing is probably averaged over 1 second or something like that. 
So yes in 1 second IO would have run on all OSD’s. But for any 1 point in time 
a single thread will only run on 1 OSD (+2 replicas) assuming the IO size isn’t 
bigger than the object size. 

 

For RBD, If data is striped in 4MB chunks, then you will have to read/write 
more than 4MB at a time to cross over to the next object. You get exactly the 
same problems with reading when you don’t set the readahead above 4MB.

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
w...@globe.de <mailto:w...@globe.de> 
Sent: 21 July 2016 14:05
To: ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> 
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

 

That can not be correct.

Check it at your cluster with dstat as i said...

You will see at every node parallel IO on every OSD and journal

 

Am 21.07.16 um 15:02 schrieb Jake Young:

I think the answer is that with 1 thread you can only ever write to one journal 
at a time. Theoretically, you would need 10 threads to be able to write to 10 
nodes at the same time.  

 

Jake

On Thursday, July 21, 2016, w...@globe.de <mailto:w...@globe.de>  
<w...@globe.de <mailto:w...@globe.de> > wrote:

What i not really undertand is:

Lets say the Intel P3700 works with 200 MByte/s rados bench one thread... See 
Nicks results below...

If we have multiple OSD Nodes. For example 10 Nodes.

Every Node has exactly 1x P3700 NVMe built in.

Why is the single Thread performance exactly at 200 MByte/s on the rbd client 
with 10 OSD Node Cluster???

I think it must be at 10 Nodes * 200 MByte/s = 2000 MByte/s.

 

Everyone look yourself at your cluster. 

dstat -D sdb,sdc,sdd,sdX 

You will see that Ceph stripes the data over all OSD's in the cluster if you 
test at the client side with rados bench...

rados bench -p rbd 60 write -b 4M -t 1

 

 

Am 21.07.16 um 14:38 schrieb w...@globe.de 
<javascript:_e(%7B%7D,'cvml','w...@globe.de');> :

Is there not a way to enable Linux page Cache? So do not user D_Sync... 

Then we would the dramatically performance improve. 


Am 21.07.16 um 14:33 schrieb Nick Fisk: 




-Original Message- 
From: w...@globe.de <javascript:_e(%7B%7D,'cvml','w...@globe.de');>  
[mailto:w...@globe.de <javascript:_e(%7B%7D,'cvml','w...@globe.de');> ] 
Sent: 21 July 2016 13:23 
To: n...@fisk.me.uk <javascript:_e(%7B%7D,'cvml','n...@fisk.me.uk');> ; 'Horace 
Ng'  <javascript:_e(%7B%7D,'cvml','hor...@hkisl.net');> <hor...@hkisl.net> 
Cc: ceph-users@lists.ceph.com 
<javascript:_e(%7B%7D,'cvml','ceph-users@lists.ceph.com');>  
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance 

Okay and what is your plan now to speed up ? 

Now I have come up with a lower latency hardware design, there is not much 
further improvement until persistent RBD caching is implemented, as you will be 
moving the SSD/NVME closer to the client. But I'm happy with what I can achieve 
at the moment. You could also experiment with bcache on the RBD. 





Would it help to put in multiple P3700 per OSD Node to improve performance for 
a single Thread (example Storage VMotion) ? 

Most likely not, it's all the other parts of the puzzle which are causing the 
latency. ESXi was designed for storage arrays that service IO's in 100us-1ms 
range, Ceph is probably about 10x slower than this, hence the problem. Disable 
the BBWC on a RAID controller or SAN and you will the same behaviour. 





Regards 


Am 21.07.16 um 14:17 schrieb Nick Fisk: 




-Original Message- 
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
<javascript:_e(%7B%7D,'cvml','ceph-users-boun...@lists

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-07-21 Thread Nick Fisk

What you are seeing is probably averaged over 1 second or something like that. 
So yes in 1 second IO would have run on all OSD’s. But for any 1 point in time 
a single thread will only run on 1 OSD (+2 replicas) assuming the IO size isn’t 
bigger than the object size. 

 

For RBD, If data is striped in 4MB chunks, then you will have to read/write 
more than 4MB at a time to cross over to the next object. You get exactly the 
same problems with reading when you don’t set the readahead above 4MB.

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
w...@globe.de
Sent: 21 July 2016 14:05
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

 

That can not be correct.

Check it at your cluster with dstat as i said...

You will see at every node parallel IO on every OSD and journal

 

Am 21.07.16 um 15:02 schrieb Jake Young:

I think the answer is that with 1 thread you can only ever write to one journal 
at a time. Theoretically, you would need 10 threads to be able to write to 10 
nodes at the same time.  

 

Jake

On Thursday, July 21, 2016, w...@globe.de <mailto:w...@globe.de>  
<w...@globe.de <mailto:w...@globe.de> > wrote:

What i not really undertand is:

Lets say the Intel P3700 works with 200 MByte/s rados bench one thread... See 
Nicks results below...

If we have multiple OSD Nodes. For example 10 Nodes.

Every Node has exactly 1x P3700 NVMe built in.

Why is the single Thread performance exactly at 200 MByte/s on the rbd client 
with 10 OSD Node Cluster???

I think it must be at 10 Nodes * 200 MByte/s = 2000 MByte/s.

 

Everyone look yourself at your cluster. 

dstat -D sdb,sdc,sdd,sdX 

You will see that Ceph stripes the data over all OSD's in the cluster if you 
test at the client side with rados bench...

rados bench -p rbd 60 write -b 4M -t 1

 

 

Am 21.07.16 um 14:38 schrieb w...@globe.de 
<javascript:_e(%7B%7D,'cvml','w...@globe.de');> :

Is there not a way to enable Linux page Cache? So do not user D_Sync... 

Then we would the dramatically performance improve. 


Am 21.07.16 um 14:33 schrieb Nick Fisk: 



-Original Message- 
From: w...@globe.de <javascript:_e(%7B%7D,'cvml','w...@globe.de');>  
[mailto:w...@globe.de <javascript:_e(%7B%7D,'cvml','w...@globe.de');> ] 
Sent: 21 July 2016 13:23 
To: n...@fisk.me.uk <javascript:_e(%7B%7D,'cvml','n...@fisk.me.uk');> ; 'Horace 
Ng'  <javascript:_e(%7B%7D,'cvml','hor...@hkisl.net');> <hor...@hkisl.net> 
Cc: ceph-users@lists.ceph.com 
<javascript:_e(%7B%7D,'cvml','ceph-users@lists.ceph.com');>  
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance 

Okay and what is your plan now to speed up ? 

Now I have come up with a lower latency hardware design, there is not much 
further improvement until persistent RBD caching is implemented, as you will be 
moving the SSD/NVME closer to the client. But I'm happy with what I can achieve 
at the moment. You could also experiment with bcache on the RBD. 




Would it help to put in multiple P3700 per OSD Node to improve performance for 
a single Thread (example Storage VMotion) ? 

Most likely not, it's all the other parts of the puzzle which are causing the 
latency. ESXi was designed for storage arrays that service IO's in 100us-1ms 
range, Ceph is probably about 10x slower than this, hence the problem. Disable 
the BBWC on a RAID controller or SAN and you will the same behaviour. 




Regards 


Am 21.07.16 um 14:17 schrieb Nick Fisk: 



-Original Message- 
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
<javascript:_e(%7B%7D,'cvml','ceph-users-boun...@lists.ceph.com');> ] On Behalf 
Of w...@globe.de <javascript:_e(%7B%7D,'cvml','w...@globe.de');>  
Sent: 21 July 2016 13:04 
To: n...@fisk.me.uk <javascript:_e(%7B%7D,'cvml','n...@fisk.me.uk');> ; 'Horace 
Ng'  <javascript:_e(%7B%7D,'cvml','hor...@hkisl.net');> <hor...@hkisl.net> 
Cc: ceph-users@lists.ceph.com 
<javascript:_e(%7B%7D,'cvml','ceph-users@lists.ceph.com');>  
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance 

Hi, 

hmm i think 200 MByte/s is really bad. Is your Cluster in production right now? 

It's just been built, not running yet. 




So if you start a storage migration you get only 200 MByte/s right? 

I wish. My current cluster (not this new one) would storage migrate at 
~10-15MB/s. Serial latency is the problem, without being able to 
buffer, ESXi waits on an ack for each IO before sending the next. Also it 
submits the migrations in 64kb chunks, unless you get VAAI 

working. I think esxi will try and do them in parallel, which will help as 
well. 



I think it would be awesome if you get 1000 MByte/s 

Where is the Bottleneck? 

Latency serialisation, without a buffer, you can't drive the devices 
to 100%. With buffered IO (or high queue depths) I can max out the journals. 




A FIO Test from S

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-07-21 Thread w...@globe.de


Okay that should be the answer...

I think it would be great to use Intel P3700 1.6TB as bcache in the 
iscsi rbd client gateway nodes.


caching device: Intel P3700 1.6TB

backing device: RBD from Ceph Cluster

What do you mean? I think this setup should improve the performance 
dramatically or not?


If i enable writeback in these nodes and use tgt for vmware. What 
happens if iscsi node 1 goes offline. Power Loss... or Linux Kernel crash.




Am 21.07.16 um 15:57 schrieb Nick Fisk:


What you are seeing is probably averaged over 1 second or something 
like that. So yes in 1 second IO would have run on all OSD’s. But for 
any 1 point in time a single thread will only run on 1 OSD (+2 
replicas) assuming the IO size isn’t bigger than the object size.


For RBD, If data is striped in 4MB chunks, then you will have to 
read/write more than 4MB at a time to cross over to the next object. 
You get exactly the same problems with reading when you don’t set the 
readahead above 4MB.


*From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On 
Behalf Of *w...@globe.de

*Sent:* 21 July 2016 14:05
*To:* ceph-users@lists.ceph.com
*Subject:* Re: [ceph-users] Ceph + VMware + Single Thread Performance

That can not be correct.

Check it at your cluster with dstat as i said...

You will see at every node parallel IO on every OSD and journal

Am 21.07.16 um 15:02 schrieb Jake Young:

I think the answer is that with 1 thread you can only ever write
to one journal at a time. Theoretically, you would need 10 threads
to be able to write to 10 nodes at the same time.

Jake

On Thursday, July 21, 2016, w...@globe.de <mailto:w...@globe.de>
<w...@globe.de <mailto:w...@globe.de>> wrote:

What i not really undertand is:

Lets say the Intel P3700 works with 200 MByte/s rados bench
one thread... See Nicks results below...

If we have multiple OSD Nodes. For example 10 Nodes.

Every Node has exactly 1x P3700 NVMe built in.

Why is the single Thread performance exactly at 200 MByte/s on
the rbd client with 10 OSD Node Cluster???

I think it must be at 10 Nodes * 200 MByte/s = 2000 MByte/s.

Everyone look yourself at your cluster.

dstat -D sdb,sdc,sdd,sdX 

You will see that Ceph stripes the data over all OSD's in the
cluster if you test at the client side with rados bench...

*rados bench -p rbd 60 write -b 4M -t 1*

Am 21.07.16 um 14:38 schrieb w...@globe.de
<javascript:_e(%7B%7D,'cvml','w...@globe.de');>:

Is there not a way to enable Linux page Cache? So do not
user D_Sync...

Then we would the dramatically performance improve.


Am 21.07.16 um 14:33 schrieb Nick Fisk:

-Original Message-
From: w...@globe.de
<javascript:_e(%7B%7D,'cvml','w...@globe.de');>
[mailto:w...@globe.de
<javascript:_e(%7B%7D,'cvml','w...@globe.de');>]
Sent: 21 July 2016 13:23
To: n...@fisk.me.uk
<javascript:_e(%7B%7D,'cvml','n...@fisk.me.uk');>;
'Horace Ng' <hor...@hkisl.net>
<javascript:_e(%7B%7D,'cvml','hor...@hkisl.net');>
Cc: ceph-users@lists.ceph.com
<javascript:_e(%7B%7D,'cvml','ceph-users@lists.ceph.com');>

    Subject: Re: [ceph-users] Ceph + VMware + Single
Thread Performance

Okay and what is your plan now to speed up ?

Now I have come up with a lower latency hardware
design, there is not much further improvement until
persistent RBD caching is implemented, as you will be
moving the SSD/NVME closer to the client. But I'm
happy with what I can achieve at the moment. You could
also experiment with bcache on the RBD.


Would it help to put in multiple P3700 per OSD
Node to improve performance for a single Thread
(example Storage VMotion) ?

Most likely not, it's all the other parts of the
puzzle which are causing the latency. ESXi was
designed for storage arrays that service IO's in
100us-1ms range, Ceph is probably about 10x slower
than this, hence the problem. Disable the BBWC on a
RAID controller or SAN and you will the same behaviour.


Regards


Am 21.07.16 um 14:17 schrieb Nick Fisk:

-Original Message-
From: ceph-users
[mailto:ceph-users-boun...@lists.ceph.com

<

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-07-21 Thread w...@globe.de


That can not be correct.

Check it at your cluster with dstat as i said...

You will see at every node parallel IO on every OSD and journal


Am 21.07.16 um 15:02 schrieb Jake Young:
I think the answer is that with 1 thread you can only ever write to 
one journal at a time. Theoretically, you would need 10 threads to be 
able to write to 10 nodes at the same time.


Jake

On Thursday, July 21, 2016, w...@globe.de <mailto:w...@globe.de> 
<w...@globe.de <mailto:w...@globe.de>> wrote:


What i not really undertand is:

Lets say the Intel P3700 works with 200 MByte/s rados bench one
thread... See Nicks results below...

If we have multiple OSD Nodes. For example 10 Nodes.

Every Node has exactly 1x P3700 NVMe built in.

Why is the single Thread performance exactly at 200 MByte/s on the
rbd client with 10 OSD Node Cluster???

I think it must be at 10 Nodes * 200 MByte/s = 2000 MByte/s.


Everyone look yourself at your cluster.

dstat -D sdb,sdc,sdd,sdX 

You will see that Ceph stripes the data over all OSD's in the
cluster if you test at the client side with rados bench...

*rados bench -p rbd 60 write -b 4M -t 1*



Am 21.07.16 um 14:38 schrieb w...@globe.de
<javascript:_e(%7B%7D,'cvml','w...@globe.de');>:

Is there not a way to enable Linux page Cache? So do not user
D_Sync...

Then we would the dramatically performance improve.


Am 21.07.16 um 14:33 schrieb Nick Fisk:

-Original Message-
From: w...@globe.de <javascript:_e(%7B%7D,'cvml','w...@globe.de');>
[mailto:w...@globe.de <javascript:_e(%7B%7D,'cvml','w...@globe.de');>]
Sent: 21 July 2016 13:23
To: n...@fisk.me.uk
<javascript:_e(%7B%7D,'cvml','n...@fisk.me.uk');>; 'Horace Ng'
<hor...@hkisl.net>
<javascript:_e(%7B%7D,'cvml','hor...@hkisl.net');>
Cc: ceph-users@lists.ceph.com
<javascript:_e(%7B%7D,'cvml','ceph-users@lists.ceph.com');>
Subject: Re: [ceph-users] Ceph + VMware + Single Thread
Performance

Okay and what is your plan now to speed up ?

Now I have come up with a lower latency hardware design, there
is not much further improvement until persistent RBD caching is
implemented, as you will be moving the SSD/NVME closer to the
client. But I'm happy with what I can achieve at the moment. You
could also experiment with bcache on the RBD.


Would it help to put in multiple P3700 per OSD Node to improve
performance for a single Thread (example Storage VMotion) ?

Most likely not, it's all the other parts of the puzzle which
are causing the latency. ESXi was designed for storage arrays
that service IO's in 100us-1ms range, Ceph is probably about 10x
slower than this, hence the problem. Disable the BBWC on a RAID
controller or SAN and you will the same behaviour.


Regards


Am 21.07.16 um 14:17 schrieb Nick Fisk:

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com
<javascript:_e(%7B%7D,'cvml','ceph-users-boun...@lists.ceph.com');>]
On Behalf
Of w...@globe.de <javascript:_e(%7B%7D,'cvml','w...@globe.de');>
Sent: 21 July 2016 13:04
To: n...@fisk.me.uk
<javascript:_e(%7B%7D,'cvml','n...@fisk.me.uk');>; 'Horace
Ng' <hor...@hkisl.net>
<javascript:_e(%7B%7D,'cvml','hor...@hkisl.net');>
Cc: ceph-users@lists.ceph.com
    <javascript:_e(%7B%7D,'cvml','ceph-users@lists.ceph.com');>
Subject: Re: [ceph-users] Ceph + VMware + Single Thread
Performance

Hi,

hmm i think 200 MByte/s is really bad. Is your Cluster in
production right now?

It's just been built, not running yet.


So if you start a storage migration you get only 200 MByte/s
right?

I wish. My current cluster (not this new one) would storage
migrate at
~10-15MB/s. Serial latency is the problem, without being able to
buffer, ESXi waits on an ack for each IO before sending the
next. Also it submits the migrations in 64kb chunks, unless
you get VAAI

working. I think esxi will try and do them in parallel, which
will help as well.

I think it would be awesome if you get 1000 MByte/s

Where is the Bottleneck?

Latency serialisation, without a buffer, you can't drive the
devices
to 100%. With buffered IO (or high queue depths) I can max out
the journals.


A FIO Test from Sebastien Han give us 400 MByte/s raw
performance from the P3700.

https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your

-ssd-is-suitable-as-a-journal-device/

How could it be that the rbd client performance is 50% slower?

Regards


Am 21.07.16 um 12:15 schrieb Nick Fisk:

I've had a lot of pain with this, smaller block sizes are
even worse.
You want to try and minimize latency at every point as there
is no
buffer

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-07-21 Thread Jake Young

I think the answer is that with 1 thread you can only ever write to one
journal at a time. Theoretically, you would need 10 threads to be able to
write to 10 nodes at the same time.

Jake

On Thursday, July 21, 2016, w...@globe.de <w...@globe.de> wrote:

> What i not really undertand is:
>
> Lets say the Intel P3700 works with 200 MByte/s rados bench one thread...
> See Nicks results below...
>
> If we have multiple OSD Nodes. For example 10 Nodes.
>
> Every Node has exactly 1x P3700 NVMe built in.
>
> Why is the single Thread performance exactly at 200 MByte/s on the rbd
> client with 10 OSD Node Cluster???
>
> I think it must be at 10 Nodes * 200 MByte/s = 2000 MByte/s.
>
>
> Everyone look yourself at your cluster.
>
> dstat -D sdb,sdc,sdd,sdX 
>
> You will see that Ceph stripes the data over all OSD's in the cluster if
> you test at the client side with rados bench...
>
> *rados bench -p rbd 60 write -b 4M -t 1*
>
>
>
> Am 21.07.16 um 14:38 schrieb w...@globe.de
> <javascript:_e(%7B%7D,'cvml','w...@globe.de');>:
>
> Is there not a way to enable Linux page Cache? So do not user D_Sync...
>
> Then we would the dramatically performance improve.
>
>
> Am 21.07.16 um 14:33 schrieb Nick Fisk:
>
> -Original Message-
> From: w...@globe.de <javascript:_e(%7B%7D,'cvml','w...@globe.de');> [
> mailto:w...@globe.de <javascript:_e(%7B%7D,'cvml','w...@globe.de');>]
> Sent: 21 July 2016 13:23
> To: n...@fisk.me.uk <javascript:_e(%7B%7D,'cvml','n...@fisk.me.uk');>;
> 'Horace Ng' <hor...@hkisl.net>
> <javascript:_e(%7B%7D,'cvml','hor...@hkisl.net');>
> Cc: ceph-users@lists.ceph.com
> <javascript:_e(%7B%7D,'cvml','ceph-users@lists.ceph.com');>
> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>
> Okay and what is your plan now to speed up ?
>
> Now I have come up with a lower latency hardware design, there is not much
> further improvement until persistent RBD caching is implemented, as you
> will be moving the SSD/NVME closer to the client. But I'm happy with what I
> can achieve at the moment. You could also experiment with bcache on the
> RBD.
>
> Would it help to put in multiple P3700 per OSD Node to improve performance
> for a single Thread (example Storage VMotion) ?
>
> Most likely not, it's all the other parts of the puzzle which are causing
> the latency. ESXi was designed for storage arrays that service IO's in
> 100us-1ms range, Ceph is probably about 10x slower than this, hence the
> problem. Disable the BBWC on a RAID controller or SAN and you will the same
> behaviour.
>
> Regards
>
>
> Am 21.07.16 um 14:17 schrieb Nick Fisk:
>
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com
> <javascript:_e(%7B%7D,'cvml','ceph-users-boun...@lists.ceph.com');>] On
> Behalf
> Of w...@globe.de <javascript:_e(%7B%7D,'cvml','w...@globe.de');>
> Sent: 21 July 2016 13:04
> To: n...@fisk.me.uk <javascript:_e(%7B%7D,'cvml','n...@fisk.me.uk');>;
> 'Horace Ng' <hor...@hkisl.net>
> <javascript:_e(%7B%7D,'cvml','hor...@hkisl.net');>
> Cc: ceph-users@lists.ceph.com
> <javascript:_e(%7B%7D,'cvml','ceph-users@lists.ceph.com');>
> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>
> Hi,
>
> hmm i think 200 MByte/s is really bad. Is your Cluster in production right
> now?
>
> It's just been built, not running yet.
>
> So if you start a storage migration you get only 200 MByte/s right?
>
> I wish. My current cluster (not this new one) would storage migrate at
> ~10-15MB/s. Serial latency is the problem, without being able to
> buffer, ESXi waits on an ack for each IO before sending the next. Also it
> submits the migrations in 64kb chunks, unless you get VAAI
>
> working. I think esxi will try and do them in parallel, which will help as
> well.
>
> I think it would be awesome if you get 1000 MByte/s
>
> Where is the Bottleneck?
>
> Latency serialisation, without a buffer, you can't drive the devices
> to 100%. With buffered IO (or high queue depths) I can max out the
> journals.
>
> A FIO Test from Sebastien Han give us 400 MByte/s raw performance from the
> P3700.
>
> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your
> -ssd-is-suitable-as-a-journal-device/
>
> How could it be that the rbd client performance is 50% slower?
>
> Regards
>
>
> Am 21.07.16 um 12:15 schrieb Nick Fisk:
>
> I've had a lot of pain with this, smaller block sizes are even worse.
> You want to try and minimize latency at every point as there is no
> buffering happening in the iSCSI stack. This means:-
>
> 1.

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-07-21 Thread Nick Fisk

Yes, but not if you are using iSCSI and don't want data loss. If the data is in 
a cache somewhere and you lose power or crashgame over. That's why you want 
to cache to a non volatile device close to the source.

If you use something like FIO and use buffered IO, you will see that you will 
get really high numbers, unfortunately you can't do this with iSCSI though.

> -Original Message-
> From: w...@globe.de [mailto:w...@globe.de]
> Sent: 21 July 2016 13:39
> To: n...@fisk.me.uk; 'Horace Ng' <hor...@hkisl.net>
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> 
> Is there not a way to enable Linux page Cache? So do not user D_Sync...
> 
> Then we would the dramatically performance improve.
> 
> 
> Am 21.07.16 um 14:33 schrieb Nick Fisk:
> >> -Original Message-
> >> From: w...@globe.de [mailto:w...@globe.de]
> >> Sent: 21 July 2016 13:23
> >> To: n...@fisk.me.uk; 'Horace Ng' <hor...@hkisl.net>
> >> Cc: ceph-users@lists.ceph.com
> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >>
> >> Okay and what is your plan now to speed up ?
> > Now I have come up with a lower latency hardware design, there is not much 
> > further improvement until persistent RBD caching is
> implemented, as you will be moving the SSD/NVME closer to the client. But I'm 
> happy with what I can achieve at the moment. You
> could also experiment with bcache on the RBD.
> >
> >> Would it help to put in multiple P3700 per OSD Node to improve performance 
> >> for a single Thread (example Storage VMotion) ?
> > Most likely not, it's all the other parts of the puzzle which are causing 
> > the latency. ESXi was designed for storage arrays that service
> IO's in 100us-1ms range, Ceph is probably about 10x slower than this, hence 
> the problem. Disable the BBWC on a RAID controller or
> SAN and you will the same behaviour.
> >
> >> Regards
> >>
> >>
> >> Am 21.07.16 um 14:17 schrieb Nick Fisk:
> >>>> -Original Message-
> >>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> >>>> Behalf Of w...@globe.de
> >>>> Sent: 21 July 2016 13:04
> >>>> To: n...@fisk.me.uk; 'Horace Ng' <hor...@hkisl.net>
> >>>> Cc: ceph-users@lists.ceph.com
> >>>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >>>>
> >>>> Hi,
> >>>>
> >>>> hmm i think 200 MByte/s is really bad. Is your Cluster in production 
> >>>> right now?
> >>> It's just been built, not running yet.
> >>>
> >>>> So if you start a storage migration you get only 200 MByte/s right?
> >>> I wish. My current cluster (not this new one) would storage migrate
> >>> at ~10-15MB/s. Serial latency is the problem, without being able to
> >>> buffer, ESXi waits on an ack for each IO before sending the next.
> >>> Also it submits the migrations in 64kb chunks, unless you get VAAI
> >> working. I think esxi will try and do them in parallel, which will help as 
> >> well.
> >>>> I think it would be awesome if you get 1000 MByte/s
> >>>>
> >>>> Where is the Bottleneck?
> >>> Latency serialisation, without a buffer, you can't drive the devices
> >>> to 100%. With buffered IO (or high queue depths) I can max out the 
> >>> journals.
> >>>
> >>>> A FIO Test from Sebastien Han give us 400 MByte/s raw performance from 
> >>>> the P3700.
> >>>>
> >>>> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-yo
> >>>> ur -ssd-is-suitable-as-a-journal-device/
> >>>>
> >>>> How could it be that the rbd client performance is 50% slower?
> >>>>
> >>>> Regards
> >>>>
> >>>>
> >>>> Am 21.07.16 um 12:15 schrieb Nick Fisk:
> >>>>> I've had a lot of pain with this, smaller block sizes are even worse.
> >>>>> You want to try and minimize latency at every point as there is no
> >>>>> buffering happening in the iSCSI stack. This means:-
> >>>>>
> >>>>> 1. Fast journals (NVME or NVRAM)
> >>>>> 2. 10GB or better networking
> >>>>> 3. Fast CPU's (Ghz)
> >>>>> 4. Fix CPU c-state's to C1
> >>>>> 5. Fix CPU's Freq to max
> >>>>>
&

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-07-21 Thread w...@globe.de


What i not really undertand is:

Lets say the Intel P3700 works with 200 MByte/s rados bench one 
thread... See Nicks results below...


If we have multiple OSD Nodes. For example 10 Nodes.

Every Node has exactly 1x P3700 NVMe built in.

Why is the single Thread performance exactly at 200 MByte/s on the rbd 
client with 10 OSD Node Cluster???


I think it must be at 10 Nodes * 200 MByte/s = 2000 MByte/s.


Everyone look yourself at your cluster.

dstat -D sdb,sdc,sdd,sdX 

You will see that Ceph stripes the data over all OSD's in the cluster if 
you test at the client side with rados bench...


*rados bench -p rbd 60 write -b 4M -t 1*



Am 21.07.16 um 14:38 schrieb w...@globe.de:

Is there not a way to enable Linux page Cache? So do not user D_Sync...

Then we would the dramatically performance improve.


Am 21.07.16 um 14:33 schrieb Nick Fisk:

-Original Message-
From: w...@globe.de [mailto:w...@globe.de]
Sent: 21 July 2016 13:23
To: n...@fisk.me.uk; 'Horace Ng' <hor...@hkisl.net>
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

Okay and what is your plan now to speed up ?
Now I have come up with a lower latency hardware design, there is not 
much further improvement until persistent RBD caching is implemented, 
as you will be moving the SSD/NVME closer to the client. But I'm 
happy with what I can achieve at the moment. You could also 
experiment with bcache on the RBD.


Would it help to put in multiple P3700 per OSD Node to improve 
performance for a single Thread (example Storage VMotion) ?
Most likely not, it's all the other parts of the puzzle which are 
causing the latency. ESXi was designed for storage arrays that 
service IO's in 100us-1ms range, Ceph is probably about 10x slower 
than this, hence the problem. Disable the BBWC on a RAID controller 
or SAN and you will the same behaviour.



Regards


Am 21.07.16 um 14:17 schrieb Nick Fisk:

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
Of w...@globe.de
Sent: 21 July 2016 13:04
To: n...@fisk.me.uk; 'Horace Ng' <hor...@hkisl.net>
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

Hi,

hmm i think 200 MByte/s is really bad. Is your Cluster in 
production right now?

It's just been built, not running yet.


So if you start a storage migration you get only 200 MByte/s right?

I wish. My current cluster (not this new one) would storage migrate at
~10-15MB/s. Serial latency is the problem, without being able to
buffer, ESXi waits on an ack for each IO before sending the next. 
Also it submits the migrations in 64kb chunks, unless you get VAAI
working. I think esxi will try and do them in parallel, which will 
help as well.

I think it would be awesome if you get 1000 MByte/s

Where is the Bottleneck?

Latency serialisation, without a buffer, you can't drive the devices
to 100%. With buffered IO (or high queue depths) I can max out the 
journals.


A FIO Test from Sebastien Han give us 400 MByte/s raw performance 
from the P3700.


https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your
-ssd-is-suitable-as-a-journal-device/

How could it be that the rbd client performance is 50% slower?

Regards


Am 21.07.16 um 12:15 schrieb Nick Fisk:
I've had a lot of pain with this, smaller block sizes are even 
worse.

You want to try and minimize latency at every point as there is no
buffering happening in the iSCSI stack. This means:-

1. Fast journals (NVME or NVRAM)
2. 10GB or better networking
3. Fast CPU's (Ghz)
4. Fix CPU c-state's to C1
5. Fix CPU's Freq to max

Also I can't be sure, but I think there is a metadata update
happening with VMFS, particularly if you are using thin VMDK's, this
can also be a major bottleneck. For my use case, I've switched 
over to NFS as it has given much more performance at scale and

less headache.

For the RADOS Run, here you go (400GB P3700):

Total time run: 60.026491
Total writes made:  3104
Write size: 4194304
Object size:4194304
Bandwidth (MB/sec): 206.842
Stddev Bandwidth:   8.10412
Max bandwidth (MB/sec): 224
Min bandwidth (MB/sec): 180
Average IOPS:   51
Stddev IOPS:2
Max IOPS:   56
Min IOPS:   45
Average Latency(s): 0.0193366
Stddev Latency(s):  0.00148039
Max latency(s): 0.0377946
Min latency(s): 0.015909

Nick


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
Behalf Of Horace
Sent: 21 July 2016 10:26
To: w...@globe.de
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

Hi,

Same here, I've read some blog saying that vmware will frequently
verify the locking on VMFS over iSCSI, hence it will have much 
slower performance than NFS (with different locking mechanism).


Regards,
Horace Ng

- Original Message -
From: w...@globe

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-07-21 Thread w...@globe.de


Is there not a way to enable Linux page Cache? So do not user D_Sync...

Then we would the dramatically performance improve.


Am 21.07.16 um 14:33 schrieb Nick Fisk:

-Original Message-
From: w...@globe.de [mailto:w...@globe.de]
Sent: 21 July 2016 13:23
To: n...@fisk.me.uk; 'Horace Ng' <hor...@hkisl.net>
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

Okay and what is your plan now to speed up ?

Now I have come up with a lower latency hardware design, there is not much 
further improvement until persistent RBD caching is implemented, as you will be 
moving the SSD/NVME closer to the client. But I'm happy with what I can achieve 
at the moment. You could also experiment with bcache on the RBD.


Would it help to put in multiple P3700 per OSD Node to improve performance for 
a single Thread (example Storage VMotion) ?

Most likely not, it's all the other parts of the puzzle which are causing the 
latency. ESXi was designed for storage arrays that service IO's in 100us-1ms 
range, Ceph is probably about 10x slower than this, hence the problem. Disable 
the BBWC on a RAID controller or SAN and you will the same behaviour.


Regards


Am 21.07.16 um 14:17 schrieb Nick Fisk:

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
Of w...@globe.de
Sent: 21 July 2016 13:04
To: n...@fisk.me.uk; 'Horace Ng' <hor...@hkisl.net>
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

Hi,

hmm i think 200 MByte/s is really bad. Is your Cluster in production right now?

It's just been built, not running yet.


So if you start a storage migration you get only 200 MByte/s right?

I wish. My current cluster (not this new one) would storage migrate at
~10-15MB/s. Serial latency is the problem, without being able to
buffer, ESXi waits on an ack for each IO before sending the next. Also it 
submits the migrations in 64kb chunks, unless you get VAAI

working. I think esxi will try and do them in parallel, which will help as well.

I think it would be awesome if you get 1000 MByte/s

Where is the Bottleneck?

Latency serialisation, without a buffer, you can't drive the devices
to 100%. With buffered IO (or high queue depths) I can max out the journals.


A FIO Test from Sebastien Han give us 400 MByte/s raw performance from the 
P3700.

https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your
-ssd-is-suitable-as-a-journal-device/

How could it be that the rbd client performance is 50% slower?

Regards


Am 21.07.16 um 12:15 schrieb Nick Fisk:

I've had a lot of pain with this, smaller block sizes are even worse.
You want to try and minimize latency at every point as there is no
buffering happening in the iSCSI stack. This means:-

1. Fast journals (NVME or NVRAM)
2. 10GB or better networking
3. Fast CPU's (Ghz)
4. Fix CPU c-state's to C1
5. Fix CPU's Freq to max

Also I can't be sure, but I think there is a metadata update
happening with VMFS, particularly if you are using thin VMDK's, this
can also be a major bottleneck. For my use case, I've switched over to NFS as 
it has given much more performance at scale and

less headache.

For the RADOS Run, here you go (400GB P3700):

Total time run: 60.026491
Total writes made:  3104
Write size: 4194304
Object size:4194304
Bandwidth (MB/sec): 206.842
Stddev Bandwidth:   8.10412
Max bandwidth (MB/sec): 224
Min bandwidth (MB/sec): 180
Average IOPS:   51
Stddev IOPS:2
Max IOPS:   56
Min IOPS:   45
Average Latency(s): 0.0193366
Stddev Latency(s):  0.00148039
Max latency(s): 0.0377946
Min latency(s): 0.015909

Nick


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
Behalf Of Horace
Sent: 21 July 2016 10:26
To: w...@globe.de
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

Hi,

Same here, I've read some blog saying that vmware will frequently
verify the locking on VMFS over iSCSI, hence it will have much slower 
performance than NFS (with different locking mechanism).

Regards,
Horace Ng

- Original Message -
From: w...@globe.de
To: ceph-users@lists.ceph.com
Sent: Thursday, July 21, 2016 5:11:21 PM
Subject: [ceph-users] Ceph + VMware + Single Thread Performance

Hi everyone,

we see at our cluster relatively slow Single Thread Performance on the iscsi 
Nodes.


Our setup:

3 Racks:

18x Data Nodes, 3 Mon Nodes, 3 iscsi Gateway Nodes with tgt (rbd cache off).

2x Samsung SM863 Enterprise SSD for Journal (3 OSD per SSD) and 6x
WD Red 1TB per Data Node as OSD.

Replication = 3

chooseleaf = 3 type Rack in the crush map


We get only ca. 90 MByte/s on the iscsi Gateway Servers with:

rados bench -p rbd 60 write -b 4M -t 1


If we test with:

rados bench -p rbd 60 write -b 4M -t 32

we get ca. 600 - 700 MByte

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-07-21 Thread Nick Fisk

From: Jake Young [mailto:jak3...@gmail.com] 
Sent: 21 July 2016 13:24
To: n...@fisk.me.uk; w...@globe.de
Cc: Horace Ng <hor...@hkisl.net>; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

My workaround to your single threaded performance issue was to increase the 
thread count of the tgtd process (I added --nr_iothreads=128 as an argument to 
tgtd).  This does help my workload.  

FWIW below are my rados bench numbers from my cluster with 1 thread:

This first one is a "cold" run. This is a test pool, and it's not in use.  This 
is the first time I've written to it in a week (but I have written to it 
before). 

Total time run: 60.049311

Total writes made:  1196

Write size: 4194304

Bandwidth (MB/sec): 79.668

Stddev Bandwidth:   80.3998

Max bandwidth (MB/sec): 208

Min bandwidth (MB/sec): 0

Average Latency:0.0502066

Stddev Latency: 0.47209

Max latency:12.9035

Min latency:0.013051

This next one is the 6th run. I honestly don't understand why there is such a 
huge performance difference. 

Total time run: 60.042933

Total writes made:  2980

Write size: 4194304

Bandwidth (MB/sec): 198.525

Stddev Bandwidth:   32.129

Max bandwidth (MB/sec): 224

Min bandwidth (MB/sec): 0

Average Latency:0.0201471

Stddev Latency: 0.0126896

Max latency:0.265931

Min latency:0.013211

75 OSDs, all 2TB SAS spinners.  There are 9 OSD servers each has a 2GB BBU RAID 
cache.

I have tuned my CPU c-state and freq to max, I have 8x 2.5MHz cores, so just 
about one core per OSD. I have 40G networking.  I don't use journals, but I 
have the RAID cache enabled.

Nick,

What NFS server are you using?

The kernel one. Seems to be working really so far after I got past the XFS 
fragmentation issues, I had to set an extent size hint of 16mb at the root.

Jake 

On Thursday, July 21, 2016, Nick Fisk <n...@fisk.me.uk <mailto:n...@fisk.me.uk> 
> wrote:

I've had a lot of pain with this, smaller block sizes are even worse. You want 
to try and minimize latency at every point as there
is no buffering happening in the iSCSI stack. This means:-

1. Fast journals (NVME or NVRAM)
2. 10GB or better networking
3. Fast CPU's (Ghz)
4. Fix CPU c-state's to C1
5. Fix CPU's Freq to max

Also I can't be sure, but I think there is a metadata update happening with 
VMFS, particularly if you are using thin VMDK's, this
can also be a major bottleneck. For my use case, I've switched over to NFS as 
it has given much more performance at scale and less
headache.

For the RADOS Run, here you go (400GB P3700):

Total time run: 60.026491
Total writes made:  3104
Write size: 4194304
Object size:4194304
Bandwidth (MB/sec): 206.842
Stddev Bandwidth:   8.10412
Max bandwidth (MB/sec): 224
Min bandwidth (MB/sec): 180
Average IOPS:   51
Stddev IOPS:2
Max IOPS:   56
Min IOPS:   45
Average Latency(s): 0.0193366
Stddev Latency(s):  0.00148039
Max latency(s): 0.0377946
Min latency(s): 0.015909

Nick

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com <javascript:;> ] 
> On Behalf Of Horace
> Sent: 21 July 2016 10:26
> To: w...@globe.de <javascript:;> 
> Cc: ceph-users@lists.ceph.com <javascript:;> 
> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>
> Hi,
>
> Same here, I've read some blog saying that vmware will frequently verify the 
> locking on VMFS over iSCSI, hence it will have much
> slower performance than NFS (with different locking mechanism).
>
> Regards,
> Horace Ng
>
> - Original Message -
> From: w...@globe.de <javascript:;> 
> To: ceph-users@lists.ceph.com <javascript:;> 
> Sent: Thursday, July 21, 2016 5:11:21 PM
> Subject: [ceph-users] Ceph + VMware + Single Thread Performance
>
> Hi everyone,
>
> we see at our cluster relatively slow Single Thread Performance on the iscsi 
> Nodes.
>
>
> Our setup:
>
> 3 Racks:
>
> 18x Data Nodes, 3 Mon Nodes, 3 iscsi Gateway Nodes with tgt (rbd cache off).
>
> 2x Samsung SM863 Enterprise SSD for Journal (3 OSD per SSD) and 6x WD
> Red 1TB per Data Node as OSD.
>
> Replication = 3
>
> chooseleaf = 3 type Rack in the crush map
>
>
> We get only ca. 90 MByte/s on the iscsi Gateway Servers with:
>
> rados bench -p rbd 60 write -b 4M -t 1
>
>
> If we test with:
>
> rados bench -p rbd 60 write -b 4M -t 32
>
> we get ca. 600 - 700 MByte/s
>
>
> We plan to replace the Samsung SSD with Intel DC P3700 PCIe NVM'e for
> the Journal to get be

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-07-21 Thread Nick Fisk

> -Original Message-
> From: w...@globe.de [mailto:w...@globe.de]
> Sent: 21 July 2016 13:23
> To: n...@fisk.me.uk; 'Horace Ng' <hor...@hkisl.net>
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> 
> Okay and what is your plan now to speed up ?

Now I have come up with a lower latency hardware design, there is not much 
further improvement until persistent RBD caching is implemented, as you will be 
moving the SSD/NVME closer to the client. But I'm happy with what I can achieve 
at the moment. You could also experiment with bcache on the RBD.

> 
> Would it help to put in multiple P3700 per OSD Node to improve performance 
> for a single Thread (example Storage VMotion) ?

Most likely not, it's all the other parts of the puzzle which are causing the 
latency. ESXi was designed for storage arrays that service IO's in 100us-1ms 
range, Ceph is probably about 10x slower than this, hence the problem. Disable 
the BBWC on a RAID controller or SAN and you will the same behaviour.

> 
> Regards
> 
> 
> Am 21.07.16 um 14:17 schrieb Nick Fisk:
> >> -Original Message-
> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> >> Of w...@globe.de
> >> Sent: 21 July 2016 13:04
> >> To: n...@fisk.me.uk; 'Horace Ng' <hor...@hkisl.net>
> >> Cc: ceph-users@lists.ceph.com
> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >>
> >> Hi,
> >>
> >> hmm i think 200 MByte/s is really bad. Is your Cluster in production right 
> >> now?
> > It's just been built, not running yet.
> >
> >> So if you start a storage migration you get only 200 MByte/s right?
> > I wish. My current cluster (not this new one) would storage migrate at
> > ~10-15MB/s. Serial latency is the problem, without being able to
> > buffer, ESXi waits on an ack for each IO before sending the next. Also it 
> > submits the migrations in 64kb chunks, unless you get VAAI
> working. I think esxi will try and do them in parallel, which will help as 
> well.
> >
> >> I think it would be awesome if you get 1000 MByte/s
> >>
> >> Where is the Bottleneck?
> > Latency serialisation, without a buffer, you can't drive the devices
> > to 100%. With buffered IO (or high queue depths) I can max out the journals.
> >
> >> A FIO Test from Sebastien Han give us 400 MByte/s raw performance from the 
> >> P3700.
> >>
> >> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your
> >> -ssd-is-suitable-as-a-journal-device/
> >>
> >> How could it be that the rbd client performance is 50% slower?
> >>
> >> Regards
> >>
> >>
> >> Am 21.07.16 um 12:15 schrieb Nick Fisk:
> >>> I've had a lot of pain with this, smaller block sizes are even worse.
> >>> You want to try and minimize latency at every point as there is no
> >>> buffering happening in the iSCSI stack. This means:-
> >>>
> >>> 1. Fast journals (NVME or NVRAM)
> >>> 2. 10GB or better networking
> >>> 3. Fast CPU's (Ghz)
> >>> 4. Fix CPU c-state's to C1
> >>> 5. Fix CPU's Freq to max
> >>>
> >>> Also I can't be sure, but I think there is a metadata update
> >>> happening with VMFS, particularly if you are using thin VMDK's, this
> >>> can also be a major bottleneck. For my use case, I've switched over to 
> >>> NFS as it has given much more performance at scale and
> less headache.
> >>>
> >>> For the RADOS Run, here you go (400GB P3700):
> >>>
> >>> Total time run: 60.026491
> >>> Total writes made:  3104
> >>> Write size: 4194304
> >>> Object size:4194304
> >>> Bandwidth (MB/sec): 206.842
> >>> Stddev Bandwidth:   8.10412
> >>> Max bandwidth (MB/sec): 224
> >>> Min bandwidth (MB/sec): 180
> >>> Average IOPS:   51
> >>> Stddev IOPS:2
> >>> Max IOPS:   56
> >>> Min IOPS:   45
> >>> Average Latency(s): 0.0193366
> >>> Stddev Latency(s):  0.00148039
> >>> Max latency(s): 0.0377946
> >>> Min latency(s): 0.015909
> >>>
> >>> Nick
> >>>
> >>>> -Original Message-
> >>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> >>>> Behalf Of Hora

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-07-21 Thread Jake Young

My workaround to your single threaded performance issue was to increase the
thread count of the tgtd process (I added --nr_iothreads=128 as an argument
to tgtd).  This does help my workload.

FWIW below are my rados bench numbers from my cluster with 1 thread:

This first one is a "cold" run. This is a test pool, and it's not in use.
This is the first time I've written to it in a week (but I have written to
it before).

Total time run: 60.049311
Total writes made:  1196
Write size: 4194304
Bandwidth (MB/sec): 79.668

Stddev Bandwidth:   80.3998
Max bandwidth (MB/sec): 208
Min bandwidth (MB/sec): 0
Average Latency:0.0502066
Stddev Latency: 0.47209
Max latency:12.9035
Min latency:0.013051

This next one is the 6th run. I honestly don't understand why there is such
a huge performance difference.

Total time run: 60.042933
Total writes made:  2980
Write size: 4194304
Bandwidth (MB/sec): 198.525

Stddev Bandwidth:   32.129
Max bandwidth (MB/sec): 224
Min bandwidth (MB/sec): 0
Average Latency:0.0201471
Stddev Latency: 0.0126896
Max latency:0.265931
Min latency:0.013211


75 OSDs, all 2TB SAS spinners.  There are 9 OSD servers each has a 2GB
BBU RAID cache.

I have tuned my CPU c-state and freq to max, I have 8x 2.5MHz cores, so
just about one core per OSD. I have 40G networking.  I don't use journals,
but I have the RAID cache enabled.


Nick,

What NFS server are you using?

Jake


On Thursday, July 21, 2016, Nick Fisk <n...@fisk.me.uk> wrote:

> I've had a lot of pain with this, smaller block sizes are even worse. You
> want to try and minimize latency at every point as there
> is no buffering happening in the iSCSI stack. This means:-
>
> 1. Fast journals (NVME or NVRAM)
> 2. 10GB or better networking
> 3. Fast CPU's (Ghz)
> 4. Fix CPU c-state's to C1
> 5. Fix CPU's Freq to max
>
> Also I can't be sure, but I think there is a metadata update happening
> with VMFS, particularly if you are using thin VMDK's, this
> can also be a major bottleneck. For my use case, I've switched over to NFS
> as it has given much more performance at scale and less
> headache.
>
> For the RADOS Run, here you go (400GB P3700):
>
> Total time run: 60.026491
> Total writes made:  3104
> Write size: 4194304
> Object size:4194304
> Bandwidth (MB/sec): 206.842
> Stddev Bandwidth:   8.10412
> Max bandwidth (MB/sec): 224
> Min bandwidth (MB/sec): 180
> Average IOPS:   51
> Stddev IOPS:2
> Max IOPS:   56
> Min IOPS:   45
> Average Latency(s): 0.0193366
> Stddev Latency(s):  0.00148039
> Max latency(s): 0.0377946
> Min latency(s): 0.015909
>
> Nick
>
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com
> <javascript:;>] On Behalf Of Horace
> > Sent: 21 July 2016 10:26
> > To: w...@globe.de <javascript:;>
> > Cc: ceph-users@lists.ceph.com <javascript:;>
> > Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >
> > Hi,
> >
> > Same here, I've read some blog saying that vmware will frequently verify
> the locking on VMFS over iSCSI, hence it will have much
> > slower performance than NFS (with different locking mechanism).
> >
> > Regards,
> > Horace Ng
> >
> > - Original Message -
> > From: w...@globe.de <javascript:;>
> > To: ceph-users@lists.ceph.com <javascript:;>
> > Sent: Thursday, July 21, 2016 5:11:21 PM
> > Subject: [ceph-users] Ceph + VMware + Single Thread Performance
> >
> > Hi everyone,
> >
> > we see at our cluster relatively slow Single Thread Performance on the
> iscsi Nodes.
> >
> >
> > Our setup:
> >
> > 3 Racks:
> >
> > 18x Data Nodes, 3 Mon Nodes, 3 iscsi Gateway Nodes with tgt (rbd cache
> off).
> >
> > 2x Samsung SM863 Enterprise SSD for Journal (3 OSD per SSD) and 6x WD
> > Red 1TB per Data Node as OSD.
> >
> > Replication = 3
> >
> > chooseleaf = 3 type Rack in the crush map
> >
> >
> > We get only ca. 90 MByte/s on the iscsi Gateway Servers with:
> >
> > rados bench -p rbd 60 write -b 4M -t 1
> >
> >
> > If we test with:
> >
> > rados bench -p rbd 60 write -b 4M -t 32
> >
> > we get ca. 600 - 700 MByte/s
> >
> >
> > We plan to replace the Samsung SSD with Intel DC P3700 PCIe NVM'e for
> > the Journal to get better Single Thread Performance.
> >
> > Is anyone of you out there who has an Intel P370

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-07-21 Thread w...@globe.de


Okay and what is your plan now to speed up ?

Would it help to put in multiple P3700 per OSD Node to improve 
performance for a single Thread (example Storage VMotion) ?


Regards


Am 21.07.16 um 14:17 schrieb Nick Fisk:

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
w...@globe.de
Sent: 21 July 2016 13:04
To: n...@fisk.me.uk; 'Horace Ng' <hor...@hkisl.net>
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

Hi,

hmm i think 200 MByte/s is really bad. Is your Cluster in production right now?

It's just been built, not running yet.


So if you start a storage migration you get only 200 MByte/s right?

I wish. My current cluster (not this new one) would storage migrate at 
~10-15MB/s. Serial latency is the problem, without being able
to buffer, ESXi waits on an ack for each IO before sending the next. Also it 
submits the migrations in 64kb chunks, unless you get
VAAI working. I think esxi will try and do them in parallel, which will help as 
well.


I think it would be awesome if you get 1000 MByte/s

Where is the Bottleneck?

Latency serialisation, without a buffer, you can't drive the devices to 100%. 
With buffered IO (or high queue depths) I can max out
the journals.


A FIO Test from Sebastien Han give us 400 MByte/s raw performance from the 
P3700.

https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

How could it be that the rbd client performance is 50% slower?

Regards


Am 21.07.16 um 12:15 schrieb Nick Fisk:

I've had a lot of pain with this, smaller block sizes are even worse.
You want to try and minimize latency at every point as there is no
buffering happening in the iSCSI stack. This means:-

1. Fast journals (NVME or NVRAM)
2. 10GB or better networking
3. Fast CPU's (Ghz)
4. Fix CPU c-state's to C1
5. Fix CPU's Freq to max

Also I can't be sure, but I think there is a metadata update happening
with VMFS, particularly if you are using thin VMDK's, this can also be
a major bottleneck. For my use case, I've switched over to NFS as it has given 
much more performance at scale and less headache.

For the RADOS Run, here you go (400GB P3700):

Total time run: 60.026491
Total writes made:  3104
Write size: 4194304
Object size:4194304
Bandwidth (MB/sec): 206.842
Stddev Bandwidth:   8.10412
Max bandwidth (MB/sec): 224
Min bandwidth (MB/sec): 180
Average IOPS:   51
Stddev IOPS:2
Max IOPS:   56
Min IOPS:   45
Average Latency(s): 0.0193366
Stddev Latency(s):  0.00148039
Max latency(s): 0.0377946
Min latency(s): 0.015909

Nick


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
Of Horace
Sent: 21 July 2016 10:26
To: w...@globe.de
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

Hi,

Same here, I've read some blog saying that vmware will frequently
verify the locking on VMFS over iSCSI, hence it will have much slower 
performance than NFS (with different locking mechanism).

Regards,
Horace Ng

- Original Message -
From: w...@globe.de
To: ceph-users@lists.ceph.com
Sent: Thursday, July 21, 2016 5:11:21 PM
Subject: [ceph-users] Ceph + VMware + Single Thread Performance

Hi everyone,

we see at our cluster relatively slow Single Thread Performance on the iscsi 
Nodes.


Our setup:

3 Racks:

18x Data Nodes, 3 Mon Nodes, 3 iscsi Gateway Nodes with tgt (rbd cache off).

2x Samsung SM863 Enterprise SSD for Journal (3 OSD per SSD) and 6x WD
Red 1TB per Data Node as OSD.

Replication = 3

chooseleaf = 3 type Rack in the crush map


We get only ca. 90 MByte/s on the iscsi Gateway Servers with:

rados bench -p rbd 60 write -b 4M -t 1


If we test with:

rados bench -p rbd 60 write -b 4M -t 32

we get ca. 600 - 700 MByte/s


We plan to replace the Samsung SSD with Intel DC P3700 PCIe NVM'e for
the Journal to get better Single Thread Performance.

Is anyone of you out there who has an Intel P3700 for Journal an can
give me back test results with:


rados bench -p rbd 60 write -b 4M -t 1


Thank you very much !!

Kind Regards !!

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-07-21 Thread Nick Fisk

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> w...@globe.de
> Sent: 21 July 2016 13:04
> To: n...@fisk.me.uk; 'Horace Ng' <hor...@hkisl.net>
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> 
> Hi,
> 
> hmm i think 200 MByte/s is really bad. Is your Cluster in production right 
> now?

It's just been built, not running yet.

> 
> So if you start a storage migration you get only 200 MByte/s right?

I wish. My current cluster (not this new one) would storage migrate at 
~10-15MB/s. Serial latency is the problem, without being able
to buffer, ESXi waits on an ack for each IO before sending the next. Also it 
submits the migrations in 64kb chunks, unless you get
VAAI working. I think esxi will try and do them in parallel, which will help as 
well.

> 
> I think it would be awesome if you get 1000 MByte/s
> 
> Where is the Bottleneck?

Latency serialisation, without a buffer, you can't drive the devices to 100%. 
With buffered IO (or high queue depths) I can max out
the journals.

> 
> A FIO Test from Sebastien Han give us 400 MByte/s raw performance from the 
> P3700.
> 
> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
> 
> How could it be that the rbd client performance is 50% slower?
> 
> Regards
> 
> 
> Am 21.07.16 um 12:15 schrieb Nick Fisk:
> > I've had a lot of pain with this, smaller block sizes are even worse.
> > You want to try and minimize latency at every point as there is no
> > buffering happening in the iSCSI stack. This means:-
> >
> > 1. Fast journals (NVME or NVRAM)
> > 2. 10GB or better networking
> > 3. Fast CPU's (Ghz)
> > 4. Fix CPU c-state's to C1
> > 5. Fix CPU's Freq to max
> >
> > Also I can't be sure, but I think there is a metadata update happening
> > with VMFS, particularly if you are using thin VMDK's, this can also be
> > a major bottleneck. For my use case, I've switched over to NFS as it has 
> > given much more performance at scale and less headache.
> >
> > For the RADOS Run, here you go (400GB P3700):
> >
> > Total time run: 60.026491
> > Total writes made:  3104
> > Write size: 4194304
> > Object size:4194304
> > Bandwidth (MB/sec): 206.842
> > Stddev Bandwidth:   8.10412
> > Max bandwidth (MB/sec): 224
> > Min bandwidth (MB/sec): 180
> > Average IOPS:   51
> > Stddev IOPS:2
> > Max IOPS:   56
> > Min IOPS:   45
> > Average Latency(s): 0.0193366
> > Stddev Latency(s):  0.00148039
> > Max latency(s): 0.0377946
> > Min latency(s): 0.015909
> >
> > Nick
> >
> >> -Original Message-
> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> >> Of Horace
> >> Sent: 21 July 2016 10:26
> >> To: w...@globe.de
> >> Cc: ceph-users@lists.ceph.com
> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >>
> >> Hi,
> >>
> >> Same here, I've read some blog saying that vmware will frequently
> >> verify the locking on VMFS over iSCSI, hence it will have much slower 
> >> performance than NFS (with different locking mechanism).
> >>
> >> Regards,
> >> Horace Ng
> >>
> >> - Original Message -
> >> From: w...@globe.de
> >> To: ceph-users@lists.ceph.com
> >> Sent: Thursday, July 21, 2016 5:11:21 PM
> >> Subject: [ceph-users] Ceph + VMware + Single Thread Performance
> >>
> >> Hi everyone,
> >>
> >> we see at our cluster relatively slow Single Thread Performance on the 
> >> iscsi Nodes.
> >>
> >>
> >> Our setup:
> >>
> >> 3 Racks:
> >>
> >> 18x Data Nodes, 3 Mon Nodes, 3 iscsi Gateway Nodes with tgt (rbd cache 
> >> off).
> >>
> >> 2x Samsung SM863 Enterprise SSD for Journal (3 OSD per SSD) and 6x WD
> >> Red 1TB per Data Node as OSD.
> >>
> >> Replication = 3
> >>
> >> chooseleaf = 3 type Rack in the crush map
> >>
> >>
> >> We get only ca. 90 MByte/s on the iscsi Gateway Servers with:
> >>
> >> rados bench -p rbd 60 write -b 4M -t 1
> >>
> >>
> >> If we test with:
> >>
> >> rados bench -p rbd 60 write -b 4M -t 32
> >>
> >> we get ca. 600 - 700 MByte/s
> >>
> &g

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-07-21 Thread w...@globe.de


Hi,

hmm i think 200 MByte/s is really bad. Is your Cluster in production 
right now?


So if you start a storage migration you get only 200 MByte/s right?

I think it would be awesome if you get 1000 MByte/s

Where is the Bottleneck?

A FIO Test from Sebastien Han give us 400 MByte/s raw performance from 
the P3700.


https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

How could it be that the rbd client performance is 50% slower?

Regards


Am 21.07.16 um 12:15 schrieb Nick Fisk:

I've had a lot of pain with this, smaller block sizes are even worse. You want 
to try and minimize latency at every point as there
is no buffering happening in the iSCSI stack. This means:-

1. Fast journals (NVME or NVRAM)
2. 10GB or better networking
3. Fast CPU's (Ghz)
4. Fix CPU c-state's to C1
5. Fix CPU's Freq to max

Also I can't be sure, but I think there is a metadata update happening with 
VMFS, particularly if you are using thin VMDK's, this
can also be a major bottleneck. For my use case, I've switched over to NFS as 
it has given much more performance at scale and less
headache.

For the RADOS Run, here you go (400GB P3700):

Total time run: 60.026491
Total writes made:  3104
Write size: 4194304
Object size:4194304
Bandwidth (MB/sec): 206.842
Stddev Bandwidth:   8.10412
Max bandwidth (MB/sec): 224
Min bandwidth (MB/sec): 180
Average IOPS:   51
Stddev IOPS:2
Max IOPS:   56
Min IOPS:   45
Average Latency(s): 0.0193366
Stddev Latency(s):  0.00148039
Max latency(s): 0.0377946
Min latency(s): 0.015909

Nick


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Horace
Sent: 21 July 2016 10:26
To: w...@globe.de
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

Hi,

Same here, I've read some blog saying that vmware will frequently verify the 
locking on VMFS over iSCSI, hence it will have much
slower performance than NFS (with different locking mechanism).

Regards,
Horace Ng

- Original Message -
From: w...@globe.de
To: ceph-users@lists.ceph.com
Sent: Thursday, July 21, 2016 5:11:21 PM
Subject: [ceph-users] Ceph + VMware + Single Thread Performance

Hi everyone,

we see at our cluster relatively slow Single Thread Performance on the iscsi 
Nodes.


Our setup:

3 Racks:

18x Data Nodes, 3 Mon Nodes, 3 iscsi Gateway Nodes with tgt (rbd cache off).

2x Samsung SM863 Enterprise SSD for Journal (3 OSD per SSD) and 6x WD
Red 1TB per Data Node as OSD.

Replication = 3

chooseleaf = 3 type Rack in the crush map


We get only ca. 90 MByte/s on the iscsi Gateway Servers with:

rados bench -p rbd 60 write -b 4M -t 1


If we test with:

rados bench -p rbd 60 write -b 4M -t 32

we get ca. 600 - 700 MByte/s


We plan to replace the Samsung SSD with Intel DC P3700 PCIe NVM'e for
the Journal to get better Single Thread Performance.

Is anyone of you out there who has an Intel P3700 for Journal an can
give me back test results with:


rados bench -p rbd 60 write -b 4M -t 1


Thank you very much !!

Kind Regards !!

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-07-21 Thread Nick Fisk

I've had a lot of pain with this, smaller block sizes are even worse. You want 
to try and minimize latency at every point as there
is no buffering happening in the iSCSI stack. This means:-

1. Fast journals (NVME or NVRAM)
2. 10GB or better networking
3. Fast CPU's (Ghz)
4. Fix CPU c-state's to C1
5. Fix CPU's Freq to max

Also I can't be sure, but I think there is a metadata update happening with 
VMFS, particularly if you are using thin VMDK's, this
can also be a major bottleneck. For my use case, I've switched over to NFS as 
it has given much more performance at scale and less
headache.

For the RADOS Run, here you go (400GB P3700):

Total time run: 60.026491
Total writes made:  3104
Write size: 4194304
Object size:4194304
Bandwidth (MB/sec): 206.842
Stddev Bandwidth:   8.10412
Max bandwidth (MB/sec): 224
Min bandwidth (MB/sec): 180
Average IOPS:   51
Stddev IOPS:2
Max IOPS:   56
Min IOPS:   45
Average Latency(s): 0.0193366
Stddev Latency(s):  0.00148039
Max latency(s): 0.0377946
Min latency(s): 0.015909

Nick

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Horace
> Sent: 21 July 2016 10:26
> To: w...@globe.de
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> 
> Hi,
> 
> Same here, I've read some blog saying that vmware will frequently verify the 
> locking on VMFS over iSCSI, hence it will have much
> slower performance than NFS (with different locking mechanism).
> 
> Regards,
> Horace Ng
> 
> - Original Message -
> From: w...@globe.de
> To: ceph-users@lists.ceph.com
> Sent: Thursday, July 21, 2016 5:11:21 PM
> Subject: [ceph-users] Ceph + VMware + Single Thread Performance
> 
> Hi everyone,
> 
> we see at our cluster relatively slow Single Thread Performance on the iscsi 
> Nodes.
> 
> 
> Our setup:
> 
> 3 Racks:
> 
> 18x Data Nodes, 3 Mon Nodes, 3 iscsi Gateway Nodes with tgt (rbd cache off).
> 
> 2x Samsung SM863 Enterprise SSD for Journal (3 OSD per SSD) and 6x WD
> Red 1TB per Data Node as OSD.
> 
> Replication = 3
> 
> chooseleaf = 3 type Rack in the crush map
> 
> 
> We get only ca. 90 MByte/s on the iscsi Gateway Servers with:
> 
> rados bench -p rbd 60 write -b 4M -t 1
> 
> 
> If we test with:
> 
> rados bench -p rbd 60 write -b 4M -t 32
> 
> we get ca. 600 - 700 MByte/s
> 
> 
> We plan to replace the Samsung SSD with Intel DC P3700 PCIe NVM'e for
> the Journal to get better Single Thread Performance.
> 
> Is anyone of you out there who has an Intel P3700 for Journal an can
> give me back test results with:
> 
> 
> rados bench -p rbd 60 write -b 4M -t 1
> 
> 
> Thank you very much !!
> 
> Kind Regards !!
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-07-21 Thread Horace

Hi,

Same here, I've read some blog saying that vmware will frequently verify the 
locking on VMFS over iSCSI, hence it will have much slower performance than NFS 
(with different locking mechanism).

Regards,
Horace Ng

- Original Message -
From: w...@globe.de
To: ceph-users@lists.ceph.com
Sent: Thursday, July 21, 2016 5:11:21 PM
Subject: [ceph-users] Ceph + VMware + Single Thread Performance

Hi everyone,

we see at our cluster relatively slow Single Thread Performance on the 
iscsi Nodes.


Our setup:

3 Racks:

18x Data Nodes, 3 Mon Nodes, 3 iscsi Gateway Nodes with tgt (rbd cache off).

2x Samsung SM863 Enterprise SSD for Journal (3 OSD per SSD) and 6x WD 
Red 1TB per Data Node as OSD.

Replication = 3

chooseleaf = 3 type Rack in the crush map


We get only ca. 90 MByte/s on the iscsi Gateway Servers with:

rados bench -p rbd 60 write -b 4M -t 1


If we test with:

rados bench -p rbd 60 write -b 4M -t 32

we get ca. 600 - 700 MByte/s


We plan to replace the Samsung SSD with Intel DC P3700 PCIe NVM'e for 
the Journal to get better Single Thread Performance.

Is anyone of you out there who has an Intel P3700 for Journal an can 
give me back test results with:


rados bench -p rbd 60 write -b 4M -t 1


Thank you very much !!

Kind Regards !!

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

38 matches

Mail list logo