Re: [ceph-users] Ceph + VMware + Single Thread Performance

Alex Gorbachev Sat, 03 Sep 2016 20:46:00 -0700

On Saturday, September 3, 2016, Alex Gorbachev <a...@iss-integration.com>
wrote:


> HI Nick,
>
> On Sun, Aug 21, 2016 at 3:19 PM, Nick Fisk <n...@fisk.me.uk
> <javascript:_e(%7B%7D,'cvml','n...@fisk.me.uk');>> wrote:
>
>> *From:* Alex Gorbachev [mailto:a...@iss-integration.com
>> <javascript:_e(%7B%7D,'cvml','a...@iss-integration.com');>]
>> *Sent:* 21 August 2016 15:27
>> *To:* Wilhelm Redbrake <w...@globe.de
>> <javascript:_e(%7B%7D,'cvml','w...@globe.de');>>
>> *Cc:* n...@fisk.me.uk <javascript:_e(%7B%7D,'cvml','n...@fisk.me.uk');>;
>> Horace Ng <hor...@hkisl.net
>> <javascript:_e(%7B%7D,'cvml','hor...@hkisl.net');>>; ceph-users <
>> ceph-users@lists.ceph.com
>> <javascript:_e(%7B%7D,'cvml','ceph-users@lists.ceph.com');>>
>> *Subject:* Re: [ceph-users] Ceph + VMware + Single Thread Performance
>>
>>
>>
>>
>>
>> On Sunday, August 21, 2016, Wilhelm Redbrake <w...@globe.de
>> <javascript:_e(%7B%7D,'cvml','w...@globe.de');>> wrote:
>>
>> Hi Nick,
>> i understand all of your technical improvements.
>> But: why do you Not use a simple for example Areca Raid Controller with 8
>> gb Cache and Bbu ontop in every ceph node.
>> Configure n Times RAID 0 on the Controller and enable Write back Cache.
>> That must be a latency "Killer" like in all the prop. Storage arrays or
>> Not ??
>>
>> Best Regards !!
>>
>>
>>
>> What we saw specifically with Areca cards is that performance is
>> excellent in benchmarking and for bursty loads. However, once we started
>> loading with more constant workloads (we replicate databases and files to
>> our Ceph cluster), this looks to have saturated the relatively small Areca
>> NVDIMM caches and we went back to pure drive based performance.
>>
>>
>>
>> Yes, I think that is a valid point. Although low latency, you are still
>> having to write to the disks twice (journal+data), so once the cache’s on
>> the cards start filling up, you are going to hit problems.
>>
>>
>>
>>
>>
>> So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per
>> 3 HDDs) in hopes that it would help reduce the noisy neighbor impact. That
>> worked, but now the overall latency is really high at times, not always.
>> Red Hat engineer suggested this is due to loading the 7200 rpm NL-SAS
>> drives with too many IOPS, which get their latency sky high. Overall we are
>> functioning fine, but I sure would like storage vmotion and other large
>> operations faster.
>>
>>
>>
>>
>>
>> Yeah this is the biggest pain point I think. Normal VM ops are fine, but
>> if you ever have to move a multi-TB VM, it’s just too slow.
>>
>>
>>
>> If you use iscsi with vaai and are migrating a thick provisioned vmdk,
>> then performance is actually quite good, as the block sizes used for the
>> copy are a lot bigger.
>>
>>
>>
>> However, my use case required thin provisioned VM’s + snapshots and I
>> found that using iscsi you have no control over the fragmentation of the
>> vmdk’s and so the read performance is then what suffers (certainly with
>> 7.2k disks)
>>
>>
>>
>> Also with thin provisioned vmdk’s I think I was seeing PG contention with
>> the updating of the VMFS metadata, although I can’t be sure.
>>
>>
>>
>>
>>
>> I am thinking I will test a few different schedulers and readahead
>> settings to see if we can improve this by parallelizing reads. Also will
>> test NFS, but need to determine whether to do krbd/knfsd or something more
>> interesting like CephFS/Ganesha.
>>
>>
>>
>> As you know I’m on NFS now. I’ve found it a lot easier to get going and a
>> lot less sensitive to making config adjustments without suddenly everything
>> dropping offline. The fact that you can specify the extent size on XFS
>> helps massively with using thin vmdks/snapshots to avoid fragmentation.
>> Storage v-motions are a bit faster than iscsi, but I think I am hitting PG
>> contention when esxi tries to write 32 copy threads to the same object.
>> There is probably some tuning that could be done here (RBD striping???) but
>> this is the best it’s been for a long time and I’m reluctant to fiddle any
>> further.
>>
>
> We have moved ahead and added NFS support to Storcium, and now able ti run
> NFS servers with Pacemaker in HA mode (all agents are public at
> https://github.com/akurz/resource-agents/tree/master/heartbeat).  I can
> confirm that VM performance is definitely better and benchmarks are more
> smooth (in Windows we can see a lot of choppiness with iSCSI, NFS is choppy
> on writes, but smooth on reads, likely due to the bursty nature of OSD
> filesystems when dealing with that small IO size).
>
> Were you using extsz=16384 at creation time for the filesystem?  I saw
> kernel memory deadlock messages during vmotion, such as:
>
>  XFS: nfsd(102545) possible memory allocation deadlock size 40320 in
> kmem_alloc (mode:0x2400240)
>
> And analyzing fragmentation:
>
> root@roc-5r-scd218:~# xfs_db -r /dev/rbd21
> xfs_db> frag -d
> actual 0, ideal 0, fragmentation factor 0.00%
> xfs_db> frag -f
> actual 1863960, ideal 74, fragmentation factor 100.00%
>
> Just from two vmotions.
>
> Are you seeing anything similar?
>

Found your post on setting XFS extent size hint for sparse files:

xfs_io -c extsize 16M /mountpoint

Will test - fragmentation definitely present without this.


>
> Thank you,
> Alex
>
>
>>
>>
>> But as mentioned above, thick vmdk’s with vaai might be a really good fit.
>>
>>
>>
>> Thanks for your very valuable info on analysis and hw build.
>>
>>
>>
>> Alex
>>
>>
>>
>>
>>
>>
>> Am 21.08.2016 um 09:31 schrieb Nick Fisk <n...@fisk.me.uk>:
>>
>> >> -----Original Message-----
>> >> From: Alex Gorbachev [mailto:a...@iss-integration.com]
>> >> Sent: 21 August 2016 04:15
>> >> To: Nick Fisk <n...@fisk.me.uk>
>> >> Cc: w...@globe.de; Horace Ng <hor...@hkisl.net>; ceph-users <
>> ceph-users@lists.ceph.com>
>> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>> >>
>> >> Hi Nick,
>> >>
>> >> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk <n...@fisk.me.uk> wrote:
>> >>>> -----Original Message-----
>> >>>> From: w...@globe.de [mailto:w...@globe.de]
>> >>>> Sent: 21 July 2016 13:23
>> >>>> To: n...@fisk.me.uk; 'Horace Ng' <hor...@hkisl.net>
>> >>>> Cc: ceph-users@lists.ceph.com
>> >>>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>> >>>>
>> >>>> Okay and what is your plan now to speed up ?
>> >>>
>> >>> Now I have come up with a lower latency hardware design, there is not
>> much further improvement until persistent RBD caching is
>> >> implemented, as you will be moving the SSD/NVME closer to the client.
>> But I'm happy with what I can achieve at the moment. You
>> >> could also experiment with bcache on the RBD.
>> >>
>> >> Reviving this thread, would you be willing to share the details of the
>> low latency hardware design?  Are you optimizing for NFS or
>> >> iSCSI?
>> >
>> > Both really, just trying to get the write latency as low as possible,
>> as you know, vmware does everything with lots of unbuffered small io's. Eg
>> when you migrate a VM or as thin vmdk's grow.
>> >
>> > Even storage vmotions which might kick off 32 threads, as they all
>> roughly fall on the same PG, there still appears to be a bottleneck with
>> contention on the PG itself.
>> >
>> > These were the sort of things I was trying to optimise for, to make the
>> time spent in Ceph as minimal as possible for each IO.
>> >
>> > So onto the hardware. Through reading various threads and experiments
>> on my own I came to the following conclusions.
>> >
>> > -You need highest possible frequency on the CPU cores, which normally
>> also means less of them.
>> > -Dual sockets are probably bad and will impact performance.
>> > -Use NVME's for journals to minimise latency
>> >
>> > The end result was OSD nodes based off of a 3.5Ghz Xeon E3v5 with an
>> Intel P3700 for a journal. I used the SuperMicro X11SSH-CTF board which has
>> 10G-T onboard as well as 8SATA and 8SAS, so no expansion cards required.
>> Actually this design as well as being very performant for Ceph, also works
>> out very cheap as you are using low end server parts. The whole lot +
>> 12x7.2k disks all goes into a 1U case.
>> >
>> > During testing I noticed that by default c-states and p-states
>> slaughter performance. After forcing max cstate to 1 and forcing the CPU
>> frequency up to max, I was seeing 600us latency for a 4kb write to a
>> 3xreplica pool, or around 1600IOPs, this is at QD=1.
>> >
>> > Few other observations:
>> > 1. Power usage is around 150-200W for this config with 12x7.2k disks
>> > 2. CPU usage maxing out disks, is only around 10-15%, so plenty of
>> headroom for more disks.
>> > 3. NOTE FOR ABOVE: Don't include iowait when looking at CPU usage
>> > 4. No idea about CPU load for pure SSD nodes, but based on the current
>> disks, you could maybe expect ~10000iops per node, before maxing out CPU's
>> > 5. Single NVME seems to be able to journal 12 disks with no problem
>> during normal operation, no doubt a specific benchmark could max it out
>> though.
>> > 6. There are slightly faster Xeon E3's, but price/performance =
>> diminishing returns
>> >
>> > Hope that answers all your questions.
>> > Nick
>> >
>> >>
>> >> Thank you,
>> >> Alex
>> >>
>> >>>
>> >>>>
>> >>>> Would it help to put in multiple P3700 per OSD Node to improve
>> performance for a single Thread (example Storage VMotion) ?
>> >>>
>> >>> Most likely not, it's all the other parts of the puzzle which are
>> causing the latency. ESXi was designed for storage arrays that service
>> >> IO's in 100us-1ms range, Ceph is probably about 10x slower than this,
>> hence the problem. Disable the BBWC on a RAID controller or
>> >> SAN and you will the same behaviour.
>> >>>
>> >>>>
>> >>>> Regards
>> >>>>
>> >>>>
>> >>>> Am 21.07.16 um 14:17 schrieb Nick Fisk:
>> >>>>>> -----Original Message-----
>> >>>>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
>> >>>>>> Behalf Of w...@globe.de
>> >>>>>> Sent: 21 July 2016 13:04
>> >>>>>> To: n...@fisk.me.uk; 'Horace Ng' <hor...@hkisl.net>
>> >>>>>> Cc: ceph-users@lists.ceph.com
>> >>>>>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread
>> >>>>>> Performance
>> >>>>>>
>> >>>>>> Hi,
>> >>>>>>
>> >>>>>> hmm i think 200 MByte/s is really bad. Is your Cluster in
>> production right now?
>> >>>>> It's just been built, not running yet.
>> >>>>>
>> >>>>>> So if you start a storage migration you get only 200 MByte/s right?
>> >>>>> I wish. My current cluster (not this new one) would storage migrate
>> >>>>> at ~10-15MB/s. Serial latency is the problem, without being able to
>> >>>>> buffer, ESXi waits on an ack for each IO before sending the next.
>> >>>>> Also it submits the migrations in 64kb chunks, unless you get VAAI
>> >>>> working. I think esxi will try and do them in parallel, which will
>> help as well.
>> >>>>>
>> >>>>>> I think it would be awesome if you get 1000 MByte/s
>> >>>>>>
>> >>>>>> Where is the Bottleneck?
>> >>>>> Latency serialisation, without a buffer, you can't drive the
>> >>>>> devices to 100%. With buffered IO (or high queue depths) I can max
>> out the journals.
>> >>>>>
>> >>>>>> A FIO Test from Sebastien Han give us 400 MByte/s raw performance
>> from the P3700.
>> >>>>>>
>> >>>>>> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-y
>> <http://xo4t.mj.am/lnk/AEMAFB9vGYQAAAAAAAAAAFhNkjYAADNJBWwAAAAAAACRXwBXuf6xWLvrtvpITbiVmjQQkHY31gAAlBI/1/wnlcqPs4ULMkSuQV7TIBHA/aHR0cHM6Ly93d3cuc2ViYXN0aWVuLWhhbi5mci9ibG9nLzIwMTQvMTAvMTAvY2VwaC1ob3ctdG8tdGVzdC1pZi15>
>> >>>>>> our -ssd-is-suitable-as-a-journal-device/
>> >>>>>>
>> >>>>>> How could it be that the rbd client performance is 50% slower?
>> >>>>>>
>> >>>>>> Regards
>> >>>>>>
>> >>>>>>
>> >>>>>>> Am 21.07.16 um 12:15 schrieb Nick Fisk:
>> >>>>>>> I've had a lot of pain with this, smaller block sizes are even
>> worse.
>> >>>>>>> You want to try and minimize latency at every point as there is
>> >>>>>>> no buffering happening in the iSCSI stack. This means:-
>> >>>>>>>
>> >>>>>>> 1. Fast journals (NVME or NVRAM)
>> >>>>>>> 2. 10GB or better networking
>> >>>>>>> 3. Fast CPU's (Ghz)
>> >>>>>>> 4. Fix CPU c-state's to C1
>> >>>>>>> 5. Fix CPU's Freq to max
>> >>>>>>>
>> >>>>>>> Also I can't be sure, but I think there is a metadata update
>> >>>>>>> happening with VMFS, particularly if you are using thin VMDK's,
>> >>>>>>> this can also be a major bottleneck. For my use case, I've
>> >>>>>>> switched over to NFS as it has given much more performance at
>> >>>>>>> scale and
>> >>>> less headache.
>> >>>>>>>
>> >>>>>>> For the RADOS Run, here you go (400GB P3700):
>> >>>>>>>
>> >>>>>>> Total time run:         60.026491
>> >>>>>>> Total writes made:      3104
>> >>>>>>> Write size:             4194304
>> >>>>>>> Object size:            4194304
>> >>>>>>> Bandwidth (MB/sec):     206.842
>> >>>>>>> Stddev Bandwidth:       8.10412
>> >>>>>>> Max bandwidth (MB/sec): 224
>> >>>>>>> Min bandwidth (MB/sec): 180
>> >>>>>>> Average IOPS:           51
>> >>>>>>> Stddev IOPS:            2
>> >>>>>>> Max IOPS:               56
>> >>>>>>> Min IOPS:               45
>> >>>>>>> Average Latency(s):     0.0193366
>> >>>>>>> Stddev Latency(s):      0.00148039
>> >>>>>>> Max latency(s):         0.0377946
>> >>>>>>> Min latency(s):         0.015909
>> >>>>>>>
>> >>>>>>> Nick
>> >>>>>>>
>> >>>>>>>> -----Original Message-----
>> >>>>>>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
>> >>>>>>>> Behalf Of Horace
>> >>>>>>>> Sent: 21 July 2016 10:26
>> >>>>>>>> To: w...@globe.de
>> >>>>>>>> Cc: ceph-users@lists.ceph.com
>> >>>>>>>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread
>> >>>>>>>> Performance
>> >>>>>>>>
>> >>>>>>>> Hi,
>> >>>>>>>>
>> >>>>>>>> Same here, I've read some blog saying that vmware will
>> >>>>>>>> frequently verify the locking on VMFS over iSCSI, hence it will
>> have much slower performance than NFS (with different
>> >> locking mechanism).
>> >>>>>>>>
>> >>>>>>>> Regards,
>> >>>>>>>> Horace Ng
>> >>>>>>>>
>> >>>>>>>> ----- Original Message -----
>> >>>>>>>> From: w...@globe.de
>> >>>>>>>> To: ceph-users@lists.ceph.com
>> >>>>>>>> Sent: Thursday, July 21, 2016 5:11:21 PM
>> >>>>>>>> Subject: [ceph-users] Ceph + VMware + Single Thread Performance
>> >>>>>>>>
>> >>>>>>>> Hi everyone,
>> >>>>>>>>
>> >>>>>>>> we see at our cluster relatively slow Single Thread Performance
>> on the iscsi Nodes.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Our setup:
>> >>>>>>>>
>> >>>>>>>> 3 Racks:
>> >>>>>>>>
>> >>>>>>>> 18x Data Nodes, 3 Mon Nodes, 3 iscsi Gateway Nodes with tgt (rbd
>> cache off).
>> >>>>>>>>
>> >>>>>>>> 2x Samsung SM863 Enterprise SSD for Journal (3 OSD per SSD) and
>> >>>>>>>> 6x WD Red 1TB per Data Node as OSD.
>> >>>>>>>>
>> >>>>>>>> Replication = 3
>> >>>>>>>>
>> >>>>>>>> chooseleaf = 3 type Rack in the crush map
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> We get only ca. 90 MByte/s on the iscsi Gateway Servers with:
>> >>>>>>>>
>> >>>>>>>> rados bench -p rbd 60 write -b 4M -t 1
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> If we test with:
>> >>>>>>>>
>> >>>>>>>> rados bench -p rbd 60 write -b 4M -t 32
>> >>>>>>>>
>> >>>>>>>> we get ca. 600 - 700 MByte/s
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> We plan to replace the Samsung SSD with Intel DC P3700 PCIe
>> >>>>>>>> NVM'e for the Journal to get better Single Thread Performance.
>> >>>>>>>>
>> >>>>>>>> Is anyone of you out there who has an Intel P3700 for Journal an
>> >>>>>>>> can give me back test results with:
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> rados bench -p rbd 60 write -b 4M -t 1
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Thank you very much !!
>> >>>>>>>>
>> >>>>>>>> Kind Regards !!
>> >>>>>>>>
>> >>>>>>>> _______________________________________________
>> >>>>>>>> ceph-users mailing list
>> >>>>>>>> ceph-users@lists.ceph.com
>> >>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> <http://xo4t.mj.am/lnk/AEMAFB9vGYQAAAAAAAAAAFhNkjYAADNJBWwAAAAAAACRXwBXuf6xWLvrtvpITbiVmjQQkHY31gAAlBI/2/Ojs7_I4n_N36oCZhLke5QQ/aHR0cDovL2xpc3RzLmNlcGguY29tL2xpc3RpbmZvLmNnaS9jZXBoLXVzZXJzLWNlcGguY29t>
>> >>>>>>>> _______________________________________________
>> >>>>>>>> ceph-users mailing list
>> >>>>>>>> ceph-users@lists.ceph.com
>> >>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> <http://xo4t.mj.am/lnk/AEMAFB9vGYQAAAAAAAAAAFhNkjYAADNJBWwAAAAAAACRXwBXuf6xWLvrtvpITbiVmjQQkHY31gAAlBI/3/T2IYxJM5o6uRTmudSQfpew/aHR0cDovL2xpc3RzLmNlcGguY29tL2xpc3RpbmZvLmNnaS9jZXBoLXVzZXJzLWNlcGguY29t>
>> >>>>>> _______________________________________________
>> >>>>>> ceph-users mailing list
>> >>>>>> ceph-users@lists.ceph.com
>> >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> <http://xo4t.mj.am/lnk/AEMAFB9vGYQAAAAAAAAAAFhNkjYAADNJBWwAAAAAAACRXwBXuf6xWLvrtvpITbiVmjQQkHY31gAAlBI/4/Oqzrg2s5ChuQhcyq9aYGGg/aHR0cDovL2xpc3RzLmNlcGguY29tL2xpc3RpbmZvLmNnaS9jZXBoLXVzZXJzLWNlcGguY29t>
>> >>>
>> >>>
>> >>> _______________________________________________
>> >>> ceph-users mailing list
>> >>> ceph-users@lists.ceph.com
>> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> <http://xo4t.mj.am/lnk/AEMAFB9vGYQAAAAAAAAAAFhNkjYAADNJBWwAAAAAAACRXwBXuf6xWLvrtvpITbiVmjQQkHY31gAAlBI/5/F23BZ4oYxOMZxOhZeFHruQ/aHR0cDovL2xpc3RzLmNlcGguY29tL2xpc3RpbmZvLmNnaS9jZXBoLXVzZXJzLWNlcGguY29t>
>> >
>>
>>
>>
>> --
>>
>> --
>>
>> Alex Gorbachev
>>
>> Storcium
>>
>>
>>
>>
>

-- 
--
Alex Gorbachev
Storcium

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph + VMware + Single Thread Performance

Reply via email to