On Saturday, September 3, 2016, Alex Gorbachev <a...@iss-integration.com> wrote:
> HI Nick, > > On Sun, Aug 21, 2016 at 3:19 PM, Nick Fisk <n...@fisk.me.uk > <javascript:_e(%7B%7D,'cvml','n...@fisk.me.uk');>> wrote: > >> *From:* Alex Gorbachev [mailto:a...@iss-integration.com >> <javascript:_e(%7B%7D,'cvml','a...@iss-integration.com');>] >> *Sent:* 21 August 2016 15:27 >> *To:* Wilhelm Redbrake <w...@globe.de >> <javascript:_e(%7B%7D,'cvml','w...@globe.de');>> >> *Cc:* n...@fisk.me.uk <javascript:_e(%7B%7D,'cvml','n...@fisk.me.uk');>; >> Horace Ng <hor...@hkisl.net >> <javascript:_e(%7B%7D,'cvml','hor...@hkisl.net');>>; ceph-users < >> ceph-users@lists.ceph.com >> <javascript:_e(%7B%7D,'cvml','ceph-users@lists.ceph.com');>> >> *Subject:* Re: [ceph-users] Ceph + VMware + Single Thread Performance >> >> >> >> >> >> On Sunday, August 21, 2016, Wilhelm Redbrake <w...@globe.de >> <javascript:_e(%7B%7D,'cvml','w...@globe.de');>> wrote: >> >> Hi Nick, >> i understand all of your technical improvements. >> But: why do you Not use a simple for example Areca Raid Controller with 8 >> gb Cache and Bbu ontop in every ceph node. >> Configure n Times RAID 0 on the Controller and enable Write back Cache. >> That must be a latency "Killer" like in all the prop. Storage arrays or >> Not ?? >> >> Best Regards !! >> >> >> >> What we saw specifically with Areca cards is that performance is >> excellent in benchmarking and for bursty loads. However, once we started >> loading with more constant workloads (we replicate databases and files to >> our Ceph cluster), this looks to have saturated the relatively small Areca >> NVDIMM caches and we went back to pure drive based performance. >> >> >> >> Yes, I think that is a valid point. Although low latency, you are still >> having to write to the disks twice (journal+data), so once the cache’s on >> the cards start filling up, you are going to hit problems. >> >> >> >> >> >> So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per >> 3 HDDs) in hopes that it would help reduce the noisy neighbor impact. That >> worked, but now the overall latency is really high at times, not always. >> Red Hat engineer suggested this is due to loading the 7200 rpm NL-SAS >> drives with too many IOPS, which get their latency sky high. Overall we are >> functioning fine, but I sure would like storage vmotion and other large >> operations faster. >> >> >> >> >> >> Yeah this is the biggest pain point I think. Normal VM ops are fine, but >> if you ever have to move a multi-TB VM, it’s just too slow. >> >> >> >> If you use iscsi with vaai and are migrating a thick provisioned vmdk, >> then performance is actually quite good, as the block sizes used for the >> copy are a lot bigger. >> >> >> >> However, my use case required thin provisioned VM’s + snapshots and I >> found that using iscsi you have no control over the fragmentation of the >> vmdk’s and so the read performance is then what suffers (certainly with >> 7.2k disks) >> >> >> >> Also with thin provisioned vmdk’s I think I was seeing PG contention with >> the updating of the VMFS metadata, although I can’t be sure. >> >> >> >> >> >> I am thinking I will test a few different schedulers and readahead >> settings to see if we can improve this by parallelizing reads. Also will >> test NFS, but need to determine whether to do krbd/knfsd or something more >> interesting like CephFS/Ganesha. >> >> >> >> As you know I’m on NFS now. I’ve found it a lot easier to get going and a >> lot less sensitive to making config adjustments without suddenly everything >> dropping offline. The fact that you can specify the extent size on XFS >> helps massively with using thin vmdks/snapshots to avoid fragmentation. >> Storage v-motions are a bit faster than iscsi, but I think I am hitting PG >> contention when esxi tries to write 32 copy threads to the same object. >> There is probably some tuning that could be done here (RBD striping???) but >> this is the best it’s been for a long time and I’m reluctant to fiddle any >> further. >> > > We have moved ahead and added NFS support to Storcium, and now able ti run > NFS servers with Pacemaker in HA mode (all agents are public at > https://github.com/akurz/resource-agents/tree/master/heartbeat). I can > confirm that VM performance is definitely better and benchmarks are more > smooth (in Windows we can see a lot of choppiness with iSCSI, NFS is choppy > on writes, but smooth on reads, likely due to the bursty nature of OSD > filesystems when dealing with that small IO size). > > Were you using extsz=16384 at creation time for the filesystem? I saw > kernel memory deadlock messages during vmotion, such as: > > XFS: nfsd(102545) possible memory allocation deadlock size 40320 in > kmem_alloc (mode:0x2400240) > > And analyzing fragmentation: > > root@roc-5r-scd218:~# xfs_db -r /dev/rbd21 > xfs_db> frag -d > actual 0, ideal 0, fragmentation factor 0.00% > xfs_db> frag -f > actual 1863960, ideal 74, fragmentation factor 100.00% > > Just from two vmotions. > > Are you seeing anything similar? > Found your post on setting XFS extent size hint for sparse files: xfs_io -c extsize 16M /mountpoint Will test - fragmentation definitely present without this. > > Thank you, > Alex > > >> >> >> But as mentioned above, thick vmdk’s with vaai might be a really good fit. >> >> >> >> Thanks for your very valuable info on analysis and hw build. >> >> >> >> Alex >> >> >> >> >> >> >> Am 21.08.2016 um 09:31 schrieb Nick Fisk <n...@fisk.me.uk>: >> >> >> -----Original Message----- >> >> From: Alex Gorbachev [mailto:a...@iss-integration.com] >> >> Sent: 21 August 2016 04:15 >> >> To: Nick Fisk <n...@fisk.me.uk> >> >> Cc: w...@globe.de; Horace Ng <hor...@hkisl.net>; ceph-users < >> ceph-users@lists.ceph.com> >> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance >> >> >> >> Hi Nick, >> >> >> >> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk <n...@fisk.me.uk> wrote: >> >>>> -----Original Message----- >> >>>> From: w...@globe.de [mailto:w...@globe.de] >> >>>> Sent: 21 July 2016 13:23 >> >>>> To: n...@fisk.me.uk; 'Horace Ng' <hor...@hkisl.net> >> >>>> Cc: ceph-users@lists.ceph.com >> >>>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance >> >>>> >> >>>> Okay and what is your plan now to speed up ? >> >>> >> >>> Now I have come up with a lower latency hardware design, there is not >> much further improvement until persistent RBD caching is >> >> implemented, as you will be moving the SSD/NVME closer to the client. >> But I'm happy with what I can achieve at the moment. You >> >> could also experiment with bcache on the RBD. >> >> >> >> Reviving this thread, would you be willing to share the details of the >> low latency hardware design? Are you optimizing for NFS or >> >> iSCSI? >> > >> > Both really, just trying to get the write latency as low as possible, >> as you know, vmware does everything with lots of unbuffered small io's. Eg >> when you migrate a VM or as thin vmdk's grow. >> > >> > Even storage vmotions which might kick off 32 threads, as they all >> roughly fall on the same PG, there still appears to be a bottleneck with >> contention on the PG itself. >> > >> > These were the sort of things I was trying to optimise for, to make the >> time spent in Ceph as minimal as possible for each IO. >> > >> > So onto the hardware. Through reading various threads and experiments >> on my own I came to the following conclusions. >> > >> > -You need highest possible frequency on the CPU cores, which normally >> also means less of them. >> > -Dual sockets are probably bad and will impact performance. >> > -Use NVME's for journals to minimise latency >> > >> > The end result was OSD nodes based off of a 3.5Ghz Xeon E3v5 with an >> Intel P3700 for a journal. I used the SuperMicro X11SSH-CTF board which has >> 10G-T onboard as well as 8SATA and 8SAS, so no expansion cards required. >> Actually this design as well as being very performant for Ceph, also works >> out very cheap as you are using low end server parts. The whole lot + >> 12x7.2k disks all goes into a 1U case. >> > >> > During testing I noticed that by default c-states and p-states >> slaughter performance. After forcing max cstate to 1 and forcing the CPU >> frequency up to max, I was seeing 600us latency for a 4kb write to a >> 3xreplica pool, or around 1600IOPs, this is at QD=1. >> > >> > Few other observations: >> > 1. Power usage is around 150-200W for this config with 12x7.2k disks >> > 2. CPU usage maxing out disks, is only around 10-15%, so plenty of >> headroom for more disks. >> > 3. NOTE FOR ABOVE: Don't include iowait when looking at CPU usage >> > 4. No idea about CPU load for pure SSD nodes, but based on the current >> disks, you could maybe expect ~10000iops per node, before maxing out CPU's >> > 5. Single NVME seems to be able to journal 12 disks with no problem >> during normal operation, no doubt a specific benchmark could max it out >> though. >> > 6. There are slightly faster Xeon E3's, but price/performance = >> diminishing returns >> > >> > Hope that answers all your questions. >> > Nick >> > >> >> >> >> Thank you, >> >> Alex >> >> >> >>> >> >>>> >> >>>> Would it help to put in multiple P3700 per OSD Node to improve >> performance for a single Thread (example Storage VMotion) ? >> >>> >> >>> Most likely not, it's all the other parts of the puzzle which are >> causing the latency. ESXi was designed for storage arrays that service >> >> IO's in 100us-1ms range, Ceph is probably about 10x slower than this, >> hence the problem. Disable the BBWC on a RAID controller or >> >> SAN and you will the same behaviour. >> >>> >> >>>> >> >>>> Regards >> >>>> >> >>>> >> >>>> Am 21.07.16 um 14:17 schrieb Nick Fisk: >> >>>>>> -----Original Message----- >> >>>>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On >> >>>>>> Behalf Of w...@globe.de >> >>>>>> Sent: 21 July 2016 13:04 >> >>>>>> To: n...@fisk.me.uk; 'Horace Ng' <hor...@hkisl.net> >> >>>>>> Cc: ceph-users@lists.ceph.com >> >>>>>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread >> >>>>>> Performance >> >>>>>> >> >>>>>> Hi, >> >>>>>> >> >>>>>> hmm i think 200 MByte/s is really bad. Is your Cluster in >> production right now? >> >>>>> It's just been built, not running yet. >> >>>>> >> >>>>>> So if you start a storage migration you get only 200 MByte/s right? >> >>>>> I wish. My current cluster (not this new one) would storage migrate >> >>>>> at ~10-15MB/s. Serial latency is the problem, without being able to >> >>>>> buffer, ESXi waits on an ack for each IO before sending the next. >> >>>>> Also it submits the migrations in 64kb chunks, unless you get VAAI >> >>>> working. I think esxi will try and do them in parallel, which will >> help as well. >> >>>>> >> >>>>>> I think it would be awesome if you get 1000 MByte/s >> >>>>>> >> >>>>>> Where is the Bottleneck? >> >>>>> Latency serialisation, without a buffer, you can't drive the >> >>>>> devices to 100%. With buffered IO (or high queue depths) I can max >> out the journals. >> >>>>> >> >>>>>> A FIO Test from Sebastien Han give us 400 MByte/s raw performance >> from the P3700. >> >>>>>> >> >>>>>> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-y >> <http://xo4t.mj.am/lnk/AEMAFB9vGYQAAAAAAAAAAFhNkjYAADNJBWwAAAAAAACRXwBXuf6xWLvrtvpITbiVmjQQkHY31gAAlBI/1/wnlcqPs4ULMkSuQV7TIBHA/aHR0cHM6Ly93d3cuc2ViYXN0aWVuLWhhbi5mci9ibG9nLzIwMTQvMTAvMTAvY2VwaC1ob3ctdG8tdGVzdC1pZi15> >> >>>>>> our -ssd-is-suitable-as-a-journal-device/ >> >>>>>> >> >>>>>> How could it be that the rbd client performance is 50% slower? >> >>>>>> >> >>>>>> Regards >> >>>>>> >> >>>>>> >> >>>>>>> Am 21.07.16 um 12:15 schrieb Nick Fisk: >> >>>>>>> I've had a lot of pain with this, smaller block sizes are even >> worse. >> >>>>>>> You want to try and minimize latency at every point as there is >> >>>>>>> no buffering happening in the iSCSI stack. This means:- >> >>>>>>> >> >>>>>>> 1. Fast journals (NVME or NVRAM) >> >>>>>>> 2. 10GB or better networking >> >>>>>>> 3. Fast CPU's (Ghz) >> >>>>>>> 4. Fix CPU c-state's to C1 >> >>>>>>> 5. Fix CPU's Freq to max >> >>>>>>> >> >>>>>>> Also I can't be sure, but I think there is a metadata update >> >>>>>>> happening with VMFS, particularly if you are using thin VMDK's, >> >>>>>>> this can also be a major bottleneck. For my use case, I've >> >>>>>>> switched over to NFS as it has given much more performance at >> >>>>>>> scale and >> >>>> less headache. >> >>>>>>> >> >>>>>>> For the RADOS Run, here you go (400GB P3700): >> >>>>>>> >> >>>>>>> Total time run: 60.026491 >> >>>>>>> Total writes made: 3104 >> >>>>>>> Write size: 4194304 >> >>>>>>> Object size: 4194304 >> >>>>>>> Bandwidth (MB/sec): 206.842 >> >>>>>>> Stddev Bandwidth: 8.10412 >> >>>>>>> Max bandwidth (MB/sec): 224 >> >>>>>>> Min bandwidth (MB/sec): 180 >> >>>>>>> Average IOPS: 51 >> >>>>>>> Stddev IOPS: 2 >> >>>>>>> Max IOPS: 56 >> >>>>>>> Min IOPS: 45 >> >>>>>>> Average Latency(s): 0.0193366 >> >>>>>>> Stddev Latency(s): 0.00148039 >> >>>>>>> Max latency(s): 0.0377946 >> >>>>>>> Min latency(s): 0.015909 >> >>>>>>> >> >>>>>>> Nick >> >>>>>>> >> >>>>>>>> -----Original Message----- >> >>>>>>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On >> >>>>>>>> Behalf Of Horace >> >>>>>>>> Sent: 21 July 2016 10:26 >> >>>>>>>> To: w...@globe.de >> >>>>>>>> Cc: ceph-users@lists.ceph.com >> >>>>>>>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread >> >>>>>>>> Performance >> >>>>>>>> >> >>>>>>>> Hi, >> >>>>>>>> >> >>>>>>>> Same here, I've read some blog saying that vmware will >> >>>>>>>> frequently verify the locking on VMFS over iSCSI, hence it will >> have much slower performance than NFS (with different >> >> locking mechanism). >> >>>>>>>> >> >>>>>>>> Regards, >> >>>>>>>> Horace Ng >> >>>>>>>> >> >>>>>>>> ----- Original Message ----- >> >>>>>>>> From: w...@globe.de >> >>>>>>>> To: ceph-users@lists.ceph.com >> >>>>>>>> Sent: Thursday, July 21, 2016 5:11:21 PM >> >>>>>>>> Subject: [ceph-users] Ceph + VMware + Single Thread Performance >> >>>>>>>> >> >>>>>>>> Hi everyone, >> >>>>>>>> >> >>>>>>>> we see at our cluster relatively slow Single Thread Performance >> on the iscsi Nodes. >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> Our setup: >> >>>>>>>> >> >>>>>>>> 3 Racks: >> >>>>>>>> >> >>>>>>>> 18x Data Nodes, 3 Mon Nodes, 3 iscsi Gateway Nodes with tgt (rbd >> cache off). >> >>>>>>>> >> >>>>>>>> 2x Samsung SM863 Enterprise SSD for Journal (3 OSD per SSD) and >> >>>>>>>> 6x WD Red 1TB per Data Node as OSD. >> >>>>>>>> >> >>>>>>>> Replication = 3 >> >>>>>>>> >> >>>>>>>> chooseleaf = 3 type Rack in the crush map >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> We get only ca. 90 MByte/s on the iscsi Gateway Servers with: >> >>>>>>>> >> >>>>>>>> rados bench -p rbd 60 write -b 4M -t 1 >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> If we test with: >> >>>>>>>> >> >>>>>>>> rados bench -p rbd 60 write -b 4M -t 32 >> >>>>>>>> >> >>>>>>>> we get ca. 600 - 700 MByte/s >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> We plan to replace the Samsung SSD with Intel DC P3700 PCIe >> >>>>>>>> NVM'e for the Journal to get better Single Thread Performance. >> >>>>>>>> >> >>>>>>>> Is anyone of you out there who has an Intel P3700 for Journal an >> >>>>>>>> can give me back test results with: >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> rados bench -p rbd 60 write -b 4M -t 1 >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> Thank you very much !! >> >>>>>>>> >> >>>>>>>> Kind Regards !! >> >>>>>>>> >> >>>>>>>> _______________________________________________ >> >>>>>>>> ceph-users mailing list >> >>>>>>>> ceph-users@lists.ceph.com >> >>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> <http://xo4t.mj.am/lnk/AEMAFB9vGYQAAAAAAAAAAFhNkjYAADNJBWwAAAAAAACRXwBXuf6xWLvrtvpITbiVmjQQkHY31gAAlBI/2/Ojs7_I4n_N36oCZhLke5QQ/aHR0cDovL2xpc3RzLmNlcGguY29tL2xpc3RpbmZvLmNnaS9jZXBoLXVzZXJzLWNlcGguY29t> >> >>>>>>>> _______________________________________________ >> >>>>>>>> ceph-users mailing list >> >>>>>>>> ceph-users@lists.ceph.com >> >>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> <http://xo4t.mj.am/lnk/AEMAFB9vGYQAAAAAAAAAAFhNkjYAADNJBWwAAAAAAACRXwBXuf6xWLvrtvpITbiVmjQQkHY31gAAlBI/3/T2IYxJM5o6uRTmudSQfpew/aHR0cDovL2xpc3RzLmNlcGguY29tL2xpc3RpbmZvLmNnaS9jZXBoLXVzZXJzLWNlcGguY29t> >> >>>>>> _______________________________________________ >> >>>>>> ceph-users mailing list >> >>>>>> ceph-users@lists.ceph.com >> >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> <http://xo4t.mj.am/lnk/AEMAFB9vGYQAAAAAAAAAAFhNkjYAADNJBWwAAAAAAACRXwBXuf6xWLvrtvpITbiVmjQQkHY31gAAlBI/4/Oqzrg2s5ChuQhcyq9aYGGg/aHR0cDovL2xpc3RzLmNlcGguY29tL2xpc3RpbmZvLmNnaS9jZXBoLXVzZXJzLWNlcGguY29t> >> >>> >> >>> >> >>> _______________________________________________ >> >>> ceph-users mailing list >> >>> ceph-users@lists.ceph.com >> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> <http://xo4t.mj.am/lnk/AEMAFB9vGYQAAAAAAAAAAFhNkjYAADNJBWwAAAAAAACRXwBXuf6xWLvrtvpITbiVmjQQkHY31gAAlBI/5/F23BZ4oYxOMZxOhZeFHruQ/aHR0cDovL2xpc3RzLmNlcGguY29tL2xpc3RpbmZvLmNnaS9jZXBoLXVzZXJzLWNlcGguY29t> >> > >> >> >> >> -- >> >> -- >> >> Alex Gorbachev >> >> Storcium >> >> >> >> > -- -- Alex Gorbachev Storcium
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com