On July 28, 2018 6:00:15 AM CDT, Mark Adams <m...@openvs.co.uk> wrote: >Hi Adam, > >Thanks for your great round up there - Your points are excellent. > >What I have ended up doing a few days ago (apologies have been too busy >to >respond..) was set rbd cache = true under each client in the ceph.conf >- >This got me from 15MB/s up to about 70MB/s. I then set the disk holding >the >zfs dataset to writeback cache in proxmox (as you note below) and that >has >bumped it up to about 130MB/s -- Which I am happy with for this setup. > >Regards, >Mark > >On 27 July 2018 at 14:46, Adam Thompson <athom...@athompso.net> wrote: > >> On 2018-07-27 07:05, ronny+pve-u...@aasen.cx wrote: >> >>> rbd striping is a per image setting. you may need to make the rbd >>> image and migrate data. >>> >>> On 07/26/18 12:25, Mark Adams wrote: >>> >>>> Thanks for your suggestions. Do you know if it is possible to >change an >>>> existing rbd pool to striping? or does this have to be done on >first >>>> setup? >>>> >>> >> Please be aware that striping will not result in any increased >> performance, if you are using "safe" I/O modes, i.e. your VM waits >for a >> successful flush-to-disk after every sector. In that scenario, CEPH >will >> never give you write performance equal to a local disk because you're >> limited to the bandwidth of a single remote disk [subsystem] *plus* >the >> network round-trip latency, which even if measured in microseconds, >still >> adds up. >> >> Based on my experience with this and other distributed storage >systems, I >> believe you will likely find that you get large write-performance >gains by: >> >> 1. use the largest possible block size during writes. 512B sectors >are >> the worst-case scenario for any remote storage. Try to write in >chunks of >> *at least* 1 MByte, and it's not unreasonable nowadays to write in >chunks >> of 64MB or larger. The rationale here is that you're spending more >time >> sending data, and less time waiting for ACKs. The more you can tilt >that >> in favor of data, the better off you are. (There are downsides to >huge >> sector/block/chunk sizes, though - this isn't a "free lunch" >scenario. See >> #5.) >> >> 2. relax your write-consistency requirements. If you can tolerate >the >> small risk with "Write Back" you should see better performance, >especially >> during burst writes. During large sequential writes, there are not >many >> ways to violate the laws of physics, and CEPH automatically amplifies >your >> writes by (in your case) a factor of 2x due to replication. >> >> 3. switch to storage devices with the best possible local write >speed, for >> OSDs. OSDs are limited by the performance of the underlying device >or >> virtual device. (e.g. it's totally possible to run OSDs on a >hardware >> RAID6 controller) >> >> 4. Avoid CoW-on-CoW. Write amplification means you'll lose around >50% of >> your IOPS and/or I/O bandwidth for each level of CoW nesting, >depending on >> workload. So don't put CEPH OSDs on, ssy, BTRFS or ZFS filesystems. >A >> worst-case scenario would be something like running a VM using ZFS on >top >> of CEPH, where the OSDs are located on BTRFS filsystems, which are in >turn >> virtual devices hosted on ZFS filesystems. Welcome to 1980's storage >> performance, in that case! (I did it without realizing once... >seriously, >> 5 MBps sequential writes was a good day!) FWIW, CoW filesystems are >> generally awesome - just not when stacked. A sufficiently fast >external >> NAS running ZFS with VMs stored over NFS can provide decent >performance, >> *if* tuned correctly. iX Systems, for example, spends a lot of time >& >> effort making this work well, including some lovely HA NAS >appliances. >> >> 5. Remember the triangle. You can optimize a distributed storage >system >> for any TWO of: a) cost, b) resiliency/reliability/HA, or c) speed. >(This >> is a specific case of the traditional good/fast/cheap:pick-any-2 >adage.) >> >> >> I'm not sure I'm saying anything new here, I may have just summarized >the >> discussion, but the points remain valid. >> >> Good luck with your performance problems. >> -Adam >> >> _______________________________________________ >> pve-user mailing list >> pve-user@pve.proxmox.com >> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user >> >_______________________________________________ >pve-user mailing list >pve-user@pve.proxmox.com >https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
That's a pretty good result. You now have some very small windows where recently-written data could be lost, but in most applications not unreasonably so. In exchange, you get very good throughout for spinning rust. (FWIW, I gave up on CEPH because my nodes only have 2Gbps network each, but I am seeing similar speeds with local ZFS+ZIL+L2ARC on 15k SAS drives. These are older systems, obviously.) Thanks for sharing your solution! -Adam -- Sent from my Android device with K-9 Mail. Please excuse my brevity. _______________________________________________ pve-user mailing list pve-user@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user