[ceph-users] HDD-only performance, how far can it be sped up ?

Wladimir Mutel Wed, 20 Jun 2018 00:29:03 -0700

    Dear all,

I set up a minimal 1-node Ceph cluster to evaluate its performance.We tried to save as much as possible on the hardware, so now the box hasAsus P10S-M WS motherboard, Xeon E3-1235L v5 CPU, 64 GB DDR4 ECC RAM and8x3TB HDDs (WD30EFRX) connected to on-board SATA ports. Also we aretrying to save on storage redundancy, so for most of our RBD images weuse erasure-coded data-pool (default profile, jerasure 2+1) instead of3x replication. I started with Luminous/Xenial 12.2.5 setup whichinitialized my OSDs as Bluestore during deploy, then updated it toMimic/Bionic 13.2.0. Base OS is Ubuntu 18.04 with kernel updated to4.17.2 from Ubuntu mainline PPA.

With this setup, I created a number of RBD images to test iSCSI,rbd-nbd and QEMU+librbd performance (running QEMU VMs on the same box).And that worked moderately well as far as data volume transferred withinone session was limited. The fastest transfers I had with 'rbd import'which pulled an ISO image file at up to 25 MBytes/sec from the remoteCIFS share over Gigabit Ethernet and stored it into EC data-pool.Windows 2008 R2 & 2016 setup, update installation, Win 2008 upgrade to2012 and to 2016 within QEMU VM also went through tolerably well. Ifound that cache=writeback gives the best performance with librbd,unlike cache=unsafe which gave the best performance with VMs on plainlocal SATA drives. Also I have a subjective feeling (not confirmed byexact measurements) that providing a huge&delayed libRBD cache (like,cache size = 1GB, max dirty = 7/8GB, max dirty age = 60) improvedWindows VM performance on bursty writes (like, during Windows updateinstallations) as well as on reboots (due to cached reads).

Now, what discouraged me, was my next attempt to clone an NTFSpartition of ~2TB from a physical drive (via USB3-SATA3 convertor) to apartition on an RBD image. I tried to map RBD image with rbd-nbd eitherlocally or remotely over Gigabit Ethernet, and the fastest speed I gotwith ntfsclone was about 8 MBytes/sec. Which means that it could spendup to 3 days copying these ~2TB of NTFS data. I thought about runningntfsclone /dev/sdX1 -o - | rbd import ... - , but ntfsclone needs torewrite a part of existing RBD image starting from certain offset, so Idecided this was not a solution in my situation. Now I am thinking abouttaking out one of OSDs and using it as a 'bcache' for this operation,but I am not sure how good is bcache performance with cache on rotatingHDD. I know that keeping OSD logs and RocksDB on the same HDD creates aseeky workload which hurts overall transfer performance.

Also I am thinking about a number of next-close possibilities, andI would like to hear your opinions on the benefits and drawbacks of eachof them.

1. Would iSCSI access to that RBD image improve my performance(compared to rbd-nbd) ? I did not check that yet, but I noticed thatWindows transferred about 2.5 MBytes/sec while formatting NTFS volume onthis RBD attached to it by iSCSI. So, for seeky/sparse workloads likeNTFS formatting the performance was not great.

2. Would it help to run ntfsclone in Linux VM, with RBD imageaccessed through QEMU+librbd ? (also going to measure that myself)

3. Is there any performance benefits in using Ceph cache-tier poolswith my setup ? I hear now use of this technique is advised against, no?

4. We have an unused older box (Supermicro X8SIL-F mobo, Xeon X3430CPU, 32 GB of DDR3 ECC RAM, 6 onboard SATA ports, used from 2010 to2017, in perfectly working condition) which can be stuffed with up to 6SATA HDDs and added to this Ceph cluster, so far with only Gigabitnetwork interconnect. Like, move 4 OSDs out of first box into it, tohave 2 boxes with 4 HDDs each. Is this going to improve Ceph performancewith the setup described above ?

5. I hear that RAID controllers like Adaptec 5805, LSI 2108 providebetter performance with SATA HDDs exported as JBODs than onboard SATAAHCI controllers due to more aggressive caching and reordering requests.Is this true ?

6. On the local market we can buy Kingston KC1000/960GB NVMe drivefor moderately reasonable price. Its specification has rewrite limit of1 PB and 0.58 DWPD (drive rewrite per day). Is there anycounterindications against using it in production Ceph setup (i.e., toolow rewrite limit, look for 8+PB) ? What is the difference between usingit as a 'bcache' os as specifically-designed OSD log+rocksdb storage ?Can it be used as a single shared partition for all OSD daemons, or willit require spitting into 8 separate partitions ?


    Thank you in advance for your replies.
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] HDD-only performance, how far can it be sped up ?

Reply via email to