Re: [ceph-users] What's the best practice for Erasure Coding

Frank Schilder Mon, 08 Jul 2019 01:31:39 -0700

Hi David,

I'm running a cluster with bluestore on raw devices (no lvm) and all journals 
collocated on the same disk with the data. Disks are spinning NL-SAS. Our goal 
was to build storage at lowest cost, therefore all data on HDD only. I got a 
few SSDs that I'm using for FS and RBD meta data. All large pools are EC on 
spinning disk.


I spent at least one month to run detailed benchmarks (rbd bench) depending on 
EC profile, object size, write size, etc. Results were varying a lot. My advice 
would be to run benchmarks with your hardware. If there was a single perfect 
choice, there wouldn't be so many options. For example, my tests will not be 
valid when using separate fast disks for WAL and DB.

There are some results though that might be valid in general:

1) EC pools have high throughput but low IOP/s compared with replicated pools

I see single-thread write speeds of up to 1.2GB (gigabyte) per second, which is 
probably the network limit and not the disk limit. IOP/s get better with more 
disks, but are way lower than what replicated pools can provide. On a cephfs 
with EC data pool, small-file IO will be comparably slow and eat a lot of 
resources.

2) I observe massive network traffic amplification on small IO sizes, which is 
due to the way EC overwrites are handled. This is one bottleneck for IOP/s. We 
have 10G infrastructure and use 2x10G client and 4x10G OSD network. OSD 
bandwidth at least 2x client network, better 4x or more.

3) k should only have small prime factors, power of 2 if possible

I tested k=5,6,8,10,12. Best results in decreasing order: k=8, k=6. All other 
choices were poor. The value of m seems not relevant for performance. Larger k 
will require more failure domains (more hardware).

4) object size matters

The best throughput (1M write size) I see with object sizes of 4MB or 8MB, with 
IOP/s getting somewhat better with slower object sizes but throughput dropping 
fast. I use the default of 4MB in production. Works well for us.

5) jerasure is quite good and seems most flexible

jerasure is quite CPU efficient and can handle smaller chunk sizes than other 
plugins, which is preferrable for IOP/s. However, CPU usage can become a 
problem and a plugin optimized for specific values of k and m might help here. 
Under usual circumstances I see very low load on all OSD hosts, even under 
rebalancing. However, I remember that once I needed to rebuild something on all 
OSDs (I don't remember what it was, sorry). In this situation, CPU load went up 
to 30-50% (meaning up to half the cores were at 100%), which is really high 
considering that each server has only 16 disks at the moment and is sized to 
handle up to 100. CPU power could become a bottle for us neck in the future.

These are some general observations and do not replace benchmarks for specific 
use cases. I was hunting for a specific performance pattern, which might not be 
what you want to optimize for. I would recommend to run extensive benchmarks if 
you have to live with a configuration for a long time - EC profiles cannot be 
changed.

We settled on 8+2 and 6+2 pools with jerasure and object size 4M. We also use 
bluestore compression. All meta data pools are on SSD, only very little SSD 
space is required. This choice works well for the majority of our use cases. We 
can still build small expensive pools to accommodate special performance 
requests.

Best regards,

=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: ceph-users <[email protected]> on behalf of David 
<[email protected]>
Sent: 07 July 2019 20:01:18
To: [email protected]
Subject: [ceph-users]  What's the best practice for Erasure Coding

Hi Ceph-Users,

I'm working with a  Ceph cluster (about 50TB, 28 OSDs, all Bluestore on lvm).
Recently, I'm trying to use the Erasure Code pool.
My question is "what's the best practice for using EC pools ?".
More specifically, which plugin (jerasure, isa, lrc, shec or  clay) should I 
adopt, and how to choose the combinations of (k,m) (e.g. (k=3,m=2), (k=6,m=3) ).

Does anyone share some experience?

Thanks for any help.

Regards,
David

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] What's the best practice for Erasure Coding

Reply via email to