[ceph-users] Re: Recomand number of k and m erasure code

Frank Schilder Mon, 15 Jan 2024 01:47:49 -0800

I would like to add here a detail that is often overlooked: maintainability 
under degraded conditions.

For production systems I would recommend to use EC profiles with at least m=3. 
The reason being that if you have a longer problem with a node that is down and 
m=2 it is not possible to do any maintenance on the system without loosing 
write access. Don't trust what users claim they are willing to tolerate - at 
least get it in writing. Once a problem occurs they will be at your door step 
no matter what they said before.

Similarly, when doing a longer maintenance task and m=2, any disk fail during 
maintenance will imply loosing write access.

Having m=3 or larger allows for 2 (or larger) numbers of hosts/OSDs being 
unavailable simultaneously while service is fully operational. That can be a 
life saver in many situations.

An additional reason for larger m is systematic failures of drives if your 
vendor doesn't mix drives from different batches and factories. If a batch has 
a systematic production error, failures are no longer statistically 
independent. In such a situation, if one drives fails the likelihood that more 
drives fail at the same time is very high. Having a larger number of parity 
shards increases the chances of recovering from such events.

For similar reasons I would recommend to deploy 5 MONs instead of 3. My life 
got so much better after having the extra redundancy.

As some background, in our situation we experience(d) somewhat heavy 
maintenance operations including modifying/updating ceph nodes (hardware, not 
software), exchanging Racks, switches, cooling and power etc. This required 
longer downtime and/or moving of servers and moving the ceph hardware was the 
easiest compared with other systems due to the extra redundancy bits in it. We 
had no service outages during such operations.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Anthony D'Atri <anthony.da...@gmail.com>
Sent: Saturday, January 13, 2024 5:36 PM
To: Phong Tran Thanh
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: Recomand number of k and m erasure code

There are nuances, but in general the higher the sum of m+k, the lower the 
performance, because *every* operation has to hit that many drives, which is 
especially impactful with HDDs.  So there’s a tradeoff between storage 
efficiency and performance.  And as you’ve seen, larger parity groups 
especially mean slower recovery/backfill.

There’s also a modest benefit to choosing values of m and k that have small 
prime factors, but I wouldn’t worry too much about that.

You can find EC efficiency tables on the net:

https://docs.netapp.com/us-en/storagegrid-116/ilm/what-erasure-coding-schemes-are.html

I should really add a table to the docs, making a note to do that.

There’s a nice calculator at the OSNEXUS site:

https://www.osnexus.com/ceph-designer

The overhead factor is (k+m) / k

So for a 4,2 profile, that’s 6 / 4 == 1.5

For 6,2, 8 / 6 = 1.33

For 10,2, 12 / 10 = 1.2

and so forth.  As k increases, the incremental efficiency gain sees diminishing 
returns, but performance continues to decrease.

Think of m as the number of copies you can lose without losing data, and m-1 as 
the number you can lose / have down and still have data *available*.

I also suggest that the number of failure domains — in your cases this means 
OSD nodes — be *at least* k+m+1, so in your case you want k+m to be at most 9.

With RBD and many CephFS implementations, we mostly have relatively large RADOS 
objects that are striped over many OSDs.

When using RGW especially, one should attend to average and median S3 object 
size.  There’s an analysis of the potential for space amplification in the docs 
so I won’t repeat it here in detail. This sheet 
https://docs.google.com/spreadsheets/d/1rpGfScgG-GLoIGMJWDixEkqs-On9w8nAUToPQjN8bDI/edit#gid=358760253
 visually demonstrates this.

Basically, for an RGW bucket pool — or for a CephFS data pool storing unusually 
small objects — if you have a lot of S3 objects in the multiples of KB size, 
you waste a significant fraction of underlying storage.  This is exacerbated by 
EC, and the larger the sum of k+m, the more waste.

When people ask me about replication vs EC and EC profile, the first question I 
ask is what they’re storing.  When EC isn’t a non-starter, I tend to recommend 
4,2 as a profile until / unless someone has specific needs and can understand 
the tradeoffs. This lets you store ~~ 2x the data of 3x replication while not 
going overboard on the performance hit.

If you care about your data, do not set m=1.

If you need to survive the loss of many drives, say if your cluster is across 
multiple buildings or sites, choose a larger value of k.  There are people 
running profiles like 4,6 because they have unusual and specific needs.

> On Jan 13, 2024, at 10:32 AM, Phong Tran Thanh <tranphong...@gmail.com> wrote:
>
> Hi ceph user!
>
> I need to determine which erasure code values (k and m) to choose for a
> Ceph cluster with 10 nodes.
>
> I am using the reef version with rbd. Furthermore, when using a larger k,
> for example, ec6+2 and ec4+2, which erasure coding performance is better,
> and what are the criteria for choosing the appropriate erasure coding?
> Please help me
>
> Email: tranphong...@gmail.com
> Skype: tranphong079
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Recomand number of k and m erasure code

Reply via email to