[ceph-users] Re: Help needed to configure erasure coding LRC plugin

Michel Jouvin Fri, 28 Apr 2023 01:29:31 -0700

Hi,

I think I found a possible cause of my PG down but still understand why.As explained in a previous mail, I setup a 15-chunk/OSD EC pool (k=9,m=6) but I have only 12 OSD servers in the cluster. To workaround theproblem I defined the failure domain as 'osd' with the reasoning that asI was using the LRC plugin, I had the warranty that I could loose a sitewithout impact, thus the possibility to loose 1 OSD server. Am I wrong?


Best regards,

Michel

Le 24/04/2023 à 13:24, Michel Jouvin a écrit :

Hi,
I'm still interesting by getting feedback from those using the LRCplugin about the right way to configure it... Last week I upgradedfrom Pacific to Quincy (17.2.6) with cephadm which is doing theupgrade host by host, checking if an OSD is ok to stop before actuallyupgrading it. I had the surprise to see 1 or 2 PGs down at some pointsin the upgrade (happened not for all OSDs but for everysite/datacenter). Looking at the details with "ceph health detail", Isaw that for these PGs there was 3 OSDs down but I was expecting thepool to be resilient to 6 OSDs down (5 for R/W access) so I'mwondering if there is something wrong in our pool configuration (k=9,m=6, l=5).
Cheers,

Michel

Le 06/04/2023 à 08:51, Michel Jouvin a écrit :
Hi,

Is somebody using LRC plugin ?
I came to the conclusion that LRC k=9, m=3, l=4 is not the same asjerasure k=9, m=6 in terms of protection against failures and that Ishould use k=9, m=6, l=5 to get a level of resilience >= jerasurek=9, m=6. The example in the documentation (k=4, m=2, l=3) suggeststhat this LRC configuration gives something better than jerasure k=4,m=2 as it is resilient to 3 drive failures (but not 4 if I understoodproperly). So how many drives can fail in the k=9, m=6, l=5configuration first without loosing RW access and second withoutloosing data?
Another thing that I don't quite understand is that a pool createdwith this configuration (and failure domain=osd, locality=datacenter)has a min_size=3 (max_size=18 as expected). It seems wrong to me, I'dexpected something ~10 (depending on answer to the previous question)...
Thanks in advance if somebody could provide some sort ofauthoritative answer on these 2 questions. Best regards,
Michel

Le 04/04/2023 à 15:53, Michel Jouvin a écrit :
Answering to myself, I found the reason for 2147483647: it'sdocumented as a failure to find enough OSD (missing OSDs). And it isnormal as I selected different hosts for the 15 OSDs but I have only12 hosts!
I'm still interested by an "expert" to confirm that LRC k=9, m=3,l=4 configuration is equivalent, in terms of redundancy, to ajerasure configuration with k=9, m=6.
Michel

Le 04/04/2023 à 15:26, Michel Jouvin a écrit :
Hi,
As discussed in another thread (Crushmap rule for multi-datacentererasure coding), I'm trying to create an EC pool spanning 3datacenters (datacenters are present in the crushmap), with theobjective to be resilient to 1 DC down, at least keeping thereadonly access to the pool and if possible the read-write access,and have a storage efficiency better than 3 replica (let say astorage overhead <= 2).
In the discussion, somebody mentioned LRC plugin as a possiblejerasure alternative to implement this without tweaking thecrushmap rule to implement the 2-step OSD allocation. I looked atthe documentation(https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/)but I have some questions if someone has experience/expertise withthis LRC plugin.
I tried to create a rule for using 5 OSDs per datacenter (15 intotal), with 3 (9 in total) being data chunks and others beingcoding chunks. For this, based of my understanding of examples, Iused k=9, m=3, l=4. Is it right? Is this configuration equivalent,in terms of redundancy, to a jerasure configuration with k=9, m=6?
The resulting rule, which looks correct to me, is:

--------

{
    "rule_id": 6,
    "rule_name": "test_lrc_2",
    "ruleset": 6,
    "type": 3,
    "min_size": 3,
    "max_size": 15,
    "steps": [
        {
            "op": "set_chooseleaf_tries",
            "num": 5
        },
        {
            "op": "set_choose_tries",
            "num": 100
        },
        {
            "op": "take",
            "item": -4,
            "item_name": "default~hdd"
        },
        {
            "op": "choose_indep",
            "num": 3,
            "type": "datacenter"
        },
        {
            "op": "chooseleaf_indep",
            "num": 5,
            "type": "host"
        },
        {
            "op": "emit"
        }
    ]
}

------------
Unfortunately, it doesn't work as expected: a pool created withthis rule ends up with its pages active+undersize, which isunexpected for me. Looking at 'ceph health detail` output, I seefor each page something like:
pg 52.14 is stuck undersized for 27m, current stateactive+undersized, last acting[90,113,2147483647,103,64,147,164,177,2147483647,133,58,28,8,32,2147483647]
For each PG, there is 3 '2147483647' entries and I guess it is thereason of the problem. What are these entries about? Clearly it isnot OSD entries... Looks like a negative number, -1, which in termsof crushmap ID is the crushmap root (named "default" in ourconfiguration). Any trivial mistake I would have made?
Thanks in advance for any help or for sharing any successfulconfiguration?
Best regards,

Michel
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

Reply via email to