Re: [ceph-users] best practices for EC pools

Scheurer François Fri, 08 Feb 2019 03:33:17 -0800

Thank you Caspar for your corrections!



> EC requires K+1 nodes to allow writes, so every IO freezes (until all 
> affected PG's are recovered to at least K+1)


I was not aware of this. This is quite important to know, many thanks.



-survive the loss of max 3 nodes, if the recovery has enough time to complete 
between failures

>I think this kind of scenario shouldn't even be considered.


Ok  the cluster will also freeze in this case, as you mentioned, so not really 
surviving.

(Maybe adding a new node will still be possible to unfreeze it, from a 
theoretical point of view.)




Best Regards

Francois Scheurer



________________________________
From: Caspar Smit <caspars...@supernas.eu>
Sent: Friday, February 8, 2019 11:47 AM
To: Scheurer François
Cc: Alan Johnson; Eugen Block; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] best practices for EC pools

Op vr 8 feb. 2019 om 11:31 schreef Scheurer François 
<francois.scheu...@everyware.ch<mailto:francois.scheu...@everyware.ch>>:
Dear Eugen Block
Dear Alan Johnson


Thank you for your answers.

So we will use EC 3+2 on 6 nodes.
Currently with only 4 osd's per node, then 8 and later 20.


>Just to add, that a more general formula is that the number of nodes should be 
>greater than or equal to k+m+m so N>=k+m+m for full recovery

Understood.
EC k+m assumes the case of loosing m nodes and that would require m 'spare' 
nodes to recover, so k+m+m in total.
But the loss of a single node should allow a full recovery, shouldn'it ?

Having 3+2 on 6 nodes should be able to:
-survive the loss of max 2 nodes simultaneously

Yes and No, technically you can survive a 2 node failure but EC requires K+1 
nodes to allow writes, so every IO freezes (until all affected PG's are 
recovered to at least K+1) when losing the second node.
So yes you survive, but no you can't use the cluster for a while during this, 
so if you want to keep using your cluster at all times you can only have 1 node 
failure.

-survive the loss of max 3 nodes, if the recovery has enough time to complete 
between failures

I think this kind of scenario shouldn't even be considered.

-recover the loss of max 1 node

Only if there's enough free disk space left to hold all the data.

Kind regards,
Caspar


>If the pools are empty I also wouldn't expect that, is restarting one OSD also 
>that slow or is it just when you reboot the whole cluster?
It also happens after rebooting a single node.

In the mon logs we see a lot os such messages:

2019-02-06 23:07:46.003473 7f14d8ed6700  1 mon.ewos1-osd1-prod@0(leader).osd 
e116 prepare_failure osd.17 
10.38.66.71:6803/76983<http://10.38.66.71:6803/76983> from osd.1 
10.38.67.72:6800/75206<http://10.38.67.72:6800/75206> is reporting failure:1
2019-02-06 23:07:46.003486 7f14d8ed6700  0 log_channel(cluster) log [DBG] : 
osd.17 10.38.66.71:6803/76983<http://10.38.66.71:6803/76983> reported failed by 
osd.1 10.38.67.72:6800/75206
2019-02-06<http://10.38.67.72:6800/752062019-02-06> 23:07:57.948959 
7f14d8ed6700  1 mon.ewos1-osd1-prod@0(leader).osd e116 prepare_failure osd.17 
10.38.66.71:6803/76983<http://10.38.66.71:6803/76983> from osd.1 
10.38.67.72:6800/75206<http://10.38.67.72:6800/75206> is reporting failure:0
2019-02-06 23:07:57.948971 7f14d8ed6700  0 log_channel(cluster) log [DBG] : 
osd.17 10.38.66.71:6803/76983<http://10.38.66.71:6803/76983> failure report 
canceled by osd.1 10.38.67.72:6800/75206
2019-02-06<http://10.38.67.72:6800/752062019-02-06> 23:08:54.632356 
7f14d8ed6700  1 mon.ewos1-osd1-prod@0(leader).osd e116 prepare_failure osd.0 
10.38.65.72:6800/72872<http://10.38.65.72:6800/72872> from osd.17 
10.38.66.71:6803/76983<http://10.38.66.71:6803/76983> is reporting failure:1
2019-02-06 23:08:54.632374 7f14d8ed6700  0 log_channel(cluster) log [DBG] : 
osd.0 10.38.65.72:6800/72872<http://10.38.65.72:6800/72872> reported failed by 
osd.17 10.38.66.71:6803/76983
2019-02-06<http://10.38.66.71:6803/769832019-02-06> 23:10:21.333513 
7f14d8ed6700  1 mon.ewos1-osd1-prod@0(leader).osd e116 prepare_failure osd.23 
10.38.66.71:6807/79639<http://10.38.66.71:6807/79639> from osd.18 
10.38.67.72:6806/79121<http://10.38.67.72:6806/79121> is reporting failure:1
2019-02-06 23:10:21.333527 7f14d8ed6700  0 log_channel(cluster) log [DBG] : 
osd.23 10.38.66.71:6807/79639<http://10.38.66.71:6807/79639> reported failed by 
osd.18 10.38.67.72:6806/79121
2019-02-06<http://10.38.67.72:6806/791212019-02-06> 23:10:57.660468 
7f14d8ed6700  1 mon.ewos1-osd1-prod@0(leader).osd e116 prepare_failure osd.23 
10.38.66.71:6807/79639<http://10.38.66.71:6807/79639> from osd.18 
10.38.67.72:6806/79121<http://10.38.67.72:6806/79121> is reporting failure:0
2019-02-06 23:10:57.660481 7f14d8ed6700  0 log_channel(cluster) log [DBG] : 
osd.23 10.38.66.71:6807/79639<http://10.38.66.71:6807/79639> failure report 
canceled by osd.18 10.38.67.72:6806/79121<http://10.38.67.72:6806/79121>



Best Regards
Francois Scheurer





________________________________________
From: ceph-users 
<ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com>> 
on behalf of Alan Johnson <al...@supermicro.com<mailto:al...@supermicro.com>>
Sent: Thursday, February 7, 2019 8:11 PM
To: Eugen Block; ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] best practices for EC pools

Just to add, that a more general formula is that the number of nodes should be 
greater than or equal to k+m+m so N>=k+m+m for full recovery

-----Original Message-----
From: ceph-users 
[mailto:ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com>]
 On Behalf Of Eugen Block
Sent: Thursday, February 7, 2019 8:47 AM
To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] best practices for EC pools

Hi Francois,

> Is that correct that recovery will be forbidden by the crush rule if a
> node is down?

yes, that is correct, failure-domain=host means no two chunks of the same PG 
can be on the same host. So if your PG is divided into 6 chunks, they're all on 
different hosts, no recovery is possible at this point (for the EC-pool).

> After rebooting all nodes we noticed that the recovery was slow, maybe
> half an hour, but all pools are currently empty (new install).
> This is odd...

If the pools are empty I also wouldn't expect that, is restarting one OSD also 
that slow or is it just when you reboot the whole cluster?

> Which k&m values are preferred on 6 nodes?

It depends on the failures you expect and how many concurrent failures you need 
to cover.
I think I would keep failure-domain=host (with only 4 OSDs per host).
As for the k and m values, 3+2 would make sense, I guess. That profile would 
leave one host for recovery and two OSDs of one PG acting set could fail 
without data loss, so as resilient as the 4+2 profile. This is one approach, so 
please don't read this as *the* solution for your environment.

Regards,
Eugen


Zitat von Scheurer François 
<francois.scheu...@everyware.ch<mailto:francois.scheu...@everyware.ch>>:

> Dear All
>
>
> We created an erasure coded pool with k=4 m=2 with failure-domain=host
> but have only 6 osd nodes.
> Is that correct that recovery will be forbidden by the crush rule if a
> node is down?
>
> After rebooting all nodes we noticed that the recovery was slow, maybe
> half an hour, but all pools are currently empty (new install).
> This is odd...
>
> Can it be related to the k+m being equal to the number of nodes?
> (4+2=6) step set_choose_tries 100 was already in the EC crush rule.
>
> rule ewos1-prod_cinder_ec {
>       id 2
>       type erasure
>       min_size 3
>       max_size 6
>       step set_chooseleaf_tries 5
>       step set_choose_tries 100
>       step take default class nvme
>       step chooseleaf indep 0 type host
>       step emit
> }
>
> ceph osd erasure-code-profile set ec42 k=4 m=2 crush-root=default
> crush-failure-domain=host crush-device-class=nvme ceph osd pool create
> ewos1-prod_cinder_ec 256 256 erasure ec42
>
> ceph version 12.2.10-543-gfc6f0c7299
> (fc6f0c7299e3442e8a0ab83260849a6249ce7b5f) luminous (stable)
>
>   cluster:
>     id:     b5e30221-a214-353c-b66b-8c37b4349123
>     health: HEALTH_WARN
>             noout flag(s) set
>             Reduced data availability: 125 pgs inactive, 32 pgs
> peering
>
>   services:
>     mon: 3 daemons, quorum ewos1-osd1-prod,ewos1-osd3-prod,ewos1-osd5-prod
>     mgr: ewos1-osd5-prod(active), standbys: ewos1-osd3-prod, ewos1-osd1-prod
>     osd: 24 osds: 24 up, 24 in
>          flags noout
>
>   data:
>     pools:   4 pools, 1600 pgs
>     objects: 0 objects, 0B
>     usage:   24.3GiB used, 43.6TiB / 43.7TiB avail
>     pgs:     7.812% pgs not active
>              1475 active+clean
>              93   activating
>              32   peering
>
>
> Which k&m values are preferred on 6 nodes?
> BTW, we plan to use this EC pool as a second rbd pool in Openstack,
> with the main first rbd pool being replicated size=3; it is nvme ssd
> only.
>
>
> Thanks for your help!
>
>
>
> Best Regards
> Francois Scheurer



_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinfo.cgi_ceph-2Dusers-2Dceph.com&d=DwIGaQ&c=4DxX-JX0i28X6V65hK0ftwVK1xnmwcYC0vo7GVya1JY&r=sgFiQgvQASiGFaHpitF5P9M9QDCRkgKGttwwMFt2VIU&m=pTchIHDm3u6d1bmWBYKGF0Akb9UelYSeP1pnEbEw85Q&s=FV0ocIQ2LDiwIdGtKE36tH50px_KHyRvz14eDP1qptI&e=
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] best practices for EC pools

Reply via email to