Ok, so if I understand correctly, for replication level 3 or 4 I would have
to use the rule
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take root
step choose firstn 2 type datacenter
step chooseleaf firstn 2 type host
step emit
}
The question I have now is: how will it behave when a DC goes down?
(Assuming catastrophic failure, the thing burns down)
For example, if I set replication to 3, min_rep to 3.
Then, if a DC goes down, crush will only return 2 PG's, so everything will
hang (same for 4/4 and 4/3)
If I set replication to 3, min_rep to 2, it could occur that all data of a
PG is in one DC (degraded mode). if this DC goes down, the PG will hang,....
As far as I know, degraded PG's will still accept writes, so data loss is
possible. (same for 4/2)
I can't seem to find a way around this. What am I missing.
Wouter
On Fri, Sep 18, 2015 at 10:10 PM, Gregory Farnum <[email protected]> wrote:
> On Fri, Sep 18, 2015 at 4:57 AM, Wouter De Borger <[email protected]>
> wrote:
> > Hi all,
> >
> > I have found on the mailing list that it should be possible to have a
> multi
> > datacenter setup, if latency is low enough.
> >
> > I would like to set this up, so that each datacenter has at least two
> > replicas and each PG has a replication level of 3.
> >
> > In this mail, it is suggested that I should use the following crush map
> for
> > multi DC:
> >
> > rule dc {
> > ruleset 0
> > type replicated
> > min_size 1
> > max_size 10
> > step take default
> > step chooseleaf firstn 0 type datacenter
> > step emit
> > }
> >
> > This looks suspicious to me, as it will only generate a list of two PG's,
> > (and only one PG if one DC is down).
> >
> > I think I should use:
> >
> > rule replicated_ruleset {
> > ruleset 0
> > type replicated
> > min_size 1
> > max_size 10
> > step take root
> > step choose firstn 2 type datacenter
> > step chooseleaf firstn 2 type host
> > step emit
> > step take root
> > step chooseleaf firstn -4 type host
> > step emit
> > }
> >
> > This correctly generates a list with 2 PG's in one DC, then 2 PG's in the
> > other and then a list of PG's
> >
> > The problem is that this list contains duplicates (e.g. for 8 OSDS per
> DC)
> >
> > [13,11,1,8,13,11,16,4,3,7]
> > [9,2,13,11,9,15,12,18,3,5]
> > [3,5,17,10,3,5,7,13,18,10]
> > [7,6,11,14,7,14,3,16,4,11]
> > [6,3,15,18,6,3,12,9,16,15]
> >
> > Will this be a problem?
>
> For replicated pools, it probably will cause trouble. For EC pools I
> think it should work fine, but obviously you're losing all kinds of
> redundancy. Nothing in the system will do work to avoid colocating
> them if you use a rule like this. Rather than distributing some of the
> replicas randomly across DCs, you really just want to split them up
> evenly across datacenters (or in some ratio, if one has more space
> than the other). Given CRUSH's current abilities that does require
> building the replication size into the rule, but such is life.
>
>
> > If crush is executed, will it only consider osd's which are (up,in) or
> all
> > OSD's in the map and then filter them from the list afterwards?
>
> CRUSH will consider all OSDs, but if it selects any OSDs which are out
> then it retries until it gets one that is still marked in.
> -Greg
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com