Re: [ceph-users] strange remap on host failure

David Turner Tue, 30 May 2017 07:32:29 -0700

If you lose 1 of the hosts in a chassis, or a single drive, the pgs from
that drive/host will be distributed to other drives in that chassis
(because you only have 3 failure domains). That is to say that if you lose
tv-c1-al01 then all of the pgs and data that were on that will be
distributed to tv-c1-al02. The reason for that is that you only have 3
failure domains and replica size 3.


If you lost both tv-c1-al01 and tv-c1-al02, then you would run with only 2
copies of your data until you brought up a third failure domain again. Ceph
would never place 2 copies of your data inside of 1 failure domain.

I recommend not to run in production with less than N+2 failure domains
where N is your replica size. It allows for more efficient data redundancy
and you can utilize a higher % of your total capacity. If you have 4
failure domains, the plan is to be able to survive losing 1 of them...
Which means you shouldn't use more than ~55% of your total capacity because
of you lose a node, that 55% of 4 nodes becomes 73% of 3 nodes. Few
clusters are balanced well enough to handle 73% full without individual
osds going above 80%.  3 failure domains can work if you replace failed
storage quickly.

On Mon, May 29, 2017, 12:07 PM Laszlo Budai <[email protected]> wrote:

> Dear all,
>
> How should ceph react in case of a host failure when from a total of 72
> OSDs 12 are out?
> is it normal that for the remapping of the PGs it is not following the
> rule set for in the crush map? (according to the rule the OSDs should be
> selected from different chassis).
>
> in the attached file you can find the crush map, and the results of:
> ceph health detail
> ceph osd dump
> ceph osd tree
> ceph -s
>
> I can send the pg dump in a separate mail on request. Its compressed size
> is exceeding the size accepted by this mailing list.
>
> Thank you for any help/directions.
>
> Kind regards,
> Laszlo
>
> On 29.05.2017 14:58, Laszlo Budai wrote:
> >
> > Hello all,
> >
> > We have a ceph cluster with 72 OSDs distributed on 6 hosts, in 3
> chassis. In our crush map the we are distributing the PGs on chassis
> (complete crush map below):
> >
> > # rules
> > rule replicated_ruleset {
> >          ruleset 0
> >          type replicated
> >          min_size 1
> >          max_size 10
> >          step take default
> >          step chooseleaf firstn 0 type chassis
> >          step emit
> > }
> >
> > We had a host failure, and I can see that ceph is using 2 OSDs from the
> same chassis for a lot of the remapped PGs. Even worse, I can see that
> there are cases when a PG is using two OSDs from the same host like here:
> >
> > 3.5f6   37      0       4       37      0       149446656       3040
> 3040    active+remapped 2017-05-26 11:29:23.122820      61820'222074
> 61820:158025    [52,39] 52      [52,39,3]       52      61488'198356
> 2017-05-23 23:51:56.210597      61488'198356    2017-05-23 23:51:56.210597
> >
> > I have tis in the log:
> > 2017-05-26 11:26:53.244424 osd.52 10.12.193.69:6801/7044 1510 : cluster
> [INF] 3.5f6 restarting backfill on osd.39 from (0'0,0'0] MAX to 61488'203000
> >
> >
> > What can be wrong?
> >
> >
> > Our crush map looks like this:
> >
> > # begin crush map
> > tunable choose_local_tries 0
> > tunable choose_local_fallback_tries 0
> > tunable choose_total_tries 50
> > tunable chooseleaf_descend_once 1
> > tunable straw_calc_version 1
> >
> > # devices
> > device 0 osd.0
> > device 1 osd.1
> > device 2 osd.2
> > device 3 osd.3
> > ....
> > device 69 osd.69
> > device 70 osd.70
> > device 71 osd.71
> >
> > # types
> > type 0 osd
> > type 1 host
> > type 2 chassis
> > type 3 rack
> > type 4 row
> > type 5 pdu
> > type 6 pod
> > type 7 room
> > type 8 datacenter
> > type 9 region
> > type 10 root
> >
> > # buckets
> > host tv-c1-al01 {
> >          id -7           # do not change unnecessarily
> >          # weight 21.840
> >          alg straw
> >          hash 0  # rjenkins1
> >          item osd.5 weight 1.820
> >          item osd.11 weight 1.820
> >          item osd.17 weight 1.820
> >          item osd.23 weight 1.820
> >          item osd.29 weight 1.820
> >          item osd.35 weight 1.820
> >          item osd.41 weight 1.820
> >          item osd.47 weight 1.820
> >          item osd.53 weight 1.820
> >          item osd.59 weight 1.820
> >          item osd.65 weight 1.820
> >          item osd.71 weight 1.820
> > }
> > host tv-c1-al02 {
> >          id -3           # do not change unnecessarily
> >          # weight 21.840
> >          alg straw
> >          hash 0  # rjenkins1
> >          item osd.1 weight 1.820
> >          item osd.7 weight 1.820
> >          item osd.13 weight 1.820
> >          item osd.19 weight 1.820
> >          item osd.25 weight 1.820
> >          item osd.31 weight 1.820
> >          item osd.37 weight 1.820
> >          item osd.43 weight 1.820
> >          item osd.49 weight 1.820
> >          item osd.55 weight 1.820
> >          item osd.61 weight 1.820
> >          item osd.67 weight 1.820
> > }
> > chassis tv-c1 {
> >          id -8           # do not change unnecessarily
> >          # weight 43.680
> >          alg straw
> >          hash 0  # rjenkins1
> >          item tv-c1-al01 weight 21.840
> >          item tv-c1-al02 weight 21.840
> > }
> > host tv-c2-al01 {
> >          id -5           # do not change unnecessarily
> >          # weight 21.840
> >          alg straw
> >          hash 0  # rjenkins1
> >          item osd.3 weight 1.820
> >          item osd.9 weight 1.820
> >          item osd.15 weight 1.820
> >          item osd.21 weight 1.820
> >          item osd.27 weight 1.820
> >          item osd.33 weight 1.820
> >          item osd.39 weight 1.820
> >          item osd.45 weight 1.820
> >          item osd.51 weight 1.820
> >          item osd.57 weight 1.820
> >          item osd.63 weight 1.820
> >          item osd.70 weight 1.820
> > }
> > host tv-c2-al02 {
> >          id -2           # do not change unnecessarily
> >          # weight 21.840
> >          alg straw
> >          hash 0  # rjenkins1
> >          item osd.0 weight 1.820
> >          item osd.6 weight 1.820
> >          item osd.12 weight 1.820
> >          item osd.18 weight 1.820
> >          item osd.24 weight 1.820
> >          item osd.30 weight 1.820
> >          item osd.36 weight 1.820
> >          item osd.42 weight 1.820
> >          item osd.48 weight 1.820
> >          item osd.54 weight 1.820
> >          item osd.60 weight 1.820
> >          item osd.66 weight 1.820
> > }
> > chassis tv-c2 {
> >          id -9           # do not change unnecessarily
> >          # weight 43.680
> >          alg straw
> >          hash 0  # rjenkins1
> >          item tv-c2-al01 weight 21.840
> >          item tv-c2-al02 weight 21.840
> > }
> > host tv-c1-al03 {
> >          id -6           # do not change unnecessarily
> >          # weight 21.840
> >          alg straw
> >          hash 0  # rjenkins1
> >          item osd.4 weight 1.820
> >          item osd.10 weight 1.820
> >          item osd.16 weight 1.820
> >          item osd.22 weight 1.820
> >          item osd.28 weight 1.820
> >          item osd.34 weight 1.820
> >          item osd.40 weight 1.820
> >          item osd.46 weight 1.820
> >          item osd.52 weight 1.820
> >          item osd.58 weight 1.820
> >          item osd.64 weight 1.820
> >          item osd.69 weight 1.820
> > }
> > host tv-c2-al03 {
> >          id -4           # do not change unnecessarily
> >          # weight 21.840
> >          alg straw
> >          hash 0  # rjenkins1
> >          item osd.2 weight 1.820
> >          item osd.8 weight 1.820
> >          item osd.14 weight 1.820
> >          item osd.20 weight 1.820
> >          item osd.26 weight 1.820
> >          item osd.32 weight 1.820
> >          item osd.38 weight 1.820
> >          item osd.44 weight 1.820
> >          item osd.50 weight 1.820
> >          item osd.56 weight 1.820
> >          item osd.62 weight 1.820
> >          item osd.68 weight 1.820
> > }
> > chassis tv-c3 {
> >          id -10          # do not change unnecessarily
> >          # weight 43.680
> >          alg straw
> >          hash 0  # rjenkins1
> >          item tv-c1-al03 weight 21.840
> >          item tv-c2-al03 weight 21.840
> > }
> > root default {
> >          id -1           # do not change unnecessarily
> >          # weight 131.040
> >          alg straw
> >          hash 0  # rjenkins1
> >          item tv-c1 weight 43.680
> >          item tv-c2 weight 43.680
> >          item tv-c3 weight 43.680
> > }
> >
> > # rules
> > rule replicated_ruleset {
> >          ruleset 0
> >          type replicated
> >          min_size 1
> >          max_size 10
> >          step take default
> >          step chooseleaf firstn 0 type chassis
> >          step emit
> > }
> >
> > # end crush map
> >
> >
> > Thank you,
> > Laszlo
> > _______________________________________________
> > ceph-users mailing list
> > [email protected]
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] strange remap on host failure

Reply via email to