If you lose 1 of the hosts in a chassis, or a single drive, the pgs from that drive/host will be distributed to other drives in that chassis (because you only have 3 failure domains). That is to say that if you lose tv-c1-al01 then all of the pgs and data that were on that will be distributed to tv-c1-al02. The reason for that is that you only have 3 failure domains and replica size 3.
If you lost both tv-c1-al01 and tv-c1-al02, then you would run with only 2 copies of your data until you brought up a third failure domain again. Ceph would never place 2 copies of your data inside of 1 failure domain. I recommend not to run in production with less than N+2 failure domains where N is your replica size. It allows for more efficient data redundancy and you can utilize a higher % of your total capacity. If you have 4 failure domains, the plan is to be able to survive losing 1 of them... Which means you shouldn't use more than ~55% of your total capacity because of you lose a node, that 55% of 4 nodes becomes 73% of 3 nodes. Few clusters are balanced well enough to handle 73% full without individual osds going above 80%. 3 failure domains can work if you replace failed storage quickly. On Mon, May 29, 2017, 12:07 PM Laszlo Budai <[email protected]> wrote: > Dear all, > > How should ceph react in case of a host failure when from a total of 72 > OSDs 12 are out? > is it normal that for the remapping of the PGs it is not following the > rule set for in the crush map? (according to the rule the OSDs should be > selected from different chassis). > > in the attached file you can find the crush map, and the results of: > ceph health detail > ceph osd dump > ceph osd tree > ceph -s > > I can send the pg dump in a separate mail on request. Its compressed size > is exceeding the size accepted by this mailing list. > > Thank you for any help/directions. > > Kind regards, > Laszlo > > On 29.05.2017 14:58, Laszlo Budai wrote: > > > > Hello all, > > > > We have a ceph cluster with 72 OSDs distributed on 6 hosts, in 3 > chassis. In our crush map the we are distributing the PGs on chassis > (complete crush map below): > > > > # rules > > rule replicated_ruleset { > > ruleset 0 > > type replicated > > min_size 1 > > max_size 10 > > step take default > > step chooseleaf firstn 0 type chassis > > step emit > > } > > > > We had a host failure, and I can see that ceph is using 2 OSDs from the > same chassis for a lot of the remapped PGs. Even worse, I can see that > there are cases when a PG is using two OSDs from the same host like here: > > > > 3.5f6 37 0 4 37 0 149446656 3040 > 3040 active+remapped 2017-05-26 11:29:23.122820 61820'222074 > 61820:158025 [52,39] 52 [52,39,3] 52 61488'198356 > 2017-05-23 23:51:56.210597 61488'198356 2017-05-23 23:51:56.210597 > > > > I have tis in the log: > > 2017-05-26 11:26:53.244424 osd.52 10.12.193.69:6801/7044 1510 : cluster > [INF] 3.5f6 restarting backfill on osd.39 from (0'0,0'0] MAX to 61488'203000 > > > > > > What can be wrong? > > > > > > Our crush map looks like this: > > > > # begin crush map > > tunable choose_local_tries 0 > > tunable choose_local_fallback_tries 0 > > tunable choose_total_tries 50 > > tunable chooseleaf_descend_once 1 > > tunable straw_calc_version 1 > > > > # devices > > device 0 osd.0 > > device 1 osd.1 > > device 2 osd.2 > > device 3 osd.3 > > .... > > device 69 osd.69 > > device 70 osd.70 > > device 71 osd.71 > > > > # types > > type 0 osd > > type 1 host > > type 2 chassis > > type 3 rack > > type 4 row > > type 5 pdu > > type 6 pod > > type 7 room > > type 8 datacenter > > type 9 region > > type 10 root > > > > # buckets > > host tv-c1-al01 { > > id -7 # do not change unnecessarily > > # weight 21.840 > > alg straw > > hash 0 # rjenkins1 > > item osd.5 weight 1.820 > > item osd.11 weight 1.820 > > item osd.17 weight 1.820 > > item osd.23 weight 1.820 > > item osd.29 weight 1.820 > > item osd.35 weight 1.820 > > item osd.41 weight 1.820 > > item osd.47 weight 1.820 > > item osd.53 weight 1.820 > > item osd.59 weight 1.820 > > item osd.65 weight 1.820 > > item osd.71 weight 1.820 > > } > > host tv-c1-al02 { > > id -3 # do not change unnecessarily > > # weight 21.840 > > alg straw > > hash 0 # rjenkins1 > > item osd.1 weight 1.820 > > item osd.7 weight 1.820 > > item osd.13 weight 1.820 > > item osd.19 weight 1.820 > > item osd.25 weight 1.820 > > item osd.31 weight 1.820 > > item osd.37 weight 1.820 > > item osd.43 weight 1.820 > > item osd.49 weight 1.820 > > item osd.55 weight 1.820 > > item osd.61 weight 1.820 > > item osd.67 weight 1.820 > > } > > chassis tv-c1 { > > id -8 # do not change unnecessarily > > # weight 43.680 > > alg straw > > hash 0 # rjenkins1 > > item tv-c1-al01 weight 21.840 > > item tv-c1-al02 weight 21.840 > > } > > host tv-c2-al01 { > > id -5 # do not change unnecessarily > > # weight 21.840 > > alg straw > > hash 0 # rjenkins1 > > item osd.3 weight 1.820 > > item osd.9 weight 1.820 > > item osd.15 weight 1.820 > > item osd.21 weight 1.820 > > item osd.27 weight 1.820 > > item osd.33 weight 1.820 > > item osd.39 weight 1.820 > > item osd.45 weight 1.820 > > item osd.51 weight 1.820 > > item osd.57 weight 1.820 > > item osd.63 weight 1.820 > > item osd.70 weight 1.820 > > } > > host tv-c2-al02 { > > id -2 # do not change unnecessarily > > # weight 21.840 > > alg straw > > hash 0 # rjenkins1 > > item osd.0 weight 1.820 > > item osd.6 weight 1.820 > > item osd.12 weight 1.820 > > item osd.18 weight 1.820 > > item osd.24 weight 1.820 > > item osd.30 weight 1.820 > > item osd.36 weight 1.820 > > item osd.42 weight 1.820 > > item osd.48 weight 1.820 > > item osd.54 weight 1.820 > > item osd.60 weight 1.820 > > item osd.66 weight 1.820 > > } > > chassis tv-c2 { > > id -9 # do not change unnecessarily > > # weight 43.680 > > alg straw > > hash 0 # rjenkins1 > > item tv-c2-al01 weight 21.840 > > item tv-c2-al02 weight 21.840 > > } > > host tv-c1-al03 { > > id -6 # do not change unnecessarily > > # weight 21.840 > > alg straw > > hash 0 # rjenkins1 > > item osd.4 weight 1.820 > > item osd.10 weight 1.820 > > item osd.16 weight 1.820 > > item osd.22 weight 1.820 > > item osd.28 weight 1.820 > > item osd.34 weight 1.820 > > item osd.40 weight 1.820 > > item osd.46 weight 1.820 > > item osd.52 weight 1.820 > > item osd.58 weight 1.820 > > item osd.64 weight 1.820 > > item osd.69 weight 1.820 > > } > > host tv-c2-al03 { > > id -4 # do not change unnecessarily > > # weight 21.840 > > alg straw > > hash 0 # rjenkins1 > > item osd.2 weight 1.820 > > item osd.8 weight 1.820 > > item osd.14 weight 1.820 > > item osd.20 weight 1.820 > > item osd.26 weight 1.820 > > item osd.32 weight 1.820 > > item osd.38 weight 1.820 > > item osd.44 weight 1.820 > > item osd.50 weight 1.820 > > item osd.56 weight 1.820 > > item osd.62 weight 1.820 > > item osd.68 weight 1.820 > > } > > chassis tv-c3 { > > id -10 # do not change unnecessarily > > # weight 43.680 > > alg straw > > hash 0 # rjenkins1 > > item tv-c1-al03 weight 21.840 > > item tv-c2-al03 weight 21.840 > > } > > root default { > > id -1 # do not change unnecessarily > > # weight 131.040 > > alg straw > > hash 0 # rjenkins1 > > item tv-c1 weight 43.680 > > item tv-c2 weight 43.680 > > item tv-c3 weight 43.680 > > } > > > > # rules > > rule replicated_ruleset { > > ruleset 0 > > type replicated > > min_size 1 > > max_size 10 > > step take default > > step chooseleaf firstn 0 type chassis > > step emit > > } > > > > # end crush map > > > > > > Thank you, > > Laszlo > > _______________________________________________ > > ceph-users mailing list > > [email protected] > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > _______________________________________________ > ceph-users mailing list > [email protected] > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
