Re: [ceph-users] Shall host weight auto reduce on hdd failure?
On 2019-12-05 02:33, Janne Johansson wrote: > Den tors 5 dec. 2019 kl 00:28 skrev Milan Kupcevic > mailto:milan_kupce...@harvard.edu>>: > > > > There is plenty of space to take more than a few failed nodes. But the > question was about what is going on inside a node with a few failed > drives. Current Ceph behavior keeps increasing number of placement > groups on surviving drives inside the same node. It does not spread them > across the cluster. So, lets get back to he original question. Shall > host weight auto reduce on hdd failure, or not? > > > If the OSDs are still in the crush map, with non-zero weights, they will > add "value" to the host, and hence the host gets as much PGs as the sum > of the crush values (ie, sizes) says it can bear. > If some of the OSDs have zero OSD-reweight values, they will not take a > part of the burden, but rather let the "surviving" OSDs on the host take > more load, until the cluster decides the broken OSDs are down and out, > at which point the cluster rebalances according to the general algorithm > which should(*) even it out, letting the OSD hosts with fewer OSDs have > less PGs and hence less data. > Well, that is simply not happening. See the state of WEIGHT and REWEIGHT columns in this sample of four nodes which are a part of a huge cluster: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-December/037602.html The failed osds are definitely down and out for significant period of time. Also compare numbers of placement groups (PGS) per osd on all presented nodes. Milan -- Milan Kupcevic Senior Cyberinfrastructure Engineer at Project NESE Harvard University FAS Research Computing ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Shall host weight auto reduce on hdd failure?
On 2019-12-04 04:11, Janne Johansson wrote: > Den ons 4 dec. 2019 kl 01:37 skrev Milan Kupcevic > mailto:milan_kupce...@harvard.edu>>: > > This cluster can handle this case at this moment as it has got plenty of > free space. I wonder how is this going to play out when we get to 90% of > usage on the whole cluster. A single backplane failure in a node takes > > > You should not run any file storage system to 90% full, ceph or otherwise. > > You should set a target for how full it can get before you must add new > hardware to it, be it more drives or hosts with drives, and as noted > below, you should probably include at least one failed node into this > calculation, so that planned maintenance doesn't become a critical > situation. There is plenty of space to take more than a few failed nodes. But the question was about what is going on inside a node with a few failed drives. Current Ceph behavior keeps increasing number of placement groups on surviving drives inside the same node. It does not spread them across the cluster. So, lets get back to he original question. Shall host weight auto reduce on hdd failure, or not? Milan -- Milan Kupcevic Senior Cyberinfrastructure Engineer at Project NESE Harvard University FAS Research Computing ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Shall host weight auto reduce on hdd failure?
On hdd failure the number of placement groups on the rest of osds on the same host goes up. I would expect equal distribution of failed placement groups across the cluster, not just on the troubled host. Shall the host weight auto reduce whenever an osd gets out? Exibit 1: Attached osd-df-tree file. Number of placement groups per osd on healthy nodes across the cluster is around 160, see osd050 and osd056. Number of placement groups per osd on nodes with hdd failures goes noticeably up, more so as more hdd failures happen on the same node, see osd051 and osd053. This cluster can handle this case at this moment as it has got plenty of free space. I wonder how is this going to play out when we get to 90% of usage on the whole cluster. A single backplane failure in a node takes four drives out at once; that is 30% of storage space on a node. The whole cluster would have enough space to host the failed placement groups but one node would not. This cluster is running Nautilus 14.2.0 with default settings deployed using ceph-ansible. Milan -- Milan Kupcevic Senior Cyberinfrastructure Engineer at Project NESE Harvard University FAS Research Computing > ceph osd df tree name osd050 ID CLASS WEIGHTREWEIGHT SIZERAW USE DATA OMAPMETAAVAIL %USE VAR PGS STATUS TYPE NAME -130 110.88315- 111 TiB 6.0 TiB 4.7 TiB 563 MiB 21 GiB 105 TiB 5.39 1.00 -host osd050 517 hdd 9.20389 1.0 9.2 TiB 442 GiB 329 GiB 16 KiB 1.7 GiB 8.8 TiB 4.69 0.87 157 up osd.517 532 hdd 9.20389 1.0 9.2 TiB 465 GiB 352 GiB 32 KiB 1.8 GiB 8.7 TiB 4.94 0.92 170 up osd.532 544 hdd 9.20389 1.0 9.2 TiB 447 GiB 334 GiB 32 KiB 1.8 GiB 8.8 TiB 4.74 0.88 153 up osd.544 562 hdd 9.20389 1.0 9.2 TiB 440 GiB 328 GiB 64 KiB 1.5 GiB 8.8 TiB 4.67 0.87 159 up osd.562 575 hdd 9.20389 1.0 9.2 TiB 479 GiB 366 GiB 88 KiB 1.9 GiB 8.7 TiB 5.08 0.94 175 up osd.575 592 hdd 9.20389 1.0 9.2 TiB 434 GiB 321 GiB 24 KiB 1.4 GiB 8.8 TiB 4.60 0.85 153 up osd.592 605 hdd 9.20389 1.0 9.2 TiB 456 GiB 343 GiB 0 B 1.5 GiB 8.8 TiB 4.84 0.90 170 up osd.605 618 hdd 9.20389 1.0 9.2 TiB 473 GiB 360 GiB 16 KiB 1.6 GiB 8.7 TiB 5.01 0.93 172 up osd.618 631 hdd 9.20389 1.0 9.2 TiB 461 GiB 348 GiB 44 KiB 1.5 GiB 8.8 TiB 4.89 0.91 165 up osd.631 644 hdd 9.20389 1.0 9.2 TiB 459 GiB 346 GiB 92 KiB 1.7 GiB 8.8 TiB 4.87 0.90 163 up osd.644 656 hdd 9.20389 1.0 9.2 TiB 433 GiB 320 GiB 68 KiB 1.4 GiB 8.8 TiB 4.59 0.85 156 up osd.656 669 hdd 9.20389 1.0 9.2 TiB 1.1 TiB 1019 GiB 36 KiB 2.6 GiB 8.1 TiB 12.01 2.23 169 up osd.669 682 ssd 0.43649 1.0 447 GiB 3.1 GiB 2.1 GiB 562 MiB 462 MiB 444 GiB 0.69 0.13 168 up osd.682 TOTAL 111 TiB 6.0 TiB 4.7 TiB 563 MiB 21 GiB 105 TiB 5.39 MIN/MAX VAR: 0.13/2.23 STDDEV: 2.32 > ceph osd df tree name osd051 ID CLASS WEIGHTREWEIGHT SIZERAW USE DATAOMAPMETAAVAIL %USE VAR PGS STATUS TYPE NAME -148 110.88315- 83 TiB 4.9 TiB 4.0 TiB 573 MiB 20 GiB 78 TiB 5.94 1.00 -host osd051 408 hdd 9.203890 0 B 0 B 0 B 0 B 0 B 0 B 00 0 down osd.408 538 hdd 9.20389 1.0 9.2 TiB 542 GiB 429 GiB 24 KiB 2.4 GiB 8.7 TiB 5.75 0.97 212 up osd.538 552 hdd 9.203890 0 B 0 B 0 B 0 B 0 B 0 B 00 0 down osd.552 565 hdd 9.203890 0 B 0 B 0 B 0 B 0 B 0 B 00 0 down osd.565 578 hdd 9.20389 1.0 9.2 TiB 557 GiB 444 GiB 56 KiB 2.0 GiB 8.7 TiB 5.91 0.99 213 up osd.578 590 hdd 9.20389 1.0 9.2 TiB 533 GiB 420 GiB 34 KiB 2.4 GiB 8.7 TiB 5.66 0.95 212 up osd.590 603 hdd 9.20389 1.0 9.2 TiB 562 GiB 449 GiB 76 KiB 2.2 GiB 8.7 TiB 5.96 1.00 218 up osd.603 616 hdd 9.20389 1.0 9.2 TiB 553 GiB 440 GiB 16 KiB 2.2 GiB 8.7 TiB 5.86 0.99 217 up osd.616 629 hdd 9.20389 1.0 9.2 TiB 579 GiB 466 GiB 40 KiB 2.0 GiB 8.6 TiB 6.14 1.03 228 up osd.629 642 hdd 9.20389 1.0 9.2 TiB 588 GiB 475 GiB 40 KiB 2.6 GiB 8.6 TiB 6.23 1.05 228 up osd.642 655 hdd 9.20389 1.0 9.2 TiB 583 GiB 470 GiB 32 KiB 2.3 GiB 8.6 TiB 6.18 1.04 223 up osd.655 668 hdd 9.20389 1.0 9.2 TiB 570 GiB 457 GiB 32 KiB 1.9 GiB 8.6 TiB 6.05 1.02 229 up osd.668 681 ssd 0.43649 1.0 447 GiB 3.1 GiB 2.1 GiB 573 MiB 451 MiB 444 GiB 0.69 0.12 167 up osd.681 TOTAL 83 TiB 4.9 TiB 4.0 TiB 5