Re: [ceph-users] Shall host weight auto reduce on hdd failure?

2019-12-05 Thread Milan Kupcevic
On 2019-12-05 02:33, Janne Johansson wrote:
> Den tors 5 dec. 2019 kl 00:28 skrev Milan Kupcevic
> mailto:milan_kupce...@harvard.edu>>:
> 
> 
> 
> There is plenty of space to take more than a few failed nodes. But the
> question was about what is going on inside a node with a few failed
> drives. Current Ceph behavior keeps increasing number of placement
> groups on surviving drives inside the same node. It does not spread them
> across the cluster. So, lets get back to he original question. Shall
> host weight auto reduce on hdd failure, or not?
> 
> 
> If the OSDs are still in the crush map, with non-zero weights, they will
> add "value" to the host, and hence the host gets as much PGs as the sum
> of the crush values (ie, sizes) says it can bear.
> If some of the OSDs have zero OSD-reweight values, they will not take a
> part of the burden, but rather let the "surviving" OSDs on the host take
> more load, until the cluster decides the broken OSDs are down and out,
> at which point the cluster rebalances according to the general algorithm
> which should(*) even it out, letting the OSD hosts with fewer OSDs have
> less PGs and hence less data.
> 


Well, that is simply not happening.

See the state of WEIGHT and REWEIGHT columns in this sample of four
nodes which are a part of a huge cluster:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-December/037602.html

The failed osds are definitely down and out for significant period of
time. Also compare numbers of placement groups (PGS) per osd on all
presented nodes.

Milan


-- 
Milan Kupcevic
Senior Cyberinfrastructure Engineer at Project NESE
Harvard University
FAS Research Computing
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Shall host weight auto reduce on hdd failure?

2019-12-04 Thread Milan Kupcevic
On 2019-12-04 04:11, Janne Johansson wrote:
> Den ons 4 dec. 2019 kl 01:37 skrev Milan Kupcevic
> mailto:milan_kupce...@harvard.edu>>:
> 
> This cluster can handle this case at this moment as it has got plenty of
> free space. I wonder how is this going to play out when we get to 90% of
> usage on the whole cluster. A single backplane failure in a node takes
> 
> 
> You should not run any file storage system to 90% full, ceph or otherwise.
> 
> You should set a target for how full it can get before you must add new
> hardware to it, be it more drives or hosts with drives, and as noted
> below, you should probably include at least one failed node into this
> calculation, so that planned maintenance doesn't become a critical
> situation. 


There is plenty of space to take more than a few failed nodes. But the
question was about what is going on inside a node with a few failed
drives. Current Ceph behavior keeps increasing number of placement
groups on surviving drives inside the same node. It does not spread them
across the cluster. So, lets get back to he original question. Shall
host weight auto reduce on hdd failure, or not?

Milan


-- 
Milan Kupcevic
Senior Cyberinfrastructure Engineer at Project NESE
Harvard University
FAS Research Computing
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Shall host weight auto reduce on hdd failure?

2019-12-03 Thread Milan Kupcevic


On hdd failure the number of placement groups on the rest of osds on the
same host goes up. I would expect equal distribution of failed placement
groups across the cluster, not just on the troubled host. Shall the host
weight auto reduce whenever an osd gets out?

Exibit 1: Attached osd-df-tree file. Number of placement groups per osd
on healthy nodes across the cluster is around 160, see osd050 and
osd056. Number of placement groups per osd on nodes with hdd failures
goes noticeably up, more so as more hdd failures happen on the same
node, see osd051 and osd053.

This cluster can handle this case at this moment as it has got plenty of
free space. I wonder how is this going to play out when we get to 90% of
usage on the whole cluster. A single backplane failure in a node takes
four drives out at once; that is 30% of storage space on a node. The
whole cluster would have enough space to host the failed placement
groups but one node would not.

This cluster is running Nautilus 14.2.0 with default settings deployed
using ceph-ansible.


Milan


-- 
Milan Kupcevic
Senior Cyberinfrastructure Engineer at Project NESE
Harvard University
FAS Research Computing



> ceph osd df tree name osd050
ID   CLASS WEIGHTREWEIGHT SIZERAW USE DATA OMAPMETAAVAIL   
%USE  VAR  PGS STATUS TYPE NAME   
-130   110.88315- 111 TiB 6.0 TiB  4.7 TiB 563 MiB  21 GiB 105 TiB  
5.39 1.00   -host osd050 
 517   hdd   9.20389  1.0 9.2 TiB 442 GiB  329 GiB  16 KiB 1.7 GiB 8.8 TiB  
4.69 0.87 157 up osd.517 
 532   hdd   9.20389  1.0 9.2 TiB 465 GiB  352 GiB  32 KiB 1.8 GiB 8.7 TiB  
4.94 0.92 170 up osd.532 
 544   hdd   9.20389  1.0 9.2 TiB 447 GiB  334 GiB  32 KiB 1.8 GiB 8.8 TiB  
4.74 0.88 153 up osd.544 
 562   hdd   9.20389  1.0 9.2 TiB 440 GiB  328 GiB  64 KiB 1.5 GiB 8.8 TiB  
4.67 0.87 159 up osd.562 
 575   hdd   9.20389  1.0 9.2 TiB 479 GiB  366 GiB  88 KiB 1.9 GiB 8.7 TiB  
5.08 0.94 175 up osd.575 
 592   hdd   9.20389  1.0 9.2 TiB 434 GiB  321 GiB  24 KiB 1.4 GiB 8.8 TiB  
4.60 0.85 153 up osd.592 
 605   hdd   9.20389  1.0 9.2 TiB 456 GiB  343 GiB 0 B 1.5 GiB 8.8 TiB  
4.84 0.90 170 up osd.605 
 618   hdd   9.20389  1.0 9.2 TiB 473 GiB  360 GiB  16 KiB 1.6 GiB 8.7 TiB  
5.01 0.93 172 up osd.618 
 631   hdd   9.20389  1.0 9.2 TiB 461 GiB  348 GiB  44 KiB 1.5 GiB 8.8 TiB  
4.89 0.91 165 up osd.631 
 644   hdd   9.20389  1.0 9.2 TiB 459 GiB  346 GiB  92 KiB 1.7 GiB 8.8 TiB  
4.87 0.90 163 up osd.644 
 656   hdd   9.20389  1.0 9.2 TiB 433 GiB  320 GiB  68 KiB 1.4 GiB 8.8 TiB  
4.59 0.85 156 up osd.656 
 669   hdd   9.20389  1.0 9.2 TiB 1.1 TiB 1019 GiB  36 KiB 2.6 GiB 8.1 TiB 
12.01 2.23 169 up osd.669 
 682   ssd   0.43649  1.0 447 GiB 3.1 GiB  2.1 GiB 562 MiB 462 MiB 444 GiB  
0.69 0.13 168 up osd.682 
TOTAL 111 TiB 6.0 TiB  4.7 TiB 563 MiB  21 GiB 105 TiB  
5.39 
MIN/MAX VAR: 0.13/2.23  STDDEV: 2.32

> ceph osd df tree name osd051
ID   CLASS WEIGHTREWEIGHT SIZERAW USE DATAOMAPMETAAVAIL   
%USE VAR  PGS STATUS TYPE NAME   
-148   110.88315-  83 TiB 4.9 TiB 4.0 TiB 573 MiB  20 GiB  78 TiB 
5.94 1.00   -host osd051 
 408   hdd   9.203890 0 B 0 B 0 B 0 B 0 B 0 B   
 00   0   down osd.408 
 538   hdd   9.20389  1.0 9.2 TiB 542 GiB 429 GiB  24 KiB 2.4 GiB 8.7 TiB 
5.75 0.97 212 up osd.538 
 552   hdd   9.203890 0 B 0 B 0 B 0 B 0 B 0 B   
 00   0   down osd.552 
 565   hdd   9.203890 0 B 0 B 0 B 0 B 0 B 0 B   
 00   0   down osd.565 
 578   hdd   9.20389  1.0 9.2 TiB 557 GiB 444 GiB  56 KiB 2.0 GiB 8.7 TiB 
5.91 0.99 213 up osd.578 
 590   hdd   9.20389  1.0 9.2 TiB 533 GiB 420 GiB  34 KiB 2.4 GiB 8.7 TiB 
5.66 0.95 212 up osd.590 
 603   hdd   9.20389  1.0 9.2 TiB 562 GiB 449 GiB  76 KiB 2.2 GiB 8.7 TiB 
5.96 1.00 218 up osd.603 
 616   hdd   9.20389  1.0 9.2 TiB 553 GiB 440 GiB  16 KiB 2.2 GiB 8.7 TiB 
5.86 0.99 217 up osd.616 
 629   hdd   9.20389  1.0 9.2 TiB 579 GiB 466 GiB  40 KiB 2.0 GiB 8.6 TiB 
6.14 1.03 228 up osd.629 
 642   hdd   9.20389  1.0 9.2 TiB 588 GiB 475 GiB  40 KiB 2.6 GiB 8.6 TiB 
6.23 1.05 228 up osd.642 
 655   hdd   9.20389  1.0 9.2 TiB 583 GiB 470 GiB  32 KiB 2.3 GiB 8.6 TiB 
6.18 1.04 223 up osd.655 
 668   hdd   9.20389  1.0 9.2 TiB 570 GiB 457 GiB  32 KiB 1.9 GiB 8.6 TiB 
6.05 1.02 229 up osd.668 
 681   ssd   0.43649  1.0 447 GiB 3.1 GiB 2.1 GiB 573 MiB 451 MiB 444 GiB 
0.69 0.12 167 up osd.681 
TOTAL  83 TiB 4.9 TiB 4.0 TiB 5