We are running jewel (10.2.10) on our Ceph cluster with 6 OSDs and 3 MONs. 144
8TB drives across the 6 OSD hosts with uniform weights.
In tests to simulate the failure of one entire OSD host or even just a few
drives on an OSD host we see that each osd drive we add back in comes back in
with at least twice as much data as before.
Here's a snippet of a "ceph osd df tree" to show what I mean:
ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS TYPE NAME
-6 174.59921 - 174T 16530G 158T 9.25 1.18 0 host osd05
96 7.27499 1.00000 7449G 520G 6929G 6.98 0.89 185 osd.96
97 7.27499 1.00000 7449G 458G 6990G 6.16 0.78 179 osd.97
98 7.27499 1.00000 7449G 415G 7033G 5.58 0.71 172 osd.98
99 7.27499 1.00000 7449G 475G 6974G 6.38 0.81 168 osd.99
100 7.27499 1.00000 7449G 480G 6968G 6.45 0.82 175 osd.100
101 7.27499 1.00000 7449G 407G 7041G 5.47 0.70 174 osd.101
102 7.27499 1.00000 7449G 476G 6972G 6.40 0.81 187 osd.102
103 7.27499 1.00000 7449G 513G 6936G 6.89 0.88 170 osd.103
104 7.27499 1.00000 7449G 423G 7025G 5.69 0.72 175 osd.104
105 7.27499 1.00000 7449G 469G 6980G 6.30 0.80 170 osd.105
106 7.27499 1.00000 7449G 373G 7076G 5.01 0.64 177 osd.106
107 7.27499 1.00000 7449G 467G 6982G 6.27 0.80 180 osd.107
108 7.27499 1.00000 7449G 497G 6951G 6.68 0.85 166 osd.108
109 7.27499 1.00000 7449G 495G 6953G 6.66 0.85 174 osd.109
110 7.27499 1.00000 7449G 428G 7020G 5.75 0.73 172 osd.110
111 7.27499 1.00000 7449G 488G 6961G 6.55 0.83 191 osd.111
112 7.27499 1.00000 7449G 619G 6830G 8.31 1.06 200 osd.112
113 7.27499 1.00000 7449G 467G 6981G 6.28 0.80 174 osd.113
116 7.27489 1.00000 7449G 1324G 6124G 17.78 2.26 184 osd.116
117 7.27489 1.00000 7449G 1491G 5958G 20.02 2.55 210 osd.117
118 7.27489 1.00000 7449G 1277G 6171G 17.15 2.18 176 osd.118
119 7.27489 1.00000 7449G 1379G 6070G 18.51 2.35 191 osd.119
114 7.27489 1.00000 7449G 1358G 6090G 18.24 2.32 197 osd.114
115 7.27489 1.00000 7449G 1218G 6231G 16.35 2.08 173 osd.115
Drives 114 through to 119 have been redeployed as if they failed out-right and
they have elevated USE and %USED compared to the other 18 that have not been
redeployed.
To remove the osds we do the following commands, with correct osd info, after
stopping the osd.service(s):
ceph osd crush reweight 0
ceph osd out
ceph osd crush remove
ceph auth del
ceph osd rm
then we clear the partition on the disks involved
we add them back via
ceph deploy osd create
After the recovery is finished the cluster is at HEALTH_OK but with the above
unbalanced drives.
cluster xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
health HEALTH_OK
monmap e1: 3 mons at
{mon01=10.0.X.X:6789/0,mon02=10.0.X.X:6789/0,mon03=10.0.X.X:6789/0}
election epoch 150, quorum 0,1,2 mon01,mon02,mon03
osdmap e50114: 144 osds: 144 up, 144 in
flags sortbitwise,require_jewel_osds
pgmap v4825786: 8704 pgs, 5 pools, 61355 GB data, 26797 kobjects
84343 GB used, 965 TB / 1047 TB avail
8693 active+clean
7 active+clean+scrubbing
4 active+clean+scrubbing+deep
client io 195 kB/s rd, 714 op/s rd, 0 op/s wr
We also noticed that the performance of the pools that were around before this
test process is 30% slower in FIO write tests. If we create a brand new pool
then the performance is not slower.
We have no idea why these osds are coming in at 2 or 3 times the USE value so
thanks for any help on this,
Andrew Ferris
Network & System Management
UBC Centre for Heart & Lung Innovation
St. Paul's Hospital, Vancouver
http://www.hli.ubc.ca
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com