This is a natural condition of CRUSH. You don’t mention what release the back-end or the clients are running so it’s difficult to give an exact answer.
Don’t mess with the CRUSH weights. Either adjust the override / reweights with `ceph osd test-reweight-by-utilization / reweight-by-utilization` https://docs.ceph.com/docs/master/rados/operations/control/ or use the balancer module in newer releases *iff* all clients are new enough to handle pg-upmap https://docs.ceph.com/docs/nautilus/rados/operations/balancer/ > On Jul 30, 2020, at 9:21 AM, Budai Laszlo <[email protected]> wrote: > > Dear all, > > We have a ceph cluster where we are have configured two SSD only pools in > order to use them as cache tier for the spinning discs. Altogether there are > 27 SSDs organized on 9 hosts distributed in 3 chassis. The hierarchy looks > like this: > > $ ceph osd df tree | grep -E 'ssd|ID' > ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS TYPE NAME > > -40 8.26199 - 8.26TiB 5.78TiB 2.48TiB 70.02 5.77 - root > ssd-root > -50 2.75400 - 2.75TiB 1.93TiB 845GiB 70.02 5.77 - > chassis c1-ssd > -41 0.91800 - 940GiB 651GiB 289GiB 69.23 5.71 - > host c1-h01-ssd > 110 ssd 0.30600 1.00000 313GiB 199GiB 115GiB 63.37 5.22 77 > osd.110 > 116 ssd 0.30600 1.00000 313GiB 219GiB 94.3GiB 69.91 5.76 89 > osd.116 > 119 ssd 0.30600 1.00000 313GiB 233GiB 80.2GiB 74.41 6.13 87 > osd.119 > -42 0.91800 - 940GiB 701GiB 239GiB 74.61 6.15 - > host c1-h02-ssd > 112 ssd 0.30600 1.00000 313GiB 228GiB 84.9GiB 72.91 6.01 85 > osd.112 > 117 ssd 0.30600 1.00000 313GiB 245GiB 67.9GiB 78.32 6.46 97 > osd.117 > 122 ssd 0.30600 1.00000 313GiB 227GiB 85.8GiB 72.61 5.99 87 > osd.122 > -43 0.91800 - 940GiB 622GiB 318GiB 66.21 5.46 - > host c1-h03-ssd > 109 ssd 0.30600 1.00000 313GiB 192GiB 122GiB 61.15 5.04 77 > osd.109 > 115 ssd 0.30600 1.00000 313GiB 206GiB 107GiB 65.79 5.42 79 > osd.115 > 120 ssd 0.30600 1.00000 313GiB 225GiB 88.7GiB 71.70 5.91 90 > osd.120 > -51 2.75400 - 2.75TiB 1.93TiB 845GiB 70.02 5.77 - > chassis c2-ssd > -46 0.91800 - 940GiB 651GiB 288GiB 69.31 5.71 - > host c2-h01-ssd > 125 ssd 0.30600 1.00000 313GiB 211GiB 103GiB 67.22 5.54 81 > osd.125 > 130 ssd 0.30600 1.00000 313GiB 233GiB 80.4GiB 74.33 6.13 89 > osd.130 > 132 ssd 0.30600 1.00000 313GiB 208GiB 105GiB 66.38 5.47 79 > osd.132 > -45 0.91800 - 940GiB 672GiB 267GiB 71.54 5.90 - > host c2-h02-ssd > 126 ssd 0.30600 1.00000 313GiB 216GiB 97.4GiB 68.90 5.68 87 > osd.126 > 129 ssd 0.30600 1.00000 313GiB 207GiB 106GiB 66.12 5.45 80 > osd.129 > 134 ssd 0.30600 1.00000 313GiB 249GiB 63.9GiB 79.61 6.56 99 > osd.134 > -44 0.91800 - 940GiB 650GiB 289GiB 69.20 5.70 - > host c2-h03-ssd > 123 ssd 0.30600 1.00000 313GiB 201GiB 112GiB 64.23 5.29 76 > osd.123 > 127 ssd 0.30600 1.00000 313GiB 217GiB 96.1GiB 69.31 5.71 85 > osd.127 > 131 ssd 0.30600 1.00000 313GiB 232GiB 81.2GiB 74.06 6.11 92 > osd.131 > -52 2.75400 - 2.75TiB 1.93TiB 845GiB 70.02 5.77 - > chassis c3-ssd > -47 0.91800 - 940GiB 628GiB 311GiB 66.86 5.51 - > host c3-h01-ssd > 124 ssd 0.30600 1.00000 313GiB 204GiB 109GiB 65.13 5.37 78 > osd.124 > 128 ssd 0.30600 1.00000 313GiB 202GiB 111GiB 64.59 5.32 76 > osd.128 > 133 ssd 0.30600 1.00000 313GiB 222GiB 91.3GiB 70.86 5.84 86 > osd.133 > -48 0.91800 - 940GiB 628GiB 312GiB 66.80 5.51 - > host c3-h02-ssd > 108 ssd 0.30600 1.00000 313GiB 220GiB 92.9GiB 70.35 5.80 86 > osd.108 > 114 ssd 0.30600 1.00000 313GiB 209GiB 105GiB 66.58 5.49 82 > osd.114 > 121 ssd 0.30600 1.00000 313GiB 199GiB 114GiB 63.46 5.23 79 > osd.121 > -49 0.91800 - 940GiB 718GiB 222GiB 76.40 6.30 - > host c3-h03-ssd > 111 ssd 0.30600 1.00000 313GiB 219GiB 94.4GiB 69.87 5.76 84 > osd.111 > 113 ssd 0.30600 1.00000 313GiB 241GiB 72.2GiB 76.95 6.34 96 > osd.113 > 118 ssd 0.30600 1.00000 313GiB 258GiB 55.2GiB 82.39 6.79 101 > osd.118 > > > The rule used for the two pools is the following: > > { > "rule_id": 1, > "rule_name": "ssd", > "ruleset": 1, > "type": 1, > "min_size": 1, > "max_size": 10, > "steps": [ > { > "op": "take", > "item": -40, > "item_name": "ssd-root" > }, > { > "op": "chooseleaf_firstn", > "num": 0, > "type": "chassis" > }, > { > "op": "emit" > } > ] > } > > > both pools have the size 3, and the total number of PGs is 768 (256+512). > > As you can see from the previous table (the PG column) there is a significant > difference between the OSD with the largest number of PGs (101PGs on osd.118) > and the ones with the smallest number (76 PGs on osd.123). The ratio between > the two is 1.32. So OSD 118 has more chances to receive data then OSD 123, > and we can see that indeed osd.118 is the one storing the most data (82.39% > full in the above table). > > I would like to re balance the PG/OSD allocation. I know that I can play > around with the OSD weights (currently .306 for all the OSDs), but I wonder > if there is any drawback for this on the long run? Are you aware of any > reason why I should NOT modify the weights (and leave those modifications > permanent)? > > Any ideas are welcome :) > > Kind regards, > Laszlo > _______________________________________________ > ceph-users mailing list -- [email protected] > To unsubscribe send an email to [email protected] _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
