[ceph-users] Re: unbalanced pg/osd allocation

Anthony D'Atri Thu, 30 Jul 2020 10:59:54 -0700

This is a natural condition of CRUSH.  You don’t mention what release the 
back-end or the clients are running so it’s difficult to give an exact answer.


Don’t mess with the CRUSH weights.

Either adjust the override / reweights with `ceph osd 
test-reweight-by-utilization / reweight-by-utilization`

https://docs.ceph.com/docs/master/rados/operations/control/


or use the balancer module in newer releases *iff* all clients are new enough 
to handle pg-upmap

https://docs.ceph.com/docs/nautilus/rados/operations/balancer/






> On Jul 30, 2020, at 9:21 AM, Budai Laszlo <[email protected]> wrote:
> 
> Dear all,
> 
> We have a ceph cluster where we are have configured two SSD only pools in 
> order to use them as cache tier for the spinning discs. Altogether there are 
> 27 SSDs organized on 9 hosts distributed in 3 chassis. The hierarchy looks 
> like this:
> 
> $ ceph osd df tree | grep -E 'ssd|ID'
> ID  CLASS WEIGHT    REWEIGHT SIZE    USE     AVAIL   %USE  VAR  PGS TYPE NAME 
>                   
> -40         8.26199        - 8.26TiB 5.78TiB 2.48TiB 70.02 5.77   - root 
> ssd-root               
> -50         2.75400        - 2.75TiB 1.93TiB  845GiB 70.02 5.77   -     
> chassis c1-ssd          
> -41         0.91800        -  940GiB  651GiB  289GiB 69.23 5.71   -         
> host c1-h01-ssd 
> 110   ssd   0.30600  1.00000  313GiB  199GiB  115GiB 63.37 5.22  77           
>   osd.110         
> 116   ssd   0.30600  1.00000  313GiB  219GiB 94.3GiB 69.91 5.76  89           
>   osd.116         
> 119   ssd   0.30600  1.00000  313GiB  233GiB 80.2GiB 74.41 6.13  87           
>   osd.119         
> -42         0.91800        -  940GiB  701GiB  239GiB 74.61 6.15   -         
> host c1-h02-ssd 
> 112   ssd   0.30600  1.00000  313GiB  228GiB 84.9GiB 72.91 6.01  85           
>   osd.112         
> 117   ssd   0.30600  1.00000  313GiB  245GiB 67.9GiB 78.32 6.46  97           
>   osd.117         
> 122   ssd   0.30600  1.00000  313GiB  227GiB 85.8GiB 72.61 5.99  87           
>   osd.122         
> -43         0.91800        -  940GiB  622GiB  318GiB 66.21 5.46   -         
> host c1-h03-ssd 
> 109   ssd   0.30600  1.00000  313GiB  192GiB  122GiB 61.15 5.04  77           
>   osd.109         
> 115   ssd   0.30600  1.00000  313GiB  206GiB  107GiB 65.79 5.42  79           
>   osd.115         
> 120   ssd   0.30600  1.00000  313GiB  225GiB 88.7GiB 71.70 5.91  90           
>   osd.120         
> -51         2.75400        - 2.75TiB 1.93TiB  845GiB 70.02 5.77   -     
> chassis c2-ssd          
> -46         0.91800        -  940GiB  651GiB  288GiB 69.31 5.71   -         
> host c2-h01-ssd 
> 125   ssd   0.30600  1.00000  313GiB  211GiB  103GiB 67.22 5.54  81           
>   osd.125         
> 130   ssd   0.30600  1.00000  313GiB  233GiB 80.4GiB 74.33 6.13  89           
>   osd.130         
> 132   ssd   0.30600  1.00000  313GiB  208GiB  105GiB 66.38 5.47  79           
>   osd.132         
> -45         0.91800        -  940GiB  672GiB  267GiB 71.54 5.90   -         
> host c2-h02-ssd 
> 126   ssd   0.30600  1.00000  313GiB  216GiB 97.4GiB 68.90 5.68  87           
>   osd.126         
> 129   ssd   0.30600  1.00000  313GiB  207GiB  106GiB 66.12 5.45  80           
>   osd.129         
> 134   ssd   0.30600  1.00000  313GiB  249GiB 63.9GiB 79.61 6.56  99           
>   osd.134         
> -44         0.91800        -  940GiB  650GiB  289GiB 69.20 5.70   -         
> host c2-h03-ssd 
> 123   ssd   0.30600  1.00000  313GiB  201GiB  112GiB 64.23 5.29  76           
>   osd.123         
> 127   ssd   0.30600  1.00000  313GiB  217GiB 96.1GiB 69.31 5.71  85           
>   osd.127         
> 131   ssd   0.30600  1.00000  313GiB  232GiB 81.2GiB 74.06 6.11  92           
>   osd.131         
> -52         2.75400        - 2.75TiB 1.93TiB  845GiB 70.02 5.77   -     
> chassis c3-ssd          
> -47         0.91800        -  940GiB  628GiB  311GiB 66.86 5.51   -         
> host c3-h01-ssd 
> 124   ssd   0.30600  1.00000  313GiB  204GiB  109GiB 65.13 5.37  78           
>   osd.124         
> 128   ssd   0.30600  1.00000  313GiB  202GiB  111GiB 64.59 5.32  76           
>   osd.128         
> 133   ssd   0.30600  1.00000  313GiB  222GiB 91.3GiB 70.86 5.84  86           
>   osd.133         
> -48         0.91800        -  940GiB  628GiB  312GiB 66.80 5.51   -         
> host c3-h02-ssd 
> 108   ssd   0.30600  1.00000  313GiB  220GiB 92.9GiB 70.35 5.80  86           
>   osd.108         
> 114   ssd   0.30600  1.00000  313GiB  209GiB  105GiB 66.58 5.49  82           
>   osd.114         
> 121   ssd   0.30600  1.00000  313GiB  199GiB  114GiB 63.46 5.23  79           
>   osd.121         
> -49         0.91800        -  940GiB  718GiB  222GiB 76.40 6.30   -         
> host c3-h03-ssd 
> 111   ssd   0.30600  1.00000  313GiB  219GiB 94.4GiB 69.87 5.76  84           
>   osd.111         
> 113   ssd   0.30600  1.00000  313GiB  241GiB 72.2GiB 76.95 6.34  96           
>   osd.113         
> 118   ssd   0.30600  1.00000  313GiB  258GiB 55.2GiB 82.39 6.79 101           
>   osd.118
> 
> 
> The rule used for the two pools is the following:
> 
>        {
>            "rule_id": 1,
>            "rule_name": "ssd",
>            "ruleset": 1,
>            "type": 1,
>            "min_size": 1,
>            "max_size": 10,
>            "steps": [
>                {
>                    "op": "take",
>                    "item": -40,
>                    "item_name": "ssd-root"
>                },
>                {
>                    "op": "chooseleaf_firstn",
>                    "num": 0,
>                    "type": "chassis"
>                },
>                {
>                    "op": "emit"
>                }
>            ]
>        }
> 
> 
> both pools have the size 3, and the total number of PGs is 768 (256+512). 
> 
> As you can see from the previous table (the PG column) there is a significant 
> difference between the OSD with the largest number of PGs (101PGs on osd.118) 
> and the ones with the smallest number (76 PGs on osd.123). The ratio between 
> the two is 1.32. So OSD 118 has more chances to receive data then OSD 123, 
> and we can see that indeed osd.118 is the one storing the most data (82.39% 
> full in the above table).
> 
> I would like to re balance the PG/OSD allocation. I know that I can play 
> around with the OSD weights (currently .306 for all the OSDs), but I wonder 
> if there is any drawback for this on the long run? Are you aware of any 
> reason why I should NOT modify the weights (and leave those modifications 
> permanent)?
> 
> Any ideas are welcome :)
> 
> Kind regards,
> Laszlo
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: unbalanced pg/osd allocation

Reply via email to