[ceph-users] Re: Unbalanced Cluster

David Schulz Fri, 06 May 2022 22:10:07 -0700

Thanks for the list of parameters.  I suppose I was just asking whether 
the docs saying "planned carefully" was something more than being ready 
to moderate the rebuild process etc.  I think that clears things up for me.


Sincerely

-Dave

On 2022-05-05 1:15 p.m., Anthony D'Atri wrote:
> [△EXTERNAL]
>
>
>
>> The balancer was driving all the weights to 1.00000 so I turned it off.
> Which weights (CRUSH or reweight?) And which balancer?
>
> Assuming the ceph-mgr balancer module in upmap mode, you’d want the reweight 
> values to be 1.000 since it uses the newer pg-upmap functionality to 
> distribute capacity.  Lower reweight values have a way of confusing the 
> balancer and preventing good uniformity.  If you had a bunch of significantly 
> adjusted reweight values, eg. from prior runs of reweight-by-utilization, 
> that could contribute to suboptimal balancing.
>
>> You mentioned that all solutions would cause data migration and would need 
>> to be planned carefully.  I've seen that language in the docs and other 
>> messages but what I can't find is what is meant by "planned carefully".
> There are many ways to proceed; documenting them all might be a bit of a 
> rabbit-hole.
>
>
>
>> Doing any of these will cause data migration like crazy but it's not 
>> avoidable other than to change the number of max backfills etc. but the 
>> systems should still be accessible during this time but with reduced 
>> bandwidth and higher latency.  Is it just a warning that the system could be 
>> degraded for a long period of time or is it suggesting that users should 
>> take an outage while the rebuild happens?
> Throttling recovery/backfill can reduce the impact of big data migrations, at 
> the expense of increased elapsed time to complete.
>
> osd_max_backfills=1
> osd_recovery_max_active=1
> osd_recovery_op_priority=1
> osd_recovery_max_single_start=1
> osd_scrub_during_recovery=false
>
> Also, ensure that
>
> osd_op_queue_cut_off = high
>
> This will help ensure that recovery / backfill doesn’t DoS client traffic.  
> I’m not sure if this is default in your release.  If changed, I believe that 
> OSDs would need to be restarted for the new value to take effect.
>
> PGs:
>
> pg_num = ( #OSDs * ratio ) / replication
> ratio = pg_num * replication / #OSDs
>
> On clusters with multiple pools this can get a bit complicated when more than 
> one pool have significant numbers of PGs; the end goal is the total number of 
> PGs on a given OSD, which `ceph osd df` reports.
>
> Your OSDs look to have ~~ 190 PGs each on average, which is probably ok given 
> your media.  If you do have big empty pools, deleting them would show more 
> indicative numbers.  PG ratio targets are somewhat controversial, but 
> depending on your media and RAM an aggregate around this range is reasonable; 
> you can go higher with flash.
>
> This calculator can help when you have multiple pools:
>
> https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fold.ceph.com%2Fpgcalc%2F&amp;data=05%7C01%7Cdschulz%40ucalgary.ca%7C403a1880f2d94adc374808da2ecbc34a%7Cc609a0eca5e346319686192280bd9151%7C1%7C0%7C637873750505242659%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=HSTPwi0QFzPqXqzSi0NWe1sXYvhLwuOaK%2FNMKWBH%2Fho%3D&amp;reserved=0
>
> If you need to bump pg_num for a pool, you don’t have to do it in one step.  
> You can increase it by, say, 32 at a time.
>
>
>> Thanks for your guidance.
>>
>> -Dave
>>
>>
>> On 2022-05-05 2:33 a.m., Erdem Agaoglu wrote:
>> [△EXTERNAL]
>>
>>
>> Hi David,
>>
>> I think you're right with your option 2. 512 pgs is just too few. You're also
>> right with the "inflation" but you should add your erasure bits to the
>> calculation, so 9x512=4608. With 144 OSDs, you would average 32 pgs per OSD.
>> Some old advice for that number was around 100.
>>
>> But your current PGs per OSD is around 180-190 according to the df output. 
>> This
>> is probably because of your empty pool 4 fsdata, having 4096 pgs with size 5,
>> and adding 5x4096=20480, 20480/144=142 more pgs per OSD.
>>
>> I'm not really sure how empty/unused PGs would affect OSD, but I think it 
>> will
>> affect the balancer which tries to balance the number of PGs, which might
>> explain things getting worse. Also your df output shows several 
>> modifications in
>> weights/reweights but I'm not sure if they're manual or balancer adjusted.
>>
>> I would first delete that empty pool to have a more clear picture of PGs on
>> OSDs. Then I would increase the pg_num for pool 6 to 2048. And after 
>> everything
>> settles, if it's still too unbalanced I'd go for the upmap balancer. 
>> Needless to
>> say, all these would cause major data migration so it should be planned
>> carefully.
>>
>> Best,
>>
>>
>>
>> On Thu, May 5, 2022 at 12:02 AM David Schulz 
>> <dsch...@ucalgary.ca<mailto:dsch...@ucalgary.ca>> wrote:
>> Hi Josh,
>>
>> We do have an old pool that is empty so there's 4611 empty PGs but the
>> rest seem fairly close:
>>
>> # ceph pg ls|awk '{print $7/1024/1024/10}'|cut -d "." -f 1|sed -e
>> 's/$/0/'|sort -n|uniq -c
>>     4611 00
>>        1 1170
>>        8 1180
>>       10 1190
>>       28 1200
>>       51 1210
>>       54 1220
>>       52 1230
>>       32 1240
>>       13 1250
>>        7 1260
>> Hmm, that's interesting, adding up the first column except the 4611
>> gives 256 but there are 512 PGs in the main data pool.
>>
>> Here are our pool settings:
>>
>> pool 3 'fsmeta' replicated size 3 min_size 1 crush_rule 0 object_hash
>> rjenkins pg_num 256 pgp_num 256 autoscale_mode warn last_change 35490
>> flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16
>> recovery_priority 5 application cephfs
>> pool 4 'fsdata' erasure size 5 min_size 4 crush_rule 1 object_hash
>> rjenkins pg_num 4096 pgp_num 4096 autoscale_mode warn last_change 35490
>> lfor 0/0/4742 flags hashpspool,ec_overwrites stripe_width 12288
>> application cephfs
>> pool 6 'fsdatak7m2' erasure size 9 min_size 8 crush_rule 3 object_hash
>> rjenkins pg_num 512 pgp_num 512 autoscale_mode warn last_change 35490
>> flags hashpspool,ec_overwrites stripe_width 28672 application cephfs
>>
>> The fsdata was originally created with very safe erasure coding that
>> wasted too much space, then the fsdatak7m2 was created and everything
>> was migrated to it.  This is why there's at least 4096 pgs with 0 bytes.
>>
>> -Dave
>>
>> On 2022-05-04 2:08 p.m., Josh Baergen wrote:
>>> [△EXTERNAL]
>>>
>>>
>>>
>>> Hi Dave,
>>>
>>>> This cluster was upgraded from 13.x to 14.2.9 some time ago.  The entire
>>>> cluster was installed at the 13.x time and was upgraded together so all
>>>> OSDs should have the same formatting etc.
>>> OK, thanks, that should rule out a difference in bluestore
>>> min_alloc_size, for example.
>>>
>>>> Below is pasted the ceph osd df tree output.
>>> It looks like there is some pretty significant skew in terms of the
>>> amount of bytes per active PG. If you issue "ceph pg ls", are you able
>>> to find any PGs with a significantly higher byte count?
>>>
>>> Josh
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
>> To unsubscribe send an email to 
>> ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>
>>
>>
>> --
>> erdem agaoglu
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Unbalanced Cluster

Reply via email to