Hi,

I had similar problem on my larce cluster.

What I found and helped me to solve it:

Due to bad drives and replacing drives too often due to scrub error there was always some recovery operations going on.

I did set this:

osd_scrub_during_recovery true

and it basically solved my issue.

If not then you can try change the interval.

I did it also from default once per week to two weeks:

osd_deep_scrub_interval 1209600

and if you want or need to speed it up to get rid of not scrubbed in time PGs take a look into

osd_max_scrubs

default is 1 and if I need to speed it up I set it to 3 and I didn't recognize any performance impact.


dp

On 3/11/22 17:32, Ray Cunningham wrote:
That's what I thought. We looked at the cluster storage nodes and found them 
all to be less than .2 normalized maximum load.

Our 'normal' BW for client IO according to ceph -s is around 60MB/s-100MB/s. I 
don't usually look at the IOPs so I don't have that number right now. We have 
seen GB/s numbers during repairs, so the cluster can get up there when the 
workload requires.

We discovered that this system never got the auto repair setting configured to 
true and since we turned that on, we have been repairing PGs for the past 24 
hours. So, maybe we've been bottlenecked by those?

Thank you,
Ray
-----Original Message-----
From: norman.kern <norman.k...@gmx.com>
Sent: Thursday, March 10, 2022 9:27
To: Ray Cunningham <ray.cunning...@keepertech.com>
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Re: Scrubbing

Ray,

You can use node-exporter+prom+grafana  to collect the load of CPUs statistics. 
You can use uptime command to get the current statistics.

On 3/10/22 10:51 PM, Ray Cunningham wrote:
From:

osd_scrub_load_threshold
The normalized maximum load. Ceph will not scrub when the system load (as 
defined by getloadavg() / number of online CPUs) is higher than this number. 
Default is 0.5.

Does anyone know how I can run getloadavg() / number of online CPUs so I can 
see what our load is? Is that a ceph command, or an OS command?

Thank you,
Ray


-----Original Message-----
From: Ray Cunningham
Sent: Thursday, March 10, 2022 7:59 AM
To: norman.kern <norman.k...@gmx.com>
Cc: ceph-users@ceph.io
Subject: RE: [ceph-users] Scrubbing


We have 16 Storage Servers each with 16TB HDDs and 2TB SSDs for DB/WAL, so we 
are using bluestore. The system is running Nautilus 14.2.19 at the moment, with 
an upgrade scheduled this month. I can't give you a complete ceph config dump 
as this is an offline customer system, but I can get answers for specific 
questions.

Off the top of my head, we have set:

osd_max_scrubs 20
osd_scrub_auto_repair true
osd _scrub_load_threashold 0.6
We do not limit srub hours.

Thank you,
Ray




-----Original Message-----
From: norman.kern <norman.k...@gmx.com>
Sent: Wednesday, March 9, 2022 7:28 PM
To: Ray Cunningham <ray.cunning...@keepertech.com>
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Scrubbing

Ray,

Can you  provide more information about your cluster(hardware and software 
configs)?

On 3/10/22 7:40 AM, Ray Cunningham wrote:
    make any difference. Do
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to