[ceph-users] Re: Increasing number of unscrubbed PGs

2022-09-14 Thread Burkhard Linke

Hi,

On 9/13/22 16:33, Wesley Dillingham wrote:
what does "ceph pg ls scrubbing" show? Do you have PGs that have been 
stuck in a scrubbing state for a long period of time (many 
hours,days,weeks etc). This will show in the "SINCE" column.



the deep scrubs have been running for some minutes to about 2 hours, 
which seems to be OK (PGs in the large data have a size of ~290 GB).


The only suspicious values are run times of several hours for the cephfs 
metadata and primary data pool (only used for the xattr entries, no 
actual data stored). But those are on SSD/NVME storage, and according to 
the timestamps have been scrubbed in the last days.



Is it possible to get a full list of all affected PGs? 'ceph health 
detail' only displays 50 entries.



Best regards,

Burkhard Linke


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Increasing number of unscrubbed PGs

2022-09-13 Thread Wesley Dillingham
what does "ceph pg ls scrubbing" show? Do you have PGs that have been stuck
in a scrubbing state for a long period of time (many hours,days,weeks etc).
This will show in the "SINCE" column.

Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn 


On Tue, Sep 13, 2022 at 7:32 AM Burkhard Linke <
burkhard.li...@computational.bio.uni-giessen.de> wrote:

> Hi Josh,
>
>
> thx for the link. I'm not sure whether this is the root cause, since we
> did not use the noscrub and nodeepscrub flags in the past. I've set them
> for a short period to test whether removing the flag triggers more
> backfilling. During that time no OSD were restarted etc.
>
>
> But the ticket mentioned repeering as a method for resolving the stuck
> OSDs. I've repeered some of the PGs, and the number of affected PG did
> not increase significantly anymore. On the other hand the number of
> running deep-scrubs also did not increase significantly. I'll keep an
> eye on the developement and hope for 16.2.11 being released soon..
>
>
> Best regards,
>
> Burkhard
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Increasing number of unscrubbed PGs

2022-09-13 Thread Burkhard Linke

Hi Josh,


thx for the link. I'm not sure whether this is the root cause, since we 
did not use the noscrub and nodeepscrub flags in the past. I've set them 
for a short period to test whether removing the flag triggers more 
backfilling. During that time no OSD were restarted etc.



But the ticket mentioned repeering as a method for resolving the stuck 
OSDs. I've repeered some of the PGs, and the number of affected PG did 
not increase significantly anymore. On the other hand the number of 
running deep-scrubs also did not increase significantly. I'll keep an 
eye on the developement and hope for 16.2.11 being released soon..



Best regards,

Burkhard


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Increasing number of unscrubbed PGs

2022-09-12 Thread Burkhard Linke

Hi,


On 9/12/22 11:44, Eugen Block wrote:

Hi,

I'm still not sure why increasing the interval doesn't help (maybe 
there's some flag set to the PG or something), but you could just 
increase osd_max_scrubs if your OSDs are not too busy. On one customer 
cluster with high load during the day we configured the scrubs to run 
during the night but then with osd_max_scrubs = 6. What is your 
current value for osd_max_scrubs?


This is the complete OSD related configuration:


  osd   advanced 
bluefs_buffered_io true
  osd   advanced 
osd_command_max_records    1024
  osd   advanced 
osd_deep_scrub_interval    4838400.00
  osd   advanced 
osd_max_backfills  5
  osd   advanced 
osd_max_scrubs 10
  osd   advanced 
osd_op_complaint_time  10.00
  osd   advanced 
osd_scrub_sleep    0.00




Regards,

Burkhard


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Increasing number of unscrubbed PGs

2022-09-12 Thread Eugen Block

Hi,

I'm still not sure why increasing the interval doesn't help (maybe  
there's some flag set to the PG or something), but you could just  
increase osd_max_scrubs if your OSDs are not too busy. On one customer  
cluster with high load during the day we configured the scrubs to run  
during the night but then with osd_max_scrubs = 6. What is your  
current value for osd_max_scrubs?


Regards,
Eugen

Zitat von Burkhard Linke :


Hi,


our cluster is running pacific 16.2.10. Since the upgrade the  
clusters starts to report an increasing number of PG without a  
timely deep-scrub:



# ceph -s
  cluster:
    id:    
    health: HEALTH_WARN
    1073 pgs not deep-scrubbed in time

  services:
    mon: 3 daemons, quorum XXX,XXX,XXX (age 10d)
    mgr: XXX(active, since 3w), standbys: XXX, XXX
    mds: 2/2 daemons up, 2 standby
    osd: 460 osds: 459 up (since 3d), 459 in (since 5d)
    rgw: 1 daemon active (1 hosts, 1 zones)

  data:
    volumes: 2/2 healthy
    pools:   16 pools, 5073 pgs
    objects: 733.76M objects, 1.1 PiB
    usage:   1.6 PiB used, 3.3 PiB / 4.9 PiB avail
    pgs: 4941 active+clean
 105  active+clean+scrubbing
 27   active+clean+scrubbing+deep


The cluster is healthy otherwise, with the exception of one failed  
OSD. It has been marked out and should not interfere with scrubbing.  
Scrubbing itself is running, but there are too few deep-scrubs. If I  
remember correctly we had a larger number of deep scrubs before the  
last upgrade. It tried to extend the deep-scrub interval, but to no  
avail yet.


The majority of PGs is part of a ceph data pool (4096 of 4941 pgs),  
and those are also most of the pgs reported. The pool is backed by  
12 machines with 48 disks each, so there should be enough I/O  
capacity for running deep-scrubs. Load on these machines and disks  
is also pretty low.


Any hints on debugging this? The number of affected PGs has rising  
from 600 to over 1000 during the weekend and continues to rise...



Best regards,

Burkhard Linke


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io