[ceph-users] Re: Pgs troubleshooting

Frédéric Nass Mon, 04 Aug 2025 08:18:52 -0700

Hi Vivien,

Great to hear that all PGs are now active+clean. Just so you know, the PG
export/import procedure Eugen mentioned should have worked to restore them
without dropping their data.


Regarding the PGs scrubbing for 5 days, it might be a consequence of many
PGs now being processed concurrently when they couldn't while not
active+clean. Alternatively, you might be encountering this bug [1] with
mClock. Switching osd_op_queue to WPQ or setting
osd_scrub_disable_reservation_queuing = true with mClock could help if
that's the case.

Regards,
Frédéric.

[1] https://tracker.ceph.com/issues/69078

--
Frédéric Nass
Ceph Ambassador France | Senior Ceph Engineer @ CLYSO
Try our Ceph Analyzer -- https://analyzer.clyso.com/
https://clyso.com | frederic.n...@clyso.com

Le lun. 4 août 2025 à 09:39, GLE, Vivien <vivien....@inist.fr> a écrit :

> Hi,
>
> I got 3 incomplete PG that I put as mark-complete because they were empty
> (I think I lost data from them)
>
> 1 was recovery_unfound, I mark_unfound_lost revert this one
>
>
> but I have beetwen 5-25 deep_scrubbing PGs, I believe this is not normal ?
> (it's been since 5 days)
>
>
> Vivien
>
> ________________________________
> De : Eugen Block <ebl...@nde.ag>
> Envoyé : vendredi 1 août 2025 15:58:22
> À : GLE, Vivien
> Cc : ceph-users@ceph.io
> Objet : Re: [ceph-users] Re: Pgs troubleshooting
>
> Dont worry, I just wanted to point out that careful reading is crucial. :-)
> So you got the OSDs back up, but were you also able to recover the pg?
>
> Zitat von "GLE, Vivien" <vivien....@inist.fr>:
>
> > I lost all perspective and didn't read carefully this message..
> > Sorry for that
> >
> >
> > Thanks for your help I'm very grateful
> >
> >
> > Vivien
> >
> > ________________________________
> > De : Eugen Block <ebl...@nde.ag>
> > Envoyé : vendredi 1 août 2025 15:27:56
> > À : GLE, Vivien
> > Cc : ceph-users@ceph.io
> > Objet : Re: [ceph-users] Re: Pgs troubleshooting
> >
> > That’s why I mentioned this two days ago:
> >
> > cephadm shell -- ceph-objectstore-tool --op list …
> >
> > That’s how you can execute commands directly with cephadm shell, this
> > is useful for batch operations like a for loop or similar. Of course,
> > first entering the shell and then execute commands works quite as well.
> >
> > Zitat von "GLE, Vivien" <vivien....@inist.fr>:
> >
> >> I was using ceph-objectstore-tool the wrong way by doing it on host
> >> instead of inside container via cephadm shell --name osd.x
> >>
> >>
> >> ________________________________
> >> De : GLE, Vivien <vivien....@inist.fr>
> >> Envoyé : vendredi 1 août 2025 09:02:59
> >> À : Eugen Block
> >> Cc : ceph-users@ceph.io
> >> Objet : [ceph-users] Re: Pgs troubleshooting
> >>
> >> Hi,
> >>
> >>
> >> What is the good way of using objectstore tool ?
> >>
> >>
> >> My OSD are up ! I purged ceph-* on my host following this thread :
> >>
> https://www.reddit.com/r/ceph/comments/1me3kvd/containerized_ceph_base_os_experience/
> >>
> >>
> >> " Make sure that the base OS does not have any ceph packages
> >> installed, with Ubuntu in the past had issues with ceph-common being
> >> installed on the host OS and it trying to take ownership of the
> >> containerized ceph deployment. If you run into any issues check the
> >> base OS for ceph-* packages and uninstall. "
> >>
> >>
> >> I believe the only good way to use ceph commands is in cephadm
> >>
> >>
> >> Thanks for your help !
> >>
> >> ________________________________
> >> De : Eugen Block <ebl...@nde.ag>
> >> Envoyé : jeudi 31 juillet 2025 19:42:21
> >> À : GLE, Vivien
> >> Cc : ceph-users@ceph.io
> >> Objet : Re: [ceph-users] Re: Pgs troubleshooting
> >>
> >> To use the objectstore tool within the container you don’t have to
> >> specify the cluster’s FSID because it’s mapped into the container. By
> >> using the objectstore tool you might have changed the ownership of the
> >> directory, change it back to the previous state. Other OSDs will show
> >> you which uid/user and/or gid/group that is.
> >>
> >> Zitat von "GLE, Vivien" <vivien....@inist.fr>:
> >>
> >>> I'm sorry for the confusion !
> >>>
> >>> I paste the wrong output.
> >>>
> >>>
> >>> ceph-objectstore-tool --data-path /var/lib/ceph/Id/osd.1 --op list
> >>> --pgid 11.4 --no-mon-config
> >>>
> >>> OSD.1 log
> >>>
> >>> 2025-07-31T12:06:56.273+0000 7a9c2bf47680  0 set uid:gid to 167:167
> >>> (ceph:ceph)
> >>> 2025-07-31T12:06:56.273+0000 7a9c2bf47680  0 ceph version 19.2.2
> >>> (0eceb0defba60152a8182f7bd87d164b639885b8) squid (stable), process
> >>> ceph-osd, pid 7
> >>> 2025-07-31T12:06:56.273+0000 7a9c2bf47680  0 pidfile_write: ignore
> >>> empty --pid-file
> >>> 2025-07-31T12:06:56.274+0000 7a9c2bf47680  1 bdev(0x57bd64210e00
> >>> /var/lib/ceph/osd/ceph-1/block) open path
> >>> /var/lib/ceph/osd/ceph-1/block
> >>> 2025-07-31T12:06:56.274+0000 7a9c2bf47680 -1 bdev(0x57bd64210e00
> >>> /var/lib/ceph/osd/ceph-1/block) open open got: (13) Permission denied
> >>> 2025-07-31T12:06:56.274+0000 7a9c2bf47680 -1  ** ERROR: unable to
> >>> open OSD superblock on /var/lib/ceph/osd/ceph-1: (2) No such file or
> >>> directory
> >>>
> >>> ----------------------
> >>>
> >>> I retried  on OSD.2 with PG 2.1 to see if I disabled instead of just
> >>> stopped the OSD.2 before objectstore-tool operation will change
> >>> something but same error occurred
> >>>
> >>>
> >>>
> >>> ________________________________
> >>> De : Eugen Block <ebl...@nde.ag>
> >>> Envoyé : jeudi 31 juillet 2025 13:27:51
> >>> À : GLE, Vivien
> >>> Cc : ceph-users@ceph.io
> >>> Objet : Re: [ceph-users] Re: Pgs troubleshooting
> >>>
> >>> Why did you look at OSD.2? According to the query output you provided
> >>> I would have looked at OSD.1 (acting set). And you pasted the output
> >>> of PG 11.4, now you’re trying to list PG 2.1, that is quite confusing.
> >>>
> >>>
> >>> Zitat von "GLE, Vivien" <vivien....@inist.fr>:
> >>>
> >>>> I dont get why is he searching in this path because there is nothing
> >>>> and this is the command I used to check bluestore
> >>>>
> >>>>
> >>>> ceph-objectstore-tool --data-path /var/lib/ceph/"ID"/osd.2 --op list
> >>>> --pgid 2.1 --no-mon-config
> >>>>
> >>>> ________________________________
> >>>> De : GLE, Vivien
> >>>> Envoyé : jeudi 31 juillet 2025 09:38:25
> >>>> À : Eugen Block
> >>>> Cc : ceph-users@ceph.io
> >>>> Objet : RE: [ceph-users] Re: Pgs troubleshooting
> >>>>
> >>>>
> >>>> Hi,
> >>>>
> >>>>
> >>>>> Or could reducing min_size to 1 help here (Thanks, Anthony)? I’m not
> >>>>> entirely sure and am on vacation. 😅 it could be worth a try. But
> don’t
> >>>>> forget to reset min_size back to 2 afterwards.
> >>>>
> >>>>
> >>>> Did it but nothing really changed, how many time should I wait to
> >>>> see if it does something ?
> >>>>
> >>>>
> >>>>> No, you use the ceph-objectstore-tool to export the PG from the
> intact
> >>>>> OSD (you need to stop it though, set noout flag), make sure you have
> >>>>> enough disk space.
> >>>>
> >>>>
> >>>> I stopped my OSD and noout to check if my PG is stored in bluestore
> >>>> (he is not) but when I tried to restart my OSD, OSD superblock was
> >>>> gone
> >>>>
> >>>>
> >>>> 2025-07-31T08:33:14.696+0000 7f0c7c889680  1 bdev(0x60945520ae00
> >>>> /var/lib/ceph/osd/ceph-2/block) open path
> >>>> /var/lib/ceph/osd/ceph-2/block
> >>>> 2025-07-31T08:33:14.697+0000 7f0c7c889680 -1 bdev(0x60945520ae00
> >>>> /var/lib/ceph/osd/ceph-2/block) open open got: (13) Permission denied
> >>>> 2025-07-31T08:33:14.697+0000 7f0c7c889680 -1  ** ERROR: unable to
> >>>> open OSD superblock on /var/lib/ceph/osd/ceph-2: (2) No such file or
> >>>> directory
> >>>>
> >>>> Did I miss something?
> >>>>
> >>>> Thanks
> >>>> Vivien
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> ________________________________
> >>>> De : Eugen Block <ebl...@nde.ag>
> >>>> Envoyé : mercredi 30 juillet 2025 16:56:50
> >>>> À : GLE, Vivien
> >>>> Cc : ceph-users@ceph.io
> >>>> Objet : [ceph-users] Re: Pgs troubleshooting
> >>>>
> >>>> Or could reducing min_size to 1 help here (Thanks, Anthony)? I’m not
> >>>> entirely sure and am on vacation. 😅 it could be worth a try. But
> don’t
> >>>> forget to reset min_size back to 2 afterwards.
> >>>>
> >>>> Zitat von "GLE, Vivien" <vivien....@inist.fr>:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>>
> >>>>>> did the two replaced OSDs fail at the sime time (before they were
> >>>>>> completely drained)? This would most likely mean that both those
> >>>>>> failed OSDs contained the other two replicas of this PG
> >>>>>
> >>>>>
> >>>>> Unfortunately yes
> >>>>>
> >>>>>
> >>>>>> This would most likely mean that both those
> >>>>>> failed OSDs contained the other two replicas of this PG. A pg query
> >>>>>> should show which OSDs are missing.
> >>>>>
> >>>>>
> >>>>> If I understand well I need to move my PG on the OSD 1 ?
> >>>>>
> >>>>>
> >>>>> ceph -w
> >>>>>
> >>>>>
> >>>>>  osd.1 [ERR] 11.4 has 2 objects unfound and apparently lost
> >>>>>
> >>>>>
> >>>>> ceph pg query 11.4
> >>>>>
> >>>>>
> >>>>>
> >>>>>      "up": [
> >>>>>                     1,
> >>>>>                     4,
> >>>>>                     5
> >>>>>                 ],
> >>>>>                 "acting": [
> >>>>>                     1,
> >>>>>                     4,
> >>>>>                     5
> >>>>>                 ],
> >>>>>                 "avail_no_missing": [],
> >>>>>                 "object_location_counts": [
> >>>>>                     {
> >>>>>                         "shards": "3,4,5",
> >>>>>                         "objects": 2
> >>>>>                     }
> >>>>>                 ],
> >>>>>                 "blocked_by": [],
> >>>>>                 "up_primary": 1,
> >>>>>                 "acting_primary": 1,
> >>>>>                 "purged_snaps": []
> >>>>>             },
> >>>>>
> >>>>>
> >>>>>
> >>>>> Thanks
> >>>>>
> >>>>>
> >>>>> Vivien
> >>>>>
> >>>>> ________________________________
> >>>>> De : Eugen Block <ebl...@nde.ag>
> >>>>> Envoyé : mardi 29 juillet 2025 16:48:41
> >>>>> À : ceph-users@ceph.io
> >>>>> Objet : [ceph-users] Re: Pgs troubleshooting
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> did the two replaced OSDs fail at the sime time (before they were
> >>>>> completely drained)? This would most likely mean that both those
> >>>>> failed OSDs contained the other two replicas of this PG. A pg query
> >>>>> should show which OSDs are missing.
> >>>>> You could try with objectstore-tool to export the PG from the
> >>>>> remaining OSD and import it on different OSDs. Or you mark the data
> as
> >>>>> lost if you don't care about the data and want a healthy state
> quickly.
> >>>>>
> >>>>> Regards,
> >>>>> Eugen
> >>>>>
> >>>>> Zitat von "GLE, Vivien" <vivien....@inist.fr>:
> >>>>>
> >>>>>> Thanks for your help ! This is my new pg stat with no more peering
> >>>>>> pgs (after rebooting some OSD)
> >>>>>>
> >>>>>> ceph pg stat ->
> >>>>>>
> >>>>>> 498 pgs: 1 active+recovery_unfound+degraded, 3
> >>>>>> recovery_unfound+undersized+degraded+remapped+peered, 14
> >>>>>> active+clean+scrubbing+deep, 480 active+clean;
> >>>>>>
> >>>>>> 36 GiB data, 169 GiB used, 6.2 TiB / 6.4 TiB avail; 8.8 KiB/s rd, 0
> >>>>>> B/s wr, 12 op/s; 715/41838 objects degraded (1.709%); 5/13946
> >>>>>> objects unfound (0.036%)
> >>>>>>
> >>>>>> ceph pg ls recovery_unfound -> shows that PG are replica 3, tried to
> >>>>>> repair but nothing happened
> >>>>>>
> >>>>>>
> >>>>>> ceph -w ->
> >>>>>>
> >>>>>> osd.1 [ERR] 11.4 has 2 objects unfound and apparently lost
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> ________________________________
> >>>>>> De : Frédéric Nass <frederic.n...@clyso.com>
> >>>>>> Envoyé : mardi 29 juillet 2025 14:03:37
> >>>>>> À : GLE, Vivien
> >>>>>> Cc : ceph-users@ceph.io
> >>>>>> Objet : Re: [ceph-users] Pgs troubleshooting
> >>>>>>
> >>>>>> Hi Vivien,
> >>>>>>
> >>>>>> Unless you ran 'ceph pg stat' command when peering was occuring, the
> >>>>>> 37 peering PGs might indicate a temporary peering issue with one or
> >>>>>> more OSDs. If that's the case then restarting associated OSDs could
> >>>>>> help with the peering or ceph pg. You could list those PGs and
> >>>>>> associated OSDs with 'ceph pg ls peering' and trigger peering by
> >>>>>> either restarting one common OSD or by using 'ceph pg repeer
> <pg_id>'.
> >>>>>>
> >>>>>> Regarding the unfound object and its associated backfill_unfound PG,
> >>>>>> you could identify this PG with 'ceph pg ls backfill_unfound' and
> >>>>>> investigate this PG with 'ceph pg <pg_id> query'. Depending on the
> >>>>>> output, you could try running a 'ceph pg repair <pg_id>'. Could you
> >>>>>> confirm that this PG is not part of a size=2 pool?
> >>>>>>
> >>>>>> Best regards,
> >>>>>> Frédéric.
> >>>>>>
> >>>>>> --
> >>>>>> Frédéric Nass
> >>>>>> Ceph Ambassador France | Senior Ceph Engineer @ CLYSO
> >>>>>> Try our Ceph Analyzer -- https://analyzer.clyso.com/
> >>>>>> https://clyso.com |
> >>>>>> frederic.n...@clyso.com<mailto:frederic.n...@clyso.com>
> >>>>>>
> >>>>>>
> >>>>>> Le mar. 29 juil. 2025 à 14:19, GLE, Vivien
> >>>>>> <vivien....@inist.fr<mailto:vivien....@inist.fr>> a écrit :
> >>>>>> Hi,
> >>>>>>
> >>>>>> After replacing 2 OSD (data corruption), this is the stats of my
> >>>>>> testing ceph cluster
> >>>>>>
> >>>>>> ceph pg stat
> >>>>>>
> >>>>>> 498 pgs: 37 peering, 1 active+remapped+backfilling, 1
> >>>>>> active+clean+remapped, 1 active+recovery_wait+undersized+remapped, 1
> >>>>>> backfill_unfound+undersized+degraded+remapped+peered, 1
> >>>>>> remapped+peering, 12 active+clean+scrubbing+deep, 1
> >>>>>> active+undersized, 442 active+clean, 1
> >>>>>> active+recovering+undersized+remapped
> >>>>>>
> >>>>>> 34 GiB data, 175 GiB used, 6.2 TiB / 6.4 TiB avail; 1.7 KiB/s rd, 1
> >>>>>> op/s; 31/39768 objects degraded (0.078%); 6/39768 objects misplaced
> >>>>>> (0.015%); 1/13256 objects unfound (0.008%)
> >>>>>>
> >>>>>> ceph osd stat
> >>>>>> 7 osds: 7 up (since 20h), 7 in (since 20h); epoch: e427538; 4
> >>>>>> remapped pgs
> >>>>>>
> >>>>>> Anyone had an idea of where to start to get a healthy cluster ?
> >>>>>>
> >>>>>> Thanks !
> >>>>>>
> >>>>>> Vivien
> >>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> ceph-users mailing list -- ceph-users@ceph.io<mailto:
> ceph-users@ceph.io>
> >>>>>> To unsubscribe send an email to
> >>>>>> ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>
> >>>>>> _______________________________________________
> >>>>>> ceph-users mailing list -- ceph-users@ceph.io
> >>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> ceph-users mailing list -- ceph-users@ceph.io
> >>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> ceph-users mailing list -- ceph-users@ceph.io
> >>>> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
> >>
> >>
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Pgs troubleshooting

Reply via email to