Suggestions:

1. Figure out which OSDs are unsafe to stop.
2. Slowly restart every other OSD
3. Figure out which PGs are degraded
4. Use the "ceph osd pg-upmap-items" command to redirect their recovery to
already-restarted OSDs
5. At this point, the set of OSDs that are unsafe to restart should contain
only already-restarted OSDs
6. Restart the remaining OSDs

P.S. Not tested.


On Tue, Aug 19, 2025 at 5:31 PM Curt <light...@gmail.com> wrote:

> Hello all,
>
> I'm sure this has been discussed before, but I can't seem to find it. I
> know on older versions of Ceph there was an issue with mclock having no
> recovery and switching to wpq fixed it. Is this still an issue with
> 19.2.1?
>
> I recently ran into this bug  <https://tracker.ceph.com/issues/70390>and
> various issues with it. In order to help recovery I set norebalance flag,
> so it would focus solely on undersized PGs. The issue I'm seeing though is
> sometimes recovering will show nothing despite having
> X active+undersized+remapped+backfilling. Sometimes restarting a few OSD's
> will fix the issue and it will start again.
>
> I'm tempted to switch to wpq, but that would mean having to slowly restart
> each OSD, which with undersized would cause IO to stop while some OSD's are
> restarted. Wanted to get others' thoughts before making the change.
>
> Thanks,
> Curt
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 
Alexander Patrakov
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to