Suggestions: 1. Figure out which OSDs are unsafe to stop. 2. Slowly restart every other OSD 3. Figure out which PGs are degraded 4. Use the "ceph osd pg-upmap-items" command to redirect their recovery to already-restarted OSDs 5. At this point, the set of OSDs that are unsafe to restart should contain only already-restarted OSDs 6. Restart the remaining OSDs
P.S. Not tested. On Tue, Aug 19, 2025 at 5:31 PM Curt <light...@gmail.com> wrote: > Hello all, > > I'm sure this has been discussed before, but I can't seem to find it. I > know on older versions of Ceph there was an issue with mclock having no > recovery and switching to wpq fixed it. Is this still an issue with > 19.2.1? > > I recently ran into this bug <https://tracker.ceph.com/issues/70390>and > various issues with it. In order to help recovery I set norebalance flag, > so it would focus solely on undersized PGs. The issue I'm seeing though is > sometimes recovering will show nothing despite having > X active+undersized+remapped+backfilling. Sometimes restarting a few OSD's > will fix the issue and it will start again. > > I'm tempted to switch to wpq, but that would mean having to slowly restart > each OSD, which with undersized would cause IO to stop while some OSD's are > restarted. Wanted to get others' thoughts before making the change. > > Thanks, > Curt > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > -- Alexander Patrakov _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io