My usuall workaround for that is to set noscrub and nodeep-scrub flags and wait (sometimes even 3 hours) until all the scheduled scrubs finish. Then a manually issued scrub or repair starts immediately. After that I unset the scrub blocking flags.
A general advice regarding pg repair is not to run it without full understanding of what kind of the data error has been discovered, how many replicas you have, how many of them are affected, etc. In some cases pg repair is considered as dangerous to the data. czw., 11 paź 2018 o 19:54 Gregory Farnum <[email protected]> napisał(a): > Yeah improving that workflow is in the backlog. (or maybe it's done in > master? I forget.) But it's complicated, so for now that's just how it > goes. :( > > On Thu, Oct 11, 2018 at 10:27 AM Brett Chancellor < > [email protected]> wrote: > >> This seems like a bug. If I'm kicking off a repair manually it should >> take place immediately, and ignore flags such as max scrubs, or minimum >> scrub window. >> >> -Brett >> >> On Thu, Oct 11, 2018 at 1:11 PM David Turner <[email protected]> >> wrote: >> >>> As a part of a repair is queuing a deep scrub. As soon as the repair >>> part is over the deep scrub continues until it is done. >>> >>> On Thu, Oct 11, 2018, 12:26 PM Brett Chancellor < >>> [email protected]> wrote: >>> >>>> Does the "repair" function use the same rules as a deep scrub? I >>>> couldn't get one to kick off, until I temporarily increased the max_scrubs >>>> and lowered the scrub_min_interval on all 3 OSDs for that placement group. >>>> This ended up fixing the issue, so I'll leave this here in case somebody >>>> else runs into it. >>>> >>>> sudo ceph tell 'osd.208' injectargs '--osd_max_scrubs 3' >>>> sudo ceph tell 'osd.120' injectargs '--osd_max_scrubs 3' >>>> sudo ceph tell 'osd.235' injectargs '--osd_max_scrubs 3' >>>> sudo ceph tell 'osd.208' injectargs '--osd_scrub_min_interval 1.0' >>>> sudo ceph tell 'osd.120' injectargs '--osd_scrub_min_interval 1.0' >>>> sudo ceph tell 'osd.235' injectargs '--osd_scrub_min_interval 1.0' >>>> sudo ceph pg repair 75.302 >>>> >>>> -Brett >>>> >>>> >>>> On Thu, Oct 11, 2018 at 8:42 AM Maks Kowalik <[email protected]> >>>> wrote: >>>> >>>>> Imho moving was not the best idea (a copying attempt would have told >>>>> if the read error was the case here). >>>>> Scrubs might don't want to start if there are many other scrubs >>>>> ongoing. >>>>> >>>>> czw., 11 paź 2018 o 14:27 Brett Chancellor <[email protected]> >>>>> napisał(a): >>>>> >>>>>> I moved the file. But the cluster won't actually start any >>>>>> scrub/repair I manually initiate. >>>>>> >>>>>> On Thu, Oct 11, 2018, 7:51 AM Maks Kowalik <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Based on the log output it looks like you're having a damaged file >>>>>>> on OSD 235 where the shard is stored. >>>>>>> To ensure if that's the case you should find the file (using >>>>>>> 81d5654895863d as a part of its name) and try to copy it to another >>>>>>> directory. >>>>>>> If you get the I/O error while copying, the next steps would be to >>>>>>> delete the file, run the scrub on 75.302 and take a deep look at the >>>>>>> OSD.235 for any other errors. >>>>>>> >>>>>>> Kind regards, >>>>>>> Maks >>>>>>> >>>>>> _______________________________________________ >>>> ceph-users mailing list >>>> [email protected] >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>> _______________________________________________ >> ceph-users mailing list >> [email protected] >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > _______________________________________________ > ceph-users mailing list > [email protected] > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
