I ended up taking Brett's recommendation and doing a "ceph osd set noscrub" and "ceph osd set nodeep-scrub", then waiting for the running scrubs to finish which doing a "ceph -w" to see what it was doing. Eventually, it reported the following:
2019-05-18 16:08:44.032780 mon.gi-cba-01 [ERR] Health check update: Possible data damage: 1 pg inconsistent, 1 pg repair (PG_DAMAGED) 2019-05-18 16:10:13.748132 osd.41 [ERR] 2.798s0 soid 2:19e2f773:::1000255879d.00000000:head : object info inconsistent , attr name mismatch '_layout', attr name mismatch '_parent' 2019-05-18 16:12:24.575444 osd.41 [ERR] 2.798s0 soid 2:19e736e2:::10002558362.00000000:head : object info inconsistent , attr name mismatch '_layout', attr name mismatch '_parent' 2019-05-18 16:19:57.204557 osd.41 [ERR] 2.798s0 soid 2:19f62945:::10002558ed4.00000000:head : object info inconsistent , attr name mismatch '_layout', attr name mismatch '_parent' 2019-05-18 16:23:07.316487 osd.41 [ERR] 2.798s0 soid 2:19fc6ba9:::100025581cc.00000000:head : object info inconsistent , attr name mismatch '_layout', attr name mismatch '_parent' 2019-05-18 16:24:41.494480 osd.41 [ERR] 2.798s0 soid 2:19ffaa2a:::10002555405.00000000:head : object info inconsistent , attr name mismatch '_layout', attr name mismatch '_parent' 2019-05-18 16:24:52.869234 osd.41 [ERR] 2.798s0 repair 0 missing, 5 inconsistent objects 2019-05-18 16:24:52.870018 osd.41 [ERR] 2.798 repair 5 errors, 5 fixed 2019-05-18 16:24:54.047312 mon.gi-cba-01 [WRN] Health check failed: Degraded data redundancy: 5/632305016 objects degraded (0.000%), 1 pg degraded (PG_DEGRADED) 2019-05-18 16:24:54.047359 mon.gi-cba-01 [INF] Health check cleared: OSD_SCRUB_ERRORS (was: 5 scrub errors) 2019-05-18 16:24:54.047383 mon.gi-cba-01 [INF] Health check cleared: PG_DAMAGED (was: Possible data damage: 1 pg inconsistent, 1 pg repair) 2019-05-18 16:24:59.232439 mon.gi-cba-01 [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 5/632305016 objects degraded (0.000%), 1 pg degraded) 2019-05-18 17:00:00.000099 mon.gi-cba-01 [WRN] overall HEALTH_WARN noscrub,nodeep-scrub flag(s) set after that, I "ceph osd unset noscrub" and "ceph osd unset nodeep-scrub" and the system was back to HEALTH_OK. Still seems like black magic, but I guess I'm happy now... Thanks! On Sun, May 19, 2019 at 2:44 AM Paul Emmerich <[email protected]> wrote: > Check out the log of the primary OSD in that PG to see what happened > during scrubbing > > -- > Paul Emmerich > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > croit GmbH > Freseniusstr. 31h > 81247 München > www.croit.io > Tel: +49 89 1896585 90 > > > On Sun, May 19, 2019 at 12:41 AM Jorge Garcia <[email protected]> > wrote: > >> I have tried ceph pg repair several times. It claims "instructing pg >> 2.798s0 on osd.41 to repair" but then nothing happens as far as I can tell. >> Any way of knowing if it's doing more? >> >> On Sat, May 18, 2019 at 3:33 PM Brett Chancellor < >> [email protected]> wrote: >> >>> I would try the ceph pg repair. If you see the pg go into deep >>> scrubbing, then back to inconsistent you probably have a bad drive. Find >>> which of the drives in the pg are bad (pg query or go to the host and look >>> through dmesg). Take that osd offline and mark it out. Once backfill is >>> complete, it should clear up. >>> >>> On Sat, May 18, 2019, 6:05 PM Jorge Garcia <[email protected]> wrote: >>> >>>> We are testing a ceph cluster mostly using cephfs. We are using an >>>> erasure-code pool, and have been loading it up with data. Recently, we got >>>> a HEALTH_ERR response when we were querying the ceph status. We stopped all >>>> activity to the filesystem, and waited to see if the error would go away. >>>> It didn't. Then we tried a couple of suggestions from the internet (ceph pg >>>> repair, ceph pg scrub, ceph pg deep-scrub) to no avail. I'm not sure how to >>>> find out more information about what the problem is, and how to repair the >>>> filesystem to bring it back to normal health. Any suggestions? >>>> >>>> Current status: >>>> >>>> # ceph -s >>>> >>>> cluster: >>>> >>>> id: 28ef32f1-4350-491b-9003-b19b9c3a2076 >>>> >>>> health: HEALTH_ERR >>>> >>>> 5 scrub errors >>>> >>>> Possible data damage: 1 pg inconsistent >>>> >>>> >>>> >>>> services: >>>> >>>> mon: 3 daemons, quorum gi-cba-01,gi-cba-02,gi-cba-03 >>>> >>>> mgr: gi-cba-01(active), standbys: gi-cba-02, gi-cba-03 >>>> >>>> mds: backups-1/1/1 up {0=gi-cbmd=up:active} >>>> >>>> osd: 87 osds: 87 up, 87 in >>>> >>>> >>>> >>>> data: >>>> >>>> pools: 2 pools, 4096 pgs >>>> >>>> objects: 90.98 M objects, 134 TiB >>>> >>>> usage: 210 TiB used, 845 TiB / 1.0 PiB avail >>>> >>>> pgs: 4088 active+clean >>>> >>>> 5 active+clean+scrubbing+deep >>>> >>>> 2 active+clean+scrubbing >>>> >>>> 1 active+clean+inconsistent >>>> >>>> # ceph health detail >>>> >>>> HEALTH_ERR 5 scrub errors; Possible data damage: 1 pg inconsistent >>>> >>>> OSD_SCRUB_ERRORS 5 scrub errors >>>> >>>> PG_DAMAGED Possible data damage: 1 pg inconsistent >>>> >>>> pg 2.798 is active+clean+inconsistent, acting [41,50,17,2,86,70,61] >>>> _______________________________________________ >>>> ceph-users mailing list >>>> [email protected] >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>> _______________________________________________ >> ceph-users mailing list >> [email protected] >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
