Re: [ceph-users] Fixing a HEALTH_ERR situation

Jorge Garcia Sun, 19 May 2019 07:45:52 -0700

I ended up taking Brett's recommendation and doing a "ceph osd set noscrub"
and "ceph osd set nodeep-scrub", then waiting for the running scrubs to
finish which doing a "ceph -w" to see what it was doing. Eventually, it
reported the following:


2019-05-18 16:08:44.032780 mon.gi-cba-01 [ERR] Health check update:
Possible data damage: 1 pg inconsistent, 1 pg repair (PG_DAMAGED)

2019-05-18 16:10:13.748132 osd.41 [ERR] 2.798s0 soid
2:19e2f773:::1000255879d.00000000:head : object info inconsistent , attr
name mismatch '_layout', attr name mismatch '_parent'

2019-05-18 16:12:24.575444 osd.41 [ERR] 2.798s0 soid
2:19e736e2:::10002558362.00000000:head : object info inconsistent , attr
name mismatch '_layout', attr name mismatch '_parent'

2019-05-18 16:19:57.204557 osd.41 [ERR] 2.798s0 soid
2:19f62945:::10002558ed4.00000000:head : object info inconsistent , attr
name mismatch '_layout', attr name mismatch '_parent'

2019-05-18 16:23:07.316487 osd.41 [ERR] 2.798s0 soid
2:19fc6ba9:::100025581cc.00000000:head : object info inconsistent , attr
name mismatch '_layout', attr name mismatch '_parent'

2019-05-18 16:24:41.494480 osd.41 [ERR] 2.798s0 soid
2:19ffaa2a:::10002555405.00000000:head : object info inconsistent , attr
name mismatch '_layout', attr name mismatch '_parent'

2019-05-18 16:24:52.869234 osd.41 [ERR] 2.798s0 repair 0 missing, 5
inconsistent objects

2019-05-18 16:24:52.870018 osd.41 [ERR] 2.798 repair 5 errors, 5 fixed

2019-05-18 16:24:54.047312 mon.gi-cba-01 [WRN] Health check failed:
Degraded data redundancy: 5/632305016 objects degraded (0.000%), 1 pg
degraded (PG_DEGRADED)

2019-05-18 16:24:54.047359 mon.gi-cba-01 [INF] Health check cleared:
OSD_SCRUB_ERRORS (was: 5 scrub errors)

2019-05-18 16:24:54.047383 mon.gi-cba-01 [INF] Health check cleared:
PG_DAMAGED (was: Possible data damage: 1 pg inconsistent, 1 pg repair)

2019-05-18 16:24:59.232439 mon.gi-cba-01 [INF] Health check cleared:
PG_DEGRADED (was: Degraded data redundancy: 5/632305016 objects degraded
(0.000%), 1 pg degraded)

2019-05-18 17:00:00.000099 mon.gi-cba-01 [WRN] overall HEALTH_WARN
noscrub,nodeep-scrub flag(s) set


after that, I "ceph osd unset noscrub" and "ceph osd unset nodeep-scrub"
and the system was back to HEALTH_OK. Still seems like black magic, but I
guess I'm happy now... Thanks!

On Sun, May 19, 2019 at 2:44 AM Paul Emmerich <[email protected]>
wrote:

> Check out the log of the primary OSD in that PG to see what happened
> during scrubbing
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
>
> On Sun, May 19, 2019 at 12:41 AM Jorge Garcia <[email protected]>
> wrote:
>
>> I have tried ceph pg repair several times. It claims "instructing pg
>> 2.798s0 on osd.41 to repair" but then nothing happens as far as I can tell.
>> Any way of knowing if it's doing more?
>>
>> On Sat, May 18, 2019 at 3:33 PM Brett Chancellor <
>> [email protected]> wrote:
>>
>>> I would try the ceph pg repair. If you see the pg go into deep
>>> scrubbing, then back to inconsistent you probably have a bad drive. Find
>>> which of the drives in the pg are bad (pg query or go to the host and look
>>> through dmesg). Take that osd offline and mark it out. Once backfill is
>>> complete, it should clear up.
>>>
>>> On Sat, May 18, 2019, 6:05 PM Jorge Garcia <[email protected]> wrote:
>>>
>>>> We are testing a ceph cluster mostly using cephfs. We are using an
>>>> erasure-code pool, and have been loading it up with data. Recently, we got
>>>> a HEALTH_ERR response when we were querying the ceph status. We stopped all
>>>> activity to the filesystem, and waited to see if the error would go away.
>>>> It didn't. Then we tried a couple of suggestions from the internet (ceph pg
>>>> repair, ceph pg scrub, ceph pg deep-scrub) to no avail. I'm not sure how to
>>>> find out more information about what the problem is, and how to repair the
>>>> filesystem to bring it back to normal health. Any suggestions?
>>>>
>>>> Current status:
>>>>
>>>> # ceph -s
>>>>
>>>>   cluster:
>>>>
>>>>     id:     28ef32f1-4350-491b-9003-b19b9c3a2076
>>>>
>>>>     health: HEALTH_ERR
>>>>
>>>>             5 scrub errors
>>>>
>>>>             Possible data damage: 1 pg inconsistent
>>>>
>>>>
>>>>
>>>>   services:
>>>>
>>>>     mon: 3 daemons, quorum gi-cba-01,gi-cba-02,gi-cba-03
>>>>
>>>>     mgr: gi-cba-01(active), standbys: gi-cba-02, gi-cba-03
>>>>
>>>>     mds: backups-1/1/1 up  {0=gi-cbmd=up:active}
>>>>
>>>>     osd: 87 osds: 87 up, 87 in
>>>>
>>>>
>>>>
>>>>   data:
>>>>
>>>>     pools:   2 pools, 4096 pgs
>>>>
>>>>     objects: 90.98 M objects, 134 TiB
>>>>
>>>>     usage:   210 TiB used, 845 TiB / 1.0 PiB avail
>>>>
>>>>     pgs:     4088 active+clean
>>>>
>>>>              5    active+clean+scrubbing+deep
>>>>
>>>>              2    active+clean+scrubbing
>>>>
>>>>              1    active+clean+inconsistent
>>>>
>>>> # ceph health detail
>>>>
>>>> HEALTH_ERR 5 scrub errors; Possible data damage: 1 pg inconsistent
>>>>
>>>> OSD_SCRUB_ERRORS 5 scrub errors
>>>>
>>>> PG_DAMAGED Possible data damage: 1 pg inconsistent
>>>>
>>>>     pg 2.798 is active+clean+inconsistent, acting [41,50,17,2,86,70,61]
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> [email protected]
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>> _______________________________________________
>> ceph-users mailing list
>> [email protected]
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fixing a HEALTH_ERR situation

Reply via email to