Re: [ceph-users] Have an inconsistent PG, repair not working

David Turner Tue, 24 Apr 2018 13:54:20 -0700

Neither the issue I created nor Michael's [1] ticket that it was rolled
into are getting any traction.  How are y'all fairing with your clusters?
I've had 3 PGs inconsistent with 5 scrub errors for a few weeks now.  I
assumed that the third PG was just like the first 2 in that it couldn't be
scrubbed, but I just checked the last scrub timestamp of the 3 PGs and the
third one is able to run scrubs.  I'm going to increase the logging on it
after I finish a round of maintenance we're performing on some OSDs.
Hopefully I'll find something more about these objects.



[1] http://tracker.ceph.com/issues/23576

On Fri, Apr 6, 2018 at 12:30 PM David Turner <[email protected]> wrote:

> I'm using filestore.  I think the root cause is something getting stuck in
> the code.  As such I went ahead and created a [1] bug tracker for this.
> Hopefully it gets some traction as I'm not particularly looking forward to
> messing with deleting PGs with the ceph-objectstore-tool in production.
>
> [1] http://tracker.ceph.com/issues/23577
>
> On Fri, Apr 6, 2018 at 11:40 AM Michael Sudnick <[email protected]>
> wrote:
>
>> I've tried a few more things to get a deep-scrub going on my PG. I tried
>> instructing the involved osds to scrub all their PGs and it looks like that
>> didn't do it.
>>
>> Do you have any documentation on the object-store-tool? What I've found
>> online talks about filestore and not bluestore.
>>
>> On 6 April 2018 at 09:27, David Turner <[email protected]> wrote:
>>
>>> I'm running into this exact same situation.  I'm running 12.2.2 and I
>>> have an EC PG with a scrub error.  It has the same output for [1] rados
>>> list-inconsistent-obj as mentioned before.  This is the [2] full health
>>> detail.  This is the [3] excerpt from the log from the deep-scrub that
>>> marked the PG inconsistent.  The scrub happened when the PG was starting up
>>> after using ceph-objectstore-tool to split its filestore subfolders.  This
>>> is using a script that I've used for months without any side effects.
>>>
>>> I have tried quite a few things to get this PG to deep-scrub or repair,
>>> but to no avail.  It will not do anything.  I have set every osd's
>>> osd_max_scrubs to 0 in the cluster, waited for all scrubbing and deep
>>> scrubbing to finish, then increased the 11 OSDs for this PG to 1 before
>>> issuing a deep-scrub.  And it will sit there for over an hour without
>>> deep-scrubbing.  My current testing of this is to set all osds to 1,
>>> increase all of the osds for this PG to 4, and then issue the repair... but
>>> similarly nothing happens.  Each time I issue the deep-scrub or repair, the
>>> output correctly says 'instructing pg 145.2e3 on osd.234 to repair', but
>>> nothing shows up in the log for the OSD and the PG state stays
>>> 'active+clean+inconsistent'.
>>>
>>> My next step, unless anyone has a better idea, is to find the exact copy
>>> of the PG with the missing object, use object-store-tool to back up that
>>> copy of the PG and remove it.  Then starting the OSD back up should
>>> backfill the full copy of the PG and be healthy again.
>>>
>>>
>>>
>>> [1] $ rados list-inconsistent-obj 145.2e3
>>> No scrub information available for pg 145.2e3
>>> error 2: (2) No such file or directory
>>>
>>> [2] $ ceph health detail
>>> HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
>>> OSD_SCRUB_ERRORS 1 scrub errors
>>> PG_DAMAGED Possible data damage: 1 pg inconsistent
>>>     pg 145.2e3 is active+clean+inconsistent, acting
>>> [234,132,33,331,278,217,55,358,79,3,24]
>>>
>>> [3] 2018-04-04 15:24:53.603380 7f54d1820700  0 log_channel(cluster) log
>>> [DBG] : 145.2e3 deep-scrub starts
>>> 2018-04-04 17:32:37.916853 7f54d1820700 -1 log_channel(cluster) log
>>> [ERR] : 145.2e3s0 deep-scrub 1 missing, 0 inconsistent objects
>>> 2018-04-04 17:32:37.916865 7f54d1820700 -1 log_channel(cluster) log
>>> [ERR] : 145.2e3 deep-scrub 1 errors
>>>
>>> On Mon, Apr 2, 2018 at 4:51 PM Michael Sudnick <
>>> [email protected]> wrote:
>>>
>>>> Hi Kjetil,
>>>>
>>>> I've tried to get the pg scrubbing/deep scrubbing and nothing seems to
>>>> be happening. I've tried it a few times over the last few days. My cluster
>>>> is recovering from a failed disk (which was probably the reason for the
>>>> inconsistency), do I need to wait for the cluster to heal before
>>>> repair/deep scrub works?
>>>>
>>>> -Michael
>>>>
>>>> On 2 April 2018 at 14:13, Kjetil Joergensen <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> scrub or deep-scrub the pg, that should in theory get you back to
>>>>> list-inconsistent-obj spitting out what's wrong, then mail that info to 
>>>>> the
>>>>> list.
>>>>>
>>>>> -KJ
>>>>>
>>>>> On Sun, Apr 1, 2018 at 9:17 AM, Michael Sudnick <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I have a small cluster with an inconsistent pg. I've tried ceph pg
>>>>>> repair multiple times to no luck. rados list-inconsistent-obj 49.11c
>>>>>> returns:
>>>>>>
>>>>>> # rados list-inconsistent-obj 49.11c
>>>>>> No scrub information available for pg 49.11c
>>>>>> error 2: (2) No such file or directory
>>>>>>
>>>>>> I'm a bit at a loss here as what to do to recover. That pg is part of
>>>>>> a cephfs_data pool with compression set to force/snappy.
>>>>>>
>>>>>> Does anyone have an suggestions?
>>>>>>
>>>>>> -Michael
>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> [email protected]
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Kjetil Joergensen <[email protected]>
>>>>> SRE, Medallia Inc
>>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> [email protected]
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>
>>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Have an inconsistent PG, repair not working

Reply via email to