Re: [ceph-users] Have an inconsistent PG, repair not working

Michael Sudnick Mon, 30 Apr 2018 11:56:38 -0700

Mine repaired themselves after a regular deep scrub. Weird that I couldn't
trigger one manually.


On 30 April 2018 at 14:23, David Turner <[email protected]> wrote:

> My 3 inconsistent PGs finally decided to run automatic scrubs and now 2 of
> the 3 will allow me to run deep-scrubs and repairs on them.  The deep-scrub
> did not show any new information about the objects other than that they
> were missing in one of the copies.  Running a repair fixed the
> inconsistency.
>
> On Tue, Apr 24, 2018 at 4:53 PM David Turner <[email protected]>
> wrote:
>
>> Neither the issue I created nor Michael's [1] ticket that it was rolled
>> into are getting any traction.  How are y'all fairing with your clusters?
>> I've had 3 PGs inconsistent with 5 scrub errors for a few weeks now.  I
>> assumed that the third PG was just like the first 2 in that it couldn't be
>> scrubbed, but I just checked the last scrub timestamp of the 3 PGs and the
>> third one is able to run scrubs.  I'm going to increase the logging on it
>> after I finish a round of maintenance we're performing on some OSDs.
>> Hopefully I'll find something more about these objects.
>>
>>
>> [1] http://tracker.ceph.com/issues/23576
>>
>> On Fri, Apr 6, 2018 at 12:30 PM David Turner <[email protected]>
>> wrote:
>>
>>> I'm using filestore.  I think the root cause is something getting stuck
>>> in the code.  As such I went ahead and created a [1] bug tracker for this.
>>> Hopefully it gets some traction as I'm not particularly looking forward to
>>> messing with deleting PGs with the ceph-objectstore-tool in production.
>>>
>>> [1] http://tracker.ceph.com/issues/23577
>>>
>>> On Fri, Apr 6, 2018 at 11:40 AM Michael Sudnick <
>>> [email protected]> wrote:
>>>
>>>> I've tried a few more things to get a deep-scrub going on my PG. I
>>>> tried instructing the involved osds to scrub all their PGs and it looks
>>>> like that didn't do it.
>>>>
>>>> Do you have any documentation on the object-store-tool? What I've found
>>>> online talks about filestore and not bluestore.
>>>>
>>>> On 6 April 2018 at 09:27, David Turner <[email protected]> wrote:
>>>>
>>>>> I'm running into this exact same situation.  I'm running 12.2.2 and I
>>>>> have an EC PG with a scrub error.  It has the same output for [1] rados
>>>>> list-inconsistent-obj as mentioned before.  This is the [2] full health
>>>>> detail.  This is the [3] excerpt from the log from the deep-scrub that
>>>>> marked the PG inconsistent.  The scrub happened when the PG was starting 
>>>>> up
>>>>> after using ceph-objectstore-tool to split its filestore subfolders.  This
>>>>> is using a script that I've used for months without any side effects.
>>>>>
>>>>> I have tried quite a few things to get this PG to deep-scrub or
>>>>> repair, but to no avail.  It will not do anything.  I have set every osd's
>>>>> osd_max_scrubs to 0 in the cluster, waited for all scrubbing and deep
>>>>> scrubbing to finish, then increased the 11 OSDs for this PG to 1 before
>>>>> issuing a deep-scrub.  And it will sit there for over an hour without
>>>>> deep-scrubbing.  My current testing of this is to set all osds to 1,
>>>>> increase all of the osds for this PG to 4, and then issue the repair... 
>>>>> but
>>>>> similarly nothing happens.  Each time I issue the deep-scrub or repair, 
>>>>> the
>>>>> output correctly says 'instructing pg 145.2e3 on osd.234 to repair', but
>>>>> nothing shows up in the log for the OSD and the PG state stays
>>>>> 'active+clean+inconsistent'.
>>>>>
>>>>> My next step, unless anyone has a better idea, is to find the exact
>>>>> copy of the PG with the missing object, use object-store-tool to back up
>>>>> that copy of the PG and remove it.  Then starting the OSD back up should
>>>>> backfill the full copy of the PG and be healthy again.
>>>>>
>>>>>
>>>>>
>>>>> [1] $ rados list-inconsistent-obj 145.2e3
>>>>> No scrub information available for pg 145.2e3
>>>>> error 2: (2) No such file or directory
>>>>>
>>>>> [2] $ ceph health detail
>>>>> HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
>>>>> OSD_SCRUB_ERRORS 1 scrub errors
>>>>> PG_DAMAGED Possible data damage: 1 pg inconsistent
>>>>>     pg 145.2e3 is active+clean+inconsistent, acting
>>>>> [234,132,33,331,278,217,55,358,79,3,24]
>>>>>
>>>>> [3] 2018-04-04 15:24:53.603380 7f54d1820700  0 log_channel(cluster)
>>>>> log [DBG] : 145.2e3 deep-scrub starts
>>>>> 2018-04-04 17:32:37.916853 7f54d1820700 -1 log_channel(cluster) log
>>>>> [ERR] : 145.2e3s0 deep-scrub 1 missing, 0 inconsistent objects
>>>>> 2018-04-04 17:32:37.916865 7f54d1820700 -1 log_channel(cluster) log
>>>>> [ERR] : 145.2e3 deep-scrub 1 errors
>>>>>
>>>>> On Mon, Apr 2, 2018 at 4:51 PM Michael Sudnick <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi Kjetil,
>>>>>>
>>>>>> I've tried to get the pg scrubbing/deep scrubbing and nothing seems
>>>>>> to be happening. I've tried it a few times over the last few days. My
>>>>>> cluster is recovering from a failed disk (which was probably the reason 
>>>>>> for
>>>>>> the inconsistency), do I need to wait for the cluster to heal before
>>>>>> repair/deep scrub works?
>>>>>>
>>>>>> -Michael
>>>>>>
>>>>>> On 2 April 2018 at 14:13, Kjetil Joergensen <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> scrub or deep-scrub the pg, that should in theory get you back to
>>>>>>> list-inconsistent-obj spitting out what's wrong, then mail that info to 
>>>>>>> the
>>>>>>> list.
>>>>>>>
>>>>>>> -KJ
>>>>>>>
>>>>>>> On Sun, Apr 1, 2018 at 9:17 AM, Michael Sudnick <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I have a small cluster with an inconsistent pg. I've tried ceph pg
>>>>>>>> repair multiple times to no luck. rados list-inconsistent-obj 49.11c
>>>>>>>> returns:
>>>>>>>>
>>>>>>>> # rados list-inconsistent-obj 49.11c
>>>>>>>> No scrub information available for pg 49.11c
>>>>>>>> error 2: (2) No such file or directory
>>>>>>>>
>>>>>>>> I'm a bit at a loss here as what to do to recover. That pg is part
>>>>>>>> of a cephfs_data pool with compression set to force/snappy.
>>>>>>>>
>>>>>>>> Does anyone have an suggestions?
>>>>>>>>
>>>>>>>> -Michael
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list
>>>>>>>> [email protected]
>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Kjetil Joergensen <[email protected]>
>>>>>>> SRE, Medallia Inc
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> [email protected]
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>
>>>>>
>>>>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Have an inconsistent PG, repair not working

Reply via email to