And to come full circle, After this whole saga, I now have a scrub error on the new device health metrics pool/PG in what looks to be the exact same way. So I am at a loss for what ever it is that I am doing incorrectly, as a scrub error obviously makes the monitoring suite very happy.
> $ ceph health detail
> OSD_SCRUB_ERRORS 1 scrub errors
> PG_DAMAGED Possible data damage: 1 pg inconsistent
> pg 33.0 is active+clean+inconsistent, acting [12,138,15]
> $ rados list-inconsistent-pg device_health_metrics
> ["33.0"]
> $ rados list-inconsistent-obj 33.0 | jq
> {
> "epoch": 176348,
> "inconsistents": []
> }
I assume that this is the root cause:
> ceph.log.5.gz:2019-09-18 11:12:16.466118 osd.138 (osd.138) 154 : cluster
> [WRN] bad locator @33 on object @33 op osd_op(client.1769585636.0:466 33.0
> 33:b08b92bd::::head [omap-set-vals] snapc 0=[]
> ondisk+write+known_if_redirected e176327) v8
> ceph.log.1.gz:2019-09-22 20:41:44.937841 osd.12 (osd.12) 53 : cluster [DBG]
> 33.0 scrub starts
> ceph.log.1.gz:2019-09-22 20:41:45.000638 osd.12 (osd.12) 54 : cluster [ERR]
> 33.0 scrub : stat mismatch, got 237/238 objects, 0/0 clones, 237/238 dirty,
> 237/238 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 0/0 bytes, 0/0
> manifest objects, 0/0 hit_set_archive bytes.
> ceph.log.1.gz:2019-09-22 20:41:45.000643 osd.12 (osd.12) 55 : cluster [ERR]
> 33.0 scrub 1 errors
Nothing fancy set for the plugin:
> $ ceph config dump | grep device
> global basic device_failure_prediction_mode local
> mgr advanced mgr/devicehealth/enable_monitoring true
Reed
> On Sep 18, 2019, at 11:33 AM, Reed Dier <[email protected]> wrote:
>
> And to provide some further updates,
>
> I was able to get OSDs to boot by updating from 14.2.2 to 14.2.4.
> Unclear why this would improve things, but it at least got me running again.
>
>> $ ceph versions
>> {
>> "mon": {
>> "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be)
>> nautilus (stable)": 3
>> },
>> "mgr": {
>> "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be)
>> nautilus (stable)": 3
>> },
>> "osd": {
>> "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be)
>> nautilus (stable)": 199,
>> "ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba)
>> nautilus (stable)": 5
>> },
>> "mds": {
>> "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be)
>> nautilus (stable)": 1
>> },
>> "overall": {
>> "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be)
>> nautilus (stable)": 206,
>> "ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba)
>> nautilus (stable)": 5
>> }
>> }
>
>
> Reed
>
>> On Sep 18, 2019, at 10:12 AM, Reed Dier <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>> To answer the question, if it is safe to disable the module and delete the
>> pool, the answer is no.
>>
>> After disabling the diskprediction_local module, I then proceeded to remove
>> the pool created by the module, device_health_metrics.
>>
>> This is where things went south quickly,
>>
>> Ceph health showed:
>>> Module 'devicehealth' has failed: [errno 2] Failed to operate write op for
>>> oid SAMSUNG_$MODEL_$SERIAL
>>
>> That module apparently can't be disabled:
>>> $ ceph mgr module disable devicehealth
>>> Error EINVAL: module 'devicehealth' cannot be disabled (always-on)
>>
>> Then 5 osd's went down, crashing with:
>>> -12> 2019-09-18 10:53:00.299 7f95940ac700 5 osd.5 pg_epoch: 176304
>>> pg[17.3d1( v 176297'568491 lc 176269'568471 (175914'565388,176297'568491]
>>> local-lis/les=176302/176303 n=107092 ec=11397/11397 lis/c 176302/172990
>>> les/c/f 176303/172991/107766 176304/176304/176304) [5,81,162] r=0
>>> lpr=176304 pi=[172990,176304)/1 crt=176297'568491 lcod 0'0 mlcod 0'0
>>> peering m=17 mbc={}] enter Started/Primary/Peering/WaitUpThru
>>> -11> 2019-09-18 10:53:00.303 7f959fd6f700 2 osd.5 176304
>>> ms_handle_reset con 0x564078474d00 session 0x56407878ea00
>>> -10> 2019-09-18 10:53:00.303 7f95b10e6700 10 monclient:
>>> handle_auth_request added challenge on 0x564077ac1b00
>>> -9> 2019-09-18 10:53:00.307 7f95b10e6700 10 monclient:
>>> handle_auth_request added challenge on 0x564077ac3180
>>> -8> 2019-09-18 10:53:00.307 7f95b10e6700 10 monclient:
>>> handle_auth_request added challenge on 0x564077ac3600
>>> -7> 2019-09-18 10:53:00.307 7f95950ae700 -1
>>> bluestore(/var/lib/ceph/osd/ceph-5) _txc_add_transaction error (39)
>>> Directory not empty not handled on operation 21 (op 1, counting from 0)
>>> -6> 2019-09-18 10:53:00.307 7f95950ae700 0 _dump_transaction
>>> transaction dump:
>>> {
>>> "ops": [
>>> {
>>> "op_num": 0,
>>> "op_name": "remove",
>>> "collection": "30.0_head",
>>> "oid": "#30:00000000::::head#"
>>> },
>>> {
>>> "op_num": 1,
>>> "op_name": "rmcoll",
>>> "collection": "30.0_head"
>>> }
>>> ]
>>> }
>>> -5> 2019-09-18 10:53:00.311 7f95948ad700 5 osd.5 pg_epoch: 176304
>>> pg[17.353( v 176300'586919 lc 176207'586887 (175912'583861,176300'586919]
>>> local-lis/les=176302/176303 n=107009 ec=11397/11397 lis/c 176302/176285
>>> les/c/f 176303/176286/107766 176304/176304/176304) [5,167,137] r=0
>>> lpr=176304 pi=[176285,176304)/1 crt=176300'586919 lcod 0'0 mlcod 0'0
>>> peering m=32 mbc={}] exit Started/Primary/Peering/GetLog 0.023847 2 0.000123
>>> -4> 2019-09-18 10:53:00.311 7f95948ad700 5 osd.5 pg_epoch: 176304
>>> pg[17.353( v 176300'586919 lc 176207'586887 (175912'583861,176300'586919]
>>> local-lis/les=176302/176303 n=107009 ec=11397/11397 lis/c 176302/176285
>>> les/c/f 176303/176286/107766 176304/176304/176304) [5,167,137] r=0
>>> lpr=176304 pi=[176285,176304)/1 crt=176300'586919 lcod 0'0 mlcod 0'0
>>> peering m=32 mbc={}] enter Started/Primary/Peering/GetMissing
>>> -3> 2019-09-18 10:53:00.311 7f95948ad700 5 osd.5 pg_epoch: 176304
>>> pg[17.353( v 176300'586919 lc 176207'586887 (175912'583861,176300'586919]
>>> local-lis/les=176302/176303 n=107009 ec=11397/11397 lis/c 176302/176285
>>> les/c/f 176303/176286/107766 176304/176304/176304) [5,167,137] r=0
>>> lpr=176304 pi=[176285,176304)/1 crt=176300'586919 lcod 0'0 mlcod 0'0
>>> peering m=32 mbc={}] exit Started/Primary/Peering/GetMissing 0.000019 0
>>> 0.000000
>>> -2> 2019-09-18 10:53:00.311 7f95948ad700 5 osd.5 pg_epoch: 176304
>>> pg[17.353( v 176300'586919 lc 176207'586887 (175912'583861,176300'586919]
>>> local-lis/les=176302/176303 n=107009 ec=11397/11397 lis/c 176302/176285
>>> les/c/f 176303/176286/107766 176304/176304/176304) [5,167,137] r=0
>>> lpr=176304 pi=[176285,176304)/1 crt=176300'586919 lcod 0'0 mlcod 0'0
>>> peering m=32 mbc={}] enter Started/Primary/Peering/WaitUpThru
>>> -1> 2019-09-18 10:53:00.315 7f95950ae700 -1
>>> /build/ceph-14.2.2/src/os/bluestore/BlueStore.cc <http://bluestore.cc/>: In
>>> function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*,
>>> ObjectStore::Transaction*)' thread 7f95950ae700 time 2019-09-18
>>> 10:53:00.312755
>>> /build/ceph-14.2.2/src/os/bluestore/BlueStore.cc <http://bluestore.cc/>:
>>> 11208: ceph_abort_msg("unexpected error")
>>
>>
>> Of the 5 OSD's now down, 3 of them are the serving OSD's for pg 30.0 (that
>> has now been erased),
>>
>>> OSD_DOWN 5 osds down
>>> osd.5 is down
>>> osd.12 is down
>>> osd.128 is down
>>> osd.183 is down
>>> osd.190 is down
>>
>>
>> But 190 and 5 were never acting members for that PG, so I have no clue why
>> they are implicated.
>>
>>
>> I re-enabled the module, and that cleared the health error about
>> devicehealth, which doesn't matter to me, but that also didn't solve the
>> issue of the down OSDs, so I am hoping there is a way to mark this PG as
>> lost, or something like that, so as to not have to rebuilt the entire OSD.
>>
>> Any help is appreciated.
>>
>> Reed
>>
>>> On Sep 12, 2019, at 5:22 PM, Reed Dier <[email protected]
>>> <mailto:[email protected]>> wrote:
>>>
>>> Trying to narrow down a strange issue where the single PG for the
>>> device_health_metrics that was created when I enabled the
>>> 'diskprediction_local' module in the ceph-mgr. But I never see any
>>> inconsistent objects in the PG.
>>>
>>>> $ ceph health detail
>>>> OSD_SCRUB_ERRORS 1 scrub errors
>>>> PG_DAMAGED Possible data damage: 1 pg inconsistent
>>>> pg 30.0 is active+clean+inconsistent, acting [128,12,183]
>>>
>>>> $ rados list-inconsistent-pg device_health_metrics
>>>> ["30.0"]
>>>
>>>> $ rados list-inconsistent-obj 30.0 | jq
>>>> {
>>>> "epoch": 172979,
>>>> "inconsistents": []
>>>> }
>>>
>>> This is the log message from osd.128 most recently during the last deep
>>> scrub
>>>> 2019-09-12 18:07:19.436 7f977744a700 -1 log_channel(cluster) log [ERR] :
>>>> 30.0 deep-scrub : stat mismatch, got 237/238 objects, 0/0 clones, 237/238
>>>> dirty, 237/238 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 0/0
>>>> bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes.
>>>
>>> Here is a pg query on the one PG:
>>> https://pastebin.com/bnzVKd6t <https://pastebin.com/bnzVKd6t>
>>>
>>> The data I have collected hasn't been useful at all, and I don't
>>> particularly care if I lose it, so would it be feasible (ie no bad effects)
>>> to just disable the disk prediction module, delete the pool, and then start
>>> over and it will create a new pool for itself?
>>>
>>> Thanks,
>>>
>>> Reed
>>
>>
>
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
