Re: [ceph-users] Local Device Health PG inconsistent

Reed Dier Wed, 18 Sep 2019 08:12:38 -0700

To answer the question, if it is safe to disable the module and delete the 
pool, the answer is no.


After disabling the diskprediction_local module, I then proceeded to remove the 
pool created by the module, device_health_metrics.

This is where things went south quickly,

Ceph health showed: 
> Module 'devicehealth' has failed: [errno 2] Failed to operate write op for 
> oid SAMSUNG_$MODEL_$SERIAL

That module apparently can't be disabled:
> $ ceph mgr module disable devicehealth
> Error EINVAL: module 'devicehealth' cannot be disabled (always-on)

Then 5 osd's went down, crashing with:
>    -12> 2019-09-18 10:53:00.299 7f95940ac700  5 osd.5 pg_epoch: 176304 
> pg[17.3d1( v 176297'568491 lc 176269'568471 (175914'565388,176297'568491] 
> local-lis/les=176302/176303 n=107092 ec=11397/11397 lis/c 176302/172990 
> les/c/f 176303/172991/107766 176304/176304/176304) [5,81,162] r=0 lpr=176304 
> pi=[172990,176304)/1 crt=176297'568491 lcod 0'0 mlcod 0'0 peering m=17 
> mbc={}] enter Started/Primary/Peering/WaitUpThru
>    -11> 2019-09-18 10:53:00.303 7f959fd6f700  2 osd.5 176304 ms_handle_reset 
> con 0x564078474d00 session 0x56407878ea00
>    -10> 2019-09-18 10:53:00.303 7f95b10e6700 10 monclient: 
> handle_auth_request added challenge on 0x564077ac1b00
>     -9> 2019-09-18 10:53:00.307 7f95b10e6700 10 monclient: 
> handle_auth_request added challenge on 0x564077ac3180
>     -8> 2019-09-18 10:53:00.307 7f95b10e6700 10 monclient: 
> handle_auth_request added challenge on 0x564077ac3600
>     -7> 2019-09-18 10:53:00.307 7f95950ae700 -1 
> bluestore(/var/lib/ceph/osd/ceph-5) _txc_add_transaction error (39) Directory 
> not empty not handled on operation 21 (op 1, counting from 0)
>     -6> 2019-09-18 10:53:00.307 7f95950ae700  0 _dump_transaction transaction 
> dump:
> {
>     "ops": [
>         {
>             "op_num": 0,
>             "op_name": "remove",
>             "collection": "30.0_head",
>             "oid": "#30:00000000::::head#"
>         },
>         {
>             "op_num": 1,
>             "op_name": "rmcoll",
>             "collection": "30.0_head"
>         }
>     ]
> }
>     -5> 2019-09-18 10:53:00.311 7f95948ad700  5 osd.5 pg_epoch: 176304 
> pg[17.353( v 176300'586919 lc 176207'586887 (175912'583861,176300'586919] 
> local-lis/les=176302/176303 n=107009 ec=11397/11397 lis/c 176302/176285 
> les/c/f 176303/176286/107766 176304/176304/176304) [5,167,137] r=0 lpr=176304 
> pi=[176285,176304)/1 crt=176300'586919 lcod 0'0 mlcod 0'0 peering m=32 
> mbc={}] exit Started/Primary/Peering/GetLog 0.023847 2 0.000123
>     -4> 2019-09-18 10:53:00.311 7f95948ad700  5 osd.5 pg_epoch: 176304 
> pg[17.353( v 176300'586919 lc 176207'586887 (175912'583861,176300'586919] 
> local-lis/les=176302/176303 n=107009 ec=11397/11397 lis/c 176302/176285 
> les/c/f 176303/176286/107766 176304/176304/176304) [5,167,137] r=0 lpr=176304 
> pi=[176285,176304)/1 crt=176300'586919 lcod 0'0 mlcod 0'0 peering m=32 
> mbc={}] enter Started/Primary/Peering/GetMissing
>     -3> 2019-09-18 10:53:00.311 7f95948ad700  5 osd.5 pg_epoch: 176304 
> pg[17.353( v 176300'586919 lc 176207'586887 (175912'583861,176300'586919] 
> local-lis/les=176302/176303 n=107009 ec=11397/11397 lis/c 176302/176285 
> les/c/f 176303/176286/107766 176304/176304/176304) [5,167,137] r=0 lpr=176304 
> pi=[176285,176304)/1 crt=176300'586919 lcod 0'0 mlcod 0'0 peering m=32 
> mbc={}] exit Started/Primary/Peering/GetMissing 0.000019 0 0.000000
>     -2> 2019-09-18 10:53:00.311 7f95948ad700  5 osd.5 pg_epoch: 176304 
> pg[17.353( v 176300'586919 lc 176207'586887 (175912'583861,176300'586919] 
> local-lis/les=176302/176303 n=107009 ec=11397/11397 lis/c 176302/176285 
> les/c/f 176303/176286/107766 176304/176304/176304) [5,167,137] r=0 lpr=176304 
> pi=[176285,176304)/1 crt=176300'586919 lcod 0'0 mlcod 0'0 peering m=32 
> mbc={}] enter Started/Primary/Peering/WaitUpThru
>     -1> 2019-09-18 10:53:00.315 7f95950ae700 -1 
> /build/ceph-14.2.2/src/os/bluestore/BlueStore.cc: In function 'void 
> BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
> ObjectStore::Transaction*)' thread 7f95950ae700 time 2019-09-18 
> 10:53:00.312755
> /build/ceph-14.2.2/src/os/bluestore/BlueStore.cc: 11208: 
> ceph_abort_msg("unexpected error")


Of the 5 OSD's now down, 3 of them are the serving OSD's for pg 30.0 (that has 
now been erased),

> OSD_DOWN 5 osds down
>     osd.5 is down
>     osd.12 is down
>     osd.128 is down
>     osd.183 is down
>     osd.190 is down


But 190 and 5 were never acting members for that PG, so I have no clue why they 
are implicated.


I re-enabled the module, and that cleared the health error about devicehealth, 
which doesn't matter to me, but that also didn't solve the issue of the down 
OSDs, so I am hoping there is a way to mark this PG as lost, or something like 
that, so as to not have to rebuilt the entire OSD.

Any help is appreciated.

Reed

> On Sep 12, 2019, at 5:22 PM, Reed Dier <[email protected]> wrote:
> 
> Trying to narrow down a strange issue where the single PG for the 
> device_health_metrics that was created when I enabled the 
> 'diskprediction_local' module in the ceph-mgr. But I never see any 
> inconsistent objects in the PG.
> 
>> $ ceph health detail
>> OSD_SCRUB_ERRORS 1 scrub errors
>> PG_DAMAGED Possible data damage: 1 pg inconsistent
>>     pg 30.0 is active+clean+inconsistent, acting [128,12,183]
> 
>> $ rados list-inconsistent-pg device_health_metrics
>> ["30.0"]
> 
>> $ rados list-inconsistent-obj 30.0 | jq
>> {
>>   "epoch": 172979,
>>   "inconsistents": []
>> }
> 
> This is the log message from osd.128 most recently during the last deep scrub
>> 2019-09-12 18:07:19.436 7f977744a700 -1 log_channel(cluster) log [ERR] : 
>> 30.0 deep-scrub : stat mismatch, got 237/238 objects, 0/0 clones, 237/238 
>> dirty, 237/238 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 0/0 
>> bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes.
> 
> Here is a pg query on the one PG:
> https://pastebin.com/bnzVKd6t <https://pastebin.com/bnzVKd6t>
> 
> The data I have collected hasn't been useful at all, and I don't particularly 
> care if I lose it, so would it be feasible (ie no bad effects) to just 
> disable the disk prediction module, delete the pool, and then start over and 
> it will create a new pool for itself?
> 
> Thanks,
> 
> Reed

smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Local Device Health PG inconsistent

Reply via email to