Re: [ceph-users] Power outages!!! help!

hjcho616 Thu, 21 Sep 2017 23:02:50 -0700

Ronny,
Could you help me with this log?  I got this with debug osd=20 filestore=20 
ms=20.  This one is running "ceph pg repair 2.7"  This is one of the smaller 
page, thus log was smaller.  Others have similar errors.  I can see the lines 
with ERR, but other than that is there something I should be paying attention 
to? 
https://drive.google.com/file/d/0By7YztAJNGUWNkpCV090dHBmOWc/view?usp=sharing
Error messages looks like this.2017-09-21 23:53:31.545510 7f51682df700 -1 
log_channel(cluster) log [ERR] : 2.7 shard 2: soid 
2:e17dbaf6:::rb.0.145d.2ae8944a.0000000000bb:head data_digest 0x62b74a1f != 
data_digest 0x43d61c5d from auth oi 
2:e17dbaf6:::rb.0.145d.2ae8944a.0000000000bb:head(12962'694 osd.2.0:90545 
dirty|data_digest|omap_digest s 4194304 uv 484 dd 43d61c5d od ffffffff 
alloc_hint [0 0])2017-09-21 23:53:31.545520 7f51682df700 -1 
log_channel(cluster) log [ERR] : 2.7 shard 7: soid 
2:e17dbaf6:::rb.0.145d.2ae8944a.0000000000bb:head data_digest 0x62b74a1f != 
data_digest 0x43d61c5d from auth oi 
2:e17dbaf6:::rb.0.145d.2ae8944a.0000000000bb:head(12962'694 osd.2.0:90545 
dirty|data_digest|omap_digest s 4194304 uv 484 dd 43d61c5d od ffffffff 
alloc_hint [0 0])2017-09-21 23:53:31.545531 7f51682df700 -1 
log_channel(cluster) log [ERR] : 2.7 soid 
2:e17dbaf6:::rb.0.145d.2ae8944a.0000000000bb:head: failed to pick suitable auth 
object
I did try to move that object to different location as suggested from this 
page.http://ceph.com/geen-categorie/ceph-manually-repair-object/

This is what I ran.systemctl stop ceph-osd@7ceph-osd -i 7 --flush-journalcd 
/var/lib/ceph/osd/ceph-7cd current/2.7_head/mv 
rb.0.145d.2ae8944a.0000000000bb__head_6F5DBE87__2 ~/ceph osd treesystemctl 
start ceph-osd@7ceph pg repair 2.7
Then I just get this..2017-09-22 00:41:06.495399 7f22ac3bd700 -1 
log_channel(cluster) log [ERR] : 2.7 shard 2: soid 
2:e17dbaf6:::rb.0.145d.2ae8944a.0000000000bb:head data_digest 0x62b74a1f != 
data_digest 0x43d61c5d from auth oi 
2:e17dbaf6:::rb.0.145d.2ae8944a.0000000000bb:head(12962'694 osd.2.0:90545 
dirty|data_digest|omap_digest s 4194304 uv 484 dd 43d61c5d od ffffffff 
alloc_hint [0 0])2017-09-22 00:41:06.495417 7f22ac3bd700 -1 
log_channel(cluster) log [ERR] : 2.7 shard 7 missing 
2:e17dbaf6:::rb.0.145d.2ae8944a.0000000000bb:head2017-09-22 00:41:06.495424 
7f22ac3bd700 -1 log_channel(cluster) log [ERR] : 2.7 soid 
2:e17dbaf6:::rb.0.145d.2ae8944a.0000000000bb:head: failed to pick suitable auth 
object
Moving from osd.2 results in similar error message, just says missing on top 
one instead. =P

I was hoping this time would give me a different result as I let one more osd 
copy one from OSD1 by turning down osd.7 and set noout.  But it doesn't appear 
to care about that extra data. Maybe only true when size is 3?  Basically since 
I had most osds alive on OSD1 I was trying to favor data from OSD1. =P
What can I do in this case? According to 
http://ceph.com/geen-categorie/incomplete-pgs-oh-my/ inconsistent data can be 
expected with skip journal replay, and I had to use it as export crashed 
without it. =P  But doesn't say much about what to do in that case.If all went 
well, then your cluster is now back to 100% active+clean / HEALTH_OK state. 
Note that you may still have inconsistent or stale data stored inside the PG. 
This is because the state of the data on the OSD that failed is a bit unknown, 
especially if you had to use the ‘–skip-journal-replay’ option on the export. 
For RBD data, the client which utilizes the RBD should run a filesystem check 
against the RBD.

Regards,Hong 

    On Thursday, September 21, 2017 1:46 AM, Ronny Aasen 
<[email protected]> wrote:

 On 21. sep. 2017 00:35, hjcho616 wrote:
> # rados list-inconsistent-pg data
> ["0.0","0.5","0.a","0.e","0.1c","0.29","0.2c"]
> # rados list-inconsistent-pg metadata
> ["1.d","1.3d"]
> # rados list-inconsistent-pg rbd
> ["2.7"]
> # rados list-inconsistent-obj 0.0 --format=json-pretty
> {
>      "epoch": 23112,
>      "inconsistents": []
> }
> # rados list-inconsistent-obj 0.5 --format=json-pretty
> {
>      "epoch": 23078,
>      "inconsistents": []
> }
> # rados list-inconsistent-obj 0.a --format=json-pretty
> {
>      "epoch": 22954,
>      "inconsistents": []
> }
> # rados list-inconsistent-obj 0.e --format=json-pretty
> {
>      "epoch": 23068,
>      "inconsistents": []
> }
> # rados list-inconsistent-obj 0.1c --format=json-pretty
> {
>      "epoch": 22954,
>      "inconsistents": []
> }
> # rados list-inconsistent-obj 0.29 --format=json-pretty
> {
>      "epoch": 22974,
>      "inconsistents": []
> }
> # rados list-inconsistent-obj 0.2c --format=json-pretty
> {
>      "epoch": 23194,
>      "inconsistents": []
> }
> # rados list-inconsistent-obj 1.d --format=json-pretty
> {
>      "epoch": 23072,
>      "inconsistents": []
> }
> # rados list-inconsistent-obj 1.3d --format=json-pretty
> {
>      "epoch": 23221,
>      "inconsistents": []
> }
> # rados list-inconsistent-obj 2.7 --format=json-pretty
> {
>      "epoch": 23032,
>      "inconsistents": []
> }
> 
> Looks like not much information is there.  Could you elaborate on the 
> items you mentioned in find the object?  How do I check metadata.  What 
> are we looking for in md5sum?
> 
> - find the object  :: manually check the objects, check the object 
> metadata, run md5sum on them all and compare. check objects on the 
> nonrunning osd's and compare there as well. anything to try to determine 
> what object is ok and what is bad.
> 
> I tried that Ceph: manually repair object - Ceph 
> <http://ceph.com/geen-categorie/ceph-manually-repair-object/> methods on 
> PG 2.7 before..Tried 3 replica case, which would result in shard 
> missing, regardless of which one I moved,  2 replica case, hmm... I 
> guess I don't know how long is "wait a bit" is, I just turned it back on 
> after a minute or so, just returns back to same inconsistent message.. 
> =P  Are we looking for entire stopped OSD to map to different OSD and 
> get 3 replica when running stopped OSD again?
> 
> Regards,
> Hong

since your  list-inconsistent-obj is empty, you need to up debugging on 
all osd's and grep the logs to find the objects with issues. this is 
explained in the link.  ceph ph  map [pg]  tells you what osd's to look 
at, and the log will have hints to the reason for the error. keep in 
mind that it can be a while since the scrub errors out, so you may need 
to look at older logs. or trigger a scrub, and wait for it to finish so 
you can check the current log.

once you have the object names you can find them with the find command.

after removing/fixing the broken object, and restaring osd, you issue 
the repair, and wait for the repair and scrub of that pg to finish. you 
can probably follow along by tailing the log.

good luck

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Power outages!!! help!

Reply via email to