Hi list,
Woke up this morning to two PG's reporting scrub errors, in a way that I
haven't seen before.
> $ ceph versions
> {
> "mon": {
> "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic
> (stable)": 3
> },
> "mgr": {
> "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic
> (stable)": 3
> },
> "osd": {
> "ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic
> (stable)": 156
> },
> "mds": {
> "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic
> (stable)": 2
> },
> "overall": {
> "ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic
> (stable)": 156,
> "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic
> (stable)": 8
> }
> }
> OSD_SCRUB_ERRORS 8 scrub errors
> PG_DAMAGED Possible data damage: 2 pgs inconsistent
> pg 17.72 is active+clean+inconsistent, acting [3,7,153]
> pg 17.2b9 is active+clean+inconsistent, acting [19,7,16]
Here is what $rados list-inconsistent-obj 17.2b9 --format=json-pretty yields:
> {
> "epoch": 134582,
> "inconsistents": [
> {
> "object": {
> "name": "10008536718.00000000",
> "nspace": "",
> "locator": "",
> "snap": "head",
> "version": 0
> },
> "errors": [],
> "union_shard_errors": [
> "obj_size_info_mismatch"
> ],
> "shards": [
> {
> "osd": 7,
> "primary": false,
> "errors": [
> "obj_size_info_mismatch"
> ],
> "size": 5883,
> "object_info": {
> "oid": {
> "oid": "10008536718.00000000",
> "key": "",
> "snapid": -2,
> "hash": 1752643257,
> "max": 0,
> "pool": 17,
> "namespace": ""
> },
> "version": "134599'448331",
> "prior_version": "134599'448330",
> "last_reqid": "client.1580931080.0:671854",
> "user_version": 448331,
> "size": 3505,
> "mtime": "2019-04-28 15:32:20.003519",
> "local_mtime": "2019-04-28 15:32:25.991015",
> "lost": 0,
> "flags": [
> "dirty",
> "data_digest",
> "omap_digest"
> ],
> "truncate_seq": 899,
> "truncate_size": 0,
> "data_digest": "0xf99a3bd3",
> "omap_digest": "0xffffffff",
> "expected_object_size": 0,
> "expected_write_size": 0,
> "alloc_hint_flags": 0,
> "manifest": {
> "type": 0
> },
> "watchers": {}
> }
> },
> {
> "osd": 16,
> "primary": false,
> "errors": [
> "obj_size_info_mismatch"
> ],
> "size": 5883,
> "object_info": {
> "oid": {
> "oid": "10008536718.00000000",
> "key": "",
> "snapid": -2,
> "hash": 1752643257,
> "max": 0,
> "pool": 17,
> "namespace": ""
> },
> "version": "134599'448331",
> "prior_version": "134599'448330",
> "last_reqid": "client.1580931080.0:671854",
> "user_version": 448331,
> "size": 3505,
> "mtime": "2019-04-28 15:32:20.003519",
> "local_mtime": "2019-04-28 15:32:25.991015",
> "lost": 0,
> "flags": [
> "dirty",
> "data_digest",
> "omap_digest"
> ],
> "truncate_seq": 899,
> "truncate_size": 0,
> "data_digest": "0xf99a3bd3",
> "omap_digest": "0xffffffff",
> "expected_object_size": 0,
> "expected_write_size": 0,
> "alloc_hint_flags": 0,
> "manifest": {
> "type": 0
> },
> "watchers": {}
> }
> },
> {
> "osd": 19,
> "primary": true,
> "errors": [
> "obj_size_info_mismatch"
> ],
> "size": 5883,
> "object_info": {
> "oid": {
> "oid": "10008536718.00000000",
> "key": "",
> "snapid": -2,
> "hash": 1752643257,
> "max": 0,
> "pool": 17,
> "namespace": ""
> },
> "version": "134599'448331",
> "prior_version": "134599'448330",
> "last_reqid": "client.1580931080.0:671854",
> "user_version": 448331,
> "size": 3505,
> "mtime": "2019-04-28 15:32:20.003519",
> "local_mtime": "2019-04-28 15:32:25.991015",
> "lost": 0,
> "flags": [
> "dirty",
> "data_digest",
> "omap_digest"
> ],
> "truncate_seq": 899,
> "truncate_size": 0,
> "data_digest": "0xf99a3bd3",
> "omap_digest": "0xffffffff",
> "expected_object_size": 0,
> "expected_write_size": 0,
> "alloc_hint_flags": 0,
> "manifest": {
> "type": 0
> },
> "watchers": {}
> }
> }
> ]
> }
> ]
> }
To snip that down to the parts that appear to matter:
> "errors": [],
> "union_shard_errors": [
> "obj_size_info_mismatch"
> ],
> "shards": [
> {
> "errors": [
> "obj_size_info_mismatch"
> ],
> "size": 5883,
> "object_info": {
> "size": 3505, }
It looks like the size info, does in fact mismatch (5883 != 3505).
So I attempted a deep-scrub again, and the issue persists across both PG's.
> 2019-04-29 09:08:27.729 7fe4f5bee700 0 log_channel(cluster) log [DBG] :
> 17.2b9 deep-scrub starts
> 2019-04-29 09:22:53.363 7fe4f5bee700 -1 log_channel(cluster) log [ERR] :
> 17.2b9 shard 19 soid 17:9d6cee
> 16:::10008536718.00000000:head : candidate size 5883 info size 3505 mismatch
> 2019-04-29 09:22:53.363 7fe4f5bee700 -1 log_channel(cluster) log [ERR] :
> 17.2b9 shard 7 soid 17:9d6cee1
> 6:::10008536718.00000000:head : candidate size 5883 info size 3505 mismatch
> 2019-04-29 09:22:53.363 7fe4f5bee700 -1 log_channel(cluster) log [ERR] :
> 17.2b9 shard 16 soid 17:9d6cee
> 16:::10008536718.00000000:head : candidate size 5883 info size 3505 mismatch
> 2019-04-29 09:22:53.363 7fe4f5bee700 -1 log_channel(cluster) log [ERR] :
> 17.2b9 soid 17:9d6cee16:::1000
> 8536718.00000000:head : failed to pick suitable object info
> 2019-04-29 09:22:53.363 7fe4f5bee700 -1 log_channel(cluster) log [ERR] :
> deep-scrub 17.2b9 17:9d6cee16:
> ::10008536718.00000000:head : on disk size (5883) does not match object info
> size (3505) adjusted for o
> ndisk to (3505)
> 2019-04-29 09:27:46.840 7fe4f5bee700 -1 log_channel(cluster) log [ERR] :
> 17.2b9 deep-scrub 4 errors
Pool 17 is a cephfs data pool, if that makes any difference.
And the two MDS's listed in versions are active:standby, not active:active.
My question is whether I should attempt a `ceph pg repair <pgid>` to attempt a
fix of these objects, or take another approach, as the object size mismatch
appears to persist across all 3 copies of the PG(s).
I know that ceph pg repair can be dangerous in certain circumstances, so I want
to feel confident in the operation before undertaking the repair.
I did look at all underlying disks for these PG's for issues or errors, and
none bubbled to the top, so I don't believe it to be a hardware issue in this
case.
Appreciate any help.
Thanks,
Reed_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com