Re: [ceph-users] Possible data damage: 1 pg inconsistent

Hervé Ballans Fri, 21 Dec 2018 06:23:02 -0800

Hi Frank,

I encounter exactly the same issue with the same disks than yours. Everyday, after a batch of deep scrubbing operation, ther are generallybetween 1 and 3 inconsistent pgs, and that, on different OSDs.


It could confirm a problem on these disks, but :

- it concerns only the pgs of the rbd pool, not those of cephfs pools(the same disk model is used)

- I encounter this when I was running 12.2.5, not when I upgraded in12.2.8 but the problem appears again after upgrade in 12.2.10

- On my side, smartctl and dmesg do not show any media error, so I'mpretty sure that physical media is not concerned...

Small precision: each disk is configured with RAID0 on a PERC740P, isthis also the case for you or are your disks in JBOD mode ?

Another question: in your case, the OSD who is involved in theinconsistent pgs is it always the same one or is it a new one every time ?

For information, currently, the manually 'ceph pg repair' command workswell each time...

Context: Luminous 12.2.10, Bluestore OSD with data block on SATA disksand WAL/DB on NVMe, rbd configuration replica 3/2


Cheers,
rv

Few outputs:

$ sudo ceph -s
  cluster:
    id:     838506b7-e0c6-4022-9e17-2d1cf9458be6
    health: HEALTH_ERR
            3 scrub errors
            Possible data damage: 3 pgs inconsistent

  services:
    mon: 3 daemons, quorum inf-ceph-mon0,inf-ceph-mon1,inf-ceph-mon2
    mgr: inf-ceph-mon0(active), standbys: inf-ceph-mon1, inf-ceph-mon2

mds: cephfs_home-2/2/2 up{0=inf-ceph-mon1=up:active,1=inf-ceph-mon0=up:active}, 1 up:standby

    osd: 126 osds: 126 up, 126 in

  data:
    pools:   3 pools, 4224 pgs
    objects: 23.35M objects, 20.9TiB
    usage:   64.9TiB used, 136TiB / 201TiB avail
    pgs:     4221 active+clean
             3    active+clean+inconsistent

  io:
    client:   2.62KiB/s rd, 2.25MiB/s wr, 0op/s rd, 118op/s wr

$ sudo ceph health detail
HEALTH_ERR 3 scrub errors; Possible data damage: 3 pgs inconsistent
OSD_SCRUB_ERRORS 3 scrub errors
PG_DAMAGED Possible data damage: 3 pgs inconsistent
    pg 9.27 is active+clean+inconsistent, acting [78,107,96]
    pg 9.260 is active+clean+inconsistent, acting [84,113,62]
    pg 9.6b9 is active+clean+inconsistent, acting [79,107,80]

$ sudo rados list-inconsistent-obj 9.27 --format=json-prettyradoslist-inconsistent-obj 9.27 --format=json-pretty |grep error

            "errors": [],
            "union_shard_errors": [
                "read_error"
                    "errors": [
                        "read_error"
                    "errors": [],
                    "errors": [],

$ sudo rados list-inconsistent-obj 9.260 --format=json-prettyradoslist-inconsistent-obj 9.260 --format=json-pretty |grep error

            "errors": [],
            "union_shard_errors": [
                "read_error"
                    "errors": [],
                    "errors": [],
                    "errors": [
                        "read_error"

$ sudo rados list-inconsistent-obj 9.6b9 --format=json-prettyradoslist-inconsistent-obj 9.6b9 --format=json-pretty |grep error

            "errors": [],
            "union_shard_errors": [
                "read_error"
                    "errors": [
                        "read_error"
                    "errors": [],
                    "errors": [],
$ sudo ceph pg repair 9.27
instructing pg 9.27 on osd.78 to repair
$ sudo ceph pg repair 9.260
instructing pg 9.260 on osd.84 to repair
$ sudo ceph pg repair 9.6b9
instructing pg 9.6b9 on osd.79 to repair
$ sudo ceph -s
  cluster:
    id:     838506b7-e0c6-4022-9e17-2d1cf9458be6
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum inf-ceph-mon0,inf-ceph-mon1,inf-ceph-mon2
    mgr: inf-ceph-mon0(active), standbys: inf-ceph-mon1, inf-ceph-mon2

mds: cephfs_home-2/2/2 up{0=inf-ceph-mon1=up:active,1=inf-ceph-mon0=up:active}, 1 up:standby

    osd: 126 osds: 126 up, 126 in

  data:
    pools:   3 pools, 4224 pgs
    objects: 23.35M objects, 20.9TiB
    usage:   64.9TiB used, 136TiB / 201TiB avail
    pgs:     4224 active+clean

  io:
    client:   195KiB/s rd, 7.19MiB/s wr, 17op/s rd, 127op/s wr



Le 19/12/2018 à 04:48, Frank Ritchie a écrit :

Hi all,

I have been receiving alerts for:

Possible data damage: 1 pg inconsistent

almost daily for a few weeks now. When I check:

rados list-inconsistent-obj $PG --format=json-pretty
I will always see a read_error. When I run a deep scrub on the PG Iwill see:
head candidate had a read error

When I check dmesg on the osd node I see:

blk_update_request: critical medium error, dev sdX, sector 123

I will also see a few uncorrected read errors in smartctl.

Info:
Ceph: ceph version 12.2.4-30.el7cp
OSD: Toshiba 1.8TB SAS 10K
120 OSDs total
Has anyone else seen these alerts occur almost daily? Can the errorspossibly be due to deep scrubbing too aggressively?
I realize these errors indicate potential failing drives but I can'treplace a drive daily.
thx
Frank

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Possible data damage: 1 pg inconsistent

Reply via email to