Hi,

I'm running into a issue with Ceph 0.94.2/3 where after doing a recovery
test 9 PGs stay incomplete:

osdmap e78770: 2294 osds: 2294 up, 2294 in
pgmap v1972391: 51840 pgs, 7 pools, 220 TB data, 185 Mobjects
       755 TB used, 14468 TB / 15224 TB avail
          51831 active+clean
              9 incomplete

As you can see, all 2294 OSDs are online and about all PGs became
active+clean again, except for 9.

I found out that these PGs are the problem:

10.3762
7.309e
7.29a2
10.2289
7.17dd
10.165a
7.1050
7.c65
10.abf

Digging further, all the PGs map back to a OSD which is running on the
same host. 'ceph-stg-01' in this case.

$ ceph pg 10.3762 query

Looking at the recovery state, this is shown:

                {
                    "first": 65286,
                    "last": 67355,
                    "maybe_went_rw": 0,
                    "up": [
                        1420,
                        854,
                        1105
                    ],
                    "acting": [
                        1420
                    ],
                    "primary": 1420,
                    "up_primary": 1420
                },

osd.1420 is online. I tried restarting it, but nothing happens, these 9
PGs stay incomplete.

Under 'peer_info' info I see both osd.854 and osd.1105 reporting about
the PG with identical numbers.

I restarted both 854 and 1105, without result.

The output of PG query can be found here: http://pastebin.com/qQL699zC

The cluster is running a mix of 0.94.2 and .3 on Ubuntu 14.04.2 with the
3.13 kernel. XFS is being used as the backing filesystem.

Any suggestions to fix this issue? There is no valuable data in these
pools, so I can remove them, but I'd rather fix the root-cause.

-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to