Re: [ceph-users] 9 PGs stay incomplete

2015-09-11 Thread Brad Hubbard
- Original Message -
> From: "Wido den Hollander" <w...@42on.com>
> To: "ceph-users" <ceph-us...@ceph.com>
> Sent: Friday, 11 September, 2015 6:46:11 AM
> Subject: [ceph-users] 9 PGs stay incomplete
> 
> Hi,
> 
> I'm running into a issue with Ceph 0.94.2/3 where after doing a recovery
> test 9 PGs stay incomplete:
> 
> osdmap e78770: 2294 osds: 2294 up, 2294 in
> pgmap v1972391: 51840 pgs, 7 pools, 220 TB data, 185 Mobjects
>755 TB used, 14468 TB / 15224 TB avail
>   51831 active+clean
>   9 incomplete
> 
> As you can see, all 2294 OSDs are online and about all PGs became
> active+clean again, except for 9.
> 
> I found out that these PGs are the problem:
> 
> 10.3762
> 7.309e
> 7.29a2
> 10.2289
> 7.17dd
> 10.165a
> 7.1050
> 7.c65
> 10.abf
> 
> Digging further, all the PGs map back to a OSD which is running on the
> same host. 'ceph-stg-01' in this case.
> 
> $ ceph pg 10.3762 query
> 
> Looking at the recovery state, this is shown:
> 
> {
> "first": 65286,
> "last": 67355,
> "maybe_went_rw": 0,
> "up": [
> 1420,
> 854,
> 1105

Anything interesting in the OSD logs for these OSDs?

> ],
> "acting": [
> 1420
> ],
> "primary": 1420,
> "up_primary": 1420
> },
> 
> osd.1420 is online. I tried restarting it, but nothing happens, these 9
> PGs stay incomplete.
> 
> Under 'peer_info' info I see both osd.854 and osd.1105 reporting about
> the PG with identical numbers.
> 
> I restarted both 854 and 1105, without result.
> 
> The output of PG query can be found here: http://pastebin.com/qQL699zC
> 
> The cluster is running a mix of 0.94.2 and .3 on Ubuntu 14.04.2 with the
> 3.13 kernel. XFS is being used as the backing filesystem.
> 
> Any suggestions to fix this issue? There is no valuable data in these
> pools, so I can remove them, but I'd rather fix the root-cause.
> 
> --
> Wido den Hollander
> 42on B.V.
> Ceph trainer and consultant
> 
> Phone: +31 (0)20 700 9902
> Skype: contact42on
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 9 PGs stay incomplete

2015-09-11 Thread Gregory Farnum
On Thu, Sep 10, 2015 at 9:46 PM, Wido den Hollander  wrote:
> Hi,
>
> I'm running into a issue with Ceph 0.94.2/3 where after doing a recovery
> test 9 PGs stay incomplete:
>
> osdmap e78770: 2294 osds: 2294 up, 2294 in
> pgmap v1972391: 51840 pgs, 7 pools, 220 TB data, 185 Mobjects
>755 TB used, 14468 TB / 15224 TB avail
>   51831 active+clean
>   9 incomplete
>
> As you can see, all 2294 OSDs are online and about all PGs became
> active+clean again, except for 9.
>
> I found out that these PGs are the problem:
>
> 10.3762
> 7.309e
> 7.29a2
> 10.2289
> 7.17dd
> 10.165a
> 7.1050
> 7.c65
> 10.abf
>
> Digging further, all the PGs map back to a OSD which is running on the
> same host. 'ceph-stg-01' in this case.
>
> $ ceph pg 10.3762 query
>
> Looking at the recovery state, this is shown:
>
> {
> "first": 65286,
> "last": 67355,
> "maybe_went_rw": 0,
> "up": [
> 1420,
> 854,
> 1105
> ],
> "acting": [
> 1420
> ],
> "primary": 1420,
> "up_primary": 1420
> },
>
> osd.1420 is online. I tried restarting it, but nothing happens, these 9
> PGs stay incomplete.
>
> Under 'peer_info' info I see both osd.854 and osd.1105 reporting about
> the PG with identical numbers.
>
> I restarted both 854 and 1105, without result.
>
> The output of PG query can be found here: http://pastebin.com/qQL699zC

Hmm. The pg query results from each peer aren't quite the same but
look largely consistent to me. I think somebody from the RADOS team
will need to check it out. I do see that the log tail on the primary
hasn't advanced as far as the other peers have, but I'm not sure if
that's the OSD being responsible or evidence of the root cause...
-Greg

>
> The cluster is running a mix of 0.94.2 and .3 on Ubuntu 14.04.2 with the
> 3.13 kernel. XFS is being used as the backing filesystem.
>
> Any suggestions to fix this issue? There is no valuable data in these
> pools, so I can remove them, but I'd rather fix the root-cause.
>
> --
> Wido den Hollander
> 42on B.V.
> Ceph trainer and consultant
>
> Phone: +31 (0)20 700 9902
> Skype: contact42on
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 9 PGs stay incomplete

2015-09-11 Thread Wido den Hollander


On 11-09-15 12:22, Gregory Farnum wrote:
> On Thu, Sep 10, 2015 at 9:46 PM, Wido den Hollander  wrote:
>> Hi,
>>
>> I'm running into a issue with Ceph 0.94.2/3 where after doing a recovery
>> test 9 PGs stay incomplete:
>>
>> osdmap e78770: 2294 osds: 2294 up, 2294 in
>> pgmap v1972391: 51840 pgs, 7 pools, 220 TB data, 185 Mobjects
>>755 TB used, 14468 TB / 15224 TB avail
>>   51831 active+clean
>>   9 incomplete
>>
>> As you can see, all 2294 OSDs are online and about all PGs became
>> active+clean again, except for 9.
>>
>> I found out that these PGs are the problem:
>>
>> 10.3762
>> 7.309e
>> 7.29a2
>> 10.2289
>> 7.17dd
>> 10.165a
>> 7.1050
>> 7.c65
>> 10.abf
>>
>> Digging further, all the PGs map back to a OSD which is running on the
>> same host. 'ceph-stg-01' in this case.
>>
>> $ ceph pg 10.3762 query
>>
>> Looking at the recovery state, this is shown:
>>
>> {
>> "first": 65286,
>> "last": 67355,
>> "maybe_went_rw": 0,
>> "up": [
>> 1420,
>> 854,
>> 1105
>> ],
>> "acting": [
>> 1420
>> ],
>> "primary": 1420,
>> "up_primary": 1420
>> },
>>
>> osd.1420 is online. I tried restarting it, but nothing happens, these 9
>> PGs stay incomplete.
>>
>> Under 'peer_info' info I see both osd.854 and osd.1105 reporting about
>> the PG with identical numbers.
>>
>> I restarted both 854 and 1105, without result.
>>
>> The output of PG query can be found here: http://pastebin.com/qQL699zC
> 
> Hmm. The pg query results from each peer aren't quite the same but
> look largely consistent to me. I think somebody from the RADOS team
> will need to check it out. I do see that the log tail on the primary
> hasn't advanced as far as the other peers have, but I'm not sure if
> that's the OSD being responsible or evidence of the root cause...
> -Greg
> 

That's what I noticed as well. I ran osd.1420 with debug osd/filestore =
20 and the output is here:
http://ceph.o.auroraobjects.eu/tmp/txc1-osd.1420.log.gz

I can't tell what is going on, I don't see any 'errors', but that's
probably me not being able to diagnose the logs properly.

>>
>> The cluster is running a mix of 0.94.2 and .3 on Ubuntu 14.04.2 with the
>> 3.13 kernel. XFS is being used as the backing filesystem.
>>
>> Any suggestions to fix this issue? There is no valuable data in these
>> pools, so I can remove them, but I'd rather fix the root-cause.
>>
>> --
>> Wido den Hollander
>> 42on B.V.
>> Ceph trainer and consultant
>>
>> Phone: +31 (0)20 700 9902
>> Skype: contact42on
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 9 PGs stay incomplete

2015-09-10 Thread Wido den Hollander
Hi,

I'm running into a issue with Ceph 0.94.2/3 where after doing a recovery
test 9 PGs stay incomplete:

osdmap e78770: 2294 osds: 2294 up, 2294 in
pgmap v1972391: 51840 pgs, 7 pools, 220 TB data, 185 Mobjects
   755 TB used, 14468 TB / 15224 TB avail
  51831 active+clean
  9 incomplete

As you can see, all 2294 OSDs are online and about all PGs became
active+clean again, except for 9.

I found out that these PGs are the problem:

10.3762
7.309e
7.29a2
10.2289
7.17dd
10.165a
7.1050
7.c65
10.abf

Digging further, all the PGs map back to a OSD which is running on the
same host. 'ceph-stg-01' in this case.

$ ceph pg 10.3762 query

Looking at the recovery state, this is shown:

{
"first": 65286,
"last": 67355,
"maybe_went_rw": 0,
"up": [
1420,
854,
1105
],
"acting": [
1420
],
"primary": 1420,
"up_primary": 1420
},

osd.1420 is online. I tried restarting it, but nothing happens, these 9
PGs stay incomplete.

Under 'peer_info' info I see both osd.854 and osd.1105 reporting about
the PG with identical numbers.

I restarted both 854 and 1105, without result.

The output of PG query can be found here: http://pastebin.com/qQL699zC

The cluster is running a mix of 0.94.2 and .3 on Ubuntu 14.04.2 with the
3.13 kernel. XFS is being used as the backing filesystem.

Any suggestions to fix this issue? There is no valuable data in these
pools, so I can remove them, but I'd rather fix the root-cause.

-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com