Re: [ceph-users] pg's degraded

JIten Shah Thu, 20 Nov 2014 17:46:27 -0800

Hi Craig,

Recreating the missing PG’s fixed it.  Thanks for your help.


But when I tried to mount the Filesystem, it gave me the “mount error 5”. I 
tried to restart the MDS server but it won’t work. It tells me that it’s 
laggy/unresponsive.

BTW, all these machines are VM’s.

[jshah@Lab-cephmon001 ~]$ ceph health detail
HEALTH_WARN mds cluster is degraded; mds Lab-cephmon001 is laggy
mds cluster is degraded
mds.Lab-cephmon001 at 17.147.16.111:6800/3745284 rank 0 is replaying journal
mds.Lab-cephmon001 at 17.147.16.111:6800/3745284 is laggy/unresponsive


—Jiten

On Nov 20, 2014, at 4:20 PM, JIten Shah <[email protected]> wrote:

> Ok. Thanks.
> 
> —Jiten
> 
> On Nov 20, 2014, at 2:14 PM, Craig Lewis <[email protected]> wrote:
> 
>> If there's no data to lose, tell Ceph to re-create all the missing PGs.
>> 
>> ceph pg force_create_pg 2.33
>> 
>> Repeat for each of the missing PGs.  If that doesn't do anything, you might 
>> need to tell Ceph that you lost the OSDs.  For each OSD you moved, run ceph 
>> osd lost <OSDID>, then try the force_create_pg command again.
>> 
>> If that doesn't work, you can keep fighting with it, but it'll be faster to 
>> rebuild the cluster.
>> 
>> 
>> 
>> On Thu, Nov 20, 2014 at 1:45 PM, JIten Shah <[email protected]> wrote:
>> Thanks for your help.
>> 
>> I was using puppet to install the OSD’s where it chooses a path over a 
>> device name. Hence it created the OSD in the path within the root volume 
>> since the path specified was incorrect.
>> 
>> And all 3 of the OSD’s were rebuilt at the same time because it was unused 
>> and we had not put any data in there.
>> 
>> Any way to recover from this or should i rebuild the cluster altogether.
>> 
>> —Jiten
>> 
>> On Nov 20, 2014, at 1:40 PM, Craig Lewis <[email protected]> wrote:
>> 
>>> So you have your crushmap set to choose osd instead of choose host?
>>> 
>>> Did you wait for the cluster to recover between each OSD rebuild?  If you 
>>> rebuilt all 3 OSDs at the same time (or without waiting for a complete 
>>> recovery between them), that would cause this problem.
>>> 
>>> 
>>> 
>>> On Thu, Nov 20, 2014 at 11:40 AM, JIten Shah <[email protected]> wrote:
>>> Yes, it was a healthy cluster and I had to rebuild because the OSD’s got 
>>> accidentally created on the root disk. Out of 4 OSD’s I had to rebuild 3 of 
>>> them.
>>> 
>>> 
>>> [jshah@Lab-cephmon001 ~]$ ceph osd tree
>>> # id        weight  type name       up/down reweight
>>> -1  0.5     root default
>>> -2  0.09999         host Lab-cephosd005
>>> 4   0.09999                 osd.4   up      1       
>>> -3  0.09999         host Lab-cephosd001
>>> 0   0.09999                 osd.0   up      1       
>>> -4  0.09999         host Lab-cephosd002
>>> 1   0.09999                 osd.1   up      1       
>>> -5  0.09999         host Lab-cephosd003
>>> 2   0.09999                 osd.2   up      1       
>>> -6  0.09999         host Lab-cephosd004
>>> 3   0.09999                 osd.3   up      1       
>>> 
>>> 
>>> [jshah@Lab-cephmon001 ~]$ ceph pg 2.33 query
>>> Error ENOENT: i don't have paid 2.33
>>> 
>>> —Jiten
>>> 
>>> 
>>> On Nov 20, 2014, at 11:18 AM, Craig Lewis <[email protected]> wrote:
>>> 
>>>> Just to be clear, this is from a cluster that was healthy, had a disk 
>>>> replaced, and hasn't returned to healthy?  It's not a new cluster that has 
>>>> never been healthy, right?
>>>> 
>>>> Assuming it's an existing cluster, how many OSDs did you replace?  It 
>>>> almost looks like you replaced multiple OSDs at the same time, and lost 
>>>> data because of it.
>>>> 
>>>> Can you give us the output of `ceph osd tree`, and `ceph pg 2.33 query`?
>>>> 
>>>> 
>>>> On Wed, Nov 19, 2014 at 2:14 PM, JIten Shah <[email protected]> wrote:
>>>> After rebuilding a few OSD’s, I see that the pg’s are stuck in degraded 
>>>> mode. Sone are in the unclean and others are in the stale state. Somehow 
>>>> the MDS is also degraded. How do I recover the OSD’s and the MDS back to 
>>>> healthy ? Read through the documentation and on the web but no luck so far.
>>>> 
>>>> pg 2.33 is stuck unclean since forever, current state 
>>>> stale+active+degraded+remapped, last acting [3]
>>>> pg 0.30 is stuck unclean since forever, current state 
>>>> stale+active+degraded+remapped, last acting [3]
>>>> pg 1.31 is stuck unclean since forever, current state 
>>>> stale+active+degraded, last acting [2]
>>>> pg 2.32 is stuck unclean for 597129.903922, current state 
>>>> stale+active+degraded, last acting [2]
>>>> pg 0.2f is stuck unclean for 597129.903951, current state 
>>>> stale+active+degraded, last acting [2]
>>>> pg 1.2e is stuck unclean since forever, current state 
>>>> stale+active+degraded+remapped, last acting [3]
>>>> pg 2.2d is stuck unclean since forever, current state 
>>>> stale+active+degraded+remapped, last acting [2]
>>>> pg 0.2e is stuck unclean since forever, current state 
>>>> stale+active+degraded+remapped, last acting [3]
>>>> pg 1.2f is stuck unclean for 597129.904015, current state 
>>>> stale+active+degraded, last acting [2]
>>>> pg 2.2c is stuck unclean since forever, current state 
>>>> stale+active+degraded+remapped, last acting [3]
>>>> pg 0.2d is stuck stale for 422844.566858, current state 
>>>> stale+active+degraded, last acting [2]
>>>> pg 1.2c is stuck stale for 422598.539483, current state 
>>>> stale+active+degraded+remapped, last acting [3]
>>>> pg 2.2f is stuck stale for 422598.539488, current state 
>>>> stale+active+degraded+remapped, last acting [3]
>>>> pg 0.2c is stuck stale for 422598.539487, current state 
>>>> stale+active+degraded+remapped, last acting [3]
>>>> pg 1.2d is stuck stale for 422598.539492, current state 
>>>> stale+active+degraded+remapped, last acting [3]
>>>> pg 2.2e is stuck stale for 422598.539496, current state 
>>>> stale+active+degraded+remapped, last acting [3]
>>>> pg 0.2b is stuck stale for 422598.539491, current state 
>>>> stale+active+degraded+remapped, last acting [3]
>>>> pg 1.2a is stuck stale for 422598.539496, current state 
>>>> stale+active+degraded+remapped, last acting [3]
>>>> pg 2.29 is stuck stale for 422598.539504, current state 
>>>> stale+active+degraded+remapped, last acting [3]
>>>> .
>>>> .
>>>> .
>>>> 6 ops are blocked > 2097.15 sec
>>>> 3 ops are blocked > 2097.15 sec on osd.0
>>>> 2 ops are blocked > 2097.15 sec on osd.2
>>>> 1 ops are blocked > 2097.15 sec on osd.4
>>>> 3 osds have slow requests
>>>> recovery 40/60 objects degraded (66.667%)
>>>> mds cluster is degraded
>>>> mds.Lab-cephmon001 at X.X.16.111:6800/3424727 rank 0 is replaying journal
>>>> 
>>>> —Jiten
>>>> 
>>>> 
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> [email protected]
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> 
>>>> 
>>> 
>>> 
>>> _______________________________________________
>>> ceph-users mailing list
>>> [email protected]
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> 
>> _______________________________________________
>> ceph-users mailing list
>> [email protected]
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] pg's degraded

Reply via email to