Re: [ceph-users] pg's degraded

JIten Shah Fri, 21 Nov 2014 09:46:05 -0800

Thanks Michael. That was a good idea.

I did:


1. sudo service ceph stop mds

2. ceph mds newfs 1 0 —yes-i-really-mean-it (where 1 and 0 are pool ID’s for 
metadata and data)

3. ceph health (It was healthy now!!!)

4. sudo servie ceph start mds.$(hostname -s)

And I am back in business.

Thanks again.

—Jiten



On Nov 20, 2014, at 5:47 PM, Michael Kuriger <[email protected]> wrote:

> Maybe delete the pool and start over?
>  
>  
> From: ceph-users [mailto:[email protected]] On Behalf Of 
> JIten Shah
> Sent: Thursday, November 20, 2014 5:46 PM
> To: Craig Lewis
> Cc: ceph-users
> Subject: Re: [ceph-users] pg's degraded
>  
> Hi Craig,
>  
> Recreating the missing PG’s fixed it.  Thanks for your help.
>  
> But when I tried to mount the Filesystem, it gave me the “mount error 5”. I 
> tried to restart the MDS server but it won’t work. It tells me that it’s 
> laggy/unresponsive.
>  
> BTW, all these machines are VM’s.
>  
> [jshah@Lab-cephmon001 ~]$ ceph health detail
> HEALTH_WARN mds cluster is degraded; mds Lab-cephmon001 is laggy
> mds cluster is degraded
> mds.Lab-cephmon001 at 17.147.16.111:6800/3745284 rank 0 is replaying journal
> mds.Lab-cephmon001 at 17.147.16.111:6800/3745284 is laggy/unresponsive
>  
>  
> —Jiten
>  
> On Nov 20, 2014, at 4:20 PM, JIten Shah <[email protected]> wrote:
> 
> 
> Ok. Thanks.
>  
> —Jiten
>  
> On Nov 20, 2014, at 2:14 PM, Craig Lewis <[email protected]> wrote:
> 
> 
> If there's no data to lose, tell Ceph to re-create all the missing PGs.
>  
> ceph pg force_create_pg 2.33
>  
> Repeat for each of the missing PGs.  If that doesn't do anything, you might 
> need to tell Ceph that you lost the OSDs.  For each OSD you moved, run ceph 
> osd lost <OSDID>, then try the force_create_pg command again.
>  
> If that doesn't work, you can keep fighting with it, but it'll be faster to 
> rebuild the cluster.
>  
>  
>  
> On Thu, Nov 20, 2014 at 1:45 PM, JIten Shah <[email protected]> wrote:
> Thanks for your help.
>  
> I was using puppet to install the OSD’s where it chooses a path over a device 
> name. Hence it created the OSD in the path within the root volume since the 
> path specified was incorrect.
>  
> And all 3 of the OSD’s were rebuilt at the same time because it was unused 
> and we had not put any data in there.
>  
> Any way to recover from this or should i rebuild the cluster altogether.
>  
> —Jiten
>  
> On Nov 20, 2014, at 1:40 PM, Craig Lewis <[email protected]> wrote:
> 
> 
> So you have your crushmap set to choose osd instead of choose host?
>  
> Did you wait for the cluster to recover between each OSD rebuild?  If you 
> rebuilt all 3 OSDs at the same time (or without waiting for a complete 
> recovery between them), that would cause this problem.
>  
>  
>  
> On Thu, Nov 20, 2014 at 11:40 AM, JIten Shah <[email protected]> wrote:
> Yes, it was a healthy cluster and I had to rebuild because the OSD’s got 
> accidentally created on the root disk. Out of 4 OSD’s I had to rebuild 3 of 
> them.
>  
>  
> [jshah@Lab-cephmon001 ~]$ ceph osd tree
> # id weight type name up/down reweight
> -1 0.5 root default
> -2 0.09999 host Lab-cephosd005
> 4 0.09999 osd.4 up 1
> -3 0.09999 host Lab-cephosd001
> 0 0.09999 osd.0 up 1
> -4 0.09999 host Lab-cephosd002
> 1 0.09999 osd.1 up 1
> -5 0.09999 host Lab-cephosd003
> 2 0.09999 osd.2 up 1
> -6 0.09999 host Lab-cephosd004
> 3 0.09999 osd.3 up 1
>  
>  
> [jshah@Lab-cephmon001 ~]$ ceph pg 2.33 query
> Error ENOENT: i don't have paid 2.33
>  
> —Jiten
>  
>  
> On Nov 20, 2014, at 11:18 AM, Craig Lewis <[email protected]> wrote:
> 
> 
> Just to be clear, this is from a cluster that was healthy, had a disk 
> replaced, and hasn't returned to healthy?  It's not a new cluster that has 
> never been healthy, right?
>  
> Assuming it's an existing cluster, how many OSDs did you replace?  It almost 
> looks like you replaced multiple OSDs at the same time, and lost data because 
> of it.
>  
> Can you give us the output of `ceph osd tree`, and `ceph pg 2.33 query`?
>  
>  
> On Wed, Nov 19, 2014 at 2:14 PM, JIten Shah <[email protected]> wrote:
> After rebuilding a few OSD’s, I see that the pg’s are stuck in degraded mode. 
> Sone are in the unclean and others are in the stale state. Somehow the MDS is 
> also degraded. How do I recover the OSD’s and the MDS back to healthy ? Read 
> through the documentation and on the web but no luck so far.
>  
> pg 2.33 is stuck unclean since forever, current state 
> stale+active+degraded+remapped, last acting [3]
> pg 0.30 is stuck unclean since forever, current state 
> stale+active+degraded+remapped, last acting [3]
> pg 1.31 is stuck unclean since forever, current state stale+active+degraded, 
> last acting [2]
> pg 2.32 is stuck unclean for 597129.903922, current state 
> stale+active+degraded, last acting [2]
> pg 0.2f is stuck unclean for 597129.903951, current state 
> stale+active+degraded, last acting [2]
> pg 1.2e is stuck unclean since forever, current state 
> stale+active+degraded+remapped, last acting [3]
> pg 2.2d is stuck unclean since forever, current state 
> stale+active+degraded+remapped, last acting [2]
> pg 0.2e is stuck unclean since forever, current state 
> stale+active+degraded+remapped, last acting [3]
> pg 1.2f is stuck unclean for 597129.904015, current state 
> stale+active+degraded, last acting [2]
> pg 2.2c is stuck unclean since forever, current state 
> stale+active+degraded+remapped, last acting [3]
> pg 0.2d is stuck stale for 422844.566858, current state 
> stale+active+degraded, last acting [2]
> pg 1.2c is stuck stale for 422598.539483, current state 
> stale+active+degraded+remapped, last acting [3]
> pg 2.2f is stuck stale for 422598.539488, current state 
> stale+active+degraded+remapped, last acting [3]
> pg 0.2c is stuck stale for 422598.539487, current state 
> stale+active+degraded+remapped, last acting [3]
> pg 1.2d is stuck stale for 422598.539492, current state 
> stale+active+degraded+remapped, last acting [3]
> pg 2.2e is stuck stale for 422598.539496, current state 
> stale+active+degraded+remapped, last acting [3]
> pg 0.2b is stuck stale for 422598.539491, current state 
> stale+active+degraded+remapped, last acting [3]
> pg 1.2a is stuck stale for 422598.539496, current state 
> stale+active+degraded+remapped, last acting [3]
> pg 2.29 is stuck stale for 422598.539504, current state 
> stale+active+degraded+remapped, last acting [3]
> .
> .
> .
> 6 ops are blocked > 2097.15 sec
> 3 ops are blocked > 2097.15 sec on osd.0
> 2 ops are blocked > 2097.15 sec on osd.2
> 1 ops are blocked > 2097.15 sec on osd.4
> 3 osds have slow requests
> recovery 40/60 objects degraded (66.667%)
> mds cluster is degraded
> mds.Lab-cephmon001 at X.X.16.111:6800/3424727 rank 0 is replaying journal
>  
> —Jiten
>  
> 
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
>  
>  
>  
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>  
>  
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>  
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>  
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] pg's degraded

Reply via email to