Re: [ceph-users] 70+ OSD are DOWN and not coming up

Craig Lewis Thu, 22 May 2014 00:27:03 -0700

On 5/21/14 21:15 , Sage Weil wrote:

On Wed, 21 May 2014, Craig Lewis wrote:

If you do this over IRC, can you please post a summary to the mailling
list?


I believe I'm having this issue as well.

In the other case, we found that some of the OSDs were behind processing
maps (by several thousand epochs).  The trick here to give them a chance
to catch up is

  ceph osd set noup
  ceph osd set nodown
  ceph osd set noout

and wait for them to stop spinning on the CPU.  You can check which map
each OSD is on with

  ceph daemon osd.NNN status

to see which epoch they are on and compare that to

  ceph osd stat

Once they are within 100 or less epochs,

  ceph osd unset noup

and let them all start up.

We haven't determined whether the original problem was caused by this or
the other way around; we'll see once they are all caught up.

sage

I was seeing the CPU spinning too, so I think it is the same issue.Thanks for the explanation! I've been pulling my hair out for weeks.

I can give you a data point for the "how". My problems started with akswapd problem on 12.04.04 (kernel 3.5.0-46-generic#70~precise1-Ubuntu). kswapd was consuming 100% CPU, and it wasblocking the ceph-osd processes. Once I prevented kswapd from doingthat, my OSDs couldn't recover. noout and nodown didn't help; the OSDswould suicide and restart.

Upgrading to Ubuntu 14.04 seems to have helped. The cluster isn't allclear yet, but it's getting better. The cluster is finally healthyafter 2 weeks of incomplete and stale. It's still unresponsive, butit's making progress. I am still seeing OSD's consuming 100% CPU, butonly the OSDs that are actively deep-scrubing. Once the deep-scrubfinishes, the OSD starts behaving again. They seem to be slowly gettingbetter, which matches up with your explanation.

I'll go ahead at set noup. I don't think it's necessary at this point,but it's not going to hurt.

I'm running Emperor, and looks like osd status isn't supported. Not abig deal though. Deep-scrub has made it through half of the PGs in thelast 36 hours, so I'll just watch for another day or two. This is aslave cluster, so I have that luxury.








--

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email [email protected] <mailto:[email protected]>

*Central Desktop. Work together in ways you never thought possible.*

Connect with us Website <http://www.centraldesktop.com/> | Twitter<http://www.twitter.com/centraldesktop> | Facebook<http://www.facebook.com/CentralDesktop> | LinkedIn<http://www.linkedin.com/groups?gid=147417> | Blog<http://cdblog.centraldesktop.com/>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] 70+ OSD are DOWN and not coming up

Reply via email to