Re: [ceph-users] More than 50% osds down, CPUs still busy; will the cluster recover without help?

Chris Murray Wed, 25 Feb 2015 04:58:02 -0800

Thanks Greg

After seeing some recommendations I found in another thread, my impatience got 
the better of me, and I've start the process again, but there is some logic, I 
promise :-)
I've copied the process from Michael Kidd, I believe, and it goes along the 
lines of:


setting noup, noin, noscrub, nodeep-scrub, norecover, nobackfill
stopping all OSDs
setting all OSDs down & out
setting various options in ceph.conf to limit backfill activity etc
starting all OSDs
wait until all CPU settles to 0%  <-- I am here
unset the noup flag
wait until all CPU settles to 0%
unset the noin flag
wait until all CPU settles to 0%
unset the nobackfill flag
wait until all CPU settles to 0%
unset the norecover flag
remove options from ceph.conf
unset the noscrub flag
unset the nodeep-scrub flag


Currently, host CPU usage is approx the following, so something's changed, and 
I'm tempted to leave things a little longer before my next step, just in case 
CPU does eventually stop spinning. I read reports of things taking "a while" 
even with modern Xeons, so I suppose it's not outside the realms of possibility 
that an AMD Neo might take days to work things out. We're up to 24.5 hours now:

192.168.12.25           20%
192.168.12.26           1%
192.168.12.27           15%
192.168.12.28           1%
192.168.12.29           12%

Interesting, as 192.168.12.26 and .28 are the two which stopped spinning before 
I restarted this process too.

The number of different states is slightly less confusing now, but not by much: 
:-)

788386/2591752 objects degraded (30.419%)
                  90 stale+active+clean
                   2 stale+down+remapped+peering
                   2 stale+incomplete
                   1 
stale+active+degraded+remapped+wait_backfill+backfill_toofull
                   1 stale+degraded
                1255 stale+active+degraded
                  32 stale+remapped+peering
                 773 stale+active+remapped
                   4 stale+active+degraded+remapped+backfill_toofull
                1254 stale+down+peering
                 278 stale+peering
                  33 stale+active+remapped+backfill_toofull
                 563 stale+active+degraded+remapped

> Well, you below indicate that osd.14's log says it crashed on an internal 
> heartbeat timeout (usually, it got stuck waiting for disk IO or the 
> kernel/btrfs hung), so that's why. The osd.12 process exists but isn't "up"; 
> osd.14 doesn't even have a socket to connect to.

Ah, that does make sense, thank you.

> That's not what I'd expect to see (it appears to have timed out and not be 
> recognizing it?) but I don't look at these things too often so maybe that's 
> the normal indication that heartbeats are failing.

I'm not sure what this means either. A google for "heartbeat_map is_healthy  
FileStore::op_tp thread had timed out after" doesn't return much, but I did see 
this quote from Sage on what looks like a similar matter:

> - the filestore op_queue is blocked on the throttler (too much io queued)
> - the commit thread is also waiting for ops to finish
> - i see no actual thread processing the op_queue
> Usually that's because it hit a kernel bug and got killed.  Not sure what 
> else would make that thread disappear...
> sage

Oh!

> Also, you want to find out why they're dying. That's probably the root cause 
> of your issues

I have a sneaking suspicion it's BTRFS, but don't have the evidence or perhaps 
the knowledge to prove it. If XFS did compression, I'd go with that, but at the 
moment I need to rely on compression to solve the problem of reclaiming space 
*within* files which reside on ceph. As far as I remember, overwriting with 
zeros didn't re-do the thin provisioning on XFS, if that makes sense.

Thanks again,
Chris
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] More than 50% osds down, CPUs still busy; will the cluster recover without help?

Reply via email to