Thanks Greg
After seeing some recommendations I found in another thread, my impatience got
the better of me, and I've start the process again, but there is some logic, I
promise :-)
I've copied the process from Michael Kidd, I believe, and it goes along the
lines of:
setting noup, noin, noscrub, nodeep-scrub, norecover, nobackfill
stopping all OSDs
setting all OSDs down & out
setting various options in ceph.conf to limit backfill activity etc
starting all OSDs
wait until all CPU settles to 0% <-- I am here
unset the noup flag
wait until all CPU settles to 0%
unset the noin flag
wait until all CPU settles to 0%
unset the nobackfill flag
wait until all CPU settles to 0%
unset the norecover flag
remove options from ceph.conf
unset the noscrub flag
unset the nodeep-scrub flag
Currently, host CPU usage is approx the following, so something's changed, and
I'm tempted to leave things a little longer before my next step, just in case
CPU does eventually stop spinning. I read reports of things taking "a while"
even with modern Xeons, so I suppose it's not outside the realms of possibility
that an AMD Neo might take days to work things out. We're up to 24.5 hours now:
192.168.12.25 20%
192.168.12.26 1%
192.168.12.27 15%
192.168.12.28 1%
192.168.12.29 12%
Interesting, as 192.168.12.26 and .28 are the two which stopped spinning before
I restarted this process too.
The number of different states is slightly less confusing now, but not by much:
:-)
788386/2591752 objects degraded (30.419%)
90 stale+active+clean
2 stale+down+remapped+peering
2 stale+incomplete
1
stale+active+degraded+remapped+wait_backfill+backfill_toofull
1 stale+degraded
1255 stale+active+degraded
32 stale+remapped+peering
773 stale+active+remapped
4 stale+active+degraded+remapped+backfill_toofull
1254 stale+down+peering
278 stale+peering
33 stale+active+remapped+backfill_toofull
563 stale+active+degraded+remapped
> Well, you below indicate that osd.14's log says it crashed on an internal
> heartbeat timeout (usually, it got stuck waiting for disk IO or the
> kernel/btrfs hung), so that's why. The osd.12 process exists but isn't "up";
> osd.14 doesn't even have a socket to connect to.
Ah, that does make sense, thank you.
> That's not what I'd expect to see (it appears to have timed out and not be
> recognizing it?) but I don't look at these things too often so maybe that's
> the normal indication that heartbeats are failing.
I'm not sure what this means either. A google for "heartbeat_map is_healthy
FileStore::op_tp thread had timed out after" doesn't return much, but I did see
this quote from Sage on what looks like a similar matter:
> - the filestore op_queue is blocked on the throttler (too much io queued)
> - the commit thread is also waiting for ops to finish
> - i see no actual thread processing the op_queue
> Usually that's because it hit a kernel bug and got killed. Not sure what
> else would make that thread disappear...
> sage
Oh!
> Also, you want to find out why they're dying. That's probably the root cause
> of your issues
I have a sneaking suspicion it's BTRFS, but don't have the evidence or perhaps
the knowledge to prove it. If XFS did compression, I'd go with that, but at the
moment I need to rely on compression to solve the problem of reclaiming space
*within* files which reside on ceph. As far as I remember, overwriting with
zeros didn't re-do the thin provisioning on XFS, if that makes sense.
Thanks again,
Chris
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com