*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email [email protected] <mailto:[email protected]>
*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/> | Twitter
<http://www.twitter.com/centraldesktop> | Facebook
<http://www.facebook.com/CentralDesktop> | LinkedIn
<http://www.linkedin.com/groups?gid=147417> | Blog
<http://cdblog.centraldesktop.com/>
On 4/8/14 18:27 , Gregory Farnum wrote:
On Tue, Apr 8, 2014 at 4:57 PM, Craig Lewis <[email protected]> wrote:
pg query says the recovery state is:
"might_have_unfound": [
{ "osd": 11,
"status": "querying"},
{ "osd": 13,
"status": "already probed"}],
I figured out why it wasn't probing osd.11.
When I manually replaced the disk, I added the OSD to the cluster with a
CRUSH weight of 0.
As soon as I changed fixed the CRUSH weight, some PGs were allocated to the
OSD, and the probing completed. My PG that was stuck in recovery mode for
24h has been remapped to be on osd.11. I believe this will allow the
recovery to complete.
Glad you worked it out. That sounds odd to me, though. Do you have any
logs from osd.11?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
Sure, but I don't think they'll be very helpful. I only had the default
logging levels. Here are the logs from today:
https://cd.centraldesktop.com/p/eAAAAAAADQ70AAAAAEBvDJY
At 2014-04-08 16:15, I restarted the OSD. That was to force the stalled
recovery to yield to another recovery/backfill. It seems to get hung up
every so often. Whenever I only saw this one PG in recovery state for
more than 15 minutes, I'd restart osd.11, and it would recover/backfill
other PGs for another ~12 hours. It's probably not helping that I have
max backfills set to 1.
I didn't record the exact time, but I ran a few of these, trying to zero
in on the right weight for the device. The final command was:
ceph osd crush reweight osd.11 3.64
around 17:00 PDT (timezone in the logs). Since the logs show a scrub
starting at 2014-04-08 16:50:40.682409, so I'd say it was just before that.
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com