I had some issues with OSD flapping after 2 days of recovery. It
appears to be related to swapping, even though I have plenty of RAM for
the number of OSDs I have. The cluster was completely unusable, and I
ended up rebooting all the nodes. It's been great ever since, but I'm
assuming it will happen again.
Details are below, but I'm wondering if anybody has any idea what happened?
I noticed some lumpy data distribution on my OSDs. Following the advice
on the mailling list, I increased the pg_num and pgp_num to the values
from the formula. .rgw.buckets is the only large pool, so I increased
pg_num and pgp_num from 128 to 2048 on that one pool. Cluster status
changes to HEALTH_WARN, there were 1920 PGs with state
active+remapped+wait_backfill, and 32% of the objects were degraded.
Recovery was slow, and we were having some performance issues. I
lowered osd_max_backfills from 10 to 2, and osd_recovery_op_priority
from 10 to 2. This didn't slow the recovery down much, but made my
application much more responsive. My journals are on the OSD disks (no
SSDs). I believe the osd_max_backfills was the more important change,
but it's much slower to test than the osd_recovery_op_priority change.
Aside from those two, my notes say I changed and reverted
osd_disk_threads, osd_op_threads, osd_recovery_threads. All changes
were pushed out using ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok
config set osd_max_backfills 2
I watched the cluster on and off over the weekend. Ceph was steadily
recovering. It was down to ~900 PGs in active+remapped+wait_backfill,
with 17% of objects degraded. A few OSDs have been marked down and
recovered, so a few tens of PGs are in state
active+degraded+remapped+wait_backfill and
active+degraded+remapped+backfilling. I was poking around, and I
noticed kswapd was using betwen 5% and 30% CPU on all nodes. It was
bursty, peaking at 30% CPU usage for about 5sec out of every 30sec. Swap
usage wasn't increasing, and kswapd appeared to be doing a lot of
nothing. My machines have 8 OSDs, and 36GB of RAM. top said that all
machines were caching 30GB of data. The 8 ceph-osd daemons are using
0.5GB to 1.2GB of RAM. I don't have the exact numbers, but I believe
they were using about 5GB for all 8 ceph-osd daemons.
A few hours later, and the OSDs really started flapping. They're being
voted unresponsive and marked down faster than they can rejoin. At one
point, a third of the OSDs were marked down. ceph -w is complaining
about hundreds of slow requests greater than 900 seconds. Most RGW
accesses are failing with HTTP timeouts. kswapd is using a consistent
33% CPU on all nodes, with no variance that I can see. To add insult,
the cluster was running a scrub and a deep scrub.
I eventually rebooted all nodes in the cluster, one at a time. Once
quorum reestablished, recovery proceeded at the original speed. The
OSDs are responding, and all my RGW requests are returning in a
reasonable amount of time. There are no complaints of slow requests in
ceph -w. kswapd is using 0% of the CPU.
I'm running Ceph 0.72.2 on Ubuntu 12.04.4, with kernel 3.5.0-37-generic
#58~precise1-Ubuntu SMP.
I monitor the running version as well as the installed version, so I
know that all daemons were restarted after the 0.72.1 -> 0.72.2
upgrade. That happened on Jan 22nd.
Any idea what happened? I'm assuming it will happen again if recovery
takes long enough.
--
*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email [email protected] <mailto:[email protected]>
*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/> | Twitter
<http://www.twitter.com/centraldesktop> | Facebook
<http://www.facebook.com/CentralDesktop> | LinkedIn
<http://www.linkedin.com/groups?gid=147417> | Blog
<http://cdblog.centraldesktop.com/>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com