Jan,

this is very handy to know! Thanks for sharing with us!

People, do you believe that it would be nice to have a place where we can gather either good practices or problem resolutions or tips from the community? We could have a voting system and those with the most votes (or above a threshold) could appear there.

Regards,

George

I know a few other people here were battling with the occasional
issue of OSD being extremely slow when starting.

I personally run OSDs mixed with KVM guests on the same nodes, and
was baffled by this issue occuring mostly on the most idle (empty)
machines.
Thought it was some kind of race condition where OSD started too fast
and disks couldn’t catch up, was investigating latency of CPUs and
cards on a mostly idle hardware etc. - with no improvement.

But in the end, most of my issues were caused by page cache using too
much memory. This doesn’t cause any problems when the OSDs have their
memory allocated and are running, but when the OSD is (re)started, OS
struggles to allocate contiguous blocks of memory for it and its
buffers.
This could also be why I’m seeing such an improvement with my NUMA
pinning script - cleaning memory on one node is probably easier and
doesn’t block allocations on other nodes.

How can you tell if this is your case? When restarting an OSD that
has this issue, look for CPU usage of “kswapd” processes. If it is >0
then you have this issue and would benefit from setting this:

for i in $(mount |grep "ceph/osd" |cut -d' ' -f1 |cut -d'/' -f3 |tr
-d '[0-9]') ; do echo 1 >/sys/block/$i/bdi/max_ratio ; done
(another option is echo 1 > drop_caches before starting the OSD, but
that’s a bit brutal)

What this does is it limits the pagecache size for each block device
to 1% of physical memory. I’d like to limit it even further but it
doesn’t understand “0.3”...

Let me know if it helps, I’ve not been able to test if this cures the
problem completely, but there was no regression after setting it.

Jan

P.S. This is for RHEL 6 / CentOS 6 ancient 2.6.32 kernel, newer
kernels have tunables to limit the overall pagecache size. You can
also set the limits in cgroups but I’m afraid that won’t help in this
case as you can only set the whole memory footprint limit where it
will battle for allocations anyway.
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to