Hello fellow cephers,
I have been struggling with stability of my Jewel cluster and from what I can
see I am not the only person.
My setup is: 3 osd+mon servers, 30 osds, half a dozen of client host servers
for rbd access, 40gbit/s infiniband link, all ceph servers are running on
Ubuntu 16.04, clients are on Ubuntu 14.04.
Problems that I've recently experienced after upgrading to Jewel: slow+blocked
requests, ceph-osd crashes, memory leaks in ceph-mon and ceph-osd causing
memory exhaustion and killing of ceph-osd/ceph-mon processes. (Slow/blocked
requests were the problem for me for years)
What I've tried initially was to reboot my osd/mon servers every night. This
has solved ALL of my problems. At least I've not had any slow/blocked requests
for over a week now. Before, there were somewhere between 50 - 2K slow requests
per day. Obviously rebooting servers on a daily basis is not ideal to say the
For the last 3 days I am running 4.9.8 kernel from the ubuntu builds and also
running a cron script to clear Page Cache twice a day on each osd/mon servers.
I am not rebooting the servers. This has solved my slow/blocked requests.
Again, I've not seen a single slow/blocked request in 3 days. However, I do see
one of my ceph-mon processes leaking memory and consuming about 3-4GB of RAM
per day. I let it past over 8GB and restarted the ceph-mon, which seems to have
stoped the leak for now as after 24 hours or so the process consumes <1gb ram
on that server.
I've not made any changes to the ceph clients. They are still running the same
I thought to share this so that perhaps other people might be experiencing
similar troubles with 10.2.5 or other minor versions. Also, if anyone have an
idea how to improve things, please share.
ceph-users mailing list