Hello fellow cephers, 

I have been struggling with stability of my Jewel cluster and from what I can 
see I am not the only person. 

My setup is: 3 osd+mon servers, 30 osds, half a dozen of client host servers 
for rbd access, 40gbit/s infiniband link, all ceph servers are running on 
Ubuntu 16.04, clients are on Ubuntu 14.04. 

Problems that I've recently experienced after upgrading to Jewel: slow+blocked 
requests, ceph-osd crashes, memory leaks in ceph-mon and ceph-osd causing 
memory exhaustion and killing of ceph-osd/ceph-mon processes. (Slow/blocked 
requests were the problem for me for years) 

What I've tried initially was to reboot my osd/mon servers every night. This 
has solved ALL of my problems. At least I've not had any slow/blocked requests 
for over a week now. Before, there were somewhere between 50 - 2K slow requests 
per day. Obviously rebooting servers on a daily basis is not ideal to say the 

For the last 3 days I am running 4.9.8 kernel from the ubuntu builds and also 
running a cron script to clear Page Cache twice a day on each osd/mon servers. 
I am not rebooting the servers. This has solved my slow/blocked requests. 
Again, I've not seen a single slow/blocked request in 3 days. However, I do see 
one of my ceph-mon processes leaking memory and consuming about 3-4GB of RAM 
per day. I let it past over 8GB and restarted the ceph-mon, which seems to have 
stoped the leak for now as after 24 hours or so the process consumes <1gb ram 
on that server. 

I've not made any changes to the ceph clients. They are still running the same 
as before. 

I thought to share this so that perhaps other people might be experiencing 
similar troubles with 10.2.5 or other minor versions. Also, if anyone have an 
idea how to improve things, please share. 


