Thanks for the recommendation, Bob! I'll try to get this data later today and reply with it.
-Aaron On Sat, Apr 15, 2017 at 9:46 AM, Bob R <[email protected]> wrote: > I'd recommend running through these steps and posting the output as well > http://docs.ceph.com/docs/master/rados/troubleshooting/memory-profiling/ > > Bob > > On Sat, Apr 15, 2017 at 5:39 AM, Peter Maloney <peter.maloney@brockmann- > consult.de> wrote: > >> How many PGs do you have? And did you change any config, like mds cache >> size? Show your ceph.conf. >> >> >> On 04/15/17 07:34, Aaron Ten Clay wrote: >> >> Hi all, >> >> Our cluster is experiencing a very odd issue and I'm hoping for some >> guidance on troubleshooting steps and/or suggestions to mitigate the issue. >> tl;dr: Individual ceph-osd processes try to allocate > 90GiB of RAM and are >> eventually nuked by oom_killer. >> >> I'll try to explain the situation in detail: >> >> We have 24-4TB bluestore HDD OSDs, and 4-600GB SSD OSDs. The SSD OSDs are >> in a different CRUSH "root", used as a cache tier for the main storage >> pools, which are erasure coded and used for cephfs. The OSDs are spread >> across two identical machines with 128GiB of RAM each, and there are three >> monitor nodes on different hardware. >> >> Several times we've encountered crippling bugs with previous Ceph >> releases when we were on RC or betas, or using non-recommended >> configurations, so in January we abandoned all previous Ceph usage, >> deployed LTS Ubuntu 16.04, and went with stable Kraken 11.2.0 with the >> configuration mentioned above. Everything was fine until the end of March, >> when one day we find all but a couple of OSDs are "down" inexplicably. >> Investigation reveals oom_killer came along and nuked almost all the >> ceph-osd processes. >> >> We've gone through a bunch of iterations of restarting the OSDs, trying >> to bring them up one at a time gradually, all at once, various >> configuration settings to reduce cache size as suggested in this ticket: >> http://tracker.ceph.com/issues/18924... >> >> I don't know if that ticket really pertains to our situation or not, I >> have no experience with memory allocation debugging. I'd be willing to try >> if someone can point me to a guide or walk me through the process. >> >> I've even tried, just to see if the situation was transitory, adding >> over 300GiB of swap to both OSD machines. The OSD procs managed to >> allocate, in a matter of 5-10 minutes, more than 300GiB of RAM pressure and >> became oom_killer victims once again. >> >> No software or hardware changes took place around the time this problem >> started, and no significant data changes occurred either. We added about >> 40GiB of ~1GiB files a week or so before the problem started and that's the >> last time data was written. >> >> I can only assume we've found another crippling bug of some kind, this >> level of memory usage is entirely unprecedented. What can we do? >> >> Thanks in advance for any suggestions. >> -Aaron >> >> >> _______________________________________________ >> ceph-users mailing >> [email protected]http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >> _______________________________________________ >> ceph-users mailing list >> [email protected] >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> > -- Aaron Ten Clay https://aarontc.com
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
