Thanks for the recommendation, Bob! I'll try to get this data later today
and reply with it.

-Aaron

On Sat, Apr 15, 2017 at 9:46 AM, Bob R <[email protected]> wrote:

> I'd recommend running through these steps and posting the output as well
> http://docs.ceph.com/docs/master/rados/troubleshooting/memory-profiling/
>
> Bob
>
> On Sat, Apr 15, 2017 at 5:39 AM, Peter Maloney <peter.maloney@brockmann-
> consult.de> wrote:
>
>> How many PGs do you have? And did you change any config, like mds cache
>> size? Show your ceph.conf.
>>
>>
>> On 04/15/17 07:34, Aaron Ten Clay wrote:
>>
>> Hi all,
>>
>> Our cluster is experiencing a very odd issue and I'm hoping for some
>> guidance on troubleshooting steps and/or suggestions to mitigate the issue.
>> tl;dr: Individual ceph-osd processes try to allocate > 90GiB of RAM and are
>> eventually nuked by oom_killer.
>>
>> I'll try to explain the situation in detail:
>>
>> We have 24-4TB bluestore HDD OSDs, and 4-600GB SSD OSDs. The SSD OSDs are
>> in a different CRUSH "root", used as a cache tier for the main storage
>> pools, which are erasure coded and used for cephfs. The OSDs are spread
>> across two identical machines with 128GiB of RAM each, and there are three
>> monitor nodes on different hardware.
>>
>> Several times we've encountered crippling bugs with previous Ceph
>> releases when we were on RC or betas, or using non-recommended
>> configurations, so in January we abandoned all previous Ceph usage,
>> deployed LTS Ubuntu 16.04, and went with stable Kraken 11.2.0 with the
>> configuration mentioned above. Everything was fine until the end of March,
>> when one day we find all but a couple of OSDs are "down" inexplicably.
>> Investigation reveals oom_killer came along and nuked almost all the
>> ceph-osd processes.
>>
>> We've gone through a bunch of iterations of restarting the OSDs, trying
>> to bring them up one at a time gradually, all at once, various
>> configuration settings to reduce cache size as suggested in this ticket:
>> http://tracker.ceph.com/issues/18924...
>>
>> I don't know if that ticket really pertains to our situation or not, I
>> have no experience with memory allocation debugging. I'd be willing to try
>> if someone can point me to a guide or walk me through the process.
>>
>> I've even tried, just to see if the situation was  transitory, adding
>> over 300GiB of swap to both OSD machines. The OSD procs managed to
>> allocate, in a matter of 5-10 minutes, more than 300GiB of RAM pressure and
>> became oom_killer victims once again.
>>
>> No software or hardware changes took place around the time this problem
>> started, and no significant data changes occurred either. We added about
>> 40GiB of ~1GiB files a week or so before the problem started and that's the
>> last time data was written.
>>
>> I can only assume we've found another crippling bug of some kind, this
>> level of memory usage is entirely unprecedented. What can we do?
>>
>> Thanks in advance for any suggestions.
>> -Aaron
>>
>>
>> _______________________________________________
>> ceph-users mailing 
>> [email protected]http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> [email protected]
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>


-- 
Aaron Ten Clay
https://aarontc.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to