Can you provide the complete OOM message from the dmesg log?

On Sat, Dec 22, 2018 at 7:53 AM Pardhiv Karri <meher4in...@gmail.com> wrote:
>
>
> Thank You for the quick response Dyweni!
>
> We are using FileStore as this cluster is upgraded from 
> Hammer-->Jewel-->Luminous 12.2.8. 16x2TB HDD per node for all nodes. R730xd 
> has 128GB and R740xd has 96GB of RAM. Everything else is the same.
>
> Thanks,
> Pardhiv Karri
>
> On Fri, Dec 21, 2018 at 1:43 PM Dyweni - Ceph-Users <6exbab4fy...@dyweni.com> 
> wrote:
>>
>> Hi,
>>
>>
>> You could be running out of memory due to the default Bluestore cache sizes.
>>
>>
>> How many disks/OSDs in the R730xd versus the R740xd?  How much memory in 
>> each server type?  How many are HDD versus SSD?  Are you running Bluestore?
>>
>>
>> OSD's in Luminous, which run Bluestore, allocate memory to use as a "cache", 
>> since the kernel-provided page-cache is not available to Bluestore.  
>> Bluestore, by default, will use 1GB of memory for each HDD, and 3GB of 
>> memory for each SSD.  OSD's do not allocate all that memory up front, but 
>> grow into it as it is used.  This cache is in addition to any other memory 
>> the OSD uses.
>>
>>
>> Check out the bluestore_cache_* values (these are specified in bytes) in the 
>> manual cache sizing section of the docs 
>> (http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/).
>>    Note that the automatic cache sizing feature wasn't added until 12.2.9.
>>
>>
>>
>> As an example, I have OSD's running on 32bit/armhf nodes.  These nodes have 
>> 2GB of memory.  I run 1 Bluestore OSD on each node.  In my ceph.conf file, I 
>> have 'bluestore cache size = 536870912' and 'bluestore cache kv max = 
>> 268435456'.  I see aprox 1.35-1.4 GB used by each OSD.
>>
>>
>>
>>
>> On 2018-12-21 15:19, Pardhiv Karri wrote:
>>
>> Hi,
>>
>> We have a luminous cluster which was upgraded from Hammer --> Jewel --> 
>> Luminous 12.2.8 recently. Post upgrade we are seeing issue with a few nodes 
>> where they are running out of memory and dying. In the logs we are seeing 
>> OOM killer. We don't have this issue before upgrade. The only difference is 
>> the nodes without any issue are R730xd and the ones with the memory leak are 
>> R740xd. The hardware vendor don't see anything wrong with the hardware. From 
>> Ceph end we are not seeing any issue when it comes to running the cluster, 
>> only issue is with memory leak. Right now we are actively rebooting the 
>> nodes in timely manner to avoid crashes. One R740xd node we set all the OSDs 
>> to 0.0 and there is no memory leak there. Any pointers to fix the issue 
>> would be helpful.
>>
>> Thanks,
>> Pardhiv Karri
>>
>>
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Pardhiv Karri
> "Rise and Rise again until LAMBS become LIONS"
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Cheers,
Brad
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to