Can you provide the complete OOM message from the dmesg log? On Sat, Dec 22, 2018 at 7:53 AM Pardhiv Karri <meher4in...@gmail.com> wrote: > > > Thank You for the quick response Dyweni! > > We are using FileStore as this cluster is upgraded from > Hammer-->Jewel-->Luminous 12.2.8. 16x2TB HDD per node for all nodes. R730xd > has 128GB and R740xd has 96GB of RAM. Everything else is the same. > > Thanks, > Pardhiv Karri > > On Fri, Dec 21, 2018 at 1:43 PM Dyweni - Ceph-Users <6exbab4fy...@dyweni.com> > wrote: >> >> Hi, >> >> >> You could be running out of memory due to the default Bluestore cache sizes. >> >> >> How many disks/OSDs in the R730xd versus the R740xd? How much memory in >> each server type? How many are HDD versus SSD? Are you running Bluestore? >> >> >> OSD's in Luminous, which run Bluestore, allocate memory to use as a "cache", >> since the kernel-provided page-cache is not available to Bluestore. >> Bluestore, by default, will use 1GB of memory for each HDD, and 3GB of >> memory for each SSD. OSD's do not allocate all that memory up front, but >> grow into it as it is used. This cache is in addition to any other memory >> the OSD uses. >> >> >> Check out the bluestore_cache_* values (these are specified in bytes) in the >> manual cache sizing section of the docs >> (http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/). >> Note that the automatic cache sizing feature wasn't added until 12.2.9. >> >> >> >> As an example, I have OSD's running on 32bit/armhf nodes. These nodes have >> 2GB of memory. I run 1 Bluestore OSD on each node. In my ceph.conf file, I >> have 'bluestore cache size = 536870912' and 'bluestore cache kv max = >> 268435456'. I see aprox 1.35-1.4 GB used by each OSD. >> >> >> >> >> On 2018-12-21 15:19, Pardhiv Karri wrote: >> >> Hi, >> >> We have a luminous cluster which was upgraded from Hammer --> Jewel --> >> Luminous 12.2.8 recently. Post upgrade we are seeing issue with a few nodes >> where they are running out of memory and dying. In the logs we are seeing >> OOM killer. We don't have this issue before upgrade. The only difference is >> the nodes without any issue are R730xd and the ones with the memory leak are >> R740xd. The hardware vendor don't see anything wrong with the hardware. From >> Ceph end we are not seeing any issue when it comes to running the cluster, >> only issue is with memory leak. Right now we are actively rebooting the >> nodes in timely manner to avoid crashes. One R740xd node we set all the OSDs >> to 0.0 and there is no memory leak there. Any pointers to fix the issue >> would be helpful. >> >> Thanks, >> Pardhiv Karri >> >> >> >> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > -- > Pardhiv Karri > "Rise and Rise again until LAMBS become LIONS" > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
-- Cheers, Brad _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com