[ceph-users] MDS allocates all memory (>500G) replaying, OOM-killed, repeat

Pickett, Neale T Mon, 01 Apr 2019 08:03:49 -0700

Hello


We are experiencing an issue where our ceph MDS gobbles up 500G of RAM, is 
killed by the kernel, dies, then repeats. We have 3 MDS daemons on different 
machines, and all are exhibiting this behavior. We are running the following 
versions (from Docker):


  *   ceph/daemon:v3.2.1-stable-3.2-luminous-centos-7
  *   ceph/daemon:v3.2.1-stable-3.2-luminous-centos-7
  *   ceph/daemon:v3.1.0-stable-3.1-luminous-centos-7 (downgraded in last-ditch 
effort to resolve, didn't help)

The machines hosting the MDS instances have 512G RAM. We tried adding swap, and 
the MDS just started eating into the swap (and got really slow, eventually 
being kicked out for exceeding the mds_beacon_grace of 240). 
mds_cache_memory_limit has been many values ranging from 200G to the default of 
1073741824, and the result of replay is always the same: keep allocating memory 
until the kernel OOM killer stops it (or the mds_beacon_grace period expires, 
if swap is enabled).

Before it died, the active MDS reported 1.592 million inodes to Prometheus 
(ceph_mds_inodes) and 1.493 million caps (ceph_mds_caps).

This appears to be the same problem as 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-October/030872.html

At this point I feel like my best option is to try to destroy the journal and 
hope things come back, but while we can probably recover from this, I'd like to 
prevent it happening in the future. Any advice?


Neale Pickett <ne...@lanl.gov>
A-4: Advanced Research in Cyber Systems
Los Alamos National Laboratory

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] MDS allocates all memory (>500G) replaying, OOM-killed, repeat

Reply via email to