[ceph-users] MDS uses up to 150 GByte of memory during journal replay

Matthias Aebi Sat, 05 Jan 2019 01:45:02 -0800

Hello everyone,

We are running a small cluster on 5 machines with 48 OSDs / 5 MDSs / 5 MONs 
based on Luminous 12.2.10 and Debian Stretch 9.6. When using a single MDS 
configuration everything works fine and looking at the active MDS's memory, as 
configured, it uses ~1 GByte of memory for cache:


$ watch ceph tell mds.$(hostname) heap stats

mds.e tcmalloc heap stats:------------------------------------------------
MALLOC:     1172867096 ( 1118.5 MiB) Bytes in use by application
MALLOC: +            0 (    0.0 MiB) Bytes in page heap freelist
MALLOC: +     39289912 (   37.5 MiB) Bytes in central cache freelist
MALLOC: +     17245344 (   16.4 MiB) Bytes in transfer cache freelist
MALLOC: +     34303760 (   32.7 MiB) Bytes in thread cache freelists
MALLOC: +      5796032 (    5.5 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =   1269502144 ( 1210.7 MiB) Actual memory used (physical + swap)
MALLOC: +     19775488 (   18.9 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =   1289277632 ( 1229.6 MiB) Virtual address space used
MALLOC:
MALLOC:          70430              Spans in use
MALLOC:             17              Thread heaps in use
MALLOC:           8192              Tcmalloc page size
------------- 
$ ceph versions

{
 "mon": {
     "ceph version 12.2.10 (177915764b752804194937482a39e95e0ca3de94) luminous 
(stable)": 5
 },
 "mgr": {
     "ceph version 12.2.10 (177915764b752804194937482a39e95e0ca3de94) luminous 
(stable)": 3
 },
 "osd": {
     "ceph version 12.2.10 (177915764b752804194937482a39e95e0ca3de94) luminous 
(stable)": 48
 },
 "mds": {
     "ceph version 12.2.10 (177915764b752804194937482a39e95e0ca3de94) luminous 
(stable)": 5
 },
 "overall": {
     "ceph version 12.2.10 (177915764b752804194937482a39e95e0ca3de94) luminous 
(stable)": 61
 }

------------- 
$ ceph -s

cluster:
 id:     .... c9024
 health: HEALTH_OK

services:
 mon: 5 daemons, quorum a,b,c,d,e
 mgr: libra(active), standbys: b, a
 mds: cephfs-1/1/1 up  {0=e=up:active}, 1 up:standby-replay, 3 up:standby
 osd: 48 osds: 48 up, 48 in

data:
 pools:   2 pools, 2052 pgs
 objects: 44.44M objects, 52.3TiB
 usage:   107TiB used, 108TiB / 216TiB avail
 pgs:     2051 active+clean
          1    active+clean+scrubbing+deep

io:
 client:   85.3KiB/s rd, 3.17MiB/s wr, 45op/s rd, 26op/s wr
------------- 

However as soon as we use "ceph fs set cephfs max_mds 2" to add a second MDS to 
the picture things get out of hand within seconds, although in a rather 
unexpected way: The standby MDS server which is brought in works fine and shown 
a normal memory consumption. However the two machines which are starting to 
replay the journal in order to become standby servers start to accumulate 
dozens of GByte of memory immediately and go up to about 150 GByte of memory, 
almost immediately starting to use swap space, which brings load up to about 80 
within seconds and makes all other processes (mainly OSDs) unreachable.

As the machine becomes basically unreachable when this happens it is only 
possible to get memory statistics when things start to wrong. After that it's 
not possible to get a memory dump anymore as the OS as a whole gets blocked by 
swapping.

$ watch ceph tell mds.$(hostname) heap stats

mds.a tcmalloc heap stats:------------------------------------------------
MALLOC:    36113137024 (34440.2 MiB) Bytes in use by application
MALLOC: +            0 (    0.0 MiB) Bytes in page heap freelist
MALLOC: +      7723144 (    7.4 MiB) Bytes in central cache freelist
MALLOC: +      2523264 (    2.4 MiB) Bytes in transfer cache freelist
MALLOC: +      2460024 (    2.3 MiB) Bytes in thread cache freelists
MALLOC: +     41185472 (   39.3 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =  36167028928 (34491.6 MiB) Actual memory used (physical + swap)
MALLOC: +      1417216 (    1.4 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =  36168446144 (34492.9 MiB) Virtual address space used
MALLOC:
MALLOC:          38476              Spans in use
MALLOC:             13              Thread heaps in use
MALLOC:           8192              Tcmalloc page size
------------- 

Please also find attached the zip'ed log file of one of the two new standby 
MDSs when it is trying to replay the fs journal.

As soon as the number of MDSs is set back to 1 (using "ceph fs set cephfs 
max_mds 1" and "ceph mds deactivate 1") things start to calm down and the 
cluster goes back to normal. Is this a known problem with Luminous and what can 
be done to solve it anyway so the multi MDS feature may be used?

As all servers used here are based on Debian it is unfortunately not possible 
to upgrade to Mimic as it seems that this cannot be / will not be made 
available for Debian Stretch due to the tool chain issue described elsewhere.

Thank you for any help and pointers in the right direction!

Best,
Matthias

----------------------------------------------------------------------------------------------------
dizmo - The Interface of Things
http://www.dizmo.com, Phone +41 52 267 88 50, Twitter @dizmos
dizmo inc, Universitätsstrasse 53, CH-8006 Zurich, Switzerland

Log of mds.b replaying fs journal.tbz
Description: Binary data

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] MDS uses up to 150 GByte of memory during journal replay

Reply via email to