Hello everyone, We are running a small cluster on 5 machines with 48 OSDs / 5 MDSs / 5 MONs based on Luminous 12.2.10 and Debian Stretch 9.6. When using a single MDS configuration everything works fine and looking at the active MDS's memory, as configured, it uses ~1 GByte of memory for cache:
$ watch ceph tell mds.$(hostname) heap stats
mds.e tcmalloc heap stats:------------------------------------------------
MALLOC: 1172867096 ( 1118.5 MiB) Bytes in use by application
MALLOC: + 0 ( 0.0 MiB) Bytes in page heap freelist
MALLOC: + 39289912 ( 37.5 MiB) Bytes in central cache freelist
MALLOC: + 17245344 ( 16.4 MiB) Bytes in transfer cache freelist
MALLOC: + 34303760 ( 32.7 MiB) Bytes in thread cache freelists
MALLOC: + 5796032 ( 5.5 MiB) Bytes in malloc metadata
MALLOC: ------------
MALLOC: = 1269502144 ( 1210.7 MiB) Actual memory used (physical + swap)
MALLOC: + 19775488 ( 18.9 MiB) Bytes released to OS (aka unmapped)
MALLOC: ------------
MALLOC: = 1289277632 ( 1229.6 MiB) Virtual address space used
MALLOC:
MALLOC: 70430 Spans in use
MALLOC: 17 Thread heaps in use
MALLOC: 8192 Tcmalloc page size
-------------
$ ceph versions
{
"mon": {
"ceph version 12.2.10 (177915764b752804194937482a39e95e0ca3de94) luminous
(stable)": 5
},
"mgr": {
"ceph version 12.2.10 (177915764b752804194937482a39e95e0ca3de94) luminous
(stable)": 3
},
"osd": {
"ceph version 12.2.10 (177915764b752804194937482a39e95e0ca3de94) luminous
(stable)": 48
},
"mds": {
"ceph version 12.2.10 (177915764b752804194937482a39e95e0ca3de94) luminous
(stable)": 5
},
"overall": {
"ceph version 12.2.10 (177915764b752804194937482a39e95e0ca3de94) luminous
(stable)": 61
}
-------------
$ ceph -s
cluster:
id: .... c9024
health: HEALTH_OK
services:
mon: 5 daemons, quorum a,b,c,d,e
mgr: libra(active), standbys: b, a
mds: cephfs-1/1/1 up {0=e=up:active}, 1 up:standby-replay, 3 up:standby
osd: 48 osds: 48 up, 48 in
data:
pools: 2 pools, 2052 pgs
objects: 44.44M objects, 52.3TiB
usage: 107TiB used, 108TiB / 216TiB avail
pgs: 2051 active+clean
1 active+clean+scrubbing+deep
io:
client: 85.3KiB/s rd, 3.17MiB/s wr, 45op/s rd, 26op/s wr
-------------
However as soon as we use "ceph fs set cephfs max_mds 2" to add a second MDS to
the picture things get out of hand within seconds, although in a rather
unexpected way: The standby MDS server which is brought in works fine and shown
a normal memory consumption. However the two machines which are starting to
replay the journal in order to become standby servers start to accumulate
dozens of GByte of memory immediately and go up to about 150 GByte of memory,
almost immediately starting to use swap space, which brings load up to about 80
within seconds and makes all other processes (mainly OSDs) unreachable.
As the machine becomes basically unreachable when this happens it is only
possible to get memory statistics when things start to wrong. After that it's
not possible to get a memory dump anymore as the OS as a whole gets blocked by
swapping.
$ watch ceph tell mds.$(hostname) heap stats
mds.a tcmalloc heap stats:------------------------------------------------
MALLOC: 36113137024 (34440.2 MiB) Bytes in use by application
MALLOC: + 0 ( 0.0 MiB) Bytes in page heap freelist
MALLOC: + 7723144 ( 7.4 MiB) Bytes in central cache freelist
MALLOC: + 2523264 ( 2.4 MiB) Bytes in transfer cache freelist
MALLOC: + 2460024 ( 2.3 MiB) Bytes in thread cache freelists
MALLOC: + 41185472 ( 39.3 MiB) Bytes in malloc metadata
MALLOC: ------------
MALLOC: = 36167028928 (34491.6 MiB) Actual memory used (physical + swap)
MALLOC: + 1417216 ( 1.4 MiB) Bytes released to OS (aka unmapped)
MALLOC: ------------
MALLOC: = 36168446144 (34492.9 MiB) Virtual address space used
MALLOC:
MALLOC: 38476 Spans in use
MALLOC: 13 Thread heaps in use
MALLOC: 8192 Tcmalloc page size
-------------
Please also find attached the zip'ed log file of one of the two new standby
MDSs when it is trying to replay the fs journal.
As soon as the number of MDSs is set back to 1 (using "ceph fs set cephfs
max_mds 1" and "ceph mds deactivate 1") things start to calm down and the
cluster goes back to normal. Is this a known problem with Luminous and what can
be done to solve it anyway so the multi MDS feature may be used?
As all servers used here are based on Debian it is unfortunately not possible
to upgrade to Mimic as it seems that this cannot be / will not be made
available for Debian Stretch due to the tool chain issue described elsewhere.
Thank you for any help and pointers in the right direction!
Best,
Matthias
----------------------------------------------------------------------------------------------------
dizmo - The Interface of Things
http://www.dizmo.com, Phone +41 52 267 88 50, Twitter @dizmos
dizmo inc, Universitätsstrasse 53, CH-8006 Zurich, Switzerland
Log of mds.b replaying fs journal.tbz
Description: Binary data
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
