Here the result:
root@ceph4-2:~# ceph daemon mds.ceph4-2.odiso.net flush journal
{
"message": "",
"return_code": 0
}
root@ceph4-2:~# ceph daemon mds.ceph4-2.odiso.net config set mds_cache_size
10000
{
"success": "mds_cache_size = '10000' (not observed, change may require
restart) "
}
wait ...
root@ceph4-2:~# ceph tell mds.ceph4-2.odiso.net heap stats
2018-05-25 07:44:02.185911 7f4cad7fa700 0 client.50748489 ms_handle_reset on
10.5.0.88:6804/994206868
2018-05-25 07:44:02.196160 7f4cae7fc700 0 client.50792764 ms_handle_reset on
10.5.0.88:6804/994206868
mds.ceph4-2.odiso.net tcmalloc heap
stats:------------------------------------------------
MALLOC: 13175782328 (12565.4 MiB) Bytes in use by application
MALLOC: + 0 ( 0.0 MiB) Bytes in page heap freelist
MALLOC: + 1774628488 ( 1692.4 MiB) Bytes in central cache freelist
MALLOC: + 34274608 ( 32.7 MiB) Bytes in transfer cache freelist
MALLOC: + 57260176 ( 54.6 MiB) Bytes in thread cache freelists
MALLOC: + 120582336 ( 115.0 MiB) Bytes in malloc metadata
MALLOC: ------------
MALLOC: = 15162527936 (14460.1 MiB) Actual memory used (physical + swap)
MALLOC: + 4974067712 ( 4743.6 MiB) Bytes released to OS (aka unmapped)
MALLOC: ------------
MALLOC: = 20136595648 (19203.8 MiB) Virtual address space used
MALLOC:
MALLOC: 1852388 Spans in use
MALLOC: 18 Thread heaps in use
MALLOC: 8192 Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.
root@ceph4-2:~# ceph daemon mds.ceph4-2.odiso.net config set mds_cache_size 0
{
"success": "mds_cache_size = '0' (not observed, change may require restart)
"
}
----- Mail original -----
De: "Zheng Yan" <[email protected]>
À: "aderumier" <[email protected]>
Envoyé: Vendredi 25 Mai 2018 05:56:31
Objet: Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?
On Thu, May 24, 2018 at 11:34 PM, Alexandre DERUMIER
<[email protected]> wrote:
>>>Still don't find any clue. Does the cephfs have idle period. If it
>>>has, could you decrease mds's cache size and check what happens. For
>>>example, run following commands during the old period.
>
>>>ceph daemon mds.xx flush journal
>>>ceph daemon mds.xx config set mds_cache_size 10000;
>>>"wait a minute"
>>>ceph tell mds.xx heap stats
>>>ceph daemon mds.xx config set mds_cache_size 0
>
> ok thanks. I'll try this night.
>
> I have already mds_cache_memory_limit = 5368709120,
>
> does it need to remove it first before setting mds_cache_size 10000 ?
no
>
>
>
>
> ----- Mail original -----
> De: "Zheng Yan" <[email protected]>
> À: "aderumier" <[email protected]>
> Cc: "ceph-users" <[email protected]>
> Envoyé: Jeudi 24 Mai 2018 16:27:21
> Objet: Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?
>
> On Thu, May 24, 2018 at 7:22 PM, Alexandre DERUMIER <[email protected]>
> wrote:
>> Thanks!
>>
>>
>> here the profile.pdf
>>
>> 10-15min profiling, I can't do it longer because my clients where lagging.
>>
>> but I think it should be enough to observe the rss memory increase.
>>
>>
>
> Still don't find any clue. Does the cephfs have idle period. If it
> has, could you decrease mds's cache size and check what happens. For
> example, run following commands during the old period.
>
> ceph daemon mds.xx flush journal
> ceph daemon mds.xx config set mds_cache_size 10000;
> "wait a minute"
> ceph tell mds.xx heap stats
> ceph daemon mds.xx config set mds_cache_size 0
>
>
>>
>>
>> ----- Mail original -----
>> De: "Zheng Yan" <[email protected]>
>> À: "aderumier" <[email protected]>
>> Cc: "ceph-users" <[email protected]>
>> Envoyé: Jeudi 24 Mai 2018 11:34:20
>> Objet: Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?
>>
>> On Tue, May 22, 2018 at 3:11 PM, Alexandre DERUMIER <[email protected]>
>> wrote:
>>> Hi,some new stats, mds memory is not 16G,
>>>
>>> I have almost same number of items and bytes in cache vs some weeks ago
>>> when mds was using 8G. (ceph 12.2.5)
>>>
>>>
>>> root@ceph4-2:~# while sleep 1; do ceph daemon mds.ceph4-2.odiso.net perf
>>> dump | jq '.mds_mem.rss'; ceph daemon mds.ceph4-2.odiso.net dump_mempools |
>>> jq -c '.mds_co'; done
>>> 16905052
>>> {"items":43350988,"bytes":5257428143}
>>> 16905052
>>> {"items":43428329,"bytes":5283850173}
>>> 16905052
>>> {"items":43209167,"bytes":5208578149}
>>> 16905052
>>> {"items":43177631,"bytes":5198833577}
>>> 16905052
>>> {"items":43312734,"bytes":5252649462}
>>> 16905052
>>> {"items":43355753,"bytes":5277197972}
>>> 16905052
>>> {"items":43700693,"bytes":5303376141}
>>> 16905052
>>> {"items":43115809,"bytes":5156628138}
>>> ^C
>>>
>>>
>>>
>>>
>>> root@ceph4-2:~# ceph status
>>> cluster:
>>> id: e22b8e83-3036-4fe5-8fd5-5ce9d539beca
>>> health: HEALTH_OK
>>>
>>> services:
>>> mon: 3 daemons, quorum ceph4-1,ceph4-2,ceph4-3
>>> mgr: ceph4-1.odiso.net(active), standbys: ceph4-2.odiso.net,
>>> ceph4-3.odiso.net
>>> mds: cephfs4-1/1/1 up {0=ceph4-2.odiso.net=up:active}, 2 up:standby
>>> osd: 18 osds: 18 up, 18 in
>>> rgw: 3 daemons active
>>>
>>> data:
>>> pools: 11 pools, 1992 pgs
>>> objects: 75677k objects, 6045 GB
>>> usage: 20579 GB used, 6246 GB / 26825 GB avail
>>> pgs: 1992 active+clean
>>>
>>> io:
>>> client: 14441 kB/s rd, 2550 kB/s wr, 371 op/s rd, 95 op/s wr
>>>
>>>
>>> root@ceph4-2:~# ceph daemon mds.ceph4-2.odiso.net cache status
>>> {
>>> "pool": {
>>> "items": 44523608,
>>> "bytes": 5326049009
>>> }
>>> }
>>>
>>>
>>> root@ceph4-2:~# ceph daemon mds.ceph4-2.odiso.net perf dump
>>> {
>>> "AsyncMessenger::Worker-0": {
>>> "msgr_recv_messages": 798876013,
>>> "msgr_send_messages": 825999506,
>>> "msgr_recv_bytes": 7003223097381,
>>> "msgr_send_bytes": 691501283744,
>>> "msgr_created_connections": 148,
>>> "msgr_active_connections": 146,
>>> "msgr_running_total_time": 39914.832387470,
>>> "msgr_running_send_time": 13744.704199430,
>>> "msgr_running_recv_time": 32342.160588451,
>>> "msgr_running_fast_dispatch_time": 5996.336446782
>>> },
>>> "AsyncMessenger::Worker-1": {
>>> "msgr_recv_messages": 429668771,
>>> "msgr_send_messages": 414760220,
>>> "msgr_recv_bytes": 5003149410825,
>>> "msgr_send_bytes": 396281427789,
>>> "msgr_created_connections": 132,
>>> "msgr_active_connections": 132,
>>> "msgr_running_total_time": 23644.410515392,
>>> "msgr_running_send_time": 7669.068710688,
>>> "msgr_running_recv_time": 19751.610043696,
>>> "msgr_running_fast_dispatch_time": 4331.023453385
>>> },
>>> "AsyncMessenger::Worker-2": {
>>> "msgr_recv_messages": 1312910919,
>>> "msgr_send_messages": 1260040403,
>>> "msgr_recv_bytes": 5330386980976,
>>> "msgr_send_bytes": 3341965016878,
>>> "msgr_created_connections": 143,
>>> "msgr_active_connections": 138,
>>> "msgr_running_total_time": 61696.635450100,
>>> "msgr_running_send_time": 23491.027014598,
>>> "msgr_running_recv_time": 53858.409319734,
>>> "msgr_running_fast_dispatch_time": 4312.451966809
>>> },
>>> "finisher-PurgeQueue": {
>>> "queue_len": 0,
>>> "complete_latency": {
>>> "avgcount": 1889416,
>>> "sum": 29224.227703697,
>>> "avgtime": 0.015467333
>>> }
>>> },
>>> "mds": {
>>> "request": 1822420924,
>>> "reply": 1822420886,
>>> "reply_latency": {
>>> "avgcount": 1822420886,
>>> "sum": 5258467.616943274,
>>> "avgtime": 0.002885429
>>> },
>>> "forward": 0,
>>> "dir_fetch": 116035485,
>>> "dir_commit": 1865012,
>>> "dir_split": 17,
>>> "dir_merge": 24,
>>> "inode_max": 2147483647,
>>> "inodes": 1600438,
>>> "inodes_top": 210492,
>>> "inodes_bottom": 100560,
>>> "inodes_pin_tail": 1289386,
>>> "inodes_pinned": 1299735,
>>> "inodes_expired": 22223476046,
>>> "inodes_with_caps": 1299137,
>>> "caps": 2211546,
>>> "subtrees": 2,
>>> "traverse": 1953482456,
>>> "traverse_hit": 1127647211,
>>> "traverse_forward": 0,
>>> "traverse_discover": 0,
>>> "traverse_dir_fetch": 105833969,
>>> "traverse_remote_ino": 31686,
>>> "traverse_lock": 4344,
>>> "load_cent": 182244014474,
>>> "q": 104,
>>> "exported": 0,
>>> "exported_inodes": 0,
>>> "imported": 0,
>>> "imported_inodes": 0
>>> },
>>> "mds_cache": {
>>> "num_strays": 14980,
>>> "num_strays_delayed": 7,
>>> "num_strays_enqueuing": 0,
>>> "strays_created": 1672815,
>>> "strays_enqueued": 1659514,
>>> "strays_reintegrated": 666,
>>> "strays_migrated": 0,
>>> "num_recovering_processing": 0,
>>> "num_recovering_enqueued": 0,
>>> "num_recovering_prioritized": 0,
>>> "recovery_started": 2,
>>> "recovery_completed": 2,
>>> "ireq_enqueue_scrub": 0,
>>> "ireq_exportdir": 0,
>>> "ireq_flush": 0,
>>> "ireq_fragmentdir": 41,
>>> "ireq_fragstats": 0,
>>> "ireq_inodestats": 0
>>> },
>>> "mds_log": {
>>> "evadd": 357717092,
>>> "evex": 357717106,
>>> "evtrm": 357716741,
>>> "ev": 105198,
>>> "evexg": 0,
>>> "evexd": 365,
>>> "segadd": 437124,
>>> "segex": 437124,
>>> "segtrm": 437123,
>>> "seg": 130,
>>> "segexg": 0,
>>> "segexd": 1,
>>> "expos": 6916004026339,
>>> "wrpos": 6916179441942,
>>> "rdpos": 6319502327537,
>>> "jlat": {
>>> "avgcount": 59071693,
>>> "sum": 120823.311894779,
>>> "avgtime": 0.002045367
>>> },
>>> "replayed": 104847
>>> },
>>> "mds_mem": {
>>> "ino": 1599422,
>>> "ino+": 22152405695,
>>> "ino-": 22150806273,
>>> "dir": 256943,
>>> "dir+": 18460298,
>>> "dir-": 18203355,
>>> "dn": 1600689,
>>> "dn+": 22227888283,
>>> "dn-": 22226287594,
>>> "cap": 2211546,
>>> "cap+": 1674287476,
>>> "cap-": 1672075930,
>>> "rss": 16905052,
>>> "heap": 313916,
>>> "buf": 0
>>> },
>>> "mds_server": {
>>> "dispatch_client_request": 1964131912,
>>> "dispatch_server_request": 0,
>>> "handle_client_request": 1822420924,
>>> "handle_client_session": 15557609,
>>> "handle_slave_request": 0,
>>> "req_create": 4116952,
>>> "req_getattr": 18696543,
>>> "req_getfilelock": 0,
>>> "req_link": 6625,
>>> "req_lookup": 1425824734,
>>> "req_lookuphash": 0,
>>> "req_lookupino": 0,
>>> "req_lookupname": 8703,
>>> "req_lookupparent": 0,
>>> "req_lookupsnap": 0,
>>> "req_lssnap": 0,
>>> "req_mkdir": 371878,
>>> "req_mknod": 0,
>>> "req_mksnap": 0,
>>> "req_open": 351119806,
>>> "req_readdir": 17103599,
>>> "req_rename": 2437529,
>>> "req_renamesnap": 0,
>>> "req_rmdir": 78789,
>>> "req_rmsnap": 0,
>>> "req_rmxattr": 0,
>>> "req_setattr": 4547650,
>>> "req_setdirlayout": 0,
>>> "req_setfilelock": 633219,
>>> "req_setlayout": 0,
>>> "req_setxattr": 2,
>>> "req_symlink": 2520,
>>> "req_unlink": 1589288
>>> },
>>> "mds_sessions": {
>>> "session_count": 321,
>>> "session_add": 383,
>>> "session_remove": 62
>>> },
>>> "objecter": {
>>> "op_active": 0,
>>> "op_laggy": 0,
>>> "op_send": 197932443,
>>> "op_send_bytes": 605992324653,
>>> "op_resend": 22,
>>> "op_reply": 197932421,
>>> "op": 197932421,
>>> "op_r": 116256030,
>>> "op_w": 81676391,
>>> "op_rmw": 0,
>>> "op_pg": 0,
>>> "osdop_stat": 1518341,
>>> "osdop_create": 4314348,
>>> "osdop_read": 79810,
>>> "osdop_write": 59151421,
>>> "osdop_writefull": 237358,
>>> "osdop_writesame": 0,
>>> "osdop_append": 0,
>>> "osdop_zero": 2,
>>> "osdop_truncate": 9,
>>> "osdop_delete": 2320714,
>>> "osdop_mapext": 0,
>>> "osdop_sparse_read": 0,
>>> "osdop_clonerange": 0,
>>> "osdop_getxattr": 29426577,
>>> "osdop_setxattr": 8628696,
>>> "osdop_cmpxattr": 0,
>>> "osdop_rmxattr": 0,
>>> "osdop_resetxattrs": 0,
>>> "osdop_tmap_up": 0,
>>> "osdop_tmap_put": 0,
>>> "osdop_tmap_get": 0,
>>> "osdop_call": 0,
>>> "osdop_watch": 0,
>>> "osdop_notify": 0,
>>> "osdop_src_cmpxattr": 0,
>>> "osdop_pgls": 0,
>>> "osdop_pgls_filter": 0,
>>> "osdop_other": 13551599,
>>> "linger_active": 0,
>>> "linger_send": 0,
>>> "linger_resend": 0,
>>> "linger_ping": 0,
>>> "poolop_active": 0,
>>> "poolop_send": 0,
>>> "poolop_resend": 0,
>>> "poolstat_active": 0,
>>> "poolstat_send": 0,
>>> "poolstat_resend": 0,
>>> "statfs_active": 0,
>>> "statfs_send": 0,
>>> "statfs_resend": 0,
>>> "command_active": 0,
>>> "command_send": 0,
>>> "command_resend": 0,
>>> "map_epoch": 3907,
>>> "map_full": 0,
>>> "map_inc": 601,
>>> "osd_sessions": 18,
>>> "osd_session_open": 20,
>>> "osd_session_close": 2,
>>> "osd_laggy": 0,
>>> "omap_wr": 3595801,
>>> "omap_rd": 232070972,
>>> "omap_del": 272598
>>> },
>>> "purge_queue": {
>>> "pq_executing_ops": 0,
>>> "pq_executing": 0,
>>> "pq_executed": 1659514
>>> },
>>> "throttle-msgr_dispatch_throttler-mds": {
>>> "val": 0,
>>> "max": 104857600,
>>> "get_started": 0,
>>> "get": 2541455703,
>>> "get_sum": 17148691767160,
>>> "get_or_fail_fail": 0,
>>> "get_or_fail_success": 2541455703,
>>> "take": 0,
>>> "take_sum": 0,
>>> "put": 2541455703,
>>> "put_sum": 17148691767160,
>>> "wait": {
>>> "avgcount": 0,
>>> "sum": 0.000000000,
>>> "avgtime": 0.000000000
>>> }
>>> },
>>> "throttle-objecter_bytes": {
>>> "val": 0,
>>> "max": 104857600,
>>> "get_started": 0,
>>> "get": 0,
>>> "get_sum": 0,
>>> "get_or_fail_fail": 0,
>>> "get_or_fail_success": 0,
>>> "take": 197932421,
>>> "take_sum": 606323353310,
>>> "put": 182060027,
>>> "put_sum": 606323353310,
>>> "wait": {
>>> "avgcount": 0,
>>> "sum": 0.000000000,
>>> "avgtime": 0.000000000
>>> }
>>> },
>>> "throttle-objecter_ops": {
>>> "val": 0,
>>> "max": 1024,
>>> "get_started": 0,
>>> "get": 0,
>>> "get_sum": 0,
>>> "get_or_fail_fail": 0,
>>> "get_or_fail_success": 0,
>>> "take": 197932421,
>>> "take_sum": 197932421,
>>> "put": 197932421,
>>> "put_sum": 197932421,
>>> "wait": {
>>> "avgcount": 0,
>>> "sum": 0.000000000,
>>> "avgtime": 0.000000000
>>> }
>>> },
>>> "throttle-write_buf_throttle": {
>>> "val": 0,
>>> "max": 3758096384,
>>> "get_started": 0,
>>> "get": 1659514,
>>> "get_sum": 154334946,
>>> "get_or_fail_fail": 0,
>>> "get_or_fail_success": 1659514,
>>> "take": 0,
>>> "take_sum": 0,
>>> "put": 79728,
>>> "put_sum": 154334946,
>>> "wait": {
>>> "avgcount": 0,
>>> "sum": 0.000000000,
>>> "avgtime": 0.000000000
>>> }
>>> },
>>> "throttle-write_buf_throttle-0x55decea8e140": {
>>> "val": 255839,
>>> "max": 3758096384,
>>> "get_started": 0,
>>> "get": 357717092,
>>> "get_sum": 596677113363,
>>> "get_or_fail_fail": 0,
>>> "get_or_fail_success": 357717092,
>>> "take": 0,
>>> "take_sum": 0,
>>> "put": 59071693,
>>> "put_sum": 596676857524,
>>> "wait": {
>>> "avgcount": 0,
>>> "sum": 0.000000000,
>>> "avgtime": 0.000000000
>>> }
>>> }
>>> }
>>>
>>>
>>
>> Maybe there is memory leak. What is output of 'ceph tell mds.xx heap
>> stats'. If the RSS size keeps increasing, please run profile heap for
>> a period of time.
>>
>>
>> ceph tell mds.xx heap start_profiler
>> "wait some time"
>> ceph tell mds.xx heap dump
>> ceph tell mds.xx heap stop_profiler
>> pprof --pdf <location pf ceph-mds binary>
>> /var/log/ceph/mds.xxx.profile.* > profile.pdf
>>
>> send profile.pdf to us
>>
>> Regards
>> Yan, Zheng
>>
>>>
>>> ----- Mail original -----
>>> De: "Webert de Souza Lima" <[email protected]>
>>> À: "ceph-users" <[email protected]>
>>> Envoyé: Lundi 14 Mai 2018 15:14:35
>>> Objet: Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?
>>>
>>> On Sat, May 12, 2018 at 3:11 AM Alexandre DERUMIER < [
>>> mailto:[email protected] | [email protected] ] > wrote:
>>>
>>>
>>> The documentation (luminous) say:
>>>
>>>
>>>
>>>
>>>
>>>>mds cache size
>>>>
>>>>Description: The number of inodes to cache. A value of 0 indicates an
>>>>unlimited number. It is recommended to use mds_cache_memory_limit to limit
>>>>the amount of memory the MDS cache uses.
>>>>Type: 32-bit Integer
>>>>Default: 0
>>>>
>>>
>>>
>>>
>>> and, my mds_cache_memory_limit is currently at 5GB.
>>>
>>>
>>> yeah I have only suggested that because the high memory usage seemed to
>>> trouble you and it might be a bug, so it's more of a workaround.
>>>
>>> Regards,
>>> Webert Lima
>>> DevOps Engineer at MAV Tecnologia
>>> Belo Horizonte - Brasil
>>> IRC NICK - WebertRLZ
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> [email protected]
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> [email protected]
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com