Thanks for your advice. Maybe it's still related to our scenario. What we did is to test if our new ovn cni can hold 10000 pods in a kubernetes cluster, so we create lots of containers in a very short period of time and it stopped at about 3000 containers when hitting the memory issue.
I will try to use a larger server and slow down the container creation to see the difference. Han Zhou <[email protected]> 于2019年12月16日周一 下午2:53写道: > Hmm... I am not sure if it is normal. In our environment with even larger > scale (in terms of number of ports), the memory is usually less than 1GB > for each ovsdb-server process, and we didn't see symptom of memory leak > (after months of clustered mode deployment in live environment). > > Could you check which DB (NB or SB) is the major memory consumer? If it is > SB DB, it is normal to see memory spike for the first time you switch from > standalone mode to clustered mode if you have big number of compute nodes > connected to SB DB. After it stablizes, the memory footprint should > decrease. Of course it is possible that you have more complex scenarios > that triggers more memory consumption (or memory leak). Could you try with > a server with more memory (to avoid OOM killer), to see if it stablizes at > some point or just keep increasing day after day? > > Besides, I believe Ben has some better trouble shooting steps for memory > issues of ovsdb-server. Ben, could you suggest? > > Thanks, > Han > > On Sun, Dec 15, 2019 at 9:56 PM 刘梦馨 <[email protected]> wrote: > > > > After more iteration (6 in my environment) the rss usage stabilized in > 759128KB. > > > > This is a really simplified test, in our real environment we run about > 3000 containers and with lots other operations, like set route, > loadbalancer, all the ovn-sb operations etc. The memory consumption can > quickly go up to 6GB (nb and sb together) and lead a system OOM. Is that a > reasonable resource consumption in your experience? I didn't remember the > actual numbers of standalone db resource consumption, however in the same > environment, it didn't lead to an OOM. > > > > Han Zhou <[email protected]> 于2019年12月16日周一 下午1:05写道: > >> > >> Thanks for the details. I tried the same command with a for loop. > >> > >> After the first 4 iterations, the RSS of the first NB server increased > to 572888 (KB). After that, it stayed the same in the next 3 iterations. So > it seems to just build memory buffers up and then stayed at the level > without further increasing and doesn't seem to be memory leaking. Could you > try more iterations and see if it still continuously increase? > >> > >> Thanks, > >> Han > >> > >> On Sun, Dec 15, 2019 at 7:54 PM 刘梦馨 <[email protected]> wrote: > >> > > >> > Hi, Han > >> > > >> > In my test scenario, I use ovn-ctl to start a one node ovn with > cluster mode db and no chassis bind to the ovn-sb to just check the memory > usage of ovn-nb. > >> > Then use a script to add a logical switch, add 1000 ports, set > dynamic addresses and then delete the logical switch. > >> > > >> > #!/bin/bash > >> > ovn-nbctl ls-add ls1 > >> > for i in {1..1000}; do > >> > ovn-nbctl lsp-add ls1 ls1-vm$i > >> > ovn-nbctl lsp-set-addresses ls1-vm$i dynamic > >> > done > >> > ovn-nbctl ls-del ls1 > >> > > >> > I run this script repeatedly and watch the memory change. > >> > > >> > After 5 runs (5000 lsp add and delete), the rss of nb increased to > 667M. > >> > The nb file increased to 119M and didn't automatically compacted. > After a manually compact the db file size change back to 11K, but the > memory usage didn't change. > >> > > >> > > >> > > >> > Han Zhou <[email protected]> 于2019年12月14日周六 上午3:40写道: > >> >> > >> >> > >> >> > >> >> On Wed, Dec 11, 2019 at 12:51 AM 刘梦馨 <[email protected]> > wrote: > >> >> > > >> >> > > >> >> > We are using ovs/ovn 2.12.0 to implementing our container network. > After switching form standalone ovndb to cluster mode ovndb, we noticed > that the memory consumption for both ovnnb and ovnsb will keep increasing > after each operation and never decrease. > >> >> > > >> >> > We did some profiling by valgrind. The leak check report a 16 byte > leak in fork_and_wait_for_startup, which obviously is not the main reason. > Later we use memif to profile the memory consumption and we put the result > in the attachment. > >> >> > > >> >> > Most of the memory come from two part ovsthread_wrapper > (ovs-thread.c:378) that allocates a subprogram_name and jsonrpc_send > (jsonrpc.c:253) as below, (I just skipped the duplicated stack of jsonrpc). > >> >> > > >> >> > However I found both part have a related free operation in near > place, so I don't know how to further explore this memory issue. I'm not > aware of the differences here between cluster mode and standalone mode. > >> >> > > >> >> > Can anyone give some advice and hint? Thanks! > >> >> > > >> >> > 100.00% (357,920,768B) (page allocation syscalls) mmap/mremap/brk, > --alloc-fns, etc. > >> >> > ->78.52% (281,038,848B) 0x66FDD49: mmap (in /usr/lib64/ > libc-2.17.so) > >> >> > | ->37.50% (134,217,728B) 0x66841EF: new_heap (in /usr/lib64/ > libc-2.17.so) > >> >> > | | ->37.50% (134,217,728B) 0x6684C22: arena_get2.isra.3 (in > /usr/lib64/libc-2.17.so) > >> >> > | | ->37.50% (134,217,728B) 0x668AACC: malloc (in /usr/lib64/ > libc-2.17.so) > >> >> > | | ->37.50% (134,217,728B) 0x4FDC613: xmalloc (util.c:138) > >> >> > | | ->37.50% (134,217,728B) 0x4FDC78E: xvasprintf > (util.c:202) > >> >> > | | ->37.50% (134,217,728B) 0x4FDC877: xasprintf > (util.c:343) > >> >> > | | ->37.50% (134,217,728B) 0x4FA548D: ovsthread_wrapper > (ovs-thread.c:378) > >> >> > | | ->37.50% (134,217,728B) 0x5BE5E63: start_thread > (in /usr/lib64/libpthread-2.17.so) > >> >> > | | ->37.50% (134,217,728B) 0x670388B: clone (in > /usr/lib64/libc-2.17.so) > >> >> > | | > >> >> > | ->36.33% (130,023,424B) 0x6686DF3: sysmalloc (in /usr/lib64/ > libc-2.17.so) > >> >> > | | ->36.33% (130,023,424B) 0x6687CA8: _int_malloc (in /usr/lib64/ > libc-2.17.so) > >> >> > | | ->28.42% (101,711,872B) 0x66890C0: _int_realloc (in > /usr/lib64/libc-2.17.so) > >> >> > | | | ->28.42% (101,711,872B) 0x668B160: realloc (in /usr/lib64/ > libc-2.17.so) > >> >> > | | | ->28.42% (101,711,872B) 0x4FDC9A3: xrealloc (util.c:149) > >> >> > | | | ->28.42% (101,711,872B) 0x4F1DEB2: ds_reserve > (dynamic-string.c:63) > >> >> > | | | ->28.42% (101,711,872B) 0x4F1DED3: ds_put_uninit > (dynamic-string.c:73) > >> >> > | | | ->28.42% (101,711,872B) 0x4F1DF0B: ds_put_char__ > (dynamic-string.c:82) > >> >> > | | | ->26.37% (94,371,840B) 0x4F2B09F: > json_serialize_string (dynamic-string.h:93) > >> >> > | | | | ->12.01% (42,991,616B) 0x4F2B3EA: > json_serialize (json.c:1651) > >> >> > | | | | | ->12.01% (42,991,616B) 0x4F2B3EA: > json_serialize (json.c:1651) > >> >> > | | | | | ->12.01% (42,991,616B) 0x4F2B3EA: > json_serialize (json.c:1651) > >> >> > | | | | | ->12.01% (42,991,616B) 0x4F2B540: > json_serialize (json.c:1626) > >> >> > | | | | | ->12.01% (42,991,616B) 0x4F2B540: > json_serialize (json.c:1626) > >> >> > | | | | | ->12.01% (42,991,616B) 0x4F2B540: > json_serialize (json.c:1626) > >> >> > | | | | | ->12.01% (42,991,616B) 0x4F2B540: > json_serialize (json.c:1626) > >> >> > | | | | | ->12.01% (42,991,616B) > 0x4F2B3EA: json_serialize (json.c:1651) > >> >> > | | | | | ->12.01% (42,991,616B) > 0x4F2B540: json_serialize (json.c:1626) > >> >> > | | | | | ->12.01% (42,991,616B) > 0x4F2D82A: json_to_ds (json.c:1525) > >> >> > | | | | | ->12.01% (42,991,616B) > 0x4F2EA49: jsonrpc_send (jsonrpc.c:253) > >> >> > | | | | | ->12.01% (42,991,616B) > 0x4C3A68A: ovsdb_jsonrpc_server_run (jsonrpc-server.c:1104) > >> >> > | | | | | ->12.01% (42,991,616B) > 0x10DCC1: main (ovsdb-server.c:209) > >> >> > > >> >> > _______________________________________________ > >> >> > discuss mailing list > >> >> > [email protected] > >> >> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss > >> >> > >> >> Thanks for reporting the issue. Could you describe your test > scenario (the operations), the scale, the db file size and the memory (RSS) > data of the NB/SB? > >> >> Clustered mode maintains some extra data such as RAFT logs, compares > to standalone, but it should not increase forever, because RAFT logs will > get compacted periodically. > >> >> > >> >> Thanks, > >> >> Han > >> > > >> > > >> > > >> > -- > >> > 刘梦馨 > >> > Blog: http://oilbeater.com > >> > Weibo: @oilbeater > > > > > > > > -- > > 刘梦馨 > > Blog: http://oilbeater.com > > Weibo: @oilbeater > -- 刘梦馨 Blog: http://oilbeater.com Weibo: @oilbeater <http://weibo.com/oilbeater>
_______________________________________________ discuss mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
