On 7/6/23 13:00, Vladislav Odintsov wrote: > > >> On 5 Jul 2023, at 20:07, Vladislav Odintsov <odiv...@gmail.com> wrote: >> >> Hi Dumitru, >> >> thanks for the quick response! >> >>> On 5 Jul 2023, at 19:53, Dumitru Ceara <dce...@redhat.com> wrote: >>> >>> On 7/5/23 17:14, Vladislav Odintsov wrote: >>>> Hi, >>>> >>> >>> Hi Vladislav, >>> >>>> we’ve noticed there is a huge ovn-controller memory consumption introduced >>>> with [0] comparing to version without its changes in ovn-controller.c part >>>> (just OVS submodule bump without ovn-controller changes doesn’t trigger >>>> such behaviour). >>>> >>> >>> Thanks for reporting this! >>> >>>> On an empty host connected to working cluster ovn-controller normally >>>> consumes ~175 MiB RSS, and the same host updated to a version with commit >>>> [0] consumes 3.3 GiB RSS. The start time and CPU consumption during >>>> process start of an affected version is higher that for the normal version. >>>> >>>> Before upgrade (normal functioning): >>>> >>>> #ovn-appctl -t ovn-controller memory/show ;ps-mem -p $(pidof >>>> ovn-controller)| grep ovn >>>> idl-cells-OVN_Southbound:343855 idl-cells-Open_vSwitch:14131 >>>> ofctrl_desired_flow_usage-KB:76 ofctrl_installed_flow_usage-KB:60 >>>> ofctrl_sb_flow_ref_usage-KB:18 >>>> 174.2 MiB + 327.5 KiB = 174.5 MiB ovn-controller >>>> >>>> After upgrade to an affected version I’ve checked memory report while >>>> ovn-controller was starting and at some time there was a bigger amount of >>>> OVN_Southbound cells comparing to "after start" time: >>>> >>>> during start: >>>> >>>> # ovn-appctl -t ovn-controller memory/show ;ps-mem -p $(pidof >>>> ovn-controller)| grep ovn >>>> idl-cells-OVN_Southbound:11388742 idl-cells-Open_vSwitch:14131 >>>> idl-outstanding-txns-Open_vSwitch:1 >>>> 3.3 GiB + 327.0 KiB = 3.3 GiB ovn-controller >>>> >>>> after start: >>>> >>>> # ovn-appctl -t ovn-controller memory/show ;ps-mem -p $(pidof >>>> ovn-controller)| grep ovn >>>> idl-cells-OVN_Southbound:343896 idl-cells-Open_vSwitch:14131 >>>> idl-outstanding-txns-Open_vSwitch:1 >>>> 3.3 GiB + 327.0 KiB = 3.3 GiB ovn-controller >>>> >>>> >>>> cells during start: >>>> 11388742 >>>> >>>> cells after start: >>>> 343896 >>>> >>> >>> Are you running with ovn-monitor-all=true on this host? >> >> No, it has default false. >> >>> >>> I guess it's unlikely but I'll try just in case: would it be possible to >>> share the SB database? >> >> Unfortunately, no. But I can say it’s about 450 M in size after compaction. >> And there are about 1M mac_bindings if it’s important :). >
I tried in a sandbox, before and after the commit in question, with a script that adds 1M mac bindings on top of the sample topology built by tutorial/ovn-setup.sh. I see ovn-controller memory usage going to >3GB in before the commit you blamed and to >1.9GB after the same commit. So it looks different than what you reported but still worrying that we use so much memory for mac bindings. I'm assuming however that quite a few of the 1M mac bindings in your setup are stale so would it be possible to enable mac_binding aging? It needs to be enabled per router with something like: $ ovn-nbctl set logical_router RTR options:mac_binding_age_threshold=60 The threshold is in seconds and is a hard limit (for now) after which a mac binding entry is removed. There's work in progress to only clear arp cache entries that are really stale [1]. > But if you are interested in any specific information, let me know it, I can > check. > How many "local" datapaths do we have on the hosts that exhibit high memory usage? The quickest way I can think of to get this info is to run this on the hypervisor: $ ovn-appctl ct-zone-list | grep snat -c Additional question: how many mac_bindings are "local", i.e., associated to local datapaths? >> >>> >>>> I guess it could be connected with this problem. Can anyone look at this >>>> and comment please? >>>> >>> >>> Does the memory usage persist after SB is upgraded too? I see the >>> number of SB idl-cells goes down eventually which means that eventually >>> the periodic malloc_trim() call would free up memory. We trim on idle >>> since >>> https://github.com/ovn-org/ovn/commit/b4c593d23bd959e98fcc9ada4a973ac933579ded >>> >> >> In this upgrade DB schemas not upgraded, so they’re up to date. >> >>> Are you using a different allocator? E.g., jemalloc. >> >> No, this issue reproduces with gcc. >> Can we run a test and see if malloc_trim() was actually called? I'd first disable lflow-cache log rate limiting: $ ovn-appctl vlog/disable-rate-limit lflow_cache Then check if malloc_trim() was called after the lflow-cache detected inactivity. You'd see logs like: "lflow_cache|INFO|Detected cache inactivity (last active 30005 ms ago): trimming cache" The fact that the number of SB idl-cells goes down and memory doesn't seems to indicate we might have a bug in the auto cache trimming mechanism. In any case, a malloc_trim() can be manually triggered by flushing the lflow cache: $ ovn-appctl lflow-cache/flush Thanks, Dumitru >>> >>>> >>>> 0: >>>> https://github.com/ovn-org/ovn/commit/1b0dbde940706e5de6e60221be78a278361fa76d [1] https://patchwork.ozlabs.org/project/ovn/list/?series=359894&state=* >>>> >>>> >>>> >>>> Regards, >>>> Vladislav Odintsov >>>> >>> >>> Regards, >>> Dumitru >>> >>> _______________________________________________ >>> dev mailing list >>> d...@openvswitch.org <mailto:d...@openvswitch.org> >>> <mailto:d...@openvswitch.org> >>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >> >> >> Regards, >> Vladislav Odintsov >> >> _______________________________________________ >> dev mailing list >> d...@openvswitch.org <mailto:d...@openvswitch.org> >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev > > > Regards, > Vladislav Odintsov > > _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev