Hi Dumitru, The system on which I reproduced this issue is running 22.09.x version. I’ve tried to upgrade ovn-controller to main branch + your patch. Please, note that it has test error: [1]. After two minutes after upgrade it still consumed 3.3G.
I tried to backport your patch to 22.09, it required to backport also this commit: [2] and it failed some tests: [3]. But I’ve got general question: prior to commit that I mentioned in initial mail, ovn-controller even didn’t try load such amount of data. And now it does and IIUC, your patch just releases memory that was freed after ovn-controller fully loaded. I’m wonder wether it should load that excess data at all? Seems like it did. 1: https://github.com/odivlad/ovn/actions/runs/5519795488/jobs/10065618661 2: https://github.com/ovn-org/ovn/commit/b4c593d23bd959e98fcc9ada4a973ac933579ded 3: https://github.com/odivlad/ovn/actions/runs/5519892057 > On 11 Jul 2023, at 14:39, Dumitru Ceara <dce...@redhat.com> wrote: > > On 7/10/23 22:20, Vladislav Odintsov wrote: >> Hi Dumitru, >> >> thanks for digging into this! I highly appreciate your help! >> > > No worries, my pleasure! :) > >> Please, see my answers inline. >> >>> On 10 Jul 2023, at 15:28, Dumitru Ceara <dce...@redhat.com> wrote: >>> >>> On 7/10/23 12:57, Dumitru Ceara wrote: >>>> On 7/6/23 13:00, Vladislav Odintsov wrote: >>>>> >>>>> >>>>>> On 5 Jul 2023, at 20:07, Vladislav Odintsov <odiv...@gmail.com> wrote: >>>>>> >>>>>> Hi Dumitru, >>>>>> >>>>>> thanks for the quick response! >>>>>> >>>>>>> On 5 Jul 2023, at 19:53, Dumitru Ceara <dce...@redhat.com> wrote: >>>>>>> >>>>>>> On 7/5/23 17:14, Vladislav Odintsov wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>> >>>>>>> Hi Vladislav, >>>>>>> >>>>>>>> we’ve noticed there is a huge ovn-controller memory consumption >>>>>>>> introduced with [0] comparing to version without its changes in >>>>>>>> ovn-controller.c part (just OVS submodule bump without ovn-controller >>>>>>>> changes doesn’t trigger such behaviour). >>>>>>>> >>>>>>> >>>>>>> Thanks for reporting this! >>>>>>> >>>>>>>> On an empty host connected to working cluster ovn-controller normally >>>>>>>> consumes ~175 MiB RSS, and the same host updated to a version with >>>>>>>> commit [0] consumes 3.3 GiB RSS. The start time and CPU consumption >>>>>>>> during process start of an affected version is higher that for the >>>>>>>> normal version. >>>>>>>> >>>>>>>> Before upgrade (normal functioning): >>>>>>>> >>>>>>>> #ovn-appctl -t ovn-controller memory/show ;ps-mem -p $(pidof >>>>>>>> ovn-controller)| grep ovn >>>>>>>> idl-cells-OVN_Southbound:343855 idl-cells-Open_vSwitch:14131 >>>>>>>> ofctrl_desired_flow_usage-KB:76 ofctrl_installed_flow_usage-KB:60 >>>>>>>> ofctrl_sb_flow_ref_usage-KB:18 >>>>>>>> 174.2 MiB + 327.5 KiB = 174.5 MiB ovn-controller >>>>>>>> >>>>>>>> After upgrade to an affected version I’ve checked memory report while >>>>>>>> ovn-controller was starting and at some time there was a bigger amount >>>>>>>> of OVN_Southbound cells comparing to "after start" time: >>>>>>>> >>>>>>>> during start: >>>>>>>> >>>>>>>> # ovn-appctl -t ovn-controller memory/show ;ps-mem -p $(pidof >>>>>>>> ovn-controller)| grep ovn >>>>>>>> idl-cells-OVN_Southbound:11388742 idl-cells-Open_vSwitch:14131 >>>>>>>> idl-outstanding-txns-Open_vSwitch:1 >>>>>>>> 3.3 GiB + 327.0 KiB = 3.3 GiB ovn-controller >>>>>>>> >>>>>>>> after start: >>>>>>>> >>>>>>>> # ovn-appctl -t ovn-controller memory/show ;ps-mem -p $(pidof >>>>>>>> ovn-controller)| grep ovn >>>>>>>> idl-cells-OVN_Southbound:343896 idl-cells-Open_vSwitch:14131 >>>>>>>> idl-outstanding-txns-Open_vSwitch:1 >>>>>>>> 3.3 GiB + 327.0 KiB = 3.3 GiB ovn-controller >>>>>>>> >>>>>>>> >>>>>>>> cells during start: >>>>>>>> 11388742 >>>>>>>> >>>>>>>> cells after start: >>>>>>>> 343896 >>>>>>>> >>>>>>> >>>>>>> Are you running with ovn-monitor-all=true on this host? >>>>>> >>>>>> No, it has default false. >>>>>> >>>>>>> >>>>>>> I guess it's unlikely but I'll try just in case: would it be possible to >>>>>>> share the SB database? >>>>>> >>>>>> Unfortunately, no. But I can say it’s about 450 M in size after >>>>>> compaction. And there are about 1M mac_bindings if it’s important :). >>>>> >>>> >>>> I tried in a sandbox, before and after the commit in question, with a >>>> script that adds 1M mac bindings on top of the sample topology built by >>>> tutorial/ovn-setup.sh. >>>> >>>> I see ovn-controller memory usage going to >3GB in before the commit you >>>> blamed and to >1.9GB after the same commit. So it looks different than >>>> what you reported but still worrying that we use so much memory for mac >>>> bindings. >>>> >>>> I'm assuming however that quite a few of the 1M mac bindings in your >>>> setup are stale so would it be possible to enable mac_binding aging? It >>>> needs to be enabled per router with something like: >>>> >>>> $ ovn-nbctl set logical_router RTR options:mac_binding_age_threshold=60 >>>> >>>> The threshold is in seconds and is a hard limit (for now) after which a >>>> mac binding entry is removed. There's work in progress to only clear >>>> arp cache entries that are really stale [1]. >> >> Yeah, I know about mac_binding ageing, but we currently use our own >> mechanism to delete excess records and waiting for patchset you’ve mentioned. >> > > Ack. > >>>> >>>>> But if you are interested in any specific information, let me know it, I >>>>> can check. >>>>> >>>> >>>> How many "local" datapaths do we have on the hosts that exhibit high >>>> memory usage? >>>> >>>> The quickest way I can think of to get this info is to run this on the >>>> hypervisor: >>>> $ ovn-appctl ct-zone-list | grep snat -c >>>> >>>> Additional question: how many mac_bindings are "local", i.e., associated >>>> to local datapaths? >> >> For both questions: 0. >> > > OK, that really suggests that malloc_trim() didn't get called. There's > really no reason (especially with ovn-monitor-all=false) to have such a > high memory usage in ovn-controller if there's no local datapath. > >>>> >>>>>> >>>>>>> >>>>>>>> I guess it could be connected with this problem. Can anyone look at >>>>>>>> this and comment please? >>>>>>>> >>>>>>> >>>>>>> Does the memory usage persist after SB is upgraded too? I see the >>>>>>> number of SB idl-cells goes down eventually which means that eventually >>>>>>> the periodic malloc_trim() call would free up memory. We trim on idle >>>>>>> since >>>>>>> https://github.com/ovn-org/ovn/commit/b4c593d23bd959e98fcc9ada4a973ac933579ded >>>>>>> >>>>>> >>>>>> In this upgrade DB schemas not upgraded, so they’re up to date. >>>>>> >>>>>>> Are you using a different allocator? E.g., jemalloc. >>>>>> >>>>>> No, this issue reproduces with gcc. >>>>>> >>>> >>>> Can we run a test and see if malloc_trim() was actually called? I'd >>>> first disable lflow-cache log rate limiting: >>>> >>>> $ ovn-appctl vlog/disable-rate-limit lflow_cache >>>> >>>> Then check if malloc_trim() was called after the lflow-cache detected >>>> inactivity. You'd see logs like: >>>> >>>> "lflow_cache|INFO|Detected cache inactivity (last active 30005 ms ago): >>>> trimming cache" >>>> >>>> The fact that the number of SB idl-cells goes down and memory doesn't >>>> seems to indicate we might have a bug in the auto cache trimming mechanism. >>>> >>>> In any case, a malloc_trim() can be manually triggered by flushing the >>>> lflow cache: >>>> >>>> $ ovn-appctl lflow-cache/flush >> >> Wow, all the excess memory was released. Memory consumption by >> ovn-controller process decreased from 3.3GiB to ~275MiB. >> >>>> >>>> Thanks, >>>> Dumitru >>>> >>>>>>> >>>>>>>> >>>>>>>> 0: >>>>>>>> https://github.com/ovn-org/ovn/commit/1b0dbde940706e5de6e60221be78a278361fa76d >>>> >>>> [1] https://patchwork.ozlabs.org/project/ovn/list/?series=359894&state=* >>>> >>> >>> I think I spotted an issue with the auto-trimming mechanism if the flows >>> we're removing are not in the lflow cache (I think that's the case for >>> the mac binding ones). Would it be possible to try out the following >>> quick'n'dirty patch that should trigger trimming more often? >> >> Thanks for the patch, I’ll try it tomorrow! >> > > Thanks, I'll wait for confirmation before preparing a formal patch. > > Regards, > Dumitru > >>> >>> Thanks! >>> >>> diff --git a/controller/lflow-cache.c b/controller/lflow-cache.c >>> index f723eab8a0..1a23e133ae 100644 >>> --- a/controller/lflow-cache.c >>> +++ b/controller/lflow-cache.c >>> @@ -24,7 +24,6 @@ >>> #include "coverage.h" >>> #include "lflow-cache.h" >>> #include "lib/uuid.h" >>> -#include "memory-trim.h" >>> #include "openvswitch/vlog.h" >>> #include "ovn/expr.h" >>> >>> @@ -83,14 +82,14 @@ static void lflow_cache_delete__(struct lflow_cache *lc, >>> static void lflow_cache_trim__(struct lflow_cache *lc, bool force); >>> >>> struct lflow_cache * >>> -lflow_cache_create(void) >>> +lflow_cache_create(struct memory_trimmer *mt) >>> { >>> struct lflow_cache *lc = xzalloc(sizeof *lc); >>> >>> for (size_t i = 0; i < LCACHE_T_MAX; i++) { >>> hmap_init(&lc->entries[i]); >>> } >>> - lc->mt = memory_trimmer_create(); >>> + lc->mt = mt; >>> >>> return lc; >>> } >>> @@ -124,7 +123,6 @@ lflow_cache_destroy(struct lflow_cache *lc) >>> for (size_t i = 0; i < LCACHE_T_MAX; i++) { >>> hmap_destroy(&lc->entries[i]); >>> } >>> - memory_trimmer_destroy(lc->mt); >>> free(lc); >>> } >>> >>> diff --git a/controller/lflow-cache.h b/controller/lflow-cache.h >>> index 300a560770..07e017e25e 100644 >>> --- a/controller/lflow-cache.h >>> +++ b/controller/lflow-cache.h >>> @@ -21,6 +21,7 @@ >>> #include "openvswitch/dynamic-string.h" >>> #include "openvswitch/hmap.h" >>> #include "openvswitch/uuid.h" >>> +#include "memory-trim.h" >>> #include "simap.h" >>> >>> struct lflow_cache; >>> @@ -56,7 +57,7 @@ struct lflow_cache_value { >>> }; >>> }; >>> >>> -struct lflow_cache *lflow_cache_create(void); >>> +struct lflow_cache *lflow_cache_create(struct memory_trimmer *); >>> void lflow_cache_flush(struct lflow_cache *); >>> void lflow_cache_destroy(struct lflow_cache *); >>> void lflow_cache_enable(struct lflow_cache *, bool enabled, uint32_t >>> capacity, >>> diff --git a/controller/ovn-controller.c b/controller/ovn-controller.c >>> index 96d01219e1..89ff37e747 100644 >>> --- a/controller/ovn-controller.c >>> +++ b/controller/ovn-controller.c >>> @@ -5057,8 +5057,9 @@ main(int argc, char *argv[]) >>> unsigned int ovnsb_cond_seqno = UINT_MAX; >>> unsigned int ovnsb_expected_cond_seqno = UINT_MAX; >>> >>> + struct memory_trimmer *mt = memory_trimmer_create(); >>> struct controller_engine_ctx ctrl_engine_ctx = { >>> - .lflow_cache = lflow_cache_create(), >>> + .lflow_cache = lflow_cache_create(mt), >>> .if_mgr = if_status_mgr_create(), >>> }; >>> struct if_status_mgr *if_mgr = ctrl_engine_ctx.if_mgr; >>> @@ -5447,6 +5448,10 @@ main(int argc, char *argv[]) >>> engine_set_force_recompute(false); >>> } >>> >>> + if (engine_has_updated()) { >>> + memory_trimmer_record_activity(mt); >>> + } >>> + >>> store_nb_cfg(ovnsb_idl_txn, ovs_idl_txn, chassis_private, >>> br_int, delay_nb_cfg_report); >>> >>> @@ -5552,6 +5557,7 @@ loop_done: >>> } >>> } >>> >>> + memory_trimmer_destroy(mt); >>> engine_set_context(NULL); >>> engine_cleanup(); >>> >>> diff --git a/controller/test-lflow-cache.c b/controller/test-lflow-cache.c >>> index 7ce9993605..1a73b8aad7 100644 >>> --- a/controller/test-lflow-cache.c >>> +++ b/controller/test-lflow-cache.c >>> @@ -108,7 +108,8 @@ test_lflow_cache_stats__(struct lflow_cache *lc) >>> static void >>> test_lflow_cache_operations(struct ovs_cmdl_context *ctx) >>> { >>> - struct lflow_cache *lc = lflow_cache_create(); >>> + struct memory_trimmer *mt = memory_trimmer_create(); >>> + struct lflow_cache *lc = lflow_cache_create(mt); >>> struct expr *e = expr_create_boolean(true); >>> bool enabled = !strcmp(ctx->argv[1], "true"); >>> struct uuid *lflow_uuids = NULL; >>> @@ -235,6 +236,7 @@ test_lflow_cache_operations(struct ovs_cmdl_context >>> *ctx) >>> test_lflow_cache_stats__(lc); >>> } >>> done: >>> + memory_trimmer_destroy(mt); >>> lflow_cache_destroy(lc); >>> free(lflow_uuids); >>> expr_destroy(e); >>> @@ -259,7 +261,7 @@ test_lflow_cache_negative(struct ovs_cmdl_context *ctx >>> OVS_UNUSED) >>> >>> struct lflow_cache *lcs[] = { >>> NULL, >>> - lflow_cache_create(), >>> + lflow_cache_create(memory_trimmer_create()), >>> }; >>> >>> for (size_t i = 0; i < ARRAY_SIZE(lcs); i++) { >> >> >> Regards, >> Vladislav Odintsov >> >> > > _______________________________________________ > dev mailing list > d...@openvswitch.org <mailto:d...@openvswitch.org> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev Regards, Vladislav Odintsov _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev