Hi Dumitru, thanks for digging into this! I highly appreciate your help!
Please, see my answers inline. > On 10 Jul 2023, at 15:28, Dumitru Ceara <dce...@redhat.com> wrote: > > On 7/10/23 12:57, Dumitru Ceara wrote: >> On 7/6/23 13:00, Vladislav Odintsov wrote: >>> >>> >>>> On 5 Jul 2023, at 20:07, Vladislav Odintsov <odiv...@gmail.com> wrote: >>>> >>>> Hi Dumitru, >>>> >>>> thanks for the quick response! >>>> >>>>> On 5 Jul 2023, at 19:53, Dumitru Ceara <dce...@redhat.com> wrote: >>>>> >>>>> On 7/5/23 17:14, Vladislav Odintsov wrote: >>>>>> Hi, >>>>>> >>>>> >>>>> Hi Vladislav, >>>>> >>>>>> we’ve noticed there is a huge ovn-controller memory consumption >>>>>> introduced with [0] comparing to version without its changes in >>>>>> ovn-controller.c part (just OVS submodule bump without ovn-controller >>>>>> changes doesn’t trigger such behaviour). >>>>>> >>>>> >>>>> Thanks for reporting this! >>>>> >>>>>> On an empty host connected to working cluster ovn-controller normally >>>>>> consumes ~175 MiB RSS, and the same host updated to a version with >>>>>> commit [0] consumes 3.3 GiB RSS. The start time and CPU consumption >>>>>> during process start of an affected version is higher that for the >>>>>> normal version. >>>>>> >>>>>> Before upgrade (normal functioning): >>>>>> >>>>>> #ovn-appctl -t ovn-controller memory/show ;ps-mem -p $(pidof >>>>>> ovn-controller)| grep ovn >>>>>> idl-cells-OVN_Southbound:343855 idl-cells-Open_vSwitch:14131 >>>>>> ofctrl_desired_flow_usage-KB:76 ofctrl_installed_flow_usage-KB:60 >>>>>> ofctrl_sb_flow_ref_usage-KB:18 >>>>>> 174.2 MiB + 327.5 KiB = 174.5 MiB ovn-controller >>>>>> >>>>>> After upgrade to an affected version I’ve checked memory report while >>>>>> ovn-controller was starting and at some time there was a bigger amount >>>>>> of OVN_Southbound cells comparing to "after start" time: >>>>>> >>>>>> during start: >>>>>> >>>>>> # ovn-appctl -t ovn-controller memory/show ;ps-mem -p $(pidof >>>>>> ovn-controller)| grep ovn >>>>>> idl-cells-OVN_Southbound:11388742 idl-cells-Open_vSwitch:14131 >>>>>> idl-outstanding-txns-Open_vSwitch:1 >>>>>> 3.3 GiB + 327.0 KiB = 3.3 GiB ovn-controller >>>>>> >>>>>> after start: >>>>>> >>>>>> # ovn-appctl -t ovn-controller memory/show ;ps-mem -p $(pidof >>>>>> ovn-controller)| grep ovn >>>>>> idl-cells-OVN_Southbound:343896 idl-cells-Open_vSwitch:14131 >>>>>> idl-outstanding-txns-Open_vSwitch:1 >>>>>> 3.3 GiB + 327.0 KiB = 3.3 GiB ovn-controller >>>>>> >>>>>> >>>>>> cells during start: >>>>>> 11388742 >>>>>> >>>>>> cells after start: >>>>>> 343896 >>>>>> >>>>> >>>>> Are you running with ovn-monitor-all=true on this host? >>>> >>>> No, it has default false. >>>> >>>>> >>>>> I guess it's unlikely but I'll try just in case: would it be possible to >>>>> share the SB database? >>>> >>>> Unfortunately, no. But I can say it’s about 450 M in size after >>>> compaction. And there are about 1M mac_bindings if it’s important :). >>> >> >> I tried in a sandbox, before and after the commit in question, with a >> script that adds 1M mac bindings on top of the sample topology built by >> tutorial/ovn-setup.sh. >> >> I see ovn-controller memory usage going to >3GB in before the commit you >> blamed and to >1.9GB after the same commit. So it looks different than >> what you reported but still worrying that we use so much memory for mac >> bindings. >> >> I'm assuming however that quite a few of the 1M mac bindings in your >> setup are stale so would it be possible to enable mac_binding aging? It >> needs to be enabled per router with something like: >> >> $ ovn-nbctl set logical_router RTR options:mac_binding_age_threshold=60 >> >> The threshold is in seconds and is a hard limit (for now) after which a >> mac binding entry is removed. There's work in progress to only clear >> arp cache entries that are really stale [1]. Yeah, I know about mac_binding ageing, but we currently use our own mechanism to delete excess records and waiting for patchset you’ve mentioned. >> >>> But if you are interested in any specific information, let me know it, I >>> can check. >>> >> >> How many "local" datapaths do we have on the hosts that exhibit high >> memory usage? >> >> The quickest way I can think of to get this info is to run this on the >> hypervisor: >> $ ovn-appctl ct-zone-list | grep snat -c >> >> Additional question: how many mac_bindings are "local", i.e., associated >> to local datapaths? For both questions: 0. >> >>>> >>>>> >>>>>> I guess it could be connected with this problem. Can anyone look at this >>>>>> and comment please? >>>>>> >>>>> >>>>> Does the memory usage persist after SB is upgraded too? I see the >>>>> number of SB idl-cells goes down eventually which means that eventually >>>>> the periodic malloc_trim() call would free up memory. We trim on idle >>>>> since >>>>> https://github.com/ovn-org/ovn/commit/b4c593d23bd959e98fcc9ada4a973ac933579ded >>>>> >>>> >>>> In this upgrade DB schemas not upgraded, so they’re up to date. >>>> >>>>> Are you using a different allocator? E.g., jemalloc. >>>> >>>> No, this issue reproduces with gcc. >>>> >> >> Can we run a test and see if malloc_trim() was actually called? I'd >> first disable lflow-cache log rate limiting: >> >> $ ovn-appctl vlog/disable-rate-limit lflow_cache >> >> Then check if malloc_trim() was called after the lflow-cache detected >> inactivity. You'd see logs like: >> >> "lflow_cache|INFO|Detected cache inactivity (last active 30005 ms ago): >> trimming cache" >> >> The fact that the number of SB idl-cells goes down and memory doesn't >> seems to indicate we might have a bug in the auto cache trimming mechanism. >> >> In any case, a malloc_trim() can be manually triggered by flushing the >> lflow cache: >> >> $ ovn-appctl lflow-cache/flush Wow, all the excess memory was released. Memory consumption by ovn-controller process decreased from 3.3GiB to ~275MiB. >> >> Thanks, >> Dumitru >> >>>>> >>>>>> >>>>>> 0: >>>>>> https://github.com/ovn-org/ovn/commit/1b0dbde940706e5de6e60221be78a278361fa76d >> >> [1] https://patchwork.ozlabs.org/project/ovn/list/?series=359894&state=* >> > > I think I spotted an issue with the auto-trimming mechanism if the flows > we're removing are not in the lflow cache (I think that's the case for > the mac binding ones). Would it be possible to try out the following > quick'n'dirty patch that should trigger trimming more often? Thanks for the patch, I’ll try it tomorrow! > > Thanks! > > diff --git a/controller/lflow-cache.c b/controller/lflow-cache.c > index f723eab8a0..1a23e133ae 100644 > --- a/controller/lflow-cache.c > +++ b/controller/lflow-cache.c > @@ -24,7 +24,6 @@ > #include "coverage.h" > #include "lflow-cache.h" > #include "lib/uuid.h" > -#include "memory-trim.h" > #include "openvswitch/vlog.h" > #include "ovn/expr.h" > > @@ -83,14 +82,14 @@ static void lflow_cache_delete__(struct lflow_cache *lc, > static void lflow_cache_trim__(struct lflow_cache *lc, bool force); > > struct lflow_cache * > -lflow_cache_create(void) > +lflow_cache_create(struct memory_trimmer *mt) > { > struct lflow_cache *lc = xzalloc(sizeof *lc); > > for (size_t i = 0; i < LCACHE_T_MAX; i++) { > hmap_init(&lc->entries[i]); > } > - lc->mt = memory_trimmer_create(); > + lc->mt = mt; > > return lc; > } > @@ -124,7 +123,6 @@ lflow_cache_destroy(struct lflow_cache *lc) > for (size_t i = 0; i < LCACHE_T_MAX; i++) { > hmap_destroy(&lc->entries[i]); > } > - memory_trimmer_destroy(lc->mt); > free(lc); > } > > diff --git a/controller/lflow-cache.h b/controller/lflow-cache.h > index 300a560770..07e017e25e 100644 > --- a/controller/lflow-cache.h > +++ b/controller/lflow-cache.h > @@ -21,6 +21,7 @@ > #include "openvswitch/dynamic-string.h" > #include "openvswitch/hmap.h" > #include "openvswitch/uuid.h" > +#include "memory-trim.h" > #include "simap.h" > > struct lflow_cache; > @@ -56,7 +57,7 @@ struct lflow_cache_value { > }; > }; > > -struct lflow_cache *lflow_cache_create(void); > +struct lflow_cache *lflow_cache_create(struct memory_trimmer *); > void lflow_cache_flush(struct lflow_cache *); > void lflow_cache_destroy(struct lflow_cache *); > void lflow_cache_enable(struct lflow_cache *, bool enabled, uint32_t capacity, > diff --git a/controller/ovn-controller.c b/controller/ovn-controller.c > index 96d01219e1..89ff37e747 100644 > --- a/controller/ovn-controller.c > +++ b/controller/ovn-controller.c > @@ -5057,8 +5057,9 @@ main(int argc, char *argv[]) > unsigned int ovnsb_cond_seqno = UINT_MAX; > unsigned int ovnsb_expected_cond_seqno = UINT_MAX; > > + struct memory_trimmer *mt = memory_trimmer_create(); > struct controller_engine_ctx ctrl_engine_ctx = { > - .lflow_cache = lflow_cache_create(), > + .lflow_cache = lflow_cache_create(mt), > .if_mgr = if_status_mgr_create(), > }; > struct if_status_mgr *if_mgr = ctrl_engine_ctx.if_mgr; > @@ -5447,6 +5448,10 @@ main(int argc, char *argv[]) > engine_set_force_recompute(false); > } > > + if (engine_has_updated()) { > + memory_trimmer_record_activity(mt); > + } > + > store_nb_cfg(ovnsb_idl_txn, ovs_idl_txn, chassis_private, > br_int, delay_nb_cfg_report); > > @@ -5552,6 +5557,7 @@ loop_done: > } > } > > + memory_trimmer_destroy(mt); > engine_set_context(NULL); > engine_cleanup(); > > diff --git a/controller/test-lflow-cache.c b/controller/test-lflow-cache.c > index 7ce9993605..1a73b8aad7 100644 > --- a/controller/test-lflow-cache.c > +++ b/controller/test-lflow-cache.c > @@ -108,7 +108,8 @@ test_lflow_cache_stats__(struct lflow_cache *lc) > static void > test_lflow_cache_operations(struct ovs_cmdl_context *ctx) > { > - struct lflow_cache *lc = lflow_cache_create(); > + struct memory_trimmer *mt = memory_trimmer_create(); > + struct lflow_cache *lc = lflow_cache_create(mt); > struct expr *e = expr_create_boolean(true); > bool enabled = !strcmp(ctx->argv[1], "true"); > struct uuid *lflow_uuids = NULL; > @@ -235,6 +236,7 @@ test_lflow_cache_operations(struct ovs_cmdl_context *ctx) > test_lflow_cache_stats__(lc); > } > done: > + memory_trimmer_destroy(mt); > lflow_cache_destroy(lc); > free(lflow_uuids); > expr_destroy(e); > @@ -259,7 +261,7 @@ test_lflow_cache_negative(struct ovs_cmdl_context *ctx > OVS_UNUSED) > > struct lflow_cache *lcs[] = { > NULL, > - lflow_cache_create(), > + lflow_cache_create(memory_trimmer_create()), > }; > > for (size_t i = 0; i < ARRAY_SIZE(lcs); i++) { Regards, Vladislav Odintsov _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev