On 7/10/23 12:57, Dumitru Ceara wrote:
> On 7/6/23 13:00, Vladislav Odintsov wrote:
>>
>>
>>> On 5 Jul 2023, at 20:07, Vladislav Odintsov <odiv...@gmail.com> wrote:
>>>
>>> Hi Dumitru,
>>>
>>> thanks for the quick response!
>>>
>>>> On 5 Jul 2023, at 19:53, Dumitru Ceara <dce...@redhat.com> wrote:
>>>>
>>>> On 7/5/23 17:14, Vladislav Odintsov wrote:
>>>>> Hi,
>>>>>
>>>>
>>>> Hi Vladislav,
>>>>
>>>>> we’ve noticed there is a huge ovn-controller memory consumption 
>>>>> introduced with [0] comparing to version without its changes in 
>>>>> ovn-controller.c part (just OVS submodule bump without ovn-controller 
>>>>> changes doesn’t trigger such behaviour).
>>>>>
>>>>
>>>> Thanks for reporting this!
>>>>
>>>>> On an empty host connected to working cluster ovn-controller normally 
>>>>> consumes ~175 MiB RSS, and the same host updated to a version with commit 
>>>>> [0] consumes 3.3 GiB RSS. The start time and CPU consumption during 
>>>>> process start of an affected version is higher that for the normal 
>>>>> version.
>>>>>
>>>>> Before upgrade (normal functioning):
>>>>>
>>>>> #ovn-appctl -t ovn-controller memory/show ;ps-mem -p $(pidof 
>>>>> ovn-controller)| grep ovn
>>>>> idl-cells-OVN_Southbound:343855 idl-cells-Open_vSwitch:14131 
>>>>> ofctrl_desired_flow_usage-KB:76 ofctrl_installed_flow_usage-KB:60 
>>>>> ofctrl_sb_flow_ref_usage-KB:18
>>>>> 174.2 MiB + 327.5 KiB = 174.5 MiB ovn-controller
>>>>>
>>>>> After upgrade to an affected version I’ve checked memory report while 
>>>>> ovn-controller was starting and at some time there was a bigger amount of 
>>>>> OVN_Southbound cells comparing to "after start" time:
>>>>>
>>>>> during start:
>>>>>
>>>>> # ovn-appctl -t ovn-controller memory/show ;ps-mem -p $(pidof 
>>>>> ovn-controller)| grep ovn
>>>>> idl-cells-OVN_Southbound:11388742 idl-cells-Open_vSwitch:14131 
>>>>> idl-outstanding-txns-Open_vSwitch:1
>>>>> 3.3 GiB + 327.0 KiB =   3.3 GiB   ovn-controller
>>>>>
>>>>> after start:
>>>>>
>>>>> # ovn-appctl -t ovn-controller memory/show ;ps-mem -p $(pidof 
>>>>> ovn-controller)| grep ovn
>>>>> idl-cells-OVN_Southbound:343896 idl-cells-Open_vSwitch:14131 
>>>>> idl-outstanding-txns-Open_vSwitch:1
>>>>> 3.3 GiB + 327.0 KiB =   3.3 GiB   ovn-controller
>>>>>
>>>>>
>>>>> cells during start:
>>>>> 11388742
>>>>>
>>>>> cells after start:
>>>>> 343896
>>>>>
>>>>
>>>> Are you running with ovn-monitor-all=true on this host?
>>>
>>> No, it has default false.
>>>
>>>>
>>>> I guess it's unlikely but I'll try just in case: would it be possible to
>>>> share the SB database?
>>>
>>> Unfortunately, no. But I can say it’s about 450 M in size after compaction. 
>>> And there are about 1M mac_bindings if it’s important :).
>>
> 
> I tried in a sandbox, before and after the commit in question, with a
> script that adds 1M mac bindings on top of the sample topology built by
> tutorial/ovn-setup.sh.
> 
> I see ovn-controller memory usage going to >3GB in before the commit you
> blamed and to >1.9GB after the same commit.  So it looks different than
> what you reported but still worrying that we use so much memory for mac
> bindings.
> 
> I'm assuming however that quite a few of the 1M mac bindings in your
> setup are stale so would it be possible to enable mac_binding aging?  It
> needs to be enabled per router with something like:
> 
>  $ ovn-nbctl set logical_router RTR options:mac_binding_age_threshold=60
> 
> The threshold is in seconds and is a hard limit (for now) after which a
> mac binding entry is removed.  There's work in progress to only clear
> arp cache entries that are really stale [1].
> 
>> But if you are interested in any specific information, let me know it, I can 
>> check.
>>
> 
> How many "local" datapaths do we have on the hosts that exhibit high
> memory usage?
> 
> The quickest way I can think of to get this info is to run this on the
> hypervisor:
>  $ ovn-appctl ct-zone-list | grep snat -c
> 
> Additional question: how many mac_bindings are "local", i.e., associated
> to local datapaths?
> 
>>>
>>>>
>>>>> I guess it could be connected with this problem. Can anyone look at this 
>>>>> and comment please?
>>>>>
>>>>
>>>> Does the memory usage persist after SB is upgraded too?  I see the
>>>> number of SB idl-cells goes down eventually which means that eventually
>>>> the periodic malloc_trim() call would free up memory.  We trim on idle
>>>> since
>>>> https://github.com/ovn-org/ovn/commit/b4c593d23bd959e98fcc9ada4a973ac933579ded
>>>>
>>>
>>> In this upgrade DB schemas not upgraded, so they’re up to date.
>>>
>>>> Are you using a different allocator?  E.g., jemalloc.
>>>
>>> No, this issue reproduces with gcc.
>>>
> 
> Can we run a test and see if malloc_trim() was actually called?  I'd
> first disable lflow-cache log rate limiting:
> 
>  $ ovn-appctl vlog/disable-rate-limit lflow_cache
> 
> Then check if malloc_trim() was called after the lflow-cache detected
> inactivity.  You'd see logs like:
> 
> "lflow_cache|INFO|Detected cache inactivity (last active 30005 ms ago):
> trimming cache"
> 
> The fact that the number of SB idl-cells goes down and memory doesn't
> seems to indicate we might have a bug in the auto cache trimming mechanism.
> 
> In any case, a malloc_trim() can be manually triggered by flushing the
> lflow cache:
> 
>  $ ovn-appctl lflow-cache/flush
> 
> Thanks,
> Dumitru
> 
>>>>
>>>>>
>>>>> 0: 
>>>>> https://github.com/ovn-org/ovn/commit/1b0dbde940706e5de6e60221be78a278361fa76d
> 
> [1] https://patchwork.ozlabs.org/project/ovn/list/?series=359894&state=*
> 

I think I spotted an issue with the auto-trimming mechanism if the flows
we're removing are not in the lflow cache (I think that's the case for
the mac binding ones).  Would it be possible to try out the following
quick'n'dirty patch that should trigger trimming more often?

Thanks!

diff --git a/controller/lflow-cache.c b/controller/lflow-cache.c
index f723eab8a0..1a23e133ae 100644
--- a/controller/lflow-cache.c
+++ b/controller/lflow-cache.c
@@ -24,7 +24,6 @@
 #include "coverage.h"
 #include "lflow-cache.h"
 #include "lib/uuid.h"
-#include "memory-trim.h"
 #include "openvswitch/vlog.h"
 #include "ovn/expr.h"
 
@@ -83,14 +82,14 @@ static void lflow_cache_delete__(struct lflow_cache *lc,
 static void lflow_cache_trim__(struct lflow_cache *lc, bool force);
 
 struct lflow_cache *
-lflow_cache_create(void)
+lflow_cache_create(struct memory_trimmer *mt)
 {
     struct lflow_cache *lc = xzalloc(sizeof *lc);
 
     for (size_t i = 0; i < LCACHE_T_MAX; i++) {
         hmap_init(&lc->entries[i]);
     }
-    lc->mt = memory_trimmer_create();
+    lc->mt = mt;
 
     return lc;
 }
@@ -124,7 +123,6 @@ lflow_cache_destroy(struct lflow_cache *lc)
     for (size_t i = 0; i < LCACHE_T_MAX; i++) {
         hmap_destroy(&lc->entries[i]);
     }
-    memory_trimmer_destroy(lc->mt);
     free(lc);
 }
 
diff --git a/controller/lflow-cache.h b/controller/lflow-cache.h
index 300a560770..07e017e25e 100644
--- a/controller/lflow-cache.h
+++ b/controller/lflow-cache.h
@@ -21,6 +21,7 @@
 #include "openvswitch/dynamic-string.h"
 #include "openvswitch/hmap.h"
 #include "openvswitch/uuid.h"
+#include "memory-trim.h"
 #include "simap.h"
 
 struct lflow_cache;
@@ -56,7 +57,7 @@ struct lflow_cache_value {
     };
 };
 
-struct lflow_cache *lflow_cache_create(void);
+struct lflow_cache *lflow_cache_create(struct memory_trimmer *);
 void lflow_cache_flush(struct lflow_cache *);
 void lflow_cache_destroy(struct lflow_cache *);
 void lflow_cache_enable(struct lflow_cache *, bool enabled, uint32_t capacity,
diff --git a/controller/ovn-controller.c b/controller/ovn-controller.c
index 96d01219e1..89ff37e747 100644
--- a/controller/ovn-controller.c
+++ b/controller/ovn-controller.c
@@ -5057,8 +5057,9 @@ main(int argc, char *argv[])
     unsigned int ovnsb_cond_seqno = UINT_MAX;
     unsigned int ovnsb_expected_cond_seqno = UINT_MAX;
 
+    struct memory_trimmer *mt = memory_trimmer_create();
     struct controller_engine_ctx ctrl_engine_ctx = {
-        .lflow_cache = lflow_cache_create(),
+        .lflow_cache = lflow_cache_create(mt),
         .if_mgr = if_status_mgr_create(),
     };
     struct if_status_mgr *if_mgr = ctrl_engine_ctx.if_mgr;
@@ -5447,6 +5448,10 @@ main(int argc, char *argv[])
                 engine_set_force_recompute(false);
             }
 
+            if (engine_has_updated()) {
+                memory_trimmer_record_activity(mt);
+            }
+
             store_nb_cfg(ovnsb_idl_txn, ovs_idl_txn, chassis_private,
                          br_int, delay_nb_cfg_report);
 
@@ -5552,6 +5557,7 @@ loop_done:
         }
     }
 
+    memory_trimmer_destroy(mt);
     engine_set_context(NULL);
     engine_cleanup();
 
diff --git a/controller/test-lflow-cache.c b/controller/test-lflow-cache.c
index 7ce9993605..1a73b8aad7 100644
--- a/controller/test-lflow-cache.c
+++ b/controller/test-lflow-cache.c
@@ -108,7 +108,8 @@ test_lflow_cache_stats__(struct lflow_cache *lc)
 static void
 test_lflow_cache_operations(struct ovs_cmdl_context *ctx)
 {
-    struct lflow_cache *lc = lflow_cache_create();
+    struct memory_trimmer *mt = memory_trimmer_create();
+    struct lflow_cache *lc = lflow_cache_create(mt);
     struct expr *e = expr_create_boolean(true);
     bool enabled = !strcmp(ctx->argv[1], "true");
     struct uuid *lflow_uuids = NULL;
@@ -235,6 +236,7 @@ test_lflow_cache_operations(struct ovs_cmdl_context *ctx)
         test_lflow_cache_stats__(lc);
     }
 done:
+    memory_trimmer_destroy(mt);
     lflow_cache_destroy(lc);
     free(lflow_uuids);
     expr_destroy(e);
@@ -259,7 +261,7 @@ test_lflow_cache_negative(struct ovs_cmdl_context *ctx 
OVS_UNUSED)
 
     struct lflow_cache *lcs[] = {
         NULL,
-        lflow_cache_create(),
+        lflow_cache_create(memory_trimmer_create()),
     };
 
     for (size_t i = 0; i < ARRAY_SIZE(lcs); i++) {



_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to