On 2/13/24 08:19, Han Zhou wrote: > On Thu, Feb 8, 2024 at 1:50 PM <[email protected]> wrote: >> >> From: Ilya Maximets <[email protected]> >> >> Currently, whenever a new logical flow is created, northd allocates >> a reference counter for every datapath in the datapath group for that >> logical flow. Those reference counters are necessary in order to not >> remove the datapath from a logical flow during incremental processing >> if it was created from two different objects for the same datapath and >> now one of them is removed. >> >> However, that creates a serious scalability problem. In a large scale >> setups with tens of thousands of logical flows applied to large >> datapath groups we may have hundreds of millions of these reference >> counters allocated, which is many GB of RSS just for that purpose. >> >> For example, in ovn-heater's cluster-density 500 node test, ovn-northd >> started to consume up to 8.5 GB or RAM. In the same test before the >> reference counting, ovn-northd consumed only 2.5 GB. All those memory >> allocation also increased CPU usage. Re-compute time went up from >> 1.5 seconds to 6 seconds in the same test. In the end we have about >> 4x increase on both CPU and memory usage. >> >> Running under valgrind --tool=massif shows: >> >> ------------------------------------------------------------- >> total(B) useful-heap(B) extra-heap(B) stacks(B) >> ------------------------------------------------------------- >> 9,416,438,200 7,401,971,556 2,014,466,644 0 >> >> 78.61% (7,401 MB) (heap allocation functions) malloc >> ->45.78% (4,311 MB) xcalloc__ (util.c:124) >> | ->45.68% (4,301 MB) xzalloc__ (util.c:134) >> | | ->45.68% (4,301 MB) xzalloc (util.c:168) >> | | ->40.97% (3,857 MB) dp_refcnt_use (lflow-mgr.c:1348) >> | | | ->40.89% (3,850 MB) lflow_table_add_lflow (lflow-mgr.c:696) >> | | | | ->12.27% (1,155 MB) build_lb_rules_pre_stateful > (northd.c:7180) >> | | | | ->12.27% (1,155 MB) build_lb_rules (northd.c:7658) >> | | | | ->12.27% (1,155,MB) build_gw_lrouter_nat_flows_for_lb > (northd.c:11054) >> ->28.62% (2,694 MB) xmalloc__ (util.c:140) >> | ->28.62% (2,694 MB) xmalloc (util.c:175) >> | ->06.71% (631 MB) resize (hmap.c:100) >> | | ->06.01% (565 MB) hmap_expand_at (hmap.c:175) >> | | | ->05.24% (492 MB) hmap_insert_at (hmap.h:309) >> | | | | ->05.24% (492 MB) dp_refcnt_use (lflow-mgr.c:1351) >> >> 45% of all the memory is allocated for reference counters themselves >> and another 7% is taken by hash maps to hold them. Also, there is >> more than 20% of a total memory allocation overhead (extra-heap) since >> all these allocated objects are very small (32B). >> >> This test allocates 120 M reference counters total. >> >> However, the vast majority of all the reference counters always has >> a value of 1, i.e. these datapaths are not used more than once. >> >> Defer allocation of reference counters until the datapath is used >> for the same logical flow for the second time. We can do that by >> checking the current datapath group bitmap. >> >> With this change, the amount of allocated reference counters goes >> down from 120 M to just 12 M. Memory consumption reduced from >> 8.5 GB to 2.67 GB and the northd recompute time reduced from 6 to 2.1 >> seconds. >> >> It is still a little higher than resource usage before introduction of >> incremental processing for logical flows, but it is fairly manageable. >> Also, the resource usage and memory consumption may be further improved >> by reducing the number of cases where northd attempts to create the >> logical flows for the same datapaths multiple times. >> >> Note: the cluster-density test in ovn-heater creates new port groups >> on every iteration and ovn-northd doesn't handle this incrementally, >> so it always re-computes. That's why there is no benefit from >> northd I-P for CPU usage in this scenario. >> >> Fixes: a623606052ea ("northd: Refactor lflow management into a separate > module.") >> Signed-off-by: Ilya Maximets <[email protected]> >> --- >> northd/lflow-mgr.c | 39 ++++++++++++++++++++------------------- >> 1 file changed, 20 insertions(+), 19 deletions(-) >> >> diff --git a/northd/lflow-mgr.c b/northd/lflow-mgr.c >> index df62cd6ab4..f70896cce7 100644 >> --- a/northd/lflow-mgr.c >> +++ b/northd/lflow-mgr.c >> @@ -47,8 +47,7 @@ static void ovn_lflow_destroy(struct lflow_table > *lflow_table, >> static char *ovn_lflow_hint(const struct ovsdb_idl_row *row); >> >> static struct ovn_lflow *do_ovn_lflow_add( >> - struct lflow_table *, const struct ovn_datapath *, >> - const unsigned long *dp_bitmap, size_t dp_bitmap_len, uint32_t hash, >> + struct lflow_table *, size_t dp_bitmap_len, uint32_t hash, >> enum ovn_stage stage, uint16_t priority, const char *match, >> const char *actions, const char *io_port, >> const char *ctrl_meter, >> @@ -674,9 +673,9 @@ lflow_table_add_lflow(struct lflow_table *lflow_table, >> >> hash_lock = lflow_hash_lock(&lflow_table->entries, hash); >> struct ovn_lflow *lflow = >> - do_ovn_lflow_add(lflow_table, od, dp_bitmap, >> - dp_bitmap_len, hash, stage, >> - priority, match, actions, >> + do_ovn_lflow_add(lflow_table, >> + od ? ods_size(od->datapaths) : dp_bitmap_len, >> + hash, stage, priority, match, actions, >> io_port, ctrl_meter, stage_hint, where); >> >> if (lflow_ref) { >> @@ -702,17 +701,24 @@ lflow_table_add_lflow(struct lflow_table > *lflow_table, >> ovs_assert(lrn->dpgrp_bitmap_len == dp_bitmap_len); >> size_t index; >> BITMAP_FOR_EACH_1 (index, dp_bitmap_len, dp_bitmap) { >> - dp_refcnt_use(&lflow->dp_refcnts_map, index); >> + /* Allocate a reference counter only if already > used. */ >> + if (bitmap_is_set(lflow->dpg_bitmap, index)) { >> + dp_refcnt_use(&lflow->dp_refcnts_map, index); >> + } >> } >> } else { >> - dp_refcnt_use(&lflow->dp_refcnts_map, lrn->dp_index); >> + /* Allocate a reference counter only if already used. */ >> + if (bitmap_is_set(lflow->dpg_bitmap, lrn->dp_index)) { >> + dp_refcnt_use(&lflow->dp_refcnts_map, lrn->dp_index); >> + } >> } >> } >> lrn->linked = true; >> } >> >> - lflow_hash_unlock(hash_lock); >> + ovn_dp_group_add_with_reference(lflow, od, dp_bitmap, dp_bitmap_len); >> >> + lflow_hash_unlock(hash_lock); >> } >> >> void >> @@ -946,9 +952,7 @@ ovn_lflow_destroy(struct lflow_table *lflow_table, > struct ovn_lflow *lflow) >> } >> >> static struct ovn_lflow * >> -do_ovn_lflow_add(struct lflow_table *lflow_table, >> - const struct ovn_datapath *od, >> - const unsigned long *dp_bitmap, size_t dp_bitmap_len, >> +do_ovn_lflow_add(struct lflow_table *lflow_table, size_t dp_bitmap_len, >> uint32_t hash, enum ovn_stage stage, uint16_t priority, >> const char *match, const char *actions, >> const char *io_port, const char *ctrl_meter, >> @@ -959,14 +963,11 @@ do_ovn_lflow_add(struct lflow_table *lflow_table, >> struct ovn_lflow *old_lflow; >> struct ovn_lflow *lflow; >> >> - size_t bitmap_len = od ? ods_size(od->datapaths) : dp_bitmap_len; >> - ovs_assert(bitmap_len); >> + ovs_assert(dp_bitmap_len); >> >> old_lflow = ovn_lflow_find(&lflow_table->entries, stage, >> priority, match, actions, ctrl_meter, > hash); >> if (old_lflow) { >> - ovn_dp_group_add_with_reference(old_lflow, od, dp_bitmap, >> - bitmap_len); >> return old_lflow; >> } >> >> @@ -974,14 +975,12 @@ do_ovn_lflow_add(struct lflow_table *lflow_table, >> /* While adding new logical flows we're not setting single datapath, > but >> * collecting a group. 'od' will be updated later for all flows > with only >> * one datapath in a group, so it could be hashed correctly. */ >> - ovn_lflow_init(lflow, NULL, bitmap_len, stage, priority, >> + ovn_lflow_init(lflow, NULL, dp_bitmap_len, stage, priority, >> xstrdup(match), xstrdup(actions), >> io_port ? xstrdup(io_port) : NULL, >> nullable_xstrdup(ctrl_meter), >> ovn_lflow_hint(stage_hint), where); >> >> - ovn_dp_group_add_with_reference(lflow, od, dp_bitmap, bitmap_len); >> - >> if (parallelization_state != STATE_USE_PARALLELIZATION) { >> hmap_insert(&lflow_table->entries, &lflow->hmap_node, hash); >> } else { >> @@ -1350,8 +1349,10 @@ dp_refcnt_use(struct hmap *dp_refcnts_map, size_t > dp_index) >> struct dp_refcnt *dp_refcnt = dp_refcnt_find(dp_refcnts_map, > dp_index); >> >> if (!dp_refcnt) { >> - dp_refcnt = xzalloc(sizeof *dp_refcnt); >> + dp_refcnt = xmalloc(sizeof *dp_refcnt); >> dp_refcnt->dp_index = dp_index; >> + /* Allocation is happening on the second (!) use. */ >> + dp_refcnt->refcnt = 1; >> >> hmap_insert(dp_refcnts_map, &dp_refcnt->key_node, dp_index); >> } >> -- >> 2.43.0 >> >> _______________________________________________ >> dev mailing list >> [email protected] >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev > > Thanks Ilya and Numan. > Acked-by: Han Zhou <[email protected]>
Looks good to me too: Acked-by: Dumitru Ceara <[email protected]> Thanks! _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
