On Thu, Feb 8, 2024 at 1:50 PM <[email protected]> wrote: > > From: Ilya Maximets <[email protected]> > > Currently, whenever a new logical flow is created, northd allocates > a reference counter for every datapath in the datapath group for that > logical flow. Those reference counters are necessary in order to not > remove the datapath from a logical flow during incremental processing > if it was created from two different objects for the same datapath and > now one of them is removed. > > However, that creates a serious scalability problem. In a large scale > setups with tens of thousands of logical flows applied to large > datapath groups we may have hundreds of millions of these reference > counters allocated, which is many GB of RSS just for that purpose. > > For example, in ovn-heater's cluster-density 500 node test, ovn-northd > started to consume up to 8.5 GB or RAM. In the same test before the > reference counting, ovn-northd consumed only 2.5 GB. All those memory > allocation also increased CPU usage. Re-compute time went up from > 1.5 seconds to 6 seconds in the same test. In the end we have about > 4x increase on both CPU and memory usage. > > Running under valgrind --tool=massif shows: > > ------------------------------------------------------------- > total(B) useful-heap(B) extra-heap(B) stacks(B) > ------------------------------------------------------------- > 9,416,438,200 7,401,971,556 2,014,466,644 0 > > 78.61% (7,401 MB) (heap allocation functions) malloc > ->45.78% (4,311 MB) xcalloc__ (util.c:124) > | ->45.68% (4,301 MB) xzalloc__ (util.c:134) > | | ->45.68% (4,301 MB) xzalloc (util.c:168) > | | ->40.97% (3,857 MB) dp_refcnt_use (lflow-mgr.c:1348) > | | | ->40.89% (3,850 MB) lflow_table_add_lflow (lflow-mgr.c:696) > | | | | ->12.27% (1,155 MB) build_lb_rules_pre_stateful (northd.c:7180) > | | | | ->12.27% (1,155 MB) build_lb_rules (northd.c:7658) > | | | | ->12.27% (1,155,MB) build_gw_lrouter_nat_flows_for_lb (northd.c:11054) > ->28.62% (2,694 MB) xmalloc__ (util.c:140) > | ->28.62% (2,694 MB) xmalloc (util.c:175) > | ->06.71% (631 MB) resize (hmap.c:100) > | | ->06.01% (565 MB) hmap_expand_at (hmap.c:175) > | | | ->05.24% (492 MB) hmap_insert_at (hmap.h:309) > | | | | ->05.24% (492 MB) dp_refcnt_use (lflow-mgr.c:1351) > > 45% of all the memory is allocated for reference counters themselves > and another 7% is taken by hash maps to hold them. Also, there is > more than 20% of a total memory allocation overhead (extra-heap) since > all these allocated objects are very small (32B). > > This test allocates 120 M reference counters total. > > However, the vast majority of all the reference counters always has > a value of 1, i.e. these datapaths are not used more than once. > > Defer allocation of reference counters until the datapath is used > for the same logical flow for the second time. We can do that by > checking the current datapath group bitmap. > > With this change, the amount of allocated reference counters goes > down from 120 M to just 12 M. Memory consumption reduced from > 8.5 GB to 2.67 GB and the northd recompute time reduced from 6 to 2.1 > seconds. > > It is still a little higher than resource usage before introduction of > incremental processing for logical flows, but it is fairly manageable. > Also, the resource usage and memory consumption may be further improved > by reducing the number of cases where northd attempts to create the > logical flows for the same datapaths multiple times. > > Note: the cluster-density test in ovn-heater creates new port groups > on every iteration and ovn-northd doesn't handle this incrementally, > so it always re-computes. That's why there is no benefit from > northd I-P for CPU usage in this scenario. > > Fixes: a623606052ea ("northd: Refactor lflow management into a separate module.") > Signed-off-by: Ilya Maximets <[email protected]> > --- > northd/lflow-mgr.c | 39 ++++++++++++++++++++------------------- > 1 file changed, 20 insertions(+), 19 deletions(-) > > diff --git a/northd/lflow-mgr.c b/northd/lflow-mgr.c > index df62cd6ab4..f70896cce7 100644 > --- a/northd/lflow-mgr.c > +++ b/northd/lflow-mgr.c > @@ -47,8 +47,7 @@ static void ovn_lflow_destroy(struct lflow_table *lflow_table, > static char *ovn_lflow_hint(const struct ovsdb_idl_row *row); > > static struct ovn_lflow *do_ovn_lflow_add( > - struct lflow_table *, const struct ovn_datapath *, > - const unsigned long *dp_bitmap, size_t dp_bitmap_len, uint32_t hash, > + struct lflow_table *, size_t dp_bitmap_len, uint32_t hash, > enum ovn_stage stage, uint16_t priority, const char *match, > const char *actions, const char *io_port, > const char *ctrl_meter, > @@ -674,9 +673,9 @@ lflow_table_add_lflow(struct lflow_table *lflow_table, > > hash_lock = lflow_hash_lock(&lflow_table->entries, hash); > struct ovn_lflow *lflow = > - do_ovn_lflow_add(lflow_table, od, dp_bitmap, > - dp_bitmap_len, hash, stage, > - priority, match, actions, > + do_ovn_lflow_add(lflow_table, > + od ? ods_size(od->datapaths) : dp_bitmap_len, > + hash, stage, priority, match, actions, > io_port, ctrl_meter, stage_hint, where); > > if (lflow_ref) { > @@ -702,17 +701,24 @@ lflow_table_add_lflow(struct lflow_table *lflow_table, > ovs_assert(lrn->dpgrp_bitmap_len == dp_bitmap_len); > size_t index; > BITMAP_FOR_EACH_1 (index, dp_bitmap_len, dp_bitmap) { > - dp_refcnt_use(&lflow->dp_refcnts_map, index); > + /* Allocate a reference counter only if already used. */ > + if (bitmap_is_set(lflow->dpg_bitmap, index)) { > + dp_refcnt_use(&lflow->dp_refcnts_map, index); > + } > } > } else { > - dp_refcnt_use(&lflow->dp_refcnts_map, lrn->dp_index); > + /* Allocate a reference counter only if already used. */ > + if (bitmap_is_set(lflow->dpg_bitmap, lrn->dp_index)) { > + dp_refcnt_use(&lflow->dp_refcnts_map, lrn->dp_index); > + } > } > } > lrn->linked = true; > } > > - lflow_hash_unlock(hash_lock); > + ovn_dp_group_add_with_reference(lflow, od, dp_bitmap, dp_bitmap_len); > > + lflow_hash_unlock(hash_lock); > } > > void > @@ -946,9 +952,7 @@ ovn_lflow_destroy(struct lflow_table *lflow_table, struct ovn_lflow *lflow) > } > > static struct ovn_lflow * > -do_ovn_lflow_add(struct lflow_table *lflow_table, > - const struct ovn_datapath *od, > - const unsigned long *dp_bitmap, size_t dp_bitmap_len, > +do_ovn_lflow_add(struct lflow_table *lflow_table, size_t dp_bitmap_len, > uint32_t hash, enum ovn_stage stage, uint16_t priority, > const char *match, const char *actions, > const char *io_port, const char *ctrl_meter, > @@ -959,14 +963,11 @@ do_ovn_lflow_add(struct lflow_table *lflow_table, > struct ovn_lflow *old_lflow; > struct ovn_lflow *lflow; > > - size_t bitmap_len = od ? ods_size(od->datapaths) : dp_bitmap_len; > - ovs_assert(bitmap_len); > + ovs_assert(dp_bitmap_len); > > old_lflow = ovn_lflow_find(&lflow_table->entries, stage, > priority, match, actions, ctrl_meter, hash); > if (old_lflow) { > - ovn_dp_group_add_with_reference(old_lflow, od, dp_bitmap, > - bitmap_len); > return old_lflow; > } > > @@ -974,14 +975,12 @@ do_ovn_lflow_add(struct lflow_table *lflow_table, > /* While adding new logical flows we're not setting single datapath, but > * collecting a group. 'od' will be updated later for all flows with only > * one datapath in a group, so it could be hashed correctly. */ > - ovn_lflow_init(lflow, NULL, bitmap_len, stage, priority, > + ovn_lflow_init(lflow, NULL, dp_bitmap_len, stage, priority, > xstrdup(match), xstrdup(actions), > io_port ? xstrdup(io_port) : NULL, > nullable_xstrdup(ctrl_meter), > ovn_lflow_hint(stage_hint), where); > > - ovn_dp_group_add_with_reference(lflow, od, dp_bitmap, bitmap_len); > - > if (parallelization_state != STATE_USE_PARALLELIZATION) { > hmap_insert(&lflow_table->entries, &lflow->hmap_node, hash); > } else { > @@ -1350,8 +1349,10 @@ dp_refcnt_use(struct hmap *dp_refcnts_map, size_t dp_index) > struct dp_refcnt *dp_refcnt = dp_refcnt_find(dp_refcnts_map, dp_index); > > if (!dp_refcnt) { > - dp_refcnt = xzalloc(sizeof *dp_refcnt); > + dp_refcnt = xmalloc(sizeof *dp_refcnt); > dp_refcnt->dp_index = dp_index; > + /* Allocation is happening on the second (!) use. */ > + dp_refcnt->refcnt = 1; > > hmap_insert(dp_refcnts_map, &dp_refcnt->key_node, dp_index); > } > -- > 2.43.0 > > _______________________________________________ > dev mailing list > [email protected] > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Thanks Ilya and Numan. Acked-by: Han Zhou <[email protected]> _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
