One day people forget about e-mail formatting problems. Not today. Second attempt to send mail with correct formatting. Sorry for the noise.
On 29.03.2025 01:41, Ilya Maximets wrote: > On 3/28/25 11:00, Vladislav Odintsov wrote: >> On 26.03.2025 15:30, Ilya Maximets wrote: >>> On 3/25/25 00:21, Vladislav Odintsov wrote: >>>> With this commit the rebalance of hash entries between bonding members >>>> becomes less frequent. If we know all bond members' speed, we do not >>>> move hash entries between them if load difference is less than 1.5% of >>>> minimum link speed of bond members. >>>> >>>> Reported-at: >>>> https://mail.openvswitch.org/pipermail/ovs-dev/2025-March/422028.html >>>> Suggested-by: Ilya Maximets <i.maxim...@ovn.org> >>>> Signed-off-by: Vladislav Odintsov <vlodintsov@k2.cloud> >>>> --- >>>> v1 -> v2: >>>> - Addressed Ilya's, Eelco's, Mike's review comments. >>>> - Docs updated. >>>> - NEWS entry added. >>>> --- >>>> Documentation/topics/bonding.rst | 3 ++- >>>> NEWS | 4 ++++ >>>> ofproto/bond.c | 25 +++++++++++++++++++------ >>>> 3 files changed, 25 insertions(+), 7 deletions(-) >>> Hi, Vladislav. Thanks for v2! >>> >>> Have you seen my reply for your observation about comparing speed to raw >>> byte counters here: >>> https://mail.openvswitch.org/pipermail/ovs-dev/2025-March/422255.html >>> >>> Tl;DR; I think you're right and we need to change the way we count the load >>> in order to be able to compare the load to a link speed. >>> >>> Best regards, Ilya Maximets. >> Hi Ilya, >> >> Thanks for review! >> >> I changed the condition before calling a bond_shift_load() function, so >> now Bps with Bps instead of B with Bps should be compared. It seems to >> me that this is good enough. >> >> Do I understand you correctly, that you proposed to change the movement >> logic inside bond_shift_load() function? If yes, I'm wondering why do >> we care, whether we compare total carried bytes during last rebalance >> interval or an average bytes per second during the same interval? In >> both cases we seem to fairly compare same units: B vs B or Bps vs Bps, >> so the current logic of this function looks correct for me. And in both >> cases the result should be the same. Am I missing something? >> > Hmm. It's complicated. Yes, you're comparing apples to apples here, i.e. > this version compares speed to speed. The problem, I think, is that our > total amount of bytes divided by the rebalance interval doesn't give a > correct or accurate enough approximation of the average speed. > > Let's say we have a constant 100Bps traffic on the port. And let's say > we have a standard 10sec rebalance interval. > > After the first 10 seconds, our tx_bytes on this port will be 1000B. > Then we'll divide it by 2 for our ad-hoc exponentially weighted moving > average implementation. So, it becomes 500B. After the next interval > we'll have another 1000B plus our 500B from the previous interval, so > 1500B in total. That, divided by the rebalancing interval, gives us > 150Bps "average speed", which is 50% higher than the actual average speed > on this interface. Next time it will be 175, and so on. In a long run > this is 100 * (1 + 1/2 + 1/4 + ...) = 200, which 2x of our actual average > speed. So, on a constant packet rate, this way of measuring is not > accurate at all. > > I'm not doing the math for the non-constant speed, but I'd guess it will > not be accurate enough either, even if it may average better in some > scenarios it might be much worse in others. And I'm not sure if we can > make this style of calculation accurate enough in all cases by just > choosing a different divisor. > > All in all, I think, we need to change the way we calculate the load in > order to have a more or less accurate idea about the average speed. > > The Exponential Moving Average from lib/mov-avg.h should give us much > more accurate result. The problem, however, is that we can't easily > shift the load if we calculate the average on a sum. So, we'll need to > calculate an average speed per hash and then compare sums of averages > with the link speed. > > Does that make sense? > > Best regards, Ilya Maximets. Hi, Ilya. I need to clarify, do we divide each hash by 2 to achieve next points: 1. have a monotonically decreasing tx_bytes to not to overload uint64_t; 2. at the same time "save" the knowledge of previous load in order to avoid frequent unnecessary shifts. Is that correct? Also, I'm not sure I fully understood your proposed approach... Do you want to switch from tx_bytes as a cumulative metric to a new "avg tx per second" metric? I guess this can be done approximately in the next order: 1. calculate the avg speed per second per each hash (tx_bytes / rebalance_interval) on rebalance run; 2. store this average speed inside hash_entry; 3. summarize hash_entries speed in "member average speed"; 4. do all calculations in bond_rebalance() in "avg bps" manner instead of "total bytes": here we can use same algorithm in bond_shift_load just re-orienting it to bps variable. Something like this (note: diff is based on current patch; not tested, only as an example of idea): diff --git a/ofproto/bond.c b/ofproto/bond.c index e2cbdd5ec..119baef4e 100644 --- a/ofproto/bond.c +++ b/ofproto/bond.c @@ -63,6 +63,8 @@ struct bond_entry { struct bond_member *member; /* Assigned member, NULL if unassigned. */ uint64_t tx_bytes /* Count of bytes recently transmitted. */ OVS_GUARDED_BY(rwlock); + /* Average bytes per second recently transmitted. */ + uint64_t tx_bps_avg OVS_GUARDED_BY(rwlock); struct ovs_list list_node; /* In bond_member's 'entries' list. */ /* Recirculation. @@ -95,7 +97,7 @@ struct bond_member { /* Rebalancing info. Used only by bond_rebalance(). */ struct ovs_list bal_node; /* In bond_rebalance()'s 'bals' list. */ struct ovs_list entries; /* 'struct bond_entry's assigned here. */ - uint64_t tx_bytes; /* Sum across 'tx_bytes' of entries. */ + uint64_t tx_bps_avg; /* Sum across 'tx_bps_avg' of entries. */ }; /* A bond, that is, a set of network devices grouped to improve performance or @@ -1163,8 +1165,8 @@ log_bals(struct bond *bond, const struct ovs_list *bals) if (ds.length) { ds_put_char(&ds, ','); } - ds_put_format(&ds, " %s %"PRIu64"kB", - member->name, member->tx_bytes / 1024); + ds_put_format(&ds, " %s %"PRIu64"kB/s", + member->name, member->tx_bps_avg / 1024); if (!member->enabled) { ds_put_cstr(&ds, " (disabled)"); @@ -1195,19 +1197,19 @@ bond_shift_load(struct bond_entry *hash, struct bond_member *to) { struct bond_member *from = hash->member; struct bond *bond = from->bond; - uint64_t delta = hash->tx_bytes; + uint64_t delta = hash->tx_bps_avg; VLOG_DBG("bond %s: shift %"PRIu64"kB of load (with hash %"PRIdPTR") " "from %s to %s (now carrying %"PRIu64"kB and " "%"PRIu64"kB load, respectively)", bond->name, delta / 1024, hash - bond->hash, from->name, to->name, - (from->tx_bytes - delta) / 1024, - (to->tx_bytes + delta) / 1024); + (from->tx_bps_avg - delta) / 1024, + (to->tx_bps_avg + delta) / 1024); /* Shift load away from 'from' to 'to'. */ - from->tx_bytes -= delta; - to->tx_bytes += delta; + from->tx_bps_avg -= delta; + to->tx_bps_avg += delta; /* Arrange for flows to be revalidated. */ hash->member = to; @@ -1222,15 +1224,16 @@ bond_shift_load(struct bond_entry *hash, struct bond_member *to) * The list of entries is sorted in descending order of load. This allows us * to collect subset of entries with accumulated load close to ideal. */ static size_t -choose_entries_to_migrate(const struct bond_member *from, uint64_t to_tx_bytes, +choose_entries_to_migrate(const struct bond_member *from, + uint64_t to_tx_bps_avg, struct bond_entry **to_migrate) OVS_REQ_WRLOCK(rwlock) { struct bond_entry *e; /* Note, the ideal traffic is the mid point between 'from' and 'to'. * This value does not change by rebalancing. */ - uint64_t ideal_tx_bytes = (from->tx_bytes + to_tx_bytes) / 2; - uint64_t ideal_delta = ideal_tx_bytes - to_tx_bytes; + uint64_t ideal_tx_bps_avg = (from->tx_bps_avg + to_tx_bps_avg) / 2; + uint64_t ideal_delta = ideal_tx_bps_avg - to_tx_bps_avg; uint64_t delta = 0; /* The amount to rebalance. */ uint64_t new_low; /* The lower bandwidth between 'to' and 'from' * after rebalancing. */ @@ -1244,10 +1247,10 @@ choose_entries_to_migrate(const struct bond_member *from, uint64_t to_tx_bytes, } LIST_FOR_EACH (e, list_node, &from->entries) { - if (delta + e->tx_bytes <= ideal_delta) { + if (delta + e->tx_bps_avg <= ideal_delta) { /* Take next entry if amount to rebalance will not exceed ideal. */ to_migrate[cnt++] = e; - delta += e->tx_bytes; + delta += e->tx_bps_avg; } if (ideal_delta - delta < migration_threshold) { /* Stop collecting hashes if we're close enough to the ideal value @@ -1263,13 +1266,13 @@ choose_entries_to_migrate(const struct bond_member *from, uint64_t to_tx_bytes, ASSIGN_CONTAINER(closest, ovs_list_back(&from->entries), list_node); - delta = closest->tx_bytes; + delta = closest->tx_bps_avg; to_migrate[cnt++] = closest; } - new_low = MIN(from->tx_bytes - delta, to_tx_bytes + delta); - if ((new_low > to_tx_bytes) && - (new_low - to_tx_bytes >= migration_threshold)) { + new_low = MIN(from->tx_bps_avg - delta, to_tx_bps_avg + delta); + if ((new_low > to_tx_bps_avg) && + (new_low - to_tx_bps_avg >= migration_threshold)) { /* Only rebalance if the new 'low' is closer to to the mid point and the * improvement of traffic deviation from the ideal split exceeds 10% * (migration threshold). @@ -1290,7 +1293,7 @@ insert_bal(struct ovs_list *bals, struct bond_member *member) struct bond_member *pos; LIST_FOR_EACH (pos, bal_node, bals) { - if (member->tx_bytes > pos->tx_bytes) { + if (member->tx_bps_avg > pos->tx_bps_avg) { break; } } @@ -1356,7 +1359,7 @@ bond_rebalance(struct bond *bond) /* Add each bond_entry to its member's 'entries' list. * Compute each member's tx_bytes as the sum of its entries' tx_bytes. */ HMAP_FOR_EACH (member, hmap_node, &bond->members) { - member->tx_bytes = 0; + member->tx_bps_avg = 0; ovs_list_init(&member->entries); } @@ -1368,10 +1371,12 @@ bond_rebalance(struct bond *bond) /* Iteration over sorted bond hashes will give us sorted 'entries'. */ for (int i = 0; i < BOND_BUCKETS; i++) { e = hashes[i]; + uint64_t tx_bps_avg = e->tx_bytes / rebalance_interval_sec; if (e->member && e->tx_bytes) { - e->member->tx_bytes += e->tx_bytes; + e->member->tx_bps_avg += tx_bps_avg + e->tx_bps_avg; ovs_list_push_back(&e->member->entries, &e->list_node); } + e->tx_bps_avg = tx_bps_avg; } /* Add enabled members to 'bals' in descending order of tx_bytes. @@ -1402,8 +1407,8 @@ bond_rebalance(struct bond *bond) struct bond_member *to = bond_member_from_bal_node(ovs_list_back(&bals)); - overload = (from->tx_bytes - to->tx_bytes) / rebalance_interval_sec; - if (overload < (to->tx_bytes / rebalance_interval_sec) >> 5 || + overload = from->tx_bps_avg - to->tx_bps_avg; + if (overload < to->tx_bps_avg >> 5 || overload * 8 / 1000000 < overload_threshold) { /* The extra average load per second on 'from' (and all less-loaded * members), compared to that of 'to' (the least-loaded member), is @@ -1414,7 +1419,7 @@ bond_rebalance(struct bond *bond) /* 'from' is carrying significantly more load than 'to'. Pick hashes * to move from 'from' to 'to'. */ - size_t cnt = choose_entries_to_migrate(from, to->tx_bytes, hashes); + size_t cnt = choose_entries_to_migrate(from, to->tx_bps_avg, hashes); if (!cnt) { /* Can't usefully migrate anything away from 'from'. * Don't reconsider it. */ @@ -1440,11 +1445,9 @@ bond_rebalance(struct bond *bond) rebalanced = true; } - /* Implement exponentially weighted moving average. A weight of 1/2 causes - * historical data to decay to <1% in 7 rebalancing runs. 1,000,000 bytes - * take 20 rebalancing runs to decay to 0 and get deleted entirely. */ + /* Zeroize tx_bytes so hash AVG speed can be calculated accurately. */ for (e = &bond->hash[0]; e <= &bond->hash[BOND_MASK]; e++) { - e->tx_bytes /= 2; + e->tx_bytes = 0; } if (rebalanced) { @@ -1455,8 +1458,8 @@ bond_rebalance(struct bond *bond) if (++i > 1) { ds_put_cstr(&member_stats, " and "); } - ds_put_format(&member_stats, "%s:%"PRIu64"kB", member->name, - member->tx_bytes / 1024); + ds_put_format(&member_stats, "%s:%"PRIu64"kB/s", member->name, + member->tx_bps_avg / 1024); } VLOG_INFO("bond %s: rebalanced (now carrying: %s)", bond->name, ds_cstr(&member_stats)); -- regards, Vladislav Odintsov _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev