My previous mail was formatted incorrectly, so re-sending with a correct 
formatting.

On 29.03.2025 01:41, Ilya Maximets wrote:
> On 3/28/25 11:00, Vladislav Odintsov wrote:
>> On 26.03.2025 15:30, Ilya Maximets wrote:
>>> On 3/25/25 00:21, Vladislav Odintsov wrote:
>>>> With this commit the rebalance of hash entries between bonding members
>>>> becomes less frequent.  If we know all bond members' speed, we do not
>>>> move hash entries between them if load difference is less than 1.5% of
>>>> minimum link speed of bond members.
>>>>
>>>> Reported-at: 
>>>> https://mail.openvswitch.org/pipermail/ovs-dev/2025-March/422028.html
>>>> Suggested-by: Ilya Maximets <i.maxim...@ovn.org>
>>>> Signed-off-by: Vladislav Odintsov <vlodintsov@k2.cloud>
>>>> ---
>>>> v1 -> v2:
>>>>     - Addressed Ilya's, Eelco's, Mike's review comments.
>>>>     - Docs updated.
>>>>     - NEWS entry added.
>>>> ---
>>>>    Documentation/topics/bonding.rst |  3 ++-
>>>>    NEWS                             |  4 ++++
>>>>    ofproto/bond.c                   | 25 +++++++++++++++++++------
>>>>    3 files changed, 25 insertions(+), 7 deletions(-)
>>> Hi, Vladislav.  Thanks for v2!
>>>
>>> Have you seen my reply for your observation about comparing speed to raw
>>> byte counters here:
>>>     https://mail.openvswitch.org/pipermail/ovs-dev/2025-March/422255.html
>>>
>>> Tl;DR; I think you're right and we need to change the way we count the load
>>> in order to be able to compare the load to a link speed.
>>>
>>> Best regards, Ilya Maximets.
>> Hi Ilya,
>>
>> Thanks for review!
>>
>> I changed the condition before calling a bond_shift_load() function, so
>> now Bps with Bps instead of B with Bps should be compared.  It seems to
>> me that this is good enough.
>>
>> Do I understand you correctly, that you proposed to change the movement
>> logic inside bond_shift_load() function?  If yes, I'm wondering why do
>> we care, whether we compare total carried bytes during last rebalance
>> interval or an average bytes per second during the same interval?  In
>> both cases we seem to fairly compare same units: B vs B or Bps vs Bps,
>> so the current logic of this function looks correct for me.  And in both
>> cases the result should be the same.  Am I missing something?
>>
> Hmm.  It's complicated.  Yes, you're comparing apples to apples here, i.e.
> this version compares speed to speed.  The problem, I think, is that our
> total amount of bytes divided by the rebalance interval doesn't give a
> correct or accurate enough approximation of the average speed.
>
> Let's say we have a constant 100Bps traffic on the port.  And let's say
> we have a standard 10sec rebalance interval.
>
> After the first 10 seconds, our tx_bytes on this port will be 1000B.
> Then we'll divide it by 2 for our ad-hoc exponentially weighted moving
> average implementation.  So, it becomes 500B.  After the next interval
> we'll have another 1000B plus our 500B from the previous interval, so
> 1500B in total.  That, divided by the rebalancing interval, gives us
> 150Bps "average speed", which is 50% higher than the actual average speed
> on this interface.  Next time it will be 175, and so on.  In a long run
> this is 100 * (1 + 1/2 + 1/4 + ...) = 200, which 2x of our actual average
> speed.  So, on a constant packet rate, this way of measuring is not
> accurate at all.
>
> I'm not doing the math for the non-constant speed, but I'd guess it will
> not be accurate enough either, even if it may average better in some
> scenarios it might be much worse in others.  And I'm not sure if we can
> make this style of calculation accurate enough in all cases by just
> choosing a different divisor.
>
> All in all, I think, we need to change the way we calculate the load in
> order to have a more or less accurate idea about the average speed.
>
> The Exponential Moving Average from lib/mov-avg.h should give us much
> more accurate result.  The problem, however, is that we can't easily
> shift the load if we calculate the average on a sum.  So, we'll need to
> calculate an average speed per hash and then compare sums of averages
> with the link speed.
>
> Does that make sense?
>
> Best regards, Ilya Maximets.

Hi, Ilya.

I need to clarify, do we divide each hash by 2 to achieve next points:

1. have a monotonically decreasing tx_bytes to not to overload uint64_t; 
2. at the same time "save" the knowledge of previous load in order to 
avoid frequent unnecessary shifts.

Is that correct?

Also, I'm not sure I fully understood your proposed approach... Do you 
want to switch from tx_bytes as a cumulative metric to a new "avg tx per 
second" metric? I guess this can be done approximately in the next order:

1. calculate the avg speed per second per each hash (tx_bytes / 
rebalance_interval) on rebalance run; 2. store this average speed inside 
hash_entry; 3. summarize hash_entries speed in "member average speed"; 
4. do all calculations in bond_rebalance() in "avg bps" manner instead 
of "total bytes": here we can use same algorithm in bond_shift_load just 
re-orienting it to bps variable.

Something like this (note: diff is based on current patch; not tested, 
only as an example of idea):

diff --git a/ofproto/bond.c b/ofproto/bond.c index e2cbdd5ec..119baef4e 
100644 --- a/ofproto/bond.c +++ b/ofproto/bond.c @@ -63,6 +63,8 @@ 
struct bond_entry { struct bond_member *member; /* Assigned member, NULL 
if unassigned. */ uint64_t tx_bytes /* Count of bytes recently 
transmitted. */ OVS_GUARDED_BY(rwlock); + /* Average bytes per second 
recently transmitted. */ + uint64_t tx_bps_avg OVS_GUARDED_BY(rwlock); 
struct ovs_list list_node; /* In bond_member's 'entries' list. */ /* 
Recirculation. @@ -95,7 +97,7 @@ struct bond_member { /* Rebalancing 
info. Used only by bond_rebalance(). */ struct ovs_list bal_node; /* In 
bond_rebalance()'s 'bals' list. */ struct ovs_list entries; /* 'struct 
bond_entry's assigned here. */ - uint64_t tx_bytes; /* Sum across 
'tx_bytes' of entries. */ + uint64_t tx_bps_avg; /* Sum across 
'tx_bps_avg' of entries. */ }; /* A bond, that is, a set of network 
devices grouped to improve performance or @@ -1163,8 +1165,8 @@ 
log_bals(struct bond *bond, const struct ovs_list *bals) if (ds.length) 
{ ds_put_char(&ds, ','); } - ds_put_format(&ds, " %s %"PRIu64"kB", - 
member->name, member->tx_bytes / 1024); + ds_put_format(&ds, " %s 
%"PRIu64"kB/s", + member->name, member->tx_bps_avg / 1024); if 
(!member->enabled) { ds_put_cstr(&ds, " (disabled)"); @@ -1195,19 
+1197,19 @@ bond_shift_load(struct bond_entry *hash, struct bond_member 
*to) { struct bond_member *from = hash->member; struct bond *bond = 
from->bond; - uint64_t delta = hash->tx_bytes; + uint64_t delta = 
hash->tx_bps_avg; VLOG_DBG("bond %s: shift %"PRIu64"kB of load (with 
hash %"PRIdPTR") " "from %s to %s (now carrying %"PRIu64"kB and " 
"%"PRIu64"kB load, respectively)", bond->name, delta / 1024, hash - 
bond->hash, from->name, to->name, - (from->tx_bytes - delta) / 1024, - 
(to->tx_bytes + delta) / 1024); + (from->tx_bps_avg - delta) / 1024, + 
(to->tx_bps_avg + delta) / 1024); /* Shift load away from 'from' to 
'to'. */ - from->tx_bytes -= delta; - to->tx_bytes += delta; + 
from->tx_bps_avg -= delta; + to->tx_bps_avg += delta; /* Arrange for 
flows to be revalidated. */ hash->member = to; @@ -1222,15 +1224,16 @@ 
bond_shift_load(struct bond_entry *hash, struct bond_member *to) * The 
list of entries is sorted in descending order of load. This allows us * 
to collect subset of entries with accumulated load close to ideal. */ 
static size_t -choose_entries_to_migrate(const struct bond_member *from, 
uint64_t to_tx_bytes, +choose_entries_to_migrate(const struct 
bond_member *from, + uint64_t to_tx_bps_avg, struct bond_entry 
**to_migrate) OVS_REQ_WRLOCK(rwlock) { struct bond_entry *e; /* Note, 
the ideal traffic is the mid point between 'from' and 'to'. * This value 
does not change by rebalancing. */ - uint64_t ideal_tx_bytes = 
(from->tx_bytes + to_tx_bytes) / 2; - uint64_t ideal_delta = 
ideal_tx_bytes - to_tx_bytes; + uint64_t ideal_tx_bps_avg = 
(from->tx_bps_avg + to_tx_bps_avg) / 2; + uint64_t ideal_delta = 
ideal_tx_bps_avg - to_tx_bps_avg; uint64_t delta = 0; /* The amount to 
rebalance. */ uint64_t new_low; /* The lower bandwidth between 'to' and 
'from' * after rebalancing. */ @@ -1244,10 +1247,10 @@ 
choose_entries_to_migrate(const struct bond_member *from, uint64_t 
to_tx_bytes, } LIST_FOR_EACH (e, list_node, &from->entries) { - if 
(delta + e->tx_bytes <= ideal_delta) { + if (delta + e->tx_bps_avg <= 
ideal_delta) { /* Take next entry if amount to rebalance will not exceed 
ideal. */ to_migrate[cnt++] = e; - delta += e->tx_bytes; + delta += 
e->tx_bps_avg; } if (ideal_delta - delta < migration_threshold) { /* 
Stop collecting hashes if we're close enough to the ideal value @@ 
-1263,13 +1266,13 @@ choose_entries_to_migrate(const struct bond_member 
*from, uint64_t to_tx_bytes, ASSIGN_CONTAINER(closest, 
ovs_list_back(&from->entries), list_node); - delta = closest->tx_bytes; 
+ delta = closest->tx_bps_avg; to_migrate[cnt++] = closest; } - new_low 
= MIN(from->tx_bytes - delta, to_tx_bytes + delta); - if ((new_low > 
to_tx_bytes) && - (new_low - to_tx_bytes >= migration_threshold)) { + 
new_low = MIN(from->tx_bps_avg - delta, to_tx_bps_avg + delta); + if 
((new_low > to_tx_bps_avg) && + (new_low - to_tx_bps_avg >= 
migration_threshold)) { /* Only rebalance if the new 'low' is closer to 
to the mid point and the * improvement of traffic deviation from the 
ideal split exceeds 10% * (migration threshold). @@ -1290,7 +1293,7 @@ 
insert_bal(struct ovs_list *bals, struct bond_member *member) struct 
bond_member *pos; LIST_FOR_EACH (pos, bal_node, bals) { - if 
(member->tx_bytes > pos->tx_bytes) { + if (member->tx_bps_avg > 
pos->tx_bps_avg) { break; } } @@ -1356,7 +1359,7 @@ 
bond_rebalance(struct bond *bond) /* Add each bond_entry to its member's 
'entries' list. * Compute each member's tx_bytes as the sum of its 
entries' tx_bytes. */ HMAP_FOR_EACH (member, hmap_node, &bond->members) 
{ - member->tx_bytes = 0; + member->tx_bps_avg = 0; 
ovs_list_init(&member->entries); } @@ -1368,10 +1371,12 @@ 
bond_rebalance(struct bond *bond) /* Iteration over sorted bond hashes 
will give us sorted 'entries'. */ for (int i = 0; i < BOND_BUCKETS; i++) 
{ e = hashes[i]; + uint64_t tx_bps_avg = e->tx_bytes / 
rebalance_interval_sec; if (e->member && e->tx_bytes) { - 
e->member->tx_bytes += e->tx_bytes; + e->member->tx_bps_avg += 
tx_bps_avg + e->tx_bps_avg; ovs_list_push_back(&e->member->entries, 
&e->list_node); } + e->tx_bps_avg = tx_bps_avg; } /* Add enabled members 
to 'bals' in descending order of tx_bytes. @@ -1402,8 +1407,8 @@ 
bond_rebalance(struct bond *bond) struct bond_member *to = 
bond_member_from_bal_node(ovs_list_back(&bals)); - overload = 
(from->tx_bytes - to->tx_bytes) / rebalance_interval_sec; - if (overload 
< (to->tx_bytes / rebalance_interval_sec) >> 5 || + overload = 
from->tx_bps_avg - to->tx_bps_avg; + if (overload < to->tx_bps_avg >> 5 
|| overload * 8 / 1000000 < overload_threshold) { /* The extra average 
load per second on 'from' (and all less-loaded * members), compared to 
that of 'to' (the least-loaded member), is @@ -1414,7 +1419,7 @@ 
bond_rebalance(struct bond *bond) /* 'from' is carrying significantly 
more load than 'to'. Pick hashes * to move from 'from' to 'to'. */ - 
size_t cnt = choose_entries_to_migrate(from, to->tx_bytes, hashes); + 
size_t cnt = choose_entries_to_migrate(from, to->tx_bps_avg, hashes); if 
(!cnt) { /* Can't usefully migrate anything away from 'from'. * Don't 
reconsider it. */ @@ -1440,11 +1445,9 @@ bond_rebalance(struct bond 
*bond) rebalanced = true; } - /* Implement exponentially weighted moving 
average. A weight of 1/2 causes - * historical data to decay to <1% in 7 
rebalancing runs. 1,000,000 bytes - * take 20 rebalancing runs to decay 
to 0 and get deleted entirely. */ + /* Zeroize tx_bytes so hash AVG 
speed can be calculated accurately. */ for (e = &bond->hash[0]; e <= 
&bond->hash[BOND_MASK]; e++) { - e->tx_bytes /= 2; + e->tx_bytes = 0; } 
if (rebalanced) { @@ -1455,8 +1458,8 @@ bond_rebalance(struct bond 
*bond) if (++i > 1) { ds_put_cstr(&member_stats, " and "); } - 
ds_put_format(&member_stats, "%s:%"PRIu64"kB", member->name, - 
member->tx_bytes / 1024); + ds_put_format(&member_stats, 
"%s:%"PRIu64"kB/s", member->name, + member->tx_bps_avg / 1024); } 
VLOG_INFO("bond %s: rebalanced (now carrying: %s)", bond->name, 
ds_cstr(&member_stats));

-- 
regards,
Vladislav Odintsov

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to