One day people forget about e-mail formatting problems.  Not today.

Second attempt to send mail with correct formatting.  Sorry for the noise.

On 29.03.2025 01:41, Ilya Maximets wrote:
> On 3/28/25 11:00, Vladislav Odintsov wrote:
>> On 26.03.2025 15:30, Ilya Maximets wrote:
>>> On 3/25/25 00:21, Vladislav Odintsov wrote:
>>>> With this commit the rebalance of hash entries between bonding members
>>>> becomes less frequent.  If we know all bond members' speed, we do not
>>>> move hash entries between them if load difference is less than 1.5% of
>>>> minimum link speed of bond members.
>>>>
>>>> Reported-at: 
>>>> https://mail.openvswitch.org/pipermail/ovs-dev/2025-March/422028.html
>>>> Suggested-by: Ilya Maximets <i.maxim...@ovn.org>
>>>> Signed-off-by: Vladislav Odintsov <vlodintsov@k2.cloud>
>>>> ---
>>>> v1 -> v2:
>>>>     - Addressed Ilya's, Eelco's, Mike's review comments.
>>>>     - Docs updated.
>>>>     - NEWS entry added.
>>>> ---
>>>>    Documentation/topics/bonding.rst |  3 ++-
>>>>    NEWS                             |  4 ++++
>>>>    ofproto/bond.c                   | 25 +++++++++++++++++++------
>>>>    3 files changed, 25 insertions(+), 7 deletions(-)
>>> Hi, Vladislav.  Thanks for v2!
>>>
>>> Have you seen my reply for your observation about comparing speed to raw
>>> byte counters here:
>>>     https://mail.openvswitch.org/pipermail/ovs-dev/2025-March/422255.html
>>>
>>> Tl;DR; I think you're right and we need to change the way we count the load
>>> in order to be able to compare the load to a link speed.
>>>
>>> Best regards, Ilya Maximets.
>> Hi Ilya,
>>
>> Thanks for review!
>>
>> I changed the condition before calling a bond_shift_load() function, so
>> now Bps with Bps instead of B with Bps should be compared.  It seems to
>> me that this is good enough.
>>
>> Do I understand you correctly, that you proposed to change the movement
>> logic inside bond_shift_load() function?  If yes, I'm wondering why do
>> we care, whether we compare total carried bytes during last rebalance
>> interval or an average bytes per second during the same interval?  In
>> both cases we seem to fairly compare same units: B vs B or Bps vs Bps,
>> so the current logic of this function looks correct for me.  And in both
>> cases the result should be the same.  Am I missing something?
>>
> Hmm.  It's complicated.  Yes, you're comparing apples to apples here, i.e.
> this version compares speed to speed.  The problem, I think, is that our
> total amount of bytes divided by the rebalance interval doesn't give a
> correct or accurate enough approximation of the average speed.
>
> Let's say we have a constant 100Bps traffic on the port.  And let's say
> we have a standard 10sec rebalance interval.
>
> After the first 10 seconds, our tx_bytes on this port will be 1000B.
> Then we'll divide it by 2 for our ad-hoc exponentially weighted moving
> average implementation.  So, it becomes 500B.  After the next interval
> we'll have another 1000B plus our 500B from the previous interval, so
> 1500B in total.  That, divided by the rebalancing interval, gives us
> 150Bps "average speed", which is 50% higher than the actual average speed
> on this interface.  Next time it will be 175, and so on.  In a long run
> this is 100 * (1 + 1/2 + 1/4 + ...) = 200, which 2x of our actual average
> speed.  So, on a constant packet rate, this way of measuring is not
> accurate at all.
>
> I'm not doing the math for the non-constant speed, but I'd guess it will
> not be accurate enough either, even if it may average better in some
> scenarios it might be much worse in others.  And I'm not sure if we can
> make this style of calculation accurate enough in all cases by just
> choosing a different divisor.
>
> All in all, I think, we need to change the way we calculate the load in
> order to have a more or less accurate idea about the average speed.
>
> The Exponential Moving Average from lib/mov-avg.h should give us much
> more accurate result.  The problem, however, is that we can't easily
> shift the load if we calculate the average on a sum.  So, we'll need to
> calculate an average speed per hash and then compare sums of averages
> with the link speed.
>
> Does that make sense?
>
> Best regards, Ilya Maximets.

Hi, Ilya.

I need to clarify, do we divide each hash by 2 to achieve next points:

1. have a monotonically decreasing tx_bytes to not to overload uint64_t;
2. at the same time "save" the knowledge of previous load in order to 
avoid frequent unnecessary shifts.

Is that correct?

Also, I'm not sure I fully understood your proposed approach... Do you 
want to switch from tx_bytes as a cumulative metric to a new "avg tx per 
second" metric? I guess this can be done approximately in the next order:

1. calculate the avg speed per second per each hash (tx_bytes / 
rebalance_interval) on rebalance run;
2. store this average speed inside hash_entry;
3. summarize hash_entries speed in "member average speed";
4. do all calculations in bond_rebalance() in "avg bps" manner instead 
of "total bytes": here we can use same algorithm in bond_shift_load just 
re-orienting it to bps variable.

Something like this (note: diff is based on current patch; not tested, 
only as an example of idea):

diff --git a/ofproto/bond.c b/ofproto/bond.c
index e2cbdd5ec..119baef4e 100644
--- a/ofproto/bond.c
+++ b/ofproto/bond.c
@@ -63,6 +63,8 @@ struct bond_entry {
      struct bond_member *member; /* Assigned member, NULL if unassigned. */
      uint64_t tx_bytes           /* Count of bytes recently transmitted. */
          OVS_GUARDED_BY(rwlock);
+    /* Average bytes per second recently transmitted. */
+    uint64_t tx_bps_avg OVS_GUARDED_BY(rwlock);
      struct ovs_list list_node;  /* In bond_member's 'entries' list. */

      /* Recirculation.
@@ -95,7 +97,7 @@ struct bond_member {
      /* Rebalancing info.  Used only by bond_rebalance(). */
      struct ovs_list bal_node;   /* In bond_rebalance()'s 'bals' list. */
      struct ovs_list entries;    /* 'struct bond_entry's assigned here. */
-    uint64_t tx_bytes;          /* Sum across 'tx_bytes' of entries. */
+    uint64_t tx_bps_avg;        /* Sum across 'tx_bps_avg' of entries. */
  };

  /* A bond, that is, a set of network devices grouped to improve 
performance or
@@ -1163,8 +1165,8 @@ log_bals(struct bond *bond, const struct ovs_list 
*bals)
              if (ds.length) {
                  ds_put_char(&ds, ',');
              }
-            ds_put_format(&ds, " %s %"PRIu64"kB",
-                          member->name, member->tx_bytes / 1024);
+            ds_put_format(&ds, " %s %"PRIu64"kB/s",
+                          member->name, member->tx_bps_avg / 1024);

              if (!member->enabled) {
                  ds_put_cstr(&ds, " (disabled)");
@@ -1195,19 +1197,19 @@ bond_shift_load(struct bond_entry *hash, struct 
bond_member *to)
  {
      struct bond_member *from = hash->member;
      struct bond *bond = from->bond;
-    uint64_t delta = hash->tx_bytes;
+    uint64_t delta = hash->tx_bps_avg;

      VLOG_DBG("bond %s: shift %"PRIu64"kB of load (with hash %"PRIdPTR") "
               "from %s to %s (now carrying %"PRIu64"kB and "
               "%"PRIu64"kB load, respectively)",
               bond->name, delta / 1024, hash - bond->hash,
               from->name, to->name,
-             (from->tx_bytes - delta) / 1024,
-             (to->tx_bytes + delta) / 1024);
+             (from->tx_bps_avg - delta) / 1024,
+             (to->tx_bps_avg + delta) / 1024);

      /* Shift load away from 'from' to 'to'. */
-    from->tx_bytes -= delta;
-    to->tx_bytes += delta;
+    from->tx_bps_avg -= delta;
+    to->tx_bps_avg += delta;

      /* Arrange for flows to be revalidated. */
      hash->member = to;
@@ -1222,15 +1224,16 @@ bond_shift_load(struct bond_entry *hash, struct 
bond_member *to)
   * The list of entries is sorted in descending order of load. This 
allows us
   * to collect subset of entries with accumulated load close to ideal.  */
  static size_t
-choose_entries_to_migrate(const struct bond_member *from, uint64_t 
to_tx_bytes,
+choose_entries_to_migrate(const struct bond_member *from,
+                          uint64_t to_tx_bps_avg,
                            struct bond_entry **to_migrate)
      OVS_REQ_WRLOCK(rwlock)
  {
      struct bond_entry *e;
      /* Note, the ideal traffic is the mid point between 'from' and 'to'.
       * This value does not change by rebalancing.  */
-    uint64_t ideal_tx_bytes = (from->tx_bytes + to_tx_bytes) / 2;
-    uint64_t ideal_delta = ideal_tx_bytes - to_tx_bytes;
+    uint64_t ideal_tx_bps_avg = (from->tx_bps_avg + to_tx_bps_avg) / 2;
+    uint64_t ideal_delta = ideal_tx_bps_avg - to_tx_bps_avg;
      uint64_t delta = 0;         /* The amount to rebalance. */
      uint64_t new_low;           /* The lower bandwidth between 'to' 
and 'from'
                                   * after rebalancing. */
@@ -1244,10 +1247,10 @@ choose_entries_to_migrate(const struct 
bond_member *from, uint64_t to_tx_bytes,
      }

      LIST_FOR_EACH (e, list_node, &from->entries) {
-        if (delta + e->tx_bytes <= ideal_delta) {
+        if (delta + e->tx_bps_avg <= ideal_delta) {
              /* Take next entry if amount to rebalance will not exceed 
ideal. */
              to_migrate[cnt++] = e;
-            delta += e->tx_bytes;
+            delta += e->tx_bps_avg;
          }
          if (ideal_delta - delta < migration_threshold) {
              /* Stop collecting hashes if we're close enough to the 
ideal value
@@ -1263,13 +1266,13 @@ choose_entries_to_migrate(const struct 
bond_member *from, uint64_t to_tx_bytes,

          ASSIGN_CONTAINER(closest, ovs_list_back(&from->entries), 
list_node);

-        delta = closest->tx_bytes;
+        delta = closest->tx_bps_avg;
          to_migrate[cnt++] = closest;
      }

-    new_low = MIN(from->tx_bytes - delta, to_tx_bytes + delta);
-    if ((new_low > to_tx_bytes) &&
-        (new_low - to_tx_bytes >= migration_threshold)) {
+    new_low = MIN(from->tx_bps_avg - delta, to_tx_bps_avg + delta);
+    if ((new_low > to_tx_bps_avg) &&
+        (new_low - to_tx_bps_avg >= migration_threshold)) {
         /* Only rebalance if the new 'low' is closer to to the mid 
point and the
          * improvement of traffic deviation from the ideal split 
exceeds 10%
          * (migration threshold).
@@ -1290,7 +1293,7 @@ insert_bal(struct ovs_list *bals, struct 
bond_member *member)
      struct bond_member *pos;

      LIST_FOR_EACH (pos, bal_node, bals) {
-        if (member->tx_bytes > pos->tx_bytes) {
+        if (member->tx_bps_avg > pos->tx_bps_avg) {
              break;
          }
      }
@@ -1356,7 +1359,7 @@ bond_rebalance(struct bond *bond)
      /* Add each bond_entry to its member's 'entries' list.
       * Compute each member's tx_bytes as the sum of its entries' 
tx_bytes. */
      HMAP_FOR_EACH (member, hmap_node, &bond->members) {
-        member->tx_bytes = 0;
+        member->tx_bps_avg = 0;
          ovs_list_init(&member->entries);
      }

@@ -1368,10 +1371,12 @@ bond_rebalance(struct bond *bond)
      /* Iteration over sorted bond hashes will give us sorted 'entries'. */
      for (int i = 0; i < BOND_BUCKETS; i++) {
          e = hashes[i];
+        uint64_t tx_bps_avg = e->tx_bytes / rebalance_interval_sec;
          if (e->member && e->tx_bytes) {
-            e->member->tx_bytes += e->tx_bytes;
+            e->member->tx_bps_avg += tx_bps_avg + e->tx_bps_avg;
              ovs_list_push_back(&e->member->entries, &e->list_node);
          }
+        e->tx_bps_avg = tx_bps_avg;
      }

      /* Add enabled members to 'bals' in descending order of tx_bytes.
@@ -1402,8 +1407,8 @@ bond_rebalance(struct bond *bond)
          struct bond_member *to
              = bond_member_from_bal_node(ovs_list_back(&bals));

-        overload = (from->tx_bytes - to->tx_bytes) / 
rebalance_interval_sec;
-        if (overload < (to->tx_bytes / rebalance_interval_sec) >> 5 ||
+        overload = from->tx_bps_avg - to->tx_bps_avg;
+        if (overload < to->tx_bps_avg >> 5 ||
              overload * 8 / 1000000 < overload_threshold) {
              /* The extra average load per second on 'from' (and all 
less-loaded
               * members), compared to that of 'to' (the least-loaded 
member), is
@@ -1414,7 +1419,7 @@ bond_rebalance(struct bond *bond)

          /* 'from' is carrying significantly more load than 'to'. Pick 
hashes
           * to move from 'from' to 'to'. */
-        size_t cnt = choose_entries_to_migrate(from, to->tx_bytes, hashes);
+        size_t cnt = choose_entries_to_migrate(from, to->tx_bps_avg, 
hashes);
          if (!cnt) {
              /* Can't usefully migrate anything away from 'from'.
               * Don't reconsider it. */
@@ -1440,11 +1445,9 @@ bond_rebalance(struct bond *bond)
          rebalanced = true;
      }

-    /* Implement exponentially weighted moving average.  A weight of 
1/2 causes
-     * historical data to decay to <1% in 7 rebalancing runs. 1,000,000 
bytes
-     * take 20 rebalancing runs to decay to 0 and get deleted entirely. */
+    /* Zeroize tx_bytes so hash AVG speed can be calculated accurately. */
      for (e = &bond->hash[0]; e <= &bond->hash[BOND_MASK]; e++) {
-        e->tx_bytes /= 2;
+        e->tx_bytes = 0;
      }

      if (rebalanced) {
@@ -1455,8 +1458,8 @@ bond_rebalance(struct bond *bond)
              if (++i > 1) {
                  ds_put_cstr(&member_stats, " and ");
              }
-            ds_put_format(&member_stats, "%s:%"PRIu64"kB", member->name,
-                          member->tx_bytes / 1024);
+            ds_put_format(&member_stats, "%s:%"PRIu64"kB/s", member->name,
+                          member->tx_bps_avg / 1024);
          }
          VLOG_INFO("bond %s: rebalanced (now carrying: %s)",
                    bond->name, ds_cstr(&member_stats));

-- 
regards,
Vladislav Odintsov

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to