[ovs-dev] [PATCH v2] bond: Fix broken rebalancing after link state changes.

Ilya Maximets Tue, 13 Jul 2021 08:32:25 -0700

There are 3 constraints for moving hashes from one member to another:

  1. The load difference exceeds ~ 3% of the load of one member.
  2. The difference in load between members exceeds 100,000 bytes.
  3. Moving the hash reduces the load difference by more than 10%.


In the current implementation, if one of the members transitions to
the DOWN state, all hashes assigned to it will be moved to the other
members.  After that, if the member goes UP, it will wait for
rebalancing to get hashes.  But in case we have more than 10 equally
loaded hashes, it will never meet constraint # 3, because each hash
will handle less than 10% of the load.  The situation gets worse when
the number of flows grows and it is almost impossible to transfer any
hash when all 256 hash records are used, which is very likely when we
have few hundred/thousand flows.

As a result, if one of the members goes down and back up while traffic
flows, it will never be used to transmit packets again.  This will not
be fixed even if we completely stop the traffic and start it again,
because the first two constraints will block rebalancing in the
earlier stages, while we have low traffic volume.

Moving a single hash if the destination does not have any hashes,
as it was before commit c460a6a7bc75 ("ofproto/bond: simplifying the
rebalancing logic"), will not help, because a single hash is not
enough to make the difference in load less than 10% of the total load,
and this member will handle only that one hash forever.

To fix this, let's try to move multiple hashes at the same time to
meet constraint # 3.

The implementation includes sorting the "records" to be able to
collect records with a cumulative load close enough to the ideal value.

Signed-off-by: Ilya Maximets <[email protected]>
---

Version 2:
  * Rebased.
  * Few small style adjustments.
  * Combined the test with the main change in a single patch.
    I don't remember why I sent them separately 4 years ago. :)

  * I tried to play around with migration_threshold, but different
    values leads to exctly the same end result in my testing but with
    larger number of calls to choose_entries_to_migrate(). For example,
    in the unit test in this patch, with 10% threshold, it migrates 48
    hashes at the first call and 5 more on a second, 53 hashes in total.
    If the threshold is 50% of ideal, it moves same 53 hashes but in
    smaller chunks: 22, 16, 8, 4, 2, 1 (hashes per call).  This just
    takes a bit more of processing time and doesn't change the result.
    Note that all these migrations done within a single bond_rebalance()
    run, so there is no difference for actual mogration of packet flows.

    Therefore, I kept threshold as it was in the first versin of a patch,
    i.e. 10% difference with the ideal delta.

 ofproto/bond.c        | 127 +++++++++++++++++++++++++++++-------------
 tests/ofproto-dpif.at | 104 +++++++++++++++++++++++++++++-----
 2 files changed, 179 insertions(+), 52 deletions(-)

diff --git a/ofproto/bond.c b/ofproto/bond.c
index 35b9caac0..b9bfa4549 100644
--- a/ofproto/bond.c
+++ b/ofproto/bond.c
@@ -1173,49 +1173,72 @@ bond_shift_load(struct bond_entry *hash, struct 
bond_member *to)
     bond->bond_revalidate = true;
 }
 
-/* Picks and returns a bond_entry to migrate from 'from' (the most heavily
+/* Picks and returns 'bond_entry's to migrate from 'from' (the most heavily
  * loaded bond member) to a bond member that has 'to_tx_bytes' bytes of load,
  * given that doing so must decrease the ratio of the load on the two members
- * by at least 0.1.  Returns NULL if there is no appropriate entry.
+ * by at least 0.1.  Returns number of entries filled in 'to_migrate'.
  *
- * The list of entries isn't sorted.  I don't know of a reason to prefer to
- * shift away small hashes or large hashes. */
-static struct bond_entry *
-choose_entry_to_migrate(const struct bond_member *from, uint64_t to_tx_bytes)
+ * The list of entries is sorted in descending order of load.  This allows us
+ * to collect subset of entries with accumulated load close to ideal.  */
+static size_t
+choose_entries_to_migrate(const struct bond_member *from, uint64_t to_tx_bytes,
+                          struct bond_entry **to_migrate)
     OVS_REQ_WRLOCK(rwlock)
 {
     struct bond_entry *e;
+    /* Note, the ideal traffic is the mid point between 'from' and 'to'.
+     * This value does not change by rebalancing.  */
+    uint64_t ideal_tx_bytes = (from->tx_bytes + to_tx_bytes) / 2;
+    uint64_t ideal_delta = ideal_tx_bytes - to_tx_bytes;
+    uint64_t delta = 0;         /* The amount to rebalance. */
+    uint64_t new_low;           /* The lower bandwidth between 'to' and 'from'
+                                 * after rebalancing. */
+    uint64_t migration_threshold = ideal_delta / 10; /* 10% */
+    size_t cnt = 0;
 
     if (ovs_list_is_short(&from->entries)) {
         /* 'from' carries no more than one MAC hash, so shifting load away from
          * it would be pointless. */
-        return NULL;
+        return 0;
     }
 
     LIST_FOR_EACH (e, list_node, &from->entries) {
-        uint64_t delta = e->tx_bytes;  /* The amount to rebalance.  */
-        uint64_t ideal_tx_bytes = (from->tx_bytes + to_tx_bytes)/2;
-                             /* Note, the ideal traffic is the mid point
-                              * between 'from' and 'to'. This value does
-                              * not change by rebalancing.  */
-        uint64_t new_low;    /* The lower bandwidth between 'to' and 'from'
-                                after rebalancing. */
-
-        new_low = MIN(from->tx_bytes - delta, to_tx_bytes + delta);
-
-        if ((new_low > to_tx_bytes) &&
-            (new_low - to_tx_bytes >= (ideal_tx_bytes - to_tx_bytes) / 10)) {
-            /* Only rebalance if the new 'low' is closer to to the mid point,
-             * and the improvement exceeds 10% of current traffic
-             * deviation from the ideal split.
-             *
-             * The improvement on the 'high' side is always the same as the
-             * 'low' side. Thus consider 'low' side is sufficient.  */
-            return e;
+        if (delta + e->tx_bytes <= ideal_delta) {
+            /* Take next entry if amount to rebalance will not exceed ideal. */
+            to_migrate[cnt++] = e;
+            delta += e->tx_bytes;
+        }
+        if (ideal_delta - delta < migration_threshold) {
+            /* Stop collecting hashes if we're close enough to ideal value
+             * to avoid frequent moving of light ones.  */
+            break;
         }
     }
 
-    return NULL;
+    if (!cnt) {
+        /* There is no entry which load less than or equal to 'ideal_delta'.
+         * Lets try closest one. The closest is the last in sorted list. */
+        struct bond_entry *closest;
+
+        ASSIGN_CONTAINER(closest, ovs_list_back(&from->entries), list_node);
+
+        delta = closest->tx_bytes;
+        to_migrate[cnt++] = closest;
+    }
+
+    new_low = MIN(from->tx_bytes - delta, to_tx_bytes + delta);
+    if ((new_low > to_tx_bytes) &&
+        (new_low - to_tx_bytes >= migration_threshold)) {
+       /* Only rebalance if the new 'low' is closer to to the mid point and the
+        * improvement of traffic deviation from the ideal split exceeds 10%
+        * (migration threshold).
+        *
+        * The improvement on the 'high' side is always the same as the 'low'
+        * side.  Thus consider 'low' side is sufficient. */
+        return cnt;
+    }
+
+    return 0;
 }
 
 /* Inserts 'member' into 'bals' so that descending order of 'tx_bytes' is
@@ -1242,6 +1265,22 @@ reinsert_bal(struct ovs_list *bals, struct bond_member 
*member)
     insert_bal(bals, member);
 }
 
+static int
+compare_bond_entries(const void *a_, const void *b_)
+    OVS_REQ_RDLOCK(rwlock)
+{
+     const struct bond_entry *const *ap = a_;
+     const struct bond_entry *const *bp = b_;
+     const struct bond_entry *a = *ap;
+     const struct bond_entry *b = *bp;
+
+     if (a->tx_bytes != b->tx_bytes) {
+         return a->tx_bytes > b->tx_bytes ? -1 : 1;
+     } else {
+         return 0;
+     }
+}
+
 /* If 'bond' needs rebalancing, does so.
  *
  * The caller should have called bond_account() for each active flow, or in 
case
@@ -1251,8 +1290,8 @@ reinsert_bal(struct ovs_list *bals, struct bond_member 
*member)
 void
 bond_rebalance(struct bond *bond)
 {
+    struct bond_entry *e, *hashes[BOND_BUCKETS];
     struct bond_member *member;
-    struct bond_entry *e;
     struct ovs_list bals;
     bool rebalanced = false;
     bool use_recirc;
@@ -1276,7 +1315,15 @@ bond_rebalance(struct bond *bond)
         member->tx_bytes = 0;
         ovs_list_init(&member->entries);
     }
-    for (e = &bond->hash[0]; e <= &bond->hash[BOND_MASK]; e++) {
+
+    for (int i = 0; i < BOND_BUCKETS; i++) {
+        hashes[i] = &bond->hash[i];
+    }
+    qsort(hashes, BOND_BUCKETS, sizeof *hashes, compare_bond_entries);
+
+    /* Iteration over sorted bond hashes will give us sorted 'entries'. */
+    for (int i = 0; i < BOND_BUCKETS; i++) {
+        e = hashes[i];
         if (e->member && e->tx_bytes) {
             e->member->tx_bytes += e->tx_bytes;
             ovs_list_push_back(&e->member->entries, &e->list_node);
@@ -1311,15 +1358,23 @@ bond_rebalance(struct bond *bond)
             break;
         }
 
-        /* 'from' is carrying significantly more load than 'to'.  Pick a hash
+        /* 'from' is carrying significantly more load than 'to'.  Pick hashes
          * to move from 'from' to 'to'. */
-        e = choose_entry_to_migrate(from, to->tx_bytes);
-        if (e) {
+        size_t cnt = choose_entries_to_migrate(from, to->tx_bytes, hashes);
+        if (!cnt) {
+            /* Can't usefully migrate anything away from 'from'.
+             * Don't reconsider it. */
+            ovs_list_remove(&from->bal_node);
+            continue;
+        }
+
+        for (size_t i = 0; i < cnt; i++) {
+            e = hashes[i];
             bond_shift_load(e, to);
 
             /* Delete element from from->entries.
              *
-             * We don't add the element to to->hashes.  That would only allow
+             * We don't add the element to to->entries.  That would only allow
              * 'e' to be migrated to another member in this rebalancing run, 
and
              * there is no point in doing that. */
             ovs_list_remove(&e->list_node);
@@ -1327,12 +1382,8 @@ bond_rebalance(struct bond *bond)
             /* Re-sort 'bals'. */
             reinsert_bal(&bals, from);
             reinsert_bal(&bals, to);
-            rebalanced = true;
-        } else {
-            /* Can't usefully migrate anything away from 'from'.
-             * Don't reconsider it. */
-            ovs_list_remove(&from->bal_node);
         }
+        rebalanced = true;
     }
 
     /* Implement exponentially weighted moving average.  A weight of 1/2 causes
diff --git a/tests/ofproto-dpif.at b/tests/ofproto-dpif.at
index cfa0cb4c7..413987f79 100644
--- a/tests/ofproto-dpif.at
+++ b/tests/ofproto-dpif.at
@@ -342,6 +342,22 @@ AT_CHECK([test `egrep 'in_port\(6\)' br1_flows.txt |wc -l` 
-gt 3])
 OVS_VSWITCHD_STOP
 AT_CLEANUP
 
+# SEND_TCP_BOND_PKTS([p_name], [p_ofport], [packet_len])
+#
+# Sends 256 packets to port 'p_name' with different TCP destination ports.
+m4_define([SEND_TCP_BOND_PKTS],
+   [
+    len_cmd=""
+    if test -n "$3"; then
+        len_cmd=" --len $3"
+    fi
+    for i in `seq 0 255`; do
+        
pkt="in_port($2),eth(src=50:54:00:00:00:05,dst=50:54:00:00:01:00),eth_type(0x0800),ipv4(src=10.0.0.2,dst=10.0.0.1,proto=6,tos=0,ttl=64,frag=no),tcp(src=8,dst=$i),tcp_flags(ack)"
+        ovs-appctl netdev-dummy/receive $1 $pkt$len_cmd
+    done
+   ]
+)
+
 AT_SETUP([ofproto-dpif - balance-tcp bonding])
 # Create br0 with members bond0(p1, p2, p3) and p7,
 #    and br1 with members bond1(p4, p5, p6) and p8.
@@ -377,13 +393,7 @@ ovs-appctl lacp/show > lacp.txt
 ovs-appctl bond/show > bond.txt
 # Check that lb_output is not enabled by default.
 AT_CHECK([grep -q '^lb_output action: disabled' bond.txt])
-(
-for i in `seq 0 255` ;
-    do
-    
pkt="in_port(7),eth(src=50:54:00:00:00:05,dst=50:54:00:00:01:00),eth_type(0x0800),ipv4(src=10.0.0.2,dst=10.0.0.1,proto=6,tos=0,ttl=64,frag=no),tcp(src=8,dst=$i),tcp_flags(ack)"
-    AT_CHECK([ovs-appctl netdev-dummy/receive p7 $pkt])
-    done
-)
+AT_CHECK([SEND_TCP_BOND_PKTS([p7], [7])])
 ovs-appctl time/warp 300 100
 AT_CHECK([ovs-appctl dpif/dump-flows br0 |grep tcp > br0_flows.txt])
 AT_CHECK([ovs-appctl dpif/dump-flows br1 |grep tcp > br1_flows.txt])
@@ -400,13 +410,7 @@ OVS_WAIT_UNTIL([ovs-appctl bond/show | grep -q '^lb_output 
action: enabled'])
 ovs-appctl time/warp 10000 500
 ovs-appctl revalidator/wait
 OVS_WAIT_WHILE([ovs-appctl dpif/dump-flows br1 | grep -q tcp])
-(
-for i in $(seq 256) ;
-    do
-    
pkt="in_port(7),eth(src=50:54:00:00:00:05,dst=50:54:00:00:01:00),eth_type(0x0800),ipv4(src=10.0.0.2,dst=10.0.0.1,proto=6,tos=0,ttl=64,frag=no),tcp(src=8,dst=$i),tcp_flags(ack)"
-    AT_CHECK([ovs-appctl netdev-dummy/receive p7 $pkt])
-    done
-)
+AT_CHECK([SEND_TCP_BOND_PKTS([p7], [7])])
 ovs-appctl time/warp 300 100
 AT_CHECK([ovs-appctl dpif/dump-flows br0 | grep tcp > br0_flows.txt])
 AT_CHECK([ovs-appctl dpif/dump-flows br1 | grep tcp > br1_flows.txt])
@@ -423,6 +427,78 @@ OVS_WAIT_UNTIL([test -z "$(ovs-appctl 
dpif-netdev/bond-show)"])
 OVS_VSWITCHD_STOP()
 AT_CLEANUP
 
+# Make sure that rebalancing works after link state changes.
+AT_SETUP([ofproto-dpif - balance-tcp bonding rebalance after link state 
changes])
+# Create br0 with interfaces bond0(p1, p2) and p5,
+#    and br1 with interfaces bond1(p3, p4) and p6.
+#    bond0 <-> bond1
+# Send some traffic, set link state down and up for p2,
+# send big amount of traffic to trigger rebalancing and
+# make sure that some hashes rebalanced.
+OVS_VSWITCHD_START(
+  [add-bond br0 bond0 p1 p2 bond_mode=balance-tcp lacp=active \
+        other-config:lacp-time=fast other-config:bond-rebalance-interval=1000 
--\
+   set interface p1 type=dummy options:pstream=punix:$OVS_RUNDIR/p1.sock 
ofport_request=1 mtu_request=65535 -- \
+   set interface p2 type=dummy options:pstream=punix:$OVS_RUNDIR/p2.sock 
ofport_request=2 mtu_request=65535 -- \
+   add-port br0 p5 -- set interface p5 ofport_request=5 type=dummy 
mtu_request=65535 -- \
+   add-br br1 -- \
+   set bridge br1 other-config:hwaddr=aa:66:aa:66:00:00 -- \
+   set bridge br1 datapath-type=dummy other-config:datapath-id=1234 \
+                  fail-mode=secure -- \
+   add-bond br1 bond1 p3 p4 bond_mode=balance-tcp lacp=active \
+        other-config:lacp-time=fast other-config:bond-rebalance-interval=1000 
--\
+   set interface p3 type=dummy options:stream=unix:$OVS_RUNDIR/p1.sock 
ofport_request=3 mtu_request=65535 -- \
+   set interface p4 type=dummy options:stream=unix:$OVS_RUNDIR/p2.sock 
ofport_request=4 mtu_request=65535 -- \
+   add-port br1 p6 -- set interface p6 ofport_request=6 type=dummy 
mtu_request=65535 --])
+AT_CHECK([ovs-appctl vlog/set bond:dbg])
+AT_CHECK([ovs-appctl netdev-dummy/set-admin-state up], 0, [OK
+])
+AT_CHECK([ovs-ofctl add-flow br0 action=normal])
+AT_CHECK([ovs-ofctl add-flow br1 action=normal])
+AT_CHECK([ovs-appctl upcall/disable-megaflows], [0], [megaflows disabled
+], [])
+OVS_WAIT_WHILE([ovs-appctl bond/show | grep "may_enable: false"])
+
+ovs-appctl time/stop
+ovs-appctl time/warp 2000 200
+
+# Send some traffic to distribute all the hashes between ports.
+AT_CHECK([SEND_TCP_BOND_PKTS([p5], [5], [65500])])
+
+# Wait for rebalancing for per-hash stats accounting.
+ovs-appctl time/warp 1000 100
+
+# Check that p2 handles some hashes.
+ovs-appctl bond/show > bond1.txt
+AT_CHECK([sed -n '/member p2/,/^$/p' bond1.txt | grep 'hash'], [0], [ignore])
+
+# Move p2 down to force all hashes move to p1
+AT_CHECK([ovs-appctl netdev-dummy/set-admin-state p2 down], 0, [OK
+])
+
+ovs-appctl time/warp 200 100
+# Check that all hashes moved form p2
+ovs-appctl bond/show > bond2.txt
+AT_CHECK([sed -n '/member p2/,/^$/p' bond2.txt | grep 'hash'], [1], [ignore])
+
+# Move p2 up
+AT_CHECK([ovs-appctl netdev-dummy/set-admin-state p2 up], 0, [OK
+])
+
+# Send some packets to trigger rebalancing.
+AT_CHECK([SEND_TCP_BOND_PKTS([p5], [5], [65500])])
+
+# Wait for rebalancing
+ovs-appctl time/warp 1000 100
+
+# Check that some hashes was shifted to p2
+ovs-appctl bond/show > bond3.txt
+AT_CHECK([sed -n '/member p2/,/^$/p' bond3.txt | grep 'hash'], [0], [ignore])
+
+OVS_VSWITCHD_STOP()
+AT_CLEANUP
+
+
 # Makes sure recirculation does not change the way packet is handled.
 AT_SETUP([ofproto-dpif - balance-tcp bonding, different recirc flow ])
 OVS_VSWITCHD_START(
-- 
2.31.1

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

[ovs-dev] [PATCH v2] bond: Fix broken rebalancing after link state changes.

Reply via email to