> On Oct 10, 2018, at 2:35 PM, Ben Pfaff <[email protected]> wrote: > > On Tue, Oct 09, 2018 at 08:33:32AM -0600, Daniel Leaberry wrote: >> >>> On Oct 8, 2018, at 5:36 PM, Ethan J. Jackson <[email protected]> wrote: >>> >>> No memory unfortunately. >>> >>> Ethan >>> >>> Ethan J. Jackson >>> ejj.sh >>> >>> >>> On Mon, Oct 08, 2018 at 1:45 PM, Ben Pfaff <[email protected]> wrote: >>> On Tue, Oct 02, 2018 at 10:28:52AM -0600, Daniel Leaberry via discuss wrote: >>> >>> I have Centos 7 with openvswitch 2.9.0. The server has 4 ports in an lacp >>> bond (called allbond) connected to a set of mlagged arista switches. Here's >>> the config >>> >>> ovs-vsctl list port allbond >>> _uuid : 9f224f2d-8bb1-4cfd-84e2-d60c6d973a7a bond_active_slave : >>> "90:e2:ba:d6:1c:44" bond_downdelay : 0 >>> bond_fake_iface : false >>> bond_mode : balance-tcp >>> bond_updelay : 40000 >>> cvlans : [] >>> external_ids : {} >>> fake_bridge : false >>> interfaces : [61b9a345-2f3d-4127-b9cd-eaca8a749574, >>> 89ce3480-d62d-4291-9a84-bdf711016793, 941c9393-1021-490c-84ac-311250ba0343, >>> dc49ffd3-c259-43b6-8072-2ce12c52d1b1] lacp : active >>> mac : [] >>> name : allbond >>> other_config : {} >>> protected : false >>> qos : [] >>> rstp_statistics : {} >>> rstp_status : {} >>> statistics : {} >>> status : {} >>> tag : [] >>> trunks : [] >>> vlan_mode : [] >>> >>> ---- allbond ---- >>> bond_mode: balance-tcp >>> bond may use recirculation: yes, Recirc-ID : 3 >>> bond-hash-basis: 0 >>> updelay: 40000 ms >>> downdelay: 0 ms >>> next rebalance: 3229 ms >>> lacp_status: negotiated >>> lacp_fallback_ab: false >>> active slave mac: 90:e2:ba:d6:1c:44(eth5) >>> >>> slave eth3: enabled >>> may_enable: true >>> hash 50: 1 kB load >>> hash 162: 1 kB load >>> hash 170: 1 kB load >>> >>> slave eth4: enabled >>> may_enable: true >>> hash 123: 4 kB load >>> hash 221: 12 kB load >>> >>> slave eth5: enabled >>> active slave >>> may_enable: true >>> hash 94: 1 kB load >>> hash 177: 1 kB load >>> hash 245: 1 kB load >>> >>> slave eth6: enabled >>> may_enable: true >>> hash 97: 46 kB load >>> >>> As you can see updelay is set to 40 seconds. I go to the switch and >>> shutdown the port for eth6. It's immediately pulled from the bond. I then >>> clear the switch counters and wait a few minutes. I would expect when the >>> port is "no shutdown" that 40 seconds will go by before openvswitch brings >>> it back into the bond. But that doesn't happen. >>> >>> 2018-10-02T15:31:32.885Z|00349|bond|INFO|interface eth6: link state down >>> 2018-10-02T15:31:32.885Z|00350|bond|INFO|interface eth6: disabled >>> 2018-10-02T15:35:45.861Z|00352|bond|INFO|interface eth6: link state up >>> 2018-10-02T15:35:45.861Z|00353|bond|INFO|interface eth6: enabled >>> 2018-10-02T15:35:51.286Z|00354|bond|INFO|bond allbond: shift 93kB of load >>> (with hash 97) from eth3 to eth6 (now carrying 6kB and 93kB load, >>> respectively) >>> >>> Immediately after link is re-established the port (eth6) is enabled again >>> and traffic as shown in the switch counters begins to flow again. It feels >>> like I'm doing something wrong but I've googled for hours and can't find >>> anything that explains why the bond_updelay is being ignored. >>> >>> I spent some time looking through the history here. Ethan (CCed) added LACP >>> support to OVS in January 2011. From that point forward, OVS has always >>> ignored updelay and downdelay for a bond when LACP is enabled. I don't know >>> why, exactly. Maybe Ethan remembers. >>> >>> It would be easy to enable updelay and downdelay for LACP bonds: >>> >>> diff --git a/ofproto/bond.c b/ofproto/bond.c >>> index f87cdba7908f..8a90ba2686af 100644 >>> --- a/ofproto/bond.c >>> +++ b/ofproto/bond.c >>> @@ -1717,8 +1717,7 @@ bond_link_status_update(struct bond_slave *slave) >>> VLOG_INFO_RL(&rl, "interface %s: will not be %s", slave->name, up ? >>> "disabled" : "enabled"); >>> } else { >>> - int delay = (bond->lacp_status != LACP_DISABLED ? 0 >>> - : up ? bond->updelay : bond->downdelay); >>> + int delay = up ? bond->updelay : bond->downdelay; slave->delay_expires = >>> time_msec() + delay; >>> if (delay) { >>> VLOG_INFO_RL(&rl, "interface %s: will be %s if it stays %s " >>> >>> >> >> I *greatly* appreciate you looking into this Ben, it's rare in opensource >> that I find an actual bug so generally I just figure I'm doing something >> wrong. The documentation is pretty clear about calling out the bond_updelay >> and downdelay parameters so at the very least those should be >> clarified/removed. >> >> What next steps should I take? Is there a bug report I should file? This is >> fairly critical to me because we run a ton of these 4 port bonds to 2 Arista >> switches (they're redundant). When we upgrade the switch firmware the switch >> comes back online, the ports all light up at the same time but it takes a >> few seconds for spanning tree to sort everything out. During those seconds >> we have packet loss because ovs thinks the ports are totally back in action >> when they aren't. > > Since we don't have a known reason not to honor these settings for LACP > bonds, I propose that we just change OVS behavior. > > I sent a formal patch: > https://patchwork.ozlabs.org/patch/982091/
Thank you! Glad it was an easy patch. _______________________________________________ discuss mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
