> On Oct 8, 2018, at 5:36 PM, Ethan J. Jackson <[email protected]> wrote: > > No memory unfortunately. > > Ethan > > Ethan J. Jackson > ejj.sh > > > On Mon, Oct 08, 2018 at 1:45 PM, Ben Pfaff <[email protected]> wrote: > On Tue, Oct 02, 2018 at 10:28:52AM -0600, Daniel Leaberry via discuss wrote: > > I have Centos 7 with openvswitch 2.9.0. The server has 4 ports in an lacp > bond (called allbond) connected to a set of mlagged arista switches. Here's > the config > > ovs-vsctl list port allbond > _uuid : 9f224f2d-8bb1-4cfd-84e2-d60c6d973a7a bond_active_slave : > "90:e2:ba:d6:1c:44" bond_downdelay : 0 > bond_fake_iface : false > bond_mode : balance-tcp > bond_updelay : 40000 > cvlans : [] > external_ids : {} > fake_bridge : false > interfaces : [61b9a345-2f3d-4127-b9cd-eaca8a749574, > 89ce3480-d62d-4291-9a84-bdf711016793, 941c9393-1021-490c-84ac-311250ba0343, > dc49ffd3-c259-43b6-8072-2ce12c52d1b1] lacp : active > mac : [] > name : allbond > other_config : {} > protected : false > qos : [] > rstp_statistics : {} > rstp_status : {} > statistics : {} > status : {} > tag : [] > trunks : [] > vlan_mode : [] > > ---- allbond ---- > bond_mode: balance-tcp > bond may use recirculation: yes, Recirc-ID : 3 > bond-hash-basis: 0 > updelay: 40000 ms > downdelay: 0 ms > next rebalance: 3229 ms > lacp_status: negotiated > lacp_fallback_ab: false > active slave mac: 90:e2:ba:d6:1c:44(eth5) > > slave eth3: enabled > may_enable: true > hash 50: 1 kB load > hash 162: 1 kB load > hash 170: 1 kB load > > slave eth4: enabled > may_enable: true > hash 123: 4 kB load > hash 221: 12 kB load > > slave eth5: enabled > active slave > may_enable: true > hash 94: 1 kB load > hash 177: 1 kB load > hash 245: 1 kB load > > slave eth6: enabled > may_enable: true > hash 97: 46 kB load > > As you can see updelay is set to 40 seconds. I go to the switch and shutdown > the port for eth6. It's immediately pulled from the bond. I then clear the > switch counters and wait a few minutes. I would expect when the port is "no > shutdown" that 40 seconds will go by before openvswitch brings it back into > the bond. But that doesn't happen. > > 2018-10-02T15:31:32.885Z|00349|bond|INFO|interface eth6: link state down > 2018-10-02T15:31:32.885Z|00350|bond|INFO|interface eth6: disabled > 2018-10-02T15:35:45.861Z|00352|bond|INFO|interface eth6: link state up > 2018-10-02T15:35:45.861Z|00353|bond|INFO|interface eth6: enabled > 2018-10-02T15:35:51.286Z|00354|bond|INFO|bond allbond: shift 93kB of load > (with hash 97) from eth3 to eth6 (now carrying 6kB and 93kB load, > respectively) > > Immediately after link is re-established the port (eth6) is enabled again and > traffic as shown in the switch counters begins to flow again. It feels like > I'm doing something wrong but I've googled for hours and can't find anything > that explains why the bond_updelay is being ignored. > > I spent some time looking through the history here. Ethan (CCed) added LACP > support to OVS in January 2011. From that point forward, OVS has always > ignored updelay and downdelay for a bond when LACP is enabled. I don't know > why, exactly. Maybe Ethan remembers. > > It would be easy to enable updelay and downdelay for LACP bonds: > > diff --git a/ofproto/bond.c b/ofproto/bond.c > index f87cdba7908f..8a90ba2686af 100644 > --- a/ofproto/bond.c > +++ b/ofproto/bond.c > @@ -1717,8 +1717,7 @@ bond_link_status_update(struct bond_slave *slave) > VLOG_INFO_RL(&rl, "interface %s: will not be %s", slave->name, up ? > "disabled" : "enabled"); > } else { > - int delay = (bond->lacp_status != LACP_DISABLED ? 0 > - : up ? bond->updelay : bond->downdelay); > + int delay = up ? bond->updelay : bond->downdelay; slave->delay_expires = > time_msec() + delay; > if (delay) { > VLOG_INFO_RL(&rl, "interface %s: will be %s if it stays %s " > >
I *greatly* appreciate you looking into this Ben, it's rare in opensource that I find an actual bug so generally I just figure I'm doing something wrong. The documentation is pretty clear about calling out the bond_updelay and downdelay parameters so at the very least those should be clarified/removed. What next steps should I take? Is there a bug report I should file? This is fairly critical to me because we run a ton of these 4 port bonds to 2 Arista switches (they're redundant). When we upgrade the switch firmware the switch comes back online, the ports all light up at the same time but it takes a few seconds for spanning tree to sort everything out. During those seconds we have packet loss because ovs thinks the ports are totally back in action when they aren't. Thanks _______________________________________________ discuss mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
