On Tue, Oct 09, 2018 at 08:33:32AM -0600, Daniel Leaberry wrote:
>
> > On Oct 8, 2018, at 5:36 PM, Ethan J. Jackson <[email protected]> wrote:
> >
> > No memory unfortunately.
> >
> > Ethan
> >
> > Ethan J. Jackson
> > ejj.sh
> >
> >
> > On Mon, Oct 08, 2018 at 1:45 PM, Ben Pfaff <[email protected]> wrote:
> > On Tue, Oct 02, 2018 at 10:28:52AM -0600, Daniel Leaberry via discuss wrote:
> >
> > I have Centos 7 with openvswitch 2.9.0. The server has 4 ports in an lacp
> > bond (called allbond) connected to a set of mlagged arista switches. Here's
> > the config
> >
> > ovs-vsctl list port allbond
> > _uuid : 9f224f2d-8bb1-4cfd-84e2-d60c6d973a7a bond_active_slave :
> > "90:e2:ba:d6:1c:44" bond_downdelay : 0
> > bond_fake_iface : false
> > bond_mode : balance-tcp
> > bond_updelay : 40000
> > cvlans : []
> > external_ids : {}
> > fake_bridge : false
> > interfaces : [61b9a345-2f3d-4127-b9cd-eaca8a749574,
> > 89ce3480-d62d-4291-9a84-bdf711016793, 941c9393-1021-490c-84ac-311250ba0343,
> > dc49ffd3-c259-43b6-8072-2ce12c52d1b1] lacp : active
> > mac : []
> > name : allbond
> > other_config : {}
> > protected : false
> > qos : []
> > rstp_statistics : {}
> > rstp_status : {}
> > statistics : {}
> > status : {}
> > tag : []
> > trunks : []
> > vlan_mode : []
> >
> > ---- allbond ----
> > bond_mode: balance-tcp
> > bond may use recirculation: yes, Recirc-ID : 3
> > bond-hash-basis: 0
> > updelay: 40000 ms
> > downdelay: 0 ms
> > next rebalance: 3229 ms
> > lacp_status: negotiated
> > lacp_fallback_ab: false
> > active slave mac: 90:e2:ba:d6:1c:44(eth5)
> >
> > slave eth3: enabled
> > may_enable: true
> > hash 50: 1 kB load
> > hash 162: 1 kB load
> > hash 170: 1 kB load
> >
> > slave eth4: enabled
> > may_enable: true
> > hash 123: 4 kB load
> > hash 221: 12 kB load
> >
> > slave eth5: enabled
> > active slave
> > may_enable: true
> > hash 94: 1 kB load
> > hash 177: 1 kB load
> > hash 245: 1 kB load
> >
> > slave eth6: enabled
> > may_enable: true
> > hash 97: 46 kB load
> >
> > As you can see updelay is set to 40 seconds. I go to the switch and
> > shutdown the port for eth6. It's immediately pulled from the bond. I then
> > clear the switch counters and wait a few minutes. I would expect when the
> > port is "no shutdown" that 40 seconds will go by before openvswitch brings
> > it back into the bond. But that doesn't happen.
> >
> > 2018-10-02T15:31:32.885Z|00349|bond|INFO|interface eth6: link state down
> > 2018-10-02T15:31:32.885Z|00350|bond|INFO|interface eth6: disabled
> > 2018-10-02T15:35:45.861Z|00352|bond|INFO|interface eth6: link state up
> > 2018-10-02T15:35:45.861Z|00353|bond|INFO|interface eth6: enabled
> > 2018-10-02T15:35:51.286Z|00354|bond|INFO|bond allbond: shift 93kB of load
> > (with hash 97) from eth3 to eth6 (now carrying 6kB and 93kB load,
> > respectively)
> >
> > Immediately after link is re-established the port (eth6) is enabled again
> > and traffic as shown in the switch counters begins to flow again. It feels
> > like I'm doing something wrong but I've googled for hours and can't find
> > anything that explains why the bond_updelay is being ignored.
> >
> > I spent some time looking through the history here. Ethan (CCed) added LACP
> > support to OVS in January 2011. From that point forward, OVS has always
> > ignored updelay and downdelay for a bond when LACP is enabled. I don't know
> > why, exactly. Maybe Ethan remembers.
> >
> > It would be easy to enable updelay and downdelay for LACP bonds:
> >
> > diff --git a/ofproto/bond.c b/ofproto/bond.c
> > index f87cdba7908f..8a90ba2686af 100644
> > --- a/ofproto/bond.c
> > +++ b/ofproto/bond.c
> > @@ -1717,8 +1717,7 @@ bond_link_status_update(struct bond_slave *slave)
> > VLOG_INFO_RL(&rl, "interface %s: will not be %s", slave->name, up ?
> > "disabled" : "enabled");
> > } else {
> > - int delay = (bond->lacp_status != LACP_DISABLED ? 0
> > - : up ? bond->updelay : bond->downdelay);
> > + int delay = up ? bond->updelay : bond->downdelay; slave->delay_expires =
> > time_msec() + delay;
> > if (delay) {
> > VLOG_INFO_RL(&rl, "interface %s: will be %s if it stays %s "
> >
> >
>
> I *greatly* appreciate you looking into this Ben, it's rare in opensource
> that I find an actual bug so generally I just figure I'm doing something
> wrong. The documentation is pretty clear about calling out the bond_updelay
> and downdelay parameters so at the very least those should be
> clarified/removed.
>
> What next steps should I take? Is there a bug report I should file? This is
> fairly critical to me because we run a ton of these 4 port bonds to 2 Arista
> switches (they're redundant). When we upgrade the switch firmware the switch
> comes back online, the ports all light up at the same time but it takes a few
> seconds for spanning tree to sort everything out. During those seconds we
> have packet loss because ovs thinks the ports are totally back in action when
> they aren't.
Since we don't have a known reason not to honor these settings for LACP
bonds, I propose that we just change OVS behavior.
I sent a formal patch:
https://patchwork.ozlabs.org/patch/982091/
_______________________________________________
discuss mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss