I will look at what could be done without breaking the protocol and will report back in a couple of days.
On Jan 18, 2017 2:56 PM, "Chad Norgan" <[email protected]> wrote: > Shu, > > Thanks so much for the deeper understanding. I think from here I'm > going to get a mirror port setup on the switch so I can confirm that > the switch is indeed sending that pdu on the restored link. I can then > take that to the switch vendor to address. > > I love the idea of patching OVS to handle this case, but I will also > work the switch vendor side. I would be willing to test a build if you > guys have an idea on how you could make OVS cope with it. I'm curious > if it couldn't just be discarded. > > Thanks, > Chad > > On Wed, Jan 18, 2017 at 12:49 PM, Ben Pfaff <[email protected]> wrote: > > Shu, thanks for all the debugging! > > > > If this is a correct interpretation of the standard, and the peer switch > > is misbehaving, and it's common behavior across a product line, then > > possibly we'll need to make OVS cope with it. > > > > Chad, do you have thoughts on Shu's discoveries? > > > > Thanks, > > > > Ben. > > > > On Wed, Jan 18, 2017 at 07:05:10AM -0800, Shu Shen wrote: > >> Hi Chad, > >> > >> I now have a theory on what's happening in your case. > >> > >> I realized that the first LACPDU packet the peer switch sent for > >> re-negotiation contains all zeros in Actor Informaiton TLV, line > >> 81-84 from your gist: > >> > >> 20:38:17.109650 00:00:00:00:00:00 > 01:80:c2:00:00:02, ethertype Slow > Protocols (0x8809), length 124: LACPv1, length 110 > >> Actor Information TLV (0x01), length 20 > >> System 00:00:00:00:00:00, System Priority 0, Key 0, Port > 0,Port Priority 0 > >> State Flags [Timeout] > >> > >> I'd assume the switch is doing the same on eth1 when the link is back > >> up. > >> > >> When OVS receives the packet with all-zero Actor Information TLV on > >> eth1, it compares the TLV with the partener information it has on record > >> for eth1, and find they are different. Therefore, it considers the peer > >> has changed and will trigger lacp_update_attach() by setting > >> lacp->update to true. See > >> https://github.com/openvswitch/ovs/blob/master/lib/lacp.c#L355 > >> > >> In lacp_update_attach(), the function tries to determine a "lead" slave, > >> which would be the slave with the highest priority. Because the eth1 > >> slave has all-zero partner actor TLV, it will always be selected as the > >> "lead" slave. See line 627-637 in > >> https://github.com/openvswitch/ovs/blob/master/lib/lacp.c#L627 > >> > >> Next in lacp_update_attach(), all other slaves that have different > >> sys_id and key from the lead slave will be deattached. See line 642-651: > >> https://github.com/openvswitch/ovs/blob/master/lib/lacp.c#L642 > >> This is because all links attached to the same LACP aggregator would > >> need be talking to the same aggregator on peer system as identified by > >> sys_id and key. > >> > >> Once a slave is deattached, in your case the eth0 slave, it will no > >> longer have (Synchronization, Collecting, Distributing) flags set and > >> thus the "rogue" packet you are seeing on eth0, which is indeed to > >> indicate that the slave is out-of-sync and a re-negotiation is required. > >> > >> My opinion is that OVS appears to be behaving correctly to restart LACP > >> negotiation on both links upon receiving the all-zero actor TLV from > >> peer switch. The issue is likely on the peer switch, it probably should > >> have sent a proper actor TLV about itself with at least unchanged system > >> id, key, after the link comes back on eth1. > >> > >> /Shu > >> > >> > >> On Tue, Jan 17, 2017 at 09:30:39PM -0800, Shu Shen wrote: > >> > On Tue, Jan 17, 2017 at 04:54:51PM -0600, Chad Norgan wrote: > >> > > Given that the partner port_id on the rogue packet matches the slave > >> > > it's sent out. I lean towards #1, that the LACP implementation is > >> > > somehow mixing up the status for the slave's pdu, rather than > leaking > >> > > eth1's pdu out the eth0 interface. > >> > > > >> > > -Chad > >> > > >> > Hi Chad, > >> > > >> > A few observations and questions as below: > >> > > >> > 1) I wrote an additional testcase for the slave down and back up case, > >> > which appears to be working fine. I put additional debug messages (not > >> > in the commit referred below thought) to trace the lacpdu being sent > by > >> > all slaves and did not see any rogue package. Of course, the testcase > >> > uses two ovs switches and patch ports, so it may well be far away from > >> > reproducing the problem you are having. You may find the test case > >> > here: > >> > > >> > https://github.com/shushen/ovs/commit/ > 72aa0afc6b61d5135ea9253b8aaf31a57c7c4734 > >> > > >> > And travis-ci builds with the above test case included are passing: > >> > https://travis-ci.org/shushen/ovs/builds/192922935 > >> > > >> > 2) Could you please elaborate a bit more about how you "manually down > >> > the eth1 interface" and "bring eth1 back up"? Did you unplug a > physical > >> > link or did you use any ovs/Linux CLI to do so? This may help me > refine > >> > the test case to reproduce what you are doing. > >> > > >> > 3) I find it interesting in the packet trace from the gist you posted, > >> > where the source mac address from the peer switch is all zeros, see > >> > > >> > https://gist.github.com/beardymcbeards/ > 7bd9feca87c0574e996a397d90d5ff98#file-2_tcpdump-L81 > >> > > >> > If I read correctly, in Section 6.2.11.1 of 802.1AX-2014, it says: > >> > > >> > Protocol entities sourcing frames from within the Link Aggregation > >> > sublayer (e.g., LACP and the Marker protocol) use the MAC address > of > >> > the MAC within an underlying Aggregation Port as the SA in frames > >> > transmitted through that Aggregation Port. > >> > > >> > I'm not sure why the peer switch is using the all-zero MAC address but > >> > it probably shouldn't. I don't know how ovs datapath handles such > >> > packets. If when eth1 is coming back up and the source MAC address is > >> > also all zeros, could this affect how the LACPDU from eth1 being > >> > handled? I welcome comments from you and the list. > >> > > >> > I'd appreciate if you could provide a bit more information on 2) or > any > >> > other thoughts. My intention is to investigate a bit more on this > >> > problem. > >> > > >> > /Shu > >> > > >> > > _______________________________________________ > >> > > discuss mailing list > >> > > [email protected] > >> > > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss > >> _______________________________________________ > >> discuss mailing list > >> [email protected] > >> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss >
_______________________________________________ discuss mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
