Hi Rasmus, On Mon, 22 Jul 2019 23:37:26 +0000, Rasmus Villemoes <rasmus.villem...@prevas.dk> wrote: > We have an ERPS (Ethernet Ring Protection Switching) setup involving > mv88e6250 switches which we're in the process of switching to a BSP > based on the mainline driver. Breaking any link in the ring works as > expected, with the ring reconfiguring itself quickly and traffic > continuing with almost no noticable drops. However, when plugging back > the cable, we see 5+ second stalls. > > This has been tracked down to the userspace application in charge of > the protocol missing a few CCM messages on the good link (the one that > was not unplugged), causing it to broadcast a "signal fail". That > message eventually reaches its link partner, which responds by > blocking the port. Meanwhile, the first node has continued to block > the port with the just plugged-in cable, breaking the network. And the > reason for those missing CCM messages has in turn been tracked down to > the VTU apparently being too busy servicing load/purge operations that > the normal lookups are delayed. > > Initial state, the link between C and D is blocked in software. > > _____________________ > / \ > | | > A ----- B ----- C *---- D > > Unplug the cable between C and D. > > _____________________ > / \ > | | > A ----- B ----- C * * D > > Reestablish the link between C and D. > _____________________ > / \ > | | > A ----- B ----- C *---- D > > Somehow, enough VTU/ATU operations happen inside C that prevents > the application from receving the CCM messages from B in a timely > manner, so a Signal Fail message is sent by C. When B receives > that, it responds by blocking its port. > > _____________________ > / \ > | | > A ----- B *---* C *---- D > > Very shortly after this, the signal fail condition clears on the > BC link (some CCM messages finally make it through), so C > unblocks the port. However, a guard timer inside B prevents it > from removing the blocking before 5 seconds have elapsed. > > It is not unlikely that our userspace ERPS implementation could be > smarter and/or is simply buggy. However, this patch fixes the symptoms > we see, and is a small optimization that should not break anything > (knock wood). The idea is simply to avoid doing an VTU load of an > entry identical to the one already present. To do that, we need to > know whether mv88e6xxx_vtu_get() actually found an existing entry, or > has just prepared a struct mv88e6xxx_vtu_entry for us to load. To that > end, let vlan->valid be an output parameter. The other two callers of > mv88e6xxx_vtu_get() are not affected by this patch since they pass > new=false. > > Signed-off-by: Rasmus Villemoes <rasmus.villem...@prevas.dk> > --- > drivers/net/dsa/mv88e6xxx/chip.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/drivers/net/dsa/mv88e6xxx/chip.c > b/drivers/net/dsa/mv88e6xxx/chip.c > index 6b17cd961d06..2e500428670c 100644 > --- a/drivers/net/dsa/mv88e6xxx/chip.c > +++ b/drivers/net/dsa/mv88e6xxx/chip.c > @@ -1423,7 +1423,6 @@ static int mv88e6xxx_vtu_get(struct mv88e6xxx_chip > *chip, u16 vid, > > /* Initialize a fresh VLAN entry */ > memset(entry, 0, sizeof(*entry)); > - entry->valid = true; > entry->vid = vid; > > /* Exclude all ports */ > @@ -1618,6 +1617,9 @@ static int _mv88e6xxx_port_vlan_add(struct > mv88e6xxx_chip *chip, int port, > if (err) > return err; > > + if (vlan.valid && vlan.member[port] == member) > + return 0; > + vlan.valid = true; > vlan.member[port] = member; > > err = mv88e6xxx_vtu_loadpurge(chip, &vlan);
I'd prefer not to use the mv88e6xxx_vtu_entry structure for output parameters, this can be confusing. As you correctly mentioned, this initialization is only done for _mv88e6xxx_port_vlan_add, so I'll prepare a patch which gets rid of the boolean parameter and move that logic up. Thanks, Vivien