Hi Alessandro, Did you bisect the whole kernel or just msk(4) changes?
Cheers, dlg > On 8 Dec 2021, at 04:39, Alessandro De Laurenzis <[email protected]> > wrote: > > Greetings, > > I recently installed OpenBSD 7.0 on an old CoreDuo2 machine (Compaq 610, > complete dmesg in attach), which was powered by 5.5/5.6/5.7 some years ago, > without any relevant issues (after that, it has been used as home server with > Debian). > >> mskc0 at pci4 dev 0 function 0 "Marvell Yukon 88E8042" rev 0x10, Yukon-2 FE+ >> rev. A0 (0x0): msi >> msk0 at mskc0 port A: address 18:a9:05:94:ab:19 >> eephy0 at msk0 phy 0: 88E3016 10/100 PHY, rev. 0 > > I noticed that the trunk(4) failover protocol is broken when the Ethernet > cable is plugged in (starting in this configuration, no lease is acquired > from DHCP server, switching to Ethernet from wifi breaks the connection; in > both cases, trunk and msk0 status is: no carrier). > > It's worth noting that when msk0 is configured as "stand-alone" (i.e., > without trunk(4) failover), the connection is pretty functional and stable. > > Since I didn't remember any similar problems showing up with 5.x, I made a > bit of bisecting, and my conclusion is that the functionality got broken b/w > 6.2 and 6.3 and, specifically, after the following commit: > >> RCS file: /cvs/src/sys/dev/pci/if_msk.c,v >> ---------------------------- >> revision 1.131 >> date: 2018/01/06 03:11:04; author: dlg; state: Exp; lines: +251 -311; >> commitid: BhB8LisF92o4xfOK; >> rework the transmit and receive paths to address reliability issues. >> phessler@ has been having trouble with msk on overdrive 1000s. some >> of the issues relate to the driver not coping with exhaustion of >> mbufs for the rx ring, the other issues are corruption of the mcl9k >> pool that msk uses. >> this diff adds a timeout that the rx refill code uses when the rx >> ring is empty and cannot be filled. it'll periodically retry the >> ring refill until it can get some mbufs in the air again. >> the current code made hunting for the mcl9k issue too hard, so this >> rewrites it to be simpler and more like other drivers. there's now >> just arrays of mbuf pointers and dmamaps to shadow the hardware >> ring entries, and producer and consumer indexes. what was there >> before had linkes lists of something to hold mbuf pointers and >> dmamaps, and some way to go from the ring to go back to that. i >> think, it was hard to tell what was happening. >> this also copies the ADDR64 handling on the tx ring to the rx ring. >> this potentially makes more rx descriptors available, but that can >> happen later. >> in hindsight the mcl9k problem could have been from letting if_rxr >> allocate the entier ring. if every descriptor was filled, the chip >> may have run around the ring when it shouldnt have. giving rxr one >> less descriptor than there is on the ring may have fixed the problem >> too. >> this work also makes it easier to make msk mpsafe. >> tested by an ok phessler@ >> ok kettenis@ deraadt@ >> ============================================================================= > > and the corresponding one for sys/dev/pci/if_mskvar.h (revision 1.14, same > log). > > On a fresh 6.3 install, which was showing the issue, I reverted the 2 files > to the revisions 1.130 and 1.13 respectively, observing a functional trunk(4) > failover again. > > The diff is too long and complex, so I cannot say where the problem lies > exactly, but I hope this report contains enough information to start an > analysis (I'm copying the involved developers, just in case they are not > reading this list); of course, I'm available to test any patches (on 7.0 or > -current) and add further details if needed. > > Please note that the dmesg is from OBSD 6.3, since that is the version > currently installed on the laptop; in case you're interested in the > 7.0/current's dmesg, just let me know. > > All the best > > -- > Alessandro De Laurenzis > [mailto:[email protected]] > Web: http://www.atlantide.mooo.com > LinkedIn: http://it.linkedin.com/in/delaurenzis<dmesg.txt>
