On Mon, 2015-01-26 at 22:57 +0200, Or Gerlitz wrote: > On Mon, Jan 26, 2015 at 9:38 PM, Doug Ledford <[email protected]> wrote: > > On Mon, 2015-01-26 at 15:16 +0200, Or Gerlitz wrote: > >> On Mon, Jan 26, 2015 at 3:00 PM, Erez Shitrit <[email protected]> wrote: > >> > Following commit 016d9fb25cd9 "IPoIB: fix MCAST_FLAG_BUSY usage" both > >> > IPv6 traffic and for the most cases all IPv4 multicast traffic aren't > >> > working. > >> > >> > >> Hi Doug + Roland > >> > >> Erez was very patiently reviewing and testing all the six (V0...V5) > >> patch series you sent to fix the 3.19-rc1 regression. > > > > Yes he has. > > > >> Can you also give this patch a try? > > > I can test it. But I need to know how it's supposed to be applied. > > just apply it on latest upstream and run whatever tests you have, simple.
I used the same base kernel that I used for my patchset. > > It might fix the regression, it might also reintroduce a race on > > ifup/ifdown. I'll test and see. > > Let's see it in action @ your env It passed the initial IPv6 after a failed join issue that my own patchset just finally passes. However, I didn't get more than 5 minutes into testing before I was able to livelock the system. In this case, from machine A running my patchset, I did ping6 -I mlx4_ib0 -i .25 <machine B address> On machine B running Erez's patch, I did: rmmod ib_ipoib; modprobe ib_ipoib mcast_debug_level=1; sleep 2; ping6 -i .25 -c 10 -I mlx4_ib0 <machine A address> And on the machine rdma-master, where the opensm runs, I did just a few: systemctl restart opensm The livelock is in the mcast flushing code. On the machine that livelocked, here's the dmesg tail: [ 423.189514] mlx4_ib0.8002: multicast join failed for ff12:401b:8002:0000:0000:0000:ffff:ffff, status -110 [ 423.189541] mlx4_ib0.8002: deleting multicast group ff12:401b:8002:0000:0000:0000:0000:0001 [ 423.189545] mlx4_ib0.8002: deleting multicast group ff12:601b:8002:0000:0000:0000:0000:0001 [ 423.189547] mlx4_ib0.8002: deleting multicast group ff12:601b:8002:0000:0000:0001:ff7b:e1b1 [ 423.189549] mlx4_ib0.8002: deleting multicast group ff12:401b:8002:0000:0000:0000:0000:00fb [ 423.189551] mlx4_ib0.8002: deleting multicast group ff12:401b:8002:0000:0000:0000:ffff:ffff [ 423.204570] mlx4_ib0.8002: stopping multicast thread [ 423.204573] mlx4_ib0.8002: flushing multicast list [ 423.213567] mlx4_ib0: stopping multicast thread [ 423.213571] mlx4_ib0: flushing multicast list The rmmod operation is stuck in ib_sa_unregister_client (one of the specific fixes my patchset resolves BTW). On another machine I started another one of my tests: On machine A: ping6 I mlx4_ib0 -i .25 <machine C address> On rdma-master: while true; do sleep 4; systemctl restart opensm; done One machine C: passes=0; while true; do ifdown qib_ib0; ifup qib_ib0; echo "Passes $passes..."; let passes++; done In this test Erez's patch made it through about 5 down/up cycles before the machine oopsed. Do I need to keep going? I was able to crash two different machines on two different brands of hardware within only a few test cycles. My patchset, while large and intrusive, now survives all of this with flying colors, and now that I've replicated Erez's specific multicast join failure, I've taken care of that corner case too (and will be adding that to my long term QE setup so it doesn't regress in the future). -- Doug Ledford <[email protected]> GPG KeyID: 0E572FDD
signature.asc
Description: This is a digitally signed message part
