Re: [PATCH V3 FIX for-3.19] IB/ipoib: Fix sendonly traffic and multicast traffic

Doug Ledford Mon, 26 Jan 2015 14:00:45 -0800

On Mon, 2015-01-26 at 22:57 +0200, Or Gerlitz wrote:
> On Mon, Jan 26, 2015 at 9:38 PM, Doug Ledford <[email protected]> wrote:
> > On Mon, 2015-01-26 at 15:16 +0200, Or Gerlitz wrote:
> >> On Mon, Jan 26, 2015 at 3:00 PM, Erez Shitrit <[email protected]> wrote:
> >> > Following commit 016d9fb25cd9 "IPoIB: fix MCAST_FLAG_BUSY usage" both
> >> > IPv6 traffic and for the most cases all IPv4 multicast traffic aren't
> >> > working.
> >>
> >>
> >> Hi Doug + Roland
> >>
> >> Erez was very patiently reviewing and testing all the six (V0...V5)
> >> patch series you sent to fix the 3.19-rc1 regression.
> >
> > Yes he has.
> 
> 
> >>  Can you also give this patch a try?
> 
> > I can test it.  But I need to know how it's supposed to be applied.
> 
> just apply it on latest upstream and run whatever tests you have, simple.


I used the same base kernel that I used for my patchset.

> > It might fix the regression, it might also reintroduce a race on
> > ifup/ifdown.  I'll test and see.
> 
> Let's see it in action @ your env

It passed the initial IPv6 after a failed join issue that my own
patchset just finally passes.

However, I didn't get more than 5 minutes into testing before I was able
to livelock the system.  In this case, from machine A running my
patchset, I did

ping6 -I mlx4_ib0 -i .25 <machine B address>

On machine B running Erez's patch, I did:

rmmod ib_ipoib; modprobe ib_ipoib mcast_debug_level=1; sleep 2; ping6
-i .25 -c 10 -I mlx4_ib0 <machine A address>

And on the machine rdma-master, where the opensm runs, I did just a few:

systemctl restart opensm

The livelock is in the mcast flushing code.  On the machine that
livelocked, here's the dmesg tail:

[  423.189514] mlx4_ib0.8002: multicast join failed for 
ff12:401b:8002:0000:0000:0000:ffff:ffff, status -110
[  423.189541] mlx4_ib0.8002: deleting multicast group 
ff12:401b:8002:0000:0000:0000:0000:0001
[  423.189545] mlx4_ib0.8002: deleting multicast group 
ff12:601b:8002:0000:0000:0000:0000:0001
[  423.189547] mlx4_ib0.8002: deleting multicast group 
ff12:601b:8002:0000:0000:0001:ff7b:e1b1
[  423.189549] mlx4_ib0.8002: deleting multicast group 
ff12:401b:8002:0000:0000:0000:0000:00fb
[  423.189551] mlx4_ib0.8002: deleting multicast group 
ff12:401b:8002:0000:0000:0000:ffff:ffff
[  423.204570] mlx4_ib0.8002: stopping multicast thread
[  423.204573] mlx4_ib0.8002: flushing multicast list
[  423.213567] mlx4_ib0: stopping multicast thread
[  423.213571] mlx4_ib0: flushing multicast list

The rmmod operation is stuck in ib_sa_unregister_client (one of the
specific fixes my patchset resolves BTW).

On another machine I started another one of my tests:

On machine A:

ping6 I mlx4_ib0 -i .25 <machine C address>

On rdma-master:

while true; do sleep 4; systemctl restart opensm; done

One machine C:

passes=0; while true; do ifdown qib_ib0; ifup qib_ib0; echo "Passes 
$passes..."; let passes++; done

In this test Erez's patch made it through about 5 down/up cycles before
the machine oopsed.

Do I need to keep going?  I was able to crash two different machines on
two different brands of hardware within only a few test cycles.  My
patchset, while large and intrusive, now survives all of this with
flying colors, and now that I've replicated Erez's specific multicast
join failure, I've taken care of that corner case too (and will be
adding that to my long term QE setup so it doesn't regress in the
future).


-- 
Doug Ledford <[email protected]>
              GPG KeyID: 0E572FDD

signature.asc
Description: This is a digitally signed message part

Re: [PATCH V3 FIX for-3.19] IB/ipoib: Fix sendonly traffic and multicast traffic

Reply via email to