On Tue, 2015-01-27 at 15:05 +0200, Or Gerlitz wrote:
> On 1/27/2015 12:00 AM, Doug Ledford wrote:
> > However, I didn't get more than 5 minutes into testing before I was able
> > to livelock the system.  In this case, from machine A running my
> > patchset, I did
> >
> > ping6 -I mlx4_ib0 -i .25 <machine B address>
> >
> > On machine B running Erez's patch, I did:
> >
> > rmmod ib_ipoib; modprobe ib_ipoib mcast_debug_level=1; sleep 2; ping6
> > -i .25 -c 10 -I mlx4_ib0 <machine A address>
> >
> > And on the machine rdma-master, where the opensm runs, I did just a few:
> >
> > systemctl restart opensm
> >
> > The livelock is in the mcast flushing code.  On the machine that livelocked
> 
> Doug,
> 
> The tests you are running and the issues you are seeing fall well into a 
> to-be-fixed-in-some-kernel-rc1 category but by NO means as something 
> which should be an rc6 fix.
> 
> You must do the distinction between Erez's patch that fixes the 
> regressions introduced on 3.19-rc1 to your attempts to fix many more 
> instabilities in the IPoIB driver, which are seen under whatever nasty 
> test you are running (and it's good we want to reach there).
> 
> Roland, the V3 patch solves the rc1 regression and I think we should 
> pick it up, by no way we can allow to pick eleven patches @ this point.
> 
> Thoughts?

As I said in my other email to Erez, and as Erez points out, not all 11
patches of mine are needed to resolve the specific regression you are
talking about.  However, my fix resolves the regression without
reverting to splitting the multicast joins down two separate code paths,
which I think is the wrong thing to do and something that actually makes
hardening the driver harder.  If you *really* don't want my patchset
because it's 11 patches (something I couldn't care less about, and I
don't think you should either...the content of the patches is much more
important than the count), I could certainly do some squashing.  And I
could split out just the regression fix from all the rest too.

But in a situation like this, what I'm *really* concerned about is the
final result.  And here's how it breaks down under the various options:

v3.18 plain - ifconfig down/ifconfig up on ib0 can easily lock machine

v3.18 + 8 patches for above issue - initial multicast bringup works, but
additional joins attempted later (after the multicast task had decided
it was done with the initial join set) did not.  there were multiple
symptoms of the multicast join issue, one of which was failure of ipv6
or ipv4 multicast, but another was hangs in ib_sa_unregister_client on
shutdown which could just as easily be classified as a regression as the
ipv6/ipv4 multicast support

v3.18 + 8 patches + Erez patch - subsequent multicast joins now work
again, but other symptoms of the 8 patch series not addressed at all,
including other regressions, and in adding this patch in, it reverts
part of the changes made in the original 8 patch series and quite likely
reintroduces instability on ifconfig down/ifconfig up cycles (making one
wonder if this fix is better or worse than just reverting the original 8
patch set)

v3.18 + 8 patches + 11 fix patches - multicast joins now work again,
ifconfig down/ifconfig up fix continues to work, other regressions such
as hangs in ib_sa_unregister_client on shutdown fixed, overall
considerably harder to cause the kernel to behave badly than with any of
the above alternatives.  I don't claim that it's perfect and that there
isn't additional hardening to be done, but I believe it is considerably
harder/less likely to trip this kernel up than all of the rest above

If there hadn't been a flurry of testing around my patches, then I
wouldn't suggest them at all.  But they have been getting testing.  Lots
of it.  And so have the alternatives.  And out of the bunch, regardless
of patch count, my patchset has fared best under testing.  But if we
don't want to do that, then I would probably recommend reverting the
original 8 patches and then dropping the whole bunch early into 3.20.

-- 
Doug Ledford <[email protected]>
              GPG KeyID: 0E572FDD


Attachment: signature.asc
Description: This is a digitally signed message part

Reply via email to