On Tue, 2015-01-27 at 10:33 +0200, Erez Shitrit wrote:
> The patch I sent, only claims to fix the regression in multicast and 
> sendonly issues, it can replace the first 3 patches and the one you sent 
> me off list from your last patchset.

OK.  But my patchset resolve those issues too, and as you point out,
it's not really my entire patchset that resolves just this regression.
It would be easy enough to squash just those patches in my patchset
required to solve just this regression down to just one patch if patch
count bothers people (I find that to be a silly thing to be concerned
with, but oh well).

> (probably there are more bugs, that the rest of your patchset solved, 
> hence it should be tested with the rest of your patchset)

Yes, there are.  But Or has been arguing that we should take your patch
to resolve the regression and leave the rest out.  So, testing that
specific situation seemed to make sense in terms of finding out if that
left us with a kernel that was in reasonable shape.

If we want to look at your patch combined with my patchset in lieu of
the individual patches in mine that perform the same fix, then it would
take some work to merge them as there are differences in assumptions
made and they would not simply work together.

> And probably there are more bugs that your patcheset didn't fix yet -:)
> For example, I run your last patchset+ the last fix you sent me with:
> modprobe -r ib_ipoib; modprobe  ib_ipoib;
> ping6
> some adding/deleting  one child interface and got the next panic:
> 
> [81209.348259] ib0: join completion for 
> ff12:601b:ffff:0000:0000:0001:ff43:3bf1 (status -102)
> [81209.408787] BUG: unable to handle kernel NULL pointer dereference at 
> 0000000000000020
> [81209.416750] IP: [<ffffffffa096b399>] ipoib_mcast_join+0xa9/0x1b0 
> [ib_ipoib]

This looks exactly like the issue solved by 3ee9876fbdb (IB/ipoib: fix
race between mcast_dev_flush and mcast_join).  But without the full logs
I wouldn't know.  However, if it happened while creating/deleting
children then there might be something lurking in the child delete logic
that is not properly locked.

> Doug, there are bugs that probably will be found all the time with any 
> patchset, my point was only according to the way sendonly need to be 
> handled.

And I proved with my final patchset that sendonly does *not* have to be
handled that way.  I proved that the regression could be solved without
basically reverting the logic to the old split way of handling multicast
joins.

So you've argued several times that "sendonly conceptually should be in
the tx path because at the Ethernet layer we are emulating there is no
join for sendonly traffic, it just goes straight out".  To whit, your
patch restores that behavior.

My argument against is that:

1) We are emulating Ethernet, we are not Ethernet, and we have to have a
sendonly join that Ethernet does not.  Emulating Ethernet does not mean
we should do conceptually silly things in our lower layer driver.  We
must do a sendonly join, which is still a join, and splitting it off
from other joins makes no sense for us at the InfiniBand layer.  And in
fact, once it leaves the ipoib layer and enters the ib_sa layer, the two
code paths completely merge and are absolutely no different.  And at
that layer, they *always* get queued to a work queue and handled later.
So there really is no advantage to doing the send directly in the tx
path once you look beyond the ipoib_multicast.c file itself.

2) Having two code paths that only vary in 2 simple ways
  a) one path uses 1 for the membership type, one uses 4
  b) one attaches a QP at the end, the other does not
but which are otherwise almost identical (and could probably be joined
into a single code path, which is something I'm planning for 3.20) does
not make sense.

3) Fixing oopses and races and other issues with (needlessly) multiple
racing code paths is considerably harder than with fewer code paths.
For the sake of hardening the IPoIB driver against various situations, a
single code path is preferable to multiple code paths.

I haven't heard an argument from you yet that I believe beats the points
I've made above.  So I believe a solution that does not revert back to
having two separate code paths to be maintained is preferable to your
patch.

-- 
Doug Ledford <[email protected]>
              GPG KeyID: 0E572FDD


Attachment: signature.asc
Description: This is a digitally signed message part

Reply via email to