On Thu, 2015-01-22 at 09:31 -0500, Doug Ledford wrote:
> My 8 patch set taken into 3.19 caused some regressions.  This patch
> set resolves those issues.
> 
> These patches are to resolve issues created by my previous patch set.
> While that set worked fine in my testing, there were problems with
> multicast joins after the initial set of joins had completed.  Since my
> testing relied upon the normal set of multicast joins that happen
> when the interface is first brought up, I missed those problems.
> 
> Symptoms vary from failure to send packets due to a failed join, to
> loss of connectivity after a subnet manager restart, to failure
> to properly release multicast groups on shutdown resulting in hangs
> when the mlx4 driver attempts to unload itself via its reboot
> notifier handler.
> 
> This set of patches has passed a number of tests above and beyond my
> original tests.  As suggested by Or Gerlitz I added IPv6 and IPv4
> multicast tests.  I also added both subnet manager restarts and
> manual shutdown/restart of individual ports at the switch in order to
> ensure that the ENETRESET path was properly tested.  I included
> testing, then a subnet manager restart, then a quiescent period for
> caches to expire, then restarting testing to make sure that arp and
> neighbor discovery work after the subnet manager restart.
> 
> All in all, I have not been able to trip the multicast joins up any
> longer.
> 
> Additionally, the original impetus for my first 8 patch set was that
> it was simply too easy to break the IPoIB subsystem with this simple
> loop:
> 
> while true; do
>     ifconfig ib0 up
>     ifconfig ib0 down
> done
> 
> Just to be safe, I made sure this problem did not resurface.
> 
> v5: fix an oversight in mcast_restart_task that leaked mcast joins
>     fix a failure to flush the ipoib_workqueue on deregister that
>     meant we could end up running our code after our device had been
>     removed, resulting in an oops
>     remove a debug message that could be trigger so fast that the
>     kernel printk mechanism would starve out the mcast join task thread
>     resulting in what looked like a mcast failure that was really just
>     delayed action
> 
> 
> Doug Ledford (10):
>   IB/ipoib: fix IPOIB_MCAST_RUN flag usage
>   IB/ipoib: Add a helper to restart the multicast task
>   IB/ipoib: make delayed tasks not hold up everything
>   IB/ipoib: Handle -ENETRESET properly in our callback
>   IB/ipoib: don't restart our thread on ENETRESET
>   IB/ipoib: remove unneeded locks
>   IB/ipoib: fix race between mcast_dev_flush and mcast_join
>   IB/ipoib: fix ipoib_mcast_restart_task
>   IB/ipoib: flush the ipoib_workqueue on unregister
>   IB/ipoib: cleanup a couple debug messages
> 
>  drivers/infiniband/ulp/ipoib/ipoib.h           |   1 +
>  drivers/infiniband/ulp/ipoib/ipoib_main.c      |   2 +
>  drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 234 
> ++++++++++++++-----------
>  3 files changed, 131 insertions(+), 106 deletions(-)
> 

FWIW, a couple different customers have tried a test kernel I built
internally with my patches and I've had multiple reports that all
previously observed issues have been resolved.

-- 
Doug Ledford <[email protected]>
              GPG KeyID: 0E572FDD


Attachment: signature.asc
Description: This is a digitally signed message part

Reply via email to