On Thu, 2015-01-22 at 09:31 -0500, Doug Ledford wrote: > My 8 patch set taken into 3.19 caused some regressions. This patch > set resolves those issues. > > These patches are to resolve issues created by my previous patch set. > While that set worked fine in my testing, there were problems with > multicast joins after the initial set of joins had completed. Since my > testing relied upon the normal set of multicast joins that happen > when the interface is first brought up, I missed those problems. > > Symptoms vary from failure to send packets due to a failed join, to > loss of connectivity after a subnet manager restart, to failure > to properly release multicast groups on shutdown resulting in hangs > when the mlx4 driver attempts to unload itself via its reboot > notifier handler. > > This set of patches has passed a number of tests above and beyond my > original tests. As suggested by Or Gerlitz I added IPv6 and IPv4 > multicast tests. I also added both subnet manager restarts and > manual shutdown/restart of individual ports at the switch in order to > ensure that the ENETRESET path was properly tested. I included > testing, then a subnet manager restart, then a quiescent period for > caches to expire, then restarting testing to make sure that arp and > neighbor discovery work after the subnet manager restart. > > All in all, I have not been able to trip the multicast joins up any > longer. > > Additionally, the original impetus for my first 8 patch set was that > it was simply too easy to break the IPoIB subsystem with this simple > loop: > > while true; do > ifconfig ib0 up > ifconfig ib0 down > done > > Just to be safe, I made sure this problem did not resurface. > > v5: fix an oversight in mcast_restart_task that leaked mcast joins > fix a failure to flush the ipoib_workqueue on deregister that > meant we could end up running our code after our device had been > removed, resulting in an oops > remove a debug message that could be trigger so fast that the > kernel printk mechanism would starve out the mcast join task thread > resulting in what looked like a mcast failure that was really just > delayed action > > > Doug Ledford (10): > IB/ipoib: fix IPOIB_MCAST_RUN flag usage > IB/ipoib: Add a helper to restart the multicast task > IB/ipoib: make delayed tasks not hold up everything > IB/ipoib: Handle -ENETRESET properly in our callback > IB/ipoib: don't restart our thread on ENETRESET > IB/ipoib: remove unneeded locks > IB/ipoib: fix race between mcast_dev_flush and mcast_join > IB/ipoib: fix ipoib_mcast_restart_task > IB/ipoib: flush the ipoib_workqueue on unregister > IB/ipoib: cleanup a couple debug messages > > drivers/infiniband/ulp/ipoib/ipoib.h | 1 + > drivers/infiniband/ulp/ipoib/ipoib_main.c | 2 + > drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 234 > ++++++++++++++----------- > 3 files changed, 131 insertions(+), 106 deletions(-) >
FWIW, a couple different customers have tried a test kernel I built internally with my patches and I've had multiple reports that all previously observed issues have been resolved. -- Doug Ledford <[email protected]> GPG KeyID: 0E572FDD
signature.asc
Description: This is a digitally signed message part
