On Mon, 2015-01-26 at 15:24 +0200, Erez Shitrit wrote: > On 1/26/2015 2:51 PM, Doug Ledford wrote: > > On Mon, 2015-01-26 at 12:27 +0200, Erez Shitrit wrote: > > > >> New (and full) dmesg attached, (after modprobe ib_ipoib, with all debug > >> flags set) it is all there. > > Thank you, I know what's going on here now. Will correct shortly. > > welcome -:)
I munged my opensm configuration so that I could forcibly replicate the situation here (I intentionally took several well known multicast groups and forbid their creation). I was able to first replicate Eriz's problem. Then I installed a new ib_ipoib module with my proposed fix for Erez's problem and it worked exactly as expected. It was a mistake in one of my earlier patches (the third in the series). When I added a delayed queue of the task thread, I didn't have a separate work struct and instead tried to queue the same work struct twice. I reworked it so that the work struct is only ever queued once and if the multicast task gets to the end of its run and there are delayed entries waiting still, it will queue itself to run again when the shortest delay has expired. I'll send that through. Here's the log of the attempt: [root@rdma-master linus (firewall/for-rc)]$ dmesg | tail -10 [337072.429488] mlx4_ib0: successfully joined all multicast groups [337073.856932] mlx4_ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0002, starting sendonly join [337073.869686] mlx4_ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0002, status -22 [337073.882754] mlx4_ib0: successfully joined all multicast groups [337088.480082] mlx4_ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join [337088.492789] mlx4_ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0016, status -22 [337088.505819] mlx4_ib0: successfully joined all multicast groups [337089.897041] mlx4_ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0002, starting sendonly join [337089.909870] mlx4_ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0002, status -22 [337089.922893] mlx4_ib0: successfully joined all multicast groups [root@rdma-master linus (firewall/for-rc)]$ ping6 -I mlx4_ib0 fe80::211:7500:77:d3cc PING fe80::211:7500:77:d3cc(fe80::211:7500:77:d3cc) from fe80::f652:1403:7b:cba1 mlx4_ib0: 56 data bytes 64 bytes from fe80::211:7500:77:d3cc: icmp_seq=1 ttl=64 time=77.6 ms 64 bytes from fe80::211:7500:77:d3cc: icmp_seq=2 ttl=64 time=0.159 ms 64 bytes from fe80::211:7500:77:d3cc: icmp_seq=3 ttl=64 time=0.125 ms 64 bytes from fe80::211:7500:77:d3cc: icmp_seq=4 ttl=64 time=0.128 ms ^C --- fe80::211:7500:77:d3cc ping statistics --- 4 packets transmitted, 4 received, 0% packet loss, time 3001ms rtt min/avg/max/mdev = 0.125/19.503/77.600/33.542 ms [root@rdma-master linus (firewall/for-rc)]$ dmesg | tail -10[337120.632427] mlx4_ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join [337120.645166] mlx4_ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0016, status -22 [337120.658292] mlx4_ib0: successfully joined all multicast groups [337121.977733] mlx4_ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0002, starting sendonly join [337121.990478] mlx4_ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0002, status -22 [337122.003589] mlx4_ib0: successfully joined all multicast groups [337130.410559] mlx4_ib0: setting up send only multicast group for ff12:601b:ffff:0000:0000:0001:ff77:d3cc [337130.423203] mlx4_ib0: no multicast record for ff12:601b:ffff:0000:0000:0001:ff77:d3cc, starting sendonly join [337130.436327] mlx4_ib0: MGID ff12:601b:ffff:0000:0000:0001:ff77:d3cc AV ffff882027235f00, LID 0xc01e, SL 0 [337130.448970] mlx4_ib0: successfully joined all multicast groups [root@rdma-master linus (firewall/for-rc)]$ -- Doug Ledford <[email protected]> GPG KeyID: 0E572FDD
signature.asc
Description: This is a digitally signed message part
