I ran across a problem today when I went to do some run tests of my for-4.2 tree. For a second there, I was about to seriously have a conniption fit. But, after about 6 hours of work bisecting and debugging, I've come to find that I wasn't so crazy after all.
When I went to install my for-4.2 tree, IPoIB was totally busted, as in DOA. I knew the 4.1 code I submitted to Linus I had checked, but I wanted to have a good starting point for a bisection so I compiled a kernel from my for-4.1-rc branch. And it was DOA too. That seriously unnerved me because I knew I tested that code. I did a number of manual checkouts at possible suspicious code points, and none of them showed that the problem was resolved. Then I started doing some debugging on both the afflicted machine and on the opensm server. I finally saw that the afflicted machine was claiming that it was attempting to join the multicast group, but was reporting error 110 (ETIMEDOUT). The opensm server was not seeing the requests at all. Long story short, I did my testing in the 4.1 merge window and rc phase on machines without SRIOV enabled, but when you enable SRIOV in the mlx4 driver, the current driver seems to have broken QP0/QP1 multiplexing support because the host becomes unable to join the IPoIB multicast groups. In addition, with SRIOV enabled, mlx4_en throws corruption errors on reboot and requires that the machine be power cycled as opposed to rebooting cleanly. From what I can tell, the 4.0 release kernel has this problem too, and it still exists at least as far as 4.1-rc7 + all of my queued up -next patches. From my /etc/modprobe.d/mlx4.conf file if you want to try and duplicate: options mlx4_core probe_vf=0 num_vfs=7 port_type_array=1,2 options mlx4_en pfctx=0x28 pfcrx=0x28 And I'm guessing that your internal regression tests must not have a machine in IB/Eth SRIOV mode as a standard config. I would consider adding it to the mix. I have it myself, but only on a few machines and I don't always use them for initial testing. -- Doug Ledford <[email protected]> GPG KeyID: 0E572FDD
signature.asc
Description: OpenPGP digital signature
