> On Jun 14, 2015, at 10:31 AM, Or Gerlitz <[email protected]> wrote:
> 
> On Sat, Jun 13, 2015 at 8:35 AM, Doug Ledford <[email protected]> wrote:
>> I ran across a problem today when I went to do some run tests of my
>> for-4.2 tree.  For a second there, I was about to seriously have a
>> conniption fit.  But, after about 6 hours of work bisecting and
>> debugging, I've come to find that I wasn't so crazy after all.
>> 
>> When I went to install my for-4.2 tree, IPoIB was totally busted, as in
>> DOA.  I knew the 4.1 code I submitted to Linus I had checked, but I
>> wanted to have a good starting point for a bisection so I compiled a
>> kernel from my for-4.1-rc branch.  And it was DOA too.  That seriously
>> unnerved me because I knew I tested that code.  I did a number of manual
>> checkouts at possible suspicious code points, and none of them showed
>> that the problem was resolved.  Then I started doing some debugging on
>> both the afflicted machine and on the opensm server.  I finally saw that
>> the afflicted machine was claiming that it was attempting to join the
>> multicast group, but was reporting error 110 (ETIMEDOUT).  The opensm
>> server was not seeing the requests at all.
>> 
>> Long story short, I did my testing in the 4.1 merge window and rc phase
>> on machines without SRIOV enabled, but when you enable SRIOV in the mlx4
>> driver, the current driver seems to have broken QP0/QP1 multiplexing
>> support because the host becomes unable to join the IPoIB multicast
>> groups.  In addition, with SRIOV enabled, mlx4_en throws corruption
>> errors on reboot and requires that the machine be power cycled as
>> opposed to rebooting cleanly.  From what I can tell, the 4.0 release
>> kernel has this problem too, and it still exists at least as far as
>> 4.1-rc7 + all of my queued up -next patches.
>> 
>> From my /etc/modprobe.d/mlx4.conf file if you want to try and duplicate:
>> 
>> options mlx4_core probe_vf=0 num_vfs=7 port_type_array=1,2
> 
> Doug,
> 
> You were 100% right, due to recent FW bug SRIOV QP0/QP1 PV is broken
> with VPI config of IB/Eth (port_type_array=1,2), personally, I didn't
> step on it, since I moved my working environment to Eth/IB (2,1)
> couple of weeks ago, Oh well.
> 
> The fix is easy, disable Granular VF QoS in that VPI config, I tested
> it and sent that now to net [1]
> 
> We should check how come the upstream regression environment didn't
> catch that up.

Did this fix the mlx4_en shutdown issue too, or is there another patch needed 
for that?

> Or.
> 
> [1] http://patchwork.ozlabs.org/patch/483991/
> 
>> options mlx4_en pfctx=0x28 pfcrx=0x28
>> 
>> And I'm guessing that your internal regression tests must not have a
>> machine in IB/Eth SRIOV mode as a standard config.  I would consider
>> adding it to the mix.  I have it myself, but only on a few machines and
>> I don't always use them for initial testing.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

—
Doug Ledford <[email protected]>
        GPG Key ID: 0E572FDD





Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

Reply via email to