On Sat, Jun 13, 2015 at 3:50 PM, Doug Ledford <[email protected]> wrote: > On 06/13/2015 03:18 AM, Or Gerlitz wrote: >> On Sat, Jun 13, 2015 at 8:35 AM, Doug Ledford <[email protected]> wrote: >>> I ran across a problem today when I went to do some run tests of my >>> for-4.2 tree. For a second there, I was about to seriously have a >>> conniption fit. But, after about 6 hours of work bisecting and >>> debugging, I've come to find that I wasn't so crazy after all. >>> >>> When I went to install my for-4.2 tree, IPoIB was totally busted, as in >>> DOA. I knew the 4.1 code I submitted to Linus I had checked, but I >>> wanted to have a good starting point for a bisection so I compiled a >>> kernel from my for-4.1-rc branch. And it was DOA too. That seriously >>> unnerved me because I knew I tested that code. I did a number of manual >>> checkouts at possible suspicious code points, and none of them showed >>> that the problem was resolved. Then I started doing some debugging on >>> both the afflicted machine and on the opensm server. I finally saw that >>> the afflicted machine was claiming that it was attempting to join the >>> multicast group, but was reporting error 110 (ETIMEDOUT). The opensm >>> server was not seeing the requests at all. >>> >>> Long story short, I did my testing in the 4.1 merge window and rc phase >>> on machines without SRIOV enabled, but when you enable SRIOV in the mlx4 >>> driver, the current driver seems to have broken QP0/QP1 multiplexing >>> support because the host becomes unable to join the IPoIB multicast >>> groups. In addition, with SRIOV enabled, mlx4_en throws corruption >>> errors on reboot and requires that the machine be power cycled as >>> opposed to rebooting cleanly. From what I can tell, the 4.0 release >> >> Doug, >> >> So now my weekend mood is busted too, probably was a mistake to pick >> into my gmail account (but I can't promise to never do it again) -- NM >> I'll manage. > > Trust me, I understand :-( > >> AFAIK, our regression systems for upstream do run SRIOV tests, and as >> I know very well, I've been working on upstream code with mlx4 SRIOV >> over the last couple of weeks on daily manner (I've been using net and >> net-next but it should be good coverage)... I wonder what fell between >> the cracks here... let's see -- but **please** be more precise when >> you report on breakage (here and elsewhere): > > Sorry, it was 1:30am my time when I wrote that. I wanted to alert you > to the issue, but I was also tired and needed sleep. > >> 1. set ipoib debug flags (both of them) and do ifdown/ifup to the NIC >> instance that fails joining groups (which? you didn't say if this is >> total failure e.g of the IPv4 multicast group or of some other >> group/s) and send the resulted log > > Total failure of multicast group. The logs are unhelpful. They are > basically a bunch of "joining mgid blah" "failed to join mgid blah, > error -110". And it's all instances that do this. No joins complete.
You mentioned mlx4_core options of "probe_vf=0 num_vfs=7 port_type_array=1,2" so you're in SRIOV mode but the join failures happen to the PF IPoIB instance, or you are probing VF/s in a VM and see the failure there? >> 2. re "broken QP0/QP1 multiplexing support" that this means you see >> address resolution failure with rdma-cm apps? indeed, no ping, no >> rping... but this doesn't mean QP0/QP1 multiplexing broken. If you >> have ping (ICMP) working, try >> >> $ rping -dvs >> $ rping -dvca SERVER >> >> and send us the output > > This is unhelpful too. For any of the afflicted interfaces (all IPoIB > interfaces), rping fails to resolve the address to the server. Other > interfaces (RoCE) work fine. This makes sense, no functioning IPoIB --> no address resolution. >> 3. Re "with SRIOV enabled, mlx4_en throws corruption errors on reboot" >> - show them to us >> >> To sum up: Send us some boring dmesg outputs from your problematic >> setup that go beyond your analysis, as well as your "lspci | grep nox" >> output + the dmesg of mlx4_core when loaded with debug_level=1 (I'd >> like to see the FW version and various other outputs). > > It will take me a while to capture all you requested. When it goes > down, it's 10 minutes per reboot cycle to collect more information. > However, as I just learned, the mlx4_en problem happens with or without > SRIOV. So, it's probably related to my vlan setup. I managed to catch > some dmesg output. Here it is: > > [24891.205735] ------------[ cut here ]------------ > [24891.227362] WARNING: CPU: 1 PID: 36115 at lib/list_debug.c:59 > __list_del_entry+0xa1/0xd0() > [24891.265535] list_del corruption. prev->next should be > ffff8800bd845028, but was (null) > [24891.308640] Modules linked in: sch_mqprio 8021q garp mrp bridge stp > llc xprtrdma sunrpc ib_isert iscsi_target_mod ib_iser libiscsi > scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp > rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm > x86_pkg_temp_thermal coretemp kvm_intel kvm crct10dif_pclmul > crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel ipmi_ssif lrw > gf128mul glue_helper ablk_helper cryptd iTCO_wdt hpilo > iTCO_vendor_support ipmi_si 8250_fintek lpc_ich microcode wmi > ipmi_msghandler hpwdt pcspkr serio_raw ioatdma acpi_power_meter mfd_core > sb_edac dca pcc_cpufreq edac_core shpchp acpi_cpufreq sch_fq_codel xfs > libcrc32c mlx4_en(-) vxlan ip6_udp_tunnel udp_tunnel ib_sa ib_mad > ib_core ib_addr ata_generic pata_acpi mgag200 syscopyarea sysfillrect > sysimgblt i2c_algo_bit drm_kms_helper ttm sd_mod drm tg3 ata_piix ptp > libata mlx4_core hpsa i2c_core pps_core dm_mirror dm_region_hash dm_log > dm_mod [last unloaded: mlx4_ib] > ** 3035 printk messages dropped ** [24891.640030] drm_kms_helper ttm > sd_mod drm tg3 ata_piix ptp libata mlx4_core hpsa i2c_core pps_core > dm_mirror dm_region_hash dm_log dm_mod [last unloaded: mlx4_ib] > [24891.640031] CPU: 1 PID: 36115 Comm: rmmod Tainted: G W > 4.0.0+ #53 > [24891.640032] Hardware name: HP ProLiant DL360p Gen8, BIOS P71 12/20/2013 > [24891.640033] 0000000000000000 0000000062127936 ffff880809743c28 > ffffffff816a3be5 > [24891.640035] 0000000000000000 ffff880809743c80 ffff880809743c68 > ffffffff8107bdaa > [24891.640036] 0000000000000004 ffff8808041eee28 ffffffffc0138000 > ffff880809743dc0 > [24891.640036] Call Trace: > [24891.640038] [<ffffffff816a3be5>] dump_stack+0x45/0x57 > [24891.640039] [<ffffffff8107bdaa>] warn_slowpath_common+0x8a/0xc0 > [24891.640041] [<ffffffff8107be35>] warn_slowpath_fmt+0x55/0x70 > [24891.640043] [<ffffffff8119faca>] ? kvfree+0x2a/0x40 > [24891.640044] [<ffffffff813353f1>] __list_del_entry+0xa1/0xd0 > [24891.640045] [<ffffffff81335431>] list_del+0x11/0x40 > [24891.640048] [<ffffffff815b4cf5>] qdisc_list_del+0x25/0x30 > [24891.640049] [<ffffffff815b2fa5>] qdisc_destroy+0x35/0xb0 > [24891.640050] [<ffffffff815b4150>] dev_shutdown+0x50/0xd0 > [24891.640052] [<ffffffff815894f0>] rollback_registered_many+0x160/0x300 > [24891.640053] [<ffffffff815896d0>] rollback_registered+0x40/0x70 > [24891.640055] [<ffffffff8158ac48>] unregister_netdevice_queue+0x48/0x80 > [24891.640056] [<ffffffff8158aca0>] unregister_netdev+0x20/0x30 > [24891.640059] [<ffffffffc01a21c8>] mlx4_en_destroy_netdev+0xf8/0x130 > [mlx4_en] > [24891.640062] [<ffffffffc019311b>] mlx4_en_remove+0xfb/0x110 [mlx4_en] > [24891.640067] [<ffffffffc00de638>] mlx4_remove_device+0x88/0xd0 > [mlx4_core] > [24891.640073] [<ffffffffc00de6c3>] mlx4_unregister_interface+0x43/0x80 > [mlx4_core] > [24891.640077] [<ffffffffc01a4b28>] mlx4_en_cleanup+0x10/0x12 [mlx4_en] > [24891.640078] [<ffffffff811021ac>] SyS_delete_module+0x1ac/0x230 > [24891.640079] [<ffffffff81099c4c>] ? task_work_run+0xbc/0xf0 > [24891.640081] [<ffffffff816aaeee>] system_call_fastpath+0x12/0x71 > [24891.640081] ---[ end trace 66118cded2cc8f95 ]--- > [24891.640083] ------------[ cut here ]------------ > [24891.640084] WARNING: CPU: 1 PID: 36115 at lib/list_debug.c:59 > __list_del_entry+0xa1/0xd0() > [24891.640084] list_del corruption. prev->next should be > ffff8808041ef028, but was dead000000100100 > [24891.640103] Modules linked in: sch_mqprio 8021q garp mrp bridge stp > llc xprtrdma sunrpc ib_isert iscsi_target_mod ib_iser libiscsi > scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp > rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm > x86_pkg_temp_thermal coretemp kvm_intel kvm crct10dif_pclmul > crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel ipmi_ssif lrw > gf128mul glue_helper ablk_helper cryptd iTCO_wdt hpilo > iTCO_vendor_support ipmi_si 8250_fintek lpc_ich microcode wmi > ipmi_msghandler hpwdt pcspkr serio_raw ioatdma acpi_power_meter mfd_core > sb_edac dca pcc_cpufreq edac_core shpchp acpi_cpufreq sch_fq_codel xfs > libcrc32c mlx4_en(-) vxlan ip6_udp_tunnel udp_tunnel ib_sa ib_mad > ib_core ib_addr ata_generic pata_acpi mgag200 syscopyarea sysfillrect > sysimgblt i2c_algo_bit drm_kms_helper ttm sd_mod drm tg3 ata_piix ptp > libata mlx4_core hpsa i2c_core pps_core dm_mirror dm_region_hash dm_log > dm_mod [last unloaded: mlx4_ib] > [24891.640108] CPU: 1 PID: 36115 Comm: rmmod Tainted: G W > 4.0.0+ #53 > [24891.640108] Hardware name: HP ProLiant DL360p Gen8, BIOS P71 12/20/2013 > [24891.640109] 0000000000000000 0000000062127936 ffff880809743c28 > ffffffff816a3be5 > [24891.640110] 0000000000000000 ffff880809743c80 ffff880809743c68 > ffffffff8107bdaa > [24891.640111] 0000000000000004 ffff8808041ef028 ffffffffc0138000 > ffff880809743dc0 > [24891.640112] Call Trace: > [24891.640114] [<ffffffff816a3be5>] dump_stack+0x45/0x57 > [24891.640116] [<ffffffff8107bdaa>] warn_slowpath_common+0x8a/0xc0 > [24891.640118] [<ffffffff8107be35>] warn_slowpath_fmt+0x55/0x70 > [24891.640119] [<ffffffff8119faca>] ? kvfree+0x2a/0x40 > [24891.640121] [<ffffffff813353f1>] __list_del_entry+0xa1/0xd0 > [24891.640121] [<ffffffff81335431>] list_del+0x11/0x40 > [24891.640123] [<ffffffff815b4cf5>] qdisc_list_del+0x25/0x30 > [24891.640124] [<ffffffff815b2fa5>] qdisc_destroy+0x35/0xb0 > [24891.640126] [<ffffffff815b4150>] dev_shutdown+0x50/0xd0 > [24891.640127] [<ffffffff815894f0>] rollback_registered_many+0x160/0x300 > [24891.640129] [<ffffffff815896d0>] rollback_registered+0x40/0x70 > [24891.640130] [<ffffffff8158ac48>] unregister_netdevice_queue+0x48/0x80 > [24891.640132] [<ffffffff8158aca0>] unregister_netdev+0x20/0x30 > [24891.640135] [<ffffffffc01a21c8>] mlx4_en_destroy_netdev+0xf8/0x130 > [mlx4_en] > [24891.640138] [<ffffffffc019311b>] mlx4_en_remove+0xfb/0x110 [mlx4_en] > [24891.640143] [<ffffffffc00de638>] mlx4_remove_device+0x88/0xd0 > [mlx4_core] > [24891.640149] [<ffffffffc00de6c3>] mlx4_unregister_interface+0x43/0x80 > [mlx4_core] > [24891.640153] [<ffffffffc01a4b28>] mlx4_en_cleanup+0x10/0x12 [mlx4_en] > [24891.640154] [<ffffffff811021ac>] SyS_delete_module+0x1ac/0x230 > [24891.640155] [<ffffffff81099c4c>] ? task_work_run+0xbc/0xf0 > [24891.640156] [<ffffffff816aaeee>] system_call_fastpath+0x12/0x71 > [24891.640157] ---[ end trace 66118cded2cc8f96 ]--- > [24891.640159] ------------[ cut here ]------------ > > You'll note in the very beginning of that is a comment from the kernel > that some printk messages were lost. So, I can't guarantee how accurate > the trace is. > > The setup I have that seems to reliably reproduce this is base eth > device using dhcp/mtu9000, vlan1 w/dhcp, vlan2 w/dhcp, all RoCE capable > interface. However, if I manually ifdown all of the interfaces before > trying to remove the mlx4_en module, the call trace doesn't happen, so > it's definitely related to the removal of multiple devices in the reboot > handler (or on the command line if you rmmod mlx4_en when multiple > interfaces are live on a single port). OK, removal of multiple devices in the reboot sequence is concrete scheme which we will try out and see what's broken. Or. > After rebooting, I got a dmesg output (attached). The process to get > this output was as follows: > > Boot with SRIOV disabled (everything works fine): > > Port 1 - IPoIB w/dhcp, connected mode, mtu 65520 > Port 1 - IPoIB pkey 0002 w/dhcp, connected mode, mtu 65520 > Port 1 - IPoIB pkey 0004 w/dhcp - mgid 0016, connected mode, mtu 65520 > Port 1 - IPoIB pkey 0006 w/dhcp, connected mode, mtu 65520 > Port 2 - RoCE w/dhcp, mtu 9000 > Port 2 - RoCE vlan 43 w/dhcp, mtu 9000 > Port 2 - RoCE vlan 45 w/dhcp, mtu 9000 > > Manually down all IPoIB and RoCE interfaces using ifdown > rmmod mlx4_* ib_ipoib > enable SRIOV and debug messages > modprobe mlx4_core > modprobe ib_ipoib > dmesg > dmesg.out > >>> kernel has this problem too, and it still exists at least as far as >>> 4.1-rc7 + all of my queued up -next patches. >>> >>> From my /etc/modprobe.d/mlx4.conf file if you want to try and duplicate: >>> >>> options mlx4_core probe_vf=0 num_vfs=7 port_type_array=1,2 >>> options mlx4_en pfctx=0x28 pfcrx=0x28 >> >> AFAIK pfctx/pfcrx are eight bits, not sure what's the 0x20 is for > > They are 8 bits. I have both priority 3 and priority 5 as no drop. > Hence 0x28. sorry, my stupid mistake. >>> And I'm guessing that your internal regression tests must not have a >>> machine in IB/Eth SRIOV mode as a standard config. I would consider >>> adding it to the mix. I have it myself, but only on a few machines and >>> I don't always use them for initial testing. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
