On Fri, Nov 03, 2017 at 03:23:14PM +0000, Bart Van Assche wrote:
> On Fri, 2017-11-03 at 11:50 +0800, Ming Lei wrote:
> > On Fri, Nov 03, 2017 at 02:42:50AM +0000, Bart Van Assche wrote:
> > > On Fri, 2017-11-03 at 10:12 +0800, Ming Lei wrote:
> > > > [root@ibclient srp-test]# ./run_tests
> > > > modprobe: FATAL: Module target_core_mod is in use.
> > >
> > > LIO must be unloaded before srp-test software is started.
> >
> > Yeah, I can make this test kick off after running
> > 'modprobe -fr ib_isert' first, but looks all tests failed without
> > any hang, could you check if that is the expected result?
> >
> > https://pastebin.com/VXe66Jpg
>
> I see plenty of "... >/sys/.../add_target failed: Connection timed out"
> messages. The srp-test scripts use the loopback functionality of IB ports
> so one or more IB ports must be active. Have you already checked the IB
> ports state, e.g. by running ibv_devinfo?
[root@ibclient ~]# ibv_devinfo
hca_id: mlx5_1
transport: InfiniBand (0)
fw_ver: 12.20.1010
node_guid: 7cfe:9003:0072:6ed3
sys_image_guid: 7cfe:9003:0072:6ed2
vendor_id: 0x02c9
vendor_part_id: 4115
hw_ver: 0x0
board_id: MT_2190110032
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 1
port_lmc: 0x00
link_layer: InfiniBand
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 12.20.1010
node_guid: 7cfe:9003:0072:6ed2
sys_image_guid: 7cfe:9003:0072:6ed2
vendor_id: 0x02c9
vendor_part_id: 4115
hw_ver: 0x0
board_id: MT_2190110032
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 4
port_lid: 4
port_lmc: 0x00
link_layer: InfiniBand
Forget to mention, there is failure when running 'make' under srp-test
because shellcheck package is missed in RHEL7. Can that be the issue
of test failure? If yes, could you provide a special version of srp-test
which doesn't depend on shellcheck?
>
> > Also could you provide some output from debugfs when this hang happens?
> > such as:
> >
> > dispatch/busy/.. under hctxN
>
> This is the output that appears on my test setup if the two patches from
> the patch series "block: remove unnecessary RESTART" are applied:
>
> # ~bart/software/infiniband/srp-test/run_tests -f xfs -d -e deadline -r 60
>
> Unloaded the ib_srpt kernel module
> Zero-initializing /dev/ram0 ... done
> Zero-initializing /dev/ram1 ... done
> Zero-initializing /dev/disk/by-id/scsi-33333333000004650 ... done
> brw-rw---- 1 root disk 1, 0 Nov 3 08:10 /dev/ram0
> /dev/ram0
> brw-rw---- 1 root disk 1, 1 Nov 3 08:10 /dev/ram1
> /dev/ram1
> lrwxrwxrwx 1 root root 9 Nov 3 08:11 /dev/disk/by-id/scsi-33333333000004650
> -> ../../sdi
> /dev/sdi
> Configured SRP target driver
> Running test /home/bart/software/infiniband/srp-test/tests/01 ...
> id_ext=0x0002c90300a02d50,ioc_guid=0x0002c90300a02d50,dgid=fe800000000000000002c90300a02d51,pkey=7fff,service_id=0x0002c90300a02d50,
> >/sys/class/infiniband_srp/
> srp-mlx4_0-2/add_target failed: Invalid argument
> id_ext=0x0002c90300a02d50,ioc_guid=0x0002c90300a02d50,dgid=fe800000000000000002c90300a02d52,pkey=7fff,service_id=0x0002c90300a02d50,
> >/sys/class/infiniband_srp/
> srp-mlx4_0-2/add_target failed: Invalid argument
> count_luns(): 3 <> 3
> Error: unloading kernel module ib_srp failed
> Test /home/bart/software/infiniband/srp-test/tests/01 failed
> # cat /sys/module/scsi_mod/parameters/use_blk_mq
> Y
> # find /sys/kernel/debug/block -name busy | xargs cat
Can you check 'dispatch' file, hctx state, tags and sched_tags?
And see where is the IO not dispatched?
> # dmesg -c >/dev/null; echo w > /proc/sysrq-trigger; dmesg
> [ 709.321750] sysrq: SysRq : Show Blocked State
> [ 709.327906] task PC stack pid father
...
> [ 709.456602] kworker/u66:6 D 0 341 2 0x80000000
> [ 709.463371] Workqueue: events_unbound async_run_entry_fn
> [ 709.469919] Call Trace:
> [ 709.473325] __schedule+0x2fa/0xbb0
> [ 709.477850] ? trace_hardirqs_on_caller+0xfb/0x190
> [ 709.483847] schedule+0x36/0x90
> [ 709.487982] schedule_timeout+0x22c/0x570
> [ 709.493115] ? call_timer_fn+0x330/0x330
> [ 709.498133] ? wait_for_completion_io_timeout+0xf7/0x180
> [ 709.504732] io_schedule_timeout+0x1e/0x50
> [ 709.509938] ? io_schedule_timeout+0x1e/0x50
> [ 709.515413] wait_for_completion_io_timeout+0x11f/0x180
> [ 709.521920] ? wake_up_q+0x80/0x80
> [ 709.526376] blk_execute_rq+0x86/0xc0
> [ 709.531135] scsi_execute+0xdb/0x1f0
> [ 709.536336] sd_revalidate_disk+0xed/0x1c70 [sd_mod]
> [ 709.543281] sd_probe_async+0xc3/0x1d0 [sd_mod]
> [ 709.549524] async_run_entry_fn+0x38/0x160
> [ 709.555503] process_one_work+0x20a/0x660
> [ 709.561299] worker_thread+0x3d/0x3b0
> [ 709.566688] kthread+0x13a/0x150
> [ 709.571613] ? process_one_work+0x660/0x660
> [ 709.577561] ? kthread_create_on_node+0x40/0x40
> [ 709.583910] ret_from_fork+0x27/0x40
Looks hang in scsi_execute().
> [ 709.589174] kworker/5:2 D 0 386 2 0x80000000
> [ 709.596616] Workqueue: kaluad alua_rtpg_work [scsi_dh_alua]
> [ 709.604130] Call Trace:
> [ 709.608096] __schedule+0x2fa/0xbb0
> [ 709.613265] ? trace_hardirqs_on_caller+0xfb/0x190
> [ 709.619879] schedule+0x36/0x90
> [ 709.624684] schedule_timeout+0x22c/0x570
> [ 709.630437] ? call_timer_fn+0x330/0x330
> [ 709.636059] ? wait_for_completion_io_timeout+0xf7/0x180
> [ 709.643270] io_schedule_timeout+0x1e/0x50
> [ 709.649083] ? io_schedule_timeout+0x1e/0x50
> [ 709.655159] wait_for_completion_io_timeout+0x11f/0x180
> [ 709.662284] ? wake_up_q+0x80/0x80
> [ 709.667334] blk_execute_rq+0x86/0xc0
> [ 709.672705] scsi_execute+0xdb/0x1f0
> [ 709.677938] alua_rtpg_work+0x2b3/0xe8a [scsi_dh_alua]
> [ 709.684983] process_one_work+0x20a/0x660
> [ 709.690759] worker_thread+0x3d/0x3b0
> [ 709.696107] kthread+0x13a/0x150
> [ 709.700986] ? process_one_work+0x660/0x660
> [ 709.706920] ? kthread_create_on_node+0x40/0x40
> [ 709.713316] ret_from_fork+0x27/0x40
Same with above.
--
Ming