copying ofiwg and the verbs maintainer.
> My name is Stefan Oesterreich and I am the Systems Administrator of
> the UNH-IOL OFA cluster. The OFIWG would like to include running
> fabtest as part of our OFED and vendor device/firmware validation
> testing. I have very limited knowledge of fabtest, so I am looking
> for some guidance on a comprehensive test command. We test
> Infiniband, iWARP, and RoCE, and we are looking to test the verbs
> provider. The command I have thus far is as follows:
>
> runfabtests.sh -t all -g $server_transport_ip_addr -s
> $server_transport_hostname -c $client_transport_hostname verbs
> $server_mgmt_hostname $client_mgmt_hostname
>
>
> Here is a filled in example:
> runfabtests.sh -t all -g 10.1.0.3 -s titan-ib.ofa -c phoebe-ib.ofa
> verbs titan.ofa phoebe.ofa
>
>
> When I run the above command on one of my Infiniband nodes I get the
> following output:
>
> # Test Result
> # --------------------------------------------------------------
> fi_getinfo_test -p "verbs": Pass
> fi_av_test -g 10.1.0.3 -n 1 -s titan-ib.ofa -p "verbs": Pass
> fi_dom_test -n 2 -p "verbs": Pass
> fi_eq_test -p "verbs": Pass
> fi_cq_test -p "verbs": Pass
> fi_mr_test -p "verbs": Pass
> fi_cntr_test -p "verbs": Pass
> fi_dgram g00n13s -p "verbs": Pass
> fi_rdm g00n13s -p "verbs": Pass
> fi_msg g00n13s -p "verbs": Pass
> fi_cm_data -p "verbs": Pass
> fi_cq_data -p "verbs": Fail
> fi_dgram -p "verbs": Notrun
> fi_dgram_waitset -p "verbs": Notrun
> fi_msg -p "verbs": Pass
> fi_msg_epoll -p "verbs": Pass
> fi_msg_sockets -p "verbs": Pass
> fi_poll -t queue -p "verbs": Notrun
> fi_poll -t counter -p "verbs": Notrun
> fi_rdm -p "verbs": Pass
> fi_rdm_rma_simple -p "verbs": Notrun
> fi_rdm_rma_trigger -p "verbs": Notrun
> fi_shared_ctx -p "verbs": Notrun
> fi_shared_ctx --no-tx-shared-ctx -p "verbs": Notrun
> fi_shared_ctx --no-rx-shared-ctx -p "verbs": Notrun
> fi_shared_ctx -e msg -p "verbs": Notrun
> fi_shared_ctx -e msg --no-tx-shared-ctx -p "verbs": Pass
> fi_shared_ctx -e msg --no-rx-shared-ctx -p "verbs": Notrun
> fi_shared_ctx -e dgram -p "verbs": Notrun
> fi_shared_ctx -e dgram --no-tx-shared-ctx -p "verbs": Notrun
> fi_shared_ctx -e dgram --no-rx-shared-ctx -p "verbs": Notrun
> fi_rdm_tagged_peek -p "verbs": Pass
> fi_scalable_ep -p "verbs": Notrun
> fi_cmatose -p "verbs": Pass
> fi_rdm_shared_av -p "verbs": Notrun
> fi_multi_mr -e msg -V -p "verbs": Notrun
> fi_multi_mr -e rdm -V -p "verbs": Notrun
> fi_recv_cancel -e rdm -V -p "verbs": Notrun
> fi_unexpected_msg -e msg -i 10 -p "verbs": Notrun
> fi_unexpected_msg -e rdm -i 10 -p "verbs": Notrun
> fi_unexpected_msg -e dgram -i 10 -p "verbs": Notrun
> fi_unexpected_msg -e msg -S -i 10 -p "verbs": Notrun
> fi_unexpected_msg -e rdm -S -i 10 -p "verbs": Notrun
> fi_unexpected_msg -e dgram -S -i 10 -p "verbs": Notrun
> fi_msg_pingpong -p "verbs": Pass
> fi_msg_pingpong -v -p "verbs": Pass
> fi_msg_pingpong -k -p "verbs": Notrun
> fi_msg_pingpong -k -v -p "verbs": Notrun
> fi_msg_bw -p "verbs": Pass
> fi_msg_bw -v -p "verbs": Pass
> fi_rma_bw -e msg -o write -p "verbs": Pass
> fi_rma_bw -e msg -o read -p "verbs": Pass
> fi_rma_bw -e msg -o writedata -p "verbs": Pass
> fi_rma_bw -e rdm -o write -p "verbs": Pass
> fi_rma_bw -e rdm -o read -p "verbs": Pass
> fi_rma_bw -e rdm -o writedata -p "verbs": Fail
> fi_msg_rma -o write -p "verbs": Pass
> fi_msg_rma -o read -p "verbs": Pass
> fi_msg_rma -o writedata -p "verbs": Pass
> fi_msg_stream -p "verbs": Pass
> fi_rdm_atomic -o all -I 1000 -p "verbs": Notrun
> fi_rdm_cntr_pingpong -p "verbs": Notrun
> fi_rdm_multi_recv -p "verbs": Fail
> fi_rdm_pingpong -p "verbs": Pass
> fi_rdm_pingpong -v -p "verbs": Pass
> fi_rdm_pingpong -k -p "verbs": Notrun
> fi_rdm_pingpong -k -v -p "verbs": Notrun
> fi_rdm_rma -o write -p "verbs": Fail
> fi_rdm_rma -o read -p "verbs": Fail
> fi_rdm_rma -o writedata -p "verbs": Fail
> fi_rdm_tagged_pingpong -p "verbs": Pass
> fi_rdm_tagged_pingpong -v -p "verbs": Pass
> fi_rdm_tagged_bw -p "verbs": Pass
> fi_rdm_tagged_bw -v -p "verbs": Pass
> fi_dgram_pingpong -p "verbs": Notrun
> fi_dgram_pingpong -k -p "verbs": Notrun
> fi_rc_pingpong -p "verbs": Pass
> fi_ubertest: Server returns
> 124, client returns 124
> fi_ubertest: Fail [/]
> # --------------------------------------------------------------
> # Total Pass 38
> # Total Notrun 33
> # Total Fail 7
> # Percentage of Pass 84
> # --------------------------------------------------------------
>
>
>
> My questions are:
>
>
> * Is the above command comprehensive enough for all 3 transports
> (IB, IW, RoCE)?
All transports should be testable using the same configuration.
> * What test mode should I be using
> (all,quick,unit,simple,standard,short,complex)? This is the first
> time running through this testing, so I don't know if "all" is
> appropriate here. Time is also a consideration here, It seems to
> take about 13 minutes to complete one server-client pair, and we
> have 6 nodes, so there are quite a few permutations.
Using 'all' versus 'quick' adds in fi_ubertest. This test is fairly
comprehensive. It is capable of testing thousands of permuations and can take
a really long time to run. If time is a concern, I would use the quick option,
which is the default.
You can also speed up testing by providing an 'exclude' file. This will allow
skipping the Notrun tests, which add a couple of seconds per test. See e.g.
test_configs/verbs/verbs.exclude.
> * What makes a test result "Notrun" vs "Fail"? When I use -vv to
> see output, I am seeing a lot of "fi_getinfo(): common/shared.c:540,
> ret=-61 (No data available)" and "fi_poll_open(): simple/poll.c:55,
> ret=-38 (Function not implemented)", is this normal?
Notrun indicates that the selected provider does not support the options
required of the test. For example, the verbs provider does not implement
counters, so any test that use counters will not work. There are specific
failure codes that the script checks for in these cases and reports 'notrun'
when it detects them.
Fail indicates that the test detected some other sort of error from the
provider that was not expected. From the output above:
fi_cq_data -p "verbs": Fail
This is a test I would expect to pass. With failures, it's usually best to
re-run the test that failed directly and see if we can get more information as
to why a failure occurred.
> * I am also seeing a lot of "Killed by signal 15", which I
> believe means that the timeout was hit and the run was killed.
> Should I be increasing my timeout? I would expect the default
> timeout to be good enough, but I am unsure.
The default timeout should be sufficient in most cases.
> * As you can see from the output above, there are a few fails.
> Does this indicate a bug in fabtests or OFED/vendors drivers or
> simply that I am not running the correct fabtest command?
Yes. :) Any failures need to be investigated. For verbs, we focus testing on
the 'msg' endpoints. I would not expect to see any failures there. The
'dgram' endpoint support is limited in its implementation. 'Rdm' endpoints are
being removed in favor of using the 'rxm' provider over verbs. Using the
exclude file may remove these from testing.
- Sean
_______________________________________________
ofiwg mailing list
[email protected]
http://lists.openfabrics.org/mailman/listinfo/ofiwg