Craig Prescott wrote:
Steve Wise wrote:

First make sure the sdp kernel module uses the rdma cma. Then I'd add printk hooks in cma.c, addr.c, and iwcm.c to see what's going on and where things are failing. Also a wire trace is good if we're getting that far (like at least doing arp resolution).


Small update - a little progress.  printk's spinkled liberally and
ib_sdp debug options turned on.  The initial problem was on the
listener during an IW_CM_EVENT_CONNECT_REQUEST event; the SDP hello header was rejected in sdp_cma.c:sdp_connect_handler() because its
max_adverts field was zero, which is not permissible.  In fact, all
of the sdp_hh fields were zero.

Comparing with the RDMA_TRANSPORT_IB case, I saw that cma.c:cma_connect_ib() does some work to create the SDP header
via cma_format_hdr().  But cma_connect_iw() did not.

Why is this SDP protocol stuff done in the CMA?? That's seems like a layer violation...
I patched cma_connect_iw() to create the SDP header as
cma_connect_ib() does.  This gets us farther - examining the
SDP header on the listener side looks right now, and the
listener at least enters rdma_accept(), but iw_cm_accept()
fails due to cm_id->device->iwcm->accept(cm_id, iw_param)
returning -104.
104 == ECONNRESET, so the client side must have reset the connection. Did this happen after 10 seconds? (there's a 10 second MPA negiation timeout in the chelsio cm). Also, a wire trace might be useful. If this reset happens immediately, then you should look on the client side and see why it reset the connection.
The above call also emits a couple of messages
into the listener's syslog now :

Jan 9 21:53:54 tebow2 kernel: iwch_ev_dispatch - CQE Err qpid 0x20 opcode 14 status 0x6 type 1 wrid.hi 0x0 wrid.lo 0x80000000 Jan 9 21:53:54 tebow2 kernel: post_qp_event - AE qpid 0x20 opcode 14 status 0x6 type 1 wrid.hi 0x0 wrid.lo 0x80000000

This is an async event generated due to a failure processing a SQ WR, I think. opcodes and status codes for iw_cxgb3 are in cxio_wr.h.
type 1 means it was an egress (SQ) failure
status 0x6 is a base/bounds violation,
but 14 seems incorrect. That's not a valid T3 opcode. ????


In the end, we still end up in rdma_reject().  Will keep digging.

Cheers,
Craig
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to