Craig Prescott wrote:
Steve Wise wrote:
Craig Prescott wrote:
I patched cma_connect_iw() to create the SDP header as
cma_connect_ib() does. This gets us farther - examining the
SDP header on the listener side looks right now, and the
listener at least enters rdma_accept(), but iw_cm_accept()
fails due to cm_id->device->iwcm->accept(cm_id, iw_param)
returning -104.
104 == ECONNRESET, so the client side must have reset the connection.
Did this happen after 10 seconds? (there's a 10 second MPA negiation
timeout in the chelsio cm). Also, a wire trace might be useful. If
this reset happens immediately, then you should look on the client
side and see why it reset the connection.
The reset happens after 10 seconds.
Here is tcpdump output from the netperf client host (tebow1):
12:00:17.156120 arp who-has tebow2.hpc.ufl.edu tell tebow1.hpc.ufl.edu
12:00:17.156178 arp reply tebow2.hpc.ufl.edu is-at 00:07:43:05:11:8a
(oui Unknown)
12:00:27.180401 IP tebow1.hpc.ufl.edu.41353 > tebow2.hpc.ufl.edu.12865:
S 697245480:697245480(0) win 17920 <mss 8960,nop,wscale 9>
12:00:30.180571 IP tebow1.hpc.ufl.edu.41353 > tebow2.hpc.ufl.edu.12865:
S 697245480:697245480(0) win 17920 <mss 8960,nop,wscale 9>
12:00:30.180616 IP tebow2.hpc.ufl.edu.12865 > tebow1.hpc.ufl.edu.41353:
S 1878582380:1878582380(0) ack 697245481 win 65535 <mss 8960,nop,wscale 3>
12:00:30.180630 IP tebow1.hpc.ufl.edu.41353 > tebow2.hpc.ufl.edu.12865:
. ack 1 win 35
12:00:30.255717 IP tebow1.hpc.ufl.edu.41353 > tebow2.hpc.ufl.edu.12865:
P 1:257(256) ack 1 win 35
The above packet is the mpa-start with the SDP hello as private data, I
think.
12:00:30.255753 IP tebow2.hpc.ufl.edu.12865 > tebow1.hpc.ufl.edu.41353:
. ack 257 win 32736
12:00:30.255763 IP tebow2.hpc.ufl.edu.12865 > tebow1.hpc.ufl.edu.41353:
R 1:1(0) ack 257 win 0
And then nothing happens from the listening side, so the mpa-start reply
never comes out.
On the netserver host (tebow2), we see only the initial arp.
The above call also emits a couple of messages
into the listener's syslog now :
Jan 9 21:53:54 tebow2 kernel: iwch_ev_dispatch - CQE Err qpid 0x20
opcode 14 status 0x6 type 1 wrid.hi 0x0 wrid.lo 0x80000000
Jan 9 21:53:54 tebow2 kernel: post_qp_event - AE qpid 0x20 opcode 14
status 0x6 type 1 wrid.hi 0x0 wrid.lo 0x80000000
This is an async event generated due to a failure processing a SQ WR,
I think. opcodes and status codes for iw_cxgb3 are in cxio_wr.h.
type 1 means it was an egress (SQ) failure
status 0x6 is a base/bounds violation,
but 14 seems incorrect. That's not a valid T3 opcode. ????
Ok, thanks! I guess I'm not sure what to make of that yet, though.
See where in iwch_accept_cr() the failure is happening. It doesn't look
like send_mpa_reply() is being called.
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general