Hi Steve;

The SDP socket gets an associated mr when sdp_init_qp() calls ib_get_dma_mr(). It looks to me like this drills down into
the provider layer, which will ultimately end up calling
build_phys_page_list() from iwch_register_phys_mem().

Unfortunately, when I try to look at the ib_mr_attrs via
ib_query_mr(), the call fails.

When sdp_post_recv() calls ib_post_recv(), it looks to me
like a DMA mapping has been set up between the SDP private
receive buffers and card.  The receive buffers are kmalloc'd
in sdp_init_qp().

I hope I have this right.  But it sounds like it is possible
I am hitting both issues you describe.

I guess one way to check is to drop my test nodes down to 4GB
or less, right?  They currently have 16GB.

Thanks again,
Craig

Steve Wise wrote:
Are these recv buffers user memory or kernel memory? I just submitted a fix for a bug in build_phys_page_list(). Perhaps you're hitting this? It would hit it if these are buffers allocated by the sdp kernel module and registered via ib_reg_phys_mr().

Alsoalso: If sdp is using ib_get_dma_mr() to access all of memory, then it won't work with the chelsio driver, which has a 4GB limit on MRs. So cxgb3 creates dma_mrs that map only address 0..4GB-1. This just doesn't work at all if there is an iommu mapping bus addresses above 4GB.

Steve.



Craig Prescott wrote:

Hi Felix;

Here are the last 4 WRs:

...
Entering iwch_post_receive
iwch_post_receive: Dumping built work request before ring_doorbell:
iwch_post_receive: WQE ffff810241d59e00: 17c001008000000d
iwch_post_receive: WQE ffff810241d59e08: 0000000000000000
iwch_post_receive: WQE ffff810241d59e10: 0000000000000001
iwch_post_receive: WQE ffff810241d59e18: 000002ff00000810
iwch_post_receive: WQE ffff810241d59e20: 000000044eac3000
iwch_post_receive: WQE ffff810241d59e28: 0000000000000000
iwch_post_receive: WQE ffff810241d59e30: 0000000000000000
iwch_post_receive: WQE ffff810241d59e38: 0000000000000000
iwch_post_receive: WQE ffff810241d59e40: 0000000000000000
iwch_post_receive: WQE ffff810241d59e48: 0000000000000000
iwch_post_receive: WQE ffff810241d59e50: 0000000000000000
iwch_post_receive: WQE ffff810241d59e58: 0000000000000000
iwch_post_receive: WQE ffff810241d59e60: 0000000000000000
iwch_post_receive: returning 0
Entering iwch_post_receive
iwch_post_receive: Dumping built work request before ring_doorbell:
iwch_post_receive: WQE ffff810241d59e80: 17c001008000000d
iwch_post_receive: WQE ffff810241d59e88: 0000000000000000
iwch_post_receive: WQE ffff810241d59e90: 0000000000000001
iwch_post_receive: WQE ffff810241d59e98: 000002ff00000810
iwch_post_receive: WQE ffff810241d59ea0: 000000044eac4000
iwch_post_receive: WQE ffff810241d59ea8: 0000000000000000
iwch_post_receive: WQE ffff810241d59eb0: 0000000000000000
iwch_post_receive: WQE ffff810241d59eb8: 0000000000000000
iwch_post_receive: WQE ffff810241d59ec0: 0000000000000000
iwch_post_receive: WQE ffff810241d59ec8: 0000000000000000
iwch_post_receive: WQE ffff810241d59ed0: 0000000000000000
iwch_post_receive: WQE ffff810241d59ed8: 0000000000000000
iwch_post_receive: WQE ffff810241d59ee0: 0000000000000000
iwch_post_receive: returning 0
Entering iwch_post_receive
iwch_post_receive: Dumping built work request before ring_doorbell:
iwch_post_receive: WQE ffff810241d59f00: 17c001008000000d
iwch_post_receive: WQE ffff810241d59f08: 0000000000000000
iwch_post_receive: WQE ffff810241d59f10: 0000000000000001
iwch_post_receive: WQE ffff810241d59f18: 000002ff00000810
iwch_post_receive: WQE ffff810241d59f20: 000000044eac5000
iwch_post_receive: WQE ffff810241d59f28: 0000000000000000
iwch_post_receive: WQE ffff810241d59f30: 0000000000000000
iwch_post_receive: WQE ffff810241d59f38: 0000000000000000
iwch_post_receive: WQE ffff810241d59f40: 0000000000000000
iwch_post_receive: WQE ffff810241d59f48: 0000000000000000
iwch_post_receive: WQE ffff810241d59f50: 0000000000000000
iwch_post_receive: WQE ffff810241d59f58: 0000000000000000
iwch_post_receive: WQE ffff810241d59f60: 0000000000000000
iwch_post_receive: returning 0
Entering iwch_post_receive
iwch_post_receive: Dumping built work request before ring_doorbell:
iwch_post_receive: WQE ffff810241d59f80: 17c001008000000d
iwch_post_receive: WQE ffff810241d59f88: 0000000000000000
iwch_post_receive: WQE ffff810241d59f90: 0000000000000001
iwch_post_receive: WQE ffff810241d59f98: 000002ff00000810
iwch_post_receive: WQE ffff810241d59fa0: 000000044eac6000
iwch_post_receive: WQE ffff810241d59fa8: 0000000000000000
iwch_post_receive: WQE ffff810241d59fb0: 0000000000000000
iwch_post_receive: WQE ffff810241d59fb8: 0000000000000000
iwch_post_receive: WQE ffff810241d59fc0: 0000000000000000
iwch_post_receive: WQE ffff810241d59fc8: 0000000000000000
iwch_post_receive: WQE ffff810241d59fd0: 0000000000000000
iwch_post_receive: WQE ffff810241d59fd8: 0000000000000000
iwch_post_receive: WQE ffff810241d59fe0: 0000000000000000
iwch_post_receive: returning 0

Thanks,
Craig


Felix Marti wrote:
Hi Craig,

Can you please dump not only the last, but the last 4 WRs?

Thanks,
felix

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:general-
[EMAIL PROTECTED] On Behalf Of Craig Prescott
Sent: Wednesday, January 23, 2008 8:05 AM
To: Steve Wise
Cc: [email protected]
Subject: Re: [ofa-general] SDP and iWARP

Steve Wise wrote:
Craig Prescott wrote:
Steve Wise wrote:
Craig Prescott wrote:
Steve Wise wrote:
Craig Prescott wrote:
The above call also emits a couple of messages
into the listener's syslog now :

Jan  9 21:53:54 tebow2 kernel: iwch_ev_dispatch - CQE Err qpid
0x20 opcode 14 status 0x6 type 1 wrid.hi 0x0 wrid.lo 0x80000000
Jan  9 21:53:54 tebow2 kernel: post_qp_event - AE qpid 0x20
opcode
14 status 0x6 type 1 wrid.hi 0x0 wrid.lo 0x80000000

This is an async event generated due to a failure processing a
SQ
WR, I think. opcodes and status codes for iw_cxgb3 are in
cxio_wr.h.
type 1 means it was an egress (SQ) failure
status 0x6 is a base/bounds violation,
but 14 seems incorrect.  That's not a valid T3 opcode. ????

Ok, thanks!  I guess I'm not sure what to make of that yet,
though.
See where in iwch_accept_cr() the failure is happening.  It
doesn't
look like send_mpa_reply() is being called.

The ECONNRESET is coming from here in iwch_accept_cr():

...
        /* wait for wr_ack */
        wait_event(ep->com.waitq, ep->com.rpl_done);
        err = ep->com.rpl_err;
...

Is that what you thought was happening?
I don't know exactly what is going on!  But the code above means
that
the firmware never successfully sent the last streaming message (the
mpa-start reply) and never transitioned the connection into rdma
mode.
And the async error might indicate that some WR was posted prior to
doing the rdma_accept() and that WR had problems.
Ok.  I'm sorry for such a slow response.

a few questions:

What firmware are you running?  ethtool -i will tell you.
[EMAIL PROTECTED] ~]# ethtool -i eth4
driver: cxgb3
version: 1.0-ko
firmware-version: T 5.0.0 TP 1.1.0
bus-info: 0000:86:00.0

What ofed version exactly?
OFED 1.3 daily from a few weeks back now: OFED-1.3-20080107-0942

Does sdp post a SQ or RQ WR prior to doing the rdma_accept()?  Can
you
dump that work request?  Maybe in iwch_post_send and iwch_post_recv,
dump the work request after it is built and before the code rings
the
doorbell.  You can dump it as 8B flits, and be sure an put the flits
in
host byte order.  See cxio_dump_wqe() in cxio_dbg.c...
The following is the last work request seen before rdma_accept():

iwch_post_receive: Dumping built work request before ring_doorbell:
iwch_post_receive: WQE ffff810241d59f80: 17c001008000000d
iwch_post_receive: WQE ffff810241d59f88: 0000000000000000
iwch_post_receive: WQE ffff810241d59f90: 0000000000000001
iwch_post_receive: WQE ffff810241d59f98: 000002ff00000810
iwch_post_receive: WQE ffff810241d59fa0: 000000044eac6000
iwch_post_receive: WQE ffff810241d59fa8: 0000000000000000
iwch_post_receive: WQE ffff810241d59fb0: 0000000000000000
iwch_post_receive: WQE ffff810241d59fb8: 0000000000000000
iwch_post_receive: WQE ffff810241d59fc0: 0000000000000000
iwch_post_receive: WQE ffff810241d59fc8: 0000000000000000
iwch_post_receive: WQE ffff810241d59fd0: 0000000000000000
iwch_post_receive: WQE ffff810241d59fd8: 0000000000000000
iwch_post_receive: WQE ffff810241d59fe0: 0000000000000000
iwch_post_receive: returning 0

This comes from sdp_init_qp(), via sdp_connect_handler().
There are a total of 64 work requests (all from
iwch_post_receive()) generated while the netserver is
trying to handle the RDMA_CM_EVENT_CONNECT_REQUEST.

Can you help me decode the above work request?

Thanks,
Craig



_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-
general


_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to