On Wed, Nov 11, 2009 at 4:52 AM, Dotan Barak <[email protected]> wrote:
> Hi.
>
> how do you connect the QPs?
> via CM/CMA or by sockets (and you actually call the ibv_modify_qp)?
>

I exchange the initial QP infortion (lid, qpn, psn) via sockets.  No
CM is used. I manually take are of everything.

Thanks!

> Dotan
>
> neutron wrote:
>>
>> Hi Paul, thanks a lot for your quick reply!
>>
>> In my test,  client informs the server of its local memory (rkey,
>> addr, size) by sending 4 back to back messages,  each message elicits
>> a RDMA read request (RR) from the server.
>>
>> In other words, client exposes its memory to the server, and server
>> RDMA reads it.
>>
>> As far as RDMA read is concerned, server is a requester, and client is
>> a responder, right?
>>
>> The error I encountered happens at the initial phase, when client
>> sends 4 back to back messages to server(using ibv_post_send ),
>> containing (rkey, addr, size) client's local memory.
>>
>> In these 4 ibv_post_send(), client will see one failure.   At server
>> side, server has already posted enough WQs in the RQ.  The failures
>> are included in my first email.
>>
>> Looking at the program output, it appears that, server gets messages
>> 1, issues RR 1, gets message 2, issues RR 2.    But somehow client
>> reports that "send message 2" fails.
>>
>> On the contrary, server reports "receive message 3" fails.
>>
>> As a result, server gets message 1,2,4, and succeeds with RR 1,2,4.
>> But clients sees that message 2 fails, and succeed with message 1,3,4.
>>  This inconsistency is the problem that puzzled me.
>>
>> ------------
>> By the way, how to interpret the parameters for RDMA, and what are
>> parameters that control RDMA behavior?  Below are something I can
>> find, there must be more....
>>
>>   max_qp_rd_atom:                 4
>>   max_res_rd_atom:                258048
>>   max_qp_init_rd_atom:            128
>>
>>   qp_attr.max_dest_rd_atomic
>>   qp_attr.max_rd_atomic
>>
>>
>>
>> -neutron
>>
>>
>>
>> On Tue, Nov 10, 2009 at 2:04 AM, Paul Grun <[email protected]>
>> wrote:
>>
>>>
>>> Is it possible that you exceeded the number of available RDMA Read
>>> Resources
>>> available on the server?  There is an expectation that the client knows
>>> how
>>> many outstanding RDMA Read Requests the responder (server) is capable of
>>> handling; if the requester (client) exceeds that number, the responder
>>> will
>>> indeed return a NAK-Invalid Request.  Sounds like your server is
>>> configured
>>> to accept three outstanding RDMA Read Requests.
>>> This also explains why it works when you pause the program
>>> periodically...it
>>> gives the responder time to generate the RDMA Read Responses and
>>> therefore
>>> free up some resources to be used in receiving the next incoming RDMA
>>> Read
>>> Request.
>>>
>>> -Paul
>>>
>>> -----Original Message-----
>>> From: [email protected]
>>> [mailto:[email protected]] On Behalf Of neutron
>>> Sent: Monday, November 09, 2009 9:04 PM
>>> To: [email protected]
>>> Subject: back to back RDMA read fail?
>>>
>>> Hi all,
>>>
>>> I have a simple program that test back to back RDMA read performance.
>>> However I encountered errors for unknown reasons.
>>>
>>> The basic flow of my program is:
>>>
>>> client:
>>> ibv_post_send() to send 4 back to back messages to server (no delay
>>> inbetween). Each message contains the (rkey, addr, size) of a local
>>> buffer. The buffer is registered with remote-read/write/ permissions.
>>> After that, ibv_poll_cq() is called to wait for completion.
>>>
>>> server:
>>> First, enough receive WRs are posted to the RQ.  Upon receipt of each
>>> message, immediately post a RDMA read request, using the (rkey, addr,
>>> size) information contained in the originating message.
>>>
>>> --------------
>>> Both client and server use RC QP.  Some errors are observed.
>>>
>>> On client side,  ibv_poll_cq() gets 4 CQE, one out of the 4 CQE is an
>>> error:
>>> CQ::  wr_id=0x0, wc_opcode=IBV_WC_SEND, wc_status=remote invalid RD
>>> request, wc_flag=0x3b
>>>     byte_len=11338758, immdata=1110104528, qp_num=0x0, src_qp=2290530758
>>>
>>> The other 3 CQE are success.
>>>
>>> On server side,
>>> 3 of the 4 messages are successfully received. One message produces an
>>> error CQE:
>>> CQ::  wr_id=0x8000000000, wc_opcode=Unknow-wc-opcode,
>>> wc_status=unknown, wc_flag=0x0
>>>     byte_len=9569287, immdata=0, qp_num=0x0, src_qp=265551872
>>>
>>> The 3 RDMA read corresponding to the successful receive all succeed.
>>>
>>> But, if I pause the client program for a short while( usleep(100) for
>>> example ) after calling ibv_post_send(), then no error occurs.
>>> Anyone can point out the pitfall here? Thanks!
>>>
>>>
>>> -----------
>>> On both client and server, I'm using  'mthca0' type MT25208.  The QPs
>>> are initialized with "qp_attr.max_dest_rd_atomic=4,
>>> qp_attr.max_rd_atomic = 4".  The QP's "devinfo -v" gives the
>>> information:
>>>
>>> hca_id: mthca0
>>>       fw_ver:                         5.1.400
>>>       node_guid:                      0002:c902:0023:c04c
>>>       sys_image_guid:                 0002:c902:0023:c04f
>>>       vendor_id:                      0x02c9
>>>       vendor_part_id:                 25218
>>>       hw_ver:                         0xA0
>>>       board_id:                       MT_0370130002
>>>       phys_port_cnt:                  2
>>>       max_mr_size:                    0xffffffffffffffff
>>>       page_size_cap:                  0xfffff000
>>>       max_qp:                         64512
>>>       max_qp_wr:                      16384
>>>       device_cap_flags:               0x00001c76
>>>       max_sge:                        27
>>>       max_sge_rd:                     0
>>>       max_cq:                         65408
>>>       max_cqe:                        131071
>>>       max_mr:                         131056
>>>       max_pd:                         32764
>>>       max_qp_rd_atom:                 4
>>>       max_ee_rd_atom:                 0
>>>       max_res_rd_atom:                258048
>>>       max_qp_init_rd_atom:            128
>>>       max_ee_init_rd_atom:            0
>>>       atomic_cap:                     ATOMIC_HCA (1)
>>>       max_ee:                         0
>>>       max_rdd:                        0
>>>       max_mw:                         0
>>>       max_raw_ipv6_qp:                0
>>>       max_raw_ethy_qp:                0
>>>       max_mcast_grp:                  8192
>>>       max_mcast_qp_attach:            56
>>>       max_total_mcast_qp_attach:      458752
>>>       max_ah:                         0
>>>       max_fmr:                        0
>>>       max_srq:                        960
>>>       max_srq_wr:                     16384
>>>       max_srq_sge:                    27
>>>       max_pkeys:                      64
>>>       local_ca_ack_delay:             15
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>> the body of a message to [email protected]
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to [email protected]
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to