Quoting r. Roland Dreier <[EMAIL PROTECTED]>: > Subject: problem with SDP/AIO on mem-free HCA > > [err, resending with a correct openib to: line] > > I'm hitting a strange problem with SDP/AIO on a mem-free Arbel. My > test is the following: I run Libor's ttcp.aio program with default > parameters (which I think just leaves one AIO in flight at a time) as > follows: > > ttcp.aio.x -r -s & > ttcp.aio.x -t -s 127.0.0.1 > > This always fails with a remote access error exactly 256K into the > test. I see the following in my log (with some extra tracing added to > SDP to get info on the RDMAs being posted): > > WARN: <2> <050e:11b1> Posting RDMA READ, rkey <2002c00> wrid <5d> at = > <1d94e000>/<1000> > WARN: <2> <050e:11b1> Posting RDMA READ, rkey <2002c00> wrid <5e> at = > <1d94f000>/<1000> > WARN: <2> <050e:11b1> Posting SEND, wrid <5f> > WARN: <1> <050e:11b1> Posting SEND, wrid <20> > CRTL: <2> <050e:11b1> GETNAME: src <0d000002:1389> dst = > <0d000002:8001> > CRTL: <2> <050e:11b1> GETNAME: src <0d000002:1389> dst = > <0d000002:8001> > WARN: <2> <050e:11b1> Posting RDMA READ, rkey <2002c00> wrid <60> at = > <1d94e000>/<0> > WARN: <2> <050e:11b1> Posting RDMA READ, rkey <2002c00> wrid <61> at = > <1d94e000>/<1000> > WARN: <2> <050e:11b1> Posting RDMA READ, rkey <2002c00> wrid <62> at = > <1d94f000>/<1000> > ib_mthca 0000:07:00.0: 86/66: error CQE -> QPN 000407, WQE @ = > 00001803 > [ 0] 00000407 > [ 4] b3000000 > [ 8] fd000003 > [ c] 110000c0 > [10] 13880000 > [14] 00000010 > [18] 00001803 > [1c] ff100000 > WARN: : Unhandled status <10> unknown event <-1> wrid <60> > > As you can see, the failed work request is an RDMA with length 0. The > previous work request with wrid 5d with the same R_Key and remote > address but a length of 0x1000 appears to complete successfully so the > FMR seems to be OK. > > So I guess there are two questions: > - why is SDP doing a zero-length RDMA read? > - is it correct for this to fail with a remote access error? > I have not had a chance to test zero-length RDMA without involving > FMRs but I don't think the FMR code is to blame.
I dont think so. I found this: C9-88: For an HCA responder using Reliable Connection service, for each zero-length RDMA READ or WRITE request, the R_Key shall not be validated, even if the request includes Immediate data. Can it be you generate a non-zero RDMA in mthca. > Also BTW, the code in sdp_cq_event_locked() is somewhat bogus: it > switches on comp->opcode even when comp->status is not success. > However, if the comp->status is not success, then per the IB spec, > mthca does not set the comp->opcode field. > > - R. > -- MST - Michael S. Tsirkin _______________________________________________ openib-general mailing list [email protected] http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
