How does the requester (in IB speak) know that an RDMA Write operation has completed on the responder?
We have a software iSER target, available at git.osc.edu/tgt or browse at http://git.osc.edu/?p=tgt.git . Using the existing in-kernel iSER initiator code, very rarely data corruption occurs, in that the received data from SCSI read operations does not match what was expected. Sometimes it appears as if random kernel memory has been scribbled on by an errant RDMA write from the target. My current working theory that the RDMA write has not completed by the time the initiator looks at its incoming data buffer. Single RC QP, single CQ, no SRQ. Only Send, Receive, and RDMA Write work requests are used. After everything is connected up, a SCSI read sequence looks like: initiator: register pages with FMR, write test pattern initiator: Send request to target target: Recv request target: RDMA Write response to initiator target: Wait for CQ entry for local RDMA Write completion target: Send response to initiator initiator: Recv response, access buffer On very rare occasions, this buffer will have the test pattern, not the data that the target just sent. Machines are opteron, fedora 7 up-to-date with its openfab libs, kernel 2.6.23-rc6 on target. Either 2.6.23-rc6 or 2.6.22 or 2.6.18-rhel5 on initiator. For some reason, it is much easier to produce with the rhel5 kernel. One site with fast disks can see similar corruption with 2.6.23-rc6, however. Target is pure userspace. Initiator is in kernel and is poked by "lmdd" (like normal dd) through an iSCSI block device (/dev/sdb). The IB spec seems to indicate that the contents of the RDMA Write buffer should be stable after completion of a subsequent send message (o9-20). In fact, the "Wait for CQ entry" step on the target should be unnecessary, no? Could there be some caching issues that the initiator is missing? I've added print[fk]s to the initiator and target to verify that the sequence of events is truly as above, and that the virtual addresses are as expected on both sides. Any suggestions or advice would help. Thanks, -- Pete P.S. Here are some debugging printfs not in the git. Userspace code does 200 read()s of length 8000, but complains about the result somewhere in the 14th read, from 112000 to 120000, and exits early. Expected pattern is a series of 400000 4-byte words, incrementing by 4, starting from 0. So 0x00000000, 0x00000004, ..., 0x001869fc: % lmdd of=internal ipat=1 if=/dev/sdb bs=8000 count=200 mismatch=10 off=112000 want=1c000 got=3b3b3b3b Initiator generates a series of SCSI operations, as driven by readahead and the block queue scheduler. You can see that it starts reading 4 pages, then 1 page, then 23 pages, then 1 page and so on, in order. These sizes and offsets vary from run to run. Each line here is printed after the SCSI read response has been received. It prints the first word in the buffer, and you can see the test pattern where data should be: tag 02 va 36061000 len 4000 word0 00000000 ref 1 tag 03 va 36065000 len 1000 word0 00004000 ref 1 tag 04 va 36066000 len 17000 word0 00005000 ref 1 tag 05 va 7b6bc000 len 1000 word0 3b3b3b3b ref 1 tag 06 va 7b6bd000 len 1f000 word0 0001d000 ref 1 tag 07 va 7bdc2000 len 20000 word0 0003c000 ref 1 The userspace target code prints a line when it starts the RDMA write, then a line when the RDMA write completes locally, then a line when it sends the repsponse. The tags are what the initiator assigned to each request. The target thinks it is sending a 4096-byte buffer that has 0x1c000 as its first word, but the initiator did not see it: tag 02 va 36061000 len 4000 word0 00000000 rdmaw tag 02 rdmaw completion tag 02 resp tag 03 va 36065000 len 1000 word0 00004000 rdmaw tag 03 rdmaw completion tag 03 resp tag 04 va 36066000 len 17000 word0 00005000 rdmaw tag 04 rdmaw completion tag 04 resp tag 05 va 7b6bc000 len 1000 word0 0001c000 rdmaw tag 05 rdmaw completion tag 05 resp tag 06 va 7b6bd000 len 1f000 word0 0001d000 rdmaw tag 06 rdmaw completion tag 07 va 7bdc2000 len 20000 word0 0003c000 rdmaw tag 07 rdmaw completion tag 06 resp tag 07 resp _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
