Re: [ofa-general] iSER data corruption issues
At 11:09 PM 10/3/2007, Roland Dreier wrote: ... It just keeps a list of FMRs that are available to remap, and batches up the unregistration. It is true that an R_Key may remain valid after an FMR is unmapped, but that's the whole point of FMRs: if you don't batch up the real flushing to amortize the cost, they're no better than regular MRs really. This is an aside, but in my experience the FMR is actually a win even if it's invalidated after each use. In testing with NFS/RDMA, I believe that direct FMR manipulation via ib_map_phys_mr()/ib_unmap_fmr() was worth somewhere on the order of 35% over straight ib_reg_phys_mr()/ib_dereg_mr(). I can only assume this was because the TPT-entry setup (ib_alloc_fmr()) is avoided on a per-I/O basis. As for the pools not invalidating the R_key/handle. Speaking as a storage provider, we take data integrity darn seriously. It's my opinion that a dynamic registration scheme that doesn't include per-I/O protection is pretty much not the point of dynamic registration. In many environments however, the performance tradeoff is important - this is why I prefer an all-physical scheme to FMRs, even though it requires additional RDMA ops to handle the resulting extra scatter/gather. Additionally, FMRs don't provide byte-range protection granularity, and they're not supported by iWARP hardware (plus they're buggy as heck on early Tavors, etc). So I didn't make them a default. Tom. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] iSER data corruption issues
[EMAIL PROTECTED] wrote on Wed, 03 Oct 2007 15:01 -0700: Machines are opteron, fedora 7 up-to-date with its openfab libs, kernel 2.6.23-rc6 on target. Either 2.6.23-rc6 or 2.6.22 or 2.6.18-rhel5 on initiator. For some reason, it is much easier to produce with the rhel5 kernel. There was a bug in mthca that caused data corruption with FMRs on Sinai (1-port PCIe) HCAs. It was fixed in commit 608d8268 (IB/mthca: Fix data corruption after FMR unmap on Sinai) which went in shortly before 2.6.21 was released. I don't know if the RHEL5 2.6.18 kernel has this fix or not -- but if you still see the problem on 2.6.22 and later kernels then this isn't the fix anyway. This is definitely it. Same test setup runs for an hour with this patch, but fails in tens of seconds without it. Thanks for pointing it out. This rhel5 kernel is 2.6.18-8.1.6. Perhaps there are newer ones about that have this critical patch included. I'm going to add a Big Fat Warning on the iser distribution about pre-2.6.21 kernels. It also crashes if the iSER connection drops in a certain easy-to-reproduce way, another reason to avoid it. Regarding the larger test I talked about that fails even on modern kernels, I'm still not able to reproduce that on my setup. I ran it literally all night with a hacked target that calculated the return buffer rather than accessing the disk. For now I'm calling that a separate bug and will investigate it further. Thanks to Tom and Tom for helping debug this. -- Pete ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] iSER data corruption issues
[EMAIL PROTECTED] wrote on Thu, 04 Oct 2007 07:55 -0400: This is an aside, but in my experience the FMR is actually a win even if it's invalidated after each use. In testing with NFS/RDMA, I believe that direct FMR manipulation via ib_map_phys_mr()/ib_unmap_fmr() was worth somewhere on the order of 35% over straight ib_reg_phys_mr()/ib_dereg_mr(). I can only assume this was because the TPT-entry setup (ib_alloc_fmr()) is avoided on a per-I/O basis. As for the pools not invalidating the R_key/handle. Speaking as a storage provider, we take data integrity darn seriously. It's my opinion that a dynamic registration scheme that doesn't include per-I/O protection is pretty much not the point of dynamic registration. In many environments however, the performance tradeoff is important - this is why I prefer an all-physical scheme to FMRs, even though it requires additional RDMA ops to handle the resulting extra scatter/gather. Ack. Unfortunately in the iSER case, we are limited to a single virtual address per command. Page size fragmentation may destroy performance, even with heavy pipelining. -- Pete Additionally, FMRs don't provide byte-range protection granularity, and they're not supported by iWARP hardware (plus they're buggy as heck on early Tavors, etc). So I didn't make them a default. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] iSER data corruption issues
On Thu, 2007-10-04 at 12:14 -0400, Pete Wyckoff wrote: [EMAIL PROTECTED] wrote on Wed, 03 Oct 2007 15:01 -0700: Machines are opteron, fedora 7 up-to-date with its openfab libs, kernel 2.6.23-rc6 on target. Either 2.6.23-rc6 or 2.6.22 or 2.6.18-rhel5 on initiator. For some reason, it is much easier to produce with the rhel5 kernel. There was a bug in mthca that caused data corruption with FMRs on Sinai (1-port PCIe) HCAs. It was fixed in commit 608d8268 (IB/mthca: Fix data corruption after FMR unmap on Sinai) which went in shortly before 2.6.21 was released. I don't know if the RHEL5 2.6.18 kernel has this fix or not -- but if you still see the problem on 2.6.22 and later kernels then this isn't the fix anyway. This is definitely it. Same test setup runs for an hour with this patch, but fails in tens of seconds without it. Thanks for pointing it out. This rhel5 kernel is 2.6.18-8.1.6. Perhaps there are newer ones about that have this critical patch included. I'm going to add a Big Fat Warning on the iser distribution about pre-2.6.21 kernels. It also crashes if the iSER connection drops in a certain easy-to-reproduce way, another reason to avoid it. Regarding the larger test I talked about that fails even on modern kernels, I'm still not able to reproduce that on my setup. I ran it literally all night with a hacked target that calculated the return buffer rather than accessing the disk. For now I'm calling that a separate bug and will investigate it further. Thanks to Tom and Tom for helping debug this. Thanks to Roland who actually knew what it was ... ;-) -- Pete ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] iSER data corruption issues
How does the requester (in IB speak) know that an RDMA Write operation has completed on the responder? We have a software iSER target, available at git.osc.edu/tgt or browse at http://git.osc.edu/?p=tgt.git . Using the existing in-kernel iSER initiator code, very rarely data corruption occurs, in that the received data from SCSI read operations does not match what was expected. Sometimes it appears as if random kernel memory has been scribbled on by an errant RDMA write from the target. My current working theory that the RDMA write has not completed by the time the initiator looks at its incoming data buffer. Single RC QP, single CQ, no SRQ. Only Send, Receive, and RDMA Write work requests are used. After everything is connected up, a SCSI read sequence looks like: initiator: register pages with FMR, write test pattern initiator: Send request to target target:Recv request target:RDMA Write response to initiator target:Wait for CQ entry for local RDMA Write completion target:Send response to initiator initiator: Recv response, access buffer On very rare occasions, this buffer will have the test pattern, not the data that the target just sent. Machines are opteron, fedora 7 up-to-date with its openfab libs, kernel 2.6.23-rc6 on target. Either 2.6.23-rc6 or 2.6.22 or 2.6.18-rhel5 on initiator. For some reason, it is much easier to produce with the rhel5 kernel. One site with fast disks can see similar corruption with 2.6.23-rc6, however. Target is pure userspace. Initiator is in kernel and is poked by lmdd (like normal dd) through an iSCSI block device (/dev/sdb). The IB spec seems to indicate that the contents of the RDMA Write buffer should be stable after completion of a subsequent send message (o9-20). In fact, the Wait for CQ entry step on the target should be unnecessary, no? Could there be some caching issues that the initiator is missing? I've added print[fk]s to the initiator and target to verify that the sequence of events is truly as above, and that the virtual addresses are as expected on both sides. Any suggestions or advice would help. Thanks, -- Pete P.S. Here are some debugging printfs not in the git. Userspace code does 200 read()s of length 8000, but complains about the result somewhere in the 14th read, from 112000 to 12, and exits early. Expected pattern is a series of 40 4-byte words, incrementing by 4, starting from 0. So 0x, 0x0004, ..., 0x001869fc: % lmdd of=internal ipat=1 if=/dev/sdb bs=8000 count=200 mismatch=10 off=112000 want=1c000 got=3b3b3b3b Initiator generates a series of SCSI operations, as driven by readahead and the block queue scheduler. You can see that it starts reading 4 pages, then 1 page, then 23 pages, then 1 page and so on, in order. These sizes and offsets vary from run to run. Each line here is printed after the SCSI read response has been received. It prints the first word in the buffer, and you can see the test pattern where data should be: tag 02 va 36061000 len 4000 word0 ref 1 tag 03 va 36065000 len 1000 word0 4000 ref 1 tag 04 va 36066000 len 17000 word0 5000 ref 1 tag 05 va 7b6bc000 len 1000 word0 3b3b3b3b ref 1 tag 06 va 7b6bd000 len 1f000 word0 0001d000 ref 1 tag 07 va 7bdc2000 len 2 word0 0003c000 ref 1 The userspace target code prints a line when it starts the RDMA write, then a line when the RDMA write completes locally, then a line when it sends the repsponse. The tags are what the initiator assigned to each request. The target thinks it is sending a 4096-byte buffer that has 0x1c000 as its first word, but the initiator did not see it: tag 02 va 36061000 len 4000 word0 rdmaw tag 02 rdmaw completion tag 02 resp tag 03 va 36065000 len 1000 word0 4000 rdmaw tag 03 rdmaw completion tag 03 resp tag 04 va 36066000 len 17000 word0 5000 rdmaw tag 04 rdmaw completion tag 04 resp tag 05 va 7b6bc000 len 1000 word0 0001c000 rdmaw tag 05 rdmaw completion tag 05 resp tag 06 va 7b6bd000 len 1f000 word0 0001d000 rdmaw tag 06 rdmaw completion tag 07 va 7bdc2000 len 2 word0 0003c000 rdmaw tag 07 rdmaw completion tag 06 resp tag 07 resp ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] iSER data corruption issues
On Wed, 2007-10-03 at 13:42 -0400, Pete Wyckoff wrote: How does the requester (in IB speak) know that an RDMA Write operation has completed on the responder? We have a software iSER target, available at git.osc.edu/tgt or browse at http://git.osc.edu/?p=tgt.git . Using the existing in-kernel iSER initiator code, very rarely data corruption occurs, in that the received data from SCSI read operations does not match what was expected. Sometimes it appears as if random kernel memory has been scribbled on by an errant RDMA write from the target. My current working theory that the RDMA write has not completed by the time the initiator looks at its incoming data buffer. Single RC QP, single CQ, no SRQ. Only Send, Receive, and RDMA Write work requests are used. After everything is connected up, a SCSI read sequence looks like: initiator: register pages with FMR, write test pattern initiator: Send request to target target:Recv request target:RDMA Write response to initiator target:Wait for CQ entry for local RDMA Write completion Pete: I don't think this should be necessary... target:Send response to initiator ...as long as the send is posted on the same SQ as the write. initiator: Recv response, access buffer On very rare occasions, this buffer will have the test pattern, not the data that the target just sent. Machines are opteron, fedora 7 up-to-date with its openfab libs, kernel 2.6.23-rc6 on target. Either 2.6.23-rc6 or 2.6.22 or 2.6.18-rhel5 on initiator. For some reason, it is much easier to produce with the rhel5 kernel. One site with fast disks can see similar corruption with 2.6.23-rc6, however. Target is pure userspace. Initiator is in kernel and is poked by lmdd (like normal dd) through an iSCSI block device (/dev/sdb). The IB spec seems to indicate that the contents of the RDMA Write buffer should be stable after completion of a subsequent send message (o9-20). In fact, the Wait for CQ entry step on the target should be unnecessary, no? I think so too. Could there be some caching issues that the initiator is missing? I've added print[fk]s to the initiator and target to verify that the sequence of events is truly as above, and that the virtual addresses are as expected on both sides. Any suggestions or advice would help. Thanks, If your theory is correct, the data should eventually show up. Does it? Does your code check for errors on dma_map_single/page? -- Pete P.S. Here are some debugging printfs not in the git. Userspace code does 200 read()s of length 8000, but complains about the result somewhere in the 14th read, from 112000 to 12, and exits early. Expected pattern is a series of 40 4-byte words, incrementing by 4, starting from 0. So 0x, 0x0004, ..., 0x001869fc: % lmdd of=internal ipat=1 if=/dev/sdb bs=8000 count=200 mismatch=10 off=112000 want=1c000 got=3b3b3b3b Initiator generates a series of SCSI operations, as driven by readahead and the block queue scheduler. You can see that it starts reading 4 pages, then 1 page, then 23 pages, then 1 page and so on, in order. These sizes and offsets vary from run to run. Each line here is printed after the SCSI read response has been received. It prints the first word in the buffer, and you can see the test pattern where data should be: tag 02 va 36061000 len 4000 word0 ref 1 tag 03 va 36065000 len 1000 word0 4000 ref 1 tag 04 va 36066000 len 17000 word0 5000 ref 1 tag 05 va 7b6bc000 len 1000 word0 3b3b3b3b ref 1 Is it interesting that the bad word occurs on the first page of the new map? tag 06 va 7b6bd000 len 1f000 word0 0001d000 ref 1 tag 07 va 7bdc2000 len 2 word0 0003c000 ref 1 The userspace target code prints a line when it starts the RDMA write, then a line when the RDMA write completes locally, then a line when it sends the repsponse. The tags are what the initiator assigned to each request. The target thinks it is sending a 4096-byte buffer that has 0x1c000 as its first word, but the initiator did not see it: tag 02 va 36061000 len 4000 word0 rdmaw tag 02 rdmaw completion tag 02 resp tag 03 va 36065000 len 1000 word0 4000 rdmaw tag 03 rdmaw completion tag 03 resp tag 04 va 36066000 len 17000 word0 5000 rdmaw tag 04 rdmaw completion tag 04 resp tag 05 va 7b6bc000 len 1000 word0 0001c000 rdmaw tag 05 rdmaw completion tag 05 resp tag 06 va 7b6bd000 len 1f000 word0 0001d000 rdmaw tag 06 rdmaw completion tag 07 va 7bdc2000 len 2 word0 0003c000 rdmaw tag 07 rdmaw completion tag 06 resp tag 07 resp ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] iSER data corruption issues
[EMAIL PROTECTED] wrote on Wed, 03 Oct 2007 13:02 -0500: On Wed, 2007-10-03 at 13:42 -0400, Pete Wyckoff wrote: My current working theory that the RDMA write has not completed by the time the initiator looks at its incoming data buffer. [..] If your theory is correct, the data should eventually show up. Does it? Good point. It does not eventually show up. I added 5 1-second busy loop delays, checking to see if the values ever change. They don't. Does your code check for errors on dma_map_single/page? This is drivers/infiniband/ulp/iser/iser_verbs.c, in iser_reg_page_vec, as called from iser_reg_rdma_mem. It uses ib_fmr_pool_map_phys, and would complain if it saw an error. These are page cache pages, and the FMR calls seem to take physical pages, but never map them into DMA addresses. Should be no mapping required for opteron and arbel, though. I could be misunderstanding something here. I don't see any major differences between this old 2.6.18-rhel5 and 2.6.23-rc6, except for a call to dma_sync_single() in mthca_arbel_map_phys_fmr(), which I'm guessing is a noop on this platform (swiotlb). Unfortunately 2.3.23-rc6 does not break at my site. At the other site with fast disks, adding any sort of kernel debugging apparently causes the problem to go away. Frustrating. tag 02 va 36061000 len 4000 word0 ref 1 tag 03 va 36065000 len 1000 word0 4000 ref 1 tag 04 va 36066000 len 17000 word0 5000 ref 1 tag 05 va 7b6bc000 len 1000 word0 3b3b3b3b ref 1 Is it interesting that the bad word occurs on the first page of the new map? One would think so, but it is not always the first page. Sometimes, less often, it is the first word of a page in the middle of a map. I'll keep digging. Thanks, -- Pete ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] iSER data corruption issues
At 01:42 PM 10/3/2007, Pete Wyckoff wrote: Single RC QP, single CQ, no SRQ. Only Send, Receive, and RDMA Write work requests are used. After everything is connected up, a SCSI read sequence looks like: initiator: register pages with FMR, write test pattern initiator: Send request to target target:Recv request target:RDMA Write response to initiator target:Wait for CQ entry for local RDMA Write completion target:Send response to initiator initiator: Recv response, access buffer ... The IB spec seems to indicate that the contents of the RDMA Write buffer should be stable after completion of a subsequent send message (o9-20). In fact, the Wait for CQ entry step on the target should be unnecessary, no? Not only unnecessary, on some hardware it may even be meaningless. A local completion means only that the hardware has accepted the RDMA Write, not that it has been sent - and certainly not that it has been received and placed in remote memory. I would look into the dma_sync behavior on the receiver. Especially on an Opteron, it's critical to synchronize the iommu and cachelines to the right memory locations. Since the FMR code hides some of this, it may be a challenge to trace. Can you try another memory registration strategy? NFS/RDMA can do that, for example. Tom. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] iSER data corruption issues
Machines are opteron, fedora 7 up-to-date with its openfab libs, kernel 2.6.23-rc6 on target. Either 2.6.23-rc6 or 2.6.22 or 2.6.18-rhel5 on initiator. For some reason, it is much easier to produce with the rhel5 kernel. There was a bug in mthca that caused data corruption with FMRs on Sinai (1-port PCIe) HCAs. It was fixed in commit 608d8268 (IB/mthca: Fix data corruption after FMR unmap on Sinai) which went in shortly before 2.6.21 was released. I don't know if the RHEL5 2.6.18 kernel has this fix or not -- but if you still see the problem on 2.6.22 and later kernels then this isn't the fix anyway. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] iSER data corruption issues
I would look into the dma_sync behavior on the receiver. Especially on an Opteron, it's critical to synchronize the iommu and cachelines to the right memory locations. Since the FMR code hides some of this, it may be a challenge to trace. Can you try another memory registration strategy? NFS/RDMA can do that, for example. I think this is a red herring. Every IB HCA does 64-bit DMA, which means it bypasses all the Opteron iommu/swiotlb stuff. Also FMR doesn't hide any DMA mapping stuff; it is completely up to the consumer to handle all the DMA mapping, because FMRs operate completely at the level of bus (HCA DMA) addresses. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] iSER data corruption issues
At 06:04 PM 10/3/2007, Roland Dreier wrote: I would look into the dma_sync behavior on the receiver. Especially on an Opteron, it's critical to synchronize the iommu and cachelines to the right memory locations. Since the FMR code hides some of this, it may be a challenge to trace. Can you try another memory registration strategy? NFS/RDMA can do that, for example. I think this is a red herring. Every IB HCA does 64-bit DMA, which means it bypasses all the Opteron iommu/swiotlb stuff. Also FMR doesn't hide any DMA mapping stuff; it is completely up to the consumer to handle all the DMA mapping, because FMRs operate completely at the level of bus (HCA DMA) addresses. Fair enough, but the FMR *pools* still worry me, because they manage internal registrations and defer their manipulation. Depending on lots of things beyond the consumer's control, they sometimes don't even close the handles advertised to the RDMA peer. Bypassing the pools and going directly to the FMRs themselves avoids this (which is what NFS/RDMA does), but iSER and SRP both use the pool API, don't they? So, what else sends an RDMA write into the weeds? Short of writing to the wrong address, it sure sounds like a dma consistency thing to me. The connection wasn't lost, so it's not an error. Tom. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] iSER data corruption issues
Fair enough, but the FMR *pools* still worry me, because they manage internal registrations and defer their manipulation. Depending on lots of things beyond the consumer's control, they sometimes don't even close the handles advertised to the RDMA peer. The FMR pool stuff (especially with caching turned off, as the iSER initiator uses the API) isn't really doing anything particularly fancy. It just keeps a list of FMRs that are available to remap, and batches up the unregistration. It is true that an R_Key may remain valid after an FMR is unmapped, but that's the whole point of FMRs: if you don't batch up the real flushing to amortize the cost, they're no better than regular MRs really. So, what else sends an RDMA write into the weeds? Short of writing to the wrong address, it sure sounds like a dma consistency thing to me. The connection wasn't lost, so it's not an error. I don't have that feeling. x86 systems are really pretty strongly consistent with respect to DMA when you're not using any of the GART/IOMMU stuff, so I think it's more likely that either the wrong address is being given to the HCA somehow, or the mthca FMR implementation is making the HCA write to the wrong address. Especially since the correct data never shows up even after a long time, it seems that the data must just be going to the wrong place. Given that there was an FMR bug with 1-port Mellanox HCAs that caused iSER corruption, I would like to make sure that the same thing isn't hitting here as well. Reproducing on 2.6.22 or 2.6.23-rcX (which have the bug fixed) would rule that out, as would seeing the bug on anything but a 1-port Mellanox HCA. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general