Lars Hi,

I'v found  the bug.

a fix patch is attached.

I will commit it after the weekend (In Israel weekend is Friday-Saturday).

- Amir

On 07/08/2009 01:17 PM, Lars Ellenberg wrote:
> On Wed, Jul 08, 2009 at 11:12:15AM +0300, Amir Vadai wrote:
>   
>> Lars Hi,
>>
>> I opened a bug in our bugzilla 
>> (https://bugs.openfabrics.org/show_bug.cgi?id=1672).
>>
>> I couldn't reproduce it on my setup: SLES 10SP2, stock kernel, same ofed git 
>> version.
>> will try now to install 2.6.27 kernel and check again.
>>     
> With a "normal" kernel config, I needed to do full load bi-directional
> network traffic on IPoIB as well as SDP, multiple stream sockets,
> to eventually actually trigger it after a few minutes
> (several hundered MegaByte per second).
>
> with the "debug" kernel config,
> I was able to reproduce with only one socket,
> within milliseconds.
>
> my .config is attached.
>
>   
>> BTW, what type of servers do you use? Are they low/high end server?
>>     
> This is the second cluster that show this bug.  I first experienced it
> when using SDP sockets from within kernel space.
> I was able to reproduce in userland,
> which I thought might make it easier for you to reproduce.
>
> The current test cluster is a slightly aged 2U supermicro dual quadcore,
> 4GB ram, and proved to be very reliable hardware in all test up to now.
> it may be a little slow on interrupts.
>
> tail of /proc/cpuinfo:
> processor       : 7
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 15
> model name      : Intel(R) Xeon(R) CPU           E5310  @ 1.60GHz
> stepping        : 7
> cpu MHz         : 1599.984
> cache size      : 4096 KB
> physical id     : 1
> siblings        : 4
> core id         : 3
> cpu cores       : 4
> apicid          : 7
> initial apicid  : 7
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 10
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
> syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl pni
> monitor ds_cpl vmx tm2 ssse3 cx16 xtpr dca lahf_lm
> bogomips        : 3201.35
> clflush size    : 64
> cache_alignment : 64
> address sizes   : 36 bits physical, 48 bits virtual
> power management:
>
> the IB setup is direct link, lspci says:
> 09:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX IB QDR, PCIe 2.0 
> 5GT/s] (rev a0)
>
>
> because using IPoIB does work just fine, I don't think we have issues
> with IB setup, or the hardware in general.
> only when using SDP it is broken, "forgets" bytes, or corrupts data.
>
> what I do "different" than the (assumed to be) typical SDP user is:
> sending large-ish messages at once (up to ~32 kB), possibly unaligned.
>
> which apparently is a mode that SDP has not excercised much yet,
> otherwise the recently fixed page leak would have been noticed by
> someone much earlier.
>
>   
diff --git a/drivers/infiniband/ulp/sdp/sdp_bcopy.c b/drivers/infiniband/ulp/sdp/sdp_bcopy.c
index a090868..a3587ce 100644
--- a/drivers/infiniband/ulp/sdp/sdp_bcopy.c
+++ b/drivers/infiniband/ulp/sdp/sdp_bcopy.c
@@ -730,6 +730,10 @@ static int sdp_handle_recv_comp(struct sdp_sock *ssk, struct ib_wc *wc)
 			break;
 		}
 		skb = sdp_sock_queue_rcv_skb(sk, skb);
+
+		/* skb might have been changed. need to read current header */
+		h = (struct sdp_bsdh *)skb_transport_header(skb); 
+
 		if (unlikely(h->flags & SDP_OOB_PRES))
 			sdp_urg(ssk, skb);
 		break;
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to