ok - bug is reproduced now on your kernel+config will check it now.
On 07/08/2009 01:33 PM, Amir Vadai wrote: > see below > > On 07/08/2009 01:17 PM, Lars Ellenberg wrote: > >> On Wed, Jul 08, 2009 at 11:12:15AM +0300, Amir Vadai wrote: >> >> >>> Lars Hi, >>> >>> I opened a bug in our bugzilla >>> (https://bugs.openfabrics.org/show_bug.cgi?id=1672). >>> >>> I couldn't reproduce it on my setup: SLES 10SP2, stock kernel, same ofed >>> git version. >>> will try now to install 2.6.27 kernel and check again. >>> >>> >> With a "normal" kernel config, I needed to do full load bi-directional >> network traffic on IPoIB as well as SDP, multiple stream sockets, >> to eventually actually trigger it after a few minutes >> (several hundered MegaByte per second). >> >> with the "debug" kernel config, >> I was able to reproduce with only one socket, >> within milliseconds. >> >> my .config is attached. >> >> > I will test it with your config and kernel version > >> >> >>> BTW, what type of servers do you use? Are they low/high end server? >>> >>> >> This is the second cluster that show this bug. I first experienced it >> when using SDP sockets from within kernel space. >> I was able to reproduce in userland, >> which I thought might make it easier for you to reproduce. >> >> The current test cluster is a slightly aged 2U supermicro dual quadcore, >> 4GB ram, and proved to be very reliable hardware in all test up to now. >> it may be a little slow on interrupts. >> >> tail of /proc/cpuinfo: >> processor : 7 >> vendor_id : GenuineIntel >> cpu family : 6 >> model : 15 >> model name : Intel(R) Xeon(R) CPU E5310 @ 1.60GHz >> stepping : 7 >> cpu MHz : 1599.984 >> cache size : 4096 KB >> physical id : 1 >> siblings : 4 >> core id : 3 >> cpu cores : 4 >> apicid : 7 >> initial apicid : 7 >> fpu : yes >> fpu_exception : yes >> cpuid level : 10 >> wp : yes >> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge >> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe >> syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl pni >> monitor ds_cpl vmx tm2 ssse3 cx16 xtpr dca lahf_lm >> bogomips : 3201.35 >> clflush size : 64 >> cache_alignment : 64 >> address sizes : 36 bits physical, 48 bits virtual >> power management: >> >> the IB setup is direct link, lspci says: >> 09:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX IB QDR, PCIe 2.0 >> 5GT/s] (rev a0) >> >> >> because using IPoIB does work just fine, I don't think we have issues >> with IB setup, or the hardware in general. >> only when using SDP it is broken, "forgets" bytes, or corrupts data. >> >> > I'm testing on SDR low end machines - and sometimes we have bugs that we > only see on high bandwidth setups. > And you have such a setup. > >> what I do "different" than the (assumed to be) typical SDP user is: >> sending large-ish messages at once (up to ~32 kB), possibly unaligned. >> >> which apparently is a mode that SDP has not excercised much yet, >> otherwise the recently fixed page leak would have been noticed by >> someone much earlier. >> >> >> > - Amir > _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
