On Wed, Feb 07, 2007 at 07:05:17AM -0500, Robin Humble wrote:
>Are the InfiniBand kernel modules that come with the Lustre 1.5.97
>RHEL4 rpms supposed to work out of the box?
>
>I'm trying the stock Lustre 1.5.97 rpms on x86_64 centos4.4 with IB.
>IB works fine for MPI communication, just not well with Lustre when
>there's more than a few OSTs.
>
>The problem can be seen easily by running 'bonnie++ -s 0 -n 2' (2048
>zero-sized files) in a striped dir. bonnie completes as expect without
>Lustre striping, but hangs indefinitely with piles of Lustre timeouts
>when striping is set and there's say, more than 5 OSTs (see the lovely
>messages below).
>The identical Lustre setup over tcp instead of o2ib works just fine.
>
>If anyone has 1.5.97 + IB and standard Lustre rpms working reliably
>in a non-trivial setup, then I'll happily stop trying to wrangling newer
>OFED into the kernels, and will instead perhaps blame our hardware.

we updated the firmware on our IB DDR cards to 1.2.000 (was 1.0.700)
and now stock Lustre 1.5.97 kernel rpms happily passes the bonnie++ to
10 OSTs test. whoo! :)

before we updated the firmware, I tried OFED 1.1 kernel modules but
that didn't help.

thanks to everyone for their help.

cheers,
robin


>
>setup:
> on x19:
>   mkfs.lustre --fsname=testfs --mdt --mgs --reformat /dev/sdb3
>   mount -t lustre /dev/sdb3 /mnt/mdt
> on 10 other nodes:
>   mkfs.lustre --fsname=testfs --ost [EMAIL PROTECTED] --reformat /dev/sdb3
>   mount -t lustre /dev/sdb3 /mnt/ost1
> on x18 (another node):
>   mount -t lustre [EMAIL PROTECTED]:/testfs /mnt/testfs
>
> x18 % bonnie++ -s 0 -n 2
>  (takes about 20 seconds to complete happily)
> x18 % lfs setstripe . 1048576 -1 -1
> x18 % bonnie++ -s 0 -n 2
> Create files in sequential order...done.
> Stat files in sequential order...
>  (hangs indefinitely pumping out timeouts)
>
>everything can still ping everything over IPoIB at this stage.
>errors (see below) spool out indefinitely...
>
>cheers,
>robin
>
>
>Feb  7 22:19:16 x18 kernel: LustreError: 
>3210:0:(o2iblnd_cb.c:2793:kiblnd_check_conns()) Timed out RDMA with [EMAIL 
>PROTECTED]
>Feb  7 22:19:16 x18 kernel: Lustre: 6:0:(linux-debug.c:98:libcfs_run_upcall()) 
>Invoked LNET upcall /usr/lib/lustre/lnet_upcall ROUTER_NOTIFY,[EMAIL 
>PROTECTED],down,1170847104
>Feb  7 22:19:17 x18 kernel: LustreError: 
>3210:0:(o2iblnd_cb.c:2793:kiblnd_check_conns()) Timed out RDMA with [EMAIL 
>PROTECTED]

_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Reply via email to