Howdy,

Lustre 1.8.5 using the EL5 provided RPMs on both clients and servers
lustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5
lustre-client-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5

The servers and clients are all running CentOS 5.5 x86_64 with kernel 
2.6.18-194.17.1.el5 (servers running the Lustre patched kernel)

We have two Infiniband networks, o2ib0 and o2ib1 as well as ethernet. Here's 
the lnet modprobe used on the mds and oss's:

options lnet networks="o2ib0(ib0),o2ib1(ib1),tcp0(eth0)"

The compute nodes that mount via tcp0 don't have any problems.
The compute nodes that mount via o2ib1 do not have any problems.

The compute nodes attached to o2ib0 fail to mount the Lustre file system at 
boot (output of dmesg is at the end).

The compute nodes are Dell M610 blades. There are 3 Dell M1000e chassis 
switches (Mellanox InfiniScale IV M3601Q 32 port 40Gb/s switches), each 
attached to a QLogic 12300 36 port QDR switch via 8 cables. The compute nodes 
are directly attached to the M3601Q switches internally (blades). The Lustre 
servers are attached directly to the QLogic 12300 switch.

All of our Infiniband tests have checked out and the switches do not report an 
errors.

Here's the scenario that enables me to mount the lustre file system after the 
compute node has booted.
1. ssh to the node

2. check ibstat to ensure that the port on the card reports the port as active, 
success

3. Run ibswitches as a test to ensure it can see the switches, success

4. Ping using another IPoverIB address using regular ping
# ping 192.168.2.20
PING 192.168.2.20 (192.168.2.20) 56(84) bytes of data.
64 bytes from 192.168.2.20: icmp_seq=1 ttl=64 time=1.93 ms

5. Try to ping the MDS using lctl ping
# lctl ping 192.168.2.20@o2ib
failed to ping 192.168.2.20@o2ib: Input/output error

6. Try it again (this step isn't actually necessary, after the single failed 
ping, I can then mount)
# lctl ping 192.168.2.20@o2ib
12345-0@lo
12345-192.168.2.20@o2ib
12345-192.168.3.20@o2ib1
12345-172.20.0.20@tcp

7. Now mount lustre
# mount /lustre
# mount | grep lustre
192.168.2.20@o2ib:/lustre on /lustre type lustre (rw,_netdev)

Instead of doing the "lctl ping" I can also do a mount, which will fail, 
followed by another which will succeed.

Here's the messages logged during boot, anyone have any suggestions? Thanks, 
Mike

Lustre: Listener bound to ib0:192.168.2.229:987:mlx4_0
Lustre: Register global MR array, MR size: 0xffffffffffffffff, array size: 1
Lustre: Added LNI 192.168.2.229@o2ib [8/64/0/180]
Lustre: Lustre Client File System; http://www.lustre.org/
Lustre: 4989:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request 
x1363390903615489 sent from MGC192.168.2.20@o2ib to NID 192.168.2.20@o2ib 5s 
ago has timed out (5s prior to deadline).
  req@ffff81062b455c00 x1363390903615489/t0 
o250->[email protected]@o2ib_0:26/25 lens 368/584 e 0 to 1 dl 1300230903 ref 
2 fl Rpc:N/0/0 rc 0/0
LustreError: 5076:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID  
req@ffff81062b455000 x1363390903615491/t0 
o501->[email protected]@o2ib_0:26/25 lens 264/432 e 0 to 1 dl 0 ref 1 fl 
Rpc:/0/0 rc 0/0
LustreError: 15c-8: MGC192.168.2.20@o2ib: The configuration from log 
'lustre-client' failed (-108). This may be the result of communication errors 
between this node and the MGS, a bad configuration, or other errors. See the 
syslog for more information.
LustreError: 5076:0:(llite_lib.c:1079:ll_fill_super()) Unable to process log: 
-108
Lustre: client ffff81062dddf400 umount complete
LustreError: 5076:0:(obd_mount.c:2050:lustre_fill_super()) Unable to mount  
(-108)
ib_srp: ASYNC event= 11 on device= mlx4_0
ib_srp: ASYNC event= 17 on device= mlx4_0
ib_srp: ASYNC event= 9 on device= mlx4_0
ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready

_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to