Hi, I have a Lustre 1.8.1.1 System (MDS, OSS, all CentOS 5.3) with Lustre 1.6.4.3 (clients, Debian etch) running without problems.
I now have 4 additional OSS nodes, which I set up using the new Lustre 1.8.2. But I can't lctl ping between 1.8.1.1 nodes and 1.8.2 nodes using InfiniBand. To be more precise: OSS node 1: [r...@oss01 ~]# ifconfig | grep -C1 ib0 ib0 Link encap:InfiniBand HWaddr ... inet addr:172.16.30.134 Bcast:172.16.30.255 Mask:255.255.255.0 [r...@oss01 ~]# uname -a Linux oss01 2.6.18-164.11.1.el5_lustre.1.8.2 #1 SMP Fri Jan 22 19:11:17 MST 2010 x86_64 x86_64 x86_64 GNU/Linux OSS node 5: [r...@oss05 ~]# ifconfig | grep -C1 ib0 ib0 Link encap:InfiniBand HWaddr ... inet addr:172.16.30.138 Bcast:172.16.30.255 Mask:255.255.255.0 [r...@oss05 ~]# uname -a Linux oss05 2.6.18-128.7.1.el5_lustre.1.8.1.1 #1 SMP Tue Oct 6 05:48:57 MDT 2009 x86_64 x86_64 x86_64 GNU/Linux InfiniBand network is up and running, I can ping oss1 from oss5 and vice versa: [r...@oss01 ~]# ping 172.16.30.138 PING 172.16.30.138 (172.16.30.138) 56(84) bytes of data. 64 bytes from 172.16.30.138: icmp_seq=1 ttl=64 time=0.125 ms 64 bytes from 172.16.30.138: icmp_seq=2 ttl=64 time=0.083 ms [r...@oss05 ~]# ping 172.16.30.134 PING 172.16.30.134 (172.16.30.134) 56(84) bytes of data. 64 bytes from 172.16.30.134: icmp_seq=1 ttl=64 time=2.19 ms 64 bytes from 172.16.30.134: icmp_seq=2 ttl=64 time=0.076 ms And I am able to lctl ping the machines on their own addresses: [r...@oss01 ~]# lctl ping 172.16.30....@o2ib 1234...@lo 12345-172.16.30....@o2ib [r...@oss05 ~]# lctl ping 172.16.30....@o2ib 1234...@lo 12345-172.16.30....@o2ib But I can't lctl ping the other machine: [r...@oss01 ~]# lctl ping 172.16.30....@o2ib failed to ping 172.16.30....@o2ib: Protocol error [r...@oss05 ~]# lctl ping 172.16.30....@o2ib failed to ping 172.16.30....@o2ib: Protocol error dmesg/meassage output is a little bit longer, but no other errors are logged except this line: [r...@oss01 ~]# dmesg |tail -1 LustreError: 8855:0:(api-ni.c:1781:lnet_ping()) 12345-172.16.30....@o2ib: Unexpected version 0x1 [r...@oss05 ~]# dmesg |tail -1 LustreError: 19249:0:(api-ni.c:1735:lnet_ping()) 12345-172.16.30....@o2ib: Unexpected version 0x2 I did not find anything regarding "Unexpected version 0x?" uding Google ... So I can't mix 1.8.1.1 nodes and 1.8.2 nodes. That would be no major problem, because I could upgrade the "older" MDS and OSS nodes to 1.8.2, too, but I currently can't upgrade the 1.6.4.3 Lustre clients. And the client nodes can't be lctl ping'ed from Lustre 1.8.2, too (172.16.30.70 being one client IP): [r...@oss01 ~]# lctl ping 172.16.30...@o2ib failed to ping 172.16.30...@o2ib: Protocol error I have nearly no InfiniBand know how (I inherited this system), so sorry if my question is a stupid one: What is going on here, and have I a simple possibility to solve that problem of no LNET connectivity between Lustre 1.8.2 and the older 1.8.1.1/1.6.4.3 servers? With regards, Alex -- Alexander Bugl, Central IT Services, ZMAW Max Planck Institute for Meteorology Bundesstrasse 53, D-20146 Hamburg, Germany tel +49-40-41173-351, fax -356, room PE048 _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
