Liang, Right; you reproduced the exact problem. But as you can see in my previous mail I think I have solved that problem by mannually assiging IP to ib0 (check this line # ifconfig ib0 172.24.198.111 and *"Added LNI" lines *)
we are back to sqare one now I guess ! LNET is up with mannually assigned IPs. normal ping succeds between machines but not lctl ping. so my current problem is this : # lctl ping 172.24.198....@o2ib failed to ping 172.24.198....@o2ib: Input/output error /var/log/messages: Jan 16 10:24:14 p128 kernel: Lustre: 2750:0:(o2iblnd_cb.c:2687:kiblnd_cm_callback()) 172.24.198....@o2ib: ROUTE ERROR -22 Jan 16 10:24:14 p128 kernel: Lustre: 2750:0:(o2iblnd_cb.c:2101:kiblnd_peer_connect_failed()) Deleting messages for 172.24.198....@o2ib: connection failed how can I get rid of this connection problem? ~subbu On Fri, Jan 16, 2009 at 2:11 PM, Liang Zhen <[email protected]> wrote: > Subbu, > > We don't have any tip for setup IPoIB, looks like linux can't find the > ifaddr of ib0 on MDS(-99 is EADDRNOTAVAIL), so I think it's because you > didn't assign any address to ib0 (or failed to assign address to ib0) before > loading o2iblnd in the first try. > I can reproduce exactly same error by: > 1. modprobe ib_ipoib > 2. ifconfig ib0 up // without assign any address > 3. modprobe ko2iblnd > 4. lctl network up > > Regards > Liang > > subbu kl: > >> Liang, >> after executing following echo : >> echo +neterror > /proc/sys/lnet/printk >> >> now lctlt ping shows the following error >> >> # lctl ping 172.24.198....@o2ib >> failed to ping 172.24.198....@o2ib: Input/output error >> >> Jan 16 10:24:14 p128 kernel: Lustre: >> 2750:0:(o2iblnd_cb.c:2687:kiblnd_cm_callback()) 172.24.198....@o2ib: >> ROUTE ERROR -22 >> Jan 16 10:24:14 p128 kernel: Lustre: >> 2750:0:(o2iblnd_cb.c:2101:kiblnd_peer_connect_failed()) Deleting messages >> for 172.24.198....@o2ib: connection failed >> >> Looks like some problem with "IB connection manager" ! >> >> 1. do we have any help docs to setup IPoIB and Lustre, lustre operation >> manual has very minimal info about this . I think I am missing some IPoIB >> setup part here. >> 2. or is it mannual assignment of IP addresses to "ib0" is creating some >> problem >> >> >> *Some more supporting info : >> *subnet manager of following version is also running : OpenSM 3.1.8 >> >> Initially I got this error for MDS mount >> >> Jan 16 09:45:20 p128 kernel: LustreError: >> 4991:0:(linux-tcpip.c:124:libcfs_ipif_query()) Can't get IP address for >> interface ib0 >> Jan 16 09:45:20 p128 kernel: LustreError: >> 4991:0:(o2iblnd.c:1563:kiblnd_startup()) Can't query IPoIB interface ib0: >> -99 >> Jan 16 09:45:21 p128 kernel: LustreError: 105-4: Error -100 starting up >> LNI o2ib >> Jan 16 09:45:21 p128 kernel: LustreError: >> 4991:0:(events.c:707:ptlrpc_init_portals()) network initialisation failed >> Jan 16 09:45:21 p128 modprobe: WARNING: Error inserting ptlrpc >> (/lib/modules/2.6.18-53.1.14.el5_lustre.1.6.5.1smp/kernel/fs/lustre/ptlrpc.ko): >> Input/output error >> Jan 16 09:45:21 p128 modprobe: WARNING: Error inserting osc >> (/lib/modules/2.6.18-53.1.14.el5_lustre.1.6.5.1smp/kernel/fs/lustre/osc.ko): >> Unknown symbol in module, or unknown parameter (see dmesg) >> Jan 16 09:45:21 p128 kernel: osc: Unknown symbol ldlm_prep_enqueue_req >> Jan 16 09:45:21 p128 kernel: osc: Unknown symbol ldlm_resource_get >> Jan 16 09:45:21 p128 kernel: osc: Unknown symbol >> ptlrpc_lprocfs_register_obd >> . >> . >> . >> >> then I mannually set the IP address for ib0 as folows : >> # ifconfig ib0 172.24.198.111 >> >> [r...@p186 ~]# ifconfig ib0 >> ib0 Link encap:InfiniBand HWaddr >> 80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 >> inet addr:172.24.198.112 Bcast:172.24.255.255 Mask:255.255.0.0 >> UP BROADCAST MULTICAST MTU:65520 Metric:1 >> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >> collisions:0 txqueuelen:256 >> RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) >> >> then it mounted sucessfully >> >> * Jan 16 09:47:09 p128 kernel: Lustre: Added LNI 172.24.198....@o2ib[8/64] >> Jan 16 09:47:09 p128 kernel: Lustre: MGS MGS started* >> Jan 16 09:47:09 p128 kernel: Lustre: Setting parameter >> lustre-MDT0000.mdt.group_upcall in log lustre-MDT0000 >> Jan 16 09:47:09 p128 kernel: Lustre: Enabling user_xattr >> Jan 16 09:47:09 p128 kernel: Lustre: lustre-MDT0000: new disk, >> initializing >> Jan 16 09:47:09 p128 kernel: Lustre: MDT lustre-MDT0000 now serving dev >> (lustre-MDT0000/64db1fc7-03ba-9803-4d20-ab0d2aa66116) with recovery enabled >> Jan 16 09:47:09 p128 kernel: Lustre: >> 5274:0:(lproc_mds.c:262:lprocfs_wr_group_upcall()) lustre-MDT0000: group >> upcall set to /usr/sbin/l_getgroups >> Jan 16 09:47:09 p128 kernel: Lustre: lustre-MDT0000.mdt: set parameter >> group_upcall=/usr/sbin/l_getgroups >> Jan 16 09:47:09 p128 kernel: Lustre: Server lustre-MDT0000 on device >> /dev/loop0 has started >> . >> . >> . >> >> >> ~subbu >> >> >> On Thu, Jan 15, 2009 at 8:37 PM, Liang Zhen <[email protected] <mailto: >> [email protected]>> wrote: >> >> Subbu, >> >> I'd suggest: >> 1) make sure ko2iblnd has been brought up (please check if there >> is any error message when startup ko2iblnd) >> 2) echo +neterror > /proc/sys/lnet/printk, then try with lctl >> ping, if it still can't work please post error messages >> >> Regards >> Liang >> >> subbu kl: >> >> Problem is similer to >> >> http://lists.lustre.org/pipermail/lustre-discuss/2008-May/007498.html >> But by looking at the thread could not really get the solution >> for the problem. >> >> I have two RHEL5 Linux servers installed with following packages - >> >> kernel-lustre-smp-2.6.18-53.1.14.el5_lustre.1.6.5.1 >> kernel-ib-1.3-2.6.18_53.1.14.el5_lustre.1.6.5.1smp >> lustre-ldiskfs-3.0.4-2.6.18_53.1.14.el5_lustre.1.6.5.1smp >> lustre-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp >> lustre-modules-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp >> e2fsprogs-1.40.7.sun3-0redhat >> >> >> machine 1: with ib0 IP address : 172.24.198.111 >> machine 2: with ib0 IP address : 172.24.198.112 >> >> /etc/modprobe.conf contains >> options lnet networks=o2ib >> >> TCP networking worked fine and now I am trying with Infiniband >> network finding it difficult in communicating with IB nodes >> mounting effort throghs me the following error >> >> [r...@p186 ~]# mount -t lustre -o loop /tmp/lustre-ost1 /mnt/ost1 >> mount.lustre: mount /dev/loop0 at /mnt/ost1 failed: >> Input/output error >> Is the MGS running? >> >> /var/log/messages : >> Jan 15 16:55:25 p186 kernel: kjournald starting. Commit >> interval 5 seconds >> Jan 15 16:55:25 p186 kernel: LDISKFS FS on loop0, internal journal >> Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mounted filesystem >> with ordered data mode. >> Jan 15 16:55:25 p186 kernel: kjournald starting. Commit >> interval 5 seconds >> Jan 15 16:55:25 p186 kernel: LDISKFS FS on loop0, internal journal >> Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mounted filesystem >> with ordered data mode. >> Jan 15 16:55:25 p186 kernel: LDISKFS-fs: file extents enabled >> Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mballoc enabled >> Jan 15 16:55:30 p186 kernel: Lustre: Request x7 sent from >> mgc172.24.198....@o2ib to NID 172.24.198....@o2ib 5s ago has >> timed out (limit 5s). >> Jan 15 16:55:30 p186 kernel: LustreError: >> 7193:0:(obd_mount.c:1062:server_start_targets()) Required >> registration failed for lustre-OSTffff: -5 >> Jan 15 16:55:30 p186 kernel: LustreError: 15f-b: Communication >> error with the MGS. Is the MGS running? >> Jan 15 16:55:30 p186 kernel: LustreError: >> 7193:0:(obd_mount.c:1597:server_fill_super()) Unable to start >> targets: -5 >> Jan 15 16:55:30 p186 kernel: LustreError: >> 7193:0:(obd_mount.c:1382:server_put_super()) no obd lustre-OSTffff >> Jan 15 16:55:30 p186 kernel: LustreError: >> 7193:0:(obd_mount.c:119:server_deregister_mount()) >> lustre-OSTffff not registered >> Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 blocks 0 >> reqs (0 success) >> Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 extents >> scanned, 0 goal hits, 0 2^N hits, 0 breaks, 0 lost >> Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 generated >> and it took 0 >> Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 >> preallocated, 0 discarded >> Jan 15 16:55:30 p186 kernel: Lustre: server umount >> lustre-OSTffff complete >> Jan 15 16:55:30 p186 kernel: LustreError: >> 7193:0:(obd_mount.c:1951:lustre_fill_super()) Unable to mount >> (-5) >> >> All pinging efforts also failed to the IB NIDS local/remote >> can ping the ip address : >> [r...@p186 ~]# ping 172.24.198.112 >> PING 172.24.198.112 (172.24.198.112) 56(84) bytes of data. >> 64 bytes from 172.24.198.112 <http://172.24.198.112>: >> icmp_seq=1 ttl=64 time=0.052 ms >> 64 bytes from 172.24.198.112 <http://172.24.198.112>: >> icmp_seq=2 ttl=64 time=0.024 ms >> >> >> --- 172.24.198.112 ping statistics --- >> 2 packets transmitted, 2 received, 0% packet loss, time 1000ms >> rtt min/avg/max/mdev = 0.024/0.038/0.052/0.014 ms >> [r...@p186 ~]# ping 172.24.198.111 >> PING 172.24.198.111 (172.24.198.111) 56(84) bytes of data. >> 64 bytes from 172.24.198.111 <http://172.24.198.111>: >> icmp_seq=1 ttl=64 time=2.16 ms >> 64 bytes from 172.24.198.111 <http://172.24.198.111>: >> icmp_seq=2 ttl=64 time=0.296 ms >> >> >> --- 172.24.198.111 ping statistics --- >> 2 packets transmitted, 2 received, 0% packet loss, time 1000ms >> rtt min/avg/max/mdev = 0.296/1.231/2.166/0.935 ms >> >> but cant ping the NIDS : >> [r...@p186 ~]# lctl ping 172.24.198....@o2ib >> failed to ping 172.24.198....@o2ib: Input/output error >> [r...@p186 ~]# lctl ping 172.24.198....@o2ib >> failed to ping 172.24.198....@o2ib: Input/output error >> >> Any idea why lnet cant ping NIDS ? >> >> some more configurations: >> [r...@p186 ~]# ibstat >> CA 'mthca0' >> CA type: MT23108 >> Number of ports: 2 >> Firmware version: 3.5.0 >> Hardware version: a1 >> Node GUID: 0x0002c9020021550c >> >> Machines are connected via IB switch. >> >> Looking forward for help. >> >> ~subbu >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Lustre-discuss mailing list >> [email protected] >> <mailto:[email protected]> >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> >> >> >> >> -- >> . . . s u b b u >> "You've got to be original, because if you're like someone else, what do >> they need you for?" >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Lustre-discuss mailing list >> [email protected] >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> > > -- . . . s u b b u "You've got to be original, because if you're like someone else, what do they need you for?"
_______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
