Subbu, I think we can't see anything from tcpdump even run ping sucessfully, because we only need ipoib for connecting (not for transaction). I think we need these information for diagnosing: 1. modprobe.conf of two nodes with IB 2. ifconfig on these two nodes 3. routing table on these two nodes 4. try lctl ping itself on both nodes and see if any error (with +neterror)
Regards Liang subbu kl: > problem remained same, when I run lctl ping with tcpdump 4.0.0 I dont > see any activity on ib0 ! > > another exhaustive Lustre debug log I took with lctl ping do you see > any problem with it ? > > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(module.c:160:libcfs_psdev_open()) Process entered > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(module.c:164:libcfs_psdev_open()) kmalloced 'ldu': 8 at > f5bc6620 (tot 7258558). > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(module.c:171:libcfs_psdev_open()) Process leaving (rc=0 : 0 : 0) > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(module.c:228:libcfs_ioctl()) Process entered > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(linux-module.c:49:libcfs_ioctl_getdata()) Process entered > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(linux-module.c:90:libcfs_ioctl_getdata()) Process leaving > (rc=0 : 0 : 0) > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(api-ni.c:1223:LNetNIInit()) refs 1 > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(api-ni.c:1614:lnet_ping()) kmalloced 'info': 144 at f0b95880 > (tot 7258702). > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(lib-lnet.h:251:lnet_eq_alloc()) kmalloced 'eq': 48 at > efda1a00 (tot 7258750). > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(lib-eq.c:72:LNetEQAlloc()) kmalloced 'eq->eq_events': 240 at > f0b95c80 (tot 7258990). > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(lib-lnet.h:279:lnet_md_alloc()) kmalloced 'md': 84 at > ed16acc0 (tot 7259074). > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(lib-lnet.h:327:lnet_msg_alloc()) kmalloced 'msg': 268 at > f205a400 (tot 7259342). > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(lib-move.c:2395:LNetGet()) LNetGet -> 12345-172.24.198....@o2ib > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(o2iblnd_cb.c:1531:kiblnd_send()) sending 0 bytes in 0 frags > to 12345-172.24.198....@o2ib > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(o2iblnd.c:312:kiblnd_create_peer()) kmalloced 'peer': 56 at > efda18c0 (tot 7259398). > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(o2iblnd_cb.c:1501:kiblnd_launch_tx()) peer[efda18c0] -> > 172.24.198....@o2ib (1)++ > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(o2iblnd_cb.c:1380:kiblnd_connect_peer()) peer[efda18c0] -> > 172.24.198....@o2ib (2)++ > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(o2iblnd_cb.c:1507:kiblnd_launch_tx()) peer[efda18c0] -> > 172.24.198....@o2ib (3)-- > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(lib-eq.c:209:LNetEQPoll()) Process entered > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(lib-eq.c:146:lib_get_event()) Process entered > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(lib-eq.c:149:lib_get_event()) event: f0b95cf8, sequence: 1, > eq->size: 2 > Jan 23 17:23:39 p186 kernel: Lustre: > 14294:0:(lib-eq.c:152:lib_get_event()) Process leaving (rc=0 : 0 : 0) > Jan 23 17:23:39 p186 kernel: Lustre: > 2782:0:(o2iblnd_cb.c:2682:kiblnd_cm_callback()) 172.24.198....@o2ib > Addr resolved: 0 > Jan 23 17:23:40 p186 kernel: Lustre: > 14294:0:(lib-eq.c:146:lib_get_event()) Process entered > Jan 23 17:23:40 p186 kernel: Lustre: > 14294:0:(lib-eq.c:149:lib_get_event()) event: f0b95cf8, sequence: 1, > eq->size: 2 > Jan 23 17:23:40 p186 kernel: Lustre: > 14294:0:(lib-eq.c:152:lib_get_event()) Process leaving (rc=0 : 0 : 0) > Jan 23 17:23:40 p186 kernel: Lustre: > 14294:0:(lib-eq.c:239:LNetEQPoll()) Process leaving (rc=0 : 0 : 0) > Jan 23 17:23:40 p186 kernel: Lustre: > 14294:0:(api-ni.c:1665:lnet_ping()) poll 0(-1 -1) > Jan 23 17:23:40 p186 kernel: Lustre: > 14294:0:(lib-md.c:69:lnet_md_unlink()) Queueing unlink of md ed16acc0 > Jan 23 17:23:40 p186 kernel: Lustre: > 14294:0:(lib-eq.c:209:LNetEQPoll()) Process entered > Jan 23 17:23:40 p186 kernel: Lustre: > 14294:0:(lib-eq.c:146:lib_get_event()) Process entered > Jan 23 17:23:40 p186 kernel: Lustre: > 14294:0:(lib-eq.c:149:lib_get_event()) event: f0b95cf8, sequence: 1, > eq->size: 2 > Jan 23 17:23:40 p186 kernel: Lustre: > 14294:0:(lib-eq.c:152:lib_get_event()) Process leaving (rc=0 : 0 : 0) > Jan 23 17:23:56 p186 kernel: Lustre: > 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving > (rc=4294962944 : -4352 : ffffef00) > Jan 23 17:23:56 p186 kernel: Lustre: > 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving > (rc=4294966784 : -512 : fffffe00) > Jan 23 17:23:56 p186 kernel: Lustre: > 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=2817 > : 2817 : b01) > Jan 23 17:23:56 p186 kernel: Lustre: > 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=2047 > : 2047 : 7ff) > Jan 23 17:23:56 p186 kernel: Lustre: > 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving > (rc=4294740832 : -226464 : fffc8b60) > Jan 23 17:23:56 p186 kernel: Lustre: > 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving > (rc=4286216485 : -8750811 : ff7a7925) > Jan 23 17:23:56 p186 kernel: Lustre: > 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving > (rc=5821091 : 5821091 : 58d2a3) > Jan 23 17:23:56 p186 kernel: Lustre: > 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving > (rc=3356952 : 3356952 : 333918) > Jan 23 17:23:56 p186 kernel: Lustre: > 8276:0:(pinger.c:193:ptlrpc_pinger_main()) next ping in 25000 (8510847) > Jan 23 17:24:21 p186 kernel: Lustre: > 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving > (rc=4294962944 : -4352 : ffffef00) > Jan 23 17:24:21 p186 kernel: Lustre: > 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving > (rc=4294966784 : -512 : fffffe00) > Jan 23 17:24:21 p186 kernel: Lustre: > 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=2817 > : 2817 : b01) > Jan 23 17:24:21 p186 kernel: Lustre: > 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=2047 > : 2047 : 7ff) > Jan 23 17:24:21 p186 kernel: Lustre: > 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving > (rc=4294740832 : -226464 : fffc8b60) > Jan 23 17:24:21 p186 kernel: Lustre: > 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving > (rc=4286216485 : -8750811 : ff7a7925) > Jan 23 17:24:21 p186 kernel: Lustre: > 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving > (rc=5821091 : 5821091 : 58d2a3) > Jan 23 17:24:21 p186 kernel: Lustre: > 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving > (rc=3356952 : 3356952 : 333918) > Jan 23 17:24:21 p186 kernel: Lustre: > 8276:0:(pinger.c:193:ptlrpc_pinger_main()) next ping in 25000 (8535847) > Jan 23 17:24:29 p186 kernel: Lustre: > 2794:0:(o2iblnd_cb.c:2704:kiblnd_cm_callback()) 172.24.198....@o2ib: > ROUTE ERROR -110 > Jan 23 17:24:29 p186 kernel: Lustre: > 2794:0:(o2iblnd.c:422:kiblnd_unlink_peer_locked()) peer[efda18c0] -> > 172.24.198....@o2ib (2)-- > Jan 23 17:24:29 p186 kernel: Lustre: > 2794:0:(router.c:151:lnet_notify()) 172.24.198....@o2ib notifying > 172.24.198....@o2ib: down > Jan 23 17:24:29 p186 kernel: Lustre: > 2794:0:(router.c:82:lnet_notify_locked()) Old news > Jan 23 17:24:29 p186 kernel: Lustre: > 2794:0:(o2iblnd_cb.c:2118:kiblnd_peer_connect_failed()) Deleting > messages for 172.24.198....@o2ib: connection failed > Jan 23 17:24:29 p186 kernel: Lustre: > 2794:0:(lib-md.c:73:lnet_md_unlink()) Unlinking md ed16acc0 > Jan 23 17:24:29 p186 kernel: Lustre: > 2794:0:(lib-lnet.h:301:lnet_md_free()) kfreed 'md': 84 at ed16acc0 > (tot 7259314). > Jan 23 17:24:29 p186 kernel: Lustre: > 2794:0:(lib-lnet.h:344:lnet_msg_free()) kfreed 'msg': 268 at f205a400 > (tot 7259046). > Jan 23 17:24:29 p186 kernel: Lustre: > 2794:0:(o2iblnd_cb.c:2706:kiblnd_cm_callback()) peer[efda18c0] -> > 172.24.198....@o2ib (1)-- > Jan 23 17:24:29 p186 kernel: Lustre: > 2794:0:(o2iblnd.c:357:kiblnd_destroy_peer()) kfreed 'peer': 56 at > efda18c0 (tot 7258990). > Jan 23 17:24:29 p186 kernel: Lustre: > 14294:0:(lib-eq.c:146:lib_get_event()) Process entered > Jan 23 17:24:29 p186 kernel: Lustre: > 14294:0:(lib-eq.c:149:lib_get_event()) event: f0b95cf8, sequence: 1, > eq->size: 2 > Jan 23 17:24:29 p186 kernel: Lustre: > 14294:0:(lib-eq.c:170:lib_get_event()) Process leaving (rc=1 : 1 : 1) > Jan 23 17:24:29 p186 kernel: Lustre: > 14294:0:(lib-eq.c:232:LNetEQPoll()) Process leaving (rc=1 : 1 : 1) > Jan 23 17:24:29 p186 kernel: Lustre: > 14294:0:(api-ni.c:1665:lnet_ping()) poll 1(4 -113) unlinked > Jan 23 17:24:29 p186 kernel: Lustre: > 14294:0:(lib-lnet.h:259:lnet_eq_free()) kfreed 'eq': 48 at efda1a00 > (tot 7258942). > Jan 23 17:24:29 p186 kernel: Lustre: > 14294:0:(lib-eq.c:135:LNetEQFree()) kfreed 'events': 240 at f0b95c80 > (tot 7258702). > Jan 23 17:24:29 p186 kernel: Lustre: > 14294:0:(api-ni.c:1772:lnet_ping()) kfreed 'info': 144 at f0b95880 > (tot 7258558). > Jan 23 17:24:29 p186 kernel: Lustre: > 14294:0:(module.c:336:libcfs_ioctl()) Process leaving (rc=4294967291 : > -5 : fffffffb) > Jan 23 17:24:29 p186 kernel: Lustre: > 14294:0:(module.c:178:libcfs_psdev_release()) Process entered > Jan 23 17:24:29 p186 kernel: Lustre: > 14294:0:(module.c:183:libcfs_psdev_release()) kfreed 'ldu': 8 at > f5bc6620 (tot 7258550). > Jan 23 17:24:29 p186 kernel: Lustre: > 14294:0:(module.c:187:libcfs_psdev_release()) Process leaving (rc=0 : > 0 : 0) > > ~subbu > > On Fri, Jan 16, 2009 at 3:38 PM, subbu kl <[email protected] > <mailto:[email protected]>> wrote: > > Liang, > > Right; you reproduced the exact problem. But as you can see in my > previous mail I think I have solved that problem by mannually > assiging IP to ib0 (check this line # ifconfig ib0 172.24.198.111 > and *"Added LNI" lines *) > > we are back to sqare one now I guess ! LNET is up with mannually > assigned IPs. normal ping succeds between machines but not lctl ping. > > so my current problem is this : > > # lctl ping 172.24.198....@o2ib > failed to ping 172.24.198....@o2ib: Input/output error > > /var/log/messages: > > > Jan 16 10:24:14 p128 kernel: Lustre: 2750:0:(o2iblnd_cb.c:2687: > kiblnd_cm_callback()) 172.24.198....@o2ib: ROUTE ERROR -22 > Jan 16 10:24:14 p128 kernel: Lustre: > 2750:0:(o2iblnd_cb.c:2101:kiblnd_peer_connect_failed()) Deleting > messages for 172.24.198....@o2ib: connection failed > > how can I get rid of this connection problem? > > ~subbu > > > > On Fri, Jan 16, 2009 at 2:11 PM, Liang Zhen <[email protected] > <mailto:[email protected]>> wrote: > > Subbu, > > We don't have any tip for setup IPoIB, looks like linux can't > find the ifaddr of ib0 on MDS(-99 is EADDRNOTAVAIL), so I > think it's because you didn't assign any address to ib0 (or > failed to assign address to ib0) before loading o2iblnd in > the first try. > I can reproduce exactly same error by: > 1. modprobe ib_ipoib > 2. ifconfig ib0 up // without assign any address > 3. modprobe ko2iblnd > 4. lctl network up > > Regards > Liang > > subbu kl: > > Liang, > after executing following echo : > echo +neterror > /proc/sys/lnet/printk > > now lctlt ping shows the following error > > # lctl ping 172.24.198....@o2ib > failed to ping 172.24.198....@o2ib: Input/output error > > Jan 16 10:24:14 p128 kernel: Lustre: > 2750:0:(o2iblnd_cb.c:2687:kiblnd_cm_callback()) > 172.24.198....@o2ib: ROUTE ERROR -22 > Jan 16 10:24:14 p128 kernel: Lustre: > 2750:0:(o2iblnd_cb.c:2101:kiblnd_peer_connect_failed()) > Deleting messages for 172.24.198....@o2ib: connection failed > > Looks like some problem with "IB connection manager" ! > > 1. do we have any help docs to setup IPoIB and Lustre, > lustre operation manual has very minimal info about this . > I think I am missing some IPoIB setup part here. > 2. or is it mannual assignment of IP addresses to "ib0" > is creating some problem > > > *Some more supporting info : > *subnet manager of following version is also running : > OpenSM 3.1.8 > > Initially I got this error for MDS mount > > Jan 16 09:45:20 p128 kernel: LustreError: > 4991:0:(linux-tcpip.c:124:libcfs_ipif_query()) Can't get > IP address for interface ib0 > Jan 16 09:45:20 p128 kernel: LustreError: > 4991:0:(o2iblnd.c:1563:kiblnd_startup()) Can't query IPoIB > interface ib0: -99 > Jan 16 09:45:21 p128 kernel: LustreError: 105-4: Error > -100 starting up LNI o2ib > Jan 16 09:45:21 p128 kernel: LustreError: > 4991:0:(events.c:707:ptlrpc_init_portals()) network > initialisation failed > Jan 16 09:45:21 p128 modprobe: WARNING: Error inserting > ptlrpc > > (/lib/modules/2.6.18-53.1.14.el5_lustre.1.6.5.1smp/kernel/fs/lustre/ptlrpc.ko): > Input/output error > Jan 16 09:45:21 p128 modprobe: WARNING: Error inserting > osc > > (/lib/modules/2.6.18-53.1.14.el5_lustre.1.6.5.1smp/kernel/fs/lustre/osc.ko): > Unknown symbol in module, or unknown parameter (see dmesg) > Jan 16 09:45:21 p128 kernel: osc: Unknown symbol > ldlm_prep_enqueue_req > Jan 16 09:45:21 p128 kernel: osc: Unknown symbol > ldlm_resource_get > Jan 16 09:45:21 p128 kernel: osc: Unknown symbol > ptlrpc_lprocfs_register_obd > . > . > . > > then I mannually set the IP address for ib0 as folows : > # ifconfig ib0 172.24.198.111 > > [r...@p186 ~]# ifconfig ib0 > ib0 Link encap:InfiniBand HWaddr > 80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 > inet addr:172.24.198.112 Bcast:172.24.255.255 > Mask:255.255.0.0 > UP BROADCAST MULTICAST MTU:65520 Metric:1 > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:256 > RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) > > then it mounted sucessfully > > *Jan 16 09:47:09 p128 kernel: Lustre: Added LNI > 172.24.198....@o2ib [8/64] > Jan 16 09:47:09 p128 kernel: Lustre: MGS MGS started* > Jan 16 09:47:09 p128 kernel: Lustre: Setting parameter > lustre-MDT0000.mdt.group_upcall in log lustre-MDT0000 > Jan 16 09:47:09 p128 kernel: Lustre: Enabling user_xattr > Jan 16 09:47:09 p128 kernel: Lustre: lustre-MDT0000: new > disk, initializing > Jan 16 09:47:09 p128 kernel: Lustre: MDT lustre-MDT0000 > now serving dev > (lustre-MDT0000/64db1fc7-03ba-9803-4d20-ab0d2aa66116) with > recovery enabled > Jan 16 09:47:09 p128 kernel: Lustre: > 5274:0:(lproc_mds.c:262:lprocfs_wr_group_upcall()) > lustre-MDT0000: group upcall set to /usr/sbin/l_getgroups > Jan 16 09:47:09 p128 kernel: Lustre: lustre-MDT0000.mdt: > set parameter group_upcall=/usr/sbin/l_getgroups > Jan 16 09:47:09 p128 kernel: Lustre: Server lustre-MDT0000 > on device /dev/loop0 has started > . > . > . > > > ~subbu > > > On Thu, Jan 15, 2009 at 8:37 PM, Liang Zhen > <[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>>> > wrote: > > Subbu, > > I'd suggest: > 1) make sure ko2iblnd has been brought up (please check > if there > is any error message when startup ko2iblnd) > 2) echo +neterror > /proc/sys/lnet/printk, then try > with lctl > ping, if it still can't work please post error messages > > Regards > Liang > > subbu kl: > > Problem is similer to > > > http://lists.lustre.org/pipermail/lustre-discuss/2008-May/007498.html > But by looking at the thread could not really get > the solution > for the problem. > > I have two RHEL5 Linux servers installed with > following packages - > > kernel-lustre-smp-2.6.18-53.1.14.el5_lustre.1.6.5.1 > kernel-ib-1.3-2.6.18_53.1.14.el5_lustre.1.6.5.1smp > > lustre-ldiskfs-3.0.4-2.6.18_53.1.14.el5_lustre.1.6.5.1smp > lustre-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp > > lustre-modules-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp > e2fsprogs-1.40.7.sun3-0redhat > > > machine 1: with ib0 IP address : 172.24.198.111 > machine 2: with ib0 IP address : 172.24.198.112 > > /etc/modprobe.conf contains > options lnet networks=o2ib > > TCP networking worked fine and now I am trying with > Infiniband > network finding it difficult in communicating with > IB nodes > mounting effort throghs me the following error > > [r...@p186 ~]# mount -t lustre -o loop > /tmp/lustre-ost1 /mnt/ost1 > mount.lustre: mount /dev/loop0 at /mnt/ost1 failed: > Input/output error > Is the MGS running? > > /var/log/messages : > Jan 15 16:55:25 p186 kernel: kjournald starting. > Commit > interval 5 seconds > Jan 15 16:55:25 p186 kernel: LDISKFS FS on loop0, > internal journal > Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mounted > filesystem > with ordered data mode. > Jan 15 16:55:25 p186 kernel: kjournald starting. > Commit > interval 5 seconds > Jan 15 16:55:25 p186 kernel: LDISKFS FS on loop0, > internal journal > Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mounted > filesystem > with ordered data mode. > Jan 15 16:55:25 p186 kernel: LDISKFS-fs: file > extents enabled > Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mballoc > enabled > Jan 15 16:55:30 p186 kernel: Lustre: Request x7 > sent from > mgc172.24.198....@o2ib to NID 172.24.198....@o2ib > 5s ago has > timed out (limit 5s). > Jan 15 16:55:30 p186 kernel: LustreError: > 7193:0:(obd_mount.c:1062:server_start_targets()) > Required > registration failed for lustre-OSTffff: -5 > Jan 15 16:55:30 p186 kernel: LustreError: 15f-b: > Communication > error with the MGS. Is the MGS running? > Jan 15 16:55:30 p186 kernel: LustreError: > 7193:0:(obd_mount.c:1597:server_fill_super()) > Unable to start > targets: -5 > Jan 15 16:55:30 p186 kernel: LustreError: > 7193:0:(obd_mount.c:1382:server_put_super()) no obd > lustre-OSTffff > Jan 15 16:55:30 p186 kernel: LustreError: > 7193:0:(obd_mount.c:119:server_deregister_mount()) > lustre-OSTffff not registered > Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 > blocks 0 > reqs (0 success) > Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 > extents > scanned, 0 goal hits, 0 2^N hits, 0 breaks, 0 lost > Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 > generated > and it took 0 > Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 > preallocated, 0 discarded > Jan 15 16:55:30 p186 kernel: Lustre: server umount > lustre-OSTffff complete > Jan 15 16:55:30 p186 kernel: LustreError: > 7193:0:(obd_mount.c:1951:lustre_fill_super()) > Unable to mount > (-5) > > All pinging efforts also failed to the IB NIDS > local/remote > can ping the ip address : > [r...@p186 ~]# ping 172.24.198.112 > PING 172.24.198.112 (172.24.198.112) 56(84) bytes > of data. > 64 bytes from 172.24.198.112 <http://172.24.198.112>: > icmp_seq=1 ttl=64 time=0.052 ms > 64 bytes from 172.24.198.112 <http://172.24.198.112>: > icmp_seq=2 ttl=64 time=0.024 ms > > > --- 172.24.198.112 ping statistics --- > 2 packets transmitted, 2 received, 0% packet loss, > time 1000ms > rtt min/avg/max/mdev = 0.024/0.038/0.052/0.014 ms > [r...@p186 ~]# ping 172.24.198.111 > PING 172.24.198.111 (172.24.198.111) 56(84) bytes > of data. > 64 bytes from 172.24.198.111 <http://172.24.198.111>: > icmp_seq=1 ttl=64 time=2.16 ms > 64 bytes from 172.24.198.111 <http://172.24.198.111>: > icmp_seq=2 ttl=64 time=0.296 ms > > > --- 172.24.198.111 ping statistics --- > 2 packets transmitted, 2 received, 0% packet loss, > time 1000ms > rtt min/avg/max/mdev = 0.296/1.231/2.166/0.935 ms > > but cant ping the NIDS : > [r...@p186 ~]# lctl ping 172.24.198....@o2ib > failed to ping 172.24.198....@o2ib: Input/output error > [r...@p186 ~]# lctl ping 172.24.198....@o2ib > failed to ping 172.24.198....@o2ib: Input/output error > > Any idea why lnet cant ping NIDS ? > > some more configurations: > [r...@p186 ~]# ibstat > CA 'mthca0' > CA type: MT23108 > Number of ports: 2 > Firmware version: 3.5.0 > Hardware version: a1 > Node GUID: 0x0002c9020021550c > > Machines are connected via IB switch. > > Looking forward for help. > > ~subbu > > > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > [email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>> > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > > > -- > . . . s u b b u > "You've got to be original, because if you're like someone > else, what do they need you for?" > > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > [email protected] > <mailto:[email protected]> > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > > > > -- > . . . s u b b u > "You've got to be original, because if you're like someone else, > what do they need you for?" > > > > > -- > . . . s u b b u > "You've got to be original, because if you're like someone else, what > do they need you for?" > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > [email protected] > http://lists.lustre.org/mailman/listinfo/lustre-discuss > _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
