2009/3/17 jeffrey Lang <[email protected]>: > First let me say, I hope this is the right list for this email, if not > please forgive me. > > I have a small 16 node compute cluster. The university where I work at > recently opened a new Datacenter. My cluster was moved from the old > Datacenter. Before the move the inifiniband was working properly, after > the move the ipoib has stopped working. > > The cluster runs Centos 4 with all the latest updates and the Centos > distributed OFED code. My plan was to update the OFED code once things had > restablized. > > For the move, I shutdown the cluster, removed the inifiniband cables and the > cluster was moved. I then reinstalled the infiniband cables (not in the > same order before the move) and brought every thing back up. > > When i brought the cluster back up the ipoib would not work. The only > message in the log file is "Mar 15 04:04:32 h2o01 kernel: ib0: multicast > join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -22".
I think that there may be a rate issue in terms of this node relative to the IPoIB broadcast group which by default is 10 Gbps (4x SDR). What does this node's portinfo show (smpquery portinfo -D 0) in terms of link width and speed ? -- Hal > The master node can see all the systems: > > [r...@h2o01 log]# ibnodes > Ca : 0x00066a0098007e99 ports 1 "h2o17 HCA-1" > Ca : 0x00066a0098007e9b ports 1 "h2o18 HCA-1" > Ca : 0x00066a0098007e97 ports 1 "h2o16 HCA-1" > Ca : 0x00066a0098007e8c ports 1 "h2o15 HCA-1" > Ca : 0x00066a0098007e94 ports 1 "h2o14 HCA-1" > Ca : 0x00066a0098007e93 ports 1 "h2o13 HCA-1" > Ca : 0x00066a0098007e8e ports 1 "h2o12 HCA-1" > Ca : 0x00066a0098007e90 ports 1 "h2o11 HCA-1" > Ca : 0x00066a0098007e98 ports 1 "h2o10 HCA-1" > Ca : 0x00066a0098007e95 ports 1 "h2o09 HCA-1" > Ca : 0x00066a0098007e8f ports 1 "h2o08 HCA-1" > Ca : 0x00066a0098007e92 ports 1 "h2o07 HCA-1" > Ca : 0x00066a0098007e8d ports 1 "h2o06 HCA-1" > Ca : 0x00066a0098007e91 ports 1 "h2o05 HCA-1" > Ca : 0x00066a0098007e96 ports 1 "h2ocfs HCA-1" > Ca : 0x00066a0098007e9c ports 1 "h2o01 HCA-1" > Switch : 0x00066a00d8000593 ports 24 "SilverStorm 9024 > GUID=0x00066a00d8000593" enhanced port 0 lid 1 lmc 0 > > I've reset the sm on the switch, but nothing seems to work. > > Any ideas of where to look for whats causing the problem? > > jeff > > _______________________________________________ > general mailing list > [email protected] > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
