I can't speak for all IB networks, but I do know on our network, the SM on our switch wouldn't last longer than a week. That network had 288 nodes. We cured all the SM problems by switching to a dedicated server running OpenSM only. We tried running OpenSM on a server running other tasks too, but still had occasional problems. Currently we have over 450 nodes and no SM problems at all by using a dedicated OpenSM server. We do match the version of OpenSM to the version of OFED we are using.
Thanks, Don Meyer Senior Network/System Engineer/Programmer US+ (253) 371-9532 iNet 8-371-9532 *Other names and brands may be claimed as the property of others -----Original Message----- From: Michael Robbert [mailto:[email protected]] Sent: Wednesday, March 24, 2010 12:43 PM To: Ira Weiny Cc: Meyer, Donald J; [email protected] Subject: Re: ibstat stuck in state initialized after reboot I've got good news. I was able to get opensm to take control. I gave it a priority of 15 and rebooted the 7000D. Unfortunately I'm not sure I can leave it like this forever. The only host I had with opensm installed is my test front end for an OS upgrade I'm testing. We're moving from Rocks 4.3 to Rocks 5.3 (RHEL 4.5 to RHEL 5.4). I may need to reboot this node from time to time over the next couple of weeks, but at least I'm working right now. So you say that a 288 node system will work "out of the box", what happens when you hit 289? Is that a magic number or just an estimate. We have 268 compute nodes plus a few auxiliary nodes so we're pretty close to that number. Thanks, Mike On Mar 24, 2010, at 12:25 PM, Ira Weiny wrote: > On Wed, 24 Mar 2010 11:34:02 -0600 > Michael Robbert <[email protected]> wrote: > >> Interesting note! The 7024 is our large switch where all the hosts are >> connected, but I was told that we were sold the 7000D because the 7024 >> didn't have a subnet manager. Unfortunately the 7000D has a different CLI >> and that command is not available and I don't have the password for our 7024 >> so I can't log onto it. >> >> On another note I just noticed the uptime on the 7000D is just over 1 day so >> that must have been the start of the problem, but I have no idea why it >> rebooted nor why it didn't come up working. I'm pretty sure we tested a >> reboot of the device during acceptance testing. >> >> Oh, I just got your second note: >> ================================== >> BTW, I highly recommend running the opensm on a server instead of using the >> sm on the switch. We found running the sm on the switch was much less >> reliable. I also recommend using a server dedicated to opensm only. >> ================================== > > I will second this. OpenSM has come a long way since the time Cisco was > selling IB switches. If I understand your situation you don't even need the > 7000D you could just remove it and run OpenSM on a "management" node. If you > can afford it adding a node for OpenSM would be nice but I am not sure you > _need_ it. > > OpenSM is now managing many of the largest IB networks out there, on a 288 > node system it will have no problems at all "out of the box". > > :D > > Ira > >> I will take that into consideration, but we bought this as a "turn-key" >> solution from Dell. They designed it and we had no experience with IB so we >> trusted their knowledge. > > <snip> > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
