>> How do we start and stop all nodes using a remote computer. > > IPMI is an excellent, portable, well-scriptable interface for control and monitoring. there are some vendor-specific alternatives, as well as cruder mechanisms (controllable PDU's). IPMI is sometimes OK, sometimes not that good: be carefull about your exact needs. IPMI is just a standard that can be implemented quite well, or so poorly, it does not work most of the times (and at a 500 nodes scale, it is a nightmare!).
I take care of a cluster that is similar in size to the one you want to build, and that requires a lot of reboots (>460 000 rebooted nodes on a 9 month time slot => an average of 5 reboots per node per day). By experience, some IPMI hardware implementations are not sufficient to ensure efficient reboot, for example, we had some issues rebooting the nodes when they were in the PXE boot stage, or blocked in grub with a missing kernel, or worse: when running a freeBSD system. controllable PDUs is not a good idea, because, it will burn your harddrives and your nodes components pretty quickly, and with so many nodes, you will loose many even if your reboot rate is low. Many other solutions are OK: they tend to be scriptable though a telnet + expect script, so it's OK as long as it can reboot all your nodes in any situation. Regards, Julien Leduc _______________________________________________ Beowulf mailing list, [email protected] To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
