Hi all, I would tend to guess this problem is fairly common and many solutions are already in place, so I would like to enquirer about your solutions to the problem:
In our large cluster we have certain nodes going down with I/O hard disk errors. We have some suspicion about the causes but would like to investigate this further. However, the log files don't show much if anything at all (which is understandably given that the log files reside on disk and we are hitting I/O disk errors). Albeit the console shows some interesting messages but cannot scroll back long enough. My question now, is there a cute little way to gather all the console outputs of > 1000 nodes? The nodes don't have physical serial cables attached to them - nor do we want to use many concentrators to achieve this - but the off-the-shelf Supermicro boxes all have an IPMI card installed and SoL works quite ok. Initially, conserver.com looked nice and we also found an IPMI interface for it, but that comes with two downsides: (1) it blocks IPMI access (I have yet to find out if a secondary user can use SoL when another user is using this already, but I doubt it) and (2) it simply does not catch messages appearing in dmesg (simple ones like plugging in a USB keyboard), but that may be a configuration problem on our side. Also we tried (r)syslog but somehow this does not get all the messages either, even when using something like *.* @loghost. For the time being we are experimenting with using "script" in many "screen" environment which should be able to monitor ipmitool's SoL output, but somehow that strikes me as inefficient as well. So, my question boils down to: How do people solve this problem? Thanks a lot Cheers Carsten -- Dr. Carsten Aulbert - Max Planck Institute for Gravitational Physics Callinstrasse 38, 30167 Hannover, Germany Phone/Fax: +49 511 762-17185 / -17193 http://www.top500.org/system/9234 | http://www.top500.org/connfam/6/list/31 _______________________________________________ Beowulf mailing list, [email protected] To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
