Hello all Scientific Linux users and experts: about a month ago we started seeing a large number of nodes going into a state where they would use 100% system CPU, the load would go to about 100, and no useful work was getting done. Nodes would not recover from this state without a reboot. The log files showed many messages like
uct2-c185/kern20100511:May 11 10:04:34 uct2-c185 kernel 03 [kern.err] kernel: BUG: soft lockup - CPU#0 stuck for 10s! [events/0:26] uct2-c185/kern20100511:May 11 12:06:36 uct2-c185 kernel 03 [kern.err] kernel: BUG: soft lockup - CPU#0 stuck for 10s! [events/0:26] Doing a little research led us to believe that we were seeing this bug: https://bugzilla.redhat.com/show_bug.cgi?id=547530 and according to that page, the fix has been backported to kernel-2.6.18-164.11.1.el5 We upgraded all of our cluster hosts to this kernel version, but the error is still occurring. Any ideas or suggestions? Thanks, - Charles
