Hello all Scientific Linux users and experts:

 about a month ago we started seeing a large number of nodes going
into a state where they would use 100% system CPU, the load would
go to about 100, and no useful work was getting done.  Nodes would
not recover from this state without a reboot.  The log files showed
many messages like

uct2-c185/kern20100511:May 11 10:04:34 uct2-c185 kernel 03 [kern.err] kernel: 
BUG: soft lockup - CPU#0 stuck for 10s! [events/0:26]
uct2-c185/kern20100511:May 11 12:06:36 uct2-c185 kernel 03 [kern.err] kernel: 
BUG: soft lockup - CPU#0 stuck for 10s! [events/0:26]

Doing a little research led us to believe that we were seeing this bug:


https://bugzilla.redhat.com/show_bug.cgi?id=547530

and according to that page, the fix has been backported to 
kernel-2.6.18-164.11.1.el5    

We upgraded all of our cluster hosts to this kernel version, but the error
is still occurring.  Any ideas or suggestions?

   Thanks,

          - Charles

Reply via email to