A few additions: I would highly recommend checking thumbscrews on the heat sink if the CPUs are legitimately hot. Thermal paste distribution on the CPUs may cause issues too.
Also, "rdmsr -f 23:16 -d 0x1a2" will return the temperature threshold in degrees C. If you hit that temperature the core will be throttled. I haven't tried this with hyperthreading so I don't know if you'll have "extra" results or not when querying all the threads. I'm guessing both threads will return the temperature of the core. Just so you know, the kernel is merely responding to interrupts from the processor cores themselves saying they are over temperature. The cores have their thresholds set and the kernel can't and doesn't mess with them. If the kernel reports the processors are hot, the processors are actually hot. Ryan Cox On 12/08/2010 02:09 PM, Ryan Cox wrote: > Try running the following code. Load the "msr" kernel module and be sure > rdmsr is installed. It's available from > http://www.kernel.org/pub/linux/utils/cpu/msr-tools/ and is simple to > compile. > for a in /dev/cpu/[0-9]* > do > cpu=$(basename $a) > printf "%2d: " $cpu > echo $(($(rdmsr -f 23:16 -p$cpu -d 0x1a2) - $(rdmsr -f 22:16 -p$cpu > -u 0x19c))) > done > > That should return the core temperatures in Celsius by reading the > values from the CPU MSRs. I may have some other ideas for you if what > that reveals doesn't help. > > Ryan > > On 12/08/2010 02:00 PM, Erich Weiler wrote: >> Hi All, >> >> We're running CentOS 5.5 (kernel 2.6.18-194.3.1.el5) on two Dell R910 >> servers. We're periodically getting CPU overheating messages spit out >> from syslogd: >> >> Message from syslogd@ at Fri Dec 3 12:06:56 2010 ... >> server kernel: CPU60: Temperature above threshold, cpu clock throttled >> >> Message from syslogd@ at Fri Dec 3 12:06:56 2010 ... >> server kernel: CPU28: Temperature above threshold, cpu clock throttled >> >> Message from syslogd@ at Fri Dec 3 12:06:56 2010 ... >> server kernel: CPU24: Temperature/speed normal >> >> Message from syslogd@ at Fri Dec 3 12:06:56 2010 ... >> server kernel: CPU32: Temperature/speed normal >> >> The servers are well ventilated in a datacenter, and they both exhibit >> the same problem when under load. I think the fans are working OK, but >> maybe these CPUs just run a little hotter than others, which may be >> triggering the threshold in the kernel? Anyone else seen this before? >> >> lm_sensors doesn't work on these boxes. Any info on why it's happening, >> or a good way to query the CPU temps, would be much appreciated! >> >> TIA! >> >> _______________________________________________ >> Linux-PowerEdge mailing list >> [email protected] >> https://lists.us.dell.com/mailman/listinfo/linux-poweredge >> Please read the FAQ at http://lists.us.dell.com/faq -- Ryan Cox Systems Administrator Fulton Supercomputing Lab Brigham Young University _______________________________________________ Linux-PowerEdge mailing list [email protected] https://lists.us.dell.com/mailman/listinfo/linux-poweredge Please read the FAQ at http://lists.us.dell.com/faq
