ps. Look at watch cat /proc/interrupts also You might get a qualitative idea of a huge rate of interrupts.
On 10 August 2017 at 16:59, John Hearns <hear...@googlemail.com> wrote: > Faraz, > I think you might have to buy me a virtual coffee. Or a beer! > Please look at the hardware health of that machine. Specifically the > DIMMS. I have seen this before! > If you have some DIMMS which are faulty and are generating ECC errors, > then if the mcelog service is enabled > an interrupt is generated for every ECC event. SO the system is spending > time servicing these interrupts. > > So: look in your /var/log/mcelog for hardware errors > Look in your /var/log/messages for hardware errors also > Look in the IPMI event logs for ECC errors: ipmitool sel elist > > I would also bring that node down and boot it with memtester. > If there is a DIMM which is that badly faulty then memtester will discover > it within minutes. > > Or it could be something else - in which case I get no coffee. > > Also Intel cluster checker is intended to exacly deal with these > situations. > What is your cluster manager, and is Intel CLuster Checker available to > you? > I would seriously look at getting this installed. > > > > > > > > On 10 August 2017 at 16:39, Faraz Hussain <i...@feacluster.com> wrote: > >> One of our compute nodes runs ~30% slower than others. It has the exact >> same image so I am baffled why it is running slow . I have tested OMP and >> MPI benchmarks. Everything runs slower. The cpu usage goes to 2000%, so all >> looks normal there. >> >> I thought it may have to do with cpu scaling, i.e when the kernel changes >> the cpu speed depending on the workload. But we do not have that enabled on >> these machines. >> >> Here is a snippet from "cat /proc/cpuinfo". Everything is identical to >> our other nodes. Any suggestions on what else to check? I have tried >> rebooting it. >> >> processor : 19 >> vendor_id : GenuineIntel >> cpu family : 6 >> model : 62 >> model name : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz >> stepping : 4 >> cpu MHz : 2500.098 >> cache size : 25600 KB >> physical id : 1 >> siblings : 10 >> core id : 12 >> cpu cores : 10 >> apicid : 56 >> initial apicid : 56 >> fpu : yes >> fpu_exception : yes >> cpuid level : 13 >> wp : yes >> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge >> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall >> nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology >> nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 >> ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt >> tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat xsaveopt pln >> pts dts tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms >> bogomips : 5004.97 >> clflush size : 64 >> cache_alignment : 64 >> address sizes : 46 bits physical, 48 bits virtual >> power management: >> >> >> >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> > >
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf