Dear All, During the last couple of weeks the syslog of a PowerEdge C6145 has been showing messages like
Mar 9 12:09:48 machinename kernel: [19255328.088523] [Hardware Error]: Corrected error, no action required. Mar 9 12:09:48 machinename kernel: [19255328.090215] [Hardware Error]: CPU:48 (15:1:2) MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x9c33c00020080813 Mar 9 12:09:48 machinename kernel: [19255328.091818] [Hardware Error]: MC4 Error Address: 0x00000053592a4510 Mar 9 12:09:48 machinename kernel: [19255328.093410] [Hardware Error]: MC4 Error (node 6): DRAM ECC error detected on the NB. Mar 9 12:09:48 machinename kernel: [19255328.094999] EDAC MC6: 1 CE on mc#6csrow#1channel#0 (csrow:1 channel:0 page:0x53592a4 offset:0x510 grain:0 syndrome:0x2067) Mar 9 12:09:48 machinename kernel: [19255328.095001] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout) with a frequency of 4 to 20 per day. I have swapped (twice) the suspected DIMM and the error location changes (i.e., the line EDAC MC6: 1 CE on mc#6csrow#1channel#0 changes; for instance the MC has been also 5 and 3). This suggests to me one DIMM is faulty. (Another reason why I think the kernel is not producing false positives is that another machine in the same chasis running similar jobs and under similar load never gives me any kernel messages like these). None of those errors, however, are shown by ipmitool (except for one lonely entry that reads "Memory #0x60 | Correctable ECC | Asserted"). After finding http://lists.us.dell.com/pipermail/linux-poweredge/2010-October/043461.html http://lists.us.dell.com/pipermail/linux-poweredge/2010-October/043457.html and the next two HP recommendations http://h20566.www2.hpe.com/hpsc/doc/public/display?sp4ts.oid=3890172&docId=emr_na-c03519543&docLocale=en_US http://h20565.www2.hpe.com/hpsc/doc/public/display?sp4ts.oid=5379860&docLocale=en_US&docId=emr_na-c04183538 I rebooted after: - disabling the edac modules - setting the boot parameter "mce=ignore_ce". However, nothing new is now being shown in any of the logs (ipmitool, or the reports obtained from pec-logs.sh or OMSA/DSET, including the ESM log). Given the very high frequency of events per day the kernel was reporting, I find this very surprising. I suspect that with the EDAC modules loaded and mce enabled the kernel reports ECC errors that, otherwise, go unnoticed. Questions: a) What should I do and where should I look to make sure the ECC errors are logged properly, so I can provide the report to DELL's technical support? b) How can I use DELL's tools to identify exactly the affected DIMM? (so far, I've used swapping of modules + guess work from the EDAC messages, but this is cumbersome and I still have not been able to narrow the issue down to a single DIMM ---I'd need another one or two swaps). c) Should I enable back mce and load the EDAC modules? Thanks, R. -- Ramon Diaz-Uriarte Department of Biochemistry, Lab B-25 Facultad de Medicina Universidad Autónoma de Madrid Arzobispo Morcillo, 4 28029 Madrid Spain Phone: +34-91-497-2412 Email: [email protected] [email protected] http://ligarto.org/rdiaz _______________________________________________ Linux-PowerEdge mailing list [email protected] https://lists.us.dell.com/mailman/listinfo/linux-poweredge
