Dear All,

During the last couple of weeks the syslog of a PowerEdge C6145 has been
showing messages like

Mar  9 12:09:48 machinename kernel: [19255328.088523] [Hardware Error]: 
Corrected error, no action required.
Mar  9 12:09:48 machinename kernel: [19255328.090215] [Hardware Error]: CPU:48 
(15:1:2) MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x9c33c00020080813
Mar  9 12:09:48 machinename kernel: [19255328.091818] [Hardware Error]: MC4 
Error Address: 0x00000053592a4510
Mar  9 12:09:48 machinename kernel: [19255328.093410] [Hardware Error]: MC4 
Error (node 6): DRAM ECC error detected on the NB.
Mar  9 12:09:48 machinename kernel: [19255328.094999] EDAC MC6: 1 CE on 
mc#6csrow#1channel#0 (csrow:1 channel:0 page:0x53592a4 offset:0x510 grain:0 
syndrome:0x2067)
Mar  9 12:09:48 machinename kernel: [19255328.095001] [Hardware Error]: cache 
level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout)

with a frequency of 4 to 20 per day.

I have swapped (twice) the suspected DIMM and the error location changes
(i.e., the line EDAC MC6: 1 CE on mc#6csrow#1channel#0 changes; for
instance the MC has been also 5 and 3). This suggests to me one DIMM is
faulty. (Another reason why I think the kernel is not producing false
positives is that another machine in the same chasis running similar jobs
and under similar load never gives me any kernel messages like these).


None of those errors, however, are shown by ipmitool (except for one
lonely entry that reads "Memory #0x60 | Correctable ECC |
Asserted"). After finding
http://lists.us.dell.com/pipermail/linux-poweredge/2010-October/043461.html
http://lists.us.dell.com/pipermail/linux-poweredge/2010-October/043457.html

and the next two HP recommendations
http://h20566.www2.hpe.com/hpsc/doc/public/display?sp4ts.oid=3890172&docId=emr_na-c03519543&docLocale=en_US
http://h20565.www2.hpe.com/hpsc/doc/public/display?sp4ts.oid=5379860&docLocale=en_US&docId=emr_na-c04183538
     

I rebooted after:

- disabling the edac modules
- setting the boot parameter "mce=ignore_ce".



However, nothing new is now being shown in any of the logs (ipmitool, or
the reports obtained from pec-logs.sh or OMSA/DSET, including the ESM
log). Given the very high frequency of events per day the kernel was
reporting, I find this very surprising. I suspect that with the EDAC
modules loaded and mce enabled the kernel reports ECC errors that,
otherwise, go unnoticed.


Questions:

a) What should I do and where should I look to make sure the ECC errors
are logged properly, so I can provide the report to DELL's technical
support?


b) How can I use DELL's tools to identify exactly the affected DIMM? (so
far, I've used swapping of modules + guess work from the EDAC messages, but
this is cumbersome and I still have not been able to narrow the issue down
to a single DIMM ---I'd need another one or two swaps).


c) Should I enable back mce and load the EDAC modules?



Thanks,


R.




-- 
Ramon Diaz-Uriarte
Department of Biochemistry, Lab B-25
Facultad de Medicina
Universidad Autónoma de Madrid 
Arzobispo Morcillo, 4
28029 Madrid
Spain

Phone: +34-91-497-2412

Email: [email protected]
       [email protected]

http://ligarto.org/rdiaz

_______________________________________________
Linux-PowerEdge mailing list
[email protected]
https://lists.us.dell.com/mailman/listinfo/linux-poweredge

Reply via email to