[EXTERNAL EMAIL] Hey, Dell does not support using EDAC on your servers and recommend to turn it off ( https://www.dell.com/support/article/us/en/19/sln283389/edac-errors-in-messages-log-in-redhat-enterprise-linux-rhel-and-poweredge?lang=en ). So if you want your warranty to cover the change of the DIMM you need to disable edac and let iDRAC discover the errors. I use ansible to disable edac: https://github.com/Klaas-/ansible-role-disable-edac
Greetings Klaas On Wed, Feb 13, 2019, 05:39 Mauricio Tavares <[email protected] wrote: > > [EXTERNAL EMAIL] > > On Tue, Feb 12, 2019 at 4:50 PM Tru Huynh <[email protected]> wrote: > > > > > > [EXTERNAL EMAIL] > > > > Hello > > > > One of our T7820 running CentOS-7 x86_64 3.10.0-957.5.1.el7.x86_64 > > latest bios 1.9.2 (01/24/2019) is logging: > > > > dmesg: > > [15108.602969] mce: [Hardware Error]: Machine check events logged > > [15108.603012] EDAC skx MC3: HANDLING MCE MEMORY ERROR > > [15108.603015] EDAC skx MC3: CPU 8: Machine Check Event: 0 Bank 18: > 8c000040000800c2 > > [15108.603016] EDAC skx MC3: TSC 0 > > [15108.603018] EDAC skx MC3: ADDR 134fef57c0 > > [15108.603019] EDAC skx MC3: MISC 900040004000086 > > [15108.603021] EDAC skx MC3: PROCESSOR 0:50654 TIME 1549992804 SOCKET 1 > APIC 10 > > [15108.603030] EDAC MC3: 1 CE memory scrubbing error on > CPU_SrcID#1_MC#1_Chan#2_DIMM#0 (channel:2 slot:0 page:0x134fef5 > offset:0x7c0 grain:32 syndrome:0x0 - err_code:0008:00c2 socket:1 imc:1 > rank:0 bg:1 ba:3 row:29ff col:358) > > > > /var/log/messages: > > Feb 12 18:33:24 ibet kernel: mce: [Hardware Error]: Machine check events > logged > > Feb 12 18:33:24 ibet kernel: EDAC MC3: 1 CE memory scrubbing error on > CPU_SrcID#1_MC#1_Chan#2_DIMM#0 (channel:2 slot:0 page:0x134fef5 > offset:0x7c0 grain:32 syndrome:0x0 - err_code:0008:00c2 socket:1 imc:1 > rank:0 bg:1 ba:3 row:29ff col:358) > > Feb 12 18:33:24 ibet mcelog: Hardware event. This is not a software > error. > > Feb 12 18:33:24 ibet mcelog: MCE 0 > > Feb 12 18:33:24 ibet mcelog: CPU 8 BANK 18 > > Feb 12 18:33:24 ibet mcelog: MISC 900040004000086 ADDR 134fef57c0 > > Feb 12 18:33:24 ibet mcelog: TIME 1549992804 Tue Feb 12 18:33:24 2019 > > Feb 12 18:33:24 ibet mcelog: MCG status: > > Feb 12 18:33:24 ibet mcelog: MCi status: > > Feb 12 18:33:24 ibet mcelog: Corrected error > > Feb 12 18:33:24 ibet mcelog: MCi_MISC register valid > > Feb 12 18:33:24 ibet mcelog: MCi_ADDR register valid > > Feb 12 18:33:24 ibet mcelog: MCA: MEMORY CONTROLLER MS_CHANNEL2_ERR > > Feb 12 18:33:24 ibet mcelog: Transaction: Memory scrubbing error > > Feb 12 18:33:24 ibet mcelog: MemCtrl: Corrected patrol scrub error > > Feb 12 18:33:24 ibet mcelog: STATUS 8c000040000800c2 MCGSTATUS 0 > > Feb 12 18:33:24 ibet mcelog: MCGCAP 7000c14 APICID 10 SOCKETID 1 > > Feb 12 18:33:24 ibet mcelog: PPIN fdf60614f277367e > > Feb 12 18:33:24 ibet mcelog: MICROCODE 200004d > > Feb 12 18:33:24 ibet mcelog: CPUID Vendor Intel Family 6 Model 85 > > > > The Dell embedded basic diagnostic tests (F12 on boot) does not show any > errors, but that is expected > > since the issue is corrected as stated "MemCtrl: Corrected patrol scrub > error". > > > > The error doesn't show immediately after boot, this time it occured ~2h > after a cold boot. > > > > There are 8x 16GB DIMMS on that machine, and Dell support is only > willing to ship one stick > > and let me find which one is unhealthy... Is there a tool available to > identify the bad one? > > > > dmidecode can let me identify DIMM[1-6]_CPU[0-1] but which one is "CPU 8 > BANK 18" > > > > Cheers > > > > Tru > > > > -- > > Dr Tru Huynh | mailto:[email protected] | tel +33 1 45 68 87 37 > > https://research.pasteur.fr/en/team/structural-bioinformatics/ > > Institut Pasteur, 25-28 rue du Docteur Roux, 75724 Paris CEDEX 15 France > > > It keeps saying "channel:2 slot:0" or "channel:2 dimm:0", so you > can associate that with the dmidecode output. > > Pop the cover open and stick your head there. Chances are you will see > a table listing all the memory slots and how they are called; it might > be in a sticker under the cover. How are they called? I mean, > channel:2 dimm:0 sounds like the first memory slot in the 3rd bank. > > Have you asked the DRAC what's up? > > Also, I have seen (older) machines which had a LED for each memory > slot. Bad memory would cause LED to be sad. > > > > _______________________________________________ > > Linux-PowerEdge mailing list > > [email protected] > > https://lists.us.dell.com/mailman/listinfo/linux-poweredge > > _______________________________________________ > Linux-PowerEdge mailing list > [email protected] > https://lists.us.dell.com/mailman/listinfo/linux-poweredge >
_______________________________________________ Linux-PowerEdge mailing list [email protected] https://lists.us.dell.com/mailman/listinfo/linux-poweredge
