[EXTERNAL EMAIL]
On Tue, Feb 12, 2019 at 4:50 PM Tru Huynh <[email protected]> wrote: > > > [EXTERNAL EMAIL] > > Hello > > One of our T7820 running CentOS-7 x86_64 3.10.0-957.5.1.el7.x86_64 > latest bios 1.9.2 (01/24/2019) is logging: > > dmesg: > [15108.602969] mce: [Hardware Error]: Machine check events logged > [15108.603012] EDAC skx MC3: HANDLING MCE MEMORY ERROR > [15108.603015] EDAC skx MC3: CPU 8: Machine Check Event: 0 Bank 18: > 8c000040000800c2 > [15108.603016] EDAC skx MC3: TSC 0 > [15108.603018] EDAC skx MC3: ADDR 134fef57c0 > [15108.603019] EDAC skx MC3: MISC 900040004000086 > [15108.603021] EDAC skx MC3: PROCESSOR 0:50654 TIME 1549992804 SOCKET 1 APIC > 10 > [15108.603030] EDAC MC3: 1 CE memory scrubbing error on > CPU_SrcID#1_MC#1_Chan#2_DIMM#0 (channel:2 slot:0 page:0x134fef5 offset:0x7c0 > grain:32 syndrome:0x0 - err_code:0008:00c2 socket:1 imc:1 rank:0 bg:1 ba:3 > row:29ff col:358) > > /var/log/messages: > Feb 12 18:33:24 ibet kernel: mce: [Hardware Error]: Machine check events > logged > Feb 12 18:33:24 ibet kernel: EDAC MC3: 1 CE memory scrubbing error on > CPU_SrcID#1_MC#1_Chan#2_DIMM#0 (channel:2 slot:0 page:0x134fef5 offset:0x7c0 > grain:32 syndrome:0x0 - err_code:0008:00c2 socket:1 imc:1 rank:0 bg:1 ba:3 > row:29ff col:358) > Feb 12 18:33:24 ibet mcelog: Hardware event. This is not a software error. > Feb 12 18:33:24 ibet mcelog: MCE 0 > Feb 12 18:33:24 ibet mcelog: CPU 8 BANK 18 > Feb 12 18:33:24 ibet mcelog: MISC 900040004000086 ADDR 134fef57c0 > Feb 12 18:33:24 ibet mcelog: TIME 1549992804 Tue Feb 12 18:33:24 2019 > Feb 12 18:33:24 ibet mcelog: MCG status: > Feb 12 18:33:24 ibet mcelog: MCi status: > Feb 12 18:33:24 ibet mcelog: Corrected error > Feb 12 18:33:24 ibet mcelog: MCi_MISC register valid > Feb 12 18:33:24 ibet mcelog: MCi_ADDR register valid > Feb 12 18:33:24 ibet mcelog: MCA: MEMORY CONTROLLER MS_CHANNEL2_ERR > Feb 12 18:33:24 ibet mcelog: Transaction: Memory scrubbing error > Feb 12 18:33:24 ibet mcelog: MemCtrl: Corrected patrol scrub error > Feb 12 18:33:24 ibet mcelog: STATUS 8c000040000800c2 MCGSTATUS 0 > Feb 12 18:33:24 ibet mcelog: MCGCAP 7000c14 APICID 10 SOCKETID 1 > Feb 12 18:33:24 ibet mcelog: PPIN fdf60614f277367e > Feb 12 18:33:24 ibet mcelog: MICROCODE 200004d > Feb 12 18:33:24 ibet mcelog: CPUID Vendor Intel Family 6 Model 85 > > The Dell embedded basic diagnostic tests (F12 on boot) does not show any > errors, but that is expected > since the issue is corrected as stated "MemCtrl: Corrected patrol scrub > error". > > The error doesn't show immediately after boot, this time it occured ~2h after > a cold boot. > > There are 8x 16GB DIMMS on that machine, and Dell support is only willing to > ship one stick > and let me find which one is unhealthy... Is there a tool available to > identify the bad one? > > dmidecode can let me identify DIMM[1-6]_CPU[0-1] but which one is "CPU 8 BANK > 18" > > Cheers > > Tru > > -- > Dr Tru Huynh | mailto:[email protected] | tel +33 1 45 68 87 37 > https://research.pasteur.fr/en/team/structural-bioinformatics/ > Institut Pasteur, 25-28 rue du Docteur Roux, 75724 Paris CEDEX 15 France > It keeps saying "channel:2 slot:0" or "channel:2 dimm:0", so you can associate that with the dmidecode output. Pop the cover open and stick your head there. Chances are you will see a table listing all the memory slots and how they are called; it might be in a sticker under the cover. How are they called? I mean, channel:2 dimm:0 sounds like the first memory slot in the 3rd bank. Have you asked the DRAC what's up? Also, I have seen (older) machines which had a LED for each memory slot. Bad memory would cause LED to be sad. > _______________________________________________ > Linux-PowerEdge mailing list > [email protected] > https://lists.us.dell.com/mailman/listinfo/linux-poweredge _______________________________________________ Linux-PowerEdge mailing list [email protected] https://lists.us.dell.com/mailman/listinfo/linux-poweredge
