Re: [Linux-PowerEdge] T7820 one bad DIMM among 8: which one?

Klaas Demter Tue, 12 Feb 2019 23:47:15 -0800

[EXTERNAL EMAIL] 

Hey,
Dell does not support using EDAC on your servers and recommend to turn
it off ( 
https://www.dell.com/support/article/us/en/19/sln283389/edac-errors-in-messages-log-in-redhat-enterprise-linux-rhel-and-poweredge?lang=en
). So if you want your warranty to cover the change of the DIMM you
need to disable edac and let iDRAC discover the errors. I use ansible
to disable edac: https://github.com/Klaas-/ansible-role-disable-edac


Greetings

Klaas


On Wed, Feb 13, 2019, 05:39 Mauricio Tavares <[email protected] wrote:

>
> [EXTERNAL EMAIL]
>
> On Tue, Feb 12, 2019 at 4:50 PM Tru Huynh <[email protected]> wrote:
> >
> >
> > [EXTERNAL EMAIL]
> >
> > Hello
> >
> > One of our T7820 running CentOS-7 x86_64 3.10.0-957.5.1.el7.x86_64
> > latest bios 1.9.2 (01/24/2019) is logging:
> >
> > dmesg:
> > [15108.602969] mce: [Hardware Error]: Machine check events logged
> > [15108.603012] EDAC skx MC3: HANDLING MCE MEMORY ERROR
> > [15108.603015] EDAC skx MC3: CPU 8: Machine Check Event: 0 Bank 18:
> 8c000040000800c2
> > [15108.603016] EDAC skx MC3: TSC 0
> > [15108.603018] EDAC skx MC3: ADDR 134fef57c0
> > [15108.603019] EDAC skx MC3: MISC 900040004000086
> > [15108.603021] EDAC skx MC3: PROCESSOR 0:50654 TIME 1549992804 SOCKET 1
> APIC 10
> > [15108.603030] EDAC MC3: 1 CE memory scrubbing error on
> CPU_SrcID#1_MC#1_Chan#2_DIMM#0 (channel:2 slot:0 page:0x134fef5
> offset:0x7c0 grain:32 syndrome:0x0 -  err_code:0008:00c2 socket:1 imc:1
> rank:0 bg:1 ba:3 row:29ff col:358)
> >
> > /var/log/messages:
> > Feb 12 18:33:24 ibet kernel: mce: [Hardware Error]: Machine check events
> logged
> > Feb 12 18:33:24 ibet kernel: EDAC MC3: 1 CE memory scrubbing error on
> CPU_SrcID#1_MC#1_Chan#2_DIMM#0 (channel:2 slot:0 page:0x134fef5
> offset:0x7c0 grain:32 syndrome:0x0 -  err_code:0008:00c2 socket:1 imc:1
> rank:0 bg:1 ba:3 row:29ff col:358)
> > Feb 12 18:33:24 ibet mcelog: Hardware event. This is not a software
> error.
> > Feb 12 18:33:24 ibet mcelog: MCE 0
> > Feb 12 18:33:24 ibet mcelog: CPU 8 BANK 18
> > Feb 12 18:33:24 ibet mcelog: MISC 900040004000086 ADDR 134fef57c0
> > Feb 12 18:33:24 ibet mcelog: TIME 1549992804 Tue Feb 12 18:33:24 2019
> > Feb 12 18:33:24 ibet mcelog: MCG status:
> > Feb 12 18:33:24 ibet mcelog: MCi status:
> > Feb 12 18:33:24 ibet mcelog: Corrected error
> > Feb 12 18:33:24 ibet mcelog: MCi_MISC register valid
> > Feb 12 18:33:24 ibet mcelog: MCi_ADDR register valid
> > Feb 12 18:33:24 ibet mcelog: MCA: MEMORY CONTROLLER MS_CHANNEL2_ERR
> > Feb 12 18:33:24 ibet mcelog: Transaction: Memory scrubbing error
> > Feb 12 18:33:24 ibet mcelog: MemCtrl: Corrected patrol scrub error
> > Feb 12 18:33:24 ibet mcelog: STATUS 8c000040000800c2 MCGSTATUS 0
> > Feb 12 18:33:24 ibet mcelog: MCGCAP 7000c14 APICID 10 SOCKETID 1
> > Feb 12 18:33:24 ibet mcelog: PPIN fdf60614f277367e
> > Feb 12 18:33:24 ibet mcelog: MICROCODE 200004d
> > Feb 12 18:33:24 ibet mcelog: CPUID Vendor Intel Family 6 Model 85
> >
> > The Dell embedded basic diagnostic tests (F12 on boot) does not show any
> errors, but that is expected
> > since the issue is corrected as stated "MemCtrl: Corrected patrol scrub
> error".
> >
> > The error doesn't show immediately after boot, this time it occured ~2h
> after a cold boot.
> >
> > There are 8x 16GB DIMMS on that machine, and Dell support is only
> willing to ship one stick
> > and let me find which one is unhealthy... Is there a tool available to
> identify the bad one?
> >
> > dmidecode can let me identify DIMM[1-6]_CPU[0-1] but which one is "CPU 8
> BANK 18"
> >
> > Cheers
> >
> > Tru
> >
> > --
> > Dr Tru Huynh | mailto:[email protected] | tel +33 1 45 68 87 37
> > https://research.pasteur.fr/en/team/structural-bioinformatics/
> > Institut Pasteur, 25-28 rue du Docteur Roux, 75724 Paris CEDEX 15 France
> >
>       It keeps saying "channel:2 slot:0" or "channel:2 dimm:0", so you
> can associate that with the dmidecode output.
>
> Pop the cover open and stick your head there. Chances are you will see
> a table listing all the memory slots and how they are called; it might
> be in a sticker under the cover. How are they called? I mean,
> channel:2 dimm:0 sounds like the first memory slot in the 3rd bank.
>
> Have you asked the DRAC what's up?
>
> Also, I have seen (older) machines which had a LED for each memory
> slot. Bad memory would cause LED to be sad.
>
>
> > _______________________________________________
> > Linux-PowerEdge mailing list
> > [email protected]
> > https://lists.us.dell.com/mailman/listinfo/linux-poweredge
>
> _______________________________________________
> Linux-PowerEdge mailing list
> [email protected]
> https://lists.us.dell.com/mailman/listinfo/linux-poweredge
>

_______________________________________________
Linux-PowerEdge mailing list
[email protected]
https://lists.us.dell.com/mailman/listinfo/linux-poweredge

Re: [Linux-PowerEdge] T7820 one bad DIMM among 8: which one?

Reply via email to