Re: machine checks on Dell R815 under jessie
-BEGIN PGP SIGNED MESSAGE- Hash: SHA512 On Wed, 2016-08-10 at 10:20 -0400, Jeffrey Mark Siskind wrote: > From: Ritesh Raj Sarraf> > I (still) have MCE errors on my new laptop [1]. But so far, hasn't created > any problem. > > It causes my servers to halt. > > Jeff (http://engineering.purdue.edu/~qobi) Were you able to progress on this issue? You mentioned initially that, with older kernel, this issue was not seen. Is that still the case ? The manpage says that uncorrected machine check errors will lead to a kernel panic. - -- Ritesh Raj Sarraf | http://people.debian.org/~rrs Debian - The Universal Operating System -BEGIN PGP SIGNATURE- iQIcBAEBCgAGBQJXtfRuAAoJEKY6WKPy4XVpC84P/13lr7xhifobSxwBDLgDEbFB Z7eeQN4ibvxz1g/NddUyAhxjkK7ISlMaUJ4MwVQdcmPMRnmtCi1e+it3j0pVTAxb /NjNQNRtxaPJz5eEbCE32R0Aj076Q88VMOVcbKVb7k3KZJmIvPmL/5plErPnC/Az cOrS78B7B2l+8EfDaQUglta8dWDf8mloD4lkzcFQAsDY8bBIQ4bkL0I7vHDxR4Z4 cn/QOsOdN2KYgIFlIIJbiovgei/sOVcNcGXopVjEMu46Vw6zqLhUtz2z9xpe+35Y uDTfUHdyALymOBXDswa8OOHRn60GhdC4Ai7f3dupjJwV8UHeeH+jd3wl0ZdFhzxo NVZFiaCSzm8M6qpF46Eh9DLjbluvjBrW7OsZ3uxDyRzQ9xwWnJpSAr0DqVkilUCX fx5DUddY5WPkRfAiiZCqqIU00OJzvQtoE4NHDSiiLBs4cpvH5FYFKcwvp/QU4Ah2 oVt8na9ZIm9njEV6aLkRsRtdk5d+Wj/wnr0HhNnVX7MkzzS1aBcZWsPzxttTIfT3 Emqf2umL0kKAtxUZCDKxyqkjS3CE6GMVdKrszAFWMYn2KmjdTHfw7tFDi4hdH+PL yO00uAVVmX0M0bsj5b9P/nxCAKEcRaFzr0Pmf2ypzgb7Teg2oh/MUgPoNxzvtb7g fPqt9FP27v2Dv4eiEv9T =8zBL -END PGP SIGNATURE-
Re: machine checks on Dell R815 under jessie
-BEGIN PGP SIGNED MESSAGE- Hash: SHA512 On Thu, 2016-08-18 at 23:16 +0530, Ritesh Raj Sarraf wrote: > On Wed, 2016-08-10 at 10:20 -0400, Jeffrey Mark Siskind wrote: > > From: Ritesh Raj Sarraf> > > > I (still) have MCE errors on my new laptop [1]. But so far, hasn't > created > > any problem. > > > > It causes my servers to halt. > > > > Jeff (http://engineering.purdue.edu/~qobi) > > Were you able to progress on this issue? > You mentioned initially that, with older kernel, this issue was not seen. Is > that still the case ? > > The manpage says that uncorrected machine check errors will lead to a kernel > panic. I just extracted the logs for my machine. Contrary to what the manpage says, I get Uncorrected errors, but no kernel panic. Hardware event. This is not a software error. MCE 0 CPU 0 BANK 6 MISC 38a086 ADDR fef81780 TIME 1471502624 Thu Aug 18 12:13:44 2016 MCG status: MCi status: Uncorrected error MCi_MISC register valid MCi_ADDR register valid Processor context corrupt MCA: corrected filtering (some unreported errors in same region) Generic CACHE Level-2 Generic Error STATUS ae40110a MCGSTATUS 0 MCGCAP c07 APICID 0 SOCKETID 0 CPUID Vendor Intel Family 6 Model 69 Hardware event. This is not a software error. MCE 0 CPU 0 BANK 6 MISC 38a086 ADDR fef81780 TIME 1471506843 Thu Aug 18 13:24:03 2016 MCG status: MCi status: Uncorrected error MCi_MISC register valid MCi_ADDR register valid Processor context corrupt MCA: corrected filtering (some unreported errors in same region) Generic CACHE Level-2 Generic Error STATUS ae40110a MCGSTATUS 0 MCGCAP c07 APICID 0 SOCKETID 0 CPUID Vendor Intel Family 6 Model 69 Hardware event. This is not a software error. MCE 0 CPU 0 BANK 6 MISC 38a086 ADDR fef81780 TIME 1471511695 Thu Aug 18 14:44:55 2016 MCG status: MCi status: Uncorrected error MCi_MISC register valid MCi_ADDR register valid Processor context corrupt MCA: corrected filtering (some unreported errors in same region) Generic CACHE Level-2 Generic Error STATUS ae40110a MCGSTATUS 0 MCGCAP c07 APICID 0 SOCKETID 0 CPUID Vendor Intel Family 6 Model 69 Hardware event. This is not a software error. MCE 0 CPU 0 BANK 6 MISC 38a086 ADDR fef81780 TIME 1471520827 Thu Aug 18 17:17:07 2016 MCG status: MCi status: Uncorrected error MCi_MISC register valid MCi_ADDR register valid Processor context corrupt MCA: corrected filtering (some unreported errors in same region) Generic CACHE Level-2 Generic Error STATUS ae40110a MCGSTATUS 0 MCGCAP c07 APICID 0 SOCKETID 0 CPUID Vendor Intel Family 6 Model 69 Hardware event. This is not a software error. MCE 0 CPU 0 BANK 6 MISC 38a086 ADDR fef81780 TIME 1471531931 Thu Aug 18 20:22:11 2016 MCG status: MCi status: Uncorrected error MCi_MISC register valid MCi_ADDR register valid Processor context corrupt MCA: corrected filtering (some unreported errors in same region) Generic CACHE Level-2 Generic Error STATUS ae40110a MCGSTATUS 0 MCGCAP c07 APICID 0 SOCKETID 0 CPUID Vendor Intel Family 6 Model 69 Hardware event. This is not a software error. MCE 0 CPU 0 BANK 6 MISC 38a086 ADDR fef81780 TIME 1471534964 Thu Aug 18 21:12:44 2016 MCG status: MCi status: Uncorrected error MCi_MISC register valid MCi_ADDR register valid Processor context corrupt MCA: corrected filtering (some unreported errors in same region) Generic CACHE Level-2 Generic Error STATUS ae40110a MCGSTATUS 0 MCGCAP c07 APICID 0 SOCKETID 0 CPUID Vendor Intel Family 6 Model 69 Hardware event. This is not a software error. MCE 0 CPU 0 BANK 6 MISC 78a086 ADDR fef87300 TIME 1471539248 Thu Aug 18 22:24:08 2016 MCG status: MCi status: Uncorrected error MCi_MISC register valid MCi_ADDR register valid Processor context corrupt MCA: corrected filtering (some unreported errors in same region) Generic CACHE Level-2 Generic Error STATUS ae40110a MCGSTATUS 0 MCGCAP c07 APICID 0 SOCKETID 0 CPUID Vendor Intel Family 6 Model 69 - -- Ritesh Raj Sarraf | http://people.debian.org/~rrs Debian - The Universal Operating System -BEGIN PGP SIGNATURE- iQIcBAEBCgAGBQJXtfuXAAoJEKY6WKPy4XVpeIQP/0eozS3oWSPZ0elxoAaVZtOi mgcxtpvRl1LPUqNfE+mhH8HzgqMzJjSK3pUJLJtXlHFMzXlTt4oRv5gwidOGrWp7 ZKzEs/6HK9CmhQnm9eqbj8CMJ1BuEzL0hxauHn79BumIteRVRNfgMst0U1wxXU2l v1iGGRAdyqiNbF7pl0NH3CNYm1oaTA1ZnqaSpL1EFFsgmMTaKIAWJLN1q4gIwqz4 7g4yEL4YqPkd9AcQR5jxOCbsMDXFZxXUcICMhTM1WA/XvPt6SIJaSdd3RKWZPJre nrs6v6rl8+Gsfhtq3U8iQlelzLr4Vg6uZcEm+TNg7w+twSCwwe0n0HiJHRdFaKR8 UPYaO3mzRCPLy/hK9BYC7LwiHj2DJYD0xm8JLP0WPfjEpLfzwNhvXZoKOXuFGr4X 9wHNGaaS/r08LkabQ8PmT9qWchoNI8opxuomD4Dhg1k5bE1R2yF1BNHyfsHfSE7f y39u0taQVcrYu8F6T0QHrCiLtWoFj84b/6DaVUx5pQOicH+OshaYUTc427ccKmjB 4Nnj9RcBTLmkMyPKU+EbpbQAwpVC3lZPtmFnxx6whJVstYw55DEt4T10pFbPoxOD ETKeREZFHZvJhZqEFxm4kh6gf6/HF5DVKDMeRtejufTmr+DaVO3pGZ9z/NdQr4Ls HlPODP4gmSfdXuSL08bQ =Ay8k -END PGP SIGNATURE-
Re: machine checks on Dell R815 under jessie
From: Ritesh Raj SarrafI (still) have MCE errors on my new laptop [1]. But so far, hasn't created any problem. It causes my servers to halt. Jeff (http://engineering.purdue.edu/~qobi)
Re: machine checks on Dell R815 under jessie
On Tue, 2016-08-09 at 20:43 -0400, Jeffrey Mark Siskind wrote: > I upgraded four Dell R815s from wheezy to jessie a few weeks ago. Prior to the > upgrade, they were running reliably for about 5 years. Since the upgrade, two > machines have been getting periodic machine checks. The machines boot fine and > run for a day or more. The machine checks appear to happen sporadically. I > can't determine a correlation with anything in particular. > > The front panel on the first machine says the machine check was on CPU #4. The > front panel on the second machine said the first machine check was on CPU #1 > and the second machine check was on CPU #2. > > I am suspicious that this is really a hardware problem. Three CPUs begin > exhibiting machine checks within a few weeks of each other, all immediately > after upgrading wheezy to jessie, after working reliably for five years. > > Has anybody else encountered this issue? Any suggestions on how to debug and > fix? I (still) have MCE errors on my new laptop [1]. But so far, hasn't created any problem. [1] https://www.researchut.com/blog/lenovo-yoga-2-13-debian -- Given the large number of mailing lists I follow, I request you to CC me in replies for quicker response signature.asc Description: This is a digitally signed message part
machine checks on Dell R815 under jessie
I upgraded four Dell R815s from wheezy to jessie a few weeks ago. Prior to the upgrade, they were running reliably for about 5 years. Since the upgrade, two machines have been getting periodic machine checks. The machines boot fine and run for a day or more. The machine checks appear to happen sporadically. I can't determine a correlation with anything in particular. The front panel on the first machine says the machine check was on CPU #4. The front panel on the second machine said the first machine check was on CPU #1 and the second machine check was on CPU #2. I am suspicious that this is really a hardware problem. Three CPUs begin exhibiting machine checks within a few weeks of each other, all immediately after upgrading wheezy to jessie, after working reliably for five years. Has anybody else encountered this issue? Any suggestions on how to debug and fix? Thanks, Jeff (http://engineering.purdue.edu/~qobi) --- root@arivu:~# ipmitool sel elist 1 | 08/05/2016 | 00:12:47 | Event Logging Disabled SEL | Log area reset/cleared | Asserted 2 | 08/06/2016 | 11:35:17 | Processor CPU Machine Chk | Transition to Non-recoverable | Asserted 3 | 08/06/2016 | 11:35:17 | Unknown #0x28 | | Asserted 4 | 08/06/2016 | 11:35:18 | Unknown #0x28 | | Asserted 5 | 08/06/2016 | 11:35:18 | Unknown #0x28 | | Asserted 6 | 08/06/2016 | 11:35:18 | Unknown #0x28 | | Asserted 7 | 08/06/2016 | 11:35:18 | Unknown #0x28 | | Asserted 8 | 08/06/2016 | 11:35:19 | Unknown #0x28 | | Asserted 9 | 08/06/2016 | 11:35:19 | Unknown #0x28 | | Asserted a | 08/06/2016 | 11:35:19 | Unknown #0x28 | | Asserted root@arivu:~# root@perisikan:~# ipmitool sel elist [...] 1c | 08/08/2016 | 12:23:02 | Processor CPU Machine Chk | Transition to Non-recoverable | Asserted 1d | 08/08/2016 | 12:23:03 | Unknown #0x28 | | Asserted 1e | 08/08/2016 | 12:23:03 | Unknown #0x28 | | Asserted 1f | 08/08/2016 | 12:23:03 | Unknown #0x28 | | Asserted 20 | 08/08/2016 | 12:23:03 | Unknown #0x28 | | Asserted 21 | 08/08/2016 | 12:23:03 | Unknown #0x28 | | Asserted 22 | 08/08/2016 | 12:23:04 | Unknown #0x28 | | Asserted 23 | 08/08/2016 | 12:23:04 | Unknown #0x28 | | Asserted 24 | 08/08/2016 | 12:23:04 | Unknown #0x28 | | Asserted 25 | 08/09/2016 | 18:37:46 | Processor CPU Machine Chk | Transition to Non-recoverable | Asserted 26 | 08/09/2016 | 18:37:46 | Unknown #0x28 | | Asserted 27 | 08/09/2016 | 18:37:47 | Unknown #0x28 | | Asserted 28 | 08/09/2016 | 18:37:47 | Unknown #0x28 | | Asserted 29 | 08/09/2016 | 18:37:47 | Unknown #0x28 | | Asserted 2a | 08/09/2016 | 18:37:47 | Unknown #0x28 | | Asserted 2b | 08/09/2016 | 18:37:48 | Unknown #0x28 | | Asserted 2c | 08/09/2016 | 18:37:48 | Unknown #0x28 | | Asserted 2d | 08/09/2016 | 18:37:48 | Unknown #0x28 | | Asserted root@perisikan:~#