Re: machine checks on Dell R815 under jessie

2016-08-18 Thread Ritesh Raj Sarraf
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

On Wed, 2016-08-10 at 10:20 -0400, Jeffrey Mark Siskind wrote:
>    From: Ritesh Raj Sarraf 
> 
>    I (still) have MCE errors on my new laptop [1]. But so far, hasn't created
>    any problem.
> 
> It causes my servers to halt.
> 
>     Jeff (http://engineering.purdue.edu/~qobi)

Were you able to progress on this issue?
You mentioned initially that, with older kernel, this issue was not seen. Is
that still the case ?

The manpage says that uncorrected machine check errors will lead to a kernel
panic.


- -- 
Ritesh Raj Sarraf | http://people.debian.org/~rrs
Debian - The Universal Operating System
-BEGIN PGP SIGNATURE-

iQIcBAEBCgAGBQJXtfRuAAoJEKY6WKPy4XVpC84P/13lr7xhifobSxwBDLgDEbFB
Z7eeQN4ibvxz1g/NddUyAhxjkK7ISlMaUJ4MwVQdcmPMRnmtCi1e+it3j0pVTAxb
/NjNQNRtxaPJz5eEbCE32R0Aj076Q88VMOVcbKVb7k3KZJmIvPmL/5plErPnC/Az
cOrS78B7B2l+8EfDaQUglta8dWDf8mloD4lkzcFQAsDY8bBIQ4bkL0I7vHDxR4Z4
cn/QOsOdN2KYgIFlIIJbiovgei/sOVcNcGXopVjEMu46Vw6zqLhUtz2z9xpe+35Y
uDTfUHdyALymOBXDswa8OOHRn60GhdC4Ai7f3dupjJwV8UHeeH+jd3wl0ZdFhzxo
NVZFiaCSzm8M6qpF46Eh9DLjbluvjBrW7OsZ3uxDyRzQ9xwWnJpSAr0DqVkilUCX
fx5DUddY5WPkRfAiiZCqqIU00OJzvQtoE4NHDSiiLBs4cpvH5FYFKcwvp/QU4Ah2
oVt8na9ZIm9njEV6aLkRsRtdk5d+Wj/wnr0HhNnVX7MkzzS1aBcZWsPzxttTIfT3
Emqf2umL0kKAtxUZCDKxyqkjS3CE6GMVdKrszAFWMYn2KmjdTHfw7tFDi4hdH+PL
yO00uAVVmX0M0bsj5b9P/nxCAKEcRaFzr0Pmf2ypzgb7Teg2oh/MUgPoNxzvtb7g
fPqt9FP27v2Dv4eiEv9T
=8zBL
-END PGP SIGNATURE-



Re: machine checks on Dell R815 under jessie

2016-08-18 Thread Ritesh Raj Sarraf
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

On Thu, 2016-08-18 at 23:16 +0530, Ritesh Raj Sarraf wrote:
> On Wed, 2016-08-10 at 10:20 -0400, Jeffrey Mark Siskind wrote:
> >    From: Ritesh Raj Sarraf 
> > 
> >    I (still) have MCE errors on my new laptop [1]. But so far, hasn't
> created
> >    any problem.
> > 
> > It causes my servers to halt.
> > 
> >     Jeff (http://engineering.purdue.edu/~qobi)
> 
> Were you able to progress on this issue?
> You mentioned initially that, with older kernel, this issue was not seen. Is
> that still the case ?
> 
> The manpage says that uncorrected machine check errors will lead to a kernel
> panic.

I just extracted the logs for my machine. Contrary to what the manpage says, I
get Uncorrected errors, but no kernel panic.


Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 6 
MISC 38a086 ADDR fef81780 
TIME 1471502624 Thu Aug 18 12:13:44 2016
MCG status:
MCi status:
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ae40110a MCGSTATUS 0
MCGCAP c07 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 69
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 6 
MISC 38a086 ADDR fef81780 
TIME 1471506843 Thu Aug 18 13:24:03 2016
MCG status:
MCi status:
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ae40110a MCGSTATUS 0
MCGCAP c07 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 69
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 6 
MISC 38a086 ADDR fef81780 
TIME 1471511695 Thu Aug 18 14:44:55 2016
MCG status:
MCi status:
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ae40110a MCGSTATUS 0
MCGCAP c07 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 69
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 6 
MISC 38a086 ADDR fef81780 
TIME 1471520827 Thu Aug 18 17:17:07 2016
MCG status:
MCi status:
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ae40110a MCGSTATUS 0
MCGCAP c07 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 69
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 6 
MISC 38a086 ADDR fef81780 
TIME 1471531931 Thu Aug 18 20:22:11 2016
MCG status:
MCi status:
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ae40110a MCGSTATUS 0
MCGCAP c07 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 69
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 6 
MISC 38a086 ADDR fef81780 
TIME 1471534964 Thu Aug 18 21:12:44 2016
MCG status:
MCi status:
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ae40110a MCGSTATUS 0
MCGCAP c07 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 69
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 6 
MISC 78a086 ADDR fef87300 
TIME 1471539248 Thu Aug 18 22:24:08 2016
MCG status:
MCi status:
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ae40110a MCGSTATUS 0
MCGCAP c07 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 69


- -- 
Ritesh Raj Sarraf | http://people.debian.org/~rrs
Debian - The Universal Operating System
-BEGIN PGP SIGNATURE-

iQIcBAEBCgAGBQJXtfuXAAoJEKY6WKPy4XVpeIQP/0eozS3oWSPZ0elxoAaVZtOi
mgcxtpvRl1LPUqNfE+mhH8HzgqMzJjSK3pUJLJtXlHFMzXlTt4oRv5gwidOGrWp7
ZKzEs/6HK9CmhQnm9eqbj8CMJ1BuEzL0hxauHn79BumIteRVRNfgMst0U1wxXU2l
v1iGGRAdyqiNbF7pl0NH3CNYm1oaTA1ZnqaSpL1EFFsgmMTaKIAWJLN1q4gIwqz4
7g4yEL4YqPkd9AcQR5jxOCbsMDXFZxXUcICMhTM1WA/XvPt6SIJaSdd3RKWZPJre
nrs6v6rl8+Gsfhtq3U8iQlelzLr4Vg6uZcEm+TNg7w+twSCwwe0n0HiJHRdFaKR8
UPYaO3mzRCPLy/hK9BYC7LwiHj2DJYD0xm8JLP0WPfjEpLfzwNhvXZoKOXuFGr4X
9wHNGaaS/r08LkabQ8PmT9qWchoNI8opxuomD4Dhg1k5bE1R2yF1BNHyfsHfSE7f
y39u0taQVcrYu8F6T0QHrCiLtWoFj84b/6DaVUx5pQOicH+OshaYUTc427ccKmjB
4Nnj9RcBTLmkMyPKU+EbpbQAwpVC3lZPtmFnxx6whJVstYw55DEt4T10pFbPoxOD
ETKeREZFHZvJhZqEFxm4kh6gf6/HF5DVKDMeRtejufTmr+DaVO3pGZ9z/NdQr4Ls
HlPODP4gmSfdXuSL08bQ
=Ay8k
-END PGP SIGNATURE-



Re: machine checks on Dell R815 under jessie

2016-08-10 Thread Jeffrey Mark Siskind
   From: Ritesh Raj Sarraf 

   I (still) have MCE errors on my new laptop [1]. But so far, hasn't created
   any problem.

It causes my servers to halt.

Jeff (http://engineering.purdue.edu/~qobi)



Re: machine checks on Dell R815 under jessie

2016-08-10 Thread Ritesh Raj Sarraf
On Tue, 2016-08-09 at 20:43 -0400, Jeffrey Mark Siskind wrote:
> I upgraded four Dell R815s from wheezy to jessie a few weeks ago. Prior to the
> upgrade, they were running reliably for about 5 years. Since the upgrade, two
> machines have been getting periodic machine checks. The machines boot fine and
> run for a day or more. The machine checks appear to happen sporadically. I
> can't determine a correlation with anything in particular.
> 
> The front panel on the first machine says the machine check was on CPU #4. The
> front panel on the second machine said the first machine check was on CPU #1
> and the second machine check was on CPU #2.
> 
> I am suspicious that this is really a hardware problem. Three CPUs begin
> exhibiting machine checks within a few weeks of each other, all immediately
> after upgrading wheezy to jessie, after working reliably for five years.
> 
> Has anybody else encountered this issue? Any suggestions on how to debug and
> fix?

I (still) have MCE errors on my new laptop [1]. But so far, hasn't created any
problem.

[1] https://www.researchut.com/blog/lenovo-yoga-2-13-debian


-- 
Given the large number of mailing lists I follow, I request you to CC
me in replies for quicker response

signature.asc
Description: This is a digitally signed message part


machine checks on Dell R815 under jessie

2016-08-09 Thread Jeffrey Mark Siskind
I upgraded four Dell R815s from wheezy to jessie a few weeks ago. Prior to the
upgrade, they were running reliably for about 5 years. Since the upgrade, two
machines have been getting periodic machine checks. The machines boot fine and
run for a day or more. The machine checks appear to happen sporadically. I
can't determine a correlation with anything in particular.

The front panel on the first machine says the machine check was on CPU #4. The
front panel on the second machine said the first machine check was on CPU #1
and the second machine check was on CPU #2.

I am suspicious that this is really a hardware problem. Three CPUs begin
exhibiting machine checks within a few weeks of each other, all immediately
after upgrading wheezy to jessie, after working reliably for five years.

Has anybody else encountered this issue? Any suggestions on how to debug and
fix?

Thanks,
Jeff (http://engineering.purdue.edu/~qobi)
---
root@arivu:~# ipmitool sel elist
   1 | 08/05/2016 | 00:12:47 | Event Logging Disabled SEL | Log area 
reset/cleared | Asserted
   2 | 08/06/2016 | 11:35:17 | Processor CPU Machine Chk | Transition to 
Non-recoverable | Asserted
   3 | 08/06/2016 | 11:35:17 | Unknown #0x28 |  | Asserted
   4 | 08/06/2016 | 11:35:18 | Unknown #0x28 |  | Asserted
   5 | 08/06/2016 | 11:35:18 | Unknown #0x28 |  | Asserted
   6 | 08/06/2016 | 11:35:18 | Unknown #0x28 |  | Asserted
   7 | 08/06/2016 | 11:35:18 | Unknown #0x28 |  | Asserted
   8 | 08/06/2016 | 11:35:19 | Unknown #0x28 |  | Asserted
   9 | 08/06/2016 | 11:35:19 | Unknown #0x28 |  | Asserted
   a | 08/06/2016 | 11:35:19 | Unknown #0x28 |  | Asserted
root@arivu:~# 

root@perisikan:~# ipmitool sel elist
[...]
  1c | 08/08/2016 | 12:23:02 | Processor CPU Machine Chk | Transition to 
Non-recoverable | Asserted
  1d | 08/08/2016 | 12:23:03 | Unknown #0x28 |  | Asserted
  1e | 08/08/2016 | 12:23:03 | Unknown #0x28 |  | Asserted
  1f | 08/08/2016 | 12:23:03 | Unknown #0x28 |  | Asserted
  20 | 08/08/2016 | 12:23:03 | Unknown #0x28 |  | Asserted
  21 | 08/08/2016 | 12:23:03 | Unknown #0x28 |  | Asserted
  22 | 08/08/2016 | 12:23:04 | Unknown #0x28 |  | Asserted
  23 | 08/08/2016 | 12:23:04 | Unknown #0x28 |  | Asserted
  24 | 08/08/2016 | 12:23:04 | Unknown #0x28 |  | Asserted
  25 | 08/09/2016 | 18:37:46 | Processor CPU Machine Chk | Transition to 
Non-recoverable | Asserted
  26 | 08/09/2016 | 18:37:46 | Unknown #0x28 |  | Asserted
  27 | 08/09/2016 | 18:37:47 | Unknown #0x28 |  | Asserted
  28 | 08/09/2016 | 18:37:47 | Unknown #0x28 |  | Asserted
  29 | 08/09/2016 | 18:37:47 | Unknown #0x28 |  | Asserted
  2a | 08/09/2016 | 18:37:47 | Unknown #0x28 |  | Asserted
  2b | 08/09/2016 | 18:37:48 | Unknown #0x28 |  | Asserted
  2c | 08/09/2016 | 18:37:48 | Unknown #0x28 |  | Asserted
  2d | 08/09/2016 | 18:37:48 | Unknown #0x28 |  | Asserted
root@perisikan:~#