Re: L2 cache errors???

2015-07-31 Thread Erich Dollansky
Hi,

On Fri, 31 Jul 2015 09:31:53 +0200
Willem Jan Withagen w...@digiware.nl wrote:

 On 31/07/2015 07:22, Erich Dollansky wrote:
  On Tue, 28 Jul 2015 21:45:03 +0200
  Willem Jan Withagen w...@digiware.nl wrote:
  
  On 28/07/2015 21:04, Josh Paetzel wrote:
 
  Offlining CPus, cool.
  
  and bringing them back online when the problem is fixed. The
  hardware there supports that things get changed while the system is
  running. A PC costs normally less than the extra hardware required
  to do this.
 
 Yes, I can imagine things being expensive.
 
 Probably like high-end routers... There you can swap also just about
 anything.. Expensive got a complete new meaning when I saw a set of 3
 core routers being delivered at a friends ISP with a pricetag 
 1.000.000 euros... Fortunately it was list price, but even still: A
 lot of money.

yes, this are common price tags in this area.

 Although I've grown to look at HA as:
   don't put it all in one (expensive) box,
   but get more (cheaper) boxes,
 But I guess there are places where this doesn't work.

It is also a matter of effort. The moment the application software has
to be adapted, it might be pointless to do it on cheap machines. Google
 Co. is a typical case for cheap hardware. It does not matter for the
normal user if things get even lost or delayed. It is a different story
for banks, manufacturing or people like UPS.

Erich
___
freebsd-hardware@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hardware
To unsubscribe, send any mail to freebsd-hardware-unsubscr...@freebsd.org


Re: L2 cache errors???

2015-07-31 Thread Erich Dollansky
Hi,

On Tue, 28 Jul 2015 21:45:03 +0200
Willem Jan Withagen w...@digiware.nl wrote:

 On 28/07/2015 21:04, Josh Paetzel wrote:
 
 Offlining CPus, cool.

and bringing them back online when the problem is fixed. The hardware
there supports that things get changed while the system is running. A
PC costs normally less than the extra hardware required to do this.

Erich
___
freebsd-hardware@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hardware
To unsubscribe, send any mail to freebsd-hardware-unsubscr...@freebsd.org


Re: L2 cache errors???

2015-07-31 Thread Willem Jan Withagen
On 31/07/2015 07:22, Erich Dollansky wrote:
 Hi,
 
 On Tue, 28 Jul 2015 21:45:03 +0200
 Willem Jan Withagen w...@digiware.nl wrote:
 
 On 28/07/2015 21:04, Josh Paetzel wrote:

 Offlining CPus, cool.
 
 and bringing them back online when the problem is fixed. The hardware
 there supports that things get changed while the system is running. A
 PC costs normally less than the extra hardware required to do this.

Yes, I can imagine things being expensive.

Probably like high-end routers... There you can swap also just about
anything.. Expensive got a complete new meaning when I saw a set of 3
core routers being delivered at a friends ISP with a pricetag 
1.000.000 euros... Fortunately it was list price, but even still: A lot
of money.
But then still you'd swap a processor board, running on the spare.
And I guess there we offline the whole board.

Last time I heard about things like this in computers, we were talking
IBM mainframes, or Tandem. Both long time ago. Obviously haven't done
much in High availability lately.

Although I've grown to look at HA as:
don't put it all in one (expensive) box,
but get more (cheaper) boxes,
But I guess there are places where this doesn't work.

--WjW

___
freebsd-hardware@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hardware
To unsubscribe, send any mail to freebsd-hardware-unsubscr...@freebsd.org


L2 cache errors???

2015-07-28 Thread Willem Jan Withagen

Hi,

Are these what I think they are?

Errors in the CPU L2 cache?

/var/log/messages:Jul 24 13:14:40 box kernel: MCA: Bank 3, Status 
0x902000120120100e
/var/log/messages:Jul 24 13:14:40 box kernel: MCA: Global Cap 
0x0806, Status 0x
/var/log/messages:Jul 24 13:14:40 box kernel: MCA: Vendor 
GenuineIntel, ID 0x10676, APIC ID 2

/var/log/messages:Jul 24 13:14:40 box kernel: MCA: CPU 2 COR L2 memory error
/var/log/messages:Jul 28 19:12:42 box kernel: MCA: Bank 3, Status 
0x90270220100e
/var/log/messages:Jul 28 19:12:42 box kernel: MCA: Global Cap 
0x0806, Status 0x
/var/log/messages:Jul 28 19:12:42 box kernel: MCA: Vendor 
GenuineIntel, ID 0x10676, APIC ID 0

/var/log/messages:Jul 28 19:12:42 box kernel: MCA: CPU 0 COR L2 memory error

Are the ECC corrected?
Or is error really data kaput?

--WjW
___
freebsd-hardware@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hardware
To unsubscribe, send any mail to freebsd-hardware-unsubscr...@freebsd.org


Re: L2 cache errors???

2015-07-28 Thread Josh Paetzel


On 07/28/2015 13:40, Willem Jan Withagen wrote:
 On 28/07/2015 19:48, Mike Tancsa wrote:
 On 7/28/2015 1:16 PM, Willem Jan Withagen wrote:
 Hi,

 Are these what I think they are?
 Errors in the CPU L2 cache?

 Are the ECC corrected?
 Or is error really data kaput?



 Could be. There is also an erratum issue that triggers these errors on
 certain CPUs when running software like virtualbox.  It was fixed in
 RELENG_10 some time ago. What are you running ?


 https://svnweb.freebsd.org/base?view=revisionrevision=269052

 has some details.
 
 'mmm,
 Not running Haswell stuff, but rather older hardware.
 
 Looked in older logfiles, and there are a few more...
 All with the same data, except that it is detected on different CPUs
 
 And it occurs when running:
   mbuffer -4 -m 1000M -I  | \
 zfs receive -F -d -v zfs
 to receive a full backup from my fileserver.
 
 --WjW
 

You can tell ECC corrected the error because on FreeBSD if ECC can't fix
the error the system will panic.  Other systems (Solaris and HP-UX being
the two I have direct experience with) can detach subsystems that have
sustained uncorrectable errors in some cases. (Yes, even CPUs!)

If a system is generating hundreds or thousands of MCAs a minute you are
dealing with a hardware issue.

If you are getting spurious MCAs to the tune of a few a day there's
nothing abnormal or broken there it's just the system doing what it's
supposed to.

Given the amount of data that flies around inside modern computers I'm
surprised there aren't more MCAs than there are in most systems.


-- 
FreeBSD - The Power To Serve.
___
freebsd-hardware@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hardware
To unsubscribe, send any mail to freebsd-hardware-unsubscr...@freebsd.org


Re: L2 cache errors???

2015-07-28 Thread Mike Tancsa
On 7/28/2015 1:16 PM, Willem Jan Withagen wrote:
 Hi,
 
 Are these what I think they are?
 Errors in the CPU L2 cache?
 
 Are the ECC corrected?
 Or is error really data kaput?
 


Could be. There is also an erratum issue that triggers these errors on
certain CPUs when running software like virtualbox.  It was fixed in
RELENG_10 some time ago. What are you running ?


https://svnweb.freebsd.org/base?view=revisionrevision=269052

has some details.

---Mike


-- 
---
Mike Tancsa, tel +1 519 651 3400
Sentex Communications, m...@sentex.net
Providing Internet services since 1994 www.sentex.net
Cambridge, Ontario Canada   http://www.tancsa.com/
___
freebsd-hardware@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hardware
To unsubscribe, send any mail to freebsd-hardware-unsubscr...@freebsd.org


Re: L2 cache errors???

2015-07-28 Thread Willem Jan Withagen
On 28/07/2015 19:48, Mike Tancsa wrote:
 On 7/28/2015 1:16 PM, Willem Jan Withagen wrote:
 Hi,

 Are these what I think they are?
 Errors in the CPU L2 cache?

 Are the ECC corrected?
 Or is error really data kaput?

 
 
 Could be. There is also an erratum issue that triggers these errors on
 certain CPUs when running software like virtualbox.  It was fixed in
 RELENG_10 some time ago. What are you running ?
 
 
 https://svnweb.freebsd.org/base?view=revisionrevision=269052
 
 has some details.

'mmm,
Not running Haswell stuff, but rather older hardware.

Looked in older logfiles, and there are a few more...
All with the same data, except that it is detected on different CPUs

And it occurs when running:
mbuffer -4 -m 1000M -I  | \
zfs receive -F -d -v zfs
to receive a full backup from my fileserver.

--WjW

No tweeked settings, neither is the CPU overheated.
System consumes about 200W, and has a supermicro 450W supply

Running 10.2-BETA2 on a
CPU: Intel(R) Core(TM)2 Extreme CPU X9650  @ 3.00GHz (3005.62-MHz
K8-class CPU)
  Origin=GenuineIntel  Id=0x10676  Family=0x6  Model=0x17  Stepping=6

Features=0xbfebfbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE

Features2=0x8e3bdSSE3,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,SSE4.1
  AMD Features=0x20100800SYSCALL,NX,LM
  AMD Features2=0x1LAHF
  VT-x: Basic Features=0x5a0800SMM,INS/OUTS
Pin-Based Controls=0x3fExtINT,NMI,VNMI
Primary Processor
Controls=0xf7f9fffeINTWIN,TSCOff,HLT,INVLPG,MWAIT,RDPMC,RDTSC,CR3-LD,CR3-ST,CR8-LD,CR8-ST,TPR,NMIWIN,MOV-DR,IO,IOmap,MSRmap,MONITOR,PAUSE
Secondary Processor Controls=0x41APIC,WBINVD
Exit Controls=0x5a0800PAT-LD,EFER-SV,PTMR-SV
Entry Controls=0x5a0800
  TSC: P-state invariant, performance statistics
Instruction TLB: 2M pages, 4-way, 8 entries or 4M pages, 4-way, 4 entries
Instruction TLB: 4 KB Pages, 4-way set associative, 128 entries
64-Byte prefetching
Data TLB0: 4 KByte pages, 4-way associative, 16 entries
Data TLB0: 4 MByte pages, 4-way set associative, 16 entries
2nd-level cache: 6MByte, 24-way set associative, 64 byte line size
1st-level instruction cache: 32 KB, 8-way set associative, 64 byte line size
Data TLB1: 4 KByte pages, 4-way associative, 256 entries
1st-level data cache: 32 KB, 8-way set associative, 64 byte line size
L2 cache: 6144 kbytes, 16-way associative, 64 bytes/line
real memory  = 7516192768 (7168 MB)

Motherboard:
Base Board Information
Manufacturer: ASUSTeK Computer INC.
Product Name: P5Q-E
Version: Rev 1.xx
Serial Number: MS1C87B16302305


___
freebsd-hardware@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hardware
To unsubscribe, send any mail to freebsd-hardware-unsubscr...@freebsd.org


Re: L2 cache errors???

2015-07-28 Thread Willem Jan Withagen
On 28/07/2015 21:04, Josh Paetzel wrote:
 
 
 On 07/28/2015 13:40, Willem Jan Withagen wrote:
 On 28/07/2015 19:48, Mike Tancsa wrote:
 On 7/28/2015 1:16 PM, Willem Jan Withagen wrote:
 Hi,

 Are these what I think they are?
 Errors in the CPU L2 cache?

 Are the ECC corrected?
 Or is error really data kaput?



 Could be. There is also an erratum issue that triggers these errors on
 certain CPUs when running software like virtualbox.  It was fixed in
 RELENG_10 some time ago. What are you running ?


 https://svnweb.freebsd.org/base?view=revisionrevision=269052

 has some details.

 'mmm,
 Not running Haswell stuff, but rather older hardware.

 Looked in older logfiles, and there are a few more...
 All with the same data, except that it is detected on different CPUs

 And it occurs when running:
  mbuffer -4 -m 1000M -I  | \
 zfs receive -F -d -v zfs
 to receive a full backup from my fileserver.

 --WjW

 
 You can tell ECC corrected the error because on FreeBSD if ECC can't fix
 the error the system will panic.  Other systems (Solaris and HP-UX being
 the two I have direct experience with) can detach subsystems that have
 sustained uncorrectable errors in some cases. (Yes, even CPUs!)

Offlining CPus, cool.
No the system does not panic, but I do get reports from 'zfs receive'
that the datastream is invalid. And it then aborts.
So I'll have to do more digging, to see what is up.

 If a system is generating hundreds or thousands of MCAs a minute you are
 dealing with a hardware issue.
 
 If you are getting spurious MCAs to the tune of a few a day there's
 nothing abnormal or broken there it's just the system doing what it's
 supposed to.

Never had them before, and now about 6 this week.
Let alone in L2 cache.
So it got me worried.

 Given the amount of data that flies around inside modern computers I'm
 surprised there aren't more MCAs than there are in most systems.

Perhaps not enough alpha particles hitting the cells. :)

Thanx,
--WjW

___
freebsd-hardware@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hardware
To unsubscribe, send any mail to freebsd-hardware-unsubscr...@freebsd.org