Re: L2 cache errors???
Hi, On Fri, 31 Jul 2015 09:31:53 +0200 Willem Jan Withagen w...@digiware.nl wrote: On 31/07/2015 07:22, Erich Dollansky wrote: On Tue, 28 Jul 2015 21:45:03 +0200 Willem Jan Withagen w...@digiware.nl wrote: On 28/07/2015 21:04, Josh Paetzel wrote: Offlining CPus, cool. and bringing them back online when the problem is fixed. The hardware there supports that things get changed while the system is running. A PC costs normally less than the extra hardware required to do this. Yes, I can imagine things being expensive. Probably like high-end routers... There you can swap also just about anything.. Expensive got a complete new meaning when I saw a set of 3 core routers being delivered at a friends ISP with a pricetag 1.000.000 euros... Fortunately it was list price, but even still: A lot of money. yes, this are common price tags in this area. Although I've grown to look at HA as: don't put it all in one (expensive) box, but get more (cheaper) boxes, But I guess there are places where this doesn't work. It is also a matter of effort. The moment the application software has to be adapted, it might be pointless to do it on cheap machines. Google Co. is a typical case for cheap hardware. It does not matter for the normal user if things get even lost or delayed. It is a different story for banks, manufacturing or people like UPS. Erich ___ freebsd-hardware@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hardware To unsubscribe, send any mail to freebsd-hardware-unsubscr...@freebsd.org
Re: L2 cache errors???
Hi, On Tue, 28 Jul 2015 21:45:03 +0200 Willem Jan Withagen w...@digiware.nl wrote: On 28/07/2015 21:04, Josh Paetzel wrote: Offlining CPus, cool. and bringing them back online when the problem is fixed. The hardware there supports that things get changed while the system is running. A PC costs normally less than the extra hardware required to do this. Erich ___ freebsd-hardware@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hardware To unsubscribe, send any mail to freebsd-hardware-unsubscr...@freebsd.org
Re: L2 cache errors???
On 31/07/2015 07:22, Erich Dollansky wrote: Hi, On Tue, 28 Jul 2015 21:45:03 +0200 Willem Jan Withagen w...@digiware.nl wrote: On 28/07/2015 21:04, Josh Paetzel wrote: Offlining CPus, cool. and bringing them back online when the problem is fixed. The hardware there supports that things get changed while the system is running. A PC costs normally less than the extra hardware required to do this. Yes, I can imagine things being expensive. Probably like high-end routers... There you can swap also just about anything.. Expensive got a complete new meaning when I saw a set of 3 core routers being delivered at a friends ISP with a pricetag 1.000.000 euros... Fortunately it was list price, but even still: A lot of money. But then still you'd swap a processor board, running on the spare. And I guess there we offline the whole board. Last time I heard about things like this in computers, we were talking IBM mainframes, or Tandem. Both long time ago. Obviously haven't done much in High availability lately. Although I've grown to look at HA as: don't put it all in one (expensive) box, but get more (cheaper) boxes, But I guess there are places where this doesn't work. --WjW ___ freebsd-hardware@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hardware To unsubscribe, send any mail to freebsd-hardware-unsubscr...@freebsd.org
L2 cache errors???
Hi, Are these what I think they are? Errors in the CPU L2 cache? /var/log/messages:Jul 24 13:14:40 box kernel: MCA: Bank 3, Status 0x902000120120100e /var/log/messages:Jul 24 13:14:40 box kernel: MCA: Global Cap 0x0806, Status 0x /var/log/messages:Jul 24 13:14:40 box kernel: MCA: Vendor GenuineIntel, ID 0x10676, APIC ID 2 /var/log/messages:Jul 24 13:14:40 box kernel: MCA: CPU 2 COR L2 memory error /var/log/messages:Jul 28 19:12:42 box kernel: MCA: Bank 3, Status 0x90270220100e /var/log/messages:Jul 28 19:12:42 box kernel: MCA: Global Cap 0x0806, Status 0x /var/log/messages:Jul 28 19:12:42 box kernel: MCA: Vendor GenuineIntel, ID 0x10676, APIC ID 0 /var/log/messages:Jul 28 19:12:42 box kernel: MCA: CPU 0 COR L2 memory error Are the ECC corrected? Or is error really data kaput? --WjW ___ freebsd-hardware@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hardware To unsubscribe, send any mail to freebsd-hardware-unsubscr...@freebsd.org
Re: L2 cache errors???
On 07/28/2015 13:40, Willem Jan Withagen wrote: On 28/07/2015 19:48, Mike Tancsa wrote: On 7/28/2015 1:16 PM, Willem Jan Withagen wrote: Hi, Are these what I think they are? Errors in the CPU L2 cache? Are the ECC corrected? Or is error really data kaput? Could be. There is also an erratum issue that triggers these errors on certain CPUs when running software like virtualbox. It was fixed in RELENG_10 some time ago. What are you running ? https://svnweb.freebsd.org/base?view=revisionrevision=269052 has some details. 'mmm, Not running Haswell stuff, but rather older hardware. Looked in older logfiles, and there are a few more... All with the same data, except that it is detected on different CPUs And it occurs when running: mbuffer -4 -m 1000M -I | \ zfs receive -F -d -v zfs to receive a full backup from my fileserver. --WjW You can tell ECC corrected the error because on FreeBSD if ECC can't fix the error the system will panic. Other systems (Solaris and HP-UX being the two I have direct experience with) can detach subsystems that have sustained uncorrectable errors in some cases. (Yes, even CPUs!) If a system is generating hundreds or thousands of MCAs a minute you are dealing with a hardware issue. If you are getting spurious MCAs to the tune of a few a day there's nothing abnormal or broken there it's just the system doing what it's supposed to. Given the amount of data that flies around inside modern computers I'm surprised there aren't more MCAs than there are in most systems. -- FreeBSD - The Power To Serve. ___ freebsd-hardware@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hardware To unsubscribe, send any mail to freebsd-hardware-unsubscr...@freebsd.org
Re: L2 cache errors???
On 7/28/2015 1:16 PM, Willem Jan Withagen wrote: Hi, Are these what I think they are? Errors in the CPU L2 cache? Are the ECC corrected? Or is error really data kaput? Could be. There is also an erratum issue that triggers these errors on certain CPUs when running software like virtualbox. It was fixed in RELENG_10 some time ago. What are you running ? https://svnweb.freebsd.org/base?view=revisionrevision=269052 has some details. ---Mike -- --- Mike Tancsa, tel +1 519 651 3400 Sentex Communications, m...@sentex.net Providing Internet services since 1994 www.sentex.net Cambridge, Ontario Canada http://www.tancsa.com/ ___ freebsd-hardware@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hardware To unsubscribe, send any mail to freebsd-hardware-unsubscr...@freebsd.org
Re: L2 cache errors???
On 28/07/2015 19:48, Mike Tancsa wrote: On 7/28/2015 1:16 PM, Willem Jan Withagen wrote: Hi, Are these what I think they are? Errors in the CPU L2 cache? Are the ECC corrected? Or is error really data kaput? Could be. There is also an erratum issue that triggers these errors on certain CPUs when running software like virtualbox. It was fixed in RELENG_10 some time ago. What are you running ? https://svnweb.freebsd.org/base?view=revisionrevision=269052 has some details. 'mmm, Not running Haswell stuff, but rather older hardware. Looked in older logfiles, and there are a few more... All with the same data, except that it is detected on different CPUs And it occurs when running: mbuffer -4 -m 1000M -I | \ zfs receive -F -d -v zfs to receive a full backup from my fileserver. --WjW No tweeked settings, neither is the CPU overheated. System consumes about 200W, and has a supermicro 450W supply Running 10.2-BETA2 on a CPU: Intel(R) Core(TM)2 Extreme CPU X9650 @ 3.00GHz (3005.62-MHz K8-class CPU) Origin=GenuineIntel Id=0x10676 Family=0x6 Model=0x17 Stepping=6 Features=0xbfebfbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE Features2=0x8e3bdSSE3,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,SSE4.1 AMD Features=0x20100800SYSCALL,NX,LM AMD Features2=0x1LAHF VT-x: Basic Features=0x5a0800SMM,INS/OUTS Pin-Based Controls=0x3fExtINT,NMI,VNMI Primary Processor Controls=0xf7f9fffeINTWIN,TSCOff,HLT,INVLPG,MWAIT,RDPMC,RDTSC,CR3-LD,CR3-ST,CR8-LD,CR8-ST,TPR,NMIWIN,MOV-DR,IO,IOmap,MSRmap,MONITOR,PAUSE Secondary Processor Controls=0x41APIC,WBINVD Exit Controls=0x5a0800PAT-LD,EFER-SV,PTMR-SV Entry Controls=0x5a0800 TSC: P-state invariant, performance statistics Instruction TLB: 2M pages, 4-way, 8 entries or 4M pages, 4-way, 4 entries Instruction TLB: 4 KB Pages, 4-way set associative, 128 entries 64-Byte prefetching Data TLB0: 4 KByte pages, 4-way associative, 16 entries Data TLB0: 4 MByte pages, 4-way set associative, 16 entries 2nd-level cache: 6MByte, 24-way set associative, 64 byte line size 1st-level instruction cache: 32 KB, 8-way set associative, 64 byte line size Data TLB1: 4 KByte pages, 4-way associative, 256 entries 1st-level data cache: 32 KB, 8-way set associative, 64 byte line size L2 cache: 6144 kbytes, 16-way associative, 64 bytes/line real memory = 7516192768 (7168 MB) Motherboard: Base Board Information Manufacturer: ASUSTeK Computer INC. Product Name: P5Q-E Version: Rev 1.xx Serial Number: MS1C87B16302305 ___ freebsd-hardware@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hardware To unsubscribe, send any mail to freebsd-hardware-unsubscr...@freebsd.org
Re: L2 cache errors???
On 28/07/2015 21:04, Josh Paetzel wrote: On 07/28/2015 13:40, Willem Jan Withagen wrote: On 28/07/2015 19:48, Mike Tancsa wrote: On 7/28/2015 1:16 PM, Willem Jan Withagen wrote: Hi, Are these what I think they are? Errors in the CPU L2 cache? Are the ECC corrected? Or is error really data kaput? Could be. There is also an erratum issue that triggers these errors on certain CPUs when running software like virtualbox. It was fixed in RELENG_10 some time ago. What are you running ? https://svnweb.freebsd.org/base?view=revisionrevision=269052 has some details. 'mmm, Not running Haswell stuff, but rather older hardware. Looked in older logfiles, and there are a few more... All with the same data, except that it is detected on different CPUs And it occurs when running: mbuffer -4 -m 1000M -I | \ zfs receive -F -d -v zfs to receive a full backup from my fileserver. --WjW You can tell ECC corrected the error because on FreeBSD if ECC can't fix the error the system will panic. Other systems (Solaris and HP-UX being the two I have direct experience with) can detach subsystems that have sustained uncorrectable errors in some cases. (Yes, even CPUs!) Offlining CPus, cool. No the system does not panic, but I do get reports from 'zfs receive' that the datastream is invalid. And it then aborts. So I'll have to do more digging, to see what is up. If a system is generating hundreds or thousands of MCAs a minute you are dealing with a hardware issue. If you are getting spurious MCAs to the tune of a few a day there's nothing abnormal or broken there it's just the system doing what it's supposed to. Never had them before, and now about 6 this week. Let alone in L2 cache. So it got me worried. Given the amount of data that flies around inside modern computers I'm surprised there aren't more MCAs than there are in most systems. Perhaps not enough alpha particles hitting the cells. :) Thanx, --WjW ___ freebsd-hardware@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hardware To unsubscribe, send any mail to freebsd-hardware-unsubscr...@freebsd.org