RE: FreeBsd MCA Panic Crash !!
* >> Are those the only MCA errors you're seeing? The reason I ask is that there's an errata in the X5600 series which can cause an "internal timer error" MCA to be logged after another uncorrectable MCA occurs.* 90% are these MCA errors regarding rest of the 10% there is no log for it such as one of the supermicro was rebooted two days ago but it was unable to generate crashdump under /var/crash directory though dump is enabled in rc.conf : dumpdev="AUTO" dumpdir="/var/crash" *>>This seems to me like it would be a CPU failure. Can you try replacing the CPU itself? I've seen this exact message on a different board, and the cause was a failing CPU. * We're thinking to replace x5690 with x5675 CPUs. *>>Well, mcelog has this hardcoded and prints this for every MCA just as a matter of course. It isn't selective but assumes every machine check is a hardware error (which they are, though some are warnings for corrected events that you can ignore as the hardware hasn't degraded enough to warrant replacement. However, corrected events don't generate panics, just messages in the logs, and only a subset of corrected events include the "yellow / green" indicators for which you can ignore "green" events. Even corrected ECC errors I would ignore if you get a few events with a count of 1 that don't recur). * Each time the MCA error occurs, server went down. So please guide how do we suppose to tackle this issue ? * >> Depending on the CPU model, you can determine more info about the error using the CPU manuals (for Intel the SDM). * CPU is x5690, is there a link we can get manual for supermicro x5690 cpu ? -- View this message in context: http://freebsd.1045724.n5.nabble.com/FreeBsd-MCA-Panic-Crash-tp6064691p6065043.html Sent from the freebsd-current mailing list archive at Nabble.com. ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: FreeBsd MCA Panic Crash !!
We're thinking to replace x5690 with x5675 CPUs. -- View this message in context: http://freebsd.1045724.n5.nabble.com/FreeBsd-MCA-Panic-Crash-tp6064691p6065039.html Sent from the freebsd-current mailing list archive at Nabble.com. ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: FreeBsd MCA Panic Crash !!
each time the MCA error occurs, server went down. So please guide how do we suppose to tackle this issue ? -- View this message in context: http://freebsd.1045724.n5.nabble.com/FreeBsd-MCA-Panic-Crash-tp6064691p6065041.html Sent from the freebsd-current mailing list archive at Nabble.com. ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
RE: FreeBsd MCA Panic Crash !!
90% are these MCA errors regarding rest of the 10% there is no log for it such as one of the supermicro was rebooted two days ago but it was unable to generate crashdump under /var/crash directory though dump is enabled in rc.conf but we've no idea what went wrong : dumpdev="AUTO" dumpdir="/var/crash" -- View this message in context: http://freebsd.1045724.n5.nabble.com/FreeBsd-MCA-Panic-Crash-tp6064691p6065042.html Sent from the freebsd-current mailing list archive at Nabble.com. ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: FreeBsd MCA Panic Crash !!
On Mon, Jan 04, 2016 at 03:34:09AM -0700, shahzaibcb wrote: > Hi, > > We've switched to FreeBSD recently to accomodate large video storage as we > are running video streaming website. So the job of the FreeBSD is to > transcode the uploaded videos using ffmpeg and serve them to users via nginx > webserver but so far our experience is not very good with it. It crashes > every 2-3 days and we're unable to track down the problem. The server specs > are pretty high : > > > Supermicro X5690 (12 cores, 24 threads - 2u) > 96GB RAM > 12x3TB RAID-10 (HBA-LSI9211) > > Here is the screenshot of recent crash : > > http://prntscr.com/9er3pk > > One thing worth mentioning is, before going down there's no load on server, > more or less free RAM usually is around 12GB. We've tried following > solutions so far : > > > - Updated FreeBSD OS > - Replaced 800W PS with 900W > - We've reduced CMOS from MAX(26x) to 18x as suggested in this post Do you try to replace CPU? > http://unix.stackexchange.com/questions/60574/determining-cause-of-linux-kernel-panic > > The solution we've not performed so far is : > > - Disable mca using (hw.mca.enabled: 0) - As we're getting MCA panics. > > Here is the crash dump : > > [root@cw001 /var/crash]# mcelog --no-dmi --ascii --file core.txt.1 > HARDWARE ERROR. This is *NOT* a software problem! > Please contact your hardware vendor > CPU 3 BANK 5 > MISC 0 ADDR 802bf6a69 > MCG status:MCIP > MCi status: > Uncorrected error > Error enabled > MCi_MISC register valid > MCi_ADDR register valid > Processor context corrupt > MCA: Internal Timer error > STATUS be800400 MCGSTATUS 4 > MCGCAP 1c09 APICID 3 SOCKETID 0 > CPUID Vendor Intel Family 6 Model 44 > HARDWARE ERROR. This is *NOT* a software problem! > Please contact your hardware vendor > CPU 2 BANK 5 > MISC 0 ADDR 802bf6a69 > MCG status:MCIP > MCi status: > Uncorrected error > Error enabled > MCi_MISC register valid > MCi_ADDR register valid > Processor context corrupt > MCA: Internal Timer error > STATUS be800400 MCGSTATUS 4 > MCGCAP 1c09 APICID 2 SOCKETID 0 > CPUID Vendor Intel Family 6 Model 44 > HARDWARE ERROR. This is *NOT* a software problem! > Please contact your hardware vendor > CPU 3 BANK 5 > MISC 0 ADDR 802bf6a69 > MCG status:MCIP > MCi status: > Uncorrected error > Error enabled > MCi_MISC register valid > MCi_ADDR register valid > Processor context corrupt > MCA: Internal Timer error > STATUS be800400 MCGSTATUS 4 > MCGCAP 1c09 APICID 3 SOCKETID 0 > CPUID Vendor Intel Family 6 Model 44 > HARDWARE ERROR. This is *NOT* a software problem! > Please contact your hardware vendor > CPU 2 BANK 5 > MISC 0 ADDR 802bf6a69 > MCG status:MCIP > MCi status: > Uncorrected error > Error enabled > MCi_MISC register valid > MCi_ADDR register valid > Processor context corrupt > MCA: Internal Timer error > STATUS be800400 MCGSTATUS 4 > MCGCAP 1c09 APICID 2 SOCKETID 0 > CPUID Vendor Intel Family 6 Model 44 > > --- > > I showed those Hardware errors to Vendor from whom we purchased Supermicro > servers . This is what he has to say : > > --- > Why do you not made one test environment with CentOS or one other Linux that > you know to use, and see if you have same errors ??? if not than you know > that the errors come from OS not from hardware. ( CentOS, RedHead….work > diferend like FreeBSD – work direct on hardware if you don’t have the right > kernel settings can the server crashed. CentOS , RedHead…. don’t work direct > on hardware and distribute the resource load better and you have better > control and you can better debug one situation) > --------------- > > Now we're on a black hole and unable to find that either issue with FreeBSD > or Hardware. We're thinking to disable mca in loader.conf but ppl are not > suggesting it. If you guys can help us, it'd be very kind. > > > > -- > View this message in context: > http://freebsd.1045724.n5.nabble.com/FreeBsd-MCA-Panic-Crash-tp6064691.html > Sent from the freebsd-current mailing list archive at Nabble.com. > ___ > freebsd-current@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-current > To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org" ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
RE: FreeBsd MCA Panic Crash !!
>We've switched to FreeBSD recently to accomodate large video storage as we are >running video streaming website. >So the job of the FreeBSD is to transcode the uploaded videos using ffmpeg and >serve them to users via nginx webserver >but so far our experience is not very good with it. It crashes every 2-3 days >and we're unable to track down the problem. >The server specs are pretty high : > > Supermicro X5690 (12 cores, 24 threads - 2u) 96GB RAM 12x3TB RAID-10 > (HBA-LSI9211) [...] >CPU 3 BANK 5 >MCA: Internal Timer error >STATUS be800400 MCGSTATUS 4 Are those the only MCA errors you're seeing? The reason I ask is that there's an errata in the X5600 series which can cause an "internal timer error" MCA to be logged after another uncorrectable MCA occurs. If not, these do point at a hardware problem *or* errata, though software can also trigger this in some cases (for instance, reading from malfunctioning or non-existent hardware). If your BIOS can be updated, that's a good first step as it will generally update the CPU microcode and add workarounds for many known issues. Replacing the CPU and/or voltage regulator is more drastic, but if the problem is hardware, it's likely in one of those components. Anton Supermicro X5690 (12 cores, 24 threads - 2u) 96GB RAM 12x3TB RAID-10 (HBA-LSI9211) Here is the screenshot of recent crash : http://prntscr.com/9er3pk One thing worth mentioning is, before going down there's no load on server, more or less free RAM usually is around 12GB. We've tried following solutions so far : - Updated FreeBSD OS - Replaced 800W PS with 900W - We've reduced CMOS from MAX(26x) to 18x as suggested in this post http://unix.stackexchange.com/questions/60574/determining-cause-of-linux-kernel-panic The solution we've not performed so far is : - Disable mca using (hw.mca.enabled: 0) - As we're getting MCA panics. Here is the crash dump : [root@cw001 /var/crash]# mcelog --no-dmi --ascii --file core.txt.1 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 3 BANK 5 MISC 0 ADDR 802bf6a69 MCG status:MCIP MCi status: Uncorrected error Error enabled MCi_MISC register valid MCi_ADDR register valid Processor context corrupt MCA: Internal Timer error STATUS be800400 MCGSTATUS 4 MCGCAP 1c09 APICID 3 SOCKETID 0 CPUID Vendor Intel Family 6 Model 44 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 2 BANK 5 MISC 0 ADDR 802bf6a69 MCG status:MCIP MCi status: Uncorrected error Error enabled MCi_MISC register valid MCi_ADDR register valid Processor context corrupt MCA: Internal Timer error STATUS be800400 MCGSTATUS 4 MCGCAP 1c09 APICID 2 SOCKETID 0 CPUID Vendor Intel Family 6 Model 44 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 3 BANK 5 MISC 0 ADDR 802bf6a69 MCG status:MCIP MCi status: Uncorrected error Error enabled MCi_MISC register valid MCi_ADDR register valid Processor context corrupt MCA: Internal Timer error STATUS be800400 MCGSTATUS 4 MCGCAP 1c09 APICID 3 SOCKETID 0 CPUID Vendor Intel Family 6 Model 44 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 2 BANK 5 MISC 0 ADDR 802bf6a69 MCG status:MCIP MCi status: Uncorrected error Error enabled MCi_MISC register valid MCi_ADDR register valid Processor context corrupt MCA: Internal Timer error STATUS be800400 MCGSTATUS 4 MCGCAP 1c09 APICID 2 SOCKETID 0 CPUID Vendor Intel Family 6 Model 44 --- I showed those Hardware errors to Vendor from whom we purchased Supermicro servers . This is what he has to say : --- Why do you not made one test environment with CentOS or one other Linux that you know to use, and see if you have same errors ??? if not than you know that the errors come from OS not from hardware. ( CentOS, RedHead….work diferend like FreeBSD – work direct on hardware if you don’t have the right kernel settings can the server crashed. CentOS , RedHead…. don’t work direct on hardware and distribute the resource load better and you have better control and you can better debug one situation) --- Now we're on a black hole and unable to find that either issue with FreeBSD or Hardware. We're thinking to disable mca in loader.conf but ppl are not suggesting it. If you guys can help us, it'd be very kind. -- View this message in context: http://freebsd.1045724.n5.nabble.com/FreeBsd-MCA-Panic-Crash-tp6064691.html Sent from the freebsd-current mailing list archive at Nabble.com. ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
FreeBsd MCA Panic Crash !!
Hi, We've switched to FreeBSD recently to accomodate large video storage as we are running video streaming website. So the job of the FreeBSD is to transcode the uploaded videos using ffmpeg and serve them to users via nginx webserver but so far our experience is not very good with it. It crashes every 2-3 days and we're unable to track down the problem. The server specs are pretty high : Supermicro X5690 (12 cores, 24 threads - 2u) 96GB RAM 12x3TB RAID-10 (HBA-LSI9211) Here is the screenshot of recent crash : http://prntscr.com/9er3pk One thing worth mentioning is, before going down there's no load on server, more or less free RAM usually is around 12GB. We've tried following solutions so far : - Updated FreeBSD OS - Replaced 800W PS with 900W - We've reduced CMOS from MAX(26x) to 18x as suggested in this post http://unix.stackexchange.com/questions/60574/determining-cause-of-linux-kernel-panic The solution we've not performed so far is : - Disable mca using (hw.mca.enabled: 0) - As we're getting MCA panics. Here is the crash dump : [root@cw001 /var/crash]# mcelog --no-dmi --ascii --file core.txt.1 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 3 BANK 5 MISC 0 ADDR 802bf6a69 MCG status:MCIP MCi status: Uncorrected error Error enabled MCi_MISC register valid MCi_ADDR register valid Processor context corrupt MCA: Internal Timer error STATUS be800400 MCGSTATUS 4 MCGCAP 1c09 APICID 3 SOCKETID 0 CPUID Vendor Intel Family 6 Model 44 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 2 BANK 5 MISC 0 ADDR 802bf6a69 MCG status:MCIP MCi status: Uncorrected error Error enabled MCi_MISC register valid MCi_ADDR register valid Processor context corrupt MCA: Internal Timer error STATUS be800400 MCGSTATUS 4 MCGCAP 1c09 APICID 2 SOCKETID 0 CPUID Vendor Intel Family 6 Model 44 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 3 BANK 5 MISC 0 ADDR 802bf6a69 MCG status:MCIP MCi status: Uncorrected error Error enabled MCi_MISC register valid MCi_ADDR register valid Processor context corrupt MCA: Internal Timer error STATUS be800400 MCGSTATUS 4 MCGCAP 1c09 APICID 3 SOCKETID 0 CPUID Vendor Intel Family 6 Model 44 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 2 BANK 5 MISC 0 ADDR 802bf6a69 MCG status:MCIP MCi status: Uncorrected error Error enabled MCi_MISC register valid MCi_ADDR register valid Processor context corrupt MCA: Internal Timer error STATUS be800400 MCGSTATUS 4 MCGCAP 1c09 APICID 2 SOCKETID 0 CPUID Vendor Intel Family 6 Model 44 --- I showed those Hardware errors to Vendor from whom we purchased Supermicro servers . This is what he has to say : --- Why do you not made one test environment with CentOS or one other Linux that you know to use, and see if you have same errors ??? if not than you know that the errors come from OS not from hardware. ( CentOS, RedHead….work diferend like FreeBSD – work direct on hardware if you don’t have the right kernel settings can the server crashed. CentOS , RedHead…. don’t work direct on hardware and distribute the resource load better and you have better control and you can better debug one situation) --- Now we're on a black hole and unable to find that either issue with FreeBSD or Hardware. We're thinking to disable mca in loader.conf but ppl are not suggesting it. If you guys can help us, it'd be very kind. -- View this message in context: http://freebsd.1045724.n5.nabble.com/FreeBsd-MCA-Panic-Crash-tp6064691.html Sent from the freebsd-current mailing list archive at Nabble.com. ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: FreeBsd MCA Panic Crash !!
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 On 04/01/16 04:34, shahzaibcb wrote: > Hi, > > We've switched to FreeBSD recently to accomodate large video storage as we > are running video streaming website. So the job of the FreeBSD is to > transcode the uploaded videos using ffmpeg and serve them to users via nginx > webserver but so far our experience is not very good with it. It crashes > every 2-3 days and we're unable to track down the problem. The server specs > are pretty high : > > > Supermicro X5690 (12 cores, 24 threads - 2u) > 96GB RAM > 12x3TB RAID-10 (HBA-LSI9211) > > Here is the screenshot of recent crash : > > http://prntscr.com/9er3pk > > One thing worth mentioning is, before going down there's no load on server, > more or less free RAM usually is around 12GB. We've tried following > solutions so far : > > > - Updated FreeBSD OS > - Replaced 800W PS with 900W > - We've reduced CMOS from MAX(26x) to 18x as suggested in this post > http://unix.stackexchange.com/questions/60574/determining-cause-of-linux-kernel-panic > > The solution we've not performed so far is : > > - Disable mca using (hw.mca.enabled: 0) - As we're getting MCA panics. > > Here is the crash dump : > > [root@cw001 /var/crash]# mcelog --no-dmi --ascii --file core.txt.1 > HARDWARE ERROR. This is *NOT* a software problem! > Please contact your hardware vendor > CPU 3 BANK 5 > MISC 0 ADDR 802bf6a69 > MCG status:MCIP > MCi status: > Uncorrected error > Error enabled > MCi_MISC register valid > MCi_ADDR register valid > Processor context corrupt > MCA: Internal Timer error > STATUS be800400 MCGSTATUS 4 > MCGCAP 1c09 APICID 3 SOCKETID 0 > CPUID Vendor Intel Family 6 Model 44 > HARDWARE ERROR. This is *NOT* a software problem! > Please contact your hardware vendor > CPU 2 BANK 5 > MISC 0 ADDR 802bf6a69 > MCG status:MCIP > MCi status: > Uncorrected error > Error enabled > MCi_MISC register valid > MCi_ADDR register valid > Processor context corrupt > MCA: Internal Timer error > STATUS be800400 MCGSTATUS 4 > MCGCAP 1c09 APICID 2 SOCKETID 0 > CPUID Vendor Intel Family 6 Model 44 > HARDWARE ERROR. This is *NOT* a software problem! > Please contact your hardware vendor > CPU 3 BANK 5 > MISC 0 ADDR 802bf6a69 > MCG status:MCIP > MCi status: > Uncorrected error > Error enabled > MCi_MISC register valid > MCi_ADDR register valid > Processor context corrupt > MCA: Internal Timer error > STATUS be800400 MCGSTATUS 4 > MCGCAP 1c09 APICID 3 SOCKETID 0 > CPUID Vendor Intel Family 6 Model 44 > HARDWARE ERROR. This is *NOT* a software problem! > Please contact your hardware vendor > CPU 2 BANK 5 > MISC 0 ADDR 802bf6a69 > MCG status:MCIP > MCi status: > Uncorrected error > Error enabled > MCi_MISC register valid > MCi_ADDR register valid > Processor context corrupt > MCA: Internal Timer error > STATUS be800400 MCGSTATUS 4 > MCGCAP 1c09 APICID 2 SOCKETID 0 > CPUID Vendor Intel Family 6 Model 44 > > --- > > I showed those Hardware errors to Vendor from whom we purchased Supermicro > servers . This is what he has to say : > > --- > Why do you not made one test environment with CentOS or one other Linux that > you know to use, and see if you have same errors ??? if not than you know > that the errors come from OS not from hardware. ( CentOS, RedHead….work > diferend like FreeBSD – work direct on hardware if you don’t have the right > kernel settings can the server crashed. CentOS , RedHead…. don’t work direct > on hardware and distribute the resource load better and you have better > control and you can better debug one situation) > --- > > Now we're on a black hole and unable to find that either issue with FreeBSD > or Hardware. We're thinking to disable mca in loader.conf but ppl are not > suggesting it. If you guys can help us, it'd be very kind. > Hello there, This seems to me like it would be a CPU failure. Can you try replacing the CPU itself? I've seen this exact message on a different board, and the cause was a failing CPU. Please do note that as the message says, this is not a software error. It is a failure of the hardware. Your vendor can try to blame FreeBSD all they want, but it is extremely improbable as to be almost impossible that that is the problem. You might also note to your vendor that it is "Red Hat" Linux, not Red Head. Hope this helps. - --arw -BEGIN PGP SIGNATURE- Version: GnuPG v2 iQIcBAEBCAAGBQJWinw5AAoJEMspy1GSK50UXI8QANH5y9c36q8uX2xtQtjQ79DR ENN5O0cuxfiCn3mo7Kn+R0wD4Ahf1Qn6uR70WXwKDtdpre6VqsBxpZak7GVpHR9j x0C0jJJQLU3qs3XREzs6DjWCOge8j7zDZG0i9gZt3NT3WnEUxrqI+dLm/1I1Cy3f nSSHb3V3Sf9SxbB132NhCfiHfQNIVNGZsnrLCCIEWN0gI5vvEe2Av1e4PYoa1TJF 7B0qTmQ+nBb0zX/mccAbTXtMCAO7PBOrVkyxrwZN/J9kGYaPe2UEpsdHjXp76sui fFzb7voaKYXvqu3XJEYU0Pxulape5cUGSuQWmWBmDZhnFmn7YYRlfRr+5anwwhxu /EVDvOrdPNm4LpR3DCwR+FtHQb+fs9rfMEGIQ9EiLLF/rXXbs0Pfq+FzjHwk6RsX
Re: FreeBsd MCA Panic Crash !!
Bank 5 seems to be common to all the crashes, which may suggest you have some dodgy ram or possibly the driving CPU's memory controller. As the error says this is a Hardware issue. One thing we've used in the past to narrow issues like this down is to remove as much RAM as possible and to disable all but one CPU core using /boot/loader.conf hints, where X is the the number of CPU core to disable as reported by the boot process. hint.lapic.X.disabled="1" Regards Steve On 04/01/2016 10:34, shahzaibcb wrote: Hi, We've switched to FreeBSD recently to accomodate large video storage as we are running video streaming website. So the job of the FreeBSD is to transcode the uploaded videos using ffmpeg and serve them to users via nginx webserver but so far our experience is not very good with it. It crashes every 2-3 days and we're unable to track down the problem. The server specs are pretty high : Supermicro X5690 (12 cores, 24 threads - 2u) 96GB RAM 12x3TB RAID-10 (HBA-LSI9211) Here is the screenshot of recent crash : http://prntscr.com/9er3pk One thing worth mentioning is, before going down there's no load on server, more or less free RAM usually is around 12GB. We've tried following solutions so far : - Updated FreeBSD OS - Replaced 800W PS with 900W - We've reduced CMOS from MAX(26x) to 18x as suggested in this post http://unix.stackexchange.com/questions/60574/determining-cause-of-linux-kernel-panic The solution we've not performed so far is : - Disable mca using (hw.mca.enabled: 0) - As we're getting MCA panics. Here is the crash dump : [root@cw001 /var/crash]# mcelog --no-dmi --ascii --file core.txt.1 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 3 BANK 5 MISC 0 ADDR 802bf6a69 MCG status:MCIP MCi status: Uncorrected error Error enabled MCi_MISC register valid MCi_ADDR register valid Processor context corrupt MCA: Internal Timer error STATUS be800400 MCGSTATUS 4 MCGCAP 1c09 APICID 3 SOCKETID 0 CPUID Vendor Intel Family 6 Model 44 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 2 BANK 5 MISC 0 ADDR 802bf6a69 MCG status:MCIP MCi status: Uncorrected error Error enabled MCi_MISC register valid MCi_ADDR register valid Processor context corrupt MCA: Internal Timer error STATUS be800400 MCGSTATUS 4 MCGCAP 1c09 APICID 2 SOCKETID 0 CPUID Vendor Intel Family 6 Model 44 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 3 BANK 5 MISC 0 ADDR 802bf6a69 MCG status:MCIP MCi status: Uncorrected error Error enabled MCi_MISC register valid MCi_ADDR register valid Processor context corrupt MCA: Internal Timer error STATUS be800400 MCGSTATUS 4 MCGCAP 1c09 APICID 3 SOCKETID 0 CPUID Vendor Intel Family 6 Model 44 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 2 BANK 5 MISC 0 ADDR 802bf6a69 MCG status:MCIP MCi status: Uncorrected error Error enabled MCi_MISC register valid MCi_ADDR register valid Processor context corrupt MCA: Internal Timer error STATUS be800400 MCGSTATUS 4 MCGCAP 1c09 APICID 2 SOCKETID 0 CPUID Vendor Intel Family 6 Model 44 --- I showed those Hardware errors to Vendor from whom we purchased Supermicro servers . This is what he has to say : --- Why do you not made one test environment with CentOS or one other Linux that you know to use, and see if you have same errors ??? if not than you know that the errors come from OS not from hardware. ( CentOS, RedHead….work diferend like FreeBSD – work direct on hardware if you don’t have the right kernel settings can the server crashed. CentOS , RedHead…. don’t work direct on hardware and distribute the resource load better and you have better control and you can better debug one situation) --- Now we're on a black hole and unable to find that either issue with FreeBSD or Hardware. We're thinking to disable mca in loader.conf but ppl are not suggesting it. If you guys can help us, it'd be very kind. -- View this message in context: http://freebsd.1045724.n5.nabble.com/FreeBsd-MCA-Panic-Crash-tp6064691.html Sent from the freebsd-current mailing list archive at Nabble.com. ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org" ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: FreeBsd MCA Panic Crash !!
On Monday, January 04, 2016 02:17:51 PM Steven Hartland wrote: > Bank 5 seems to be common to all the crashes, which may suggest you have > some dodgy ram or possibly the driving CPU's memory controller. No, this has nothing to do with that. Bank 5 means that it is bank 5 of the Machine check registers in the processor that are triggering the errors (MC5_*). Different "banks" of the MC registers handle errors for different parts of the hardware (and this varies by CPU). For example, on Nehalem CPUs, the memory controller logs errors (e.g. ECC errors) in bank 8, but that has no correlation to the "bank" of DIMMs that the error occurred in. Later Intel CPUs can log the same errors in register banks 8 through 12 (IIRC). Depending on the CPU model, you can determine more info about the error using the CPU manuals (for Intel the SDM). > As the error says this is a Hardware issue. Well, mcelog has this hardcoded and prints this for every MCA just as a matter of course. It isn't selective but assumes every machine check is a hardware error (which they are, though some are warnings for corrected events that you can ignore as the hardware hasn't degraded enough to warrant replacement. However, corrected events don't generate panics, just messages in the logs, and only a subset of corrected events include the "yellow / green" indicators for which you can ignore "green" events. Even corrected ECC errors I would ignore if you get a few events with a count of 1 that don't recur). -- John Baldwin ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"