RE: FreeBsd MCA Panic Crash !!

2016-01-05 Thread shahzaibcb
* >> Are those the only MCA errors you're seeing? The reason I ask is that
there's an errata in the X5600 series which can cause an "internal timer
error" MCA to be logged after another uncorrectable MCA occurs.* 

90% are these MCA errors regarding rest of the 10% there is no log for it
such as one of the supermicro was rebooted two days ago but it was unable to
generate crashdump under /var/crash directory though dump is enabled in
rc.conf :

dumpdev="AUTO"
dumpdir="/var/crash"

*>>This seems to me like it would be a CPU failure.  Can you try replacing 
the CPU itself?  I've seen this exact message on a different board, and 
the cause was a failing CPU. *

We're thinking to replace x5690 with x5675 CPUs.

*>>Well, mcelog has this hardcoded and prints this for every MCA just as a 
matter of course.  It isn't selective but assumes every machine check is 
a hardware error (which they are, though some are warnings for corrected 
events that you can ignore as the hardware hasn't degraded enough to 
warrant replacement.  However, corrected events don't generate panics, 
just messages in the logs, and only a subset of corrected events include 
the "yellow / green" indicators for which you can ignore "green" events. 
Even corrected ECC errors I would ignore if you get a few events with 
a count of 1 that don't recur). *

Each time the MCA error occurs, server went down. So please guide how do we
suppose to tackle this issue ?
*
>> Depending on the CPU model, you can determine more info about the 
error using the CPU manuals (for Intel the SDM). *
CPU is x5690, is there a link we can get manual for supermicro x5690 cpu ?



--
View this message in context: 
http://freebsd.1045724.n5.nabble.com/FreeBsd-MCA-Panic-Crash-tp6064691p6065043.html
Sent from the freebsd-current mailing list archive at Nabble.com.
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: FreeBsd MCA Panic Crash !!

2016-01-05 Thread shahzaibcb
We're thinking to replace x5690 with x5675 CPUs.



--
View this message in context: 
http://freebsd.1045724.n5.nabble.com/FreeBsd-MCA-Panic-Crash-tp6064691p6065039.html
Sent from the freebsd-current mailing list archive at Nabble.com.
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: FreeBsd MCA Panic Crash !!

2016-01-05 Thread shahzaibcb
 each time the MCA error occurs, server went down. So please guide how do we
suppose to tackle this issue ?



--
View this message in context: 
http://freebsd.1045724.n5.nabble.com/FreeBsd-MCA-Panic-Crash-tp6064691p6065041.html
Sent from the freebsd-current mailing list archive at Nabble.com.
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


RE: FreeBsd MCA Panic Crash !!

2016-01-05 Thread shahzaibcb
90% are these MCA errors regarding rest of the 10% there is no log for it
such as one of the supermicro was rebooted two days ago but it was unable to
generate crashdump under /var/crash directory though dump is enabled in
rc.conf but we've no idea what went wrong :

dumpdev="AUTO"
dumpdir="/var/crash"



--
View this message in context: 
http://freebsd.1045724.n5.nabble.com/FreeBsd-MCA-Panic-Crash-tp6064691p6065042.html
Sent from the freebsd-current mailing list archive at Nabble.com.
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: FreeBsd MCA Panic Crash !!

2016-01-04 Thread Slawa Olhovchenkov
On Mon, Jan 04, 2016 at 03:34:09AM -0700, shahzaibcb wrote:

> Hi,
> 
> We've switched to FreeBSD recently to accomodate large video storage as we
> are running video streaming website. So the job of the FreeBSD is to
> transcode the uploaded videos using ffmpeg and serve them to users via nginx
> webserver but so far our experience is not very good with it. It crashes
> every 2-3 days and we're unable to track down the problem. The server specs
> are pretty high :
> 
> 
> Supermicro X5690 (12 cores, 24 threads - 2u)
> 96GB RAM
> 12x3TB RAID-10 (HBA-LSI9211)
> 
> Here is the screenshot of recent crash :
> 
> http://prntscr.com/9er3pk
> 
> One thing worth mentioning is, before going down there's no load on server,
> more or less free RAM usually is around 12GB.  We've tried following
> solutions so far :
> 
> 
> - Updated FreeBSD OS
> - Replaced 800W PS with 900W
> - We've reduced CMOS from MAX(26x) to 18x as suggested in this post

Do you try to replace CPU?

> http://unix.stackexchange.com/questions/60574/determining-cause-of-linux-kernel-panic
> 
> The solution we've not performed so far is :
> 
> - Disable mca using (hw.mca.enabled: 0) - As we're getting MCA panics.
> 
> Here is the crash dump :
> 
> [root@cw001 /var/crash]# mcelog --no-dmi --ascii --file core.txt.1 
> HARDWARE ERROR. This is *NOT* a software problem!
> Please contact your hardware vendor
> CPU 3 BANK 5 
> MISC 0 ADDR 802bf6a69 
> MCG status:MCIP 
> MCi status:
> Uncorrected error
> Error enabled
> MCi_MISC register valid
> MCi_ADDR register valid
> Processor context corrupt
> MCA: Internal Timer error
> STATUS be800400 MCGSTATUS 4
> MCGCAP 1c09 APICID 3 SOCKETID 0 
> CPUID Vendor Intel Family 6 Model 44
> HARDWARE ERROR. This is *NOT* a software problem!
> Please contact your hardware vendor
> CPU 2 BANK 5 
> MISC 0 ADDR 802bf6a69 
> MCG status:MCIP 
> MCi status:
> Uncorrected error
> Error enabled
> MCi_MISC register valid
> MCi_ADDR register valid
> Processor context corrupt
> MCA: Internal Timer error
> STATUS be800400 MCGSTATUS 4
> MCGCAP 1c09 APICID 2 SOCKETID 0 
> CPUID Vendor Intel Family 6 Model 44
> HARDWARE ERROR. This is *NOT* a software problem!
> Please contact your hardware vendor
> CPU 3 BANK 5 
> MISC 0 ADDR 802bf6a69 
> MCG status:MCIP 
> MCi status:
> Uncorrected error
> Error enabled
> MCi_MISC register valid
> MCi_ADDR register valid
> Processor context corrupt
> MCA: Internal Timer error
> STATUS be800400 MCGSTATUS 4
> MCGCAP 1c09 APICID 3 SOCKETID 0 
> CPUID Vendor Intel Family 6 Model 44
> HARDWARE ERROR. This is *NOT* a software problem!
> Please contact your hardware vendor
> CPU 2 BANK 5 
> MISC 0 ADDR 802bf6a69 
> MCG status:MCIP 
> MCi status:
> Uncorrected error
> Error enabled
> MCi_MISC register valid
> MCi_ADDR register valid
> Processor context corrupt
> MCA: Internal Timer error
> STATUS be800400 MCGSTATUS 4
> MCGCAP 1c09 APICID 2 SOCKETID 0 
> CPUID Vendor Intel Family 6 Model 44
> 
> ---
> 
> I showed those Hardware errors to Vendor from whom we purchased Supermicro
> servers . This is what he has to say :
> 
> ---
> Why do you not made one test environment with CentOS or one other Linux that
> you know to use, and see if you have same errors ??? if not than you know
> that the errors come from OS not from hardware. ( CentOS, RedHead….work
> diferend like FreeBSD – work direct on hardware if you don’t have the right
> kernel settings can the server crashed. CentOS , RedHead…. don’t work direct
> on hardware and distribute the resource load better and you have better
> control and you can better debug one situation)
> ---------------
> 
> Now we're on a black hole and unable to find that either issue with FreeBSD
> or Hardware. We're thinking to disable mca in loader.conf but ppl are not
> suggesting it. If you guys can help us, it'd be very kind.
> 
> 
> 
> --
> View this message in context: 
> http://freebsd.1045724.n5.nabble.com/FreeBsd-MCA-Panic-Crash-tp6064691.html
> Sent from the freebsd-current mailing list archive at Nabble.com.
> ___
> freebsd-current@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

RE: FreeBsd MCA Panic Crash !!

2016-01-04 Thread Rang, Anton
>We've switched to FreeBSD recently to accomodate large video storage as we are 
>running video streaming website.
>So the job of the FreeBSD is to transcode the uploaded videos using ffmpeg and 
>serve them to users via nginx webserver
>but so far our experience is not very good with it. It crashes every 2-3 days 
>and we're unable to track down the problem.
>The server specs are pretty high :
>
> Supermicro X5690 (12 cores, 24 threads - 2u) 96GB RAM 12x3TB RAID-10 
> (HBA-LSI9211)

[...]

>CPU 3 BANK 5
>MCA: Internal Timer error
>STATUS be800400 MCGSTATUS 4

Are those the only MCA errors you're seeing? The reason I ask is that there's 
an errata in the X5600 series which can cause an "internal timer error" MCA to 
be logged after another uncorrectable MCA occurs.

If not, these do point at a hardware problem *or* errata, though software can 
also trigger this in some cases (for instance, reading from malfunctioning or 
non-existent hardware). If your BIOS can be updated, that's a good first step 
as it will generally update the CPU microcode and add workarounds for many 
known issues. Replacing the CPU and/or voltage regulator is more drastic, but 
if the problem is hardware, it's likely in one of those components.

Anton




Supermicro X5690 (12 cores, 24 threads - 2u) 96GB RAM 12x3TB RAID-10 
(HBA-LSI9211)

Here is the screenshot of recent crash :

http://prntscr.com/9er3pk

One thing worth mentioning is, before going down there's no load on server, 
more or less free RAM usually is around 12GB.  We've tried following solutions 
so far :


- Updated FreeBSD OS
- Replaced 800W PS with 900W
- We've reduced CMOS from MAX(26x) to 18x as suggested in this post 
http://unix.stackexchange.com/questions/60574/determining-cause-of-linux-kernel-panic

The solution we've not performed so far is :

- Disable mca using (hw.mca.enabled: 0) - As we're getting MCA panics.

Here is the crash dump :

[root@cw001 /var/crash]# mcelog --no-dmi --ascii --file core.txt.1 HARDWARE 
ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 3 BANK 5
MISC 0 ADDR 802bf6a69
MCG status:MCIP
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS be800400 MCGSTATUS 4
MCGCAP 1c09 APICID 3 SOCKETID 0
CPUID Vendor Intel Family 6 Model 44
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 2 BANK 5
MISC 0 ADDR 802bf6a69
MCG status:MCIP
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS be800400 MCGSTATUS 4
MCGCAP 1c09 APICID 2 SOCKETID 0
CPUID Vendor Intel Family 6 Model 44
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 3 BANK 5
MISC 0 ADDR 802bf6a69
MCG status:MCIP
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS be800400 MCGSTATUS 4
MCGCAP 1c09 APICID 3 SOCKETID 0
CPUID Vendor Intel Family 6 Model 44
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 2 BANK 5
MISC 0 ADDR 802bf6a69
MCG status:MCIP
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS be800400 MCGSTATUS 4
MCGCAP 1c09 APICID 2 SOCKETID 0
CPUID Vendor Intel Family 6 Model 44

---

I showed those Hardware errors to Vendor from whom we purchased Supermicro 
servers . This is what he has to say :

---
Why do you not made one test environment with CentOS or one other Linux that 
you know to use, and see if you have same errors ??? if not than you know that 
the errors come from OS not from hardware. ( CentOS, RedHead….work diferend 
like FreeBSD – work direct on hardware if you don’t have the right kernel 
settings can the server crashed. CentOS , RedHead…. don’t work direct on 
hardware and distribute the resource load better and you have better control 
and you can better debug one situation)
---

Now we're on a black hole and unable to find that either issue with FreeBSD or 
Hardware. We're thinking to disable mca in loader.conf but ppl are not 
suggesting it. If you guys can help us, it'd be very kind.



--
View this message in context: 
http://freebsd.1045724.n5.nabble.com/FreeBsd-MCA-Panic-Crash-tp6064691.html
Sent from the freebsd-current mailing list archive at Nabble.com.
___
freebsd-current@freebsd.org mailing list 
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

FreeBsd MCA Panic Crash !!

2016-01-04 Thread shahzaibcb
Hi,

We've switched to FreeBSD recently to accomodate large video storage as we
are running video streaming website. So the job of the FreeBSD is to
transcode the uploaded videos using ffmpeg and serve them to users via nginx
webserver but so far our experience is not very good with it. It crashes
every 2-3 days and we're unable to track down the problem. The server specs
are pretty high :


Supermicro X5690 (12 cores, 24 threads - 2u)
96GB RAM
12x3TB RAID-10 (HBA-LSI9211)

Here is the screenshot of recent crash :

http://prntscr.com/9er3pk

One thing worth mentioning is, before going down there's no load on server,
more or less free RAM usually is around 12GB.  We've tried following
solutions so far :


- Updated FreeBSD OS
- Replaced 800W PS with 900W
- We've reduced CMOS from MAX(26x) to 18x as suggested in this post
http://unix.stackexchange.com/questions/60574/determining-cause-of-linux-kernel-panic

The solution we've not performed so far is :

- Disable mca using (hw.mca.enabled: 0) - As we're getting MCA panics.

Here is the crash dump :

[root@cw001 /var/crash]# mcelog --no-dmi --ascii --file core.txt.1 
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 3 BANK 5 
MISC 0 ADDR 802bf6a69 
MCG status:MCIP 
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS be800400 MCGSTATUS 4
MCGCAP 1c09 APICID 3 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 44
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 2 BANK 5 
MISC 0 ADDR 802bf6a69 
MCG status:MCIP 
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS be800400 MCGSTATUS 4
MCGCAP 1c09 APICID 2 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 44
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 3 BANK 5 
MISC 0 ADDR 802bf6a69 
MCG status:MCIP 
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS be800400 MCGSTATUS 4
MCGCAP 1c09 APICID 3 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 44
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 2 BANK 5 
MISC 0 ADDR 802bf6a69 
MCG status:MCIP 
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS be800400 MCGSTATUS 4
MCGCAP 1c09 APICID 2 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 44

---

I showed those Hardware errors to Vendor from whom we purchased Supermicro
servers . This is what he has to say :

---
Why do you not made one test environment with CentOS or one other Linux that
you know to use, and see if you have same errors ??? if not than you know
that the errors come from OS not from hardware. ( CentOS, RedHead….work
diferend like FreeBSD – work direct on hardware if you don’t have the right
kernel settings can the server crashed. CentOS , RedHead…. don’t work direct
on hardware and distribute the resource load better and you have better
control and you can better debug one situation)
---

Now we're on a black hole and unable to find that either issue with FreeBSD
or Hardware. We're thinking to disable mca in loader.conf but ppl are not
suggesting it. If you guys can help us, it'd be very kind.



--
View this message in context: 
http://freebsd.1045724.n5.nabble.com/FreeBsd-MCA-Panic-Crash-tp6064691.html
Sent from the freebsd-current mailing list archive at Nabble.com.
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: FreeBsd MCA Panic Crash !!

2016-01-04 Thread Anna Wilcox

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

On 04/01/16 04:34, shahzaibcb wrote:
> Hi,
>
> We've switched to FreeBSD recently to accomodate large video storage as we
> are running video streaming website. So the job of the FreeBSD is to
> transcode the uploaded videos using ffmpeg and serve them to users via
nginx
> webserver but so far our experience is not very good with it. It crashes
> every 2-3 days and we're unable to track down the problem. The server
specs
> are pretty high :
>
>
> Supermicro X5690 (12 cores, 24 threads - 2u)
> 96GB RAM
> 12x3TB RAID-10 (HBA-LSI9211)
>
> Here is the screenshot of recent crash :
>
> http://prntscr.com/9er3pk
>
> One thing worth mentioning is, before going down there's no load on
server,
> more or less free RAM usually is around 12GB.  We've tried following
> solutions so far :
>
>
> - Updated FreeBSD OS
> - Replaced 800W PS with 900W
> - We've reduced CMOS from MAX(26x) to 18x as suggested in this post
>
http://unix.stackexchange.com/questions/60574/determining-cause-of-linux-kernel-panic
>
> The solution we've not performed so far is :
>
> - Disable mca using (hw.mca.enabled: 0) - As we're getting MCA panics.
>
> Here is the crash dump :
>
> [root@cw001 /var/crash]# mcelog --no-dmi --ascii --file core.txt.1
> HARDWARE ERROR. This is *NOT* a software problem!
> Please contact your hardware vendor
> CPU 3 BANK 5
> MISC 0 ADDR 802bf6a69
> MCG status:MCIP
> MCi status:
> Uncorrected error
> Error enabled
> MCi_MISC register valid
> MCi_ADDR register valid
> Processor context corrupt
> MCA: Internal Timer error
> STATUS be800400 MCGSTATUS 4
> MCGCAP 1c09 APICID 3 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 44
> HARDWARE ERROR. This is *NOT* a software problem!
> Please contact your hardware vendor
> CPU 2 BANK 5
> MISC 0 ADDR 802bf6a69
> MCG status:MCIP
> MCi status:
> Uncorrected error
> Error enabled
> MCi_MISC register valid
> MCi_ADDR register valid
> Processor context corrupt
> MCA: Internal Timer error
> STATUS be800400 MCGSTATUS 4
> MCGCAP 1c09 APICID 2 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 44
> HARDWARE ERROR. This is *NOT* a software problem!
> Please contact your hardware vendor
> CPU 3 BANK 5
> MISC 0 ADDR 802bf6a69
> MCG status:MCIP
> MCi status:
> Uncorrected error
> Error enabled
> MCi_MISC register valid
> MCi_ADDR register valid
> Processor context corrupt
> MCA: Internal Timer error
> STATUS be800400 MCGSTATUS 4
> MCGCAP 1c09 APICID 3 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 44
> HARDWARE ERROR. This is *NOT* a software problem!
> Please contact your hardware vendor
> CPU 2 BANK 5
> MISC 0 ADDR 802bf6a69
> MCG status:MCIP
> MCi status:
> Uncorrected error
> Error enabled
> MCi_MISC register valid
> MCi_ADDR register valid
> Processor context corrupt
> MCA: Internal Timer error
> STATUS be800400 MCGSTATUS 4
> MCGCAP 1c09 APICID 2 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 44
>
>
---
>
> I showed those Hardware errors to Vendor from whom we purchased Supermicro
> servers . This is what he has to say :
>
> ---
> Why do you not made one test environment with CentOS or one other
Linux that
> you know to use, and see if you have same errors ??? if not than you know
> that the errors come from OS not from hardware. ( CentOS, RedHead….work
> diferend like FreeBSD – work direct on hardware if you don’t have the
right
> kernel settings can the server crashed. CentOS , RedHead…. don’t work
direct
> on hardware and distribute the resource load better and you have better
> control and you can better debug one situation)
> ---
>
> Now we're on a black hole and unable to find that either issue with
FreeBSD
> or Hardware. We're thinking to disable mca in loader.conf but ppl are not
> suggesting it. If you guys can help us, it'd be very kind.
>

Hello there,

This seems to me like it would be a CPU failure.  Can you try replacing
the CPU itself?  I've seen this exact message on a different board, and
the cause was a failing CPU.

Please do note that as the message says, this is not a software error. 
It is a failure of the hardware.  Your vendor can try to blame FreeBSD
all they want, but it is extremely improbable as to be almost impossible
that that is the problem.  You might also note to your vendor that it is
"Red Hat" Linux, not Red Head.

Hope this helps.
- --arw
-BEGIN PGP SIGNATURE-
Version: GnuPG v2

iQIcBAEBCAAGBQJWinw5AAoJEMspy1GSK50UXI8QANH5y9c36q8uX2xtQtjQ79DR
ENN5O0cuxfiCn3mo7Kn+R0wD4Ahf1Qn6uR70WXwKDtdpre6VqsBxpZak7GVpHR9j
x0C0jJJQLU3qs3XREzs6DjWCOge8j7zDZG0i9gZt3NT3WnEUxrqI+dLm/1I1Cy3f
nSSHb3V3Sf9SxbB132NhCfiHfQNIVNGZsnrLCCIEWN0gI5vvEe2Av1e4PYoa1TJF
7B0qTmQ+nBb0zX/mccAbTXtMCAO7PBOrVkyxrwZN/J9kGYaPe2UEpsdHjXp76sui
fFzb7voaKYXvqu3XJEYU0Pxulape5cUGSuQWmWBmDZhnFmn7YYRlfRr+5anwwhxu
/EVDvOrdPNm4LpR3DCwR+FtHQb+fs9rfMEGIQ9EiLLF/rXXbs0Pfq+FzjHwk6RsX

Re: FreeBsd MCA Panic Crash !!

2016-01-04 Thread Steven Hartland
Bank 5 seems to be common to all the crashes, which may suggest you have 
some dodgy ram or possibly the driving CPU's memory controller.


As the error says this is a Hardware issue.

One thing we've used in the past to narrow issues like this down is to 
remove as much RAM as possible and to disable all but one CPU core using 
/boot/loader.conf hints, where X is the the number of CPU core to 
disable as reported by the boot process.

hint.lapic.X.disabled="1"

Regards
Steve

On 04/01/2016 10:34, shahzaibcb wrote:

Hi,

We've switched to FreeBSD recently to accomodate large video storage as we
are running video streaming website. So the job of the FreeBSD is to
transcode the uploaded videos using ffmpeg and serve them to users via nginx
webserver but so far our experience is not very good with it. It crashes
every 2-3 days and we're unable to track down the problem. The server specs
are pretty high :


Supermicro X5690 (12 cores, 24 threads - 2u)
96GB RAM
12x3TB RAID-10 (HBA-LSI9211)

Here is the screenshot of recent crash :

http://prntscr.com/9er3pk

One thing worth mentioning is, before going down there's no load on server,
more or less free RAM usually is around 12GB.  We've tried following
solutions so far :


- Updated FreeBSD OS
- Replaced 800W PS with 900W
- We've reduced CMOS from MAX(26x) to 18x as suggested in this post
http://unix.stackexchange.com/questions/60574/determining-cause-of-linux-kernel-panic

The solution we've not performed so far is :

- Disable mca using (hw.mca.enabled: 0) - As we're getting MCA panics.

Here is the crash dump :

[root@cw001 /var/crash]# mcelog --no-dmi --ascii --file core.txt.1
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 3 BANK 5
MISC 0 ADDR 802bf6a69
MCG status:MCIP
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS be800400 MCGSTATUS 4
MCGCAP 1c09 APICID 3 SOCKETID 0
CPUID Vendor Intel Family 6 Model 44
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 2 BANK 5
MISC 0 ADDR 802bf6a69
MCG status:MCIP
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS be800400 MCGSTATUS 4
MCGCAP 1c09 APICID 2 SOCKETID 0
CPUID Vendor Intel Family 6 Model 44
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 3 BANK 5
MISC 0 ADDR 802bf6a69
MCG status:MCIP
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS be800400 MCGSTATUS 4
MCGCAP 1c09 APICID 3 SOCKETID 0
CPUID Vendor Intel Family 6 Model 44
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 2 BANK 5
MISC 0 ADDR 802bf6a69
MCG status:MCIP
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS be800400 MCGSTATUS 4
MCGCAP 1c09 APICID 2 SOCKETID 0
CPUID Vendor Intel Family 6 Model 44

---

I showed those Hardware errors to Vendor from whom we purchased Supermicro
servers . This is what he has to say :

---
Why do you not made one test environment with CentOS or one other Linux that
you know to use, and see if you have same errors ??? if not than you know
that the errors come from OS not from hardware. ( CentOS, RedHead….work
diferend like FreeBSD – work direct on hardware if you don’t have the right
kernel settings can the server crashed. CentOS , RedHead…. don’t work direct
on hardware and distribute the resource load better and you have better
control and you can better debug one situation)
---

Now we're on a black hole and unable to find that either issue with FreeBSD
or Hardware. We're thinking to disable mca in loader.conf but ppl are not
suggesting it. If you guys can help us, it'd be very kind.



--
View this message in context: 
http://freebsd.1045724.n5.nabble.com/FreeBsd-MCA-Panic-Crash-tp6064691.html
Sent from the freebsd-current mailing list archive at Nabble.com.
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: FreeBsd MCA Panic Crash !!

2016-01-04 Thread John Baldwin
On Monday, January 04, 2016 02:17:51 PM Steven Hartland wrote:
> Bank 5 seems to be common to all the crashes, which may suggest you have 
> some dodgy ram or possibly the driving CPU's memory controller.

No, this has nothing to do with that.  Bank 5 means that it is bank 5 of the
Machine check registers in the processor that are triggering the errors
(MC5_*).  Different "banks" of the MC registers handle errors for different
parts of the hardware (and this varies by CPU).  For example, on Nehalem
CPUs, the memory controller logs errors (e.g. ECC errors) in bank 8, but
that has no correlation to the "bank" of DIMMs that the error occurred in.
Later Intel CPUs can log the same errors in register banks 8 through 12
(IIRC).  Depending on the CPU model, you can determine more info about the
error using the CPU manuals (for Intel the SDM).

> As the error says this is a Hardware issue.

Well, mcelog has this hardcoded and prints this for every MCA just as a
matter of course.  It isn't selective but assumes every machine check is
a hardware error (which they are, though some are warnings for corrected
events that you can ignore as the hardware hasn't degraded enough to
warrant replacement.  However, corrected events don't generate panics,
just messages in the logs, and only a subset of corrected events include
the "yellow / green" indicators for which you can ignore "green" events.
Even corrected ECC errors I would ignore if you get a few events with
a count of 1 that don't recur).

-- 
John Baldwin
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"