Re: HP DL 585 / ACPI ID / ECC Memory / Panic

2016-05-12 Thread Nikolaj Hansen

Hi,

On 2016-05-12 21:03, Steven Hartland wrote:

I wouldn't rule out a bad cpu as we had a very similar issue and that's
what it was.

Quick way to confirm is to move all the dram from the disabled CPU to
one of the other CPUs and see if the issue stays away with the current
CPU still disabled.


One core is still running seemingly without problems it is only one core 
I disabled not the entire cpu. APIC 1 and 2 I believe are on the same 
chip. I am not a super CPU design expert, but if the two cores are on 
the same cpu chip do they not share the same memory bus with this model 
of the AMD cpu?




If that's the case it's likely the on chip memory controller has
developed a fault


Or you could just move around two cpu cards and se if the error jumps 
from apic 1+2(err) to apic 3+4(err). If these are issued in order by 
FreeBSD? Or is the ordering random?


I suppose I could move all of the boards one step to the right and test 
it that way regardless.


If it does it is probably a DIMM or, as you say, the memory bus if not 
it is probably the cpuboard slot on the mainboard itself.


I will try this and post my findings.

Offtopic:

I cannot belive how poor the onboard bios diagnostics are on this server 
compared to my old IBM netfinity 5000.


rgrds

Nikolaj Hansen



smime.p7s
Description: S/MIME Cryptographic Signature


Re: HP DL 585 / ACPI ID / ECC Memory / Panic

2016-05-12 Thread Rainer Duffner

> Am 12.05.2016 um 21:03 schrieb Steven Hartland :
> 
> I wouldn't rule out a bad cpu as we had a very similar issue and that's
> what it was.
>> 




IIRC, the AMD-servers of HP had numerous problems for the first few generations.
Some worked well (I think we have a handful of 385 G1/G2/G5 still running), but 
other would just hang or crash from time to time.
May boss was never too keen on them anyway, so we never had that many to begin 
with.

Plus, HP servers had and have a way of popping when you remove the power from a 
long-running one (that’s probably servers in general).
Most times, it’s only the PSU or a disk, but we’ve also fried NICs by simply 
powering the damn thing off…

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: HP DL 585 / ACPI ID / ECC Memory / Panic

2016-05-12 Thread Steven Hartland
I wouldn't rule out a bad cpu as we had a very similar issue and that's
what it was.

Quick way to confirm is to move all the dram from the disabled CPU to one
of the other CPUs and see if the issue stays away with the current CPU
still disabled.

If that's the case it's likely the on chip memory controller has developed
a fault

On Thursday, 12 May 2016, Nikolaj Hansen  wrote:

> Hi,
>
> I recently added a zfs disk array to my old HP 585 G1 Server.
> Immediately there was kernel panics and I have spent quite a bit of time
> figuring out what was really wrong.
>
> The system has 4 cpu cards with opteron double core processors. Each
> card has 4x2 gigabyte memory 4x2x4 = 32 gigabyte of total system mem.
> The memory is DDR400 ECC mem.
>
> The panic was very easily reproducable. I just had to issue enough reads
> to the system up until the faulty mem was accessed.
>
> Strangely I can run memtest86+ with the DDR setting on and I find no
> error what so ever.
>
> Adding
>
> hint.lapic.2.disabled=1 > /boot/loader.conf
>
> Immediately mitigates the error for FreeBSD. So here is my conclusion:
>
> If you can make the system stable by disabling one core on one cpu card:
>
> 1) The other cards / mem must be ok.
> 2) The mainboard must be ok since one of the cores on the cpu is still
> running / not barfing panics.
> 3) the cpu core with acpi 2 is probably also ok. it is on the same chip
> as a non disabled core.
> 4) It is likely down to a rotten DIMM.
>
> In place of mindlessly trying to find the culprit by switching dimms I
> would really like to identify the CPU, card and mem module from the os.
>
> Info here:
>
> http://pastebin.com/jqufNKck
>
> Thank you for your time and help.
>
> --
>
>
> Med venlig hilsen / with regards
>
> Nikolaj Hansen
>
>
>
>
>
>
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"