Jenkins build is back to normal : FreeBSD_stable_10 #261

2016-05-12 Thread jenkins-admin
See 

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: HP DL 585 / ACPI ID / ECC Memory / Panic

2016-05-12 Thread Nikolaj Hansen

Hi,

On 2016-05-12 21:03, Steven Hartland wrote:

I wouldn't rule out a bad cpu as we had a very similar issue and that's
what it was.

Quick way to confirm is to move all the dram from the disabled CPU to
one of the other CPUs and see if the issue stays away with the current
CPU still disabled.


One core is still running seemingly without problems it is only one core 
I disabled not the entire cpu. APIC 1 and 2 I believe are on the same 
chip. I am not a super CPU design expert, but if the two cores are on 
the same cpu chip do they not share the same memory bus with this model 
of the AMD cpu?




If that's the case it's likely the on chip memory controller has
developed a fault


Or you could just move around two cpu cards and se if the error jumps 
from apic 1+2(err) to apic 3+4(err). If these are issued in order by 
FreeBSD? Or is the ordering random?


I suppose I could move all of the boards one step to the right and test 
it that way regardless.


If it does it is probably a DIMM or, as you say, the memory bus if not 
it is probably the cpuboard slot on the mainboard itself.


I will try this and post my findings.

Offtopic:

I cannot belive how poor the onboard bios diagnostics are on this server 
compared to my old IBM netfinity 5000.


rgrds

Nikolaj Hansen



smime.p7s
Description: S/MIME Cryptographic Signature


Re: HP DL 585 / ACPI ID / ECC Memory / Panic

2016-05-12 Thread Rainer Duffner

> Am 12.05.2016 um 21:03 schrieb Steven Hartland :
> 
> I wouldn't rule out a bad cpu as we had a very similar issue and that's
> what it was.
>> 




IIRC, the AMD-servers of HP had numerous problems for the first few generations.
Some worked well (I think we have a handful of 385 G1/G2/G5 still running), but 
other would just hang or crash from time to time.
May boss was never too keen on them anyway, so we never had that many to begin 
with.

Plus, HP servers had and have a way of popping when you remove the power from a 
long-running one (that’s probably servers in general).
Most times, it’s only the PSU or a disk, but we’ve also fried NICs by simply 
powering the damn thing off…

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: HP DL 585 / ACPI ID / ECC Memory / Panic

2016-05-12 Thread Steven Hartland
I wouldn't rule out a bad cpu as we had a very similar issue and that's
what it was.

Quick way to confirm is to move all the dram from the disabled CPU to one
of the other CPUs and see if the issue stays away with the current CPU
still disabled.

If that's the case it's likely the on chip memory controller has developed
a fault

On Thursday, 12 May 2016, Nikolaj Hansen  wrote:

> Hi,
>
> I recently added a zfs disk array to my old HP 585 G1 Server.
> Immediately there was kernel panics and I have spent quite a bit of time
> figuring out what was really wrong.
>
> The system has 4 cpu cards with opteron double core processors. Each
> card has 4x2 gigabyte memory 4x2x4 = 32 gigabyte of total system mem.
> The memory is DDR400 ECC mem.
>
> The panic was very easily reproducable. I just had to issue enough reads
> to the system up until the faulty mem was accessed.
>
> Strangely I can run memtest86+ with the DDR setting on and I find no
> error what so ever.
>
> Adding
>
> hint.lapic.2.disabled=1 > /boot/loader.conf
>
> Immediately mitigates the error for FreeBSD. So here is my conclusion:
>
> If you can make the system stable by disabling one core on one cpu card:
>
> 1) The other cards / mem must be ok.
> 2) The mainboard must be ok since one of the cores on the cpu is still
> running / not barfing panics.
> 3) the cpu core with acpi 2 is probably also ok. it is on the same chip
> as a non disabled core.
> 4) It is likely down to a rotten DIMM.
>
> In place of mindlessly trying to find the culprit by switching dimms I
> would really like to identify the CPU, card and mem module from the os.
>
> Info here:
>
> http://pastebin.com/jqufNKck
>
> Thank you for your time and help.
>
> --
>
>
> Med venlig hilsen / with regards
>
> Nikolaj Hansen
>
>
>
>
>
>
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Build failed in Jenkins: FreeBSD_stable_10 #260

2016-05-12 Thread jenkins-admin
8.10.2] out: usr.bin/cpio/functional_test:test_option_l  ->  passed  
[0.205s]
[192.168.10.2] out: usr.bin/cpio/functional_test:test_option_lrzip  ->  passed  
[0.107s]
[192.168.10.2] out: usr.bin/cpio/functional_test:test_option_lzma  ->  passed  
[0.116s]
[192.168.10.2] out: usr.bin/cpio/functional_test:test_option_lzop  ->  passed  
[0.160s]
[192.168.10.2] out: usr.bin/cpio/functional_test:test_option_m  ->  passed  
[0.168s]
[192.168.10.2] out: usr.bin/cpio/functional_test:test_option_t  ->  passed  
[0.160s]
[192.168.10.2] out: usr.bin/cpio/functional_test:test_option_u  ->  passed  
[0.157s]
[192.168.10.2] out: usr.bin/cpio/functional_test:test_option_uuencode  ->  
passed  [0.121s]
[192.168.10.2] out: usr.bin/cpio/functional_test:test_option_version  ->  
passed  [0.096s]
[192.168.10.2] out: usr.bin/cpio/functional_test:test_option_xz  ->  passed  
[0.087s]
[192.168.10.2] out: usr.bin/cpio/functional_test:test_option_y  ->  passed  
[0.072s]
[192.168.10.2] out: usr.bin/cpio/functional_test:test_option_z  ->  passed  
[0.111s]
[192.168.10.2] out: usr.bin/cpio/functional_test:test_owner_parse  ->  passed  
[0.081s]
[192.168.10.2] out: usr.bin/cpio/functional_test:test_passthrough_dotdot  ->  
passed  [0.113s]
[192.168.10.2] out: usr.bin/cpio/functional_test:test_passthrough_reverse  ->  
passed  [0.426s]
[192.168.10.2] out: usr.bin/comm/legacy_test:main  ->  passed  [0.020s]
[192.168.10.2] out: usr.bin/timeout/timeout:exit_numbers  ->  passed  [0.547s]
[192.168.10.2] out: usr.bin/timeout/timeout:invalid_command  ->  passed  
[0.035s]
[192.168.10.2] out: usr.bin/timeout/timeout:invalid_signal  ->  passed  [0.034s]
[192.168.10.2] out: usr.bin/timeout/timeout:invalid_timeout  ->  passed  
[0.073s]
[192.168.10.2] out: usr.bin/timeout/timeout:no_such_command  ->  passed  
[0.034s]
[192.168.10.2] out: usr.bin/timeout/timeout:no_timeout  ->  passed  [0.036s]
[192.168.10.2] out: usr.bin/timeout/timeout:nominal  ->  passed  [0.034s]
[192.168.10.2] out: usr.bin/timeout/timeout:time_unit  ->  passed  [0.057s]
[192.168.10.2] out: usr.bin/timeout/timeout:with_a_child  ->  passed  [0.533s]
[192.168.10.2] out: 
[192.168.10.2] out: Results file id is usr_tests.20160512-171142-266651
[192.168.10.2] out: Results saved to 
/root/.kyua/store/results.usr_tests.20160512-171142-266651.db
[192.168.10.2] out: 
[192.168.10.2] out: 5049/5049 passed (0 failed)
[192.168.10.2] out: 

[192.168.10.2] run: kyua report --verbose --results-filter 
passed,skipped,xfail,broken,failed  --output test-report.txt
[192.168.10.2] run: kyua report-junit --output=test-report.xml
[192.168.10.2] run: shutdown -p now
[192.168.10.2] out: Shutdown NOW!
[192.168.10.2] out: shutdown: [pid 62468]
[192.168.10.2] out: 

adcast 192.168.10.255 
kyuatestprompt # May 12 17:25:43  h_fgets: stack overflow detected; terminated

May 12 17:25:43  h_gets: stack overflow detected; terminated

May 12 17:25:43  h_memcpy: stack overflow detected; terminated

May 12 17:25:43  h_memmove: stack overflow detected; terminated

May 12 17:25:43  h_memset: stack overflow detected; terminated

May 12 17:25:43  h_read: stack overflow detected; terminated

May 12 17:25:43  h_readlink: stack overflow detected; terminated

May 12 17:25:43  h_snprintf: stack overflow detected; terminated

May 12 17:25:44  h_sprintf: stack overflow detected; terminated

May 12 17:25:44  h_stpcpy: stack overflow detected; terminated

May 12 17:25:44  h_stpncpy: stack overflow detected; terminated

May 12 17:25:44  h_strcat: stack overflow detected; terminated

May 12 17:25:44  h_strcpy: stack overflow detected; terminated

May 12 17:25:44  h_strncat: stack overflow detected; terminated

May 12 17:25:44  h_strncpy: stack overflow detected; terminated

May 12 17:25:44  h_vsnprintf: stack overflow detected; terminated

May 12 17:25:44  h_vsprintf: stack overflow detected; terminated

GEOM_CONCAT: Device concat.DwXTSf created (id=4163115584).
GEOM_CONCAT: Disk md0 attached to concat.DwXTSf.
GEOM_CONCAT: Disk md1 attached to concat.DwXTSf.
GEOM_CONCAT: Disk md2 attached to concat.DwXTSf.
GEOM_CONCAT: Device concat/concat.DwXTSf activated.
GEOM_CONCAT: Disk md2 removed from concat.DwXTSf.
GEOM_CONCAT: Device concat/concat.DwXTSf deactivated.
GEOM_CONCAT: Disk md1 removed from concat.DwXTSf.
GEOM_CONCAT: Disk md0 removed from concat.DwXTSf.
GEOM_CONCAT: Device concat.DwXTSf destroyed.
GEOM_CONCAT: Device concat.9H0MkY created (id=977263104).
GEOM_CONCAT: Disk md0 attached to concat.9H0MkY.
GEOM_CONCAT: Disk md1 attached to concat.9H0MkY.
GEOM_CONCAT: Disk md2 attached to concat.9H0MkY.
GEOM_CONCAT: Device concat/concat.9H0MkY activated.
GEOM_CONCAT: Disk md2 removed from concat.9H0MkY.
Traceback (most recent call last):
  File "freebsd-ci/scripts/test/run-tests.py", line 207, in 
main(sys.argv)
  File "freebsd-ci/scripts/test/run-tests.py", line 79, in main
runTest()
  File "freebsd-ci/scripts/test/run-te

FreeBSD_STABLE_10-i386 - Build #1160 - Fixed

2016-05-12 Thread jenkins-admin
FreeBSD_STABLE_10-i386 - Build #1160 - Fixed:

Build information: https://jenkins.FreeBSD.org/job/FreeBSD_STABLE_10-i386/1160/
Full change log: 
https://jenkins.FreeBSD.org/job/FreeBSD_STABLE_10-i386/1160/changes
Full build log: 
https://jenkins.FreeBSD.org/job/FreeBSD_STABLE_10-i386/1160/console

Change summaries:

No changes
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


HP DL 585 / ACPI ID / ECC Memory / Panic

2016-05-12 Thread Nikolaj Hansen

Hi,

I recently added a zfs disk array to my old HP 585 G1 Server.
Immediately there was kernel panics and I have spent quite a bit of time
figuring out what was really wrong.

The system has 4 cpu cards with opteron double core processors. Each
card has 4x2 gigabyte memory 4x2x4 = 32 gigabyte of total system mem.
The memory is DDR400 ECC mem.

The panic was very easily reproducable. I just had to issue enough reads
to the system up until the faulty mem was accessed.

Strangely I can run memtest86+ with the DDR setting on and I find no
error what so ever.

Adding

hint.lapic.2.disabled=1 > /boot/loader.conf

Immediately mitigates the error for FreeBSD. So here is my conclusion:

If you can make the system stable by disabling one core on one cpu card:

1) The other cards / mem must be ok.
2) The mainboard must be ok since one of the cores on the cpu is still
running / not barfing panics.
3) the cpu core with acpi 2 is probably also ok. it is on the same chip
as a non disabled core.
4) It is likely down to a rotten DIMM.

In place of mindlessly trying to find the culprit by switching dimms I
would really like to identify the CPU, card and mem module from the os.

Info here:

http://pastebin.com/jqufNKck

Thank you for your time and help.

--


Med venlig hilsen / with regards

Nikolaj Hansen







smime.p7s
Description: S/MIME Cryptographic Signature


Jenkins build is back to stable : FreeBSD_stable_10 #259

2016-05-12 Thread jenkins-admin
See 

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"