Re: (2) 5.1-R-p2 crashes on SMP with AMI RAID and Intel 1000/Pro

2003-08-25 Thread Colin Faber
After many hours of fighting with the machine I finally managed to get a debugging kernel built. 
Given I can successfully panic this machine on command what would any of you very smart developer 
people like out of the panic message? Commands I should run to get a kernel.debug etc?

Petri Helenius wrote:

Related to the em driver, 82540M has not worked since sometime in 
5.1-BETA time,
I filed a pr on that a few months ago but it seems the fault might be 
with PCI IRQ routing,
not the em driver itself.

Pete

Hartmann, O. wrote:

On Wed, 20 Aug 2003, Colin Faber wrote:

Hi.

I first swapped the Intel 1000/PRO server NIC into the next slot and 
up then the
machine seems to be 'stable'. Then, two days later, I changed the PSU 
to 400W
units.

I think it's a IRQ routing problem since we have had this problem 
(spontanous reboots)
from FreeBSD 4.0 on). Changing the slot for the NIC helped, but this 
is a very bad state.

I can not remember the error message I got when the system crashed, 
but it lookes like
yours and I always say the amr0-text in that message. ACPI is not 
working on the old
TYAN Thunder 2500 (S1867) main PCB.

I also changed machdep.cpu_idle_hlt = 0, but with no effect.

At the moment, I do not dare swapping the NIC again due to the fact 
the machine is in
a preliminary production state.

I also realized some weird things when creating and deleting files 
when the system crashed.
Crashes always could be forced by accessing samba services from a PC. 
Crashes always
occured when heavy IO was done, but this also could be a evidence for 
an IRQ problem, I think.
I do not know. The machine was 'stable' (it means: when the NIC was at 
the crash-causing
slot) a whole night, but whenever our department 'got started' in the 
morning time and heavy
IO was done, the machine froze. This changed when I swapped the NIC to 
another slot
And now I also have two 400W PSUs.

FreeBSD 5.1-p2 on the TYAN S1867 seems to have much more problems, but 
I do not know whether I
should mention this here. truss for instance crashes. We use afbackup 
for backing up, but
afbackup core dumps on this machine and it does not on a UP machine 
also running FreeBSD 5.1-p2.
It also crashes on a UP kernel on this machine.

I tried to 'truss' an afrestore call, but I had to start the tracing 
three or four times
because I got this error first time:

truss: PIOCWAIT: Input/output error

or something like this

root: /usr/local/samba/lib: truss -fae -o /tmp/afrestore afrestore 
-v -p /usr/homes/kurs* -C /
truss: PIOCWAIT top of loop: Input/output error
truss: PIOCWAIT top of loop: Input/output error
truss: PIOCWAIT top of loop: Input/output error
truss: PIOCWAIT top of loop: Input/output error

or sometimes truss stops lacking in a /proc/PID-XXX/mem file.

But calling it more times will 'solve' the problem.

While writing this, I crashed the system with the above showed 
command, this is the
error message from the kernel when the system froze (I wrote it down 
from the screen):

Fatal trap 12 : page fault while in kernel mode
cpuid = 1; lapic.id = 
fault virtual address = 0x24
fault code = supervisor read, page not present
instruction pointer= 0x8:0xc01b29db
stack pointer= 0x10:0xe8ff3b70
frame pointer= 0x10:0xe8ff3b84
code segment= base 0x0, limit 0xf, type 0x1b
= DPL 0, pres 1, def 32, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process= 27510 (bunzip2)
trap number= 12
panic: page fault
cpuid = 1, lapic.id = 
boot() called on cpu#1
syncing disks, buffers remaining ... panic: absolutely cannot call
smp_ipi_shutdown with interrupts already disabled
cpuid = 1; lapic.id = 
boot() called on cpu#1
Uptime1d20h18m55s
pfs_vncache_unload(): 6 entried remaining
Fatal double fault:
eip = 0xc03134ic
esp = 0xe8ff1ff8
ebp = 0xe8ff2014
cpuid = 1, lapic.id = 
panic: double fault
cpuid = 1, lapic.id = 
boot() called on cpu#1
Uptime: 1d20h18m55s
pfs_vncache_unload(): 6 entries remaining
After this, the machine was dead.

:Hi,
:
:I've got nearly the same setup in a Dell 1600SC with a gig of ram 
and a PERC4/Sc (LSI MegaRAID) card.
:
:Dual 2.4GHz Xeon P4 HT CPU's and I've discovered I can lock up 
FreeBSD 5.1-RELEASE-p2 on command
:simply by running something to quickly create and remove a 
directory. i.e.:
:
:perl -e 'for(my $i = 0 ; $i  ; $i++){ mkdir(abc); 
rmdir(abc); }'
:
:
:Having machdep.cpu_idle_hlt = 0 makes no difference.
:
:
:Kernel:
:FreeBSD 5.1-RELEASE-p2 FreeBSD 5.1-RELEASE-p2 #0: Mon Aug 11 
21:40:47 MDT 2003 i386
:
:Raid:
:amr0: LSILogic MegaRAID mem 0xfcd0-0xfcd0 irq 3 at 
device 2.0 on pci1
:amrd0: LSILogic MegaRAID logical drive on amr0
:amrd0: 34556MB (70770688 sectors) RAID 5 (optimal)
:
:
:I suspect that your and my problems are more driver related to the 
amr driver and may be exposing
:some other problem with in 

Re: (2) 5.1-R-p2 crashes on SMP with AMI RAID and Intel 1000/Pro

2003-08-21 Thread Hartmann, O.
On Wed, 20 Aug 2003, Colin Faber wrote:

Hi.

I first swapped the Intel 1000/PRO server NIC into the next slot and up then the
machine seems to be 'stable'. Then, two days later, I changed the PSU to 400W
units.

I think it's a IRQ routing problem since we have had this problem (spontanous reboots)
from FreeBSD 4.0 on). Changing the slot for the NIC helped, but this is a very bad 
state.

I can not remember the error message I got when the system crashed, but it lookes like
yours and I always say the amr0-text in that message. ACPI is not working on the old
TYAN Thunder 2500 (S1867) main PCB.

I also changed machdep.cpu_idle_hlt = 0, but with no effect.

At the moment, I do not dare swapping the NIC again due to the fact the machine is in
a preliminary production state.

I also realized some weird things when creating and deleting files when the system 
crashed.
Crashes always could be forced by accessing samba services from a PC. Crashes always
occured when heavy IO was done, but this also could be a evidence for an IRQ problem, 
I think.
I do not know. The machine was 'stable' (it means: when the NIC was at the 
crash-causing
slot) a whole night, but whenever our department 'got started' in the morning time and 
heavy
IO was done, the machine froze. This changed when I swapped the NIC to another slot
And now I also have two 400W PSUs.

FreeBSD 5.1-p2 on the TYAN S1867 seems to have much more problems, but I do not know 
whether I
should mention this here. truss for instance crashes. We use afbackup for backing up, 
but
afbackup core dumps on this machine and it does not on a UP machine also running 
FreeBSD 5.1-p2.
It also crashes on a UP kernel on this machine.

I tried to 'truss' an afrestore call, but I had to start the tracing three or four 
times
because I got this error first time:

truss: PIOCWAIT: Input/output error

or something like this

root: /usr/local/samba/lib: truss -fae -o /tmp/afrestore afrestore -v -p 
/usr/homes/kurs* -C /
truss: PIOCWAIT top of loop: Input/output error
truss: PIOCWAIT top of loop: Input/output error
truss: PIOCWAIT top of loop: Input/output error
truss: PIOCWAIT top of loop: Input/output error

or sometimes truss stops lacking in a /proc/PID-XXX/mem file.

But calling it more times will 'solve' the problem.


While writing this, I crashed the system with the above showed command, this is the
error message from the kernel when the system froze (I wrote it down from the screen):


Fatal trap 12 : page fault while in kernel mode
cpuid = 1; lapic.id = 
fault virtual address   = 0x24
fault code  = supervisor read, page not present
instruction pointer = 0x8:0xc01b29db
stack pointer   = 0x10:0xe8ff3b70
frame pointer   = 0x10:0xe8ff3b84
code segment= base 0x0, limit 0xf, type 0x1b
= DPL 0, pres 1, def 32, gran 1
processor eflags= interrupt enabled, resume, IOPL = 0
current process = 27510 (bunzip2)
trap number = 12
panic: page fault
cpuid = 1, lapic.id = 
boot() called on cpu#1
syncing disks, buffers remaining ... panic: absolutely cannot call
smp_ipi_shutdown with interrupts already disabled

cpuid = 1; lapic.id = 
boot() called on cpu#1
Uptime  1d20h18m55s
pfs_vncache_unload(): 6 entried remaining

Fatal double fault:
eip = 0xc03134ic
esp = 0xe8ff1ff8
ebp = 0xe8ff2014

cpuid = 1, lapic.id = 
panic: double fault
cpuid = 1, lapic.id = 
boot() called on cpu#1
Uptime: 1d20h18m55s
pfs_vncache_unload(): 6 entries remaining

After this, the machine was dead.

:Hi,
:
:I've got nearly the same setup in a Dell 1600SC with a gig of ram and a PERC4/Sc 
(LSI MegaRAID) card.
:
:Dual 2.4GHz Xeon P4 HT CPU's and I've discovered I can lock up FreeBSD 
5.1-RELEASE-p2 on command
:simply by running something to quickly create and remove a directory. i.e.:
:
:  perl -e 'for(my $i = 0 ; $i  ; $i++){ mkdir(abc); rmdir(abc); }'
:
:
:Having machdep.cpu_idle_hlt = 0 makes no difference.
:
:
:Kernel:
:  FreeBSD 5.1-RELEASE-p2 FreeBSD 5.1-RELEASE-p2 #0: Mon Aug 11 21:40:47 MDT 2003 
i386
:
:Raid:
:  amr0: LSILogic MegaRAID mem 0xfcd0-0xfcd0 irq 3 at device 2.0 on pci1
:  amrd0: LSILogic MegaRAID logical drive on amr0
:  amrd0: 34556MB (70770688 sectors) RAID 5 (optimal)
:
:
:I suspect that your and my problems are more driver related to the amr driver and 
may be exposing
:some other problem with in the kernels fs locking. I don't think (as others have 
suggested) that
:your issue is power related, or related to the combination of hardware you're using. 
(Other than
:the fact that you've got a MegaRAID card).
:
:The exact crash message I'm seeing is:
:
:panic: lockmgr: locking against myself
:cpuid = 0; lapic.id 
:boot() called on cpu#0
:
:syncing disks, buffers remaining... panic: ffs_copyonwrite: recursive call
:cpuid = 0; lapic.id 
:boot() 

Re: (2) 5.1-R-p2 crashes on SMP with AMI RAID and Intel 1000/Pro

2003-08-21 Thread Petri Helenius
Related to the em driver, 82540M has not worked since sometime in 
5.1-BETA time,
I filed a pr on that a few months ago but it seems the fault might be 
with PCI IRQ routing,
not the em driver itself.

Pete

Hartmann, O. wrote:

On Wed, 20 Aug 2003, Colin Faber wrote:

Hi.

I first swapped the Intel 1000/PRO server NIC into the next slot and up then the
machine seems to be 'stable'. Then, two days later, I changed the PSU to 400W
units.
I think it's a IRQ routing problem since we have had this problem (spontanous reboots)
from FreeBSD 4.0 on). Changing the slot for the NIC helped, but this is a very bad 
state.
I can not remember the error message I got when the system crashed, but it lookes like
yours and I always say the amr0-text in that message. ACPI is not working on the old
TYAN Thunder 2500 (S1867) main PCB.
I also changed machdep.cpu_idle_hlt = 0, but with no effect.

At the moment, I do not dare swapping the NIC again due to the fact the machine is in
a preliminary production state.
I also realized some weird things when creating and deleting files when the system 
crashed.
Crashes always could be forced by accessing samba services from a PC. Crashes always
occured when heavy IO was done, but this also could be a evidence for an IRQ problem, 
I think.
I do not know. The machine was 'stable' (it means: when the NIC was at the 
crash-causing
slot) a whole night, but whenever our department 'got started' in the morning time and 
heavy
IO was done, the machine froze. This changed when I swapped the NIC to another slot
And now I also have two 400W PSUs.
FreeBSD 5.1-p2 on the TYAN S1867 seems to have much more problems, but I do not know 
whether I
should mention this here. truss for instance crashes. We use afbackup for backing up, 
but
afbackup core dumps on this machine and it does not on a UP machine also running 
FreeBSD 5.1-p2.
It also crashes on a UP kernel on this machine.
I tried to 'truss' an afrestore call, but I had to start the tracing three or four 
times
because I got this error first time:
	truss: PIOCWAIT: Input/output error

or something like this

root: /usr/local/samba/lib: truss -fae -o /tmp/afrestore afrestore -v -p 
/usr/homes/kurs* -C /
truss: PIOCWAIT top of loop: Input/output error
truss: PIOCWAIT top of loop: Input/output error
truss: PIOCWAIT top of loop: Input/output error
truss: PIOCWAIT top of loop: Input/output error
or sometimes truss stops lacking in a /proc/PID-XXX/mem file.

But calling it more times will 'solve' the problem.

While writing this, I crashed the system with the above showed command, this is the
error message from the kernel when the system froze (I wrote it down from the screen):
Fatal trap 12 : page fault while in kernel mode
cpuid = 1; lapic.id = 
fault virtual address   = 0x24
fault code  = supervisor read, page not present
instruction pointer = 0x8:0xc01b29db
stack pointer   = 0x10:0xe8ff3b70
frame pointer   = 0x10:0xe8ff3b84
code segment= base 0x0, limit 0xf, type 0x1b
= DPL 0, pres 1, def 32, gran 1
processor eflags= interrupt enabled, resume, IOPL = 0
current process = 27510 (bunzip2)
trap number = 12
panic: page fault
cpuid = 1, lapic.id = 
boot() called on cpu#1
syncing disks, buffers remaining ... panic: absolutely cannot call
smp_ipi_shutdown with interrupts already disabled
cpuid = 1; lapic.id = 
boot() called on cpu#1
Uptime  1d20h18m55s
pfs_vncache_unload(): 6 entried remaining
Fatal double fault:
eip = 0xc03134ic
esp = 0xe8ff1ff8
ebp = 0xe8ff2014
cpuid = 1, lapic.id = 
panic: double fault
cpuid = 1, lapic.id = 
boot() called on cpu#1
Uptime: 1d20h18m55s
pfs_vncache_unload(): 6 entries remaining
After this, the machine was dead.

:Hi,
:
:I've got nearly the same setup in a Dell 1600SC with a gig of ram and a PERC4/Sc 
(LSI MegaRAID) card.
:
:Dual 2.4GHz Xeon P4 HT CPU's and I've discovered I can lock up FreeBSD 
5.1-RELEASE-p2 on command
:simply by running something to quickly create and remove a directory. i.e.:
:
:   perl -e 'for(my $i = 0 ; $i  ; $i++){ mkdir(abc); rmdir(abc); }'
:
:
:Having machdep.cpu_idle_hlt = 0 makes no difference.
:
:
:Kernel:
:   FreeBSD 5.1-RELEASE-p2 FreeBSD 5.1-RELEASE-p2 #0: Mon Aug 11 21:40:47 MDT 2003 
i386
:
:Raid:
:   amr0: LSILogic MegaRAID mem 0xfcd0-0xfcd0 irq 3 at device 2.0 on pci1
:   amrd0: LSILogic MegaRAID logical drive on amr0
:   amrd0: 34556MB (70770688 sectors) RAID 5 (optimal)
:
:
:I suspect that your and my problems are more driver related to the amr driver and 
may be exposing
:some other problem with in the kernels fs locking. I don't think (as others have 
suggested) that
:your issue is power related, or related to the combination of hardware you're using. 
(Other than
:the fact that you've got a MegaRAID card).
:
:The exact crash message I'm seeing is:
:
:panic: lockmgr: