Re: (2) 5.1-R-p2 crashes on SMP with AMI RAID and Intel 1000/Pro
After many hours of fighting with the machine I finally managed to get a debugging kernel built. Given I can successfully panic this machine on command what would any of you very smart developer people like out of the panic message? Commands I should run to get a kernel.debug etc? Petri Helenius wrote: Related to the em driver, 82540M has not worked since sometime in 5.1-BETA time, I filed a pr on that a few months ago but it seems the fault might be with PCI IRQ routing, not the em driver itself. Pete Hartmann, O. wrote: On Wed, 20 Aug 2003, Colin Faber wrote: Hi. I first swapped the Intel 1000/PRO server NIC into the next slot and up then the machine seems to be 'stable'. Then, two days later, I changed the PSU to 400W units. I think it's a IRQ routing problem since we have had this problem (spontanous reboots) from FreeBSD 4.0 on). Changing the slot for the NIC helped, but this is a very bad state. I can not remember the error message I got when the system crashed, but it lookes like yours and I always say the amr0-text in that message. ACPI is not working on the old TYAN Thunder 2500 (S1867) main PCB. I also changed machdep.cpu_idle_hlt = 0, but with no effect. At the moment, I do not dare swapping the NIC again due to the fact the machine is in a preliminary production state. I also realized some weird things when creating and deleting files when the system crashed. Crashes always could be forced by accessing samba services from a PC. Crashes always occured when heavy IO was done, but this also could be a evidence for an IRQ problem, I think. I do not know. The machine was 'stable' (it means: when the NIC was at the crash-causing slot) a whole night, but whenever our department 'got started' in the morning time and heavy IO was done, the machine froze. This changed when I swapped the NIC to another slot And now I also have two 400W PSUs. FreeBSD 5.1-p2 on the TYAN S1867 seems to have much more problems, but I do not know whether I should mention this here. truss for instance crashes. We use afbackup for backing up, but afbackup core dumps on this machine and it does not on a UP machine also running FreeBSD 5.1-p2. It also crashes on a UP kernel on this machine. I tried to 'truss' an afrestore call, but I had to start the tracing three or four times because I got this error first time: truss: PIOCWAIT: Input/output error or something like this root: /usr/local/samba/lib: truss -fae -o /tmp/afrestore afrestore -v -p /usr/homes/kurs* -C / truss: PIOCWAIT top of loop: Input/output error truss: PIOCWAIT top of loop: Input/output error truss: PIOCWAIT top of loop: Input/output error truss: PIOCWAIT top of loop: Input/output error or sometimes truss stops lacking in a /proc/PID-XXX/mem file. But calling it more times will 'solve' the problem. While writing this, I crashed the system with the above showed command, this is the error message from the kernel when the system froze (I wrote it down from the screen): Fatal trap 12 : page fault while in kernel mode cpuid = 1; lapic.id = fault virtual address = 0x24 fault code = supervisor read, page not present instruction pointer= 0x8:0xc01b29db stack pointer= 0x10:0xe8ff3b70 frame pointer= 0x10:0xe8ff3b84 code segment= base 0x0, limit 0xf, type 0x1b = DPL 0, pres 1, def 32, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process= 27510 (bunzip2) trap number= 12 panic: page fault cpuid = 1, lapic.id = boot() called on cpu#1 syncing disks, buffers remaining ... panic: absolutely cannot call smp_ipi_shutdown with interrupts already disabled cpuid = 1; lapic.id = boot() called on cpu#1 Uptime1d20h18m55s pfs_vncache_unload(): 6 entried remaining Fatal double fault: eip = 0xc03134ic esp = 0xe8ff1ff8 ebp = 0xe8ff2014 cpuid = 1, lapic.id = panic: double fault cpuid = 1, lapic.id = boot() called on cpu#1 Uptime: 1d20h18m55s pfs_vncache_unload(): 6 entries remaining After this, the machine was dead. :Hi, : :I've got nearly the same setup in a Dell 1600SC with a gig of ram and a PERC4/Sc (LSI MegaRAID) card. : :Dual 2.4GHz Xeon P4 HT CPU's and I've discovered I can lock up FreeBSD 5.1-RELEASE-p2 on command :simply by running something to quickly create and remove a directory. i.e.: : :perl -e 'for(my $i = 0 ; $i ; $i++){ mkdir(abc); rmdir(abc); }' : : :Having machdep.cpu_idle_hlt = 0 makes no difference. : : :Kernel: :FreeBSD 5.1-RELEASE-p2 FreeBSD 5.1-RELEASE-p2 #0: Mon Aug 11 21:40:47 MDT 2003 i386 : :Raid: :amr0: LSILogic MegaRAID mem 0xfcd0-0xfcd0 irq 3 at device 2.0 on pci1 :amrd0: LSILogic MegaRAID logical drive on amr0 :amrd0: 34556MB (70770688 sectors) RAID 5 (optimal) : : :I suspect that your and my problems are more driver related to the amr driver and may be exposing :some other problem with in
Re: (2) 5.1-R-p2 crashes on SMP with AMI RAID and Intel 1000/Pro
On Wed, 20 Aug 2003, Colin Faber wrote: Hi. I first swapped the Intel 1000/PRO server NIC into the next slot and up then the machine seems to be 'stable'. Then, two days later, I changed the PSU to 400W units. I think it's a IRQ routing problem since we have had this problem (spontanous reboots) from FreeBSD 4.0 on). Changing the slot for the NIC helped, but this is a very bad state. I can not remember the error message I got when the system crashed, but it lookes like yours and I always say the amr0-text in that message. ACPI is not working on the old TYAN Thunder 2500 (S1867) main PCB. I also changed machdep.cpu_idle_hlt = 0, but with no effect. At the moment, I do not dare swapping the NIC again due to the fact the machine is in a preliminary production state. I also realized some weird things when creating and deleting files when the system crashed. Crashes always could be forced by accessing samba services from a PC. Crashes always occured when heavy IO was done, but this also could be a evidence for an IRQ problem, I think. I do not know. The machine was 'stable' (it means: when the NIC was at the crash-causing slot) a whole night, but whenever our department 'got started' in the morning time and heavy IO was done, the machine froze. This changed when I swapped the NIC to another slot And now I also have two 400W PSUs. FreeBSD 5.1-p2 on the TYAN S1867 seems to have much more problems, but I do not know whether I should mention this here. truss for instance crashes. We use afbackup for backing up, but afbackup core dumps on this machine and it does not on a UP machine also running FreeBSD 5.1-p2. It also crashes on a UP kernel on this machine. I tried to 'truss' an afrestore call, but I had to start the tracing three or four times because I got this error first time: truss: PIOCWAIT: Input/output error or something like this root: /usr/local/samba/lib: truss -fae -o /tmp/afrestore afrestore -v -p /usr/homes/kurs* -C / truss: PIOCWAIT top of loop: Input/output error truss: PIOCWAIT top of loop: Input/output error truss: PIOCWAIT top of loop: Input/output error truss: PIOCWAIT top of loop: Input/output error or sometimes truss stops lacking in a /proc/PID-XXX/mem file. But calling it more times will 'solve' the problem. While writing this, I crashed the system with the above showed command, this is the error message from the kernel when the system froze (I wrote it down from the screen): Fatal trap 12 : page fault while in kernel mode cpuid = 1; lapic.id = fault virtual address = 0x24 fault code = supervisor read, page not present instruction pointer = 0x8:0xc01b29db stack pointer = 0x10:0xe8ff3b70 frame pointer = 0x10:0xe8ff3b84 code segment= base 0x0, limit 0xf, type 0x1b = DPL 0, pres 1, def 32, gran 1 processor eflags= interrupt enabled, resume, IOPL = 0 current process = 27510 (bunzip2) trap number = 12 panic: page fault cpuid = 1, lapic.id = boot() called on cpu#1 syncing disks, buffers remaining ... panic: absolutely cannot call smp_ipi_shutdown with interrupts already disabled cpuid = 1; lapic.id = boot() called on cpu#1 Uptime 1d20h18m55s pfs_vncache_unload(): 6 entried remaining Fatal double fault: eip = 0xc03134ic esp = 0xe8ff1ff8 ebp = 0xe8ff2014 cpuid = 1, lapic.id = panic: double fault cpuid = 1, lapic.id = boot() called on cpu#1 Uptime: 1d20h18m55s pfs_vncache_unload(): 6 entries remaining After this, the machine was dead. :Hi, : :I've got nearly the same setup in a Dell 1600SC with a gig of ram and a PERC4/Sc (LSI MegaRAID) card. : :Dual 2.4GHz Xeon P4 HT CPU's and I've discovered I can lock up FreeBSD 5.1-RELEASE-p2 on command :simply by running something to quickly create and remove a directory. i.e.: : : perl -e 'for(my $i = 0 ; $i ; $i++){ mkdir(abc); rmdir(abc); }' : : :Having machdep.cpu_idle_hlt = 0 makes no difference. : : :Kernel: : FreeBSD 5.1-RELEASE-p2 FreeBSD 5.1-RELEASE-p2 #0: Mon Aug 11 21:40:47 MDT 2003 i386 : :Raid: : amr0: LSILogic MegaRAID mem 0xfcd0-0xfcd0 irq 3 at device 2.0 on pci1 : amrd0: LSILogic MegaRAID logical drive on amr0 : amrd0: 34556MB (70770688 sectors) RAID 5 (optimal) : : :I suspect that your and my problems are more driver related to the amr driver and may be exposing :some other problem with in the kernels fs locking. I don't think (as others have suggested) that :your issue is power related, or related to the combination of hardware you're using. (Other than :the fact that you've got a MegaRAID card). : :The exact crash message I'm seeing is: : :panic: lockmgr: locking against myself :cpuid = 0; lapic.id :boot() called on cpu#0 : :syncing disks, buffers remaining... panic: ffs_copyonwrite: recursive call :cpuid = 0; lapic.id :boot()
Re: (2) 5.1-R-p2 crashes on SMP with AMI RAID and Intel 1000/Pro
Related to the em driver, 82540M has not worked since sometime in 5.1-BETA time, I filed a pr on that a few months ago but it seems the fault might be with PCI IRQ routing, not the em driver itself. Pete Hartmann, O. wrote: On Wed, 20 Aug 2003, Colin Faber wrote: Hi. I first swapped the Intel 1000/PRO server NIC into the next slot and up then the machine seems to be 'stable'. Then, two days later, I changed the PSU to 400W units. I think it's a IRQ routing problem since we have had this problem (spontanous reboots) from FreeBSD 4.0 on). Changing the slot for the NIC helped, but this is a very bad state. I can not remember the error message I got when the system crashed, but it lookes like yours and I always say the amr0-text in that message. ACPI is not working on the old TYAN Thunder 2500 (S1867) main PCB. I also changed machdep.cpu_idle_hlt = 0, but with no effect. At the moment, I do not dare swapping the NIC again due to the fact the machine is in a preliminary production state. I also realized some weird things when creating and deleting files when the system crashed. Crashes always could be forced by accessing samba services from a PC. Crashes always occured when heavy IO was done, but this also could be a evidence for an IRQ problem, I think. I do not know. The machine was 'stable' (it means: when the NIC was at the crash-causing slot) a whole night, but whenever our department 'got started' in the morning time and heavy IO was done, the machine froze. This changed when I swapped the NIC to another slot And now I also have two 400W PSUs. FreeBSD 5.1-p2 on the TYAN S1867 seems to have much more problems, but I do not know whether I should mention this here. truss for instance crashes. We use afbackup for backing up, but afbackup core dumps on this machine and it does not on a UP machine also running FreeBSD 5.1-p2. It also crashes on a UP kernel on this machine. I tried to 'truss' an afrestore call, but I had to start the tracing three or four times because I got this error first time: truss: PIOCWAIT: Input/output error or something like this root: /usr/local/samba/lib: truss -fae -o /tmp/afrestore afrestore -v -p /usr/homes/kurs* -C / truss: PIOCWAIT top of loop: Input/output error truss: PIOCWAIT top of loop: Input/output error truss: PIOCWAIT top of loop: Input/output error truss: PIOCWAIT top of loop: Input/output error or sometimes truss stops lacking in a /proc/PID-XXX/mem file. But calling it more times will 'solve' the problem. While writing this, I crashed the system with the above showed command, this is the error message from the kernel when the system froze (I wrote it down from the screen): Fatal trap 12 : page fault while in kernel mode cpuid = 1; lapic.id = fault virtual address = 0x24 fault code = supervisor read, page not present instruction pointer = 0x8:0xc01b29db stack pointer = 0x10:0xe8ff3b70 frame pointer = 0x10:0xe8ff3b84 code segment= base 0x0, limit 0xf, type 0x1b = DPL 0, pres 1, def 32, gran 1 processor eflags= interrupt enabled, resume, IOPL = 0 current process = 27510 (bunzip2) trap number = 12 panic: page fault cpuid = 1, lapic.id = boot() called on cpu#1 syncing disks, buffers remaining ... panic: absolutely cannot call smp_ipi_shutdown with interrupts already disabled cpuid = 1; lapic.id = boot() called on cpu#1 Uptime 1d20h18m55s pfs_vncache_unload(): 6 entried remaining Fatal double fault: eip = 0xc03134ic esp = 0xe8ff1ff8 ebp = 0xe8ff2014 cpuid = 1, lapic.id = panic: double fault cpuid = 1, lapic.id = boot() called on cpu#1 Uptime: 1d20h18m55s pfs_vncache_unload(): 6 entries remaining After this, the machine was dead. :Hi, : :I've got nearly the same setup in a Dell 1600SC with a gig of ram and a PERC4/Sc (LSI MegaRAID) card. : :Dual 2.4GHz Xeon P4 HT CPU's and I've discovered I can lock up FreeBSD 5.1-RELEASE-p2 on command :simply by running something to quickly create and remove a directory. i.e.: : : perl -e 'for(my $i = 0 ; $i ; $i++){ mkdir(abc); rmdir(abc); }' : : :Having machdep.cpu_idle_hlt = 0 makes no difference. : : :Kernel: : FreeBSD 5.1-RELEASE-p2 FreeBSD 5.1-RELEASE-p2 #0: Mon Aug 11 21:40:47 MDT 2003 i386 : :Raid: : amr0: LSILogic MegaRAID mem 0xfcd0-0xfcd0 irq 3 at device 2.0 on pci1 : amrd0: LSILogic MegaRAID logical drive on amr0 : amrd0: 34556MB (70770688 sectors) RAID 5 (optimal) : : :I suspect that your and my problems are more driver related to the amr driver and may be exposing :some other problem with in the kernels fs locking. I don't think (as others have suggested) that :your issue is power related, or related to the combination of hardware you're using. (Other than :the fact that you've got a MegaRAID card). : :The exact crash message I'm seeing is: : :panic: lockmgr: