Re: OS bug in taskq

2007-12-17 Thread Robert Watson


On Sat, 15 Dec 2007, Elliot Finley wrote:

in the kernel and I'm still unable to obtain a crash dump.  Hopefully there 
is enough info in this email for a hacker to point me in the right direction 
to debug this.


If you're unable to obtain a crash dump, you should still be able to use 
interactive console-based debugging with DDB.  I find this is easiest to do 
with a serial console from an adjacent machine, so that I can copy-and-paste 
the results into an e-mail rather than hand-transcribe.  You can also use 
firewire consoles to the same effect, although I've never done that.


Once the system panics, it will drop into DDB.  I usually kick off debugging 
by doing a backtrace, bt, and showing the status of the current and then all 
processors show pcpu, show allpcpu.  Depending on the type of bug, I find 
output from ps, alltrace, show lockedvnods, show alllocks, show uma, 
show malloc quite useful.  The below panic is a NULL pointer dereference in 
the taskqueue code, but it's likely triggered by a bug in a consumer of the 
task queue service, rather than the task queue code itself.  That means we'll 
need to identify what consumer that is.  That information should become 
visible by looking at the arguments to the stack trace in DDB.  If not, we may 
need to work a little harder to get a dump, or set up serial or firewire kgdb 
to inspect the live running system with a full debugger.


On the swap / dump / etc thing.  In order to capture a saved kernel dump, you 
need sufficient room for the full dump on whatever partition /var/crash is on, 
and it must be writable.  Because dumps are normally written to swap 
partitions, running fsck before the dump is captured can lead to portions of 
the dump being overwritten if fsck uses a lot of memory (and hence overflows 
into swap).  As many systems have a separate /var and /var is often small, it 
could well be that you can successfully capture the dump by just booting to 
single-user, manually fscking /var, mounting /var, and running savecore in the 
/var/crash directory.  You can also configure additional partitions as purely 
dump partitions, rather than swap partitions.  One trick I've used previousy 
is to add a disk temporarily just for the purposes of dumping to, and manually 
doing a dumpon for a partition on that disk (but not a swapon).


Robert N M Watson
Computer Laboratory
University of Cambridge



dmesg:

Copyright (c) 1992-2007 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993,
1994
   The Regents of the University of California. All rights
reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 6.2-RELEASE-p5 #1: Mon Nov 19 11:16:44 MST 2007
   [EMAIL PROTECTED]:/usr/obj/usr/src/sys/DDB-SMP
Timecounter i8254 frequency 1193182 Hz quality 0
CPU: Intel(R) Xeon(TM) CPU 2.80GHz (2793.20-MHz 686-class CPU)
 Origin = GenuineIntel  Id = 0xf4a  Stepping = 10

Features=0xbfebfbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE
 Features2=0x641dSSE3,RSVD2,MON,DS_CPL,CNTX-ID,CX16,b14
 AMD Features=0x2010NX,LM
 AMD Features2=0x1LAHF
 Logical CPUs per core: 2
real memory  = 3220963328 (3071 MB)
avail memory = 3150856192 (3004 MB)
ACPI APIC Table: DELL   PE BKC  
FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
cpu0 (BSP): APIC ID:  0
cpu1 (AP): APIC ID:  1
cpu2 (AP): APIC ID:  6
cpu3 (AP): APIC ID:  7
ioapic0: Changing APIC ID to 8
ioapic1: Changing APIC ID to 9
ioapic1: WARNING: intbase 32 != expected base 24
ioapic2: Changing APIC ID to 10
ioapic2: WARNING: intbase 64 != expected base 56
ioapic0 Version 2.0 irqs 0-23 on motherboard
ioapic1 Version 2.0 irqs 32-55 on motherboard
ioapic2 Version 2.0 irqs 64-87 on motherboard
kbd1 at kbdmux0
ath_hal: 0.9.17.2 (AR5210, AR5211, AR5212, RF5111, RF5112, RF2413,
RF5413)
acpi0: DELL PE BKC on motherboard
acpi0: Power Button (fixed)
Timecounter ACPI-fast frequency 3579545 Hz quality 1000
acpi_timer0: 24-bit timer at 3.579545MHz port 0x808-0x80b on acpi0
cpu0: ACPI CPU on acpi0
cpu1: ACPI CPU on acpi0
cpu2: ACPI CPU on acpi0
cpu3: ACPI CPU on acpi0
pcib0: ACPI Host-PCI bridge port 0xcf8-0xcff on acpi0
pci0: ACPI PCI bus on pcib0
pcib1: ACPI PCI-PCI bridge at device 2.0 on pci0
pci1: ACPI PCI bus on pcib1
pcib2: ACPI PCI-PCI bridge at device 0.0 on pci1
pci2: ACPI PCI bus on pcib2
amr0: LSILogic MegaRAID 1.53 mem
0xd80f-0xd80f,0xdfdc-0xdfdf irq 46 at device 14.0 on
pci2
amr0: delete logical drives supported by controller
amr0: LSILogic PERC 4e/Di Firmware 522A, BIOS H430, 256MB RAM
pcib3: ACPI PCI-PCI bridge at device 0.2 on pci1
pci3: ACPI PCI bus on pcib3
pcib4: ACPI PCI-PCI bridge at device 4.0 on pci0
pci4: ACPI PCI bus on pcib4
pcib5: ACPI PCI-PCI bridge at device 5.0 on pci0
pci5: ACPI PCI bus on pcib5
pcib6: ACPI PCI-PCI bridge at device 0.0 on pci5
pci6: ACPI PCI bus on pcib6
em0: Intel(R) PRO/1000 Network Connection Version - 6.2.9 port

Re: OS bug in taskq

2007-12-16 Thread Clifton Royston
On Sat, Dec 15, 2007 at 03:58:10PM -0800, Jeremy Chadwick wrote:
 On Sat, Dec 15, 2007 at 01:03:14PM -0700, Elliot Finley wrote:
  I have:
  dumpdev=AUTO
  in /etc/rc.conf and:
  ... 
  in the kernel and I'm still unable to obtain a crash dump.  Hopefully
  there is enough info in this email for a hacker to point me in the
  right direction to debug this.
 
 I can't help with the panic itself, but the reason for the inability to
 obtain a crash dump is mentioned in a thread I started in November:
 
 http://lists.freebsd.org/pipermail/freebsd-stable/2007-November/038069.html
 
 The explanation of the problem was documented best by Doug Barton in
 this thread (over at freebsd-rc@):
 
 http://lists.freebsd.org/pipermail/freebsd-rc/2007-November/001263.html
 
 Open PR:
 
 http://www.freebsd.org/cgi/query-pr.cgi?pr=118255

  Why does it work *sometimes* then?  Or was this particular problem
introduced more recently than the 6.2 branch?

  I have two apparently similarly configured systems running 6.2p8,
with identical hardware, identical apps, and identical load, and I have
at least attempted to configure them the same way.  Both have
/var/crash set up and dumpon enabled in rc.conf.  Both crashed in the
last week.  I got a dump on one, which I now need to analyze, but have
twice failed to get a dump on the other.  (Once this past week, once
the previous month.)

  -- Clifton

-- 
Clifton Royston  --  [EMAIL PROTECTED] / [EMAIL PROTECTED]
   President  - I and I Computing * http://www.iandicomputing.com/
 Custom programming, network design, systems and network consulting services
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


OS bug in taskq

2007-12-15 Thread Elliot Finley
Hello,

After turning tls/ssl on in Exim and installing dovecot (with pop3s
and imaps) I've been getting a panic:

kernel trap 12 with interrupts disabled

Fatel trap 12: page fault while in kernel mode
cpuid = 2; apic id = 06
fault virtual address = 0x104
fault code= supervisor read, page not present
instruction pointer   = 0x20:0xc06730cd
stack pointer = 0x28:0xea1ddc90
frame pointer = 0x28:0xea1ddc9c
code segment  = base 0x0, limit 0xf, type 0x1b
  = DPL 0, pres 1, def32 1, gran 1
processor eflags  = resume, IOPL = 0
current process   = 5 (thread taskq)

It's always the same. Same cpuid, same pointers, etc...

I have:

dumpdev=AUTO

in /etc/rc.conf and:

options KDB
options DDB # debugging kernel

in the kernel and I'm still unable to obtain a crash dump.  Hopefully
there is enough info in this email for a hacker to point me in the
right direction to debug this.

dmesg:

Copyright (c) 1992-2007 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993,
1994
The Regents of the University of California. All rights
reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 6.2-RELEASE-p5 #1: Mon Nov 19 11:16:44 MST 2007
[EMAIL PROTECTED]:/usr/obj/usr/src/sys/DDB-SMP
Timecounter i8254 frequency 1193182 Hz quality 0
CPU: Intel(R) Xeon(TM) CPU 2.80GHz (2793.20-MHz 686-class CPU)
  Origin = GenuineIntel  Id = 0xf4a  Stepping = 10

Features=0xbfebfbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE
  Features2=0x641dSSE3,RSVD2,MON,DS_CPL,CNTX-ID,CX16,b14
  AMD Features=0x2010NX,LM
  AMD Features2=0x1LAHF
  Logical CPUs per core: 2
real memory  = 3220963328 (3071 MB)
avail memory = 3150856192 (3004 MB)
ACPI APIC Table: DELL   PE BKC  
FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
 cpu0 (BSP): APIC ID:  0
 cpu1 (AP): APIC ID:  1
 cpu2 (AP): APIC ID:  6
 cpu3 (AP): APIC ID:  7
ioapic0: Changing APIC ID to 8
ioapic1: Changing APIC ID to 9
ioapic1: WARNING: intbase 32 != expected base 24
ioapic2: Changing APIC ID to 10
ioapic2: WARNING: intbase 64 != expected base 56
ioapic0 Version 2.0 irqs 0-23 on motherboard
ioapic1 Version 2.0 irqs 32-55 on motherboard
ioapic2 Version 2.0 irqs 64-87 on motherboard
kbd1 at kbdmux0
ath_hal: 0.9.17.2 (AR5210, AR5211, AR5212, RF5111, RF5112, RF2413,
RF5413)
acpi0: DELL PE BKC on motherboard
acpi0: Power Button (fixed)
Timecounter ACPI-fast frequency 3579545 Hz quality 1000
acpi_timer0: 24-bit timer at 3.579545MHz port 0x808-0x80b on acpi0
cpu0: ACPI CPU on acpi0
cpu1: ACPI CPU on acpi0
cpu2: ACPI CPU on acpi0
cpu3: ACPI CPU on acpi0
pcib0: ACPI Host-PCI bridge port 0xcf8-0xcff on acpi0
pci0: ACPI PCI bus on pcib0
pcib1: ACPI PCI-PCI bridge at device 2.0 on pci0
pci1: ACPI PCI bus on pcib1
pcib2: ACPI PCI-PCI bridge at device 0.0 on pci1
pci2: ACPI PCI bus on pcib2
amr0: LSILogic MegaRAID 1.53 mem
0xd80f-0xd80f,0xdfdc-0xdfdf irq 46 at device 14.0 on
pci2
amr0: delete logical drives supported by controller
amr0: LSILogic PERC 4e/Di Firmware 522A, BIOS H430, 256MB RAM
pcib3: ACPI PCI-PCI bridge at device 0.2 on pci1
pci3: ACPI PCI bus on pcib3
pcib4: ACPI PCI-PCI bridge at device 4.0 on pci0
pci4: ACPI PCI bus on pcib4
pcib5: ACPI PCI-PCI bridge at device 5.0 on pci0
pci5: ACPI PCI bus on pcib5
pcib6: ACPI PCI-PCI bridge at device 0.0 on pci5
pci6: ACPI PCI bus on pcib6
em0: Intel(R) PRO/1000 Network Connection Version - 6.2.9 port
0xecc0-0xecff mem 0xdfae-0xdfaf irq 64 at device 7.0 on pci6
em0: Ethernet address: 00:18:8b:34:70:50
pcib7: ACPI PCI-PCI bridge at device 0.2 on pci5
pci7: ACPI PCI bus on pcib7
em1: Intel(R) PRO/1000 Network Connection Version - 6.2.9 port
0xdcc0-0xdcff mem 0xdf8e-0xdf8f irq 65 at device 8.0 on pci7
em1: Ethernet address: 00:18:8b:34:70:51
pcib8: ACPI PCI-PCI bridge at device 6.0 on pci0
pci8: ACPI PCI bus on pcib8
uhci0: Intel 82801EB (ICH5) USB controller USB-A port 0xbce0-0xbcff
irq 16 at device 29.0 on pci0
uhci0: [GIANT-LOCKED]
usb0: Intel 82801EB (ICH5) USB controller USB-A on uhci0
usb0: USB revision 1.0
uhub0: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub0: 2 ports with 2 removable, self powered
uhci1: Intel 82801EB (ICH5) USB controller USB-B port 0xbcc0-0xbcdf
irq 19 at device 29.1 on pci0
uhci1: [GIANT-LOCKED]
usb1: Intel 82801EB (ICH5) USB controller USB-B on uhci1
usb1: USB revision 1.0
uhub1: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub1: 2 ports with 2 removable, self powered
uhci2: Intel 82801EB (ICH5) USB controller USB-C port 0xbca0-0xbcbf
irq 18 at device 29.2 on pci0
uhci2: [GIANT-LOCKED]
usb2: Intel 82801EB (ICH5) USB controller USB-C on uhci2
usb2: USB revision 1.0
uhub2: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub2: 2 ports with 2 removable, self powered
ehci0: Intel 82801EB/R (ICH5) USB 2.0 controller mem

Re: OS bug in taskq

2007-12-15 Thread Jeremy Chadwick
On Sat, Dec 15, 2007 at 01:03:14PM -0700, Elliot Finley wrote:
 I have:
 dumpdev=AUTO
 in /etc/rc.conf and:
 ... 
 in the kernel and I'm still unable to obtain a crash dump.  Hopefully
 there is enough info in this email for a hacker to point me in the
 right direction to debug this.

I can't help with the panic itself, but the reason for the inability to
obtain a crash dump is mentioned in a thread I started in November:

http://lists.freebsd.org/pipermail/freebsd-stable/2007-November/038069.html

The explanation of the problem was documented best by Doug Barton in
this thread (over at freebsd-rc@):

http://lists.freebsd.org/pipermail/freebsd-rc/2007-November/001263.html

Open PR:

http://www.freebsd.org/cgi/query-pr.cgi?pr=118255

-- 
| Jeremy Chadwickjdc at parodius.com |
| Parodius Networking   http://www.parodius.com/ |
| UNIX Systems Administrator  Mountain View, CA, USA |
| Making life hard for others since 1977.  PGP: 4BD6C0CB |

___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: OS bug in taskq

2007-12-15 Thread Elliot Finley
On Sat, 15 Dec 2007 15:58:10 -0800, you wrote:

On Sat, Dec 15, 2007 at 01:03:14PM -0700, Elliot Finley wrote:
 I have:
 dumpdev=AUTO
 in /etc/rc.conf and:
 ... 
 in the kernel and I'm still unable to obtain a crash dump.  Hopefully
 there is enough info in this email for a hacker to point me in the
 right direction to debug this.

I can't help with the panic itself, but the reason for the inability to
obtain a crash dump is mentioned in a thread I started in November:

http://lists.freebsd.org/pipermail/freebsd-stable/2007-November/038069.html

The explanation of the problem was documented best by Doug Barton in
this thread (over at freebsd-rc@):

http://lists.freebsd.org/pipermail/freebsd-rc/2007-November/001263.html

In this thread it states:

Short term fix is to disable swapping on the system long enough to get
the dump, then reboot with swapping turned back on.

how do I turn swapping off?  I don't think I can just not mount it,
because then it wouldn't exist for the dump.


Open PR:

http://www.freebsd.org/cgi/query-pr.cgi?pr=118255

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]