On Fri, 2006-08-25 at 19:54 -0700, Piet Delaney wrote:
> I just noticed kgdb for 2.6.12 not stopping CPU's. This could be
> aggravated by the fact that I disabled optimization for the complete
> kernel. I've been getting NMI problems for a while and I suspect that
> compiling -O0 just aggravated the problem:
Looks like I found the land mine. It seems kgdb_handle_exception() has
a count for the number of times the trapped CPU should loop waiting for
the other CPU's to stop. Apparently this constant is too small for our
system, resulting in the gdb message about CPU not being stopped.
// #define ROUNDUP_WAIT 64000
// #define ROUNDUP_WAIT 640000
#define ROUNDUP_WAIT 64000000
I'm currently using a value 1k larger so that printk's in the
other threads can complete. 64000 appears to be a marginal
value and compiling the kernel -O0 seems to push us over the
edge. I see little downsize on making it larger. Looks like
the 2.6.16 patch has 640000; so it's only a bug for the older
patches like we use in 2.6.12.
I got rid of the "NMI received for unknown reason" messages.
I think it was either my configuring ACPI or not having
a few cmd DIE_NMI covered in kgdb_notify(). I noticed that
the 2.6.16 version includes cmd DIE_NMI with DIE_NMI_IPI;
so I tried adding it to our 2.6.13 environment since
DIE_NMI exist.
In the kgdb_SMP_bug attachment is /var/log/messages with a
few kgdb printk's displaying the internals and it's current
correct behavior while running with two CPU's and first
hitting break point in tcp_sendmsg() a few times followed
by a kernel panic caused by differencing a bogus pointer.
It's interesting how we hang at the end after detaching
from the panic'd kernel. It seems that this allows us to
attach again to look at the panic again with another
gdb session; seems reasonable. I tried attaching a few
times and it seems to work great.
-piet
> ------------------------------------------------------------------------
> Aug 26 02:02:46 localhost kernel: [ 721.299293] Uhhuh. NMI
> received for unknown reason 31 on CPU 0.
> Aug 26 02:02:46 localhost kernel: [ 721.299299]
> Uhhuh. NMI received for unknown reason 00 on CPU 1.
> Aug 26 02:02:46 localhost kernel: [ 721.299306] Dazed and
> confused, but trying to continue
> Aug 26 02:02:46 localhost kernel: [ 721.299310] Do you have a
> strange power saving mode enabled?
> --------------------------------------------------------------------------
>
> Anyone know what I have configured wrong that would cause this?
> I'm attaching my .config file.
>
> I noticed:
>
> "Using IPI Shortcut mode"
>
> in /var/log/messages but can't find it with cscope; likely
> from a different kernel.
>
> I think I running with BIOS disabling Hyper-Threading but
> configured it in the kernel.
>
> -piet
>
>
>
> > Dave, Stephen, et. al:
> >
> > Shouldn't the tcp timers have LIST_POISON1 and LIST_POISON2
> > in their list heads when we drop the last reference to a sk?
> >
> > We noticed our tcp timers have unexpected values in them
> > and wondering how to explain it.
> >
> > Attached is a copy of a tcp_sock just prior to our freeing it;
> > as see by kgdb on our 2.6.12 with tcp modified to support being
> > a proxy.
> >
> > sk->sk_timer looks as expected with LIST_POISON in both list
> > head pointers.
> >
> > The retransmit_timer on the other hand appears to have valid
> > pointers in it, so I'm wondering if we have a timer reference
> > count problem.
> >
> > The tcp_sock is zero'd out on allocation, so I doubt it's
> > just stale pointers from a previous incarnation of a tcp_sock.
> >
> > The delack_timer also has absurd values in it.
> >
> > I'll add some debug code to try to understand this; if
> > you have some thought on this, it might save me some time
> > trying to understanding it.
> >
> > I was wondering if kgdb can pick up stale data buy not
> > flushing the cpu caches when you hit panic.
> >
> > My vger mailing list stopped on 11-Aug-2006 at 9:30pm;
> > so far I haven't found a reason. Heard of anything new
> > like ECN being required? I suspect our mailstreet feed
> > but say nothing changed on their systems. I'll try adding
> > myself back to a list and see what happens.
> >
> > -piet
> >
> >
--
Piet Delaney
BlueLane Teck
W: (408) 200-5256; [EMAIL PROTECTED]
H: (408) 243-8872; [EMAIL PROTECTED]
[ 725.757818] kgdb_notify: Calling kgdb_handle_exception();
[ 725.793329] kgdb_handle_exception: debugger_step:0, kgdb_contthread:00000000
[ 725.839648] kgdb_roundup_cpus(flags:2): Calling
send_IPI_allbutself(APIC_DM_NMI); /* linux/arch/i386/kernel/kgdb.c */
[ 725.909419] kgdb_handle_exception: procindebug[smp_processor_id():0] = 1; /*
Wait for other CPU's
[ 725.909425] kgdb_nmihook(cpu:1, regs:c218ff18) {
[ 725.909430] kgdb_wait(regs:c218ff18): procindebug[processor:1]:0 = 1;
[ 731.138353] kgdb_wait: procindebug[processor:1]:1 = 0; /* We are done */
[ 731.182407] kgdb_nmihook: cpu:1, return; }
[ 731.184381] kgdb_notify: return NOTIFY_STOP
[ 731.184384]
[ 731.184385]
[ 743.508286] kgdb_notify: Calling kgdb_handle_exception();
[ 743.543779] kgdb_handle_exception: debugger_step:1, kgdb_contthread:00000000
[ 743.590099] kgdb_roundup_cpus(flags:2): Calling
send_IPI_allbutself(APIC_DM_NMI); /* linux/arch/i386/kernel/kgdb.c */
[ 743.659870] kgdb_handle_exception: procindebug[smp_processor_id():0] = 1; /*
Wait for other CPU's
[ 743.659878] kgdb_nmihook(cpu:1, regs:c218ff18) {
[ 743.659884] kgdb_wait(regs:c218ff18): procindebug[processor:1]:0 = 1;
[ 761.131598] kgdb_wait: procindebug[processor:1]:1 = 0; /* We are done */
[ 761.175638] kgdb_nmihook: cpu:1, return; }
[ 761.177611] kgdb_notify: return NOTIFY_STOP
[ 761.177614]
[ 761.177615]
[ 764.306001] kgdb_notify: Calling kgdb_handle_exception();
[ 764.348913] kgdb_handle_exception: debugger_step:1,
kgdb_contthread:00000000
[ 764.403287] kgdb_roundup_cpus(flags:2): Calling
send_IPI_allbutself(APIC_DM_NMI); /* linux/arch/i386/kernel/kgdb.c */
[ 764.481690] kgdb_handle_exception:
procindebug[smp_processor_id():1] = 1; /* Wait for other CPU's
[ 764.481696] kgdb_nmihook(cpu:0, regs:c08a3f40) {
[ 764.481701] kgdb_wait(regs:c08a3f40): procindebug[processor:0]:0 = 1;
[ 769.508423] kgdb_notify: return NOTIFY_STOP
[ 769.543335]
[ 769.560493]
[ 769.577706] kgdb_notify: Calling kgdb_handle_exception();
[ 769.620596] kgdb_handle_exception: debugger_step:1,
kgdb_contthread:c2558540
[ 769.674923] kgdb_handle_exception:
procindebug[smp_processor_id():1] = 1; /* Wait for other CPU's
[ 769.743774] kgdb_wait: procindebug[processor:0]:1 = 0; /* We are done */
[ 769.787846] kgdb_nmihook: cpu:0, return; }
[ 769.789845] kgdb_notify: return NOTIFY_STOP
[ 769.789849]
[ 769.789850]
[ 769.884367] kgdb_notify: Calling kgdb_handle_exception();
[ 769.927286] kgdb_handle_exception: debugger_step:1,
kgdb_contthread:00000000
[ 769.981613] kgdb_roundup_cpus(flags:2): Calling
send_IPI_allbutself(APIC_DM_NMI); /* linux/arch/i386/kernel/kgdb.c */
[ 770.059964] kgdb_handle_exception:
procindebug[smp_processor_id():0] = 1; /* Wait for other CPU's
[ 770.059971] kgdb_nmihook(cpu:1, regs:c218ff18) {
[ 770.059978] kgdb_wait(regs:c218ff18): procindebug[processor:1]:0 = 1;
[ 771.774212] kgdb_notify: return NOTIFY_STOP
[ 771.809116]
[ 771.826272]
[ 771.843500] kgdb_notify: Calling kgdb_handle_exception();
[ 771.886426] kgdb_handle_exception: debugger_step:1,
kgdb_contthread:e7267a60
[ 771.940755] kgdb_handle_exception:
procindebug[smp_processor_id():0] = 1; /* Wait for other CPU's
[ 772.008590] kgdb_wait: procindebug[processor:1]:1 = 0; /* We are done */
[ 772.052638] kgdb_nmihook: cpu:1, return; }
[ 772.054611] kgdb_notify: return NOTIFY_STOP
[ 772.054613]
[ 772.054615]
[ 772.150359] kgdb_notify: Calling kgdb_handle_exception();
[ 772.193281] kgdb_handle_exception: debugger_step:1,
kgdb_contthread:00000000
[ 772.247654] kgdb_roundup_cpus(flags:2): Calling
send_IPI_allbutself(APIC_DM_NMI); /* linux/arch/i386/kernel/kgdb.c */
[ 772.326004] kgdb_handle_exception:
procindebug[smp_processor_id():1] = 1; /* Wait for other CPU's
[ 772.326011] kgdb_nmihook(cpu:0, regs:c08a3f40) {
[ 772.326016] kgdb_wait(regs:c08a3f40): procindebug[processor:0]:0 = 1;
[ 778.039648] kgdb_wait: procindebug[processor:0]:1 = 0; /* We are done */
[ 778.083694] kgdb_nmihook: cpu:0, return; }
[ 778.085693] kgdb_notify: return NOTIFY_STOP
[ 778.085696]
[ 778.085697]
[ 819.767910] kgdb_notify: Calling kgdb_handle_exception();
[ 819.803401] kgdb_handle_exception: debugger_step:1, kgdb_contthread:00000000
[ 819.849770] kgdb_roundup_cpus(flags:202): Calling
send_IPI_allbutself(APIC_DM_NMI); /* linux/arch/i386/kernel/kgdb.c */
[ 819.920685] kgdb_handle_exception: procindebug[smp_processor_id():1] = 1; /*
Wait for other CPU's
[ 819.920689] kgdb_nmihook(cpu:0, regs:c08a3f40) {
[ 819.920694] kgdb_wait(regs:c08a3f40): procindebug[processor:0]:0 = 1;
[ 862.254780] kgdb_wait: procindebug[processor:0]:1 = 0; /* We are done */
[ 862.298833] kgdb_nmihook: cpu:0, return; }
[ 862.300836] kgdb_notify: return NOTIFY_STOP
[ 862.300839]
[ 862.300840]
[ 862.300844] kgdb_notify: Calling kgdb_handle_exception();
[ 862.300848] kgdb_handle_exception: debugger_step:1, kgdb_contthread:00000000
[ 862.300851] kgdb_roundup_cpus(flags:202): Calling
send_IPI_allbutself(APIC_DM_NMI); /* linux/arch/i386/kernel/kgdb.c */
[ 862.300855] kgdb_handle_exception: procindebug[smp_processor_id():1] = 1; /*
Wait for other CPU's
[ 862.583640] kgdb_nmihook(cpu:0, regs:c08a3f40) {
[ 862.613944] kgdb_wait(regs:c08a3f40): procindebug[processor:0]:0 = 1;
CTRL-A Z for help | 19200 8N1 | NOR | Minicom 2.1 | VT102 | Offline
-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Kgdb-bugreport mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport