Re: [Kgdb-bugreport] kgdb on 2.6.12 - NMI received for unknown reason and CPUs not stopping at panic/breakpoi nt.

Amit S. Kale Tue, 29 Aug 2006 04:37:23 -0700

Piet,

Your ROUNDUP_WAIT value is a good find. The log is also interesting. Plan to 
write an "Undocumented KGDB"? :-D
-Amit


On Tuesday 29 August 2006 09:08, Piet Delaney wrote:
> On Fri, 2006-08-25 at 19:54 -0700, Piet Delaney wrote:
> > I just noticed kgdb for 2.6.12 not stopping CPU's. This could be
> > aggravated by the fact that I disabled optimization for the complete
> > kernel. I've been getting NMI problems for a while and I suspect that
> > compiling -O0 just aggravated the problem:
>
> Looks like I found the land mine. It seems kgdb_handle_exception() has
> a count for the number of times the trapped CPU should loop waiting for
> the other CPU's to stop. Apparently this constant is too small for our
> system, resulting in the gdb message about CPU not being stopped.
>
>       // #define ROUNDUP_WAIT     64000
>       // #define ROUNDUP_WAIT     640000
>       #define ROUNDUP_WAIT        64000000
>
> I'm currently using a value 1k larger so that printk's in the
> other threads can complete. 64000 appears to be a marginal
> value and compiling the kernel -O0 seems to push us over the
> edge. I see little downsize on making it larger. Looks like
> the 2.6.16 patch has 640000; so it's only a bug for the older
> patches like we use in 2.6.12.
>
> I got rid of the "NMI received for unknown reason" messages.
> I think it was either my configuring ACPI or not having
> a few cmd DIE_NMI covered in kgdb_notify(). I noticed that
> the 2.6.16 version includes cmd DIE_NMI with DIE_NMI_IPI;
> so I tried adding it to our 2.6.13 environment since
> DIE_NMI exist.
>
> In the kgdb_SMP_bug attachment is /var/log/messages with a
> few kgdb printk's displaying the internals and it's current
> correct behavior while running with two CPU's and first
> hitting break point in tcp_sendmsg() a few times followed
> by a kernel panic caused by differencing a bogus pointer.
>
> It's interesting how we hang at the end after detaching
> from the panic'd kernel. It seems that this allows us to
> attach again to look at the panic again with another
> gdb session; seems reasonable. I tried attaching a few
> times and it seems to work great.
>
> -piet
>
> > ------------------------------------------------------------------------
> > Aug 26 02:02:46 localhost kernel: [  721.299293]           Uhhuh. NMI
> > received for unknown reason 31 on CPU 0.
> > Aug 26 02:02:46 localhost kernel: [  721.299299]
> >     Uhhuh. NMI received for unknown reason 00 on CPU 1.
> > Aug 26 02:02:46 localhost kernel: [  721.299306]           Dazed and
> > confused, but trying to continue
> > Aug 26 02:02:46 localhost kernel: [  721.299310]           Do you have a
> > strange power saving mode enabled?
> > -------------------------------------------------------------------------
> >-
> >
> > Anyone know what I have configured wrong that would cause this?
> > I'm attaching my .config file.
> >
> > I noticed:
> >
> >     "Using IPI Shortcut mode"
> >
> > in /var/log/messages but can't find it with cscope; likely
> > from a different kernel.
> >
> > I think I running with BIOS disabling Hyper-Threading but
> > configured it in the kernel.
> >
> > -piet
> >
> > > Dave, Stephen, et. al:
> > >
> > > Shouldn't the tcp timers have LIST_POISON1 and LIST_POISON2
> > > in their list heads when we drop the last reference to a sk?
> > >
> > > We noticed our tcp timers have unexpected values in them
> > > and wondering how to explain it.
> > >
> > > Attached is a copy of a tcp_sock just prior to our freeing it;
> > > as see by kgdb on our 2.6.12 with tcp modified to support being
> > > a proxy.
> > >
> > > sk->sk_timer looks as expected with LIST_POISON in both list
> > > head pointers.
> > >
> > > The retransmit_timer on the other hand appears to have valid
> > > pointers in it, so I'm wondering if we have a timer reference
> > > count problem.
> > >
> > > The tcp_sock is zero'd out on allocation, so I doubt it's
> > > just stale pointers from a previous incarnation of a tcp_sock.
> > >
> > > The delack_timer also has absurd values in it.
> > >
> > > I'll add some debug code to try to understand this; if
> > > you have some thought on this, it might save me some time
> > > trying to understanding it.
> > >
> > > I was wondering if kgdb can pick up stale data buy not
> > > flushing the cpu caches when you hit panic.
> > >
> > > My vger mailing list stopped on 11-Aug-2006 at 9:30pm;
> > > so far I haven't found a reason. Heard of anything new
> > > like ECN being required? I suspect our mailstreet feed
> > > but say nothing changed on their systems. I'll try adding
> > > myself back to a list and see what happens.
> > >
> > > -piet

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Kgdb-bugreport mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport

Re: [Kgdb-bugreport] kgdb on 2.6.12 - NMI received for unknown reason and CPUs not stopping at panic/breakpoi nt.

Reply via email to