Piet, Your ROUNDUP_WAIT value is a good find. The log is also interesting. Plan to write an "Undocumented KGDB"? :-D -Amit
On Tuesday 29 August 2006 09:08, Piet Delaney wrote: > On Fri, 2006-08-25 at 19:54 -0700, Piet Delaney wrote: > > I just noticed kgdb for 2.6.12 not stopping CPU's. This could be > > aggravated by the fact that I disabled optimization for the complete > > kernel. I've been getting NMI problems for a while and I suspect that > > compiling -O0 just aggravated the problem: > > Looks like I found the land mine. It seems kgdb_handle_exception() has > a count for the number of times the trapped CPU should loop waiting for > the other CPU's to stop. Apparently this constant is too small for our > system, resulting in the gdb message about CPU not being stopped. > > // #define ROUNDUP_WAIT 64000 > // #define ROUNDUP_WAIT 640000 > #define ROUNDUP_WAIT 64000000 > > I'm currently using a value 1k larger so that printk's in the > other threads can complete. 64000 appears to be a marginal > value and compiling the kernel -O0 seems to push us over the > edge. I see little downsize on making it larger. Looks like > the 2.6.16 patch has 640000; so it's only a bug for the older > patches like we use in 2.6.12. > > I got rid of the "NMI received for unknown reason" messages. > I think it was either my configuring ACPI or not having > a few cmd DIE_NMI covered in kgdb_notify(). I noticed that > the 2.6.16 version includes cmd DIE_NMI with DIE_NMI_IPI; > so I tried adding it to our 2.6.13 environment since > DIE_NMI exist. > > In the kgdb_SMP_bug attachment is /var/log/messages with a > few kgdb printk's displaying the internals and it's current > correct behavior while running with two CPU's and first > hitting break point in tcp_sendmsg() a few times followed > by a kernel panic caused by differencing a bogus pointer. > > It's interesting how we hang at the end after detaching > from the panic'd kernel. It seems that this allows us to > attach again to look at the panic again with another > gdb session; seems reasonable. I tried attaching a few > times and it seems to work great. > > -piet > > > ------------------------------------------------------------------------ > > Aug 26 02:02:46 localhost kernel: [ 721.299293] Uhhuh. NMI > > received for unknown reason 31 on CPU 0. > > Aug 26 02:02:46 localhost kernel: [ 721.299299] > > Uhhuh. NMI received for unknown reason 00 on CPU 1. > > Aug 26 02:02:46 localhost kernel: [ 721.299306] Dazed and > > confused, but trying to continue > > Aug 26 02:02:46 localhost kernel: [ 721.299310] Do you have a > > strange power saving mode enabled? > > ------------------------------------------------------------------------- > >- > > > > Anyone know what I have configured wrong that would cause this? > > I'm attaching my .config file. > > > > I noticed: > > > > "Using IPI Shortcut mode" > > > > in /var/log/messages but can't find it with cscope; likely > > from a different kernel. > > > > I think I running with BIOS disabling Hyper-Threading but > > configured it in the kernel. > > > > -piet > > > > > Dave, Stephen, et. al: > > > > > > Shouldn't the tcp timers have LIST_POISON1 and LIST_POISON2 > > > in their list heads when we drop the last reference to a sk? > > > > > > We noticed our tcp timers have unexpected values in them > > > and wondering how to explain it. > > > > > > Attached is a copy of a tcp_sock just prior to our freeing it; > > > as see by kgdb on our 2.6.12 with tcp modified to support being > > > a proxy. > > > > > > sk->sk_timer looks as expected with LIST_POISON in both list > > > head pointers. > > > > > > The retransmit_timer on the other hand appears to have valid > > > pointers in it, so I'm wondering if we have a timer reference > > > count problem. > > > > > > The tcp_sock is zero'd out on allocation, so I doubt it's > > > just stale pointers from a previous incarnation of a tcp_sock. > > > > > > The delack_timer also has absurd values in it. > > > > > > I'll add some debug code to try to understand this; if > > > you have some thought on this, it might save me some time > > > trying to understanding it. > > > > > > I was wondering if kgdb can pick up stale data buy not > > > flushing the cpu caches when you hit panic. > > > > > > My vger mailing list stopped on 11-Aug-2006 at 9:30pm; > > > so far I haven't found a reason. Heard of anything new > > > like ECN being required? I suspect our mailstreet feed > > > but say nothing changed on their systems. I'll try adding > > > myself back to a list and see what happens. > > > > > > -piet ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Kgdb-bugreport mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport
