Hi!
I _also_ found the the thread/float/NaN bug in GDB. But Kevin Buettner beat me in supplying the patch... I saw Kevin's message while I was preparing this message. How funny/strange two people find the same bug at the same moment when it's been there for half a year... But I am not sending this message to tell you only this, the bug gets fixed, that's important. But during my bug-hunt I discovered more. I am pretty sure there is one more bug: maybe in GDB, maybe in GCC, or the Linux kernel... Read about what I found out. In addendum follow some pointers to problems which I think are related (found through Google). I hope my explanation can contribute to solving some "old mysteries", some where GDB was never a suspect... During 2001 I have occasionally tried to use the snapshots of GDB to debug the Linux-port of our real-time DSP software. Its sole purpose is to debug in a more "friendly environment". I encountered several times strange problems with NaN popping all over when trying to debug code with floating-point operations. I did not bother to look into it. It behaved quite funny, it sometimes worked, it sometime didn't. It didn't seem that easy to reproduce in a small test program and I had hopes that it would be fixed by the final 5.1 release. But now I tried 5.1, and it was still there. And I saw some posts on the bug-gdb mailing-list describing the problem with a very small test program. I search Google and found a lot of related stuff... In the end, I set out to debug GDB _with_ GDB (compiled from source, with optimisation level -O0). Quite a daunting task requiring a certain degree of schizophrenia ;-) BTW I did not succeed in attaching GDB to a running GDB (gets stuck in poll()), I had to really run GDB in GDB, which confused DDD a bit... Is it a known problem not being able to attach? To make a complicated story short, I found the same problem in the convertion of the ftag register (in i387-nat.c). But when I recompiled GDB with the default optimilisation level -O2 and ran it on another "production" machine (vanilla Mandrake 8.1, PentiumII), I got NaNs again! Same GDB on a identically configured Mandrake 8.1 (PentiumIII), no problem... It is CPU related. But how?! The default Mandrake 8.1 compiler is the "controversial" gcc 2.96 (20000731). I decided to try gcc 3.0.1 (also supplied my MDK) to compile GDB with -O2. Now the problem was gone. Looks very much like a gcc 2.96 optimisation issue which is CPU dependent. I can't think what that can be, but it is sure what I see... Or is it some incorrect assumption in the GDB sources about volatile variables or aliased pointers that breaks by GCC's optimisation (I remember vaguely some issues with the -fstrict-aliasing switch). I am also thinking of two last lines in the test matrix supplied by Emmanuel Blindauer on http://manu.agat.net/bug.html. Two identical Debian Unstable setups, except for the kernel (2.4.13 vs 2.4.14) and the CPU (K6-2 vs. Athlon), but only one (Athlon) showing the problem... Does this mean that gcc 2.95.4 (default compiler for Debian Unstable) has the same problem as gcc 2.96? Can someone with Debian Unstable check this out? And I am still wondering how one can explain that the bug can disappear by booting linux 2.2 instead of 2.4? Anyway, there seem to be a least 2 bugs. One just found and fixed in GDB. The other one showing when using gcc 2.96, but seemingly related to the CPU type and the linux kernel version... For me, the problems with NaNs are totally fixed by using the patch for GDB and gcc 3.0.1. I suggest people having the same problems try the same. I will report to Mandrake about this to. Haven't been using RedHat recently, but good chance they're bitten too... Kudos to the makers of GDB (and Linux, GCC and the whole GNU/Linux/FreeSoftware/OpenSource community for that matter). Stepping with GDB through GDB gives one an enormous sense of "standing on the shoulders of giants" :-) Kind regards, Bart --- Addendum: ONE *** From: Blindauer ([EMAIL PROTECTED]) Subject: thread in linux 2.2 and 2.4 where is the difference? Newsgroups: gnu.gdb.bug Date: 2001-12-02 06:12:45 PST Provides the 10 C-line test-program with the strtod() call I used to reproduce and debug the problem. TWO *** From: Alexander Enchevich ([EMAIL PROTECTED]) Subject: gdb+pthreads+strtod=nan Newsgroups: gnu.gdb.bug Date: 2001-07-04 17:39:27 PST Describes the same problem with a somewhat less minimal test program, using 2 threads and strtod. THREE ***** From: Ken Whaley ([EMAIL PROTECTED]) Subject: CVS: stepping over function returning float returns NaN with shared pthreads Newsgroups: gnu.gdb.bug Date: 2001-07-12 16:56:07 PST GNATS GDB bug number 175 (see sources.redhat.com/gdb/bugs) Synopsis: smp + pthreads + breakpoints in FP code = corrupt FP state on x86 Arrival-Date: Fri Jul 13 15:38:00 PDT 2001 Originator: [EMAIL PROTECTED] Release: snapshot 2001-07-13 Problem only seen on SMP machine. Not on single processor. Maybe the problem is not the SMP, but the different type of processor (PII vs. PIII?!)... FOUR **** GNATS GDB bug number 178 Synopsis: Floating Point NaN on first fp operation when compiling with -lpthread Arrival-Date: Mon Jul 23 12:48:01 PDT 2001 Originator: Marius VLAD (Digital Media Institute, Tampere Univ. of Technology, FINLAND) From: [EMAIL PROTECTED] Problem with small program using printf("%f", ...)... FIVE **** From: Brendan Doherty ([EMAIL PROTECTED]) Subject: gdb 5.1 with shared libraries Newsgroups: gnu.gdb.bug Date: 2001-12-05 19:36:51 PST 3 problems: Problem 1 (Linux 2.4/libc 2.2.4/Debian Woody): value of a double (in static data section of exe) get's changed when run through 5.1. SIX *** From: Rychard Bouwens ([EMAIL PROTECTED]) Subject: Possible bug in my microprocessor/memory Newsgroups: comp.lang.asm.x86 Date: 2000/07/07 A thread about the problem trying to find a memory overwrite. Overwrite is found and the discussion ends with no explanation for the question "Where was the overwrite? I would love to know how it related to the problem you saw." ([EMAIL PROTECTED]) The GDB bug might be the answer... SEVEN ***** From: Mikael Djurfeldt Mikael Djurfeldt <[EMAIL PROTECTED]> Subject: Really weird things happening in Guile/GDB To: guile-devel mailing-list Date: Wed, 19 Sep 2001 22:10:33 +0200 Also NaN stuff. Discussion focusses around possible wrongly generated assembly to the level of looking up the hex-codes of the filldll instruction in the Intel manual. Can be GDB again that makes people think an instruction is not working... _______________________________________________ Bug-gdb mailing list [EMAIL PROTECTED] http://mail.gnu.org/mailman/listinfo/bug-gdb