I am not an expert at all finding these bugs but I can tell you of an
anecdote in some code that I maintain.
This is a large application that would fail seemingly random under a
fair amount of stress. The odd thing is that this app, which runs on
multiple OS' seemed to work on RHEL3, AS 2.1, Windows etc but on RHEL 4
it pooped itself. Since at first this bug was hard to reproduce it had
been ignored mostly until someone found a reliable way of reproducing
it. Weeks go buy trying to find this bug, which had all the symptoms of
either a memory overflow or a synchronization issue. Eventually I found
it while disassembling a 3rd party library. Here is what these
Einsteins had done:
#include <stdio.h>
#include <stdlib.h>
struct moo {
u_int32_t a, b, c, d;
char e[44];
};
int
main(int argc, char *argv[])
{
struct moo m[2];
struct moo *p;
p = m;
p->a = 0;
p->b = 1;
/* go to next element */
p = p + sizeof(struct moo);
p->a = 0;
p->b = 1;
return (0);
}
So any tool that I threw at this died mysteriously with the same
failure. Valgrind was particularly funny because since it pretends to
be an OS it pooped itself with the exact failure as the program would.
The value add was that it took longer for the code to fail because it
slowed down the app by like 1000%.
Eventually I found this gem after I wrote a custom efence like app that
put guard pages, marked PROT_NONE, in front and back of a chunk of
allocated memory. I was simply lucky that it hit just right and it gave
me proximity to where the code was failing. Interestingly enough this
code also had 32-bit canaries (among an array of other features) in
front and back of the allocated memory chunk however those were never
overwritten.
The moral is that I used several tools, including expensive commercial
ones, to track down this bug. It was only after a custom rig that I was
able to get a proximity reading. Do not rely on tools to say: "it
passed because tool x didn't complain". Bugs can be very subtle, or
like this one, symptoms can be very subtle and hard to track down.
There is no swiss army knife when it comes to memory bugs.
After this was fixed all kinds of other mysterious bugs that seemed
unrelated disappeared; including on all other OS'. Magic :-)
FWIW,
/marco
Edd Barrett wrote:
Hello people,
I wish to query the usefullness (if thats not a made up word) of
electricfence on OpenBSD. I have a program which works great when not linked
against -lefence, but gives a bus error otherwise (not as a result of my
code, but in libpq according to a stack trace :O ).
A google search later, and I find this page (http://kerneltrap.org/node/5584)
in which theo explains that the new malloc() does exactly what electric
fence does by default. So my question are:
a) Why do we have a port of electric fence?
b) If my program runs fine on OpenBSD without -lefence can I assume that no
buffers have been over-run?
c) (off-topic) How are people checking for memory leaks these days on
OpenBSD? I took a quick look at gc-boehm, but havent got it working as of
yet. How well does it work for you and what alternatives exist? It seems
most are using valgrind, but thats very linuxcentric if i understand
correctly.
Thanks for you time guys
Best Regards
Edd