First I'll start off with a little bit of history on this problem.
We have an in-house (proprietary) product which does lots of
threaded floating point operations. Since moving to RH6.0+ we've been
starting to get complete machine lockups (can't ping the box) on our
systems running the processor in question. All the boxes have IDE disk
drives (they also have SCSI devices) so I upgraded to 2.2.14pre3 which
has some IDE SMP fixes and the problem still persists. Two of the three
systems affected have the following hardware:
Asus P2BD-S
256MB of RAM
Dual 350MHz PII
Dec Tulip
8GB Primary IDE master (boot disk)
2x4GB UW SCSI data disks. (Connected to on-board AIC 7880)
SCSI CDROM
And the third system
Dell Dual 266MHz PII
128MB of RAM
4GB Primary IDE master (boot disk)
2x4GB SCSI
One of our scientists has reduced the crashing part to a simple
threaded application (attached) that is built as follows:
gcc -o crash -lpthread crash.c
The application is run by doing ./crash N
where N is the number of threads and buffers to allocate, each
buffer is roughly 15MB in size. He's identified three scenarions for
different values of N.
1) pick N so it exhausts virtual memory, this causes the kernel to
become completely unresponsive (I think this is a known issue w/ OOM on
Linux) so it can be ignored
2) pick N so that the physical memory is exceeded but not the
swap space. The program will run but slowly (on his machine crash
10)
3) pick N so that the physical memory is not exceeded. The
program will run much more quickly (on his machine crash 6 leads
to a 94 megabyte program which takes about 1/2 hour to run).
And to directly quote from his message to me:
In cases 2 and 3, linux will sometimes halt. However, I have
managed to run case 3 to the end (on another run it stopped at
the second iteration, at another run somewhere around 100).
Putting more load on the system while case 2 or 3 is running
seems to lead to a crash. I can reliably get a crash by
starting one %./crash 6 and then starting %./crash 2 in
another window. Starting %./crash 6 then using numerical
python to allocate a 30 megabyte buffer also reliably leads
to a crash.
There are a couple of weird coding practices in the program but I'm
forwarding it as I got it from him. Any help would be greatly
appreciated.
- Paul
crash.c