Sweet!  I'm impressed that you found such an obscure bug.  Could you go
through some of the steps that you went through to do this debugging?  I
would suggest the following mentionables:
1) enabling kdb
Not that hard in the mandrake kernel, just switch the kdb option in the .spec file from 0 to 1.

2) interpreting the data
I did a full backtrace of every process that was on the machine. (After the machine locked up). once your in the kdb console you just type "ps" for the process list, and then "bta" to backtrace all processes.

The problem was there was NO xfs backtraces, however one processes was screwed up and a backtrace could not be done. This ment that something in that process got totally hosed. it's also the process that hard locked the machine.

I pointed out that the load on the machine can be very low when the bug is triggered, and that it doesn't get triggered when there is no high memory available. (even when running the highmem kernel)

In linux, when HIGHMEM is turned on there is still "regular" (LOWMEM) memory. As long as xfs was only playing with the LOWMEM segment of memory no lockups. But as soon as xfs used HIGHMEM (no matter the load ont he machine) it locked hard.

3) What made you zero in on that particular memset (ties back to #2)
I ran SGI's pre3 of XFS 1.2 kernels and had no problems. After that I used thier own brew of 1.1 with highmem, and didn't have a problem. After I told SGI they sent me a one liner patch.

4) Documentation that says the __pb_block_prepare_write already kmap'ed
the page.  If the answer is "use the source Luke" then so be it
SGI read the source... and knows it... ;)

Mandrake must have pulled the xfs patches from SGI at exactly the wrong time.... :(


--
Bryan Whitehead
SysAdmin - JPL - Interferometry Systems and Technology
Phone: 818 354 2903
[EMAIL PROTECTED]




Reply via email to