Re: Memory corruption in CURRENT
On Thu, Aug 22, 2002 at 09:43:45AM +0200, Martin Blapp wrote: I suspect all the SIG4 and SIG11 problems we see are due memory corruption in CURRENT. In the latter case, the affected file looks like: case HASH('^', 'e'): case HASH('^', 'i'): case HASH('^' 'o'): \xc0 case HASH('^', 'u'): %case HADH('`', \xc0A'): ^@ase HASH('`', 'E'): case HASH('`', 'I'): case HASH('`', 'O'): case HASH('`', 'U'): The file is correct after a reboot, so the corruption was limited to the copy cached in RAM. Thats memory corruption. I'm also not able anymore to make 10 buildworlds (without -j, that triggers panics in pmap code). Bye the way, I'm experiencing this since about 4-5 months. All hackers, please help to track this down. Is it P4 specific or not? Mark -- Mark Santcroos RIPE Network Coordination Centre http://www.ripe.net/home/mark/ New Projects Group/TTM To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Memory corruption in CURRENT
It seems Martin Blapp wrote: Hi all, I suspect all the SIG4 and SIG11 problems we see are due memory corruption in CURRENT. The file is correct after a reboot, so the corruption was limited to the copy cached in RAM. Thats memory corruption. I'm also not able anymore to make 10 buildworlds (without -j, that triggers panics in pmap code). Bye the way, I'm experiencing this since about 4-5 months. All hackers, please help to track this down. Hmm, I haven't seen this at all, but I've just started buildworld loops on two machines here, but I normally do at least a couble buildworlds a day and I havn't notice problems like the above (but plenty of bad commits etc). However, this kind of problem in most cases spells bad HW to me, ie subspec RAM, poor powersupply, badly cooled CPU, overclocking etc etc... -Søren To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Memory corruption in CURRENT
Hi Soeren, However, this kind of problem in most cases spells bad HW to me, ie subspec RAM, poor powersupply, badly cooled CPU, overclocking etc etc... That's what I thought too. I have now three different systems which show all this: 1) PIV 1,6Ghz, Intel B845DG Board, 1GB Kingston Ram, 2) PIV 2Ghz Intel B845DG Board, 1GB Kingston ECC Ram 3) PIV 2,26 Ghz Asus P4B533 Board with I845 chipset, 1GB noname Ram All running CURRENT. I also replaced in 1) and 2) the CPU, RAM. It happens both on SCSI and ATA disks. Powersupply has been changed for all 3 systems. Problem is still the same. The problem sometimes appears just after startup. CPU is still cold then. Other times it builds 6 buildworlds sucessfully, and then suddenly I see a SIG4. Martin To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Memory corruption in CURRENT
It seems Martin Blapp wrote: Hi Soeren, However, this kind of problem in most cases spells bad HW to me, ie subspec RAM, poor powersupply, badly cooled CPU, overclocking etc etc... That's what I thought too. I have now three different systems which show all this: 1) PIV 1,6Ghz, Intel B845DG Board, 1GB Kingston Ram, 2) PIV 2Ghz Intel B845DG Board, 1GB Kingston ECC Ram 3) PIV 2,26 Ghz Asus P4B533 Board with I845 chipset, 1GB noname Ram All running CURRENT. I also replaced in 1) and 2) the CPU, RAM. It happens both on SCSI and ATA disks. Powersupply has been changed for all 3 systems. Problem is still the same. The problem sometimes appears just after startup. CPU is still cold then. Other times it builds 6 buildworlds sucessfully, and then suddenly I see a SIG4. Hmm, thats probably a P4 problem then, I dont see it on any of my systems (P3 + K7) I dont have a P4 here (and newer will unless left at my doorstep), and I have no immediate ideas other than trying a MB that doesn't have a i845 chipset (the less Intel parts the better :) ) -Søren To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Memory corruption in CURRENT
On Thu, Aug 22, 2002 at 11:23:46 +0200, Martin Blapp wrote: That's what I thought too. I have now three different systems which show all this: 1) PIV 1,6Ghz, Intel B845DG Board, 1GB Kingston Ram, 2) PIV 2Ghz Intel B845DG Board, 1GB Kingston ECC Ram 3) PIV 2,26 Ghz Asus P4B533 Board with I845 chipset, 1GB noname Ram All running CURRENT. I also replaced in 1) and 2) the CPU, RAM. It happens both on SCSI and ATA disks. Powersupply has been changed for all 3 systems. Problem is still the same. The problem sometimes appears just after startup. CPU is still cold then. Other times it builds 6 buildworlds sucessfully, and then suddenly I see a SIG4. Only a little addition from me: I had the same problems on -stable and they only disappeared after compiling the kernel without debugging. I had the impression that it has to do with the size of the kernel (but this of course maybe wrong). After dropping -g from kernel compiling I hadn't a problem again on -stable. (At the moment I do not have -current on a P-IV, the motherboard is a Fujitsu-Siemens) Best regards -- Udo Schweigert, Siemens AG | Voice : +49 89 636 42170 CT IC CERT, Siemens CERT | Fax: +49 89 636 41166 D-81730 Muenchen / Germany | email : [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Memory corruption in CURRENT
On 22 Aug, Mark Santcroos wrote: On Thu, Aug 22, 2002 at 09:43:45AM +0200, Martin Blapp wrote: Thats memory corruption. I'm also not able anymore to make 10 buildworlds (without -j, that triggers panics in pmap code). Bye the way, I'm experiencing this since about 4-5 months. All hackers, please help to track this down. Is it P4 specific or not? Nope. I'm seeing it on an AMD Athlon XP 1900+. To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Memory corruption in CURRENT
Soeren Schmidt wrote: It seems Martin Blapp wrote: I suspect all the SIG4 and SIG11 problems we see are due memory corruption in CURRENT. The file is correct after a reboot, so the corruption was limited to the copy cached in RAM. Thats memory corruption. I'm also not able anymore to make 10 buildworlds (without -j, that triggers panics in pmap code). Bye the way, I'm experiencing this since about 4-5 months. All hackers, please help to track this down. Hmm, I haven't seen this at all, but I've just started buildworld loops on two machines here, but I normally do at least a couble buildworlds a day and I havn't notice problems like the above (but plenty of bad commits etc). However, this kind of problem in most cases spells bad HW to me, ie subspec RAM, poor powersupply, badly cooled CPU, overclocking etc etc... Try: options DISABLE_PSE options DISABLE_PG_G -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Memory corruption in CURRENT
On Thu, Aug 22, 2002 at 11:30:50AM +0200, Udo Schweigert wrote: Only a little addition from me: I had the same problems on -stable and they only disappeared after compiling the kernel without debugging. I had the impression that it has to do with the size of the kernel (but this of course maybe wrong). After dropping -g from kernel compiling I hadn't a problem again on -stable. (At the moment I do not have -current on a P-IV, the motherboard is a Fujitsu-Siemens) I will try that asap. Mark -- Mark Santcroos RIPE Network Coordination Centre http://www.ripe.net/home/mark/ New Projects Group/TTM To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Memory corruption in CURRENT
Martin Blapp wrote: I have now three different systems which show all this: 1) PIV 1,6Ghz, Intel B845DG Board, 1GB Kingston Ram, 2) PIV 2Ghz Intel B845DG Board, 1GB Kingston ECC Ram 3) PIV 2,26 Ghz Asus P4B533 Board with I845 chipset, 1GB noname Ram All running CURRENT. I also replaced in 1) and 2) the CPU, RAM. It happens both on SCSI and ATA disks. Powersupply has been changed for all 3 systems. Problem is still the same. The problem sometimes appears just after startup. CPU is still cold then. Other times it builds 6 buildworlds sucessfully, and then suddenly I see a SIG4. Alternatively, rather than those options, try losing 512M of the RAM... I note they are all 1G boxes. -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Memory corruption in CURRENT
Hi Martin, As you know this problem for longer, did you already try to make the problem a bit more reproducable / narrowed down? If not, we really should try to, that will be the first step in fixing it. Mark -- Mark Santcroos RIPE Network Coordination Centre http://www.ripe.net/home/mark/ New Projects Group/TTM To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Memory corruption in CURRENT
On Thu, Aug 22, 2002 at 02:38:25AM -0700, Terry Lambert wrote: Alternatively, rather than those options, try losing 512M of the RAM... I note they are all 1G boxes. No, mine is 256MB. Mark -- Mark Santcroos RIPE Network Coordination Centre http://www.ripe.net/home/mark/ New Projects Group/TTM To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Memory corruption in CURRENT
On 22 Aug, Soeren Schmidt wrote: However, this kind of problem in most cases spells bad HW to me, ie subspec RAM, poor powersupply, badly cooled CPU, overclocking etc etc... My motherboard chipset supports ECC RAM and I have ECC RAM installed. I upgraded to an expensive Antec power supply that has better specs than most of the other supplies I looked at. The system is plugged into a surge supressor. I don't currently have an UPS. I added an extra case fan and drive cooler fans, and the two failures happened in the evening after the room cooled off. For some reason xmbmon doesn't seem to be working at the moment, but when I looked at the temperatures previously they seemed to be acceptable. I don't believe in overclocking. To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Memory corruption in CURRENT
Mark Santcroos wrote: On Thu, Aug 22, 2002 at 02:38:25AM -0700, Terry Lambert wrote: Alternatively, rather than those options, try losing 512M of the RAM... I note they are all 1G boxes. No, mine is 256MB. Correction: all of his were 1G, and should be halved. *You*, on the other hand, should try doubling the RAM size. 8-). Another potential winner is: options maxfiles=5 Note that I believe this one, the RAM size change, and the compile without debug all merely mask, rather than fixing, the problem (i.e. it's not the code that's generated, it's a side effect of a more subtle and amusing problem). I don't believe anyone actually loads kernels with debug symbols anyway if you config -g GENERIC, unless you intentionally copy up kernel.debug, you will get the stripped one). The suggested options, if they work, actually *fix* the problem, but it's a really pessimal way to go about it (they work by making the requisite preconditions impossible to trigger); only an idiot would run with those options, unless they had no other choice in the matter (e.g. they had local kernel hacks that broke all of the other workarounds, with no hope of repair). If it's what I think it is, it's more fixable than that. -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Memory corruption in CURRENT
On Thu, Aug 22, 2002 at 02:33:57AM -0700, Terry Lambert wrote: options DISABLE_PSE options DISABLE_PG_G Coming up next in this theater :-) btw, how does the report that using the other compiler fixed everything for KT fit in? -- Mark Santcroos RIPE Network Coordination Centre http://www.ripe.net/home/mark/ New Projects Group/TTM To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Memory corruption in CURRENT
Hi This is what I did the to system's cc/gcc. I built gcc3.1.1 released version from the ports (with much pain of coz). passion:/usr/bin[514]# ls -l cc* gcc* lrwxr-xr-x 1 root wheel 20 Aug 12 21:54 cc - /usr/local/bin/gcc31 -r-xr-xr-x 2 root wheel 135616 Aug 12 21:52 cc.sav lrwxr-xr-x 1 root wheel 20 Aug 12 21:54 gcc - /usr/local/bin/gcc31 -r-xr-xr-x 2 root wheel 135616 Aug 12 21:52 gcc.sav The hardware = Asus P4S533 + P4 1.6A (now 2.4Ghz) + Kingston DDR333 value ram. Haven't encountered any random signal or stability issue since I swapped cc/gcc with ports' version. Will try to run 10 buildworlds in a row later tonight. kt On Thu, Aug 22, 2002 at 12:00:14PM +0200, Mark Santcroos wrote: On Thu, Aug 22, 2002 at 02:33:57AM -0700, Terry Lambert wrote: options DISABLE_PSE options DISABLE_PG_G Coming up next in this theater :-) btw, how does the report that using the other compiler fixed everything for KT fit in? -- Mark SantcroosRIPE Network Coordination Centre http://www.ripe.net/home/mark/New Projects Group/TTM To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Memory corruption in CURRENT
Mark Santcroos wrote: On Thu, Aug 22, 2002 at 02:33:57AM -0700, Terry Lambert wrote: options DISABLE_PSE options DISABLE_PG_G Coming up next in this theater :-) btw, how does the report that using the other compiler fixed everything for KT fit in? Coincidentally. It's hard to trigger the bug, so it's easy to work around it accidently. Most people who run into these type of bugs only really care about getting things working, so if they accidently work around the thing, they stop there, without digging down to discover the root cause. It's much better to find out the root cause than to submerge the bug again; making it go away for unknown reasons means it will probably come back for unknown reasons at a later point, and bite you on the butt. No one in their right mind could believe that recompining a user space application to avoid kernel-based faults actually fixes the underlying problem. If I had to voice a theory, I would say that the change in memory access patterns resulted in the problem being submerged again. Most likely, a different set of system load characteristics could cause it to resurface, everything else being the same (in fact, that's what I would say is happening with the original compiler; workarounds are commutative). It's not really predictable at what future point that could/would happen, so you basically you end up being left with a live landmine somewhere in your back yard, and you think it's safe because poodles no longer explode 15 minutes after they are tied to the apple tree. 8-) 8-). -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Memory corruption in CURRENT
On 22 Aug, Terry Lambert wrote: Alternatively, rather than those options, try losing 512M of the RAM... I note they are all 1G boxes. When I first put this system together several months ago, I only installed the first 512M of RAM and the problem was much worse. I only had about a 50% chance of getting a successful buildworld. The problem seemed to go away shortly thereafter, and though I wasn't sure whether the problem was caused by hardware or software, I attributed the improvement to an upgrade to a newer version of -current. Since then I've replaced the motherboard (the old one didn't support ECC), the disk and controller (I needed more space, and the new disk consumes a lot less power than the two that it replaced, plus it doesn't sound like a dental drill), and the power supply (because I was concerned that the original might be marginal). I also added an extra intake fan on the front of the case. At the moment I'm running a set of buildworlds with an August 6th kernel, just to verify the problem that I'm seeing isn't something new. When I'm done with that, I'll reduce the RAM from 1G to 512M and try again. I'll also try the DISABLE_PSE and DISABLE_PG_G options. To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Memory corruption in CURRENT
On Thu, Aug 22, 2002 at 03:17:03AM -0700, Terry Lambert wrote: Mark Santcroos wrote: On Thu, Aug 22, 2002 at 02:33:57AM -0700, Terry Lambert wrote: options DISABLE_PSE options DISABLE_PG_G Coming up next in this theater :-) btw, how does the report that using the other compiler fixed everything for KT fit in? It looks indeed like it is a 'winner'. The buildworld is still running but getting further already than the previous 10. Coincidentally. It's hard to trigger the bug, so it's easy to work around it accidently. Thats very true indeed. I can take that as a good 'explanation'. I remember you talking about this PSE problems earlier and more often. Is it fixable? I assume we would like to turn these options back on as they improve performance don't they? Mark -- Mark Santcroos RIPE Network Coordination Centre http://www.ripe.net/home/mark/ New Projects Group/TTM To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Memory corruption in CURRENT
Hi, options DISABLE_PSE options DISABLE_PG_G Just added them. I'll now build 20 buildworlds with those enabled. Martin To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Memory corruption in CURRENT
Don Lewis wrote: At the moment I'm running a set of buildworlds with an August 6th kernel, just to verify the problem that I'm seeing isn't something new. When I'm done with that, I'll reduce the RAM from 1G to 512M and try again. I'll also try the DISABLE_PSE and DISABLE_PG_G options. Please do these seperately. Changing the amount of RAM is only indicative, not diagnostic, so it's just additional information, if it does anything, instead of nothing. With Matt Dillon's changes to machdep.c to do auto-tuning, changing the amount of RAM is much less likely to work around the problem, unless you hit the stair function at just the right point. -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Memory corruption in CURRENT
Mark Santcroos wrote: On Thu, Aug 22, 2002 at 03:17:03AM -0700, Terry Lambert wrote: Mark Santcroos wrote: On Thu, Aug 22, 2002 at 02:33:57AM -0700, Terry Lambert wrote: options DISABLE_PSE options DISABLE_PG_G Coming up next in this theater :-) btw, how does the report that using the other compiler fixed everything for KT fit in? It looks indeed like it is a 'winner'. The buildworld is still running but getting further already than the previous 10. Ugh! Wait until it seems to work for a statistically significant sample size, and for more than one person before calling it happy! Also, I'm not sure looking at the code whether or not the PG_G is truly significant, or just preterbs the workaround. The problem I've referred to in my hunch here is actually related solely to the PSE, but with the recent code reorganization in locore.s, etc., it could have become more significant. Coincidentally. It's hard to trigger the bug, so it's easy to work around it accidently. Thats very true indeed. I can take that as a good 'explanation'. I remember you talking about this PSE problems earlier and more often. Is it fixable? I assume we would like to turn these options back on as they improve performance don't they? Yes and yes, but it could be pretty ugly. It would be better to get more data from people who are seeing the problem. It may be that it's just similar symptoms and more than one proot cause, etc., so I'm pretty loathe to make any assumptions. -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Memory corruption in CURRENT
Martin Blapp wrote: options DISABLE_PSE options DISABLE_PG_G Just added them. I'll now build 20 buildworlds with those enabled. Let the list know if it does anything. If Soren could also test, that would give a sample size. If it's a 3-for-3 workaround, then I probably need to take the discussion offline with Peter Wemm, and come up with a permanent fix. If it's not 3-for-3, then more testing and thinking is needed. -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Memory corruption in CURRENT
On Thu, Aug 22, 2002 at 04:23:46AM -0700, Terry Lambert wrote: Ugh! Wait until it seems to work for a statistically significant sample size, and for more than one person before calling it happy! Also, I'm not sure looking at the code whether or not the PG_G is truly significant, or just preterbs the workaround. The problem I've referred to in my hunch here is actually related solely to the PSE, but with the recent code reorganization in locore.s, etc., it could have become more significant. I was just giving a slight report, not yelling halleluja yet ;-) It's doing the 2nd buildworld now. Do you also want me to try to split up the disabling of the two options? Mark -- Mark Santcroos RIPE Network Coordination Centre http://www.ripe.net/home/mark/ New Projects Group/TTM To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Memory corruption in CURRENT
On Thu, Aug 22, 2002 at 04:31:02AM -0700, Terry Lambert wrote: If it's a 3-for-3 workaround, then I probably need to take the discussion offline with Peter Wemm, and come up with a permanent fix. There was something with non-disclosure, am I right? -- Mark Santcroos RIPE Network Coordination Centre http://www.ripe.net/home/mark/ New Projects Group/TTM To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Memory corruption in CURRENT
It seems Terry Lambert wrote: Martin Blapp wrote: options DISABLE_PSE options DISABLE_PG_G Just added them. I'll now build 20 buildworlds with those enabled. Let the list know if it does anything. If Soren could also test, that would give a sample size. Sure, but I dont have the problem :) I can buildworld for days on my (heavily overclocked btw) Athlon with no problems at all... -Søren To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Memory corruption in CURRENT
Hi, Can you revert back to the system compiler and also compile your kernel with this options and do some buildworlds again? Thanks Mark On Thu, Aug 22, 2002 at 01:41:13PM +0200, Soeren Schmidt wrote: It seems Terry Lambert wrote: Martin Blapp wrote: options DISABLE_PSE options DISABLE_PG_G Just added them. I'll now build 20 buildworlds with those enabled. Let the list know if it does anything. If Soren could also test, that would give a sample size. Sure, but I dont have the problem :) I can buildworld for days on my (heavily overclocked btw) Athlon with no problems at all... -S?ren To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message -- Mark Santcroos RIPE Network Coordination Centre http://www.ripe.net/home/mark/ New Projects Group/TTM To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Memory corruption in CURRENT
It seems Mark Santcroos wrote: Hi, Can you revert back to the system compiler and also compile your kernel with this options and do some buildworlds again? I already use the system compiler... On Thu, Aug 22, 2002 at 01:41:13PM +0200, Soeren Schmidt wrote: Sure, but I dont have the problem :) I can buildworld for days on my (heavily overclocked btw) Athlon with no problems at all... -S?ren -Søren To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Memory corruption in CURRENT
Hi, as far as I can tell it is really reather easy to hide the bug. All these options did hide the bug for some time: - Use -g to compile the segfaulting binarys - Remove -g to compile the segfaulting binarys - Use a kernel compiler on another machine - New hardware They just disappeared after such an action, but showed up again later after some buildworlds. Martin Martin Blapp, [EMAIL PROTECTED] [EMAIL PROTECTED] -- ImproWare AG, UNIXSP ISP, Zurlindenstrasse 29, 4133 Pratteln, CH Phone: +41 061 826 93 00: +41 61 826 93 01 PGP: finger -l [EMAIL PROTECTED] PGP Fingerprint: B434 53FC C87C FE7B 0A18 B84C 8686 EF22 D300 551E -- To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Memory corruption in CURRENT
On Thu, Aug 22, 2002 at 01:55:42PM +0200, Soeren Schmidt wrote: Can you revert back to the system compiler and also compile your kernel with this options and do some buildworlds again? I already use the system compiler... That's why the message was addressed to kt ;-) -- Mark Santcroos RIPE Network Coordination Centre http://www.ripe.net/home/mark/ New Projects Group/TTM To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Memory corruption in CURRENT
Mark Santcroos wrote: On Thu, Aug 22, 2002 at 04:31:02AM -0700, Terry Lambert wrote: If it's a 3-for-3 workaround, then I probably need to take the discussion offline with Peter Wemm, and come up with a permanent fix. There was something with non-disclosure, am I right? No. If it's even the same problem, I figured out what was going on by myself, not for an employer, by being obsessive about details, and spending well over a week on my need to know why?. I had *seen* the problem at two previous employers, but neither of them were interested in me wasting time on the why?, once there was a workaround available. No reason to give the information away to competitors, if there's nothing in it but that they become more competitive... -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Memory corruption in CURRENT
Mark Santcroos wrote: On Thu, Aug 22, 2002 at 01:41:13PM +0200, Soeren Schmidt wrote: Sure, but I dont have the problem :) I can buildworld for days on my (heavily overclocked btw) Athlon with no problems at all... Can you revert back to the system compiler and also compile your kernel with this options and do some buildworlds again? You are making the same mistake I did. We should be asking Udo Schweigert and KT Sin. Soeren doesn't have any P4's or AMD's with the problem. Hmmm... P4... AMD... have to wonder if someone looked at someone else's paper during the test... ;^). Here's the list of people I have, from a casual look at the archives for this week: Mark Santcroos Martin Blapp (P4) Udo Schweigert (P4, new compiler workaround) KT Sin (P4, new compiler workaround) Don Lewis (AMD Athlon) Alexander Leidinger (P4) David O'Brien (AMD) Alfred Perlstein (?) If we could get most or all of them to try the workaround, it would provide useful information... -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Memory corruption in CURRENT
On Thu, Aug 22, 2002 at 05:21:54AM -0700, Terry Lambert wrote: Mark Santcroos wrote: On Thu, Aug 22, 2002 at 01:41:13PM +0200, Soeren Schmidt wrote: Sure, but I dont have the problem :) I can buildworld for days on my (heavily overclocked btw) Athlon with no problems at all... Can you revert back to the system compiler and also compile your kernel with this options and do some buildworlds again? You are making the same mistake I did. No, you are making the same mistake as Soren ;-) I did ask KT, look at the 'To:' field. Mark -- Mark Santcroos RIPE Network Coordination Centre http://www.ripe.net/home/mark/ New Projects Group/TTM To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Memory corruption in CURRENT
Mark Santcroos wrote: I was just giving a slight report, not yelling halleluja yet ;-) It's doing the 2nd buildworld now. Do you also want me to try to split up the disabling of the two options? No. Me saying to use both options was just me being lazy about spending 2 days re-documenting the code from boot2 through mi_startup(), to be sure nothing had crept in there. -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Memory corruption in CURRENT
We have seen weird problems regarding the pmap PG_G related stuff (well sort of, it has to do with PSE and PG_G) on ppro and pII chips (apparently, this is not the case with at least Xeons) but what happened, for the record, was this: We would enable PSE and switch the pde corresponding to the first 4M to the new entry describing a 4M page, instead of the one describing the location of the ptes covering those 4M. Then, what we would do is walk all the ptes, including those old stale and useless ones that previously described those first 4M and set the PG_G bit there (Note: we've already set PG_G on our 4M page). Normally, we don't really need to touch the old ptes but we did it just because it was more convenient (i.e. a few lines less code). Oddly enough, on the ppro and pII what would happen is that we would page fault on that page where we kept the old ptes covering those first 4M, and only on that page! The other ptes - the ones that actually mattered - were all fine. The ptes are mapped above the 4M so I don't see how changing the pde for those first 4M would have done anything. To fix the problem, we (actually Peter) committed code that basically just jumps beyond that first page of stale ptes when setting the PG_G bit for the 4K pages, and since then, the problem seems to have gone away. Although we are not sure, this seems like a silicon bug. Since then, Peter had some work planned to load the kernel above the first 4M to see if that fixed the problems. I'm wondering if this problem on the PIVs could be related. Please let us know if the removal of those two options really makes 5-10 buildworlds in a row work out for you. Regards, Bosko On Thu, Aug 22, 2002 at 01:34:11PM +0200, Mark Santcroos wrote: On Thu, Aug 22, 2002 at 04:23:46AM -0700, Terry Lambert wrote: Ugh! Wait until it seems to work for a statistically significant sample size, and for more than one person before calling it happy! Also, I'm not sure looking at the code whether or not the PG_G is truly significant, or just preterbs the workaround. The problem I've referred to in my hunch here is actually related solely to the PSE, but with the recent code reorganization in locore.s, etc., it could have become more significant. I was just giving a slight report, not yelling halleluja yet ;-) It's doing the 2nd buildworld now. Do you also want me to try to split up the disabling of the two options? Mark -- Mark SantcroosRIPE Network Coordination Centre http://www.ripe.net/home/mark/New Projects Group/TTM To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message -- Bosko Milekic * [EMAIL PROTECTED] * [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Memory corruption in CURRENT
Hi Terry, options DISABLE_PSE options DISABLE_PG_G I'm now at buildworld IV, since I have these options compiled it the bug did not show up again. Another sideeffect: Before that I could not even make -j 10 buildworld, that ended with a page fault somewhere im pmap... This has gone away too. But let's see. I'll build 20 worlds at one row. Martin To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message