Re: Memory corruption in CURRENT

2002-08-22 Thread Mark Santcroos

On Thu, Aug 22, 2002 at 09:43:45AM +0200, Martin Blapp wrote:
 I suspect all the SIG4 and SIG11 problems we see are due
 memory corruption in CURRENT.
 
  In the latter case, the affected file looks like:
 
case HASH('^', 'e'):
case HASH('^', 'i'):
case HASH('^'  'o'):
  \xc0 case HASH('^', 'u'):
   %case HADH('`', \xc0A'):
^@ase HASH('`', 'E'):
case HASH('`', 'I'):
case HASH('`', 'O'):
case HASH('`', 'U'):
 
  The file is correct after a reboot, so the corruption was limited to the
  copy cached in RAM.
 
 Thats memory corruption. I'm also not able anymore
 to make 10 buildworlds (without -j, that triggers
 panics in pmap code).
 
 Bye the way, I'm experiencing this since about 4-5 months.
 
 All hackers, please help to track this down.

Is it P4 specific or not?

Mark

-- 
Mark Santcroos  RIPE Network Coordination Centre
http://www.ripe.net/home/mark/  New Projects Group/TTM

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Memory corruption in CURRENT

2002-08-22 Thread Soeren Schmidt

It seems Martin Blapp wrote:
 
 Hi all,
 
 I suspect all the SIG4 and SIG11 problems we see are due
 memory corruption in CURRENT.

  The file is correct after a reboot, so the corruption was limited to the
  copy cached in RAM.
 
 Thats memory corruption. I'm also not able anymore
 to make 10 buildworlds (without -j, that triggers
 panics in pmap code).
 
 Bye the way, I'm experiencing this since about 4-5 months.
 
 All hackers, please help to track this down.

Hmm, I haven't seen this at all, but I've just started buildworld loops on
two machines here, but I normally do at least a couble buildworlds a day
and I havn't notice problems like the above (but plenty of bad commits etc).

However, this kind of problem in most cases spells bad HW to me,
ie subspec RAM, poor powersupply, badly cooled CPU, overclocking etc etc...

-Søren

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Memory corruption in CURRENT

2002-08-22 Thread Martin Blapp


Hi Soeren,

 However, this kind of problem in most cases spells bad HW to me,
 ie subspec RAM, poor powersupply, badly cooled CPU, overclocking etc etc...

That's what I thought too. I have now three different systems which show
all this:

1) PIV 1,6Ghz, Intel B845DG Board, 1GB Kingston Ram,
2) PIV 2Ghz Intel B845DG Board, 1GB Kingston ECC Ram
3) PIV 2,26 Ghz Asus P4B533 Board with I845 chipset, 1GB noname Ram

All running CURRENT. I also replaced in 1) and 2) the CPU, RAM.
It happens both on SCSI and ATA disks. Powersupply has been changed
for all 3 systems. Problem is still the same.

The problem sometimes appears just after startup. CPU is still cold then.
Other times it builds 6 buildworlds sucessfully, and then suddenly I see
a SIG4.

Martin


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Memory corruption in CURRENT

2002-08-22 Thread Soeren Schmidt

It seems Martin Blapp wrote:
 Hi Soeren,
 
  However, this kind of problem in most cases spells bad HW to me,
  ie subspec RAM, poor powersupply, badly cooled CPU, overclocking etc etc...
 
 That's what I thought too. I have now three different systems which show
 all this:
 
 1) PIV 1,6Ghz, Intel B845DG Board, 1GB Kingston Ram,
 2) PIV 2Ghz Intel B845DG Board, 1GB Kingston ECC Ram
 3) PIV 2,26 Ghz Asus P4B533 Board with I845 chipset, 1GB noname Ram
 
 All running CURRENT. I also replaced in 1) and 2) the CPU, RAM.
 It happens both on SCSI and ATA disks. Powersupply has been changed
 for all 3 systems. Problem is still the same.
 
 The problem sometimes appears just after startup. CPU is still cold then.
 Other times it builds 6 buildworlds sucessfully, and then suddenly I see
 a SIG4.

Hmm, thats probably a P4 problem then, I dont see it on any of my 
systems (P3 + K7) I dont have a P4 here (and newer will unless left 
at my doorstep), and I have no immediate ideas other than trying a
MB that doesn't have a i845 chipset (the less Intel parts the better :) )

-Søren

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Memory corruption in CURRENT

2002-08-22 Thread Udo Schweigert

On Thu, Aug 22, 2002 at 11:23:46 +0200, Martin Blapp wrote:
 That's what I thought too. I have now three different systems which show
 all this:
 
 1) PIV 1,6Ghz, Intel B845DG Board, 1GB Kingston Ram,
 2) PIV 2Ghz Intel B845DG Board, 1GB Kingston ECC Ram
 3) PIV 2,26 Ghz Asus P4B533 Board with I845 chipset, 1GB noname Ram
 
 All running CURRENT. I also replaced in 1) and 2) the CPU, RAM.
 It happens both on SCSI and ATA disks. Powersupply has been changed
 for all 3 systems. Problem is still the same.
 
 The problem sometimes appears just after startup. CPU is still cold then.
 Other times it builds 6 buildworlds sucessfully, and then suddenly I see
 a SIG4.
 

Only a little addition from me: I had the same problems on -stable and they
only disappeared after compiling the kernel without debugging. I had the
impression that it has to do with the size of the kernel (but this of course
maybe wrong). After dropping -g from kernel compiling I hadn't a problem
again on -stable. (At the moment I do not have -current on a P-IV, the
motherboard is a Fujitsu-Siemens)

Best regards

--
Udo Schweigert, Siemens AG   | Voice  : +49 89 636 42170
CT IC CERT, Siemens CERT | Fax: +49 89 636 41166
D-81730 Muenchen / Germany   | email  : [EMAIL PROTECTED]

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Memory corruption in CURRENT

2002-08-22 Thread Don Lewis

On 22 Aug, Mark Santcroos wrote:
 On Thu, Aug 22, 2002 at 09:43:45AM +0200, Martin Blapp wrote:

 Thats memory corruption. I'm also not able anymore
 to make 10 buildworlds (without -j, that triggers
 panics in pmap code).
 
 Bye the way, I'm experiencing this since about 4-5 months.
 
 All hackers, please help to track this down.
 
 Is it P4 specific or not?

Nope.  I'm seeing it on an AMD Athlon XP 1900+.


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Memory corruption in CURRENT

2002-08-22 Thread Terry Lambert

Soeren Schmidt wrote:
 It seems Martin Blapp wrote:
  I suspect all the SIG4 and SIG11 problems we see are due
  memory corruption in CURRENT.
 
   The file is correct after a reboot, so the corruption was limited to the
   copy cached in RAM.
 
  Thats memory corruption. I'm also not able anymore
  to make 10 buildworlds (without -j, that triggers
  panics in pmap code).
 
  Bye the way, I'm experiencing this since about 4-5 months.
 
  All hackers, please help to track this down.
 
 Hmm, I haven't seen this at all, but I've just started buildworld loops on
 two machines here, but I normally do at least a couble buildworlds a day
 and I havn't notice problems like the above (but plenty of bad commits etc).
 
 However, this kind of problem in most cases spells bad HW to me,
 ie subspec RAM, poor powersupply, badly cooled CPU, overclocking etc etc...

Try:

options DISABLE_PSE
options DISABLE_PG_G

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Memory corruption in CURRENT

2002-08-22 Thread Mark Santcroos

On Thu, Aug 22, 2002 at 11:30:50AM +0200, Udo Schweigert wrote:
 Only a little addition from me: I had the same problems on -stable and they
 only disappeared after compiling the kernel without debugging. I had the
 impression that it has to do with the size of the kernel (but this of course
 maybe wrong). After dropping -g from kernel compiling I hadn't a problem
 again on -stable. (At the moment I do not have -current on a P-IV, the
 motherboard is a Fujitsu-Siemens)

I will try that asap.

Mark

-- 
Mark Santcroos  RIPE Network Coordination Centre
http://www.ripe.net/home/mark/  New Projects Group/TTM

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Memory corruption in CURRENT

2002-08-22 Thread Terry Lambert

Martin Blapp wrote:
 I have now three different systems which show all this:
 
 1) PIV 1,6Ghz, Intel B845DG Board, 1GB Kingston Ram,
 2) PIV 2Ghz Intel B845DG Board, 1GB Kingston ECC Ram
 3) PIV 2,26 Ghz Asus P4B533 Board with I845 chipset, 1GB noname Ram
 
 All running CURRENT. I also replaced in 1) and 2) the CPU, RAM.
 It happens both on SCSI and ATA disks. Powersupply has been changed
 for all 3 systems. Problem is still the same.
 
 The problem sometimes appears just after startup. CPU is still cold then.
 Other times it builds 6 buildworlds sucessfully, and then suddenly I see
 a SIG4.

Alternatively, rather than those options, try losing 512M of
the RAM... I note they are all 1G boxes.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Memory corruption in CURRENT

2002-08-22 Thread Mark Santcroos

Hi Martin,

As you know this problem for longer, did you already try to make the
problem a bit more reproducable / narrowed down?

If not, we really should try to, that will be the first step in fixing it.

Mark

-- 
Mark Santcroos  RIPE Network Coordination Centre
http://www.ripe.net/home/mark/  New Projects Group/TTM

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Memory corruption in CURRENT

2002-08-22 Thread Mark Santcroos

On Thu, Aug 22, 2002 at 02:38:25AM -0700, Terry Lambert wrote:
 Alternatively, rather than those options, try losing 512M of
 the RAM... I note they are all 1G boxes.

No, mine is 256MB.

Mark

-- 
Mark Santcroos  RIPE Network Coordination Centre
http://www.ripe.net/home/mark/  New Projects Group/TTM

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Memory corruption in CURRENT

2002-08-22 Thread Don Lewis

On 22 Aug, Soeren Schmidt wrote:

 However, this kind of problem in most cases spells bad HW to me,
 ie subspec RAM, poor powersupply, badly cooled CPU, overclocking etc etc...

My motherboard chipset supports ECC RAM and I have ECC RAM installed.  I
upgraded to an expensive Antec power supply that has better specs than
most of the other supplies I looked at.  The system is plugged into a
surge supressor.  I don't currently have an UPS.  I added an extra case
fan and drive cooler fans, and the two failures happened in the evening
after the room cooled off.  For some reason xmbmon doesn't seem to be
working at the moment, but when I looked at the temperatures previously
they seemed to be acceptable.  I don't believe in overclocking.



To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Memory corruption in CURRENT

2002-08-22 Thread Terry Lambert

Mark Santcroos wrote:
 On Thu, Aug 22, 2002 at 02:38:25AM -0700, Terry Lambert wrote:
  Alternatively, rather than those options, try losing 512M of
  the RAM... I note they are all 1G boxes.
 
 No, mine is 256MB.

Correction: all of his were 1G, and should be halved.  *You*, on
the other hand, should try doubling the RAM size.  8-).

Another potential winner is:

options maxfiles=5

Note that I believe this one, the RAM size change, and the compile
without debug all merely mask, rather than fixing, the problem (i.e.
it's not the code that's generated, it's a side effect of a more
subtle and amusing problem).  I don't believe anyone actually
loads kernels with debug symbols anyway if you config -g GENERIC,
unless you intentionally copy up kernel.debug, you will get the
stripped one).

The suggested options, if they work, actually *fix* the problem,
but it's a really pessimal way to go about it (they work by making
the requisite preconditions impossible to trigger); only an idiot
would run with those options, unless they had no other choice in
the matter (e.g. they had local kernel hacks that broke all of
the other workarounds, with no hope of repair).  If it's what I
think it is, it's more fixable than that.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Memory corruption in CURRENT

2002-08-22 Thread Mark Santcroos

On Thu, Aug 22, 2002 at 02:33:57AM -0700, Terry Lambert wrote:
   options DISABLE_PSE
   options DISABLE_PG_G

Coming up next in this theater :-)

btw, how does the report that using the other compiler fixed everything
for KT fit in?

-- 
Mark Santcroos  RIPE Network Coordination Centre
http://www.ripe.net/home/mark/  New Projects Group/TTM

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Memory corruption in CURRENT

2002-08-22 Thread KT Sin

Hi

This is what I did the to system's cc/gcc. I built gcc3.1.1 released version
from the ports (with much pain of coz).

passion:/usr/bin[514]# ls -l cc* gcc*
lrwxr-xr-x  1 root  wheel  20 Aug 12 21:54 cc - /usr/local/bin/gcc31
-r-xr-xr-x  2 root  wheel  135616 Aug 12 21:52 cc.sav
lrwxr-xr-x  1 root  wheel  20 Aug 12 21:54 gcc - /usr/local/bin/gcc31
-r-xr-xr-x  2 root  wheel  135616 Aug 12 21:52 gcc.sav

The hardware = Asus P4S533 + P4 1.6A (now 2.4Ghz) + Kingston DDR333 value ram.

Haven't encountered any random signal or stability issue since I swapped
cc/gcc with ports' version.

Will try to run 10 buildworlds in a row later tonight.

kt

On Thu, Aug 22, 2002 at 12:00:14PM +0200, Mark Santcroos wrote:
 On Thu, Aug 22, 2002 at 02:33:57AM -0700, Terry Lambert wrote:
  options DISABLE_PSE
  options DISABLE_PG_G
 
 Coming up next in this theater :-)
 
 btw, how does the report that using the other compiler fixed everything
 for KT fit in?
 
 -- 
 Mark SantcroosRIPE Network Coordination Centre
 http://www.ripe.net/home/mark/New Projects Group/TTM
 

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Memory corruption in CURRENT

2002-08-22 Thread Terry Lambert

Mark Santcroos wrote:
 On Thu, Aug 22, 2002 at 02:33:57AM -0700, Terry Lambert wrote:
options DISABLE_PSE
options DISABLE_PG_G
 
 Coming up next in this theater :-)
 
 btw, how does the report that using the other compiler fixed everything
 for KT fit in?

Coincidentally.  It's hard to trigger the bug, so it's easy to
work around it accidently.

Most people who run into these type of bugs only really care about
getting things working, so if they accidently work around the thing,
they stop there, without digging down to discover the root cause.

It's much better to find out the root cause than to submerge the bug
again; making it go away for unknown reasons means it will probably
come back for unknown reasons at a later point, and bite you on the
butt.

No one in their right mind could believe that recompining a user
space application to avoid kernel-based faults actually fixes the
underlying problem.  If I had to voice a theory, I would say that
the change in memory access patterns resulted in the problem being
submerged again.  Most likely, a different set of system load
characteristics could cause it to resurface, everything else being
the same (in fact, that's what I would say is happening with the
original compiler; workarounds are commutative).

It's not really predictable at what future point that could/would
happen, so you basically you end up being left with a live landmine
somewhere in your back yard, and you think it's safe because poodles
no longer explode 15 minutes after they are tied to the apple tree.
8-) 8-).

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Memory corruption in CURRENT

2002-08-22 Thread Don Lewis

On 22 Aug, Terry Lambert wrote:

 Alternatively, rather than those options, try losing 512M of
 the RAM... I note they are all 1G boxes.

When I first put this system together several months ago, I only
installed the first 512M of RAM and the problem was much worse.  I only
had about a 50% chance of getting a successful buildworld.  The problem
seemed to go away shortly thereafter, and though I wasn't sure whether
the problem was caused by hardware or software, I attributed the
improvement to an upgrade to a newer version of -current.  Since then
I've replaced the motherboard (the old one didn't support ECC), the disk
and controller (I needed more space, and the new disk consumes a lot
less power than the two that it replaced, plus it doesn't sound like a
dental drill), and the power supply (because I was concerned that the
original might be marginal).  I also added an extra intake fan on the
front of the case.

At the moment I'm running a set of buildworlds with an August 6th
kernel, just to verify the problem that I'm seeing isn't something new.
When I'm done with that, I'll reduce the RAM from 1G to 512M and try
again.  I'll also try the DISABLE_PSE and DISABLE_PG_G options.


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Memory corruption in CURRENT

2002-08-22 Thread Mark Santcroos

On Thu, Aug 22, 2002 at 03:17:03AM -0700, Terry Lambert wrote:
 Mark Santcroos wrote:
  On Thu, Aug 22, 2002 at 02:33:57AM -0700, Terry Lambert wrote:
 options DISABLE_PSE
 options DISABLE_PG_G
  
  Coming up next in this theater :-)
  
  btw, how does the report that using the other compiler fixed everything
  for KT fit in?

It looks indeed like it is a 'winner'. The buildworld is still running but
getting further already than the previous 10.

 Coincidentally.  It's hard to trigger the bug, so it's easy to
 work around it accidently.

Thats very true indeed. I can take that as a good 'explanation'.

I remember you talking about this PSE problems earlier and more often. Is 
it fixable? I assume we would like to turn these options back on as they
improve performance don't they?

Mark

-- 
Mark Santcroos  RIPE Network Coordination Centre
http://www.ripe.net/home/mark/  New Projects Group/TTM

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Memory corruption in CURRENT

2002-08-22 Thread Martin Blapp


Hi,

   options DISABLE_PSE
   options DISABLE_PG_G

Just added them. I'll now build 20 buildworlds with those enabled.

Martin


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Memory corruption in CURRENT

2002-08-22 Thread Terry Lambert

Don Lewis wrote:
 At the moment I'm running a set of buildworlds with an August 6th
 kernel, just to verify the problem that I'm seeing isn't something new.
 When I'm done with that, I'll reduce the RAM from 1G to 512M and try
 again.  I'll also try the DISABLE_PSE and DISABLE_PG_G options.

Please do these seperately.  Changing the amount of RAM is only
indicative, not diagnostic, so it's just additional information,
if it does anything, instead of nothing.

With Matt Dillon's changes to machdep.c to do auto-tuning, changing
the amount of RAM is much less likely to work around the problem,
unless you hit the stair function at just the right point.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Memory corruption in CURRENT

2002-08-22 Thread Terry Lambert

Mark Santcroos wrote:
 On Thu, Aug 22, 2002 at 03:17:03AM -0700, Terry Lambert wrote:
  Mark Santcroos wrote:
   On Thu, Aug 22, 2002 at 02:33:57AM -0700, Terry Lambert wrote:
  options DISABLE_PSE
  options DISABLE_PG_G
  
   Coming up next in this theater :-)
  
   btw, how does the report that using the other compiler fixed everything
   for KT fit in?
 
 It looks indeed like it is a 'winner'. The buildworld is still running but
 getting further already than the previous 10.

Ugh!  Wait until it seems to work for a statistically significant
sample size, and for more than one person before calling it happy!

Also, I'm not sure looking at the code whether or not the PG_G is
truly significant, or just preterbs the workaround.  The problem
I've referred to in my hunch here is actually related solely to
the PSE, but with the recent code reorganization in locore.s, etc.,
it could have become more significant.


  Coincidentally.  It's hard to trigger the bug, so it's easy to
  work around it accidently.
 
 Thats very true indeed. I can take that as a good 'explanation'.
 
 I remember you talking about this PSE problems earlier and more often. Is
 it fixable? I assume we would like to turn these options back on as they
 improve performance don't they?

Yes and yes, but it could be pretty ugly.  It would be better to
get more data from people who are seeing the problem.  It may be
that it's just similar symptoms and more than one proot cause,
etc., so I'm pretty loathe to make any assumptions. 

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Memory corruption in CURRENT

2002-08-22 Thread Terry Lambert

Martin Blapp wrote:
options DISABLE_PSE
options DISABLE_PG_G
 
 Just added them. I'll now build 20 buildworlds with those enabled.

Let the list know if it does anything.  If Soren could also test,
that would give a sample size.

If it's a 3-for-3 workaround, then I probably need to take the
discussion offline with Peter Wemm, and come up with a permanent
fix.

If it's not 3-for-3, then more testing and thinking is needed.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Memory corruption in CURRENT

2002-08-22 Thread Mark Santcroos

On Thu, Aug 22, 2002 at 04:23:46AM -0700, Terry Lambert wrote:
 Ugh!  Wait until it seems to work for a statistically significant
 sample size, and for more than one person before calling it happy!
 
 Also, I'm not sure looking at the code whether or not the PG_G is
 truly significant, or just preterbs the workaround.  The problem
 I've referred to in my hunch here is actually related solely to
 the PSE, but with the recent code reorganization in locore.s, etc.,
 it could have become more significant.

I was just giving a slight report, not yelling halleluja yet ;-)

It's doing the 2nd buildworld now.

Do you also want me to try to split up the disabling of the two options?

Mark

-- 
Mark Santcroos  RIPE Network Coordination Centre
http://www.ripe.net/home/mark/  New Projects Group/TTM

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Memory corruption in CURRENT

2002-08-22 Thread Mark Santcroos

On Thu, Aug 22, 2002 at 04:31:02AM -0700, Terry Lambert wrote:
 If it's a 3-for-3 workaround, then I probably need to take the
 discussion offline with Peter Wemm, and come up with a permanent
 fix.

There was something with non-disclosure, am I right?

-- 
Mark Santcroos  RIPE Network Coordination Centre
http://www.ripe.net/home/mark/  New Projects Group/TTM

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Memory corruption in CURRENT

2002-08-22 Thread Soeren Schmidt

It seems Terry Lambert wrote:
 Martin Blapp wrote:
 options DISABLE_PSE
 options DISABLE_PG_G
  
  Just added them. I'll now build 20 buildworlds with those enabled.
 
 Let the list know if it does anything.  If Soren could also test,
 that would give a sample size.

Sure, but I dont have the problem :) I can buildworld for days on my 
(heavily overclocked btw) Athlon with no problems at all...

-Søren

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Memory corruption in CURRENT

2002-08-22 Thread Mark Santcroos

Hi,

Can you revert back to the system compiler and also compile your kernel
with this options and do some buildworlds again?

Thanks

Mark

On Thu, Aug 22, 2002 at 01:41:13PM +0200, Soeren Schmidt wrote:
 It seems Terry Lambert wrote:
  Martin Blapp wrote:
  options DISABLE_PSE
  options DISABLE_PG_G
   
   Just added them. I'll now build 20 buildworlds with those enabled.
  
  Let the list know if it does anything.  If Soren could also test,
  that would give a sample size.
 
 Sure, but I dont have the problem :) I can buildworld for days on my 
 (heavily overclocked btw) Athlon with no problems at all...
 
 -S?ren
 
 To Unsubscribe: send mail to [EMAIL PROTECTED]
 with unsubscribe freebsd-current in the body of the message

-- 
Mark Santcroos  RIPE Network Coordination Centre
http://www.ripe.net/home/mark/  New Projects Group/TTM

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Memory corruption in CURRENT

2002-08-22 Thread Soeren Schmidt

It seems Mark Santcroos wrote:
 Hi,
 
 Can you revert back to the system compiler and also compile your kernel
 with this options and do some buildworlds again?

I already use the system compiler...

 On Thu, Aug 22, 2002 at 01:41:13PM +0200, Soeren Schmidt wrote:
  
  Sure, but I dont have the problem :) I can buildworld for days on my 
  (heavily overclocked btw) Athlon with no problems at all...
  
  -S?ren

-Søren

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Memory corruption in CURRENT

2002-08-22 Thread Martin Blapp


Hi,

as far as I can tell it is really reather easy to hide the bug.

All these options did hide the bug for some time:

- Use -g to compile the segfaulting binarys
- Remove -g to compile the segfaulting binarys
- Use a kernel compiler on another machine
- New hardware

They just disappeared after such an action, but showed
up again later after some buildworlds.

Martin

Martin Blapp, [EMAIL PROTECTED] [EMAIL PROTECTED]
--
ImproWare AG, UNIXSP  ISP, Zurlindenstrasse 29, 4133 Pratteln, CH
Phone: +41 061 826 93 00: +41 61 826 93 01
PGP: finger -l [EMAIL PROTECTED]
PGP Fingerprint: B434 53FC C87C FE7B 0A18 B84C 8686 EF22 D300 551E
--



To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Memory corruption in CURRENT

2002-08-22 Thread Mark Santcroos

On Thu, Aug 22, 2002 at 01:55:42PM +0200, Soeren Schmidt wrote:
  Can you revert back to the system compiler and also compile your kernel
  with this options and do some buildworlds again?
 
 I already use the system compiler...

That's why the message was addressed to kt ;-)

-- 
Mark Santcroos  RIPE Network Coordination Centre
http://www.ripe.net/home/mark/  New Projects Group/TTM

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Memory corruption in CURRENT

2002-08-22 Thread Terry Lambert

Mark Santcroos wrote:
 On Thu, Aug 22, 2002 at 04:31:02AM -0700, Terry Lambert wrote:
  If it's a 3-for-3 workaround, then I probably need to take the
  discussion offline with Peter Wemm, and come up with a permanent
  fix.
 
 There was something with non-disclosure, am I right?

No.  If it's even the same problem, I figured out what was going
on by myself, not for an employer, by being obsessive about details,
and spending well over a week on my need to know why?.  I had
*seen* the problem at two previous employers, but neither of them
were interested in me wasting time on the why?, once there was
a workaround available.

No reason to give the information away to competitors, if there's
nothing in it but that they become more competitive...

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Memory corruption in CURRENT

2002-08-22 Thread Terry Lambert

Mark Santcroos wrote:
 On Thu, Aug 22, 2002 at 01:41:13PM +0200, Soeren Schmidt wrote:
  Sure, but I dont have the problem :) I can buildworld for days on my
  (heavily overclocked btw) Athlon with no problems at all...

 Can you revert back to the system compiler and also compile your kernel
 with this options and do some buildworlds again?

You are making the same mistake I did.

We should be asking Udo Schweigert and KT Sin.  Soeren doesn't have
any P4's or AMD's with the problem.

Hmmm... P4... AMD... have to wonder if someone looked at someone
else's paper during the test... ;^).

Here's the list of people I have, from a casual look at the archives
for this week:

Mark Santcroos
Martin Blapp (P4)
Udo Schweigert (P4, new compiler workaround)
KT Sin (P4, new compiler workaround)
Don Lewis (AMD Athlon)
Alexander Leidinger (P4)
David O'Brien (AMD)
Alfred Perlstein (?)

If we could get most or all of them to try the workaround, it would
provide useful information...   

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Memory corruption in CURRENT

2002-08-22 Thread Mark Santcroos

On Thu, Aug 22, 2002 at 05:21:54AM -0700, Terry Lambert wrote:
 Mark Santcroos wrote:
  On Thu, Aug 22, 2002 at 01:41:13PM +0200, Soeren Schmidt wrote:
   Sure, but I dont have the problem :) I can buildworld for days on my
   (heavily overclocked btw) Athlon with no problems at all...
 
  Can you revert back to the system compiler and also compile your kernel
  with this options and do some buildworlds again?
 
 You are making the same mistake I did.

No, you are making the same mistake as Soren ;-)
I did ask KT, look at the 'To:' field.

Mark


-- 
Mark Santcroos  RIPE Network Coordination Centre
http://www.ripe.net/home/mark/  New Projects Group/TTM

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Memory corruption in CURRENT

2002-08-22 Thread Terry Lambert

Mark Santcroos wrote:
 I was just giving a slight report, not yelling halleluja yet ;-)
 
 It's doing the 2nd buildworld now.
 
 Do you also want me to try to split up the disabling of the two options?

No.

Me saying to use both options was just me being lazy about
spending 2 days re-documenting the code from boot2 through
mi_startup(), to be sure nothing had crept in there.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Memory corruption in CURRENT

2002-08-22 Thread Bosko Milekic


We have seen weird problems regarding the pmap PG_G related stuff (well
sort of, it has to do with PSE and PG_G) on ppro and pII chips
(apparently, this is not the case with at least Xeons) but what
happened, for the record, was this:

We would enable PSE and switch the pde corresponding to the first 4M
to the new entry describing a 4M page, instead of the one describing the
location of the ptes covering those 4M.  Then, what we would do is walk
all the ptes, including those old stale and useless ones that previously
described those first 4M and set the PG_G bit there (Note: we've already
set PG_G on our 4M page).  Normally, we don't really need to touch the
old ptes but we did it just because it was more convenient (i.e. a few
lines less code).  Oddly enough, on the ppro and pII what would happen
is that we would page fault on that page where we kept the old ptes
covering those first 4M, and only on that page!  The other ptes - the
ones that actually mattered - were all fine.  The ptes are mapped above
the 4M so I don't see how changing the pde for those first 4M would have
done anything.  To fix the problem, we (actually Peter) committed code
that basically just jumps beyond that first page of stale ptes when
setting the PG_G bit for the 4K pages, and since then, the problem seems
to have gone away.  Although we are not sure, this seems like a silicon
bug.

Since then, Peter had some work planned to load the kernel above the
first 4M to see if that fixed the problems.  I'm wondering if this
problem on the PIVs could be related.  Please let us know if the removal
of those two options really makes 5-10 buildworlds in a row work out for
you.

Regards,
Bosko

On Thu, Aug 22, 2002 at 01:34:11PM +0200, Mark Santcroos wrote:
 On Thu, Aug 22, 2002 at 04:23:46AM -0700, Terry Lambert wrote:
  Ugh!  Wait until it seems to work for a statistically significant
  sample size, and for more than one person before calling it happy!
  
  Also, I'm not sure looking at the code whether or not the PG_G is
  truly significant, or just preterbs the workaround.  The problem
  I've referred to in my hunch here is actually related solely to
  the PSE, but with the recent code reorganization in locore.s, etc.,
  it could have become more significant.
 
 I was just giving a slight report, not yelling halleluja yet ;-)
 
 It's doing the 2nd buildworld now.
 
 Do you also want me to try to split up the disabling of the two options?
 
 Mark
 
 -- 
 Mark SantcroosRIPE Network Coordination Centre
 http://www.ripe.net/home/mark/New Projects Group/TTM
 
 To Unsubscribe: send mail to [EMAIL PROTECTED]
 with unsubscribe freebsd-current in the body of the message
 

-- 
Bosko Milekic * [EMAIL PROTECTED] * [EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Memory corruption in CURRENT

2002-08-22 Thread Martin Blapp


Hi Terry,

  options DISABLE_PSE
  options DISABLE_PG_G

I'm now at buildworld IV, since I have these options compiled it
the bug did not show up again.

Another sideeffect: Before that I could not even make -j 10 buildworld,
that ended with a page fault somewhere im pmap... This has gone away too.

But let's see. I'll build 20 worlds at one row.

Martin


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message