Re: [Patches avail?] Re: MMAP() in STABLE/CURRENT ...

1999-10-12 Thread Warner Losh

In message [EMAIL PROTECTED] Matthew Dillon writes:
: Forget the drilling!  Blood conducts electricity... simply *installing*
: a motherboard in those fraggin sharp-edged sheet metal chassis is enough!

I've had one or two cheapo mo-bos that haven't worked at 100MHz after
spattering human blood on it from said sharp-edged sheet metal
corners...

Warner


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: [Patches avail?] Re: MMAP() in STABLE/CURRENT ...

1999-10-08 Thread David Malone

On Thu, Oct 07, 1999 at 10:09:23AM -0700, Matthew Dillon wrote:

 Intel's ECC implementation is not perfect (1), but it's good enough to 
 catch these sorts of problems.

Just as an interesting side note, we had a motherboard which
supported ECC ram and had ECC ram in it and which was crashing.
Eventually we discovered that every 8th byte in page aligned 4KB
chunks was becomming corrupted.

We replaced the ram and saw no improvement, and then got a replacement
motherboard. As far as I could see the only significant difference
between the new and old motherboard was the addition of a heat sink
to the memory controler chip. The machine is now perfectly happy.

So it seems that ECC isn't enough if your memory controler is too
hot!

David.


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: [Patches avail?] Re: MMAP() in STABLE/CURRENT ...

1999-10-08 Thread Rodney W. Grimes

 On Thu, Oct 07, 1999 at 10:09:23AM -0700, Matthew Dillon wrote:
 
  Intel's ECC implementation is not perfect (1), but it's good enough to 
  catch these sorts of problems.
 
 Just as an interesting side note, we had a motherboard which
 supported ECC ram and had ECC ram in it and which was crashing.
 Eventually we discovered that every 8th byte in page aligned 4KB
 chunks was becomming corrupted.
 
 We replaced the ram and saw no improvement, and then got a replacement
 motherboard. As far as I could see the only significant difference
 between the new and old motherboard was the addition of a heat sink
 to the memory controler chip. The machine is now perfectly happy.
 
 So it seems that ECC isn't enough if your memory controler is too
 hot!
 
ECC doesn't protect against certain types of motherboard address line
 errors (since although the ECC is correct, the selected address is wrong, so
 thus the data is wrong). There's parity protection on parts of the CPU
 address bus, but I don't believe there is any protection between the memory
 controller and the DIMMs for this type of problem. A handful of metal
 filings is also known to cause problems when it is dispersed properly. :-)

Your suppose to remove the motherboard before drilling holes in your
chassis!!!  :-).  And be careful when you strip them there screws out,
that little bit of metal filings is enough to through one for some
real loops.  A good blast of 60psi dry air does wonders for ``fixing''
some of these really strange problems :-)

Now if I could just find something that would get sheet rock sanding
dust out of tape drive mechanisms, a dunk in the freon tank often
works, but that also cleans out all the lubrication :-).

-- 
Rod Grimes - KD7CAX - (RWG25)[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: [Patches avail?] Re: MMAP() in STABLE/CURRENT ...

1999-10-08 Thread David Greenman

On Thu, Oct 07, 1999 at 10:09:23AM -0700, Matthew Dillon wrote:

 Intel's ECC implementation is not perfect (1), but it's good enough to 
 catch these sorts of problems.

Just as an interesting side note, we had a motherboard which
supported ECC ram and had ECC ram in it and which was crashing.
Eventually we discovered that every 8th byte in page aligned 4KB
chunks was becomming corrupted.

We replaced the ram and saw no improvement, and then got a replacement
motherboard. As far as I could see the only significant difference
between the new and old motherboard was the addition of a heat sink
to the memory controler chip. The machine is now perfectly happy.

So it seems that ECC isn't enough if your memory controler is too
hot!

   ECC doesn't protect against certain types of motherboard address line
errors (since although the ECC is correct, the selected address is wrong, so
thus the data is wrong). There's parity protection on parts of the CPU
address bus, but I don't believe there is any protection between the memory
controller and the DIMMs for this type of problem. A handful of metal
filings is also known to cause problems when it is dispersed properly. :-)

-DG

David Greenman
Co-founder/Principal Architect, The FreeBSD Project - http://www.freebsd.org
Creator of high-performance Internet servers - http://www.terasolutions.com
Pave the road of life with opportunities.


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: [Patches avail?] Re: MMAP() in STABLE/CURRENT ...

1999-10-08 Thread Matthew Dillon

:ECC doesn't protect against certain types of motherboard address line
: errors (since although the ECC is correct, the selected address is wrong, so
: thus the data is wrong). There's parity protection on parts of the CPU
: address bus, but I don't believe there is any protection between the memory
: controller and the DIMMs for this type of problem. A handful of metal
: filings is also known to cause problems when it is dispersed properly. :-)
:
:Your suppose to remove the motherboard before drilling holes in your
:chassis!!!  :-).  And be careful when you strip them there screws out,
:that little bit of metal filings is enough to through one for some
:
:Rod Grimes - KD7CAX - (RWG25)[EMAIL PROTECTED]

Forget the drilling!  Blood conducts electricity... simply *installing*
a motherboard in those fraggin sharp-edged sheet metal chassis is enough!

-Matt
Matthew Dillon 
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: [Patches avail?] Re: MMAP() in STABLE/CURRENT ...

1999-10-07 Thread Matthew Dillon

:Hi again,
:
: Whoops: a few hours after downgrading to 3.1-STABLE I had a double fault
:error (strange, it didn't look like a normal panic screen, just the
:message and the content of three registers, then the syncing disks
:message). It seems that I might be wrong about hardware not being the
:problem.
:
: I've changed the motherboard, CPU, memory and the video card and I'm
:waiting to see how much it's going to stay up (I have 1day 1hour uptime so
:far)...
:
: Thanks,
: Ady (@warpnet.ro)

One thing I do on all 'server' class machines that I buy (and this is
also something that BEST instituted as policy in 1998) is to only buy
motherboards with ECC support and only buy ECC memory to go along with
that support.  If you are using a non-ECC motherboard or non-ECC memory
I would heartily recommend that you adopt the same policy.  Not that your
problem is necessarily memory related, but I've found that memory-related
problems account for at least 80% of the 'difficult to locate' hardware 
problems that normally occur with PC technology.

ECC gives you protection not only against hardware faults, but it also
protects you against remarked dynamic ram chips and processors by 
catching the timing errors that usually occur with such chips relatively
soon after purchase rather then weeks or months down the line.  Being
the commodity it is, memory is the most likely item on the motherboard
to be out of spec.

Intel's ECC implementation is not perfect (1), but it's good enough to 
catch these sorts of problems.

note 1: Intel doesn't implement memory scrubbing properly outside of the
Xeon line and FreeBSD does not scrub memory either.  Scrubbing is a
method of preventing bit errors from building up in memory by regenerating
the ECC bits with a memory read followed by a memory write of the same
data.  Outside of the Xeon chipsets the OS must issue a read followed by
a write.  With the Xeon chipsets the OS need only issue a read and hardware
will automatically rewrite a correction if it finds a bit error.  This
information is 6 months old so the situation may have changed.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: [Patches avail?] Re: MMAP() in STABLE/CURRENT ...

1999-10-07 Thread Adrian Penisoara

Hi again,

On Wed, 6 Oct 1999, Adrian Penisoara wrote:

 hi again,
 
 On Tue, 5 Oct 1999, Matthew Dillon wrote:
 
  : The problem is that the machine is completely locked, I can't get into
  :the debugger with CTR-ALT-ESC; no panics so there are no coredumps
  :catched. Any advise ? Could you escape in the debugger when you were hit
  :by these bugs ?
  
  If it's completely locked up and ctl-alt-esc doesn't work (and normally
  does work - try it on a working system to make sure that you've compiled
  in the appropriate DDB options), and you aren't in an X display
  (ctl-alt-esc isn't useful when done from an X display)... then your
  lockup problem is unrelated to mmap.
 
  No X on the machine, but CTRL-ALT-ESC doesn't work.
  And another thing: I tried the MMAP "exploit"/test that has been floating
 around at that time on another 3.2-STABLE machine SMP with 2 Pentiums and
 it does lock the machine but you can switch consoles and escape to the
 debugger; on the production server (K6-2 300) everything goes dead when 
 it happens (I haven't tried the MMAP test)...
 
  You're probably right, it's not the MMAP bug; but it's not faulty
 hardware -- I'll have an undeniable proof in a few days, I have downgraded
 to 3.1-STABLE as of 20th April...
 

 Whoops: a few hours after downgrading to 3.1-STABLE I had a double fault
error (strange, it didn't look like a normal panic screen, just the
message and the content of three registers, then the syncing disks
message). It seems that I might be wrong about hardware not being the
problem.

 I've changed the motherboard, CPU, memory and the video card and I'm
waiting to see how much it's going to stay up (I have 1day 1hour uptime so
far)...

 Thanks,
 Ady (@warpnet.ro)



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: [Patches avail?] Re: MMAP() in STABLE/CURRENT ...

1999-10-07 Thread Rodney W. Grimes

 :Hi again,
 :
 : Whoops: a few hours after downgrading to 3.1-STABLE I had a double fault
 :error (strange, it didn't look like a normal panic screen, just the
 :message and the content of three registers, then the syncing disks
 :message). It seems that I might be wrong about hardware not being the
 :problem.
 :
 : I've changed the motherboard, CPU, memory and the video card and I'm
 :waiting to see how much it's going to stay up (I have 1day 1hour uptime so
 :far)...
 :
 : Thanks,
 : Ady (@warpnet.ro)
 
 One thing I do on all 'server' class machines that I buy (and this is
 also something that BEST instituted as policy in 1998) is to only buy
 motherboards with ECC support and only buy ECC memory to go along with
 that support.  If you are using a non-ECC motherboard or non-ECC memory
 I would heartily recommend that you adopt the same policy.  Not that your
 problem is necessarily memory related, but I've found that memory-related
 problems account for at least 80% of the 'difficult to locate' hardware 
 problems that normally occur with PC technology.

And to add support to this, AAI, the oldest vendor of FreeBSD specific
systems, implemented a similiar policy on all system sold sometime in 1992.
But at that time ECC was not avaliable so it was ``parity memory is required,
and the chipset must support it''.  As soon as ECC chipsets hit the market
the policy was changed to refect this.  We also require all memory that
we purchase be backed by a no-fuss lifetime warranty, which we pass on to
the end user.

I strongly recommend that any one running Unix on a PC do the same, it
will save you in the long run.  Since implementing the policies we have
seen a near 0 memory related problem after burnin with our systems.

-- 
Rod Grimes - KD7CAX - (RWG25)[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: [Patches avail?] Re: MMAP() in STABLE/CURRENT ...

1999-10-06 Thread Adrian Penisoara

hi again,

On Tue, 5 Oct 1999, Matthew Dillon wrote:

 : The problem is that the machine is completely locked, I can't get into
 :the debugger with CTR-ALT-ESC; no panics so there are no coredumps
 :catched. Any advise ? Could you escape in the debugger when you were hit
 :by these bugs ?
 
 If it's completely locked up and ctl-alt-esc doesn't work (and normally
 does work - try it on a working system to make sure that you've compiled
 in the appropriate DDB options), and you aren't in an X display
 (ctl-alt-esc isn't useful when done from an X display)... then your
 lockup problem is unrelated to mmap.

 No X on the machine, but CTRL-ALT-ESC doesn't work.
 And another thing: I tried the MMAP "exploit"/test that has been floating
around at that time on another 3.2-STABLE machine SMP with 2 Pentiums and
it does lock the machine but you can switch consoles and escape to the
debugger; on the production server (K6-2 300) everything goes dead when 
it happens (I haven't tried the MMAP test)...

 You're probably right, it's not the MMAP bug; but it's not faulty
hardware -- I'll have an undeniable proof in a few days, I have downgraded
to 3.1-STABLE as of 20th April...

 
 If you are running an X display on this box, you may be able to get
 more information in regards to the crash if you turn off X.
 
 :
 : I have: squid (20Mb), nntpcached (17Mb SIZE, 1Mb RES), apache, named,
 :MFS, a few PPP processes and the rest of the standard menu.
 
 The only programs known to cause the swap problem are innd and innxmit,
 both part of the inn news system.

 No such thing (yet); and I heard that innd-stable is OK (I have
INND-stable running on that SMP box and had no problems with it) ?...

 
 : OK, how about some workarounds, I can't wait anylonger for this to be
 :fixed, my situation got critical. Should I downgrade to 3.1-RELEASE (that
 :hadn't exhibit this way) or can I dare a 4.0-CURRENT (are these problems
 :present in -current too ?) ?
 :
 : Thanks,
 : Ady (@warpnet.ro)
 
 If the machine is locking up to the point where you cannot even drop
 into DDB, this bug is not related to the known mmap() bugs.
 
 At this point I have no idea what might be causing your lockup problem.

 Neither do I, dammit... :-(

 
   -Matt
   Matthew Dillon 
   [EMAIL PROTECTED]
 

 I'll get back to you in a few days.
 Thanks a lot,
 Ady (@warpnet.ro)



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: [Patches avail?] Re: MMAP() in STABLE/CURRENT ...

1999-10-05 Thread Adrian Penisoara

Hi,

On Mon, 4 Oct 1999, Adrian Penisoara wrote:

 
  I have a -stable production server that keeps (solidly) blocking pretty
 often (I don't get over 3 days uptimes). If you need details just let me
 know.

 Just to let you know: syncing every second in a loop like this:

   while true
   do sync ; sleep 1
   done

doesn't prove to be a workaround -- the system still locks up. I tried
this as per Mattew's suggestion in an e-mail on the list.

BTW: I'll downgrade to 3.1-STABLE as of aprox. end of April; I'll let you
know if it's stable for me.

 Thanks,
 Ady (@warpnet.ro)



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: [Patches avail?] Re: MMAP() in STABLE/CURRENT ...

1999-10-05 Thread Matthew Dillon

: The problem is that the machine is completely locked, I can't get into
:the debugger with CTR-ALT-ESC; no panics so there are no coredumps
:catched. Any advise ? Could you escape in the debugger when you were hit
:by these bugs ?

If it's completely locked up and ctl-alt-esc doesn't work (and normally
does work - try it on a working system to make sure that you've compiled
in the appropriate DDB options), and you aren't in an X display
(ctl-alt-esc isn't useful when done from an X display)... then your
lockup problem is unrelated to mmap.

If you are running an X display on this box, you may be able to get
more information in regards to the crash if you turn off X.

:
: I have: squid (20Mb), nntpcached (17Mb SIZE, 1Mb RES), apache, named,
:MFS, a few PPP processes and the rest of the standard menu.

The only programs known to cause the swap problem are innd and innxmit,
both part of the inn news system.

: OK, how about some workarounds, I can't wait anylonger for this to be
:fixed, my situation got critical. Should I downgrade to 3.1-RELEASE (that
:hadn't exhibit this way) or can I dare a 4.0-CURRENT (are these problems
:present in -current too ?) ?
:
: Thanks,
: Ady (@warpnet.ro)

If the machine is locking up to the point where you cannot even drop
into DDB, this bug is not related to the known mmap() bugs.

At this point I have no idea what might be causing your lockup problem.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: [Patches avail?] Re: MMAP() in STABLE/CURRENT ...

1999-10-05 Thread Adrian Penisoara

Hi,

On Mon, 4 Oct 1999, Matthew Dillon wrote:

 : Excuse my intrusion, but could you be so kind to tell me whether you had
 :the time to build patches for these MMAP-related freezes ? If not could
 :you recommend me some workarounds ?
 :
 : I have a -stable production server that keeps (solidly) blocking pretty
 :often (I don't get over 3 days uptimes). If you need details just let me
 :know.
 :
 :-Matt
 :Matthew Dillon 
 :[EMAIL PROTECTED]
 : Thanks,
 : Ady (@warpnet.ro)
 
 Well, your lockups may or may not be related to the remaining mmap
 problems.  They could be related to the swap fragmentation problems
 in stable, or they could be related to something else entirely.  In
 order to determine the cause of your lockup problems, some additional
 information is necessary.  The easiest way to get the information is
 to enable DDB and kernel core dumps so you can panic the machine from

 The problem is that the machine is completely locked, I can't get into
the debugger with CTR-ALT-ESC; no panics so there are no coredumps
catched. Any advise ? Could you escape in the debugger when you were hit
by these bugs ?

 the console and get a core.  Once you have the core
 'cd /var/crash; ps -M vmcore.X -N kernel.X' (where X is the latest
 dump number) can be used to determine what the processes were doing
 when they locked up.
 
 The two most common VM-related lockups in -stable are:
 
 (1) swap metadata fragmentation due to paging in the face of large 
   running processes (system runs out of KVM), and

 I have: squid (20Mb), nntpcached (17Mb SIZE, 1Mb RES), apache, named,
MFS, a few PPP processes and the rest of the standard menu.

 
 (2) write()ing the mmap'd area of one file descriptor to another.
 

 OK, how about some workarounds, I can't wait anylonger for this to be
fixed, my situation got critical. Should I downgrade to 3.1-RELEASE (that
hadn't exhibit this way) or can I dare a 4.0-CURRENT (are these problems
present in -current too ?) ?

   -Matt
   Matthew Dillon 
   [EMAIL PROTECTED]
 

 Thanks,
 Ady (@warpnet.ro)



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: [Patches avail?] Re: MMAP() in STABLE/CURRENT ...

1999-10-04 Thread The Hermit Hacker

On Mon, 4 Oct 1999, Adrian Penisoara wrote:

 
  Excuse my intrusion, but could you be so kind to tell me whether you had
 the time to build patches for these MMAP-related freezes ? If not could
 you recommend me some workarounds ?

doubling the ram from 384 - 768 meg appears to have fixed it for mme...

Marc G. Fournier   ICQ#7615664   IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: [EMAIL PROTECTED]   secondary: scrappy@{freebsd|postgresql}.org 



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message