Re: analyzing freebsd core dumps

2008-10-08 Thread Mister Olli
hi...

thanks for the feedback on this topic.
the first step to clean the machine and check all connectors has been
done yesterday. I hope that this will fix the problem, and that it's not
some kind of hardware failure.

to run tests with memtest is quite a problem, since the machine has high
availability requirements. to take it off for nearly one hour for
cleaning and checking during daily work of our company was a pain.
6 hours or more of RAM tests is not possible.

is there some other way to detect hardware failure with less time
consuming tool/ process?

greetz
olli

Am Montag, den 06.10.2008, 13:45 -0400 schrieb Jerry McAllister:
 On Mon, Oct 06, 2008 at 10:18:09AM -0700, Jeremy Chadwick wrote:
 
  On Mon, Oct 06, 2008 at 08:04:07AM +0200, Mister Olli wrote:
   hi list...
   
   I have a freebsd maschine running for more 6 months without any
   problems.
   the machine's only service is to be an openvpn gateway for a hand of
   users.
   
   2 weeks ago the first problems started. the openvpn exited with signal
   11 and 4 and core dumps were written.
   
   the same happend yesterday with the postfix/cleanup process, and the
   suddenly the machine rebooted without any further log messages.
   
   what is the best way to troubleshoot the cause of this problem?
  
  Signal 11 happening out of no where on machines which have been
  running fine, most of the time, is a sign of hardware failure (usually
  RAM, but sometimes motherboard or PSU).  The fact you got a reboot is
  also further evidence of this.
  
  http://www.freebsd.org/doc/en/books/faq/troubleshoot.html#SIGNAL11
  
  I would recommend taking the machine offline and running something like
  memtest86+ on it for 6-7 hours.  Any errors seen are a pretty good sign
  that you should replace the memory or the motherboard.  You can
  download an ISO or floppy disk images here:
  
  http://www.memtest.org/
  
  Bottom line is that this is probably a hardware issue.
 
 Could also be a contacts if it is not the actual memory or board.
 A marginal contact where something is plugged in can over time
 build up deposits that make it fail.   Of course, this is still
 a hardware problem, but can often be cured by reseating everything.
 If it is bad enough, it could also be exacerbated by reseating 
 everything.
 
 jerry
 
  
  -- 
  | Jeremy Chadwickjdc at parodius.com |
  | Parodius Networking   http://www.parodius.com/ |
  | UNIX Systems Administrator  Mountain View, CA, USA |
  | Making life hard for others since 1977.  PGP: 4BD6C0CB |
  
  ___
  freebsd-questions@freebsd.org mailing list
  http://lists.freebsd.org/mailman/listinfo/freebsd-questions
  To unsubscribe, send any mail to [EMAIL PROTECTED]

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: analyzing freebsd core dumps

2008-10-08 Thread Jeremy Chadwick
On Wed, Oct 08, 2008 at 08:30:12AM +0200, Mister Olli wrote:
 hi...
 
 thanks for the feedback on this topic.
 the first step to clean the machine and check all connectors has been
 done yesterday. I hope that this will fix the problem, and that it's not
 some kind of hardware failure.
 
 to run tests with memtest is quite a problem, since the machine has high
 availability requirements. to take it off for nearly one hour for
 cleaning and checking during daily work of our company was a pain.
 6 hours or more of RAM tests is not possible.
 
 is there some other way to detect hardware failure with less time
 consuming tool/ process?

Yes -- you start replacing hardware one piece at a time until the
problem goes away.  That will also require downtime, quite regularly,
and waste money.

So to answer your question: no, there is no way to easily track down the
source of a hardware failure, or determine what piece has failed (if
any).  This is completely 100% normal when it comes to computers,
especially x86 PCs.  Anyone who has worked in the IT field for many
years knows this.  :-)

I'm amazed that in this day and age, any company would have a single
host as a single-point-of-failure.  You can't take this machine down
for troubleshooting, but you have no failover available.  The company
has put themselves into this situation.

-- 
| Jeremy Chadwickjdc at parodius.com |
| Parodius Networking   http://www.parodius.com/ |
| UNIX Systems Administrator  Mountain View, CA, USA |
| Making life hard for others since 1977.  PGP: 4BD6C0CB |

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


analyzing freebsd core dumps

2008-10-06 Thread Mister Olli
hi list...

I have a freebsd maschine running for more 6 months without any
problems.
the machine's only service is to be an openvpn gateway for a hand of
users.

2 weeks ago the first problems started. the openvpn exited with signal
11 and 4 and core dumps were written.

the same happend yesterday with the postfix/cleanup process, and the
suddenly the machine rebooted without any further log messages.

what is the best way to troubleshoot the cause of this problem?

greetz
olli

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: analyzing freebsd core dumps

2008-10-06 Thread Jeremy Chadwick
On Mon, Oct 06, 2008 at 08:04:07AM +0200, Mister Olli wrote:
 hi list...
 
 I have a freebsd maschine running for more 6 months without any
 problems.
 the machine's only service is to be an openvpn gateway for a hand of
 users.
 
 2 weeks ago the first problems started. the openvpn exited with signal
 11 and 4 and core dumps were written.
 
 the same happend yesterday with the postfix/cleanup process, and the
 suddenly the machine rebooted without any further log messages.
 
 what is the best way to troubleshoot the cause of this problem?

Signal 11 happening out of no where on machines which have been
running fine, most of the time, is a sign of hardware failure (usually
RAM, but sometimes motherboard or PSU).  The fact you got a reboot is
also further evidence of this.

http://www.freebsd.org/doc/en/books/faq/troubleshoot.html#SIGNAL11

I would recommend taking the machine offline and running something like
memtest86+ on it for 6-7 hours.  Any errors seen are a pretty good sign
that you should replace the memory or the motherboard.  You can
download an ISO or floppy disk images here:

http://www.memtest.org/

Bottom line is that this is probably a hardware issue.

-- 
| Jeremy Chadwickjdc at parodius.com |
| Parodius Networking   http://www.parodius.com/ |
| UNIX Systems Administrator  Mountain View, CA, USA |
| Making life hard for others since 1977.  PGP: 4BD6C0CB |

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: analyzing freebsd core dumps

2008-10-06 Thread Jerry McAllister
On Mon, Oct 06, 2008 at 10:18:09AM -0700, Jeremy Chadwick wrote:

 On Mon, Oct 06, 2008 at 08:04:07AM +0200, Mister Olli wrote:
  hi list...
  
  I have a freebsd maschine running for more 6 months without any
  problems.
  the machine's only service is to be an openvpn gateway for a hand of
  users.
  
  2 weeks ago the first problems started. the openvpn exited with signal
  11 and 4 and core dumps were written.
  
  the same happend yesterday with the postfix/cleanup process, and the
  suddenly the machine rebooted without any further log messages.
  
  what is the best way to troubleshoot the cause of this problem?
 
 Signal 11 happening out of no where on machines which have been
 running fine, most of the time, is a sign of hardware failure (usually
 RAM, but sometimes motherboard or PSU).  The fact you got a reboot is
 also further evidence of this.
 
 http://www.freebsd.org/doc/en/books/faq/troubleshoot.html#SIGNAL11
 
 I would recommend taking the machine offline and running something like
 memtest86+ on it for 6-7 hours.  Any errors seen are a pretty good sign
 that you should replace the memory or the motherboard.  You can
 download an ISO or floppy disk images here:
 
 http://www.memtest.org/
 
 Bottom line is that this is probably a hardware issue.

Could also be a contacts if it is not the actual memory or board.
A marginal contact where something is plugged in can over time
build up deposits that make it fail.   Of course, this is still
a hardware problem, but can often be cured by reseating everything.
If it is bad enough, it could also be exacerbated by reseating 
everything.

jerry

 
 -- 
 | Jeremy Chadwickjdc at parodius.com |
 | Parodius Networking   http://www.parodius.com/ |
 | UNIX Systems Administrator  Mountain View, CA, USA |
 | Making life hard for others since 1977.  PGP: 4BD6C0CB |
 
 ___
 freebsd-questions@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-questions
 To unsubscribe, send any mail to [EMAIL PROTECTED]
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]