A small comment: stress testing is cumulative only if the underlying
system has no recovery mechanism. (An understanding of this in detail
requires non-equilibrium statistical mechanics but can be summarized
with non-equilibrium "thermodynamics"). My experience with failing
electronics and magnetics -- depending upon the exact failure mode -- is
that non-interrupted stress testing is better than interrupted in terms
of finding failures. A simple example: suppose a failure mode is
temperature dependent, and temperature depends upon the amount of work
being done. An interrupted but cumulative stress test might never reach
the "critical" temperature, whereas a continued stress test might.
Yasha Karant
On 04/24/2013 08:03 AM, Joseph Areeda wrote:
Thanks for the tips Konstantin,
I assume that your recommendation for 24 hrs of memtest is cumulative
and I can probably see the same results starting it each night when I
quit for the day.
When I mentioned SMART I was talking about the self tests not the status
that comes up. I've also copied large files around and checked their
md5sum's.
I played with LiveCD for 4 or 5 hours today, much of it was trying to
install it on a different spinning hard drive.
I did see one time when the SSD was shown in the disk utility but all
the partitions were zero length. that's where my root directory used to be.
I also found that the nvidia drivers in ELREPO don't seem to work with
6.4. I seem to be able to run fine (at least for a while) unless I
install kmod-nvidia then I get a kernal panic on the next reboot (3
times until I tracked it down). It saiys something like "not syncing
attempt xxx(can't read my writing) PID 1 comm init not tainted
2.6.32.258.2.1. That's another problem I think.
Right now I suspect not necessarily in order:
* Bad SSD. Run time is reported as 1.8 years. I did have /usr
/usr/local /tmp swap and /home on spinning media but...
* Bad memory: still a good possiblity
* Some insidious incompatibility with all packages from multiple
repos. I really hope it's not that, I don't load much I don't need.
And as for finding a real computer repairman, let me know if you have
one in Los Angeles. This is similar to a problem I had with an iMac.
The geniuses at the store took three trips to convince them something
was wrong and that was after about an hour each time with the phone
support people. That one turned out to be a flaky memory DIMM that
passed all the quick diagnostics.
Oh well the saga continues. It's nice have a group to go to for ideas.
Thank you all.
Joe
On 04/23/2013 04:20 PM, Konstantin Olchanski wrote:
On Tue, Apr 23, 2013 at 11:44:22AM -0700, Joseph Areeda wrote:
I'm having this strange behavior that I think is a hardware problem ...
* System freezes, mouse and keyboard dead, sshd unresponsive sometimes
First action is to run memtest86 (Q: which one? google finds several. A: all of
them).
Run memtest86 for 24 hours at least - if it reports memory errors, hangs,
freezes or
machine turns off, you definitely have a hardware problem. Suspect parts
are in this order: RAM, power supply, CPU socket (bent pins), mobo, CPU.
If memtest86 runs fine for 24 hours and more, there *still* could be a hardware
problem. (memtest86 does not test the video, the disk, the network
and the usb interfaces).
disk utility show ... SMART [is] fine.
SMART "health report" is useless. I had dead disks report "SMART OK" and perfectly
functional disks report "SMART Failure, replace your disk now".
This is free advice. For advice that would actually get your computer
working again, you would want to hire a proper computer repairman.