Mark Knecht writes:
> Do I just watch the logs looking for problems? I have no way of
> knowing right now whether this was a disk problem that's going to come
> back, a 1 time deal due to power, or something else entirely.
>
> As these cheap machines that don't use RAID what's the right way to
> go? emerge -e @world and then wait for the next event? Do nothing and
> wait?
Emerge smartmontools, then:
smartctl -h /dev/sda # get overview of what the drive thinks about itself
smartctl -t short /dev/sda # start short self test
Wait
smartctl -l selftest /dev/sda # see results
smartctl -t long /dev/sda # start long self test
Wait a lot longer
smartctl -l selftest /dev/sda # see results
You can continue working in the meanwhile, there will be no performance
impact. You will see something like this in the log:
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours)
LBA_of_first_error
# 1 Short offline Completed without error 00% 2275 -
# 2 Extended offline Completed without error 00% 2270 -
# 3 Extended offline Completed without error 00% 1799 -
# 4 Extended offline Completed without error 00% 197 -
# 5 Extended offline Completed without error 00% 26 -
I you have a '-' in the right column, the disk has found no errors. If
there is a number, than it's the position of the first error.
There's also badblocks, this will check every block and output the bad
ones: badblocks -sv /dev/sda
badblocks -svn /dev/sda will do a read-write test. In case of a bad block,
the drive should exchange it with a spare one. Maybe this happens already
in read-only mode, I am not sure.
Also watch for errors in syslog or via dmesg, there should be some when
bad blocks are being accessed.
Wonko