In the last episode (May 04), Grant Peel said:
>  A few weeks back, I turned on mod_gzip in apache and as a result,
>  the /tmp directory filled up with .wrk files causing the root
>  filesystem to fill to capacity. When we noticed what was happening,
>  on May 1 we had no choice but to cold boot the machine as it was,
>  for all purposes locked up.
> In the security run, for May 1 and May 3 I am seeing the SCSI errors below.
> FreeBSD 4.7 (yes we are going to upgrade soon (migrating to a newly setup 
> machine)),
> Apache 1.3.26
> We do have complete dumps (From may1),
> The machine is a vintage 2003 Dell SC1400 
> HD = 1 Fujitsu SCSI that has never had problems before.
> Questions:
> Do the errors below TRUELY indicate pending doom?
> Can camcontrol be used to squash the errors?
> Should FSCK be used to fix?
> Are these errors (the text below), formatted from the FreeBSD kernel
> or are they shown as reported by the HD subsystem? i.e. where can I
> go to read what the errors actually mean?

Those are errors reported by the drive:
> May  3 03:59:13 excelsior /kernel: (da0:ahc0:0:1:0): READ(10). CDB: 28 0 4 21 
> 7a df 0 0 80 0
> May  3 03:59:13 excelsior /kernel: (da0:ahc0:0:1:0): MEDIUM ERROR 
> info:4217b55 asc:11,1
> May  3 03:59:14 excelsior /kernel: (da0:ahc0:0:1:0): Read retries exhausted 
> sks:80,3f
> May  1 03:29:28 excelsior /kernel: (da0:ahc0:0:1:0): READ(10). CDB: 28 0 3 ab 
> d5 c1 0 0 e 0
> May  1 03:29:31 excelsior /kernel: (da0:ahc0:0:1:0): MEDIUM ERROR 
> info:3abd5c1 asc:11,1
> May  1 03:29:31 excelsior /kernel: (da0:ahc0:0:1:0): Read retries exhausted 
> sks:80,3f

The drive has tried to read the indicated block number (0x4217b55 and
0x3abd5c1), and couldn't, even after multiple retries.  If it was able
to recover the data after retrying, it would have reallocated the block
to a spare sector.

There isn't an easy way to map a raw block number to a filename, but if
you can determine that the files belonging to the blocks were old, your
drive is probably still okay, and you happened to trip over some weak
spots on the disk that lost their data over time.  If they were
recently-generated files, then I'd start worrying about getting that
new system up as soon as possible.

One thing to try would be "dd if=/dev/da0 of=/dev/null bs=64k", and see
how many more errors get generated.  Installing smartmontools and
comparing the output of "smartctl -a /dev/da0" before and after will
also tell you how many ECC recoveries and rereads were done.

        Dan Nelson
_______________________________________________ mailing list
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Reply via email to