I have a NetBSD box running 6.0.1 i386. It has four 3TB HDs with two
raidframe raid arrays configured.
The first raid array is a raid0 for / (currently over wd0 and wd3, using
5GB on each disk), the second a raid5 for a data partition (over wd0, wd1,
wd2, wd3 using all remaing space (reporting 8TB in total)).

A week ago the system became unresponsive with many errors like the
following in /var/log/messages:

Jun  2 14:50:31 ex-fl-sr-03 /netbsd: wd1d: error reading fsbn 5804545856 of
5804545856-5804545983 (wd1 bn 5804545856; cn 5758478 tn 0 sn 32), retrying
Jun  2 14:50:31 ex-fl-sr-03 /netbsd: wd1: (uncorrectable data error)
Jun  2 14:50:31 ex-fl-sr-03 /netbsd: ahcisata0 port 1: device present,
speed: 3.0Gb/s

At that point the / raid1 was running on wd0 and wd1 and had the component
running on wd1 listed as failed. I added a preprepared partition on wd3 to
that mirror and rebuilt it. At present both the part on wd0 and wd3 are
reporting as optimal.

The odd part was that raidframe had listed the part of the raid5 data
partition on wd0 as failed (the errors in /var/log/messages only ever
referred to wd1) and the part on wd1 as optimal.

I reseated the drives, rebooted the system and all the drives seemed OK. As
there were no errors reported for wd0, and raidframe seemed happy with the
part of the raid5 on wd1 I set the array rebuilding on wd0.

Today (5 days later - this are 3TB drives) the rebuild failed at 99%. Again
there are errors in /var/log/messages about wd1 (see above). Again the
raid5 has failed on the section on wd0 (although in this case it never
completed rebuilding). The rebuild failed  17 seconds after these errors
started being printed to the log:

Jun  2 14:50:48 ex-fl-sr-03 /netbsd: raid1: Recon read failed: 5
Jun  2 14:50:48 ex-fl-sr-03 /netbsd: raid1: reconstruction failed.
Jun  2 14:50:48 ex-fl-sr-03 /netbsd: ahcisata0 port 1: device present,
speed: 3.0Gb/s

My reading of the situation is that raidframe in incorrectly failing the
part of the raid5 on wd0 due to read errors on wd1. As there are read
errors on the part of the raid array on wd1 (with no redundancy as one
member of raid has been failed) I need to get as much of the data off the
raid as possible and rebuild from scratch, probably after replacing wd1 as
a failed drive.

Do you agree?

Any idea why raidframe seems to be failing the wrong member of the raid5
thus invalidating the whole thing?

Thanks in advance,

Will

Reply via email to