Quoting Jeremy Chadwick <[EMAIL PROTECTED]>:

On Tue, Jan 08, 2008 at 05:28:46PM -0500, Stephen M. Rumble wrote:
I'm having a bit of trouble with a new machine running the latest RELENG_7
code. I have two 500GB WD Caviar GP disks on a mini-itx GM965-based board
(MSI "fuzzy") running amd64 with 4GB of ram. The disks are:

Could be related to a PR that I submit long ago, but was not specific to
ZFS -- instead, it appeared to be specific to the motherboard I was
using.  There's also some tidbits posted by others which appeared to
help them, although performance was impacted:

http://www.freebsd.org/cgi/query-pr.cgi?pr=103435

Another related PR, which seems to indicate motherboard problems:

http://www.freebsd.org/cgi/query-pr.cgi?pr=93885

Thanks. I'm not sure they apply, but I'll keep them in mind. The Intel chipsets seem to be rather bug-free; at least, I didn't see any mention of quirks or workarounds when glancing over code. The problems I'm seeing also seem to occur during low utilisation, not high (remarkably, keeping the system active seems to postpone issues!). I'm not sure PCI bus issues would be a likely culprit and I don't see any obviously relevant BIOS settings.

ad4: 476940MB <WDC WD5000AACS-00ZUB0 01.01B01> at ata2-master SATA150
ad6: 476940MB <WDC WD5000AACS-00ZUB0 01.01B01> at ata3-master SATA150

I've tried different power supplies and cables. I've enabled and disabled
spread spectrum clocking and tried both SATA300 and SATA150 rates. I've
also tried switching drives between ports so that what was ad4 is ad6 and
what was ad6 is ad4. The problems persist, but seem to follow the same
drive (ad6 originally, then ad4 when swapped). This seems to indicate a
drive problem, but it works great on its own, even when exercising both
disks simultaneously. SMART reports no problems and ZFS reports no issues
when ad6 is used on its own outside of a zfs mirror. It seems like it's the
drive, but it works fine when not in a mirror. I'm stumped. Any ideas?

Have you tried running long SMART tests (smartctl -t long) on both of
these drives, ditto with an offline test (smartctl -t offline)?
Statistics that are labelled "Offline" as their type won't get updated
until an offline test is performed.  It's possible those statistics may
provide some answers, but no guarantees.

Nope, but I'm going to do that right now!

The only interesting bit of evidence I could find is that when these errors
do occur, smartctl reports an increase in the Start_Stop_Count field on
ad6. ad4, which appears to work fine, doesn't demonstrate this and has a
much lower value.

Start_Stop_Count indicates the drive is actually stopping then spinning
back up (usually caused by a reset of some kind; equivalent of powering
down then back up but without the loss of power).  It's possible that
your drive has actual problems -- this is supported by the fact that the
problem follows the disk (when moving the disk to another SATA port).

I'm leaning ever closer to blaming the disk. I still can't explain why I couldn't make it misbehave with it on its own zfs pool and UFS filesystems. However, shortly after setting the dubious disk offline using zpool, I poked at it with 'atacontrol cap' and managed to wedge it. Upon issuing the command it sounded like it was spinning up (it should never spin down, although these GP drives are supposed to lower their RPM while idle) and atacontrol hung. I couldn't kill it and top listed the state as 'ata re'. The rest of the system was responsive, but the machine wouldn't shutdown properly, presumably on account of that stuck channel.

Tracking down the source of this problem usually requires a lot of time,
money, and trial-and-error techniques.  This is what I'd go with:

1) See if there's a BIOS update.  I know at least in the case of Intel
manufactured boards BIOS updates have solved weird problems like this in
the past.

None. BIOS version 1.0 doesn't leave me convinced it's bug-free, though ;)

2) Try an Advanced RMA with Western Digital (which guarantees you get a
brand new drive rather than chancing that they repair the one you send
them) and see if a new drive helps.

I'll definitely look into that.

3) Try replacing the motherboard with a different brand (non-MSI).  I
have nothing against MSI, but switching vendors usually means that you
ensure a cross-model h/w bug (e.g. something vendor does in the BIOS or
engineering which is suspect).  Try Asus or Gigabyte.  Obviously this
will cost money to do and will very likely set you out the cost of the
motherboard you have currently, but it's a viable option since you've
already tried replacing SATA cables.

I suppose I could always stick the disks in another box, boot it up, and see what happens. Actually, I may just do that next.

I'm not sure why ZFS would cause something like this to happen vs. UFS.
I happen to run ZFS at home (same machine as what's mentioned in PR
103435, with the replaced motherboard of course) doing very heavy disk
I/O across two disks, and I have never seen problems of this sort.  That
doesn't mean there isn't a problem, just that I haven't encountered it
with ZFS.

I'm not convinced it's any issue with ZFS or FreeBSD. Rather, it seems that using a ZFS mirror just makes that drive unhappy posthaste. If I didn't want to avoid rebuilding the dataset, I'd try gmirror. I probably haven't been patient enough to let the problem exhibit itself outside of the mirrored configuration.

[snip working system tease]

Thanks for all your input,
Steve

_______________________________________________
[email protected] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Reply via email to