Re: ad0 READ_DMA TIMEOUT errors on install of 7.0-RELEASE

Scott Long Wed, 27 Feb 2008 12:21:54 -0800

Stephen Hurd wrote:


This shows you've had 4 reallocated sectors, meaning your disk does in
fact have bad blocks.  In 90% of the cases out there, bad blocks
continue to "grow" over time, due to whatever reason (I remember reading
an article explaining it, but I can't for the life of me find the URL).

This is unusual now? I've always "known" that a small number of badblocks is normal. Time to readjust my knowledge again?


Modern drives hide bad sectors by keeping a pool of spare tracks and
automatically remapping bad sectors to that pool.  The problem lies in
when the drive has aged enough that it's run out of spares.

194 Temperature_Celsius 0x0032 253 253 000 Old_ageAlways - 48
This is excessive, and may be attributing to problems.  A hard disk
running at 48C is not a good sign.  This should really be somewhere
between high 20s and mid 30s.
Yeah, this is a known problem with this drive... it's been running hotfor years. I always figured it was due to the rotational speed increasein commodity drives.

48C is high, but I wouldn't consider it excessive. Drives that startgenerating "excessive" heat tend to fail shortly thereafter. I do agreethat the heat is probably shortening the lifespan on the drive.

Error 2 occurred at disk power-on lifetime: 5171 hours (215 days + 11hours)When the command that caused the error occurred, the device was inan unknown state.Error 1 occurred at disk power-on lifetime: 5171 hours (215 days + 11hours)When the command that caused the error occurred, the device was inan unknown state.
These are automated SMART log entries confirming the DMA failures.  The
fact that SMART saw them means that the disk is also aware of said
issues.  These may have been caused by the reallocated sectors.  It's
also interesting that the LBAs are different than the ones FreeBSD
reported issues with.
If that power on lifetime is accurate, that was at least a year ago...but I can't find any documentation as to when the power-on lifetimewraps or what it actually indicates. I'm assuming that it is totalpower on time since the drive was manufactured. If it's total hours asa 16-bit integer, it shouldn't wrap. Is there a way of getting the"current" power-on lifetime value that you're aware of? That power onminutes is interesting, but its current value is lower than the value atthe error (but higher than the power uptime of the system):9 Power_On_Minutes 0x0032 219 219 000 Old_ageAlways - 1061h+40m
Also interesting is that after getting more errors from FreeBSD, I didnot get more errors in smartctl.


The errors you're getting from FreeBSD have nothing to do directly with
SMART.  The driver thinks that commands are timing out and that the
drive is becoming unresponsive.  Whether they actually are is another
question.  Given that this problem changes behavior with the version of
FreeBSD that you're running (and even happens in completely virtual
environments like vmware) I'm betting that it's a driver problem and not
a hardware problem, though you should probably think about migrating
your data off to a new drive sometime soon.

I'd like to attack these driver problems.  What I need is to spend a
couple of days with an affected system that can reliably reproduce the
problem, instrumenting and testing the driver.  I have a number of
theories about what might be going wrong, but nothing that I'm
definitely sure about.  If you are willing to set up your system with
remote power and remote serial, and if we knew a reliable way to
reproduce the problem, I could probably have the problem identified and
fixed pretty quickly.

Scott
_______________________________________________
[email protected] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: ad0 READ_DMA TIMEOUT errors on install of 7.0-RELEASE

Reply via email to