Re: Too many uncorrectable read errors with atang

2003-11-10 Thread Kris Kennaway
On Fri, Nov 07, 2003 at 08:36:28PM -0800, Andrew P. Lentvorski, Jr. wrote:
 On Fri, 7 Nov 2003, John Baldwin wrote:
 
  On 07-Nov-2003 Kris Kennaway wrote:
   So far this has happened (well, the panic above was new) on 5 separate
   machines that were all working on older -current.  Now, these are all
   IBM DeathStar drives, but previously I was only experiencing ata
   errors every month or two, and they were correctable for another month
   or two by /dev/zero'ing the drive.
 
 IBM Deathstar's have this annoying tendency to perform thermal
 recalibration cycles that cause them to delay returning data for somewhere
 between 30-90 seconds until the calibration finishes.  Unfortunately,
 these seem to show up as uncorrectable errors.  It's a true pain with RAID
 cards as the RAID array will take the drive offline when it could retry
 the data.
 
 If you can, try to reduce the temperature of the drives.  This generally
 helped my Deathstars before I got rid of them all.
 
 Also, given the touchiness of PRML detectors, it is entirely possible that
 the drive is reading increased errors due to the solar flares as a need to
 thermally recalibrate more often.
 
 Other than tossing the drives, ATAng, like Windows, would have to be more
 aggressive about retrying even uncorrectable errors for up to a minute or
 so before giving up.

It looks like my drives are indeed dying..reverting to 5.1-RELEASE
still gives lots of errors on 2 of the machines.  I guess ATAng is
more sensitive to errors on the others.

Kris


pgp0.pgp
Description: PGP signature


Re: Too many uncorrectable read errors with atang

2003-11-07 Thread Soren Schmidt
It seems Kris Kennaway wrote:
-- Start of PGP signed section.
 Since upgrading the bento package machines to -current I am getting
 a lot of the following errors:
 
 ad0: FAILURE - READ_DMA status=51READY,DSC,ERROR error=40UNCORRECTABLE

That does look like a valid error condition from the drive...

 1) All my drives have performed mass suicide at once

You know, with deathstar's you cant really rule that out :)

 2) ATAng is detecting errors that the ATAog did not

That is true, the error detection is better in ATAng.

 3) ATAng is not trying as hard as ATAog to recover from the errors
 from the crappy drives

Neither ATAog nor ATAnr retried uncorrectable errors...
 
 4) ATAng has a bug on this hardware.

That we cant rule out, and it probably likely..

 Furthermore, I'd like to know why the panic occurred above.

Is this on a brand new -current ? lots of things that could
cause this has been fixed...

-Søren
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to [EMAIL PROTECTED]


RE: Too many uncorrectable read errors with atang

2003-11-07 Thread John Baldwin

On 07-Nov-2003 Kris Kennaway wrote:
 So far this has happened (well, the panic above was new) on 5 separate
 machines that were all working on older -current.  Now, these are all
 IBM DeathStar drives, but previously I was only experiencing ata
 errors every month or two, and they were correctable for another month
 or two by /dev/zero'ing the drive.
 
 To suddenly start receiving errors on 5 out of 7 drives in the past
 few weeks is a significant anomaly.  Perhaps one of the following is
 happening:
 
 1) All my drives have performed mass suicide at once
 
 2) ATAng is detecting errors that the ATAog did not
 
 3) ATAng is not trying as hard as ATAog to recover from the errors
 from the crappy drives
 
 4) ATAng has a bug on this hardware.

5) Interference from abnormally high solar activity.  It is known
to cause an increase in NMI's from ECC errors, so it could be a
possible explanation here even if it's a bit far-fetched.

-- 

John Baldwin [EMAIL PROTECTED]http://www.FreeBSD.org/~jhb/
Power Users Use the Power to Serve!  -  http://www.FreeBSD.org/
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Too many uncorrectable read errors with atang

2003-11-07 Thread Eduard Martinescu
If you are running -CURRENT, you can check the SMART status of the
drives with the port sysutils/smartmontools.  If the drive supports 
ATA-3 commands, you should be able to see if there are errors being
reported by the drive itself.

Ed

On Fri, 2003-11-07 at 13:33, Soren Schmidt wrote:

 It seems Kris Kennaway wrote:
 -- Start of PGP signed section.
  Since upgrading the bento package machines to -current I am getting
  a lot of the following errors:
  
  ad0: FAILURE - READ_DMA status=51READY,DSC,ERROR error=40UNCORRECTABLE
 
 That does look like a valid error condition from the drive...
 
  1) All my drives have performed mass suicide at once
 
 You know, with deathstar's you cant really rule that out :)
 
  2) ATAng is detecting errors that the ATAog did not
 
 That is true, the error detection is better in ATAng.
 
  3) ATAng is not trying as hard as ATAog to recover from the errors
  from the crappy drives
 
 Neither ATAog nor ATAnr retried uncorrectable errors...
  
  4) ATAng has a bug on this hardware.
 
 That we cant rule out, and it probably likely..
 
  Furthermore, I'd like to know why the panic occurred above.
 
 Is this on a brand new -current ? lots of things that could
 cause this has been fixed...
 
 -Søren
 ___
 [EMAIL PROTECTED] mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-current
 To unsubscribe, send any mail to [EMAIL PROTECTED]

-- 
Eduard Martinescu [EMAIL PROTECTED]
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Too many uncorrectable read errors with atang

2003-11-07 Thread Kris Kennaway
On Fri, Nov 07, 2003 at 07:33:41PM +0100, Soren Schmidt wrote:

  1) All my drives have performed mass suicide at once
 
 You know, with deathstar's you cant really rule that out :)

:-)

  Furthermore, I'd like to know why the panic occurred above.
 
 Is this on a brand new -current ? lots of things that could
 cause this has been fixed...

Yes, it was updated last night.

Kris


pgp0.pgp
Description: PGP signature


Re: Too many uncorrectable read errors with atang

2003-11-07 Thread Kris Kennaway
On Fri, Nov 07, 2003 at 10:10:07AM -0800, Kris Kennaway wrote:

 ad0: FAILURE - READ_DMA status=51READY,DSC,ERROR error=40UNCORRECTABLE
 ad0: FAILURE - READ_DMA status=51READY,DSC,ERROR error=40UNCORRECTABLE
 ad0: TIMEOUT - READ_DMA retrying (2 retries left)
 ata0: resetting devices ..
 ad0: FAILURE - already active DMA on this device
 ad0: setting up DMA failed
 panic: initiate_write_inodeblock_ufs2: already started
 Debugger(panic)
 Stopped at  Debugger+0x54:  xchgl   %ebx,in_Debugger.0
 db trace

I just had another machine panic in the same failure mode.

kris


pgp0.pgp
Description: PGP signature


RE: Too many uncorrectable read errors with atang

2003-11-07 Thread Andrew P. Lentvorski, Jr.
On Fri, 7 Nov 2003, John Baldwin wrote:

 On 07-Nov-2003 Kris Kennaway wrote:
  So far this has happened (well, the panic above was new) on 5 separate
  machines that were all working on older -current.  Now, these are all
  IBM DeathStar drives, but previously I was only experiencing ata
  errors every month or two, and they were correctable for another month
  or two by /dev/zero'ing the drive.

IBM Deathstar's have this annoying tendency to perform thermal
recalibration cycles that cause them to delay returning data for somewhere
between 30-90 seconds until the calibration finishes.  Unfortunately,
these seem to show up as uncorrectable errors.  It's a true pain with RAID
cards as the RAID array will take the drive offline when it could retry
the data.

If you can, try to reduce the temperature of the drives.  This generally
helped my Deathstars before I got rid of them all.

Also, given the touchiness of PRML detectors, it is entirely possible that
the drive is reading increased errors due to the solar flares as a need to
thermally recalibrate more often.

Other than tossing the drives, ATAng, like Windows, would have to be more
aggressive about retrying even uncorrectable errors for up to a minute or
so before giving up.

-a
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Too many uncorrectable read errors with atang

2003-11-07 Thread Kris Kennaway
On Fri, Nov 07, 2003 at 08:36:28PM -0800, Andrew P. Lentvorski, Jr. wrote:
 On Fri, 7 Nov 2003, John Baldwin wrote:
 
  On 07-Nov-2003 Kris Kennaway wrote:
   So far this has happened (well, the panic above was new) on 5 separate
   machines that were all working on older -current.  Now, these are all
   IBM DeathStar drives, but previously I was only experiencing ata
   errors every month or two, and they were correctable for another month
   or two by /dev/zero'ing the drive.
 
 IBM Deathstar's have this annoying tendency to perform thermal
 recalibration cycles that cause them to delay returning data for somewhere
 between 30-90 seconds until the calibration finishes.  Unfortunately,
 these seem to show up as uncorrectable errors.  It's a true pain with RAID
 cards as the RAID array will take the drive offline when it could retry
 the data.
 
 If you can, try to reduce the temperature of the drives.  This generally
 helped my Deathstars before I got rid of them all.
 
 Also, given the touchiness of PRML detectors, it is entirely possible that
 the drive is reading increased errors due to the solar flares as a need to
 thermally recalibrate more often.
 
 Other than tossing the drives, ATAng, like Windows, would have to be more
 aggressive about retrying even uncorrectable errors for up to a minute or
 so before giving up.

Thanks..that's interesting, perhaps there's something sos can do here.
Unfortunately the drives in question are in Yahoo's datacenter, so I
do not have physical access.

Kris


pgp0.pgp
Description: PGP signature