Re: Too many uncorrectable read errors with atang
On Fri, Nov 07, 2003 at 08:36:28PM -0800, Andrew P. Lentvorski, Jr. wrote: On Fri, 7 Nov 2003, John Baldwin wrote: On 07-Nov-2003 Kris Kennaway wrote: So far this has happened (well, the panic above was new) on 5 separate machines that were all working on older -current. Now, these are all IBM DeathStar drives, but previously I was only experiencing ata errors every month or two, and they were correctable for another month or two by /dev/zero'ing the drive. IBM Deathstar's have this annoying tendency to perform thermal recalibration cycles that cause them to delay returning data for somewhere between 30-90 seconds until the calibration finishes. Unfortunately, these seem to show up as uncorrectable errors. It's a true pain with RAID cards as the RAID array will take the drive offline when it could retry the data. If you can, try to reduce the temperature of the drives. This generally helped my Deathstars before I got rid of them all. Also, given the touchiness of PRML detectors, it is entirely possible that the drive is reading increased errors due to the solar flares as a need to thermally recalibrate more often. Other than tossing the drives, ATAng, like Windows, would have to be more aggressive about retrying even uncorrectable errors for up to a minute or so before giving up. It looks like my drives are indeed dying..reverting to 5.1-RELEASE still gives lots of errors on 2 of the machines. I guess ATAng is more sensitive to errors on the others. Kris pgp0.pgp Description: PGP signature
Re: Too many uncorrectable read errors with atang
It seems Kris Kennaway wrote: -- Start of PGP signed section. Since upgrading the bento package machines to -current I am getting a lot of the following errors: ad0: FAILURE - READ_DMA status=51READY,DSC,ERROR error=40UNCORRECTABLE That does look like a valid error condition from the drive... 1) All my drives have performed mass suicide at once You know, with deathstar's you cant really rule that out :) 2) ATAng is detecting errors that the ATAog did not That is true, the error detection is better in ATAng. 3) ATAng is not trying as hard as ATAog to recover from the errors from the crappy drives Neither ATAog nor ATAnr retried uncorrectable errors... 4) ATAng has a bug on this hardware. That we cant rule out, and it probably likely.. Furthermore, I'd like to know why the panic occurred above. Is this on a brand new -current ? lots of things that could cause this has been fixed... -Søren ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to [EMAIL PROTECTED]
RE: Too many uncorrectable read errors with atang
On 07-Nov-2003 Kris Kennaway wrote: So far this has happened (well, the panic above was new) on 5 separate machines that were all working on older -current. Now, these are all IBM DeathStar drives, but previously I was only experiencing ata errors every month or two, and they were correctable for another month or two by /dev/zero'ing the drive. To suddenly start receiving errors on 5 out of 7 drives in the past few weeks is a significant anomaly. Perhaps one of the following is happening: 1) All my drives have performed mass suicide at once 2) ATAng is detecting errors that the ATAog did not 3) ATAng is not trying as hard as ATAog to recover from the errors from the crappy drives 4) ATAng has a bug on this hardware. 5) Interference from abnormally high solar activity. It is known to cause an increase in NMI's from ECC errors, so it could be a possible explanation here even if it's a bit far-fetched. -- John Baldwin [EMAIL PROTECTED]http://www.FreeBSD.org/~jhb/ Power Users Use the Power to Serve! - http://www.FreeBSD.org/ ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Too many uncorrectable read errors with atang
If you are running -CURRENT, you can check the SMART status of the drives with the port sysutils/smartmontools. If the drive supports ATA-3 commands, you should be able to see if there are errors being reported by the drive itself. Ed On Fri, 2003-11-07 at 13:33, Soren Schmidt wrote: It seems Kris Kennaway wrote: -- Start of PGP signed section. Since upgrading the bento package machines to -current I am getting a lot of the following errors: ad0: FAILURE - READ_DMA status=51READY,DSC,ERROR error=40UNCORRECTABLE That does look like a valid error condition from the drive... 1) All my drives have performed mass suicide at once You know, with deathstar's you cant really rule that out :) 2) ATAng is detecting errors that the ATAog did not That is true, the error detection is better in ATAng. 3) ATAng is not trying as hard as ATAog to recover from the errors from the crappy drives Neither ATAog nor ATAnr retried uncorrectable errors... 4) ATAng has a bug on this hardware. That we cant rule out, and it probably likely.. Furthermore, I'd like to know why the panic occurred above. Is this on a brand new -current ? lots of things that could cause this has been fixed... -Søren ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to [EMAIL PROTECTED] -- Eduard Martinescu [EMAIL PROTECTED] ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Too many uncorrectable read errors with atang
On Fri, Nov 07, 2003 at 07:33:41PM +0100, Soren Schmidt wrote: 1) All my drives have performed mass suicide at once You know, with deathstar's you cant really rule that out :) :-) Furthermore, I'd like to know why the panic occurred above. Is this on a brand new -current ? lots of things that could cause this has been fixed... Yes, it was updated last night. Kris pgp0.pgp Description: PGP signature
Re: Too many uncorrectable read errors with atang
On Fri, Nov 07, 2003 at 10:10:07AM -0800, Kris Kennaway wrote: ad0: FAILURE - READ_DMA status=51READY,DSC,ERROR error=40UNCORRECTABLE ad0: FAILURE - READ_DMA status=51READY,DSC,ERROR error=40UNCORRECTABLE ad0: TIMEOUT - READ_DMA retrying (2 retries left) ata0: resetting devices .. ad0: FAILURE - already active DMA on this device ad0: setting up DMA failed panic: initiate_write_inodeblock_ufs2: already started Debugger(panic) Stopped at Debugger+0x54: xchgl %ebx,in_Debugger.0 db trace I just had another machine panic in the same failure mode. kris pgp0.pgp Description: PGP signature
RE: Too many uncorrectable read errors with atang
On Fri, 7 Nov 2003, John Baldwin wrote: On 07-Nov-2003 Kris Kennaway wrote: So far this has happened (well, the panic above was new) on 5 separate machines that were all working on older -current. Now, these are all IBM DeathStar drives, but previously I was only experiencing ata errors every month or two, and they were correctable for another month or two by /dev/zero'ing the drive. IBM Deathstar's have this annoying tendency to perform thermal recalibration cycles that cause them to delay returning data for somewhere between 30-90 seconds until the calibration finishes. Unfortunately, these seem to show up as uncorrectable errors. It's a true pain with RAID cards as the RAID array will take the drive offline when it could retry the data. If you can, try to reduce the temperature of the drives. This generally helped my Deathstars before I got rid of them all. Also, given the touchiness of PRML detectors, it is entirely possible that the drive is reading increased errors due to the solar flares as a need to thermally recalibrate more often. Other than tossing the drives, ATAng, like Windows, would have to be more aggressive about retrying even uncorrectable errors for up to a minute or so before giving up. -a ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Too many uncorrectable read errors with atang
On Fri, Nov 07, 2003 at 08:36:28PM -0800, Andrew P. Lentvorski, Jr. wrote: On Fri, 7 Nov 2003, John Baldwin wrote: On 07-Nov-2003 Kris Kennaway wrote: So far this has happened (well, the panic above was new) on 5 separate machines that were all working on older -current. Now, these are all IBM DeathStar drives, but previously I was only experiencing ata errors every month or two, and they were correctable for another month or two by /dev/zero'ing the drive. IBM Deathstar's have this annoying tendency to perform thermal recalibration cycles that cause them to delay returning data for somewhere between 30-90 seconds until the calibration finishes. Unfortunately, these seem to show up as uncorrectable errors. It's a true pain with RAID cards as the RAID array will take the drive offline when it could retry the data. If you can, try to reduce the temperature of the drives. This generally helped my Deathstars before I got rid of them all. Also, given the touchiness of PRML detectors, it is entirely possible that the drive is reading increased errors due to the solar flares as a need to thermally recalibrate more often. Other than tossing the drives, ATAng, like Windows, would have to be more aggressive about retrying even uncorrectable errors for up to a minute or so before giving up. Thanks..that's interesting, perhaps there's something sos can do here. Unfortunately the drives in question are in Yahoo's datacenter, so I do not have physical access. Kris pgp0.pgp Description: PGP signature