RE: Adaptec 3210S, 4.9-STABLE, corruption when disk fails
From: Uwe Doering [mailto:[EMAIL PROTECTED] ... As far as I understand this family of controllers the OS drivers aren't involved at all in case of a disk drive failure. It's strictly the controller's business to deal with it internally. The OS just sits there and waits until the controller is done with the retries and either drops into degraded mode or recovers from the disk error. That's why I initially speculated that there might be a timeout somewhere in PostgreSQL or FreeBSD that leads to data loss if the controller is busy for too long. A somewhat radical way to at least make these failures as rare an event as possible would be to deliberately fail all remaining old disk drives, one after the other of course, in order to get rid of them. And if you are lucky the problem won't happen with newer drives anyway, in case the root cause is an incompatibility between the controller and the old drives. Started that yesterday. I've got one 'old' one left. Sadly, the one that failed night before last was not one of the 'old' ones, so this is no guarantee :) From the raidutil -e log, I see this type of info. I'm not sure what the 'unknown' events are. The 'CRC Failure' is probably the problem? There's also Bad SCSI Status, unit attention, etc. Perhaps the driver doesn't deal with these properly? $ raidutil -e d0 03/31/2005 23:37:59 Level 1 Lock for Channel 0 : Started 03/31/2005 23:37:59 Level 1 Lock for Channel 1 : Started 03/31/2005 23:38:09 Level 1 Lock for Channel 0 : Stopped 03/31/2005 23:38:22 Level 1 Lock for Channel 1 : Stopped 03/31/2005 23:38:22 Level 4 HBA=0 BUS=0 ID=0 LUN=0 Status Change Optimal = Degraded - Drive Failed 03/31/2005 23:38:22 Level 1 Unknown Event : 56 10 00 08 EE 89 4C 42 00 00 00 00 03/31/2005 23:38:22 Level 1 CRC Failure Number of dirty blocks = -1 D30A1F2A 03/31/2005 23:38:24 Level 3 HBA=0 BUS=0 ID=0 LUN=0 Bad SCSI Status - Check Condition 28 00 00 00 00 00 00 00 01 00 00 00 03/31/2005 23:38:24 Level 3 HBA=0 BUS=0 ID=0 LUN=0 Request Sense 70 00 06 00 00 00 00 0A 00 00 00 00 29 02 02 00 00 00 Unit Attention ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Adaptec 3210S, 4.9-STABLE, corruption when disk fails
Don Bowman wrote: From: Uwe Doering [mailto:[EMAIL PROTECTED] ... As far as I understand this family of controllers the OS drivers aren't involved at all in case of a disk drive failure. It's strictly the controller's business to deal with it internally. The OS just sits there and waits until the controller is done with the retries and either drops into degraded mode or recovers from the disk error. That's why I initially speculated that there might be a timeout somewhere in PostgreSQL or FreeBSD that leads to data loss if the controller is busy for too long. A somewhat radical way to at least make these failures as rare an event as possible would be to deliberately fail all remaining old disk drives, one after the other of course, in order to get rid of them. And if you are lucky the problem won't happen with newer drives anyway, in case the root cause is an incompatibility between the controller and the old drives. Started that yesterday. I've got one 'old' one left. Sadly, the one that failed night before last was not one of the 'old' ones, so this is no guarantee :) From the raidutil -e log, I see this type of info. I'm not sure what the 'unknown' events are. The 'CRC Failure' is probably the problem? There's also Bad SCSI Status, unit attention, etc. Perhaps the driver doesn't deal with these properly? In my opinion what the log shows in this case is internal communication between the controller and the disk drives. The OS driver is not involved. In the past I've seen CRC errors like these as a result of bad cabling or contact problems. You may want to check the SCSI cables. They have to be properly terminated and there must not be any sharp kinks given the signal frequencies involved these days. Also, pluggable drive bays can cause this. Every electrical contact is a potential source of trouble. Finally, faulty or overloaded power supplies can cause glitches like these. This can be especially hard to debug. When these hardware issues have been taken care of you may want to start a RAID verification/correction run. If it shows any inconsistencies this may be an indication of former hardware glitches. I'm not sure whether you can trigger that process through 'raidutil'. I've always used the X11 'dptmgr' program. You can terminate it after having started the verification. It continues to run in the background (inside the controller). Uwe -- Uwe Doering | EscapeBox - Managed On-Demand UNIX Servers [EMAIL PROTECTED] | http://www.escapebox.net ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
RE: Adaptec 3210S, 4.9-STABLE, corruption when disk fails
From: [EMAIL PROTECTED] From: Uwe Doering [mailto:[EMAIL PROTECTED] ... Did you merge 1.3.2.3 as well? This actually should have been one MFC Yes, merged from RELENG_4. I will post later if this happens again, but it will be quite a long time. The machine has 7 drives in it, there are only 3 ones left old enough they might fail before I take it out of service (it originally had 7 1999-era IBM drives, now it has 4 2004-era seagate drives and 3 of the old IBM's. The drives have been in continuous service, so they've lead a pretty good life!) Thanks for the suggestion on the cam timeout, I've set that value. Another drive failed and the same thing happened. After the failure, the raid worked in degrade mode just fine, but many files had been corrupted during the failure. So I would suggest that this merge did not help, and the cam timeout did not help either. This is very frustrating, again I rebuild my postgresql install from backup :( --don ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Adaptec 3210S, 4.9-STABLE, corruption when disk fails
Don Bowman wrote: From: [EMAIL PROTECTED] From: Uwe Doering [mailto:[EMAIL PROTECTED] ... Did you merge 1.3.2.3 as well? This actually should have been one MFC Yes, merged from RELENG_4. I will post later if this happens again, but it will be quite a long time. The machine has 7 drives in it, there are only 3 ones left old enough they might fail before I take it out of service (it originally had 7 1999-era IBM drives, now it has 4 2004-era seagate drives and 3 of the old IBM's. The drives have been in continuous service, so they've lead a pretty good life!) Thanks for the suggestion on the cam timeout, I've set that value. Another drive failed and the same thing happened. After the failure, the raid worked in degrade mode just fine, but many files had been corrupted during the failure. So I would suggest that this merge did not help, and the cam timeout did not help either. This is very frustrating, again I rebuild my postgresql install from backup :( This is indeed unfortunate. Maybe the problem is in fact located neither in PostgreSQL nor in FreeBSD but in the controller itself. Does it have the latest firmware? The necessary files should be available on Adaptec's website, and you can use the 'raidutil' program under FreeBSD to upload the firmware to the controller. I have to concede, however, that I never did this under FreeBSD myself. If I recall correctly I did the upload via a DOS diskette the last time. If this doesn't help either you could ask Adaptec's support for help. You need to register the controller first, if memory serves. Uwe -- Uwe Doering | EscapeBox - Managed On-Demand UNIX Servers [EMAIL PROTECTED] | http://www.escapebox.net ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
RE: Adaptec 3210S, 4.9-STABLE, corruption when disk fails
From: Uwe Doering [mailto:[EMAIL PROTECTED] Don Bowman wrote: From: [EMAIL PROTECTED] From: Uwe Doering [mailto:[EMAIL PROTECTED] ... Did you merge 1.3.2.3 as well? This actually should have been one MFC Yes, merged from RELENG_4. I will post later if this happens again, but it will be quite a long time. The machine has 7 drives in it, there are only 3 ones left old enough they might fail before I take it out of service (it originally had 7 1999-era IBM drives, now it has 4 2004-era seagate drives and 3 of the old IBM's. The drives have been in continuous service, so they've lead a pretty good life!) Thanks for the suggestion on the cam timeout, I've set that value. Another drive failed and the same thing happened. After the failure, the raid worked in degrade mode just fine, but many files had been corrupted during the failure. So I would suggest that this merge did not help, and the cam timeout did not help either. This is very frustrating, again I rebuild my postgresql install from backup :( This is indeed unfortunate. Maybe the problem is in fact located neither in PostgreSQL nor in FreeBSD but in the controller itself. Does it have the latest firmware? The necessary files should be available on Adaptec's website, and you can use the 'raidutil' program under FreeBSD to upload the firmware to the controller. I have to concede, however, that I never did this under FreeBSD myself. If I recall correctly I did the upload via a DOS diskette the last time. If this doesn't help either you could ask Adaptec's support for help. You need to register the controller first, if memory serves. The latest firmware bios is in the controller (upgraded the last time I had problems). Tried adaptec support, controller is registered. The problem is definitely not in postgresql. Files go missing in directories that are having new entries added (e.g. I lost a 'PG_VERSION' file). Data within the postgresql files becomes corrupt. Since the only application running is postgresql, and it reads/writes/fsyncs the data, its not unexpected that it's the one that reaps the 'rewards' of the failure. I have to believe this is either a bug in the controller, or a problem in cam or asr. --don ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Adaptec 3210S, 4.9-STABLE, corruption when disk fails
Don Bowman wrote: From: Uwe Doering [mailto:[EMAIL PROTECTED] Don Bowman wrote: [...] Another drive failed and the same thing happened. After the failure, the raid worked in degrade mode just fine, but many files had been corrupted during the failure. So I would suggest that this merge did not help, and the cam timeout did not help either. This is very frustrating, again I rebuild my postgresql install from backup :( This is indeed unfortunate. Maybe the problem is in fact located neither in PostgreSQL nor in FreeBSD but in the controller itself. Does it have the latest firmware? The necessary files should be available on Adaptec's website, and you can use the 'raidutil' program under FreeBSD to upload the firmware to the controller. I have to concede, however, that I never did this under FreeBSD myself. If I recall correctly I did the upload via a DOS diskette the last time. If this doesn't help either you could ask Adaptec's support for help. You need to register the controller first, if memory serves. The latest firmware bios is in the controller (upgraded the last time I had problems). Tried adaptec support, controller is registered. The problem is definitely not in postgresql. Files go missing in directories that are having new entries added (e.g. I lost a 'PG_VERSION' file). Data within the postgresql files becomes corrupt. Since the only application running is postgresql, and it reads/writes/fsyncs the data, its not unexpected that it's the one that reaps the 'rewards' of the failure. I have to believe this is either a bug in the controller, or a problem in cam or asr. As far as I understand this family of controllers the OS drivers aren't involved at all in case of a disk drive failure. It's strictly the controller's business to deal with it internally. The OS just sits there and waits until the controller is done with the retries and either drops into degraded mode or recovers from the disk error. That's why I initially speculated that there might be a timeout somewhere in PostgreSQL or FreeBSD that leads to data loss if the controller is busy for too long. A somewhat radical way to at least make these failures as rare an event as possible would be to deliberately fail all remaining old disk drives, one after the other of course, in order to get rid of them. And if you are lucky the problem won't happen with newer drives anyway, in case the root cause is an incompatibility between the controller and the old drives. Uwe -- Uwe Doering | EscapeBox - Managed On-Demand UNIX Servers [EMAIL PROTECTED] | http://www.escapebox.net ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Adaptec 3210S, 4.9-STABLE, corruption when disk fails
Don Bowman wrote: From: Uwe Doering [mailto:[EMAIL PROTECTED] Don Bowman wrote: I have a machine running: $ uname -a FreeBSD machine.phaedrus.sandvine.com 4.9-STABLE FreeBSD 4.9-STABLE #0: Fri Mar 19 10:39:07 EST 2004 [EMAIL PROTECTED]:/usr/src/sys/compile/LABDB i386 ... I have merged asr.c from RELENG_4 to get this fix: Fix a mis-merge in the MFC of rev. 1.64 in rev. 1.3.2.3; the following change wasn't included: - Set the CAM status to CAM_SCSI_STATUS_ERROR rather than CAM_REQ_CMP in case of a CHECK CONDITION. since I guess its conceivable this could cause my problem. I have to admit that I didn't think of this right away, even though I was kind of involved. Did you merge 1.3.2.3 as well? This actually should have been one MFC but it was done in two steps due to an oversight. Please let us know whether the fix makes any difference in your case. Its author made it for CD burners and wasn't sure whether it has any effect on other devices, like da(4). Uwe -- Uwe Doering | EscapeBox - Managed On-Demand UNIX Servers [EMAIL PROTECTED] | http://www.escapebox.net ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Adaptec 3210S, 4.9-STABLE, corruption when disk fails
Uwe Doering wrote: Don Bowman wrote: I have merged asr.c from RELENG_4 to get this fix: Fix a mis-merge in the MFC of rev. 1.64 in rev. 1.3.2.3; the following change wasn't included: - Set the CAM status to CAM_SCSI_STATUS_ERROR rather than CAM_REQ_CMP in case of a CHECK CONDITION. since I guess its conceivable this could cause my problem. I have to admit that I didn't think of this right away, even though I was kind of involved. Did you merge 1.3.2.3 as well? This actually should have been one MFC but it was done in two steps due to an oversight. Please let us know whether the fix makes any difference in your case. Its author made it for CD burners and wasn't sure whether it has any effect on other devices, like da(4). Memory's coming back piecemeal. ;-) There's another thing you could try. The 'asr' driver's original timeout is 360 seconds, because its author knew that this type of controller can be busy for quite some time. FreeBSD's SCSI driver, however, sets it to its default of 60 seconds, which can be way too short. What happens when the controller is busy trying to deal with a failed disk is that the 'asr' driver sends a bus reset to the controller as a whole, due to the short timeout. You should be able to see this clash in the controller's event log. My feeling is that this collision of events may have ill effects, like the data corruption you've observed. On our machines we've set the SCSI timeout and thereby also the 'asr' driver's timeout back to the original 360 seconds, in order to leave the controller alone while it is busy. There is a 'sysctl' variable for this: kern.cam.da.default_timeout=360 Maybe that's the actual fix for your problem. Uwe -- Uwe Doering | EscapeBox - Managed On-Demand UNIX Servers [EMAIL PROTECTED] | http://www.escapebox.net ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
RE: Adaptec 3210S, 4.9-STABLE, corruption when disk fails
From: Uwe Doering [mailto:[EMAIL PROTECTED] ... Did you merge 1.3.2.3 as well? This actually should have been one MFC Yes, merged from RELENG_4. I will post later if this happens again, but it will be quite a long time. The machine has 7 drives in it, there are only 3 ones left old enough they might fail before I take it out of service (it originally had 7 1999-era IBM drives, now it has 4 2004-era seagate drives and 3 of the old IBM's. The drives have been in continuous service, so they've lead a pretty good life!) Thanks for the suggestion on the cam timeout, I've set that value. --don ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Adaptec 3210S, 4.9-STABLE, corruption when disk fails
Don Bowman wrote: I have a machine running: $ uname -a FreeBSD machine.phaedrus.sandvine.com 4.9-STABLE FreeBSD 4.9-STABLE #0: Fri Mar 19 10:39:07 EST 2004 [EMAIL PROTECTED]:/usr/src/sys/compile/LABDB i386 It has an adaptec 3210S raid controller running a single raid-5, and runs postgresql 7.4.6 as its primary application. 3 times now I have had a drive fail, and have had corrupted files in the postgresql cluster @ the same time. The time is too closely correlated to be a coincidence. It passes fsck @ the time that I got to it a couple of hours later, and the filesystem seems to be ok (with a failed drive, the raid in 'degrade' mode). It appears that the drive failure and the postgresql failure occur @ exactly the same time (monitoring with nagios, within 1hr accuracy). It would appear that for some file(s) bad data was returned. Does anyone have any suggestions? In my experience, in a situation like this RAID controllers can block the system for up to a couple of minutes, trying to revive a failed disk drive by sending it bus reset commands and the like, until they eventually give up and drop into degraded mode. With sufficiently patient applications this is no problem, but if a program runs into internal timeouts during this period of time bad things can happen. My point is that while the disk controller may trigger the problem the instance that actually corrupts the database might be PostgreSQL itself. Of course, I'm aware that it's going to be quite hard to tell for sure who the culprit is. Uwe -- Uwe Doering | EscapeBox - Managed On-Demand UNIX Servers [EMAIL PROTECTED] | http://www.escapebox.net ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
RE: Adaptec 3210S, 4.9-STABLE, corruption when disk fails
From: Uwe Doering [mailto:[EMAIL PROTECTED] Don Bowman wrote: I have a machine running: $ uname -a FreeBSD machine.phaedrus.sandvine.com 4.9-STABLE FreeBSD 4.9-STABLE #0: Fri Mar 19 10:39:07 EST 2004 [EMAIL PROTECTED]:/usr/src/sys/compile/LABDB i386 ... I have merged asr.c from RELENG_4 to get this fix: Fix a mis-merge in the MFC of rev. 1.64 in rev. 1.3.2.3; the following change wasn't included: - Set the CAM status to CAM_SCSI_STATUS_ERROR rather than CAM_REQ_CMP in case of a CHECK CONDITION. since I guess its conceivable this could cause my problem. --don ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]