RE: Adaptec 3210S, 4.9-STABLE, corruption when disk fails

2005-04-01 Thread Don Bowman
From: Uwe Doering [mailto:[EMAIL PROTECTED] 
 ...
 As far as I understand this family of controllers the OS 
 drivers aren't involved at all in case of a disk drive 
 failure.  It's strictly the controller's business to deal 
 with it internally.  The OS just sits there and waits until 
 the controller is done with the retries and either drops into 
 degraded mode or recovers from the disk error.
 
 That's why I initially speculated that there might be a 
 timeout somewhere in PostgreSQL or FreeBSD that leads to data 
 loss if the controller is busy for too long.
 
 A somewhat radical way to at least make these failures as 
 rare an event as possible would be to deliberately fail all 
 remaining old disk drives, one after the other of course, in 
 order to get rid of them.  And if you are lucky the problem 
 won't happen with newer drives anyway, in case the root cause 
 is an incompatibility between the controller and the old drives.

Started that yesterday. I've got one 'old' one left.
Sadly, the one that failed night before last was not one of the
'old' ones, so this is no guarantee :)

From the raidutil -e log, I see this type of info. I'm not sure 
what the 'unknown' events are. The 'CRC Failure' is probably the
problem? There's also Bad SCSI Status, unit attention, etc.
Perhaps the driver doesn't deal with these properly?

$ raidutil -e d0
03/31/2005  23:37:59   Level 1
Lock for Channel 0 : Started


03/31/2005  23:37:59   Level 1
Lock for Channel 1 : Started


03/31/2005  23:38:09   Level 1
Lock for Channel 0 : Stopped


03/31/2005  23:38:22   Level 1
Lock for Channel 1 : Stopped


03/31/2005  23:38:22   Level 4
HBA=0 BUS=0 ID=0 LUN=0
Status Change
Optimal   = Degraded - Drive Failed


03/31/2005  23:38:22   Level 1
Unknown Event : 56 10 00 08 EE 89 4C 42 00 00 00 00 


03/31/2005  23:38:22   Level 1
CRC Failure
Number of dirty blocks = -1
 D30A1F2A      
        


03/31/2005  23:38:24   Level 3
HBA=0 BUS=0 ID=0 LUN=0
Bad SCSI Status - Check Condition
28 00 00 00 00 00 00 00 01 00 00 00 


03/31/2005  23:38:24   Level 3
HBA=0 BUS=0 ID=0 LUN=0
Request Sense
70 00 06 00 00 00 00 0A 00 00 00 00 29 02 02 00 00 00 
Unit Attention

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Adaptec 3210S, 4.9-STABLE, corruption when disk fails

2005-04-01 Thread Uwe Doering
Don Bowman wrote:
From: Uwe Doering [mailto:[EMAIL PROTECTED] 
 ...

As far as I understand this family of controllers the OS 
drivers aren't involved at all in case of a disk drive 
failure.  It's strictly the controller's business to deal 
with it internally.  The OS just sits there and waits until 
the controller is done with the retries and either drops into 
degraded mode or recovers from the disk error.

That's why I initially speculated that there might be a 
timeout somewhere in PostgreSQL or FreeBSD that leads to data 
loss if the controller is busy for too long.

A somewhat radical way to at least make these failures as 
rare an event as possible would be to deliberately fail all 
remaining old disk drives, one after the other of course, in 
order to get rid of them.  And if you are lucky the problem 
won't happen with newer drives anyway, in case the root cause 
is an incompatibility between the controller and the old drives.
Started that yesterday. I've got one 'old' one left.
Sadly, the one that failed night before last was not one of the
'old' ones, so this is no guarantee :)
From the raidutil -e log, I see this type of info. I'm not sure 
what the 'unknown' events are. The 'CRC Failure' is probably the
problem? There's also Bad SCSI Status, unit attention, etc.
Perhaps the driver doesn't deal with these properly?
In my opinion what the log shows in this case is internal communication 
between the controller and the disk drives.  The OS driver is not 
involved.  In the past I've seen CRC errors like these as a result of 
bad cabling or contact problems.  You may want to check the SCSI cables. 
 They have to be properly terminated and there must not be any sharp 
kinks given the signal frequencies involved these days.  Also, pluggable 
drive bays can cause this.  Every electrical contact is a potential 
source of trouble.  Finally, faulty or overloaded power supplies can 
cause glitches like these.  This can be especially hard to debug.

When these hardware issues have been taken care of you may want to start 
a RAID verification/correction run.  If it shows any inconsistencies 
this may be an indication of former hardware glitches.  I'm not sure 
whether you can trigger that process through 'raidutil'.  I've always 
used the X11 'dptmgr' program.  You can terminate it after having 
started the verification.  It continues to run in the background (inside 
the controller).

   Uwe
--
Uwe Doering |  EscapeBox - Managed On-Demand UNIX Servers
[EMAIL PROTECTED]  |  http://www.escapebox.net
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


RE: Adaptec 3210S, 4.9-STABLE, corruption when disk fails

2005-03-31 Thread Don Bowman
From: [EMAIL PROTECTED] 
 From: Uwe Doering [mailto:[EMAIL PROTECTED]  ...
   
   Did you merge 1.3.2.3 as well?  This actually should have
  been one MFC
 
 Yes, merged from RELENG_4.
 
 I will post later if this happens again, but it will be quite
 a long time. The machine has 7 drives in it, there are only
 3 ones left old enough they might fail before I take it out
 of service (it originally had 7 1999-era IBM drives, now
 it has 4 2004-era seagate drives and 3 of the old IBM's.
 The drives have been in continuous service, so they've lead
 a pretty good life!)
 
 Thanks for the suggestion on the cam timeout, I've set that
 value.

Another drive failed and the same thing happened.
After the failure, the raid worked in degrade mode just
fine, but many files had been corrupted during the failure.

So I would suggest that this merge did not help, and the
cam timeout did not help either.

This is very frustrating, again I rebuild my postgresql install
from backup :(

--don
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Adaptec 3210S, 4.9-STABLE, corruption when disk fails

2005-03-31 Thread Uwe Doering
Don Bowman wrote:
From: [EMAIL PROTECTED] 

From: Uwe Doering [mailto:[EMAIL PROTECTED]  ...
Did you merge 1.3.2.3 as well?  This actually should have
been one MFC
Yes, merged from RELENG_4.
I will post later if this happens again, but it will be quite
a long time. The machine has 7 drives in it, there are only
3 ones left old enough they might fail before I take it out
of service (it originally had 7 1999-era IBM drives, now
it has 4 2004-era seagate drives and 3 of the old IBM's.
The drives have been in continuous service, so they've lead
a pretty good life!)
Thanks for the suggestion on the cam timeout, I've set that
value.
Another drive failed and the same thing happened.
After the failure, the raid worked in degrade mode just
fine, but many files had been corrupted during the failure.
So I would suggest that this merge did not help, and the
cam timeout did not help either.
This is very frustrating, again I rebuild my postgresql install
from backup :(
This is indeed unfortunate.  Maybe the problem is in fact located 
neither in PostgreSQL nor in FreeBSD but in the controller itself.  Does 
it have the latest firmware?  The necessary files should be available on 
Adaptec's website, and you can use the 'raidutil' program under FreeBSD 
to upload the firmware to the controller.  I have to concede, however, 
that I never did this under FreeBSD myself.  If I recall correctly I did 
the upload via a DOS diskette the last time.

If this doesn't help either you could ask Adaptec's support for help. 
You need to register the controller first, if memory serves.

   Uwe
--
Uwe Doering |  EscapeBox - Managed On-Demand UNIX Servers
[EMAIL PROTECTED]  |  http://www.escapebox.net
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


RE: Adaptec 3210S, 4.9-STABLE, corruption when disk fails

2005-03-31 Thread Don Bowman
From: Uwe Doering [mailto:[EMAIL PROTECTED] 
 Don Bowman wrote:
  From: [EMAIL PROTECTED]
  
 From: Uwe Doering [mailto:[EMAIL PROTECTED]  ...
 
 Did you merge 1.3.2.3 as well?  This actually should have
 
 been one MFC
 
 Yes, merged from RELENG_4.
 
 I will post later if this happens again, but it will be 
 quite a long 
 time. The machine has 7 drives in it, there are only
 3 ones left old enough they might fail before I take it out 
 of service 
 (it originally had 7 1999-era IBM drives, now it has 4 2004-era 
 seagate drives and 3 of the old IBM's.
 The drives have been in continuous service, so they've lead 
 a pretty 
 good life!)
 
 Thanks for the suggestion on the cam timeout, I've set that value.
  
  Another drive failed and the same thing happened.
  After the failure, the raid worked in degrade mode just 
 fine, but many 
  files had been corrupted during the failure.
  
  So I would suggest that this merge did not help, and the 
 cam timeout 
  did not help either.
  
  This is very frustrating, again I rebuild my postgresql 
 install from 
  backup :(
 
 This is indeed unfortunate.  Maybe the problem is in fact 
 located neither in PostgreSQL nor in FreeBSD but in the 
 controller itself.  Does it have the latest firmware?  The 
 necessary files should be available on Adaptec's website, and 
 you can use the 'raidutil' program under FreeBSD to upload 
 the firmware to the controller.  I have to concede, however, 
 that I never did this under FreeBSD myself.  If I recall 
 correctly I did the upload via a DOS diskette the last time.
 
 If this doesn't help either you could ask Adaptec's support for help. 
 You need to register the controller first, if memory serves.

The latest firmware  bios is in the controller (upgraded the
last time I had problems).

Tried adaptec support, controller is registered.

The problem is definitely not in postgresql. Files go missing
in directories that are having new entries added (e.g. I lost
a 'PG_VERSION' file). Data within the postgresql files becomes
corrupt. Since the only application running is postgresql,
and it reads/writes/fsyncs the data, its not unexpected that
it's the one that reaps the 'rewards' of the failure.

I have to believe this is either a bug in the controller,
or a problem in cam or asr.

--don
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Adaptec 3210S, 4.9-STABLE, corruption when disk fails

2005-03-31 Thread Uwe Doering
Don Bowman wrote:
From: Uwe Doering [mailto:[EMAIL PROTECTED] 
Don Bowman wrote:
[...]
Another drive failed and the same thing happened.
After the failure, the raid worked in degrade mode just 
fine, but many 

files had been corrupted during the failure.
So I would suggest that this merge did not help, and the 
cam timeout 

did not help either.
This is very frustrating, again I rebuild my postgresql 
install from 

backup :(
This is indeed unfortunate.  Maybe the problem is in fact 
located neither in PostgreSQL nor in FreeBSD but in the 
controller itself.  Does it have the latest firmware?  The 
necessary files should be available on Adaptec's website, and 
you can use the 'raidutil' program under FreeBSD to upload 
the firmware to the controller.  I have to concede, however, 
that I never did this under FreeBSD myself.  If I recall 
correctly I did the upload via a DOS diskette the last time.

If this doesn't help either you could ask Adaptec's support for help. 
You need to register the controller first, if memory serves.
The latest firmware  bios is in the controller (upgraded the
last time I had problems).
Tried adaptec support, controller is registered.
The problem is definitely not in postgresql. Files go missing
in directories that are having new entries added (e.g. I lost
a 'PG_VERSION' file). Data within the postgresql files becomes
corrupt. Since the only application running is postgresql,
and it reads/writes/fsyncs the data, its not unexpected that
it's the one that reaps the 'rewards' of the failure.
I have to believe this is either a bug in the controller,
or a problem in cam or asr.
As far as I understand this family of controllers the OS drivers aren't 
involved at all in case of a disk drive failure.  It's strictly the 
controller's business to deal with it internally.  The OS just sits 
there and waits until the controller is done with the retries and either 
drops into degraded mode or recovers from the disk error.

That's why I initially speculated that there might be a timeout 
somewhere in PostgreSQL or FreeBSD that leads to data loss if the 
controller is busy for too long.

A somewhat radical way to at least make these failures as rare an event 
as possible would be to deliberately fail all remaining old disk drives, 
one after the other of course, in order to get rid of them.  And if you 
are lucky the problem won't happen with newer drives anyway, in case the 
root cause is an incompatibility between the controller and the old drives.

   Uwe
--
Uwe Doering |  EscapeBox - Managed On-Demand UNIX Servers
[EMAIL PROTECTED]  |  http://www.escapebox.net
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Adaptec 3210S, 4.9-STABLE, corruption when disk fails

2005-03-01 Thread Uwe Doering
Don Bowman wrote:
From: Uwe Doering [mailto:[EMAIL PROTECTED] 

Don Bowman wrote:
I have a machine running:
$ uname -a
FreeBSD machine.phaedrus.sandvine.com 4.9-STABLE FreeBSD 
4.9-STABLE #0:
Fri Mar 19 10:39:07 EST 2004
[EMAIL PROTECTED]:/usr/src/sys/compile/LABDB  i386
 
...

I have merged asr.c from RELENG_4 to get this fix:
Fix a mis-merge in the MFC of rev. 1.64 in rev. 1.3.2.3; the following
change wasn't included:
- Set the CAM status to CAM_SCSI_STATUS_ERROR rather than CAM_REQ_CMP
  in case of a CHECK CONDITION.
since I guess its conceivable this could cause my problem.
I have to admit that I didn't think of this right away, even though I 
was kind of involved.

Did you merge 1.3.2.3 as well?  This actually should have been one MFC 
but it was done in two steps due to an oversight.  Please let us know 
whether the fix makes any difference in your case.  Its author made it 
for CD burners and wasn't sure whether it has any effect on other 
devices, like da(4).

   Uwe
--
Uwe Doering |  EscapeBox - Managed On-Demand UNIX Servers
[EMAIL PROTECTED]  |  http://www.escapebox.net
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Adaptec 3210S, 4.9-STABLE, corruption when disk fails

2005-03-01 Thread Uwe Doering
Uwe Doering wrote:
Don Bowman wrote:
I have merged asr.c from RELENG_4 to get this fix:
Fix a mis-merge in the MFC of rev. 1.64 in rev. 1.3.2.3; the following
change wasn't included:
- Set the CAM status to CAM_SCSI_STATUS_ERROR rather than CAM_REQ_CMP
  in case of a CHECK CONDITION.
since I guess its conceivable this could cause my problem.
I have to admit that I didn't think of this right away, even though I 
was kind of involved.

Did you merge 1.3.2.3 as well?  This actually should have been one MFC 
but it was done in two steps due to an oversight.  Please let us know 
whether the fix makes any difference in your case.  Its author made it 
for CD burners and wasn't sure whether it has any effect on other 
devices, like da(4).
Memory's coming back piecemeal. ;-)  There's another thing you could 
try.  The 'asr' driver's original timeout is 360 seconds, because its 
author knew that this type of controller can be busy for quite some 
time.  FreeBSD's SCSI driver, however, sets it to its default of 60 
seconds, which can be way too short.

What happens when the controller is busy trying to deal with a failed 
disk is that the 'asr' driver sends a bus reset to the controller as a 
whole, due to the short timeout.  You should be able to see this clash 
in the controller's event log.  My feeling is that this collision of 
events may have ill effects, like the data corruption you've observed.

On our machines we've set the SCSI timeout and thereby also the 'asr' 
driver's timeout back to the original 360 seconds, in order to leave the 
controller alone while it is busy.  There is a 'sysctl' variable for this:

  kern.cam.da.default_timeout=360
Maybe that's the actual fix for your problem.
   Uwe
--
Uwe Doering |  EscapeBox - Managed On-Demand UNIX Servers
[EMAIL PROTECTED]  |  http://www.escapebox.net
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


RE: Adaptec 3210S, 4.9-STABLE, corruption when disk fails

2005-03-01 Thread Don Bowman
From: Uwe Doering [mailto:[EMAIL PROTECTED] 
 ...
  
  Did you merge 1.3.2.3 as well?  This actually should have 
 been one MFC 

Yes, merged from RELENG_4.

I will post later if this happens again, but it will be quite
a long time. The machine has 7 drives in it, there are only
3 ones left old enough they might fail before I take it out
of service (it originally had 7 1999-era IBM drives, now
it has 4 2004-era seagate drives and 3 of the old IBM's.
The drives have been in continuous service, so they've lead
a pretty good life!)

Thanks for the suggestion on the cam timeout, I've set that
value.

--don
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Adaptec 3210S, 4.9-STABLE, corruption when disk fails

2005-02-28 Thread Uwe Doering
Don Bowman wrote:
I have a machine running:
$ uname -a
FreeBSD machine.phaedrus.sandvine.com 4.9-STABLE FreeBSD 4.9-STABLE #0:
Fri Mar 19 10:39:07 EST 2004
[EMAIL PROTECTED]:/usr/src/sys/compile/LABDB  i386
It has an adaptec 3210S raid controller running a single raid-5, and
runs postgresql 7.4.6 as its primary application.
3 times now I have had a drive fail, and have had corrupted files in the
postgresql cluster @ the same time.
The time is too closely correlated to be a coincidence. It passes fsck @
the time that I got to it a couple of hours later, and the filesystem
seems to be ok (with a failed drive, the raid in 'degrade' mode).
It appears that the drive failure and the postgresql failure occur @
exactly the same time (monitoring with nagios, within 1hr accuracy). It
would appear that for some file(s) bad data was returned.
Does anyone have any suggestions?
In my experience, in a situation like this RAID controllers can block 
the system for up to a couple of minutes, trying to revive a failed disk 
drive by sending it bus reset commands and the like, until they 
eventually give up and drop into degraded mode.  With sufficiently 
patient applications this is no problem, but if a program runs into 
internal timeouts during this period of time bad things can happen.

My point is that while the disk controller may trigger the problem the 
instance that actually corrupts the database might be PostgreSQL itself. 
 Of course, I'm aware that it's going to be quite hard to tell for sure 
who the culprit is.

   Uwe
--
Uwe Doering |  EscapeBox - Managed On-Demand UNIX Servers
[EMAIL PROTECTED]  |  http://www.escapebox.net
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


RE: Adaptec 3210S, 4.9-STABLE, corruption when disk fails

2005-02-28 Thread Don Bowman
From: Uwe Doering [mailto:[EMAIL PROTECTED] 
 Don Bowman wrote:
  I have a machine running:
  
  $ uname -a
  FreeBSD machine.phaedrus.sandvine.com 4.9-STABLE FreeBSD 
 4.9-STABLE #0:
  Fri Mar 19 10:39:07 EST 2004
  [EMAIL PROTECTED]:/usr/src/sys/compile/LABDB  i386
  
 
...

I have merged asr.c from RELENG_4 to get this fix:

Fix a mis-merge in the MFC of rev. 1.64 in rev. 1.3.2.3; the following
change wasn't included:
- Set the CAM status to CAM_SCSI_STATUS_ERROR rather than CAM_REQ_CMP
  in case of a CHECK CONDITION.

since I guess its conceivable this could cause my problem.

--don
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]