Harald Dunkel wrote: > Stuart Henderson wrote: >> >> With IDE (Integrated Drive Electronics), the controller is *on the >> drive*. A failing drive/controller can do all sorts of nasty things >> to the host system. >> > > So you mean I should not use IDE disks (PATA or SATA), because > Raidframe cannot support a failsafe operation with these disks?
<rant> Your basic assumption that "RAID=minimal down time" is flawed. The most horrible down times and "events" you ever see will often involve RAID. That goes for hardware RAID, software RAID, whatever. Almost every RAID system out there handles the sudden removal of a disk from the system pretty well. Why? Because it's EASY to create that "failure mode". Problem is, in 25 years in this business, I don't recall having seen a hard disk fall out of a computer as a mode of actual failure (I did see a SCSI HBA fall out of a machine once, but that's a different story). My preferred way to test RAID systems involves a powder-actuated nail gun driving a nail through the platter. Not overly realistic either, but arguably more so than having the drive suddenly being pulled out. The disks get expensive, though. Back to your situation... The drive reports a failure, but not one so horrible that the OS doesn't attempt a retry. So, at what point does the OS just shut down the drive and say, "not worth the trouble"? If you are running a single drive, you generally want to keep trying as long as there is the slightest hope (another digression: back in the MSDOS v2 days, I had a machine blow a disk such that if I kept hitting "Retry" enough times, each sector would ultimately be read successfully. Wedged a pen in between the 'R' key and the monitor, went to dinner, and when I came back, I had all my data successfully copied to another drive). In your case, however, you have a drive saying, "I'm getting better" when you are saying, "It'll be stone dead in a moment". You want the OS to whack the drive and toss it on the cart..er..remove it from the RAID set at the first sign of trouble, but that's not a universal answer. Curiously, I've had servers that caused problems BOTH Ways. One kept a drive on-line even though it was having serious problems and should have been declared dead. In a several other cases, the drives reported minor errors and were popped off-line and out of the array when there was really nothing significant wrong with the disks, but the local staff didn't recognize that...and if the right two popped off-line, down went the array. oh, btw: those were both HW RAID. You can run into these problems no matter what you are using. The "try too long" ones were SATA, the "give up too early" ones were SCSI. We had 20 servers with the SATA HW mirroring, not a single one lost data, though one got Really Slow until we figured out it was a drive problem. We had 15 SCSI systems which cost about four times as much as the SATA systems..three of those lost data. Complexity kills. Proper operation can be neat. Failures rarely rarely are. There are usually more ways a system can fail than there are ways it can work. It is also really hard to have the drives fail in realistic ways when the designers are watching, and it is really hard to fail something the same exact way again to work out every bug. In your case, you have firewalls, which can be made completely redundant, rather than just making the disk system more complex. Why run RAID in the firewalls, when you can just run CARP and have much more redundancy? Of course, you can have similar problems with CARP, too..we managed to install a proxy package in a CARP'd pair of FWs and didn't notice how fast it filled the disk. One box quit passing packets when the proxy couldn't log any longer, but CARP didn't see the box or interfaces or links as actually failing, so it didn't switch over to the standby system. Happened when both our administrators were out that morning (of course), so when they called me at home, I asked a few questions, and had them hit the power switch on the primary firewall, which instantly got data moving again through the secondary. Consider RAID to be a "rapid repair" tool, don't expect it to never let you go down (and that's assuming you know how to recover when it actually does fail...and most people I've seen just assume magic happens or they hope they got a job elsewhere by that point). And don't expect to get less down time out of a very complex system compared to a simple system. In particular, when an IDE disk fails, it often does seem to the computer than an entire controller fell out of the system, so don't expect an IDE system to stay up after a drive failure. On the other hand, if you haven't seen a SCSI disk take out an entire SCSI bus, just wait, do enough, you will. Don't expect them to stay up, either. SATA? Ask me again in about ten ten years, but so far, I've seen a SATA drive toss a dead short across the power supply, killing the RAID box, the PS in the computer and the PS on a second computer I plugged the drive into wondering "is this is the bad drive?". Don't fool yourself into thinking "a SCSI drive wouldn't do that". A later Stupid Administrator Trick (i.e., I screwed up) resulted in my rebuilding the entire array from scratch and restoring from backup when the data was just a config setting away from being recovered. Stuff happens. Be ready for it. Personally, I've found that being ready for the unexpected problem on a simple system beats the heck out of thinking you have eliminated them by adding complexity. </rant> Nick.

