On 2011-01-27 14.11, Ted Unangst wrote:
> On Thu, Jan 27, 2011 at 7:28 AM, Benny Lofgren <bl-li...@lofgren.biz> wrote:
>> It's a matter of uptime.
>>
>> The indicated behaviour, that the system more or less freezes when
>> encountering a simple sector read error is indeed disturbing. For
>> example, my own reasons for using mirroring are exclusively so that a
>> system can remain online and operational in case of a disk failure.
> 
> If that's why you're investigating, I'll save you some time.  The disk
> retry code will basically lock the system up while it's retrying.  If
> you don't like it, send a patch.

Well, fwiw I wasn't the one investigating this particular problem, but I
have no problem submitting patches in cases where I'm able to do
meaningful work. (The problem I mentioned investigating is in all
likelihood either driver-related or a hardware problem.)

I absolutely didn't mean to imply that "hey this is broken, 'someone'
need to spend time to fix it" - I fully realize that that someone may
very well be me. I apologize if I came across that way.

I was merely pointing out that the standard response of "disks break,
live with it", while ever true, is sometimes irrelevant to the problem.

Yes, disks break (I currently have approximately two dozen broken ones
in a box at the office waiting for an appointment with a sledgehammer),
and yes, we diligently keep backups (or are sorry we didn't) but that
doesn't solve the situation where you have a critical system that causes
pain if it goes offline.

I have never in almost thirty years in this business lost a single byte
of customer data to disk failure. I have however had cases of unplanned
downtime, and every time that happens is also a failure.

Designing redundancy into our systems helps only as far as to the
nearest single point of failure, and if that point is the OS then I'd
say that is a problem (since it's not always feasible to build
redundancy using multiple servers).

I know I'm preaching to the choir here, and my only interest here is to
improve the robustness of an already incredibly robust system. I'll
certainly contribute to the best of my ability whenever I find fixable
problems.


Best regards,

/Benny

-- 
internetlabbet.se     / work:   +46 8 551 124 80      / "Words must
Benny Lofgren        /  mobile: +46 70 718 11 90     /   be weighed,
                    /   fax:    +46 8 551 124 89    /    not counted."
                   /    email:  benny -at- internetlabbet.se

Reply via email to