Thank you for the interest so far in my post.  

I never meant to imply "someone fix this now".  If that's how it came across, 
then I do apologize - that's not what I intended.

I am looking for more than the standard "disks break, live with it" answer.

I am surprised that the disk retry code doesn't timeout after 5 minutes or 100 
retries or something like that.  Also, it seems odd that the system is still 
responsive when the first few error messages are written to console but then 
stops responding a few messages later.

Also, I expected the unresponsiveness when the failed disk was mounted as part 
of the root filesystem - not when it is mounted as an auxiliary filesystem or 
not even mounted at all but simply accessed as a raw device.

I have trouble believing that I'm the first one to run into this, or at least 
the need to go back and forth between filesystem blocks and filenames.  But 
maybe I am.

Thanks Lee, for the dd_rescue suggestion.

Thanks David, for the sleuthkit suggestion.

Sincerely,
Gordon

On Thu, Jan 27, 2011 at 03:01:40PM +0100, Benny Lofgren wrote:
> On 2011-01-27 14.11, Ted Unangst wrote:
> > On Thu, Jan 27, 2011 at 7:28 AM, Benny Lofgren <bl-li...@lofgren.biz> wrote:
> >> It's a matter of uptime.
> >>
> >> The indicated behaviour, that the system more or less freezes when
> >> encountering a simple sector read error is indeed disturbing. For
> >> example, my own reasons for using mirroring are exclusively so that a
> >> system can remain online and operational in case of a disk failure.
> > 
> > If that's why you're investigating, I'll save you some time.  The disk
> > retry code will basically lock the system up while it's retrying.  If
> > you don't like it, send a patch.
> 
> Well, fwiw I wasn't the one investigating this particular problem, but I
> have no problem submitting patches in cases where I'm able to do
> meaningful work. (The problem I mentioned investigating is in all
> likelihood either driver-related or a hardware problem.)
> 
> I absolutely didn't mean to imply that "hey this is broken, 'someone'
> need to spend time to fix it" - I fully realize that that someone may
> very well be me. I apologize if I came across that way.
> 
> I was merely pointing out that the standard response of "disks break,
> live with it", while ever true, is sometimes irrelevant to the problem.
> 
> Yes, disks break (I currently have approximately two dozen broken ones
> in a box at the office waiting for an appointment with a sledgehammer),
> and yes, we diligently keep backups (or are sorry we didn't) but that
> doesn't solve the situation where you have a critical system that causes
> pain if it goes offline.
> 
> I have never in almost thirty years in this business lost a single byte
> of customer data to disk failure. I have however had cases of unplanned
> downtime, and every time that happens is also a failure.
> 
> Designing redundancy into our systems helps only as far as to the
> nearest single point of failure, and if that point is the OS then I'd
> say that is a problem (since it's not always feasible to build
> redundancy using multiple servers).
> 
> I know I'm preaching to the choir here, and my only interest here is to
> improve the robustness of an already incredibly robust system. I'll
> certainly contribute to the best of my ability whenever I find fixable
> problems.
> 
> 
> Best regards,
> 
> /Benny
> 
> -- 
> internetlabbet.se     / work:   +46 8 551 124 80      / "Words must
> Benny Lofgren        /  mobile: +46 70 718 11 90     /   be weighed,
>                     /   fax:    +46 8 551 124 89    /    not counted."
>                    /    email:  benny -at- internetlabbet.se

----- End forwarded message -----

-- 
Gordon Ferris
W.F. Engineering
Phone: +1 801-455-6108

----- End forwarded message -----

-- 
Gordon Ferris
W.F. Engineering
Phone: +1 801-455-6108

Reply via email to