Re: "scsidisk I/O error" from scsi.h

Eric Youngdale Tue, 09 May 2000 19:52:00 -0700

----- Original Message -----
From: <[EMAIL PROTECTED]>
To: "Eric Youngdale" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Tuesday, May 09, 2000 12:54 PM
Subject: Re: "scsidisk I/O error" from scsi.h


>
>
>
> [EMAIL PROTECTED] wrote:
> >    What kernel version are you using?  Very important.
>
> Oops.  Kernel=2.2.14-5.0 (RH 6.2).  Adaptec aic7xxx version 5.1.28.
> Recent attempts with aic7xxx 5.1.29 appear to recreate the problem faster.
>
>
> >> If "!uptodate" is an error worthy of a printk, shouldn't some sort of
> error
> >> be returned back?
> >
> >    There is an error being returned back, but the mechanism isn't
> obvious.
> >Essentially we are taking the blocks for the command and marking them to
> >indicate that there is no longer I/O pending.  The uptodate flag for the
> >buffer indicates whether the I/O was completed with success or not - this
> is
> >the flag that can be used by the process that initiated the I/O to make
> sure
> >that everything went OK.
> >
> >    If you think about it, the call to end_scsi_request can take place
> from
> >the context of an interrupt handler (a bottom half handler in this case).
> >The return value from the function isn't going to be significant once you
> >return from SCSI into the general purpose kernel code.
>
> I was hoping to get the application to see an I/O error instead of
hanging.
> The real goal is to find out why this occurs at all as the error inject is
> ment to be surviveable (and _is_ surviable on the other supported OS
> platforms).
> The cmd might have to be retried, but it should not be fatal.

    Yes, if you want to see all of the error information in all of the gory
detail, then perhaps the
generics driver would be a good choice, but this would involve bypassing the
filesystem.  On the other hand, if the goal is to just figure out what is
actually wrong here, then you can continue to use the disk driver and we
will eventually get to the bottom of it.

    There is one other thing which is definitely worth checking.  Make sure
that you aren't trying to read past the end of the disk.  I know that in
theory there are a number of checks in the kernel to prevent this, but the
sector number being reported is a large number after all (23801000).  There
might not be a problem at all here, but it is worth a quick check.  In
particular, if the disk has something like 1024 byte sectors, the checks
that are in place in the kernel might be off by a factor of two...

    Ultimately the underlying cause may be bad sectors on the disc.  In some
instances the disc itself keeps retrying for an excessive period of time, in
other instances something else goes screwy.  The disk driver should be
allowing about 30 seconds or so for the thing to complete before it decides
that the command has "timed out", which under nearly every circumstance
should be more than enough time.  One way to decide if this is an issue or
not is whether the sector number being reported is the same (or nearly the
same) or not.

    Alternatively, if you can copy the entire file to /dev/null from a
command line, then it potentially indicates a different issue.

    It could be a problem with the aic7xxx driver as well - if it is, then I
won't be of as much help on this one.

    You didn't tell us the make, model, and firmware version number of the
disk in question.  Some models or versions are known to be flakey - and some
people on the list here seem to have that list memorized (I don't).

    There is a scsiinfo tool floating around that I first wrote and has
since been maintained by others.  This allows you to look at some of the
details for the device, including the defect lists.  I believe that RedHat
ships an RPM that contains this, and you might play with it to see if you
learn anything.

> Since the only difference between a survivable error and a fatal one is
the
> printk generated by !updtodate in end_scsi_request, I can only guess that
> the
> caller isn't handling it right.

    Ultimately if an error gets passed up to this level, the error is
properly propogated back to the user.  The user level call to read the
blocks will be woken up after end_scsi_request() completes, it should see
that the sectors are not uptodate, and this should cause it to return a read
error.  It is the instance where the error doesn't get propogated to
end_scsi_request() that things really go rotten in a hurry.

> Is there a suggested set of steps to follow for untying this knot?  The
> only
> loose thread I saw to pick at was in the form of the "scsidisk I/O error"
> msg.
>
> I should add that while the exerciser is hanging outright, the rest of the
> box starts to petrify as well.  Eventually the box has to be hw reset and
> all the filesystems the exerciser had open have to be cleaned up.

    I am afraid this can be a common result - especially if the error
recovery code gets involved.  This is why I have been trying to get driver
authors to switch to using the new error recovery code - things tend to be
better in this instance, but I cannot guarantee that this issue would be
solved.  At some point very early in the 2.5 series, I am going to rip the
old error recovery code out which will force all driver authors to finish
the conversion.

> >    The minor numbers above are reported in hexadecimal :-).
> Ah.  You assume all readers of the error msgs are properly caffinated to
> catch that?  :)  how about displaying the minor numbers in the same format
> as they appear in /dev.

    Never thought of it :-).  Usually when there is any ambiguity, I have
added a leading "0x" helps to ensure that there is no confusion.

-Eric


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
Re: "scsidisk I/O error" from scsi.h

Reply via email to