Hello,
On Wed, 18 Aug 1999, D. Lance Robinson wrote:
> Hi,
>
> I am seeing stack overflows when doing riggourous error testing while
> using the sym53c8xx driver. This would not happen if the driver was
> using the newer scsi error handling code. The newer code uses a queue
> with a bottom half driver and this will prevent the error cases that I'm
> seeing.
I know what the assumed not obsolete scsi linux code is doing. I just
haven't had time for now to try it.
Note that the driver only calls scsi_done() from its entry points and so a
stack overflow can only occurs if the caller does wrong things as calling
recursively driver entry points from scsi_done() or friends.
> I have tried to switch the driver over to the newer scsi error handling
> code but have data corruption problems. I have modified the SYM53C8XX
> #define (in sym53c8xx.h) to include use_new_eh_code:1, and modified
> sym53c8xx_queue_command() so it always returns 0. Things appear to work
> fine for a while, but in 1-2 minutes of testing, my test code detects
> data corruption in one of its files. The cache gets corrupted. Every
> corruption I have looked at seems to be file based... that is, the
> information that should go at the beginning of a file is seen at the
> beginning of a different file.
I ask me questions about the status of the use_new_eh_code. It seems that
only eata, u14-34f, aha1542, gdth and qlogicfc are using it. Btw, eata and
u14-34f seem to allow to disable this option from the boot command line.
> So, my real question is...
> Does anybody know the history of the scsi error handling code as to know
> what is needed to convert older scsi drivers to utilize the new code
> (without suffering data corruption problems)?
You may ask Eric Youngdale about or the authors of the drivers that use
this option about their experience.
> <>< Lance.
>
> ------------------------
> Notes on what I am doing to generate the stack overflow errors...
>
> I have modified the driver slightly to handle missing devices as will
> happen if someone yanks a drive from the bus (while in an appropriate
> scsi backplane). The driver was changed to set a flag (bad_select) when
> a select timeout happens. All new commands to that device, other than
> TEST UNIT READY, will be rejected for that drive. When a TEST UNIT READY
> command is seen, the bad_select bit is cleared.
Hmmm...
Normally the driver should call the scsi_done() from queue_command() only
for success or error conditions that should never be retried. If such a
situation exists then it is a bug (either in the scsi code or in the
driver).
> This all works for our situation, except when there is a backlog of
> commands that are queued in the sd layer. When the command is rejected
> in the sym53c8xx driver because of a previous bad_select, that command
> is sent to the scsi done code which gets its way up to rw_intr, then
> requeu_sd_request, then do_sd_request, back to requeue_sd_request, to
> scsi_do_cmd, back to sym53c8xx_queue_command. If this command is also
> rejected, the cycle continues for about 12-16 times in which the stack
> overflows and the system freezes.
I see the problem here. The command is not retried, but the SCSI code
just queues recursiverly numerous commands that fail.
Damned uncontrolled recursions!!!!!!!!!
Note that the recursion between do_sd_request() and requeue_sd_request()
seems way stupid to me regardless new_eh_code or not. If this highly
stupid recursion was fixed, then it should be possible to be aware of
the offending recursion you described and make things right.
So, the right fix is:
1) Remove the recusion do_sd_request()/requeue_sd_request() that, in my
opinion should never have existed.
2) Make the new stuff aware of the recursion from
queue_command()/rw_intr() and just do nothing in that situation.
G�rard.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]