Re: [openib-general][patch review] srp: fmr implementation,

Vu Pham Fri, 05 May 2006 11:54:19 -0700

Roland Dreier wrote:

 > 1. srp_unmap_data() and srp_remove_req() for .eh_abort_handler(scmnd)
 >     a. abort get timeout or
 >     b. req->cmd_done or
 >     c. !req->tsk_status
 > 2. we should do step (1) for .eh_abort_handler(scmnd) only and don't
 > do step 1 for .eh_device_reset_handler(scmnd) since same scsi command
 > is used for all .eh_handler()
 > 3. scsi command is used in all .eh_handler() will be freed by scsi
 > midlayer at the end of error handling sequences
 > 4. If we don't do step 1, scsi command which is used in all
 > .eh_handler() and freed is still in our pending queue and is
 > referenced in srp_reconnect_target() / reinit request ring


So I finally got a chance to look at this in detail.  It does look
like we should remove the request in (1) if the command finishes or
the abort succeeds.  However if the abort times out then then command
is still out there -- shouldn't we wait for the
eh_device_reset_handler and then flush all matching commands there?

We should remove the request in (1) if the command finishesor the abort succeeds for eh_abort_handler only

For abort times out case, we can do either remove thecommand in eh_abort_handler or wait foreh_device_reset_handler (this may be better since it gives acommand more time with a chance to complete.

And I don't understand (4) -- isn't srp_reconnect_target() being
called from srp_reset_host() as part of the error handling sequence?
Unless I'm misreading the code in scsi_error.c, commands don't get
freed (assuming all aborts and device resets fail) before then.

What am I missing?  In your case, where the abort and device reset
fail and then the host reset gets called, where was the command
getting freed?

reading scsi_error.c again, I find this logic for our case(please correct me if I'm wrong)1. eh_abort_handler and eh_device_reset_handler fail withtimeout; eh_host_reset_handler successes

2. scsi_eh_host_reset goes on with scsi_eh_try_stu & scsi_eh_tur

3. either scsi_eh_try_stu or scsi_eh_tur will reuse the scsicommand and call scsi_send_eh_cmnd to send STU or TUR command4. scsi_send_eh_cmnd calls srp_queuecommand which will getnew req, reformat scsi_done pointer to scsi_eh_done, and addreq to req_queue for this same scsi command with differentopcode (ie. STU or TUR)5. In my case I got QP event 1 - so scsi_send_eh_cmnd willget to timeout case and call eh_abort_handler for this scsicommand with opcode STU or TUR6. scsi_eh_try_stu & scsi_eh_tur will retrieve the old scsicommand back with scsi_set_cmd_retry; however, srp alreadychange and can not retrieve the old scsi_done andhost_scribble pointer

8. scsi_eh_host_reset fail and scsi_eh_offline_sdevs is called

9. scsi_eh_offline_sdevs calls scsi_eh_finish_cmd whichmoves the scsi command to done_q and scsi command is freedin done_q10. However the srp req carries this scsi command still inour req_queue. The next eh_host_reset_handler will re-initthe req_queue and use the scsi command pointer (this is thecrash use-after-freed that we see)

Bottom line my previous patch still does not address thelogic above - I'll rework the patch and send to you laterfor review


Vu


_______________________________________________
openib-general mailing list
[email protected]
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general][patch review] srp: fmr implementation,

Reply via email to