Roland Dreier wrote:
 > 1. srp_unmap_data() and srp_remove_req() for .eh_abort_handler(scmnd)
 >     a. abort get timeout or
 >     b. req->cmd_done or
 >     c. !req->tsk_status
 > 2. we should do step (1) for .eh_abort_handler(scmnd) only and don't
 > do step 1 for .eh_device_reset_handler(scmnd) since same scsi command
 > is used for all .eh_handler()
 > 3. scsi command is used in all .eh_handler() will be freed by scsi
 > midlayer at the end of error handling sequences
 > 4. If we don't do step 1, scsi command which is used in all
 > .eh_handler() and freed is still in our pending queue and is
 > referenced in srp_reconnect_target() / reinit request ring

So I finally got a chance to look at this in detail.  It does look
like we should remove the request in (1) if the command finishes or
the abort succeeds.  However if the abort times out then then command
is still out there -- shouldn't we wait for the
eh_device_reset_handler and then flush all matching commands there?


We should remove the request in (1) if the command finishes or the abort succeeds for eh_abort_handler only

For abort times out case, we can do either remove the command in eh_abort_handler or wait for eh_device_reset_handler (this may be better since it gives a command more time with a chance to complete.

And I don't understand (4) -- isn't srp_reconnect_target() being
called from srp_reset_host() as part of the error handling sequence?
Unless I'm misreading the code in scsi_error.c, commands don't get
freed (assuming all aborts and device resets fail) before then.

What am I missing?  In your case, where the abort and device reset
fail and then the host reset gets called, where was the command
getting freed?

reading scsi_error.c again, I find this logic for our case (please correct me if I'm wrong) 1. eh_abort_handler and eh_device_reset_handler fail with timeout; eh_host_reset_handler successes
2. scsi_eh_host_reset goes on with scsi_eh_try_stu & scsi_eh_tur
3. either scsi_eh_try_stu or scsi_eh_tur will reuse the scsi command and call scsi_send_eh_cmnd to send STU or TUR command 4. scsi_send_eh_cmnd calls srp_queuecommand which will get new req, reformat scsi_done pointer to scsi_eh_done, and add req to req_queue for this same scsi command with different opcode (ie. STU or TUR) 5. In my case I got QP event 1 - so scsi_send_eh_cmnd will get to timeout case and call eh_abort_handler for this scsi command with opcode STU or TUR 6. scsi_eh_try_stu & scsi_eh_tur will retrieve the old scsi command back with scsi_set_cmd_retry; however, srp already change and can not retrieve the old scsi_done and host_scribble pointer
8. scsi_eh_host_reset fail and scsi_eh_offline_sdevs is called
9. scsi_eh_offline_sdevs calls scsi_eh_finish_cmd which moves the scsi command to done_q and scsi command is freed in done_q 10. However the srp req carries this scsi command still in our req_queue. The next eh_host_reset_handler will re-init the req_queue and use the scsi command pointer (this is the crash use-after-freed that we see)

Bottom line my previous patch still does not address the logic above - I'll rework the patch and send to you later for review

Vu


_______________________________________________
openib-general mailing list
[email protected]
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to