Roland Dreier wrote:
> 1. srp_unmap_data() and srp_remove_req() for .eh_abort_handler(scmnd)
> a. abort get timeout or
> b. req->cmd_done or
> c. !req->tsk_status
> 2. we should do step (1) for .eh_abort_handler(scmnd) only and don't
> do step 1 for .eh_device_reset_handler(scmnd) since same scsi command
> is used for all .eh_handler()
> 3. scsi command is used in all .eh_handler() will be freed by scsi
> midlayer at the end of error handling sequences
> 4. If we don't do step 1, scsi command which is used in all
> .eh_handler() and freed is still in our pending queue and is
> referenced in srp_reconnect_target() / reinit request ring
So I finally got a chance to look at this in detail. It does look
like we should remove the request in (1) if the command finishes or
the abort succeeds. However if the abort times out then then command
is still out there -- shouldn't we wait for the
eh_device_reset_handler and then flush all matching commands there?
We should remove the request in (1) if the command finishes
or the abort succeeds for eh_abort_handler only
For abort times out case, we can do either remove the
command in eh_abort_handler or wait for
eh_device_reset_handler (this may be better since it gives a
command more time with a chance to complete.
And I don't understand (4) -- isn't srp_reconnect_target() being
called from srp_reset_host() as part of the error handling sequence?
Unless I'm misreading the code in scsi_error.c, commands don't get
freed (assuming all aborts and device resets fail) before then.
What am I missing? In your case, where the abort and device reset
fail and then the host reset gets called, where was the command
getting freed?
reading scsi_error.c again, I find this logic for our case
(please correct me if I'm wrong)
1. eh_abort_handler and eh_device_reset_handler fail with
timeout; eh_host_reset_handler successes
2. scsi_eh_host_reset goes on with scsi_eh_try_stu & scsi_eh_tur
3. either scsi_eh_try_stu or scsi_eh_tur will reuse the scsi
command and call scsi_send_eh_cmnd to send STU or TUR command
4. scsi_send_eh_cmnd calls srp_queuecommand which will get
new req, reformat scsi_done pointer to scsi_eh_done, and add
req to req_queue for this same scsi command with different
opcode (ie. STU or TUR)
5. In my case I got QP event 1 - so scsi_send_eh_cmnd will
get to timeout case and call eh_abort_handler for this scsi
command with opcode STU or TUR
6. scsi_eh_try_stu & scsi_eh_tur will retrieve the old scsi
command back with scsi_set_cmd_retry; however, srp already
change and can not retrieve the old scsi_done and
host_scribble pointer
8. scsi_eh_host_reset fail and scsi_eh_offline_sdevs is called
9. scsi_eh_offline_sdevs calls scsi_eh_finish_cmd which
moves the scsi command to done_q and scsi command is freed
in done_q
10. However the srp req carries this scsi command still in
our req_queue. The next eh_host_reset_handler will re-init
the req_queue and use the scsi command pointer (this is the
crash use-after-freed that we see)
Bottom line my previous patch still does not address the
logic above - I'll rework the patch and send to you later
for review
Vu
_______________________________________________
openib-general mailing list
[email protected]
http://openib.org/mailman/listinfo/openib-general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general