reading scsi_error.c again, I find this logic for our case (please
correct me if I'm wrong)
1. eh_abort_handler and eh_device_reset_handler fail with timeout;
eh_host_reset_handler successes
2. scsi_eh_host_reset goes on with scsi_eh_try_stu & scsi_eh_tur
3. either scsi_eh_try_stu or scsi_eh_tur will reuse the scsi command and
call scsi_send_eh_cmnd to send STU or TUR command
4. scsi_send_eh_cmnd calls srp_queuecommand which will get new req,
reformat scsi_done pointer to scsi_eh_done, and add req to req_queue for
this same scsi command with different opcode (ie. STU or TUR)
5. In my case I got QP event 1 - so scsi_send_eh_cmnd will get to
timeout case and call eh_abort_handler for this scsi command with opcode
STU or TUR
6. scsi_eh_try_stu & scsi_eh_tur will retrieve the old scsi command back
with scsi_set_cmd_retry; however, srp already change and can not
retrieve the old scsi_done and host_scribble pointer
8. scsi_eh_host_reset fail and scsi_eh_offline_sdevs is called
9. scsi_eh_offline_sdevs calls scsi_eh_finish_cmd which moves the scsi
command to done_q and scsi command is freed in done_q
10. However the srp req carries this scsi command still in our
req_queue. The next eh_host_reset_handler will re-init the req_queue and
use the scsi command pointer (this is the crash use-after-freed that we
see)
Bottom line my previous patch still does not address the logic above -
I'll rework the patch and send to you later for review
on correction: my previous patch address the issue since the
the abort of TUR or STU command get time out and I remove
the req; therefore the req was not in req_queue anymore and
subsequence eh_host_reset_handler did not run into
use-after-free
Vu
_______________________________________________
openib-general mailing list
[email protected]
http://openib.org/mailman/listinfo/openib-general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general