reading scsi_error.c again, I find this logic for our case (please correct me if I'm wrong) 1. eh_abort_handler and eh_device_reset_handler fail with timeout; eh_host_reset_handler successes
2. scsi_eh_host_reset goes on with scsi_eh_try_stu & scsi_eh_tur
3. either scsi_eh_try_stu or scsi_eh_tur will reuse the scsi command and call scsi_send_eh_cmnd to send STU or TUR command 4. scsi_send_eh_cmnd calls srp_queuecommand which will get new req, reformat scsi_done pointer to scsi_eh_done, and add req to req_queue for this same scsi command with different opcode (ie. STU or TUR) 5. In my case I got QP event 1 - so scsi_send_eh_cmnd will get to timeout case and call eh_abort_handler for this scsi command with opcode STU or TUR 6. scsi_eh_try_stu & scsi_eh_tur will retrieve the old scsi command back with scsi_set_cmd_retry; however, srp already change and can not retrieve the old scsi_done and host_scribble pointer
8. scsi_eh_host_reset fail and scsi_eh_offline_sdevs is called
9. scsi_eh_offline_sdevs calls scsi_eh_finish_cmd which moves the scsi command to done_q and scsi command is freed in done_q 10. However the srp req carries this scsi command still in our req_queue. The next eh_host_reset_handler will re-init the req_queue and use the scsi command pointer (this is the crash use-after-freed that we see)

Bottom line my previous patch still does not address the logic above - I'll rework the patch and send to you later for review



on correction: my previous patch address the issue since the the abort of TUR or STU command get time out and I remove the req; therefore the req was not in req_queue anymore and subsequence eh_host_reset_handler did not run into use-after-free

Vu
_______________________________________________
openib-general mailing list
[email protected]
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to