On Tue, 2018-10-16 at 11:23 -0400, Bill Kuzeja wrote:
> When doing a surprise removal of an adapter, some in flight I/Os can
> get 
> stuck and take a while to complete (they actually timeout and are 
> retried). We are not handling an early error exit from 
> qla2xxx_eh_abort properly.
> 
> Fixes: 45235022da99 ("scsi: qla2xxx: Fix driver unload by shutting
> down chip") 
> Signed-off-by: Bill Kuzeja <william.kuz...@stratus.com>
> ---
> 
> (Originally sent on September 26th, no reply so resending.)
> 
> After a hot remove of a Qlogic adapter, the driver's remove function
> gets 
> called and we end up aborting all in progress I/Os. Here is the code
> flow:
> 
> qla2x00_remove_one
>   qla2x00_abort_isp_cleanup
>     qla2x00_abort_all_cmds
>       __qla2x00_abort_all_cmds
>         qla2xxx_eh_abort
> 
> At the start of qla2xxx_eh_abort, some sanity checks are done before 
> actually sending the abort. One of these checks is a call to 
> fc_block_scsi_eh. In the case of a hot remove, it turns out that
> this 
> routine can exit with FAST_IO_FAIL.
> 
> When this occurs, we return back to __qla2x00_abort_all_cmds with an 
> extra reference on sp (because the abort never gets sent).
> Originally, 
> this was addressed with another fix:
> 
> commit 4cd3b6ebff85 scsi: qla2xxx: Fix extraneous ref on sp's after
> adapter break
> 
> But this later this added change complicated matters:
> 
> commit 45235022da99 scsi: qla2xxx: Fix driver unload by shutting down
> chip
> 
> Because the abort is now being done earlier in the teardown (through 
> qla2x00_abort_isp_cleanup), in qla2xxx_eh_abort we make it past 
> the first check because qla2x00_isp_reg_stat(ha) returns zero. When
> we
> fail a few lines later in fc_block_scsi_eh, this error is not handled
> properly in __qla2x00_abort_all_cmds and the I/O ends up hanging and 
> timing out because of the extra reference.
> 
> For this fix, a check for FAST_IO_FAIL is added to 
> __qla2x00_abort_all_cmds where we check to see if qla2xxx_eh_abort 
> succeeded or not. 
> 
> This removes the extra reference in this additional early exit case.
> In 
> my testing (hw surprise removals and also adapter remove via sysfs),
> this 
> eliminates the timeouts and delays and the remove proceeds smoothly.
> 
> ---
>  drivers/scsi/qla2xxx/qla_os.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/scsi/qla2xxx/qla_os.c
> b/drivers/scsi/qla2xxx/qla_os.c
> index 42b8f0d..3ba3765 100644
> --- a/drivers/scsi/qla2xxx/qla_os.c
> +++ b/drivers/scsi/qla2xxx/qla_os.c
> @@ -1771,8 +1771,9 @@ uint32_t qla2x00_isp_reg_stat(struct
> qla_hw_data *ha)
>                                        * if immediate exit from
>                                        * ql2xxx_eh_abort
>                                        */
> -                                     if (status == FAILED &&
> -                                         (qla2x00_isp_reg_stat(ha
> )))
> +                                     if (((status == FAILED) &&
> +                                         (qla2x00_isp_reg_stat(ha
> ))) ||
> +                                          (status ==
> FAST_IO_FAIL))
>                                               atomic_dec(
>                                                   &sp->ref_count);
>                               }

I recently bumped into this issue. The fix looks correct to me.

Reviewed-by Laurence Oberman <lober...@redhat.com>

Reply via email to