On Mon, 2018-11-05 at 11:23 -0500, Bill Kuzeja wrote:
> When doing a surprise removal of an adapter, some in flight I/Os can
> get
> stuck and take a while to complete (they actually timeout and are
> retried). We are not handling an early error exit from
> qla2xxx_eh_abort properly.
>
> Fixes: 45235022da99 ("scsi: qla2xxx: Fix driver unload by shutting
> down chip")
> Signed-off-by: Bill Kuzeja <[email protected]>
>
> ---
> Resending...can somebody review this (pretty please)??
>
> After a hot remove of a Qlogic adapter, the driver's remove function
> gets
> called and we end up aborting all in progress I/Os. Here is the code
> flow:
>
> qla2x00_remove_one
> qla2x00_abort_isp_cleanup
> qla2x00_abort_all_cmds
> __qla2x00_abort_all_cmds
> qla2xxx_eh_abort
>
> At the start of qla2xxx_eh_abort, some sanity checks are done before
> actually sending the abort. One of these checks is a call to
> fc_block_scsi_eh. In the case of a hot remove, it turns out that this
> routine can exit with FAST_IO_FAIL.
>
> When this occurs, we return back to __qla2x00_abort_all_cmds with an
> extra reference on sp (because the abort never gets sent).
> Originally,
> this was addressed with another fix:
>
> commit 4cd3b6ebff85 scsi: qla2xxx: Fix extraneous ref on sp's after
> adapter break
>
> But this later this added change complicated matters:
>
> commit 45235022da99 scsi: qla2xxx: Fix driver unload by shutting down
> chip
>
> Because the abort is now being done earlier in the teardown (through
> qla2x00_abort_isp_cleanup), in qla2xxx_eh_abort we make it past
> the first check because qla2x00_isp_reg_stat(ha) returns zero. When
> we
> fail a few lines later in fc_block_scsi_eh, this error is not handled
> properly in __qla2x00_abort_all_cmds and the I/O ends up hanging and
> timing out because of the extra reference.
>
> For this fix, a check for FAST_IO_FAIL is added to
> __qla2x00_abort_all_cmds where we check to see if qla2xxx_eh_abort
> succeeded or not.
>
> This removes the extra reference in this additional early exit case.
> In
> my testing (hw surprise removals and also adapter remove via sysfs),
> this
> eliminates the timeouts and delays and the remove proceeds smoothly.
>
> ---
> drivers/scsi/qla2xxx/qla_os.c | 10 ++++++++--
> 1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/scsi/qla2xxx/qla_os.c
> b/drivers/scsi/qla2xxx/qla_os.c
> index 518f151..5261adf 100644
> --- a/drivers/scsi/qla2xxx/qla_os.c
> +++ b/drivers/scsi/qla2xxx/qla_os.c
> @@ -1749,7 +1749,7 @@ uint32_t qla2x00_isp_reg_stat(struct
> qla_hw_data *ha)
> static void
> __qla2x00_abort_all_cmds(struct qla_qpair *qp, int res)
> {
> - int cnt;
> + int cnt, status;
> unsigned long flags;
> srb_t *sp;
> scsi_qla_host_t *vha = qp->vha;
> @@ -1799,10 +1799,16 @@ uint32_t qla2x00_isp_reg_stat(struct
> qla_hw_data *ha)
> if (!sp_get(sp)) {
> spin_unlock_irqresto
> re
> (qp-
> >qp_lock_ptr, flags);
> - qla2xxx_eh_abort(
> + status =
> qla2xxx_eh_abort(
> GET_CMD_SP(s
> p));
> spin_lock_irqsave
> (qp-
> >qp_lock_ptr, flags);
> + /*
> + * Get rid of extra
> reference caused
> + * by early exit
> from qla2xxx_eh_abort
> + */
> + if (status ==
> FAST_IO_FAIL)
> + atomic_dec(&
> sp->ref_count);
> }
> }
> sp->done(sp, res);
Bill
The patch looks fine to me but I still cant apply it
I tried against linux-next which is 4.20
What ree are you applying it against now
Regards
Laurence