On 03/15/2017 06:55 PM, Benjamin Block wrote:
> On Wed, Mar 15, 2017 at 02:54:09PM +0100, Hannes Reinecke wrote:
>> On 03/14/2017 06:33 PM, Benjamin Block wrote:
>>> Hello Hannes,
>>>
>>> On Wed, Mar 01, 2017 at 10:15:20AM +0100, Hannes Reinecke wrote:
>>>> There hasn't been any reports for HBAs where asynchronous abort
>>>> would not work, so we should make it mandatory and remove
>>>> the fallback.
>>>>
>>>> Signed-off-by: Hannes Reinecke <h...@suse.de>
>>>> Reviewed-by: Johannes Thumshirn <jthumsh...@suse.de>
>>>> Reviewed-by: Bart Van Assche <bart.vanass...@sandisk.com>
>>>> ---
>>>>  Documentation/scsi/scsi_eh.txt | 18 ++++------
>>>>  drivers/scsi/scsi_error.c      | 81 
>>>> ++++--------------------------------------
>>>>  drivers/scsi/scsi_lib.c        |  2 +-
>>>>  drivers/scsi/scsi_priv.h       |  3 +-
>>>>  include/scsi/scsi_host.h       |  5 ---
>>>>  5 files changed, 15 insertions(+), 94 deletions(-)
>>>>
>>>> diff --git a/Documentation/scsi/scsi_eh.txt 
>>>> b/Documentation/scsi/scsi_eh.txt
>>>> index 4edb9c1c..11e447b 100644
>>>> --- a/Documentation/scsi/scsi_eh.txt
>>>> +++ b/Documentation/scsi/scsi_eh.txt
>>>> @@ -70,7 +70,7 @@ with the command.
>>>>    scmd is requeued to blk queue.
>>>>
>>>>   - otherwise
>>>> -  scsi_eh_scmd_add(scmd, 0) is invoked for the command.  See
>>>> +  scsi_eh_scmd_add(scmd) is invoked for the command.  See
>>>>    [1-3] for details of this function.
>>>>
>>>>
>>>> @@ -103,9 +103,7 @@ function
>>>>          eh_timed_out() callback did not handle the command.
>>>>    Step #2 is taken.
>>>>
>>>> - 2. If the host supports asynchronous completion (as indicated by the
>>>> -    no_async_abort setting in the host template) scsi_abort_command()
>>>> -    is invoked to schedule an asynchrous abort.
>>>> + 2. scsi_abort_command() is invoked to schedule an asynchrous abort.
>>>>      Asynchronous abort are not invoked for commands which the
>>>>      SCSI_EH_ABORT_SCHEDULED flag is set (this indicates that the command
>>>>      already had been aborted once, and this is a retry which failed),
>>>> @@ -127,16 +125,13 @@ function
>>>>
>>>>   scmds enter EH via scsi_eh_scmd_add(), which does the following.
>>>>
>>>> - 1. Turns on scmd->eh_eflags as requested.  It's 0 for error
>>>> -    completions and SCSI_EH_CANCEL_CMD for timeouts.
>>>> + 1. Links scmd->eh_entry to shost->eh_cmd_q
>>>>
>>>> - 2. Links scmd->eh_entry to shost->eh_cmd_q
>>>> + 2. Sets SHOST_RECOVERY bit in shost->shost_state
>>>>
>>>> - 3. Sets SHOST_RECOVERY bit in shost->shost_state
>>>> + 3. Increments shost->host_failed
>>>>
>>>> - 4. Increments shost->host_failed
>>>> -
>>>> - 5. Wakes up SCSI EH thread if shost->host_busy == shost->host_failed
>>>> + 4. Wakes up SCSI EH thread if shost->host_busy == shost->host_failed
>>>>
>>>>   As can be seen above, once any scmd is added to shost->eh_cmd_q,
>>>>  SHOST_RECOVERY shost_state bit is turned on.  This prevents any new
>>>> @@ -252,7 +247,6 @@ scmd->allowed.
>>>>
>>>>   1. Error completion / time out
>>>>      ACTION: scsi_eh_scmd_add() is invoked for scmd
>>>> -  - set scmd->eh_eflags
>>>>    - add scmd to shost->eh_cmd_q
>>>>    - set SHOST_RECOVERY
>>>>    - shost->host_failed++
>>>> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
>>>> index 802a081..7b70ee9 100644
>>>> --- a/drivers/scsi/scsi_error.c
>>>> +++ b/drivers/scsi/scsi_error.c
>>>> @@ -163,7 +163,7 @@ static int scsi_host_eh_past_deadline(struct Scsi_Host 
>>>> *shost)
>>>>            }
>>>>    }
>>>>
>>>> -  scsi_eh_scmd_add(scmd, 0);
>>>> +  scsi_eh_scmd_add(scmd);
>>>>  }
>>>>
>>>>  /**
>>>> @@ -217,9 +217,8 @@ static int scsi_host_eh_past_deadline(struct Scsi_Host 
>>>> *shost)
>>>>  /**
>>>>   * scsi_eh_scmd_add - add scsi cmd to error handling.
>>>>   * @scmd: scmd to run eh on.
>>>> - * @eh_flag:      optional SCSI_EH flag.
>>>>   */
>>>> -void scsi_eh_scmd_add(struct scsi_cmnd *scmd, int eh_flag)
>>>> +void scsi_eh_scmd_add(struct scsi_cmnd *scmd)
>>>>  {
>>>>    struct Scsi_Host *shost = scmd->device->host;
>>>>    unsigned long flags;
>>>> @@ -235,9 +234,6 @@ void scsi_eh_scmd_add(struct scsi_cmnd *scmd, int 
>>>> eh_flag)
>>>>    if (shost->eh_deadline != -1 && !shost->last_reset)
>>>>            shost->last_reset = jiffies;
>>>>
>>>> -  if (scmd->eh_eflags & SCSI_EH_ABORT_SCHEDULED)
>>>> -          eh_flag &= ~SCSI_EH_CANCEL_CMD;
>>>> -  scmd->eh_eflags |= eh_flag;
>>>>    scsi_eh_reset(scmd);
>>>>    list_add_tail(&scmd->eh_entry, &shost->eh_cmd_q);
>>>>    shost->host_failed++;
>>>> @@ -271,10 +267,9 @@ enum blk_eh_timer_return scsi_times_out(struct 
>>>> request *req)
>>>>            rtn = host->hostt->eh_timed_out(scmd);
>>>>
>>>>    if (rtn == BLK_EH_NOT_HANDLED) {
>>>> -          if (host->hostt->no_async_abort ||
>>>> -              scsi_abort_command(scmd) != SUCCESS) {
>>>> +          if (scsi_abort_command(scmd) != SUCCESS) {
>>>>                    set_host_byte(scmd, DID_TIME_OUT);
>>>> -                  scsi_eh_scmd_add(scmd, SCSI_EH_CANCEL_CMD);
>>>> +                  scsi_eh_scmd_add(scmd);
>>>>            }
>>>>    }
>>>>
>>>> @@ -327,7 +322,7 @@ static inline void scsi_eh_prt_fail_stats(struct 
>>>> Scsi_Host *shost,
>>>>            list_for_each_entry(scmd, work_q, eh_entry) {
>>>>                    if (scmd->device == sdev) {
>>>>                            ++total_failures;
>>>> -                          if (scmd->eh_eflags & SCSI_EH_CANCEL_CMD)
>>>> +                          if (scmd->eh_eflags & SCSI_EH_ABORT_SCHEDULED)
>>>>                                    ++cmd_cancel;
>>>>                            else
>>>>                                    ++cmd_failed;
>>>> @@ -1153,8 +1148,7 @@ int scsi_eh_get_sense(struct list_head *work_q,
>>>>     * should not get sense.
>>>>     */
>>>>    list_for_each_entry_safe(scmd, next, work_q, eh_entry) {
>>>> -          if ((scmd->eh_eflags & SCSI_EH_CANCEL_CMD) ||
>>>> -              (scmd->eh_eflags & SCSI_EH_ABORT_SCHEDULED) ||
>>>> +          if ((scmd->eh_eflags & SCSI_EH_ABORT_SCHEDULED) ||
>>>>                SCSI_SENSE_VALID(scmd))
>>>>                    continue;
>>>>
>>>> @@ -1294,61 +1288,6 @@ static int scsi_eh_test_devices(struct list_head 
>>>> *cmd_list,
>>>>    return list_empty(work_q);
>>>>  }
>>>>
>>>> -
>>>> -/**
>>>> - * scsi_eh_abort_cmds - abort pending commands.
>>>> - * @work_q:       &list_head for pending commands.
>>>> - * @done_q:       &list_head for processed commands.
>>>> - *
>>>> - * Decription:
>>>> - *    Try and see whether or not it makes sense to try and abort the
>>>> - *    running command.  This only works out to be the case if we have one
>>>> - *    command that has timed out.  If the command simply failed, it makes
>>>> - *    no sense to try and abort the command, since as far as the shost
>>>> - *    adapter is concerned, it isn't running.
>>>> - */
>>>> -static int scsi_eh_abort_cmds(struct list_head *work_q,
>>>> -                        struct list_head *done_q)
>>>> -{
>>>> -  struct scsi_cmnd *scmd, *next;
>>>> -  LIST_HEAD(check_list);
>>>> -  int rtn;
>>>> -  struct Scsi_Host *shost;
>>>> -
>>>> -  list_for_each_entry_safe(scmd, next, work_q, eh_entry) {
>>>> -          if (!(scmd->eh_eflags & SCSI_EH_CANCEL_CMD))
>>>> -                  continue;
>>>> -          shost = scmd->device->host;
>>>> -          if (scsi_host_eh_past_deadline(shost)) {
>>>> -                  list_splice_init(&check_list, work_q);
>>>> -                  SCSI_LOG_ERROR_RECOVERY(3,
>>>> -                          scmd_printk(KERN_INFO, scmd,
>>>> -                                      "%s: skip aborting cmd, past eh 
>>>> deadline\n",
>>>> -                                      current->comm));
>>>> -                  return list_empty(work_q);
>>>> -          }
>>>> -          SCSI_LOG_ERROR_RECOVERY(3,
>>>> -                  scmd_printk(KERN_INFO, scmd,
>>>> -                               "%s: aborting cmd\n", current->comm));
>>>> -          rtn = scsi_try_to_abort_cmd(shost->hostt, scmd);
>>>> -          if (rtn == FAILED) {
>>>> -                  SCSI_LOG_ERROR_RECOVERY(3,
>>>> -                          scmd_printk(KERN_INFO, scmd,
>>>> -                                      "%s: aborting cmd failed\n",
>>>> -                                       current->comm));
>>>> -                  list_splice_init(&check_list, work_q);
>>>> -                  return list_empty(work_q);
>>>> -          }
>>>> -          scmd->eh_eflags &= ~SCSI_EH_CANCEL_CMD;
>>>> -          if (rtn == FAST_IO_FAIL)
>>>> -                  scsi_eh_finish_cmd(scmd, done_q);
>>>> -          else
>>>> -                  list_move_tail(&scmd->eh_entry, &check_list);
>>>> -  }
>>>> -
>>>> -  return scsi_eh_test_devices(&check_list, work_q, done_q, 0);
>>>> -}
>>>> -
>>>>  /**
>>>>   * scsi_eh_try_stu - Send START_UNIT to device.
>>>>   * @scmd: &scsi_cmnd to send START_UNIT
>>>> @@ -1691,11 +1630,6 @@ static void scsi_eh_offline_sdevs(struct list_head 
>>>> *work_q,
>>>>            sdev_printk(KERN_INFO, scmd->device, "Device offlined - "
>>>>                        "not ready after error recovery\n");
>>>>            scsi_device_set_state(scmd->device, SDEV_OFFLINE);
>>>> -          if (scmd->eh_eflags & SCSI_EH_CANCEL_CMD) {
>>>> -                  /*
>>>> -                   * FIXME: Handle lost cmds.
>>>> -                   */
>>>> -          }
>>>>            scsi_eh_finish_cmd(scmd, done_q);
>>>>    }
>>>>    return;
>>>> @@ -2139,8 +2073,7 @@ static void scsi_unjam_host(struct Scsi_Host *shost)
>>>>    SCSI_LOG_ERROR_RECOVERY(1, scsi_eh_prt_fail_stats(shost, &eh_work_q));
>>>>
>>>>    if (!scsi_eh_get_sense(&eh_work_q, &eh_done_q))
>>>> -          if (!scsi_eh_abort_cmds(&eh_work_q, &eh_done_q))
>>>> -                  scsi_eh_ready_devs(shost, &eh_work_q, &eh_done_q);
>>>> +          scsi_eh_ready_devs(shost, &eh_work_q, &eh_done_q);
>>>>
>>>>    spin_lock_irqsave(shost->host_lock, flags);
>>>>    if (shost->eh_deadline != -1)
>>>> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
>>>> index 0735a46..98b2df8 100644
>>>> --- a/drivers/scsi/scsi_lib.c
>>>> +++ b/drivers/scsi/scsi_lib.c
>>>> @@ -1593,7 +1593,7 @@ static void scsi_softirq_done(struct request *rq)
>>>>                    scsi_queue_insert(cmd, SCSI_MLQUEUE_DEVICE_BUSY);
>>>>                    break;
>>>>            default:
>>>> -                  scsi_eh_scmd_add(cmd, 0);
>>>> +                  scsi_eh_scmd_add(cmd);
>>>>                    break;
>>>>    }
>>>>  }
>>>> diff --git a/drivers/scsi/scsi_priv.h b/drivers/scsi/scsi_priv.h
>>>> index 5be6cbf6..e20ab10 100644
>>>> --- a/drivers/scsi/scsi_priv.h
>>>> +++ b/drivers/scsi/scsi_priv.h
>>>> @@ -18,7 +18,6 @@
>>>>  /*
>>>>   * Scsi Error Handler Flags
>>>>   */
>>>> -#define SCSI_EH_CANCEL_CMD        0x0001  /* Cancel this cmd */
>>>>  #define SCSI_EH_ABORT_SCHEDULED   0x0002  /* Abort has been scheduled */
>>>>
>>>>  #define SCSI_SENSE_VALID(scmd) \
>>>> @@ -72,7 +71,7 @@ extern int scsi_dev_info_list_add_keyed(int compatible, 
>>>> char *vendor,
>>>>  extern int scsi_error_handler(void *host);
>>>>  extern int scsi_decide_disposition(struct scsi_cmnd *cmd);
>>>>  extern void scsi_eh_wakeup(struct Scsi_Host *shost);
>>>> -extern void scsi_eh_scmd_add(struct scsi_cmnd *, int);
>>>> +extern void scsi_eh_scmd_add(struct scsi_cmnd *);
>>>>  void scsi_eh_ready_devs(struct Scsi_Host *shost,
>>>>                    struct list_head *work_q,
>>>>                    struct list_head *done_q);
>>>> diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
>>>> index 3cd8c3b..afb0481 100644
>>>> --- a/include/scsi/scsi_host.h
>>>> +++ b/include/scsi/scsi_host.h
>>>> @@ -452,11 +452,6 @@ struct scsi_host_template {
>>>>    unsigned no_write_same:1;
>>>>
>>>>    /*
>>>> -   * True if asynchronous aborts are not supported
>>>> -   */
>>>> -  unsigned no_async_abort:1;
>>>> -
>>>> -  /*
>>>>     * Countdown for host blocking with no commands outstanding.
>>>>     */
>>>>    unsigned int max_host_blocked;
>>>> --
>>>> 1.8.5.6
>>>>
>>>
>>> Hmm so, I guess we compromise in terms of how granular we want to
>>> recover?
>>>
>>> When say an abort for command A on LUN 1 behind Port α fails for some
>>> reason, then we also skip all possible aborts for command B on LUN 2
>>> behind Port α and command C on LUN 1 behind Port β? (The host might
>>> already be in recovery by the time command B and C fail)
>>>
>> No.
>> Recovery will only be started once all commands have been completed or
>> aborted.
>> Hence will be be aborting all commands before entering SCSI EH.
>>
> 
> When you call scsi_abort_command() after a timeout happend, it will
> check with scsi_host_in_recovery() whether the host has recovery already
> set or not. And only if not it will schedule the abort.
> 
Ah. Umm.

Yes, you are right...

> When the first command A times out, it will schedule a async abort, and
> if that fails it will set the state of the host in scsi_eh_scmd_add() to
> SHOST_RECOVERY before even the EH thread is kicked. So if command B and
> C timeout later than that, there won't be any abort scheduled.
> 
> That is with state tag v4.10, and your patchset. And I have actually
> observed this behavior in real life already (granted, the timeouts
> happened after injects), with the difference that previously the abort
> would still happen in the EH handling.
> 
Hmm. Okay. But then we should _not_ be checking if the host is in
recovery when sending aborts, but rather trying to do an abort anyway.
We only need to terminate/not send aborts if eh_deadline has triggered,
as then we're doing a host reset and aborts are meaningless anyway.

Will be updating the patchset.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Teamlead Storage & Networking
h...@suse.de                                   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Reply via email to