Hi Don,

some comments inline.

On 04/16/2015 03:46 PM, Don Brace wrote:
> From: Webb Scales <[email protected]>
> 
> Allow driver initiated commands to have a timeout.  It does not
> yet try to do anything with timeouts on such commands.
> 
> We are sending a reset in order to get rid of a command we want to abort.
> If we make it return on the same reply queue as the command we want to abort,
> the completion of the aborted command will not race with the completion of
> the reset command.
> 
> Rename hpsa_scsi_do_simple_cmd_core() to hpsa_scsi_do_simple_cmd(), since
> this function is the interface for issuing commands to the controller and
> not the "core" of that implementation.  Add a parameter to it which allows
> the caller to specify the reply queue to be used.  Modify existing callers
> to specify the default reply queue.
> 
> Rename __hpsa_scsi_do_simple_cmd_core() to hpsa_scsi_do_simple_cmd_core(),
> since this routine is the "core" implementation of the "do simple command"
> function and there is no longer any other function with a similar name.
> Modify the existing callers of this routine (other than
> hpsa_scsi_do_simple_cmd()) to instead call hpsa_scsi_do_simple_cmd(), since
> it will now accept the reply_queue paramenter, and it provides a controller
> lock-up check.  (Also, tweak two related message strings to make them
> distinct from each other.)
> 
> Submitting a command to a locked up controller always results in a timeout,
> so check for controller lock-up before submitting.
> 
> This is to enable fixing a race between command completions and
> abort completions on different reply queues in a subsequent patch.
> We want to be able to specify which reply queue an abort completion
> should occur on so that it cannot race the completion of the command
> it is trying to abort.
> 
> The following race was possible in theory:
> 
>   1. Abort command is sent to hardware.
>   2. Command to be aborted simultaneously completes on another
>      reply queue.
>   3. Hardware receives abort command, decides command has already
>      completed and indicates this to the driver via another different
>      reply queue.
>   4. driver processes abort completion finds that the hardware does not know
>      about the command, concludes that therefore the command cannot complete,
>      returns SUCCESS indicating to the mid-layer that the scsi_cmnd may be
>      re-used.
>   5. Command from step 2 is processed and completed back to scsi mid
>      layer (after we already promised that would never happen.)
> 
> Fix by forcing aborts to complete on the same reply queue as the command
> they are aborting.
> 
> Piggybacking device rescanning functionality onto the lockup
> detection thread is not a good idea because if the controller
> locks up during device rescanning, then the thread could get
> stuck, then the lockup isn't detected.  Use separate work
> queues for device rescanning and lockup detection.
> 
> Detect controller lockup in abort handler.
> 
> After a lockup is detected, return DO_NO_CONNECT which results in immediate
> termination of commands rather than DID_ERR which results in retries.
> 
> Modify detect_controller_lockup() to return the result, to remove the need 
> for a separate check.
> 
> Reviewed-by: Scott Teel <[email protected]>
> Reviewed-by: Kevin Barnett <[email protected]>
> Signed-off-by: Webb Scales <[email protected]>
> Signed-off-by: Don Brace <[email protected]>
> ---
>  drivers/scsi/hpsa.c     |  329 
> ++++++++++++++++++++++++++++++++++++-----------
>  drivers/scsi/hpsa_cmd.h |    5 +
>  2 files changed, 257 insertions(+), 77 deletions(-)
> 
[ .. ]
> @@ -4375,13 +4469,46 @@ static int hpsa_eh_device_reset_handler(struct 
> scsi_cmnd *scsicmd)
>                       "device lookup failed.\n");
>               return FAILED;
>       }
> -     dev_warn(&h->pdev->dev, "resetting device %d:%d:%d:%d\n",
> -             h->scsi_host->host_no, dev->bus, dev->target, dev->lun);
> +
> +     /* if controller locked up, we can guarantee command won't complete */
> +     if (lockup_detected(h)) {
> +             dev_warn(&h->pdev->dev,
> +                     "scsi %d:%d:%d:%d RESET FAILED, lockup detected\n",
> +                     h->scsi_host->host_no, dev->bus, dev->target,
> +                     dev->lun);
> +             return FAILED;
> +     }
> +
> +     /* this reset request might be the result of a lockup; check */
> +     if (detect_controller_lockup(h)) {
> +             dev_warn(&h->pdev->dev,
> +                      "scsi %d:%d:%d:%d RESET FAILED, new lockup detected\n",
> +                      h->scsi_host->host_no, dev->bus, dev->target,
> +                      dev->lun);
> +             return FAILED;
> +     }
> +
> +     dev_warn(&h->pdev->dev,
> +             "scsi %d:%d:%d:%d: %s %.8s %.16s resetting RAID-%s 
> SSDSmartPathCap%c En%c Exp=%d\n",
> +             h->scsi_host->host_no, dev->bus, dev->target, dev->lun,
> +             scsi_device_type(dev->devtype),
> +             dev->vendor,
> +             dev->model,
> +             dev->raid_level > RAID_UNKNOWN ?
> +                     "RAID-?" : raid_label[dev->raid_level],
> +             dev->offload_config ? '+' : '-',
> +             dev->offload_enabled ? '+' : '-',
> +             dev->expose_state);
> +
Didn't you just reworked the message logging in the previous patch?
Why can't you use it here?

>       /* send a reset to the SCSI LUN which the command was sent to */
> -     rc = hpsa_send_reset(h, dev->scsi3addr, HPSA_RESET_TYPE_LUN);
> +     rc = hpsa_send_reset(h, dev->scsi3addr, HPSA_RESET_TYPE_LUN,
> +                          DEFAULT_REPLY_QUEUE);
>       if (rc == 0 && wait_for_device_to_become_ready(h, dev->scsi3addr) == 0)
>               return SUCCESS;
>  
> +     dev_warn(&h->pdev->dev,
> +             "scsi %d:%d:%d:%d reset failed\n",
> +             h->scsi_host->host_no, dev->bus, dev->target, dev->lun);
>       return FAILED;
>  }
>  
> @@ -4426,7 +4553,7 @@ static void hpsa_get_tag(struct ctlr_info *h,
>  }
>  
>  static int hpsa_send_abort(struct ctlr_info *h, unsigned char *scsi3addr,
> -     struct CommandList *abort, int swizzle)
> +     struct CommandList *abort, int swizzle, int reply_queue)
>  {
>       int rc = IO_OK;
>       struct CommandList *c;
> @@ -4444,9 +4571,9 @@ static int hpsa_send_abort(struct ctlr_info *h, 
> unsigned char *scsi3addr,
>               0, 0, scsi3addr, TYPE_MSG);
>       if (swizzle)
>               swizzle_abort_tag(&c->Request.CDB[4]);
> -     hpsa_scsi_do_simple_cmd_core(h, c);
> +     (void) hpsa_scsi_do_simple_cmd(h, c, reply_queue, NO_TIMEOUT);
>       hpsa_get_tag(h, abort, &taglower, &tagupper);
> -     dev_dbg(&h->pdev->dev, "%s: Tag:0x%08x:%08x: do_simple_cmd_core 
> completed.\n",
> +     dev_dbg(&h->pdev->dev, "%s: Tag:0x%08x:%08x: do_simple_cmd(abort) 
> completed.\n",
>               __func__, tagupper, taglower);
>       /* no unmap needed here because no data xfer. */
>  
> @@ -4478,7 +4605,7 @@ static int hpsa_send_abort(struct ctlr_info *h, 
> unsigned char *scsi3addr,
>   */
>  
>  static int hpsa_send_reset_as_abort_ioaccel2(struct ctlr_info *h,
> -     unsigned char *scsi3addr, struct CommandList *abort)
> +     unsigned char *scsi3addr, struct CommandList *abort, int reply_queue)
>  {
>       int rc = IO_OK;
>       struct scsi_cmnd *scmd; /* scsi command within request being aborted */
> @@ -4521,7 +4648,7 @@ static int hpsa_send_reset_as_abort_ioaccel2(struct 
> ctlr_info *h,
>                       "Reset as abort: Resetting physical device at scsi3addr 
> 0x%02x%02x%02x%02x%02x%02x%02x%02x\n",
>                       psa[0], psa[1], psa[2], psa[3],
>                       psa[4], psa[5], psa[6], psa[7]);
> -     rc = hpsa_send_reset(h, psa, HPSA_RESET_TYPE_TARGET);
> +     rc = hpsa_send_reset(h, psa, HPSA_RESET_TYPE_TARGET, reply_queue);
>       if (rc != 0) {
>               dev_warn(&h->pdev->dev,
>                       "Reset as abort: Failed on physical device at scsi3addr 
> 0x%02x%02x%02x%02x%02x%02x%02x%02x\n",
> @@ -4555,7 +4682,7 @@ static int hpsa_send_reset_as_abort_ioaccel2(struct 
> ctlr_info *h,
>   * make this true someday become false.
>   */
>  static int hpsa_send_abort_both_ways(struct ctlr_info *h,
> -     unsigned char *scsi3addr, struct CommandList *abort)
> +     unsigned char *scsi3addr, struct CommandList *abort, int reply_queue)
>  {
>       /* ioccelerator mode 2 commands should be aborted via the
>        * accelerated path, since RAID path is unaware of these commands,
> @@ -4563,10 +4690,20 @@ static int hpsa_send_abort_both_ways(struct ctlr_info 
> *h,
>        * Change abort to physical device reset.
>        */
>       if (abort->cmd_type == CMD_IOACCEL2)
> -             return hpsa_send_reset_as_abort_ioaccel2(h, scsi3addr, abort);
> +             return hpsa_send_reset_as_abort_ioaccel2(h, scsi3addr,
> +                                                     abort, reply_queue);
> +
> +     return hpsa_send_abort(h, scsi3addr, abort, 0, reply_queue) &&
> +                     hpsa_send_abort(h, scsi3addr, abort, 1, reply_queue);
> +}
>  
> -     return hpsa_send_abort(h, scsi3addr, abort, 0) &&
> -                     hpsa_send_abort(h, scsi3addr, abort, 1);
> +/* Find out which reply queue a command was meant to return on */
> +static int hpsa_extract_reply_queue(struct ctlr_info *h,
> +                                     struct CommandList *c)
> +{
> +     if (c->cmd_type == CMD_IOACCEL2)
> +             return h->ioaccel2_cmd_pool[c->cmdindex].reply_queue;
> +     return c->Header.ReplyQueue;
>  }
>  
>  /* Send an abort for the specified command.
> @@ -4584,7 +4721,7 @@ static int hpsa_eh_abort_handler(struct scsi_cmnd *sc)
>       char msg[256];          /* For debug messaging. */
>       int ml = 0;
>       __le32 tagupper, taglower;
> -     int refcount;
> +     int refcount, reply_queue;
>  
>       /* Find the controller of the command to be aborted */
>       h = sdev_to_hba(sc->device);
> @@ -4592,8 +4729,23 @@ static int hpsa_eh_abort_handler(struct scsi_cmnd *sc)
>                       "ABORT REQUEST FAILED, Controller lookup failed.\n"))
>               return FAILED;
>  
> -     if (lockup_detected(h))
> +     /* If controller locked up, we can guarantee command won't complete */
> +     if (lockup_detected(h)) {
> +             dev_warn(&h->pdev->dev,
> +                     "scsi %d:%d:%d:%llu scmd %p ABORT FAILED, lockup 
> detected\n",
> +                     h->scsi_host->host_no, sc->device->channel,
> +                     sc->device->id, sc->device->lun, sc);
>               return FAILED;
> +     }
> +
> +     /* This is a good time to check if controller lockup has occurred */
> +     if (detect_controller_lockup(h)) {
> +             dev_warn(&h->pdev->dev,
> +                      "scsi %d:%d:%d:%llu scmd %p ABORT FAILED, new lockup 
> detected\n",
> +                      h->scsi_host->host_no, sc->device->channel,
> +                      sc->device->id, sc->device->lun, sc);
> +             return FAILED;
> +     }
>  
Same here.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                            zSeries & Storage
[email protected]                                   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to