[Kernel-packages] [Bug 1775235] Re: Ubuntu 16.04 (4.4.0-127) hangs on boot with virtio-scsi MQ enabled

Felipe Franciosi Thu, 07 Jun 2018 06:50:56 -0700

Hi Joseph,

Thanks for looking at this so promptly. I have downloaded and tested
your kernel. It appears to work well both with scsi mq enabled and
disabled (over virtio-scsi). I also ran some basic io integrity tests
and didn't spot any problems.


Before you commit this, may I propose a v2 of my own patch which is
functionally identical but slightly more elegant? It does the
atomic_inc() within virtscsi_pick_vq_mq(). That's probably preferable
given the other uses of atomic_*() for non-mq are done within
virtscsi_pick_vq().

Have a look:
-------------8<-------------
--- old/linux-4.4.0/drivers/scsi/virtio_scsi.c        2018-06-04 
10:23:07.000000000 -0700
+++ new/linux-4.4.0/drivers/scsi/virtio_scsi.c        2018-06-07 
06:20:58.596764040 -0700
@@ -588,11 +588,13 @@
 }

 static struct virtio_scsi_vq *virtscsi_pick_vq_mq(struct virtio_scsi *vscsi,
+                                                 struct 
virtio_scsi_target_state *tgt,
                                                  struct scsi_cmnd *sc)
 {
        u32 tag = blk_mq_unique_tag(sc->request);
        u16 hwq = blk_mq_unique_tag_to_hwq(tag);

+       atomic_inc(&tgt->reqs);
        return &vscsi->req_vqs[hwq];
 }

@@ -642,7 +644,7 @@
        struct virtio_scsi_vq *req_vq;

        if (shost_use_blk_mq(sh))
-               req_vq = virtscsi_pick_vq_mq(vscsi, sc);
+               req_vq = virtscsi_pick_vq_mq(vscsi, tgt, sc);
        else
                req_vq = virtscsi_pick_vq(vscsi, tgt);
 

Signed-off-by: Felipe Franciosi <[email protected]>

-------------8<-------------

I'm happy to test it again if you'd like, but it should be functionally
identical.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1775235

Title:
  Ubuntu 16.04 (4.4.0-127) hangs on boot with virtio-scsi MQ enabled

Status in linux package in Ubuntu:
  Invalid
Status in linux source package in Xenial:
  In Progress

Bug description:
  We noticed that Ubuntu 16.04 guests running on Nutanix AHV stopped
  booting after they were upgraded to the latest kernel (4.4.0-127).
  Only guests with scsi mq enabled suffered from this problem. AHV is
  one of the few hypervisor products to offer multiqueue for virtio-scsi
  devices.

  Upon further investigation, we could see that the kernel would hang
  during the scanning of scsi targets. More specifically, immediately
  after coming across a target without any luns present. That's the
  first time the kernel destroys a target (given it doesn't have luns).
  This could be confirmed with gdb (attached to qemu's gdbserver):

  #0  0xffffffffc0045039 in ?? ()
  #1  0xffff88022c753c98 in ?? ()
  #2  0xffffffff815d1de6 in scsi_target_destroy (starget=0xffff88022ad62400)
      at /build/linux-E14mqW/linux-4.4.0/drivers/scsi/scsi_scan.c:322

  This shows the guest vCPU stuck on virtio-scsi's implementation of
  target_destroy. Despite lacking symbols, we managed to examine the
  virtio_scsi_target_state to see that the 'reqs' counter was invalid:

  (gdb) p *(struct virtio_scsi_target_state  *)starget->hostdata
  $6 = {tgt_seq = {sequence = 0}, reqs = {counter = -1}, req_vq = 
0xffff88022cbdd9e8}
  (gdb)

  This drew our attention to the following patch which is exclusive to the 
Ubuntu kernel:
  commit f1f609d8015e1d34d39458924dcd9524fccd4307
  Author: Jay Vosburgh <[email protected]>
  Date:   Thu Apr 19 21:40:00 2018 +0200

  In a nutshell, the patch spins on the target's 'reqs' counter waiting for the 
target to quiesce:
  --- a/drivers/scsi/virtio_scsi.c
  +++ b/drivers/scsi/virtio_scsi.c
  @@ -785,6 +785,10 @@ static int virtscsi_target_alloc(struct scsi_target 
*starget)
   static void virtscsi_target_destroy(struct scsi_target *starget)
   {
          struct virtio_scsi_target_state *tgt = starget->hostdata;
  +
  +       /* we can race with concurrent virtscsi_complete_cmd */
  +       while (atomic_read(&tgt->reqs))
  +               cpu_relax();
          kfree(tgt);
   }

  Personally, I think this is a catastrophic way of waiting for a target
  to quiesce since virtscsi_target_destroy() is called with IRQs
  disabled from scsi_scan.c:scsi_target_destroy(). Devices which take a
  long time to quiesce during a target_destroy() could hog the CPU for
  relatively long periods of time.

  Nevertheless, further study revealed that virtio-scsi itself is broken
  in a way that it doesn't increment the 'reqs' counter when submitting
  requests on MQ in certain conditions. That caused the counter to go to
  -1 (on the completion of the first request) and the CPU to hang
  indefinitely.

  The following patch fixes the issue:

  --- old/linux-4.4.0/drivers/scsi/virtio_scsi.c        2018-06-04 
10:23:07.000000000 -0700
  +++ new/linux-4.4.0/drivers/scsi/virtio_scsi.c        2018-06-05 
10:03:29.083428545 -0700
  @@ -641,9 +641,10 @@
                                  scsi_target(sc->device)->hostdata;
          struct virtio_scsi_vq *req_vq;

  -       if (shost_use_blk_mq(sh))
  +       if (shost_use_blk_mq(sh)) {
                  req_vq = virtscsi_pick_vq_mq(vscsi, sc);
  -       else
  +               atomic_inc(&tgt->reqs);
  +       } else
                  req_vq = virtscsi_pick_vq(vscsi, tgt);

          return virtscsi_queuecommand(vscsi, req_vq, sc);

  Signed-off-by: Felipe Franciosi <[email protected]>

  Please consider this a urgent fix as all of our customers which use
  Ubuntu 16.04 and have MQ enabled for better performance will be
  affected by your latest update. Our workaround is to recommend that
  they disable SCSI MQ while you work on the issue.

  Best regards,
  Felipe

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1775235/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1775235] Re: Ubuntu 16.04 (4.4.0-127) hangs on boot with virtio-scsi MQ enabled

Reply via email to