from:"Omar Sandoval"

Re: [PATCH 2/2] tests/scsi/0001: Regression test for SCSI device blacklisting

2018-05-21 Thread Omar Sandoval

On Wed, Dec 13, 2017 at 01:29:39AM -0800, Omar Sandoval wrote:
> On Tue, Aug 15, 2017 at 11:00:49PM -0700, Omar Sandoval wrote:
> > On Wed, Aug 09, 2017 at 12:50:06PM +0200, Hannes Reinecke wrote:
> > > SCSI device blacklisting seems to be a tricky subject, with
> > > lots of potential for messing up the selection algorithm.
> > > This adds a test for catching regressions here.
> > 
> > I'm waiting to see how the patches end up before applying this, but I'm
> > glad to see this test :) A few comments below. I've addressed most of
> > them and pushed it to 
> > https://github.com/osandov/blktests/tree/scsi-blacklist,
> > but the golden output needs to be updated for v3 of your patches. Could
> > you modify the patch in my branch and resend? Thanks for adding tests!
> 
> Hannes, can you resend these tests now that your patches are in? Let me
> know if you need any help. Thanks!

Hannes, this test would still be nice to get in.

Re: [PATCH blktests] scsi/004: add regression test for false BLK_STS_OK with non good SAM status

2018-04-24 Thread Omar Sandoval

On Mon, Apr 23, 2018 at 02:25:03PM +0200, Steffen Maier wrote:
> 
> On 04/19/2018 10:18 PM, Omar Sandoval wrote:
> > On Thu, Apr 19, 2018 at 01:44:41PM -0600, Jens Axboe wrote:
> >> On 4/19/18 1:41 PM, Bart Van Assche wrote:
> >>> On Thu, 2018-04-19 at 12:13 -0700, Omar Sandoval wrote:
> >>>> On Thu, Apr 19, 2018 at 11:53:30AM -0700, Omar Sandoval wrote:
> >>>>> Thanks for the test! Applied.
> >>>>
> >>>> Side note, it's unfortunate that this test takes 180 seconds to run only
> >>>> because we have to wait for the command timeout. We should be able to
> >>>> export request_queue->rq_timeout writeable in sysfs. Would you be
> >>>> interested in doing that?
> >>>
> >>> Hello Omar,
> >>>
> >>> Is this perhaps what you are looking for?
> >>> # ls -l /sys/class/scsi_device/*/*/timeout
> >>> -rw-r--r-- 1 root root 4096 Apr 19 08:52 
> >>> /sys/class/scsi_device/2:0:0:0/device/timeout
> >>> -rw-r--r-- 1 root root 4096 Apr 19 12:39 
> >>> /sys/class/scsi_device/8:0:0:1/device/timeout
> >>
> >> We should have it generically available though, not just for SCSI. In
> >> retrospect, it should have been under queue/ from the start, now we'll
> >> end up with duplicate entries for SCSI.
> > 
> > For the sake of this test, I just decreased the timeout through SCSI.
> 
> Great idea.
> 
> > echo 5 > "/sys/block/${SCSI_DEBUG_DEVICES[0]}/device/timeout"
> 
> However, the timeout should be sufficiently larger than scsi_debug/delay,
> in order not to run into the command timeout.
> It may be unfortunate that scsi_debug/delay uses jiffies as unit and
> can thus differ in a range of an order of magnitude for different kernel 
> configs.
> 
> > # delay to reduce response repetition: around 1..10sec depending on HZ
> > echo 1000 > /sys/bus/pseudo/drivers/scsi_debug/delay
> 
> On s390, we typically have HZ=100, so 1000 jiffies are 10 seconds.

Good catch, I just switched this to use ndelay in nanoseconds instead of
delay.

> We can increase the sdev cmd timeout or decrease the scsi_debug/delay.
> 100 instead of 1000 for scsi_debug/delay worked for me;
> but for some reason the loop checking for busy did not work (any more?)
> causing an unexpected test case error:
> 
> > # ./check scsi/004
> > scsi/004 (ensure repeated TASK SET FULL results in EIO on timing out 
> > command) [failed]
> > runtime  31.892s  ...  31.720s
> > --- tests/scsi/004.out  2018-04-16 11:47:19.105931872 +0200
> > +++ results/nodev/scsi/004.out.bad  2018-04-23 14:07:33.615445253 
> > +0200
> > @@ -1,3 +1,3 @@
> >  Running scsi/004
> > -Input/output error
> > +modprobe: FATAL: Module scsi_debug is in use.
> >  Test complete
> 
> so I added another sleep hack:
> 
>  # dd closing SCSI disk causes implicit TUR also being delayed once
> +# sleep over time window where READ was done and TUR not yet queued
> +sleep 2
>  while grep -q -F "in_use_bm BUSY:" 
> "/proc/scsi/scsi_debug/${SCSI_DEBUG_HOSTS[0]}"; do
> 
> What do you think?

I've been hitting this on and off on all of the scsi-debug tests for
awhile, and I can't figure out where the lingering reference comes from.
I don't think it's related, but I'll look into it.

Re: [PATCH blktests] scsi/004: add regression test for false BLK_STS_OK with non good SAM status

2018-04-19 Thread Omar Sandoval

On Thu, Apr 19, 2018 at 01:44:41PM -0600, Jens Axboe wrote:
> On 4/19/18 1:41 PM, Bart Van Assche wrote:
> > On Thu, 2018-04-19 at 12:13 -0700, Omar Sandoval wrote:
> >> On Thu, Apr 19, 2018 at 11:53:30AM -0700, Omar Sandoval wrote:
> >>> Thanks for the test! Applied.
> >>
> >> Side note, it's unfortunate that this test takes 180 seconds to run only
> >> because we have to wait for the command timeout. We should be able to
> >> export request_queue->rq_timeout writeable in sysfs. Would you be
> >> interested in doing that?
> > 
> > Hello Omar,
> > 
> > Is this perhaps what you are looking for?
> > # ls -l /sys/class/scsi_device/*/*/timeout 
> > -rw-r--r-- 1 root root 4096 Apr 19 08:52 
> > /sys/class/scsi_device/2:0:0:0/device/timeout
> > -rw-r--r-- 1 root root 4096 Apr 19 12:39 
> > /sys/class/scsi_device/8:0:0:1/device/timeout
> 
> We should have it generically available though, not just for SCSI. In
> retrospect, it should have been under queue/ from the start, now we'll
> end up with duplicate entries for SCSI.

For the sake of this test, I just decreased the timeout through SCSI.

Re: [PATCH blktests] scsi/004: add regression test for false BLK_STS_OK with non good SAM status

2018-04-19 Thread Omar Sandoval

On Thu, Apr 19, 2018 at 11:53:30AM -0700, Omar Sandoval wrote:
> Thanks for the test! Applied.

Side note, it's unfortunate that this test takes 180 seconds to run only
because we have to wait for the command timeout. We should be able to
export request_queue->rq_timeout writeable in sysfs. Would you be
interested in doing that?

Re: [PATCH blktests] scsi/004: add regression test for false BLK_STS_OK with non good SAM status

2018-04-19 Thread Omar Sandoval

On Tue, Apr 17, 2018 at 11:03:37AM +0200, Steffen Maier wrote:
> Signed-off-by: Steffen Maier 
> ---
>  tests/scsi/004 |   59 
> 
>  tests/scsi/004.out |3 ++
>  2 files changed, 62 insertions(+), 0 deletions(-)
>  create mode 100755 tests/scsi/004
>  create mode 100644 tests/scsi/004.out
> 
> diff --git a/tests/scsi/004 b/tests/scsi/004
> new file mode 100755
> index 000..4852efc
> --- /dev/null
> +++ b/tests/scsi/004
> @@ -0,0 +1,59 @@
> +#!/bin/bash
> +#
> +# Ensure repeated SAM_STAT_TASK_SET_FULL results in EIO on timing out 
> command.
> +#
> +# Regression test for commit cbe095e2b584 ("Revert "scsi: core: return
> +# BLK_STS_OK for DID_OK in __scsi_error_from_host_byte()"")
> +#
> +# Found independently of corresponding commit mail threads while
> +# experimenting with storage mirroring. This test is a storage-independent
> +# reproducer for the error case I ran into.
> +#
> +# Copyright IBM Corp. 2018
> +# Author: Steffen Maier 
> +#
> +# This program is free software: you can redistribute it and/or modify
> +# it under the terms of the GNU General Public License as published by
> +# the Free Software Foundation, either version 3 of the License, or
> +# (at your option) any later version.
> +#
> +# This program is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program.  If not, see .
> +
> +. common/scsi_debug
> +
> +DESCRIPTION="ensure repeated TASK SET FULL results in EIO on timing out 
> command"
> +
> +requires() {
> + _have_scsi_debug
> +}
> +
> +test() {
> + echo "Running ${TEST_NAME}"
> +
> + if ! _init_scsi_debug add_host=1 max_luns=1 statistics=1 every_nth=1; 
> then
> + return 1
> + fi
> + # every_nth RW with full queue gets SAM_STAT_TASK_SET_FULL
> + echo 0x800 > /sys/bus/pseudo/drivers/scsi_debug/opts
> + # delay to reduce response repetition: around 1..10sec depending on HZ
> + echo 1000 > /sys/bus/pseudo/drivers/scsi_debug/delay
> + # a single command fills device queue to satisfy 0x800 opts condition
> + echo 1 > "/sys/block/${SCSI_DEBUG_DEVICES[0]}/device/queue_depth"
> + dd if="/dev/${SCSI_DEBUG_DEVICES[0]}" iflag=direct of=/dev/null bs=512 
> count=1 |& grep -o "Input/output error"
> + # stop injection
> + echo 0 > /sys/bus/pseudo/drivers/scsi_debug/opts
> + # dd closing SCSI disk causes implicit TUR also being delayed once
> + while grep -q -F "in_use_bm BUSY:" 
> "/proc/scsi/scsi_debug/${SCSI_DEBUG_HOSTS[0]}"; do
> + sleep 1
> + done
> + echo 1 > /sys/bus/pseudo/drivers/scsi_debug/delay
> + _exit_scsi_debug
> +
> + echo "Test complete"
> +}
> diff --git a/tests/scsi/004.out b/tests/scsi/004.out
> new file mode 100644
> index 000..b1126fb
> --- /dev/null
> +++ b/tests/scsi/004.out
> @@ -0,0 +1,3 @@
> +Running scsi/004
> +Input/output error
> +Test complete
> -- 
> 1.7.1
> 

Thanks for the test! Applied.

Re: [PATCH V2 4/8] block: null_blk: introduce module parameter of 'g_global_tags'

2018-02-06 Thread Omar Sandoval

On Mon, Feb 05, 2018 at 11:20:31PM +0800, Ming Lei wrote:
> This patch introduces the parameter of 'g_global_tags' so that we can
> test this feature by null_blk easiy.
> 
> Not see obvious performance drop with global_tags when the whole hw
> depth is kept as same:
> 
> 1) no 'global_tags', each hw queue depth is 1, and 4 hw queues
> modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 global_tags=0 
> submit_queues=4 hw_queue_depth=1
> 
> 2) 'global_tags', global hw queue depth is 4, and 4 hw queues
> modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 global_tags=1 
> submit_queues=4 hw_queue_depth=4
> 
> 3) fio test done in above two settings:
>fio --bs=4k --size=512G  --rw=randread --norandommap --direct=1 
> --ioengine=libaio --iodepth=4 --runtime=$RUNTIME --group_reporting=1  
> --name=nullb0 --filename=/dev/nullb0 --name=nullb1 --filename=/dev/nullb1 
> --name=nullb2 --filename=/dev/nullb2 --name=nullb3 --filename=/dev/nullb3
> 
> 1M IOPS can be reached in both above tests which is done in one VM.
> 
> Cc: Arun Easi <arun.e...@cavium.com>
> Cc: Omar Sandoval <osan...@fb.com>,
> Cc: "Martin K. Petersen" <martin.peter...@oracle.com>,
> Cc: James Bottomley <james.bottom...@hansenpartnership.com>,
> Cc: Christoph Hellwig <h...@lst.de>,
> Cc: Don Brace <don.br...@microsemi.com>
> Cc: Kashyap Desai <kashyap.de...@broadcom.com>
> Cc: Peter Rivera <peter.riv...@broadcom.com>
> Cc: Mike Snitzer <snit...@redhat.com>
> Tested-by: Laurence Oberman <lober...@redhat.com>
> Reviewed-by: Hannes Reinecke <h...@suse.de>

The module parameter is just called "global_tags", not "g_global_tags",
right? The subject should say the former.

Otherwise,

Reviewed-by: Omar Sandoval <osan...@fb.com>

> Signed-off-by: Ming Lei <ming@redhat.com>
> ---
>  drivers/block/null_blk.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
> index 287a09611c0f..ad0834efad42 100644
> --- a/drivers/block/null_blk.c
> +++ b/drivers/block/null_blk.c
> @@ -163,6 +163,10 @@ static int g_submit_queues = 1;
>  module_param_named(submit_queues, g_submit_queues, int, S_IRUGO);
>  MODULE_PARM_DESC(submit_queues, "Number of submission queues");
>  
> +static int g_global_tags = 0;
> +module_param_named(global_tags, g_global_tags, int, S_IRUGO);
> +MODULE_PARM_DESC(global_tags, "All submission queues share one tags");
> +
>  static int g_home_node = NUMA_NO_NODE;
>  module_param_named(home_node, g_home_node, int, S_IRUGO);
>  MODULE_PARM_DESC(home_node, "Home node for the device");
> @@ -1622,6 +1626,8 @@ static int null_init_tag_set(struct nullb *nullb, 
> struct blk_mq_tag_set *set)
>   set->flags = BLK_MQ_F_SHOULD_MERGE;
>   if (g_no_sched)
>   set->flags |= BLK_MQ_F_NO_SCHED;
> + if (g_global_tags)
> + set->flags |= BLK_MQ_F_GLOBAL_TAGS;
>   set->driver_data = NULL;
>  
>   if ((nullb && nullb->dev->blocking) || g_blocking)
> -- 
> 2.9.5
>

Re: [PATCH V2 1/8] blk-mq: tags: define several fields of tags as pointer

2018-02-06 Thread Omar Sandoval

On Mon, Feb 05, 2018 at 11:20:28PM +0800, Ming Lei wrote:
> This patch changes tags->breserved_tags, tags->bitmap_tags and
> tags->active_queues as pointer, and prepares for supporting global tags.
> 
> No functional change.
> 
> Tested-by: Laurence Oberman <lober...@redhat.com>
> Reviewed-by: Hannes Reinecke <h...@suse.com>

Assuming it builds :)

Reviewed-by: Omar Sandoval <osan...@fb.com>

> Cc: Mike Snitzer <snit...@redhat.com>
> Cc: Christoph Hellwig <h...@infradead.org>
> Signed-off-by: Ming Lei <ming@redhat.com>
> ---
>  block/bfq-iosched.c|  4 ++--
>  block/blk-mq-debugfs.c | 10 +-
>  block/blk-mq-tag.c | 48 ++--
>  block/blk-mq-tag.h | 10 +++---
>  block/blk-mq.c |  2 +-
>  block/kyber-iosched.c  |  2 +-
>  6 files changed, 42 insertions(+), 34 deletions(-)
> 
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> index 47e6ec7427c4..1e1211814a57 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -534,9 +534,9 @@ static void bfq_limit_depth(unsigned int op, struct 
> blk_mq_alloc_data *data)
>   WARN_ON_ONCE(1);
>   return;
>   }
> - bt = >breserved_tags;
> + bt = tags->breserved_tags;
>   } else
> - bt = >bitmap_tags;
> + bt = tags->bitmap_tags;
>  
>   if (unlikely(bfqd->sb_shift != bt->sb.shift))
>   bfq_update_depths(bfqd, bt);
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 21cbc1f071c6..0dfafa4b655a 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -433,14 +433,14 @@ static void blk_mq_debugfs_tags_show(struct seq_file *m,
>   seq_printf(m, "nr_tags=%u\n", tags->nr_tags);
>   seq_printf(m, "nr_reserved_tags=%u\n", tags->nr_reserved_tags);
>   seq_printf(m, "active_queues=%d\n",
> -atomic_read(>active_queues));
> +atomic_read(tags->active_queues));
>  
>   seq_puts(m, "\nbitmap_tags:\n");
> - sbitmap_queue_show(>bitmap_tags, m);
> + sbitmap_queue_show(tags->bitmap_tags, m);
>  
>   if (tags->nr_reserved_tags) {
>   seq_puts(m, "\nbreserved_tags:\n");
> - sbitmap_queue_show(>breserved_tags, m);
> + sbitmap_queue_show(tags->breserved_tags, m);
>   }
>  }
>  
> @@ -471,7 +471,7 @@ static int hctx_tags_bitmap_show(void *data, struct 
> seq_file *m)
>   if (res)
>   goto out;
>   if (hctx->tags)
> - sbitmap_bitmap_show(>tags->bitmap_tags.sb, m);
> + sbitmap_bitmap_show(>tags->bitmap_tags->sb, m);
>   mutex_unlock(>sysfs_lock);
>  
>  out:
> @@ -505,7 +505,7 @@ static int hctx_sched_tags_bitmap_show(void *data, struct 
> seq_file *m)
>   if (res)
>   goto out;
>   if (hctx->sched_tags)
> - sbitmap_bitmap_show(>sched_tags->bitmap_tags.sb, m);
> + sbitmap_bitmap_show(>sched_tags->bitmap_tags->sb, m);
>   mutex_unlock(>sysfs_lock);
>  
>  out:
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 336dde07b230..571797dc36cb 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -18,7 +18,7 @@ bool blk_mq_has_free_tags(struct blk_mq_tags *tags)
>   if (!tags)
>   return true;
>  
> - return sbitmap_any_bit_clear(>bitmap_tags.sb);
> + return sbitmap_any_bit_clear(>bitmap_tags->sb);
>  }
>  
>  /*
> @@ -28,7 +28,7 @@ bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
>  {
>   if (!test_bit(BLK_MQ_S_TAG_ACTIVE, >state) &&
>   !test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, >state))
> - atomic_inc(>tags->active_queues);
> + atomic_inc(hctx->tags->active_queues);
>  
>   return true;
>  }
> @@ -38,9 +38,9 @@ bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
>   */
>  void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool include_reserve)
>  {
> - sbitmap_queue_wake_all(>bitmap_tags);
> + sbitmap_queue_wake_all(tags->bitmap_tags);
>   if (include_reserve)
> - sbitmap_queue_wake_all(>breserved_tags);
> + sbitmap_queue_wake_all(tags->breserved_tags);
>  }
>  
>  /*
> @@ -54,7 +54,7 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
>   if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, >state))
>   return;
>  
> - atomic_dec(>active_

Re: [PATCH V2 2/8] blk-mq: introduce BLK_MQ_F_GLOBAL_TAGS

2018-02-06 Thread Omar Sandoval

On Mon, Feb 05, 2018 at 11:20:29PM +0800, Ming Lei wrote:
> Quite a few HBAs(such as HPSA, megaraid, mpt3sas, ..) support multiple
> reply queues, but tags is often HBA wide.
> 
> These HBAs have switched to use pci_alloc_irq_vectors(PCI_IRQ_AFFINITY)
> for automatic affinity assignment.
> 
> Now 84676c1f21e8ff5(genirq/affinity: assign vectors to all possible CPUs)
> has been merged to V4.16-rc, and it is easy to allocate all offline CPUs
> for some irq vectors, this can't be avoided even though the allocation
> is improved.
> 
> So all these drivers have to avoid to ask HBA to complete request in
> reply queue which hasn't online CPUs assigned, and HPSA has been broken
> with v4.15+:
> 
>   https://marc.info/?l=linux-kernel=151748144730409=2
> 
> This issue can be solved generically and easily via blk_mq(scsi_mq) multiple
> hw queue by mapping each reply queue into hctx, but one tricky thing is
> the HBA wide(instead of hw queue wide) tags.
> 
> This patch is based on the following Hannes's patch:
> 
>   https://marc.info/?l=linux-block=149132580511346=2
> 
> One big difference with Hannes's is that this patch only makes the tags 
> sbitmap
> and active_queues data structure HBA wide, and others are kept as NUMA 
> locality,
> such as request, hctx, tags, ...
> 
> The following patch will support global tags on null_blk, also the performance
> data is provided, no obvious performance loss is observed when the whole
> hw queue depth is same.
> 
> Cc: Hannes Reinecke <h...@suse.de>
> Cc: Arun Easi <arun.e...@cavium.com>
> Cc: Omar Sandoval <osan...@fb.com>,
> Cc: "Martin K. Petersen" <martin.peter...@oracle.com>,
> Cc: James Bottomley <james.bottom...@hansenpartnership.com>,
> Cc: Christoph Hellwig <h...@lst.de>,
> Cc: Don Brace <don.br...@microsemi.com>
> Cc: Kashyap Desai <kashyap.de...@broadcom.com>
> Cc: Peter Rivera <peter.riv...@broadcom.com>
> Cc: Mike Snitzer <snit...@redhat.com>
> Tested-by: Laurence Oberman <lober...@redhat.com>
> Signed-off-by: Ming Lei <ming@redhat.com>
> ---
>  block/blk-mq-debugfs.c |  1 +
>  block/blk-mq-sched.c   | 13 -
>  block/blk-mq-tag.c | 23 ++-
>  block/blk-mq-tag.h |  5 -
>  block/blk-mq.c | 29 -
>  block/blk-mq.h |  3 ++-
>  include/linux/blk-mq.h |  2 ++
>  7 files changed, 63 insertions(+), 13 deletions(-)
> 
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 0dfafa4b655a..0f0fafe03f5d 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -206,6 +206,7 @@ static const char *const hctx_flag_name[] = {
>   HCTX_FLAG_NAME(SHOULD_MERGE),
>   HCTX_FLAG_NAME(TAG_SHARED),
>   HCTX_FLAG_NAME(SG_MERGE),
> + HCTX_FLAG_NAME(GLOBAL_TAGS),
>   HCTX_FLAG_NAME(BLOCKING),
>   HCTX_FLAG_NAME(NO_SCHED),
>  };
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 55c0a745b427..385bbec73804 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -81,6 +81,17 @@ static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx 
> *hctx)
>   } else
>   clear_bit(BLK_MQ_S_SCHED_RESTART, >state);
>  
> + /* need to restart all hw queues for global tags */
> + if (hctx->flags & BLK_MQ_F_GLOBAL_TAGS) {
> + struct blk_mq_hw_ctx *hctx2;
> + int i;
> +
> + queue_for_each_hw_ctx(hctx->queue, hctx2, i)
> + if (blk_mq_run_hw_queue(hctx2, true))
> + return true;

Is it intentional that we stop after the first hw queue does work? That
seems fine but it's a little confusing because the comment claims we
restart everything.

> + return false;
> + }
> +

Re: [PATCH V2 5/8] scsi: introduce force_blk_mq

2018-02-06 Thread Omar Sandoval

On Mon, Feb 05, 2018 at 11:20:32PM +0800, Ming Lei wrote:
> From scsi driver view, it is a bit troublesome to support both blk-mq
> and non-blk-mq at the same time, especially when drivers need to support
> multi hw-queue.
> 
> This patch introduces 'force_blk_mq' to scsi_host_template so that drivers
> can provide blk-mq only support, so driver code can avoid the trouble
> for supporting both.
> 
> This patch may clean up driver a lot by providing blk-mq only support, 
> espeically
> it is easier to convert multiple reply queues into blk_mq's MQ for the 
> following
> purposes:
> 
> 1) use blk_mq multiple hw queue to deal with allocated irq vectors of all 
> offline
> CPU affinity[1]:
> 
>   [1] https://marc.info/?l=linux-kernel=151748144730409=2
> 
> Now 84676c1f21e8ff5(genirq/affinity: assign vectors to all possible CPUs)
> has been merged to V4.16-rc, and it is easy to allocate all offline CPUs
> for some irq vectors, this can't be avoided even though the allocation
> is improved.
> 
> So all these drivers have to avoid to ask HBA to complete request in
> reply queue which hasn't online CPUs assigned.
> 
> This issue can be solved generically and easily via blk_mq(scsi_mq) multiple
> hw queue by mapping each reply queue into hctx.
> 
> 2) some drivers[1] require to complete request in the submission CPU for
> avoiding hard/soft lockup, which is easily done with blk_mq, so not necessary
> to reinvent wheels for solving the problem.
> 
>   [2] https://marc.info/?t=15160185141=1=2
> 
> Sovling the above issues for non-MQ path may not be easy, or introduce
> unnecessary work, especially we plan to enable SCSI_MQ soon as discussed
> recently[3]:
> 
>   [3] https://marc.info/?l=linux-scsi=151727684915589=2
> 
> Cc: Arun Easi <arun.e...@cavium.com>
> Cc: Omar Sandoval <osan...@fb.com>,
> Cc: "Martin K. Petersen" <martin.peter...@oracle.com>,
> Cc: James Bottomley <james.bottom...@hansenpartnership.com>,
> Cc: Christoph Hellwig <h...@lst.de>,
> Cc: Don Brace <don.br...@microsemi.com>
> Cc: Kashyap Desai <kashyap.de...@broadcom.com>
> Cc: Peter Rivera <peter.riv...@broadcom.com>
> Cc: Mike Snitzer <snit...@redhat.com>
> Reviewed-by: Hannes Reinecke <h...@suse.de>
> Tested-by: Laurence Oberman <lober...@redhat.com>
> Signed-off-by: Ming Lei <ming@redhat.com>
> ---
>  drivers/scsi/hosts.c | 1 +
>  include/scsi/scsi_host.h | 3 +++
>  2 files changed, 4 insertions(+)
> 
> diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
> index fe3a0da3ec97..c75cebd7911d 100644
> --- a/drivers/scsi/hosts.c
> +++ b/drivers/scsi/hosts.c
> @@ -471,6 +471,7 @@ struct Scsi_Host *scsi_host_alloc(struct 
> scsi_host_template *sht, int privsize)
>   shost->dma_boundary = 0x;
>  
>   shost->use_blk_mq = scsi_use_blk_mq;

Not sure if this is a patch formatting issue, but this old line wasn't
deleted.

> + shost->use_blk_mq = scsi_use_blk_mq || !!shost->hostt->force_blk_mq;

Re: [PATCH 2/2] tests/scsi/0001: Regression test for SCSI device blacklisting

2017-12-13 Thread Omar Sandoval

On Tue, Aug 15, 2017 at 11:00:49PM -0700, Omar Sandoval wrote:
> On Wed, Aug 09, 2017 at 12:50:06PM +0200, Hannes Reinecke wrote:
> > SCSI device blacklisting seems to be a tricky subject, with
> > lots of potential for messing up the selection algorithm.
> > This adds a test for catching regressions here.
> 
> I'm waiting to see how the patches end up before applying this, but I'm
> glad to see this test :) A few comments below. I've addressed most of
> them and pushed it to https://github.com/osandov/blktests/tree/scsi-blacklist,
> but the golden output needs to be updated for v3 of your patches. Could
> you modify the patch in my branch and resend? Thanks for adding tests!

Hannes, can you resend these tests now that your patches are in? Let me
know if you need any help. Thanks!

Re: [PATCH V6 5/5] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed

2017-10-10 Thread Omar Sandoval

On Mon, Oct 09, 2017 at 07:24:24PM +0800, Ming Lei wrote:
> During dispatching, we moved all requests from hctx->dispatch to
> one temporary list, then dispatch them one by one from this list.
> Unfortunately during this period, run queue from other contexts
> may think the queue is idle, then start to dequeue from sw/scheduler
> queue and still try to dispatch because ->dispatch is empty. This way
> hurts sequential I/O performance because requests are dequeued when
> lld queue is busy.
> 
> This patch introduces the state of BLK_MQ_S_DISPATCH_BUSY to
> make sure that request isn't dequeued until ->dispatch is
> flushed.
> 
> Reviewed-by: Bart Van Assche <bart.vanass...@wdc.com>
> Reviewed-by: Christoph Hellwig <h...@lst.de>

I think this will do for now.

Reviewed-by: Omar Sandoval <osan...@fb.com>

> Signed-off-by: Ming Lei <ming@redhat.com>
> ---
>  block/blk-mq-debugfs.c |  1 +
>  block/blk-mq-sched.c   | 38 --
>  block/blk-mq.c |  5 +
>  include/linux/blk-mq.h |  1 +
>  4 files changed, 39 insertions(+), 6 deletions(-)

Re: [PATCH V6 4/5] blk-mq-sched: improve dispatching from sw queue

2017-10-10 Thread Omar Sandoval

On Mon, Oct 09, 2017 at 07:24:23PM +0800, Ming Lei wrote:
> SCSI devices use host-wide tagset, and the shared driver tag space is
> often quite big. Meantime there is also queue depth for each lun(
> .cmd_per_lun), which is often small, for example, on both lpfc and
> qla2xxx, .cmd_per_lun is just 3.
> 
> So lots of requests may stay in sw queue, and we always flush all
> belonging to same hw queue and dispatch them all to driver, unfortunately
> it is easy to cause queue busy because of the small .cmd_per_lun.
> Once these requests are flushed out, they have to stay in hctx->dispatch,
> and no bio merge can participate into these requests, and sequential IO
> performance is hurt a lot.
> 
> This patch introduces blk_mq_dequeue_from_ctx for dequeuing request from
> sw queue so that we can dispatch them in scheduler's way, then we can
> avoid to dequeue too many requests from sw queue when ->dispatch isn't
> flushed completely.
> 
> This patch improves dispatching from sw queue when there is per-request-queue
> queue depth by taking request one by one from sw queue, just like the way
> of IO scheduler.

This still didn't address Jens' concern about using q->queue_depth as
the heuristic for whether to do the full sw queue flush or one-by-one
dispatch. The EWMA approach is a bit too complex for now, can you please
try the heuristic of whether the driver ever returned BLK_STS_RESOURCE?

> Reviewed-by: Omar Sandoval <osan...@fb.com>
> Reviewed-by: Bart Van Assche <bart.vanass...@wdc.com>
> Signed-off-by: Ming Lei <ming@redhat.com>
> ---
>  block/blk-mq-sched.c   | 50 
> --
>  block/blk-mq.c | 39 +++
>  block/blk-mq.h |  2 ++
>  include/linux/blk-mq.h |  2 ++
>  4 files changed, 91 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index be29ba849408..14b354f617e5 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -104,6 +104,39 @@ static void blk_mq_do_dispatch_sched(struct 
> blk_mq_hw_ctx *hctx)
>   } while (blk_mq_dispatch_rq_list(q, _list));
>  }
>  
> +static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
> +   struct blk_mq_ctx *ctx)
> +{
> + unsigned idx = ctx->index_hw;
> +
> + if (++idx == hctx->nr_ctx)
> + idx = 0;
> +
> + return hctx->ctxs[idx];
> +}
> +
> +static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
> +{
> + struct request_queue *q = hctx->queue;
> + LIST_HEAD(rq_list);
> + struct blk_mq_ctx *ctx = READ_ONCE(hctx->dispatch_from);
> +
> + do {
> + struct request *rq;
> +
> + rq = blk_mq_dequeue_from_ctx(hctx, ctx);
> + if (!rq)
> + break;
> + list_add(>queuelist, _list);
> +
> + /* round robin for fair dispatch */
> + ctx = blk_mq_next_ctx(hctx, rq->mq_ctx);
> +
> + } while (blk_mq_dispatch_rq_list(q, _list));
> +
> + WRITE_ONCE(hctx->dispatch_from, ctx);
> +}
> +
>  void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
>  {
>   struct request_queue *q = hctx->queue;
> @@ -143,10 +176,23 @@ void blk_mq_sched_dispatch_requests(struct 
> blk_mq_hw_ctx *hctx)
>*/
>   if (!list_empty(_list)) {
>   blk_mq_sched_mark_restart_hctx(hctx);
> - if (blk_mq_dispatch_rq_list(q, _list) && has_sched_dispatch)
> - blk_mq_do_dispatch_sched(hctx);
> + if (blk_mq_dispatch_rq_list(q, _list)) {
> + if (has_sched_dispatch)
> + blk_mq_do_dispatch_sched(hctx);
> + else
> + blk_mq_do_dispatch_ctx(hctx);
> + }
>   } else if (has_sched_dispatch) {
>   blk_mq_do_dispatch_sched(hctx);
> + } else if (q->queue_depth) {
> + /*
> +  * If there is per-request_queue depth, we dequeue
> +  * request one by one from sw queue for avoiding to mess
> +  * up I/O merge when dispatch runs out of resource, which
> +  * can be triggered easily when there is per-request_queue
> +  * queue depth or .cmd_per_lun, such as SCSI device.
> +  */
> + blk_mq_do_dispatch_ctx(hctx);
>   } else {
>   blk_mq_flush_busy_ctxs(hctx, _list);
>   blk_mq_dispatch_rq_list(q, _list);
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 076cbab9c3e0..394cb75d66fa 100644
>

Re: [PATCH V6 3/5] sbitmap: introduce __sbitmap_for_each_set()

2017-10-10 Thread Omar Sandoval

On Mon, Oct 09, 2017 at 07:24:22PM +0800, Ming Lei wrote:
> We need to iterate ctx starting from any ctx in round robin
> way, so introduce this helper.
> 
> Cc: Omar Sandoval <osan...@fb.com>

Reviewed-by: Omar Sandoval <osan...@fb.com>

> Signed-off-by: Ming Lei <ming@redhat.com>
> ---
>  include/linux/sbitmap.h | 64 
> -
>  1 file changed, 47 insertions(+), 17 deletions(-)

Re: [PATCH V6 1/5] blk-mq-sched: fix scheduler bad performance

2017-10-10 Thread Omar Sandoval

On Mon, Oct 09, 2017 at 07:24:20PM +0800, Ming Lei wrote:
> When hw queue is busy, we shouldn't take requests from
> scheduler queue any more, otherwise it is difficult to do
> IO merge.
> 
> This patch fixes the awful IO performance on some
> SCSI devices(lpfc, qla2xxx, ...) when mq-deadline/kyber
> is used by not taking requests if hw queue is busy.
> 
> Reviewed-by: Bart Van Assche <bart.vanass...@wdc.com>
> Reviewed-by: Christoph Hellwig <h...@lst.de>

Reviewed-by: Omar Sandoval <osan...@fb.com>

> Signed-off-by: Ming Lei <ming@redhat.com>
> ---
>  block/blk-mq-sched.c | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 4ab69435708c..eca011fdfa0e 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -94,7 +94,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
> *hctx)
>   struct request_queue *q = hctx->queue;
>   struct elevator_queue *e = q->elevator;
>   const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
> - bool did_work = false;
> + bool do_sched_dispatch = true;
>   LIST_HEAD(rq_list);
>  
>   /* RCU or SRCU read lock is needed before checking quiesced flag */
> @@ -125,18 +125,18 @@ void blk_mq_sched_dispatch_requests(struct 
> blk_mq_hw_ctx *hctx)
>*/
>   if (!list_empty(_list)) {
>   blk_mq_sched_mark_restart_hctx(hctx);
> - did_work = blk_mq_dispatch_rq_list(q, _list);
> + do_sched_dispatch = blk_mq_dispatch_rq_list(q, _list);
>   } else if (!has_sched_dispatch) {
>   blk_mq_flush_busy_ctxs(hctx, _list);
>   blk_mq_dispatch_rq_list(q, _list);
>   }
>  
>   /*
> -  * We want to dispatch from the scheduler if we had no work left
> -  * on the dispatch list, OR if we did have work but weren't able
> -  * to make progress.
> +  * We want to dispatch from the scheduler if there was nothing
> +  * on the dispatch list or we were able to dispatch from the
> +  * dispatch list.
>*/
> - if (!did_work && has_sched_dispatch) {
> + if (do_sched_dispatch && has_sched_dispatch) {
>   do {
>   struct request *rq;
>  
> -- 
> 2.9.5
>

Re: [PATCH 2/2] tests/scsi/0001: Regression test for SCSI device blacklisting

2017-08-16 Thread Omar Sandoval

On Wed, Aug 09, 2017 at 12:50:06PM +0200, Hannes Reinecke wrote:
> SCSI device blacklisting seems to be a tricky subject, with
> lots of potential for messing up the selection algorithm.
> This adds a test for catching regressions here.

I'm waiting to see how the patches end up before applying this, but I'm
glad to see this test :) A few comments below. I've addressed most of
them and pushed it to https://github.com/osandov/blktests/tree/scsi-blacklist,
but the golden output needs to be updated for v3 of your patches. Could
you modify the patch in my branch and resend? Thanks for adding tests!

> Signed-off-by: Hannes Reinecke 
> ---
>  tests/scsi/001 | 69 
> ++
>  tests/scsi/001.out | 10 
>  2 files changed, 79 insertions(+)
>  create mode 100644 tests/scsi/001
>  create mode 100644 tests/scsi/001.out
> 
> diff --git a/tests/scsi/001 b/tests/scsi/001
> new file mode 100644
> index 000..374a458
> --- /dev/null
> +++ b/tests/scsi/001
> @@ -0,0 +1,69 @@
> +#!/bin/bash
> +#
> +# Regression test for scsi device blacklisting

In my experience with xfstests, the single most useful piece of
information to include for a regression test is the commit or patch it
tests. I fixed this to say 'Regression test for patch "scsi_devinfo:
fixup string compare".'

> +# Copyright (C) 2017 Hannes Reinecke, SUSE Linux GmbH
> +#
> +# This program is free software: you can redistribute it and/or modify
> +# it under the terms of the GNU General Public License as published by
> +# the Free Software Foundation, either version 3 of the License, or
> +# (at your option) any later version.
> +#
> +# This program is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program.  If not, see .
> +
> +DESCRIPTION="SCSI device blacklisting"
> +
> +QUICK=1
> +
> +CHECK_DMESG=0

I realize that the documentation wasn't clear about what we check for in
dmesg. It's just oopses and warnings and such, so we don't need/want to
disable checking dmesg here. I removed it and improved the
documentation.

> +requires() {
> +if modinfo scsi_debug | grep -q inq_vendor ; then
> + return 0
> +fi
> +return 1
> +}

There's a helper for this, _have_module_param.

> +test() {
> +local inq vendor model host dev blacklist
> +
> +echo "Running ${TEST_NAME}"
> +
> +for inq in \
> + "" \
> + "" \
> + "HITACHI OPEN-V  " \
> + "Scanner " \
> + "Inateck " \
> + "Promise STEX" \
> + "HITAOPEN-V  " \
> + "ABCDScanner " ; do
> + vendor=${inq:0:8}
> + model=${inq:8:16}
> + modprobe scsi_debug inq_vendor="$vendor" inq_product="$model"
> + host=$(lsscsi -H | sed -n 's/.\([0-9]*\).*scsi_debug/\1/p')
> + if [ -z "$host" ] ; then
> + echo "Test failed, scsi_debug could not be loaded"
> + return 1
> + fi
> + dev=$(lsscsi | grep $host | sed -n 's/.*\/dev\/\(sd[a-z]*\).*/\1/p')
> + if [ -z "$dev" ] ; then
> + echo "Test failed, SCSI device not found"
> + rmmod scsi_debug
> + return 1
> + fi

The biggest change I made was improving the existing scsi_debug helpers
so you don't have to hardcode this.

> + vendor=$(cat /sys/block/$dev/device/vendor)
> + model=$(cat /sys/block/$dev/device/model)
> + blacklist=$(cat /sys/block/$dev/device/blacklist)
> + echo "$vendor $model $blacklist"
> + rmmod scsi_debug
> +done
> +echo "Test complete"
> +return 0
> +}
> diff --git a/tests/scsi/001.out b/tests/scsi/001.out
> new file mode 100644
> index 000..64db97c
> --- /dev/null
> +++ b/tests/scsi/001.out
> @@ -0,0 +1,10 @@
> +Running scsi/001
> +  0x0
> +  0x0
> +HITACHI  OPEN-V   0x2
> + Scanner  0x1
> +Inateck   0x0
> +Promise  STEX 0x40
> +HITA OPEN-V   0x0
> +ABCD Scanner  0x0
> +Test complete

As mentioned above, this needs updating for the symbolic constants.

Re: [PATCH 1/2] tests/scsi: add SCSI midlayer test group

2017-08-15 Thread Omar Sandoval

On Wed, Aug 09, 2017 at 12:50:05PM +0200, Hannes Reinecke wrote:
> Add a test group for tests of the SCSI midlayer.
> 
> Signed-off-by: Hannes Reinecke 
> ---
>  tests/scsi/group | 26 ++
>  1 file changed, 26 insertions(+)
>  create mode 100644 tests/scsi/group
> 
> diff --git a/tests/scsi/group b/tests/scsi/group
> new file mode 100644
> index 000..73459a6
> --- /dev/null
> +++ b/tests/scsi/group
> @@ -0,0 +1,26 @@
> +#!/bin/bash
> +#
> +# SCSI regression tests
> +#
> +# Copyright (C) 2017 Hannes Reinecke, SUSE Linux GmbH
> +#
> +# This program is free software: you can redistribute it and/or modify
> +# it under the terms of the GNU General Public License as published by
> +# the Free Software Foundation, either version 3 of the License, or
> +# (at your option) any later version.
> +#
> +# This program is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program.  If not, see .
> +
> +#
> +# Usually, group_requires() just needs to check that any necessary programs 
> and
> +# kernel features are available using the _have_foo helpers. If
> +# group_requires() returns non-zero, all tests in this group will be skipped.

This comment should have been removed, looks fine otherwise.

> +group_requires() {
> + _have_root
> +}
> -- 
> 1.8.5.6
>

Re: [PATCH blktests] sg: add regression test for patch scsi: sg: fix SG_DXFER_FROM_DEV transfers

2017-07-14 Thread Omar Sandoval

On Thu, Jul 06, 2017 at 02:09:21PM +0200, Johannes Thumshirn wrote:
> Add a regression test for the patch titled "scsi: sg: fix
> SG_DXFER_FROM_DEV transfers" which reassembles the syscalls done by Nero
> Burning ROM to discover CD and DVD burners.

Fixed up a few things below and applied, thanks!

> Signed-off-by: Johannes Thumshirn 
> ---
>  common/sg   |  2 +-
>  src/.gitignore  |  1 +
>  src/Makefile|  2 +-
>  src/sg/dxfer-from-dev.c | 57 
> +
>  src/sg/sg-dxfer.c   | 57 
> +
>  tests/sg/002| 38 +
>  tests/sg/002.out|  3 +++
>  7 files changed, 158 insertions(+), 2 deletions(-)
>  create mode 100644 src/sg/dxfer-from-dev.c
>  create mode 100644 src/sg/sg-dxfer.c
>  create mode 100755 tests/sg/002
>  create mode 100644 tests/sg/002.out
> 
> diff --git a/common/sg b/common/sg
> index c306af500350..19732ec6a541 100644
> --- a/common/sg
> +++ b/common/sg
> @@ -30,5 +30,5 @@ _test_dev_is_scsi() {
>  }
>  
>  _get_sg_from_blockdev() {
> - echo /sys/block/"$1"/device/scsi_generic/sg+([0-9])
> + echo ${TEST_DEV_SYSFS}/device/scsi_generic/sg* | grep -Eo "sg[0-9]+"
>  }

I renamed this to _get_test_dev_sg to reflect the new behavior.

> diff --git a/src/.gitignore b/src/.gitignore
> index 722c137c7fca..0fca08a74fb1 100644
> --- a/src/.gitignore
> +++ b/src/.gitignore
> @@ -1 +1,2 @@
>  /sg/syzkaller1
> +/sg/dxfer-from-dev
> diff --git a/src/Makefile b/src/Makefile
> index 0e8d74688db4..57b6bb25793a 100644
> --- a/src/Makefile
> +++ b/src/Makefile
> @@ -1,4 +1,4 @@
> -TARGETS := sg/syzkaller1
> +TARGETS := sg/syzkaller1 sg/dxfer-from-dev
>  
>  CFLAGS := -O2
>  
> diff --git a/src/sg/dxfer-from-dev.c b/src/sg/dxfer-from-dev.c
> new file mode 100644
> index ..ca52f30e23af
> --- /dev/null
> +++ b/src/sg/dxfer-from-dev.c
> @@ -0,0 +1,57 @@
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +
> +int main(int argc, char **argv)
> +{
> + int fd;
> + int rc;
> + int rsz = 131072;
> + int tout = 1080;
> + char buf[42] = { 0 };
> +
> + if (argc != 2) {
> + printf("usage: %s /dev/sgX\n", argv[0]);
> + return 1;
> + }
> +
> + fd = open(argv[1], O_RDWR);
> + if (fd < 0) {
> + perror("open");
> + return 1;
> + }
> +
> + rc = ioctl(fd, SG_SET_RESERVED_SIZE, );
> + if (rc < 0) {
> + perror("ioctl SG_SET_RESERVED_SIZE");
> + goto out_close;
> + }
> +
> + rc = ioctl(fd, SG_SET_TIMEOUT, );
> + if (rc < 0) {
> + perror("ioctl SG_SET_TIMEOUT");
> + goto out_close;
> + }
> +
> + buf[4] = 'H';
> + rc = write(fd, , sizeof(buf));
> + if (rc < 0) {
> + perror("write");
> + if (errno == EINVAL)
> + printf("FAIL\n");
> + goto out_close;
> + }
> +
> + printf("PASS\n");
> +
> +out_close:
> + close(fd);
> +}
> diff --git a/src/sg/sg-dxfer.c b/src/sg/sg-dxfer.c
> new file mode 100644
> index ..ca52f30e23af
> --- /dev/null
> +++ b/src/sg/sg-dxfer.c

This file looks like a copy of sg-dxfer-dev.c, removed it.

> diff --git a/tests/sg/002 b/tests/sg/002
> new file mode 100755
> index ..c8b39091f82a
> --- /dev/null
> +++ b/tests/sg/002
> @@ -0,0 +1,38 @@
> +#!/bin/bash
> +#
> +# TODO: provide a description of the test here, i.e., what it tests and how. 
> If
> +# this is a regression test for a patch, reference the patch title:

You forgot to delete this part :)

Re: [PATCH blktests v2 0/3] Add SCSI generic test group

2017-06-07 Thread Omar Sandoval

On Fri, May 19, 2017 at 03:55:28PM +0200, Johannes Thumshirn wrote:
> Add a test group for the SCSI generic driver and one syzcaller
> reproducer for this group.
> 
> The reprodcuer is distributed as a C program, so the makefile is
> amended to build C files to be used in the test.
> 
> Changes to v1:
> * Stripped left over TODO comment
> * Modified reproducer to accept a device name
> * Fixed Makefile so it can build more than one target
> 
> Johannes Thumshirn (3):
>   Add ability to build test-cases
>   tests/sg: add SCSI generic test grouop
>   sg/001: add regression test for syzcaller generated GPF in sg_read
> path

Hey, Johannes,

Sorry for the delay, I was chasing down a few Btrfs bugs for the last
couple of weeks. I applied these with some small changes. I stuck with
running syzkaller on the scsi-debug device like you had here, we can
revisit that for future tests if it makes sense to run them on a real
device. Thanks!

Re: [PATCH blktests v2 3/3] sg/001: add regression test for syzcaller generated GPF in sg_read path

2017-05-22 Thread Omar Sandoval

On Fri, May 19, 2017 at 03:55:31PM +0200, Johannes Thumshirn wrote:
> Add a regression test for commit 48ae8484e9fc ("scsi: sg: don't return
> bogus Sg_requests"). This is a general protection fault triggered by
> syzcaller via issuing bogus read(2)s on the /dev/sg devices.
> 
> Signed-off-by: Johannes Thumshirn 
> ---
>  tests/sg/001 | 47 +++
>  tests/sg/001.out |  2 ++
>  2 files changed, 49 insertions(+)
>  create mode 100755 tests/sg/001
>  create mode 100644 tests/sg/001.out
> 
> diff --git a/tests/sg/001 b/tests/sg/001
> new file mode 100755
> index ..86430409b6a3
> --- /dev/null
> +++ b/tests/sg/001
> @@ -0,0 +1,47 @@
> +#!/bin/bash
> +#
> +# Regression test for commit 48ae8484e9fc ("scsi: sg: don't return bogus
> +# Sg_requests")
> +#
> +# Copyright (C) 2017 Johannes Thumshirn 
> +#
> +# This program is free software: you can redistribute it and/or modify
> +# it under the terms of the GNU General Public License as published by
> +# the Free Software Foundation, either version 3 of the License, or
> +# (at your option) any later version.
> +#
> +# This program is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program.  If not, see .
> +
> +. common/sg
> +. common/scsi_debug
> +
> +DESCRIPTION="try triggering a kernel GPF with 0 byte SG reads"
> +QUICK=1
> +
> +requires() {
> + _have_program src/sg-001 \
> + && _have_scsi_debug \
> + && _have_scsi_generic
> +}
> +
> +
> +test() {
> + echo "Running ${TEST_NAME}"
> +
> + if ! _get_scsi_debug_dev; then
> + return 1
> + fi
> +
> + SG_DEV=$(_get_sg_from_blockdev "$SCSI_DEBUG_NAME")
> + timeout -s INT 10s ./src/sg-001 "$SG_DEV"
> +
> + _put_scsi_debug_dev
> +
> + echo "Test complete"
> +}

This looks much better, thanks! One question for you: is there any value
in running this on specific test devices (i.e., changing test() to
test_device() and using "$TEST_DEV" instead of a scsi-debug device), or
would it be a waste of time since it's just exercising generic code?

Re: [PATCH blktests 0/3] Add SCSI generic test group

2017-05-18 Thread Omar Sandoval

On Thu, May 18, 2017 at 03:29:45PM +0200, Johannes Thumshirn wrote:
> On 05/18/2017 03:19 PM, Christoph Hellwig wrote:
> > All SG_IO test should also apply to block device nodes that support
> > the ioctl..
> > 
> 
> But these are not necessarily SG_IO tests, are they?
> 
> The test included is doesn't hit the SG_IO path in the sg driver, but
> the sg_read path.
> 
> Of cause we can make a generic sg_io tool but IMHO it's quite convenient
> to just throw in the reproducers we get.
> 
> And as it exercises the /dev/sg character device I decided to make a new
> group.
> 
>   Johannes

Looking at this some more, it seems like the syzkaller reproducer always
bangs on /dev/sg0. How hard would it be to adapt it to run on the sg
device for every test device instead?

Re: [PATCH blktests 2/3] tests/sg: add SCSI generic test grouop

2017-05-18 Thread Omar Sandoval

On Thu, May 18, 2017 at 02:06:20PM -0700, Omar Sandoval wrote:
> On Thu, May 18, 2017 at 02:13:07PM +0200, Johannes Thumshirn wrote:
> > Add a test group for tests of the SCSI generic driver and and
> > functions common to the SCSI generic driver and it's test cases.
> > 
> > Signed-off-by: Johannes Thumshirn <jthumsh...@suse.de>
> > ---
> >  common/sg  | 22 ++
> >  tests/sg/group | 40 
> >  2 files changed, 62 insertions(+)
> >  create mode 100644 common/sg
> >  create mode 100644 tests/sg/group
> > 
> > diff --git a/common/sg b/common/sg
> > new file mode 100644
> > index ..30b5089c68f7
> > --- /dev/null
> > +++ b/common/sg
> > +# TODO: if this test group has extra requirements for what devices it can 
> > be
> > +# run on, it should define a group_device_requires() function. If tests in 
> > this
> > +# group can be run on the test device, it should return zero. Otherwise, it
> > +# should return non-zero and set the $SKIP_REASON variable. $TEST_DEV is 
> > the
> > +# full path of the block device (e.g., /dev/nvme0n1 or /dev/sda1), and
> > +# $TEST_DEV_SYSFS is the sysfs path of the disk (not the partition, e.g.,
> > +# /sys/block/nvme0n1 or /sys/block/sda).
> > +#
> > +# Usually, group_device_requires() just needs to check that the test 
> > device is
> > +# the right type of hardware or supports any necessary features using the
> > +# _test_dev_foo helpers. If group_device_requires() returns non-zero, all 
> > tests
> > +# in this group will be skipped on that device.
> > +# group_device_requires() {
> > +#  _test_dev_is_foo && _test_dev_supports_bar
> > +# }
> 
> Leftover TODO, I'll remove it when applying. If we add an sg test that
> runs on an actual device, we can define group_device_requires().

On second though, since I had some comments for patch 3, just fix this
up when you resend.

Re: [PATCH blktests 3/3] sg/001: add regression test for syzcaller generated GPF

2017-05-18 Thread Omar Sandoval

On Thu, May 18, 2017 at 02:13:08PM +0200, Johannes Thumshirn wrote:
> Add a regression test for commit 48ae8484e9fc ("scsi: sg: don't return
> bogus Sg_requests"). This is a general protection fault triggered by
> syzcaller.
> 
> Signed-off-by: Johannes Thumshirn 
> ---
>  tests/sg/001 | 48 
>  tests/sg/001.out |  2 ++
>  2 files changed, 50 insertions(+)
>  create mode 100755 tests/sg/001
>  create mode 100644 tests/sg/001.out
> 
> diff --git a/tests/sg/001 b/tests/sg/001
> new file mode 100755
> index ..3a72931d5748
> --- /dev/null
> +++ b/tests/sg/001
> @@ -0,0 +1,48 @@
> +#!/bin/bash
> +#
> +# Regression test for commit 48ae8484e9fc ("scsi: sg: don't return bogus 
> Sg_requests")
> +#
> +# Copyright (C) 2017 Johannes Thumshirn 
> +#
> +# This program is free software: you can redistribute it and/or modify
> +# it under the terms of the GNU General Public License as published by
> +# the Free Software Foundation, either version 3 of the License, or
> +# (at your option) any later version.
> +#
> +# This program is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program.  If not, see .
> +
> +. common/sg
> +. common/scsi_debug
> +
> +DESCRIPTION="Regression test for commit 48ae8484e9fc (\"scsi: sg: don't 
> return bogus Sg_requests\")"

This description doesn't really say what the test does. If this test
fails, the person running it should be able to know what the test is
doing without having to decode the syzkaller reproducer. I'd prefer
something like:

DESCRIPTION="do bogus sg reads and writes"

Or something like that, that's my best attempt at understanding the
reproducer :)

> +TIMED=1
> +QUICK=1
> +
> +requires() {
> + _have_program src/sg-001 \
> + _have_program timeout \

timeout is part of coreutils so this isn't necessary.

> + && _have_scsi_debug \
> + && _have_scsi_generic
> +}
> +
> +
> +test() {
> + echo "Running ${TEST_NAME}"
> +
> + if ! _get_scsi_debug_dev; then
> + return 1
> + fi
> +
> + _divide_timeout 2

You don't use $TIMEOUT in this test, so remove this and the TIMED=1. The
arbitrary timeout is fine for this kind of test, I think.

> + timeout -s INT 10s ./src/sg-001
> +
> + _put_scsi_debug_dev
> + 
> + echo "Test complete"
> +}
> diff --git a/tests/sg/001.out b/tests/sg/001.out
> new file mode 100644
> index ..beb4c437dd28
> --- /dev/null
> +++ b/tests/sg/001.out
> @@ -0,0 +1,2 @@
> +Running sg/001
> +Test complete
> -- 
> 2.12.0
>

Re: [PATCH blktests 2/3] tests/sg: add SCSI generic test grouop

2017-05-18 Thread Omar Sandoval

On Thu, May 18, 2017 at 02:13:07PM +0200, Johannes Thumshirn wrote:
> Add a test group for tests of the SCSI generic driver and and
> functions common to the SCSI generic driver and it's test cases.
> 
> Signed-off-by: Johannes Thumshirn 
> ---
>  common/sg  | 22 ++
>  tests/sg/group | 40 
>  2 files changed, 62 insertions(+)
>  create mode 100644 common/sg
>  create mode 100644 tests/sg/group
> 
> diff --git a/common/sg b/common/sg
> new file mode 100644
> index ..30b5089c68f7
> --- /dev/null
> +++ b/common/sg
> +# TODO: if this test group has extra requirements for what devices it can be
> +# run on, it should define a group_device_requires() function. If tests in 
> this
> +# group can be run on the test device, it should return zero. Otherwise, it
> +# should return non-zero and set the $SKIP_REASON variable. $TEST_DEV is the
> +# full path of the block device (e.g., /dev/nvme0n1 or /dev/sda1), and
> +# $TEST_DEV_SYSFS is the sysfs path of the disk (not the partition, e.g.,
> +# /sys/block/nvme0n1 or /sys/block/sda).
> +#
> +# Usually, group_device_requires() just needs to check that the test device 
> is
> +# the right type of hardware or supports any necessary features using the
> +# _test_dev_foo helpers. If group_device_requires() returns non-zero, all 
> tests
> +# in this group will be skipped on that device.
> +# group_device_requires() {
> +#_test_dev_is_foo && _test_dev_supports_bar
> +# }

Leftover TODO, I'll remove it when applying. If we add an sg test that
runs on an actual device, we can define group_device_requires().

Re: [PATCH v5 10/10] scsi: Implement blk_mq_ops.show_rq()

2017-04-25 Thread Omar Sandoval

On Tue, Apr 25, 2017 at 01:37:45PM -0700, Bart Van Assche wrote:
> Show the SCSI CDB, .eh_eflags and .result for pending SCSI commands
> in /sys/kernel/debug/block/*/mq/*/dispatch and */rq_list.

Only thing I noticed was that the only other caller I see has buf[70].
No idea if that's a meaningful number. For the sake of this not getting
bike-shedded to death,

Reviewed-by: Omar Sandoval <osan...@fb.com>

> Signed-off-by: Bart Van Assche <bart.vanass...@sandisk.com>
> Cc: Martin K. Petersen <martin.peter...@oracle.com>
> Cc: James Bottomley <james.bottom...@hansenpartnership.com>
> Cc: Omar Sandoval <osan...@fb.com>
> Cc: Hannes Reinecke <h...@suse.com>
> Cc: <linux-scsi@vger.kernel.org>
> ---
>  drivers/scsi/Makefile   |  1 +
>  drivers/scsi/scsi_debugfs.c | 13 +
>  drivers/scsi/scsi_debugfs.h |  4 
>  drivers/scsi/scsi_lib.c |  4 
>  4 files changed, 22 insertions(+)
>  create mode 100644 drivers/scsi/scsi_debugfs.c
>  create mode 100644 drivers/scsi/scsi_debugfs.h
> 
> diff --git a/drivers/scsi/Makefile b/drivers/scsi/Makefile
> index fc2855565a51..93dbe58c47c8 100644
> --- a/drivers/scsi/Makefile
> +++ b/drivers/scsi/Makefile
> @@ -166,6 +166,7 @@ scsi_mod-y+= scsi_scan.o 
> scsi_sysfs.o scsi_devinfo.o
>  scsi_mod-$(CONFIG_SCSI_NETLINK)  += scsi_netlink.o
>  scsi_mod-$(CONFIG_SYSCTL)+= scsi_sysctl.o
>  scsi_mod-$(CONFIG_SCSI_PROC_FS)  += scsi_proc.o
> +scsi_mod-$(CONFIG_BLK_DEBUG_FS)  += scsi_debugfs.o
>  scsi_mod-y   += scsi_trace.o scsi_logging.o
>  scsi_mod-$(CONFIG_PM)+= scsi_pm.o
>  scsi_mod-$(CONFIG_SCSI_DH)   += scsi_dh.o
> diff --git a/drivers/scsi/scsi_debugfs.c b/drivers/scsi/scsi_debugfs.c
> new file mode 100644
> index ..f831c23fdee3
> --- /dev/null
> +++ b/drivers/scsi/scsi_debugfs.c
> @@ -0,0 +1,13 @@
> +#include 
> +#include 
> +#include 
> +#include "scsi_debugfs.h"
> +
> +void scsi_show_rq(struct seq_file *m, struct request *rq)
> +{
> + struct scsi_cmnd *cmd = container_of(scsi_req(rq), typeof(*cmd), req);
> + char buf[64];
> +
> + __scsi_format_command(buf, sizeof(buf), cmd->cmnd, cmd->cmd_len);
> + seq_printf(m, ", .cmd=%s", buf);
> +}
> diff --git a/drivers/scsi/scsi_debugfs.h b/drivers/scsi/scsi_debugfs.h
> new file mode 100644
> index ..951b043e82d0
> --- /dev/null
> +++ b/drivers/scsi/scsi_debugfs.h
> @@ -0,0 +1,4 @@
> +struct request;
> +struct seq_file;
> +
> +void scsi_show_rq(struct seq_file *m, struct request *rq);
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index abc391e00f7d..1c3e87d6c48f 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -34,6 +34,7 @@
>  
>  #include 
>  
> +#include "scsi_debugfs.h"
>  #include "scsi_priv.h"
>  #include "scsi_logging.h"
>  
> @@ -2157,6 +2158,9 @@ static const struct blk_mq_ops scsi_mq_ops = {
>   .queue_rq   = scsi_queue_rq,
>   .complete   = scsi_softirq_done,
>   .timeout= scsi_timeout,
> +#ifdef CONFIG_BLK_DEBUG_FS
> + .show_rq= scsi_show_rq,
> +#endif
>   .init_request   = scsi_init_request,
>   .exit_request   = scsi_exit_request,
>   .map_queues = scsi_map_queues,
> -- 
> 2.12.2
>

Re: [PATCH v4 10/10] scsi: Implement blk_mq_ops.show_rq()

2017-04-24 Thread Omar Sandoval

On Mon, Apr 24, 2017 at 07:19:50PM -0400, Martin K. Petersen wrote:
> 
> Bart,
> 
> > SCSI tracing has to be enabled before a test is started, produces a
> > huge amount of data, and deriving state information from a huge trace
> > is far from easy. The information in debugfs provides an easy to read
> > overview of the current state without having to analyze megabytes of
> > traces, without introducing any slowdown and without having to enable
> > any tracing mechanism from beforehand.
> 
> Fair enough. Just seems like there's an obvious overlap in plumbing.
> Don't know if that can be leveraged instead of introducing something
> completely new?

The debugfs infrastructure is already there, and it already supports
showing requests. Bart is just exposing the SCSI information.

Re: [RFC PATCH 0/2] block,scsi: support host-wide tagset

2017-04-04 Thread Omar Sandoval

On Tue, Apr 04, 2017 at 02:07:43PM +0200, Hannes Reinecke wrote:
> Hi all,
> 
> as discussed recently most existing HBAs have a host-wide tagset which
> does not map easily onto the per-queue tagset model of block mq.
> This patchset implements a flag BLK_MQ_F_GLOBAL_TAGS for block-mq, which
> enables the use of a shared tagset for all hardware queues.
> The second patch adds a flag 'host_tagset' to the SCSI host template,
> which allows drivers to enable the use of the global tagset.
> 
> This patchset probably has some performance implications as
> there is a quite high probability of cache-bouncing when allocating
> tags. Also I'm not quite sure if the implemented tagset sharing
> is the correct way to handle things.
> So this can be considered an RFC.
> 
> As usual, comments and reviews are welcome.

Hi, Hannes,

blk-mq already supports a shared tagset, and scsi-mq already uses that.
When we initialize a request queue, we add it to a tagset with
blk_mq_add_queue_set(), where we automatically mark the tagset as shared
if there is more than one queue using it. What does this do that
BLK_MQ_F_TAG_SHARED doesn't cover?

Re: [PATCH 0/4] block: Fixes for bdi handling

2017-03-08 Thread Omar Sandoval

On Wed, Mar 08, 2017 at 08:50:51AM -0800, Omar Sandoval wrote:
> On Wed, Mar 08, 2017 at 05:48:30PM +0100, Jan Kara wrote:
> > Hi!
> > 
> > patches in this series fix the most urgent bugs that were introduced by 
> > commit
> > 165a5e22fafb "block: Move bdi_unregister() to del_gendisk()" and by
> > 0dba1314d4f8 "scsi, block: fix duplicate bdi name registration crashes".
> > In fact before these commits we had a different set of problems in the
> > code but they were less visible :).
> > 
> > I'm still waiting for test confirmation from Omar and Arthur Marsh who 
> > reported
> > issues but I'm not able to hit any problem anymore in my testing.  I think 
> > it
> > would be nice to get the patches to rc2 so to speed up things I'm posting 
> > the
> > patches now so that review can happen in parallel with the testing.
> > 
> > Other BDI handling fixes I have in my queue can wait a bit more since they 
> > are
> > either theoretical or long-standing issues. So I'll repost them once these 
> > four
> > are sorted out.
> > 
> > Honza
> 
> Applying patches 1 and 2 indeed fixed my virtio-scsi problem. I'll apply
> the whole series and test that my sd/sr test cases still work.

Yup, everything is working here. For the series:

Tested-by: Omar Sandoval <osan...@fb.com>

Re: [PATCH 0/4] block: Fixes for bdi handling

2017-03-08 Thread Omar Sandoval

On Wed, Mar 08, 2017 at 05:48:30PM +0100, Jan Kara wrote:
> Hi!
> 
> patches in this series fix the most urgent bugs that were introduced by commit
> 165a5e22fafb "block: Move bdi_unregister() to del_gendisk()" and by
> 0dba1314d4f8 "scsi, block: fix duplicate bdi name registration crashes".
> In fact before these commits we had a different set of problems in the
> code but they were less visible :).
> 
> I'm still waiting for test confirmation from Omar and Arthur Marsh who 
> reported
> issues but I'm not able to hit any problem anymore in my testing.  I think it
> would be nice to get the patches to rc2 so to speed up things I'm posting the
> patches now so that review can happen in parallel with the testing.
> 
> Other BDI handling fixes I have in my queue can wait a bit more since they are
> either theoretical or long-standing issues. So I'll repost them once these 
> four
> are sorted out.
> 
>   Honza

Applying patches 1 and 2 indeed fixed my virtio-scsi problem. I'll apply
the whole series and test that my sd/sr test cases still work.

Re: [PATCH] mpt3sas: Avoid sleeping in interrupt context

2017-03-01 Thread Omar Sandoval

On Wed, Mar 01, 2017 at 05:09:48PM +, Bart Van Assche wrote:
> On Wed, 2017-03-01 at 09:04 -0800, Omar Sandoval wrote:
> > Thanks, Bart, you can add
> > 
> > Tested-by: Omar Sandoval <osan...@fb.com>
> 
> Hello Omar,
> 
> Have you been able to test both code paths - scsi-mq enabled and scsi-mq
> disabled?
> 
> Thanks,
> 
> Bart.

Just tried on blk-mq and I didn't get the splat, seems to be fine.

Re: [PATCH] mpt3sas: Avoid sleeping in interrupt context

2017-03-01 Thread Omar Sandoval

On Wed, Mar 01, 2017 at 09:00:36AM -0800, Bart Van Assche wrote:
> Commit 669f044170d8 ("scsi: srp_transport: Move queuecommand() wait
> code to SCSI core") can make scsi_internal_device_block() sleep.
> However, the mpt3sas driver can call this function from an interrupt
> handler. Hence add a second argument to scsi_internal_device_block()
> that restores the old behavior of this function for the mpt3sas
> handler.
> 
> The call chain that triggered an "IRQ handler enabled interrupts"
> complaint is as follows:
> 
> _base_interrupt()
> -> _base_async_event()
>-> mpt3sas_scsih_event_callback()
>   -> _scsih_check_topo_delete_events()
>  -> _scsih_block_io_to_children_attached_directly()
> -> _scsih_block_io_device()
>-> _scsih_internal_device_block()
>   -> scsi_internal_device_block()
> 
> Reported-by: Omar Sandoval <osan...@osandov.com>
> Signed-off-by: Bart Van Assche <bart.vanass...@sandisk.com>
> Cc: Omar Sandoval <osan...@osandov.com>
> Cc: Hannes Reinecke <h...@suse.com>
> Cc: Sagi Grimberg <s...@grimberg.me>
> Cc: Christoph Hellwig <h...@lst.de>
> Cc: Sathya Prakash <sathya.prak...@broadcom.com>
> Cc: Chaitra P B <chaitra.basa...@broadcom.com>
> Cc: Suganath Prabu Subramani <suganath-prabu.subram...@broadcom.com>
> Cc: Sreekanth Reddy <sreekanth.re...@broadcom.com>
> Cc: <sta...@vger.kernel.org> # v4.10+

Thanks, Bart, you can add

Tested-by: Omar Sandoval <osan...@fb.com>

Re: mpt3sas sleep from atomic context on v4.10

2017-02-28 Thread Omar Sandoval

On Wed, Mar 01, 2017 at 01:07:12AM +, Bart Van Assche wrote:
> On Tue, 2017-02-28 at 16:25 -0800, Omar Sandoval wrote:
> > I'm seeing this while testing on Linus' current master:
> > 
> > [  427.814466] WARNING: CPU: 0 PID: 0 at kernel/irq/handle.c:149 
> > __handle_irq_event_percpu+0x187/0x190
> > [  427.832552] irq 116 handler _base_interrupt+0x0/0x9e0 [mpt3sas] enabled 
> > interrupts
> > 
> > I tracked it down to commit 669f044170d8 ("scsi: srp_transport: Move
> > queuecommand() wait code to SCSI core"). That commit made it so
> > scsi_internal_device_block() can sleep, but mpt3sas calls this from an
> > interrupt handler:
> > 
> > _base_interrupt
> > -> _base_async_event
> >-> mpt3sas_scsih_event_callback
> >   -> _scsih_check_topo_delete_events
> >  -> _scsih_block_io_to_children_attached_directly
> > -> _scsih_block_io_device
> >-> _scsih_internal_device_block
> >   -> scsi_internal_device_block
> > 
> > This change was made in 4.10. Bart, can you take a look?
> 
> How about the (entirely untested) patch below?
> 
> ---
>  drivers/scsi/mpt3sas/mpt3sas_base.h  |  3 ---
>  drivers/scsi/mpt3sas/mpt3sas_scsih.c |  4 ++--
>  drivers/scsi/scsi_lib.c  | 32 ++--
>  drivers/scsi/scsi_priv.h |  3 ---
>  include/scsi/scsi_device.h   |  4 
>  5 files changed, 24 insertions(+), 22 deletions(-)

[snip]

> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 3e32dc954c3c..77851697f130 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -2945,6 +2945,8 @@ EXPORT_SYMBOL(scsi_target_resume);
>  /**
>   * scsi_internal_device_block - internal function to put a device 
> temporarily into the SDEV_BLOCK state
>   * @sdev:device to block
> + * @wait:   Whether or not to wait until ongoing .queuecommand() /
> + *   .queue_rq() calls have finished.
>   *
>   * Block request made by scsi lld's to temporarily stop all
>   * scsi commands on the specified device. May sleep.
> @@ -2962,7 +2964,7 @@ EXPORT_SYMBOL(scsi_target_resume);
>   * remove the rport mutex lock and unlock calls from srp_queuecommand().
>   */
>  int
> -scsi_internal_device_block(struct scsi_device *sdev)
> +scsi_internal_device_block(struct scsi_device *sdev, bool wait)
>  {
>   struct request_queue *q = sdev->request_queue;
>   unsigned long flags;
> @@ -2976,18 +2978,20 @@ scsi_internal_device_block(struct scsi_device *sdev)
>   return err;
>   }
>  
> - /* 
> -  * The device has transitioned to SDEV_BLOCK.  Stop the
> -  * block layer from calling the midlayer with this device's
> -  * request queue. 
> -  */
> - if (q->mq_ops) {
> - blk_mq_quiesce_queue(q);
> - } else {
> - spin_lock_irqsave(q->queue_lock, flags);
> - blk_stop_queue(q);
> - spin_unlock_irqrestore(q->queue_lock, flags);
> - scsi_wait_for_queuecommand(sdev);
> + if (wait) {
> + /*
> +  * The device has transitioned to SDEV_BLOCK.  Stop the
> +  * block layer from calling the midlayer with this device's
> +  * request queue.
> +  */
> + if (q->mq_ops) {
> + blk_mq_quiesce_queue(q);
> + } else {
> + spin_lock_irqsave(q->queue_lock, flags);
> + blk_stop_queue(q);
> + spin_unlock_irqrestore(q->queue_lock, flags);
> + scsi_wait_for_queuecommand(sdev);
> + }
>   }

I think here, we want this instead:

@@ -2987,7 +2989,8 @@ scsi_internal_device_block(struct scsi_device *sdev)
spin_lock_irqsave(q->queue_lock, flags);
blk_stop_queue(q);
spin_unlock_irqrestore(q->queue_lock, flags);
-   scsi_wait_for_queuecommand(sdev);
+   if (wait)
+   scsi_wait_for_queuecommand(sdev);
}

return 0;

That fixes the warnings for me.

mpt3sas sleep from atomic context on v4.10

2017-02-28 Thread Omar Sandoval

I'm seeing this while testing on Linus' current master:

[  427.814466] WARNING: CPU: 0 PID: 0 at kernel/irq/handle.c:149 
__handle_irq_event_percpu+0x187/0x190
[  427.832552] irq 116 handler _base_interrupt+0x0/0x9e0 [mpt3sas] enabled 
interrupts

I tracked it down to commit 669f044170d8 ("scsi: srp_transport: Move
queuecommand() wait code to SCSI core"). That commit made it so
scsi_internal_device_block() can sleep, but mpt3sas calls this from an
interrupt handler:

_base_interrupt
-> _base_async_event
   -> mpt3sas_scsih_event_callback
  -> _scsih_check_topo_delete_events
 -> _scsih_block_io_to_children_attached_directly
-> _scsih_block_io_device
   -> _scsih_internal_device_block
  -> scsi_internal_device_block

This change was made in 4.10. Bart, can you take a look?

Thanks.

Re: Manual driver binding and unbinding broken for SCSI

2017-02-19 Thread Omar Sandoval

On Fri, Feb 17, 2017 at 04:43:56PM -0800, James Bottomley wrote:
> This seems to be related to a 0day test we got on the block tree,
> details here:
> 
> http://marc.info/?t=14862406881
> 
> I root caused the above to something not being released when it should
> be, so it looks like you have the same problem.  It seems to be a
> recent commit in the block tree, so could you bisect it since you have
> a nice reproducer?

These appear to actually be two separate issues.

The unbind followed by bind crash only happens with scsi-mq. It reproes
since at least 4.0.

The unbind followed by a new device coming up crash happens both with
and without scsi-mq. The earliest version I was able to check for that
was 4.6, which did reproduce.

I'll see if I can get some more info on these two issues separately.

Manual driver binding and unbinding broken for SCSI

2017-02-17 Thread Omar Sandoval

Hi, everyone,

As per $SUBJECT, I can cause a crash on v4.10-rc8, Jens' block/for-next,
and Jan's bdi branch [1] by doing this:

# lsscsi
[0:0:0:0]diskQEMU QEMU HARDDISK2.5+  /dev/sda
# echo 0:0:0:0 > /sys/bus/scsi/drivers/sd/unbind
# echo 0:0:0:0 > /sys/bus/scsi/drivers/sd/bind

The resulting trace looks like this:

[   19.347924] kobject (8800791ea0b8): tried to init an initialized object, 
something is seriously wrong.
[   19.349781] CPU: 1 PID: 84 Comm: kworker/u8:1 Not tainted 
4.10.0-rc7-00210-g53f39eeaa263 #34
[   19.350686] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.1-20161122_114906-anatol 04/01/2014
[   19.350920] Workqueue: events_unbound async_run_entry_fn
[   19.350920] Call Trace:
[   19.350920]  dump_stack+0x63/0x83
[   19.350920]  kobject_init+0x77/0x90
[   19.350920]  blk_mq_register_dev+0x40/0x130
[   19.350920]  blk_register_queue+0xb6/0x190
[   19.350920]  device_add_disk+0x1ec/0x4b0
[   19.350920]  sd_probe_async+0x10d/0x1c0 [sd_mod]
[   19.350920]  async_run_entry_fn+0x48/0x150
[   19.350920]  process_one_work+0x1d0/0x480
[   19.350920]  worker_thread+0x48/0x4e0
[   19.350920]  kthread+0x101/0x140
[   19.350920]  ? process_one_work+0x480/0x480
[   19.350920]  ? kthread_create_on_node+0x60/0x60
[   19.350920]  ret_from_fork+0x2c/0x40

Additionally, on v4.10-rc8, but not on block/for-next or Jan's branch,
doing this:

# echo 0:0:0:0 > /sys/bus/scsi/drivers/sd/unbind
# modprobe scsi_debug

Causes this trace:

[   18.876096] [ cut here ]
[   18.877057] WARNING: CPU: 1 PID: 90 at fs/sysfs/dir.c:31 
sysfs_warn_dup+0x62/0x80
[   18.878270] sysfs: cannot create duplicate filename 
'/devices/virtual/bdi/8:0'
[   18.879435] Modules linked in: scsi_debug btrfs xor raid6_pq sd_mod 
virtio_scsi scsi_mod nvme nvme_core virtio_net
[   18.881118] CPU: 1 PID: 90 Comm: kworker/u8:2 Not tainted 4.10.0-rc8 #34
[   18.882114] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.1-20161122_114906-anatol 04/01/2014
[   18.883872] Workqueue: events_unbound async_run_entry_fn
[   18.884408] Call Trace:
[   18.884408]  dump_stack+0x63/0x83
[   18.884408]  __warn+0xcb/0xf0
[   18.884408]  warn_slowpath_fmt+0x5f/0x80
[   18.884408]  ? kernfs_path_from_node+0x4f/0x60
[   18.884408]  sysfs_warn_dup+0x62/0x80
[   18.884408]  sysfs_create_dir_ns+0x77/0x90
[   18.884408]  kobject_add_internal+0xbe/0x350
[   18.884408]  kobject_add+0x75/0xd0
[   18.884408]  device_add+0x121/0x680
[   18.884408]  device_create_groups_vargs+0xe0/0xf0
[   18.884408]  device_create_vargs+0x1c/0x20
[   18.884408]  bdi_register+0x90/0x1b0
[   18.884408]  ? sd_revalidate_disk+0x34a/0x1d00 [sd_mod]
[   18.884408]  bdi_register_owner+0x36/0x60
[   18.884408]  device_add_disk+0x165/0x4a0
[   18.884408]  ? update_autosuspend+0x51/0x60
[   18.884408]  ? __pm_runtime_use_autosuspend+0x5c/0x70
[   18.884408]  sd_probe_async+0x10d/0x1c0 [sd_mod]
[   18.884408]  async_run_entry_fn+0x4a/0x170
[   18.884408]  process_one_work+0x165/0x430
[   18.884408]  worker_thread+0x4e/0x490
[   18.884408]  kthread+0x101/0x140
[   18.884408]  ? process_one_work+0x430/0x430
[   18.884408]  ? kthread_create_on_node+0x60/0x60
[   18.884408]  ret_from_fork+0x2c/0x40
[   18.913090] ---[ end trace f43b051485c2a749 ]---

On all three kernels, it looks like the bdi sysfs entry hangs around
after the block device has already been removed:

┌[root@silver ~]
└# lsblk /dev/sda
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda8:00  16G  0 disk
┌[root@silver ~]
└# ls -al /sys/devices/virtual/bdi
total 0
drwxr-xr-x  6 root root 0 Feb 17 16:19 .
drwxr-xr-x 13 root root 0 Feb 17 16:19 ..
drwxr-xr-x  3 root root 0 Feb 17 16:19 254:0
drwxr-xr-x  3 root root 0 Feb 17 16:19 259:0
drwxr-xr-x  3 root root 0 Feb 17 16:19 8:0
drwxr-xr-x  3 root root 0 Feb 17 16:19 9p-1
┌[root@silver ~]
└# echo 0:0:0:0 > /sys/bus/scsi/drivers/sd/unbind
┌[root@silver ~]
└# ls -al /sys/devices/virtual/bdi
total 0
drwxr-xr-x  6 root root 0 Feb 17 16:19 .
drwxr-xr-x 13 root root 0 Feb 17 16:19 ..
drwxr-xr-x  3 root root 0 Feb 17 16:19 254:0
drwxr-xr-x  3 root root 0 Feb 17 16:19 259:0
drwxr-xr-x  3 root root 0 Feb 17 16:19 8:0
drwxr-xr-x  3 root root 0 Feb 17 16:19 9p-1
┌[root@silver ~]
└# lsblk /dev/sda
lsblk: /dev/sda: not a block device

Any ideas here?

1: https://git.kernel.org/cgit/linux/kernel/git/jack/linux-fs.git/tree/?h=bdi

[PATCH 2/2] scsi_lib: untangle 0 and BLK_MQ_RQ_QUEUE_OK

2016-11-15 Thread Omar Sandoval

From: Omar Sandoval <osan...@fb.com>

Let's not depend on any of the BLK_MQ_RQ_QUEUE_* constants having
specific values. No functional change.

Signed-off-by: Omar Sandoval <osan...@fb.com>
---
 drivers/scsi/scsi_lib.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 2e35132..47a5c87 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1810,7 +1810,7 @@ static inline int prep_to_mq(int ret)
 {
switch (ret) {
case BLKPREP_OK:
-   return 0;
+   return BLK_MQ_RQ_QUEUE_OK;
case BLKPREP_DEFER:
return BLK_MQ_RQ_QUEUE_BUSY;
default:
@@ -1897,7 +1897,7 @@ static int scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
int reason;
 
ret = prep_to_mq(scsi_prep_state_check(sdev, req));
-   if (ret)
+   if (ret != BLK_MQ_RQ_QUEUE_OK)
goto out;
 
ret = BLK_MQ_RQ_QUEUE_BUSY;
@@ -1914,7 +1914,7 @@ static int scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
 
if (!(req->rq_flags & RQF_DONTPREP)) {
ret = prep_to_mq(scsi_mq_prep_fn(req));
-   if (ret)
+   if (ret != BLK_MQ_RQ_QUEUE_OK)
goto out_dec_host_busy;
req->rq_flags |= RQF_DONTPREP;
} else {
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] nvme: untangle 0 and BLK_MQ_RQ_QUEUE_OK

2016-11-15 Thread Omar Sandoval

From: Omar Sandoval <osan...@fb.com>

Let's not depend on any of the BLK_MQ_RQ_QUEUE_* constants having
specific values. No functional change.

Signed-off-by: Omar Sandoval <osan...@fb.com>
---
 drivers/nvme/host/core.c   | 4 ++--
 drivers/nvme/host/pci.c| 8 
 drivers/nvme/host/rdma.c   | 2 +-
 drivers/nvme/target/loop.c | 6 +++---
 4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 53584d2..e54bb10 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -269,7 +269,7 @@ static inline int nvme_setup_discard(struct nvme_ns *ns, 
struct request *req,
 */
req->__data_len = nr_bytes;
 
-   return 0;
+   return BLK_MQ_RQ_QUEUE_OK;
 }
 
 static inline void nvme_setup_rw(struct nvme_ns *ns, struct request *req,
@@ -317,7 +317,7 @@ static inline void nvme_setup_rw(struct nvme_ns *ns, struct 
request *req,
 int nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
struct nvme_command *cmd)
 {
-   int ret = 0;
+   int ret = BLK_MQ_RQ_QUEUE_OK;
 
if (req->cmd_type == REQ_TYPE_DRV_PRIV)
memcpy(cmd, nvme_req(req)->cmd, sizeof(*cmd));
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 51d13d5..d58f8e4 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -328,7 +328,7 @@ static int nvme_init_iod(struct request *rq, unsigned size,
rq->retries = 0;
rq->rq_flags |= RQF_DONTPREP;
}
-   return 0;
+   return BLK_MQ_RQ_QUEUE_OK;
 }
 
 static void nvme_free_iod(struct nvme_dev *dev, struct request *req)
@@ -598,17 +598,17 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 
map_len = nvme_map_len(req);
ret = nvme_init_iod(req, map_len, dev);
-   if (ret)
+   if (ret != BLK_MQ_RQ_QUEUE_OK)
return ret;
 
ret = nvme_setup_cmd(ns, req, );
-   if (ret)
+   if (ret != BLK_MQ_RQ_QUEUE_OK)
goto out;
 
if (req->nr_phys_segments)
ret = nvme_map_data(dev, req, map_len, );
 
-   if (ret)
+   if (ret != BLK_MQ_RQ_QUEUE_OK)
goto out;
 
cmnd.common.command_id = req->tag;
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index c4700ef..ff1b34060 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1395,7 +1395,7 @@ static int nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
sizeof(struct nvme_command), DMA_TO_DEVICE);
 
ret = nvme_setup_cmd(ns, rq, c);
-   if (ret)
+   if (ret != BLK_MQ_RQ_QUEUE_OK)
return ret;
 
c->common.command_id = rq->tag;
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index 26aa3a5..be56d05 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -169,7 +169,7 @@ static int nvme_loop_queue_rq(struct blk_mq_hw_ctx *hctx,
int ret;
 
ret = nvme_setup_cmd(ns, req, >cmd);
-   if (ret)
+   if (ret != BLK_MQ_RQ_QUEUE_OK)
return ret;
 
iod->cmd.common.flags |= NVME_CMD_SGL_METABUF;
@@ -179,7 +179,7 @@ static int nvme_loop_queue_rq(struct blk_mq_hw_ctx *hctx,
nvme_cleanup_cmd(req);
blk_mq_start_request(req);
nvme_loop_queue_response(>req);
-   return 0;
+   return BLK_MQ_RQ_QUEUE_OK;
}
 
if (blk_rq_bytes(req)) {
@@ -198,7 +198,7 @@ static int nvme_loop_queue_rq(struct blk_mq_hw_ctx *hctx,
blk_mq_start_request(req);
 
schedule_work(>work);
-   return 0;
+   return BLK_MQ_RQ_QUEUE_OK;
 }
 
 static void nvme_loop_submit_async_event(struct nvme_ctrl *arg, int aer_idx)
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] blk-mq drivers: untangle 0 and BLK_MQ_RQ_QUEUE_OK

2016-11-14 Thread Omar Sandoval

From: Omar Sandoval <osan...@fb.com>

Let's not depend on any of the BLK_MQ_RQ_QUEUE_* constants having
specific values. No functional change.

Signed-off-by: Omar Sandoval <osan...@fb.com>
---
Hi, Jens,

Some more trivial cleanup, feel free to apply or not if it's too intrusive.

 drivers/nvme/host/core.c   | 4 ++--
 drivers/nvme/host/pci.c| 8 
 drivers/nvme/host/rdma.c   | 2 +-
 drivers/nvme/target/loop.c | 6 +++---
 drivers/scsi/scsi_lib.c| 6 +++---
 5 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 79e679d..e7f9ef5 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -276,7 +276,7 @@ static inline int nvme_setup_discard(struct nvme_ns *ns, 
struct request *req,
 */
req->__data_len = nr_bytes;
 
-   return 0;
+   return BLK_MQ_RQ_QUEUE_OK;
 }
 
 static inline void nvme_setup_rw(struct nvme_ns *ns, struct request *req,
@@ -324,7 +324,7 @@ static inline void nvme_setup_rw(struct nvme_ns *ns, struct 
request *req,
 int nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
struct nvme_command *cmd)
 {
-   int ret = 0;
+   int ret = BLK_MQ_RQ_QUEUE_OK;
 
if (req->cmd_type == REQ_TYPE_DRV_PRIV)
memcpy(cmd, req->cmd, sizeof(*cmd));
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 0248d0e..4afa2f7 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -328,7 +328,7 @@ static int nvme_init_iod(struct request *rq, unsigned size,
rq->retries = 0;
rq->cmd_flags |= REQ_DONTPREP;
}
-   return 0;
+   return BLK_MQ_RQ_QUEUE_OK;
 }
 
 static void nvme_free_iod(struct nvme_dev *dev, struct request *req)
@@ -598,17 +598,17 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 
map_len = nvme_map_len(req);
ret = nvme_init_iod(req, map_len, dev);
-   if (ret)
+   if (ret != BLK_MQ_RQ_QUEUE_OK)
return ret;
 
ret = nvme_setup_cmd(ns, req, );
-   if (ret)
+   if (ret != BLK_MQ_RQ_QUEUE_OK)
goto out;
 
if (req->nr_phys_segments)
ret = nvme_map_data(dev, req, map_len, );
 
-   if (ret)
+   if (ret != BLK_MQ_RQ_QUEUE_OK)
goto out;
 
cmnd.common.command_id = req->tag;
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 5a83881..125c43b 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1399,7 +1399,7 @@ static int nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
sizeof(struct nvme_command), DMA_TO_DEVICE);
 
ret = nvme_setup_cmd(ns, rq, c);
-   if (ret)
+   if (ret != BLK_MQ_RQ_QUEUE_OK)
return ret;
 
c->common.command_id = rq->tag;
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index d5df77d..01035fb 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -168,7 +168,7 @@ static int nvme_loop_queue_rq(struct blk_mq_hw_ctx *hctx,
int ret;
 
ret = nvme_setup_cmd(ns, req, >cmd);
-   if (ret)
+   if (ret != BLK_MQ_RQ_QUEUE_OK)
return ret;
 
iod->cmd.common.flags |= NVME_CMD_SGL_METABUF;
@@ -178,7 +178,7 @@ static int nvme_loop_queue_rq(struct blk_mq_hw_ctx *hctx,
nvme_cleanup_cmd(req);
blk_mq_start_request(req);
nvme_loop_queue_response(>req);
-   return 0;
+   return BLK_MQ_RQ_QUEUE_OK;
}
 
if (blk_rq_bytes(req)) {
@@ -197,7 +197,7 @@ static int nvme_loop_queue_rq(struct blk_mq_hw_ctx *hctx,
blk_mq_start_request(req);
 
schedule_work(>work);
-   return 0;
+   return BLK_MQ_RQ_QUEUE_OK;
 }
 
 static void nvme_loop_submit_async_event(struct nvme_ctrl *arg, int aer_idx)
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 2cca9cf..13887c0 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1801,7 +1801,7 @@ static inline int prep_to_mq(int ret)
 {
switch (ret) {
case BLKPREP_OK:
-   return 0;
+   return BLK_MQ_RQ_QUEUE_OK;
case BLKPREP_DEFER:
return BLK_MQ_RQ_QUEUE_BUSY;
default:
@@ -1888,7 +1888,7 @@ static int scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
int reason;
 
ret = prep_to_mq(scsi_prep_state_check(sdev, req));
-   if (ret)
+   if (ret != BLK_MQ_RQ_QUEUE_OK)
goto out;
 
ret = BLK_MQ_RQ_QUEUE_BUSY;
@@ -1905,7 +1905,7 @@ static int scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
 
if (!(req->cmd_flags & REQ_DONTPREP)) {
ret = prep_to_mq(scsi_mq_prep_fn(req));
-   if (ret)
+   if (ret != BLK_MQ_RQ_QUEUE_OK)
goto out_dec_host_busy;
req->cmd_flags |=

Re: Device or HBA level QD throttling creates randomness in sequetial workload

2016-10-26 Thread Omar Sandoval

On Tue, Oct 25, 2016 at 12:24:24AM +0530, Kashyap Desai wrote:
> > -Original Message-
> > From: Omar Sandoval [mailto:osan...@osandov.com]
> > Sent: Monday, October 24, 2016 9:11 PM
> > To: Kashyap Desai
> > Cc: linux-scsi@vger.kernel.org; linux-ker...@vger.kernel.org; linux-
> > bl...@vger.kernel.org; ax...@kernel.dk; Christoph Hellwig;
> > paolo.vale...@linaro.org
> > Subject: Re: Device or HBA level QD throttling creates randomness in
> sequetial
> > workload
> >
> > On Mon, Oct 24, 2016 at 06:35:01PM +0530, Kashyap Desai wrote:
> > > >
> > > > On Fri, Oct 21, 2016 at 05:43:35PM +0530, Kashyap Desai wrote:
> > > > > Hi -
> > > > >
> > > > > I found below conversation and it is on the same line as I wanted
> > > > > some input from mailing list.
> > > > >
> > > > > http://marc.info/?l=linux-kernel=147569860526197=2
> > > > >
> > > > > I can do testing on any WIP item as Omar mentioned in above
> > > discussion.
> > > > > https://github.com/osandov/linux/tree/blk-mq-iosched
> > >
> > > I tried build kernel using this repo, but looks like it is not allowed
> > > to reboot due to some changes in  layer.
> >
> > Did you build the most up-to-date version of that branch? I've been
> force
> > pushing to it, so the commit id that you built would be useful.
> > What boot failure are you seeing?
> 
> Below  is latest commit on repo.
> commit b077a9a5149f17ccdaa86bc6346fa256e3c1feda
> Author: Omar Sandoval <osan...@fb.com>
> Date:   Tue Sep 20 11:20:03 2016 -0700
> 
> [WIP] blk-mq: limit bio queue depth
> 
> I have latest repo from 4.9/scsi-next maintained by Martin which boots
> fine.  Only Delta is  " CONFIG_SBITMAP" is enabled in WIP blk-mq-iosched
> branch. I could not see any meaningful data on boot hang, so going to try
> one more time tomorrow.

The blk-mq-bio-queueing branch has the latest work there separated out.
Not sure that it'll help in this case.

> >
> > > >
> > > > Are you using blk-mq for this disk? If not, then the work there
> > > > won't
> > > affect you.
> > >
> > > YES. I am using blk-mq for my test. I also confirm if use_blk_mq is
> > > disable, Sequential work load issue is not seen and  scheduling
> > > works well.
> >
> > Ah, okay, perfect. Can you send the fio job file you're using? Hard to
> tell exactly
> > what's going on without the details. A sequential workload with just one
> > submitter is about as easy as it gets, so this _should_ be behaving
> nicely.
> 
> 
> 
> ; setup numa policy for each thread
> ; 'numactl --show' to determine the maximum numa nodes
> [global]
> ioengine=libaio
> buffered=0
> rw=write
> bssplit=4K/100
> iodepth=256
> numjobs=1
> direct=1
> runtime=60s
> allow_mounted_write=0
> 
> [job1]
> filename=/dev/sdd
> ..
> [job24]
> filename=/dev/sdaa

Okay, so you have one high-iodepth job per disk, got it.

> When I tune /sys/module/scsi_mod/parameters/use_blk_mq = 1, below is a
> ioscheduler detail. (It is in blk-mq mode. )
> /sys/devices/pci:00/:00:02.0/:02:00.0/host10/target10:2:13/10:
> 2:13:0/block/sdq/queue/scheduler:none
> 
> When I have set /sys/module/scsi_mod/parameters/use_blk_mq = 0,
> ioscheduler picked by SML is .
> /sys/devices/pci:00/:00:02.0/:02:00.0/host10/target10:2:13/10:
> 2:13:0/block/sdq/queue/scheduler:noop deadline [cfq]
> 
> I see in blk-mq performance is very low for Sequential Write work load and
> I confirm that blk-mq convert Sequential work load into random stream due
> to  io-scheduler change in blk-mq vs legacy block layer.

Since this happens when the fio iodepth exceeds the per-device QD, my
best guess is that this is that requests are getting requeued and
scrambled when that happens. Do you have the blktrace lying around?

> > > > > Is there any workaround/alternative in latest upstream kernel, if
> > > > > user wants to see limited penalty  for Sequential Work load on HDD
> ?
> > > > >
> > > > > ` Kashyap
> > > > >
> >
> > P.S., your emails are being marked as spam by Gmail. Actually, Gmail
> seems to
> > mark just about everything I get from Broadcom as spam due to failed
> DMARC.
> >
> > --
> > Omar

-- 
Omar
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Device or HBA level QD throttling creates randomness in sequetial workload

2016-10-24 Thread Omar Sandoval

On Mon, Oct 24, 2016 at 06:35:01PM +0530, Kashyap Desai wrote:
> >
> > On Fri, Oct 21, 2016 at 05:43:35PM +0530, Kashyap Desai wrote:
> > > Hi -
> > >
> > > I found below conversation and it is on the same line as I wanted some
> > > input from mailing list.
> > >
> > > http://marc.info/?l=linux-kernel=147569860526197=2
> > >
> > > I can do testing on any WIP item as Omar mentioned in above
> discussion.
> > > https://github.com/osandov/linux/tree/blk-mq-iosched
> 
> I tried build kernel using this repo, but looks like it is not allowed to
> reboot due to some changes in  layer.

Did you build the most up-to-date version of that branch? I've been
force pushing to it, so the commit id that you built would be useful.
What boot failure are you seeing?

> >
> > Are you using blk-mq for this disk? If not, then the work there won't
> affect you.
> 
> YES. I am using blk-mq for my test. I also confirm if use_blk_mq is
> disable, Sequential work load issue is not seen and  scheduling works
> well.

Ah, okay, perfect. Can you send the fio job file you're using? Hard to
tell exactly what's going on without the details. A sequential workload
with just one submitter is about as easy as it gets, so this _should_ be
behaving nicely.

> >
> > > Is there any workaround/alternative in latest upstream kernel, if user
> > > wants to see limited penalty  for Sequential Work load on HDD ?
> > >
> > > ` Kashyap
> > >

P.S., your emails are being marked as spam by Gmail. Actually, Gmail
seems to mark just about everything I get from Broadcom as spam due to
failed DMARC.

-- 
Omar
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Device or HBA level QD throttling creates randomness in sequetial workload

2016-10-21 Thread Omar Sandoval

On Fri, Oct 21, 2016 at 05:43:35PM +0530, Kashyap Desai wrote:
> Hi -
> 
> I found below conversation and it is on the same line as I wanted some
> input from mailing list.
> 
> http://marc.info/?l=linux-kernel=147569860526197=2
> 
> I can do testing on any WIP item as Omar mentioned in above discussion.
> https://github.com/osandov/linux/tree/blk-mq-iosched

Are you using blk-mq for this disk? If not, then the work there won't
affect you.

> Is there any workaround/alternative in latest upstream kernel, if user
> wants to see limited penalty  for Sequential Work load on HDD ?
> 
> ` Kashyap
> 
> > -Original Message-
> > From: Kashyap Desai [mailto:kashyap.de...@broadcom.com]
> > Sent: Thursday, October 20, 2016 3:39 PM
> > To: linux-scsi@vger.kernel.org
> > Subject: Device or HBA level QD throttling creates randomness in
> sequetial
> > workload
> >
> > [ Apologize, if you find more than one instance of my email.
> > Web based email client has some issue, so now trying git send mail.]
> >
> > Hi,
> >
> > I am doing some performance tuning in MR driver to understand how sdev
> > queue depth and hba queue depth play role in IO submission from above
> layer.
> > I have 24 JBOD connected to MR 12GB controller and I can see performance
> for
> > 4K Sequential work load as below.
> >
> > HBA QD for MR controller is 4065 and Per device QD is set to 32
> >
> > queue depth from  256 reports 300K IOPS queue depth from  128
> > reports 330K IOPS queue depth from  64 reports 360K IOPS queue
> depth
> > from  32 reports 510K IOPS
> >
> > In MR driver I added debug print and confirm that more IO come to driver
> as
> > random IO whenever I have  queue depth more than 32.
> >
> > I have debug using scsi logging level and blktrace as well. Below is
> snippet of
> > logs using scsi logging level.  In summary, if SML do flow control of IO
> due to
> > Device QD or HBA QD, IO coming to LLD is more random pattern.
> >
> > I see IO coming to driver is not sequential.
> >
> > [79546.912041] sd 18:2:21:0: [sdy] tag#854 CDB: Write(10) 2a 00 00 03 c0
> 3b 00
> > 00 01 00 [79546.912049] sd 18:2:21:0: [sdy] tag#855 CDB: Write(10) 2a 00
> 00 03
> > c0 3c 00 00 01 00 [79546.912053] sd 18:2:21:0: [sdy] tag#886 CDB:
> Write(10) 2a
> > 00 00 03 c0 5b 00 00 01 00
> >
> >  After LBA "00 03 c0 3c" next command is with LBA "00 03 c0 5b".
> > Two Sequence are overlapped due to sdev QD throttling.
> >
> > [79546.912056] sd 18:2:21:0: [sdy] tag#887 CDB: Write(10) 2a 00 00 03 c0
> 5c 00
> > 00 01 00 [79546.912250] sd 18:2:21:0: [sdy] tag#856 CDB: Write(10) 2a 00
> 00 03
> > c0 3d 00 00 01 00 [79546.912257] sd 18:2:21:0: [sdy] tag#888 CDB:
> Write(10) 2a
> > 00 00 03 c0 5d 00 00 01 00 [79546.912259] sd 18:2:21:0: [sdy] tag#857
> CDB:
> > Write(10) 2a 00 00 03 c0 3e 00 00 01 00 [79546.912268] sd 18:2:21:0:
> [sdy]
> > tag#858 CDB: Write(10) 2a 00 00 03 c0 3f 00 00 01 00
> >
> >  If scsi_request_fn() breaks due to unavailability of device queue (due
> to below
> > check), will there be any side defect as I observe ?
> > if (!scsi_dev_queue_ready(q, sdev))
> >  break;
> >
> > If I reduce HBA QD and make sure IO from above layer is throttled due to
> HBA
> > QD, there is a same impact.
> > MR driver use host wide shared tag map.
> >
> > Can someone help me if this can be tunable in LLD providing additional
> settings
> > or it is expected behavior ? Problem I am facing is, I am not able to
> figure out
> > optimal device queue depth for different configuration and work load.
> >
> > Thanks, Kashyap

-- 
Omar
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

40 matches

Mail list logo