Re: [PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1)

2017-10-09 Thread Ming Lei
On Mon, Oct 09, 2017 at 11:04:39PM +0800, Ming Lei wrote:
> Hi John,
> 
> On Mon, Oct 09, 2017 at 01:09:22PM +0100, John Garry wrote:
> > On 30/09/2017 11:27, Ming Lei wrote:
> > > Hi Jens,
> > > 
> > > In Red Hat internal storage test wrt. blk-mq scheduler, we
> > > found that I/O performance is much bad with mq-deadline, especially
> > > about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
> > > SRP...)
> > > 
> > > Turns out one big issue causes the performance regression: requests
> > > are still dequeued from sw queue/scheduler queue even when ldd's
> > > queue is busy, so I/O merge becomes quite difficult to make, then
> > > sequential IO degrades a lot.
> > > 
> > > This issue becomes one of mains reasons for reverting default SCSI_MQ
> > > in V4.13.
> > > 
> > > The 1st patch takes direct issue in blk_mq_request_bypass_insert(),
> > > then we can improve dm-mpath's performance in part 2, which will
> > > be posted out soon.
> > > 
> > > The 2nd six patches improve this situation, and brings back
> > > some performance loss.
> > > 
> > > With this change, SCSI-MQ sequential I/O performance is
> > > improved much, Paolo reported that mq-deadline performance
> > > improved much[2] in his dbench test wrt V2. Also performanc
> > > improvement on lpfc/qla2xx was observed with V1.[1]
> > > 
> > > Please consider it for V4.15.
> > > 
> > > [1] http://marc.info/?l=linux-block=150151989915776=2
> > > [2] https://marc.info/?l=linux-block=150217980602843=2
> > > 
> > 
> > I tested this series for the SAS controller on HiSilicon hip07 platform as I
> > am interested in enabling MQ for this driver. Driver is
> > ./drivers/scsi/hisi_sas/.
> > 
> > So I found that that performance is improved when enabling default SCSI_MQ
> > with this series vs baseline. However, it is still not as a good as when
> > default SCSI_MQ is disabled.
> > 
> > Here are some figures I got with fio:
> > 4.14-rc2 without default SCSI_MQ
> > read, rw, write IOPS
> > 952K, 133K/133K, 800K
> > 
> > 4.14-rc2 with default SCSI_MQ
> > read, rw, write IOPS
> > 311K, 117K/117K, 320K
> > 
> > This series* without default SCSI_MQ
> > read, rw, write IOPS
> > 975K, 132K/132K, 790K
> > 
> > This series* with default SCSI_MQ
> > read, rw, write IOPS
> > 770K, 164K/164K, 594K
> 
> Thanks for testing this patchset!
> 
> Looks there is big improvement, but the gap compared with
> block legacy is not small too.
> 
> > 
> > Please note that hisi_sas driver does not enable mq by exposing multiple
> > queues to upper layer (even though it has multiple queues). I have been
> > playing with enabling it, but my performance is always worse...
> > 
> > * I'm using
> > https://github.com/ming1/linux/commits/blk_mq_improve_scsi_mpath_perf_V5.1,
> > as advised by Ming Lei.
> 
> Could you test on the following branch and see if it makes a
> difference?
> 
>   
> https://github.com/ming1/linux/commits/blk_mq_improve_scsi_mpath_perf_V6.1_test

Hi John,

Please test the following branch directly:

https://github.com/ming1/linux/tree/blk_mq_improve_scsi_mpath_perf_V6.2_test

And code is simplified and cleaned up much in V6.2, then only two extra
patches(top 2) are needed against V6 which was posted yesterday.

Please test SCSI_MQ with mq-deadline, which should be the default
mq scheduler on your HiSilicon SAS.

-- 
Ming


[PATCH] scsi/aic7xxx: Convert timers to use timer_setup()

2017-10-09 Thread Kees Cook
stat_timer only ever assigns the same function and data, so consolidate
to using timer_setup(), adjust callback, drop everything else used
to pass things around, and remove needless typedefs.

reset_timer is unused; remove it.

Cc: Hannes Reinecke 
Cc: "James E.J. Bottomley" 
Cc: "Martin K. Petersen" 
Cc: Thomas Gleixner 
Signed-off-by: Kees Cook 
---
This requires commit 686fef928bba ("timer: Prepare to change timer
callback argument type") in v4.14-rc3, but should be otherwise
stand-alone.
---
 drivers/scsi/aic7xxx/aic79xx.h  |  5 +
 drivers/scsi/aic7xxx/aic79xx_core.c | 29 -
 drivers/scsi/aic7xxx/aic79xx_osm.h  |  7 ---
 3 files changed, 9 insertions(+), 32 deletions(-)

diff --git a/drivers/scsi/aic7xxx/aic79xx.h b/drivers/scsi/aic7xxx/aic79xx.h
index d47b527b25dd..31f2bb9d7146 100644
--- a/drivers/scsi/aic7xxx/aic79xx.h
+++ b/drivers/scsi/aic7xxx/aic79xx.h
@@ -1046,8 +1046,6 @@ typedef enum {
 
 typedef uint8_t ahd_mode_state;
 
-typedef void ahd_callback_t (void *);
-
 struct ahd_completion
 {
uint16_ttag;
@@ -1122,8 +1120,7 @@ struct ahd_softc {
/*
 * Timer handles for timer driven callbacks.
 */
-   ahd_timer_t   reset_timer;
-   ahd_timer_t   stat_timer;
+   struct timer_list   stat_timer;
 
/*
 * Statistics.
diff --git a/drivers/scsi/aic7xxx/aic79xx_core.c 
b/drivers/scsi/aic7xxx/aic79xx_core.c
index 95d8f25cbcca..b560f396ee99 100644
--- a/drivers/scsi/aic7xxx/aic79xx_core.c
+++ b/drivers/scsi/aic7xxx/aic79xx_core.c
@@ -207,7 +207,7 @@ static void ahd_add_scb_to_free_list(struct 
ahd_softc *ahd,
 static u_int   ahd_rem_wscb(struct ahd_softc *ahd, u_int scbid,
 u_int prev, u_int next, u_int tid);
 static voidahd_reset_current_bus(struct ahd_softc *ahd);
-static ahd_callback_t  ahd_stat_timer;
+static voidahd_stat_timer(struct timer_list *t);
 #ifdef AHD_DUMP_SEQ
 static voidahd_dumpseq(struct ahd_softc *ahd);
 #endif
@@ -6104,8 +6104,7 @@ ahd_alloc(void *platform_arg, char *name)
ahd->bugs = AHD_BUGNONE;
ahd->flags = AHD_SPCHK_ENB_A|AHD_RESET_BUS_A|AHD_TERM_ENB_A
   | AHD_EXTENDED_TRANS_A|AHD_STPWLEVEL_A;
-   ahd_timer_init(>reset_timer);
-   ahd_timer_init(>stat_timer);
+   timer_setup(>stat_timer, ahd_stat_timer, 0);
ahd->int_coalescing_timer = AHD_INT_COALESCING_TIMER_DEFAULT;
ahd->int_coalescing_maxcmds = AHD_INT_COALESCING_MAXCMDS_DEFAULT;
ahd->int_coalescing_mincmds = AHD_INT_COALESCING_MINCMDS_DEFAULT;
@@ -6235,8 +6234,7 @@ ahd_shutdown(void *arg)
/*
 * Stop periodic timer callbacks.
 */
-   ahd_timer_stop(>reset_timer);
-   ahd_timer_stop(>stat_timer);
+   del_timer_sync(>stat_timer);
 
/* This will reset most registers to 0, but not all */
ahd_reset(ahd, /*reinit*/FALSE);
@@ -7039,20 +7037,11 @@ static const char *termstat_strings[] = {
 };
 
 /* Timer Facilities 
***/
-#define ahd_timer_init init_timer
-#define ahd_timer_stop del_timer_sync
-typedef void ahd_linux_callback_t (u_long);
-
 static void
-ahd_timer_reset(ahd_timer_t *timer, int usec, ahd_callback_t *func, void *arg)
+ahd_timer_reset(struct timer_list *timer, int usec)
 {
-   struct ahd_softc *ahd;
-
-   ahd = (struct ahd_softc *)arg;
del_timer(timer);
-   timer->data = (u_long)arg;
timer->expires = jiffies + (usec * HZ)/100;
-   timer->function = (ahd_linux_callback_t*)func;
add_timer(timer);
 }
 
@@ -7279,8 +7268,7 @@ ahd_init(struct ahd_softc *ahd)
}
 init_done:
ahd_restart(ahd);
-   ahd_timer_reset(>stat_timer, AHD_STAT_UPDATE_US,
-   ahd_stat_timer, ahd);
+   ahd_timer_reset(>stat_timer, AHD_STAT_UPDATE_US);
return (0);
 }
 
@@ -8878,9 +8866,9 @@ ahd_reset_channel(struct ahd_softc *ahd, char channel, 
int initiate_reset)
 
 / Statistics Processing 
***/
 static void
-ahd_stat_timer(void *arg)
+ahd_stat_timer(struct timer_list *t)
 {
-   struct  ahd_softc *ahd = arg;
+   struct  ahd_softc *ahd = from_timer(ahd, t, stat_timer);
u_long  s;
int enint_coal;

@@ -8907,8 +8895,7 @@ ahd_stat_timer(void *arg)
ahd->cmdcmplt_bucket = (ahd->cmdcmplt_bucket+1) & (AHD_STAT_BUCKETS-1);
ahd->cmdcmplt_total -= ahd->cmdcmplt_counts[ahd->cmdcmplt_bucket];
ahd->cmdcmplt_counts[ahd->cmdcmplt_bucket] = 0;
-   ahd_timer_reset(>stat_timer, AHD_STAT_UPDATE_US,
-   ahd_stat_timer, ahd);
+   ahd_timer_reset(>stat_timer, AHD_STAT_UPDATE_US);
ahd_unlock(ahd, );
 }
 
diff --git 

Re: [PATCH v7 0/9] Hello Jens,

2017-10-09 Thread Bart Van Assche
On Mon, 2017-10-09 at 16:13 -0700, Bart Van Assche wrote:
> [ ... ]

Just like for v6, the title of this series should have been:
"block, scsi, md: Improve suspend and resume"

[PATCH v7 4/9] block: Make q_usage_counter also track legacy requests

2017-10-09 Thread Bart Van Assche
From: Ming Lei 

This patch makes it possible to pause request allocation for
the legacy block layer by calling blk_mq_freeze_queue() and
blk_mq_unfreeze_queue().

Signed-off-by: Ming Lei 
[ bvanassche: Combined two patches into one, edited a comment and made sure
  REQ_NOWAIT is handled properly in blk_old_get_request() ]
Signed-off-by: Bart Van Assche 
Reviewed-by: Christoph Hellwig 
Cc: Ming Lei 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 block/blk-core.c | 12 
 block/blk-mq.c   | 10 ++
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 14f7674fa0b1..265a69511c47 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -610,6 +610,9 @@ void blk_set_queue_dying(struct request_queue *q)
}
spin_unlock_irq(q->queue_lock);
}
+
+   /* Make blk_queue_enter() reexamine the DYING flag. */
+   wake_up_all(>mq_freeze_wq);
 }
 EXPORT_SYMBOL_GPL(blk_set_queue_dying);
 
@@ -1395,16 +1398,22 @@ static struct request *blk_old_get_request(struct 
request_queue *q,
   unsigned int op, gfp_t gfp_mask)
 {
struct request *rq;
+   int ret = 0;
 
WARN_ON_ONCE(q->mq_ops);
 
/* create ioc upfront */
create_io_context(gfp_mask, q->node);
 
+   ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM) ||
+ (op & REQ_NOWAIT));
+   if (ret)
+   return ERR_PTR(ret);
spin_lock_irq(q->queue_lock);
rq = get_request(q, op, NULL, gfp_mask);
if (IS_ERR(rq)) {
spin_unlock_irq(q->queue_lock);
+   blk_queue_exit(q);
return rq;
}
 
@@ -1576,6 +1585,7 @@ void __blk_put_request(struct request_queue *q, struct 
request *req)
blk_free_request(rl, req);
freed_request(rl, sync, rq_flags);
blk_put_rl(rl);
+   blk_queue_exit(q);
}
 }
 EXPORT_SYMBOL_GPL(__blk_put_request);
@@ -1857,8 +1867,10 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, 
struct bio *bio)
 * Grab a free request. This is might sleep but can not fail.
 * Returns with the queue unlocked.
 */
+   blk_queue_enter_live(q);
req = get_request(q, bio->bi_opf, bio, GFP_NOIO);
if (IS_ERR(req)) {
+   blk_queue_exit(q);
__wbt_done(q->rq_wb, wb_acct);
if (PTR_ERR(req) == -ENOMEM)
bio->bi_status = BLK_STS_RESOURCE;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 076cbab9c3e0..8766587ae8b8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -125,7 +125,8 @@ void blk_freeze_queue_start(struct request_queue *q)
freeze_depth = atomic_inc_return(>mq_freeze_depth);
if (freeze_depth == 1) {
percpu_ref_kill(>q_usage_counter);
-   blk_mq_run_hw_queues(q, false);
+   if (q->mq_ops)
+   blk_mq_run_hw_queues(q, false);
}
 }
 EXPORT_SYMBOL_GPL(blk_freeze_queue_start);
@@ -255,13 +256,6 @@ void blk_mq_wake_waiters(struct request_queue *q)
queue_for_each_hw_ctx(q, hctx, i)
if (blk_mq_hw_queue_mapped(hctx))
blk_mq_tag_wakeup_all(hctx->tags, true);
-
-   /*
-* If we are called because the queue has now been marked as
-* dying, we need to ensure that processes currently waiting on
-* the queue are notified as well.
-*/
-   wake_up_all(>mq_freeze_wq);
 }
 
 bool blk_mq_can_queue(struct blk_mq_hw_ctx *hctx)
-- 
2.14.2



[PATCH v7 2/9] md: Introduce md_stop_all_writes()

2017-10-09 Thread Bart Van Assche
Introduce md_stop_all_writes() because the next patch will add
a second caller for this function. This patch does not change
any functionality.

Signed-off-by: Bart Van Assche 
Cc: Shaohua Li 
Cc: linux-r...@vger.kernel.org
Cc: Ming Lei 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 drivers/md/md.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 8933cafc212d..b99584e5d6b1 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8937,8 +8937,7 @@ int rdev_clear_badblocks(struct md_rdev *rdev, sector_t 
s, int sectors,
 }
 EXPORT_SYMBOL_GPL(rdev_clear_badblocks);
 
-static int md_notify_reboot(struct notifier_block *this,
-   unsigned long code, void *x)
+static void md_stop_all_writes(void)
 {
struct list_head *tmp;
struct mddev *mddev;
@@ -8962,6 +8961,12 @@ static int md_notify_reboot(struct notifier_block *this,
 */
if (need_delay)
mdelay(1000*1);
+}
+
+static int md_notify_reboot(struct notifier_block *this,
+   unsigned long code, void *x)
+{
+   md_stop_all_writes();
 
return NOTIFY_DONE;
 }
-- 
2.14.2



[PATCH v7 5/9] block: Introduce blk_get_request_flags()

2017-10-09 Thread Bart Van Assche
A side effect of this patch is that the GFP mask that is passed to
several allocation functions in the legacy block layer is changed
from GFP_KERNEL into __GFP_DIRECT_RECLAIM.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 block/blk-core.c   | 50 +++---
 include/linux/blkdev.h |  3 +++
 2 files changed, 38 insertions(+), 15 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 265a69511c47..03156d9ede94 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1157,7 +1157,7 @@ int blk_update_nr_requests(struct request_queue *q, 
unsigned int nr)
  * @rl: request list to allocate from
  * @op: operation and flags
  * @bio: bio to allocate request for (can be %NULL)
- * @gfp_mask: allocation mask
+ * @flags: BLQ_MQ_REQ_* flags
  *
  * Get a free request from @q.  This function may fail under memory
  * pressure or if @q is dead.
@@ -1167,7 +1167,7 @@ int blk_update_nr_requests(struct request_queue *q, 
unsigned int nr)
  * Returns request pointer on success, with @q->queue_lock *not held*.
  */
 static struct request *__get_request(struct request_list *rl, unsigned int op,
-   struct bio *bio, gfp_t gfp_mask)
+struct bio *bio, unsigned int flags)
 {
struct request_queue *q = rl->q;
struct request *rq;
@@ -1176,6 +1176,8 @@ static struct request *__get_request(struct request_list 
*rl, unsigned int op,
struct io_cq *icq = NULL;
const bool is_sync = op_is_sync(op);
int may_queue;
+   gfp_t gfp_mask = flags & BLK_MQ_REQ_NOWAIT ? GFP_ATOMIC :
+__GFP_DIRECT_RECLAIM;
req_flags_t rq_flags = RQF_ALLOCED;
 
lockdep_assert_held(q->queue_lock);
@@ -1336,7 +1338,7 @@ static struct request *__get_request(struct request_list 
*rl, unsigned int op,
  * @q: request_queue to allocate request from
  * @op: operation and flags
  * @bio: bio to allocate request for (can be %NULL)
- * @gfp_mask: allocation mask
+ * @flags: BLK_MQ_REQ_* flags.
  *
  * Get a free request from @q.  If %__GFP_DIRECT_RECLAIM is set in @gfp_mask,
  * this function keeps retrying under memory pressure and fails iff @q is dead.
@@ -1346,7 +1348,7 @@ static struct request *__get_request(struct request_list 
*rl, unsigned int op,
  * Returns request pointer on success, with @q->queue_lock *not held*.
  */
 static struct request *get_request(struct request_queue *q, unsigned int op,
-   struct bio *bio, gfp_t gfp_mask)
+  struct bio *bio, unsigned int flags)
 {
const bool is_sync = op_is_sync(op);
DEFINE_WAIT(wait);
@@ -1358,7 +1360,7 @@ static struct request *get_request(struct request_queue 
*q, unsigned int op,
 
rl = blk_get_rl(q, bio);/* transferred to @rq on success */
 retry:
-   rq = __get_request(rl, op, bio, gfp_mask);
+   rq = __get_request(rl, op, bio, flags);
if (!IS_ERR(rq))
return rq;
 
@@ -1367,7 +1369,7 @@ static struct request *get_request(struct request_queue 
*q, unsigned int op,
return ERR_PTR(-EAGAIN);
}
 
-   if (!gfpflags_allow_blocking(gfp_mask) || unlikely(blk_queue_dying(q))) 
{
+   if ((flags & BLK_MQ_REQ_NOWAIT) || unlikely(blk_queue_dying(q))) {
blk_put_rl(rl);
return rq;
}
@@ -1394,10 +1396,13 @@ static struct request *get_request(struct request_queue 
*q, unsigned int op,
goto retry;
 }
 
+/* flags: BLK_MQ_REQ_PREEMPT and/or BLK_MQ_REQ_NOWAIT. */
 static struct request *blk_old_get_request(struct request_queue *q,
-  unsigned int op, gfp_t gfp_mask)
+  unsigned int op, unsigned int flags)
 {
struct request *rq;
+   gfp_t gfp_mask = flags & BLK_MQ_REQ_NOWAIT ? GFP_ATOMIC :
+__GFP_DIRECT_RECLAIM;
int ret = 0;
 
WARN_ON_ONCE(q->mq_ops);
@@ -1410,7 +1415,7 @@ static struct request *blk_old_get_request(struct 
request_queue *q,
if (ret)
return ERR_PTR(ret);
spin_lock_irq(q->queue_lock);
-   rq = get_request(q, op, NULL, gfp_mask);
+   rq = get_request(q, op, NULL, flags);
if (IS_ERR(rq)) {
spin_unlock_irq(q->queue_lock);
blk_queue_exit(q);
@@ -1424,25 +1429,40 @@ static struct request *blk_old_get_request(struct 
request_queue *q,
return rq;
 }
 
-struct request *blk_get_request(struct request_queue *q, unsigned int op,
-   gfp_t gfp_mask)
+/**
+ * blk_get_request_flags - allocate a request
+ * @q: request queue to allocate a request for
+ * @op: operation (REQ_OP_*) and REQ_* flags, e.g. REQ_SYNC.
+ * @flags: BLK_MQ_REQ_* flags, e.g. 

[PATCH v7 9/9] block, scsi: Make SCSI quiesce and resume work reliably

2017-10-09 Thread Bart Van Assche
The contexts from which a SCSI device can be quiesced or resumed are:
* Writing into /sys/class/scsi_device/*/device/state.
* SCSI parallel (SPI) domain validation.
* The SCSI device power management methods. See also scsi_bus_pm_ops.

It is essential during suspend and resume that neither the filesystem
state nor the filesystem metadata in RAM changes. This is why while
the hibernation image is being written or restored that SCSI devices
are quiesced. The SCSI core quiesces devices through scsi_device_quiesce()
and scsi_device_resume(). In the SDEV_QUIESCE state execution of
non-preempt requests is deferred. This is realized by returning
BLKPREP_DEFER from inside scsi_prep_state_check() for quiesced SCSI
devices. Avoid that a full queue prevents power management requests
to be submitted by deferring allocation of non-preempt requests for
devices in the quiesced state. This patch has been tested by running
the following commands and by verifying that after resume the fio job
is still running:

for d in /sys/class/block/sd*[a-z]; do
  hcil=$(readlink "$d/device")
  hcil=${hcil#../../../}
  echo 4 > "$d/queue/nr_requests"
  echo 1 > "/sys/class/scsi_device/$hcil/device/queue_depth"
done
bdev=$(readlink /dev/disk/by-uuid/5217d83f-213e-4b42-b86e-20013325ba6c)
bdev=${bdev#../../}
hcil=$(readlink "/sys/block/$bdev/device")
hcil=${hcil#../../../}
fio --name="$bdev" --filename="/dev/$bdev" --buffered=0 --bs=512 --rw=randread \
  --ioengine=libaio --numjobs=4 --iodepth=16 --iodepth_batch=1 --thread \
  --loops=$((2**31)) &
pid=$!
sleep 1
systemctl hibernate
sleep 10
kill $pid

Reported-by: Oleksandr Natalenko 
References: "I/O hangs after resuming from suspend-to-ram" 
(https://marc.info/?l=linux-block=150340235201348).
Signed-off-by: Bart Van Assche 
Cc: Martin K. Petersen 
Cc: Ming Lei 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 block/blk-core.c| 42 +++---
 block/blk-mq.c  |  4 ++--
 block/blk-timeout.c |  2 +-
 drivers/scsi/scsi_lib.c | 28 ++--
 fs/block_dev.c  |  4 ++--
 include/linux/blkdev.h  |  2 +-
 6 files changed, 59 insertions(+), 23 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index ed992cbd107f..3847ea42e341 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -372,6 +372,7 @@ void blk_clear_preempt_only(struct request_queue *q)
 
spin_lock_irqsave(q->queue_lock, flags);
queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q);
+   wake_up_all(>mq_freeze_wq);
spin_unlock_irqrestore(q->queue_lock, flags);
 }
 EXPORT_SYMBOL_GPL(blk_clear_preempt_only);
@@ -793,15 +794,40 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(blk_alloc_queue);
 
-int blk_queue_enter(struct request_queue *q, bool nowait)
+/**
+ * blk_queue_enter() - try to increase q->q_usage_counter
+ * @q: request queue pointer
+ * @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PREEMPT
+ */
+int blk_queue_enter(struct request_queue *q, unsigned int flags)
 {
+   const bool preempt = flags & BLK_MQ_REQ_PREEMPT;
+
while (true) {
+   bool success = false;
int ret;
 
-   if (percpu_ref_tryget_live(>q_usage_counter))
+   rcu_read_lock_sched();
+   if (percpu_ref_tryget_live(>q_usage_counter)) {
+   /*
+* The code that sets the PREEMPT_ONLY flag is
+* responsible for ensuring that that flag is globally
+* visible before the queue is unfrozen.
+*/
+   if (preempt || !blk_queue_preempt_only(q)) {
+   success = true;
+   } else {
+   percpu_ref_put(>q_usage_counter);
+   WARN_ONCE("%s: Attempt to allocate non-preempt 
request in preempt-only mode.\n",
+ kobject_name(q->kobj.parent));
+   }
+   }
+   rcu_read_unlock_sched();
+
+   if (success)
return 0;
 
-   if (nowait)
+   if (flags & BLK_MQ_REQ_NOWAIT)
return -EBUSY;
 
/*
@@ -814,7 +840,8 @@ int blk_queue_enter(struct request_queue *q, bool nowait)
smp_rmb();
 
ret = wait_event_interruptible(q->mq_freeze_wq,
-   !atomic_read(>mq_freeze_depth) ||
+   (atomic_read(>mq_freeze_depth) == 0 &&
+(preempt || !blk_queue_preempt_only(q))) ||
blk_queue_dying(q));
if (blk_queue_dying(q))
return -ENODEV;
@@ -1442,8 

[PATCH v7 1/9] md: Rename md_notifier into md_reboot_notifier

2017-10-09 Thread Bart Van Assche
This avoids confusion with the pm notifier that will be added
through a later patch.

Signed-off-by: Bart Van Assche 
Cc: Shaohua Li 
Cc: linux-r...@vger.kernel.org
Cc: Ming Lei 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 drivers/md/md.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 5d61049e7417..8933cafc212d 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8966,7 +8966,7 @@ static int md_notify_reboot(struct notifier_block *this,
return NOTIFY_DONE;
 }
 
-static struct notifier_block md_notifier = {
+static struct notifier_block md_reboot_notifier = {
.notifier_call  = md_notify_reboot,
.next   = NULL,
.priority   = INT_MAX, /* before any real devices */
@@ -9003,7 +9003,7 @@ static int __init md_init(void)
blk_register_region(MKDEV(mdp_major, 0), 1UL<

[PATCH v7 7/9] ide, scsi: Tell the block layer at request allocation time about preempt requests

2017-10-09 Thread Bart Van Assche
Convert blk_get_request(q, op, __GFP_RECLAIM) into
blk_get_request_flags(q, op, BLK_MQ_PREEMPT). This patch does not
change any functionality.

Signed-off-by: Bart Van Assche 
Cc: Martin K. Petersen 
Acked-by: David S. Miller  [ for IDE ]
Cc: Ming Lei 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 drivers/ide/ide-pm.c| 4 ++--
 drivers/scsi/scsi_lib.c | 6 +++---
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/ide/ide-pm.c b/drivers/ide/ide-pm.c
index 544f02d673ca..f56d742908df 100644
--- a/drivers/ide/ide-pm.c
+++ b/drivers/ide/ide-pm.c
@@ -89,9 +89,9 @@ int generic_ide_resume(struct device *dev)
}
 
memset(, 0, sizeof(rqpm));
-   rq = blk_get_request(drive->queue, REQ_OP_DRV_IN, __GFP_RECLAIM);
+   rq = blk_get_request_flags(drive->queue, REQ_OP_DRV_IN,
+  BLK_MQ_REQ_PREEMPT);
ide_req(rq)->type = ATA_PRIV_PM_RESUME;
-   rq->rq_flags |= RQF_PREEMPT;
rq->special = 
rqpm.pm_step = IDE_PM_START_RESUME;
rqpm.pm_state = PM_EVENT_ON;
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 9cf6a80fe297..1c16a247fae6 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -252,9 +252,9 @@ int scsi_execute(struct scsi_device *sdev, const unsigned 
char *cmd,
struct scsi_request *rq;
int ret = DRIVER_ERROR << 24;
 
-   req = blk_get_request(sdev->request_queue,
+   req = blk_get_request_flags(sdev->request_queue,
data_direction == DMA_TO_DEVICE ?
-   REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, __GFP_RECLAIM);
+   REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, BLK_MQ_REQ_PREEMPT);
if (IS_ERR(req))
return ret;
rq = scsi_req(req);
@@ -268,7 +268,7 @@ int scsi_execute(struct scsi_device *sdev, const unsigned 
char *cmd,
rq->retries = retries;
req->timeout = timeout;
req->cmd_flags |= flags;
-   req->rq_flags |= rq_flags | RQF_QUIET | RQF_PREEMPT;
+   req->rq_flags |= rq_flags | RQF_QUIET;
 
/*
 * head injection *required* here otherwise quiesce won't work
-- 
2.14.2



[PATCH v7 3/9] md: Neither resync nor reshape while the system is frozen

2017-10-09 Thread Bart Van Assche
Some people use the md driver on laptops and use the suspend and
resume functionality. Since it is essential that submitting of
new I/O requests stops before a hibernation image is created,
interrupt the md resync and reshape actions if the system is
being frozen. Note: the resync and reshape will restart after
the system is resumed and a message similar to the following
will appear in the system log:

md: md0: data-check interrupted.

Signed-off-by: Bart Van Assche 
Cc: Shaohua Li 
Cc: linux-r...@vger.kernel.org
Cc: Ming Lei 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 drivers/md/md.c | 30 +-
 1 file changed, 29 insertions(+), 1 deletion(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index b99584e5d6b1..d712d3320c1d 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -66,6 +66,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include "md.h"
@@ -7439,6 +7441,7 @@ static int md_thread(void *arg)
 */
 
allow_signal(SIGKILL);
+   set_freezable();
while (!kthread_should_stop()) {
 
/* We need to wait INTERRUPTIBLE so that
@@ -7449,7 +7452,7 @@ static int md_thread(void *arg)
if (signal_pending(current))
flush_signals(current);
 
-   wait_event_interruptible_timeout
+   wait_event_freezable_timeout
(thread->wqueue,
 test_bit(THREAD_WAKEUP, >flags)
 || kthread_should_stop() || kthread_should_park(),
@@ -8963,6 +8966,29 @@ static void md_stop_all_writes(void)
mdelay(1000*1);
 }
 
+/*
+ * Ensure that neither resyncing nor reshaping occurs while the system is
+ * frozen.
+ */
+static int md_notify_pm(struct notifier_block *bl, unsigned long state,
+   void *unused)
+{
+   pr_debug("%s: state = %ld\n", __func__, state);
+
+   switch (state) {
+   case PM_HIBERNATION_PREPARE:
+   case PM_SUSPEND_PREPARE:
+   case PM_RESTORE_PREPARE:
+   md_stop_all_writes();
+   break;
+   }
+   return NOTIFY_DONE;
+}
+
+static struct notifier_block md_pm_notifier = {
+   .notifier_call  = md_notify_pm,
+};
+
 static int md_notify_reboot(struct notifier_block *this,
unsigned long code, void *x)
 {
@@ -9009,6 +9035,7 @@ static int __init md_init(void)
md_probe, NULL, NULL);
 
register_reboot_notifier(_reboot_notifier);
+   register_pm_notifier(_pm_notifier);
raid_table_header = register_sysctl_table(raid_root_table);
 
md_geninit();
@@ -9248,6 +9275,7 @@ static __exit void md_exit(void)
 
unregister_blkdev(MD_MAJOR,"md");
unregister_blkdev(mdp_major, "mdp");
+   unregister_pm_notifier(_pm_notifier);
unregister_reboot_notifier(_reboot_notifier);
unregister_sysctl_table(raid_table_header);
 
-- 
2.14.2



[PATCH v7 0/9] Hello Jens,

2017-10-09 Thread Bart Van Assche
It is known that during the resume following a hibernate, especially when
using an md RAID1 array created on top of SCSI devices, sometimes the
system hangs instead of coming up properly. This patch series fixes this
problem. This patch series is an alternative for Ming Lei's "block/scsi:
safe SCSI quiescing" patch series. The advantage of this patch series is
that a fix for the md driver has been included.

These patches have been tested on top of the block layer for-next branch.

Please consider these changes for kernel v4.15.

Thanks,

Bart.

Changes between v6 and v7:
- Added support for the PREEMPT_ONLY flag in blk-mq-debugfs.c.
- Fixed kerneldoc header of blk_queue_enter().
- Added a rcu_read_lock_sched() / rcu_read_unlock_sched() pair in
  blk_set_preempt_only().
- Removed a synchronize_rcu() call from scsi_device_quiesce().
- Modified the description of patch 9/9 in this series.
- Removed scsi_run_queue() call from scsi_device_resume().

Changes between v5 and v6:
- Split an md patch into two patches to make it easier to review the changes.
- For the md patch that suspends I/O while the system is frozen, switched back
  to the freezer mechanism because this makes the code shorter and easier to
  review.
- Changed blk_set/clear_preempt_only() from EXPORT_SYMBOL() into
  EXPORT_SYMBOL_GPL().
- Made blk_set_preempt_only() behave as a test-and-set operation.
- Introduced blk_get_request_flags() and BLK_MQ_REQ_PREEMPT as requested by
  Christoph and reduced the number of arguments of blk_queue_enter() back from
  three to two.
- In scsi_device_quiesce(), moved the blk_mq_freeze_queue() call out of a
  critical section. Made the explanation of why the synchronize_rcu() call
  is necessary more detailed.

Changes between v4 and v5:
- Split blk_set_preempt_only() into two functions as requested by Christoph.
- Made blk_get_request() trigger WARN_ONCE() if it is attempted to allocate
  a request while the system is frozen. This is a big help to identify drivers
  that submit I/O while the system is frozen.
- Since kernel thread freezing is on its way out, reworked the approach for
  avoiding that the md driver submits I/O while the system is frozen such that
  the approach no longer depends on the kernel thread freeze mechanism.
- Fixed the (theoretical) deadlock in scsi_device_quiesce() that was identified
  by Ming.

Changes between v3 and v4:
- Made sure that this patch series not only works for scsi-mq but also for
  the legacy SCSI stack.
- Removed an smp_rmb()/smp_wmb() pair from the hot path and added a comment
  that explains why that is safe.
- Reordered the patches in this patch series.

Changes between v2 and v3:
- Made md kernel threads freezable.
- Changed the approach for quiescing SCSI devices again.
- Addressed Ming's review comments.

Changes compared to v1 of this patch series:
- Changed the approach and rewrote the patch series.

Bart Van Assche (8):
  md: Rename md_notifier into md_reboot_notifier
  md: Introduce md_stop_all_writes()
  md: Neither resync nor reshape while the system is frozen
  block: Introduce blk_get_request_flags()
  block: Introduce BLK_MQ_REQ_PREEMPT
  ide, scsi: Tell the block layer at request allocation time about
preempt requests
  block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
  block, scsi: Make SCSI quiesce and resume work reliably

Ming Lei (1):
  block: Make q_usage_counter also track legacy requests

 block/blk-core.c| 132 
 block/blk-mq-debugfs.c  |   1 +
 block/blk-mq.c  |  16 +++---
 block/blk-timeout.c |   2 +-
 drivers/ide/ide-pm.c|   4 +-
 drivers/md/md.c |  45 ++---
 drivers/scsi/scsi_lib.c |  34 -
 fs/block_dev.c  |   4 +-
 include/linux/blk-mq.h  |   1 +
 include/linux/blkdev.h  |  11 +++-
 10 files changed, 195 insertions(+), 55 deletions(-)

-- 
2.14.2



[PATCH v7 6/9] block: Introduce BLK_MQ_REQ_PREEMPT

2017-10-09 Thread Bart Van Assche
Set RQF_PREEMPT if BLK_MQ_REQ_PREEMPT is passed to
blk_get_request_flags().

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 block/blk-core.c   | 4 +++-
 block/blk-mq.c | 2 ++
 include/linux/blk-mq.h | 1 +
 3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 03156d9ede94..842e67e0b7ab 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1260,6 +1260,8 @@ static struct request *__get_request(struct request_list 
*rl, unsigned int op,
blk_rq_set_rl(rq, rl);
rq->cmd_flags = op;
rq->rq_flags = rq_flags;
+   if (flags & BLK_MQ_REQ_PREEMPT)
+   rq->rq_flags |= RQF_PREEMPT;
 
/* init elvpriv */
if (rq_flags & RQF_ELVPRIV) {
@@ -1441,7 +1443,7 @@ struct request *blk_get_request_flags(struct 
request_queue *q, unsigned int op,
struct request *req;
 
WARN_ON_ONCE(op & REQ_NOWAIT);
-   WARN_ON_ONCE(flags & ~BLK_MQ_REQ_NOWAIT);
+   WARN_ON_ONCE(flags & ~(BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_PREEMPT));
 
if (q->mq_ops) {
req = blk_mq_alloc_request(q, op, flags);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 8766587ae8b8..bdbfe760bda0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -290,6 +290,8 @@ static struct request *blk_mq_rq_ctx_init(struct 
blk_mq_alloc_data *data,
rq->q = data->q;
rq->mq_ctx = data->ctx;
rq->cmd_flags = op;
+   if (data->flags & BLK_MQ_REQ_PREEMPT)
+   rq->rq_flags |= RQF_PREEMPT;
if (blk_queue_io_stat(data->q))
rq->rq_flags |= RQF_IO_STAT;
/* do not touch atomic flags, it needs atomic ops against the timer */
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 50c6485cb04f..ee9fc3fb7fca 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -200,6 +200,7 @@ enum {
BLK_MQ_REQ_NOWAIT   = (1 << 0), /* return when out of requests */
BLK_MQ_REQ_RESERVED = (1 << 1), /* allocate from reserved pool */
BLK_MQ_REQ_INTERNAL = (1 << 2), /* allocate internal/sched tag */
+   BLK_MQ_REQ_PREEMPT  = (1 << 3), /* set RQF_PREEMPT */
 };
 
 struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
-- 
2.14.2



[PATCH v7 8/9] block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag

2017-10-09 Thread Bart Van Assche
This flag will be used in the next patch to let the block layer
core know whether or not a SCSI request queue has been quiesced.
A quiesced SCSI queue namely only processes RQF_PREEMPT requests.

Signed-off-by: Bart Van Assche 
Cc: Ming Lei 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 block/blk-core.c   | 30 ++
 block/blk-mq-debugfs.c |  1 +
 include/linux/blkdev.h |  6 ++
 3 files changed, 37 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 842e67e0b7ab..ed992cbd107f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -346,6 +346,36 @@ void blk_sync_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_sync_queue);
 
+/**
+ * blk_set_preempt_only - set QUEUE_FLAG_PREEMPT_ONLY
+ * @q: request queue pointer
+ *
+ * Returns the previous value of the PREEMPT_ONLY flag - 0 if the flag was not
+ * set and 1 if the flag was already set.
+ */
+int blk_set_preempt_only(struct request_queue *q)
+{
+   unsigned long flags;
+   int res;
+
+   spin_lock_irqsave(q->queue_lock, flags);
+   res = queue_flag_test_and_set(QUEUE_FLAG_PREEMPT_ONLY, q);
+   spin_unlock_irqrestore(q->queue_lock, flags);
+
+   return res;
+}
+EXPORT_SYMBOL_GPL(blk_set_preempt_only);
+
+void blk_clear_preempt_only(struct request_queue *q)
+{
+   unsigned long flags;
+
+   spin_lock_irqsave(q->queue_lock, flags);
+   queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q);
+   spin_unlock_irqrestore(q->queue_lock, flags);
+}
+EXPORT_SYMBOL_GPL(blk_clear_preempt_only);
+
 /**
  * __blk_run_queue_uncond - run a queue whether or not it has been stopped
  * @q: The queue to run
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 7f4a1ba532af..75f31535f280 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -74,6 +74,7 @@ static const char *const blk_queue_flag_name[] = {
QUEUE_FLAG_NAME(REGISTERED),
QUEUE_FLAG_NAME(SCSI_PASSTHROUGH),
QUEUE_FLAG_NAME(QUIESCED),
+   QUEUE_FLAG_NAME(PREEMPT_ONLY),
 };
 #undef QUEUE_FLAG_NAME
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 3f7b08a9b9ae..89555eea742b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -630,6 +630,7 @@ struct request_queue {
 #define QUEUE_FLAG_REGISTERED  26  /* queue has been registered to a disk 
*/
 #define QUEUE_FLAG_SCSI_PASSTHROUGH 27 /* queue supports SCSI commands */
 #define QUEUE_FLAG_QUIESCED28  /* queue has been quiesced */
+#define QUEUE_FLAG_PREEMPT_ONLY29  /* only process REQ_PREEMPT 
requests */
 
 #define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) |\
 (1 << QUEUE_FLAG_SAME_COMP)|   \
@@ -730,6 +731,11 @@ static inline void queue_flag_clear(unsigned int flag, 
struct request_queue *q)
((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
 REQ_FAILFAST_DRIVER))
 #define blk_queue_quiesced(q)  test_bit(QUEUE_FLAG_QUIESCED, &(q)->queue_flags)
+#define blk_queue_preempt_only(q)  \
+   test_bit(QUEUE_FLAG_PREEMPT_ONLY, &(q)->queue_flags)
+
+extern int blk_set_preempt_only(struct request_queue *q);
+extern void blk_clear_preempt_only(struct request_queue *q);
 
 static inline bool blk_account_rq(struct request *rq)
 {
-- 
2.14.2



Re: [PATCH] scsi: libiscsi: fix shifting of DID_REQUEUE host byte

2017-10-09 Thread Lee Duncan


On 10/09/2017 04:33 AM, Johannes Thumshirn wrote:
> The SCSI host byte should be shifted left by 16 in order to have
> scsi_decide_disposition() do the right thing (.i.e. requeue the command).
> 
> Signed-off-by: Johannes Thumshirn 
> Fixes: 661134ad3765 ("[SCSI] libiscsi, bnx2i: make bound ep check common")
> Cc: Lee Duncan 
> Cc: Hannes Reinecke 
> Cc: Bart Van Assche 
> Cc: Chris Leech 
> ---
>  drivers/scsi/libiscsi.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c
> index bd4605a34f54..9cba4913b43c 100644
> --- a/drivers/scsi/libiscsi.c
> +++ b/drivers/scsi/libiscsi.c
> @@ -1728,7 +1728,7 @@ int iscsi_queuecommand(struct Scsi_Host *host, struct 
> scsi_cmnd *sc)
>  
>   if (test_bit(ISCSI_SUSPEND_BIT, >suspend_tx)) {
>   reason = FAILURE_SESSION_IN_RECOVERY;
> - sc->result = DID_REQUEUE;
> + sc->result = DID_REQUEUE << 16;
>   goto fault;
>   }
>  
> 
Good catch.

I know you're working on fixing all drivers to use the correct macros
rather than rolling there own.

Acked-by: Lee Duncan 
-- 
Lee Duncan
SUSE Labs


[PATCH] scsi: aic7xxx: explain a mysterious build failure message

2017-10-09 Thread Adam Borowski
"*** Install db development libraries" doesn't ring a bell that it's talking
about libdb (Berkeley DB).  Because of BDB's relicensing to Affero,
incompatible with many users including Linux, most if not all distributions
can be expected to keep version 5.3 for the foreseable future, thus we can
point the user directly to libdb5.3-dev as the package to install.

Signed-off-by: Adam Borowski 
---
 drivers/scsi/aic7xxx/aicasm/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/aic7xxx/aicasm/Makefile 
b/drivers/scsi/aic7xxx/aicasm/Makefile
index 45e2d49c1fff..e69ceb71beb7 100644
--- a/drivers/scsi/aic7xxx/aicasm/Makefile
+++ b/drivers/scsi/aic7xxx/aicasm/Makefile
@@ -54,7 +54,7 @@ $(OUTDIR)/aicdb.h:
 elif [ -e "/usr/include/db_185.h" ]; then  \
echo "#include " > $@;\
 else   \
-   echo "*** Install db development libraries";\
+   echo "*** Install libdb development libraries (usually 
libdb5.3-dev)";\
 fi
 
 clean:
-- 
2.14.2



Re: [PATCH linux-firmware 1/1] qed: Add firmware 8.30.16.0

2017-10-09 Thread Ben Hutchings
On Wed, 2017-09-13 at 03:46 -0700, Rahul Verma wrote:
> The new qed firmware contains fixes to firmware and added
> support for new features,
> -Add UFP support.
> -DCQCN support for unlimited number of QP
> -Add IP type to GFT filter profile.
> -Added new TCP function counters.
> -Support flow ID in aRFS flow.
> 
> Signed-off-by: Rahul Verma 
> ---
>  qed/qed_init_values_zipped-8.30.16.0.bin | Bin 0 -> 837008 bytes
>  1 file changed, 0 insertions(+), 0 deletions(-)
>  create mode 100755 qed/qed_init_values_zipped-8.30.16.0.bin
[...]

The new file needs an entry in WHENCE.

Ben.

-- 
Ben Hutchings
Humour is the best antidote to reality.



signature.asc
Description: This is a digitally signed message part


Re: [PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1)

2017-10-09 Thread Ming Lei
Hi John,

On Mon, Oct 09, 2017 at 01:09:22PM +0100, John Garry wrote:
> On 30/09/2017 11:27, Ming Lei wrote:
> > Hi Jens,
> > 
> > In Red Hat internal storage test wrt. blk-mq scheduler, we
> > found that I/O performance is much bad with mq-deadline, especially
> > about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
> > SRP...)
> > 
> > Turns out one big issue causes the performance regression: requests
> > are still dequeued from sw queue/scheduler queue even when ldd's
> > queue is busy, so I/O merge becomes quite difficult to make, then
> > sequential IO degrades a lot.
> > 
> > This issue becomes one of mains reasons for reverting default SCSI_MQ
> > in V4.13.
> > 
> > The 1st patch takes direct issue in blk_mq_request_bypass_insert(),
> > then we can improve dm-mpath's performance in part 2, which will
> > be posted out soon.
> > 
> > The 2nd six patches improve this situation, and brings back
> > some performance loss.
> > 
> > With this change, SCSI-MQ sequential I/O performance is
> > improved much, Paolo reported that mq-deadline performance
> > improved much[2] in his dbench test wrt V2. Also performanc
> > improvement on lpfc/qla2xx was observed with V1.[1]
> > 
> > Please consider it for V4.15.
> > 
> > [1] http://marc.info/?l=linux-block=150151989915776=2
> > [2] https://marc.info/?l=linux-block=150217980602843=2
> > 
> 
> I tested this series for the SAS controller on HiSilicon hip07 platform as I
> am interested in enabling MQ for this driver. Driver is
> ./drivers/scsi/hisi_sas/.
> 
> So I found that that performance is improved when enabling default SCSI_MQ
> with this series vs baseline. However, it is still not as a good as when
> default SCSI_MQ is disabled.
> 
> Here are some figures I got with fio:
> 4.14-rc2 without default SCSI_MQ
> read, rw, write IOPS  
> 952K, 133K/133K, 800K
> 
> 4.14-rc2 with default SCSI_MQ
> read, rw, write IOPS  
> 311K, 117K/117K, 320K
> 
> This series* without default SCSI_MQ
> read, rw, write IOPS  
> 975K, 132K/132K, 790K
> 
> This series* with default SCSI_MQ
> read, rw, write IOPS  
> 770K, 164K/164K, 594K

Thanks for testing this patchset!

Looks there is big improvement, but the gap compared with
block legacy is not small too.

> 
> Please note that hisi_sas driver does not enable mq by exposing multiple
> queues to upper layer (even though it has multiple queues). I have been
> playing with enabling it, but my performance is always worse...
> 
> * I'm using
> https://github.com/ming1/linux/commits/blk_mq_improve_scsi_mpath_perf_V5.1,
> as advised by Ming Lei.

Could you test on the following branch and see if it makes a
difference?


https://github.com/ming1/linux/commits/blk_mq_improve_scsi_mpath_perf_V6.1_test

BTW, one big change is in the following commit, which just takes block
legacy's policy to dequeue request, and I can observe some improvement
on virtio-scsi too, and this commit is just for verification/debug purpose,
which is never posted out before.

https://github.com/ming1/linux/commit/94a117fdd9cfc1291445e5a35f04464c89c9ce70


Thanks,
Ming


Re: [PATCH] scsi: libiscsi: fix shifting of DID_REQUEUE host byte

2017-10-09 Thread Johannes Thumshirn
On Mon, Oct 09, 2017 at 02:35:06PM +0200, Steffen Maier wrote:
> Use wrapper functions to advertize their use in an attempt to avoid wrong
> shifting in the future?

After a second thought and a bit of coccinelle magic I converted all drivers
under drivers/scsi to use set_host_byte(), msg and driver byte will follow as
well.

But all this is on top of this patch, which IMHO is stable material (we had an
actual customer bug with this).

Byte,
Johannes
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH 10/13] timer: Remove expires and data arguments from DEFINE_TIMER

2017-10-09 Thread Ralf Baechle
On Wed, Oct 04, 2017 at 04:27:04PM -0700, Kees Cook wrote:

> Subject: [PATCH 10/13] timer: Remove expires and data arguments from
>  DEFINE_TIMER
> 
> Drop the arguments from the macro and adjust all callers with the
> following script:
> 
>   perl -pi -e 's/DEFINE_TIMER\((.*), 0, 0\);/DEFINE_TIMER($1);/g;' \
> $(git grep DEFINE_TIMER | cut -d: -f1 | sort -u | grep -v timer.h)
> 
> Signed-off-by: Kees Cook 
> Acked-by: Geert Uytterhoeven  # for m68k parts
> ---
>  arch/arm/mach-ixp4xx/dsmg600-setup.c  | 2 +-
>  arch/arm/mach-ixp4xx/nas100d-setup.c  | 2 +-
>  arch/m68k/amiga/amisound.c| 2 +-
>  arch/m68k/mac/macboing.c  | 2 +-
>  arch/mips/mti-malta/malta-display.c   | 2 +-
>  arch/parisc/kernel/pdc_cons.c | 2 +-
>  arch/s390/mm/cmm.c| 2 +-
>  drivers/atm/idt77105.c| 4 ++--
>  drivers/atm/iphase.c  | 2 +-
>  drivers/block/ataflop.c   | 8 
>  drivers/char/dtlk.c   | 2 +-
>  drivers/char/hangcheck-timer.c| 2 +-
>  drivers/char/nwbutton.c   | 2 +-
>  drivers/char/rtc.c| 2 +-
>  drivers/input/touchscreen/s3c2410_ts.c| 2 +-
>  drivers/net/cris/eth_v10.c| 6 +++---
>  drivers/net/hamradio/yam.c| 2 +-
>  drivers/net/wireless/atmel/at76c50x-usb.c | 2 +-
>  drivers/staging/speakup/main.c| 2 +-
>  drivers/staging/speakup/synth.c   | 2 +-
>  drivers/tty/cyclades.c| 2 +-
>  drivers/tty/isicom.c  | 2 +-
>  drivers/tty/moxa.c| 2 +-
>  drivers/tty/rocket.c  | 2 +-
>  drivers/tty/vt/keyboard.c | 2 +-
>  drivers/tty/vt/vt.c   | 2 +-
>  drivers/watchdog/alim7101_wdt.c   | 2 +-
>  drivers/watchdog/machzwd.c| 2 +-
>  drivers/watchdog/mixcomwd.c   | 2 +-
>  drivers/watchdog/sbc60xxwdt.c | 2 +-
>  drivers/watchdog/sc520_wdt.c  | 2 +-
>  drivers/watchdog/via_wdt.c| 2 +-
>  drivers/watchdog/w83877f_wdt.c| 2 +-
>  drivers/xen/grant-table.c | 2 +-
>  fs/pstore/platform.c  | 2 +-
>  include/linux/timer.h | 4 ++--
>  kernel/irq/spurious.c | 2 +-
>  lib/random32.c| 2 +-
>  net/atm/mpc.c | 2 +-
>  net/decnet/dn_route.c | 2 +-
>  net/ipv6/ip6_flowlabel.c  | 2 +-
>  net/netrom/nr_loopback.c  | 2 +-
>  security/keys/gc.c| 2 +-
>  sound/oss/midibuf.c   | 2 +-
>  sound/oss/soundcard.c | 2 +-
>  sound/oss/sys_timer.c | 2 +-
>  sound/oss/uart6850.c  | 2 +-
>  47 files changed, 54 insertions(+), 54 deletions(-)

Acked-by: Ralf Baechle 

Thanks,

  Ralf


[PATCH 3.16 051/192] scsi: virtio_scsi: let host do exception handling

2017-10-09 Thread Ben Hutchings
3.16.49-rc1 review patch.  If anyone has any objections, please let me know.

--

From: Paolo Bonzini 

commit e72c9a2a67a6400c8ef3d01d4c461dbbbfa0e1f0 upstream.

virtio_scsi tries to do exception handling after the default 30 seconds
timeout expires.  However, it's better to let the host control the
timeout, otherwise with a heavy I/O load it is likely that an abort will
also timeout.  This leads to fatal errors like filesystems going
offline.

Disable the 'sd' timeout and allow the host to do exception handling,
following the precedent of the storvsc driver.

Hannes has a proposal to introduce timeouts in virtio, but this provides
an immediate solution for stable kernels too.

[mkp: fixed typo]

Reported-by: Douglas Miller 
Cc: "James E.J. Bottomley" 
Cc: "Martin K. Petersen" 
Cc: Hannes Reinecke 
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Paolo Bonzini 
Signed-off-by: Martin K. Petersen 
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings 
---
 drivers/scsi/virtio_scsi.c | 12 
 1 file changed, 12 insertions(+)

--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -686,6 +686,16 @@ static void virtscsi_target_destroy(stru
kfree(tgt);
 }
 
+/*
+ * The host guarantees to respond to each command, although I/O
+ * latencies might be higher than on bare metal.  Reset the timer
+ * unconditionally to give the host a chance to perform EH.
+ */
+static enum blk_eh_timer_return virtscsi_eh_timed_out(struct scsi_cmnd *scmnd)
+{
+   return BLK_EH_RESET_TIMER;
+}
+
 static struct scsi_host_template virtscsi_host_template_single = {
.module = THIS_MODULE,
.name = "Virtio SCSI HBA",
@@ -695,6 +705,7 @@ static struct scsi_host_template virtscs
.queuecommand = virtscsi_queuecommand_single,
.eh_abort_handler = virtscsi_abort,
.eh_device_reset_handler = virtscsi_device_reset,
+   .eh_timed_out = virtscsi_eh_timed_out,
 
.can_queue = 1024,
.dma_boundary = UINT_MAX,
@@ -712,6 +723,7 @@ static struct scsi_host_template virtscs
.queuecommand = virtscsi_queuecommand_multi,
.eh_abort_handler = virtscsi_abort,
.eh_device_reset_handler = virtscsi_device_reset,
+   .eh_timed_out = virtscsi_eh_timed_out,
 
.can_queue = 1024,
.dma_boundary = UINT_MAX,



Re: [PATCH 09/13] timer: Remove users of expire and data arguments to DEFINE_TIMER

2017-10-09 Thread Ralf Baechle
On Wed, Oct 04, 2017 at 04:27:03PM -0700, Kees Cook wrote:

> Subject: [PATCH 09/13] timer: Remove users of expire and data arguments to
>  DEFINE_TIMER
> 
> The expire and data arguments of DEFINE_TIMER are only used in two places
> and are ignored by the code (malta-display.c only uses mod_timer(),
> never add_timer(), so the preset expires value is ignored). Set both
> sets of arguments to zero.
> 
> Cc: Ralf Baechle 
> Cc: Wim Van Sebroeck 
> Cc: Guenter Roeck 
> Cc: Geert Uytterhoeven 
> Cc: linux-m...@linux-mips.org
> Cc: linux-watch...@vger.kernel.org
> Signed-off-by: Kees Cook 
> ---
>  arch/mips/mti-malta/malta-display.c | 6 +++---
>  drivers/watchdog/alim7101_wdt.c | 4 ++--
>  2 files changed, 5 insertions(+), 5 deletions(-)

For malta-display:

Acked-by: Ralf Baechle 

  Ralf


Re: [PATCH] scsi: libiscsi: fix shifting of DID_REQUEUE host byte

2017-10-09 Thread Johannes Thumshirn
On Mon, Oct 09, 2017 at 02:35:06PM +0200, Steffen Maier wrote:
> Use wrapper functions to advertize their use in an attempt to avoid wrong
> shifting in the future?

Not sure, converting all the users would be a lot of churn for relatively low
benefit:

linux (master)$ git grep "result = DID_" drivers/scsi/ | wc -l
419

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH] scsi: libiscsi: fix shifting of DID_REQUEUE host byte

2017-10-09 Thread Steffen Maier
Use wrapper functions to advertize their use in an attempt to avoid 
wrong shifting in the future?


On 10/09/2017 01:33 PM, Johannes Thumshirn wrote:

The SCSI host byte should be shifted left by 16 in order to have
scsi_decide_disposition() do the right thing (.i.e. requeue the command).

Signed-off-by: Johannes Thumshirn 
Fixes: 661134ad3765 ("[SCSI] libiscsi, bnx2i: make bound ep check common")
Cc: Lee Duncan 
Cc: Hannes Reinecke 
Cc: Bart Van Assche 
Cc: Chris Leech 
---
  drivers/scsi/libiscsi.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c
index bd4605a34f54..9cba4913b43c 100644
--- a/drivers/scsi/libiscsi.c
+++ b/drivers/scsi/libiscsi.c
@@ -1728,7 +1728,7 @@ int iscsi_queuecommand(struct Scsi_Host *host, struct 
scsi_cmnd *sc)

if (test_bit(ISCSI_SUSPEND_BIT, >suspend_tx)) {
reason = FAILURE_SESSION_IN_RECOVERY;
-   sc->result = DID_REQUEUE;
+   sc->result = DID_REQUEUE << 16;


not sure if this really wants to reset the other parts of result, but if 
so (and they are not 0 already anyway), preceed the set_host_byte() by:

sc->result = 0;

set_host_byte(sc, DID_REQUEUE);


goto fault;
}



--
Mit freundlichen Grüßen / Kind regards
Steffen Maier

Linux on z Systems Development

IBM Deutschland Research & Development GmbH
Vorsitzende des Aufsichtsrats: Martina Koederitz
Geschaeftsfuehrung: Dirk Wittkopp
Sitz der Gesellschaft: Boeblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294



Re: [PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1)

2017-10-09 Thread John Garry

On 30/09/2017 11:27, Ming Lei wrote:

Hi Jens,

In Red Hat internal storage test wrt. blk-mq scheduler, we
found that I/O performance is much bad with mq-deadline, especially
about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
SRP...)

Turns out one big issue causes the performance regression: requests
are still dequeued from sw queue/scheduler queue even when ldd's
queue is busy, so I/O merge becomes quite difficult to make, then
sequential IO degrades a lot.

This issue becomes one of mains reasons for reverting default SCSI_MQ
in V4.13.

The 1st patch takes direct issue in blk_mq_request_bypass_insert(),
then we can improve dm-mpath's performance in part 2, which will
be posted out soon.

The 2nd six patches improve this situation, and brings back
some performance loss.

With this change, SCSI-MQ sequential I/O performance is
improved much, Paolo reported that mq-deadline performance
improved much[2] in his dbench test wrt V2. Also performanc
improvement on lpfc/qla2xx was observed with V1.[1]

Please consider it for V4.15.

[1] http://marc.info/?l=linux-block=150151989915776=2
[2] https://marc.info/?l=linux-block=150217980602843=2



I tested this series for the SAS controller on HiSilicon hip07 platform 
as I am interested in enabling MQ for this driver. Driver is 
./drivers/scsi/hisi_sas/.


So I found that that performance is improved when enabling default 
SCSI_MQ with this series vs baseline. However, it is still not as a good 
as when default SCSI_MQ is disabled.


Here are some figures I got with fio:
4.14-rc2 without default SCSI_MQ
read, rw, write IOPS
952K, 133K/133K, 800K

4.14-rc2 with default SCSI_MQ
read, rw, write IOPS
311K, 117K/117K, 320K

This series* without default SCSI_MQ
read, rw, write IOPS
975K, 132K/132K, 790K

This series* with default SCSI_MQ
read, rw, write IOPS
770K, 164K/164K, 594K

Please note that hisi_sas driver does not enable mq by exposing multiple 
queues to upper layer (even though it has multiple queues). I have been 
playing with enabling it, but my performance is always worse...


* I'm using 
https://github.com/ming1/linux/commits/blk_mq_improve_scsi_mpath_perf_V5.1, 
as advised by Ming Lei.


Thanks,
John


V5:
- address some comments from Omar
- add Tested-by & Reveiewed-by tag
- use direct issue for blk_mq_request_bypass_insert(), and
start to consider to improve sequential I/O for dm-mpath
- only include part 1(the original patch 1 ~ 6), as suggested
by Omar

V4:
- add Reviewed-by tag
- some trival change: typo fix in commit log or comment,
variable name, no actual functional change

V3:
- totally round robin for picking req from ctx, as suggested
by Bart
- remove one local variable in __sbitmap_for_each_set()
- drop patches of single dispatch list, which can improve
performance on mq-deadline, but cause a bit degrade on
none because all hctxs need to be checked after ->dispatch
is flushed. Will post it again once it is mature.
- rebase on v4.13-rc6 with block for-next

V2:
- dequeue request from sw queues in round roubin's style
as suggested by Bart, and introduces one helper in sbitmap
for this purpose
- improve bio merge via hash table from sw queue
- add comments about using DISPATCH_BUSY state in lockless way,
simplifying handling on busy state,
- hold ctx->lock when clearing ctx busy bit as suggested
by Bart


Ming Lei (7):
  blk-mq: issue rq directly in blk_mq_request_bypass_insert()
  blk-mq-sched: fix scheduler bad performance
  sbitmap: introduce __sbitmap_for_each_set()
  blk-mq: introduce blk_mq_dequeue_from_ctx()
  blk-mq-sched: move actual dispatching into one helper
  blk-mq-sched: improve dispatching from sw queue
  blk-mq-sched: don't dequeue request until all in ->dispatch are
flushed

 block/blk-core.c|   3 +-
 block/blk-mq-debugfs.c  |   1 +
 block/blk-mq-sched.c| 104 ---
 block/blk-mq.c  | 114 +++-
 block/blk-mq.h  |   4 +-
 drivers/md/dm-rq.c  |   2 +-
 include/linux/blk-mq.h  |   3 ++
 include/linux/sbitmap.h |  64 +++
 8 files changed, 238 insertions(+), 57 deletions(-)






[PATCH] scsi: libiscsi: fix shifting of DID_REQUEUE host byte

2017-10-09 Thread Johannes Thumshirn
The SCSI host byte should be shifted left by 16 in order to have
scsi_decide_disposition() do the right thing (.i.e. requeue the command).

Signed-off-by: Johannes Thumshirn 
Fixes: 661134ad3765 ("[SCSI] libiscsi, bnx2i: make bound ep check common")
Cc: Lee Duncan 
Cc: Hannes Reinecke 
Cc: Bart Van Assche 
Cc: Chris Leech 
---
 drivers/scsi/libiscsi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c
index bd4605a34f54..9cba4913b43c 100644
--- a/drivers/scsi/libiscsi.c
+++ b/drivers/scsi/libiscsi.c
@@ -1728,7 +1728,7 @@ int iscsi_queuecommand(struct Scsi_Host *host, struct 
scsi_cmnd *sc)
 
if (test_bit(ISCSI_SUSPEND_BIT, >suspend_tx)) {
reason = FAILURE_SESSION_IN_RECOVERY;
-   sc->result = DID_REQUEUE;
+   sc->result = DID_REQUEUE << 16;
goto fault;
}
 
-- 
2.13.6



[PATCH V6 1/5] blk-mq-sched: fix scheduler bad performance

2017-10-09 Thread Ming Lei
When hw queue is busy, we shouldn't take requests from
scheduler queue any more, otherwise it is difficult to do
IO merge.

This patch fixes the awful IO performance on some
SCSI devices(lpfc, qla2xxx, ...) when mq-deadline/kyber
is used by not taking requests if hw queue is busy.

Reviewed-by: Bart Van Assche 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 block/blk-mq-sched.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 4ab69435708c..eca011fdfa0e 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -94,7 +94,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
*hctx)
struct request_queue *q = hctx->queue;
struct elevator_queue *e = q->elevator;
const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
-   bool did_work = false;
+   bool do_sched_dispatch = true;
LIST_HEAD(rq_list);
 
/* RCU or SRCU read lock is needed before checking quiesced flag */
@@ -125,18 +125,18 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
*hctx)
 */
if (!list_empty(_list)) {
blk_mq_sched_mark_restart_hctx(hctx);
-   did_work = blk_mq_dispatch_rq_list(q, _list);
+   do_sched_dispatch = blk_mq_dispatch_rq_list(q, _list);
} else if (!has_sched_dispatch) {
blk_mq_flush_busy_ctxs(hctx, _list);
blk_mq_dispatch_rq_list(q, _list);
}
 
/*
-* We want to dispatch from the scheduler if we had no work left
-* on the dispatch list, OR if we did have work but weren't able
-* to make progress.
+* We want to dispatch from the scheduler if there was nothing
+* on the dispatch list or we were able to dispatch from the
+* dispatch list.
 */
-   if (!did_work && has_sched_dispatch) {
+   if (do_sched_dispatch && has_sched_dispatch) {
do {
struct request *rq;
 
-- 
2.9.5



[PATCH V6 0/5] blk-mq-sched: improve sequential I/O performance

2017-10-09 Thread Ming Lei
Hi Jens,

In Red Hat internal storage test wrt. blk-mq scheduler, we
found that I/O performance is much bad with mq-deadline, especially
about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
SRP...)

Turns out one big issue causes the performance regression: requests
are still dequeued from sw queue/scheduler queue even when ldd's
queue is busy, so I/O merge becomes quite difficult to make, then
sequential IO degrades a lot.

This issue becomes one of mains reasons for reverting default SCSI_MQ
in V4.13.

The 1st patch takes direct issue in blk_mq_request_bypass_insert(),
then we can improve dm-mpath's performance in part 2, which will
be posted out soon.

The 2nd six patches improve this situation, and brings back
some performance loss.

With this change, SCSI-MQ sequential I/O performance is
improved much, Paolo reported that mq-deadline performance
improved much[2] in his dbench test wrt V2. Also performanc
improvement on lpfc/qla2xx was observed with V1.[1]

Please consider it for V4.15.

[1] http://marc.info/?l=linux-block=150151989915776=2
[2] https://marc.info/?l=linux-block=150217980602843=2

gitweb:

https://github.com/ming1/linux/commits/blk_mq_improve_scsi_mpath_perf_V6_part1  

git & branch:
https://github.com/ming1/linux.git  
#blk_mq_improve_scsi_mpath_perf_V6_part1

V6:
- address comments from Christoph
- drop the 1st patch which changes blk_mq_request_bypass_insert(),
which should belong to dm-mpath's improvement
- move ' blk-mq-sched: move actual dispatching into one helper'
as 2nd patch, and use the introduced helper to simplify dispatch
logic
- merge two previous patches into one for improving dispatch from sw 
queue
- make comment/commit log line width as ~70, as suggested by
  Christoph

V5:
- address some comments from Omar
- add Tested-by & Reveiewed-by tag
- use direct issue for blk_mq_request_bypass_insert(), and
start to consider to improve sequential I/O for dm-mpath
- only include part 1(the original patch 1 ~ 6), as suggested
by Omar

V4:
- add Reviewed-by tag
- some trival change: typo fix in commit log or comment,
variable name, no actual functional change

V3:
- totally round robin for picking req from ctx, as suggested
by Bart
- remove one local variable in __sbitmap_for_each_set()
- drop patches of single dispatch list, which can improve
performance on mq-deadline, but cause a bit degrade on
none because all hctxs need to be checked after ->dispatch
is flushed. Will post it again once it is mature.
- rebase on v4.13-rc6 with block for-next

V2:
- dequeue request from sw queues in round roubin's style
as suggested by Bart, and introduces one helper in sbitmap
for this purpose
- improve bio merge via hash table from sw queue
- add comments about using DISPATCH_BUSY state in lockless way,
simplifying handling on busy state,
- hold ctx->lock when clearing ctx busy bit as suggested
by Bart

Ming Lei (5):
  blk-mq-sched: fix scheduler bad performance
  blk-mq-sched: move actual dispatching into one helper
  sbitmap: introduce __sbitmap_for_each_set()
  blk-mq-sched: improve dispatching from sw queue
  blk-mq-sched: don't dequeue request until all in ->dispatch are
flushed

 block/blk-mq-debugfs.c  |   1 +
 block/blk-mq-sched.c| 115 
 block/blk-mq.c  |  44 ++
 block/blk-mq.h  |   2 +
 include/linux/blk-mq.h  |   3 ++
 include/linux/sbitmap.h |  64 ---
 6 files changed, 193 insertions(+), 36 deletions(-)

-- 
2.9.5



[PATCH V6 2/5] blk-mq-sched: move actual dispatching into one helper

2017-10-09 Thread Ming Lei
So that it becomes easy to support to dispatch from sw queue in the
following patch.

No functional change.

Reviewed-by: Bart Van Assche 
Reviewed-by: Omar Sandoval 
Suggested-by: Christoph Hellwig  # for simplifying dispatch logic
Signed-off-by: Ming Lei 
---
 block/blk-mq-sched.c | 43 ---
 1 file changed, 24 insertions(+), 19 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index eca011fdfa0e..be29ba849408 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -89,12 +89,26 @@ static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx 
*hctx)
return false;
 }
 
+static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
+{
+   struct request_queue *q = hctx->queue;
+   struct elevator_queue *e = q->elevator;
+   LIST_HEAD(rq_list);
+
+   do {
+   struct request *rq = e->type->ops.mq.dispatch_request(hctx);
+
+   if (!rq)
+   break;
+   list_add(>queuelist, _list);
+   } while (blk_mq_dispatch_rq_list(q, _list));
+}
+
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 {
struct request_queue *q = hctx->queue;
struct elevator_queue *e = q->elevator;
const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
-   bool do_sched_dispatch = true;
LIST_HEAD(rq_list);
 
/* RCU or SRCU read lock is needed before checking quiesced flag */
@@ -122,30 +136,21 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
*hctx)
 * scheduler, we can no longer merge or sort them. So it's best to
 * leave them there for as long as we can. Mark the hw queue as
 * needing a restart in that case.
+*
+* We want to dispatch from the scheduler if there was nothing
+* on the dispatch list or we were able to dispatch from the
+* dispatch list.
 */
if (!list_empty(_list)) {
blk_mq_sched_mark_restart_hctx(hctx);
-   do_sched_dispatch = blk_mq_dispatch_rq_list(q, _list);
-   } else if (!has_sched_dispatch) {
+   if (blk_mq_dispatch_rq_list(q, _list) && has_sched_dispatch)
+   blk_mq_do_dispatch_sched(hctx);
+   } else if (has_sched_dispatch) {
+   blk_mq_do_dispatch_sched(hctx);
+   } else {
blk_mq_flush_busy_ctxs(hctx, _list);
blk_mq_dispatch_rq_list(q, _list);
}
-
-   /*
-* We want to dispatch from the scheduler if there was nothing
-* on the dispatch list or we were able to dispatch from the
-* dispatch list.
-*/
-   if (do_sched_dispatch && has_sched_dispatch) {
-   do {
-   struct request *rq;
-
-   rq = e->type->ops.mq.dispatch_request(hctx);
-   if (!rq)
-   break;
-   list_add(>queuelist, _list);
-   } while (blk_mq_dispatch_rq_list(q, _list));
-   }
 }
 
 bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
-- 
2.9.5



[PATCH V6 5/5] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed

2017-10-09 Thread Ming Lei
During dispatching, we moved all requests from hctx->dispatch to
one temporary list, then dispatch them one by one from this list.
Unfortunately during this period, run queue from other contexts
may think the queue is idle, then start to dequeue from sw/scheduler
queue and still try to dispatch because ->dispatch is empty. This way
hurts sequential I/O performance because requests are dequeued when
lld queue is busy.

This patch introduces the state of BLK_MQ_S_DISPATCH_BUSY to
make sure that request isn't dequeued until ->dispatch is
flushed.

Reviewed-by: Bart Van Assche 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 block/blk-mq-debugfs.c |  1 +
 block/blk-mq-sched.c   | 38 --
 block/blk-mq.c |  5 +
 include/linux/blk-mq.h |  1 +
 4 files changed, 39 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index e5dccc9f6f1d..6c15487bc3ff 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -181,6 +181,7 @@ static const char *const hctx_state_name[] = {
HCTX_STATE_NAME(SCHED_RESTART),
HCTX_STATE_NAME(TAG_WAITING),
HCTX_STATE_NAME(START_ON_RUN),
+   HCTX_STATE_NAME(DISPATCH_BUSY),
 };
 #undef HCTX_STATE_NAME
 
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 14b354f617e5..9f549711da84 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -95,6 +95,18 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx 
*hctx)
struct elevator_queue *e = q->elevator;
LIST_HEAD(rq_list);
 
+   /*
+* If DISPATCH_BUSY is set, that means hw queue is busy
+* and requests in the list of hctx->dispatch need to
+* be flushed first, so return early.
+*
+* Wherever DISPATCH_BUSY is set, blk_mq_run_hw_queue()
+* will be run to try to make progress, so it is always
+* safe to check the state here.
+*/
+   if (test_bit(BLK_MQ_S_DISPATCH_BUSY, >state))
+   return;
+
do {
struct request *rq = e->type->ops.mq.dispatch_request(hctx);
 
@@ -121,6 +133,10 @@ static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx 
*hctx)
LIST_HEAD(rq_list);
struct blk_mq_ctx *ctx = READ_ONCE(hctx->dispatch_from);
 
+   /* See same comment in blk_mq_do_dispatch_sched() */
+   if (test_bit(BLK_MQ_S_DISPATCH_BUSY, >state))
+   return;
+
do {
struct request *rq;
 
@@ -176,12 +192,22 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
*hctx)
 */
if (!list_empty(_list)) {
blk_mq_sched_mark_restart_hctx(hctx);
-   if (blk_mq_dispatch_rq_list(q, _list)) {
-   if (has_sched_dispatch)
-   blk_mq_do_dispatch_sched(hctx);
-   else
-   blk_mq_do_dispatch_ctx(hctx);
-   }
+   blk_mq_dispatch_rq_list(q, _list);
+
+   /*
+* We may clear DISPATCH_BUSY just after it is set from
+* another context, the only cost is that one request is
+* dequeued a bit early, we can survive that. Given the
+* window is small enough, no need to worry about performance
+* effect.
+*/
+   if (list_empty_careful(>dispatch))
+   clear_bit(BLK_MQ_S_DISPATCH_BUSY, >state);
+
+   if (has_sched_dispatch)
+   blk_mq_do_dispatch_sched(hctx);
+   else
+   blk_mq_do_dispatch_ctx(hctx);
} else if (has_sched_dispatch) {
blk_mq_do_dispatch_sched(hctx);
} else if (q->queue_depth) {
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 394cb75d66fa..06dda6182b7a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1172,6 +1172,11 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list)
 
spin_lock(>lock);
list_splice_init(list, >dispatch);
+   /*
+* DISPATCH_BUSY won't be cleared until all requests
+* in hctx->dispatch are dispatched successfully
+*/
+   set_bit(BLK_MQ_S_DISPATCH_BUSY, >state);
spin_unlock(>lock);
 
/*
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 7b7a366a97f3..13f6c25fa461 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -172,6 +172,7 @@ enum {
BLK_MQ_S_SCHED_RESTART  = 2,
BLK_MQ_S_TAG_WAITING= 3,
BLK_MQ_S_START_ON_RUN   = 4,
+   BLK_MQ_S_DISPATCH_BUSY  = 5,
 
BLK_MQ_MAX_DEPTH= 10240,
 
-- 
2.9.5



[PATCH V6 3/5] sbitmap: introduce __sbitmap_for_each_set()

2017-10-09 Thread Ming Lei
We need to iterate ctx starting from any ctx in round robin
way, so introduce this helper.

Cc: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 include/linux/sbitmap.h | 64 -
 1 file changed, 47 insertions(+), 17 deletions(-)

diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index a1904aadbc45..0dcc60e820de 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -211,10 +211,14 @@ bool sbitmap_any_bit_set(const struct sbitmap *sb);
  */
 bool sbitmap_any_bit_clear(const struct sbitmap *sb);
 
+#define SB_NR_TO_INDEX(sb, bitnr) ((bitnr) >> (sb)->shift)
+#define SB_NR_TO_BIT(sb, bitnr) ((bitnr) & ((1U << (sb)->shift) - 1U))
+
 typedef bool (*sb_for_each_fn)(struct sbitmap *, unsigned int, void *);
 
 /**
- * sbitmap_for_each_set() - Iterate over each set bit in a  sbitmap.
+ * __sbitmap_for_each_set() - Iterate over each set bit in a  sbitmap.
+ * @start: Where to start the iteration.
  * @sb: Bitmap to iterate over.
  * @fn: Callback. Should return true to continue or false to break early.
  * @data: Pointer to pass to callback.
@@ -222,35 +226,61 @@ typedef bool (*sb_for_each_fn)(struct sbitmap *, unsigned 
int, void *);
  * This is inline even though it's non-trivial so that the function calls to 
the
  * callback will hopefully get optimized away.
  */
-static inline void sbitmap_for_each_set(struct sbitmap *sb, sb_for_each_fn fn,
-   void *data)
+static inline void __sbitmap_for_each_set(struct sbitmap *sb,
+ unsigned int start,
+ sb_for_each_fn fn, void *data)
 {
-   unsigned int i;
+   unsigned int index;
+   unsigned int nr;
+   unsigned int scanned = 0;
 
-   for (i = 0; i < sb->map_nr; i++) {
-   struct sbitmap_word *word = >map[i];
-   unsigned int off, nr;
+   if (start >= sb->depth)
+   start = 0;
+   index = SB_NR_TO_INDEX(sb, start);
+   nr = SB_NR_TO_BIT(sb, start);
 
-   if (!word->word)
-   continue;
+   while (scanned < sb->depth) {
+   struct sbitmap_word *word = >map[index];
+   unsigned int depth = min_t(unsigned int, word->depth - nr,
+  sb->depth - scanned);
 
-   nr = 0;
-   off = i << sb->shift;
+   scanned += depth;
+   if (!word->word)
+   goto next;
+
+   /*
+* On the first iteration of the outer loop, we need to add the
+* bit offset back to the size of the word for find_next_bit().
+* On all other iterations, nr is zero, so this is a noop.
+*/
+   depth += nr;
while (1) {
-   nr = find_next_bit(>word, word->depth, nr);
-   if (nr >= word->depth)
+   nr = find_next_bit(>word, depth, nr);
+   if (nr >= depth)
break;
-
-   if (!fn(sb, off + nr, data))
+   if (!fn(sb, (index << sb->shift) + nr, data))
return;
 
nr++;
}
+next:
+   nr = 0;
+   if (++index >= sb->map_nr)
+   index = 0;
}
 }
 
-#define SB_NR_TO_INDEX(sb, bitnr) ((bitnr) >> (sb)->shift)
-#define SB_NR_TO_BIT(sb, bitnr) ((bitnr) & ((1U << (sb)->shift) - 1U))
+/**
+ * sbitmap_for_each_set() - Iterate over each set bit in a  sbitmap.
+ * @sb: Bitmap to iterate over.
+ * @fn: Callback. Should return true to continue or false to break early.
+ * @data: Pointer to pass to callback.
+ */
+static inline void sbitmap_for_each_set(struct sbitmap *sb, sb_for_each_fn fn,
+   void *data)
+{
+   __sbitmap_for_each_set(sb, 0, fn, data);
+}
 
 static inline unsigned long *__sbitmap_word(struct sbitmap *sb,
unsigned int bitnr)
-- 
2.9.5



[PATCH V6 4/5] blk-mq-sched: improve dispatching from sw queue

2017-10-09 Thread Ming Lei
SCSI devices use host-wide tagset, and the shared driver tag space is
often quite big. Meantime there is also queue depth for each lun(
.cmd_per_lun), which is often small, for example, on both lpfc and
qla2xxx, .cmd_per_lun is just 3.

So lots of requests may stay in sw queue, and we always flush all
belonging to same hw queue and dispatch them all to driver, unfortunately
it is easy to cause queue busy because of the small .cmd_per_lun.
Once these requests are flushed out, they have to stay in hctx->dispatch,
and no bio merge can participate into these requests, and sequential IO
performance is hurt a lot.

This patch introduces blk_mq_dequeue_from_ctx for dequeuing request from
sw queue so that we can dispatch them in scheduler's way, then we can
avoid to dequeue too many requests from sw queue when ->dispatch isn't
flushed completely.

This patch improves dispatching from sw queue when there is per-request-queue
queue depth by taking request one by one from sw queue, just like the way
of IO scheduler.

Reviewed-by: Omar Sandoval 
Reviewed-by: Bart Van Assche 
Signed-off-by: Ming Lei 
---
 block/blk-mq-sched.c   | 50 --
 block/blk-mq.c | 39 +++
 block/blk-mq.h |  2 ++
 include/linux/blk-mq.h |  2 ++
 4 files changed, 91 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index be29ba849408..14b354f617e5 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -104,6 +104,39 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx 
*hctx)
} while (blk_mq_dispatch_rq_list(q, _list));
 }
 
+static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
+ struct blk_mq_ctx *ctx)
+{
+   unsigned idx = ctx->index_hw;
+
+   if (++idx == hctx->nr_ctx)
+   idx = 0;
+
+   return hctx->ctxs[idx];
+}
+
+static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
+{
+   struct request_queue *q = hctx->queue;
+   LIST_HEAD(rq_list);
+   struct blk_mq_ctx *ctx = READ_ONCE(hctx->dispatch_from);
+
+   do {
+   struct request *rq;
+
+   rq = blk_mq_dequeue_from_ctx(hctx, ctx);
+   if (!rq)
+   break;
+   list_add(>queuelist, _list);
+
+   /* round robin for fair dispatch */
+   ctx = blk_mq_next_ctx(hctx, rq->mq_ctx);
+
+   } while (blk_mq_dispatch_rq_list(q, _list));
+
+   WRITE_ONCE(hctx->dispatch_from, ctx);
+}
+
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 {
struct request_queue *q = hctx->queue;
@@ -143,10 +176,23 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
*hctx)
 */
if (!list_empty(_list)) {
blk_mq_sched_mark_restart_hctx(hctx);
-   if (blk_mq_dispatch_rq_list(q, _list) && has_sched_dispatch)
-   blk_mq_do_dispatch_sched(hctx);
+   if (blk_mq_dispatch_rq_list(q, _list)) {
+   if (has_sched_dispatch)
+   blk_mq_do_dispatch_sched(hctx);
+   else
+   blk_mq_do_dispatch_ctx(hctx);
+   }
} else if (has_sched_dispatch) {
blk_mq_do_dispatch_sched(hctx);
+   } else if (q->queue_depth) {
+   /*
+* If there is per-request_queue depth, we dequeue
+* request one by one from sw queue for avoiding to mess
+* up I/O merge when dispatch runs out of resource, which
+* can be triggered easily when there is per-request_queue
+* queue depth or .cmd_per_lun, such as SCSI device.
+*/
+   blk_mq_do_dispatch_ctx(hctx);
} else {
blk_mq_flush_busy_ctxs(hctx, _list);
blk_mq_dispatch_rq_list(q, _list);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 076cbab9c3e0..394cb75d66fa 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -911,6 +911,45 @@ void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, 
struct list_head *list)
 }
 EXPORT_SYMBOL_GPL(blk_mq_flush_busy_ctxs);
 
+struct dispatch_rq_data {
+   struct blk_mq_hw_ctx *hctx;
+   struct request *rq;
+};
+
+static bool dispatch_rq_from_ctx(struct sbitmap *sb, unsigned int bitnr,
+   void *data)
+{
+   struct dispatch_rq_data *dispatch_data = data;
+   struct blk_mq_hw_ctx *hctx = dispatch_data->hctx;
+   struct blk_mq_ctx *ctx = hctx->ctxs[bitnr];
+
+   spin_lock(>lock);
+   if (unlikely(!list_empty(>rq_list))) {
+   dispatch_data->rq = list_entry_rq(ctx->rq_list.next);
+   list_del_init(_data->rq->queuelist);
+   if (list_empty(>rq_list))
+   sbitmap_clear_bit(sb, bitnr);
+   }

Re: [PATCH V5 7/7] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed

2017-10-09 Thread Ming Lei
On Tue, Oct 03, 2017 at 02:11:28AM -0700, Christoph Hellwig wrote:
> This looks good in general:
> 
> Reviewed-by: Christoph Hellwig 
> 
> Minor nitpicks below:
> 
> > const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
> 
> This is now only tested once, so you can remove the local variable
> for it.

There are still two users of the local variable, so I suggest to
keep it.

> 
> > +   /*
> > +* We may clear DISPATCH_BUSY just after it
> > +* is set from another context, the only cost
> > +* is that one request is dequeued a bit early,
> > +* we can survive that. Given the window is
> > +* small enough, no need to worry about performance
> > +* effect.
> > +*/
> 
> Use your 80 line real estate for comments please.

OK.

> 
> > if (!has_sched_dispatch)
> > +   if (!q->queue_depth) {
> > +   blk_mq_flush_busy_ctxs(hctx, _list);
> > +   blk_mq_dispatch_rq_list(q, _list);
> > +   } else {
> > +   blk_mq_do_dispatch_ctx(q, hctx);
> > +   }
> > +   } else {
> > blk_mq_do_dispatch_sched(q, e, hctx);
> > +   }
> 
> Maybe flatten this out to:
> 
>   if (e && e->type->ops.mq.dispatch_request) {
>   blk_mq_do_dispatch_sched(q, e, hctx);
>   } else if (q->queue_depth) {
>   blk_mq_do_dispatch_ctx(q, hctx);
>   } else {
>   blk_mq_flush_busy_ctxs(hctx, _list);
>   blk_mq_dispatch_rq_list(q, _list);
>   }
> 

OK.


-- 
Ming


Re: [PATCH V5 6/7] blk-mq-sched: improve dispatching from sw queue

2017-10-09 Thread Ming Lei
On Tue, Oct 03, 2017 at 02:05:27AM -0700, Christoph Hellwig wrote:
> On Sat, Sep 30, 2017 at 06:27:19PM +0800, Ming Lei wrote:
> > SCSI devices use host-wide tagset, and the shared
> > driver tag space is often quite big. Meantime
> > there is also queue depth for each lun(.cmd_per_lun),
> > which is often small.
> > 
> > So lots of requests may stay in sw queue, and we
> > always flush all belonging to same hw queue and
> > dispatch them all to driver, unfortunately it is
> > easy to cause queue busy because of the small
> > per-lun queue depth. Once these requests are flushed
> > out, they have to stay in hctx->dispatch, and no bio
> > merge can participate into these requests, and
> > sequential IO performance is hurted.
> > 
> > This patch improves dispatching from sw queue when
> > there is per-request-queue queue depth by taking
> > request one by one from sw queue, just like the way
> > of IO scheduler.
> 
> Once again, use your line length real estate for your change logs..

OK.

> 
> > +static void blk_mq_do_dispatch_ctx(struct request_queue *q,
> > +  struct blk_mq_hw_ctx *hctx)
> > +{
> > +   LIST_HEAD(rq_list);
> > +   struct blk_mq_ctx *ctx = READ_ONCE(hctx->dispatch_from);
> > +   bool dispatched;
> > +
> > +   do {
> > +   struct request *rq;
> > +
> > +   rq = blk_mq_dequeue_from_ctx(hctx, ctx);
> 
> This probably should be merged wit hthe patch that introduceѕ
> blk_mq_dequeue_from_ctx.

OK.

> 
> > +   if (!rq)
> > +   break;
> > +   list_add(>queuelist, _list);
> 
> Btw, do we really need to return a request from blk_mq_dequeue_from_ctx,
> or would it be easier to just add it directly to the list passed as
> an argument.  I did wonder that about the existing code already.

I suggest to keep them as current way in this patch, but we can clean
them up all in another patch.

> 
> > +
> > +   /* round robin for fair dispatch */
> > +   ctx = blk_mq_next_ctx(hctx, rq->mq_ctx);
> > +
> > +   dispatched = blk_mq_dispatch_rq_list(q, _list);
> > +   } while (dispatched);
> > +
> > +   if (!dispatched)
> > +   WRITE_ONCE(hctx->dispatch_from, ctx);
> 
> No need for the dispatched argument, just write this as:
> 
>} while (blk_mq_dispatch_rq_list(q, _list));
>   
>   WRITE_ONCE(hctx->dispatch_from, ctx);
> 
> and do an early return instead of a break from inside the loop.

OK.

> 
> > +   } else if (!has_sched_dispatch && !q->queue_depth) {
> > +   /*
> > +* If there is no per-request_queue depth, we
> > +* flush all requests in this hw queue, otherwise
> > +* pick up request one by one from sw queue for
> > +* avoiding to mess up I/O merge when dispatch
> > +* run out of resource, which can be triggered
> > +* easily by per-request_queue queue depth
> > +*/
> 
> Use your 80 line real estate for comments..

OK.

-- 
Ming


Re: [PATCH V5 5/7] blk-mq-sched: move actual dispatching into one helper

2017-10-09 Thread Ming Lei
On Mon, Oct 02, 2017 at 07:19:56AM -0700, Christoph Hellwig wrote:
> Can you move this to the beginning of your series, just after
> the other edits to blk_mq_sched_dispatch_requests?

OK.

> 
> > +static void blk_mq_do_dispatch_sched(struct request_queue *q,
> > +struct elevator_queue *e,
> > +struct blk_mq_hw_ctx *hctx)
> 
> No need to pass anything but the hctx here, the other two can
> be trivially derived from it.

OK.

> 
> > +{
> > +   LIST_HEAD(rq_list);
> > +
> > +   do {
> > +   struct request *rq;
> > +
> > +   rq = e->type->ops.mq.dispatch_request(hctx);
> 
> how about shortening this to:
> 
>   struct request *rq = e->type->ops.mq.dispatch_request(hctx);

OK

> 
> >  void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
> >  {
> > struct request_queue *q = hctx->queue;
> > @@ -136,16 +152,8 @@ void blk_mq_sched_dispatch_requests(struct 
> > blk_mq_hw_ctx *hctx)
> >  * on the dispatch list or we were able to dispatch from the
> >  * dispatch list.
> >  */
> > -   if (do_sched_dispatch && has_sched_dispatch) {
> > -   do {
> > -   struct request *rq;
> > -
> > -   rq = e->type->ops.mq.dispatch_request(hctx);
> > -   if (!rq)
> > -   break;
> > -   list_add(>queuelist, _list);
> > -   } while (blk_mq_dispatch_rq_list(q, _list));
> > -   }
> > +   if (do_sched_dispatch && has_sched_dispatch)
> > +   blk_mq_do_dispatch_sched(q, e, hctx);
> >  }
> 
> Please use this new helper to simplify the logic. E.g.:
> 
>   if (!list_empty(_list)) {
>   blk_mq_sched_mark_restart_hctx(hctx);
>   if (blk_mq_dispatch_rq_list(q, _list) && has_sched_dispatch)
>   blk_mq_do_dispatch_sched(hctx);
>   } else if (has_sched_dispatch) {
>   blk_mq_do_dispatch_sched(hctx);
>   } else {
>   blk_mq_flush_busy_ctxs(hctx, _list);
>   blk_mq_dispatch_rq_list(q, _list);
>   }

OK

-- 
Ming