[PATCH 5/5] scsi: virtio_scsi: fix IO hang by irq vector automatic affinity

2018-02-02 Thread Ming Lei
Now 84676c1f21e8ff5(genirq/affinity: assign vectors to all possible CPUs)
has been merged to V4.16-rc, and it is easy to allocate all offline CPUs
for some irq vectors, this can't be avoided even though the allocation
is improved.

For example, on a 8cores VM, 4~7 are not-present/offline, 4 queues of
virtio-scsi, the irq affinity assigned can become the following shape:

irq 36, cpu list 0-7
irq 37, cpu list 0-7
irq 38, cpu list 0-7
irq 39, cpu list 0-1
irq 40, cpu list 4,6
irq 41, cpu list 2-3
irq 42, cpu list 5,7

Then IO hang is triggered in case of non-SCSI_MQ.

Given storage IO is always C/S model, there isn't such issue with 
SCSI_MQ(blk-mq),
because no IO can be submitted to one hw queue if the hw queue hasn't online
CPUs.

Fix this issue by forcing to use blk_mq.

BTW, I have been used virtio-scsi(scsi_mq) for several years, and it has
been quite stable, so it shouldn't cause extra risk.

Cc: Hannes Reinecke 
Cc: Arun Easi 
Cc: Omar Sandoval ,
Cc: "Martin K. Petersen" ,
Cc: James Bottomley ,
Cc: Christoph Hellwig ,
Cc: Don Brace 
Cc: Kashyap Desai 
Cc: Peter Rivera 
Cc: Paolo Bonzini 
Cc: Laurence Oberman 
Cc: Mike Snitzer 
Signed-off-by: Ming Lei 
---
 drivers/scsi/virtio_scsi.c | 59 +++---
 1 file changed, 3 insertions(+), 56 deletions(-)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 7c28e8d4955a..54e3a0f6844c 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -91,9 +91,6 @@ struct virtio_scsi_vq {
 struct virtio_scsi_target_state {
seqcount_t tgt_seq;
 
-   /* Count of outstanding requests. */
-   atomic_t reqs;
-
/* Currently active virtqueue for requests sent to this target. */
struct virtio_scsi_vq *req_vq;
 };
@@ -152,8 +149,6 @@ static void virtscsi_complete_cmd(struct virtio_scsi 
*vscsi, void *buf)
struct virtio_scsi_cmd *cmd = buf;
struct scsi_cmnd *sc = cmd->sc;
struct virtio_scsi_cmd_resp *resp = >resp.cmd;
-   struct virtio_scsi_target_state *tgt =
-   scsi_target(sc->device)->hostdata;
 
dev_dbg(>device->sdev_gendev,
"cmd %p response %u status %#02x sense_len %u\n",
@@ -210,8 +205,6 @@ static void virtscsi_complete_cmd(struct virtio_scsi 
*vscsi, void *buf)
}
 
sc->scsi_done(sc);
-
-   atomic_dec(>reqs);
 }
 
 static void virtscsi_vq_done(struct virtio_scsi *vscsi,
@@ -580,10 +573,7 @@ static int virtscsi_queuecommand_single(struct Scsi_Host 
*sh,
struct scsi_cmnd *sc)
 {
struct virtio_scsi *vscsi = shost_priv(sh);
-   struct virtio_scsi_target_state *tgt =
-   scsi_target(sc->device)->hostdata;
 
-   atomic_inc(>reqs);
return virtscsi_queuecommand(vscsi, >req_vqs[0], sc);
 }
 
@@ -596,55 +586,11 @@ static struct virtio_scsi_vq *virtscsi_pick_vq_mq(struct 
virtio_scsi *vscsi,
return >req_vqs[hwq];
 }
 
-static struct virtio_scsi_vq *virtscsi_pick_vq(struct virtio_scsi *vscsi,
-  struct virtio_scsi_target_state 
*tgt)
-{
-   struct virtio_scsi_vq *vq;
-   unsigned long flags;
-   u32 queue_num;
-
-   local_irq_save(flags);
-   if (atomic_inc_return(>reqs) > 1) {
-   unsigned long seq;
-
-   do {
-   seq = read_seqcount_begin(>tgt_seq);
-   vq = tgt->req_vq;
-   } while (read_seqcount_retry(>tgt_seq, seq));
-   } else {
-   /* no writes can be concurrent because of atomic_t */
-   write_seqcount_begin(>tgt_seq);
-
-   /* keep previous req_vq if a reader just arrived */
-   if (unlikely(atomic_read(>reqs) > 1)) {
-   vq = tgt->req_vq;
-   goto unlock;
-   }
-
-   queue_num = smp_processor_id();
-   while (unlikely(queue_num >= vscsi->num_queues))
-   queue_num -= vscsi->num_queues;
-   tgt->req_vq = vq = >req_vqs[queue_num];
- unlock:
-   write_seqcount_end(>tgt_seq);
-   }
-   local_irq_restore(flags);
-
-   return vq;
-}
-
 static int virtscsi_queuecommand_multi(struct Scsi_Host *sh,
   struct scsi_cmnd *sc)
 {
struct virtio_scsi *vscsi = shost_priv(sh);
-   struct virtio_scsi_target_state *tgt =
-   scsi_target(sc->device)->hostdata;
-   struct virtio_scsi_vq *req_vq;
-
-   if (shost_use_blk_mq(sh))
-   

[PATCH 4/5] scsi: introduce force_blk_mq

2018-02-02 Thread Ming Lei
>From scsi driver view, it is a bit troublesome to support both blk-mq
and non-blk-mq at the same time, especially when drivers need to support
multi hw-queue.

This patch introduces 'force_blk_mq' to scsi_host_template so that drivers
can provide blk-mq only support, so driver code can avoid the trouble
for supporting both.

This patch may clean up driver a lot by providing blk-mq only support, 
espeically
it is easier to convert multiple reply queues into blk_mq's MQ for the following
purposes:

1) use blk_mq multiple hw queue to deal with allocated irq vectors of all 
offline
CPU affinity[1]:

[1] https://marc.info/?l=linux-kernel=151748144730409=2

Now 84676c1f21e8ff5(genirq/affinity: assign vectors to all possible CPUs)
has been merged to V4.16-rc, and it is easy to allocate all offline CPUs
for some irq vectors, this can't be avoided even though the allocation
is improved.

So all these drivers have to avoid to ask HBA to complete request in
reply queue which hasn't online CPUs assigned.

This issue can be solved generically and easily via blk_mq(scsi_mq) multiple
hw queue by mapping each reply queue into hctx.

2) some drivers[1] require to complete request in the submission CPU for
avoiding hard/soft lockup, which is easily done with blk_mq, so not necessary
to reinvent wheels for solving the problem.

[2] https://marc.info/?t=15160185141=1=2

Sovling the above issues for non-MQ path may not be easy, or introduce
unnecessary work, especially we plan to enable SCSI_MQ soon as discussed
recently[3]:

[3] https://marc.info/?l=linux-scsi=151727684915589=2

Cc: Hannes Reinecke 
Cc: Arun Easi 
Cc: Omar Sandoval ,
Cc: "Martin K. Petersen" ,
Cc: James Bottomley ,
Cc: Christoph Hellwig ,
Cc: Don Brace 
Cc: Kashyap Desai 
Cc: Peter Rivera 
Cc: Laurence Oberman 
Cc: Mike Snitzer 
Signed-off-by: Ming Lei 
---
 drivers/scsi/hosts.c | 1 +
 include/scsi/scsi_host.h | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index fe3a0da3ec97..c75cebd7911d 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -471,6 +471,7 @@ struct Scsi_Host *scsi_host_alloc(struct scsi_host_template 
*sht, int privsize)
shost->dma_boundary = 0x;
 
shost->use_blk_mq = scsi_use_blk_mq;
+   shost->use_blk_mq = scsi_use_blk_mq || !!shost->hostt->force_blk_mq;
 
device_initialize(>shost_gendev);
dev_set_name(>shost_gendev, "host%d", shost->host_no);
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index a8b7bf879ced..4118760e5c32 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -452,6 +452,9 @@ struct scsi_host_template {
/* True if the controller does not support WRITE SAME */
unsigned no_write_same:1;
 
+   /* tell scsi core we support blk-mq only */
+   unsigned force_blk_mq:1;
+
/*
 * Countdown for host blocking with no commands outstanding.
 */
-- 
2.9.5



[PATCH 2/5] blk-mq: introduce BLK_MQ_F_GLOBAL_TAGS

2018-02-02 Thread Ming Lei
Quite a few HBAs(such as HPSA, megaraid, mpt3sas, ..) support multiple
reply queues, but tags is often HBA wide.

These HBAs have switched to use pci_alloc_irq_vectors(PCI_IRQ_AFFINITY)
for automatic affinity assignment.

Now 84676c1f21e8ff5(genirq/affinity: assign vectors to all possible CPUs)
has been merged to V4.16-rc, and it is easy to allocate all offline CPUs
for some irq vectors, this can't be avoided even though the allocation
is improved.

So all these drivers have to avoid to ask HBA to complete request in
reply queue which hasn't online CPUs assigned, and HPSA has been broken
with v4.15+:

https://marc.info/?l=linux-kernel=151748144730409=2

This issue can be solved generically and easily via blk_mq(scsi_mq) multiple
hw queue by mapping each reply queue into hctx, but one tricky thing is
the HBA wide(instead of hw queue wide) tags.

This patch is based on the following Hannes's patch:

https://marc.info/?l=linux-block=149132580511346=2

One big difference with Hannes's is that this patch only makes the tags sbitmap
and active_queues data structure HBA wide, and others are kept as NUMA locality,
such as request, hctx, tags, ...

The following patch will support global tags on null_blk, also the performance
data is provided, no obvious performance loss is observed when the whole
hw queue depth is same.

Cc: Hannes Reinecke 
Cc: Arun Easi 
Cc: Omar Sandoval ,
Cc: "Martin K. Petersen" ,
Cc: James Bottomley ,
Cc: Christoph Hellwig ,
Cc: Don Brace 
Cc: Kashyap Desai 
Cc: Peter Rivera 
Cc: Laurence Oberman 
Cc: Mike Snitzer 
Signed-off-by: Ming Lei 
---
 block/blk-mq-debugfs.c |  1 +
 block/blk-mq-sched.c   |  2 +-
 block/blk-mq-tag.c | 23 ++-
 block/blk-mq-tag.h |  5 -
 block/blk-mq.c | 29 -
 block/blk-mq.h |  3 ++-
 include/linux/blk-mq.h |  2 ++
 7 files changed, 52 insertions(+), 13 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 0dfafa4b655a..0f0fafe03f5d 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -206,6 +206,7 @@ static const char *const hctx_flag_name[] = {
HCTX_FLAG_NAME(SHOULD_MERGE),
HCTX_FLAG_NAME(TAG_SHARED),
HCTX_FLAG_NAME(SG_MERGE),
+   HCTX_FLAG_NAME(GLOBAL_TAGS),
HCTX_FLAG_NAME(BLOCKING),
HCTX_FLAG_NAME(NO_SCHED),
 };
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 55c0a745b427..191d4bc95f0e 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -495,7 +495,7 @@ static int blk_mq_sched_alloc_tags(struct request_queue *q,
int ret;
 
hctx->sched_tags = blk_mq_alloc_rq_map(set, hctx_idx, q->nr_requests,
-  set->reserved_tags);
+  set->reserved_tags, false);
if (!hctx->sched_tags)
return -ENOMEM;
 
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 571797dc36cb..66377d09eaeb 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -379,9 +379,11 @@ static struct blk_mq_tags *blk_mq_init_bitmap_tags(struct 
blk_mq_tags *tags,
return NULL;
 }
 
-struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
+struct blk_mq_tags *blk_mq_init_tags(struct blk_mq_tag_set *set,
+unsigned int total_tags,
 unsigned int reserved_tags,
-int node, int alloc_policy)
+int node, int alloc_policy,
+bool global_tag)
 {
struct blk_mq_tags *tags;
 
@@ -397,6 +399,14 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int 
total_tags,
tags->nr_tags = total_tags;
tags->nr_reserved_tags = reserved_tags;
 
+   WARN_ON(global_tag && !set->global_tags);
+   if (global_tag && set->global_tags) {
+   tags->bitmap_tags = set->global_tags->bitmap_tags;
+   tags->breserved_tags = set->global_tags->breserved_tags;
+   tags->active_queues = set->global_tags->active_queues;
+   tags->global_tags = true;
+   return tags;
+   }
tags->bitmap_tags = >__bitmap_tags;
tags->breserved_tags = >__breserved_tags;
tags->active_queues = >__active_queues;
@@ -406,8 +416,10 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int 
total_tags,
 
 void blk_mq_free_tags(struct blk_mq_tags *tags)
 {
-   sbitmap_queue_free(tags->bitmap_tags);
-   sbitmap_queue_free(tags->breserved_tags);
+   if (!tags->global_tags) {
+   sbitmap_queue_free(tags->bitmap_tags);
+   

[PATCH 3/5] block: null_blk: introduce module parameter of 'g_global_tags'

2018-02-02 Thread Ming Lei
This patch introduces the parameter of 'g_global_tags' so that we can
test this feature by null_blk easiy.

Not see obvious performance drop with global_tags when the whole hw
depth is kept as same:

1) no 'global_tags', each hw queue depth is 1, and 4 hw queues
modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 global_tags=0 
submit_queues=4 hw_queue_depth=1

2) 'global_tags', global hw queue depth is 4, and 4 hw queues
modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 global_tags=0 
submit_queues=4 hw_queue_depth=4

3) fio test done in above two settings:
   fio --bs=4k --size=512G  --rw=randread --norandommap --direct=1 
--ioengine=libaio --iodepth=4 --runtime=$RUNTIME --group_reporting=1  
--name=nullb0 --filename=/dev/nullb0 --name=nullb1 --filename=/dev/nullb1 
--name=nullb2 --filename=/dev/nullb2 --name=nullb3 --filename=/dev/nullb3

1M IOPS can be reached in both above tests which is done in one VM.

Cc: Hannes Reinecke 
Cc: Arun Easi 
Cc: Omar Sandoval ,
Cc: "Martin K. Petersen" ,
Cc: James Bottomley ,
Cc: Christoph Hellwig ,
Cc: Don Brace 
Cc: Kashyap Desai 
Cc: Peter Rivera 
Cc: Laurence Oberman 
Cc: Mike Snitzer 
Signed-off-by: Ming Lei 
---
 drivers/block/null_blk.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 287a09611c0f..ad0834efad42 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -163,6 +163,10 @@ static int g_submit_queues = 1;
 module_param_named(submit_queues, g_submit_queues, int, S_IRUGO);
 MODULE_PARM_DESC(submit_queues, "Number of submission queues");
 
+static int g_global_tags = 0;
+module_param_named(global_tags, g_global_tags, int, S_IRUGO);
+MODULE_PARM_DESC(global_tags, "All submission queues share one tags");
+
 static int g_home_node = NUMA_NO_NODE;
 module_param_named(home_node, g_home_node, int, S_IRUGO);
 MODULE_PARM_DESC(home_node, "Home node for the device");
@@ -1622,6 +1626,8 @@ static int null_init_tag_set(struct nullb *nullb, struct 
blk_mq_tag_set *set)
set->flags = BLK_MQ_F_SHOULD_MERGE;
if (g_no_sched)
set->flags |= BLK_MQ_F_NO_SCHED;
+   if (g_global_tags)
+   set->flags |= BLK_MQ_F_GLOBAL_TAGS;
set->driver_data = NULL;
 
if ((nullb && nullb->dev->blocking) || g_blocking)
-- 
2.9.5



[PATCH 1/5] blk-mq: tags: define several fields of tags as pointer

2018-02-02 Thread Ming Lei
This patch changes tags->breserved_tags, tags->bitmap_tags and
tags->active_queues as pointer, and prepares for supporting global tags.

No functional change.

Cc: Laurence Oberman 
Cc: Mike Snitzer 
Cc: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 block/bfq-iosched.c|  4 ++--
 block/blk-mq-debugfs.c | 10 +-
 block/blk-mq-tag.c | 48 ++--
 block/blk-mq-tag.h | 10 +++---
 block/blk-mq.c |  2 +-
 block/kyber-iosched.c  |  2 +-
 6 files changed, 42 insertions(+), 34 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 47e6ec7427c4..1e1211814a57 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -534,9 +534,9 @@ static void bfq_limit_depth(unsigned int op, struct 
blk_mq_alloc_data *data)
WARN_ON_ONCE(1);
return;
}
-   bt = >breserved_tags;
+   bt = tags->breserved_tags;
} else
-   bt = >bitmap_tags;
+   bt = tags->bitmap_tags;
 
if (unlikely(bfqd->sb_shift != bt->sb.shift))
bfq_update_depths(bfqd, bt);
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 21cbc1f071c6..0dfafa4b655a 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -433,14 +433,14 @@ static void blk_mq_debugfs_tags_show(struct seq_file *m,
seq_printf(m, "nr_tags=%u\n", tags->nr_tags);
seq_printf(m, "nr_reserved_tags=%u\n", tags->nr_reserved_tags);
seq_printf(m, "active_queues=%d\n",
-  atomic_read(>active_queues));
+  atomic_read(tags->active_queues));
 
seq_puts(m, "\nbitmap_tags:\n");
-   sbitmap_queue_show(>bitmap_tags, m);
+   sbitmap_queue_show(tags->bitmap_tags, m);
 
if (tags->nr_reserved_tags) {
seq_puts(m, "\nbreserved_tags:\n");
-   sbitmap_queue_show(>breserved_tags, m);
+   sbitmap_queue_show(tags->breserved_tags, m);
}
 }
 
@@ -471,7 +471,7 @@ static int hctx_tags_bitmap_show(void *data, struct 
seq_file *m)
if (res)
goto out;
if (hctx->tags)
-   sbitmap_bitmap_show(>tags->bitmap_tags.sb, m);
+   sbitmap_bitmap_show(>tags->bitmap_tags->sb, m);
mutex_unlock(>sysfs_lock);
 
 out:
@@ -505,7 +505,7 @@ static int hctx_sched_tags_bitmap_show(void *data, struct 
seq_file *m)
if (res)
goto out;
if (hctx->sched_tags)
-   sbitmap_bitmap_show(>sched_tags->bitmap_tags.sb, m);
+   sbitmap_bitmap_show(>sched_tags->bitmap_tags->sb, m);
mutex_unlock(>sysfs_lock);
 
 out:
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 336dde07b230..571797dc36cb 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -18,7 +18,7 @@ bool blk_mq_has_free_tags(struct blk_mq_tags *tags)
if (!tags)
return true;
 
-   return sbitmap_any_bit_clear(>bitmap_tags.sb);
+   return sbitmap_any_bit_clear(>bitmap_tags->sb);
 }
 
 /*
@@ -28,7 +28,7 @@ bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
 {
if (!test_bit(BLK_MQ_S_TAG_ACTIVE, >state) &&
!test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, >state))
-   atomic_inc(>tags->active_queues);
+   atomic_inc(hctx->tags->active_queues);
 
return true;
 }
@@ -38,9 +38,9 @@ bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
  */
 void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool include_reserve)
 {
-   sbitmap_queue_wake_all(>bitmap_tags);
+   sbitmap_queue_wake_all(tags->bitmap_tags);
if (include_reserve)
-   sbitmap_queue_wake_all(>breserved_tags);
+   sbitmap_queue_wake_all(tags->breserved_tags);
 }
 
 /*
@@ -54,7 +54,7 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, >state))
return;
 
-   atomic_dec(>active_queues);
+   atomic_dec(tags->active_queues);
 
blk_mq_tag_wakeup_all(tags, false);
 }
@@ -79,7 +79,7 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
if (bt->sb.depth == 1)
return true;
 
-   users = atomic_read(>tags->active_queues);
+   users = atomic_read(hctx->tags->active_queues);
if (!users)
return true;
 
@@ -117,10 +117,10 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data 
*data)
WARN_ON_ONCE(1);
return BLK_MQ_TAG_FAIL;
}
-   bt = >breserved_tags;
+   bt = tags->breserved_tags;
tag_offset = 0;
} else {
-   bt = >bitmap_tags;
+   bt = tags->bitmap_tags;
tag_offset = tags->nr_reserved_tags;
}
 
@@ -165,9 +165,9 @@ unsigned int blk_mq_get_tag(struct 

[PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq

2018-02-02 Thread Ming Lei
Hi All,

This patchset supports global tags which was started by Hannes originally:

https://marc.info/?l=linux-block=149132580511346=2

Also inroduce 'force_blk_mq' to 'struct scsi_host_template', so that
driver can avoid to support two IO paths(legacy and blk-mq), especially
recent discusion mentioned that SCSI_MQ will be enabled at default soon.

https://marc.info/?l=linux-scsi=151727684915589=2

With the above two changes, it should be easier to convert SCSI drivers'
reply queue into blk-mq's hctx, then the automatic irq affinity issue can
be solved easily, please see detailed descrption in commit log.

Also drivers may require to complete request on the submission CPU
for avoiding hard/soft deadlock, which can be done easily with blk_mq
too.

https://marc.info/?t=15160185141=1=2

The final patch uses the introduced 'force_blk_mq' to fix virtio_scsi
so that IO hang issue can be avoided inside legacy IO path, this issue is
a bit generic, at least HPSA/virtio-scsi are found broken with v4.15+.

Thanks
Ming

Ming Lei (5):
  blk-mq: tags: define several fields of tags as pointer
  blk-mq: introduce BLK_MQ_F_GLOBAL_TAGS
  block: null_blk: introduce module parameter of 'g_global_tags'
  scsi: introduce force_blk_mq
  scsi: virtio_scsi: fix IO hang by irq vector automatic affinity

 block/bfq-iosched.c|  4 +--
 block/blk-mq-debugfs.c | 11 
 block/blk-mq-sched.c   |  2 +-
 block/blk-mq-tag.c | 67 +-
 block/blk-mq-tag.h | 15 ---
 block/blk-mq.c | 31 -
 block/blk-mq.h |  3 ++-
 block/kyber-iosched.c  |  2 +-
 drivers/block/null_blk.c   |  6 +
 drivers/scsi/hosts.c   |  1 +
 drivers/scsi/virtio_scsi.c | 59 +++-
 include/linux/blk-mq.h |  2 ++
 include/scsi/scsi_host.h   |  3 +++
 13 files changed, 105 insertions(+), 101 deletions(-)

-- 
2.9.5



CONTACT DHL OFFICE IMMEDIATELY FOR DELIVERY OF YOUR ATM MASTERCARD

2018-02-02 Thread Rev:Tony HARRIES
Attention; Beneficiary,

This is to official inform you that we have been having meetings for the past 
three (3) weeks which ended two days ago with MR. JIM YONG KIM the Former world 
bank president and other seven continent presidents on the congress we treated 
on solution to scam victim problems.

Note: we have decided to contact you following the reports we received from 
anti-fraud international monitoring group your name/email has been submitted to 
us therefore the united nations have agreed to compensate you with the sum of 
(USD$1.5 Million) this compensation is also including international business 
that failed you in past due to government problems etc.

We have arranged your payment through our ATM MasterCard and deposited it in 
DHL Office to deliver it to you which is the latest instruction from the Former 
World Bank president MR. JIM YONG KIM, For your information’s, the delivery 
charges already paid by U.N treasury, the only money you will send to DHL 
office Benin Republic is $165 dollars for security keeping fee, U.N coordinator 
already paid for others charges fees for delivery except the security keeping 
fee, the director of DHL refused to collect the security keeping fee from U.N 
treasury because Direct of DHL office said they don’t know exactly time you 
will contact them to reconfirm your details to avoid counting demurrage.

Therefore be advice to contact DHL Office agent Benin. Rev: Tony harries who is 
in position to post your ATM MasterCard, contact DHL Office  immediately with 
the bellow email & phone number  as listed below.

Contact name: Rev:Tony  HARRIES
Email:(post.dhlbenin.serv...@outlook.com )
Tell: +229-98787436

Make sure you reconfirmed DHL Office your details ASAP as stated below to avoid 
wrong delivery.

Your full name..
Home address:.
Your country...
Your city..
Telephone..
Occupation:...
Age:……..


Let us know as soon as possible you receive your ATM MasterCard for proper 
verifications.

Regards,
MR Paul Ogie.
DEPUTY SECRETARY-GENERAL (U.N)
Benin Republic.


[GIT PULL] SCSI postmerge updates for the 4.15+ merge window

2018-02-02 Thread James Bottomley
This is a set of three patches that depended on mq and zone changes in
the block tree (now upstream).

The patch is available here:

git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi.git scsi-postmerge

The short changelog is:

Bart Van Assche (1):
  scsi: scsi-mq-debugfs: Show more information

Damien Le Moal (2):
  scsi: sd: Remove zone write locking
  scsi: sd_zbc: Initialize device request queue zoned data

And the diffstat:

 drivers/scsi/scsi_debugfs.c |  40 +++-
 drivers/scsi/sd.c   |  41 +---
 drivers/scsi/sd.h   |  11 ---
 drivers/scsi/sd_zbc.c   | 235 +++-
 include/scsi/scsi_cmnd.h|   3 +-
 5 files changed, 187 insertions(+), 143 deletions(-)

James



Re: [PATCH 2/3] virtio-scsi: Add FC transport class

2018-02-02 Thread Steffen Maier


On 02/02/2018 05:00 PM, Hannes Reinecke wrote:

On 01/26/2018 05:54 PM, Steffen Maier wrote:

On 12/18/2017 09:31 AM, Hannes Reinecke wrote:

On 12/15/2017 07:08 PM, Steffen Maier wrote:

On 12/14/2017 11:11 AM, Hannes Reinecke wrote:



To me, this raises the question which properties of the host's FC
(driver core) objects should be mirrored to the guest. Ideally all (and
that's a lot).
This in turn makes me wonder if mirroring is really desirable (e.g.
considering the effort) or if only the guest should have its own FC
object hierarchy which does _not_ exist on the KVM host in case an
fc_host is passed through with virtio-(v)fc.



A few more thoughts on your presentation [1]:

"Devices on the vport will not be visible on the host"
I could not agree more to the design point that devices (or at least
their descendant object subtree) passed through to a guest should not
appear on the host!
With virtio-blk or virtio-scsi, we have SCSI devices and thus disks
visible in the host, which needlessly scans partitions, or even worse
automatically scans for LVM and maybe even activates PVs/VGs/LVs. It's
hard for a KVM host admin to suppress this (and not break the devices
the host needs itself).
If we mirror the host's scsi_transport_fc tree including fc_rports and
thus SCSI devices etc., we would have the same problems?
Even more so, dev_loss_tmo and fast_io_fail_tmo would run independently
on the host and in the guest on the same mirrored scsi_transport_fc
object tree. I can envision user confusion having configured timeouts on
the "wrong" side (host vs. guest). Also we would still need a mechanism
to mirror fc_rport (un)block from host to guest for proper transport
recovery. In zfcp we try to recover on transport rather than scsi_eh
whenever possible because it is so much smoother.


As similar thing can be achieved event today, by setting the
'no_uld_attach' parameter when scanning the scsi device
(that's what some RAID HBAs do).
However, there currently is no way of modifying it from user-space, and
certainly not to change the behaviour for existing devices.
It should be relatively simple to set this flag whenever the host is
exposed to a VM; we would still see the scsi devices, but the 'sd'
driver won't be attached so nothing will scan the device on the host.


Ah, nice, didn't know that. It would solve the undesired I/O problem in 
the host.
But it would not solve the so far somewhat unsynchronized state 
transitions of fc_rports on the host and their mirrors in the guest?


I would be very interested in how you intend to do transport recovery.


"Towards virtio-fc?"
Using the FCP_CMND_IU (instead of just a plain SCB as with virtio-scsi)
sounds promising to me as starting point.
A listener from the audience asked if you would also do ELS/CT in the
guest and you replied that this would not be good. Why is that?
Based on above starting point, doing ELS/CT (and basic aborts and maybe
a few other functions such as open/close ports or metadata transfer
commands) in the guest is exactly what I would have expected. An HBA
LLDD on the KVM host would implement such API and for all fc_hosts,
passed through this way, it would *not* establish any scsi_transport_fc
tree on the host. Instead the one virtio-vfc implementation in the guest
would do this indendently of which HBA LLDD provides the passed through
fc_host in the KVM host.
ELS/CT pass through is maybe even for free via FC_BSG for those LLDDs
that already implement it.
Rport open/close is just the analogon of slave_alloc()/slave_destroy().


I'm not convinced that moving to full virtio-fc is something we want or
even can do.
Neither qla2xxx nor lpfc allow for direct FC frame access; so one would
need to reformat the FC frames into something the driver understands,
just so that the hardware can transform it back into FC frames.


I thought of a more high level para-virtualized FCP HBA interface, than 
FC frames (which did exist in kernel v2.4 under drivers/fc4/ but no 
longer as it seems). Just like large parts of today's FCP LLDDs handle 
scatter gather lists and framing is done by the hardware.



Another thing is xid management; some drivers have to do their own xid
management, based on hardware capabilities etc.
So the FC frames would need to re-write the xids, making it hard if not
impossible to match things up when the response comes in.


For such things, where the hardware exposes more details (than, say, 
zfcp sees) I thought the LLDD on the KVM host would handle such details 
internally and only expose the higher level interface to virtio-fc.


Maybe something roughly like the basic transport protocol part of 
ibmvfc/ibmvscsi (not the other end in the firmware and not the cross 
partition DMA part), if I understood its overall design correctly by 
quickly looking at the code.
I somewhat had the impression that zfcp isn't too far from the overall 
operations style. As seem qla2xxx or lpfc to me, they just see and need 
to handle some more low-level FC 

Re: [PATCH v4 02/10] ufs: sysfs: device descriptor

2018-02-02 Thread Bart Van Assche
On Fri, 2018-02-02 at 08:17 +0100, gre...@linuxfoundation.org wrote:
> On Fri, Feb 02, 2018 at 12:25:46AM +, Bart Van Assche wrote:
> > On Thu, 2018-02-01 at 18:15 +0200, Stanislav Nijnikov wrote:
> > > +enum ufs_desc_param_size {
> > > + UFS_PARAM_BYTE_SIZE = 1,
> > > + UFS_PARAM_WORD_SIZE = 2,
> > > + UFS_PARAM_DWORD_SIZE= 4,
> > > + UFS_PARAM_QWORD_SIZE= 8,
> > > +};
> > 
> > Please do not copy bad naming choices from the Windows kernel into the Linux
> > kernel. Using names like WORD / DWORD / QWORD is much less readable than 
> > using
> > the numeric constants 2, 4, 8. Hence my proposal to leave out the above enum
> > completely.
> 
> Are you sure those do not come from the spec itself?  It's been a while
> since I last read it, but for some reason I remember those types of
> names being in there.  But I might be confusing specs here.

Hello Greg,

That's a good question. However, a quick search on the Internet for the search
phrase "Universal Flash Storage" "qword" did not yield any results about UFS in
the first ten search hits. And I haven't found any references to the DWORD /
QWORD terminology in the "UNIVERSAL FLASH STORAGE HOST CONTROLLER INTERFACE
(UFSHCI), UNIFIED MEMORY EXTENSION, Version 1.1" document either. Maybe that
means that I was looking at the wrong document?

Thanks,

Bart.





Re: [PATCH 2/3] virtio-scsi: Add FC transport class

2018-02-02 Thread Hannes Reinecke
On 01/26/2018 05:54 PM, Steffen Maier wrote:
> On 12/18/2017 09:31 AM, Hannes Reinecke wrote:
>> On 12/15/2017 07:08 PM, Steffen Maier wrote:
>>> On 12/14/2017 11:11 AM, Hannes Reinecke wrote:
 When a device announces an 'FC' protocol we should be pulling
 in the FC transport class to have the rports etc setup correctly.
>>>
>>> It took some time for me to understand what this does.
>>> It seems to mirror the topology of rports and sdevs that exist under the
>>> fc_host on the kvm host side to the virtio-scsi-fc side in the guest.
>>>
>>> I like the idea. This is also what I've been suggesting users to do if
>>> they back virtio-scsi with zfcp on the kvm host side. Primarily to not
>>> stall all virtio-scsi I/O on all paths if the guest ever gets into
>>> scsi_eh. But also to make it look like an HBA pass through so one can
>>> more easily migrate to this once we have FCP pass through.
> 
> On second thought, I like the idea for virtio-scsi.
> 
> For the future virtio-(v)fc case, see below.
> 
 @@ -755,19 +823,34 @@ static int virtscsi_abort(struct scsi_cmnd *sc)
>>>
 +    if (vscsi->protocol == SCSI_PROTOCOL_FCP) {
 +    struct fc_rport *rport =
 +    starget_to_rport(scsi_target(sc->device));
 +    if (rport && rport->dd_data ) {
 +    tgt = rport->dd_data;
 +    target_id = tgt->target_id;
 +    } else
 +    return FAST_IO_FAIL;
 +    } else {
 +    tgt = scsi_target(sc->device)->hostdata;
 +    if (!tgt || tgt->removed)
 +    return FAST_IO_FAIL;
 +    }
>>>
>>> dito
>>>
 @@ -857,27 +970,67 @@ static void virtscsi_rescan_work(struct
 work_struct *work)

    wait_for_completion();
>>>
>>> Waiting in work item .vs. having the response (IRQ) path trigger
>>> subsequent processing async ?
>>> Or do we need the call chain(s) getting here to be in our own process
>>> context via the workqueue anyway?
>>>
>> Can't see I can parse this sentence, but I'll be looking at the code
>> trying to come up with a clever explanation :-)
> 
> Sorry, meanwhile I have a hard time understanding my own words, too.
> 
> I think I wondered if the effort of a work item is really necessary,
> especially considering that it does block on the completion and thus
> could delay other queued work items (even though Concurrency Managed
> Workqueues can often hide this delay).
> 
> Couldn't we just return asynchronously after having sent the request.
> And then later on, simply have the response (IRQ) path trigger whatever
> processing is necessary (after the work item variant woke up from the
> wait_for_completion) in some asynchronuous fashion? Of course, this
> could also be a work item which just does necessary remaining processing
> after we got a response.
> Just a wild guess, without knowing the environmental requirements.
> 
The main point here was that we get off the irq-completion handler; if
we were to trigger this directly we would be running in an interrupt
context, which then will make things like spin_lock, mutex et al tricky.
Plus it's not really time critical; during rescanning all I/O should be
blocked, so we shouldn't have anything else going on.

 +    if (transport == SCSI_PROTOCOL_FCP) {
 +    struct fc_rport_identifiers rport_ids;
 +    struct fc_rport *rport;
 +
 +    rport_ids.node_name = wwn_to_u64(cmd->resp.rescan.node_wwn);
 +    rport_ids.port_name = wwn_to_u64(cmd->resp.rescan.port_wwn);
 +    rport_ids.port_id = (target_id >> 8);
>>>
>>> Why do you shift target_id by one byte to the right?
>>>
>> Because with the original setup virtio_scsi guest would pass in the
>> target_id, and the host would be selecting the device based on that
>> information.
>> With virtio-vfc we pass in the wwpn, but still require the target ID to
>> be compliant with things like event notification etc.
> 
> Don't we need the true N_Port-ID, then? That's what an fc_rport.port_id
> usually contains. It's also a simple way to lookup resources on a SAN
> switch for problem determination. Or did I misunderstand the
> content/semantics of the variable target_id, assuming it's a SCSI target
> ID, i.e. the 3rd part of a HCTL 4-tuple?
> 
Yes, that was the idea.
I've now modified the code so that we can pick up the port id for both,
target and host port. That should satisfy the needs here.

>> So I've shifted the target id onto the port ID (which is 24 bit anyway).
>> I could've used a bitfield here, but then I wasn't quite sure about the
>> endianness of which.
> 
 +    rport = fc_remote_port_add(sh, 0, _ids);
 +    if (rport) {
 +    tgt->rport = rport;
 +    rport->dd_data = tgt;
 +    fc_remote_port_rolechg(rport, FC_RPORT_ROLE_FCP_TARGET);
>>>
>>> Is the rolechg to get some event? Otherwise we could have
>>> rport_ids.roles = FC_RPORT_ROLE_FCP_TARGET before fc_remote_port_add().
>>>
>> That's 

[PATCH 6/6] scsi: qedf: use correct strncpy() size

2018-02-02 Thread Arnd Bergmann
gcc-8 warns during link-time optimization that the strncpy() call
passes the size of the source buffer rather than the destination:

drivers/scsi/qedf/qedf_dbg.c: In function 'qedf_uevent_emit':
include/linux/string.h:253: error: 'strncpy' specified bound depends on the 
length of the source argument [-Werror=stringop-overflow=]

This changes it to strscpy() with the correct length, guaranteeing
a properly nul-terminated string of the right size.

Signed-off-by: Arnd Bergmann 
---
 drivers/scsi/qedf/qedf_dbg.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/qedf/qedf_dbg.c b/drivers/scsi/qedf/qedf_dbg.c
index e023f5d0dc12..bd1cef25a900 100644
--- a/drivers/scsi/qedf/qedf_dbg.c
+++ b/drivers/scsi/qedf/qedf_dbg.c
@@ -160,7 +160,7 @@ qedf_uevent_emit(struct Scsi_Host *shost, u32 code, char 
*msg)
switch (code) {
case QEDF_UEVENT_CODE_GRCDUMP:
if (msg)
-   strncpy(event_string, msg, strlen(msg));
+   strscpy(event_string, msg, sizeof(event_string));
else
sprintf(event_string, "GRCDUMP=%u", shost->host_no);
break;
-- 
2.9.0



[PATCH 3/6] scsi: sym53c416: avoid section mismatch with LTO

2018-02-02 Thread Arnd Bergmann
Building with link time optimizations produces a false-postive section
mismatch warning:

WARNING: vmlinux.o(.data+0xf8c8): Section mismatch in reference from the 
variable driver_template.lto_priv.6915 to the function 
.init.text:sym53c416_detect()
The variable driver_template.lto_priv.6915 references
the function __init sym53c416_detect()
If the reference is valid then annotate the
variable with __init* or __refdata (see linux/init.h) or name the variable:
*_template, *_timer, *_sht, *_ops, *_probe, *_probe_one, *_console

The ->detect callback is always entered from the init_this_scsi_driver()
init function, but apparently LTO turns the optimized direct function
call into an indirect call through a non-__initdata pointer.

All drivers using init_this_scsi_driver() are for ancient hardware,
and most don't mark the detect() callback as __init(), so I'm
just removing the annotation here to kill off the warning instead
of doing a larger rework.

Signed-off-by: Arnd Bergmann 
---
 drivers/scsi/sym53c416.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/sym53c416.c b/drivers/scsi/sym53c416.c
index 5bdcbe8fa958..e68bcdc75bc3 100644
--- a/drivers/scsi/sym53c416.c
+++ b/drivers/scsi/sym53c416.c
@@ -608,7 +608,7 @@ static void sym53c416_probe(void)
}
 }
 
-int __init sym53c416_detect(struct scsi_host_template *tpnt)
+int sym53c416_detect(struct scsi_host_template *tpnt)
 {
unsigned long flags;
struct Scsi_Host * shpnt = NULL;
-- 
2.9.0



[PATCH 5/6] scsi: qedi: fix building with LTO

2018-02-02 Thread Arnd Bergmann
When link-time optimizations are enabled, qedi fails to build because
of mismatched prototypes:

drivers/scsi/qedi/qedi_gbl.h:27:37: error: type of 'qedi_dbg_fops' does not 
match original declaration [-Werror=lto-type-mismatch]
 extern const struct file_operations qedi_dbg_fops;
 ^
drivers/scsi/qedi/qedi_debugfs.c:239:30: note: 'qedi_dbg_fops' was previously 
declared here
 const struct file_operations qedi_dbg_fops[] = {
  ^
drivers/scsi/qedi/qedi_gbl.h:26:32: error: type of 'qedi_debugfs_ops' does not 
match original declaration [-Werror=lto-type-mismatch]
 extern struct qedi_debugfs_ops qedi_debugfs_ops;
^
drivers/scsi/qedi/qedi_debugfs.c:102:25: note: 'qedi_debugfs_ops' was 
previously declared here
 struct qedi_debugfs_ops qedi_debugfs_ops[] = {

This changes the declaration to match the definition, and adapts the
users as necessary. Since both array can be constant here, I'm adding
the 'const' everywhere for consistency.

Signed-off-by: Arnd Bergmann 
---
 drivers/scsi/qedi/qedi_dbg.h | 2 +-
 drivers/scsi/qedi/qedi_debugfs.c | 4 ++--
 drivers/scsi/qedi/qedi_gbl.h | 4 ++--
 drivers/scsi/qedi/qedi_main.c| 4 ++--
 4 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/scsi/qedi/qedi_dbg.h b/drivers/scsi/qedi/qedi_dbg.h
index c55572badfb0..358f40567849 100644
--- a/drivers/scsi/qedi/qedi_dbg.h
+++ b/drivers/scsi/qedi/qedi_dbg.h
@@ -134,7 +134,7 @@ struct qedi_debugfs_ops {
 }
 
 void qedi_dbg_host_init(struct qedi_dbg_ctx *qedi,
-   struct qedi_debugfs_ops *dops,
+   const struct qedi_debugfs_ops *dops,
const struct file_operations *fops);
 void qedi_dbg_host_exit(struct qedi_dbg_ctx *qedi);
 void qedi_dbg_init(char *drv_name);
diff --git a/drivers/scsi/qedi/qedi_debugfs.c b/drivers/scsi/qedi/qedi_debugfs.c
index fd8a1eea3163..fd914ca4149a 100644
--- a/drivers/scsi/qedi/qedi_debugfs.c
+++ b/drivers/scsi/qedi/qedi_debugfs.c
@@ -19,7 +19,7 @@ static struct dentry *qedi_dbg_root;
 
 void
 qedi_dbg_host_init(struct qedi_dbg_ctx *qedi,
-  struct qedi_debugfs_ops *dops,
+  const struct qedi_debugfs_ops *dops,
   const struct file_operations *fops)
 {
char host_dirname[32];
@@ -99,7 +99,7 @@ static struct qedi_list_of_funcs 
qedi_dbg_do_not_recover_ops[] = {
{ NULL, NULL }
 };
 
-struct qedi_debugfs_ops qedi_debugfs_ops[] = {
+const struct qedi_debugfs_ops qedi_debugfs_ops[] = {
{ "gbl_ctx", NULL },
{ "do_not_recover", qedi_dbg_do_not_recover_ops},
{ "io_trace", NULL },
diff --git a/drivers/scsi/qedi/qedi_gbl.h b/drivers/scsi/qedi/qedi_gbl.h
index f5b5a31999aa..a2aa06ed1620 100644
--- a/drivers/scsi/qedi/qedi_gbl.h
+++ b/drivers/scsi/qedi/qedi_gbl.h
@@ -23,8 +23,8 @@ extern uint qedi_io_tracing;
 extern struct scsi_host_template qedi_host_template;
 extern struct iscsi_transport qedi_iscsi_transport;
 extern const struct qed_iscsi_ops *qedi_ops;
-extern struct qedi_debugfs_ops qedi_debugfs_ops;
-extern const struct file_operations qedi_dbg_fops;
+extern const struct qedi_debugfs_ops qedi_debugfs_ops[];
+extern const struct file_operations qedi_dbg_fops[];
 extern struct device_attribute *qedi_shost_attrs[];
 
 int qedi_alloc_sq(struct qedi_ctx *qedi, struct qedi_endpoint *ep);
diff --git a/drivers/scsi/qedi/qedi_main.c b/drivers/scsi/qedi/qedi_main.c
index 029e2e69b29f..e992f9d3ef00 100644
--- a/drivers/scsi/qedi/qedi_main.c
+++ b/drivers/scsi/qedi/qedi_main.c
@@ -2303,8 +2303,8 @@ static int __qedi_probe(struct pci_dev *pdev, int mode)
}
 
 #ifdef CONFIG_DEBUG_FS
-   qedi_dbg_host_init(>dbg_ctx, _debugfs_ops,
-  _dbg_fops);
+   qedi_dbg_host_init(>dbg_ctx, qedi_debugfs_ops,
+  qedi_dbg_fops);
 #endif
QEDI_INFO(>dbg_ctx, QEDI_LOG_INFO,
  "QLogic FastLinQ iSCSI Module qedi %s, FW %d.%d.%d.%d\n",
-- 
2.9.0



[PATCH 4/6] scsi: qedf: fix LTO-enabled build

2018-02-02 Thread Arnd Bergmann
The prototype for qedf_dbg_fops/qedf_debugfs_ops doesn't match the definition,
which causes the final link to fail with link-time optimizations:

drivers/scsi/qedf/qedf_main.c:34: error: type of 'qedf_dbg_fops' does not match 
original declaration [-Werror=lto-type-mismatch]
 extern struct file_operations qedf_dbg_fops;

drivers/scsi/qedf/qedf_debugfs.c:443: note: 'qedf_dbg_fops' was previously 
declared here
 const struct file_operations qedf_dbg_fops[] = {

drivers/scsi/qedf/qedf_main.c:33: error: type of 'qedf_debugfs_ops' does not 
match original declaration [-Werror=lto-type-mismatch]
 extern struct qedf_debugfs_ops qedf_debugfs_ops;

drivers/scsi/qedf/qedf_debugfs.c:102: note: 'qedf_debugfs_ops' was previously 
declared here
 struct qedf_debugfs_ops qedf_debugfs_ops[] = {

This corrects the prototype and moves it into a shared header file where it
belongs. The file operations can also be marked 'const' like the
qedf_debugfs_ops.

Signed-off-by: Arnd Bergmann 
---
 drivers/scsi/qedf/qedf_dbg.h | 17 ++---
 drivers/scsi/qedf/qedf_debugfs.c |  6 +++---
 drivers/scsi/qedf/qedf_main.c|  8 +++-
 3 files changed, 16 insertions(+), 15 deletions(-)

diff --git a/drivers/scsi/qedf/qedf_dbg.h b/drivers/scsi/qedf/qedf_dbg.h
index 50083cae84c3..77c27e888969 100644
--- a/drivers/scsi/qedf/qedf_dbg.h
+++ b/drivers/scsi/qedf/qedf_dbg.h
@@ -116,6 +116,14 @@ extern int qedf_create_sysfs_attr(struct Scsi_Host *shost,
 extern void qedf_remove_sysfs_attr(struct Scsi_Host *shost,
struct sysfs_bin_attrs *iter);
 
+struct qedf_debugfs_ops {
+   char *name;
+   struct qedf_list_of_funcs *qedf_funcs;
+};
+
+extern const struct qedf_debugfs_ops qedf_debugfs_ops[];
+extern const struct file_operations qedf_dbg_fops[];
+
 #ifdef CONFIG_DEBUG_FS
 /* DebugFS related code */
 struct qedf_list_of_funcs {
@@ -123,11 +131,6 @@ struct qedf_list_of_funcs {
ssize_t (*oper_func)(struct qedf_dbg_ctx *qedf);
 };
 
-struct qedf_debugfs_ops {
-   char *name;
-   struct qedf_list_of_funcs *qedf_funcs;
-};
-
 #define qedf_dbg_fileops(drv, ops) \
 { \
.owner  = THIS_MODULE, \
@@ -147,8 +150,8 @@ struct qedf_debugfs_ops {
 }
 
 extern void qedf_dbg_host_init(struct qedf_dbg_ctx *qedf,
-   struct qedf_debugfs_ops *dops,
-   struct file_operations *fops);
+   const struct qedf_debugfs_ops *dops,
+   const struct file_operations *fops);
 extern void qedf_dbg_host_exit(struct qedf_dbg_ctx *qedf);
 extern void qedf_dbg_init(char *drv_name);
 extern void qedf_dbg_exit(void);
diff --git a/drivers/scsi/qedf/qedf_debugfs.c b/drivers/scsi/qedf/qedf_debugfs.c
index 2b1ef3075e93..c539a7ae3a7e 100644
--- a/drivers/scsi/qedf/qedf_debugfs.c
+++ b/drivers/scsi/qedf/qedf_debugfs.c
@@ -23,8 +23,8 @@ static struct dentry *qedf_dbg_root;
  **/
 void
 qedf_dbg_host_init(struct qedf_dbg_ctx *qedf,
-   struct qedf_debugfs_ops *dops,
-   struct file_operations *fops)
+   const struct qedf_debugfs_ops *dops,
+   const struct file_operations *fops)
 {
char host_dirname[32];
struct dentry *file_dentry = NULL;
@@ -99,7 +99,7 @@ qedf_dbg_exit(void)
qedf_dbg_root = NULL;
 }
 
-struct qedf_debugfs_ops qedf_debugfs_ops[] = {
+const struct qedf_debugfs_ops qedf_debugfs_ops[] = {
{ "fp_int", NULL },
{ "io_trace", NULL },
{ "debug", NULL },
diff --git a/drivers/scsi/qedf/qedf_main.c b/drivers/scsi/qedf/qedf_main.c
index ccd9a08ea030..284ccb566b19 100644
--- a/drivers/scsi/qedf/qedf_main.c
+++ b/drivers/scsi/qedf/qedf_main.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include "qedf.h"
+#include "qedf_dbg.h"
 #include 
 
 const struct qed_fcoe_ops *qed_ops;
@@ -30,9 +31,6 @@ const struct qed_fcoe_ops *qed_ops;
 static int qedf_probe(struct pci_dev *pdev, const struct pci_device_id *id);
 static void qedf_remove(struct pci_dev *pdev);
 
-extern struct qedf_debugfs_ops qedf_debugfs_ops;
-extern struct file_operations qedf_dbg_fops;
-
 /*
  * Driver module parameters.
  */
@@ -3155,8 +3153,8 @@ static int __qedf_probe(struct pci_dev *pdev, int mode)
}
 
 #ifdef CONFIG_DEBUG_FS
-   qedf_dbg_host_init(&(qedf->dbg_ctx), _debugfs_ops,
-   _dbg_fops);
+   qedf_dbg_host_init(&(qedf->dbg_ctx), qedf_debugfs_ops,
+   qedf_dbg_fops);
 #endif
 
/* Start LL2 */
-- 
2.9.0



[PATCH 2/6] scsi: NCR53c406a: avoid section mismatch with LTO

2018-02-02 Thread Arnd Bergmann
Building with link time optimizations produces a false-postive section
mismatch warning:

WARNING: vmlinux.o(.data+0xf7e8): Section mismatch in reference from the 
variable driver_template.lto_priv.6914 to the function 
.init.text:NCR53c406a_detect()
The variable driver_template.lto_priv.6914 references
the function __init NCR53c406a_detect()
If the reference is valid then annotate the
variable with __init* or __refdata (see linux/init.h) or name the variable:
*_template, *_timer, *_sht, *_ops, *_probe, *_probe_one, *_console

The ->detect callback is always entered from the init_this_scsi_driver()
init function, but apparently LTO turns the optimized direct function
call into an indirect call through a non-__initdata pointer.

All drivers using init_this_scsi_driver() are for ancient hardware,
and most don't mark the detect() callback as __init(), so I'm
just removing the annotation here to kill off the warning instead
of doing a larger rework.

Signed-off-by: Arnd Bergmann 
---
 drivers/scsi/NCR53c406a.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/NCR53c406a.c b/drivers/scsi/NCR53c406a.c
index 6e110c630d2c..44b09870bf51 100644
--- a/drivers/scsi/NCR53c406a.c
+++ b/drivers/scsi/NCR53c406a.c
@@ -448,7 +448,7 @@ static __inline__ int NCR53c406a_pio_write(unsigned char 
*request, unsigned int
 }
 #endif /* USE_PIO */
 
-static int __init NCR53c406a_detect(struct scsi_host_template * tpnt)
+static int NCR53c406a_detect(struct scsi_host_template * tpnt)
 {
int present = 0;
struct Scsi_Host *shpnt = NULL;
-- 
2.9.0



[PATCH 1/6] scsi: fc_encode: work around strncpy size warnings

2018-02-02 Thread Arnd Bergmann
struct fc_fdmi_attr_entry contains a variable-length string at
the end, which is encoded as a one-byte array.  gcc-8 notices that
we copy strings into it that obviously go beyond that one byte:

In function 'fc_ct_ms_fill',
inlined from 'fc_elsct_send' at include/scsi/fc_encode.h:518:8:
include/scsi/fc_encode.h:275:3: error: 'strncpy' writing 64 bytes into a region 
of size 1 overflows the destination [-Werror=stringop-overflow=]
   strncpy((char *)>value,
   ^
include/scsi/fc_encode.h:287:3: error: 'strncpy' writing 64 bytes into a region 
of size 1 overflows the destination [-Werror=stringop-overflow=]
   strncpy((char *)>value,
   ^

No idea what the right fix is.

Signed-off-by: Arnd Bergmann 
---
 include/scsi/fc/fc_ms.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/scsi/fc/fc_ms.h b/include/scsi/fc/fc_ms.h
index f52b921b5c70..c5614e725a0e 100644
--- a/include/scsi/fc/fc_ms.h
+++ b/include/scsi/fc/fc_ms.h
@@ -128,7 +128,7 @@ struct fc_fdmi_port_name {
 struct fc_fdmi_attr_entry {
__be16  type;
__be16  len;
-   __u8value[1];
+   __u8value[];
 } __attribute__((__packed__));
 
 /*
-- 
2.9.0



[PATCH 0/6] scsi: fixes for building with LTO

2018-02-02 Thread Arnd Bergmann
I experimented with link time optimization after Nico's
article at https://lwn.net/Articles/744507/

Here is a set of patches that came out of it for the scsi
subsystem.

Arnd Bergmann (6):
  scsi: fc_encode: work around strncpy size warnings
  scsi: NCR53c406a: avoid section mismatch with LTO
  scsi: sym53c416: avoid section mismatch with LTO
  scsi: qedf: fix LTO-enabled build
  scsi: qedi: fix building with LTO
  scsi: qedf: use correct strncpy() size

 drivers/scsi/NCR53c406a.c|  2 +-
 drivers/scsi/qedf/qedf_dbg.c |  2 +-
 drivers/scsi/qedf/qedf_dbg.h | 17 ++---
 drivers/scsi/qedf/qedf_debugfs.c |  6 +++---
 drivers/scsi/qedf/qedf_main.c|  8 +++-
 drivers/scsi/qedi/qedi_dbg.h |  2 +-
 drivers/scsi/qedi/qedi_debugfs.c |  4 ++--
 drivers/scsi/qedi/qedi_gbl.h |  4 ++--
 drivers/scsi/qedi/qedi_main.c|  4 ++--
 drivers/scsi/sym53c416.c |  2 +-
 include/scsi/fc/fc_ms.h  |  2 +-
 11 files changed, 27 insertions(+), 26 deletions(-)

-- 
2.9.0



Re: [RFC 0/2] mpt3sas/megaraid_sas : irq poll and load balancing of reply queue

2018-02-02 Thread Ming Lei
Hi Kashyap,

On Fri, Feb 02, 2018 at 05:08:12PM +0530, Kashyap Desai wrote:
> > -Original Message-
> > From: Ming Lei [mailto:ming@redhat.com]
> > Sent: Friday, February 2, 2018 3:44 PM
> > To: Kashyap Desai
> > Cc: linux-scsi@vger.kernel.org; Peter Rivera
> > Subject: Re: [RFC 0/2] mpt3sas/megaraid_sas : irq poll and load
> balancing of
> > reply queue
> >
> > Hi Kashyap,
> >
> > On Mon, Jan 15, 2018 at 05:42:05PM +0530, Kashyap Desai wrote:
> > > Hi All -
> > >
> > > We have seen cpu lock up issue from fields if system has greater (more
> > > than 96) logical cpu count.
> > > SAS3.0 controller (Invader series) supports at max 96 msix vector and
> > > SAS3.5 product (Ventura) supports at max 128 msix vectors.
> > >
> > > This may be a generic issue (if PCI device support  completion on
> > > multiple reply queues). Let me explain it w.r.t to mpt3sas supported
> > > h/w just to simplify the problem and possible changes to handle such
> > > issues. IT HBA
> > > (mpt3sas) supports multiple reply queues in completion path. Driver
> > > creates MSI-x vectors for controller as "min of ( FW supported Reply
> > > queue, Logical CPUs)". If submitter is not interrupted via completion
> > > on same CPU, there is a loop in the IO path. This behavior can cause
> > > hard/soft CPU lockups, IO timeout, system sluggish etc.
> >
> > As I mentioned in another thread, this issue may be solved by SCSI_MQ
> via
> > mapping reply queue into hctx of blk_mq, together with
> > QUEUE_FLAG_SAME_FORCE, especially you have set 'smp_affinity_enable' as
> > 1 at default already, then pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) can
> do IRQ
> > vectors spread on CPUs perfectly for you.
> >
> > But the following Hannes's patch is required for the conversion.
> >
> > https://marc.info/?l=linux-block=149130770004507=2
> >
> 
> Hi Ming -
> 
> I gone through thread discussing "support host-wide tagset". Below Link
> has latest reply on that thread.
> https://marc.info/?l=linux-block=149132580511346=2
> 
> I think, there is a confusion over mpt3sas and megaraid_sas h/w behavior.
> Broadcom/LSI HBA and MR h/w has only one h/w queue for submission but
> there are multiple reply queue.

That shouldn't be a problem, you still can submit to same hw queue in
all submission paths(all hw queues) just like the current implementation.

> Even though I include Hannes' patch for host-side tagset, problem
> described in this RFC will not be resolved.  In fact, tagset can also
> provide same results if completion queue is less than online CPU. Don't
> you think ? OR I am missing anything ?

If reply queue is less than online CPU, more than one CPU may be mapped
to some(or all) of hw queue, but the completion is only handled on one of
the mapped CPUs, and can be done on the request's submission CPU via the
queue flag of QUEUE_FLAG_SAME_FORCE, please see __blk_mq_complete_request().
Or do you have other requirement except for completing request on its
submission CPU?

> 
> We don't have problem in submission path.  Current problem is MSI-x to
> more than one  CPU can cause I/O loop. This is visible, if we have higher
> number of online CPUs.

Yeah, I know, as I mentioned above, your requirement of completing
request on its submission CPU can be met with current SCSI_MQ without
much difficulty.

Thanks,
Ming


RE: [RFC 0/2] mpt3sas/megaraid_sas : irq poll and load balancing of reply queue

2018-02-02 Thread Kashyap Desai
> -Original Message-
> From: Ming Lei [mailto:ming@redhat.com]
> Sent: Friday, February 2, 2018 3:44 PM
> To: Kashyap Desai
> Cc: linux-scsi@vger.kernel.org; Peter Rivera
> Subject: Re: [RFC 0/2] mpt3sas/megaraid_sas : irq poll and load
balancing of
> reply queue
>
> Hi Kashyap,
>
> On Mon, Jan 15, 2018 at 05:42:05PM +0530, Kashyap Desai wrote:
> > Hi All -
> >
> > We have seen cpu lock up issue from fields if system has greater (more
> > than 96) logical cpu count.
> > SAS3.0 controller (Invader series) supports at max 96 msix vector and
> > SAS3.5 product (Ventura) supports at max 128 msix vectors.
> >
> > This may be a generic issue (if PCI device support  completion on
> > multiple reply queues). Let me explain it w.r.t to mpt3sas supported
> > h/w just to simplify the problem and possible changes to handle such
> > issues. IT HBA
> > (mpt3sas) supports multiple reply queues in completion path. Driver
> > creates MSI-x vectors for controller as "min of ( FW supported Reply
> > queue, Logical CPUs)". If submitter is not interrupted via completion
> > on same CPU, there is a loop in the IO path. This behavior can cause
> > hard/soft CPU lockups, IO timeout, system sluggish etc.
>
> As I mentioned in another thread, this issue may be solved by SCSI_MQ
via
> mapping reply queue into hctx of blk_mq, together with
> QUEUE_FLAG_SAME_FORCE, especially you have set 'smp_affinity_enable' as
> 1 at default already, then pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) can
do IRQ
> vectors spread on CPUs perfectly for you.
>
> But the following Hannes's patch is required for the conversion.
>
>   https://marc.info/?l=linux-block=149130770004507=2
>

Hi Ming -

I gone through thread discussing "support host-wide tagset". Below Link
has latest reply on that thread.
https://marc.info/?l=linux-block=149132580511346=2

I think, there is a confusion over mpt3sas and megaraid_sas h/w behavior.
Broadcom/LSI HBA and MR h/w has only one h/w queue for submission but
there are multiple reply queue.
Even though I include Hannes' patch for host-side tagset, problem
described in this RFC will not be resolved.  In fact, tagset can also
provide same results if completion queue is less than online CPU. Don't
you think ? OR I am missing anything ?

We don't have problem in submission path.  Current problem is MSI-x to
more than one  CPU can cause I/O loop. This is visible, if we have higher
number of online CPUs.

> >
> > Example - one CPU (e.g. CPU A) is busy submitting the IOs and another
> > CPU (e.g. CPU B) is busy with processing the corresponding IO's reply
> > descriptors from reply descriptor queue upon receiving the interrupts
> > from HBA. If the CPU A is continuously pumping the IOs then always CPU
> > B (which is executing the ISR) will see the valid reply descriptors in
> > the reply descriptor queue and it will be continuously processing
> > those reply descriptor in a loop without quitting the ISR handler.
> > Mpt3sas driver will exit ISR handler if it finds unused reply
> > descriptor in the reply descriptor queue. Since CPU A will be
> > continuously sending the IOs, CPU B may always see a valid reply
> > descriptor (posted by HBA Firmware after processing the IO) in the
> > reply descriptor queue. In worst case, driver will not quit from this
> > loop in the ISR handler. Eventually, CPU lockup will be detected by
> watchdog.
> >
> > Above mentioned behavior is not common if "rq_affinity" set to 2 or
> > affinity_hint is honored by irqbalance as "exact".
> > If rq_affinity is set to 2, submitter will be always interrupted via
> > completion on same CPU.
> > If irqbalance is using "exact" policy, interrupt will be delivered to
> > submitter CPU.
>
> Now you have used pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) to get msix
> vector number, the irq affinity can't be changed by userspace any more.
>
> >
> > Problem statement -
> > If CPU counts to MSI-X vectors (reply descriptor Queues) count ratio
> > is not 1:1, we still have  exposure of issue explained above and for
> > that we don't have any solution.
> >
> > Exposure of soft/hard lockup if CPU count is more than MSI-x supported
> > by device.
> >
> > If CPUs count to MSI-x vectors count ratio is not 1:1, (Other way, if
> > CPU counts to MSI-x vector count ratio is something like X:1, where X
> > > 1) then 'exact' irqbalance policy OR rq_affinity = 2 won't help to
> > avoid CPU hard/soft lockups. There won't be any one to one mapping
> > between CPU to MSI-x vector instead one MSI-x interrupt (or reply
> > descriptor queue) is shared with group/set of CPUs and there is a
> > possibility of having a loop in the IO path within that CPU group and
may
> observe lockups.
> >
> > For example: Consider a system having two NUMA nodes and each node
> > having four logical CPUs and also consider that number of MSI-x
> > vectors enabled on the HBA is two, then CPUs count to MSI-x vector
count
> ratio as 4:1.
> > e.g.
> > MSIx vector 0 is affinity to  CPU 0, CPU 1, CPU 2 & 

Re: [LSF/MM TOPIC] irq affinity handling for high CPU count machines

2018-02-02 Thread Ming Lei
Hi Kashyap,

On Fri, Feb 02, 2018 at 02:19:01PM +0530, Kashyap Desai wrote:
> > > > > Today I am looking at one megaraid_sas related issue, and found
> > > > > pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) is used in the driver, so
> > > > > looks each reply queue has been handled by more than one CPU if
> > > > > there are more CPUs than MSIx vectors in the system, which is done
> > > > > by generic irq affinity code, please see kernel/irq/affinity.c.
> > >
> > > Yes. That is a problematic area. If CPU and MSI-x(reply queue) is 1:1
> > > mapped, we don't have any issue.
> >
> > I guess the problematic area is similar with the following link:
> >
> > https://marc.info/?l=linux-kernel=151748144730409=2
> 
> Hi Ming,
> 
> Above mentioned link is different discussion and looks like a generic
> issue. megaraid_sas/mpt3sas will have same symptoms if irq affinity has
> only offline CPUs.

If you convert to SCSI_MQ/MQ, it is a generic issue, which is solved
by a generic solution, otherwise now it is driver's responsibility to make
sure to not use the reply queue in which no online CPUs is mapped.

> Just for info - "In such condition, we can ask users to disable affinity
> hit via module parameter " smp_affinity_enable".

Yeah, that is exactly what I suggested to our QE friend.

> 
> >
> > otherwise could you explain a bit about the area?
> 
> Please check below post for more details.
> 
> https://marc.info/?l=linux-scsi=151601833418346=2

Seems SCSI_MQ/MQ can solve this issue, and I have replied on the above link,
we can discuss on that thread further.


thanks, 
Ming


Re: [RFC 0/2] mpt3sas/megaraid_sas : irq poll and load balancing of reply queue

2018-02-02 Thread Ming Lei
Hi Kashyap,

On Mon, Jan 15, 2018 at 05:42:05PM +0530, Kashyap Desai wrote:
> Hi All -
> 
> We have seen cpu lock up issue from fields if system has greater (more
> than 96) logical cpu count.
> SAS3.0 controller (Invader series) supports at max 96 msix vector and
> SAS3.5 product (Ventura) supports at max 128 msix vectors.
> 
> This may be a generic issue (if PCI device support  completion on multiple
> reply queues). Let me explain it w.r.t to mpt3sas supported h/w just to
> simplify the problem and possible changes to handle such issues. IT HBA
> (mpt3sas) supports multiple reply queues in completion path. Driver
> creates MSI-x vectors for controller as "min of ( FW supported Reply
> queue, Logical CPUs)". If submitter is not interrupted via completion on
> same CPU, there is a loop in the IO path. This behavior can cause
> hard/soft CPU lockups, IO timeout, system sluggish etc.

As I mentioned in another thread, this issue may be solved by SCSI_MQ
via mapping reply queue into hctx of blk_mq, together with 
QUEUE_FLAG_SAME_FORCE,
especially you have set 'smp_affinity_enable' as 1 at default already,
then pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) can do IRQ vectors spread on
CPUs perfectly for you.

But the following Hannes's patch is required for the conversion.

https://marc.info/?l=linux-block=149130770004507=2

> 
> Example - one CPU (e.g. CPU A) is busy submitting the IOs and another CPU
> (e.g. CPU B) is busy with processing the corresponding IO's reply
> descriptors from reply descriptor queue upon receiving the interrupts from
> HBA. If the CPU A is continuously pumping the IOs then always CPU B (which
> is executing the ISR) will see the valid reply descriptors in the reply
> descriptor queue and it will be continuously processing those reply
> descriptor in a loop without quitting the ISR handler.  Mpt3sas driver
> will exit ISR handler if it finds unused reply descriptor in the reply
> descriptor queue. Since CPU A will be continuously sending the IOs, CPU B
> may always see a valid reply descriptor (posted by HBA Firmware after
> processing the IO) in the reply descriptor queue. In worst case, driver
> will not quit from this loop in the ISR handler. Eventually, CPU lockup
> will be detected by watchdog.
> 
> Above mentioned behavior is not common if "rq_affinity" set to 2 or
> affinity_hint is honored by irqbalance as "exact".
> If rq_affinity is set to 2, submitter will be always interrupted via
> completion on same CPU.
> If irqbalance is using "exact" policy, interrupt will be delivered to
> submitter CPU.

Now you have used pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) to get msix
vector number, the irq affinity can't be changed by userspace any more.

> 
> Problem statement -
> If CPU counts to MSI-X vectors (reply descriptor Queues) count ratio is
> not 1:1, we still have  exposure of issue explained above and for that we
> don't have any solution.
> 
> Exposure of soft/hard lockup if CPU count is more than MSI-x supported by
> device.
> 
> If CPUs count to MSI-x vectors count ratio is not 1:1, (Other way, if CPU
> counts to MSI-x vector count ratio is something like X:1, where X > 1)
> then 'exact' irqbalance policy OR rq_affinity = 2 won't help to avoid CPU
> hard/soft lockups. There won't be any one to one mapping between CPU to
> MSI-x vector instead one MSI-x interrupt (or reply descriptor queue) is
> shared with group/set of CPUs and there is a possibility of having a loop
> in the IO path within that CPU group and may observe lockups.
> 
> For example: Consider a system having two NUMA nodes and each node having
> four logical CPUs and also consider that number of MSI-x vectors enabled
> on the HBA is two, then CPUs count to MSI-x vector count ratio as 4:1.
> e.g.
> MSIx vector 0 is affinity to  CPU 0, CPU 1, CPU 2 & CPU 3 of NUMA node 0
> and MSI-x vector 1 is affinity to CPU 4, CPU 5, CPU 6 & CPU 7 of NUMA node
> 1.
> 
> numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3-->
> MSI-x 0
> node 0 size: 65536 MB
> node 0 free: 63176 MB
> node 1 cpus: 4 5 6 7
> -->MSI-x 1
> node 1 size: 65536 MB
> node 1 free: 63176 MB
> 
> Assume that user started an application which uses all the CPUs of NUMA
> node 0 for issuing the IOs.
> Only one CPU from affinity list (it can be any cpu since this behavior
> depends upon irqbalance) CPU0 will receive the interrupts from MSIx vector
> 0 for all the IOs. Eventually, CPU 0 IO submission percentage will be
> decreasing and ISR processing percentage will be increasing as it is more
> busy with processing the interrupts. Gradually IO submission percentage on
> CPU 0 will be zero and it's ISR processing percentage will be 100
> percentage as IO loop has already formed within the NUMA node 0, i.e. CPU
> 1, CPU 2 & CPU 3 will be continuously busy with submitting the heavy IOs
> and only CPU 0 is busy in the ISR path as it always find the valid reply
> descriptor in the reply descriptor 

RE: [LSF/MM TOPIC] irq affinity handling for high CPU count machines

2018-02-02 Thread Kashyap Desai
> > > > Today I am looking at one megaraid_sas related issue, and found
> > > > pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) is used in the driver, so
> > > > looks each reply queue has been handled by more than one CPU if
> > > > there are more CPUs than MSIx vectors in the system, which is done
> > > > by generic irq affinity code, please see kernel/irq/affinity.c.
> >
> > Yes. That is a problematic area. If CPU and MSI-x(reply queue) is 1:1
> > mapped, we don't have any issue.
>
> I guess the problematic area is similar with the following link:
>
>   https://marc.info/?l=linux-kernel=151748144730409=2

Hi Ming,

Above mentioned link is different discussion and looks like a generic
issue. megaraid_sas/mpt3sas will have same symptoms if irq affinity has
only offline CPUs.
Just for info - "In such condition, we can ask users to disable affinity
hit via module parameter " smp_affinity_enable".

>
> otherwise could you explain a bit about the area?

Please check below post for more details.

https://marc.info/?l=linux-scsi=151601833418346=2