Re: [PATCH 1/1] scsi: ufs-qcom: remove broken hci version quirk

2018-01-29 Thread Asutosh Das (asd)

On 1/30/2018 11:33 AM, Vivek Gautam wrote:

Hi Asutosh,


On 1/30/2018 10:11 AM, Asutosh Das wrote:

From: Subhash Jadavani 

UFSHCD_QUIRK_BROKEN_UFS_HCI_VERSION is only applicable for QCOM UFS host
controller version 2.x.y and this has been fixed from version 3.x.y
onwards, hence this change removes this quirk for version 3.x.y onwards.

Signed-off-by: Subhash Jadavani 
Signed-off-by: Asutosh Das 
---


This patch and all other ufs patches that you have posted recently,
do they all fall under one 'ufs-qcom fixes' patch series for fixes that
we would want to do?
If it is so, then please club them together in a series, so that
it's easy for reviewers and maintainers to review, and keep track
of all the patches that can get-in after the reviews.
If they belong to two or more separate patch-series then please
create such patch series.
It's difficult to read through a lot of [PATCH 1/1] ... patch.

Sure Vivek - Makes sense. Will resend it.



Regards
Vivek


  drivers/scsi/ufs/ufs-qcom.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/ufs/ufs-qcom.c b/drivers/scsi/ufs/ufs-qcom.c
index 2b38db2..221820a 100644
--- a/drivers/scsi/ufs/ufs-qcom.c
+++ b/drivers/scsi/ufs/ufs-qcom.c
@@ -1098,7 +1098,7 @@ static void ufs_qcom_advertise_quirks(struct 
ufs_hba *hba)

  hba->quirks |= UFSHCD_QUIRK_BROKEN_LCC;
  }
-    if (host->hw_ver.major >= 0x2) {
+    if (host->hw_ver.major == 0x2) {
  hba->quirks |= UFSHCD_QUIRK_BROKEN_UFS_HCI_VERSION;
  if (!ufs_qcom_cap_qunipro(host))





--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a 
Linux Foundation Collaborative Project


Re: [PATCH 1/1] scsi: ufs-qcom: remove broken hci version quirk

2018-01-29 Thread Vivek Gautam

Hi Asutosh,


On 1/30/2018 10:11 AM, Asutosh Das wrote:

From: Subhash Jadavani 

UFSHCD_QUIRK_BROKEN_UFS_HCI_VERSION is only applicable for QCOM UFS host
controller version 2.x.y and this has been fixed from version 3.x.y
onwards, hence this change removes this quirk for version 3.x.y onwards.

Signed-off-by: Subhash Jadavani 
Signed-off-by: Asutosh Das 
---


This patch and all other ufs patches that you have posted recently,
do they all fall under one 'ufs-qcom fixes' patch series for fixes that
we would want to do?
If it is so, then please club them together in a series, so that
it's easy for reviewers and maintainers to review, and keep track
of all the patches that can get-in after the reviews.
If they belong to two or more separate patch-series then please
create such patch series.
It's difficult to read through a lot of [PATCH 1/1] ... patch.

Regards
Vivek


  drivers/scsi/ufs/ufs-qcom.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/ufs/ufs-qcom.c b/drivers/scsi/ufs/ufs-qcom.c
index 2b38db2..221820a 100644
--- a/drivers/scsi/ufs/ufs-qcom.c
+++ b/drivers/scsi/ufs/ufs-qcom.c
@@ -1098,7 +1098,7 @@ static void ufs_qcom_advertise_quirks(struct ufs_hba *hba)
hba->quirks |= UFSHCD_QUIRK_BROKEN_LCC;
}
  
-	if (host->hw_ver.major >= 0x2) {

+   if (host->hw_ver.major == 0x2) {
hba->quirks |= UFSHCD_QUIRK_BROKEN_UFS_HCI_VERSION;
  
  		if (!ufs_qcom_cap_qunipro(host))




Re: [dm-devel] [PATCH V4] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-29 Thread Ming Lei
On Tue, Jan 30, 2018 at 6:14 AM, Mike Snitzer  wrote:
> On Mon, Jan 29 2018 at  4:51pm -0500,
> Bart Van Assche  wrote:
>
>> On Mon, 2018-01-29 at 16:44 -0500, Mike Snitzer wrote:
>> > But regardless of which wins the race, the queue will have been run.
>> > Which is all we care about right?
>>
>> Running the queue is not sufficient. With this patch applied it can happen
>> that the block driver returns BLK_STS_DEV_RESOURCE, that the two or more
>> concurrent queue runs finish before sufficient device resources are available
>> to execute the request and that blk_mq_delay_run_hw_queue() does not get
>> called at all. If no other activity triggers a queue run, e.g. request
>> completion, this will result in a queue stall.
>
> If BLK_STS_DEV_RESOURCE is returned then the driver doesn't need to rely
> on a future queue run.  IIUC, that is the entire premise of
> BLK_STS_DEV_RESOURCE.  If the driver had doubt about whether the
> resource were going to be available in the future it should return
> BLK_STS_RESOURCE.
>
> That may seem like putting a lot on a driver developer (to decide
> between the 2) but I'll again defer to Jens here.  This was the approach
> he was advocating be pursued.

Thinking of further, maybe you can add the following description in V5,
and it should be much easier for driver developer to follow:

When any resource allocation fails, if driver can make sure that there is
any in-flight IO, it is safe to return BLK_STS_DEV_RESOURCE to blk-mq,
that is exactly what scsi_queue_rq() is doing.

Follows the theory:

1) driver returns BLK_STS_DEV_RESOURCE if driver figures out
there is any in-flight IO, in case of any resource allocation failure

2) If all these in-flight IOs complete before examining SCHED_RESTART in
blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue
is run immediately in this case by blk_mq_dispatch_rq_list();

3) if there is any in-flight IO after/when examining SCHED_RESTART in
blk_mq_dispatch_rq_list():
- if SCHED_RESTART isn't set, queue is run immediately as handled in 2)
- otherwise, this request will be dispatched after any in-flight IO is
completed via
blk_mq_sched_restart() since this request is added to hctx->dispatch already

And there are two invariants when driver returns BLK_STS_DEV_RESOURCE
iff there is any in-flight IOs:

1) SCHED_RESTART must be zero if no in-flight IOs
2) there has to be any IO in-flight if SCHED_RESTART is read as 1


Thanks,
Ming


[PATCH] scsi_debug: implement IMMED bit

2018-01-29 Thread Douglas Gilbert
The start stop unit command takes in the order of a second to complete
on some SAS SSDs and longer on hard disks. Synchronize cache can also
take some time. Both commands have an IMMED bit for those apps that don't
want to wait. This patch introduces a long delay for those commands
when the IMMED bit is clear.

Changes:
  - add the SYNCHRONIZE CACHE(16) command
  - together with the existing START STOP UNIT and SYNCHRONIZE CACHE(10)
commands process the IMMED bit in their cdbs
  - if the IMMED bit is set, return immediately
  - if the IMMED bit is clear, treat the delay parameter as having
a unit of one second
  - in the SYNCHRONIZE CACHE processing do a bounds check

Signed-off-by: Douglas Gilbert 
---
 drivers/scsi/scsi_debug.c | 76 ---
 1 file changed, 65 insertions(+), 11 deletions(-)

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index a5986dae9020..0996fcb8a8d2 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -6,7 +6,7 @@
  *  anything out of the ordinary is seen.
  * ^^^ Original ^^^
  *
- * Copyright (C) 2001 - 2017 Douglas Gilbert
+ * Copyright (C) 2001 - 2018 Douglas Gilbert
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
@@ -61,8 +61,8 @@
 #include "scsi_logging.h"
 
 /* make sure inq_product_rev string corresponds to this version */
-#define SDEBUG_VERSION "0187"  /* format to fit INQUIRY revision field */
-static const char *sdebug_version_date = "20171202";
+#define SDEBUG_VERSION "0188"  /* format to fit INQUIRY revision field */
+static const char *sdebug_version_date = "20180128";
 
 #define MY_NAME "scsi_debug"
 
@@ -234,6 +234,7 @@ static const char *sdebug_version_date = "20171202";
 #define F_INV_OP   0x200
 #define F_FAKE_RW  0x400
 #define F_M_ACCESS 0x800   /* media access */
+#define F_LONG_DELAY   0x1000
 
 #define FF_RESPOND (F_RL_WLUN_OK | F_SKIP_UA | F_DELAY_OVERR)
 #define FF_MEDIA_IO (F_M_ACCESS | F_FAKE_RW)
@@ -349,7 +350,7 @@ enum sdeb_opcode_index {
SDEB_I_XDWRITEREAD = 25,/* 10 only */
SDEB_I_WRITE_BUFFER = 26,
SDEB_I_WRITE_SAME = 27, /* 10, 16 */
-   SDEB_I_SYNC_CACHE = 28, /* 10 only */
+   SDEB_I_SYNC_CACHE = 28, /* 10, 16 */
SDEB_I_COMP_WRITE = 29,
SDEB_I_LAST_ELEMENT = 30,   /* keep this last (previous + 1) */
 };
@@ -382,7 +383,7 @@ static const unsigned char opcode_ind_arr[256] = {
 /* 0x80; 0x80->0x9f: 16 byte cdbs */
0, 0, 0, 0, 0, SDEB_I_ATA_PT, 0, 0,
SDEB_I_READ, SDEB_I_COMP_WRITE, SDEB_I_WRITE, 0, 0, 0, 0, 0,
-   0, 0, 0, SDEB_I_WRITE_SAME, 0, 0, 0, 0,
+   0, SDEB_I_SYNC_CACHE, 0, SDEB_I_WRITE_SAME, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, SDEB_I_SERV_ACT_IN_16, SDEB_I_SERV_ACT_OUT_16,
 /* 0xa0; 0xa0->0xbf: 12 byte cdbs */
SDEB_I_REPORT_LUNS, SDEB_I_ATA_PT, 0, SDEB_I_MAINT_IN,
@@ -398,6 +399,14 @@ static const unsigned char opcode_ind_arr[256] = {
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 };
 
+/*
+ * The following "response" functions return the SCSI mid-level's 4 byte
+ * tuple-in-an-int. To handle commands with an IMMED bit, for a faster
+ * command completion, they can mask their return value with
+ * SDEG_RES_IMMED_MASK .
+ */
+#define SDEG_RES_IMMED_MASK 0x4000
+
 static int resp_inquiry(struct scsi_cmnd *, struct sdebug_dev_info *);
 static int resp_report_luns(struct scsi_cmnd *, struct sdebug_dev_info *);
 static int resp_requests(struct scsi_cmnd *, struct sdebug_dev_info *);
@@ -420,6 +429,7 @@ static int resp_write_same_16(struct scsi_cmnd *, struct 
sdebug_dev_info *);
 static int resp_xdwriteread_10(struct scsi_cmnd *, struct sdebug_dev_info *);
 static int resp_comp_write(struct scsi_cmnd *, struct sdebug_dev_info *);
 static int resp_write_buffer(struct scsi_cmnd *, struct sdebug_dev_info *);
+static int resp_sync_cache(struct scsi_cmnd *, struct sdebug_dev_info *);
 
 /*
  * The following are overflow arrays for cdbs that "hit" the same index in
@@ -499,6 +509,12 @@ static const struct opcode_info_t release_iarr[] = {
{6,  0x1f, 0xff, 0, 0, 0xc7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} },
 };
 
+static const struct opcode_info_t sync_cache_iarr[] = {
+   {0, 0x91, 0, F_LONG_DELAY | F_M_ACCESS, resp_sync_cache, NULL,
+   {16,  0x7, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+0xff, 0xff, 0xff, 0xff, 0x3f, 0xc7} }, /* SYNC_CACHE (16) */
+};
+
 
 /* This array is accessed via SDEB_I_* values. Make sure all are mapped,
  * plus the terminating elements for logic that scans this table such as
@@ -536,8 +552,8 @@ static const struct opcode_info_t 
opcode_info_arr[SDEB_I_LAST_ELEMENT + 1] = {
{ARRAY_SIZE(write_iarr), 0x8a, 0, F_D_OUT | FF_MEDIA_IO,

[PATCH 1/1] scsi: ufs: make sure all interrupts are processed

2018-01-29 Thread Asutosh Das
From: Venkat Gopalakrishnan 

As multiple requests are submitted to the ufs host controller in
parallel there could be instances where the command completion
interrupt arrives later for a request that is already processed
earlier as the corresponding doorbell was cleared when handling
the previous interrupt. Read the interrupt status in a loop after
processing the received interrupt to catch such interrupts and
handle it.

Signed-off-by: Venkat Gopalakrishnan 
Signed-off-by: Asutosh Das 
---
 drivers/scsi/ufs/ufshcd.c | 27 +++
 1 file changed, 19 insertions(+), 8 deletions(-)

diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
index 8af2af3..58d81de 100644
--- a/drivers/scsi/ufs/ufshcd.c
+++ b/drivers/scsi/ufs/ufshcd.c
@@ -5357,19 +5357,30 @@ static irqreturn_t ufshcd_intr(int irq, void *__hba)
u32 intr_status, enabled_intr_status;
irqreturn_t retval = IRQ_NONE;
struct ufs_hba *hba = __hba;
+   int retries = hba->nutrs;
 
spin_lock(hba->host->host_lock);
intr_status = ufshcd_readl(hba, REG_INTERRUPT_STATUS);
-   enabled_intr_status =
-   intr_status & ufshcd_readl(hba, REG_INTERRUPT_ENABLE);
 
-   if (intr_status)
-   ufshcd_writel(hba, intr_status, REG_INTERRUPT_STATUS);
+   /*
+* There could be max of hba->nutrs reqs in flight and in worst case
+* if the reqs get finished 1 by 1 after the interrupt status is
+* read, make sure we handle them by checking the interrupt status
+* again in a loop until we process all of the reqs before returning.
+*/
+   do {
+   enabled_intr_status =
+   intr_status & ufshcd_readl(hba, REG_INTERRUPT_ENABLE);
+   if (intr_status)
+   ufshcd_writel(hba, intr_status, REG_INTERRUPT_STATUS);
+   if (enabled_intr_status) {
+   ufshcd_sl_intr(hba, enabled_intr_status);
+   retval = IRQ_HANDLED;
+   }
+
+   intr_status = ufshcd_readl(hba, REG_INTERRUPT_STATUS);
+   } while (intr_status && --retries);
 
-   if (enabled_intr_status) {
-   ufshcd_sl_intr(hba, enabled_intr_status);
-   retval = IRQ_HANDLED;
-   }
spin_unlock(hba->host->host_lock);
return retval;
 }
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc. 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.



[PATCH 1/1] scsi: ufs-qcom: remove broken hci version quirk

2018-01-29 Thread Asutosh Das
From: Subhash Jadavani 

UFSHCD_QUIRK_BROKEN_UFS_HCI_VERSION is only applicable for QCOM UFS host
controller version 2.x.y and this has been fixed from version 3.x.y
onwards, hence this change removes this quirk for version 3.x.y onwards.

Signed-off-by: Subhash Jadavani 
Signed-off-by: Asutosh Das 
---
 drivers/scsi/ufs/ufs-qcom.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/ufs/ufs-qcom.c b/drivers/scsi/ufs/ufs-qcom.c
index 2b38db2..221820a 100644
--- a/drivers/scsi/ufs/ufs-qcom.c
+++ b/drivers/scsi/ufs/ufs-qcom.c
@@ -1098,7 +1098,7 @@ static void ufs_qcom_advertise_quirks(struct ufs_hba *hba)
hba->quirks |= UFSHCD_QUIRK_BROKEN_LCC;
}
 
-   if (host->hw_ver.major >= 0x2) {
+   if (host->hw_ver.major == 0x2) {
hba->quirks |= UFSHCD_QUIRK_BROKEN_UFS_HCI_VERSION;
 
if (!ufs_qcom_cap_qunipro(host))
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc. 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.



[PATCH 1/1] scsi: ufs: add reference counting for scsi block requests

2018-01-29 Thread Asutosh Das
From: Subhash Jadavani 

Currently we call the scsi_block_requests()/scsi_unblock_requests()
whenever we want to block/unblock scsi requests but as there is no
reference counting, nesting of these calls could leave us in undesired
state sometime. Consider following call flow sequence:
1. func1() calls scsi_block_requests() but calls func2() before
   calling scsi_unblock_requests()
2. func2() calls scsi_block_requests()
3. func2() calls scsi_unblock_requests()
4. func1() calls scsi_unblock_requests()

As there is no reference counting, we will have scsi requests unblocked
after #3 instead of it to be unblocked only after #4. Though we may not
have failures seen with this, we might run into some failures in future.
Better solution would be to fix this by adding reference counting.

Signed-off-by: Subhash Jadavani 
Signed-off-by: Can Guo 
Signed-off-by: Asutosh Das 
---
 drivers/scsi/ufs/ufshcd.c | 44 +---
 drivers/scsi/ufs/ufshcd.h |  5 +
 2 files changed, 42 insertions(+), 7 deletions(-)

diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
index 8af2af3..a9ebc8d 100644
--- a/drivers/scsi/ufs/ufshcd.c
+++ b/drivers/scsi/ufs/ufshcd.c
@@ -264,6 +264,36 @@ static inline void ufshcd_disable_irq(struct ufs_hba *hba)
}
 }
 
+void ufshcd_scsi_unblock_requests(struct ufs_hba *hba)
+{
+   unsigned long flags;
+   bool unblock = false;
+
+   spin_lock_irqsave(hba->host->host_lock, flags);
+   hba->scsi_block_reqs_cnt--;
+   unblock = !hba->scsi_block_reqs_cnt;
+   spin_unlock_irqrestore(hba->host->host_lock, flags);
+   if (unblock)
+   scsi_unblock_requests(hba->host);
+}
+EXPORT_SYMBOL(ufshcd_scsi_unblock_requests);
+
+static inline void __ufshcd_scsi_block_requests(struct ufs_hba *hba)
+{
+   if (!hba->scsi_block_reqs_cnt++)
+   scsi_block_requests(hba->host);
+}
+
+void ufshcd_scsi_block_requests(struct ufs_hba *hba)
+{
+   unsigned long flags;
+
+   spin_lock_irqsave(hba->host->host_lock, flags);
+   __ufshcd_scsi_block_requests(hba);
+   spin_unlock_irqrestore(hba->host->host_lock, flags);
+}
+EXPORT_SYMBOL(ufshcd_scsi_block_requests);
+
 /* replace non-printable or non-ASCII characters with spaces */
 static inline void ufshcd_remove_non_printable(char *val)
 {
@@ -1079,12 +1109,12 @@ static int ufshcd_clock_scaling_prepare(struct ufs_hba 
*hba)
 * make sure that there are no outstanding requests when
 * clock scaling is in progress
 */
-   scsi_block_requests(hba->host);
+   ufshcd_scsi_block_requests(hba);
down_write(>clk_scaling_lock);
if (ufshcd_wait_for_doorbell_clr(hba, DOORBELL_CLR_TOUT_US)) {
ret = -EBUSY;
up_write(>clk_scaling_lock);
-   scsi_unblock_requests(hba->host);
+   ufshcd_scsi_unblock_requests(hba);
}
 
return ret;
@@ -1093,7 +1123,7 @@ static int ufshcd_clock_scaling_prepare(struct ufs_hba 
*hba)
 static void ufshcd_clock_scaling_unprepare(struct ufs_hba *hba)
 {
up_write(>clk_scaling_lock);
-   scsi_unblock_requests(hba->host);
+   ufshcd_scsi_unblock_requests(hba);
 }
 
 /**
@@ -1413,7 +1443,7 @@ static void ufshcd_ungate_work(struct work_struct *work)
hba->clk_gating.is_suspended = false;
}
 unblock_reqs:
-   scsi_unblock_requests(hba->host);
+   ufshcd_scsi_unblock_requests(hba);
 }
 
 /**
@@ -1469,7 +1499,7 @@ int ufshcd_hold(struct ufs_hba *hba, bool async)
 * work and to enable clocks.
 */
case CLKS_OFF:
-   scsi_block_requests(hba->host);
+   __ufshcd_scsi_block_requests(hba);
hba->clk_gating.state = REQ_CLKS_ON;
trace_ufshcd_clk_gating(dev_name(hba->dev),
hba->clk_gating.state);
@@ -5178,7 +5208,7 @@ static void ufshcd_err_handler(struct work_struct *work)
 
 out:
spin_unlock_irqrestore(hba->host->host_lock, flags);
-   scsi_unblock_requests(hba->host);
+   ufshcd_scsi_unblock_requests(hba);
ufshcd_release(hba);
pm_runtime_put_sync(hba->dev);
 }
@@ -5280,7 +5310,7 @@ static void ufshcd_check_errors(struct ufs_hba *hba)
/* handle fatal errors only when link is functional */
if (hba->ufshcd_state == UFSHCD_STATE_OPERATIONAL) {
/* block commands from scsi mid-layer */
-   scsi_block_requests(hba->host);
+   __ufshcd_scsi_block_requests(hba);
 
hba->ufshcd_state = UFSHCD_STATE_EH_SCHEDULED;
 
diff --git a/drivers/scsi/ufs/ufshcd.h b/drivers/scsi/ufs/ufshcd.h
index 1332e54..84f4175 100644
--- a/drivers/scsi/ufs/ufshcd.h
+++ b/drivers/scsi/ufs/ufshcd.h
@@ -498,6 +498,7 @@ struct ufs_stats {
  * 

Re: [PATCH V3] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-29 Thread Ming Lei
On Tue, Jan 30, 2018 at 03:37:38AM +, Bart Van Assche wrote:
> On Tue, 2018-01-30 at 11:31 +0800, Ming Lei wrote:
> > Please take a look at drivers, when BLK_STS_RESOURCE is returned, who
> > will call blk_mq_delay_run_hw_queue() for drivers?
> 
> As you know the SCSI and dm drivers in kernel v4.15 already call that function
> whenever necessary.

We have concluded that it is generic issue which need generic solution[1].

[1] https://marc.info/?l=linux-kernel=151638176727612=2

[ming@ming linux]$ git grep -n BLK_STS_RESOURCE ./drivers/ | wc -l
43

Almost every driver need this kind of change if BLK_STS_RESOURCE is returned
from IO path. And the failure can be caused by kmalloc(GFP_ATOMIC), DMA Mapping,
or what ever.

Thanks,
Ming


Re: [PATCH V3] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-29 Thread Bart Van Assche
On Tue, 2018-01-30 at 11:31 +0800, Ming Lei wrote:
> Please take a look at drivers, when BLK_STS_RESOURCE is returned, who
> will call blk_mq_delay_run_hw_queue() for drivers?

As you know the SCSI and dm drivers in kernel v4.15 already call that function
whenever necessary.

> > 
> > > [ ... ] Later, you admitted you understood the patch wrong. [ ... ]
> > 
> > That's nonsense. I never wrote that.
> 
> Believe it or not, follows the link and your reply:
> 
>   https://marc.info/?l=dm-devel=151672694508389=2
> 
> > So what is wrong with this way?
> > Sorry, I swapped BLK_STS_DEV_RESOURCE and BLK_STS_RESOURCE accidentally in 
> > my
> > reply. What I meant is that changing a blk_mq_delay_run_hw_queue() call 
> > followed
> > by return BLK_STS_RESOURCE into BLK_STS_DEV_RESOURCE is wrong and 
> > introduces a
> > race condition in code where there was no race condition.

That meant that I made a mistake (typo) while typing the reply and nothing else.

Bart.

Re: [PATCH V3] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-29 Thread Ming Lei
On Tue, Jan 30, 2018 at 01:11:22AM +, Bart Van Assche wrote:
> On Tue, 2018-01-30 at 09:07 +0800, Ming Lei wrote:
> > On Mon, Jan 29, 2018 at 04:48:31PM +, Bart Van Assche wrote:
> > > - It is easy to fix this race inside the block layer, namely by using
> > >   call_rcu() inside the blk_mq_delay_run_hw_queue() implementation to
> > >   postpone the queue rerunning until after the request has been added 
> > > back to
> > >   the dispatch list.
> > 
> > It is just easy to say, can you cook a patch and fix all drivers first?
> 
> Please reread what I wrote. I proposed to change the 
> blk_mq_delay_run_hw_queue()
> IMPLEMENTATION such that the callers do not have to be modified.

Please take a look at drivers, when BLK_STS_RESOURCE is returned, who
will call blk_mq_delay_run_hw_queue() for drivers?

> 
> > [ ... ] Later, you admitted you understood the patch wrong. [ ... ]
> 
> That's nonsense. I never wrote that.

Believe it or not, follows the link and your reply:

https://marc.info/?l=dm-devel=151672694508389=2

> So what is wrong with this way?

>Sorry, I swapped BLK_STS_DEV_RESOURCE and BLK_STS_RESOURCE accidentally in my
>reply. What I meant is that changing a blk_mq_delay_run_hw_queue() call 
>followed
>by return BLK_STS_RESOURCE into BLK_STS_DEV_RESOURCE is wrong and introduces a
>race condition in code where there was no race condition.
>
>Bart.


-- 
Ming


[PATCH 1/1] scsi: ufshcd: fix possible unclocked register access

2018-01-29 Thread Asutosh Das
From: Subhash Jadavani 

vendor specific setup_clocks ops may depend on clocks managed by ufshcd
driver so if the vendor specific setup_clocks callback is called when
the required clocks are turned off, it results into unclocked register
access.

This change make sure that required clocks are enabled before vendor
specific setup_clocks callback is called.

Signed-off-by: Subhash Jadavani 
Signed-off-by: Venkat Gopalakrishnan 
Signed-off-by: Can Guo 
Signed-off-by: Asutosh Das 
---
 drivers/scsi/ufs/ufshcd.c | 26 --
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
index 8af2af3..c5ee686 100644
--- a/drivers/scsi/ufs/ufshcd.c
+++ b/drivers/scsi/ufs/ufshcd.c
@@ -6763,9 +6763,16 @@ static int __ufshcd_setup_clocks(struct ufs_hba *hba, 
bool on,
if (list_empty(head))
goto out;
 
-   ret = ufshcd_vops_setup_clocks(hba, on, PRE_CHANGE);
-   if (ret)
-   return ret;
+   /*
+* vendor specific setup_clocks ops may depend on clocks managed by
+* this standard driver hence call the vendor specific setup_clocks
+* before disabling the clocks managed here.
+*/
+   if (!on) {
+   ret = ufshcd_vops_setup_clocks(hba, on, PRE_CHANGE);
+   if (ret)
+   return ret;
+   }
 
list_for_each_entry(clki, head, list) {
if (!IS_ERR_OR_NULL(clki->clk)) {
@@ -6789,9 +6796,16 @@ static int __ufshcd_setup_clocks(struct ufs_hba *hba, 
bool on,
}
}
 
-   ret = ufshcd_vops_setup_clocks(hba, on, POST_CHANGE);
-   if (ret)
-   return ret;
+   /*
+* vendor specific setup_clocks ops may depend on clocks managed by
+* this standard driver hence call the vendor specific setup_clocks
+* after enabling the clocks managed here.
+*/
+   if (on) {
+   ret = ufshcd_vops_setup_clocks(hba, on, POST_CHANGE);
+   if (ret)
+   return ret;
+   }
 
 out:
if (ret) {
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc. 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.



Re: [PATCH V4] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-29 Thread Ming Lei
On Tue, Jan 30, 2018 at 5:51 AM, Bart Van Assche  wrote:
> On Mon, 2018-01-29 at 16:44 -0500, Mike Snitzer wrote:
>> But regardless of which wins the race, the queue will have been run.
>> Which is all we care about right?
>
> Running the queue is not sufficient. With this patch applied it can happen
> that the block driver returns BLK_STS_DEV_RESOURCE, that the two or more
> concurrent queue runs finish before sufficient device resources are available
> to execute the request and that blk_mq_delay_run_hw_queue() does not get
> called at all. If no other activity triggers a queue run, e.g. request
> completion, this will result in a queue stall.

Please see document of BLK_STS_DEV_RESOURCE:

+ * BLK_STS_DEV_RESOURCE is returned from driver to block layer if device
+ * related resource is unavailable, but driver can guarantee that queue
+ * will be rerun in future once the resource is available (whereby
+ * dispatching requests).

I have explained the SCSI's BLK_STS_DEV_RESOURCE conversion in
another thread, let's know if you have further concern:

https://marc.info/?l=linux-block=151727454815018=2

-- 
Ming Lei


Re: [PATCH V4] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-29 Thread Ming Lei
On Tue, Jan 30, 2018 at 5:22 AM, Bart Van Assche  wrote:
> On Mon, 2018-01-29 at 15:33 -0500, Mike Snitzer wrote:
>> +  * If driver returns BLK_STS_RESOURCE and SCHED_RESTART
>> +  * bit is set, run queue after 10ms to avoid IO stalls
>> +  * that could otherwise occur if the queue is idle.
>>*/
>> - if (!blk_mq_sched_needs_restart(hctx) ||
>> + needs_restart = blk_mq_sched_needs_restart(hctx);
>> + if (!needs_restart ||
>>   (no_tag && list_empty_careful(>dispatch_wait.entry)))
>>   blk_mq_run_hw_queue(hctx, true);
>> + else if (needs_restart && (ret == BLK_STS_RESOURCE))
>> + blk_mq_delay_run_hw_queue(hctx, BLK_MQ_QUEUE_DELAY);
>>   }
>
> The above code only calls blk_mq_delay_run_hw_queue() if the following 
> condition
> is met: !(!needs_restart || (no_tag && 
> list_empty_careful(>dispatch_wait.entry)))
> && (needs_restart && (ret == BLK_STS_RESOURCE)). That boolean expression can 
> be
> simplified into the following: needs_restart && ret == BLK_STS_RESOURCE &&
> !(no_tag && list_empty_careful(>dispatch_wait.entry)). From 
> blk-mq-sched.h:

No, the expression of  (needs_restart && (ret == BLK_STS_RESOURCE)) clearly
expresses what we want to do.

When RESTART is working, and meantime BLK_STS_RESOURCE is returned
from driver, we need to avoid IO hang by blk_mq_delay_run_hw_queue().

>
> static inline bool blk_mq_sched_needs_restart(struct blk_mq_hw_ctx *hctx)
> {
> return test_bit(BLK_MQ_S_SCHED_RESTART, >state);
> }
>
> In other words, the above code will not call blk_mq_delay_run_hw_queue() if
> BLK_MQ_S_SCHED_RESTART is cleared after it got set and before the above code 
> is
> reached. The BLK_MQ_S_SCHED_RESTART bit gets cleared if a request completes
> concurrently with the above code. From blk-mq-sched.c:

That won't be an issue, given any time the SCHED_RESTART is cleared, the queue
will be run again, so we needn't to worry about that.

>
> static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
> {
> if (!test_bit(BLK_MQ_S_SCHED_RESTART, >state))
> return false;
>
> if (hctx->flags & BLK_MQ_F_TAG_SHARED) {
> struct request_queue *q = hctx->queue;
>
> if (test_and_clear_bit(BLK_MQ_S_SCHED_RESTART, >state))
> atomic_dec(>shared_hctx_restart);
> } else
> clear_bit(BLK_MQ_S_SCHED_RESTART, >state);
>
> return blk_mq_run_hw_queue(hctx, true);
> }
>
> The above blk_mq_run_hw_queue() call may finish either before or after

If the above blk_mq_run_hw_queue() finishes before blk_mq_dispatch_rq_list()
examines the flag, blk_mq_dispatch_rq_list() will see the flag cleared, and run
queue again by the following code branch:

if (!blk_mq_sched_needs_restart(hctx) ||
(no_tag && list_empty_careful(>dispatch_wait.entry)))
blk_mq_run_hw_queue(hctx, true);

If blk_mq_run_hw_queue() finishes after blk_mq_dispatch_rq_list() examines
the flag, this blk_mq_run_hw_queue() will see the new added request, and
still everything is fine. Even blk_mq_delay_run_hw_queue() can be started
too.

> blk_mq_dispatch_rq_list() examines the BLK_MQ_S_SCHED_RESTART flag.
>
> That's why I wrote in previous e-mails that this patch and also the previous
> versions of this patch change a systematic call of blk_mq_delay_run_hw_queue()
> into a call that may or may not happen, depending on whether or not a request
> completes concurrently with request queueing. Sorry but I think that means
> that the above change combined with changing BLK_STS_RESOURCE into
> BLK_STS_DEV_RESOURCE is wrong and why I expect that this will result in
> sporadic queue stalls. As you know sporadic queue stalls are annoying and hard
> to debug.

Now there is debugfs, it isn't difficult to deal with that if
reporter'd like to cowork.

>
> Plenty of e-mails have already been exchanged about versions one to four of 
> this
> patch but so far nobody has even tried to contradict the above.

No, I don't see the issue, let's revisit the main change again:

+   needs_restart = blk_mq_sched_needs_restart(hctx);
+   if (!needs_restart ||
(no_tag && list_empty_careful(>dispatch_wait.entry)))
blk_mq_run_hw_queue(hctx, true);
+   else if (needs_restart && (ret == BLK_STS_RESOURCE))
+   blk_mq_delay_run_hw_queue(hctx, 10);

If SCHED_RESTART isn't set, queue is run immediately, otherwise when
BLK_STS_RESOURCE is returned, we avoid IO hang by
blk_mq_delay_run_hw_queue(hctx, XXX).

And we don't cover  BLK_STS_DEV_RESOURCE above because it is documented
clearly that  BLK_STS_DEV_RESOURCE is only returned by driver iff queue will be
rerun in future if resource is available.

Is there  any hole in above code?

Thanks,
Ming Lei


Re: [LSF/MM TOPIC] Two blk-mq related topics

2018-01-29 Thread Jens Axboe
On 1/29/18 4:46 PM, James Bottomley wrote:
> On Mon, 2018-01-29 at 14:00 -0700, Jens Axboe wrote:
>> On 1/29/18 1:56 PM, James Bottomley wrote:
>>>
>>> On Mon, 2018-01-29 at 23:46 +0800, Ming Lei wrote:
>>> [...]

 2. When to enable SCSI_MQ at default again?
>>>
>>> I'm not sure there's much to discuss ... I think the basic answer
>>> is as soon as Christoph wants to try it again.
>>
>> FWIW, internally I've been running various IO intensive workloads on
>> what is essentially 4.12 upstream with scsi-mq the default (with
>> mq-deadline as the scheduler) and comparing IO workloads with a
>> previous 4.6 kernel (without scsi-mq), and things are looking
>> great.
>>
>> We're never going to iron out the last kinks with it being off
>> by default, I think we should attempt to flip the switch again
>> for 4.16.
> 
> Absolutely, I agree we turn it on ASAP.  I just don't want to be on the
> receiving end of Linus' flamethrower because a bug we already had
> reported against scsi-mq caused problems.  Get confirmation from the
> original reporters (or as close to it as you can) that their problems
> are fixed and we're good to go; he won't kick us nearly as hard for new
> bugs that turn up.

I agree, the functional issues definitely have to be verified to be
resolved. Various performance hitches we can dive into if they
crop up, but reintroducing some random suspend regression is not
acceptable.

-- 
Jens Axboe



Re: [LSF/MM TOPIC] Two blk-mq related topics

2018-01-29 Thread Ming Lei
On Mon, Jan 29, 2018 at 03:40:31PM -0500, Mike Snitzer wrote:
> On Mon, Jan 29 2018 at 10:46am -0500,
> Ming Lei  wrote:
>  
> > 2. When to enable SCSI_MQ at default again?
> > 
> > SCSI_MQ is enabled on V3.17 firstly, but disabled at default. In V4.13-rc1,
> > it is enabled at default, but later the patch is reverted in V4.13-rc7, and
> > becomes disabled at default too.
> > 
> > Now both the original reported PM issue(actually SCSI quiesce) and the
> > sequential IO performance issue have been addressed. And MQ IO schedulers
> > are ready too for traditional disks. Are there other issues to be addressed
> > for enabling SCSI_MQ at default? When can we do that again?
> > 
> > Last time, the two issues were reported during V4.13 dev cycle just when it 
> > is
> > enabled at default, that seems if SCSI_MQ isn't enabled at default, it 
> > wouldn't
> > be exposed to run/tested completely & fully.  
> > 
> > So if we continue to disable it at default, maybe it can never be exposed to
> > full test/production environment.
> 
> I was going to propose revisiting this as well.
> 
> I'd really like to see all the old .request_fn block core code removed.

Yeah, that should be a final goal, but may take a bit long.

> 
> But maybe we take a first step of enabling:
> CONFIG_SCSI_MQ_DEFAULT=Y
> CONFIG_DM_MQ_DEFAULT=Y

Maybe you can remove legacy path from DM_RQ first, and take your
original approach to allow DM/MQ over legacy underlying driver,
seems we discussed this topic before, :-)

Thanks,
Ming


Re: [LSF/MM TOPIC] Two blk-mq related topics

2018-01-29 Thread Ming Lei
On Mon, Jan 29, 2018 at 12:56:30PM -0800, James Bottomley wrote:
> On Mon, 2018-01-29 at 23:46 +0800, Ming Lei wrote:
> [...]
> > 2. When to enable SCSI_MQ at default again?
> 
> I'm not sure there's much to discuss ... I think the basic answer is as
> soon as Christoph wants to try it again.

I guess Christoph still need to evaluate if there are existed issues or
blockers before trying it again. And more input may be got from F2F
discussion, IMHO.

> 
> > SCSI_MQ is enabled on V3.17 firstly, but disabled at default. In
> > V4.13-rc1, it is enabled at default, but later the patch is reverted
> > in V4.13-rc7, and becomes disabled at default too.
> > 
> > Now both the original reported PM issue(actually SCSI quiesce) and
> > the sequential IO performance issue have been addressed.
> 
> Is the blocker bug just not closed because no-one thought to do it:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=178381
> 
> (we have confirmed that this issue is now fixed with the original
> reporter?)

>From a developer view, this issue is fixed by the following commit:
3a0a52997(block, scsi: Make SCSI quiesce and resume work reliably),
and it is verified by kernel list reporter.

> 
> And did the Huawei guy (Jonathan Cameron) confirm his performance issue
> was fixed (I don't think I saw email that he did)?

Last time I talked with John Garry about the issue, and the merged .get_budget
based patch improves much on the IO performance, but there is still a bit gap
compared with legacy path. Seems a driver specific issue, remembered that 
removing
a driver's lock can improve performance much.

Garry, could you provide further update on this issue?

Thanks,
Ming


Re: [PATCH V3] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-29 Thread Bart Van Assche
On Tue, 2018-01-30 at 09:07 +0800, Ming Lei wrote:
> On Mon, Jan 29, 2018 at 04:48:31PM +, Bart Van Assche wrote:
> > - It is easy to fix this race inside the block layer, namely by using
> >   call_rcu() inside the blk_mq_delay_run_hw_queue() implementation to
> >   postpone the queue rerunning until after the request has been added back 
> > to
> >   the dispatch list.
> 
> It is just easy to say, can you cook a patch and fix all drivers first?

Please reread what I wrote. I proposed to change the blk_mq_delay_run_hw_queue()
IMPLEMENTATION such that the callers do not have to be modified.

> [ ... ] Later, you admitted you understood the patch wrong. [ ... ]

That's nonsense. I never wrote that.

Bart.

Re: [PATCH V3] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-29 Thread Ming Lei
On Mon, Jan 29, 2018 at 04:48:31PM +, Bart Van Assche wrote:
> On Sun, 2018-01-28 at 07:41 +0800, Ming Lei wrote:
> > Not mention, the request isn't added to dispatch list yet in .queue_rq(),
> > strictly speaking, it is not correct to call blk_mq_delay_run_hw_queue() in
> > .queue_rq(), so the current block layer API can't handle it well enough.
> 
> I disagree that it is not correct to call blk_mq_delay_run_hw_queue() from
> inside .queue_rq(). Additionally, I have already explained to you in
> previous e-mails why it's fine to call that function from inside .queue_rq():
> - Nobody has ever observed the race you described because the time after

'Nobody observed it' does not mean there isn't the race, since I can explain
it clearly.

>   which a queue is rerun by blk_mq_delay_run_hw_queue() is much larger than
>   the time during which that race exists.

Again it is fragile to depend on the delay(10ms, 3ms, or what ever) to make
everything correct, why can't we make the way robust like the approach I took?

You can not guarantee that the request handled in .queue_rq() is added to
hctx->dispatch when you call blk_mq_delay_run_hw_queue(), because the
operation of adding request to hctx->dispatch happens after returning
from .queue_rq() to blk_mq_dispatch_rq_list(), which is done in the future
against calling blk_mq_delay_run_hw_queue(). Right? 

Especially now kblockd_mod_delayed_work_on() is changed to use 
mod_delayed_work_on(),
which may run the queue before the timer is expired from another context, then
IO hang still can be triggered since the run queue may miss the request
to be added to hctx->dispatch.

> - It is easy to fix this race inside the block layer, namely by using
>   call_rcu() inside the blk_mq_delay_run_hw_queue() implementation to
>   postpone the queue rerunning until after the request has been added back to
>   the dispatch list.

It is just easy to say, can you cook a patch and fix all drivers first?

Then let's compare which patch is simpler and better.

> 
> > > - The patch at the start of this thread introduces a regression in the
> > >   SCSI core, namely a queue stall if a request completion occurs 
> > > concurrently
> > >   with the newly added BLK_MQ_S_SCHED_RESTART test in the blk-mq core.

> > 
> > This patch only moves the blk_mq_delay_run_hw_queue() from scsi_queue_rq()
> > to blk-mq, again, please explain it in detail how this patch V3 introduces 
> > this
> > regression on SCSI.
> 
> Please reread the following message: 
> https://marc.info/?l=dm-devel=151672480107560.

OK, the following message is copied from the above link:

> My opinion about this patch is as follows:
> * Changing a blk_mq_delay_run_hw_queue() call followed by return
>   BLK_STS_DEV_RESOURCE into return BLK_STS_RESOURCE is wrong because it 
> changes
>   a guaranteed queue rerun into a queue rerun that may or may not happen
>   depending on whether or not multiple queue runs happen simultaneously.
> * This change makes block drivers less readable because anyone who encounters
>   BLK_STS_DEV_RESOURCE will have to look up its definition to figure out what
>   it's meaning is.
> * We don't need the new status code BLK_STS_DEV_RESOURCE because a delayed
>   queue run can be implemented easily with the existing block layer API.

Later, you admitted you understood the patch wrong, so follows your
reply again from https://marc.info/?l=dm-devel=151672694508389=2

> On Wed, 2018-01-24 at 00:37 +0800, Ming Lei wrote:
> > On Tue, Jan 23, 2018 at 04:24:20PM +, Bart Van Assche wrote:
> > > My opinion about this patch is as follows:
> > > * Changing a blk_mq_delay_run_hw_queue() call followed by return
> > >   BLK_STS_DEV_RESOURCE into return BLK_STS_RESOURCE is wrong because it 
> > > changes
> > >   a guaranteed queue rerun into a queue rerun that may or may not happen
> > >   depending on whether or not multiple queue runs happen simultaneously.
> > 
> > You may not understand the two:
> > 
> > 1) it is always safe to return BLK_STS_RESOURCE, which will make sure to
> > avoid IO hang by blk_mq_delay_run_hw_queue() or blk_mq_run_hw_queue(),
> > and using which one depends on SCHED_RESTART.
> > 
> > 2) if driver can make sure the queue will be rerun after some resource
> > is available, either by itself or by blk-mq, it will return 
> > BLK_STS_DEV_RESOURCE
> > 
> > So what is wrong with this way?
> 
> Sorry, I swapped BLK_STS_DEV_RESOURCE and BLK_STS_RESOURCE accidentally in my
> reply. What I meant is that changing a blk_mq_delay_run_hw_queue() call 
> followed
> by return BLK_STS_RESOURCE into BLK_STS_DEV_RESOURCE is wrong and introduces a
> race condition in code where there was no race condition.

You still doesn't explain the race condition here, right?

I can explain it again, but I am losing patience on you if you continue
to refuse answer my question, just like you refused to provide debugfs
log before when you reported issue.

When turning BLK_STS_RESOURCE into BLK_STS_DEV_RESOURCE, it is 

Re: [PATCH] scsi: handle special return codes for ABORTED COMMAND

2018-01-29 Thread Martin Wilck
On Mon, 2018-01-29 at 23:39 +, Bart Van Assche wrote:
> On Tue, 2018-01-30 at 00:30 +0100, Martin Wilck wrote:
> > +   static struct aborted_cmd_blist blist[] = {
> 
> Please consider to declare this array const.

That doesn't work because I want to store the "warned" flag in it. And
I think it'd be good to print one message per asc/ascq combination,
rather than just having a single static "warned" variable.

> > +   for (i = 0; i < sizeof(blist)/sizeof(struct
> > aborted_cmd_blist); i++) {
> 
> Have you considered to use ARRAY_SIZE()?

No. Thanks for the hint. I'll re-post.

Martin

-- 
Dr. Martin Wilck , Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)



Re: [LSF/MM TOPIC] Two blk-mq related topics

2018-01-29 Thread James Bottomley
On Mon, 2018-01-29 at 14:00 -0700, Jens Axboe wrote:
> On 1/29/18 1:56 PM, James Bottomley wrote:
> > 
> > On Mon, 2018-01-29 at 23:46 +0800, Ming Lei wrote:
> > [...]
> > > 
> > > 2. When to enable SCSI_MQ at default again?
> > 
> > I'm not sure there's much to discuss ... I think the basic answer
> > is as soon as Christoph wants to try it again.
> 
> FWIW, internally I've been running various IO intensive workloads on
> what is essentially 4.12 upstream with scsi-mq the default (with
> mq-deadline as the scheduler) and comparing IO workloads with a
> previous 4.6 kernel (without scsi-mq), and things are looking
> great.
> 
> We're never going to iron out the last kinks with it being off
> by default, I think we should attempt to flip the switch again
> for 4.16.

Absolutely, I agree we turn it on ASAP.  I just don't want to be on the
receiving end of Linus' flamethrower because a bug we already had
reported against scsi-mq caused problems.  Get confirmation from the
original reporters (or as close to it as you can) that their problems
are fixed and we're good to go; he won't kick us nearly as hard for new
bugs that turn up.

James



Re: A qla2xxx commit cause Linux no response, has not fixed in lastest version 4.15-rc6

2018-01-29 Thread Madhani, Himanshu
Hi Chang, 

> On Jan 18, 2018, at 4:51 AM, Changlimin  wrote:
> 
> Hi Himanshu,
>  Today I reproduced the issue in my server.
>  First, I compiled kernel 4.15-rc6 (make localmodconfig; make; make 
> modules_install; make install), then start the kernel with parameter 
> modprobe.blacklist=qla2xxx.
>  Second,  tail -f /var/log/syslog
>  Third,  modprobe qla2xxx ql2xextended_error_logging=0x1e40 , the log is 
> syslog-1e40.txt
>  The syslog-7fff is got when modprobe qla2xxx 
> ql2xextended_error_logging=0x7fff
> 
>  BTW, I haven't load driver from 4.9.x to kernel 4.15-rc6. 
>  When I checkout kernel commit 726b85487067d7f5b23495bc33c484b8517c4074, all 
> kernel code is 4.9.x.
> 

Sorry for extended delay in the response. From the syslog that you sent me, I 
do see driver version 10.00.00.02-k which is from 4.15.0-rc6 so atleast you are 
using the correct
driver. (in your email earlier you mentioned 8.07.xx which was confusing) 

Jan 18 20:30:23 cvknode25 kernel: [  100.991309] qla2xxx [:00:00.0]-0005: : 
QLogic Fibre Channel HBA Driver: 10.00.00.02-k-debug.
Jan 18 20:30:23 cvknode25 kernel: [  100.991486] qla2xxx [:0a:00.0]-001d: : 
Found an ISP2532 irq 16 iobase 0x67aad9fd.
Jan 18 20:30:23 cvknode25 kernel: [  101.651676] qla2xxx [:0a:00.0]-4800:1: 
DPC handler sleeping.
Jan 18 20:30:23 cvknode25 kernel: [  101.651677] scsi host1: qla2xxx

Also I do see  

Jan 18 20:30:24 cvknode25 kernel: [  102.624987] qla2xxx [:0a:00.0]-500a:1: 
LOOP UP detected (8 Gbps).

i.e. driver was able to bring up 8G link 

So having said that i still do not have clear picture from the logs provided, 
why you are encountering issue. 

Can you please share you configuration details. I would like to see how is your 
system setup and see if i can replicate in our lab here. 

> Regards
> Chang Limin
> 
> -Original Message-
> From: Madhani, Himanshu [mailto:himanshu.madh...@cavium.com]
> Sent: Thursday, January 18, 2018 2:26 AM
> To: changlimin (Cloud)
> Cc: Nicholas A. Bellinger; Tran, Quinn; jifuliang (Cloud); zhangguanghui 
> (Cloud); zhangzijian (Cloud); target-devel; linux-scsi
> Subject: Re: A qla2xxx commit cause Linux no response, has not fixed in 
> lastest version 4.15-rc6
> 
> Hi Chang, 
> 
>> On Jan 15, 2018, at 10:49 PM, Changlimin  wrote:
>> 
>> Hi Himanshu,
>>  This is my progress.
>>  First, I compiled 4.15-rc6, I found linux hang when booting, the stack 
>> showed something wrong in qla2xxx driver.
> 
> Can you provide me detail steps of how you compiled 4.15-rc6. Also provide me 
> details of how you are loading driver and also provide complete log file.
> 
> I do not see how you will be able to load driver which is from 4.9.x when you 
> compile fresh 4.15.0-rc6. 
> 
> Just FYI, I build test system with 8G/16G/32G adapter with 4.15.0-rc6 kernel 
> and I am not able to see hang that you are describing. 
> 
> # uname -r
> 4.15.0-rc6+
> 
> # modprobe qla2xxx
> 
> # fcc.sh
> FC HBAs:
> HBA   Port NamePort ID   State Device
> host3 21:00:00:24:ff:7e:f5:80  01:0d:00  OnlineQLE2742 FW:v8.05.63 
> DVR:v10.00.00.04-k
> host4 21:00:00:24:ff:7e:f5:81  01:0e:00  OnlineQLE2742 FW:v8.05.63 
> DVR:v10.00.00.04-k
> host5 21:00:00:0e:1e:12:e9:a0  01:06:00  OnlineQLE8362 FW:v8.03.06 
> DVR:v10.00.00.04-k
> host6 21:00:00:0e:1e:12:e9:a1  01:14:00  OnlineQLE8362 FW:v8.03.06 
> DVR:v10.00.00.04-k
> host7 21:00:00:24:ff:46:0a:5c  01:0d:00  OnlineQLE2562 FW:v8.03.00 
> DVR:v10.00.00.04-k
> host8 21:00:00:24:ff:46:0a:5d  01:15:00  OnlineQLE2562 FW:v8.03.00 
> DVR:v10.00.00.04-k
> 
> # modinfo qla2xxx | more
> 
> filename:   
> /lib/modules/4.15.0-rc6+/kernel/drivers/scsi/qla2xxx/qla2xxx.ko
> firmware:   ql2500_fw.bin
> firmware:   ql2400_fw.bin
> firmware:   ql2322_fw.bin
> firmware:   ql2300_fw.bin
> firmware:   ql2200_fw.bin
> firmware:   ql2100_fw.bin
> version:10.00.00.04-k
> license:GPL
> description:QLogic Fibre Channel HBA Driver
> author: QLogic Corporation
> srcversion: 6CBCF1372A7756690E83CC3
> 
> 
>>  Second, I want to find which commit introduced the issue. So I tried many 
>> times via git bisect to linux kernel.
>>  Finally, I found the commit 726b85487067d7f5b23495bc33c484b8517c4074 
>> introduced the issue. The attached log is related to this commit.
>>  Also ubuntu kernel has this issue: 
>> 
>> https://launchpad.net/ubuntu/+archive/primary/+files/linux-image-4.13.
>> 0-25-generic_4.13.0-25.29_amd64.deb
>> 
>> https://launchpad.net/ubuntu/+archive/primary/+files/linux-image-extra
>> -4.13.0-25-generic_4.13.0-25.29_amd64.deb
>> 
>> Regards
>> Chang Limin
>> 
>> -Original Message-
>> From: Madhani, Himanshu [mailto:himanshu.madh...@cavium.com]
>> Sent: Tuesday, January 16, 2018 12:59 PM
>> To: changlimin (Cloud)
>> Cc: Nicholas A. Bellinger; Tran, Quinn; jifuliang (Cloud); 
>> zhangguanghui (Cloud); 

Re: [PATCH] scsi: handle special return codes for ABORTED COMMAND

2018-01-29 Thread Bart Van Assche
On Tue, 2018-01-30 at 00:30 +0100, Martin Wilck wrote:
> + static struct aborted_cmd_blist blist[] = {

Please consider to declare this array const.

> + for (i = 0; i < sizeof(blist)/sizeof(struct aborted_cmd_blist); i++) {

Have you considered to use ARRAY_SIZE()?

Thanks,

Bart.

[PATCH 2/2] scsi_debug: reset injection flags for every_nth > 0

2018-01-29 Thread Martin Wilck
If every_nth > 0, the injection flags must be reset for commands
that aren't supposed to fail (i.e. that aren't "nth"). Otherwise,
commands will continue to fail, like in the every_nth < 0 case.

Signed-off-by: Martin Wilck 
---
 drivers/scsi/scsi_debug.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index 86c48797381e..820b98853e69 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -3988,8 +3988,13 @@ static void clear_queue_stats(void)
 static void setup_inject(struct sdebug_queue *sqp,
 struct sdebug_queued_cmd *sqcp)
 {
-   if ((atomic_read(_cmnd_count) % abs(sdebug_every_nth)) > 0)
+   if ((atomic_read(_cmnd_count) % abs(sdebug_every_nth)) > 0) {
+   if (sdebug_every_nth > 0)
+   sqcp->inj_recovered = sqcp->inj_transport
+   = sqcp->inj_dif
+   = sqcp->inj_dix = sqcp->inj_short = 0;
return;
+   }
sqcp->inj_recovered = !!(SDEBUG_OPT_RECOVERED_ERR & sdebug_opts);
sqcp->inj_transport = !!(SDEBUG_OPT_TRANSPORT_ERR & sdebug_opts);
sqcp->inj_dif = !!(SDEBUG_OPT_DIF_ERR & sdebug_opts);
-- 
2.16.1



[PATCH] scsi: handle special return codes for ABORTED COMMAND

2018-01-29 Thread Martin Wilck
(Resending, because I forgot to cc linux-scsi. BIG SORRY!)

Introduce a new blist flag that indicates the device may return certain
sense code/ASC/ASCQ combinations that indicate different treatment than
normal. In particular, some devices need unconditional retry (aka
ADD_TO_MLQUEUE) under certain conditions; otherwise path failures may be
falsely detected.

Because different devices use different sense codes to indicate this
condition, a single bit is not enough. But we also can't use a bit for
every possible status code. Therefore the new flag BLIST_ABORTED_CMD_QUIRK
just indicates that this is a device for which the return code should be
checked. An extra "blacklist" in scsi_aborted_cmd_quirk() then checks for
known ASC/ASCQ combinations, and handles them.

When such a combination is encountered for the first time, a message is
printed. In systems that have several different peripherals using this
flag, that might lead to a wrong match without a warning. This small risk
is a compromise between exactness and the excessive resources that would be
required to check for matching device vendor and model name every time.

I chose to recycle devinfo flag (1<<11) (former BLIST_INQUIRY_58, free
since 496c91bbc910) for this purpose rather than extending blist_flags_t to
64 bit. This could be easily changed of course.

This patch replaces the previously submitted patches
 "scsi: handle ABORTED_COMMAND on Fujitsu ETERNUS" and
 "scsi: Always retry internal target error" (Hannes Reinecke)

Signed-off-by: Martin Wilck 
---
 drivers/scsi/scsi_devinfo.c |  4 ++-
 drivers/scsi/scsi_error.c   | 82 +
 include/scsi/scsi_devinfo.h |  6 
 3 files changed, 91 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/scsi_devinfo.c b/drivers/scsi/scsi_devinfo.c
index dfb8da83fa50..39455734d934 100644
--- a/drivers/scsi/scsi_devinfo.c
+++ b/drivers/scsi/scsi_devinfo.c
@@ -161,12 +161,14 @@ static struct {
{"DGC", "RAID", NULL, BLIST_SPARSELUN}, /* Dell PV 650F, storage on LUN 
0 */
{"DGC", "DISK", NULL, BLIST_SPARSELUN}, /* Dell PV 650F, no storage on 
LUN 0 */
{"EMC",  "Invista", "*", BLIST_SPARSELUN | BLIST_LARGELUN},
-   {"EMC", "SYMMETRIX", NULL, BLIST_SPARSELUN | BLIST_LARGELUN | 
BLIST_REPORTLUN2},
+   {"EMC", "SYMMETRIX", NULL, BLIST_SPARSELUN | BLIST_LARGELUN
+| BLIST_REPORTLUN2 | BLIST_ABORTED_CMD_QUIRK},
{"EMULEX", "MD21/S2 ESDI", NULL, BLIST_SINGLELUN},
{"easyRAID", "16P", NULL, BLIST_NOREPORTLUN},
{"easyRAID", "X6P", NULL, BLIST_NOREPORTLUN},
{"easyRAID", "F8", NULL, BLIST_NOREPORTLUN},
{"FSC", "CentricStor", "*", BLIST_SPARSELUN | BLIST_LARGELUN},
+   {"FUJITSU", "ETERNUS_DXM", "*", BLIST_ABORTED_CMD_QUIRK},
{"Generic", "USB SD Reader", "1.00", BLIST_FORCELUN | BLIST_INQUIRY_36},
{"Generic", "USB Storage-SMC", "0180", BLIST_FORCELUN | 
BLIST_INQUIRY_36},
{"Generic", "USB Storage-SMC", "0207", BLIST_FORCELUN | 
BLIST_INQUIRY_36},
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 62b56de38ae8..91f5cdc85dfa 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -39,6 +39,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "scsi_priv.h"
 #include "scsi_logging.h"
@@ -432,6 +433,84 @@ static void scsi_report_sense(struct scsi_device *sdev,
}
 }
 
+struct aborted_cmd_blist {
+   u8 asc;
+   u8 ascq;
+   int retval;
+   const char *vendor;
+   const char *model;
+   bool warned;
+};
+
+/**
+ * scsi_aborted_cmd_quirk - Handle special return codes for ABORTED COMMAND
+ * @sdev:  SCSI device that returned ABORTED COMMAND.
+ * @sshdr: Sense data
+ *
+ * Return value:
+ * SUCCESS or FAILED or NEEDS_RETRY or ADD_TO_MLQUEUE
+ *
+ * Notes:
+ * This is only called for devices that have the blist flag
+ *  BLIST_ABORTED_CMD_QUIRK set.
+ */
+static int scsi_aborted_cmd_quirk(const struct scsi_device *sdev,
+ const struct scsi_sense_hdr *sshdr)
+{
+   static struct aborted_cmd_blist blist[] = {
+   /*
+* 44/00: SYMMETRIX uses this code for a variety of internal
+* issues, all of which can be recovered by retry
+*/
+   { 0x44, 0x00, ADD_TO_MLQUEUE, "EMC", "SYMMETRIX", false },
+   /*
+* c1/01: This is used by ETERNUS to indicate the
+* command should be retried unconditionally
+*/
+   { 0xc1, 0x01, ADD_TO_MLQUEUE, "FUJITSU", "ETERNUS_DXM", false }
+   };
+   struct aborted_cmd_blist *found = NULL;
+   int ret = NEEDS_RETRY, i;
+
+   for (i = 0; i < sizeof(blist)/sizeof(struct aborted_cmd_blist); i++) {
+   if (sshdr->asc == blist[i].asc &&
+   sshdr->ascq == blist[i].ascq) {
+   found = [i];
+   ret = 

Re: [PATCH V4] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-29 Thread Bart Van Assche
On Mon, 2018-01-29 at 17:14 -0500, Mike Snitzer wrote:
> On Mon, Jan 29 2018 at  4:51pm -0500,
> Bart Van Assche  wrote:
> 
> > On Mon, 2018-01-29 at 16:44 -0500, Mike Snitzer wrote:
> > > But regardless of which wins the race, the queue will have been run.
> > > Which is all we care about right?
> > 
> > Running the queue is not sufficient. With this patch applied it can happen
> > that the block driver returns BLK_STS_DEV_RESOURCE, that the two or more
> > concurrent queue runs finish before sufficient device resources are 
> > available
> > to execute the request and that blk_mq_delay_run_hw_queue() does not get
> > called at all. If no other activity triggers a queue run, e.g. request
> > completion, this will result in a queue stall.
> 
> If BLK_STS_DEV_RESOURCE is returned then the driver doesn't need to rely
> on a future queue run.  IIUC, that is the entire premise of
> BLK_STS_DEV_RESOURCE.  If the driver had doubt about whether the
> resource were going to be available in the future it should return
> BLK_STS_RESOURCE.
> 
> That may seem like putting a lot on a driver developer (to decide
> between the 2) but I'll again defer to Jens here.  This was the approach
> he was advocating be pursued.

Hello Mike,

There was a typo in my reply: I should have referred to BLK_STS_RESOURCE
instead of BLK_STS_DEV_RESOURCE. The convention for BLK_STS_DEV_RESOURCE is
namely that if .queue_rq() returns that status code that the block driver
guarantees that the queue will be rerun.

Bart.

Re: [PATCH V4] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-29 Thread Mike Snitzer
On Mon, Jan 29 2018 at  4:51pm -0500,
Bart Van Assche  wrote:

> On Mon, 2018-01-29 at 16:44 -0500, Mike Snitzer wrote:
> > But regardless of which wins the race, the queue will have been run.
> > Which is all we care about right?
> 
> Running the queue is not sufficient. With this patch applied it can happen
> that the block driver returns BLK_STS_DEV_RESOURCE, that the two or more
> concurrent queue runs finish before sufficient device resources are available
> to execute the request and that blk_mq_delay_run_hw_queue() does not get
> called at all. If no other activity triggers a queue run, e.g. request
> completion, this will result in a queue stall.

If BLK_STS_DEV_RESOURCE is returned then the driver doesn't need to rely
on a future queue run.  IIUC, that is the entire premise of
BLK_STS_DEV_RESOURCE.  If the driver had doubt about whether the
resource were going to be available in the future it should return
BLK_STS_RESOURCE.

That may seem like putting a lot on a driver developer (to decide
between the 2) but I'll again defer to Jens here.  This was the approach
he was advocating be pursued.

In practice it is working.  Would welcome you testing this V4.

Thanks,
Mike


Re: [PATCH V4] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-29 Thread Bart Van Assche
On Mon, 2018-01-29 at 16:44 -0500, Mike Snitzer wrote:
> But regardless of which wins the race, the queue will have been run.
> Which is all we care about right?

Running the queue is not sufficient. With this patch applied it can happen
that the block driver returns BLK_STS_DEV_RESOURCE, that the two or more
concurrent queue runs finish before sufficient device resources are available
to execute the request and that blk_mq_delay_run_hw_queue() does not get
called at all. If no other activity triggers a queue run, e.g. request
completion, this will result in a queue stall.

Bart.

Re: [PATCH V4] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-29 Thread Mike Snitzer
On Mon, Jan 29 2018 at  4:22pm -0500,
Bart Van Assche  wrote:

> On Mon, 2018-01-29 at 15:33 -0500, Mike Snitzer wrote:
> > +* If driver returns BLK_STS_RESOURCE and SCHED_RESTART
> > +* bit is set, run queue after 10ms to avoid IO stalls
> > +* that could otherwise occur if the queue is idle.
> >  */
> > -   if (!blk_mq_sched_needs_restart(hctx) ||
> > +   needs_restart = blk_mq_sched_needs_restart(hctx);
> > +   if (!needs_restart ||
> > (no_tag && list_empty_careful(>dispatch_wait.entry)))
> > blk_mq_run_hw_queue(hctx, true);
> > +   else if (needs_restart && (ret == BLK_STS_RESOURCE))
> > +   blk_mq_delay_run_hw_queue(hctx, BLK_MQ_QUEUE_DELAY);
> > }
> 
> The above code only calls blk_mq_delay_run_hw_queue() if the following 
> condition
> is met: !(!needs_restart || (no_tag && 
> list_empty_careful(>dispatch_wait.entry)))
> && (needs_restart && (ret == BLK_STS_RESOURCE)). That boolean expression can 
> be
> simplified into the following: needs_restart && ret == BLK_STS_RESOURCE && 
> !(no_tag && list_empty_careful(>dispatch_wait.entry)). From 
> blk-mq-sched.h:
> 
> static inline bool blk_mq_sched_needs_restart(struct blk_mq_hw_ctx *hctx)
> {
>   return test_bit(BLK_MQ_S_SCHED_RESTART, >state);
> }
> 
> In other words, the above code will not call blk_mq_delay_run_hw_queue() if
> BLK_MQ_S_SCHED_RESTART is cleared after it got set and before the above code 
> is
> reached. The BLK_MQ_S_SCHED_RESTART bit gets cleared if a request completes
> concurrently with the above code. From blk-mq-sched.c:
> 
> static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
> {
>   if (!test_bit(BLK_MQ_S_SCHED_RESTART, >state))
>   return false;
> 
>   if (hctx->flags & BLK_MQ_F_TAG_SHARED) {
>   struct request_queue *q = hctx->queue;
> 
>   if (test_and_clear_bit(BLK_MQ_S_SCHED_RESTART, >state))
>   atomic_dec(>shared_hctx_restart);
>   } else
>   clear_bit(BLK_MQ_S_SCHED_RESTART, >state);
> 
>   return blk_mq_run_hw_queue(hctx, true);
> }
> 
> The above blk_mq_run_hw_queue() call may finish either before or after
> blk_mq_dispatch_rq_list() examines the BLK_MQ_S_SCHED_RESTART flag.

But regardless of which wins the race, the queue will have been run.
Which is all we care about right?

> That's why I wrote in previous e-mails that this patch and also the previous
> versions of this patch change a systematic call of blk_mq_delay_run_hw_queue()
> into a call that may or may not happen, depending on whether or not a request
> completes concurrently with request queueing. Sorry but I think that means
> that the above change combined with changing BLK_STS_RESOURCE into
> BLK_STS_DEV_RESOURCE is wrong and why I expect that this will result in
> sporadic queue stalls. As you know sporadic queue stalls are annoying and hard
> to debug.
> 
> Plenty of e-mails have already been exchanged about versions one to four of 
> this
> patch but so far nobody has even tried to contradict the above.

Sure, but we aren't mind-readers.  We're all very open to the idea that
you have noticed something we haven't.  But until your very helpful
mail I'm replying to, you hadn't been specific about the race that
concerned you.  Thanks for sharing.

I'll defer to Jens and/or Ming on whether the race you've raised is of
concern.  As I said above, I'm inclined to think it doesn't matter who
wins the race given the queue will be run again.  And when it does it
may get STS_RESOURCE again, and in that case it may then resort to
blk_mq_delay_run_hw_queue() in blk_mq_dispatch_rq_list().

Mike


Re: [PATCH V4] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-29 Thread Bart Van Assche
On Mon, 2018-01-29 at 15:33 -0500, Mike Snitzer wrote:
> +  * If driver returns BLK_STS_RESOURCE and SCHED_RESTART
> +  * bit is set, run queue after 10ms to avoid IO stalls
> +  * that could otherwise occur if the queue is idle.
>*/
> - if (!blk_mq_sched_needs_restart(hctx) ||
> + needs_restart = blk_mq_sched_needs_restart(hctx);
> + if (!needs_restart ||
>   (no_tag && list_empty_careful(>dispatch_wait.entry)))
>   blk_mq_run_hw_queue(hctx, true);
> + else if (needs_restart && (ret == BLK_STS_RESOURCE))
> + blk_mq_delay_run_hw_queue(hctx, BLK_MQ_QUEUE_DELAY);
>   }

The above code only calls blk_mq_delay_run_hw_queue() if the following condition
is met: !(!needs_restart || (no_tag && 
list_empty_careful(>dispatch_wait.entry)))
&& (needs_restart && (ret == BLK_STS_RESOURCE)). That boolean expression can be
simplified into the following: needs_restart && ret == BLK_STS_RESOURCE && 
!(no_tag && list_empty_careful(>dispatch_wait.entry)). From 
blk-mq-sched.h:

static inline bool blk_mq_sched_needs_restart(struct blk_mq_hw_ctx *hctx)
{
return test_bit(BLK_MQ_S_SCHED_RESTART, >state);
}

In other words, the above code will not call blk_mq_delay_run_hw_queue() if
BLK_MQ_S_SCHED_RESTART is cleared after it got set and before the above code is
reached. The BLK_MQ_S_SCHED_RESTART bit gets cleared if a request completes
concurrently with the above code. From blk-mq-sched.c:

static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
{
if (!test_bit(BLK_MQ_S_SCHED_RESTART, >state))
return false;

if (hctx->flags & BLK_MQ_F_TAG_SHARED) {
struct request_queue *q = hctx->queue;

if (test_and_clear_bit(BLK_MQ_S_SCHED_RESTART, >state))
atomic_dec(>shared_hctx_restart);
} else
clear_bit(BLK_MQ_S_SCHED_RESTART, >state);

return blk_mq_run_hw_queue(hctx, true);
}

The above blk_mq_run_hw_queue() call may finish either before or after
blk_mq_dispatch_rq_list() examines the BLK_MQ_S_SCHED_RESTART flag.

That's why I wrote in previous e-mails that this patch and also the previous
versions of this patch change a systematic call of blk_mq_delay_run_hw_queue()
into a call that may or may not happen, depending on whether or not a request
completes concurrently with request queueing. Sorry but I think that means
that the above change combined with changing BLK_STS_RESOURCE into
BLK_STS_DEV_RESOURCE is wrong and why I expect that this will result in
sporadic queue stalls. As you know sporadic queue stalls are annoying and hard
to debug.

Plenty of e-mails have already been exchanged about versions one to four of this
patch but so far nobody has even tried to contradict the above.

Bart.

Re: [LSF/MM TOPIC] Two blk-mq related topics

2018-01-29 Thread Jens Axboe
On 1/29/18 1:56 PM, James Bottomley wrote:
> On Mon, 2018-01-29 at 23:46 +0800, Ming Lei wrote:
> [...]
>> 2. When to enable SCSI_MQ at default again?
> 
> I'm not sure there's much to discuss ... I think the basic answer is as
> soon as Christoph wants to try it again.

FWIW, internally I've been running various IO intensive workloads on
what is essentially 4.12 upstream with scsi-mq the default (with
mq-deadline as the scheduler) and comparing IO workloads with a
previous 4.6 kernel (without scsi-mq), and things are looking
great.

We're never going to iron out the last kinks with it being off
by default, I think we should attempt to flip the switch again
for 4.16.

-- 
Jens Axboe



Re: [LSF/MM TOPIC] Two blk-mq related topics

2018-01-29 Thread James Bottomley
On Mon, 2018-01-29 at 23:46 +0800, Ming Lei wrote:
[...]
> 2. When to enable SCSI_MQ at default again?

I'm not sure there's much to discuss ... I think the basic answer is as
soon as Christoph wants to try it again.

> SCSI_MQ is enabled on V3.17 firstly, but disabled at default. In
> V4.13-rc1, it is enabled at default, but later the patch is reverted
> in V4.13-rc7, and becomes disabled at default too.
> 
> Now both the original reported PM issue(actually SCSI quiesce) and
> the sequential IO performance issue have been addressed.

Is the blocker bug just not closed because no-one thought to do it:

https://bugzilla.kernel.org/show_bug.cgi?id=178381

(we have confirmed that this issue is now fixed with the original
reporter?)

And did the Huawei guy (Jonathan Cameron) confirm his performance issue
was fixed (I don't think I saw email that he did)?

James




Re: [LSF/MM TOPIC] Two blk-mq related topics

2018-01-29 Thread Mike Snitzer
On Mon, Jan 29 2018 at 10:46am -0500,
Ming Lei  wrote:
 
> 2. When to enable SCSI_MQ at default again?
> 
> SCSI_MQ is enabled on V3.17 firstly, but disabled at default. In V4.13-rc1,
> it is enabled at default, but later the patch is reverted in V4.13-rc7, and
> becomes disabled at default too.
> 
> Now both the original reported PM issue(actually SCSI quiesce) and the
> sequential IO performance issue have been addressed. And MQ IO schedulers
> are ready too for traditional disks. Are there other issues to be addressed
> for enabling SCSI_MQ at default? When can we do that again?
> 
> Last time, the two issues were reported during V4.13 dev cycle just when it is
> enabled at default, that seems if SCSI_MQ isn't enabled at default, it 
> wouldn't
> be exposed to run/tested completely & fully.  
> 
> So if we continue to disable it at default, maybe it can never be exposed to
> full test/production environment.

I was going to propose revisiting this as well.

I'd really like to see all the old .request_fn block core code removed.

But maybe we take a first step of enabling:
CONFIG_SCSI_MQ_DEFAULT=Y
CONFIG_DM_MQ_DEFAULT=Y

Thanks,
Mike


[PATCH V4] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-29 Thread Mike Snitzer
From: Ming Lei 

This status is returned from driver to block layer if device related
resource is unavailable, but driver can guarantee that IO dispatch
will be triggered in future when the resource is available.

Convert some drivers to return BLK_STS_DEV_RESOURCE.  Also, if driver
returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after
a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls.  BLK_MQ_DELAY_QUEUE is
3 ms because both scsi-mq and nvmefc are using that magic value.

Suggested-by: Jens Axboe 
Tested-by: Laurence Oberman 
Signed-off-by: Ming Lei 
Signed-off-by: Mike Snitzer 
---

V4:
- cleanup header and code comments
- rerun queue after BLK_MQ_QUEUE_DELAY (3ms) instead of 10ms
- eliminate nvmefc's queue rerun now that blk-mq does it
V3:
- fix typo, and improvement document
- add tested-by tag
V2:
- add comments on the new introduced status
- patch style fix
- both are suggested by Christoph

 block/blk-core.c |  1 +
 block/blk-mq.c   | 20 
 drivers/block/null_blk.c |  2 +-
 drivers/block/virtio_blk.c   |  2 +-
 drivers/block/xen-blkfront.c |  2 +-
 drivers/md/dm-rq.c   |  5 ++---
 drivers/nvme/host/fc.c   | 12 ++--
 drivers/scsi/scsi_lib.c  |  6 +++---
 include/linux/blk_types.h| 14 ++
 9 files changed, 41 insertions(+), 23 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index cdae69be68e9..38279d4ae08b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -145,6 +145,7 @@ static const struct {
[BLK_STS_MEDIUM]= { -ENODATA,   "critical medium" },
[BLK_STS_PROTECTION]= { -EILSEQ,"protection" },
[BLK_STS_RESOURCE]  = { -ENOMEM,"kernel resource" },
+   [BLK_STS_DEV_RESOURCE]  = { -ENOMEM,"device resource" },
[BLK_STS_AGAIN] = { -EAGAIN,"nonblocking retry" },
 
/* device mapper special case, should not leak out: */
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 43e7449723e0..dd097ca5f1e9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1160,6 +1160,8 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
**hctx,
return true;
 }
 
+#define BLK_MQ_QUEUE_DELAY 3   /* ms units */
+
 bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 bool got_budget)
 {
@@ -1167,6 +1169,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list,
struct request *rq, *nxt;
bool no_tag = false;
int errors, queued;
+   blk_status_t ret = BLK_STS_OK;
 
if (list_empty(list))
return false;
@@ -1179,7 +1182,6 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list,
errors = queued = 0;
do {
struct blk_mq_queue_data bd;
-   blk_status_t ret;
 
rq = list_first_entry(list, struct request, queuelist);
if (!blk_mq_get_driver_tag(rq, , false)) {
@@ -1224,7 +1226,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list,
}
 
ret = q->mq_ops->queue_rq(hctx, );
-   if (ret == BLK_STS_RESOURCE) {
+   if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) {
/*
 * If an I/O scheduler has been configured and we got a
 * driver tag for the next request already, free it
@@ -1255,6 +1257,8 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list,
 * that is where we will continue on next queue run.
 */
if (!list_empty(list)) {
+   bool needs_restart;
+
spin_lock(>lock);
list_splice_init(list, >dispatch);
spin_unlock(>lock);
@@ -1278,10 +1282,17 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list,
 * - Some but not all block drivers stop a queue before
 *   returning BLK_STS_RESOURCE. Two exceptions are scsi-mq
 *   and dm-rq.
+*
+* If driver returns BLK_STS_RESOURCE and SCHED_RESTART
+* bit is set, run queue after 10ms to avoid IO stalls
+* that could otherwise occur if the queue is idle.
 */
-   if (!blk_mq_sched_needs_restart(hctx) ||
+   needs_restart = blk_mq_sched_needs_restart(hctx);
+   if (!needs_restart ||
(no_tag && list_empty_careful(>dispatch_wait.entry)))
blk_mq_run_hw_queue(hctx, true);
+   else if (needs_restart && (ret == BLK_STS_RESOURCE))
+   blk_mq_delay_run_hw_queue(hctx, 

Re: [PATCH 06/13] lpfc: Add 64G link speed support

2018-01-29 Thread James Smart

On 1/29/2018 3:45 AM, Steffen Maier wrote:

@@ -2966,6 +2975,9 @@ struct lpfc_mbx_read_top {

  #define LPFC_LINK_SPEED_10GHZ    0x40
  #define LPFC_LINK_SPEED_16GHZ    0x80
  #define LPFC_LINK_SPEED_32GHZ    0x90
+#define LPFC_LINK_SPEED_64GHZ    0xA0
+#define LPFC_LINK_SPEED_128GHZ    0xB0
+#define LPFC_LINK_SPEED_2568GHZ    0xC0

  ^
typo? 2568 => 256

The new 128 and 256 definitions do not seem to be used in this patch set?



Yep. And not used yet.  I'll fix in a repost.  For the repost, I'll wait 
until the other patches have a review.


-- james



Re: [PATCH] tcmu: Allow reconfig to handle multiple attributes

2018-01-29 Thread Bryant G. Ly

On 1/10/18 12:26 PM, Mike Christie wrote:

> On 01/04/2018 10:11 AM, Bryant G. Ly wrote:
>> This patch allows for multiple attributes to be reconfigured
>> and handled all in one call as compared to multiple netlinks.
>>
>> Example:
>> set attribute dev_reconfig=dev_config=fbo//home/path:dev_size=2147483648
>>
> I know I suggested this, but I think I was wrong. If we have to support
> other apps that work in reverse to targetcli+tcmu-runner where the
> tcmu-runner equivalent app sets things up then contacts the kernel,
> let's just not do passthrough operations like this for reconfig. There
> is no need to bring in the kernel.
>
> For the initial config we can still do it since we have to maintain
> compat, but for major reconfigs like this let's just have targetcli
> contact tcmu-runner, then runner can update the kernel if needed.
>
> So in targetcli and runner copy the check_config stuff, and add a
> reconfig callout that works like it. We then do not have this weird
> kernel passthrough and do not have to worry about the non
> targetcl-tcmu-runner type of model.
>
>
>
So you basically want me to use DBUS correct? Connect a GCallback function

with reconfig function and use DBUS to pass info back to kernel if necessary?

-Bryant



Re: [PATCH 1/3] devicetree: bindings: scsi: hisi_sas: add LED feature for v2 hw

2018-01-29 Thread Rob Herring
On Thu, Jan 18, 2018 at 12:46:52AM +0800, John Garry wrote:
> From: Xiaofei Tan 

"dt-bindings: ..." is the preferred subject prefix.

> 
> Add directly attached disk LED feature for v2 hw.
> 
> Signed-off-by: Xiaofei Tan 
> Signed-off-by: John Garry 
> ---
>  Documentation/devicetree/bindings/scsi/hisilicon-sas.txt | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/devicetree/bindings/scsi/hisilicon-sas.txt 
> b/Documentation/devicetree/bindings/scsi/hisilicon-sas.txt
> index b6a869f..df3bef7 100644
> --- a/Documentation/devicetree/bindings/scsi/hisilicon-sas.txt
> +++ b/Documentation/devicetree/bindings/scsi/hisilicon-sas.txt
> @@ -8,7 +8,10 @@ Main node required properties:
>   (b) "hisilicon,hip06-sas-v2" for v2 hw in hip06 chipset
>   (c) "hisilicon,hip07-sas-v2" for v2 hw in hip07 chipset
>- sas-addr : array of 8 bytes for host SAS address
> -  - reg : Address and length of the SAS register
> +  - reg : Contains two regions. The first is the address and length of the 
> SAS
> +  register. The second is the address and length of CPLD register for
> +  SGPIO control. The second is optional, and should be set only when
> +  we use a CPLD for directly attached disk LED control.

Ah SGPIO, what nice memories I have of trying to support that after the 
chip was done. Seems you have the same fortune. :)

What happens if you have a different CPLD or SGPIO control? Really this 
should probably be a separate node, but the kernel isn't setup for 
separate SGPIO drivers either. 

Reviewed-by: Rob Herring 

Rob


Re: [PATCH resend 6/6] cdrom: wait for drive to become ready

2018-01-29 Thread Bart Van Assche
On Fri, 2018-01-26 at 17:58 +0100, Michal Suchanek wrote:
> When the drive closes it can take tens of seconds until the disc is
> analyzed. Wait for the drive to become ready or report an error.
> 
> Signed-off-by: Michal Suchanek 
> ---
>  drivers/cdrom/cdrom.c | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/drivers/cdrom/cdrom.c b/drivers/cdrom/cdrom.c
> index 69e85c902373..9994441f5041 100644
> --- a/drivers/cdrom/cdrom.c
> +++ b/drivers/cdrom/cdrom.c
> @@ -1087,6 +1087,15 @@ int open_for_common(struct cdrom_device_info *cdi, 
> tracktype *tracks)
>   }
>   cd_dbg(CD_OPEN, "the tray is now closed\n");
>   }
> + /* the door should be closed now, check for the disc */
> + if (ret == CDS_DRIVE_NOT_READY) {
> + int poll_res = poll_event_interruptible(
> + CDS_DRIVE_NOT_READY !=
> + (ret = cdo->drive_status(cdi, CDSL_CURRENT)),
> + 500);
> + if (poll_res == -ERESTARTSYS)
> + return poll_res;
> + }
>   if (ret != CDS_DISC_OK)
>   return -ENOMEDIUM;
>   }

Same comment here as for a previous patch: although interruptible by a signal,
I'm not sure potentially infinite loops inside the kernel are really welcome.

Thanks,

Bart.

Re: [PATCH resend 3/6] cdrom: wait for tray to close

2018-01-29 Thread Bart Van Assche
On Fri, 2018-01-26 at 17:58 +0100, Michal Suchanek wrote:
> +static int cdrom_tray_close(struct cdrom_device_info *cdi)
> +{
> + int ret;
> +
> + ret = cdi->ops->tray_move(cdi, 0);
> + if (ret || !cdi->ops->drive_status)
> + return ret;
> +
> + return poll_event_interruptible(CDS_TRAY_OPEN !=
> + cdi->ops->drive_status(cdi, CDSL_CURRENT), 500);
> +}
> +
>  static
>  int open_for_common(struct cdrom_device_info *cdi, tracktype *tracks)
>  {
> @@ -1048,7 +1062,9 @@ int open_for_common(struct cdrom_device_info *cdi, 
> tracktype *tracks)
>   if (CDROM_CAN(CDC_CLOSE_TRAY) &&
>   cdi->options & CDO_AUTO_CLOSE) {
>   cd_dbg(CD_OPEN, "trying to close the tray\n");
> - ret = cdo->tray_move(cdi, 0);
> + ret = cdrom_tray_close(cdi);
> + if (ret == -ERESTARTSYS)
> + return ret;
>   if (ret) {
>   cd_dbg(CD_OPEN, "bummer. tried to close 
> the tray but failed.\n");
>   /* Ignore the error from the low
> @@ -2312,7 +2328,8 @@ static int cdrom_ioctl_closetray(struct 
> cdrom_device_info *cdi)
>  
>   if (!CDROM_CAN(CDC_CLOSE_TRAY))
>   return -ENOSYS;
> - return cdi->ops->tray_move(cdi, 0);
> +
> + return cdrom_tray_close(cdi);
>  }

So this patch changes code that does not wait into code that potentially waits
forever? Sorry but I don't think that's ideal. Please make sure that after a
certain time (a few seconds?) the loop finishes.

Thanks,

Bart.

Re: [PATCH resend 2/6] cdrom: factor out common open_for_* code

2018-01-29 Thread Bart Van Assche
On Fri, 2018-01-26 at 17:58 +0100, Michal Suchanek wrote:
> - ret=cdo->tray_move(cdi,0);
> + ret = cdo->tray_move(cdi, 0);

Please separate whitespace-only changes from functional changes such that
this patch series becomes easier to review.

Thanks,

Bart.




Re: [PATCH resend 1/6] delay: add poll_event_interruptible

2018-01-29 Thread Bart Van Assche
On Fri, 2018-01-26 at 17:58 +0100, Michal Suchanek wrote:
> Add convenience macro for polling an event that does not have a
> waitqueue.
> 
> Signed-off-by: Michal Suchanek 
> ---
>  include/linux/delay.h | 12 
>  1 file changed, 12 insertions(+)
> 
> diff --git a/include/linux/delay.h b/include/linux/delay.h
> index b78bab4395d8..3ae9fa395628 100644
> --- a/include/linux/delay.h
> +++ b/include/linux/delay.h
> @@ -64,4 +64,16 @@ static inline void ssleep(unsigned int seconds)
>   msleep(seconds * 1000);
>  }
>  
> +#define poll_event_interruptible(event, interval) ({ \
> + int ret = 0; \
> + while (!(event)) { \
> + if (signal_pending(current)) { \
> + ret = -ERESTARTSYS; \
> + break; \
> + } \
> + msleep_interruptible(interval); \
> + } \
> + ret; \
> +})
> +
>  #endif /* defined(_LINUX_DELAY_H) */

Sorry but I'm not sure we should encourage other kernel developers to use
busy-waiting by adding the poll_event_interruptible() macro to a system-wide
header file. Can that macro be moved into a CDROM-specific .c or .h file?

Thanks,

Bart.

RE: [RFC 0/2] mpt3sas/megaraid_sas : irq poll and load balancing of reply queue

2018-01-29 Thread Kashyap Desai
> -Original Message-
> From: Hannes Reinecke [mailto:h...@suse.de]
> Sent: Monday, January 29, 2018 2:29 PM
> To: Kashyap Desai; linux-scsi@vger.kernel.org; Peter Rivera
> Subject: Re: [RFC 0/2] mpt3sas/megaraid_sas : irq poll and load balancing
> of
> reply queue
>
> On 01/15/2018 01:12 PM, Kashyap Desai wrote:
> > Hi All -
> >
> > We have seen cpu lock up issue from fields if system has greater (more
> > than 96) logical cpu count.
> > SAS3.0 controller (Invader series) supports at max 96 msix vector and
> > SAS3.5 product (Ventura) supports at max 128 msix vectors.
> >
> > This may be a generic issue (if PCI device support  completion on
> > multiple reply queues). Let me explain it w.r.t to mpt3sas supported
> > h/w just to simplify the problem and possible changes to handle such
> > issues. IT HBA
> > (mpt3sas) supports multiple reply queues in completion path. Driver
> > creates MSI-x vectors for controller as "min of ( FW supported Reply
> > queue, Logical CPUs)". If submitter is not interrupted via completion
> > on same CPU, there is a loop in the IO path. This behavior can cause
> > hard/soft CPU lockups, IO timeout, system sluggish etc.
> >
> > Example - one CPU (e.g. CPU A) is busy submitting the IOs and another
> > CPU (e.g. CPU B) is busy with processing the corresponding IO's reply
> > descriptors from reply descriptor queue upon receiving the interrupts
> > from HBA. If the CPU A is continuously pumping the IOs then always CPU
> > B (which is executing the ISR) will see the valid reply descriptors in
> > the reply descriptor queue and it will be continuously processing
> > those reply descriptor in a loop without quitting the ISR handler.
> > Mpt3sas driver will exit ISR handler if it finds unused reply
> > descriptor in the reply descriptor queue. Since CPU A will be
> > continuously sending the IOs, CPU B may always see a valid reply
> > descriptor (posted by HBA Firmware after processing the IO) in the
> > reply descriptor queue. In worst case, driver will not quit from this
> > loop in the ISR handler. Eventually, CPU lockup will be detected by
> watchdog.
> >
> > Above mentioned behavior is not common if "rq_affinity" set to 2 or
> > affinity_hint is honored by irqbalance as "exact".
> > If rq_affinity is set to 2, submitter will be always interrupted via
> > completion on same CPU.
> > If irqbalance is using "exact" policy, interrupt will be delivered to
> > submitter CPU.
> >
> > Problem statement -
> > If CPU counts to MSI-X vectors (reply descriptor Queues) count ratio
> > is not 1:1, we still have  exposure of issue explained above and for
> > that we don't have any solution.
> >
> > Exposure of soft/hard lockup if CPU count is more than MSI-x supported
> > by device.
> >
> > If CPUs count to MSI-x vectors count ratio is not 1:1, (Other way, if
> > CPU counts to MSI-x vector count ratio is something like X:1, where X
> > > 1) then 'exact' irqbalance policy OR rq_affinity = 2 won't help to
> > avoid CPU hard/soft lockups. There won't be any one to one mapping
> > between CPU to MSI-x vector instead one MSI-x interrupt (or reply
> > descriptor queue) is shared with group/set of CPUs and there is a
> > possibility of having a loop in the IO path within that CPU group and
> > may
> observe lockups.
> >
> > For example: Consider a system having two NUMA nodes and each node
> > having four logical CPUs and also consider that number of MSI-x
> > vectors enabled on the HBA is two, then CPUs count to MSI-x vector count
> ratio as 4:1.
> > e.g.
> > MSIx vector 0 is affinity to  CPU 0, CPU 1, CPU 2 & CPU 3 of NUMA node
> > 0 and MSI-x vector 1 is affinity to CPU 4, CPU 5, CPU 6 & CPU 7 of
> > NUMA node 1.
> >
> > numactl --hardware
> > available: 2 nodes (0-1)
> > node 0 cpus: 0 1 2 3-->
> > MSI-x 0
> > node 0 size: 65536 MB
> > node 0 free: 63176 MB
> > node 1 cpus: 4 5 6 7
> > -->MSI-x 1
> > node 1 size: 65536 MB
> > node 1 free: 63176 MB
> >
> > Assume that user started an application which uses all the CPUs of
> > NUMA node 0 for issuing the IOs.
> > Only one CPU from affinity list (it can be any cpu since this behavior
> > depends upon irqbalance) CPU0 will receive the interrupts from MSIx
> > vector
> > 0 for all the IOs. Eventually, CPU 0 IO submission percentage will be
> > decreasing and ISR processing percentage will be increasing as it is
> > more busy with processing the interrupts. Gradually IO submission
> > percentage on CPU 0 will be zero and it's ISR processing percentage
> > will be 100 percentage as IO loop has already formed within the NUMA
> > node 0, i.e. CPU 1, CPU 2 & CPU 3 will be continuously busy with
> > submitting the heavy IOs and only CPU 0 is busy in the ISR path as it
> > always find the valid reply descriptor in the reply descriptor queue.
> > Eventually, we will observe the hard lockup here.
> >
> > Chances of occurring of hard/soft lockups are directly proportional to
> > value of X. If value of X is 

Re: [PATCH V3] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-29 Thread Bart Van Assche
On Sun, 2018-01-28 at 07:41 +0800, Ming Lei wrote:
> Not mention, the request isn't added to dispatch list yet in .queue_rq(),
> strictly speaking, it is not correct to call blk_mq_delay_run_hw_queue() in
> .queue_rq(), so the current block layer API can't handle it well enough.

I disagree that it is not correct to call blk_mq_delay_run_hw_queue() from
inside .queue_rq(). Additionally, I have already explained to you in
previous e-mails why it's fine to call that function from inside .queue_rq():
- Nobody has ever observed the race you described because the time after
  which a queue is rerun by blk_mq_delay_run_hw_queue() is much larger than
  the time during which that race exists.
- It is easy to fix this race inside the block layer, namely by using
  call_rcu() inside the blk_mq_delay_run_hw_queue() implementation to
  postpone the queue rerunning until after the request has been added back to
  the dispatch list.

> > - The patch at the start of this thread introduces a regression in the
> >   SCSI core, namely a queue stall if a request completion occurs 
> > concurrently
> >   with the newly added BLK_MQ_S_SCHED_RESTART test in the blk-mq core.
> 
> This patch only moves the blk_mq_delay_run_hw_queue() from scsi_queue_rq()
> to blk-mq, again, please explain it in detail how this patch V3 introduces 
> this
> regression on SCSI.

Please reread the following message: 
https://marc.info/?l=dm-devel=151672480107560.

Thanks,

Bart.

RE: [LSF/MM TOPIC] irq affinity handling for high CPU count machines

2018-01-29 Thread Kashyap Desai
> -Original Message-
> From: Bart Van Assche [mailto:bart.vanass...@wdc.com]
> Sent: Monday, January 29, 2018 10:08 PM
> To: Elliott, Robert (Persistent Memory); Hannes Reinecke;
> lsf-pc@lists.linux-
> foundation.org
> Cc: linux-scsi@vger.kernel.org; linux-n...@lists.infradead.org; Kashyap
> Desai
> Subject: Re: [LSF/MM TOPIC] irq affinity handling for high CPU count
> machines
>
> On 01/29/18 07:41, Elliott, Robert (Persistent Memory) wrote:
> >> -Original Message-
> >> From: Linux-nvme [mailto:linux-nvme-boun...@lists.infradead.org] On
> >> Behalf Of Hannes Reinecke
> >> Sent: Monday, January 29, 2018 3:09 AM
> >> To: lsf...@lists.linux-foundation.org
> >> Cc: linux-n...@lists.infradead.org; linux-scsi@vger.kernel.org;
> >> Kashyap Desai 
> >> Subject: [LSF/MM TOPIC] irq affinity handling for high CPU count
> >> machines
> >>
> >> Hi all,
> >>
> >> here's a topic which came up on the SCSI ML (cf thread '[RFC 0/2]
> >> mpt3sas/megaraid_sas: irq poll and load balancing of reply queue').
> >>
> >> When doing I/O tests on a machine with more CPUs than MSIx vectors
> >> provided by the HBA we can easily setup a scenario where one CPU is
> >> submitting I/O and the other one is completing I/O. Which will result
> >> in the latter CPU being stuck in the interrupt completion routine for
> >> basically ever, resulting in the lockup detector kicking in.
> >>
> >> How should these situations be handled?
> >> Should it be made the responsibility of the drivers, ensuring that
> >> the interrupt completion routine is terminated after a certain time?
> >> Should it be made the responsibility of the upper layers?
> >> Should it be the responsibility of the interrupt mapping code?
> >> Can/should interrupt polling be used in these situations?
> >
> > Back when we introduced scsi-mq with hpsa, the best approach was to
> > route interrupts and completion handling so each CPU core handles its
> > own submissions; this way, they are self-throttling.


Ideal scenario is to make sure submitter is interrupted for completion.  It
is not possible to manage via any tuning like rq_affinity=2 (and --exact
irqbalance policy), if we have more # of CPUs than MSI-x vector supported by
controllers. If we use irq poll interface with good amount of weights in irq
poll API, we will no more see CPU lockups because low level driver will quit
ISR routine after each weighted completion. There will be always chance that
we will have back to back pressure on the same CPU for completion, but irq
poll design will manage to run watchdog task and timestamp will updated.
Using irq poll we may see close to 100% CPU consumption, but there will be
no  lockup detection.

>
> That approach may work for the hpsa adapter but I'm not sure whether it
> works for all adapter types. It has already been observed with the SRP
> initiator
> driver running inside a VM that a single core spent all its time
> processing IB
> interrupts.
>
> Additionally, only initiator workloads are self-throttling. Target style
> workloads are not self-throttling.
>
> In other words, I think it's worth to discuss this topic further.
>
> Bart.
>


RE: [PATCH 3/3] scsi: aacraid: Auto detect INTx or MSIx mode during sync cmd processing

2018-01-29 Thread Dave Carroll
> -Original Message-
> From: Raghava Aditya Renukunta
> [mailto:raghavaaditya.renuku...@microsemi.com]
> Sent: Friday, January 19, 2018 6:02 PM
> To: j...@linux.vnet.ibm.com; martin.peter...@oracle.com; linux-
> s...@vger.kernel.org
> Cc: Scott Benesh ; Tom White
> ; dl-esc-Aacraid Linux Driver
> ; Guilherme G . Piccoli
> ; Bart Van Assche 
> Subject: [PATCH 3/3] scsi: aacraid: Auto detect INTx or MSIx mode during sync
> cmd processing
> 
> During sync command processing if legacy INTx status indicates
> command is not completed, sample the MSIx register and check if
> it indicates command completion, set controller MSIx enabled flag.
> 
> Signed-off-by: Prasad B Munirathnam 
> Signed-off-by: Raghava Aditya Renukunta
> 
> ---
Reviewed-by: Dave Carroll 


Re: [LSF/MM TOPIC] irq affinity handling for high CPU count machines

2018-01-29 Thread Bart Van Assche

On 01/29/18 07:41, Elliott, Robert (Persistent Memory) wrote:

-Original Message-
From: Linux-nvme [mailto:linux-nvme-boun...@lists.infradead.org] On Behalf
Of Hannes Reinecke
Sent: Monday, January 29, 2018 3:09 AM
To: lsf...@lists.linux-foundation.org
Cc: linux-n...@lists.infradead.org; linux-scsi@vger.kernel.org; Kashyap
Desai 
Subject: [LSF/MM TOPIC] irq affinity handling for high CPU count machines

Hi all,

here's a topic which came up on the SCSI ML (cf thread '[RFC 0/2]
mpt3sas/megaraid_sas: irq poll and load balancing of reply queue').

When doing I/O tests on a machine with more CPUs than MSIx vectors
provided by the HBA we can easily setup a scenario where one CPU is
submitting I/O and the other one is completing I/O. Which will result in
the latter CPU being stuck in the interrupt completion routine for
basically ever, resulting in the lockup detector kicking in.

How should these situations be handled?
Should it be made the responsibility of the drivers, ensuring that the
interrupt completion routine is terminated after a certain time?
Should it be made the responsibility of the upper layers?
Should it be the responsibility of the interrupt mapping code?
Can/should interrupt polling be used in these situations?


Back when we introduced scsi-mq with hpsa, the best approach was to
route interrupts and completion handling so each CPU core handles its
own submissions; this way, they are self-throttling.


That approach may work for the hpsa adapter but I'm not sure whether it 
works for all adapter types. It has already been observed with the SRP 
initiator driver running inside a VM that a single core spent all its 
time processing IB interrupts.


Additionally, only initiator workloads are self-throttling. Target style 
workloads are not self-throttling.


In other words, I think it's worth to discuss this topic further.

Bart.




RE: [PATCH 2/3] scsi: aacraid: Preserve MSIX mode in the OMR register

2018-01-29 Thread Dave Carroll
> -Original Message-
> From: Raghava Aditya Renukunta
> [mailto:raghavaaditya.renuku...@microsemi.com]
> Sent: Friday, January 19, 2018 6:02 PM
> To: j...@linux.vnet.ibm.com; martin.peter...@oracle.com; linux-
> s...@vger.kernel.org
> Cc: Scott Benesh ; Tom White
> ; dl-esc-Aacraid Linux Driver
> ; Guilherme G . Piccoli
> ; Bart Van Assche 
> Subject: [PATCH 2/3] scsi: aacraid: Preserve MSIX mode in the OMR register
> 
> Preserve the current MSIX mode value in the OMR before rewriting the OMR to
> initiate the IOP or Soft Reset.
> 
> Signed-off-by: Prasad B Munirathnam 
> Signed-off-by: Raghava Aditya Renukunta
> 
> ---
>  drivers/scsi/aacraid/src.c | 22 ++
>  1 file changed, 22 insertions(+)
> 
Reviewed-by: Dave Carroll 


RE: [PATCH 1/3] scsi: aacraid: Implement DropIO sync command

2018-01-29 Thread Dave Carroll
> -Original Message-
> From: Raghava Aditya Renukunta
> [mailto:raghavaaditya.renuku...@microsemi.com]
> Sent: Friday, January 19, 2018 6:02 PM
> To: j...@linux.vnet.ibm.com; martin.peter...@oracle.com; linux-
> s...@vger.kernel.org
> Cc: Scott Benesh ; Tom White
> ; dl-esc-Aacraid Linux Driver
> ; Guilherme G . Piccoli
> ; Bart Van Assche 
> Subject: [PATCH 1/3] scsi: aacraid: Implement DropIO sync command
> 
> IOP_RESET takes longer time to complete, if controller is in a state where we
> can bring it back with init struct, controller DropIO sync command is
> implemented.
> 
>  - If controller is faulted perform standard IOP_RESET in aac_srcv_init.
>  - If controller is not faulted get adapter properties and extended
>properties.
>  - Update the sa_firmware variable and determine if DropIO request is
>supported.
>  - Issue DropIO request, and get the number of outstanding commands.
>  - If all commands are complete with success (CT_OK), consider IOP_RESET
>is complete.
>  - If any commands timeout, Perform the IOP_RESET.
> 
> Signed-off-by: Prasad B Munirathnam 
> Signed-off-by: Raghava Aditya Renukunta
> 
> ---
>  drivers/scsi/aacraid/aacraid.h |   4 +
>  drivers/scsi/aacraid/src.c | 167
> +++--
>  2 files changed, 163 insertions(+), 8 deletions(-)

Reviewed-by: Dave Carroll 


[LSF/MM TOPIC] Two blk-mq related topics

2018-01-29 Thread Ming Lei
Hi guys,

Two blk-mq related topics

1. blk-mq vs. CPU hotplug & IRQ vectors spread on CPUs

We have done three big changes in this field before, each time some issues
are fixed, meantime new ones are introduced

1) freeze all queues during CPU hotplug handler
- issues: queue dependency such as loop-mq/dm vs underlying queues, NVMe admin
queue vs. namespace queues, and IO hang may be caused during freezing all
these queues in CPU hotplug handler.

2) IRQ vectors spread on all present CPUs
- fix issue on 1)
- new issues introduced: don't support CPU hotplug physically, and cause blk-mq
warning during dispatch

3) IRQ vectors spread on all possible CPUs
- can support CPU hotplug physically
- warning in __blk_mq_run_hw_queue() still may be triggered if CPU
  offline/online happens between blk_mq_hctx_next_cpu() and running
   __blk_mq_run_hw_queue()
- new issues introduced: queue mapping may be distorted completely,
patch sent out(https://marc.info/?t=15160323092=1=2), but may
need further discussion about this approach; drivers(such as NVMe) may
need to pass 'num_possible_cpus()' as the max vectors for allocating
irq vectors; some drivers(NVMe) uses hard-code hw queue index directly,
then this way becomes very fragile, since the hw queue may be inactive
from the beginning.

Also starting from 2), another issue is that IO completion may not be
delivered to CPUs, for example, IO may be dispatched to hw queue just
before(or after) all CPUs mapped to the hctx become offline, then IRQ
vector of the hw queue can be shutdown. Now seems we depend on timeout
handler to deal with the situation, and is there better way to solve this
issue?

2. When to enable SCSI_MQ at default again?

SCSI_MQ is enabled on V3.17 firstly, but disabled at default. In V4.13-rc1,
it is enabled at default, but later the patch is reverted in V4.13-rc7, and
becomes disabled at default too.

Now both the original reported PM issue(actually SCSI quiesce) and the
sequential IO performance issue have been addressed. And MQ IO schedulers
are ready too for traditional disks. Are there other issues to be addressed
for enabling SCSI_MQ at default? When can we do that again?

Last time, the two issues were reported during V4.13 dev cycle just when it is
enabled at default, that seems if SCSI_MQ isn't enabled at default, it wouldn't
be exposed to run/tested completely & fully.  

So if we continue to disable it at default, maybe it can never be exposed to
full test/production environment.


Thanks,
Ming


RE: [LSF/MM TOPIC] irq affinity handling for high CPU count machines

2018-01-29 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: Linux-nvme [mailto:linux-nvme-boun...@lists.infradead.org] On Behalf
> Of Hannes Reinecke
> Sent: Monday, January 29, 2018 3:09 AM
> To: lsf...@lists.linux-foundation.org
> Cc: linux-n...@lists.infradead.org; linux-scsi@vger.kernel.org; Kashyap
> Desai 
> Subject: [LSF/MM TOPIC] irq affinity handling for high CPU count machines
> 
> Hi all,
> 
> here's a topic which came up on the SCSI ML (cf thread '[RFC 0/2]
> mpt3sas/megaraid_sas: irq poll and load balancing of reply queue').
> 
> When doing I/O tests on a machine with more CPUs than MSIx vectors
> provided by the HBA we can easily setup a scenario where one CPU is
> submitting I/O and the other one is completing I/O. Which will result in
> the latter CPU being stuck in the interrupt completion routine for
> basically ever, resulting in the lockup detector kicking in.
> 
> How should these situations be handled?
> Should it be made the responsibility of the drivers, ensuring that the
> interrupt completion routine is terminated after a certain time?
> Should it be made the resposibility of the upper layers?
> Should it be the responsibility of the interrupt mapping code?
> Can/should interrupt polling be used in these situations?

Back when we introduced scsi-mq with hpsa, the best approach was to
route interrupts and completion handling so each CPU core handles its
own submissions; this way, they are self-throttling.

Every other arrangement was subject to soft lockups and other problems
when the completion CPUs become overwhelmed with work.

See https://lkml.org/lkml/2014/9/9/931.

---
Robert Elliott, HPE Persistent Memory




Re: [PATCH] scsi: csiostor: remove redundant assignment to pointer 'ln'

2018-01-29 Thread Varun Prakash
On Wed, Jan 24, 2018 at 02:58:01PM +, Colin King wrote:
> From: Colin Ian King 
> 
> The pointer ln is assigned a value that is never read, it is re-assigned
> a new value in the list_for_each loop hence the initialization is
> redundant and can be removed.
> 
> Cleans up clang warning:
> drivers/scsi/csiostor/csio_lnode.c:117:21: warning: Value stored to 'ln'
> during its initialization is never read
> 
> Signed-off-by: Colin Ian King 
> ---
>  drivers/scsi/csiostor/csio_lnode.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/scsi/csiostor/csio_lnode.c 
> b/drivers/scsi/csiostor/csio_lnode.c
> index be5ee2d37815..7d81a1e7 100644
> --- a/drivers/scsi/csiostor/csio_lnode.c
> +++ b/drivers/scsi/csiostor/csio_lnode.c
> @@ -114,7 +114,7 @@ static enum csio_ln_ev fwevt_to_lnevt[] = {
>  static struct csio_lnode *
>  csio_ln_lookup_by_portid(struct csio_hw *hw, uint8_t portid)
>  {
> - struct csio_lnode *ln = hw->rln;
> + struct csio_lnode *ln;
>   struct list_head *tmp;
>  
>   /* Match siblings lnode with portid */

Acked-by: Varun Prakash  


Re: [PATCH] mptfusion: Add bounds check in mptctl_hp_targetinfo()

2018-01-29 Thread Johannes Thumshirn
Looks good,
Reviewed-by: Johannes Thumshirn 
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH] [SCSI] sym53c8xx_2: iterator underflow in sym_getsync()

2018-01-29 Thread Johannes Thumshirn
Looks good,
Reviewed-by: Johannes Thumshirn 
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH] scsi: remove dead makefile about oktagon files

2018-01-29 Thread Geert Uytterhoeven
On Mon, Jan 29, 2018 at 1:30 PM, Corentin Labbe  wrote:
> Remove line using inexistant files which were removed in
> commit 642978beb483 ("[SCSI] remove m68k NCR53C9x based drivers")
>
> Signed-off-by: Corentin Labbe 

Acked-by: Geert Uytterhoeven 

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


[PATCH] scsi: remove dead makefile about oktagon files

2018-01-29 Thread Corentin Labbe
Remove line using inexistant files which were removed in
commit 642978beb483 ("[SCSI] remove m68k NCR53C9x based drivers")

Signed-off-by: Corentin Labbe 
---
 drivers/scsi/Makefile | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/scsi/Makefile b/drivers/scsi/Makefile
index fcfd28d2884c..de1b3fce936d 100644
--- a/drivers/scsi/Makefile
+++ b/drivers/scsi/Makefile
@@ -185,7 +185,6 @@ ncr53c8xx-flags-$(CONFIG_SCSI_ZALON) \
 CFLAGS_ncr53c8xx.o := $(ncr53c8xx-flags-y) $(ncr53c8xx-flags-m)
 zalon7xx-objs  := zalon.o ncr53c8xx.o
 NCR_Q720_mod-objs  := NCR_Q720.o ncr53c8xx.o
-oktagon_esp_mod-objs   := oktagon_esp.o oktagon_io.o
 
 # Files generated that shall be removed upon make clean
 clean-files := 53c700_d.h 53c700_u.h
-- 
2.13.6



Re: [PATCH 06/13] lpfc: Add 64G link speed support

2018-01-29 Thread Steffen Maier

On 01/26/2018 08:31 PM, James Smart wrote:

The G7 adapter supports 64G link speeds. Add support to the driver.

In addition, a small cleanup to replace the odd bitmap logic with
a switch case.

Signed-off-by: Dick Kennedy 
Signed-off-by: James Smart 
---



  9 files changed, 92 insertions(+), 30 deletions(-)



diff --git a/drivers/scsi/lpfc/lpfc_hw.h b/drivers/scsi/lpfc/lpfc_hw.h
index d07d2fcbea34..b91b429fe2df 100644
--- a/drivers/scsi/lpfc/lpfc_hw.h
+++ b/drivers/scsi/lpfc/lpfc_hw.h



@@ -2966,6 +2975,9 @@ struct lpfc_mbx_read_top {
  #define LPFC_LINK_SPEED_10GHZ 0x40
  #define LPFC_LINK_SPEED_16GHZ 0x80
  #define LPFC_LINK_SPEED_32GHZ 0x90
+#define LPFC_LINK_SPEED_64GHZ  0xA0
+#define LPFC_LINK_SPEED_128GHZ 0xB0
+#define LPFC_LINK_SPEED_2568GHZ0xC0

  ^
typo? 2568 => 256

The new 128 and 256 definitions do not seem to be used in this patch set?

--
Mit freundlichen Grüßen / Kind regards
Steffen Maier

Linux on z Systems Development

IBM Deutschland Research & Development GmbH
Vorsitzende des Aufsichtsrats: Martina Koederitz
Geschaeftsfuehrung: Dirk Wittkopp
Sitz der Gesellschaft: Boeblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294



Re: [PATCH 12/19] lpfc: Indicate CONF support in NVMe PRLI

2018-01-29 Thread Hannes Reinecke
On 01/24/2018 11:45 PM, James Smart wrote:
> Revise the NVME PRLI to indicate CONF support.
> 
> Signed-off-by: Dick Kennedy 
> Signed-off-by: James Smart 
> ---
>  drivers/scsi/lpfc/lpfc_els.c   | 3 ++-
>  drivers/scsi/lpfc/lpfc_hw4.h   | 6 +++---
>  drivers/scsi/lpfc/lpfc_nportdisc.c | 3 ---
>  3 files changed, 5 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/scsi/lpfc/lpfc_els.c b/drivers/scsi/lpfc/lpfc_els.c
> index 234c7c015982..404e1af5e2ab 100644
> --- a/drivers/scsi/lpfc/lpfc_els.c
> +++ b/drivers/scsi/lpfc/lpfc_els.c
> @@ -2293,10 +2293,11 @@ lpfc_issue_els_prli(struct lpfc_vport *vport, struct 
> lpfc_nodelist *ndlp,
>   if (phba->nvmet_support) {
>   bf_set(prli_tgt, npr_nvme, 1);
>   bf_set(prli_disc, npr_nvme, 1);
> -
>   } else {
>   bf_set(prli_init, npr_nvme, 1);
> + bf_set(prli_conf, npr_nvme, 1);
>   }
> +
>   npr_nvme->word1 = cpu_to_be32(npr_nvme->word1);
>   npr_nvme->word4 = cpu_to_be32(npr_nvme->word4);
>   elsiocb->iocb_flag |= LPFC_PRLI_NVME_REQ;
> diff --git a/drivers/scsi/lpfc/lpfc_hw4.h b/drivers/scsi/lpfc/lpfc_hw4.h
> index ef469129fb71..7c3afc3d3121 100644
> --- a/drivers/scsi/lpfc/lpfc_hw4.h
> +++ b/drivers/scsi/lpfc/lpfc_hw4.h
> @@ -4346,9 +4346,9 @@ struct lpfc_nvme_prli {
>  #define prli_init_SHIFT 5
>  #define prli_init_MASK  0x0001
>  #define prli_init_WORD  word4
> -#define prli_recov_SHIFT8
> -#define prli_recov_MASK 0x0001
> -#define prli_recov_WORD word4
> +#define prli_conf_SHIFT 7
> +#define prli_conf_MASK  0x0001
> +#define prli_conf_WORD  word4
>   uint32_t word5;
>  #define prli_fb_sz_SHIFT0
>  #define prli_fb_sz_MASK 0x
> diff --git a/drivers/scsi/lpfc/lpfc_nportdisc.c 
> b/drivers/scsi/lpfc/lpfc_nportdisc.c
> index d841aa42f607..bbf1e1342b09 100644
> --- a/drivers/scsi/lpfc/lpfc_nportdisc.c
> +++ b/drivers/scsi/lpfc/lpfc_nportdisc.c
> @@ -2011,9 +2011,6 @@ lpfc_cmpl_prli_prli_issue(struct lpfc_vport *vport, 
> struct lpfc_nodelist *ndlp,
>   }
>   }
>  
> - if (bf_get_be32(prli_recov, nvpr))
> - ndlp->nlp_fcp_info |= NLP_FCP_2_DEVICE;
> -
>   lpfc_printf_vlog(vport, KERN_INFO, LOG_NVME_DISC,
>"6029 NVME PRLI Cmpl w1 x%08x "
>"w4 x%08x w5 x%08x flag x%x, "
> 
If you say so :-)

Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 10/19] lpfc: Fix soft lockup in lpfc worker thread during LIP testing

2018-01-29 Thread Hannes Reinecke
On 01/24/2018 11:45 PM, James Smart wrote:
> During link bounce testing in a point-to-point topology, the
> host may enter a soft lockup on the lpfc_worker thread:
> Call Trace:
>  lpfc_work_done+0x1f3/0x1390 [lpfc]
>  lpfc_do_work+0x16f/0x180 [lpfc]
>  kthread+0xc7/0xe0
>  ret_from_fork+0x3f/0x70
> 
> The driver was simultaneously setting a combination of flags
> that caused lpfc_do_work()to effectively spin between slow path
> work and new event data, causing the lockup.
> 
> Ensure in the typical wq completions, that new event data flags
> are set if the slow path flag is running. The slow path will
> eventually reschedule the wq handling.
> 
> Signed-off-by: Dick Kennedy 
> Signed-off-by: James Smart 
> ---
>  drivers/scsi/lpfc/lpfc_hbadisc.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/scsi/lpfc/lpfc_hbadisc.c 
> b/drivers/scsi/lpfc/lpfc_hbadisc.c
> index b159a5c4e388..9265906d956e 100644
> --- a/drivers/scsi/lpfc/lpfc_hbadisc.c
> +++ b/drivers/scsi/lpfc/lpfc_hbadisc.c
> @@ -696,8 +696,9 @@ lpfc_work_done(struct lpfc_hba *phba)
> phba->hba_flag & HBA_SP_QUEUE_EVT)) {
>   if (pring->flag & LPFC_STOP_IOCB_EVENT) {
>   pring->flag |= LPFC_DEFERRED_RING_EVENT;
> - /* Set the lpfc data pending flag */
> - set_bit(LPFC_DATA_READY, >data_flags);
> + /* Preserve legacy behavior. */
> + if (!(phba->hba_flag & HBA_SP_QUEUE_EVT))
> + set_bit(LPFC_DATA_READY, >data_flags);
>   } else {
>   if (phba->link_state >= LPFC_LINK_UP ||
>   phba->link_flag & LS_MDS_LOOPBACK) {
> 
_Actually_ lpfc_do_work() and friends could be replace with a workqueue ...
But anyway.

Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 11/19] lpfc: Fix issue_lip if link is disabled

2018-01-29 Thread Hannes Reinecke
On 01/24/2018 11:45 PM, James Smart wrote:
> The driver ignored checks on whether the link should be
> kept administratively down after a link bounce. Correct the
> checks.
> 
> Signed-off-by: Dick Kennedy 
> Signed-off-by: James Smart 
> ---
>  drivers/scsi/lpfc/lpfc_attr.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/scsi/lpfc/lpfc_attr.c b/drivers/scsi/lpfc/lpfc_attr.c
> index d8064e3ea0ba..b79dad7b8278 100644
> --- a/drivers/scsi/lpfc/lpfc_attr.c
> +++ b/drivers/scsi/lpfc/lpfc_attr.c
> @@ -911,7 +911,12 @@ lpfc_issue_lip(struct Scsi_Host *shost)
>   LPFC_MBOXQ_t *pmboxq;
>   int mbxstatus = MBXERR_ERROR;
>  
> + /*
> +  * If the link is offline, disabled or BLOCK_MGMT_IO
> +  * it doesn't make any sense to allow issue_lip
> +  */
>   if ((vport->fc_flag & FC_OFFLINE_MODE) ||
> + (phba->hba_flag & LINK_DISABLED) ||
>   (phba->sli.sli_flag & LPFC_BLOCK_MGMT_IO))
>   return -EPERM;
>  
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 14/19] lpfc: Validate adapter support for SRIU option

2018-01-29 Thread Hannes Reinecke
On 01/24/2018 11:45 PM, James Smart wrote:
> When using the special option to suppress the response iu, ensure
> the adapter fully supports the feature by checking feature flags
> from the adapter.
> 
> Signed-off-by: Dick Kennedy 
> Signed-off-by: James Smart 
> ---
>  drivers/scsi/lpfc/lpfc_hw4.h  |  3 +++
>  drivers/scsi/lpfc/lpfc_init.c | 13 -
>  2 files changed, 15 insertions(+), 1 deletion(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 16/19] lpfc: Treat SCSI Write operation Underruns as an error

2018-01-29 Thread Hannes Reinecke
On 01/24/2018 11:45 PM, James Smart wrote:
> Currently, write underruns (mismatch of amount transferred vs scsi
> status and its residual) detected by the adapter are not being
> flagged as an error. Its expected the target controls the data
> transfer and would appropriately set the RSP values.  Only read
> underruns are treated as errors.
> 
> Revise the SCSI error handling to treat write underruns as an
> error as well.
> 
> Signed-off-by: Dick Kennedy 
> Signed-off-by: James Smart 
> ---
>  drivers/scsi/lpfc/lpfc_scsi.c | 8 +++-
>  1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/scsi/lpfc/lpfc_scsi.c b/drivers/scsi/lpfc/lpfc_scsi.c
> index dcc86936e6fc..10c2dc0cf1fa 100644
> --- a/drivers/scsi/lpfc/lpfc_scsi.c
> +++ b/drivers/scsi/lpfc/lpfc_scsi.c
> @@ -3772,20 +3772,18 @@ lpfc_handle_fcp_err(struct lpfc_vport *vport, struct 
> lpfc_scsi_buf *lpfc_cmd,
>   scsi_set_resid(cmnd, be32_to_cpu(fcprsp->rspResId));
>  
>   lpfc_printf_vlog(vport, KERN_INFO, LOG_FCP_UNDER,
> -  "9025 FCP Read Underrun, expected %d, "
> +  "9025 FCP Underrun, expected %d, "
>"residual %d Data: x%x x%x x%x\n",
>fcpDl,
>scsi_get_resid(cmnd), fcpi_parm, cmnd->cmnd[0],
>cmnd->underflow);
>  
>   /*
> -  * If there is an under run check if under run reported by
> +  * If there is an under run, check if under run reported by
>* storage array is same as the under run reported by HBA.
>* If this is not same, there is a dropped frame.
>*/
> - if ((cmnd->sc_data_direction == DMA_FROM_DEVICE) &&
> - fcpi_parm &&
> - (scsi_get_resid(cmnd) != fcpi_parm)) {
> + if (fcpi_parm && (scsi_get_resid(cmnd) != fcpi_parm)) {
>   lpfc_printf_vlog(vport, KERN_WARNING,
>LOG_FCP | LOG_FCP_ERROR,
>"9026 FCP Read Check Error "
> 
I so hope you have this tested ...

Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 02/19] lpfc: Increase CQ and WQ sizes for SCSI

2018-01-29 Thread Hannes Reinecke
On 01/24/2018 11:45 PM, James Smart wrote:
> Increased CQ and WQ sizes for SCSI FCP, matching those used
> for NVMe development.
> 
> Signed-off-by: Dick Kennedy 
> Signed-off-by: James Smart 
> ---
>  drivers/scsi/lpfc/lpfc.h  |  1 +
>  drivers/scsi/lpfc/lpfc_hw4.h  |  3 +++
>  drivers/scsi/lpfc/lpfc_init.c | 30 ++
>  drivers/scsi/lpfc/lpfc_sli.c  |  3 ++-
>  drivers/scsi/lpfc/lpfc_sli4.h |  5 +
>  5 files changed, 33 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/scsi/lpfc/lpfc.h b/drivers/scsi/lpfc/lpfc.h
> index 61fb46da05d4..d042f9118e3b 100644
> --- a/drivers/scsi/lpfc/lpfc.h
> +++ b/drivers/scsi/lpfc/lpfc.h
> @@ -760,6 +760,7 @@ struct lpfc_hba {
>   uint8_t  mds_diags_support;
>   uint32_t initial_imax;
>   uint8_t  bbcredit_support;
> + uint8_t  enab_exp_wqcq_pages;
>  
>   /* HBA Config Parameters */
>   uint32_t cfg_ack0;
> diff --git a/drivers/scsi/lpfc/lpfc_hw4.h b/drivers/scsi/lpfc/lpfc_hw4.h
> index 73c2f6971d2b..ef469129fb71 100644
> --- a/drivers/scsi/lpfc/lpfc_hw4.h
> +++ b/drivers/scsi/lpfc/lpfc_hw4.h
> @@ -3212,6 +3212,9 @@ struct lpfc_sli4_parameters {
>  #define cfg_cqv_SHIFT14
>  #define cfg_cqv_MASK 0x0003
>  #define cfg_cqv_WORD word4
> +#define cfg_cqpsize_SHIFT16
> +#define cfg_cqpsize_MASK 0x00ff
> +#define cfg_cqpsize_WORD word4
>   uint32_t word5;
>   uint32_t word6;
>  #define cfg_mqv_SHIFT14
> diff --git a/drivers/scsi/lpfc/lpfc_init.c b/drivers/scsi/lpfc/lpfc_init.c
> index f539c554588c..5b96bf0d3331 100644
> --- a/drivers/scsi/lpfc/lpfc_init.c
> +++ b/drivers/scsi/lpfc/lpfc_init.c
> @@ -8011,9 +8011,10 @@ static int
>  lpfc_alloc_fcp_wq_cq(struct lpfc_hba *phba, int wqidx)
>  {
>   struct lpfc_queue *qdesc;
> + uint32_t wqesize;
>  
>   /* Create Fast Path FCP CQs */
> - if (phba->fcp_embed_io)
> + if (phba->enab_exp_wqcq_pages)
>   /* Increase the CQ size when WQEs contain an embedded cdb */
>   qdesc = lpfc_sli4_queue_alloc(phba, LPFC_EXPANDED_PAGE_SIZE,
> phba->sli4_hba.cq_esize,
> @@ -8031,15 +8032,18 @@ lpfc_alloc_fcp_wq_cq(struct lpfc_hba *phba, int wqidx)
>   phba->sli4_hba.fcp_cq[wqidx] = qdesc;
>  
>   /* Create Fast Path FCP WQs */
> - if (phba->fcp_embed_io)
>   /* Increase the WQ size when WQEs contain an embedded cdb */

Wrong indentation after patching.

> + if (phba->enab_exp_wqcq_pages) {
> + wqesize = (phba->fcp_embed_io) ?
> + LPFC_WQE128_SIZE : phba->sli4_hba.wq_esize;
>   qdesc = lpfc_sli4_queue_alloc(phba, LPFC_EXPANDED_PAGE_SIZE,
> -   LPFC_WQE128_SIZE,
> +   wqesize,
> LPFC_WQE_EXP_COUNT);
> - else
> + } else {
>   qdesc = lpfc_sli4_queue_alloc(phba, LPFC_DEFAULT_PAGE_SIZE,
> phba->sli4_hba.wq_esize,
> phba->sli4_hba.wq_ecount);
> + }

No need for braces here.

>   if (!qdesc) {
>   lpfc_printf_log(phba, KERN_ERR, LOG_INIT,
>   "0503 Failed allocate fast-path FCP WQ (%d)\n",
> @@ -10485,6 +10489,12 @@ lpfc_get_sli4_parameters(struct lpfc_hba *phba, 
> LPFC_MBOXQ_t *mboxq)
>   else
>   phba->fcp_embed_io = 0;
>  
> + if ((bf_get(cfg_cqpsize, mbx_sli4_parameters) & LPFC_CQ_16K_PAGE_SZ) &&
> + (bf_get(cfg_wqpsize, mbx_sli4_parameters) & LPFC_WQ_16K_PAGE_SZ) &&
> + (sli4_params->wqsize & LPFC_WQ_SZ128_SUPPORT))
> + phba->enab_exp_wqcq_pages = 1;
> + else
> + phba->enab_exp_wqcq_pages = 0;
>   /*
>* Check if the SLI port supports MDS Diagnostics
>*/
> @@ -12227,6 +12237,7 @@ int
>  lpfc_fof_queue_create(struct lpfc_hba *phba)
>  {
>   struct lpfc_queue *qdesc;
> + uint32_t wqesize;
>  
>   /* Create FOF EQ */
>   qdesc = lpfc_sli4_queue_alloc(phba, LPFC_DEFAULT_PAGE_SIZE,
> @@ -12240,7 +12251,7 @@ lpfc_fof_queue_create(struct lpfc_hba *phba)
>   if (phba->cfg_fof) {
>  
>   /* Create OAS CQ */
> - if (phba->fcp_embed_io)
> + if (phba->enab_exp_wqcq_pages)
>   qdesc = lpfc_sli4_queue_alloc(phba,
> LPFC_EXPANDED_PAGE_SIZE,
> phba->sli4_hba.cq_esize,
> @@ -12256,16 +12267,19 @@ lpfc_fof_queue_create(struct lpfc_hba *phba)
>   phba->sli4_hba.oas_cq = qdesc;
>  
>   /* Create OAS WQ */
> - if (phba->fcp_embed_io)
> + 

Re: [PATCH 19/19] lpfc: Update modified files for 2018 Copyright

2018-01-29 Thread Hannes Reinecke
On 01/24/2018 11:45 PM, James Smart wrote:
> Updated Copyright in files updated 11.4.0.7
> 
> Signed-off-by: Dick Kennedy 
> Signed-off-by: James Smart 
> ---
>  drivers/scsi/lpfc/lpfc.h   | 2 +-
>  drivers/scsi/lpfc/lpfc_attr.c  | 2 +-
>  drivers/scsi/lpfc/lpfc_crtn.h  | 2 +-
>  drivers/scsi/lpfc/lpfc_els.c   | 2 +-
>  drivers/scsi/lpfc/lpfc_hbadisc.c   | 2 +-
>  drivers/scsi/lpfc/lpfc_hw4.h   | 2 +-
>  drivers/scsi/lpfc/lpfc_init.c  | 2 +-
>  drivers/scsi/lpfc/lpfc_mbox.c  | 2 +-
>  drivers/scsi/lpfc/lpfc_mem.c   | 2 +-
>  drivers/scsi/lpfc/lpfc_nportdisc.c | 4 ++--
>  drivers/scsi/lpfc/lpfc_nvme.c  | 2 +-
>  drivers/scsi/lpfc/lpfc_nvmet.c | 2 +-
>  drivers/scsi/lpfc/lpfc_nvmet.h | 2 +-
>  drivers/scsi/lpfc/lpfc_scsi.c  | 2 +-
>  drivers/scsi/lpfc/lpfc_sli.c   | 3 +--
>  drivers/scsi/lpfc/lpfc_sli4.h  | 2 +-
>  drivers/scsi/lpfc/lpfc_version.h   | 2 +-
>  17 files changed, 18 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/scsi/lpfc/lpfc.h b/drivers/scsi/lpfc/lpfc.h
> index d042f9118e3b..9698b9635058 100644
> --- a/drivers/scsi/lpfc/lpfc.h
> +++ b/drivers/scsi/lpfc/lpfc.h
> @@ -1,7 +1,7 @@
>  /***
>   * This file is part of the Emulex Linux Device Driver for *
>   * Fibre Channel Host Bus Adapters.*
> - * Copyright (C) 2017 Broadcom. All Rights Reserved. The term  *
> + * Copyright (C) 2017-2018 Broadcom. All Rights Reserved. The term *
>   * “Broadcom” refers to Broadcom Limited and/or its subsidiaries.  *
>   * Copyright (C) 2004-2016 Emulex.  All rights reserved.   *
>   * EMULEX and SLI are trademarks of Emulex.*
> diff --git a/drivers/scsi/lpfc/lpfc_attr.c b/drivers/scsi/lpfc/lpfc_attr.c
> index b79dad7b8278..70bd25666243 100644
> --- a/drivers/scsi/lpfc/lpfc_attr.c
> +++ b/drivers/scsi/lpfc/lpfc_attr.c
> @@ -1,7 +1,7 @@
>  /***
>   * This file is part of the Emulex Linux Device Driver for *
>   * Fibre Channel Host Bus Adapters.*
> - * Copyright (C) 2017 Broadcom. All Rights Reserved. The term  *
> + * Copyright (C) 2017-2018 Broadcom. All Rights Reserved. The term *
>   * “Broadcom” refers to Broadcom Limited and/or its subsidiaries.  *
>   * Copyright (C) 2004-2016 Emulex.  All rights reserved.   *
>   * EMULEX and SLI are trademarks of Emulex.*
> diff --git a/drivers/scsi/lpfc/lpfc_crtn.h b/drivers/scsi/lpfc/lpfc_crtn.h
> index 3ecf50df93f4..14a86b5b51e4 100644
> --- a/drivers/scsi/lpfc/lpfc_crtn.h
> +++ b/drivers/scsi/lpfc/lpfc_crtn.h
> @@ -1,7 +1,7 @@
>  /***
>   * This file is part of the Emulex Linux Device Driver for *
>   * Fibre Channel Host Bus Adapters.*
> - * Copyright (C) 2017 Broadcom. All Rights Reserved. The term  *
> + * Copyright (C) 2017-2018 Broadcom. All Rights Reserved. The term *
>   * “Broadcom” refers to Broadcom Limited and/or its subsidiaries.  *
>   * Copyright (C) 2004-2016 Emulex.  All rights reserved.   *
>   * EMULEX and SLI are trademarks of Emulex.*
> diff --git a/drivers/scsi/lpfc/lpfc_els.c b/drivers/scsi/lpfc/lpfc_els.c
> index 404e1af5e2ab..ba896554a14f 100644
> --- a/drivers/scsi/lpfc/lpfc_els.c
> +++ b/drivers/scsi/lpfc/lpfc_els.c
> @@ -1,7 +1,7 @@
>  /***
>   * This file is part of the Emulex Linux Device Driver for *
>   * Fibre Channel Host Bus Adapters.*
> - * Copyright (C) 2017 Broadcom. All Rights Reserved. The term  *
> + * Copyright (C) 2017-2018 Broadcom. All Rights Reserved. The term *
>   * “Broadcom” refers to Broadcom Limited and/or its subsidiaries.  *
>   * Copyright (C) 2004-2016 Emulex.  All rights reserved.   *
>   * EMULEX and SLI are trademarks of Emulex.*
> diff --git a/drivers/scsi/lpfc/lpfc_hbadisc.c 
> b/drivers/scsi/lpfc/lpfc_hbadisc.c
> index 9265906d956e..f5bbac3cadbb 100644
> --- a/drivers/scsi/lpfc/lpfc_hbadisc.c
> +++ b/drivers/scsi/lpfc/lpfc_hbadisc.c
> @@ -1,7 +1,7 @@
>  /***
>   * This file is part of the Emulex Linux Device Driver for *
>   * Fibre Channel Host Bus Adapters.*
> - * Copyright (C) 2017 Broadcom. All Rights Reserved. The term  *
> + * Copyright (C) 2017-2018 Broadcom. All Rights Reserved. The term *
>   * “Broadcom” refers to Broadcom Limited and/or its subsidiaries.  *
>   * Copyright (C) 2004-2016 Emulex.  All rights reserved.   *
>   * EMULEX and SLI are trademarks of Emulex.*
> diff --git 

Re: [PATCH 05/19] lpfc: Add WQ Full Logic for NVME Target

2018-01-29 Thread Hannes Reinecke
On 01/24/2018 11:45 PM, James Smart wrote:
> I/O conditions on the nvme target may have the driver submitting
> to a full hardware wq. The hardware wq is a shared resource among
> all nvme controllers. When the driver hit a full wq, it failed the
> io posting back to the nvme-fc transport, which then escalated it
> into errors.
> 
> Correct by maintaining a sideband queue within the driver that is
> added to when the WQ full condition is hit, and drained from as soon
> as new WQ space opens up.
> 
> Signed-off-by: Dick Kennedy 
> Signed-off-by: James Smart 
> ---
>  drivers/scsi/lpfc/lpfc_crtn.h  |   1 +
>  drivers/scsi/lpfc/lpfc_nvmet.c | 116 
> +
>  drivers/scsi/lpfc/lpfc_nvmet.h |   1 +
>  drivers/scsi/lpfc/lpfc_sli.c   |   3 ++
>  drivers/scsi/lpfc/lpfc_sli4.h  |   5 +-
>  5 files changed, 125 insertions(+), 1 deletion(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 13/19] lpfc: Fix SCSI io host reset causing kernel crash

2018-01-29 Thread Hannes Reinecke
On 01/24/2018 11:45 PM, James Smart wrote:
> During SCSI error handling escalation to host reset, the SCSI io
> routines were moved off the txcmplq, but the individual io's
> ON_CMPLQ flag wasn't cleared.  Thus, a background thread saw the
> io and attempted to access it as if on the txcmplq.
> 
> Clear the flag upon removal.
> 
> Signed-off-by: Dick Kennedy 
> Signed-off-by: James Smart 
> ---
>  drivers/scsi/lpfc/lpfc_init.c |  4 
>  drivers/scsi/lpfc/lpfc_sli.c  | 13 -
>  2 files changed, 16 insertions(+), 1 deletion(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 06/19] lpfc: Fix PRLI handling when topology type changes

2018-01-29 Thread Hannes Reinecke
On 01/24/2018 11:45 PM, James Smart wrote:
> The lpfc driver does not discover a target when the topology
> changes from switched-fabric to direct-connect. The target
> rejects the PRLI from the initiator in direct-connect as the
> driver is using the old S_ID from the switched topology.
> 
> The driver was inappropriately clearing the VP bit to register the
> VPI, which is what is associated with the S_ID.
> 
> Fix by leaving the VP bit set (it was set earlier) and as the VFI
> is being re-registered, set the UPDT bit.
> 
> Signed-off-by: Dick Kennedy 
> Signed-off-by: James Smart 
> ---
>  drivers/scsi/lpfc/lpfc_mbox.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/drivers/scsi/lpfc/lpfc_mbox.c b/drivers/scsi/lpfc/lpfc_mbox.c
> index 81fb92967b11..c32d4a323db2 100644
> --- a/drivers/scsi/lpfc/lpfc_mbox.c
> +++ b/drivers/scsi/lpfc/lpfc_mbox.c
> @@ -2170,10 +2170,8 @@ lpfc_reg_vfi(struct lpfcMboxq *mbox, struct lpfc_vport 
> *vport, dma_addr_t phys)
>   /* Only FC supports upd bit */
>   if ((phba->sli4_hba.lnk_info.lnk_tp == LPFC_LNK_TYPE_FC) &&
>   (vport->fc_flag & FC_VFI_REGISTERED) &&
> - (!phba->fc_topology_changed)) {
> - bf_set(lpfc_reg_vfi_vp, reg_vfi, 0);
> + (!phba->fc_topology_changed))
>   bf_set(lpfc_reg_vfi_upd, reg_vfi, 1);
> - }
>  
>   bf_set(lpfc_reg_vfi_bbcr, reg_vfi, 0);
>   bf_set(lpfc_reg_vfi_bbscn, reg_vfi, 0);
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 18/19] lpfc: update driver version to 11.4.0.7

2018-01-29 Thread Hannes Reinecke
On 01/24/2018 11:45 PM, James Smart wrote:
> Update the driver version to 11.4.0.7
> 
> Signed-off-by: Dick Kennedy 
> Signed-off-by: James Smart 
> ---
>  drivers/scsi/lpfc/lpfc_version.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/scsi/lpfc/lpfc_version.h 
> b/drivers/scsi/lpfc/lpfc_version.h
> index c232bf0e8998..6f4092cb903e 100644
> --- a/drivers/scsi/lpfc/lpfc_version.h
> +++ b/drivers/scsi/lpfc/lpfc_version.h
> @@ -20,7 +20,7 @@
>   * included with this package. *
>   ***/
>  
> -#define LPFC_DRIVER_VERSION "11.4.0.6"
> +#define LPFC_DRIVER_VERSION "11.4.0.7"
>  #define LPFC_DRIVER_NAME "lpfc"
>  
>  /* Used for SLI 2/3 */
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [RFC 0/2] mpt3sas/megaraid_sas : irq poll and load balancing of reply queue

2018-01-29 Thread Hannes Reinecke
On 01/15/2018 01:12 PM, Kashyap Desai wrote:
> Hi All -
> 
> We have seen cpu lock up issue from fields if system has greater (more
> than 96) logical cpu count.
> SAS3.0 controller (Invader series) supports at max 96 msix vector and
> SAS3.5 product (Ventura) supports at max 128 msix vectors.
> 
> This may be a generic issue (if PCI device support  completion on multiple
> reply queues). Let me explain it w.r.t to mpt3sas supported h/w just to
> simplify the problem and possible changes to handle such issues. IT HBA
> (mpt3sas) supports multiple reply queues in completion path. Driver
> creates MSI-x vectors for controller as "min of ( FW supported Reply
> queue, Logical CPUs)". If submitter is not interrupted via completion on
> same CPU, there is a loop in the IO path. This behavior can cause
> hard/soft CPU lockups, IO timeout, system sluggish etc.
> 
> Example - one CPU (e.g. CPU A) is busy submitting the IOs and another CPU
> (e.g. CPU B) is busy with processing the corresponding IO's reply
> descriptors from reply descriptor queue upon receiving the interrupts from
> HBA. If the CPU A is continuously pumping the IOs then always CPU B (which
> is executing the ISR) will see the valid reply descriptors in the reply
> descriptor queue and it will be continuously processing those reply
> descriptor in a loop without quitting the ISR handler.  Mpt3sas driver
> will exit ISR handler if it finds unused reply descriptor in the reply
> descriptor queue. Since CPU A will be continuously sending the IOs, CPU B
> may always see a valid reply descriptor (posted by HBA Firmware after
> processing the IO) in the reply descriptor queue. In worst case, driver
> will not quit from this loop in the ISR handler. Eventually, CPU lockup
> will be detected by watchdog.
> 
> Above mentioned behavior is not common if "rq_affinity" set to 2 or
> affinity_hint is honored by irqbalance as "exact".
> If rq_affinity is set to 2, submitter will be always interrupted via
> completion on same CPU.
> If irqbalance is using "exact" policy, interrupt will be delivered to
> submitter CPU.
> 
> Problem statement -
> If CPU counts to MSI-X vectors (reply descriptor Queues) count ratio is
> not 1:1, we still have  exposure of issue explained above and for that we
> don't have any solution.
> 
> Exposure of soft/hard lockup if CPU count is more than MSI-x supported by
> device.
> 
> If CPUs count to MSI-x vectors count ratio is not 1:1, (Other way, if CPU
> counts to MSI-x vector count ratio is something like X:1, where X > 1)
> then 'exact' irqbalance policy OR rq_affinity = 2 won't help to avoid CPU
> hard/soft lockups. There won't be any one to one mapping between CPU to
> MSI-x vector instead one MSI-x interrupt (or reply descriptor queue) is
> shared with group/set of CPUs and there is a possibility of having a loop
> in the IO path within that CPU group and may observe lockups.
> 
> For example: Consider a system having two NUMA nodes and each node having
> four logical CPUs and also consider that number of MSI-x vectors enabled
> on the HBA is two, then CPUs count to MSI-x vector count ratio as 4:1.
> e.g.
> MSIx vector 0 is affinity to  CPU 0, CPU 1, CPU 2 & CPU 3 of NUMA node 0
> and MSI-x vector 1 is affinity to CPU 4, CPU 5, CPU 6 & CPU 7 of NUMA node
> 1.
> 
> numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3-->
> MSI-x 0
> node 0 size: 65536 MB
> node 0 free: 63176 MB
> node 1 cpus: 4 5 6 7
> -->MSI-x 1
> node 1 size: 65536 MB
> node 1 free: 63176 MB
> 
> Assume that user started an application which uses all the CPUs of NUMA
> node 0 for issuing the IOs.
> Only one CPU from affinity list (it can be any cpu since this behavior
> depends upon irqbalance) CPU0 will receive the interrupts from MSIx vector
> 0 for all the IOs. Eventually, CPU 0 IO submission percentage will be
> decreasing and ISR processing percentage will be increasing as it is more
> busy with processing the interrupts. Gradually IO submission percentage on
> CPU 0 will be zero and it's ISR processing percentage will be 100
> percentage as IO loop has already formed within the NUMA node 0, i.e. CPU
> 1, CPU 2 & CPU 3 will be continuously busy with submitting the heavy IOs
> and only CPU 0 is busy in the ISR path as it always find the valid reply
> descriptor in the reply descriptor queue. Eventually, we will observe the
> hard lockup here.
> 
> Chances of occurring of hard/soft lockups are directly proportional to
> value of X. If value of X is high, then chances of observing CPU lockups
> is high.
> 
> Solution -
> Fix - 1 Use IRQ poll interface defined in " irq_poll.c". mpt3sas driver
> will execute ISR routine in Softirq context and it will always quit the
> loop based on budget provided in IRQ poll interface.
> 
> In these scenarios( i.e. where CPUs count to MSI-X vectors count ratio is
> X:1 (where X >  1)),  IRQ poll interface will avoid CPU hard lockups due
> to voluntary exit from the 

[LSF/MM TOPIC] irq affinity handling for high CPU count machines

2018-01-29 Thread Hannes Reinecke
Hi all,

here's a topic which came up on the SCSI ML (cf thread '[RFC 0/2]
mpt3sas/megaraid_sas: irq poll and load balancing of reply queue').

When doing I/O tests on a machine with more CPUs than MSIx vectors
provided by the HBA we can easily setup a scenario where one CPU is
submitting I/O and the other one is completing I/O. Which will result in
the latter CPU being stuck in the interrupt completion routine for
basically ever, resulting in the lockup detector kicking in.

How should these situations be handled?
Should it be made the responsibility of the drivers, ensuring that the
interrupt completion routine is terminated after a certain time?
Should it be made the resposibility of the upper layers?
Should it be the responsibility of the interrupt mapping code?
Can/should interrupt polling be used in these situations?

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 15/19] lpfc: Fix header inclusion in lpfc_nvmet

2018-01-29 Thread Hannes Reinecke
On 01/24/2018 11:45 PM, James Smart wrote:
> The driver was inappropriately pulling in the nvme host's
> nvme.h header. What it really needed was the standard
>  header.
> 
> Signed-off-by: Dick Kennedy 
> Signed-off-by: James Smart 
> ---
>  drivers/scsi/lpfc/lpfc_nvmet.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/scsi/lpfc/lpfc_nvmet.c b/drivers/scsi/lpfc/lpfc_nvmet.c
> index 0539585d32d4..f8863c51ba20 100644
> --- a/drivers/scsi/lpfc/lpfc_nvmet.c
> +++ b/drivers/scsi/lpfc/lpfc_nvmet.c
> @@ -36,7 +36,7 @@
>  #include 
>  #include 
>  
> -#include <../drivers/nvme/host/nvme.h>
> +#include 
>  #include 
>  #include 
>  
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 03/19] lpfc: move placement of target destroy on driver detach

2018-01-29 Thread Hannes Reinecke
On 01/24/2018 11:45 PM, James Smart wrote:
> Ensure nvme localports/targetports are torn down before
> dismantling the adapter sli interface on driver detachment.
> This aids leaving interfaces live while nvme may be making
> callbacks to abort it.
> 
> Signed-off-by: Dick Kennedy 
> Signed-off-by: James Smart 
> ---
>  drivers/scsi/lpfc/lpfc_init.c | 14 +++---
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/scsi/lpfc/lpfc_init.c b/drivers/scsi/lpfc/lpfc_init.c
> index 5b96bf0d3331..851d4e889042 100644
> --- a/drivers/scsi/lpfc/lpfc_init.c
> +++ b/drivers/scsi/lpfc/lpfc_init.c
> @@ -11503,13 +11503,6 @@ lpfc_pci_remove_one_s4(struct pci_dev *pdev)
>   /* Remove FC host and then SCSI host with the physical port */
>   fc_remove_host(shost);
>   scsi_remove_host(shost);
> - /*
> -  * Bring down the SLI Layer. This step disables all interrupts,
> -  * clears the rings, discards all mailbox commands, and resets
> -  * the HBA FCoE function.
> -  */
> - lpfc_debugfs_terminate(vport);
> - lpfc_sli4_hba_unset(phba);
>  
>   /* Perform ndlp cleanup on the physical port.  The nvme and nvmet
>* localports are destroyed after to cleanup all transport memory.
> @@ -11518,6 +11511,13 @@ lpfc_pci_remove_one_s4(struct pci_dev *pdev)
>   lpfc_nvmet_destroy_targetport(phba);
>   lpfc_nvme_destroy_localport(vport);
>  
> + /*
> +  * Bring down the SLI Layer. This step disables all interrupts,
> +  * clears the rings, discards all mailbox commands, and resets
> +  * the HBA FCoE function.
> +  */
> + lpfc_debugfs_terminate(vport);
> + lpfc_sli4_hba_unset(phba);
>  
>   lpfc_stop_hba_timers(phba);
>   spin_lock_irq(>hbalock);
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 09/19] lpfc: Allow set of maximum outstanding SCSI cmd limit for a target parameter

2018-01-29 Thread Hannes Reinecke
On 01/24/2018 11:45 PM, James Smart wrote:
> Make the attribute writeable.
> Remove the ramp up to logic as its unnecessary, simply set depth.
> Add debug message if depth changed, possibly reducing limit, yet
> our outstanding count has yet to catch up with it.
> 
> Signed-off-by: Dick Kennedy 
> Signed-off-by: James Smart 
> ---
>  drivers/scsi/lpfc/lpfc_attr.c |  4 ++--
>  drivers/scsi/lpfc/lpfc_scsi.c | 39 ---
>  2 files changed, 30 insertions(+), 13 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 08/19] lpfc: Fix RQ empty firmware trap

2018-01-29 Thread Hannes Reinecke
On 01/24/2018 11:45 PM, James Smart wrote:
> When nvme target deferred receive logic waits for exchange
> resources, the corresponding receive buffer is not replenished
> with the hardware. This can result in a lack of asynchronous
> receive buffer resources in the hardware, resulting in a
> "2885 Port Status Event: ... error 1=0x52004a01 ..." message.
> 
> Correct by replenishing the buffer whenenver the deferred
> logic kicks in.  Update corresponding debug messages and
> statistics as well.
> 
> Signed-off-by: Dick Kennedy 
> Signed-off-by: James Smart 
> ---
>  drivers/scsi/lpfc/lpfc_attr.c  |  6 ++
>  drivers/scsi/lpfc/lpfc_mem.c   |  8 ++--
>  drivers/scsi/lpfc/lpfc_nvmet.c | 31 +--
>  drivers/scsi/lpfc/lpfc_nvmet.h |  7 +--
>  drivers/scsi/lpfc/lpfc_sli.c   | 12 
>  5 files changed, 50 insertions(+), 14 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 17/19] lpfc: Fix nonrecovery of NVME controller after cable swap.

2018-01-29 Thread Hannes Reinecke
On 01/24/2018 11:45 PM, James Smart wrote:
> In a test that is doing large numbers of cable swaps on the target,
> the nvme controllers wouldn't reconnect.
> 
> During the cable swaps, the targets n_port_id would change. This
> information was passed to the nvme-fc transport, in the new remoteport
> registration. However, the nvme-fc transport didn't update the n_port_id
> value in the remoteport struct when it reused an existing structure.
> Later, when a new association was attempted on the remoteport, the
> driver's NVME LS routine would use the stale n_port_id from the remoteport
> struct to address the LS. As the device is no longer at that address,
> the LS would go into never never land.
> 
> Separately, the nvme-fc transport will be corrected to update the
> n_port_id value on a re-registration.
> 
> However, for now, there's no reason to use the transports values.
> The private pointer points to the drivers node structure and the
> node structure is up to date. Therefore, revise the LS routine to
> use the drivers data structures for the LS. Augmented the debug
> message for better debugging in the future.
> 
> Also removed a duplicate if check that seems to have slipped in.
> 
> Signed-off-by: Dick Kennedy 
> Signed-off-by: James Smart 
> ---
>  drivers/scsi/lpfc/lpfc_nvme.c | 24 +---
>  1 file changed, 13 insertions(+), 11 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 07/19] lpfc: Fix IO failure during hba reset testing with nvme io.

2018-01-29 Thread Hannes Reinecke
On 01/24/2018 11:45 PM, James Smart wrote:
> A stress test repeatedly resetting the adapter while performing
> io would eventually report I/O failures and missing nvme namespaces.
> 
> The driver was setting the nvmefc_fcp_req->private pointer to NULL
> during the IO completion routine before upcalling done().
> If the transport was also running an abort for that IO, the driver
> would fail the abort with message 6140. Failing the abort is not
> allowed by the nvme-fc transport, as it mandates that the io must be
> returned back to the transport. As that does not happen, the transport
> controller delete has an outstanding reference and can't complete
> teardown.
> 
> Remove the NULL'ing of the private pointer in the nvmefc request.
> The driver simply overwrites this value on each IO start.
> 
> Signed-off-by: Dick Kennedy 
> Signed-off-by: James Smart 
> ---
>  drivers/scsi/lpfc/lpfc_nvme.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/drivers/scsi/lpfc/lpfc_nvme.c b/drivers/scsi/lpfc/lpfc_nvme.c
> index 81e3a4f10c3c..92643ffa79c3 100644
> --- a/drivers/scsi/lpfc/lpfc_nvme.c
> +++ b/drivers/scsi/lpfc/lpfc_nvme.c
> @@ -804,7 +804,6 @@ lpfc_nvme_io_cmd_wqe_cmpl(struct lpfc_hba *phba, struct 
> lpfc_iocbq *pwqeIn,
>   struct nvme_fc_cmd_iu *cp;
>   struct lpfc_nvme_rport *rport;
>   struct lpfc_nodelist *ndlp;
> - struct lpfc_nvme_fcpreq_priv *freqpriv;
>   struct lpfc_nvme_lport *lport;
>   unsigned long flags;
>   uint32_t code, status;
> @@ -980,8 +979,6 @@ lpfc_nvme_io_cmd_wqe_cmpl(struct lpfc_hba *phba, struct 
> lpfc_iocbq *pwqeIn,
>   phba->cpucheck_cmpl_io[lpfc_ncmd->cpu]++;
>   }
>  #endif
> - freqpriv = nCmd->private;
> - freqpriv->nvme_buf = NULL;
>  
>   /* NVME targets need completion held off until the abort exchange
>* completes unless the NVME Rport is getting unregistered.
> 
I would avoid that if possible.
By not zeroing the pointers we run into the risk of executing the wrong
callback on stale commands.
Can't you just modify the abort handling to always return 'true' if this
condition is hit?

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 04/19] lpfc: correct debug counters for abort

2018-01-29 Thread Hannes Reinecke
On 01/24/2018 11:45 PM, James Smart wrote:
> Existing code was using the wrong field for the completion status
> when comparing whether to increment abort statistics
> 
> Signed-off-by: Dick Kennedy 
> Signed-off-by: James Smart 
> ---
>  drivers/scsi/lpfc/lpfc_nvmet.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/scsi/lpfc/lpfc_nvmet.c b/drivers/scsi/lpfc/lpfc_nvmet.c
> index 8dbf5c9d51aa..7927ac46d345 100644
> --- a/drivers/scsi/lpfc/lpfc_nvmet.c
> +++ b/drivers/scsi/lpfc/lpfc_nvmet.c
> @@ -130,7 +130,7 @@ lpfc_nvmet_xmt_ls_rsp_cmp(struct lpfc_hba *phba, struct 
> lpfc_iocbq *cmdwqe,
>   if (tgtp) {
>   if (status) {
>   atomic_inc(>xmt_ls_rsp_error);
> - if (status == IOERR_ABORT_REQUESTED)
> + if (result == IOERR_ABORT_REQUESTED)
>   atomic_inc(>xmt_ls_rsp_aborted);
>   if (bf_get(lpfc_wcqe_c_xb, wcqe))
>   atomic_inc(>xmt_ls_rsp_xb_set);
> @@ -541,7 +541,7 @@ lpfc_nvmet_xmt_fcp_op_cmp(struct lpfc_hba *phba, struct 
> lpfc_iocbq *cmdwqe,
>   rsp->transferred_length = 0;
>   if (tgtp) {
>   atomic_inc(>xmt_fcp_rsp_error);
> - if (status == IOERR_ABORT_REQUESTED)
> + if (result == IOERR_ABORT_REQUESTED)
>   atomic_inc(>xmt_fcp_rsp_aborted);
>   }
>  
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 01/19] lpfc: Fix frequency of Release WQE CQEs

2018-01-29 Thread Hannes Reinecke
On 01/24/2018 11:45 PM, James Smart wrote:
> The driver controls when the hardware sends completions that
> communicate consumption of elements from the WQ. This is done by
> setting a WQEC bit on a WQE.
> 
> The current driver sets it on every Nth WQE posting. However, the
> driver isn't clearing the bit if the WQE is reused. Thus, if the
> queue depth isn't evenly divisible by N, with enough time, it can
> be set on every element, creating a lot of overhead and risking
> CQ full conditions.
> 
> Correct by clearing the bit when not setting it on an Nth element.
> 
> Signed-off-by: Dick Kennedy 
> Signed-off-by: James Smart 
> ---
>  drivers/scsi/lpfc/lpfc_sli.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/scsi/lpfc/lpfc_sli.c b/drivers/scsi/lpfc/lpfc_sli.c
> index 5f5528a12308..149f21f53b13 100644
> --- a/drivers/scsi/lpfc/lpfc_sli.c
> +++ b/drivers/scsi/lpfc/lpfc_sli.c
> @@ -129,6 +129,8 @@ lpfc_sli4_wq_put(struct lpfc_queue *q, union lpfc_wqe 
> *wqe)
>   /* set consumption flag every once in a while */
>   if (!((q->host_index + 1) % q->entry_repost))
>   bf_set(wqe_wqec, >generic.wqe_com, 1);
> + else
> + bf_set(wqe_wqec, >generic.wqe_com, 0);
>   if (q->phba->sli3_options & LPFC_SLI4_PHWQ_ENABLED)
>   bf_set(wqe_wqid, >generic.wqe_com, q->queue_id);
>   lpfc_sli_pcimem_bcopy(wqe, temp_wqe, q->entry_size);
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)