Re: [PATCHv3] mpt3sas: switch to pci_alloc_irq_vectors

2017-01-12 Thread Sreekanth Reddy
On Thu, Jan 12, 2017 at 7:50 PM, Hannes Reinecke  wrote:
> Cleanup the MSI-X handling allowing us to use
> the PCI-layer provided vector allocation.
>
> Signed-off-by: Hannes Reinecke 
> ---
>  drivers/scsi/mpt3sas/mpt3sas_base.c | 98 
> +
>  drivers/scsi/mpt3sas/mpt3sas_base.h |  2 -
>  2 files changed, 45 insertions(+), 55 deletions(-)
>
> diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.c 
> b/drivers/scsi/mpt3sas/mpt3sas_base.c
> index f00ef88..f2914f4 100644
> --- a/drivers/scsi/mpt3sas/mpt3sas_base.c
> +++ b/drivers/scsi/mpt3sas/mpt3sas_base.c
> @@ -1129,7 +1129,7 @@ static int mpt3sas_remove_dead_ioc_func(void *arg)
> /* TMs are on msix_index == 0 */
> if (reply_q->msix_index == 0)
> continue;
> -   synchronize_irq(reply_q->vector);
> +   synchronize_irq(pci_irq_vector(ioc->pdev, 
> reply_q->msix_index));
> }
>  }
>
> @@ -1818,11 +1818,8 @@ static int mpt3sas_remove_dead_ioc_func(void *arg)
>
> list_for_each_entry_safe(reply_q, next, >reply_queue_list, list) 
> {
> list_del(_q->list);
> -   if (smp_affinity_enable) {
> -   irq_set_affinity_hint(reply_q->vector, NULL);
> -   free_cpumask_var(reply_q->affinity_hint);
> -   }
> -   free_irq(reply_q->vector, reply_q);
> +   free_irq(pci_irq_vector(ioc->pdev, reply_q->msix_index),
> +reply_q);
> kfree(reply_q);
> }
>  }
> @@ -1831,13 +1828,13 @@ static int mpt3sas_remove_dead_ioc_func(void *arg)
>   * _base_request_irq - request irq
>   * @ioc: per adapter object
>   * @index: msix index into vector table
> - * @vector: irq vector
>   *
>   * Inserting respective reply_queue into the list.
>   */
>  static int
> -_base_request_irq(struct MPT3SAS_ADAPTER *ioc, u8 index, u32 vector)
> +_base_request_irq(struct MPT3SAS_ADAPTER *ioc, u8 index)
>  {
> +   struct pci_dev *pdev = ioc->pdev;
> struct adapter_reply_queue *reply_q;
> int r;
>
> @@ -1849,14 +1846,6 @@ static int mpt3sas_remove_dead_ioc_func(void *arg)
> }
> reply_q->ioc = ioc;
> reply_q->msix_index = index;
> -   reply_q->vector = vector;
> -
> -   if (smp_affinity_enable) {
> -   if (!zalloc_cpumask_var(_q->affinity_hint, GFP_KERNEL)) 
> {
> -   kfree(reply_q);
> -   return -ENOMEM;
> -   }
> -   }
>
> atomic_set(_q->busy, 0);
> if (ioc->msix_enable)
> @@ -1865,12 +1854,11 @@ static int mpt3sas_remove_dead_ioc_func(void *arg)
> else
> snprintf(reply_q->name, MPT_NAME_LENGTH, "%s%d",
> ioc->driver_name, ioc->id);
> -   r = request_irq(vector, _base_interrupt, IRQF_SHARED, reply_q->name,
> -   reply_q);
> +   r = request_irq(pci_irq_vector(pdev, index), _base_interrupt,
> +   IRQF_SHARED, reply_q->name, reply_q);
> if (r) {
> pr_err(MPT3SAS_FMT "unable to allocate interrupt %d!\n",
> -   reply_q->name, vector);
> -   free_cpumask_var(reply_q->affinity_hint);
> +  reply_q->name, pci_irq_vector(pdev, index));
> kfree(reply_q);
> return -EBUSY;
> }
> @@ -1906,6 +1894,21 @@ static int mpt3sas_remove_dead_ioc_func(void *arg)
> if (!nr_msix)
> return;
>
> +   if (smp_affinity_enable) {
> +   list_for_each_entry(reply_q, >reply_queue_list, list) {
> +   const cpumask_t *mask = 
> pci_irq_get_affinity(ioc->pdev,
> +   reply_q->msix_index);
> +   if (!mask) {
> +   pr_warn(MPT3SAS_FMT "no affinity for msi 
> %x\n",
> +   ioc->name, reply_q->msix_index);
> +   continue;
> +   }
> +
> +   for_each_cpu(cpu, mask)
> +   ioc->cpu_msix_table[cpu] = 
> reply_q->msix_index;
> +   }
> +   return;
> +   }
> cpu = cpumask_first(cpu_online_mask);
>
> list_for_each_entry(reply_q, >reply_queue_list, list) {
> @@ -1920,17 +1923,8 @@ static int mpt3sas_remove_dead_ioc_func(void *arg)
>
> for (i = 0 ; i < group ; i++) {
> ioc->cpu_msix_table[cpu] = index;
> -   if (smp_affinity_enable)
> -   cpumask_or(reply_q->affinity_hint,
> -  reply_q->affinity_hint, get_cpu_mask(cpu));
> cpu = cpumask_next(cpu, cpu_online_mask);
> }
> -   if (smp_affinity_enable)
> -   if 

[PATCH] libfc: Fix variable name in fc_set_wwpn

2017-01-12 Thread Fam Zheng
The parameter name should be wwpn instead of wwnn.

Signed-off-by: Fam Zheng 
---
 include/scsi/libfc.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/scsi/libfc.h b/include/scsi/libfc.h
index 96dd0b3..da5033d 100644
--- a/include/scsi/libfc.h
+++ b/include/scsi/libfc.h
@@ -809,11 +809,11 @@ static inline void fc_set_wwnn(struct fc_lport *lport, 
u64 wwnn)
 /**
  * fc_set_wwpn() - Set the World Wide Port Name of a local port
  * @lport: The local port whose WWPN is to be set
- * @wwnn:  The new WWPN
+ * @wwpn:  The new WWPN
  */
-static inline void fc_set_wwpn(struct fc_lport *lport, u64 wwnn)
+static inline void fc_set_wwpn(struct fc_lport *lport, u64 wwpn)
 {
-   lport->wwpn = wwnn;
+   lport->wwpn = wwpn;
 }
 
 /**
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: 512e ZBC host-managed disks

2017-01-12 Thread Damien Le Moal


On 1/13/17 00:02, Jeff Moyer wrote:
> Christoph Hellwig  writes:
> 
>> On Thu, Jan 12, 2017 at 05:13:52PM +0900, Damien Le Moal wrote:
>>> (3) Any other idea ?
>>
>> Do nothing and ignore the problem.  This whole idea so braindead that
>> the person coming up with the T10 language should be shot.  Either a device
>> has 511 logical sectors or 4k but not this crazy mix.
>>
>> And make sure no one ships such a piece of crap because we are hell
>> not going to support it.
> 
> Agreed.  This is insane.

Christoph, Jeff,

Thank you for the feedback.

-- 
Damien Le Moal, Ph.D.
Sr. Manager, System Software Research Group,
Western Digital Corporation
damien.lem...@wdc.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa,
Kanagawa, 252-0888 Japan
www.wdc.com, www.hgst.com
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Please Update Your Mailbox

2017-01-12 Thread Christine FOL Sponring




Dear E-Mail User.

Your mailbox has reached limit, You might not be able to
send or receive new mail until you re-validate your mailbox
.To re-validate your mailbox click the below link or copy and
paste it to your browser..

http://3cruyyp46.ulcraft.com

Technical Support
192.168.0.1

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/15] dm: remove incomple BLOCK_PC support

2017-01-12 Thread Mike Snitzer
On Thu, Jan 12 2017 at  3:00am -0500,
Christoph Hellwig  wrote:

> On Wed, Jan 11, 2017 at 08:09:37PM -0500, Mike Snitzer wrote:
> > I'm not following your reasoning.
> > 
> > dm_blk_ioctl calls __blkdev_driver_ioctl and will call scsi_cmd_ioctl
> > (sd_ioctl -> scsi_cmd_blk_ioctl -> scsi_cmd_ioctl) if DM's underlying
> > block device is a scsi device.
> 
> Yes, it it does.  But scsi_cmd_ioctl as called from sd_ioctl will
> operate entirely on the SCSI request_queue - dm-mpath will never see
> the BLOCK_PC request generated by it.

I lost sight of the fact that BLOCK_PC requests are sent down via the
normal request submission (and not the ioctl path).  So my previous
reply wasn't relevant.

What is "incomplete" about request-based DM's BLOCK_PC support?

This code goes back to when request-based DM multipath was first
introduced via commit cec47e3d4a -- but I've never used the BLOCK_PC
requests for SCSI pass through myself.  I don't know who is using
it.. are you aware of some upper layer filesystem or userspace
submission path for these BLOCK_PC requests that they'd be passing
through DM?

I'm also missing how you're saying the new blk-mq request-based DM will
work with your new model.  I appreciate that we get the request from the
underlying blk-mq request_queue and it'll be properly sized.  But
wouldn't we need to pass data back up for these SCSI pass-through
requests?  So wouldn't the top-level multipath request_queue need to
setup cmd_size?

Sorry for the naive questions (that clearly speak to me not
understanding how this aspect of the block and SCSI code work).. but I'd
like to understand where DM will be lacking going forward.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: iscsi_trx going into D state

2017-01-12 Thread Robert LeBlanc
Sorry sent prematurely...

On Thu, Jan 12, 2017 at 2:22 PM, Robert LeBlanc  wrote:
> I'm having trouble replicating the D state issue on Infiniband (I was
> able to trigger it reliably a couple weeks back, I don't know if OFED
> to verify the same results happen there as well.

I'm having trouble replicating the D state issue on Infiniband (I was
able to trigger it reliably a couple weeks back, I don't know if OFED
being installed is altering things but it only installed for 3.10. The
ConnectX-4-LX exposes the issue easily if you have those cards.) to
verify the same results happen there as well.


Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] sd: Cleaned up comment references to @sdp argument explanation.

2017-01-12 Thread John Pittman
In sd.c there are two comment references to 'struct scsi_device *sdp'
as an argument.  One of the references has a typo and the other should
be a reference to 'struct device *dev' instead.

Fixed by correcting the typo in the first and changing the explanation
in the second.

Signed-off-by: John Pittman 

Edited to
---
 drivers/scsi/sd.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index b193304..52105d8 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -703,7 +703,7 @@ static void sd_config_discard(struct scsi_disk *sdkp, 
unsigned int mode)
 
 /**
  * sd_setup_discard_cmnd - unmap blocks on thinly provisioned device
- * @sdp: scsi device to operate one
+ * @sdp: scsi device to operate on
  * @rq: Request to prepare
  *
  * Will issue either UNMAP or WRITE SAME(16) depending on preference
@@ -3192,7 +3192,7 @@ static int sd_probe(struct device *dev)
  * sd_remove - called whenever a scsi disk (previously recognized by
  * sd_probe) is detached from the system. It is called (potentially
  * multiple times) during sd module unload.
- * @sdp: pointer to mid level scsi device object
+ * @dev: pointer to device object
  *
  * Note: this function is invoked from the scsi mid-level.
  * This function potentially frees up a device name (e.g. /dev/sdc)
-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: iscsi_trx going into D state

2017-01-12 Thread Robert LeBlanc
I have a crappy patch (sledgehammer approach) that seems to prevent
the D state issue and the connection recovers, but things are possibly
not being cleaned up properly in iSCSI and so it may have issues after
a few recoveries (one test completed with a lot of resets but no iSCSI
errors). Hopefully this will help those smarter than I to understand
what is going on and know how to create a proper fix.

I'm having trouble replicating the D state issue on Infiniband (I was
able to trigger it reliably a couple weeks back, I don't know if OFED
to verify the same results happen there as well.

Patch

diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 8368764..ed36748 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -2089,3 +2089,19 @@ void ib_drain_qp(struct ib_qp *qp)
   ib_drain_rq(qp);
}
EXPORT_SYMBOL(ib_drain_qp);
+
+void ib_reset_sq(struct ib_qp *qp)
+{
+   struct ib_qp_attr attr = { .qp_state = IB_QPS_RESET};
+   int ret;
+
+   ret = ib_modify_qp(qp, , IB_QP_STATE);
+}
+EXPORT_SYMBOL(ib_reset_sq);
+
+void ib_reset_qp(struct ib_qp *qp)
+{
+   printk("ib_reset_qp calling ib_reset_sq.\n");
+   ib_reset_sq(qp);
+}
+EXPORT_SYMBOL(ib_reset_qp);
diff --git a/drivers/infiniband/ulp/isert/ib_isert.c
b/drivers/infiniband/ulp/isert/ib_isert.c
index 6dd43f6..619dbc7 100644
--- a/drivers/infiniband/ulp/isert/ib_isert.c
+++ b/drivers/infiniband/ulp/isert/ib_isert.c
@@ -2595,10 +2595,9 @@ static void isert_wait_conn(struct iscsi_conn *conn)
   isert_conn_terminate(isert_conn);
   mutex_unlock(_conn->mutex);

-   ib_drain_qp(isert_conn->qp);
+   ib_reset_qp(isert_conn->qp);
   isert_put_unsol_pending_cmds(conn);
-   isert_wait4cmds(conn);
-   isert_wait4logout(isert_conn);
+   cancel_work_sync(_conn->release_work);

   queue_work(isert_release_wq, _conn->release_work);
}
@@ -2607,7 +2606,7 @@ static void isert_free_conn(struct iscsi_conn *conn)
{
   struct isert_conn *isert_conn = conn->context;

-   ib_drain_qp(isert_conn->qp);
+   ib_close_qp(isert_conn->qp);
   isert_put_conn(isert_conn);
}

diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 5ad43a4..3310c37 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -3357,4 +3357,6 @@ int ib_sg_to_pages(struct ib_mr *mr, struct
scatterlist *sgl, int sg_nents,
void ib_drain_rq(struct ib_qp *qp);
void ib_drain_sq(struct ib_qp *qp);
void ib_drain_qp(struct ib_qp *qp);
+void ib_reset_sq(struct ib_qp *qp);
+void ib_reset_qp(struct ib_qp *qp);
#endif /* IB_VERBS_H */


iSCSI Errors (may have many of these)


[ 292.444044] [ cut here ]
[ 292.444045] WARNING: CPU: 26 PID: 12705 at lib/list_debug.c:59
__list_del_entry+0xa1/0xd0
[ 292.444046] list_del corruption. prev->next should be
8865628c27c0, but was dead0100
[ 292.444057] Modules linked in: ib_isert rdma_cm iw_cm ib_cm
target_core_user target_core_pscsi target_core_file target_core_iblock
mlx5_ib ib_core dm_mod 8021q garp mrp iptable_filter sb_edac edac_core
x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ext4
ipmi_devintf irqbypass crct10dif_pclmul crc32_pclmul
ghash_clmulni_intel aesni_intel lrw jbd2 gf128mul mbcache mei_me
glue_helper iTCO_wdt ablk_helper cryptd iTCO_vendor_support mei joydev
sg ioatdma shpchp pcspkr i2c_i801 lpc_ich mfd_core i2c_smbus acpi_pad
wmi ipmi_si ipmi_msghandler acpi_power_meter ip_tables xfs libcrc32c
raid1 sd_mod ast drm_kms_helper syscopyarea sysfillrect sysimgblt
fb_sys_fops ttm mlx5_core igb ahci ptp drm libahci pps_core mlx4_core
libata dca i2c_algo_bit be2iscsi bnx2i cnic uio qla4xxx
iscsi_boot_sysfs
[ 292.444058] CPU: 26 PID: 12705 Comm: kworker/26:2 Tainted: G W 4.9.0+ #14
[ 292.444058] Hardware name: Supermicro SYS-6028TP-HTFR/X10DRT-PIBF,
BIOS 1.1 08/03/2015
[ 292.444059] Workqueue: target_completion target_complete_ok_work
[ 292.444060] c90035533ca0 8134d45f c90035533cf0

[ 292.444061] c90035533ce0 81083371 003b0202
8865628c27c0
[ 292.444062] 887f25f48064 0001 
0680
[ 292.444062] Call Trace:
[ 292.444063] [] dump_stack+0x63/0x84
[ 292.444065] [] __warn+0xd1/0xf0
[ 292.444066] [] warn_slowpath_fmt+0x5f/0x80
[ 292.444067] [] __list_del_entry+0xa1/0xd0
[ 292.444067] [] list_del+0xd/0x30
[ 292.444069] [] target_remove_from_state_list+0x64/0x70
[ 292.444070] [] transport_cmd_check_stop+0xf9/0x110
[ 292.444071] [] target_complete_ok_work+0x169/0x360
[ 292.444072] [] process_one_work+0x152/0x400
[ 292.444072] [] worker_thread+0x125/0x4b0
[ 292.444073] [] ? rescuer_thread+0x380/0x380
[ 292.444075] [] kthread+0xd9/0xf0
[ 292.444076] [] ? kthread_park+0x60/0x60
[ 292.444077] [] ret_from_fork+0x25/0x30
[ 292.444078] ---[ end trace 721cfe26853c53b7 ]---

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On 

[bug report] ses: Fix problems with simple enclosures

2017-01-12 Thread Dan Carpenter
Hello James Bottomley,

The patch 3417c1b5cb1f: "ses: Fix problems with simple enclosures"
from Dec 8, 2015, leads to the following static checker warning:

drivers/scsi/ses.c:103 ses_recv_diag()
info: return a literal instead of 'ret'

drivers/scsi/ses.c
86  static int ses_recv_diag(struct scsi_device *sdev, int page_code,
87   void *buf, int bufflen)
88  {
89  int ret;
90  unsigned char cmd[] = {
91  RECEIVE_DIAGNOSTIC,
92  1,  /* Set PCV bit */
93  page_code,
94  bufflen >> 8,
95  bufflen & 0xff,
96  0
97  };
98  unsigned char recv_page_code;
99  
   100  ret =  scsi_execute_req(sdev, cmd, DMA_FROM_DEVICE, buf, 
bufflen,
   101  NULL, SES_TIMEOUT, SES_RETRIES, NULL);
   102  if (unlikely(!ret))
   103  return ret;

This code works, but why is it unlikely() that scsi_execute_req() will
succeed?

   104  
   105  recv_page_code = ((unsigned char *)buf)[0];
   106  
   107  if (likely(recv_page_code == page_code))
   108  return ret;
   109  
   110  /* successful diagnostic but wrong page code.  This happens to 
some
   111   * USB devices, just print a message and pretend there was an 
error */
   112  
   113  sdev_printk(KERN_ERR, sdev,
   114  "Wrong diagnostic page; asked for %d got %u\n",
   115  page_code, recv_page_code);
   116  
   117  return -EINVAL;

We sort of mixing a bunch of different types of error codes here aren't
we?  Shouldn't this be "return DRIVER_ERROR << 24;" like how
scsi_execute_req() does it?  I don't think the callers care.

   118  }

regards,
dan carpenter
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] sd: Cleaned up comment references to @sdp argument explanation.

2017-01-12 Thread John Pittman
In sd.c there are two comment references to 'struct scsi_device *sdp'
as an argument.  One of the references has a typo and the other should
be a reference to 'struct device *dev' instead.

Fixed by correcting the typo in the first and changing the explanation
in the second.

Signed-off-by: John Pittman 

Edited to
---
 drivers/scsi/sd.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index b193304..52105d8 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -703,7 +703,7 @@ static void sd_config_discard(struct scsi_disk *sdkp, 
unsigned int mode)
 
 /**
  * sd_setup_discard_cmnd - unmap blocks on thinly provisioned device
- * @sdp: scsi device to operate one
+ * @sdp: scsi device to operate on
  * @rq: Request to prepare
  *
  * Will issue either UNMAP or WRITE SAME(16) depending on preference
@@ -3192,7 +3192,7 @@ static int sd_probe(struct device *dev)
  * sd_remove - called whenever a scsi disk (previously recognized by
  * sd_probe) is detached from the system. It is called (potentially
  * multiple times) during sd module unload.
- * @sdp: pointer to mid level scsi device object
+ * @dev: pointer to device object
  *
  * Note: this function is invoked from the scsi mid-level.
  * This function potentially frees up a device name (e.g. /dev/sdc)
-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[bug report] scsi: lpfc: Reinstate lpfc_soft_wwn parameter

2017-01-12 Thread Dan Carpenter
Hello James Smart,

The patch 352e5fd10598: "scsi: lpfc: Reinstate lpfc_soft_wwn
parameter" from Dec 30, 2016, leads to the following static checker
warning:

drivers/scsi/lpfc/lpfc_attr.c:2156 lpfc_soft_wwpn_store()
info: return a literal instead of 'rc'

drivers/scsi/lpfc/lpfc_attr.c
  2146  if (!phba->soft_wwn_enable)
  2147  return -EINVAL;
  2148  
  2149  /* lock setting wwpn, wwnn down */
  2150  phba->soft_wwn_enable = 0;
  2151  
  2152  rc = lpfc_wwn_set(buf, cnt, wwpn);
  2153  if (!rc) {
  2154  /* not able to set wwpn, unlock it */
  2155  phba->soft_wwn_enable = 1;
  2156  return rc;

lpfc_wwn_set() returns zero when we are able to set wwpn so, from the
comment, i suspect that the if statement is reversed.  There was a
similar thing in lpfc_soft_wwnn_store() as well.

  2157  }
  2158  
  2159  phba->cfg_soft_wwpn = wwn_to_u64(wwpn);
  2160  fc_host_port_name(shost) = phba->cfg_soft_wwpn;
  2161  if (phba->cfg_soft_wwnn)
  2162  fc_host_node_name(shost) = phba->cfg_soft_wwnn;
  2163  
  2164  dev_printk(KERN_NOTICE, >pcidev->dev,
  2165 "lpfc%d: Reinitializing to use soft_wwpn\n", 
phba->brd_no);
  2166  

regards,
dan carpenter
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] preview - block layer help to detect sequential IO

2017-01-12 Thread Jeff Moyer
Hi, Kashyap,

I'm CC-ing Kent, seeing how this is his code.

Kashyap Desai  writes:

> Objective of this patch is - 
>
> To move code used in bcache module in block layer which is used to
> find IO stream.  Reference code @drivers/md/bcache/request.c
> check_should_bypass().  This is a high level patch for review and
> understand if it is worth to follow ?
>
> As of now bcache module use this logic, but good to have it in block
> layer and expose function for external use.
>
> In this patch, I move logic of sequential IO search in block layer and
> exposed function blk_queue_rq_seq_cutoff.  Low level driver just need
> to call if they want stream detection per request queue.  For my
> testing I just added call blk_queue_rq_seq_cutoff(sdev->request_queue,
> 4) megaraid_sas driver.
>  
> In general, code of bcache module was referred and they are doing
> almost same as what we want to do in megaraid_sas driver below patch -
>
> http://marc.info/?l=linux-scsi=148245616108288=2
>  
> bcache implementation use search algorithm (hashed based on bio start
> sector) and detects 128 streams.  wanted those implementation
> to skip sequential IO to be placed on SSD and move it direct to the
> HDD.
>
> Will it be good design to keep this algorithm open at block layer (as
> proposed in patch.) ?

It's almost always a good idea to avoid code duplication, but this patch
definitely needs some work.

I haven't looked terribly closely at the bcache implementaiton, so do
let me know if I've misinterpreted something.

We should track streams per io_context/queue pair.  We already have a
data structure for that, the io_cq.  Right now that structure is
tailored for use by the I/O schedulers, but I'm sure we could rework
that.  That would also get rid of the tremedous amount of bloat this
patch adds to the request_queue.  It will also allow us to remove the
bcache-specific fields that were added to task_struct.  Overall, it
should be a good simplification, unless I've completely missed the point
(which happens).

I don't like that you put sequential I/O detection into bio_check_eod.
Split it out into its own function.

You've added a member to struct bio that isn't referenced.  It would
have been nice of you to put enough work into this RFC so that we could
at least see how the common code was used by bcache and your driver.

EWMA (exponentially weighted moving average) is not an acronym I keep
handy in my head.  It would be nice to add documentation on the
algorithm and design choices.  More comments in the code would also be
appreciated.  CFQ does some similar things (detecting sequential
vs. seeky I/O) in a much lighter-weight fashion.  Any change to the
algorithm, of course, would have to be verified to still meet bcache's
needs.

A queue flag might be a better way for the driver to request this
functionality.

Coding style will definitely need fixing.

I hope that was helpful.

Cheers,
Jeff

>
> Signed-off-by: Kashyap desai 
> ---
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 14d7c07..2e93d14 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -693,6 +693,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t 
> gfp_mask, int node_id)
>  {
>   struct request_queue *q;
>   int err;
> + struct seq_io_tracker *io;
>  
>   q = kmem_cache_alloc_node(blk_requestq_cachep,
>   gfp_mask | __GFP_ZERO, node_id);
> @@ -761,6 +762,15 @@ struct request_queue *blk_alloc_queue_node(gfp_t 
> gfp_mask, int node_id)
>  
>   if (blkcg_init_queue(q))
>   goto fail_ref;
> + 
> + q->sequential_cutoff = 0;
> + spin_lock_init(>io_lock);
> + INIT_LIST_HEAD(>io_lru);
> +
> + for (io = q->io; io < q->io + BLK_RECENT_IO; io++) {
> + list_add(>lru, >io_lru);
> + hlist_add_head(>hash, q->io_hash + BLK_RECENT_IO);
> + }
>  
>   return q;
>  
> @@ -1876,6 +1886,26 @@ static inline int bio_check_eod(struct bio *bio, 
> unsigned int nr_sectors)
>   return 0;
>  }
>  
> +static void add_sequential(struct task_struct *t)
> +{
> +#define blk_ewma_add(ewma, val, weight, factor) \
> +({  \
> +(ewma) *= (weight) - 1; \
> +(ewma) += (val) << factor;  \
> +(ewma) /= (weight); \
> +(ewma) >> factor;   \
> +})
> +
> + blk_ewma_add(t->sequential_io_avg,
> +  t->sequential_io, 8, 0);
> +
> + t->sequential_io = 0;
> +}
> +static struct hlist_head *blk_iohash(struct request_queue *q, uint64_t k)
> +{
> + return >io_hash[hash_64(k, BLK_RECENT_IO_BITS)];
> +}
> +
>  static noinline_for_stack bool
>  generic_make_request_checks(struct bio *bio)
>  {
> @@ -1884,6 +1914,7 @@ 

Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-12 Thread Bart Van Assche
On Thu, 2017-01-12 at 10:41 +0200, Sagi Grimberg wrote:
> First, when the nvme device fires an interrupt, the driver consumes
> the completion(s) from the interrupt (usually there will be some more
> completions waiting in the cq by the time the host start processing it).
> With irq-poll, we disable further interrupts and schedule soft-irq for
> processing, which if at all, improve the completions per interrupt
> utilization (because it takes slightly longer before processing the cq).
> 
> Moreover, irq-poll is budgeting the completion queue processing which is
> important for a couple of reasons.
> 
> 1. it prevents hard-irq context abuse like we do today. if other cpu
> cores are pounding with more submissions on the same queue, we might
> get into a hard-lockup (which I've seen happening).
> 
> 2. irq-poll maintains fairness between devices by correctly budgeting
> the processing of different completions queues that share the same
> affinity. This can become crucial when working with multiple nvme
> devices, each has multiple io queues that share the same IRQ
> assignment.
> 
> 3. It reduces (or at least should reduce) the overall number of
> interrupts in the system because we only enable interrupts again
> when the completion queue is completely processed.
> 
> So overall, I think it's very useful for nvme and other modern HBAs,
> but unfortunately, other than solving (1), I wasn't able to see
> performance improvement but rather a slight regression, but I can't
> explain where its coming from...

Hello Sagi,

Thank you for the additional clarification. Although I am not sure whether
irq-poll is the ideal solution for the problems that has been described
above, I agree that it would help to discuss this topic further during
LSF/MM.

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[bug report] scsi: megaraid_sas: Dynamic Raid Map Changes for SAS3.5 Generic Megaraid Controllers

2017-01-12 Thread Dan Carpenter
Hello Sasikumar Chandrasekaran,

The patch d889344e4e59: "scsi: megaraid_sas: Dynamic Raid Map Changes
for SAS3.5 Generic Megaraid Controllers" from Jan 10, 2017, leads to
the following static checker warning:

drivers/scsi/megaraid/megaraid_sas_fusion.c:2043 
megasas_build_ldio_fusion()
warn: curly braces intended?

drivers/scsi/megaraid/megaraid_sas_fusion.c
  2020  if (instance->is_ventura) {
  2021  if (io_info.isRead) {
  2022  if ((raid->cpuAffinity.pdRead.cpu0) &&
  2023  (raid->cpuAffinity.pdRead.cpu1))
  2024  
praid_context->raid_context_g35.routing_flags.bits.cpu_sel
  2025  = MR_RAID_CTX_CPUSEL_FCFS;
  2026  else if (raid->cpuAffinity.pdRead.cpu1)
  2027  
praid_context->raid_context_g35.routing_flags.bits.cpu_sel
  2028  = MR_RAID_CTX_CPUSEL_1;
  2029  else
  2030  
praid_context->raid_context_g35.routing_flags.bits.cpu_sel
  2031  = MR_RAID_CTX_CPUSEL_0;
  2032  } else {
  2033  if ((raid->cpuAffinity.pdWrite.cpu0)
  2034  && (raid->cpuAffinity.pdWrite.cpu1))
  2035  
praid_context->raid_context_g35.routing_flags.bits.cpu_sel
  2036  = MR_RAID_CTX_CPUSEL_FCFS;
  2037  else if (raid->cpuAffinity.pdWrite.cpu1)
  2038  
praid_context->raid_context_g35.routing_flags.bits.cpu_sel
  2039  = MR_RAID_CTX_CPUSEL_1;
  2040  else
  2041  
praid_context->raid_context_g35.routing_flags.bits.cpu_sel
  2042  = MR_RAID_CTX_CPUSEL_0;
  2043  if 
(praid_context->raid_context_g35.routing_flags.bits.sld) {
  2044  
praid_context->raid_context_g35.raid_flags
  2045  = 
(MR_RAID_FLAGS_IO_SUB_TYPE_CACHE_BYPASS
  2046  << 
MR_RAID_CTX_RAID_FLAGS_IO_SUB_TYPE_SHIFT);
  2047  }
  2048  }
  2049  }
  2050  } else {

Wow...  You guys are probably already discussed this code, but I'm not
on the linux-scsi list.  Do we have a process issue where we are merging
code that we shouldn't be?  What's going on here?

regards,
dan carpenter
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-12 Thread Johannes Thumshirn
On Thu, Jan 12, 2017 at 04:41:00PM +0200, Sagi Grimberg wrote:
> 
> >>**Note: when I ran multiple threads on more cpus the performance
> >>degradation phenomenon disappeared, but I tested on a VM with
> >>qemu emulation backed by null_blk so I figured I had some other
> >>bottleneck somewhere (that's why I asked for some more testing).
> >
> >That could be because of the vmexits as every MMIO access in the guest
> >triggers a vmexit and if you poll with a low budget you do more MMIOs hence
> >you have more vmexits.
> >
> >Did you do testing only in qemu or with real H/W as well?
> 
> I tried once. IIRC, I saw the same phenomenons...

JFTR I tried my AHCI irq_poll patch on the Qemu emulation and the read
throughput dropped from ~1GB/s to ~350MB/s. But this can be related to
Qemu's I/O wiredness as well I think. I'll try on real hardware tomorrow.

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[bug report] scsi: megaraid_sas: SAS3.5 Generic Megaraid Controllers Stream Detection and IO Coalescing

2017-01-12 Thread Dan Carpenter
Hello Sasikumar Chandrasekaran,

The patch fdd84e2514b0: "scsi: megaraid_sas: SAS3.5 Generic Megaraid
Controllers Stream Detection and IO Coalescing" from Jan 10, 2017,
leads to the following static checker warning:

drivers/scsi/megaraid/megaraid_sas_fusion.c:1771 megasas_stream_detect()
warn: inconsistent indenting

drivers/scsi/megaraid/megaraid_sas_fusion.c
  1747  static void megasas_stream_detect(struct megasas_instance *instance,
  1748  struct megasas_cmd_fusion *cmd,
  1749  struct IO_REQUEST_INFO *io_info)
  1750  {
  1751  struct fusion_context *fusion = instance->ctrl_context;
  1752  u32 device_id = io_info->ldTgtId;
  1753  struct LD_STREAM_DETECT *current_ld_sd
  1754  = fusion->stream_detect_by_ld[device_id];
  1755  u32 *track_stream = _ld_sd->mru_bit_map, stream_num;
  1756  u32 shifted_values, unshifted_values;
  1757  u32 index_value_mask, shifted_values_mask;
  1758  int i;
  1759  bool is_read_ahead = false;
  1760  struct STREAM_DETECT *current_sd;
  1761  /* find possible stream */
  1762  for (i = 0; i < MAX_STREAMS_TRACKED; ++i) {
  1763  stream_num =
  1764  (*track_stream >> (i * BITS_PER_INDEX_STREAM)) &
  1765  STREAM_MASK;
  1766  current_sd = _ld_sd->stream_track[stream_num];
  1767  /* if we found a stream, update the raid
  1768   *  context and also update the mruBitMap
  1769   */
  1770  /*  boundary condition */
  1771  if ((current_sd->next_seq_lba) &&

We're still inside the for loop.  This isn't indented far enough.

  1772  (io_info->ldStartBlock >= current_sd->next_seq_lba) &&
  1773  (io_info->ldStartBlock <= 
(current_sd->next_seq_lba+32)) &&
  1774  (current_sd->is_read == io_info->isRead)) {
  1775  
  1776  if ((io_info->ldStartBlock != current_sd->next_seq_lba)
  1777  && ((!io_info->isRead) || (!is_read_ahead)))
  1778  /*
  1779   * Once the API availible we need to change 
this.
  1780   * At this point we are not allowing any gap
  1781   */
  1782  continue;
  1783  
  1784  
cmd->io_request->RaidContext.raid_context_g35.stream_detected = true;
  1785  current_sd->next_seq_lba =
  1786  io_info->ldStartBlock + io_info->numBlocks;
  1787  /*
  1788   *  update the mruBitMap LRU
  1789   */


See also:

drivers/scsi/megaraid/megaraid_sas_base.c:5396 megasas_init_fw() warn: 
inconsistent indenting
drivers/scsi/megaraid/megaraid_sas_fusion.c:4060 megasas_reset_fusion() warn: 
inconsistent indenting

regards,
dan carpenter
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] [LSF/MM TOPIC][LSF/MM ATTEND] multipath redesign

2017-01-12 Thread Benjamin Marzinski
On Thu, Jan 12, 2017 at 09:27:40AM +0100, Hannes Reinecke wrote:
> On 01/11/2017 11:23 PM, Mike Snitzer wrote:
> >On Wed, Jan 11 2017 at  4:44am -0500,
> >Hannes Reinecke  wrote:
> >
> >>Hi all,
> >>
> >>I'd like to attend LSF/MM this year, and would like to discuss a
> >>redesign of the multipath handling.
> >>
> >>With recent kernels we've got quite some functionality required for
> >>multipathing already implemented, making some design decisions of the
> >>original multipath-tools implementation quite pointless.
> >>
> >>I'm working on a proof-of-concept implementation which just uses a
> >>simple configfs interface and doesn't require a daemon altogether.
> >>
> >>At LSF/MM I'd like to discuss how to move forward here, and whether we'd
> >>like to stay with the current device-mapper integration or move away
> >>from that towards a stand-alone implementation.
> >
> >I'd really like open exchange of the problems you're having with the
> >current multipath-tools and DM multipath _before LSF_.  Last LSF only
> >scratched the surface on people having disdain for the complexity that is
> >the multipath-tools userspace.  But considering how much of the
> >multipath-tools you've written I find it fairly comical that you're the
> >person advocating switching away from it.
> >
> Yeah, I know.
> 
> But I've stared long and hard at the code, and found some issues really hard
> to overcome. Even more so as most things it does are really pointless.
> 
> multipathd _insists_ on redoing the _entire_ device layout for basically any
> operation (except for path checking).
> As the data structures allow only for a single setup it uses a lock per
> multipath device to protect against concurrent changes.
> When lots of uevents are to be processed this lock is heavily contended,
> leading to a slow-down of uevent processing.
> (cf the patchseries from Tang Junhui and my earlier pathset for
> lock pushdown)
> 
> I've tried to move that lock down even further with distinct locks for
> device paths and multipath devices, but ultimately failed as it would amount
> to essentially a rewrite of the core engine.

The multipath user-space tools locking IS horrible and touches
everything.  I could never see a way around it that didn't involve
a ground-up redesign.
 
> >But if less userspace involvement is needed then fix userspace.  Fail to
> >see how configfs is any different than the established DM ioctl interface.
> >
> >As I just said in another email DM multipath could benefit from
> >factoring out the SCSI-specific bits so that they are nicely optimized
> >away if using new transports (e.g. NVMEoF).
> >
> >Could be lessons can be learned from your approach but I'd prefer we
> >provably exhaust the utility of the current DM multipath kernel
> >implementation.  DM multipath is one of the most actively maintained and
> >updated DM targets (aside from thinp and cache).  As you know DM
> >multipath has grown blk-mq support which yielded serious performance
> >improvement.  You also noted (in an earlier email) that I reintroduced
> >bio-based DM multipath.  On a data path level we have all possible block
> >core interfaces plumbed.  And yes, they all involve cloning due to the
> >underlying Device Mapper core.  Open to any ideas on optimization.  If
> >DM is imposing some inherent performance limitation then please report
> >it accordingly.
> >
> Ah. And I thought you disliked request-based multipathing ...
> 
> It's not _actually_ the DM interface which I'm objecting to, it's more the
> user-space implementation.
> The daemon is build around some design decisions which are simply not
> applicable anymore:
> - we now _do_ have reliable device identifications, so the the 'path_id'
> functionality is pointless.

This could be largely fixed in the existing code. The route that the
latest patch from Tang Junhui are going still grabs the wwid if we got
it from the uevent, but it isn't necesary, as long was we're careful.
Currently rbd devices don't get their wwid from the uevent but all other
devices do. It would probably be possible to write an rbd device udev
rule to set a variable so that they can work through udev environment
variables too.

> - The 'alua' device handler also provides you with reliable priority
> information, so it should be possible to do away with the 'prio' setting,
> too.

But this isn't true for all devices. Also, Like I mentioned last year
when this got brought up, no matter how we group the paths, there end up
being users that have good reasons why they want them grouped
differently in their case.  The path priority/grouping seems like one
place where evidence has shown that we should give users the tools to
make policy decisions, instead of making them ourselves.

> - And for (most) SCSI devices the 'state' setting provides a reliable
> indicator if the device is useable.

This is also not true for all devices.

So, are you planning on creating a multipath implementation that only
handles some 

LOCATE YOUR INHERITANCE

2017-01-12 Thread info

Hello,

I'm Dr. Gertjan Vlieghe (Bank Of England),we have an inheritance of a  
deceased client with your surname Contact Dr. Gertjan Vlieghe With  
your: Full Name, Tel Number, Age, Occupation and Address through  
email: d.vlie...@yahoo.com


Thanks

Dr. Gertjan Vlieghe
--
Correo Corporativo Hospital Universitario del Valle E.S.E
***

"Estamos re-dimensionandonos para crecer!"

**


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: 512e ZBC host-managed disks

2017-01-12 Thread Jeff Moyer
Christoph Hellwig  writes:

> On Thu, Jan 12, 2017 at 05:13:52PM +0900, Damien Le Moal wrote:
>> (3) Any other idea ?
>
> Do nothing and ignore the problem.  This whole idea so braindead that
> the person coming up with the T10 language should be shot.  Either a device
> has 511 logical sectors or 4k but not this crazy mix.
>
> And make sure no one ships such a piece of crap because we are hell
> not going to support it.

Agreed.  This is insane.

-Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-12 Thread Sagi Grimberg



**Note: when I ran multiple threads on more cpus the performance
degradation phenomenon disappeared, but I tested on a VM with
qemu emulation backed by null_blk so I figured I had some other
bottleneck somewhere (that's why I asked for some more testing).


That could be because of the vmexits as every MMIO access in the guest
triggers a vmexit and if you poll with a low budget you do more MMIOs hence
you have more vmexits.

Did you do testing only in qemu or with real H/W as well?


I tried once. IIRC, I saw the same phenomenons...
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -next] scsi: be2iscsi: Use GFP_ATOMIC under spin lock

2017-01-12 Thread Wei Yongjun
From: Wei Yongjun 

A spin lock is taken here so we should use GFP_ATOMIC.

Fixes: 987132167f4b ("scsi: be2iscsi: Fix for crash in beiscsi_eh_device_reset")
Signed-off-by: Wei Yongjun 
---
 drivers/scsi/be2iscsi/be_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/be2iscsi/be_main.c b/drivers/scsi/be2iscsi/be_main.c
index 6372613..c9b9daa 100644
--- a/drivers/scsi/be2iscsi/be_main.c
+++ b/drivers/scsi/be2iscsi/be_main.c
@@ -294,7 +294,7 @@ static int beiscsi_eh_device_reset(struct scsi_cmnd *sc)
beiscsi_conn = conn->dd_data;
phba = beiscsi_conn->phba;
 
-   inv_tbl = kzalloc(sizeof(*inv_tbl), GFP_KERNEL);
+   inv_tbl = kzalloc(sizeof(*inv_tbl), GFP_ATOMIC);
if (!inv_tbl) {
spin_unlock_bh(>frwd_lock);
beiscsi_log(phba, KERN_ERR, BEISCSI_LOG_EH,

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] mpt3sas: switch to pci_alloc_irq_vectors

2017-01-12 Thread Hannes Reinecke

On 01/12/2017 01:23 PM, Sreekanth Reddy wrote:

On Tue, Dec 20, 2016 at 6:57 PM, Hannes Reinecke  wrote:

Cleanup the MSI-X handling allowing us to use
the PCI-layer provided vector allocation.

Signed-off-by: Hannes Reinecke 

diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.c 
b/drivers/scsi/mpt3sas/mpt3sas_base.c
index a1a5ceb..75149f0 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_base.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_base.c
@@ -1129,7 +1129,7 @@ static int mpt3sas_remove_dead_ioc_func(void *arg)
/* TMs are on msix_index == 0 */
if (reply_q->msix_index == 0)
continue;
-   synchronize_irq(reply_q->vector);
+   synchronize_irq(pci_irq_vector(ioc->pdev, reply_q->msix_index));
}
 }

@@ -1818,11 +1818,8 @@ static int mpt3sas_remove_dead_ioc_func(void *arg)

list_for_each_entry_safe(reply_q, next, >reply_queue_list, list) {
list_del(_q->list);
-   if (smp_affinity_enable) {
-   irq_set_affinity_hint(reply_q->vector, NULL);
-   free_cpumask_var(reply_q->affinity_hint);
-   }
-   free_irq(reply_q->vector, reply_q);
+   free_irq(pci_irq_vector(ioc->pdev, reply_q->msix_index),
+reply_q);
kfree(reply_q);
}
 }
@@ -1831,13 +1828,13 @@ static int mpt3sas_remove_dead_ioc_func(void *arg)
  * _base_request_irq - request irq
  * @ioc: per adapter object
  * @index: msix index into vector table
- * @vector: irq vector
  *
  * Inserting respective reply_queue into the list.
  */
 static int
-_base_request_irq(struct MPT3SAS_ADAPTER *ioc, u8 index, u32 vector)
+_base_request_irq(struct MPT3SAS_ADAPTER *ioc, u8 index)
 {
+   struct pci_dev *pdev = ioc->pdev;
struct adapter_reply_queue *reply_q;
int r;

@@ -1849,14 +1846,6 @@ static int mpt3sas_remove_dead_ioc_func(void *arg)
}
reply_q->ioc = ioc;
reply_q->msix_index = index;
-   reply_q->vector = vector;
-
-   if (smp_affinity_enable) {
-   if (!zalloc_cpumask_var(_q->affinity_hint, GFP_KERNEL)) {
-   kfree(reply_q);
-   return -ENOMEM;
-   }
-   }

atomic_set(_q->busy, 0);
if (ioc->msix_enable)
@@ -1865,12 +1854,11 @@ static int mpt3sas_remove_dead_ioc_func(void *arg)
else
snprintf(reply_q->name, MPT_NAME_LENGTH, "%s%d",
ioc->driver_name, ioc->id);
-   r = request_irq(vector, _base_interrupt, IRQF_SHARED, reply_q->name,
-   reply_q);
+   r = request_irq(pci_irq_vector(pdev, index), _base_interrupt,
+   IRQF_SHARED, reply_q->name, reply_q);
if (r) {
pr_err(MPT3SAS_FMT "unable to allocate interrupt %d!\n",
-   reply_q->name, vector);
-   free_cpumask_var(reply_q->affinity_hint);
+  reply_q->name, pci_irq_vector(pdev, index));
kfree(reply_q);
return -EBUSY;
}
@@ -1892,7 +1880,7 @@ static int mpt3sas_remove_dead_ioc_func(void *arg)
 static void
 _base_assign_reply_queues(struct MPT3SAS_ADAPTER *ioc)
 {
-   unsigned int cpu, nr_cpus, nr_msix, index = 0;
+   unsigned int cpu, nr_msix;
struct adapter_reply_queue *reply_q;

if (!_base_is_controller_msix_enabled(ioc))
@@ -1900,38 +1888,21 @@ static int mpt3sas_remove_dead_ioc_func(void *arg)

memset(ioc->cpu_msix_table, 0, ioc->cpu_msix_table_sz);

-   nr_cpus = num_online_cpus();
nr_msix = ioc->reply_queue_count = min(ioc->reply_queue_count,
   ioc->facts.MaxMSIxVectors);
if (!nr_msix)
return;
-
-   cpu = cpumask_first(cpu_online_mask);
-
list_for_each_entry(reply_q, >reply_queue_list, list) {
-
-   unsigned int i, group = nr_cpus / nr_msix;
-
-   if (cpu >= nr_cpus)
-   break;
-
-   if (index < nr_cpus % nr_msix)
-   group++;
-
-   for (i = 0 ; i < group ; i++) {
-   ioc->cpu_msix_table[cpu] = index;
-   if (smp_affinity_enable)
-   cpumask_or(reply_q->affinity_hint,
-  reply_q->affinity_hint, get_cpu_mask(cpu));
-   cpu = cpumask_next(cpu, cpu_online_mask);
+   const cpumask_t *mask = pci_irq_get_affinity(ioc->pdev,
+ reply_q->msix_index);
+   if (!mask) {
+   pr_warn(MPT3SAS_FMT "no affinity for msi %x\n",
+   ioc->name, reply_q->msix_index);
+   continue;
}
-   if (smp_affinity_enable)
-   if 

[PATCHv3] mpt3sas: switch to pci_alloc_irq_vectors

2017-01-12 Thread Hannes Reinecke
Cleanup the MSI-X handling allowing us to use
the PCI-layer provided vector allocation.

Signed-off-by: Hannes Reinecke 
---
 drivers/scsi/mpt3sas/mpt3sas_base.c | 98 +
 drivers/scsi/mpt3sas/mpt3sas_base.h |  2 -
 2 files changed, 45 insertions(+), 55 deletions(-)

diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.c 
b/drivers/scsi/mpt3sas/mpt3sas_base.c
index f00ef88..f2914f4 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_base.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_base.c
@@ -1129,7 +1129,7 @@ static int mpt3sas_remove_dead_ioc_func(void *arg)
/* TMs are on msix_index == 0 */
if (reply_q->msix_index == 0)
continue;
-   synchronize_irq(reply_q->vector);
+   synchronize_irq(pci_irq_vector(ioc->pdev, reply_q->msix_index));
}
 }
 
@@ -1818,11 +1818,8 @@ static int mpt3sas_remove_dead_ioc_func(void *arg)
 
list_for_each_entry_safe(reply_q, next, >reply_queue_list, list) {
list_del(_q->list);
-   if (smp_affinity_enable) {
-   irq_set_affinity_hint(reply_q->vector, NULL);
-   free_cpumask_var(reply_q->affinity_hint);
-   }
-   free_irq(reply_q->vector, reply_q);
+   free_irq(pci_irq_vector(ioc->pdev, reply_q->msix_index),
+reply_q);
kfree(reply_q);
}
 }
@@ -1831,13 +1828,13 @@ static int mpt3sas_remove_dead_ioc_func(void *arg)
  * _base_request_irq - request irq
  * @ioc: per adapter object
  * @index: msix index into vector table
- * @vector: irq vector
  *
  * Inserting respective reply_queue into the list.
  */
 static int
-_base_request_irq(struct MPT3SAS_ADAPTER *ioc, u8 index, u32 vector)
+_base_request_irq(struct MPT3SAS_ADAPTER *ioc, u8 index)
 {
+   struct pci_dev *pdev = ioc->pdev;
struct adapter_reply_queue *reply_q;
int r;
 
@@ -1849,14 +1846,6 @@ static int mpt3sas_remove_dead_ioc_func(void *arg)
}
reply_q->ioc = ioc;
reply_q->msix_index = index;
-   reply_q->vector = vector;
-
-   if (smp_affinity_enable) {
-   if (!zalloc_cpumask_var(_q->affinity_hint, GFP_KERNEL)) {
-   kfree(reply_q);
-   return -ENOMEM;
-   }
-   }
 
atomic_set(_q->busy, 0);
if (ioc->msix_enable)
@@ -1865,12 +1854,11 @@ static int mpt3sas_remove_dead_ioc_func(void *arg)
else
snprintf(reply_q->name, MPT_NAME_LENGTH, "%s%d",
ioc->driver_name, ioc->id);
-   r = request_irq(vector, _base_interrupt, IRQF_SHARED, reply_q->name,
-   reply_q);
+   r = request_irq(pci_irq_vector(pdev, index), _base_interrupt,
+   IRQF_SHARED, reply_q->name, reply_q);
if (r) {
pr_err(MPT3SAS_FMT "unable to allocate interrupt %d!\n",
-   reply_q->name, vector);
-   free_cpumask_var(reply_q->affinity_hint);
+  reply_q->name, pci_irq_vector(pdev, index));
kfree(reply_q);
return -EBUSY;
}
@@ -1906,6 +1894,21 @@ static int mpt3sas_remove_dead_ioc_func(void *arg)
if (!nr_msix)
return;
 
+   if (smp_affinity_enable) {
+   list_for_each_entry(reply_q, >reply_queue_list, list) {
+   const cpumask_t *mask = pci_irq_get_affinity(ioc->pdev,
+   reply_q->msix_index);
+   if (!mask) {
+   pr_warn(MPT3SAS_FMT "no affinity for msi %x\n",
+   ioc->name, reply_q->msix_index);
+   continue;
+   }
+
+   for_each_cpu(cpu, mask)
+   ioc->cpu_msix_table[cpu] = reply_q->msix_index;
+   }
+   return;
+   }
cpu = cpumask_first(cpu_online_mask);
 
list_for_each_entry(reply_q, >reply_queue_list, list) {
@@ -1920,17 +1923,8 @@ static int mpt3sas_remove_dead_ioc_func(void *arg)
 
for (i = 0 ; i < group ; i++) {
ioc->cpu_msix_table[cpu] = index;
-   if (smp_affinity_enable)
-   cpumask_or(reply_q->affinity_hint,
-  reply_q->affinity_hint, get_cpu_mask(cpu));
cpu = cpumask_next(cpu, cpu_online_mask);
}
-   if (smp_affinity_enable)
-   if (irq_set_affinity_hint(reply_q->vector,
-  reply_q->affinity_hint))
-   dinitprintk(ioc, pr_info(MPT3SAS_FMT
-"Err setting affinity hint to irq vector %d\n",
-ioc->name, reply_q->vector));
 

Re: [PATCH] virtio_scsi: Reject commands when virtqueue is broken

2017-01-12 Thread Eric Farman



On 01/12/2017 08:45 AM, Fam Zheng wrote:

On Thu, 01/12 08:28, Eric Farman wrote:

-   if (virtscsi_kick_cmd(req_vq, cmd, req_size, sizeof(cmd->resp.cmd)) != 
0)
+   ret = virtscsi_kick_cmd(req_vq, cmd, req_size, sizeof(cmd->resp.cmd));
+   if (ret == -EIO) {
+   cmd->resp.cmd.response = VIRTIO_SCSI_S_BAD_TARGET;
+   virtscsi_complete_cmd(vscsi, cmd);


Is this safe? Calling virtscsi_complete_cmd requires vq_lock but we don't seem
to have it here.


Hrm...  Didn't notice that, and can't speak to its safety.  I had a bit of
an I/O workload going to other disks, and things seemed okay, but it was by
no means an exhaustive test.

I can't use virtscsi_vq_done, which normally handles that acquire/release.
It calls virtqueue_get_buf prior to calling virtscsi_complete_cmd, which
returns NULL because the virtqueue is broken.Thus, no call to
virtscsi_complete_cmd.

Can I mock up a wrapping routine that only handles the lock and complete_cmd
call, and ignore the virtqueue components that virtscsi_vq_done does?


That sounds good to me, taking the vq_lock here around the call to
virtscsi_complete_cmd, just like virtscsi_kick_cmd().


Okay, working up a v2 today.  Thanks!

Eric



Fam



Eric



Fam


+   } else if (ret != 0) {
return SCSI_MLQUEUE_HOST_BUSY;
+   }
return 0;
 }








--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio_scsi: Reject commands when virtqueue is broken

2017-01-12 Thread Fam Zheng
On Thu, 01/12 08:28, Eric Farman wrote:
> > > - if (virtscsi_kick_cmd(req_vq, cmd, req_size, sizeof(cmd->resp.cmd)) != 
> > > 0)
> > > + ret = virtscsi_kick_cmd(req_vq, cmd, req_size, sizeof(cmd->resp.cmd));
> > > + if (ret == -EIO) {
> > > + cmd->resp.cmd.response = VIRTIO_SCSI_S_BAD_TARGET;
> > > + virtscsi_complete_cmd(vscsi, cmd);
> > 
> > Is this safe? Calling virtscsi_complete_cmd requires vq_lock but we don't 
> > seem
> > to have it here.
> 
> Hrm...  Didn't notice that, and can't speak to its safety.  I had a bit of
> an I/O workload going to other disks, and things seemed okay, but it was by
> no means an exhaustive test.
> 
> I can't use virtscsi_vq_done, which normally handles that acquire/release.
> It calls virtqueue_get_buf prior to calling virtscsi_complete_cmd, which
> returns NULL because the virtqueue is broken.Thus, no call to
> virtscsi_complete_cmd.
> 
> Can I mock up a wrapping routine that only handles the lock and complete_cmd
> call, and ignore the virtqueue components that virtscsi_vq_done does?

That sounds good to me, taking the vq_lock here around the call to
virtscsi_complete_cmd, just like virtscsi_kick_cmd().

Fam

> 
> Eric
> 
> > 
> > Fam
> > 
> > > + } else if (ret != 0) {
> > >   return SCSI_MLQUEUE_HOST_BUSY;
> > > + }
> > >   return 0;
> > >  }
> > 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio_scsi: Reject commands when virtqueue is broken

2017-01-12 Thread Eric Farman



On 01/11/2017 10:11 PM, Fam Zheng wrote:

On Wed, 01/11 17:02, Eric Farman wrote:

In the case of a graceful set of detaches, where the virtio-scsi-ccw
disk is removed from the guest prior to the controller, the guest
behaves quite normally.  Specifically, the detach gets us into
sd_sync_cache to issue a Synchronize Cache(10) command, which
immediately fails (and is retried a couple of times) because the
device has been removed.  Later, the removal of the controller
sees two CRWs presented, but there's no further indication of the
removal from the guest viewpoint.

 [   17.217458] sd 0:0:0:0: [sda] Synchronizing SCSI cache
 [   17.219257] sd 0:0:0:0: [sda] Synchronize Cache(10) failed: Result: 
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
 [   21.449400] crw_info : CRW reports slct=0, oflw=0, chn=1, rsc=3, anc=0, 
erc=4, rsid=2
 [   21.449406] crw_info : CRW reports slct=0, oflw=0, chn=0, rsc=3, anc=0, 
erc=4, rsid=0

However, on s390, the SCSI disks can be removed "by surprise" when
an entire controller (host) is removed and all associated disks
are removed via the loop in scsi_forget_host.  The same call to
sd_sync_cache is made, but because the controller has already
been removed, the Synchronize Cache(10) command is neither issued
(and then failed) nor rejected.

That the I/O isn't returned means the guest cannot have other devices
added nor removed, and other tasks (such as shutdown or reboot) issued
by the guest will not complete either.  The virtio ring has already
been marked as broken (via virtio_break_device in virtio_ccw_remove),
but we still attempt to queue the command only to have it remain there.
The calling sequence provides a bit of distinction for us:

  virtscsi_queuecommand()
   -> virtscsi_kick_cmd()
-> virtscsi_add_cmd()
 -> virtqueue_add_sgs()
  -> virtqueue_add()
 if success
   return 0
 elseif vq->broken or vring_mapping_error()
   return -EIO
 else
   return -ENOSPC

A return of ENOSPC is generally a temporary condition, so returning
"host busy" from virtscsi_queuecommand makes sense here, to have it
redriven in a moment or two.  But the EIO return code is more of a
permanent error and so it would be wise to return the I/O itself and
allow the calling thread to finish gracefully.  The result is these
four kernel messages in the guest (the fourth one does not occur
prior to this patch):

 [   22.921562] crw_info : CRW reports slct=0, oflw=0, chn=1, rsc=3, anc=0, 
erc=4, rsid=2
 [   22.921580] crw_info : CRW reports slct=0, oflw=0, chn=0, rsc=3, anc=0, 
erc=4, rsid=0
 [   22.921978] sd 0:0:0:0: [sda] Synchronizing SCSI cache
 [   22.921993] sd 0:0:0:0: [sda] Synchronize Cache(10) failed: Result: 
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK

I opted to fill in the same response data that is returned from the
more graceful device detach, where the disk device is removed prior
to the controller device.

Signed-off-by: Eric Farman 
---
 drivers/scsi/virtio_scsi.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index ec91bd0..78d50ca 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -535,6 +535,7 @@ static int virtscsi_queuecommand(struct virtio_scsi *vscsi,
struct Scsi_Host *shost = virtio_scsi_host(vscsi->vdev);
struct virtio_scsi_cmd *cmd = scsi_cmd_priv(sc);
int req_size;
+   int ret;

BUG_ON(scsi_sg_count(sc) > shost->sg_tablesize);

@@ -562,8 +563,13 @@ static int virtscsi_queuecommand(struct virtio_scsi *vscsi,
req_size = sizeof(cmd->req.cmd);
}

-   if (virtscsi_kick_cmd(req_vq, cmd, req_size, sizeof(cmd->resp.cmd)) != 
0)
+   ret = virtscsi_kick_cmd(req_vq, cmd, req_size, sizeof(cmd->resp.cmd));
+   if (ret == -EIO) {
+   cmd->resp.cmd.response = VIRTIO_SCSI_S_BAD_TARGET;
+   virtscsi_complete_cmd(vscsi, cmd);


Is this safe? Calling virtscsi_complete_cmd requires vq_lock but we don't seem
to have it here.


Hrm...  Didn't notice that, and can't speak to its safety.  I had a bit 
of an I/O workload going to other disks, and things seemed okay, but it 
was by no means an exhaustive test.


I can't use virtscsi_vq_done, which normally handles that 
acquire/release.  It calls virtqueue_get_buf prior to calling 
virtscsi_complete_cmd, which returns NULL because the virtqueue is 
broken.Thus, no call to virtscsi_complete_cmd.


Can I mock up a wrapping routine that only handles the lock and 
complete_cmd call, and ignore the virtqueue components that 
virtscsi_vq_done does?  Or do I need to consider that somehow despite it 
all being broken?


Eric



Fam


+   } else if (ret != 0) {
return SCSI_MLQUEUE_HOST_BUSY;
+   }
return 0;
 }




--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to 

Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-12 Thread Johannes Thumshirn
On Thu, Jan 12, 2017 at 01:44:05PM +0200, Sagi Grimberg wrote:
[...]
> Its pretty basic:
> --
> [global]
> group_reporting
> cpus_allowed=0
> cpus_allowed_policy=split
> rw=randrw
> bs=4k
> numjobs=4
> iodepth=32
> runtime=60
> time_based
> loops=1
> ioengine=libaio
> direct=1
> invalidate=1
> randrepeat=1
> norandommap
> exitall
> 
> [job]
> --
> 
> **Note: when I ran multiple threads on more cpus the performance
> degradation phenomenon disappeared, but I tested on a VM with
> qemu emulation backed by null_blk so I figured I had some other
> bottleneck somewhere (that's why I asked for some more testing).

That could be because of the vmexits as every MMIO access in the guest
triggers a vmexit and if you poll with a low budget you do more MMIOs hence
you have more vmexits.

Did you do testing only in qemu or with real H/W as well?

> 
> Note that I ran randrw because I was backed with null_blk, testing
> with a real nvme device, you should either run randread or write, and
> if you do a write, you can't run it multi-threaded (well you can, but
> you'll get unpredictable performance...).

Noted, thanks.

Byte,
Johannes
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ata mapping mvsas driver question

2017-01-12 Thread Jelle de Jong

Anyone that know how to figure out the mapping?

Kind regards,

Jelle de Jong

On 08/12/16 13:57, Jelle de Jong wrote:

Hello everybody,

Can anyone help me out with my question?

Thank you in advance.

Kind regards,

Jelle de Jong

On 26/11/16 21:43, Jelle de Jong wrote:

Hello everybody,

I am trying to find the mapping for ata22.00. due to some repeated
issues:

# 3.16.36-1+deb8u2~bpo70+1 crazing down
http://paste.debian.net/hidden/54c12fae/

# smart status 16 disks WDC 1TB RE SATA
http://paste.debian.net/hidden/8bfe8d2f/

"[473315.445703] ata22.00: exception Emask 0x0 SAct 0x7fe7 SErr 0x0
action 0x6 frozen"

lsscsi --verbose doesn’t provide me with the ata details?
http://paste.debian.net/hidden/430599a0/

root@shirley:~# lsscsi --verbose
[0:0:0:0]diskATA  WDC WD1003FBYX-0 1V01  /dev/sdc
  dir: /sys/bus/scsi/devices/0:0:0:0
[/sys/devices/pci:00/:00:1c.0/:08:00.0/host0/port-0:0/expander-0:0/port-0:0:0/end_device-0:0:0/target0:0:0/0:0:0:0]



I can not find any useful unique_id with the mvsas driver...
find /sys/devices/pci\:00/\:00\:1c.0/\:08\:00.0/ -iname
unique_id

grep ata22 /var/log/messages indicates it is really mapped somewhere...

root@shirley:~# readlink /sys/block/sdm
../devices/pci:00/:00:1c.0/:08:00.0/host0/port-0:0/expander-0:0/port-0:0:10/end_device-0:0:10/target0:0:10/0:0:10:0/block/sdm



Best I came up with with no results:

root@shirley:~# find
/sys/devices/pci\:00/\:00\:1c.0/\:08\:00.0/host0/port-0\:0/expander-0\:0/port-0\:0\:10/end_device-0\:0\:10/target0\:0\:10/0\:0\:10\:0/

-type f  -exec grep -i ata {} \;

Linux shirley 3.16.0-0.bpo.4-amd64 #1 SMP Debian
3.16.36-1+deb8u2~bpo70+1 (2016-10-19) x86_64 GNU/Linux
root@shirley:~# lsmod mvsas
Usage: lsmod

root@shirley:~# modinfo mvsas
filename:
/lib/modules/3.16.0-0.bpo.4-amd64/kernel/drivers/scsi/mvsas/mvsas.ko
license:GPL
version:0.8.16

Thank you in advance!

Kind regards,

Jelle de Jong
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] mpt3sas: switch to pci_alloc_irq_vectors

2017-01-12 Thread Sreekanth Reddy
On Tue, Dec 20, 2016 at 6:57 PM, Hannes Reinecke  wrote:
> Cleanup the MSI-X handling allowing us to use
> the PCI-layer provided vector allocation.
>
> Signed-off-by: Hannes Reinecke 
>
> diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.c 
> b/drivers/scsi/mpt3sas/mpt3sas_base.c
> index a1a5ceb..75149f0 100644
> --- a/drivers/scsi/mpt3sas/mpt3sas_base.c
> +++ b/drivers/scsi/mpt3sas/mpt3sas_base.c
> @@ -1129,7 +1129,7 @@ static int mpt3sas_remove_dead_ioc_func(void *arg)
> /* TMs are on msix_index == 0 */
> if (reply_q->msix_index == 0)
> continue;
> -   synchronize_irq(reply_q->vector);
> +   synchronize_irq(pci_irq_vector(ioc->pdev, 
> reply_q->msix_index));
> }
>  }
>
> @@ -1818,11 +1818,8 @@ static int mpt3sas_remove_dead_ioc_func(void *arg)
>
> list_for_each_entry_safe(reply_q, next, >reply_queue_list, list) 
> {
> list_del(_q->list);
> -   if (smp_affinity_enable) {
> -   irq_set_affinity_hint(reply_q->vector, NULL);
> -   free_cpumask_var(reply_q->affinity_hint);
> -   }
> -   free_irq(reply_q->vector, reply_q);
> +   free_irq(pci_irq_vector(ioc->pdev, reply_q->msix_index),
> +reply_q);
> kfree(reply_q);
> }
>  }
> @@ -1831,13 +1828,13 @@ static int mpt3sas_remove_dead_ioc_func(void *arg)
>   * _base_request_irq - request irq
>   * @ioc: per adapter object
>   * @index: msix index into vector table
> - * @vector: irq vector
>   *
>   * Inserting respective reply_queue into the list.
>   */
>  static int
> -_base_request_irq(struct MPT3SAS_ADAPTER *ioc, u8 index, u32 vector)
> +_base_request_irq(struct MPT3SAS_ADAPTER *ioc, u8 index)
>  {
> +   struct pci_dev *pdev = ioc->pdev;
> struct adapter_reply_queue *reply_q;
> int r;
>
> @@ -1849,14 +1846,6 @@ static int mpt3sas_remove_dead_ioc_func(void *arg)
> }
> reply_q->ioc = ioc;
> reply_q->msix_index = index;
> -   reply_q->vector = vector;
> -
> -   if (smp_affinity_enable) {
> -   if (!zalloc_cpumask_var(_q->affinity_hint, GFP_KERNEL)) 
> {
> -   kfree(reply_q);
> -   return -ENOMEM;
> -   }
> -   }
>
> atomic_set(_q->busy, 0);
> if (ioc->msix_enable)
> @@ -1865,12 +1854,11 @@ static int mpt3sas_remove_dead_ioc_func(void *arg)
> else
> snprintf(reply_q->name, MPT_NAME_LENGTH, "%s%d",
> ioc->driver_name, ioc->id);
> -   r = request_irq(vector, _base_interrupt, IRQF_SHARED, reply_q->name,
> -   reply_q);
> +   r = request_irq(pci_irq_vector(pdev, index), _base_interrupt,
> +   IRQF_SHARED, reply_q->name, reply_q);
> if (r) {
> pr_err(MPT3SAS_FMT "unable to allocate interrupt %d!\n",
> -   reply_q->name, vector);
> -   free_cpumask_var(reply_q->affinity_hint);
> +  reply_q->name, pci_irq_vector(pdev, index));
> kfree(reply_q);
> return -EBUSY;
> }
> @@ -1892,7 +1880,7 @@ static int mpt3sas_remove_dead_ioc_func(void *arg)
>  static void
>  _base_assign_reply_queues(struct MPT3SAS_ADAPTER *ioc)
>  {
> -   unsigned int cpu, nr_cpus, nr_msix, index = 0;
> +   unsigned int cpu, nr_msix;
> struct adapter_reply_queue *reply_q;
>
> if (!_base_is_controller_msix_enabled(ioc))
> @@ -1900,38 +1888,21 @@ static int mpt3sas_remove_dead_ioc_func(void *arg)
>
> memset(ioc->cpu_msix_table, 0, ioc->cpu_msix_table_sz);
>
> -   nr_cpus = num_online_cpus();
> nr_msix = ioc->reply_queue_count = min(ioc->reply_queue_count,
>ioc->facts.MaxMSIxVectors);
> if (!nr_msix)
> return;
> -
> -   cpu = cpumask_first(cpu_online_mask);
> -
> list_for_each_entry(reply_q, >reply_queue_list, list) {
> -
> -   unsigned int i, group = nr_cpus / nr_msix;
> -
> -   if (cpu >= nr_cpus)
> -   break;
> -
> -   if (index < nr_cpus % nr_msix)
> -   group++;
> -
> -   for (i = 0 ; i < group ; i++) {
> -   ioc->cpu_msix_table[cpu] = index;
> -   if (smp_affinity_enable)
> -   cpumask_or(reply_q->affinity_hint,
> -  reply_q->affinity_hint, get_cpu_mask(cpu));
> -   cpu = cpumask_next(cpu, cpu_online_mask);
> +   const cpumask_t *mask = pci_irq_get_affinity(ioc->pdev,
> + reply_q->msix_index);
> +   if (!mask) {
> +   pr_warn(MPT3SAS_FMT "no affinity for msi %x\n",
> +   

Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-12 Thread Sagi Grimberg



I agree with Jens that we'll need some analysis if we want the
discussion to be affective, and I can spend some time this if I
can find volunteers with high-end nvme devices (I only have access
to client nvme devices.


I have a P3700 but somehow burned the FW. Let me see if I can bring it back to
live.

I also have converted AHCI to the irq_poll interface and will run some tests.
I do also have some hpsa devices on which I could run tests once the driver is
adopted.

But can we come to a common testing methology not to compare apples with
oranges? Sagi do you still have the fio job file from your last tests laying
somewhere and if yes could you share it?


Its pretty basic:
--
[global]
group_reporting
cpus_allowed=0
cpus_allowed_policy=split
rw=randrw
bs=4k
numjobs=4
iodepth=32
runtime=60
time_based
loops=1
ioengine=libaio
direct=1
invalidate=1
randrepeat=1
norandommap
exitall

[job]
--

**Note: when I ran multiple threads on more cpus the performance
degradation phenomenon disappeared, but I tested on a VM with
qemu emulation backed by null_blk so I figured I had some other
bottleneck somewhere (that's why I asked for some more testing).

Note that I ran randrw because I was backed with null_blk, testing
with a real nvme device, you should either run randread or write, and
if you do a write, you can't run it multi-threaded (well you can, but
you'll get unpredictable performance...).
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LFS/MM TOPIC][LFS/MM ATTEND]: - Storage Stack and Driver Testing methodology.

2017-01-12 Thread Sagi Grimberg



Hi Folks,

I would like to propose a general discussion on Storage stack and device driver 
testing.


I think its very useful and needed.


Purpose:-
-
The main objective of this discussion is to address the need for
a Unified Test Automation Framework which can be used by different subsystems
in the kernel in order to improve the overall development and stability
of the storage stack.

For Example:-
From my previous experience, I've worked on the NVMe driver testing last year 
and we
have developed simple unit test framework
 (https://github.com/linux-nvme/nvme-cli/tree/master/tests).
In current implementation Upstream NVMe Driver supports following subsystems:-
1. PCI Host.
2. RDMA Target.
3. Fiber Channel Target (in progress).
Today due to lack of centralized automated test framework NVMe Driver testing is
scattered and performed using the combination of various utilities like 
nvme-cli/tests,
nvmet-cli, shell scripts (git://git.infradead.org/nvme-fabrics.git 
nvmf-selftests) etc.

In order to improve overall driver stability with various subsystems, it will 
be beneficial
to have a Unified Test Automation Framework (UTAF) which will centralize overall
testing.

This topic will allow developers from various subsystem engage in the 
discussion about
how to collaborate efficiently instead of having discussions on lengthy email 
threads.


While a unified test framework for all sounds great, I suspect that the
difference might be too large. So I think that for this framework to be
maintainable, it needs to be carefully designed such that we don't have
too much code churn.

For example we should start by classifying tests and then see where
sharing is feasible:

1. basic management - I think not a lot can be shared
2. spec compliance - again, not much sharing here
3. data-verification - probably everything can be shared
4. basic performance - probably a lot can be shared
5. vectored-io - probably everything can be shared
6. error handling - I can think of some sharing that can be used.

This repository can also store some useful tracing scripts (ebpf and
friends) that are useful for performance analysis.

So I think that for this to happen, we can start with the shared
test under block/, then migrate proto specific tests into
scsi/, nvme/, and then add transport specific tests so
we can have something like:

├── block
├── lib
├── nvme
│   ├── fabrics
│   │   ├── loop
│   │   └── rdma
│   └── pci
└── scsi
├── fc
└── iscsi

Thoughts?
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-12 Thread Johannes Thumshirn
On Thu, Jan 12, 2017 at 10:23:47AM +0200, Sagi Grimberg wrote:
> 
> >>>Hi all,
> >>>
> >>>I'd like to attend LSF/MM and would like to discuss polling for block 
> >>>drivers.
> >>>
> >>>Currently there is blk-iopoll but it is neither as widely used as NAPI in 
> >>>the
> >>>networking field and accoring to Sagi's findings in [1] performance with
> >>>polling is not on par with IRQ usage.
> >>>
> >>>On LSF/MM I'd like to whether it is desirable to have NAPI like polling in
> >>>more block drivers and how to overcome the currently seen performance 
> >>>issues.
> >>
> >>It would be an interesting topic to discuss, as it is a shame that 
> >>blk-iopoll
> >>isn't used more widely.
> >
> >Forgot to mention - it should only be a topic, if experimentation has
> >been done and results gathered to pin point what the issues are, so we
> >have something concrete to discus. I'm not at all interested in a hand
> >wavy discussion on the topic.
> >
> 
> Hey all,
> 
> Indeed I attempted to convert nvme to use irq-poll (let's use its
> new name) but experienced some unexplained performance degradations.
> 
> Keith reported a 700ns degradation for QD=1 with his Xpoint devices,
> this sort of degradation are acceptable I guess because we do schedule
> a soft-irq before consuming the completion, but I noticed ~10% IOPs
> degradation fr QD=32 which is not acceptable.
> 
> I agree with Jens that we'll need some analysis if we want the
> discussion to be affective, and I can spend some time this if I
> can find volunteers with high-end nvme devices (I only have access
> to client nvme devices.

I have a P3700 but somehow burned the FW. Let me see if I can bring it back to
live. 

I also have converted AHCI to the irq_poll interface and will run some tests.
I do also have some hpsa devices on which I could run tests once the driver is
adopted.

But can we come to a common testing methology not to compare apples with
oranges? Sagi do you still have the fio job file from your last tests laying
somewhere and if yes could you share it?

Byte,
Johannes

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-12 Thread sagi grimberg
 A typical Ethernet network adapter delays the generation of an
 interrupt
 after it has received a packet. A typical block device or HBA does not
 delay
 the generation of an interrupt that reports an I/O completion.
>>>
>>> NVMe allows for configurable interrupt coalescing, as do a few modern
>>> SCSI HBAs.
>>
>> Essentially every modern SCSI HBA does interrupt coalescing; otherwise
>> the queuing interface won't work efficiently.
>
> Hello Hannes,
>
> The first e-mail in this e-mail thread referred to measurements against a
> block device for which interrupt coalescing was not enabled. I think that
> the measurements have to be repeated against a block device for which
> interrupt coalescing is enabled.

Hey Bart,

I see how interrupt coalescing can help, but even without it, I think it
should be better.

Moreover, I don't think that strict moderation is something that can
work. The only way interrupt moderation can be effective, is if it's
adaptive and adjusts itself to the workload. Note that this feature
is on by default in most of the modern Ethernet devices (adaptive-rx).

IMHO, irq-poll vs. interrupt polling should be compared without relying
on the underlying device capabilities.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-12 Thread Sagi Grimberg



I'd like to attend LSF/MM and would like to discuss polling for block
drivers.

Currently there is blk-iopoll but it is neither as widely used as NAPI in
the networking field and accoring to Sagi's findings in [1] performance
with polling is not on par with IRQ usage.

On LSF/MM I'd like to whether it is desirable to have NAPI like polling in
more block drivers and how to overcome the currently seen performance
issues.

[1] http://lists.infradead.org/pipermail/linux-nvme/2016-October/006975.ht
ml


A typical Ethernet network adapter delays the generation of an interrupt
after it has received a packet. A typical block device or HBA does not delay
the generation of an interrupt that reports an I/O completion. I think that
is why polling is more effective for network adapters than for block
devices. I'm not sure whether it is possible to achieve benefits similar to
NAPI for block devices without implementing interrupt coalescing in the
block device firmware. Note: for block device implementations that use the
RDMA API, the RDMA API supports interrupt coalescing (see also
ib_modify_cq()).


Hey Bart,

I don't agree that interrupt coalescing is the reason why irq-poll is
not suitable for nvme or storage devices.

First, when the nvme device fires an interrupt, the driver consumes
the completion(s) from the interrupt (usually there will be some more
completions waiting in the cq by the time the host start processing it).
With irq-poll, we disable further interrupts and schedule soft-irq for
processing, which if at all, improve the completions per interrupt
utilization (because it takes slightly longer before processing the cq).

Moreover, irq-poll is budgeting the completion queue processing which is
important for a couple of reasons.

1. it prevents hard-irq context abuse like we do today. if other cpu
   cores are pounding with more submissions on the same queue, we might
   get into a hard-lockup (which I've seen happening).

2. irq-poll maintains fairness between devices by correctly budgeting
   the processing of different completions queues that share the same
   affinity. This can become crucial when working with multiple nvme
   devices, each has multiple io queues that share the same IRQ
   assignment.

3. It reduces (or at least should reduce) the overall number of
   interrupts in the system because we only enable interrupts again
   when the completion queue is completely processed.

So overall, I think it's very useful for nvme and other modern HBAs,
but unfortunately, other than solving (1), I wasn't able to see
performance improvement but rather a slight regression, but I can't
explain where its coming from...
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] preview - block layer help to detect sequential IO

2017-01-12 Thread Kashyap Desai
> -Original Message-
> From: kbuild test robot [mailto:l...@intel.com]
> Sent: Thursday, January 12, 2017 1:18 AM
> To: Kashyap Desai
> Cc: kbuild-...@01.org; linux-scsi@vger.kernel.org;
linux-bl...@vger.kernel.org;
> ax...@kernel.dk; martin.peter...@oracle.com; j...@linux.vnet.ibm.com;
> sumit.sax...@broadcom.com; Kashyap desai
> Subject: Re: [PATCH] preview - block layer help to detect sequential IO
>
> Hi Kashyap,
>
> [auto build test ERROR on v4.9-rc8]
> [cannot apply to block/for-next linus/master linux/master next-20170111]
[if
> your patch is applied to the wrong git tree, please drop us a note to
help
> improve the system]
>
> url:
https://github.com/0day-ci/linux/commits/Kashyap-Desai/preview-block-
> layer-help-to-detect-sequential-IO/20170112-024228
> config: i386-randconfig-a0-201702 (attached as .config)
> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=i386
>
> All errors (new ones prefixed by >>):
>
>block/blk-core.c: In function 'add_sequential':
> >> block/blk-core.c:1899:16: error: 'struct task_struct' has no member
named
> 'sequential_io_avg'
>  blk_ewma_add(t->sequential_io_avg,


This error fixable. For now, I just wanted to get high level review of the
idea.
Below defines are required to use sequential_io and sequential_io_avg. I
have enable BCACHE for my testing in .config.

#if defined(CONFIG_BCACHE) || defined(CONFIG_BCACHE_MODULE)
unsigned intsequential_io;
unsigned intsequential_io_avg;
#endif

Looking for high level review comment.

` Kashyap


>^
>block/blk-core.c:1893:10: note: in definition of macro 'blk_ewma_add'
> (ewma) *= (weight) - 1;
\
>  ^~~~
> >> block/blk-core.c:1899:16: error: 'struct task_struct' has no member
named
> 'sequential_io_avg'
>  blk_ewma_add(t->sequential_io_avg,
>^
>block/blk-core.c:1894:10: note: in definition of macro 'blk_ewma_add'
> (ewma) += (val) << factor;
\
>  ^~~~
> >> block/blk-core.c:1900:5: error: 'struct task_struct' has no member
named
> 'sequential_io'
>t->sequential_io, 8, 0);
> ^
>block/blk-core.c:1894:20: note: in definition of macro 'blk_ewma_add'
> (ewma) += (val) << factor;
\
>^~~
> >> block/blk-core.c:1899:16: error: 'struct task_struct' has no member
named
> 'sequential_io_avg'
>  blk_ewma_add(t->sequential_io_avg,
>^
>block/blk-core.c:1895:10: note: in definition of macro 'blk_ewma_add'
> (ewma) /= (weight);
\
>  ^~~~
> >> block/blk-core.c:1899:16: error: 'struct task_struct' has no member
named
> 'sequential_io_avg'
>  blk_ewma_add(t->sequential_io_avg,
>^
>block/blk-core.c:1896:10: note: in definition of macro 'blk_ewma_add'
> (ewma) >> factor;
\
>  ^~~~
>block/blk-core.c:1902:3: error: 'struct task_struct' has no member
named
> 'sequential_io'
>  t->sequential_io = 0;
>   ^~
>block/blk-core.c: In function 'generic_make_request_checks':
>block/blk-core.c:2012:7: error: 'struct task_struct' has no member
named
> 'sequential_io'
>   task->sequential_io  = i->sequential;
>   ^~
>In file included from block/blk-core.c:14:0:
>block/blk-core.c:2020:21: error: 'struct task_struct' has no member
named
> 'sequential_io'
>   sectors = max(task->sequential_io,
> ^
>include/linux/kernel.h:747:2: note: in definition of macro '__max'
>  t1 max1 = (x); \
>  ^~
>block/blk-core.c:2020:13: note: in expansion of macro 'max'
>   sectors = max(task->sequential_io,
> ^~~
>block/blk-core.c:2020:21: error: 'struct task_struct' has no member
named
> 'sequential_io'
>   sectors = max(task->sequential_io,
> ^
>include/linux/kernel.h:747:13: note: in definition of macro '__max'
>  t1 max1 = (x); \
> ^
>block/blk-core.c:2020:13: note: in expansion of macro 'max'
>   sectors = max(task->sequential_io,
> ^~~
>block/blk-core.c:2021:14: error: 'struct task_struct' has no member
named
> 'sequential_io_avg'
>  task->sequential_io_avg) >> 9;
>  ^
>include/linux/kernel.h:748:2: note: in definition of macro '__max'
>  t2 max2 = (y); \
>  ^~
>block/blk-core.c:2020:13: note: in expansion of macro 'max'
>   sectors = max(task->sequential_i

Re: [LSF/MM TOPIC][LSF/MM ATTEND] multipath redesign

2017-01-12 Thread Hannes Reinecke

On 01/11/2017 11:23 PM, Mike Snitzer wrote:

On Wed, Jan 11 2017 at  4:44am -0500,
Hannes Reinecke  wrote:


Hi all,

I'd like to attend LSF/MM this year, and would like to discuss a
redesign of the multipath handling.

With recent kernels we've got quite some functionality required for
multipathing already implemented, making some design decisions of the
original multipath-tools implementation quite pointless.

I'm working on a proof-of-concept implementation which just uses a
simple configfs interface and doesn't require a daemon altogether.

At LSF/MM I'd like to discuss how to move forward here, and whether we'd
like to stay with the current device-mapper integration or move away
from that towards a stand-alone implementation.


I'd really like open exchange of the problems you're having with the
current multipath-tools and DM multipath _before LSF_.  Last LSF only
scratched the surface on people having disdain for the complexity that is
the multipath-tools userspace.  But considering how much of the
multipath-tools you've written I find it fairly comical that you're the
person advocating switching away from it.


Yeah, I know.

But I've stared long and hard at the code, and found some issues really 
hard to overcome. Even more so as most things it does are really pointless.


multipathd _insists_ on redoing the _entire_ device layout for basically 
any operation (except for path checking).
As the data structures allow only for a single setup it uses a lock per 
multipath device to protect against concurrent changes.
When lots of uevents are to be processed this lock is heavily contended, 
leading to a slow-down of uevent processing.

(cf the patchseries from Tang Junhui and my earlier pathset for
lock pushdown)

I've tried to move that lock down even further with distinct locks for 
device paths and multipath devices, but ultimately failed as it would 
amount to essentially a rewrite of the core engine.



But if less userspace involvement is needed then fix userspace.  Fail to
see how configfs is any different than the established DM ioctl interface.

As I just said in another email DM multipath could benefit from
factoring out the SCSI-specific bits so that they are nicely optimized
away if using new transports (e.g. NVMEoF).

Could be lessons can be learned from your approach but I'd prefer we
provably exhaust the utility of the current DM multipath kernel
implementation.  DM multipath is one of the most actively maintained and
updated DM targets (aside from thinp and cache).  As you know DM
multipath has grown blk-mq support which yielded serious performance
improvement.  You also noted (in an earlier email) that I reintroduced
bio-based DM multipath.  On a data path level we have all possible block
core interfaces plumbed.  And yes, they all involve cloning due to the
underlying Device Mapper core.  Open to any ideas on optimization.  If
DM is imposing some inherent performance limitation then please report
it accordingly.


Ah. And I thought you disliked request-based multipathing ...

It's not _actually_ the DM interface which I'm objecting to, it's more 
the user-space implementation.
The daemon is build around some design decisions which are simply not 
applicable anymore:
- we now _do_ have reliable device identifications, so the the 'path_id' 
functionality is pointless.
- The 'alua' device handler also provides you with reliable priority 
information, so it should be possible to do away with the 'prio' 
setting, too.
- And for (most) SCSI devices the 'state' setting provides a reliable 
indicator if the device is useable.


Hence I've implemented a notifier chain (hooked onto 'struct gendisk') 
which provides events for path up/path down etc.

With that it's possible to automatically fail and reinstate paths.
However, what's missing is an automatic pathgroup switch once all paths 
in a group are down.
In the current implementation the device-mapper target doesn't have any 
inkling about path priorities; it just sees path groups as such.
As it stands should reasonably trivial to switch to the next available 
pathgroup, but fallback will become ... interesting.
So we would need to update the interface here to allow for path group 
priorities and also for transmitting the fallback information.


Nothing insurmountable, agreed.
But once we do this most of the current functionality of the 
multipath-tools daemon will become obsolete.


Plus I wasn't quite sure about the direction device-mapper itself will 
be going, so I decided to implement a stand-alone version as a testbed.
I'm not trying to push that at all costs; I'm perfectly happy with 
updating device-mapper.

As long as no-one insists we're having to use the bio-based interface ...

Cheers,

Hannes
--
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. 

Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-12 Thread Sagi Grimberg



Hi all,

I'd like to attend LSF/MM and would like to discuss polling for block drivers.

Currently there is blk-iopoll but it is neither as widely used as NAPI in the
networking field and accoring to Sagi's findings in [1] performance with
polling is not on par with IRQ usage.

On LSF/MM I'd like to whether it is desirable to have NAPI like polling in
more block drivers and how to overcome the currently seen performance issues.


It would be an interesting topic to discuss, as it is a shame that blk-iopoll
isn't used more widely.


Forgot to mention - it should only be a topic, if experimentation has
been done and results gathered to pin point what the issues are, so we
have something concrete to discus. I'm not at all interested in a hand
wavy discussion on the topic.



Hey all,

Indeed I attempted to convert nvme to use irq-poll (let's use its
new name) but experienced some unexplained performance degradations.

Keith reported a 700ns degradation for QD=1 with his Xpoint devices,
this sort of degradation are acceptable I guess because we do schedule
a soft-irq before consuming the completion, but I noticed ~10% IOPs
degradation fr QD=32 which is not acceptable.

I agree with Jens that we'll need some analysis if we want the
discussion to be affective, and I can spend some time this if I
can find volunteers with high-end nvme devices (I only have access
to client nvme devices.

I can add debugfs statistics on average the number of completions I
consume per intererupt, I can also trace the interrupt and the soft-irq
start,end. Any other interesting stats I can add?

I also tried a hybrid mode where the first 4 completions were handled
in the interrupt and the rest in soft-irq but that didn't make much
of a difference.

Any other thoughts?
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: 512e ZBC host-managed disks

2017-01-12 Thread Christoph Hellwig
On Thu, Jan 12, 2017 at 05:13:52PM +0900, Damien Le Moal wrote:
> (3) Any other idea ?

Do nothing and ignore the problem.  This whole idea so braindead that
the person coming up with the T10 language should be shot.  Either a device
has 511 logical sectors or 4k but not this crazy mix.

And make sure no one ships such a piece of crap because we are hell
not going to support it.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RFC: 512e ZBC host-managed disks

2017-01-12 Thread Damien Le Moal

Regular block devices are always accessible in units of logical block
sizes, regardless of the actual physical block size that the device has.
For hard disks, the common cases are:

512n: 512 B logical and physical blocks
512e: 512B logical blocks and 4096B physical blocks
4Kn: 4096B logical and physical blocks

and the sd.c in the kernel checks requests 512B "sectors" position and
size alignment against the disk declared logical block size. All is fine
with this, nothing new.

However, for host-managed zoned block devices (ZBC), the 512e case
breaks this model: the standard allows for 512B logical block reads,
*but* writes MUST be aligned on 4KB boundaries within sequential zones
(still using the 512B logical block size addressing). This is a problem
for users of the disk, e.g. an FS, who may wrongly believe that writing
512B units is possible (and so that it can use 512B FS block size).
Host-aware devices do not have this restriction. Nor does the
restriction apply to writes in conventional zones of host-managed devices.

Summary: for HM 512e block devices, reads are 512e compliant, but writes
in sequential zones are 4Kn compliant.

I would like an opinion on if we should do something about this. I see
the following possible options:

(1) Do nothing and let the disk user deal with the write alignment
problem. It already has to do so anyway as writes must be sequential.
But this would force in-kernel users to go and look at the device
physical block size, which is not something usually done by layers above
the block layer (FS, device mappers etc).

(2) For 512e host-managed devices, always report to the block layer
(device queue) a larger logical block size of 4096B to allow for disk
users to seamlessly adjust to the disk type without having to deal with
the physical sector size. I do not think that this would actually not
require changing the scsi_disk->sector_size field to that incorrect
value so that command addressing does not break. But I wonder if this
may not break a lot of things because of the difference introduced.

(3) Any other idea ?

Best regards.

-- 
Damien Le Moal, Ph.D.
Sr. Manager, System Software Research Group,
Western Digital Corporation
damien.lem...@wdc.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa,
Kanagawa, 252-0888 Japan
www.wdc.com, www.hgst.com
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/15] dm: remove incomple BLOCK_PC support

2017-01-12 Thread Christoph Hellwig
On Wed, Jan 11, 2017 at 08:09:37PM -0500, Mike Snitzer wrote:
> I'm not following your reasoning.
> 
> dm_blk_ioctl calls __blkdev_driver_ioctl and will call scsi_cmd_ioctl
> (sd_ioctl -> scsi_cmd_blk_ioctl -> scsi_cmd_ioctl) if DM's underlying
> block device is a scsi device.

Yes, it it does.  But scsi_cmd_ioctl as called from sd_ioctl will
operate entirely on the SCSI request_queue - dm-mpath will never see
the BLOCK_PC request generated by it.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: split scsi passthrough fields out of struct request

2017-01-12 Thread Christoph Hellwig
On Wed, Jan 11, 2017 at 05:41:42PM -0500, Mike Snitzer wrote:
> I removed blk-mq on request_fn paths support because it was one of the
> permutations that I felt least useful/stable (see commit c5248f79f3 "dm:
> remove support for stacking dm-mq on .request_fn device(s)")
> 
> As for all of the different IO paths.  I've always liked the idea of
> blk-mq ruling the world.  With Jens' blk-mq IO scheduling advances maybe
> we're closer!

That removed blk-mq on top of request_fn code would do the right thing
for this series if it entirely replaced the old request_fn code in dm-mpath.
(as would the new bio code for that matter)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html