Re: [PATCH v2] sd: Consider max_xfer_blocks if opt_xfer_blocks is unusable

2017-03-30 Thread Fam Zheng
On Thu, 03/30 11:30, Martin K. Petersen wrote:
> Fam Zheng  writes:
> 
> >>rw_max = min_not_zero(logical_to_sectors(sdp, dev_max),
> >>   BLK_DEF_MAX_SECTORS);
> >
> > Yes, it is better. Is it okay to make the change when you apply?
> 
> Sure. Applied to 4.11/scsi-fixes.

Cool. Thanks!

Fam


Re: [PATCH 1/6] virtio: wrap find_vqs

2017-03-30 Thread Jason Wang



On 2017年03月30日 22:32, Michael S. Tsirkin wrote:

On Thu, Mar 30, 2017 at 02:00:08PM +0800, Jason Wang wrote:


On 2017年03月30日 04:48, Michael S. Tsirkin wrote:

We are going to add more parameters to find_vqs, let's wrap the call so
we don't need to tweak all drivers every time.

Signed-off-by: Michael S. Tsirkin
---

A quick glance and it looks ok, but what the benefit of this series, is it
required by other changes?

Thanks

Yes - to avoid touching all devices when doing the rest of
the patchset.


Maybe I'm not clear. I mean the benefit of this series not this single 
patch. I guess it may be used by you proposal that avoid reset when set 
XDP? If yes, do we really want to drop some packets after XDP is set?


Thanks


[PATCH] tcmu: Skip Data-Out blocks before gathering Data-In buffer for BIDI case

2017-03-30 Thread lixiubo
From: Xiubo Li 

For the bidirectional case, the Data-Out buffer blocks will always at
the head of the tcmu_cmd's bitmap, and before gathering the Data-In
buffer, first of all it should skip the Data-Out ones, or the device
supporting BIDI commands won't work.

Fixed: 26418649eead ("target/user: Introduce data_bitmap, replace
data_length/data_head/data_tail")
Reported-by: Ilias Tsitsimpis 
Signed-off-by: Xiubo Li 
---
 drivers/target/target_core_user.c | 48 +++
 1 file changed, 33 insertions(+), 15 deletions(-)

diff --git a/drivers/target/target_core_user.c 
b/drivers/target/target_core_user.c
index ede815c..63e3a1b 100644
--- a/drivers/target/target_core_user.c
+++ b/drivers/target/target_core_user.c
@@ -311,24 +311,50 @@ static void free_data_area(struct tcmu_dev *udev, struct 
tcmu_cmd *cmd)
   DATA_BLOCK_BITS);
 }
 
-static void gather_data_area(struct tcmu_dev *udev, unsigned long *cmd_bitmap,
-   struct scatterlist *data_sg, unsigned int data_nents)
+static void gather_data_area(struct tcmu_dev *udev, struct tcmu_cmd *cmd,
+bool bidi)
 {
+   struct se_cmd *se_cmd = cmd->se_cmd;
int i, block;
int block_remaining = 0;
void *from, *to;
size_t copy_bytes, from_offset;
-   struct scatterlist *sg;
+   struct scatterlist *sg, *data_sg;
+   unsigned int data_nents;
+   DECLARE_BITMAP(bitmap, DATA_BLOCK_BITS);
+
+   bitmap_copy(bitmap, cmd->data_bitmap, DATA_BLOCK_BITS);
+
+   if (!bidi) {
+   data_sg = se_cmd->t_data_sg;
+   data_nents = se_cmd->t_data_nents;
+   } else {
+   uint32_t count;
+
+   /*
+* For bidi case, the first count blocks are for Data-Out
+* buffer blocks, and before gathering the Data-In buffer
+* the Data-Out buffer blocks should be discarded.
+*/
+   count = DIV_ROUND_UP(se_cmd->data_length, DATA_BLOCK_SIZE);
+   while (count--) {
+   block = find_first_bit(bitmap, DATA_BLOCK_BITS);
+   clear_bit(block, bitmap);
+   }
+
+   data_sg = se_cmd->t_bidi_data_sg;
+   data_nents = se_cmd->t_bidi_data_nents;
+   }
 
for_each_sg(data_sg, sg, data_nents, i) {
int sg_remaining = sg->length;
to = kmap_atomic(sg_page(sg)) + sg->offset;
while (sg_remaining > 0) {
if (block_remaining == 0) {
-   block = find_first_bit(cmd_bitmap,
+   block = find_first_bit(bitmap,
DATA_BLOCK_BITS);
block_remaining = DATA_BLOCK_SIZE;
-   clear_bit(block, cmd_bitmap);
+   clear_bit(block, bitmap);
}
copy_bytes = min_t(size_t, sg_remaining,
block_remaining);
@@ -610,19 +636,11 @@ static void tcmu_handle_completion(struct tcmu_cmd *cmd, 
struct tcmu_cmd_entry *
   se_cmd->scsi_sense_length);
free_data_area(udev, cmd);
} else if (se_cmd->se_cmd_flags & SCF_BIDI) {
-   DECLARE_BITMAP(bitmap, DATA_BLOCK_BITS);
-
/* Get Data-In buffer before clean up */
-   bitmap_copy(bitmap, cmd->data_bitmap, DATA_BLOCK_BITS);
-   gather_data_area(udev, bitmap,
-   se_cmd->t_bidi_data_sg, se_cmd->t_bidi_data_nents);
+   gather_data_area(udev, cmd, true);
free_data_area(udev, cmd);
} else if (se_cmd->data_direction == DMA_FROM_DEVICE) {
-   DECLARE_BITMAP(bitmap, DATA_BLOCK_BITS);
-
-   bitmap_copy(bitmap, cmd->data_bitmap, DATA_BLOCK_BITS);
-   gather_data_area(udev, bitmap,
-   se_cmd->t_data_sg, se_cmd->t_data_nents);
+   gather_data_area(udev, cmd, false);
free_data_area(udev, cmd);
} else if (se_cmd->data_direction == DMA_TO_DEVICE) {
free_data_area(udev, cmd);
-- 
1.8.3.1





Re: [PATCH 22/23] drbd: implement REQ_OP_WRITE_ZEROES

2017-03-30 Thread Martin K. Petersen
Mike Snitzer  writes:

Mike,

> But while discussing this effort with Jeff Moyer he asked: shouldn't the
> zeroed blocks be provisioned?  This is a fairly embarassing question not
> to be able to answer in the moment.  So I clearly need to learn what the
> overall intent of WRITE_ZEROES actually is.

The intent is to guarantee all future reads to these blocks will return
zeroes. Whether to allocate or deallocate or do a combination to achieve
that goal is up to the device.

> If it is meant as a replacement for WRITE_SAME (as hch switched dm-io
> over from WRITE_SAME with a payload of 0 to WRITE_ZEROES) and for the
> backing mechanism for blkdev_issue_zeroout() then I guess I have my
> answer.

Yes. I was hoping MD would use WRITE SAME to initialize data and parity
drives. That's why I went with SAME nomenclature rather than ZEROES
(which had just appeared in NVMe at that time).

Christoph's renaming is mostly to emphasize the purpose and the
semantics. People keep getting confused because both REQ_DISCARD and
REQ_WRITE_SAME use the WRITE SAME SCSI command. But the two uses are
completely orthogonal and governed by different heuristics in sd.c.

> Unless DM thinp can guarantee that the discarded blocks will
> always return zeroes (by zeroing before all partial block writes) my
> discard based dm-thinp implementation of WRITE_ZEROES is a complete
> throw-away (unless block zeroing is enabled.. which it never is because
> performance sucks with it).  So if an upper-level of the IO stack
> (e.g. ext4) were to assume that a block will _definitely_ have zeroes
> then DM thinp would fall short.

You don't have a way to mark those blocks as being full of zeroes
without actually writing them?

Note that the fallback to a zeroout command is to do a regular write. So
if DM doesn't zero the blocks, the block layer is going to it.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 12/23] sd: handle REQ_UNMAP

2017-03-30 Thread Martin K. Petersen
"h...@lst.de"  writes:

> If you manually change the provisioning mode to WS10 on a device that
> must use WRITE SAME (16) to be able to address all blocks you're already
> screwed right now, and with this patch you can screw yourself through
> the WRITE_ZEROES path in addition to the DISCARD path.

Oh, I see. We only had the LBA sanity check in place for write same, not
for discard.

-- 
Martin K. Petersen  Oracle Linux Engineering


commit 1b6ac5e3f "fnic: Using rport->dd_data to check rport online instead of rport_lookup."

2017-03-30 Thread Joe Jin
Hi Satish,

My customer hit below error when issue LIP to fnic controller:

[94702.898408] sd 2:0:4:1: [sdx] tag#1 FAILED Result: hostbyte=DID_NO_CONNECT 
driverbyte=DRIVER_OK
[94702.898416] sd 2:0:4:1: [sdx] tag#1 CDB: Write(10) 2a 00 04 56 c0 08 00 00 
08 00
[94702.898420] blk_update_request: I/O error, dev sdx, sector 72794120
[94702.898455] sd 2:0:4:10: [sdy] tag#2 FAILED Result: hostbyte=DID_NO_CONNECT 
driverbyte=DRIVER_OK
[94702.898459] sd 2:0:4:10: [sdy] tag#2 CDB: Read(10) 28 00 00 00 08 a8 00 00 
02 00
[94702.898462] blk_update_request: I/O error, dev sdy, sector 2216

Looked at all changes of fnic I found it caused by commit 1b6ac5e3f "fnic: 
Using rport->dd_data to check rport online instead of rport_lookup.", a 
question is why changed state check from RPORT_ST_DELETE to RPORT_ST_READY
here?

-   if (!rdata || (rdata->rp_state == RPORT_ST_DELETE)) { 
-   FNIC_SCSI_DBG(KERN_DEBUG, fnic->lport->host,
-   "returning IO as rport is removed\n");
-   atomic64_inc(_stats->misc_stats.rport_not_ready);
-   sc->result = DID_NO_CONNECT;
-   done(sc);
-   return 0;
+   if (rport) {
+   struct fc_rport_libfc_priv *rp = rport->dd_data;
+
+   if (!rp || rp->rp_state != RPORT_ST_READY) {
+   FNIC_SCSI_DBG(KERN_DEBUG, fnic->lport->host,
+   "returning DID_NO_CONNECT for IO as rport is 
removed\n");


Thanks,
Joe


Re: [PATCH 22/23] drbd: implement REQ_OP_WRITE_ZEROES

2017-03-30 Thread Mike Snitzer
On Thu, Mar 30 2017 at 11:20am -0400,
Martin K. Petersen  wrote:

> Mike Snitzer  writes:
> 
> > I can work on this now.  Only question I have is: should DM thinp take
> > care to zero any misaligned head and tail?  (I assume so but with all
> > the back and forth between Bart, Paolo and Martin I figured I'd ask
> > explicitly).
> 
> Yep, let's make sure our semantics match the hardware ditto.
> 
>  - So write zeroes should behave deterministically and explicitly handle
>any blocks that can't be cleared via deprovisioning.
> 
>  - And discard can work at the discard granularity in a
>non-deterministic fashion.

I got pretty far along with implementing the DM thinp support for
WRITE_ZEROES in terms of thinp's DISCARD support (more of an
implementation detail.. or so I thought).

But while discussing this effort with Jeff Moyer he asked: shouldn't the
zeroed blocks be provisioned?  This is a fairly embarassing question not
to be able to answer in the moment.  So I clearly need to learn what the
overall intent of WRITE_ZEROES actually is.

If it is meant as a replacement for WRITE_SAME (as hch switched dm-io
over from WRITE_SAME with a payload of 0 to WRITE_ZEROES) and for the
backing mechanism for blkdev_issue_zeroout() then I guess I have my
answer.  Unless DM thinp can guarantee that the discarded blocks will
always return zeroes (by zeroing before all partial block writes) my
discard based dm-thinp implementation of WRITE_ZEROES is a complete
throw-away (unless block zeroing is enabled.. which it never is because
performance sucks with it).  So if an upper-level of the IO stack
(e.g. ext4) were to assume that a block will _definitely_ have zeroes
then DM thinp would fall short.

This is all to say: I don't see a quick way forward on implementing
performant WRITE_ZEROES support for DM thinp.


[RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device

2017-03-30 Thread Logan Gunthorpe
A p2pmem device is simply a PCI card with a BAR space that points to
regular memory. This may be an independent PCI card or part of another
completely unrelated device (like an IB card or a NVMe card). The
p2pmem device is designed such that other drivers may register p2pmem
memory for use by the system.

p2pmem devices then provide a kernel interface so that other subsystems
can allocate chunks of this memory as necessary to facilitate transfers
between two PCI peers. Depending on hardware, this may reduce the
bandwidth of the transfer but could significantly reduce presure
on system memory. This may be desirable in many cases: for example a
system could be designed with a small CPU connected to a PCI switch by a
small number of lanes which would maximize the number of lanes available
to connect to NVME devices.

Seeing using p2p memory can often have negative effects, especially
with older PCI root complexes. The code is designed to only utilize the
p2pmem device if all the devices involved in a transfer are behind the
same PCI switch. Other cases may still work or be desirable for some
end users but it was decided this would be the best course of action
to prevent users enabling it and wondering why their performance
dropped.

Signed-off-by: Logan Gunthorpe 
Signed-off-by: Stephen Bates 
Signed-off-by: Steve Wise 
---
 drivers/memory/Kconfig  |   5 +
 drivers/memory/Makefile |   2 +
 drivers/memory/p2pmem.c | 403 
 include/linux/p2pmem.h  | 103 +
 4 files changed, 513 insertions(+)
 create mode 100644 drivers/memory/p2pmem.c
 create mode 100644 include/linux/p2pmem.h

diff --git a/drivers/memory/Kconfig b/drivers/memory/Kconfig
index ec80e35..4a02cd3 100644
--- a/drivers/memory/Kconfig
+++ b/drivers/memory/Kconfig
@@ -146,3 +146,8 @@ source "drivers/memory/samsung/Kconfig"
 source "drivers/memory/tegra/Kconfig"
 
 endif
+
+config P2PMEM
+   bool "Peer 2 Peer Memory Device Support"
+   help
+ This driver is for peer 2 peer memory device managers.
diff --git a/drivers/memory/Makefile b/drivers/memory/Makefile
index e88097fb..260bfe9 100644
--- a/drivers/memory/Makefile
+++ b/drivers/memory/Makefile
@@ -21,3 +21,5 @@ obj-$(CONFIG_DA8XX_DDRCTL)+= da8xx-ddrctl.o
 
 obj-$(CONFIG_SAMSUNG_MC)   += samsung/
 obj-$(CONFIG_TEGRA_MC) += tegra/
+
+obj-$(CONFIG_P2PMEM)+= p2pmem.o
diff --git a/drivers/memory/p2pmem.c b/drivers/memory/p2pmem.c
new file mode 100644
index 000..c4ea311
--- /dev/null
+++ b/drivers/memory/p2pmem.c
@@ -0,0 +1,403 @@
+/*
+ * Peer 2 Peer Memory Device
+ * Copyright (c) 2016, Microsemi Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+MODULE_DESCRIPTION("Peer 2 Peer Memory Device");
+MODULE_VERSION("0.1");
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Microsemi Corporation");
+
+static struct class *p2pmem_class;
+static DEFINE_IDA(p2pmem_ida);
+
+static struct p2pmem_dev *to_p2pmem(struct device *dev)
+{
+   return container_of(dev, struct p2pmem_dev, dev);
+}
+
+static void p2pmem_percpu_release(struct percpu_ref *ref)
+{
+   struct p2pmem_dev *p = container_of(ref, struct p2pmem_dev, ref);
+
+   complete_all(>cmp);
+}
+
+static void p2pmem_percpu_exit(void *data)
+{
+   struct percpu_ref *ref = data;
+
+   percpu_ref_exit(ref);
+}
+
+static void p2pmem_percpu_kill(void *data)
+{
+   struct percpu_ref *ref = data;
+   struct p2pmem_dev *p = container_of(ref, struct p2pmem_dev, ref);
+
+   if (percpu_ref_is_dying(ref))
+   return;
+
+   percpu_ref_kill(ref);
+   wait_for_completion(>cmp);
+}
+
+static void p2pmem_release(struct device *dev)
+{
+   struct p2pmem_dev *p = to_p2pmem(dev);
+
+   if (p->pool)
+   gen_pool_destroy(p->pool);
+
+   kfree(p);
+}
+
+/**
+ * p2pmem_create() - create a new p2pmem device
+ * @parent: the parent device to create it under
+ *
+ * Return value is a pointer to the new device or an ERR_PTR
+ * on failure.
+ */
+struct p2pmem_dev *p2pmem_create(struct device *parent)
+{
+   struct p2pmem_dev *p;
+   int nid = dev_to_node(parent);
+   int rc;
+
+   p = kzalloc_node(sizeof(*p), GFP_KERNEL, nid);
+   if (!p)
+   return ERR_PTR(-ENOMEM);
+
+   init_completion(>cmp);
+   device_initialize(>dev);
+   p->dev.class = p2pmem_class;
+   p->dev.parent = parent;
+   p->dev.release = p2pmem_release;
+

[RFC 5/8] scatterlist: Modify SG copy functions to support io memory.

2017-03-30 Thread Logan Gunthorpe
Now that we are using p2pmem SG buffers we occasionally have to copy
to and from this memory. For this, we add an iomem flag to
sg_copy_buffer for copying with iomemcpy. We also add the sg_iocopy_
variants to use this more easily.

Signed-off-by: Logan Gunthorpe 
Signed-off-by: Stephen Bates 
Signed-off-by: Steve Wise 
---
 drivers/scsi/scsi_debug.c   |  7 ++---
 include/linux/scatterlist.h |  7 -
 lib/scatterlist.c   | 64 ++---
 3 files changed, 65 insertions(+), 13 deletions(-)

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index 17249c3..70c0d9f 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -1309,7 +1309,7 @@ static int resp_inquiry(struct scsi_cmnd *scp, struct 
sdebug_dev_info *devip)
int lu_id_num, port_group_id, target_dev_id, len;
char lu_id_str[6];
int host_no = devip->sdbg_host->shost->host_no;
-   
+
port_group_id = (((host_no + 1) & 0x7f) << 8) +
(devip->channel & 0x7f);
if (sdebug_vpd_use_hostno == 0)
@@ -2381,14 +2381,15 @@ static int do_device_access(struct scsi_cmnd *scmd, u64 
lba, u32 num,
 
ret = sg_copy_buffer(sdb->table.sgl, sdb->table.nents,
   fake_storep + (block * sdebug_sector_size),
-  (num - rest) * sdebug_sector_size, 0, do_write);
+  (num - rest) * sdebug_sector_size, 0, do_write, false);
if (ret != (num - rest) * sdebug_sector_size)
return ret;
 
if (rest) {
ret += sg_copy_buffer(sdb->table.sgl, sdb->table.nents,
fake_storep, rest * sdebug_sector_size,
-   (num - rest) * sdebug_sector_size, do_write);
+   (num - rest) * sdebug_sector_size, do_write,
+ false);
}
 
return ret;
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index cb3c8fe..030b92b 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -267,7 +267,7 @@ int sg_alloc_table_from_pages(struct sg_table *sgt,
gfp_t gfp_mask);
 
 size_t sg_copy_buffer(struct scatterlist *sgl, unsigned int nents, void *buf,
- size_t buflen, off_t skip, bool to_buffer);
+ size_t buflen, off_t skip, bool to_buffer, bool iomem);
 
 size_t sg_copy_from_buffer(struct scatterlist *sgl, unsigned int nents,
   const void *buf, size_t buflen);
@@ -279,6 +279,11 @@ size_t sg_pcopy_from_buffer(struct scatterlist *sgl, 
unsigned int nents,
 size_t sg_pcopy_to_buffer(struct scatterlist *sgl, unsigned int nents,
  void *buf, size_t buflen, off_t skip);
 
+size_t sg_iocopy_from_buffer(struct scatterlist *sgl, unsigned int nents,
+const void *buf, size_t buflen);
+size_t sg_iocopy_to_buffer(struct scatterlist *sgl, unsigned int nents,
+  void *buf, size_t buflen);
+
 /*
  * Maximum number of entries that will be allocated in one piece, if
  * a list larger than this is required then chaining will be utilized.
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index c6cf822..22abd94 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -647,7 +647,7 @@ EXPORT_SYMBOL(sg_miter_stop);
  *
  **/
 size_t sg_copy_buffer(struct scatterlist *sgl, unsigned int nents, void *buf,
- size_t buflen, off_t skip, bool to_buffer)
+ size_t buflen, off_t skip, bool to_buffer, bool iomem)
 {
unsigned int offset = 0;
struct sg_mapping_iter miter;
@@ -668,10 +668,17 @@ size_t sg_copy_buffer(struct scatterlist *sgl, unsigned 
int nents, void *buf,
 
len = min(miter.length, buflen - offset);
 
-   if (to_buffer)
-   memcpy(buf + offset, miter.addr, len);
-   else
-   memcpy(miter.addr, buf + offset, len);
+   if (iomem) {
+   if (to_buffer)
+   memcpy_fromio(buf + offset,  miter.addr, len);
+   else
+   memcpy_toio(miter.addr, buf + offset, len);
+   } else {
+   if (to_buffer)
+   memcpy(buf + offset, miter.addr, len);
+   else
+   memcpy(miter.addr, buf + offset, len);
+   }
 
offset += len;
}
@@ -695,7 +702,8 @@ EXPORT_SYMBOL(sg_copy_buffer);
 size_t sg_copy_from_buffer(struct scatterlist *sgl, unsigned int nents,
   const void *buf, size_t buflen)
 {
-   return sg_copy_buffer(sgl, nents, (void *)buf, buflen, 0, false);
+   return sg_copy_buffer(sgl, nents, 

[RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem

2017-03-30 Thread Logan Gunthorpe
p2pmem will always be iomem so if we ever access it, we should be using
the correct methods to read and write to it.

Signed-off-by: Logan Gunthorpe 
Signed-off-by: Stephen Bates 
Signed-off-by: Steve Wise 
---
 drivers/nvme/target/core.c| 18 --
 drivers/nvme/target/fabrics-cmd.c | 28 +++-
 drivers/nvme/target/nvmet.h   |  1 +
 drivers/nvme/target/rdma.c| 13 ++---
 4 files changed, 38 insertions(+), 22 deletions(-)

diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 798653b..a1524d5 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -45,15 +45,29 @@ static struct nvmet_subsys *nvmet_find_get_subsys(struct 
nvmet_port *port,
 u16 nvmet_copy_to_sgl(struct nvmet_req *req, off_t off, const void *buf,
size_t len)
 {
-   if (sg_pcopy_from_buffer(req->sg, req->sg_cnt, buf, len, off) != len)
+   bool iomem = req->p2pmem;
+   size_t ret;
+
+   ret = sg_copy_buffer(req->sg, req->sg_cnt, (void *)buf, len, off,
+false, iomem);
+
+   if (ret != len)
return NVME_SC_SGL_INVALID_DATA | NVME_SC_DNR;
+
return 0;
 }
 
 u16 nvmet_copy_from_sgl(struct nvmet_req *req, off_t off, void *buf, size_t 
len)
 {
-   if (sg_pcopy_to_buffer(req->sg, req->sg_cnt, buf, len, off) != len)
+   bool iomem = req->p2pmem;
+   size_t ret;
+
+   ret = sg_copy_buffer(req->sg, req->sg_cnt, buf, len, off, true,
+iomem);
+
+   if (ret != len)
return NVME_SC_SGL_INVALID_DATA | NVME_SC_DNR;
+
return 0;
 }
 
diff --git a/drivers/nvme/target/fabrics-cmd.c 
b/drivers/nvme/target/fabrics-cmd.c
index 8bd022af..9d966f0 100644
--- a/drivers/nvme/target/fabrics-cmd.c
+++ b/drivers/nvme/target/fabrics-cmd.c
@@ -118,11 +118,13 @@ static u16 nvmet_install_queue(struct nvmet_ctrl *ctrl, 
struct nvmet_req *req)
 static void nvmet_execute_admin_connect(struct nvmet_req *req)
 {
struct nvmf_connect_command *c = >cmd->connect;
-   struct nvmf_connect_data *d;
+   struct nvmf_connect_data d;
struct nvmet_ctrl *ctrl = NULL;
u16 status = 0;
 
-   d = kmap(sg_page(req->sg)) + req->sg->offset;
+   status = nvmet_copy_from_sgl(req, 0, , sizeof(d));
+   if (status)
+   goto out;
 
/* zero out initial completion result, assign values as needed */
req->rsp->result.u32 = 0;
@@ -134,16 +136,16 @@ static void nvmet_execute_admin_connect(struct nvmet_req 
*req)
goto out;
}
 
-   if (unlikely(d->cntlid != cpu_to_le16(0x))) {
+   if (unlikely(d.cntlid != cpu_to_le16(0x))) {
pr_warn("connect attempt for invalid controller ID %#x\n",
-   d->cntlid);
+   d.cntlid);
status = NVME_SC_CONNECT_INVALID_PARAM | NVME_SC_DNR;
req->rsp->result.u32 = IPO_IATTR_CONNECT_DATA(cntlid);
goto out;
}
 
-   status = nvmet_alloc_ctrl(d->subsysnqn, d->hostnqn, req,
-   le32_to_cpu(c->kato), );
+   status = nvmet_alloc_ctrl(d.subsysnqn, d.hostnqn, req,
+ le32_to_cpu(c->kato), );
if (status)
goto out;
 
@@ -158,19 +160,20 @@ static void nvmet_execute_admin_connect(struct nvmet_req 
*req)
req->rsp->result.u16 = cpu_to_le16(ctrl->cntlid);
 
 out:
-   kunmap(sg_page(req->sg));
nvmet_req_complete(req, status);
 }
 
 static void nvmet_execute_io_connect(struct nvmet_req *req)
 {
struct nvmf_connect_command *c = >cmd->connect;
-   struct nvmf_connect_data *d;
+   struct nvmf_connect_data d;
struct nvmet_ctrl *ctrl = NULL;
u16 qid = le16_to_cpu(c->qid);
u16 status = 0;
 
-   d = kmap(sg_page(req->sg)) + req->sg->offset;
+   status = nvmet_copy_from_sgl(req, 0, , sizeof(d));
+   if (status)
+   goto out;
 
/* zero out initial completion result, assign values as needed */
req->rsp->result.u32 = 0;
@@ -182,9 +185,9 @@ static void nvmet_execute_io_connect(struct nvmet_req *req)
goto out;
}
 
-   status = nvmet_ctrl_find_get(d->subsysnqn, d->hostnqn,
-   le16_to_cpu(d->cntlid),
-   req, );
+   status = nvmet_ctrl_find_get(d.subsysnqn, d.hostnqn,
+le16_to_cpu(d.cntlid),
+req, );
if (status)
goto out;
 
@@ -205,7 +208,6 @@ static void nvmet_execute_io_connect(struct nvmet_req *req)
pr_info("adding queue %d to ctrl %d.\n", qid, ctrl->cntlid);
 
 out:
-   kunmap(sg_page(req->sg));
nvmet_req_complete(req, status);
return;
 
diff --git a/drivers/nvme/target/nvmet.h 

[RFC 8/8] p2pmem: Added char device user interface

2017-03-30 Thread Logan Gunthorpe
This creates a userspace interface to use p2pmemory. A user can use
mmap on the p2pmem char device to get buffers from the corresponding
device. This allows a user to use p2p memory with existing
interfaces like RDMA and O_DIRECT.

This patch is a bit more controversial because people don't want to
expose these interfaces to userspace without more consideration.
However, this patch is _very_ useful for expirementing with p2p memory.

For example, with this patch, you can test with commands like:

ib_write_bw -R --mmap=/dev/p2pmem0 -D 30

or use an fio script like:

[rdma-server]
rw=read
mem=mmapshared:/dev/p2pmem0
ioengine=rdma
port=14242
bs=64k
size=10G
iodepth=2

which would test the bandwidth of RDMA to/from the specified p2p memory.

Signed-off-by: Logan Gunthorpe 
Signed-off-by: Stephen Bates 
Signed-off-by: Steve Wise 
---
 drivers/memory/p2pmem.c | 184 +++-
 include/linux/p2pmem.h  |   4 ++
 2 files changed, 186 insertions(+), 2 deletions(-)

diff --git a/drivers/memory/p2pmem.c b/drivers/memory/p2pmem.c
index 499d42c..129c49c 100644
--- a/drivers/memory/p2pmem.c
+++ b/drivers/memory/p2pmem.c
@@ -19,14 +19,20 @@
 #include 
 #include 
 #include 
+#include 
 
 MODULE_DESCRIPTION("Peer 2 Peer Memory Device");
 MODULE_VERSION("0.1");
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Microsemi Corporation");
 
+static int max_devices = 16;
+module_param(max_devices, int, 0444);
+MODULE_PARM_DESC(max_devices, "Maximum number of char devices");
+
 static struct class *p2pmem_class;
 static DEFINE_IDA(p2pmem_ida);
+static dev_t p2pmem_devt;
 
 static struct dentry *p2pmem_debugfs_root;
 
@@ -67,6 +73,144 @@ static struct p2pmem_dev *to_p2pmem(struct device *dev)
return container_of(dev, struct p2pmem_dev, dev);
 }
 
+struct p2pmem_vma {
+   struct p2pmem_dev *p2pmem_dev;
+   atomic_t mmap_count;
+   size_t nr_pages;
+
+   /* Protects the used_pages array */
+   struct mutex mutex;
+   struct page *used_pages[];
+};
+
+static void p2pmem_vma_open(struct vm_area_struct *vma)
+{
+   struct p2pmem_vma *pv = vma->vm_private_data;
+
+   atomic_inc(>mmap_count);
+}
+
+static void p2pmem_vma_free_pages(struct vm_area_struct *vma)
+{
+   int i;
+   struct p2pmem_vma *pv = vma->vm_private_data;
+
+   mutex_lock(>mutex);
+
+   for (i = 0; i < pv->nr_pages; i++) {
+   if (pv->used_pages[i]) {
+   p2pmem_free_page(pv->p2pmem_dev, pv->used_pages[i]);
+   pv->used_pages[i] = NULL;
+   }
+   }
+
+   mutex_unlock(>mutex);
+}
+
+static void p2pmem_vma_close(struct vm_area_struct *vma)
+{
+   struct p2pmem_vma *pv = vma->vm_private_data;
+
+   if (!atomic_dec_and_test(>mmap_count))
+   return;
+
+   p2pmem_vma_free_pages(vma);
+
+   dev_dbg(>p2pmem_dev->dev, "vma close");
+   kfree(pv);
+}
+
+static int p2pmem_vma_fault(struct vm_fault *vmf)
+{
+   struct p2pmem_vma *pv = vmf->vma->vm_private_data;
+   unsigned int pg_idx;
+   struct page *pg;
+   pfn_t pfn;
+   int rc;
+
+   if (!pv->p2pmem_dev->alive)
+   return VM_FAULT_SIGBUS;
+
+   pg_idx = (vmf->address - vmf->vma->vm_start) / PAGE_SIZE;
+
+   mutex_lock(>mutex);
+
+   if (pv->used_pages[pg_idx])
+   pg = pv->used_pages[pg_idx];
+   else
+   pg = p2pmem_alloc_page(pv->p2pmem_dev);
+
+   if (!pg)
+   return VM_FAULT_OOM;
+
+   pv->used_pages[pg_idx] = pg;
+
+   pfn = phys_to_pfn_t(page_to_phys(pg), PFN_DEV | PFN_MAP);
+   rc = vm_insert_mixed(vmf->vma, vmf->address, pfn);
+
+   mutex_unlock(>mutex);
+
+   if (rc == -ENOMEM)
+   return VM_FAULT_OOM;
+   if (rc < 0 && rc != -EBUSY)
+   return VM_FAULT_SIGBUS;
+
+   return VM_FAULT_NOPAGE;
+}
+
+const struct vm_operations_struct p2pmem_vmops = {
+   .open = p2pmem_vma_open,
+   .close = p2pmem_vma_close,
+   .fault = p2pmem_vma_fault,
+};
+
+static int p2pmem_open(struct inode *inode, struct file *filp)
+{
+   struct p2pmem_dev *p;
+
+   p = container_of(inode->i_cdev, struct p2pmem_dev, cdev);
+   filp->private_data = p;
+   p->inode = inode;
+
+   return 0;
+}
+
+static int p2pmem_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+   struct p2pmem_dev *p = filp->private_data;
+   struct p2pmem_vma *pv;
+   size_t nr_pages = (vma->vm_end - vma->vm_start) / PAGE_SIZE;
+
+   if ((vma->vm_flags & VM_MAYSHARE) != VM_MAYSHARE) {
+   dev_warn(>dev, "mmap failed: can't create private 
mapping\n");
+   return -EINVAL;
+   }
+
+   dev_dbg(>dev, "Allocating mmap with %zd pages.\n", nr_pages);
+
+   pv = kzalloc(sizeof(*pv) + sizeof(pv->used_pages[0]) * nr_pages,
+GFP_KERNEL);
+   if (!pv)
+   return -ENOMEM;

[RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-03-30 Thread Logan Gunthorpe
Hello,

As discussed at LSF/MM we'd like to present our work to enable
copy offload support in NVMe fabrics RDMA targets. We'd appreciate
some review and feedback from the community on our direction.
This series is not intended to go upstream at this point.

The concept here is to use memory that's exposed on a PCI BAR as
data buffers in the NVME target code such that data can be transferred
from an RDMA NIC to the special memory and then directly to an NVMe
device avoiding system memory entirely. The upside of this is better
QoS for applications running on the CPU utilizing memory and lower
PCI bandwidth required to the CPU (such that systems could be designed
with fewer lanes connected to the CPU). However, presently, the trade-off
is currently a reduction in overall throughput. (Largely due to hardware
issues that would certainly improve in the future).

Due to these trade-offs we've designed the system to only enable using
the PCI memory in cases where the NIC, NVMe devices and memory are all
behind the same PCI switch. This will mean many setups that could likely
work well will not be supported so that we can be more confident it
will work and not place any responsibility on the user to understand
their topology. (We've chosen to go this route based on feedback we
received at LSF).

In order to enable this functionality we introduce a new p2pmem device
which can be instantiated by PCI drivers. The device will register some
PCI memory as ZONE_DEVICE and provide an genalloc based allocator for
users of these devices to get buffers. We give an example of enabling
p2p memory with the cxgb4 driver, however currently these devices have
some hardware issues that prevent their use so we will likely be
dropping this patch in the future. Ideally, we'd want to enable this
functionality with NVME CMB buffers, however we don't have any hardware
with this feature at this time.

In nvmet-rdma, we attempt to get an appropriate p2pmem device at
queue creation time and if a suitable one is found we will use it for
all the (non-inlined) memory in the queue. An 'allow_p2pmem' configfs
attribute is also created which is required to be set before any p2pmem
is attempted.

This patchset also includes a more controversial patch which provides an
interface for userspace to obtain p2pmem buffers through an mmap call on
a cdev. This enables userspace to fairly easily use p2pmem with RDMA and
O_DIRECT interfaces. However, the user would be entirely responsible for
knowing what their doing and inspecting sysfs to understand the pci
topology and only using it in sane situations.

Thanks,

Logan


Logan Gunthorpe (6):
  Introduce Peer-to-Peer memory (p2pmem) device
  nvmet: Use p2pmem in nvme target
  scatterlist: Modify SG copy functions to support io memory.
  nvmet: Be careful about using iomem accesses when dealing with p2pmem
  p2pmem: Support device removal
  p2pmem: Added char device user interface

Steve Wise (2):
  cxgb4: setup pcie memory window 4 and create p2pmem region
  p2pmem: Add debugfs "stats" file

 drivers/memory/Kconfig  |   5 +
 drivers/memory/Makefile |   2 +
 drivers/memory/p2pmem.c | 697 
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h  |   3 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c |  97 +++-
 drivers/net/ethernet/chelsio/cxgb4/t4_regs.h|   5 +
 drivers/nvme/target/configfs.c  |  31 ++
 drivers/nvme/target/core.c  |  18 +-
 drivers/nvme/target/fabrics-cmd.c   |  28 +-
 drivers/nvme/target/nvmet.h |   2 +
 drivers/nvme/target/rdma.c  | 183 +--
 drivers/scsi/scsi_debug.c   |   7 +-
 include/linux/p2pmem.h  | 120 
 include/linux/scatterlist.h |   7 +-
 lib/scatterlist.c   |  64 ++-
 15 files changed, 1189 insertions(+), 80 deletions(-)
 create mode 100644 drivers/memory/p2pmem.c
 create mode 100644 include/linux/p2pmem.h

--
2.1.4


[RFC 4/8] p2pmem: Add debugfs "stats" file

2017-03-30 Thread Logan Gunthorpe
From: Steve Wise 

For each p2pmem instance, add a "stats" file to show
the gen_pool statistics.

Signed-off-by: Steve Wise 
Signed-off-by: Logan Gunthorpe 
Signed-off-by: Stephen Bates 
---
 drivers/memory/p2pmem.c | 49 +
 include/linux/p2pmem.h  |  2 ++
 2 files changed, 51 insertions(+)

diff --git a/drivers/memory/p2pmem.c b/drivers/memory/p2pmem.c
index c4ea311..71741c2 100644
--- a/drivers/memory/p2pmem.c
+++ b/drivers/memory/p2pmem.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 MODULE_DESCRIPTION("Peer 2 Peer Memory Device");
 MODULE_VERSION("0.1");
@@ -27,6 +28,40 @@ MODULE_AUTHOR("Microsemi Corporation");
 static struct class *p2pmem_class;
 static DEFINE_IDA(p2pmem_ida);
 
+static struct dentry *p2pmem_debugfs_root;
+
+static int stats_show(struct seq_file *seq, void *v)
+{
+   struct p2pmem_dev *p = seq->private;
+
+   if (p->pool) {
+   seq_printf(seq, "total size: %lu\n", gen_pool_size(p->pool));
+   seq_printf(seq, "available:  %lu\n", gen_pool_avail(p->pool));
+   }
+   return 0;
+}
+
+static int stats_open(struct inode *inode, struct file *file)
+{
+   return single_open(file, stats_show, inode->i_private);
+}
+
+static const struct file_operations stats_debugfs_fops = {
+   .owner   = THIS_MODULE,
+   .open= stats_open,
+   .release = single_release,
+   .read= seq_read,
+   .llseek  = seq_lseek,
+};
+
+static void setup_debugfs(struct p2pmem_dev *p)
+{
+   struct dentry *de;
+
+   de = debugfs_create_file("stats", 0400, p->debugfs_root,
+(void *)p, _debugfs_fops);
+}
+
 static struct p2pmem_dev *to_p2pmem(struct device *dev)
 {
return container_of(dev, struct p2pmem_dev, dev);
@@ -62,6 +97,8 @@ static void p2pmem_release(struct device *dev)
 {
struct p2pmem_dev *p = to_p2pmem(dev);
 
+   debugfs_remove_recursive(p->debugfs_root);
+
if (p->pool)
gen_pool_destroy(p->pool);
 
@@ -114,6 +151,13 @@ struct p2pmem_dev *p2pmem_create(struct device *parent)
if (rc)
goto err_id;
 
+   if (p2pmem_debugfs_root) {
+   p->debugfs_root = debugfs_create_dir(dev_name(>dev),
+p2pmem_debugfs_root);
+   if (p->debugfs_root)
+   setup_debugfs(p);
+   }
+
rc = device_add(>dev);
if (rc)
goto err_id;
@@ -390,12 +434,17 @@ static int __init p2pmem_init(void)
if (IS_ERR(p2pmem_class))
return PTR_ERR(p2pmem_class);
 
+   p2pmem_debugfs_root = debugfs_create_dir("p2pmem", NULL);
+   if (!p2pmem_debugfs_root)
+   pr_info("could not create debugfs entry, continuing\n");
+
return 0;
 }
 module_init(p2pmem_init);
 
 static void __exit p2pmem_exit(void)
 {
+   debugfs_remove_recursive(p2pmem_debugfs_root);
class_destroy(p2pmem_class);
 
pr_info(KBUILD_MODNAME ": unloaded.\n");
diff --git a/include/linux/p2pmem.h b/include/linux/p2pmem.h
index 71dc1e1..4cd6f35 100644
--- a/include/linux/p2pmem.h
+++ b/include/linux/p2pmem.h
@@ -26,6 +26,8 @@ struct p2pmem_dev {
struct percpu_ref ref;
struct completion cmp;
struct gen_pool *pool;
+
+   struct dentry *debugfs_root;
 };
 
 #ifdef CONFIG_P2PMEM
-- 
2.1.4



[RFC 2/8] cxgb4: setup pcie memory window 4 and create p2pmem region

2017-03-30 Thread Logan Gunthorpe
From: Steve Wise 

Some cxgb4 cards expose memory as part of BAR4. This patch registers
this memory as a p2pmem device.

Signed-off-by: Steve Wise 
Signed-off-by: Logan Gunthorpe 
Signed-off-by: Stephen Bates 
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h  |  3 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 97 -
 drivers/net/ethernet/chelsio/cxgb4/t4_regs.h|  5 ++
 3 files changed, 102 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index 163543b..e92443b 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -48,6 +48,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "t4_chip_type.h"
 #include "cxgb4_uld.h"
@@ -859,6 +860,8 @@ struct adapter {
 
/* TC u32 offload */
struct cxgb4_tc_u32_table *tc_u32;
+
+   struct p2pmem_dev *p2pmem;
 };
 
 /* Support for "sched-class" command to allow a TX Scheduling Class to be
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index afb0967..a33bcd1 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -172,6 +172,11 @@ module_param(select_queue, int, 0644);
 MODULE_PARM_DESC(select_queue,
 "Select between kernel provided method of selecting or driver 
method of selecting TX queue. Default is kernel method.");
 
+static bool use_p2pmem;
+module_param(use_p2pmem, bool, 0644);
+MODULE_PARM_DESC(use_p2pmem,
+"Enable registering a p2pmem device with bar space (if 
available)");
+
 static struct dentry *cxgb4_debugfs_root;
 
 LIST_HEAD(adapter_list);
@@ -2835,6 +2840,54 @@ static void setup_memwin_rdma(struct adapter *adap)
}
 }
 
+static void setup_memwin_p2pmem(struct adapter *adap)
+{
+   unsigned int mem_base = t4_read_reg(adap, CIM_EXTMEM2_BASE_ADDR_A);
+   unsigned int mem_size = t4_read_reg(adap, CIM_EXTMEM2_ADDR_SIZE_A);
+
+   if (!use_p2pmem)
+   return;
+
+   if (mem_base != 0 && mem_size != 0) {
+   unsigned int sz_kb, pcieofst;
+
+   sz_kb = roundup_pow_of_two(mem_size) >> 10;
+
+   /*
+* The start offset must be aligned to the window size.
+* Also, BAR4 has MSIX vectors using the first 8KB.
+* Further, the min allowed p2pmem region size is 1MB,
+* so set the start offset to the memory size and we're aligned
+* as well as past the 8KB vector table.
+*/
+   pcieofst = sz_kb << 10;
+
+   dev_info(adap->pdev_dev,
+"p2pmem base 0x%x, size %uB, ilog2(sk_kb) 0x%x, "
+"pcieofst 0x%X\n", mem_base, mem_size, ilog2(sz_kb),
+pcieofst);
+
+   /* Write the window offset and size */
+   t4_write_reg(adap,
+   PCIE_MEM_ACCESS_REG(PCIE_MEM_ACCESS_BASE_WIN_A,
+   MEMWIN_RSVD4),
+   pcieofst | BIR_V(2) | WINDOW_V(ilog2(sz_kb)));
+
+   /* Write the adapter memory base/start */
+   t4_write_reg(adap,
+   PCIE_MEM_ACCESS_REG(PCIE_MEM_ACCESS_OFFSET_A,
+   MEMWIN_RSVD4),
+   MEMOFST_V((mem_base >> MEMOFST_S)) | PFNUM_V(adap->pf));
+
+   /* Read it back to flush it */
+   t4_read_reg(adap,
+   PCIE_MEM_ACCESS_REG(PCIE_MEM_ACCESS_OFFSET_A,
+   MEMWIN_RSVD4));
+   } else
+   dev_info(adap->pdev_dev, "p2pmem memory not reserved, "
+"base 0x%x size %uB\n", mem_base, mem_size);
+}
+
 static int adap_init1(struct adapter *adap, struct fw_caps_config_cmd *c)
 {
u32 v;
@@ -4622,6 +4675,42 @@ static int cxgb4_iov_configure(struct pci_dev *pdev, int 
num_vfs)
 }
 #endif
 
+static int init_p2pmem(struct adapter *adapter)
+{
+   unsigned int mem_size = t4_read_reg(adapter, CIM_EXTMEM2_ADDR_SIZE_A);
+   struct p2pmem_dev *p;
+   int rc;
+   struct resource res;
+
+   if (!mem_size || !use_p2pmem)
+   return 0;
+
+   mem_size = roundup_pow_of_two(mem_size);
+
+   /*
+* Create a subset of BAR4 for the p2pmem region based on the
+* exported memory size.
+*/
+   memcpy(, >pdev->resource[4], sizeof(res));
+   res.start += mem_size;
+   res.end = res.start + mem_size - 1;
+   dev_info(adapter->pdev_dev, "p2pmem resource start 0x%llx end 0x%llx 
size %lluB\n",
+res.start, res.end, resource_size());
+
+   p = p2pmem_create(>pdev->dev);
+   if 

[RFC 3/8] nvmet: Use p2pmem in nvme target

2017-03-30 Thread Logan Gunthorpe
We create a configfs attribute in each nvme-fabrics target port to
enable p2p memory use. When enabled, the port will only then use the
p2p memory if a p2p memory device can be found which is behind the
same switch as the RDMA port and all the block devices in use. If
the user enabled it an no devices are found, then the system will
silently fall back on using regular memory.

If appropriate, that port will allocate memory for the RDMA buffers
for queues from the p2pmem device falling back to system memory should
anything fail.

Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
save an extra PCI transfer as the NVME card could just take the data
out of it's own memory. However, at this time, cards with CMB buffers
don't seem to be available.

Signed-off-by: Logan Gunthorpe 
Signed-off-by: Stephen Bates 
Signed-off-by: Steve Wise 
---
 drivers/nvme/target/configfs.c | 31 +++
 drivers/nvme/target/nvmet.h|  1 +
 drivers/nvme/target/rdma.c | 90 ++
 3 files changed, 114 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index be8c800..e61a7f4 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -777,12 +777,43 @@ static void nvmet_port_release(struct config_item *item)
kfree(port);
 }
 
+#ifdef CONFIG_P2PMEM
+static ssize_t nvmet_allow_p2pmem_show(struct config_item *item, char *page)
+{
+   return sprintf(page, "%d\n", to_nvmet_port(item)->allow_p2pmem);
+}
+
+static ssize_t nvmet_allow_p2pmem_store(struct config_item *item,
+   const char *page, size_t count)
+{
+   struct nvmet_port *port = to_nvmet_port(item);
+   bool allow;
+   int ret;
+
+   ret = strtobool(page, );
+   if (ret)
+   return ret;
+
+   down_write(_config_sem);
+   port->allow_p2pmem = allow;
+   up_write(_config_sem);
+
+   return count;
+}
+CONFIGFS_ATTR(nvmet_, allow_p2pmem);
+#endif
+
 static struct configfs_attribute *nvmet_port_attrs[] = {
_attr_addr_adrfam,
_attr_addr_treq,
_attr_addr_traddr,
_attr_addr_trsvcid,
_attr_addr_trtype,
+
+   #ifdef CONFIG_P2PMEM
+   _attr_allow_p2pmem,
+   #endif
+
NULL,
 };
 
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index f7ff15f..ab67175 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -95,6 +95,7 @@ struct nvmet_port {
struct list_headreferrals;
void*priv;
boolenabled;
+   boolallow_p2pmem;
 };
 
 static inline struct nvmet_port *to_nvmet_port(struct config_item *item)
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index ecc4fe8..7fd4840 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -64,6 +65,7 @@ struct nvmet_rdma_rsp {
struct rdma_rw_ctx  rw;
 
struct nvmet_reqreq;
+   struct p2pmem_dev   *p2pmem;
 
u8  n_rdma;
u32 flags;
@@ -107,6 +109,8 @@ struct nvmet_rdma_queue {
int send_queue_size;
 
struct list_headqueue_list;
+
+   struct p2pmem_dev   *p2pmem;
 };
 
 struct nvmet_rdma_device {
@@ -185,7 +189,8 @@ nvmet_rdma_put_rsp(struct nvmet_rdma_rsp *rsp)
spin_unlock_irqrestore(>queue->rsps_lock, flags);
 }
 
-static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int nents)
+static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int nents,
+   struct p2pmem_dev *p2pmem)
 {
struct scatterlist *sg;
int count;
@@ -193,13 +198,17 @@ static void nvmet_rdma_free_sgl(struct scatterlist *sgl, 
unsigned int nents)
if (!sgl || !nents)
return;
 
-   for_each_sg(sgl, sg, nents, count)
-   __free_page(sg_page(sg));
+   for_each_sg(sgl, sg, nents, count) {
+   if (p2pmem)
+   p2pmem_free_page(p2pmem, sg_page(sg));
+   else
+   __free_page(sg_page(sg));
+   }
kfree(sgl);
 }
 
 static int nvmet_rdma_alloc_sgl(struct scatterlist **sgl, unsigned int *nents,
-   u32 length)
+   u32 length, struct p2pmem_dev *p2pmem)
 {
struct scatterlist *sg;
struct page *page;
@@ -216,7 +225,11 @@ static int nvmet_rdma_alloc_sgl(struct scatterlist **sgl, 
unsigned int *nents,
while (length) {
u32 page_len = min_t(u32, length, PAGE_SIZE);
 
-   page = alloc_page(GFP_KERNEL);
+   if (p2pmem)
+   

[RFC 7/8] p2pmem: Support device removal

2017-03-30 Thread Logan Gunthorpe
This patch creates a list of callbacks to notify users of this memory
that the p2pmem device is going away or gone.

In nvmet-rdma, we disconnect any queue using p2p memory.
The remote side will then automatically reconnect in a
couple seconds and regular system memory (or a different p2pmem device)
will be used.

Signed-off-by: Logan Gunthorpe 
Signed-off-by: Stephen Bates 
Signed-off-by: Steve Wise 
---
 drivers/memory/p2pmem.c| 75 ---
 drivers/nvme/target/rdma.c | 98 ++
 include/linux/p2pmem.h | 19 +++--
 3 files changed, 140 insertions(+), 52 deletions(-)

diff --git a/drivers/memory/p2pmem.c b/drivers/memory/p2pmem.c
index 71741c2..499d42c 100644
--- a/drivers/memory/p2pmem.c
+++ b/drivers/memory/p2pmem.c
@@ -105,6 +105,21 @@ static void p2pmem_release(struct device *dev)
kfree(p);
 }
 
+struct remove_callback {
+   struct list_head list;
+   void (*callback)(void *context);
+   void *context;
+};
+
+static void p2pmem_remove(struct p2pmem_dev *p)
+{
+   struct remove_callback *remove_call, *tmp;
+
+   p->alive = false;
+   list_for_each_entry_safe(remove_call, tmp, >remove_list, list)
+   remove_call->callback(remove_call->context);
+}
+
 /**
  * p2pmem_create() - create a new p2pmem device
  * @parent: the parent device to create it under
@@ -123,6 +138,10 @@ struct p2pmem_dev *p2pmem_create(struct device *parent)
return ERR_PTR(-ENOMEM);
 
init_completion(>cmp);
+   mutex_init(>remove_mutex);
+   INIT_LIST_HEAD(>remove_list);
+   p->alive = true;
+
device_initialize(>dev);
p->dev.class = p2pmem_class;
p->dev.parent = parent;
@@ -187,6 +206,7 @@ void p2pmem_unregister(struct p2pmem_dev *p)
 
dev_info(>dev, "unregistered");
device_del(>dev);
+   p2pmem_remove(p);
ida_simple_remove(_ida, p->id);
put_device(>dev);
 }
@@ -291,6 +311,9 @@ EXPORT_SYMBOL(p2pmem_add_pci_region);
  */
 void *p2pmem_alloc(struct p2pmem_dev *p, size_t size)
 {
+   if (!p->alive)
+   return NULL;
+
return (void *)gen_pool_alloc(p->pool, size);
 }
 EXPORT_SYMBOL(p2pmem_alloc);
@@ -349,6 +372,9 @@ static int upstream_bridges_match(struct device *p2pmem,
struct pci_dev *p2p_up;
struct pci_dev *dma_up;
 
+   if (!to_p2pmem(p2pmem)->alive)
+   return false;
+
p2p_up = get_upstream_switch_port(p2pmem);
if (!p2p_up) {
dev_warn(p2pmem, "p2pmem is not behind a pci switch");
@@ -383,6 +409,8 @@ static int upstream_bridges_match(struct device *p2pmem,
  * specified devices
  * @dma_devices: a null terminated array of device pointers which
  * all must be compatible with the returned p2pmem device
+ * @remove_callback: this callback will be called if the p2pmem
+ * device is removed.
  *
  * For now, we only support cases where all the devices that
  * will transfer to the p2pmem device are on the same switch.
@@ -400,9 +428,13 @@ static int upstream_bridges_match(struct device *p2pmem,
  * (use p2pmem_put to return the reference) or NULL if no compatible
  * p2pmem device is found.
  */
-struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devices)
+struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devices,
+ void (*remove_callback)(void *context),
+ void *context)
 {
struct device *dev;
+   struct p2pmem_dev *p;
+   struct remove_callback *remove_call;
 
dev = class_find_device(p2pmem_class, NULL, dma_devices,
upstream_bridges_match);
@@ -410,21 +442,54 @@ struct p2pmem_dev *p2pmem_find_compat(struct device 
**dma_devices)
if (!dev)
return NULL;
 
-   return to_p2pmem(dev);
+   p = to_p2pmem(dev);
+   mutex_lock(>remove_mutex);
+
+   if (!p->alive) {
+   p = NULL;
+   goto out;
+   }
+
+   remove_call = kzalloc(sizeof(*remove_call), GFP_KERNEL);
+   remove_call->callback = remove_callback;
+   remove_call->context = context;
+   INIT_LIST_HEAD(_call->list);
+   list_add(_call->list, >remove_list);
+
+out:
+   mutex_unlock(>remove_mutex);
+   return p;
 }
 EXPORT_SYMBOL(p2pmem_find_compat);
 
 /**
  * p2pmem_put() - decrement a p2pmem device reference
  * @p: p2pmem device to return
+ * @data: data pointer that was passed to p2pmem_find_compat
  *
  * Dereference and free (if last) the device's reference counter.
  * It's safe to pass a NULL pointer to this function.
  */
-void p2pmem_put(struct p2pmem_dev *p)
+void p2pmem_put(struct p2pmem_dev *p, void *context)
 {
-   if (p)
-   put_device(>dev);
+   struct remove_callback *remove_call;
+
+   if (!p)
+   return;
+
+   

Re: [PATCH 12/23] sd: handle REQ_UNMAP

2017-03-30 Thread h...@lst.de
On Thu, Mar 30, 2017 at 11:28:32AM -0400, Martin K. Petersen wrote:
> "h...@lst.de"  writes:
> 
> Christoph,
> 
> > On Tue, Mar 28, 2017 at 04:48:55PM +, Bart Van Assche wrote:
> >> >  if (sdp->no_write_same)
> >> >  return BLKPREP_INVALID;
> >> >  if (sdkp->ws16 || sector > 0x || nr_sectors > 0x)
> >> 
> >> Users can change the provisioning mode from user space from SD_LBP_WS16 
> >> into
> >> SD_LBP_WS10 so I'm not sure it's safe to skip the (sdkp->ws16 || sector >
> >> 0x || nr_sectors > 0x) check if REQ_UNMAP is set.
> >
> > They can, and if the device has too many sectors that will already cause
> > discard to fail,
> 
> I'm not sure I understand what you mean by that?

If you manually change the provisioning mode to WS10 on a device that
must use WRITE SAME (16) to be able to address all blocks you're already
screwed right now, and with this patch you can screw yourself through
the WRITE_ZEROES path in addition to the DISCARD path.


Re: [PATCH 23/23] block: remove the discard_zeroes_data flag

2017-03-30 Thread h...@lst.de
On Thu, Mar 30, 2017 at 11:29:29AM -0400, Martin K. Petersen wrote:
> "h...@lst.de"  writes:
> 
> > Jens, any opinion?  I'd like to remove it too, but I fear it might
> > break things.  We could deprecate it first with a warning when read
> > and then remove it a few releases down the road.
> 
> I know of several apps that check this variable (as opposed to the
> ioctl).

The above was in reference to both methods..


[PATCH] osd_uld: Check scsi_device_get() return value

2017-03-30 Thread Bart Van Assche
scsi_device_get() can fail. Hence check its return value.

Signed-off-by: Bart Van Assche 
Cc: Boaz Harrosh 
---
 drivers/scsi/osd/osd_uld.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/osd/osd_uld.c b/drivers/scsi/osd/osd_uld.c
index e0ce5d2fd14d..1dea4244dd0c 100644
--- a/drivers/scsi/osd/osd_uld.c
+++ b/drivers/scsi/osd/osd_uld.c
@@ -464,14 +464,15 @@ static int osd_probe(struct device *dev)
/* hold one more reference to the scsi_device that will get released
 * in __release, in case a logout is happening while fs is mounted
 */
-   scsi_device_get(scsi_device);
+   if (scsi_device_get(scsi_device))
+   goto err_put_disk;
osd_dev_init(>od, scsi_device);
 
/* Detect the OSD Version */
error = __detect_osd(oud);
if (error) {
OSD_ERR("osd detection failed, non-compatible OSD device\n");
-   goto err_put_disk;
+   goto err_put_sdev;
}
 
/* init the char-device for communication with user-mode */
@@ -508,8 +509,9 @@ static int osd_probe(struct device *dev)
 
 err_put_cdev:
cdev_del(>cdev);
-err_put_disk:
+err_put_sdev:
scsi_device_put(scsi_device);
+err_put_disk:
put_disk(disk);
 err_free_osd:
dev_set_drvdata(dev, NULL);
-- 
2.12.0



Re: [PATCH 1/2] iscsi-target: Fix TMR reference leak during session shutdown

2017-03-30 Thread Bart Van Assche
On Thu, 2017-03-30 at 08:29 +, Nicholas A. Bellinger wrote:
> diff --git a/drivers/target/iscsi/iscsi_target_util.c 
> b/drivers/target/iscsi/iscsi_target_util.c
> index 5041a9c..b464033 100644
> --- a/drivers/target/iscsi/iscsi_target_util.c
> +++ b/drivers/target/iscsi/iscsi_target_util.c
> @@ -737,21 +737,23 @@ void iscsit_free_cmd(struct iscsi_cmd *cmd, bool 
> shutdown)
>  {
>   struct se_cmd *se_cmd = NULL;
>   int rc;
> + bool op_scsi = false;
>   /*
>* Determine if a struct se_cmd is associated with
>* this struct iscsi_cmd.
>*/
>   switch (cmd->iscsi_opcode) {
>   case ISCSI_OP_SCSI_CMD:
> - se_cmd = >se_cmd;
> - __iscsit_free_cmd(cmd, true, shutdown);
> + op_scsi = true;
>   /*
>* Fallthrough
>*/
>   case ISCSI_OP_SCSI_TMFUNC:
> - rc = transport_generic_free_cmd(>se_cmd, shutdown);
> - if (!rc && shutdown && se_cmd && se_cmd->se_sess) {
> - __iscsit_free_cmd(cmd, true, shutdown);
> + se_cmd = >se_cmd;
> + __iscsit_free_cmd(cmd, op_scsi, shutdown);
> + rc = transport_generic_free_cmd(se_cmd, shutdown);
> + if (!rc && shutdown && se_cmd->se_sess) {
> + __iscsit_free_cmd(cmd, op_scsi, shutdown);
>   target_put_sess_cmd(se_cmd);
>   }
>   break;

Hello Nic,

I agree that this patch fixes a leak. However, an existing bug in
iscsit_free_cmd() is not addressed by this patch. Before the TMF code started
using kref_get() / kref_put() it was possible for transport_generic_free_cmd()
to determine whether or not iscsit_free_cmd() should call __iscsit_free_cmd()
by checking the command reference count. I think that since the TMF code
manipulates the command reference count it is no longer possible for
transport_generic_free_cmd() to determine this. If iscsit_free_cmd() is called
while a LUN RESET is in progress then the return value of
transport_generic_free_cmd() will be wrong. I will post a few patches that not
only address what I just described but also the leak fixed by this patch.

Bart.

Re: RFC: always use REQ_OP_WRITE_ZEROES for zeroing offload

2017-03-30 Thread Mike Snitzer
On Thu, Mar 30 2017 at 11:22am -0400,
Martin K. Petersen  wrote:

> Mike Snitzer  writes:
> 
> > Would be very useful, particularly for testing, if
> > drivers/scsi/scsi_debug.c were updated to support WRITE ZEROES.
> 
> There is no WRITE ZEROES in SCSI. You should be able to get the right
> behavior with lbpws=1 lbprz=1.

OK, thanks.


Re: [PATCH] be2iscsi: switch to pci_alloc_irq_vectors

2017-03-30 Thread Martin K. Petersen
Christoph Hellwig  writes:

> Any progress on that? I'd like to get be2iscsi off the old MSI-X API
> for 4.12.

Applied to 4.12/scsi-queue.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH v2] sd: Consider max_xfer_blocks if opt_xfer_blocks is unusable

2017-03-30 Thread Martin K. Petersen
Fam Zheng  writes:

>>  rw_max = min_not_zero(logical_to_sectors(sdp, dev_max),
>>   BLK_DEF_MAX_SECTORS);
>
> Yes, it is better. Is it okay to make the change when you apply?

Sure. Applied to 4.11/scsi-fixes.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 23/23] block: remove the discard_zeroes_data flag

2017-03-30 Thread Martin K. Petersen
"h...@lst.de"  writes:

> Jens, any opinion?  I'd like to remove it too, but I fear it might
> break things.  We could deprecate it first with a warning when read
> and then remove it a few releases down the road.

I know of several apps that check this variable (as opposed to the
ioctl).

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 12/23] sd: handle REQ_UNMAP

2017-03-30 Thread Martin K. Petersen
"h...@lst.de"  writes:

Christoph,

> On Tue, Mar 28, 2017 at 04:48:55PM +, Bart Van Assche wrote:
>> >if (sdp->no_write_same)
>> >return BLKPREP_INVALID;
>> >if (sdkp->ws16 || sector > 0x || nr_sectors > 0x)
>> 
>> Users can change the provisioning mode from user space from SD_LBP_WS16 into
>> SD_LBP_WS10 so I'm not sure it's safe to skip the (sdkp->ws16 || sector >
>> 0x || nr_sectors > 0x) check if REQ_UNMAP is set.
>
> They can, and if the device has too many sectors that will already cause
> discard to fail,

I'm not sure I understand what you mean by that?

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: RFC: always use REQ_OP_WRITE_ZEROES for zeroing offload

2017-03-30 Thread Martin K. Petersen
Mike Snitzer  writes:

> Would be very useful, particularly for testing, if
> drivers/scsi/scsi_debug.c were updated to support WRITE ZEROES.

There is no WRITE ZEROES in SCSI. You should be able to get the right
behavior with lbpws=1 lbprz=1.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 22/23] drbd: implement REQ_OP_WRITE_ZEROES

2017-03-30 Thread Martin K. Petersen
Mike Snitzer  writes:

> I can work on this now.  Only question I have is: should DM thinp take
> care to zero any misaligned head and tail?  (I assume so but with all
> the back and forth between Bart, Paolo and Martin I figured I'd ask
> explicitly).

Yep, let's make sure our semantics match the hardware ditto.

 - So write zeroes should behave deterministically and explicitly handle
   any blocks that can't be cleared via deprovisioning.

 - And discard can work at the discard granularity in a
   non-deterministic fashion.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: RFC: always use REQ_OP_WRITE_ZEROES for zeroing offload

2017-03-30 Thread Mike Snitzer
Would be very useful, particularly for testing, if
drivers/scsi/scsi_debug.c were updated to support WRITE ZEROES.


Re: [PATCHv5 0/2] tcmu: For bugs fix only

2017-03-30 Thread Bart Van Assche
On Thu, 2017-03-30 at 01:48 -0700, Nicholas A. Bellinger wrote:
> Just for future reference, the flow of these tags should reflect the
> history of the patch.  Eg:
> 
> Reviewed-by: First reviewer 
> Tested-by: First tester 
> Reviewed-by: Second reviewer 
> Signed-off-by: Patch Author 
> 
> and then once the subsystem maintainer merges it into his tree, they add
> their own:
> 
> Signed-off-by: Subsystem Maintainer 

Hi Nic,

I agree that these tags should reflect the history of the patch. I think
that means that the patch author should be mentioned first, Reviewed-by /
Tested-by tags next and the subsystem maintainer sign-off last. At least,
that's how most other maintainers do it.

Bart.

[PATCH] target/user: PGR Support

2017-03-30 Thread Bryant G. Ly
This adds PGR support for just TCMU, since tcmu doesn't
have the necessary IT_NEXUS info to process PGR in userspace,
so have those commands be processed in kernel.

Signed-off-by: Bryant G. Ly 
---
 drivers/target/target_core_configfs.c | 10 +-
 drivers/target/target_core_device.c   | 27 +++
 drivers/target/target_core_pr.c   |  2 +-
 drivers/target/target_core_pscsi.c|  3 ++-
 include/target/target_core_backend.h  |  1 +
 5 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/drivers/target/target_core_configfs.c 
b/drivers/target/target_core_configfs.c
index 38b5025..edfb098 100644
--- a/drivers/target/target_core_configfs.c
+++ b/drivers/target/target_core_configfs.c
@@ -1366,7 +1366,7 @@ static ssize_t target_pr_res_holder_show(struct 
config_item *item, char *page)
struct se_device *dev = pr_to_dev(item);
int ret;
 
-   if (dev->transport->transport_flags & TRANSPORT_FLAG_PASSTHROUGH)
+   if (dev->transport->transport_flags & TRANSPORT_FLAG_PASSTHROUGH_PGR)
return sprintf(page, "Passthrough\n");
 
spin_lock(>dev_reservation_lock);
@@ -1506,7 +1506,7 @@ static ssize_t target_pr_res_type_show(struct config_item 
*item, char *page)
 {
struct se_device *dev = pr_to_dev(item);
 
-   if (dev->transport->transport_flags & TRANSPORT_FLAG_PASSTHROUGH)
+   if (dev->transport->transport_flags & TRANSPORT_FLAG_PASSTHROUGH_PGR)
return sprintf(page, "SPC_PASSTHROUGH\n");
else if (dev->dev_reservation_flags & DRF_SPC2_RESERVATIONS)
return sprintf(page, "SPC2_RESERVATIONS\n");
@@ -1519,7 +1519,7 @@ static ssize_t target_pr_res_aptpl_active_show(struct 
config_item *item,
 {
struct se_device *dev = pr_to_dev(item);
 
-   if (dev->transport->transport_flags & TRANSPORT_FLAG_PASSTHROUGH)
+   if (dev->transport->transport_flags & TRANSPORT_FLAG_PASSTHROUGH_PGR)
return 0;
 
return sprintf(page, "APTPL Bit Status: %s\n",
@@ -1531,7 +1531,7 @@ static ssize_t target_pr_res_aptpl_metadata_show(struct 
config_item *item,
 {
struct se_device *dev = pr_to_dev(item);
 
-   if (dev->transport->transport_flags & TRANSPORT_FLAG_PASSTHROUGH)
+   if (dev->transport->transport_flags & TRANSPORT_FLAG_PASSTHROUGH_PGR)
return 0;
 
return sprintf(page, "Ready to process PR APTPL metadata..\n");
@@ -1577,7 +1577,7 @@ static ssize_t target_pr_res_aptpl_metadata_store(struct 
config_item *item,
u16 tpgt = 0;
u8 type = 0;
 
-   if (dev->transport->transport_flags & TRANSPORT_FLAG_PASSTHROUGH)
+   if (dev->transport->transport_flags & TRANSPORT_FLAG_PASSTHROUGH_PGR)
return count;
if (dev->dev_reservation_flags & DRF_SPC2_RESERVATIONS)
return count;
diff --git a/drivers/target/target_core_device.c 
b/drivers/target/target_core_device.c
index c754ae3..83c0d77 100644
--- a/drivers/target/target_core_device.c
+++ b/drivers/target/target_core_device.c
@@ -1045,6 +1045,7 @@ passthrough_parse_cdb(struct se_cmd *cmd,
sense_reason_t (*exec_cmd)(struct se_cmd *cmd))
 {
unsigned char *cdb = cmd->t_task_cdb;
+   struct se_device *dev = cmd->se_dev;
 
/*
 * Clear a lun set in the cdb if the initiator talking to use spoke
@@ -1076,6 +1077,32 @@ passthrough_parse_cdb(struct se_cmd *cmd,
return TCM_NO_SENSE;
}
 
+   /*
+* For PERSISTENT RESERVE IN/OUT, RELEASE, and RESERVE we need to
+* emulate the response, since tcmu does not have the information
+* required to process these commands.
+*/
+   if (!(dev->transport->transport_flags &
+ TRANSPORT_FLAG_PASSTHROUGH_PGR)) {
+   if (cdb[0] == PERSISTENT_RESERVE_IN) {
+   cmd->execute_cmd = target_scsi3_emulate_pr_in;
+   return TCM_NO_SENSE;
+   }
+   if (cdb[0] == PERSISTENT_RESERVE_OUT) {
+   cmd->execute_cmd = target_scsi3_emulate_pr_out;
+   return TCM_NO_SENSE;
+   }
+
+   if (cdb[0] == RELEASE || cdb[0] == RELEASE_10) {
+   cmd->execute_cmd = target_scsi2_reservation_release;
+   return TCM_NO_SENSE;
+   }
+   if (cdb[0] == RESERVE || cdb[0] == RESERVE_10) {
+   cmd->execute_cmd = target_scsi2_reservation_reserve;
+   return TCM_NO_SENSE;
+   }
+   }
+
/* Set DATA_CDB flag for ops that should have it */
switch (cdb[0]) {
case READ_6:
diff --git a/drivers/target/target_core_pr.c b/drivers/target/target_core_pr.c
index e180511..129ca57 100644
--- a/drivers/target/target_core_pr.c
+++ b/drivers/target/target_core_pr.c
@@ -4147,7 +4147,7 @@ target_check_reservation(struct se_cmd *cmd)

Re: [PATCH 1/6] virtio: wrap find_vqs

2017-03-30 Thread Michael S. Tsirkin
On Thu, Mar 30, 2017 at 02:00:08PM +0800, Jason Wang wrote:
> 
> 
> On 2017年03月30日 04:48, Michael S. Tsirkin wrote:
> > We are going to add more parameters to find_vqs, let's wrap the call so
> > we don't need to tweak all drivers every time.
> > 
> > Signed-off-by: Michael S. Tsirkin
> > ---
> 
> A quick glance and it looks ok, but what the benefit of this series, is it
> required by other changes?
> 
> Thanks

Yes - to avoid touching all devices when doing the rest of
the patchset.


Re: [PATCH 22/23] drbd: implement REQ_OP_WRITE_ZEROES

2017-03-30 Thread Mike Snitzer
On Thu, Mar 30 2017 at  7:44am -0400,
Christoph Hellwig  wrote:

> On Thu, Mar 30, 2017 at 12:06:41PM +0200, Lars Ellenberg wrote:
> 
> > Will it make an fstrim cause thinly provisioned
> > devices to suddenly be fully allocated?
> 
> Not for SCSI devices.  Yes for dm-thinp until it implements
> REQ_OP_WRITE_ZEROES, which will hopefully be soon.

I can work on this now.  Only question I have is: should DM thinp take
care to zero any misaligned head and tail?  (I assume so but with all
the back and forth between Bart, Paolo and Martin I figured I'd ask
explicitly).


[PATCH 2/3] add test: generic/420 check information lead for lio-fileio

2017-03-30 Thread Dmitry Monakhov
Signed-off-by: Dmitry Monakhov 
---
 tests/generic/420 | 77 +++
 tests/generic/420.out |  2 ++
 tests/generic/group   |  1 +
 3 files changed, 80 insertions(+)
 create mode 100755 tests/generic/420
 create mode 100644 tests/generic/420.out

diff --git a/tests/generic/420 b/tests/generic/420
new file mode 100755
index 000..592703d
--- /dev/null
+++ b/tests/generic/420
@@ -0,0 +1,77 @@
+#! /bin/bash
+# FS QA Test 420
+#
+# Regression test for  information leak for lio-fileio target
+# BUG: If image  file is less than virtual blockdev then read() may return
+#  unitilized data.
+#
+#---
+# Copyright (c) 2017 Dmitry Monakhov 
+# All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+   _liotgt_cleanup
+   rm -rf $TEST_DIR/$$
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/liotarget
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs generic
+_supported_os Linux
+_require_test
+_require_liotarget
+
+mkdir -p $TEST_DIR/$$ || _fail "Can not make test dir"
+
+cfg_path=$(_liotgt_create_fileio  420-test.img $TEST_DIR/$$/420-test.img 128M)
+dev=$(_liotgt_attach_target /backstores/fileio/420-test.img)
+
+$XFS_IO_PROG -f -c "truncate 1M" $TEST_DIR/$$/420-test.img >> $seqres.full
+
+$XFS_IO_PROG -c "pwrite -S 0xa0 -b 512 512 1k" -d $dev >>$seqres.full 2>&1 || \
+_fail "pwrite failed"
+
+$XFS_IO_PROG -c "pwrite -S 0xab -b 512 2M 1k" -d $dev >>$seqres.full 2>&1 || \
+_fail "pwrite failed"
+
+md5sum $dev | sed -e "s,$dev,loopdev,g"
+# success, all done
+status=0
+exit
diff --git a/tests/generic/420.out b/tests/generic/420.out
new file mode 100644
index 000..deed5da
--- /dev/null
+++ b/tests/generic/420.out
@@ -0,0 +1,2 @@
+QA output created by 420
+f5cfb0d8d7bfadbc2127f4f6ca4a0eba  loopdev
diff --git a/tests/generic/group b/tests/generic/group
index 0781f35..1261547 100644
--- a/tests/generic/group
+++ b/tests/generic/group
@@ -422,3 +422,4 @@
 417 auto quick shutdown log
 418 auto rw
 419 auto quick encrypt
+420 auto blockdev liotarget
-- 
2.9.3



[PATCH 1/3] add lio-target helpers

2017-03-30 Thread Dmitry Monakhov
Linux-IO Target is very good framework for testing block backend.
It is more flexible than scsi_debug.

http://linux-iscsi.org/wiki/LIO
Signed-off-by: Dmitry Monakhov 
---
 common/config|   2 +
 common/liotarget | 111 +++
 2 files changed, 113 insertions(+)
 create mode 100644 common/liotarget

diff --git a/common/config b/common/config
index 59041a3..cfe7913 100644
--- a/common/config
+++ b/common/config
@@ -212,6 +212,8 @@ export XZ_PROG="`set_prog_path xz`"
 export FLOCK_PROG="`set_prog_path flock`"
 export LDD_PROG="`set_prog_path ldd`"
 export TIMEOUT_PROG="`set_prog_path timeout`"
+export TARGETCLI_PROG="`set_prog_path targetcli`"
+export TARGETCTL_PROG="`set_prog_path targetctl`"
 
 # use 'udevadm settle' or 'udevsettle' to wait for lv to be settled.
 # newer systems have udevadm command but older systems like RHEL5 don't.
diff --git a/common/liotarget b/common/liotarget
new file mode 100644
index 000..f821692
--- /dev/null
+++ b/common/liotarget
@@ -0,0 +1,111 @@
+#!/bin/bash
+#
+# Copyright (c) 2017 Virtuozzo Inc
+# All Rights Reserved.
+#
+# Written by Dmitry Monakhov 
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#
+#
+# Functions for Linux-IO Target manipulation
+#
+
+_require_liotarget()
+{
+   which $TARGETCLI_PROG >>$seqres.full 2>&1  || \
+   _notrun "this test requires 'targetcli' tool"
+   which $TARGETCLI_PROG >>$seqres.full 2>&1  || \
+   _notrun "this test requires 'targetcli' tool"
+
+   $TARGETCLI_PROG ls /backstores/ramdisk >>$seqres.full 2>&1 ||\
+   _notrun "kernel compiled w/o CONFIG_TARGET_CORE"
+   $TARGETCLI_PROG ls /backstores/fileio >>$seqres.full 2>&1 ||\
+   _notrun "kernel compiled w/o CONFIG_TCM_FILEIO"
+   $TARGETCLI_PROG ls /loopback >>$seqres.full 2>&1 ||\
+   _notrun "kernel compiled w/o CONFIG_LOOPBACK_TARGET"
+}
+
+_liotgt_create_fileio()
+{
+   local name=$1
+   local img=$2
+   local sz=$3
+
+   $TARGETCLI_PROG /backstores/fileio create \
+   name=$name file_or_dev=$img  size=$sz >>$seqres.full ||\
+   _fail "Can not create /backstores/fileio/$name"
+
+   local cfg_path=`ls -d /sys/kernel/config/target/core/fileio_*/$name`
+   [ -d "$cfg_path" ] || _fail "Bad config path"
+   echo $cfg_path
+}
+
+_liotgt_set_attribute()
+{
+   local path=$1
+   local attr_name=$2
+   local attr_val=$3
+
+   [ -f $path/attrib/$attr_name ] || _fail "Bad attribute $attr_name"
+   echo "echo $attr_val >  $path/attrib/$attr_name " >>$seqres.full
+   echo $attr_val >  $path/attrib/$attr_name
+}
+
+_liotgt_create_loopback()
+{
+   local out=`$TARGETCLI_PROG /loopback/ create 2>&1`
+   [ $? -eq 0 ] || _fail "Can not create loopback target"
+   echo $out >>$seqres.full
+
+   local naa=`echo $out | gawk '{ print $3 }'`
+   echo ${naa:0:20} >>$seqres.full
+
+   echo ${naa:0:20}
+}
+
+_liotgt_find_dev()
+{
+   local found=""
+   local name=$1
+   local drives=`find /sys/devices -type f -name model`
+   for d in $drives;do
+   local dir=`dirname $d`
+   local vendor=`cat $dir/vendor`
+   local model=`cat $dir/model`
+   if [ "${vendor//[[:space:]]/}" == "LIO-ORG" ] && \
+  [ "${model//[[:space:]]/}" == "$name" ]; then
+   found=/dev/$(ls $dir/block)
+   break
+   fi
+   done
+   [ -z "$found" ] && _fail "Can not find device with backend $name"
+   echo "$found"
+}
+
+_liotgt_attach_target()
+{
+   local path=$1
+   local name=`basename $path`
+   local naa=$(_liotgt_create_loopback)
+
+   $TARGETCLI_PROG /loopback/$naa/luns create $path >>$seqres.full 2>&1 ||\
+   _fail "Can not attach $path to tcm_loop"
+   _liotgt_find_dev $name
+}
+
+_liotgt_cleanup()
+{
+   $TARGETCTL_PROG clear
+}
-- 
2.9.3



[PATCH 3/3] add test: generic/421 basic blockdev T10-DIF-TYPE1 integrity checks

2017-03-30 Thread Dmitry Monakhov
Test create virtual block device via lio-targed infastructure and
perform basic IO operations with data corruption detection.

Temprorally mark is as dangerous, because currently it trigger BUG_ON
inside blkdev_issue_flush

BTW: I use 'dd' to test read from corrupted image instead of xfs_io
because even if pread failed, xfs_io still exit with success, BUG?

Signed-off-by: Dmitry Monakhov 
---
 tests/generic/421 | 136 ++
 tests/generic/421.out |  16 ++
 tests/generic/group   |   1 +
 3 files changed, 153 insertions(+)
 create mode 100644 tests/generic/421
 create mode 100644 tests/generic/421.out

diff --git a/tests/generic/421 b/tests/generic/421
new file mode 100644
index 000..de1ba73
--- /dev/null
+++ b/tests/generic/421
@@ -0,0 +1,136 @@
+#! /bin/bash
+# FS QA Test 421
+#
+# Check basic T10-DIF integrity features for a block device
+#
+# DIF/DIX TYPE: T10-DIF-TYPE1-CRC
+# Kernel docs: Documentation/block/data-integrity.txt
+#---
+# Copyright (c) 2017 Dmitry Monakhov 
+# All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+   _liotgt_cleanup
+   rm -rf $TEST_DIR/$$
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/liotarget
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs generic
+_supported_os Linux
+_require_test
+_require_liotarget
+
+mkdir -p $TEST_DIR/$$ || _fail "Can not make test dir"
+
+# Create virtual block device
+cfg_path=$(_liotgt_create_fileio  t10-dif-t1 $TEST_DIR/$$/t10-dif-t1 32M)
+
+_liotgt_set_attribute $cfg_path  pi_prot_type 1
+_liotgt_set_attribute $cfg_path  pi_prot_format 1
+
+dev=$(_liotgt_attach_target /backstores/fileio/t10-dif-t1)
+
+echo "T0: Test basic IO"
+$XFS_IO_PROG -c "pwrite -S 0xa0 -b 4M 0 16M" -d $dev >>$seqres.full 2>&1 || \
+_fail "pwrite failed"
+
+$XFS_IO_PROG -c "pwrite -S 0xa1 -b 1k -V8  20M 32k" -d $dev >>$seqres.full 
2>&1 || \
+_fail "pwrite failed"
+
+$XFS_IO_PROG -c "pwrite -S 0xa2 -b 1k -V8  2M 32k" -f $dev >>$seqres.full 2>&1 
|| \
+_fail "pwrite failed"
+
+$XFS_IO_PROG -c "pwrite -S 0xa3 -b 4k  1536000 8k" -f $dev >>$seqres.full 2>&1 
|| \
+_fail "pwrite failed"
+
+$XFS_IO_PROG -c "fsync" -d $dev >>$seqres.full 2>&1 || _fail "fsync failed"
+
+echo "Check that buffered and direct read works"
+dd if=$dev bs=4k  2>$seqres.full | md5sum
+dd if=$dev bs=4M iflag=direct 2>$seqres.full | md5sum
+
+echo "Check csum corruption detection"
+# LIO-fileio store t10 DIF data in separate file ${IMG}.protection
+# struct t10_pi_tuple {
+#  __be16 guard_tag;   /* Checksum */
+#   __be16 app_tag; /* Opaque storage */
+#   __be32 ref_tag; /* Target LBA or indirect LBA */
+#}
+# Play with 3000'th sector -> t10_pi_tuple offset == 3000 * 8 == 24000
+#
+echo "T1: Corrupt guard_tag, next read should fail"
+$XFS_IO_PROG -c "pwrite -S 0xde -b2 24000 2 -w" \
+-f $TEST_DIR/$$/t10-dif-t1.protection >>$seqres.full 2>&1
+dd if=$dev bs=1M count=2 iflag=direct >>$seqres.full 2>&1 &&
+_fail "read should fail on 3000'th sector"
+
+echo "T2: Check that unaffected blocks are still readable"
+dd if=$dev bs=1M count=1 iflag=direct >>$seqres.full 2>&1 || _fail "read 
failed"
+
+echo "T3: Rewrite corrupted sector and check that read works"
+$XFS_IO_PROG -c "pwrite -S 0xa3 -b 4k 1536000 4k" -d $dev >>$seqres.full 2>&1 
|| \
+_fail "pwrite failed"
+dd if=$dev bs=2M count=1 iflag=direct >>$seqres.full 2>&1 || _fail "read 
failed"
+
+echo "T4: Corrupt app_tag, should not affect read"
+$XFS_IO_PROG -c "pwrite -S 0xde -b2 24002 2 -w" \
+-f $TEST_DIR/$$/t10-dif-t1.protection >>$seqres.full 2>&1
+dd if=$dev bs=2M count=1 iflag=direct >>$seqres.full 2>&1 || _fail "read 
failed"
+
+echo "T5: Corrupt ref_tag, next read should fail"
+$XFS_IO_PROG -c "pwrite -S 0xde -b4 24004 4 -w"  \
+

[PATCH 0/3] Improve block device testing coverage

2017-03-30 Thread Dmitry Monakhov
During LSFMM we have discussed how to test lower-backend of linux IO-stack.
Common opinion was that xfstests is the most obvious solution which cover
most of use cases filesystem care about.

I'm working on integration T10-DIF/DIF data integrity features to ext4,
for that reason we need to be shure that linux integrity framework is
in working state, which is currently broken in several places.

In fact, it is relatively simple to add basic coverage tests for basic
IO operations over virtual device with integrity support. All we need
is to add lio target support.

TOC:
add lio target helpers
add test: generic/420 check information lead for lio-fileio
add test: generic/421 basic blockdev T10-DIF-TYPE1 IO


Re: [Drbd-dev] [PATCH 22/23] drbd: implement REQ_OP_WRITE_ZEROES

2017-03-30 Thread Lars Ellenberg
On Thu, Mar 30, 2017 at 01:44:09PM +0200, Christoph Hellwig wrote:
> On Thu, Mar 30, 2017 at 12:06:41PM +0200, Lars Ellenberg wrote:
> > On Thu, Mar 23, 2017 at 10:33:40AM -0400, Christoph Hellwig wrote:
> > > It seems like DRBD assumes its on the wire TRIM request always zeroes 
> > > data.
> > > Use that fact to implement REQ_OP_WRITE_ZEROES.
> > > 
> > > XXX: will need a careful audit from the drbd team!
> > 
> > Thanks, this one looks ok to me.
> 
> So the DRBD protocol requires the TRIM operation to always zero?

"users" (both as in submitting entities, and people using DRBD)
expect that DRBD guarantees replicas to be identical after whatever
operations have been completed by all replicas.

Which means that for trim/discard/unmap, we can only expose that to
upper layers (or use it for internal purposes) if the operation has
a well defined, and on all backends identical, result.

Short answer: Yes.

> > The real question for me is, will the previous one (21/23)
> > return != 0 (some EOPNOTSUPP or else) to DRBD in more situations than
> > what we have now?
> 
> No, blkdev_issue_zeroout should never return -EOPNOTSUPP.
> 
> > Will it make an fstrim cause thinly provisioned
> > devices to suddenly be fully allocated?
> 
> Not for SCSI devices.  Yes for dm-thinp until it implements
> REQ_OP_WRITE_ZEROES, which will hopefully be soon.

"good enough for me" ;-)

Thanks,

Lars



Re: [PATCH 22/23] drbd: implement REQ_OP_WRITE_ZEROES

2017-03-30 Thread Christoph Hellwig
On Thu, Mar 30, 2017 at 12:06:41PM +0200, Lars Ellenberg wrote:
> On Thu, Mar 23, 2017 at 10:33:40AM -0400, Christoph Hellwig wrote:
> > It seems like DRBD assumes its on the wire TRIM request always zeroes data.
> > Use that fact to implement REQ_OP_WRITE_ZEROES.
> > 
> > XXX: will need a careful audit from the drbd team!
> 
> Thanks, this one looks ok to me.

So the DRBD protocol requires the TRIM operation to always zero?

> The real question for me is, will the previous one (21/23)
> return != 0 (some EOPNOTSUPP or else) to DRBD in more situations than
> what we have now?

No, blkdev_issue_zeroout should never return -EOPNOTSUPP.

> Will it make an fstrim cause thinly provisioned
> devices to suddenly be fully allocated?

Not for SCSI devices.  Yes for dm-thinp until it implements
REQ_OP_WRITE_ZEROES, which will hopefully be soon.


Re: [PATCH 22/23] drbd: implement REQ_OP_WRITE_ZEROES

2017-03-30 Thread Lars Ellenberg
On Thu, Mar 23, 2017 at 10:33:40AM -0400, Christoph Hellwig wrote:
> It seems like DRBD assumes its on the wire TRIM request always zeroes data.
> Use that fact to implement REQ_OP_WRITE_ZEROES.
> 
> XXX: will need a careful audit from the drbd team!

Thanks, this one looks ok to me.

The real question for me is, will the previous one (21/23)
return != 0 (some EOPNOTSUPP or else) to DRBD in more situations than
what we have now?  Will it make an fstrim cause thinly provisioned
devices to suddenly be fully allocated?
Or does it unmap "the same" as what we have now?
Especially on top of dm-thin, but also on top of any other device.
That's something that is not really "obvious" to me yet.

Cheers,
Lars




Re: [PATCHv5 0/2] tcmu: For bugs fix only

2017-03-30 Thread Xiubo Li

On 2017年03月30日 16:48, Nicholas A. Bellinger wrote:

Hi Xiubo & Co,

On Mon, 2017-03-27 at 17:07 +0800, lixi...@cmss.chinamobile.com wrote:

From: Xiubo Li 

Changed for V5:
- This only includes #1 and #2. And for old #3, #4 are still reviewing.
- #1, since the issue reported by Ilias is a separate new one, and will
   create a new patch later.
- #2, address the issue pointed out by Mike, thanks.
- #1 and #2 have been tested by Ilias in BIDI case, thanks.

Changed for V4:
- re-order the #3, #4 at the head.
- merge most of the #5 to others.

Xiubo Li (2):
   tcmu: Fix possible overwrite of t_data_sg's last iov[]
   tcmu: Fix wrongly calculating of the base_command_size

  drivers/target/target_core_user.c | 44 +++
  1 file changed, 31 insertions(+), 13 deletions(-)


Applied to target-pending/master, along with CC's for 3.19.y stable.

Thanks Xiubo !

Btw, I fixed-up the ordering of the Reviewed-by + Tested-by +
Signed-off-by tags on the patches here:

https://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending.git/commit/?id=ab22d2604c86ceb01bb2725c9860b88a7dd383bb
https://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending.git/commit/?id=abe342a5b4b5aa579f6bf40ba73447c699e6b579

Just for future reference, the flow of these tags should reflect the
history of the patch.  Eg:

Reviewed-by: First reviewer 
Tested-by: First tester 
Reviewed-by: Second reviewer 
Signed-off-by: Patch Author 

and then once the subsystem maintainer merges it into his tree, they add
their own:

Signed-off-by: Subsystem Maintainer 


Sure, thanks very much.


Beyond that, what is the status of the other two feature improvement
patches..?

Do you intend to post those (again) as v4.12-rc1 material..?

Since the #2 has changed in V5, and the old #3 will have a conflict with 
it.


The #3 and #4 are still in reviewing process, so I will post it again if 
they are all okay later.


Thanks,

BRs
Xiubo






Re: RFC: always use REQ_OP_WRITE_ZEROES for zeroing offload

2017-03-30 Thread Christoph Hellwig
Lars, can you please take a look a patch 22 and check if it's safe?

That's the big thing I want to know before posting the next version
of the series.  If it's not safe I'd like to drop that patch.


RE: [PATCH] be2iscsi: switch to pci_alloc_irq_vectors

2017-03-30 Thread Jitendra Bhivare
> -Original Message-
> From: Christoph Hellwig [mailto:h...@lst.de]
> Sent: Friday, January 13, 2017 10:00 PM
> To: subbu.seethara...@broadcom.com; ketan.muka...@broadcom.com;
> jitendra.bhiv...@broadcom.com; linux-scsi@vger.kernel.org
> Subject: [PATCH] be2iscsi: switch to pci_alloc_irq_vectors
>
> And get automatic MSI-X affinity for free.
>
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/scsi/be2iscsi/be_main.c | 127
+---
>  drivers/scsi/be2iscsi/be_main.h |   2 -
>  2 files changed, 42 insertions(+), 87 deletions(-)
>
> diff --git a/drivers/scsi/be2iscsi/be_main.c
b/drivers/scsi/be2iscsi/be_main.c
> index 6372613..03faca8 100644
> --- a/drivers/scsi/be2iscsi/be_main.c
> +++ b/drivers/scsi/be2iscsi/be_main.c
> @@ -801,12 +801,12 @@ static int beiscsi_init_irqs(struct beiscsi_hba
> *phba)
>   struct pci_dev *pcidev = phba->pcidev;
>   struct hwi_controller *phwi_ctrlr;
>   struct hwi_context_memory *phwi_context;
> - int ret, msix_vec, i, j;
> + int ret, i, j;
>
>   phwi_ctrlr = phba->phwi_ctrlr;
>   phwi_context = phwi_ctrlr->phwi_ctxt;
>
> - if (phba->msix_enabled) {
> + if (pcidev->msix_enabled) {
>   for (i = 0; i < phba->num_cpus; i++) {
>   phba->msi_name[i] = kzalloc(BEISCSI_MSI_NAME,
>   GFP_KERNEL);
> @@ -817,9 +817,8 @@ static int beiscsi_init_irqs(struct beiscsi_hba
*phba)
>
>   sprintf(phba->msi_name[i], "beiscsi_%02x_%02x",
>   phba->shost->host_no, i);
> - msix_vec = phba->msix_entries[i].vector;
> - ret = request_irq(msix_vec, be_isr_msix, 0,
> -   phba->msi_name[i],
> + ret = request_irq(pci_irq_vector(pcidev, i),
> +   be_isr_msix, 0,
phba->msi_name[i],
> _context->be_eq[i]);
>   if (ret) {
>   beiscsi_log(phba, KERN_ERR,
> BEISCSI_LOG_INIT, @@ -837,9 +836,8 @@ static int
beiscsi_init_irqs(struct
> beiscsi_hba *phba)
>   }
>   sprintf(phba->msi_name[i], "beiscsi_mcc_%02x",
>   phba->shost->host_no);
> - msix_vec = phba->msix_entries[i].vector;
> - ret = request_irq(msix_vec, be_isr_mcc, 0, phba-
> >msi_name[i],
> -   _context->be_eq[i]);
> + ret = request_irq(pci_irq_vector(pcidev, i), be_isr_mcc,
0,
> +   phba->msi_name[i], _context-
> >be_eq[i]);
>   if (ret) {
>   beiscsi_log(phba, KERN_ERR, BEISCSI_LOG_INIT ,
>   "BM_%d : beiscsi_init_irqs-"
> @@ -861,9 +859,8 @@ static int beiscsi_init_irqs(struct beiscsi_hba
*phba)
>   return 0;
>  free_msix_irqs:
>   for (j = i - 1; j >= 0; j--) {
> + free_irq(pci_irq_vector(pcidev, i),
_context->be_eq[j]);
>   kfree(phba->msi_name[j]);
> - msix_vec = phba->msix_entries[j].vector;
> - free_irq(msix_vec, _context->be_eq[j]);
>   }
>   return ret;
>  }
> @@ -3039,7 +3036,7 @@ static int beiscsi_create_eqs(struct beiscsi_hba
> *phba,
>   num_eq_pages = PAGES_REQUIRED(phba->params.num_eq_entries *
> \
> sizeof(struct be_eq_entry));
>
> - if (phba->msix_enabled)
> + if (phba->pcidev->msix_enabled)
>   eq_for_mcc = 1;
>   else
>   eq_for_mcc = 0;
> @@ -3549,7 +3546,7 @@ static int be_mcc_queues_create(struct
> beiscsi_hba *phba,
>   sizeof(struct be_mcc_compl)))
>   goto err;
>   /* Ask BE to create MCC compl queue; */
> - if (phba->msix_enabled) {
> + if (phba->pcidev->msix_enabled) {
>   if (beiscsi_cmd_cq_create(ctrl, cq, _context->be_eq
>[phba->num_cpus].q, false, true,
0))
>   goto mcc_cq_free;
> @@ -3580,42 +3577,35 @@ static int be_mcc_queues_create(struct
> beiscsi_hba *phba,
>   return -ENOMEM;
>  }
>
> -/**
> - * find_num_cpus()- Get the CPU online count
> - * @phba: ptr to priv structure
> - *
> - * CPU count is used for creating EQ.
> - **/
> -static void find_num_cpus(struct beiscsi_hba *phba)
> +static void be2iscsi_enable_msix(struct beiscsi_hba *phba)
>  {
> - int  num_cpus = 0;
> -
> - num_cpus = num_online_cpus();
> + int nvec = 1;
>
>   switch (phba->generation) {
>   case BE_GEN2:
>   case BE_GEN3:
> - phba->num_cpus = (num_cpus > BEISCSI_MAX_NUM_CPUS) ?
> -   BEISCSI_MAX_NUM_CPUS : num_cpus;
> + nvec = BEISCSI_MAX_NUM_CPUS + 1;
>   break;
>   case BE_GEN4:
> - /*
> -  * If eqid_count == 1 fall back to
> -  * INTX mechanism
> 

Re: [PATCH 23/23] block: remove the discard_zeroes_data flag

2017-03-30 Thread h...@lst.de
On Tue, Mar 28, 2017 at 05:00:48PM +, Bart Van Assche wrote:
> 
> It seems to me like the documentation in Documentation/ABI/testing/sysfs-block
> and the above code are not in sync. I think the above code will cause reading
> from the discard_zeroes_data attribute to return an empty string ("") instead
> of "0\n".

Thanks, fine with me.

> 
> BTW, my personal preference is to remove this attribute entirely because 
> keeping
> it will cause confusion, no matter how well we document the behavior of this
> attribute.

Jens, any opinion?  I'd like to remove it too, but I fear it might
break things.  We could deprecate it first with a warning when read
and then remove it a few releases down the road.


Re: [PATCH 11/23] block_dev: use blkdev_issue_zerout for hole punches

2017-03-30 Thread h...@lst.de
On Tue, Mar 28, 2017 at 04:50:47PM +, Bart Van Assche wrote:
> On Thu, 2017-03-23 at 10:33 -0400, Christoph Hellwig wrote:
> > This gets us support for non-discard efficient write of zeroes (e.g. NVMe)
> > and preparse for removing the discard_zeroes_data flag.
> 
> Hello Christoph,
> 
> "preparse" probably should have been "prepare"?

Yes, fixed.


Re: [PATCH 12/23] sd: handle REQ_UNMAP

2017-03-30 Thread h...@lst.de
On Tue, Mar 28, 2017 at 04:48:55PM +, Bart Van Assche wrote:
> > if (sdp->no_write_same)
> > return BLKPREP_INVALID;
> > if (sdkp->ws16 || sector > 0x || nr_sectors > 0x)
> 
> Users can change the provisioning mode from user space from SD_LBP_WS16 into
> SD_LBP_WS10 so I'm not sure it's safe to skip the (sdkp->ws16 || sector >
> 0x || nr_sectors > 0x) check if REQ_UNMAP is set.

They can, and if the device has too many sectors that will already cause
discard to fail, and in this case it will cause write zeroes to fail as
well.  The intent behind this patch is to keep the behavior the same
as the old path that uses discards for zeroing.  The logic looks a bit
clumsy, but I'd rather keep it as-is.


Re: [PATCH 01/23] block: renumber REQ_OP_WRITE_ZEROES

2017-03-30 Thread h...@lst.de
On Tue, Mar 28, 2017 at 04:12:46PM +, Bart Van Assche wrote:
> Since REQ_OP_WRITE_ZEROES was introduced in kernel v4.10, do we need
> "Cc: stable" and "Fixes: a6f0788ec2881" tags for this patch?

No.  This just works around the way scsi_setup_cmnd sets up the data
direction.  Before this series it's not an issue because no one used
the req_op data direction for setting up the dma direction.


Re: [PATCHv5 0/2] tcmu: For bugs fix only

2017-03-30 Thread Nicholas A. Bellinger
Hi Xiubo & Co,

On Mon, 2017-03-27 at 17:07 +0800, lixi...@cmss.chinamobile.com wrote:
> From: Xiubo Li 
> 
> Changed for V5:
> - This only includes #1 and #2. And for old #3, #4 are still reviewing.
> - #1, since the issue reported by Ilias is a separate new one, and will
>   create a new patch later.
> - #2, address the issue pointed out by Mike, thanks.
> - #1 and #2 have been tested by Ilias in BIDI case, thanks.
> 
> Changed for V4:
> - re-order the #3, #4 at the head.
> - merge most of the #5 to others.
> 
> Xiubo Li (2):
>   tcmu: Fix possible overwrite of t_data_sg's last iov[]
>   tcmu: Fix wrongly calculating of the base_command_size
> 
>  drivers/target/target_core_user.c | 44 
> +++
>  1 file changed, 31 insertions(+), 13 deletions(-)
> 

Applied to target-pending/master, along with CC's for 3.19.y stable.

Thanks Xiubo !

Btw, I fixed-up the ordering of the Reviewed-by + Tested-by +
Signed-off-by tags on the patches here:

https://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending.git/commit/?id=ab22d2604c86ceb01bb2725c9860b88a7dd383bb
https://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending.git/commit/?id=abe342a5b4b5aa579f6bf40ba73447c699e6b579

Just for future reference, the flow of these tags should reflect the
history of the patch.  Eg:

Reviewed-by: First reviewer 
Tested-by: First tester 
Reviewed-by: Second reviewer 
Signed-off-by: Patch Author 

and then once the subsystem maintainer merges it into his tree, they add
their own:

Signed-off-by: Subsystem Maintainer 

Beyond that, what is the status of the other two feature improvement
patches..?

Do you intend to post those (again) as v4.12-rc1 material..?



Re: [PATCH 1/7] ѕd: split sd_setup_discard_cmnd

2017-03-30 Thread h...@lst.de
On Tue, Mar 28, 2017 at 08:05:09AM -0600, ax...@kernel.dk wrote:
> > Although I know this is an issue in the existing code and not something
> > introduced by you: please consider using logical_to_sectors() instead of
> > open-coding this function. Otherwise this patch looks fine to me.
> 
> The downside of doing that is that we still call ilog2() twice, which
> sucks. Would be faster to cache ilog2(sector_size) and use that in the
> shift calculation.

I suspect that gcc is smart enough to optimize it away.  That beeing said
while this looks like a nice cleanup this patch is just supposed to move
code, so I'd rather not add the change here and leave it for a separate
submission.


Re: [PATCH] be2iscsi: switch to pci_alloc_irq_vectors

2017-03-30 Thread Christoph Hellwig
On Mon, Jan 23, 2017 at 09:41:45AM +0530, Jitendra Bhivare wrote:
> We will be taking this up along with some other changes in the same area.

Any progress on that? I'd like to get be2iscsi off the old MSI-X API
for 4.12.


[PATCH 2/2] target: Avoid mappedlun symlink creation during lun shutdown

2017-03-30 Thread Nicholas A. Bellinger
From: Nicholas Bellinger 

This patch closes a race between se_lun deletion during configfs
unlink in target_fabric_port_unlink() -> core_dev_del_lun()
-> core_tpg_remove_lun(), when transport_clear_lun_ref() blocks
waiting for percpu_ref RCU grace period to finish, but a new
NodeACL mappedlun is added before the RCU grace period has
completed.

This can happen in target_fabric_mappedlun_link() because it
only checks for se_lun->lun_se_dev, which is not cleared until
after transport_clear_lun_ref() percpu_ref RCU grace period
finishes.

This bug originally manifested as NULL pointer dereference
OOPsen in target_stat_scsi_att_intr_port_show_attr_dev() on
v4.1.y code, because it dereferences lun->lun_se_dev without
a explicit NULL pointer check.

In post v4.1 code with target-core RCU conversion, the code
in target_stat_scsi_att_intr_port_show_attr_dev() no longer
uses se_lun->lun_se_dev, but the same race still exists.

To address the bug, go ahead and set se_lun>lun_shutdown as
early as possible in core_tpg_remove_lun(), and ensure new
NodeACL mappedlun creation in target_fabric_mappedlun_link()
fails during se_lun shutdown.

Reported-by: James Shen 
Cc: James Shen 
Tested-by: James Shen 
Signed-off-by: Nicholas Bellinger 
---
 drivers/target/target_core_fabric_configfs.c | 5 +
 drivers/target/target_core_tpg.c | 4 
 include/target/target_core_base.h| 1 +
 3 files changed, 10 insertions(+)

diff --git a/drivers/target/target_core_fabric_configfs.c 
b/drivers/target/target_core_fabric_configfs.c
index d8a16ca..d1e6cab 100644
--- a/drivers/target/target_core_fabric_configfs.c
+++ b/drivers/target/target_core_fabric_configfs.c
@@ -92,6 +92,11 @@ static int target_fabric_mappedlun_link(
pr_err("Source se_lun->lun_se_dev does not exist\n");
return -EINVAL;
}
+   if (lun->lun_shutdown) {
+   pr_err("Unable to create mappedlun symlink because"
+   " lun->lun_shutdown=true\n");
+   return -EINVAL;
+   }
se_tpg = lun->lun_tpg;
 
nacl_ci = _acl_ci->ci_parent->ci_group->cg_item;
diff --git a/drivers/target/target_core_tpg.c b/drivers/target/target_core_tpg.c
index 6fb1919..dfaef4d 100644
--- a/drivers/target/target_core_tpg.c
+++ b/drivers/target/target_core_tpg.c
@@ -642,6 +642,8 @@ void core_tpg_remove_lun(
 */
struct se_device *dev = rcu_dereference_raw(lun->lun_se_dev);
 
+   lun->lun_shutdown = true;
+
core_clear_lun_from_tpg(lun, tpg);
/*
 * Wait for any active I/O references to percpu se_lun->lun_ref to
@@ -663,6 +665,8 @@ void core_tpg_remove_lun(
}
if (!(dev->se_hba->hba_flags & HBA_FLAGS_INTERNAL_USE))
hlist_del_rcu(>link);
+
+   lun->lun_shutdown = false;
mutex_unlock(>tpg_lun_mutex);
 
percpu_ref_exit(>lun_ref);
diff --git a/include/target/target_core_base.h 
b/include/target/target_core_base.h
index 4b784b6..2e28246 100644
--- a/include/target/target_core_base.h
+++ b/include/target/target_core_base.h
@@ -705,6 +705,7 @@ struct se_lun {
u64 unpacked_lun;
 #define SE_LUN_LINK_MAGIC  0x7771
u32 lun_link_magic;
+   boollun_shutdown;
boollun_access_ro;
u32 lun_index;
 
-- 
1.9.1



[PATCH 1/2] iscsi-target: Fix TMR reference leak during session shutdown

2017-03-30 Thread Nicholas A. Bellinger
From: Nicholas Bellinger 

This patch fixes a iscsi-target specific TMR reference leak
during session shutdown, that could occur when a TMR was
quiesced before the hand-off back to iscsi-target code
via transport_cmd_check_stop_to_fabric().

The reference leak happens because iscsit_free_cmd() was
incorrectly skipping the final target_put_sess_cmd() for
TMRs when transport_generic_free_cmd() returned zero because
the se_cmd->cmd_kref did not reach zero, due to the missing
se_cmd assignment in original code.

The result was iscsi_cmd and it's associated se_cmd memory
would be freed once se_sess->sess_cmd_map where released,
but the associated se_tmr_req was leaked and remained part
of se_device->dev_tmr_list.

This bug would manfiest itself as kernel paging request
OOPsen in core_tmr_lun_reset(), when a left-over se_tmr_req
attempted to dereference it's se_cmd pointer that had
already been released during normal session shutdown.

To address this bug, go ahead and treat ISCSI_OP_SCSI_CMD
and ISCSI_OP_SCSI_TMFUNC the same when there is an extra
se_cmd->cmd_kref to drop in iscsit_free_cmd(), and use
op_scsi to signal __iscsit_free_cmd() when the former
needs to clear any further iscsi related I/O state.

Reported-by: Rob Millner 
Cc: Rob Millner 
Reported-by: Chu Yuan Lin 
Cc: Chu Yuan Lin 
Tested-by: Chu Yuan Lin 
Signed-off-by: Nicholas Bellinger 
---
 drivers/target/iscsi/iscsi_target_util.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/target/iscsi/iscsi_target_util.c 
b/drivers/target/iscsi/iscsi_target_util.c
index 5041a9c..b464033 100644
--- a/drivers/target/iscsi/iscsi_target_util.c
+++ b/drivers/target/iscsi/iscsi_target_util.c
@@ -737,21 +737,23 @@ void iscsit_free_cmd(struct iscsi_cmd *cmd, bool shutdown)
 {
struct se_cmd *se_cmd = NULL;
int rc;
+   bool op_scsi = false;
/*
 * Determine if a struct se_cmd is associated with
 * this struct iscsi_cmd.
 */
switch (cmd->iscsi_opcode) {
case ISCSI_OP_SCSI_CMD:
-   se_cmd = >se_cmd;
-   __iscsit_free_cmd(cmd, true, shutdown);
+   op_scsi = true;
/*
 * Fallthrough
 */
case ISCSI_OP_SCSI_TMFUNC:
-   rc = transport_generic_free_cmd(>se_cmd, shutdown);
-   if (!rc && shutdown && se_cmd && se_cmd->se_sess) {
-   __iscsit_free_cmd(cmd, true, shutdown);
+   se_cmd = >se_cmd;
+   __iscsit_free_cmd(cmd, op_scsi, shutdown);
+   rc = transport_generic_free_cmd(se_cmd, shutdown);
+   if (!rc && shutdown && se_cmd->se_sess) {
+   __iscsit_free_cmd(cmd, op_scsi, shutdown);
target_put_sess_cmd(se_cmd);
}
break;
-- 
1.9.1



[PATCH 0/2] target: Bug-fixes for v4.11-rc

2017-03-30 Thread Nicholas A. Bellinger
From: Nicholas Bellinger 

Hi all,

Here are two additional target bug-fixes that have been found
by the DATERA Q/A + automation team during extended longevity
and stress testing atop v4.1.y stable code.

The first is a iscsi-target specific TMR reference leak during
session shutdown when se_cmd quiesce occured before hand-off
back to iscsi-target, that results in se_tmr_req descriptors
being left after se_cmd was freed.

This would manifest as kernel paging OOPsen during LUN_RESET
after the leak occured, because the associated se_tmr_req had
not been released even though the pre-allocated parent se_cmd
descriptor had already been freed during session shutdown.

The second is a race between se_lun configfs unlink shutdown
when se_lun->lun_ref percpu_ref RCU grace period is happening,
and user-space attempts to add a new NodeACL mappedlun configfs
symlink while se_lun shutdown is occuring.

This would manifest as NULL pointer dereference OOPsen within
target_stat_scsi_att_intr_port_show_attr_dev() after a new
NodeACL mappedlun was added to a se_lun that had already
been shutdown.

Both have been verified on v4.1.y stable code, and forward
ported to v4.11-rc.

Please review.

--nab

Nicholas Bellinger (2):
  iscsi-target: Fix TMR reference leak during session shutdown
  target: Avoid mappedlun symlink creation during lun shutdown

 drivers/target/iscsi/iscsi_target_util.c | 12 +++-
 drivers/target/target_core_fabric_configfs.c |  5 +
 drivers/target/target_core_tpg.c |  4 
 include/target/target_core_base.h|  1 +
 4 files changed, 17 insertions(+), 5 deletions(-)

-- 
1.9.1



Re: [PATCH v2] scsi/bfa: use designated initializers

2017-03-30 Thread Christoph Hellwig
On Wed, Mar 29, 2017 at 01:55:09PM -0700, Kees Cook wrote:
> Prepare to mark sensitive kernel structures for randomization by making
> sure they're using designated initializers. These were identified during
> allyesconfig builds of x86, arm, and arm64, with most initializer fixes
> extracted from grsecurity.
> 
> Signed-off-by: Kees Cook 
> ---
> This has been updated now that struct bfa_fcs_mod_s was dropped.

Can you also add designated array initializers for __port_action while
you're at it?


Re: [REGRESSION][Stable][v3.12.y][v4.4.y][v4.9.y][v4.10.y][v4.11-rc1] scsi: storvsc: properly set residual data length on errors

2017-03-30 Thread Greg KH
On Tue, Mar 28, 2017 at 04:14:09PM +, Stephen Hemminger wrote:
> I decided not to send it to stable since problem was only observed on
> 4.11 but it is probably endemic to all GEN2 VM's

So, what does this mean?  What should stable@ do?  Nothing?  Ok, now
dropped this from my patch queue :)

thanks,

greg k-h


Re: [PATCH 1/6] virtio: wrap find_vqs

2017-03-30 Thread Cornelia Huck
On Wed, 29 Mar 2017 23:48:44 +0300
"Michael S. Tsirkin"  wrote:

> We are going to add more parameters to find_vqs, let's wrap the call so
> we don't need to tweak all drivers every time.
> 
> Signed-off-by: Michael S. Tsirkin 
> ---
>  drivers/block/virtio_blk.c | 3 +--
>  drivers/char/virtio_console.c  | 6 +++---
>  drivers/crypto/virtio/virtio_crypto_core.c | 3 +--
>  drivers/gpu/drm/virtio/virtgpu_kms.c   | 3 +--
>  drivers/net/caif/caif_virtio.c | 3 +--
>  drivers/net/virtio_net.c   | 3 +--
>  drivers/rpmsg/virtio_rpmsg_bus.c   | 2 +-
>  drivers/scsi/virtio_scsi.c | 3 +--
>  drivers/virtio/virtio_balloon.c| 3 +--
>  drivers/virtio/virtio_input.c  | 3 +--
>  include/linux/virtio_config.h  | 9 +
>  net/vmw_vsock/virtio_transport.c   | 6 +++---
>  12 files changed, 24 insertions(+), 23 deletions(-)

Regardless whether that context thing is the right thing to do, this
looks like a sensible cleanup.



Re: [PATCH 1/6] virtio: wrap find_vqs

2017-03-30 Thread Jason Wang



On 2017年03月30日 04:48, Michael S. Tsirkin wrote:

We are going to add more parameters to find_vqs, let's wrap the call so
we don't need to tweak all drivers every time.

Signed-off-by: Michael S. Tsirkin
---


A quick glance and it looks ok, but what the benefit of this series, is 
it required by other changes?


Thanks