[PATCH] iser: set sector for ambiguous mr status errors

2018-11-14 Thread Sagi Grimberg
If for some reason we failed to query the mr status, we need to make sure
to provide sufficient information for an ambiguous error (guard error on
sector 0).

Fixes: 0a7a08ad6f5f ("IB/iser: Implement check_protection")
Cc: 
Reported-by: Dan Carpenter 
Signed-off-by: Sagi Grimberg 
---
 drivers/infiniband/ulp/iser/iser_verbs.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c 
b/drivers/infiniband/ulp/iser/iser_verbs.c
index 946b623ba5eb..ab137bafa8a8 100644
--- a/drivers/infiniband/ulp/iser/iser_verbs.c
+++ b/drivers/infiniband/ulp/iser/iser_verbs.c
@@ -1124,7 +1124,9 @@ u8 iser_check_task_pi_status(struct iscsi_iser_task 
*iser_task,
 IB_MR_CHECK_SIG_STATUS, _status);
if (ret) {
pr_err("ib_check_mr_status failed, ret %d\n", ret);
-   goto err;
+   /* Not alot we can do, return ambiguous guard error */
+   *sector = 0;
+   return 0x1;
}
 
if (mr_status.fail_status & IB_MR_CHECK_SIG_STATUS) {
@@ -1152,9 +1154,6 @@ u8 iser_check_task_pi_status(struct iscsi_iser_task 
*iser_task,
}
 
return 0;
-err:
-   /* Not alot we can do here, return ambiguous guard error */
-   return 0x1;
 }
 
 void iser_err_comp(struct ib_wc *wc, const char *type)
-- 
2.17.1



Re: [PATCH v2 08/15] nvmet: make config_item_type const

2017-10-17 Thread Sagi Grimberg

Acked-by: Sagi Grimberg <s...@grimberg.me>


Re: tgtd CPU 100% problem

2017-07-11 Thread Sagi Grimberg



Thanks very much for reply.
I think it may be better to  assert failed to exit problem than run
the endless loop after receivered DEVICE_REMOVAL event.
Or we can sleep 5 ms to check if the conn->h.state is STATE_FULL.


Note that neither of the CC'd lists are the correct list
for stgt, you should correspond s...@lists.wpkg.org

Regarding your observation, not sure why you think this is
happening, the work is queued once in 5 seconds and the loop
is finite.


Re: tgtd CPU 100% problem

2017-07-11 Thread Sagi Grimberg



On 11/07/17 10:51, 李春 wrote:

We have meet a problem of tgtd CPU 100%.

the infinband network card was negotiate as eth mode by mistake,
after we change it to ib mode and restart opensmd for correct State(Active)
the tgtd using 100% of CPU. and when we connect to it using tgtadm,
tgtadm hang forever.

# how to repeat

* tgtd export a disk throught port 3260 of iser
* iscsiadm login a target from tgt through infiniband

* connectx_port_config set the mellanox infiniband to eth mode
* connectx_port_config set the mellanox infiniband to ib mode
* /etc/init.d/opensmd restart
* tgtadm connect to tgt will hang

# error messge

```
Jul  1 21:32:37 shadow tgtd: iser_handle_rdmacm(1628) Unsupported
event:11, RDMA_CM_EVENT_DEVICE_REMOVAL - ignored
Jul  1 21:32:37 shadow tgtd: iser_handle_rdmacm(1628) Unsupported
event:11, RDMA_CM_EVENT_DEVICE_REMOVAL - ignored

Jul  1 21:32:39 shadow tgtd: iser_handle_async_event(3174) dev:mlx4_0
HCA evt: local catastrophic error


iser code in tgtd does not know how to correctly handle RDMA device
removal events (and it never did).

The event is generated from the port configuration step while
tgt-iser is bound to it. Once the device is removed the device
handle tgt-iser has is essentially unusable, which explains
the qp creation failures below.

Handling DEVICE_REMOVAL event handling is a new feature request.


Jul  1 21:46:56 shadow tgtd: iser_cm_connect_request(1471)
conn:0x1380bf0 cm_id:0x1380950 rdma_create_qp failed, Cannot allocate
memory
Jul  1 21:46:56 shadow tgtd: iser_cm_connect_request(1520)
cm_id:0x1380950 rdma_reject failed, Bad file descriptor
Jul  1 21:46:56 shadow tgtd: iser_cm_connect_request(1471)
conn:0x1380bf0 cm_id:0x1380950 rdma_create_qp failed, Cannot allocate
memory


And also tgt-iser cannot even reject the (re)connect request.


Jul  1 21:46:56 shadow tgtd: iser_cm_connect_request(1520)
cm_id:0x1380950 rdma_reject failed, Bad file descriptor
``


Re: ["PATCH-v2" 00/22] lpfc updates for 11.2.0.12

2017-04-20 Thread Sagi Grimberg



The patches are dependent on the FC nvme/nvmet patches from the following 2
series:
http://lists.infradead.org/pipermail/linux-nvme/2017-April/009250.html
http://lists.infradead.org/pipermail/linux-nvme/2017-April/009256.html


Hmm,

So it seems that we have conflicts here

A local merge attempt on Jens's current for-4.12/block
from nvme-4.12 (with this patchset)is generating:

--
diff --cc drivers/nvme/host/fc.c
index 450733c8cd24,b6d2ca8559f6..
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@@ -1147,8 -1266,9 +1266,14 @@@ nvme_fc_fcpio_done(struct nvmefc_fcp_re
struct nvme_fc_ctrl *ctrl = op->ctrl;
struct nvme_fc_queue *queue = op->queue;
struct nvme_completion *cqe = >rsp_iu.cqe;
++<<< HEAD
 +  __le16 status = cpu_to_le16(NVME_SC_SUCCESS << 1);
 +  union nvme_result result;
++===
+   struct nvme_command *sqe = >cmd_iu.sqe;
+   u16 status = NVME_SC_SUCCESS;
+   bool complete_rq;
++>>> nvme-4.12

/*
 * WARNING:
@@@ -1229,12 -1349,12 +1354,17 @@@
 be32_to_cpu(op->rsp_iu.xfrd_len) !=
freq->transferred_length ||
 op->rsp_iu.status_code ||
++<<< HEAD
 +   op->rqno != le16_to_cpu(cqe->command_id))) {
 +  status = cpu_to_le16(NVME_SC_FC_TRANSPORT_ERROR 
<< 1);

++===
+sqe->common.command_id != cqe->command_id)) {
+   status = NVME_SC_FC_TRANSPORT_ERROR;
++>>> nvme-4.12
goto done;
}
 -  op->nreq.result = cqe->result;
 -  status = le16_to_cpu(cqe->status) >> 1;
 +  result = cqe->result;
 +  status = cqe->status;
break;

default:
@@@ -1243,13 -1363,26 +1373,35 @@@
}

  done:
++<<< HEAD
 +  if (!queue->qnum && op->rqno >= AEN_CMDID_BASE) {
 +  nvme_complete_async_event(>ctrl->ctrl, status, 
);

++===
+   if (op->flags & FCOP_FLAGS_AEN) {
+   nvme_complete_async_event(>ctrl->ctrl, status,
+   >nreq.result);
+   complete_rq = __nvme_fc_fcpop_chk_teardowns(ctrl, op);
+   atomic_set(>state, FCPOP_STATE_IDLE);
+   op->flags = FCOP_FLAGS_AEN; /* clear other flags */
++>>> nvme-4.12
nvme_fc_ctrl_put(ctrl);
return;
}

++<<< HEAD
 +  nvme_end_request(rq, status, result);
++===
+   complete_rq = __nvme_fc_fcpop_chk_teardowns(ctrl, op);
+   if (!complete_rq) {
+   if (unlikely(op->flags & FCOP_FLAGS_TERMIO)) {
+   status = NVME_SC_ABORT_REQ;
+   if (blk_queue_dying(rq->q))
+   status |= NVME_SC_DNR;
+   }
+   blk_mq_complete_request(rq, status);
+   } else
+   __nvme_fc_final_op_cleanup(rq);
++>>> nvme-4.12
  }

  static int
--


Re: [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem

2017-04-10 Thread Sagi Grimberg



Sagi

As long as legA, legB and the RC are all connected to the same switch then 
ordering will be preserved (I think many other topologies also work). Here is 
how it would work for the problem case you are concerned about (which is a read 
from the NVMe drive).

1. Disk device DMAs out the data to the p2pmem device via a string of PCIe 
MemWr TLPs.
2. Disk device writes to the completion queue (in system memory) via a MemWr 
TLP.
3. The last of the MemWrs from step 1 might have got stalled in the PCIe switch 
due to congestion but if so they are stalled in the egress path of the switch 
for the p2pmem port.
4. The RC determines the IO is complete when the TLP associated with step 2 
updates the memory associated with the CQ. It issues some operation to read the 
p2pmem.
5. Regardless of whether the MemRd TLP comes from the RC or another device 
connected to the switch it is queued in the egress queue for the p2pmem FIO 
behind the last DMA TLP (from step 1).
PCIe ordering ensures that this MemRd cannot overtake the MemWr (Reads can 
never pass writes).
Therefore the MemRd can never get to the p2pmem device until after the last DMA 
MemWr has.


What you are saying is surprising to me. The switch needs to preserve
ordering across different switch ports ??

You are suggesting that there is a *switch-wide* state that tracks
MemRds never pass MemWrs across all the switch ports? That is a very
non-trivial statement...


Re: [RFC 3/8] nvmet: Use p2pmem in nvme target

2017-04-05 Thread Sagi Grimberg



I hadn't done this yet but I think a simple closest device in the tree
would solve the issue sufficiently. However, I originally had it so the
user has to pick the device and I prefer that approach. But if the user
picks the device, then why bother restricting what he picks?


Because the user can get it wrong, and its our job to do what we can in
order to prevent the user from screwing itself.


Per the
thread with Sinan, I'd prefer to use what the user picks. You were one
of the biggest opponents to that so I'd like to hear your opinion on
removing the restrictions.


I wasn't against it that much, I'm all for making things "just work"
with minimal configuration steps, but I'm not sure we can get it
right without it.


Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
save an extra PCI transfer as the NVME card could just take the data
out of it's own memory. However, at this time, cards with CMB buffers
don't seem to be available.


Even if it was available, it would be hard to make real use of this
given that we wouldn't know how to pre-post recv buffers (for in-capsule
data). But let's leave this out of the scope entirely...


I don't understand what you're referring to. We'd simply use the CMB
buffer as a p2pmem device, why does that change anything?


I'm referring to the in-capsule data buffers pre-posts that we do.
Because we prepare a buffer that would contain in-capsule data, we have
no knowledge to which device the incoming I/O is directed to, which
means we can (and will) have I/O where the data lies in CMB of device
A but it's really targeted to device B - which sorta defeats the purpose
of what we're trying to optimize here...


Why do you need this? you have a reference to the
queue itself.


This keeps track of whether the response was actually allocated with
p2pmem or not. It's needed for when we free the SGL because the queue
may have a p2pmem device assigned to it but, if the alloc failed and it
fell back on system memory then we need to know how to free it. I'm
currently looking at having SGLs having an iomem flag. In which case,
this would no longer be needed as the flag in the SGL could be used.


That would be better, maybe...

[...]


This is a problem. namespaces can be added at any point in time. No one
guarantee that dma_devs are all the namepaces we'll ever see.


Yeah, well restricting p2pmem based on all the devices in use is hard.
So we'd need a call into the transport every time an ns is added and
we'd have to drop the p2pmem if they add one that isn't supported. This
complexity is just one of the reasons I prefer just letting the user chose.


Still the user can get it wrong. Not sure we can get a way without
keeping track of this as new devices join the subsystem.


+
+if (queue->p2pmem)
+pr_debug("using %s for rdma nvme target queue",
+ dev_name(>p2pmem->dev));
+
+kfree(dma_devs);
+}
+
 static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id,
 struct rdma_cm_event *event)
 {
@@ -1199,6 +1271,8 @@ static int nvmet_rdma_queue_connect(struct
rdma_cm_id *cm_id,
 }
 queue->port = cm_id->context;

+nvmet_rdma_queue_setup_p2pmem(queue);
+


Why is all this done for each queue? looks completely redundant to me.


A little bit. Where would you put it?


I think we'll need a representation of a controller in nvmet-rdma for
that. we sort of got a way without it so far, but I don't think we can
anymore with this.


 ret = nvmet_rdma_cm_accept(cm_id, queue, >param.conn);
 if (ret)
 goto release_queue;


You seemed to skip the in-capsule buffers for p2pmem (inline_page), I'm
curious why?


Yes, the thinking was that these transfers were small anyway so there
would not be significant benefit to pushing them through p2pmem. There's
really no reason why we couldn't do that if it made sense to though.


I don't see an urgent reason for it too. I was just curious...


Re: [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem

2017-04-05 Thread Sagi Grimberg



Note that the nvme completion queues are still on the host memory, so
this means we have lost the ordering between data and completions as
they go to different pcie targets.


Hmm, in this simple up/down case with a switch, I think it might
actually be OK.

Transactions might not complete at the NVMe device before the CPU
processes the RDMA completion, however due to the PCI-E ordering rules
new TLPs directed to the NVMe will complete after the RMDA TLPs and
thus observe the new data. (eg order preserving)

It would be very hard to use P2P if fabric ordering is not preserved..


I think it still can race if the p2p device is connected with more than
a single port to the switch.

Say it's connected via 2 legs, the bar is accessed from leg A and the
data from the disk comes via leg B. In this case, the data is heading
towards the p2p device via leg B (might be congested), the completion
goes directly to the RC, and then the host issues a read from the
bar via leg A. I don't understand what can guarantee ordering here.

Stephen told me that this still guarantees ordering, but I honestly
can't understand how, perhaps someone can explain to me in a simple
way that I can understand.


Re: [PATCH 2/5] nvme: cleanup nvme_req_needs_retry

2017-04-05 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg <s...@grimberg.me>


Re: [PATCH 4/5] nvme: move the retries count to struct nvme_request

2017-04-05 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg <s...@grimberg.me>


Re: [PATCH 3/5] nvme: mark nvme_max_retries static

2017-04-05 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg <s...@grimberg.me>


Re: [PATCH 1/5] nvme: move ->retries setup to nvme_setup_cmd

2017-04-05 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg <s...@grimberg.me>


Re: [PATCH 5/5] block, scsi: move the retries field to struct scsi_request

2017-04-05 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg <s...@grimberg.me>


Re: [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem

2017-04-04 Thread Sagi Grimberg



 u16 nvmet_copy_to_sgl(struct nvmet_req *req, off_t off, const void *buf,
size_t len)
 {
-   if (sg_pcopy_from_buffer(req->sg, req->sg_cnt, buf, len, off) != len)
+   bool iomem = req->p2pmem;
+   size_t ret;
+
+   ret = sg_copy_buffer(req->sg, req->sg_cnt, (void *)buf, len, off,
+false, iomem);
+
+   if (ret != len)
return NVME_SC_SGL_INVALID_DATA | NVME_SC_DNR;
+
return 0;
 }


We can never ever get here from an IO command, and that is a good thing
because it would have been broken if we did, regardless of what copy
method we use...

Note that the nvme completion queues are still on the host memory, so
this means we have lost the ordering between data and completions as
they go to different pcie targets.

If at all, this is the place to *emphasize* we must never get here
with p2pmem, and immediately fail if we do.

I'm not sure what will happen with to copy_from_sgl, I guess we
have the same race because the nvme submission queues are also
on the host memory (which is on a different pci target). Maybe
more likely to happen with write-combine enabled?

Anyway I don't think we have a real issue here *currently*, because
we use copy_to_sgl only for admin/fabrics commands emulation and
copy_from_sgl to setup dsm ranges...


Re: [RFC 4/8] p2pmem: Add debugfs "stats" file

2017-04-04 Thread Sagi Grimberg



+   p2pmem_debugfs_root = debugfs_create_dir("p2pmem", NULL);
+   if (!p2pmem_debugfs_root)
+   pr_info("could not create debugfs entry, continuing\n");
+


Why continue? I think it'd be better to just fail it.

Besides, this can be safely squashed into patch 1.


Re: [RFC 2/8] cxgb4: setup pcie memory window 4 and create p2pmem region

2017-04-04 Thread Sagi Grimberg



+static void setup_memwin_p2pmem(struct adapter *adap)
+{
+   unsigned int mem_base = t4_read_reg(adap, CIM_EXTMEM2_BASE_ADDR_A);
+   unsigned int mem_size = t4_read_reg(adap, CIM_EXTMEM2_ADDR_SIZE_A);
+
+   if (!use_p2pmem)
+   return;


This is weird, why even call this if !use_p2pmem?


+static int init_p2pmem(struct adapter *adapter)
+{
+   unsigned int mem_size = t4_read_reg(adapter, CIM_EXTMEM2_ADDR_SIZE_A);
+   struct p2pmem_dev *p;
+   int rc;
+   struct resource res;
+
+   if (!mem_size || !use_p2pmem)
+   return 0;


Again, weird...


Re: [RFC 3/8] nvmet: Use p2pmem in nvme target

2017-04-04 Thread Sagi Grimberg

Hey Logan,


We create a configfs attribute in each nvme-fabrics target port to
enable p2p memory use. When enabled, the port will only then use the
p2p memory if a p2p memory device can be found which is behind the
same switch as the RDMA port and all the block devices in use. If
the user enabled it an no devices are found, then the system will
silently fall back on using regular memory.


What should we do if we have more than a single device that satisfies
this? I'd say that it would be better to have the user ask for a
specific device and fail it if it doesn't meet the above conditions...


If appropriate, that port will allocate memory for the RDMA buffers
for queues from the p2pmem device falling back to system memory should
anything fail.


That's good :)


Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
save an extra PCI transfer as the NVME card could just take the data
out of it's own memory. However, at this time, cards with CMB buffers
don't seem to be available.


Even if it was available, it would be hard to make real use of this
given that we wouldn't know how to pre-post recv buffers (for in-capsule
data). But let's leave this out of the scope entirely...


diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index ecc4fe8..7fd4840 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 

 #include 
@@ -64,6 +65,7 @@ struct nvmet_rdma_rsp {
struct rdma_rw_ctx  rw;

struct nvmet_reqreq;
+   struct p2pmem_dev   *p2pmem;


Why do you need this? you have a reference to the
queue itself.


@@ -107,6 +109,8 @@ struct nvmet_rdma_queue {
int send_queue_size;

struct list_headqueue_list;
+
+   struct p2pmem_dev   *p2pmem;
 };

 struct nvmet_rdma_device {
@@ -185,7 +189,8 @@ nvmet_rdma_put_rsp(struct nvmet_rdma_rsp *rsp)
spin_unlock_irqrestore(>queue->rsps_lock, flags);
 }

-static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int nents)
+static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int nents,
+   struct p2pmem_dev *p2pmem)
 {
struct scatterlist *sg;
int count;
@@ -193,13 +198,17 @@ static void nvmet_rdma_free_sgl(struct scatterlist *sgl, 
unsigned int nents)
if (!sgl || !nents)
return;

-   for_each_sg(sgl, sg, nents, count)
-   __free_page(sg_page(sg));
+   for_each_sg(sgl, sg, nents, count) {
+   if (p2pmem)
+   p2pmem_free_page(p2pmem, sg_page(sg));
+   else
+   __free_page(sg_page(sg));
+   }
kfree(sgl);
 }

 static int nvmet_rdma_alloc_sgl(struct scatterlist **sgl, unsigned int *nents,
-   u32 length)
+   u32 length, struct p2pmem_dev *p2pmem)
 {
struct scatterlist *sg;
struct page *page;
@@ -216,7 +225,11 @@ static int nvmet_rdma_alloc_sgl(struct scatterlist **sgl, 
unsigned int *nents,
while (length) {
u32 page_len = min_t(u32, length, PAGE_SIZE);

-   page = alloc_page(GFP_KERNEL);
+   if (p2pmem)
+   page = p2pmem_alloc_page(p2pmem);
+   else
+   page = alloc_page(GFP_KERNEL);
+
if (!page)
goto out_free_pages;

@@ -231,7 +244,10 @@ static int nvmet_rdma_alloc_sgl(struct scatterlist **sgl, 
unsigned int *nents,
 out_free_pages:
while (i > 0) {
i--;
-   __free_page(sg_page([i]));
+   if (p2pmem)
+   p2pmem_free_page(p2pmem, sg_page([i]));
+   else
+   __free_page(sg_page([i]));
}
kfree(sg);
 out:
@@ -484,7 +500,8 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp 
*rsp)
}

if (rsp->req.sg != >cmd->inline_sg)
-   nvmet_rdma_free_sgl(rsp->req.sg, rsp->req.sg_cnt);
+   nvmet_rdma_free_sgl(rsp->req.sg, rsp->req.sg_cnt,
+   rsp->p2pmem);

if (unlikely(!list_empty_careful(>rsp_wr_wait_list)))
nvmet_rdma_process_wr_wait_list(queue);
@@ -625,8 +642,16 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp 
*rsp,
if (!len)
return 0;

+   rsp->p2pmem = rsp->queue->p2pmem;
status = nvmet_rdma_alloc_sgl(>req.sg, >req.sg_cnt,
-   len);
+   len, rsp->p2pmem);
+
+   if (status && rsp->p2pmem) {
+   rsp->p2pmem = NULL;
+   status = nvmet_rdma_alloc_sgl(>req.sg, >req.sg_cnt,
+ len, rsp->p2pmem);
+   }
+


Not sure its a good practice to rely on rsp->p2pmem not being NULL...
Would be nice if the allocation routines can hide it from us...


   

Re: [PATCH] lpfc: add missing Kconfig NVME dependencies

2017-02-22 Thread Sagi Grimberg



add missing Kconfig NVME dependencies

Can't believe I missed posting this

-- james


Heh, the this sort of comment should come after
the '---' separator (below) unless you want it to live forever
in the git log...



Signed-off-by: James Smart 
---


[here]


 drivers/scsi/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig
index d4023bf..2558434 100644
--- a/drivers/scsi/Kconfig
+++ b/drivers/scsi/Kconfig
@@ -1240,6 +1240,7 @@ config SCSI_LPFC
tristate "Emulex LightPulse Fibre Channel Support"
depends on PCI && SCSI
depends on SCSI_FC_ATTRS
+   depends on NVME_FC && NVME_TARGET_FC
select CRC_T10DIF
help
   This lpfc driver supports the Emulex LightPulse



Re: hch's native NVMe multipathing [was: Re: [PATCH 1/2] Don't blacklist nvme]

2017-02-16 Thread Sagi Grimberg



I'm fine with the path selectors getting moved out; maybe it'll
encourage new path selectors to be developed.

But there will need to be some userspace interface stood up to support
your native NVMe multipathing (you may not think it needed but think in
time there will be a need to configure _something_).  That is the
fragmentation I'm referring to.


I guess one config option that we'd need is multibus vs. failover
which are used per use-case.

I'd have to say that having something much simpler than multipath-tools
does sound appealing.


Re: [PATCH v3 00/16] lpfc: Add NVME Fabrics support

2017-02-15 Thread Sagi Grimberg

Hi James,


This patch set adds support for NVME over Fabrics FC transport
to lpfc

The internals of the driver are reworked to support being either:
 a SCSI initiator;
 a NVME intiator;
 both a SCSI initiator and a NVME initiator;
 or a NVME target.

The driver effectively has parallel NVME and SCSI stacks that
utilize their own set of resources. They intersect only at the
hardware level, mainly in queue create layers and interrupt handling.

A few new files are added to support the interfaces of the
FC transport LLDD api for NVME fabrics.

The patches were cut against 1/30 scsi.git tree, misc branch.
** THEY ARE INTENDED FOR THE SCSI.GIT TREE, MISC BRANCH **

The lpfc version in the linux-block.git tree is rather old. I have a
recipe for how to get it to a version that syncs with the
scsi.git/misc tree so that these patches can apply there as well.
Contact me if you would like it.


This set does not seem to apply cleanly on nvme-4.11 tree, looks
like patch 6 is failing.

Also, can you send your patchset threaded? It usually does so
when generating the patches with git format-patch, not sure
how this is not the case with your set... It would make my
life a bit easier.

Thanks.


Re: [PATCH 1/1] iscsi: fix regression caused by session lock patch

2017-02-06 Thread Sagi Grimberg

Hey Chris and Guilherme,

I'm indeed not responsive under this email address.


Thanks for the testing, looks like you have the magic target to reproduce this.

I think this verifies what Mike's idea of what was going wrong, and we're way 
overdue to get this fixed upstream.  Thanks to IBM for pushing this, I don't 
think any major distro is shipping this patch and we don't want to keep having 
to back it out.

The options look like
1) back out the session lock changes that split it into two locks
2) add in the additional locking from this test patch
3) some other fix for the issue of targets that complete tasks oddly

I'm leaning to #1, as I don't want to keep adding more locks for this.


Thanks Chris! IIRC, the lock changes from Shlomo/Or are not on RHEL,
SLES and Ubuntu anymore, as you mentioned. We requested them to revert
the patch, and it was accepted.

On the other hand, your patch is great and a cool fix to this. If we
have any good numbers and/or reasons to keep their patch, guess the
alternative #2 is cool too. I can perform more testing if you plan to
send this (or similar) patch to iscsi list.



Sagi, Or, Shlomo?  You pushed to keep this from being backed out before.  
Here's your cause, any better ideas on fixing it?  I also tried to go back in 
the mailing list archives, but I don't see any real numbers for the performance 
gains.


I'll loop Sagi here based on the email I see he's using on NVMe list
currently - seems it's different from the one showed in the header of
this message.


IIRC, this was brought up more than two years ago? it's been
a while now.

The motivation for the fined grained locking from Shlomo was
designed to address the submission/completion inter-locking
scheme that was not needed for iser.

In iser, task completions are triggered from soft-irq only for
task responses, the data-transfer is driven in HW, so we don't need
the inter-locking between submissions and task management or error
handling.

My recollection is that this scheme solved a contention point we had
back then, if I'm not mistaken it was as much as 50% improvement in
IOPs scalability in some scenarios.

Now, this was all pre block-mq. So I think the correct solution for
iscsi (iser, tcp and offloads) is to use block-mq facilities for
task pre-allocations (scsi host tagset) and have iscsi tcp take care
of it's own locking instead of imposing it inherently in libiscsi.

We can have LOGIN, LOGOUT, NOOP_OUT, TEXT, TMR as reserved tags,
and queue_depth with max session cmds. I had a prototype for that
back when I experimented with scsi-mq conversion (way back...),
but kinda got stuck with trying to figure out how to convert the
offload drivers qla4xxx, bnx2i and cxgbi which seemed to rely heavily
on on the task pools.

If people are more interested in improving iscsi locking schemes we
can discuss on approaches for it.


Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-19 Thread Sagi Grimberg

Christoph suggest to me once that we can take a hybrid
approach where we consume a small amount of completions (say 4)
right away from the interrupt handler and if we have more
we schedule irq-poll to reap the rest. But back then it
didn't work better which is not aligned with my observations
that we consume only 1 completion per interrupt...

I can give it another go... What do people think about it?
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-19 Thread Sagi Grimberg



I think you missed:
http://git.infradead.org/nvme.git/commit/49c91e3e09dc3c9dd1718df85112a8cce3ab7007


I indeed did, thanks.


But it doesn't help.

We're still having to wait for the first interrupt, and if we're really
fast that's the only completion we have to process.

Try this:


diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b4b32e6..e2dd9e2 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -623,6 +623,8 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
}
__nvme_submit_cmd(nvmeq, );
spin_unlock(>sq_lock);
+   disable_irq_nosync(nvmeq_irq(irq));
+   irq_poll_sched(>iop);


a. This would trigger a condition that we disable irq twice which
is wrong at least because it will generate a warning.

b. This would cause a way-too-much triggers of ksoftirqd. In order for
it to be effective we need to to run only when it should and optimally
when the completion queue has a batch of completions waiting.

After a deeper analysis, I agree with Bart that interrupt coalescing is
needed for it to work. The problem with nvme coalescing as Jens said, is
a death penalty of 100us granularity. Hannes, Johannes, how does it look
like with the devices you are testing with?

Also, I think that adaptive moderation is needed in order for it to
work well. I know that some networking drivers implemented adaptive
moderation in SW before having HW support for it. It can be done by
maintaining stats and having a periodic work that looks at it and
changes the moderation parameters.

Does anyone think that this is something we should consider?
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] nvme: use blk_rq_payload_bytes

2017-01-18 Thread Sagi Grimberg



@@ -1014,9 +1013,9 @@ static int nvme_rdma_map_data(struct nvme_rdma_queue 
*queue,
}



Christoph, a little above here we still look at blk_rq_bytes(),
shouldn't that look at blk_rq_payload_bytes() too?


The check is ok for now as it's just zero vs non-zero.  It's somewhat
broken for Write Zeroes, though.  I've fixed it in this series
which I need to submit:


http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/write-zeroes-take-2



I see, thanks for the clarification.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-18 Thread Sagi Grimberg



Hannes just spotted this:
static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 const struct blk_mq_queue_data *bd)
{
[...]
__nvme_submit_cmd(nvmeq, );
nvme_process_cq(nvmeq);
spin_unlock_irq(>q_lock);
return BLK_MQ_RQ_QUEUE_OK;
out_cleanup_iod:
nvme_free_iod(dev, req);
out_free_cmd:
nvme_cleanup_cmd(req);
return ret;
}

So we're draining the CQ on submit. This of cause makes polling for
completions in the IRQ handler rather pointless as we already did in the
submission path.


I think you missed:
http://git.infradead.org/nvme.git/commit/49c91e3e09dc3c9dd1718df85112a8cce3ab7007
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-18 Thread Sagi Grimberg



Your report provided this stats with one-completion dominance for the
single-threaded case. Does it also hold if you run multiple fio
threads per core?


It's useless to run more threads on that core, it's already fully
utilized. That single threads is already posting a fair amount of
submissions, so I don't see how adding more fio jobs can help in any
way.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-18 Thread Sagi Grimberg



So what you say is you saw a consomed == 1 [1] most of the time?

[1] from 
http://git.infradead.org/nvme.git/commitdiff/eed5a9d925c59e43980047059fde29e3aa0b7836


Exactly. By processing 1 completion per interrupt it makes perfect sense
why this performs poorly, it's not worth paying the soft-irq schedule
for only a single completion.

What I'm curious is how consistent is this with different devices (wish
I had some...)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] nvme: use blk_rq_payload_bytes

2017-01-17 Thread Sagi Grimberg



@@ -1014,9 +1013,9 @@ static int nvme_rdma_map_data(struct nvme_rdma_queue 
*queue,
}



Christoph, a little above here we still look at blk_rq_bytes(),
shouldn't that look at blk_rq_payload_bytes() too?


if (count == 1) {
-   if (rq_data_dir(rq) == WRITE &&
-   map_len <= nvme_rdma_inline_data_size(queue) &&
-   nvme_rdma_queue_idx(queue))
+   if (rq_data_dir(rq) == WRITE && nvme_rdma_queue_idx(queue) &&
+   blk_rq_payload_bytes(rq) <=
+   nvme_rdma_inline_data_size(queue))
return nvme_rdma_map_sg_inline(queue, req, c);

if (dev->pd->flags & IB_PD_UNSAFE_GLOBAL_RKEY)

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-17 Thread Sagi Grimberg



So it looks like we are super not efficient because most of the
times we catch 1
completion per interrupt and the whole point is that we need to find
more! This fio
is single threaded with QD=32 so I'd expect that we be somewhere in
8-31 almost all
the time... I also tried QD=1024, histogram is still the same.


It looks like it takes you longer to submit an I/O than to service an
interrupt,


Well, with irq-poll we do practically nothing in the interrupt handler,
only schedule irq-poll. Note that the latency measures are only from
the point the interrupt arrives and the point we actually service it
by polling for completions.


so increasing queue depth in the singe-threaded case doesn't
make much difference. You might want to try multiple threads per core
with QD, say, 32


This is how I ran, QD=32.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-17 Thread Sagi Grimberg



Just for the record, all tests you've run are with the upper irq_poll_budget of
256 [1]?


Yes, but that's the point, I never ever reach this budget because
I'm only processing 1-2 completions per interrupt.


We (Hannes and me) recently stumbed accross this when trying to poll for more
than 256 queue entries in the drivers we've been testing.


What do you mean by stumbed? irq-poll should be agnostic to the fact
that drivers can poll more than their given budget?


Did your system load reduce with irq polling? In theory it should but I have
seen increases with AHCI at least according to fio. IIRC Hannes saw decreases
with his SAS HBA tests, as expected.


I didn't see any reduction. When I tested on a single cpu core (to
simplify for a single queue) the cpu was at 100% cpu but got less iops
(which makes sense, a single cpu-core is not enough to max out the nvme
device, at least not the core I'm using). Before irqpoll I got
~230 KIOPs on a single cpu-core and after irqpoll I got ~205 KIOPs
which is consistent with the ~10% iops decrease I've reported in the
original submission.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-17 Thread Sagi Grimberg

Oh, and the current code that was tested can be found at:

git://git.infradead.org/nvme.git nvme-irqpoll
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-17 Thread Sagi Grimberg

Hey, so I made some initial analysis of whats going on with
irq-poll.

First, I sampled how much time it takes before we
get the interrupt in nvme_irq and the initial visit
to nvme_irqpoll_handler. I ran a single threaded fio
with QD=32 of 4K reads. This is two displays of a
histogram of the latency (ns):
--
[1]
queue = b'nvme0q1'
 usecs   : count distribution
 0 -> 1  : 7310 
||
 2 -> 3  : 11   | 
 |
 4 -> 7  : 10   | 
 |
 8 -> 15 : 20   | 
 |
16 -> 31 : 0| 
 |
32 -> 63 : 0| 
 |
64 -> 127: 1| 
 |


[2]
queue = b'nvme0q1'
 usecs   : count distribution
 0 -> 1  : 7309 
||
 2 -> 3  : 14   | 
 |
 4 -> 7  : 7| 
 |
 8 -> 15 : 17   | 
 |


We can see that most of the time our latency is pretty good (<1ns) but with
huge tail latencies (some 8-15 ns and even one in 32-63 ns).
**NOTE, in order to reduce the tracing impact on performance I sampled
for every 100 interrupts.

I also sampled for a multiple threads/queues with QD=32 of 4K reads.
This is a collection of histograms for 5 queues (5 fio threads):
queue = b'nvme0q1'
 usecs   : count distribution
 0 -> 1  : 701 
||
 2 -> 3  : 177  |** 
 |
 4 -> 7  : 56   |*** 
 |
 8 -> 15 : 24   |* 
 |
16 -> 31 : 6| 
 |
32 -> 63 : 1| 
 |


queue = b'nvme0q2'
 usecs   : count distribution
 0 -> 1  : 412 
||
 2 -> 3  : 52   |* 
 |
 4 -> 7  : 19   |* 
 |
 8 -> 15 : 13   |* 
 |
16 -> 31 : 5| 
 |


queue = b'nvme0q3'
 usecs   : count distribution
 0 -> 1  : 381 
||
 2 -> 3  : 74   |*** 
 |
 4 -> 7  : 26   |** 
 |
 8 -> 15 : 12   |* 
 |
16 -> 31 : 3| 
 |
32 -> 63 : 0| 
 |
64 -> 127: 0| 
 |
   128 -> 255: 1| 
 |


queue = b'nvme0q4'
 usecs   : count distribution
 0 -> 1  : 386 
||
 2 -> 3  : 63   |** 
 |
 4 -> 7  : 30   |*** 
 |
 8 -> 15 : 11   |* 
 |
16 -> 31 : 7| 
 |
32 -> 63 : 1| 
 |


queue = b'nvme0q5'
 usecs   : count distribution
 0 -> 1  : 384 
||
 2 -> 3  : 69   |*** 
 |
 4 -> 7  : 25   |** 
 |
 8 -> 15 : 15   |* 
 |
16 -> 31 : 3| 
 |


Overall looks pretty much the same but some more samples with tails...

Next, I sampled how many completions we are able to consume per interrupt.
Two exaples of histograms of how many completions we take per interrupt.
--
queue = b'nvme0q1'
 completed : count distribution
0  : 0||
1  : 11690||
2  : 46   ||
3  : 1||

queue = b'nvme0q1'
 completed : count distribution
0  : 0||
1  : 944  ||
2  : 8||
--

So it looks like we are super not efficient because most of the times we 
catch 1
completion per interrupt and the whole point is that we need to find 
more! This fio
is single threaded with QD=32 so I'd expect that we be somewhere in 8-31 
almost all

the time... I also tried QD=1024, histogram is still the same.
**NOTE: Here I also sampled for every 100 interrupts.


I'll try to run the counter on the current nvme driver and see what I get.



I attached the bpf scripts I wrote (nvme-trace-irq, nvme-count-comps)
with hope that someone is interested enough to try and reproduce these
numbers on his/hers setup and maybe suggest some other useful tracing
we can do.

Prerequisites:
1. iovisor is needed for python bpf support.
  $ echo "deb [trusted=yes] https://repo.iovisor.org/apt/xenial 

Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-17 Thread Sagi Grimberg



--
[1]
queue = b'nvme0q1'
 usecs   : count distribution
 0 -> 1  : 7310 ||
 2 -> 3  : 11   |  |
 4 -> 7  : 10   |  |
 8 -> 15 : 20   |  |
16 -> 31 : 0|  |
32 -> 63 : 0|  |
64 -> 127: 1|  |

[2]
queue = b'nvme0q1'
 usecs   : count distribution
 0 -> 1  : 7309 ||
 2 -> 3  : 14   |  |
 4 -> 7  : 7|  |
 8 -> 15 : 17   |  |



Rrr, email made the histograms look funky (tabs vs. spaces...)
The count is what's important anyways...

Just adding that I used an Intel P3500 nvme device.


We can see that most of the time our latency is pretty good (<1ns) but with
huge tail latencies (some 8-15 ns and even one in 32-63 ns).


Obviously is micro-seconds and not nano-seconds (I wish...)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] sd: remove __data_len hack for WRITE SAME

2017-01-13 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg <s...@grimberg.me>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] nvme: use blk_rq_payload_bytes

2017-01-13 Thread Sagi Grimberg

This looks good,

Reviewed-by: Sagi Grimberg <s...@grimberg.me>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] scsi: use blk_rq_payload_bytes

2017-01-13 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg <s...@grimberg.me>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] block: add blk_rq_payload_bytes

2017-01-13 Thread Sagi Grimberg



Add a helper to calculate the actual data transfer size for special
payload requests.

Signed-off-by: Christoph Hellwig 
---
 include/linux/blkdev.h | 13 +
 1 file changed, 13 insertions(+)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ff3d774..1ca8e8f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1000,6 +1000,19 @@ static inline unsigned int blk_rq_cur_sectors(const 
struct request *rq)
return blk_rq_cur_bytes(rq) >> 9;
 }

+/*
+ * Some commands like WRITE SAME have a payload or data transfer size which


Don't you mean here DISCARD?


+ * is different from the size of the request.  Any driver that supports such
+ * commands using the RQF_SPECIAL_PAYLOAD flag needs to use this helper to
+ * calculate the data transfer size.
+ */
+static inline unsigned int blk_rq_payload_bytes(struct request *rq)
+{
+   if (rq->rq_flags & RQF_SPECIAL_PAYLOAD)
+   return rq->special_vec.bv_len;
+   return blk_rq_bytes(rq);
+}
+
 static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
 int op)
 {


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-12 Thread Sagi Grimberg



**Note: when I ran multiple threads on more cpus the performance
degradation phenomenon disappeared, but I tested on a VM with
qemu emulation backed by null_blk so I figured I had some other
bottleneck somewhere (that's why I asked for some more testing).


That could be because of the vmexits as every MMIO access in the guest
triggers a vmexit and if you poll with a low budget you do more MMIOs hence
you have more vmexits.

Did you do testing only in qemu or with real H/W as well?


I tried once. IIRC, I saw the same phenomenons...
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-12 Thread Sagi Grimberg



I agree with Jens that we'll need some analysis if we want the
discussion to be affective, and I can spend some time this if I
can find volunteers with high-end nvme devices (I only have access
to client nvme devices.


I have a P3700 but somehow burned the FW. Let me see if I can bring it back to
live.

I also have converted AHCI to the irq_poll interface and will run some tests.
I do also have some hpsa devices on which I could run tests once the driver is
adopted.

But can we come to a common testing methology not to compare apples with
oranges? Sagi do you still have the fio job file from your last tests laying
somewhere and if yes could you share it?


Its pretty basic:
--
[global]
group_reporting
cpus_allowed=0
cpus_allowed_policy=split
rw=randrw
bs=4k
numjobs=4
iodepth=32
runtime=60
time_based
loops=1
ioengine=libaio
direct=1
invalidate=1
randrepeat=1
norandommap
exitall

[job]
--

**Note: when I ran multiple threads on more cpus the performance
degradation phenomenon disappeared, but I tested on a VM with
qemu emulation backed by null_blk so I figured I had some other
bottleneck somewhere (that's why I asked for some more testing).

Note that I ran randrw because I was backed with null_blk, testing
with a real nvme device, you should either run randread or write, and
if you do a write, you can't run it multi-threaded (well you can, but
you'll get unpredictable performance...).
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LFS/MM TOPIC][LFS/MM ATTEND]: - Storage Stack and Driver Testing methodology.

2017-01-12 Thread Sagi Grimberg



Hi Folks,

I would like to propose a general discussion on Storage stack and device driver 
testing.


I think its very useful and needed.


Purpose:-
-
The main objective of this discussion is to address the need for
a Unified Test Automation Framework which can be used by different subsystems
in the kernel in order to improve the overall development and stability
of the storage stack.

For Example:-
From my previous experience, I've worked on the NVMe driver testing last year 
and we
have developed simple unit test framework
 (https://github.com/linux-nvme/nvme-cli/tree/master/tests).
In current implementation Upstream NVMe Driver supports following subsystems:-
1. PCI Host.
2. RDMA Target.
3. Fiber Channel Target (in progress).
Today due to lack of centralized automated test framework NVMe Driver testing is
scattered and performed using the combination of various utilities like 
nvme-cli/tests,
nvmet-cli, shell scripts (git://git.infradead.org/nvme-fabrics.git 
nvmf-selftests) etc.

In order to improve overall driver stability with various subsystems, it will 
be beneficial
to have a Unified Test Automation Framework (UTAF) which will centralize overall
testing.

This topic will allow developers from various subsystem engage in the 
discussion about
how to collaborate efficiently instead of having discussions on lengthy email 
threads.


While a unified test framework for all sounds great, I suspect that the
difference might be too large. So I think that for this framework to be
maintainable, it needs to be carefully designed such that we don't have
too much code churn.

For example we should start by classifying tests and then see where
sharing is feasible:

1. basic management - I think not a lot can be shared
2. spec compliance - again, not much sharing here
3. data-verification - probably everything can be shared
4. basic performance - probably a lot can be shared
5. vectored-io - probably everything can be shared
6. error handling - I can think of some sharing that can be used.

This repository can also store some useful tracing scripts (ebpf and
friends) that are useful for performance analysis.

So I think that for this to happen, we can start with the shared
test under block/, then migrate proto specific tests into
scsi/, nvme/, and then add transport specific tests so
we can have something like:

├── block
├── lib
├── nvme
│   ├── fabrics
│   │   ├── loop
│   │   └── rdma
│   └── pci
└── scsi
├── fc
└── iscsi

Thoughts?
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-12 Thread sagi grimberg
 A typical Ethernet network adapter delays the generation of an
 interrupt
 after it has received a packet. A typical block device or HBA does not
 delay
 the generation of an interrupt that reports an I/O completion.
>>>
>>> NVMe allows for configurable interrupt coalescing, as do a few modern
>>> SCSI HBAs.
>>
>> Essentially every modern SCSI HBA does interrupt coalescing; otherwise
>> the queuing interface won't work efficiently.
>
> Hello Hannes,
>
> The first e-mail in this e-mail thread referred to measurements against a
> block device for which interrupt coalescing was not enabled. I think that
> the measurements have to be repeated against a block device for which
> interrupt coalescing is enabled.

Hey Bart,

I see how interrupt coalescing can help, but even without it, I think it
should be better.

Moreover, I don't think that strict moderation is something that can
work. The only way interrupt moderation can be effective, is if it's
adaptive and adjusts itself to the workload. Note that this feature
is on by default in most of the modern Ethernet devices (adaptive-rx).

IMHO, irq-poll vs. interrupt polling should be compared without relying
on the underlying device capabilities.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-12 Thread Sagi Grimberg



I'd like to attend LSF/MM and would like to discuss polling for block
drivers.

Currently there is blk-iopoll but it is neither as widely used as NAPI in
the networking field and accoring to Sagi's findings in [1] performance
with polling is not on par with IRQ usage.

On LSF/MM I'd like to whether it is desirable to have NAPI like polling in
more block drivers and how to overcome the currently seen performance
issues.

[1] http://lists.infradead.org/pipermail/linux-nvme/2016-October/006975.ht
ml


A typical Ethernet network adapter delays the generation of an interrupt
after it has received a packet. A typical block device or HBA does not delay
the generation of an interrupt that reports an I/O completion. I think that
is why polling is more effective for network adapters than for block
devices. I'm not sure whether it is possible to achieve benefits similar to
NAPI for block devices without implementing interrupt coalescing in the
block device firmware. Note: for block device implementations that use the
RDMA API, the RDMA API supports interrupt coalescing (see also
ib_modify_cq()).


Hey Bart,

I don't agree that interrupt coalescing is the reason why irq-poll is
not suitable for nvme or storage devices.

First, when the nvme device fires an interrupt, the driver consumes
the completion(s) from the interrupt (usually there will be some more
completions waiting in the cq by the time the host start processing it).
With irq-poll, we disable further interrupts and schedule soft-irq for
processing, which if at all, improve the completions per interrupt
utilization (because it takes slightly longer before processing the cq).

Moreover, irq-poll is budgeting the completion queue processing which is
important for a couple of reasons.

1. it prevents hard-irq context abuse like we do today. if other cpu
   cores are pounding with more submissions on the same queue, we might
   get into a hard-lockup (which I've seen happening).

2. irq-poll maintains fairness between devices by correctly budgeting
   the processing of different completions queues that share the same
   affinity. This can become crucial when working with multiple nvme
   devices, each has multiple io queues that share the same IRQ
   assignment.

3. It reduces (or at least should reduce) the overall number of
   interrupts in the system because we only enable interrupts again
   when the completion queue is completely processed.

So overall, I think it's very useful for nvme and other modern HBAs,
but unfortunately, other than solving (1), I wasn't able to see
performance improvement but rather a slight regression, but I can't
explain where its coming from...
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-12 Thread Sagi Grimberg



Hi all,

I'd like to attend LSF/MM and would like to discuss polling for block drivers.

Currently there is blk-iopoll but it is neither as widely used as NAPI in the
networking field and accoring to Sagi's findings in [1] performance with
polling is not on par with IRQ usage.

On LSF/MM I'd like to whether it is desirable to have NAPI like polling in
more block drivers and how to overcome the currently seen performance issues.


It would be an interesting topic to discuss, as it is a shame that blk-iopoll
isn't used more widely.


Forgot to mention - it should only be a topic, if experimentation has
been done and results gathered to pin point what the issues are, so we
have something concrete to discus. I'm not at all interested in a hand
wavy discussion on the topic.



Hey all,

Indeed I attempted to convert nvme to use irq-poll (let's use its
new name) but experienced some unexplained performance degradations.

Keith reported a 700ns degradation for QD=1 with his Xpoint devices,
this sort of degradation are acceptable I guess because we do schedule
a soft-irq before consuming the completion, but I noticed ~10% IOPs
degradation fr QD=32 which is not acceptable.

I agree with Jens that we'll need some analysis if we want the
discussion to be affective, and I can spend some time this if I
can find volunteers with high-end nvme devices (I only have access
to client nvme devices.

I can add debugfs statistics on average the number of completions I
consume per intererupt, I can also trace the interrupt and the soft-irq
start,end. Any other interesting stats I can add?

I also tried a hybrid mode where the first 4 completions were handled
in the interrupt and the rest in soft-irq but that didn't make much
of a difference.

Any other thoughts?
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/12] target: avoid to access .bi_vcnt directly

2016-11-12 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg <s...@grimberg.me>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 12/14] SRP transport, scsi-mq: Wait for .queue_rq() if necessary

2016-11-01 Thread Sagi Grimberg

and again,

Reviewed-by: Sagi Grimberg <s...@grimberg.me>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 13/14] nvme: Fix a race condition related to stopping queues

2016-11-01 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg <s...@grimberg.me>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 11/14] SRP transport: Move queuecommand() wait code to SCSI core

2016-11-01 Thread Sagi Grimberg

Again,

Reviewed-by: Sagi Grimberg <s...@grimberg.me>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 08/14] blk-mq: Add a kick_requeue_list argument to blk_mq_requeue_request()

2016-11-01 Thread Sagi Grimberg

Looks useful,

Reviewed-by: Sagi Grimberg <s...@grimberg.me>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 07/14] blk-mq: Introduce blk_mq_quiesce_queue()

2016-11-01 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg <s...@grimberg.me>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 06/14] blk-mq: Remove blk_mq_cancel_requeue_work()

2016-11-01 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg <s...@grimberg.me>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 05/14] blk-mq: Avoid that requeueing starts stopped queues

2016-11-01 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg <s...@grimberg.me>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] scsi: allow LLDDs to expose the queue mapping to blk-mq

2016-11-01 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg <s...@grimberg.me>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Avoid that SCSI device removal through sysfs triggers a deadlock

2016-10-27 Thread Sagi Grimberg

Hey Bart,


The solution I prefer is to modify the SCSI scanning code such that
the scan_mutex is only held while performing the actual LUN scanning
and while ensuring that no SCSI device has been created yet for a
certain LUN number but not while the Linux device and its sysfs
attributes are created. Since that approach would require extensive
changes in the SCSI scanning code, another approach has been chosen,
namely to make self-removal asynchronous. This patch avoids that
self-removal triggers the following deadlock:


Is this a real deadlock? or just a lockdep complaint?

Wouldn't making scsi_remove_device() taking single depth
mutex_lock_nested suffice here?
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/12] nvme: Use BLK_MQ_S_STOPPED instead of QUEUE_FLAG_STOPPED in blk-mq code

2016-10-27 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg <s...@grimberg.me>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/12] blk-mq: Do not invoke .queue_rq() for a stopped queue

2016-10-27 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg <s...@grimberg.me>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A question regarding "multiple SGL"

2016-10-27 Thread Sagi Grimberg



Hi Robert,


Hey Robert, Christoph,


please explain your use cases that isn't handled.  The one and only
reason to set MSDBD to 1 is to make the code a lot simpler given that
there is no real use case for supporting more.

RDMA uses memory registrations to register large and possibly
discontiguous data regions for a single rkey, aka single SGL descriptor
in NVMe terms.  There would be two reasons to support multiple SGL
descriptors:  a) to support a larger I/O size than supported by a single
MR, or b) to support a data region format not mappable by a single
MR.

iSER only supports a single rkey (or stag in IETF terminology) and has
been doing fine on a) and mostly fine on b).   There are a few possible
data layouts not supported by the traditional IB/iWarp FR WRs, but the
limit is in fact exactly the same as imposed by the NVMe PRPs used for
PCIe NVMe devices, so the Linux block layer has support to not generate
them.  Also with modern Mellanox IB/RoCE hardware we can actually
register completely arbitrary SGLs.  iSER supports using this registration
mode already with a trivial code addition, but for NVMe we didn't have a
pressing need yet.


Good explanation :)

The IO transfer size is a bit more pressing on some devices (e.g.
cxgb3/4) where the number of pages per-MR can be indeed lower than
a reasonable transfer size (Steve can correct me if I'm wrong).

However, if there is a real demand for this we'll happily accept
patches :)

Just a note, having this feature in-place can bring unexpected behavior
depending on how we implement it:
- If we can use multiple MRs per IO (for multiple SGLs) we can either
prepare for the worst-case and allocate enough MRs to satisfy the
various IO patterns. This will be much heavier in terms of resource
allocation and can limit the scalability of the host driver.
- Or we can implement a shared MR pool with a reasonable number of MRs.
In this case each IO can consume one or more MRs on the expense of
other IOs. In this case we may need to requeue the IO later when we
have enough available MRs to satisfy the IO. This can yield some
unexpected performance gaps for some workloads.

Cheers,
Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/12] blk-mq: Introduce blk_mq_hctx_stopped()

2016-10-27 Thread Sagi Grimberg

Looks fine,

Reviewed-by: Sagi Grimberg <s...@grimberg.me>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/12] blk-mq: Move more code into blk_mq_direct_issue_request()

2016-10-27 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg <s...@grimberg.me>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/12] SRP transport, scsi-mq: Wait for .queue_rq() if necessary

2016-10-27 Thread Sagi Grimberg

Thanks for moving it,

Reviewed-by: Sagi Grimberg <s...@grimberg.me>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/12] SRP transport: Move queuecommand() wait code to SCSI core

2016-10-27 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg <s...@grimberg.me>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/7] blk-mq: Introduce blk_quiesce_queue() and blk_resume_queue()

2016-10-05 Thread Sagi Grimberg



Hello Ming,

Can you have a look at the attached patch? That patch uses an srcu read
lock for all queue types, whether or not the BLK_MQ_F_BLOCKING flag has
been set. Additionally, I have dropped the QUEUE_FLAG_QUIESCING flag.
Just like previous versions, this patch has been tested.


Hey Bart,

Do we care about the synchronization of queue_rq and/or
blk_mq_run_hw_queue of the hctx is not stopped?

I'm wandering if we can avoid introducing new barriers in the
submission path of its not absolutely needed.


Hello Sagi,


Hey Bart,



I'm not sure whether the new blk_quiesce_queue() function is useful
without stopping all hardware contexts first. In other words, in my view
setting BLK_MQ_F_BLOCKING flag before calling blk_quiesce_queue() is
sufficient and I don't think that a new QUEUE_FLAG_QUIESCING flag is
necessary.


I was referring to weather we can take srcu in the submission path
conditional of the hctx being STOPPED?
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/7] blk-mq: Introduce blk_quiesce_queue() and blk_resume_queue()

2016-10-05 Thread Sagi Grimberg



Hello Ming,

Can you have a look at the attached patch? That patch uses an srcu read
lock for all queue types, whether or not the BLK_MQ_F_BLOCKING flag has
been set. Additionally, I have dropped the QUEUE_FLAG_QUIESCING flag.
Just like previous versions, this patch has been tested.


Hey Bart,

Do we care about the synchronization of queue_rq and/or
blk_mq_run_hw_queue of the hctx is not stopped?

I'm wandering if we can avoid introducing new barriers in the
submission path of its not absolutely needed.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/7] blk-mq: Introduce blk_mq_queue_stopped()

2016-10-05 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg <s...@grimberg.me>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/7] [RFC] nvme: Use BLK_MQ_S_STOPPED instead of QUEUE_FLAG_STOPPED in blk-mq code

2016-10-05 Thread Sagi Grimberg

Make nvme_requeue_req() check BLK_MQ_S_STOPPED instead of
QUEUE_FLAG_STOPPED. Remove the QUEUE_FLAG_STOPPED manipulations
that became superfluous because of this change. This patch fixes
a race condition: using queue_flag_clear_unlocked() is not safe
if any other function that manipulates the queue flags can be
called concurrently, e.g. blk_cleanup_queue(). Untested.


This looks good to me, but I know keith had all sort of
creative ways to challenge this are so I'd wait for his
input...
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 7/7] [RFC] nvme: Fix a race condition

2016-10-05 Thread Sagi Grimberg



Avoid that nvme_queue_rq() is still running when nvme_stop_queues()
returns. Untested.

Signed-off-by: Bart Van Assche <bart.vanass...@sandisk.com>
Cc: Keith Busch <keith.bu...@intel.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Sagi Grimberg <s...@grimberg.me>


Bart this looks really good! and possibly fixes an issue
I've been chasing with fabrics a while ago. I'll take it
for testing but you can add my:

Reviewed-by: Sagi Grimberg <s...@grimberg.me>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 6/7] SRP transport: Port srp_wait_for_queuecommand() to scsi-mq

2016-10-05 Thread Sagi Grimberg



+static void srp_mq_wait_for_queuecommand(struct Scsi_Host *shost)
+{
+   struct scsi_device *sdev;
+   struct request_queue *q;
+
+   shost_for_each_device(sdev, shost) {
+   q = sdev->request_queue;
+
+   blk_mq_quiesce_queue(q);
+   blk_mq_resume_queue(q);
+   }
+}
+


This *should* live in scsi_lib.c. I suspect that
various drivers would really want this functionality.

Thoughts?
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] nvme-fabrics: Add FC transport FC-NVME definitions

2016-08-16 Thread Sagi Grimberg



Looks fine,

Reviewed-by: Christoph Hellwig 


Hey James,

Can you collect review tags, address CR comments and resend the series?
I'd like to stage these for 0-day testing and try to get it into 4.9.

Thanks,
Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IB/isert: Return value of iser target transport handlers ignored by iscsi target

2016-08-07 Thread Sagi Grimberg

Hi,


Hi Baharat,


In iSER target during iwarp connection tear-down due to ping timeouts, the rdma
queues are set to error state and subsequent queued iscsi session commands 
posted
shall fail with corresponding errno returned by ib_post_send/recv.
At this stage iser target handlers (Ex: isert_put_datain())
return the error to iscsci target, but these errors are
not being handled by the iscsi target handlers(Ex: lio_queue_status()).


Indeed looks like lio_queue_data_in() assumes that
->iscsit_queue_data_in() cannot fail. This either mean
that isert_put_data_in() should take care of
the extra put in this case, or the iscsi should correctly
handle a queue_full condition (because we should not be hitting
this in the normal IO flow).

Does this (untested) patch help?
--
diff --git a/drivers/infiniband/ulp/isert/ib_isert.c 
b/drivers/infiniband/ulp/isert/ib_isert.c

index a990c04208c9..ff5dfc0f7c50 100644
--- a/drivers/infiniband/ulp/isert/ib_isert.c
+++ b/drivers/infiniband/ulp/isert/ib_isert.c
@@ -1810,6 +1810,7 @@ isert_put_response(struct iscsi_conn *conn, struct 
iscsi_cmd *cmd)

struct ib_send_wr *send_wr = _cmd->tx_desc.send_wr;
struct iscsi_scsi_rsp *hdr = (struct iscsi_scsi_rsp *)
_cmd->tx_desc.iscsi_header;
+   int ret;

isert_create_send_desc(isert_conn, isert_cmd, _cmd->tx_desc);
iscsit_build_rsp_pdu(cmd, conn, true, hdr);
@@ -1848,7 +1849,10 @@ isert_put_response(struct iscsi_conn *conn, 
struct iscsi_cmd *cmd)


isert_dbg("Posting SCSI Response\n");

-   return isert_post_response(isert_conn, isert_cmd);
+   ret = isert_post_response(isert_conn, isert_cmd);
+   if (ret)
+   target_put_sess_cmd(>se_cmd);
+   return 0;
 }

 static void
@@ -2128,7 +2132,7 @@ isert_put_datain(struct iscsi_conn *conn, struct 
iscsi_cmd *cmd)

struct isert_conn *isert_conn = conn->context;
struct ib_cqe *cqe = NULL;
struct ib_send_wr *chain_wr = NULL;
-   int rc;
+   int ret;

isert_dbg("Cmd: %p RDMA_WRITE data_length: %u\n",
 isert_cmd, se_cmd->data_length);
@@ -2148,34 +2152,41 @@ isert_put_datain(struct iscsi_conn *conn, struct 
iscsi_cmd *cmd)

isert_init_send_wr(isert_conn, isert_cmd,
   _cmd->tx_desc.send_wr);

-   rc = isert_post_recv(isert_conn, isert_cmd->rx_desc);
-   if (rc) {
-   isert_err("ib_post_recv failed with %d\n", rc);
-   return rc;
+   ret = isert_post_recv(isert_conn, isert_cmd->rx_desc);
+   if (ret) {
+   isert_err("ib_post_recv failed with %d\n", ret);
+   goto out;
}

chain_wr = _cmd->tx_desc.send_wr;
}

-   isert_rdma_rw_ctx_post(isert_cmd, isert_conn, cqe, chain_wr);
+   ret = isert_rdma_rw_ctx_post(isert_cmd, isert_conn, cqe, chain_wr);
isert_dbg("Cmd: %p posted RDMA_WRITE for iSER Data READ\n", 
isert_cmd);

-   return 1;
+out:
+   if (ret)
+   target_put_sess_cmd(>se_cmd);
+   return 0;
 }

 static int
 isert_get_dataout(struct iscsi_conn *conn, struct iscsi_cmd *cmd, bool 
recovery)

 {
struct isert_cmd *isert_cmd = iscsit_priv_cmd(cmd);
+   int ret;

isert_dbg("Cmd: %p RDMA_READ data_length: %u write_data_done: 
%u\n",

 isert_cmd, cmd->se_cmd.data_length, cmd->write_data_done);

isert_cmd->tx_desc.tx_cqe.done = isert_rdma_read_done;
-   isert_rdma_rw_ctx_post(isert_cmd, conn->context,
+   ret = isert_rdma_rw_ctx_post(isert_cmd, conn->context,
_cmd->tx_desc.tx_cqe, NULL);

isert_dbg("Cmd: %p posted RDMA_READ memory for ISER Data WRITE\n",
 isert_cmd);+out:
+   if (ret)
+   target_put_sess_cmd(>se_cmd);
return 0;
 }

--



-> While closing the session in case of ping timeouts, 
isert_close_connection()->
isert_wait_conn()->isert_wait4cmds() checks for the queued session commands
and waits infinitely for command completion 'cmd_wait_comp' in 
target_wait_for_sess_cmds().
'cmd_wait_comp' will be never complete as the kref on the session command is
not derefed, due to which target_release_cmd_kref() is not called by kref_put().
Due to this, the older session is not cleared causing the next login 
negotiation to fail
as the older session is still active(Older SID exists).


Makes sense...


[Query 1] If the return value of ib_post_send/recv() are handled to deref the
corresponding queued session commands, the wait on cmd_wait_comp will be
complete and clears off the session successfully. Is this the rightway to
do it here?


I think that given that ->iscsit_queue_data_in() and
->iscsit_queue_status() are never expected to fail we probably
should take care of it in isert...

Nic, any input on the iscsit side?



[Query 2] An extra deref is done in case of 

Re: [PATCH v2 1/3] block: provide helpers for reading block count

2016-06-23 Thread Sagi Grimberg

Looks good, for the series:

Reviewed-by: Sagi Grimbeg 
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Connect-IB not performing as well as ConnectX-3 with iSER

2016-06-22 Thread Sagi Grimberg

Let me see if I get this correct:


4.5.0_rc3_1aaa57f5_00399

sdc;10.218.128.17;4627942;1156985;18126
sdf;10.218.202.17;4590963;1147740;18272
sdk;10.218.203.17;4564980;1141245;18376
sdn;10.218.204.17;4571946;1142986;18348
sdd;10.219.128.17;4591717;1147929;18269
sdi;10.219.202.17;4505644;1126411;18618
sdg;10.219.203.17;4562001;1140500;18388
sdl;10.219.204.17;4583187;1145796;18303
sde;10.220.128.17;5511568;1377892;15220
sdh;10.220.202.17;551;137;15209
sdj;10.220.203.17;5609983;1402495;14953
sdm;10.220.204.17;5509035;1377258;15227


In 1aaa57f5 you get on CIB ~115K IOPs per sd device
and on CX3 you get around 140K IOPs per sd device.



Mlx5_0;sde;3593013;898253;23347 100% CPU kworker/u69:2
Mlx5_0;sdd;3588555;897138;23376 100% CPU kworker/u69:2
Mlx4_0;sdc;3525662;881415;23793 100% CPU kworker/u68:0


Is this on the host or the target?


4.5.0_rc5_7861728d_1
sdc;10.218.128.17;3747591;936897;22384
sdf;10.218.202.17;3750607;937651;22366
sdh;10.218.203.17;3750439;937609;22367
sdn;10.218.204.17;3771008;942752;22245
sde;10.219.128.17;3867678;966919;21689
sdg;10.219.202.17;3781889;945472;22181
sdk;10.219.203.17;3791804;947951;22123
sdl;10.219.204.17;3795406;948851;22102
sdd;10.220.128.17;5039110;1259777;16647
sdi;10.220.202.17;4992921;1248230;16801
sdj;10.220.203.17;5015610;1253902;16725
Sdm;10.220.204.17;5087087;1271771;16490


In 7861728d you get on CIB ~95K IOPs per sd device
and on CX3 you get around 125K IOPs per sd device.

I don't see any difference in the code around iser/isert,
in fact, I don't see any commit in drivers/infiniband




Mlx5_0;sde;2930722;732680;28623 ~98% CPU kworker/u69:0
Mlx5_0;sdd;2910891;727722;28818 ~98% CPU kworker/u69:0
Mlx4_0;sdc;3263668;815917;25703 ~98% CPU kworker/u68:0


Again, host or target?


4.5.0_rc5_f81bf458_00018
sdb;10.218.128.17;5023720;1255930;16698
sde;10.218.202.17;5016809;1254202;16721
sdj;10.218.203.17;5021915;1255478;16704
sdk;10.218.204.17;5021314;1255328;16706
sdc;10.219.128.17;4984318;1246079;16830
sdf;10.219.202.17;4986096;1246524;16824
sdh;10.219.203.17;5043958;1260989;16631
sdm;10.219.204.17;5032460;1258115;16669
sdd;10.220.128.17;3736740;934185;22449
sdg;10.220.202.17;3728767;932191;22497
sdi;10.220.203.17;3752117;938029;22357
Sdl;10.220.204.17;3763901;940975;22287


In f81bf458 you get on CIB ~125K IOPs per sd device
and on CX3 you get around 93K IOPs per sd device which
is the other way around? CIB is better than CX3?

The commits in this gap are:
f81bf458208e iser-target: Separate flows for np listeners and 
connections cma events

aea92980601f iser-target: Add new state ISER_CONN_BOUND to isert_conn
b89a7c25462b iser-target: Fix identification of login rx descriptor type

None of those should affect the data-path.



Srpt keeps crashing couldn't test

4.5.0_rc5_5adabdd1_00023
Sdc;10.218.128.17;3726448;931612;22511 ~97% CPU kworker/u69:4
sdf;10.218.202.17;3750271;937567;22368
sdi;10.218.203.17;3749266;937316;22374
sdj;10.218.204.17;3798844;949711;22082
sde;10.219.128.17;3759852;939963;22311 ~97% CPU kworker/u69:4
sdg;10.219.202.17;3772534;943133;22236
sdl;10.219.203.17;3769483;942370;22254
sdn;10.219.204.17;3790604;947651;22130
sdd;10.220.128.17;5171130;1292782;16222 ~96% CPU kworker/u68:3
sdh;10.220.202.17;5105354;1276338;16431
sdk;10.220.203.17;4995300;1248825;16793
sdm;10.220.204.17;4959564;1239891;16914


In 5adabdd1 you get on CIB ~94K IOPs per sd device
and on CX3 you get around 130K IOPs per sd device
which means you flipped again (very strange).

The commits in this gap are:
5adabdd122e4 iser-target: Split and properly type the login buffer
ed1083b251f0 iser-target: Remove ISER_RECV_DATA_SEG_LEN
26c7b673db57 iser-target: Remove impossible condition from isert_wait_conn
69c48846f1c7 iser-target: Remove redundant wait in release_conn
6d1fba0c2cc7 iser-target: Rework connection termination

Again, none are suspected to implicate the data-plane.


Srpt crashes

4.5.0_rc5_07b63196_00027
sdb;10.218.128.17;3606142;901535;23262
sdg;10.218.202.17;3570988;892747;23491
sdf;10.218.203.17;3576011;894002;23458
sdk;10.218.204.17;3558113;889528;23576
sdc;10.219.128.17;3577384;894346;23449
sde;10.219.202.17;3575401;893850;23462
sdj;10.219.203.17;3567798;891949;23512
sdl;10.219.204.17;3584262;896065;23404
sdd;10.220.128.17;4430680;1107670;18933
sdh;10.220.202.17;4488286;1122071;18690
sdi;10.220.203.17;4487326;1121831;18694
sdm;10.220.204.17;4441236;1110309;1


In 5adabdd1 you get on CIB ~89K IOPs per sd device
and on CX3 you get around 112K IOPs per sd device

The commits in this gap are:
e3416ab2d156 iser-target: Kill the ->isert_cmd back pointer in struct 
iser_tx_desc

d1ca2ed7dcf8 iser-target: Kill struct isert_rdma_wr
9679cc51eb13 iser-target: Convert to new CQ API

Which do effect the data-path, but nothing that can explain
a specific CIB issue. Moreover, the perf drop happened before that.


Srpt crashes

4.5.0_rc5_5e47f198_00036
sdb;10.218.128.17;3519597;879899;23834
sdi;10.218.202.17;3512229;878057;23884

Re: Connect-IB not performing as well as ConnectX-3 with iSER

2016-06-22 Thread Sagi Grimberg
sdf;10.218.203.17;3576011;894002;23458
sdk;10.218.204.17;3558113;889528;23576
sdc;10.219.128.17;3577384;894346;23449
sde;10.219.202.17;3575401;893850;23462
sdj;10.219.203.17;3567798;891949;23512
sdl;10.219.204.17;3584262;896065;23404
sdd;10.220.128.17;4430680;1107670;18933
sdh;10.220.202.17;4488286;1122071;18690
sdi;10.220.203.17;4487326;1121831;18694
sdm;10.220.204.17;4441236;1110309;1

Srpt crashes

4.5.0_rc5_5e47f198_00036
sdb;10.218.128.17;3519597;879899;23834
sdi;10.218.202.17;3512229;878057;23884
sdh;10.218.203.17;3518563;879640;23841
sdk;10.218.204.17;3582119;895529;23418
sdd;10.219.128.17;3550883;887720;23624
sdj;10.219.202.17;3558415;889603;23574
sde;10.219.203.17;3552086;888021;23616
sdl;10.219.204.17;3579521;894880;23435
sdc;10.220.128.17;4532912;1133228;18506
sdf;10.220.202.17;4558035;1139508;18404
sdg;10.220.203.17;4601035;1150258;18232
sdm;10.220.204.17;4548150;1137037;18444

srpt crashes

4.6.2 vanilla default config
sde;10.218.128.17;3431063;857765;24449
sdf;10.218.202.17;3360685;840171;24961
sdi;10.218.203.17;3355174;838793;25002
sdm;10.218.204.17;3360955;840238;24959
sdd;10.219.128.17;3337288;834322;25136
sdh;10.219.202.17;3327492;831873;25210
sdj;10.219.203.17;3380867;845216;24812
sdk;10.219.204.17;3418340;854585;24540
sdc;10.220.128.17;4668377;1167094;17969
sdg;10.220.202.17;4716675;1179168;17785
sdl;10.220.203.17;4675663;1168915;17941
sdn;10.220.204.17;4631519;1157879;18112

Mlx5_0;sde;3390021;847505;24745 ~98% CPU kworker/u69:3
Mlx5_0;sdd;3207512;801878;26153 ~98% CPU kworker/u69:3
Mlx4_0;sdc;2998072;749518;27980 ~98% CPU kworker/u68:0

4.7.0_rc3_5edb5649
sdc;10.218.128.17;3260244;815061;25730
sdg;10.218.202.17;3405988;851497;24629
sdh;10.218.203.17;3307419;826854;25363
sdm;10.218.204.17;3430502;857625;24453
sdi;10.219.128.17;3544282;886070;23668
sdj;10.219.202.17;3412083;853020;24585
sdk;10.219.203.17;3422385;855596;24511
sdl;10.219.204.17;3444164;861041;24356
sdb;10.220.128.17;4803646;1200911;17463
sdd;10.220.202.17;4832982;1208245;17357
sde;10.220.203.17;4809430;1202357;17442
sdf;10.220.204.17;4808878;1202219;17444

mlx5_0;sdd;2986864;746716;28085
mlx5_0;sdc;2963648;740912;28305
mlx4_0;sdb;3317228;829307;25288

Thanks,

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Jun 21, 2016 at 8:50 AM, Robert LeBlanc <rob...@leblancnet.us> wrote:

Sagi,

I'm working to implement SRP (I think I got it all working) to test
some of the commits. I can try TGT afterwards and the commit you
mention. I haven't been watching the CPU lately, but before when I was
doing a lot of testing, there wasn't any one thread that was at 100%.
There are several threads that have high utilization, but none 100%
and there is plenty of CPU capacity available (32 cores). I can
capture some of that data if it is helpful. I did test 4.7_rc3 on
Friday, but it didn't change much, is that "new" enough?

4.7.0_rc3_5edb5649
sdc;10.218.128.17;3260244;815061;25730
sdg;10.218.202.17;3405988;851497;24629
sdh;10.218.203.17;3307419;826854;25363
sdm;10.218.204.17;3430502;857625;24453
sdi;10.219.128.17;3544282;886070;23668
sdj;10.219.202.17;3412083;853020;24585
sdk;10.219.203.17;3422385;855596;24511
sdl;10.219.204.17;3444164;861041;24356
sdb;10.220.128.17;4803646;1200911;17463
sdd;10.220.202.17;4832982;1208245;17357
sde;10.220.203.17;4809430;1202357;17442
sdf;10.220.204.17;4808878;1202219;17444

Thanks for the suggestions, I'll work to get some of the requested
data back to you guys quickly.

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Jun 21, 2016 at 7:08 AM, Sagi Grimberg <sagig...@gmail.com> wrote:

Hey Robert,


I narrowed the performance degradation to this series
7861728..5e47f19, but while trying to bisect it, the changes were
erratic between each commit that I could not figure out exactly which
introduced the issue. If someone could give me some pointers on what
to do, I can keep trying to dig through this.



This bisection brings suspects:

e3416ab2d156 iser-target: Kill the ->isert_cmd back pointer in struct
iser_tx_desc
d1ca2ed7dcf8 iser-target: Kill struct isert_rdma_wr
9679cc51eb13 iser-target: Convert to new CQ API
5adabdd122e4 iser-target: Split and properly type the login buffer
ed1083b251f0 iser-target: Remove ISER_RECV_DATA_SEG_LEN
26c7b673db57 iser-target: Remove impossible condition from isert_wait_conn
69c48846f1c7 iser-target: Remove redundant wait in release_conn
6d1fba0c2cc7 iser-target: Rework connection termination
f81bf458208e iser-target: Separate flows for np listeners and connections
cma events
aea92980601f iser-target: Add new state ISER_CONN_BOUND to isert_conn
b89a7c25462b iser-target: Fix identification of login rx descriptor type

However I don't really see performance implications in these patches,
not to mention something that would affect on ConnectIB...

Given that your bisection brings up target side patches, I have
a couple questions:

1. Are the

Re: Connect-IB not performing as well as ConnectX-3 with iSER

2016-06-21 Thread Sagi Grimberg

Hey Robert,


I narrowed the performance degradation to this series
7861728..5e47f19, but while trying to bisect it, the changes were
erratic between each commit that I could not figure out exactly which
introduced the issue. If someone could give me some pointers on what
to do, I can keep trying to dig through this.


This bisection brings suspects:

e3416ab2d156 iser-target: Kill the ->isert_cmd back pointer in struct 
iser_tx_desc

d1ca2ed7dcf8 iser-target: Kill struct isert_rdma_wr
9679cc51eb13 iser-target: Convert to new CQ API
5adabdd122e4 iser-target: Split and properly type the login buffer
ed1083b251f0 iser-target: Remove ISER_RECV_DATA_SEG_LEN
26c7b673db57 iser-target: Remove impossible condition from isert_wait_conn
69c48846f1c7 iser-target: Remove redundant wait in release_conn
6d1fba0c2cc7 iser-target: Rework connection termination
f81bf458208e iser-target: Separate flows for np listeners and 
connections cma events

aea92980601f iser-target: Add new state ISER_CONN_BOUND to isert_conn
b89a7c25462b iser-target: Fix identification of login rx descriptor type

However I don't really see performance implications in these patches,
not to mention something that would affect on ConnectIB...

Given that your bisection brings up target side patches, I have
a couple questions:

1. Are the CPU usage in the target side at 100%, or the initiator side
is the bottleneck?

2. Would it be possible to use another target implementation? TGT maybe?

3. Can you try testing right before 9679cc51eb13? This is a patch that
involves data-plane.

4. Can you try the latest upstream kernel? The iser target code uses
a generic data-transfer library and I'm interested in knowing what is
the status there.

Cheers,
Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NVMe over Fabrics target implementation

2016-06-08 Thread Sagi Grimberg



*) Extensible to multiple types of backend drivers.

nvme-target needs a way to absorb new backend drivers, that
does not effect existing configfs group layout or attributes.

Looking at the nvmet/configfs layout as-is, there are no multiple
backend types defined, nor a way to control backend feature bits
exposed to nvme namespaces at runtime.


Hey Nic,

As for different type of backends, I still don't see a big justification
for adding the LIO backends pscsi (as it doesn't make sense),
ramdisk (we have brd), or file (losetup).

What kind of feature bits would you want to expose at runtime?


And that's very much intentional.  We have a very well working block
layer which we're going to use, no need to reivent it.  The block
layer supports NVMe pass through just fine in case we'll need it,
as I spent the last year preparing it for that.


Why does it ever make sense for $SUBSYSTEM_NQN_0 with $PORT_DRIVER_FOO
to block operation of $SUBSYSTEM_NQN_1 with $PORT_DRIVER_BAR..?


Because it keeps the code simple.  If you had actually participated
on our development list you might have seen that until not too long
ago we have very fine grainded locks here.  In the end Armen convinced
me that it's easier to maintain if we don't bother with fine grained
locking outside the fast path, especially as it significantly simplifies
the discovery implementation.   If if it ever turns out to be an
issue we can change it easily as the implementation is well encapsulated.


We did change that, and Nic is raising a valid point in terms of having
a global mutex around all the ports. If the requirement of nvme
subsystems and ports configuration is that it should happen fast enough
and scale to the numbers that Nic is referring to, we'll need to change
that back.

Having said that, I'm not sure this is a real hard requirement for RDMA
and FC in the mid-term, because from what I've seen, the workloads Nic
is referring to are more typical for iscsi/tcp where connections are
cheaper and you need more to saturate a high-speed interconnects, so
we'll probably see this when we have nvme over tcp working.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: libiscsi: Use scsi helper to set information descriptor

2016-04-13 Thread Sagi Grimberg

Hey Dan,


Hello Sagi Grimberg,

The patch a73c2a2f9123: "libiscsi: Use scsi helper to set information
descriptor" from Jul 15, 2015, leads to the following static checker
warning:

drivers/scsi/libiscsi.c:858 iscsi_scsi_cmd_rsp()
error: XXX uninitialized symbol 'sector'.

drivers/scsi/libiscsi.c
850  ascq = session->tt->check_protection(task, );

If "ascq" is 0x1 then there sector might not be initialized.  The
documentation is not clear on how that works.  Har dee har har.  The
oldest jokes are still the best...  :P


iscsi transports that implement this callout are expected
to set the sector which is passed by reference.

would it make the checker happy if we set sector to 0 before
calling check_protection (although it's not needed by no means)?
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 01/16] iscsi-target: add callback to alloc and free PDU

2016-04-13 Thread Sagi Grimberg



On 09/04/16 16:11, Varun Prakash wrote:

Add two callbacks to struct iscsit_transport -

1. void *(*iscsit_alloc_pdu)()
iscsi-target uses this callback for
iSCSI PDU allocation.

2. void (*iscsit_free_pdu)
iscsi-target uses this callback
to free an iSCSI PDU which was
allocated by iscsit_alloc_pdu().


Can you please explain why are you adding two different
callouts? Who (Chelsio T5) will need it, and why they can't
use the in-cmd pdu?


I am adding these to avoid per PDU 48 bytes(BHS) memcpy from cmd->pdu to
transport driver tx buffer, iscsi-target can directly form iscsi hdr
in transport driver tx buffer.


Is that what it's for? Looks meaningless to me, do
you have an indication that this is some sort of bottleneck?

I see you memset in your pdu_alloc, where is the gain
here anyway?
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 08/16] iscsi-target: add void (*iscsit_get_r2t_ttt)()

2016-04-13 Thread Sagi Grimberg



Add void (*iscsit_get_r2t_ttt)() to
struct iscsit_transport, iscsi-target
uses this callback to get
r2t->targ_xfer_tag.


Your driver allocates ttt's? That looks like bad
layering to me. This definitely deserves an explanation...


cxgbit.ko allocates ttt only for r2t pdus to do Direct Data
Placement of Data Out pdus, adapter uses the ttt value in
Data Out pdus to place data directly in the host buffers.


How do you guarantee that the core doesn't conflict with
your internal ttt's?
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] scsi: add a max_segment_size limitation to struct Scsi_Host

2016-04-13 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg <s...@grimberg.me>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] IB/iser: set max_segment_size

2016-04-13 Thread Sagi Grimberg

Acked-by: Sagi Grimberg <s...@grimberg.me>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] IB/iser: set max_segment_size

2016-04-13 Thread Sagi Grimberg



In iser we sorta rely on 4k pages so we avoid
PAGE_SIZE but rather set SIZE_4K for these sort
of things (like we did in the virt_boundary).


So you still want only 4k segments even on PPC where the PAGE_SIZE is
16k?


Yes, iSER has the "no-gaps" constraint (like nvme) and some
applications in the past were known to issue vectored IO that doesn't
necessarily align with the system PAGE_SIZE. These sort of workloads
resulted in suboptimal performance on ppc, ia64 etc (almost every I/O
had gaps).

Before the block layer was able to enforce scatterlists without gaps,
iSER used bounce buffering in order to service "gappy" I/O which was
probably a lot worse than bio splitting like the block layer is doing
today, but still we felt that having 4k segments even on larger
PAGE_SIZE systems was a decent compromise.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] IB/iser: set max_segment_size

2016-04-12 Thread Sagi Grimberg



So that we don't overflow the number of MR segments allocated because
we have to split on SGL segment into multiple MR segments.

Signed-off-by: Christoph Hellwig 
---
  drivers/infiniband/ulp/iser/iscsi_iser.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.c 
b/drivers/infiniband/ulp/iser/iscsi_iser.c
index 80b6bed..784504a 100644
--- a/drivers/infiniband/ulp/iser/iscsi_iser.c
+++ b/drivers/infiniband/ulp/iser/iscsi_iser.c
@@ -623,6 +623,7 @@ iscsi_iser_session_create(struct iscsi_endpoint *ep,
shost->max_id = 0;
shost->max_channel = 0;
shost->max_cmd_len = 16;
+   shost->max_segment_size = PAGE_SIZE;


In iser we sorta rely on 4k pages so we avoid
PAGE_SIZE but rather set SIZE_4K for these sort
of things (like we did in the virt_boundary).
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 5/5] lib: scatterlist: move SG pool code from SCSI driver to lib/sg_pool.c

2016-04-12 Thread Sagi Grimberg

From: Ming Lin <min...@ssi.samsung.com>

Now it's ready to move the mempool based SG chained allocator code from
SCSI driver to lib/sg_pool.c, which will be compiled only based on a Kconfig
symbol CONFIG_SG_POOL.

SCSI selects CONFIG_SG_POOL.

Reviewed-by: Christoph Hellwig <h...@lst.de>
Signed-off-by: Ming Lin <min...@ssi.samsung.com>


Reviewed-by: Sagi Grimberg <s...@grimberg.me>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 4/5] scsi: rename SCSI_MAX_{SG, SG_CHAIN}_SEGMENTS

2016-04-12 Thread Sagi Grimberg



<martin.peter...@oracle.com> wrote:

"Ming" == Ming Lin <m...@kernel.org> writes:


Ming> Are we ready to merge it?

We're still missing an ack from Sagi.


Thought we already had a ack from Bart.
OK, let's get one more from Sagi.


Hmm ... this patch doesn't touch any code for which Sagi is the
maintainer so I think my ack for the ib_srp changes is sufficient.


Indeed no need for my ack, but a review tag can't hurt ;)

Reviewed-by: Sagi Grimberg <s...@grimberg.me>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 3/5] scsi: rename SG related struct and functions

2016-04-12 Thread Sagi Grimberg

From: Ming Lin <min...@ssi.samsung.com>

Rename SCSI specific struct and functions to more genenic names.

Reviewed-by: Christoph Hellwig <h...@lst.de>
Signed-off-by: Ming Lin <min...@ssi.samsung.com>


Reviewed-by: Sagi Grimberg <s...@grimberg.me>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 2/5] scsi: replace "mq" with "first_chunk" in SG functions

2016-04-12 Thread Sagi Grimberg



From: Ming Lin <min...@ssi.samsung.com>

Parameter "bool mq" is block driver specific.
Change it to "first_chunk" to make it more generic.

Reviewed-by: Christoph Hellwig <h...@lst.de>
Signed-off-by: Ming Lin <min...@ssi.samsung.com>


Reviewed-by: Sagi Grimberg <s...@grimberg.me>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/5] scsi: replace "scsi_data_buffer" with "sg_table" in SG functions

2016-04-12 Thread Sagi Grimberg

From: Ming Lin <min...@ssi.samsung.com>

Replace parameter "struct scsi_data_buffer" with "struct sg_table" in
SG alloc/free functions to make them generic.

Reviewed-by: Christoph Hellwig <h...@lst.de>
Signed-off-by: Ming Lin <min...@ssi.samsung.com>


Reviewed-by: Sagi Grimberg <s...@grimberg.me>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 4/5] scsi: rename SCSI_MAX_{SG, SG_CHAIN}_SEGMENTS

2016-04-12 Thread Sagi Grimberg



We're still missing an ack from Sagi.


Oops, wasn't aware that mine was needed.
I reviewed these on the nvme-fabrics project.

I'll add my review tags too.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf] [LSF/MM TOPIC] block-mq issues with FC

2016-04-10 Thread Sagi Grimberg

Hey Willy,


  - Interrupt steering needs to be controlled by block-mq instead of
the driver.  It's pointless to have each driver implement its own
policies on interrupt steering, irqbalanced remains a source of
end-user frustration, and block-mq can change the queue<->cpu mapping
without the driver's knowledge.


I honestly don't think that block-mq is the right place to
*assign* interrupt steering. Not all HW devices are dedicated
to storage, take RDMA for example, a RNIC is shared by block
storage, networking and even user-space workloads so obviously
block-mq can't understand how a user wants to steer interrupts.

I think that block-mq needs to ask the device driver:
"what is the optimal queue index for cpu X?" and use it
while *someone* will be responsible for optimum interrupt
steering (can be the driver itself or user-space).

From some discussions I had with HCH I think he intends to
use the cpu reverse-mapping API to try and do what's described
above (if I'm not mistaken).
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 15/16] iscsi-target: fix seq_end_offset calculation

2016-04-10 Thread Sagi Grimberg

Fixes should go in the front of the series (probably even better
detached from the set), and this also looks like stable material...
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 11/16] iscsi-target: add new offload transport type

2016-04-10 Thread Sagi Grimberg



+static ssize_t lio_target_np_hw_offload_show(struct config_item *item,
+char *page)
+{
+   struct iscsi_tpg_np *tpg_np = to_iscsi_tpg_np(item);
+   struct iscsi_tpg_np *tpg_np_hw_offload;
+   ssize_t rb;
+
+   tpg_np_hw_offload = iscsit_tpg_locate_child_np(tpg_np,
+  ISCSI_HW_OFFLOAD);
+   if (tpg_np_hw_offload)
+   rb = sprintf(page, "1\n");
+   else
+   rb = sprintf(page, "0\n");
+
+   return rb;
+}
+
+static ssize_t lio_target_np_hw_offload_store(struct config_item *item,
+ const char *page, size_t count)
+{
+   struct iscsi_tpg_np *tpg_np = to_iscsi_tpg_np(item);
+   struct iscsi_np *np;
+   struct iscsi_portal_group *tpg;
+   struct iscsi_tpg_np *tpg_np_hw_offload = NULL;
+   u32 op;
+   int rc = 0;
+
+   rc = kstrtou32(page, 0, );
+   if (rc)
+   return rc;
+
+   if ((op != 1) && (op != 0)) {
+   pr_err("Illegal value for tpg_enable: %u\n", op);
+   return -EINVAL;
+   }
+
+   np = tpg_np->tpg_np;
+   if (!np) {
+   pr_err("Unable to locate struct iscsi_np from"
+  " struct iscsi_tpg_np\n");
+   return -EINVAL;
+   }
+
+   tpg = tpg_np->tpg;
+   if (iscsit_get_tpg(tpg) < 0)
+   return -EINVAL;
+
+   if (op) {
+   tpg_np_hw_offload = iscsit_tpg_add_network_portal(tpg,
+   >np_sockaddr, tpg_np, ISCSI_HW_OFFLOAD);
+
+   if (IS_ERR(tpg_np_hw_offload)) {
+   rc = PTR_ERR(tpg_np_hw_offload);
+   goto out;
+   }
+   } else {
+   tpg_np_hw_offload = iscsit_tpg_locate_child_np(tpg_np,
+   ISCSI_HW_OFFLOAD);
+
+   if (tpg_np_hw_offload) {
+   rc = iscsit_tpg_del_network_portal(tpg,
+  tpg_np_hw_offload);
+   if (rc < 0)
+   goto out;
+   }
+   }
+
+   iscsit_put_tpg(tpg);
+   return count;
+out:
+   iscsit_put_tpg(tpg);
+   return rc;
+}
+
  CONFIGFS_ATTR(lio_target_np_, sctp);
  CONFIGFS_ATTR(lio_target_np_, iser);
+CONFIGFS_ATTR(lio_target_np_, hw_offload);


I'd be happy to see new transports being added with some
initiative to reduce the code duplication here (pretty please :))
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 10/16] iscsi-target: use conn->network_transport in text rsp

2016-04-10 Thread Sagi Grimberg

Looks fine,

Reviewed-by: Sagi Grimberg <s...@grimberg.me>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 08/16] iscsi-target: add void (*iscsit_get_r2t_ttt)()

2016-04-10 Thread Sagi Grimberg



Add void (*iscsit_get_r2t_ttt)() to
struct iscsit_transport, iscsi-target
uses this callback to get
r2t->targ_xfer_tag.


Your driver allocates ttt's? That looks like bad
layering to me. This definitely deserves an explanation...
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 07/16] iscsi-target: add int (*iscsit_validate_params)()

2016-04-10 Thread Sagi Grimberg



Add int (*iscsit_validate_params)() to
struct iscsit_transport, iscsi-target
uses this callback for validating
conn operational parameters.


Again, why is this needed?
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 05/16] iscsi-target: add void (*iscsit_get_rx_pdu)()

2016-04-10 Thread Sagi Grimberg



Add void (*iscsit_get_rx_pdu)() to
struct iscsit_transport, iscsi-target
uses this callback to receive and
process Rx iSCSI PDUs.


Same comment on change logs.

The iser bit looks harmless

Acked-by: Sagi Grimberg <s...@grimber.me>

Though I agree with hch that we don't really need
rx threads in iSER, but that's another story...
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 04/16] iscsi-target: add void (*iscsit_release_cmd)()

2016-04-10 Thread Sagi Grimberg



Add void (*iscsit_release_cmd)() to
struct iscsit_transport, iscsi-target
uses this callback to release transport
driver resources associated with an iSCSI cmd.


I'd really like to see some reasoning on why you add
abstraction callouts. It may have a valid reason but
it needs to be documented in the change log...


diff --git a/drivers/target/iscsi/iscsi_target_util.c 
b/drivers/target/iscsi/iscsi_target_util.c
index 428b0d9..a533017 100644
--- a/drivers/target/iscsi/iscsi_target_util.c
+++ b/drivers/target/iscsi/iscsi_target_util.c
@@ -725,6 +725,9 @@ void __iscsit_free_cmd(struct iscsi_cmd *cmd, bool scsi_cmd,
iscsit_remove_cmd_from_immediate_queue(cmd, conn);
iscsit_remove_cmd_from_response_queue(cmd, conn);
}
+
+   if (conn && conn->conn_transport->iscsit_release_cmd)
+   conn->conn_transport->iscsit_release_cmd(conn, cmd);
  }


Did you verify that you get here with conn = NULL (given that you test
it)? If so, then can you please document why is it expected for this
function to be called twice that we need to make it safe?

If not, then I'd move this check to be a WARN_ON/BUG_ON to hunt
down when is this happening.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 03/16] iscsi-target: add int (*iscsit_xmit_datain_pdu)()

2016-04-10 Thread Sagi Grimberg



On 09/04/16 16:11, Varun Prakash wrote:

Add int (*iscsit_xmit_datain_pdu)() to
struct iscsit_transport, iscsi-target
uses this callback to transmit a DATAIN
iSCSI PDU.

Signed-off-by: Varun Prakash 
---
  drivers/target/iscsi/iscsi_target.c| 143 +++--
  include/target/iscsi/iscsi_transport.h |   3 +
  2 files changed, 86 insertions(+), 60 deletions(-)

diff --git a/drivers/target/iscsi/iscsi_target.c 
b/drivers/target/iscsi/iscsi_target.c
index 0e7a481..9e65e5d 100644
--- a/drivers/target/iscsi/iscsi_target.c
+++ b/drivers/target/iscsi/iscsi_target.c
@@ -577,6 +577,84 @@ static int iscsit_xmit_pdu(struct iscsi_conn *conn, struct 
iscsi_cmd *cmd,
return 0;
  }

+static int iscsit_map_iovec(struct iscsi_cmd *, struct kvec *, u32, u32);
+static void iscsit_unmap_iovec(struct iscsi_cmd *);
+static u32 iscsit_do_crypto_hash_sg(struct ahash_request *, struct iscsi_cmd *,
+   u32, u32, u32, u8 *);
+static int
+iscsit_xmit_datain_pdu(struct iscsi_conn *conn, struct iscsi_cmd *cmd,
+  struct iscsi_datain_req *dr, struct iscsi_datain *datain)


Looks very similar to xmit_pdu(), if we add a datain pointer that
can be null for normal pdus would that reduce code duplication?
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 02/16] iscsi-target: add int (*iscsit_xmit_pdu)()

2016-04-10 Thread Sagi Grimberg

Nice!

Reviewed-by: Sagi Grimberg <s...@grimberg.me>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 01/16] iscsi-target: add callback to alloc and free PDU

2016-04-10 Thread Sagi Grimberg



On 09/04/16 16:11, Varun Prakash wrote:

Add two callbacks to struct iscsit_transport -

1. void *(*iscsit_alloc_pdu)()
iscsi-target uses this callback for
iSCSI PDU allocation.

2. void (*iscsit_free_pdu)
iscsi-target uses this callback
to free an iSCSI PDU which was
allocated by iscsit_alloc_pdu().


Can you please explain why are you adding two different
callouts? Who (Chelsio T5) will need it, and why they can't
use the in-cmd pdu?



Signed-off-by: Varun Prakash 
---
  drivers/target/iscsi/iscsi_target.c| 76 --
  include/target/iscsi/iscsi_transport.h |  2 +
  2 files changed, 65 insertions(+), 13 deletions(-)

diff --git a/drivers/target/iscsi/iscsi_target.c 
b/drivers/target/iscsi/iscsi_target.c
index 961202f..fdb33ba 100644
--- a/drivers/target/iscsi/iscsi_target.c
+++ b/drivers/target/iscsi/iscsi_target.c
@@ -499,6 +499,11 @@ static void iscsit_aborted_task(struct iscsi_conn *conn, 
struct iscsi_cmd *cmd)
__iscsit_free_cmd(cmd, scsi_cmd, true);
  }

+static void *iscsit_alloc_pdu(struct iscsi_conn *conn, struct iscsi_cmd *cmd)
+{
+   return cmd->pdu;
+}
+
  static enum target_prot_op iscsit_get_sup_prot_ops(struct iscsi_conn *conn)
  {
return TARGET_PROT_NORMAL;
@@ -519,6 +524,7 @@ static struct iscsit_transport iscsi_target_transport = {
.iscsit_queue_data_in   = iscsit_queue_rsp,
.iscsit_queue_status= iscsit_queue_rsp,
.iscsit_aborted_task= iscsit_aborted_task,
+   .iscsit_alloc_pdu   = iscsit_alloc_pdu,
.iscsit_get_sup_prot_ops = iscsit_get_sup_prot_ops,
  };

@@ -2537,7 +2543,10 @@ static int iscsit_send_conn_drop_async_message(
cmd->tx_size = ISCSI_HDR_LEN;
cmd->iscsi_opcode = ISCSI_OP_ASYNC_EVENT;

-   hdr = (struct iscsi_async *) cmd->pdu;
+   hdr = conn->conn_transport->iscsit_alloc_pdu(conn, cmd);
+   if (unlikely(!hdr))
+   return -ENOMEM;
+
hdr->opcode  = ISCSI_OP_ASYNC_EVENT;
hdr->flags   = ISCSI_FLAG_CMD_FINAL;
cmd->init_task_tag   = RESERVED_ITT;
@@ -2630,7 +2639,7 @@ iscsit_build_datain_pdu(struct iscsi_cmd *cmd, struct 
iscsi_conn *conn,

  static int iscsit_send_datain(struct iscsi_cmd *cmd, struct iscsi_conn *conn)
  {
-   struct iscsi_data_rsp *hdr = (struct iscsi_data_rsp *)>pdu[0];
+   struct iscsi_data_rsp *hdr;
struct iscsi_datain datain;
struct iscsi_datain_req *dr;
struct kvec *iov;
@@ -2675,6 +2684,10 @@ static int iscsit_send_datain(struct iscsi_cmd *cmd, 
struct iscsi_conn *conn)
set_statsn = true;
}

+   hdr = conn->conn_transport->iscsit_alloc_pdu(conn, cmd);
+   if (unlikely(!hdr))
+   return -ENOMEM;
+
iscsit_build_datain_pdu(cmd, conn, , hdr, set_statsn);

iov = >iov_data[0];
@@ -2843,13 +2856,20 @@ EXPORT_SYMBOL(iscsit_build_logout_rsp);
  static int
  iscsit_send_logout(struct iscsi_cmd *cmd, struct iscsi_conn *conn)
  {
+   struct iscsi_logout_rsp *hdr;
struct kvec *iov;
int niov = 0, tx_size, rc;

-   rc = iscsit_build_logout_rsp(cmd, conn,
-   (struct iscsi_logout_rsp *)>pdu[0]);
-   if (rc < 0)
+   hdr = conn->conn_transport->iscsit_alloc_pdu(conn, cmd);
+   if (unlikely(!hdr))
+   return -ENOMEM;
+
+   rc = iscsit_build_logout_rsp(cmd, conn, hdr);
+   if (rc < 0) {
+   if (conn->conn_transport->iscsit_free_pdu)
+   conn->conn_transport->iscsit_free_pdu(conn, cmd);


This creates a weird asymmetry where alloc is called unconditionally
while free is conditional, I'd say implement an empty free for iscsit.
Same for the rest of the code...

P.S. I didn't see any non-error path call to free_pdu, is that a
possible leak (for drivers that actually allocate a PDU)?

On another unrelated note, I'd be very happy if we lose the iscsit_
prefix from all the callouts, it clear to everyone that it iscsi, no
need for an explicit reminder...
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 14/34] iscsi-target: export symbols

2016-04-10 Thread Sagi Grimberg



Great.  Just curious how -v2 is coming along..?

I've got a few cycles over the weekend, and plan to start reviewing as
the series hits the list.

Btw, I asked Sagi to help out with review as well.


If you did, it's lost in my old email address :)

I'll have a look this week.

Cheers,
Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 22/34] iscsi-target: call Rx thread function

2016-02-15 Thread Sagi Grimberg



call Rx thread function if registered
by transport driver, so that transport
drivers can use iscsi-target Rx thread
for Rx processing.

update iSER target driver to use this
interface.

Signed-off-by: Varun Prakash <va...@chelsio.com>


Other than the handler name, looks harmless.

Acked-by: Sagi Grimberg <sa...@mellanox.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 20/34] iscsi-target: update struct iscsit_transport definition

2016-02-15 Thread Sagi Grimberg



1. void (*iscsit_rx_pdu)(struct iscsi_conn *);
Rx thread uses this for receiving and processing
iSCSI PDU in full feature phase.


Is iscsit_rx_pdu the best name for this? it sounds like
a function that would handle A pdu, but it's actually the
thread function dequeuing pdus correct?
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   5   6   >