[PATCH] iser: set sector for ambiguous mr status errors
If for some reason we failed to query the mr status, we need to make sure to provide sufficient information for an ambiguous error (guard error on sector 0). Fixes: 0a7a08ad6f5f ("IB/iser: Implement check_protection") Cc: Reported-by: Dan Carpenter Signed-off-by: Sagi Grimberg --- drivers/infiniband/ulp/iser/iser_verbs.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c b/drivers/infiniband/ulp/iser/iser_verbs.c index 946b623ba5eb..ab137bafa8a8 100644 --- a/drivers/infiniband/ulp/iser/iser_verbs.c +++ b/drivers/infiniband/ulp/iser/iser_verbs.c @@ -1124,7 +1124,9 @@ u8 iser_check_task_pi_status(struct iscsi_iser_task *iser_task, IB_MR_CHECK_SIG_STATUS, _status); if (ret) { pr_err("ib_check_mr_status failed, ret %d\n", ret); - goto err; + /* Not alot we can do, return ambiguous guard error */ + *sector = 0; + return 0x1; } if (mr_status.fail_status & IB_MR_CHECK_SIG_STATUS) { @@ -1152,9 +1154,6 @@ u8 iser_check_task_pi_status(struct iscsi_iser_task *iser_task, } return 0; -err: - /* Not alot we can do here, return ambiguous guard error */ - return 0x1; } void iser_err_comp(struct ib_wc *wc, const char *type) -- 2.17.1
Re: [PATCH v2 08/15] nvmet: make config_item_type const
Acked-by: Sagi Grimberg <s...@grimberg.me>
Re: tgtd CPU 100% problem
Thanks very much for reply. I think it may be better to assert failed to exit problem than run the endless loop after receivered DEVICE_REMOVAL event. Or we can sleep 5 ms to check if the conn->h.state is STATE_FULL. Note that neither of the CC'd lists are the correct list for stgt, you should correspond s...@lists.wpkg.org Regarding your observation, not sure why you think this is happening, the work is queued once in 5 seconds and the loop is finite.
Re: tgtd CPU 100% problem
On 11/07/17 10:51, 李春 wrote: We have meet a problem of tgtd CPU 100%. the infinband network card was negotiate as eth mode by mistake, after we change it to ib mode and restart opensmd for correct State(Active) the tgtd using 100% of CPU. and when we connect to it using tgtadm, tgtadm hang forever. # how to repeat * tgtd export a disk throught port 3260 of iser * iscsiadm login a target from tgt through infiniband * connectx_port_config set the mellanox infiniband to eth mode * connectx_port_config set the mellanox infiniband to ib mode * /etc/init.d/opensmd restart * tgtadm connect to tgt will hang # error messge ``` Jul 1 21:32:37 shadow tgtd: iser_handle_rdmacm(1628) Unsupported event:11, RDMA_CM_EVENT_DEVICE_REMOVAL - ignored Jul 1 21:32:37 shadow tgtd: iser_handle_rdmacm(1628) Unsupported event:11, RDMA_CM_EVENT_DEVICE_REMOVAL - ignored Jul 1 21:32:39 shadow tgtd: iser_handle_async_event(3174) dev:mlx4_0 HCA evt: local catastrophic error iser code in tgtd does not know how to correctly handle RDMA device removal events (and it never did). The event is generated from the port configuration step while tgt-iser is bound to it. Once the device is removed the device handle tgt-iser has is essentially unusable, which explains the qp creation failures below. Handling DEVICE_REMOVAL event handling is a new feature request. Jul 1 21:46:56 shadow tgtd: iser_cm_connect_request(1471) conn:0x1380bf0 cm_id:0x1380950 rdma_create_qp failed, Cannot allocate memory Jul 1 21:46:56 shadow tgtd: iser_cm_connect_request(1520) cm_id:0x1380950 rdma_reject failed, Bad file descriptor Jul 1 21:46:56 shadow tgtd: iser_cm_connect_request(1471) conn:0x1380bf0 cm_id:0x1380950 rdma_create_qp failed, Cannot allocate memory And also tgt-iser cannot even reject the (re)connect request. Jul 1 21:46:56 shadow tgtd: iser_cm_connect_request(1520) cm_id:0x1380950 rdma_reject failed, Bad file descriptor ``
Re: ["PATCH-v2" 00/22] lpfc updates for 11.2.0.12
The patches are dependent on the FC nvme/nvmet patches from the following 2 series: http://lists.infradead.org/pipermail/linux-nvme/2017-April/009250.html http://lists.infradead.org/pipermail/linux-nvme/2017-April/009256.html Hmm, So it seems that we have conflicts here A local merge attempt on Jens's current for-4.12/block from nvme-4.12 (with this patchset)is generating: -- diff --cc drivers/nvme/host/fc.c index 450733c8cd24,b6d2ca8559f6.. --- a/drivers/nvme/host/fc.c +++ b/drivers/nvme/host/fc.c @@@ -1147,8 -1266,9 +1266,14 @@@ nvme_fc_fcpio_done(struct nvmefc_fcp_re struct nvme_fc_ctrl *ctrl = op->ctrl; struct nvme_fc_queue *queue = op->queue; struct nvme_completion *cqe = >rsp_iu.cqe; ++<<< HEAD + __le16 status = cpu_to_le16(NVME_SC_SUCCESS << 1); + union nvme_result result; ++=== + struct nvme_command *sqe = >cmd_iu.sqe; + u16 status = NVME_SC_SUCCESS; + bool complete_rq; ++>>> nvme-4.12 /* * WARNING: @@@ -1229,12 -1349,12 +1354,17 @@@ be32_to_cpu(op->rsp_iu.xfrd_len) != freq->transferred_length || op->rsp_iu.status_code || ++<<< HEAD + op->rqno != le16_to_cpu(cqe->command_id))) { + status = cpu_to_le16(NVME_SC_FC_TRANSPORT_ERROR << 1); ++=== +sqe->common.command_id != cqe->command_id)) { + status = NVME_SC_FC_TRANSPORT_ERROR; ++>>> nvme-4.12 goto done; } - op->nreq.result = cqe->result; - status = le16_to_cpu(cqe->status) >> 1; + result = cqe->result; + status = cqe->status; break; default: @@@ -1243,13 -1363,26 +1373,35 @@@ } done: ++<<< HEAD + if (!queue->qnum && op->rqno >= AEN_CMDID_BASE) { + nvme_complete_async_event(>ctrl->ctrl, status, ); ++=== + if (op->flags & FCOP_FLAGS_AEN) { + nvme_complete_async_event(>ctrl->ctrl, status, + >nreq.result); + complete_rq = __nvme_fc_fcpop_chk_teardowns(ctrl, op); + atomic_set(>state, FCPOP_STATE_IDLE); + op->flags = FCOP_FLAGS_AEN; /* clear other flags */ ++>>> nvme-4.12 nvme_fc_ctrl_put(ctrl); return; } ++<<< HEAD + nvme_end_request(rq, status, result); ++=== + complete_rq = __nvme_fc_fcpop_chk_teardowns(ctrl, op); + if (!complete_rq) { + if (unlikely(op->flags & FCOP_FLAGS_TERMIO)) { + status = NVME_SC_ABORT_REQ; + if (blk_queue_dying(rq->q)) + status |= NVME_SC_DNR; + } + blk_mq_complete_request(rq, status); + } else + __nvme_fc_final_op_cleanup(rq); ++>>> nvme-4.12 } static int --
Re: [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem
Sagi As long as legA, legB and the RC are all connected to the same switch then ordering will be preserved (I think many other topologies also work). Here is how it would work for the problem case you are concerned about (which is a read from the NVMe drive). 1. Disk device DMAs out the data to the p2pmem device via a string of PCIe MemWr TLPs. 2. Disk device writes to the completion queue (in system memory) via a MemWr TLP. 3. The last of the MemWrs from step 1 might have got stalled in the PCIe switch due to congestion but if so they are stalled in the egress path of the switch for the p2pmem port. 4. The RC determines the IO is complete when the TLP associated with step 2 updates the memory associated with the CQ. It issues some operation to read the p2pmem. 5. Regardless of whether the MemRd TLP comes from the RC or another device connected to the switch it is queued in the egress queue for the p2pmem FIO behind the last DMA TLP (from step 1). PCIe ordering ensures that this MemRd cannot overtake the MemWr (Reads can never pass writes). Therefore the MemRd can never get to the p2pmem device until after the last DMA MemWr has. What you are saying is surprising to me. The switch needs to preserve ordering across different switch ports ?? You are suggesting that there is a *switch-wide* state that tracks MemRds never pass MemWrs across all the switch ports? That is a very non-trivial statement...
Re: [RFC 3/8] nvmet: Use p2pmem in nvme target
I hadn't done this yet but I think a simple closest device in the tree would solve the issue sufficiently. However, I originally had it so the user has to pick the device and I prefer that approach. But if the user picks the device, then why bother restricting what he picks? Because the user can get it wrong, and its our job to do what we can in order to prevent the user from screwing itself. Per the thread with Sinan, I'd prefer to use what the user picks. You were one of the biggest opponents to that so I'd like to hear your opinion on removing the restrictions. I wasn't against it that much, I'm all for making things "just work" with minimal configuration steps, but I'm not sure we can get it right without it. Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would save an extra PCI transfer as the NVME card could just take the data out of it's own memory. However, at this time, cards with CMB buffers don't seem to be available. Even if it was available, it would be hard to make real use of this given that we wouldn't know how to pre-post recv buffers (for in-capsule data). But let's leave this out of the scope entirely... I don't understand what you're referring to. We'd simply use the CMB buffer as a p2pmem device, why does that change anything? I'm referring to the in-capsule data buffers pre-posts that we do. Because we prepare a buffer that would contain in-capsule data, we have no knowledge to which device the incoming I/O is directed to, which means we can (and will) have I/O where the data lies in CMB of device A but it's really targeted to device B - which sorta defeats the purpose of what we're trying to optimize here... Why do you need this? you have a reference to the queue itself. This keeps track of whether the response was actually allocated with p2pmem or not. It's needed for when we free the SGL because the queue may have a p2pmem device assigned to it but, if the alloc failed and it fell back on system memory then we need to know how to free it. I'm currently looking at having SGLs having an iomem flag. In which case, this would no longer be needed as the flag in the SGL could be used. That would be better, maybe... [...] This is a problem. namespaces can be added at any point in time. No one guarantee that dma_devs are all the namepaces we'll ever see. Yeah, well restricting p2pmem based on all the devices in use is hard. So we'd need a call into the transport every time an ns is added and we'd have to drop the p2pmem if they add one that isn't supported. This complexity is just one of the reasons I prefer just letting the user chose. Still the user can get it wrong. Not sure we can get a way without keeping track of this as new devices join the subsystem. + +if (queue->p2pmem) +pr_debug("using %s for rdma nvme target queue", + dev_name(>p2pmem->dev)); + +kfree(dma_devs); +} + static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id, struct rdma_cm_event *event) { @@ -1199,6 +1271,8 @@ static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id, } queue->port = cm_id->context; +nvmet_rdma_queue_setup_p2pmem(queue); + Why is all this done for each queue? looks completely redundant to me. A little bit. Where would you put it? I think we'll need a representation of a controller in nvmet-rdma for that. we sort of got a way without it so far, but I don't think we can anymore with this. ret = nvmet_rdma_cm_accept(cm_id, queue, >param.conn); if (ret) goto release_queue; You seemed to skip the in-capsule buffers for p2pmem (inline_page), I'm curious why? Yes, the thinking was that these transfers were small anyway so there would not be significant benefit to pushing them through p2pmem. There's really no reason why we couldn't do that if it made sense to though. I don't see an urgent reason for it too. I was just curious...
Re: [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem
Note that the nvme completion queues are still on the host memory, so this means we have lost the ordering between data and completions as they go to different pcie targets. Hmm, in this simple up/down case with a switch, I think it might actually be OK. Transactions might not complete at the NVMe device before the CPU processes the RDMA completion, however due to the PCI-E ordering rules new TLPs directed to the NVMe will complete after the RMDA TLPs and thus observe the new data. (eg order preserving) It would be very hard to use P2P if fabric ordering is not preserved.. I think it still can race if the p2p device is connected with more than a single port to the switch. Say it's connected via 2 legs, the bar is accessed from leg A and the data from the disk comes via leg B. In this case, the data is heading towards the p2p device via leg B (might be congested), the completion goes directly to the RC, and then the host issues a read from the bar via leg A. I don't understand what can guarantee ordering here. Stephen told me that this still guarantees ordering, but I honestly can't understand how, perhaps someone can explain to me in a simple way that I can understand.
Re: [PATCH 2/5] nvme: cleanup nvme_req_needs_retry
Reviewed-by: Sagi Grimberg <s...@grimberg.me>
Re: [PATCH 4/5] nvme: move the retries count to struct nvme_request
Reviewed-by: Sagi Grimberg <s...@grimberg.me>
Re: [PATCH 3/5] nvme: mark nvme_max_retries static
Reviewed-by: Sagi Grimberg <s...@grimberg.me>
Re: [PATCH 1/5] nvme: move ->retries setup to nvme_setup_cmd
Reviewed-by: Sagi Grimberg <s...@grimberg.me>
Re: [PATCH 5/5] block, scsi: move the retries field to struct scsi_request
Reviewed-by: Sagi Grimberg <s...@grimberg.me>
Re: [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem
u16 nvmet_copy_to_sgl(struct nvmet_req *req, off_t off, const void *buf, size_t len) { - if (sg_pcopy_from_buffer(req->sg, req->sg_cnt, buf, len, off) != len) + bool iomem = req->p2pmem; + size_t ret; + + ret = sg_copy_buffer(req->sg, req->sg_cnt, (void *)buf, len, off, +false, iomem); + + if (ret != len) return NVME_SC_SGL_INVALID_DATA | NVME_SC_DNR; + return 0; } We can never ever get here from an IO command, and that is a good thing because it would have been broken if we did, regardless of what copy method we use... Note that the nvme completion queues are still on the host memory, so this means we have lost the ordering between data and completions as they go to different pcie targets. If at all, this is the place to *emphasize* we must never get here with p2pmem, and immediately fail if we do. I'm not sure what will happen with to copy_from_sgl, I guess we have the same race because the nvme submission queues are also on the host memory (which is on a different pci target). Maybe more likely to happen with write-combine enabled? Anyway I don't think we have a real issue here *currently*, because we use copy_to_sgl only for admin/fabrics commands emulation and copy_from_sgl to setup dsm ranges...
Re: [RFC 4/8] p2pmem: Add debugfs "stats" file
+ p2pmem_debugfs_root = debugfs_create_dir("p2pmem", NULL); + if (!p2pmem_debugfs_root) + pr_info("could not create debugfs entry, continuing\n"); + Why continue? I think it'd be better to just fail it. Besides, this can be safely squashed into patch 1.
Re: [RFC 2/8] cxgb4: setup pcie memory window 4 and create p2pmem region
+static void setup_memwin_p2pmem(struct adapter *adap) +{ + unsigned int mem_base = t4_read_reg(adap, CIM_EXTMEM2_BASE_ADDR_A); + unsigned int mem_size = t4_read_reg(adap, CIM_EXTMEM2_ADDR_SIZE_A); + + if (!use_p2pmem) + return; This is weird, why even call this if !use_p2pmem? +static int init_p2pmem(struct adapter *adapter) +{ + unsigned int mem_size = t4_read_reg(adapter, CIM_EXTMEM2_ADDR_SIZE_A); + struct p2pmem_dev *p; + int rc; + struct resource res; + + if (!mem_size || !use_p2pmem) + return 0; Again, weird...
Re: [RFC 3/8] nvmet: Use p2pmem in nvme target
Hey Logan, We create a configfs attribute in each nvme-fabrics target port to enable p2p memory use. When enabled, the port will only then use the p2p memory if a p2p memory device can be found which is behind the same switch as the RDMA port and all the block devices in use. If the user enabled it an no devices are found, then the system will silently fall back on using regular memory. What should we do if we have more than a single device that satisfies this? I'd say that it would be better to have the user ask for a specific device and fail it if it doesn't meet the above conditions... If appropriate, that port will allocate memory for the RDMA buffers for queues from the p2pmem device falling back to system memory should anything fail. That's good :) Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would save an extra PCI transfer as the NVME card could just take the data out of it's own memory. However, at this time, cards with CMB buffers don't seem to be available. Even if it was available, it would be hard to make real use of this given that we wouldn't know how to pre-post recv buffers (for in-capsule data). But let's leave this out of the scope entirely... diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c index ecc4fe8..7fd4840 100644 --- a/drivers/nvme/target/rdma.c +++ b/drivers/nvme/target/rdma.c @@ -23,6 +23,7 @@ #include #include #include +#include #include #include @@ -64,6 +65,7 @@ struct nvmet_rdma_rsp { struct rdma_rw_ctx rw; struct nvmet_reqreq; + struct p2pmem_dev *p2pmem; Why do you need this? you have a reference to the queue itself. @@ -107,6 +109,8 @@ struct nvmet_rdma_queue { int send_queue_size; struct list_headqueue_list; + + struct p2pmem_dev *p2pmem; }; struct nvmet_rdma_device { @@ -185,7 +189,8 @@ nvmet_rdma_put_rsp(struct nvmet_rdma_rsp *rsp) spin_unlock_irqrestore(>queue->rsps_lock, flags); } -static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int nents) +static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int nents, + struct p2pmem_dev *p2pmem) { struct scatterlist *sg; int count; @@ -193,13 +198,17 @@ static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int nents) if (!sgl || !nents) return; - for_each_sg(sgl, sg, nents, count) - __free_page(sg_page(sg)); + for_each_sg(sgl, sg, nents, count) { + if (p2pmem) + p2pmem_free_page(p2pmem, sg_page(sg)); + else + __free_page(sg_page(sg)); + } kfree(sgl); } static int nvmet_rdma_alloc_sgl(struct scatterlist **sgl, unsigned int *nents, - u32 length) + u32 length, struct p2pmem_dev *p2pmem) { struct scatterlist *sg; struct page *page; @@ -216,7 +225,11 @@ static int nvmet_rdma_alloc_sgl(struct scatterlist **sgl, unsigned int *nents, while (length) { u32 page_len = min_t(u32, length, PAGE_SIZE); - page = alloc_page(GFP_KERNEL); + if (p2pmem) + page = p2pmem_alloc_page(p2pmem); + else + page = alloc_page(GFP_KERNEL); + if (!page) goto out_free_pages; @@ -231,7 +244,10 @@ static int nvmet_rdma_alloc_sgl(struct scatterlist **sgl, unsigned int *nents, out_free_pages: while (i > 0) { i--; - __free_page(sg_page([i])); + if (p2pmem) + p2pmem_free_page(p2pmem, sg_page([i])); + else + __free_page(sg_page([i])); } kfree(sg); out: @@ -484,7 +500,8 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp) } if (rsp->req.sg != >cmd->inline_sg) - nvmet_rdma_free_sgl(rsp->req.sg, rsp->req.sg_cnt); + nvmet_rdma_free_sgl(rsp->req.sg, rsp->req.sg_cnt, + rsp->p2pmem); if (unlikely(!list_empty_careful(>rsp_wr_wait_list))) nvmet_rdma_process_wr_wait_list(queue); @@ -625,8 +642,16 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp, if (!len) return 0; + rsp->p2pmem = rsp->queue->p2pmem; status = nvmet_rdma_alloc_sgl(>req.sg, >req.sg_cnt, - len); + len, rsp->p2pmem); + + if (status && rsp->p2pmem) { + rsp->p2pmem = NULL; + status = nvmet_rdma_alloc_sgl(>req.sg, >req.sg_cnt, + len, rsp->p2pmem); + } + Not sure its a good practice to rely on rsp->p2pmem not being NULL... Would be nice if the allocation routines can hide it from us...
Re: [PATCH] lpfc: add missing Kconfig NVME dependencies
add missing Kconfig NVME dependencies Can't believe I missed posting this -- james Heh, the this sort of comment should come after the '---' separator (below) unless you want it to live forever in the git log... Signed-off-by: James Smart--- [here] drivers/scsi/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig index d4023bf..2558434 100644 --- a/drivers/scsi/Kconfig +++ b/drivers/scsi/Kconfig @@ -1240,6 +1240,7 @@ config SCSI_LPFC tristate "Emulex LightPulse Fibre Channel Support" depends on PCI && SCSI depends on SCSI_FC_ATTRS + depends on NVME_FC && NVME_TARGET_FC select CRC_T10DIF help This lpfc driver supports the Emulex LightPulse
Re: hch's native NVMe multipathing [was: Re: [PATCH 1/2] Don't blacklist nvme]
I'm fine with the path selectors getting moved out; maybe it'll encourage new path selectors to be developed. But there will need to be some userspace interface stood up to support your native NVMe multipathing (you may not think it needed but think in time there will be a need to configure _something_). That is the fragmentation I'm referring to. I guess one config option that we'd need is multibus vs. failover which are used per use-case. I'd have to say that having something much simpler than multipath-tools does sound appealing.
Re: [PATCH v3 00/16] lpfc: Add NVME Fabrics support
Hi James, This patch set adds support for NVME over Fabrics FC transport to lpfc The internals of the driver are reworked to support being either: a SCSI initiator; a NVME intiator; both a SCSI initiator and a NVME initiator; or a NVME target. The driver effectively has parallel NVME and SCSI stacks that utilize their own set of resources. They intersect only at the hardware level, mainly in queue create layers and interrupt handling. A few new files are added to support the interfaces of the FC transport LLDD api for NVME fabrics. The patches were cut against 1/30 scsi.git tree, misc branch. ** THEY ARE INTENDED FOR THE SCSI.GIT TREE, MISC BRANCH ** The lpfc version in the linux-block.git tree is rather old. I have a recipe for how to get it to a version that syncs with the scsi.git/misc tree so that these patches can apply there as well. Contact me if you would like it. This set does not seem to apply cleanly on nvme-4.11 tree, looks like patch 6 is failing. Also, can you send your patchset threaded? It usually does so when generating the patches with git format-patch, not sure how this is not the case with your set... It would make my life a bit easier. Thanks.
Re: [PATCH 1/1] iscsi: fix regression caused by session lock patch
Hey Chris and Guilherme, I'm indeed not responsive under this email address. Thanks for the testing, looks like you have the magic target to reproduce this. I think this verifies what Mike's idea of what was going wrong, and we're way overdue to get this fixed upstream. Thanks to IBM for pushing this, I don't think any major distro is shipping this patch and we don't want to keep having to back it out. The options look like 1) back out the session lock changes that split it into two locks 2) add in the additional locking from this test patch 3) some other fix for the issue of targets that complete tasks oddly I'm leaning to #1, as I don't want to keep adding more locks for this. Thanks Chris! IIRC, the lock changes from Shlomo/Or are not on RHEL, SLES and Ubuntu anymore, as you mentioned. We requested them to revert the patch, and it was accepted. On the other hand, your patch is great and a cool fix to this. If we have any good numbers and/or reasons to keep their patch, guess the alternative #2 is cool too. I can perform more testing if you plan to send this (or similar) patch to iscsi list. Sagi, Or, Shlomo? You pushed to keep this from being backed out before. Here's your cause, any better ideas on fixing it? I also tried to go back in the mailing list archives, but I don't see any real numbers for the performance gains. I'll loop Sagi here based on the email I see he's using on NVMe list currently - seems it's different from the one showed in the header of this message. IIRC, this was brought up more than two years ago? it's been a while now. The motivation for the fined grained locking from Shlomo was designed to address the submission/completion inter-locking scheme that was not needed for iser. In iser, task completions are triggered from soft-irq only for task responses, the data-transfer is driven in HW, so we don't need the inter-locking between submissions and task management or error handling. My recollection is that this scheme solved a contention point we had back then, if I'm not mistaken it was as much as 50% improvement in IOPs scalability in some scenarios. Now, this was all pre block-mq. So I think the correct solution for iscsi (iser, tcp and offloads) is to use block-mq facilities for task pre-allocations (scsi host tagset) and have iscsi tcp take care of it's own locking instead of imposing it inherently in libiscsi. We can have LOGIN, LOGOUT, NOOP_OUT, TEXT, TMR as reserved tags, and queue_depth with max session cmds. I had a prototype for that back when I experimented with scsi-mq conversion (way back...), but kinda got stuck with trying to figure out how to convert the offload drivers qla4xxx, bnx2i and cxgbi which seemed to rely heavily on on the task pools. If people are more interested in improving iscsi locking schemes we can discuss on approaches for it.
Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
Christoph suggest to me once that we can take a hybrid approach where we consume a small amount of completions (say 4) right away from the interrupt handler and if we have more we schedule irq-poll to reap the rest. But back then it didn't work better which is not aligned with my observations that we consume only 1 completion per interrupt... I can give it another go... What do people think about it? -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
I think you missed: http://git.infradead.org/nvme.git/commit/49c91e3e09dc3c9dd1718df85112a8cce3ab7007 I indeed did, thanks. But it doesn't help. We're still having to wait for the first interrupt, and if we're really fast that's the only completion we have to process. Try this: diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index b4b32e6..e2dd9e2 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -623,6 +623,8 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx, } __nvme_submit_cmd(nvmeq, ); spin_unlock(>sq_lock); + disable_irq_nosync(nvmeq_irq(irq)); + irq_poll_sched(>iop); a. This would trigger a condition that we disable irq twice which is wrong at least because it will generate a warning. b. This would cause a way-too-much triggers of ksoftirqd. In order for it to be effective we need to to run only when it should and optimally when the completion queue has a batch of completions waiting. After a deeper analysis, I agree with Bart that interrupt coalescing is needed for it to work. The problem with nvme coalescing as Jens said, is a death penalty of 100us granularity. Hannes, Johannes, how does it look like with the devices you are testing with? Also, I think that adaptive moderation is needed in order for it to work well. I know that some networking drivers implemented adaptive moderation in SW before having HW support for it. It can be done by maintaining stats and having a periodic work that looks at it and changes the moderation parameters. Does anyone think that this is something we should consider? -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] nvme: use blk_rq_payload_bytes
@@ -1014,9 +1013,9 @@ static int nvme_rdma_map_data(struct nvme_rdma_queue *queue, } Christoph, a little above here we still look at blk_rq_bytes(), shouldn't that look at blk_rq_payload_bytes() too? The check is ok for now as it's just zero vs non-zero. It's somewhat broken for Write Zeroes, though. I've fixed it in this series which I need to submit: http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/write-zeroes-take-2 I see, thanks for the clarification. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
Hannes just spotted this: static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx, const struct blk_mq_queue_data *bd) { [...] __nvme_submit_cmd(nvmeq, ); nvme_process_cq(nvmeq); spin_unlock_irq(>q_lock); return BLK_MQ_RQ_QUEUE_OK; out_cleanup_iod: nvme_free_iod(dev, req); out_free_cmd: nvme_cleanup_cmd(req); return ret; } So we're draining the CQ on submit. This of cause makes polling for completions in the IRQ handler rather pointless as we already did in the submission path. I think you missed: http://git.infradead.org/nvme.git/commit/49c91e3e09dc3c9dd1718df85112a8cce3ab7007 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
Your report provided this stats with one-completion dominance for the single-threaded case. Does it also hold if you run multiple fio threads per core? It's useless to run more threads on that core, it's already fully utilized. That single threads is already posting a fair amount of submissions, so I don't see how adding more fio jobs can help in any way. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
So what you say is you saw a consomed == 1 [1] most of the time? [1] from http://git.infradead.org/nvme.git/commitdiff/eed5a9d925c59e43980047059fde29e3aa0b7836 Exactly. By processing 1 completion per interrupt it makes perfect sense why this performs poorly, it's not worth paying the soft-irq schedule for only a single completion. What I'm curious is how consistent is this with different devices (wish I had some...) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] nvme: use blk_rq_payload_bytes
@@ -1014,9 +1013,9 @@ static int nvme_rdma_map_data(struct nvme_rdma_queue *queue, } Christoph, a little above here we still look at blk_rq_bytes(), shouldn't that look at blk_rq_payload_bytes() too? if (count == 1) { - if (rq_data_dir(rq) == WRITE && - map_len <= nvme_rdma_inline_data_size(queue) && - nvme_rdma_queue_idx(queue)) + if (rq_data_dir(rq) == WRITE && nvme_rdma_queue_idx(queue) && + blk_rq_payload_bytes(rq) <= + nvme_rdma_inline_data_size(queue)) return nvme_rdma_map_sg_inline(queue, req, c); if (dev->pd->flags & IB_PD_UNSAFE_GLOBAL_RKEY) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
So it looks like we are super not efficient because most of the times we catch 1 completion per interrupt and the whole point is that we need to find more! This fio is single threaded with QD=32 so I'd expect that we be somewhere in 8-31 almost all the time... I also tried QD=1024, histogram is still the same. It looks like it takes you longer to submit an I/O than to service an interrupt, Well, with irq-poll we do practically nothing in the interrupt handler, only schedule irq-poll. Note that the latency measures are only from the point the interrupt arrives and the point we actually service it by polling for completions. so increasing queue depth in the singe-threaded case doesn't make much difference. You might want to try multiple threads per core with QD, say, 32 This is how I ran, QD=32. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
Just for the record, all tests you've run are with the upper irq_poll_budget of 256 [1]? Yes, but that's the point, I never ever reach this budget because I'm only processing 1-2 completions per interrupt. We (Hannes and me) recently stumbed accross this when trying to poll for more than 256 queue entries in the drivers we've been testing. What do you mean by stumbed? irq-poll should be agnostic to the fact that drivers can poll more than their given budget? Did your system load reduce with irq polling? In theory it should but I have seen increases with AHCI at least according to fio. IIRC Hannes saw decreases with his SAS HBA tests, as expected. I didn't see any reduction. When I tested on a single cpu core (to simplify for a single queue) the cpu was at 100% cpu but got less iops (which makes sense, a single cpu-core is not enough to max out the nvme device, at least not the core I'm using). Before irqpoll I got ~230 KIOPs on a single cpu-core and after irqpoll I got ~205 KIOPs which is consistent with the ~10% iops decrease I've reported in the original submission. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
Oh, and the current code that was tested can be found at: git://git.infradead.org/nvme.git nvme-irqpoll -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
Hey, so I made some initial analysis of whats going on with irq-poll. First, I sampled how much time it takes before we get the interrupt in nvme_irq and the initial visit to nvme_irqpoll_handler. I ran a single threaded fio with QD=32 of 4K reads. This is two displays of a histogram of the latency (ns): -- [1] queue = b'nvme0q1' usecs : count distribution 0 -> 1 : 7310 || 2 -> 3 : 11 | | 4 -> 7 : 10 | | 8 -> 15 : 20 | | 16 -> 31 : 0| | 32 -> 63 : 0| | 64 -> 127: 1| | [2] queue = b'nvme0q1' usecs : count distribution 0 -> 1 : 7309 || 2 -> 3 : 14 | | 4 -> 7 : 7| | 8 -> 15 : 17 | | We can see that most of the time our latency is pretty good (<1ns) but with huge tail latencies (some 8-15 ns and even one in 32-63 ns). **NOTE, in order to reduce the tracing impact on performance I sampled for every 100 interrupts. I also sampled for a multiple threads/queues with QD=32 of 4K reads. This is a collection of histograms for 5 queues (5 fio threads): queue = b'nvme0q1' usecs : count distribution 0 -> 1 : 701 || 2 -> 3 : 177 |** | 4 -> 7 : 56 |*** | 8 -> 15 : 24 |* | 16 -> 31 : 6| | 32 -> 63 : 1| | queue = b'nvme0q2' usecs : count distribution 0 -> 1 : 412 || 2 -> 3 : 52 |* | 4 -> 7 : 19 |* | 8 -> 15 : 13 |* | 16 -> 31 : 5| | queue = b'nvme0q3' usecs : count distribution 0 -> 1 : 381 || 2 -> 3 : 74 |*** | 4 -> 7 : 26 |** | 8 -> 15 : 12 |* | 16 -> 31 : 3| | 32 -> 63 : 0| | 64 -> 127: 0| | 128 -> 255: 1| | queue = b'nvme0q4' usecs : count distribution 0 -> 1 : 386 || 2 -> 3 : 63 |** | 4 -> 7 : 30 |*** | 8 -> 15 : 11 |* | 16 -> 31 : 7| | 32 -> 63 : 1| | queue = b'nvme0q5' usecs : count distribution 0 -> 1 : 384 || 2 -> 3 : 69 |*** | 4 -> 7 : 25 |** | 8 -> 15 : 15 |* | 16 -> 31 : 3| | Overall looks pretty much the same but some more samples with tails... Next, I sampled how many completions we are able to consume per interrupt. Two exaples of histograms of how many completions we take per interrupt. -- queue = b'nvme0q1' completed : count distribution 0 : 0|| 1 : 11690|| 2 : 46 || 3 : 1|| queue = b'nvme0q1' completed : count distribution 0 : 0|| 1 : 944 || 2 : 8|| -- So it looks like we are super not efficient because most of the times we catch 1 completion per interrupt and the whole point is that we need to find more! This fio is single threaded with QD=32 so I'd expect that we be somewhere in 8-31 almost all the time... I also tried QD=1024, histogram is still the same. **NOTE: Here I also sampled for every 100 interrupts. I'll try to run the counter on the current nvme driver and see what I get. I attached the bpf scripts I wrote (nvme-trace-irq, nvme-count-comps) with hope that someone is interested enough to try and reproduce these numbers on his/hers setup and maybe suggest some other useful tracing we can do. Prerequisites: 1. iovisor is needed for python bpf support. $ echo "deb [trusted=yes] https://repo.iovisor.org/apt/xenial
Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
-- [1] queue = b'nvme0q1' usecs : count distribution 0 -> 1 : 7310 || 2 -> 3 : 11 | | 4 -> 7 : 10 | | 8 -> 15 : 20 | | 16 -> 31 : 0| | 32 -> 63 : 0| | 64 -> 127: 1| | [2] queue = b'nvme0q1' usecs : count distribution 0 -> 1 : 7309 || 2 -> 3 : 14 | | 4 -> 7 : 7| | 8 -> 15 : 17 | | Rrr, email made the histograms look funky (tabs vs. spaces...) The count is what's important anyways... Just adding that I used an Intel P3500 nvme device. We can see that most of the time our latency is pretty good (<1ns) but with huge tail latencies (some 8-15 ns and even one in 32-63 ns). Obviously is micro-seconds and not nano-seconds (I wish...) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] sd: remove __data_len hack for WRITE SAME
Looks good, Reviewed-by: Sagi Grimberg <s...@grimberg.me> -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] nvme: use blk_rq_payload_bytes
This looks good, Reviewed-by: Sagi Grimberg <s...@grimberg.me> -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] scsi: use blk_rq_payload_bytes
Looks good, Reviewed-by: Sagi Grimberg <s...@grimberg.me> -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] block: add blk_rq_payload_bytes
Add a helper to calculate the actual data transfer size for special payload requests. Signed-off-by: Christoph Hellwig--- include/linux/blkdev.h | 13 + 1 file changed, 13 insertions(+) diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index ff3d774..1ca8e8f 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1000,6 +1000,19 @@ static inline unsigned int blk_rq_cur_sectors(const struct request *rq) return blk_rq_cur_bytes(rq) >> 9; } +/* + * Some commands like WRITE SAME have a payload or data transfer size which Don't you mean here DISCARD? + * is different from the size of the request. Any driver that supports such + * commands using the RQF_SPECIAL_PAYLOAD flag needs to use this helper to + * calculate the data transfer size. + */ +static inline unsigned int blk_rq_payload_bytes(struct request *rq) +{ + if (rq->rq_flags & RQF_SPECIAL_PAYLOAD) + return rq->special_vec.bv_len; + return blk_rq_bytes(rq); +} + static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q, int op) { -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
**Note: when I ran multiple threads on more cpus the performance degradation phenomenon disappeared, but I tested on a VM with qemu emulation backed by null_blk so I figured I had some other bottleneck somewhere (that's why I asked for some more testing). That could be because of the vmexits as every MMIO access in the guest triggers a vmexit and if you poll with a low budget you do more MMIOs hence you have more vmexits. Did you do testing only in qemu or with real H/W as well? I tried once. IIRC, I saw the same phenomenons... -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
I agree with Jens that we'll need some analysis if we want the discussion to be affective, and I can spend some time this if I can find volunteers with high-end nvme devices (I only have access to client nvme devices. I have a P3700 but somehow burned the FW. Let me see if I can bring it back to live. I also have converted AHCI to the irq_poll interface and will run some tests. I do also have some hpsa devices on which I could run tests once the driver is adopted. But can we come to a common testing methology not to compare apples with oranges? Sagi do you still have the fio job file from your last tests laying somewhere and if yes could you share it? Its pretty basic: -- [global] group_reporting cpus_allowed=0 cpus_allowed_policy=split rw=randrw bs=4k numjobs=4 iodepth=32 runtime=60 time_based loops=1 ioengine=libaio direct=1 invalidate=1 randrepeat=1 norandommap exitall [job] -- **Note: when I ran multiple threads on more cpus the performance degradation phenomenon disappeared, but I tested on a VM with qemu emulation backed by null_blk so I figured I had some other bottleneck somewhere (that's why I asked for some more testing). Note that I ran randrw because I was backed with null_blk, testing with a real nvme device, you should either run randread or write, and if you do a write, you can't run it multi-threaded (well you can, but you'll get unpredictable performance...). -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LFS/MM TOPIC][LFS/MM ATTEND]: - Storage Stack and Driver Testing methodology.
Hi Folks, I would like to propose a general discussion on Storage stack and device driver testing. I think its very useful and needed. Purpose:- - The main objective of this discussion is to address the need for a Unified Test Automation Framework which can be used by different subsystems in the kernel in order to improve the overall development and stability of the storage stack. For Example:- From my previous experience, I've worked on the NVMe driver testing last year and we have developed simple unit test framework (https://github.com/linux-nvme/nvme-cli/tree/master/tests). In current implementation Upstream NVMe Driver supports following subsystems:- 1. PCI Host. 2. RDMA Target. 3. Fiber Channel Target (in progress). Today due to lack of centralized automated test framework NVMe Driver testing is scattered and performed using the combination of various utilities like nvme-cli/tests, nvmet-cli, shell scripts (git://git.infradead.org/nvme-fabrics.git nvmf-selftests) etc. In order to improve overall driver stability with various subsystems, it will be beneficial to have a Unified Test Automation Framework (UTAF) which will centralize overall testing. This topic will allow developers from various subsystem engage in the discussion about how to collaborate efficiently instead of having discussions on lengthy email threads. While a unified test framework for all sounds great, I suspect that the difference might be too large. So I think that for this framework to be maintainable, it needs to be carefully designed such that we don't have too much code churn. For example we should start by classifying tests and then see where sharing is feasible: 1. basic management - I think not a lot can be shared 2. spec compliance - again, not much sharing here 3. data-verification - probably everything can be shared 4. basic performance - probably a lot can be shared 5. vectored-io - probably everything can be shared 6. error handling - I can think of some sharing that can be used. This repository can also store some useful tracing scripts (ebpf and friends) that are useful for performance analysis. So I think that for this to happen, we can start with the shared test under block/, then migrate proto specific tests into scsi/, nvme/, and then add transport specific tests so we can have something like: ├── block ├── lib ├── nvme │ ├── fabrics │ │ ├── loop │ │ └── rdma │ └── pci └── scsi ├── fc └── iscsi Thoughts? -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
A typical Ethernet network adapter delays the generation of an interrupt after it has received a packet. A typical block device or HBA does not delay the generation of an interrupt that reports an I/O completion. >>> >>> NVMe allows for configurable interrupt coalescing, as do a few modern >>> SCSI HBAs. >> >> Essentially every modern SCSI HBA does interrupt coalescing; otherwise >> the queuing interface won't work efficiently. > > Hello Hannes, > > The first e-mail in this e-mail thread referred to measurements against a > block device for which interrupt coalescing was not enabled. I think that > the measurements have to be repeated against a block device for which > interrupt coalescing is enabled. Hey Bart, I see how interrupt coalescing can help, but even without it, I think it should be better. Moreover, I don't think that strict moderation is something that can work. The only way interrupt moderation can be effective, is if it's adaptive and adjusts itself to the workload. Note that this feature is on by default in most of the modern Ethernet devices (adaptive-rx). IMHO, irq-poll vs. interrupt polling should be compared without relying on the underlying device capabilities. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
I'd like to attend LSF/MM and would like to discuss polling for block drivers. Currently there is blk-iopoll but it is neither as widely used as NAPI in the networking field and accoring to Sagi's findings in [1] performance with polling is not on par with IRQ usage. On LSF/MM I'd like to whether it is desirable to have NAPI like polling in more block drivers and how to overcome the currently seen performance issues. [1] http://lists.infradead.org/pipermail/linux-nvme/2016-October/006975.ht ml A typical Ethernet network adapter delays the generation of an interrupt after it has received a packet. A typical block device or HBA does not delay the generation of an interrupt that reports an I/O completion. I think that is why polling is more effective for network adapters than for block devices. I'm not sure whether it is possible to achieve benefits similar to NAPI for block devices without implementing interrupt coalescing in the block device firmware. Note: for block device implementations that use the RDMA API, the RDMA API supports interrupt coalescing (see also ib_modify_cq()). Hey Bart, I don't agree that interrupt coalescing is the reason why irq-poll is not suitable for nvme or storage devices. First, when the nvme device fires an interrupt, the driver consumes the completion(s) from the interrupt (usually there will be some more completions waiting in the cq by the time the host start processing it). With irq-poll, we disable further interrupts and schedule soft-irq for processing, which if at all, improve the completions per interrupt utilization (because it takes slightly longer before processing the cq). Moreover, irq-poll is budgeting the completion queue processing which is important for a couple of reasons. 1. it prevents hard-irq context abuse like we do today. if other cpu cores are pounding with more submissions on the same queue, we might get into a hard-lockup (which I've seen happening). 2. irq-poll maintains fairness between devices by correctly budgeting the processing of different completions queues that share the same affinity. This can become crucial when working with multiple nvme devices, each has multiple io queues that share the same IRQ assignment. 3. It reduces (or at least should reduce) the overall number of interrupts in the system because we only enable interrupts again when the completion queue is completely processed. So overall, I think it's very useful for nvme and other modern HBAs, but unfortunately, other than solving (1), I wasn't able to see performance improvement but rather a slight regression, but I can't explain where its coming from... -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
Hi all, I'd like to attend LSF/MM and would like to discuss polling for block drivers. Currently there is blk-iopoll but it is neither as widely used as NAPI in the networking field and accoring to Sagi's findings in [1] performance with polling is not on par with IRQ usage. On LSF/MM I'd like to whether it is desirable to have NAPI like polling in more block drivers and how to overcome the currently seen performance issues. It would be an interesting topic to discuss, as it is a shame that blk-iopoll isn't used more widely. Forgot to mention - it should only be a topic, if experimentation has been done and results gathered to pin point what the issues are, so we have something concrete to discus. I'm not at all interested in a hand wavy discussion on the topic. Hey all, Indeed I attempted to convert nvme to use irq-poll (let's use its new name) but experienced some unexplained performance degradations. Keith reported a 700ns degradation for QD=1 with his Xpoint devices, this sort of degradation are acceptable I guess because we do schedule a soft-irq before consuming the completion, but I noticed ~10% IOPs degradation fr QD=32 which is not acceptable. I agree with Jens that we'll need some analysis if we want the discussion to be affective, and I can spend some time this if I can find volunteers with high-end nvme devices (I only have access to client nvme devices. I can add debugfs statistics on average the number of completions I consume per intererupt, I can also trace the interrupt and the soft-irq start,end. Any other interesting stats I can add? I also tried a hybrid mode where the first 4 completions were handled in the interrupt and the rest in soft-irq but that didn't make much of a difference. Any other thoughts? -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/12] target: avoid to access .bi_vcnt directly
Reviewed-by: Sagi Grimberg <s...@grimberg.me> -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 12/14] SRP transport, scsi-mq: Wait for .queue_rq() if necessary
and again, Reviewed-by: Sagi Grimberg <s...@grimberg.me> -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 13/14] nvme: Fix a race condition related to stopping queues
Reviewed-by: Sagi Grimberg <s...@grimberg.me> -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 11/14] SRP transport: Move queuecommand() wait code to SCSI core
Again, Reviewed-by: Sagi Grimberg <s...@grimberg.me> -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 08/14] blk-mq: Add a kick_requeue_list argument to blk_mq_requeue_request()
Looks useful, Reviewed-by: Sagi Grimberg <s...@grimberg.me> -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 07/14] blk-mq: Introduce blk_mq_quiesce_queue()
Reviewed-by: Sagi Grimberg <s...@grimberg.me> -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 06/14] blk-mq: Remove blk_mq_cancel_requeue_work()
Reviewed-by: Sagi Grimberg <s...@grimberg.me> -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 05/14] blk-mq: Avoid that requeueing starts stopped queues
Reviewed-by: Sagi Grimberg <s...@grimberg.me> -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] scsi: allow LLDDs to expose the queue mapping to blk-mq
Reviewed-by: Sagi Grimberg <s...@grimberg.me> -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Avoid that SCSI device removal through sysfs triggers a deadlock
Hey Bart, The solution I prefer is to modify the SCSI scanning code such that the scan_mutex is only held while performing the actual LUN scanning and while ensuring that no SCSI device has been created yet for a certain LUN number but not while the Linux device and its sysfs attributes are created. Since that approach would require extensive changes in the SCSI scanning code, another approach has been chosen, namely to make self-removal asynchronous. This patch avoids that self-removal triggers the following deadlock: Is this a real deadlock? or just a lockdep complaint? Wouldn't making scsi_remove_device() taking single depth mutex_lock_nested suffice here? -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 11/12] nvme: Use BLK_MQ_S_STOPPED instead of QUEUE_FLAG_STOPPED in blk-mq code
Looks good, Reviewed-by: Sagi Grimberg <s...@grimberg.me> -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/12] blk-mq: Do not invoke .queue_rq() for a stopped queue
Reviewed-by: Sagi Grimberg <s...@grimberg.me> -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A question regarding "multiple SGL"
Hi Robert, Hey Robert, Christoph, please explain your use cases that isn't handled. The one and only reason to set MSDBD to 1 is to make the code a lot simpler given that there is no real use case for supporting more. RDMA uses memory registrations to register large and possibly discontiguous data regions for a single rkey, aka single SGL descriptor in NVMe terms. There would be two reasons to support multiple SGL descriptors: a) to support a larger I/O size than supported by a single MR, or b) to support a data region format not mappable by a single MR. iSER only supports a single rkey (or stag in IETF terminology) and has been doing fine on a) and mostly fine on b). There are a few possible data layouts not supported by the traditional IB/iWarp FR WRs, but the limit is in fact exactly the same as imposed by the NVMe PRPs used for PCIe NVMe devices, so the Linux block layer has support to not generate them. Also with modern Mellanox IB/RoCE hardware we can actually register completely arbitrary SGLs. iSER supports using this registration mode already with a trivial code addition, but for NVMe we didn't have a pressing need yet. Good explanation :) The IO transfer size is a bit more pressing on some devices (e.g. cxgb3/4) where the number of pages per-MR can be indeed lower than a reasonable transfer size (Steve can correct me if I'm wrong). However, if there is a real demand for this we'll happily accept patches :) Just a note, having this feature in-place can bring unexpected behavior depending on how we implement it: - If we can use multiple MRs per IO (for multiple SGLs) we can either prepare for the worst-case and allocate enough MRs to satisfy the various IO patterns. This will be much heavier in terms of resource allocation and can limit the scalability of the host driver. - Or we can implement a shared MR pool with a reasonable number of MRs. In this case each IO can consume one or more MRs on the expense of other IOs. In this case we may need to requeue the IO later when we have enough available MRs to satisfy the IO. This can yield some unexpected performance gaps for some workloads. Cheers, Sagi. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/12] blk-mq: Introduce blk_mq_hctx_stopped()
Looks fine, Reviewed-by: Sagi Grimberg <s...@grimberg.me> -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/12] blk-mq: Move more code into blk_mq_direct_issue_request()
Looks good, Reviewed-by: Sagi Grimberg <s...@grimberg.me> -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/12] SRP transport, scsi-mq: Wait for .queue_rq() if necessary
Thanks for moving it, Reviewed-by: Sagi Grimberg <s...@grimberg.me> -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/12] SRP transport: Move queuecommand() wait code to SCSI core
Reviewed-by: Sagi Grimberg <s...@grimberg.me> -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/7] blk-mq: Introduce blk_quiesce_queue() and blk_resume_queue()
Hello Ming, Can you have a look at the attached patch? That patch uses an srcu read lock for all queue types, whether or not the BLK_MQ_F_BLOCKING flag has been set. Additionally, I have dropped the QUEUE_FLAG_QUIESCING flag. Just like previous versions, this patch has been tested. Hey Bart, Do we care about the synchronization of queue_rq and/or blk_mq_run_hw_queue of the hctx is not stopped? I'm wandering if we can avoid introducing new barriers in the submission path of its not absolutely needed. Hello Sagi, Hey Bart, I'm not sure whether the new blk_quiesce_queue() function is useful without stopping all hardware contexts first. In other words, in my view setting BLK_MQ_F_BLOCKING flag before calling blk_quiesce_queue() is sufficient and I don't think that a new QUEUE_FLAG_QUIESCING flag is necessary. I was referring to weather we can take srcu in the submission path conditional of the hctx being STOPPED? -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/7] blk-mq: Introduce blk_quiesce_queue() and blk_resume_queue()
Hello Ming, Can you have a look at the attached patch? That patch uses an srcu read lock for all queue types, whether or not the BLK_MQ_F_BLOCKING flag has been set. Additionally, I have dropped the QUEUE_FLAG_QUIESCING flag. Just like previous versions, this patch has been tested. Hey Bart, Do we care about the synchronization of queue_rq and/or blk_mq_run_hw_queue of the hctx is not stopped? I'm wandering if we can avoid introducing new barriers in the submission path of its not absolutely needed. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/7] blk-mq: Introduce blk_mq_queue_stopped()
Looks good, Reviewed-by: Sagi Grimberg <s...@grimberg.me> -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 3/7] [RFC] nvme: Use BLK_MQ_S_STOPPED instead of QUEUE_FLAG_STOPPED in blk-mq code
Make nvme_requeue_req() check BLK_MQ_S_STOPPED instead of QUEUE_FLAG_STOPPED. Remove the QUEUE_FLAG_STOPPED manipulations that became superfluous because of this change. This patch fixes a race condition: using queue_flag_clear_unlocked() is not safe if any other function that manipulates the queue flags can be called concurrently, e.g. blk_cleanup_queue(). Untested. This looks good to me, but I know keith had all sort of creative ways to challenge this are so I'd wait for his input... -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 7/7] [RFC] nvme: Fix a race condition
Avoid that nvme_queue_rq() is still running when nvme_stop_queues() returns. Untested. Signed-off-by: Bart Van Assche <bart.vanass...@sandisk.com> Cc: Keith Busch <keith.bu...@intel.com> Cc: Christoph Hellwig <h...@lst.de> Cc: Sagi Grimberg <s...@grimberg.me> Bart this looks really good! and possibly fixes an issue I've been chasing with fabrics a while ago. I'll take it for testing but you can add my: Reviewed-by: Sagi Grimberg <s...@grimberg.me> -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 6/7] SRP transport: Port srp_wait_for_queuecommand() to scsi-mq
+static void srp_mq_wait_for_queuecommand(struct Scsi_Host *shost) +{ + struct scsi_device *sdev; + struct request_queue *q; + + shost_for_each_device(sdev, shost) { + q = sdev->request_queue; + + blk_mq_quiesce_queue(q); + blk_mq_resume_queue(q); + } +} + This *should* live in scsi_lib.c. I suspect that various drivers would really want this functionality. Thoughts? -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] nvme-fabrics: Add FC transport FC-NVME definitions
Looks fine, Reviewed-by: Christoph HellwigHey James, Can you collect review tags, address CR comments and resend the series? I'd like to stage these for 0-day testing and try to get it into 4.9. Thanks, Sagi. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IB/isert: Return value of iser target transport handlers ignored by iscsi target
Hi, Hi Baharat, In iSER target during iwarp connection tear-down due to ping timeouts, the rdma queues are set to error state and subsequent queued iscsi session commands posted shall fail with corresponding errno returned by ib_post_send/recv. At this stage iser target handlers (Ex: isert_put_datain()) return the error to iscsci target, but these errors are not being handled by the iscsi target handlers(Ex: lio_queue_status()). Indeed looks like lio_queue_data_in() assumes that ->iscsit_queue_data_in() cannot fail. This either mean that isert_put_data_in() should take care of the extra put in this case, or the iscsi should correctly handle a queue_full condition (because we should not be hitting this in the normal IO flow). Does this (untested) patch help? -- diff --git a/drivers/infiniband/ulp/isert/ib_isert.c b/drivers/infiniband/ulp/isert/ib_isert.c index a990c04208c9..ff5dfc0f7c50 100644 --- a/drivers/infiniband/ulp/isert/ib_isert.c +++ b/drivers/infiniband/ulp/isert/ib_isert.c @@ -1810,6 +1810,7 @@ isert_put_response(struct iscsi_conn *conn, struct iscsi_cmd *cmd) struct ib_send_wr *send_wr = _cmd->tx_desc.send_wr; struct iscsi_scsi_rsp *hdr = (struct iscsi_scsi_rsp *) _cmd->tx_desc.iscsi_header; + int ret; isert_create_send_desc(isert_conn, isert_cmd, _cmd->tx_desc); iscsit_build_rsp_pdu(cmd, conn, true, hdr); @@ -1848,7 +1849,10 @@ isert_put_response(struct iscsi_conn *conn, struct iscsi_cmd *cmd) isert_dbg("Posting SCSI Response\n"); - return isert_post_response(isert_conn, isert_cmd); + ret = isert_post_response(isert_conn, isert_cmd); + if (ret) + target_put_sess_cmd(>se_cmd); + return 0; } static void @@ -2128,7 +2132,7 @@ isert_put_datain(struct iscsi_conn *conn, struct iscsi_cmd *cmd) struct isert_conn *isert_conn = conn->context; struct ib_cqe *cqe = NULL; struct ib_send_wr *chain_wr = NULL; - int rc; + int ret; isert_dbg("Cmd: %p RDMA_WRITE data_length: %u\n", isert_cmd, se_cmd->data_length); @@ -2148,34 +2152,41 @@ isert_put_datain(struct iscsi_conn *conn, struct iscsi_cmd *cmd) isert_init_send_wr(isert_conn, isert_cmd, _cmd->tx_desc.send_wr); - rc = isert_post_recv(isert_conn, isert_cmd->rx_desc); - if (rc) { - isert_err("ib_post_recv failed with %d\n", rc); - return rc; + ret = isert_post_recv(isert_conn, isert_cmd->rx_desc); + if (ret) { + isert_err("ib_post_recv failed with %d\n", ret); + goto out; } chain_wr = _cmd->tx_desc.send_wr; } - isert_rdma_rw_ctx_post(isert_cmd, isert_conn, cqe, chain_wr); + ret = isert_rdma_rw_ctx_post(isert_cmd, isert_conn, cqe, chain_wr); isert_dbg("Cmd: %p posted RDMA_WRITE for iSER Data READ\n", isert_cmd); - return 1; +out: + if (ret) + target_put_sess_cmd(>se_cmd); + return 0; } static int isert_get_dataout(struct iscsi_conn *conn, struct iscsi_cmd *cmd, bool recovery) { struct isert_cmd *isert_cmd = iscsit_priv_cmd(cmd); + int ret; isert_dbg("Cmd: %p RDMA_READ data_length: %u write_data_done: %u\n", isert_cmd, cmd->se_cmd.data_length, cmd->write_data_done); isert_cmd->tx_desc.tx_cqe.done = isert_rdma_read_done; - isert_rdma_rw_ctx_post(isert_cmd, conn->context, + ret = isert_rdma_rw_ctx_post(isert_cmd, conn->context, _cmd->tx_desc.tx_cqe, NULL); isert_dbg("Cmd: %p posted RDMA_READ memory for ISER Data WRITE\n", isert_cmd);+out: + if (ret) + target_put_sess_cmd(>se_cmd); return 0; } -- -> While closing the session in case of ping timeouts, isert_close_connection()-> isert_wait_conn()->isert_wait4cmds() checks for the queued session commands and waits infinitely for command completion 'cmd_wait_comp' in target_wait_for_sess_cmds(). 'cmd_wait_comp' will be never complete as the kref on the session command is not derefed, due to which target_release_cmd_kref() is not called by kref_put(). Due to this, the older session is not cleared causing the next login negotiation to fail as the older session is still active(Older SID exists). Makes sense... [Query 1] If the return value of ib_post_send/recv() are handled to deref the corresponding queued session commands, the wait on cmd_wait_comp will be complete and clears off the session successfully. Is this the rightway to do it here? I think that given that ->iscsit_queue_data_in() and ->iscsit_queue_status() are never expected to fail we probably should take care of it in isert... Nic, any input on the iscsit side? [Query 2] An extra deref is done in case of
Re: [PATCH v2 1/3] block: provide helpers for reading block count
Looks good, for the series: Reviewed-by: Sagi Grimbeg-- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Connect-IB not performing as well as ConnectX-3 with iSER
Let me see if I get this correct: 4.5.0_rc3_1aaa57f5_00399 sdc;10.218.128.17;4627942;1156985;18126 sdf;10.218.202.17;4590963;1147740;18272 sdk;10.218.203.17;4564980;1141245;18376 sdn;10.218.204.17;4571946;1142986;18348 sdd;10.219.128.17;4591717;1147929;18269 sdi;10.219.202.17;4505644;1126411;18618 sdg;10.219.203.17;4562001;1140500;18388 sdl;10.219.204.17;4583187;1145796;18303 sde;10.220.128.17;5511568;1377892;15220 sdh;10.220.202.17;551;137;15209 sdj;10.220.203.17;5609983;1402495;14953 sdm;10.220.204.17;5509035;1377258;15227 In 1aaa57f5 you get on CIB ~115K IOPs per sd device and on CX3 you get around 140K IOPs per sd device. Mlx5_0;sde;3593013;898253;23347 100% CPU kworker/u69:2 Mlx5_0;sdd;3588555;897138;23376 100% CPU kworker/u69:2 Mlx4_0;sdc;3525662;881415;23793 100% CPU kworker/u68:0 Is this on the host or the target? 4.5.0_rc5_7861728d_1 sdc;10.218.128.17;3747591;936897;22384 sdf;10.218.202.17;3750607;937651;22366 sdh;10.218.203.17;3750439;937609;22367 sdn;10.218.204.17;3771008;942752;22245 sde;10.219.128.17;3867678;966919;21689 sdg;10.219.202.17;3781889;945472;22181 sdk;10.219.203.17;3791804;947951;22123 sdl;10.219.204.17;3795406;948851;22102 sdd;10.220.128.17;5039110;1259777;16647 sdi;10.220.202.17;4992921;1248230;16801 sdj;10.220.203.17;5015610;1253902;16725 Sdm;10.220.204.17;5087087;1271771;16490 In 7861728d you get on CIB ~95K IOPs per sd device and on CX3 you get around 125K IOPs per sd device. I don't see any difference in the code around iser/isert, in fact, I don't see any commit in drivers/infiniband Mlx5_0;sde;2930722;732680;28623 ~98% CPU kworker/u69:0 Mlx5_0;sdd;2910891;727722;28818 ~98% CPU kworker/u69:0 Mlx4_0;sdc;3263668;815917;25703 ~98% CPU kworker/u68:0 Again, host or target? 4.5.0_rc5_f81bf458_00018 sdb;10.218.128.17;5023720;1255930;16698 sde;10.218.202.17;5016809;1254202;16721 sdj;10.218.203.17;5021915;1255478;16704 sdk;10.218.204.17;5021314;1255328;16706 sdc;10.219.128.17;4984318;1246079;16830 sdf;10.219.202.17;4986096;1246524;16824 sdh;10.219.203.17;5043958;1260989;16631 sdm;10.219.204.17;5032460;1258115;16669 sdd;10.220.128.17;3736740;934185;22449 sdg;10.220.202.17;3728767;932191;22497 sdi;10.220.203.17;3752117;938029;22357 Sdl;10.220.204.17;3763901;940975;22287 In f81bf458 you get on CIB ~125K IOPs per sd device and on CX3 you get around 93K IOPs per sd device which is the other way around? CIB is better than CX3? The commits in this gap are: f81bf458208e iser-target: Separate flows for np listeners and connections cma events aea92980601f iser-target: Add new state ISER_CONN_BOUND to isert_conn b89a7c25462b iser-target: Fix identification of login rx descriptor type None of those should affect the data-path. Srpt keeps crashing couldn't test 4.5.0_rc5_5adabdd1_00023 Sdc;10.218.128.17;3726448;931612;22511 ~97% CPU kworker/u69:4 sdf;10.218.202.17;3750271;937567;22368 sdi;10.218.203.17;3749266;937316;22374 sdj;10.218.204.17;3798844;949711;22082 sde;10.219.128.17;3759852;939963;22311 ~97% CPU kworker/u69:4 sdg;10.219.202.17;3772534;943133;22236 sdl;10.219.203.17;3769483;942370;22254 sdn;10.219.204.17;3790604;947651;22130 sdd;10.220.128.17;5171130;1292782;16222 ~96% CPU kworker/u68:3 sdh;10.220.202.17;5105354;1276338;16431 sdk;10.220.203.17;4995300;1248825;16793 sdm;10.220.204.17;4959564;1239891;16914 In 5adabdd1 you get on CIB ~94K IOPs per sd device and on CX3 you get around 130K IOPs per sd device which means you flipped again (very strange). The commits in this gap are: 5adabdd122e4 iser-target: Split and properly type the login buffer ed1083b251f0 iser-target: Remove ISER_RECV_DATA_SEG_LEN 26c7b673db57 iser-target: Remove impossible condition from isert_wait_conn 69c48846f1c7 iser-target: Remove redundant wait in release_conn 6d1fba0c2cc7 iser-target: Rework connection termination Again, none are suspected to implicate the data-plane. Srpt crashes 4.5.0_rc5_07b63196_00027 sdb;10.218.128.17;3606142;901535;23262 sdg;10.218.202.17;3570988;892747;23491 sdf;10.218.203.17;3576011;894002;23458 sdk;10.218.204.17;3558113;889528;23576 sdc;10.219.128.17;3577384;894346;23449 sde;10.219.202.17;3575401;893850;23462 sdj;10.219.203.17;3567798;891949;23512 sdl;10.219.204.17;3584262;896065;23404 sdd;10.220.128.17;4430680;1107670;18933 sdh;10.220.202.17;4488286;1122071;18690 sdi;10.220.203.17;4487326;1121831;18694 sdm;10.220.204.17;4441236;1110309;1 In 5adabdd1 you get on CIB ~89K IOPs per sd device and on CX3 you get around 112K IOPs per sd device The commits in this gap are: e3416ab2d156 iser-target: Kill the ->isert_cmd back pointer in struct iser_tx_desc d1ca2ed7dcf8 iser-target: Kill struct isert_rdma_wr 9679cc51eb13 iser-target: Convert to new CQ API Which do effect the data-path, but nothing that can explain a specific CIB issue. Moreover, the perf drop happened before that. Srpt crashes 4.5.0_rc5_5e47f198_00036 sdb;10.218.128.17;3519597;879899;23834 sdi;10.218.202.17;3512229;878057;23884
Re: Connect-IB not performing as well as ConnectX-3 with iSER
sdf;10.218.203.17;3576011;894002;23458 sdk;10.218.204.17;3558113;889528;23576 sdc;10.219.128.17;3577384;894346;23449 sde;10.219.202.17;3575401;893850;23462 sdj;10.219.203.17;3567798;891949;23512 sdl;10.219.204.17;3584262;896065;23404 sdd;10.220.128.17;4430680;1107670;18933 sdh;10.220.202.17;4488286;1122071;18690 sdi;10.220.203.17;4487326;1121831;18694 sdm;10.220.204.17;4441236;1110309;1 Srpt crashes 4.5.0_rc5_5e47f198_00036 sdb;10.218.128.17;3519597;879899;23834 sdi;10.218.202.17;3512229;878057;23884 sdh;10.218.203.17;3518563;879640;23841 sdk;10.218.204.17;3582119;895529;23418 sdd;10.219.128.17;3550883;887720;23624 sdj;10.219.202.17;3558415;889603;23574 sde;10.219.203.17;3552086;888021;23616 sdl;10.219.204.17;3579521;894880;23435 sdc;10.220.128.17;4532912;1133228;18506 sdf;10.220.202.17;4558035;1139508;18404 sdg;10.220.203.17;4601035;1150258;18232 sdm;10.220.204.17;4548150;1137037;18444 srpt crashes 4.6.2 vanilla default config sde;10.218.128.17;3431063;857765;24449 sdf;10.218.202.17;3360685;840171;24961 sdi;10.218.203.17;3355174;838793;25002 sdm;10.218.204.17;3360955;840238;24959 sdd;10.219.128.17;3337288;834322;25136 sdh;10.219.202.17;3327492;831873;25210 sdj;10.219.203.17;3380867;845216;24812 sdk;10.219.204.17;3418340;854585;24540 sdc;10.220.128.17;4668377;1167094;17969 sdg;10.220.202.17;4716675;1179168;17785 sdl;10.220.203.17;4675663;1168915;17941 sdn;10.220.204.17;4631519;1157879;18112 Mlx5_0;sde;3390021;847505;24745 ~98% CPU kworker/u69:3 Mlx5_0;sdd;3207512;801878;26153 ~98% CPU kworker/u69:3 Mlx4_0;sdc;2998072;749518;27980 ~98% CPU kworker/u68:0 4.7.0_rc3_5edb5649 sdc;10.218.128.17;3260244;815061;25730 sdg;10.218.202.17;3405988;851497;24629 sdh;10.218.203.17;3307419;826854;25363 sdm;10.218.204.17;3430502;857625;24453 sdi;10.219.128.17;3544282;886070;23668 sdj;10.219.202.17;3412083;853020;24585 sdk;10.219.203.17;3422385;855596;24511 sdl;10.219.204.17;3444164;861041;24356 sdb;10.220.128.17;4803646;1200911;17463 sdd;10.220.202.17;4832982;1208245;17357 sde;10.220.203.17;4809430;1202357;17442 sdf;10.220.204.17;4808878;1202219;17444 mlx5_0;sdd;2986864;746716;28085 mlx5_0;sdc;2963648;740912;28305 mlx4_0;sdb;3317228;829307;25288 Thanks, Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Tue, Jun 21, 2016 at 8:50 AM, Robert LeBlanc <rob...@leblancnet.us> wrote: Sagi, I'm working to implement SRP (I think I got it all working) to test some of the commits. I can try TGT afterwards and the commit you mention. I haven't been watching the CPU lately, but before when I was doing a lot of testing, there wasn't any one thread that was at 100%. There are several threads that have high utilization, but none 100% and there is plenty of CPU capacity available (32 cores). I can capture some of that data if it is helpful. I did test 4.7_rc3 on Friday, but it didn't change much, is that "new" enough? 4.7.0_rc3_5edb5649 sdc;10.218.128.17;3260244;815061;25730 sdg;10.218.202.17;3405988;851497;24629 sdh;10.218.203.17;3307419;826854;25363 sdm;10.218.204.17;3430502;857625;24453 sdi;10.219.128.17;3544282;886070;23668 sdj;10.219.202.17;3412083;853020;24585 sdk;10.219.203.17;3422385;855596;24511 sdl;10.219.204.17;3444164;861041;24356 sdb;10.220.128.17;4803646;1200911;17463 sdd;10.220.202.17;4832982;1208245;17357 sde;10.220.203.17;4809430;1202357;17442 sdf;10.220.204.17;4808878;1202219;17444 Thanks for the suggestions, I'll work to get some of the requested data back to you guys quickly. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Tue, Jun 21, 2016 at 7:08 AM, Sagi Grimberg <sagig...@gmail.com> wrote: Hey Robert, I narrowed the performance degradation to this series 7861728..5e47f19, but while trying to bisect it, the changes were erratic between each commit that I could not figure out exactly which introduced the issue. If someone could give me some pointers on what to do, I can keep trying to dig through this. This bisection brings suspects: e3416ab2d156 iser-target: Kill the ->isert_cmd back pointer in struct iser_tx_desc d1ca2ed7dcf8 iser-target: Kill struct isert_rdma_wr 9679cc51eb13 iser-target: Convert to new CQ API 5adabdd122e4 iser-target: Split and properly type the login buffer ed1083b251f0 iser-target: Remove ISER_RECV_DATA_SEG_LEN 26c7b673db57 iser-target: Remove impossible condition from isert_wait_conn 69c48846f1c7 iser-target: Remove redundant wait in release_conn 6d1fba0c2cc7 iser-target: Rework connection termination f81bf458208e iser-target: Separate flows for np listeners and connections cma events aea92980601f iser-target: Add new state ISER_CONN_BOUND to isert_conn b89a7c25462b iser-target: Fix identification of login rx descriptor type However I don't really see performance implications in these patches, not to mention something that would affect on ConnectIB... Given that your bisection brings up target side patches, I have a couple questions: 1. Are the
Re: Connect-IB not performing as well as ConnectX-3 with iSER
Hey Robert, I narrowed the performance degradation to this series 7861728..5e47f19, but while trying to bisect it, the changes were erratic between each commit that I could not figure out exactly which introduced the issue. If someone could give me some pointers on what to do, I can keep trying to dig through this. This bisection brings suspects: e3416ab2d156 iser-target: Kill the ->isert_cmd back pointer in struct iser_tx_desc d1ca2ed7dcf8 iser-target: Kill struct isert_rdma_wr 9679cc51eb13 iser-target: Convert to new CQ API 5adabdd122e4 iser-target: Split and properly type the login buffer ed1083b251f0 iser-target: Remove ISER_RECV_DATA_SEG_LEN 26c7b673db57 iser-target: Remove impossible condition from isert_wait_conn 69c48846f1c7 iser-target: Remove redundant wait in release_conn 6d1fba0c2cc7 iser-target: Rework connection termination f81bf458208e iser-target: Separate flows for np listeners and connections cma events aea92980601f iser-target: Add new state ISER_CONN_BOUND to isert_conn b89a7c25462b iser-target: Fix identification of login rx descriptor type However I don't really see performance implications in these patches, not to mention something that would affect on ConnectIB... Given that your bisection brings up target side patches, I have a couple questions: 1. Are the CPU usage in the target side at 100%, or the initiator side is the bottleneck? 2. Would it be possible to use another target implementation? TGT maybe? 3. Can you try testing right before 9679cc51eb13? This is a patch that involves data-plane. 4. Can you try the latest upstream kernel? The iser target code uses a generic data-transfer library and I'm interested in knowing what is the status there. Cheers, Sagi. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NVMe over Fabrics target implementation
*) Extensible to multiple types of backend drivers. nvme-target needs a way to absorb new backend drivers, that does not effect existing configfs group layout or attributes. Looking at the nvmet/configfs layout as-is, there are no multiple backend types defined, nor a way to control backend feature bits exposed to nvme namespaces at runtime. Hey Nic, As for different type of backends, I still don't see a big justification for adding the LIO backends pscsi (as it doesn't make sense), ramdisk (we have brd), or file (losetup). What kind of feature bits would you want to expose at runtime? And that's very much intentional. We have a very well working block layer which we're going to use, no need to reivent it. The block layer supports NVMe pass through just fine in case we'll need it, as I spent the last year preparing it for that. Why does it ever make sense for $SUBSYSTEM_NQN_0 with $PORT_DRIVER_FOO to block operation of $SUBSYSTEM_NQN_1 with $PORT_DRIVER_BAR..? Because it keeps the code simple. If you had actually participated on our development list you might have seen that until not too long ago we have very fine grainded locks here. In the end Armen convinced me that it's easier to maintain if we don't bother with fine grained locking outside the fast path, especially as it significantly simplifies the discovery implementation. If if it ever turns out to be an issue we can change it easily as the implementation is well encapsulated. We did change that, and Nic is raising a valid point in terms of having a global mutex around all the ports. If the requirement of nvme subsystems and ports configuration is that it should happen fast enough and scale to the numbers that Nic is referring to, we'll need to change that back. Having said that, I'm not sure this is a real hard requirement for RDMA and FC in the mid-term, because from what I've seen, the workloads Nic is referring to are more typical for iscsi/tcp where connections are cheaper and you need more to saturate a high-speed interconnects, so we'll probably see this when we have nvme over tcp working. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: libiscsi: Use scsi helper to set information descriptor
Hey Dan, Hello Sagi Grimberg, The patch a73c2a2f9123: "libiscsi: Use scsi helper to set information descriptor" from Jul 15, 2015, leads to the following static checker warning: drivers/scsi/libiscsi.c:858 iscsi_scsi_cmd_rsp() error: XXX uninitialized symbol 'sector'. drivers/scsi/libiscsi.c 850 ascq = session->tt->check_protection(task, ); If "ascq" is 0x1 then there sector might not be initialized. The documentation is not clear on how that works. Har dee har har. The oldest jokes are still the best... :P iscsi transports that implement this callout are expected to set the sector which is passed by reference. would it make the checker happy if we set sector to 0 before calling check_protection (although it's not needed by no means)? -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 01/16] iscsi-target: add callback to alloc and free PDU
On 09/04/16 16:11, Varun Prakash wrote: Add two callbacks to struct iscsit_transport - 1. void *(*iscsit_alloc_pdu)() iscsi-target uses this callback for iSCSI PDU allocation. 2. void (*iscsit_free_pdu) iscsi-target uses this callback to free an iSCSI PDU which was allocated by iscsit_alloc_pdu(). Can you please explain why are you adding two different callouts? Who (Chelsio T5) will need it, and why they can't use the in-cmd pdu? I am adding these to avoid per PDU 48 bytes(BHS) memcpy from cmd->pdu to transport driver tx buffer, iscsi-target can directly form iscsi hdr in transport driver tx buffer. Is that what it's for? Looks meaningless to me, do you have an indication that this is some sort of bottleneck? I see you memset in your pdu_alloc, where is the gain here anyway? -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 08/16] iscsi-target: add void (*iscsit_get_r2t_ttt)()
Add void (*iscsit_get_r2t_ttt)() to struct iscsit_transport, iscsi-target uses this callback to get r2t->targ_xfer_tag. Your driver allocates ttt's? That looks like bad layering to me. This definitely deserves an explanation... cxgbit.ko allocates ttt only for r2t pdus to do Direct Data Placement of Data Out pdus, adapter uses the ttt value in Data Out pdus to place data directly in the host buffers. How do you guarantee that the core doesn't conflict with your internal ttt's? -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] scsi: add a max_segment_size limitation to struct Scsi_Host
Reviewed-by: Sagi Grimberg <s...@grimberg.me> -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] IB/iser: set max_segment_size
Acked-by: Sagi Grimberg <s...@grimberg.me> -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] IB/iser: set max_segment_size
In iser we sorta rely on 4k pages so we avoid PAGE_SIZE but rather set SIZE_4K for these sort of things (like we did in the virt_boundary). So you still want only 4k segments even on PPC where the PAGE_SIZE is 16k? Yes, iSER has the "no-gaps" constraint (like nvme) and some applications in the past were known to issue vectored IO that doesn't necessarily align with the system PAGE_SIZE. These sort of workloads resulted in suboptimal performance on ppc, ia64 etc (almost every I/O had gaps). Before the block layer was able to enforce scatterlists without gaps, iSER used bounce buffering in order to service "gappy" I/O which was probably a lot worse than bio splitting like the block layer is doing today, but still we felt that having 4k segments even on larger PAGE_SIZE systems was a decent compromise. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] IB/iser: set max_segment_size
So that we don't overflow the number of MR segments allocated because we have to split on SGL segment into multiple MR segments. Signed-off-by: Christoph Hellwig--- drivers/infiniband/ulp/iser/iscsi_iser.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.c b/drivers/infiniband/ulp/iser/iscsi_iser.c index 80b6bed..784504a 100644 --- a/drivers/infiniband/ulp/iser/iscsi_iser.c +++ b/drivers/infiniband/ulp/iser/iscsi_iser.c @@ -623,6 +623,7 @@ iscsi_iser_session_create(struct iscsi_endpoint *ep, shost->max_id = 0; shost->max_channel = 0; shost->max_cmd_len = 16; + shost->max_segment_size = PAGE_SIZE; In iser we sorta rely on 4k pages so we avoid PAGE_SIZE but rather set SIZE_4K for these sort of things (like we did in the virt_boundary). -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 5/5] lib: scatterlist: move SG pool code from SCSI driver to lib/sg_pool.c
From: Ming Lin <min...@ssi.samsung.com> Now it's ready to move the mempool based SG chained allocator code from SCSI driver to lib/sg_pool.c, which will be compiled only based on a Kconfig symbol CONFIG_SG_POOL. SCSI selects CONFIG_SG_POOL. Reviewed-by: Christoph Hellwig <h...@lst.de> Signed-off-by: Ming Lin <min...@ssi.samsung.com> Reviewed-by: Sagi Grimberg <s...@grimberg.me> -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 4/5] scsi: rename SCSI_MAX_{SG, SG_CHAIN}_SEGMENTS
<martin.peter...@oracle.com> wrote: "Ming" == Ming Lin <m...@kernel.org> writes: Ming> Are we ready to merge it? We're still missing an ack from Sagi. Thought we already had a ack from Bart. OK, let's get one more from Sagi. Hmm ... this patch doesn't touch any code for which Sagi is the maintainer so I think my ack for the ib_srp changes is sufficient. Indeed no need for my ack, but a review tag can't hurt ;) Reviewed-by: Sagi Grimberg <s...@grimberg.me> -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 3/5] scsi: rename SG related struct and functions
From: Ming Lin <min...@ssi.samsung.com> Rename SCSI specific struct and functions to more genenic names. Reviewed-by: Christoph Hellwig <h...@lst.de> Signed-off-by: Ming Lin <min...@ssi.samsung.com> Reviewed-by: Sagi Grimberg <s...@grimberg.me> -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 2/5] scsi: replace "mq" with "first_chunk" in SG functions
From: Ming Lin <min...@ssi.samsung.com> Parameter "bool mq" is block driver specific. Change it to "first_chunk" to make it more generic. Reviewed-by: Christoph Hellwig <h...@lst.de> Signed-off-by: Ming Lin <min...@ssi.samsung.com> Reviewed-by: Sagi Grimberg <s...@grimberg.me> -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/5] scsi: replace "scsi_data_buffer" with "sg_table" in SG functions
From: Ming Lin <min...@ssi.samsung.com> Replace parameter "struct scsi_data_buffer" with "struct sg_table" in SG alloc/free functions to make them generic. Reviewed-by: Christoph Hellwig <h...@lst.de> Signed-off-by: Ming Lin <min...@ssi.samsung.com> Reviewed-by: Sagi Grimberg <s...@grimberg.me> -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 4/5] scsi: rename SCSI_MAX_{SG, SG_CHAIN}_SEGMENTS
We're still missing an ack from Sagi. Oops, wasn't aware that mine was needed. I reviewed these on the nvme-fabrics project. I'll add my review tags too. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf] [LSF/MM TOPIC] block-mq issues with FC
Hey Willy, - Interrupt steering needs to be controlled by block-mq instead of the driver. It's pointless to have each driver implement its own policies on interrupt steering, irqbalanced remains a source of end-user frustration, and block-mq can change the queue<->cpu mapping without the driver's knowledge. I honestly don't think that block-mq is the right place to *assign* interrupt steering. Not all HW devices are dedicated to storage, take RDMA for example, a RNIC is shared by block storage, networking and even user-space workloads so obviously block-mq can't understand how a user wants to steer interrupts. I think that block-mq needs to ask the device driver: "what is the optimal queue index for cpu X?" and use it while *someone* will be responsible for optimum interrupt steering (can be the driver itself or user-space). From some discussions I had with HCH I think he intends to use the cpu reverse-mapping API to try and do what's described above (if I'm not mistaken). -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 15/16] iscsi-target: fix seq_end_offset calculation
Fixes should go in the front of the series (probably even better detached from the set), and this also looks like stable material... -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 11/16] iscsi-target: add new offload transport type
+static ssize_t lio_target_np_hw_offload_show(struct config_item *item, +char *page) +{ + struct iscsi_tpg_np *tpg_np = to_iscsi_tpg_np(item); + struct iscsi_tpg_np *tpg_np_hw_offload; + ssize_t rb; + + tpg_np_hw_offload = iscsit_tpg_locate_child_np(tpg_np, + ISCSI_HW_OFFLOAD); + if (tpg_np_hw_offload) + rb = sprintf(page, "1\n"); + else + rb = sprintf(page, "0\n"); + + return rb; +} + +static ssize_t lio_target_np_hw_offload_store(struct config_item *item, + const char *page, size_t count) +{ + struct iscsi_tpg_np *tpg_np = to_iscsi_tpg_np(item); + struct iscsi_np *np; + struct iscsi_portal_group *tpg; + struct iscsi_tpg_np *tpg_np_hw_offload = NULL; + u32 op; + int rc = 0; + + rc = kstrtou32(page, 0, ); + if (rc) + return rc; + + if ((op != 1) && (op != 0)) { + pr_err("Illegal value for tpg_enable: %u\n", op); + return -EINVAL; + } + + np = tpg_np->tpg_np; + if (!np) { + pr_err("Unable to locate struct iscsi_np from" + " struct iscsi_tpg_np\n"); + return -EINVAL; + } + + tpg = tpg_np->tpg; + if (iscsit_get_tpg(tpg) < 0) + return -EINVAL; + + if (op) { + tpg_np_hw_offload = iscsit_tpg_add_network_portal(tpg, + >np_sockaddr, tpg_np, ISCSI_HW_OFFLOAD); + + if (IS_ERR(tpg_np_hw_offload)) { + rc = PTR_ERR(tpg_np_hw_offload); + goto out; + } + } else { + tpg_np_hw_offload = iscsit_tpg_locate_child_np(tpg_np, + ISCSI_HW_OFFLOAD); + + if (tpg_np_hw_offload) { + rc = iscsit_tpg_del_network_portal(tpg, + tpg_np_hw_offload); + if (rc < 0) + goto out; + } + } + + iscsit_put_tpg(tpg); + return count; +out: + iscsit_put_tpg(tpg); + return rc; +} + CONFIGFS_ATTR(lio_target_np_, sctp); CONFIGFS_ATTR(lio_target_np_, iser); +CONFIGFS_ATTR(lio_target_np_, hw_offload); I'd be happy to see new transports being added with some initiative to reduce the code duplication here (pretty please :)) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 10/16] iscsi-target: use conn->network_transport in text rsp
Looks fine, Reviewed-by: Sagi Grimberg <s...@grimberg.me> -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 08/16] iscsi-target: add void (*iscsit_get_r2t_ttt)()
Add void (*iscsit_get_r2t_ttt)() to struct iscsit_transport, iscsi-target uses this callback to get r2t->targ_xfer_tag. Your driver allocates ttt's? That looks like bad layering to me. This definitely deserves an explanation... -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 07/16] iscsi-target: add int (*iscsit_validate_params)()
Add int (*iscsit_validate_params)() to struct iscsit_transport, iscsi-target uses this callback for validating conn operational parameters. Again, why is this needed? -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 05/16] iscsi-target: add void (*iscsit_get_rx_pdu)()
Add void (*iscsit_get_rx_pdu)() to struct iscsit_transport, iscsi-target uses this callback to receive and process Rx iSCSI PDUs. Same comment on change logs. The iser bit looks harmless Acked-by: Sagi Grimberg <s...@grimber.me> Though I agree with hch that we don't really need rx threads in iSER, but that's another story... -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 04/16] iscsi-target: add void (*iscsit_release_cmd)()
Add void (*iscsit_release_cmd)() to struct iscsit_transport, iscsi-target uses this callback to release transport driver resources associated with an iSCSI cmd. I'd really like to see some reasoning on why you add abstraction callouts. It may have a valid reason but it needs to be documented in the change log... diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c index 428b0d9..a533017 100644 --- a/drivers/target/iscsi/iscsi_target_util.c +++ b/drivers/target/iscsi/iscsi_target_util.c @@ -725,6 +725,9 @@ void __iscsit_free_cmd(struct iscsi_cmd *cmd, bool scsi_cmd, iscsit_remove_cmd_from_immediate_queue(cmd, conn); iscsit_remove_cmd_from_response_queue(cmd, conn); } + + if (conn && conn->conn_transport->iscsit_release_cmd) + conn->conn_transport->iscsit_release_cmd(conn, cmd); } Did you verify that you get here with conn = NULL (given that you test it)? If so, then can you please document why is it expected for this function to be called twice that we need to make it safe? If not, then I'd move this check to be a WARN_ON/BUG_ON to hunt down when is this happening. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 03/16] iscsi-target: add int (*iscsit_xmit_datain_pdu)()
On 09/04/16 16:11, Varun Prakash wrote: Add int (*iscsit_xmit_datain_pdu)() to struct iscsit_transport, iscsi-target uses this callback to transmit a DATAIN iSCSI PDU. Signed-off-by: Varun Prakash--- drivers/target/iscsi/iscsi_target.c| 143 +++-- include/target/iscsi/iscsi_transport.h | 3 + 2 files changed, 86 insertions(+), 60 deletions(-) diff --git a/drivers/target/iscsi/iscsi_target.c b/drivers/target/iscsi/iscsi_target.c index 0e7a481..9e65e5d 100644 --- a/drivers/target/iscsi/iscsi_target.c +++ b/drivers/target/iscsi/iscsi_target.c @@ -577,6 +577,84 @@ static int iscsit_xmit_pdu(struct iscsi_conn *conn, struct iscsi_cmd *cmd, return 0; } +static int iscsit_map_iovec(struct iscsi_cmd *, struct kvec *, u32, u32); +static void iscsit_unmap_iovec(struct iscsi_cmd *); +static u32 iscsit_do_crypto_hash_sg(struct ahash_request *, struct iscsi_cmd *, + u32, u32, u32, u8 *); +static int +iscsit_xmit_datain_pdu(struct iscsi_conn *conn, struct iscsi_cmd *cmd, + struct iscsi_datain_req *dr, struct iscsi_datain *datain) Looks very similar to xmit_pdu(), if we add a datain pointer that can be null for normal pdus would that reduce code duplication? -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 02/16] iscsi-target: add int (*iscsit_xmit_pdu)()
Nice! Reviewed-by: Sagi Grimberg <s...@grimberg.me> -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 01/16] iscsi-target: add callback to alloc and free PDU
On 09/04/16 16:11, Varun Prakash wrote: Add two callbacks to struct iscsit_transport - 1. void *(*iscsit_alloc_pdu)() iscsi-target uses this callback for iSCSI PDU allocation. 2. void (*iscsit_free_pdu) iscsi-target uses this callback to free an iSCSI PDU which was allocated by iscsit_alloc_pdu(). Can you please explain why are you adding two different callouts? Who (Chelsio T5) will need it, and why they can't use the in-cmd pdu? Signed-off-by: Varun Prakash--- drivers/target/iscsi/iscsi_target.c| 76 -- include/target/iscsi/iscsi_transport.h | 2 + 2 files changed, 65 insertions(+), 13 deletions(-) diff --git a/drivers/target/iscsi/iscsi_target.c b/drivers/target/iscsi/iscsi_target.c index 961202f..fdb33ba 100644 --- a/drivers/target/iscsi/iscsi_target.c +++ b/drivers/target/iscsi/iscsi_target.c @@ -499,6 +499,11 @@ static void iscsit_aborted_task(struct iscsi_conn *conn, struct iscsi_cmd *cmd) __iscsit_free_cmd(cmd, scsi_cmd, true); } +static void *iscsit_alloc_pdu(struct iscsi_conn *conn, struct iscsi_cmd *cmd) +{ + return cmd->pdu; +} + static enum target_prot_op iscsit_get_sup_prot_ops(struct iscsi_conn *conn) { return TARGET_PROT_NORMAL; @@ -519,6 +524,7 @@ static struct iscsit_transport iscsi_target_transport = { .iscsit_queue_data_in = iscsit_queue_rsp, .iscsit_queue_status= iscsit_queue_rsp, .iscsit_aborted_task= iscsit_aborted_task, + .iscsit_alloc_pdu = iscsit_alloc_pdu, .iscsit_get_sup_prot_ops = iscsit_get_sup_prot_ops, }; @@ -2537,7 +2543,10 @@ static int iscsit_send_conn_drop_async_message( cmd->tx_size = ISCSI_HDR_LEN; cmd->iscsi_opcode = ISCSI_OP_ASYNC_EVENT; - hdr = (struct iscsi_async *) cmd->pdu; + hdr = conn->conn_transport->iscsit_alloc_pdu(conn, cmd); + if (unlikely(!hdr)) + return -ENOMEM; + hdr->opcode = ISCSI_OP_ASYNC_EVENT; hdr->flags = ISCSI_FLAG_CMD_FINAL; cmd->init_task_tag = RESERVED_ITT; @@ -2630,7 +2639,7 @@ iscsit_build_datain_pdu(struct iscsi_cmd *cmd, struct iscsi_conn *conn, static int iscsit_send_datain(struct iscsi_cmd *cmd, struct iscsi_conn *conn) { - struct iscsi_data_rsp *hdr = (struct iscsi_data_rsp *)>pdu[0]; + struct iscsi_data_rsp *hdr; struct iscsi_datain datain; struct iscsi_datain_req *dr; struct kvec *iov; @@ -2675,6 +2684,10 @@ static int iscsit_send_datain(struct iscsi_cmd *cmd, struct iscsi_conn *conn) set_statsn = true; } + hdr = conn->conn_transport->iscsit_alloc_pdu(conn, cmd); + if (unlikely(!hdr)) + return -ENOMEM; + iscsit_build_datain_pdu(cmd, conn, , hdr, set_statsn); iov = >iov_data[0]; @@ -2843,13 +2856,20 @@ EXPORT_SYMBOL(iscsit_build_logout_rsp); static int iscsit_send_logout(struct iscsi_cmd *cmd, struct iscsi_conn *conn) { + struct iscsi_logout_rsp *hdr; struct kvec *iov; int niov = 0, tx_size, rc; - rc = iscsit_build_logout_rsp(cmd, conn, - (struct iscsi_logout_rsp *)>pdu[0]); - if (rc < 0) + hdr = conn->conn_transport->iscsit_alloc_pdu(conn, cmd); + if (unlikely(!hdr)) + return -ENOMEM; + + rc = iscsit_build_logout_rsp(cmd, conn, hdr); + if (rc < 0) { + if (conn->conn_transport->iscsit_free_pdu) + conn->conn_transport->iscsit_free_pdu(conn, cmd); This creates a weird asymmetry where alloc is called unconditionally while free is conditional, I'd say implement an empty free for iscsit. Same for the rest of the code... P.S. I didn't see any non-error path call to free_pdu, is that a possible leak (for drivers that actually allocate a PDU)? On another unrelated note, I'd be very happy if we lose the iscsit_ prefix from all the callouts, it clear to everyone that it iscsi, no need for an explicit reminder... -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 14/34] iscsi-target: export symbols
Great. Just curious how -v2 is coming along..? I've got a few cycles over the weekend, and plan to start reviewing as the series hits the list. Btw, I asked Sagi to help out with review as well. If you did, it's lost in my old email address :) I'll have a look this week. Cheers, Sagi. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 22/34] iscsi-target: call Rx thread function
call Rx thread function if registered by transport driver, so that transport drivers can use iscsi-target Rx thread for Rx processing. update iSER target driver to use this interface. Signed-off-by: Varun Prakash <va...@chelsio.com> Other than the handler name, looks harmless. Acked-by: Sagi Grimberg <sa...@mellanox.com> -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 20/34] iscsi-target: update struct iscsit_transport definition
1. void (*iscsit_rx_pdu)(struct iscsi_conn *); Rx thread uses this for receiving and processing iSCSI PDU in full feature phase. Is iscsit_rx_pdu the best name for this? it sounds like a function that would handle A pdu, but it's actually the thread function dequeuing pdus correct? -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html