Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
>> Yes, that's why I used 'significant'. One good thing is that given resources >> it can easily be done in parallel with other development, and will give >> additional >> insight of some form. > >Yup, well if someone wants to start working on an emulated RDMA device >that actually simulates proper DMA transfers that would be great! Give that each RDMA vendor’s devices expose a different MMIO I don’t expect this to happen anytime soon. > Yes, the nvme device in qemu has a CMB buffer which is a good choice to > test with but we don't have code to use it for p2p transfers in the >kernel so it is a bit awkward. Note the CMB code is not in upstream QEMU, it’s in Keith’s fork [1]. I will see if I can push this upstream. Stephen [1] git://git.infradead.org/users/kbusch/qemu-nvme.git
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
> My first reflex when reading this thread was to think that this whole domain > lends it self excellently to testing via Qemu. Could it be that doing this in > the opposite direction might be a safer approach in the long run even though > (significant) more work up-front? While the idea of QEMU for this work is attractive it will be a long time before QEMU is in a position to support this development. Another approach is to propose a common development platform for p2pmem work using a platform we know is going to work. This an extreme version of the whitelisting approach that was discussed on this thread. We can list a very specific set of hardware (motherboard, PCIe end-points and (possibly) PCIe switch enclosure) that has been shown to work that others can copy for their development purposes. p2pmem.io perhaps ;-)? Stephen
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
>> Yes, this makes sense I think we really just want to distinguish host >> memory or not in terms of the dev_pagemap type. > >> I would like to see mutually exclusive flags for host memory (or not) and >> persistence (or not). >> > > Why persistence? It has zero meaning to the mm. I like the idea of having properties of the memory in one place. While mm might not use persistence today it may make use certain things that persistence implies (like finite endurance and/or higher write latency) in the future. Also the persistence of the memory must have issues for mm security? Again not addressed today but useful in the future. In addition I am not sure where else would be an appropriate place to put something like a persistence property flag. I know the NVDIMM section of the kernel uses things like NFIT to describe properties of the memory but we don’t yet (to my knowledge) have something similar for IO memory. Stephen
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
> Yes, this makes sense I think we really just want to distinguish host > memory or not in terms of the dev_pagemap type. I would like to see mutually exclusive flags for host memory (or not) and persistence (or not). Stephen
Re: [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem
On 2017-04-06, 6:33 AM, "Sagi Grimberg"wrote: > Say it's connected via 2 legs, the bar is accessed from leg A and the > data from the disk comes via leg B. In this case, the data is heading > towards the p2p device via leg B (might be congested), the completion > goes directly to the RC, and then the host issues a read from the > bar via leg A. I don't understand what can guarantee ordering here. > Stephen told me that this still guarantees ordering, but I honestly > can't understand how, perhaps someone can explain to me in a simple > way that I can understand. Sagi As long as legA, legB and the RC are all connected to the same switch then ordering will be preserved (I think many other topologies also work). Here is how it would work for the problem case you are concerned about (which is a read from the NVMe drive). 1. Disk device DMAs out the data to the p2pmem device via a string of PCIe MemWr TLPs. 2. Disk device writes to the completion queue (in system memory) via a MemWr TLP. 3. The last of the MemWrs from step 1 might have got stalled in the PCIe switch due to congestion but if so they are stalled in the egress path of the switch for the p2pmem port. 4. The RC determines the IO is complete when the TLP associated with step 2 updates the memory associated with the CQ. It issues some operation to read the p2pmem. 5. Regardless of whether the MemRd TLP comes from the RC or another device connected to the switch it is queued in the egress queue for the p2pmem FIO behind the last DMA TLP (from step 1). PCIe ordering ensures that this MemRd cannot overtake the MemWr (Reads can never pass writes). Therefore the MemRd can never get to the p2pmem device until after the last DMA MemWr has. I hope this helps! Stephen
Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
> > This is a separate topic. The initial proposal is for polling for > interrupt mitigation, you are talking about polling in the context of > polling for completion of an IO. > > We can definitely talk about this form of polling as well, but it should > be a separate topic and probably proposed independently. > > -- > Jens Axboe > > Jens Oh thanks for the clarification. I will propose this as a separate topic. Thanks Stephen -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[LSF/MM TOPIC][LSF/MM ATTEND] IO completion polling for block drivers
Hi I'd like to discuss the ongoing work in the kernel to enable high priority IO via polling for completion in the blk-mq subsystem. Given that iopoll only really makes sense for low-latency, low queue depth environments (i.e. down below 10-20us) I'd like to discuss which drivers we think will need/want to be upgraded (aside from NVMe ;-)). I'd also be interested in discussing how best to enable and disable polling. In the past some of us have pushed for a "big hammer" to turn polling on for a given device or HW queue [1]. I'd like to discuss this again as well as looking at other methods above and beyond the preadv2 system call and the HIPRI flag. Finally I'd like to discuss some of the recent work to improve the heuristics around when to poll and when not to poll. I'd like to see if we can come up with more optimal balance between CPU load and average completion times [2]. Stephen Bates [1] http://marc.info/?l=linux-block=146307410101827=2 [2] http://marc.info/?l=linux-block=147803441801858=2 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
>> >> I'd like to attend LSF/MM and would like to discuss polling for block >> drivers. >> >> Currently there is blk-iopoll but it is neither as widely used as NAPI >> in the networking field and accoring to Sagi's findings in [1] >> performance with polling is not on par with IRQ usage. >> >> On LSF/MM I'd like to whether it is desirable to have NAPI like polling >> in more block drivers and how to overcome the currently seen performance >> issues. > > It would be an interesting topic to discuss, as it is a shame that > blk-iopoll isn't used more widely. > > -- > Jens Axboe > I'd also be interested in this topic. Given that iopoll only really makes sense for low-latency, low queue depth environments (i.e. down below 10-20us) I'd like to discuss which drivers we think will need/want to be upgraded (aside from NVMe ;-)). I'd also be interested in discussing how best to enable and disable polling. In the past some of us have pushed for a "big hammer" to turn polling on for a given device or HW queue [1]. I'd like to discuss this again as well as looking at other methods above and beyond the preadv2 system call and the HIPRI flag. Stephen [1] http://marc.info/?l=linux-block=146307410101827=2 > ___ > Linux-nvme mailing list > linux-n...@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-nvme > > -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v4] cxlflash: Base support for IBM CXL Flash Adapter
Hi I just wanted to add support for this patchset. There are others considering implementing block IO devices that use the CAPI (CXL) interface and having this patch-set upstream will be very useful in the future. Supported-by: Stephen Bates stephen.ba...@pmcs.com Cheers Stephen -Original Message- From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-ow...@vger.kernel.org] On Behalf Of Matthew R. Ochs Sent: Friday, June 5, 2015 3:46 PM To: linux-scsi@vger.kernel.org; james.bottom...@hansenpartnership.com; n...@linux-iscsi.org; brk...@linux.vnet.ibm.com; h...@infradead.org Cc: mi...@neuling.org; imun...@au1.ibm.com; Manoj N. Kumar Subject: [PATCH v4] cxlflash: Base support for IBM CXL Flash Adapter SCSI device driver to support filesystem access on the IBM CXL Flash adapter. Signed-off-by: Matthew R. Ochs mro...@linux.vnet.ibm.com Signed-off-by: Manoj N. Kumar ma...@linux.vnet.ibm.com --- drivers/scsi/Kconfig|1 + drivers/scsi/Makefile |1 + drivers/scsi/cxlflash/Kconfig | 11 + drivers/scsi/cxlflash/Makefile |2 + drivers/scsi/cxlflash/common.h | 180 drivers/scsi/cxlflash/main.c| 2263 +++ drivers/scsi/cxlflash/main.h| 104 ++ drivers/scsi/cxlflash/sislite.h | 465 8 files changed, 3027 insertions(+) create mode 100644 drivers/scsi/cxlflash/Kconfig create mode 100644 drivers/scsi/cxlflash/Makefile create mode 100644 drivers/scsi/cxlflash/common.h create mode 100644 drivers/scsi/cxlflash/main.c create mode 100644 drivers/scsi/cxlflash/main.h create mode 100755 drivers/scsi/cxlflash/sislite.h diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig index b021bcb..ebb12a7 100644 --- a/drivers/scsi/Kconfig +++ b/drivers/scsi/Kconfig @@ -345,6 +345,7 @@ source drivers/scsi/cxgbi/Kconfig source drivers/scsi/bnx2i/Kconfig source drivers/scsi/bnx2fc/Kconfig source drivers/scsi/be2iscsi/Kconfig +source drivers/scsi/cxlflash/Kconfig config SGIWD93_SCSI tristate SGI WD93C93 SCSI Driver diff --git a/drivers/scsi/Makefile b/drivers/scsi/Makefile index dee160a..619f8fb 100644 --- a/drivers/scsi/Makefile +++ b/drivers/scsi/Makefile @@ -101,6 +101,7 @@ obj-$(CONFIG_SCSI_7000FASST)+= wd7000.o obj-$(CONFIG_SCSI_EATA)+= eata.o obj-$(CONFIG_SCSI_DC395x) += dc395x.o obj-$(CONFIG_SCSI_AM53C974)+= esp_scsi.o am53c974.o +obj-$(CONFIG_CXLFLASH) += cxlflash/ obj-$(CONFIG_MEGARAID_LEGACY) += megaraid.o obj-$(CONFIG_MEGARAID_NEWGEN) += megaraid/ obj-$(CONFIG_MEGARAID_SAS) += megaraid/ diff --git a/drivers/scsi/cxlflash/Kconfig b/drivers/scsi/cxlflash/Kconfig new file mode 100644 index 000..e98c3f6 --- /dev/null +++ b/drivers/scsi/cxlflash/Kconfig @@ -0,0 +1,11 @@ +# +# IBM CXL-attached Flash Accelerator SCSI Driver +# + +config CXLFLASH + tristate Support for IBM CAPI Flash + depends on CXL + default m + help + Allows CAPI Accelerated IO to Flash + If unsure, say N. diff --git a/drivers/scsi/cxlflash/Makefile b/drivers/scsi/cxlflash/Makefile new file mode 100644 index 000..dc95e20 --- /dev/null +++ b/drivers/scsi/cxlflash/Makefile @@ -0,0 +1,2 @@ +obj-$(CONFIG_CXLFLASH) += cxlflash.o +cxlflash-y += main.o diff --git a/drivers/scsi/cxlflash/common.h b/drivers/scsi/cxlflash/common.h new file mode 100644 index 000..2bd842b --- /dev/null +++ b/drivers/scsi/cxlflash/common.h @@ -0,0 +1,180 @@ +/* + * CXL Flash Device Driver + * + * Written by: Manoj N. Kumar ma...@linux.vnet.ibm.com, IBM Corporation + * Matthew R. Ochs mro...@linux.vnet.ibm.com, IBM Corporation + * + * Copyright (C) 2015 IBM Corporation + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ + +#ifndef _CXLFLASH_COMMON_H +#define _CXLFLASH_COMMON_H + +#include linux/list.h +#include linux/types.h +#include scsi/scsi.h +#include scsi/scsi_device.h + + +#define MAX_CONTEXT CXLFLASH_MAX_CONTEXT /* num contexts per afu */ + +#define CXLFLASH_BLOCK_SIZE4096/* 4K blocks */ +#define CXLFLASH_MAX_XFER_SIZE 16777216/* 16MB transfer */ +#define CXLFLASH_MAX_SECTORS (CXLFLASH_MAX_XFER_SIZE/512)/* SCSI wants + max_sectors + in units of + 512 byte + sectors + */ + +#define NUM_RRQ_ENTRY16 /* for master issued cmds */ +#define MAX_RHT_PER_CONTEXT (PAGE_SIZE / sizeof(struct sisl_rht_entry)) + +/* AFU command retry limit */ +#define MC_RETRY_CNT 5