Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-25 Thread Stephen Bates
>> Yes, that's why I used 'significant'. One good thing is that given resources 
>> it can easily be done in parallel with other development, and will give 
>> additional
>> insight of some form.
>
>Yup, well if someone wants to start working on an emulated RDMA device
>that actually simulates proper DMA transfers that would be great!

Give that each RDMA vendor’s devices expose a different MMIO I don’t expect 
this to happen anytime soon.

> Yes, the nvme device in qemu has a CMB buffer which is a good choice to
> test with but we don't have code to use it for p2p transfers in the
>kernel so it is a bit awkward.

Note the CMB code is not in upstream QEMU, it’s in Keith’s fork [1]. I will see 
if I can push this upstream.

Stephen

[1] git://git.infradead.org/users/kbusch/qemu-nvme.git




Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-25 Thread Stephen Bates

> My first reflex when reading this thread was to think that this whole domain
> lends it self excellently to testing via Qemu. Could it be that doing this in 
> the opposite direction might be a safer approach in the long run even though 
> (significant) more work up-front?

While the idea of QEMU for this work is attractive it will be a long time 
before QEMU is in a position to support this development. 

Another approach is to propose a common development platform for p2pmem work 
using a platform we know is going to work. This an extreme version of the 
whitelisting approach that was discussed on this thread. We can list a very 
specific set of hardware (motherboard, PCIe end-points and (possibly) PCIe 
switch enclosure) that has been shown to work that others can copy for their 
development purposes.

p2pmem.io perhaps ;-)?

Stephen




Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-20 Thread Stephen Bates
>> Yes, this makes sense I think we really just want to distinguish host
>> memory or not in terms of the dev_pagemap type.
>
>> I would like to see mutually exclusive flags for host memory (or not) and 
>> persistence (or not).
>>
>
> Why persistence? It has zero meaning to the mm.

I like the idea of having properties of the memory in one place. While mm might 
not use persistence today it may make use certain things that persistence 
implies (like finite endurance and/or higher write latency) in the future. Also 
the persistence of the memory must have issues for mm security? Again not 
addressed today but useful in the future.

In addition I am not sure where else would be an appropriate place to put 
something like a persistence property flag. I know the NVDIMM section of the 
kernel uses things like NFIT to describe properties of the memory but we don’t 
yet (to my knowledge) have something similar for IO memory.

Stephen



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-20 Thread Stephen Bates

> Yes, this makes sense I think we really just want to distinguish host
> memory or not in terms of the dev_pagemap type.

I would like to see mutually exclusive flags for host memory (or not) and 
persistence (or not).

Stephen



Re: [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem

2017-04-07 Thread Stephen Bates
On 2017-04-06, 6:33 AM, "Sagi Grimberg"  wrote:

> Say it's connected via 2 legs, the bar is accessed from leg A and the
> data from the disk comes via leg B. In this case, the data is heading
> towards the p2p device via leg B (might be congested), the completion
> goes directly to the RC, and then the host issues a read from the
> bar via leg A. I don't understand what can guarantee ordering here.

> Stephen told me that this still guarantees ordering, but I honestly
> can't understand how, perhaps someone can explain to me in a simple
> way that I can understand.

Sagi

As long as legA, legB and the RC are all connected to the same switch then 
ordering will be preserved (I think many other topologies also work). Here is 
how it would work for the problem case you are concerned about (which is a read 
from the NVMe drive).

1. Disk device DMAs out the data to the p2pmem device via a string of PCIe 
MemWr TLPs.
2. Disk device writes to the completion queue (in system memory) via a MemWr 
TLP.
3. The last of the MemWrs from step 1 might have got stalled in the PCIe switch 
due to congestion but if so they are stalled in the egress path of the switch 
for the p2pmem port.
4. The RC determines the IO is complete when the TLP associated with step 2 
updates the memory associated with the CQ. It issues some operation to read the 
p2pmem.
5. Regardless of whether the MemRd TLP comes from the RC or another device 
connected to the switch it is queued in the egress queue for the p2pmem FIO 
behind the last DMA TLP (from step 1). PCIe ordering ensures that this MemRd 
cannot overtake the MemWr (Reads can never pass writes). Therefore the MemRd 
can never get to the p2pmem device until after the last DMA MemWr has.

I hope this helps!

Stephen




Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-11 Thread Stephen Bates
>
> This is a separate topic. The initial proposal is for polling for
> interrupt mitigation, you are talking about polling in the context of
> polling for completion of an IO.
>
> We can definitely talk about this form of polling as well, but it should
> be a separate topic and probably proposed independently.
>
> --
> Jens Axboe
>
>

Jens

Oh thanks for the clarification. I will propose this as a separate topic.

Thanks

Stephen
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[LSF/MM TOPIC][LSF/MM ATTEND] IO completion polling for block drivers

2017-01-11 Thread Stephen Bates
Hi

I'd like to discuss the ongoing work in the kernel to enable high priority
IO via polling for completion in the blk-mq subsystem.

Given that iopoll only really makes sense for low-latency, low queue depth
environments (i.e. down below 10-20us) I'd like to discuss which drivers
we think will need/want to be upgraded (aside from NVMe ;-)).

I'd also be interested in discussing how best to enable and disable
polling. In the past some of us have pushed for a "big hammer" to turn
polling on for a given device or HW queue [1]. I'd like to discuss this
again as well as looking at other methods above and beyond the preadv2
system call and the HIPRI flag.

Finally I'd like to discuss some of the recent work to improve the
heuristics around when to poll and when not to poll. I'd like to see if we
can come up with more optimal balance between CPU load and average
completion times [2].

Stephen Bates

[1] http://marc.info/?l=linux-block=146307410101827=2
[2] http://marc.info/?l=linux-block=147803441801858=2

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-11 Thread Stephen Bates
>>
>> I'd like to attend LSF/MM and would like to discuss polling for block
>> drivers.
>>
>> Currently there is blk-iopoll but it is neither as widely used as NAPI
>> in the networking field and accoring to Sagi's findings in [1]
>> performance with polling is not on par with IRQ usage.
>>
>> On LSF/MM I'd like to whether it is desirable to have NAPI like polling
>> in more block drivers and how to overcome the currently seen performance
>> issues.
>
> It would be an interesting topic to discuss, as it is a shame that
> blk-iopoll isn't used more widely.
>
> --
> Jens Axboe
>

I'd also be interested in this topic. Given that iopoll only really makes
sense for low-latency, low queue depth environments (i.e. down below
10-20us) I'd like to discuss which drivers we think will need/want to be
upgraded (aside from NVMe ;-)).

I'd also be interested in discussing how best to enable and disable
polling. In the past some of us have pushed for a "big hammer" to turn
polling on for a given device or HW queue [1]. I'd like to discuss this
again as well as looking at other methods above and beyond the preadv2
system call and the HIPRI flag.

Stephen

[1] http://marc.info/?l=linux-block=146307410101827=2

> ___
> Linux-nvme mailing list
> linux-n...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v4] cxlflash: Base support for IBM CXL Flash Adapter

2015-06-08 Thread Stephen Bates
Hi

I just wanted to add support for this patchset. There are others considering 
implementing block IO devices that use the CAPI (CXL) interface and having this 
patch-set upstream will be very useful in the future.

Supported-by: Stephen Bates stephen.ba...@pmcs.com

Cheers

Stephen 

-Original Message-
From: linux-scsi-ow...@vger.kernel.org 
[mailto:linux-scsi-ow...@vger.kernel.org] On Behalf Of Matthew R. Ochs
Sent: Friday, June 5, 2015 3:46 PM
To: linux-scsi@vger.kernel.org; james.bottom...@hansenpartnership.com; 
n...@linux-iscsi.org; brk...@linux.vnet.ibm.com; h...@infradead.org
Cc: mi...@neuling.org; imun...@au1.ibm.com; Manoj N. Kumar
Subject: [PATCH v4] cxlflash: Base support for IBM CXL Flash Adapter

SCSI device driver to support filesystem access on the IBM CXL Flash adapter.

Signed-off-by: Matthew R. Ochs mro...@linux.vnet.ibm.com
Signed-off-by: Manoj N. Kumar ma...@linux.vnet.ibm.com
---
 drivers/scsi/Kconfig|1 +
 drivers/scsi/Makefile   |1 +
 drivers/scsi/cxlflash/Kconfig   |   11 +
 drivers/scsi/cxlflash/Makefile  |2 +
 drivers/scsi/cxlflash/common.h  |  180 
 drivers/scsi/cxlflash/main.c| 2263 +++
 drivers/scsi/cxlflash/main.h|  104 ++
 drivers/scsi/cxlflash/sislite.h |  465 
 8 files changed, 3027 insertions(+)
 create mode 100644 drivers/scsi/cxlflash/Kconfig
 create mode 100644 drivers/scsi/cxlflash/Makefile
 create mode 100644 drivers/scsi/cxlflash/common.h
 create mode 100644 drivers/scsi/cxlflash/main.c
 create mode 100644 drivers/scsi/cxlflash/main.h
 create mode 100755 drivers/scsi/cxlflash/sislite.h

diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig
index b021bcb..ebb12a7 100644
--- a/drivers/scsi/Kconfig
+++ b/drivers/scsi/Kconfig
@@ -345,6 +345,7 @@ source drivers/scsi/cxgbi/Kconfig
 source drivers/scsi/bnx2i/Kconfig
 source drivers/scsi/bnx2fc/Kconfig
 source drivers/scsi/be2iscsi/Kconfig
+source drivers/scsi/cxlflash/Kconfig
 
 config SGIWD93_SCSI
tristate SGI WD93C93 SCSI Driver
diff --git a/drivers/scsi/Makefile b/drivers/scsi/Makefile
index dee160a..619f8fb 100644
--- a/drivers/scsi/Makefile
+++ b/drivers/scsi/Makefile
@@ -101,6 +101,7 @@ obj-$(CONFIG_SCSI_7000FASST)+= wd7000.o
 obj-$(CONFIG_SCSI_EATA)+= eata.o
 obj-$(CONFIG_SCSI_DC395x)  += dc395x.o
 obj-$(CONFIG_SCSI_AM53C974)+= esp_scsi.o   am53c974.o
+obj-$(CONFIG_CXLFLASH) += cxlflash/
 obj-$(CONFIG_MEGARAID_LEGACY)  += megaraid.o
 obj-$(CONFIG_MEGARAID_NEWGEN)  += megaraid/
 obj-$(CONFIG_MEGARAID_SAS) += megaraid/
diff --git a/drivers/scsi/cxlflash/Kconfig b/drivers/scsi/cxlflash/Kconfig
new file mode 100644
index 000..e98c3f6
--- /dev/null
+++ b/drivers/scsi/cxlflash/Kconfig
@@ -0,0 +1,11 @@
+#
+# IBM CXL-attached Flash Accelerator SCSI Driver
+#
+
+config CXLFLASH
+   tristate Support for IBM CAPI Flash
+   depends on CXL
+   default m
+   help
+ Allows CAPI Accelerated IO to Flash
+ If unsure, say N.
diff --git a/drivers/scsi/cxlflash/Makefile b/drivers/scsi/cxlflash/Makefile
new file mode 100644
index 000..dc95e20
--- /dev/null
+++ b/drivers/scsi/cxlflash/Makefile
@@ -0,0 +1,2 @@
+obj-$(CONFIG_CXLFLASH) += cxlflash.o
+cxlflash-y += main.o
diff --git a/drivers/scsi/cxlflash/common.h b/drivers/scsi/cxlflash/common.h
new file mode 100644
index 000..2bd842b
--- /dev/null
+++ b/drivers/scsi/cxlflash/common.h
@@ -0,0 +1,180 @@
+/*
+ * CXL Flash Device Driver
+ *
+ * Written by: Manoj N. Kumar ma...@linux.vnet.ibm.com, IBM Corporation
+ * Matthew R. Ochs mro...@linux.vnet.ibm.com, IBM Corporation
+ *
+ * Copyright (C) 2015 IBM Corporation
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#ifndef _CXLFLASH_COMMON_H
+#define _CXLFLASH_COMMON_H
+
+#include linux/list.h
+#include linux/types.h
+#include scsi/scsi.h
+#include scsi/scsi_device.h
+
+
+#define MAX_CONTEXT  CXLFLASH_MAX_CONTEXT   /* num contexts per afu */
+
+#define CXLFLASH_BLOCK_SIZE4096/* 4K blocks */
+#define CXLFLASH_MAX_XFER_SIZE 16777216/* 16MB transfer */
+#define CXLFLASH_MAX_SECTORS   (CXLFLASH_MAX_XFER_SIZE/512)/* SCSI wants
+  max_sectors
+  in units of
+  512 byte
+  sectors
+   */
+
+#define NUM_RRQ_ENTRY16 /* for master issued cmds */
+#define MAX_RHT_PER_CONTEXT (PAGE_SIZE / sizeof(struct sisl_rht_entry))
+
+/* AFU command retry limit */
+#define MC_RETRY_CNT 5