Re: Enabling peer to peer device transactions for PCIe devices

2017-10-26 Thread Petrosyan, Ludwig


- Original Message -
> From: "David Laight" <david.lai...@aculab.com>
> To: "Petrosyan, Ludwig" <ludwig.petros...@desy.de>, "Logan Gunthorpe" 
> <log...@deltatee.com>
> Cc: "Alexander Deucher" <alexander.deuc...@amd.com>, "linux-kernel" 
> <linux-kernel@vger.kernel.org>, "linux-rdma"
> <linux-r...@vger.kernel.org>, "linux-nvdimm" <linux-nvd...@lists.01.org>, 
> "Linux-media" <linux-me...@vger.kernel.org>,
> "dri-devel" <dri-de...@lists.freedesktop.org>, "linux-pci" 
> <linux-...@vger.kernel.org>, "John Bridgman"
> <john.bridg...@amd.com>, "Felix Kuehling" <felix.kuehl...@amd.com>, "Serguei 
> Sagalovitch"
> <serguei.sagalovi...@amd.com>, "Paul Blinzer" <paul.blin...@amd.com>, 
> "Christian Koenig" <christian.koe...@amd.com>,
> "Suravee Suthikulpanit" <suravee.suthikulpa...@amd.com>, "Ben Sander" 
> <ben.san...@amd.com>
> Sent: Tuesday, 24 October, 2017 16:58:24
> Subject: RE: Enabling peer to peer device transactions for PCIe devices

> Please don't top post, write shorter lines, and add the odd blank line.
> Big blocks of text are hard to read quickly.
> 

OK this time I am very short. 
peer2peer works

Ludwig


Re: Enabling peer to peer device transactions for PCIe devices

2017-10-26 Thread Petrosyan, Ludwig


- Original Message -
> From: "David Laight" 
> To: "Petrosyan, Ludwig" , "Logan Gunthorpe" 
> 
> Cc: "Alexander Deucher" , "linux-kernel" 
> , "linux-rdma"
> , "linux-nvdimm" , 
> "Linux-media" ,
> "dri-devel" , "linux-pci" 
> , "John Bridgman"
> , "Felix Kuehling" , "Serguei 
> Sagalovitch"
> , "Paul Blinzer" , 
> "Christian Koenig" ,
> "Suravee Suthikulpanit" , "Ben Sander" 
> 
> Sent: Tuesday, 24 October, 2017 16:58:24
> Subject: RE: Enabling peer to peer device transactions for PCIe devices

> Please don't top post, write shorter lines, and add the odd blank line.
> Big blocks of text are hard to read quickly.
> 

OK this time I am very short. 
peer2peer works

Ludwig


RE: Enabling peer to peer device transactions for PCIe devices

2017-10-24 Thread David Laight
Please don't top post, write shorter lines, and add the odd blank line.
Big blocks of text are hard to read quickly.

> From: Petrosyan, Ludwig [mailto:ludwig.petros...@desy.de]
> Yes I agree it has to be started with the write transaction, according of 
> PCIe standard all write
> transaction are address routed, and I agree with Logan:
> if in write transaction TLP the endpoint address written in header the TLP 
> should not touch CPU, the
> PCIe Switch has to route it to endpoint.

That depends, IIRC there is a feature for PCIe switches to force them
to send all transactions to the root hub.
This is there so that the host can enforce rules to stop p2p transfers.
It might enabled on the switch you have.

> The idea was: in MTCA system there is PCIe Switch on MCH (MTCA crate HUB) 
> this switch connects CPU to
> other Crate Slots, so one port is Upstream and others are Downstream  ports, 
> DMA read from CPU is
> usual write on endpoint side, Xilinx DMA core has two registers Destination 
> Address and Source
> Address,
> device driver to make DMA has to set up these registers,
> usually device driver to start DMA read from Board sets Source address as 
> FPGA memory address and
> Destination address is DMA prepared system address,
> in case of testing p2p I set Destination address as physical address of other 
> endpoint.

Unnecessary detail...

> More detailed:
> we have so called pcie universal driver: the idea behind is
> 1. all pcie configuration staff, find enabled BARs, mapping BARs, usual 
> read/write and common ioctl
> (get slot number, get driver version ...) implemented in universal driver and 
> EXPORTed.
> 2. if some system function in new kernel are changed we change it only in 
> universal parts (easy
> support a big number of drivers )
> so the universal driver something like PCIe Driver API
> 3. the universal driver provides read/writ functions so we have the same 
> device access API for any
> PCIe device, we could use the same user application with any PCIe device

More crap...

> now. during BARs finding and mapping universal driver keeps pcie endpoint 
> physical address in some
> internal structures, any top driver may get physical address
> of other pcie endpoint by slot number.
> in may case during get_resorce the physical address is 0xB200, I check 
> lspci -H1 - -s psie
> switch port bus address (the endpoint connected to this port, checked by 
> lspci -H1 -t) the same
> address (0xB20) is the memory behind bridge,

Overly verbose...

> I want to make p2p writes to offset 0x4, so I set DMA destination address 
> 0xB240
> is something wrong?

Possibly.

You almost certainly need the address that is written into the BAR of the
target endpoint.
This could well be different from the physical address that the cpu uses
to write to the endpoint (as well as the cpu virtual address).

lspci lies [1], run lspci -x  (or hexdump config space through /sys/devices)
to see what is actually in the BAR.

[1] The addresses come from somewhere other than reading the BAR.
If the endpoint resets the BAR lspci will still report the old
addresses.

David



RE: Enabling peer to peer device transactions for PCIe devices

2017-10-24 Thread David Laight
Please don't top post, write shorter lines, and add the odd blank line.
Big blocks of text are hard to read quickly.

> From: Petrosyan, Ludwig [mailto:ludwig.petros...@desy.de]
> Yes I agree it has to be started with the write transaction, according of 
> PCIe standard all write
> transaction are address routed, and I agree with Logan:
> if in write transaction TLP the endpoint address written in header the TLP 
> should not touch CPU, the
> PCIe Switch has to route it to endpoint.

That depends, IIRC there is a feature for PCIe switches to force them
to send all transactions to the root hub.
This is there so that the host can enforce rules to stop p2p transfers.
It might enabled on the switch you have.

> The idea was: in MTCA system there is PCIe Switch on MCH (MTCA crate HUB) 
> this switch connects CPU to
> other Crate Slots, so one port is Upstream and others are Downstream  ports, 
> DMA read from CPU is
> usual write on endpoint side, Xilinx DMA core has two registers Destination 
> Address and Source
> Address,
> device driver to make DMA has to set up these registers,
> usually device driver to start DMA read from Board sets Source address as 
> FPGA memory address and
> Destination address is DMA prepared system address,
> in case of testing p2p I set Destination address as physical address of other 
> endpoint.

Unnecessary detail...

> More detailed:
> we have so called pcie universal driver: the idea behind is
> 1. all pcie configuration staff, find enabled BARs, mapping BARs, usual 
> read/write and common ioctl
> (get slot number, get driver version ...) implemented in universal driver and 
> EXPORTed.
> 2. if some system function in new kernel are changed we change it only in 
> universal parts (easy
> support a big number of drivers )
> so the universal driver something like PCIe Driver API
> 3. the universal driver provides read/writ functions so we have the same 
> device access API for any
> PCIe device, we could use the same user application with any PCIe device

More crap...

> now. during BARs finding and mapping universal driver keeps pcie endpoint 
> physical address in some
> internal structures, any top driver may get physical address
> of other pcie endpoint by slot number.
> in may case during get_resorce the physical address is 0xB200, I check 
> lspci -H1 - -s psie
> switch port bus address (the endpoint connected to this port, checked by 
> lspci -H1 -t) the same
> address (0xB20) is the memory behind bridge,

Overly verbose...

> I want to make p2p writes to offset 0x4, so I set DMA destination address 
> 0xB240
> is something wrong?

Possibly.

You almost certainly need the address that is written into the BAR of the
target endpoint.
This could well be different from the physical address that the cpu uses
to write to the endpoint (as well as the cpu virtual address).

lspci lies [1], run lspci -x  (or hexdump config space through /sys/devices)
to see what is actually in the BAR.

[1] The addresses come from somewhere other than reading the BAR.
If the endpoint resets the BAR lspci will still report the old
addresses.

David



Re: Enabling peer to peer device transactions for PCIe devices

2017-10-23 Thread Petrosyan, Ludwig
Yes I agree it has to be started with the write transaction, according of PCIe 
standard all write transaction are address routed, and I agree with Logan:
if in write transaction TLP the endpoint address written in header the TLP 
should not touch CPU, the PCIe Switch has to route it to endpoint.
The idea was: in MTCA system there is PCIe Switch on MCH (MTCA crate HUB) this 
switch connects CPU to other Crate Slots, so one port is Upstream and others 
are Downstream  ports, DMA read from CPU is usual write on endpoint side, 
Xilinx DMA core has two registers Destination Address and Source Address,
device driver to make DMA has to set up these registers,
usually device driver to start DMA read from Board sets Source address as FPGA 
memory address and Destination address is DMA prepared system address,
in case of testing p2p I set Destination address as physical address of other 
endpoint.
More detailed:
we have so called pcie universal driver: the idea behind is
1. all pcie configuration staff, find enabled BARs, mapping BARs, usual 
read/write and common ioctl (get slot number, get driver version ...) 
implemented in universal driver and EXPORTed.
2. if some system function in new kernel are changed we change it only in 
universal parts (easy support a big number of drivers )
so the universal driver something like PCIe Driver API
3. the universal driver provides read/writ functions so we have the same device 
access API for any PCIe device, we could use the same user application with any 
PCIe device

now. during BARs finding and mapping universal driver keeps pcie endpoint 
physical address in some internal structures, any top driver may get physical 
address
of other pcie endpoint by slot number.
in may case during get_resorce the physical address is 0xB200, I check 
lspci -H1 - -s psie switch port bus address (the endpoint connected to this 
port, checked by lspci -H1 -t) the same address (0xB20) is the memory 
behind bridge, 
I want to make p2p writes to offset 0x4, so I set DMA destination address 
0xB240
is something wrong?

thanks for help
regards

Ludwig

- Original Message -
From: "Logan Gunthorpe" <log...@deltatee.com>
To: "David Laight" <david.lai...@aculab.com>, "Petrosyan, Ludwig" 
<ludwig.petros...@desy.de>
Cc: "Alexander Deucher" <alexander.deuc...@amd.com>, "linux-kernel" 
<linux-kernel@vger.kernel.org>, "linux-rdma" <linux-r...@vger.kernel.org>, 
"linux-nvdimm" <linux-nvd...@lists.01.org>, "Linux-media" 
<linux-me...@vger.kernel.org>, "dri-devel" <dri-de...@lists.freedesktop.org>, 
"linux-pci" <linux-...@vger.kernel.org>, "John Bridgman" 
<john.bridg...@amd.com>, "Felix Kuehling" <felix.kuehl...@amd.com>, "Serguei 
Sagalovitch" <serguei.sagalovi...@amd.com>, "Paul Blinzer" 
<paul.blin...@amd.com>, "Christian Koenig" <christian.koe...@amd.com>, "Suravee 
Suthikulpanit" <suravee.suthikulpa...@amd.com>, "Ben Sander" 
<ben.san...@amd.com>
Sent: Tuesday, 24 October, 2017 00:04:26
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On 23/10/17 10:08 AM, David Laight wrote:
> It is also worth checking that the hardware actually supports p2p transfers.
> Writes are more likely to be supported then reads.
> ISTR that some intel cpus support some p2p writes, but there could easily
> be errata against them.

Ludwig mentioned a PCIe switch. The few switches I'm aware of support 
P2P transfers. So if everything is setup correctly, the TLPs shouldn't 
even touch the CPU.

But, yes, generally it's a good idea to start with writes and see if 
they work first.

Logan


Re: Enabling peer to peer device transactions for PCIe devices

2017-10-23 Thread Petrosyan, Ludwig
Yes I agree it has to be started with the write transaction, according of PCIe 
standard all write transaction are address routed, and I agree with Logan:
if in write transaction TLP the endpoint address written in header the TLP 
should not touch CPU, the PCIe Switch has to route it to endpoint.
The idea was: in MTCA system there is PCIe Switch on MCH (MTCA crate HUB) this 
switch connects CPU to other Crate Slots, so one port is Upstream and others 
are Downstream  ports, DMA read from CPU is usual write on endpoint side, 
Xilinx DMA core has two registers Destination Address and Source Address,
device driver to make DMA has to set up these registers,
usually device driver to start DMA read from Board sets Source address as FPGA 
memory address and Destination address is DMA prepared system address,
in case of testing p2p I set Destination address as physical address of other 
endpoint.
More detailed:
we have so called pcie universal driver: the idea behind is
1. all pcie configuration staff, find enabled BARs, mapping BARs, usual 
read/write and common ioctl (get slot number, get driver version ...) 
implemented in universal driver and EXPORTed.
2. if some system function in new kernel are changed we change it only in 
universal parts (easy support a big number of drivers )
so the universal driver something like PCIe Driver API
3. the universal driver provides read/writ functions so we have the same device 
access API for any PCIe device, we could use the same user application with any 
PCIe device

now. during BARs finding and mapping universal driver keeps pcie endpoint 
physical address in some internal structures, any top driver may get physical 
address
of other pcie endpoint by slot number.
in may case during get_resorce the physical address is 0xB200, I check 
lspci -H1 - -s psie switch port bus address (the endpoint connected to this 
port, checked by lspci -H1 -t) the same address (0xB20) is the memory 
behind bridge, 
I want to make p2p writes to offset 0x4, so I set DMA destination address 
0xB240
is something wrong?

thanks for help
regards

Ludwig

- Original Message -
From: "Logan Gunthorpe" 
To: "David Laight" , "Petrosyan, Ludwig" 

Cc: "Alexander Deucher" , "linux-kernel" 
, "linux-rdma" , 
"linux-nvdimm" , "Linux-media" 
, "dri-devel" , 
"linux-pci" , "John Bridgman" 
, "Felix Kuehling" , "Serguei 
Sagalovitch" , "Paul Blinzer" 
, "Christian Koenig" , "Suravee 
Suthikulpanit" , "Ben Sander" 

Sent: Tuesday, 24 October, 2017 00:04:26
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On 23/10/17 10:08 AM, David Laight wrote:
> It is also worth checking that the hardware actually supports p2p transfers.
> Writes are more likely to be supported then reads.
> ISTR that some intel cpus support some p2p writes, but there could easily
> be errata against them.

Ludwig mentioned a PCIe switch. The few switches I'm aware of support 
P2P transfers. So if everything is setup correctly, the TLPs shouldn't 
even touch the CPU.

But, yes, generally it's a good idea to start with writes and see if 
they work first.

Logan


Re: Enabling peer to peer device transactions for PCIe devices

2017-10-23 Thread Logan Gunthorpe



On 23/10/17 10:08 AM, David Laight wrote:

It is also worth checking that the hardware actually supports p2p transfers.
Writes are more likely to be supported then reads.
ISTR that some intel cpus support some p2p writes, but there could easily
be errata against them.


Ludwig mentioned a PCIe switch. The few switches I'm aware of support 
P2P transfers. So if everything is setup correctly, the TLPs shouldn't 
even touch the CPU.


But, yes, generally it's a good idea to start with writes and see if 
they work first.


Logan


Re: Enabling peer to peer device transactions for PCIe devices

2017-10-23 Thread Logan Gunthorpe



On 23/10/17 10:08 AM, David Laight wrote:

It is also worth checking that the hardware actually supports p2p transfers.
Writes are more likely to be supported then reads.
ISTR that some intel cpus support some p2p writes, but there could easily
be errata against them.


Ludwig mentioned a PCIe switch. The few switches I'm aware of support 
P2P transfers. So if everything is setup correctly, the TLPs shouldn't 
even touch the CPU.


But, yes, generally it's a good idea to start with writes and see if 
they work first.


Logan


RE: Enabling peer to peer device transactions for PCIe devices

2017-10-23 Thread David Laight
From: Petrosyan Ludwig
> Sent: 22 October 2017 07:14
> Could be I have done is stupid...
> But at first sight it has to be simple:
> The PCIe Write transactions are address routed, so if in the packet header 
> the other endpoint address
> is written the TLP has to be routed (by PCIe Switch to the endpoint), the DMA 
> reading from the end
> point is really write transactions from the endpoint, usually (Xilinx core) 
> to start DMA one has to
> write to the DMA control register of the endpoint the destination address. So 
> I have change the device
> driver to set in this register the physical address of the other endpoint 
> (get_resource start called
> to other endpoint, and it is the same address which I could see in lspci 
> - -s bus-address of the
> switch port, memories behind bridge), so now the endpoint has to start send 
> writes TLP with the other
> endpoint address in the TLP header.
> But this is not working (I want to understand why ...), but I could see the 
> first address of the
> destination endpoint is changed (with the wrong value 0xFF),
> now I want to try prepare in the driver of one endpoint the DMA buffer , but 
> using physical address of
> the other endpoint,
> Could be it will never work, but I want to understand why, there is my error 
> ...

It is also worth checking that the hardware actually supports p2p transfers.
Writes are more likely to be supported then reads.
ISTR that some intel cpus support some p2p writes, but there could easily
be errata against them.

I'd certainly test a single word write to read/write memory location.
First verify against kernel memory, then against a 'slave' board.

I don't know about Xilinx fpga, but we've had 'fun' getting Altera fpga
to do sensible PCIe cycles (I ended up writing a simple dma controller 
that would generate long TLP).
We also found a bug in the Altera logic that processed interleaved
read completions.

David



RE: Enabling peer to peer device transactions for PCIe devices

2017-10-23 Thread David Laight
From: Petrosyan Ludwig
> Sent: 22 October 2017 07:14
> Could be I have done is stupid...
> But at first sight it has to be simple:
> The PCIe Write transactions are address routed, so if in the packet header 
> the other endpoint address
> is written the TLP has to be routed (by PCIe Switch to the endpoint), the DMA 
> reading from the end
> point is really write transactions from the endpoint, usually (Xilinx core) 
> to start DMA one has to
> write to the DMA control register of the endpoint the destination address. So 
> I have change the device
> driver to set in this register the physical address of the other endpoint 
> (get_resource start called
> to other endpoint, and it is the same address which I could see in lspci 
> - -s bus-address of the
> switch port, memories behind bridge), so now the endpoint has to start send 
> writes TLP with the other
> endpoint address in the TLP header.
> But this is not working (I want to understand why ...), but I could see the 
> first address of the
> destination endpoint is changed (with the wrong value 0xFF),
> now I want to try prepare in the driver of one endpoint the DMA buffer , but 
> using physical address of
> the other endpoint,
> Could be it will never work, but I want to understand why, there is my error 
> ...

It is also worth checking that the hardware actually supports p2p transfers.
Writes are more likely to be supported then reads.
ISTR that some intel cpus support some p2p writes, but there could easily
be errata against them.

I'd certainly test a single word write to read/write memory location.
First verify against kernel memory, then against a 'slave' board.

I don't know about Xilinx fpga, but we've had 'fun' getting Altera fpga
to do sensible PCIe cycles (I ended up writing a simple dma controller 
that would generate long TLP).
We also found a bug in the Altera logic that processed interleaved
read completions.

David



Re: Enabling peer to peer device transactions for PCIe devices

2017-10-22 Thread Logan Gunthorpe

On 22/10/17 12:13 AM, Petrosyan, Ludwig wrote:
> But at first sight it has to be simple:
> The PCIe Write transactions are address routed, so if in the packet header 
> the other endpoint address is written the TLP has to be routed (by PCIe 
> Switch to the endpoint), the DMA reading from the end point is really write 
> transactions from the endpoint, usually (Xilinx core) to start DMA one has to 
> write to the DMA control register of the endpoint the destination address. So 
> I have change the device driver to set in this register the physical address 
> of the other endpoint (get_resource start called to other endpoint, and it is 
> the same address which I could see in lspci - -s bus-address of the 
> switch port, memories behind bridge), so now the endpoint has to start send 
> writes TLP with the other endpoint address in the TLP header.
> But this is not working (I want to understand why ...), but I could see the 
> first address of the destination endpoint is changed (with the wrong value 
> 0xFF),
> now I want to try prepare in the driver of one endpoint the DMA buffer , but 
> using physical address of the other endpoint,
> Could be it will never work, but I want to understand why, there is my error 
> ...

Hmm, well if I understand you correctly it sounds like, in theory, it
should work. But there could be any number of reasons why it does not.
You may need to get a hold of a PCIe analyzer to figure out what's
actually going on.

Logan


Re: Enabling peer to peer device transactions for PCIe devices

2017-10-22 Thread Logan Gunthorpe

On 22/10/17 12:13 AM, Petrosyan, Ludwig wrote:
> But at first sight it has to be simple:
> The PCIe Write transactions are address routed, so if in the packet header 
> the other endpoint address is written the TLP has to be routed (by PCIe 
> Switch to the endpoint), the DMA reading from the end point is really write 
> transactions from the endpoint, usually (Xilinx core) to start DMA one has to 
> write to the DMA control register of the endpoint the destination address. So 
> I have change the device driver to set in this register the physical address 
> of the other endpoint (get_resource start called to other endpoint, and it is 
> the same address which I could see in lspci - -s bus-address of the 
> switch port, memories behind bridge), so now the endpoint has to start send 
> writes TLP with the other endpoint address in the TLP header.
> But this is not working (I want to understand why ...), but I could see the 
> first address of the destination endpoint is changed (with the wrong value 
> 0xFF),
> now I want to try prepare in the driver of one endpoint the DMA buffer , but 
> using physical address of the other endpoint,
> Could be it will never work, but I want to understand why, there is my error 
> ...

Hmm, well if I understand you correctly it sounds like, in theory, it
should work. But there could be any number of reasons why it does not.
You may need to get a hold of a PCIe analyzer to figure out what's
actually going on.

Logan


Re: Enabling peer to peer device transactions for PCIe devices

2017-10-22 Thread Petrosyan, Ludwig
Hello Logan

Thank You very much for respond.
Could be I have done is stupid...
But at first sight it has to be simple:
The PCIe Write transactions are address routed, so if in the packet header the 
other endpoint address is written the TLP has to be routed (by PCIe Switch to 
the endpoint), the DMA reading from the end point is really write transactions 
from the endpoint, usually (Xilinx core) to start DMA one has to write to the 
DMA control register of the endpoint the destination address. So I have change 
the device driver to set in this register the physical address of the other 
endpoint (get_resource start called to other endpoint, and it is the same 
address which I could see in lspci - -s bus-address of the switch port, 
memories behind bridge), so now the endpoint has to start send writes TLP with 
the other endpoint address in the TLP header.
But this is not working (I want to understand why ...), but I could see the 
first address of the destination endpoint is changed (with the wrong value 
0xFF),
now I want to try prepare in the driver of one endpoint the DMA buffer , but 
using physical address of the other endpoint,
Could be it will never work, but I want to understand why, there is my error ...

with best regards

Ludwig

- Original Message -
From: "Logan Gunthorpe" <log...@deltatee.com>
To: "Ludwig Petrosyan" <ludwig.petros...@desy.de>, "Deucher, Alexander" 
<alexander.deuc...@amd.com>, "linux-kernel@vger.kernel.org" 
<linux-kernel@vger.kernel.org>, "linux-r...@vger.kernel.org" 
<linux-r...@vger.kernel.org>, "linux-nvd...@lists.01.org" 
<linux-nvd...@lists.01.org>, "linux-me...@vger.kernel.org" 
<linux-me...@vger.kernel.org>, "dri-de...@lists.freedesktop.org" 
<dri-de...@lists.freedesktop.org>, "linux-...@vger.kernel.org" 
<linux-...@vger.kernel.org>
Cc: "Bridgman, John" <john.bridg...@amd.com>, "Kuehling, Felix" 
<felix.kuehl...@amd.com>, "Sagalovitch, Serguei" <serguei.sagalovi...@amd.com>, 
"Blinzer, Paul" <paul.blin...@amd.com>, "Koenig, Christian" 
<christian.koe...@amd.com>, "Suthikulpanit, Suravee" 
<suravee.suthikulpa...@amd.com>, "Sander, Ben" <ben.san...@amd.com>
Sent: Friday, 20 October, 2017 17:48:58
Subject: Re: Enabling peer to peer device transactions for PCIe devices

Hi Ludwig,

P2P transactions are still *very* experimental at the moment and take a 
lot of expertise to get working in a general setup. It will definitely 
require changes to the kernel, including the drivers of all the devices 
you are trying to make talk to eachother. If you're up for it you can 
take a look at:

https://github.com/sbates130272/linux-p2pmem/

Which has our current rough work making NVMe fabrics use p2p transactions.

Logan

On 10/20/2017 6:36 AM, Ludwig Petrosyan wrote:
> Dear Linux kernel group
> 
> my name is Ludwig Petrosyan I am working in DESY (Germany)
> 
> we are responsible for the control system of  all accelerators in DESY.
> 
> For a 7-8 years we have switched to MTCA.4 systems and using PCIe as a 
> central Bus.
> 
> I am mostly responsible for the Linux drivers of the AMC Cards (PCIe 
> endpoints).
> 
> The idea is start to use peer to peer transaction for PCIe endpoint (DMA 
> and/or usual Read/Write)
> 
> Could You please advise me where to start, is there some Documentation 
> how to do it.
> 
> 
> with best regards
> 
> 
> Ludwig
> 
> 
> On 11/21/2016 09:36 PM, Deucher, Alexander wrote:
>> This is certainly not the first time this has been brought up, but I'd 
>> like to try and get some consensus on the best way to move this 
>> forward.  Allowing devices to talk directly improves performance and 
>> reduces latency by avoiding the use of staging buffers in system 
>> memory.  Also in cases where both devices are behind a switch, it 
>> avoids the CPU entirely.  Most current APIs (DirectGMA, PeerDirect, 
>> CUDA, HSA) that deal with this are pointer based.  Ideally we'd be 
>> able to take a CPU virtual address and be able to get to a physical 
>> address taking into account IOMMUs, etc.  Having struct pages for the 
>> memory would allow it to work more generally and wouldn't require as 
>> much explicit support in drivers that wanted to use it.
>> Some use cases:
>> 1. Storage devices streaming directly to GPU device memory
>> 2. GPU device memory to GPU device memory streaming
>> 3. DVB/V4L/SDI devices streaming directly to GPU device memory
>> 4. DVB/V4L/SDI devices streaming directly to storage devices
>> Here is a relatively simple example of how this could work for 
>> testing.  This is obviously not a complete s

Re: Enabling peer to peer device transactions for PCIe devices

2017-10-22 Thread Petrosyan, Ludwig
Hello Logan

Thank You very much for respond.
Could be I have done is stupid...
But at first sight it has to be simple:
The PCIe Write transactions are address routed, so if in the packet header the 
other endpoint address is written the TLP has to be routed (by PCIe Switch to 
the endpoint), the DMA reading from the end point is really write transactions 
from the endpoint, usually (Xilinx core) to start DMA one has to write to the 
DMA control register of the endpoint the destination address. So I have change 
the device driver to set in this register the physical address of the other 
endpoint (get_resource start called to other endpoint, and it is the same 
address which I could see in lspci - -s bus-address of the switch port, 
memories behind bridge), so now the endpoint has to start send writes TLP with 
the other endpoint address in the TLP header.
But this is not working (I want to understand why ...), but I could see the 
first address of the destination endpoint is changed (with the wrong value 
0xFF),
now I want to try prepare in the driver of one endpoint the DMA buffer , but 
using physical address of the other endpoint,
Could be it will never work, but I want to understand why, there is my error ...

with best regards

Ludwig

- Original Message -
From: "Logan Gunthorpe" 
To: "Ludwig Petrosyan" , "Deucher, Alexander" 
, "linux-kernel@vger.kernel.org" 
, "linux-r...@vger.kernel.org" 
, "linux-nvd...@lists.01.org" 
, "linux-me...@vger.kernel.org" 
, "dri-de...@lists.freedesktop.org" 
, "linux-...@vger.kernel.org" 

Cc: "Bridgman, John" , "Kuehling, Felix" 
, "Sagalovitch, Serguei" , 
"Blinzer, Paul" , "Koenig, Christian" 
, "Suthikulpanit, Suravee" 
, "Sander, Ben" 
Sent: Friday, 20 October, 2017 17:48:58
Subject: Re: Enabling peer to peer device transactions for PCIe devices

Hi Ludwig,

P2P transactions are still *very* experimental at the moment and take a 
lot of expertise to get working in a general setup. It will definitely 
require changes to the kernel, including the drivers of all the devices 
you are trying to make talk to eachother. If you're up for it you can 
take a look at:

https://github.com/sbates130272/linux-p2pmem/

Which has our current rough work making NVMe fabrics use p2p transactions.

Logan

On 10/20/2017 6:36 AM, Ludwig Petrosyan wrote:
> Dear Linux kernel group
> 
> my name is Ludwig Petrosyan I am working in DESY (Germany)
> 
> we are responsible for the control system of  all accelerators in DESY.
> 
> For a 7-8 years we have switched to MTCA.4 systems and using PCIe as a 
> central Bus.
> 
> I am mostly responsible for the Linux drivers of the AMC Cards (PCIe 
> endpoints).
> 
> The idea is start to use peer to peer transaction for PCIe endpoint (DMA 
> and/or usual Read/Write)
> 
> Could You please advise me where to start, is there some Documentation 
> how to do it.
> 
> 
> with best regards
> 
> 
> Ludwig
> 
> 
> On 11/21/2016 09:36 PM, Deucher, Alexander wrote:
>> This is certainly not the first time this has been brought up, but I'd 
>> like to try and get some consensus on the best way to move this 
>> forward.  Allowing devices to talk directly improves performance and 
>> reduces latency by avoiding the use of staging buffers in system 
>> memory.  Also in cases where both devices are behind a switch, it 
>> avoids the CPU entirely.  Most current APIs (DirectGMA, PeerDirect, 
>> CUDA, HSA) that deal with this are pointer based.  Ideally we'd be 
>> able to take a CPU virtual address and be able to get to a physical 
>> address taking into account IOMMUs, etc.  Having struct pages for the 
>> memory would allow it to work more generally and wouldn't require as 
>> much explicit support in drivers that wanted to use it.
>> Some use cases:
>> 1. Storage devices streaming directly to GPU device memory
>> 2. GPU device memory to GPU device memory streaming
>> 3. DVB/V4L/SDI devices streaming directly to GPU device memory
>> 4. DVB/V4L/SDI devices streaming directly to storage devices
>> Here is a relatively simple example of how this could work for 
>> testing.  This is obviously not a complete solution.
>> - Device memory will be registered with Linux memory sub-system by 
>> created corresponding struct page structures for device memory
>> - get_user_pages_fast() will  return corresponding struct pages when 
>> CPU address points to the device memory
>> - put_page() will deal with struct pages for device memory
>> Previously proposed solutions and related proposals:
>> 1.P2P DMA
>> DMA-API/PCI map_peer_resource support for peer-to-peer 
&

Re: Enabling peer to peer device transactions for PCIe devices

2017-10-20 Thread Logan Gunthorpe

Hi Ludwig,

P2P transactions are still *very* experimental at the moment and take a 
lot of expertise to get working in a general setup. It will definitely 
require changes to the kernel, including the drivers of all the devices 
you are trying to make talk to eachother. If you're up for it you can 
take a look at:


https://github.com/sbates130272/linux-p2pmem/

Which has our current rough work making NVMe fabrics use p2p transactions.

Logan

On 10/20/2017 6:36 AM, Ludwig Petrosyan wrote:

Dear Linux kernel group

my name is Ludwig Petrosyan I am working in DESY (Germany)

we are responsible for the control system of  all accelerators in DESY.

For a 7-8 years we have switched to MTCA.4 systems and using PCIe as a 
central Bus.


I am mostly responsible for the Linux drivers of the AMC Cards (PCIe 
endpoints).


The idea is start to use peer to peer transaction for PCIe endpoint (DMA 
and/or usual Read/Write)


Could You please advise me where to start, is there some Documentation 
how to do it.



with best regards


Ludwig


On 11/21/2016 09:36 PM, Deucher, Alexander wrote:
This is certainly not the first time this has been brought up, but I'd 
like to try and get some consensus on the best way to move this 
forward.  Allowing devices to talk directly improves performance and 
reduces latency by avoiding the use of staging buffers in system 
memory.  Also in cases where both devices are behind a switch, it 
avoids the CPU entirely.  Most current APIs (DirectGMA, PeerDirect, 
CUDA, HSA) that deal with this are pointer based.  Ideally we'd be 
able to take a CPU virtual address and be able to get to a physical 
address taking into account IOMMUs, etc.  Having struct pages for the 
memory would allow it to work more generally and wouldn't require as 
much explicit support in drivers that wanted to use it.

Some use cases:
1. Storage devices streaming directly to GPU device memory
2. GPU device memory to GPU device memory streaming
3. DVB/V4L/SDI devices streaming directly to GPU device memory
4. DVB/V4L/SDI devices streaming directly to storage devices
Here is a relatively simple example of how this could work for 
testing.  This is obviously not a complete solution.
- Device memory will be registered with Linux memory sub-system by 
created corresponding struct page structures for device memory
- get_user_pages_fast() will  return corresponding struct pages when 
CPU address points to the device memory

- put_page() will deal with struct pages for device memory
Previously proposed solutions and related proposals:
1.P2P DMA
DMA-API/PCI map_peer_resource support for peer-to-peer 
(http://www.spinics.net/lists/linux-pci/msg44560.html)

Pros: Low impact, already largely reviewed.
Cons: requires explicit support in all drivers that want to support 
it, doesn't handle S/G in device memory.

2. ZONE_DEVICE IO
Direct I/O and DMA for persistent memory 
(https://lwn.net/Articles/672457/)
Add support for ZONE_DEVICE IO memory with struct pages. 
(https://patchwork.kernel.org/patch/8583221/)

Pro: Doesn't waste system memory for ZONE metadata
Cons: CPU access to ZONE metadata slow, may be lost, corrupted on 
device reset.

3. DMA-BUF
RDMA subsystem DMA-BUF support 
(http://www.spinics.net/lists/linux-rdma/msg38748.html)

Pros: uses existing dma-buf interface
Cons: dma-buf is handle based, requires explicit dma-buf support in 
drivers.


4. iopmem
iopmem : A block device for PCIe memory 
(https://lwn.net/Articles/703895/)

5. HMM
Heterogeneous Memory Management 
(http://lkml.iu.edu/hypermail/linux/kernel/1611.2/02473.html)


6. Some new mmap-like interface that takes a userptr and a length and 
returns a dma-buf and offset?

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


___
Linux-nvdimm mailing list
linux-nvd...@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm




Re: Enabling peer to peer device transactions for PCIe devices

2017-10-20 Thread Logan Gunthorpe

Hi Ludwig,

P2P transactions are still *very* experimental at the moment and take a 
lot of expertise to get working in a general setup. It will definitely 
require changes to the kernel, including the drivers of all the devices 
you are trying to make talk to eachother. If you're up for it you can 
take a look at:


https://github.com/sbates130272/linux-p2pmem/

Which has our current rough work making NVMe fabrics use p2p transactions.

Logan

On 10/20/2017 6:36 AM, Ludwig Petrosyan wrote:

Dear Linux kernel group

my name is Ludwig Petrosyan I am working in DESY (Germany)

we are responsible for the control system of  all accelerators in DESY.

For a 7-8 years we have switched to MTCA.4 systems and using PCIe as a 
central Bus.


I am mostly responsible for the Linux drivers of the AMC Cards (PCIe 
endpoints).


The idea is start to use peer to peer transaction for PCIe endpoint (DMA 
and/or usual Read/Write)


Could You please advise me where to start, is there some Documentation 
how to do it.



with best regards


Ludwig


On 11/21/2016 09:36 PM, Deucher, Alexander wrote:
This is certainly not the first time this has been brought up, but I'd 
like to try and get some consensus on the best way to move this 
forward.  Allowing devices to talk directly improves performance and 
reduces latency by avoiding the use of staging buffers in system 
memory.  Also in cases where both devices are behind a switch, it 
avoids the CPU entirely.  Most current APIs (DirectGMA, PeerDirect, 
CUDA, HSA) that deal with this are pointer based.  Ideally we'd be 
able to take a CPU virtual address and be able to get to a physical 
address taking into account IOMMUs, etc.  Having struct pages for the 
memory would allow it to work more generally and wouldn't require as 
much explicit support in drivers that wanted to use it.

Some use cases:
1. Storage devices streaming directly to GPU device memory
2. GPU device memory to GPU device memory streaming
3. DVB/V4L/SDI devices streaming directly to GPU device memory
4. DVB/V4L/SDI devices streaming directly to storage devices
Here is a relatively simple example of how this could work for 
testing.  This is obviously not a complete solution.
- Device memory will be registered with Linux memory sub-system by 
created corresponding struct page structures for device memory
- get_user_pages_fast() will  return corresponding struct pages when 
CPU address points to the device memory

- put_page() will deal with struct pages for device memory
Previously proposed solutions and related proposals:
1.P2P DMA
DMA-API/PCI map_peer_resource support for peer-to-peer 
(http://www.spinics.net/lists/linux-pci/msg44560.html)

Pros: Low impact, already largely reviewed.
Cons: requires explicit support in all drivers that want to support 
it, doesn't handle S/G in device memory.

2. ZONE_DEVICE IO
Direct I/O and DMA for persistent memory 
(https://lwn.net/Articles/672457/)
Add support for ZONE_DEVICE IO memory with struct pages. 
(https://patchwork.kernel.org/patch/8583221/)

Pro: Doesn't waste system memory for ZONE metadata
Cons: CPU access to ZONE metadata slow, may be lost, corrupted on 
device reset.

3. DMA-BUF
RDMA subsystem DMA-BUF support 
(http://www.spinics.net/lists/linux-rdma/msg38748.html)

Pros: uses existing dma-buf interface
Cons: dma-buf is handle based, requires explicit dma-buf support in 
drivers.


4. iopmem
iopmem : A block device for PCIe memory 
(https://lwn.net/Articles/703895/)

5. HMM
Heterogeneous Memory Management 
(http://lkml.iu.edu/hypermail/linux/kernel/1611.2/02473.html)


6. Some new mmap-like interface that takes a userptr and a length and 
returns a dma-buf and offset?

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


___
Linux-nvdimm mailing list
linux-nvd...@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm




Re: Enabling peer to peer device transactions for PCIe devices

2017-10-20 Thread Ludwig Petrosyan

Dear Linux kernel group

my name is Ludwig Petrosyan I am working in DESY (Germany)

we are responsible for the control system of  all accelerators in DESY.

For a 7-8 years we have switched to MTCA.4 systems and using PCIe as a 
central Bus.


I am mostly responsible for the Linux drivers of the AMC Cards (PCIe 
endpoints).


The idea is start to use peer to peer transaction for PCIe endpoint (DMA 
and/or usual Read/Write)


Could You please advise me where to start, is there some Documentation 
how to do it.



with best regards


Ludwig


On 11/21/2016 09:36 PM, Deucher, Alexander wrote:

This is certainly not the first time this has been brought up, but I'd like to 
try and get some consensus on the best way to move this forward.  Allowing 
devices to talk directly improves performance and reduces latency by avoiding 
the use of staging buffers in system memory.  Also in cases where both devices 
are behind a switch, it avoids the CPU entirely.  Most current APIs (DirectGMA, 
PeerDirect, CUDA, HSA) that deal with this are pointer based.  Ideally we'd be 
able to take a CPU virtual address and be able to get to a physical address 
taking into account IOMMUs, etc.  Having struct pages for the memory would 
allow it to work more generally and wouldn't require as much explicit support 
in drivers that wanted to use it.
  
Some use cases:

1. Storage devices streaming directly to GPU device memory
2. GPU device memory to GPU device memory streaming
3. DVB/V4L/SDI devices streaming directly to GPU device memory
4. DVB/V4L/SDI devices streaming directly to storage devices
  
Here is a relatively simple example of how this could work for testing.  This is obviously not a complete solution.

- Device memory will be registered with Linux memory sub-system by created 
corresponding struct page structures for device memory
- get_user_pages_fast() will  return corresponding struct pages when CPU 
address points to the device memory
- put_page() will deal with struct pages for device memory
  
Previously proposed solutions and related proposals:

1.P2P DMA
DMA-API/PCI map_peer_resource support for peer-to-peer 
(http://www.spinics.net/lists/linux-pci/msg44560.html)
Pros: Low impact, already largely reviewed.
Cons: requires explicit support in all drivers that want to support it, doesn't 
handle S/G in device memory.
  
2. ZONE_DEVICE IO

Direct I/O and DMA for persistent memory (https://lwn.net/Articles/672457/)
Add support for ZONE_DEVICE IO memory with struct pages. 
(https://patchwork.kernel.org/patch/8583221/)
Pro: Doesn't waste system memory for ZONE metadata
Cons: CPU access to ZONE metadata slow, may be lost, corrupted on device reset.
  
3. DMA-BUF

RDMA subsystem DMA-BUF support 
(http://www.spinics.net/lists/linux-rdma/msg38748.html)
Pros: uses existing dma-buf interface
Cons: dma-buf is handle based, requires explicit dma-buf support in drivers.

4. iopmem
iopmem : A block device for PCIe memory (https://lwn.net/Articles/703895/)
  
5. HMM

Heterogeneous Memory Management 
(http://lkml.iu.edu/hypermail/linux/kernel/1611.2/02473.html)

6. Some new mmap-like interface that takes a userptr and a length and returns a 
dma-buf and offset?
  
Alex


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




Re: Enabling peer to peer device transactions for PCIe devices

2017-10-20 Thread Ludwig Petrosyan

Dear Linux kernel group

my name is Ludwig Petrosyan I am working in DESY (Germany)

we are responsible for the control system of  all accelerators in DESY.

For a 7-8 years we have switched to MTCA.4 systems and using PCIe as a 
central Bus.


I am mostly responsible for the Linux drivers of the AMC Cards (PCIe 
endpoints).


The idea is start to use peer to peer transaction for PCIe endpoint (DMA 
and/or usual Read/Write)


Could You please advise me where to start, is there some Documentation 
how to do it.



with best regards


Ludwig


On 11/21/2016 09:36 PM, Deucher, Alexander wrote:

This is certainly not the first time this has been brought up, but I'd like to 
try and get some consensus on the best way to move this forward.  Allowing 
devices to talk directly improves performance and reduces latency by avoiding 
the use of staging buffers in system memory.  Also in cases where both devices 
are behind a switch, it avoids the CPU entirely.  Most current APIs (DirectGMA, 
PeerDirect, CUDA, HSA) that deal with this are pointer based.  Ideally we'd be 
able to take a CPU virtual address and be able to get to a physical address 
taking into account IOMMUs, etc.  Having struct pages for the memory would 
allow it to work more generally and wouldn't require as much explicit support 
in drivers that wanted to use it.
  
Some use cases:

1. Storage devices streaming directly to GPU device memory
2. GPU device memory to GPU device memory streaming
3. DVB/V4L/SDI devices streaming directly to GPU device memory
4. DVB/V4L/SDI devices streaming directly to storage devices
  
Here is a relatively simple example of how this could work for testing.  This is obviously not a complete solution.

- Device memory will be registered with Linux memory sub-system by created 
corresponding struct page structures for device memory
- get_user_pages_fast() will  return corresponding struct pages when CPU 
address points to the device memory
- put_page() will deal with struct pages for device memory
  
Previously proposed solutions and related proposals:

1.P2P DMA
DMA-API/PCI map_peer_resource support for peer-to-peer 
(http://www.spinics.net/lists/linux-pci/msg44560.html)
Pros: Low impact, already largely reviewed.
Cons: requires explicit support in all drivers that want to support it, doesn't 
handle S/G in device memory.
  
2. ZONE_DEVICE IO

Direct I/O and DMA for persistent memory (https://lwn.net/Articles/672457/)
Add support for ZONE_DEVICE IO memory with struct pages. 
(https://patchwork.kernel.org/patch/8583221/)
Pro: Doesn't waste system memory for ZONE metadata
Cons: CPU access to ZONE metadata slow, may be lost, corrupted on device reset.
  
3. DMA-BUF

RDMA subsystem DMA-BUF support 
(http://www.spinics.net/lists/linux-rdma/msg38748.html)
Pros: uses existing dma-buf interface
Cons: dma-buf is handle based, requires explicit dma-buf support in drivers.

4. iopmem
iopmem : A block device for PCIe memory (https://lwn.net/Articles/703895/)
  
5. HMM

Heterogeneous Memory Management 
(http://lkml.iu.edu/hypermail/linux/kernel/1611.2/02473.html)

6. Some new mmap-like interface that takes a userptr and a length and returns a 
dma-buf and offset?
  
Alex


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




Re: Enabling peer to peer device transactions for PCIe devices

2017-01-13 Thread Christian König

Am 12.01.2017 um 16:11 schrieb Jerome Glisse:

On Wed, Jan 11, 2017 at 10:54:39PM -0600, Stephen Bates wrote:

On Fri, January 6, 2017 4:10 pm, Logan Gunthorpe wrote:


On 06/01/17 11:26 AM, Jason Gunthorpe wrote:



Make a generic API for all of this and you'd have my vote..


IMHO, you must support basic pinning semantics - that is necessary to
support generic short lived DMA (eg filesystem, etc). That hardware can
clearly do that if it can support ODP.

I agree completely.


What we want is for RDMA, O_DIRECT, etc to just work with special VMAs
(ie. at least those backed with ZONE_DEVICE memory). Then
GPU/NVME/DAX/whatever drivers can just hand these VMAs to userspace
(using whatever interface is most appropriate) and userspace can do what
it pleases with them. This makes _so_ much sense and actually largely
already works today (as demonstrated by iopmem).

+1 for iopmem ;-)

I feel like we are going around and around on this topic. I would like to
see something that is upstream that enables P2P even if it is only the
minimum viable useful functionality to begin. I think aiming for the moon
(which is what HMM and things like it are) are simply going to take more
time if they ever get there.

There is a use case for in-kernel P2P PCIe transfers between two NVMe
devices and between an NVMe device and an RDMA NIC (using NVMe CMBs or
BARs on the NIC). I am even seeing users who now want to move data P2P
between FPGAs and NVMe SSDs and the upstream kernel should be able to
support these users or they will look elsewhere.

The iopmem patchset addressed all the use cases above and while it is not
an in kernel API it could have been modified to be one reasonably easily.
As Logan states the driver can then choose to pass the VMAs to user-space
in a manner that makes sense.

Earlier in the thread someone mentioned LSF/MM. There is already a
proposal to discuss this topic so if you are interested please respond to
the email letting the committee know this topic is of interest to you [1].

Also earlier in the thread someone discussed the issues around the IOMMU.
Given the known issues around P2P transfers in certain CPU root complexes
[2] it might just be a case of only allowing P2P when a PCIe switch
connects the two EPs. Another option is just to use CONFIG_EXPERT and make
sure people are aware of the pitfalls if they invoke the P2P option.


iopmem is not applicable to GPU what i propose is to split the issue in 2
so that everyone can reuse the part that needs to be common namely the DMA
API part where you have to create IOMMU mapping for one device to point
to the other device memory.

We can have a DMA API that is agnostic to how the device memory is manage
(so does not matter if device memory have struct page or not). This what
i have been arguing in this thread. To make progress on this issue we need
to stop conflicting different use case.

So i say let solve the IOMMU issue first and let everyone use it in their
own way with their device. I do not think we can share much more than
that.


Yeah, exactly what I said from the very beginning as well. Just hacking 
together quick solutions doesn't really solve the problem in the long term.


What we need is proper adjusting of the DMA API towards handling of P2P 
and then build solutions for the different use cases on top of that.


We should also avoid falling into the trap of trying to just handle the 
existing get_user_pages and co interfaces so that the existing code 
doesn't need to change. P2P needs to be validated for each use case 
individually and not implemented in workarounds with fingers crossed and 
hoped for the best.


Regards,
Christian.



Cheers,
Jérôme





Re: Enabling peer to peer device transactions for PCIe devices

2017-01-13 Thread Christian König

Am 12.01.2017 um 16:11 schrieb Jerome Glisse:

On Wed, Jan 11, 2017 at 10:54:39PM -0600, Stephen Bates wrote:

On Fri, January 6, 2017 4:10 pm, Logan Gunthorpe wrote:


On 06/01/17 11:26 AM, Jason Gunthorpe wrote:



Make a generic API for all of this and you'd have my vote..


IMHO, you must support basic pinning semantics - that is necessary to
support generic short lived DMA (eg filesystem, etc). That hardware can
clearly do that if it can support ODP.

I agree completely.


What we want is for RDMA, O_DIRECT, etc to just work with special VMAs
(ie. at least those backed with ZONE_DEVICE memory). Then
GPU/NVME/DAX/whatever drivers can just hand these VMAs to userspace
(using whatever interface is most appropriate) and userspace can do what
it pleases with them. This makes _so_ much sense and actually largely
already works today (as demonstrated by iopmem).

+1 for iopmem ;-)

I feel like we are going around and around on this topic. I would like to
see something that is upstream that enables P2P even if it is only the
minimum viable useful functionality to begin. I think aiming for the moon
(which is what HMM and things like it are) are simply going to take more
time if they ever get there.

There is a use case for in-kernel P2P PCIe transfers between two NVMe
devices and between an NVMe device and an RDMA NIC (using NVMe CMBs or
BARs on the NIC). I am even seeing users who now want to move data P2P
between FPGAs and NVMe SSDs and the upstream kernel should be able to
support these users or they will look elsewhere.

The iopmem patchset addressed all the use cases above and while it is not
an in kernel API it could have been modified to be one reasonably easily.
As Logan states the driver can then choose to pass the VMAs to user-space
in a manner that makes sense.

Earlier in the thread someone mentioned LSF/MM. There is already a
proposal to discuss this topic so if you are interested please respond to
the email letting the committee know this topic is of interest to you [1].

Also earlier in the thread someone discussed the issues around the IOMMU.
Given the known issues around P2P transfers in certain CPU root complexes
[2] it might just be a case of only allowing P2P when a PCIe switch
connects the two EPs. Another option is just to use CONFIG_EXPERT and make
sure people are aware of the pitfalls if they invoke the P2P option.


iopmem is not applicable to GPU what i propose is to split the issue in 2
so that everyone can reuse the part that needs to be common namely the DMA
API part where you have to create IOMMU mapping for one device to point
to the other device memory.

We can have a DMA API that is agnostic to how the device memory is manage
(so does not matter if device memory have struct page or not). This what
i have been arguing in this thread. To make progress on this issue we need
to stop conflicting different use case.

So i say let solve the IOMMU issue first and let everyone use it in their
own way with their device. I do not think we can share much more than
that.


Yeah, exactly what I said from the very beginning as well. Just hacking 
together quick solutions doesn't really solve the problem in the long term.


What we need is proper adjusting of the DMA API towards handling of P2P 
and then build solutions for the different use cases on top of that.


We should also avoid falling into the trap of trying to just handle the 
existing get_user_pages and co interfaces so that the existing code 
doesn't need to change. P2P needs to be validated for each use case 
individually and not implemented in workarounds with fingers crossed and 
hoped for the best.


Regards,
Christian.



Cheers,
Jérôme





Re: Enabling peer to peer device transactions for PCIe devices

2017-01-12 Thread Logan Gunthorpe


On 11/01/17 09:54 PM, Stephen Bates wrote:
> The iopmem patchset addressed all the use cases above and while it is not
> an in kernel API it could have been modified to be one reasonably easily.
> As Logan states the driver can then choose to pass the VMAs to user-space
> in a manner that makes sense.

Just to clarify: the iopmem patchset had one patch that allowed for
slightly more flexible zone device mappings which ought to be useful for
everyone.

The other patch (which was iopmem proper) was more of an example of how
the zone_device memory _could_ be exposed to userspace with "iopmem"
hardware that looks similar to nvdimm hardware. Iopmem was not really
useful, in itself, for NVMe devices and it was never expected to be
useful for GPUs.

Logan


Re: Enabling peer to peer device transactions for PCIe devices

2017-01-12 Thread Logan Gunthorpe


On 11/01/17 09:54 PM, Stephen Bates wrote:
> The iopmem patchset addressed all the use cases above and while it is not
> an in kernel API it could have been modified to be one reasonably easily.
> As Logan states the driver can then choose to pass the VMAs to user-space
> in a manner that makes sense.

Just to clarify: the iopmem patchset had one patch that allowed for
slightly more flexible zone device mappings which ought to be useful for
everyone.

The other patch (which was iopmem proper) was more of an example of how
the zone_device memory _could_ be exposed to userspace with "iopmem"
hardware that looks similar to nvdimm hardware. Iopmem was not really
useful, in itself, for NVMe devices and it was never expected to be
useful for GPUs.

Logan


Re: Enabling peer to peer device transactions for PCIe devices

2017-01-12 Thread Jason Gunthorpe
On Thu, Jan 12, 2017 at 10:11:29AM -0500, Jerome Glisse wrote:
> On Wed, Jan 11, 2017 at 10:54:39PM -0600, Stephen Bates wrote:
> > > What we want is for RDMA, O_DIRECT, etc to just work with special VMAs
> > > (ie. at least those backed with ZONE_DEVICE memory). Then
> > > GPU/NVME/DAX/whatever drivers can just hand these VMAs to userspace
> > > (using whatever interface is most appropriate) and userspace can do what
> > > it pleases with them. This makes _so_ much sense and actually largely
> > > already works today (as demonstrated by iopmem).

> So i say let solve the IOMMU issue first and let everyone use it in their
> own way with their device. I do not think we can share much more than
> that.

Solve it for the easy ZONE_DIRECT/etc case then.

Jason


Re: Enabling peer to peer device transactions for PCIe devices

2017-01-12 Thread Jason Gunthorpe
On Thu, Jan 12, 2017 at 10:11:29AM -0500, Jerome Glisse wrote:
> On Wed, Jan 11, 2017 at 10:54:39PM -0600, Stephen Bates wrote:
> > > What we want is for RDMA, O_DIRECT, etc to just work with special VMAs
> > > (ie. at least those backed with ZONE_DEVICE memory). Then
> > > GPU/NVME/DAX/whatever drivers can just hand these VMAs to userspace
> > > (using whatever interface is most appropriate) and userspace can do what
> > > it pleases with them. This makes _so_ much sense and actually largely
> > > already works today (as demonstrated by iopmem).

> So i say let solve the IOMMU issue first and let everyone use it in their
> own way with their device. I do not think we can share much more than
> that.

Solve it for the easy ZONE_DIRECT/etc case then.

Jason


Re: Enabling peer to peer device transactions for PCIe devices

2017-01-12 Thread Jerome Glisse
On Wed, Jan 11, 2017 at 10:54:39PM -0600, Stephen Bates wrote:
> On Fri, January 6, 2017 4:10 pm, Logan Gunthorpe wrote:
> >
> >
> > On 06/01/17 11:26 AM, Jason Gunthorpe wrote:
> >
> >
> >> Make a generic API for all of this and you'd have my vote..
> >>
> >>
> >> IMHO, you must support basic pinning semantics - that is necessary to
> >> support generic short lived DMA (eg filesystem, etc). That hardware can
> >> clearly do that if it can support ODP.
> >
> > I agree completely.
> >
> >
> > What we want is for RDMA, O_DIRECT, etc to just work with special VMAs
> > (ie. at least those backed with ZONE_DEVICE memory). Then
> > GPU/NVME/DAX/whatever drivers can just hand these VMAs to userspace
> > (using whatever interface is most appropriate) and userspace can do what
> > it pleases with them. This makes _so_ much sense and actually largely
> > already works today (as demonstrated by iopmem).
> 
> +1 for iopmem ;-)
> 
> I feel like we are going around and around on this topic. I would like to
> see something that is upstream that enables P2P even if it is only the
> minimum viable useful functionality to begin. I think aiming for the moon
> (which is what HMM and things like it are) are simply going to take more
> time if they ever get there.
> 
> There is a use case for in-kernel P2P PCIe transfers between two NVMe
> devices and between an NVMe device and an RDMA NIC (using NVMe CMBs or
> BARs on the NIC). I am even seeing users who now want to move data P2P
> between FPGAs and NVMe SSDs and the upstream kernel should be able to
> support these users or they will look elsewhere.
> 
> The iopmem patchset addressed all the use cases above and while it is not
> an in kernel API it could have been modified to be one reasonably easily.
> As Logan states the driver can then choose to pass the VMAs to user-space
> in a manner that makes sense.
> 
> Earlier in the thread someone mentioned LSF/MM. There is already a
> proposal to discuss this topic so if you are interested please respond to
> the email letting the committee know this topic is of interest to you [1].
> 
> Also earlier in the thread someone discussed the issues around the IOMMU.
> Given the known issues around P2P transfers in certain CPU root complexes
> [2] it might just be a case of only allowing P2P when a PCIe switch
> connects the two EPs. Another option is just to use CONFIG_EXPERT and make
> sure people are aware of the pitfalls if they invoke the P2P option.


iopmem is not applicable to GPU what i propose is to split the issue in 2
so that everyone can reuse the part that needs to be common namely the DMA
API part where you have to create IOMMU mapping for one device to point
to the other device memory.

We can have a DMA API that is agnostic to how the device memory is manage
(so does not matter if device memory have struct page or not). This what
i have been arguing in this thread. To make progress on this issue we need
to stop conflicting different use case.

So i say let solve the IOMMU issue first and let everyone use it in their
own way with their device. I do not think we can share much more than
that.

Cheers,
Jérôme


Re: Enabling peer to peer device transactions for PCIe devices

2017-01-12 Thread Jerome Glisse
On Wed, Jan 11, 2017 at 10:54:39PM -0600, Stephen Bates wrote:
> On Fri, January 6, 2017 4:10 pm, Logan Gunthorpe wrote:
> >
> >
> > On 06/01/17 11:26 AM, Jason Gunthorpe wrote:
> >
> >
> >> Make a generic API for all of this and you'd have my vote..
> >>
> >>
> >> IMHO, you must support basic pinning semantics - that is necessary to
> >> support generic short lived DMA (eg filesystem, etc). That hardware can
> >> clearly do that if it can support ODP.
> >
> > I agree completely.
> >
> >
> > What we want is for RDMA, O_DIRECT, etc to just work with special VMAs
> > (ie. at least those backed with ZONE_DEVICE memory). Then
> > GPU/NVME/DAX/whatever drivers can just hand these VMAs to userspace
> > (using whatever interface is most appropriate) and userspace can do what
> > it pleases with them. This makes _so_ much sense and actually largely
> > already works today (as demonstrated by iopmem).
> 
> +1 for iopmem ;-)
> 
> I feel like we are going around and around on this topic. I would like to
> see something that is upstream that enables P2P even if it is only the
> minimum viable useful functionality to begin. I think aiming for the moon
> (which is what HMM and things like it are) are simply going to take more
> time if they ever get there.
> 
> There is a use case for in-kernel P2P PCIe transfers between two NVMe
> devices and between an NVMe device and an RDMA NIC (using NVMe CMBs or
> BARs on the NIC). I am even seeing users who now want to move data P2P
> between FPGAs and NVMe SSDs and the upstream kernel should be able to
> support these users or they will look elsewhere.
> 
> The iopmem patchset addressed all the use cases above and while it is not
> an in kernel API it could have been modified to be one reasonably easily.
> As Logan states the driver can then choose to pass the VMAs to user-space
> in a manner that makes sense.
> 
> Earlier in the thread someone mentioned LSF/MM. There is already a
> proposal to discuss this topic so if you are interested please respond to
> the email letting the committee know this topic is of interest to you [1].
> 
> Also earlier in the thread someone discussed the issues around the IOMMU.
> Given the known issues around P2P transfers in certain CPU root complexes
> [2] it might just be a case of only allowing P2P when a PCIe switch
> connects the two EPs. Another option is just to use CONFIG_EXPERT and make
> sure people are aware of the pitfalls if they invoke the P2P option.


iopmem is not applicable to GPU what i propose is to split the issue in 2
so that everyone can reuse the part that needs to be common namely the DMA
API part where you have to create IOMMU mapping for one device to point
to the other device memory.

We can have a DMA API that is agnostic to how the device memory is manage
(so does not matter if device memory have struct page or not). This what
i have been arguing in this thread. To make progress on this issue we need
to stop conflicting different use case.

So i say let solve the IOMMU issue first and let everyone use it in their
own way with their device. I do not think we can share much more than
that.

Cheers,
Jérôme


Re: Enabling peer to peer device transactions for PCIe devices

2017-01-11 Thread Stephen Bates
On Fri, January 6, 2017 4:10 pm, Logan Gunthorpe wrote:
>
>
> On 06/01/17 11:26 AM, Jason Gunthorpe wrote:
>
>
>> Make a generic API for all of this and you'd have my vote..
>>
>>
>> IMHO, you must support basic pinning semantics - that is necessary to
>> support generic short lived DMA (eg filesystem, etc). That hardware can
>> clearly do that if it can support ODP.
>
> I agree completely.
>
>
> What we want is for RDMA, O_DIRECT, etc to just work with special VMAs
> (ie. at least those backed with ZONE_DEVICE memory). Then
> GPU/NVME/DAX/whatever drivers can just hand these VMAs to userspace
> (using whatever interface is most appropriate) and userspace can do what
> it pleases with them. This makes _so_ much sense and actually largely
> already works today (as demonstrated by iopmem).

+1 for iopmem ;-)

I feel like we are going around and around on this topic. I would like to
see something that is upstream that enables P2P even if it is only the
minimum viable useful functionality to begin. I think aiming for the moon
(which is what HMM and things like it are) are simply going to take more
time if they ever get there.

There is a use case for in-kernel P2P PCIe transfers between two NVMe
devices and between an NVMe device and an RDMA NIC (using NVMe CMBs or
BARs on the NIC). I am even seeing users who now want to move data P2P
between FPGAs and NVMe SSDs and the upstream kernel should be able to
support these users or they will look elsewhere.

The iopmem patchset addressed all the use cases above and while it is not
an in kernel API it could have been modified to be one reasonably easily.
As Logan states the driver can then choose to pass the VMAs to user-space
in a manner that makes sense.

Earlier in the thread someone mentioned LSF/MM. There is already a
proposal to discuss this topic so if you are interested please respond to
the email letting the committee know this topic is of interest to you [1].

Also earlier in the thread someone discussed the issues around the IOMMU.
Given the known issues around P2P transfers in certain CPU root complexes
[2] it might just be a case of only allowing P2P when a PCIe switch
connects the two EPs. Another option is just to use CONFIG_EXPERT and make
sure people are aware of the pitfalls if they invoke the P2P option.

Finally, as Jason noted, we could all just wait until
CAPI/OpenCAPI/CCIX/GenZ comes along. However given that these interfaces
are the remit of the CPU vendors I think it behooves us to solve this
problem before then. Also some of the above mentioned protocols are not
even switchable and may not be amenable to a P2P topology...

Stephen

[1] http://marc.info/?l=linux-mm=148156541804940=2
[2] https://community.mellanox.com/docs/DOC-1119



Re: Enabling peer to peer device transactions for PCIe devices

2017-01-11 Thread Stephen Bates
On Fri, January 6, 2017 4:10 pm, Logan Gunthorpe wrote:
>
>
> On 06/01/17 11:26 AM, Jason Gunthorpe wrote:
>
>
>> Make a generic API for all of this and you'd have my vote..
>>
>>
>> IMHO, you must support basic pinning semantics - that is necessary to
>> support generic short lived DMA (eg filesystem, etc). That hardware can
>> clearly do that if it can support ODP.
>
> I agree completely.
>
>
> What we want is for RDMA, O_DIRECT, etc to just work with special VMAs
> (ie. at least those backed with ZONE_DEVICE memory). Then
> GPU/NVME/DAX/whatever drivers can just hand these VMAs to userspace
> (using whatever interface is most appropriate) and userspace can do what
> it pleases with them. This makes _so_ much sense and actually largely
> already works today (as demonstrated by iopmem).

+1 for iopmem ;-)

I feel like we are going around and around on this topic. I would like to
see something that is upstream that enables P2P even if it is only the
minimum viable useful functionality to begin. I think aiming for the moon
(which is what HMM and things like it are) are simply going to take more
time if they ever get there.

There is a use case for in-kernel P2P PCIe transfers between two NVMe
devices and between an NVMe device and an RDMA NIC (using NVMe CMBs or
BARs on the NIC). I am even seeing users who now want to move data P2P
between FPGAs and NVMe SSDs and the upstream kernel should be able to
support these users or they will look elsewhere.

The iopmem patchset addressed all the use cases above and while it is not
an in kernel API it could have been modified to be one reasonably easily.
As Logan states the driver can then choose to pass the VMAs to user-space
in a manner that makes sense.

Earlier in the thread someone mentioned LSF/MM. There is already a
proposal to discuss this topic so if you are interested please respond to
the email letting the committee know this topic is of interest to you [1].

Also earlier in the thread someone discussed the issues around the IOMMU.
Given the known issues around P2P transfers in certain CPU root complexes
[2] it might just be a case of only allowing P2P when a PCIe switch
connects the two EPs. Another option is just to use CONFIG_EXPERT and make
sure people are aware of the pitfalls if they invoke the P2P option.

Finally, as Jason noted, we could all just wait until
CAPI/OpenCAPI/CCIX/GenZ comes along. However given that these interfaces
are the remit of the CPU vendors I think it behooves us to solve this
problem before then. Also some of the above mentioned protocols are not
even switchable and may not be amenable to a P2P topology...

Stephen

[1] http://marc.info/?l=linux-mm=148156541804940=2
[2] https://community.mellanox.com/docs/DOC-1119



Re: Enabling peer to peer device transactions for PCIe devices

2017-01-06 Thread Logan Gunthorpe


On 06/01/17 11:26 AM, Jason Gunthorpe wrote:

> Make a generic API for all of this and you'd have my vote..
> 
> IMHO, you must support basic pinning semantics - that is necessary to
> support generic short lived DMA (eg filesystem, etc). That hardware
> can clearly do that if it can support ODP.

I agree completely.

What we want is for RDMA, O_DIRECT, etc to just work with special VMAs
(ie. at least those backed with ZONE_DEVICE memory). Then
GPU/NVME/DAX/whatever drivers can just hand these VMAs to userspace
(using whatever interface is most appropriate) and userspace can do what
it pleases with them. This makes _so_ much sense and actually largely
already works today (as demonstrated by iopmem).

Though, of course, there are many aspects that could still be improved
like denying CPU access to special VMAs and having get_user_pages avoid
pinning device memory, etc, etc. But all this would just be enhancements
to how VMAs work and not be effected by the basic design described above.

We experimented with GPU Direct and the peer memory patchset for IB and
they were broken by design. They were just a very specific hack into the
IB core and thus didn't help to support O_DIRECT or any other possible
DMA user. And the invalidation thing was completely nuts. We had to pray
an invalidation would never occur because, if it did, our application
would just break.

Logan



Re: Enabling peer to peer device transactions for PCIe devices

2017-01-06 Thread Logan Gunthorpe


On 06/01/17 11:26 AM, Jason Gunthorpe wrote:

> Make a generic API for all of this and you'd have my vote..
> 
> IMHO, you must support basic pinning semantics - that is necessary to
> support generic short lived DMA (eg filesystem, etc). That hardware
> can clearly do that if it can support ODP.

I agree completely.

What we want is for RDMA, O_DIRECT, etc to just work with special VMAs
(ie. at least those backed with ZONE_DEVICE memory). Then
GPU/NVME/DAX/whatever drivers can just hand these VMAs to userspace
(using whatever interface is most appropriate) and userspace can do what
it pleases with them. This makes _so_ much sense and actually largely
already works today (as demonstrated by iopmem).

Though, of course, there are many aspects that could still be improved
like denying CPU access to special VMAs and having get_user_pages avoid
pinning device memory, etc, etc. But all this would just be enhancements
to how VMAs work and not be effected by the basic design described above.

We experimented with GPU Direct and the peer memory patchset for IB and
they were broken by design. They were just a very specific hack into the
IB core and thus didn't help to support O_DIRECT or any other possible
DMA user. And the invalidation thing was completely nuts. We had to pray
an invalidation would never occur because, if it did, our application
would just break.

Logan



RE: Enabling peer to peer device transactions for PCIe devices

2017-01-06 Thread Deucher, Alexander
> -Original Message-
> From: Jason Gunthorpe [mailto:jguntho...@obsidianresearch.com]
> Sent: Friday, January 06, 2017 1:26 PM
> To: Jerome Glisse
> Cc: Sagalovitch, Serguei; Jerome Glisse; Deucher, Alexander; 'linux-
> ker...@vger.kernel.org'; 'linux-r...@vger.kernel.org'; 'linux-
> nvd...@lists.01.org'; 'linux-me...@vger.kernel.org'; 'dri-
> de...@lists.freedesktop.org'; 'linux-...@vger.kernel.org'; Kuehling, Felix;
> Blinzer, Paul; Koenig, Christian; Suthikulpanit, Suravee; Sander, Ben;
> h...@infradead.org; Zhou, David(ChunMing); Yu, Qiang
> Subject: Re: Enabling peer to peer device transactions for PCIe devices
> 
> On Fri, Jan 06, 2017 at 12:37:22PM -0500, Jerome Glisse wrote:
> > On Fri, Jan 06, 2017 at 11:56:30AM -0500, Serguei Sagalovitch wrote:
> > > On 2017-01-05 08:58 PM, Jerome Glisse wrote:
> > > > On Thu, Jan 05, 2017 at 05:30:34PM -0700, Jason Gunthorpe wrote:
> > > > > On Thu, Jan 05, 2017 at 06:23:52PM -0500, Jerome Glisse wrote:
> > > > >
> > > > > > > I still don't understand what you driving at - you've said in both
> > > > > > > cases a user VMA exists.
> > > > > > In the former case no, there is no VMA directly but if you want one
> than
> > > > > > a device can provide one. But such VMA is useless as CPU access is
> not
> > > > > > expected.
> > > > > I disagree it is useless, the VMA is going to be necessary to support
> > > > > upcoming things like CAPI, you need it to support O_DIRECT from the
> > > > > filesystem, DPDK, etc. This is why I am opposed to any model that is
> > > > > not VMA based for setting up RDMA - that is shorted sighted and
> does
> > > > > not seem to reflect where the industry is going.
> > > > >
> > > > > So focus on having VMA backed by actual physical memory that
> covers
> > > > > your GPU objects and ask how do we wire up the '__user *' to the
> DMA
> > > > > API in the best way so the DMA API still has enough information to
> > > > > setup IOMMUs and whatnot.
> > > > I am talking about 2 different thing. Existing hardware and API where
> you
> > > > _do not_ have a vma and you do not need one. This is just
> > > > > existing stuff.
> 
> > > I do not understand why you assume that existing API doesn't  need one.
> > > I would say that a lot of __existing__ user level API and their support in
> > > kernel (especially outside of graphics domain) assumes that we have vma
> and
> > > deal with __user * pointers.
> 
> +1
> 
> > Well i am thinking to GPUDirect here. Some of GPUDirect use case do not
> have
> > vma (struct vm_area_struct) associated with them they directly apply to
> GPU
> > object that aren't expose to CPU. Yes some use case have vma for share
> buffer.
> 
> Lets stop talkind about GPU direct. Today we can't even make VMA
> pointing at a PCI bar work properly in the kernel - lets start there
> please. People can argue over other options once that is done.
> 
> > For HMM plan is to restrict to ODP and either to replace ODP with HMM or
> change
> > ODP to not use get_user_pages_remote() but directly fetch informations
> from
> > CPU page table. Everything else stay as it is. I posted patchset to replace
> > ODP with HMM in the past.
> 
> Make a generic API for all of this and you'd have my vote..
> 
> IMHO, you must support basic pinning semantics - that is necessary to
> support generic short lived DMA (eg filesystem, etc). That hardware
> can clearly do that if it can support ODP.

We would definitely like to have support for hardware that can't handle page 
faults gracefully.

Alex



RE: Enabling peer to peer device transactions for PCIe devices

2017-01-06 Thread Deucher, Alexander
> -Original Message-
> From: Jason Gunthorpe [mailto:jguntho...@obsidianresearch.com]
> Sent: Friday, January 06, 2017 1:26 PM
> To: Jerome Glisse
> Cc: Sagalovitch, Serguei; Jerome Glisse; Deucher, Alexander; 'linux-
> ker...@vger.kernel.org'; 'linux-r...@vger.kernel.org'; 'linux-
> nvd...@lists.01.org'; 'linux-me...@vger.kernel.org'; 'dri-
> de...@lists.freedesktop.org'; 'linux-...@vger.kernel.org'; Kuehling, Felix;
> Blinzer, Paul; Koenig, Christian; Suthikulpanit, Suravee; Sander, Ben;
> h...@infradead.org; Zhou, David(ChunMing); Yu, Qiang
> Subject: Re: Enabling peer to peer device transactions for PCIe devices
> 
> On Fri, Jan 06, 2017 at 12:37:22PM -0500, Jerome Glisse wrote:
> > On Fri, Jan 06, 2017 at 11:56:30AM -0500, Serguei Sagalovitch wrote:
> > > On 2017-01-05 08:58 PM, Jerome Glisse wrote:
> > > > On Thu, Jan 05, 2017 at 05:30:34PM -0700, Jason Gunthorpe wrote:
> > > > > On Thu, Jan 05, 2017 at 06:23:52PM -0500, Jerome Glisse wrote:
> > > > >
> > > > > > > I still don't understand what you driving at - you've said in both
> > > > > > > cases a user VMA exists.
> > > > > > In the former case no, there is no VMA directly but if you want one
> than
> > > > > > a device can provide one. But such VMA is useless as CPU access is
> not
> > > > > > expected.
> > > > > I disagree it is useless, the VMA is going to be necessary to support
> > > > > upcoming things like CAPI, you need it to support O_DIRECT from the
> > > > > filesystem, DPDK, etc. This is why I am opposed to any model that is
> > > > > not VMA based for setting up RDMA - that is shorted sighted and
> does
> > > > > not seem to reflect where the industry is going.
> > > > >
> > > > > So focus on having VMA backed by actual physical memory that
> covers
> > > > > your GPU objects and ask how do we wire up the '__user *' to the
> DMA
> > > > > API in the best way so the DMA API still has enough information to
> > > > > setup IOMMUs and whatnot.
> > > > I am talking about 2 different thing. Existing hardware and API where
> you
> > > > _do not_ have a vma and you do not need one. This is just
> > > > > existing stuff.
> 
> > > I do not understand why you assume that existing API doesn't  need one.
> > > I would say that a lot of __existing__ user level API and their support in
> > > kernel (especially outside of graphics domain) assumes that we have vma
> and
> > > deal with __user * pointers.
> 
> +1
> 
> > Well i am thinking to GPUDirect here. Some of GPUDirect use case do not
> have
> > vma (struct vm_area_struct) associated with them they directly apply to
> GPU
> > object that aren't expose to CPU. Yes some use case have vma for share
> buffer.
> 
> Lets stop talkind about GPU direct. Today we can't even make VMA
> pointing at a PCI bar work properly in the kernel - lets start there
> please. People can argue over other options once that is done.
> 
> > For HMM plan is to restrict to ODP and either to replace ODP with HMM or
> change
> > ODP to not use get_user_pages_remote() but directly fetch informations
> from
> > CPU page table. Everything else stay as it is. I posted patchset to replace
> > ODP with HMM in the past.
> 
> Make a generic API for all of this and you'd have my vote..
> 
> IMHO, you must support basic pinning semantics - that is necessary to
> support generic short lived DMA (eg filesystem, etc). That hardware
> can clearly do that if it can support ODP.

We would definitely like to have support for hardware that can't handle page 
faults gracefully.

Alex



Re: Enabling peer to peer device transactions for PCIe devices

2017-01-06 Thread Jason Gunthorpe
On Fri, Jan 06, 2017 at 12:37:22PM -0500, Jerome Glisse wrote:
> On Fri, Jan 06, 2017 at 11:56:30AM -0500, Serguei Sagalovitch wrote:
> > On 2017-01-05 08:58 PM, Jerome Glisse wrote:
> > > On Thu, Jan 05, 2017 at 05:30:34PM -0700, Jason Gunthorpe wrote:
> > > > On Thu, Jan 05, 2017 at 06:23:52PM -0500, Jerome Glisse wrote:
> > > > 
> > > > > > I still don't understand what you driving at - you've said in both
> > > > > > cases a user VMA exists.
> > > > > In the former case no, there is no VMA directly but if you want one 
> > > > > than
> > > > > a device can provide one. But such VMA is useless as CPU access is not
> > > > > expected.
> > > > I disagree it is useless, the VMA is going to be necessary to support
> > > > upcoming things like CAPI, you need it to support O_DIRECT from the
> > > > filesystem, DPDK, etc. This is why I am opposed to any model that is
> > > > not VMA based for setting up RDMA - that is shorted sighted and does
> > > > not seem to reflect where the industry is going.
> > > > 
> > > > So focus on having VMA backed by actual physical memory that covers
> > > > your GPU objects and ask how do we wire up the '__user *' to the DMA
> > > > API in the best way so the DMA API still has enough information to
> > > > setup IOMMUs and whatnot.
> > > I am talking about 2 different thing. Existing hardware and API where you
> > > _do not_ have a vma and you do not need one. This is just
> > > > existing stuff.

> > I do not understand why you assume that existing API doesn't  need one.
> > I would say that a lot of __existing__ user level API and their support in
> > kernel (especially outside of graphics domain) assumes that we have vma and
> > deal with __user * pointers.

+1

> Well i am thinking to GPUDirect here. Some of GPUDirect use case do not have
> vma (struct vm_area_struct) associated with them they directly apply to GPU
> object that aren't expose to CPU. Yes some use case have vma for share buffer.

Lets stop talkind about GPU direct. Today we can't even make VMA
pointing at a PCI bar work properly in the kernel - lets start there
please. People can argue over other options once that is done.

> For HMM plan is to restrict to ODP and either to replace ODP with HMM or 
> change
> ODP to not use get_user_pages_remote() but directly fetch informations from
> CPU page table. Everything else stay as it is. I posted patchset to replace
> ODP with HMM in the past.

Make a generic API for all of this and you'd have my vote..

IMHO, you must support basic pinning semantics - that is necessary to
support generic short lived DMA (eg filesystem, etc). That hardware
can clearly do that if it can support ODP.

Jason


Re: Enabling peer to peer device transactions for PCIe devices

2017-01-06 Thread Jason Gunthorpe
On Fri, Jan 06, 2017 at 12:37:22PM -0500, Jerome Glisse wrote:
> On Fri, Jan 06, 2017 at 11:56:30AM -0500, Serguei Sagalovitch wrote:
> > On 2017-01-05 08:58 PM, Jerome Glisse wrote:
> > > On Thu, Jan 05, 2017 at 05:30:34PM -0700, Jason Gunthorpe wrote:
> > > > On Thu, Jan 05, 2017 at 06:23:52PM -0500, Jerome Glisse wrote:
> > > > 
> > > > > > I still don't understand what you driving at - you've said in both
> > > > > > cases a user VMA exists.
> > > > > In the former case no, there is no VMA directly but if you want one 
> > > > > than
> > > > > a device can provide one. But such VMA is useless as CPU access is not
> > > > > expected.
> > > > I disagree it is useless, the VMA is going to be necessary to support
> > > > upcoming things like CAPI, you need it to support O_DIRECT from the
> > > > filesystem, DPDK, etc. This is why I am opposed to any model that is
> > > > not VMA based for setting up RDMA - that is shorted sighted and does
> > > > not seem to reflect where the industry is going.
> > > > 
> > > > So focus on having VMA backed by actual physical memory that covers
> > > > your GPU objects and ask how do we wire up the '__user *' to the DMA
> > > > API in the best way so the DMA API still has enough information to
> > > > setup IOMMUs and whatnot.
> > > I am talking about 2 different thing. Existing hardware and API where you
> > > _do not_ have a vma and you do not need one. This is just
> > > > existing stuff.

> > I do not understand why you assume that existing API doesn't  need one.
> > I would say that a lot of __existing__ user level API and their support in
> > kernel (especially outside of graphics domain) assumes that we have vma and
> > deal with __user * pointers.

+1

> Well i am thinking to GPUDirect here. Some of GPUDirect use case do not have
> vma (struct vm_area_struct) associated with them they directly apply to GPU
> object that aren't expose to CPU. Yes some use case have vma for share buffer.

Lets stop talkind about GPU direct. Today we can't even make VMA
pointing at a PCI bar work properly in the kernel - lets start there
please. People can argue over other options once that is done.

> For HMM plan is to restrict to ODP and either to replace ODP with HMM or 
> change
> ODP to not use get_user_pages_remote() but directly fetch informations from
> CPU page table. Everything else stay as it is. I posted patchset to replace
> ODP with HMM in the past.

Make a generic API for all of this and you'd have my vote..

IMHO, you must support basic pinning semantics - that is necessary to
support generic short lived DMA (eg filesystem, etc). That hardware
can clearly do that if it can support ODP.

Jason


Re: Enabling peer to peer device transactions for PCIe devices

2017-01-06 Thread Jerome Glisse
On Fri, Jan 06, 2017 at 11:56:30AM -0500, Serguei Sagalovitch wrote:
> On 2017-01-05 08:58 PM, Jerome Glisse wrote:
> > On Thu, Jan 05, 2017 at 05:30:34PM -0700, Jason Gunthorpe wrote:
> > > On Thu, Jan 05, 2017 at 06:23:52PM -0500, Jerome Glisse wrote:
> > > 
> > > > > I still don't understand what you driving at - you've said in both
> > > > > cases a user VMA exists.
> > > > In the former case no, there is no VMA directly but if you want one than
> > > > a device can provide one. But such VMA is useless as CPU access is not
> > > > expected.
> > > I disagree it is useless, the VMA is going to be necessary to support
> > > upcoming things like CAPI, you need it to support O_DIRECT from the
> > > filesystem, DPDK, etc. This is why I am opposed to any model that is
> > > not VMA based for setting up RDMA - that is shorted sighted and does
> > > not seem to reflect where the industry is going.
> > > 
> > > So focus on having VMA backed by actual physical memory that covers
> > > your GPU objects and ask how do we wire up the '__user *' to the DMA
> > > API in the best way so the DMA API still has enough information to
> > > setup IOMMUs and whatnot.
> > I am talking about 2 different thing. Existing hardware and API where you
> > _do not_ have a vma and you do not need one. This is just existing stuff.
> I do not understand why you assume that existing API doesn't  need one.
> I would say that a lot of __existing__ user level API and their support in
> kernel (especially outside of graphics domain) assumes that we have vma and
> deal with __user * pointers.

Well i am thinking to GPUDirect here. Some of GPUDirect use case do not have
vma (struct vm_area_struct) associated with them they directly apply to GPU
object that aren't expose to CPU. Yes some use case have vma for share buffer.

In the open source driver it is true that we have vma most often than not.

> > Some close driver provide a functionality on top of this design. Question
> > is do we want to do the same ? If yes and you insist on having a vma we
> > could provide one but this is does not apply and is useless for where we
> > are going with new hardware.
> > 
> > With new hardware you just use malloc or mmap to allocate memory and then
> > you use it directly with the device. Device driver can migrate any part of
> > the process address space to device memory. In this scheme you have your
> > usual VMAs but there is nothing special about them.
>
> Assuming that the whole device memory is CPU accessible and it looks
> like the direction where we are going:
> - You forgot about use case when we want or need to allocate memory
> directly on device (why we need to migrate anything if not needed?).
> - We may want to use CPU to access such memory on device to avoid
> any unnecessary migration back.
> - We may have more device memory than the system one.
> E.g. if you have 12 GPUs w/64GB each it will already give us ~0.7 TB
> not mentioning NVDIMM cards which could also be used as memory
> storage for other device access.
> - We also may want/need to share GPU memory between different
> processes.

Here i am talking about platform where GPU memory is not accessible at
all by the CPU (because of PCIe restriction, think CPU atomic operation
on IO memory).

So i really distinguish between CAPI/CCIX and PCIe. Some platform will
have CAPI/CCIX other wont. HMM apply mostly to the latter. Some of HMM
functionalities are still usefull with CAPI/CCIX.

Note that HMM do support allocation on GPU first. In current design this
can happen when GPU is the first to access an unpopulated virtual address.


For platform where GPU memory is accessible plan is either something
like CDM (Coherent Device Memory) or rely on ZONE_DEVICE. So all GPU
memory have struct page and those are like ordinary pages. CDM still
wants some restrictions like avoiding CPU allocation to happen on GPU
when there is memory pressure ... For all intent and purposes this
will work transparently in respect to RDMA because we assume on those
system that the RDMA is CAPI/CCIX and that it can peer to other device.


> > Now when you try to do get_user_page() on any page that is inside the
> > device it will fails because we do not allow any device memory to be pin.
> > There is various reasons for that and they are not going away in any hw
> > in the planing (so for next few years).
> > 
> > Still we do want to support peer to peer mapping. Plan is to only do so
> > with ODP capable hardware. Still we need to solve the IOMMU issue and
> > it needs special handling inside the RDMA device. The way it works is
> > that RDMA ask for a GPU page, GPU check if it has place inside its PCI
> > bar to map this page for the device, this can fail. If it succeed then
> > you need the IOMMU to let the RDMA device access the GPU PCI bar.
> > 
> > So here we have 2 orthogonal problem. First one is how to make 2 drivers
> > talks to each other to setup mapping to allow peer to peer But I would 
> > assume  and 

Re: Enabling peer to peer device transactions for PCIe devices

2017-01-06 Thread Jerome Glisse
On Fri, Jan 06, 2017 at 11:56:30AM -0500, Serguei Sagalovitch wrote:
> On 2017-01-05 08:58 PM, Jerome Glisse wrote:
> > On Thu, Jan 05, 2017 at 05:30:34PM -0700, Jason Gunthorpe wrote:
> > > On Thu, Jan 05, 2017 at 06:23:52PM -0500, Jerome Glisse wrote:
> > > 
> > > > > I still don't understand what you driving at - you've said in both
> > > > > cases a user VMA exists.
> > > > In the former case no, there is no VMA directly but if you want one than
> > > > a device can provide one. But such VMA is useless as CPU access is not
> > > > expected.
> > > I disagree it is useless, the VMA is going to be necessary to support
> > > upcoming things like CAPI, you need it to support O_DIRECT from the
> > > filesystem, DPDK, etc. This is why I am opposed to any model that is
> > > not VMA based for setting up RDMA - that is shorted sighted and does
> > > not seem to reflect where the industry is going.
> > > 
> > > So focus on having VMA backed by actual physical memory that covers
> > > your GPU objects and ask how do we wire up the '__user *' to the DMA
> > > API in the best way so the DMA API still has enough information to
> > > setup IOMMUs and whatnot.
> > I am talking about 2 different thing. Existing hardware and API where you
> > _do not_ have a vma and you do not need one. This is just existing stuff.
> I do not understand why you assume that existing API doesn't  need one.
> I would say that a lot of __existing__ user level API and their support in
> kernel (especially outside of graphics domain) assumes that we have vma and
> deal with __user * pointers.

Well i am thinking to GPUDirect here. Some of GPUDirect use case do not have
vma (struct vm_area_struct) associated with them they directly apply to GPU
object that aren't expose to CPU. Yes some use case have vma for share buffer.

In the open source driver it is true that we have vma most often than not.

> > Some close driver provide a functionality on top of this design. Question
> > is do we want to do the same ? If yes and you insist on having a vma we
> > could provide one but this is does not apply and is useless for where we
> > are going with new hardware.
> > 
> > With new hardware you just use malloc or mmap to allocate memory and then
> > you use it directly with the device. Device driver can migrate any part of
> > the process address space to device memory. In this scheme you have your
> > usual VMAs but there is nothing special about them.
>
> Assuming that the whole device memory is CPU accessible and it looks
> like the direction where we are going:
> - You forgot about use case when we want or need to allocate memory
> directly on device (why we need to migrate anything if not needed?).
> - We may want to use CPU to access such memory on device to avoid
> any unnecessary migration back.
> - We may have more device memory than the system one.
> E.g. if you have 12 GPUs w/64GB each it will already give us ~0.7 TB
> not mentioning NVDIMM cards which could also be used as memory
> storage for other device access.
> - We also may want/need to share GPU memory between different
> processes.

Here i am talking about platform where GPU memory is not accessible at
all by the CPU (because of PCIe restriction, think CPU atomic operation
on IO memory).

So i really distinguish between CAPI/CCIX and PCIe. Some platform will
have CAPI/CCIX other wont. HMM apply mostly to the latter. Some of HMM
functionalities are still usefull with CAPI/CCIX.

Note that HMM do support allocation on GPU first. In current design this
can happen when GPU is the first to access an unpopulated virtual address.


For platform where GPU memory is accessible plan is either something
like CDM (Coherent Device Memory) or rely on ZONE_DEVICE. So all GPU
memory have struct page and those are like ordinary pages. CDM still
wants some restrictions like avoiding CPU allocation to happen on GPU
when there is memory pressure ... For all intent and purposes this
will work transparently in respect to RDMA because we assume on those
system that the RDMA is CAPI/CCIX and that it can peer to other device.


> > Now when you try to do get_user_page() on any page that is inside the
> > device it will fails because we do not allow any device memory to be pin.
> > There is various reasons for that and they are not going away in any hw
> > in the planing (so for next few years).
> > 
> > Still we do want to support peer to peer mapping. Plan is to only do so
> > with ODP capable hardware. Still we need to solve the IOMMU issue and
> > it needs special handling inside the RDMA device. The way it works is
> > that RDMA ask for a GPU page, GPU check if it has place inside its PCI
> > bar to map this page for the device, this can fail. If it succeed then
> > you need the IOMMU to let the RDMA device access the GPU PCI bar.
> > 
> > So here we have 2 orthogonal problem. First one is how to make 2 drivers
> > talks to each other to setup mapping to allow peer to peer But I would 
> > assume  and 

Re: Enabling peer to peer device transactions for PCIe devices

2017-01-06 Thread Serguei Sagalovitch

On 2017-01-05 08:58 PM, Jerome Glisse wrote:

On Thu, Jan 05, 2017 at 05:30:34PM -0700, Jason Gunthorpe wrote:

On Thu, Jan 05, 2017 at 06:23:52PM -0500, Jerome Glisse wrote:


I still don't understand what you driving at - you've said in both
cases a user VMA exists.

In the former case no, there is no VMA directly but if you want one than
a device can provide one. But such VMA is useless as CPU access is not
expected.

I disagree it is useless, the VMA is going to be necessary to support
upcoming things like CAPI, you need it to support O_DIRECT from the
filesystem, DPDK, etc. This is why I am opposed to any model that is
not VMA based for setting up RDMA - that is shorted sighted and does
not seem to reflect where the industry is going.

So focus on having VMA backed by actual physical memory that covers
your GPU objects and ask how do we wire up the '__user *' to the DMA
API in the best way so the DMA API still has enough information to
setup IOMMUs and whatnot.

I am talking about 2 different thing. Existing hardware and API where you
_do not_ have a vma and you do not need one. This is just existing stuff.

I do not understand why you assume that existing API doesn't  need one.
I would say that a lot of __existing__ user level API and their support 
in kernel

(especially outside of graphics domain) assumes that we have vma and
deal with __user * pointers.

Some close driver provide a functionality on top of this design. Question
is do we want to do the same ? If yes and you insist on having a vma we
could provide one but this is does not apply and is useless for where we
are going with new hardware.

With new hardware you just use malloc or mmap to allocate memory and then
you use it directly with the device. Device driver can migrate any part of
the process address space to device memory. In this scheme you have your
usual VMAs but there is nothing special about them.

Assuming that the whole device memory is CPU accessible and it looks
like the direction where we are going:
- You forgot about use case when we want or need to allocate memory
directly on device (why we need to migrate anything if not needed?).
- We may want to use CPU to access such memory on device to avoid
any unnecessary migration back.
- We may have more device memory than the system one.
E.g. if you have 12 GPUs w/64GB each it will already give us ~0.7 TB
not mentioning NVDIMM cards which could also be used as memory
storage for other device access.
- We also may want/need to share GPU memory between different
processes.

Now when you try to do get_user_page() on any page that is inside the
device it will fails because we do not allow any device memory to be pin.
There is various reasons for that and they are not going away in any hw
in the planing (so for next few years).

Still we do want to support peer to peer mapping. Plan is to only do so
with ODP capable hardware. Still we need to solve the IOMMU issue and
it needs special handling inside the RDMA device. The way it works is
that RDMA ask for a GPU page, GPU check if it has place inside its PCI
bar to map this page for the device, this can fail. If it succeed then
you need the IOMMU to let the RDMA device access the GPU PCI bar.

So here we have 2 orthogonal problem. First one is how to make 2 drivers
talks to each other to setup mapping to allow peer to peer But I would assume  
and second is
about IOMMU.


I think that there is the third problem:  A lot of existing user level API
(MPI, IB Verbs, file i/o, etc.) deal with pointers to the buffers.
Potentially it would be ideally to support use cases when those buffers are
located in device memory avoiding any unnecessary migration / 
double-buffering.

Currently a lot of infrastructure in kernel assumes that this is the user
pointer and call "get_user_pages"  to get s/g.   What is your opinion
how it should be changed to deal with cases when "buffer" is in
device memory?





Re: Enabling peer to peer device transactions for PCIe devices

2017-01-06 Thread Serguei Sagalovitch

On 2017-01-05 08:58 PM, Jerome Glisse wrote:

On Thu, Jan 05, 2017 at 05:30:34PM -0700, Jason Gunthorpe wrote:

On Thu, Jan 05, 2017 at 06:23:52PM -0500, Jerome Glisse wrote:


I still don't understand what you driving at - you've said in both
cases a user VMA exists.

In the former case no, there is no VMA directly but if you want one than
a device can provide one. But such VMA is useless as CPU access is not
expected.

I disagree it is useless, the VMA is going to be necessary to support
upcoming things like CAPI, you need it to support O_DIRECT from the
filesystem, DPDK, etc. This is why I am opposed to any model that is
not VMA based for setting up RDMA - that is shorted sighted and does
not seem to reflect where the industry is going.

So focus on having VMA backed by actual physical memory that covers
your GPU objects and ask how do we wire up the '__user *' to the DMA
API in the best way so the DMA API still has enough information to
setup IOMMUs and whatnot.

I am talking about 2 different thing. Existing hardware and API where you
_do not_ have a vma and you do not need one. This is just existing stuff.

I do not understand why you assume that existing API doesn't  need one.
I would say that a lot of __existing__ user level API and their support 
in kernel

(especially outside of graphics domain) assumes that we have vma and
deal with __user * pointers.

Some close driver provide a functionality on top of this design. Question
is do we want to do the same ? If yes and you insist on having a vma we
could provide one but this is does not apply and is useless for where we
are going with new hardware.

With new hardware you just use malloc or mmap to allocate memory and then
you use it directly with the device. Device driver can migrate any part of
the process address space to device memory. In this scheme you have your
usual VMAs but there is nothing special about them.

Assuming that the whole device memory is CPU accessible and it looks
like the direction where we are going:
- You forgot about use case when we want or need to allocate memory
directly on device (why we need to migrate anything if not needed?).
- We may want to use CPU to access such memory on device to avoid
any unnecessary migration back.
- We may have more device memory than the system one.
E.g. if you have 12 GPUs w/64GB each it will already give us ~0.7 TB
not mentioning NVDIMM cards which could also be used as memory
storage for other device access.
- We also may want/need to share GPU memory between different
processes.

Now when you try to do get_user_page() on any page that is inside the
device it will fails because we do not allow any device memory to be pin.
There is various reasons for that and they are not going away in any hw
in the planing (so for next few years).

Still we do want to support peer to peer mapping. Plan is to only do so
with ODP capable hardware. Still we need to solve the IOMMU issue and
it needs special handling inside the RDMA device. The way it works is
that RDMA ask for a GPU page, GPU check if it has place inside its PCI
bar to map this page for the device, this can fail. If it succeed then
you need the IOMMU to let the RDMA device access the GPU PCI bar.

So here we have 2 orthogonal problem. First one is how to make 2 drivers
talks to each other to setup mapping to allow peer to peer But I would assume  
and second is
about IOMMU.


I think that there is the third problem:  A lot of existing user level API
(MPI, IB Verbs, file i/o, etc.) deal with pointers to the buffers.
Potentially it would be ideally to support use cases when those buffers are
located in device memory avoiding any unnecessary migration / 
double-buffering.

Currently a lot of infrastructure in kernel assumes that this is the user
pointer and call "get_user_pages"  to get s/g.   What is your opinion
how it should be changed to deal with cases when "buffer" is in
device memory?





Re: Enabling peer to peer device transactions for PCIe devices

2017-01-06 Thread Henrique Almeida
 Hello, I've been watching this thread not as a kernel developer, but
as an user interested in doing peer-to-peer access between network
card and GPU. I believe that merging raw direct access with vma
overcomplicates things for our use case. We'll have a very large
camera streaming data at high throughput (up to 100 Gbps) to the GPU,
which will operate in soft real time mode and write back the results
to a RDMA enabled network storage. The CPU will only arrange the
connection between GPU and network card. Having things like paging or
memory overcommit is possible, but they are not required and they
might consistently decrease the quality of the data acquisition.

 I see my use case something likely to exist for others and a strong
reason to split the implementation in two.


2017-01-05 16:01 GMT-03:00 Jason Gunthorpe :
> On Thu, Jan 05, 2017 at 01:39:29PM -0500, Jerome Glisse wrote:
>
>>   1) peer-to-peer because of userspace specific API like NVidia GPU
>> direct (AMD is pushing its own similar API i just can't remember
>> marketing name). This does not happen through a vma, this happens
>> through specific device driver call going through device specific
>> ioctl on both side (GPU and RDMA). So both kernel driver are aware
>> of each others.
>
> Today you can only do user-initiated RDMA operations in conjection
> with a VMA.
>
> We'd need a really big and strong reason to create an entirely new
> non-VMA based memory handle scheme for RDMA.
>
> So my inclination is to just completely push back on this idea. You
> need a VMA to do RMA.
>
> GPUs need to create VMAs for the memory they want to RDMA from, even
> if the VMA handle just causes SIGBUS for any CPU access.
>
> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Enabling peer to peer device transactions for PCIe devices

2017-01-06 Thread Henrique Almeida
 Hello, I've been watching this thread not as a kernel developer, but
as an user interested in doing peer-to-peer access between network
card and GPU. I believe that merging raw direct access with vma
overcomplicates things for our use case. We'll have a very large
camera streaming data at high throughput (up to 100 Gbps) to the GPU,
which will operate in soft real time mode and write back the results
to a RDMA enabled network storage. The CPU will only arrange the
connection between GPU and network card. Having things like paging or
memory overcommit is possible, but they are not required and they
might consistently decrease the quality of the data acquisition.

 I see my use case something likely to exist for others and a strong
reason to split the implementation in two.


2017-01-05 16:01 GMT-03:00 Jason Gunthorpe :
> On Thu, Jan 05, 2017 at 01:39:29PM -0500, Jerome Glisse wrote:
>
>>   1) peer-to-peer because of userspace specific API like NVidia GPU
>> direct (AMD is pushing its own similar API i just can't remember
>> marketing name). This does not happen through a vma, this happens
>> through specific device driver call going through device specific
>> ioctl on both side (GPU and RDMA). So both kernel driver are aware
>> of each others.
>
> Today you can only do user-initiated RDMA operations in conjection
> with a VMA.
>
> We'd need a really big and strong reason to create an entirely new
> non-VMA based memory handle scheme for RDMA.
>
> So my inclination is to just completely push back on this idea. You
> need a VMA to do RMA.
>
> GPUs need to create VMAs for the memory they want to RDMA from, even
> if the VMA handle just causes SIGBUS for any CPU access.
>
> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Enabling peer to peer device transactions for PCIe devices

2017-01-05 Thread Jerome Glisse
On Thu, Jan 05, 2017 at 05:30:34PM -0700, Jason Gunthorpe wrote:
> On Thu, Jan 05, 2017 at 06:23:52PM -0500, Jerome Glisse wrote:
> 
> > > I still don't understand what you driving at - you've said in both
> > > cases a user VMA exists.
> > 
> > In the former case no, there is no VMA directly but if you want one than
> > a device can provide one. But such VMA is useless as CPU access is not
> > expected.
> 
> I disagree it is useless, the VMA is going to be necessary to support
> upcoming things like CAPI, you need it to support O_DIRECT from the
> filesystem, DPDK, etc. This is why I am opposed to any model that is
> not VMA based for setting up RDMA - that is shorted sighted and does
> not seem to reflect where the industry is going.
> 
> So focus on having VMA backed by actual physical memory that covers
> your GPU objects and ask how do we wire up the '__user *' to the DMA
> API in the best way so the DMA API still has enough information to
> setup IOMMUs and whatnot.

I am talking about 2 different thing. Existing hardware and API where you
_do not_ have a vma and you do not need one. This is just existing stuff.
Some close driver provide a functionality on top of this design. Question
is do we want to do the same ? If yes and you insist on having a vma we
could provide one but this is does not apply and is useless for where we
are going with new hardware.

With new hardware you just use malloc or mmap to allocate memory and then
you use it directly with the device. Device driver can migrate any part of
the process address space to device memory. In this scheme you have your
usual VMAs but there is nothing special about them.

Now when you try to do get_user_page() on any page that is inside the
device it will fails because we do not allow any device memory to be pin.
There is various reasons for that and they are not going away in any hw
in the planing (so for next few years).

Still we do want to support peer to peer mapping. Plan is to only do so
with ODP capable hardware. Still we need to solve the IOMMU issue and
it needs special handling inside the RDMA device. The way it works is
that RDMA ask for a GPU page, GPU check if it has place inside its PCI
bar to map this page for the device, this can fail. If it succeed then
you need the IOMMU to let the RDMA device access the GPU PCI bar.

So here we have 2 orthogonal problem. First one is how to make 2 drivers
talks to each other to setup mapping to allow peer to peer and second is
about IOMMU.


> > What i was trying to get accross is that no matter what level you
> > consider in the end you still need something at the DMA API level.
> > And that the 2 different use case (device vma or regular vma) means
> > 2 differents API for the device driver.
> 
> I agree we need new stuff at the DMA API level, but I am opposed to
> the idea we need two API paths that the *driver* has to figure out.
> That is fundamentally not what I want as a driver developer.
> 
> Give me a common API to convert '__user *' to a scatter list and pin
> the pages. This needs to figure out your two cases. And Huge
> Pages. And ZONE_DIRECT.. (a better get_user_pages)

Pining is not gonna happen like i said it would hinder the GPU to the
point it would become useless.


> Give me an API to take the scatter list and DMA map it, handling all
> the stuff associated with peer-peer. (a better dma_map_sg)
> 
> Give me a notifier scheme to rework my scatter list when physical
> pages need to change (mmu notifiers)
> 
> Use the scatter list memory to convey needed information from the
> first step to the second.
> 
> Do not bother the driver with distinctions on what kind of memory is
> behind that VMA. Don't ask me to use get_user_pages or
> gpu_get_user_pages, do not ask me to use dma_map_sg or
> dma_map_sg_peer_direct. The Driver Doesn't Need To Know.

I understand you want it easy but there must be part that must be aware,
at very least the ODP logic. Creating a peer to peer mapping is a multi
step process and some of those step can fails. Fallback is always to
migrate back to system memory as a default path that can not fail, except
if we are out of memory.


> IMHO this is why GPU direct is not mergable - it creates a crazy
> parallel mini-mm subsystem inside RDMA and uses that to connect to a
> GPU driver, everything is expected to have parallel paths for GPU
> direct and normal MM. No good at all.

Existing hardware and new hardware works differently. I am trying to
explain the two different design needed for each one. You understandtably
dislike the existing hardware that has more stringent requirement and
can not be supported transparently and need dedicated communication with
the two driver.

New hardware that have a completely different API in userspace. We can
decide to only support the latter and forget about the former.


> > > So, how do you identify these GPU objects? How do you expect RDMA
> > > convert them to scatter lists? How will ODP work?
> > 
> > No ODP on 

Re: Enabling peer to peer device transactions for PCIe devices

2017-01-05 Thread Jerome Glisse
On Thu, Jan 05, 2017 at 05:30:34PM -0700, Jason Gunthorpe wrote:
> On Thu, Jan 05, 2017 at 06:23:52PM -0500, Jerome Glisse wrote:
> 
> > > I still don't understand what you driving at - you've said in both
> > > cases a user VMA exists.
> > 
> > In the former case no, there is no VMA directly but if you want one than
> > a device can provide one. But such VMA is useless as CPU access is not
> > expected.
> 
> I disagree it is useless, the VMA is going to be necessary to support
> upcoming things like CAPI, you need it to support O_DIRECT from the
> filesystem, DPDK, etc. This is why I am opposed to any model that is
> not VMA based for setting up RDMA - that is shorted sighted and does
> not seem to reflect where the industry is going.
> 
> So focus on having VMA backed by actual physical memory that covers
> your GPU objects and ask how do we wire up the '__user *' to the DMA
> API in the best way so the DMA API still has enough information to
> setup IOMMUs and whatnot.

I am talking about 2 different thing. Existing hardware and API where you
_do not_ have a vma and you do not need one. This is just existing stuff.
Some close driver provide a functionality on top of this design. Question
is do we want to do the same ? If yes and you insist on having a vma we
could provide one but this is does not apply and is useless for where we
are going with new hardware.

With new hardware you just use malloc or mmap to allocate memory and then
you use it directly with the device. Device driver can migrate any part of
the process address space to device memory. In this scheme you have your
usual VMAs but there is nothing special about them.

Now when you try to do get_user_page() on any page that is inside the
device it will fails because we do not allow any device memory to be pin.
There is various reasons for that and they are not going away in any hw
in the planing (so for next few years).

Still we do want to support peer to peer mapping. Plan is to only do so
with ODP capable hardware. Still we need to solve the IOMMU issue and
it needs special handling inside the RDMA device. The way it works is
that RDMA ask for a GPU page, GPU check if it has place inside its PCI
bar to map this page for the device, this can fail. If it succeed then
you need the IOMMU to let the RDMA device access the GPU PCI bar.

So here we have 2 orthogonal problem. First one is how to make 2 drivers
talks to each other to setup mapping to allow peer to peer and second is
about IOMMU.


> > What i was trying to get accross is that no matter what level you
> > consider in the end you still need something at the DMA API level.
> > And that the 2 different use case (device vma or regular vma) means
> > 2 differents API for the device driver.
> 
> I agree we need new stuff at the DMA API level, but I am opposed to
> the idea we need two API paths that the *driver* has to figure out.
> That is fundamentally not what I want as a driver developer.
> 
> Give me a common API to convert '__user *' to a scatter list and pin
> the pages. This needs to figure out your two cases. And Huge
> Pages. And ZONE_DIRECT.. (a better get_user_pages)

Pining is not gonna happen like i said it would hinder the GPU to the
point it would become useless.


> Give me an API to take the scatter list and DMA map it, handling all
> the stuff associated with peer-peer. (a better dma_map_sg)
> 
> Give me a notifier scheme to rework my scatter list when physical
> pages need to change (mmu notifiers)
> 
> Use the scatter list memory to convey needed information from the
> first step to the second.
> 
> Do not bother the driver with distinctions on what kind of memory is
> behind that VMA. Don't ask me to use get_user_pages or
> gpu_get_user_pages, do not ask me to use dma_map_sg or
> dma_map_sg_peer_direct. The Driver Doesn't Need To Know.

I understand you want it easy but there must be part that must be aware,
at very least the ODP logic. Creating a peer to peer mapping is a multi
step process and some of those step can fails. Fallback is always to
migrate back to system memory as a default path that can not fail, except
if we are out of memory.


> IMHO this is why GPU direct is not mergable - it creates a crazy
> parallel mini-mm subsystem inside RDMA and uses that to connect to a
> GPU driver, everything is expected to have parallel paths for GPU
> direct and normal MM. No good at all.

Existing hardware and new hardware works differently. I am trying to
explain the two different design needed for each one. You understandtably
dislike the existing hardware that has more stringent requirement and
can not be supported transparently and need dedicated communication with
the two driver.

New hardware that have a completely different API in userspace. We can
decide to only support the latter and forget about the former.


> > > So, how do you identify these GPU objects? How do you expect RDMA
> > > convert them to scatter lists? How will ODP work?
> > 
> > No ODP on 

Re: Enabling peer to peer device transactions for PCIe devices

2017-01-05 Thread Serguei Sagalovitch

On 2017-01-05 07:30 PM, Jason Gunthorpe wrote:

 but I am opposed to
the idea we need two API paths that the *driver* has to figure out.
That is fundamentally not what I want as a driver developer.

Give me a common API to convert '__user *' to a scatter list and pin
the pages.

Completely agreed. IMHO there is no sense to duplicate the same logic
everywhere as well as  trying to find places where it is missing.

Sincerely yours,
Serguei Sagalovitch



Re: Enabling peer to peer device transactions for PCIe devices

2017-01-05 Thread Serguei Sagalovitch

On 2017-01-05 07:30 PM, Jason Gunthorpe wrote:

 but I am opposed to
the idea we need two API paths that the *driver* has to figure out.
That is fundamentally not what I want as a driver developer.

Give me a common API to convert '__user *' to a scatter list and pin
the pages.

Completely agreed. IMHO there is no sense to duplicate the same logic
everywhere as well as  trying to find places where it is missing.

Sincerely yours,
Serguei Sagalovitch



Re: Enabling peer to peer device transactions for PCIe devices

2017-01-05 Thread Jason Gunthorpe
On Thu, Jan 05, 2017 at 06:23:52PM -0500, Jerome Glisse wrote:

> > I still don't understand what you driving at - you've said in both
> > cases a user VMA exists.
> 
> In the former case no, there is no VMA directly but if you want one than
> a device can provide one. But such VMA is useless as CPU access is not
> expected.

I disagree it is useless, the VMA is going to be necessary to support
upcoming things like CAPI, you need it to support O_DIRECT from the
filesystem, DPDK, etc. This is why I am opposed to any model that is
not VMA based for setting up RDMA - that is shorted sighted and does
not seem to reflect where the industry is going.

So focus on having VMA backed by actual physical memory that covers
your GPU objects and ask how do we wire up the '__user *' to the DMA
API in the best way so the DMA API still has enough information to
setup IOMMUs and whatnot.

> What i was trying to get accross is that no matter what level you
> consider in the end you still need something at the DMA API level.
> And that the 2 different use case (device vma or regular vma) means
> 2 differents API for the device driver.

I agree we need new stuff at the DMA API level, but I am opposed to
the idea we need two API paths that the *driver* has to figure out.
That is fundamentally not what I want as a driver developer.

Give me a common API to convert '__user *' to a scatter list and pin
the pages. This needs to figure out your two cases. And Huge
Pages. And ZONE_DIRECT.. (a better get_user_pages)

Give me an API to take the scatter list and DMA map it, handling all
the stuff associated with peer-peer. (a better dma_map_sg)

Give me a notifier scheme to rework my scatter list when physical
pages need to change (mmu notifiers)

Use the scatter list memory to convey needed information from the
first step to the second.

Do not bother the driver with distinctions on what kind of memory is
behind that VMA. Don't ask me to use get_user_pages or
gpu_get_user_pages, do not ask me to use dma_map_sg or
dma_map_sg_peer_direct. The Driver Doesn't Need To Know.

IMHO this is why GPU direct is not mergable - it creates a crazy
parallel mini-mm subsystem inside RDMA and uses that to connect to a
GPU driver, everything is expected to have parallel paths for GPU
direct and normal MM. No good at all.

> > So, how do you identify these GPU objects? How do you expect RDMA
> > convert them to scatter lists? How will ODP work?
> 
> No ODP on those. If you want vma, the GPU device driver can provide

You said you needed invalidate, that has to be done via ODP.

Jason


Re: Enabling peer to peer device transactions for PCIe devices

2017-01-05 Thread Jason Gunthorpe
On Thu, Jan 05, 2017 at 06:23:52PM -0500, Jerome Glisse wrote:

> > I still don't understand what you driving at - you've said in both
> > cases a user VMA exists.
> 
> In the former case no, there is no VMA directly but if you want one than
> a device can provide one. But such VMA is useless as CPU access is not
> expected.

I disagree it is useless, the VMA is going to be necessary to support
upcoming things like CAPI, you need it to support O_DIRECT from the
filesystem, DPDK, etc. This is why I am opposed to any model that is
not VMA based for setting up RDMA - that is shorted sighted and does
not seem to reflect where the industry is going.

So focus on having VMA backed by actual physical memory that covers
your GPU objects and ask how do we wire up the '__user *' to the DMA
API in the best way so the DMA API still has enough information to
setup IOMMUs and whatnot.

> What i was trying to get accross is that no matter what level you
> consider in the end you still need something at the DMA API level.
> And that the 2 different use case (device vma or regular vma) means
> 2 differents API for the device driver.

I agree we need new stuff at the DMA API level, but I am opposed to
the idea we need two API paths that the *driver* has to figure out.
That is fundamentally not what I want as a driver developer.

Give me a common API to convert '__user *' to a scatter list and pin
the pages. This needs to figure out your two cases. And Huge
Pages. And ZONE_DIRECT.. (a better get_user_pages)

Give me an API to take the scatter list and DMA map it, handling all
the stuff associated with peer-peer. (a better dma_map_sg)

Give me a notifier scheme to rework my scatter list when physical
pages need to change (mmu notifiers)

Use the scatter list memory to convey needed information from the
first step to the second.

Do not bother the driver with distinctions on what kind of memory is
behind that VMA. Don't ask me to use get_user_pages or
gpu_get_user_pages, do not ask me to use dma_map_sg or
dma_map_sg_peer_direct. The Driver Doesn't Need To Know.

IMHO this is why GPU direct is not mergable - it creates a crazy
parallel mini-mm subsystem inside RDMA and uses that to connect to a
GPU driver, everything is expected to have parallel paths for GPU
direct and normal MM. No good at all.

> > So, how do you identify these GPU objects? How do you expect RDMA
> > convert them to scatter lists? How will ODP work?
> 
> No ODP on those. If you want vma, the GPU device driver can provide

You said you needed invalidate, that has to be done via ODP.

Jason


Re: Enabling peer to peer device transactions for PCIe devices

2017-01-05 Thread Jerome Glisse
On Thu, Jan 05, 2017 at 03:42:15PM -0700, Jason Gunthorpe wrote:
> On Thu, Jan 05, 2017 at 03:19:36PM -0500, Jerome Glisse wrote:
> 
> > > Always having a VMA changes the discussion - the question is how to
> > > create a VMA that reprensents IO device memory, and how do DMA
> > > consumers extract the correct information from that VMA to pass to the
> > > kernel DMA API so it can setup peer-peer DMA.
> > 
> > Well my point is that it can't be. In HMM case inside a single VMA
> > you
> [..]
> 
> > In the GPUDirect case the idea is that you have a specific device vma
> > that you map for peer to peer.
> 
> [..]
> 
> I still don't understand what you driving at - you've said in both
> cases a user VMA exists.

In the former case no, there is no VMA directly but if you want one than
a device can provide one. But such VMA is useless as CPU access is not
expected.

> 
> From my perspective in RDMA, all I want is a core kernel flow to
> convert a '__user *' into a scatter list of DMA addresses, that works no
> matter what is backing that VMA, be it HMM, a 'hidden' GPU object, or
> struct page memory.
> 
> A '__user *' pointer is the only way to setup a RDMA MR, and I see no
> reason to have another API at this time.
> 
> The details of how to translate to a scatter list are a MM subject,
> and the MM folks need to get 
> 
> I just don't care if that routine works at a page level, or a whole
> VMA level, or some combination of both, that is up to the MM team to
> figure out :)

And that's what i am trying to get accross. There is 2 cases here.
What exist on today hardware. Thing like GPU direct, that works on
VMA level. Versus where some new hardware is going were want to do
thing on page level. Both require different API at different level.

What i was trying to get accross is that no matter what level you
consider in the end you still need something at the DMA API level.
And that the 2 different use case (device vma or regular vma) means
2 differents API for the device driver.

> 
> > a page level. Expectation here is that the GPU userspace expose a special
> > API to allow RDMA to directly happen on GPU object allocated through
> > GPU specific API (ie it is not regular memory and it is not accessible
> > by CPU).
> 
> So, how do you identify these GPU objects? How do you expect RDMA
> convert them to scatter lists? How will ODP work?

No ODP on those. If you want vma, the GPU device driver can provide
one. GPU object are disjoint from regular memory (coming from some
form of mmap). They are created through ioctl and in many case are
never expose to the CPU. They only exist inside the GPU driver realm.

None the less there is usecase where exchanging those object accross
computer over a network make sense. I am not an end user here :)


> > > We have MMU notifiers to handle this today in RDMA. Async RDMA MR
> > > Invalidate like you see in the above out of tree patches is totally
> > > crazy and shouldn't be in mainline. Use ODP capable RDMA hardware.
> > 
> > Well there is still a large base of hardware that do not have such
> > feature and some people would like to be able to keep using those.
> 
> Hopefully someone will figure out how to do that without the crazy
> async MR invalidation.

Personnaly i don't care too much about this old hardware and thus i am
fine without supporting them. The open source userspace is playing
catchup and doing feature for old hardware probably does not make sense.

Cheers,
Jérôme


Re: Enabling peer to peer device transactions for PCIe devices

2017-01-05 Thread Jerome Glisse
On Thu, Jan 05, 2017 at 03:42:15PM -0700, Jason Gunthorpe wrote:
> On Thu, Jan 05, 2017 at 03:19:36PM -0500, Jerome Glisse wrote:
> 
> > > Always having a VMA changes the discussion - the question is how to
> > > create a VMA that reprensents IO device memory, and how do DMA
> > > consumers extract the correct information from that VMA to pass to the
> > > kernel DMA API so it can setup peer-peer DMA.
> > 
> > Well my point is that it can't be. In HMM case inside a single VMA
> > you
> [..]
> 
> > In the GPUDirect case the idea is that you have a specific device vma
> > that you map for peer to peer.
> 
> [..]
> 
> I still don't understand what you driving at - you've said in both
> cases a user VMA exists.

In the former case no, there is no VMA directly but if you want one than
a device can provide one. But such VMA is useless as CPU access is not
expected.

> 
> From my perspective in RDMA, all I want is a core kernel flow to
> convert a '__user *' into a scatter list of DMA addresses, that works no
> matter what is backing that VMA, be it HMM, a 'hidden' GPU object, or
> struct page memory.
> 
> A '__user *' pointer is the only way to setup a RDMA MR, and I see no
> reason to have another API at this time.
> 
> The details of how to translate to a scatter list are a MM subject,
> and the MM folks need to get 
> 
> I just don't care if that routine works at a page level, or a whole
> VMA level, or some combination of both, that is up to the MM team to
> figure out :)

And that's what i am trying to get accross. There is 2 cases here.
What exist on today hardware. Thing like GPU direct, that works on
VMA level. Versus where some new hardware is going were want to do
thing on page level. Both require different API at different level.

What i was trying to get accross is that no matter what level you
consider in the end you still need something at the DMA API level.
And that the 2 different use case (device vma or regular vma) means
2 differents API for the device driver.

> 
> > a page level. Expectation here is that the GPU userspace expose a special
> > API to allow RDMA to directly happen on GPU object allocated through
> > GPU specific API (ie it is not regular memory and it is not accessible
> > by CPU).
> 
> So, how do you identify these GPU objects? How do you expect RDMA
> convert them to scatter lists? How will ODP work?

No ODP on those. If you want vma, the GPU device driver can provide
one. GPU object are disjoint from regular memory (coming from some
form of mmap). They are created through ioctl and in many case are
never expose to the CPU. They only exist inside the GPU driver realm.

None the less there is usecase where exchanging those object accross
computer over a network make sense. I am not an end user here :)


> > > We have MMU notifiers to handle this today in RDMA. Async RDMA MR
> > > Invalidate like you see in the above out of tree patches is totally
> > > crazy and shouldn't be in mainline. Use ODP capable RDMA hardware.
> > 
> > Well there is still a large base of hardware that do not have such
> > feature and some people would like to be able to keep using those.
> 
> Hopefully someone will figure out how to do that without the crazy
> async MR invalidation.

Personnaly i don't care too much about this old hardware and thus i am
fine without supporting them. The open source userspace is playing
catchup and doing feature for old hardware probably does not make sense.

Cheers,
Jérôme


Re: Enabling peer to peer device transactions for PCIe devices

2017-01-05 Thread Jason Gunthorpe
On Thu, Jan 05, 2017 at 03:19:36PM -0500, Jerome Glisse wrote:

> > Always having a VMA changes the discussion - the question is how to
> > create a VMA that reprensents IO device memory, and how do DMA
> > consumers extract the correct information from that VMA to pass to the
> > kernel DMA API so it can setup peer-peer DMA.
> 
> Well my point is that it can't be. In HMM case inside a single VMA
> you
[..]

> In the GPUDirect case the idea is that you have a specific device vma
> that you map for peer to peer.

[..]

I still don't understand what you driving at - you've said in both
cases a user VMA exists.

>From my perspective in RDMA, all I want is a core kernel flow to
convert a '__user *' into a scatter list of DMA addresses, that works no
matter what is backing that VMA, be it HMM, a 'hidden' GPU object, or
struct page memory.

A '__user *' pointer is the only way to setup a RDMA MR, and I see no
reason to have another API at this time.

The details of how to translate to a scatter list are a MM subject,
and the MM folks need to get 

I just don't care if that routine works at a page level, or a whole
VMA level, or some combination of both, that is up to the MM team to
figure out :)

> a page level. Expectation here is that the GPU userspace expose a special
> API to allow RDMA to directly happen on GPU object allocated through
> GPU specific API (ie it is not regular memory and it is not accessible
> by CPU).

So, how do you identify these GPU objects? How do you expect RDMA
convert them to scatter lists? How will ODP work?

> > We have MMU notifiers to handle this today in RDMA. Async RDMA MR
> > Invalidate like you see in the above out of tree patches is totally
> > crazy and shouldn't be in mainline. Use ODP capable RDMA hardware.
> 
> Well there is still a large base of hardware that do not have such
> feature and some people would like to be able to keep using those.

Hopefully someone will figure out how to do that without the crazy
async MR invalidation.

Jason


Re: Enabling peer to peer device transactions for PCIe devices

2017-01-05 Thread Jason Gunthorpe
On Thu, Jan 05, 2017 at 03:19:36PM -0500, Jerome Glisse wrote:

> > Always having a VMA changes the discussion - the question is how to
> > create a VMA that reprensents IO device memory, and how do DMA
> > consumers extract the correct information from that VMA to pass to the
> > kernel DMA API so it can setup peer-peer DMA.
> 
> Well my point is that it can't be. In HMM case inside a single VMA
> you
[..]

> In the GPUDirect case the idea is that you have a specific device vma
> that you map for peer to peer.

[..]

I still don't understand what you driving at - you've said in both
cases a user VMA exists.

>From my perspective in RDMA, all I want is a core kernel flow to
convert a '__user *' into a scatter list of DMA addresses, that works no
matter what is backing that VMA, be it HMM, a 'hidden' GPU object, or
struct page memory.

A '__user *' pointer is the only way to setup a RDMA MR, and I see no
reason to have another API at this time.

The details of how to translate to a scatter list are a MM subject,
and the MM folks need to get 

I just don't care if that routine works at a page level, or a whole
VMA level, or some combination of both, that is up to the MM team to
figure out :)

> a page level. Expectation here is that the GPU userspace expose a special
> API to allow RDMA to directly happen on GPU object allocated through
> GPU specific API (ie it is not regular memory and it is not accessible
> by CPU).

So, how do you identify these GPU objects? How do you expect RDMA
convert them to scatter lists? How will ODP work?

> > We have MMU notifiers to handle this today in RDMA. Async RDMA MR
> > Invalidate like you see in the above out of tree patches is totally
> > crazy and shouldn't be in mainline. Use ODP capable RDMA hardware.
> 
> Well there is still a large base of hardware that do not have such
> feature and some people would like to be able to keep using those.

Hopefully someone will figure out how to do that without the crazy
async MR invalidation.

Jason


Re: Enabling peer to peer device transactions for PCIe devices

2017-01-05 Thread Jerome Glisse
On Thu, Jan 05, 2017 at 01:07:19PM -0700, Jason Gunthorpe wrote:
> On Thu, Jan 05, 2017 at 02:54:24PM -0500, Jerome Glisse wrote:
> 
> > Mellanox and NVidia support peer to peer with what they market a
> > GPUDirect. It only works without IOMMU. It is probably not upstream :
> > 
> > https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg21402.html
> > 
> > I thought it was but it seems it require an out of tree driver to work.
> 
> Right, it is out of tree and not under consideration for mainline.
> 
> > Wether there is a vma or not isn't important to the issue anyway. If
> > you want to enforce VMA rule for RDMA it is an RDMA specific discussion
> > in which i don't want to be involve, it is not my turf :)
> 
> Always having a VMA changes the discussion - the question is how to
> create a VMA that reprensents IO device memory, and how do DMA
> consumers extract the correct information from that VMA to pass to the
> kernel DMA API so it can setup peer-peer DMA.

Well my point is that it can't be. In HMM case inside a single VMA you
can have one page inside GPU memory at address A but next page inside
regular memory at A+4k. So handling this at the VMA level does not make
sense. So in this case you would get the device from the struct page
and you would query through common API to determine if you can do peer
to peer. If not it would trigger migration back to regular memory.
If yes then you still have to solve the IOMMU issue and hence the DMA
API changes that were propose.

In the GPUDirect case the idea is that you have a specific device vma
that you map for peer to peer. Here thing can be at vma level and not at
a page level. Expectation here is that the GPU userspace expose a special
API to allow RDMA to directly happen on GPU object allocated through
GPU specific API (ie it is not regular memory and it is not accessible
by CPU).


Both case are disjoint. Both case need to solve the IOMMU issue which
seems to be best solve at the DMA API level.


> > What matter is the back channel API between peer-to-peer device. Like
> > the above patchset points out for GPU we need to be able to invalidate
> > a mapping at any point in time. Pining is not something we want to
> > live with.
> 
> We have MMU notifiers to handle this today in RDMA. Async RDMA MR
> Invalidate like you see in the above out of tree patches is totally
> crazy and shouldn't be in mainline. Use ODP capable RDMA hardware.

Well there is still a large base of hardware that do not have such
feature and some people would like to be able to keep using those.
I believe allowing direct access to GPU object that are otherwise
hidden from regular kernel memory management is still meaningfull.

Cheers,
Jérôme



Re: Enabling peer to peer device transactions for PCIe devices

2017-01-05 Thread Jerome Glisse
On Thu, Jan 05, 2017 at 01:07:19PM -0700, Jason Gunthorpe wrote:
> On Thu, Jan 05, 2017 at 02:54:24PM -0500, Jerome Glisse wrote:
> 
> > Mellanox and NVidia support peer to peer with what they market a
> > GPUDirect. It only works without IOMMU. It is probably not upstream :
> > 
> > https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg21402.html
> > 
> > I thought it was but it seems it require an out of tree driver to work.
> 
> Right, it is out of tree and not under consideration for mainline.
> 
> > Wether there is a vma or not isn't important to the issue anyway. If
> > you want to enforce VMA rule for RDMA it is an RDMA specific discussion
> > in which i don't want to be involve, it is not my turf :)
> 
> Always having a VMA changes the discussion - the question is how to
> create a VMA that reprensents IO device memory, and how do DMA
> consumers extract the correct information from that VMA to pass to the
> kernel DMA API so it can setup peer-peer DMA.

Well my point is that it can't be. In HMM case inside a single VMA you
can have one page inside GPU memory at address A but next page inside
regular memory at A+4k. So handling this at the VMA level does not make
sense. So in this case you would get the device from the struct page
and you would query through common API to determine if you can do peer
to peer. If not it would trigger migration back to regular memory.
If yes then you still have to solve the IOMMU issue and hence the DMA
API changes that were propose.

In the GPUDirect case the idea is that you have a specific device vma
that you map for peer to peer. Here thing can be at vma level and not at
a page level. Expectation here is that the GPU userspace expose a special
API to allow RDMA to directly happen on GPU object allocated through
GPU specific API (ie it is not regular memory and it is not accessible
by CPU).


Both case are disjoint. Both case need to solve the IOMMU issue which
seems to be best solve at the DMA API level.


> > What matter is the back channel API between peer-to-peer device. Like
> > the above patchset points out for GPU we need to be able to invalidate
> > a mapping at any point in time. Pining is not something we want to
> > live with.
> 
> We have MMU notifiers to handle this today in RDMA. Async RDMA MR
> Invalidate like you see in the above out of tree patches is totally
> crazy and shouldn't be in mainline. Use ODP capable RDMA hardware.

Well there is still a large base of hardware that do not have such
feature and some people would like to be able to keep using those.
I believe allowing direct access to GPU object that are otherwise
hidden from regular kernel memory management is still meaningfull.

Cheers,
Jérôme



Re: Enabling peer to peer device transactions for PCIe devices

2017-01-05 Thread Jason Gunthorpe
On Thu, Jan 05, 2017 at 02:54:24PM -0500, Jerome Glisse wrote:

> Mellanox and NVidia support peer to peer with what they market a
> GPUDirect. It only works without IOMMU. It is probably not upstream :
> 
> https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg21402.html
> 
> I thought it was but it seems it require an out of tree driver to work.

Right, it is out of tree and not under consideration for mainline.

> Wether there is a vma or not isn't important to the issue anyway. If
> you want to enforce VMA rule for RDMA it is an RDMA specific discussion
> in which i don't want to be involve, it is not my turf :)

Always having a VMA changes the discussion - the question is how to
create a VMA that reprensents IO device memory, and how do DMA
consumers extract the correct information from that VMA to pass to the
kernel DMA API so it can setup peer-peer DMA.

> What matter is the back channel API between peer-to-peer device. Like
> the above patchset points out for GPU we need to be able to invalidate
> a mapping at any point in time. Pining is not something we want to
> live with.

We have MMU notifiers to handle this today in RDMA. Async RDMA MR
Invalidate like you see in the above out of tree patches is totally
crazy and shouldn't be in mainline. Use ODP capable RDMA hardware.

Jason


Re: Enabling peer to peer device transactions for PCIe devices

2017-01-05 Thread Jason Gunthorpe
On Thu, Jan 05, 2017 at 02:54:24PM -0500, Jerome Glisse wrote:

> Mellanox and NVidia support peer to peer with what they market a
> GPUDirect. It only works without IOMMU. It is probably not upstream :
> 
> https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg21402.html
> 
> I thought it was but it seems it require an out of tree driver to work.

Right, it is out of tree and not under consideration for mainline.

> Wether there is a vma or not isn't important to the issue anyway. If
> you want to enforce VMA rule for RDMA it is an RDMA specific discussion
> in which i don't want to be involve, it is not my turf :)

Always having a VMA changes the discussion - the question is how to
create a VMA that reprensents IO device memory, and how do DMA
consumers extract the correct information from that VMA to pass to the
kernel DMA API so it can setup peer-peer DMA.

> What matter is the back channel API between peer-to-peer device. Like
> the above patchset points out for GPU we need to be able to invalidate
> a mapping at any point in time. Pining is not something we want to
> live with.

We have MMU notifiers to handle this today in RDMA. Async RDMA MR
Invalidate like you see in the above out of tree patches is totally
crazy and shouldn't be in mainline. Use ODP capable RDMA hardware.

Jason


Re: Enabling peer to peer device transactions for PCIe devices

2017-01-05 Thread Jerome Glisse
On Thu, Jan 05, 2017 at 12:01:13PM -0700, Jason Gunthorpe wrote:
> On Thu, Jan 05, 2017 at 01:39:29PM -0500, Jerome Glisse wrote:
> 
> >   1) peer-to-peer because of userspace specific API like NVidia GPU
> > direct (AMD is pushing its own similar API i just can't remember
> > marketing name). This does not happen through a vma, this happens
> > through specific device driver call going through device specific
> > ioctl on both side (GPU and RDMA). So both kernel driver are aware
> > of each others.
> 
> Today you can only do user-initiated RDMA operations in conjection
> with a VMA.
> 
> We'd need a really big and strong reason to create an entirely new
> non-VMA based memory handle scheme for RDMA.
> 
> So my inclination is to just completely push back on this idea. You
> need a VMA to do RMA.
> 
> GPUs need to create VMAs for the memory they want to RDMA from, even
> if the VMA handle just causes SIGBUS for any CPU access.

Mellanox and NVidia support peer to peer with what they market a
GPUDirect. It only works without IOMMU. It is probably not upstream :

https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg21402.html

I thought it was but it seems it require an out of tree driver to work.

Wether there is a vma or not isn't important to the issue anyway. If
you want to enforce VMA rule for RDMA it is an RDMA specific discussion
in which i don't want to be involve, it is not my turf :)

What matter is the back channel API between peer-to-peer device. Like
the above patchset points out for GPU we need to be able to invalidate
a mapping at any point in time. Pining is not something we want to
live with.

So the VMA consideration does not change what i was saying there is
2 cases:
  1) device vma (might be restricted to specific userspace API)
  2) regular vma (!VM_MIXED and no special pte entry)

For 1) you need back channel it can be per device driver or we can agree
to some common API that can add to vm_operations_struct.

For 2) expectation is that you will have valid struct page but you still
need special handling at the dma API level.

In 1) the peer-to-peer mapping is track at vma level and mediated there.
For 2) it is per page and it is mediated at that level.

In both case on you have setup mapping you need to handle the IOMMU and
the PCI bridge restriction that might apply and i believe that the DMA
API is the place where we want to solve that second side of the problem.

Cheers,
Jérôme


Re: Enabling peer to peer device transactions for PCIe devices

2017-01-05 Thread Jerome Glisse
On Thu, Jan 05, 2017 at 12:01:13PM -0700, Jason Gunthorpe wrote:
> On Thu, Jan 05, 2017 at 01:39:29PM -0500, Jerome Glisse wrote:
> 
> >   1) peer-to-peer because of userspace specific API like NVidia GPU
> > direct (AMD is pushing its own similar API i just can't remember
> > marketing name). This does not happen through a vma, this happens
> > through specific device driver call going through device specific
> > ioctl on both side (GPU and RDMA). So both kernel driver are aware
> > of each others.
> 
> Today you can only do user-initiated RDMA operations in conjection
> with a VMA.
> 
> We'd need a really big and strong reason to create an entirely new
> non-VMA based memory handle scheme for RDMA.
> 
> So my inclination is to just completely push back on this idea. You
> need a VMA to do RMA.
> 
> GPUs need to create VMAs for the memory they want to RDMA from, even
> if the VMA handle just causes SIGBUS for any CPU access.

Mellanox and NVidia support peer to peer with what they market a
GPUDirect. It only works without IOMMU. It is probably not upstream :

https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg21402.html

I thought it was but it seems it require an out of tree driver to work.

Wether there is a vma or not isn't important to the issue anyway. If
you want to enforce VMA rule for RDMA it is an RDMA specific discussion
in which i don't want to be involve, it is not my turf :)

What matter is the back channel API between peer-to-peer device. Like
the above patchset points out for GPU we need to be able to invalidate
a mapping at any point in time. Pining is not something we want to
live with.

So the VMA consideration does not change what i was saying there is
2 cases:
  1) device vma (might be restricted to specific userspace API)
  2) regular vma (!VM_MIXED and no special pte entry)

For 1) you need back channel it can be per device driver or we can agree
to some common API that can add to vm_operations_struct.

For 2) expectation is that you will have valid struct page but you still
need special handling at the dma API level.

In 1) the peer-to-peer mapping is track at vma level and mediated there.
For 2) it is per page and it is mediated at that level.

In both case on you have setup mapping you need to handle the IOMMU and
the PCI bridge restriction that might apply and i believe that the DMA
API is the place where we want to solve that second side of the problem.

Cheers,
Jérôme


Re: Enabling peer to peer device transactions for PCIe devices

2017-01-05 Thread Jason Gunthorpe
On Thu, Jan 05, 2017 at 01:39:29PM -0500, Jerome Glisse wrote:

>   1) peer-to-peer because of userspace specific API like NVidia GPU
> direct (AMD is pushing its own similar API i just can't remember
> marketing name). This does not happen through a vma, this happens
> through specific device driver call going through device specific
> ioctl on both side (GPU and RDMA). So both kernel driver are aware
> of each others.

Today you can only do user-initiated RDMA operations in conjection
with a VMA.

We'd need a really big and strong reason to create an entirely new
non-VMA based memory handle scheme for RDMA.

So my inclination is to just completely push back on this idea. You
need a VMA to do RMA.

GPUs need to create VMAs for the memory they want to RDMA from, even
if the VMA handle just causes SIGBUS for any CPU access.

Jason


Re: Enabling peer to peer device transactions for PCIe devices

2017-01-05 Thread Jason Gunthorpe
On Thu, Jan 05, 2017 at 01:39:29PM -0500, Jerome Glisse wrote:

>   1) peer-to-peer because of userspace specific API like NVidia GPU
> direct (AMD is pushing its own similar API i just can't remember
> marketing name). This does not happen through a vma, this happens
> through specific device driver call going through device specific
> ioctl on both side (GPU and RDMA). So both kernel driver are aware
> of each others.

Today you can only do user-initiated RDMA operations in conjection
with a VMA.

We'd need a really big and strong reason to create an entirely new
non-VMA based memory handle scheme for RDMA.

So my inclination is to just completely push back on this idea. You
need a VMA to do RMA.

GPUs need to create VMAs for the memory they want to RDMA from, even
if the VMA handle just causes SIGBUS for any CPU access.

Jason


Re: Enabling peer to peer device transactions for PCIe devices

2017-01-05 Thread Jerome Glisse
Sorry to revive this thread but it fells through my filters and i
miss it. I have been going through it and i think the discussion
has been hinder by the fact that distinct problems were merge while
they should be address separately.

First for peer-to-peer we need to be clear on how this happens. Two
cases here :
  1) peer-to-peer because of userspace specific API like NVidia GPU
direct (AMD is pushing its own similar API i just can't remember
marketing name). This does not happen through a vma, this happens
through specific device driver call going through device specific
ioctl on both side (GPU and RDMA). So both kernel driver are aware
of each others.
  2) peer-to-peer because RDMA/device is trying to access a regular
vma (ie non special either private anonymous or share memory or
mmap of a regular file not a device file).

For 1) there is no need to over complicate thing. Device driver must
have a back-channel between them and must be able to invalidate their
respective mapping (ie GPU must be able to ask RDMA device to kill/
stop its MR).

So remaining issue for 1) is how to enable effective peer-to-peer
mapping given that it might not work reliably on all platform. Here
Alex was listing existing proposal:
  A P2P DMA DMA-API/PCI map_peer_resource support for peer-to-peer
http://www.spinics.net/lists/linux-pci/msg44560.html
  B ZONE_DEVICE IO irect I/O and DMA for persistent memory
https://lwn.net/Articles/672457/
  C DMA-BUF RDMA subsystem DMA-BUF support
http://www.spinics.net/lists/linux-rdma/msg38748.html
  D iopmem iopmem : A block device for PCIe memory
https://lwn.net/Articles/703895/
  E HMM (not interesting for case 1)
  F Something new

Of the above D is ill suited for for GPU as we do not want to pin
GPU memory and D is design with long live object that do not move.
Also i do not think that exposing device PCIe bar through a new
/dev/somefilename is a good idea for GPU. So i think this should
be discarded.

HMM should be discard in respect of case 1 too. It is useful for
case 2. I don't think dma-buf is the right path either.

So we i think there is only A and B that make sense. Now for use
case 1 i think A is the best solution. No need to have struct page
and it require explicit knowlegde for device driver that it is
mapping another device memory which is a given in usecase 1.


If we look at case 2 the situation is bit more complex. Here RDMA
is just trying to access a regular VMA but it might happens that
some memory inside that VMA reside inside a device memory. When
that happens we would like to avoid to move that memory back to
system memory assuming that a peer mapping is doable.

Usecase 2 assume that the GPU is either on platform with CAPI or
CCTX (or something similar) in which case it is easy as device
memory will have struct page and is always accessible by CPU and
transparent from device to device access (AFAICT).

So we left with platform that do not have proper support for
device memory (ie CPU can not access it the same as DDR or as
limited access). Which apply to x86 for the foreseeable future.

This is the problem HMM address, allowing to transparently use
device memory inside a process even if direct CPU access are not
permited. I have plan to support peer-to-peer with HMM because
it is an important usecase. The idea is to have the device driver
fault against ZONE_DEVICE page and communicate through common API
to establish mapping. HMM will only handle keeping track of device
to device mapping and allowing to invalidate such mapping at any
time to allow memory to be migrated.

I do not intend to solve the IOMMU side of the problem or even
the PCI hierarchy issue where you can't peer-to-peer between device
accross some PCI bridge. I believe this is an orthogonal problem
and that it is best solve inside the DMA API ie with solution A.


I do not think we should try to solve all the problems with a
common solutions. They are too disparate from capabilities (what
the hardware can and can't do).

>From my point of view there is few take aways:
  - device should only access regular vma
  - device should never try to access vma that point to another
device (mmap of any file in /dev)
  - peer to peer access through dedicated userspace API must
involve dedicated API between kernel driver taking part into
the peer to peer access
  - peer to peer of regular vma must involve a common API for
drivers to interact so no driver can block the other


So i think the DMA-API proposal is the one to pursue and others
problem relating to handling GPU memory and how to use it is a
different kind of problem. One with either an hardware solution
(CAPI, CCTX, ...) or a software solution (HMM so far).

I don't think we should conflict the 2 problems into one. Anyway
i think this should be something worth discussing face to face
with interested party to flesh out a solution (can be at LSF/MM
or in another forum).

Cheers,
Jérôme


Re: Enabling peer to peer device transactions for PCIe devices

2017-01-05 Thread Jerome Glisse
Sorry to revive this thread but it fells through my filters and i
miss it. I have been going through it and i think the discussion
has been hinder by the fact that distinct problems were merge while
they should be address separately.

First for peer-to-peer we need to be clear on how this happens. Two
cases here :
  1) peer-to-peer because of userspace specific API like NVidia GPU
direct (AMD is pushing its own similar API i just can't remember
marketing name). This does not happen through a vma, this happens
through specific device driver call going through device specific
ioctl on both side (GPU and RDMA). So both kernel driver are aware
of each others.
  2) peer-to-peer because RDMA/device is trying to access a regular
vma (ie non special either private anonymous or share memory or
mmap of a regular file not a device file).

For 1) there is no need to over complicate thing. Device driver must
have a back-channel between them and must be able to invalidate their
respective mapping (ie GPU must be able to ask RDMA device to kill/
stop its MR).

So remaining issue for 1) is how to enable effective peer-to-peer
mapping given that it might not work reliably on all platform. Here
Alex was listing existing proposal:
  A P2P DMA DMA-API/PCI map_peer_resource support for peer-to-peer
http://www.spinics.net/lists/linux-pci/msg44560.html
  B ZONE_DEVICE IO irect I/O and DMA for persistent memory
https://lwn.net/Articles/672457/
  C DMA-BUF RDMA subsystem DMA-BUF support
http://www.spinics.net/lists/linux-rdma/msg38748.html
  D iopmem iopmem : A block device for PCIe memory
https://lwn.net/Articles/703895/
  E HMM (not interesting for case 1)
  F Something new

Of the above D is ill suited for for GPU as we do not want to pin
GPU memory and D is design with long live object that do not move.
Also i do not think that exposing device PCIe bar through a new
/dev/somefilename is a good idea for GPU. So i think this should
be discarded.

HMM should be discard in respect of case 1 too. It is useful for
case 2. I don't think dma-buf is the right path either.

So we i think there is only A and B that make sense. Now for use
case 1 i think A is the best solution. No need to have struct page
and it require explicit knowlegde for device driver that it is
mapping another device memory which is a given in usecase 1.


If we look at case 2 the situation is bit more complex. Here RDMA
is just trying to access a regular VMA but it might happens that
some memory inside that VMA reside inside a device memory. When
that happens we would like to avoid to move that memory back to
system memory assuming that a peer mapping is doable.

Usecase 2 assume that the GPU is either on platform with CAPI or
CCTX (or something similar) in which case it is easy as device
memory will have struct page and is always accessible by CPU and
transparent from device to device access (AFAICT).

So we left with platform that do not have proper support for
device memory (ie CPU can not access it the same as DDR or as
limited access). Which apply to x86 for the foreseeable future.

This is the problem HMM address, allowing to transparently use
device memory inside a process even if direct CPU access are not
permited. I have plan to support peer-to-peer with HMM because
it is an important usecase. The idea is to have the device driver
fault against ZONE_DEVICE page and communicate through common API
to establish mapping. HMM will only handle keeping track of device
to device mapping and allowing to invalidate such mapping at any
time to allow memory to be migrated.

I do not intend to solve the IOMMU side of the problem or even
the PCI hierarchy issue where you can't peer-to-peer between device
accross some PCI bridge. I believe this is an orthogonal problem
and that it is best solve inside the DMA API ie with solution A.


I do not think we should try to solve all the problems with a
common solutions. They are too disparate from capabilities (what
the hardware can and can't do).

>From my point of view there is few take aways:
  - device should only access regular vma
  - device should never try to access vma that point to another
device (mmap of any file in /dev)
  - peer to peer access through dedicated userspace API must
involve dedicated API between kernel driver taking part into
the peer to peer access
  - peer to peer of regular vma must involve a common API for
drivers to interact so no driver can block the other


So i think the DMA-API proposal is the one to pursue and others
problem relating to handling GPU memory and how to use it is a
different kind of problem. One with either an hardware solution
(CAPI, CCTX, ...) or a software solution (HMM so far).

I don't think we should conflict the 2 problems into one. Anyway
i think this should be something worth discussing face to face
with interested party to flesh out a solution (can be at LSF/MM
or in another forum).

Cheers,
Jérôme


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-06 Thread Dan Williams
On Tue, Dec 6, 2016 at 1:47 PM, Logan Gunthorpe  wrote:
> Hey,
>
>> Okay, so clearly this needs a kernel side NVMe specific allocator
>> and locking so users don't step on each other..
>
> Yup, ideally. That's why device dax isn't ideal for this application: it
> doesn't provide any way to prevent users from stepping on each other.

On this particular point I'm in the process of posting patches that
allow device-dax sub-division, so you could carve up a bar into
multiple devices of various sizes.


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-06 Thread Dan Williams
On Tue, Dec 6, 2016 at 1:47 PM, Logan Gunthorpe  wrote:
> Hey,
>
>> Okay, so clearly this needs a kernel side NVMe specific allocator
>> and locking so users don't step on each other..
>
> Yup, ideally. That's why device dax isn't ideal for this application: it
> doesn't provide any way to prevent users from stepping on each other.

On this particular point I'm in the process of posting patches that
allow device-dax sub-division, so you could carve up a bar into
multiple devices of various sizes.


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-06 Thread Logan Gunthorpe
Hey,

> Okay, so clearly this needs a kernel side NVMe specific allocator
> and locking so users don't step on each other..

Yup, ideally. That's why device dax isn't ideal for this application: it
doesn't provide any way to prevent users from stepping on each other.

> Or as Christoph says some kind of general mechanism to get these
> bounce buffers..

Yeah, I imagine a general allocate from BAR/region system would be very
useful.

> Ah, I see.
> 
> As a first draft I'd stick with some kind of API built into the
> /dev/nvmeX that backs the filesystem. The user app would fstat the
> target file, open /dev/block/MAJOR(st_dev):MINOR(st_dev), do some
> ioctl to get a CMB mmap, and then proceed from there..
> 
> When that is all working kernel-side, it would make sense to look at a
> more general mechanism that could be used unprivileged??

That makes a lot of sense to me. I suggested mmapping the char device
because it's really easy, but I can see that an ioctl on the block
device does seem more general and device agnostic.

> This is similar to the GPU issues too.. On NVMe you don't need to pin
> the pages, you just need to lock that VMA so it doesn't get freed from
> the NVMe CMB allocator while the IO is running...
> Probably in the long run the get_user_pages is going to have to be
> pushed down into drivers.. Future MMU coherent IO hardware also does
> not need the pinning or other overheads.

Yup. Yup.

Logan


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-06 Thread Logan Gunthorpe
Hey,

> Okay, so clearly this needs a kernel side NVMe specific allocator
> and locking so users don't step on each other..

Yup, ideally. That's why device dax isn't ideal for this application: it
doesn't provide any way to prevent users from stepping on each other.

> Or as Christoph says some kind of general mechanism to get these
> bounce buffers..

Yeah, I imagine a general allocate from BAR/region system would be very
useful.

> Ah, I see.
> 
> As a first draft I'd stick with some kind of API built into the
> /dev/nvmeX that backs the filesystem. The user app would fstat the
> target file, open /dev/block/MAJOR(st_dev):MINOR(st_dev), do some
> ioctl to get a CMB mmap, and then proceed from there..
> 
> When that is all working kernel-side, it would make sense to look at a
> more general mechanism that could be used unprivileged??

That makes a lot of sense to me. I suggested mmapping the char device
because it's really easy, but I can see that an ioctl on the block
device does seem more general and device agnostic.

> This is similar to the GPU issues too.. On NVMe you don't need to pin
> the pages, you just need to lock that VMA so it doesn't get freed from
> the NVMe CMB allocator while the IO is running...
> Probably in the long run the get_user_pages is going to have to be
> pushed down into drivers.. Future MMU coherent IO hardware also does
> not need the pinning or other overheads.

Yup. Yup.

Logan


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-06 Thread Jason Gunthorpe
On Tue, Dec 06, 2016 at 09:51:15AM -0700, Logan Gunthorpe wrote:
> Hey,
> 
> On 06/12/16 09:38 AM, Jason Gunthorpe wrote:
> >>> I'm not opposed to mapping /dev/nvmeX.  However, the lookup is trivial
> >>> to accomplish in sysfs through /sys/dev/char to find the sysfs path of the
> >>> device-dax instance under the nvme device, or if you already have the nvme
> >>> sysfs path the dax instance(s) will appear under the "dax" sub-directory.
> >>
> >> Personally I think mapping the dax resource in the sysfs tree is a nice
> >> way to do this and a bit more intuitive than mapping a /dev/nvmeX.
> > 
> > It is still not at all clear to me what userpsace is supposed to do
> > with this on nvme.. How is the CMB usable from userspace?
> 
> The flow is pretty simple. For example to write to NVMe from an RDMA device:
> 
> 1) Obtain a chunk of the CMB to use as a buffer(either by mmaping
> /dev/nvmx, the device dax char device or through a block layer interface
> (which sounds like a good suggestion from Christoph, but I'm not really
> sure how it would look).

Okay, so clearly this needs a kernel side NVMe specific allocator
and locking so users don't step on each other..

Or as Christoph says some kind of general mechanism to get these
bounce buffers..

> 2) Create an MR with the buffer and use an RDMA function to fill it with
> data from a remote host. This will cause the RDMA hardware to write
> directly to the memory in the NVMe card.
> 
> 3) Using O_DIRECT, write the buffer to a file on the NVMe filesystem.
> When the address reaches hardware the NVMe will recognize it as local
> memory and copy it directly there.

Ah, I see.

As a first draft I'd stick with some kind of API built into the
/dev/nvmeX that backs the filesystem. The user app would fstat the
target file, open /dev/block/MAJOR(st_dev):MINOR(st_dev), do some
ioctl to get a CMB mmap, and then proceed from there..

When that is all working kernel-side, it would make sense to look at a
more general mechanism that could be used unprivileged??

> Thus we are able to transfer data to any file on an NVMe device without
> going through system memory. This has benefits on systems with lots of
> activity in system memory but step 3 is likely to be slowish due to the
> need to pin/unpin the memory for every transaction.

This is similar to the GPU issues too.. On NVMe you don't need to pin
the pages, you just need to lock that VMA so it doesn't get freed from
the NVMe CMB allocator while the IO is running...

Probably in the long run the get_user_pages is going to have to be
pushed down into drivers.. Future MMU coherent IO hardware also does
not need the pinning or other overheads.

Jason


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-06 Thread Jason Gunthorpe
On Tue, Dec 06, 2016 at 09:51:15AM -0700, Logan Gunthorpe wrote:
> Hey,
> 
> On 06/12/16 09:38 AM, Jason Gunthorpe wrote:
> >>> I'm not opposed to mapping /dev/nvmeX.  However, the lookup is trivial
> >>> to accomplish in sysfs through /sys/dev/char to find the sysfs path of the
> >>> device-dax instance under the nvme device, or if you already have the nvme
> >>> sysfs path the dax instance(s) will appear under the "dax" sub-directory.
> >>
> >> Personally I think mapping the dax resource in the sysfs tree is a nice
> >> way to do this and a bit more intuitive than mapping a /dev/nvmeX.
> > 
> > It is still not at all clear to me what userpsace is supposed to do
> > with this on nvme.. How is the CMB usable from userspace?
> 
> The flow is pretty simple. For example to write to NVMe from an RDMA device:
> 
> 1) Obtain a chunk of the CMB to use as a buffer(either by mmaping
> /dev/nvmx, the device dax char device or through a block layer interface
> (which sounds like a good suggestion from Christoph, but I'm not really
> sure how it would look).

Okay, so clearly this needs a kernel side NVMe specific allocator
and locking so users don't step on each other..

Or as Christoph says some kind of general mechanism to get these
bounce buffers..

> 2) Create an MR with the buffer and use an RDMA function to fill it with
> data from a remote host. This will cause the RDMA hardware to write
> directly to the memory in the NVMe card.
> 
> 3) Using O_DIRECT, write the buffer to a file on the NVMe filesystem.
> When the address reaches hardware the NVMe will recognize it as local
> memory and copy it directly there.

Ah, I see.

As a first draft I'd stick with some kind of API built into the
/dev/nvmeX that backs the filesystem. The user app would fstat the
target file, open /dev/block/MAJOR(st_dev):MINOR(st_dev), do some
ioctl to get a CMB mmap, and then proceed from there..

When that is all working kernel-side, it would make sense to look at a
more general mechanism that could be used unprivileged??

> Thus we are able to transfer data to any file on an NVMe device without
> going through system memory. This has benefits on systems with lots of
> activity in system memory but step 3 is likely to be slowish due to the
> need to pin/unpin the memory for every transaction.

This is similar to the GPU issues too.. On NVMe you don't need to pin
the pages, you just need to lock that VMA so it doesn't get freed from
the NVMe CMB allocator while the IO is running...

Probably in the long run the get_user_pages is going to have to be
pushed down into drivers.. Future MMU coherent IO hardware also does
not need the pinning or other overheads.

Jason


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-06 Thread Christoph Hellwig
On Tue, Dec 06, 2016 at 09:38:50AM -0700, Jason Gunthorpe wrote:
> > > I'm not opposed to mapping /dev/nvmeX.  However, the lookup is trivial
> > > to accomplish in sysfs through /sys/dev/char to find the sysfs path of the
> > > device-dax instance under the nvme device, or if you already have the nvme
> > > sysfs path the dax instance(s) will appear under the "dax" sub-directory.
> > 
> > Personally I think mapping the dax resource in the sysfs tree is a nice
> > way to do this and a bit more intuitive than mapping a /dev/nvmeX.
> 
> It is still not at all clear to me what userpsace is supposed to do
> with this on nvme.. How is the CMB usable from userspace?

I don't think trying to expose it to userspace makes any sense.
Exposing it to in-kernel storage targets on the other hand makes a lot
of sense.


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-06 Thread Christoph Hellwig
On Tue, Dec 06, 2016 at 09:38:50AM -0700, Jason Gunthorpe wrote:
> > > I'm not opposed to mapping /dev/nvmeX.  However, the lookup is trivial
> > > to accomplish in sysfs through /sys/dev/char to find the sysfs path of the
> > > device-dax instance under the nvme device, or if you already have the nvme
> > > sysfs path the dax instance(s) will appear under the "dax" sub-directory.
> > 
> > Personally I think mapping the dax resource in the sysfs tree is a nice
> > way to do this and a bit more intuitive than mapping a /dev/nvmeX.
> 
> It is still not at all clear to me what userpsace is supposed to do
> with this on nvme.. How is the CMB usable from userspace?

I don't think trying to expose it to userspace makes any sense.
Exposing it to in-kernel storage targets on the other hand makes a lot
of sense.


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-06 Thread Logan Gunthorpe
Hey,

On 06/12/16 09:38 AM, Jason Gunthorpe wrote:
>>> I'm not opposed to mapping /dev/nvmeX.  However, the lookup is trivial
>>> to accomplish in sysfs through /sys/dev/char to find the sysfs path of the
>>> device-dax instance under the nvme device, or if you already have the nvme
>>> sysfs path the dax instance(s) will appear under the "dax" sub-directory.
>>
>> Personally I think mapping the dax resource in the sysfs tree is a nice
>> way to do this and a bit more intuitive than mapping a /dev/nvmeX.
> 
> It is still not at all clear to me what userpsace is supposed to do
> with this on nvme.. How is the CMB usable from userspace?

The flow is pretty simple. For example to write to NVMe from an RDMA device:

1) Obtain a chunk of the CMB to use as a buffer(either by mmaping
/dev/nvmx, the device dax char device or through a block layer interface
(which sounds like a good suggestion from Christoph, but I'm not really
sure how it would look).

2) Create an MR with the buffer and use an RDMA function to fill it with
data from a remote host. This will cause the RDMA hardware to write
directly to the memory in the NVMe card.

3) Using O_DIRECT, write the buffer to a file on the NVMe filesystem.
When the address reaches hardware the NVMe will recognize it as local
memory and copy it directly there.

Thus we are able to transfer data to any file on an NVMe device without
going through system memory. This has benefits on systems with lots of
activity in system memory but step 3 is likely to be slowish due to the
need to pin/unpin the memory for every transaction.

Logan



Re: Enabling peer to peer device transactions for PCIe devices

2016-12-06 Thread Logan Gunthorpe
Hey,

On 06/12/16 09:38 AM, Jason Gunthorpe wrote:
>>> I'm not opposed to mapping /dev/nvmeX.  However, the lookup is trivial
>>> to accomplish in sysfs through /sys/dev/char to find the sysfs path of the
>>> device-dax instance under the nvme device, or if you already have the nvme
>>> sysfs path the dax instance(s) will appear under the "dax" sub-directory.
>>
>> Personally I think mapping the dax resource in the sysfs tree is a nice
>> way to do this and a bit more intuitive than mapping a /dev/nvmeX.
> 
> It is still not at all clear to me what userpsace is supposed to do
> with this on nvme.. How is the CMB usable from userspace?

The flow is pretty simple. For example to write to NVMe from an RDMA device:

1) Obtain a chunk of the CMB to use as a buffer(either by mmaping
/dev/nvmx, the device dax char device or through a block layer interface
(which sounds like a good suggestion from Christoph, but I'm not really
sure how it would look).

2) Create an MR with the buffer and use an RDMA function to fill it with
data from a remote host. This will cause the RDMA hardware to write
directly to the memory in the NVMe card.

3) Using O_DIRECT, write the buffer to a file on the NVMe filesystem.
When the address reaches hardware the NVMe will recognize it as local
memory and copy it directly there.

Thus we are able to transfer data to any file on an NVMe device without
going through system memory. This has benefits on systems with lots of
activity in system memory but step 3 is likely to be slowish due to the
need to pin/unpin the memory for every transaction.

Logan



Re: Enabling peer to peer device transactions for PCIe devices

2016-12-06 Thread Jason Gunthorpe
> > I'm not opposed to mapping /dev/nvmeX.  However, the lookup is trivial
> > to accomplish in sysfs through /sys/dev/char to find the sysfs path of the
> > device-dax instance under the nvme device, or if you already have the nvme
> > sysfs path the dax instance(s) will appear under the "dax" sub-directory.
> 
> Personally I think mapping the dax resource in the sysfs tree is a nice
> way to do this and a bit more intuitive than mapping a /dev/nvmeX.

It is still not at all clear to me what userpsace is supposed to do
with this on nvme.. How is the CMB usable from userspace?

Jason


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-06 Thread Jason Gunthorpe
> > I'm not opposed to mapping /dev/nvmeX.  However, the lookup is trivial
> > to accomplish in sysfs through /sys/dev/char to find the sysfs path of the
> > device-dax instance under the nvme device, or if you already have the nvme
> > sysfs path the dax instance(s) will appear under the "dax" sub-directory.
> 
> Personally I think mapping the dax resource in the sysfs tree is a nice
> way to do this and a bit more intuitive than mapping a /dev/nvmeX.

It is still not at all clear to me what userpsace is supposed to do
with this on nvme.. How is the CMB usable from userspace?

Jason


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-06 Thread Stephen Bates
>>> I've already recommended that iopmem not be a block device and
>>> instead be a device-dax instance. I also don't think it should claim
>>> the PCI ID, rather the driver that wants to map one of its bars this
>>> way can register the memory region with the device-dax core.
>>>
>>> I'm not sure there are enough device drivers that want to do this to
>>> have it be a generic /sys/.../resource_dmableX capability. It still
>>> seems to be an exotic one-off type of configuration.
>>
>>
>> Yes, this is essentially my thinking. Except I think the userspace
>> interface should really depend on the device itself. Device dax is a
>> good  choice for many and I agree the block device approach wouldn't be
>> ideal.

I tend to agree here. The block device interface has seen quite a bit of
resistance and /dev/dax looks like a better approach for most. We can look
at doing it that way in v2.

>>
>> Specifically for NVME CMB: I think it would make a lot of sense to just
>> hand out these mappings with an mmap call on /dev/nvmeX. I expect CMB
>> buffers would be volatile and thus you wouldn't need to keep track of
>> where in the BAR the region came from. Thus, the mmap call would just be
>> an allocator from BAR memory. If device-dax were used, userspace would
>> need to lookup which device-dax instance corresponds to which nvme
>> drive.
>>
>
> I'm not opposed to mapping /dev/nvmeX.  However, the lookup is trivial
> to accomplish in sysfs through /sys/dev/char to find the sysfs path of the
> device-dax instance under the nvme device, or if you already have the nvme
> sysfs path the dax instance(s) will appear under the "dax" sub-directory.
>

Personally I think mapping the dax resource in the sysfs tree is a nice
way to do this and a bit more intuitive than mapping a /dev/nvmeX.




Re: Enabling peer to peer device transactions for PCIe devices

2016-12-06 Thread Stephen Bates
>>> I've already recommended that iopmem not be a block device and
>>> instead be a device-dax instance. I also don't think it should claim
>>> the PCI ID, rather the driver that wants to map one of its bars this
>>> way can register the memory region with the device-dax core.
>>>
>>> I'm not sure there are enough device drivers that want to do this to
>>> have it be a generic /sys/.../resource_dmableX capability. It still
>>> seems to be an exotic one-off type of configuration.
>>
>>
>> Yes, this is essentially my thinking. Except I think the userspace
>> interface should really depend on the device itself. Device dax is a
>> good  choice for many and I agree the block device approach wouldn't be
>> ideal.

I tend to agree here. The block device interface has seen quite a bit of
resistance and /dev/dax looks like a better approach for most. We can look
at doing it that way in v2.

>>
>> Specifically for NVME CMB: I think it would make a lot of sense to just
>> hand out these mappings with an mmap call on /dev/nvmeX. I expect CMB
>> buffers would be volatile and thus you wouldn't need to keep track of
>> where in the BAR the region came from. Thus, the mmap call would just be
>> an allocator from BAR memory. If device-dax were used, userspace would
>> need to lookup which device-dax instance corresponds to which nvme
>> drive.
>>
>
> I'm not opposed to mapping /dev/nvmeX.  However, the lookup is trivial
> to accomplish in sysfs through /sys/dev/char to find the sysfs path of the
> device-dax instance under the nvme device, or if you already have the nvme
> sysfs path the dax instance(s) will appear under the "dax" sub-directory.
>

Personally I think mapping the dax resource in the sysfs tree is a nice
way to do this and a bit more intuitive than mapping a /dev/nvmeX.




Re: Enabling peer to peer device transactions for PCIe devices

2016-12-05 Thread Christoph Hellwig
On Mon, Dec 05, 2016 at 12:46:14PM -0700, Jason Gunthorpe wrote:
> In any event the allocator still needs to track which regions are in
> use and be able to hook 'free' from userspace. That does suggest it
> should be integrated into the nvme driver and not a bolt on driver..

Two totally different use cases:

 - a card that exposes directly byte addressable storage as a PCI-e
   bar.  Thin of it as a nvdimm on a PCI-e card.  That's the iopmem
   case.
 - the NVMe CMB which exposes a byte addressable indirection buffer for
   I/O, but does not actually provide byte addressable persistent
   storage.  This is something that needs to be added to the NVMe driver
   (and the block layer for the abstraction probably).


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-05 Thread Christoph Hellwig
On Mon, Dec 05, 2016 at 12:46:14PM -0700, Jason Gunthorpe wrote:
> In any event the allocator still needs to track which regions are in
> use and be able to hook 'free' from userspace. That does suggest it
> should be integrated into the nvme driver and not a bolt on driver..

Two totally different use cases:

 - a card that exposes directly byte addressable storage as a PCI-e
   bar.  Thin of it as a nvdimm on a PCI-e card.  That's the iopmem
   case.
 - the NVMe CMB which exposes a byte addressable indirection buffer for
   I/O, but does not actually provide byte addressable persistent
   storage.  This is something that needs to be added to the NVMe driver
   (and the block layer for the abstraction probably).


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-05 Thread Logan Gunthorpe



On 05/12/16 12:46 PM, Jason Gunthorpe wrote:

NVMe might have to deal with pci-e hot-unplug, which is a similar
problem-class to the GPU case..


Sure, but if the NVMe device gets hot-unplugged it means that all the 
CMB mappings are useless and need to be torn down. This probably means 
killing any process that has mappings open.



In any event the allocator still needs to track which regions are in
use and be able to hook 'free' from userspace. That does suggest it
should be integrated into the nvme driver and not a bolt on driver..


Yup, that's correct. And yes, I've never suggested this to be a bolt on 
driver -- I always expected for it to get integrated into the nvme 
driver. (iopmem was not meant for this.)


Logan


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-05 Thread Logan Gunthorpe



On 05/12/16 12:46 PM, Jason Gunthorpe wrote:

NVMe might have to deal with pci-e hot-unplug, which is a similar
problem-class to the GPU case..


Sure, but if the NVMe device gets hot-unplugged it means that all the 
CMB mappings are useless and need to be torn down. This probably means 
killing any process that has mappings open.



In any event the allocator still needs to track which regions are in
use and be able to hook 'free' from userspace. That does suggest it
should be integrated into the nvme driver and not a bolt on driver..


Yup, that's correct. And yes, I've never suggested this to be a bolt on 
driver -- I always expected for it to get integrated into the nvme 
driver. (iopmem was not meant for this.)


Logan


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-05 Thread Jason Gunthorpe
On Mon, Dec 05, 2016 at 12:27:20PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 05/12/16 12:14 PM, Jason Gunthorpe wrote:
> >But CMB sounds much more like the GPU case where there is a
> >specialized allocator handing out the BAR to consumers, so I'm not
> >sure a general purpose chardev makes a lot of sense?
> 
> I don't think it will ever need to be as complicated as the GPU case. There
> will probably only ever be a relatively small amount of memory behind the
> CMB and really the only users are those doing P2P work. Thus the specialized
> allocator could be pretty simple and I expect it would be fine to just
> return -ENOMEM if there is not enough memory.

NVMe might have to deal with pci-e hot-unplug, which is a similar
problem-class to the GPU case..

In any event the allocator still needs to track which regions are in
use and be able to hook 'free' from userspace. That does suggest it
should be integrated into the nvme driver and not a bolt on driver..

Jason


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-05 Thread Jason Gunthorpe
On Mon, Dec 05, 2016 at 12:27:20PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 05/12/16 12:14 PM, Jason Gunthorpe wrote:
> >But CMB sounds much more like the GPU case where there is a
> >specialized allocator handing out the BAR to consumers, so I'm not
> >sure a general purpose chardev makes a lot of sense?
> 
> I don't think it will ever need to be as complicated as the GPU case. There
> will probably only ever be a relatively small amount of memory behind the
> CMB and really the only users are those doing P2P work. Thus the specialized
> allocator could be pretty simple and I expect it would be fine to just
> return -ENOMEM if there is not enough memory.

NVMe might have to deal with pci-e hot-unplug, which is a similar
problem-class to the GPU case..

In any event the allocator still needs to track which regions are in
use and be able to hook 'free' from userspace. That does suggest it
should be integrated into the nvme driver and not a bolt on driver..

Jason


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-05 Thread Logan Gunthorpe



On 05/12/16 12:14 PM, Jason Gunthorpe wrote:

But CMB sounds much more like the GPU case where there is a
specialized allocator handing out the BAR to consumers, so I'm not
sure a general purpose chardev makes a lot of sense?


I don't think it will ever need to be as complicated as the GPU case. 
There will probably only ever be a relatively small amount of memory 
behind the CMB and really the only users are those doing P2P work. Thus 
the specialized allocator could be pretty simple and I expect it would 
be fine to just return -ENOMEM if there is not enough memory.


Also, if it was implemented this way, if there was a need to make the 
allocator more complicated it could easily be added later as the 
userspace interface is just mmap to obtain a buffer.


Logan


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-05 Thread Logan Gunthorpe



On 05/12/16 12:14 PM, Jason Gunthorpe wrote:

But CMB sounds much more like the GPU case where there is a
specialized allocator handing out the BAR to consumers, so I'm not
sure a general purpose chardev makes a lot of sense?


I don't think it will ever need to be as complicated as the GPU case. 
There will probably only ever be a relatively small amount of memory 
behind the CMB and really the only users are those doing P2P work. Thus 
the specialized allocator could be pretty simple and I expect it would 
be fine to just return -ENOMEM if there is not enough memory.


Also, if it was implemented this way, if there was a need to make the 
allocator more complicated it could easily be added later as the 
userspace interface is just mmap to obtain a buffer.


Logan


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-05 Thread Jason Gunthorpe
On Mon, Dec 05, 2016 at 10:48:58AM -0800, Dan Williams wrote:
> On Mon, Dec 5, 2016 at 10:39 AM, Logan Gunthorpe  wrote:
> > On 05/12/16 11:08 AM, Dan Williams wrote:
> >>
> >> I've already recommended that iopmem not be a block device and instead
> >> be a device-dax instance. I also don't think it should claim the PCI
> >> ID, rather the driver that wants to map one of its bars this way can
> >> register the memory region with the device-dax core.
> >>
> >> I'm not sure there are enough device drivers that want to do this to
> >> have it be a generic /sys/.../resource_dmableX capability. It still
> >> seems to be an exotic one-off type of configuration.
> >
> >
> > Yes, this is essentially my thinking. Except I think the userspace interface
> > should really depend on the device itself. Device dax is a good  choice for
> > many and I agree the block device approach wouldn't be ideal.
> >
> > Specifically for NVME CMB: I think it would make a lot of sense to just hand
> > out these mappings with an mmap call on /dev/nvmeX. I expect CMB buffers
> > would be volatile and thus you wouldn't need to keep track of where in the
> > BAR the region came from. Thus, the mmap call would just be an allocator
> > from BAR memory. If device-dax were used, userspace would need to lookup
> > which device-dax instance corresponds to which nvme drive.
> 
> I'm not opposed to mapping /dev/nvmeX.  However, the lookup is trivial
> to accomplish in sysfs through /sys/dev/char to find the sysfs path
> of

But CMB sounds much more like the GPU case where there is a
specialized allocator handing out the BAR to consumers, so I'm not
sure a general purpose chardev makes a lot of sense?

Jason


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-05 Thread Jason Gunthorpe
On Mon, Dec 05, 2016 at 10:48:58AM -0800, Dan Williams wrote:
> On Mon, Dec 5, 2016 at 10:39 AM, Logan Gunthorpe  wrote:
> > On 05/12/16 11:08 AM, Dan Williams wrote:
> >>
> >> I've already recommended that iopmem not be a block device and instead
> >> be a device-dax instance. I also don't think it should claim the PCI
> >> ID, rather the driver that wants to map one of its bars this way can
> >> register the memory region with the device-dax core.
> >>
> >> I'm not sure there are enough device drivers that want to do this to
> >> have it be a generic /sys/.../resource_dmableX capability. It still
> >> seems to be an exotic one-off type of configuration.
> >
> >
> > Yes, this is essentially my thinking. Except I think the userspace interface
> > should really depend on the device itself. Device dax is a good  choice for
> > many and I agree the block device approach wouldn't be ideal.
> >
> > Specifically for NVME CMB: I think it would make a lot of sense to just hand
> > out these mappings with an mmap call on /dev/nvmeX. I expect CMB buffers
> > would be volatile and thus you wouldn't need to keep track of where in the
> > BAR the region came from. Thus, the mmap call would just be an allocator
> > from BAR memory. If device-dax were used, userspace would need to lookup
> > which device-dax instance corresponds to which nvme drive.
> 
> I'm not opposed to mapping /dev/nvmeX.  However, the lookup is trivial
> to accomplish in sysfs through /sys/dev/char to find the sysfs path
> of

But CMB sounds much more like the GPU case where there is a
specialized allocator handing out the BAR to consumers, so I'm not
sure a general purpose chardev makes a lot of sense?

Jason


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-05 Thread Dan Williams
On Mon, Dec 5, 2016 at 10:39 AM, Logan Gunthorpe  wrote:
> On 05/12/16 11:08 AM, Dan Williams wrote:
>>
>> I've already recommended that iopmem not be a block device and instead
>> be a device-dax instance. I also don't think it should claim the PCI
>> ID, rather the driver that wants to map one of its bars this way can
>> register the memory region with the device-dax core.
>>
>> I'm not sure there are enough device drivers that want to do this to
>> have it be a generic /sys/.../resource_dmableX capability. It still
>> seems to be an exotic one-off type of configuration.
>
>
> Yes, this is essentially my thinking. Except I think the userspace interface
> should really depend on the device itself. Device dax is a good  choice for
> many and I agree the block device approach wouldn't be ideal.
>
> Specifically for NVME CMB: I think it would make a lot of sense to just hand
> out these mappings with an mmap call on /dev/nvmeX. I expect CMB buffers
> would be volatile and thus you wouldn't need to keep track of where in the
> BAR the region came from. Thus, the mmap call would just be an allocator
> from BAR memory. If device-dax were used, userspace would need to lookup
> which device-dax instance corresponds to which nvme drive.
>

I'm not opposed to mapping /dev/nvmeX.  However, the lookup is trivial
to accomplish in sysfs through /sys/dev/char to find the sysfs path of
the device-dax instance under the nvme device, or if you already have
the nvme sysfs path the dax instance(s) will appear under the "dax"
sub-directory.


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-05 Thread Dan Williams
On Mon, Dec 5, 2016 at 10:39 AM, Logan Gunthorpe  wrote:
> On 05/12/16 11:08 AM, Dan Williams wrote:
>>
>> I've already recommended that iopmem not be a block device and instead
>> be a device-dax instance. I also don't think it should claim the PCI
>> ID, rather the driver that wants to map one of its bars this way can
>> register the memory region with the device-dax core.
>>
>> I'm not sure there are enough device drivers that want to do this to
>> have it be a generic /sys/.../resource_dmableX capability. It still
>> seems to be an exotic one-off type of configuration.
>
>
> Yes, this is essentially my thinking. Except I think the userspace interface
> should really depend on the device itself. Device dax is a good  choice for
> many and I agree the block device approach wouldn't be ideal.
>
> Specifically for NVME CMB: I think it would make a lot of sense to just hand
> out these mappings with an mmap call on /dev/nvmeX. I expect CMB buffers
> would be volatile and thus you wouldn't need to keep track of where in the
> BAR the region came from. Thus, the mmap call would just be an allocator
> from BAR memory. If device-dax were used, userspace would need to lookup
> which device-dax instance corresponds to which nvme drive.
>

I'm not opposed to mapping /dev/nvmeX.  However, the lookup is trivial
to accomplish in sysfs through /sys/dev/char to find the sysfs path of
the device-dax instance under the nvme device, or if you already have
the nvme sysfs path the dax instance(s) will appear under the "dax"
sub-directory.


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-05 Thread Logan Gunthorpe

On 05/12/16 11:08 AM, Dan Williams wrote:

I've already recommended that iopmem not be a block device and instead
be a device-dax instance. I also don't think it should claim the PCI
ID, rather the driver that wants to map one of its bars this way can
register the memory region with the device-dax core.

I'm not sure there are enough device drivers that want to do this to
have it be a generic /sys/.../resource_dmableX capability. It still
seems to be an exotic one-off type of configuration.


Yes, this is essentially my thinking. Except I think the userspace 
interface should really depend on the device itself. Device dax is a 
good  choice for many and I agree the block device approach wouldn't be 
ideal.


Specifically for NVME CMB: I think it would make a lot of sense to just 
hand out these mappings with an mmap call on /dev/nvmeX. I expect CMB 
buffers would be volatile and thus you wouldn't need to keep track of 
where in the BAR the region came from. Thus, the mmap call would just be 
an allocator from BAR memory. If device-dax were used, userspace would 
need to lookup which device-dax instance corresponds to which nvme drive.


Logan




Re: Enabling peer to peer device transactions for PCIe devices

2016-12-05 Thread Logan Gunthorpe

On 05/12/16 11:08 AM, Dan Williams wrote:

I've already recommended that iopmem not be a block device and instead
be a device-dax instance. I also don't think it should claim the PCI
ID, rather the driver that wants to map one of its bars this way can
register the memory region with the device-dax core.

I'm not sure there are enough device drivers that want to do this to
have it be a generic /sys/.../resource_dmableX capability. It still
seems to be an exotic one-off type of configuration.


Yes, this is essentially my thinking. Except I think the userspace 
interface should really depend on the device itself. Device dax is a 
good  choice for many and I agree the block device approach wouldn't be 
ideal.


Specifically for NVME CMB: I think it would make a lot of sense to just 
hand out these mappings with an mmap call on /dev/nvmeX. I expect CMB 
buffers would be volatile and thus you wouldn't need to keep track of 
where in the BAR the region came from. Thus, the mmap call would just be 
an allocator from BAR memory. If device-dax were used, userspace would 
need to lookup which device-dax instance corresponds to which nvme drive.


Logan




Re: Enabling peer to peer device transactions for PCIe devices

2016-12-05 Thread Dan Williams
On Mon, Dec 5, 2016 at 10:02 AM, Jason Gunthorpe
 wrote:
> On Mon, Dec 05, 2016 at 09:40:38AM -0800, Dan Williams wrote:
>
>> > If it is kernel only with physical addresess we don't need a uAPI for
>> > it, so I'm not sure #1 is at all related to iopmem.
>> >
>> > Most people who want #1 probably can just mmap
>> > /sys/../pci/../resourceX to get a user handle to it, or pass around
>> > __iomem pointers in the kernel. This has been asked for before with
>> > RDMA.
>> >
>> > I'm still not really clear what iopmem is for, or why DAX should ever
>> > be involved in this..
>>
>> Right, by default remap_pfn_range() does not establish DMA capable
>> mappings. You can think of iopmem as remap_pfn_range() converted to
>> use devm_memremap_pages(). Given the extra constraints of
>> devm_memremap_pages() it seems reasonable to have those DMA capable
>> mappings be optionally established via a separate driver.
>
> Except the iopmem driver claims the PCI ID, and presents a block
> interface which is really *NOT* what people who have asked for this in
> the past have wanted. IIRC it was embedded stuff eg RDMA video
> directly out of a capture card or a similar kind of thinking.
>
> It is a good point about devm_memremap_pages limitations, but maybe
> that just says to create a /sys/.../resource_dmableX ?
>
> Or is there some reason why people want a filesystem on top of BAR
> memory? That does not seem to have been covered yet..
>

I've already recommended that iopmem not be a block device and instead
be a device-dax instance. I also don't think it should claim the PCI
ID, rather the driver that wants to map one of its bars this way can
register the memory region with the device-dax core.

I'm not sure there are enough device drivers that want to do this to
have it be a generic /sys/.../resource_dmableX capability. It still
seems to be an exotic one-off type of configuration.


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-05 Thread Dan Williams
On Mon, Dec 5, 2016 at 10:02 AM, Jason Gunthorpe
 wrote:
> On Mon, Dec 05, 2016 at 09:40:38AM -0800, Dan Williams wrote:
>
>> > If it is kernel only with physical addresess we don't need a uAPI for
>> > it, so I'm not sure #1 is at all related to iopmem.
>> >
>> > Most people who want #1 probably can just mmap
>> > /sys/../pci/../resourceX to get a user handle to it, or pass around
>> > __iomem pointers in the kernel. This has been asked for before with
>> > RDMA.
>> >
>> > I'm still not really clear what iopmem is for, or why DAX should ever
>> > be involved in this..
>>
>> Right, by default remap_pfn_range() does not establish DMA capable
>> mappings. You can think of iopmem as remap_pfn_range() converted to
>> use devm_memremap_pages(). Given the extra constraints of
>> devm_memremap_pages() it seems reasonable to have those DMA capable
>> mappings be optionally established via a separate driver.
>
> Except the iopmem driver claims the PCI ID, and presents a block
> interface which is really *NOT* what people who have asked for this in
> the past have wanted. IIRC it was embedded stuff eg RDMA video
> directly out of a capture card or a similar kind of thinking.
>
> It is a good point about devm_memremap_pages limitations, but maybe
> that just says to create a /sys/.../resource_dmableX ?
>
> Or is there some reason why people want a filesystem on top of BAR
> memory? That does not seem to have been covered yet..
>

I've already recommended that iopmem not be a block device and instead
be a device-dax instance. I also don't think it should claim the PCI
ID, rather the driver that wants to map one of its bars this way can
register the memory region with the device-dax core.

I'm not sure there are enough device drivers that want to do this to
have it be a generic /sys/.../resource_dmableX capability. It still
seems to be an exotic one-off type of configuration.


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-05 Thread Jason Gunthorpe
On Mon, Dec 05, 2016 at 09:40:38AM -0800, Dan Williams wrote:

> > If it is kernel only with physical addresess we don't need a uAPI for
> > it, so I'm not sure #1 is at all related to iopmem.
> >
> > Most people who want #1 probably can just mmap
> > /sys/../pci/../resourceX to get a user handle to it, or pass around
> > __iomem pointers in the kernel. This has been asked for before with
> > RDMA.
> >
> > I'm still not really clear what iopmem is for, or why DAX should ever
> > be involved in this..
> 
> Right, by default remap_pfn_range() does not establish DMA capable
> mappings. You can think of iopmem as remap_pfn_range() converted to
> use devm_memremap_pages(). Given the extra constraints of
> devm_memremap_pages() it seems reasonable to have those DMA capable
> mappings be optionally established via a separate driver.

Except the iopmem driver claims the PCI ID, and presents a block
interface which is really *NOT* what people who have asked for this in
the past have wanted. IIRC it was embedded stuff eg RDMA video
directly out of a capture card or a similar kind of thinking.

It is a good point about devm_memremap_pages limitations, but maybe
that just says to create a /sys/.../resource_dmableX ?

Or is there some reason why people want a filesystem on top of BAR
memory? That does not seem to have been covered yet..

Jason


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-05 Thread Jason Gunthorpe
On Mon, Dec 05, 2016 at 09:40:38AM -0800, Dan Williams wrote:

> > If it is kernel only with physical addresess we don't need a uAPI for
> > it, so I'm not sure #1 is at all related to iopmem.
> >
> > Most people who want #1 probably can just mmap
> > /sys/../pci/../resourceX to get a user handle to it, or pass around
> > __iomem pointers in the kernel. This has been asked for before with
> > RDMA.
> >
> > I'm still not really clear what iopmem is for, or why DAX should ever
> > be involved in this..
> 
> Right, by default remap_pfn_range() does not establish DMA capable
> mappings. You can think of iopmem as remap_pfn_range() converted to
> use devm_memremap_pages(). Given the extra constraints of
> devm_memremap_pages() it seems reasonable to have those DMA capable
> mappings be optionally established via a separate driver.

Except the iopmem driver claims the PCI ID, and presents a block
interface which is really *NOT* what people who have asked for this in
the past have wanted. IIRC it was embedded stuff eg RDMA video
directly out of a capture card or a similar kind of thinking.

It is a good point about devm_memremap_pages limitations, but maybe
that just says to create a /sys/.../resource_dmableX ?

Or is there some reason why people want a filesystem on top of BAR
memory? That does not seem to have been covered yet..

Jason


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-05 Thread Dan Williams
On Mon, Dec 5, 2016 at 9:18 AM, Jason Gunthorpe
 wrote:
> On Sun, Dec 04, 2016 at 07:23:00AM -0600, Stephen Bates wrote:
>> Hi All
>>
>> This has been a great thread (thanks to Alex for kicking it off) and I
>> wanted to jump in and maybe try and put some summary around the
>> discussion. I also wanted to propose we include this as a topic for LFS/MM
>> because I think we need more discussion on the best way to add this
>> functionality to the kernel.
>>
>> As far as I can tell the people looking for P2P support in the kernel fall
>> into two main camps:
>>
>> 1. Those who simply want to expose static BARs on PCIe devices that can be
>> used as the source/destination for DMAs from another PCIe device. This
>> group has no need for memory invalidation and are happy to use
>> physical/bus addresses and not virtual addresses.
>
> I didn't think there was much on this topic except for the CMB
> thing.. Even that is really a mapped kernel address..
>
>> I think something like the iopmem patches Logan and I submitted recently
>> come close to addressing use case 1. There are some issues around
>> routability but based on feedback to date that does not seem to be a
>> show-stopper for an initial inclusion.
>
> If it is kernel only with physical addresess we don't need a uAPI for
> it, so I'm not sure #1 is at all related to iopmem.
>
> Most people who want #1 probably can just mmap
> /sys/../pci/../resourceX to get a user handle to it, or pass around
> __iomem pointers in the kernel. This has been asked for before with
> RDMA.
>
> I'm still not really clear what iopmem is for, or why DAX should ever
> be involved in this..

Right, by default remap_pfn_range() does not establish DMA capable
mappings. You can think of iopmem as remap_pfn_range() converted to
use devm_memremap_pages(). Given the extra constraints of
devm_memremap_pages() it seems reasonable to have those DMA capable
mappings be optionally established via a separate driver.


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-05 Thread Dan Williams
On Mon, Dec 5, 2016 at 9:18 AM, Jason Gunthorpe
 wrote:
> On Sun, Dec 04, 2016 at 07:23:00AM -0600, Stephen Bates wrote:
>> Hi All
>>
>> This has been a great thread (thanks to Alex for kicking it off) and I
>> wanted to jump in and maybe try and put some summary around the
>> discussion. I also wanted to propose we include this as a topic for LFS/MM
>> because I think we need more discussion on the best way to add this
>> functionality to the kernel.
>>
>> As far as I can tell the people looking for P2P support in the kernel fall
>> into two main camps:
>>
>> 1. Those who simply want to expose static BARs on PCIe devices that can be
>> used as the source/destination for DMAs from another PCIe device. This
>> group has no need for memory invalidation and are happy to use
>> physical/bus addresses and not virtual addresses.
>
> I didn't think there was much on this topic except for the CMB
> thing.. Even that is really a mapped kernel address..
>
>> I think something like the iopmem patches Logan and I submitted recently
>> come close to addressing use case 1. There are some issues around
>> routability but based on feedback to date that does not seem to be a
>> show-stopper for an initial inclusion.
>
> If it is kernel only with physical addresess we don't need a uAPI for
> it, so I'm not sure #1 is at all related to iopmem.
>
> Most people who want #1 probably can just mmap
> /sys/../pci/../resourceX to get a user handle to it, or pass around
> __iomem pointers in the kernel. This has been asked for before with
> RDMA.
>
> I'm still not really clear what iopmem is for, or why DAX should ever
> be involved in this..

Right, by default remap_pfn_range() does not establish DMA capable
mappings. You can think of iopmem as remap_pfn_range() converted to
use devm_memremap_pages(). Given the extra constraints of
devm_memremap_pages() it seems reasonable to have those DMA capable
mappings be optionally established via a separate driver.


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-05 Thread Jason Gunthorpe
On Sun, Dec 04, 2016 at 07:23:00AM -0600, Stephen Bates wrote:
> Hi All
> 
> This has been a great thread (thanks to Alex for kicking it off) and I
> wanted to jump in and maybe try and put some summary around the
> discussion. I also wanted to propose we include this as a topic for LFS/MM
> because I think we need more discussion on the best way to add this
> functionality to the kernel.
> 
> As far as I can tell the people looking for P2P support in the kernel fall
> into two main camps:
> 
> 1. Those who simply want to expose static BARs on PCIe devices that can be
> used as the source/destination for DMAs from another PCIe device. This
> group has no need for memory invalidation and are happy to use
> physical/bus addresses and not virtual addresses.

I didn't think there was much on this topic except for the CMB
thing.. Even that is really a mapped kernel address..

> I think something like the iopmem patches Logan and I submitted recently
> come close to addressing use case 1. There are some issues around
> routability but based on feedback to date that does not seem to be a
> show-stopper for an initial inclusion.

If it is kernel only with physical addresess we don't need a uAPI for
it, so I'm not sure #1 is at all related to iopmem.

Most people who want #1 probably can just mmap
/sys/../pci/../resourceX to get a user handle to it, or pass around
__iomem pointers in the kernel. This has been asked for before with
RDMA.

I'm still not really clear what iopmem is for, or why DAX should ever
be involved in this..

> For use-case 2 it looks like there are several options and some of them
> (like HMM) have been around for quite some time without gaining
> acceptance. I think there needs to be more discussion on this usecase and
> it could be some time before we get something upstreamable.

AFAIK, hmm makes parts easier, but isn't directly addressing this
need..

I think you need to get ZONE_DEVICE accepted for non-cachable PCI BARs
as the first step.

>From there is pretty clear we the DMA API needs to be updated to
support that use and work can be done to solve the various problems
there on the basis of using ZONE_DEVICE pages to figure out to the
PCI-E end points

Jason


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-05 Thread Jason Gunthorpe
On Sun, Dec 04, 2016 at 07:23:00AM -0600, Stephen Bates wrote:
> Hi All
> 
> This has been a great thread (thanks to Alex for kicking it off) and I
> wanted to jump in and maybe try and put some summary around the
> discussion. I also wanted to propose we include this as a topic for LFS/MM
> because I think we need more discussion on the best way to add this
> functionality to the kernel.
> 
> As far as I can tell the people looking for P2P support in the kernel fall
> into two main camps:
> 
> 1. Those who simply want to expose static BARs on PCIe devices that can be
> used as the source/destination for DMAs from another PCIe device. This
> group has no need for memory invalidation and are happy to use
> physical/bus addresses and not virtual addresses.

I didn't think there was much on this topic except for the CMB
thing.. Even that is really a mapped kernel address..

> I think something like the iopmem patches Logan and I submitted recently
> come close to addressing use case 1. There are some issues around
> routability but based on feedback to date that does not seem to be a
> show-stopper for an initial inclusion.

If it is kernel only with physical addresess we don't need a uAPI for
it, so I'm not sure #1 is at all related to iopmem.

Most people who want #1 probably can just mmap
/sys/../pci/../resourceX to get a user handle to it, or pass around
__iomem pointers in the kernel. This has been asked for before with
RDMA.

I'm still not really clear what iopmem is for, or why DAX should ever
be involved in this..

> For use-case 2 it looks like there are several options and some of them
> (like HMM) have been around for quite some time without gaining
> acceptance. I think there needs to be more discussion on this usecase and
> it could be some time before we get something upstreamable.

AFAIK, hmm makes parts easier, but isn't directly addressing this
need..

I think you need to get ZONE_DEVICE accepted for non-cachable PCI BARs
as the first step.

>From there is pretty clear we the DMA API needs to be updated to
support that use and work can be done to solve the various problems
there on the basis of using ZONE_DEVICE pages to figure out to the
PCI-E end points

Jason


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-04 Thread Stephen Bates
Hi All

This has been a great thread (thanks to Alex for kicking it off) and I
wanted to jump in and maybe try and put some summary around the
discussion. I also wanted to propose we include this as a topic for LFS/MM
because I think we need more discussion on the best way to add this
functionality to the kernel.

As far as I can tell the people looking for P2P support in the kernel fall
into two main camps:

1. Those who simply want to expose static BARs on PCIe devices that can be
used as the source/destination for DMAs from another PCIe device. This
group has no need for memory invalidation and are happy to use
physical/bus addresses and not virtual addresses.

2. Those who want to support devices that suffer from occasional memory
pressure and need to invalidate memory regions from time to time. This
camp also would like to use virtual addresses rather than physical ones to
allow for things like migration.

I am wondering if people agree with this assessment?

I think something like the iopmem patches Logan and I submitted recently
come close to addressing use case 1. There are some issues around
routability but based on feedback to date that does not seem to be a
show-stopper for an initial inclusion.

For use-case 2 it looks like there are several options and some of them
(like HMM) have been around for quite some time without gaining
acceptance. I think there needs to be more discussion on this usecase and
it could be some time before we get something upstreamable.

I for one, would really like to see use case 1 get addressed soon because
we have consumers for it coming soon in the form of CMBs for NVMe devices.

Long term I think Jason summed it up really well. CPU vendors will put
high-speed, open, switchable, coherent buses on their processors and all
these problems will vanish. But I ain't holding my breathe for that to
happen ;-).

Cheers

Stephen


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-04 Thread Stephen Bates
Hi All

This has been a great thread (thanks to Alex for kicking it off) and I
wanted to jump in and maybe try and put some summary around the
discussion. I also wanted to propose we include this as a topic for LFS/MM
because I think we need more discussion on the best way to add this
functionality to the kernel.

As far as I can tell the people looking for P2P support in the kernel fall
into two main camps:

1. Those who simply want to expose static BARs on PCIe devices that can be
used as the source/destination for DMAs from another PCIe device. This
group has no need for memory invalidation and are happy to use
physical/bus addresses and not virtual addresses.

2. Those who want to support devices that suffer from occasional memory
pressure and need to invalidate memory regions from time to time. This
camp also would like to use virtual addresses rather than physical ones to
allow for things like migration.

I am wondering if people agree with this assessment?

I think something like the iopmem patches Logan and I submitted recently
come close to addressing use case 1. There are some issues around
routability but based on feedback to date that does not seem to be a
show-stopper for an initial inclusion.

For use-case 2 it looks like there are several options and some of them
(like HMM) have been around for quite some time without gaining
acceptance. I think there needs to be more discussion on this usecase and
it could be some time before we get something upstreamable.

I for one, would really like to see use case 1 get addressed soon because
we have consumers for it coming soon in the form of CMBs for NVMe devices.

Long term I think Jason summed it up really well. CPU vendors will put
high-speed, open, switchable, coherent buses on their processors and all
these problems will vanish. But I ain't holding my breathe for that to
happen ;-).

Cheers

Stephen


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-04 Thread Haggai Eran
On 11/30/2016 6:23 PM, Jason Gunthorpe wrote:
>> and O_DIRECT operations that access GPU memory.
> This goes through user space so there is still a VMA..
> 
>> Also, HMM's migration between two GPUs could use peer to peer in the
>> kernel, although that is intended to be handled by the GPU driver if
>> I understand correctly.
> Hum, presumably these migrations are VMA backed as well...
I guess so.

>>> Presumably in-kernel could use a vmap or something and the same basic
>>> flow?
>> I think we can achieve the kernel's needs with ZONE_DEVICE and DMA-API 
>> support
>> for peer to peer. I'm not sure we need vmap. We need a way to have a 
>> scatterlist
>> of MMIO pfns, and ZONE_DEVICE allows that.
> Well, if there is no virtual map then we are back to how do you do
> migrations and other things people seem to want to do on these
> pages. Maybe the loose 'struct page' flow is not for those users.
I was thinking that kernel use cases would disallow migration, similar to how 
non-ODP MRs would work. Either they are short-lived (like an O_DIRECT transfer)
or they can be longed lived but non-migratable (like perhaps a CMB staging 
buffer).

> But I think if you want kGPU or similar then you probably need vmaps
> or something similar to represent the GPU pages in kernel memory.
Right, although sometimes the GPU pages are simply inaccessible to the CPU.
In any case, I haven't thought about kGPU as a use-case.


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-04 Thread Haggai Eran
On 11/30/2016 6:23 PM, Jason Gunthorpe wrote:
>> and O_DIRECT operations that access GPU memory.
> This goes through user space so there is still a VMA..
> 
>> Also, HMM's migration between two GPUs could use peer to peer in the
>> kernel, although that is intended to be handled by the GPU driver if
>> I understand correctly.
> Hum, presumably these migrations are VMA backed as well...
I guess so.

>>> Presumably in-kernel could use a vmap or something and the same basic
>>> flow?
>> I think we can achieve the kernel's needs with ZONE_DEVICE and DMA-API 
>> support
>> for peer to peer. I'm not sure we need vmap. We need a way to have a 
>> scatterlist
>> of MMIO pfns, and ZONE_DEVICE allows that.
> Well, if there is no virtual map then we are back to how do you do
> migrations and other things people seem to want to do on these
> pages. Maybe the loose 'struct page' flow is not for those users.
I was thinking that kernel use cases would disallow migration, similar to how 
non-ODP MRs would work. Either they are short-lived (like an O_DIRECT transfer)
or they can be longed lived but non-migratable (like perhaps a CMB staging 
buffer).

> But I think if you want kGPU or similar then you probably need vmaps
> or something similar to represent the GPU pages in kernel memory.
Right, although sometimes the GPU pages are simply inaccessible to the CPU.
In any case, I haven't thought about kGPU as a use-case.


  1   2   3   >