Re: New NTB API Issue

2017-06-23 Thread Logan Gunthorpe


On 23/06/17 04:06 PM, Allen Hubbe wrote:
> From: Logan Gunthorpe
>> But any translation can be
>> programmed by any peer.
> 
> That doesn't seem safe.  Even though it can be done as you say, would it not 
> be better to have each specific translation under the control of exactly one 
> driver?
> 
> If drivers can reach across and set the translation of any peer bar, they 
> would still need to negotiate among N peers which one sets which other's 
> translation.

Yup. In a dual host setup its not a problem seeing only the one peer
will ever set any of the memory windows. In the multi-host setup, there
has to be implicit or explicit negotiation as to which memory window
connects to which peer.

The hardware also implements locking so if two peers try to mess with
the same translation, the worst that can happen is the last peer's
settings will take effect or one peer will see an error.

Logan



Re: New NTB API Issue

2017-06-23 Thread Logan Gunthorpe


On 23/06/17 04:06 PM, Allen Hubbe wrote:
> From: Logan Gunthorpe
>> But any translation can be
>> programmed by any peer.
> 
> That doesn't seem safe.  Even though it can be done as you say, would it not 
> be better to have each specific translation under the control of exactly one 
> driver?
> 
> If drivers can reach across and set the translation of any peer bar, they 
> would still need to negotiate among N peers which one sets which other's 
> translation.

Yup. In a dual host setup its not a problem seeing only the one peer
will ever set any of the memory windows. In the multi-host setup, there
has to be implicit or explicit negotiation as to which memory window
connects to which peer.

The hardware also implements locking so if two peers try to mess with
the same translation, the worst that can happen is the last peer's
settings will take effect or one peer will see an error.

Logan



RE: New NTB API Issue

2017-06-23 Thread Allen Hubbe
From: Logan Gunthorpe
> But any translation can be
> programmed by any peer.

That doesn't seem safe.  Even though it can be done as you say, would it not be 
better to have each specific translation under the control of exactly one 
driver?

If drivers can reach across and set the translation of any peer bar, they would 
still need to negotiate among N peers which one sets which other's translation.



RE: New NTB API Issue

2017-06-23 Thread Allen Hubbe
From: Logan Gunthorpe
> But any translation can be
> programmed by any peer.

That doesn't seem safe.  Even though it can be done as you say, would it not be 
better to have each specific translation under the control of exactly one 
driver?

If drivers can reach across and set the translation of any peer bar, they would 
still need to negotiate among N peers which one sets which other's translation.



RE: New NTB API Issue

2017-06-23 Thread Allen Hubbe
From: Logan Gunthorpe
> Hey,
> 
> Thanks Serge for the detailed explanation. This is pretty much exactly
> as I had thought it should be interpreted. My only problem remains that
> my hardware can't provide ntb_mw_get_align until the port it is asking
> about has configured itself. The easiest way to solve this is to only
> allow access when the link to that port is up. It's not a complicated
> change and would actually simplify ntb_transport and ntb_perf a little.
> 
> Is this ok with you Allen? If it is, I can include a patch in my next
> switchtec submission.

Ok.

> 
> Thanks,
> 
> Logan



RE: New NTB API Issue

2017-06-23 Thread Allen Hubbe
From: Logan Gunthorpe
> Hey,
> 
> Thanks Serge for the detailed explanation. This is pretty much exactly
> as I had thought it should be interpreted. My only problem remains that
> my hardware can't provide ntb_mw_get_align until the port it is asking
> about has configured itself. The easiest way to solve this is to only
> allow access when the link to that port is up. It's not a complicated
> change and would actually simplify ntb_transport and ntb_perf a little.
> 
> Is this ok with you Allen? If it is, I can include a patch in my next
> switchtec submission.

Ok.

> 
> Thanks,
> 
> Logan



Re: New NTB API Issue

2017-06-23 Thread Logan Gunthorpe
Hey,

Thanks Serge for the detailed explanation. This is pretty much exactly
as I had thought it should be interpreted. My only problem remains that
my hardware can't provide ntb_mw_get_align until the port it is asking
about has configured itself. The easiest way to solve this is to only
allow access when the link to that port is up. It's not a complicated
change and would actually simplify ntb_transport and ntb_perf a little.

Is this ok with you Allen? If it is, I can include a patch in my next
switchtec submission.

Thanks,

Logan


Re: New NTB API Issue

2017-06-23 Thread Logan Gunthorpe
Hey,

Thanks Serge for the detailed explanation. This is pretty much exactly
as I had thought it should be interpreted. My only problem remains that
my hardware can't provide ntb_mw_get_align until the port it is asking
about has configured itself. The easiest way to solve this is to only
allow access when the link to that port is up. It's not a complicated
change and would actually simplify ntb_transport and ntb_perf a little.

Is this ok with you Allen? If it is, I can include a patch in my next
switchtec submission.

Thanks,

Logan


Re: New NTB API Issue

2017-06-23 Thread Serge Semin
On Fri, Jun 23, 2017 at 03:07:19PM -0400, Allen Hubbe  
wrote:
Hello Allen,

> From: Logan Gunthorpe
> > On 23/06/17 07:18 AM, Allen Hubbe wrote:
> > > By "remote" do you mean the source or destination of a write?
> > 
> > Look at how these calls are used in ntb_transport and ntb_perf:
> > 
> > They both call ntb_peer_mw_get_addr to get the size of the BAR. The size
> > is sent via spads to the other side. The other side then uses
> > ntb_mw_get_align and applies align_size to the received size.
> > 
> > > Yes, clients should transfer the address and size information to the peer.
> > 
> > But then they also need to technically transfer the alignment
> > information as well. Which neither of the clients do.
> 
> The client's haven't been fully ported to the multi-port api yet.  They were 
> only minimally changed to call the new api, but so far other than that they 
> have only been made to work as they had before.
> 
> > > Maybe this is the confusion.  None of these api calls are to reach across 
> > > to the peer port, as to
> > get the size of the peer's bar.  They are to get information from the local 
> > port, or to configure the
> > local port.
> > 
> > I like the rule that these api calls are not to reach across the port.
> > But then API looks very wrong. Why are we calling one peer_mw_get addr
> > and the other mw_get_align? And why does mw_get_align have a peer index?
> 
> I regret that the term "peer" is used to distinguish the mw api.  Better 
> names perhaps should be ntb_outbound_mw_foo, ntb_inbound_mw_foo; or 
> ntb_src_mw_foo, ntb_dest_mw_foo.  I like outbound/inbound, although the names 
> are longer, maybe out/in would be ok.
> 

As you remember we discussed the matter when new NTB API was being developed.
So we decided to have the following designation:
ntb_mw_*  - NTB MW API connected with inbound memory windows configurations,
ntb_peer_mw_* - NTB MW API connected with outbound memory windows 
configurations.
Yes, suffixes like "in/out" might give better notion of the methods purpose,
but the current notation is in coherency with the rest of API, and as long as
user gets into NTB API, one won't have much difficulties with understanding.

> > And why does mw_get_align have a peer index?
> 
> Good question.  Serge?
> 
> For that matter, why do we not also have peer mw idx in the set of 
> parameters.  Working through the example below, it looks like we are lacking 
> a way to say Outbound MW1 on A corresponds with Inbound MW0 on B.  It looks 
> like we can only indicate that Port A (not which Outbound MW of Port A) 
> corresponds with Inbound MW0 on B.
> 

Before I give any explanation, could you please study the new NTB API 
documentation:
https://github.com/jonmason/ntb/blob/ntb-next/Documentation/ntb.txt
Particularly you are interested in lines 31 - 111. It will give you a better
description of the way all MW-related methods work for the start. The detailed
multi-port example is given in the end of this email.

> > > Some snippets of code would help me understand your interpretation of the 
> > > api semantics more
> > exactly.
> > 
> > I'm not sure the code to best show this in code, but let me try
> > describing an example situation:
> > 
> > Lets say we have the following mws on each side (this is something that
> > is possible with Switchtec hardware):
> > 
> > Host A BARs:
> > mwA0: 2MB size, aligned to 4k, size aligned to 4k
> > mwA1: 4MB size, aligned to 4k, size aligned to 4k
> > mwA2: 64k size, aligned to 64k, size aligned to 64k
> > 
> > Host B BARs:
> > mwB0: 2MB size, aligned to 4k, size aligned to 4k
> > mwB1: 64k size, aligned to 64k, size aligned to 64k
> 
> If those are BARs, that corresponds to "outbound", writing something to the 
> BAR at mwA0.
> 
> A more complete picture might be:
> 
> Host A BARs (aka "outbound" or "peer" memory windows):
> peer_mwA0: resource at 0xA - 0xA0020 (2MB)
> peer_mwA1: resource at 0xA1000 - 0xA1040 (4MB)
> peer_mwA2: resource at 0xA2000 - 0xa2001 (64k)
> 
> Host A MWs (aka "inbound" memory windows):
> mwA0: 64k max size, aligned to 64k, size aligned to 64k
> mwA1: 2MB max size, aligned to 4k, size aligned to 4k
> 
> Host A sees Host B on port index 1
> 
> 
> Host B BARs (aka "outbound" or "peer" memory windows):
> peer_mwB0: resource at 0xB - 0xB0020 (2MB)
> peer_mwB1: resource at 0xB1000 - 0xB1001 (64k)
> 
> Host B MWs (aka "inbound" memory windows):
> mwB0: 1MB size, aligned to 4k, size aligned to 4k
> mwB1: 2MB size, aligned to 4k, size aligned to 4k
> 
> Host B sees Host A on port index 4
> 
> 
> Outbound memory (aka "peer mw") windows come with a pci resource.  We can get 
> the size of the resource, it's physical address, and set up outbound 
> translation if the hardware has that (IDT).
> 
> Inbound memory windows (aka "mw") are only used to set up inbound 
> translation, if the hardware has that (Intel, AMD).
> 
> To set up end-to-end memory window so that 

Re: New NTB API Issue

2017-06-23 Thread Serge Semin
On Fri, Jun 23, 2017 at 03:07:19PM -0400, Allen Hubbe  
wrote:
Hello Allen,

> From: Logan Gunthorpe
> > On 23/06/17 07:18 AM, Allen Hubbe wrote:
> > > By "remote" do you mean the source or destination of a write?
> > 
> > Look at how these calls are used in ntb_transport and ntb_perf:
> > 
> > They both call ntb_peer_mw_get_addr to get the size of the BAR. The size
> > is sent via spads to the other side. The other side then uses
> > ntb_mw_get_align and applies align_size to the received size.
> > 
> > > Yes, clients should transfer the address and size information to the peer.
> > 
> > But then they also need to technically transfer the alignment
> > information as well. Which neither of the clients do.
> 
> The client's haven't been fully ported to the multi-port api yet.  They were 
> only minimally changed to call the new api, but so far other than that they 
> have only been made to work as they had before.
> 
> > > Maybe this is the confusion.  None of these api calls are to reach across 
> > > to the peer port, as to
> > get the size of the peer's bar.  They are to get information from the local 
> > port, or to configure the
> > local port.
> > 
> > I like the rule that these api calls are not to reach across the port.
> > But then API looks very wrong. Why are we calling one peer_mw_get addr
> > and the other mw_get_align? And why does mw_get_align have a peer index?
> 
> I regret that the term "peer" is used to distinguish the mw api.  Better 
> names perhaps should be ntb_outbound_mw_foo, ntb_inbound_mw_foo; or 
> ntb_src_mw_foo, ntb_dest_mw_foo.  I like outbound/inbound, although the names 
> are longer, maybe out/in would be ok.
> 

As you remember we discussed the matter when new NTB API was being developed.
So we decided to have the following designation:
ntb_mw_*  - NTB MW API connected with inbound memory windows configurations,
ntb_peer_mw_* - NTB MW API connected with outbound memory windows 
configurations.
Yes, suffixes like "in/out" might give better notion of the methods purpose,
but the current notation is in coherency with the rest of API, and as long as
user gets into NTB API, one won't have much difficulties with understanding.

> > And why does mw_get_align have a peer index?
> 
> Good question.  Serge?
> 
> For that matter, why do we not also have peer mw idx in the set of 
> parameters.  Working through the example below, it looks like we are lacking 
> a way to say Outbound MW1 on A corresponds with Inbound MW0 on B.  It looks 
> like we can only indicate that Port A (not which Outbound MW of Port A) 
> corresponds with Inbound MW0 on B.
> 

Before I give any explanation, could you please study the new NTB API 
documentation:
https://github.com/jonmason/ntb/blob/ntb-next/Documentation/ntb.txt
Particularly you are interested in lines 31 - 111. It will give you a better
description of the way all MW-related methods work for the start. The detailed
multi-port example is given in the end of this email.

> > > Some snippets of code would help me understand your interpretation of the 
> > > api semantics more
> > exactly.
> > 
> > I'm not sure the code to best show this in code, but let me try
> > describing an example situation:
> > 
> > Lets say we have the following mws on each side (this is something that
> > is possible with Switchtec hardware):
> > 
> > Host A BARs:
> > mwA0: 2MB size, aligned to 4k, size aligned to 4k
> > mwA1: 4MB size, aligned to 4k, size aligned to 4k
> > mwA2: 64k size, aligned to 64k, size aligned to 64k
> > 
> > Host B BARs:
> > mwB0: 2MB size, aligned to 4k, size aligned to 4k
> > mwB1: 64k size, aligned to 64k, size aligned to 64k
> 
> If those are BARs, that corresponds to "outbound", writing something to the 
> BAR at mwA0.
> 
> A more complete picture might be:
> 
> Host A BARs (aka "outbound" or "peer" memory windows):
> peer_mwA0: resource at 0xA - 0xA0020 (2MB)
> peer_mwA1: resource at 0xA1000 - 0xA1040 (4MB)
> peer_mwA2: resource at 0xA2000 - 0xa2001 (64k)
> 
> Host A MWs (aka "inbound" memory windows):
> mwA0: 64k max size, aligned to 64k, size aligned to 64k
> mwA1: 2MB max size, aligned to 4k, size aligned to 4k
> 
> Host A sees Host B on port index 1
> 
> 
> Host B BARs (aka "outbound" or "peer" memory windows):
> peer_mwB0: resource at 0xB - 0xB0020 (2MB)
> peer_mwB1: resource at 0xB1000 - 0xB1001 (64k)
> 
> Host B MWs (aka "inbound" memory windows):
> mwB0: 1MB size, aligned to 4k, size aligned to 4k
> mwB1: 2MB size, aligned to 4k, size aligned to 4k
> 
> Host B sees Host A on port index 4
> 
> 
> Outbound memory (aka "peer mw") windows come with a pci resource.  We can get 
> the size of the resource, it's physical address, and set up outbound 
> translation if the hardware has that (IDT).
> 
> Inbound memory windows (aka "mw") are only used to set up inbound 
> translation, if the hardware has that (Intel, AMD).
> 
> To set up end-to-end memory window so that A can write to B, 

Re: New NTB API Issue

2017-06-23 Thread Logan Gunthorpe


On 23/06/17 01:07 PM, Allen Hubbe wrote:
> The client's haven't been fully ported to the multi-port api yet.  They were 
> only minimally changed to call the new api, but so far other than that they 
> have only been made to work as they had before.

So is it intended to eventually send the align parameters via spads?
This seems like it would take a lot of spads or multiplexing the spads
with a few doorbells. This gets a bit nasty.


> If those are BARs, that corresponds to "outbound", writing something to the 
> BAR at mwA0.
> A more complete picture might be:
> 
> Host A BARs (aka "outbound" or "peer" memory windows):
> peer_mwA0: resource at 0xA - 0xA0020 (2MB)
> peer_mwA1: resource at 0xA1000 - 0xA1040 (4MB)
> peer_mwA2: resource at 0xA2000 - 0xa2001 (64k)
> 
> Host A MWs (aka "inbound" memory windows):
> mwA0: 64k max size, aligned to 64k, size aligned to 64k
> mwA1: 2MB max size, aligned to 4k, size aligned to 4k

I don't really like the separation of inbound and output as you describe
it. It doesn't really match my hardware. In switchtec, each partition
has some number of BARs and each BAR has a single translation which sets
the peer and destination address. The translation really exists inside
the switch hardware, not on either side. But any translation can be
programmed by any peer. Saying that there's an opposite inbound window
to every outbound window is not an accurate abstraction for us.

I _suspect_ the IDT hardware is similar but, based on Serge's driver, I
think the translation can only be programmed by the peer that the BAR is
resident in (as opposed to from any side like in the switchtec hardwer).
(This poses some problems for getting the IDT code to actually work with
existing clients.)

> Outbound memory (aka "peer mw") windows come with a pci resource.  We can get 
> the size of the resource, it's physical address, and set up outbound 
> translation if the hardware has that (IDT).
> 
> Inbound memory windows (aka "mw") are only used to set up inbound 
> translation, if the hardware has that (Intel, AMD).
> 
> To set up end-to-end memory window so that A can write to B, let's use 
> peer_mwA1 and mwB0.
> 
> A: ntb_peer_mw_get_addr(peer_mwA1) -> base 0xA1000, size 4MB
> B: ntb_mw_get_align(port4**, mwB0) -> aligned 4k, aligned 4k, max size 1MB
> ** Serge: do we need port info here, why?
> 
> Side A has a resource size of 4MB, but B only supports inbound translation up 
> to 1MB.  Side A can only use the first quarter of the 4MB resource.
> 
> Side B needs to allocate memory aligned to 4k (the dma address must be 
> aligned to 4k after dma mapping), and a multiple of 4k in size.  B may need 
> to set inbound translation so that incoming writes go into this memory.  A 
> may also need to set outbound translation.
> 
> A: ntb_peer_mw_set_trans(port1**, peer_mwA1, dma_mem_addr, dma_mem_size)
> B: ntb_mw_set_trans(port4**, mwB0, dma_mem_addr, dma_mem_size)
> ** Serge: do we also need the opposing side MW index here?
> 
> ** Logan: would those changes to the api suit your needs?

Not really, no. Except for the confusion with the mw_get_align issue the
new API, as it is, suits my hardware well. What you're proposing doesn't
fix my issue and doesn't match my hardware. Though, I interpreted
ntb_peer_mw_set_trans somewhat differently from what you describe. I did
not expect the client would need to call both functions but some clients
could optionally use ntb_peer_mw_set_trans to set the translation from
the opposite side (thus needing to send the DMA address over spads or
msgs). Though, without an actual in-kernel user it's hard to know what
is actually intended.

It's worth noticing that the IDT driver only provides peer_mw_set_trans
and not mw_set_trans. I assumed it's because the hardware's memory
windows can only be configured from the opposite side.

Pragmatically, the only change I need for everything to work as I expect
is for mw_get_align to be called only after link up. However, given all
the confusion I'm wondering if these changes are even ready for
upstream. Without actual in-kernel client code it's hard to know if the
API is correct or that everyone is even interpreting it in the same way.

Logan


Re: New NTB API Issue

2017-06-23 Thread Logan Gunthorpe


On 23/06/17 01:07 PM, Allen Hubbe wrote:
> The client's haven't been fully ported to the multi-port api yet.  They were 
> only minimally changed to call the new api, but so far other than that they 
> have only been made to work as they had before.

So is it intended to eventually send the align parameters via spads?
This seems like it would take a lot of spads or multiplexing the spads
with a few doorbells. This gets a bit nasty.


> If those are BARs, that corresponds to "outbound", writing something to the 
> BAR at mwA0.
> A more complete picture might be:
> 
> Host A BARs (aka "outbound" or "peer" memory windows):
> peer_mwA0: resource at 0xA - 0xA0020 (2MB)
> peer_mwA1: resource at 0xA1000 - 0xA1040 (4MB)
> peer_mwA2: resource at 0xA2000 - 0xa2001 (64k)
> 
> Host A MWs (aka "inbound" memory windows):
> mwA0: 64k max size, aligned to 64k, size aligned to 64k
> mwA1: 2MB max size, aligned to 4k, size aligned to 4k

I don't really like the separation of inbound and output as you describe
it. It doesn't really match my hardware. In switchtec, each partition
has some number of BARs and each BAR has a single translation which sets
the peer and destination address. The translation really exists inside
the switch hardware, not on either side. But any translation can be
programmed by any peer. Saying that there's an opposite inbound window
to every outbound window is not an accurate abstraction for us.

I _suspect_ the IDT hardware is similar but, based on Serge's driver, I
think the translation can only be programmed by the peer that the BAR is
resident in (as opposed to from any side like in the switchtec hardwer).
(This poses some problems for getting the IDT code to actually work with
existing clients.)

> Outbound memory (aka "peer mw") windows come with a pci resource.  We can get 
> the size of the resource, it's physical address, and set up outbound 
> translation if the hardware has that (IDT).
> 
> Inbound memory windows (aka "mw") are only used to set up inbound 
> translation, if the hardware has that (Intel, AMD).
> 
> To set up end-to-end memory window so that A can write to B, let's use 
> peer_mwA1 and mwB0.
> 
> A: ntb_peer_mw_get_addr(peer_mwA1) -> base 0xA1000, size 4MB
> B: ntb_mw_get_align(port4**, mwB0) -> aligned 4k, aligned 4k, max size 1MB
> ** Serge: do we need port info here, why?
> 
> Side A has a resource size of 4MB, but B only supports inbound translation up 
> to 1MB.  Side A can only use the first quarter of the 4MB resource.
> 
> Side B needs to allocate memory aligned to 4k (the dma address must be 
> aligned to 4k after dma mapping), and a multiple of 4k in size.  B may need 
> to set inbound translation so that incoming writes go into this memory.  A 
> may also need to set outbound translation.
> 
> A: ntb_peer_mw_set_trans(port1**, peer_mwA1, dma_mem_addr, dma_mem_size)
> B: ntb_mw_set_trans(port4**, mwB0, dma_mem_addr, dma_mem_size)
> ** Serge: do we also need the opposing side MW index here?
> 
> ** Logan: would those changes to the api suit your needs?

Not really, no. Except for the confusion with the mw_get_align issue the
new API, as it is, suits my hardware well. What you're proposing doesn't
fix my issue and doesn't match my hardware. Though, I interpreted
ntb_peer_mw_set_trans somewhat differently from what you describe. I did
not expect the client would need to call both functions but some clients
could optionally use ntb_peer_mw_set_trans to set the translation from
the opposite side (thus needing to send the DMA address over spads or
msgs). Though, without an actual in-kernel user it's hard to know what
is actually intended.

It's worth noticing that the IDT driver only provides peer_mw_set_trans
and not mw_set_trans. I assumed it's because the hardware's memory
windows can only be configured from the opposite side.

Pragmatically, the only change I need for everything to work as I expect
is for mw_get_align to be called only after link up. However, given all
the confusion I'm wondering if these changes are even ready for
upstream. Without actual in-kernel client code it's hard to know if the
API is correct or that everyone is even interpreting it in the same way.

Logan


RE: New NTB API Issue

2017-06-23 Thread Allen Hubbe
From: Logan Gunthorpe
> On 23/06/17 07:18 AM, Allen Hubbe wrote:
> > By "remote" do you mean the source or destination of a write?
> 
> Look at how these calls are used in ntb_transport and ntb_perf:
> 
> They both call ntb_peer_mw_get_addr to get the size of the BAR. The size
> is sent via spads to the other side. The other side then uses
> ntb_mw_get_align and applies align_size to the received size.
> 
> > Yes, clients should transfer the address and size information to the peer.
> 
> But then they also need to technically transfer the alignment
> information as well. Which neither of the clients do.

The client's haven't been fully ported to the multi-port api yet.  They were 
only minimally changed to call the new api, but so far other than that they 
have only been made to work as they had before.

> > Maybe this is the confusion.  None of these api calls are to reach across 
> > to the peer port, as to
> get the size of the peer's bar.  They are to get information from the local 
> port, or to configure the
> local port.
> 
> I like the rule that these api calls are not to reach across the port.
> But then API looks very wrong. Why are we calling one peer_mw_get addr
> and the other mw_get_align? And why does mw_get_align have a peer index?

I regret that the term "peer" is used to distinguish the mw api.  Better names 
perhaps should be ntb_outbound_mw_foo, ntb_inbound_mw_foo; or ntb_src_mw_foo, 
ntb_dest_mw_foo.  I like outbound/inbound, although the names are longer, maybe 
out/in would be ok.

> And why does mw_get_align have a peer index?

Good question.  Serge?

For that matter, why do we not also have peer mw idx in the set of parameters.  
Working through the example below, it looks like we are lacking a way to say 
Outbound MW1 on A corresponds with Inbound MW0 on B.  It looks like we can only 
indicate that Port A (not which Outbound MW of Port A) corresponds with Inbound 
MW0 on B.

> > Some snippets of code would help me understand your interpretation of the 
> > api semantics more
> exactly.
> 
> I'm not sure the code to best show this in code, but let me try
> describing an example situation:
> 
> Lets say we have the following mws on each side (this is something that
> is possible with Switchtec hardware):
> 
> Host A BARs:
> mwA0: 2MB size, aligned to 4k, size aligned to 4k
> mwA1: 4MB size, aligned to 4k, size aligned to 4k
> mwA2: 64k size, aligned to 64k, size aligned to 64k
> 
> Host B BARs:
> mwB0: 2MB size, aligned to 4k, size aligned to 4k
> mwB1: 64k size, aligned to 64k, size aligned to 64k

If those are BARs, that corresponds to "outbound", writing something to the BAR 
at mwA0.

A more complete picture might be:

Host A BARs (aka "outbound" or "peer" memory windows):
peer_mwA0: resource at 0xA - 0xA0020 (2MB)
peer_mwA1: resource at 0xA1000 - 0xA1040 (4MB)
peer_mwA2: resource at 0xA2000 - 0xa2001 (64k)

Host A MWs (aka "inbound" memory windows):
mwA0: 64k max size, aligned to 64k, size aligned to 64k
mwA1: 2MB max size, aligned to 4k, size aligned to 4k

Host A sees Host B on port index 1


Host B BARs (aka "outbound" or "peer" memory windows):
peer_mwB0: resource at 0xB - 0xB0020 (2MB)
peer_mwB1: resource at 0xB1000 - 0xB1001 (64k)

Host B MWs (aka "inbound" memory windows):
mwB0: 1MB size, aligned to 4k, size aligned to 4k
mwB1: 2MB size, aligned to 4k, size aligned to 4k

Host B sees Host A on port index 4


Outbound memory (aka "peer mw") windows come with a pci resource.  We can get 
the size of the resource, it's physical address, and set up outbound 
translation if the hardware has that (IDT).

Inbound memory windows (aka "mw") are only used to set up inbound translation, 
if the hardware has that (Intel, AMD).

To set up end-to-end memory window so that A can write to B, let's use 
peer_mwA1 and mwB0.

A: ntb_peer_mw_get_addr(peer_mwA1) -> base 0xA1000, size 4MB
B: ntb_mw_get_align(port4**, mwB0) -> aligned 4k, aligned 4k, max size 1MB
** Serge: do we need port info here, why?

Side A has a resource size of 4MB, but B only supports inbound translation up 
to 1MB.  Side A can only use the first quarter of the 4MB resource.

Side B needs to allocate memory aligned to 4k (the dma address must be aligned 
to 4k after dma mapping), and a multiple of 4k in size.  B may need to set 
inbound translation so that incoming writes go into this memory.  A may also 
need to set outbound translation.

A: ntb_peer_mw_set_trans(port1**, peer_mwA1, dma_mem_addr, dma_mem_size)
B: ntb_mw_set_trans(port4**, mwB0, dma_mem_addr, dma_mem_size)
** Serge: do we also need the opposing side MW index here?

** Logan: would those changes to the api suit your needs?

> So on Host A: peer_mw_get_addr(idx=1) should return size=4M (from mwA1),
> but it's not clear what mw_get_align(widx=1) should do. I see two
> possibilities:
> 
> 1) Given that it has the opposite sense of peer_mw_get_addr (ie. there
> is no peer in it's name) and that this new 

RE: New NTB API Issue

2017-06-23 Thread Allen Hubbe
From: Logan Gunthorpe
> On 23/06/17 07:18 AM, Allen Hubbe wrote:
> > By "remote" do you mean the source or destination of a write?
> 
> Look at how these calls are used in ntb_transport and ntb_perf:
> 
> They both call ntb_peer_mw_get_addr to get the size of the BAR. The size
> is sent via spads to the other side. The other side then uses
> ntb_mw_get_align and applies align_size to the received size.
> 
> > Yes, clients should transfer the address and size information to the peer.
> 
> But then they also need to technically transfer the alignment
> information as well. Which neither of the clients do.

The client's haven't been fully ported to the multi-port api yet.  They were 
only minimally changed to call the new api, but so far other than that they 
have only been made to work as they had before.

> > Maybe this is the confusion.  None of these api calls are to reach across 
> > to the peer port, as to
> get the size of the peer's bar.  They are to get information from the local 
> port, or to configure the
> local port.
> 
> I like the rule that these api calls are not to reach across the port.
> But then API looks very wrong. Why are we calling one peer_mw_get addr
> and the other mw_get_align? And why does mw_get_align have a peer index?

I regret that the term "peer" is used to distinguish the mw api.  Better names 
perhaps should be ntb_outbound_mw_foo, ntb_inbound_mw_foo; or ntb_src_mw_foo, 
ntb_dest_mw_foo.  I like outbound/inbound, although the names are longer, maybe 
out/in would be ok.

> And why does mw_get_align have a peer index?

Good question.  Serge?

For that matter, why do we not also have peer mw idx in the set of parameters.  
Working through the example below, it looks like we are lacking a way to say 
Outbound MW1 on A corresponds with Inbound MW0 on B.  It looks like we can only 
indicate that Port A (not which Outbound MW of Port A) corresponds with Inbound 
MW0 on B.

> > Some snippets of code would help me understand your interpretation of the 
> > api semantics more
> exactly.
> 
> I'm not sure the code to best show this in code, but let me try
> describing an example situation:
> 
> Lets say we have the following mws on each side (this is something that
> is possible with Switchtec hardware):
> 
> Host A BARs:
> mwA0: 2MB size, aligned to 4k, size aligned to 4k
> mwA1: 4MB size, aligned to 4k, size aligned to 4k
> mwA2: 64k size, aligned to 64k, size aligned to 64k
> 
> Host B BARs:
> mwB0: 2MB size, aligned to 4k, size aligned to 4k
> mwB1: 64k size, aligned to 64k, size aligned to 64k

If those are BARs, that corresponds to "outbound", writing something to the BAR 
at mwA0.

A more complete picture might be:

Host A BARs (aka "outbound" or "peer" memory windows):
peer_mwA0: resource at 0xA - 0xA0020 (2MB)
peer_mwA1: resource at 0xA1000 - 0xA1040 (4MB)
peer_mwA2: resource at 0xA2000 - 0xa2001 (64k)

Host A MWs (aka "inbound" memory windows):
mwA0: 64k max size, aligned to 64k, size aligned to 64k
mwA1: 2MB max size, aligned to 4k, size aligned to 4k

Host A sees Host B on port index 1


Host B BARs (aka "outbound" or "peer" memory windows):
peer_mwB0: resource at 0xB - 0xB0020 (2MB)
peer_mwB1: resource at 0xB1000 - 0xB1001 (64k)

Host B MWs (aka "inbound" memory windows):
mwB0: 1MB size, aligned to 4k, size aligned to 4k
mwB1: 2MB size, aligned to 4k, size aligned to 4k

Host B sees Host A on port index 4


Outbound memory (aka "peer mw") windows come with a pci resource.  We can get 
the size of the resource, it's physical address, and set up outbound 
translation if the hardware has that (IDT).

Inbound memory windows (aka "mw") are only used to set up inbound translation, 
if the hardware has that (Intel, AMD).

To set up end-to-end memory window so that A can write to B, let's use 
peer_mwA1 and mwB0.

A: ntb_peer_mw_get_addr(peer_mwA1) -> base 0xA1000, size 4MB
B: ntb_mw_get_align(port4**, mwB0) -> aligned 4k, aligned 4k, max size 1MB
** Serge: do we need port info here, why?

Side A has a resource size of 4MB, but B only supports inbound translation up 
to 1MB.  Side A can only use the first quarter of the 4MB resource.

Side B needs to allocate memory aligned to 4k (the dma address must be aligned 
to 4k after dma mapping), and a multiple of 4k in size.  B may need to set 
inbound translation so that incoming writes go into this memory.  A may also 
need to set outbound translation.

A: ntb_peer_mw_set_trans(port1**, peer_mwA1, dma_mem_addr, dma_mem_size)
B: ntb_mw_set_trans(port4**, mwB0, dma_mem_addr, dma_mem_size)
** Serge: do we also need the opposing side MW index here?

** Logan: would those changes to the api suit your needs?

> So on Host A: peer_mw_get_addr(idx=1) should return size=4M (from mwA1),
> but it's not clear what mw_get_align(widx=1) should do. I see two
> possibilities:
> 
> 1) Given that it has the opposite sense of peer_mw_get_addr (ie. there
> is no peer in it's name) and that this new 

Re: New NTB API Issue

2017-06-23 Thread Logan Gunthorpe


On 23/06/17 07:18 AM, Allen Hubbe wrote:
> By "remote" do you mean the source or destination of a write?

Look at how these calls are used in ntb_transport and ntb_perf:

They both call ntb_peer_mw_get_addr to get the size of the BAR. The size
is sent via spads to the other side. The other side then uses
ntb_mw_get_align and applies align_size to the received size.

> Yes, clients should transfer the address and size information to the peer.

But then they also need to technically transfer the alignment
information as well. Which neither of the clients do.

> Maybe this is the confusion.  None of these api calls are to reach across to 
> the peer port, as to get the size of the peer's bar.  They are to get 
> information from the local port, or to configure the local port.

I like the rule that these api calls are not to reach across the port.
But then API looks very wrong. Why are we calling one peer_mw_get addr
and the other mw_get_align? And why does mw_get_align have a peer index?


> Some snippets of code would help me understand your interpretation of the api 
> semantics more exactly.

I'm not sure the code to best show this in code, but let me try
describing an example situation:

Lets say we have the following mws on each side (this is something that
is possible with Switchtec hardware):

Host A BARs:
mwA0: 2MB size, aligned to 4k, size aligned to 4k
mwA1: 4MB size, aligned to 4k, size aligned to 4k
mwA2: 64k size, aligned to 64k, size aligned to 64k

Host B BARs:
mwB0: 2MB size, aligned to 4k, size aligned to 4k
mwB1: 64k size, aligned to 64k, size aligned to 64k

So on Host A: peer_mw_get_addr(idx=1) should return size=4M (from mwA1),
but it's not clear what mw_get_align(widx=1) should do. I see two
possibilities:

1) Given that it has the opposite sense of peer_mw_get_addr (ie. there
is no peer in it's name) and that this new api also has a peer index, it
should return align_size=64k (from mwB1). However, in order to do this,
Host B must be fully configured (technically the link doesn't have to be
fully up, but having a link up is the only way for a client to tell if
Host B is configured or not).

2) Given your assertion that these APIs should never reach across the
link, then one could say it should return align_size=4k. However, in
this situation the name is really confusing. And the fact that it has a
pidx argument makes no sense. And the way ntb_transport and ntb_perf use
it is wrong because they will end up masking the 64K size of mwB1 with
the 4k align_size from mwA1.

Does that make more sense?

Thanks,

Logan




Re: New NTB API Issue

2017-06-23 Thread Logan Gunthorpe


On 23/06/17 07:18 AM, Allen Hubbe wrote:
> By "remote" do you mean the source or destination of a write?

Look at how these calls are used in ntb_transport and ntb_perf:

They both call ntb_peer_mw_get_addr to get the size of the BAR. The size
is sent via spads to the other side. The other side then uses
ntb_mw_get_align and applies align_size to the received size.

> Yes, clients should transfer the address and size information to the peer.

But then they also need to technically transfer the alignment
information as well. Which neither of the clients do.

> Maybe this is the confusion.  None of these api calls are to reach across to 
> the peer port, as to get the size of the peer's bar.  They are to get 
> information from the local port, or to configure the local port.

I like the rule that these api calls are not to reach across the port.
But then API looks very wrong. Why are we calling one peer_mw_get addr
and the other mw_get_align? And why does mw_get_align have a peer index?


> Some snippets of code would help me understand your interpretation of the api 
> semantics more exactly.

I'm not sure the code to best show this in code, but let me try
describing an example situation:

Lets say we have the following mws on each side (this is something that
is possible with Switchtec hardware):

Host A BARs:
mwA0: 2MB size, aligned to 4k, size aligned to 4k
mwA1: 4MB size, aligned to 4k, size aligned to 4k
mwA2: 64k size, aligned to 64k, size aligned to 64k

Host B BARs:
mwB0: 2MB size, aligned to 4k, size aligned to 4k
mwB1: 64k size, aligned to 64k, size aligned to 64k

So on Host A: peer_mw_get_addr(idx=1) should return size=4M (from mwA1),
but it's not clear what mw_get_align(widx=1) should do. I see two
possibilities:

1) Given that it has the opposite sense of peer_mw_get_addr (ie. there
is no peer in it's name) and that this new api also has a peer index, it
should return align_size=64k (from mwB1). However, in order to do this,
Host B must be fully configured (technically the link doesn't have to be
fully up, but having a link up is the only way for a client to tell if
Host B is configured or not).

2) Given your assertion that these APIs should never reach across the
link, then one could say it should return align_size=4k. However, in
this situation the name is really confusing. And the fact that it has a
pidx argument makes no sense. And the way ntb_transport and ntb_perf use
it is wrong because they will end up masking the 64K size of mwB1 with
the 4k align_size from mwA1.

Does that make more sense?

Thanks,

Logan




RE: New NTB API Issue

2017-06-23 Thread Allen Hubbe
From: Logan Gunthorpe
> On 6/22/2017 4:42 PM, Allen Hubbe wrote:
> > From: Logan Gunthorpe
> >> Any thoughts on changing the semantics of mw_get_align so it must be
> >> called with the link up?
> >
> > The intention of these is that these calls return information from the 
> > local port.  The calls
> themselves don't reach across the link to the peer, but the information 
> returned from the local port
> needs to be communicated for setting up the translation end-to-end.
> 
> Ok, well if it's from the local port, then splitting up mw_get_range
> into peer_mw_get_addr and mw_get_align is confusing because one has the
> peer designation and the other doesn't. And all the clients apply the
> alignments to the remote bar so they'd technically need to transfer them
> across the link somehow.

By "remote" do you mean the source or destination of a write?

Yes, clients should transfer the address and size information to the peer.

> > I would like to understand why this hardware needs link up.  Are there 
> > registers on the local port
> that are only valid after link up?
> 
> We only need the link up if we are trying to find the alignment
> requirements (and max_size) of the peer's bar. In theory, switchtec

Maybe this is the confusion.  None of these api calls are to reach across to 
the peer port, as to get the size of the peer's bar.  They are to get 
information from the local port, or to configure the local port.

Some mw configuration is done at the destination, as for Intel and AMD, and 
some configuration is done on the source side, for IDT.  The local 
configuration of the port on one side could depend on information from the 
remote port on the other side.  For example in IDT, the mw translation 
configured on the source side needs to know destination address information.  
Likewise, if there is any difference in the size of the range that can be 
translated by ports on opposing sides, that needs to be negotiated.

> could have different sizes of bars on both sides of the link and
> different alignment requirements. Though, in practice, this is probably
> unlikely.
> 
> > Can you post snippets of how ntb_mw_get_align and ntb_peer_mw_get_addr 
> > might be implemented for the
> Switchtec?
> 
> Hmm, yes, but lets sort out my confusion on the semantics per above first.

Some snippets of code would help me understand your interpretation of the api 
semantics more exactly.

> Logan



RE: New NTB API Issue

2017-06-23 Thread Allen Hubbe
From: Logan Gunthorpe
> On 6/22/2017 4:42 PM, Allen Hubbe wrote:
> > From: Logan Gunthorpe
> >> Any thoughts on changing the semantics of mw_get_align so it must be
> >> called with the link up?
> >
> > The intention of these is that these calls return information from the 
> > local port.  The calls
> themselves don't reach across the link to the peer, but the information 
> returned from the local port
> needs to be communicated for setting up the translation end-to-end.
> 
> Ok, well if it's from the local port, then splitting up mw_get_range
> into peer_mw_get_addr and mw_get_align is confusing because one has the
> peer designation and the other doesn't. And all the clients apply the
> alignments to the remote bar so they'd technically need to transfer them
> across the link somehow.

By "remote" do you mean the source or destination of a write?

Yes, clients should transfer the address and size information to the peer.

> > I would like to understand why this hardware needs link up.  Are there 
> > registers on the local port
> that are only valid after link up?
> 
> We only need the link up if we are trying to find the alignment
> requirements (and max_size) of the peer's bar. In theory, switchtec

Maybe this is the confusion.  None of these api calls are to reach across to 
the peer port, as to get the size of the peer's bar.  They are to get 
information from the local port, or to configure the local port.

Some mw configuration is done at the destination, as for Intel and AMD, and 
some configuration is done on the source side, for IDT.  The local 
configuration of the port on one side could depend on information from the 
remote port on the other side.  For example in IDT, the mw translation 
configured on the source side needs to know destination address information.  
Likewise, if there is any difference in the size of the range that can be 
translated by ports on opposing sides, that needs to be negotiated.

> could have different sizes of bars on both sides of the link and
> different alignment requirements. Though, in practice, this is probably
> unlikely.
> 
> > Can you post snippets of how ntb_mw_get_align and ntb_peer_mw_get_addr 
> > might be implemented for the
> Switchtec?
> 
> Hmm, yes, but lets sort out my confusion on the semantics per above first.

Some snippets of code would help me understand your interpretation of the api 
semantics more exactly.

> Logan



Re: New NTB API Issue

2017-06-22 Thread Logan Gunthorpe

On 6/22/2017 4:42 PM, Allen Hubbe wrote:

From: Logan Gunthorpe

Any thoughts on changing the semantics of mw_get_align so it must be
called with the link up?


The intention of these is that these calls return information from the local 
port.  The calls themselves don't reach across the link to the peer, but the 
information returned from the local port needs to be communicated for setting 
up the translation end-to-end.


Ok, well if it's from the local port, then splitting up mw_get_range 
into peer_mw_get_addr and mw_get_align is confusing because one has the 
peer designation and the other doesn't. And all the clients apply the 
alignments to the remote bar so they'd technically need to transfer them 
across the link somehow.



I would like to understand why this hardware needs link up.  Are there 
registers on the local port that are only valid after link up?


We only need the link up if we are trying to find the alignment 
requirements (and max_size) of the peer's bar. In theory, switchtec 
could have different sizes of bars on both sides of the link and 
different alignment requirements. Though, in practice, this is probably 
unlikely.



Can you post snippets of how ntb_mw_get_align and ntb_peer_mw_get_addr might be 
implemented for the Switchtec?


Hmm, yes, but lets sort out my confusion on the semantics per above first.

Logan



Re: New NTB API Issue

2017-06-22 Thread Logan Gunthorpe

On 6/22/2017 4:42 PM, Allen Hubbe wrote:

From: Logan Gunthorpe

Any thoughts on changing the semantics of mw_get_align so it must be
called with the link up?


The intention of these is that these calls return information from the local 
port.  The calls themselves don't reach across the link to the peer, but the 
information returned from the local port needs to be communicated for setting 
up the translation end-to-end.


Ok, well if it's from the local port, then splitting up mw_get_range 
into peer_mw_get_addr and mw_get_align is confusing because one has the 
peer designation and the other doesn't. And all the clients apply the 
alignments to the remote bar so they'd technically need to transfer them 
across the link somehow.



I would like to understand why this hardware needs link up.  Are there 
registers on the local port that are only valid after link up?


We only need the link up if we are trying to find the alignment 
requirements (and max_size) of the peer's bar. In theory, switchtec 
could have different sizes of bars on both sides of the link and 
different alignment requirements. Though, in practice, this is probably 
unlikely.



Can you post snippets of how ntb_mw_get_align and ntb_peer_mw_get_addr might be 
implemented for the Switchtec?


Hmm, yes, but lets sort out my confusion on the semantics per above first.

Logan



RE: New NTB API Issue

2017-06-22 Thread Allen Hubbe
From: Logan Gunthorpe
> Any thoughts on changing the semantics of mw_get_align so it must be
> called with the link up?

The intention of these is that these calls return information from the local 
port.  The calls themselves don't reach across the link to the peer, but the 
information returned from the local port needs to be communicated for setting 
up the translation end-to-end.

I would like to understand why this hardware needs link up.  Are there 
registers on the local port that are only valid after link up?

Can you post snippets of how ntb_mw_get_align and ntb_peer_mw_get_addr might be 
implemented for the Switchtec?



RE: New NTB API Issue

2017-06-22 Thread Allen Hubbe
From: Logan Gunthorpe
> Any thoughts on changing the semantics of mw_get_align so it must be
> called with the link up?

The intention of these is that these calls return information from the local 
port.  The calls themselves don't reach across the link to the peer, but the 
information returned from the local port needs to be communicated for setting 
up the translation end-to-end.

I would like to understand why this hardware needs link up.  Are there 
registers on the local port that are only valid after link up?

Can you post snippets of how ntb_mw_get_align and ntb_peer_mw_get_addr might be 
implemented for the Switchtec?



Re: New NTB API Issue

2017-06-22 Thread Logan Gunthorpe

On 6/22/2017 4:12 PM, Allen Hubbe wrote:

The resource size given by peer_mw_get_addr might be different than the 
max_size given by ntb_mw_get_align.

I am most familiar with the ntb_hw_intel driver and that type of ntb hardware.  
The peer_mw_get_addr size is of the primary bar on the side to be the source of 
the translated writes (or reads).  In b2b topology, at least, the first 
translation of that write lands it on the secondary bar of the peer ntb.  That 
size of that bar could different than the first.  The second translation lands 
the write in memory (eg).  So, the end-to-end translation is limited by the 
first AND second sizes.

The first point is, the *max_size returned by intel_ntb_mw_get_align looks 
wrong.  That should be the size of the secondary bar, not the resource size of 
the primary bar, of that device.

The second point is, because the sizes returned by peer_mw_get_addr, and 
ntb_mw_get_align, may be different, the two sides should communicate and 
reconcile the address and size information when setting up the translations.


Ok, that makes some sense. So drivers that don't need this should return 
-1 or something like that? Though, I'd still suggest that until a driver 
actually needs this, and it's implemented correctly in the clients, we 
should leave it out.


Any thoughts on changing the semantics of mw_get_align so it must be 
called with the link up?


Logan






Re: New NTB API Issue

2017-06-22 Thread Logan Gunthorpe

On 6/22/2017 4:12 PM, Allen Hubbe wrote:

The resource size given by peer_mw_get_addr might be different than the 
max_size given by ntb_mw_get_align.

I am most familiar with the ntb_hw_intel driver and that type of ntb hardware.  
The peer_mw_get_addr size is of the primary bar on the side to be the source of 
the translated writes (or reads).  In b2b topology, at least, the first 
translation of that write lands it on the secondary bar of the peer ntb.  That 
size of that bar could different than the first.  The second translation lands 
the write in memory (eg).  So, the end-to-end translation is limited by the 
first AND second sizes.

The first point is, the *max_size returned by intel_ntb_mw_get_align looks 
wrong.  That should be the size of the secondary bar, not the resource size of 
the primary bar, of that device.

The second point is, because the sizes returned by peer_mw_get_addr, and 
ntb_mw_get_align, may be different, the two sides should communicate and 
reconcile the address and size information when setting up the translations.


Ok, that makes some sense. So drivers that don't need this should return 
-1 or something like that? Though, I'd still suggest that until a driver 
actually needs this, and it's implemented correctly in the clients, we 
should leave it out.


Any thoughts on changing the semantics of mw_get_align so it must be 
called with the link up?


Logan






RE: New NTB API Issue

2017-06-22 Thread Allen Hubbe
From: Logan Gunthorpe
> On 6/22/2017 12:32 PM, Allen Hubbe wrote:
> > From: Logan Gunthorpe
> >> 2) The changes to the Intel and AMD driver for mw_get_align sets
> >> *max_size to the local pci resource size. (Thus making the assumption
> >> that the local is the same as the peer, which is wrong). max_size isn't
> >> actually used for anything so it's not _really_ an issue, but I do think
> >> it's confusing and incorrect. I'd suggest we remove max_size until
> >> something actually needs it, or at least set it to zero in cases where
> >> the hardware doesn't support returning the size of the peer's memory
> >> window (ie. in the Intel and AMD drivers).
> >
> > You're right, and the b2b_split in the Intel driver even makes use of 
> > different primary/secondary
> bar sizes. For Intel and AMD, it would make more sense to use the secondary 
> bar size here.  The size
> of the secondary bar still not necessarily valid end-to-end, because in b2b 
> the peer's primary bar
> size could be even smaller.
> >
> > I'm not entirely convinced that this should represent the end-to-end size 
> > of local and peer memory
> window configurations.  I think it should represent the largest side that 
> would be valid to pass to
> ntb_mw_set_trans().  Then, the peers should communicate their respective max 
> sizes (along with
> translation addresses, etc) before setting up the translations, and that 
> exchange will ensure that the
> size finally used is valid end-to-end.
> 
> But why would the client ever need to use the max_size instead of the
> actual size of the bar as retrieved and exchanged from peer_mw_get_addr?

The resource size given by peer_mw_get_addr might be different than the 
max_size given by ntb_mw_get_align.

I am most familiar with the ntb_hw_intel driver and that type of ntb hardware.  
The peer_mw_get_addr size is of the primary bar on the side to be the source of 
the translated writes (or reads).  In b2b topology, at least, the first 
translation of that write lands it on the secondary bar of the peer ntb.  That 
size of that bar could different than the first.  The second translation lands 
the write in memory (eg).  So, the end-to-end translation is limited by the 
first AND second sizes.

The first point is, the *max_size returned by intel_ntb_mw_get_align looks 
wrong.  That should be the size of the secondary bar, not the resource size of 
the primary bar, of that device.

The second point is, because the sizes returned by peer_mw_get_addr, and 
ntb_mw_get_align, may be different, the two sides should communicate and 
reconcile the address and size information when setting up the translations.



RE: New NTB API Issue

2017-06-22 Thread Allen Hubbe
From: Logan Gunthorpe
> On 6/22/2017 12:32 PM, Allen Hubbe wrote:
> > From: Logan Gunthorpe
> >> 2) The changes to the Intel and AMD driver for mw_get_align sets
> >> *max_size to the local pci resource size. (Thus making the assumption
> >> that the local is the same as the peer, which is wrong). max_size isn't
> >> actually used for anything so it's not _really_ an issue, but I do think
> >> it's confusing and incorrect. I'd suggest we remove max_size until
> >> something actually needs it, or at least set it to zero in cases where
> >> the hardware doesn't support returning the size of the peer's memory
> >> window (ie. in the Intel and AMD drivers).
> >
> > You're right, and the b2b_split in the Intel driver even makes use of 
> > different primary/secondary
> bar sizes. For Intel and AMD, it would make more sense to use the secondary 
> bar size here.  The size
> of the secondary bar still not necessarily valid end-to-end, because in b2b 
> the peer's primary bar
> size could be even smaller.
> >
> > I'm not entirely convinced that this should represent the end-to-end size 
> > of local and peer memory
> window configurations.  I think it should represent the largest side that 
> would be valid to pass to
> ntb_mw_set_trans().  Then, the peers should communicate their respective max 
> sizes (along with
> translation addresses, etc) before setting up the translations, and that 
> exchange will ensure that the
> size finally used is valid end-to-end.
> 
> But why would the client ever need to use the max_size instead of the
> actual size of the bar as retrieved and exchanged from peer_mw_get_addr?

The resource size given by peer_mw_get_addr might be different than the 
max_size given by ntb_mw_get_align.

I am most familiar with the ntb_hw_intel driver and that type of ntb hardware.  
The peer_mw_get_addr size is of the primary bar on the side to be the source of 
the translated writes (or reads).  In b2b topology, at least, the first 
translation of that write lands it on the secondary bar of the peer ntb.  That 
size of that bar could different than the first.  The second translation lands 
the write in memory (eg).  So, the end-to-end translation is limited by the 
first AND second sizes.

The first point is, the *max_size returned by intel_ntb_mw_get_align looks 
wrong.  That should be the size of the secondary bar, not the resource size of 
the primary bar, of that device.

The second point is, because the sizes returned by peer_mw_get_addr, and 
ntb_mw_get_align, may be different, the two sides should communicate and 
reconcile the address and size information when setting up the translations.



RE: New NTB API Issue

2017-06-22 Thread Allen Hubbe
From: Logan Gunthorpe
> Hey Guys,
> 
> I've run into some subtle issues with the new API:
> 
> It has to do with splitting mw_get_range into mw_get_align and
> peer_mw_get_addr.
> 
> The original mw_get_range returned the size of the /local/ memory
> window's size, address and alignment requirements. The ntb clients then
> take the local size and transmit it via spads to the peer which would
> use it in setting up the memory window. However, it made the assumption
> that the alignment restrictions were symmetric on both hosts seeing they
> were not sent across the link.
> 
> The new API makes a sensible change for this in that mw_get_align
> appears to be intended to return the alignment restrictions (and now
> size) of the peer. This helps a bit for the Switchtec driver but appears
> to be a semantic change that wasn't really reflected in the changes to
> the other NTB code. So, I see a couple of issues:
> 
> 1) With our hardware, we can't actually know anything about the peer's
> memory windows until the peer has finished its setup (ie. the link is
> up). However, all the clients call the function during probe, before the
> link is ready. There's really no good reason for this, so I think we
> should change the clients so that mw_get_align is called only when the
> link is up.
> 
> 2) The changes to the Intel and AMD driver for mw_get_align sets
> *max_size to the local pci resource size. (Thus making the assumption
> that the local is the same as the peer, which is wrong). max_size isn't
> actually used for anything so it's not _really_ an issue, but I do think
> it's confusing and incorrect. I'd suggest we remove max_size until
> something actually needs it, or at least set it to zero in cases where
> the hardware doesn't support returning the size of the peer's memory
> window (ie. in the Intel and AMD drivers).

You're right, and the b2b_split in the Intel driver even makes use of different 
primary/secondary bar sizes. For Intel and AMD, it would make more sense to use 
the secondary bar size here.  The size of the secondary bar still not 
necessarily valid end-to-end, because in b2b the peer's primary bar size could 
be even smaller.

I'm not entirely convinced that this should represent the end-to-end size of 
local and peer memory window configurations.  I think it should represent the 
largest side that would be valid to pass to ntb_mw_set_trans().  Then, the 
peers should communicate their respective max sizes (along with translation 
addresses, etc) before setting up the translations, and that exchange will 
ensure that the size finally used is valid end-to-end.

> 
> Thoughts?
> 
> Logan



RE: New NTB API Issue

2017-06-22 Thread Allen Hubbe
From: Logan Gunthorpe
> Hey Guys,
> 
> I've run into some subtle issues with the new API:
> 
> It has to do with splitting mw_get_range into mw_get_align and
> peer_mw_get_addr.
> 
> The original mw_get_range returned the size of the /local/ memory
> window's size, address and alignment requirements. The ntb clients then
> take the local size and transmit it via spads to the peer which would
> use it in setting up the memory window. However, it made the assumption
> that the alignment restrictions were symmetric on both hosts seeing they
> were not sent across the link.
> 
> The new API makes a sensible change for this in that mw_get_align
> appears to be intended to return the alignment restrictions (and now
> size) of the peer. This helps a bit for the Switchtec driver but appears
> to be a semantic change that wasn't really reflected in the changes to
> the other NTB code. So, I see a couple of issues:
> 
> 1) With our hardware, we can't actually know anything about the peer's
> memory windows until the peer has finished its setup (ie. the link is
> up). However, all the clients call the function during probe, before the
> link is ready. There's really no good reason for this, so I think we
> should change the clients so that mw_get_align is called only when the
> link is up.
> 
> 2) The changes to the Intel and AMD driver for mw_get_align sets
> *max_size to the local pci resource size. (Thus making the assumption
> that the local is the same as the peer, which is wrong). max_size isn't
> actually used for anything so it's not _really_ an issue, but I do think
> it's confusing and incorrect. I'd suggest we remove max_size until
> something actually needs it, or at least set it to zero in cases where
> the hardware doesn't support returning the size of the peer's memory
> window (ie. in the Intel and AMD drivers).

You're right, and the b2b_split in the Intel driver even makes use of different 
primary/secondary bar sizes. For Intel and AMD, it would make more sense to use 
the secondary bar size here.  The size of the secondary bar still not 
necessarily valid end-to-end, because in b2b the peer's primary bar size could 
be even smaller.

I'm not entirely convinced that this should represent the end-to-end size of 
local and peer memory window configurations.  I think it should represent the 
largest side that would be valid to pass to ntb_mw_set_trans().  Then, the 
peers should communicate their respective max sizes (along with translation 
addresses, etc) before setting up the translations, and that exchange will 
ensure that the size finally used is valid end-to-end.

> 
> Thoughts?
> 
> Logan



Re: New NTB API Issue

2017-06-22 Thread Logan Gunthorpe

On 6/22/2017 12:32 PM, Allen Hubbe wrote:

From: Logan Gunthorpe

Hey Guys,

I've run into some subtle issues with the new API:

It has to do with splitting mw_get_range into mw_get_align and
peer_mw_get_addr.

The original mw_get_range returned the size of the /local/ memory
window's size, address and alignment requirements. The ntb clients then
take the local size and transmit it via spads to the peer which would
use it in setting up the memory window. However, it made the assumption
that the alignment restrictions were symmetric on both hosts seeing they
were not sent across the link.

The new API makes a sensible change for this in that mw_get_align
appears to be intended to return the alignment restrictions (and now
size) of the peer. This helps a bit for the Switchtec driver but appears
to be a semantic change that wasn't really reflected in the changes to
the other NTB code. So, I see a couple of issues:

1) With our hardware, we can't actually know anything about the peer's
memory windows until the peer has finished its setup (ie. the link is
up). However, all the clients call the function during probe, before the
link is ready. There's really no good reason for this, so I think we
should change the clients so that mw_get_align is called only when the
link is up.

2) The changes to the Intel and AMD driver for mw_get_align sets
*max_size to the local pci resource size. (Thus making the assumption
that the local is the same as the peer, which is wrong). max_size isn't
actually used for anything so it's not _really_ an issue, but I do think
it's confusing and incorrect. I'd suggest we remove max_size until
something actually needs it, or at least set it to zero in cases where
the hardware doesn't support returning the size of the peer's memory
window (ie. in the Intel and AMD drivers).


You're right, and the b2b_split in the Intel driver even makes use of different 
primary/secondary bar sizes. For Intel and AMD, it would make more sense to use 
the secondary bar size here.  The size of the secondary bar still not 
necessarily valid end-to-end, because in b2b the peer's primary bar size could 
be even smaller.

I'm not entirely convinced that this should represent the end-to-end size of 
local and peer memory window configurations.  I think it should represent the 
largest side that would be valid to pass to ntb_mw_set_trans().  Then, the 
peers should communicate their respective max sizes (along with translation 
addresses, etc) before setting up the translations, and that exchange will 
ensure that the size finally used is valid end-to-end.


But why would the client ever need to use the max_size instead of the 
actual size of the bar as retrieved and exchanged from peer_mw_get_addr?


Logan



Re: New NTB API Issue

2017-06-22 Thread Logan Gunthorpe

On 6/22/2017 12:32 PM, Allen Hubbe wrote:

From: Logan Gunthorpe

Hey Guys,

I've run into some subtle issues with the new API:

It has to do with splitting mw_get_range into mw_get_align and
peer_mw_get_addr.

The original mw_get_range returned the size of the /local/ memory
window's size, address and alignment requirements. The ntb clients then
take the local size and transmit it via spads to the peer which would
use it in setting up the memory window. However, it made the assumption
that the alignment restrictions were symmetric on both hosts seeing they
were not sent across the link.

The new API makes a sensible change for this in that mw_get_align
appears to be intended to return the alignment restrictions (and now
size) of the peer. This helps a bit for the Switchtec driver but appears
to be a semantic change that wasn't really reflected in the changes to
the other NTB code. So, I see a couple of issues:

1) With our hardware, we can't actually know anything about the peer's
memory windows until the peer has finished its setup (ie. the link is
up). However, all the clients call the function during probe, before the
link is ready. There's really no good reason for this, so I think we
should change the clients so that mw_get_align is called only when the
link is up.

2) The changes to the Intel and AMD driver for mw_get_align sets
*max_size to the local pci resource size. (Thus making the assumption
that the local is the same as the peer, which is wrong). max_size isn't
actually used for anything so it's not _really_ an issue, but I do think
it's confusing and incorrect. I'd suggest we remove max_size until
something actually needs it, or at least set it to zero in cases where
the hardware doesn't support returning the size of the peer's memory
window (ie. in the Intel and AMD drivers).


You're right, and the b2b_split in the Intel driver even makes use of different 
primary/secondary bar sizes. For Intel and AMD, it would make more sense to use 
the secondary bar size here.  The size of the secondary bar still not 
necessarily valid end-to-end, because in b2b the peer's primary bar size could 
be even smaller.

I'm not entirely convinced that this should represent the end-to-end size of 
local and peer memory window configurations.  I think it should represent the 
largest side that would be valid to pass to ntb_mw_set_trans().  Then, the 
peers should communicate their respective max sizes (along with translation 
addresses, etc) before setting up the translations, and that exchange will 
ensure that the size finally used is valid end-to-end.


But why would the client ever need to use the max_size instead of the 
actual size of the bar as retrieved and exchanged from peer_mw_get_addr?


Logan



New NTB API Issue

2017-06-22 Thread Logan Gunthorpe
Hey Guys,

I've run into some subtle issues with the new API:

It has to do with splitting mw_get_range into mw_get_align and
peer_mw_get_addr.

The original mw_get_range returned the size of the /local/ memory
window's size, address and alignment requirements. The ntb clients then
take the local size and transmit it via spads to the peer which would
use it in setting up the memory window. However, it made the assumption
that the alignment restrictions were symmetric on both hosts seeing they
were not sent across the link.

The new API makes a sensible change for this in that mw_get_align
appears to be intended to return the alignment restrictions (and now
size) of the peer. This helps a bit for the Switchtec driver but appears
to be a semantic change that wasn't really reflected in the changes to
the other NTB code. So, I see a couple of issues:

1) With our hardware, we can't actually know anything about the peer's
memory windows until the peer has finished its setup (ie. the link is
up). However, all the clients call the function during probe, before the
link is ready. There's really no good reason for this, so I think we
should change the clients so that mw_get_align is called only when the
link is up.

2) The changes to the Intel and AMD driver for mw_get_align sets
*max_size to the local pci resource size. (Thus making the assumption
that the local is the same as the peer, which is wrong). max_size isn't
actually used for anything so it's not _really_ an issue, but I do think
it's confusing and incorrect. I'd suggest we remove max_size until
something actually needs it, or at least set it to zero in cases where
the hardware doesn't support returning the size of the peer's memory
window (ie. in the Intel and AMD drivers).

Thoughts?

Logan


New NTB API Issue

2017-06-22 Thread Logan Gunthorpe
Hey Guys,

I've run into some subtle issues with the new API:

It has to do with splitting mw_get_range into mw_get_align and
peer_mw_get_addr.

The original mw_get_range returned the size of the /local/ memory
window's size, address and alignment requirements. The ntb clients then
take the local size and transmit it via spads to the peer which would
use it in setting up the memory window. However, it made the assumption
that the alignment restrictions were symmetric on both hosts seeing they
were not sent across the link.

The new API makes a sensible change for this in that mw_get_align
appears to be intended to return the alignment restrictions (and now
size) of the peer. This helps a bit for the Switchtec driver but appears
to be a semantic change that wasn't really reflected in the changes to
the other NTB code. So, I see a couple of issues:

1) With our hardware, we can't actually know anything about the peer's
memory windows until the peer has finished its setup (ie. the link is
up). However, all the clients call the function during probe, before the
link is ready. There's really no good reason for this, so I think we
should change the clients so that mw_get_align is called only when the
link is up.

2) The changes to the Intel and AMD driver for mw_get_align sets
*max_size to the local pci resource size. (Thus making the assumption
that the local is the same as the peer, which is wrong). max_size isn't
actually used for anything so it's not _really_ an issue, but I do think
it's confusing and incorrect. I'd suggest we remove max_size until
something actually needs it, or at least set it to zero in cases where
the hardware doesn't support returning the size of the peer's memory
window (ie. in the Intel and AMD drivers).

Thoughts?

Logan