Re: bypass support for iommu on sparc64

2018-10-20 Thread Bryan Steele
This is OpenBSD tech@

On Sat, Oct 20, 2018 at 08:36:33PM +0100, Andrew Grillet wrote:
> So, substitute opening and closing the connection to the network?
> 
> Is the IOMMU not used for disk (and all SCSI) access also?
> 
> 
> 
> On Sat, 20 Oct 2018 at 20:32, Theo de Raadt  wrote:
> 
> > Andrew Grillet  wrote:
> >
> > > Ok, what I am proposing is that the IOMMU is set up when a file is opened
> > > to provide the address space required for that file's IO.
> >
> > Wow, you keep saying file as if it means something.
> >
> > packets off the network are not associated with any specific "file"
> > activity
> >
> > it isn't how the kernel works.
> >
> > You are ... way off target.
> >
> 



Re: bypass support for iommu on sparc64

2018-10-20 Thread Andrew Grillet
So, substitute opening and closing the connection to the network?

Is the IOMMU not used for disk (and all SCSI) access also?



On Sat, 20 Oct 2018 at 20:32, Theo de Raadt  wrote:

> Andrew Grillet  wrote:
>
> > Ok, what I am proposing is that the IOMMU is set up when a file is opened
> > to provide the address space required for that file's IO.
>
> Wow, you keep saying file as if it means something.
>
> packets off the network are not associated with any specific "file"
> activity
>
> it isn't how the kernel works.
>
> You are ... way off target.
>


Re: bypass support for iommu on sparc64

2018-10-20 Thread Andrew Grillet
Ok, what I am proposing is that the IOMMU is set up when a file is opened
to provide the address space required for that file's IO.
This remains set up until the file is closed, avoiding frequent set-up and
tear-down for each IO transfer.

I assume that there is sufficient IOMMU address space to handle any
plausible number of files open, and that it is possible to keep
the knowledge of address spaces private to the Primary Ldom, and guests
would only be aware of the mbufs visible to them, and
this is acceptable. (If you cant trust the Primary, I rather suspect you
are stuffed anyway). Clearly, dependent of IOMMU architecture,
which I do not claim to understand, this could exhaust IO address space
before it exhausts physical memory, I don't know.
But I cannot see any other reason why this would not avoid frequent set-up
and tear-downs.

I get the impression that disk access is not great on my T machines. I
expect a 1GHz T1000 to totally piss on a 4GHz Intel
machine at web serving, and it doesn't. (Solaris annoys me too much to even
try it, but I assume its better than OpenBSD on
Spact64 at this time, or Larry Ellison would have to sell his yacht).





On Sat, 20 Oct 2018 at 20:04, Theo de Raadt  wrote:

> In this case, what do mbufs have to do with files?
>
> I am very confused.
>
> > I was assuming that the main objection to allocating mbufs for duration
> of
> > file open,
> > rather than allocating per transfer, this could result in a much higher
> > number of mbufs
> > being in use concurrently. I cannot see any other downside (which may be
> > due to my
> > not understanding a lot of stuff - I last wrote this level of stuff for
> > Unix in the 1980's).
> >
> > On Sat, 20 Oct 2018 at 14:41, Theo de Raadt  wrote:
> >
> > > Andrew Grillet  wrote:
> > >
> > > > These days we are not so short of memory - would it not be possible
> to
> > > > allocate an mbuf (or two for double-buffered) for each file
> > > > when opened, and free when closed?
> > >
> > > What does this have to do with files??
> > >
>


Re: bypass support for iommu on sparc64

2018-10-20 Thread Theo de Raadt
In this case, what do mbufs have to do with files?

I am very confused.

> I was assuming that the main objection to allocating mbufs for duration of
> file open,
> rather than allocating per transfer, this could result in a much higher
> number of mbufs
> being in use concurrently. I cannot see any other downside (which may be
> due to my
> not understanding a lot of stuff - I last wrote this level of stuff for
> Unix in the 1980's).
> 
> On Sat, 20 Oct 2018 at 14:41, Theo de Raadt  wrote:
> 
> > Andrew Grillet  wrote:
> >
> > > These days we are not so short of memory - would it not be possible to
> > > allocate an mbuf (or two for double-buffered) for each file
> > > when opened, and free when closed?
> >
> > What does this have to do with files??
> >



Re: bypass support for iommu on sparc64

2018-10-20 Thread Andrew Grillet
I was assuming that the main objection to allocating mbufs for duration of
file open,
rather than allocating per transfer, this could result in a much higher
number of mbufs
being in use concurrently. I cannot see any other downside (which may be
due to my
not understanding a lot of stuff - I last wrote this level of stuff for
Unix in the 1980's).

On Sat, 20 Oct 2018 at 14:41, Theo de Raadt  wrote:

> Andrew Grillet  wrote:
>
> > These days we are not so short of memory - would it not be possible to
> > allocate an mbuf (or two for double-buffered) for each file
> > when opened, and free when closed?
>
> What does this have to do with files??
>


Re: bypass support for iommu on sparc64

2018-10-20 Thread Theo de Raadt
Andrew Grillet  wrote:

> These days we are not so short of memory - would it not be possible to
> allocate an mbuf (or two for double-buffered) for each file
> when opened, and free when closed?

What does this have to do with files??  



Re: bypass support for iommu on sparc64

2018-10-20 Thread Andrew Grillet
These days we are not so short of memory - would it not be possible to
allocate an mbuf (or two for double-buffered) for each file
when opened, and free when closed?

I can see the management might be more complex, but the performance
benefits might be considerable.
Also, for VM disk access (ldom on T) does this mean the process happens
twice -once for disk-to-host
and again for host-to-guest? In which case, allocating mbufs for the entire
vdisk file to the host once
at (VM) boot time (ldomctl start guest), and deallocating when it is shut
down would save huge amounts
of work. Unless the host is not involved in guest file access at all (don't
know how you could safely do
that).

I can't see making all of memory visible to (even to kernel processes) in a
guest is acceptable. Too much to
go wrong.

Andrew

On Sat, 20 Oct 2018 at 01:59, David Gwynne  wrote:

>
>
> > On 19 Oct 2018, at 9:59 pm, Andrew Grillet  wrote:
> >
> > Is the setup and teardown per transfer or when file is opened and closed?
> > Or is it set up once per context switch of task?
> >
> > I am partly interested cos I would like to improve mt one day (as user of
> > tape
> > and Sparc64 Txxx) if I get the time.
> >
> > Andrew
>
> The overhead is per transfer. You might not get better performance out of
> a tx000 because of the PCIe bridges involved, but you may also be lucky and
> not have that bridge in the way.
>
> >
> >
> >
> > On Fri, 19 Oct 2018 at 10:22, Mark Kettenis 
> wrote:
> >
> >>> Date: Fri, 19 Oct 2018 10:22:30 +1000
> >>> From: David Gwynne 
> >>>
> >>> On Wed, May 10, 2017 at 10:09:59PM +1000, David Gwynne wrote:
>  On Mon, May 08, 2017 at 11:03:58AM +1000, David Gwynne wrote:
> > on modern sparc64s (think fire or sparc enterprise Mx000 boxes),
> > setting up and tearing down the translation table entries (TTEs)
> > is very expensive. so expensive that the cost of doing it for disk
> > io has a noticable impact on compile times.
> >
> > now that there's a BUS_DMA_64BIT flag, we can use that to decide
> > to bypass the iommu for devices that set that flag, therefore
> > avoiding the cost of handling the TTEs.
> >
> > the following diff adds support for bypass mappings to the iommu
> > code on sparc64. it's based on a diff from kettenis@ back in 2009.
> > the main changes are around coping with the differences between
> > schizo/psycho and fire/oberon.
> >
> > the differences between the chips are now represented by a iommu_hw
> > struct. these differences include how to enable the iommu (now via
> > a function pointer), and masks for bypass addresses.
> >
> > ive tested this on oberon (on an m4000) and schizo (on a v880).
> > however, the bypass code isnt working on fire (v245s). to cope with
> > that for now, the iommu_hw struct lets drivers mask flag bits that
> > are handled when creating a dmamap. this means fire boards will
> > ignore BUS_DMA_64BIT until i can figure out whats wrong with them.
> 
>  i figured it out. it turns out Fire was working fine. however,
>  enabling 64bit dva on the onboard devices didnt work because the
>  serverworks/broadcom pcie to pcix bridge can only handle dma addresses
>  in the low 40 bits. because the fire bypass window is higher than
>  this, the bridge would choke and things stopped working.
> 
>  the updated diff attempts to handle this. basically when probing
>  the bridge, the platform creates a custom dma tag for it. this tag
>  intercets bus_dmamap_create and clears the BUS_DMA_64BIT flag before
>  handing it up to the parent bridge, which is pyro in my situation.
>  it looks like early sun4v boxes could make use of this too.
> 
> > i have not tested this on psycho yet. if anyone has such a machine
> > and is willing to work with me to figure it out, please talk to me.
> 
>  i still dont have psycho reports.
> >>>
> >>> Would anyone object if I committed this? I've been running it for the
> >>> last release or two without issues, but with significant improvements
> in
> >>> performance on the machines involved.
> >>
> >> At the price of giving all PCI devices unrestricted access to memory.
> >>
> >> So I'm not eager to this, especially since on sun4v hardware bypassing
> >> the iommu isn't possible as soon as multiple domains are enabled.  And
> >> we lose a useful diagnostic when developing drivers.  Are you sure the
> >> iommu overhead can't be reduced some other way?  At some point we
> >> probably want to add iommu support on amd64 and arm64, but if that
> >> comes with a similar overhead as on sparc64 that's going to be a bit
> >> of an issue.
> >>
>  Index: dev/iommu.c
>  ===
>  RCS file: /cvs/src/sys/arch/sparc64/dev/iommu.c,v
>  retrieving revision 1.74
>  diff -u -p -r1.74 iommu.c
>  --- dev/iommu.c 30 Apr 2017 

Re: bypass support for iommu on sparc64

2018-10-19 Thread David Gwynne
On Sat, Oct 20, 2018 at 02:44:29AM +, Joseph Mayer wrote:
> 
> Last iteration from me on this one.
> 
> Why is this not a problem on some other architectures?

It is a problem, it's just that other archs don't have an iommu like
sparc64.

> I'd have thought DMA and hardware being assigned transitory addresses
> (from memory allocator or other OS subsystem or driver) mostly is a
> lower level phenomenon and memcpy normally applies on higher levels,
> isn't it so - for networking for instance, mbuf's take over soon above
> the driver level. Does OpenBSD have a pool of to-be-mbufs and it asks
> network drivers to write received ethernet frames directly to them, and
> similarly transmit ethernet frames directly from mbuf:s?

Hrm. There's three views of memory you need to keep in mind here.
Memory has a physical address which gets mapped to virtual addresses
that the kernel and programs see. Finally, there's the DMA address,
which is the address devices use to access physical memory.

On most archs the physical and dma addresses are the same thing. On
archs with an IOMMU or similar, the dma address can be virtual, just
like the kernel addresses are virtual.

When you allocate an mbuf, you're getting a chunk of physical memory
that is mapped into the kernel virtual address space. For a device
to do something with it, the kernel has the bus_dma api that figures
out the dma address of the physical memory behind the kernel virtual
address.

On sparc64, that figuring out involves finding the physical address on
the memory, then allocating and filling TTEs. On amd64, it just has to
get the physical address of the kva and the device can use it directly.

> What potentially or clearly sensitive memory would passthru expose,
> driver-owned structures only or all memory?

Passthru menas a device can access all the physical memory in a
computer. So everything.



Re: bypass support for iommu on sparc64

2018-10-19 Thread Joseph Mayer
On Saturday, October 20, 2018 10:14 AM, David Gwynne  wrote:

> > On 20 Oct 2018, at 11:56 am, Joseph Mayer joseph.ma...@protonmail.com wrote:
> > ‐‐‐ Original Message ‐‐‐
> > On Friday, October 19, 2018 5:15 PM, Mark Kettenis mark.kette...@xs4all.nl 
> > wrote:
> >
> > > > Date: Fri, 19 Oct 2018 10:22:30 +1000
> > > > From: David Gwynne da...@gwynne.id.au
> > > > On Wed, May 10, 2017 at 10:09:59PM +1000, David Gwynne wrote:
> > > >
> > > > > On Mon, May 08, 2017 at 11:03:58AM +1000, David Gwynne wrote:
> > > > >
> > > > > > on modern sparc64s (think fire or sparc enterprise Mx000 boxes),
> > > > > > setting up and tearing down the translation table entries (TTEs)
> > > > > > is very expensive. so expensive that the cost of doing it for disk
> > > > > > io has a noticable impact on compile times.
> > > > > > now that there's a BUS_DMA_64BIT flag, we can use that to decide
> > > > > > to bypass the iommu for devices that set that flag, therefore
> > > > > > avoiding the cost of handling the TTEs.
> >
> > Question for the unintroduced, what's the scope here, TTE is Sparc's
> > page table and reconfiguring them at (process) context switch is
> > expensive and this suggestion removes the need for TTE:s for hardware
> > device access, but those don't change at context switch?
>
> We're talking about an IOMMU here, not a traditional MMU providing virtual 
> addresses for programs. An IOMMU sits between physical memory and the devices 
> in a machine. It allows DMA addresses to mapped to different parts of 
> physical memory. Mapping physical memory to a DMA virtual address (or dva) is 
> how a device that only understands 32bit addresses can work in a 64bit 
> machine. Memory at high addresses gets mapped to a low dva.
>
> This is done at runtime on OpenBSD when DMA mappings are loaded or unloaded 
> by populating Translation Table Entries (TTEs). A TTE is effectively a table 
> or array mapping DVA pages to physical addresses. Generally device drivers 
> load and unload dma memory for every I/O or packet or so on.
>
> IOMMUs in sparc64s have some more features than this. Because they really are 
> between memory and the devices they can act as a gatekeeper for all memory 
> accesses. They also have a toggle that can allow a device to have direct or 
> passthru access to physical memory. If passthru is enabled, there's a special 
> address range that effectively maps all physical memory into a DVA range. 
> Devices can be pointed at it without having to manage TTEs. When passthru is 
> disabled, all accesses must go through TTEs.
>
> Currently OpenBSD disables passthru. The benefit is devices can't blindly 
> access sensitive memory unless it is explicitly shared. Note that this is how 
> it is on most architectures anyway. However, the consequence of managing the 
> TTEs is that it is expensive, and extremely so in some cases.
>
> dlg

Last iteration from me on this one.

Why is this not a problem on some other architectures?

I'd have thought DMA and hardware being assigned transitory addresses
(from memory allocator or other OS subsystem or driver) mostly is a
lower level phenomenon and memcpy normally applies on higher levels,
isn't it so - for networking for instance, mbuf's take over soon above
the driver level. Does OpenBSD have a pool of to-be-mbufs and it asks
network drivers to write received ethernet frames directly to them, and
similarly transmit ethernet frames directly from mbuf:s?

What potentially or clearly sensitive memory would passthru expose,
driver-owned structures only or all memory?



Re: bypass support for iommu on sparc64

2018-10-19 Thread David Gwynne



> On 20 Oct 2018, at 11:56 am, Joseph Mayer  wrote:
> 
> ‐‐‐ Original Message ‐‐‐
> On Friday, October 19, 2018 5:15 PM, Mark Kettenis  
> wrote:
> 
>>> Date: Fri, 19 Oct 2018 10:22:30 +1000
>>> From: David Gwynne da...@gwynne.id.au
>>> On Wed, May 10, 2017 at 10:09:59PM +1000, David Gwynne wrote:
>>> 
 On Mon, May 08, 2017 at 11:03:58AM +1000, David Gwynne wrote:
 
> on modern sparc64s (think fire or sparc enterprise Mx000 boxes),
> setting up and tearing down the translation table entries (TTEs)
> is very expensive. so expensive that the cost of doing it for disk
> io has a noticable impact on compile times.
> now that there's a BUS_DMA_64BIT flag, we can use that to decide
> to bypass the iommu for devices that set that flag, therefore
> avoiding the cost of handling the TTEs.
> 
> Question for the unintroduced, what's the scope here, TTE is Sparc's
> page table and reconfiguring them at (process) context switch is
> expensive and this suggestion removes the need for TTE:s for hardware
> device access, but those don't change at context switch?

We're talking about an IOMMU here, not a traditional MMU providing virtual 
addresses for programs. An IOMMU sits between physical memory and the devices 
in a machine. It allows DMA addresses to mapped to different parts of physical 
memory. Mapping physical memory to a DMA virtual address (or dva) is how a 
device that only understands 32bit addresses can work in a 64bit machine. 
Memory at high addresses gets mapped to a low dva.

This is done at runtime on OpenBSD when DMA mappings are loaded or unloaded by 
populating Translation Table Entries (TTEs). A TTE is effectively a table or 
array mapping DVA pages to physical addresses. Generally device drivers load 
and unload dma memory for every I/O or packet or so on.

IOMMUs in sparc64s have some more features than this. Because they really are 
between memory and the devices they can act as a gatekeeper for all memory 
accesses. They also have a toggle that can allow a device to have direct or 
passthru access to physical memory. If passthru is enabled, there's a special 
address range that effectively maps all physical memory into a DVA range. 
Devices can be pointed at it without having to manage TTEs. When passthru is 
disabled, all accesses must go through TTEs.

Currently OpenBSD disables passthru. The benefit is devices can't blindly 
access sensitive memory unless it is explicitly shared. Note that this is how 
it is on most architectures anyway. However, the consequence of managing the 
TTEs is that it is expensive, and extremely so in some cases.

dlg



Re: bypass support for iommu on sparc64

2018-10-19 Thread Joseph Mayer
‐‐‐ Original Message ‐‐‐
On Friday, October 19, 2018 5:15 PM, Mark Kettenis  
wrote:

> > Date: Fri, 19 Oct 2018 10:22:30 +1000
> > From: David Gwynne da...@gwynne.id.au
> > On Wed, May 10, 2017 at 10:09:59PM +1000, David Gwynne wrote:
> >
> > > On Mon, May 08, 2017 at 11:03:58AM +1000, David Gwynne wrote:
> > >
> > > > on modern sparc64s (think fire or sparc enterprise Mx000 boxes),
> > > > setting up and tearing down the translation table entries (TTEs)
> > > > is very expensive. so expensive that the cost of doing it for disk
> > > > io has a noticable impact on compile times.
> > > > now that there's a BUS_DMA_64BIT flag, we can use that to decide
> > > > to bypass the iommu for devices that set that flag, therefore
> > > > avoiding the cost of handling the TTEs.

Question for the unintroduced, what's the scope here, TTE is Sparc's
page table and reconfiguring them at (process) context switch is
expensive and this suggestion removes the need for TTE:s for hardware
device access, but those don't change at context switch?



Re: bypass support for iommu on sparc64

2018-10-19 Thread David Gwynne



> On 19 Oct 2018, at 9:59 pm, Andrew Grillet  wrote:
> 
> Is the setup and teardown per transfer or when file is opened and closed?
> Or is it set up once per context switch of task?
> 
> I am partly interested cos I would like to improve mt one day (as user of
> tape
> and Sparc64 Txxx) if I get the time.
> 
> Andrew

The overhead is per transfer. You might not get better performance out of a 
tx000 because of the PCIe bridges involved, but you may also be lucky and not 
have that bridge in the way.

> 
> 
> 
> On Fri, 19 Oct 2018 at 10:22, Mark Kettenis  wrote:
> 
>>> Date: Fri, 19 Oct 2018 10:22:30 +1000
>>> From: David Gwynne 
>>> 
>>> On Wed, May 10, 2017 at 10:09:59PM +1000, David Gwynne wrote:
 On Mon, May 08, 2017 at 11:03:58AM +1000, David Gwynne wrote:
> on modern sparc64s (think fire or sparc enterprise Mx000 boxes),
> setting up and tearing down the translation table entries (TTEs)
> is very expensive. so expensive that the cost of doing it for disk
> io has a noticable impact on compile times.
> 
> now that there's a BUS_DMA_64BIT flag, we can use that to decide
> to bypass the iommu for devices that set that flag, therefore
> avoiding the cost of handling the TTEs.
> 
> the following diff adds support for bypass mappings to the iommu
> code on sparc64. it's based on a diff from kettenis@ back in 2009.
> the main changes are around coping with the differences between
> schizo/psycho and fire/oberon.
> 
> the differences between the chips are now represented by a iommu_hw
> struct. these differences include how to enable the iommu (now via
> a function pointer), and masks for bypass addresses.
> 
> ive tested this on oberon (on an m4000) and schizo (on a v880).
> however, the bypass code isnt working on fire (v245s). to cope with
> that for now, the iommu_hw struct lets drivers mask flag bits that
> are handled when creating a dmamap. this means fire boards will
> ignore BUS_DMA_64BIT until i can figure out whats wrong with them.
 
 i figured it out. it turns out Fire was working fine. however,
 enabling 64bit dva on the onboard devices didnt work because the
 serverworks/broadcom pcie to pcix bridge can only handle dma addresses
 in the low 40 bits. because the fire bypass window is higher than
 this, the bridge would choke and things stopped working.
 
 the updated diff attempts to handle this. basically when probing
 the bridge, the platform creates a custom dma tag for it. this tag
 intercets bus_dmamap_create and clears the BUS_DMA_64BIT flag before
 handing it up to the parent bridge, which is pyro in my situation.
 it looks like early sun4v boxes could make use of this too.
 
> i have not tested this on psycho yet. if anyone has such a machine
> and is willing to work with me to figure it out, please talk to me.
 
 i still dont have psycho reports.
>>> 
>>> Would anyone object if I committed this? I've been running it for the
>>> last release or two without issues, but with significant improvements in
>>> performance on the machines involved.
>> 
>> At the price of giving all PCI devices unrestricted access to memory.
>> 
>> So I'm not eager to this, especially since on sun4v hardware bypassing
>> the iommu isn't possible as soon as multiple domains are enabled.  And
>> we lose a useful diagnostic when developing drivers.  Are you sure the
>> iommu overhead can't be reduced some other way?  At some point we
>> probably want to add iommu support on amd64 and arm64, but if that
>> comes with a similar overhead as on sparc64 that's going to be a bit
>> of an issue.
>> 
 Index: dev/iommu.c
 ===
 RCS file: /cvs/src/sys/arch/sparc64/dev/iommu.c,v
 retrieving revision 1.74
 diff -u -p -r1.74 iommu.c
 --- dev/iommu.c 30 Apr 2017 16:45:45 -  1.74
 +++ dev/iommu.c 10 May 2017 12:00:09 -
 @@ -100,6 +100,25 @@ void iommu_iomap_clear_pages(struct iomm
 void _iommu_dvmamap_sync(bus_dma_tag_t, bus_dma_tag_t, bus_dmamap_t,
 bus_addr_t, bus_size_t, int);
 
 +void iommu_hw_enable(struct iommu_state *);
 +
 +const struct iommu_hw iommu_hw_default = {
 +   .ihw_enable = iommu_hw_enable,
 +
 +   .ihw_dvma_pa= IOTTE_PAMASK,
 +
 +   .ihw_bypass = 0x3fffUL << 50,
 +   .ihw_bypass_nc  = 0,
 +   .ihw_bypass_ro  = 0,
 +};
 +
 +void
 +iommu_hw_enable(struct iommu_state *is)
 +{
 +   IOMMUREG_WRITE(is, iommu_tsb, is->is_ptsb);
 +   IOMMUREG_WRITE(is, iommu_cr, IOMMUCR_EN | (is->is_tsbsize << 16));
 +}
 +
 /*
  * Initiate an STC entry flush.
  */
 @@ -125,7 +144,8 @@ iommu_strbuf_flush(struct strbuf_ctl *sb
  * - create a private DVMA map.
  */
 void
 -iommu_init(char *name, struct 

Re: bypass support for iommu on sparc64

2018-10-19 Thread David Gwynne



> On 19 Oct 2018, at 7:15 pm, Mark Kettenis  wrote:
> 
>> Date: Fri, 19 Oct 2018 10:22:30 +1000
>> From: David Gwynne 
>> 
>> On Wed, May 10, 2017 at 10:09:59PM +1000, David Gwynne wrote:
>>> On Mon, May 08, 2017 at 11:03:58AM +1000, David Gwynne wrote:
 on modern sparc64s (think fire or sparc enterprise Mx000 boxes),
 setting up and tearing down the translation table entries (TTEs)
 is very expensive. so expensive that the cost of doing it for disk
 io has a noticable impact on compile times.
 
 now that there's a BUS_DMA_64BIT flag, we can use that to decide
 to bypass the iommu for devices that set that flag, therefore
 avoiding the cost of handling the TTEs.
 
 the following diff adds support for bypass mappings to the iommu
 code on sparc64. it's based on a diff from kettenis@ back in 2009.
 the main changes are around coping with the differences between
 schizo/psycho and fire/oberon.
 
 the differences between the chips are now represented by a iommu_hw
 struct. these differences include how to enable the iommu (now via
 a function pointer), and masks for bypass addresses.
 
 ive tested this on oberon (on an m4000) and schizo (on a v880).
 however, the bypass code isnt working on fire (v245s). to cope with
 that for now, the iommu_hw struct lets drivers mask flag bits that
 are handled when creating a dmamap. this means fire boards will
 ignore BUS_DMA_64BIT until i can figure out whats wrong with them.
>>> 
>>> i figured it out. it turns out Fire was working fine. however,
>>> enabling 64bit dva on the onboard devices didnt work because the
>>> serverworks/broadcom pcie to pcix bridge can only handle dma addresses
>>> in the low 40 bits. because the fire bypass window is higher than
>>> this, the bridge would choke and things stopped working.
>>> 
>>> the updated diff attempts to handle this. basically when probing
>>> the bridge, the platform creates a custom dma tag for it. this tag
>>> intercets bus_dmamap_create and clears the BUS_DMA_64BIT flag before
>>> handing it up to the parent bridge, which is pyro in my situation.
>>> it looks like early sun4v boxes could make use of this too.
>>> 
 i have not tested this on psycho yet. if anyone has such a machine
 and is willing to work with me to figure it out, please talk to me.
>>> 
>>> i still dont have psycho reports.
>> 
>> Would anyone object if I committed this? I've been running it for the
>> last release or two without issues, but with significant improvements in
>> performance on the machines involved.
> 
> At the price of giving all PCI devices unrestricted access to memory.
> 
> So I'm not eager to this, especially since on sun4v hardware bypassing
> the iommu isn't possible as soon as multiple domains are enabled.  And
> we lose a useful diagnostic when developing drivers.  Are you sure the
> iommu overhead can't be reduced some other way?  At some point we
> probably want to add iommu support on amd64 and arm64, but if that
> comes with a similar overhead as on sparc64 that's going to be a bit
> of an issue.

First, note that it doesn't turn the iommu off. By default drivers still go 
through it unless they opt out with BUS_DMA_64BIT. This is because the iommu is 
still in between the device and ram, and it provides the passthru window up at 
0xfffc. 

As an aside, and as hinted at in my previous mails, it means that devices with 
ppb6 at pci6 dev 0 function 0 "ServerWorks PCIE-PCIX" rev 0xb5 in them cannot 
really use BUS_DMA_64BIT cos those bridges are buggy and don't handle DVAs 
above 48 or 56 bits or something. That bridge is used in v215s, v245s, v445s, 
t1000s, and so on.

I have a theory that because of that bridge, there was a meme going around Sun 
at the time that it was cheaper to memcpy in and out of preallocated DMA memory 
than it was to do DMA for every packet or disk I/O or whatever.

Which leads me to the conclusion that an alternative to using the passthru 
window would be to have bus_dma preallocate the dmaable memory and bounce in 
and out of it. The performance hit I'm trying to avoid is with setting up and 
tearing down the transaction table entries. If they already exist, you avoid 
that hit.

Bouncing is complicated though, both in the bus_dma layer, and especially by 
pushing it into drivers.

The amount of overhead varies between machines. It seems less of a difference 
with nvme(4) in a slot that is not behind the dodgy bridge on a v245. It was 
about 20 or 30 percent of a difference with gem(4) and tcpbench in a v880 
(schizo). It is particularly bad on the M4000 I have, this is why I looked into 
this. There are orders of magnitude of difference between tcpbench results with 
a tweaked ix(4) and this diff on or off. We've not enabled mitigations before 
because of performance hits less than this.

dlg

> 
>>> Index: dev/iommu.c
>>> 

Re: bypass support for iommu on sparc64

2018-10-19 Thread Andrew Grillet
Is the setup and teardown per transfer or when file is opened and closed?
Or is it set up once per context switch of task?

I am partly interested cos I would like to improve mt one day (as user of
tape
and Sparc64 Txxx) if I get the time.

Andrew



On Fri, 19 Oct 2018 at 10:22, Mark Kettenis  wrote:

> > Date: Fri, 19 Oct 2018 10:22:30 +1000
> > From: David Gwynne 
> >
> > On Wed, May 10, 2017 at 10:09:59PM +1000, David Gwynne wrote:
> > > On Mon, May 08, 2017 at 11:03:58AM +1000, David Gwynne wrote:
> > > > on modern sparc64s (think fire or sparc enterprise Mx000 boxes),
> > > > setting up and tearing down the translation table entries (TTEs)
> > > > is very expensive. so expensive that the cost of doing it for disk
> > > > io has a noticable impact on compile times.
> > > >
> > > > now that there's a BUS_DMA_64BIT flag, we can use that to decide
> > > > to bypass the iommu for devices that set that flag, therefore
> > > > avoiding the cost of handling the TTEs.
> > > >
> > > > the following diff adds support for bypass mappings to the iommu
> > > > code on sparc64. it's based on a diff from kettenis@ back in 2009.
> > > > the main changes are around coping with the differences between
> > > > schizo/psycho and fire/oberon.
> > > >
> > > > the differences between the chips are now represented by a iommu_hw
> > > > struct. these differences include how to enable the iommu (now via
> > > > a function pointer), and masks for bypass addresses.
> > > >
> > > > ive tested this on oberon (on an m4000) and schizo (on a v880).
> > > > however, the bypass code isnt working on fire (v245s). to cope with
> > > > that for now, the iommu_hw struct lets drivers mask flag bits that
> > > > are handled when creating a dmamap. this means fire boards will
> > > > ignore BUS_DMA_64BIT until i can figure out whats wrong with them.
> > >
> > > i figured it out. it turns out Fire was working fine. however,
> > > enabling 64bit dva on the onboard devices didnt work because the
> > > serverworks/broadcom pcie to pcix bridge can only handle dma addresses
> > > in the low 40 bits. because the fire bypass window is higher than
> > > this, the bridge would choke and things stopped working.
> > >
> > > the updated diff attempts to handle this. basically when probing
> > > the bridge, the platform creates a custom dma tag for it. this tag
> > > intercets bus_dmamap_create and clears the BUS_DMA_64BIT flag before
> > > handing it up to the parent bridge, which is pyro in my situation.
> > > it looks like early sun4v boxes could make use of this too.
> > >
> > > > i have not tested this on psycho yet. if anyone has such a machine
> > > > and is willing to work with me to figure it out, please talk to me.
> > >
> > > i still dont have psycho reports.
> >
> > Would anyone object if I committed this? I've been running it for the
> > last release or two without issues, but with significant improvements in
> > performance on the machines involved.
>
> At the price of giving all PCI devices unrestricted access to memory.
>
> So I'm not eager to this, especially since on sun4v hardware bypassing
> the iommu isn't possible as soon as multiple domains are enabled.  And
> we lose a useful diagnostic when developing drivers.  Are you sure the
> iommu overhead can't be reduced some other way?  At some point we
> probably want to add iommu support on amd64 and arm64, but if that
> comes with a similar overhead as on sparc64 that's going to be a bit
> of an issue.
>
> > > Index: dev/iommu.c
> > > ===
> > > RCS file: /cvs/src/sys/arch/sparc64/dev/iommu.c,v
> > > retrieving revision 1.74
> > > diff -u -p -r1.74 iommu.c
> > > --- dev/iommu.c 30 Apr 2017 16:45:45 -  1.74
> > > +++ dev/iommu.c 10 May 2017 12:00:09 -
> > > @@ -100,6 +100,25 @@ void iommu_iomap_clear_pages(struct iomm
> > >  void _iommu_dvmamap_sync(bus_dma_tag_t, bus_dma_tag_t, bus_dmamap_t,
> > >  bus_addr_t, bus_size_t, int);
> > >
> > > +void iommu_hw_enable(struct iommu_state *);
> > > +
> > > +const struct iommu_hw iommu_hw_default = {
> > > +   .ihw_enable = iommu_hw_enable,
> > > +
> > > +   .ihw_dvma_pa= IOTTE_PAMASK,
> > > +
> > > +   .ihw_bypass = 0x3fffUL << 50,
> > > +   .ihw_bypass_nc  = 0,
> > > +   .ihw_bypass_ro  = 0,
> > > +};
> > > +
> > > +void
> > > +iommu_hw_enable(struct iommu_state *is)
> > > +{
> > > +   IOMMUREG_WRITE(is, iommu_tsb, is->is_ptsb);
> > > +   IOMMUREG_WRITE(is, iommu_cr, IOMMUCR_EN | (is->is_tsbsize << 16));
> > > +}
> > > +
> > >  /*
> > >   * Initiate an STC entry flush.
> > >   */
> > > @@ -125,7 +144,8 @@ iommu_strbuf_flush(struct strbuf_ctl *sb
> > >   * - create a private DVMA map.
> > >   */
> > >  void
> > > -iommu_init(char *name, struct iommu_state *is, int tsbsize, u_int32_t
> iovabase)
> > > +iommu_init(char *name, const struct iommu_hw *ihw, struct iommu_state
> *is,
> > > +int tsbsize, u_int32_t iovabase)
> > >  {
> > > psize_t 

Re: bypass support for iommu on sparc64

2018-10-19 Thread Mark Kettenis
> Date: Fri, 19 Oct 2018 10:22:30 +1000
> From: David Gwynne 
> 
> On Wed, May 10, 2017 at 10:09:59PM +1000, David Gwynne wrote:
> > On Mon, May 08, 2017 at 11:03:58AM +1000, David Gwynne wrote:
> > > on modern sparc64s (think fire or sparc enterprise Mx000 boxes),
> > > setting up and tearing down the translation table entries (TTEs)
> > > is very expensive. so expensive that the cost of doing it for disk
> > > io has a noticable impact on compile times.
> > > 
> > > now that there's a BUS_DMA_64BIT flag, we can use that to decide
> > > to bypass the iommu for devices that set that flag, therefore
> > > avoiding the cost of handling the TTEs.
> > > 
> > > the following diff adds support for bypass mappings to the iommu
> > > code on sparc64. it's based on a diff from kettenis@ back in 2009.
> > > the main changes are around coping with the differences between
> > > schizo/psycho and fire/oberon.
> > > 
> > > the differences between the chips are now represented by a iommu_hw
> > > struct. these differences include how to enable the iommu (now via
> > > a function pointer), and masks for bypass addresses.
> > > 
> > > ive tested this on oberon (on an m4000) and schizo (on a v880).
> > > however, the bypass code isnt working on fire (v245s). to cope with
> > > that for now, the iommu_hw struct lets drivers mask flag bits that
> > > are handled when creating a dmamap. this means fire boards will
> > > ignore BUS_DMA_64BIT until i can figure out whats wrong with them.
> > 
> > i figured it out. it turns out Fire was working fine. however,
> > enabling 64bit dva on the onboard devices didnt work because the
> > serverworks/broadcom pcie to pcix bridge can only handle dma addresses
> > in the low 40 bits. because the fire bypass window is higher than
> > this, the bridge would choke and things stopped working.
> > 
> > the updated diff attempts to handle this. basically when probing
> > the bridge, the platform creates a custom dma tag for it. this tag
> > intercets bus_dmamap_create and clears the BUS_DMA_64BIT flag before
> > handing it up to the parent bridge, which is pyro in my situation.
> > it looks like early sun4v boxes could make use of this too.
> > 
> > > i have not tested this on psycho yet. if anyone has such a machine
> > > and is willing to work with me to figure it out, please talk to me.
> > 
> > i still dont have psycho reports.
> 
> Would anyone object if I committed this? I've been running it for the
> last release or two without issues, but with significant improvements in
> performance on the machines involved.

At the price of giving all PCI devices unrestricted access to memory.

So I'm not eager to this, especially since on sun4v hardware bypassing
the iommu isn't possible as soon as multiple domains are enabled.  And
we lose a useful diagnostic when developing drivers.  Are you sure the
iommu overhead can't be reduced some other way?  At some point we
probably want to add iommu support on amd64 and arm64, but if that
comes with a similar overhead as on sparc64 that's going to be a bit
of an issue.

> > Index: dev/iommu.c
> > ===
> > RCS file: /cvs/src/sys/arch/sparc64/dev/iommu.c,v
> > retrieving revision 1.74
> > diff -u -p -r1.74 iommu.c
> > --- dev/iommu.c 30 Apr 2017 16:45:45 -  1.74
> > +++ dev/iommu.c 10 May 2017 12:00:09 -
> > @@ -100,6 +100,25 @@ void iommu_iomap_clear_pages(struct iomm
> >  void _iommu_dvmamap_sync(bus_dma_tag_t, bus_dma_tag_t, bus_dmamap_t,
> >  bus_addr_t, bus_size_t, int);
> >  
> > +void iommu_hw_enable(struct iommu_state *);
> > +
> > +const struct iommu_hw iommu_hw_default = {
> > +   .ihw_enable = iommu_hw_enable,
> > +
> > +   .ihw_dvma_pa= IOTTE_PAMASK,
> > +
> > +   .ihw_bypass = 0x3fffUL << 50,
> > +   .ihw_bypass_nc  = 0,
> > +   .ihw_bypass_ro  = 0,
> > +};
> > +
> > +void
> > +iommu_hw_enable(struct iommu_state *is)
> > +{
> > +   IOMMUREG_WRITE(is, iommu_tsb, is->is_ptsb);
> > +   IOMMUREG_WRITE(is, iommu_cr, IOMMUCR_EN | (is->is_tsbsize << 16));
> > +}
> > +
> >  /*
> >   * Initiate an STC entry flush.
> >   */
> > @@ -125,7 +144,8 @@ iommu_strbuf_flush(struct strbuf_ctl *sb
> >   * - create a private DVMA map.
> >   */
> >  void
> > -iommu_init(char *name, struct iommu_state *is, int tsbsize, u_int32_t 
> > iovabase)
> > +iommu_init(char *name, const struct iommu_hw *ihw, struct iommu_state *is,
> > +int tsbsize, u_int32_t iovabase)
> >  {
> > psize_t size;
> > vaddr_t va;
> > @@ -149,13 +169,9 @@ iommu_init(char *name, struct iommu_stat
> >  * be hard-wired, so we read the start and size from the PROM and
> >  * just use those values.
> >  */
> > -   if (strncmp(name, "pyro", 4) == 0) {
> > -   is->is_cr = IOMMUREG_READ(is, iommu_cr);
> > -   is->is_cr &= ~IOMMUCR_FIRE_BE;
> > -   is->is_cr |= (IOMMUCR_FIRE_SE | IOMMUCR_FIRE_CM_EN |
> > -   IOMMUCR_FIRE_TE);
> > -   } else 
> > - 

Re: bypass support for iommu on sparc64

2018-10-18 Thread David Gwynne
On Wed, May 10, 2017 at 10:09:59PM +1000, David Gwynne wrote:
> On Mon, May 08, 2017 at 11:03:58AM +1000, David Gwynne wrote:
> > on modern sparc64s (think fire or sparc enterprise Mx000 boxes),
> > setting up and tearing down the translation table entries (TTEs)
> > is very expensive. so expensive that the cost of doing it for disk
> > io has a noticable impact on compile times.
> > 
> > now that there's a BUS_DMA_64BIT flag, we can use that to decide
> > to bypass the iommu for devices that set that flag, therefore
> > avoiding the cost of handling the TTEs.
> > 
> > the following diff adds support for bypass mappings to the iommu
> > code on sparc64. it's based on a diff from kettenis@ back in 2009.
> > the main changes are around coping with the differences between
> > schizo/psycho and fire/oberon.
> > 
> > the differences between the chips are now represented by a iommu_hw
> > struct. these differences include how to enable the iommu (now via
> > a function pointer), and masks for bypass addresses.
> > 
> > ive tested this on oberon (on an m4000) and schizo (on a v880).
> > however, the bypass code isnt working on fire (v245s). to cope with
> > that for now, the iommu_hw struct lets drivers mask flag bits that
> > are handled when creating a dmamap. this means fire boards will
> > ignore BUS_DMA_64BIT until i can figure out whats wrong with them.
> 
> i figured it out. it turns out Fire was working fine. however,
> enabling 64bit dva on the onboard devices didnt work because the
> serverworks/broadcom pcie to pcix bridge can only handle dma addresses
> in the low 40 bits. because the fire bypass window is higher than
> this, the bridge would choke and things stopped working.
> 
> the updated diff attempts to handle this. basically when probing
> the bridge, the platform creates a custom dma tag for it. this tag
> intercets bus_dmamap_create and clears the BUS_DMA_64BIT flag before
> handing it up to the parent bridge, which is pyro in my situation.
> it looks like early sun4v boxes could make use of this too.
> 
> > i have not tested this on psycho yet. if anyone has such a machine
> > and is willing to work with me to figure it out, please talk to me.
> 
> i still dont have psycho reports.

Would anyone object if I committed this? I've been running it for the
last release or two without issues, but with significant improvements in
performance on the machines involved.

> Index: dev/iommu.c
> ===
> RCS file: /cvs/src/sys/arch/sparc64/dev/iommu.c,v
> retrieving revision 1.74
> diff -u -p -r1.74 iommu.c
> --- dev/iommu.c   30 Apr 2017 16:45:45 -  1.74
> +++ dev/iommu.c   10 May 2017 12:00:09 -
> @@ -100,6 +100,25 @@ void iommu_iomap_clear_pages(struct iomm
>  void _iommu_dvmamap_sync(bus_dma_tag_t, bus_dma_tag_t, bus_dmamap_t,
>  bus_addr_t, bus_size_t, int);
>  
> +void iommu_hw_enable(struct iommu_state *);
> +
> +const struct iommu_hw iommu_hw_default = {
> + .ihw_enable = iommu_hw_enable,
> +
> + .ihw_dvma_pa= IOTTE_PAMASK,
> +
> + .ihw_bypass = 0x3fffUL << 50,
> + .ihw_bypass_nc  = 0,
> + .ihw_bypass_ro  = 0,
> +};
> +
> +void
> +iommu_hw_enable(struct iommu_state *is)
> +{
> + IOMMUREG_WRITE(is, iommu_tsb, is->is_ptsb);
> + IOMMUREG_WRITE(is, iommu_cr, IOMMUCR_EN | (is->is_tsbsize << 16));
> +}
> +
>  /*
>   * Initiate an STC entry flush.
>   */
> @@ -125,7 +144,8 @@ iommu_strbuf_flush(struct strbuf_ctl *sb
>   *   - create a private DVMA map.
>   */
>  void
> -iommu_init(char *name, struct iommu_state *is, int tsbsize, u_int32_t 
> iovabase)
> +iommu_init(char *name, const struct iommu_hw *ihw, struct iommu_state *is,
> +int tsbsize, u_int32_t iovabase)
>  {
>   psize_t size;
>   vaddr_t va;
> @@ -149,13 +169,9 @@ iommu_init(char *name, struct iommu_stat
>* be hard-wired, so we read the start and size from the PROM and
>* just use those values.
>*/
> - if (strncmp(name, "pyro", 4) == 0) {
> - is->is_cr = IOMMUREG_READ(is, iommu_cr);
> - is->is_cr &= ~IOMMUCR_FIRE_BE;
> - is->is_cr |= (IOMMUCR_FIRE_SE | IOMMUCR_FIRE_CM_EN |
> - IOMMUCR_FIRE_TE);
> - } else 
> - is->is_cr = IOMMUCR_EN;
> +
> + is->is_hw = ihw;
> +
>   is->is_tsbsize = tsbsize;
>   if (iovabase == (u_int32_t)-1) {
>   is->is_dvmabase = IOTSB_VSTART(is->is_tsbsize);
> @@ -237,15 +253,6 @@ iommu_init(char *name, struct iommu_stat
>   mtx_init(>is_mtx, IPL_HIGH);
>  
>   /*
> -  * Set the TSB size.  The relevant bits were moved to the TSB
> -  * base register in the PCIe host bridges.
> -  */
> - if (strncmp(name, "pyro", 4) == 0)
> - is->is_ptsb |= is->is_tsbsize;
> - else
> - is->is_cr |= (is->is_tsbsize << 16);
> -
> - /*
>* Now actually start up the IOMMU.
>*/
>   iommu_reset(is);
> @@ -262,10 +269,7 

Re: bypass support for iommu on sparc64

2017-05-10 Thread David Gwynne
On Mon, May 08, 2017 at 11:03:58AM +1000, David Gwynne wrote:
> on modern sparc64s (think fire or sparc enterprise Mx000 boxes),
> setting up and tearing down the translation table entries (TTEs)
> is very expensive. so expensive that the cost of doing it for disk
> io has a noticable impact on compile times.
> 
> now that there's a BUS_DMA_64BIT flag, we can use that to decide
> to bypass the iommu for devices that set that flag, therefore
> avoiding the cost of handling the TTEs.
> 
> the following diff adds support for bypass mappings to the iommu
> code on sparc64. it's based on a diff from kettenis@ back in 2009.
> the main changes are around coping with the differences between
> schizo/psycho and fire/oberon.
> 
> the differences between the chips are now represented by a iommu_hw
> struct. these differences include how to enable the iommu (now via
> a function pointer), and masks for bypass addresses.
> 
> ive tested this on oberon (on an m4000) and schizo (on a v880).
> however, the bypass code isnt working on fire (v245s). to cope with
> that for now, the iommu_hw struct lets drivers mask flag bits that
> are handled when creating a dmamap. this means fire boards will
> ignore BUS_DMA_64BIT until i can figure out whats wrong with them.

i figured it out. it turns out Fire was working fine. however,
enabling 64bit dva on the onboard devices didnt work because the
serverworks/broadcom pcie to pcix bridge can only handle dma addresses
in the low 40 bits. because the fire bypass window is higher than
this, the bridge would choke and things stopped working.

the updated diff attempts to handle this. basically when probing
the bridge, the platform creates a custom dma tag for it. this tag
intercets bus_dmamap_create and clears the BUS_DMA_64BIT flag before
handing it up to the parent bridge, which is pyro in my situation.
it looks like early sun4v boxes could make use of this too.

> i have not tested this on psycho yet. if anyone has such a machine
> and is willing to work with me to figure it out, please talk to me.

i still dont have psycho reports.

Index: dev/iommu.c
===
RCS file: /cvs/src/sys/arch/sparc64/dev/iommu.c,v
retrieving revision 1.74
diff -u -p -r1.74 iommu.c
--- dev/iommu.c 30 Apr 2017 16:45:45 -  1.74
+++ dev/iommu.c 10 May 2017 12:00:09 -
@@ -100,6 +100,25 @@ void iommu_iomap_clear_pages(struct iomm
 void _iommu_dvmamap_sync(bus_dma_tag_t, bus_dma_tag_t, bus_dmamap_t,
 bus_addr_t, bus_size_t, int);
 
+void iommu_hw_enable(struct iommu_state *);
+
+const struct iommu_hw iommu_hw_default = {
+   .ihw_enable = iommu_hw_enable,
+
+   .ihw_dvma_pa= IOTTE_PAMASK,
+
+   .ihw_bypass = 0x3fffUL << 50,
+   .ihw_bypass_nc  = 0,
+   .ihw_bypass_ro  = 0,
+};
+
+void
+iommu_hw_enable(struct iommu_state *is)
+{
+   IOMMUREG_WRITE(is, iommu_tsb, is->is_ptsb);
+   IOMMUREG_WRITE(is, iommu_cr, IOMMUCR_EN | (is->is_tsbsize << 16));
+}
+
 /*
  * Initiate an STC entry flush.
  */
@@ -125,7 +144,8 @@ iommu_strbuf_flush(struct strbuf_ctl *sb
  * - create a private DVMA map.
  */
 void
-iommu_init(char *name, struct iommu_state *is, int tsbsize, u_int32_t iovabase)
+iommu_init(char *name, const struct iommu_hw *ihw, struct iommu_state *is,
+int tsbsize, u_int32_t iovabase)
 {
psize_t size;
vaddr_t va;
@@ -149,13 +169,9 @@ iommu_init(char *name, struct iommu_stat
 * be hard-wired, so we read the start and size from the PROM and
 * just use those values.
 */
-   if (strncmp(name, "pyro", 4) == 0) {
-   is->is_cr = IOMMUREG_READ(is, iommu_cr);
-   is->is_cr &= ~IOMMUCR_FIRE_BE;
-   is->is_cr |= (IOMMUCR_FIRE_SE | IOMMUCR_FIRE_CM_EN |
-   IOMMUCR_FIRE_TE);
-   } else 
-   is->is_cr = IOMMUCR_EN;
+
+   is->is_hw = ihw;
+
is->is_tsbsize = tsbsize;
if (iovabase == (u_int32_t)-1) {
is->is_dvmabase = IOTSB_VSTART(is->is_tsbsize);
@@ -237,15 +253,6 @@ iommu_init(char *name, struct iommu_stat
mtx_init(>is_mtx, IPL_HIGH);
 
/*
-* Set the TSB size.  The relevant bits were moved to the TSB
-* base register in the PCIe host bridges.
-*/
-   if (strncmp(name, "pyro", 4) == 0)
-   is->is_ptsb |= is->is_tsbsize;
-   else
-   is->is_cr |= (is->is_tsbsize << 16);
-
-   /*
 * Now actually start up the IOMMU.
 */
iommu_reset(is);
@@ -262,10 +269,7 @@ iommu_reset(struct iommu_state *is)
 {
int i;
 
-   IOMMUREG_WRITE(is, iommu_tsb, is->is_ptsb);
-
-   /* Enable IOMMU */
-   IOMMUREG_WRITE(is, iommu_cr, is->is_cr);
+   (*is->is_hw->ihw_enable)(is);
 
for (i = 0; i < 2; ++i) {
struct strbuf_ctl *sb = is->is_sb[i];
@@ -280,7 +284,7 @@ iommu_reset(struct iommu_state *is)
printf(", STC%d 

bypass support for iommu on sparc64

2017-05-07 Thread David Gwynne
on modern sparc64s (think fire or sparc enterprise Mx000 boxes),
setting up and tearing down the translation table entries (TTEs)
is very expensive. so expensive that the cost of doing it for disk
io has a noticable impact on compile times.

now that there's a BUS_DMA_64BIT flag, we can use that to decide
to bypass the iommu for devices that set that flag, therefore
avoiding the cost of handling the TTEs.

the following diff adds support for bypass mappings to the iommu
code on sparc64. it's based on a diff from kettenis@ back in 2009.
the main changes are around coping with the differences between
schizo/psycho and fire/oberon.

the differences between the chips are now represented by a iommu_hw
struct. these differences include how to enable the iommu (now via
a function pointer), and masks for bypass addresses.

ive tested this on oberon (on an m4000) and schizo (on a v880).
however, the bypass code isnt working on fire (v245s). to cope with
that for now, the iommu_hw struct lets drivers mask flag bits that
are handled when creating a dmamap. this means fire boards will
ignore BUS_DMA_64BIT until i can figure out whats wrong with them.

i have not tested this on psycho yet. if anyone has such a machine
and is willing to work with me to figure it out, please talk to me.

Index: dev/iommu.c
===
RCS file: /cvs/src/sys/arch/sparc64/dev/iommu.c,v
retrieving revision 1.74
diff -u -p -r1.74 iommu.c
--- dev/iommu.c 30 Apr 2017 16:45:45 -  1.74
+++ dev/iommu.c 8 May 2017 00:45:05 -
@@ -100,6 +100,25 @@ void iommu_iomap_clear_pages(struct iomm
 void _iommu_dvmamap_sync(bus_dma_tag_t, bus_dma_tag_t, bus_dmamap_t,
 bus_addr_t, bus_size_t, int);
 
+void iommu_hw_enable(struct iommu_state *);
+
+const struct iommu_hw iommu_hw_default = {
+   .ihw_enable = iommu_hw_enable,
+
+   .ihw_dvma_pa= IOTTE_PAMASK,
+
+   .ihw_bypass = 0x3fffUL << 50,
+   .ihw_bypass_nc  = 0,
+   .ihw_bypass_ro  = 0,
+};
+
+void
+iommu_hw_enable(struct iommu_state *is)
+{
+   IOMMUREG_WRITE(is, iommu_tsb, is->is_ptsb);
+   IOMMUREG_WRITE(is, iommu_cr, IOMMUCR_EN | (is->is_tsbsize << 16));
+}
+
 /*
  * Initiate an STC entry flush.
  */
@@ -125,7 +144,8 @@ iommu_strbuf_flush(struct strbuf_ctl *sb
  * - create a private DVMA map.
  */
 void
-iommu_init(char *name, struct iommu_state *is, int tsbsize, u_int32_t iovabase)
+iommu_init(char *name, const struct iommu_hw *ihw, struct iommu_state *is,
+int tsbsize, u_int32_t iovabase)
 {
psize_t size;
vaddr_t va;
@@ -149,13 +169,9 @@ iommu_init(char *name, struct iommu_stat
 * be hard-wired, so we read the start and size from the PROM and
 * just use those values.
 */
-   if (strncmp(name, "pyro", 4) == 0) {
-   is->is_cr = IOMMUREG_READ(is, iommu_cr);
-   is->is_cr &= ~IOMMUCR_FIRE_BE;
-   is->is_cr |= (IOMMUCR_FIRE_SE | IOMMUCR_FIRE_CM_EN |
-   IOMMUCR_FIRE_TE);
-   } else 
-   is->is_cr = IOMMUCR_EN;
+
+   is->is_hw = ihw;
+
is->is_tsbsize = tsbsize;
if (iovabase == (u_int32_t)-1) {
is->is_dvmabase = IOTSB_VSTART(is->is_tsbsize);
@@ -237,15 +253,6 @@ iommu_init(char *name, struct iommu_stat
mtx_init(>is_mtx, IPL_HIGH);
 
/*
-* Set the TSB size.  The relevant bits were moved to the TSB
-* base register in the PCIe host bridges.
-*/
-   if (strncmp(name, "pyro", 4) == 0)
-   is->is_ptsb |= is->is_tsbsize;
-   else
-   is->is_cr |= (is->is_tsbsize << 16);
-
-   /*
 * Now actually start up the IOMMU.
 */
iommu_reset(is);
@@ -262,10 +269,7 @@ iommu_reset(struct iommu_state *is)
 {
int i;
 
-   IOMMUREG_WRITE(is, iommu_tsb, is->is_ptsb);
-
-   /* Enable IOMMU */
-   IOMMUREG_WRITE(is, iommu_cr, is->is_cr);
+   (*is->is_hw->ihw_enable)(is);
 
for (i = 0; i < 2; ++i) {
struct strbuf_ctl *sb = is->is_sb[i];
@@ -280,7 +284,7 @@ iommu_reset(struct iommu_state *is)
printf(", STC%d enabled", i);
}
 
-   if (is->is_flags & IOMMU_FLUSH_CACHE)
+   if (ISSET(is->is_hw->ihw_flags, IOMMU_HW_FLUSH_CACHE))
IOMMUREG_WRITE(is, iommu_cache_invalidate, -1ULL);
 }
 
@@ -433,7 +437,7 @@ iommu_extract(struct iommu_state *is, bu
if (dva >= is->is_dvmabase && dva <= is->is_dvmaend)
tte = is->is_tsb[IOTSBSLOT(dva, is->is_tsbsize)];
 
-   return (tte & IOTTE_PAMASK);
+   return (tte & is->is_hw->ihw_dvma_pa);
 }
 
 /*
@@ -601,8 +605,11 @@ iommu_dvmamap_create(bus_dma_tag_t t, bu
 {
int ret;
bus_dmamap_t map;
+   struct iommu_state *is = sb->sb_iommu;
struct iommu_map_state *ims;
 
+   flags &= ~is->is_hw->ihw_dma_flags;
+
BUS_DMA_FIND_PARENT(t, _dmamap_create);
ret =