Re: [Qemu-devel] [PULL, 08/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)

2017-11-09 Thread David Gibson
On Wed, Nov 08, 2017 at 09:52:12AM +1100, Alexey Kardashevskiy wrote:
> On 07/11/17 19:42, Laurent Vivier wrote:
> > On 05/07/2016 07:31, David Gibson wrote:
> >> From: Alexey Kardashevskiy 
> >>
> >> This adds support for Dynamic DMA Windows (DDW) option defined by
> >> the SPAPR specification which allows to have additional DMA window(s)
> >>
> >> The "ddw" property is enabled by default on a PHB but for compatibility
> >> the pseries-2.6 machine and older disable it.
> >> This also creates a single DMA window for the older machines to
> >> maintain backward migration.
> >>
> >> This implements DDW for PHB with emulated and VFIO devices. The host
> >> kernel support is required. The advertised IOMMU page sizes are 4K and
> >> 64K; 16M pages are supported but not advertised by default, in order to
> >> enable them, the user has to specify "pgsz" property for PHB and
> >> enable huge pages for RAM.
> > 
> > Why is it not advirtised by default?
> 
> I do not remember clearly but this kind of automation is usually less
> manageable. What if we do not want huge IOMMU pages for some reason?

The reason is basically for migration.  Because of the compatibility
requirements for that, we generally try to avoid making any guest
visible changes to the virtual machine based on host capabilities or
configuration because it makes migration a total nightmare.


That said, in this case the difference which we'd need here -
availability of hugepages in the guest, is *already* guest visible for
the normal (non IOMMU) page mapping.  So maybe automatically changing
the IOMMU mask as well would be ok.  Or at least no worse than we have
already.

> > When we start qemu with hugepage memory ("mount -t hugetlbfs none
> > /mnt/kvm_hugepage" and ".. -mem-path /mnt/kvm_hugepage .."), we have an
> > ugly message:
> > 
> > "qemu-kvm: System page size 0x100 is not enabled in page_size_mask
> > (0x11000). Performance may be slow"
> > 
> > I understand if we want to use this with VFIO, we need something like
> > "-global spapr-pci-host-bridge.pgsz=0x1011000".
> > 
> > But is it needed if we don't use VFIO?
> 
> Yes, TCE tables are still created in KVM so the size matters.
> 
> > Is it a way QEMU adds automatically the 0x100 mask to page_size_mask?
> 
> No. That thing which decides about -mem-path should also add the pgsz mask.
> 
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [Qemu-devel] [PULL, 08/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)

2017-11-07 Thread Alexey Kardashevskiy
On 07/11/17 19:42, Laurent Vivier wrote:
> On 05/07/2016 07:31, David Gibson wrote:
>> From: Alexey Kardashevskiy 
>>
>> This adds support for Dynamic DMA Windows (DDW) option defined by
>> the SPAPR specification which allows to have additional DMA window(s)
>>
>> The "ddw" property is enabled by default on a PHB but for compatibility
>> the pseries-2.6 machine and older disable it.
>> This also creates a single DMA window for the older machines to
>> maintain backward migration.
>>
>> This implements DDW for PHB with emulated and VFIO devices. The host
>> kernel support is required. The advertised IOMMU page sizes are 4K and
>> 64K; 16M pages are supported but not advertised by default, in order to
>> enable them, the user has to specify "pgsz" property for PHB and
>> enable huge pages for RAM.
> 
> Why is it not advirtised by default?

I do not remember clearly but this kind of automation is usually less
manageable. What if we do not want huge IOMMU pages for some reason?


> When we start qemu with hugepage memory ("mount -t hugetlbfs none
> /mnt/kvm_hugepage" and ".. -mem-path /mnt/kvm_hugepage .."), we have an
> ugly message:
> 
> "qemu-kvm: System page size 0x100 is not enabled in page_size_mask
> (0x11000). Performance may be slow"
> 
> I understand if we want to use this with VFIO, we need something like
> "-global spapr-pci-host-bridge.pgsz=0x1011000".
> 
> But is it needed if we don't use VFIO?

Yes, TCE tables are still created in KVM so the size matters.

> Is it a way QEMU adds automatically the 0x100 mask to page_size_mask?

No. That thing which decides about -mem-path should also add the pgsz mask.


-- 
Alexey



Re: [Qemu-devel] [PULL, 08/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)

2017-11-07 Thread Laurent Vivier
On 05/07/2016 07:31, David Gibson wrote:
> From: Alexey Kardashevskiy 
> 
> This adds support for Dynamic DMA Windows (DDW) option defined by
> the SPAPR specification which allows to have additional DMA window(s)
> 
> The "ddw" property is enabled by default on a PHB but for compatibility
> the pseries-2.6 machine and older disable it.
> This also creates a single DMA window for the older machines to
> maintain backward migration.
> 
> This implements DDW for PHB with emulated and VFIO devices. The host
> kernel support is required. The advertised IOMMU page sizes are 4K and
> 64K; 16M pages are supported but not advertised by default, in order to
> enable them, the user has to specify "pgsz" property for PHB and
> enable huge pages for RAM.

Why is it not advirtised by default?

When we start qemu with hugepage memory ("mount -t hugetlbfs none
/mnt/kvm_hugepage" and ".. -mem-path /mnt/kvm_hugepage .."), we have an
ugly message:

"qemu-kvm: System page size 0x100 is not enabled in page_size_mask
(0x11000). Performance may be slow"

I understand if we want to use this with VFIO, we need something like
"-global spapr-pci-host-bridge.pgsz=0x1011000".

But is it needed if we don't use VFIO?
Is it a way QEMU adds automatically the 0x100 mask to page_size_mask?

Thanks,
Laurent



Re: [Qemu-devel] [PULL 08/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)

2016-07-11 Thread David Gibson
On Mon, Jul 11, 2016 at 04:21:29PM +0200, Paolo Bonzini wrote:
> 
> 
> On 05/07/2016 07:31, David Gibson wrote:
> >  
> > -if (tcet && tcet->nb_table) {
> > -spapr_tce_table_disable(tcet);
> > +if (tcet && tcet->nb_table) {
> > +spapr_tce_table_disable(tcet);
> > +}
> >  }
> >  
> >  /* Register default 32bit DMA window */
> > +tcet = spapr_tce_find_by_liobn(sphb->dma_liobn[0]);
> 
> Should the statement below be wrapped by "if (tcet)"?

Not really.  That lookup should never fail (if we didn't allocate the
tcet before we would have failed initialization).  An assert(tcet)
might be appropriate, but I don't know it's worth adding at this
stage.
 
> Paolo
> 
> >  spapr_tce_table_enable(tcet, SPAPR_TCE_PAGE_SHIFT, sphb->dma_win_addr,
> > sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT)
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [Qemu-devel] [PULL 08/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)

2016-07-11 Thread Paolo Bonzini


On 05/07/2016 07:31, David Gibson wrote:
>  
> -if (tcet && tcet->nb_table) {
> -spapr_tce_table_disable(tcet);
> +if (tcet && tcet->nb_table) {
> +spapr_tce_table_disable(tcet);
> +}
>  }
>  
>  /* Register default 32bit DMA window */
> +tcet = spapr_tce_find_by_liobn(sphb->dma_liobn[0]);

Should the statement below be wrapped by "if (tcet)"?

Paolo

>  spapr_tce_table_enable(tcet, SPAPR_TCE_PAGE_SHIFT, sphb->dma_win_addr,
> sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT)



[Qemu-devel] [PULL 08/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)

2016-07-04 Thread David Gibson
From: Alexey Kardashevskiy 

This adds support for Dynamic DMA Windows (DDW) option defined by
the SPAPR specification which allows to have additional DMA window(s)

The "ddw" property is enabled by default on a PHB but for compatibility
the pseries-2.6 machine and older disable it.
This also creates a single DMA window for the older machines to
maintain backward migration.

This implements DDW for PHB with emulated and VFIO devices. The host
kernel support is required. The advertised IOMMU page sizes are 4K and
64K; 16M pages are supported but not advertised by default, in order to
enable them, the user has to specify "pgsz" property for PHB and
enable huge pages for RAM.

The existing linux guests try creating one additional huge DMA window
with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
the guest switches to dma_direct_ops and never calls TCE hypercalls
(H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
and not waste time on map/unmap later. This adds a "dma64_win_addr"
property which is a bus address for the 64bit window and by default
set to 0x800... as this is what the modern POWER8 hardware
uses and this allows having emulated and VFIO devices on the same bus.

This adds 4 RTAS handlers:
* ibm,query-pe-dma-window
* ibm,create-pe-dma-window
* ibm,remove-pe-dma-window
* ibm,reset-pe-dma-window
These are registered from type_init() callback.

These RTAS handlers are implemented in a separate file to avoid polluting
spapr_iommu.c with PCI.

This changes sPAPRPHBState::dma_liobn to an array to allow 2 LIOBNs
and updates all references to dma_liobn. However this does not add
64bit LIOBN to the migration stream as in fact even 32bit LIOBN is
rather pointless there (as it is a PHB property and the management
software can/should pass LIOBNs via CLI) but we keep it for the backward
migration support.

Signed-off-by: Alexey Kardashevskiy 
Signed-off-by: David Gibson 
---
 hw/ppc/Makefile.objs|   1 +
 hw/ppc/spapr.c  |   7 +-
 hw/ppc/spapr_pci.c  |  75 ---
 hw/ppc/spapr_rtas_ddw.c | 295 
 hw/ppc/trace-events |   4 +
 include/hw/pci-host/spapr.h |   8 +-
 include/hw/ppc/spapr.h  |  16 ++-
 7 files changed, 385 insertions(+), 21 deletions(-)
 create mode 100644 hw/ppc/spapr_rtas_ddw.c

diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
index 5cc6608..91a3420 100644
--- a/hw/ppc/Makefile.objs
+++ b/hw/ppc/Makefile.objs
@@ -8,6 +8,7 @@ obj-$(CONFIG_PSERIES) += spapr_cpu_core.o
 ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
 obj-y += spapr_pci_vfio.o
 endif
+obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
 # PowerPC 4xx boards
 obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
 obj-y += ppc4xx_pci.o
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 78ebd9e..9c1c2c1 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -2489,7 +2489,12 @@ DEFINE_SPAPR_MACHINE(2_7, "2.7", true);
  * pseries-2.6
  */
 #define SPAPR_COMPAT_2_6 \
-HW_COMPAT_2_6
+HW_COMPAT_2_6 \
+{ \
+.driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
+.property = "ddw",\
+.value= stringify(off),\
+},
 
 static void spapr_machine_2_6_instance_options(MachineState *machine)
 {
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index cbb7cdd..949c44f 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -35,6 +35,7 @@
 #include "hw/ppc/spapr.h"
 #include "hw/pci-host/spapr.h"
 #include "exec/address-spaces.h"
+#include "exec/ram_addr.h"
 #include 
 #include "trace.h"
 #include "qemu/error-report.h"
@@ -45,6 +46,7 @@
 #include "hw/ppc/spapr_drc.h"
 #include "sysemu/device_tree.h"
 #include "sysemu/kvm.h"
+#include "sysemu/hostmem.h"
 
 #include "hw/vfio/vfio.h"
 
@@ -1304,11 +1306,14 @@ static void spapr_phb_realize(DeviceState *dev, Error 
**errp)
 PCIBus *bus;
 uint64_t msi_window_size = 4096;
 sPAPRTCETable *tcet;
+const unsigned windows_supported =
+sphb->ddw_enabled ? SPAPR_PCI_DMA_MAX_WINDOWS : 1;
 
 if (sphb->index != (uint32_t)-1) {
 hwaddr windows_base;
 
-if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn != (uint32_t)-1)
+if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn[0] != 
(uint32_t)-1)
+|| (sphb->dma_liobn[1] != (uint32_t)-1 && windows_supported == 2)
 || (sphb->mem_win_addr != (hwaddr)-1)
 || (sphb->io_win_addr != (hwaddr)-1)) {
 error_setg(errp, "Either \"index\" or other parameters must"
@@ -1323,7 +1328,9 @@ static void spapr_phb_realize(DeviceState *dev, Error 
**errp)
 }
 
 sphb->buid = SPAPR_PCI_BASE_BUID + sphb->index;
-sphb->dma_liobn = SPAPR_PCI_LIOBN(sphb->index, 0);
+for (i = 0; i < windows_supported; ++i) {
+sphb->dma_liobn[i] = SPAPR_PCI_LIOBN(sphb->index, i);
+}
 
 windows_base = 

[Qemu-devel] [PULL 08/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)

2016-07-04 Thread David Gibson
From: Alexey Kardashevskiy 

This adds support for Dynamic DMA Windows (DDW) option defined by
the SPAPR specification which allows to have additional DMA window(s)

The "ddw" property is enabled by default on a PHB but for compatibility
the pseries-2.6 machine and older disable it.
This also creates a single DMA window for the older machines to
maintain backward migration.

This implements DDW for PHB with emulated and VFIO devices. The host
kernel support is required. The advertised IOMMU page sizes are 4K and
64K; 16M pages are supported but not advertised by default, in order to
enable them, the user has to specify "pgsz" property for PHB and
enable huge pages for RAM.

The existing linux guests try creating one additional huge DMA window
with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
the guest switches to dma_direct_ops and never calls TCE hypercalls
(H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
and not waste time on map/unmap later. This adds a "dma64_win_addr"
property which is a bus address for the 64bit window and by default
set to 0x800... as this is what the modern POWER8 hardware
uses and this allows having emulated and VFIO devices on the same bus.

This adds 4 RTAS handlers:
* ibm,query-pe-dma-window
* ibm,create-pe-dma-window
* ibm,remove-pe-dma-window
* ibm,reset-pe-dma-window
These are registered from type_init() callback.

These RTAS handlers are implemented in a separate file to avoid polluting
spapr_iommu.c with PCI.

This changes sPAPRPHBState::dma_liobn to an array to allow 2 LIOBNs
and updates all references to dma_liobn. However this does not add
64bit LIOBN to the migration stream as in fact even 32bit LIOBN is
rather pointless there (as it is a PHB property and the management
software can/should pass LIOBNs via CLI) but we keep it for the backward
migration support.

Signed-off-by: Alexey Kardashevskiy 
Signed-off-by: David Gibson 
---
 hw/ppc/Makefile.objs|   1 +
 hw/ppc/spapr.c  |   7 +-
 hw/ppc/spapr_pci.c  |  75 ---
 hw/ppc/spapr_rtas_ddw.c | 295 
 hw/ppc/trace-events |   4 +
 include/hw/pci-host/spapr.h |   8 +-
 include/hw/ppc/spapr.h  |  16 ++-
 7 files changed, 385 insertions(+), 21 deletions(-)
 create mode 100644 hw/ppc/spapr_rtas_ddw.c

diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
index 5cc6608..91a3420 100644
--- a/hw/ppc/Makefile.objs
+++ b/hw/ppc/Makefile.objs
@@ -8,6 +8,7 @@ obj-$(CONFIG_PSERIES) += spapr_cpu_core.o
 ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
 obj-y += spapr_pci_vfio.o
 endif
+obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
 # PowerPC 4xx boards
 obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
 obj-y += ppc4xx_pci.o
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 78ebd9e..9c1c2c1 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -2489,7 +2489,12 @@ DEFINE_SPAPR_MACHINE(2_7, "2.7", true);
  * pseries-2.6
  */
 #define SPAPR_COMPAT_2_6 \
-HW_COMPAT_2_6
+HW_COMPAT_2_6 \
+{ \
+.driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
+.property = "ddw",\
+.value= stringify(off),\
+},
 
 static void spapr_machine_2_6_instance_options(MachineState *machine)
 {
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index cbb7cdd..949c44f 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -35,6 +35,7 @@
 #include "hw/ppc/spapr.h"
 #include "hw/pci-host/spapr.h"
 #include "exec/address-spaces.h"
+#include "exec/ram_addr.h"
 #include 
 #include "trace.h"
 #include "qemu/error-report.h"
@@ -45,6 +46,7 @@
 #include "hw/ppc/spapr_drc.h"
 #include "sysemu/device_tree.h"
 #include "sysemu/kvm.h"
+#include "sysemu/hostmem.h"
 
 #include "hw/vfio/vfio.h"
 
@@ -1304,11 +1306,14 @@ static void spapr_phb_realize(DeviceState *dev, Error 
**errp)
 PCIBus *bus;
 uint64_t msi_window_size = 4096;
 sPAPRTCETable *tcet;
+const unsigned windows_supported =
+sphb->ddw_enabled ? SPAPR_PCI_DMA_MAX_WINDOWS : 1;
 
 if (sphb->index != (uint32_t)-1) {
 hwaddr windows_base;
 
-if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn != (uint32_t)-1)
+if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn[0] != 
(uint32_t)-1)
+|| (sphb->dma_liobn[1] != (uint32_t)-1 && windows_supported == 2)
 || (sphb->mem_win_addr != (hwaddr)-1)
 || (sphb->io_win_addr != (hwaddr)-1)) {
 error_setg(errp, "Either \"index\" or other parameters must"
@@ -1323,7 +1328,9 @@ static void spapr_phb_realize(DeviceState *dev, Error 
**errp)
 }
 
 sphb->buid = SPAPR_PCI_BASE_BUID + sphb->index;
-sphb->dma_liobn = SPAPR_PCI_LIOBN(sphb->index, 0);
+for (i = 0; i < windows_supported; ++i) {
+sphb->dma_liobn[i] = SPAPR_PCI_LIOBN(sphb->index, i);
+}
 
 windows_base =