Re: [Qemu-devel] [PULL, 08/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
On Wed, Nov 08, 2017 at 09:52:12AM +1100, Alexey Kardashevskiy wrote: > On 07/11/17 19:42, Laurent Vivier wrote: > > On 05/07/2016 07:31, David Gibson wrote: > >> From: Alexey Kardashevskiy> >> > >> This adds support for Dynamic DMA Windows (DDW) option defined by > >> the SPAPR specification which allows to have additional DMA window(s) > >> > >> The "ddw" property is enabled by default on a PHB but for compatibility > >> the pseries-2.6 machine and older disable it. > >> This also creates a single DMA window for the older machines to > >> maintain backward migration. > >> > >> This implements DDW for PHB with emulated and VFIO devices. The host > >> kernel support is required. The advertised IOMMU page sizes are 4K and > >> 64K; 16M pages are supported but not advertised by default, in order to > >> enable them, the user has to specify "pgsz" property for PHB and > >> enable huge pages for RAM. > > > > Why is it not advirtised by default? > > I do not remember clearly but this kind of automation is usually less > manageable. What if we do not want huge IOMMU pages for some reason? The reason is basically for migration. Because of the compatibility requirements for that, we generally try to avoid making any guest visible changes to the virtual machine based on host capabilities or configuration because it makes migration a total nightmare. That said, in this case the difference which we'd need here - availability of hugepages in the guest, is *already* guest visible for the normal (non IOMMU) page mapping. So maybe automatically changing the IOMMU mask as well would be ok. Or at least no worse than we have already. > > When we start qemu with hugepage memory ("mount -t hugetlbfs none > > /mnt/kvm_hugepage" and ".. -mem-path /mnt/kvm_hugepage .."), we have an > > ugly message: > > > > "qemu-kvm: System page size 0x100 is not enabled in page_size_mask > > (0x11000). Performance may be slow" > > > > I understand if we want to use this with VFIO, we need something like > > "-global spapr-pci-host-bridge.pgsz=0x1011000". > > > > But is it needed if we don't use VFIO? > > Yes, TCE tables are still created in KVM so the size matters. > > > Is it a way QEMU adds automatically the 0x100 mask to page_size_mask? > > No. That thing which decides about -mem-path should also add the pgsz mask. > > -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature
Re: [Qemu-devel] [PULL, 08/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
On 07/11/17 19:42, Laurent Vivier wrote: > On 05/07/2016 07:31, David Gibson wrote: >> From: Alexey Kardashevskiy>> >> This adds support for Dynamic DMA Windows (DDW) option defined by >> the SPAPR specification which allows to have additional DMA window(s) >> >> The "ddw" property is enabled by default on a PHB but for compatibility >> the pseries-2.6 machine and older disable it. >> This also creates a single DMA window for the older machines to >> maintain backward migration. >> >> This implements DDW for PHB with emulated and VFIO devices. The host >> kernel support is required. The advertised IOMMU page sizes are 4K and >> 64K; 16M pages are supported but not advertised by default, in order to >> enable them, the user has to specify "pgsz" property for PHB and >> enable huge pages for RAM. > > Why is it not advirtised by default? I do not remember clearly but this kind of automation is usually less manageable. What if we do not want huge IOMMU pages for some reason? > When we start qemu with hugepage memory ("mount -t hugetlbfs none > /mnt/kvm_hugepage" and ".. -mem-path /mnt/kvm_hugepage .."), we have an > ugly message: > > "qemu-kvm: System page size 0x100 is not enabled in page_size_mask > (0x11000). Performance may be slow" > > I understand if we want to use this with VFIO, we need something like > "-global spapr-pci-host-bridge.pgsz=0x1011000". > > But is it needed if we don't use VFIO? Yes, TCE tables are still created in KVM so the size matters. > Is it a way QEMU adds automatically the 0x100 mask to page_size_mask? No. That thing which decides about -mem-path should also add the pgsz mask. -- Alexey
Re: [Qemu-devel] [PULL, 08/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
On 05/07/2016 07:31, David Gibson wrote: > From: Alexey Kardashevskiy> > This adds support for Dynamic DMA Windows (DDW) option defined by > the SPAPR specification which allows to have additional DMA window(s) > > The "ddw" property is enabled by default on a PHB but for compatibility > the pseries-2.6 machine and older disable it. > This also creates a single DMA window for the older machines to > maintain backward migration. > > This implements DDW for PHB with emulated and VFIO devices. The host > kernel support is required. The advertised IOMMU page sizes are 4K and > 64K; 16M pages are supported but not advertised by default, in order to > enable them, the user has to specify "pgsz" property for PHB and > enable huge pages for RAM. Why is it not advirtised by default? When we start qemu with hugepage memory ("mount -t hugetlbfs none /mnt/kvm_hugepage" and ".. -mem-path /mnt/kvm_hugepage .."), we have an ugly message: "qemu-kvm: System page size 0x100 is not enabled in page_size_mask (0x11000). Performance may be slow" I understand if we want to use this with VFIO, we need something like "-global spapr-pci-host-bridge.pgsz=0x1011000". But is it needed if we don't use VFIO? Is it a way QEMU adds automatically the 0x100 mask to page_size_mask? Thanks, Laurent
Re: [Qemu-devel] [PULL 08/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
On Mon, Jul 11, 2016 at 04:21:29PM +0200, Paolo Bonzini wrote: > > > On 05/07/2016 07:31, David Gibson wrote: > > > > -if (tcet && tcet->nb_table) { > > -spapr_tce_table_disable(tcet); > > +if (tcet && tcet->nb_table) { > > +spapr_tce_table_disable(tcet); > > +} > > } > > > > /* Register default 32bit DMA window */ > > +tcet = spapr_tce_find_by_liobn(sphb->dma_liobn[0]); > > Should the statement below be wrapped by "if (tcet)"? Not really. That lookup should never fail (if we didn't allocate the tcet before we would have failed initialization). An assert(tcet) might be appropriate, but I don't know it's worth adding at this stage. > Paolo > > > spapr_tce_table_enable(tcet, SPAPR_TCE_PAGE_SHIFT, sphb->dma_win_addr, > > sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT) > -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature
Re: [Qemu-devel] [PULL 08/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
On 05/07/2016 07:31, David Gibson wrote: > > -if (tcet && tcet->nb_table) { > -spapr_tce_table_disable(tcet); > +if (tcet && tcet->nb_table) { > +spapr_tce_table_disable(tcet); > +} > } > > /* Register default 32bit DMA window */ > +tcet = spapr_tce_find_by_liobn(sphb->dma_liobn[0]); Should the statement below be wrapped by "if (tcet)"? Paolo > spapr_tce_table_enable(tcet, SPAPR_TCE_PAGE_SHIFT, sphb->dma_win_addr, > sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT)
[Qemu-devel] [PULL 08/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
From: Alexey KardashevskiyThis adds support for Dynamic DMA Windows (DDW) option defined by the SPAPR specification which allows to have additional DMA window(s) The "ddw" property is enabled by default on a PHB but for compatibility the pseries-2.6 machine and older disable it. This also creates a single DMA window for the older machines to maintain backward migration. This implements DDW for PHB with emulated and VFIO devices. The host kernel support is required. The advertised IOMMU page sizes are 4K and 64K; 16M pages are supported but not advertised by default, in order to enable them, the user has to specify "pgsz" property for PHB and enable huge pages for RAM. The existing linux guests try creating one additional huge DMA window with 64K or 16MB pages and map the entire guest RAM to. If succeeded, the guest switches to dma_direct_ops and never calls TCE hypercalls (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM and not waste time on map/unmap later. This adds a "dma64_win_addr" property which is a bus address for the 64bit window and by default set to 0x800... as this is what the modern POWER8 hardware uses and this allows having emulated and VFIO devices on the same bus. This adds 4 RTAS handlers: * ibm,query-pe-dma-window * ibm,create-pe-dma-window * ibm,remove-pe-dma-window * ibm,reset-pe-dma-window These are registered from type_init() callback. These RTAS handlers are implemented in a separate file to avoid polluting spapr_iommu.c with PCI. This changes sPAPRPHBState::dma_liobn to an array to allow 2 LIOBNs and updates all references to dma_liobn. However this does not add 64bit LIOBN to the migration stream as in fact even 32bit LIOBN is rather pointless there (as it is a PHB property and the management software can/should pass LIOBNs via CLI) but we keep it for the backward migration support. Signed-off-by: Alexey Kardashevskiy Signed-off-by: David Gibson --- hw/ppc/Makefile.objs| 1 + hw/ppc/spapr.c | 7 +- hw/ppc/spapr_pci.c | 75 --- hw/ppc/spapr_rtas_ddw.c | 295 hw/ppc/trace-events | 4 + include/hw/pci-host/spapr.h | 8 +- include/hw/ppc/spapr.h | 16 ++- 7 files changed, 385 insertions(+), 21 deletions(-) create mode 100644 hw/ppc/spapr_rtas_ddw.c diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs index 5cc6608..91a3420 100644 --- a/hw/ppc/Makefile.objs +++ b/hw/ppc/Makefile.objs @@ -8,6 +8,7 @@ obj-$(CONFIG_PSERIES) += spapr_cpu_core.o ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy) obj-y += spapr_pci_vfio.o endif +obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o # PowerPC 4xx boards obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o obj-y += ppc4xx_pci.o diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c index 78ebd9e..9c1c2c1 100644 --- a/hw/ppc/spapr.c +++ b/hw/ppc/spapr.c @@ -2489,7 +2489,12 @@ DEFINE_SPAPR_MACHINE(2_7, "2.7", true); * pseries-2.6 */ #define SPAPR_COMPAT_2_6 \ -HW_COMPAT_2_6 +HW_COMPAT_2_6 \ +{ \ +.driver = TYPE_SPAPR_PCI_HOST_BRIDGE,\ +.property = "ddw",\ +.value= stringify(off),\ +}, static void spapr_machine_2_6_instance_options(MachineState *machine) { diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c index cbb7cdd..949c44f 100644 --- a/hw/ppc/spapr_pci.c +++ b/hw/ppc/spapr_pci.c @@ -35,6 +35,7 @@ #include "hw/ppc/spapr.h" #include "hw/pci-host/spapr.h" #include "exec/address-spaces.h" +#include "exec/ram_addr.h" #include #include "trace.h" #include "qemu/error-report.h" @@ -45,6 +46,7 @@ #include "hw/ppc/spapr_drc.h" #include "sysemu/device_tree.h" #include "sysemu/kvm.h" +#include "sysemu/hostmem.h" #include "hw/vfio/vfio.h" @@ -1304,11 +1306,14 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp) PCIBus *bus; uint64_t msi_window_size = 4096; sPAPRTCETable *tcet; +const unsigned windows_supported = +sphb->ddw_enabled ? SPAPR_PCI_DMA_MAX_WINDOWS : 1; if (sphb->index != (uint32_t)-1) { hwaddr windows_base; -if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn != (uint32_t)-1) +if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn[0] != (uint32_t)-1) +|| (sphb->dma_liobn[1] != (uint32_t)-1 && windows_supported == 2) || (sphb->mem_win_addr != (hwaddr)-1) || (sphb->io_win_addr != (hwaddr)-1)) { error_setg(errp, "Either \"index\" or other parameters must" @@ -1323,7 +1328,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp) } sphb->buid = SPAPR_PCI_BASE_BUID + sphb->index; -sphb->dma_liobn = SPAPR_PCI_LIOBN(sphb->index, 0); +for (i = 0; i < windows_supported; ++i) { +sphb->dma_liobn[i] = SPAPR_PCI_LIOBN(sphb->index, i); +} windows_base =
[Qemu-devel] [PULL 08/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
From: Alexey KardashevskiyThis adds support for Dynamic DMA Windows (DDW) option defined by the SPAPR specification which allows to have additional DMA window(s) The "ddw" property is enabled by default on a PHB but for compatibility the pseries-2.6 machine and older disable it. This also creates a single DMA window for the older machines to maintain backward migration. This implements DDW for PHB with emulated and VFIO devices. The host kernel support is required. The advertised IOMMU page sizes are 4K and 64K; 16M pages are supported but not advertised by default, in order to enable them, the user has to specify "pgsz" property for PHB and enable huge pages for RAM. The existing linux guests try creating one additional huge DMA window with 64K or 16MB pages and map the entire guest RAM to. If succeeded, the guest switches to dma_direct_ops and never calls TCE hypercalls (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM and not waste time on map/unmap later. This adds a "dma64_win_addr" property which is a bus address for the 64bit window and by default set to 0x800... as this is what the modern POWER8 hardware uses and this allows having emulated and VFIO devices on the same bus. This adds 4 RTAS handlers: * ibm,query-pe-dma-window * ibm,create-pe-dma-window * ibm,remove-pe-dma-window * ibm,reset-pe-dma-window These are registered from type_init() callback. These RTAS handlers are implemented in a separate file to avoid polluting spapr_iommu.c with PCI. This changes sPAPRPHBState::dma_liobn to an array to allow 2 LIOBNs and updates all references to dma_liobn. However this does not add 64bit LIOBN to the migration stream as in fact even 32bit LIOBN is rather pointless there (as it is a PHB property and the management software can/should pass LIOBNs via CLI) but we keep it for the backward migration support. Signed-off-by: Alexey Kardashevskiy Signed-off-by: David Gibson --- hw/ppc/Makefile.objs| 1 + hw/ppc/spapr.c | 7 +- hw/ppc/spapr_pci.c | 75 --- hw/ppc/spapr_rtas_ddw.c | 295 hw/ppc/trace-events | 4 + include/hw/pci-host/spapr.h | 8 +- include/hw/ppc/spapr.h | 16 ++- 7 files changed, 385 insertions(+), 21 deletions(-) create mode 100644 hw/ppc/spapr_rtas_ddw.c diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs index 5cc6608..91a3420 100644 --- a/hw/ppc/Makefile.objs +++ b/hw/ppc/Makefile.objs @@ -8,6 +8,7 @@ obj-$(CONFIG_PSERIES) += spapr_cpu_core.o ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy) obj-y += spapr_pci_vfio.o endif +obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o # PowerPC 4xx boards obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o obj-y += ppc4xx_pci.o diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c index 78ebd9e..9c1c2c1 100644 --- a/hw/ppc/spapr.c +++ b/hw/ppc/spapr.c @@ -2489,7 +2489,12 @@ DEFINE_SPAPR_MACHINE(2_7, "2.7", true); * pseries-2.6 */ #define SPAPR_COMPAT_2_6 \ -HW_COMPAT_2_6 +HW_COMPAT_2_6 \ +{ \ +.driver = TYPE_SPAPR_PCI_HOST_BRIDGE,\ +.property = "ddw",\ +.value= stringify(off),\ +}, static void spapr_machine_2_6_instance_options(MachineState *machine) { diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c index cbb7cdd..949c44f 100644 --- a/hw/ppc/spapr_pci.c +++ b/hw/ppc/spapr_pci.c @@ -35,6 +35,7 @@ #include "hw/ppc/spapr.h" #include "hw/pci-host/spapr.h" #include "exec/address-spaces.h" +#include "exec/ram_addr.h" #include #include "trace.h" #include "qemu/error-report.h" @@ -45,6 +46,7 @@ #include "hw/ppc/spapr_drc.h" #include "sysemu/device_tree.h" #include "sysemu/kvm.h" +#include "sysemu/hostmem.h" #include "hw/vfio/vfio.h" @@ -1304,11 +1306,14 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp) PCIBus *bus; uint64_t msi_window_size = 4096; sPAPRTCETable *tcet; +const unsigned windows_supported = +sphb->ddw_enabled ? SPAPR_PCI_DMA_MAX_WINDOWS : 1; if (sphb->index != (uint32_t)-1) { hwaddr windows_base; -if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn != (uint32_t)-1) +if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn[0] != (uint32_t)-1) +|| (sphb->dma_liobn[1] != (uint32_t)-1 && windows_supported == 2) || (sphb->mem_win_addr != (hwaddr)-1) || (sphb->io_win_addr != (hwaddr)-1)) { error_setg(errp, "Either \"index\" or other parameters must" @@ -1323,7 +1328,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp) } sphb->buid = SPAPR_PCI_BASE_BUID + sphb->index; -sphb->dma_liobn = SPAPR_PCI_LIOBN(sphb->index, 0); +for (i = 0; i < windows_supported; ++i) { +sphb->dma_liobn[i] = SPAPR_PCI_LIOBN(sphb->index, i); +} windows_base =