On 08/13/2014 01:28 AM, Alexander Graf wrote: > > On 12.08.14 17:10, Alexey Kardashevskiy wrote: >> On 08/12/2014 07:37 PM, Alexander Graf wrote: >>> On 12.08.14 02:03, Alexey Kardashevskiy wrote: >>>> On 08/12/2014 03:30 AM, Alexander Graf wrote: >>>>> On 11.08.14 17:01, Alexey Kardashevskiy wrote: >>>>>> On 08/11/2014 10:02 PM, Alexander Graf wrote: >>>>>>> On 31.07.14 11:34, Alexey Kardashevskiy wrote: >>>>>>>> This implements DDW for VFIO. Host kernel support is required for >>>>>>>> this. >>>>>>>> >>>>>>>> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> >>>>>>>> --- >>>>>>>> hw/ppc/spapr_pci_vfio.c | 75 >>>>>>>> +++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>>> 1 file changed, 75 insertions(+) >>>>>>>> >>>>>>>> diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c >>>>>>>> index d3bddf2..dc443e2 100644 >>>>>>>> --- a/hw/ppc/spapr_pci_vfio.c >>>>>>>> +++ b/hw/ppc/spapr_pci_vfio.c >>>>>>>> @@ -69,6 +69,77 @@ static void >>>>>>>> spapr_phb_vfio_finish_realize(sPAPRPHBState *sphb, Error **errp) >>>>>>>> /* Register default 32bit DMA window */ >>>>>>>> memory_region_add_subregion(&sphb->iommu_root, >>>>>>>> tcet->bus_offset, >>>>>>>> spapr_tce_get_iommu(tcet)); >>>>>>>> + >>>>>>>> + sphb->ddw_supported = !!(info.flags & >>>>>>>> VFIO_IOMMU_SPAPR_TCE_FLAG_DDW); >>>>>>>> +} >>>>>>>> + >>>>>>>> +static int spapr_pci_vfio_ddw_query(sPAPRPHBState *sphb, >>>>>>>> + uint32_t *windows_available, >>>>>>>> + uint32_t *page_size_mask) >>>>>>>> +{ >>>>>>>> + sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb); >>>>>>>> + struct vfio_iommu_spapr_tce_query query = { .argsz = >>>>>>>> sizeof(query) }; >>>>>>>> + int ret; >>>>>>>> + >>>>>>>> + ret = vfio_container_ioctl(&sphb->iommu_as, svphb->iommugroupid, >>>>>>>> + VFIO_IOMMU_SPAPR_TCE_QUERY, &query); >>>>>>>> + if (ret) { >>>>>>>> + return ret; >>>>>>>> + } >>>>>>>> + >>>>>>>> + *windows_available = query.windows_available; >>>>>>>> + *page_size_mask = query.page_size_mask; >>>>>>>> + >>>>>>>> + return ret; >>>>>>>> +} >>>>>>>> + >>>>>>>> +static int spapr_pci_vfio_ddw_create(sPAPRPHBState *sphb, uint32_t >>>>>>>> page_shift, >>>>>>>> + uint32_t window_shift, uint32_t >>>>>>>> liobn, >>>>>>>> + sPAPRTCETable **ptcet) >>>>>>>> +{ >>>>>>>> + sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb); >>>>>>>> + struct vfio_iommu_spapr_tce_create create = { >>>>>>>> + .argsz = sizeof(create), >>>>>>>> + .page_shift = page_shift, >>>>>>>> + .window_shift = window_shift, >>>>>>>> + .start_addr = 0 >>>>>>>> + }; >>>>>>>> + int ret; >>>>>>>> + >>>>>>>> + ret = vfio_container_ioctl(&sphb->iommu_as, svphb->iommugroupid, >>>>>>>> + VFIO_IOMMU_SPAPR_TCE_CREATE, &create); >>>>>>>> + if (ret) { >>>>>>>> + return ret; >>>>>>>> + } >>>>>>>> + >>>>>>>> + *ptcet = spapr_tce_new_table(DEVICE(sphb), liobn, >>>>>>>> create.start_addr, >>>>>>>> + page_shift, 1 << (window_shift - >>>>>>>> page_shift), >>>>>>> I spot a 1 without ULL again - this time it might work out ok, but >>>>>>> please >>>>>>> just always use ULL when you pass around addresses. >>>>>> My bad. I keep forgetting this, I'll adjust my own checkpatch.py :) >>>>>> >>>>>> >>>>>>> Please walk me though the abstraction levels on what each page size >>>>>>> honoration means. If I use THP, what page size granularity can I use >>>>>>> for >>>>>>> TCE entries? >>>>>> [RFC PATCH 06/10] spapr_rtas: Add Dynamic DMA windows (DDW) RTAS calls >>>>>> support >>>>>> >>>>>> + const struct { int shift; uint32_t mask; } masks[] = { >>>>>> + { 12, DDW_PGSIZE_4K }, >>>>>> + { 16, DDW_PGSIZE_64K }, >>>>>> + { 24, DDW_PGSIZE_16M }, >>>>>> + { 25, DDW_PGSIZE_32M }, >>>>>> + { 26, DDW_PGSIZE_64M }, >>>>>> + { 27, DDW_PGSIZE_128M }, >>>>>> + { 28, DDW_PGSIZE_256M }, >>>>>> + { 34, DDW_PGSIZE_16G }, >>>>>> + }; >>>>>> >>>>>> >>>>>> Supported page sizes are returned by the host kernel via "query". For >>>>>> 16MB >>>>>> pages, page shift will return >>>>>> DDW_PGSIZE_4K|DDW_PGSIZE_64K|DDW_PGSIZE_16M. >>>>>> Or I did not understand the question... >>>>> Why do we care about the sizes? Anything bigger than what we support >>>>> should >>>>> always work, no? What happens if the guest creates a 16MB map but my >>>>> pages >>>>> are 4kb mapped? Wouldn't the same logic be able to deal with 16G pages? >>>> It is DMA memory, if I split "virtual" 16M page to a bunch of real 4K >>>> pages, I have to make sure these 16M are continuous - there will be one >>>> TCE >>>> entry for it and no more translations besides IOMMU. What do I miss now? >>> Who does the shadow translation where? Does it exist at all? >> IOMMU? I am not sure I am following you... This IOMMU will look as direct >> DMA for the guest but the real IOMMU table is sparse and it is populated >> via a bunch of H_PUT_TCE calls as the default small window. >> >> There is a direct mapping in the host called "bypass window" but it is not >> used here as sPAPR does not define that for paravirtualization. > > Ok, imagine I have 16MB of guest physical memory that is in reality backed > by 256 64k pages on the host. The guest wants to create a 16M TCE entry for > this (from its point of view contiguous) chunk of memory. > > Do we allow this?
No, we do not. We tell the guest what it can use. > Or do we force the guest to create 64k TCE entries? 16MB TCE pages are only allowed if qemu is running with hugepages. > If we allow it, why would we ever put any restriction at the upper end of > TCE entry sizes? If we already implement enough logic to map things lazily > around, we could as well have the guest create a 256M TCE entry and just > split it on the host view to 64k TCE entries. Oh, thiiiiiis is what you meant... Well, we could, just for now current linux guests support 4K/64K/16M only and they choose depending on what hypervisor supports - look at enable_ddw() in the guest. What you suggest seems to be an unnecessary code duplication for 16MB pages case. For bigger page sizes - for example, for 64GB guest, a TCE table with 16MB TCEs will be 32KB which is already awesome enough, no? -- Alexey