Re: [Qemu-devel] [RFC PATCH 09/10] spapr_pci_vfio: Enable DDW

Alexey Kardashevskiy Tue, 12 Aug 2014 17:19:49 -0700

On 08/13/2014 01:28 AM, Alexander Graf wrote:
> 
> On 12.08.14 17:10, Alexey Kardashevskiy wrote:
>> On 08/12/2014 07:37 PM, Alexander Graf wrote:
>>> On 12.08.14 02:03, Alexey Kardashevskiy wrote:
>>>> On 08/12/2014 03:30 AM, Alexander Graf wrote:
>>>>> On 11.08.14 17:01, Alexey Kardashevskiy wrote:
>>>>>> On 08/11/2014 10:02 PM, Alexander Graf wrote:
>>>>>>> On 31.07.14 11:34, Alexey Kardashevskiy wrote:
>>>>>>>> This implements DDW for VFIO. Host kernel support is required for
>>>>>>>> this.
>>>>>>>>
>>>>>>>> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
>>>>>>>> ---
>>>>>>>>      hw/ppc/spapr_pci_vfio.c | 75
>>>>>>>> +++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>      1 file changed, 75 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
>>>>>>>> index d3bddf2..dc443e2 100644
>>>>>>>> --- a/hw/ppc/spapr_pci_vfio.c
>>>>>>>> +++ b/hw/ppc/spapr_pci_vfio.c
>>>>>>>> @@ -69,6 +69,77 @@ static void
>>>>>>>> spapr_phb_vfio_finish_realize(sPAPRPHBState *sphb, Error **errp)
>>>>>>>>          /* Register default 32bit DMA window */
>>>>>>>>          memory_region_add_subregion(&sphb->iommu_root,
>>>>>>>> tcet->bus_offset,
>>>>>>>>                                      spapr_tce_get_iommu(tcet));
>>>>>>>> +
>>>>>>>> +    sphb->ddw_supported = !!(info.flags &
>>>>>>>> VFIO_IOMMU_SPAPR_TCE_FLAG_DDW);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static int spapr_pci_vfio_ddw_query(sPAPRPHBState *sphb,
>>>>>>>> +                                    uint32_t *windows_available,
>>>>>>>> +                                    uint32_t *page_size_mask)
>>>>>>>> +{
>>>>>>>> +    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
>>>>>>>> +    struct vfio_iommu_spapr_tce_query query = { .argsz =
>>>>>>>> sizeof(query) };
>>>>>>>> +    int ret;
>>>>>>>> +
>>>>>>>> +    ret = vfio_container_ioctl(&sphb->iommu_as, svphb->iommugroupid,
>>>>>>>> +                               VFIO_IOMMU_SPAPR_TCE_QUERY, &query);
>>>>>>>> +    if (ret) {
>>>>>>>> +        return ret;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    *windows_available = query.windows_available;
>>>>>>>> +    *page_size_mask = query.page_size_mask;
>>>>>>>> +
>>>>>>>> +    return ret;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static int spapr_pci_vfio_ddw_create(sPAPRPHBState *sphb, uint32_t
>>>>>>>> page_shift,
>>>>>>>> +                                     uint32_t window_shift, uint32_t
>>>>>>>> liobn,
>>>>>>>> +                                     sPAPRTCETable **ptcet)
>>>>>>>> +{
>>>>>>>> +    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
>>>>>>>> +    struct vfio_iommu_spapr_tce_create create = {
>>>>>>>> +        .argsz = sizeof(create),
>>>>>>>> +        .page_shift = page_shift,
>>>>>>>> +        .window_shift = window_shift,
>>>>>>>> +        .start_addr = 0
>>>>>>>> +    };
>>>>>>>> +    int ret;
>>>>>>>> +
>>>>>>>> +    ret = vfio_container_ioctl(&sphb->iommu_as, svphb->iommugroupid,
>>>>>>>> +                               VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
>>>>>>>> +    if (ret) {
>>>>>>>> +        return ret;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    *ptcet = spapr_tce_new_table(DEVICE(sphb), liobn,
>>>>>>>> create.start_addr,
>>>>>>>> +                                 page_shift, 1 << (window_shift -
>>>>>>>> page_shift),
>>>>>>> I spot a 1 without ULL again - this time it might work out ok, but
>>>>>>> please
>>>>>>> just always use ULL when you pass around addresses.
>>>>>> My bad. I keep forgetting this, I'll adjust my own checkpatch.py :)
>>>>>>
>>>>>>
>>>>>>> Please walk me though the abstraction levels on what each page size
>>>>>>> honoration means. If I use THP, what page size granularity can I use
>>>>>>> for
>>>>>>> TCE entries?
>>>>>> [RFC PATCH 06/10] spapr_rtas: Add Dynamic DMA windows (DDW) RTAS calls
>>>>>> support
>>>>>>
>>>>>> +        const struct { int shift; uint32_t mask; } masks[] = {
>>>>>> +            { 12, DDW_PGSIZE_4K },
>>>>>> +            { 16, DDW_PGSIZE_64K },
>>>>>> +            { 24, DDW_PGSIZE_16M },
>>>>>> +            { 25, DDW_PGSIZE_32M },
>>>>>> +            { 26, DDW_PGSIZE_64M },
>>>>>> +            { 27, DDW_PGSIZE_128M },
>>>>>> +            { 28, DDW_PGSIZE_256M },
>>>>>> +            { 34, DDW_PGSIZE_16G },
>>>>>> +        };
>>>>>>
>>>>>>
>>>>>> Supported page sizes are returned by the host kernel via "query". For
>>>>>> 16MB
>>>>>> pages, page shift will return
>>>>>> DDW_PGSIZE_4K|DDW_PGSIZE_64K|DDW_PGSIZE_16M.
>>>>>> Or I did not understand the question...
>>>>> Why do we care about the sizes? Anything bigger than what we support
>>>>> should
>>>>> always work, no? What happens if the guest creates a 16MB map but my
>>>>> pages
>>>>> are 4kb mapped? Wouldn't the same logic be able to deal with 16G pages?
>>>> It is DMA memory, if I split "virtual" 16M page to a bunch of real 4K
>>>> pages, I have to make sure these 16M are continuous - there will be one
>>>> TCE
>>>> entry for it and no more translations besides IOMMU. What do I miss now?
>>> Who does the shadow translation where? Does it exist at all?
>> IOMMU? I am not sure I am following you... This IOMMU will look as direct
>> DMA for the guest but the real IOMMU table is sparse and it is populated
>> via a bunch of H_PUT_TCE calls as the default small window.
>>
>> There is a direct mapping in the host called "bypass window" but it is not
>> used here as sPAPR does not define that for paravirtualization.
> 
> Ok, imagine I have 16MB of guest physical memory that is in reality backed
> by 256 64k pages on the host. The guest wants to create a 16M TCE entry for
> this (from its point of view contiguous) chunk of memory.
> 
> Do we allow this?


No, we do not. We tell the guest what it can use.

> Or do we force the guest to create 64k TCE entries?

16MB TCE pages are only allowed if qemu is running with hugepages.


> If we allow it, why would we ever put any restriction at the upper end of
> TCE entry sizes? If we already implement enough logic to map things lazily
> around, we could as well have the guest create a 256M TCE entry and just
> split it on the host view to 64k TCE entries.

Oh, thiiiiiis is what you meant...

Well, we could, just for now current linux guests support 4K/64K/16M only
and they choose depending on what hypervisor supports - look at
enable_ddw() in the guest. What you suggest seems to be an unnecessary code
duplication for 16MB pages case. For bigger page sizes - for example, for
64GB guest, a TCE table with 16MB TCEs will be 32KB which is already
awesome enough, no?




-- 
Alexey

Re: [Qemu-devel] [RFC PATCH 09/10] spapr_pci_vfio: Enable DDW

Reply via email to