Re: [RFC PATCH v4 18/20] vhost: Add VhostIOVATree

Jason Wang Mon, 25 Oct 2021 21:30:54 -0700

On Thu, Oct 21, 2021 at 10:34 PM Eugenio Perez Martin
<epere...@redhat.com> wrote:
>
> On Thu, Oct 21, 2021 at 10:12 AM Jason Wang <jasow...@redhat.com> wrote:
> >
> > On Thu, Oct 21, 2021 at 3:03 PM Eugenio Perez Martin
> > <epere...@redhat.com> wrote:
> > >
> > > On Thu, Oct 21, 2021 at 4:34 AM Jason Wang <jasow...@redhat.com> wrote:
> > > >
> > > > On Wed, Oct 20, 2021 at 8:07 PM Eugenio Perez Martin
> > > > <epere...@redhat.com> wrote:
> > > > >
> > > > > On Wed, Oct 20, 2021 at 11:01 AM Jason Wang <jasow...@redhat.com> 
> > > > > wrote:
> > > > > >
> > > > > > On Wed, Oct 20, 2021 at 3:54 PM Eugenio Perez Martin
> > > > > > <epere...@redhat.com> wrote:
> > > > > > >
> > > > > > > On Tue, Oct 19, 2021 at 11:23 AM Jason Wang <jasow...@redhat.com> 
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > On Tue, Oct 19, 2021 at 4:32 PM Jason Wang 
> > > > > > > > <jasow...@redhat.com> wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > 在 2021/10/1 下午3:06, Eugenio Pérez 写道:
> > > > > > > > > > This tree is able to look for a translated address from an 
> > > > > > > > > > IOVA address.
> > > > > > > > > >
> > > > > > > > > > At first glance is similar to util/iova-tree. However, SVQ 
> > > > > > > > > > working on
> > > > > > > > > > devices with limited IOVA space need more capabilities, 
> > > > > > > > > > like allocating
> > > > > > > > > > IOVA chunks or perform reverse translations (qemu addresses 
> > > > > > > > > > to iova).
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I don't see any reverse translation is used in the shadow 
> > > > > > > > > code. Or
> > > > > > > > > anything I missed?
> > > > > > > >
> > > > > > > > Ok, it looks to me that it is used in the iova allocator. But I 
> > > > > > > > think
> > > > > > > > it's better to decouple it to an independent allocator instead 
> > > > > > > > of
> > > > > > > > vhost iova tree.
> > > > > > > >
> > > > > > >
> > > > > > > Reverse translation is used every time a buffer is made available,
> > > > > > > since buffers content are not copied, only the descriptors to SVQ
> > > > > > > vring.
> > > > > >
> > > > > > I may miss something but I didn't see the code? Qemu knows the VA of
> > > > > > virtqueue, and the VA of the VQ is stored in the VirtQueueElem?
> > > > > >
> > > > >
> > > > > It's used in the patch 20/20, could that be the misunderstanding? The
> > > > > function calling it is vhost_svq_translate_addr.
> > > > >
> > > > > Qemu knows the VA address of the buffer, but it must offer a valid SVQ
> > > > > iova to the device. That is the translation I mean.
> > > >
> > > > Ok, I get you. So if I understand correctly, what you did is:
> > > >
> > > > 1) allocate IOVA during region_add
> > > > 2) preform VA->IOVA reverse lookup in handle_kick
> > > >
> > > > This should be fine, but here're some suggestions:
> > > >
> > > > 1) remove the assert(map) in vhost_svq_translate_addr() since guest
> > > > can add e.g BAR address
> > >
> > > Wouldn't VirtQueue block them in virtqueue_pop / address_space_read_*
> > > functions? I'm fine to remove it but I would say it adds value against
> > > coding error.
> >
> > I think not. Though these addresses were excluded in
> > vhost_vdpa_listener_skipped_section(). For Qemu memory core, they are
> > valid addresses. Qemu emulate how hardware work (e.g pci p2p), so dma
> > to bar is allowed.
> >
>
> Ok I will treat them as errors.
>
> > >
> > > > 2) we probably need a better name vhost_iova_tree_alloc(), maybe
> > > > "vhost_iova_tree_map_alloc()"
> > > >
> > >
> > > Ok I will change for the next version.
> > >
> > > > There's actually another method.
> > > >
> > > > 1) don't do IOVA/map allocation in region_add()
> > > > 2) do the allocation in handle_kick(), then we know the IOVA so no
> > > > reverse lookup
> > > >
> > > > The advantage is that this can work for the case of vIOMMU. And they
> > > > should perform the same:
> > > >
> > > > 1) you method avoid the iova allocation per sg
> > > > 2) my method avoid the reverse lookup per sg
> > > >
> > >
> > > It's somehow doable, but we are replacing a tree search with a linear
> > > insertion at this moment.
> > >
> > > I would say that guest's IOVA -> qemu vaddr part works with no change
> > > for vIOMMU, since VirtQueue's virtqueue_pop already gives us the vaddr
> > > even in the case of vIOMMU.
> >
> > So in this case:
> >
> > 1) listener gives us GPA->host IOVA (host IOVA is allocated per GPA)
>
> Right, that was a miss from my side, I think I get your point way better now.
>
> So now vhost-iova-tree translates GPA -> host IOVA in vIOMMU case, and
> it is updated at the same frequency than guest physical memory hotplug
> / unplug (little during migration, I guess). There are special entries
> for SVQ vrings, that the tree does not map with GPA for obvious
> reasons, and you cannot locate them when looking by GPA.


Yes.

>
> Let's assume too that only SVQ vrings have been sent as IOMMU / IOTLB
> map, with the relation Host iova -> qemu's VA.
>
> > 2) virtqueue_pop gives us guest IOVA -> VA
> >
> > We still need extra logic to lookup the vIOMMU to get the guest IOVA
> > GPA then we can know the host IOVA.
> >
>
> That's somehow right, but I think this does not need to be *another*
> search, insertion, etc. Please see below.
>
> > If we allocate after virtqueue_pop(), we can follow the same logic as
> > without vIOMMU. Just allocate an host IOVA then all is done.
> >
> > > The only change I would add for that case
> > > is the SVQ -> device map/unmapping part, so the device cannot access
> > > random addresses but only the exposed ones. I'm assuming that part is
> > > O(1).
> > >
> > > This way, we already have a tree with all the possible guest's
> > > addresses, and we only need to look for it's SVQ iova -> vaddr
> > > translation. This is a O(log(N)) operation,
> >
> > Yes, but it's requires traverse the vIOMMU page table which should be
> > slower than our own iova tree?
> >
>
> The lookup over vIOMMU is not needed (to perform twice), since
> virtqueue_pop already do it. We already have that data here, just need
> to extract it.

For 'extract' do you mean fetching it from IOMMU's IOTLB via
address_space_get_iotlb_entry()? Yes, it would be faster and probably
an O(1).

> Not saying that is complicated, just saying that I
> didn't dedicate a lot of time to figure out how. The calltrace of it
> is:
>
> #0  address_space_translate_iommu
>     (iommu_mr, xlat, plen_out, page_mask_out, is_write, is_mmio,
> target_as, attrs) at ../softmmu/physmem.c:418
> #1  flatview_do_translate
>     (fv, addr, xlat, plen_out, page_mask_out, is_write, is_mmio,
> target_as, attrs) at ../softmmu/physmem.c:505
> #2  flatview_translate
>     (fv, addr, xlat, plen, is_write, attrs) at ../softmmu/physmem.c:565
> #3  address_space_map (as, addr, plen, is_write, attrs)
>     at ../softmmu/physmem.c:3183
> #4  dma_memory_map (as, addr, len, dir)
>     at /home/qemu/svq/include/sysemu/dma.h:202
> #5  virtqueue_map_desc
>     (vdev, p_num_sg, addr, iov, max_num_sg, is_write, pa, sz) at
> ../hw/virtio/virtio.c:1314
> #6  virtqueue_split_pop (vq, sz) at ../hw/virtio/virtio.c:1488
>
> So with that GPA we can locate its correspond entry in the
> vhost-iova-tree, in a read-only operation, O(log(N)). And element
> address in qemu's va is not going to change until we mark it as used.
>
> This process (all the stack call trace) needs to be serialized somehow
> in qemu's memory system internals, I'm just assuming that it will be
> faster than the one we can do in SVQ with little effort, and it will
> help to reduce duplication. If is not the case, I think it is even
> more beneficial to improve it, than to reinvent it in SVQ.

I think so.

Thanks

>
> After that, an iommu map needs to be sent to the device, as (qemu's
> iommu obtained from the tree, qemu's VA, length, ...). We may even
> batch them. Another option is to wait for the miss(), but I think that
> would be a waste of resources.
>
> The reverse is also true with the unmapping: When we see an used
> descriptor, IOTLB unmap(s) will be sent before send the descriptor to
> guest as used.
>
> > > and read only, so it's
> > > easily parallelizable when we make each SVQ in it's own thread (if
> > > needed).
> >
> > Yes, this is because the host IOVA was allocated before by the memory 
> > listener.
> >
>
> Right.
>
> > > The only thing left is to expose that with an iommu miss
> > > (O(1)) and unmap it on used buffers processing (also O(1)). The
> > > domination operation keeps being VirtQueue's own code lookup for
> > > guest's IOVA -> GPA, which I'm assuming is already well optimized and
> > > will benefit from future optimizations since qemu's memory system is
> > > frequently used.
> > >
> > > To optimize your use case we would need to add a custom (and smarter
> > > than the currently used) allocator to SVQ. I've been looking for ways
> > > to reuse glibc or similar in our own arenas but with no luck. It will
> > > be code that SVQ needs to maintain by and for itself anyway.
> >
> > The benefit is to have separate iova allocation from the tree.
> >
> > >
> > > In either case it should not be hard to switch to your method, just a
> > > few call changes in the future, if we achieve a faster allocator.
> > >
> > > Would that make sense?
> >
> > Yes, feel free to choose any method you wish or feel simpler in the next 
> > series.
> >
> > >
> > > > >
> > > > > > >
> > > > > > > At this point all the limits are copied to vhost iova tree in the 
> > > > > > > next
> > > > > > > revision I will send, defined at its creation at
> > > > > > > vhost_iova_tree_new(). They are outside of util/iova-tree, only 
> > > > > > > sent
> > > > > > > to the latter at allocation time.
> > > > > > >
> > > > > > > Since vhost_iova_tree has its own vhost_iova_tree_alloc(), that 
> > > > > > > wraps
> > > > > > > the iova_tree_alloc() [1], limits could be kept in vhost-vdpa and 
> > > > > > > make
> > > > > > > them an argument of vhost_iova_tree_alloc. But I'm not sure if 
> > > > > > > it's
> > > > > > > what you are proposing or I'm missing something.
> > > > > >
> > > > > > If the reverse translation is only used in iova allocation, I meant 
> > > > > > to
> > > > > > split the logic of IOVA allocation itself.
> > > > > >
> > > > >
> > > > > Still don't understand it, sorry :). In SVQ setup we allocate an iova
> > > > > address for every guest's GPA address its driver can use. After that
> > > > > there should be no allocation unless memory is hotplugged.
> > > > >
> > > > > So the limits are only needed precisely at allocation time. Not sure
> > > > > if that is what you mean here, but to first allocate and then check if
> > > > > it is within the range could lead to false negatives, since there
> > > > > could be a valid range *in* the address but the iova allocator
> > > > > returned us another range that fell outside the range. How could we
> > > > > know the cause if it is not using the range itself?
> > > >
> > > > See my above reply. And we can teach the iova allocator to return the
> > > > IOVA in the range that vhost-vDPA supports.
> > > >
> > >
> > > Ok,
> > >
> > > For the next series it will be that way. I'm pretty sure we are
> > > aligned in this part, but the lack of code in this series makes it
> > > very hard to discuss it :).
> >
> > Fine. Let's see.
> >
> > Thanks
> >
> > >
> > > Thanks!
> > >
> > > > Thanks
> > > >
> > > > >
> > > > > > >
> > > > > > > Either way, I think it is harder to talk about this specific case
> > > > > > > without code, since this one still does not address the limits. 
> > > > > > > Would
> > > > > > > you prefer me to send another RFC in WIP quality, with *not* all
> > > > > > > comments addressed? I would say that there is not a lot of pending
> > > > > > > work to send the next one, but it might be easier for all of us.
> > > > > >
> > > > > > I'd prefer to try to address them all, otherwise it's not easy to 
> > > > > > see
> > > > > > what is missing.
> > > > > >
> > > > >
> > > > > Got it, I will do it that way then!
> > > > >
> > > > > Thanks!
> > > > >
> > > > > > Thanks
> > > > > >
> > > > > > >
> > > > > > > Thanks!
> > > > > > >
> > > > > > > [1] This util/iova-tree method will be proposed in the next 
> > > > > > > series,
> > > > > > > and vhost_iova_tree wraps it since it needs to keep in sync both
> > > > > > > trees: iova->qemu vaddr for iova allocation and the reverse one to
> > > > > > > translate available buffers.
> > > > > > >
> > > > > > > > Thanks
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [RFC PATCH v4 18/20] vhost: Add VhostIOVATree

Reply via email to