On Wed, Jan 03, 2024 at 12:11:19PM +0100, Eugenio Perez Martin wrote:
> On Wed, Jan 3, 2024 at 7:16 AM Peter Xu <pet...@redhat.com> wrote:
> >
> > On Tue, Jan 02, 2024 at 12:28:48PM +0100, Eugenio Perez Martin wrote:
> > > On Tue, Jan 2, 2024 at 6:33 AM Peter Xu <pet...@redhat.com> wrote:
> > > >
> > > > Jason, Eugenio,
> > > >
> > > > Apologies for a late reply; just back from the long holiday.
> > > >
> > > > On Thu, Dec 21, 2023 at 09:20:40AM +0100, Eugenio Perez Martin wrote:
> > > > > Si-Wei did the actual profiling as he is the one with the 128G guests,
> > > > > but most of the time was spent in the memory pinning. Si-Wei, please
> > > > > correct me if I'm wrong.
> > > >
> > > > IIUC we're talking about no-vIOMMU use case.  The pinning should indeed
> > > > take a lot of time if it's similar to what VFIO does.
> > > >
> > > > >
> > > > > I didn't check VFIO, but I think it just maps at realize phase with
> > > > > vfio_realize -> vfio_attach_device -> vfio_connect_container(). In
> > > > > previous testings, this delayed the VM initialization by a lot, as
> > > > > we're moving that 20s of blocking to every VM start.
> > > > >
> > > > > Investigating a way to do it only in the case of being the destination
> > > > > of a live migration, I think the right place is .load_setup migration
> > > > > handler. But I'm ok to move it for sure.
> > > >
> > > > If it's destined to map the 128G, it does sound sensible to me to do it
> > > > when VM starts, rather than anytime afterwards.
> > > >
> > >
> > > Just for completion, it is not 100% sure the driver will start the
> > > device. But it is likely for sure.
> >
> > My understanding is that vDPA is still a quite special device, assuming
> > only targeting advanced users, and should not appear in a default config
> > for anyone.  It means the user should hopefully remove the device if the
> > guest is not using it, instead of worrying on a slow boot.
> >
> > >
> > > > Could anyone help to explain what's the problem if vDPA maps 128G at VM
> > > > init just like what VFIO does?
> > > >
> > >
> > > The main problem was the delay of VM start. In the master branch, the
> > > pinning is done when the driver starts the device. While it takes the
> > > BQL, the rest of the vCPUs can move work forward while the host is
> > > pinning. So the impact of it is not so evident.
> > >
> > > To move it to initialization time made it very noticeable. To make
> > > things worse, QEMU did not respond to QMP commands and similar. That's
> > > why it was done only if the VM was the destination of a LM.
> >
> > Is that a major issue for us?
> 
> To me it is a regression but I'm ok with it for sure.
> 
> >  IIUC then VFIO shares the same condition.
> > If it's a real problem, do we want to have a solution that works for both
> > (or, is it possible)?
> >
> 
> I would not consider a regression for VFIO since I think it has
> behaved that way from the beginning. But yes, I'm all in to find a
> common solution.
> 
> > >
> > > However, we've added the memory map thread in this version, so this
> > > might not be a problem anymore. We could move the spawn of the thread
> > > to initialization time.
> > >
> > > But how to undo this pinning in the case the guest does not start the
> > > device? In this series, this is done at the destination with
> > > vhost_vdpa_load_cleanup. Or is it ok to just keep the memory mapped as
> > > long as QEMU has the vDPA device?
> >
> > I think even if vDPA decides to use a thread, we should keep the same
> > behavior before/after the migration.  Having assymetric behavior over DMA
> > from the assigned HWs might have unpredictable implications.
> >
> > What I worry is we may over-optimize / over-engineer the case where the
> > user will specify the vDPA device but not use it, as I mentioned above.
> >
> 
> I agree with all of the above. If it is ok to keep memory mapped while
> the guest has not started I think we can move the spawn of the thread,
> or even just the map write itself, to the vdpa init.
> 
> > For the long term, maybe there's chance to optimize DMA pinning for both
> > vdpa/vfio use cases, then we can always pin them during VM starts? Assuming
> > that issue only exists for large VMs, while they should normally be good
> > candidates for huge pages already.  Then, it means maybe one folio/page can
> > cover a large range (e.g. 1G on x86_64) in one pin, and physical continuity
> > also provides possibility of IOMMU large page mappings.  I didn't check at
> > which stage we are for VFIO on this, Alex may know better.
> 
> Sounds interesting, and I think it should be implemented. Thanks for
> the pointer!

I didn't have an exact pointer previously, but to provide a pointer, I
think it can be something like this:

  physr discussion - Jason Gunthorpe
  
https://www.youtube.com/watch?v=QftOTtks-pI&list=PLbzoR-pLrL6rlmdpJ3-oMgU_zxc1wAhjS&index=36

Since I have zero knowledge on vDPA side, I can only provide the exmaple
from VFIO and even if so that may not be fully accurate.  Basically afaiu
currently vfio is already smart enough to recognize continuous ranges (via
vfio_pin_pages_remote()):

                /* Pin a contiguous chunk of memory */
                npage = vfio_pin_pages_remote(dma, vaddr + dma->size,
                                              size >> PAGE_SHIFT, &pfn, limit,
                                              &batch);

But the pin can still be slow, due to e.g. we need a page* for each
PAGE_SIZE range.

I think something like physr can improve it.  That should correspond to the
1st slide of the video of the "performance" entry on pin_user_pages().

Thanks,

> 
> > I'm copying Alex
> > anyway since the problem seems to be a common one already, so maybe he has
> > some thoughts.
> >
> 
> Appreciated :).
> 
> Thanks!
> 

-- 
Peter Xu


Reply via email to