Hi Michael,

On Wed, 20 May 2026 19:26:24 +0000
Michael Kelley <[email protected]> wrote:

> From: Michael Kelley <[email protected]>
> To: Yu Zhang <[email protected]>, Jason Gunthorpe
> <[email protected]> CC: "[email protected]"
> <[email protected]>,  "[email protected]"
> <[email protected]>,  "[email protected]"
> <[email protected]>, "[email protected]"
> <[email protected]>, "[email protected]"
> <[email protected]>, "[email protected]"
> <[email protected]>,  "[email protected]" <[email protected]>,
> "[email protected]"  <[email protected]>,
> "[email protected]" <[email protected]>,  "[email protected]"
> <[email protected]>, "[email protected]"  <[email protected]>,
> "[email protected]" <[email protected]>,  "[email protected]"
> <[email protected]>, "[email protected]"  <[email protected]>,
> "[email protected]" <[email protected]>,
> "[email protected]" <[email protected]>, "[email protected]"
> <[email protected]>, "[email protected]" <[email protected]>,
> "[email protected]"  <[email protected]>, "[email protected]"
> <[email protected]>, "[email protected]"
> <[email protected]>,
> "[email protected]"
> <[email protected]> Subject: RE: [PATCH v1 4/4]
> iommu/hyperv: Add page-selective IOTLB flush  support Date: Wed, 20
> May 2026 19:26:24 +0000
> 
> From: Yu Zhang <[email protected]> Sent: Wednesday, May
> 20, 2026 10:15 AM
> > 
> > On Fri, May 15, 2026 at 07:35:45PM -0300, Jason Gunthorpe wrote:  
> > > On Tue, May 12, 2026 at 12:24:08AM +0800, Yu Zhang wrote:  
> > > > +static inline u16 hv_iommu_fill_iova_list(union
> > > > hv_iommu_flush_va *iova_list,
> > > > +                                         unsigned long start,
> > > > +                                         unsigned long end)
> > > > +{
> > > > +       unsigned long start_pfn = start >> PAGE_SHIFT;
> > > > +       unsigned long end_pfn = PAGE_ALIGN(end) >> PAGE_SHIFT;
> > > > +       unsigned long nr_pages = end_pfn - start_pfn;
> > > > +       u16 count = 0;
> > > > +
> > > > +       while (nr_pages > 0) {
> > > > +               unsigned long flush_pages;
> > > > +               int order;
> > > > +               unsigned long pfn_align;
> > > > +               unsigned long size_align;
> > > > +
> > > > +               if (count >= HV_IOMMU_MAX_FLUSH_VA_COUNT) {
> > > > +                       count = HV_IOMMU_FLUSH_VA_OVERFLOW;
> > > > +                       break;
> > > > +               }
> > > > +
> > > > +               if (start_pfn)
> > > > +                       pfn_align = __ffs(start_pfn);
> > > > +               else
> > > > +                       pfn_align = BITS_PER_LONG - 1;
> > > > +
> > > > +               size_align = __fls(nr_pages);
> > > > +               order = min(pfn_align, size_align);
> > > > +               iova_list[count].page_mask_shift = order;
> > > > +               iova_list[count].page_number = start_pfn;
> > > > +
> > > > +               flush_pages = 1UL << order;
> > > > +               start_pfn += flush_pages;
> > > > +               nr_pages -= flush_pages;
> > > > +               count++;
> > > > +       }  
> > >
> > > This seems like a really silly hypervisor interface. Why doesn't
> > > it just accept a normal range? Splitting it into power of two
> > > aligned ranges is very inefficient.  
> > 
> > Fair point. I'm not sure how much flexibility we have to change
> > this hypercall interface at the moment - it predates the pvIOMMU
> > work and may have other consumers beyond Linux guest. On the other
> > hand, having the guest specify 2^N-aligned blocks does save the
> > hypervisor from having to decompose ranges itself before issuing
> > hardware invalidation commands - the guest-provided entries can be
> > fed to the HW more or less directly.
> > 
> > That said, the way I'm currently using this interface may be
> > more precise than necessary. Maybe we have 2 options:
> > 
> > 1) Current approach: decompose the range into multiple exact
> >    2^N-aligned blocks with no over-flush, but at the cost of
> >    more complex calculations and more entries.
> > 
> > 2) Follow what Intel/AMD drivers do: find a single minimal
> >    2^N-aligned block that covers the entire range, but may
> >    over-flush.
> > 
> > Any preference?
> > 
> > @Michael, since you've also been reviewing this patch, I'd
> > appreciate your thoughts on the above as well. :)
> >   
> 
> I'm just guessing, but perhaps flushing an aligned power-of-2
> range can be processed by the hypervisor at a relatively fixed
> cost, regardless of the size. Having the guest do the decomposing
> of an arbitrary range allows the hypervisor to make use of the
> existing "rep" hypercall mechanism if the hypercall is taking
> "too long". The hypervisor can pause its processing, return to
> the guest temporarily, and then continue the hypercall. If the
> arbitrary range were passed into the hypercall for the hypervisor
> to do the decomposing, that pause-and-restart mechanism
> wouldn't be available.
> 
> Of course, Linux doesn't really take advantage of the pause to
> reduce guest interrupt latency because the Hyper-V code in
> Linux typically disable interrupts around a hypercall due to the
> way the hypercall input page is allocated. But other guest
> operating systems might benefit from such a pause. And we could
> probably fix the Hyper-V code in Linux to allow interrupts during a
> hypercall pause/restart if long-running hypercalls turn out to be
> a problem.
I am not sure if this pause feature is suitable for IOTLB flush at all
since it is inherently synchronous — the caller must block until all
invalidations complete. Pausing mid-flush to return to the guest
doesn't help if the guest can't make forward progress anyway.

Reply via email to