RE: [RFC 0/7] drm/virtio: Import scanout buffers from other devices

2024-05-30 Thread Kasireddy, Vivek
Hi Gurchetan,

> 
> On Fri, May 24, 2024 at 11:33 AM Kasireddy, Vivek
> mailto:vivek.kasire...@intel.com> > wrote:
> 
> 
>   Hi,
> 
>   Sorry, my previous reply got messed up as a result of HTML
> formatting. This is
>   a plain text version of the same reply.
> 
>   >
>   >
>   >   Having virtio-gpu import scanout buffers (via prime) from other
>   >   devices means that we'd be adding a head to headless GPUs
> assigned
>   >   to a Guest VM or additional heads to regular GPU devices that
> are
>   >   passthrough'd to the Guest. In these cases, the Guest
> compositor
>   >   can render into the scanout buffer using a primary GPU and has
> the
>   >   secondary GPU (virtio-gpu) import it for display purposes.
>   >
>   >   The main advantage with this is that the imported scanout
> buffer can
>   >   either be displayed locally on the Host (e.g, using Qemu + GTK
> UI)
>   >   or encoded and streamed to a remote client (e.g, Qemu + Spice
> UI).
>   >   Note that since Qemu uses udmabuf driver, there would be no
>   > copies
>   >   made of the scanout buffer as it is displayed. This should be
>   >   possible even when it might reside in device memory such has
>   > VRAM.
>   >
>   >   The specific use-case that can be supported with this series is
> when
>   >   running Weston or other guest compositors with "additional-
> devices"
>   >   feature (./weston --drm-device=card1 --additional-
> devices=card0).
>   >   More info about this feature can be found at:
>   >   https://gitlab.freedesktop.org/wayland/weston/-
>   > /merge_requests/736
>   >
>   >   In the above scenario, card1 could be a dGPU or an iGPU and
> card0
>   >   would be virtio-gpu in KMS only mode. However, the case
> where this
>   >   patch series could be particularly useful is when card1 is a GPU
> VF
>   >   that needs to share its scanout buffer (in a zero-copy way) with
> the
>   >   GPU PF on the Host. Or, it can also be useful when the scanout
> buffer
>   >   needs to be shared between any two GPU devices (assuming
> one of
>   > them
>   >   is assigned to a Guest VM) as long as they are P2P DMA
> compatible.
>   >
>   >
>   >
>   > Is passthrough iGPU-only or passthrough dGPU-only something you
> intend to
>   > use?
>   Our main use-case involves passthrough’g a headless dGPU VF device
> and sharing
>   the Guest compositor’s scanout buffer with dGPU PF device on the
> Host. Same goal for
>   headless iGPU VF to iGPU PF device as well.
> 
> 
> 
> Just to check my understanding: the same physical {i, d}GPU is partitioned
> into the VF and PF, but the PF handles host-side display integration and
> rendering?
Yes, that is mostly right. In a nutshell, the same physical GPU is partitioned
into one PF device and multiple VF devices. Only the PF device has access to
the display hardware and can do KMS (on the Host). The VF devices are
headless with no access to display hardware (cannot do KMS but can do render/
encode/decode) and are generally assigned (or passthrough'd) to the Guest VMs.
Some more details about this model can be found here:
https://lore.kernel.org/dri-devel/20231110182231.1730-1-michal.wajdec...@intel.com/

> 
> 
>   However, using a combination of iGPU and dGPU where either of
> them can be passthrough’d
>   to the Guest is something I think can be supported with this patch
> series as well.
> 
>   >
>   > If it's a dGPU + iGPU setup, then the way other people seem to do it
> is a
>   > "virtualized" iGPU (via virgl/gfxstream/take your pick) and pass-
> through the
>   > dGPU.
>   >
>   > For example, AMD seems to use virgl to allocate and import into
> the dGPU.
>   >
>   > https://gitlab.freedesktop.org/mesa/mesa/-
> /merge_requests/23896
>   >
>   > https://lore.kernel.org/all/20231221100016.4022353-1-
>   > julia.zh...@amd.com/ <http://julia.zh...@amd.com/>
>   >
>   >
>   > ChromeOS also uses that method (see crrev.com/c/3764931
> <http://crrev.com/c/3764931>
>   > <http://crrev.com/c/3764931> ) [cc: dGPU architect +Dominik Behr
>   > <mailto:db...@google.com <mailto:db...@google.com> > ]
>   >
>   > So if iGPU + dGPU is 

RE: [PATCH] udmabuf: add CONFIG_MMU dependency

2024-05-28 Thread Kasireddy, Vivek
> From: Arnd Bergmann 
> 
> There is no !CONFIG_MMU version of vmf_insert_pfn():
> 
> arm-linux-gnueabi-ld: drivers/dma-buf/udmabuf.o: in function
> `udmabuf_vm_fault':
> udmabuf.c:(.text+0xaa): undefined reference to `vmf_insert_pfn'
> 
> Fixes: f7254e043ff1 ("udmabuf: use vmf_insert_pfn and VM_PFNMAP for
> handling mmap")
> Signed-off-by: Arnd Bergmann 
> ---
>  drivers/dma-buf/Kconfig | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/dma-buf/Kconfig b/drivers/dma-buf/Kconfig
> index e4dc53a36428..b46eb8a552d7 100644
> --- a/drivers/dma-buf/Kconfig
> +++ b/drivers/dma-buf/Kconfig
> @@ -35,6 +35,7 @@ config UDMABUF
>   default n
>   depends on DMA_SHARED_BUFFER
>   depends on MEMFD_CREATE || COMPILE_TEST
> + depends on MMU
Thank you for the fix!
Acked-by: Vivek Kasireddy 

>   help
> A driver to let userspace turn memfd regions into dma-bufs.
> Qemu can use this to create host dmabufs for guest framebuffers.
> --
> 2.39.2
> 



RE: [RFC 0/7] drm/virtio: Import scanout buffers from other devices

2024-05-24 Thread Kasireddy, Vivek
Hi,

Sorry, my previous reply got messed up as a result of HTML formatting. This is
a plain text version of the same reply.

> 
> 
>   Having virtio-gpu import scanout buffers (via prime) from other
>   devices means that we'd be adding a head to headless GPUs assigned
>   to a Guest VM or additional heads to regular GPU devices that are
>   passthrough'd to the Guest. In these cases, the Guest compositor
>   can render into the scanout buffer using a primary GPU and has the
>   secondary GPU (virtio-gpu) import it for display purposes.
> 
>   The main advantage with this is that the imported scanout buffer can
>   either be displayed locally on the Host (e.g, using Qemu + GTK UI)
>   or encoded and streamed to a remote client (e.g, Qemu + Spice UI).
>   Note that since Qemu uses udmabuf driver, there would be no
> copies
>   made of the scanout buffer as it is displayed. This should be
>   possible even when it might reside in device memory such has
> VRAM.
> 
>   The specific use-case that can be supported with this series is when
>   running Weston or other guest compositors with "additional-devices"
>   feature (./weston --drm-device=card1 --additional-devices=card0).
>   More info about this feature can be found at:
>   https://gitlab.freedesktop.org/wayland/weston/-
> /merge_requests/736
> 
>   In the above scenario, card1 could be a dGPU or an iGPU and card0
>   would be virtio-gpu in KMS only mode. However, the case where this
>   patch series could be particularly useful is when card1 is a GPU VF
>   that needs to share its scanout buffer (in a zero-copy way) with the
>   GPU PF on the Host. Or, it can also be useful when the scanout buffer
>   needs to be shared between any two GPU devices (assuming one of
> them
>   is assigned to a Guest VM) as long as they are P2P DMA compatible.
> 
> 
> 
> Is passthrough iGPU-only or passthrough dGPU-only something you intend to
> use?
Our main use-case involves passthrough’g a headless dGPU VF device and sharing
the Guest compositor’s scanout buffer with dGPU PF device on the Host. Same 
goal for
headless iGPU VF to iGPU PF device as well.

However, using a combination of iGPU and dGPU where either of them can be 
passthrough’d
to the Guest is something I think can be supported with this patch series as 
well.

> 
> If it's a dGPU + iGPU setup, then the way other people seem to do it is a
> "virtualized" iGPU (via virgl/gfxstream/take your pick) and pass-through the
> dGPU.
> 
> For example, AMD seems to use virgl to allocate and import into the dGPU.
> 
> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23896
> 
> https://lore.kernel.org/all/20231221100016.4022353-1-
> julia.zh...@amd.com/
> 
> 
> ChromeOS also uses that method (see crrev.com/c/3764931
>  ) [cc: dGPU architect +Dominik Behr
>  ]
> 
> So if iGPU + dGPU is the primary use case, you should be able to use these
> methods as well.  The model would "virtualized iGPU" + passthrough dGPU,
> not split SoCs.
In our use-case, the goal is to have only one primary GPU (passthrough’d 
iGPU/dGPU)
do all the rendering (using native DRI drivers) for clients/compositor and all 
the outputs
and share the scanout buffers with the secondary GPU (virtio-gpu). Since this 
is mostly
how Mutter (and also Weston) work in a multi-GPU setup, I am not sure if virgl 
is needed.

And, doing it this way means that no other userspace components need to be 
modified
on both the Guest and the Host.

> 
> 
> 
>   As part of the import, the virtio-gpu driver shares the dma
>   addresses and lengths with Qemu which then determines whether
> the
>   memory region they belong to is owned by a PCI device or whether it
>   is part of the Guest's system ram. If it is the former, it identifies
>   the devid (or bdf) and bar and provides this info (along with offsets
>   and sizes) to the udmabuf driver. In the latter case, instead of the
>   the devid and bar it provides the memfd. The udmabuf driver then
>   creates a dmabuf using this info that Qemu shares with Spice for
>   encode via Gstreamer.
> 
>   Note that the virtio-gpu driver registers a move_notify() callback
>   to track location changes associated with the scanout buffer and
>   sends attach/detach backing cmds to Qemu when appropriate. And,
>   synchronization (that is, ensuring that Guest and Host are not
>   using the scanout buffer at the same time) is ensured by pinning/
>   unpinning the dmabuf as part of plane update and using a fence
>   in resource_flush cmd.
> 
> 
> I'm not sure how QEMU's display paths work, but with crosvm if you share
> the guest-created dmabuf with the display, and the guest moves the backing
> pages, the only recourse is the destroy the surface and show a black screen
> to the user: not the best thing experience wise.

RE: [RFC 0/7] drm/virtio: Import scanout buffers from other devices

2024-05-24 Thread Kasireddy, Vivek
Hi Gurchetan,

Thank you for taking a look at this patch series!



On Thu, Mar 28, 2024 at 2:01 AM Vivek Kasireddy 
mailto:vivek.kasire...@intel.com>> wrote:
Having virtio-gpu import scanout buffers (via prime) from other
devices means that we'd be adding a head to headless GPUs assigned
to a Guest VM or additional heads to regular GPU devices that are
passthrough'd to the Guest. In these cases, the Guest compositor
can render into the scanout buffer using a primary GPU and has the
secondary GPU (virtio-gpu) import it for display purposes.

The main advantage with this is that the imported scanout buffer can
either be displayed locally on the Host (e.g, using Qemu + GTK UI)
or encoded and streamed to a remote client (e.g, Qemu + Spice UI).
Note that since Qemu uses udmabuf driver, there would be no copies
made of the scanout buffer as it is displayed. This should be
possible even when it might reside in device memory such has VRAM.

The specific use-case that can be supported with this series is when
running Weston or other guest compositors with "additional-devices"
feature (./weston --drm-device=card1 --additional-devices=card0).
More info about this feature can be found at:
https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/736

In the above scenario, card1 could be a dGPU or an iGPU and card0
would be virtio-gpu in KMS only mode. However, the case where this
patch series could be particularly useful is when card1 is a GPU VF
that needs to share its scanout buffer (in a zero-copy way) with the
GPU PF on the Host. Or, it can also be useful when the scanout buffer
needs to be shared between any two GPU devices (assuming one of them
is assigned to a Guest VM) as long as they are P2P DMA compatible.

Is passthrough iGPU-only or passthrough dGPU-only something you intend to use?
Our main use-case involves passthrough’g a headless dGPU VF device and sharing
the Guest compositor’s scanout buffer with dGPU PF device on the Host. Same 
goal for
headless iGPU VF to iGPU PF device as well.

However, using a combination of iGPU and dGPU where either of them can be 
passthrough’d
to the Guest is something I think can be supported with this patch series as 
well.

If it's a dGPU + iGPU setup, then the way other people seem to do it is a 
"virtualized" iGPU (via virgl/gfxstream/take your pick) and pass-through the 
dGPU.

For example, AMD seems to use virgl to allocate and import into the dGPU.

https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23896
https://lore.kernel.org/all/20231221100016.4022353-1-julia.zh...@amd.com/

ChromeOS also uses that method (see 
crrev.com/c/3764931) [cc: dGPU architect +Dominik 
Behr]

So if iGPU + dGPU is the primary use case, you should be able to use these 
methods as well.  The model would "virtualized iGPU" + passthrough dGPU, not 
split SoCs.
In our use-case, the goal is to have only one primary GPU (passthrough’d 
iGPU/dGPU)
do all the rendering (using native DRI drivers) for clients/compositor and all 
the outputs
and share the scanout buffers with the secondary GPU (virtio-gpu). Since this 
is mostly
how Mutter (and also Weston) work in a multi-GPU setup, I am not sure if virgl 
is needed.

As part of the import, the virtio-gpu driver shares the dma
addresses and lengths with Qemu which then determines whether the
memory region they belong to is owned by a PCI device or whether it
is part of the Guest's system ram. If it is the former, it identifies
the devid (or bdf) and bar and provides this info (along with offsets
and sizes) to the udmabuf driver. In the latter case, instead of the
the devid and bar it provides the memfd. The udmabuf driver then
creates a dmabuf using this info that Qemu shares with Spice for
encode via Gstreamer.

Note that the virtio-gpu driver registers a move_notify() callback
to track location changes associated with the scanout buffer and
sends attach/detach backing cmds to Qemu when appropriate. And,
synchronization (that is, ensuring that Guest and Host are not
using the scanout buffer at the same time) is ensured by pinning/
unpinning the dmabuf as part of plane update and using a fence
in resource_flush cmd.

I'm not sure how QEMU's display paths work, but with crosvm if you share the 
guest-created dmabuf with the display, and the guest moves the backing pages, 
the only recourse is the destroy the surface and show a black screen to the 
user: not the best thing experience wise.
Since Qemu GTK UI uses EGL, there is a blit done from the guest’s scanout 
buffer onto an EGL
backed buffer on the Host. So, this problem would not happen as of now.

Only amdgpu calls dma_buf_move_notfiy(..), and you're probably testing on Intel 
only, so you may not be hitting that code path anyways.
I have tested with the Xe driver in the Guest which also calls 
dma_buf_move_notfiy(). However,
note that for dGPUs, both Xe and amdgpu migrate the scanout buffer from vram to 
system
memory as part 

RE: [PATCH v14 0/8] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios

2024-05-23 Thread Kasireddy, Vivek
Hi Gerd, Dave,

> 
> On Thu, May 23, 2024 at 01:13:11PM GMT, Dave Airlie wrote:
> > Hey
> >
> > Gerd, do you have any time to look at this series again, I think at
> > v14 we should probably consider landing it.
> 
> Phew.  Didn't follow recent MM changes closely, don't know much about
> folios beyond LWN coverage.  The changes look sane to my untrained eye,
> I wouldn't rate that a 'review' though.
> 
> The patch series structure looks a bit odd, with patch #5 adding hugetlb
> support, with the functions added being removed again in patch #7 after
> switching to folios.  But maybe regression testing the series is easier
> that way ...
Yes, regression testing is one reason. The other reason is to make it possible 
for
patches #4 and #5 to be backported to older stable kernels in order to add back
support for mapping hugetlbfs files without depending on folio related 
changes/patches.

> 
> Acked-by: Gerd Hoffmann 
Thank you. Andrew has merged this series to his mm tree.

Thanks,
Vivek

> 
> take care,
>   Gerd



RE: dma-buf sg mangling

2024-05-14 Thread Kasireddy, Vivek
Hi Rob,

> 
> On Mon, May 13, 2024 at 11:27 AM Christian König
>  wrote:
> >
> > Am 10.05.24 um 18:34 schrieb Zack Rusin:
> > > Hey,
> > >
> > > so this is a bit of a silly problem but I'd still like to solve it
> > > properly. The tldr is that virtualized drivers abuse
> > > drm_driver::gem_prime_import_sg_table (at least vmwgfx and xen do,
> > > virtgpu and xen punt on it) because there doesn't seem to be a
> > > universally supported way of converting the sg_table back to a list of
> > > pages without some form of gart to do it.
> >
> > Well the whole point is that you should never touch the pages in the
> > sg_table in the first place.
> >
> > The long term plan is actually to completely remove the pages from that
> > interface.
> >
> > > drm_prime_sg_to_page_array is deprecated (for all the right reasons on
> > > actual hardware) but in our cooky virtualized world we don't have
> > > gart's so what are we supposed to do with the dma_addr_t from the
> > > imported sg_table? What makes it worse (and definitely breaks xen) is
> > > that with CONFIG_DMABUF_DEBUG the sg page_link is mangled via
> > > mangle_sg_table so drm_prime_sg_to_page_array won't even work.
> >
> > XEN and KVM were actually adjusted to not touch the struct pages any
> more.
> >
> > I'm not sure if that work is already upstream or not but I had to
> > explain it over and over again why their approach doesn't work.
> >
> > > The reason why I'm saying it's a bit of a silly problem is that afaik
> > > currently it only affects IGT testing with vgem (because the rest of
> > > external gem objects will be from the virtualized gpu itself which is
> > > different). But do you have any ideas on what we'd like to do with
> > > this long term? i.e. we have a virtualized gpus without iommu, we have
> > > sg_table with some memory and we'd like to import it. Do we just
> > > assume that the sg_table on those configs will always reference cpu
> > > accessible memory (i.e. if it's external it only comes through
> > > drm_gem_shmem_object) and just do some horrific abomination like:
> > > for (i = 0; i < bo->ttm->num_pages; ++i) {
> > >  phys_addr_t pa = dma_to_phys(vmw->drm.dev, bo->ttm-
> >dma_address[i]);
> > >  pages[i] = pfn_to_page(PHYS_PFN(pa));
> > > }
> > > or add a "i know this is cpu accessible, please demangle" flag to
> > > drm_prime_sg_to_page_array or try to have some kind of more
> permanent
> > > solution?
> >
> > Well there is no solution for that. Accessing the underlying struct page
> > through the sg_table is illegal in the first place.
> >
> > So the question is not how to access the struct page, but rather why do
> > you want to do this?
> 
> It _think_ Zack is trying to map guest paged back buffers to the host
> GPU?  Which would require sending the pfn's in some form to the host
> vmm..
> 
> virtgpu goes the other direction with mapping host page backed GEM
> buffers to guest as "vram" (although for various reasons I kinda want
> to go in the other direction)
I just want to mention that I proposed a way for virtio-gpu to import buffers
from other GPU drivers here:
https://lore.kernel.org/dri-devel/20240328083615.2662516-1-vivek.kasire...@intel.com/

For now, this is only being used for importing scanout buffers, considering the
Mutter and Weston (additional_devices feature) use-cases.

Thanks,
Vivek

> 
> BR,
> -R
> 
> > Regards,
> > Christian.
> >
> > >
> > > z
> >


RE: [PATCH v1 2/2] vfio/pci: Allow MMIO regions to be exported through dma-buf

2024-05-02 Thread Kasireddy, Vivek
Hi Jason,

> 
> On Tue, Apr 30, 2024 at 04:24:50PM -0600, Alex Williamson wrote:
> > > +static vm_fault_t vfio_pci_dma_buf_fault(struct vm_fault *vmf)
> > > +{
> > > + struct vm_area_struct *vma = vmf->vma;
> > > + struct vfio_pci_dma_buf *priv = vma->vm_private_data;
> > > + pgoff_t pgoff = vmf->pgoff;
> > > +
> > > + if (pgoff >= priv->nr_pages)
> > > + return VM_FAULT_SIGBUS;
> > > +
> > > + return vmf_insert_pfn(vma, vmf->address,
> > > +   page_to_pfn(priv->pages[pgoff]));
> > > +}
> >
> > How does this prevent the MMIO space from being mmap'd when disabled
> at
> > the device?  How is the mmap revoked when the MMIO becomes disabled?
> > Is it part of the move protocol?
In this case, I think the importers that mmap'd the dmabuf need to be tracked
separately and their VMA PTEs need to be zapped when MMIO access is revoked.

> 
> Yes, we should not have a mmap handler for dmabuf. vfio memory must be
> mmapped in the normal way.
Although optional, I think most dmabuf exporters (drm ones) provide a mmap
handler. Otherwise, there is no easy way to provide CPU access (backup slow 
path)
to the dmabuf for the importer.

> 
> > > +static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf)
> > > +{
> > > + struct vfio_pci_dma_buf *priv = dmabuf->priv;
> > > +
> > > + /*
> > > +  * Either this or vfio_pci_dma_buf_cleanup() will remove from the
> list.
> > > +  * The refcount prevents both.
> > > +  */
> > > + if (priv->vdev) {
> > > + release_p2p_pages(priv, priv->nr_pages);
> > > + kfree(priv->pages);
> > > + down_write(>vdev->memory_lock);
> > > + list_del_init(>dmabufs_elm);
> > > + up_write(>vdev->memory_lock);
> >
> > Why are we acquiring and releasing the memory_lock write lock
> > throughout when we're not modifying the device memory enable state?
> > Ugh, we're using it to implicitly lock dmabufs_elm/dmabufs aren't we...
> 
> Not really implicitly, but yes the dmabufs list is locked by the
> memory_lock.
> 
> > > +int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev,
> u32 flags,
> > > +   struct vfio_device_feature_dma_buf __user
> *arg,
> > > +   size_t argsz)
> > > +{
> > > + struct vfio_device_feature_dma_buf get_dma_buf;
> > > + struct vfio_region_p2p_area *p2p_areas;
> > > + DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
> > > + struct vfio_pci_dma_buf *priv;
> > > + int i, ret;
> > > +
> > > + ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
> > > +  sizeof(get_dma_buf));
> > > + if (ret != 1)
> > > + return ret;
> > > +
> > > + if (copy_from_user(_dma_buf, arg, sizeof(get_dma_buf)))
> > > + return -EFAULT;
> > > +
> > > + p2p_areas = memdup_array_user(>p2p_areas,
> > > +   get_dma_buf.nr_areas,
> > > +   sizeof(*p2p_areas));
> > > + if (IS_ERR(p2p_areas))
> > > + return PTR_ERR(p2p_areas);
> > > +
> > > + priv = kzalloc(sizeof(*priv), GFP_KERNEL);
> > > + if (!priv)
> > > + return -ENOMEM;
> >
> > p2p_areas is leaked.
> 
> What is this new p2p_areas thing? It wasn't in my patch..
As noted in the commit message, this is one of the things I added to
your original patch.

> 
> > > + exp_info.ops = _pci_dmabuf_ops;
> > > + exp_info.size = priv->nr_pages << PAGE_SHIFT;
> > > + exp_info.flags = get_dma_buf.open_flags;
> >
> > open_flags from userspace are unchecked.
> 
> Huh. That seems to be a dmabuf pattern. :\
> 
> > > + exp_info.priv = priv;
> > > +
> > > + priv->dmabuf = dma_buf_export(_info);
> > > + if (IS_ERR(priv->dmabuf)) {
> > > + ret = PTR_ERR(priv->dmabuf);
> > > + goto err_free_pages;
> > > + }
> > > +
> > > + /* dma_buf_put() now frees priv */
> > > + INIT_LIST_HEAD(>dmabufs_elm);
> > > + down_write(>memory_lock);
> > > + dma_resv_lock(priv->dmabuf->resv, NULL);
> > > + priv->revoked = !__vfio_pci_memory_enabled(vdev);
> > > + vfio_device_try_get_registration(>vdev);
> >
> > I guess we're assuming this can't fail in the ioctl path of an open
> > device?
> 
> Seems like a bug added here.. My version had this as
> vfio_device_get(). This stuff has probably changed since I wrote it.
vfio_device_try_get_registration() is essentially doing the same thing as
vfio_device_get() except that we need check the return value of
vfio_device_try_get_registration() which I plan to do in v2.

> 
> > > + list_add_tail(>dmabufs_elm, >dmabufs);
> > > + dma_resv_unlock(priv->dmabuf->resv);
> >
> > What was the purpose of locking this?
> 
> ?
> 
> > > +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool
> revoked)
> > > +{
> > > + struct vfio_pci_dma_buf *priv;
> > > + struct vfio_pci_dma_buf *tmp;
> > > +
> > > + lockdep_assert_held_write(>memory_lock);
> > > +
> > > + list_for_each_entry_safe(priv, tmp, >dmabufs, dmabufs_elm)
> {
> > > + if (!get_file_rcu(>dmabuf->file))
> > > + continue;
> >
> > Does 

RE: [PATCH v12 0/8] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios

2024-03-28 Thread Kasireddy, Vivek
Hi David,

> 
> On 25.02.24 08:56, Vivek Kasireddy wrote:
> > Currently, some drivers (e.g, Udmabuf) that want to longterm-pin
> > the pages/folios associated with a memfd, do so by simply taking a
> > reference on them. This is not desirable because the pages/folios
> > may reside in Movable zone or CMA block.
> >
> > Therefore, having drivers use memfd_pin_folios() API ensures that
> > the folios are appropriately pinned via FOLL_PIN for longterm DMA.
> >
> > This patchset also introduces a few helpers and converts the Udmabuf
> > driver to use folios and memfd_pin_folios() API to longterm-pin
> > the folios for DMA. Two new Udmabuf selftests are also included to
> > test the driver and the new API.
> >
> > ---
> 
> Sorry Vivek, I got distracted. What's the state of this? I assume it's
> not in an mm tree yet.
No problem. Right, they are not in any tree yet. The first two mm patches that
add the unpin_folios() and check_and_migrate_movable_folios() helpers still
need to be reviewed.

> 
> I try to get this reviewed this week. If I fail to do that, please ping me.
Ok, sounds good!

Thanks,
Vivek
> 
> --
> Cheers,
> 
> David / dhildenb



RE: [PATCH 2/3] udmabuf: Sync buffer mappings for attached devices

2024-01-29 Thread Kasireddy, Vivek
Hi Andrew,

> 
> On 1/26/24 1:25 AM, Kasireddy, Vivek wrote:
> >>>> Currently this driver creates a SGT table using the CPU as the
> >>>> target device, then performs the dma_sync operations against
> >>>> that SGT. This is backwards to how DMA-BUFs are supposed to behave.
> >>>> This may have worked for the case where these buffers were given
> >>>> only back to the same CPU that produced them as in the QEMU case.
> >>>> And only then because the original author had the dma_sync
> >>>> operations also backwards, syncing for the "device" on begin_cpu.
> >>>> This was noticed and "fixed" in this patch[0].
> >>>>
> >>>> That then meant we were sync'ing from the CPU to the CPU using
> >>>> a pseudo-device "miscdevice". Which then caused another issue
> >>>> due to the miscdevice not having a proper DMA mask (and why should
> >>>> it, the CPU is not a DMA device). The fix for that was an even
> >>>> more egregious hack[1] that declares the CPU is coherent with
> >>>> itself and can access its own memory space..
> >>>>
> >>>> Unwind all this and perform the correct action by doing the dma_sync
> >>>> operations for each device currently attached to the backing buffer.
> >>> Makes sense.
> >>>
> >>>>
> >>>> [0] commit 1ffe09590121 ("udmabuf: fix dma-buf cpu access")
> >>>> [1] commit 9e9fa6a9198b ("udmabuf: Set the DMA mask for the
> udmabuf
> >>>> device (v2)")
> >>>>
> >>>> Signed-off-by: Andrew Davis 
> >>>> ---
> >>>>drivers/dma-buf/udmabuf.c | 41 +++
> >>>>1 file changed, 16 insertions(+), 25 deletions(-)
> >>>>
> >>>> diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
> >>>> index 3a23f0a7d112a..ab6764322523c 100644
> >>>> --- a/drivers/dma-buf/udmabuf.c
> >>>> +++ b/drivers/dma-buf/udmabuf.c
> >>>> @@ -26,8 +26,6 @@ MODULE_PARM_DESC(size_limit_mb, "Max size
> of a
> >>>> dmabuf, in megabytes. Default is
> >>>>struct udmabuf {
> >>>>  pgoff_t pagecount;
> >>>>  struct page **pages;
> >>>> -struct sg_table *sg;
> >>>> -struct miscdevice *device;
> >>>>  struct list_head attachments;
> >>>>  struct mutex lock;
> >>>>};
> >>>> @@ -169,12 +167,8 @@ static void unmap_udmabuf(struct
> >>>> dma_buf_attachment *at,
> >>>>static void release_udmabuf(struct dma_buf *buf)
> >>>>{
> >>>>  struct udmabuf *ubuf = buf->priv;
> >>>> -struct device *dev = ubuf->device->this_device;
> >>>>  pgoff_t pg;
> >>>>
> >>>> -if (ubuf->sg)
> >>>> -put_sg_table(dev, ubuf->sg, DMA_BIDIRECTIONAL);
> >>> What happens if the last importer maps the dmabuf but erroneously
> >>> closes it immediately? Would unmap somehow get called in this case?
> >>>
> >>
> >> Good question, had to scan the framework code a bit here. I thought
> >> closing a DMABUF handle would automatically unwind any current
> >> attachments/mappings, but it seems nothing in the framework does that.
> >>
> >> Looks like that is up to the importing drivers[0]:
> >>
> >>> Once a driver is done with a shared buffer it needs to call
> >>> dma_buf_detach() (after cleaning up any mappings) and then
> >>> release the reference acquired with dma_buf_get() by
> >>> calling dma_buf_put().
> >>
> >> So closing a DMABUF after mapping without first unmapping it would
> >> be a bug in the importer, it is not the exporters problem to check
> > It may be a bug in the importer but wouldn't the memory associated
> > with the sg table and attachment get leaked if unmap doesn't get called
> > in this scenario?
> >
> 
> Yes the attachment data would be leaked if unattach was not called,
> but that is true for all DMABUF exporters. The .release() callback
> is meant to be the mirror of the export function and it only cleans
> up that. Same for attach/unattach, map/unmap, etc.. If these calls
> are not balanced then yes they can leak memory.
&

RE: [PATCH 2/3] udmabuf: Sync buffer mappings for attached devices

2024-01-25 Thread Kasireddy, Vivek
> >> Currently this driver creates a SGT table using the CPU as the
> >> target device, then performs the dma_sync operations against
> >> that SGT. This is backwards to how DMA-BUFs are supposed to behave.
> >> This may have worked for the case where these buffers were given
> >> only back to the same CPU that produced them as in the QEMU case.
> >> And only then because the original author had the dma_sync
> >> operations also backwards, syncing for the "device" on begin_cpu.
> >> This was noticed and "fixed" in this patch[0].
> >>
> >> That then meant we were sync'ing from the CPU to the CPU using
> >> a pseudo-device "miscdevice". Which then caused another issue
> >> due to the miscdevice not having a proper DMA mask (and why should
> >> it, the CPU is not a DMA device). The fix for that was an even
> >> more egregious hack[1] that declares the CPU is coherent with
> >> itself and can access its own memory space..
> >>
> >> Unwind all this and perform the correct action by doing the dma_sync
> >> operations for each device currently attached to the backing buffer.
> > Makes sense.
> >
> >>
> >> [0] commit 1ffe09590121 ("udmabuf: fix dma-buf cpu access")
> >> [1] commit 9e9fa6a9198b ("udmabuf: Set the DMA mask for the udmabuf
> >> device (v2)")
> >>
> >> Signed-off-by: Andrew Davis 
> >> ---
> >>   drivers/dma-buf/udmabuf.c | 41 +++
> >>   1 file changed, 16 insertions(+), 25 deletions(-)
> >>
> >> diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
> >> index 3a23f0a7d112a..ab6764322523c 100644
> >> --- a/drivers/dma-buf/udmabuf.c
> >> +++ b/drivers/dma-buf/udmabuf.c
> >> @@ -26,8 +26,6 @@ MODULE_PARM_DESC(size_limit_mb, "Max size of a
> >> dmabuf, in megabytes. Default is
> >>   struct udmabuf {
> >>pgoff_t pagecount;
> >>struct page **pages;
> >> -  struct sg_table *sg;
> >> -  struct miscdevice *device;
> >>struct list_head attachments;
> >>struct mutex lock;
> >>   };
> >> @@ -169,12 +167,8 @@ static void unmap_udmabuf(struct
> >> dma_buf_attachment *at,
> >>   static void release_udmabuf(struct dma_buf *buf)
> >>   {
> >>struct udmabuf *ubuf = buf->priv;
> >> -  struct device *dev = ubuf->device->this_device;
> >>pgoff_t pg;
> >>
> >> -  if (ubuf->sg)
> >> -  put_sg_table(dev, ubuf->sg, DMA_BIDIRECTIONAL);
> > What happens if the last importer maps the dmabuf but erroneously
> > closes it immediately? Would unmap somehow get called in this case?
> >
> 
> Good question, had to scan the framework code a bit here. I thought
> closing a DMABUF handle would automatically unwind any current
> attachments/mappings, but it seems nothing in the framework does that.
> 
> Looks like that is up to the importing drivers[0]:
> 
> > Once a driver is done with a shared buffer it needs to call
> > dma_buf_detach() (after cleaning up any mappings) and then
> > release the reference acquired with dma_buf_get() by
> > calling dma_buf_put().
> 
> So closing a DMABUF after mapping without first unmapping it would
> be a bug in the importer, it is not the exporters problem to check
It may be a bug in the importer but wouldn't the memory associated
with the sg table and attachment get leaked if unmap doesn't get called
in this scenario?

Thanks,
Vivek

> for (although some more warnings in the framework checking for that
> might not be a bad idea..).
> 
> Andrew
> 
> [0] https://www.kernel.org/doc/html/v6.7/driver-api/dma-buf.html
> 
> > Thanks,
> > Vivek
> >
> >> -
> >>for (pg = 0; pg < ubuf->pagecount; pg++)
> >>put_page(ubuf->pages[pg]);
> >>kfree(ubuf->pages);
> >> @@ -185,33 +179,31 @@ static int begin_cpu_udmabuf(struct dma_buf
> >> *buf,
> >> enum dma_data_direction direction)
> >>   {
> >>struct udmabuf *ubuf = buf->priv;
> >> -  struct device *dev = ubuf->device->this_device;
> >> -  int ret = 0;
> >> -
> >> -  if (!ubuf->sg) {
> >> -  ubuf->sg = get_sg_table(dev, buf, direction);
> >> -  if (IS_ERR(ubuf->sg)) {
> >> -  ret = PTR_ERR(ubuf->sg);
> >> -  ubuf->sg = NULL;
> >> -  }
> >> -  } else {
> >> -  dma_sync_sg_for_cpu(dev, ubuf->sg->sgl, ubuf->sg->nents,
> >> -  direction);
> >> -  }
> >> +  struct udmabuf_attachment *a;
> >>
> >> -  return ret;
> >> +  mutex_lock(>lock);
> >> +
> >> +  list_for_each_entry(a, >attachments, list)
> >> +  dma_sync_sgtable_for_cpu(a->dev, a->table, direction);
> >> +
> >> +  mutex_unlock(>lock);
> >> +
> >> +  return 0;
> >>   }
> >>
> >>   static int end_cpu_udmabuf(struct dma_buf *buf,
> >>   enum dma_data_direction direction)
> >>   {
> >>struct udmabuf *ubuf = buf->priv;
> >> -  struct device *dev = ubuf->device->this_device;
> >> +  struct udmabuf_attachment *a;
> >>
> >> -  if (!ubuf->sg)
> >> -  return -EINVAL;
> >> +  mutex_lock(>lock);
> >> +
> >> +  list_for_each_entry(a, >attachments, list)
> >> +  

RE: [PATCH 2/3] udmabuf: Sync buffer mappings for attached devices

2024-01-24 Thread Kasireddy, Vivek
Hi Andrew,

> Currently this driver creates a SGT table using the CPU as the
> target device, then performs the dma_sync operations against
> that SGT. This is backwards to how DMA-BUFs are supposed to behave.
> This may have worked for the case where these buffers were given
> only back to the same CPU that produced them as in the QEMU case.
> And only then because the original author had the dma_sync
> operations also backwards, syncing for the "device" on begin_cpu.
> This was noticed and "fixed" in this patch[0].
> 
> That then meant we were sync'ing from the CPU to the CPU using
> a pseudo-device "miscdevice". Which then caused another issue
> due to the miscdevice not having a proper DMA mask (and why should
> it, the CPU is not a DMA device). The fix for that was an even
> more egregious hack[1] that declares the CPU is coherent with
> itself and can access its own memory space..
> 
> Unwind all this and perform the correct action by doing the dma_sync
> operations for each device currently attached to the backing buffer.
Makes sense.

> 
> [0] commit 1ffe09590121 ("udmabuf: fix dma-buf cpu access")
> [1] commit 9e9fa6a9198b ("udmabuf: Set the DMA mask for the udmabuf
> device (v2)")
> 
> Signed-off-by: Andrew Davis 
> ---
>  drivers/dma-buf/udmabuf.c | 41 +++
>  1 file changed, 16 insertions(+), 25 deletions(-)
> 
> diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
> index 3a23f0a7d112a..ab6764322523c 100644
> --- a/drivers/dma-buf/udmabuf.c
> +++ b/drivers/dma-buf/udmabuf.c
> @@ -26,8 +26,6 @@ MODULE_PARM_DESC(size_limit_mb, "Max size of a
> dmabuf, in megabytes. Default is
>  struct udmabuf {
>   pgoff_t pagecount;
>   struct page **pages;
> - struct sg_table *sg;
> - struct miscdevice *device;
>   struct list_head attachments;
>   struct mutex lock;
>  };
> @@ -169,12 +167,8 @@ static void unmap_udmabuf(struct
> dma_buf_attachment *at,
>  static void release_udmabuf(struct dma_buf *buf)
>  {
>   struct udmabuf *ubuf = buf->priv;
> - struct device *dev = ubuf->device->this_device;
>   pgoff_t pg;
> 
> - if (ubuf->sg)
> - put_sg_table(dev, ubuf->sg, DMA_BIDIRECTIONAL);
What happens if the last importer maps the dmabuf but erroneously
closes it immediately? Would unmap somehow get called in this case?

Thanks,
Vivek

> -
>   for (pg = 0; pg < ubuf->pagecount; pg++)
>   put_page(ubuf->pages[pg]);
>   kfree(ubuf->pages);
> @@ -185,33 +179,31 @@ static int begin_cpu_udmabuf(struct dma_buf
> *buf,
>enum dma_data_direction direction)
>  {
>   struct udmabuf *ubuf = buf->priv;
> - struct device *dev = ubuf->device->this_device;
> - int ret = 0;
> -
> - if (!ubuf->sg) {
> - ubuf->sg = get_sg_table(dev, buf, direction);
> - if (IS_ERR(ubuf->sg)) {
> - ret = PTR_ERR(ubuf->sg);
> - ubuf->sg = NULL;
> - }
> - } else {
> - dma_sync_sg_for_cpu(dev, ubuf->sg->sgl, ubuf->sg->nents,
> - direction);
> - }
> + struct udmabuf_attachment *a;
> 
> - return ret;
> + mutex_lock(>lock);
> +
> + list_for_each_entry(a, >attachments, list)
> + dma_sync_sgtable_for_cpu(a->dev, a->table, direction);
> +
> + mutex_unlock(>lock);
> +
> + return 0;
>  }
> 
>  static int end_cpu_udmabuf(struct dma_buf *buf,
>  enum dma_data_direction direction)
>  {
>   struct udmabuf *ubuf = buf->priv;
> - struct device *dev = ubuf->device->this_device;
> + struct udmabuf_attachment *a;
> 
> - if (!ubuf->sg)
> - return -EINVAL;
> + mutex_lock(>lock);
> +
> + list_for_each_entry(a, >attachments, list)
> + dma_sync_sgtable_for_device(a->dev, a->table, direction);
> +
> + mutex_unlock(>lock);
> 
> - dma_sync_sg_for_device(dev, ubuf->sg->sgl, ubuf->sg->nents,
> direction);
>   return 0;
>  }
> 
> @@ -307,7 +299,6 @@ static long udmabuf_create(struct miscdevice
> *device,
>   exp_info.priv = ubuf;
>   exp_info.flags = O_RDWR;
> 
> - ubuf->device = device;
>   buf = dma_buf_export(_info);
>   if (IS_ERR(buf)) {
>   ret = PTR_ERR(buf);
> --
> 2.39.2



RE: [PATCH 3/3] udmabuf: Use module_misc_device() to register this device

2024-01-24 Thread Kasireddy, Vivek
Acked-by: Vivek Kasireddy 

> 
> Now that we do not need to call dma_coerce_mask_and_coherent() on our
> miscdevice device, use the module_misc_device() helper for registering and
> module init/exit.
> 
> Signed-off-by: Andrew Davis 
> ---
>  drivers/dma-buf/udmabuf.c | 30 +-
>  1 file changed, 1 insertion(+), 29 deletions(-)
> 
> diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c index
> ab6764322523c..3028ac3fd9f6a 100644
> --- a/drivers/dma-buf/udmabuf.c
> +++ b/drivers/dma-buf/udmabuf.c
> @@ -392,34 +392,6 @@ static struct miscdevice udmabuf_misc = {
>   .name   = "udmabuf",
>   .fops   = _fops,
>  };
> -
> -static int __init udmabuf_dev_init(void) -{
> - int ret;
> -
> - ret = misc_register(_misc);
> - if (ret < 0) {
> - pr_err("Could not initialize udmabuf device\n");
> - return ret;
> - }
> -
> - ret = dma_coerce_mask_and_coherent(udmabuf_misc.this_device,
> -DMA_BIT_MASK(64));
> - if (ret < 0) {
> - pr_err("Could not setup DMA mask for udmabuf device\n");
> - misc_deregister(_misc);
> - return ret;
> - }
> -
> - return 0;
> -}
> -
> -static void __exit udmabuf_dev_exit(void) -{
> - misc_deregister(_misc);
> -}
> -
> -module_init(udmabuf_dev_init)
> -module_exit(udmabuf_dev_exit)
> +module_misc_device(udmabuf_misc);
> 
>  MODULE_AUTHOR("Gerd Hoffmann ");
> --
> 2.39.2



RE: [PATCH 1/3] udmabuf: Keep track current device mappings

2024-01-24 Thread Kasireddy, Vivek
Hi Andrew,

> When a device attaches to and maps our buffer we need to keep track
> of this mapping/device. This is needed for synchronization with these
> devices when beginning and ending CPU access for instance. Add a list
> that tracks device mappings as part of {map,unmap}_udmabuf().
> 
> Signed-off-by: Andrew Davis 
> ---
>  drivers/dma-buf/udmabuf.c | 43
> +--
>  1 file changed, 41 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
> index c406459996489..3a23f0a7d112a 100644
> --- a/drivers/dma-buf/udmabuf.c
> +++ b/drivers/dma-buf/udmabuf.c
> @@ -28,6 +28,14 @@ struct udmabuf {
>   struct page **pages;
>   struct sg_table *sg;
>   struct miscdevice *device;
> + struct list_head attachments;
> + struct mutex lock;
> +};
> +
> +struct udmabuf_attachment {
> + struct device *dev;
> + struct sg_table *table;
> + struct list_head list;
>  };
> 
>  static vm_fault_t udmabuf_vm_fault(struct vm_fault *vmf)
> @@ -120,14 +128,42 @@ static void put_sg_table(struct device *dev, struct
> sg_table *sg,
>  static struct sg_table *map_udmabuf(struct dma_buf_attachment *at,
>   enum dma_data_direction direction)
>  {
> - return get_sg_table(at->dev, at->dmabuf, direction);
> + struct udmabuf *ubuf = at->dmabuf->priv;
> + struct udmabuf_attachment *a;
> +
> + a = kzalloc(sizeof(*a), GFP_KERNEL);
> + if (!a)
> + return ERR_PTR(-ENOMEM);
> +
> + a->table = get_sg_table(at->dev, at->dmabuf, direction);
> + if (IS_ERR(a->table)) {
> + kfree(a);
> + return a->table;
Isn't that a use-after-free bug?
Rest of the patch lgtm.

Thanks,
Vivek

> + }
> +
> + a->dev = at->dev;
> +
> + mutex_lock(>lock);
> + list_add(>list, >attachments);
> + mutex_unlock(>lock);
> +
> + return a->table;
>  }
> 
>  static void unmap_udmabuf(struct dma_buf_attachment *at,
> struct sg_table *sg,
> enum dma_data_direction direction)
>  {
> - return put_sg_table(at->dev, sg, direction);
> + struct udmabuf_attachment *a = at->priv;
> + struct udmabuf *ubuf = at->dmabuf->priv;
> +
> + mutex_lock(>lock);
> + list_del(>list);
> + mutex_unlock(>lock);
> +
> + put_sg_table(at->dev, sg, direction);
> +
> + kfree(a);
>  }
> 
>  static void release_udmabuf(struct dma_buf *buf)
> @@ -263,6 +299,9 @@ static long udmabuf_create(struct miscdevice
> *device,
>   memfd = NULL;
>   }
> 
> + INIT_LIST_HEAD(>attachments);
> + mutex_init(>lock);
> +
>   exp_info.ops  = _ops;
>   exp_info.size = ubuf->pagecount << PAGE_SHIFT;
>   exp_info.priv = ubuf;
> --
> 2.39.2



RE: [PATCH RESEND] drm/virtio: set segment size for virtio_gpu device

2024-01-24 Thread Kasireddy, Vivek
> Hej,
> 
> debug dma code is not happy with virtio gpu (arm64 VM):
> 
> [  305.881733] [ cut here ]
> [  305.883117] DMA-API: virtio-pci :07:00.0: mapping sg segment longer
> than device claims to support [len=262144] [max=65536]
> [  305.885976] WARNING: CPU: 8 PID: 2002 at kernel/dma/debug.c:1177
> check_sg_segment+0x2d0/0x420
> [  305.888038] Modules linked in: crct10dif_ce(+) polyval_ce polyval_generic
> ghash_ce virtio_gpu(+) virtio_net net_failover virtio_blk(+) virtio_dma_buf
> virtio_console failover virtio_mmio scsi_dh_r dac scsi_dh_emc scsi_dh_alua
> dm_multipath qemu_fw_cfg
> [  305.893496] CPU: 8 PID: 2002 Comm: (udev-worker) Not tainted 6.7.0 #1
> [  305.895070] Hardware name: QEMU KVM Virtual Machine, BIOS edk2-
> 20230524-3.fc37 05/24/2023
> [  305.897112] pstate: 6045 (nZCv daif +PAN -UAO -TCO -DIT -SSBS
> BTYPE=--)
> [  305.897129] pc : check_sg_segment+0x2d0/0x420
> [  305.897139] lr : check_sg_segment+0x2d0/0x420
> [  305.897145] sp : 80008ffc69d0
> [  305.897149] x29: 80008ffc69d0 x28: dfff8000 x27:
> b0232879e578
> [  305.897167] x26:  x25: b0232778c060 x24:
> 19ee9b2060c0
> [  305.897181] x23:  x22: b0232ab9ce10 x21:
> 19eece5c64ac
> [  305.906942] x20: 0001 x19: 19eece5c64a0 x18:
> 19eec36fc304
> [  305.908633] x17: 6e61687420726567 x16: 6e6f6c20746e656d x15:
> 6765732067732067
> [  305.910352] x14: f1f1f1f1 x13: 0001 x12:
> 700011ff8cc3
> [  305.912044] x11: 100011ff8cc2 x10: 700011ff8cc2 x9 :
> b02324a70e54
> [  305.913751] x8 : 8fffee00733e x7 : 80008ffc6617 x6 :
> 0001
> [  305.915451] x5 : 80008ffc6610 x4 : 1fffe33e70564622 x3 :
> dfff8000
> [  305.917158] x2 :  x1 :  x0 :
> 19f382b23100
> [  305.918864] Call trace:
> [  305.919474]  check_sg_segment+0x2d0/0x420
> [  305.920443]  debug_dma_map_sg+0x2a0/0x428
> [  305.921402]  __dma_map_sg_attrs+0xf4/0x1a8
> [  305.922388]  dma_map_sgtable+0x7c/0x100
> [  305.923318]  drm_gem_shmem_get_pages_sgt+0x15c/0x328
> [  305.924500]
> virtio_gpu_object_shmem_init.constprop.0.isra.0+0x50/0x628 [virtio_gpu]
> [  305.926390]  virtio_gpu_object_create+0x198/0x478 [virtio_gpu]
> [  305.927802]  virtio_gpu_mode_dumb_create+0x2a0/0x4c8 [virtio_gpu]
> [  305.929272]  drm_mode_create_dumb+0x1c0/0x280
> [  305.930327]  drm_client_framebuffer_create+0x140/0x328
> [  305.931555]  drm_fbdev_generic_helper_fb_probe+0x1bc/0x5c0
> [  305.932871]  __drm_fb_helper_initial_config_and_unlock+0x1e0/0x630
> [  305.934372]  drm_fb_helper_initial_config+0x50/0x68
> [  305.935540]  drm_fbdev_generic_client_hotplug+0x148/0x200
> [  305.936819]  drm_client_register+0x130/0x200
> [  305.937856]  drm_fbdev_generic_setup+0xe8/0x320
> [  305.938932]  virtio_gpu_probe+0x13c/0x2d0 [virtio_gpu]
> [  305.940190]  virtio_dev_probe+0x38c/0x600
> [  305.941153]  really_probe+0x334/0x9c8
> [  305.942047]  __driver_probe_device+0x164/0x3d8
> [  305.943102]  driver_probe_device+0x64/0x180
> [  305.944094]  __driver_attach+0x1d4/0x488
> [  305.945045]  bus_for_each_dev+0x104/0x198
> [  305.946008]  driver_attach+0x44/0x68
> [  305.946892]  bus_add_driver+0x23c/0x4a8
> [  305.947838]  driver_register+0xf8/0x3d0
> [  305.948770]  register_virtio_driver+0x74/0xc8
> [  305.949836]  virtio_gpu_driver_init+0x20/0xff8 [virtio_gpu]
> [  305.951237]  do_one_initcall+0x17c/0x8c0
> [  305.952182]  do_init_module+0x1dc/0x630
> [  305.953106]  load_module+0x10c0/0x1638
> [  305.954012]  init_module_from_file+0xe0/0x140
> [  305.955058]  idempotent_init_module+0x2c0/0x590
> [  305.956174]  __arm64_sys_finit_module+0xb4/0x140
> [  305.957282]  invoke_syscall+0xd8/0x258
> [  305.958187]  el0_svc_common.constprop.0+0x16c/0x240
> [  305.959526]  do_el0_svc+0x48/0x68
> [  305.960456]  el0_svc+0x58/0x118
> [  305.961310]  el0t_64_sync_handler+0x120/0x130
> [  305.962510]  el0t_64_sync+0x194/0x198
> [  305.963509] irq event stamp: 37944
> [  305.964412] hardirqs last  enabled at (37943): []
> console_unlock+0x1a4/0x1c8
> [  305.966602] hardirqs last disabled at (37944): []
> el1_dbg+0x24/0xa0
> [  305.968535] softirqs last  enabled at (37930): []
> __do_softirq+0x8e4/0xe1c
> [  305.970781] softirqs last disabled at (37925): []
> do_softirq+0x18/0x30
> [  305.972937] ---[ end trace  ]---
> 
> The 64K max_segment size of the device seems to be inherited by PCIs
> default.
> The sg list is crated via this drm helper:
> 
> struct sg_table *drm_prime_pages_to_sg(struct drm_device *dev,
>  struct page **pages, unsigned int
> nr_pages)
> {
> ...
>   if (dev)
>   max_segment = dma_max_mapping_size(dev->dev);
>   if (max_segment == 0)
>   max_segment = UINT_MAX;
>   err = sg_alloc_table_from_pages_segment(sg, pages, nr_pages, 0,
>   nr_pages << 

RE: [PATCH v7 3/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7)

2023-12-13 Thread Kasireddy, Vivek
Hi David,

> 
> On 12.12.23 08:38, Vivek Kasireddy wrote:
> > For drivers that would like to longterm-pin the folios associated
> > with a memfd, the memfd_pin_folios() API provides an option to
> > not only pin the folios via FOLL_PIN but also to check and migrate
> > them if they reside in movable zone or CMA block. This API
> > currently works with memfds but it should work with any files
> > that belong to either shmemfs or hugetlbfs. Files belonging to
> > other filesystems are rejected for now.
> >
> > The folios need to be located first before pinning them via FOLL_PIN.
> > If they are found in the page cache, they can be immediately pinned.
> > Otherwise, they need to be allocated using the filesystem specific
> > APIs and then pinned.
> >
> > v2:
> > - Drop gup_flags and improve comments and commit message (David)
> > - Allocate a page if we cannot find in page cache for the hugetlbfs
> >case as well (David)
> > - Don't unpin pages if there is a migration related failure (David)
> > - Drop the unnecessary nr_pages <= 0 check (Jason)
> > - Have the caller of the API pass in file * instead of fd (Jason)
> >
> > v3: (David)
> > - Enclose the huge page allocation code with #ifdef
> CONFIG_HUGETLB_PAGE
> >(Build error reported by kernel test robot )
> > - Don't forget memalloc_pin_restore() on non-migration related errors
> > - Improve the readability of the cleanup code associated with
> >non-migration related errors
> > - Augment the comments by describing FOLL_LONGTERM like behavior
> > - Include the R-b tag from Jason
> >
> > v4:
> > - Remove the local variable "page" and instead use 3 return statements
> >in alloc_file_page() (David)
> > - Add the R-b tag from David
> >
> > v5: (David)
> > - For hugetlb case, ensure that we only obtain head pages from the
> >mapping by using __filemap_get_folio() instead of find_get_page_flags()
> > - Handle -EEXIST when two or more potential users try to simultaneously
> >add a huge page to the mapping by forcing them to retry on failure
> >
> > v6: (Christoph)
> > - Rename this API to memfd_pin_user_pages() to make it clear that it
> >is intended for memfds
> > - Move the memfd page allocation helper from gup.c to memfd.c
> > - Fix indentation errors in memfd_pin_user_pages()
> > - For contiguous ranges of folios, use a helper such as
> >filemap_get_folios_contig() to lookup the page cache in batches
> >
> > v7:
> > - Rename this API to memfd_pin_folios() and make it return folios
> >and offsets instead of pages (David)
> > - Don't continue processing the folios in the batch returned by
> >filemap_get_folios_contig() if they do not have correct next_idx
> > - Add the R-b tag from Christoph
> >
> 
> Sorry, I'm still not happy about the current state, because (1) the
> folio vs. pages handling is still mixed (2) we're returning+pinning a
> large folio multiple times.
I can address (1) in a follow-up series and as far as (2) is concerned, my
understanding is that we need to increase the folio's refcount as and
when the folio's tail pages are used. Is this not the case? It appears
this is what unpin_user_pages() expects as well. Do you see any
concern with this?

> 
> See below if there is an easy way to clean this up.
> 
> >> @@ -5,6 +5,7 @@
> >   #include 
> >
> >   #include 
> > +#include 
> >   #include 
> >   #include 
> >   #include 
> > @@ -17,6 +18,7 @@
> >   #include 
> >   #include 
> >   #include 
> > +#include 
> >   #include 
> >   #include 
> >
> > @@ -3410,3 +3412,156 @@ long pin_user_pages_unlocked(unsigned long
> start, unsigned long nr_pages,
> >  , gup_flags);
> >   }
> >   EXPORT_SYMBOL(pin_user_pages_unlocked);
> > +
> > +/**
> > + * memfd_pin_folios() - pin folios associated with a memfd
> > + * @memfd:  the memfd whose folios are to be pinned
> > + * @start:  starting memfd offset
> > + * @nr_pages:   number of pages from start to pin
> 
> We're not pinning pages. An inclusive range [start, end] would be clearer.
Ok, I'll make this change in the next version.

> 
> > + * @folios: array that receives pointers to the folios pinned.
> > + *  Should be at-least nr_pages long.
> > + * @offsets:array that receives offsets of pages in their folios.
> > + *  Should be at-least nr_pages long.
> 
> See below, I'm wondering if this is really required once we return each folio
> only once.
The offsets can be calculated by the caller (udmabuf) as well but doing so
in this interface would prevent special handling in the caller for the hugetlb
case. Please look at patch 5 in this series (udmabuf: Pin the pages using
memfd_pin_folios() API (v5)) for more details as to what I mean.

> 
> > + *
> > + * Attempt to pin folios associated with a memfd; given that a memfd is
> > + * either backed by shmem or hugetlb, the folios can either be found in
> > + * the page cache or need to be allocated if necessary. Once the folios
> > + * are located, they are 

RE: [PATCH v6 3/5] mm/gup: Introduce memfd_pin_user_pages() for pinning memfd pages (v6)

2023-12-07 Thread Kasireddy, Vivek
Hi David,

> >
> >> On 05.12.23 06:35, Vivek Kasireddy wrote:
> >>> For drivers that would like to longterm-pin the pages associated
> >>> with a memfd, the pin_user_pages_fd() API provides an option to
> >>> not only pin the pages via FOLL_PIN but also to check and migrate
> >>> them if they reside in movable zone or CMA block. This API
> >>> currently works with memfds but it should work with any files
> >>> that belong to either shmemfs or hugetlbfs. Files belonging to
> >>> other filesystems are rejected for now.
> >>>
> >>> The pages need to be located first before pinning them via FOLL_PIN.
> >>> If they are found in the page cache, they can be immediately pinned.
> >>> Otherwise, they need to be allocated using the filesystem specific
> >>> APIs and then pinned.
> >>>
> >>> v2:
> >>> - Drop gup_flags and improve comments and commit message (David)
> >>> - Allocate a page if we cannot find in page cache for the hugetlbfs
> >>> case as well (David)
> >>> - Don't unpin pages if there is a migration related failure (David)
> >>> - Drop the unnecessary nr_pages <= 0 check (Jason)
> >>> - Have the caller of the API pass in file * instead of fd (Jason)
> >>>
> >>> v3: (David)
> >>> - Enclose the huge page allocation code with #ifdef
> >> CONFIG_HUGETLB_PAGE
> >>> (Build error reported by kernel test robot )
> >>> - Don't forget memalloc_pin_restore() on non-migration related errors
> >>> - Improve the readability of the cleanup code associated with
> >>> non-migration related errors
> >>> - Augment the comments by describing FOLL_LONGTERM like behavior
> >>> - Include the R-b tag from Jason
> >>>
> >>> v4:
> >>> - Remove the local variable "page" and instead use 3 return statements
> >>> in alloc_file_page() (David)
> >>> - Add the R-b tag from David
> >>>
> >>> v5: (David)
> >>> - For hugetlb case, ensure that we only obtain head pages from the
> >>> mapping by using __filemap_get_folio() instead of
> find_get_page_flags()
> >>> - Handle -EEXIST when two or more potential users try to simultaneously
> >>> add a huge page to the mapping by forcing them to retry on failure
> >>>
> >>> v6: (Christoph)
> >>> - Rename this API to memfd_pin_user_pages() to make it clear that it
> >>> is intended for memfds
> >>> - Move the memfd page allocation helper from gup.c to memfd.c
> >>> - Fix indentation errors in memfd_pin_user_pages()
> >>> - For contiguous ranges of folios, use a helper such as
> >>> filemap_get_folios_contig() to lookup the page cache in batches
> >>>
> >>> Cc: David Hildenbrand 
> >>> Cc: Christoph Hellwig 
> >>> Cc: Daniel Vetter 
> >>> Cc: Mike Kravetz 
> >>> Cc: Hugh Dickins 
> >>> Cc: Peter Xu 
> >>> Cc: Gerd Hoffmann 
> >>> Cc: Dongwon Kim 
> >>> Cc: Junxiao Chang 
> >>> Suggested-by: Jason Gunthorpe 
> >>> Reviewed-by: Jason Gunthorpe  (v2)
> >>> Reviewed-by: David Hildenbrand  (v3)
> >>> Signed-off-by: Vivek Kasireddy 
> >>> ---
> >>>include/linux/memfd.h |   5 +++
> >>>include/linux/mm.h|   2 +
> >>>mm/gup.c  | 102
> ++
> >>>mm/memfd.c|  34 ++
> >>>4 files changed, 143 insertions(+)
> >>>
> >>> diff --git a/include/linux/memfd.h b/include/linux/memfd.h
> >>> index e7abf6fa4c52..6fc0d1282151 100644
> >>> --- a/include/linux/memfd.h
> >>> +++ b/include/linux/memfd.h
> >>> @@ -6,11 +6,16 @@
> >>>
> >>>#ifdef CONFIG_MEMFD_CREATE
> >>>extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned 
> >>> int
> >> arg);
> >>> +extern struct page *memfd_alloc_page(struct file *memfd, pgoff_t idx);
> >>>#else
> >>>static inline long memfd_fcntl(struct file *f, unsigned int c, 
> >>> unsigned int
> a)
> >>>{
> >>>   return -EINVAL;
> >>>}
> >>> +static inline struct page *memfd_alloc_page(struct file *memfd, pgoff_t
> >> idx)
> >>> +{
> >>> + return ERR_PTR(-EINVAL);
> >>> +}
> >>>#endif
> >>>
> >>>#endif /* __LINUX_MEMFD_H */
> >>> diff --git a/include/linux/mm.h b/include/linux/mm.h
> >>> index 418d26608ece..ac69db45509f 100644
> >>> --- a/include/linux/mm.h
> >>> +++ b/include/linux/mm.h
> >>> @@ -2472,6 +2472,8 @@ long get_user_pages_unlocked(unsigned long
> >> start, unsigned long nr_pages,
> >>>   struct page **pages, unsigned int gup_flags);
> >>>long pin_user_pages_unlocked(unsigned long start, unsigned long
> >> nr_pages,
> >>>   struct page **pages, unsigned int gup_flags);
> >>> +long memfd_pin_user_pages(struct file *file, pgoff_t start,
> >>> +   unsigned long nr_pages, struct page **pages);
> >>>
> >>>int get_user_pages_fast(unsigned long start, int nr_pages,
> >>>   unsigned int gup_flags, struct page **pages);
> >>> diff --git a/mm/gup.c b/mm/gup.c
> >>> index 231711efa390..eb93d1ec9dc6 100644
> >>> --- a/mm/gup.c
> >>> +++ b/mm/gup.c
> >>> @@ -5,6 +5,7 @@
> >>>#include 
> >>>
> >>>#include 
> >>> +#include 
> 

RE: [PATCH v6 3/5] mm/gup: Introduce memfd_pin_user_pages() for pinning memfd pages (v6)

2023-12-06 Thread Kasireddy, Vivek
Hi David,

> On 05.12.23 06:35, Vivek Kasireddy wrote:
> > For drivers that would like to longterm-pin the pages associated
> > with a memfd, the pin_user_pages_fd() API provides an option to
> > not only pin the pages via FOLL_PIN but also to check and migrate
> > them if they reside in movable zone or CMA block. This API
> > currently works with memfds but it should work with any files
> > that belong to either shmemfs or hugetlbfs. Files belonging to
> > other filesystems are rejected for now.
> >
> > The pages need to be located first before pinning them via FOLL_PIN.
> > If they are found in the page cache, they can be immediately pinned.
> > Otherwise, they need to be allocated using the filesystem specific
> > APIs and then pinned.
> >
> > v2:
> > - Drop gup_flags and improve comments and commit message (David)
> > - Allocate a page if we cannot find in page cache for the hugetlbfs
> >case as well (David)
> > - Don't unpin pages if there is a migration related failure (David)
> > - Drop the unnecessary nr_pages <= 0 check (Jason)
> > - Have the caller of the API pass in file * instead of fd (Jason)
> >
> > v3: (David)
> > - Enclose the huge page allocation code with #ifdef
> CONFIG_HUGETLB_PAGE
> >(Build error reported by kernel test robot )
> > - Don't forget memalloc_pin_restore() on non-migration related errors
> > - Improve the readability of the cleanup code associated with
> >non-migration related errors
> > - Augment the comments by describing FOLL_LONGTERM like behavior
> > - Include the R-b tag from Jason
> >
> > v4:
> > - Remove the local variable "page" and instead use 3 return statements
> >in alloc_file_page() (David)
> > - Add the R-b tag from David
> >
> > v5: (David)
> > - For hugetlb case, ensure that we only obtain head pages from the
> >mapping by using __filemap_get_folio() instead of find_get_page_flags()
> > - Handle -EEXIST when two or more potential users try to simultaneously
> >add a huge page to the mapping by forcing them to retry on failure
> >
> > v6: (Christoph)
> > - Rename this API to memfd_pin_user_pages() to make it clear that it
> >is intended for memfds
> > - Move the memfd page allocation helper from gup.c to memfd.c
> > - Fix indentation errors in memfd_pin_user_pages()
> > - For contiguous ranges of folios, use a helper such as
> >filemap_get_folios_contig() to lookup the page cache in batches
> >
> > Cc: David Hildenbrand 
> > Cc: Christoph Hellwig 
> > Cc: Daniel Vetter 
> > Cc: Mike Kravetz 
> > Cc: Hugh Dickins 
> > Cc: Peter Xu 
> > Cc: Gerd Hoffmann 
> > Cc: Dongwon Kim 
> > Cc: Junxiao Chang 
> > Suggested-by: Jason Gunthorpe 
> > Reviewed-by: Jason Gunthorpe  (v2)
> > Reviewed-by: David Hildenbrand  (v3)
> > Signed-off-by: Vivek Kasireddy 
> > ---
> >   include/linux/memfd.h |   5 +++
> >   include/linux/mm.h|   2 +
> >   mm/gup.c  | 102 ++
> >   mm/memfd.c|  34 ++
> >   4 files changed, 143 insertions(+)
> >
> > diff --git a/include/linux/memfd.h b/include/linux/memfd.h
> > index e7abf6fa4c52..6fc0d1282151 100644
> > --- a/include/linux/memfd.h
> > +++ b/include/linux/memfd.h
> > @@ -6,11 +6,16 @@
> >
> >   #ifdef CONFIG_MEMFD_CREATE
> >   extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int
> arg);
> > +extern struct page *memfd_alloc_page(struct file *memfd, pgoff_t idx);
> >   #else
> >   static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned 
> > int a)
> >   {
> > return -EINVAL;
> >   }
> > +static inline struct page *memfd_alloc_page(struct file *memfd, pgoff_t
> idx)
> > +{
> > +   return ERR_PTR(-EINVAL);
> > +}
> >   #endif
> >
> >   #endif /* __LINUX_MEMFD_H */
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 418d26608ece..ac69db45509f 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -2472,6 +2472,8 @@ long get_user_pages_unlocked(unsigned long
> start, unsigned long nr_pages,
> > struct page **pages, unsigned int gup_flags);
> >   long pin_user_pages_unlocked(unsigned long start, unsigned long
> nr_pages,
> > struct page **pages, unsigned int gup_flags);
> > +long memfd_pin_user_pages(struct file *file, pgoff_t start,
> > + unsigned long nr_pages, struct page **pages);
> >
> >   int get_user_pages_fast(unsigned long start, int nr_pages,
> > unsigned int gup_flags, struct page **pages);
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 231711efa390..eb93d1ec9dc6 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -5,6 +5,7 @@
> >   #include 
> >
> >   #include 
> > +#include 
> >   #include 
> >   #include 
> >   #include 
> > @@ -17,6 +18,7 @@
> >   #include 
> >   #include 
> >   #include 
> > +#include 
> >   #include 
> >   #include 
> >
> > @@ -3410,3 +3412,103 @@ long pin_user_pages_unlocked(unsigned long
> start, unsigned long nr_pages,
> >  

RE: [PATCH v6 3/5] mm/gup: Introduce memfd_pin_user_pages() for pinning memfd pages (v6)

2023-12-06 Thread Kasireddy, Vivek
Hi,

> > +struct page *memfd_alloc_page(struct file *memfd, pgoff_t idx)
> > +{
> > +#ifdef CONFIG_HUGETLB_PAGE
> > +   struct folio *folio;
> > +   int err;
> > +
> > +   if (is_file_hugepages(memfd)) {
> > +   folio = alloc_hugetlb_folio_nodemask(hstate_file(memfd),
> > +NUMA_NO_NODE,
> > +NULL,
> > +GFP_USER);
> > +   if (folio && folio_try_get(folio)) {
> > +   err = hugetlb_add_to_page_cache(folio,
> 
> If alloc_hugetlb_folio_nodemask moved out of the CONFIG_HUGETLB_PAGE
> ifdef, the ifdef here could go away.
Unlike alloc_hugetlb_folio_nodemask(), hugetlb_add_to_page_cache() does not
get exposed without enabling CONFIG_HUGETLB_PAGE.

> 
> Either way, this looks good:
> 
> Reviewed-by: Christoph Hellwig 
Thank you for the review.

Thanks,
Vivek
> 
> 



RE: [PATCH v5 3/5] mm/gup: Introduce pin_user_pages_fd() for pinning shmem/hugetlbfs file pages (v5)

2023-11-29 Thread Kasireddy, Vivek
Hi Christoph,

> 
> > +static struct page *alloc_file_page(struct file *file, pgoff_t idx)
> 
> alloc_file_pages seems like a weird name for something that assumes
> it is called either on a hugetlbfs or shmemfs file (without any
I see your concern. The word "file" does make it look like this API works
with all kinds of files although it is meant to specifically work with files 
that
belong to shmemfs or hugetlbfs. Since it is intended to work with memfds
in particular, I'll rename this helper alloc_memfd_page(). I think it also
makes sense to do s/file/memfd in this whole patch. Does this sound ok?

> asserts that this is true).  gup.c also seems like a very odd place
> for such a helper.
I only created this helper to cleanly separate lookup and creation and to
reduce the level of indentation in pin_user_pages_fd(). Anyway, would
mm/memfd.c be a more appropriate location?

> 
> > + * Attempt to pin pages associated with a file that belongs to either
> shmem
> > + * or hugetlb.
> 
> Why do we need a special case for hugetlb or shmemfs?
As mentioned above, this API is mainly intended for memfds and FWICS,
memfds are backed by files from either shmemfs or hugetlbfs.

> 
> > +   if (!file)
> > +   return -EINVAL;
> > +
> > +   if (!shmem_file(file) && !is_file_hugepages(file))
> > +   return -EINVAL;
> 
> Indentation is messed up here.
Ok, will fix it in next version.

> 
> > +   for (i = 0; i < nr_pages; i++) {
> > +   /*
> > +* In most cases, we should be able to find the page
> > +* in the page cache. If we cannot find it, we try to
> > +* allocate one and add it to the page cache.
> > +*/
> > +retry:
> > +   folio = __filemap_get_folio(file->f_mapping,
> > +   start + i,
> > +   FGP_ACCESSED, 0);
> 
> __filemap_get_folio is a very inefficient way to find a
> contiguous range of folios, I'd suggest to look into something that
> batches instead.
Ok, I will try to explore using filemap_get_folios_contig() or other
related APIs to make the lookup more efficient.

> 
> > +   page = IS_ERR(folio) ? NULL: >page;
> > +   if (!page) {
> > +   page = alloc_file_page(file, start + i);
> > +   if (IS_ERR(page)) {
> > +   ret = PTR_ERR(page);
> > +   if (ret == -EEXIST)
> > +   goto retry;
> > +   goto err;
> > +   }
> 
> This mix of folios and pages is odd.  Especially as hugetlbfs by
> definitions uses large folios.
Yeah, it does look odd but I ultimately need a list of pages to call
check_and_migrate_movable_pages() and also to populate a scatterlist.

Thanks,
Vivek



RE: [PATCH v4 3/5] mm/gup: Introduce pin_user_pages_fd() for pinning shmem/hugetlbfs file pages (v4)

2023-11-20 Thread Kasireddy, Vivek
Hi David,

> 
> On 18.11.23 07:32, Vivek Kasireddy wrote:
> > For drivers that would like to longterm-pin the pages associated
> > with a file, the pin_user_pages_fd() API provides an option to
> > not only pin the pages via FOLL_PIN but also to check and migrate
> > them if they reside in movable zone or CMA block. This API
> > currently works with files that belong to either shmem or hugetlbfs.
> > Files belonging to other filesystems are rejected for now.
> >
> > The pages need to be located first before pinning them via FOLL_PIN.
> > If they are found in the page cache, they can be immediately pinned.
> > Otherwise, they need to be allocated using the filesystem specific
> > APIs and then pinned.
> >
> > v2:
> > - Drop gup_flags and improve comments and commit message (David)
> > - Allocate a page if we cannot find in page cache for the hugetlbfs
> >case as well (David)
> > - Don't unpin pages if there is a migration related failure (David)
> > - Drop the unnecessary nr_pages <= 0 check (Jason)
> > - Have the caller of the API pass in file * instead of fd (Jason)
> >
> > v3: (David)
> > - Enclose the huge page allocation code with #ifdef
> CONFIG_HUGETLB_PAGE
> >(Build error reported by kernel test robot )
> > - Don't forget memalloc_pin_restore() on non-migration related errors
> > - Improve the readability of the cleanup code associated with
> >non-migration related errors
> > - Augment the comments by describing FOLL_LONGTERM like behavior
> > - Include the R-b tag from Jason
> >
> > v4:
> > - Remove the local variable "page" and instead use 3 return statements
> >in alloc_file_page() (David)
> > - Add the R-b tag from David
> >
> > Cc: David Hildenbrand 
> > Cc: Daniel Vetter 
> > Cc: Mike Kravetz 
> > Cc: Hugh Dickins 
> > Cc: Peter Xu 
> > Cc: Gerd Hoffmann 
> > Cc: Dongwon Kim 
> > Cc: Junxiao Chang 
> > Suggested-by: Jason Gunthorpe 
> > Reviewed-by: Jason Gunthorpe  (v2)
> > Reviewed-by: David Hildenbrand  (v3)
> > Signed-off-by: Vivek Kasireddy 
> > ---
> 
> 
> [...]
> 
> 
> > +static struct page *alloc_file_page(struct file *file, pgoff_t idx)
> > +{
> > +#ifdef CONFIG_HUGETLB_PAGE
> > +   struct folio *folio;
> > +   int err;
> > +
> > +   if (is_file_hugepages(file)) {
> > +   folio = alloc_hugetlb_folio_nodemask(hstate_file(file),
> > +NUMA_NO_NODE,
> > +NULL,
> > +GFP_USER);
> > +   if (folio && folio_try_get(folio)) {
> > +   err = hugetlb_add_to_page_cache(folio,
> > +   file->f_mapping,
> > +   idx);
> > +   if (err) {
> > +   folio_put(folio);
> > +   free_huge_folio(folio);
> > +   return ERR_PTR(err);
> > +   }
> > +   return >page;
> 
> While looking at the user of pin_user_pages_fd(), I realized something:
> 
> Assume idx is not aligned to the hugetlb page size.
> find_get_page_flags() would always return a tail page in that case, but
> you'd be returning the head page here.
> 
> See pagecache_get_page()->folio_file_page(folio, index);
Thank you for catching this. Looking at how udambuf uses this API for hugetlb 
case:
hpstate = hstate_file(memfd);
mapidx = list[i].offset >> huge_page_shift(hpstate);
do {
nr_pages = shmem_file(memfd) ? pgcnt : 1;
   ret = pin_user_pages_fd(memfd, mapidx, nr_pages,
ubuf->pages + 
pgbuf);
As the raw file offset is translated into huge page size units, represented by
mapidx, I was expecting find_get_page_flags() to return a head page but I
did not realize that find_get_page_flags() now returns tail pages given that
it had returned head pages in the previous kernel versions I had tested IIRC.
As my goal is to only grab the head pages, __filemap_get_folio() seems like
the right API to use instead of find_get_page_flags(). With this change, the
hugetlb subtest (that I have not tested with kernels >= 6.7) that fails with
kernel 6.7 RC1 now seems to work as expected.

> 
> > +   }
> > +   return ERR_PTR(-ENOMEM);
> > +   }
> > +#endif
> > +   return shmem_read_mapping_page(file->f_mapping, idx);
> > +}
> > +
> > +/**
> > + * pin_user_pages_fd() - pin user pages associated with a file
> > + * @file:   the file whose pages are to be pinned
> > + * @start:  starting file offset
> > + * @nr_pages:   number of pages from start to pin
> > + * @pages:  array that receives pointers to the pages pinned.
> > + *  Should be at-least nr_pages long.
> > + *
> > + * Attempt to pin pages associated with a file that belongs to either
> shmem
> > + * or hugetlb. The pages are either found in the page cache or allocated if
> > + * necessary. Once the 

RE: [PATCH v1 1/3] mm/gup: Introduce pin_user_pages_fd() for pinning shmem/hugetlbfs file pages

2023-10-17 Thread Kasireddy, Vivek
Hi David,

> > For drivers that would like to longterm-pin the pages associated
> > with a file, the pin_user_pages_fd() API provides an option to
> > not only FOLL_PIN the pages but also to check and migrate them
> > if they reside in movable zone or CMA block. For now, this API
> > can only work with files belonging to shmem or hugetlbfs given
> > that the udmabuf driver is the only user.
> 
> Maybe add "Other files are rejected.". Wasn't clear to me before I
> looked into the code.
Ok, will add it in v2.

> 
> >
> > It must be noted that the pages associated with hugetlbfs files
> > are expected to be found in the page cache. An error is returned
> > if they are not found. However, shmem pages can be swapped in or
> > allocated if they are not present in the page cache.
> >
> > Cc: David Hildenbrand 
> > Cc: Daniel Vetter 
> > Cc: Mike Kravetz 
> > Cc: Hugh Dickins 
> > Cc: Peter Xu 
> > Cc: Gerd Hoffmann 
> > Cc: Dongwon Kim 
> > Cc: Junxiao Chang 
> > Suggested-by: Jason Gunthorpe 
> > Signed-off-by: Vivek Kasireddy 
> > ---
> >   include/linux/mm.h |  2 ++
> >   mm/gup.c   | 87
> ++
> >   2 files changed, 89 insertions(+)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index bf5d0b1b16f4..af2121fb8101 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -2457,6 +2457,8 @@ long get_user_pages_unlocked(unsigned long
> start, unsigned long nr_pages,
> > struct page **pages, unsigned int gup_flags);
> >   long pin_user_pages_unlocked(unsigned long start, unsigned long
> nr_pages,
> > struct page **pages, unsigned int gup_flags);
> > +long pin_user_pages_fd(int fd, pgoff_t start, unsigned long nr_pages,
> > +  unsigned int gup_flags, struct page **pages);
> >
> >   int get_user_pages_fast(unsigned long start, int nr_pages,
> > unsigned int gup_flags, struct page **pages);
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 2f8a2d89fde1..e34b77a15fa8 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -3400,3 +3400,90 @@ long pin_user_pages_unlocked(unsigned long
> start, unsigned long nr_pages,
> >  , gup_flags);
> >   }
> >   EXPORT_SYMBOL(pin_user_pages_unlocked);
> > +
> 
> This does look quite neat, nice! Let's take a closer look ...
> 
> > +/**
> > + * pin_user_pages_fd() - pin user pages associated with a file
> > + * @fd: the fd whose pages are to be pinned
> > + * @start:  starting file offset
> > + * @nr_pages:   number of pages from start to pin
> > + * @gup_flags:  flags modifying pin behaviour
> 
> ^ I assume we should drop that. At least for now the flags are
> completely unused. And most likely we would want a different set of
> flags later (GUPFD_ ...).
Right now, FOLL_LONGTERM is the only accepted value for gup_flags but
yes, as you suggest, this can be made implicit by dropping gup_flags.

> 
> > + * @pages:  array that receives pointers to the pages pinned.
> > + *  Should be at least nr_pages long.
> > + *
> > + * Attempt to pin (and migrate) pages associated with a file belonging to
> 
> I'd drop the "and migrate" part, it's more of an implementation detail.
> 
> > + * either shmem or hugetlbfs. An error is returned if pages associated with
> > + * hugetlbfs files are not present in the page cache. However, shmem
> pages
> > + * are swapped in or allocated if they are not present in the page cache.
> 
> Why don't we do the same for hugetlbfs? Would make the interface more
> streamlined.
I am going off of what Mike has stated previously:
"It may not matter to your users, but the semantics for hugetlb and shmem
pages is different.  hugetlb requires the pages exist in the page cache
while shmem will create/add pages to the cache if necessary."

However, if we were to allocate a hugepage (assuming one is not present in the
page cache at a given index), what needs to be done in addition to calling 
these APIs?
folio = alloc_hugetlb_folio_nodemask(h, NUMA_NO_NODE, NULL, GFP_USER)
hugetlb_add_to_page_cache(folio, mapping, idx)

> 
> Certainly add that pinned pages have to be released using
> unpin_user_pages().
Sure, will include that in v2.

> 
> > + *
> > + * Returns number of pages pinned. This would be equal to the number of
> > + * pages requested.
> > + * If nr_pages is 0 or negative, returns 0. If no pages were pinned, 
> > returns
> > + * -errno.
> > + */
> > +long pin_user_pages_fd(int fd, pgoff_t start, unsigned long nr_pages,
> > +  unsigned int gup_flags, struct page **pages)
> > +{
> > +   struct page *page;
> > +   struct file *filep;
> > +   unsigned int flags, i;
> > +   long ret;
> > +
> > +   if (nr_pages <= 0)
> > +   return 0;
> 
> I think we should just forbid that and use a WARN_ON_ONCE() here /
> return -EINVAL. So we'll never end up returning 0.
I think I'll drop this check in v2 as Jason suggested.

> 
> > +   if 

RE: [PATCH v1 0/3] udmabuf: Add support for page migration out of movable zone or CMA

2023-09-16 Thread Kasireddy, Vivek
Hi David,

> >> I think it makes sense to have a generic (non-GUP) version of
> >> check_and_migrate_movable_pages() available in migration.h that
> >> drivers can use to ensure that they don't break memory hotunplug
> >> accidentally.
> >
> > Definately not.
> >
> > Either use the VMA and pin_user_pages(), or implement
> > pin_user_pages_fd() in core code.
> >
> > Do not open code something wonky in drivers.
> 
> Agreed. pin_user_pages_fd() might become relevant in the context of
> vfio/mdev + KVM gmem -- don't mmap guest memory but instead provide it
> via a special memfd to the kernel.
> 
> So there might be value in having such a core infrastructure.
Ok, I'll work on adding pin_user_pages_fd() soon.

Thanks,
Vivek
> 
> --
> Cheers,
> 
> David / dhildenb



RE: [syzbot] [mm?] kernel BUG in filemap_unaccount_folio

2023-09-10 Thread Kasireddy, Vivek
Hi Fengwei,

> 
> Add udmabuf maintainers.
> 
> On 9/7/2023 2:51 AM, syzbot wrote:
> > Hello,
> >
> > syzbot found the following issue on:
> >
> > HEAD commit:db906f0ca6bb Merge tag 'phy-for-6.6' of git://git.kernel.o..
> > git tree:   upstream
> > console+strace: https://syzkaller.appspot.com/x/log.txt?x=16cbb32fa8
> > kernel config:
> https://syzkaller.appspot.com/x/.config?x=3bd57a1ac08277b0
> > dashboard link:
> https://syzkaller.appspot.com/bug?extid=17a207d226b8a5fb0fd9
> > compiler:   gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for
> Debian) 2.40
> > syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=11609f3868
> > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=14c1fc0068
> >
> > Downloadable assets:
> > disk image: https://storage.googleapis.com/syzbot-
> assets/46394f3ca3eb/disk-db906f0c.raw.xz
> > vmlinux: https://storage.googleapis.com/syzbot-
> assets/eeaa594bfd1f/vmlinux-db906f0c.xz
> > kernel image: https://storage.googleapis.com/syzbot-
> assets/5c8df8de79ec/bzImage-db906f0c.xz
> >
> > IMPORTANT: if you fix the issue, please add the following tag to the
> commit:
> > Reported-by: syzbot+17a207d226b8a5fb0...@syzkaller.appspotmail.com
> 
> Operations from user space before kernel BUG hit:
> 
> [pid  5043]
> memfd_create("\x79\x10\x35\x25\xfa\x2c\x1f\x99\xa2\xc9\x8e\xcd\x5c\xfa
> \xf6\x12\x95\x5e\xdf\x54\xe2\x3d\x0e\x7e\x46\xcd\x73\xa3\xff\x89\x3e\x
> 84\xa9\x86\x86\xa2\x46\x90\x93\x98\x4e\x05\x65\x92\x4a\x77\xce\x63\xc
> e\x9f\x32\xc8\x02\x66\x03\x07\x6d\x08\xb4\x48\x8f\x9e\xa5\x16\x8f\x61\
> xff\xb2\x22\x8a\x15\x13\xa2\x17\x25\x21\x54\x8b\xa1\xb9\x2d\x13\xf9\x
> 6f\x67\x95\x9d\x54\xef\xca\x68\x77\xf5\xff\x75\x7f\x75\xb8\x2a\xd3"...,
> MFD_ALLOW_SEALING) = 3
> [pid  5043] ftruncate(3, 65535) = 0
> [pid  5043] fcntl(3, F_ADD_SEALS,
> F_SEAL_SEAL|F_SEAL_SHRINK|F_SEAL_GROW) = 0
> [pid  5043] openat(AT_FDCWD, "/dev/udmabuf", O_RDWR) = 4
> [pid  5043] ioctl(4, UDMABUF_CREATE, 0x2000) = 5
> [pid  5043] mmap(0x20667000, 16384,
> PROT_WRITE|PROT_EXEC|PROT_SEM|PROT_GROWSDOWN,
> MAP_SHARED|MAP_FIXED|MAP_POPULATE|MAP_STACK, 5, 0) = 0x20667000
> 
> The crash happens when test app tried to close the memfd.
> 
> 
> It's like test app created udmabuf above memfd. But didn't boost memfd
> refcount.
> And mmap with MAP_POPULATE make the underneath folios mapped.
> 
> When memfd is closed without munmap 0x20667000, the memfd refcount
> hit zero and
> trigger evict() and hit
>   VM_BUG_ON_FOLIO(folio_mapped(folio), folio);
> 
> 
> Related test code:
> 
>   res = syscall(__NR_memfd_create, /*name=*/0x2040ul, /*flags=*/2ul);
>   if (res != -1)
> r[0] = res;
>   syscall(__NR_ftruncate, /*fd=*/r[0], /*len=*/0xul);
>   syscall(__NR_fcntl, /*fd=*/r[0], /*cmd=*/0x409ul, /*seals=*/7ul);
>   memcpy((void*)0x21c0, "/dev/udmabuf\000", 13);
>   res = syscall(__NR_openat, /*fd=*/0xff9cul, 
> /*file=*/0x21c0ul,
> /*flags=*/2ul, 0);
>   if (res != -1)
> r[1] = res;
>   *(uint32_t*)0x2000 = r[0];
>   *(uint32_t*)0x2004 = 0;
>   *(uint64_t*)0x2008 = 0;
>   *(uint64_t*)0x2010 = 0x8000;
>   res = syscall(__NR_ioctl, /*fd=*/r[1], /*cmd=*/0x40187542,
> /*arg=*/0x2000ul);
>   if (res != -1)
> r[2] = res;
>   syscall(__NR_mmap, /*addr=*/0x20667000ul, /*len=*/0x4000ul,
>   /*prot=*/0x10eul, /*flags=*/0x28011ul, /*fd=*/r[2],
>   /*offset=*/0ul);
>   close_fds();
> 
> 
> Should memfd refcount increased when create udmabuf above it? Thanks.
I think the following patch should fix this crash:
https://lists.freedesktop.org/archives/dri-devel/2023-August/418952.html

Thanks,
Vivek
> 
> Regards
> Yin, Fengwei
> 
> >
> >  search_binary_handler fs/exec.c:1739 [inline]
> >  exec_binprm fs/exec.c:1781 [inline]
> >  bprm_execve fs/exec.c:1856 [inline]
> >  bprm_execve+0x80a/0x1a50 fs/exec.c:1812
> >  do_execveat_common.isra.0+0x5d3/0x740 fs/exec.c:1964
> >  do_execve fs/exec.c:2038 [inline]
> >  __do_sys_execve fs/exec.c:2114 [inline]
> >  __se_sys_execve fs/exec.c:2109 [inline]
> >  __x64_sys_execve+0x8c/0xb0 fs/exec.c:2109
> >  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> >  do_syscall_64+0x38/0xb0 arch/x86/entry/common.c:80
> >  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > [ cut here ]
> > kernel BUG at mm/filemap.c:158!
> > invalid opcode:  [#1] PREEMPT SMP KASAN
> > CPU: 0 PID: 5043 Comm: syz-executor729 Not tainted 6.5.0-syzkaller-11275-
> gdb906f0ca6bb #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine,
> BIOS Google 07/26/2023
> > RIP: 0010:filemap_unaccount_folio+0x62e/0x870 mm/filemap.c:158
> > Code: 0f 85 68 01 00 00 8b 6b 5c 31 ff 89 ee e8 6a 3e d2 ff 85 ed 7e 16 e8 
> > f1
> 42 d2 ff 48 c7 c6 c0 3b 97 8a 48 89 df e8 a2 58 10 00 <0f> 0b e8 db 42 d2 ff 
> 48
> 8d 6b 58 be 04 00 00 00 48 89 ef e8 0a 0d
> > RSP: 0018:c900039ef828 EFLAGS: 00010093
> > RAX:  RBX: ea0001cfe400 RCX: 

RE: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)

2023-08-27 Thread Kasireddy, Vivek
Hi Alistair,

> 
> > >
> > >> >> > > > No, adding HMM_PFN_REQ_WRITE still doesn't help in fixing the
> > >> issue.
> > >> >> > > > Although, I do not have THP enabled (or built-in), shmem does
> > not
> > >> evict
> > >> >> > > > the pages after hole punch as noted in the comment in
> > >> >> shmem_fallocate():
> > >> >> > >
> > >> >> > > This is the source of all your problems.
> > >> >> > >
> > >> >> > > Things that are mm-centric are supposed to track the VMAs and
> > >> changes
> > >> >> to
> > >> >> > > the PTEs. If you do something in userspace and it doesn't cause
> the
> > >> >> > > CPU page tables to change then it certainly shouldn't cause any
> > mmu
> > >> >> > > notifiers or hmm_range_fault changes.
> > >> >> > I am not doing anything out of the blue in the userspace. I think 
> > >> >> > the
> > >> >> behavior
> > >> >> > I am seeing with shmem (where an invalidation event
> > >> >> (MMU_NOTIFY_CLEAR)
> > >> >> > does occur because of a hole punch but the PTEs don't really get
> > >> updated)
> > >> >> > can arguably be considered an optimization.
> > >> >>
> > >> >> Your explanations don't make sense.
> > >> >>
> > >> >> If MMU_NOTIFER_CLEAR was sent but the PTEs were left present
> then:
> > >> >>
> > >> >> > > There should still be an invalidation notifier at some point when
> the
> > >> >> > > CPU tables do eventually change, whenever that is. Missing that
> > >> >> > > notification would be a bug.
> > >> >> > I clearly do not see any notification getting triggered (from both
> > >> >> shmem_fault()
> > >> >> > and hugetlb_fault()) when the PTEs do get updated as the hole is
> > refilled
> > >> >> > due to writes. Are you saying that there needs to be an invalidation
> > >> event
> > >> >> > (MMU_NOTIFY_CLEAR?) dispatched at this point?
> > >> >>
> > >> >> You don't get to get shmem_fault in the first place.
> > >> > What I am observing is that even after MMU_NOTIFY_CLEAR (hole
> > punch)
> > >> is sent,
> > >> > hmm_range_fault() finds that the PTEs associated with the hole are 
> > >> > still
> > >> pte_present().
> > >> > I think it remains this way as long as there are reads on the hole. 
> > >> > Once
> > >> there are
> > >> > writes, it triggers shmem_fault() which results in PTEs getting updated
> > but
> > >> without
> > >> > any notification.
> > >>
> > >> Oh wait, this is shmem. The read from hmm_range_fault() (assuming
> you
> > >> specified HMM_PFN_REQ_FAULT) will trigger shmem_fault() due to the
> > >> missing PTE.
> > > When running one of the udmabuf subtests (introduced in the third patch
> > of
> > > this series), I see that MMU_NOTIFY_CLEAR is sent when a hole is
> punched.
> > > As a response, hmm_range_fault() is called from the udmabuf invalidate
> > callback,
> >
> > Actually I'm suprised that works. If you've setup an interval notifier
> > and are updating the notifier sequence numbers correctly I would expect
> > hmm_range_fault() to return -EBUSY until
> > mmu_notifier_invalidate_range_end() is called.
> >
> > It might be helpful to post the code you're testing with somewhere but
> > are you calling mmu_interval_read_begin() to start the critical section
> > and mmu_interval_set_seq() to update the sequence in another notifier?
> > I'm not at all convinced calling hmm_range_fault() from a notifier can
> > be made to work though.
Turns out, calling hmm_range_fault() from the invalidate callback was indeed
a problem and the reason why new pages were not faulted-in. In other words,
it looks like the invalidate callback is not the right place to invoke 
hmm_range_fault()
as the PTEs may not have been cleared.

> That could be part of the problem. I mean the way hmm_range_fault()
> is invoked from the invalidate callback is probably incorrect as you are
> suggesting. Anyway, here is the code I am testing with:
> static bool invalidate_udmabuf(struct mmu_interval_notifier *mn,
>const struct mmu_notifier_range *range_mn,
>unsigned long cur_seq)
> {
> struct udmabuf_vma_range *range =
> container_of(mn, struct udmabuf_vma_range, range_mn);
> struct udmabuf *ubuf = range->ubuf;
> struct hmm_range hrange = {0};
> unsigned long *pfns, num_pages, timeout;
> int i, ret;
> 
> printk("invalidate; start = %lu, end = %lu\n",
>range->start, range->end);
> 
> hrange.notifier = mn;
> hrange.default_flags = HMM_PFN_REQ_FAULT;
> hrange.start = max(range_mn->start, range->start);
> hrange.end = min(range_mn->end, range->end);
> num_pages = (hrange.end - hrange.start) >> PAGE_SHIFT;
> 
> pfns = kmalloc_array(num_pages, sizeof(*pfns), GFP_KERNEL);
> if (!pfns)
> return true;
> 
> printk("invalidate; num pages = %lu\n", num_pages);
> 
> hrange.hmm_pfns = pfns;
> timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> do {
>

RE: [PATCH v1 0/3] udmabuf: Add support for page migration out of movable zone or CMA

2023-08-27 Thread Kasireddy, Vivek
Hi Jason, David,

> > > Sure, we can simply always fail when we detect ZONE_MOVABLE or
> > MIGRATE_CMA.
> > > Maybe that keeps at least some use cases working.
> >
> > That seems fairly reasonable
> AFAICS, failing udmabuf_create() if we detect one or more pages are in
> ZONE_MOVABLE or MIGRATE_CMA would not be a recoverable failure --
> as it would result in the failure of Guest GUI (or compositor).
> 
> I think it makes sense to have a generic version of
> And, since check_and_migrate_movable_pages() is GUP-specific, would
> it be ok to create a generic version of that (in mm/migrate.c) which can be
> used by udmabuf and/or other drivers in the future?
Sorry, I accidentally sent this earlier email before finishing it. 
What I meant to say is since the same situation (inadvertently pinning pages
in movable) may probably arise in the future with another driver, I think it 
makes
sense to have a generic (non-GUP) version of check_and_migrate_movable_pages()
available in migration.h that drivers can use to ensure that they don't break
memory hotunplug accidentally.

Thanks,
Vivek

> 
> Thanks,
> Vivek
> 
> >
> > Jason
> 



RE: [PATCH v1 0/3] udmabuf: Add support for page migration out of movable zone or CMA

2023-08-27 Thread Kasireddy, Vivek
Hi Jason, David,

> 
> > Sure, we can simply always fail when we detect ZONE_MOVABLE or
> MIGRATE_CMA.
> > Maybe that keeps at least some use cases working.
> 
> That seems fairly reasonable
AFAICS, failing udmabuf_create() if we detect one or more pages are in
ZONE_MOVABLE or MIGRATE_CMA would not be a recoverable failure --
as it would result in the failure of Guest GUI (or compositor).

I think it makes sense to have a generic version of 
And, since check_and_migrate_movable_pages() is GUP-specific, would
it be ok to create a generic version of that (in mm/migrate.c) which can be
used by udmabuf and/or other drivers in the future?

Thanks,
Vivek

> 
> Jason



RE: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)

2023-08-24 Thread Kasireddy, Vivek
Hi Alistair,

> >
> >> >> > > > No, adding HMM_PFN_REQ_WRITE still doesn't help in fixing the
> >> issue.
> >> >> > > > Although, I do not have THP enabled (or built-in), shmem does
> not
> >> evict
> >> >> > > > the pages after hole punch as noted in the comment in
> >> >> shmem_fallocate():
> >> >> > >
> >> >> > > This is the source of all your problems.
> >> >> > >
> >> >> > > Things that are mm-centric are supposed to track the VMAs and
> >> changes
> >> >> to
> >> >> > > the PTEs. If you do something in userspace and it doesn't cause the
> >> >> > > CPU page tables to change then it certainly shouldn't cause any
> mmu
> >> >> > > notifiers or hmm_range_fault changes.
> >> >> > I am not doing anything out of the blue in the userspace. I think the
> >> >> behavior
> >> >> > I am seeing with shmem (where an invalidation event
> >> >> (MMU_NOTIFY_CLEAR)
> >> >> > does occur because of a hole punch but the PTEs don't really get
> >> updated)
> >> >> > can arguably be considered an optimization.
> >> >>
> >> >> Your explanations don't make sense.
> >> >>
> >> >> If MMU_NOTIFER_CLEAR was sent but the PTEs were left present then:
> >> >>
> >> >> > > There should still be an invalidation notifier at some point when 
> >> >> > > the
> >> >> > > CPU tables do eventually change, whenever that is. Missing that
> >> >> > > notification would be a bug.
> >> >> > I clearly do not see any notification getting triggered (from both
> >> >> shmem_fault()
> >> >> > and hugetlb_fault()) when the PTEs do get updated as the hole is
> refilled
> >> >> > due to writes. Are you saying that there needs to be an invalidation
> >> event
> >> >> > (MMU_NOTIFY_CLEAR?) dispatched at this point?
> >> >>
> >> >> You don't get to get shmem_fault in the first place.
> >> > What I am observing is that even after MMU_NOTIFY_CLEAR (hole
> punch)
> >> is sent,
> >> > hmm_range_fault() finds that the PTEs associated with the hole are still
> >> pte_present().
> >> > I think it remains this way as long as there are reads on the hole. Once
> >> there are
> >> > writes, it triggers shmem_fault() which results in PTEs getting updated
> but
> >> without
> >> > any notification.
> >>
> >> Oh wait, this is shmem. The read from hmm_range_fault() (assuming you
> >> specified HMM_PFN_REQ_FAULT) will trigger shmem_fault() due to the
> >> missing PTE.
> > When running one of the udmabuf subtests (introduced in the third patch
> of
> > this series), I see that MMU_NOTIFY_CLEAR is sent when a hole is punched.
> > As a response, hmm_range_fault() is called from the udmabuf invalidate
> callback,
> 
> Actually I'm suprised that works. If you've setup an interval notifier
> and are updating the notifier sequence numbers correctly I would expect
> hmm_range_fault() to return -EBUSY until
> mmu_notifier_invalidate_range_end() is called.
> 
> It might be helpful to post the code you're testing with somewhere but
> are you calling mmu_interval_read_begin() to start the critical section
> and mmu_interval_set_seq() to update the sequence in another notifier?
> I'm not at all convinced calling hmm_range_fault() from a notifier can
> be made to work though.
That could be part of the problem. I mean the way hmm_range_fault()
is invoked from the invalidate callback is probably incorrect as you are
suggesting. Anyway, here is the code I am testing with:
static bool invalidate_udmabuf(struct mmu_interval_notifier *mn,
   const struct mmu_notifier_range *range_mn,
   unsigned long cur_seq)
{
struct udmabuf_vma_range *range =
container_of(mn, struct udmabuf_vma_range, range_mn);
struct udmabuf *ubuf = range->ubuf;
struct hmm_range hrange = {0};
unsigned long *pfns, num_pages, timeout;
int i, ret;

printk("invalidate; start = %lu, end = %lu\n",
   range->start, range->end);

hrange.notifier = mn;
hrange.default_flags = HMM_PFN_REQ_FAULT;
hrange.start = max(range_mn->start, range->start);
hrange.end = min(range_mn->end, range->end);
num_pages = (hrange.end - hrange.start) >> PAGE_SHIFT;

pfns = kmalloc_array(num_pages, sizeof(*pfns), GFP_KERNEL);
if (!pfns)
return true;

printk("invalidate; num pages = %lu\n", num_pages);

hrange.hmm_pfns = pfns;
timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
do {
hrange.notifier_seq = mmu_interval_read_begin(mn);

mmap_read_lock(ubuf->vmm_mm);
ret = hmm_range_fault();
mmap_read_unlock(ubuf->vmm_mm);
if (ret) {
if (ret == -EBUSY && !time_after(jiffies, timeout))
continue;
break;
}

if (mmu_interval_read_retry(mn, hrange.notifier_seq))
continue;
} while (ret);

   

RE: [PATCH v1 0/3] udmabuf: Add support for page migration out of movable zone or CMA

2023-08-24 Thread Kasireddy, Vivek
Hi David,

> 
> >> - Add a new API to the backing store/allocator to longterm-pin the page.
> >>For example, something along the lines of
> shmem_pin_mapping_page_longterm()
> >>for shmem as suggested by Daniel. A similar one needs to be added for
> >>hugetlbfs as well.
> >
> > This may also be reasonable.
> 
> Sounds reasonable to keep the old API (that we unfortunately have) working.
I agree; I'd like to avoid adding new APIs unless absolutely necessary. Given 
this,
and considering the options I have mentioned earlier, what would be your
recommendation for how page migration needs to be done in udmabuf driver?

Thanks,
Vivek

> 
> --
> Cheers,
> 
> David / dhildenb



RE: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)

2023-08-22 Thread Kasireddy, Vivek
Hi Alistair,

> >> > > > No, adding HMM_PFN_REQ_WRITE still doesn't help in fixing the
> issue.
> >> > > > Although, I do not have THP enabled (or built-in), shmem does not
> evict
> >> > > > the pages after hole punch as noted in the comment in
> >> shmem_fallocate():
> >> > >
> >> > > This is the source of all your problems.
> >> > >
> >> > > Things that are mm-centric are supposed to track the VMAs and
> changes
> >> to
> >> > > the PTEs. If you do something in userspace and it doesn't cause the
> >> > > CPU page tables to change then it certainly shouldn't cause any mmu
> >> > > notifiers or hmm_range_fault changes.
> >> > I am not doing anything out of the blue in the userspace. I think the
> >> behavior
> >> > I am seeing with shmem (where an invalidation event
> >> (MMU_NOTIFY_CLEAR)
> >> > does occur because of a hole punch but the PTEs don't really get
> updated)
> >> > can arguably be considered an optimization.
> >>
> >> Your explanations don't make sense.
> >>
> >> If MMU_NOTIFER_CLEAR was sent but the PTEs were left present then:
> >>
> >> > > There should still be an invalidation notifier at some point when the
> >> > > CPU tables do eventually change, whenever that is. Missing that
> >> > > notification would be a bug.
> >> > I clearly do not see any notification getting triggered (from both
> >> shmem_fault()
> >> > and hugetlb_fault()) when the PTEs do get updated as the hole is refilled
> >> > due to writes. Are you saying that there needs to be an invalidation
> event
> >> > (MMU_NOTIFY_CLEAR?) dispatched at this point?
> >>
> >> You don't get to get shmem_fault in the first place.
> > What I am observing is that even after MMU_NOTIFY_CLEAR (hole punch)
> is sent,
> > hmm_range_fault() finds that the PTEs associated with the hole are still
> pte_present().
> > I think it remains this way as long as there are reads on the hole. Once
> there are
> > writes, it triggers shmem_fault() which results in PTEs getting updated but
> without
> > any notification.
> 
> Oh wait, this is shmem. The read from hmm_range_fault() (assuming you
> specified HMM_PFN_REQ_FAULT) will trigger shmem_fault() due to the
> missing PTE. 
When running one of the udmabuf subtests (introduced in the third patch of
this series), I see that MMU_NOTIFY_CLEAR is sent when a hole is punched.
As a response, hmm_range_fault() is called from the udmabuf invalidate callback,
to walk over the PTEs associated with the hole. When this happens, I noticed 
that
the below function returns HMM_PFN_VALID | HMM_PFN_WRITE for all the
PTEs associated with the hole. 
static inline unsigned long pte_to_hmm_pfn_flags(struct hmm_range *range,
 pte_t pte)
{
if (pte_none(pte) || !pte_present(pte) || pte_protnone(pte))
return 0;
return pte_write(pte) ? (HMM_PFN_VALID | HMM_PFN_WRITE) : HMM_PFN_VALID;
}

As a result, hmm_pte_need_fault() always returns 0 and shmem_fault()
never gets triggered despite specifying HMM_PFN_REQ_FAULT | HMM_PFN_REQ_WRITE.
And, the set of PFNs returned by hmm_range_fault() are the same ones
that existed before the hole was punched.

> Subsequent writes will just upgrade PTE permissions
> assuming the read didn't map them RW to begin with. If you want to
> actually see the hole with hmm_range_fault() don't specify
> HMM_PFN_REQ_FAULT (or _WRITE).
> 
> >>
> >> If they were marked non-prsent during the CLEAR then the shadow side
> >> remains non-present until it gets its own fault.
> >>
> >> If they were made non-present without an invalidation then that is a
> >> bug.
> >>
> >> > > hmm_range_fault() is the correct API to use if you are working with
> >> > > notifiers. Do not hack something together using pin_user_pages.
> >>
> >> > I noticed that hmm_range_fault() does not seem to be working as
> expected
> >> > given that it gets stuck(hangs) while walking hugetlb pages.
> >>
> >> You are the first to report that, it sounds like a serious bug. Please
> >> try to fix it.
> >>
> >> > Regardless, as I mentioned above, the lack of notification when PTEs
> >> > do get updated due to writes is the crux of the issue
> >> > here. Therefore, AFAIU, triggering an invalidation event or some
> >> > other kind of notification would help in fixing this issue.
> >>
> >> You seem to be facing some kind of bug in the mm, it sounds pretty
> >> serious, and it almost certainly is a missing invalidation.
> >>
> >> Basically, anything that changes a PTE must eventually trigger an
> >> invalidation. It is illegal to change a PTE from one present value to
> >> another present value without invalidation notification.
> >>
> >> It is not surprising something would be missed here.
> > As you suggest, it looks like the root-cause of this issue is the missing
> > invalidation notification when the PTEs are changed from one present
> 
> I don't think there's a missing invalidation here. You say you're seeing
> the MMU_NOTIFY_CLEAR when hole punching which is when the 

RE: [PATCH v1 0/3] udmabuf: Add support for page migration out of movable zone or CMA

2023-08-21 Thread Kasireddy, Vivek
Hi Jason,

> > This patch series adds support for migrating pages associated with
> > a udmabuf out of the movable zone or CMA to avoid breaking features
> > such as memory hotunplug.
> >
> > The first patch exports check_and_migrate_movable_pages() function
> > out of GUP so that the udmabuf driver can leverage it for page
> > migration that is done as part of the second patch. The last patch
> > adds two new udmabuf selftests to verify data coherency after
> > page migration.
> 
> Please don't do this. If you want to do what GUP does then call
> GUP. udmabuf is not so special that it needs to open code its own
> weird version of it.
We can't call GUP directly as explained in the first patch of this series:
"For drivers that would like to migrate pages out of the movable
zone (or CMA) in order to pin them (longterm) for DMA, using
check_and_migrate_movable_pages() directly provides a convenient
option instead of duplicating similar checks (e.g, checking
the folios for zone, hugetlb, etc) and calling migrate_pages()
directly.

Ideally, a driver is expected to call pin_user_pages(FOLL_LONGTERM)
to migrate and pin the pages for longterm DMA but there are
situations where the GUP APIs cannot be used directly for
various reasons (e.g, when the VMA or start addr cannot be
easily determined but the relevant pages are available)."

Given the current (model and) UAPI (udmabuf_create), the userspace
only shares (memfd, offset, size) values that we use to find the 
relevant pages and pin them (by doing get_page()). Since the goal
is to also migrate these pages, I think we have the following options:
- Leverage check_and_migrate_movable_pages(); but this function
  needs to be exported from GUP.

- Iterate over all the pages (in udmabuf) to check for 
folio_is_longterm_pinnable()
  and call migrate_pages() eventually. This requires changes only to
  the udmabuf driver but we'd be duplicating much of the functionality
  provided by check_and_migrate_movable_pages().

- Call pin_user_pages_fast(FOLL_LONGTERM) from udmabuf driver. In
  order to do this, we have to first unpin all pages and iterate over all
  the VMAs of the VMM to identify the Guest RAM VMA and then use
  page_address_in_vma() to find the start addr of the ranges and then
  call GUP. Although this approach is feasible, it feels a bit convoluted.

- Add a new udmabuf UAPI to have userspace share (start addr, len) values
  so that the udmabuf driver can directly call GUP APIs. But this means all
  the current users of udmabuf such as Qemu, CrosVM, etc, need to be
  updated to use the new UAPI. 

- Add a new API to the backing store/allocator to longterm-pin the page.
  For example, something along the lines of shmem_pin_mapping_page_longterm()
  for shmem as suggested by Daniel. A similar one needs to be added for
  hugetlbfs as well.

Among these options, the first one seems very reasonable to me.

Thanks,
Vivek

> 
> Jason


RE: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)

2023-08-16 Thread Kasireddy, Vivek
Hi Jason,

> > >
> > > > No, adding HMM_PFN_REQ_WRITE still doesn't help in fixing the issue.
> > > > Although, I do not have THP enabled (or built-in), shmem does not evict
> > > > the pages after hole punch as noted in the comment in
> shmem_fallocate():
> > >
> > > This is the source of all your problems.
> > >
> > > Things that are mm-centric are supposed to track the VMAs and changes
> to
> > > the PTEs. If you do something in userspace and it doesn't cause the
> > > CPU page tables to change then it certainly shouldn't cause any mmu
> > > notifiers or hmm_range_fault changes.
> > I am not doing anything out of the blue in the userspace. I think the
> behavior
> > I am seeing with shmem (where an invalidation event
> (MMU_NOTIFY_CLEAR)
> > does occur because of a hole punch but the PTEs don't really get updated)
> > can arguably be considered an optimization.
> 
> Your explanations don't make sense.
> 
> If MMU_NOTIFER_CLEAR was sent but the PTEs were left present then:
> 
> > > There should still be an invalidation notifier at some point when the
> > > CPU tables do eventually change, whenever that is. Missing that
> > > notification would be a bug.
> > I clearly do not see any notification getting triggered (from both
> shmem_fault()
> > and hugetlb_fault()) when the PTEs do get updated as the hole is refilled
> > due to writes. Are you saying that there needs to be an invalidation event
> > (MMU_NOTIFY_CLEAR?) dispatched at this point?
> 
> You don't get to get shmem_fault in the first place.
What I am observing is that even after MMU_NOTIFY_CLEAR (hole punch) is sent,
hmm_range_fault() finds that the PTEs associated with the hole are still 
pte_present().
I think it remains this way as long as there are reads on the hole. Once there 
are
writes, it triggers shmem_fault() which results in PTEs getting updated but 
without
any notification.

> 
> If they were marked non-prsent during the CLEAR then the shadow side
> remains non-present until it gets its own fault.
> 
> If they were made non-present without an invalidation then that is a
> bug.
> 
> > > hmm_range_fault() is the correct API to use if you are working with
> > > notifiers. Do not hack something together using pin_user_pages.
> 
> > I noticed that hmm_range_fault() does not seem to be working as expected
> > given that it gets stuck(hangs) while walking hugetlb pages.
> 
> You are the first to report that, it sounds like a serious bug. Please
> try to fix it.
> 
> > Regardless, as I mentioned above, the lack of notification when PTEs
> > do get updated due to writes is the crux of the issue
> > here. Therefore, AFAIU, triggering an invalidation event or some
> > other kind of notification would help in fixing this issue.
> 
> You seem to be facing some kind of bug in the mm, it sounds pretty
> serious, and it almost certainly is a missing invalidation.
> 
> Basically, anything that changes a PTE must eventually trigger an
> invalidation. It is illegal to change a PTE from one present value to
> another present value without invalidation notification.
> 
> It is not surprising something would be missed here.
As you suggest, it looks like the root-cause of this issue is the missing
invalidation notification when the PTEs are changed from one present
value to another. I'd like to fix this issue eventually but I first need to
focus on addressing udmabuf page migration (out of movable zone)
and also look into the locking concerns Daniel mentioned about pairing
static and dynamic dmabuf exporters and importers.

Thanks,
Vivek



RE: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)

2023-08-08 Thread Kasireddy, Vivek
Hi Jason,

> 
> > No, adding HMM_PFN_REQ_WRITE still doesn't help in fixing the issue.
> > Although, I do not have THP enabled (or built-in), shmem does not evict
> > the pages after hole punch as noted in the comment in shmem_fallocate():
> 
> This is the source of all your problems.
> 
> Things that are mm-centric are supposed to track the VMAs and changes to
> the PTEs. If you do something in userspace and it doesn't cause the
> CPU page tables to change then it certainly shouldn't cause any mmu
> notifiers or hmm_range_fault changes.
I am not doing anything out of the blue in the userspace. I think the behavior
I am seeing with shmem (where an invalidation event (MMU_NOTIFY_CLEAR)
does occur because of a hole punch but the PTEs don't really get updated)
can arguably be considered an optimization. 

> 
> There should still be an invalidation notifier at some point when the
> CPU tables do eventually change, whenever that is. Missing that
> notification would be a bug.
I clearly do not see any notification getting triggered (from both shmem_fault()
and hugetlb_fault()) when the PTEs do get updated as the hole is refilled
due to writes. Are you saying that there needs to be an invalidation event
(MMU_NOTIFY_CLEAR?) dispatched at this point?

> 
> > If I force it to read-fault or write-fault (by hacking 
> > hmm_pte_need_fault()),
> > it gets indefinitely stuck in the do while loop in hmm_range_fault().
> > AFAIU, unless there is a way to fault-in zero pages (or any scratch pages)
> > after hole punch that get invalidated because of writes, I do not see how
> > using hmm_range_fault() can help with my use-case.
> 
> hmm_range_fault() is the correct API to use if you are working with
> notifiers. Do not hack something together using pin_user_pages.
I noticed that hmm_range_fault() does not seem to be working as expected
given that it gets stuck(hangs) while walking hugetlb pages. Regardless,
as I mentioned above, the lack of notification when PTEs do get updated due
to writes is the crux of the issue here. Therefore, AFAIU, triggering an
invalidation event or some other kind of notification would help in fixing
this issue.

Thanks,
Vivek

> 
> Jason



RE: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)

2023-08-04 Thread Kasireddy, Vivek
Hi David,

> >
>  Right, the "the zero pages are changed into writable pages" in your
>  above comment just might not apply, because there won't be any
> >> page
>  replacement (hopefully :) ).
> >>
> >>> If the page replacement does not happen when there are new
> writes
> >> to the
> >>> area where the hole previously existed, then would we still get an
> >> invalidate
> >>> when this happens? Is there any other way to get notified when the
> >> zeroed
> >>> page is written to if the invalidate does not get triggered?
> >>
> >> What David is saying is that memfd does not use the zero page
> >> optimization for hole punches. Any access to the memory, including
> >> read-only access through hmm_range_fault() will allocate unique
> >> pages. Since there is no zero page and no zero-page replacement
> there
> >> is no issue with invalidations.
> 
> > It looks like even with hmm_range_fault(), the invalidate does not get
> > triggered when the hole is refilled with new pages because of writes.
> > This is probably because hmm_range_fault() does not fault in any
> pages
> > that get invalidated later when writes occur.
>  hmm_range_fault() returns the current content of the VMAs, or it
>  faults. If it returns pages then it came from one of these two places.
>  If your VMA is incoherent with what you are doing then you have
>  bigger
>  problems, or maybe you found a bug.
> >>
> >> Note it will only fault in pages if HMM_PFN_REQ_FAULT is specified. You
> >> are setting that however you aren't setting HMM_PFN_REQ_WRITE which
> is
> >> what would trigger a fault to bring in the new pages. Does setting that
> >> fix the issue you are seeing?
> > No, adding HMM_PFN_REQ_WRITE still doesn't help in fixing the issue.
> > Although, I do not have THP enabled (or built-in), shmem does not evict
> > the pages after hole punch as noted in the comment in shmem_fallocate():
> >  if ((u64)unmap_end > (u64)unmap_start)
> >  unmap_mapping_range(mapping, unmap_start,
> >  1 + unmap_end - unmap_start, 
> > 0);
> >  shmem_truncate_range(inode, offset, offset + len - 1);
> >  /* No need to unmap again: hole-punching leaves COWed pages
> */
> >
> > As a result, the pfn is still valid and the pte is pte_present() and 
> > pte_write().
> > This is the reason why adding in HMM_PFN_REQ_WRITE does not help;
> 
> Just to understand your setup: you are definitely using a MAP_SHARED
> shmem mapping, and not accidentally a MAP_PRIVATE mapping?
In terms of setup, I am just running the udmabuf selftest (shmem-based)
introduced in patch #3 of this series:
https://lore.kernel.org/all/20230718082858.1570809-4-vivek.kasire...@intel.com/

And, it indeed uses a MAP_SHARED mapping.

Thanks,
Vivek

> 
> --
> Cheers,
> 
> David / dhildenb



RE: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)

2023-08-04 Thread Kasireddy, Vivek
Hi Alistair, David, Jason,

> >> Right, the "the zero pages are changed into writable pages" in your
> >> above comment just might not apply, because there won't be any
> page
> >> replacement (hopefully :) ).
> 
> > If the page replacement does not happen when there are new writes
> to the
> > area where the hole previously existed, then would we still get an
>  invalidate
> > when this happens? Is there any other way to get notified when the
> zeroed
> > page is written to if the invalidate does not get triggered?
> 
>  What David is saying is that memfd does not use the zero page
>  optimization for hole punches. Any access to the memory, including
>  read-only access through hmm_range_fault() will allocate unique
>  pages. Since there is no zero page and no zero-page replacement there
>  is no issue with invalidations.
> >>
> >>> It looks like even with hmm_range_fault(), the invalidate does not get
> >>> triggered when the hole is refilled with new pages because of writes.
> >>> This is probably because hmm_range_fault() does not fault in any pages
> >>> that get invalidated later when writes occur.
> >> hmm_range_fault() returns the current content of the VMAs, or it
> >> faults. If it returns pages then it came from one of these two places.
> >> If your VMA is incoherent with what you are doing then you have
> >> bigger
> >> problems, or maybe you found a bug.
> 
> Note it will only fault in pages if HMM_PFN_REQ_FAULT is specified. You
> are setting that however you aren't setting HMM_PFN_REQ_WRITE which is
> what would trigger a fault to bring in the new pages. Does setting that
> fix the issue you are seeing?
No, adding HMM_PFN_REQ_WRITE still doesn't help in fixing the issue.
Although, I do not have THP enabled (or built-in), shmem does not evict
the pages after hole punch as noted in the comment in shmem_fallocate():
if ((u64)unmap_end > (u64)unmap_start)
unmap_mapping_range(mapping, unmap_start,
1 + unmap_end - unmap_start, 0);
shmem_truncate_range(inode, offset, offset + len - 1);
/* No need to unmap again: hole-punching leaves COWed pages */

As a result, the pfn is still valid and the pte is pte_present() and 
pte_write().
This is the reason why adding in HMM_PFN_REQ_WRITE does not help;
because, it fails the below condition in hmm_pte_need_fault():
if ((pfn_req_flags & HMM_PFN_REQ_WRITE) &&
!(cpu_flags & HMM_PFN_WRITE))
return HMM_NEED_FAULT | HMM_NEED_WRITE_FAULT;

If I force it to read-fault or write-fault (by hacking hmm_pte_need_fault()),
it gets indefinitely stuck in the do while loop in hmm_range_fault().
AFAIU, unless there is a way to fault-in zero pages (or any scratch pages)
after hole punch that get invalidated because of writes, I do not see how
using hmm_range_fault() can help with my use-case. 

Thanks,
Vivek

> 
> >>> The above log messages are seen immediately after the hole is punched.
> As
> >>> you can see, hmm_range_fault() returns the pfns of old pages and not
> zero
> >>> pages. And, I see the below messages (with patch #2 in this series
> applied)
> >>> as the hole is refilled after writes:
> >> I don't know what you are doing, but it is something wrong or you've
> >> found a bug in the memfds.
> >
> >
> > Maybe THP is involved? I recently had to dig that out for an internal
> > discussion:
> >
> > "Currently when truncating shmem file, if the range is partial of THP
> > (start or end is in the middle of THP), the pages actually will just get
> > cleared rather than being freed unless the range cover the whole THP.
> > Even though all the subpages are truncated (randomly or sequentially),
> > the THP may still be kept in page cache.  This might be fine for some
> > usecases which prefer preserving THP."
> >
> > My recollection is that this behavior was never changed.
> >
> > https://lore.kernel.org/all/1575420174-19171-1-git-send-email-
> yang@linux.alibaba.com/



RE: [RFC v1 2/3] udmabuf: Replace pages when there is FALLOC_FL_PUNCH_HOLE in memfd

2023-08-03 Thread Kasireddy, Vivek
Hi Daniel,

> 
> On Tue, Jul 18, 2023 at 01:28:57AM -0700, Vivek Kasireddy wrote:
> > When a hole is punched in the memfd or when a page is replaced for
> > any reason, the udmabuf driver needs to get notified in order to
> > update its list of pages with the new page. To accomplish this, we
> > first identify the vma ranges where pages associated with a given
> > udmabuf are mapped to and then register a handler for update_mapping
> > mmu notifier for receiving mapping updates.
> >
> > Once we get notified about a new page faulted in at a given offset
> > in the mapping (backed by shmem or hugetlbfs), the list of pages
> > is updated and we also zap the relevant PTEs associated with the
> > vmas that have mmap'd the udmabuf fd.
> >
> > Cc: David Hildenbrand 
> > Cc: Mike Kravetz 
> > Cc: Hugh Dickins 
> > Cc: Peter Xu 
> > Cc: Jason Gunthorpe 
> > Cc: Gerd Hoffmann 
> > Cc: Dongwon Kim 
> > Cc: Junxiao Chang 
> > Signed-off-by: Vivek Kasireddy 
> 
> I think the long thread made it clear already, so just for the record:
> This wont work. udmabuf is very intentionally about pin_user_page
> semantics, if you change the underlying mapping, you get to keep all the
> pieces.
> 
> The _only_ way to make this work by implementing the dma_buf move
> notification infrastructure, and most importers can't cope with such
> dynamic dma-buf. And so most likely will not solve your use-case.
Right, we do have to call move_notify() at some point to let the importers
know about the backing memory changes but as you suggest, unfortunately,
most importers don't handle moves. However, I guess I could try implementing
it in i915 and also add a helper in GEM.

> 
> Everything else races in a fundamental and unfixable way.
I think there might still be some options to address this use-case in a safe
and race-free way particularly given the fact that with udmabuf driver,
the writes and reads do not occur simultaneously. We use DMA fences
in both the Host and Guest to ensure this synchronization.

Thanks,
Vivek

> -Daniel
> 
> > ---
> >  drivers/dma-buf/udmabuf.c | 172
> ++
> >  1 file changed, 172 insertions(+)
> >
> > diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
> > index 10c47bf77fb5..189a36c41906 100644
> > --- a/drivers/dma-buf/udmabuf.c
> > +++ b/drivers/dma-buf/udmabuf.c
> > @@ -4,6 +4,8 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > @@ -30,6 +32,23 @@ struct udmabuf {
> > struct sg_table *sg;
> > struct miscdevice *device;
> > pgoff_t *offsets;
> > +   struct udmabuf_vma_range *ranges;
> > +   unsigned int num_ranges;
> > +   struct mmu_notifier notifier;
> > +   struct mutex mn_lock;
> > +   struct list_head mmap_vmas;
> > +};
> > +
> > +struct udmabuf_vma_range {
> > +   struct file *memfd;
> > +   pgoff_t ubufindex;
> > +   unsigned long start;
> > +   unsigned long end;
> > +};
> > +
> > +struct udmabuf_mmap_vma {
> > +   struct list_head vma_link;
> > +   struct vm_area_struct *vma;
> >  };
> >
> >  static vm_fault_t udmabuf_vm_fault(struct vm_fault *vmf)
> > @@ -42,28 +61,54 @@ static vm_fault_t udmabuf_vm_fault(struct
> vm_fault *vmf)
> > if (pgoff >= ubuf->pagecount)
> > return VM_FAULT_SIGBUS;
> >
> > +   mutex_lock(>mn_lock);
> > pfn = page_to_pfn(ubuf->pages[pgoff]);
> > if (ubuf->offsets) {
> > pfn += ubuf->offsets[pgoff] >> PAGE_SHIFT;
> > }
> > +   mutex_unlock(>mn_lock);
> >
> > return vmf_insert_pfn(vma, vmf->address, pfn);
> >  }
> >
> > +static void udmabuf_vm_close(struct vm_area_struct *vma)
> > +{
> > +   struct udmabuf *ubuf = vma->vm_private_data;
> > +   struct udmabuf_mmap_vma *mmap_vma;
> > +
> > +   list_for_each_entry(mmap_vma, >mmap_vmas, vma_link) {
> > +   if (mmap_vma->vma == vma) {
> > +   list_del(_vma->vma_link);
> > +   kfree(mmap_vma);
> > +   break;
> > +   }
> > +   }
> > +}
> > +
> >  static const struct vm_operations_struct udmabuf_vm_ops = {
> > .fault = udmabuf_vm_fault,
> > +   .close = udmabuf_vm_close,
> >  };
> >
> >  static int mmap_udmabuf(struct dma_buf *buf, struct vm_area_struct
> *vma)
> >  {
> > struct udmabuf *ubuf = buf->priv;
> > +   struct udmabuf_mmap_vma *mmap_vma;
> >
> > if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
> > return -EINVAL;
> >
> > +   mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL);
> > +   if (!mmap_vma)
> > +   return -ENOMEM;
> > +
> > vma->vm_ops = _vm_ops;
> > vma->vm_private_data = ubuf;
> > vm_flags_set(vma, VM_PFNMAP | VM_DONTEXPAND |
> VM_DONTDUMP);
> > +
> > +   mmap_vma->vma = vma;
> > +   list_add(_vma->vma_link, >mmap_vmas);
> > +
> > return 0;
> >  }
> >
> > @@ -109,6 +154,7 @@ static struct sg_table *get_sg_table(struct device
> *dev, struct dma_buf *buf,
> > if (ret < 0)
> > goto err_alloc;
> >
> > +   

RE: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)

2023-08-03 Thread Kasireddy, Vivek
Hi Peter,

> > Ok, I'll keep your use-case in mind but AFAICS, the process that creates
> > the udmabuf can be considered the owner. So, I think it makes sense that
> > the owner's VMA range can be registered (via mmu_notifiers) for updates.
> 
> No need to have your special attention on this; my use case is not anything
> useful with details, just wanted to show the idea that virtual address
> range based notification might not work.
> 
> [...]
> 
> > What limitation do you see with the usage of mmu notifiers for this use-
> case?
> > And, if using mmu notifiers is not the right approach, how do you suggest
> we
> > can solve this problem?
> 
> AFAIU, even if there'll be a notification chanism, it needs to be at least
> in per-file address space (probably in file offsets) rather than per-mm for
> a shmem backend, so that any mapping of the file should notify that.
Yes, it makes sense that the notification in this case is a combination of
(mapping, offset). Not sure how challenging it'd be to add such a notification
mechanism that would be per-file address space. However, as discussed
earlier with Alistair, it appears there is some value in having something
similar with mmu notifiers:
mmu_notifier_update_mapping(struct mm_struct *mm, unsigned long address)
And, in the callback, we could get the new page either using hmm_range_fault()
or through the page cache as we get notified after the PTE gets updated:
mapoff = linear_page_index(vma, address);
new_page = find_get_page(vma->vm_file->f_mapping, mapoff);

> 
> Isn't it already too late though to wait that notification until page is
> installed?  Because here you pinned the page for DMA, I think it means
> before a new page installed (but after the page is invalidated) the device
> can DMA to an invalid buffer.
The page is only invalidated in the memfd. Until the hole is written to,
we (udmabuf) can choose to handle any reads (or DMA) using old pages
if needed.

> 
> To come back to the original question: I don't know how that could work at
> all, the userapp should just never do that invalidation, because right
> after it does, the dma buffer will be invalid, and the device can update
> data into trash.  So.. I don't have an easy way to do this right.. besides
> disabling ram discard just like what vfio does already.
Yeah, disabling ram discard is the last option if we cannot find a way to
safely get notified about mapping updates.

Thanks,
Vivek

> 
> Thanks,
> 
> --
> Peter Xu
> 



RE: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)

2023-08-03 Thread Kasireddy, Vivek
Hi Jason,

> > > Right, the "the zero pages are changed into writable pages" in your
> > > above comment just might not apply, because there won't be any page
> > > replacement (hopefully :) ).
> 
> > If the page replacement does not happen when there are new writes to the
> > area where the hole previously existed, then would we still get an
> invalidate
> > when this happens? Is there any other way to get notified when the zeroed
> > page is written to if the invalidate does not get triggered?
> 
> What David is saying is that memfd does not use the zero page
> optimization for hole punches. Any access to the memory, including
> read-only access through hmm_range_fault() will allocate unique
> pages. Since there is no zero page and no zero-page replacement there
> is no issue with invalidations.
It looks like even with hmm_range_fault(), the invalidate does not get
triggered when the hole is refilled with new pages because of writes.
This is probably because hmm_range_fault() does not fault in any pages
that get invalidated later when writes occur. Not sure if there is a way to
request it to fill a hole with zero pages. Here is what I have in the
invalidate callback (added on top of this series):
static bool invalidate_udmabuf(struct mmu_interval_notifier *mn,
   const struct mmu_notifier_range *range_mn,
   unsigned long cur_seq)
{
struct udmabuf_vma_range *range =
container_of(mn, struct udmabuf_vma_range, range_mn);
struct udmabuf *ubuf = range->ubuf;
struct hmm_range hrange = {0};
unsigned long *pfns, num_pages, timeout;
int i, ret;

printk("invalidate; start = %lu, end = %lu\n",
   range->start, range->end);

hrange.notifier = mn;
hrange.default_flags = HMM_PFN_REQ_FAULT;
hrange.start = max(range_mn->start, range->start);
hrange.end = min(range_mn->end, range->end);
num_pages = (hrange.end - hrange.start) >> PAGE_SHIFT;

pfns = kmalloc_array(num_pages, sizeof(*pfns), GFP_KERNEL);
if (!pfns)
return true;

printk("invalidate; num pages = %lu\n", num_pages);

hrange.hmm_pfns = pfns;
timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
do {
hrange.notifier_seq = mmu_interval_read_begin(mn);

mmap_read_lock(ubuf->vmm_mm);
ret = hmm_range_fault();
mmap_read_unlock(ubuf->vmm_mm);
if (ret) {
if (ret == -EBUSY && !time_after(jiffies, timeout))
continue;
break;
}

if (mmu_interval_read_retry(mn, hrange.notifier_seq))
continue;
} while (ret);

if (!ret) {
for (i = 0; i < num_pages; i++) {
printk("hmm returned page = %p; pfn = %lu\n",
   hmm_pfn_to_page(pfns[i]),
   pfns[i] & ~HMM_PFN_FLAGS);
}
}
return true;
}

static const struct mmu_interval_notifier_ops udmabuf_invalidate_ops = {
.invalidate = invalidate_udmabuf,
};

Here are the log messages I see when I run the udmabuf (shmem-based) selftest:
[  132.662863] invalidate; start = 140737347612672, end = 140737347629056
[  132.672953] invalidate; num pages = 4
[  132.676690] hmm returned page = 0483755d; pfn = 2595360
[  132.682676] hmm returned page = d5a87cc6; pfn = 2588133
[  132.688651] hmm returned page = f9eb8d20; pfn = 2673429
[  132.694629] hmm returned page = 5b44da27; pfn = 2588481
[  132.700605] invalidate; start = 140737348661248, end = 140737348677632
[  132.710672] invalidate; num pages = 4
[  132.714412] hmm returned page = 02867206; pfn = 2680737
[  132.720394] hmm returned page = 778a48f0; pfn = 2680738
[  132.726366] hmm returned page = d8adf162; pfn = 2680739
[  132.732350] hmm returned page = 671769ff; pfn = 2680740

The above log messages are seen immediately after the hole is punched. As
you can see, hmm_range_fault() returns the pfns of old pages and not zero
pages. And, I see the below messages (with patch #2 in this series applied)
as the hole is refilled after writes:
[  160.279227] udpate mapping; old page = 0483755d; pfn = 2595360
[  160.285809] update mapping; new page = 080e9595; pfn = 2680991
[  160.292402] udpate mapping; old page = d5a87cc6; pfn = 2588133
[  160.298979] update mapping; new page = 0483755d; pfn = 2595360
[  160.305574] udpate mapping; old page = f9eb8d20; pfn = 2673429
[  160.312154] update mapping; new page = d5a87cc6; pfn = 2588133
[  160.318744] udpate mapping; old page = 5b44da27; pfn = 2588481
[  160.325320] update mapping; new page = f9eb8d20; pfn = 2673429
[  

RE: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)

2023-08-01 Thread Kasireddy, Vivek
Hi David,

> 
> On 01.08.23 14:26, Jason Gunthorpe wrote:
> > On Tue, Aug 01, 2023 at 02:26:03PM +0200, David Hildenbrand wrote:
> >> On 01.08.23 14:23, Jason Gunthorpe wrote:
> >>> On Tue, Aug 01, 2023 at 02:22:12PM +0200, David Hildenbrand wrote:
> >>>> On 01.08.23 14:19, Jason Gunthorpe wrote:
> >>>>> On Tue, Aug 01, 2023 at 05:32:38AM +, Kasireddy, Vivek wrote:
> >>>>>
> >>>>>>> You get another invalidate because the memfd removes the zero
> pages
> >>>>>>> that hmm_range_fault installed in the PTEs before replacing them
> with
> >>>>>>> actual writable pages. Then you do the move, and another
> >>>>>>> hmm_range_fault, and basically the whole thing over again. Except
> this
> >>>>>>> time instead of returning zero pages it returns actual writable
> >>>>>>> page.
> >>>>>
> >>>>>> Ok, when I tested earlier (by registering an invalidate callback) but
> without
> >>>>>> hmm_range_fault(), I did not find this additional invalidate getting
> triggered.
> >>>>>> Let me try with hmm_range_fault() and see if everything works as
> expected.
> >>>>>> Thank you for your help.
> >>>>>
> >>>>> If you do not get an invalidate then there is a pretty serious bug in
> >>>>> the mm that needs fixing.
> >>>>>
> >>>>> Anything hmm_range_fault() returns must be invalidated if the
> >>>>> underying CPU mapping changes for any reasons. Since
> hmm_range_fault()
> >>>>> will populate zero pages when reading from a hole in a memfd, it must
> >>>>> also get an invalidation when the zero pages are changed into writable
> >>>>> pages.
> >>>>
> >>>> Can you point me at the code that returns that (shared) zero page?
> >>>
> >>> It calls handle_mm_fault() - shouldn't that do it? Same as if the CPU
> >>> read faulted the page?
> >>
> >> To the best of my knowledge, the shared zeropage is only used in
> >> MAP_PRIVATE|MAP_AON mappings and in weird DAX mappings.
> >>
> >> If that changed, we have to fix FOLL_PIN|FOLL_LONGTERM for
> MAP_SHARED VMAs.
> >>
> >> If you read-fault on a memfd hole, you should get a proper "zeroed"
> >> pagecache page that effectively "filled that hole" -- so there is no file
> >> hole anymore.
> >
> > Sounds fine then :)
> 
> Right, the "the zero pages are changed into writable pages" in your
> above comment just might not apply, because there won't be any page
> replacement (hopefully :) ).
If the page replacement does not happen when there are new writes to the
area where the hole previously existed, then would we still get an invalidate
when this happens? Is there any other way to get notified when the zeroed
page is written to if the invalidate does not get triggered?

Thanks,
Vivek

> 
> --
> Cheers,
> 
> David / dhildenb



RE: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)

2023-08-01 Thread Kasireddy, Vivek
Hi Peter,

> >
> > > > > > > > > I'm not at all familiar with the udmabuf use case but that
> sounds
> > > > > > > > > brittle and effectively makes this notifier udmabuf specific
> right?
> > > > > > > > Oh, Qemu uses the udmabuf driver to provide Host Graphics
> > > > > components
> > > > > > > > (such as Spice, Gstreamer, UI, etc) zero-copy access to Guest
> created
> > > > > > > > buffers. In other words, from a core mm standpoint, udmabuf
> just
> > > > > > > > collects a bunch of pages (associated with buffers) scattered
> inside
> > > > > > > > the memfd (Guest ram backed by shmem or hugetlbfs) and
> wraps
> > > > > > > > them in a dmabuf fd. And, since we provide zero-copy access,
> we
> > > > > > > > use DMA fences to ensure that the components on the Host and
> > > > > > > > Guest do not access the buffer simultaneously.
> > > > > > >
> > > > > > > So why do you need to track updates proactively like this?
> > > > > > As David noted in the earlier series, if Qemu punches a hole in its
> > > memfd
> > > > > > that goes through pages that are registered against a udmabuf fd,
> then
> > > > > > udmabuf needs to update its list with new pages when the hole gets
> > > > > > filled after (guest) writes. Otherwise, we'd run into the coherency
> > > > > > problem (between udmabuf and memfd) as demonstrated in the
> > > selftest
> > > > > > (patch #3 in this series).
> > > > >
> > > > > Wouldn't this all be very much better if Qemu stopped punching holes
> > > there?
> > > > I think holes can be punched anywhere in the memfd for various
> reasons.
> > > Some
> > >
> > > I just start to read this thread, even haven't finished all of them.. but
> > > so far I'm not sure whether this is right at all..
> > >
> > > udmabuf is a file, it means it should follow the file semantics. Mmu
> > Right, it is a file but a special type of file given that it is a dmabuf. 
> > So, AFAIK,
> > operations such as truncate, FALLOC_FL_PUNCH_HOLE, etc cannot be done
> > on it. And, in our use-case, since udmabuf driver is sharing (or exporting)
> its
> > buffer (via the fd), consumers (or importers) of the dmabuf fd are expected
> > to only read from it.
> >
> > > notifier is per-mm, otoh.
> > >
> > > Imagine for some reason QEMU mapped the guest pages twice, udmabuf
> is
> > > created with vma1, so udmabuf registers the mm changes over vma1
> only.
> > Udmabufs are created with pages obtained from the mapping using offsets
> > provided by Qemu.
> >
> > >
> > > However the shmem/hugetlb page cache can be populated in either
> vma1, or
> > > vma2.  It means when populating on vma2 udmabuf won't get update
> notify
> > > at
> > > all, udmabuf pages can still be obsolete.  Same thing to when multi-
> process
> > In this (unlikely) scenario you described above,
> 
> IMHO it's very legal for qemu to do that, we won't want this to break so
> easily and silently simply because qemu mapped it twice.  I would hope
> it'll not be myself to debug something like that. :)
> 
> I actually personally have a tree that does exactly that:
> 
> https://github.com/xzpeter/qemu/commit/62050626d6e511d022953165cc0f
> 604bf90c5324
> 
> But that's definitely not in main line.. it shouldn't need special
> attention, either.  Just want to say that it can always happen for various
> reasons especially in an relatively involved software piece like QEMU.
Ok, I'll keep your use-case in mind but AFAICS, the process that creates
the udmabuf can be considered the owner. So, I think it makes sense that
the owner's VMA range can be registered (via mmu_notifiers) for updates.

> 
> > I think we could still find all the
> > VMAs (and ranges) where the guest buffer pages are mapped (and register
> > for PTE updates) using Qemu's mm_struct. The below code can be
> modified
> > to create a list of VMAs where the guest buffer pages are mapped.
> > static struct vm_area_struct *find_guest_ram_vma(struct udmabuf *ubuf,
> >  struct mm_struct *vmm_mm)
> > {
> > struct vm_area_struct *vma = NULL;
> > MA_STATE(mas, _mm->mm_mt, 0, 0);
> > unsigned long addr;
> > pgoff_t pg;
> >
> > mas_set(, 0);
> > mmap_read_lock(vmm_mm);
> > mas_for_each(, vma, ULONG_MAX) {
> > for (pg = 0; pg < ubuf->pagecount; pg++) {
> > addr = page_address_in_vma(ubuf->pages[pg], vma);
> > if (addr == -EFAULT)
> > break;
> > }
> > if (addr != -EFAULT)
> > break;
> > }
> > mmap_read_unlock(vmm_mm);
> >
> > return vma;
> > }
> 
> This is hackish to me, and not working when across mm (multi-proc qemu).
Udmabuf backend is still considered experimental for multi-proc qemu (i.e, Qemu 
+
vhost-user-gpu given our use-case). And, it looks like the usage of the udmabuf
driver in both cases is different. 

> 
> >
> > > QEMU is used, where 

RE: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)

2023-07-31 Thread Kasireddy, Vivek
Hi Jason,

> 
> > > Later the importer decides it needs the memory again so it again asks
> > > for the dmabuf to be present, which does hmm_range_fault and gets
> > > whatever is appropriate at the time.
> > Unless I am missing something, I think just doing the above still won't 
> > solve
> > the problem. Consider this sequence:
> >  write_to_memfd(addr1, size, 'a');
> >  buf = create_udmabuf_list(devfd, memfd, size);
> >  addr2 = mmap_fd(buf, NUM_PAGES * NUM_ENTRIES * getpagesize());
> >  read(addr2);
> >  write_to_memfd(addr1, size, 'b');
> >  punch_hole(memfd, MEMFD_SIZE / 2);
> > -> Since we can process the invalidate at this point, as per your 
> > suggestion,
> >  we can trigger dmabuf move to let the importers know that the
> dmabuf's
> >  backing memory has changed (or moved).
> >
> >  read(addr2);
> > -> Because there is a hole, we can handle the read by either providing the
> >  old pages or zero pages (if using hmm_range_fault()) to the
> > importers.
> 
> You never provide the old pages. After trunctate the only correct
> value to read is zero.
> 
> >  Maybe it is against convention, but I think it makes sense to provide 
> > old
> >  pages (that were mapped before the hole punch) because the importers
> >  have not read the data in these pages ('b' above) yet.
> 
> Nope.
> 
> >  And, another reason to provide old pages is because the data in
> >  these pages is shown in a window on the Host's screen so it
> >  doesn't make sense to show zero page data.
> 
> So why did you trucate it if you want to keep the data?
> 
> 
> > -> write_to_memfd(addr1, size, 'c');
> >  As the hole gets refilled (with new pages) after the above write, 
> > AFAIU,
> we
> >  have to tell the importers again that since the backing memory has
> changed,
> >  (new pages) they need to recreate their mappings. But herein lies the
> problem:
> >  from inside the udmabuf driver, we cannot know when this write occurs,
> so we
> >  would not be able to notify the importers of the dmabuf move.
> 
> You get another invalidate because the memfd removes the zero pages
> that hmm_range_fault installed in the PTEs before replacing them with
> actual writable pages. Then you do the move, and another
> hmm_range_fault, and basically the whole thing over again. Except this
> time instead of returning zero pages it returns actual writable page.
Ok, when I tested earlier (by registering an invalidate callback) but without
hmm_range_fault(), I did not find this additional invalidate getting triggered.
Let me try with hmm_range_fault() and see if everything works as expected.
Thank you for your help.


Thanks,
Vivek

> 
> Jason


RE: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)

2023-07-28 Thread Kasireddy, Vivek
Hi Jason,

> > > > > If you still need the memory mapped then you re-call
> hmm_range_fault
> > > > > and re-obtain it. hmm_range_fault will resolve all the races and you
> > > > > get new pages.
> > >
> > > > IIUC, for my udmabuf use-case, it looks like calling hmm_range_fault
> > > > immediately after an invalidate (range notification) would preemptively
> > > fault in
> > > > new pages before a write. The problem with that is if a read occurs on
> > > those
> > > > new pages, then the data is incorrect as a write may not have
> > > > happened yet.
> > >
> > > It cannot be, if you use hmm_range_fault correctly you cannot get
> > > corruption no matter what is done to the mmap'd memfd. If there is
> > > otherwise it is a hmm_range_fault bug plain and simple.
> > >
> > > > Ideally, what I am looking for is for getting new pages at the time of 
> > > > or
> after
> > > > a write; until then, it is ok to use the old pages given my use-case.
> > >
> > > It is wrong, if you are synchronizing the vma then you must use the
> > > latest copy. If your use case can tolerate it then keep a 'not
> > > present' indication for the missing pages until you actually need
> > > them, but dmabuf doesn't really provide an API for that.
> > >
> > > > I think the difference comes down to whether we (udmabuf driver)
> want to
> > > > grab the new pages after getting notified about a PTE update because
> > > > of a fault
> > >
> > > Why? You still haven't explained why you want this.
> > Ok, let me explain using one of the udmabuf selftests (added in patch #3)
> > to describe the problem (sorry, I'd have to use the terms memfd, hole, etc)
> > I am trying to solve:
> > size = MEMFD_SIZE * page_size;
> > memfd = create_memfd_with_seals(size, false);
> > addr1 = mmap_fd(memfd, size);
> > write_to_memfd(addr1, size, 'a');
> > buf = create_udmabuf_list(devfd, memfd, size);
> > addr2 = mmap_fd(buf, NUM_PAGES * NUM_ENTRIES * getpagesize());
> > punch_hole(memfd, MEMFD_SIZE / 2);
> >-> At this point, if I were to read addr1, it'd still have "a" in 
> > relevant areas
> > because a new write hasn't happened yet. And, since this results in 
> > an
> > invalidation (notification) of the associated VMA range, I could 
> > register
> > a callback in udmabuf driver and get notified but I am not sure how 
> > or
> > why that would be useful.
> 
> When you get an invalidation you trigger dmabuf move, which revokes
> the importes use of the dmabuf because the underlying memory has
> changed. This is exactly the same as a GPU driver migrating memory
> to/fro CPU memory.
> 
> >
> > write_to_memfd(addr1, size, 'b');
> >-> Here, the hole gets refilled as a result of the above writes which 
> > trigger
> > faults and the PTEs are updated to point to new pages. When this
> happens,
> > the udmabuf driver needs to be made aware of the new pages that
> were
> > faulted in because of the new writes.
> 
> You only need this because you are not processing the invalidate.
> 
> > a way to get notified when the hole is written to, the solution I 
> > came
> up
> > with is to either add a new notifier or add calls to change_pte() 
> > when
> the
> > PTEs do get updated. However, considering your suggestion to use
> > hmm_range_fault(), it is not clear to me how it would help while the
> hole
> > is being written to as the writes occur outside of the
> > udmabuf driver.
> 
> You have the design backwards.
> 
> When a dmabuf importer asks for the dmabuf to be present you call
> hmm_range_fault() and you get back whatever memory is appropriate. The
> importer can then use it.
> 
> If the underlying memory changes then you get the invalidation and you
> trigger move. The importer stops using the memory and the underlying
> pages change.
> 
> Later the importer decides it needs the memory again so it again asks
> for the dmabuf to be present, which does hmm_range_fault and gets
> whatever is appropriate at the time.
Unless I am missing something, I think just doing the above still won't solve
the problem. Consider this sequence:
 write_to_memfd(addr1, size, 'a');
 buf = create_udmabuf_list(devfd, memfd, size);
 addr2 = mmap_fd(buf, NUM_PAGES * NUM_ENTRIES * getpagesize());
 read(addr2);
 write_to_memfd(addr1, size, 'b');
 punch_hole(memfd, MEMFD_SIZE / 2);
-> Since we can process the invalidate at this point, as per your suggestion,
 we can trigger dmabuf move to let the importers know that the dmabuf's
 backing memory has changed (or moved).

 read(addr2);
-> Because there is a hole, we can handle the read by either providing the
 old pages or zero pages (if using hmm_range_fault()) to the importers.
 Maybe it is against convention, but I think it makes sense to provide old
 pages (that were mapped before the hole punch) because the importers
 

RE: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)

2023-07-28 Thread Kasireddy, Vivek
Hi Peter,

> > > > > > > I'm not at all familiar with the udmabuf use case but that sounds
> > > > > > > brittle and effectively makes this notifier udmabuf specific 
> > > > > > > right?
> > > > > > Oh, Qemu uses the udmabuf driver to provide Host Graphics
> > > components
> > > > > > (such as Spice, Gstreamer, UI, etc) zero-copy access to Guest 
> > > > > > created
> > > > > > buffers. In other words, from a core mm standpoint, udmabuf just
> > > > > > collects a bunch of pages (associated with buffers) scattered inside
> > > > > > the memfd (Guest ram backed by shmem or hugetlbfs) and wraps
> > > > > > them in a dmabuf fd. And, since we provide zero-copy access, we
> > > > > > use DMA fences to ensure that the components on the Host and
> > > > > > Guest do not access the buffer simultaneously.
> > > > >
> > > > > So why do you need to track updates proactively like this?
> > > > As David noted in the earlier series, if Qemu punches a hole in its
> memfd
> > > > that goes through pages that are registered against a udmabuf fd, then
> > > > udmabuf needs to update its list with new pages when the hole gets
> > > > filled after (guest) writes. Otherwise, we'd run into the coherency
> > > > problem (between udmabuf and memfd) as demonstrated in the
> selftest
> > > > (patch #3 in this series).
> > >
> > > Wouldn't this all be very much better if Qemu stopped punching holes
> there?
> > I think holes can be punched anywhere in the memfd for various reasons.
> Some
> 
> I just start to read this thread, even haven't finished all of them.. but
> so far I'm not sure whether this is right at all..
> 
> udmabuf is a file, it means it should follow the file semantics. Mmu
Right, it is a file but a special type of file given that it is a dmabuf. So, 
AFAIK,
operations such as truncate, FALLOC_FL_PUNCH_HOLE, etc cannot be done
on it. And, in our use-case, since udmabuf driver is sharing (or exporting) its
buffer (via the fd), consumers (or importers) of the dmabuf fd are expected
to only read from it.

> notifier is per-mm, otoh.
> 
> Imagine for some reason QEMU mapped the guest pages twice, udmabuf is
> created with vma1, so udmabuf registers the mm changes over vma1 only.
Udmabufs are created with pages obtained from the mapping using offsets
provided by Qemu. 

> 
> However the shmem/hugetlb page cache can be populated in either vma1, or
> vma2.  It means when populating on vma2 udmabuf won't get update notify
> at
> all, udmabuf pages can still be obsolete.  Same thing to when multi-process
In this (unlikely) scenario you described above, I think we could still find 
all the
VMAs (and ranges) where the guest buffer pages are mapped (and register
for PTE updates) using Qemu's mm_struct. The below code can be modified
to create a list of VMAs where the guest buffer pages are mapped.
static struct vm_area_struct *find_guest_ram_vma(struct udmabuf *ubuf,
 struct mm_struct *vmm_mm)
{
struct vm_area_struct *vma = NULL;
MA_STATE(mas, _mm->mm_mt, 0, 0);
unsigned long addr;
pgoff_t pg;

mas_set(, 0);
mmap_read_lock(vmm_mm);
mas_for_each(, vma, ULONG_MAX) {
for (pg = 0; pg < ubuf->pagecount; pg++) {
addr = page_address_in_vma(ubuf->pages[pg], vma);
if (addr == -EFAULT)
break;
}
if (addr != -EFAULT)
break;
}
mmap_read_unlock(vmm_mm);

return vma;
}

> QEMU is used, where we can have vma1 in QEMU while vma2 in the other
> process like vhost-user.
> 
> I think the trick here is we tried to "hide" the fact that these are
> actually normal file pages, but we're doing PFNMAP on them... then we want
> the file features back, like hole punching..
> 
> If we used normal file operations, everything will just work fine; TRUNCATE
> will unmap the host mapped frame buffers when needed, and when
> accessed
> it'll fault on demand from the page cache.  We seem to be trying to
> reinvent "truncation" for pfnmap but mmu notifier doesn't sound right to
> this at least..
If we can figure out the VMA ranges where the guest buffer pages are mapped,
we should be able to register mmu notifiers for those ranges right?

> 
> > of the use-cases where this would be done were identified by David. Here
> is what
> > he said in an earlier discussion:
> > "There are *probably* more issues on the QEMU side when udmabuf is
> paired
> > with things like MADV_DONTNEED/FALLOC_FL_PUNCH_HOLE used for
> > virtio-balloon, virtio-mem, postcopy live migration, ... for example, in"
> 
> Now after seething this, I'm truly wondering whether we can still simply
> use the file semantics we already have (for either shmem/hugetlb/...), or
> is it a must we need to use a single fd to represent all?
> 
> Say, can we just use a tuple (fd, page_array) rather than the udmabuf
> itself to do host 

RE: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)

2023-07-27 Thread Kasireddy, Vivek
Hi Jason,

> 
> On Tue, Jul 25, 2023 at 10:44:09PM +, Kasireddy, Vivek wrote:
> > > If you still need the memory mapped then you re-call hmm_range_fault
> > > and re-obtain it. hmm_range_fault will resolve all the races and you
> > > get new pages.
> 
> > IIUC, for my udmabuf use-case, it looks like calling hmm_range_fault
> > immediately after an invalidate (range notification) would preemptively
> fault in
> > new pages before a write. The problem with that is if a read occurs on
> those
> > new pages, then the data is incorrect as a write may not have
> > happened yet.
> 
> It cannot be, if you use hmm_range_fault correctly you cannot get
> corruption no matter what is done to the mmap'd memfd. If there is
> otherwise it is a hmm_range_fault bug plain and simple.
> 
> > Ideally, what I am looking for is for getting new pages at the time of or 
> > after
> > a write; until then, it is ok to use the old pages given my use-case.
> 
> It is wrong, if you are synchronizing the vma then you must use the
> latest copy. If your use case can tolerate it then keep a 'not
> present' indication for the missing pages until you actually need
> them, but dmabuf doesn't really provide an API for that.
> 
> > I think the difference comes down to whether we (udmabuf driver) want to
> > grab the new pages after getting notified about a PTE update because
> > of a fault
> 
> Why? You still haven't explained why you want this.
Ok, let me explain using one of the udmabuf selftests (added in patch #3)
to describe the problem (sorry, I'd have to use the terms memfd, hole, etc)
I am trying to solve:
size = MEMFD_SIZE * page_size;
memfd = create_memfd_with_seals(size, false);
addr1 = mmap_fd(memfd, size);
write_to_memfd(addr1, size, 'a');
buf = create_udmabuf_list(devfd, memfd, size);
addr2 = mmap_fd(buf, NUM_PAGES * NUM_ENTRIES * getpagesize());
punch_hole(memfd, MEMFD_SIZE / 2);
   -> At this point, if I were to read addr1, it'd still have "a" in relevant 
areas
because a new write hasn't happened yet. And, since this results in an
invalidation (notification) of the associated VMA range, I could 
register
a callback in udmabuf driver and get notified but I am not sure how or
why that would be useful.

write_to_memfd(addr1, size, 'b');
   -> Here, the hole gets refilled as a result of the above writes which trigger
faults and the PTEs are updated to point to new pages. When this 
happens,
the udmabuf driver needs to be made aware of the new pages that were
faulted in because of the new writes. Since there does not appear to be
a way to get notified when the hole is written to, the solution I came 
up
with is to either add a new notifier or add calls to change_pte() when 
the
PTEs do get updated. However, considering your suggestion to use
hmm_range_fault(), it is not clear to me how it would help while the 
hole
is being written to as the writes occur outside of the udmabuf driver. 
And,
there is no way to get notified or track them either, AFAICS from 
inside the
udmabuf driver.

Thanks,
Vivek

> 
> If you are writing to the pages then you have to do this
> 
> If you are reading from the pages then hmm_range_fault should return
> the zero page for a hole until it is written too
> 
> Jason


RE: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)

2023-07-25 Thread Kasireddy, Vivek
Hi Jason,

> > >
> > > > > I'm not at all familiar with the udmabuf use case but that sounds
> > > > > brittle and effectively makes this notifier udmabuf specific right?
> > > > Oh, Qemu uses the udmabuf driver to provide Host Graphics
> components
> > > > (such as Spice, Gstreamer, UI, etc) zero-copy access to Guest created
> > > > buffers. In other words, from a core mm standpoint, udmabuf just
> > > > collects a bunch of pages (associated with buffers) scattered inside
> > > > the memfd (Guest ram backed by shmem or hugetlbfs) and wraps
> > > > them in a dmabuf fd. And, since we provide zero-copy access, we
> > > > use DMA fences to ensure that the components on the Host and
> > > > Guest do not access the buffer simultaneously.
> > >
> > > So why do you need to track updates proactively like this?
> > As David noted in the earlier series, if Qemu punches a hole in its memfd
> > that goes through pages that are registered against a udmabuf fd, then
> > udmabuf needs to update its list with new pages when the hole gets
> > filled after (guest) writes. Otherwise, we'd run into the coherency
> > problem (between udmabuf and memfd) as demonstrated in the selftest
> > (patch #3 in this series).
> 
> Holes created in VMA are tracked by invalidation, you haven't
> explained why this needs to also see change.
Oh, the invalidation part is ok and does not need any changes. My concern
(and the reason for this new notifier patch) is only about the lack of a
notification when a PTE is updated because of a fault (new page). In other
words, if something like change_pte() would have been called after
handle_pte_fault() or hugetlb_fault(), then this patch would not be needed.

> 
> BTW it is very jarring to hear you talk about files when working with
> mmu notifiers. MMU notifiers do not track hole punches or memfds, they
> track VMAs and PTEs. Punching a hole in a mmapped memfd will
> invalidate the convering PTEs.
I figured describing the problem in terms of memfds or hole punches would
provide more context; but, ok, I'll refrain from mentioning memfds or holes
and limit the discussion of this patch to VMAs and PTEs. 

> 
> > > Trigger a move when the backing memory changes and re-acquire it with
> > AFAICS, without this patch or adding new change_pte calls, there is no way
> to
> > get notified when a new page is mapped into the backing memory of a
> memfd
> > (backed by shmem or hugetlbfs) which happens after a hole punch
> followed
> > by writes.
> 
> Yes, we have never wanted to do this because is it racy.
> 
> If you still need the memory mapped then you re-call hmm_range_fault
> and re-obtain it. hmm_range_fault will resolve all the races and you
> get new pages.
IIUC, for my udmabuf use-case, it looks like calling hmm_range_fault
immediately after an invalidate (range notification) would preemptively fault in
new pages before a write. The problem with that is if a read occurs on those
new pages, then the data is incorrect as a write may not have happened yet.
Ideally, what I am looking for is for getting new pages at the time of or after
a write; until then, it is ok to use the old pages given my use-case.

> 
> > We can definitely get notified when a hole is punched via the
> > invalidate notifiers though, but as I described earlier this is not very 
> > helpful
> > for the udmabuf use-case.
> 
> I still don't understand why, or what makes udmabuf so special
> compared to all the other places tracking VMA changes and using
> hmm_range_fault.
I think the difference comes down to whether we (udmabuf driver) want to
grab the new pages after getting notified about a PTE update because of a fault
triggered by a write vs proactively obtaining the new pages by triggering the
fault (since hmm_range_fault() seems to call handle_mm_fault()) before a
potential write.

Thanks,
Vivek

> 
> Jason


RE: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)

2023-07-25 Thread Kasireddy, Vivek
Hi Hugh,

> 
> On Mon, 24 Jul 2023, Kasireddy, Vivek wrote:
> > Hi Jason,
> > > On Mon, Jul 24, 2023 at 07:54:38AM +0000, Kasireddy, Vivek wrote:
> > >
> > > > > I'm not at all familiar with the udmabuf use case but that sounds
> > > > > brittle and effectively makes this notifier udmabuf specific right?
> > > > Oh, Qemu uses the udmabuf driver to provide Host Graphics
> components
> > > > (such as Spice, Gstreamer, UI, etc) zero-copy access to Guest created
> > > > buffers. In other words, from a core mm standpoint, udmabuf just
> > > > collects a bunch of pages (associated with buffers) scattered inside
> > > > the memfd (Guest ram backed by shmem or hugetlbfs) and wraps
> > > > them in a dmabuf fd. And, since we provide zero-copy access, we
> > > > use DMA fences to ensure that the components on the Host and
> > > > Guest do not access the buffer simultaneously.
> > >
> > > So why do you need to track updates proactively like this?
> > As David noted in the earlier series, if Qemu punches a hole in its memfd
> > that goes through pages that are registered against a udmabuf fd, then
> > udmabuf needs to update its list with new pages when the hole gets
> > filled after (guest) writes. Otherwise, we'd run into the coherency
> > problem (between udmabuf and memfd) as demonstrated in the selftest
> > (patch #3 in this series).
> 
> Wouldn't this all be very much better if Qemu stopped punching holes there?
I think holes can be punched anywhere in the memfd for various reasons. Some
of the use-cases where this would be done were identified by David. Here is what
he said in an earlier discussion:
"There are *probably* more issues on the QEMU side when udmabuf is paired 
with things like MADV_DONTNEED/FALLOC_FL_PUNCH_HOLE used for 
virtio-balloon, virtio-mem, postcopy live migration, ... for example, in"

Thanks,
Vivek

> 
> Hugh


RE: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)

2023-07-24 Thread Kasireddy, Vivek
Hi Alistair,

> >>
> >> Yes, although obviously as I think you point out below you wouldn't be
> >> able to take any sleeping locks in mmu_notifier_update_mapping().
> > Yes, I understand that, but I am not sure how we can prevent any potential
> > notifier callback from taking sleeping locks other than adding clear
> comments.
> 
> Oh of course not, but is such a restriction on not taking sleeping locks
> acceptable for your implementation of the notifier callback? I notice in
> patch 2 update_udmabuf() takes a mutex so I assumed not being able to
> sleep in the callback would be an issue.
I plan to drop the mutex in v2 which is not really needed as I described in
my previous reply because we ensure Guest and Host synchronization via
other means. 

> 
> >>
> >> > In which case I'd need to make a similar change in the shmem path as
> well.
> >> > And, also redo (or eliminate) the locking in udmabuf (patch) which
> seems a
> >> > bit excessive on a second look given our use-case (where reads and
> writes
> >> do
> >> > not happen simultaneously due to fence synchronization in the guest
> >> driver).
> >>
> >> I'm not at all familiar with the udmabuf use case but that sounds
> >> brittle and effectively makes this notifier udmabuf specific right?
> > Oh, Qemu uses the udmabuf driver to provide Host Graphics components
> > (such as Spice, Gstreamer, UI, etc) zero-copy access to Guest created
> > buffers. In other words, from a core mm standpoint, udmabuf just
> > collects a bunch of pages (associated with buffers) scattered inside
> > the memfd (Guest ram backed by shmem or hugetlbfs) and wraps
> > them in a dmabuf fd. And, since we provide zero-copy access, we
> > use DMA fences to ensure that the components on the Host and
> > Guest do not access the buffer simultaneously.
> 
> Thanks for the background!
> 
> >> contemplated adding a notifier for PTE updates for drivers using
> >> hmm_range_fault() as it would save some expensive device faults and it
> >> this could be useful for that.
> >>
> >> So if we're adding a notifier for PTE updates I think it would be good
> >> if it covered all cases and was robust enough to allow mirroring of the
> >> correct PTE value (ie. by being called under PTL or via some other
> >> synchronisation like hmm_range_fault()).
> > Ok; in order to make it clear that the notifier is associated with PTE
> updates,
> > I think it needs to have a more suitable name such as
> mmu_notifier_update_pte()
> > or mmu_notifier_new_pte(). But we already have
> mmu_notifier_change_pte,
> > which IIUC is used mainly for PTE updates triggered by KSM. So, I am
> inclining
> > towards dropping this new notifier and instead adding a new flag to
> change_pte
> > to distinguish between KSM triggered notifications and others. Something
> along
> > the lines of:
> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > index 218ddc3b4bc7..6afce2287143 100644
> > --- a/include/linux/mmu_notifier.h
> > +++ b/include/linux/mmu_notifier.h
> > @@ -129,7 +129,8 @@ struct mmu_notifier_ops {
> > void (*change_pte)(struct mmu_notifier *subscription,
> >struct mm_struct *mm,
> >unsigned long address,
> > -  pte_t pte);
> > +  pte_t pte,
> > +  bool ksm_update);
> > @@ -658,7 +659,7 @@ static inline void mmu_notifier_range_init_owner(
> > unsigned long ___address = __address;   \
> > pte_t ___pte = __pte;   \
> > \
> > -   mmu_notifier_change_pte(___mm, ___address, ___pte); \
> > +   mmu_notifier_change_pte(___mm, ___address, ___pte, true);   \
> >
> > And replace mmu_notifier_update_mapping(vma->vm_mm, address,
> pte_pfn(*ptep))
> > in the current patch with
> > mmu_notifier_change_pte(vma->vm_mm, address, ptep, false));
> 
> I wonder if we actually need the flag? IIUC it is already used for more
> than just KSM. For example it can be called as part of fault handling by
> set_pte_at_notify() in in wp_page_copy().
Yes, I noticed that but what I really meant is I'd put all these prior instances
of change_pte in one category using the flag. Without the flag, KVM, the only
user that currently has a callback for change_pte would get notified which
may not be appropriate. Note that the change_pte callback for KVM was
added (based on Git log) for KSM updates and it is not clear to me if that
is still the case.

> 
> > Would that work for your HMM use-case -- assuming we call change_pte
> after
> > taking PTL?
> 
> I suspect being called under the PTL could be an issue. For HMM we use a
> combination of sequence numbers and a mutex to synchronise PTEs. To
> avoid calling the notifier while holding PTL we might be able to record
> the sequence number (subscriptions->invalidate_seq) while 

RE: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)

2023-07-24 Thread Kasireddy, Vivek
Hi Jason,

> 
> On Mon, Jul 24, 2023 at 07:54:38AM +, Kasireddy, Vivek wrote:
> 
> > > I'm not at all familiar with the udmabuf use case but that sounds
> > > brittle and effectively makes this notifier udmabuf specific right?
> > Oh, Qemu uses the udmabuf driver to provide Host Graphics components
> > (such as Spice, Gstreamer, UI, etc) zero-copy access to Guest created
> > buffers. In other words, from a core mm standpoint, udmabuf just
> > collects a bunch of pages (associated with buffers) scattered inside
> > the memfd (Guest ram backed by shmem or hugetlbfs) and wraps
> > them in a dmabuf fd. And, since we provide zero-copy access, we
> > use DMA fences to ensure that the components on the Host and
> > Guest do not access the buffer simultaneously.
> 
> So why do you need to track updates proactively like this?
As David noted in the earlier series, if Qemu punches a hole in its memfd
that goes through pages that are registered against a udmabuf fd, then
udmabuf needs to update its list with new pages when the hole gets
filled after (guest) writes. Otherwise, we'd run into the coherency 
problem (between udmabuf and memfd) as demonstrated in the selftest
(patch #3 in this series).

> 
> Trigger a move when the backing memory changes and re-acquire it with
AFAICS, without this patch or adding new change_pte calls, there is no way to
get notified when a new page is mapped into the backing memory of a memfd
(backed by shmem or hugetlbfs) which happens after a hole punch followed
by writes. We can definitely get notified when a hole is punched via the
invalidate notifiers though, but as I described earlier this is not very helpful
for the udmabuf use-case.

> hmm_range_fault like everything else does.
> 
> > And replace mmu_notifier_update_mapping(vma->vm_mm, address,
> pte_pfn(*ptep))
> > in the current patch with
> > mmu_notifier_change_pte(vma->vm_mm, address, ptep, false));
> 
> It isn't very useful because nothing can do anything meaningful under
> the PTLs. Can't allocate memory for instance. Which makes me wonder
> what it is udmabuf plans to actually do here.
It is useful for udmabuf because it helps ensure coherency with the memfd.
If you look at patch #2 in this series particularly the notifier callback 
(update_udmabuf),
it just updates its list of pages and does not allocate any memory or do 
anything
that would cause it to go to sleep other than taking a mutex which I plan to 
drop
in v2 as it is not really needed. With that removed, I think it seems ok to 
call the
notifier callback under the PTL.

Thanks,
Vivek

> 
> JAson


RE: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)

2023-07-24 Thread Kasireddy, Vivek
Hi Alistair,

> 
> 
> "Kasireddy, Vivek"  writes:
> 
> > Hi Alistair,
> >
> >>
> >> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >> > index 64a3239b6407..1f2f0209101a 100644
> >> > --- a/mm/hugetlb.c
> >> > +++ b/mm/hugetlb.c
> >> > @@ -6096,8 +6096,12 @@ vm_fault_t hugetlb_fault(struct mm_struct
> >> *mm, struct vm_area_struct *vma,
> >> >   * hugetlb_no_page will drop vma lock and hugetlb fault
> >> >   * mutex internally, which make us return immediately.
> >> >   */
> >> > -return hugetlb_no_page(mm, vma, mapping, idx, address,
> >> ptep,
> >> > +ret = hugetlb_no_page(mm, vma, mapping, idx, address,
> >> ptep,
> >> >entry, flags);
> >> > +if (!ret)
> >> > +mmu_notifier_update_mapping(vma->vm_mm,
> >> address,
> >> > +pte_pfn(*ptep));
> >>
> >> The next patch ends up calling pfn_to_page() on the result of
> >> pte_pfn(*ptep). I don't think that's safe because couldn't the PTE have
> >> already changed and/or the new page have been freed?
> > Yeah, that might be possible; I believe the right thing to do would be:
> > -   return hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
> > +   ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
> >   entry, flags);
> > +   if (!ret) {
> > +   ptl = huge_pte_lock(h, mm, ptep);
> > +   mmu_notifier_update_mapping(vma->vm_mm, address,
> > +pte_pfn(*ptep));
> > +   spin_unlock(ptl);
> > +   }
> 
> Yes, although obviously as I think you point out below you wouldn't be
> able to take any sleeping locks in mmu_notifier_update_mapping().
Yes, I understand that, but I am not sure how we can prevent any potential
notifier callback from taking sleeping locks other than adding clear comments.

> 
> > In which case I'd need to make a similar change in the shmem path as well.
> > And, also redo (or eliminate) the locking in udmabuf (patch) which seems a
> > bit excessive on a second look given our use-case (where reads and writes
> do
> > not happen simultaneously due to fence synchronization in the guest
> driver).
> 
> I'm not at all familiar with the udmabuf use case but that sounds
> brittle and effectively makes this notifier udmabuf specific right?
Oh, Qemu uses the udmabuf driver to provide Host Graphics components
(such as Spice, Gstreamer, UI, etc) zero-copy access to Guest created
buffers. In other words, from a core mm standpoint, udmabuf just
collects a bunch of pages (associated with buffers) scattered inside
the memfd (Guest ram backed by shmem or hugetlbfs) and wraps
them in a dmabuf fd. And, since we provide zero-copy access, we
use DMA fences to ensure that the components on the Host and
Guest do not access the buffer simultaneously.

> 
> The name gives the impression it is more general though. I have
I'd like to make it suitable for general usage.

> contemplated adding a notifier for PTE updates for drivers using
> hmm_range_fault() as it would save some expensive device faults and it
> this could be useful for that.
> 
> So if we're adding a notifier for PTE updates I think it would be good
> if it covered all cases and was robust enough to allow mirroring of the
> correct PTE value (ie. by being called under PTL or via some other
> synchronisation like hmm_range_fault()).
Ok; in order to make it clear that the notifier is associated with PTE updates,
I think it needs to have a more suitable name such as mmu_notifier_update_pte()
or mmu_notifier_new_pte(). But we already have mmu_notifier_change_pte,
which IIUC is used mainly for PTE updates triggered by KSM. So, I am inclining
towards dropping this new notifier and instead adding a new flag to change_pte
to distinguish between KSM triggered notifications and others. Something along
the lines of:
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 218ddc3b4bc7..6afce2287143 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -129,7 +129,8 @@ struct mmu_notifier_ops {
void (*change_pte)(struct mmu_notifier *subscription,
   struct mm_struct *mm,
   unsigned long address,
-  pte_t pte);
+  pte_t pte,
+   

RE: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)

2023-07-20 Thread Kasireddy, Vivek
Hi Alistair,

> 
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 64a3239b6407..1f2f0209101a 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -6096,8 +6096,12 @@ vm_fault_t hugetlb_fault(struct mm_struct
> *mm, struct vm_area_struct *vma,
> >  * hugetlb_no_page will drop vma lock and hugetlb fault
> >  * mutex internally, which make us return immediately.
> >  */
> > -   return hugetlb_no_page(mm, vma, mapping, idx, address,
> ptep,
> > +   ret = hugetlb_no_page(mm, vma, mapping, idx, address,
> ptep,
> >   entry, flags);
> > +   if (!ret)
> > +   mmu_notifier_update_mapping(vma->vm_mm,
> address,
> > +   pte_pfn(*ptep));
> 
> The next patch ends up calling pfn_to_page() on the result of
> pte_pfn(*ptep). I don't think that's safe because couldn't the PTE have
> already changed and/or the new page have been freed?
Yeah, that might be possible; I believe the right thing to do would be:
-   return hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
+   ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
  entry, flags);
+   if (!ret) {
+   ptl = huge_pte_lock(h, mm, ptep);
+   mmu_notifier_update_mapping(vma->vm_mm, address,
+pte_pfn(*ptep));
+   spin_unlock(ptl);
+   }

In which case I'd need to make a similar change in the shmem path as well.
And, also redo (or eliminate) the locking in udmabuf (patch) which seems a
bit excessive on a second look given our use-case (where reads and writes do
not happen simultaneously due to fence synchronization in the guest driver). 

Thanks,
Vivek

> 
> > +   return ret;
> >
> > ret = 0;
> >
> > @@ -6223,6 +6227,9 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm,
> struct vm_area_struct *vma,
> >  */
> > if (need_wait_lock)
> > folio_wait_locked(folio);
> > +   if (!ret)
> > +   mmu_notifier_update_mapping(vma->vm_mm, address,
> > +   pte_pfn(*ptep));
> > return ret;
> >  }
> >
> > diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> > index 50c0dde1354f..6421405334b9 100644
> > --- a/mm/mmu_notifier.c
> > +++ b/mm/mmu_notifier.c
> > @@ -441,6 +441,23 @@ void __mmu_notifier_change_pte(struct
> mm_struct *mm, unsigned long address,
> > srcu_read_unlock(, id);
> >  }
> >
> > +void __mmu_notifier_update_mapping(struct mm_struct *mm, unsigned
> long address,
> > +  unsigned long pfn)
> > +{
> > +   struct mmu_notifier *subscription;
> > +   int id;
> > +
> > +   id = srcu_read_lock();
> > +   hlist_for_each_entry_rcu(subscription,
> > +>notifier_subscriptions->list, hlist,
> > +srcu_read_lock_held()) {
> > +   if (subscription->ops->update_mapping)
> > +   subscription->ops->update_mapping(subscription,
> mm,
> > + address, pfn);
> > +   }
> > +   srcu_read_unlock(, id);
> > +}
> > +
> >  static int mn_itree_invalidate(struct mmu_notifier_subscriptions
> *subscriptions,
> >const struct mmu_notifier_range *range)
> >  {
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 2f2e0e618072..e59eb5fafadb 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -77,6 +77,7 @@ static struct vfsmount *shm_mnt;
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >
> >  #include 
> > @@ -2164,8 +2165,12 @@ static vm_fault_t shmem_fault(struct vm_fault
> *vmf)
> >   gfp, vma, vmf, );
> > if (err)
> > return vmf_error(err);
> > -   if (folio)
> > +   if (folio) {
> > vmf->page = folio_file_page(folio, vmf->pgoff);
> > +   if (ret == VM_FAULT_LOCKED)
> > +   mmu_notifier_update_mapping(vma->vm_mm, vmf-
> >address,
> > +   page_to_pfn(vmf->page));
> > +   }
> > return ret;
> >  }



RE: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)

2023-07-19 Thread Kasireddy, Vivek
Hi Jason,

> 
> On Wed, Jul 19, 2023 at 12:05:29AM +, Kasireddy, Vivek wrote:
> 
> > > If there is no change to the PTEs then it is hard to see why this
> > > would be part of a mmu_notifier.
> > IIUC, the PTEs do get changed but only when a new page is faulted in.
> > For shmem, it looks like the PTEs are updated in handle_pte_fault()
> > after shmem_fault() gets called and for hugetlbfs, this seems to
> > happen in hugetlb_fault().
> 
> That sounds about right
> 
> > Instead of introducing a new notifier, I did think about reusing
> > (or overloading) .change_pte() but I did not fully understand the impact
> > it would have on KVM, the only user of .change_pte().
> 
> Yes, change_pte will be called, I think, but under various locks. 
AFAICT, change_pte does not seem to get called in my use-case either
during invalidate or when a new page is faulted in.

>Why would you need to change it?
What I meant to say is instead of adding a new notifier for mapping updates,
I initially considered just calling change_pte() when a new page is faulted in
but I was concerned that doing so might adversely impact existing users (of
change_pte) such as KVM.

> 
> What you are doing here doesn't make any sense within the design of
> mmu_notifiers, eg:
> 
> > @ -2164,8 +2165,12 @@ static vm_fault_t shmem_fault(struct vm_fault
> *vmf)
> >   gfp, vma, vmf, );
> > if (err)
> > return vmf_error(err);
> > -   if (folio)
> > +   if (folio) {
> > vmf->page = folio_file_page(folio, vmf->pgoff);
> > +   if (ret == VM_FAULT_LOCKED)
> > +   mmu_notifier_update_mapping(vma->vm_mm, vmf-
> >address,
> > +   page_to_pfn(vmf->page));
> > +   }
> > return ret;
> 
> Hasn't even updated the PTEs yet, but it is invoking a callback??
I was counting on the fragile assumption that once we have a valid page,
the PTE would be eventually updated after shmem_fault(), which doesn't
make sense in retrospect. Instead, would something like below be OK?
@@ -5234,6 +5237,14 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, 
unsigned long address,

lru_gen_exit_fault();

+   if (vma_is_shmem(vma) || is_vm_hugetlb_page(vma)) {
+   if (!follow_pte(vma->vm_mm, address, , )) {
+   pfn = pte_pfn(ptep_get(ptep));
+   pte_unmap_unlock(ptep, ptl);
+   mmu_notifier_update_mapping(vma->vm_mm, address, pfn);
+   }
+   }


Thanks,
Vivek

> 
> Jason


RE: [PATCH v2 2/2] udmabuf: Add back support for mapping hugetlb pages (v2)

2023-07-18 Thread Kasireddy, Vivek
Hi Mike,

> 
> On 07/18/23 01:26, Vivek Kasireddy wrote:
> > A user or admin can configure a VMM (Qemu) Guest's memory to be
> > backed by hugetlb pages for various reasons. However, a Guest OS
> > would still allocate (and pin) buffers that are backed by regular
> > 4k sized pages. In order to map these buffers and create dma-bufs
> > for them on the Host, we first need to find the hugetlb pages where
> > the buffer allocations are located and then determine the offsets
> > of individual chunks (within those pages) and use this information
> > to eventually populate a scatterlist.
> >
> > Testcase: default_hugepagesz=2M hugepagesz=2M hugepages=2500
> options
> > were passed to the Host kernel and Qemu was launched with these
> > relevant options: qemu-system-x86_64 -m 4096m
> > -device virtio-gpu-pci,max_outputs=1,blob=true,xres=1920,yres=1080
> > -display gtk,gl=on
> > -object memory-backend-memfd,hugetlb=on,id=mem1,size=4096M
> > -machine memory-backend=mem1
> >
> > Replacing -display gtk,gl=on with -display gtk,gl=off above would
> > exercise the mmap handler.
> >
> > v2: Updated get_sg_table() to manually populate the scatterlist for
> > both huge page and non-huge-page cases.
> >
> > Cc: David Hildenbrand 
> > Cc: Mike Kravetz 
> > Cc: Hugh Dickins 
> > Cc: Peter Xu 
> > Cc: Jason Gunthorpe 
> > Cc: Gerd Hoffmann 
> > Cc: Dongwon Kim 
> > Cc: Junxiao Chang 
> > Signed-off-by: Vivek Kasireddy 
> > ---
> >  drivers/dma-buf/udmabuf.c | 84 +--
> 
> >  1 file changed, 71 insertions(+), 13 deletions(-)
> >
> > diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
> > index 820c993c8659..10c47bf77fb5 100644
> > --- a/drivers/dma-buf/udmabuf.c
> > +++ b/drivers/dma-buf/udmabuf.c
> > @@ -10,6 +10,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > @@ -28,6 +29,7 @@ struct udmabuf {
> > struct page **pages;
> > struct sg_table *sg;
> > struct miscdevice *device;
> > +   pgoff_t *offsets;
> >  };
> >
> >  static vm_fault_t udmabuf_vm_fault(struct vm_fault *vmf)
> > @@ -41,6 +43,10 @@ static vm_fault_t udmabuf_vm_fault(struct vm_fault
> *vmf)
> > return VM_FAULT_SIGBUS;
> >
> > pfn = page_to_pfn(ubuf->pages[pgoff]);
> > +   if (ubuf->offsets) {
> > +   pfn += ubuf->offsets[pgoff] >> PAGE_SHIFT;
> > +   }
> > +
> > return vmf_insert_pfn(vma, vmf->address, pfn);
> >  }
> >
> > @@ -90,23 +96,31 @@ static struct sg_table *get_sg_table(struct device
> *dev, struct dma_buf *buf,
> >  {
> > struct udmabuf *ubuf = buf->priv;
> > struct sg_table *sg;
> > +   struct scatterlist *sgl;
> > +   pgoff_t offset;
> > +   unsigned long i = 0;
> > int ret;
> >
> > sg = kzalloc(sizeof(*sg), GFP_KERNEL);
> > if (!sg)
> > return ERR_PTR(-ENOMEM);
> > -   ret = sg_alloc_table_from_pages(sg, ubuf->pages, ubuf->pagecount,
> > -   0, ubuf->pagecount << PAGE_SHIFT,
> > -   GFP_KERNEL);
> > +
> > +   ret = sg_alloc_table(sg, ubuf->pagecount, GFP_KERNEL);
> > if (ret < 0)
> > -   goto err;
> > +   goto err_alloc;
> > +
> > +   for_each_sg(sg->sgl, sgl, ubuf->pagecount, i) {
> > +   offset = ubuf->offsets ? ubuf->offsets[i] : 0;
> > +   sg_set_page(sgl, ubuf->pages[i], PAGE_SIZE, offset);
> > +   }
> > ret = dma_map_sgtable(dev, sg, direction, 0);
> > if (ret < 0)
> > -   goto err;
> > +   goto err_map;
> > return sg;
> >
> > -err:
> > +err_map:
> > sg_free_table(sg);
> > +err_alloc:
> > kfree(sg);
> > return ERR_PTR(ret);
> >  }
> > @@ -143,6 +157,7 @@ static void release_udmabuf(struct dma_buf *buf)
> >
> > for (pg = 0; pg < ubuf->pagecount; pg++)
> > put_page(ubuf->pages[pg]);
> > +   kfree(ubuf->offsets);
> > kfree(ubuf->pages);
> > kfree(ubuf);
> >  }
> > @@ -206,7 +221,9 @@ static long udmabuf_create(struct miscdevice
> *device,
> > struct udmabuf *ubuf;
> > struct dma_buf *buf;
> > pgoff_t pgoff, pgcnt, pgidx, pgbuf = 0, pglimit;
> > -   struct page *page;
> > +   struct page *page, *hpage = NULL;
> > +   pgoff_t hpoff, chunkoff, maxchunks;
> > +   struct hstate *hpstate;
> > int seals, ret = -EINVAL;
> > u32 i, flags;
> >
> > @@ -242,7 +259,7 @@ static long udmabuf_create(struct miscdevice
> *device,
> > if (!memfd)
> > goto err;
> > mapping = memfd->f_mapping;
> > -   if (!shmem_mapping(mapping))
> > +   if (!shmem_mapping(mapping) &&
> !is_file_hugepages(memfd))
> > goto err;
> > seals = memfd_fcntl(memfd, F_GET_SEALS, 0);
> > if (seals == -EINVAL)
> > @@ -253,16 +270,56 @@ static long udmabuf_create(struct miscdevice
> *device,
> > goto err;
> > pgoff = list[i].offset >> PAGE_SHIFT;
> > pgcnt = list[i].size   >> 

RE: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)

2023-07-18 Thread Kasireddy, Vivek
Hi Jason,

> 
> On Tue, Jul 18, 2023 at 01:28:56AM -0700, Vivek Kasireddy wrote:
> > Currently, there does not appear to be any mechanism for letting
> > drivers or other kernel entities know about updates made in a
> > mapping particularly when a new page is faulted in. Providing
> > notifications for such situations is really useful when using
> > memfds backed by ram-based filesystems such as shmem or hugetlbfs
> > that also allow FALLOC_FL_PUNCH_HOLE.
> 
> Huh? You get an invalidate when this happens and the address becomes
> non-present.
Yes, we do get an invalidate (range) but it is not going to help given my
use-case. This is because the invalidate only indicates that the old pages
are gone (and not about the availability of new pages). IIUC, after a hole
gets punched, it appears the new pages are faulted in only when there
are writes made to that region where the hole was punched. So, I think
what would really help is to get notified when a new page becomes part
of the mapping at a given offset. 

> 
> > More specifically, when a hole is punched in a memfd (that is
> > backed by shmem or hugetlbfs), a driver can register for
> > notifications associated with range invalidations. However, it
> > would also be useful to have notifications when new pages are
> > faulted in as a result of writes made to the mapping region that
> > overlaps with a previously punched hole.
> 
> If there is no change to the PTEs then it is hard to see why this
> would be part of a mmu_notifier.
IIUC, the PTEs do get changed but only when a new page is faulted in.
For shmem, it looks like the PTEs are updated in handle_pte_fault()
after shmem_fault() gets called and for hugetlbfs, this seems to
happen in hugetlb_fault().

Instead of introducing a new notifier, I did think about reusing
(or overloading) .change_pte() but I did not fully understand the impact
it would have on KVM, the only user of .change_pte(). 

Thanks,
Vivek

> 
> Jason


RE: [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages

2023-06-28 Thread Kasireddy, Vivek
Hi David,

> 
> On 27.06.23 08:37, Kasireddy, Vivek wrote:
> > Hi David,
> >
> 
> Hi!
> 
> sorry for taking a bit longer to reply lately.
No problem.

> 
> [...]
> 
> >>> Sounds right, maybe it needs to go back to the old GUP solution, though,
> as
> >>> mmu notifiers are also mm-based not fd-based. Or to be explicit, I think
> >>> it'll be pin_user_pages(FOLL_LONGTERM) with the new API.  It'll also
> solve
> >>> the movable pages issue on pinning.
> >>
> >> It better should be pin_user_pages(FOLL_LONGTERM). But I'm afraid we
> >> cannot achieve that without breaking the existing kernel interface ...
> > Yeah, as you suggest, we unfortunately cannot go back to using GUP
> > without breaking udmabuf_create UAPI that expects memfds and file
> > offsets.
> >
> >>
> >> So we might have to implement the same page migration as gup does on
> >> FOLL_LONGTERM here ... maybe there are more such cases/drivers that
> >> actually require that handling when simply taking pages out of the
> >> memfd, believing they can hold on to them forever.
> > IIUC, I don't think just handling the page migration in udmabuf is going to
> > cut it. It might require active cooperation of the Guest GPU driver as well
> > if this is even feasible.
> 
> The idea is, that once you extract the page from the memfd and it
> resides somewhere bad (MIGRATE_CMA, ZONE_MOVABLE), you trigger page
> migration. Essentially what migrate_longterm_unpinnable_pages() does:
So, IIUC, it looks like calling check_and_migrate_movable_pages() at the time
of creation (udmabuf_create) and when we get notified about something like
FALLOC_FL_PUNCH_HOLE will be all that needs to be done in udmabuf?

> 
> Why would the guest driver have to be involved? It shouldn't care about
> page migration in the hypervisor.
Yeah, it appears that the page migration would be transparent to the Guest
driver.

> 
> [...]
> 
> >> balloon, and then using that memory for communicating with the device]
> >>
> >> Maybe it's all fine with udmabuf because of the way it is setup/torn
> >> down by the guest driver. Unfortunately I can't tell.
> > Here are the functions used by virtio-gpu (Guest GPU driver) to allocate
> > pages for its resources:
> > __drm_gem_shmem_create:
> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/drm_gem_sh
> mem_helper.c#L97
> > Interestingly, the comment in the above function says that the pages
> > should not be allocated from the MOVABLE zone.
> 
> It doesn't add GFP_MOVABLE, so pages don't end up in
> ZONE_MOVABLE/MIGRATE_CMA *in the guest*. But we care about the
> ZONE_MOVABLE /MIGRATE_CMA *in the host*. (what the guest does is
> right,
> though)
> 
> IOW, what udmabuf does with guest memory on the hypervisor side, not the
> guest driver on the guest side.
Ok, got it.

> 
> > The pages along with their dma addresses are then extracted and shared
> > with Qemu using these two functions:
> > drm_gem_get_pages:
> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/drm_gem.c#
> L534
> > virtio_gpu_object_shmem_init:
> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/virtio/virtgpu
> _object.c#L135
> 
> ^ so these two target the guest driver as well, right? IOW, there is a
> memfd (shmem) in the guest that the guest driver uses to allocate pages
> from and there is the memfd in the hypervisor to back guest RAM.
> 
> The latter gets registered with udmabuf.
Yes, that's exactly what happens.

> 
> > Qemu then translates the dma addresses into file offsets and creates
> > udmabufs -- as an optimization to avoid data copies only if blob is set
> > to true.
> 
> If the guest OS doesn't end up freeing/reallocating that memory while
> it's registered with udmabuf in the hypervisor, then we should be fine.
IIUC, udmabuf does get notified when something like that happens.

Thanks,
Vivek

> 
> Because that way, the guest won't end up trigger MADV_DONTNEED by
> "accident".
> 
> --
> Cheers,
> 
> David / dhildenb



RE: [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages

2023-06-27 Thread Kasireddy, Vivek
Hi David,

> On 26.06.23 19:52, Peter Xu wrote:
> > On Mon, Jun 26, 2023 at 07:45:37AM +, Kasireddy, Vivek wrote:
> >> Hi Peter,
> >>
> >>>
> >>> On Fri, Jun 23, 2023 at 06:13:02AM +, Kasireddy, Vivek wrote:
> >>>> Hi David,
> >>>>
> >>>>>> The first patch ensures that the mappings needed for handling mmap
> >>>>>> operation would be managed by using the pfn instead of struct page.
> >>>>>> The second patch restores support for mapping hugetlb pages where
> >>>>>> subpages of a hugepage are not directly used anymore (main reason
> >>>>>> for revert) and instead the hugetlb pages and the relevant offsets
> >>>>>> are used to populate the scatterlist for dma-buf export and for
> >>>>>> mmap operation.
> >>>>>>
> >>>>>> Testcase: default_hugepagesz=2M hugepagesz=2M hugepages=2500
> >>>>> options
> >>>>>> were passed to the Host kernel and Qemu was launched with these
> >>>>>> relevant options: qemu-system-x86_64 -m 4096m
> >>>>>> -device virtio-gpu-pci,max_outputs=1,blob=true,xres=1920,yres=1080
> >>>>>> -display gtk,gl=on
> >>>>>> -object memory-backend-memfd,hugetlb=on,id=mem1,size=4096M
> >>>>>> -machine memory-backend=mem1
> >>>>>>
> >>>>>> Replacing -display gtk,gl=on with -display gtk,gl=off above would
> >>>>>> exercise the mmap handler.
> >>>>>>
> >>>>>
> >>>>> While I think the VM_PFNMAP approach is much better and should fix
> >>> that
> >>>>> issue at hand, I thought more about missing memlock support and
> >>> realized
> >>>>> that we might have to fix something else. SO I'm going to raise the
> >>>>> issue here.
> >>>>>
> >>>>> I think udmabuf chose the wrong interface to do what it's doing, that
> >>>>> makes it harder to fix it eventually.
> >>>>>
> >>>>> Instead of accepting a range in a memfd, it should just have accepted a
> >>>>> user space address range and then used
> >>>>> pin_user_pages(FOLL_WRITE|FOLL_LONGTERM) to longterm-pin the
> >>> pages
> >>>>> "officially".
> >>>> Udmabuf indeed started off by using user space address range and GUP
> >>> but
> >>>> the dma-buf subsystem maintainer had concerns with that approach in
> v2.
> >>>> It also had support for mlock in that version. Here is v2 and the 
> >>>> relevant
> >>>> conversation:
> >>>>
> https://patchwork.freedesktop.org/patch/210992/?series=39879=2
> >>>>
> >>>>>
> >>>>> So what's the issue? Udma effectively pins pages longterm ("possibly
> >>>>> forever") simply by grabbing a reference on them. These pages might
> >>>>> easily reside in ZONE_MOVABLE or in MIGRATE_CMA pageblocks.
> >>>>>
> >>>>> So what udmabuf does is break memory hotunplug and CMA, because
> it
> >>>>> turns
> >>>>> pages that have to remain movable unmovable.
> >>>>>
> >>>>> In the pin_user_pages(FOLL_LONGTERM) case we make sure to
> migrate
> >>>>> these
> >>>>> pages. See mm/gup.c:check_and_migrate_movable_pages() and
> >>> especially
> >>>>> folio_is_longterm_pinnable(). We'd probably have to implement
> >>> something
> >>>>> similar for udmabuf, where we detect such unpinnable pages and
> >>> migrate
> >>>>> them.
> >>>> The pages udmabuf pins are only those associated with Guest (GPU
> >>> driver/virtio-gpu)
> >>>> resources (or buffers allocated and pinned from shmem via drm GEM).
> >>> Some
> >>>> resources are short-lived, and some are long-lived and whenever a
> >>> resource
> >>>> gets destroyed, the pages are unpinned. And, not all resources have
> their
> >>> pages
> >>>> pinned. The resource that is pinned for the longest duration is the FB
> and
> >>> that's
> >>>> because it is updated every ~16ms (assuming 1920x1080@60) by the
> Guest
> >>>> GPU driver. We c

RE: [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages

2023-06-26 Thread Kasireddy, Vivek
Hi Peter,

> 
> On Fri, Jun 23, 2023 at 06:13:02AM +, Kasireddy, Vivek wrote:
> > Hi David,
> >
> > > > The first patch ensures that the mappings needed for handling mmap
> > > > operation would be managed by using the pfn instead of struct page.
> > > > The second patch restores support for mapping hugetlb pages where
> > > > subpages of a hugepage are not directly used anymore (main reason
> > > > for revert) and instead the hugetlb pages and the relevant offsets
> > > > are used to populate the scatterlist for dma-buf export and for
> > > > mmap operation.
> > > >
> > > > Testcase: default_hugepagesz=2M hugepagesz=2M hugepages=2500
> > > options
> > > > were passed to the Host kernel and Qemu was launched with these
> > > > relevant options: qemu-system-x86_64 -m 4096m
> > > > -device virtio-gpu-pci,max_outputs=1,blob=true,xres=1920,yres=1080
> > > > -display gtk,gl=on
> > > > -object memory-backend-memfd,hugetlb=on,id=mem1,size=4096M
> > > > -machine memory-backend=mem1
> > > >
> > > > Replacing -display gtk,gl=on with -display gtk,gl=off above would
> > > > exercise the mmap handler.
> > > >
> > >
> > > While I think the VM_PFNMAP approach is much better and should fix
> that
> > > issue at hand, I thought more about missing memlock support and
> realized
> > > that we might have to fix something else. SO I'm going to raise the
> > > issue here.
> > >
> > > I think udmabuf chose the wrong interface to do what it's doing, that
> > > makes it harder to fix it eventually.
> > >
> > > Instead of accepting a range in a memfd, it should just have accepted a
> > > user space address range and then used
> > > pin_user_pages(FOLL_WRITE|FOLL_LONGTERM) to longterm-pin the
> pages
> > > "officially".
> > Udmabuf indeed started off by using user space address range and GUP
> but
> > the dma-buf subsystem maintainer had concerns with that approach in v2.
> > It also had support for mlock in that version. Here is v2 and the relevant
> > conversation:
> > https://patchwork.freedesktop.org/patch/210992/?series=39879=2
> >
> > >
> > > So what's the issue? Udma effectively pins pages longterm ("possibly
> > > forever") simply by grabbing a reference on them. These pages might
> > > easily reside in ZONE_MOVABLE or in MIGRATE_CMA pageblocks.
> > >
> > > So what udmabuf does is break memory hotunplug and CMA, because it
> > > turns
> > > pages that have to remain movable unmovable.
> > >
> > > In the pin_user_pages(FOLL_LONGTERM) case we make sure to migrate
> > > these
> > > pages. See mm/gup.c:check_and_migrate_movable_pages() and
> especially
> > > folio_is_longterm_pinnable(). We'd probably have to implement
> something
> > > similar for udmabuf, where we detect such unpinnable pages and
> migrate
> > > them.
> > The pages udmabuf pins are only those associated with Guest (GPU
> driver/virtio-gpu)
> > resources (or buffers allocated and pinned from shmem via drm GEM).
> Some
> > resources are short-lived, and some are long-lived and whenever a
> resource
> > gets destroyed, the pages are unpinned. And, not all resources have their
> pages
> > pinned. The resource that is pinned for the longest duration is the FB and
> that's
> > because it is updated every ~16ms (assuming 1920x1080@60) by the Guest
> > GPU driver. We can certainly pin/unpin the FB after it is accessed on the
> Host
> > as a workaround, but I guess that may not be very efficient given the
> amount
> > of churn it would create.
> >
> > Also, as far as migration or S3/S4 is concerned, my understanding is that 
> > all
> > the Guest resources are destroyed and recreated again. So, wouldn't
> something
> > similar happen during memory hotunplug?
> >
> > >
> > >
> > > For example, pairing udmabuf with vfio (which pins pages using
> > > pin_user_pages(FOLL_LONGTERM)) in QEMU will most probably not work
> in
> > > all cases: if udmabuf longterm pinned the pages "the wrong way", vfio
> > > will fail to migrate them during FOLL_LONGTERM and consequently fail
> > > pin_user_pages(). As long as udmabuf holds a reference on these pages,
> > > that will never succeed.
> > Dma-buf rules (for exporters) indicate that the pages only need to be
> pinned
> > during the map_att

RE: [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages

2023-06-23 Thread Kasireddy, Vivek
Hi David,

> > The first patch ensures that the mappings needed for handling mmap
> > operation would be managed by using the pfn instead of struct page.
> > The second patch restores support for mapping hugetlb pages where
> > subpages of a hugepage are not directly used anymore (main reason
> > for revert) and instead the hugetlb pages and the relevant offsets
> > are used to populate the scatterlist for dma-buf export and for
> > mmap operation.
> >
> > Testcase: default_hugepagesz=2M hugepagesz=2M hugepages=2500
> options
> > were passed to the Host kernel and Qemu was launched with these
> > relevant options: qemu-system-x86_64 -m 4096m
> > -device virtio-gpu-pci,max_outputs=1,blob=true,xres=1920,yres=1080
> > -display gtk,gl=on
> > -object memory-backend-memfd,hugetlb=on,id=mem1,size=4096M
> > -machine memory-backend=mem1
> >
> > Replacing -display gtk,gl=on with -display gtk,gl=off above would
> > exercise the mmap handler.
> >
> 
> While I think the VM_PFNMAP approach is much better and should fix that
> issue at hand, I thought more about missing memlock support and realized
> that we might have to fix something else. SO I'm going to raise the
> issue here.
> 
> I think udmabuf chose the wrong interface to do what it's doing, that
> makes it harder to fix it eventually.
> 
> Instead of accepting a range in a memfd, it should just have accepted a
> user space address range and then used
> pin_user_pages(FOLL_WRITE|FOLL_LONGTERM) to longterm-pin the pages
> "officially".
Udmabuf indeed started off by using user space address range and GUP but
the dma-buf subsystem maintainer had concerns with that approach in v2.
It also had support for mlock in that version. Here is v2 and the relevant
conversation:
https://patchwork.freedesktop.org/patch/210992/?series=39879=2

> 
> So what's the issue? Udma effectively pins pages longterm ("possibly
> forever") simply by grabbing a reference on them. These pages might
> easily reside in ZONE_MOVABLE or in MIGRATE_CMA pageblocks.
> 
> So what udmabuf does is break memory hotunplug and CMA, because it
> turns
> pages that have to remain movable unmovable.
> 
> In the pin_user_pages(FOLL_LONGTERM) case we make sure to migrate
> these
> pages. See mm/gup.c:check_and_migrate_movable_pages() and especially
> folio_is_longterm_pinnable(). We'd probably have to implement something
> similar for udmabuf, where we detect such unpinnable pages and migrate
> them.
The pages udmabuf pins are only those associated with Guest (GPU 
driver/virtio-gpu)
resources (or buffers allocated and pinned from shmem via drm GEM). Some
resources are short-lived, and some are long-lived and whenever a resource
gets destroyed, the pages are unpinned. And, not all resources have their pages
pinned. The resource that is pinned for the longest duration is the FB and 
that's
because it is updated every ~16ms (assuming 1920x1080@60) by the Guest
GPU driver. We can certainly pin/unpin the FB after it is accessed on the Host
as a workaround, but I guess that may not be very efficient given the amount
of churn it would create.

Also, as far as migration or S3/S4 is concerned, my understanding is that all
the Guest resources are destroyed and recreated again. So, wouldn't something
similar happen during memory hotunplug?

> 
> 
> For example, pairing udmabuf with vfio (which pins pages using
> pin_user_pages(FOLL_LONGTERM)) in QEMU will most probably not work in
> all cases: if udmabuf longterm pinned the pages "the wrong way", vfio
> will fail to migrate them during FOLL_LONGTERM and consequently fail
> pin_user_pages(). As long as udmabuf holds a reference on these pages,
> that will never succeed.
Dma-buf rules (for exporters) indicate that the pages only need to be pinned
during the map_attachment phase (and until unmap attachment happens).
In other words, only when the sg_table is created by udmabuf. I guess one
option would be to not hold any references during UDMABUF_CREATE and
only grab references to the pages (as and when it gets used) during this step.
Would this help?

> 
> 
> There are *probably* more issues on the QEMU side when udmabuf is
> paired
> with things like MADV_DONTNEED/FALLOC_FL_PUNCH_HOLE used for
> virtio-balloon, virtio-mem, postcopy live migration, ... for example, in
> the vfio/vdpa case we make sure that we disallow most of these, because
> otherwise there can be an accidental "disconnect" between the pages
> mapped into the VM (guest view) and the pages mapped into the IOMMU
> (device view), for example, after a reboot.
Ok; I am not sure if I can figure out if there is any acceptable way to address
these issues but given the current constraints associated with udmabuf, what
do you suggest is the most reasonable way to deal with these problems you
have identified?

Thanks,
Vivek

> 
> --
> Cheers,
> 
> David / dhildenb



RE: [PATCH] mm: fix hugetlb page unmap count balance issue

2023-06-20 Thread Kasireddy, Vivek
Hi Gerd,

> 
> On Mon, May 15, 2023 at 10:04:42AM -0700, Mike Kravetz wrote:
> > On 05/12/23 16:29, Mike Kravetz wrote:
> > > On 05/12/23 14:26, James Houghton wrote:
> > > > On Fri, May 12, 2023 at 12:20 AM Junxiao Chang
>  wrote:
> > > >
> > > > This alone doesn't fix mapcounting for PTE-mapped HugeTLB pages.
> You
> > > > need something like [1]. I can resend it if that's what we should be
> > > > doing, but this mapcounting scheme doesn't work when the page
> structs
> > > > have been freed.
> > > >
> > > > It seems like it was a mistake to include support for hugetlb memfds in
> udmabuf.
> > >
> > > IIUC, it was added with commit 16c243e99d33 udmabuf: Add support for
> mapping
> > > hugepages (v4).  Looks like it was never sent to linux-mm?  That is
> unfortunate
> > > as hugetlb vmemmap freeing went in at about the same time.  And, as
> you have
> > > noted udmabuf will not work if hugetlb vmemmap freeing is enabled.
> > >
> > > Sigh!
> > >
> > > Trying to think of a way forward.
> > > --
> > > Mike Kravetz
> > >
> > > >
> > > > [1]: https://lore.kernel.org/linux-mm/20230306230004.1387007-2-
> jthough...@google.com/
> > > >
> > > > - James
> >
> > Adding people and list on Cc: involved with commit 16c243e99d33.
> >
> > There are several issues with trying to map tail pages of hugetllb pages
> > not taken into account with udmabuf.  James spent quite a bit of time
> trying
> > to understand and address all the issues with the HGM code.  While using
> > the scheme proposed by James, may be an approach to the mapcount
> issue there
> > are also other issues that need attention.  For example, I do not see how
> > the fault code checks the state of the hugetlb page (such as poison) as none
> > of that state is carried in tail pages.
> >
> > The more I think about it, the more I think udmabuf should treat hugetlb
> > pages as hugetlb pages.  They should be mapped at the appropriate level
> > in the page table.  Of course, this would impose new restrictions on the
> > API (mmap and ioctl) that may break existing users.  I have no idea how
> > extensively udmabuf is being used with hugetlb mappings.
> 
> User of this is qemu.  It can use the udmabuf driver to create host
> dma-bufs for guest resources (virtio-gpu buffers), to avoid copying
> data when showing the guest display in a host window.
> 
> hugetlb support is needed in case qemu guest memory is backed by
> hugetlbfs.  That does not imply the virtio-gpu buffers are hugepage
> aligned though, udmabuf would still need to operate on smaller chunks
> of memory.  So with additional restrictions this will not work any
> more for qemu.  I'd suggest to just revert hugetlb support instead
> and go back to the drawing board.
> 
> Also not sure why hugetlbfs is used for guest memory in the first place.
> It used to be a thing years ago, but with the arrival of transparent
> hugepages there is as far I know little reason to still use hugetlbfs.
The main reason why we are interested in using hugetlbfs for guest memory
is because we observed non-trivial performance improvement while running
certain 3D heavy workloads in the guest. And, we noticed this by only
switching the Guest memory backend to include hugepages (i.e, hugetlb=on)
and with no other changes.

To address the current situation, I am readying a patch for udmabuf driver that
would add back support for mapping hugepages but without making use of
the subpages directly.

Thanks,
Vivek

> 
> Vivek? Dongwon?
> 
> take care,
>   Gerd



RE: [PATCH] udmabuf: revert 'Add support for mapping hugepages (v4)'

2023-06-14 Thread Kasireddy, Vivek
Hi David,

> 
> On 13.06.23 10:26, Kasireddy, Vivek wrote:
> > Hi David,
> >
> >>
> >> On 12.06.23 09:10, Kasireddy, Vivek wrote:
> >>> Hi Mike,
> >>
> >> Hi Vivek,
> >>
> >>>
> >>> Sorry for the late reply; I just got back from vacation.
> >>> If it is unsafe to directly use the subpages of a hugetlb page, then
> reverting
> >>> this patch seems like the only option for addressing this issue
> immediately.
> >>> So, this patch is
> >>> Acked-by: Vivek Kasireddy 
> >>>
> >>> As far as the use-case is concerned, there are two main users of the
> >> udmabuf
> >>> driver: Qemu and CrosVM VMMs. However, it appears Qemu is the only
> >> one
> >>> that uses hugetlb pages (when hugetlb=on is set) as the backing store for
> >>> Guest (Linux, Android and Windows) system memory. The main goal is
> to
> >>> share the pages associated with the Guest allocated framebuffer (FB)
> with
> >>> the Host GPU driver and other components in a zero-copy way. To that
> >> end,
> >>> the guest GPU driver (virtio-gpu) allocates 4k size pages (associated with
> >>> the FB) and pins them before sharing the (guest) physical (or dma)
> >> addresses
> >>> (and lengths) with Qemu. Qemu then translates the addresses into file
> >>> offsets and shares these offsets with udmabuf.
> >>
> >> Is my understanding correct, that we can effectively long-term pin
> >> (worse than mlock) 64 MiB per UDMABUF_CREATE, allowing eventually
> !root
> > The 64 MiB limit is the theoretical upper bound that we have not seen hit in
> > practice. Typically, for a 1920x1080 resolution (commonly used in Guests),
> > the size of the FB is ~8 MB (1920x1080x4). And, most modern Graphics
> > compositors flip between two FBs.
> >
> 
> Okay, but users with privileges to open that file can just create as
> many as they want? I think I'll have to play with it.
Yeah, unfortunately, we are not restricting the total number of FBs or other
buffers that are mapped by udambuf per user. We'll definitely try to add a
patch to align it with mlock limits.

> 
> >> users
> >>
> >> ll /dev/udmabuf
> >> crw-rw 1 root kvm 10, 125 12. Jun 08:12 /dev/udmabuf
> >>
> >> to bypass there effective MEMLOCK limit, fragmenting physical memory
> and
> >> breaking swap?
> > Right, it does not look like the mlock limits are honored.
> >
> 
> That should be added.
> 
> >>
> >> Regarding the udmabuf_vm_fault(), I assume we're mapping pages we
> >> obtained from the memfd ourselves into a special VMA (mmap() of the
> > mmap operation is really needed only if any component on the Host needs
> > CPU access to the buffer. But in most scenarios, we try to ensure direct GPU
> > access (h/w acceleration via gl) to these pages.
> >
> >> udmabuf). I'm not sure how well shmem pages are prepared for getting
> >> mapped by someone else into an arbitrary VMA (page->index?).
> > Most drm/gpu drivers use shmem pages as the backing store for FBs and
> > other buffers and also provide mmap capability. What concerns do you see
> > with this approach?
> 
> Are these mmaping the pages the way udmabuf maps these pages (IOW,
> on-demand fault where we core-mm will adjust the mapcount etc)?
> 
> Skimming over at shmem_read_mapping_page() users, I assume most of
> them
> use a VM_PFNMAP mapping (or don't mmap them at all), where we won't be
> messing with the struct page at all.
> 
> (That might even allow you to mmap hugetlb sub-pages, because the struct
> page -- and mapcount -- will be ignored completely and not touched.)
Oh, are you suggesting that if we do vma->vm_flags |= VM_PFNMAP
in the mmap handler (mmap_udmabuf) and also do
vmf_insert_pfn(vma, vmf->address, page_to_pfn(page))
instead of
vmf->page = ubuf->pages[pgoff];
get_page(vmf->page);

in the vma fault handler (udmabuf_vm_fault), we can avoid most of the
pitfalls you have identified -- including with the usage of hugetlb subpages? 

> 
> >
> >>
> >> ... also, just imagine someone doing FALLOC_FL_PUNCH_HOLE /
> ftruncate()
> >> on the memfd. What's mapped into the memfd no longer corresponds to
> >> what's pinned / mapped into the VMA.
> > IIUC, making use of the DMA_BUF_IOCTL_SYNC ioctl would help with any
> > coherency issues:
> > https://www.kernel.org/doc/html/v6.2/driver-api/dma-
> buf.html#c.dma_buf_sync
> >
> 
> Would it as of now? udmabuf_create() pul

RE: [PATCH] udmabuf: revert 'Add support for mapping hugepages (v4)'

2023-06-13 Thread Kasireddy, Vivek
Hi David,

> 
> On 12.06.23 09:10, Kasireddy, Vivek wrote:
> > Hi Mike,
> 
> Hi Vivek,
> 
> >
> > Sorry for the late reply; I just got back from vacation.
> > If it is unsafe to directly use the subpages of a hugetlb page, then 
> > reverting
> > this patch seems like the only option for addressing this issue immediately.
> > So, this patch is
> > Acked-by: Vivek Kasireddy 
> >
> > As far as the use-case is concerned, there are two main users of the
> udmabuf
> > driver: Qemu and CrosVM VMMs. However, it appears Qemu is the only
> one
> > that uses hugetlb pages (when hugetlb=on is set) as the backing store for
> > Guest (Linux, Android and Windows) system memory. The main goal is to
> > share the pages associated with the Guest allocated framebuffer (FB) with
> > the Host GPU driver and other components in a zero-copy way. To that
> end,
> > the guest GPU driver (virtio-gpu) allocates 4k size pages (associated with
> > the FB) and pins them before sharing the (guest) physical (or dma)
> addresses
> > (and lengths) with Qemu. Qemu then translates the addresses into file
> > offsets and shares these offsets with udmabuf.
> 
> Is my understanding correct, that we can effectively long-term pin
> (worse than mlock) 64 MiB per UDMABUF_CREATE, allowing eventually !root
The 64 MiB limit is the theoretical upper bound that we have not seen hit in 
practice. Typically, for a 1920x1080 resolution (commonly used in Guests),
the size of the FB is ~8 MB (1920x1080x4). And, most modern Graphics
compositors flip between two FBs.

> users
> 
> ll /dev/udmabuf
> crw-rw 1 root kvm 10, 125 12. Jun 08:12 /dev/udmabuf
> 
> to bypass there effective MEMLOCK limit, fragmenting physical memory and
> breaking swap?
Right, it does not look like the mlock limits are honored.

> 
> Regarding the udmabuf_vm_fault(), I assume we're mapping pages we
> obtained from the memfd ourselves into a special VMA (mmap() of the
mmap operation is really needed only if any component on the Host needs
CPU access to the buffer. But in most scenarios, we try to ensure direct GPU
access (h/w acceleration via gl) to these pages.

> udmabuf). I'm not sure how well shmem pages are prepared for getting
> mapped by someone else into an arbitrary VMA (page->index?).
Most drm/gpu drivers use shmem pages as the backing store for FBs and
other buffers and also provide mmap capability. What concerns do you see
with this approach?

> 
> ... also, just imagine someone doing FALLOC_FL_PUNCH_HOLE / ftruncate()
> on the memfd. What's mapped into the memfd no longer corresponds to
> what's pinned / mapped into the VMA.
IIUC, making use of the DMA_BUF_IOCTL_SYNC ioctl would help with any
coherency issues:
https://www.kernel.org/doc/html/v6.2/driver-api/dma-buf.html#c.dma_buf_sync

> 
> 
> Was linux-mm (and especially shmem maintainers, ccing Hugh) involved in
> the upstreaming of udmabuf?
It does not appear so from the link below although other key lists were cc'd:
https://patchwork.freedesktop.org/patch/246100/?series=39879=7

Thanks,
Vivek
> 
> --
> Cheers,
> 
> David / dhildenb



RE: [PATCH] udmabuf: revert 'Add support for mapping hugepages (v4)'

2023-06-12 Thread Kasireddy, Vivek
Hi Mike,

Sorry for the late reply; I just got back from vacation.
If it is unsafe to directly use the subpages of a hugetlb page, then reverting
this patch seems like the only option for addressing this issue immediately.
So, this patch is
Acked-by: Vivek Kasireddy 

As far as the use-case is concerned, there are two main users of the udmabuf
driver: Qemu and CrosVM VMMs. However, it appears Qemu is the only one
that uses hugetlb pages (when hugetlb=on is set) as the backing store for
Guest (Linux, Android and Windows) system memory. The main goal is to
share the pages associated with the Guest allocated framebuffer (FB) with
the Host GPU driver and other components in a zero-copy way. To that end,
the guest GPU driver (virtio-gpu) allocates 4k size pages (associated with
the FB) and pins them before sharing the (guest) physical (or dma) addresses
(and lengths) with Qemu. Qemu then translates the addresses into file
offsets and shares these offsets with udmabuf. 

The udmabuf driver obtains the pages associated with the file offsets and
uses these pages to eventually populate a scatterlist. It also creates a 
dmabuf fd and acts as the exporter. AFAIK, it should be possible to populate
the scatterlist with physical/dma addresses (of huge pages) instead of using
subpages but this might limit the capabilities of some (dmabuf) importers.
I'll try to figure out a solution using physical/dma addresses and see if it
works as expected, and will share the patch on linux-mm to request
feedback once it is ready.

Thanks,
Vivek

> 
> This effectively reverts commit 16c243e99d33 ("udmabuf: Add support
> for mapping hugepages (v4)").  Recently, Junxiao Chang found a BUG
> with page map counting as described here [1].  This issue pointed out
> that the udmabuf driver was making direct use of subpages of hugetlb
> pages.  This is not a good idea, and no other mm code attempts such use.
> In addition to the mapcount issue, this also causes issues with hugetlb
> vmemmap optimization and page poisoning.
> 
> For now, remove hugetlb support.
> 
> If udmabuf wants to be used on hugetlb mappings, it should be changed to
> only use complete hugetlb pages.  This will require different alignment
> and size requirements on the UDMABUF_CREATE API.
> 
> [1] https://lore.kernel.org/linux-mm/20230512072036.1027784-1-
> junxiao.ch...@intel.com/
> 
> Fixes: 16c243e99d33 ("udmabuf: Add support for mapping hugepages (v4)")
> Cc: 
> Signed-off-by: Mike Kravetz 
> ---
>  drivers/dma-buf/udmabuf.c | 47 +--
>  1 file changed, 6 insertions(+), 41 deletions(-)
> 
> diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
> index 01f2e86f3f7c..12cf6bb2e3ce 100644
> --- a/drivers/dma-buf/udmabuf.c
> +++ b/drivers/dma-buf/udmabuf.c
> @@ -12,7 +12,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
> 
> @@ -207,9 +206,7 @@ static long udmabuf_create(struct miscdevice
> *device,
>   struct udmabuf *ubuf;
>   struct dma_buf *buf;
>   pgoff_t pgoff, pgcnt, pgidx, pgbuf = 0, pglimit;
> - struct page *page, *hpage = NULL;
> - pgoff_t subpgoff, maxsubpgs;
> - struct hstate *hpstate;
> + struct page *page;
>   int seals, ret = -EINVAL;
>   u32 i, flags;
> 
> @@ -245,7 +242,7 @@ static long udmabuf_create(struct miscdevice
> *device,
>   if (!memfd)
>   goto err;
>   mapping = memfd->f_mapping;
> - if (!shmem_mapping(mapping) &&
> !is_file_hugepages(memfd))
> + if (!shmem_mapping(mapping))
>   goto err;
>   seals = memfd_fcntl(memfd, F_GET_SEALS, 0);
>   if (seals == -EINVAL)
> @@ -256,48 +253,16 @@ static long udmabuf_create(struct miscdevice
> *device,
>   goto err;
>   pgoff = list[i].offset >> PAGE_SHIFT;
>   pgcnt = list[i].size   >> PAGE_SHIFT;
> - if (is_file_hugepages(memfd)) {
> - hpstate = hstate_file(memfd);
> - pgoff = list[i].offset >> huge_page_shift(hpstate);
> - subpgoff = (list[i].offset &
> - ~huge_page_mask(hpstate)) >>
> PAGE_SHIFT;
> - maxsubpgs = huge_page_size(hpstate) >>
> PAGE_SHIFT;
> - }
>   for (pgidx = 0; pgidx < pgcnt; pgidx++) {
> - if (is_file_hugepages(memfd)) {
> - if (!hpage) {
> - hpage =
> find_get_page_flags(mapping, pgoff,
> -
> FGP_ACCESSED);
> - if (!hpage) {
> - ret = -EINVAL;
> - goto err;
> - }
> - }
> - page = hpage + subpgoff;
> - get_page(page);
> - 

RE: [PATCH v1 2/2] drm/virtio: Add the hotplug_mode_update property for rescanning of modes

2023-01-09 Thread Kasireddy, Vivek
Hi Daniel,

> 
> On Fri, Jan 06, 2023 at 09:56:40AM +0100, Gerd Hoffmann wrote:
> > On Thu, Nov 17, 2022 at 05:30:54PM -0800, Vivek Kasireddy wrote:
> > > Setting this property will allow the userspace to look for new modes or
> > > position info when a hotplug event occurs.
> >
> > This works just fine for modes today.
> >
> > I assume this is this need to have userspace also check for position
> > info updates added by patch #1)?
> 
> What does this thing even do? Quick grep says qxl and vmwgfx also use
> this, but it's not documented anywhere, and it's also not done with any
> piece of common code. Which all looks really fishy.
[Kasireddy, Vivek] AFAIU, this property appears to be useful only for virtual
GPU drivers to share the Host output(s) layout with the Guest compositor. The
suggested_x/y properties are specifically used for this purpose but it looks 
like
the hotplug_mode_update property also needs to be set in order to have Guest
compositors (Mutter cares but Weston does not) look at suggested_x/y.

> 
> I think we need to do a bit of refactoring/documenting here first.
[Kasireddy, Vivek] Just for reference, here is Dave's commit that added this
property for qxl:
commit 4695b03970df378dcb93fe3e7158381f1e980fa2
Author: Dave Airlie 
Date:   Fri Oct 11 11:05:00 2013 +1000

qxl: add a connector property to denote hotplug should rescan modes.

So GNOME userspace has an issue with when it rescans for modes on hotplug
events, if the monitor has no EDID it assumes that nothing has changed on
EDID as with real hw we'd never have new modes without a new EDID, and they
kind off rely on the behaviour now, however with virtual GPUs we would
like to rescan the modes and get a new preferred mode on hotplug events
to handle dynamic guest resizing (where you resize the host window and the
guest resizes with it).

This is a simple property we can make userspace watch for to trigger new
behaviour based on it, and can be used to replaced EDID hacks in virtual
drivers.

Are you suggesting that this property needs to be part of drm_mode_config
just like suggested_x/y properties?

> 
> Also in principle, userspace needs to look at everything in the connector
> again when it gets a hotplug event. We do have hotplug events for specific
> properties nowadays, but those are fairly new.
[Kasireddy, Vivek] From what I understand, Mutter does probe all the
connector properties during hotplug but it still needs this property to be set 
in
order to consider suggested_x/y values. And, it appears, some customers and
users have relied on this behavior from when these properties were first
introduced for virtual GPU drivers.

Thanks,
Vivek

> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch


RE: [PATCH v1 2/2] drm/virtio: Add the hotplug_mode_update property for rescanning of modes

2023-01-09 Thread Kasireddy, Vivek
Hi Gerd,

> 
> On Thu, Nov 17, 2022 at 05:30:54PM -0800, Vivek Kasireddy wrote:
> > Setting this property will allow the userspace to look for new modes or
> > position info when a hotplug event occurs.
> 
> This works just fine for modes today.
> 
> I assume this is this need to have userspace also check for position
> info updates added by patch #1)?
[Kasireddy, Vivek] Yes, that is exactly the reason why this property is needed. 
In 
other words, Mutter does not seem to look at suggested_x/y values (or position 
info)
if hotplug_mode_property is not there. Here is the relevant piece of code in 
Mutter:

static gboolean
meta_monitor_normal_get_suggested_position (MetaMonitor *monitor,
int *x,
int *y)
{
  const MetaOutputInfo *output_info =
meta_monitor_get_main_output_info (monitor);

  if (!output_info->hotplug_mode_update)
return FALSE;

  if (output_info->suggested_x < 0 && output_info->suggested_y < 0)
return FALSE;

  if (x)
*x = output_info->suggested_x;

  if (y)
*y = output_info->suggested_y;


Thanks,
Vivek

> 
> take care,
>   Gerd



RE: [PATCH v2 2/2] drm/virtio: fence created per cursor/plane update

2022-06-15 Thread Kasireddy, Vivek
Hi DW,

> 
> On Thu, Jun 09, 2022 at 06:24:43AM +0200, Gerd Hoffmann wrote:
> > On Fri, Jun 03, 2022 at 02:18:49PM -0700, Dongwon Kim wrote:
> > > Having one fence for a vgfb would cause conflict in case there are
> > > multiple planes referencing the same vgfb (e.g. Xorg screen covering
> > > two displays in extended mode) being flushed simultaneously. So it makes
> > > sence to use a separated fence for each plane update to prevent this.
> > >
> > > vgfb->fence is not required anymore with the suggested code change so
> > > both prepare_fb and cleanup_fb are removed since only fence creation/
> > > freeing are done in there.
> >
> > The fences are allocated and released in prepare_fb + cleanup_fb for a
> > reason: atomic_update must not fail.
> 
> In case fence allocation fails, it falls back to non-fence path so it
> won't fail for primary-plane-update.
> 
> For cursor plane update, it returns if fence is NULL but we could change
> it to just proceed and just make it skip waiting like,
[Kasireddy, Vivek] But cursor plane update is always tied to a fence based on 
the
way it works now and we have to fail if there is no fence.

> 
> if (fence) {
> dma_fence_wait(>f, true);
> dma_fence_put(>f);
> }
> 
> Or maybe I can limit my suggested changes to primary-plane-update only.
> 
> What do you think about these?
> 
> >
> > I guess virtio-gpu must be fixed to use drm_plane_state->fence
> > correctly ...
> 
> I was thinking about this too but current functions (e.g.
> virtio_gpu_cmd_transfer_to_host_2d) takes "struct virtio_gpu_fence".
> Not sure what is the best way to connect drm_plane_state->fence to
> virtio_gpu_fence without changing major function interfaces.
[Kasireddy, Vivek] FWIU, we cannot use drm_plane_state->fence as it is used
by drm core to handle explicit fences. So, I think a cleaner way is to subclass
base drm_plane_state and move the fence from virtio_gpu_framebuffer to a
new struct virtio_gpu_plane_state. This way, we can create the fence in
prepare_fb() and use it for synchronization in resource_flush.

Thanks,
Vivek

> 
> >
> > take care,
> >   Gerd
> >


RE: [PATCH v2 2/2] drm/virtio: fence created per cursor/plane update

2022-06-06 Thread Kasireddy, Vivek
Hi DW,

> Subject: [PATCH v2 2/2] drm/virtio: fence created per cursor/plane update
> 
> Having one fence for a vgfb would cause conflict in case there are
> multiple planes referencing the same vgfb (e.g. Xorg screen covering
> two displays in extended mode) being flushed simultaneously. So it makes
> sence to use a separated fence for each plane update to prevent this.
> 
> vgfb->fence is not required anymore with the suggested code change so
> both prepare_fb and cleanup_fb are removed since only fence creation/
> freeing are done in there.
> 
> v2: - use the fence always as long as guest_blob is enabled on the
>   scanout object
> - obj and fence initialized as NULL ptrs to avoid uninitialzed
>   ptr problem (Reported by Dan Carpenter/kernel-test-robot)
> 
> Reported-by: kernel test robot 
> Reported-by: Dan Carpenter 
> Cc: Gurchetan Singh 
> Cc: Gerd Hoffmann 
> Cc: Vivek Kasireddy 
> Signed-off-by: Dongwon Kim 
> ---
>  drivers/gpu/drm/virtio/virtgpu_drv.h   |   1 -
>  drivers/gpu/drm/virtio/virtgpu_plane.c | 103 ++---
>  2 files changed, 39 insertions(+), 65 deletions(-)
> 
> diff --git a/drivers/gpu/drm/virtio/virtgpu_drv.h 
> b/drivers/gpu/drm/virtio/virtgpu_drv.h
> index 0a194aaad419..4c59c1e67ca5 100644
> --- a/drivers/gpu/drm/virtio/virtgpu_drv.h
> +++ b/drivers/gpu/drm/virtio/virtgpu_drv.h
> @@ -186,7 +186,6 @@ struct virtio_gpu_output {
> 
>  struct virtio_gpu_framebuffer {
>   struct drm_framebuffer base;
> - struct virtio_gpu_fence *fence;
>  };
>  #define to_virtio_gpu_framebuffer(x) \
>   container_of(x, struct virtio_gpu_framebuffer, base)
> diff --git a/drivers/gpu/drm/virtio/virtgpu_plane.c 
> b/drivers/gpu/drm/virtio/virtgpu_plane.c
> index 6d3cc9e238a4..821023b7d57d 100644
> --- a/drivers/gpu/drm/virtio/virtgpu_plane.c
> +++ b/drivers/gpu/drm/virtio/virtgpu_plane.c
> @@ -137,29 +137,37 @@ static void virtio_gpu_resource_flush(struct drm_plane 
> *plane,
>   struct virtio_gpu_device *vgdev = dev->dev_private;
>   struct virtio_gpu_framebuffer *vgfb;
>   struct virtio_gpu_object *bo;
> + struct virtio_gpu_object_array *objs = NULL;
> + struct virtio_gpu_fence *fence = NULL;
> 
>   vgfb = to_virtio_gpu_framebuffer(plane->state->fb);
>   bo = gem_to_virtio_gpu_obj(vgfb->base.obj[0]);
> - if (vgfb->fence) {
> - struct virtio_gpu_object_array *objs;
> 
> + if (!bo)
> + return;
[Kasireddy, Vivek] I think you can drop the above check as bo is guaranteed
to be valid in resource_flush as the necessary checks are already done early
in virtio_gpu_primary_plane_update().

> +
> + if (bo->dumb && bo->guest_blob)
> + fence = virtio_gpu_fence_alloc(vgdev, vgdev->fence_drv.context,
> +0);
> +
> + if (fence) {
>   objs = virtio_gpu_array_alloc(1);
> - if (!objs)
> + if (!objs) {
> + kfree(fence);
>   return;
> + }
>   virtio_gpu_array_add_obj(objs, vgfb->base.obj[0]);
>   virtio_gpu_array_lock_resv(objs);
> - virtio_gpu_cmd_resource_flush(vgdev, bo->hw_res_handle, x, y,
> -   width, height, objs, vgfb->fence);
> - virtio_gpu_notify(vgdev);
> + }
> +
> + virtio_gpu_cmd_resource_flush(vgdev, bo->hw_res_handle, x, y,
> +   width, height, objs, fence);
> + virtio_gpu_notify(vgdev);
[Kasireddy, Vivek] I think it is OK to retain the existing style where all the
statements relevant for if (fence) would be lumped together. I do understand 
that
the above two statements would be redundant in that case but it looks a bit 
cleaner.

> 
> - dma_fence_wait_timeout(>fence->f, true,
> + if (fence) {
> + dma_fence_wait_timeout(>f, true,
>  msecs_to_jiffies(50));
> - dma_fence_put(>fence->f);
> - vgfb->fence = NULL;
> - } else {
> - virtio_gpu_cmd_resource_flush(vgdev, bo->hw_res_handle, x, y,
> -   width, height, NULL, NULL);
> - virtio_gpu_notify(vgdev);
> + dma_fence_put(>f);
>   }
>  }
> 
> @@ -239,47 +247,6 @@ static void virtio_gpu_primary_plane_update(struct 
> drm_plane
> *plane,
> rect.y2 - rect.y1);
>  }
> 
> -static int virtio_gpu_plane_prepare_fb(struct drm_plane *plane,
> -struct drm_plane_state *new_sta

RE: [PATCH v2 1/2] drm/virtio: .release ops for virtgpu fence release

2022-06-06 Thread Kasireddy, Vivek
> virtio_gpu_fence_release is added to free virtio-gpu-fence
> upon release of dma_fence.
> 
> Cc: Gurchetan Singh 
> Cc: Gerd Hoffmann 
> Cc: Vivek Kasireddy 
> Signed-off-by: Dongwon Kim 
> ---
>  drivers/gpu/drm/virtio/virtgpu_fence.c | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/gpu/drm/virtio/virtgpu_fence.c 
> b/drivers/gpu/drm/virtio/virtgpu_fence.c
> index f28357dbde35..ba659ac2a51d 100644
> --- a/drivers/gpu/drm/virtio/virtgpu_fence.c
> +++ b/drivers/gpu/drm/virtio/virtgpu_fence.c
> @@ -63,12 +63,20 @@ static void virtio_gpu_timeline_value_str(struct 
> dma_fence *f,
> char *str,
>(u64)atomic64_read(>drv->last_fence_id));
>  }
> 
> +static void virtio_gpu_fence_release(struct dma_fence *f)
> +{
> + struct virtio_gpu_fence *fence = to_virtio_gpu_fence(f);
> +
> + kfree(fence);
> +}
> +
>  static const struct dma_fence_ops virtio_gpu_fence_ops = {
>   .get_driver_name = virtio_gpu_get_driver_name,
>   .get_timeline_name   = virtio_gpu_get_timeline_name,
>   .signaled= virtio_gpu_fence_signaled,
>   .fence_value_str = virtio_gpu_fence_value_str,
>   .timeline_value_str  = virtio_gpu_timeline_value_str,
> + .release = virtio_gpu_fence_release,

Acked-by: Vivek Kasireddy 

>  };
> 
>  struct virtio_gpu_fence *virtio_gpu_fence_alloc(struct virtio_gpu_device 
> *vgdev,
> --
> 2.20.1



RE: [Intel-gfx] [PATCH v6 2/2] drm/i915/gem: Don't try to map and fence large scanout buffers (v9)

2022-03-17 Thread Kasireddy, Vivek
Hi Tvrtko,

> 
> On 16/03/2022 07:37, Kasireddy, Vivek wrote:
> > Hi Tvrtko,
> >
> >>
> >> On 15/03/2022 07:28, Kasireddy, Vivek wrote:
> >>> Hi Tvrtko, Daniel,
> >>>
> >>>>
> >>>> On 11/03/2022 09:39, Daniel Vetter wrote:
> >>>>> On Mon, 7 Mar 2022 at 21:38, Vivek Kasireddy 
> wrote:
> >>>>>>
> >>>>>> On platforms capable of allowing 8K (7680 x 4320) modes, pinning 2 or
> >>>>>> more framebuffers/scanout buffers results in only one that is mappable/
> >>>>>> fenceable. Therefore, pageflipping between these 2 FBs where only one
> >>>>>> is mappable/fenceable creates latencies large enough to miss alternate
> >>>>>> vblanks thereby producing less optimal framerate.
> >>>>>>
> >>>>>> This mainly happens because when i915_gem_object_pin_to_display_plane()
> >>>>>> is called to pin one of the FB objs, the associated vma is identified
> >>>>>> as misplaced and therefore i915_vma_unbind() is called which unbinds 
> >>>>>> and
> >>>>>> evicts it. This misplaced vma gets subseqently pinned only when
> >>>>>> i915_gem_object_ggtt_pin_ww() is called without PIN_MAPPABLE. This
> >>>>>> results in a latency of ~10ms and happens every other vblank/repaint 
> >>>>>> cycle.
> >>>>>> Therefore, to fix this issue, we try to see if there is space to map
> >>>>>> at-least two objects of a given size and return early if there isn't. 
> >>>>>> This
> >>>>>> would ensure that we do not try with PIN_MAPPABLE for any objects that
> >>>>>> are too big to map thereby preventing unncessary unbind.
> >>>>>>
> >>>>>> Testcase:
> >>>>>> Running Weston and weston-simple-egl on an Alderlake_S (ADLS) platform
> >>>>>> with a 8K@60 mode results in only ~40 FPS. Since upstream Weston 
> >>>>>> submits
> >>>>>> a frame ~7ms before the next vblank, the latencies seen between atomic
> >>>>>> commit and flip event are 7, 24 (7 + 16.66), 7, 24. suggesting that
> >>>>>> it misses the vblank every other frame.
> >>>>>>
> >>>>>> Here is the ftrace snippet that shows the source of the ~10ms latency:
> >>>>>>  i915_gem_object_pin_to_display_plane() {
> >>>>>> 0.102 us   |i915_gem_object_set_cache_level();
> >>>>>>i915_gem_object_ggtt_pin_ww() {
> >>>>>> 0.390 us   |  i915_vma_instance();
> >>>>>> 0.178 us   |  i915_vma_misplaced();
> >>>>>>  i915_vma_unbind() {
> >>>>>>  __i915_active_wait() {
> >>>>>> 0.082 us   |i915_active_acquire_if_busy();
> >>>>>> 0.475 us   |  }
> >>>>>>  intel_runtime_pm_get() {
> >>>>>> 0.087 us   |intel_runtime_pm_acquire();
> >>>>>> 0.259 us   |  }
> >>>>>>  __i915_active_wait() {
> >>>>>> 0.085 us   |i915_active_acquire_if_busy();
> >>>>>> 0.240 us   |  }
> >>>>>>  __i915_vma_evict() {
> >>>>>>ggtt_unbind_vma() {
> >>>>>>  gen8_ggtt_clear_range() {
> >>>>>> 10507.255 us |}
> >>>>>> 10507.689 us |  }
> >>>>>> 10508.516 us |   }
> >>>>>>
> >>>>>> v2: Instead of using bigjoiner checks, determine whether a scanout
> >>>>>>buffer is too big by checking to see if it is possible to map
> >>>>>>two of them into the ggtt.
> >>>>>>
> >>>>>> v3 (Ville):
> >>>>>> - Count how many fb objects can be fit into the available holes
> >>>>>>  instead of checking for a hole twice the object size.
> >>>>>> - Take alignment constraints into account.
> >>>>>> - Limit this large scanout buffer check to >= Gen 11 platforms.
> >>>>>>
> >>>>>> v4:
> >>>>>&

RE: [Intel-gfx] [PATCH v6 2/2] drm/i915/gem: Don't try to map and fence large scanout buffers (v9)

2022-03-16 Thread Kasireddy, Vivek
Hi Tvrtko,

> 
> On 15/03/2022 07:28, Kasireddy, Vivek wrote:
> > Hi Tvrtko, Daniel,
> >
> >>
> >> On 11/03/2022 09:39, Daniel Vetter wrote:
> >>> On Mon, 7 Mar 2022 at 21:38, Vivek Kasireddy  
> >>> wrote:
> >>>>
> >>>> On platforms capable of allowing 8K (7680 x 4320) modes, pinning 2 or
> >>>> more framebuffers/scanout buffers results in only one that is mappable/
> >>>> fenceable. Therefore, pageflipping between these 2 FBs where only one
> >>>> is mappable/fenceable creates latencies large enough to miss alternate
> >>>> vblanks thereby producing less optimal framerate.
> >>>>
> >>>> This mainly happens because when i915_gem_object_pin_to_display_plane()
> >>>> is called to pin one of the FB objs, the associated vma is identified
> >>>> as misplaced and therefore i915_vma_unbind() is called which unbinds and
> >>>> evicts it. This misplaced vma gets subseqently pinned only when
> >>>> i915_gem_object_ggtt_pin_ww() is called without PIN_MAPPABLE. This
> >>>> results in a latency of ~10ms and happens every other vblank/repaint 
> >>>> cycle.
> >>>> Therefore, to fix this issue, we try to see if there is space to map
> >>>> at-least two objects of a given size and return early if there isn't. 
> >>>> This
> >>>> would ensure that we do not try with PIN_MAPPABLE for any objects that
> >>>> are too big to map thereby preventing unncessary unbind.
> >>>>
> >>>> Testcase:
> >>>> Running Weston and weston-simple-egl on an Alderlake_S (ADLS) platform
> >>>> with a 8K@60 mode results in only ~40 FPS. Since upstream Weston submits
> >>>> a frame ~7ms before the next vblank, the latencies seen between atomic
> >>>> commit and flip event are 7, 24 (7 + 16.66), 7, 24. suggesting that
> >>>> it misses the vblank every other frame.
> >>>>
> >>>> Here is the ftrace snippet that shows the source of the ~10ms latency:
> >>>> i915_gem_object_pin_to_display_plane() {
> >>>> 0.102 us   |i915_gem_object_set_cache_level();
> >>>>   i915_gem_object_ggtt_pin_ww() {
> >>>> 0.390 us   |  i915_vma_instance();
> >>>> 0.178 us   |  i915_vma_misplaced();
> >>>> i915_vma_unbind() {
> >>>> __i915_active_wait() {
> >>>> 0.082 us   |i915_active_acquire_if_busy();
> >>>> 0.475 us   |  }
> >>>> intel_runtime_pm_get() {
> >>>> 0.087 us   |intel_runtime_pm_acquire();
> >>>> 0.259 us   |  }
> >>>> __i915_active_wait() {
> >>>> 0.085 us   |i915_active_acquire_if_busy();
> >>>> 0.240 us   |  }
> >>>> __i915_vma_evict() {
> >>>>   ggtt_unbind_vma() {
> >>>> gen8_ggtt_clear_range() {
> >>>> 10507.255 us |}
> >>>> 10507.689 us |  }
> >>>> 10508.516 us |   }
> >>>>
> >>>> v2: Instead of using bigjoiner checks, determine whether a scanout
> >>>>   buffer is too big by checking to see if it is possible to map
> >>>>   two of them into the ggtt.
> >>>>
> >>>> v3 (Ville):
> >>>> - Count how many fb objects can be fit into the available holes
> >>>> instead of checking for a hole twice the object size.
> >>>> - Take alignment constraints into account.
> >>>> - Limit this large scanout buffer check to >= Gen 11 platforms.
> >>>>
> >>>> v4:
> >>>> - Remove existing heuristic that checks just for size. (Ville)
> >>>> - Return early if we find space to map at-least two objects. (Tvrtko)
> >>>> - Slightly update the commit message.
> >>>>
> >>>> v5: (Tvrtko)
> >>>> - Rename the function to indicate that the object may be too big to
> >>>> map into the aperture.
> >>>> - Account for guard pages while calculating the total size required
> >>>> for the object.
> >>>> - Do not subject all objects to the heuristic check and instead
> >>>> consider objects only of a certain

RE: [Intel-gfx] [PATCH v6 2/2] drm/i915/gem: Don't try to map and fence large scanout buffers (v9)

2022-03-15 Thread Kasireddy, Vivek
 traverse the hole nodes.
> >>
> >> v9: (Tvrtko)
> >> - Use mutex_lock_interruptible_nested() instead of mutex_lock().
> >>
> >> Cc: Ville Syrjälä 
> >> Cc: Maarten Lankhorst 
> >> Cc: Tvrtko Ursulin 
> >> Cc: Manasi Navare 
> >> Reviewed-by: Tvrtko Ursulin 
> >> Signed-off-by: Vivek Kasireddy 
> >> ---
> >>   drivers/gpu/drm/i915/i915_gem.c | 128 +++-
> >>   1 file changed, 94 insertions(+), 34 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/i915/i915_gem.c 
> >> b/drivers/gpu/drm/i915/i915_gem.c
> >> index 9747924cc57b..e0d731b3f215 100644
> >> --- a/drivers/gpu/drm/i915/i915_gem.c
> >> +++ b/drivers/gpu/drm/i915/i915_gem.c
> >> @@ -49,6 +49,7 @@
> >>   #include "gem/i915_gem_pm.h"
> >>   #include "gem/i915_gem_region.h"
> >>   #include "gem/i915_gem_userptr.h"
> >> +#include "gem/i915_gem_tiling.h"
> >>   #include "gt/intel_engine_user.h"
> >>   #include "gt/intel_gt.h"
> >>   #include "gt/intel_gt_pm.h"
> >> @@ -882,6 +883,96 @@ static void discard_ggtt_vma(struct i915_vma *vma)
> >>  spin_unlock(>vma.lock);
> >>   }
> >>
> >> +static int
> >> +i915_gem_object_fits_in_aperture(struct drm_i915_gem_object *obj,
> >> +u64 alignment, u64 flags)
> >
> > Tvrtko asked me to ack the first patch, but then I looked at this and
> > started wondering.
> >
> > Conceptually this doesn't pass the smell test. What if we have
> > multiple per-crtc buffers? Multiple planes on the same crtc? What if
> > the app does triple buffer? You'll be forever busy tuning this
> > heuristics, which can't fundamentally be fixed I think. The old "half
> > of mappable" heuristic isn't really better, but at least it was dead
> > simple.
> >
> > Imo what we need here is a change in approach:
> > 1. Check whether the useable view for scanout exists already. If yes,
> > use that. This should avoid the constant unbinding stalls.
> > 2. Try to in buffer to mappabley, but without evicting anything (so
> > not the non-blocking thing)
> > 3. Pin the buffer with the most lenient approach
> >
> > Even the non-blocking interim stage is dangerous, since it'll just
> > result in other buffers (e.g. when triple-buffering) getting unbound
> > and we're back to the same stall. Note that this could have an impact
> > on cpu rendering compositors, where we might end up relying a lot more
> > partial views. But as long as we are a tad more aggressive (i.e. the
> > non-blocking binding) in the mmap path that should work out to keep
> > everything balanced, since usually you render first before you display
> > anything. And so the buffer should end up in the ideal place.
> >
> > I'd try to first skip the 2. step since I think it'll require a bit of
> > work, and frankly I don't think we care about the potential fallout.
> 
> To be sure I understand, you propose to stop trying to pin mappable by 
> default. Ie. stop
> respecting this comment from i915_gem_object_pin_to_display_plane:
> 
>   /*
>* As the user may map the buffer once pinned in the display plane
>* (e.g. libkms for the bootup splash), we have to ensure that we
>* always use map_and_fenceable for all scanout buffers. However,
>* it may simply be too big to fit into mappable, in which case
>* put it anyway and hope that userspace can cope (but always first
>* try to preserve the existing ABI).
>*/
[Kasireddy, Vivek] Digging further, this is what the commit message that added
the above comment says:
commit 2efb813d5388e18255c54afac77bd91acd586908
Author: Chris Wilson 
Date:   Thu Aug 18 17:17:06 2016 +0100

drm/i915: Fallback to using unmappable memory for scanout

The existing ABI says that scanouts are pinned into the mappable region
so that legacy clients (e.g. old Xorg or plymouthd) can write directly
into the scanout through a GTT mapping. However if the surface does not
fit into the mappable region, we are better off just trying to fit it
anywhere and hoping for the best. (Any userspace that is capable of
using ginormous scanouts is also likely not to rely on pure GTT
updates.) With the partial vma fault support, we are no longer
restricted to only using scanouts that we can pin (though it is still
preferred for performance reasons and for powersaving features like
FBC).

> 
> By a quick look, for this case it appears we would end

RE: [Intel-gfx] [CI 1/2] drm/mm: Add an iterator to optimally walk over holes for an allocation (v4)

2022-02-28 Thread Kasireddy, Vivek
Hi Tvrtko,

> 
> Hi Vivek,
> 
> On 27/02/2022 17:29, Vivek Kasireddy wrote:
> > This iterator relies on drm_mm_first_hole() and drm_mm_next_hole()
> > functions to identify suitable holes for an allocation of a given
> > size by efficiently traversing the rbtree associated with the given
> > allocator.
> >
> > It replaces the for loop in drm_mm_insert_node_in_range() and can
> > also be used by drm drivers to quickly identify holes of a certain
> > size within a given range.
> >
> > v2: (Tvrtko)
> > - Prepend a double underscore for the newly exported first/next_hole
> > - s/each_best_hole/each_suitable_hole/g
> > - Mask out DRM_MM_INSERT_ONCE from the mode before calling
> >first/next_hole and elsewhere.
> >
> > v3: (Tvrtko)
> > - Reduce the number of hunks by retaining the "mode" variable name
> >
> > v4:
> > - Typo: s/__drm_mm_next_hole(.., hole/__drm_mm_next_hole(.., pos
> >
> > Reviewed-by: Tvrtko Ursulin 
> > Acked-by: Christian König 
> > Suggested-by: Tvrtko Ursulin 
> > Signed-off-by: Vivek Kasireddy 
> > ---
> >   drivers/gpu/drm/drm_mm.c | 32 +++-
> >   include/drm/drm_mm.h | 36 
> >   2 files changed, 51 insertions(+), 17 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/drm_mm.c b/drivers/gpu/drm/drm_mm.c
> > index 8257f9d4f619..8efea548ae9f 100644
> > --- a/drivers/gpu/drm/drm_mm.c
> > +++ b/drivers/gpu/drm/drm_mm.c
> > @@ -352,10 +352,10 @@ static struct drm_mm_node *find_hole_addr(struct 
> > drm_mm
> *mm, u64 addr, u64 size)
> > return node;
> >   }
> >
> > -static struct drm_mm_node *
> > -first_hole(struct drm_mm *mm,
> > -  u64 start, u64 end, u64 size,
> > -  enum drm_mm_insert_mode mode)
> > +struct drm_mm_node *
> > +__drm_mm_first_hole(struct drm_mm *mm,
> > +   u64 start, u64 end, u64 size,
> > +   enum drm_mm_insert_mode mode)
> >   {
> > switch (mode) {
> > default:
> > @@ -374,6 +374,7 @@ first_hole(struct drm_mm *mm,
> > hole_stack);
> > }
> >   }
> > +EXPORT_SYMBOL(__drm_mm_first_hole);
> >
> >   /**
> >* DECLARE_NEXT_HOLE_ADDR - macro to declare next hole functions
> > @@ -410,11 +411,11 @@ static struct drm_mm_node *name(struct drm_mm_node
> *entry, u64 size) \
> >   DECLARE_NEXT_HOLE_ADDR(next_hole_high_addr, rb_left, rb_right)
> >   DECLARE_NEXT_HOLE_ADDR(next_hole_low_addr, rb_right, rb_left)
> >
> > -static struct drm_mm_node *
> > -next_hole(struct drm_mm *mm,
> > - struct drm_mm_node *node,
> > - u64 size,
> > - enum drm_mm_insert_mode mode)
> > +struct drm_mm_node *
> > +__drm_mm_next_hole(struct drm_mm *mm,
> > +  struct drm_mm_node *node,
> > +  u64 size,
> > +  enum drm_mm_insert_mode mode)
> >   {
> > switch (mode) {
> > default:
> > @@ -432,6 +433,7 @@ next_hole(struct drm_mm *mm,
> > return >hole_stack == >hole_stack ? NULL : node;
> > }
> >   }
> > +EXPORT_SYMBOL(__drm_mm_next_hole);
> >
> >   /**
> >* drm_mm_reserve_node - insert an pre-initialized node
> > @@ -516,11 +518,11 @@ int drm_mm_insert_node_in_range(struct drm_mm * const
> mm,
> > u64 size, u64 alignment,
> > unsigned long color,
> > u64 range_start, u64 range_end,
> > -   enum drm_mm_insert_mode mode)
> > +   enum drm_mm_insert_mode caller_mode)
> >   {
> > struct drm_mm_node *hole;
> > u64 remainder_mask;
> > -   bool once;
> > +   enum drm_mm_insert_mode mode = caller_mode &
> ~DRM_MM_INSERT_ONCE;
> >
> > DRM_MM_BUG_ON(range_start > range_end);
> >
> > @@ -533,13 +535,9 @@ int drm_mm_insert_node_in_range(struct drm_mm * const
> mm,
> > if (alignment <= 1)
> > alignment = 0;
> >
> > -   once = mode & DRM_MM_INSERT_ONCE;
> > -   mode &= ~DRM_MM_INSERT_ONCE;
> > -
> > remainder_mask = is_power_of_2(alignment) ? alignment - 1 : 0;
> > -   for (hole = first_hole(mm, range_start, range_end, size, mode);
> > -hole;
> > -hole = once ? NULL : next_hole(mm, hole, size, mode)) {
> > +   drm_mm_for_each_suitable_hole(hole, mm, range_start, range_end,
> > + 

RE: [PATCH v2 1/3] drm/mm: Ensure that the entry is not NULL before extracting rb_node

2022-02-22 Thread Kasireddy, Vivek
Hi Tvrtko,

> 
> On 18/02/2022 03:47, Kasireddy, Vivek wrote:
> > Hi Tvrtko,
> >
> >>
> >> On 17/02/2022 07:50, Vivek Kasireddy wrote:
> >>> While looking for next holes suitable for an allocation, although,
> >>> it is highly unlikely, make sure that the DECLARE_NEXT_HOLE_ADDR
> >>> macro is using a valid node before it extracts the rb_node from it.
> >>
> >> Was the need for this just a consequence of insufficient locking in the
> >> i915 patch?
> > [Kasireddy, Vivek] Partly, yes; but I figured since we are anyway doing
> > if (!entry || ..), it makes sense to dereference entry and extract the 
> > rb_node
> > after this check.
> 
> Unless I am blind I don't see that it makes a difference.
> ">rb_hole_addr" is taking an address of, which works "fine" is
[Kasireddy, Vivek] Ah, didn't realize it was the same thing as offsetof(). 

> entry is NULL. And does not get past the !entry check for the actual
> de-reference via RB_EMPTY_NODE. With your patch you move that after the
> !entry check but still have it in the RB_EMPTY_NODE macro. Again, unless
> I am blind, I think just drop this patch.
[Kasireddy, Vivek] Sure; do you want me to send another version with this
patch dropped? Or, would you be able to just merge the other two from the
latest version of this series?

Thanks,
Vivek

> 
> Regards,
> 
> Tvrtko
> 
> 
> > Thanks,
> > Vivek
> >
> >>
> >> Regards,
> >>
> >> Tvrtko
> >>
> >>>
> >>> Cc: Tvrtko Ursulin 
> >>> Cc: Christian König 
> >>> Signed-off-by: Vivek Kasireddy 
> >>> ---
> >>>drivers/gpu/drm/drm_mm.c | 5 +++--
> >>>1 file changed, 3 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/drm_mm.c b/drivers/gpu/drm/drm_mm.c
> >>> index 8257f9d4f619..499d8874e4ed 100644
> >>> --- a/drivers/gpu/drm/drm_mm.c
> >>> +++ b/drivers/gpu/drm/drm_mm.c
> >>> @@ -389,11 +389,12 @@ first_hole(struct drm_mm *mm,
> >>>#define DECLARE_NEXT_HOLE_ADDR(name, first, last)  
> >>> \
> >>>static struct drm_mm_node *name(struct drm_mm_node *entry, u64 size)   
> >>> \
> >>>{  
> >>> \
> >>> - struct rb_node *parent, *node = >rb_hole_addr;   \
> >>> + struct rb_node *parent, *node;  \
> >>>   
> >>> \
> >>> - if (!entry || RB_EMPTY_NODE(node))  \
> >>> + if (!entry || RB_EMPTY_NODE(>rb_hole_addr))  \
> >>>   return NULL;
> >>> \
> >>>   
> >>> \
> >>> + node = >rb_hole_addr;\
> >>>   if (usable_hole_addr(node->first, size)) {  
> >>> \
> >>>   node = node->first; 
> >>> \
> >>>   while (usable_hole_addr(node->last, size))  
> >>> \


RE: [PATCH v2 1/3] drm/mm: Ensure that the entry is not NULL before extracting rb_node

2022-02-17 Thread Kasireddy, Vivek
Hi Tvrtko,

> 
> On 17/02/2022 07:50, Vivek Kasireddy wrote:
> > While looking for next holes suitable for an allocation, although,
> > it is highly unlikely, make sure that the DECLARE_NEXT_HOLE_ADDR
> > macro is using a valid node before it extracts the rb_node from it.
> 
> Was the need for this just a consequence of insufficient locking in the
> i915 patch?
[Kasireddy, Vivek] Partly, yes; but I figured since we are anyway doing
if (!entry || ..), it makes sense to dereference entry and extract the rb_node
after this check.

Thanks,
Vivek

> 
> Regards,
> 
> Tvrtko
> 
> >
> > Cc: Tvrtko Ursulin 
> > Cc: Christian König 
> > Signed-off-by: Vivek Kasireddy 
> > ---
> >   drivers/gpu/drm/drm_mm.c | 5 +++--
> >   1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/drm_mm.c b/drivers/gpu/drm/drm_mm.c
> > index 8257f9d4f619..499d8874e4ed 100644
> > --- a/drivers/gpu/drm/drm_mm.c
> > +++ b/drivers/gpu/drm/drm_mm.c
> > @@ -389,11 +389,12 @@ first_hole(struct drm_mm *mm,
> >   #define DECLARE_NEXT_HOLE_ADDR(name, first, last) \
> >   static struct drm_mm_node *name(struct drm_mm_node *entry, u64 size)  
> > \
> >   { \
> > -   struct rb_node *parent, *node = >rb_hole_addr;   \
> > +   struct rb_node *parent, *node;  \
> > \
> > -   if (!entry || RB_EMPTY_NODE(node))  \
> > +   if (!entry || RB_EMPTY_NODE(>rb_hole_addr))  \
> > return NULL;\
> > \
> > +   node = >rb_hole_addr;\
> > if (usable_hole_addr(node->first, size)) {  \
> > node = node->first; \
> > while (usable_hole_addr(node->last, size))  \


RE: [PATCH 1/2] drm/mm: Add an iterator to optimally walk over holes for an allocation

2022-02-03 Thread Kasireddy, Vivek
Hi Tvrtko,

> -Original Message-
> From: Tvrtko Ursulin 
> Sent: Wednesday, February 02, 2022 5:04 AM
> To: Kasireddy, Vivek ; 
> dri-devel@lists.freedesktop.org
> Subject: Re: [PATCH 1/2] drm/mm: Add an iterator to optimally walk over holes 
> for an
> allocation
> 
> 
> On 02/02/2022 01:13, Vivek Kasireddy wrote:
> > This iterator relies on drm_mm_first_hole() and drm_mm_next_hole()
> > functions to identify suitable holes for an allocation of a given
> > size by efficently traversing the rbtree associated with the given
> > allocator.
> >
> > It replaces the for loop in drm_mm_insert_node_in_range() and can
> > also be used by drm drivers to quickly identify holes of a certain
> > size within a given range.
> >
> > Suggested-by: Tvrtko Ursulin 
> > Signed-off-by: Vivek Kasireddy 
> > ---
> >   drivers/gpu/drm/drm_mm.c | 28 
> >   include/drm/drm_mm.h | 32 
> >   2 files changed, 44 insertions(+), 16 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/drm_mm.c b/drivers/gpu/drm/drm_mm.c
> > index 8257f9d4f619..416c849c10e5 100644
> > --- a/drivers/gpu/drm/drm_mm.c
> > +++ b/drivers/gpu/drm/drm_mm.c
> > @@ -352,10 +352,10 @@ static struct drm_mm_node *find_hole_addr(struct 
> > drm_mm
> *mm, u64 addr, u64 size)
> > return node;
> >   }
> >
> > -static struct drm_mm_node *
> > -first_hole(struct drm_mm *mm,
> > -  u64 start, u64 end, u64 size,
> > -  enum drm_mm_insert_mode mode)
> > +struct drm_mm_node *
> > +drm_mm_first_hole(struct drm_mm *mm,
> > + u64 start, u64 end, u64 size,
> > + enum drm_mm_insert_mode mode)
> >   {
> > switch (mode) {
> > default:
> > @@ -374,6 +374,7 @@ first_hole(struct drm_mm *mm,
> > hole_stack);
> > }
> >   }
> > +EXPORT_SYMBOL(drm_mm_first_hole);
> >
> >   /**
> >* DECLARE_NEXT_HOLE_ADDR - macro to declare next hole functions
> > @@ -410,11 +411,11 @@ static struct drm_mm_node *name(struct drm_mm_node
> *entry, u64 size) \
> >   DECLARE_NEXT_HOLE_ADDR(next_hole_high_addr, rb_left, rb_right)
> >   DECLARE_NEXT_HOLE_ADDR(next_hole_low_addr, rb_right, rb_left)
> >
> > -static struct drm_mm_node *
> > -next_hole(struct drm_mm *mm,
> > - struct drm_mm_node *node,
> > - u64 size,
> > - enum drm_mm_insert_mode mode)
> > +struct drm_mm_node *
> > +drm_mm_next_hole(struct drm_mm *mm,
> > +struct drm_mm_node *node,
> > +u64 size,
> > +enum drm_mm_insert_mode mode)
> >   {
> > switch (mode) {
> > default:
> > @@ -432,6 +433,7 @@ next_hole(struct drm_mm *mm,
> > return >hole_stack == >hole_stack ? NULL : node;
> > }
> >   }
> > +EXPORT_SYMBOL(drm_mm_next_hole);
> 
> May need to add kerneldoc since first/next_hole are now exported, or
> perhaps double underscore them if DRM core allows that approach to kind
> of signify "it is exported by shouldn't really be used"? Question for
> dri-devel I guess.
[Kasireddy, Vivek] Ok, will wait until Daniel or others on dri-devel weigh in.
However, it looks like double underscore is the way to go given how the
exported symbol __drm_mm_interval_first() is used.

> 
> >
> >   /**
> >* drm_mm_reserve_node - insert an pre-initialized node
> > @@ -520,7 +522,6 @@ int drm_mm_insert_node_in_range(struct drm_mm * const 
> > mm,
> >   {
> > struct drm_mm_node *hole;
> > u64 remainder_mask;
> > -   bool once;
> >
> > DRM_MM_BUG_ON(range_start > range_end);
> >
> > @@ -533,13 +534,8 @@ int drm_mm_insert_node_in_range(struct drm_mm * const
> mm,
> > if (alignment <= 1)
> > alignment = 0;
> >
> > -   once = mode & DRM_MM_INSERT_ONCE;
> > -   mode &= ~DRM_MM_INSERT_ONCE;
> > -
> > remainder_mask = is_power_of_2(alignment) ? alignment - 1 : 0;
> > -   for (hole = first_hole(mm, range_start, range_end, size, mode);
> > -hole;
> > -hole = once ? NULL : next_hole(mm, hole, size, mode)) {
> > +   drm_mm_for_each_best_hole(hole, mm, range_start, range_end, size, mode) 
> > {
> > u64 hole_start = __drm_mm_hole_node_start(hole);
> > u64 hole_end = hole_start + hole->hole_size;
> > u64 adj_start, adj_end;
> > diff --git a/include/drm/drm_mm.h b/include/drm/drm_mm.h
> > index ac33ba1b18bc..505

RE: [PATCH v3 11/12] drm/virtio: implement context init: add virtio_gpu_fence_event

2021-11-15 Thread Kasireddy, Vivek
Hi Daniel, Greg,

If it is the same or a similar crash reported here:
https://lists.freedesktop.org/archives/dri-devel/2021-November/330018.html
and here: 
https://lists.freedesktop.org/archives/dri-devel/2021-November/330212.html
then the fix is already merged:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d89c0c8322ecdc9a2ec84b959b6f766be082da76

Thanks,
Vivek

> On Sat, Nov 13, 2021 at 03:51:48PM +0100, Greg KH wrote:
> > On Tue, Sep 21, 2021 at 04:20:23PM -0700, Gurchetan Singh wrote:
> > > Similar to DRM_VMW_EVENT_FENCE_SIGNALED.  Sends a pollable event
> > > to the DRM file descriptor when a fence on a specific ring is
> > > signaled.
> > >
> > > One difference is the event is not exposed via the UAPI -- this is
> > > because host responses are on a shared memory buffer of type
> > > BLOB_MEM_GUEST [this is the common way to receive responses with
> > > virtgpu].  As such, there is no context specific read(..)
> > > implementation either -- just a poll(..) implementation.
> > >
> > > Signed-off-by: Gurchetan Singh 
> > > Acked-by: Nicholas Verne 
> > > ---
> > >  drivers/gpu/drm/virtio/virtgpu_drv.c   | 43 +-
> > >  drivers/gpu/drm/virtio/virtgpu_drv.h   |  7 +
> > >  drivers/gpu/drm/virtio/virtgpu_fence.c | 10 ++
> > >  drivers/gpu/drm/virtio/virtgpu_ioctl.c | 34 
> > >  4 files changed, 93 insertions(+), 1 deletion(-)
> >
> > This commit seems to cause a crash in a virtual drm gpu driver for
> > Android.  I have reverted this, and the next commit in the series from
> > Linus's tree and all is good again.
> >
> > Any ideas?
> 
> Well no, but also this patch looks very questionable of hand-rolling
> drm_poll. Yes you can do driver private events like
> DRM_VMW_EVENT_FENCE_SIGNALED, that's fine. But you really should not need
> to hand-roll the poll callback. vmwgfx (which generally is a very old
> driver which has lots of custom stuff, so not a great example) doesn't do
> that either.
> 
> So that part should go no matter what I think.
> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch


RE: [RFC v1 3/6] drm: Add a capability flag to support additional flip completion signalling

2021-10-17 Thread Kasireddy, Vivek
Hi Pekka,

> 
> Hi Vivek!
> 
> > > On Mon, 13 Sep 2021 16:35:26 -0700
> > > Vivek Kasireddy  wrote:
> > >
> > > > If a driver supports this capability, it means that there would be an
> > > > additional signalling mechanism for a page flip completion in addition
> > > > to out_fence or DRM_MODE_PAGE_FLIP_EVENT.
> > > >
> > > > This capability may only be relevant for Virtual KMS drivers and is 
> > > > currently
> > > > used only by virtio-gpu. Also, it can provide a potential solution for:
> > > > https://gitlab.freedesktop.org/wayland/weston/-/issues/514
> > > >
> > > > Signed-off-by: Vivek Kasireddy 
> > > > ---
> > > >  drivers/gpu/drm/drm_ioctl.c   | 3 +++
> > > >  include/drm/drm_mode_config.h | 8 
> > > >  include/uapi/drm/drm.h| 1 +
> > > >  3 files changed, 12 insertions(+)
> > > >
> > > > diff --git a/drivers/gpu/drm/drm_ioctl.c b/drivers/gpu/drm/drm_ioctl.c
> > > > index 8b8744dcf691..8a420844f8bc 100644
> > > > --- a/drivers/gpu/drm/drm_ioctl.c
> > > > +++ b/drivers/gpu/drm/drm_ioctl.c
> > > > @@ -302,6 +302,9 @@ static int drm_getcap(struct drm_device *dev, void 
> > > > *data,
> struct
> > > drm_file *file_
> > > > case DRM_CAP_CRTC_IN_VBLANK_EVENT:
> > > > req->value = 1;
> > > > break;
> > > > +   case DRM_CAP_RELEASE_FENCE:
> > > > +   req->value = dev->mode_config.release_fence;
> > > > +   break;
> > >
> > > Hi Vivek,
> > >
> > > is this actually necessary?
> > >
> > > I would think that userspace figures out the existence of the release
> > > fence capability by seeing that the KMS property "RELEASE_FENCE_PTR"
> > > either exists or not.
> > [Vivek] Yeah, that makes sense. However, in order for the userspace to not 
> > see
> > this property, we'd have to prevent drm core from exposing it; which means 
> > we
> > need to check dev->mode_config.release_fence before attaching the property
> > to the crtc.
> 
> Kernel implementation details, I don't bother with those personally. ;-)
> 
> Sounds right.
> 
> > >
> > > However, would we not need a client cap instead?
> > >
> > > If a KMS driver knows that userspace is aware of "RELEASE_FENCE_PTR"
> > > and will use it when necessary, then the KMS driver can send the
> > > pageflip completion without waiting for the host OS to signal the old
> > > buffer as free for re-use.
> > [Vivek] Right, the KMS driver can just look at whether the release_fence was
> > added by the userspace (in the atomic commit) to determine whether it needs
> > to wait for the old fb.
> 
> You could do it that way, but is it a good idea? I'm not sure.
> 
> > > If the KMS driver does not know that userspace can handle pageflip
> > > completing "too early", then it has no choice but to wait until the old
> > > buffer is really free before signalling pageflip completion.
> > >
> > > Wouldn't that make sense?
> > [Vivek] Yes; DRM_CAP_RELEASE_FENCE may not be necessary to
> > implement the behavior you suggest which makes sense.
> >
> > >
> > >
> > > Otherwise, this proposal sounds fine to me.
> > [Vivek] Did you get a chance to review the Weston MR:
> > https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/668
> >
> > Could you please take a look?
> 
> Unfortunately I cannot promise any timely feedback on that, I try to
> concentrate on CM However, I'm not the only Weston reviewer, I
> hope.
[Kasireddy, Vivek] I was going to say it's a small patch to review but, ok np, 
I'll
ping Simon or Michel or Daniel.

Thanks,
Vivek
> 
> 
> Thanks,
> pq
> 
> >
> > Thanks,
> > Vivek
> >
> > >
> > >
> > > Thanks,
> > > pq
> > >
> > >
> > > > default:
> > > > return -EINVAL;
> > > > }
> > > > diff --git a/include/drm/drm_mode_config.h 
> > > > b/include/drm/drm_mode_config.h
> > > > index 12b964540069..944bebf359d7 100644
> > > > --- a/include/drm/drm_mode_config.h
> > > > +++ b/include/drm/drm_mode_config.h
> > > > @@ -935,6 +935,14 @@ struct drm_mode_config {
> > > >  */
> > > >

RE: [RFC v1 3/6] drm: Add a capability flag to support additional flip completion signalling

2021-10-14 Thread Kasireddy, Vivek
Hi Pekka,
Thank you for reviewing this patch.
 
> On Mon, 13 Sep 2021 16:35:26 -0700
> Vivek Kasireddy  wrote:
> 
> > If a driver supports this capability, it means that there would be an
> > additional signalling mechanism for a page flip completion in addition
> > to out_fence or DRM_MODE_PAGE_FLIP_EVENT.
> >
> > This capability may only be relevant for Virtual KMS drivers and is 
> > currently
> > used only by virtio-gpu. Also, it can provide a potential solution for:
> > https://gitlab.freedesktop.org/wayland/weston/-/issues/514
> >
> > Signed-off-by: Vivek Kasireddy 
> > ---
> >  drivers/gpu/drm/drm_ioctl.c   | 3 +++
> >  include/drm/drm_mode_config.h | 8 
> >  include/uapi/drm/drm.h| 1 +
> >  3 files changed, 12 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/drm_ioctl.c b/drivers/gpu/drm/drm_ioctl.c
> > index 8b8744dcf691..8a420844f8bc 100644
> > --- a/drivers/gpu/drm/drm_ioctl.c
> > +++ b/drivers/gpu/drm/drm_ioctl.c
> > @@ -302,6 +302,9 @@ static int drm_getcap(struct drm_device *dev, void 
> > *data, struct
> drm_file *file_
> > case DRM_CAP_CRTC_IN_VBLANK_EVENT:
> > req->value = 1;
> > break;
> > +   case DRM_CAP_RELEASE_FENCE:
> > +   req->value = dev->mode_config.release_fence;
> > +   break;
> 
> Hi Vivek,
> 
> is this actually necessary?
> 
> I would think that userspace figures out the existence of the release
> fence capability by seeing that the KMS property "RELEASE_FENCE_PTR"
> either exists or not.
[Vivek] Yeah, that makes sense. However, in order for the userspace to not see
this property, we'd have to prevent drm core from exposing it; which means we
need to check dev->mode_config.release_fence before attaching the property
to the crtc.

> 
> However, would we not need a client cap instead?
> 
> If a KMS driver knows that userspace is aware of "RELEASE_FENCE_PTR"
> and will use it when necessary, then the KMS driver can send the
> pageflip completion without waiting for the host OS to signal the old
> buffer as free for re-use.
[Vivek] Right, the KMS driver can just look at whether the release_fence was
added by the userspace (in the atomic commit) to determine whether it needs
to wait for the old fb.

> 
> If the KMS driver does not know that userspace can handle pageflip
> completing "too early", then it has no choice but to wait until the old
> buffer is really free before signalling pageflip completion.
> 
> Wouldn't that make sense?
[Vivek] Yes; DRM_CAP_RELEASE_FENCE may not be necessary to
implement the behavior you suggest which makes sense.

> 
> 
> Otherwise, this proposal sounds fine to me.
[Vivek] Did you get a chance to review the Weston MR:
https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/668

Could you please take a look?

Thanks,
Vivek

> 
> 
> Thanks,
> pq
> 
> 
> > default:
> > return -EINVAL;
> > }
> > diff --git a/include/drm/drm_mode_config.h b/include/drm/drm_mode_config.h
> > index 12b964540069..944bebf359d7 100644
> > --- a/include/drm/drm_mode_config.h
> > +++ b/include/drm/drm_mode_config.h
> > @@ -935,6 +935,14 @@ struct drm_mode_config {
> >  */
> > bool normalize_zpos;
> >
> > +   /**
> > +* @release_fence:
> > +*
> > +* If this option is set, it means there would be an additional 
> > signalling
> > +* mechanism for a page flip completion.
> > +*/
> > +   bool release_fence;
> > +
> > /**
> >  * @modifiers_property: Plane property to list support modifier/format
> >  * combination.
> > diff --git a/include/uapi/drm/drm.h b/include/uapi/drm/drm.h
> > index 3b810b53ba8b..8b8985f65581 100644
> > --- a/include/uapi/drm/drm.h
> > +++ b/include/uapi/drm/drm.h
> > @@ -767,6 +767,7 @@ struct drm_gem_open {
> >   * Documentation/gpu/drm-mm.rst, section "DRM Sync Objects".
> >   */
> >  #define DRM_CAP_SYNCOBJ_TIMELINE   0x14
> > +#define DRM_CAP_RELEASE_FENCE  0x15
> >
> >  /* DRM_IOCTL_GET_CAP ioctl argument type */
> >  struct drm_get_cap {



RE: [PATCH v2 11/12] drm/virtio: implement context init: add virtio_gpu_fence_event

2021-09-17 Thread Kasireddy, Vivek
Hi Gurchetan,

> 
> Similar to DRM_VMW_EVENT_FENCE_SIGNALED.  Sends a pollable event
> to the DRM file descriptor when a fence on a specific ring is
> signaled.
> 
> One difference is the event is not exposed via the UAPI -- this is
> because host responses are on a shared memory buffer of type
> BLOB_MEM_GUEST [this is the common way to receive responses with
> virtgpu].  As such, there is no context specific read(..)
> implementation either -- just a poll(..) implementation.
[Kasireddy, Vivek] Given my limited understanding of virtio_gpu 3D/Virgl, I am
wondering why you'd need a new internal event associated with a fence; would
you not be able to accomplish the same by adding the out_fence_fd (from execbuf)
to your userspace's event loop (in addition to DRM fd) and get signalled?

Thanks,
Vivek

> 
> Signed-off-by: Gurchetan Singh 
> Acked-by: Nicholas Verne 
> ---
>  drivers/gpu/drm/virtio/virtgpu_drv.c   | 43 +-
>  drivers/gpu/drm/virtio/virtgpu_drv.h   |  7 +
>  drivers/gpu/drm/virtio/virtgpu_fence.c | 10 ++
>  drivers/gpu/drm/virtio/virtgpu_ioctl.c | 34 
>  4 files changed, 93 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/virtio/virtgpu_drv.c 
> b/drivers/gpu/drm/virtio/virtgpu_drv.c
> index 9d963f1fda8f..749db18dcfa2 100644
> --- a/drivers/gpu/drm/virtio/virtgpu_drv.c
> +++ b/drivers/gpu/drm/virtio/virtgpu_drv.c
> @@ -29,6 +29,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
> 
>  #include 
>  #include 
> @@ -155,6 +157,35 @@ static void virtio_gpu_config_changed(struct 
> virtio_device
> *vdev)
>   schedule_work(>config_changed_work);
>  }
> 
> +static __poll_t virtio_gpu_poll(struct file *filp,
> + struct poll_table_struct *wait)
> +{
> + struct drm_file *drm_file = filp->private_data;
> + struct virtio_gpu_fpriv *vfpriv = drm_file->driver_priv;
> + struct drm_device *dev = drm_file->minor->dev;
> + struct drm_pending_event *e = NULL;
> + __poll_t mask = 0;
> +
> + if (!vfpriv->ring_idx_mask)
> + return drm_poll(filp, wait);
> +
> + poll_wait(filp, _file->event_wait, wait);
> +
> + if (!list_empty(_file->event_list)) {
> + spin_lock_irq(>event_lock);
> + e = list_first_entry(_file->event_list,
> +  struct drm_pending_event, link);
> + drm_file->event_space += e->event->length;
> + list_del(>link);
> + spin_unlock_irq(>event_lock);
> +
> + kfree(e);
> + mask |= EPOLLIN | EPOLLRDNORM;
> + }
> +
> + return mask;
> +}
> +
>  static struct virtio_device_id id_table[] = {
>   { VIRTIO_ID_GPU, VIRTIO_DEV_ANY_ID },
>   { 0 },
> @@ -194,7 +225,17 @@ MODULE_AUTHOR("Dave Airlie ");
>  MODULE_AUTHOR("Gerd Hoffmann ");
>  MODULE_AUTHOR("Alon Levy");
> 
> -DEFINE_DRM_GEM_FOPS(virtio_gpu_driver_fops);
> +static const struct file_operations virtio_gpu_driver_fops = {
> + .owner  = THIS_MODULE,
> + .open   = drm_open,
> + .release= drm_release,
> + .unlocked_ioctl = drm_ioctl,
> + .compat_ioctl   = drm_compat_ioctl,
> + .poll   = virtio_gpu_poll,
> + .read   = drm_read,
> + .llseek = noop_llseek,
> + .mmap   = drm_gem_mmap
> +};
> 
>  static const struct drm_driver driver = {
>   .driver_features = DRIVER_MODESET | DRIVER_GEM | DRIVER_RENDER |
> DRIVER_ATOMIC,
> diff --git a/drivers/gpu/drm/virtio/virtgpu_drv.h 
> b/drivers/gpu/drm/virtio/virtgpu_drv.h
> index cb60d52c2bd1..e0265fe74aa5 100644
> --- a/drivers/gpu/drm/virtio/virtgpu_drv.h
> +++ b/drivers/gpu/drm/virtio/virtgpu_drv.h
> @@ -138,11 +138,18 @@ struct virtio_gpu_fence_driver {
>   spinlock_t   lock;
>  };
> 
> +#define VIRTGPU_EVENT_FENCE_SIGNALED_INTERNAL 0x1000
> +struct virtio_gpu_fence_event {
> + struct drm_pending_event base;
> + struct drm_event event;
> +};
> +
>  struct virtio_gpu_fence {
>   struct dma_fence f;
>   uint32_t ring_idx;
>   uint64_t fence_id;
>   bool emit_fence_info;
> + struct virtio_gpu_fence_event *e;
>   struct virtio_gpu_fence_driver *drv;
>   struct list_head node;
>  };
> diff --git a/drivers/gpu/drm/virtio/virtgpu_fence.c
> b/drivers/gpu/drm/virtio/virtgpu_fence.c
> index 98a00c1e654d..f28357dbde35 100644
> --- a/drivers/gpu/drm/virtio/virtgpu_fence.c
> +++ b/drivers/gpu/drm/virtio/virtgpu_fence.c
> @@ -152,11 +152,21 @@ 

RE: [RFC v1 4/6] drm/virtio: Probe and implement VIRTIO_GPU_F_RELEASE_FENCE feature

2021-09-15 Thread Kasireddy, Vivek
Hi Gerd,

>   Hi,
> 
> > --- a/include/uapi/linux/virtio_gpu.h
> > +++ b/include/uapi/linux/virtio_gpu.h
> > @@ -60,6 +60,8 @@
> >   */
> >  #define VIRTIO_GPU_F_RESOURCE_BLOB   3
> >
> > +#define VIRTIO_GPU_F_RELEASE_FENCE  4
> > +
> >  enum virtio_gpu_ctrl_type {
> > VIRTIO_GPU_UNDEFINED = 0,
> 
> Where is the virtio-spec update for that?
[Kasireddy, Vivek] I was going to do that if there'd a consensus over 
DRM_CAP_RELEASE_FENCE.
Otherwise, I don't think VIRTIO_GPU_F_RELEASE_FENCE is needed.

Thanks,
Vivek

> 
> thanks,
>   Gerd



RE: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

2021-08-11 Thread Kasireddy, Vivek
Hi Michel,
 
> On 2021-08-10 10:30 a.m., Daniel Vetter wrote:
> > On Tue, Aug 10, 2021 at 08:21:09AM +, Kasireddy, Vivek wrote:
> >>> On Fri, Aug 06, 2021 at 07:27:13AM +0000, Kasireddy, Vivek wrote:
> >>>>>>>
> >>>>>>> Hence my gut feeling reaction that first we need to get these two
> >>>>>>> compositors aligned in their timings, which propobably needs
> >>>>>>> consistent vblank periods/timestamps across them (plus/minux
> >>>>>>> guest/host clocksource fun ofc). Without this any of the next steps
> >>>>>>> will simply not work because there's too much jitter by the time the
> >>>>>>> guest compositor gets the flip completion events.
> >>>>>> [Kasireddy, Vivek] Timings are not a problem and do not significantly
> >>>>>> affect the repaint cycles from what I have seen so far.
> >>>>>>
> >>>>>>>
> >>>>>>> Once we have solid events I think we should look into statically
> >>>>>>> tuning guest/host compositor deadlines (like you've suggested in a
> >>>>>>> bunch of places) to consisently make that deadline and hit 60 fps.
> >>>>>>> With that we can then look into tuning this automatically and what to
> >>>>>>> do when e.g. switching between copying and zero-copy on the host side
> >>>>>>> (which might be needed in some cases) and how to handle all that.
> >>>>>> [Kasireddy, Vivek] As I confirm here:
> >>> https://gitlab.freedesktop.org/wayland/weston/-
> >>>>> /issues/514#note_984065
> >>>>>> tweaking the deadlines works (i.e., we get 60 FPS) as we expect. 
> >>>>>> However,
> >>>>>> I feel that this zero-copy solution I am trying to create should be 
> >>>>>> independent
> >>>>>> of compositors' deadlines, delays or other scheduling parameters.
> >>>>>
> >>>>> That's not how compositors work nowadays. Your problem is that you don't
> >>>>> have the guest/host compositor in sync. zero-copy only changes the 
> >>>>> timing,
> >>>>> so it changes things from "rendering way too many frames" to "rendering
> >>>>> way too few frames".
> >>>>>
> >>>>> We need to fix the timing/sync issue here first, not paper over it with
> >>>>> hacks.
> >>>> [Kasireddy, Vivek] What I really meant is that the zero-copy solution 
> >>>> should be
> >>>> independent of the scheduling policies to ensure that it works with all 
> >>>> compositors.
> >>>>  IIUC, Weston for example uses the vblank/pageflip completion timestamp, 
> >>>> the
> >>>> configurable repaint-window value, refresh-rate, etc to determine when 
> >>>> to start
> >>>> its next repaint -- if there is any damage:
> >>>> timespec_add_nsec(>next_repaint, stamp, refresh_nsec);
> >>>> timespec_add_msec(>next_repaint, >next_repaint, 
> >>>> -compositor-
> >>>> repaint_msec);
> >>>>
> >>>> And, in the case of VKMS, since there is no real hardware, the timestamp 
> >>>> is always:
> >>>> now = ktime_get();
> >>>> send_vblank_event(dev, e, seq, now);
> >>>
> >>> vkms has been fixed since a while to fake high-precision timestamps like
> >>> from a real display.
> >> [Kasireddy, Vivek] IIUC, that might be one of the reasons why the Guest 
> >> does not need
> >> to have the same timestamp as that of the Host -- to work as expected.
> >>
> >>>
> >>>> When you say that the Guest/Host compositor need to stay in sync, are you
> >>>> suggesting that we need to ensure that the vblank timestamp on the Host
> >>>> needs to be shared and be the same on the Guest and a vblank/pageflip
> >>>> completion for the Guest needs to be sent at exactly the same time it is 
> >>>> sent
> >>>> on the Host? If yes, I'd say that we do send the pageflip completion to 
> >>>> Guest
> >>>> around the same time a vblank is generated on the Host but it does not 
> >>>> help
> >>>> because the Guest compositor would only have 9 ms to submit 

RE: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

2021-08-11 Thread Kasireddy, Vivek
Hi Daniel,

> On Tue, Aug 10, 2021 at 08:21:09AM +0000, Kasireddy, Vivek wrote:
> > Hi Daniel,
> >
> > > On Fri, Aug 06, 2021 at 07:27:13AM +, Kasireddy, Vivek wrote:
> > > > Hi Daniel,
> > > >
> > > > > > > > >>> The solution:
> > > > > > > > >>> - To ensure full framerate, the Guest compositor has to 
> > > > > > > > >>> start it's repaint
> > > cycle
> > > > > > > (including
> > > > > > > > >>> the 9 ms wait) when the Host compositor sends the frame 
> > > > > > > > >>> callback event
> to
> > > its
> > > > > > > clients.
> > > > > > > > >>> In order for this to happen, the dma-fence that the Guest 
> > > > > > > > >>> KMS waits on -
> -
> > > before
> > > > > > > sending
> > > > > > > > >>> pageflip completion -- cannot be tied to a 
> > > > > > > > >>> wl_buffer.release event. This
> > > means
> > > > > that,
> > > > > > > the
> > > > > > > > >>> Guest compositor has to be forced to use a new buffer for 
> > > > > > > > >>> its next
> repaint
> > > cycle
> > > > > > > when it
> > > > > > > > >>> gets a pageflip completion.
> > > > > > > > >>
> > > > > > > > >> Is that really the only solution?
> > > > > > > > > [Kasireddy, Vivek] There are a few others I mentioned here:
> > > > > > > > > https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986572
> > > > > > > > > But I think none of them are as compelling as this one.
> > > > > > > > >
> > > > > > > > >>
> > > > > > > > >> If we fix the event timestamps so that both guest and host 
> > > > > > > > >> use the same
> > > > > > > > >> timestamp, but then the guest starts 5ms (or something like 
> > > > > > > > >> that) earlier,
> > > > > > > > >> then things should work too? I.e.
> > > > > > > > >> - host compositor starts at (previous_frametime + 9ms)
> > > > > > > > >> - guest compositor starts at (previous_frametime + 4ms)
> > > > > > > > >>
> > > > > > > > >> Ofc this only works if the frametimes we hand out to both 
> > > > > > > > >> match
> _exactly_
> > > > > > > > >> and are as high-precision as the ones on the host side. 
> > > > > > > > >> Which for many
> gpu
> > > > > > > > >> drivers at least is the case, and all the ones you care 
> > > > > > > > >> about for sure :-)
> > > > > > > > >>
> > > > > > > > >> But if the frametimes the guest receives are the no_vblank 
> > > > > > > > >> fake ones,
> then
> > > > > > > > >> they'll be all over the place and this carefully tuned 
> > > > > > > > >> low-latency redraw
> > > > > > > > >> loop falls apart. Aside fromm the fact that without tuning 
> > > > > > > > >> the guests to
> > > > > > > > >> be earlier than the hosts, you're guaranteed to miss every 
> > > > > > > > >> frame (except
> > > > > > > > >> when the timing wobbliness in the guest is big enough by 
> > > > > > > > >> chance to make
> > > > > > > > >> the deadline on the oddball frame).
> > > > > > > > > [Kasireddy, Vivek] The Guest and Host use different event 
> > > > > > > > > timestamps as
> we
> > > don't
> > > > > > > > > share these between the Guest and the Host. It does not seem 
> > > > > > > > > to be causing
> any
> > > > > other
> > > > > > > > > problems so far but we did try the experiment you mentioned 
> > > > > > > > > (i.e.,
> adjus

RE: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

2021-08-10 Thread Kasireddy, Vivek
Hi Daniel,

> On Fri, Aug 06, 2021 at 07:27:13AM +0000, Kasireddy, Vivek wrote:
> > Hi Daniel,
> >
> > > > > > >>> The solution:
> > > > > > >>> - To ensure full framerate, the Guest compositor has to start 
> > > > > > >>> it's repaint
> cycle
> > > > > (including
> > > > > > >>> the 9 ms wait) when the Host compositor sends the frame 
> > > > > > >>> callback event to
> its
> > > > > clients.
> > > > > > >>> In order for this to happen, the dma-fence that the Guest KMS 
> > > > > > >>> waits on --
> before
> > > > > sending
> > > > > > >>> pageflip completion -- cannot be tied to a wl_buffer.release 
> > > > > > >>> event. This
> means
> > > that,
> > > > > the
> > > > > > >>> Guest compositor has to be forced to use a new buffer for its 
> > > > > > >>> next repaint
> cycle
> > > > > when it
> > > > > > >>> gets a pageflip completion.
> > > > > > >>
> > > > > > >> Is that really the only solution?
> > > > > > > [Kasireddy, Vivek] There are a few others I mentioned here:
> > > > > > > https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986572
> > > > > > > But I think none of them are as compelling as this one.
> > > > > > >
> > > > > > >>
> > > > > > >> If we fix the event timestamps so that both guest and host use 
> > > > > > >> the same
> > > > > > >> timestamp, but then the guest starts 5ms (or something like 
> > > > > > >> that) earlier,
> > > > > > >> then things should work too? I.e.
> > > > > > >> - host compositor starts at (previous_frametime + 9ms)
> > > > > > >> - guest compositor starts at (previous_frametime + 4ms)
> > > > > > >>
> > > > > > >> Ofc this only works if the frametimes we hand out to both match 
> > > > > > >> _exactly_
> > > > > > >> and are as high-precision as the ones on the host side. Which 
> > > > > > >> for many gpu
> > > > > > >> drivers at least is the case, and all the ones you care about 
> > > > > > >> for sure :-)
> > > > > > >>
> > > > > > >> But if the frametimes the guest receives are the no_vblank fake 
> > > > > > >> ones, then
> > > > > > >> they'll be all over the place and this carefully tuned 
> > > > > > >> low-latency redraw
> > > > > > >> loop falls apart. Aside fromm the fact that without tuning the 
> > > > > > >> guests to
> > > > > > >> be earlier than the hosts, you're guaranteed to miss every frame 
> > > > > > >> (except
> > > > > > >> when the timing wobbliness in the guest is big enough by chance 
> > > > > > >> to make
> > > > > > >> the deadline on the oddball frame).
> > > > > > > [Kasireddy, Vivek] The Guest and Host use different event 
> > > > > > > timestamps as we
> don't
> > > > > > > share these between the Guest and the Host. It does not seem to 
> > > > > > > be causing any
> > > other
> > > > > > > problems so far but we did try the experiment you mentioned 
> > > > > > > (i.e., adjusting
> the
> > > > > delays)
> > > > > > > and it works. However, this patch series is meant to fix the 
> > > > > > > issue without
> having to
> > > > > tweak
> > > > > > > anything (delays) because we can't do this for every compositor 
> > > > > > > out there.
> > > > > >
> > > > > > Maybe there could be a mechanism which allows the compositor in the 
> > > > > > guest to
> > > > > automatically adjust its repaint cycle as needed.
> > > > > >
> > > > > > This might even be possible without requiring changes in each 
> > > > > > compositor, by
> > > adjus

RE: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

2021-08-06 Thread Kasireddy, Vivek
Hi Daniel,

> > > > >>> The solution:
> > > > >>> - To ensure full framerate, the Guest compositor has to start it's 
> > > > >>> repaint cycle
> > > (including
> > > > >>> the 9 ms wait) when the Host compositor sends the frame callback 
> > > > >>> event to its
> > > clients.
> > > > >>> In order for this to happen, the dma-fence that the Guest KMS waits 
> > > > >>> on -- before
> > > sending
> > > > >>> pageflip completion -- cannot be tied to a wl_buffer.release event. 
> > > > >>> This means
> that,
> > > the
> > > > >>> Guest compositor has to be forced to use a new buffer for its next 
> > > > >>> repaint cycle
> > > when it
> > > > >>> gets a pageflip completion.
> > > > >>
> > > > >> Is that really the only solution?
> > > > > [Kasireddy, Vivek] There are a few others I mentioned here:
> > > > > https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986572
> > > > > But I think none of them are as compelling as this one.
> > > > >
> > > > >>
> > > > >> If we fix the event timestamps so that both guest and host use the 
> > > > >> same
> > > > >> timestamp, but then the guest starts 5ms (or something like that) 
> > > > >> earlier,
> > > > >> then things should work too? I.e.
> > > > >> - host compositor starts at (previous_frametime + 9ms)
> > > > >> - guest compositor starts at (previous_frametime + 4ms)
> > > > >>
> > > > >> Ofc this only works if the frametimes we hand out to both match 
> > > > >> _exactly_
> > > > >> and are as high-precision as the ones on the host side. Which for 
> > > > >> many gpu
> > > > >> drivers at least is the case, and all the ones you care about for 
> > > > >> sure :-)
> > > > >>
> > > > >> But if the frametimes the guest receives are the no_vblank fake 
> > > > >> ones, then
> > > > >> they'll be all over the place and this carefully tuned low-latency 
> > > > >> redraw
> > > > >> loop falls apart. Aside fromm the fact that without tuning the 
> > > > >> guests to
> > > > >> be earlier than the hosts, you're guaranteed to miss every frame 
> > > > >> (except
> > > > >> when the timing wobbliness in the guest is big enough by chance to 
> > > > >> make
> > > > >> the deadline on the oddball frame).
> > > > > [Kasireddy, Vivek] The Guest and Host use different event timestamps 
> > > > > as we don't
> > > > > share these between the Guest and the Host. It does not seem to be 
> > > > > causing any
> other
> > > > > problems so far but we did try the experiment you mentioned (i.e., 
> > > > > adjusting the
> > > delays)
> > > > > and it works. However, this patch series is meant to fix the issue 
> > > > > without having to
> > > tweak
> > > > > anything (delays) because we can't do this for every compositor out 
> > > > > there.
> > > >
> > > > Maybe there could be a mechanism which allows the compositor in the 
> > > > guest to
> > > automatically adjust its repaint cycle as needed.
> > > >
> > > > This might even be possible without requiring changes in each 
> > > > compositor, by
> adjusting
> > > the vertical blank periods in the guest to be aligned with the host 
> > > compositor repaint
> > > cycles. Not sure about that though.
> > > >
> > > > Even if not, both this series or making it possible to queue multiple 
> > > > flips require
> > > corresponding changes in each compositor as well to have any effect.
> > >
> > > Yeah from all the discussions and tests done it sounds even with a
> > > deeper queue we have big coordination issues between the guest and
> > > host compositor (like the example that the guest is now rendering at
> > > 90fps instead of 60fps like the host).
> > [Kasireddy, Vivek] Oh, I think you are referring to my reply to Gerd. That 
> > 90 FPS vs
> > 60 FPS problem is a comple

RE: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

2021-08-04 Thread Kasireddy, Vivek
Hi Daniel,

> > >>> The solution:
> > >>> - To ensure full framerate, the Guest compositor has to start it's 
> > >>> repaint cycle
> (including
> > >>> the 9 ms wait) when the Host compositor sends the frame callback event 
> > >>> to its
> clients.
> > >>> In order for this to happen, the dma-fence that the Guest KMS waits on 
> > >>> -- before
> sending
> > >>> pageflip completion -- cannot be tied to a wl_buffer.release event. 
> > >>> This means that,
> the
> > >>> Guest compositor has to be forced to use a new buffer for its next 
> > >>> repaint cycle
> when it
> > >>> gets a pageflip completion.
> > >>
> > >> Is that really the only solution?
> > > [Kasireddy, Vivek] There are a few others I mentioned here:
> > > https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986572
> > > But I think none of them are as compelling as this one.
> > >
> > >>
> > >> If we fix the event timestamps so that both guest and host use the same
> > >> timestamp, but then the guest starts 5ms (or something like that) 
> > >> earlier,
> > >> then things should work too? I.e.
> > >> - host compositor starts at (previous_frametime + 9ms)
> > >> - guest compositor starts at (previous_frametime + 4ms)
> > >>
> > >> Ofc this only works if the frametimes we hand out to both match _exactly_
> > >> and are as high-precision as the ones on the host side. Which for many 
> > >> gpu
> > >> drivers at least is the case, and all the ones you care about for sure 
> > >> :-)
> > >>
> > >> But if the frametimes the guest receives are the no_vblank fake ones, 
> > >> then
> > >> they'll be all over the place and this carefully tuned low-latency redraw
> > >> loop falls apart. Aside fromm the fact that without tuning the guests to
> > >> be earlier than the hosts, you're guaranteed to miss every frame (except
> > >> when the timing wobbliness in the guest is big enough by chance to make
> > >> the deadline on the oddball frame).
> > > [Kasireddy, Vivek] The Guest and Host use different event timestamps as 
> > > we don't
> > > share these between the Guest and the Host. It does not seem to be 
> > > causing any other
> > > problems so far but we did try the experiment you mentioned (i.e., 
> > > adjusting the
> delays)
> > > and it works. However, this patch series is meant to fix the issue 
> > > without having to
> tweak
> > > anything (delays) because we can't do this for every compositor out there.
> >
> > Maybe there could be a mechanism which allows the compositor in the guest to
> automatically adjust its repaint cycle as needed.
> >
> > This might even be possible without requiring changes in each compositor, 
> > by adjusting
> the vertical blank periods in the guest to be aligned with the host 
> compositor repaint
> cycles. Not sure about that though.
> >
> > Even if not, both this series or making it possible to queue multiple flips 
> > require
> corresponding changes in each compositor as well to have any effect.
> 
> Yeah from all the discussions and tests done it sounds even with a
> deeper queue we have big coordination issues between the guest and
> host compositor (like the example that the guest is now rendering at
> 90fps instead of 60fps like the host).
[Kasireddy, Vivek] Oh, I think you are referring to my reply to Gerd. That 90 
FPS vs 
60 FPS problem is a completely different issue that is associated with Qemu GTK 
UI
backend. With the GTK backend -- and also with SDL backend -- we Blit the Guest
scanout FB onto one of the backbuffers managed by EGL. 

I am trying to add a new Qemu Wayland UI backend so that we can eliminate that 
Blit
and thereby have a truly zero-copy solution. And, this is there I am running 
into the 
halved frame-rate issue -- the current problem.

> 
> Hence my gut feeling reaction that first we need to get these two
> compositors aligned in their timings, which propobably needs
> consistent vblank periods/timestamps across them (plus/minux
> guest/host clocksource fun ofc). Without this any of the next steps
> will simply not work because there's too much jitter by the time the
> guest compositor gets the flip completion events.
[Kasireddy, Vivek] Timings are not a problem and do not significantly
affect the repaint cycles from what I have seen so far.

> 
> Once we have solid events I think

RE: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

2021-08-04 Thread Kasireddy, Vivek
Hi Gerd,

> 
> > > virtio_gpu_primary_plane_update() will send RESOURCE_FLUSH only for
> > > DIRTYFB and both SET_SCANOUT + RESOURCE_FLUSH for page-flip, and I
> > > think for the page-flip case the host (aka qemu) doesn't get the
> > > "wait until old framebuffer is not in use any more" right yet.
> > [Kasireddy, Vivek] As you know, with the GTK UI backend and this patch 
> > series:
> > https://lists.nongnu.org/archive/html/qemu-devel/2021-06/msg06745.html
> > we do create a sync file fd -- after the Blit -- and wait (adding it to 
> > Qemu's main
> > event loop) for it to ensure that the Guest scanout FB is longer in use on 
> > the Host.
> > This mechanism works in a similarly way for both frontbuffer DIRTYFB case 
> > and
> > also the double-buffer case.
> 
> Well, we don't explicitly wait on the old framebuffer.  Not fully sure
> this is actually needed, maybe the command ordering (SET_SCANOUT goes
> first) is enough.
[Kasireddy, Vivek] When the sync file fd is signaled, the new FB can be 
considered done/free
on the Host; and, when this new FB becomes the old FB -- after another FB is 
submitted
by the Guest -- we don't need to explicitly wait as we already did that in the 
previous
cycle. 

Strictly speaking, in the double-buffered Guest case, we should be waiting for 
the
sync file fd of the old FB and not the new one. However, if we do this, we saw 
that
the Guest will render faster (~90 FPS) than what the Host can consume (~60 FPS)
resulting in unnecessary GPU cycles. And, in addition, we can't be certain about
whether a Guest is using double-buffering or single as we noticed that Windows
Guests tend to switch between single and double-buffering at runtime based on
the damage, etc.

> 
> > > So we'll need a host-side fix for that and a guest-side fix to switch
> > > from a blocking wait on the fence to vblank events.
> > [Kasireddy, Vivek] Do you see any concerns with the blocking wait?
> 
> Well, it's sync vs. async for userspace.
> 
> With the blocking wait the userspace ioctl (PAGE_FLIP or the atomic
> version of it) will return when the host is done.
> 
> Without the blocking wait the userspace ioctl will return right away and
> userspace can do something else until the host is done (and the vbland
> event is sent to notify userspace).
[Kasireddy, Vivek] Right, but upstream Weston -- and I am guessing Mutter as 
well -- 
almost always choose DRM_MODE_ATOMIC_NONBLOCK. In this case, the
atomic ioctl call would not block and the blocking wait will instead happen in 
the
commit_work/commit_tail workqueue thread.

> 
> > And, are you
> > suggesting that we use a vblank timer?
> 
> I think we should send the vblank event when the RESOURCE_FLUSH fence
> signals the host is done.
[Kasireddy, Vivek] That is how it works now:
drm_atomic_helper_commit_planes(dev, old_state, 0);

drm_atomic_helper_commit_modeset_enables(dev, old_state);

drm_atomic_helper_fake_vblank(old_state);

The blocking wait is in the plane_update hook called by 
drm_atomic_helper_commit_planes()
and immediately after that the fake vblank is sent.

Thanks,
Vivek
> 
> take care,
>   Gerd



RE: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

2021-08-04 Thread Kasireddy, Vivek
Hi Michel,

> >
> >>> The goal:
> >>> - Maintain full framerate even when the Guest scanout FB is flipped onto 
> >>> a hardware
> >> plane
> >>> on the Host -- regardless of either compositor's scheduling policy -- 
> >>> without making
> any
> >>> copies and ensuring that both Host and Guest are not accessing the buffer 
> >>> at the same
> >> time.
> >>>
> >>> The problem:
> >>> - If the Host compositor flips the client's buffer (in this case Guest 
> >>> compositor's
> buffer)
> >>> onto a hardware plane, then it can send a wl_buffer.release event for the 
> >>> previous
> buffer
> >>> only after it gets a pageflip completion. And, if the Guest compositor 
> >>> takes 10-12 ms
> to
> >>> submit a new buffer and given the fact that the Host compositor waits 
> >>> only for 9 ms,
> the
> >>> Guest compositor will miss the Host's repaint cycle resulting in halved 
> >>> frame-rate.
> >>>
> >>> The solution:
> >>> - To ensure full framerate, the Guest compositor has to start it's 
> >>> repaint cycle
> (including
> >>> the 9 ms wait) when the Host compositor sends the frame callback event to 
> >>> its clients.
> >>> In order for this to happen, the dma-fence that the Guest KMS waits on -- 
> >>> before
> sending
> >>> pageflip completion -- cannot be tied to a wl_buffer.release event. This 
> >>> means that,
> the
> >>> Guest compositor has to be forced to use a new buffer for its next 
> >>> repaint cycle when
> it
> >>> gets a pageflip completion.
> >>
> >> Is that really the only solution?
> > [Kasireddy, Vivek] There are a few others I mentioned here:
> > https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986572
> > But I think none of them are as compelling as this one.
> >
> >>
> >> If we fix the event timestamps so that both guest and host use the same
> >> timestamp, but then the guest starts 5ms (or something like that) earlier,
> >> then things should work too? I.e.
> >> - host compositor starts at (previous_frametime + 9ms)
> >> - guest compositor starts at (previous_frametime + 4ms)
> >>
> >> Ofc this only works if the frametimes we hand out to both match _exactly_
> >> and are as high-precision as the ones on the host side. Which for many gpu
> >> drivers at least is the case, and all the ones you care about for sure :-)
> >>
> >> But if the frametimes the guest receives are the no_vblank fake ones, then
> >> they'll be all over the place and this carefully tuned low-latency redraw
> >> loop falls apart. Aside fromm the fact that without tuning the guests to
> >> be earlier than the hosts, you're guaranteed to miss every frame (except
> >> when the timing wobbliness in the guest is big enough by chance to make
> >> the deadline on the oddball frame).
> > [Kasireddy, Vivek] The Guest and Host use different event timestamps as we 
> > don't
> > share these between the Guest and the Host. It does not seem to be causing 
> > any other
> > problems so far but we did try the experiment you mentioned (i.e., 
> > adjusting the delays)
> > and it works. However, this patch series is meant to fix the issue without 
> > having to tweak
> > anything (delays) because we can't do this for every compositor out there.
> 
> Maybe there could be a mechanism which allows the compositor in the guest to
> automatically adjust its repaint cycle as needed.
> 
> This might even be possible without requiring changes in each compositor, by 
> adjusting
> the vertical blank periods in the guest to be aligned with the host 
> compositor repaint
> cycles. Not sure about that though.
[Kasireddy, Vivek] The problem really is that the Guest compositor -- or any 
other compositor
for that matter -- assumes that after a pageflip completion, the old buffer 
submitted in the
previous flip is free and can be reused again. I think this is a guarantee 
given by KMS. If we have
to enforce this, we (Guest KMS) have to wait until the Host compositor sends a 
wl_buffer.release --
which can only happen after Host gets a pageflip completion assuming it uses 
hardware planes .
From this point onwards, the Guest compositor only has 9 ms (in the case of 
Weston) -- or less
based on the Host compositor's scheduling policy -- to submit a new frame.

Although, we can adjust the repaint-window of the Guest com

RE: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

2021-08-03 Thread Kasireddy, Vivek
Hi Gerd,

> 
>   Hi,
> 
> > > That sounds sensible to me.  Fence the virtio commands, make sure (on
> > > the host side) the command completes only when the work is actually done
> > > not only submitted.  Has recently been added to qemu for RESOURCE_FLUSH
> > > (aka frontbuffer rendering) and doing the same for SET_SCANOUT (aka
> > > pageflipping), then send vblank events to userspace on command
> > > completion certainly makes sense.
> >
> > Hm how does this all work? At least drm/virtio uses
> > drm_atomic_helper_dirtyfb, so both DIRTYFB ioctl and atomic flips all end
> > up in the same driver path for everything. Or do you just combine the
> > resource_flush with the flip as needed and let the host side figure it all
> > out? From a quick read of virtgpu_plane.c that seems to be the case ...
> 
> virtio_gpu_primary_plane_update() will send RESOURCE_FLUSH only for
> DIRTYFB and both SET_SCANOUT + RESOURCE_FLUSH for page-flip, and I
> think for the page-flip case the host (aka qemu) doesn't get the
> "wait until old framebuffer is not in use any more" right yet.
[Kasireddy, Vivek] As you know, with the GTK UI backend and this patch series: 
https://lists.nongnu.org/archive/html/qemu-devel/2021-06/msg06745.html
we do create a sync file fd -- after the Blit -- and wait (adding it to Qemu's 
main
event loop) for it to ensure that the Guest scanout FB is longer in use on the 
Host.
This mechanism works in a similarly way for both frontbuffer DIRTYFB case and
also the double-buffer case. 

The out_fence work is only relevant for the future Wayland UI backend though.

> 
> So we'll need a host-side fix for that and a guest-side fix to switch
> from a blocking wait on the fence to vblank events.
[Kasireddy, Vivek] Do you see any concerns with the blocking wait? And, are you
suggesting that we use a vblank timer? Not sure if that would be needed because 
it
would not align with the render/draw signals used with GTK. And, the DRM core
does send out an event -- immediately after the blocking wait -- to Guest 
compositor
as no_vblank=true.

> 
> > Also to make this work we don't just need the fence, we need the timestamp
> > (in a clock domain the guest can correct for ofc) of the host side kms
> > driver flip completion. If you just have the fence then the jitter from
> > going through all the layers will most likely make it unusable.
> 
> Well, there are no timestamps in the virtio-gpu protocol ...
> 
> Also I'm not sure they would be that helpful, any timing is *much* less
> predictable in a virtual machine, especially in case the host machine is
> loaded.
[Kasireddy, Vivek] I agree; I think sharing the Host timestamps with the Guest 
or 
vice-versa may not be useful. We have not run into any problems without these 
so far.

Thanks,
Vivek

> 
> take care,
>   Gerd



RE: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

2021-08-03 Thread Kasireddy, Vivek
Hi Daniel,

> > > > By separating the OUT_FENCE signalling from pageflip completion allows
> > > > a Guest compositor to start a new repaint cycle with a new buffer
> > > > instead of waiting for the old buffer to be free.
> > > >
> > > > This work is based on the idea/suggestion from Simon and Pekka.
> > > >
> > > > This capability can be a solution for this issue:
> > > > https://gitlab.freedesktop.org/wayland/weston/-/issues/514
> > > >
> > > > Corresponding Weston MR:
> > > > https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/668
> > >
> > > Uh I kinda wanted to discuss this a bit more before we jump into typing
> > > code, but well I guess not that much work yet.
> > [Kasireddy, Vivek] Right, it wasn't a lot of work :)
> >
> > >
> > > So maybe I'm not understanding the problem, but I think the fundamental
> > > underlying issue is that with KMS you can have at most 2 buffers
> > > in-flight, due to our queue depth limit of 1 pending flip.
> > [Kasireddy, Vivek] Let me summarize the problem again from the perspective 
> > of
> > both the Host (Weston) and Guest (Weston) compositors assuming a 
> > refresh-rate
> > of 60 -- which implies the Vblank/Vsync is generated every ~16.66 ms.
> > Host compositor:
> > - After a pageflip completion event, it starts its next repaint cycle by 
> > waiting for 9 ms
> > and then submits the atomic commit and at the tail end of its cycle sends a 
> > frame
> callback
> > event to all its clients (who registered and submitted frames) indicating 
> > to them to
> > start their next redraw  -- giving them at-least ~16 ms to submit a new 
> > frame to be
> > included in its next repaint. Why a configurable 9 ms delay is needed is 
> > explained
> > in Pekka's blog post here:
> > https://ppaalanen.blogspot.com/2015/02/weston-repaint-scheduling.html
> >
> > - It'll send a wl_buffer.release event for a client submitted previous 
> > buffer only
> > when the client has submitted a new buffer and:
> > a) When it hasn't started its repaint cycle yet OR
> > b) When it clears its old state after it gets a pageflip completion event 
> > -- if it had
> > flipped the client's buffer onto a hardware plane.
> >
> > Guest compositor:
> > - After a pageflip completion is sent by Guest KMS, it takes about 10-12 ms 
> > for the
> > Guest compositor to submit a new atomic commit. This time of 10-12 ms 
> > includes the
> > 9 ms wait -- just like the Host compositor -- for its clients to submit new 
> > buffers.
> > - When it gets a pageflip completion, it assumes that the previously 
> > submitted buffer
> > is free for re-use and uses it again -- resulting in the usage of only 2 
> > out of a maximum
> > of 4 backbuffers included as part of the Mesa GBM surface implementation.
> >
> > Guest KMS/Virtio-gpu/Qemu Wayland UI:
> > - Because no_vblank=true for Guest KMS and since the vblank event (which 
> > also serves
> > as the pageflip completion event for user-space) is sent right away after 
> > atomic commit,
> > as Gerd said, we use an internal dma-fence to block/wait the Guest KMS 
> > until we know
> for
> > sure that the Host is completely done using the buffer. To ensure this, we 
> > signal the
> dma-fence
> > only after the Host compositor sends a wl_buffer.release event or an 
> > equivalent signal.
> >
> > The goal:
> > - Maintain full framerate even when the Guest scanout FB is flipped onto a 
> > hardware
> plane
> > on the Host -- regardless of either compositor's scheduling policy -- 
> > without making any
> > copies and ensuring that both Host and Guest are not accessing the buffer 
> > at the same
> time.
> >
> > The problem:
> > - If the Host compositor flips the client's buffer (in this case Guest 
> > compositor's buffer)
> > onto a hardware plane, then it can send a wl_buffer.release event for the 
> > previous buffer
> > only after it gets a pageflip completion. And, if the Guest compositor 
> > takes 10-12 ms to
> > submit a new buffer and given the fact that the Host compositor waits only 
> > for 9 ms, the
> > Guest compositor will miss the Host's repaint cycle resulting in halved 
> > frame-rate.
> >
> > The solution:
> > - To ensure full framerate, the Guest compositor has to start it's repaint 
> > cycle (including
> > the 9 ms wait) when the Host compositor sends the frame

RE: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

2021-08-02 Thread Kasireddy, Vivek
Hi Daniel,

> 
> On Thu, Jul 29, 2021 at 01:16:55AM -0700, Vivek Kasireddy wrote:
> > By separating the OUT_FENCE signalling from pageflip completion allows
> > a Guest compositor to start a new repaint cycle with a new buffer
> > instead of waiting for the old buffer to be free.
> >
> > This work is based on the idea/suggestion from Simon and Pekka.
> >
> > This capability can be a solution for this issue:
> > https://gitlab.freedesktop.org/wayland/weston/-/issues/514
> >
> > Corresponding Weston MR:
> > https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/668
> 
> Uh I kinda wanted to discuss this a bit more before we jump into typing
> code, but well I guess not that much work yet.
[Kasireddy, Vivek] Right, it wasn't a lot of work :)

> 
> So maybe I'm not understanding the problem, but I think the fundamental
> underlying issue is that with KMS you can have at most 2 buffers
> in-flight, due to our queue depth limit of 1 pending flip.
[Kasireddy, Vivek] Let me summarize the problem again from the perspective of
both the Host (Weston) and Guest (Weston) compositors assuming a refresh-rate
of 60 -- which implies the Vblank/Vsync is generated every ~16.66 ms.
Host compositor:
- After a pageflip completion event, it starts its next repaint cycle by 
waiting for 9 ms
and then submits the atomic commit and at the tail end of its cycle sends a 
frame callback
event to all its clients (who registered and submitted frames) indicating to 
them to 
start their next redraw  -- giving them at-least ~16 ms to submit a new frame 
to be
included in its next repaint. Why a configurable 9 ms delay is needed is 
explained
in Pekka's blog post here:
https://ppaalanen.blogspot.com/2015/02/weston-repaint-scheduling.html

- It'll send a wl_buffer.release event for a client submitted previous buffer 
only
when the client has submitted a new buffer and:
a) When it hasn't started its repaint cycle yet OR
b) When it clears its old state after it gets a pageflip completion event -- if 
it had
flipped the client's buffer onto a hardware plane.

Guest compositor:
- After a pageflip completion is sent by Guest KMS, it takes about 10-12 ms for 
the 
Guest compositor to submit a new atomic commit. This time of 10-12 ms includes 
the
9 ms wait -- just like the Host compositor -- for its clients to submit new 
buffers.
- When it gets a pageflip completion, it assumes that the previously submitted 
buffer
is free for re-use and uses it again -- resulting in the usage of only 2 out of 
a maximum
of 4 backbuffers included as part of the Mesa GBM surface implementation.

Guest KMS/Virtio-gpu/Qemu Wayland UI:
- Because no_vblank=true for Guest KMS and since the vblank event (which also 
serves
as the pageflip completion event for user-space) is sent right away after 
atomic commit,
as Gerd said, we use an internal dma-fence to block/wait the Guest KMS until we 
know for
sure that the Host is completely done using the buffer. To ensure this, we 
signal the dma-fence
only after the Host compositor sends a wl_buffer.release event or an equivalent 
signal.

The goal:
- Maintain full framerate even when the Guest scanout FB is flipped onto a 
hardware plane
on the Host -- regardless of either compositor's scheduling policy -- without 
making any
copies and ensuring that both Host and Guest are not accessing the buffer at 
the same time.

The problem:
- If the Host compositor flips the client's buffer (in this case Guest 
compositor's buffer) 
onto a hardware plane, then it can send a wl_buffer.release event for the 
previous buffer
only after it gets a pageflip completion. And, if the Guest compositor takes 
10-12 ms to
submit a new buffer and given the fact that the Host compositor waits only for 
9 ms, the
Guest compositor will miss the Host's repaint cycle resulting in halved 
frame-rate.

The solution:
- To ensure full framerate, the Guest compositor has to start it's repaint 
cycle (including
the 9 ms wait) when the Host compositor sends the frame callback event to its 
clients.
In order for this to happen, the dma-fence that the Guest KMS waits on -- 
before sending
pageflip completion -- cannot be tied to a wl_buffer.release event. This means 
that, the
Guest compositor has to be forced to use a new buffer for its next repaint 
cycle when it
gets a pageflip completion.
- The Weston MR I linked above does this by getting an out_fence fd and taking 
a reference
on all the FBs included in the atomic commit forcing the compositor to use new 
FBs for its
next repaint cycle. It releases the references when the out_fence is signalled 
later when
the Host compositor sends a wl_buffer.release event.

> 
> Unfortunately that means for virtual hw where it takes a few more
> steps/vblanks until the framebuffer actually shows up on screen and is
> scanned out, we suffer deeply. The usual fix for that is to drop the
> latency and increa

RE: [RFC v1 4/4] drm/virtio: Probe and implement VIRTIO_GPU_F_OUT_FENCE feature

2021-07-29 Thread Kasireddy, Vivek
Hi Gerd,

> 
>   Hi,
> 
> > +   bool has_out_fence;
> 
> > +   if (virtio_has_feature(vgdev->vdev, VIRTIO_GPU_F_OUT_FENCE)) {
> > +   vgdev->has_out_fence = true;
> > +   vgdev->ddev->mode_config.deferred_out_fence = true;
> 
> Looks like you don't need has_out_fence, you can just use
> vgdev->ddev->mode_config.deferred_out_fence instead.
[Kasireddy, Vivek] Right, I don't need has_out_fence; will fix it.

Thanks,
Vivek
> 
> take care,
>   Gerd



RE: [RFC v1 2/4] virtio-gpu uapi: Add VIRTIO_GPU_F_OUT_FENCE feature

2021-07-29 Thread Kasireddy, Vivek
Hi Gerd,

> 
> On Thu, Jul 29, 2021 at 01:16:57AM -0700, Vivek Kasireddy wrote:
> > This feature enables the Guest to wait to know when a resource
> > is completely consumed by the Host.
> 
> virtio spec update?
> 
> What are the exact semantics?
[Kasireddy, Vivek] As of now, this is still a RFC version. If everyone (Weston
Upstream, drm upstream and you) agree that this is a reasonable way to
solve https://gitlab.freedesktop.org/wayland/weston/-/issues/514 then I'd go 
ahead and send out the spec updates and cleaner versions of these patches --
with more documentation.

> 
> Why a new command?  Can't you simply fence one of the commands sent
> anyway (set_scanout probably for page-flip updates)?
[Kasireddy, Vivek] Yes, I think I could add a fence (and an out_fence) to 
set-scanout-blob. 

> 
> (feature flag is probably needed even in case we don't need a new
> command to make sure the host sends the completion when processing
> the command is actually done, i.e. in case of qemu the recently added
> fence support is there).
[Kasireddy, Vivek] The recently added fence support was for resource_flush and
specifically for GTK-UI or similar backends. I tried using the same mechanism 
for
Wayland-UI backend but ran into the above Weston issue. This feature (OUT_FENCE)
is a potential solution for this issue.

Thanks,
Vivek
> 
> take care,
>   Gerd



RE: [PATCH 1/3] virtio-gpu uapi: Add VIRTIO_GPU_F_EXPLICIT_FLUSH feature

2021-05-24 Thread Kasireddy, Vivek
Hi Gerd,
Any further comments on this?

Thanks,
Vivek

> 
> Hi Gerd,
> 
> > > [Kasireddy, Vivek] Correct, that is exactly what I want -- make the
> > > Guest wait until it gets notified that the Host is completely done 
> > > processing/using the
> fb.
> > > However, there can be two resources the guest can be made to wait
> > > on: wait for the new/current fb that is being submitted to be
> > > processed (explicit flush)
> >
> > That would be wait on resource_flush case, right?
> [Kasireddy, Vivek] Yes, correct.
> 
> >
> > > or wait for the previous fb that was submitted earlier (in the
> > > previous repaint cycle) to be processed (explicit sync).
> >
> > That would be the wait on set_scanout case, right?
> [Kasireddy, Vivek] Right.
> 
> >
> > And it would effectively wait on the previous fb not being needed by
> > the host any more (because the page-flip to the new fb completed) so
> > the guest can re-use the previous fb to render the next frame, right?
> [Kasireddy, Vivek] Yup.
> 
> >
> > (also when doing front-buffer rendering with xorg/fbcon and then doing
> > a virtual console switch the guest could wait for the console switch
> > being completed).
> >
> > > IIUC, Explicit sync only makes sense if 1) the Host windowing system
> > > also supports that feature/protocol (currently only upstream Weston
> > > does but I'd like to add it to Mutter if no one else does) or if
> > > there is a way to figure out (dma-buf sync file?) if the Host has
> > > completely processed the fb and 2) if Qemu UI is not doing a blit and 
> > > instead
> submitting the guest fb/dmabuf directly to the Host windowing system.
> > > As you are aware, 2) can possibly be done with dbus/pipewire Qemu UI
> > > backends (I'll explore this soon) but not with GTK or SDL.
> >
> > Well, I think we need to clearly define the wait flag semantics.
> [Kasireddy, Vivek] At-least with our passthrough use-case (maybe not with 
> Virgl), I think
> we need to ensure the following criteria:
> 1) With Blobs, ensure that the Guest and Host would never use the dmabuf/FB 
> at the same
> time.
> 2) The Guest should not render more frames than the refresh rate of the Host 
> so that GPU
> resources are not wasted.
> 
> > Should resource_flush with wait flag wait until the host is done
> > reading the resource (blit done)?
> [Kasireddy, Vivek] I started with this but did not find it useful as it did 
> not meet
> 2) above. However, I think we could have a flag for this if the Guest is 
> using a virtual
> vblank/timer and only wants to wait until the blit is done.
> 
> > Or should it wait until the host screen has been updated (gtk draw
> > callback completed)?
> [Kasireddy, Vivek] This is what the last 7 patches of my Blob series (v3) do. 
> So, we'd want
> to have a separate flag for this as well. And, lastly, we are going to need 
> another flag for
> the set_scanout case where we wait for the previous fb to be synchronized.
> 
> >
> > Everything else will be a host/guest implementation detail then, and
> > of course this needs some integration with the UI on the host side and
> > different UIs might have to do different things.
> [Kasireddy, Vivek] Sure, I think we can start with GTK and go from there.
> 
> >
> > On the guest side integrating this with fences will give us enough
> > flexibility on how we want handle the waits.  Simplest would be to
> > just block.
> [Kasireddy, Vivek] I agree; simply blocking (dma_fence_wait) is more than 
> enough for
> most use-cases.
> 
> >We could implement virtual vblanks, which would probably make  most
> >userspace work fine without explicit virtio-gpu support.  If needed  we
> >could even give userspace access to the fence so it can choose how to
> >wait.
> [Kasireddy, Vivek] Virtual vblanks is not a bad idea but I think blocking 
> with fences in the
> Guest kernel space seems more simpler. And, sharing fences with the Guest 
> compositor is
> also very interesting but I suspect we might need to modify the compositor 
> for this use-
> case, which might be a non-starter. Lastly, even with virtual vblanks, we 
> still need to make
> sure that we meet the two criteria mentioned above.
> 
> Thanks,
> Vivek



RE: [PATCH 1/3] virtio-gpu uapi: Add VIRTIO_GPU_F_EXPLICIT_FLUSH feature

2021-05-17 Thread Kasireddy, Vivek
Hi Gerd,

> > [Kasireddy, Vivek] Correct, that is exactly what I want -- make the Guest 
> > wait
> > until it gets notified that the Host is completely done processing/using 
> > the fb.
> > However, there can be two resources the guest can be made to wait on: wait 
> > for
> > the new/current fb that is being submitted to be processed (explicit flush)
> 
> That would be wait on resource_flush case, right?
[Kasireddy, Vivek] Yes, correct.

> 
> > or wait for the previous fb that was submitted earlier (in the
> > previous repaint cycle) to be processed (explicit sync).
> 
> That would be the wait on set_scanout case, right?
[Kasireddy, Vivek] Right.

> 
> And it would effectively wait on the previous fb not being needed by the
> host any more (because the page-flip to the new fb completed) so the
> guest can re-use the previous fb to render the next frame, right?
[Kasireddy, Vivek] Yup.

> 
> (also when doing front-buffer rendering with xorg/fbcon and then doing a
> virtual console switch the guest could wait for the console switch being
> completed).
> 
> > IIUC, Explicit sync only makes sense if 1) the Host windowing system also 
> > supports
> > that feature/protocol (currently only upstream Weston does but I'd like to 
> > add it to
> > Mutter if no one else does) or if there is a way to figure out (dma-buf 
> > sync file?) if
> > the Host has completely processed the fb and 2) if Qemu UI is not doing a 
> > blit and
> > instead submitting the guest fb/dmabuf directly to the Host windowing 
> > system.
> > As you are aware, 2) can possibly be done with dbus/pipewire Qemu UI 
> > backends
> > (I'll explore this soon) but not with GTK or SDL.
> 
> Well, I think we need to clearly define the wait flag semantics. 
[Kasireddy, Vivek] At-least with our passthrough use-case (maybe not with 
Virgl),
I think we need to ensure the following criteria:
1) With Blobs, ensure that the Guest and Host would never use the dmabuf/FB at
the same time. 
2) The Guest should not render more frames than the refresh rate of the Host so
that GPU resources are not wasted.

> Should resource_flush with wait flag wait until the host is done reading the
> resource (blit done)?
[Kasireddy, Vivek] I started with this but did not find it useful as it did not 
meet
2) above. However, I think we could have a flag for this if the Guest is using a
virtual vblank/timer and only wants to wait until the blit is done.

> Or should it wait until the host screen has been
> updated (gtk draw callback completed)?
[Kasireddy, Vivek] This is what the last 7 patches of my Blob series (v3) do. 
So,
we'd want to have a separate flag for this as well. And, lastly, we are going to
need another flag for the set_scanout case where we wait for the previous
fb to be synchronized.

> 
> Everything else will be a host/guest implementation detail then, and
> of course this needs some integration with the UI on the host side and
> different UIs might have to do different things.
[Kasireddy, Vivek] Sure, I think we can start with GTK and go from there.

> 
> On the guest side integrating this with fences will give us enough
> flexibility on how we want handle the waits.  Simplest would be to just
> block. 
[Kasireddy, Vivek] I agree; simply blocking (dma_fence_wait) is more than
enough for most use-cases.

>We could implement virtual vblanks, which would probably make
> most userspace work fine without explicit virtio-gpu support.  If needed
> we could even give userspace access to the fence so it can choose how to
> wait.
[Kasireddy, Vivek] Virtual vblanks is not a bad idea but I think blocking with
fences in the Guest kernel space seems more simpler. And, sharing fences with
the Guest compositor is also very interesting but I suspect we might need to
modify the compositor for this use-case, which might be a non-starter. Lastly,
even with virtual vblanks, we still need to make sure that we meet the two
criteria mentioned above. 

Thanks,
Vivek



RE: [PATCH 1/3] virtio-gpu uapi: Add VIRTIO_GPU_F_EXPLICIT_FLUSH feature

2021-05-12 Thread Kasireddy, Vivek
Hi Gerd,

> > However, as part of this feature (explicit flush), I'd like to make the 
> > Guest wait until
> > the current resource (as specified by resource_flush or set_scanout) is 
> > flushed or
> > synchronized. But for a different feature I am thinking of (explicit sync), 
> > I'd like to
> > make the Guest wait for the previous buffer/resource submitted (available 
> > via
> > old_state->fb).
> 
> For page-flipping I guess?  i.e. you want submit a new framebuffer, then
> wait until the host doesn't need the previous one?  That is likewise
> linked to a command, although it is set_scanout this time.
[Kasireddy, Vivek] Mainly for page-flipping but I'd also like to have fbcon, 
Xorg that
do frontbuffer rendering/updates to work seamlessly as well.

> 
> So, right now qemu simply queues the request and completes the command
> when a guest sends a resource_flush our set_scanout command.  You want
> be notified when the host is actually done processing the request.
[Kasireddy, Vivek] Correct, that is exactly what I want -- make the Guest wait
until it gets notified that the Host is completely done processing/using the fb.
However, there can be two resources the guest can be made to wait on: wait for
the new/current fb that is being submitted to be processed (explicit flush) or 
wait
for the previous fb that was submitted earlier (in the previous repaint cycle)
to be processed (explicit sync).

IIUC, Explicit sync only makes sense if 1) the Host windowing system also 
supports
that feature/protocol (currently only upstream Weston does but I'd like to add 
it to
Mutter if no one else does) or if there is a way to figure out (dma-buf sync 
file?) if
the Host has completely processed the fb and 2) if Qemu UI is not doing a blit 
and
instead submitting the guest fb/dmabuf directly to the Host windowing system.
As you are aware, 2) can possibly be done with dbus/pipewire Qemu UI backends
(I'll explore this soon) but not with GTK or SDL. 

Ideally, I'd like to have Explicit sync but until 2) above happens, Explicit 
flush can
be a reasonable alternative given the blit that happens with GTK/SDL backends. 
By making the Guest wait until the UI has submitted the buffer to the Host 
windowing system, we can theoretically tie the Guest repaint freq to the Host
vblank and thus have Guest apps render at 60 FPS by default instead of 90 FPS
that I see without expflush. This feature would also help Windows guests that
can only do frontbuffer rendering with Blobs.

> 
> I still think it makes sense extend the resource_flush and set_scanout
> commands for that, for example by adding a flag for the flags field in
> the request header.  That way it is clear what exactly you are waiting
> for.  You can also attach a fence to the request which you can later
> wait on.
[Kasireddy, Vivek] Yeah, I am reluctant to add a new cmd as well but I want
it to work for all Guests and Hosts. Anyway, let me try your idea of adding
specific flags and see if it works.

Thanks,
Vivek

> 
> take care,
>   Gerd



RE: [PATCH 1/3] virtio-gpu uapi: Add VIRTIO_GPU_F_EXPLICIT_FLUSH feature

2021-05-11 Thread Kasireddy, Vivek
Hi Gerd,

> On Tue, May 11, 2021 at 01:36:08AM -0700, Vivek Kasireddy wrote:
> > This feature enables the Guest to wait until a flush has been
> > performed on a buffer it has submitted to the Host.
> 
> This needs a virtio-spec update documenting the new feature.
[Kasireddy, Vivek] Yes, I was planning to do that after getting your 
thoughts on this feature.

> 
> > +   VIRTIO_GPU_CMD_WAIT_FLUSH,
> 
> Why a new command?
> 
> If I understand it correctly you want wait until
> VIRTIO_GPU_CMD_RESOURCE_FLUSH is done.  We could
> extend the VIRTIO_GPU_CMD_RESOURCE_FLUSH command
> for that instead.
[Kasireddy, Vivek] VIRTIO_GPU_CMD_RESOURCE_FLUSH can trigger/queue a
redraw that may be performed synchronously or asynchronously depending on the
UI (Glarea is async and gtk-egl is sync but can be made async). I'd like to 
make the
Guest wait until the actual redraw happens (until GlFLush or eglSwapBuffers, 
again
depending on the UI). 

However, as part of this feature (explicit flush), I'd like to make the Guest 
wait until
the current resource (as specified by resource_flush or set_scanout) is flushed 
or
synchronized. But for a different feature I am thinking of (explicit sync), I'd 
like to
make the Guest wait for the previous buffer/resource submitted (available via 
old_state->fb).

I think it may be possible to accomplish both features by overloading 
resource_flush
but given the various combinations of Guests (Android/Chrome OS, Windows, Linux)
and Hosts (Android/Chrome OS, Linux) that are or will be supported with 
virtio-gpu +
i915, I figured adding a new command might be cleaner.

Thanks,
Vivek


> 
> take care,
>   Gerd



RE: [PATCH 1/2] drm/virtio: Create Dumb BOs as guest Blobs

2021-03-31 Thread Kasireddy, Vivek
Hi Gerd,

> > If support for Blob resources is available, then dumb BOs created by
> > the driver can be considered as guest Blobs. And, for guest Blobs,
> > there is no need to do any transfers or flushes
> 
> No.  VIRTGPU_BLOB_FLAG_USE_SHAREABLE means the host (aka device in virtio
> terms) *can* create a shared mapping.  So, the guest sends still needs to 
> send transfer
> commands, and then the device can shortcut the transfer commands on the host 
> side in
> case a shared mapping exists.
[Kasireddy, Vivek] Ok. IIUC, are you saying that the device may or may not 
create a shared
mapping (meaning res->image) and that the driver should not make any 
assumptions about
that and thus still do the transfers and flushes?

Also, could you please briefly explain what does 
VIRTIO_GPU_BLOB_FLAG_USE_MAPPABLE
mean given that the spec does not describe these blob_flags clearly? This is 
what the spec says:

"The driver MUST inform the device if the blob resource is used for
memory access, sharing between driver instances and/or sharing with
other devices. This is done via the \field{blob_flags} field."

And, what should be the default blob_flags value for a dumb bo if the userspace 
does not
specify them?

> 
> flush commands are still needed for dirty tracking.
> 
> > but we do need to do set_scanout even if the FB has not changed as
> > part of plane updates.
> 
> Sounds like you workaround host bugs.  This should not be needed with properly
> implemented flush.
[Kasireddy, Vivek] With the patches I tested with:
https://lists.nongnu.org/archive/html/qemu-devel/2021-03/msg09786.html

I noticed that if we do not have res->image and only have res->blob, we have to 
re-submit the blob/dmabuf and update the displaysurface if guest made updates 
to it 
(in this case same FB) which can only happen if we call set_scanout_blob. IIUC, 
flush
only marks the area as dirty but does not re-submit the updated buffer/blob and 
I see
a flicker if I let it do dpy_gfx_update().

Thanks,
Vivek

> 
> take care,
>   Gerd

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


RE: [RFC v3 2/3] virtio: Introduce Vdmabuf driver

2021-02-22 Thread Kasireddy, Vivek
Hi Gerd,

> 
> On Fri, Feb 12, 2021 at 08:15:12AM +, Kasireddy, Vivek wrote:
> > Hi Gerd,
> > [Kasireddy, Vivek] Just to confirm my understanding of what you are
> > suggesting, are you saying that we need to either have Weston allocate
> > scanout buffers (GBM surface/BO) using virtio-gpu and render into them
> > using i915; or have virtio-gpu allocate pages and export a dma-buf and
> > have Weston create a GBM BO by calling gbm_bo_import(fd) and render into 
> > the BO
> using i915?
> 
> Not sure what the difference between the former and the latter is.
[Kasireddy, Vivek] Oh, what I meant is whether you were suggesting that we 
create a GBM device and create a GBM surface and BOs using this device or
just create a raw/dumb GEM object and create a GBM BO by importing it. As
we just discovered, the former means we have to initialize virgl which 
complicates
things so we went with the latter.

> 
> > [Kasireddy, Vivek] We are only interested in Qemu UI at the moment but
> > if we were to use virtio-gpu, we are going to need to add one more vq
> > and support for managing buffers, events, etc.
> 
> Should be easy and it should not need any virtio-gpu driver changes.
[Kasireddy, Vivek] Vdmabuf v4, that implements your suggestion -- to have
Vdmabuf allocate pages --  is posted here:
https://lists.freedesktop.org/archives/dri-devel/2021-February/297841.html
and tested it with Weston Headless and Qemu:
https://gitlab.freedesktop.org/Vivek/weston/-/blob/vdmabuf/libweston/backend-headless/headless.c#L522
https://lists.nongnu.org/archive/html/qemu-devel/2021-02/msg02976.html

Having said that, after discussing with Daniel Vetter, we are now switching our
focus to virtio-gpu to compare and contrast both solutions. 

> 
> You can use virtio-gpu like a dumb scanout device.  Create a dumb bo, create a
> framebuffer for the bo, map the framebuffer to the crtc.
> 
> Then export the bo, import into i915, use it as render target.  When 
> rendering is done flush
> (DRM_IOCTL_MODE_DIRTYFB).  Alternatively allocate multiple bo's + framebuffers
> and pageflip.
[Kasireddy, Vivek] Since we are testing with Weston, we are looking at 
pageflips (4 color
buffers). And, this part so far seems to work where virtio-gpu is used for kms 
(max_outputs=1)
and Iris/i915 is used for rendering. We are currently glueing virtio-gpu and 
i915 in Weston but
eventually the plan is to glue them (virgl/virtio-gpu and Iris) in Mesa if 
possible using KMSRO
(KMS render only) to avoid having to change Weston or X or other user-space 
components.

> 
> Pretty standard workflow for cases where rendering and scanout are handled by 
> different
> devices.  As far I know not uncommon in the arm world.
> 
> Right now this will involve a memcpy() for any display update because qemu is 
> a bit
> behind on supporting recent virtio-gpu features.
[Kasireddy, Vivek] IIUC, I think you are referring to creating the Pixman image 
in set_scanout.
What additional features need to be implemented or what is your recommendation 
in terms of
what needs to be done to turn the memcpy() into a dma-buf? Also, how should we 
ensure that
access to the guest fb/dmabuf is synchronized to ensure that the Guest and the 
Host do not access
the backing storage of the dmabuf at the same time?

Thanks,
Vivek

> 
> take care,
>   Gerd

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


RE: [RFC v3 2/3] virtio: Introduce Vdmabuf driver

2021-02-12 Thread Kasireddy, Vivek
Hi Christian,

> 
> Hi Vivek,
> 
> > [Kasireddy, Vivek] What if I do mmap() on the fd followed by mlock()
> > or mmap() followed by get_user_pages()? If it still fails, would
> > ioremapping the device memory and poking at the backing storage be an
> > option? Or, if I bind the passthrough'd GPU device to vfio-pci and tap
> > into the memory region associated with the device memory, can it be made to 
> > work?
> 
> get_user_pages() is not allowed on mmaped DMA-bufs in the first place.
> 
> Daniel is currently adding code to make sure that this is never ever used.
> 
> > And, I noticed that for PFNs that do not have valid struct page
> > associated with it, KVM does a memremap() to access/map them. Is this an 
> > option?
> 
> No, even for system memory which has a valid struct page touching it when it 
> is part of a
> DMA-buf is illegal since the reference count and mapping fields in struct 
> page might be
> used for something different.
> 
> Keep in mind that struct page is a heavily overloaded structure for different 
> use cases. You
> can't just use it for a different use case than what the owner of the page 
> has intended it.
[Kasireddy, Vivek] What is your recommended/acceptable way for doing what I am 
trying to 
do?

Thanks,
Vivek

> 
> Regards,
> Christian.
> 
> >
> >
> > Thanks,
> > Vivek
> >> take care,
> >>Gerd

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


RE: [RFC v3 2/3] virtio: Introduce Vdmabuf driver

2021-02-12 Thread Kasireddy, Vivek
Hi Gerd,

> > > You don't have to use the rendering pipeline.  You can let the i915
> > > gpu render into a dma-buf shared with virtio-gpu, then use
> > > virtio-gpu only for buffer sharing with the host.
[Kasireddy, Vivek] Just to confirm my understanding of what you are suggesting, 
are
you saying that we need to either have Weston allocate scanout buffers (GBM 
surface/BO)
using virtio-gpu and render into them using i915; or have virtio-gpu allocate 
pages and 
export a dma-buf and have Weston create a GBM BO by calling gbm_bo_import(fd) 
and
render into the BO using i915?

> Hmm, why a big mode switch?  You should be able to do that without modifying 
> the
> virtio-gpu guest driver.  On the host side qemu needs some work to support 
> the most
> recent virtio-gpu features like the buffer uuids (assuming you use qemu 
> userspace), right
> now those are only supported by crosvm.
[Kasireddy, Vivek] We are only interested in Qemu UI at the moment but if we 
were to use
virtio-gpu, we are going to need to add one more vq and support for managing 
buffers, 
events, etc.

Thanks,
Vivek

> 
> It might be useful to add support for display-less virtio-gpu, i.e.
> "qemu -device virtio-gpu-pci,max_outputs=0".  Right now the linux driver 
> throws an error
> in case no output (crtc) is present.  Should be fixable without too much 
> effort though,
> effectively the sanity check would have to be moved from driver 
> initialization to
> commands like SET_SCANOUT which manage the outputs.
> 
> take care,
>   Gerd

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


RE: [RFC v3 2/3] virtio: Introduce Vdmabuf driver

2021-02-09 Thread Kasireddy, Vivek
Hi Gerd,

> -Original Message-
> From: Gerd Hoffmann 
> Sent: Tuesday, February 09, 2021 12:45 AM
> To: Kasireddy, Vivek 
> Cc: Daniel Vetter ; 
> virtualizat...@lists.linux-foundation.org; dri-
> de...@lists.freedesktop.org; Vetter, Daniel ;
> daniel.vet...@ffwll.ch; Kim, Dongwon ;
> sumit.sem...@linaro.org; christian.koe...@amd.com; linux-me...@vger.kernel.org
> Subject: Re: [RFC v3 2/3] virtio: Introduce Vdmabuf driver
> 
>   Hi,
> 
> > > > > Nack, this doesn't work on dma-buf. And it'll blow up at runtime
> > > > > when you enable the very recently merged CONFIG_DMABUF_DEBUG (would
> > > > > be good to test with that, just to make sure).
> > [Kasireddy, Vivek] Although, I have not tested it yet but it looks like 
> > this will
> > throw a wrench in our solution as we use sg_next to iterate over all the 
> > struct page *
> > and get their PFNs. I wonder if there is any other clean way to get the 
> > PFNs of all
> > the pages associated with a dmabuf.
> 
> Well, there is no guarantee that dma-buf backing storage actually has
> struct page ...
[Kasireddy, Vivek] What if I do mmap() on the fd followed by mlock() or mmap()
followed by get_user_pages()? If it still fails, would ioremapping the device 
memory
and poking at the backing storage be an option? Or, if I bind the passthrough'd 
GPU device
to vfio-pci and tap into the memory region associated with the device memory, 
can it be
made to work? 

And, I noticed that for PFNs that do not have valid struct page associated with 
it, KVM
does a memremap() to access/map them. Is this an option?

> 
> > [Kasireddy, Vivek] To exclude such cases, would it not be OK to limit the 
> > scope
> > of this solution (Vdmabuf) to make it clear that the dma-buf has to live in 
> > Guest RAM?
> > Or, are there any ways to pin the dma-buf pages in Guest RAM to make this
> > solution work?
> 
> At that point it becomes (i915) driver-specific.  If you go that route
> it doesn't look that useful to use dma-bufs in the first place ...
[Kasireddy, Vivek] I prefer not to make this driver specific if possible.

> 
> > IIUC, Virtio GPU is used to present a virtual GPU to the Guest and all the 
> > rendering
> > commands are captured and forwarded to the Host GPU via Virtio.
> 
> You don't have to use the rendering pipeline.  You can let the i915 gpu
> render into a dma-buf shared with virtio-gpu, then use virtio-gpu only for
> buffer sharing with the host.
[Kasireddy, Vivek] Is this the most viable path forward? I am not sure how 
complex or 
feasible it would be but I'll look into it.
Also, not using the rendering capabilities of virtio-gpu and turning it into a 
sharing only
device means there would be a giant mode switch with a lot of if() conditions 
sprinkled
across. Are you OK with that?

Thanks,
Vivek
> 
> take care,
>   Gerd

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


RE: [RFC v3 2/3] virtio: Introduce Vdmabuf driver

2021-02-08 Thread Kasireddy, Vivek
Hi Gerd, Daniel,

> -Original Message-
> From: Daniel Vetter 
> Sent: Monday, February 08, 2021 1:39 AM
> To: Gerd Hoffmann 
> Cc: Daniel Vetter ; Kasireddy, Vivek 
> ;
> virtualizat...@lists.linux-foundation.org; dri-devel@lists.freedesktop.org; 
> Vetter, Daniel
> ; daniel.vet...@ffwll.ch; Kim, Dongwon
> ; sumit.sem...@linaro.org; christian.koe...@amd.com;
> linux-me...@vger.kernel.org
> Subject: Re: [RFC v3 2/3] virtio: Introduce Vdmabuf driver
> 
> On Mon, Feb 08, 2021 at 08:57:48AM +0100, Gerd Hoffmann wrote:
> >   Hi,
> >
> > > > +/* extract pages referenced by sgt */ static struct page
> > > > +**extr_pgs(struct sg_table *sgt, int *nents, int *last_len)
> > >
> > > Nack, this doesn't work on dma-buf. And it'll blow up at runtime
> > > when you enable the very recently merged CONFIG_DMABUF_DEBUG (would
> > > be good to test with that, just to make sure).
[Kasireddy, Vivek] Although, I have not tested it yet but it looks like this 
will
throw a wrench in our solution as we use sg_next to iterate over all the struct 
page *
and get their PFNs. I wonder if there is any other clean way to get the PFNs of 
all 
the pages associated with a dmabuf.

> >
> > > Aside from this, for virtio/kvm use-cases we've already merged the
> > > udmabuf driver. Does this not work for your usecase?
> >
> > udmabuf can be used on the host side to make a collection of guest
> > pages available as host dmabuf.  It's part of the puzzle, but not a
> > complete solution.
> >
> > As I understand it the intended workflow is this:
> >
> >   (1) guest gpu driver exports some object as dma-buf
> >   (2) dma-buf is imported into this new driver.
> >   (3) driver sends the pages to the host.
> >   (4) hypervisor uses udmabuf to create a host dma-buf.
> >   (5) host dma-buf is passed on.
> >
> > And step (3) is the problematic one as this will not work in case the
> > dma-buf doesn't live in guest ram but in -- for example -- gpu device
> > memory.
> 
> Yup, vram or similar special ram is the reason why an importer can't look at 
> the pages
> behind a dma-buf sg table.
[Kasireddy, Vivek] To exclude such cases, would it not be OK to limit the scope 
of this solution (Vdmabuf) to make it clear that the dma-buf has to live in 
Guest RAM?
Or, are there any ways to pin the dma-buf pages in Guest RAM to make this
solution work?

> 
> > Reversing the driver roles in the guest (virtio driver allocates pages
> > and exports the dma-buf to the guest gpu driver) should work fine.
> 
> Yup, this needs to flow the other way round than in these patches.
[Kasireddy, Vivek] That might work but I am afraid it means making invasive 
changes
to the Guest GPU driver (i915 in our case) which we are trying to avoid to
keep this solution more generic.

> 
> > Which btw is something you can do today with virtio-gpu.
> > Maybe it makes sense to have the option to run virtio-gpu in
> > render-only mode for that use case.
> 
> Yeah that sounds like a useful addition.
> 
> Also, the same flow should work for real gpus passed through as pci devices. 
> What we
> need is some way to surface the dma-buf on the guest side, which I think 
> doesn't exist yet
> stand-alone. But this role could be fulfilled by virtio-gpu in render-only 
> mode I think. And
> (assuming I've understood the recent discussions around virtio dma-buf 
> sharing using
> virtio ids) this would give you some neat zero-copy tricks for free if you 
> share multiple
> devices.
> 
> Also if you really want seamless buffer sharing between devices that are 
> passed to the
> guest and devices on the host side (like displays I guess?
> or maybe video encode if this is for cloug gaming?), then using virtio-gpu in 
> render mode
> should also allow you to pass the dma_fence back
> Which we'll need too, not just the dma-buf.
> 
> So at a first guess I'd say "render-only virtio-gpu mode" sounds like 
> something rather
> useful. But I might be totally off here.
[Kasireddy, Vivek] Let me present more details about the use-case we are trying 
to solve;
Sorry for the crude graphic below:

Guest:  Host:
+---+   ++
|   Weston  |   |Qemu UI|
|(Headless)||   |
+---+   +-^--+
 |  (1. Export prime fd   (3.Share UUID |  |  (4. Qemu 
calls Import using this UUID and a gets a new Dmabuf
 |   of scanout buffer)  with Qemu)   ||   fd that is 
used with EGL