Re: dma-buf sg mangling

Christian König Tue, 14 May 2024 00:00:57 -0700

Am 14.05.24 um 06:15 schrieb Zack Rusin:

On Mon, May 13, 2024 at 1:09 PM Christian König
<christian.koe...@amd.com>  wrote:

Am 10.05.24 um 18:34 schrieb Zack Rusin:

Hey,


so this is a bit of a silly problem but I'd still like to solve it
properly. The tldr is that virtualized drivers abuse
drm_driver::gem_prime_import_sg_table (at least vmwgfx and xen do,
virtgpu and xen punt on it) because there doesn't seem to be a
universally supported way of converting the sg_table back to a list of
pages without some form of gart to do it.

Well the whole point is that you should never touch the pages in the
sg_table in the first place.

The long term plan is actually to completely remove the pages from that
interface.

First let me clarify that I'm not arguing for access to those pages.
What I'd like to figure out are precise semantics for all of this
prime import/map business on drivers that don't have some dedicated
hardware to turn dma_addr_t array into something readable. If the
general consensus is that those devices are broken, so be it.

Well that stuff is actually surprisingly well documented, see here:https://docs.kernel.org/driver-api/dma-buf.html#cpu-access-to-dma-buffer-objects

It's just that the documentation is written from the perspective of theexporter and userspace, so it's probably not that easy to understandwhat you should do as an importer.


Maybe we should add a sentence or two to clarify this.

drm_prime_sg_to_page_array is deprecated (for all the right reasons on
actual hardware) but in our cooky virtualized world we don't have
gart's so what are we supposed to do with the dma_addr_t from the
imported sg_table? What makes it worse (and definitely breaks xen) is
that with CONFIG_DMABUF_DEBUG the sg page_link is mangled via
mangle_sg_table so drm_prime_sg_to_page_array won't even work.

XEN and KVM were actually adjusted to not touch the struct pages any more.

I'm not sure if that work is already upstream or not but I had to
explain it over and over again why their approach doesn't work.

I'd love to see those patches. Upstream xen definitely still uses it:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/xen/xen_drm_front_gem.c#n263
which looks pretty broken to me, especially with CONFIG_DMABUF_DEBUG
because drm_gem_prime_import
will call dma_buf_map_attachment_unlocked:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/drm_prime.c#n940
which will call __map_dma_buf
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/dma-buf/dma-buf.c#n1131
which will mangle the sg's page_list before calling xen's
gem_prime_import_sg_table. Which means the drm_prime_sg_to_page_array
that's used in xen's gem_prime_import_sg_table is silently generating
broken pages and the entire thing should just be a kernel oops (btw,
it'd probably be a good idea to not have drm_prime_sg_to_page_array
generate garbage with CONFIG_DMABUF_DEBUG and print some kind of a
warning).

I honestly didn't followed the discussion to the end, but both Sima andme pointed that out to the XEN people and there were quite a bit of backand forth how to fix it.


Let me try to dig that up.

The reason why I'm saying it's a bit of a silly problem is that afaik
currently it only affects IGT testing with vgem (because the rest of
external gem objects will be from the virtualized gpu itself which is
different). But do you have any ideas on what we'd like to do with
this long term? i.e. we have a virtualized gpus without iommu, we have
sg_table with some memory and we'd like to import it. Do we just
assume that the sg_table on those configs will always reference cpu
accessible memory (i.e. if it's external it only comes through
drm_gem_shmem_object) and just do some horrific abomination like:
for (i = 0; i < bo->ttm->num_pages; ++i) {
      phys_addr_t pa = dma_to_phys(vmw->drm.dev, bo->ttm->dma_address[i]);
      pages[i] = pfn_to_page(PHYS_PFN(pa));
}
or add a "i know this is cpu accessible, please demangle" flag to
drm_prime_sg_to_page_array or try to have some kind of more permanent
solution?

Well there is no solution for that. Accessing the underlying struct page
through the sg_table is illegal in the first place.

So the question is not how to access the struct page, but rather why do
you want to do this?

Rob mentioned one usecase, but to be honest, as I mentioned in the
beginning I'd like to have a semantic clarity to the general problem
of going from dma_addr_t to something readable on non iomem resources,
e.g. get the IGT vgem<->vmwgfx tests running, i.e.:
vgem_handle = dumb_buffer_create(vgem_fd, ....);
dma_buf_fd = prime_handle_to_fd(vgem_fd, vgem_handle);
vmw_handle = prime_fd_to_handle(vmw_fd, dma_buf_fd);
void *ptr = vmw_map_bo(vmw_fd, vmw_handle, ...); <- crash

trying to map that bo will crash because we'll endup in
ttm_bo_vm_fault_reserved which will check whether
bo->resource->bus.is_iomem, which it won't be because every vmwgfx
buffer is just system memory and it will try to access the ttm pages
which don't exist and crash:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/ttm/ttm_bo_vm.c#n249

Going through TTM for the VM fault is completely illegal to begin with.What you do instead is to re-route the mmap() request to the exporter,see how dma_buf_mmap() is being used by drm_gem_shmem_mmap() for an example.

It was even discussed to do this generally in the GEM layer, but IIRC itwas rejected because driver stacks should avoid doing this and use theoriginal exporter for the mmap() instead when possible.

Sima and Thomas probably knew this better than me, but I think theproblem might be that VMWGFX didn't used that GEM layer until recentlyand so most likely never enforced some of the restrictions there.

On our hypervisors that are less than 8 years old all of vmwgfx
buffers are always system memory, and when we get an external buffer
from vgem, which is also system memory we have no currently supported
way of populating the ttm::pages to be able to map/read it.

It's fine if the general consensus is "that's crazy, we can't fault on
external buffers pages" and "without some gart like thing in your
device we can not make prime work" but I do want to have some clarity
on how/whether this is supposed to work.

Short summary is that you redirect your mmap() call to thedma_buf_mmap() instead of handling it yourself.

Or to put it another way: imagine two different cpu vgem like drivers,
how does one driver import the sg_table from another and convert it to
something userspace readable?


Well you don't touch the sg_table at all in that case.

If you need an userspace mapping you use dma_buf_mmap(), if you need akernel mapping you use the vmap interface.

If you are a hypervisor like XEN, KVM and VMW you basically don't touchany of that and just mirror parts of an address space between host andguest through an MMU notifier (include faults and invalidation).

See the recent (~2 month old) discussion between Christoph Hellwig, theKVM people and me on the some mailing lists.


Regards,
Christian.

Re: dma-buf sg mangling

Reply via email to