RE: Making drm_gpuvm work across gpu devices

Zeng, Oak Thu, 25 Jan 2024 13:02:18 -0800

> -----Original Message-----
> From: Daniel Vetter <dan...@ffwll.ch>
> Sent: Thursday, January 25, 2024 1:33 PM
> To: Christian König <christian.koe...@amd.com>
> Cc: Zeng, Oak <oak.z...@intel.com>; Danilo Krummrich <d...@redhat.com>;
> Dave Airlie <airl...@redhat.com>; Daniel Vetter <dan...@ffwll.ch>; Felix
> Kuehling <felix.kuehl...@amd.com>; Welty, Brian <brian.we...@intel.com>; dri-
> de...@lists.freedesktop.org; intel...@lists.freedesktop.org; Bommu, Krishnaiah
> <krishnaiah.bo...@intel.com>; Ghimiray, Himal Prasad
> <himal.prasad.ghimi...@intel.com>; thomas.hellst...@linux.intel.com;
> Vishwanathapura, Niranjana <niranjana.vishwanathap...@intel.com>; Brost,
> Matthew <matthew.br...@intel.com>; Gupta, saurabhg
> <saurabhg.gu...@intel.com>
> Subject: Re: Making drm_gpuvm work across gpu devices
> 
> On Wed, Jan 24, 2024 at 09:33:12AM +0100, Christian König wrote:
> > Am 23.01.24 um 20:37 schrieb Zeng, Oak:
> > > [SNIP]
> > > Yes most API are per device based.
> > >
> > > One exception I know is actually the kfd SVM API. If you look at the 
> > > svm_ioctl
> function, it is per-process based. Each kfd_process represent a process 
> across N
> gpu devices.
> >
> > Yeah and that was a big mistake in my opinion. We should really not do that
> > ever again.
> >
> > > Need to say, kfd SVM represent a shared virtual address space across CPU
> and all GPU devices on the system. This is by the definition of SVM (shared 
> virtual
> memory). This is very different from our legacy gpu *device* driver which 
> works
> for only one device (i.e., if you want one device to access another device's
> memory, you will have to use dma-buf export/import etc).
> >
> > Exactly that thinking is what we have currently found as blocker for a
> > virtualization projects. Having SVM as device independent feature which
> > somehow ties to the process address space turned out to be an extremely bad
> > idea.
> >
> > The background is that this only works for some use cases but not all of
> > them.
> >
> > What's working much better is to just have a mirror functionality which says
> > that a range A..B of the process address space is mapped into a range C..D
> > of the GPU address space.
> >
> > Those ranges can then be used to implement the SVM feature required for
> > higher level APIs and not something you need at the UAPI or even inside the
> > low level kernel memory management.
> >
> > When you talk about migrating memory to a device you also do this on a per
> > device basis and *not* tied to the process address space. If you then get
> > crappy performance because userspace gave contradicting information where
> to
> > migrate memory then that's a bug in userspace and not something the kernel
> > should try to prevent somehow.
> >
> > [SNIP]
> > > > I think if you start using the same drm_gpuvm for multiple devices you
> > > > will sooner or later start to run into the same mess we have seen with
> > > > KFD, where we moved more and more functionality from the KFD to the
> DRM
> > > > render node because we found that a lot of the stuff simply doesn't work
> > > > correctly with a single object to maintain the state.
> > > As I understand it, KFD is designed to work across devices. A single 
> > > pseudo
> /dev/kfd device represent all hardware gpu devices. That is why during kfd 
> open,
> many pdd (process device data) is created, each for one hardware device for 
> this
> process.
> >
> > Yes, I'm perfectly aware of that. And I can only repeat myself that I see
> > this design as a rather extreme failure. And I think it's one of the reasons
> > why NVidia is so dominant with Cuda.
> >
> > This whole approach KFD takes was designed with the idea of extending the
> > CPU process into the GPUs, but this idea only works for a few use cases and
> > is not something we should apply to drivers in general.
> >
> > A very good example are virtualization use cases where you end up with CPU
> > address != GPU address because the VAs are actually coming from the guest
> VM
> > and not the host process.
> >
> > SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should not have
> > any influence on the design of the kernel UAPI.
> >
> > If you want to do something similar as KFD for Xe I think you need to get
> > explicit permission to do this from Dave and Daniel and maybe even Linus.
> 
> I think the one and only one exception where an SVM uapi like in kfd makes
> sense, is if the _hardware_ itself, not the software stack defined
> semantics that you've happened to build on top of that hw, enforces a 1:1
> mapping with the cpu process address space.
> 
> Which means your hardware is using PASID, IOMMU based translation, PCI-ATS
> (address translation services) or whatever your hw calls it and has _no_
> device-side pagetables on top. Which from what I've seen all devices with
> device-memory have, simply because they need some place to store whether
> that memory is currently in device memory or should be translated using
> PASID. Currently there's no gpu that works with PASID only, but there are
> some on-cpu-die accelerator things that do work like that.
> 
> Maybe in the future there will be some accelerators that are fully cpu
> cache coherent (including atomics) with something like CXL, and the
> on-device memory is managed as normal system memory with struct page as
> ZONE_DEVICE and accelerator va -> physical address translation is only
> done with PASID ... but for now I haven't seen that, definitely not in
> upstream drivers.
> 
> And the moment you have some per-device pagetables or per-device memory
> management of some sort (like using gpuva mgr) then I'm 100% agreeing with
> Christian that the kfd SVM model is too strict and not a great idea.
> 


GPU is nothing more than a piece of HW to accelerate part of a program, just 
like an extra CPU core. From this perspective, a unified virtual address space 
across CPU and all GPU devices (and any other accelerators) is always more 
convenient to program than split address space b/t devices.

In reality, GPU program started from split address space.  HMM is designed to 
provide unified virtual address space w/o a lot of advanced hardware feature 
you listed above. 

I am aware Nvidia's new hardware platforms such as Grace Hopper natively 
support the Unified Memory programming model through hardware-based memory 
coherence among all CPUs and GPUs. For such systems, HMM is not required.

You can think HMM as a software based solution to provide unified address space 
b/t cpu and devices. Both AMD and Nvidia have been providing unified address 
space through hmm. I think it is still valuable.

Regards,
Oak  



> Cheers, Sima
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
RE: Making drm_gpuvm work across gpu devices

Reply via email to