On Thu, May 15, 2025 at 10:51 AM Danilo Krummrich <d...@kernel.org> wrote: > > On Thu, May 15, 2025 at 10:34:07AM -0700, Rob Clark wrote: > > On Thu, May 15, 2025 at 8:30 AM Danilo Krummrich <d...@kernel.org> wrote: > > > > > > On Thu, May 15, 2025 at 07:59:16AM -0700, Rob Clark wrote: > > > > > > Thanks for the detailed explanation! > > > > > > > On Thu, May 15, 2025 at 2:00 AM Danilo Krummrich <d...@kernel.org> > > > > wrote: > > > > > > > > > > On Wed, May 14, 2025 at 10:53:16AM -0700, Rob Clark wrote: > > > > > > From: Rob Clark <robdcl...@chromium.org> > > > > > > > > > > > > Eases migration for drivers where VAs don't hold hard references to > > > > > > their associated BO, avoiding reference loops. > > > > > > > > > > > > In particular, msm uses soft references to optimistically keep > > > > > > around > > > > > > mappings until the BO is distroyed. Which obviously won't work if > > > > > > the > > > > > > VA (the mapping) is holding a reference to the BO. > > > > > > > > > > Ick! This is all complicated enough. Allow drivers to bypass the > > > > > proper > > > > > reference counting for GEM objects in the context of VM_BO structures > > > > > seems like > > > > > an insane footgun. > > > > > > > > > > I don't understand why MSM would need weak references here. Why does > > > > > msm need > > > > > that, but nouveau, Xe, panthor, PowerVR do not? > > > > > > > > Most of those drivers were designed (and had their UABI designed) with > > > > gpuvm, or at least sparse, in mind from the get go. I'm not sure > > > > about nouveau, but I guess it just got lucky that it's UABI semantics > > > > fit having the VMA hold a reference to the BO. > > > > > > > > Unfortunately, msm pre-dates sparse.. and in the beginning there was > > > > only a single global VM, multiple VMs was something retrofitted ~6yrs > > > > (?) back. For existing msm, the VMA(s) are implicitly torn down when > > > > the GEM obj is freed. This won't work with the VMA(s) holding hard > > > > references to the BO. > > > > > > Ok, that makes sense to me, but why can't this be changed? I don't see > > > how the > > > uAPI would be affected, this is just an implementation detail, no? > > > > It's about the behaviour of the API, there is no explicit VMA > > creation/destruction in the uAPI. > > But that shouldn't matter? Userspace gives you a BO, the driver creates VMAs > itself, which can have a reference on the VM_BO, which references the original > BO. At this point you can drop the original reference of the BO and just > destroy > all corresponding VMAs once the driver fulfilled the request from userspace?
Having the submit hold a reference to the VM_BO, and then this funny looking bit of code in gem_close() gets us part way there: vm_bo = drm_gpuvm_bo_find(ctx->vm, obj); if (vm_bo) { drm_gpuvm_bo_put(vm_bo); drm_gpuvm_bo_put(vm_bo); } But we still leak BO's used in other VMs.. scanout, and various other fw and other internal BOs... those would all have to be tracked down and to find _someplace_ to break the VM_BO circular reference loop. > > > > When userspace opts-in to "VM_BIND" mode, which it has to do before > > > > the VM is created, then we don't set this flag, the VMA holds a hard > > > > reference to the BO as it does with other drivers. But consider this > > > > use-case, which is perfectly valid for old (existing) userspace: > > > > > > > > 1) Userspace creates a BO > > > > 2) Submits rendering referencing the BO > > > > 3) Immediately closes the BO handle, without waiting for the submit to > > > > complete > > > > > > > > In this case, the submit holds a reference to the BO which holds a > > > > reference to the VMA. > > > > > > Can't you just instead create the VMAs, which hold a reference to the > > > VM_BO, > > > which holds a reference to the BO, then drop the drop the original BO > > > reference > > > and finally, when everything is completed, remove all VMAs of the VM_BO? > > > > Perhaps the submit could hold a ref to the VM_BO instead of the BO to > > cover that particular case. > > > > But for the legacy world, the VMA is implicitly torn down when the BO > > is freed. Which will never happen if the VM_BO holds a reference to > > the BO. > > Sure, I get that; what I do not get is why it can't be changed, e.g. in the > way > described above. > > > > This should do exactly the same *and* be conformant with GPUVM design. > > > > > > > Everything is torn down gracefully when the > > > > submit completes. But if the VMA held a hard reference to the BO then > > > > you'd have a reference loop. > > > > > > > > So there really is no other way to use gpuvm _and_ maintain backwards > > > > compatibility with the semantics of the pre-VM_BIND UAPI without this > > > > flag. > > > > > > Again, how is this important for maintaining backwards compatibility with > > > the > > > uAPI? This all seems like a driver internal implementation detail to me. > > > > > > So, is there a technical reason, or is it more that it would be more > > > effort on > > > the driver end to rework things accordingly? > > > > If there were a way to work without WEAK_REF, it seems like it would > > be harder and much less of a drop in change. > > So, you're saying there is no technical blocker to rework it? Not clear.. it would certainly make conversion to gpuvm a much bigger flag-day, because without WEAK_REF the way gpuvm works is exactly backwards from how the thing it is replacing works. BR, -R