On Thu, May 15, 2025 at 10:51 AM Danilo Krummrich <d...@kernel.org> wrote:
>
> On Thu, May 15, 2025 at 10:34:07AM -0700, Rob Clark wrote:
> > On Thu, May 15, 2025 at 8:30 AM Danilo Krummrich <d...@kernel.org> wrote:
> > >
> > > On Thu, May 15, 2025 at 07:59:16AM -0700, Rob Clark wrote:
> > >
> > > Thanks for the detailed explanation!
> > >
> > > > On Thu, May 15, 2025 at 2:00 AM Danilo Krummrich <d...@kernel.org> 
> > > > wrote:
> > > > >
> > > > > On Wed, May 14, 2025 at 10:53:16AM -0700, Rob Clark wrote:
> > > > > > From: Rob Clark <robdcl...@chromium.org>
> > > > > >
> > > > > > Eases migration for drivers where VAs don't hold hard references to
> > > > > > their associated BO, avoiding reference loops.
> > > > > >
> > > > > > In particular, msm uses soft references to optimistically keep 
> > > > > > around
> > > > > > mappings until the BO is distroyed.  Which obviously won't work if 
> > > > > > the
> > > > > > VA (the mapping) is holding a reference to the BO.
> > > > >
> > > > > Ick! This is all complicated enough. Allow drivers to bypass the 
> > > > > proper
> > > > > reference counting for GEM objects in the context of VM_BO structures 
> > > > > seems like
> > > > > an insane footgun.
> > > > >
> > > > > I don't understand why MSM would need weak references here. Why does 
> > > > > msm need
> > > > > that, but nouveau, Xe, panthor, PowerVR do not?
> > > >
> > > > Most of those drivers were designed (and had their UABI designed) with
> > > > gpuvm, or at least sparse, in mind from the get go.  I'm not sure
> > > > about nouveau, but I guess it just got lucky that it's UABI semantics
> > > > fit having the VMA hold a reference to the BO.
> > > >
> > > > Unfortunately, msm pre-dates sparse.. and in the beginning there was
> > > > only a single global VM, multiple VMs was something retrofitted ~6yrs
> > > > (?) back.  For existing msm, the VMA(s) are implicitly torn down when
> > > > the GEM obj is freed.  This won't work with the VMA(s) holding hard
> > > > references to the BO.
> > >
> > > Ok, that makes sense to me, but why can't this be changed? I don't see 
> > > how the
> > > uAPI would be affected, this is just an implementation detail, no?
> >
> > It's about the behaviour of the API, there is no explicit VMA
> > creation/destruction in the uAPI.
>
> But that shouldn't matter? Userspace gives you a BO, the driver creates VMAs
> itself, which can have a reference on the VM_BO, which references the original
> BO. At this point you can drop the original reference of the BO and just 
> destroy
> all corresponding VMAs once the driver fulfilled the request from userspace?

Having the submit hold a reference to the VM_BO, and then this funny
looking bit of code in gem_close() gets us part way there:

   vm_bo = drm_gpuvm_bo_find(ctx->vm, obj);
   if (vm_bo) {
      drm_gpuvm_bo_put(vm_bo);
      drm_gpuvm_bo_put(vm_bo);
  }

But we still leak BO's used in other VMs.. scanout, and various other
fw and other internal BOs... those would all have to be tracked down
and to find _someplace_ to break the VM_BO  circular reference loop.

> > > > When userspace opts-in to "VM_BIND" mode, which it has to do before
> > > > the VM is created, then we don't set this flag, the VMA holds a hard
> > > > reference to the BO as it does with other drivers.  But consider this
> > > > use-case, which is perfectly valid for old (existing) userspace:
> > > >
> > > > 1) Userspace creates a BO
> > > > 2) Submits rendering referencing the BO
> > > > 3) Immediately closes the BO handle, without waiting for the submit to 
> > > > complete
> > > >
> > > > In this case, the submit holds a reference to the BO which holds a
> > > > reference to the VMA.
> > >
> > > Can't you just instead create the VMAs, which hold a reference to the 
> > > VM_BO,
> > > which holds a reference to the BO, then drop the drop the original BO 
> > > reference
> > > and finally, when everything is completed, remove all VMAs of the VM_BO?
> >
> > Perhaps the submit could hold a ref to the VM_BO instead of the BO to
> > cover that particular case.
> >
> > But for the legacy world, the VMA is implicitly torn down when the BO
> > is freed.  Which will never happen if the VM_BO holds a reference to
> > the BO.
>
> Sure, I get that; what I do not get is why it can't be changed, e.g. in the 
> way
> described above.
>
> > > This should do exactly the same *and* be conformant with GPUVM design.
> > >
> > > > Everything is torn down gracefully when the
> > > > submit completes.  But if the VMA held a hard reference to the BO then
> > > > you'd have a reference loop.
> > > >
> > > > So there really is no other way to use gpuvm _and_ maintain backwards
> > > > compatibility with the semantics of the pre-VM_BIND UAPI without this
> > > > flag.
> > >
> > > Again, how is this important for maintaining backwards compatibility with 
> > > the
> > > uAPI? This all seems like a driver internal implementation detail to me.
> > >
> > > So, is there a technical reason, or is it more that it would be more 
> > > effort on
> > > the driver end to rework things accordingly?
> >
> > If there were a way to work without WEAK_REF, it seems like it would
> > be harder and much less of a drop in change.
>
> So, you're saying there is no technical blocker to rework it?

Not clear.. it would certainly make conversion to gpuvm a much bigger
flag-day, because without WEAK_REF the way gpuvm works is exactly
backwards from how the thing it is replacing works.

BR,
-R

Reply via email to