On 9/1/23 14:10, Danilo Krummrich wrote:
On Fri, Sep 01, 2023 at 07:59:21AM +0200, Thomas Hellström (Intel) wrote:

On 8/31/23 21:07, Danilo Krummrich wrote:
On Thu, Aug 31, 2023 at 06:53:01PM +0200, Thomas Hellström (Intel) wrote:
Hi,

On 8/31/23 13:18, Danilo Krummrich wrote:
On Thu, Aug 31, 2023 at 11:04:06AM +0200, Thomas Hellström (Intel) wrote:
Hi!

On 8/30/23 17:00, Danilo Krummrich wrote:
On Wed, Aug 30, 2023 at 03:42:08PM +0200, Thomas Hellström (Intel) wrote:
On 8/30/23 14:49, Danilo Krummrich wrote:
Hi Thomas,

thanks for having a look!

On Wed, Aug 30, 2023 at 09:27:45AM +0200, Thomas Hellström (Intel) wrote:
Hi, Danilo.

Some quick comments since I'm doing some Xe work in this area. Will probably
get back with more.

On 8/20/23 23:53, Danilo Krummrich wrote:
<snip>

diff --git a/include/drm/drm_gpuva_mgr.h b/include/drm/drm_gpuva_mgr.h
index ed8d50200cc3..693e2da3f425 100644
--- a/include/drm/drm_gpuva_mgr.h
+++ b/include/drm/drm_gpuva_mgr.h
@@ -26,12 +26,16 @@
        */
       #include <linux/list.h>
+#include <linux/dma-resv.h>
+#include <linux/maple_tree.h>
       #include <linux/rbtree.h>
       #include <linux/types.h>
       #include <drm/drm_gem.h>
+#include <drm/drm_exec.h>
       struct drm_gpuva_manager;
+struct drm_gpuva_gem;
       struct drm_gpuva_fn_ops;
       /**
@@ -140,7 +144,7 @@ struct drm_gpuva {
       int drm_gpuva_insert(struct drm_gpuva_manager *mgr, struct drm_gpuva 
*va);
       void drm_gpuva_remove(struct drm_gpuva *va);
-void drm_gpuva_link(struct drm_gpuva *va);
+void drm_gpuva_link(struct drm_gpuva *va, struct drm_gpuva_gem *vm_bo);
       void drm_gpuva_unlink(struct drm_gpuva *va);
       struct drm_gpuva *drm_gpuva_find(struct drm_gpuva_manager *mgr,
@@ -240,15 +244,137 @@ struct drm_gpuva_manager {
         * @ops: &drm_gpuva_fn_ops providing the split/merge steps to drivers
         */
        const struct drm_gpuva_fn_ops *ops;
+
+       /**
+        * @d_obj: Dummy GEM object; used internally to pass the GPU VMs
+        * dma-resv to &drm_exec.
+        */
+       struct drm_gem_object d_obj;
+
+       /**
+        * @resv: the &dma_resv for &drm_gem_objects mapped in this GPU VA
+        * space
+        */
+       struct dma_resv *resv;
+
+       /**
+        * @exec: the &drm_exec helper to lock external &drm_gem_objects
+        */
+       struct drm_exec exec;
+
+       /**
+        * @mt_ext: &maple_tree storing external &drm_gem_objects
+        */
+       struct maple_tree mt_ext;
Why are you using a maple tree here? Insertion and removal is O(log(n))
instead of O(1) for a list?

Having a list of drm_gem_objects directly wouldn't work, as multiple GPU-VMs
could have mappings of the same extobj.

I considered using the VM_BO abstraction (struct drm_gpuva_gem) as list entry
instead, which also seems to be the obvious choice. However, there is a locking
conflict.

A drm_gem_object keeps a list of drm_gpuva_gems, while each drm_gpuva_gem keeps
a list of drm_gpuvas. Both lists are either protected with the dma-resv lock of
the corresponding drm_gem_object, or with an external lock provided by the
driver (see drm_gem_gpuva_set_lock()). The latter is used by drivers performing
changes on the GPUVA space directly from the fence signalling path.

Now, similar to what drm_gpuva_link() and drm_gpuva_unlink() are doing already,
we'd want to add a drm_gpuva_gem to the extobj list for the first mapping being
linked and we'd want to remove it for the last one being unlinked.

(Actually we'd want to add the drm_gpuva_gem object to the extobj list even
before, because otherwise we'd not acquire it's dma-resv lock of this GEM object
through drm_gpuva_manager_lock(). But that's trival, we could do that when we
create the drm_gpuva_gem, which we need to do anyways.)

Anyway, we'd probably want to keep removing the drm_gpuva_gem from the extobj
list from drm_gpuva_unlink() when the last mapping of this BO is unlinked. In
order to do so, we'd (as discussed above) either need to hold the outer GPU-VM
lock or the GPU-VMs dma-resv lock. Both would be illegal in the case
drm_gpuva_unlink() is called from within the fence signalling path. For drivers
like XE or Nouveau, we'd at least need to make sure to not mess up the locking
hierarchy of GPU-VM lock and dma-resv lock of the corresponding BO.

Considering all that, I thought it's probably better to track extobjs separate
from the drm_gpuva_gem, hence the maple tree choice.
Hm. OK, in Xe we're having a list of the xe_vmas (drm_gpuvas) that point to
external objects, or in the case of multiple mappings to the same gem
object, only one of the drm_gpuvas is in the list. These are protected by
the GPU-VM lock. I don't see a problem with removing those from the fence
signalling path, though?
I intentionally tried to avoid keeping a list of drm_gpuvas to track extobjs,
since this is generic code I don't know how much mappings of an external object
the corresponding driver potentially creates. This could become a pretty large
list to iterate. Another reason was, that I want to keep the drm_gpuva structure
as small as possible, hence avoiding another list_head.
Yes, the list might be pretty large, but OTOH you never iterate to access a
single list element. When you need to iterate the whole list you need to do
that regardless of the data structure used. As for the list head, it might
perhaps be aliased (union) with an upcoming userptr list head?

Oh, I did not mean that I'm concerned about the size of a list of extobjs in
general, that would indeed be the same for every data structure chosen. But I
would be concerned about keeping a list of *all* mappings being backed by an
extobj.

Now, it sounds like in XE you're doing some kind of optimization just keeping a
single mapping of an extobj in the list? How do you know when to remove it? What
if the mapping from the extobj list gets unmapped, but there is still another
one left in the GPU-VM being backed by the same BO?
When removing from the lists, we iterate through the object's list of vmas,
and if there is one matching the same vm, we replace the old one with the
new one. A similar iteration is done when adding to avoid adding one that is
already on the list.
I see, but wouldn't this be O(n) on insertion and O(m) on removal of an extobj,
while using the maple tree is O(log(n))?
No, insertion and removal is O(m) where m is the number of vms the object is
currently bound to. Typically a very small number.
Ok, my guess was that on insertion you'd actually walk the extobj list and see
if there's a vma backed by the same BO already, while on removal you said you're
walking the BO's vma list. So I guess on insertion you're also walking the BO's
vma list and see if there's already a mapping for this VM?

In your case that might make sense if you expect the extobj list to be larger
than the BO's vma list typically. In general I don't think this is true.

I think we're then optimizing for different scenarios. Our compute driver
will use mostly external objects only, and if shared, I don't forsee them
bound to many VMs. What saves us currently here is that in compute mode we
only really traverse the extobj list after a preempt fence wait, or when a
vm is using a new context for the first time. So vm's extobj list is pretty
large. Each bo's vma list will typically be pretty small.

Admittedly, I did not had in mind VMs where every GEM is an extobj. However,
especially for iterating a lot of extobjs a maple tree should perform better
than a list.


Another reason for us to use the list is that one possible, but not yet
implemented, workaround for this is the "vm fence", which when attached to
external bos pulls them off the extobj list and on "enable_signalling()"
splices its sublist of external bos back, and then snapshots the vm's
dma_resv and waits for all its fences. (The idea is that it should very
seldom be waited for in practice, and largely eliminate the extobj
handling). Here a list is an ideal data structure for list removal and
splicing. TBH we really want to avoid this optimization but we need to see
how bad extobj handling ends up in practice for the compute drivers.

If you end up doing this I highly doubt it'd make sense to use the GPUVA
manager for that, even if it would implement extobjs as a list of drm_gpuva_gems
(VM_BOs). It'd probably be a mess. When you remove extobjs from the GPUVA
manager, not because they're actually gone, but because you want to keep them
separate, you'd need to make sure to keep the drm_gpuva_gem structure alive,
which means you would need to increase the GPUVA managers refcount for extobjs
manually. You could probably also just "steal" them silently, but that'd be
quite nasty as well.




Although assuming that's a no-go for GPUVA wouldn't an XArray be a better
choice, keeping O(1)?
When tracking extobjs, the address of the drm_gem_object is the key while the
reference count is the value. I was thinking of an XArray as well, but I was
worried that the corresponding indices could be too much distributed for an
XArray to still be efficient. Now that I think about it, it's probably not that
bad.

Btw., while I agree trying to make things as efficient as possible, what is the
magnitue for extobjs to be tracked, do we need to worry about the O(log(n))?
Not sure yet, TBH, but I think one of our UMDs can only use external object,
because they don't know at creation time which ones need exporting. However
if this turns out to be too bad, there are various flavours of "clever but
complicated" optimizations that we could think of to reduce the list size.
Still in our case, we opted for the vma list head for now.
Considering the above, I would guess that if your current approach is good
enough, a maple tree will work as well.
Hmm, Yeah it's probably a bikeshed since each drm_exec builds a realloced
array of all external objects on each exec.
I did a quick sketchy benchmark, which is probably good enough. In a maple tree
with 0xFFFF - 1 existing entries insertion of a random (non-existant) entry
took on average ~530ns over 1k iterations.

The average insertion time for each entry to build up a tree with 0xFFFF - 1
entries in the first place was ~1.3us. That's expected since it should hit
memory allocations more often than the previous one. The maximum peak was ~10us.
Inserting already existing entries took ~300ns.

That's probably good enough.

That's hard to tell because we have nothing to compare with. For drm_exec,
Christian chose a realloced array because of linked list cache locality
issues, and Xarray locking requirements causing measurable performance
issues. Wouldn't a maple tree suffer from both of these?

Maple tree was designed for cache efficient traversal and to replace rbtree and
linked lists in MM because of their lack of cache efficiency. (That's also why
it is really unfortunate that we couldn't use maple tree for VMA tracking in the
GPUVA manager.)

In terms of locking, I can only imagine an issue because Xarray always seems to
use RCU and hence you can't get rid of some grace period latency? Otherwise it
should just be a spinlock.

@Christian: Or was there a different issue?

Maple tree can disable RCU entirely [1] AFAIK, hence likely we can avoid such an
issue.

[1] 
https://elixir.bootlin.com/linux/latest/source/include/linux/maple_tree.h#L612


In any case if you go for the maple tree would it be possible to hide the
implementation in a way as to make it not too hard to replace if real-world
workloads prove it necessary?

Of course, I would want to do that anyway.

Just a heads-up. It looks like (with the help of Boris) I can come up with a 
solution
everyone should be happy with. I think we can move extobjs to a list and API 
and locking
wise just do (almost) everything as we would as if there wouldn't be the 
use-case of
updating the VA space with direct callback from the fence signaling path.

Drivers doing that can simply schedule work to call drm_gpuva_unlink() from the 
fence
signaling path for explicit unmaps to avoid locking issues.

Drivers relying on VA space updates in the IOCTL already aren't affected at all.

I'll probably send out a v2 today or tomorrow.

- Danilo




Otherwise, if you want, I could do some experiments with Xarray and see how
that works out compared to using a maple tree.

Btw. another nice thing about using Xarray or maple tree for that is that
drivers updating the VA space from the fence signalling path don't need to
hold a GPU-VM lock to update the extobj list. Actually, they might not need
a GPU-VM lock at all.
I still don't follow why drivers would want to do that. Isn't the VA space /
fence object list always updated sync from the IOCTL?
For the extobj list I don't see any advantage not doing that in the IOCTL right
away. For the VA space there are a few advantages doing it in the fence
signalling path.

(1) No need to allocate drm_gpuva_ops at all. For a given map / unmap request
      the driver can receive the callbacks for map / remap / unmap directly.
(2) No need to unwind VA space updates on failure, also no need for any other
      unwind tricks.
(3) Synchronous bind jobs can be injected at any point of time and don't need to
      be queued up in the scheduler to preserve ordering.
(4) Potentially less error prone ressource management. Although, I admit partly
      this is just the consequence of (1) and (2).

Actually, once I get the page table management prepared for that I'd like to
move Nouveau over this approach.

OK. I guess I need to look at the resulting implementation to fully digest
this.

Thanks,

Thomas



/Thomas


/Thomas


+
+       /**
+        * @evict: structure holding the evict list and evict list lock
+        */
+       struct {
+               /**
+                * @list: &list_head storing &drm_gem_objects currently being
+                * evicted
+                */
+               struct list_head list;
+
+               /**
+                * @lock: spinlock to protect the evict list against concurrent
+                * insertion / removal of different &drm_gpuva_gems
+                */
+               spinlock_t lock;
+       } evict;
       };
       void drm_gpuva_manager_init(struct drm_gpuva_manager *mgr,
+                           struct drm_device *drm,
                            const char *name,
                            u64 start_offset, u64 range,
                            u64 reserve_offset, u64 reserve_range,
                            const struct drm_gpuva_fn_ops *ops);
       void drm_gpuva_manager_destroy(struct drm_gpuva_manager *mgr);
+/**
+ * DRM_GPUVA_EXEC - returns the &drm_gpuva_managers &drm_exec instance
+ * @mgr: the &drm_gpuva_managers to return the &drm_exec instance for
+ */
+#define DRM_GPUVA_EXEC(mgr)    &(mgr)->exec
A struct ww_acquire_ctx and thus a drm_exec is fundamentally per task and
should typically be allocated on the stack. Otherwise you'd need to protect
the mgr->exec member with an exclusive lock throughout the locking process,
and that's not what we want.
Oh, good point. I think it works in Nouveau, because there it's implicitly
protected with the job submission lock.

Did you consider subclassing a drm_exec for drm_gpuva purposes and add
needed ops to it: Like so:
That's a good idea, will take this into V2.
Actually, I'm not fully sure that was a good idea: I've now have a working
version of Xe ported over to drm_exec, having these helpers in mind and with
the intention to start using them as they mature. What I found, though is
that open-coding the drm_exec loop is not all that bad, but that building
blocks that can be called from within the loop are useful:

Like the drm_gpuva_prepare_objects() and an imaginary
drm_gpuva_prepare_gpuva() that locks the vm resv and the resv of the object
(if different and the gpuva points to the object. And
drm_gpuva_prepare_array() although we don't use it within Xe. That means you
can use these building blocks like helpers and avoid the fn() callback by
instead open-coding.

But I guess YMMV.
That's exactly why those building blocks are exported, I already had in mind
that there might be drivers which still want to open-code the drm_exec loop,
while others might just want a simple interface to lock everything.

I still think it is a good idea, but I'd keep that as simple as possible. And
for everything else just let the driver open-code it and use the "building
blocks" - will also expand the bulding blocks to what you mentioned above.

struct drm_gpuva_exec_ops {
         int (*fn) (struct drm_gpuva_exec *exec, int num_fences);
Is this the fn argument from drm_gpuva_manager_lock_extra()?

         int (*bo_validate) (struct drm_gpuva_exec *exec, struct drm_gem_object
*obj);
I guess we could also keep that within the drm_gpuva_fn_ops? This should always
be the same callback, right?

};

struct drm_gpuva_exec {
         const struct drm_gpuva_exec_ops *ops;
         struct drm_exec exec;
         struct drm_gpuva_manager *mgr;
};

Although I'd actually expect bo_validate to be part of fn in the typical
case. The drm_gpuva_exec would then be allocated by the caller on the stack.
This doesn't sound like my assumption about fn() above is correct.
Well one important thing in our conversion is that ttm_bo_validate () needs
to be in the until_all_locked() loop. We want to be able soon to use
sleeping locks for eviction, so a xe_bo_validate() would, at least
temporarily, add locked objects to the drm_exec list of locked objects. That
means everything that may end up calling validate deep within the call chain
needs to be part of the until_all_locked() loop, so our
drm_gpuva_manager_lock_extra() fn callback would include those validates and
look different all the time. Hence that's why open-coding isn't all that
bad...
Oh, I see. You indeed want to call validate() from within until_all_locked().

/Thomas


<snip>


Reply via email to