Re: [PATCH v3 5/6] drm/i915/ttm: Implement asynchronous TTM moves

2021-11-17 Thread Thomas Hellström
Hi, Matthew

Finally got some time to look at this more in-depth, please see below.

On Mon, 2021-11-15 at 17:16 +, Matthew Auld wrote:
> On 14/11/2021 11:12, Thomas Hellström wrote:
> > Don't wait sync while migrating, but rather make the GPU blit await
> > the
> > dependencies and add a moving fence to the object.
> > 
> > This also enables asynchronous VRAM management in that on eviction,
> > rather than waiting for the moving fence to expire before freeing
> > VRAM,
> > it is freed immediately and the fence is stored with the VRAM
> > manager and
> > handed out to newly allocated objects to await before clears and
> > swapins,
> > or for kernel objects before setting up gpu vmas or mapping.
> > 
> > To collect dependencies before migrating, add a set of utilities
> > that
> > coalesce these to a single dma_fence.
> > 
> > What is still missing for fully asynchronous operation is
> > asynchronous vma
> > unbinding, which is still to be implemented.
> > 
> > This commit substantially reduces execution time in the
> > gem_lmem_swapping
> > test.
> > 
> > v2:
> > - Make a couple of functions static.
> > 
> > Signed-off-by: Thomas Hellström 
> > ---
> >   drivers/gpu/drm/i915/gem/i915_gem_ttm.c  |  10 +
> >   drivers/gpu/drm/i915/gem/i915_gem_ttm.h  |   2 +-
> >   drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c | 329
> > +--
> >   drivers/gpu/drm/i915/gem/i915_gem_wait.c |   4 +-
> >   4 files changed, 318 insertions(+), 27 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> > b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> > index a1df49378a0f..111a4282d779 100644
> > --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> > +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> > @@ -326,6 +326,9 @@ static bool i915_ttm_eviction_valuable(struct
> > ttm_buffer_object *bo,
> >   {
> > struct drm_i915_gem_object *obj = i915_ttm_to_gem(bo);
> >   
> > +   if (!obj)
> > +   return false;
> > +
> > /*
> >  * EXTERNAL objects should never be swapped out by TTM,
> > instead we need
> >  * to handle that ourselves. TTM will already skip such
> > objects for us,
> > @@ -448,6 +451,10 @@ static int
> > i915_ttm_shrinker_release_pages(struct drm_i915_gem_object *obj,
> > if (bo->ttm->page_flags & TTM_TT_FLAG_SWAPPED)
> > return 0;
> >   
> > +   ret = ttm_bo_wait_ctx(bo, );
> > +   if (ret)
> > +   return ret;
> 
> 
> Why do we need this? Also not needed for the above purge case?

This is for bos with an on-going async move to system. The
intel_migrate code doesn't set up vmas, so unbinding doesn't
necessarily idle. The purge code currently idles in TTM, but in both
cases we should probably add another argument to
shrinker_release_pages() and move this wait before purge to return -
EBUSY unless we have SHRINK_ACTIVE.

> 
> > +
> > bo->ttm->page_flags |= TTM_TT_FLAG_SWAPPED;
> > ret = ttm_bo_validate(bo, , );
> > if (ret) {
> > @@ -549,6 +556,9 @@ static void i915_ttm_swap_notify(struct
> > ttm_buffer_object *bo)
> > struct drm_i915_gem_object *obj = i915_ttm_to_gem(bo);
> > int ret = i915_ttm_move_notify(bo);
> >   
> > +   if (!obj)
> > +   return;
> 
> It looks like the i915_ttm_move_notify(bo) already dereferenced the
> GEM 
> bo. Or did something in there maybe nuke it?
> 
> > +
> > GEM_WARN_ON(ret);
> > GEM_WARN_ON(obj->ttm.cached_io_rsgt);
> > if (!ret && obj->mm.madv != I915_MADV_WILLNEED)
> > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.h
> > b/drivers/gpu/drm/i915/gem/i915_gem_ttm.h
> > index 82cdabb542be..9d698ad00853 100644
> > --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.h
> > +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.h
> > @@ -37,7 +37,7 @@ void i915_ttm_bo_destroy(struct ttm_buffer_object
> > *bo);
> >   static inline struct drm_i915_gem_object *
> >   i915_ttm_to_gem(struct ttm_buffer_object *bo)
> >   {
> > -   if (GEM_WARN_ON(bo->destroy != i915_ttm_bo_destroy))
> > +   if (bo->destroy != i915_ttm_bo_destroy)
> > return NULL;
> 
> So this would indicate a "ghost" object, or is this something else?
> How 
> scared should we be with this, like with the above checking for NULL
> GEM 
> object state? In general do you know where we need the above
> checking?

Yeah, these are ghost objects and this is a major flaw in TTM in that
some callbacks are per device and not per object. Should have been
fixed long ago :/. For the ttm_tt callbacks, obj might be NULL but we
must still be able to cope with that. For other callbacks we should
ignore the ghost objects. I'll do a second audit here.

> 
> >   
> > return container_of(bo, struct drm_i915_gem_object,
> > __do_not_access);
> > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
> > b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
> > index f35b386c56ca..ae2c49fc3500 100644
> > --- 

Re: [PATCH v3 5/6] drm/i915/ttm: Implement asynchronous TTM moves

2021-11-15 Thread Thomas Hellström



On 11/15/21 18:16, Matthew Auld wrote:

Thanks for reviewing, Matthew,

I'll take a look at the comments.

/Thomas



On 14/11/2021 11:12, Thomas Hellström wrote:

Don't wait sync while migrating, but rather make the GPU blit await the
dependencies and add a moving fence to the object.

This also enables asynchronous VRAM management in that on eviction,
rather than waiting for the moving fence to expire before freeing VRAM,
it is freed immediately and the fence is stored with the VRAM manager 
and
handed out to newly allocated objects to await before clears and 
swapins,

or for kernel objects before setting up gpu vmas or mapping.

To collect dependencies before migrating, add a set of utilities that
coalesce these to a single dma_fence.

What is still missing for fully asynchronous operation is 
asynchronous vma

unbinding, which is still to be implemented.

This commit substantially reduces execution time in the 
gem_lmem_swapping

test.

v2:
- Make a couple of functions static. 


Re: [PATCH v3 5/6] drm/i915/ttm: Implement asynchronous TTM moves

2021-11-15 Thread Matthew Auld

On 14/11/2021 11:12, Thomas Hellström wrote:

Don't wait sync while migrating, but rather make the GPU blit await the
dependencies and add a moving fence to the object.

This also enables asynchronous VRAM management in that on eviction,
rather than waiting for the moving fence to expire before freeing VRAM,
it is freed immediately and the fence is stored with the VRAM manager and
handed out to newly allocated objects to await before clears and swapins,
or for kernel objects before setting up gpu vmas or mapping.

To collect dependencies before migrating, add a set of utilities that
coalesce these to a single dma_fence.

What is still missing for fully asynchronous operation is asynchronous vma
unbinding, which is still to be implemented.

This commit substantially reduces execution time in the gem_lmem_swapping
test.

v2:
- Make a couple of functions static.

Signed-off-by: Thomas Hellström 
---
  drivers/gpu/drm/i915/gem/i915_gem_ttm.c  |  10 +
  drivers/gpu/drm/i915/gem/i915_gem_ttm.h  |   2 +-
  drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c | 329 +--
  drivers/gpu/drm/i915/gem/i915_gem_wait.c |   4 +-
  4 files changed, 318 insertions(+), 27 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c 
b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
index a1df49378a0f..111a4282d779 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
@@ -326,6 +326,9 @@ static bool i915_ttm_eviction_valuable(struct 
ttm_buffer_object *bo,
  {
struct drm_i915_gem_object *obj = i915_ttm_to_gem(bo);
  
+	if (!obj)

+   return false;
+
/*
 * EXTERNAL objects should never be swapped out by TTM, instead we need
 * to handle that ourselves. TTM will already skip such objects for us,
@@ -448,6 +451,10 @@ static int i915_ttm_shrinker_release_pages(struct 
drm_i915_gem_object *obj,
if (bo->ttm->page_flags & TTM_TT_FLAG_SWAPPED)
return 0;
  
+	ret = ttm_bo_wait_ctx(bo, );

+   if (ret)
+   return ret;



Why do we need this? Also not needed for the above purge case?


+
bo->ttm->page_flags |= TTM_TT_FLAG_SWAPPED;
ret = ttm_bo_validate(bo, , );
if (ret) {
@@ -549,6 +556,9 @@ static void i915_ttm_swap_notify(struct ttm_buffer_object 
*bo)
struct drm_i915_gem_object *obj = i915_ttm_to_gem(bo);
int ret = i915_ttm_move_notify(bo);
  
+	if (!obj)

+   return;


It looks like the i915_ttm_move_notify(bo) already dereferenced the GEM 
bo. Or did something in there maybe nuke it?



+
GEM_WARN_ON(ret);
GEM_WARN_ON(obj->ttm.cached_io_rsgt);
if (!ret && obj->mm.madv != I915_MADV_WILLNEED)
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.h 
b/drivers/gpu/drm/i915/gem/i915_gem_ttm.h
index 82cdabb542be..9d698ad00853 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.h
@@ -37,7 +37,7 @@ void i915_ttm_bo_destroy(struct ttm_buffer_object *bo);
  static inline struct drm_i915_gem_object *
  i915_ttm_to_gem(struct ttm_buffer_object *bo)
  {
-   if (GEM_WARN_ON(bo->destroy != i915_ttm_bo_destroy))
+   if (bo->destroy != i915_ttm_bo_destroy)
return NULL;


So this would indicate a "ghost" object, or is this something else? How 
scared should we be with this, like with the above checking for NULL GEM 
object state? In general do you know where we need the above checking?


  
  	return container_of(bo, struct drm_i915_gem_object, __do_not_access);

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
index f35b386c56ca..ae2c49fc3500 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
@@ -3,6 +3,8 @@
   * Copyright © 2021 Intel Corporation
   */
  
+#include 

+
  #include 
  
  #include "i915_drv.h"

@@ -41,6 +43,228 @@ void i915_ttm_migrate_set_failure_modes(bool gpu_migration,
  }
  #endif
  
+/**

+ * DOC: Set of utilities to dynamically collect dependencies and
+ * eventually coalesce them into a single fence which is fed into
+ * the migration code. That single fence is, in the case of dependencies
+ * from multiple contexts, a struct dma_fence_array, since the
+ * i915 request code can break that up and await the individual
+ * fences.


Would it make sense to add few more more details here for why we need 
this? IIUC it looks like TTM expects single context/timeline for 
pipelined move, like with that dma_fence_is_later() check in 
pipeline_evict? Maybe there is something already documented in TTM we 
can link to here?



+ *
+ * While collecting the individual dependencies, we store the refcounted
+ * struct dma_fence pointers in a realloc-type-managed pointer array, since
+ * that can be easily fed into a dma_fence_array. Other options are
+ * available, like for example an xarray for similarity with drm/sched.
+ * Can be changed