Re: [PATCH 01/13] drm: execution context for GEM buffers v4

2023-06-19 Thread Intel



On 6/19/23 11:48, Christian König wrote:

Hi,

Am 19.06.23 um 11:33 schrieb Thomas Hellström (Intel):

[SNIP]
Sometimes you want to just drop the contended lock after the above 
relaxation. (Eviction would be one example), and not add as 
prelocked, if the contended object goes out of scope. Eviction 
would in some situations be one such example, -EDEADLOCK leading to 
an error path where the object should otherwise be freed is 
another. Perhaps we could add an argument to prepare_obj() as to 
whether the object should be immediately put after relaxation.


I was considering a try_prepare version as well, that should cover 
this use case.


That sounds a bit different from this use-case. The use-case above 
would, on -EDEADLOCK actually unlock everything, then lock-slow the 
contending lock and then immediately unlock it and drop.


Hui? What would that be good for?


It's for the case where you have nested locking, the contended lock has 
gone out-of-scope and you're probably not going to need it on the next 
attempt. I think the refcounted "prelocked" object that is lacking in 
the i915 variant will resolve all correctness / uaf issues, but still 
the object might be needlessly carried around for yet another locking round.





It sounds like try_prepare would just skip locking and continue with 
everything locked so far still locked?


Correct.








+    ret = drm_exec_obj_locked(exec, obj);
+    if (unlikely(ret)) {
+    dma_resv_unlock(obj->resv);
+    goto error_dropref;
+    }
+
+    swap(exec->prelocked, obj);
+
+error_dropref:
+    /* Always cleanup the contention so that error handling can 
kick in */

+    drm_gem_object_put(obj);
+    exec->contended = NULL;
+    return ret;
+}
+
+/**
+ * drm_exec_prepare_obj - prepare a GEM object for use
+ * @exec: the drm_exec object with the state
+ * @obj: the GEM object to prepare
+ * @num_fences: how many fences to reserve
+ *
+ * Prepare a GEM object for use by locking it and reserving fence 
slots. All

+ * successfully locked objects are put into the locked container.
+ *
+ * Returns: -EDEADLK if a contention is detected, -EALREADY when 
object is
+ * already locked, -ENOMEM when memory allocation failed and zero 
for success.

+ */
+int drm_exec_prepare_obj(struct drm_exec *exec, struct 
drm_gem_object *obj,

+ unsigned int num_fences)
+{
+    int ret;
+
+    ret = drm_exec_lock_contended(exec);
+    if (unlikely(ret))
+    return ret;
+
+    if (exec->prelocked == obj) {
+    drm_gem_object_put(exec->prelocked);
+    exec->prelocked = NULL;
+
+    return dma_resv_reserve_fences(obj->resv, num_fences);
+    }
+
+    if (exec->flags & DRM_EXEC_FLAG_INTERRUPTIBLE)
+    ret = dma_resv_lock_interruptible(obj->resv, >ticket);
+    else
+    ret = dma_resv_lock(obj->resv, >ticket);
+
+    if (unlikely(ret == -EDEADLK)) {
+    drm_gem_object_get(obj);
+    exec->contended = obj;
+    return -EDEADLK;
+    }
+
+    if (unlikely(ret == -EALREADY &&
+    (exec->flags & DRM_EXEC_FLAG_ALLOW_DUPLICATES)))
+    goto reserve_fences;
+
+    if (unlikely(ret))
+    return ret;
+
+    ret = drm_exec_obj_locked(exec, obj);
+    if (ret)
+    goto error_unlock;
+
+reserve_fences:
+    /* Keep locked when reserving fences fails */
+    return dma_resv_reserve_fences(obj->resv, num_fences);


Ugh, what is the use-case for keeping things locked here? How would 
a caller tell the difference between an error where everything is 
locked and nothing is locked? IMO, we should unlock on error here. 
If there indeed is a use-case we should add a separate function for 
reserving fences for all locked objects, rather than returning 
sometimes locked on error sometime not.


We return the object locked here because it was to much churn to 
remove it again from the array and we are getting fully cleaned up 
at the end anyway.


OK, so if we add an unlock functionality, we could just have a 
consistent locking state on error return?


Yeah, that should work. Going to work on this.


Great.

Thanks,

Thomas




Regards,
Christian.



Thanks,
Thomas



Regards,
Christian.



Thanks,

Thomas



+
+error_unlock:
+    dma_resv_unlock(obj->resv);
+    return ret;
+}
+EXPORT_SYMBOL(drm_exec_prepare_obj);
+
+/**
+ * drm_exec_prepare_array - helper to prepare an array of objects
+ * @exec: the drm_exec object with the state
+ * @objects: array of GEM object to prepare
+ * @num_objects: number of GEM objects in the array
+ * @num_fences: number of fences to reserve on each GEM object
+ *
+ * Prepares all GEM objects in an array, handles contention but 
aports on first
+ * error otherwise. Reserves @num_fences on each GEM object after 
locking it.

+ *
+ * Returns: -EALREADY when object is already locked, -ENOMEM when 
memory

+ * allocation failed and zero for success.
+ */
+int drm_exec_prepare_array(struct drm_exec *exec,
+  

Re: [PATCH 01/13] drm: execution context for GEM buffers v4

2023-06-19 Thread Intel

Hi!

On 6/19/23 11:20, Christian König wrote:

Hi guys,

Am 19.06.23 um 10:59 schrieb Thomas Hellström (Intel):

[SNIP]


I really need to find some time to work on that anyway.

I've been playing with drm_exec for a couple weeks now, and I wanted
to share something I hacked to try and make the API simpler and
more robust against misuse (see the below diff, which is a slightly
adjusted version of your work).


It would be good if we could have someone taking charge of this 
series and address all review comments, I see some of my comments 
getting lost, we have multiple submitters and I can't find a 
dri-devel patchwork entry for this. Anyway some comments below.


I can try to find some time for the series this week (As long as 
nobody comes along and has any burning roof).






In this version, the user is no longer in control of the retry
loop. Instead, it provides an expression (a call to a
sub-function) to be re-evaluated each time a contention is
detected. IMHO, this makes the 'prepare-objs' functions easier to
apprehend, and avoids any mistake like calling
drm_exec_continue_on_contention() in an inner loop, or breaking
out of the drm_exec_while_all_locked() loop unintentionally.


In i915 we've had a very similar helper to this, and while I agree 
this newer version would probably help make code cleaner, but OTOH 
there also are some places where the short 
drm_exec_while_all_locked() -likeblock don't really motivate a 
separate function. Porting i915 to the current version will take some 
work, For  the xe driver both versions would work fine.


Yeah, this is actually what my first version of this looked like. But 
I abandoned that approach because we have a lot of cases were we just 
quickly want to lock a few GEM objects and don't want the extra 
overhead of putting all the state into some bag to forward it to a 
function.




Some additional review comments not related to the interface change 
below:




It also makes the internal management a bit simpler, since we
no longer call drm_exec_cleanup() on the first attempt, and can
thus get rid of the DRM_EXEC_DUMMY trick.

In the below diff, I also re-introduced native support for
duplicates as an opt-in, so we don't have to do things like:

ret = drm_exec_prepare_obj(exec, obj, num_fences);
if (ret == -EALREADY)
    ret = dma_resv_reserve_fences(obj->resv, num_fences);
if (ret)
    return ret;

and can just do:

ret = drm_exec_prepare_obj(exec, obj, num_fences);
if (ret)
    return;

Of course drivers can open-code a wrapper doing the same thing, but
given at least pvr and radeon need this, it'd be nice if the core
could support it natively.

That's mostly it. Just wanted to share what I had in case you're
interested. If not, that's fine too.

Regards,

Boris
---
  Documentation/gpu/drm-mm.rst |  12 ++
  drivers/gpu/drm/Kconfig  |   6 +
  drivers/gpu/drm/Makefile |   2 +
  drivers/gpu/drm/drm_exec.c   | 274 
+++

  include/drm/drm_exec.h   | 130 +
  5 files changed, 424 insertions(+)
  create mode 100644 drivers/gpu/drm/drm_exec.c
  create mode 100644 include/drm/drm_exec.h

diff --git a/Documentation/gpu/drm-mm.rst 
b/Documentation/gpu/drm-mm.rst

index fe40ee686f6e..c9f120cfe730 100644
--- a/Documentation/gpu/drm-mm.rst
+++ b/Documentation/gpu/drm-mm.rst
@@ -524,6 +524,18 @@ DRM Sync Objects
  .. kernel-doc:: drivers/gpu/drm/drm_syncobj.c
 :export:
  +DRM Execution context
+=
+
+.. kernel-doc:: drivers/gpu/drm/drm_exec.c
+   :doc: Overview
+
+.. kernel-doc:: include/drm/drm_exec.h
+   :internal:
+
+.. kernel-doc:: drivers/gpu/drm/drm_exec.c
+   :export:
+
  GPU Scheduler
  =
  diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
index 76991720637c..01a38fcdb1c4 100644
--- a/drivers/gpu/drm/Kconfig
+++ b/drivers/gpu/drm/Kconfig
@@ -194,6 +194,12 @@ config DRM_TTM
    GPU memory types. Will be enabled automatically if a device 
driver

    uses it.
  +config DRM_EXEC
+    tristate
+    depends on DRM
+    help
+  Execution context for command submissions
+
  config DRM_BUDDY
  tristate
  depends on DRM
diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
index 1873f64db171..18a02eaf2d49 100644
--- a/drivers/gpu/drm/Makefile
+++ b/drivers/gpu/drm/Makefile
@@ -79,6 +79,8 @@ obj-$(CONFIG_DRM_PANEL_ORIENTATION_QUIRKS) += 
drm_panel_orientation_quirks.o

  #
  # Memory-management helpers
  #
+#
+obj-$(CONFIG_DRM_EXEC) += drm_exec.o
    obj-$(CONFIG_DRM_BUDDY) += drm_buddy.o
  diff --git a/drivers/gpu/drm/drm_exec.c b/drivers/gpu/drm/drm_exec.c
new file mode 100644
index ..e0ad1a3e1610
--- /dev/null
+++ b/drivers/gpu/drm/drm_exec.c
@@ -0,0 +1,274 @@
+/* SPDX-License-Identifier: GPL-2.0 OR MIT */
+
+#include 
+#include 
+#include 
+
+/**
+ * DOC: Overview
+ *
+ * This component mainly abstracts the retry loop necessary for 
locking
+ * multiple GEM objects while prepar

Re: [PATCH 01/13] drm: execution context for GEM buffers v4

2023-06-19 Thread Intel



On 6/17/23 13:54, Boris Brezillon wrote:

+Matthew who's been using drm_exec in Xe if I'm correct.

Hello Christian,

On Wed, 14 Jun 2023 15:02:52 +0200
Boris Brezillon  wrote:


On Wed, 14 Jun 2023 14:30:53 +0200
Christian König  wrote:


Am 14.06.23 um 14:23 schrieb Boris Brezillon:

Hi Christian,

On Thu,  4 May 2023 13:51:47 +0200
"Christian König"  wrote:


This adds the infrastructure for an execution context for GEM buffers
which is similar to the existing TTMs execbuf util and intended to replace
it in the long term.

The basic functionality is that we abstracts the necessary loop to lock
many different GEM buffers with automated deadlock and duplicate handling.

As many other drivers do already, we are considering using drm_exec()
for our resv locking in the PowerVR driver, so we might have more
questions/comments in the coming days/weeks, but I already have a
couple right now (see below).


v3: drop duplicate tracking, radeon is really the only one needing that

I think we'd actually be interested in duplicate tracking. Is there any
way we can make it an optional feature through some extra helpers/flags?
Doesn't have to be done in this patch series, I'm just wondering if this
is something we can share as well.

You can still capture the -EALREADY error and act appropriately in your
driver.

For radeon it just means ignoring the error code and going ahead, but
that behavior doesn't seem to be desired in most cases.

Initially I though we need to separately track how many and how often
BOs are duplicated, but there is simply no use for this.
   

[...]


+/**
+ * DOC: Overview
+ *
+ * This component mainly abstracts the retry loop necessary for locking
+ * multiple GEM objects while preparing hardware operations (e.g. command
+ * submissions, page table updates etc..).
+ *
+ * If a contention is detected while locking a GEM object the cleanup procedure
+ * unlocks all previously locked GEM objects and locks the contended one first
+ * before locking any further objects.
+ *
+ * After an object is locked fences slots can optionally be reserved on the
+ * dma_resv object inside the GEM object.
+ *
+ * A typical usage pattern should look like this::
+ *
+ * struct drm_gem_object *obj;
+ * struct drm_exec exec;
+ * unsigned long index;
+ * int ret;
+ *
+ * drm_exec_init(, true);
+ * drm_exec_while_not_all_locked() {
+ * ret = drm_exec_prepare_obj(, boA, 1);
+ * drm_exec_continue_on_contention();
+ * if (ret)
+ * goto error;
+ *

Have you considered defining a drm_exec_try_prepare_obj_or_retry()
combining drm_exec_prepare_obj() and drm_exec_continue_on_contention()?

#define drm_exec_try_prepare_obj_or_retry(exec, obj, num_fences) \
  ({ \
  int __ret = drm_exec_prepare_obj(exec, bo, num_fences); \
  if (unlikely(drm_exec_is_contended(exec))) \
  continue; \
  __ret; \
  })

This way the following pattern

ret = drm_exec_prepare_obj(, boA, 1);
drm_exec_continue_on_contention();
if (ret)
goto error;

can be turned into something more conventional:

ret = drm_exec_try_prepare_obj_or_retry(, boA, 1);
if (ret)
goto error;

Yeah, I was considering that as well. But then abandoned it as to
complicated.

I really need to find some time to work on that anyway.

I've been playing with drm_exec for a couple weeks now, and I wanted
to share something I hacked to try and make the API simpler and
more robust against misuse (see the below diff, which is a slightly
adjusted version of your work).


It would be good if we could have someone taking charge of this series 
and address all review comments, I see some of my comments getting lost, 
we have multiple submitters and I can't find a dri-devel patchwork entry 
for this. Anyway some comments below.




In this version, the user is no longer in control of the retry
loop. Instead, it provides an expression (a call to a
sub-function) to be re-evaluated each time a contention is
detected. IMHO, this makes the 'prepare-objs' functions easier to
apprehend, and avoids any mistake like calling
drm_exec_continue_on_contention() in an inner loop, or breaking
out of the drm_exec_while_all_locked() loop unintentionally.


In i915 we've had a very similar helper to this, and while I agree this 
newer version would probably help make code cleaner, but OTOH there also 
are some places where the short drm_exec_while_all_locked() -likeblock 
don't really motivate a separate function. Porting i915 to the current 
version will take some work, For  the xe driver both versions would work 
fine.


Some additional review comments not related to the interface change below:



It also makes the internal management a bit simpler, since we
no longer call drm_exec_cleanup() on the first 

Re: [PATCH 01/13] drm: execution context for GEM buffers v4

2023-05-04 Thread Intel



On 5/4/23 13:51, Christian König wrote:

This adds the infrastructure for an execution context for GEM buffers
which is similar to the existing TTMs execbuf util and intended to replace
it in the long term.

The basic functionality is that we abstracts the necessary loop to lock
many different GEM buffers with automated deadlock and duplicate handling.

v2: drop xarray and use dynamic resized array instead, the locking
 overhead is unecessary and measurable.
v3: drop duplicate tracking, radeon is really the only one needing that.
v4: fixes issues pointed out by Danilo, some typos in comments and a
 helper for lock arrays of GEM objects.

Signed-off-by: Christian König 

...

+/**
+ * struct drm_exec - Execution context
+ */
+struct drm_exec {
+   /**
+* @interruptible: If locks should be taken interruptible
+*/
+   boolinterruptible;
+
+   /**
+* @ticket: WW ticket used for acquiring locks
+*/
+   struct ww_acquire_ctx   ticket;
+
+   /**
+* @num_objects: number of objects locked
+*/
+   unsigned intnum_objects;
+
+   /**
+* @max_objects: maximum objects in array
+*/
+   unsigned intmax_objects;
+
+   /**
+* @objects: array of the locked objects
+*/
+   struct drm_gem_object   **objects;


Hi, Christian. Did you consider using a list here with links embedded in 
gem objects, now that only locked objects are to be put on the list / array.


That should work as only the process owning the lock may access the list 
link. Apart from getting rid of reallocing this is beneficial for the 
more general types of ww transactions that are used by i915 (and to a 
minor extent xe as well, I think).


In those cases we would want to unlock a temporary held object within 
the while_not_all_locked() loop and would then have to search the entire 
array for the correct pointer. Instead one could just remove it from the 
list of locked objects.


Thanks,

Thomas



Re: [PATCH 1/9] drm: execution context for GEM buffers v3

2023-03-10 Thread Intel

Hi Christian

On 2/28/23 09:33, Christian König wrote:

This adds the infrastructure for an execution context for GEM buffers
which is similar to the existinc TTMs execbuf util and intended to replace
it in the long term.

The basic functionality is that we abstracts the necessary loop to lock
many different GEM buffers with automated deadlock and duplicate handling.

v2: drop xarray and use dynamic resized array instead, the locking
 overhead is unecessary and measureable.
v3: drop duplicate tracking, radeon is really the only one needing that.

Signed-off-by: Christian König 


Nice. This seems to have all we need for now for Xe as well, although 
not for i915 ATM.


I see future exension of this for things like sleeping lock evictions 
(where locks may go out-of-scope during the locking loop), including 
other resv containers, passing through dma-buf move_notify() etc, what 
are your thoughts around that? Extend this or build a different 
functionality and make this a specialization of that?


What about the drm_gem_lock_reservations() interface? While a bit less 
capable it's confusing having two apis doing essentially the same thing, 
could we deprecate that?


Finally a comment below:



---
  Documentation/gpu/drm-mm.rst |  12 ++
  drivers/gpu/drm/Kconfig  |   6 +
  drivers/gpu/drm/Makefile |   2 +
  drivers/gpu/drm/drm_exec.c   | 249 +++
  include/drm/drm_exec.h   | 115 
  5 files changed, 384 insertions(+)
  create mode 100644 drivers/gpu/drm/drm_exec.c
  create mode 100644 include/drm/drm_exec.h

diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
index a79fd3549ff8..a52e6f4117d6 100644
--- a/Documentation/gpu/drm-mm.rst
+++ b/Documentation/gpu/drm-mm.rst
@@ -493,6 +493,18 @@ DRM Sync Objects
  .. kernel-doc:: drivers/gpu/drm/drm_syncobj.c
 :export:
  
+DRM Execution context

+=
+
+.. kernel-doc:: drivers/gpu/drm/drm_exec.c
+   :doc: Overview
+
+.. kernel-doc:: include/drm/drm_exec.h
+   :internal:
+
+.. kernel-doc:: drivers/gpu/drm/drm_exec.c
+   :export:
+
  GPU Scheduler
  =
  
diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig

index 17d252dc25e2..84a5fc28c48d 100644
--- a/drivers/gpu/drm/Kconfig
+++ b/drivers/gpu/drm/Kconfig
@@ -200,6 +200,12 @@ config DRM_TTM
  GPU memory types. Will be enabled automatically if a device driver
  uses it.
  
+config DRM_EXEC

+   tristate
+   depends on DRM
+   help
+ Execution context for command submissions
+
  config DRM_BUDDY
tristate
depends on DRM
diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
index ab4460fcd63f..d40defbb0347 100644
--- a/drivers/gpu/drm/Makefile
+++ b/drivers/gpu/drm/Makefile
@@ -78,6 +78,8 @@ obj-$(CONFIG_DRM_PANEL_ORIENTATION_QUIRKS) += 
drm_panel_orientation_quirks.o
  #
  # Memory-management helpers
  #
+#
+obj-$(CONFIG_DRM_EXEC) += drm_exec.o
  
  obj-$(CONFIG_DRM_BUDDY) += drm_buddy.o
  
diff --git a/drivers/gpu/drm/drm_exec.c b/drivers/gpu/drm/drm_exec.c

new file mode 100644
index ..df546cc5a227
--- /dev/null
+++ b/drivers/gpu/drm/drm_exec.c
@@ -0,0 +1,249 @@
+/* SPDX-License-Identifier: GPL-2.0 OR MIT */
+
+#include 
+#include 
+#include 
+
+/**
+ * DOC: Overview
+ *
+ * This component mainly abstracts the retry loop necessary for locking
+ * multiple GEM objects while preparing hardware operations (e.g. command
+ * submissions, page table updates etc..).
+ *
+ * If a contention is detected while locking a GEM object the cleanup procedure
+ * unlocks all previously locked GEM objects and locks the contended one first
+ * before locking any further objects.
+ *
+ * After an object is locked fences slots can optionally be reserved on the
+ * dma_resv object inside the GEM object.
+ *
+ * A typical usage pattern should look like this::
+ *
+ * struct drm_gem_object *obj;
+ * struct drm_exec exec;
+ * unsigned long index;
+ * int ret;
+ *
+ * drm_exec_init(, true);
+ * drm_exec_while_not_all_locked() {
+ * ret = drm_exec_prepare_obj(, boA, 1);
+ * drm_exec_continue_on_contention();
+ * if (ret)
+ * goto error;
+ *
+ * ret = drm_exec_lock(, boB, 1);


Should be drm_exec_prepare_obj()?


+ * drm_exec_continue_on_contention();
+ * if (ret)
+ * goto error;
+ * }
+ *
+ * drm_exec_for_each_locked_object(, index, obj) {
+ * dma_resv_add_fence(obj->resv, fence, DMA_RESV_USAGE_READ);
+ * ...
+ * }
+ * drm_exec_fini();
+ *
+ * See struct dma_exec for more details.
+ */
+
+/* Dummy value used to initially enter the retry loop */
+#define DRM_EXEC_DUMMY (void*)~0
+
+/* Unlock all objects and drop references */
+static void drm_exec_unlock_all(struct drm_exec *exec)
+{
+   struct drm_gem_object *obj;
+   unsigned long index;
+
+   

Re: [PATCH 9/9] drm: move ttm_execbuf_util into vmwgfx

2023-03-09 Thread Intel

Hi,

On 3/9/23 06:14, Zack Rusin wrote:

On Wed, 2023-03-08 at 10:10 +0100, Christian König wrote:

Am 08.03.23 um 06:14 schrieb Zack Rusin:

On Tue, 2023-02-28 at 09:34 +0100, Christian König wrote:

VMWGFX is the only remaining user of this and should probably moved over
to drm_exec when it starts using GEM as well.

Is this because vmwgfx piggybacks buffer-id relocations on top of ttm
validations or
did you just find it too hard to port it over? I'd prefer to avoid ttm moves to
vmwgfx and at least have a clear idea of what we need to do to port.

I've just found it to hard to port it over because vmwgfx does some
strange things with the validation code here.

If you want we can take a deeper look at this together, but I need to
find some time.

Alternatively just tell me how to do it and I will add that to the patch
set :)

I don't want to hold up the set (it looks good btw), because I had to look at
something else today and tomorrow.

We overload the validation lists to do quite a bit more than just reservations
though.

There are, I think, four separate things that need to be refactored there
(Christian, feel free to skip this section, this is mainly for VMware folks on 
the
team):
1) Relocations - userspace uses the id's of the bo's in the command stream, but 
on
the kernel side those id's are different (or in vmwgfx terminology gem id != mob
id), so the buffer id's in the command stream need to be replaced,
2) Resource validation. vmwgfx splits the userspace objects into buffers and
resources (shaders, surfaces, contexts). The resources are not buffers but are
backed by them. A single buffer can back multiple different resources and 
sometimes
the kernel has to actually allocate a buffer to back a resource and attach it 
to it
(i.e. in common terminology buffer is the memory and resources are placed in 
it) .
Now this shouldn't be in the kernel at all, the resources shouldn't have been 
kernel
objects and instead we should have left them completely to userspace.


The reason behind this is partly historical, since initially the 
resources (IIRC surfaces and shaders at that time),

were per-device and not per context and not backed by buffer objects.

The main reason for not doing the transition to user-space for resources 
was the SVGA device "between-draw-call-validation". If you for example 
unbound a mob that was backing a resource that was bound to a context, 
you'd need to start a whole chain of resource invalidations to avoid the 
device complaining about invalid setups at awkward moments, for example 
when it felt like suspending.


Not sure whether that is gone now though, or whether there are better 
ways to handle that.


/Thomas



3) Coherency tracking. We use validation lists as a central place for tracking 
which
bo's/resources are used in a command buffer and we use it to keep track of which
buffers/resources will endup dirty to implement coherency.
4) Central place to allocate memory for relocation/validation nodes.

Where we want to endup is with 2 completely gone from the kernel side and 1, 3 
and 4
refactored and cleaned up. I think there's at least 4 separate patches to this 
port,
so it's not a trivial thing. We will take a look at this on Friday in more 
detail to
see what we can do.

z


Re: [PATCH 9/9] drm: move ttm_execbuf_util into vmwgfx

2023-03-09 Thread Intel



On 3/8/23 10:10, Christian König wrote:

Am 08.03.23 um 06:14 schrieb Zack Rusin:

On Tue, 2023-02-28 at 09:34 +0100, Christian König wrote:
VMWGFX is the only remaining user of this and should probably moved 
over

to drm_exec when it starts using GEM as well.
Is this because vmwgfx piggybacks buffer-id relocations on top of ttm 
validations or
did you just find it too hard to port it over? I'd prefer to avoid 
ttm moves to

vmwgfx and at least have a clear idea of what we need to do to port.


I've just found it to hard to port it over because vmwgfx does some 
strange things with the validation code here.


If you want we can take a deeper look at this together, but I need to 
find some time.


Alternatively just tell me how to do it and I will add that to the 
patch set :)


Regards,
Christian.


Hmm.

It turns out Xe was using these from the very start as well. But I will 
take a look at what it take to deprecate that usage, so don't let that 
stop this removal. We need a more flexible WW transaction handling anyway.


/Thomas





z


Re: [PATCH v6 02/22] drm/gem: Move mapping of imported dma-bufs to drm_gem_mmap_obj()

2022-06-29 Thread Intel



On 6/29/22 10:22, Dmitry Osipenko wrote:

On 6/29/22 09:40, Thomas Hellström (Intel) wrote:

On 5/27/22 01:50, Dmitry Osipenko wrote:

Drivers that use drm_gem_mmap() and drm_gem_mmap_obj() helpers don't
handle imported dma-bufs properly, which results in mapping of something
else than the imported dma-buf. For example, on NVIDIA Tegra we get a
hard
lockup when userspace writes to the memory mapping of a dma-buf that was
imported into Tegra's DRM GEM.

To fix this bug, move mapping of imported dma-bufs to drm_gem_mmap_obj().
Now mmaping of imported dma-bufs works properly for all DRM drivers.

Same comment about Fixes: as in patch 1,

Cc: sta...@vger.kernel.org
Signed-off-by: Dmitry Osipenko 
---
   drivers/gpu/drm/drm_gem.c  | 3 +++
   drivers/gpu/drm/drm_gem_shmem_helper.c | 9 -
   drivers/gpu/drm/tegra/gem.c    | 4 
   3 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
index 86d670c71286..7c0b025508e4 100644
--- a/drivers/gpu/drm/drm_gem.c
+++ b/drivers/gpu/drm/drm_gem.c
@@ -1038,6 +1038,9 @@ int drm_gem_mmap_obj(struct drm_gem_object *obj,
unsigned long obj_size,
   if (obj_size < vma->vm_end - vma->vm_start)
   return -EINVAL;
   +    if (obj->import_attach)
+    return dma_buf_mmap(obj->dma_buf, vma, 0);

If we start enabling mmaping of imported dma-bufs on a majority of
drivers in this way, how do we ensure that user-space is not blindly
using the object mmap without calling the needed DMA_BUF_IOCTL_SYNC
which is needed before and after cpu access of mmap'ed dma-bufs?

I was under the impression (admittedly without looking) that the few
drivers that actually called into dma_buf_mmap() had some private
user-mode driver code in place that ensured this happened.

Since it's a userspace who does the mapping, then it should be a
responsibility of userspace to do all the necessary syncing.


Sure, but nothing prohibits user-space to ignore the syncing thinking 
"It works anyway", testing those drivers where the syncing is a NOP. And 
when a driver that finally needs syncing is tested it's too late to fix 
all broken user-space.



  I'm not
sure whether anyone in userspace really needs to map imported dma-bufs
in practice. Nevertheless, this use-case is broken and should be fixed
by either allowing to do the mapping or prohibiting it.

Then I'd vote for prohibiting it, at least for now. And for the future 
moving forward we could perhaps revisit the dma-buf need for syncing, 
requiring those drivers that actually need it to implement emulated 
coherent memory which can be done not too inefficiently (vmwgfx being 
one example).


/Thomas




Re: [PATCH v6 08/22] drm/virtio: Unlock reservations on dma_resv_reserve_fences() error

2022-06-29 Thread Intel



On 5/27/22 01:50, Dmitry Osipenko wrote:

Unlock reservations on dma_resv_reserve_fences() error to fix recursive
locking of the reservations when this error happens.

Fixes:

Cc: sta...@vger.kernel.org


With that fixed,

Reviewed-by: Thomas Hellström 



Signed-off-by: Dmitry Osipenko 
---
  drivers/gpu/drm/virtio/virtgpu_gem.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/virtio/virtgpu_gem.c 
b/drivers/gpu/drm/virtio/virtgpu_gem.c
index 580a78809836..7db48d17ee3a 100644
--- a/drivers/gpu/drm/virtio/virtgpu_gem.c
+++ b/drivers/gpu/drm/virtio/virtgpu_gem.c
@@ -228,8 +228,10 @@ int virtio_gpu_array_lock_resv(struct 
virtio_gpu_object_array *objs)
  
  	for (i = 0; i < objs->nents; ++i) {

ret = dma_resv_reserve_fences(objs->objs[i]->resv, 1);
-   if (ret)
+   if (ret) {
+   virtio_gpu_array_unlock_resv(objs);
return ret;
+   }
}
return ret;
  }


Re: [PATCH v6 02/22] drm/gem: Move mapping of imported dma-bufs to drm_gem_mmap_obj()

2022-06-29 Thread Intel



On 5/27/22 01:50, Dmitry Osipenko wrote:

Drivers that use drm_gem_mmap() and drm_gem_mmap_obj() helpers don't
handle imported dma-bufs properly, which results in mapping of something
else than the imported dma-buf. For example, on NVIDIA Tegra we get a hard
lockup when userspace writes to the memory mapping of a dma-buf that was
imported into Tegra's DRM GEM.

To fix this bug, move mapping of imported dma-bufs to drm_gem_mmap_obj().
Now mmaping of imported dma-bufs works properly for all DRM drivers.

Same comment about Fixes: as in patch 1,


Cc: sta...@vger.kernel.org
Signed-off-by: Dmitry Osipenko 
---
  drivers/gpu/drm/drm_gem.c  | 3 +++
  drivers/gpu/drm/drm_gem_shmem_helper.c | 9 -
  drivers/gpu/drm/tegra/gem.c| 4 
  3 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
index 86d670c71286..7c0b025508e4 100644
--- a/drivers/gpu/drm/drm_gem.c
+++ b/drivers/gpu/drm/drm_gem.c
@@ -1038,6 +1038,9 @@ int drm_gem_mmap_obj(struct drm_gem_object *obj, unsigned 
long obj_size,
if (obj_size < vma->vm_end - vma->vm_start)
return -EINVAL;
  
+	if (obj->import_attach)

+   return dma_buf_mmap(obj->dma_buf, vma, 0);


If we start enabling mmaping of imported dma-bufs on a majority of 
drivers in this way, how do we ensure that user-space is not blindly 
using the object mmap without calling the needed DMA_BUF_IOCTL_SYNC 
which is needed before and after cpu access of mmap'ed dma-bufs?


I was under the impression (admittedly without looking) that the few 
drivers that actually called into dma_buf_mmap() had some private 
user-mode driver code in place that ensured this happened.


/Thomas



+
/* Take a ref for this mapping of the object, so that the fault
 * handler can dereference the mmap offset's pointer to the object.
 * This reference is cleaned up by the corresponding vm_close
diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c 
b/drivers/gpu/drm/drm_gem_shmem_helper.c
index 8ad0e02991ca..6190f5018986 100644
--- a/drivers/gpu/drm/drm_gem_shmem_helper.c
+++ b/drivers/gpu/drm/drm_gem_shmem_helper.c
@@ -609,17 +609,8 @@ EXPORT_SYMBOL_GPL(drm_gem_shmem_vm_ops);
   */
  int drm_gem_shmem_mmap(struct drm_gem_shmem_object *shmem, struct 
vm_area_struct *vma)
  {
-   struct drm_gem_object *obj = >base;
int ret;
  
-	if (obj->import_attach) {

-   /* Drop the reference drm_gem_mmap_obj() acquired.*/
-   drm_gem_object_put(obj);
-   vma->vm_private_data = NULL;
-
-   return dma_buf_mmap(obj->dma_buf, vma, 0);
-   }
-
ret = drm_gem_shmem_get_pages(shmem);
if (ret) {
drm_gem_vm_close(vma);
diff --git a/drivers/gpu/drm/tegra/gem.c b/drivers/gpu/drm/tegra/gem.c
index 7c7dd84e6db8..f92aa20d63bb 100644
--- a/drivers/gpu/drm/tegra/gem.c
+++ b/drivers/gpu/drm/tegra/gem.c
@@ -564,6 +564,10 @@ int __tegra_gem_mmap(struct drm_gem_object *gem, struct 
vm_area_struct *vma)
  {
struct tegra_bo *bo = to_tegra_bo(gem);
  
+	/* imported dmu-buf is mapped by drm_gem_mmap_obj()  */

+   if (gem->import_attach)
+   return 0;
+
if (!bo->pages) {
unsigned long vm_pgoff = vma->vm_pgoff;
int err;


Re: [PATCH v6 14/22] dma-buf: Introduce new locking convention

2022-06-28 Thread Intel



On 5/30/22 15:57, Dmitry Osipenko wrote:

On 5/30/22 16:41, Christian König wrote:

Hi Dmitry,

Am 30.05.22 um 15:26 schrieb Dmitry Osipenko:

Hello Christian,

On 5/30/22 09:50, Christian König wrote:

Hi Dmitry,

First of all please separate out this patch from the rest of the series,
since this is a complex separate structural change.

I assume all the patches will go via the DRM tree in the end since the
rest of the DRM patches in this series depend on this dma-buf change.
But I see that separation may ease reviewing of the dma-buf changes, so
let's try it.

That sounds like you are underestimating a bit how much trouble this
will be.


I have tried this before and failed because catching all the locks in
the right code paths are very tricky. So expect some fallout from this
and make sure the kernel test robot and CI systems are clean.

Sure, I'll fix up all the reported things in the next iteration.

BTW, have you ever posted yours version of the patch? Will be great if
we could compare the changed code paths.

No, I never even finished creating it after realizing how much work it
would be.


This patch introduces new locking convention for dma-buf users. From
now
on all dma-buf importers are responsible for holding dma-buf
reservation
lock around operations performed over dma-bufs.

This patch implements the new dma-buf locking convention by:

     1. Making dma-buf API functions to take the reservation lock.

     2. Adding new locked variants of the dma-buf API functions for
drivers
    that need to manage imported dma-bufs under the held lock.

Instead of adding new locked variants please mark all variants which
expect to be called without a lock with an _unlocked postfix.

This should make it easier to remove those in a follow up patch set and
then fully move the locking into the importer.

Do we really want to move all the locks to the importers? Seems the
majority of drivers should be happy with the dma-buf helpers handling
the locking for them.

Yes, I clearly think so.


     3. Converting all drivers to the new locking scheme.

I have strong doubts that you got all of them. At least radeon and
nouveau should grab the reservation lock in their ->attach callbacks
somehow.

Radeon and Nouveau use gem_prime_import_sg_table() and they take resv
lock already, seems they should be okay (?)

You are looking at the wrong side. You need to fix the export code path,
not the import ones.

See for example attach on radeon works like this
drm_gem_map_attach->drm_gem_pin->radeon_gem_prime_pin->radeon_bo_reserve->ttm_bo_reserve->dma_resv_lock.

Yeah, I was looking at the both sides, but missed this one.


Also i915 will run into trouble with attach. In particular since i915 
starts a full ww transaction in its attach callback to be able to lock 
other objects if migration is needed. I think i915 CI would catch this 
in a selftest.


Perhaps it's worthwile to take a step back and figure out, if the 
importer is required to lock, which callbacks might need a ww acquire 
context?


(And off-topic, Since we do a lot of fancy stuff under dma-resv locks 
including waiting for fences and other locks, IMO taking these locks 
uninterruptible should ring a warning bell)


/Thomas




Same for nouveau and probably a few other exporters as well. That will
certainly cause a deadlock if you don't fix it.

I strongly suggest to do this step by step, first attach/detach and then
the rest.

Thank you very much for the suggestions. I'll implement them in the next
version.



Re: [PATCH v6 01/22] drm/gem: Properly annotate WW context on drm_gem_lock_reservations() error

2022-06-28 Thread Intel

Hi,

On 5/27/22 01:50, Dmitry Osipenko wrote:

Use ww_acquire_fini() in the error code paths. Otherwise lockdep
thinks that lock is held when lock's memory is freed after the
drm_gem_lock_reservations() error. The WW needs to be annotated
as "freed"


s /WW/ww_acquire_context/ ?
s /"freed"/"released"/ ?



, which fixes the noisy "WARNING: held lock freed!" splat
of VirtIO-GPU driver with CONFIG_DEBUG_MUTEXES=y and enabled lockdep.

Cc: sta...@vger.kernel.org


Can you dig up the commit in error and add a Fixes: Tag?

Using that and "dim fixes" will also make the Cc: stable tag a bit more 
verbose.


With that fixed,

Reviewed-by: Thomas Hellström 



Signed-off-by: Dmitry Osipenko 
---
  drivers/gpu/drm/drm_gem.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
index eb0c2d041f13..86d670c71286 100644
--- a/drivers/gpu/drm/drm_gem.c
+++ b/drivers/gpu/drm/drm_gem.c
@@ -1226,7 +1226,7 @@ drm_gem_lock_reservations(struct drm_gem_object **objs, 
int count,
ret = dma_resv_lock_slow_interruptible(obj->resv,
 acquire_ctx);
if (ret) {
-   ww_acquire_done(acquire_ctx);
+   ww_acquire_fini(acquire_ctx);
return ret;
}
}
@@ -1251,7 +1251,7 @@ drm_gem_lock_reservations(struct drm_gem_object **objs, 
int count,
goto retry;
}
  
-			ww_acquire_done(acquire_ctx);

+   ww_acquire_fini(acquire_ctx);
return ret;
}
}


Re: [PATCH 1/2] drm/ttm: cleanup and add TTM_PL_FLAG_TEMPORARY

2021-05-31 Thread Intel


On 5/31/21 2:02 PM, Christian König wrote:

Am 31.05.21 um 13:19 schrieb Thomas Hellström (Intel):


On 5/31/21 12:56 PM, Christian König wrote:

Am 31.05.21 um 12:46 schrieb Thomas Hellström (Intel):


On 5/31/21 12:32 PM, Christian König wrote:

Am 31.05.21 um 11:52 schrieb Thomas Hellström (Intel):

Hi, Lang,

On 5/31/21 10:19 AM, Yu, Lang wrote:

[Public]


Hi,
On 5/27/21 3:30 AM, Lang Yu wrote:

Make TTM_PL_FLAG_* start from zero and add
TTM_PL_FLAG_TEMPORARY flag for temporary
GTT allocation use.

GTT is a driver private acronym, right? And it doesn't look like
TTM_PL_FLAG_TEMPORARY will be used in core TTM, so should we 
instead set

aside a mask in the PL flag for driver-private use?

Hi Thomas,

Thanks for your comments and advice, GTT means Graphics 
Translation Table here, seems
a general acronym. TTM_PL_FLAG_TEMPORARY may also be used by 
other drives.

I have made other patches for this. Please help review.

Regards,
Lang

My point was really that the flag naming and documentation should 
reflect what core ttm is doing with that flag. If there is no 
specific core TTM usage, IMO we should move it to a driver 
specific flag to avoid future confusion. In particular a writer 
of a generic TTM resource manager should be able to know without 
looking at an old commit message what the placement flag is 
intended for.


So here we add a flag where core TTM forces a bo move on validate 
and that's it. And that appears to be how it's used when an 
amdgpu bo is evicted to GTT, (btw should it be accounted in this 
situation?)


The other use is to force the amdgpu driver to temporarily accept 
it into GTT when there is a lack of space, and IMHO that's a 
driver specific use and we should allocate a mask for driver 
specific flags for that.


So shouldn't this be two flags, really?


Well one flag makes sense for the use case at hand that drivers 
want to signal to TTM that an allocation is only temporary and not 
considered valid.


That we then use this flag to implement temporary GTT allocations 
to avoid problems during eviction is just extending that use case.


OK, but it looked like there were actually two use-cases. One for 
possibly long-term VRAM evictions to GTT, the other for the 
temporary use-case where the hop resource allocations sometimes 
failed. Or did I misunderstand the code?


Ok sounds like we need more documentation here. That's really one 
use case.


Key point is we need temporary allocation during multihop which 
should be handled differently to normal allocations.


Yes, that part is clear from the patches. The part that I can't fit 
into that pattern is why the evict flags when evicting from visible 
VRAM to GTT or ordinary VRAM is marked with TTM_PL_FLAG_TEMPORARY. 
Wouldn't those remain evicted in that placement until re-validated to 
visible VRAM at an unknown future time?


Not necessarily.

The situation we ran into was the following:
1. OOM on VRAM, we try to evict.

2. GTT space is used up as well, ok evict directly to SYSTEM.

3. For VRAM->SYSTEM eviction we use a temporary bounce buffer.

4. Waiting for the bounce buffer to become idle is interrupted by a 
signal so BO is still backed by bounce buffer.


5. Next CS, BO is validated with VRAM|GTT. TTM sees that it is in GTT 
and doesn't move the buffer.


6. CS goes into nirvana because bounce buffers are not meant to be 
used for CS (we can ignore alignment and accounting for them).



Yes, makes sense to me.

/Thomas


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 1/2] drm/ttm: cleanup and add TTM_PL_FLAG_TEMPORARY

2021-05-31 Thread Intel


On 5/31/21 12:56 PM, Christian König wrote:

Am 31.05.21 um 12:46 schrieb Thomas Hellström (Intel):


On 5/31/21 12:32 PM, Christian König wrote:

Am 31.05.21 um 11:52 schrieb Thomas Hellström (Intel):

Hi, Lang,

On 5/31/21 10:19 AM, Yu, Lang wrote:

[Public]


Hi,
On 5/27/21 3:30 AM, Lang Yu wrote:

Make TTM_PL_FLAG_* start from zero and add
TTM_PL_FLAG_TEMPORARY flag for temporary
GTT allocation use.

GTT is a driver private acronym, right? And it doesn't look like
TTM_PL_FLAG_TEMPORARY will be used in core TTM, so should we 
instead set

aside a mask in the PL flag for driver-private use?

Hi Thomas,

Thanks for your comments and advice, GTT means Graphics 
Translation Table here, seems
a general acronym. TTM_PL_FLAG_TEMPORARY may also be used by other 
drives.

I have made other patches for this. Please help review.

Regards,
Lang

My point was really that the flag naming and documentation should 
reflect what core ttm is doing with that flag. If there is no 
specific core TTM usage, IMO we should move it to a driver specific 
flag to avoid future confusion. In particular a writer of a generic 
TTM resource manager should be able to know without looking at an 
old commit message what the placement flag is intended for.


So here we add a flag where core TTM forces a bo move on validate 
and that's it. And that appears to be how it's used when an amdgpu 
bo is evicted to GTT, (btw should it be accounted in this situation?)


The other use is to force the amdgpu driver to temporarily accept 
it into GTT when there is a lack of space, and IMHO that's a driver 
specific use and we should allocate a mask for driver specific 
flags for that.


So shouldn't this be two flags, really?


Well one flag makes sense for the use case at hand that drivers want 
to signal to TTM that an allocation is only temporary and not 
considered valid.


That we then use this flag to implement temporary GTT allocations to 
avoid problems during eviction is just extending that use case.


OK, but it looked like there were actually two use-cases. One for 
possibly long-term VRAM evictions to GTT, the other for the temporary 
use-case where the hop resource allocations sometimes failed. Or did 
I misunderstand the code?


Ok sounds like we need more documentation here. That's really one use 
case.


Key point is we need temporary allocation during multihop which should 
be handled differently to normal allocations.


Yes, that part is clear from the patches. The part that I can't fit into 
that pattern is why the evict flags when evicting from visible VRAM to 
GTT or ordinary VRAM is marked with TTM_PL_FLAG_TEMPORARY. Wouldn't 
those remain evicted in that placement until re-validated to visible 
VRAM at an unknown future time?


Patch 3/3:

amdgpu_bo_placement_from_domain(abo, 
AMDGPU_GEM_DOMAIN_GTT);
abo->placements[0].flags |= TTM_PL_FLAG_TEMPORARY;





Christian.



/Thomas



Christian.



TTM_PL_FLAG_FORCE_MOVE

and

AMDGPU_PL_FLAG_TEMPORARY

Thanks,

/Thomas


Thomas
-Original Message-
From: Yu, Lang 
Sent: Thursday, May 27, 2021 9:31 AM
To: amd-gfx@lists.freedesktop.org; dri-de...@lists.freedesktop.org
Cc: Koenig, Christian ; Huang, Ray
; Deucher, Alexander ;
Yu, Lang 
Subject: [PATCH 1/2] drm/ttm: cleanup and add TTM_PL_FLAG_TEMPORARY

Make TTM_PL_FLAG_* start from zero and add TTM_PL_FLAG_TEMPORARY 
flag

for temporary GTT allocation use.

Signed-off-by: Lang Yu 
---
include/drm/ttm/ttm_placement.h | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/drm/ttm/ttm_placement.h
b/include/drm/ttm/ttm_placement.h index 
aa6ba4d0cf78..9f5cfc7c2d5a 100644

--- a/include/drm/ttm/ttm_placement.h
+++ b/include/drm/ttm/ttm_placement.h
@@ -47,8 +47,9 @@
  * top of the memory area, instead of the bottom.
  */

-#define TTM_PL_FLAG_CONTIGUOUS  (1 << 19)
-#define TTM_PL_FLAG_TOPDOWN (1 << 22)
+#define TTM_PL_FLAG_CONTIGUOUS  (1 << 0)
+#define TTM_PL_FLAG_TOPDOWN (1 << 1)
+#define TTM_PL_FLAG_TEMPORARY   (1 << 2)

/**
  * struct ttm_place
--
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=04%7C01%7CChristian.Koenig%40amd.com%7C3868af2bd5d94aeda94b08d924215b3a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637580547980190391%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=C7b5wz8Kph5bM8fkFVyXKwSNkSj3lDaxGUnww4jY%2FeM%3Dreserved=0 


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 1/2] drm/ttm: cleanup and add TTM_PL_FLAG_TEMPORARY

2021-05-31 Thread Intel


On 5/31/21 12:32 PM, Christian König wrote:

Am 31.05.21 um 11:52 schrieb Thomas Hellström (Intel):

Hi, Lang,

On 5/31/21 10:19 AM, Yu, Lang wrote:

[Public]


Hi,
On 5/27/21 3:30 AM, Lang Yu wrote:

Make TTM_PL_FLAG_* start from zero and add
TTM_PL_FLAG_TEMPORARY flag for temporary
GTT allocation use.

GTT is a driver private acronym, right? And it doesn't look like
TTM_PL_FLAG_TEMPORARY will be used in core TTM, so should we 
instead set

aside a mask in the PL flag for driver-private use?

Hi Thomas,

Thanks for your comments and advice, GTT means Graphics Translation 
Table here, seems
a general acronym. TTM_PL_FLAG_TEMPORARY may also be used by other 
drives.

I have made other patches for this. Please help review.

Regards,
Lang

My point was really that the flag naming and documentation should 
reflect what core ttm is doing with that flag. If there is no 
specific core TTM usage, IMO we should move it to a driver specific 
flag to avoid future confusion. In particular a writer of a generic 
TTM resource manager should be able to know without looking at an old 
commit message what the placement flag is intended for.


So here we add a flag where core TTM forces a bo move on validate and 
that's it. And that appears to be how it's used when an amdgpu bo is 
evicted to GTT, (btw should it be accounted in this situation?)


The other use is to force the amdgpu driver to temporarily accept it 
into GTT when there is a lack of space, and IMHO that's a driver 
specific use and we should allocate a mask for driver specific flags 
for that.


So shouldn't this be two flags, really?


Well one flag makes sense for the use case at hand that drivers want 
to signal to TTM that an allocation is only temporary and not 
considered valid.


That we then use this flag to implement temporary GTT allocations to 
avoid problems during eviction is just extending that use case.


OK, but it looked like there were actually two use-cases. One for 
possibly long-term VRAM evictions to GTT, the other for the temporary 
use-case where the hop resource allocations sometimes failed. Or did I 
misunderstand the code?


/Thomas



Christian.



TTM_PL_FLAG_FORCE_MOVE

and

AMDGPU_PL_FLAG_TEMPORARY

Thanks,

/Thomas


Thomas
-Original Message-
From: Yu, Lang 
Sent: Thursday, May 27, 2021 9:31 AM
To: amd-gfx@lists.freedesktop.org; dri-de...@lists.freedesktop.org
Cc: Koenig, Christian ; Huang, Ray
; Deucher, Alexander ;
Yu, Lang 
Subject: [PATCH 1/2] drm/ttm: cleanup and add TTM_PL_FLAG_TEMPORARY

Make TTM_PL_FLAG_* start from zero and add TTM_PL_FLAG_TEMPORARY flag
for temporary GTT allocation use.

Signed-off-by: Lang Yu 
---
include/drm/ttm/ttm_placement.h | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/drm/ttm/ttm_placement.h
b/include/drm/ttm/ttm_placement.h index aa6ba4d0cf78..9f5cfc7c2d5a 
100644

--- a/include/drm/ttm/ttm_placement.h
+++ b/include/drm/ttm/ttm_placement.h
@@ -47,8 +47,9 @@
  * top of the memory area, instead of the bottom.
  */

-#define TTM_PL_FLAG_CONTIGUOUS  (1 << 19)
-#define TTM_PL_FLAG_TOPDOWN (1 << 22)
+#define TTM_PL_FLAG_CONTIGUOUS  (1 << 0)
+#define TTM_PL_FLAG_TOPDOWN (1 << 1)
+#define TTM_PL_FLAG_TEMPORARY   (1 << 2)

/**
  * struct ttm_place
--
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 1/2] drm/ttm: cleanup and add TTM_PL_FLAG_TEMPORARY

2021-05-31 Thread Intel

Hi, Lang,

On 5/31/21 10:19 AM, Yu, Lang wrote:

[Public]


Hi,
On 5/27/21 3:30 AM, Lang Yu wrote:

Make TTM_PL_FLAG_* start from zero and add
TTM_PL_FLAG_TEMPORARY flag for temporary
GTT allocation use.

GTT is a driver private acronym, right? And it doesn't look like
TTM_PL_FLAG_TEMPORARY will be used in core TTM, so should we instead set
aside a mask in the PL flag for driver-private use?

Hi Thomas,

Thanks for your comments and advice, GTT means Graphics Translation Table here, 
seems
a general acronym. TTM_PL_FLAG_TEMPORARY may also be used by other drives.
I have made other patches for this. Please help review.

Regards,
Lang

My point was really that the flag naming and documentation should 
reflect what core ttm is doing with that flag. If there is no specific 
core TTM usage, IMO we should move it to a driver specific flag to avoid 
future confusion. In particular a writer of a generic TTM resource 
manager should be able to know without looking at an old commit message 
what the placement flag is intended for.


So here we add a flag where core TTM forces a bo move on validate and 
that's it. And that appears to be how it's used when an amdgpu bo is 
evicted to GTT, (btw should it be accounted in this situation?)


The other use is to force the amdgpu driver to temporarily accept it 
into GTT when there is a lack of space, and IMHO that's a driver 
specific use and we should allocate a mask for driver specific flags for 
that.


So shouldn't this be two flags, really?

TTM_PL_FLAG_FORCE_MOVE

and

AMDGPU_PL_FLAG_TEMPORARY

Thanks,

/Thomas


Thomas
-Original Message-
From: Yu, Lang 
Sent: Thursday, May 27, 2021 9:31 AM
To: amd-gfx@lists.freedesktop.org; dri-de...@lists.freedesktop.org
Cc: Koenig, Christian ; Huang, Ray
; Deucher, Alexander ;
Yu, Lang 
Subject: [PATCH 1/2] drm/ttm: cleanup and add TTM_PL_FLAG_TEMPORARY

Make TTM_PL_FLAG_* start from zero and add TTM_PL_FLAG_TEMPORARY flag
for temporary GTT allocation use.

Signed-off-by: Lang Yu 
---
include/drm/ttm/ttm_placement.h | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/drm/ttm/ttm_placement.h
b/include/drm/ttm/ttm_placement.h index aa6ba4d0cf78..9f5cfc7c2d5a 100644
--- a/include/drm/ttm/ttm_placement.h
+++ b/include/drm/ttm/ttm_placement.h
@@ -47,8 +47,9 @@
  * top of the memory area, instead of the bottom.
  */

-#define TTM_PL_FLAG_CONTIGUOUS  (1 << 19)
-#define TTM_PL_FLAG_TOPDOWN (1 << 22)
+#define TTM_PL_FLAG_CONTIGUOUS  (1 << 0)
+#define TTM_PL_FLAG_TOPDOWN (1 << 1)
+#define TTM_PL_FLAG_TEMPORARY   (1 << 2)

/**
  * struct ttm_place
--
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/ttm: stop warning on TT shrinker failure

2021-03-24 Thread Intel


On 3/23/21 4:45 PM, Christian König wrote:

Am 23.03.21 um 16:13 schrieb Michal Hocko:

On Tue 23-03-21 14:56:54, Christian König wrote:

Am 23.03.21 um 14:41 schrieb Michal Hocko:

[...]
Anyway, I am wondering whether the overall approach is sound. Why 
don't
you simply use shmem as your backing storage from the beginning and 
pin

those pages if they are used by the device?
Yeah, that is exactly what the Intel guys are doing for their 
integrated

GPUs :)

Problem is for TTM I need to be able to handle dGPUs and those have all
kinds of funny allocation restrictions. In other words I need to 
guarantee
that the allocated memory is coherent accessible to the GPU without 
using

SWIOTLB.

The simple case is that the device can only do DMA32, but you also got
device which can only do 40bits or 48bits.

On top of that you also got AGP, CMA and stuff like CPU cache behavior
changes (write back vs. write through, vs. uncached).

OK, so the underlying problem seems to be that gfp mask (thus
mapping_gfp_mask) cannot really reflect your requirements, right?  Would
it help if shmem would allow to provide an allocation callback to
override alloc_page_vma which is used currently? I am pretty sure there
will be more to handle but going through shmem for the whole life time
is just so much easier to reason about than some tricks to abuse shmem
just for the swapout path.


Well it's a start, but the pages can have special CPU cache settings. 
So direct IO from/to them usually doesn't work as expected.


Additional to that for AGP and CMA I need to make sure that I give 
those pages back to the relevant subsystems instead of just dropping 
the page reference.


So I would need to block for the swapio to be completed.

Anyway I probably need to revert those patches for now since this 
isn't working as we hoped it would.


Thanks for the explanation how stuff works here.


Another alternative here that I've tried before without being successful 
would perhaps be to drop shmem completely and, if it's a normal page (no 
dma or funny caching attributes) just use add_to_swap_cache()? If it's 
something else, try alloc a page with relevant gfp attributes, copy and 
add_to_swap_cache()? Or perhaps that doesn't work well from a shrinker 
either?


/Thomas





Christian.
___
dri-devel mailing list
dri-de...@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 17/35] drm/amdkfd: register HMM device private zone

2021-03-11 Thread Intel


On 3/4/21 6:58 PM, Felix Kuehling wrote:

Am 2021-03-01 um 3:46 a.m. schrieb Thomas Hellström (Intel):

On 3/1/21 9:32 AM, Daniel Vetter wrote:

On Wed, Jan 06, 2021 at 10:01:09PM -0500, Felix Kuehling wrote:

From: Philip Yang 

Register vram memory as MEMORY_DEVICE_PRIVATE type resource, to
allocate vram backing pages for page migration.

Signed-off-by: Philip Yang 
Signed-off-by: Felix Kuehling 

So maybe I'm getting this all wrong, but I think that the current ttm
fault code relies on devmap pte entries (especially for hugepte entries)
to stop get_user_pages. But this only works if the pte happens to not
point at a range with devmap pages.

I don't think that's in TTM yet, but the proposed fix, yes (see email
I just sent in another thread),
but only for huge ptes.


This patch here changes that, and so probably breaks this devmap pte
hack
ttm is using?

If I'm not wrong here then I think we need to first fix up the ttm
code to
not use the devmap hack anymore, before a ttm based driver can
register a
dev_pagemap. Also adding Thomas since that just came up in another
discussion.

It doesn't break the ttm devmap hack per se, but it indeed allows gup
to the range registered, but here's where my lack of understanding why
we can't allow gup-ing TTM ptes if there indeed is a backing
struct-page? Because registering MEMORY_DEVICE_PRIVATE implies that,
right?

I wasn't aware that TTM used devmap at all. If it does, what type of
memory does it use?

MEMORY_DEVICE_PRIVATE is like swapped out memory. It cannot be mapped in
the CPU page table. GUP would cause a page fault to swap it back into
system memory. We are looking into use MEMORY_DEVICE_GENERIC for a
future coherent memory architecture, where device memory can be
coherently accessed by the CPU and GPU.

As I understand it, our DEVICE_PRIVATE registration is not tied to an
actual physical address. Thus your devmap registration and our devmap
registration could probably coexist without any conflict. You'll just
have the overhead of two sets of struct pages for the same memory.

Regards,
   Felix


Hi, Felix. TTM doesn't use devmap yet, but thinking of using it for 
faking pmd_special() which isn't available. That would mean pmd_devmap() 
+ no_registered_dev_pagemap meaning special in the sense documented by 
vm_normal_page(). The implication here would be that if you register 
memory like above, TTM would never be able to set up a huge page table 
entry to it. But it sounds like that's not an issue?


/Thomas




/Thomas


-Daniel


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 17/35] drm/amdkfd: register HMM device private zone

2021-03-01 Thread Intel


On 3/1/21 9:58 AM, Daniel Vetter wrote:

On Mon, Mar 01, 2021 at 09:46:44AM +0100, Thomas Hellström (Intel) wrote:

On 3/1/21 9:32 AM, Daniel Vetter wrote:

On Wed, Jan 06, 2021 at 10:01:09PM -0500, Felix Kuehling wrote:

From: Philip Yang 

Register vram memory as MEMORY_DEVICE_PRIVATE type resource, to
allocate vram backing pages for page migration.

Signed-off-by: Philip Yang 
Signed-off-by: Felix Kuehling 

So maybe I'm getting this all wrong, but I think that the current ttm
fault code relies on devmap pte entries (especially for hugepte entries)
to stop get_user_pages. But this only works if the pte happens to not
point at a range with devmap pages.

I don't think that's in TTM yet, but the proposed fix, yes (see email I just
sent in another thread),
but only for huge ptes.


This patch here changes that, and so probably breaks this devmap pte hack
ttm is using?

If I'm not wrong here then I think we need to first fix up the ttm code to
not use the devmap hack anymore, before a ttm based driver can register a
dev_pagemap. Also adding Thomas since that just came up in another
discussion.

It doesn't break the ttm devmap hack per se, but it indeed allows gup to the
range registered, but here's where my lack of understanding why we can't
allow gup-ing TTM ptes if there indeed is a backing struct-page? Because
registering MEMORY_DEVICE_PRIVATE implies that, right?

We need to keep supporting buffer based memory management for all the
non-compute users. Because those require end-of-batch dma_fence semantics,
which prevents us from using gpu page faults, which makes hmm not really
work.

And for buffer based memory manager we can't have gup pin random pages in
there, that's not really how it works. Worst case ttm just assumes it can
actually move buffers and reallocate them as it sees fit, and your gup
mapping (for direct i/o or whatever) now points at a page of a buffer that
you don't even own anymore. That's not good. Hence also all the
discussions about preventing gup for bo mappings in general.

Once we throw hmm into the mix we need to be really careful that the two
worlds don't collide. Pure hmm is fine, pure bo managed memory is fine,
mixing them is tricky.
-Daniel


Hmm, OK so then registering MEMORY_DEVICE_PRIVATE means we can't set 
pxx_devmap because that would allow gup, which, in turn, means no huge 
TTM ptes.


/Thomas

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 17/35] drm/amdkfd: register HMM device private zone

2021-03-01 Thread Intel



On 3/1/21 9:32 AM, Daniel Vetter wrote:

On Wed, Jan 06, 2021 at 10:01:09PM -0500, Felix Kuehling wrote:

From: Philip Yang 

Register vram memory as MEMORY_DEVICE_PRIVATE type resource, to
allocate vram backing pages for page migration.

Signed-off-by: Philip Yang 
Signed-off-by: Felix Kuehling 

So maybe I'm getting this all wrong, but I think that the current ttm
fault code relies on devmap pte entries (especially for hugepte entries)
to stop get_user_pages. But this only works if the pte happens to not
point at a range with devmap pages.


I don't think that's in TTM yet, but the proposed fix, yes (see email I 
just sent in another thread),

but only for huge ptes.



This patch here changes that, and so probably breaks this devmap pte hack
ttm is using?

If I'm not wrong here then I think we need to first fix up the ttm code to
not use the devmap hack anymore, before a ttm based driver can register a
dev_pagemap. Also adding Thomas since that just came up in another
discussion.


It doesn't break the ttm devmap hack per se, but it indeed allows gup to 
the range registered, but here's where my lack of understanding why we 
can't allow gup-ing TTM ptes if there indeed is a backing struct-page? 
Because registering MEMORY_DEVICE_PRIVATE implies that, right?


/Thomas


-Daniel



___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea

2020-07-22 Thread Intel


On 2020-07-22 16:23, Christian König wrote:

Am 22.07.20 um 16:07 schrieb Daniel Vetter:

On Wed, Jul 22, 2020 at 3:12 PM Thomas Hellström (Intel)
 wrote:

On 2020-07-22 14:41, Daniel Vetter wrote:

I'm pretty sure there's more bugs, I just haven't heard from them yet.
Also due to the opt-in nature of dma-fence we can limit the scope of
what we fix fairly naturally, just don't put them where no one cares
:-) Of course that also hides general locking issues in dma_fence
signalling code, but well *shrug*.

Hmm, yes. Another potential big problem would be drivers that want to
use gpu page faults in the dma-fence critical sections with the
batch-based programming model.

Yeah that's a massive can of worms. But luckily there's no such driver
merged in upstream, so hopefully we can think about all the
constraints and how to best annotate this before we land any
code and have big regrets.


Do you want a bad news? I once made a prototype for that when Vega10 
came out.


But we abandoned this approach for the the batch based approach 
because of the horrible performance.


In context of the previous discussion I'd consider the fact that it's 
not performant in the batch-based model good news :)


Thomas




KFD is going to see that, but this is only with user queues and no 
dma_fence involved whatsoever.


Christian.


-Daniel



--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea

2020-07-22 Thread Intel



On 2020-07-22 14:41, Daniel Vetter wrote:


Ah I think I misunderstood which options you want to compare here. I'm
not sure how much pain fixing up "dma-fence as memory fence" really
is. That's kinda why I want a lot more testing on my annotation
patches, to figure that out. Not much feedback aside from amdgpu and
intel, and those two drivers pretty much need to sort out their memory
fence issues anyway (because of userptr and stuff like that).

The only other issues outside of these two drivers I'm aware of:
- various scheduler drivers doing allocations in the drm/scheduler
critical section. Since all arm-soc drivers have a mildly shoddy
memory model of "we just pin everything" they don't really have to
deal with this. So we might just declare arm as a platform broken and
not taint the dma-fence critical sections with fs_reclaim. Otoh we
need to fix this for drm/scheduler anyway, I think best option would
be to have a mempool for hw fences in the scheduler itself, and at
that point fixing the other drivers shouldn't be too onerous.

- vmwgfx doing a dma_resv in the atomic commit tail. Entirely
orthogonal to the entire memory fence discussion.


With vmwgfx there is another issue that is hit when the gpu signals an 
error. At that point the batch might be restarted with a new meta 
command buffer that needs to be allocated out of a dma pool. in the 
fence critical section. That's probably a bit nasty to fix, but not 
impossible.




I'm pretty sure there's more bugs, I just haven't heard from them yet.
Also due to the opt-in nature of dma-fence we can limit the scope of
what we fix fairly naturally, just don't put them where no one cares
:-) Of course that also hides general locking issues in dma_fence
signalling code, but well *shrug*.
Hmm, yes. Another potential big problem would be drivers that want to 
use gpu page faults in the dma-fence critical sections with the 
batch-based programming model.


/Thomas


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea

2020-07-22 Thread Intel


On 2020-07-22 13:39, Daniel Vetter wrote:

On Wed, Jul 22, 2020 at 12:31 PM Thomas Hellström (Intel)
 wrote:


On 2020-07-22 11:45, Daniel Vetter wrote:

On Wed, Jul 22, 2020 at 10:05 AM Thomas Hellström (Intel)
 wrote:

On 2020-07-22 09:11, Daniel Vetter wrote:

On Wed, Jul 22, 2020 at 8:45 AM Thomas Hellström (Intel)
 wrote:

On 2020-07-22 00:45, Dave Airlie wrote:

On Tue, 21 Jul 2020 at 18:47, Thomas Hellström (Intel)
 wrote:

On 7/21/20 9:45 AM, Christian König wrote:

Am 21.07.20 um 09:41 schrieb Daniel Vetter:

On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
wrote:

Hi,

On 7/9/20 2:33 PM, Daniel Vetter wrote:

Comes up every few years, gets somewhat tedious to discuss, let's
write this down once and for all.

What I'm not sure about is whether the text should be more explicit in
flat out mandating the amdkfd eviction fences for long running compute
workloads or workloads where userspace fencing is allowed.

Although (in my humble opinion) it might be possible to completely
untangle
kernel-introduced fences for resource management and dma-fences used
for
completion- and dependency tracking and lift a lot of restrictions
for the
dma-fences, including prohibiting infinite ones, I think this makes
sense
describing the current state.

Yeah I think a future patch needs to type up how we want to make that
happen (for some cross driver consistency) and what needs to be
considered. Some of the necessary parts are already there (with like the
preemption fences amdkfd has as an example), but I think some clear docs
on what's required from both hw, drivers and userspace would be really
good.

I'm currently writing that up, but probably still need a few days for
this.

Great! I put down some (very) initial thoughts a couple of weeks ago
building on eviction fences for various hardware complexity levels here:

https://gitlab.freedesktop.org/thomash/docs/-/blob/master/Untangling%20dma-fence%20and%20memory%20allocation.odt

We are seeing HW that has recoverable GPU page faults but only for
compute tasks, and scheduler without semaphores hw for graphics.

So a single driver may have to expose both models to userspace and
also introduces the problem of how to interoperate between the two
models on one card.

Dave.

Hmm, yes to begin with it's important to note that this is not a
replacement for new programming models or APIs, This is something that
takes place internally in drivers to mitigate many of the restrictions
that are currently imposed on dma-fence and documented in this and
previous series. It's basically the driver-private narrow completions
Jason suggested in the lockdep patches discussions implemented the same
way as eviction-fences.

The memory fence API would be local to helpers and middle-layers like
TTM, and the corresponding drivers.  The only cross-driver-like
visibility would be that the dma-buf move_notify() callback would not be
allowed to wait on dma-fences or something that depends on a dma-fence.

Because we can't preempt (on some engines at least) we already have
the requirement that cross driver buffer management can get stuck on a
dma-fence. Not even taking into account the horrors we do with
userptr, which are cross driver no matter what. Limiting move_notify
to memory fences only doesn't work, since the pte clearing might need
to wait for a dma_fence first. Hence this becomes a full end-of-batch
fence, not just a limited kernel-internal memory fence.

For non-preemptible hardware the memory fence typically *is* the
end-of-batch fence. (Unless, as documented, there is a scheduler
consuming sync-file dependencies in which case the memory fence wait
needs to be able to break out of that). The key thing is not that we can
break out of execution, but that we can break out of dependencies, since
when we're executing all dependecies (modulo semaphores) are already
fulfilled. That's what's eliminating the deadlocks.


That's kinda why I think only reasonable option is to toss in the
towel and declare dma-fence to be the memory fence (and suck up all
the consequences of that decision as uapi, which is kinda where we
are), and construct something new free-wheeling for userspace
fencing. But only for engines that allow enough preempt/gpu page
faulting to make that possible. Free wheeling userspace fences/gpu
semaphores or whatever you want to call them (on windows I think it's
monitored fence) only work if you can preempt to decouple the memory
fences from your gpu command execution.

There's the in-between step of just decoupling the batchbuffer
submission prep for hw without any preempt (but a scheduler), but that
seems kinda pointless. Modern execbuf should be O(1) fastpath, with
all the allocation/mapping work pulled out ahead. vk exposes that
model directly to clients, GL drivers could use it internally too, so
I see zero value in spending lots of time engineering very tricky
kernel code just for old userspace. Much more reasonable to do that in
userspace, where we have

Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea

2020-07-22 Thread Intel


On 2020-07-22 11:45, Daniel Vetter wrote:

On Wed, Jul 22, 2020 at 10:05 AM Thomas Hellström (Intel)
 wrote:


On 2020-07-22 09:11, Daniel Vetter wrote:

On Wed, Jul 22, 2020 at 8:45 AM Thomas Hellström (Intel)
 wrote:

On 2020-07-22 00:45, Dave Airlie wrote:

On Tue, 21 Jul 2020 at 18:47, Thomas Hellström (Intel)
 wrote:

On 7/21/20 9:45 AM, Christian König wrote:

Am 21.07.20 um 09:41 schrieb Daniel Vetter:

On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
wrote:

Hi,

On 7/9/20 2:33 PM, Daniel Vetter wrote:

Comes up every few years, gets somewhat tedious to discuss, let's
write this down once and for all.

What I'm not sure about is whether the text should be more explicit in
flat out mandating the amdkfd eviction fences for long running compute
workloads or workloads where userspace fencing is allowed.

Although (in my humble opinion) it might be possible to completely
untangle
kernel-introduced fences for resource management and dma-fences used
for
completion- and dependency tracking and lift a lot of restrictions
for the
dma-fences, including prohibiting infinite ones, I think this makes
sense
describing the current state.

Yeah I think a future patch needs to type up how we want to make that
happen (for some cross driver consistency) and what needs to be
considered. Some of the necessary parts are already there (with like the
preemption fences amdkfd has as an example), but I think some clear docs
on what's required from both hw, drivers and userspace would be really
good.

I'm currently writing that up, but probably still need a few days for
this.

Great! I put down some (very) initial thoughts a couple of weeks ago
building on eviction fences for various hardware complexity levels here:

https://gitlab.freedesktop.org/thomash/docs/-/blob/master/Untangling%20dma-fence%20and%20memory%20allocation.odt

We are seeing HW that has recoverable GPU page faults but only for
compute tasks, and scheduler without semaphores hw for graphics.

So a single driver may have to expose both models to userspace and
also introduces the problem of how to interoperate between the two
models on one card.

Dave.

Hmm, yes to begin with it's important to note that this is not a
replacement for new programming models or APIs, This is something that
takes place internally in drivers to mitigate many of the restrictions
that are currently imposed on dma-fence and documented in this and
previous series. It's basically the driver-private narrow completions
Jason suggested in the lockdep patches discussions implemented the same
way as eviction-fences.

The memory fence API would be local to helpers and middle-layers like
TTM, and the corresponding drivers.  The only cross-driver-like
visibility would be that the dma-buf move_notify() callback would not be
allowed to wait on dma-fences or something that depends on a dma-fence.

Because we can't preempt (on some engines at least) we already have
the requirement that cross driver buffer management can get stuck on a
dma-fence. Not even taking into account the horrors we do with
userptr, which are cross driver no matter what. Limiting move_notify
to memory fences only doesn't work, since the pte clearing might need
to wait for a dma_fence first. Hence this becomes a full end-of-batch
fence, not just a limited kernel-internal memory fence.

For non-preemptible hardware the memory fence typically *is* the
end-of-batch fence. (Unless, as documented, there is a scheduler
consuming sync-file dependencies in which case the memory fence wait
needs to be able to break out of that). The key thing is not that we can
break out of execution, but that we can break out of dependencies, since
when we're executing all dependecies (modulo semaphores) are already
fulfilled. That's what's eliminating the deadlocks.


That's kinda why I think only reasonable option is to toss in the
towel and declare dma-fence to be the memory fence (and suck up all
the consequences of that decision as uapi, which is kinda where we
are), and construct something new free-wheeling for userspace
fencing. But only for engines that allow enough preempt/gpu page
faulting to make that possible. Free wheeling userspace fences/gpu
semaphores or whatever you want to call them (on windows I think it's
monitored fence) only work if you can preempt to decouple the memory
fences from your gpu command execution.

There's the in-between step of just decoupling the batchbuffer
submission prep for hw without any preempt (but a scheduler), but that
seems kinda pointless. Modern execbuf should be O(1) fastpath, with
all the allocation/mapping work pulled out ahead. vk exposes that
model directly to clients, GL drivers could use it internally too, so
I see zero value in spending lots of time engineering very tricky
kernel code just for old userspace. Much more reasonable to do that in
userspace, where we have real debuggers and no panics about security
bugs (or well, a lot less, webgl is still a thing, but at least

Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea

2020-07-22 Thread Intel


On 2020-07-22 09:11, Daniel Vetter wrote:

On Wed, Jul 22, 2020 at 8:45 AM Thomas Hellström (Intel)
 wrote:


On 2020-07-22 00:45, Dave Airlie wrote:

On Tue, 21 Jul 2020 at 18:47, Thomas Hellström (Intel)
 wrote:

On 7/21/20 9:45 AM, Christian König wrote:

Am 21.07.20 um 09:41 schrieb Daniel Vetter:

On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
wrote:

Hi,

On 7/9/20 2:33 PM, Daniel Vetter wrote:

Comes up every few years, gets somewhat tedious to discuss, let's
write this down once and for all.

What I'm not sure about is whether the text should be more explicit in
flat out mandating the amdkfd eviction fences for long running compute
workloads or workloads where userspace fencing is allowed.

Although (in my humble opinion) it might be possible to completely
untangle
kernel-introduced fences for resource management and dma-fences used
for
completion- and dependency tracking and lift a lot of restrictions
for the
dma-fences, including prohibiting infinite ones, I think this makes
sense
describing the current state.

Yeah I think a future patch needs to type up how we want to make that
happen (for some cross driver consistency) and what needs to be
considered. Some of the necessary parts are already there (with like the
preemption fences amdkfd has as an example), but I think some clear docs
on what's required from both hw, drivers and userspace would be really
good.

I'm currently writing that up, but probably still need a few days for
this.

Great! I put down some (very) initial thoughts a couple of weeks ago
building on eviction fences for various hardware complexity levels here:

https://gitlab.freedesktop.org/thomash/docs/-/blob/master/Untangling%20dma-fence%20and%20memory%20allocation.odt

We are seeing HW that has recoverable GPU page faults but only for
compute tasks, and scheduler without semaphores hw for graphics.

So a single driver may have to expose both models to userspace and
also introduces the problem of how to interoperate between the two
models on one card.

Dave.

Hmm, yes to begin with it's important to note that this is not a
replacement for new programming models or APIs, This is something that
takes place internally in drivers to mitigate many of the restrictions
that are currently imposed on dma-fence and documented in this and
previous series. It's basically the driver-private narrow completions
Jason suggested in the lockdep patches discussions implemented the same
way as eviction-fences.

The memory fence API would be local to helpers and middle-layers like
TTM, and the corresponding drivers.  The only cross-driver-like
visibility would be that the dma-buf move_notify() callback would not be
allowed to wait on dma-fences or something that depends on a dma-fence.

Because we can't preempt (on some engines at least) we already have
the requirement that cross driver buffer management can get stuck on a
dma-fence. Not even taking into account the horrors we do with
userptr, which are cross driver no matter what. Limiting move_notify
to memory fences only doesn't work, since the pte clearing might need
to wait for a dma_fence first. Hence this becomes a full end-of-batch
fence, not just a limited kernel-internal memory fence.


For non-preemptible hardware the memory fence typically *is* the 
end-of-batch fence. (Unless, as documented, there is a scheduler 
consuming sync-file dependencies in which case the memory fence wait 
needs to be able to break out of that). The key thing is not that we can 
break out of execution, but that we can break out of dependencies, since 
when we're executing all dependecies (modulo semaphores) are already 
fulfilled. That's what's eliminating the deadlocks.




That's kinda why I think only reasonable option is to toss in the
towel and declare dma-fence to be the memory fence (and suck up all
the consequences of that decision as uapi, which is kinda where we
are), and construct something new free-wheeling for userspace
fencing. But only for engines that allow enough preempt/gpu page
faulting to make that possible. Free wheeling userspace fences/gpu
semaphores or whatever you want to call them (on windows I think it's
monitored fence) only work if you can preempt to decouple the memory
fences from your gpu command execution.

There's the in-between step of just decoupling the batchbuffer
submission prep for hw without any preempt (but a scheduler), but that
seems kinda pointless. Modern execbuf should be O(1) fastpath, with
all the allocation/mapping work pulled out ahead. vk exposes that
model directly to clients, GL drivers could use it internally too, so
I see zero value in spending lots of time engineering very tricky
kernel code just for old userspace. Much more reasonable to do that in
userspace, where we have real debuggers and no panics about security
bugs (or well, a lot less, webgl is still a thing, but at least
browsers realized you need to container that completely).


Sure, it's definitely a big chunk of work

Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea

2020-07-22 Thread Intel


On 2020-07-22 00:45, Dave Airlie wrote:

On Tue, 21 Jul 2020 at 18:47, Thomas Hellström (Intel)
 wrote:


On 7/21/20 9:45 AM, Christian König wrote:

Am 21.07.20 um 09:41 schrieb Daniel Vetter:

On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
wrote:

Hi,

On 7/9/20 2:33 PM, Daniel Vetter wrote:

Comes up every few years, gets somewhat tedious to discuss, let's
write this down once and for all.

What I'm not sure about is whether the text should be more explicit in
flat out mandating the amdkfd eviction fences for long running compute
workloads or workloads where userspace fencing is allowed.

Although (in my humble opinion) it might be possible to completely
untangle
kernel-introduced fences for resource management and dma-fences used
for
completion- and dependency tracking and lift a lot of restrictions
for the
dma-fences, including prohibiting infinite ones, I think this makes
sense
describing the current state.

Yeah I think a future patch needs to type up how we want to make that
happen (for some cross driver consistency) and what needs to be
considered. Some of the necessary parts are already there (with like the
preemption fences amdkfd has as an example), but I think some clear docs
on what's required from both hw, drivers and userspace would be really
good.

I'm currently writing that up, but probably still need a few days for
this.

Great! I put down some (very) initial thoughts a couple of weeks ago
building on eviction fences for various hardware complexity levels here:

https://gitlab.freedesktop.org/thomash/docs/-/blob/master/Untangling%20dma-fence%20and%20memory%20allocation.odt

We are seeing HW that has recoverable GPU page faults but only for
compute tasks, and scheduler without semaphores hw for graphics.

So a single driver may have to expose both models to userspace and
also introduces the problem of how to interoperate between the two
models on one card.

Dave.


Hmm, yes to begin with it's important to note that this is not a 
replacement for new programming models or APIs, This is something that 
takes place internally in drivers to mitigate many of the restrictions 
that are currently imposed on dma-fence and documented in this and 
previous series. It's basically the driver-private narrow completions 
Jason suggested in the lockdep patches discussions implemented the same 
way as eviction-fences.


The memory fence API would be local to helpers and middle-layers like 
TTM, and the corresponding drivers.  The only cross-driver-like 
visibility would be that the dma-buf move_notify() callback would not be 
allowed to wait on dma-fences or something that depends on a dma-fence.


So with that in mind, I don't foresee engines with different 
capabilities on the same card being a problem.


/Thomas


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea

2020-07-21 Thread Intel


On 2020-07-21 15:59, Christian König wrote:

Am 21.07.20 um 12:47 schrieb Thomas Hellström (Intel):

...
Yes, we can't do magic. As soon as an indefinite batch makes it to 
such hardware we've lost. But since we can break out while the batch 
is stuck in the scheduler waiting, what I believe we *can* do with 
this approach is to avoid deadlocks due to locally unknown 
dependencies, which has some bearing on this documentation patch, and 
also to allow memory allocation in dma-fence (not memory-fence) 
critical sections, like gpu fault- and error handlers without 
resorting to using memory pools.


Avoiding deadlocks is only the tip of the iceberg here.

When you allow the kernel to depend on user space to proceed with some 
operation there are a lot more things which need consideration.


E.g. what happens when an userspace process which has submitted stuff 
to the kernel is killed? Are the prepared commands send to the 
hardware or aborted as well? What do we do with other processes 
waiting for that stuff?


How to we do resource accounting? When processes need to block when 
submitting to the hardware stuff which is not ready we have a process 
we can punish for blocking resources. But how is kernel memory used 
for a submission accounted? How do we avoid deny of service attacks 
here were somebody eats up all memory by doing submissions which can't 
finish?


Hmm. Are these problems really unique to user-space controlled 
dependencies? Couldn't you hit the same or similar problems with 
mis-behaving shaders blocking timeline progress?


/Thomas



___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea

2020-07-21 Thread Intel


On 7/21/20 11:50 AM, Daniel Vetter wrote:

On Tue, Jul 21, 2020 at 11:38 AM Thomas Hellström (Intel)
 wrote:


On 7/21/20 10:55 AM, Christian König wrote:

Am 21.07.20 um 10:47 schrieb Thomas Hellström (Intel):

On 7/21/20 9:45 AM, Christian König wrote:

Am 21.07.20 um 09:41 schrieb Daniel Vetter:

On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
wrote:

Hi,

On 7/9/20 2:33 PM, Daniel Vetter wrote:

Comes up every few years, gets somewhat tedious to discuss, let's
write this down once and for all.

What I'm not sure about is whether the text should be more
explicit in
flat out mandating the amdkfd eviction fences for long running
compute
workloads or workloads where userspace fencing is allowed.

Although (in my humble opinion) it might be possible to completely
untangle
kernel-introduced fences for resource management and dma-fences
used for
completion- and dependency tracking and lift a lot of restrictions
for the
dma-fences, including prohibiting infinite ones, I think this
makes sense
describing the current state.

Yeah I think a future patch needs to type up how we want to make that
happen (for some cross driver consistency) and what needs to be
considered. Some of the necessary parts are already there (with
like the
preemption fences amdkfd has as an example), but I think some clear
docs
on what's required from both hw, drivers and userspace would be really
good.

I'm currently writing that up, but probably still need a few days
for this.

Great! I put down some (very) initial thoughts a couple of weeks ago
building on eviction fences for various hardware complexity levels here:

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fthomash%2Fdocs%2F-%2Fblob%2Fmaster%2FUntangling%2520dma-fence%2520and%2520memory%2520allocation.odtdata=02%7C01%7Cchristian.koenig%40amd.com%7C8978bbd7823e4b41663708d82d52add3%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637309180424312390sdata=tTxx2vfzfwLM1IBJSqqAZRw1604R%2F0bI3MwN1%2FBf2VQ%3Dreserved=0


I don't think that this will ever be possible.

See that Daniel describes in his text is that indefinite fences are a
bad idea for memory management, and I think that this is a fixed fact.

In other words the whole concept of submitting work to the kernel
which depends on some user space interaction doesn't work and never will.

Well the idea here is that memory management will *never* depend on
indefinite fences: As soon as someone waits on a memory manager fence
(be it eviction, shrinker or mmu notifier) it breaks out of any
dma-fence dependencies and /or user-space interaction. The text tries to
describe what's required to be able to do that (save for non-preemptible
gpus where someone submits a forever-running shader).

Yeah I think that part of your text is good to describe how to
untangle memory fences from synchronization fences given how much the
hw can do.


So while I think this is possible (until someone comes up with a case
where it wouldn't work of course), I guess Daniel has a point in that it
won't happen because of inertia and there might be better options.

Yeah it's just I don't see much chance for splitting dma-fence itself.
That's also why I'm not positive on the "no hw preemption, only
scheduler" case: You still have a dma_fence for the batch itself,
which means still no userspace controlled synchronization or other
form of indefinite batches allowed. So not getting us any closer to
enabling the compute use cases people want.


Yes, we can't do magic. As soon as an indefinite batch makes it to such 
hardware we've lost. But since we can break out while the batch is stuck 
in the scheduler waiting, what I believe we *can* do with this approach 
is to avoid deadlocks due to locally unknown dependencies, which has 
some bearing on this documentation patch, and also to allow memory 
allocation in dma-fence (not memory-fence) critical sections, like gpu 
fault- and error handlers without resorting to using memory pools.


But again. I'm not saying we should actually implement this. Better to 
consider it and reject it than not consider it at all.


/Thomas


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea

2020-07-21 Thread Intel


On 7/21/20 10:55 AM, Christian König wrote:

Am 21.07.20 um 10:47 schrieb Thomas Hellström (Intel):


On 7/21/20 9:45 AM, Christian König wrote:

Am 21.07.20 um 09:41 schrieb Daniel Vetter:
On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel) 
wrote:

Hi,

On 7/9/20 2:33 PM, Daniel Vetter wrote:

Comes up every few years, gets somewhat tedious to discuss, let's
write this down once and for all.

What I'm not sure about is whether the text should be more 
explicit in
flat out mandating the amdkfd eviction fences for long running 
compute

workloads or workloads where userspace fencing is allowed.
Although (in my humble opinion) it might be possible to completely 
untangle
kernel-introduced fences for resource management and dma-fences 
used for
completion- and dependency tracking and lift a lot of restrictions 
for the
dma-fences, including prohibiting infinite ones, I think this 
makes sense

describing the current state.

Yeah I think a future patch needs to type up how we want to make that
happen (for some cross driver consistency) and what needs to be
considered. Some of the necessary parts are already there (with 
like the
preemption fences amdkfd has as an example), but I think some clear 
docs

on what's required from both hw, drivers and userspace would be really
good.


I'm currently writing that up, but probably still need a few days 
for this.


Great! I put down some (very) initial thoughts a couple of weeks ago 
building on eviction fences for various hardware complexity levels here:


https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fthomash%2Fdocs%2F-%2Fblob%2Fmaster%2FUntangling%2520dma-fence%2520and%2520memory%2520allocation.odtdata=02%7C01%7Cchristian.koenig%40amd.com%7C8978bbd7823e4b41663708d82d52add3%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637309180424312390sdata=tTxx2vfzfwLM1IBJSqqAZRw1604R%2F0bI3MwN1%2FBf2VQ%3Dreserved=0 



I don't think that this will ever be possible.

See that Daniel describes in his text is that indefinite fences are a 
bad idea for memory management, and I think that this is a fixed fact.


In other words the whole concept of submitting work to the kernel 
which depends on some user space interaction doesn't work and never will.


Well the idea here is that memory management will *never* depend on 
indefinite fences: As soon as someone waits on a memory manager fence 
(be it eviction, shrinker or mmu notifier) it breaks out of any 
dma-fence dependencies and /or user-space interaction. The text tries to 
describe what's required to be able to do that (save for non-preemptible 
gpus where someone submits a forever-running shader).


So while I think this is possible (until someone comes up with a case 
where it wouldn't work of course), I guess Daniel has a point in that it 
won't happen because of inertia and there might be better options.


/Thomas


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea

2020-07-21 Thread Intel


On 7/21/20 9:45 AM, Christian König wrote:

Am 21.07.20 um 09:41 schrieb Daniel Vetter:
On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel) 
wrote:

Hi,

On 7/9/20 2:33 PM, Daniel Vetter wrote:

Comes up every few years, gets somewhat tedious to discuss, let's
write this down once and for all.

What I'm not sure about is whether the text should be more explicit in
flat out mandating the amdkfd eviction fences for long running compute
workloads or workloads where userspace fencing is allowed.
Although (in my humble opinion) it might be possible to completely 
untangle
kernel-introduced fences for resource management and dma-fences used 
for
completion- and dependency tracking and lift a lot of restrictions 
for the
dma-fences, including prohibiting infinite ones, I think this makes 
sense

describing the current state.

Yeah I think a future patch needs to type up how we want to make that
happen (for some cross driver consistency) and what needs to be
considered. Some of the necessary parts are already there (with like the
preemption fences amdkfd has as an example), but I think some clear docs
on what's required from both hw, drivers and userspace would be really
good.


I'm currently writing that up, but probably still need a few days for 
this.


Great! I put down some (very) initial thoughts a couple of weeks ago 
building on eviction fences for various hardware complexity levels here:


https://gitlab.freedesktop.org/thomash/docs/-/blob/master/Untangling%20dma-fence%20and%20memory%20allocation.odt

/Thomas


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea

2020-07-20 Thread Intel

Hi,

On 7/9/20 2:33 PM, Daniel Vetter wrote:

Comes up every few years, gets somewhat tedious to discuss, let's
write this down once and for all.

What I'm not sure about is whether the text should be more explicit in
flat out mandating the amdkfd eviction fences for long running compute
workloads or workloads where userspace fencing is allowed.


Although (in my humble opinion) it might be possible to completely 
untangle kernel-introduced fences for resource management and dma-fences 
used for completion- and dependency tracking and lift a lot of 
restrictions for the dma-fences, including prohibiting infinite ones, I 
think this makes sense describing the current state.


Reviewed-by: Thomas Hellstrom 


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

2020-06-11 Thread Intel


On 6/4/20 10:12 AM, Daniel Vetter wrote:

Two in one go:
- it is allowed to call dma_fence_wait() while holding a
   dma_resv_lock(). This is fundamental to how eviction works with ttm,
   so required.

- it is allowed to call dma_fence_wait() from memory reclaim contexts,
   specifically from shrinker callbacks (which i915 does), and from mmu
   notifier callbacks (which amdgpu does, and which i915 sometimes also
   does, and probably always should, but that's kinda a debate). Also
   for stuff like HMM we really need to be able to do this, or things
   get real dicey.

Consequence is that any critical path necessary to get to a
dma_fence_signal for a fence must never a) call dma_resv_lock nor b)
allocate memory with GFP_KERNEL. Also by implication of
dma_resv_lock(), no userspace faulting allowed. That's some supremely
obnoxious limitations, which is why we need to sprinkle the right
annotations to all relevant paths.

The one big locking context we're leaving out here is mmu notifiers,
added in

commit 23b68395c7c78a764e8963fc15a7cfd318bf187f
Author: Daniel Vetter 
Date:   Mon Aug 26 22:14:21 2019 +0200

 mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end

that one covers a lot of other callsites, and it's also allowed to
wait on dma-fences from mmu notifiers. But there's no ready-made
functions exposed to prime this, so I've left it out for now.

v2: Also track against mmu notifier context.

v3: kerneldoc to spec the cross-driver contract. Note that currently
i915 throws in a hard-coded 10s timeout on foreign fences (not sure
why that was done, but it's there), which is why that rule is worded
with SHOULD instead of MUST.

Also some of the mmu_notifier/shrinker rules might surprise SoC
drivers, I haven't fully audited them all. Which is infeasible anyway,
we'll need to run them with lockdep and dma-fence annotations and see
what goes boom.

v4: A spelling fix from Mika

Cc: Mika Kuoppala 
Cc: Thomas Hellstrom 
Cc: linux-me...@vger.kernel.org
Cc: linaro-mm-...@lists.linaro.org
Cc: linux-r...@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-...@lists.freedesktop.org
Cc: Chris Wilson 
Cc: Maarten Lankhorst 
Cc: Christian König 
Signed-off-by: Daniel Vetter 
---
  Documentation/driver-api/dma-buf.rst |  6 
  drivers/dma-buf/dma-fence.c  | 41 
  drivers/dma-buf/dma-resv.c   |  4 +++
  include/linux/dma-fence.h|  1 +
  4 files changed, 52 insertions(+)


I still have my doubts about allowing fence waiting from within 
shrinkers. IMO ideally they should use a trywait approach, in order to 
allow memory allocation during command submission for drivers that
publish fences before command submission. (Since early reservation 
object release requires that).


But since drivers are already waiting from within shrinkers and I take 
your word for HMM requiring this,


Reviewed-by: Thomas Hellström 


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 1/6] drm/ttm: Add unampping of the entire device address space

2020-06-11 Thread Intel


On 6/10/20 11:19 PM, Andrey Grodzovsky wrote:


On 6/10/20 4:30 PM, Thomas Hellström (Intel) wrote:


On 6/10/20 5:30 PM, Daniel Vetter wrote:

On Wed, Jun 10, 2020 at 04:05:04PM +0200, Christian König wrote:

Am 10.06.20 um 15:54 schrieb Andrey Grodzovsky:


On 6/10/20 6:15 AM, Thomas Hellström (Intel) wrote:


On 6/9/20 7:21 PM, Koenig, Christian wrote:


Am 09.06.2020 18:37 schrieb "Grodzovsky, Andrey"
:


 On 6/5/20 2:40 PM, Christian König wrote:
 > Am 05.06.20 um 16:29 schrieb Andrey Grodzovsky:
 >>
 >> On 5/11/20 2:45 AM, Christian König wrote:
 >>> Am 09.05.20 um 20:51 schrieb Andrey Grodzovsky:
 >>>> Signed-off-by: Andrey Grodzovsky 


 >>>> ---
 >>>> drivers/gpu/drm/ttm/ttm_bo.c | 22 +-
 >>>> include/drm/ttm/ttm_bo_driver.h | 2 ++
 >>>>   2 files changed, 23 insertions(+), 1 deletion(-)
 >>>>
 >>>> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c
 >>>> b/drivers/gpu/drm/ttm/ttm_bo.c
 >>>> index c5b516f..eae61cc 100644
 >>>> --- a/drivers/gpu/drm/ttm/ttm_bo.c
 >>>> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
 >>>> @@ -1750,9 +1750,29 @@ void ttm_bo_unmap_virtual(struct
 >>>> ttm_buffer_object *bo)
 >>>> ttm_bo_unmap_virtual_locked(bo);
 >>>> ttm_mem_io_unlock(man);
 >>>>   }
 >>>> +EXPORT_SYMBOL(ttm_bo_unmap_virtual);
 >>>>   +void ttm_bo_unmap_virtual_address_space(struct
 ttm_bo_device *bdev)
 >>>> +{
 >>>> +    struct ttm_mem_type_manager *man;
 >>>> +    int i;
 >>>> -EXPORT_SYMBOL(ttm_bo_unmap_virtual);
 >>>
 >>>> +    for (i = 0; i < TTM_NUM_MEM_TYPES; i++) {
 >>>> +    man = >man[i];
 >>>> +    if (man->has_type && man->use_type)
 >>>> + ttm_mem_io_lock(man, false);
 >>>> +    }
 >>>
 >>> You should drop that it will just result in a deadlock
 warning for
 >>> Nouveau and has no effect at all.
 >>>
 >>> Apart from that looks good to me,
 >>> Christian.
 >>
 >>
 >> As I am considering to re-include this in V2 of the
 patchsets, can
 >> you clarify please why this will have no effect at all ?
 >
 > The locks are exclusive for Nouveau to allocate/free the io
 address
 > space.
 >
 > Since we don't do this here we don't need the locks.
 >
 > Christian.


 So basically calling unmap_mapping_range doesn't require 
any extra
 locking around it and whatever locks are taken within the 
function

 should be enough ?



I think so, yes.

Christian.

Yes, that's true. However, without the bo reservation, nothing stops
a PTE from being immediately re-faulted back again. Even while
unmap_mapping_range() is running.

Can you explain more on this - specifically, which function to 
reserve

the BO, why BO reservation would prevent re-fault of the PTE ?

Thomas is talking about ttm_bo_reserver()/ttm_bo_unreserve(), but 
we don't
need this because we unmap everything because the whole device is 
gone and

not just manipulate a single BO.


So the device removed flag needs to be advertized before this
function is run,


I indeed intend to call this  right after calling drm_dev_unplug from
amdgpu_pci_remove while adding drm_dev_enter/exit in 
ttm_bo_vm_fault (or

in amdgpu specific wrapper since I don't see how can I access struct
drm_device from ttm_bo_vm_fault) and this in my understanding should
stop a PTE from being re-faulted back as you pointed out - so again I
don't see how  bo reservation would prevent it so it looks like I am
missing something...



(perhaps with a memory barrier pair).


drm_dev_unplug and drm_dev_enter/exit are RCU synchronized and so I
don't think require any extra memory barriers for visibility of the
removed flag being set


As far as I can see that should be perfectly sufficient.

Only if you have a drm_dev_enter/exit pair in your fault handler.
Otherwise you're still open to the races Thomas described. But aside 
from
that the drm_dev_unplug stuff has all the barriers and stuff to make 
sure

nothing escapes.

Failure to drm_dev_enter could then also trigger the special case 
where we

put a dummy page in place.
-Daniel


Hmm, Yes, indeed advertizing the flag before the call to 
unmap_mapping_range isn't enough, since there might be fault handlers 
running that haven't picked up the flag when unmap_mapping_range is 
launched.



If you mean those fault handlers that were in progress when the flag 
(drm_dev_unplug) was set in amdgpu_pci_remove then as long as i wr

Re: [PATCH 1/6] drm/ttm: Add unampping of the entire device address space

2020-06-10 Thread Intel


On 6/10/20 5:30 PM, Daniel Vetter wrote:

On Wed, Jun 10, 2020 at 04:05:04PM +0200, Christian König wrote:

Am 10.06.20 um 15:54 schrieb Andrey Grodzovsky:


On 6/10/20 6:15 AM, Thomas Hellström (Intel) wrote:


On 6/9/20 7:21 PM, Koenig, Christian wrote:


Am 09.06.2020 18:37 schrieb "Grodzovsky, Andrey"
:


 On 6/5/20 2:40 PM, Christian König wrote:
 > Am 05.06.20 um 16:29 schrieb Andrey Grodzovsky:
 >>
 >> On 5/11/20 2:45 AM, Christian König wrote:
 >>> Am 09.05.20 um 20:51 schrieb Andrey Grodzovsky:
 >>>> Signed-off-by: Andrey Grodzovsky 
 >>>> ---
 >>>> drivers/gpu/drm/ttm/ttm_bo.c    | 22 +-
 >>>> include/drm/ttm/ttm_bo_driver.h |  2 ++
 >>>>   2 files changed, 23 insertions(+), 1 deletion(-)
 >>>>
 >>>> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c
 >>>> b/drivers/gpu/drm/ttm/ttm_bo.c
 >>>> index c5b516f..eae61cc 100644
 >>>> --- a/drivers/gpu/drm/ttm/ttm_bo.c
 >>>> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
 >>>> @@ -1750,9 +1750,29 @@ void ttm_bo_unmap_virtual(struct
 >>>> ttm_buffer_object *bo)
 >>>> ttm_bo_unmap_virtual_locked(bo);
 >>>> ttm_mem_io_unlock(man);
 >>>>   }
 >>>> +EXPORT_SYMBOL(ttm_bo_unmap_virtual);
 >>>>   +void ttm_bo_unmap_virtual_address_space(struct
 ttm_bo_device *bdev)
 >>>> +{
 >>>> +    struct ttm_mem_type_manager *man;
 >>>> +    int i;
 >>>> -EXPORT_SYMBOL(ttm_bo_unmap_virtual);
 >>>
 >>>> +    for (i = 0; i < TTM_NUM_MEM_TYPES; i++) {
 >>>> +    man = >man[i];
 >>>> +    if (man->has_type && man->use_type)
 >>>> + ttm_mem_io_lock(man, false);
 >>>> +    }
 >>>
 >>> You should drop that it will just result in a deadlock
 warning for
 >>> Nouveau and has no effect at all.
 >>>
 >>> Apart from that looks good to me,
 >>> Christian.
 >>
 >>
 >> As I am considering to re-include this in V2 of the
 patchsets, can
 >> you clarify please why this will have no effect at all ?
 >
 > The locks are exclusive for Nouveau to allocate/free the io
 address
 > space.
 >
 > Since we don't do this here we don't need the locks.
 >
 > Christian.


 So basically calling unmap_mapping_range doesn't require any extra
 locking around it and whatever locks are taken within the function
 should be enough ?



I think so, yes.

Christian.

Yes, that's true. However, without the bo reservation, nothing stops
a PTE from being immediately re-faulted back again. Even while
unmap_mapping_range() is running.


Can you explain more on this - specifically, which function to reserve
the BO, why BO reservation would prevent re-fault of the PTE ?


Thomas is talking about ttm_bo_reserver()/ttm_bo_unreserve(), but we don't
need this because we unmap everything because the whole device is gone and
not just manipulate a single BO.


So the device removed flag needs to be advertized before this
function is run,


I indeed intend to call this  right after calling drm_dev_unplug from
amdgpu_pci_remove while adding drm_dev_enter/exit in ttm_bo_vm_fault (or
in amdgpu specific wrapper since I don't see how can I access struct
drm_device from ttm_bo_vm_fault) and this in my understanding should
stop a PTE from being re-faulted back as you pointed out - so again I
don't see how  bo reservation would prevent it so it looks like I am
missing something...



(perhaps with a memory barrier pair).


drm_dev_unplug and drm_dev_enter/exit are RCU synchronized and so I
don't think require any extra memory barriers for visibility of the
removed flag being set


As far as I can see that should be perfectly sufficient.

Only if you have a drm_dev_enter/exit pair in your fault handler.
Otherwise you're still open to the races Thomas described. But aside from
that the drm_dev_unplug stuff has all the barriers and stuff to make sure
nothing escapes.

Failure to drm_dev_enter could then also trigger the special case where we
put a dummy page in place.
-Daniel


Hmm, Yes, indeed advertizing the flag before the call to 
unmap_mapping_range isn't enough, since there might be fault handlers 
running that haven't picked up the flag when unmap_mapping_range is 
launched.


For the special case of syncing a full address-space 
unmap_mapping_range() with fault handlers regardless of the reason for 
the full address-space unmap_mapping_range() one could either traverse 
the address space (drm_vma_manager) and grab *all* bo reservations 
around the unmap_mapping_range(), or grab the i_mmap_lock in read mode 
in the fault handler. (It's taken in write mode in unmap_mapping_range). 
While the latter may seem like a simple solution, one should probably 
consider the overhead both in run-time and scaling ability.


/Thomas


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 1/6] drm/ttm: Add unampping of the entire device address space

2020-06-10 Thread Intel


On 6/10/20 3:54 PM, Andrey Grodzovsky wrote:



On 6/10/20 6:15 AM, Thomas Hellström (Intel) wrote:



On 6/9/20 7:21 PM, Koenig, Christian wrote:



Am 09.06.2020 18:37 schrieb "Grodzovsky, Andrey" 
:



On 6/5/20 2:40 PM, Christian König wrote:
> Am 05.06.20 um 16:29 schrieb Andrey Grodzovsky:
>>
>> On 5/11/20 2:45 AM, Christian König wrote:
>>> Am 09.05.20 um 20:51 schrieb Andrey Grodzovsky:
>>>> Signed-off-by: Andrey Grodzovsky 
>>>> ---
>>>> drivers/gpu/drm/ttm/ttm_bo.c    | 22 +-
>>>> include/drm/ttm/ttm_bo_driver.h |  2 ++
>>>>   2 files changed, 23 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c
>>>> b/drivers/gpu/drm/ttm/ttm_bo.c
>>>> index c5b516f..eae61cc 100644
>>>> --- a/drivers/gpu/drm/ttm/ttm_bo.c
>>>> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
>>>> @@ -1750,9 +1750,29 @@ void ttm_bo_unmap_virtual(struct
>>>> ttm_buffer_object *bo)
>>>> ttm_bo_unmap_virtual_locked(bo);
>>>> ttm_mem_io_unlock(man);
>>>>   }
>>>> +EXPORT_SYMBOL(ttm_bo_unmap_virtual);
>>>>   +void ttm_bo_unmap_virtual_address_space(struct
ttm_bo_device *bdev)
>>>> +{
>>>> +    struct ttm_mem_type_manager *man;
>>>> +    int i;
>>>> -EXPORT_SYMBOL(ttm_bo_unmap_virtual);
>>>
>>>> +    for (i = 0; i < TTM_NUM_MEM_TYPES; i++) {
>>>> +    man = >man[i];
>>>> +    if (man->has_type && man->use_type)
>>>> + ttm_mem_io_lock(man, false);
>>>> +    }
>>>
>>> You should drop that it will just result in a deadlock
warning for
>>> Nouveau and has no effect at all.
>>>
>>> Apart from that looks good to me,
>>> Christian.
>>
>>
>> As I am considering to re-include this in V2 of the
patchsets, can
>> you clarify please why this will have no effect at all ?
>
> The locks are exclusive for Nouveau to allocate/free the io
address
> space.
>
> Since we don't do this here we don't need the locks.
>
> Christian.


So basically calling unmap_mapping_range doesn't require any extra
locking around it and whatever locks are taken within the function
should be enough ?



I think so, yes.

Christian.


Yes, that's true. However, without the bo reservation, nothing stops 
a PTE from being immediately re-faulted back again. Even while 
unmap_mapping_range() is running.




Can you explain more on this - specifically, which function to reserve 
the BO, why BO reservation would prevent re-fault of the PTE ?


The bo reservation is taken in the TTM fault handler and temporarily 
blocks inserting a new PTE. So typically the bo reservation is held 
around unmap_mapping_range() and the buffer object operation that 
triggered it (typically a move or change of backing store).


/Thomas



___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 02/18] dma-buf: minor doc touch-ups

2020-06-10 Thread Intel



On 6/4/20 10:12 AM, Daniel Vetter wrote:

Just some tiny edits:
- fix link to struct dma_fence
- give slightly more meaningful title - the polling here is about
   implicit fences, explicit fences (in sync_file or drm_syncobj) also
   have their own polling

Signed-off-by: Daniel Vetter 


Reviewed-by: Thomas Hellstrom 


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 01/18] mm: Track mmu notifiers in fs_reclaim_acquire/release

2020-06-10 Thread Intel

Hi, Daniel,

Please see below.

On 6/4/20 10:12 AM, Daniel Vetter wrote:

fs_reclaim_acquire/release nicely catch recursion issues when
allocating GFP_KERNEL memory against shrinkers (which gpu drivers tend
to use to keep the excessive caches in check). For mmu notifier
recursions we do have lockdep annotations since 23b68395c7c7
("mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end").

But these only fire if a path actually results in some pte
invalidation - for most small allocations that's very rarely the case.
The other trouble is that pte invalidation can happen any time when
__GFP_RECLAIM is set. Which means only really GFP_ATOMIC is a safe
choice, GFP_NOIO isn't good enough to avoid potential mmu notifier
recursion.

I was pondering whether we should just do the general annotation, but
there's always the risk for false positives. Plus I'm assuming that
the core fs and io code is a lot better reviewed and tested than
random mmu notifier code in drivers. Hence why I decide to only
annotate for that specific case.

Furthermore even if we'd create a lockdep map for direct reclaim, we'd
still need to explicit pull in the mmu notifier map - there's a lot
more places that do pte invalidation than just direct reclaim, these
two contexts arent the same.

Note that the mmu notifiers needing their own independent lockdep map
is also the reason we can't hold them from fs_reclaim_acquire to
fs_reclaim_release - it would nest with the acquistion in the pte
invalidation code, causing a lockdep splat. And we can't remove the
annotations from pte invalidation and all the other places since
they're called from many other places than page reclaim. Hence we can
only do the equivalent of might_lock, but on the raw lockdep map.

With this we can also remove the lockdep priming added in 66204f1d2d1b
("mm/mmu_notifiers: prime lockdep") since the new annotations are
strictly more powerful.

Cc: Andrew Morton 
Cc: Jason Gunthorpe 
Cc: linux...@kvack.org
Cc: linux-r...@vger.kernel.org
Cc: Maarten Lankhorst 
Cc: Christian König 
Signed-off-by: Daniel Vetter 
---
This is part of a gpu lockdep annotation series simply because it
really helps to catch issues where gpu subsystem locks and primitives
can deadlock with themselves through allocations and mmu notifiers.
But aside from that motivation it should be completely free-standing,
and can land through -mm/-rdma/-hmm or any other tree really whenever.
-Daniel
---
  mm/mmu_notifier.c |  7 ---
  mm/page_alloc.c   | 23 ++-
  2 files changed, 14 insertions(+), 16 deletions(-)

diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 06852b896fa6..5d578b9122f8 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -612,13 +612,6 @@ int __mmu_notifier_register(struct mmu_notifier 
*subscription,
lockdep_assert_held_write(>mmap_sem);
BUG_ON(atomic_read(>mm_users) <= 0);
  
-	if (IS_ENABLED(CONFIG_LOCKDEP)) {

-   fs_reclaim_acquire(GFP_KERNEL);
-   lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
-   lock_map_release(&__mmu_notifier_invalidate_range_start_map);
-   fs_reclaim_release(GFP_KERNEL);
-   }
-
if (!mm->notifier_subscriptions) {
/*
 * kmalloc cannot be called under mm_take_all_locks(), but we
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 13cc653122b7..f8a222db4a53 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -57,6 +57,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -4124,7 +4125,7 @@ should_compact_retry(struct alloc_context *ac, unsigned 
int order, int alloc_fla
  static struct lockdep_map __fs_reclaim_map =
STATIC_LOCKDEP_MAP_INIT("fs_reclaim", &__fs_reclaim_map);
  
-static bool __need_fs_reclaim(gfp_t gfp_mask)

+static bool __need_reclaim(gfp_t gfp_mask)
  {
gfp_mask = current_gfp_context(gfp_mask);
  
@@ -4136,10 +4137,6 @@ static bool __need_fs_reclaim(gfp_t gfp_mask)

if (current->flags & PF_MEMALLOC)
return false;
  
-	/* We're only interested __GFP_FS allocations for now */

-   if (!(gfp_mask & __GFP_FS))
-   return false;
-
if (gfp_mask & __GFP_NOLOCKDEP)
return false;
  
@@ -4158,15 +4155,23 @@ void __fs_reclaim_release(void)
  
  void fs_reclaim_acquire(gfp_t gfp_mask)

  {
-   if (__need_fs_reclaim(gfp_mask))
-   __fs_reclaim_acquire();
+   if (__need_reclaim(gfp_mask)) {
+   if (!(gfp_mask & __GFP_FS))

Hmm. Shouldn't this be "if (gfp_mask & __GFP_FS)" or am I misunderstanding?

+   __fs_reclaim_acquire();



#ifdef CONFIG_MMU_NOTIFIER?


+
+   lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
+   lock_map_release(&__mmu_notifier_invalidate_range_start_map);
+
+   }
  }
  EXPORT_SYMBOL_GPL(fs_reclaim_acquire);
  
  void fs_reclaim_release(gfp_t gfp_mask)

  {
-   if 

Re: [PATCH 5/6] drm/ttm: Add destroy flag in TTM BO eviction interface

2020-06-10 Thread Intel



On 5/9/20 8:51 PM, Andrey Grodzovsky wrote:

This will allow to invalidate, destroy backing storage and notify users
of BOs when device is unpluged.

Signed-off-by: Andrey Grodzovsky 


Please add a motivation in the commit message and use imperative wording 
("Allow to invalidate..." instead of "This will allow to")


s /unpluged/unplugged/



---
  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  2 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c  |  2 +-
  drivers/gpu/drm/nouveau/nouveau_drm.c   |  2 +-
  drivers/gpu/drm/qxl/qxl_object.c|  4 +--
  drivers/gpu/drm/radeon/radeon_object.c  |  2 +-
  drivers/gpu/drm/ttm/ttm_bo.c| 41 ++---
  drivers/gpu/drm/vmwgfx/vmwgfx_drv.c |  6 ++---
  include/drm/ttm/ttm_bo_api.h|  2 +-
  8 files changed, 35 insertions(+), 26 deletions(-)

diff --git a/include/drm/ttm/ttm_bo_api.h b/include/drm/ttm/ttm_bo_api.h
index b9bc1b0..9d57b8c 100644
--- a/include/drm/ttm/ttm_bo_api.h
+++ b/include/drm/ttm/ttm_bo_api.h
@@ -597,7 +597,7 @@ int ttm_bo_clean_mm(struct ttm_bo_device *bdev, unsigned 
mem_type);
   * -ERESTARTSYS: The call was interrupted by a signal while waiting to
   * evict a buffer.
   */


Please also update the function documentation.


-int ttm_bo_evict_mm(struct ttm_bo_device *bdev, unsigned mem_type);
+int ttm_bo_evict_mm(struct ttm_bo_device *bdev, unsigned mem_type, bool 
destroy);
  
  /**

   * ttm_kmap_obj_virtual



Thanks,

Thomas


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 1/6] drm/ttm: Add unampping of the entire device address space

2020-06-10 Thread Intel


On 6/9/20 7:21 PM, Koenig, Christian wrote:



Am 09.06.2020 18:37 schrieb "Grodzovsky, Andrey" 
:



On 6/5/20 2:40 PM, Christian König wrote:
> Am 05.06.20 um 16:29 schrieb Andrey Grodzovsky:
>>
>> On 5/11/20 2:45 AM, Christian König wrote:
>>> Am 09.05.20 um 20:51 schrieb Andrey Grodzovsky:
 Signed-off-by: Andrey Grodzovsky 
 ---
 drivers/gpu/drm/ttm/ttm_bo.c    | 22 +-
 include/drm/ttm/ttm_bo_driver.h |  2 ++
   2 files changed, 23 insertions(+), 1 deletion(-)

 diff --git a/drivers/gpu/drm/ttm/ttm_bo.c
 b/drivers/gpu/drm/ttm/ttm_bo.c
 index c5b516f..eae61cc 100644
 --- a/drivers/gpu/drm/ttm/ttm_bo.c
 +++ b/drivers/gpu/drm/ttm/ttm_bo.c
 @@ -1750,9 +1750,29 @@ void ttm_bo_unmap_virtual(struct
 ttm_buffer_object *bo)
 ttm_bo_unmap_virtual_locked(bo);
   ttm_mem_io_unlock(man);
   }
 +EXPORT_SYMBOL(ttm_bo_unmap_virtual);
   +void ttm_bo_unmap_virtual_address_space(struct
ttm_bo_device *bdev)
 +{
 +    struct ttm_mem_type_manager *man;
 +    int i;
 -EXPORT_SYMBOL(ttm_bo_unmap_virtual);
>>>
 +    for (i = 0; i < TTM_NUM_MEM_TYPES; i++) {
 +    man = >man[i];
 +    if (man->has_type && man->use_type)
 + ttm_mem_io_lock(man, false);
 +    }
>>>
>>> You should drop that it will just result in a deadlock warning
for
>>> Nouveau and has no effect at all.
>>>
>>> Apart from that looks good to me,
>>> Christian.
>>
>>
>> As I am considering to re-include this in V2 of the patchsets, can
>> you clarify please why this will have no effect at all ?
>
> The locks are exclusive for Nouveau to allocate/free the io address
> space.
>
> Since we don't do this here we don't need the locks.
>
> Christian.


So basically calling unmap_mapping_range doesn't require any extra
locking around it and whatever locks are taken within the function
should be enough ?



I think so, yes.

Christian.


Yes, that's true. However, without the bo reservation, nothing stops a 
PTE from being immediately re-faulted back again. Even while 
unmap_mapping_range() is running. So the device removed flag needs to be 
advertized before this function is run, (perhaps with a memory barrier 
pair). That should probably be added to the function documentation.


(Other than that, please add a commit message if respinning).

/Thomas



___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] dma-fence: basic lockdep annotations

2020-06-05 Thread Intel
 critical for reaching completion of
   fences. One example would be a scheduler kthread which picks up jobs
   and pushes them into hardware, where the interrupt handler or
   another completion thread calls dma_fence_signal(). But if the
   scheduler thread hangs, then all the fences hang, hence we need to
   manually annotate it. cross-release aimed to solve this by chaining
   cross-release dependencies, but the dependency from scheduler thread
   to the completion interrupt handler goes through hw where
   cross-release code can't observe it.

In short, without manual annotations and careful review of the start
and end of critical sections, cross-relese dependency tracking doesn't
work. We need explicit annotations.

v2: handle soft/hardirq ctx better against write side and dont forget
EXPORT_SYMBOL, drivers can't use this otherwise.

v3: Kerneldoc.

v4: Some spelling fixes from Mika

v5: Amend commit message to explain in detail why cross-release isn't
the solution.

Cc: Mika Kuoppala 
Cc: Thomas Hellstrom 
Cc: linux-me...@vger.kernel.org
Cc: linaro-mm-...@lists.linaro.org
Cc: linux-r...@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-...@lists.freedesktop.org
Cc: Chris Wilson 
Cc: Maarten Lankhorst 
Cc: Christian König 
Signed-off-by: Daniel Vetter 
---


Reviewed-by: Thomas Hellström 


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 03/18] dma-fence: basic lockdep annotations

2020-06-04 Thread Intel


On 6/4/20 10:12 AM, Daniel Vetter wrote:
...

Thread A:

mutex_lock(A);
mutex_unlock(A);

dma_fence_signal();

Thread B:

mutex_lock(A);
dma_fence_wait();
mutex_unlock(A);

Thread B is blocked on A signalling the fence, but A never gets around
to that because it cannot acquire the lock A.

Note that dma_fence_wait() is allowed to be nested within
dma_fence_begin/end_signalling sections. To allow this to happen the
read lock needs to be upgraded to a write lock, which means that any
other lock is acquired between the dma_fence_begin_signalling() call and
the call to dma_fence_wait(), and still held, this will result in an
immediate lockdep complaint. The only other option would be to not
annotate such calls, defeating the point. Therefore these annotations
cannot be sprinkled over the code entirely mindless to avoid false
positives.


Just realized, isn't that example actually a true positive, or at least 
a great candidate for a true positive, since if another thread reenters 
that signaling path, it will block on that mutex, and the fence would 
never be signaled unless there is another signaling path?


Although I agree the conclusion is sound: These annotations cannot be 
sprinkled mindlessly over the code.


/Thomas








v2: handle soft/hardirq ctx better against write side and dont forget
EXPORT_SYMBOL, drivers can't use this otherwise.

v3: Kerneldoc.

v4: Some spelling fixes from Mika

Cc: Mika Kuoppala 
Cc: Thomas Hellstrom 
Cc: linux-me...@vger.kernel.org
Cc: linaro-mm-...@lists.linaro.org
Cc: linux-r...@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-...@lists.freedesktop.org
Cc: Chris Wilson 
Cc: Maarten Lankhorst 
Cc: Christian König 
Signed-off-by: Daniel Vetter 
---
  Documentation/driver-api/dma-buf.rst |  12 +-
  drivers/dma-buf/dma-fence.c  | 161 +++
  include/linux/dma-fence.h|  12 ++
  3 files changed, 182 insertions(+), 3 deletions(-)

diff --git a/Documentation/driver-api/dma-buf.rst 
b/Documentation/driver-api/dma-buf.rst
index 63dec76d1d8d..05d856131140 100644
--- a/Documentation/driver-api/dma-buf.rst
+++ b/Documentation/driver-api/dma-buf.rst
@@ -100,11 +100,11 @@ CPU Access to DMA Buffer Objects
  .. kernel-doc:: drivers/dma-buf/dma-buf.c
 :doc: cpu access
  
-Fence Poll Support

-~~
+Implicit Fence Poll Support
+~~~
  
  .. kernel-doc:: drivers/dma-buf/dma-buf.c

-   :doc: fence polling
+   :doc: implicit fence polling
  
  Kernel Functions and Structures Reference

  ~
@@ -133,6 +133,12 @@ DMA Fences
  .. kernel-doc:: drivers/dma-buf/dma-fence.c
 :doc: DMA fences overview
  
+DMA Fence Signalling Annotations

+
+
+.. kernel-doc:: drivers/dma-buf/dma-fence.c
+   :doc: fence signalling annotation
+
  DMA Fences Functions Reference
  ~~
  
diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c

index 656e9ac2d028..0005bc002529 100644
--- a/drivers/dma-buf/dma-fence.c
+++ b/drivers/dma-buf/dma-fence.c
@@ -110,6 +110,160 @@ u64 dma_fence_context_alloc(unsigned num)
  }
  EXPORT_SYMBOL(dma_fence_context_alloc);
  
+/**

+ * DOC: fence signalling annotation
+ *
+ * Proving correctness of all the kernel code around _fence through code
+ * review and testing is tricky for a few reasons:
+ *
+ * * It is a cross-driver contract, and therefore all drivers must follow the
+ *   same rules for lock nesting order, calling contexts for various functions
+ *   and anything else significant for in-kernel interfaces. But it is also
+ *   impossible to test all drivers in a single machine, hence brute-force N 
vs.
+ *   N testing of all combinations is impossible. Even just limiting to the
+ *   possible combinations is infeasible.
+ *
+ * * There is an enormous amount of driver code involved. For render drivers
+ *   there's the tail of command submission, after fences are published,
+ *   scheduler code, interrupt and workers to process job completion,
+ *   and timeout, gpu reset and gpu hang recovery code. Plus for integration
+ *   with core mm with have _notifier, respectively _interval_notifier,
+ *   and  For modesetting drivers there's the commit tail functions
+ *   between when fences for an atomic modeset are published, and when the
+ *   corresponding vblank completes, including any interrupt processing and
+ *   related workers. Auditing all that code, across all drivers, is not
+ *   feasible.
+ *
+ * * Due to how many other subsystems are involved and the locking hierarchies
+ *   this pulls in there is extremely thin wiggle-room for driver-specific
+ *   differences. _fence interacts with almost all of the core memory
+ *   handling through page fault handlers via _resv, dma_resv_lock() and
+ *   dma_resv_unlock(). On the other side it also interacts through all
+ *   allocation sites through _notifier

Re: [RFC 02/17] dma-fence: basic lockdep annotations

2020-05-28 Thread Intel

On 2020-05-12 10:59, Daniel Vetter wrote:

Design is similar to the lockdep annotations for workers, but with
some twists:

- We use a read-lock for the execution/worker/completion side, so that
   this explicit annotation can be more liberally sprinkled around.
   With read locks lockdep isn't going to complain if the read-side
   isn't nested the same way under all circumstances, so ABBA deadlocks
   are ok. Which they are, since this is an annotation only.

- We're using non-recursive lockdep read lock mode, since in recursive
   read lock mode lockdep does not catch read side hazards. And we
   _very_ much want read side hazards to be caught. For full details of
   this limitation see

   commit e91498589746065e3ae95d9a00b068e525eec34f
   Author: Peter Zijlstra 
   Date:   Wed Aug 23 13:13:11 2017 +0200

   locking/lockdep/selftests: Add mixed read-write ABBA tests

- To allow nesting of the read-side explicit annotations we explicitly
   keep track of the nesting. lock_is_held() allows us to do that.

- The wait-side annotation is a write lock, and entirely done within
   dma_fence_wait() for everyone by default.

- To be able to freely annotate helper functions I want to make it ok
   to call dma_fence_begin/end_signalling from soft/hardirq context.
   First attempt was using the hardirq locking context for the write
   side in lockdep, but this forces all normal spinlocks nested within
   dma_fence_begin/end_signalling to be spinlocks. That bollocks.

   The approach now is to simple check in_atomic(), and for these cases
   entirely rely on the might_sleep() check in dma_fence_wait(). That
   will catch any wrong nesting against spinlocks from soft/hardirq
   contexts.

The idea here is that every code path that's critical for eventually
signalling a dma_fence should be annotated with
dma_fence_begin/end_signalling. The annotation ideally starts right
after a dma_fence is published (added to a dma_resv, exposed as a
sync_file fd, attached to a drm_syncobj fd, or anything else that
makes the dma_fence visible to other kernel threads), up to and
including the dma_fence_wait(). Examples are irq handlers, the
scheduler rt threads, the tail of execbuf (after the corresponding
fences are visible), any workers that end up signalling dma_fences and
really anything else. Not annotated should be code paths that only
complete fences opportunistically as the gpu progresses, like e.g.
shrinker/eviction code.

The main class of deadlocks this is supposed to catch are:

Thread A:

mutex_lock(A);
mutex_unlock(A);

dma_fence_signal();

Thread B:

mutex_lock(A);
dma_fence_wait();
mutex_unlock(A);

Thread B is blocked on A signalling the fence, but A never gets around
to that because it cannot acquire the lock A.

Note that dma_fence_wait() is allowed to be nested within
dma_fence_begin/end_signalling sections. To allow this to happen the
read lock needs to be upgraded to a write lock, which means that any
other lock is acquired between the dma_fence_begin_signalling() call and
the call to dma_fence_wait(), and still held, this will result in an
immediate lockdep complaint. The only other option would be to not
annotate such calls, defeating the point. Therefore these annotations
cannot be sprinkled over the code entirely mindless to avoid false
positives.

v2: handle soft/hardirq ctx better against write side and dont forget
EXPORT_SYMBOL, drivers can't use this otherwise.

Cc: linux-me...@vger.kernel.org
Cc: linaro-mm-...@lists.linaro.org
Cc: linux-r...@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-...@lists.freedesktop.org
Cc: Chris Wilson 
Cc: Maarten Lankhorst 
Cc: Christian König 
Signed-off-by: Daniel Vetter 


LGTM. Perhaps some in-code documentation on how to use the new functions 
are called.


Otherwise for patch 2 and 3,

Reviewed-by: Thomas Hellstrom 


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx