RE: Suspecting corrupted VBIOS after update of AMDGPU on AMD7870

2020-01-30 Thread Liu, Zhan
Okay I see. From your attached dmesg.log, issue comes from here:

[   26.265638] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, 
signaled seq=1, emitted seq=2
[   26.265764] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: 
process  pid 0 thread  pid 0
[   26.265771] [drm] GPU recovery disabled.

There was a very similar issue that’s recently fixed and merged in 5.5 kernel. 
I’ve noticed that you are using 5.4 kernel, so you can give 5.5 a spin to see 
what happens.

As for these “Green Dots” at BIOS initialization stage, the main amdgpu driver 
was not loaded yet, so it shouldn’t related to amdgpu.

BTW, your new findings 
(https://gist.github.com/Kreyren/3e55e9a754e58956e1690e38b1888de7) gives me 
404. Please fix the link. Good luck!

Warm regards,
Zhan



From: Jacob Hrbek 
Sent: 2020/January/30, Thursday 5:55 PM
To: Liu, Zhan ; amd-gfx@lists.freedesktop.org
Subject: Re: Suspecting corrupted VBIOS after update of AMDGPU on AMD7870

Hello Zhan,
Here is it:
https://gist.githubusercontent.com/Kreyren/e35587d8710e63e511e69d8653fd996b/raw/628df1c76ff99adab1d2161e6a20f631de101d5c/gistfile1.txt

Note that I'm updating previous gists with new findings 
(https://gist.github.com/Kreyren/3e55e9a754e58956e1690e38b1888de7).

If relevant i'm also getting these 'Green dots' at the initialization of bios 
(https://linx.li/s/8j3poh2z.png).
These dots are not present anywhere else and were not present before said 
update.

Thanks,
- Jacob Hrbek

On Thu, Jan 30, 2020 at 8:10 PM Liu, Zhan 
mailto:zhan@amd.com>> wrote:
Hi Jacob,

Thant you for your bug reporting.

I saw you attached xorg.log, which is great. Could you also grab dmesg.log via 
SSH?

Thanks,
Zhan


From: amd-gfx 
mailto:amd-gfx-boun...@lists.freedesktop.org>>
 On Behalf Of Jacob Hrbek
Sent: 2020/January/30, Thursday 12:18 PM
To: amd-gfx@lists.freedesktop.org
Subject: Suspecting corrupted VBIOS after update of AMDGPU on AMD7870

Hello,
I believe that system update that included amdgpu on debian testing (but i am 
on LFS) corrupted my VBIOS on AMD7870 (+- 4 hours after the update the GPU 
using AMDGPU/Radeon drivers resulted in no output).
i'm sending this email to inform about possible bug with my findings on 
https://gist.github.com/Kreyren/3e55e9a754e58956e1690e38b1888de7
 and i would appreciate any help in excluding VBIOS corruption from the 
diagnostics.
Thanks,
- Jacob Hrbek
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [Patch v1 5/5] drm/amdkfd: refactor runtime pm for baco

2020-01-30 Thread Zeng, Oak
[AMD Official Use Only - Internal Distribution Only]

Hi Felix,

See one inline comment

Regards,
Oak

-Original Message-
From: amd-gfx  On Behalf Of Felix 
Kuehling
Sent: Thursday, January 30, 2020 6:24 PM
To: Alex Deucher 
Cc: Deucher, Alexander ; Bhardwaj, Rajneesh 
; amd-gfx list 
Subject: Re: [Patch v1 5/5] drm/amdkfd: refactor runtime pm for baco


On 2020-01-30 17:11, Alex Deucher wrote:
> On Thu, Jan 30, 2020 at 4:55 PM Felix Kuehling  wrote:
>> On 2020-01-30 14:01, Bhardwaj, Rajneesh wrote:
>>> Hello Felix,
>>>
>>> Thanks for your time to review and for your feedback.
>>>
>>> On 1/29/2020 5:52 PM, Felix Kuehling wrote:
 HI Rajneesh,

 See comments inline ...

 And a general question: Why do you need to set the 
 autosuspend_delay in so many places? Amdgpu only has a single call 
 to this function during initialization.
>>>
>>> We don't. I have called this out in cover letter since i too feel 
>>> its not the best way to do what we want to do. I have seen weird 
>>> issues on dual GPUs with KFDTest that sometimes results in system 
>>> hang and gpu hang etc.
>>>
>>> Even if i try with running kfdtest on just one node in a multi gpu 
>>> system, the baco exit seems to wake up all gpus and then one goes to 
>>> auto suspend again while other performs the desired kfdtest operation.
>>> I have never seen any issue with a single gpu card. So in current 
>>> approch, i am making sure that the GPUs are not quickly 
>>> auto-suspended while the kfd operations are ongoing but once the 
>>> kfdtest is finished, the runtime suspend call with ensure to reset 
>>> the timeout back to 5 seconds.
>> So by setting the timeout to 60s, you can fix applications that run 
>> for less than 60s without calling into KFD. What about applications 
>> that run for longer without calling into KFD? This doesn't sound like 
>> a solution, it's just a hack.
>>
>>
>>>
>>> I would like to avoid this and appreciate some pointers where i can 
>>> put get_sync calls while unbinding. I have tried a number of places 
>>> but saw some issues. So any help will be highly appreciated, to 
>>> identify the proper logical inverse of kfd_bind_process_to_device.
>> kfd_bind_process_to_device is called when KFD starts using a device. 
>> It only stops using that device when the process is destroyed. 
>> Therefore I recommended moving the cleanup code into 
>> kfd_process_destroy_pdds, which iterates over all devices (PDDs) during 
>> process termination.
>>
>> Note that kfd_bind_process_to_device can be called many times for the 
>> same device. If you need to keep usage counters balanced, you need 
>> extra checks to make sure you only do runtime-PM stuff the first time 
>> it's called for a particular device in a particular process.
>>
>> Also, in kfd_process_destroy_pdds you may need to check whether 
>> kfd_bind_process_to_device has been called for a particular PDD 
>> before releasing runtime PM.
> runtimepm already has reference counting.  You can use something like
> pm_runtime_get_sync() or pm_runtime_get_noresume() depending on 
> whether you want to make sure the GPU is powered up or not.  That will 
> increase the usage count on the device for runtime pm.  Then when you 
> want to decrement and possibly suspend the device, call
> pm_runtime_mark_last_busy() and pm_runtime_put_autosuspend().  That is 
> basically what we do in amdgpu to handle device usage.

My point is, kfd_bind_process_to_device can be called many times. If we call 
pm_runtime_sync_get, we'll need to release an unpredictable number of usage 
counts later. Therefore I suggested ensuring we only do this the first time 
kfd_bind_process_to_device is called for a particular process-device.

Similarly, in kfd_process_destroy_pdds we should only release the usage count 
if we actually took it in kfd_bind_process_to_device.


>We call
> pm_runtime_get_noresume() in our fence_emit function which is what
> gets called when we submit work to the GPU.  That increments the usage
> count for runpm.  Then we call pm_runtime_mark_last_busy() and
> pm_runtime_put_autosuspend() in our fence_process function.  That gets
> called as each job on the GPU is retired.  For kfd, I think you'll
> want to call pm_runtime_get_sync() in at the start of your ioctl to
> wake the device, then pm_runtime_mark_last_busy() and
> pm_runtime_put_autosuspend() at the end of your ioctl to mark to match
> the start.  Then in your actual ioctl work, you'll want to call
> pm_runtime_get_noresume() everytime a queue gets created and
> pm_runtime_mark_last_busy() and pm_runtime_put_autosuspend() each time
> a queue is destroyed.  That should keep the device awake as long as
> there are any queues around.  Due to stuff like xgmi, it's probably
> best to just do this for all GPUs on the system just in case you have
> any p2p stuff going on.

OK, that's a completely different approach from what I had in mind. 
Basically tying device usage to both ioctls and the 

Re: [Patch v1 5/5] drm/amdkfd: refactor runtime pm for baco

2020-01-30 Thread Felix Kuehling


On 2020-01-30 17:11, Alex Deucher wrote:

On Thu, Jan 30, 2020 at 4:55 PM Felix Kuehling  wrote:

On 2020-01-30 14:01, Bhardwaj, Rajneesh wrote:

Hello Felix,

Thanks for your time to review and for your feedback.

On 1/29/2020 5:52 PM, Felix Kuehling wrote:

HI Rajneesh,

See comments inline ...

And a general question: Why do you need to set the autosuspend_delay
in so many places? Amdgpu only has a single call to this function
during initialization.


We don't. I have called this out in cover letter since i too feel its
not the best way to do what we want to do. I have seen weird issues on
dual GPUs with KFDTest that sometimes results in system hang and gpu
hang etc.

Even if i try with running kfdtest on just one node in a multi gpu
system, the baco exit seems to wake up all gpus and then one goes to
auto suspend again while other performs the desired kfdtest operation.
I have never seen any issue with a single gpu card. So in current
approch, i am making sure that the GPUs are not quickly auto-suspended
while the kfd operations are ongoing but once the kfdtest is finished,
the runtime suspend call with ensure to reset the timeout back to 5
seconds.

So by setting the timeout to 60s, you can fix applications that run for
less than 60s without calling into KFD. What about applications that run
for longer without calling into KFD? This doesn't sound like a solution,
it's just a hack.




I would like to avoid this and appreciate some pointers where i can
put get_sync calls while unbinding. I have tried a number of places
but saw some issues. So any help will be highly appreciated, to
identify the proper logical inverse of kfd_bind_process_to_device.

kfd_bind_process_to_device is called when KFD starts using a device. It
only stops using that device when the process is destroyed. Therefore I
recommended moving the cleanup code into kfd_process_destroy_pdds, which
iterates over all devices (PDDs) during process termination.

Note that kfd_bind_process_to_device can be called many times for the
same device. If you need to keep usage counters balanced, you need extra
checks to make sure you only do runtime-PM stuff the first time it's
called for a particular device in a particular process.

Also, in kfd_process_destroy_pdds you may need to check whether
kfd_bind_process_to_device has been called for a particular PDD before
releasing runtime PM.

runtimepm already has reference counting.  You can use something like
pm_runtime_get_sync() or pm_runtime_get_noresume() depending on
whether you want to make sure the GPU is powered up or not.  That will
increase the usage count on the device for runtime pm.  Then when you
want to decrement and possibly suspend the device, call
pm_runtime_mark_last_busy() and pm_runtime_put_autosuspend().  That is
basically what we do in amdgpu to handle device usage.


My point is, kfd_bind_process_to_device can be called many times. If we 
call pm_runtime_sync_get, we'll need to release an unpredictable number 
of usage counts later. Therefore I suggested ensuring we only do this 
the first time kfd_bind_process_to_device is called for a particular 
process-device.


Similarly, in kfd_process_destroy_pdds we should only release the usage 
count if we actually took it in kfd_bind_process_to_device.




   We call
pm_runtime_get_noresume() in our fence_emit function which is what
gets called when we submit work to the GPU.  That increments the usage
count for runpm.  Then we call pm_runtime_mark_last_busy() and
pm_runtime_put_autosuspend() in our fence_process function.  That gets
called as each job on the GPU is retired.  For kfd, I think you'll
want to call pm_runtime_get_sync() in at the start of your ioctl to
wake the device, then pm_runtime_mark_last_busy() and
pm_runtime_put_autosuspend() at the end of your ioctl to mark to match
the start.  Then in your actual ioctl work, you'll want to call
pm_runtime_get_noresume() everytime a queue gets created and
pm_runtime_mark_last_busy() and pm_runtime_put_autosuspend() each time
a queue is destroyed.  That should keep the device awake as long as
there are any queues around.  Due to stuff like xgmi, it's probably
best to just do this for all GPUs on the system just in case you have
any p2p stuff going on.


OK, that's a completely different approach from what I had in mind. 
Basically tying device usage to both ioctls and the existence of user 
mode queues. P2P adds extra complication. Not just XGMI but also PCIe 
P2P where GPU A wants to access GPU B's memory but GPU B doesn't have 
any activity to keep it awake. To simplify this, we could also associate 
usage counters with allocating memory. That way any GPU that has memory 
that may be remotely accessed would be kept awake.


Always keeping all GPUs awake seems wasteful. I think being able to 
power down unused GPUs would be a very useful feature.


I still believe my idea of taking a usage counter in 
kfd_bind_process_to_device and dropping it in kfd_process_destroy_pdds 
is 

RE: [PATCH] drm/amdkfd: Add queue information to sysfs

2020-01-30 Thread Kasiviswanathan, Harish
[AMD Official Use Only - Internal Distribution Only]

One minor comment.

-Original Message-
From: amd-gfx  On Behalf Of Amber Lin
Sent: Thursday, January 30, 2020 12:46 AM
To: amd-gfx@lists.freedesktop.org
Cc: Lin, Amber 
Subject: [PATCH] drm/amdkfd: Add queue information to sysfs

Provide compute queues information in sysfs under /sys/class/kfd/kfd/proc.
The format is /sys/class/kfd/kfd/proc//queues//XX where
XX are size, type, and gpuid three files to represent queue size, queue
type, and the GPU this queue uses.  folder and files underneath
are generated when a queue is created. They are removed when the queue is
destroyed.

Signed-off-by: Amber Lin 
---
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h  |  9 ++
 drivers/gpu/drm/amd/amdkfd/kfd_process.c   | 99 ++
 .../gpu/drm/amd/amdkfd/kfd_process_queue_manager.c |  2 +
 3 files changed, 110 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index c0b0def..cb2d2d7 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -503,6 +503,12 @@ struct queue {
struct kfd_process  *process;
struct kfd_dev  *device;
void *gws;
+
+   /* procfs */
+   struct kobject *kobj_qid;
+   struct attribute attr_size;
+   struct attribute attr_type;
+   struct attribute attr_gpuid;
 };
 
 /*
@@ -730,6 +736,7 @@ struct kfd_process {
 
/* Kobj for our procfs */
struct kobject *kobj;
+   struct kobject *kobj_queues;
struct attribute attr_pasid;
 };
 
@@ -836,6 +843,8 @@ extern struct device *kfd_device;
 /* KFD's procfs */
 void kfd_procfs_init(void);
 void kfd_procfs_shutdown(void);
+int kfd_procfs_add_queue(struct queue *q);
+void kfd_procfs_del_queue(struct queue *q);
 
 /* Topology */
 int kfd_topology_init(void);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index 25b90f7..0220651 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -132,6 +132,97 @@ void kfd_procfs_shutdown(void)
}
 }
 
+static int kfd_procfs_add_file(const char *name, struct kobject *kobj,
+  struct attribute *attr)
+{
+   int ret;
+
+   attr->name = name;
+   attr->mode = KFD_SYSFS_FILE_MODE;
+   sysfs_attr_init(attr);
+   ret = sysfs_create_file(kobj, attr);
+   if (ret)
+   pr_warn("Creating %s file failed", name);
+   return ret;
+}
+
+static ssize_t kfd_procfs_queue_show(struct kobject *kobj,
+struct attribute *attr, char *buffer)
+{
+   if (!strcmp(attr->name, "size")) {
+   struct queue *q = container_of(attr, struct queue, attr_size);
+   return snprintf(buffer, PAGE_SIZE, "%llu",
+   q->properties.queue_size);
+   } else if (!strcmp(attr->name, "type")) {
+   struct queue *q = container_of(attr, struct queue, attr_type);
+   return snprintf(buffer, PAGE_SIZE, "%d", q->properties.type);
+   } else if (!strcmp(attr->name, "gpuid")) {
+   struct queue *q = container_of(attr, struct queue, attr_gpuid);
+   return snprintf(buffer, PAGE_SIZE, "%u", q->device->id);
+   } else
+   pr_err("Invalid attribute");
+
+   return 0;
+}
+
+static const struct sysfs_ops procfs_queue_ops = {
+   .show = kfd_procfs_queue_show,
+};
+
+static struct kobj_type procfs_queue_type = {
+   .release = kfd_procfs_kobj_release,
+   .sysfs_ops = _queue_ops,
+};
+
+int kfd_procfs_add_queue(struct queue *q)
+{
+   struct kfd_process *proc;
+   int ret;
+
+   if (!q || !q->process)
+   return -EINVAL;
+   proc = q->process;
+
+   /* Create proc//queues/ folder*/
+   if (!proc->kobj_queues)
+   return -EFAULT;
+   if (q->kobj_qid)
+   return -EEXIST;
+   q->kobj_qid = kfd_alloc_struct(q->kobj_qid);
+   if (!q->kobj_qid)
+   return -ENOMEM;
+   ret = kobject_init_and_add(q->kobj_qid, _queue_type,
+   proc->kobj_queues, "%u", q->properties.queue_id);
+   if (ret < 0) {
+   pr_warn("Creating proc//queues/%u failed",
+   q->properties.queue_id);
+   return ret;
+   }
+
+   /* Create proc//queues//XX files */
+   kfd_procfs_add_file("size", q->kobj_qid, >attr_size);
+   kfd_procfs_add_file("type", q->kobj_qid, >attr_type);
+   kfd_procfs_add_file("gpuid", q->kobj_qid, >attr_gpuid);
+
+   return 0;
+}
+
+void kfd_procfs_del_queue(struct queue *q)
+{
+   struct kfd_process *proc;

[HK] : proc is not used

+
+   if (!q || !q->process)
+   return;
+   proc = q->process;
+
+   sysfs_remove_file(q->kobj_qid, >attr_size);
+   sysfs_remove_file(q->kobj_qid, 

Re: [PATCH] drm/amdkfd: Fix a bug in SDMA RLC queue counting under HWS mode

2020-01-30 Thread Yong Zhao

True. It is a bug too. I am looking into it.

Yong

On 2020-01-30 5:51 p.m., Felix Kuehling wrote:

On 2020-01-30 17:29, Yong Zhao wrote:

The sdma_queue_count increment should be done before
execute_queues_cpsch(), which calls pm_calc_rlib_size() where
sdma_queue_count is used to calculate whether over_subscription is
triggered.

With the previous code, when a SDMA queue is created,
compute_queue_count in pm_calc_rlib_size() is one more than the
actual compute queue number, because the queue_count has been
incremented while sdma_queue_count has not. This patch fixes that.

Change-Id: I20353e657efd505353d0dd9f7eb2fab5085e7202
Signed-off-by: Yong Zhao 


Reviewed-by: Felix Kuehling 

But I took a look at pm_calc_rlib_size. I don't think subtracting 
dqm->sdma_queue_count from dqm->queue_count is not quite correct, 
because sdma_queue_count counts all SDMA queues, while queue_count 
only counts active queues. So an application that creates inactive 
SDMA queues will also create errors here. We probably need to count 
active compute and active SDMA queues separately in DQM to fix this 
properly.


Regards,
  Felix



---
  drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 10 ++
  1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c

index 2870553a2ce0..80d22bf702e8 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -1237,16 +1237,18 @@ static int create_queue_cpsch(struct 
device_queue_manager *dqm, struct queue *q,

    list_add(>list, >queues_list);
  qpd->queue_count++;
+
+    if (q->properties.type == KFD_QUEUE_TYPE_SDMA)
+    dqm->sdma_queue_count++;
+    else if (q->properties.type == KFD_QUEUE_TYPE_SDMA_XGMI)
+    dqm->xgmi_sdma_queue_count++;
+
  if (q->properties.is_active) {
  dqm->queue_count++;
  retval = execute_queues_cpsch(dqm,
  KFD_UNMAP_QUEUES_FILTER_DYNAMIC_QUEUES, 0);
  }
  -    if (q->properties.type == KFD_QUEUE_TYPE_SDMA)
-    dqm->sdma_queue_count++;
-    else if (q->properties.type == KFD_QUEUE_TYPE_SDMA_XGMI)
-    dqm->xgmi_sdma_queue_count++;
  /*
   * Unconditionally increment this counter, regardless of the 
queue's

   * type or whether the queue is active.

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdkfd: Fix a bug in SDMA RLC queue counting under HWS mode

2020-01-30 Thread Felix Kuehling

On 2020-01-30 17:29, Yong Zhao wrote:

The sdma_queue_count increment should be done before
execute_queues_cpsch(), which calls pm_calc_rlib_size() where
sdma_queue_count is used to calculate whether over_subscription is
triggered.

With the previous code, when a SDMA queue is created,
compute_queue_count in pm_calc_rlib_size() is one more than the
actual compute queue number, because the queue_count has been
incremented while sdma_queue_count has not. This patch fixes that.

Change-Id: I20353e657efd505353d0dd9f7eb2fab5085e7202
Signed-off-by: Yong Zhao 


Reviewed-by: Felix Kuehling 

But I took a look at pm_calc_rlib_size. I don't think subtracting 
dqm->sdma_queue_count from dqm->queue_count is not quite correct, 
because sdma_queue_count counts all SDMA queues, while queue_count only 
counts active queues. So an application that creates inactive SDMA 
queues will also create errors here. We probably need to count active 
compute and active SDMA queues separately in DQM to fix this properly.


Regards,
  Felix



---
  drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 10 ++
  1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index 2870553a2ce0..80d22bf702e8 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -1237,16 +1237,18 @@ static int create_queue_cpsch(struct 
device_queue_manager *dqm, struct queue *q,
  
  	list_add(>list, >queues_list);

qpd->queue_count++;
+
+   if (q->properties.type == KFD_QUEUE_TYPE_SDMA)
+   dqm->sdma_queue_count++;
+   else if (q->properties.type == KFD_QUEUE_TYPE_SDMA_XGMI)
+   dqm->xgmi_sdma_queue_count++;
+
if (q->properties.is_active) {
dqm->queue_count++;
retval = execute_queues_cpsch(dqm,
KFD_UNMAP_QUEUES_FILTER_DYNAMIC_QUEUES, 0);
}
  
-	if (q->properties.type == KFD_QUEUE_TYPE_SDMA)

-   dqm->sdma_queue_count++;
-   else if (q->properties.type == KFD_QUEUE_TYPE_SDMA_XGMI)
-   dqm->xgmi_sdma_queue_count++;
/*
 * Unconditionally increment this counter, regardless of the queue's
 * type or whether the queue is active.

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 5/5] drm/amdgpu: rework synchronization of VM updates v4

2020-01-30 Thread Felix Kuehling

On 2020-01-30 7:49, Christian König wrote:

If provided we only sync to the BOs reservation
object and no longer to the root PD.

v2: update comment, cleanup amdgpu_bo_sync_wait_resv
v3: use correct reservation object while clearing
v4: fix typo in amdgpu_bo_sync_wait_resv

Signed-off-by: Christian König 
Tested-by: Tom St Denis 


Reviewed-by: Felix Kuehling 



---
  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c  | 35 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h  |  3 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c|  7 
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c  | 41 -
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h  |  4 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c  | 22 +++
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 15 ++--
  7 files changed, 66 insertions(+), 61 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 46c76e2e1281..6f60a581e3ba 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -1403,30 +1403,51 @@ void amdgpu_bo_fence(struct amdgpu_bo *bo, struct 
dma_fence *fence,
  }
  
  /**

- * amdgpu_sync_wait_resv - Wait for BO reservation fences
+ * amdgpu_bo_sync_wait_resv - Wait for BO reservation fences
   *
- * @bo: buffer object
+ * @adev: amdgpu device pointer
+ * @resv: reservation object to sync to
+ * @sync_mode: synchronization mode
   * @owner: fence owner
   * @intr: Whether the wait is interruptible
   *
+ * Extract the fences from the reservation object and waits for them to finish.
+ *
   * Returns:
   * 0 on success, errno otherwise.
   */
-int amdgpu_bo_sync_wait(struct amdgpu_bo *bo, void *owner, bool intr)
+int amdgpu_bo_sync_wait_resv(struct amdgpu_device *adev, struct dma_resv *resv,
+enum amdgpu_sync_mode sync_mode, void *owner,
+bool intr)
  {
-   struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
struct amdgpu_sync sync;
int r;
  
  	amdgpu_sync_create();

-   amdgpu_sync_resv(adev, , bo->tbo.base.resv,
-AMDGPU_SYNC_NE_OWNER, owner);
+   amdgpu_sync_resv(adev, , resv, sync_mode, owner);
r = amdgpu_sync_wait(, intr);
amdgpu_sync_free();
-
return r;
  }
  
+/**

+ * amdgpu_bo_sync_wait - Wrapper for amdgpu_bo_sync_wait_resv
+ * @bo: buffer object to wait for
+ * @owner: fence owner
+ * @intr: Whether the wait is interruptible
+ *
+ * Wrapper to wait for fences in a BO.
+ * Returns:
+ * 0 on success, errno otherwise.
+ */
+int amdgpu_bo_sync_wait(struct amdgpu_bo *bo, void *owner, bool intr)
+{
+   struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
+
+   return amdgpu_bo_sync_wait_resv(adev, bo->tbo.base.resv,
+   AMDGPU_SYNC_NE_OWNER, owner, intr);
+}
+
  /**
   * amdgpu_bo_gpu_offset - return GPU offset of bo
   * @bo:   amdgpu object for which we query the offset
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
index 2eeafc77c9c1..96d805889e8d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
@@ -283,6 +283,9 @@ void amdgpu_bo_release_notify(struct ttm_buffer_object *bo);
  int amdgpu_bo_fault_reserve_notify(struct ttm_buffer_object *bo);
  void amdgpu_bo_fence(struct amdgpu_bo *bo, struct dma_fence *fence,
 bool shared);
+int amdgpu_bo_sync_wait_resv(struct amdgpu_device *adev, struct dma_resv *resv,
+enum amdgpu_sync_mode sync_mode, void *owner,
+bool intr);
  int amdgpu_bo_sync_wait(struct amdgpu_bo *bo, void *owner, bool intr);
  u64 amdgpu_bo_gpu_offset(struct amdgpu_bo *bo);
  int amdgpu_bo_validate(struct amdgpu_bo *bo);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
index 9f42032676da..b86392253696 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
@@ -249,13 +249,6 @@ int amdgpu_sync_resv(struct amdgpu_device *adev, struct 
amdgpu_sync *sync,
owner != AMDGPU_FENCE_OWNER_UNDEFINED)
continue;
  
-		/* VM updates only sync with moves but not with user

-* command submissions or KFD evictions fences
-*/
-   if (fence_owner != AMDGPU_FENCE_OWNER_UNDEFINED &&
-   owner == AMDGPU_FENCE_OWNER_VM)
-   continue;
-
/* Ignore fences depending on the sync mode */
switch (mode) {
case AMDGPU_SYNC_ALWAYS:
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 6038b3c89633..aaae2b5e6eee 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -797,7 +797,7 @@ 

Re: [PATCH 4/5] drm/amdgpu: simplify and fix amdgpu_sync_resv

2020-01-30 Thread Felix Kuehling

On 2020-01-30 7:49, Christian König wrote:

No matter what we always need to sync to moves.

Signed-off-by: Christian König 
Tested-by: Tom St Denis 


Reviewed-by: Felix Kuehling 



---
  drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c | 15 +++
  1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
index c124f64e7aae..9f42032676da 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
@@ -232,10 +232,19 @@ int amdgpu_sync_resv(struct amdgpu_device *adev, struct 
amdgpu_sync *sync,
  
  		f = rcu_dereference_protected(flist->shared[i],

  dma_resv_held(resv));
+
+   fence_owner = amdgpu_sync_get_owner(f);
+
+   /* Always sync to moves, no matter what */
+   if (fence_owner == AMDGPU_FENCE_OWNER_UNDEFINED) {
+   r = amdgpu_sync_fence(sync, f, false);
+   if (r)
+   break;
+   }
+
/* We only want to trigger KFD eviction fences on
 * evict or move jobs. Skip KFD fences otherwise.
 */
-   fence_owner = amdgpu_sync_get_owner(f);
if (fence_owner == AMDGPU_FENCE_OWNER_KFD &&
owner != AMDGPU_FENCE_OWNER_UNDEFINED)
continue;
@@ -265,9 +274,7 @@ int amdgpu_sync_resv(struct amdgpu_device *adev, struct 
amdgpu_sync *sync,
break;
  
  		case AMDGPU_SYNC_EXPLICIT:

-   if (owner != AMDGPU_FENCE_OWNER_UNDEFINED)
-   continue;
-   break;
+   continue;
}
  
  		r = amdgpu_sync_fence(sync, f, false);

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amdkfd: Fix a bug in SDMA RLC queue counting under HWS mode

2020-01-30 Thread Yong Zhao
The sdma_queue_count increment should be done before
execute_queues_cpsch(), which calls pm_calc_rlib_size() where
sdma_queue_count is used to calculate whether over_subscription is
triggered.

With the previous code, when a SDMA queue is created,
compute_queue_count in pm_calc_rlib_size() is one more than the
actual compute queue number, because the queue_count has been
incremented while sdma_queue_count has not. This patch fixes that.

Change-Id: I20353e657efd505353d0dd9f7eb2fab5085e7202
Signed-off-by: Yong Zhao 
---
 drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index 2870553a2ce0..80d22bf702e8 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -1237,16 +1237,18 @@ static int create_queue_cpsch(struct 
device_queue_manager *dqm, struct queue *q,
 
list_add(>list, >queues_list);
qpd->queue_count++;
+
+   if (q->properties.type == KFD_QUEUE_TYPE_SDMA)
+   dqm->sdma_queue_count++;
+   else if (q->properties.type == KFD_QUEUE_TYPE_SDMA_XGMI)
+   dqm->xgmi_sdma_queue_count++;
+
if (q->properties.is_active) {
dqm->queue_count++;
retval = execute_queues_cpsch(dqm,
KFD_UNMAP_QUEUES_FILTER_DYNAMIC_QUEUES, 0);
}
 
-   if (q->properties.type == KFD_QUEUE_TYPE_SDMA)
-   dqm->sdma_queue_count++;
-   else if (q->properties.type == KFD_QUEUE_TYPE_SDMA_XGMI)
-   dqm->xgmi_sdma_queue_count++;
/*
 * Unconditionally increment this counter, regardless of the queue's
 * type or whether the queue is active.
-- 
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/5] drm/amdgpu: return EINVAL instead of ENOENT in the VM code

2020-01-30 Thread Felix Kuehling

On 2020-01-30 7:49, Christian König wrote:

That we can't find a PD above the root is expected can only happen if
we try to update a larger range than actually managed by the VM.

Signed-off-by: Christian König 
Tested-by: Tom St Denis 

Reviewed-by: Felix Kuehling 

---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 4ba6a5e5d094..9705c961405b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1475,7 +1475,7 @@ static int amdgpu_vm_update_ptes(struct 
amdgpu_vm_update_params *params,
 * shift we should go up one level and check it again.
 */
if (!amdgpu_vm_pt_ancestor())
-   return -ENOENT;
+   return -EINVAL;
continue;
}
  

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 3/5] drm/amdgpu: allow higher level PD invalidations

2020-01-30 Thread Felix Kuehling

On 2020-01-30 7:49, Christian König wrote:

Allow partial invalidation on unallocated PDs. This is useful when we
need to silence faults to stop interrupt floods on Vega.

Signed-off-by: Christian König 
Tested-by: Tom St Denis 


I already reviewed this a week ago. With two style nit-picks fixed, this 
patch is


Reviewed-by: Felix Kuehling 



---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 23 ++-
  1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 9705c961405b..6038b3c89633 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1467,9 +1467,8 @@ static int amdgpu_vm_update_ptes(struct 
amdgpu_vm_update_params *params,
 * smaller than the address shift. Go to the next
 * child entry and try again.
 */
-   if (!amdgpu_vm_pt_descendant(adev, ))
-   return -ENOENT;
-   continue;
+   if (amdgpu_vm_pt_descendant(adev, ))
+   continue;
} else if (frag >= parent_shift) {
/* If the fragment size is even larger than the parent
 * shift we should go up one level and check it again.
@@ -1480,8 +1479,19 @@ static int amdgpu_vm_update_ptes(struct 
amdgpu_vm_update_params *params,
}
  
  		pt = cursor.entry->base.bo;

-   if (!pt)
-   return -ENOENT;
+   if (!pt) {
+   /* We need all PDs and PTs for mapping something, */
+   if (flags & AMDGPU_PTE_VALID)
+   return -ENOENT;
+
+   /* but unmapping something can happen at a higher
+* level. */


checkpatch.pl complains about multi-line comments with the */ not on its 
own line.




+   if (!amdgpu_vm_pt_ancestor())
+   return -EINVAL;
+
+   pt = cursor.entry->base.bo;
+   shift = parent_shift;
+   }
  
  		/* Looks good so far, calculate parameters for the update */

incr = (uint64_t)AMDGPU_GPU_PAGE_SIZE << shift;
@@ -1495,6 +1505,9 @@ static int amdgpu_vm_update_ptes(struct 
amdgpu_vm_update_params *params,
uint64_t upd_end = min(entry_end, frag_end);
unsigned nptes = (upd_end - frag_start) >> shift;
  
+			/* This can happen when we set higher level PDs to

+* silent to stop fault floods. */


Same as above.



+   nptes = max(nptes, 1u);
amdgpu_vm_update_flags(params, pt, cursor.level,
   pe_start, dst, nptes, incr,
   flags | AMDGPU_PTE_FRAG(frag));

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Patch v1 5/5] drm/amdkfd: refactor runtime pm for baco

2020-01-30 Thread Alex Deucher
On Thu, Jan 30, 2020 at 4:55 PM Felix Kuehling  wrote:
>
> On 2020-01-30 14:01, Bhardwaj, Rajneesh wrote:
> > Hello Felix,
> >
> > Thanks for your time to review and for your feedback.
> >
> > On 1/29/2020 5:52 PM, Felix Kuehling wrote:
> >> HI Rajneesh,
> >>
> >> See comments inline ...
> >>
> >> And a general question: Why do you need to set the autosuspend_delay
> >> in so many places? Amdgpu only has a single call to this function
> >> during initialization.
> >
> >
> > We don't. I have called this out in cover letter since i too feel its
> > not the best way to do what we want to do. I have seen weird issues on
> > dual GPUs with KFDTest that sometimes results in system hang and gpu
> > hang etc.
> >
> > Even if i try with running kfdtest on just one node in a multi gpu
> > system, the baco exit seems to wake up all gpus and then one goes to
> > auto suspend again while other performs the desired kfdtest operation.
> > I have never seen any issue with a single gpu card. So in current
> > approch, i am making sure that the GPUs are not quickly auto-suspended
> > while the kfd operations are ongoing but once the kfdtest is finished,
> > the runtime suspend call with ensure to reset the timeout back to 5
> > seconds.
>
> So by setting the timeout to 60s, you can fix applications that run for
> less than 60s without calling into KFD. What about applications that run
> for longer without calling into KFD? This doesn't sound like a solution,
> it's just a hack.
>
>
> >
> >
> > I would like to avoid this and appreciate some pointers where i can
> > put get_sync calls while unbinding. I have tried a number of places
> > but saw some issues. So any help will be highly appreciated, to
> > identify the proper logical inverse of kfd_bind_process_to_device.
>
> kfd_bind_process_to_device is called when KFD starts using a device. It
> only stops using that device when the process is destroyed. Therefore I
> recommended moving the cleanup code into kfd_process_destroy_pdds, which
> iterates over all devices (PDDs) during process termination.
>
> Note that kfd_bind_process_to_device can be called many times for the
> same device. If you need to keep usage counters balanced, you need extra
> checks to make sure you only do runtime-PM stuff the first time it's
> called for a particular device in a particular process.
>
> Also, in kfd_process_destroy_pdds you may need to check whether
> kfd_bind_process_to_device has been called for a particular PDD before
> releasing runtime PM.

runtimepm already has reference counting.  You can use something like
pm_runtime_get_sync() or pm_runtime_get_noresume() depending on
whether you want to make sure the GPU is powered up or not.  That will
increase the usage count on the device for runtime pm.  Then when you
want to decrement and possibly suspend the device, call
pm_runtime_mark_last_busy() and pm_runtime_put_autosuspend().  That is
basically what we do in amdgpu to handle device usage.  We call
pm_runtime_get_noresume() in our fence_emit function which is what
gets called when we submit work to the GPU.  That increments the usage
count for runpm.  Then we call pm_runtime_mark_last_busy() and
pm_runtime_put_autosuspend() in our fence_process function.  That gets
called as each job on the GPU is retired.  For kfd, I think you'll
want to call pm_runtime_get_sync() in at the start of your ioctl to
wake the device, then pm_runtime_mark_last_busy() and
pm_runtime_put_autosuspend() at the end of your ioctl to mark to match
the start.  Then in your actual ioctl work, you'll want to call
pm_runtime_get_noresume() everytime a queue gets created and
pm_runtime_mark_last_busy() and pm_runtime_put_autosuspend() each time
a queue is destroyed.  That should keep the device awake as long as
there are any queues around.  Due to stuff like xgmi, it's probably
best to just do this for all GPUs on the system just in case you have
any p2p stuff going on.

Alex

>
> Regards,
>Felix
>
>
> >
> >
> >>
> >> On 2020-01-27 20:29, Rajneesh Bhardwaj wrote:
> >>> So far the kfd driver implemented same routines for runtime and system
> >>> wide suspend and resume (s2idle or mem). During system wide suspend the
> >>> kfd aquires an atomic lock that prevents any more user processes to
> >>> create queues and interact with kfd driver and amd gpu. This mechanism
> >>> created problem when amdgpu device is runtime suspended with BACO
> >>> enabled. Any application that relies on kfd driver fails to load
> >>> because
> >>> the driver reports a locked kfd device since gpu is runtime suspended.
> >>>
> >>> However, in an ideal case, when gpu is runtime  suspended the kfd
> >>> driver
> >>> should be able to:
> >>>
> >>>   - auto resume amdgpu driver whenever a client requests compute
> >>> service
> >>>   - prevent runtime suspend for amdgpu  while kfd is in use
> >>>
> >>> This change refactors the amdgpu and amdkfd drivers to support BACO and
> >>> runtime runtime power management.
> >>>
> >>> Signed-off-by: 

Re: [PATCH 1/5] drm/amdgpu: fix braces in amdgpu_vm_update_ptes

2020-01-30 Thread Felix Kuehling


On 2020-01-30 7:49, Christian König wrote:

For the root PD mask can be 0x as well which would
overrun to 0 if we don't cast it before we add one.

You're fixing parentheses, not braces.

Parentheses: ()
Brackets: []
Braces: {}

With the title fixed, this patch is

Reviewed-by: Felix Kuehling 



Signed-off-by: Christian König 
Tested-by: Tom St Denis 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 5cb182231f5d..4ba6a5e5d094 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1487,7 +1487,7 @@ static int amdgpu_vm_update_ptes(struct 
amdgpu_vm_update_params *params,
incr = (uint64_t)AMDGPU_GPU_PAGE_SIZE << shift;
mask = amdgpu_vm_entries_mask(adev, cursor.level);
pe_start = ((cursor.pfn >> shift) & mask) * 8;
-   entry_end = (uint64_t)(mask + 1) << shift;
+   entry_end = ((uint64_t)mask + 1) << shift;
entry_end += cursor.pfn & ~(entry_end - 1);
entry_end = min(entry_end, end);
  

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Patch v1 5/5] drm/amdkfd: refactor runtime pm for baco

2020-01-30 Thread Felix Kuehling

On 2020-01-30 14:01, Bhardwaj, Rajneesh wrote:

Hello Felix,

Thanks for your time to review and for your feedback.

On 1/29/2020 5:52 PM, Felix Kuehling wrote:

HI Rajneesh,

See comments inline ...

And a general question: Why do you need to set the autosuspend_delay 
in so many places? Amdgpu only has a single call to this function 
during initialization.



We don't. I have called this out in cover letter since i too feel its 
not the best way to do what we want to do. I have seen weird issues on 
dual GPUs with KFDTest that sometimes results in system hang and gpu 
hang etc.


Even if i try with running kfdtest on just one node in a multi gpu 
system, the baco exit seems to wake up all gpus and then one goes to 
auto suspend again while other performs the desired kfdtest operation. 
I have never seen any issue with a single gpu card. So in current 
approch, i am making sure that the GPUs are not quickly auto-suspended 
while the kfd operations are ongoing but once the kfdtest is finished, 
the runtime suspend call with ensure to reset the timeout back to 5 
seconds.


So by setting the timeout to 60s, you can fix applications that run for 
less than 60s without calling into KFD. What about applications that run 
for longer without calling into KFD? This doesn't sound like a solution, 
it's just a hack.






I would like to avoid this and appreciate some pointers where i can 
put get_sync calls while unbinding. I have tried a number of places 
but saw some issues. So any help will be highly appreciated, to 
identify the proper logical inverse of kfd_bind_process_to_device.


kfd_bind_process_to_device is called when KFD starts using a device. It 
only stops using that device when the process is destroyed. Therefore I 
recommended moving the cleanup code into kfd_process_destroy_pdds, which 
iterates over all devices (PDDs) during process termination.


Note that kfd_bind_process_to_device can be called many times for the 
same device. If you need to keep usage counters balanced, you need extra 
checks to make sure you only do runtime-PM stuff the first time it's 
called for a particular device in a particular process.


Also, in kfd_process_destroy_pdds you may need to check whether 
kfd_bind_process_to_device has been called for a particular PDD before 
releasing runtime PM.


Regards,
  Felix







On 2020-01-27 20:29, Rajneesh Bhardwaj wrote:

So far the kfd driver implemented same routines for runtime and system
wide suspend and resume (s2idle or mem). During system wide suspend the
kfd aquires an atomic lock that prevents any more user processes to
create queues and interact with kfd driver and amd gpu. This mechanism
created problem when amdgpu device is runtime suspended with BACO
enabled. Any application that relies on kfd driver fails to load 
because

the driver reports a locked kfd device since gpu is runtime suspended.

However, in an ideal case, when gpu is runtime  suspended the kfd 
driver

should be able to:

  - auto resume amdgpu driver whenever a client requests compute 
service

  - prevent runtime suspend for amdgpu  while kfd is in use

This change refactors the amdgpu and amdkfd drivers to support BACO and
runtime runtime power management.

Signed-off-by: Rajneesh Bhardwaj 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 12 -
  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  8 +++---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 +--
  drivers/gpu/drm/amd/amdkfd/kfd_device.c    | 31 
+-

  drivers/gpu/drm/amd/amdkfd/kfd_iommu.c |  5 +++-
  drivers/gpu/drm/amd/amdkfd/kfd_process.c   | 19 ++---
  6 files changed, 51 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c

index 8609287620ea..314c4a2a0354 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -178,18 +178,18 @@ void amdgpu_amdkfd_interrupt(struct 
amdgpu_device *adev,

  kgd2kfd_interrupt(adev->kfd.dev, ih_ring_entry);
  }
  -void amdgpu_amdkfd_suspend(struct amdgpu_device *adev)
+void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm)
  {
  if (adev->kfd.dev)
-    kgd2kfd_suspend(adev->kfd.dev);
+    kgd2kfd_suspend(adev->kfd.dev, run_pm);
  }
  -int amdgpu_amdkfd_resume(struct amdgpu_device *adev)
+int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm)
  {
  int r = 0;
    if (adev->kfd.dev)
-    r = kgd2kfd_resume(adev->kfd.dev);
+    r = kgd2kfd_resume(adev->kfd.dev, run_pm);
    return r;
  }
@@ -713,11 +713,11 @@ void kgd2kfd_exit(void)
  {
  }
  -void kgd2kfd_suspend(struct kfd_dev *kfd)
+void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
  {
  }
  -int kgd2kfd_resume(struct kfd_dev *kfd)
+int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
  {
  return 0;
  }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h


Re: [PATCH 3/3] drm/amdgpu/smu_v11_0: Correct behavior of restoring default tables (v2)

2020-01-30 Thread Matt Coffin
It's worth noting here that I don't have a vega20 card to test with, so
it might be prudent to get a Tested-by from someone that has access to one.

It *should* work since it's so similar to the navi10 code, but it is
moderately un-tested.

On 1/29/20 11:17 AM, Alex Deucher wrote:
> From: Matt Coffin 
> 
> Previously, the syfs functionality for restoring the default powerplay
> table was sourcing it's information from the currently-staged powerplay
> table.
> 
> This patch adds a step to cache the first overdrive table that we see on
> boot, so that it can be used later to "restore" the powerplay table
> 
> v2: sqaush my original with Matt's fix
> 
> Bug: https://gitlab.freedesktop.org/drm/amd/issues/1020
> Signed-off-by: Matt Coffin 
> Reviewed-by: Alex Deucher 
> Signed-off-by: Alex Deucher 
> ---
>  .../gpu/drm/amd/powerplay/inc/amdgpu_smu.h|  1 +
>  drivers/gpu/drm/amd/powerplay/navi10_ppt.c|  7 +
>  drivers/gpu/drm/amd/powerplay/smu_v11_0.c |  6 
>  drivers/gpu/drm/amd/powerplay/vega20_ppt.c| 28 ++-
>  4 files changed, 22 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/powerplay/inc/amdgpu_smu.h 
> b/drivers/gpu/drm/amd/powerplay/inc/amdgpu_smu.h
> index b0591a8dda41..1e33c3e9b98d 100644
> --- a/drivers/gpu/drm/amd/powerplay/inc/amdgpu_smu.h
> +++ b/drivers/gpu/drm/amd/powerplay/inc/amdgpu_smu.h
> @@ -273,6 +273,7 @@ struct smu_table_context
>   uint8_t thermal_controller_type;
>  
>   void*overdrive_table;
> + void*boot_overdrive_table;
>  };
>  
>  struct smu_dpm_context {
> diff --git a/drivers/gpu/drm/amd/powerplay/navi10_ppt.c 
> b/drivers/gpu/drm/amd/powerplay/navi10_ppt.c
> index d2d45181ae23..26cfccc57331 100644
> --- a/drivers/gpu/drm/amd/powerplay/navi10_ppt.c
> +++ b/drivers/gpu/drm/amd/powerplay/navi10_ppt.c
> @@ -2063,6 +2063,13 @@ static int navi10_od_edit_dpm_table(struct smu_context 
> *smu, enum PP_OD_DPM_TABL
>   return ret;
>   od_table->UclkFmax = input[1];
>   break;
> + case PP_OD_RESTORE_DEFAULT_TABLE:
> + if (!(table_context->overdrive_table && 
> table_context->boot_overdrive_table)) {
> + pr_err("Overdrive table was not initialized!\n");
> + return -EINVAL;
> + }
> + memcpy(table_context->overdrive_table, 
> table_context->boot_overdrive_table, sizeof(OverDriveTable_t));
> + break;
>   case PP_OD_COMMIT_DPM_TABLE:
>   navi10_dump_od_table(od_table);
>   ret = smu_update_table(smu, SMU_TABLE_OVERDRIVE, 0, (void 
> *)od_table, true);
> diff --git a/drivers/gpu/drm/amd/powerplay/smu_v11_0.c 
> b/drivers/gpu/drm/amd/powerplay/smu_v11_0.c
> index 02f8c9cb89d9..0dc49479a7eb 100644
> --- a/drivers/gpu/drm/amd/powerplay/smu_v11_0.c
> +++ b/drivers/gpu/drm/amd/powerplay/smu_v11_0.c
> @@ -1882,6 +1882,12 @@ int smu_v11_0_set_default_od_settings(struct 
> smu_context *smu, bool initialize,
>   pr_err("Failed to export overdrive table!\n");
>   return ret;
>   }
> + if (!table_context->boot_overdrive_table) {
> + table_context->boot_overdrive_table = 
> kmemdup(table_context->overdrive_table, overdrive_table_size, GFP_KERNEL);
> + if (!table_context->boot_overdrive_table) {
> + return -ENOMEM;
> + }
> + }
>   }
>   ret = smu_update_table(smu, SMU_TABLE_OVERDRIVE, 0, 
> table_context->overdrive_table, true);
>   if (ret) {
> diff --git a/drivers/gpu/drm/amd/powerplay/vega20_ppt.c 
> b/drivers/gpu/drm/amd/powerplay/vega20_ppt.c
> index 38febd5ca4da..4ad8d6c14ee5 100644
> --- a/drivers/gpu/drm/amd/powerplay/vega20_ppt.c
> +++ b/drivers/gpu/drm/amd/powerplay/vega20_ppt.c
> @@ -1706,22 +1706,11 @@ static int vega20_set_default_od_settings(struct 
> smu_context *smu,
>   struct smu_table_context *table_context = >smu_table;
>   int ret;
>  
> - if (initialize) {
> - if (table_context->overdrive_table)
> - return -EINVAL;
> -
> - table_context->overdrive_table = 
> kzalloc(sizeof(OverDriveTable_t), GFP_KERNEL);
> -
> - if (!table_context->overdrive_table)
> - return -ENOMEM;
> -
> - ret = smu_update_table(smu, SMU_TABLE_OVERDRIVE, 0,
> -table_context->overdrive_table, false);
> - if (ret) {
> - pr_err("Failed to export over drive table!\n");
> - return ret;
> - }
> + ret = smu_v11_0_set_default_od_settings(smu, initialize, 
> sizeof(OverDriveTable_t));
> + if (ret)
> + return ret;
>  
> + if (initialize) {
>   ret = vega20_set_default_od8_setttings(smu);
>   if (ret)

Re: [PATCH 6/6] drm/amd/display: REFERENCE for srm interface patches

2020-01-30 Thread Harry Wentland
Thanks for providing more documentation and this reference.

The patch set (1-5) is
Reviewed-by: Harry Wentland 

Harry

On 2020-01-22 4:05 p.m., Bhawanpreet Lakha wrote:
> This is just a reference for the patches. not to be merged
> 
> Signed-off-by: Bhawanpreet Lakha 
> ---
>  REFERENCE | 49 +
>  1 file changed, 49 insertions(+)
>  create mode 100644 REFERENCE
> 
> diff --git a/REFERENCE b/REFERENCE
> new file mode 100644
> index ..2e53f9cc82ff
> --- /dev/null
> +++ b/REFERENCE
> @@ -0,0 +1,49 @@
> +SRM interface Reference Usermode scripts for ubuntu. 
> +These are just reference sciprts to facilitate the SRM interface for amdgpu.
> +
> ++-+
> +| Main script, this is called on boot/shutdown/suspend/resume , it calls the 
> sysfs to get/set the SRM |
> +| FILE: /home/amdgpu_hdcp_srm_script.sh  
>  |
> ++-+
> +#!/bin/bash
> +
> +SRMFILE="/home/SRM"
> +sudo cat "$SRMFILE" > /sys/class/drm/card0/device/hdcp_srm
> +sudo cat /sys/class/drm/card0/device/hdcp_srm > "$SRMFILE"
> +
> +
> +
> +
> +
> ++-+
> +| .service file, This is placed into /etc/systemd/system/ so it runs the 
> main script on boot/shutdown |
> +| FILE: /etc/systemd/system/amdgpu_hdc_srm_boot_shutdown.service 
>  |
> ++-+
> +[Unit]
> +Description=HDCP SRM boot and shutdown save/load
> +
> +[Service]
> +Type=simple
> +ExecStart=/home/amdgpu_hdcp_srm_script.sh
> +ExecStop=/home/amdgpu_hdcp_srm_script.sh
> +
> +[Install]
> +WantedBy=multi-user.target
> +
> +
> +
> ++-+
> +| To run the script on boot/start run
>   |
> ++-+
> +sudo systemctl start amdgpu_hdc_srm_boot_shutdown
> +sudo systemctl enable amdgpu_hdc_srm_boot_shutdown
> +
> +
> +
> ++-+
> +| To symlnk the files (adding to these directory will run the script on 
> suspend/resume|
> ++-+
> +sudo ln -s $SCRIPTFILE /lib/systemd/system-sleep/amdgpu_hdcp_srm
> +sudo ln -s $SCRIPTFILE /usr/lib/pm-utils/sleep.d/95amdgpu_hdcp_srm
> +
> +
> 
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: Suspecting corrupted VBIOS after update of AMDGPU on AMD7870

2020-01-30 Thread Liu, Zhan
Hi Jacob,

Thant you for your bug reporting.

I saw you attached xorg.log, which is great. Could you also grab dmesg.log via 
SSH?

Thanks,
Zhan


From: amd-gfx  On Behalf Of Jacob Hrbek
Sent: 2020/January/30, Thursday 12:18 PM
To: amd-gfx@lists.freedesktop.org
Subject: Suspecting corrupted VBIOS after update of AMDGPU on AMD7870

Hello,
I believe that system update that included amdgpu on debian testing (but i am 
on LFS) corrupted my VBIOS on AMD7870 (+- 4 hours after the update the GPU 
using AMDGPU/Radeon drivers resulted in no output).
i'm sending this email to inform about possible bug with my findings on 
https://gist.github.com/Kreyren/3e55e9a754e58956e1690e38b1888de7
 and i would appreciate any help in excluding VBIOS corruption from the 
diagnostics.
Thanks,
- Jacob Hrbek
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Patch v1 5/5] drm/amdkfd: refactor runtime pm for baco

2020-01-30 Thread Bhardwaj, Rajneesh



On 1/28/2020 3:09 PM, Zeng, Oak wrote:

[AMD Official Use Only - Internal Distribution Only]



Regards,
Oak

-Original Message-
From: amd-gfx  On Behalf Of Rajneesh 
Bhardwaj
Sent: Monday, January 27, 2020 8:29 PM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Kuehling, Felix 
; Bhardwaj, Rajneesh 
Subject: [Patch v1 5/5] drm/amdkfd: refactor runtime pm for baco

So far the kfd driver implemented same routines for runtime and system wide 
suspend and resume (s2idle or mem). During system wide suspend the kfd aquires 
an atomic lock that prevents any more user processes to create queues and 
interact with kfd driver and amd gpu. This mechanism created problem when 
amdgpu device is runtime suspended with BACO enabled. Any application that 
relies on kfd driver fails to load because the driver reports a locked kfd 
device since gpu is runtime suspended.

However, in an ideal case, when gpu is runtime  suspended the kfd driver should 
be able to:

  - auto resume amdgpu driver whenever a client requests compute service
  - prevent runtime suspend for amdgpu  while kfd is in use

This change refactors the amdgpu and amdkfd drivers to support BACO and runtime 
runtime power management.
[Oak] two runtime above



Thanks, will fix it.




Signed-off-by: Rajneesh Bhardwaj 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 12 -  
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  8 +++---  
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 +--
  drivers/gpu/drm/amd/amdkfd/kfd_device.c| 31 +-
  drivers/gpu/drm/amd/amdkfd/kfd_iommu.c |  5 +++-
  drivers/gpu/drm/amd/amdkfd/kfd_process.c   | 19 ++---
  6 files changed, 51 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 8609287620ea..314c4a2a0354 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -178,18 +178,18 @@ void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
kgd2kfd_interrupt(adev->kfd.dev, ih_ring_entry);  }
  
-void amdgpu_amdkfd_suspend(struct amdgpu_device *adev)

+void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm)
[Oak] then name run_pm is a little bit confusing. Maybe system_wide_pm or 
none_runtime_pm?
  {
if (adev->kfd.dev)
-   kgd2kfd_suspend(adev->kfd.dev);
+   kgd2kfd_suspend(adev->kfd.dev, run_pm);
  }
  
-int amdgpu_amdkfd_resume(struct amdgpu_device *adev)

+int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm)
  {
int r = 0;
  
  	if (adev->kfd.dev)

-   r = kgd2kfd_resume(adev->kfd.dev);
+   r = kgd2kfd_resume(adev->kfd.dev, run_pm);
  
  	return r;

  }
@@ -713,11 +713,11 @@ void kgd2kfd_exit(void)  {  }
  
-void kgd2kfd_suspend(struct kfd_dev *kfd)

+void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
  {
  }
  
-int kgd2kfd_resume(struct kfd_dev *kfd)

+int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
  {
return 0;
  }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index 0fa898d30207..3dce4a51e522 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -123,8 +123,8 @@ struct amdkfd_process_info {  int amdgpu_amdkfd_init(void); 
 void amdgpu_amdkfd_fini(void);
  
-void amdgpu_amdkfd_suspend(struct amdgpu_device *adev); -int amdgpu_amdkfd_resume(struct amdgpu_device *adev);

+void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
+int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm);
  void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
const void *ih_ring_entry);
  void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev); @@ -250,8 +250,8 
@@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
 struct drm_device *ddev,
 const struct kgd2kfd_shared_resources *gpu_resources); 
 void kgd2kfd_device_exit(struct kfd_dev *kfd); -void kgd2kfd_suspend(struct 
kfd_dev *kfd); -int kgd2kfd_resume(struct kfd_dev *kfd);
+void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm); int
+kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
  int kgd2kfd_pre_reset(struct kfd_dev *kfd);  int kgd2kfd_post_reset(struct 
kfd_dev *kfd);  void kgd2kfd_interrupt(struct kfd_dev *kfd, const void 
*ih_ring_entry); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index d9e5eac182d3..787d49e9f4de 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3314,7 +3314,7 @@ int amdgpu_device_suspend(struct drm_device *dev, bool 
fbcon)
}
}
  
-	amdgpu_amdkfd_suspend(adev);

+   amdgpu_amdkfd_suspend(adev, fbcon);
  
  	amdgpu_ras_suspend(adev);
  
@@ -3398,7 +3398,7 @@ int amdgpu_device_resume(struct drm_device 

Re: [Patch v1 1/5] drm/amdgpu: always enable runtime power management

2020-01-30 Thread Bhardwaj, Rajneesh



On 1/28/2020 3:14 PM, Alex Deucher wrote:

[CAUTION: External Email]

On Mon, Jan 27, 2020 at 8:30 PM Rajneesh Bhardwaj
 wrote:

This allows runtime power management to kick in on amdgpu driver when
the underlying hardware supports either BOCO or BACO. This can still be
avoided if boot arg amdgpu.runpm = 0 is supplied.

 BOCO: Bus Off, Chip Off
 BACO: Bus Alive, Chip Off

Signed-off-by: Rajneesh Bhardwaj 

This patch should be the last one in the series, otherwise we'll
enable runpm on BACO capable devices before the KFD code is in place.
Also, it's only supported on VI and newer asics, so we should use this
patch instead:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.freedesktop.org%2Fpatch%2F335402%2Fdata=02%7C01%7Crajneesh.bhardwaj%40amd.com%7C01f67fc720d3423ee6b908d7a42eb68d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637158392852429503sdata=aU07GE56Vfb0JTSDsVDdyCdhxUkjHEVMAHiBaBC4V7g%3Dreserved=0

Alex


Thanks, Will fix in v2.



---
  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 9 -
  1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index 3a0ea9096498..7958d508486e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -169,11 +169,10 @@ int amdgpu_driver_load_kms(struct drm_device *dev, 
unsigned long flags)
 goto out;
 }

-   if (amdgpu_device_supports_boco(dev) &&
-   (amdgpu_runtime_pm != 0)) /* enable runpm by default */
-   adev->runpm = true;
-   else if (amdgpu_device_supports_baco(dev) &&
-(amdgpu_runtime_pm > 0)) /* enable runpm if runpm=1 */
+   /* always enable runtime power management except when amdgpu.runpm=0 */
+   if ((amdgpu_device_supports_boco(dev) ||
+   amdgpu_device_supports_baco(dev))
+   && (amdgpu_runtime_pm != 0))
 adev->runpm = true;

 /* Call ACPI methods: require modeset init
--
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=02%7C01%7Crajneesh.bhardwaj%40amd.com%7C01f67fc720d3423ee6b908d7a42eb68d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637158392852429503sdata=%2BXwHkoDeyA9Q%2FwnSyaND6QOc1SxpGuAHkZ4JdaTM3wU%3Dreserved=0

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Patch v1 3/5] drm/amdkfd: Introduce debugfs option to disable baco

2020-01-30 Thread Bhardwaj, Rajneesh

Hi Alex


Thanks for your time and feedback!


On 1/28/2020 3:22 PM, Alex Deucher wrote:

[CAUTION: External Email]

On Mon, Jan 27, 2020 at 8:30 PM Rajneesh Bhardwaj
 wrote:

When BACO is enabled by default, sometimes it can cause additional
trouble to debug KFD issues. This debugfs override allows to temporarily
disable BACO for debug purpose without having to reboot the machine.

However, in some cases one suspend-resume cycle might be needed if
the device is already runtime suspended.

e.g

sudo rtcwake -m < mem or freeze > -s 15

or

by triggering autosuspend event from user space, by doing something
like:

echo 6000 > /sys/bus/pci/devices/\:03\:00.0/power/autosuspend_delay_ms

 Usage:

echo 0 > /sys/kernel/debug/kfd/enable_baco and run
cat /sys/kernel/debug/kfd/baco_status to verify whether BACO is
enabled or disabled by kfd driver.

It should be noted that while enabling baco again via kfd override, we
must do the following steps:

1. echo 0 > /sys/kernel/debug/kfd/enable_baco
2. sudo rtcwake -m < mem > -s 15

In this case, we need GPU to be fully reset which is done by BIOS. This
is not possible in case of S2idle i.e. freeze so we must use mem for
sleep.


I think we can drop this patch in favor of just using the standard
runtime pm control.  E.g.,
/sys/class/drm/card0/device/power/control



Sure, i was using the /sys/bus/pci way to do it and found it was not 
easy. Since this sysfs exists, will drop the patch.



Regards

Rajneesh



___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=02%7C01%7Crajneesh.bhardwaj%40amd.com%7Cfdaaf630ee6548c6bd9108d7a42fe314%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637158397891190611sdata=3jE9jZbbw9IiCu7geMeCCsTC4u4tTdippeWYeSnX3oE%3Dreserved=0

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Patch v1 4/5] drm/amdkfd: show warning when kfd is locked

2020-01-30 Thread Bhardwaj, Rajneesh


On 1/28/2020 5:42 PM, Felix Kuehling wrote:

On 2020-01-27 20:29, Rajneesh Bhardwaj wrote:

During system suspend the kfd driver aquires a lock that prohibits
further kfd actions unless the gpu is resumed. This adds some info which
can be useful while debugging.

Signed-off-by: Rajneesh Bhardwaj 
---
  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c

index 1aebda4bbbe7..081cc5f40d18 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -127,6 +127,8 @@ static int kfd_open(struct inode *inode, struct 
file *filep)

  return PTR_ERR(process);
    if (kfd_is_locked()) {
+    dev_warn(kfd_device, "kfd is locked!\n"
+    "process %d unreferenced", process->pasid);


If this is for debugging, make it dev_dbg. Printing warnings like this 
usually confuses people reporting completely unrelated problems that 
aren't even kernel problems at all. "Look, I found a warning in the 
kernel log. It must be a kernel problem."


Agree. Will fix in v2.




Regards,
  Felix



  kfd_unref_process(process);
  return -EAGAIN;
  }

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Patch v1 5/5] drm/amdkfd: refactor runtime pm for baco

2020-01-30 Thread Bhardwaj, Rajneesh

Hello Felix,

Thanks for your time to review and for your feedback.

On 1/29/2020 5:52 PM, Felix Kuehling wrote:

HI Rajneesh,

See comments inline ...

And a general question: Why do you need to set the autosuspend_delay 
in so many places? Amdgpu only has a single call to this function 
during initialization.



We don't. I have called this out in cover letter since i too feel its 
not the best way to do what we want to do. I have seen weird issues on 
dual GPUs with KFDTest that sometimes results in system hang and gpu 
hang etc.


Even if i try with running kfdtest on just one node in a multi gpu 
system, the baco exit seems to wake up all gpus and then one goes to 
auto suspend again while other performs the desired kfdtest operation. I 
have never seen any issue with a single gpu card. So in current approch, 
i am making sure that the GPUs are not quickly auto-suspended while the 
kfd operations are ongoing but once the kfdtest is finished, the runtime 
suspend call with ensure to reset the timeout back to 5 seconds.



I would like to avoid this and appreciate some pointers where i can put 
get_sync calls while unbinding. I have tried a number of places but saw 
some issues. So any help will be highly appreciated, to identify the 
proper logical inverse of kfd_bind_process_to_device.





On 2020-01-27 20:29, Rajneesh Bhardwaj wrote:

So far the kfd driver implemented same routines for runtime and system
wide suspend and resume (s2idle or mem). During system wide suspend the
kfd aquires an atomic lock that prevents any more user processes to
create queues and interact with kfd driver and amd gpu. This mechanism
created problem when amdgpu device is runtime suspended with BACO
enabled. Any application that relies on kfd driver fails to load because
the driver reports a locked kfd device since gpu is runtime suspended.

However, in an ideal case, when gpu is runtime  suspended the kfd driver
should be able to:

  - auto resume amdgpu driver whenever a client requests compute service
  - prevent runtime suspend for amdgpu  while kfd is in use

This change refactors the amdgpu and amdkfd drivers to support BACO and
runtime runtime power management.

Signed-off-by: Rajneesh Bhardwaj 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 12 -
  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  8 +++---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 +--
  drivers/gpu/drm/amd/amdkfd/kfd_device.c    | 31 +-
  drivers/gpu/drm/amd/amdkfd/kfd_iommu.c |  5 +++-
  drivers/gpu/drm/amd/amdkfd/kfd_process.c   | 19 ++---
  6 files changed, 51 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c

index 8609287620ea..314c4a2a0354 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -178,18 +178,18 @@ void amdgpu_amdkfd_interrupt(struct 
amdgpu_device *adev,

  kgd2kfd_interrupt(adev->kfd.dev, ih_ring_entry);
  }
  -void amdgpu_amdkfd_suspend(struct amdgpu_device *adev)
+void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm)
  {
  if (adev->kfd.dev)
-    kgd2kfd_suspend(adev->kfd.dev);
+    kgd2kfd_suspend(adev->kfd.dev, run_pm);
  }
  -int amdgpu_amdkfd_resume(struct amdgpu_device *adev)
+int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm)
  {
  int r = 0;
    if (adev->kfd.dev)
-    r = kgd2kfd_resume(adev->kfd.dev);
+    r = kgd2kfd_resume(adev->kfd.dev, run_pm);
    return r;
  }
@@ -713,11 +713,11 @@ void kgd2kfd_exit(void)
  {
  }
  -void kgd2kfd_suspend(struct kfd_dev *kfd)
+void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
  {
  }
  -int kgd2kfd_resume(struct kfd_dev *kfd)
+int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
  {
  return 0;
  }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h

index 0fa898d30207..3dce4a51e522 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -123,8 +123,8 @@ struct amdkfd_process_info {
  int amdgpu_amdkfd_init(void);
  void amdgpu_amdkfd_fini(void);
  -void amdgpu_amdkfd_suspend(struct amdgpu_device *adev);
-int amdgpu_amdkfd_resume(struct amdgpu_device *adev);
+void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
+int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm);
  void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
  const void *ih_ring_entry);
  void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
@@ -250,8 +250,8 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
   struct drm_device *ddev,
   const struct kgd2kfd_shared_resources *gpu_resources);
  void kgd2kfd_device_exit(struct kfd_dev *kfd);
-void kgd2kfd_suspend(struct kfd_dev *kfd);
-int kgd2kfd_resume(struct kfd_dev *kfd);
+void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
+int 

Re: [PATCH] drm/amd/display: Only enable cursor on pipes that need it

2020-01-30 Thread Harry Wentland
On 2020-01-30 1:29 p.m., Nicholas Kazlauskas wrote:
> [Why]
> In current code we're essentially drawing the cursor on every pipe
> that contains it. This only works when the planes have the same
> scaling for src to dest rect, otherwise we'll get "double cursor" where
> one cursor is incorrectly filtered and offset from the real position.
> 
> [How]
> Without dedicated cursor planes on DCN we require at least one pipe
> that matches the scaling of the current timing.
> 
> This is an optimization and workaround for the most common case where
> the top-most plane is not scaled but the bottom-most plane is scaled.
> 
> Whenever a pipe has a parent pipe in the blending tree whose recout
> fully contains the current pipe we can disable the pipe.
> 
> This only applies when the pipe is actually visible of course.
> 
> Signed-off-by: Nicholas Kazlauskas 

Reviewed-by: Harry Wentland 

Harry

> ---
>  .../amd/display/dc/dcn10/dcn10_hw_sequencer.c | 30 +++
>  1 file changed, 30 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_hw_sequencer.c 
> b/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_hw_sequencer.c
> index f2127afb37b2..1008ac8a0f2a 100644
> --- a/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_hw_sequencer.c
> +++ b/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_hw_sequencer.c
> @@ -2911,6 +2911,33 @@ void dcn10_update_dchub(struct dce_hwseq *hws, struct 
> dchub_init_data *dh_data)
>   hubbub->funcs->update_dchub(hubbub, dh_data);
>  }
>  
> +static bool dcn10_can_pipe_disable_cursor(struct pipe_ctx *pipe_ctx)
> +{
> + struct pipe_ctx *test_pipe;
> + const struct rect *r1 = _ctx->plane_res.scl_data.recout, *r2;
> + int r1_r = r1->x + r1->width, r1_b = r1->y + r1->height, r2_r, r2_b;
> +
> + /**
> +  * Disable the cursor if there's another pipe above this with a
> +  * plane that contains this pipe's viewport to prevent double cursor
> +  * and incorrect scaling artifacts.
> +  */
> + for (test_pipe = pipe_ctx->top_pipe; test_pipe;
> +  test_pipe = test_pipe->top_pipe) {
> + if (!test_pipe->plane_state->visible)
> + continue;
> +
> + r2 = _pipe->plane_res.scl_data.recout;
> + r2_r = r2->x + r2->width;
> + r2_b = r2->y + r2->height;
> +
> + if (r1->x >= r2->x && r1->y >= r2->y && r1_r <= r2_r && r1_b <= 
> r2_b)
> + return true;
> + }
> +
> + return false;
> +}
> +
>  void dcn10_set_cursor_position(struct pipe_ctx *pipe_ctx)
>  {
>   struct dc_cursor_position pos_cpy = pipe_ctx->stream->cursor_position;
> @@ -2956,6 +2983,9 @@ void dcn10_set_cursor_position(struct pipe_ctx 
> *pipe_ctx)
>   == PLN_ADDR_TYPE_VIDEO_PROGRESSIVE)
>   pos_cpy.enable = false;
>  
> + if (pos_cpy.enable && dcn10_can_pipe_disable_cursor(pipe_ctx))
> + pos_cpy.enable = false;
> +
>   // Swap axis and mirror horizontally
>   if (param.rotation == ROTATION_ANGLE_90) {
>   uint32_t temp_x = pos_cpy.x;
> 
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: Fix implicit enum conversion in gfx_v9_4_ras_error_inject

2020-01-30 Thread Alex Deucher
On Thu, Jan 30, 2020 at 3:33 AM Nathan Chancellor
 wrote:
>
> Clang warns:
>
> ../drivers/gpu/drm/amd/amdgpu/gfx_v9_4.c:967:35: warning: implicit
> conversion from enumeration type 'enum amdgpu_ras_block' to different
> enumeration type 'enum ta_ras_block' [-Wenum-conversion]
> block_info.block_id = info->head.block;
> ~ ~~~^
> 1 warning generated.
>
> Use the function added in commit 828cfa29093f ("drm/amdgpu: Fix amdgpu
> ras to ta enums conversion") that handles this conversion explicitly.
>
> Fixes: 4c461d89db4f ("drm/amdgpu: add RAS support for the gfx block of 
> Arcturus")
> Link: https://github.com/ClangBuiltLinux/linux/issues/849
> Signed-off-by: Nathan Chancellor 

Applied.  thanks!

Alex

> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_4.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4.c 
> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4.c
> index e19d275f3f7d..f099f13d7f1e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4.c
> @@ -964,7 +964,7 @@ int gfx_v9_4_ras_error_inject(struct amdgpu_device *adev, 
> void *inject_if)
> if (!amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__GFX))
> return -EINVAL;
>
> -   block_info.block_id = info->head.block;
> +   block_info.block_id = amdgpu_ras_block_to_ta(info->head.block);
> block_info.sub_block_index = info->head.sub_block_index;
> block_info.inject_error_type = 
> amdgpu_ras_error_to_ta(info->head.type);
> block_info.address = info->address;
> --
> 2.25.0
>
> ___
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amd/display: Only enable cursor on pipes that need it

2020-01-30 Thread Nicholas Kazlauskas
[Why]
In current code we're essentially drawing the cursor on every pipe
that contains it. This only works when the planes have the same
scaling for src to dest rect, otherwise we'll get "double cursor" where
one cursor is incorrectly filtered and offset from the real position.

[How]
Without dedicated cursor planes on DCN we require at least one pipe
that matches the scaling of the current timing.

This is an optimization and workaround for the most common case where
the top-most plane is not scaled but the bottom-most plane is scaled.

Whenever a pipe has a parent pipe in the blending tree whose recout
fully contains the current pipe we can disable the pipe.

This only applies when the pipe is actually visible of course.

Signed-off-by: Nicholas Kazlauskas 
---
 .../amd/display/dc/dcn10/dcn10_hw_sequencer.c | 30 +++
 1 file changed, 30 insertions(+)

diff --git a/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_hw_sequencer.c 
b/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_hw_sequencer.c
index f2127afb37b2..1008ac8a0f2a 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_hw_sequencer.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_hw_sequencer.c
@@ -2911,6 +2911,33 @@ void dcn10_update_dchub(struct dce_hwseq *hws, struct 
dchub_init_data *dh_data)
hubbub->funcs->update_dchub(hubbub, dh_data);
 }
 
+static bool dcn10_can_pipe_disable_cursor(struct pipe_ctx *pipe_ctx)
+{
+   struct pipe_ctx *test_pipe;
+   const struct rect *r1 = _ctx->plane_res.scl_data.recout, *r2;
+   int r1_r = r1->x + r1->width, r1_b = r1->y + r1->height, r2_r, r2_b;
+
+   /**
+* Disable the cursor if there's another pipe above this with a
+* plane that contains this pipe's viewport to prevent double cursor
+* and incorrect scaling artifacts.
+*/
+   for (test_pipe = pipe_ctx->top_pipe; test_pipe;
+test_pipe = test_pipe->top_pipe) {
+   if (!test_pipe->plane_state->visible)
+   continue;
+
+   r2 = _pipe->plane_res.scl_data.recout;
+   r2_r = r2->x + r2->width;
+   r2_b = r2->y + r2->height;
+
+   if (r1->x >= r2->x && r1->y >= r2->y && r1_r <= r2_r && r1_b <= 
r2_b)
+   return true;
+   }
+
+   return false;
+}
+
 void dcn10_set_cursor_position(struct pipe_ctx *pipe_ctx)
 {
struct dc_cursor_position pos_cpy = pipe_ctx->stream->cursor_position;
@@ -2956,6 +2983,9 @@ void dcn10_set_cursor_position(struct pipe_ctx *pipe_ctx)
== PLN_ADDR_TYPE_VIDEO_PROGRESSIVE)
pos_cpy.enable = false;
 
+   if (pos_cpy.enable && dcn10_can_pipe_disable_cursor(pipe_ctx))
+   pos_cpy.enable = false;
+
// Swap axis and mirror horizontally
if (param.rotation == ROTATION_ANGLE_90) {
uint32_t temp_x = pos_cpy.x;
-- 
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amdgpu/vcn: leave all dpg mode unpause in idle work

2020-01-30 Thread James Zhu
Leave all dpg mode unpause in idle work to fix VCN2.* timeout issue
during transcoding.

Signed-off-by: James Zhu 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 13 ++---
 1 file changed, 2 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
index f96464e..ed4b753 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
@@ -329,20 +329,11 @@ void amdgpu_vcn_ring_begin_use(struct amdgpu_ring *ring)
 
if (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG){
struct dpg_pause_state new_state;
-   unsigned int fences = 0;
-   unsigned int i;
-
-   for (i = 0; i < adev->vcn.num_enc_rings; ++i) {
-   fences += 
amdgpu_fence_count_emitted(>vcn.inst[ring->me].ring_enc[i]);
-   }
-   if (fences)
-   new_state.fw_based = VCN_DPG_STATE__PAUSE;
-   else
-   new_state.fw_based = VCN_DPG_STATE__UNPAUSE;
 
if (ring->funcs->type == AMDGPU_RING_TYPE_VCN_ENC)
new_state.fw_based = VCN_DPG_STATE__PAUSE;
-
+   else
+   new_state.fw_based = adev->vcn.pause_state.fw_based;
adev->vcn.pause_dpg_mode(adev, ring->me, _state);
}
 }
-- 
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Suspecting corrupted VBIOS after update of AMDGPU on AMD7870

2020-01-30 Thread Jacob Hrbek
*Hello,*

I believe that system update that included amdgpu on debian testing (but i
am on LFS) corrupted my VBIOS on AMD7870 (+- 4 hours after the update the
GPU using AMDGPU/Radeon drivers resulted in no output).

i'm sending this email to inform about possible bug with my findings on
https://gist.github.com/Kreyren/3e55e9a754e58956e1690e38b1888de7 and i
would appreciate any help in excluding VBIOS corruption from the
diagnostics.

*Thanks,*
- Jacob Hrbek
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 3/5] drm/amdgpu: allow higher level PD invalidations

2020-01-30 Thread Christian König
Allow partial invalidation on unallocated PDs. This is useful when we
need to silence faults to stop interrupt floods on Vega.

Signed-off-by: Christian König 
Tested-by: Tom St Denis 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 23 ++-
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 9705c961405b..6038b3c89633 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1467,9 +1467,8 @@ static int amdgpu_vm_update_ptes(struct 
amdgpu_vm_update_params *params,
 * smaller than the address shift. Go to the next
 * child entry and try again.
 */
-   if (!amdgpu_vm_pt_descendant(adev, ))
-   return -ENOENT;
-   continue;
+   if (amdgpu_vm_pt_descendant(adev, ))
+   continue;
} else if (frag >= parent_shift) {
/* If the fragment size is even larger than the parent
 * shift we should go up one level and check it again.
@@ -1480,8 +1479,19 @@ static int amdgpu_vm_update_ptes(struct 
amdgpu_vm_update_params *params,
}
 
pt = cursor.entry->base.bo;
-   if (!pt)
-   return -ENOENT;
+   if (!pt) {
+   /* We need all PDs and PTs for mapping something, */
+   if (flags & AMDGPU_PTE_VALID)
+   return -ENOENT;
+
+   /* but unmapping something can happen at a higher
+* level. */
+   if (!amdgpu_vm_pt_ancestor())
+   return -EINVAL;
+
+   pt = cursor.entry->base.bo;
+   shift = parent_shift;
+   }
 
/* Looks good so far, calculate parameters for the update */
incr = (uint64_t)AMDGPU_GPU_PAGE_SIZE << shift;
@@ -1495,6 +1505,9 @@ static int amdgpu_vm_update_ptes(struct 
amdgpu_vm_update_params *params,
uint64_t upd_end = min(entry_end, frag_end);
unsigned nptes = (upd_end - frag_start) >> shift;
 
+   /* This can happen when we set higher level PDs to
+* silent to stop fault floods. */
+   nptes = max(nptes, 1u);
amdgpu_vm_update_flags(params, pt, cursor.level,
   pe_start, dst, nptes, incr,
   flags | AMDGPU_PTE_FRAG(frag));
-- 
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 2/5] drm/amdgpu: return EINVAL instead of ENOENT in the VM code

2020-01-30 Thread Christian König
That we can't find a PD above the root is expected can only happen if
we try to update a larger range than actually managed by the VM.

Signed-off-by: Christian König 
Tested-by: Tom St Denis 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 4ba6a5e5d094..9705c961405b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1475,7 +1475,7 @@ static int amdgpu_vm_update_ptes(struct 
amdgpu_vm_update_params *params,
 * shift we should go up one level and check it again.
 */
if (!amdgpu_vm_pt_ancestor())
-   return -ENOENT;
+   return -EINVAL;
continue;
}
 
-- 
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 5/5] drm/amdgpu: rework synchronization of VM updates v4

2020-01-30 Thread Christian König
If provided we only sync to the BOs reservation
object and no longer to the root PD.

v2: update comment, cleanup amdgpu_bo_sync_wait_resv
v3: use correct reservation object while clearing
v4: fix typo in amdgpu_bo_sync_wait_resv

Signed-off-by: Christian König 
Tested-by: Tom St Denis 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c  | 35 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h  |  3 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c|  7 
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c  | 41 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h  |  4 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c  | 22 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 15 ++--
 7 files changed, 66 insertions(+), 61 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 46c76e2e1281..6f60a581e3ba 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -1403,30 +1403,51 @@ void amdgpu_bo_fence(struct amdgpu_bo *bo, struct 
dma_fence *fence,
 }
 
 /**
- * amdgpu_sync_wait_resv - Wait for BO reservation fences
+ * amdgpu_bo_sync_wait_resv - Wait for BO reservation fences
  *
- * @bo: buffer object
+ * @adev: amdgpu device pointer
+ * @resv: reservation object to sync to
+ * @sync_mode: synchronization mode
  * @owner: fence owner
  * @intr: Whether the wait is interruptible
  *
+ * Extract the fences from the reservation object and waits for them to finish.
+ *
  * Returns:
  * 0 on success, errno otherwise.
  */
-int amdgpu_bo_sync_wait(struct amdgpu_bo *bo, void *owner, bool intr)
+int amdgpu_bo_sync_wait_resv(struct amdgpu_device *adev, struct dma_resv *resv,
+enum amdgpu_sync_mode sync_mode, void *owner,
+bool intr)
 {
-   struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
struct amdgpu_sync sync;
int r;
 
amdgpu_sync_create();
-   amdgpu_sync_resv(adev, , bo->tbo.base.resv,
-AMDGPU_SYNC_NE_OWNER, owner);
+   amdgpu_sync_resv(adev, , resv, sync_mode, owner);
r = amdgpu_sync_wait(, intr);
amdgpu_sync_free();
-
return r;
 }
 
+/**
+ * amdgpu_bo_sync_wait - Wrapper for amdgpu_bo_sync_wait_resv
+ * @bo: buffer object to wait for
+ * @owner: fence owner
+ * @intr: Whether the wait is interruptible
+ *
+ * Wrapper to wait for fences in a BO.
+ * Returns:
+ * 0 on success, errno otherwise.
+ */
+int amdgpu_bo_sync_wait(struct amdgpu_bo *bo, void *owner, bool intr)
+{
+   struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
+
+   return amdgpu_bo_sync_wait_resv(adev, bo->tbo.base.resv,
+   AMDGPU_SYNC_NE_OWNER, owner, intr);
+}
+
 /**
  * amdgpu_bo_gpu_offset - return GPU offset of bo
  * @bo:amdgpu object for which we query the offset
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
index 2eeafc77c9c1..96d805889e8d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
@@ -283,6 +283,9 @@ void amdgpu_bo_release_notify(struct ttm_buffer_object *bo);
 int amdgpu_bo_fault_reserve_notify(struct ttm_buffer_object *bo);
 void amdgpu_bo_fence(struct amdgpu_bo *bo, struct dma_fence *fence,
 bool shared);
+int amdgpu_bo_sync_wait_resv(struct amdgpu_device *adev, struct dma_resv *resv,
+enum amdgpu_sync_mode sync_mode, void *owner,
+bool intr);
 int amdgpu_bo_sync_wait(struct amdgpu_bo *bo, void *owner, bool intr);
 u64 amdgpu_bo_gpu_offset(struct amdgpu_bo *bo);
 int amdgpu_bo_validate(struct amdgpu_bo *bo);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
index 9f42032676da..b86392253696 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
@@ -249,13 +249,6 @@ int amdgpu_sync_resv(struct amdgpu_device *adev, struct 
amdgpu_sync *sync,
owner != AMDGPU_FENCE_OWNER_UNDEFINED)
continue;
 
-   /* VM updates only sync with moves but not with user
-* command submissions or KFD evictions fences
-*/
-   if (fence_owner != AMDGPU_FENCE_OWNER_UNDEFINED &&
-   owner == AMDGPU_FENCE_OWNER_VM)
-   continue;
-
/* Ignore fences depending on the sync mode */
switch (mode) {
case AMDGPU_SYNC_ALWAYS:
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 6038b3c89633..aaae2b5e6eee 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -797,7 +797,7 @@ static int amdgpu_vm_clear_bo(struct amdgpu_device *adev,
params.vm = vm;

[PATCH 4/5] drm/amdgpu: simplify and fix amdgpu_sync_resv

2020-01-30 Thread Christian König
No matter what we always need to sync to moves.

Signed-off-by: Christian König 
Tested-by: Tom St Denis 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
index c124f64e7aae..9f42032676da 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
@@ -232,10 +232,19 @@ int amdgpu_sync_resv(struct amdgpu_device *adev, struct 
amdgpu_sync *sync,
 
f = rcu_dereference_protected(flist->shared[i],
  dma_resv_held(resv));
+
+   fence_owner = amdgpu_sync_get_owner(f);
+
+   /* Always sync to moves, no matter what */
+   if (fence_owner == AMDGPU_FENCE_OWNER_UNDEFINED) {
+   r = amdgpu_sync_fence(sync, f, false);
+   if (r)
+   break;
+   }
+
/* We only want to trigger KFD eviction fences on
 * evict or move jobs. Skip KFD fences otherwise.
 */
-   fence_owner = amdgpu_sync_get_owner(f);
if (fence_owner == AMDGPU_FENCE_OWNER_KFD &&
owner != AMDGPU_FENCE_OWNER_UNDEFINED)
continue;
@@ -265,9 +274,7 @@ int amdgpu_sync_resv(struct amdgpu_device *adev, struct 
amdgpu_sync *sync,
break;
 
case AMDGPU_SYNC_EXPLICIT:
-   if (owner != AMDGPU_FENCE_OWNER_UNDEFINED)
-   continue;
-   break;
+   continue;
}
 
r = amdgpu_sync_fence(sync, f, false);
-- 
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 1/5] drm/amdgpu: fix braces in amdgpu_vm_update_ptes

2020-01-30 Thread Christian König
For the root PD mask can be 0x as well which would
overrun to 0 if we don't cast it before we add one.

Signed-off-by: Christian König 
Tested-by: Tom St Denis 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 5cb182231f5d..4ba6a5e5d094 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1487,7 +1487,7 @@ static int amdgpu_vm_update_ptes(struct 
amdgpu_vm_update_params *params,
incr = (uint64_t)AMDGPU_GPU_PAGE_SIZE << shift;
mask = amdgpu_vm_entries_mask(adev, cursor.level);
pe_start = ((cursor.pfn >> shift) & mask) * 8;
-   entry_end = (uint64_t)(mask + 1) << shift;
+   entry_end = ((uint64_t)mask + 1) << shift;
entry_end += cursor.pfn & ~(entry_end - 1);
entry_end = min(entry_end, end);
 
-- 
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amd/powerplay: fix navi10 system intermittent reboot issue

2020-01-30 Thread Xu, Feifei


Reviewed-by: Feifei Xu 

> On Jan 30, 2020, at 16:59, Evan Quan  wrote:
> 
> This workaround is needed only for Navi10 12 Gbps SKUs.
> 
> Change-Id: I4bfcb8a8dbff785a159e6a1ed413d93063403ab3
> Signed-off-by: Evan Quan 
> ---
> drivers/gpu/drm/amd/powerplay/amdgpu_smu.c| 18 +++
> .../gpu/drm/amd/powerplay/inc/amdgpu_smu.h|  1 +
> drivers/gpu/drm/amd/powerplay/inc/smu_types.h |  2 +
> .../drm/amd/powerplay/inc/smu_v11_0_ppsmc.h   |  5 +-
> drivers/gpu/drm/amd/powerplay/navi10_ppt.c| 49 +++
> drivers/gpu/drm/amd/powerplay/smu_internal.h  |  3 ++
> 6 files changed, 77 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c 
> b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
> index fabc46dfb933..9d1075823681 100644
> --- a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
> +++ b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
> @@ -21,6 +21,7 @@
>  */
> 
> #include 
> +#include 
> 
> #include "pp_debug.h"
> #include "amdgpu.h"
> @@ -1138,6 +1139,23 @@ static int smu_smc_table_hw_init(struct smu_context 
> *smu,
>ret = smu_system_features_control(smu, true);
>if (ret)
>return ret;
> +
> +if (adev->asic_type == CHIP_NAVI10) {
> +if ((adev->pdev->device == 0x731f && (adev->pdev->revision == 
> 0xc2 ||
> +  adev->pdev->revision == 0xc3 ||
> +  adev->pdev->revision == 0xca ||
> +  adev->pdev->revision == 0xcb)) ||
> +(adev->pdev->device == 0x66af && (adev->pdev->revision == 
> 0xf3 ||
> +  adev->pdev->revision == 0xf4 ||
> +  adev->pdev->revision == 0xf5 ||
> +  adev->pdev->revision == 0xf6))) {
> +ret = smu_disable_umc_cdr_12gbps_workaround(smu);
> +if (ret) {
> +pr_err("Workaround failed to disable UMC CDR feature on 
> 12Gbps SKU!\n");
> +return ret;
> +}
> +}
> +}
>}
>if (adev->asic_type != CHIP_ARCTURUS) {
>ret = smu_notify_display_change(smu);
> diff --git a/drivers/gpu/drm/amd/powerplay/inc/amdgpu_smu.h 
> b/drivers/gpu/drm/amd/powerplay/inc/amdgpu_smu.h
> index b8781820cec1..75c79b52ef03 100644
> --- a/drivers/gpu/drm/amd/powerplay/inc/amdgpu_smu.h
> +++ b/drivers/gpu/drm/amd/powerplay/inc/amdgpu_smu.h
> @@ -566,6 +566,7 @@ struct pptable_funcs {
>int (*set_soft_freq_limited_range)(struct smu_context *smu, enum 
> smu_clk_type clk_type, uint32_t min, uint32_t max);
>int (*override_pcie_parameters)(struct smu_context *smu);
>uint32_t (*get_pptable_power_limit)(struct smu_context *smu);
> +int (*disable_umc_cdr_12gbps_workaround)(struct smu_context *smu);
> };
> 
> int smu_load_microcode(struct smu_context *smu);
> diff --git a/drivers/gpu/drm/amd/powerplay/inc/smu_types.h 
> b/drivers/gpu/drm/amd/powerplay/inc/smu_types.h
> index d8c9b7f91fcc..a5b4df146713 100644
> --- a/drivers/gpu/drm/amd/powerplay/inc/smu_types.h
> +++ b/drivers/gpu/drm/amd/powerplay/inc/smu_types.h
> @@ -170,6 +170,8 @@
>__SMU_DUMMY_MAP(SetSoftMinJpeg),  \
>__SMU_DUMMY_MAP(SetHardMinFclkByFreq),\
>__SMU_DUMMY_MAP(DFCstateControl), \
> +__SMU_DUMMY_MAP(DAL_DISABLE_DUMMY_PSTATE_CHANGE), \
> +__SMU_DUMMY_MAP(DAL_ENABLE_DUMMY_PSTATE_CHANGE), \
> 
> #undef __SMU_DUMMY_MAP
> #define __SMU_DUMMY_MAP(type)SMU_MSG_##type
> diff --git a/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0_ppsmc.h 
> b/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0_ppsmc.h
> index 373861ddccd0..406bfd187ce8 100644
> --- a/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0_ppsmc.h
> +++ b/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0_ppsmc.h
> @@ -120,7 +120,10 @@
> #define PPSMC_MSG_GetVoltageByDpmOverdrive   0x45
> #define PPSMC_MSG_BacoAudioD3PME 0x48
> 
> -#define PPSMC_Message_Count  0x49
> +#define PPSMC_MSG_DALDisableDummyPstateChange0x49
> +#define PPSMC_MSG_DALEnableDummyPstateChange 0x4A
> +
> +#define PPSMC_Message_Count  0x4B
> 
> typedef uint32_t PPSMC_Result;
> typedef uint32_t PPSMC_Msg;
> diff --git a/drivers/gpu/drm/amd/powerplay/navi10_ppt.c 
> b/drivers/gpu/drm/amd/powerplay/navi10_ppt.c
> index a0a342f6127f..3feb339a434e 100644
> --- a/drivers/gpu/drm/amd/powerplay/navi10_ppt.c
> +++ b/drivers/gpu/drm/amd/powerplay/navi10_ppt.c
> @@ -119,6 +119,8 @@ static struct smu_11_0_cmn2aisc_mapping 
> navi10_message_map[SMU_MSG_MAX_COUNT] =
>MSG_MAP(PowerDownJpeg,PPSMC_MSG_PowerDownJpeg),
>MSG_MAP(BacoAudioD3PME,PPSMC_MSG_BacoAudioD3PME),
>MSG_MAP(ArmD3,PPSMC_MSG_ArmD3),
> +
> MSG_MAP(DAL_DISABLE_DUMMY_PSTATE_CHANGE,PPSMC_MSG_DALDisableDummyPstateChange),
> +MSG_MAP(DAL_ENABLE_DUMMY_PSTATE_CHANGE,
> PPSMC_MSG_DALEnableDummyPstateChange),
> };
> 
> static struct 

Re: [PATCH 1/2] drm/amdgpu: enable GPU reset by default on Navi

2020-01-30 Thread Pierre-Eric Pelloux-Prayer
Hi Alex,

I had one issue while testing this patch on a RX5700.

After a gfx hang a reset is executed.
Switching to a VT and restarting gdm works fine but the clocks seem messed up:
   - lots of graphical artifcats (underflows?)
   - pp_dpm_sclk and pp_dpm_socclk have strange values (see attached files)

dmesg output (from the gfx hang):
[  169.755071] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for 
fences timed out!
[  169.755173] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for 
fences timed out!
[  174.874847] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 
timeout, signaled seq=10034, emitted seq=10036
[  174.874925] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: 
process gnome-shell pid 1724 thread gnome-shel:cs0 pid 1741
[  174.874933] amdgpu :0b:00.0: GPU reset begin!
[  174.875192] [ cut here ]
[  174.875282] WARNING: CPU: 0 PID: 7 at 
drivers/gpu/drm/amd/amdgpu/../display/dc/dcn20/dcn20_resource.c:2969 
dcn20_validate_bandwidth+0x87/0xe0 [amdgpu]
[  174.875283] Modules linked in: binfmt_misc(E) nls_ascii(E) nls_cp437(E) 
vfat(E) fat(E) snd_hda_codec_realtek(E) snd_hda_codec_generic(E) 
ledtrig_audio(E) snd_hda_codec_hdmi(E) snd_hda_intel(E) snd_intel_nhlt(E) 
snd_hda_codec(E) snd_hwdep(E) efi_pstore(E) snd_hda_core(E) snd_pcm(E) 
snd_timer(E) snd(E) ccp(E) xpad(E) wmi_bmof(E) mxm_wmi(E) evdev(E) joydev(E) 
ff_memless(E) efivars(E) soundcore(E) pcspkr(E) k10temp(E) sp5100_tco(E) 
rng_core(E) sg(E) wmi(E) button(E) acpi_cpufreq(E) uinput(E) parport_pc(E) 
ppdev(E) lp(E) parport(E) efivarfs(E) ip_tables(E) autofs4(E) ext4(E) 
crc32c_generic(E) crc16(E) mbcache(E) jbd2(E) dm_crypt(E) dm_mod(E) 
hid_generic(E) usbhid(E) hid(E) sd_mod(E) crct10dif_pclmul(E) crc32_pclmul(E) 
crc32c_intel(E) ghash_clmulni_intel(E) amdgpu(E) gpu_sched(E) ttm(E) 
drm_kms_helper(E) aesni_intel(E) glue_helper(E) xhci_pci(E) ahci(E) 
crypto_simd(E) libahci(E) cryptd(E) drm(E) xhci_hcd(E) i2c_piix4(E) libata(E) 
igb(E) usbcore(E) scsi_mod(E) dca(E) i2c_algo_bit(E) gpio_amdpt(E)
[  174.875310]  gpio_generic(E)
[  174.875313] CPU: 0 PID: 7 Comm: kworker/0:1 Tainted: GE 
5.4.0-rc7-02679-g9d664d914f0e #10
[  174.875314] Hardware name: Gigabyte Technology Co., Ltd. X470 AORUS ULTRA 
GAMING/X470 AORUS ULTRA GAMING-CF, BIOS F6 01/25/2019
[  174.875318] Workqueue: events drm_sched_job_timedout [gpu_sched]
[  174.875404] RIP: 0010:dcn20_validate_bandwidth+0x87/0xe0 [amdgpu]
[  174.875406] Code: 2d 44 22 a5 e8 1d 00 00 75 26 f2 0f 11 85 a8 21 00 00 31 
d2 48 89 ee 4c 89 ef e8 d4 f5 ff ff 41 89 c4 22 85 e8 1d 00 00 75 4a <0f> 0b eb 
02 75 d1 f2 0f 10 14 24 f2 0f 11 95 a8 21 00 00 e8 f1 4b
[  174.875407] RSP: 0018:a87880067a90 EFLAGS: 00010246
[  174.875408] RAX:  RBX:  RCX: 1e61
[  174.875409] RDX: 1e60 RSI: 981f3ec2d880 RDI: 0002d880
[  174.875409] RBP: 981f25a6 R08: 0006 R09: 
[  174.875410] R10: 0001 R11: 00010001 R12: 0001
[  174.875411] R13: 981f1af4 R14:  R15: 981f25a6
[  174.875412] FS:  () GS:981f3ec0() 
knlGS:
[  174.875413] CS:  0010 DS:  ES:  CR0: 80050033
[  174.875414] CR2: 7f42858f3000 CR3: 0007f02ae000 CR4: 003406f0
[  174.875414] Call Trace:
[  174.875498]  dc_validate_global_state+0x25f/0x2d0 [amdgpu]
[  174.875583]  amdgpu_dm_atomic_check+0x8ff/0xf20 [amdgpu]
[  174.875587]  ? __ww_mutex_lock.isra.0+0x3a/0x760
[  174.875590]  ? _cond_resched+0x15/0x30
[  174.875591]  ? __ww_mutex_lock.isra.0+0x3a/0x760
[  174.875606]  drm_atomic_check_only+0x554/0x7e0 [drm]
[  174.875620]  ? drm_connector_list_iter_next+0x7d/0x90 [drm]
[  174.875632]  drm_atomic_commit+0x13/0x50 [drm]
[  174.875640]  drm_atomic_helper_disable_all+0x14c/0x160 [drm_kms_helper]
[  174.875647]  drm_atomic_helper_suspend+0x60/0xf0 [drm_kms_helper]
[  174.875730]  dm_suspend+0x1c/0x60 [amdgpu]
[  174.875782]  amdgpu_device_ip_suspend_phase1+0x81/0xe0 [amdgpu]
[  174.875836]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
[  174.875923]  amdgpu_device_pre_asic_reset+0x191/0x1a4 [amdgpu]
[  174.876010]  amdgpu_device_gpu_recover.cold+0x43a/0xbca [amdgpu]
[  174.876084]  amdgpu_job_timedout+0x103/0x130 [amdgpu]
[  174.876088]  drm_sched_job_timedout+0x7f/0xe0 [gpu_sched]
[  174.876092]  process_one_work+0x1b5/0x360
[  174.876094]  worker_thread+0x50/0x3c0
[  174.876096]  kthread+0xf9/0x130
[  174.876097]  ? process_one_work+0x360/0x360
[  174.876099]  ? kthread_park+0x90/0x90
[  174.876100]  ret_from_fork+0x22/0x40
[  174.876103] ---[ end trace af4365804bf318ce ]---
[  175.346937] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
[  175.600179] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
[  175.853418] [drm:gfx_v10_0_cp_gfx_enable [amdgpu]] *ERROR* failed to halt cp 
gfx
[  175.874639] 

RE: [PATCH] drm/amd/dm/mst: Ignore payload update failures on disable

2020-01-30 Thread Lin, Wayne
[AMD Public Use]

Hi Lyude,

Thanks for the patch!
I'm wondering if this error still occurs with this patch applied
https://patchwork.kernel.org/patch/11274363/
I tried to clean up all mgr->proposed_vcpis[] in this patch so
drm_dp_update_payload_part1() will skip all invalid ports.

However, I'm also thinking to improve this patch.
Maybe it is better to do cleaning proposed_vcpis[] in 
dm_helpers_dp_mst_write_payload_allocation_table() due to
it is the actual place that DC try to update the status for a specific 
VC stream. If it's reasonable then I'll do that in the future :)

> -Original Message-
> From: Lyude Paul 
> Sent: Saturday, January 25, 2020 2:57 AM
> To: Lipski, Mikita ; Wentland, Harry
> ; amd-gfx@lists.freedesktop.org
> Cc: sta...@vger.kernel.org; Wentland, Harry ; Li,
> Sun peng (Leo) ; Deucher, Alexander
> ; Koenig, Christian
> ; Zhou, David(ChunMing)
> ; David Airlie ; Daniel Vetter
> ; Lakha, Bhawanpreet ;
> Lipski, Mikita ; Sam Ravnborg ;
> David Francis ; Tsai, Martin
> ; Chris Wilson ; Lee, Alvin
> ; Jean Delvare ;
> dri-de...@lists.freedesktop.org; linux-ker...@vger.kernel.org; Lin, Wayne
> 
> Subject: Re: [PATCH] drm/amd/dm/mst: Ignore payload update failures on
> disable
> 
> 
> 
> On Fri, 2020-01-24 at 11:39 -0500, Mikita Lipski wrote:
> >
> > On 1/24/20 9:55 AM, Harry Wentland wrote:
> > > On 2020-01-23 7:06 p.m., Lyude Paul wrote:
> > > > Disabling a display on MST can potentially happen after the entire
> > > > MST topology has been removed, which means that we can't
> > > > communicate with the topology at all in this scenario. Likewise,
> > > > this also means that we can't properly update payloads on the
> > > > topology and as such, it's a good idea to ignore payload update failures
> when disabling displays.
> > > > Currently, amdgpu makes the mistake of halting the payload update
> > > > process when any payload update failures occur, resulting in
> > > > leaving DC's local copies of the payload tables out of date.
> > > >
> > > > This ends up causing problems with hotplugging MST topologies, and
> > > > causes modesets on the second hotplug to fail like so:
> > > >
> > > > [drm] Failed to updateMST allocation table forpipe idx:1
> > > > [ cut here ]
> > > > WARNING: CPU: 5 PID: 1511 at
> > > > drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc_link.c:2677
> > > > update_mst_stream_alloc_table+0x11e/0x130 [amdgpu] Modules linked
> > > > in: cdc_ether usbnet fuse xt_conntrack nf_conntrack
> > > > nf_defrag_ipv6 libcrc32c nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4
> > > > nft_counter nft_compat nf_tables nfnetlink tun bridge stp llc
> > > > sunrpc vfat fat wmi_bmof uvcvideo snd_hda_codec_realtek
> > > > snd_hda_codec_generic snd_hda_codec_hdmi videobuf2_vmalloc
> > > > snd_hda_intel videobuf2_memops
> > > > videobuf2_v4l2 snd_intel_dspcfg videobuf2_common crct10dif_pclmul
> > > > snd_hda_codec videodev crc32_pclmul snd_hwdep snd_hda_core
> > > > ghash_clmulni_intel snd_seq mc joydev pcspkr snd_seq_device
> > > > snd_pcm sp5100_tco k10temp i2c_piix4 snd_timer thinkpad_acpi
> > > > ledtrig_audio snd wmi soundcore video i2c_scmi acpi_cpufreq
> > > > ip_tables amdgpu(O) rtsx_pci_sdmmc amd_iommu_v2 gpu_sched
> mmc_core
> > > > i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt
> > > > fb_sys_fops cec drm crc32c_intel serio_raw hid_multitouch r8152
> > > > mii nvme r8169 nvme_core rtsx_pci pinctrl_amd
> > > > CPU: 5 PID: 1511 Comm: gnome-shell Tainted: G   O
> 5.5.0-
> > > > rc7Lyude-Test+ #4
> > > > Hardware name: LENOVO FA495SIT26/FA495SIT26, BIOS
> R12ET22W(0.22 )
> > > > 01/31/2019
> > > > RIP: 0010:update_mst_stream_alloc_table+0x11e/0x130 [amdgpu]
> > > > Code: 28 00 00 00 75 2b 48 8d 65 e0 5b 41 5c 41 5d 41 5e 5d c3 0f
> > > > b6 06
> > > > 49 89 1c 24 41 88 44 24 08 0f b6 46 01 41 88 44 24 09 eb 93 <0f>
> > > > 0b e9 2f ff ff ff e8 a6 82 a3 c2 66 0f 1f 44 00 00 0f 1f 44 00
> > > > RSP: 0018:ac428127f5b0 EFLAGS: 00010202
> > > > RAX: 0002 RBX: 8d1e166eee80 RCX:
> 
> > > > RDX: ac428127f668 RSI: 8d1e166eee80 RDI: ac428127f610
> > > > RBP: ac428127f640 R08: c03d94a8 R09: 
> > > > R10: 8d1e24b02000 R11: ac428127f5b0 R12: 8d1e1b83d000
> > > > R13: 8d1e1bea0b08 R14: 0002 R15:
> 0002
> > > > FS:  7fab23ffcd80() GS:8d1e28b4()
> > > > knlGS:
> > > > CS:  0010 DS:  ES:  CR0: 80050033
> > > > CR2: 7f151f1711e8 CR3: 0005997c CR4:
> 003406e0
> > > > Call Trace:
> > > >   ? mutex_lock+0xe/0x30
> > > >   dc_link_allocate_mst_payload+0x9a/0x210 [amdgpu]
> > > >   ? dm_read_reg_func+0x39/0xb0 [amdgpu]
> > > >   ? core_link_enable_stream+0x656/0x730 [amdgpu]
> > > >   core_link_enable_stream+0x656/0x730 [amdgpu]
> > > >   dce110_apply_ctx_to_hw+0x58e/0x5d0 [amdgpu]
> > > >   ? dcn10_verify_allow_pstate_change_high+0x1d/0x280 [amdgpu]
> > > >   ? 

[PATCH] drm/amd/powerplay: fix navi10 system intermittent reboot issue

2020-01-30 Thread Evan Quan
This workaround is needed only for Navi10 12 Gbps SKUs.

Change-Id: I4bfcb8a8dbff785a159e6a1ed413d93063403ab3
Signed-off-by: Evan Quan 
---
 drivers/gpu/drm/amd/powerplay/amdgpu_smu.c| 18 +++
 .../gpu/drm/amd/powerplay/inc/amdgpu_smu.h|  1 +
 drivers/gpu/drm/amd/powerplay/inc/smu_types.h |  2 +
 .../drm/amd/powerplay/inc/smu_v11_0_ppsmc.h   |  5 +-
 drivers/gpu/drm/amd/powerplay/navi10_ppt.c| 49 +++
 drivers/gpu/drm/amd/powerplay/smu_internal.h  |  3 ++
 6 files changed, 77 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c 
b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
index fabc46dfb933..9d1075823681 100644
--- a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
+++ b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
@@ -21,6 +21,7 @@
  */
 
 #include 
+#include 
 
 #include "pp_debug.h"
 #include "amdgpu.h"
@@ -1138,6 +1139,23 @@ static int smu_smc_table_hw_init(struct smu_context *smu,
ret = smu_system_features_control(smu, true);
if (ret)
return ret;
+
+   if (adev->asic_type == CHIP_NAVI10) {
+   if ((adev->pdev->device == 0x731f && 
(adev->pdev->revision == 0xc2 ||
+ 
adev->pdev->revision == 0xc3 ||
+ 
adev->pdev->revision == 0xca ||
+ 
adev->pdev->revision == 0xcb)) ||
+   (adev->pdev->device == 0x66af && 
(adev->pdev->revision == 0xf3 ||
+ 
adev->pdev->revision == 0xf4 ||
+ 
adev->pdev->revision == 0xf5 ||
+ 
adev->pdev->revision == 0xf6))) {
+   ret = 
smu_disable_umc_cdr_12gbps_workaround(smu);
+   if (ret) {
+   pr_err("Workaround failed to disable 
UMC CDR feature on 12Gbps SKU!\n");
+   return ret;
+   }
+   }
+   }
}
if (adev->asic_type != CHIP_ARCTURUS) {
ret = smu_notify_display_change(smu);
diff --git a/drivers/gpu/drm/amd/powerplay/inc/amdgpu_smu.h 
b/drivers/gpu/drm/amd/powerplay/inc/amdgpu_smu.h
index b8781820cec1..75c79b52ef03 100644
--- a/drivers/gpu/drm/amd/powerplay/inc/amdgpu_smu.h
+++ b/drivers/gpu/drm/amd/powerplay/inc/amdgpu_smu.h
@@ -566,6 +566,7 @@ struct pptable_funcs {
int (*set_soft_freq_limited_range)(struct smu_context *smu, enum 
smu_clk_type clk_type, uint32_t min, uint32_t max);
int (*override_pcie_parameters)(struct smu_context *smu);
uint32_t (*get_pptable_power_limit)(struct smu_context *smu);
+   int (*disable_umc_cdr_12gbps_workaround)(struct smu_context *smu);
 };
 
 int smu_load_microcode(struct smu_context *smu);
diff --git a/drivers/gpu/drm/amd/powerplay/inc/smu_types.h 
b/drivers/gpu/drm/amd/powerplay/inc/smu_types.h
index d8c9b7f91fcc..a5b4df146713 100644
--- a/drivers/gpu/drm/amd/powerplay/inc/smu_types.h
+++ b/drivers/gpu/drm/amd/powerplay/inc/smu_types.h
@@ -170,6 +170,8 @@
__SMU_DUMMY_MAP(SetSoftMinJpeg),  \
__SMU_DUMMY_MAP(SetHardMinFclkByFreq),\
__SMU_DUMMY_MAP(DFCstateControl), \
+   __SMU_DUMMY_MAP(DAL_DISABLE_DUMMY_PSTATE_CHANGE), \
+   __SMU_DUMMY_MAP(DAL_ENABLE_DUMMY_PSTATE_CHANGE), \
 
 #undef __SMU_DUMMY_MAP
 #define __SMU_DUMMY_MAP(type)  SMU_MSG_##type
diff --git a/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0_ppsmc.h 
b/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0_ppsmc.h
index 373861ddccd0..406bfd187ce8 100644
--- a/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0_ppsmc.h
+++ b/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0_ppsmc.h
@@ -120,7 +120,10 @@
 #define PPSMC_MSG_GetVoltageByDpmOverdrive   0x45
 #define PPSMC_MSG_BacoAudioD3PME 0x48
 
-#define PPSMC_Message_Count  0x49
+#define PPSMC_MSG_DALDisableDummyPstateChange0x49
+#define PPSMC_MSG_DALEnableDummyPstateChange 0x4A
+
+#define PPSMC_Message_Count  0x4B
 
 typedef uint32_t PPSMC_Result;
 typedef uint32_t PPSMC_Msg;
diff --git a/drivers/gpu/drm/amd/powerplay/navi10_ppt.c 
b/drivers/gpu/drm/amd/powerplay/navi10_ppt.c
index a0a342f6127f..3feb339a434e 100644
--- a/drivers/gpu/drm/amd/powerplay/navi10_ppt.c
+++ b/drivers/gpu/drm/amd/powerplay/navi10_ppt.c
@@ -119,6 +119,8 @@ static struct smu_11_0_cmn2aisc_mapping 
navi10_message_map[SMU_MSG_MAX_COUNT] =
MSG_MAP(PowerDownJpeg,  PPSMC_MSG_PowerDownJpeg),
MSG_MAP(BacoAudioD3PME, PPSMC_MSG_BacoAudioD3PME),
MSG_MAP(ArmD3,  PPSMC_MSG_ArmD3),
+   

[PATCH] drm/amdgpu: Fix implicit enum conversion in gfx_v9_4_ras_error_inject

2020-01-30 Thread Nathan Chancellor
Clang warns:

../drivers/gpu/drm/amd/amdgpu/gfx_v9_4.c:967:35: warning: implicit
conversion from enumeration type 'enum amdgpu_ras_block' to different
enumeration type 'enum ta_ras_block' [-Wenum-conversion]
block_info.block_id = info->head.block;
~ ~~~^
1 warning generated.

Use the function added in commit 828cfa29093f ("drm/amdgpu: Fix amdgpu
ras to ta enums conversion") that handles this conversion explicitly.

Fixes: 4c461d89db4f ("drm/amdgpu: add RAS support for the gfx block of 
Arcturus")
Link: https://github.com/ClangBuiltLinux/linux/issues/849
Signed-off-by: Nathan Chancellor 
---
 drivers/gpu/drm/amd/amdgpu/gfx_v9_4.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4.c
index e19d275f3f7d..f099f13d7f1e 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4.c
@@ -964,7 +964,7 @@ int gfx_v9_4_ras_error_inject(struct amdgpu_device *adev, 
void *inject_if)
if (!amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__GFX))
return -EINVAL;
 
-   block_info.block_id = info->head.block;
+   block_info.block_id = amdgpu_ras_block_to_ta(info->head.block);
block_info.sub_block_index = info->head.sub_block_index;
block_info.inject_error_type = amdgpu_ras_error_to_ta(info->head.type);
block_info.address = info->address;
-- 
2.25.0

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx