[PATCH v2] drm/kfd: fix a system crash issue during GPU recovery

2020-09-01 Thread Dennis Li
The crash log as the below:

[Thu Aug 20 23:18:14 2020] general protection fault:  [#1] SMP NOPTI
[Thu Aug 20 23:18:14 2020] CPU: 152 PID: 1837 Comm: kworker/152:1 Tainted: G
   OE 5.4.0-42-generic #46~18.04.1-Ubuntu
[Thu Aug 20 23:18:14 2020] Hardware name: GIGABYTE G482-Z53-YF/MZ52-G40-00, 
BIOS R12 05/13/2020
[Thu Aug 20 23:18:14 2020] Workqueue: events amdgpu_ras_do_recovery [amdgpu]
[Thu Aug 20 23:18:14 2020] RIP: 0010:evict_process_queues_cpsch+0xc9/0x130 
[amdgpu]
[Thu Aug 20 23:18:14 2020] Code: 49 8d 4d 10 48 39 c8 75 21 eb 44 83 fa 03 74 
36 80 78 72 00 74 0c 83 ab 68 01 00 00 01 41 c6 45 41 00 48 8b 00 48 39 c8 74 
25 <80> 78 70 00 c6 40 6d 01 74 ee 8b 50 28 c6 40 70 00 83 ab 60 01 00
[Thu Aug 20 23:18:14 2020] RSP: 0018:b29b52f6fc90 EFLAGS: 00010213
[Thu Aug 20 23:18:14 2020] RAX: 1c884edb0a118914 RBX: 8a0d45ff3c00 RCX: 
8a2d83e41038
[Thu Aug 20 23:18:14 2020] RDX:  RSI: 0082 RDI: 
8a0e2e4178c0
[Thu Aug 20 23:18:14 2020] RBP: b29b52f6fcb0 R08: 1b64 R09: 
0004
[Thu Aug 20 23:18:14 2020] R10: b29b52f6fb78 R11: 0001 R12: 
8a0d45ff3d28
[Thu Aug 20 23:18:14 2020] R13: 8a2d83e41028 R14:  R15: 

[Thu Aug 20 23:18:14 2020] FS:  () 
GS:8a0e2e40() knlGS:
[Thu Aug 20 23:18:14 2020] CS:  0010 DS:  ES:  CR0: 80050033
[Thu Aug 20 23:18:14 2020] CR2: 55c783c0e6a8 CR3: 0034a1284000 CR4: 
00340ee0
[Thu Aug 20 23:18:14 2020] Call Trace:
[Thu Aug 20 23:18:14 2020]  kfd_process_evict_queues+0x43/0xd0 [amdgpu]
[Thu Aug 20 23:18:14 2020]  kfd_suspend_all_processes+0x60/0xf0 [amdgpu]
[Thu Aug 20 23:18:14 2020]  kgd2kfd_suspend.part.7+0x43/0x50 [amdgpu]
[Thu Aug 20 23:18:14 2020]  kgd2kfd_pre_reset+0x46/0x60 [amdgpu]
[Thu Aug 20 23:18:14 2020]  amdgpu_amdkfd_pre_reset+0x1a/0x20 [amdgpu]
[Thu Aug 20 23:18:14 2020]  amdgpu_device_gpu_recover+0x377/0xf90 [amdgpu]
[Thu Aug 20 23:18:14 2020]  ? amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu]
[Thu Aug 20 23:18:14 2020]  amdgpu_ras_do_recovery+0x159/0x190 [amdgpu]
[Thu Aug 20 23:18:14 2020]  process_one_work+0x20f/0x400
[Thu Aug 20 23:18:14 2020]  worker_thread+0x34/0x410

When GPU hang, user process will fail to create a compute queue whose
struct object will be freed later, but driver wrongly add this queue to
queue list of the proccess. And then kfd_process_evict_queues will
access a freed memory, which cause a system crash.

v2:
The failure to execute_queues should probably not be reported to
the caller of create_queue, because the queue was already created.
Therefore change to ignore the return value from execute_queues.

Signed-off-by: Dennis Li 

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index 560adc57a050..069ba4be1e8f 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -1302,7 +1302,7 @@ static int create_queue_cpsch(struct device_queue_manager 
*dqm, struct queue *q,
if (q->properties.is_active) {
increment_queue_count(dqm, q->properties.type);
 
-   retval = execute_queues_cpsch(dqm,
+   execute_queues_cpsch(dqm,
KFD_UNMAP_QUEUES_FILTER_DYNAMIC_QUEUES, 0);
}
 
-- 
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH V2] drm/amdgpu: Do not move root PT bo to relocated list

2020-09-01 Thread Pan, Xinhui


> 2020年9月1日 21:54,Christian König  写道:
> 
> Agreed, that change doesn't seem to make sense and your backtrace is mangled 
> so barely readable.

it is reply that messed up the logs.

And this patch was sent on 10th Feb. 
> 
> Christian.
> 
> Am 01.09.20 um 14:59 schrieb Liu, Monk:
>> [AMD Official Use Only - Internal Distribution Only]
>> 
>> See that we already have such logic:
>> 
>> 282 static void amdgpu_vm_bo_relocated(struct amdgpu_vm_bo_base *vm_bo)
>>  283 {
>>  284 if (vm_bo->bo->parent)
>>  285 list_move(_bo->vm_status, _bo->vm->relocated);
>>  286 else
>>  287 amdgpu_vm_bo_idle(vm_bo);
>>  288 }
>> 
>> Why you need to do the bo->parent check out side ?

because it is me that moves such logic into amdgpu_vm_bo_relocated.

>> 
>> -邮件原件-
>> 发件人: amd-gfx  代表 Pan, Xinhui
>> 发送时间: 2020年2月10日 9:04
>> 收件人: amd-gfx@lists.freedesktop.org
>> 抄送: Deucher, Alexander ; Koenig, Christian 
>> 
>> 主题: [PATCH V2] drm/amdgpu: Do not move root PT bo to relocated list
>> 
>> hit panic when we update the page tables.
>> 
>> <1>[  122.103290] BUG: kernel NULL pointer dereference, address: 
>> 0008 <1>[  122.103348] #PF: supervisor read access in kernel 
>> mode <1>[  122.103376] #PF: error_code(0x) - not-present page <6>[  
>> 122.103403] PGD 0 P4D 0 <4>[  122.103421] Oops:  [#1] SMP PTI
>> <4>[  122.103442] CPU: 13 PID: 2133 Comm: kfdtest Tainted: G   OE
>>  5.4.0-rc7+ #7
>> <4>[  122.103480] Hardware name: Supermicro SYS-7048GR-TR/X10DRG-Q, BIOS 
>> 3.0b 03/09/2018 <4>[  122.103657] RIP: 
>> 0010:amdgpu_vm_update_pdes+0x140/0x330 [amdgpu] <4>[  122.103689] Code: 03 
>> 4c 89 73 08 49 89 9d c8 00 00 00 48 8b 7b f0 c6 43 10 00 45 31 c0 48 8b 87 
>> 28 04 00 00 48 85 c0 74 07 4c 8b 80 20 04 00 00 <4d> 8b 70 08 31 f6 49 8b 86 
>> 28 04 00 00 48 85 c0 74 0f 48 8b 80 28 <4>[  122.103769] RSP: 
>> 0018:b49a0a6a3a98 EFLAGS: 00010246 <4>[  122.103797] RAX: 
>>  RBX: 9020f823c148 RCX: dead0122 <4>[  
>> 122.103831] RDX: 9020ece70018 RSI: 9020f823c0c8 RDI: 
>> 9010ca31c800 <4>[  122.103865] RBP: b49a0a6a3b38 R08: 
>>  R09: 0001 <4>[  122.103899] R10: 
>> 6044f994 R11: df57fb58 R12: 9020f823c000 <4>[  
>> 122.103933] R13: 9020f823c000 R14: 9020f823c0c8 R15: 
>> 9010d5d2 <4>[  122.103968] FS:  7f32c83dc780() 
>> GS:9020ff38() knlGS: <4>[  122.104006] CS:  0010 
>> DS:  ES:  CR0: 80050033 <4>[  122.104035] CR2: 
>> 0008 CR3: 002036bba005 CR4: 003606e0 <4>[  
>> 122.104069] DR0:  DR1:  DR2: 
>>  <4>[  122.104103] DR3:  DR6: 
>> fffe0ff0 DR7: 0400 <4>[  122.104137] Call Trace:
>> <4>[  122.104241]  vm_update_pds+0x31/0x50 [amdgpu] <4>[  122.104347]  
>> amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x2ef/0x690 [amdgpu] <4>[  122.104466] 
>>  kfd_process_alloc_gpuvm+0x98/0x190 [amdgpu] <4>[  122.104576]  
>> kfd_process_device_init_vm.part.8+0xf3/0x1f0 [amdgpu] <4>[  122.104688]  
>> kfd_process_device_init_vm+0x24/0x30 [amdgpu] <4>[  122.104794]  
>> kfd_ioctl_acquire_vm+0xa4/0xc0 [amdgpu] <4>[  122.104900]  
>> kfd_ioctl+0x277/0x500 [amdgpu] <4>[  122.105001]  ? 
>> kfd_ioctl_free_memory_of_gpu+0xc0/0xc0 [amdgpu] <4>[  122.105039]  ? 
>> rcu_read_lock_sched_held+0x4f/0x80
>> <4>[  122.105068]  ? kmem_cache_free+0x2ba/0x300 <4>[  122.105093]  ? 
>> vm_area_free+0x18/0x20 <4>[  122.105117]  ? find_held_lock+0x35/0xa0 <4>[  
>> 122.105143]  do_vfs_ioctl+0xa9/0x6f0 <4>[  122.106001]  ksys_ioctl+0x75/0x80 
>> <4>[  122.106802]  ? do_syscall_64+0x17/0x230 <4>[  122.107605]  
>> __x64_sys_ioctl+0x1a/0x20 <4>[  122.108378]  do_syscall_64+0x5f/0x230 <4>[  
>> 122.109118]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>> <4>[  122.109842] RIP: 0033:0x7f32c6b495d7
>> 
>> Signed-off-by: xinhui pan 
>> ---
>> change from v1:
>>move root pt bo to idle state instead.
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 9 ++---
>>  1 file changed, 6 insertions(+), 3 deletions(-)
>> 
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> index 3195bc9..c3d1af5 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> @@ -2619,9 +2619,12 @@ void amdgpu_vm_bo_invalidate(struct amdgpu_device 
>> *adev,
>>  continue;
>>  bo_base->moved = true;
>> 
>> -if (bo->tbo.type == ttm_bo_type_kernel)
>> -amdgpu_vm_bo_relocated(bo_base);
>> -else if (bo->tbo.base.resv == vm->root.base.bo->tbo.base.resv)
>> +if (bo->tbo.type == ttm_bo_type_kernel) {
>> +if (bo->parent)
>> +amdgpu_vm_bo_relocated(bo_base);
>> +else
>> +amdgpu_vm_bo_idle(bo_base);
>> +} else if (bo->tbo.base.resv == vm->root.base.bo->tbo.base.resv)
>>  amdgpu_vm_bo_moved(bo_base);
>>  else
>>  amdgpu_vm_bo_invalidated(bo_base);
>> --
>> 2.7.4
>> 

Re: [PATCH 0/3] Use implicit kref infra

2020-09-01 Thread Pan, Xinhui


> 2020年9月2日 11:46,Tuikov, Luben  写道:
> 
> On 2020-09-01 21:42, Pan, Xinhui wrote:
>> If you take a look at the below function, you should not use driver's 
>> release to free adev. As dev is embedded in adev.
> 
> Do you mean "look at the function below", using "below" as an adverb?
> "below" is not an adjective.
> 
> I know dev is embedded in adev--I did that patchset.
> 
>> 
>> 809 static void drm_dev_release(struct kref *ref)
>> 810 {
>> 811 struct drm_device *dev = container_of(ref, struct drm_device, 
>> ref);
>> 812
>> 813 if (dev->driver->release)
>> 814 dev->driver->release(dev);
>> 815 
>> 816 drm_managed_release(dev);
>> 817 
>> 818 kfree(dev->managed.final_kfree);
>> 819 }
> 
> That's simple--this comes from change c6603c740e0e3
> and it should be reverted. Simple as that.
> 
> The version before this change was absolutely correct:
> 
> static void drm_dev_release(struct kref *ref)
> {
>   if (dev->driver->release)
>   dev->driver->release(dev);
>   else
>   drm_dev_fini(dev);
> }
> 
> Meaning, "the kref is now 0"--> if the driver
> has a release, call it, else use our own.
> But note that nothing can be assumed after this point,
> about the existence of "dev".
> 
> It is exactly because struct drm_device is statically
> embedded into a container, struct amdgpu_device,
> that this change above should be reverted.
> 
> This is very similar to how fops has open/release
> but no close. That is, the "release" is called
> only when the last kref is released, i.e. when
> kref goes from non-zero to zero.
> 
> This uses the kref infrastructure which has been
> around for about 20 years in the Linux kernel.
> 
> I suggest reading the comments
> in drm_dev.c mostly, "DOC: driver instance overview"
> starting at line 240 onwards. This is right above
> drm_put_dev(). There is actually an example of a driver
> in the comment. Also the comment to drm_dev_init().
> 
> Now, take a look at this:
> 
> /**
> * drm_dev_put - Drop reference of a DRM device
> * @dev: device to drop reference of or NULL
> *
> * This decreases the ref-count of @dev by one. The device is destroyed if the
> * ref-count drops to zero.
> */
> void drm_dev_put(struct drm_device *dev)
> {
>if (dev)
>kref_put(>ref, drm_dev_release);
> }
> EXPORT_SYMBOL(drm_dev_put);
> 
> Two things:
> 
> 1. It is us, who kzalloc the amdgpu device, which contains
> the drm_device (you'll see this discussed in the reading
> material I pointed to above). We do this because we're
> probing the PCI device whether we'll work it it or not.
> 

that is true.
My understanding of the drm core code is like something below.
struct B { 
strcut A 
}
we initialize A firstly and initialize B in the end. But destroy B firstly and 
destory A in the end.
But yes, practice is more complex. 
if B has nothing to be destroyed. we can destory A directly, otherwise destroy 
B firstly.

in this case, we can do something below in our release()
//some cleanup work of B
drm_dev_fini(dev);//destroy A
kfree(adev)

> 2. Using the kref infrastructure, when the ref goes to 0,
> drm_dev_release is called. And here's the KEY:
> Because WE allocated the container, we should free it--after the release
> method is called, DRM cannot assume anything about the drm
> device or the container. The "release" method is final.
> 
> We allocate, we free. And we free only when the ref goes to 0.
> 
> DRM can, in due time, "free" itself of the DRM device and stop
> having knowledge of it--that's fine, but as long as the ref
> is not 0, the amdgpu device and thus the contained DRM device,
> cannot be freed.
> 
>> 
>> You have to make another change something like
>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
>> index 13068fdf4331..2aabd2b4c63b 100644
>> --- a/drivers/gpu/drm/drm_drv.c
>> +++ b/drivers/gpu/drm/drm_drv.c
>> @@ -815,7 +815,8 @@ static void drm_dev_release(struct kref *ref)
>> 
>>drm_managed_release(dev);
>> 
>> -   kfree(dev->managed.final_kfree);
>> +   if (dev->driver->final_release)
>> +   dev->driver->final_release(dev);
>> }
> 
> No. What's this?
> There is no such thing as "final" release, nor is there a "partial" release.
> When the kref goes to 0, the device disappears. Simple.
> If someone is using it, they should kref-get it, and when they're
> done with it, they should kref-put it.

I just take an example here. add another release in the end. then no one could 
touch us. IOW, final_release.


A destroy B by a callback, then A destroy itself. It assumes B just free its 
own resource.
but that makes trouble if some resource of A is allocated by B.
Because B must take care of these common resource shared between A and B.

yes, that logical is more complex. So I think we can revert drm_dev_release to 
its previous version.

> 
> The whole point is that this is done implicitly, via the kref infrastructure.
> 

Re: [PATCH] drm/amdgpu: Fix a redundant kfree

2020-09-01 Thread Luben Tuikov
On 2020-09-01 5:58 p.m., Pan, Xinhui wrote:
> [AMD Official Use Only - Internal Distribution Only]
> 
> 
> 
> The correct thing to do this is to
> _leave the amdgpu_driver_release()_ alone,
> remove "drmm_add_final_kfree()" and qualify
> the WARN_ON() in drm_dev_register() by
> the existence of drm_driver.release() (i.e. non-NULL).
> 
> Re:  this drm driver release callback is more like a release notify. It is 
> called in the beginning of the total release sequence. As you have made drm 
> device a member of adev. So you can not free adev in the driver release 
> callback.
> 

No. The DRM driver release callback is not "more like release notify".
When the kref goes to 0, the release callback is called. The kref
infrastructure has been around in the kernel for about 20 years now.

I don't understand what a "total release sequence" is. Is there
a "partial release sequence"? The stack below shows an example
of a release. It's very simple, when the kref goes to 0,
the DRM driver's callback is called.

> Maybe change the sequence of release,  say, put drm driver release in the end 
> of total release sequence.
> Or still use the final_kfree to free adev and our release callback just do 
> some other cleanup work.

No. "final kfree" is a hack. What we want to use is the implicit
nature of the kref infrastructure. You don't really care to "free",
how and when it happens. You just forget about "free."
It will implicitly take place when the kref transitions from non-zero
to 0. It's implicit.

Another advantage of using DRM driver's "release" callback
on kref going to 0, is that we can do other interesting stuff
just before we free the container structure amdgpu_device.

And most of all, we don't really need to be concerned with
sequencing the free, how, when, etc., it naturally takes place
on a kref-put, when that transitions the kref to 0.

Regards,
Luben


> --
> *From:* Tuikov, Luben 
> *Sent:* Wednesday, September 2, 2020 4:35:32 AM
> *To:* Alex Deucher ; Pan, Xinhui ; 
> Daniel Vetter 
> *Cc:* amd-gfx@lists.freedesktop.org ; Deucher, 
> Alexander 
> *Subject:* Re: [PATCH] drm/amdgpu: Fix a redundant kfree
>  
> On 2020-09-01 10:12 a.m., Alex Deucher wrote:
>> On Tue, Sep 1, 2020 at 3:46 AM Pan, Xinhui  wrote:
>>>
>>> [AMD Official Use Only - Internal Distribution Only]
>>>
>>> drm_dev_alloc() alloc *dev* and set managed.final_kfree to dev to free
>>> itself.
>>> Now from commit 5cdd68498918("drm/amdgpu: Embed drm_device into
>>> amdgpu_device (v3)") we alloc *adev* and ddev is just a member of it.
>>> So drm_dev_release try to free a wrong pointer then.
>>>
>>> Also driver's release trys to free adev, but drm_dev_release will
>>> access dev after call drvier's release.
>>>
>>> To fix it, remove driver's release and set managed.final_kfree to adev.
>> 
>> I've got to admit, the documentation around drm_dev_init is hard to
>> follow.  We aren't really using the drmm stuff, but you have to use
>> drmm_add_final_kfree to avoid a warning.  The logic seems to make
>> sense though.
>> Acked-by: Alex Deucher 
> 
> The logic in patch 3/3 uses the kref infrastructure
> as described in drm_drv.c's comment on what the DRM
> usage is, i.e. taking advantage of the kref infrastructure.
> 
> In amdgpu_pci_probe() we call drm_dev_init() which takes
> a ref of 1 on the kref in the DRM device structure,
> and from then on, only when the kref transitions
> from non-zero to 0, do we free the container of
> DRM device, and this is beautifully shown in the
> kernel stack below (please take a look at the kernel
> stack below).
> 
> Using a kref is very powerful as it is implicit:
> when the kref transitions from non-zero to 0,
> then call the release method.
> 
> Furthermore, we own the release method, and we
> like that, as it is pure, as well as,
> there may be more things we'd like to do in the future
> before we free the amdgpu device: maybe free memory we're
> using specifically for that PCI device, maybe write
> some registers, maybe notify someone or 

Re: [PATCH 0/3] Use implicit kref infra

2020-09-01 Thread Luben Tuikov
On 2020-09-01 21:42, Pan, Xinhui wrote:
> If you take a look at the below function, you should not use driver's release 
> to free adev. As dev is embedded in adev.

Do you mean "look at the function below", using "below" as an adverb?
"below" is not an adjective.

I know dev is embedded in adev--I did that patchset.

> 
>  809 static void drm_dev_release(struct kref *ref)
>  810 {
>  811 struct drm_device *dev = container_of(ref, struct drm_device, 
> ref);
>  812
>  813 if (dev->driver->release)
>  814 dev->driver->release(dev);
>  815 
>  816 drm_managed_release(dev);
>  817 
>  818 kfree(dev->managed.final_kfree);
>  819 }

That's simple--this comes from change c6603c740e0e3
and it should be reverted. Simple as that.

The version before this change was absolutely correct:

static void drm_dev_release(struct kref *ref)
{
if (dev->driver->release)
dev->driver->release(dev);
else
drm_dev_fini(dev);
}

Meaning, "the kref is now 0"--> if the driver
has a release, call it, else use our own.
But note that nothing can be assumed after this point,
about the existence of "dev".

It is exactly because struct drm_device is statically
embedded into a container, struct amdgpu_device,
that this change above should be reverted.

This is very similar to how fops has open/release
but no close. That is, the "release" is called
only when the last kref is released, i.e. when
kref goes from non-zero to zero.

This uses the kref infrastructure which has been
around for about 20 years in the Linux kernel.

I suggest reading the comments
in drm_dev.c mostly, "DOC: driver instance overview"
starting at line 240 onwards. This is right above
drm_put_dev(). There is actually an example of a driver
in the comment. Also the comment to drm_dev_init().

Now, take a look at this:

/**
 * drm_dev_put - Drop reference of a DRM device
 * @dev: device to drop reference of or NULL
 *
 * This decreases the ref-count of @dev by one. The device is destroyed if the
 * ref-count drops to zero.
 */
void drm_dev_put(struct drm_device *dev)
{
if (dev)
kref_put(>ref, drm_dev_release);
}
EXPORT_SYMBOL(drm_dev_put);

Two things:

1. It is us, who kzalloc the amdgpu device, which contains
the drm_device (you'll see this discussed in the reading
material I pointed to above). We do this because we're
probing the PCI device whether we'll work it it or not.

2. Using the kref infrastructure, when the ref goes to 0,
drm_dev_release is called. And here's the KEY:
Because WE allocated the container, we should free it--after the release
method is called, DRM cannot assume anything about the drm
device or the container. The "release" method is final.

We allocate, we free. And we free only when the ref goes to 0.

DRM can, in due time, "free" itself of the DRM device and stop
having knowledge of it--that's fine, but as long as the ref
is not 0, the amdgpu device and thus the contained DRM device,
cannot be freed.

> 
> You have to make another change something like
> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> index 13068fdf4331..2aabd2b4c63b 100644
> --- a/drivers/gpu/drm/drm_drv.c
> +++ b/drivers/gpu/drm/drm_drv.c
> @@ -815,7 +815,8 @@ static void drm_dev_release(struct kref *ref)
>  
> drm_managed_release(dev);
>  
> -   kfree(dev->managed.final_kfree);
> +   if (dev->driver->final_release)
> +   dev->driver->final_release(dev);
>  }

No. What's this?
There is no such thing as "final" release, nor is there a "partial" release.
When the kref goes to 0, the device disappears. Simple.
If someone is using it, they should kref-get it, and when they're
done with it, they should kref-put it.

The whole point is that this is done implicitly, via the kref infrastructure.
drm_dev_init() which we call in our PCI probe function, sets the kref to 1--all
as per the documentation I pointed you to above.

Another point is that we can do some other stuff in the release
function, notify someone, write some registers, free memory we use
for that PCI device, etc.

If the "managed resources" infrastructure wants to stay, it should hook
itself into drm_dev_fini() and into drm_dev_init() or drm_dev_register().
It shouldn't have to be so out-of-place like in patch 2/3 of this series,
where the drmm_add_final_kfree() is smack-dab in the middle of our PCI
discovery function, surrounded on top and bottom by drm_dev_init()
and drm_dev_register(). The "managed resources" infra should be non-invasive
and drivers shouldn't have to change to use it--it should be invisible to them.
Then our kref would just work.

> 
> And in the final_release callback we free the dev. But that is a little 
> complex now. so I prefer still using final_kfree.
> Of course we can do some cleanup work in the driver's release callback. BUT 
> no kfree.

No! No final_kfree. It's a hack.

Read the documentation in drm_drv.c I noted 

[PATCH] drm/amdgpu: Revert "drm/amdgpu: stop allocating dummy GTT nodes"

2020-09-01 Thread xinhui pan
This reverts commit 1e691e2444871d1fde11b611653b5da9010dcec8.

mem->mm_node now could be NULL with commit above. That makes
amdgpu_vm_bo_split_mapping touchs outside of the page table as
max_entries set to S64_MAX;

before we fix that issue, revert commit above.

[  978.955925] BUG: unable to handle page fault for address: 94dfc4bc
[  978.963424] #PF: supervisor read access in kernel mode
[  978.969034] #PF: error_code(0x) - not-present page
[  978.974662] PGD 72e201067 P4D 72e201067 PUD 86a414067 PMD 86a3ee067 PTE 
80083b43f060
[  978.983494] Oops:  [#1] SMP DEBUG_PAGEALLOC NOPTI
[  978.988992] CPU: 0 PID: 12264 Comm: Xorg Tainted: GW  O  
5.9.0-rc2+ #46
[  978.997394] Hardware name: System manufacturer System Product Name/PRIME 
Z390-A, BIOS 1401 11/26/2019
[  979.007495] RIP: 0010:amdgpu_vm_bo_update+0x5af/0x880 [amdgpu]
[  979.013881] Code: ff ff ff ff ff 7f 48 8b 45 c0 4c 8d 04 d8 b8 01 00 00 00 
eb 09 48 83 c0 01 48 39 c2 76 12 49 8b 74 c0 f8 48 81 c6 00 10 00 00 <49> 39 34 
c0 74 e5 8b 75 b4 4c 8b 45 c8 48 38
[  979.034354] RSP: 0018:a94281403ba8 EFLAGS: 00010206
[  979.040050] RAX: 0200 RBX: 0e00 RCX: 001049e8
[  979.047824] RDX: 7fff RSI: 0007c5e0 RDI: 94dfd5fc
[  979.055644] RBP: a94281403c40 R08: 94dfc4bbf000 R09: 0001
[  979.063441] R10:  R11:  R12: 001047e8
[  979.071279] R13:  R14: 001047e9 R15: 94dfc4e9af48
[  979.079098] FS:  7f19d3d00a80() GS:94e007e0() 
knlGS:
[  979.087911] CS:  0010 DS:  ES:  CR0: 80050033
[  979.094240] CR2: 94dfc4bc CR3: 0007c408c005 CR4: 003706f0
[  979.102050] DR0:  DR1:  DR2: 
[  979.109868] DR3:  DR6: fffe0ff0 DR7: 0400
[  979.117669] Call Trace:
[  979.120393]  amdgpu_gem_va_ioctl+0x533/0x560 [amdgpu]
[  979.125970]  ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
[  979.131914]  drm_ioctl_kernel+0xb4/0x100 [drm]
[  979.136792]  drm_ioctl+0x241/0x400 [drm]
[  979.141100]  ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
[  979.147003]  ? _raw_spin_unlock_irqrestore+0x4c/0x60
[  979.152446]  ? trace_hardirqs_on+0x2b/0xf0
[  979.156977]  amdgpu_drm_ioctl+0x4e/0x80 [amdgpu]
[  979.162033]  __x64_sys_ioctl+0x91/0xc0
[  979.166117]  do_syscall_64+0x38/0x90
[  979.170022]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  979.175537] RIP: 0033:0x7f19d405e37b
[  979.179450] Code: 0f 1e fa 48 8b 05 15 3b 0d 00 64 c7 00 26 00 00 00 48 c7 
c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 
f0 ff ff 73 01 c3 48 8b 0d e5 3a 08
[  979.200034] RSP: 002b:7ffe66c9e938 EFLAGS: 0246 ORIG_RAX: 
0010
[  979.208330] RAX: ffda RBX: 7ffe66c9e990 RCX: 7f19d405e37b
[  979.216147] RDX: 7ffe66c9e990 RSI: c0286448 RDI: 0010
[  979.223897] RBP: c0286448 R08: 0001039e9000 R09: 000e
[  979.231742] R10: 5640dcedf010 R11: 0246 R12: 
[  979.239555] R13: 0010 R14: 0001 R15: 7ffe66c9ea58
[  979.247358] Modules linked in: amdgpu(O) iommu_v2 gpu_sched(O) ttm(O) 
drm_kms_helper(O) cec i2c_algo_bit fb_sys_fops syscopyarea sysfillrect 
sysimgblt overlay binfmt_misc snd_sof_pci snd_sos
[  979.247375]  x_tables autofs4 crc32_pclmul e1000e i2c_i801 i2c_smbus ahci 
libahci wmi video pinctrl_cannonlake pinctrl_intel
[  979.354934] CR2: 94dfc4bc
[  979.358566] ---[ end trace 5b622843e4242519 ]---

Signed-off-by: xinhui pan 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c | 104 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c |  18 +---
 2 files changed, 80 insertions(+), 42 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
index e1b66898cb76..295d6fbcda8f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
@@ -150,7 +150,60 @@ static int amdgpu_gtt_mgr_fini(struct ttm_mem_type_manager 
*man)
  */
 bool amdgpu_gtt_mgr_has_gart_addr(struct ttm_mem_reg *mem)
 {
-   return mem->mm_node != NULL;
+   struct amdgpu_gtt_node *node = mem->mm_node;
+
+   return (node->node.start != AMDGPU_BO_INVALID_OFFSET);
+}
+
+/**
+ * amdgpu_gtt_mgr_alloc - allocate new ranges
+ *
+ * @man: TTM memory type manager
+ * @tbo: TTM BO we need this range for
+ * @place: placement flags and restrictions
+ * @mem: the resulting mem object
+ *
+ * Allocate the address space for a node.
+ */
+static int amdgpu_gtt_mgr_alloc(struct ttm_mem_type_manager *man,
+   struct ttm_buffer_object *tbo,
+   const struct ttm_place *place,
+   struct ttm_mem_reg *mem)
+{
+   struct amdgpu_device *adev = 

Re: [PATCH] drm/kfd: fix a system crash issue during GPU recovery

2020-09-01 Thread Felix Kuehling

On 2020-09-01 11:21 a.m., Li, Dennis wrote:

[AMD Official Use Only - Internal Distribution Only]

Hi, Felix,
  If GPU hang, execute_queues_cpsch will fail to unmap or map queues and then 
create_queue_cpsch will return error. If pqm_create_queue find create_queue_cpsch 
failed, it will call uninit_queue to free queue object. However this queue object 
has been added in to qpd->queues_list in the old code.


Right, that's a problem. I think the intention here is to keep going 
because a failure to execute the runlist affects not just the queue that 
was just created, but all queues in all processes.


The failure to execute_queues should probably not be reported to the 
caller of create_queue, because the queue was already created, and the 
problem with execute_queues has a much bigger scope than this one 
caller. So I think the correct solution is to ignore the return value 
from execute_queues.


As a follow up, we should probably handle all the error scenarios inside 
execute_queues and make it a void function. Failure to unmap queues 
already triggers a GPU reset, so nothing new needs to be done for that. 
But we need to add handling of failures to map queues. It doesn't 
require a GPU reset, because the problem is in the kernel (e.g. out of 
memory), not the GPU. The best we can do is report this asynchronously 
as a GPU hang to all KFD processes, so they know the GPU is no longer 
going to work for them.


Regards,
  Felix



Best Regards
Dennis Li

-Original Message-
From: Kuehling, Felix 
Sent: Tuesday, September 1, 2020 9:26 PM
To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander 
; Zhang, Hawking ; Koenig, Christian 

Subject: Re: [PATCH] drm/kfd: fix a system crash issue during GPU recovery

I'm not sure how the bug you're fixing is caused, but your fix is clearly in 
the wrong place.

A queue being disabled is not the same thing as a queue being destroyed.
Queues can be disabled for legitimate reasons, but they still should exist and be 
in the qpd->queues_list.

If a destroyed queue is left on the qpd->queues_list, that would be a problem. 
Can you point out where such a thing is happening?

Thanks,
   Felix


Am 2020-08-31 um 9:36 p.m. schrieb Dennis Li:

The crash log as the below:

[Thu Aug 20 23:18:14 2020] general protection fault:  [#1] SMP NOPTI
[Thu Aug 20 23:18:14 2020] CPU: 152 PID: 1837 Comm: kworker/152:1 Tainted: G
   OE 5.4.0-42-generic #46~18.04.1-Ubuntu
[Thu Aug 20 23:18:14 2020] Hardware name: GIGABYTE
G482-Z53-YF/MZ52-G40-00, BIOS R12 05/13/2020 [Thu Aug 20 23:18:14
2020] Workqueue: events amdgpu_ras_do_recovery [amdgpu] [Thu Aug 20
23:18:14 2020] RIP: 0010:evict_process_queues_cpsch+0xc9/0x130
[amdgpu] [Thu Aug 20 23:18:14 2020] Code: 49 8d 4d 10 48 39 c8 75 21
eb 44 83 fa 03 74 36 80 78 72 00 74 0c 83 ab 68 01 00 00 01 41 c6 45
41 00 48 8b 00 48 39 c8 74 25 <80> 78 70 00 c6 40 6d 01 74 ee 8b 50 28
c6 40 70 00 83 ab 60 01 00 [Thu Aug 20 23:18:14 2020] RSP:
0018:b29b52f6fc90 EFLAGS: 00010213 [Thu Aug 20 23:18:14 2020] RAX:
1c884edb0a118914 RBX: 8a0d45ff3c00 RCX: 8a2d83e41038 [Thu Aug
20 23:18:14 2020] RDX:  RSI: 0082 RDI:
8a0e2e4178c0 [Thu Aug 20 23:18:14 2020] RBP: b29b52f6fcb0 R08:
1b64 R09: 0004 [Thu Aug 20 23:18:14 2020] R10:
b29b52f6fb78 R11: 0001 R12: 8a0d45ff3d28 [Thu Aug 20 
23:18:14 2020] R13: 8a2d83e41028 R14:  R15: 
 [Thu Aug 20 23:18:14 2020] FS:  () 
GS:8a0e2e40() knlGS: [Thu Aug 20 23:18:14 2020] CS: 
 0010 DS:  ES:  CR0: 80050033 [Thu Aug 20 23:18:14 2020] CR2: 
55c783c0e6a8 CR3: 0034a1284000 CR4: 00340ee0 [Thu Aug 20 
23:18:14 2020] Call Trace:
[Thu Aug 20 23:18:14 2020]  kfd_process_evict_queues+0x43/0xd0
[amdgpu] [Thu Aug 20 23:18:14 2020]
kfd_suspend_all_processes+0x60/0xf0 [amdgpu] [Thu Aug 20 23:18:14
2020]  kgd2kfd_suspend.part.7+0x43/0x50 [amdgpu] [Thu Aug 20 23:18:14
2020]  kgd2kfd_pre_reset+0x46/0x60 [amdgpu] [Thu Aug 20 23:18:14 2020]
amdgpu_amdkfd_pre_reset+0x1a/0x20 [amdgpu] [Thu Aug 20 23:18:14 2020]
amdgpu_device_gpu_recover+0x377/0xf90 [amdgpu] [Thu Aug 20 23:18:14
2020]  ? amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu] [Thu Aug 20
23:18:14 2020]  amdgpu_ras_do_recovery+0x159/0x190 [amdgpu] [Thu Aug
20 23:18:14 2020]  process_one_work+0x20f/0x400 [Thu Aug 20 23:18:14
2020]  worker_thread+0x34/0x410

When GPU hang, user process will fail to create a compute queue whose
struct object will be freed later, but driver wrongly add this queue
to queue list of the proccess. And then kfd_process_evict_queues will
access a freed memory, which cause a system crash.

Signed-off-by: Dennis Li 

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index 560adc57a050..d5e6b07ffb27 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c

RE: [PATCH] drm/amd/display: Fix a list corruption

2020-09-01 Thread Xu, Feifei
[AMD Official Use Only - Internal Distribution Only]

Acked-by: Feifei Xu 

-Original Message-
From: amd-gfx  On Behalf Of Pan, Xinhui
Sent: Tuesday, September 1, 2020 3:58 PM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander 
Subject: [PATCH] drm/amd/display: Fix a list corruption

[AMD Official Use Only - Internal Distribution Only]

[AMD Official Use Only - Internal Distribution Only]

Remove the private obj from the internal list before we free aconnector.

[   56.925828] BUG: unable to handle page fault for address: 8f84a870a560
[   56.933272] #PF: supervisor read access in kernel mode
[   56.938801] #PF: error_code(0x) - not-present page
[   56.944376] PGD 18e605067 P4D 18e605067 PUD 86a614067 PMD 86a4d0067 PTE 
8008578f5060
[   56.953260] Oops:  [#1] SMP DEBUG_PAGEALLOC NOPTI
[   56.958815] CPU: 6 PID: 1407 Comm: bash Tainted: G   O  
5.9.0-rc2+ #46
[   56.967092] Hardware name: System manufacturer System Product Name/PRIME 
Z390-A, BIOS 1401 11/26/2019
[   56.977162] RIP: 0010:__list_del_entry_valid+0x31/0xa0
[   56.982768] Code: 00 ad de 55 48 8b 17 4c 8b 47 08 48 89 e5 48 39 c2 74 27 
48 b8 22 01 00 00 00 00 ad de 49 39 c0 74 2d 49 8b 30 48 39 fe 75 3d <48> 8b 52 
08 48 39 f2 75 4c b8 01 00 00 00 5d c3 48 89 7
[   57.003327] RSP: 0018:b40c81687c90 EFLAGS: 00010246
[   57.009048] RAX: dead0122 RBX: 8f84ea41f4f0 RCX: 0006
[   57.016871] RDX: 8f84a870a558 RSI: 8f84ea41f4f0 RDI: 8f84ea41f4f0
[   57.024672] RBP: b40c81687c90 R08: 8f84ea400998 R09: 0001
[   57.032490] R10:  R11:  R12: 0006
[   57.040287] R13: 8f84ea422a90 R14: 8f84b4129a20 R15: fff2
[   57.048105] FS:  7f550d885740() GS:8f850960() 
knlGS:
[   57.056979] CS:  0010 DS:  ES:  CR0: 80050033
[   57.063260] CR2: 8f84a870a560 CR3: 0007e5144001 CR4: 003706e0
[   57.071053] DR0:  DR1:  DR2: 
[   57.078849] DR3:  DR6: fffe0ff0 DR7: 0400
[   57.086684] Call Trace:
[   57.089381]  drm_atomic_private_obj_fini+0x29/0x82 [drm]
[   57.095247]  amdgpu_dm_fini+0x83/0x170 [amdgpu]
[   57.100264]  dm_hw_fini+0x23/0x30 [amdgpu]
[   57.104814]  amdgpu_device_fini+0x1df/0x4fe [amdgpu]
[   57.110271]  amdgpu_driver_unload_kms+0x43/0x70 [amdgpu]
[   57.116136]  amdgpu_pci_remove+0x3b/0x60 [amdgpu]
[   57.121291]  pci_device_remove+0x3e/0xb0
[   57.125583]  device_release_driver_internal+0xff/0x1d0
[   57.131223]  device_release_driver+0x12/0x20
[   57.135903]  pci_stop_bus_device+0x70/0xa0
[   57.140401]  pci_stop_and_remove_bus_device_locked+0x1b/0x30
[   57.146571]  remove_store+0x7b/0x90
[   57.150429]  dev_attr_store+0x17/0x30
[   57.154441]  sysfs_kf_write+0x4b/0x60
[   57.158479]  kernfs_fop_write+0xe8/0x1d0
[   57.162788]  vfs_write+0xf5/0x230
[   57.166426]  ksys_write+0x70/0xf0
[   57.170087]  __x64_sys_write+0x1a/0x20
[   57.174219]  do_syscall_64+0x38/0x90
[   57.178145]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Signed-off-by: xinhui pan 
---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index f52533ee7372..cb624ee70545 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -5076,6 +5076,7 @@ static void amdgpu_dm_connector_destroy(struct 
drm_connector *connector)  struct amdgpu_device *adev = 
drm_to_adev(connector->dev);  struct amdgpu_display_manager *dm = >dm;

+drm_atomic_private_obj_fini(>mst_mgr.base);
 #if defined(CONFIG_BACKLIGHT_CLASS_DEVICE) ||\
 defined(CONFIG_BACKLIGHT_CLASS_DEVICE_MODULE)

--
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=02%7C01%7CFeifei.Xu%40amd.com%7C39a1103c3dfc4794778c08d84e4cb1c2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637345438618002415sdata=WXlcZfzbfIKcrcCR8DQoT2GerjWbT0MorFil%2FP3LCAA%3Dreserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/2] drm/amdgpu/gmc10: print client id string for gfxhub

2020-09-01 Thread Felix Kuehling

Should there a corresponding change in mmhub_v2_0.c?

Other than that, the series is

Reviewed-by: Felix Kuehling 

On 2020-09-01 5:51 p.m., Alex Deucher wrote:

Print the name of the client rather than the number.  This
makes it easier to debug what block is causing the fault.

Signed-off-by: Alex Deucher 
---
  drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c | 30 +---
  drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c | 30 +---
  2 files changed, 54 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c
index 76acd7f7723e..b882ac59879a 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c
@@ -31,6 +31,27 @@
  
  #include "soc15_common.h"
  
+static const char *gfxhub_client_ids[] = {

+   "CB/DB",
+   "Reserved",
+   "GE1",
+   "GE2",
+   "CPF",
+   "CPC",
+   "CPG",
+   "RLC",
+   "TCP",
+   "SQC (inst)",
+   "SQC (data)",
+   "SQG",
+   "Reserved",
+   "SDMA0",
+   "SDMA1",
+   "GCR",
+   "SDMA2",
+   "SDMA3",
+};
+
  static uint32_t gfxhub_v2_0_get_invalidate_req(unsigned int vmid,
   uint32_t flush_type)
  {
@@ -55,12 +76,15 @@ static void
  gfxhub_v2_0_print_l2_protection_fault_status(struct amdgpu_device *adev,
 uint32_t status)
  {
+   u32 cid = REG_GET_FIELD(status,
+   GCVM_L2_PROTECTION_FAULT_STATUS, CID);
+
dev_err(adev->dev,
"GCVM_L2_PROTECTION_FAULT_STATUS:0x%08X\n",
status);
-   dev_err(adev->dev, "\t Faulty UTCL2 client ID: 0x%lx\n",
-   REG_GET_FIELD(status,
-   GCVM_L2_PROTECTION_FAULT_STATUS, CID));
+   dev_err(adev->dev, "\t Faulty UTCL2 client ID: %s (0x%x)\n",
+   cid >= ARRAY_SIZE(gfxhub_client_ids) ? "unknown" : 
gfxhub_client_ids[cid],
+   cid);
dev_err(adev->dev, "\t MORE_FAULTS: 0x%lx\n",
REG_GET_FIELD(status,
GCVM_L2_PROTECTION_FAULT_STATUS, MORE_FAULTS));
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c 
b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c
index 80c906a0383f..237a9ff5afa0 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c
@@ -31,6 +31,27 @@
  
  #include "soc15_common.h"
  
+static const char *gfxhub_client_ids[] = {

+   "CB/DB",
+   "Reserved",
+   "GE1",
+   "GE2",
+   "CPF",
+   "CPC",
+   "CPG",
+   "RLC",
+   "TCP",
+   "SQC (inst)",
+   "SQC (data)",
+   "SQG",
+   "Reserved",
+   "SDMA0",
+   "SDMA1",
+   "GCR",
+   "SDMA2",
+   "SDMA3",
+};
+
  static uint32_t gfxhub_v2_1_get_invalidate_req(unsigned int vmid,
   uint32_t flush_type)
  {
@@ -55,12 +76,15 @@ static void
  gfxhub_v2_1_print_l2_protection_fault_status(struct amdgpu_device *adev,
 uint32_t status)
  {
+   u32 cid = REG_GET_FIELD(status,
+   GCVM_L2_PROTECTION_FAULT_STATUS, CID);
+
dev_err(adev->dev,
"GCVM_L2_PROTECTION_FAULT_STATUS:0x%08X\n",
status);
-   dev_err(adev->dev, "\t Faulty UTCL2 client ID: 0x%lx\n",
-   REG_GET_FIELD(status,
-   GCVM_L2_PROTECTION_FAULT_STATUS, CID));
+   dev_err(adev->dev, "\t Faulty UTCL2 client ID: %s (0x%x)\n",
+   cid >= ARRAY_SIZE(gfxhub_client_ids) ? "unknown" : 
gfxhub_client_ids[cid],
+   cid);
dev_err(adev->dev, "\t MORE_FAULTS: 0x%lx\n",
REG_GET_FIELD(status,
GCVM_L2_PROTECTION_FAULT_STATUS, MORE_FAULTS));

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amdgpu: add ta firmware load in psp_v12_0 for renoir

2020-09-01 Thread Changfeng.Zhu
From: changzhu 

From: Changfeng 

It needs to load renoir_ta firmware because hdcp is enabled by default
for renoir now. This can avoid error:DTM TA is not initialized

Change-Id: Ib2f03a531013e4b432c2e9d4ec3dc021b4f8da7d
Signed-off-by: Changfeng 
---
 drivers/gpu/drm/amd/amdgpu/psp_v12_0.c | 54 ++
 1 file changed, 54 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v12_0.c 
b/drivers/gpu/drm/amd/amdgpu/psp_v12_0.c
index 6c9614f77d33..75489313dbad 100644
--- a/drivers/gpu/drm/amd/amdgpu/psp_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/psp_v12_0.c
@@ -38,6 +38,8 @@
 #include "oss/osssys_4_0_sh_mask.h"
 
 MODULE_FIRMWARE("amdgpu/renoir_asd.bin");
+MODULE_FIRMWARE("amdgpu/renoir_ta.bin");
+
 /* address block */
 #define smnMP1_FIRMWARE_FLAGS  0x3010024
 
@@ -45,7 +47,10 @@ static int psp_v12_0_init_microcode(struct psp_context *psp)
 {
struct amdgpu_device *adev = psp->adev;
const char *chip_name;
+   char fw_name[30];
int err = 0;
+   const struct ta_firmware_header_v1_0 *ta_hdr;
+   DRM_DEBUG("\n");
 
switch (adev->asic_type) {
case CHIP_RENOIR:
@@ -56,6 +61,55 @@ static int psp_v12_0_init_microcode(struct psp_context *psp)
}
 
err = psp_init_asd_microcode(psp, chip_name);
+   if (err)
+   goto out;
+
+   snprintf(fw_name, sizeof(fw_name), "amdgpu/%s_ta.bin", chip_name);
+   err = request_firmware(>psp.ta_fw, fw_name, adev->dev);
+   if (err) {
+   release_firmware(adev->psp.ta_fw);
+   adev->psp.ta_fw = NULL;
+   dev_info(adev->dev,
+"psp v12.0: Failed to load firmware \"%s\"\n",
+fw_name);
+   } else {
+   err = amdgpu_ucode_validate(adev->psp.ta_fw);
+   if (err)
+   goto out2;
+
+   ta_hdr = (const struct ta_firmware_header_v1_0 *)
+adev->psp.ta_fw->data;
+   adev->psp.ta_hdcp_ucode_version =
+   le32_to_cpu(ta_hdr->ta_hdcp_ucode_version);
+   adev->psp.ta_hdcp_ucode_size =
+   le32_to_cpu(ta_hdr->ta_hdcp_size_bytes);
+   adev->psp.ta_hdcp_start_addr =
+   (uint8_t *)ta_hdr +
+   le32_to_cpu(ta_hdr->header.ucode_array_offset_bytes);
+
+   adev->psp.ta_fw_version = 
le32_to_cpu(ta_hdr->header.ucode_version);
+
+   adev->psp.ta_dtm_ucode_version =
+   le32_to_cpu(ta_hdr->ta_dtm_ucode_version);
+   adev->psp.ta_dtm_ucode_size =
+   le32_to_cpu(ta_hdr->ta_dtm_size_bytes);
+   adev->psp.ta_dtm_start_addr =
+   (uint8_t *)adev->psp.ta_hdcp_start_addr +
+   le32_to_cpu(ta_hdr->ta_dtm_offset_bytes);
+   }
+
+   return 0;
+
+out2:
+   release_firmware(adev->psp.ta_fw);
+   adev->psp.ta_fw = NULL;
+out:
+   if (err) {
+   dev_err(adev->dev,
+   "psp v12.0: Failed to load firmware \"%s\"\n",
+   fw_name);
+   }
+
return err;
 }
 
-- 
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 0/3] Use implicit kref infra

2020-09-01 Thread Pan, Xinhui
If you take a look at the below function, you should not use driver's release 
to free adev. As dev is embedded in adev.

 809 static void drm_dev_release(struct kref *ref)
 810 {
 811 struct drm_device *dev = container_of(ref, struct drm_device, ref);
 812
 813 if (dev->driver->release)
 814 dev->driver->release(dev);
 815 
 816 drm_managed_release(dev);
 817 
 818 kfree(dev->managed.final_kfree);
 819 }

You have to make another change something like
diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
index 13068fdf4331..2aabd2b4c63b 100644
--- a/drivers/gpu/drm/drm_drv.c
+++ b/drivers/gpu/drm/drm_drv.c
@@ -815,7 +815,8 @@ static void drm_dev_release(struct kref *ref)
 
drm_managed_release(dev);
 
-   kfree(dev->managed.final_kfree);
+   if (dev->driver->final_release)
+   dev->driver->final_release(dev);
 }

And in the final_release callback we free the dev. But that is a little complex 
now. so I prefer still using final_kfree.
Of course we can do some cleanup work in the driver's release callback. BUT no 
kfree.

-原始邮件-
发件人: "Tuikov, Luben" 
日期: 2020年9月2日 星期三 09:07
收件人: "amd-gfx@lists.freedesktop.org" , 
"dri-de...@lists.freedesktop.org" 
抄送: "Deucher, Alexander" , Daniel Vetter 
, "Pan, Xinhui" , "Tuikov, Luben" 

主题: [PATCH 0/3] Use implicit kref infra

Use the implicit kref infrastructure to free the container
struct amdgpu_device, container of struct drm_device.

First, in drm_dev_register(), do not indiscriminately warn
when a DRM driver hasn't opted for managed.final_kfree,
but instead check if the driver has provided its own
"release" function callback in the DRM driver structure.
If that is the case, no warning.

Remove drmm_add_final_kfree(). We take care of that, in the
kref "release" callback when all refs are down to 0, via
drm_dev_put(), i.e. the free is implicit.

Remove superfluous NULL check, since the DRM device to be
suspended always exists, so long as the underlying PCI and
DRM devices exist.

Luben Tuikov (3):
  drm: No warn for drivers who provide release
  drm/amdgpu: Remove drmm final free
  drm/amdgpu: Remove superfluous NULL check

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c| 2 --
 drivers/gpu/drm/drm_drv.c  | 3 ++-
 3 files changed, 2 insertions(+), 6 deletions(-)

-- 
2.28.0.394.ge197136389



___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 0/3] Use implicit kref infra

2020-09-01 Thread Luben Tuikov
Use the implicit kref infrastructure to free the container
struct amdgpu_device, container of struct drm_device.

First, in drm_dev_register(), do not indiscriminately warn
when a DRM driver hasn't opted for managed.final_kfree,
but instead check if the driver has provided its own
"release" function callback in the DRM driver structure.
If that is the case, no warning.

Remove drmm_add_final_kfree(). We take care of that, in the
kref "release" callback when all refs are down to 0, via
drm_dev_put(), i.e. the free is implicit.

Remove superfluous NULL check, since the DRM device to be
suspended always exists, so long as the underlying PCI and
DRM devices exist.

Luben Tuikov (3):
  drm: No warn for drivers who provide release
  drm/amdgpu: Remove drmm final free
  drm/amdgpu: Remove superfluous NULL check

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c| 2 --
 drivers/gpu/drm/drm_drv.c  | 3 ++-
 3 files changed, 2 insertions(+), 6 deletions(-)

-- 
2.28.0.394.ge197136389

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 2/3] drm/amdgpu: Remove drmm final free

2020-09-01 Thread Luben Tuikov
The amdgpu driver implements its own DRM driver
release function which naturally frees
the container struct amdgpu_device of
the DRM device, on a "final" kref-put,
i.e. when the kref transitions from non-zero
to 0.

Signed-off-by: Luben Tuikov 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 459cf13e76fe..17d49f1d86e7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1153,8 +1153,6 @@ static int amdgpu_pci_probe(struct pci_dev *pdev,
if (ret)
goto err_free;
 
-   drmm_add_final_kfree(ddev, ddev);
-
if (!supports_atomic)
ddev->driver_features &= ~DRIVER_ATOMIC;
 
-- 
2.28.0.394.ge197136389

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 3/3] drm/amdgpu: Remove superfluous NULL check

2020-09-01 Thread Luben Tuikov
The DRM device is a static member of
the amdgpu device structure and as such
always exists, so long as the PCI and
thus the amdgpu device exist.

Signed-off-by: Luben Tuikov 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index c4900471beb0..6dcc256b9ebc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3471,9 +3471,6 @@ int amdgpu_device_suspend(struct drm_device *dev, bool 
fbcon)
struct drm_connector_list_iter iter;
int r;
 
-   if (!dev)
-   return -ENODEV;
-
adev = drm_to_adev(dev);
 
if (dev->switch_power_state == DRM_SWITCH_POWER_OFF)
-- 
2.28.0.394.ge197136389

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 1/3] drm: No warn for drivers who provide release

2020-09-01 Thread Luben Tuikov
Drivers usually allocate their container
struct at PCI probe time, then call drm_dev_init(),
which initializes the contained DRM dev kref to 1.

A DRM driver may provide their own kref
release method, which frees the container
object, the container of the DRM device,
on the last "put" which usually comes
after the PCI device has been freed
with PCI and with DRM.

If a driver has provided their own "release"
method in the drm_driver structure, then
do not check "managed.final_kfree", and thus
do not splat a WARN_ON in the kernel log
when a driver which implements "release"
is loaded.

This patch adds this one-line check.

Signed-off-by: Luben Tuikov 
---
 drivers/gpu/drm/drm_drv.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
index 13068fdf4331..952455dedb8c 100644
--- a/drivers/gpu/drm/drm_drv.c
+++ b/drivers/gpu/drm/drm_drv.c
@@ -935,7 +935,8 @@ int drm_dev_register(struct drm_device *dev, unsigned long 
flags)
if (!driver->load)
drm_mode_config_validate(dev);
 
-   WARN_ON(!dev->managed.final_kfree);
+   if (!driver->release)
+   WARN_ON(!dev->managed.final_kfree);
 
if (drm_dev_needs_global_mutex(dev))
mutex_lock(_global_mutex);
-- 
2.28.0.394.ge197136389

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amdkfd: Move process doorbell allocation into kfd device

2020-09-01 Thread Mukul Joshi
Move doorbell allocation for a process into kfd device and
allocate doorbell space in each PDD during process creation.
Currently, KFD manages its own doorbell space but for some
devices, amdgpu would allocate the complete doorbell
space instead of leaving a chunk of doorbell space for KFD to
manage. In a system with mix of such devices, KFD would need
to request process doorbell space based on the type of device,
either from amdgpu or from its own doorbell space.

Signed-off-by: Mukul Joshi 
---
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c  | 30 +--
 drivers/gpu/drm/amd/amdkfd/kfd_device.c   |  3 ++
 .../drm/amd/amdkfd/kfd_device_queue_manager.c |  3 +-
 drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c | 37 ++-
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 17 ++---
 drivers/gpu/drm/amd/amdkfd/kfd_process.c  | 21 ++-
 6 files changed, 64 insertions(+), 47 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
index b7b16adb0615..b23caa78328b 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -1290,18 +1290,6 @@ static int kfd_ioctl_alloc_memory_of_gpu(struct file 
*filep,
return -EINVAL;
}
 
-   if (flags & KFD_IOC_ALLOC_MEM_FLAGS_DOORBELL) {
-   if (args->size != kfd_doorbell_process_slice(dev))
-   return -EINVAL;
-   offset = kfd_get_process_doorbells(dev, p);
-   } else if (flags & KFD_IOC_ALLOC_MEM_FLAGS_MMIO_REMAP) {
-   if (args->size != PAGE_SIZE)
-   return -EINVAL;
-   offset = amdgpu_amdkfd_get_mmio_remap_phys_addr(dev->kgd);
-   if (!offset)
-   return -ENOMEM;
-   }
-
mutex_lock(>mutex);
 
pdd = kfd_bind_process_to_device(dev, p);
@@ -1310,6 +1298,24 @@ static int kfd_ioctl_alloc_memory_of_gpu(struct file 
*filep,
goto err_unlock;
}
 
+   if (flags & KFD_IOC_ALLOC_MEM_FLAGS_DOORBELL) {
+   if (args->size != kfd_doorbell_process_slice(dev)) {
+   err = -EINVAL;
+   goto err_unlock;
+   }
+   offset = kfd_get_process_doorbells(dev, pdd);
+   } else if (flags & KFD_IOC_ALLOC_MEM_FLAGS_MMIO_REMAP) {
+   if (args->size != PAGE_SIZE) {
+   err = -EINVAL;
+   goto err_unlock;
+   }
+   offset = amdgpu_amdkfd_get_mmio_remap_phys_addr(dev->kgd);
+   if (!offset) {
+   err = -ENOMEM;
+   goto err_unlock;
+   }
+   }
+
err = amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu(
dev->kgd, args->va_addr, args->size,
pdd->vm, (struct kgd_mem **) , ,
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 0e71a0543f98..a857282f3d09 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -583,6 +583,8 @@ struct kfd_dev *kgd2kfd_probe(struct kgd_dev *kgd,
 
atomic_set(>sram_ecc_flag, 0);
 
+   ida_init(>doorbell_ida);
+
return kfd;
 }
 
@@ -798,6 +800,7 @@ void kgd2kfd_device_exit(struct kfd_dev *kfd)
kfd_interrupt_exit(kfd);
kfd_topology_remove_device(kfd);
kfd_doorbell_fini(kfd);
+   ida_destroy(>doorbell_ida);
kfd_gtt_sa_fini(kfd);
amdgpu_amdkfd_free_gtt_mem(kfd->kgd, kfd->gtt_mem);
if (kfd->gws)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index 560adc57a050..b9d1359c6fe0 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -191,9 +191,8 @@ static int allocate_doorbell(struct qcm_process_device 
*qpd, struct queue *q)
}
 
q->properties.doorbell_off =
-   kfd_get_doorbell_dw_offset_in_bar(dev, q->process,
+   kfd_get_doorbell_dw_offset_in_bar(dev, qpd_to_pdd(qpd),
  q->doorbell_id);
-
return 0;
 }
 
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c
index 8e0c00b9555e..5946bfb6b75c 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c
@@ -31,9 +31,6 @@
  * kernel queues using the first doorbell page reserved for the kernel.
  */
 
-static DEFINE_IDA(doorbell_ida);
-static unsigned int max_doorbell_slices;
-
 /*
  * Each device exposes a doorbell aperture, a PCI MMIO aperture that
  * receives 32-bit writes that are passed to queues as wptr values.
@@ -84,9 +81,9 @@ int kfd_doorbell_init(struct kfd_dev *kfd)
else
return -ENOSPC;
 
-   

Re: [PATCH] drm/amdgpu: Fix a redundant kfree

2020-09-01 Thread Pan, Xinhui
[AMD Official Use Only - Internal Distribution Only]


The correct thing to do this is to
_leave the amdgpu_driver_release()_ alone,
remove "drmm_add_final_kfree()" and qualify
the WARN_ON() in drm_dev_register() by
the existence of drm_driver.release() (i.e. non-NULL).

Re:  this drm driver release callback is more like a release notify. It is 
called in the beginning of the total release sequence. As you have made drm 
device a member of adev. So you can not free adev in the driver release 
callback.

Maybe change the sequence of release,  say, put drm driver release in the end 
of total release sequence.
Or still use the final_kfree to free adev and our release callback just do some 
other cleanup work.

From: Tuikov, Luben 
Sent: Wednesday, September 2, 2020 4:35:32 AM
To: Alex Deucher ; Pan, Xinhui ; 
Daniel Vetter 
Cc: amd-gfx@lists.freedesktop.org ; Deucher, 
Alexander 
Subject: Re: [PATCH] drm/amdgpu: Fix a redundant kfree

On 2020-09-01 10:12 a.m., Alex Deucher wrote:
> On Tue, Sep 1, 2020 at 3:46 AM Pan, Xinhui  wrote:
>>
>> [AMD Official Use Only - Internal Distribution Only]
>>
>> drm_dev_alloc() alloc *dev* and set managed.final_kfree to dev to free
>> itself.
>> Now from commit 5cdd68498918("drm/amdgpu: Embed drm_device into
>> amdgpu_device (v3)") we alloc *adev* and ddev is just a member of it.
>> So drm_dev_release try to free a wrong pointer then.
>>
>> Also driver's release trys to free adev, but drm_dev_release will
>> access dev after call drvier's release.
>>
>> To fix it, remove driver's release and set managed.final_kfree to adev.
>
> I've got to admit, the documentation around drm_dev_init is hard to
> follow.  We aren't really using the drmm stuff, but you have to use
> drmm_add_final_kfree to avoid a warning.  The logic seems to make
> sense though.
> Acked-by: Alex Deucher 

The logic in patch 3/3 uses the kref infrastructure
as described in drm_drv.c's comment on what the DRM
usage is, i.e. taking advantage of the kref infrastructure.

In amdgpu_pci_probe() we call drm_dev_init() which takes
a ref of 1 on the kref in the DRM device structure,
and from then on, only when the kref transitions
from non-zero to 0, do we free the container of
DRM device, and this is beautifully shown in the
kernel stack below (please take a look at the kernel
stack below).

Using a kref is very powerful as it is implicit:
when the kref transitions from non-zero to 0,
then call the release method.

Furthermore, we own the release method, and we
like that, as it is pure, as well as,
there may be more things we'd like to do in the future
before we free the amdgpu device: maybe free memory we're
using specifically for that PCI device, maybe write
some registers, maybe notify someone or something, etc.

On another note, adding "drmm_add_final_kfree()" in the middle
of amdgpu_pci_probe() seems hackish--it's neither part
of drm_dev_init() nor of drm_dev_register(). We really
don't need it, since we rely on the kref infrastructure
to tell us when to free the device, and if you'd look
at the beautiful stack below, it knows exactly when that is,
i.e. when to free it.

The correct thing to do this is to
_leave the amdgpu_driver_release()_ alone,
remove "drmm_add_final_kfree()" and qualify
the WARN_ON() in drm_dev_register() by
the existence of drm_driver.release() (i.e. non-NULL).

I'll submit a sequence of patches to fix this right.

Regards,
Luben

>
>>
>> [   36.269348] BUG: unable to handle page fault for address: a0c279940028
>> [   36.276841] #PF: supervisor read access in kernel mode
>> [   36.282434] #PF: error_code(0x) - not-present page
>> [   36.288053] PGD 676601067 P4D 676601067 PUD 86a414067 PMD 86a247067 PTE 
>> 8008066bf060
>> [   36.296868] Oops:  [#1] SMP DEBUG_PAGEALLOC NOPTI
>> [   36.302409] CPU: 4 PID: 1375 Comm: bash Tainted: G   O  
>> 5.9.0-rc2+ #46
>> [   36.310670] Hardware name: System manufacturer System Product Name/PRIME 
>> Z390-A, BIOS 1401 11/26/2019
>> [   36.320725] RIP: 0010:drm_managed_release+0x25/0x110 [drm]
>> [   36.326741] Code: 80 00 00 00 00 0f 1f 44 00 00 55 48 c7 c2 5a 9f 41 c0 
>> be 00 02 00 00 48 89 e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 83 ec 08 <48> 
>> 8b 7f 18 e8 c2 10 ff ff 4d 8b 74 24 20 49 8d 44 24 5
>> [   36.347217] RSP: 0018:b9424141fce0 EFLAGS: 00010282
>> [   36.352931] RAX: 0006 RBX: a0c279940010 RCX: 
>> 0006
>> [   36.360718] RDX: c0419f5a RSI: 0200 RDI: 
>> a0c279940010
>> [   36.368503] RBP: b9424141fd10 R08: 0001 R09: 
>> 0001
>> [   36.376304] R10:  R11:  R12: 
>> a0c279940010
>> [   36.384070] R13: c0e2a000 R14: a0c26924e220 R15: 
>> fff2
>> [   36.391845] FS:  7fc4a277b740() GS:a0c288e0() 
>> knlGS:
>> [   36.400669] CS:  0010 DS:  ES:  CR0: 80050033
>> [   36.406937] CR2: 

[PATCH 1/2] drm/amdgpu/gmc9: print client id string for gfxhub

2020-09-01 Thread Alex Deucher
Print the name of the client rather than the number.  This
makes it easier to debug what block is causing the fault.

Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 30 +++
 1 file changed, 26 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index 1ca79030e95e..7e86aee60c64 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -71,6 +71,22 @@
 #define mmDCHUBBUB_SDPIF_MMIO_CNTRL_0_BASE_IDX 
2
 
 
+static const char *gfxhub_client_ids[] = {
+   "CB",
+   "DB",
+   "IA",
+   "WD",
+   "CPF",
+   "CPC",
+   "CPG",
+   "RLC",
+   "TCP",
+   "SQC (inst)",
+   "SQC (data)",
+   "SQG",
+   "PA",
+};
+
 static const u32 golden_settings_vega10_hdp[] =
 {
0xf64, 0x0fff, 0x,
@@ -303,7 +319,7 @@ static int gmc_v9_0_process_interrupt(struct amdgpu_device 
*adev,
 {
struct amdgpu_vmhub *hub;
bool retry_fault = !!(entry->src_data[1] & 0x80);
-   uint32_t status = 0;
+   uint32_t status = 0, cid = 0;
u64 addr;
char hub_name[10];
 
@@ -340,6 +356,8 @@ static int gmc_v9_0_process_interrupt(struct amdgpu_device 
*adev,
RREG32(hub->vm_l2_pro_fault_status);
 
status = RREG32(hub->vm_l2_pro_fault_status);
+   cid = REG_GET_FIELD(status,
+   VM_L2_PROTECTION_FAULT_STATUS, CID);
WREG32_P(hub->vm_l2_pro_fault_cntl, 1, ~1);
}
 
@@ -362,9 +380,13 @@ static int gmc_v9_0_process_interrupt(struct amdgpu_device 
*adev,
dev_err(adev->dev,
"VM_L2_PROTECTION_FAULT_STATUS:0x%08X\n",
status);
-   dev_err(adev->dev, "\t Faulty UTCL2 client ID: 0x%lx\n",
-   REG_GET_FIELD(status,
-   VM_L2_PROTECTION_FAULT_STATUS, CID));
+   if (hub == >vmhub[AMDGPU_GFXHUB_0])
+   dev_err(adev->dev, "\t Faulty UTCL2 client ID: 
%s (0x%x)\n",
+   cid >= ARRAY_SIZE(gfxhub_client_ids) ? 
"unknown" : gfxhub_client_ids[cid],
+   cid);
+   else
+   dev_err(adev->dev, "\t Faulty UTCL2 client ID: 
0x%x\n",
+   cid);
dev_err(adev->dev, "\t MORE_FAULTS: 0x%lx\n",
REG_GET_FIELD(status,
VM_L2_PROTECTION_FAULT_STATUS, MORE_FAULTS));
-- 
2.25.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 2/2] drm/amdgpu/gmc10: print client id string for gfxhub

2020-09-01 Thread Alex Deucher
Print the name of the client rather than the number.  This
makes it easier to debug what block is causing the fault.

Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c | 30 +---
 drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c | 30 +---
 2 files changed, 54 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c
index 76acd7f7723e..b882ac59879a 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c
@@ -31,6 +31,27 @@
 
 #include "soc15_common.h"
 
+static const char *gfxhub_client_ids[] = {
+   "CB/DB",
+   "Reserved",
+   "GE1",
+   "GE2",
+   "CPF",
+   "CPC",
+   "CPG",
+   "RLC",
+   "TCP",
+   "SQC (inst)",
+   "SQC (data)",
+   "SQG",
+   "Reserved",
+   "SDMA0",
+   "SDMA1",
+   "GCR",
+   "SDMA2",
+   "SDMA3",
+};
+
 static uint32_t gfxhub_v2_0_get_invalidate_req(unsigned int vmid,
   uint32_t flush_type)
 {
@@ -55,12 +76,15 @@ static void
 gfxhub_v2_0_print_l2_protection_fault_status(struct amdgpu_device *adev,
 uint32_t status)
 {
+   u32 cid = REG_GET_FIELD(status,
+   GCVM_L2_PROTECTION_FAULT_STATUS, CID);
+
dev_err(adev->dev,
"GCVM_L2_PROTECTION_FAULT_STATUS:0x%08X\n",
status);
-   dev_err(adev->dev, "\t Faulty UTCL2 client ID: 0x%lx\n",
-   REG_GET_FIELD(status,
-   GCVM_L2_PROTECTION_FAULT_STATUS, CID));
+   dev_err(adev->dev, "\t Faulty UTCL2 client ID: %s (0x%x)\n",
+   cid >= ARRAY_SIZE(gfxhub_client_ids) ? "unknown" : 
gfxhub_client_ids[cid],
+   cid);
dev_err(adev->dev, "\t MORE_FAULTS: 0x%lx\n",
REG_GET_FIELD(status,
GCVM_L2_PROTECTION_FAULT_STATUS, MORE_FAULTS));
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c 
b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c
index 80c906a0383f..237a9ff5afa0 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c
@@ -31,6 +31,27 @@
 
 #include "soc15_common.h"
 
+static const char *gfxhub_client_ids[] = {
+   "CB/DB",
+   "Reserved",
+   "GE1",
+   "GE2",
+   "CPF",
+   "CPC",
+   "CPG",
+   "RLC",
+   "TCP",
+   "SQC (inst)",
+   "SQC (data)",
+   "SQG",
+   "Reserved",
+   "SDMA0",
+   "SDMA1",
+   "GCR",
+   "SDMA2",
+   "SDMA3",
+};
+
 static uint32_t gfxhub_v2_1_get_invalidate_req(unsigned int vmid,
   uint32_t flush_type)
 {
@@ -55,12 +76,15 @@ static void
 gfxhub_v2_1_print_l2_protection_fault_status(struct amdgpu_device *adev,
 uint32_t status)
 {
+   u32 cid = REG_GET_FIELD(status,
+   GCVM_L2_PROTECTION_FAULT_STATUS, CID);
+
dev_err(adev->dev,
"GCVM_L2_PROTECTION_FAULT_STATUS:0x%08X\n",
status);
-   dev_err(adev->dev, "\t Faulty UTCL2 client ID: 0x%lx\n",
-   REG_GET_FIELD(status,
-   GCVM_L2_PROTECTION_FAULT_STATUS, CID));
+   dev_err(adev->dev, "\t Faulty UTCL2 client ID: %s (0x%x)\n",
+   cid >= ARRAY_SIZE(gfxhub_client_ids) ? "unknown" : 
gfxhub_client_ids[cid],
+   cid);
dev_err(adev->dev, "\t MORE_FAULTS: 0x%lx\n",
REG_GET_FIELD(status,
GCVM_L2_PROTECTION_FAULT_STATUS, MORE_FAULTS));
-- 
2.25.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 3/3] drm/amdgpu: Embed drm_device into amdgpu_device (v2)

2020-09-01 Thread Luben Tuikov
On 2020-09-01 9:49 a.m., Alex Deucher wrote:
> On Tue, Sep 1, 2020 at 3:44 AM Daniel Vetter  wrote:
>>
>> On Wed, Aug 19, 2020 at 01:00:42AM -0400, Luben Tuikov wrote:
>>> a) Embed struct drm_device into struct amdgpu_device.
>>> b) Modify the inline-f drm_to_adev() accordingly.
>>> c) Modify the inline-f adev_to_drm() accordingly.
>>> d) Eliminate the use of drm_device.dev_private,
>>>in amdgpu.
>>> e) Switch from using drm_dev_alloc() to
>>>drm_dev_init().
>>> f) Add a DRM driver release function, which frees
>>>the container amdgpu_device after all krefs on
>>>the contained drm_device have been released.
>>>
>>> v2: Split out adding adev_to_drm() into its own
>>> patch (previous commit), making this patch
>>> more succinct and clear. More detailed commit
>>> description.
>>>
>>> Signed-off-by: Luben Tuikov 
>>> ---
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu.h| 10 ++---
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +++-
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c| 43 ++
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c| 20 +++---
>>>  4 files changed, 43 insertions(+), 45 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> index 735480cc7dcf..107a6ec920f7 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> @@ -724,8 +724,8 @@ struct amd_powerplay {
>>>  #define AMDGPU_MAX_DF_PERFMONS 4
>>>  struct amdgpu_device {
>>>   struct device   *dev;
>>> - struct drm_device   *ddev;
>>>   struct pci_dev  *pdev;
>>> + struct drm_device   ddev;
>>>
>>>  #ifdef CONFIG_DRM_AMD_ACP
>>>   struct amdgpu_acp   acp;
>>> @@ -990,12 +990,12 @@ struct amdgpu_device {
>>>
>>>  static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
>>>  {
>>> - return ddev->dev_private;
>>> + return container_of(ddev, struct amdgpu_device, ddev);
>>>  }
>>>
>>>  static inline struct drm_device *adev_to_drm(struct amdgpu_device *adev)
>>>  {
>>> - return adev->ddev;
>>> + return >ddev;
>>>  }
>>>
>>>  static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device 
>>> *bdev)
>>> @@ -1004,8 +1004,6 @@ static inline struct amdgpu_device 
>>> *amdgpu_ttm_adev(struct ttm_bo_device *bdev)
>>>  }
>>>
>>>  int amdgpu_device_init(struct amdgpu_device *adev,
>>> -struct drm_device *ddev,
>>> -struct pci_dev *pdev,
>>>  uint32_t flags);
>>>  void amdgpu_device_fini(struct amdgpu_device *adev);
>>>  int amdgpu_gpu_wait_for_idle(struct amdgpu_device *adev);
>>> @@ -1195,7 +1193,7 @@ static inline void *amdgpu_atpx_get_dhandle(void) { 
>>> return NULL; }
>>>  extern const struct drm_ioctl_desc amdgpu_ioctls_kms[];
>>>  extern const int amdgpu_max_kms_ioctl;
>>>
>>> -int amdgpu_driver_load_kms(struct drm_device *dev, unsigned long flags);
>>> +int amdgpu_driver_load_kms(struct amdgpu_device *adev, unsigned long 
>>> flags);
>>>  void amdgpu_driver_unload_kms(struct drm_device *dev);
>>>  void amdgpu_driver_lastclose_kms(struct drm_device *dev);
>>>  int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file 
>>> *file_priv);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 07012d71eeea..6e529548e708 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -1216,7 +1216,8 @@ static int amdgpu_device_check_arguments(struct 
>>> amdgpu_device *adev)
>>>   * Callback for the switcheroo driver.  Suspends or resumes the
>>>   * the asics before or after it is powered up using ACPI methods.
>>>   */
>>> -static void amdgpu_switcheroo_set_state(struct pci_dev *pdev, enum 
>>> vga_switcheroo_state state)
>>> +static void amdgpu_switcheroo_set_state(struct pci_dev *pdev,
>>> + enum vga_switcheroo_state state)
>>>  {
>>>   struct drm_device *dev = pci_get_drvdata(pdev);
>>>   int r;
>>> @@ -2977,8 +2978,6 @@ static const struct attribute 
>>> *amdgpu_dev_attributes[] = {
>>>   * amdgpu_device_init - initialize the driver
>>>   *
>>>   * @adev: amdgpu_device pointer
>>> - * @ddev: drm dev pointer
>>> - * @pdev: pci dev pointer
>>>   * @flags: driver flags
>>>   *
>>>   * Initializes the driver info and hw (all asics).
>>> @@ -2986,18 +2985,15 @@ static const struct attribute 
>>> *amdgpu_dev_attributes[] = {
>>>   * Called at driver startup.
>>>   */
>>>  int amdgpu_device_init(struct amdgpu_device *adev,
>>> -struct drm_device *ddev,
>>> -struct pci_dev *pdev,
>>>  uint32_t flags)
>>>  {
>>> + struct drm_device *ddev = adev_to_drm(adev);
>>> + struct pci_dev *pdev = adev->pdev;
>>>   int r, i;
>>>   bool boco = false;
>>>   u32 

Re: [PATCH] drm/amdgpu: Fix a redundant kfree

2020-09-01 Thread Luben Tuikov
On 2020-09-01 10:12 a.m., Alex Deucher wrote:
> On Tue, Sep 1, 2020 at 3:46 AM Pan, Xinhui  wrote:
>>
>> [AMD Official Use Only - Internal Distribution Only]
>>
>> drm_dev_alloc() alloc *dev* and set managed.final_kfree to dev to free
>> itself.
>> Now from commit 5cdd68498918("drm/amdgpu: Embed drm_device into
>> amdgpu_device (v3)") we alloc *adev* and ddev is just a member of it.
>> So drm_dev_release try to free a wrong pointer then.
>>
>> Also driver's release trys to free adev, but drm_dev_release will
>> access dev after call drvier's release.
>>
>> To fix it, remove driver's release and set managed.final_kfree to adev.
> 
> I've got to admit, the documentation around drm_dev_init is hard to
> follow.  We aren't really using the drmm stuff, but you have to use
> drmm_add_final_kfree to avoid a warning.  The logic seems to make
> sense though.
> Acked-by: Alex Deucher 

The logic in patch 3/3 uses the kref infrastructure
as described in drm_drv.c's comment on what the DRM
usage is, i.e. taking advantage of the kref infrastructure.

In amdgpu_pci_probe() we call drm_dev_init() which takes
a ref of 1 on the kref in the DRM device structure,
and from then on, only when the kref transitions
from non-zero to 0, do we free the container of
DRM device, and this is beautifully shown in the
kernel stack below (please take a look at the kernel
stack below).

Using a kref is very powerful as it is implicit:
when the kref transitions from non-zero to 0,
then call the release method.

Furthermore, we own the release method, and we
like that, as it is pure, as well as,
there may be more things we'd like to do in the future
before we free the amdgpu device: maybe free memory we're
using specifically for that PCI device, maybe write
some registers, maybe notify someone or something, etc.

On another note, adding "drmm_add_final_kfree()" in the middle
of amdgpu_pci_probe() seems hackish--it's neither part
of drm_dev_init() nor of drm_dev_register(). We really
don't need it, since we rely on the kref infrastructure
to tell us when to free the device, and if you'd look
at the beautiful stack below, it knows exactly when that is,
i.e. when to free it.

The correct thing to do this is to
_leave the amdgpu_driver_release()_ alone,
remove "drmm_add_final_kfree()" and qualify
the WARN_ON() in drm_dev_register() by
the existence of drm_driver.release() (i.e. non-NULL).

I'll submit a sequence of patches to fix this right.

Regards,
Luben

> 
>>
>> [   36.269348] BUG: unable to handle page fault for address: a0c279940028
>> [   36.276841] #PF: supervisor read access in kernel mode
>> [   36.282434] #PF: error_code(0x) - not-present page
>> [   36.288053] PGD 676601067 P4D 676601067 PUD 86a414067 PMD 86a247067 PTE 
>> 8008066bf060
>> [   36.296868] Oops:  [#1] SMP DEBUG_PAGEALLOC NOPTI
>> [   36.302409] CPU: 4 PID: 1375 Comm: bash Tainted: G   O  
>> 5.9.0-rc2+ #46
>> [   36.310670] Hardware name: System manufacturer System Product Name/PRIME 
>> Z390-A, BIOS 1401 11/26/2019
>> [   36.320725] RIP: 0010:drm_managed_release+0x25/0x110 [drm]
>> [   36.326741] Code: 80 00 00 00 00 0f 1f 44 00 00 55 48 c7 c2 5a 9f 41 c0 
>> be 00 02 00 00 48 89 e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 83 ec 08 <48> 
>> 8b 7f 18 e8 c2 10 ff ff 4d 8b 74 24 20 49 8d 44 24 5
>> [   36.347217] RSP: 0018:b9424141fce0 EFLAGS: 00010282
>> [   36.352931] RAX: 0006 RBX: a0c279940010 RCX: 
>> 0006
>> [   36.360718] RDX: c0419f5a RSI: 0200 RDI: 
>> a0c279940010
>> [   36.368503] RBP: b9424141fd10 R08: 0001 R09: 
>> 0001
>> [   36.376304] R10:  R11:  R12: 
>> a0c279940010
>> [   36.384070] R13: c0e2a000 R14: a0c26924e220 R15: 
>> fff2
>> [   36.391845] FS:  7fc4a277b740() GS:a0c288e0() 
>> knlGS:
>> [   36.400669] CS:  0010 DS:  ES:  CR0: 80050033
>> [   36.406937] CR2: a0c279940028 CR3: 000792304006 CR4: 
>> 003706e0
>> [   36.414732] DR0:  DR1:  DR2: 
>> 
>> [   36.422550] DR3:  DR6: fffe0ff0 DR7: 
>> 0400
>> [   36.430354] Call Trace:
>> [   36.433044]  drm_dev_put.part.0+0x40/0x60 [drm]
>> [   36.438017]  drm_dev_put+0x13/0x20 [drm]
>> [   36.442398]  amdgpu_pci_remove+0x56/0x60 [amdgpu]
>> [   36.447528]  pci_device_remove+0x3e/0xb0
>> [   36.451807]  device_release_driver_internal+0xff/0x1d0
>> [   36.457416]  device_release_driver+0x12/0x20
>> [   36.462094]  pci_stop_bus_device+0x70/0xa0
>> [   36.466588]  pci_stop_and_remove_bus_device_locked+0x1b/0x30
>> [   36.472786]  remove_store+0x7b/0x90
>> [   36.476614]  dev_attr_store+0x17/0x30
>> [   36.480646]  sysfs_kf_write+0x4b/0x60
>> [   36.484655]  kernfs_fop_write+0xe8/0x1d0
>> [   36.488952]  vfs_write+0xf5/0x230
>> [   36.492562]  ksys_write+0x70/0xf0
>> [   36.496206]  

[PATCH 2/2] drm/amdgpu: disable gpu-sched load balance for uvd

2020-09-01 Thread Nirmoy Das
On hardware with multiple uvd instances, dependent uvd
jobs may get scheduled to different uvd instances. Because
uvd jobs retain hw context, dependent jobs should always
run on the same uvd instance. This patch disables gpu scheduler's
load balancer for a context that binds jobs from same the
context to a uvd instance.

Signed-off-by: Nirmoy Das 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index 59032c26fc82..fc392dfd1789 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -114,7 +114,10 @@ static int amdgpu_ctx_init_entity(struct amdgpu_ctx *ctx, 
u32 hw_ip,
scheds = adev->gpu_sched[hw_ip][hw_prio].sched;
num_scheds = adev->gpu_sched[hw_ip][hw_prio].num_scheds;

-   if (hw_ip == AMDGPU_HW_IP_VCN_ENC || hw_ip == AMDGPU_HW_IP_VCN_DEC) {
+   if (hw_ip == AMDGPU_HW_IP_UVD ||
+   hw_ip == AMDGPU_HW_IP_UVD_ENC ||
+   hw_ip == AMDGPU_HW_IP_VCN_ENC ||
+   hw_ip == AMDGPU_HW_IP_VCN_DEC) {
sched = drm_sched_pick_best(scheds, num_scheds);
scheds = 
num_scheds = 1;
--
2.28.0

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 1/2] Revert "drm/amdgpu: disable gpu-sched load balance for uvd"

2020-09-01 Thread Nirmoy Das
This reverts commit e0300ed8820d19fe108006cf1b69fa26f0b4e3fc.

We should also disable load balance for AMDGPU_HW_IP_UVD_ENC jobs.

Signed-off-by: Nirmoy Das 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index 7cd398d25498..59032c26fc82 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -114,9 +114,7 @@ static int amdgpu_ctx_init_entity(struct amdgpu_ctx *ctx, 
u32 hw_ip,
scheds = adev->gpu_sched[hw_ip][hw_prio].sched;
num_scheds = adev->gpu_sched[hw_ip][hw_prio].num_scheds;
 
-   if (hw_ip == AMDGPU_HW_IP_VCN_ENC ||
-   hw_ip == AMDGPU_HW_IP_VCN_DEC ||
-   hw_ip == AMDGPU_HW_IP_UVD) {
+   if (hw_ip == AMDGPU_HW_IP_VCN_ENC || hw_ip == AMDGPU_HW_IP_VCN_DEC) {
sched = drm_sched_pick_best(scheds, num_scheds);
scheds = 
num_scheds = 1;
-- 
2.28.0

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 4/5] drm_dp_cec: add plumbing in preparation for MST support

2020-09-01 Thread Lyude Paul
Super minor nitpicks:

On Tue, 2020-09-01 at 16:22 +1000, Sam McNally wrote:
> From: Hans Verkuil 
> 
> Signed-off-by: Hans Verkuil 
> [sa...@chromium.org:
>  - rebased
>  - removed polling-related changes
>  - moved the calls to drm_dp_cec_(un)set_edid() into the next patch
> ]
> Signed-off-by: Sam McNally 
> ---
> 
>  .../display/amdgpu_dm/amdgpu_dm_mst_types.c   |  2 +-
>  drivers/gpu/drm/drm_dp_cec.c  | 22 ++-
>  drivers/gpu/drm/i915/display/intel_dp.c   |  2 +-
>  drivers/gpu/drm/nouveau/nouveau_connector.c   |  2 +-
>  include/drm/drm_dp_helper.h   |  6 +++--
>  5 files changed, 19 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c
> b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c
> index 461fa4da0a34..6e7075893ec9 100644
> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c
> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c
> @@ -419,7 +419,7 @@ void amdgpu_dm_initialize_dp_connector(struct
> amdgpu_display_manager *dm,
>  
>   drm_dp_aux_init(>dm_dp_aux.aux);
>   drm_dp_cec_register_connector(>dm_dp_aux.aux,
> -   >base);
> +   >base, false);
>  
>   if (aconnector->base.connector_type == DRM_MODE_CONNECTOR_eDP)
>   return;
> diff --git a/drivers/gpu/drm/drm_dp_cec.c b/drivers/gpu/drm/drm_dp_cec.c
> index 3ab2609f9ec7..04ab7b88055c 100644
> --- a/drivers/gpu/drm/drm_dp_cec.c
> +++ b/drivers/gpu/drm/drm_dp_cec.c
> @@ -14,6 +14,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  /*
>   * Unfortunately it turns out that we have a chicken-and-egg situation
> @@ -338,8 +339,6 @@ void drm_dp_cec_set_edid(struct drm_dp_aux *aux, const
> struct edid *edid)
>   if (aux->cec.adap) {
>   if (aux->cec.adap->capabilities == cec_caps &&
>   aux->cec.adap->available_log_addrs == num_las) {
> - /* Unchanged, so just set the phys addr */
> - cec_s_phys_addr_from_edid(aux->cec.adap, edid);
>   goto unlock;
>   }

May as well drop the braces here

>   /*
> @@ -364,15 +363,16 @@ void drm_dp_cec_set_edid(struct drm_dp_aux *aux, const
> struct edid *edid)
>   if (cec_register_adapter(aux->cec.adap, connector->dev->dev)) {
>   cec_delete_adapter(aux->cec.adap);
>   aux->cec.adap = NULL;
> - } else {
> - /*
> -  * Update the phys addr for the new CEC adapter. When called
> -  * from drm_dp_cec_register_connector() edid == NULL, so in
> -  * that case the phys addr is just invalidated.
> -  */
> - cec_s_phys_addr_from_edid(aux->cec.adap, edid);
>   }
>  unlock:
> + /*
> +  * Update the phys addr for the new CEC adapter. When called
> +  * from drm_dp_cec_register_connector() edid == NULL, so in
> +  * that case the phys addr is just invalidated.
> +  */
> + if (aux->cec.adap && edid) {
> + cec_s_phys_addr_from_edid(aux->cec.adap, edid);
> + }

And here

>   mutex_unlock(>cec.lock);
>  }
>  EXPORT_SYMBOL(drm_dp_cec_set_edid);
> @@ -418,6 +418,7 @@ EXPORT_SYMBOL(drm_dp_cec_unset_edid);
>   * drm_dp_cec_register_connector() - register a new connector
>   * @aux: DisplayPort AUX channel
>   * @connector: drm connector
> + * @is_mst: set to true if this is an MST branch
>   *
>   * A new connector was registered with associated CEC adapter name and
>   * CEC adapter parent device. After registering the name and parent
> @@ -425,12 +426,13 @@ EXPORT_SYMBOL(drm_dp_cec_unset_edid);
>   * CEC and to register a CEC adapter if that is the case.
>   */
>  void drm_dp_cec_register_connector(struct drm_dp_aux *aux,
> -struct drm_connector *connector)
> +struct drm_connector *connector, bool is_mst)
>  {
>   WARN_ON(aux->cec.adap);
>   if (WARN_ON(!aux->transfer))
>   return;
>   aux->cec.connector = connector;
> + aux->cec.is_mst = is_mst;

Also JFYI, you can also check aux->is_remote, but maybe you've got another
reason for copying this here

Either way:

Reviewed-by: Lyude Paul 

...Also, maybe this is just a coincidence - but do I know your name from
somewhere? Perhaps an IRC community from long ago?

>   INIT_DELAYED_WORK(>cec.unregister_work,
> drm_dp_cec_unregister_work);
>  }
> diff --git a/drivers/gpu/drm/i915/display/intel_dp.c
> b/drivers/gpu/drm/i915/display/intel_dp.c
> index 82b9de274f65..744cb55572f9 100644
> --- a/drivers/gpu/drm/i915/display/intel_dp.c
> +++ b/drivers/gpu/drm/i915/display/intel_dp.c
> @@ -6261,7 +6261,7 @@ intel_dp_connector_register(struct drm_connector
> *connector)
>   intel_dp->aux.dev = connector->kdev;
>   ret = drm_dp_aux_register(_dp->aux);
>   if 

Re: [PATCH] drm/radeon: Reset ASIC if suspend is not managed by platform firmware

2020-09-01 Thread Alex Deucher
On Tue, Sep 1, 2020 at 12:21 PM Kai-Heng Feng
 wrote:
>
>
>
> > On Sep 1, 2020, at 22:19, Alex Deucher  wrote:
> >
> > On Tue, Sep 1, 2020 at 3:32 AM Kai-Heng Feng
> >  wrote:
> >>
> >> Suspend with s2idle or by the following steps cause screen frozen:
> >> # echo devices > /sys/power/pm_test
> >> # echo freeze > /sys/power/mem
> >>
> >> [  289.625461] [drm:uvd_v1_0_ib_test [radeon]] *ERROR* radeon: fence wait 
> >> timed out.
> >> [  289.625494] [drm:radeon_ib_ring_tests [radeon]] *ERROR* radeon: failed 
> >> testing IB on ring 5 (-110).
> >>
> >> The issue doesn't happen on traditional S3, probably because firmware or
> >> hardware provides extra power management.
> >>
> >> Inspired by Daniel Drake's patch [1] on amdgpu, using a similar approach
> >> can fix the issue.
> >
> > It doesn't actually fix the issue.  The device is never powered down
> > so you are using more power than you would if you did not suspend in
> > the first place.  The reset just works around the fact that the device
> > is never powered down.
>
> So how do we properly suspend/resume the device without help from platform 
> firmware?

I guess you don't?

Alex


>
> Kai-Heng
>
> >
> > Alex
> >
> >>
> >> [1] https://patchwork.freedesktop.org/patch/335839/
> >>
> >> Signed-off-by: Kai-Heng Feng 
> >> ---
> >> drivers/gpu/drm/radeon/radeon_device.c | 3 +++
> >> 1 file changed, 3 insertions(+)
> >>
> >> diff --git a/drivers/gpu/drm/radeon/radeon_device.c 
> >> b/drivers/gpu/drm/radeon/radeon_device.c
> >> index 266e3cbbd09b..df823b9ad79f 100644
> >> --- a/drivers/gpu/drm/radeon/radeon_device.c
> >> +++ b/drivers/gpu/drm/radeon/radeon_device.c
> >> @@ -33,6 +33,7 @@
> >> #include 
> >> #include 
> >> #include 
> >> +#include 
> >>
> >> #include 
> >> #include 
> >> @@ -1643,6 +1644,8 @@ int radeon_suspend_kms(struct drm_device *dev, bool 
> >> suspend,
> >>rdev->asic->asic_reset(rdev, true);
> >>pci_restore_state(dev->pdev);
> >>} else if (suspend) {
> >> +   if (pm_suspend_no_platform())
> >> +   rdev->asic->asic_reset(rdev, true);
> >>/* Shut down the device */
> >>pci_disable_device(dev->pdev);
> >>pci_set_power_state(dev->pdev, PCI_D3hot);
> >> --
> >> 2.17.1
> >>
> >> ___
> >> dri-devel mailing list
> >> dri-de...@lists.freedesktop.org
> >> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/radeon: Reset ASIC if suspend is not managed by platform firmware

2020-09-01 Thread Kai-Heng Feng



> On Sep 1, 2020, at 22:19, Alex Deucher  wrote:
> 
> On Tue, Sep 1, 2020 at 3:32 AM Kai-Heng Feng
>  wrote:
>> 
>> Suspend with s2idle or by the following steps cause screen frozen:
>> # echo devices > /sys/power/pm_test
>> # echo freeze > /sys/power/mem
>> 
>> [  289.625461] [drm:uvd_v1_0_ib_test [radeon]] *ERROR* radeon: fence wait 
>> timed out.
>> [  289.625494] [drm:radeon_ib_ring_tests [radeon]] *ERROR* radeon: failed 
>> testing IB on ring 5 (-110).
>> 
>> The issue doesn't happen on traditional S3, probably because firmware or
>> hardware provides extra power management.
>> 
>> Inspired by Daniel Drake's patch [1] on amdgpu, using a similar approach
>> can fix the issue.
> 
> It doesn't actually fix the issue.  The device is never powered down
> so you are using more power than you would if you did not suspend in
> the first place.  The reset just works around the fact that the device
> is never powered down.

So how do we properly suspend/resume the device without help from platform 
firmware?

Kai-Heng

> 
> Alex
> 
>> 
>> [1] https://patchwork.freedesktop.org/patch/335839/
>> 
>> Signed-off-by: Kai-Heng Feng 
>> ---
>> drivers/gpu/drm/radeon/radeon_device.c | 3 +++
>> 1 file changed, 3 insertions(+)
>> 
>> diff --git a/drivers/gpu/drm/radeon/radeon_device.c 
>> b/drivers/gpu/drm/radeon/radeon_device.c
>> index 266e3cbbd09b..df823b9ad79f 100644
>> --- a/drivers/gpu/drm/radeon/radeon_device.c
>> +++ b/drivers/gpu/drm/radeon/radeon_device.c
>> @@ -33,6 +33,7 @@
>> #include 
>> #include 
>> #include 
>> +#include 
>> 
>> #include 
>> #include 
>> @@ -1643,6 +1644,8 @@ int radeon_suspend_kms(struct drm_device *dev, bool 
>> suspend,
>>rdev->asic->asic_reset(rdev, true);
>>pci_restore_state(dev->pdev);
>>} else if (suspend) {
>> +   if (pm_suspend_no_platform())
>> +   rdev->asic->asic_reset(rdev, true);
>>/* Shut down the device */
>>pci_disable_device(dev->pdev);
>>pci_set_power_state(dev->pdev, PCI_D3hot);
>> --
>> 2.17.1
>> 
>> ___
>> dri-devel mailing list
>> dri-de...@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH] drm/kfd: fix a system crash issue during GPU recovery

2020-09-01 Thread Li, Dennis
[AMD Official Use Only - Internal Distribution Only]

Hi, Felix,
 If GPU hang, execute_queues_cpsch will fail to unmap or map queues and 
then create_queue_cpsch will return error. If pqm_create_queue find 
create_queue_cpsch failed, it will call uninit_queue to free queue object. 
However this queue object has been added in to qpd->queues_list in the old 
code. 

Best Regards
Dennis Li

-Original Message-
From: Kuehling, Felix  
Sent: Tuesday, September 1, 2020 9:26 PM
To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, 
Alexander ; Zhang, Hawking ; 
Koenig, Christian 
Subject: Re: [PATCH] drm/kfd: fix a system crash issue during GPU recovery

I'm not sure how the bug you're fixing is caused, but your fix is clearly in 
the wrong place.

A queue being disabled is not the same thing as a queue being destroyed.
Queues can be disabled for legitimate reasons, but they still should exist and 
be in the qpd->queues_list.

If a destroyed queue is left on the qpd->queues_list, that would be a problem. 
Can you point out where such a thing is happening?

Thanks,
  Felix


Am 2020-08-31 um 9:36 p.m. schrieb Dennis Li:
> The crash log as the below:
>
> [Thu Aug 20 23:18:14 2020] general protection fault:  [#1] SMP NOPTI
> [Thu Aug 20 23:18:14 2020] CPU: 152 PID: 1837 Comm: kworker/152:1 Tainted: G  
>  OE 5.4.0-42-generic #46~18.04.1-Ubuntu
> [Thu Aug 20 23:18:14 2020] Hardware name: GIGABYTE 
> G482-Z53-YF/MZ52-G40-00, BIOS R12 05/13/2020 [Thu Aug 20 23:18:14 
> 2020] Workqueue: events amdgpu_ras_do_recovery [amdgpu] [Thu Aug 20 
> 23:18:14 2020] RIP: 0010:evict_process_queues_cpsch+0xc9/0x130 
> [amdgpu] [Thu Aug 20 23:18:14 2020] Code: 49 8d 4d 10 48 39 c8 75 21 
> eb 44 83 fa 03 74 36 80 78 72 00 74 0c 83 ab 68 01 00 00 01 41 c6 45 
> 41 00 48 8b 00 48 39 c8 74 25 <80> 78 70 00 c6 40 6d 01 74 ee 8b 50 28 
> c6 40 70 00 83 ab 60 01 00 [Thu Aug 20 23:18:14 2020] RSP: 
> 0018:b29b52f6fc90 EFLAGS: 00010213 [Thu Aug 20 23:18:14 2020] RAX: 
> 1c884edb0a118914 RBX: 8a0d45ff3c00 RCX: 8a2d83e41038 [Thu Aug 
> 20 23:18:14 2020] RDX:  RSI: 0082 RDI: 
> 8a0e2e4178c0 [Thu Aug 20 23:18:14 2020] RBP: b29b52f6fcb0 R08: 
> 1b64 R09: 0004 [Thu Aug 20 23:18:14 2020] R10: 
> b29b52f6fb78 R11: 0001 R12: 8a0d45ff3d28 [Thu Aug 20 
> 23:18:14 2020] R13: 8a2d83e41028 R14:  R15: 
>  [Thu Aug 20 23:18:14 2020] FS:  () 
> GS:8a0e2e40() knlGS: [Thu Aug 20 23:18:14 2020] 
> CS:  0010 DS:  ES:  CR0: 80050033 [Thu Aug 20 23:18:14 2020] 
> CR2: 55c783c0e6a8 CR3: 0034a1284000 CR4: 00340ee0 [Thu Aug 20 
> 23:18:14 2020] Call Trace:
> [Thu Aug 20 23:18:14 2020]  kfd_process_evict_queues+0x43/0xd0 
> [amdgpu] [Thu Aug 20 23:18:14 2020]  
> kfd_suspend_all_processes+0x60/0xf0 [amdgpu] [Thu Aug 20 23:18:14 
> 2020]  kgd2kfd_suspend.part.7+0x43/0x50 [amdgpu] [Thu Aug 20 23:18:14 
> 2020]  kgd2kfd_pre_reset+0x46/0x60 [amdgpu] [Thu Aug 20 23:18:14 2020]  
> amdgpu_amdkfd_pre_reset+0x1a/0x20 [amdgpu] [Thu Aug 20 23:18:14 2020]  
> amdgpu_device_gpu_recover+0x377/0xf90 [amdgpu] [Thu Aug 20 23:18:14 
> 2020]  ? amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu] [Thu Aug 20 
> 23:18:14 2020]  amdgpu_ras_do_recovery+0x159/0x190 [amdgpu] [Thu Aug 
> 20 23:18:14 2020]  process_one_work+0x20f/0x400 [Thu Aug 20 23:18:14 
> 2020]  worker_thread+0x34/0x410
>
> When GPU hang, user process will fail to create a compute queue whose 
> struct object will be freed later, but driver wrongly add this queue 
> to queue list of the proccess. And then kfd_process_evict_queues will 
> access a freed memory, which cause a system crash.
>
> Signed-off-by: Dennis Li 
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> index 560adc57a050..d5e6b07ffb27 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> @@ -1296,16 +1296,18 @@ static int create_queue_cpsch(struct 
> device_queue_manager *dqm, struct queue *q,
>   mqd_mgr->init_mqd(mqd_mgr, >mqd, q->mqd_mem_obj,
>   >gart_mqd_addr, >properties);
>  
> - list_add(>list, >queues_list);
> - qpd->queue_count++;
> -
>   if (q->properties.is_active) {
>   increment_queue_count(dqm, q->properties.type);
>  
>   retval = execute_queues_cpsch(dqm,
>   KFD_UNMAP_QUEUES_FILTER_DYNAMIC_QUEUES, 0);
> + if (retval)
> + goto out_execute_cpsch;
>   }
>  
> + list_add(>list, >queues_list);
> + qpd->queue_count++;
> +
>   /*
>* Unconditionally increment this counter, regardless of the queue's
>* type or whether the queue is active.
> @@ -1318,6 +1320,9 @@ static int create_queue_cpsch(struct 
> 

Re: [PATCH] drm/amdgpu: block ring buffer access during GPU recovery

2020-09-01 Thread Andrey Grodzovsky

Now i get it, I missed the 'else' part.

Acked-by: Andrey Grodzovsky 

Andrey

On 8/31/20 10:45 PM, Li, Dennis wrote:

[AMD Official Use Only - Internal Distribution Only]

Hi, Andrey,
 
RE- Isn't adev->reset_sem non-recursive ? How this works when you try to access registers from within GPU reset thread while adev->reset_sem is already write locked from amdgpu_device_lock_adev earlier in the same thread ?


Deli: down_read_trylock will fail in this case, return false immediately and will 
not lock adev->reset_sem. In GPU reset thread, we should use MMIO instead of 
KIQ to access registers.

Best Regards
Dennis Li
-Original Message-
From: Grodzovsky, Andrey 
Sent: Tuesday, September 1, 2020 9:40 AM
To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander 
; Kuehling, Felix ; Zhang, Hawking 
; Koenig, Christian 
Subject: Re: [PATCH] drm/amdgpu: block ring buffer access during GPU recovery


On 8/31/20 9:17 PM, Dennis Li wrote:

When GPU is in reset, its status isn't stable and ring buffer also
need be reset when resuming. Therefore driver should protect GPU
recovery thread from ring buffer accessed by other threads. Otherwise
GPU will randomly hang during recovery.

Signed-off-by: Dennis Li 

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 172dc47b7f39..8db56a22cd1b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -319,8 +319,13 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, 
uint32_t reg,
   {
uint32_t ret;
   
-	if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))

-   return amdgpu_kiq_rreg(adev, reg);
+   if (!(acc_flags & AMDGPU_REGS_NO_KIQ) &&
+   amdgpu_sriov_runtime(adev) &&
+   down_read_trylock(>reset_sem)) {
+   ret = amdgpu_kiq_rreg(adev, reg);
+   up_read(>reset_sem);
+   return ret;
+   }


Isn't adev->reset_sem non-recursive ? How this works when you try to access 
registers from within GPU reset thread while adev->reset_sem is already write 
locked from amdgpu_device_lock_adev earlier in the same thread ?

Andrey


   
   	if ((reg * 4) < adev->rmmio_size)

ret = readl(((void __iomem *)adev->rmmio) + (reg * 4)); @@ 
-332,6
+337,7 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,
ret = readl(((void __iomem *)adev->rmmio) + (mmMM_DATA * 4));
spin_unlock_irqrestore(>mmio_idx_lock, flags);
}
+
trace_amdgpu_mm_rreg(adev->pdev->device, reg, ret);
return ret;
   }
@@ -407,8 +413,13 @@ void static inline amdgpu_mm_wreg_mmio(struct 
amdgpu_device *adev, uint32_t reg,
   void amdgpu_mm_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
uint32_t acc_flags)
   {
-   if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
-   return amdgpu_kiq_wreg(adev, reg, v);
+   if (!(acc_flags & AMDGPU_REGS_NO_KIQ) &&
+   amdgpu_sriov_runtime(adev) &&
+   down_read_trylock(>reset_sem)) {
+   amdgpu_kiq_wreg(adev, reg, v);
+   up_read(>reset_sem);
+   return;
+   }
   
   	amdgpu_mm_wreg_mmio(adev, reg, v, acc_flags);

   }
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
index ad9ad622ccce..4ea2a065daa9 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
@@ -287,7 +287,7 @@ static void gmc_v10_0_flush_gpu_tlb(struct amdgpu_device 
*adev, uint32_t vmid,
 */
if (adev->gfx.kiq.ring.sched.ready &&
(amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev)) &&
-   !amdgpu_in_reset(adev)) {
+   down_read_trylock(>reset_sem)) {
   
   		struct amdgpu_vmhub *hub = >vmhub[vmhub];

const unsigned eng = 17;
@@ -297,6 +297,8 @@ static void gmc_v10_0_flush_gpu_tlb(struct
amdgpu_device *adev, uint32_t vmid,
   
   		amdgpu_virt_kiq_reg_write_reg_wait(adev, req, ack, inv_req,

1 << vmid);
+
+   up_read(>reset_sem);
return;
}
   
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c

b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index e1a0ae327cf5..33b7cf1c79ec 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -501,12 +501,13 @@ static void gmc_v9_0_flush_gpu_tlb(struct amdgpu_device 
*adev, uint32_t vmid,
 */
if (adev->gfx.kiq.ring.sched.ready &&
(amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev)) 
&&
-   !amdgpu_in_reset(adev)) {
+   down_read_trylock(>reset_sem)) {
uint32_t req = hub->vm_inv_eng0_req + hub->eng_distance * eng;
uint32_t ack = hub->vm_inv_eng0_ack + hub->eng_distance * eng;
   
   		

Re: [PATCH] drm/radeon: Reset ASIC if suspend is not managed by platform firmware

2020-09-01 Thread Alex Deucher
On Tue, Sep 1, 2020 at 3:32 AM Kai-Heng Feng
 wrote:
>
> Suspend with s2idle or by the following steps cause screen frozen:
>  # echo devices > /sys/power/pm_test
>  # echo freeze > /sys/power/mem
>
> [  289.625461] [drm:uvd_v1_0_ib_test [radeon]] *ERROR* radeon: fence wait 
> timed out.
> [  289.625494] [drm:radeon_ib_ring_tests [radeon]] *ERROR* radeon: failed 
> testing IB on ring 5 (-110).
>
> The issue doesn't happen on traditional S3, probably because firmware or
> hardware provides extra power management.
>
> Inspired by Daniel Drake's patch [1] on amdgpu, using a similar approach
> can fix the issue.

It doesn't actually fix the issue.  The device is never powered down
so you are using more power than you would if you did not suspend in
the first place.  The reset just works around the fact that the device
is never powered down.

Alex

>
> [1] https://patchwork.freedesktop.org/patch/335839/
>
> Signed-off-by: Kai-Heng Feng 
> ---
>  drivers/gpu/drm/radeon/radeon_device.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/drivers/gpu/drm/radeon/radeon_device.c 
> b/drivers/gpu/drm/radeon/radeon_device.c
> index 266e3cbbd09b..df823b9ad79f 100644
> --- a/drivers/gpu/drm/radeon/radeon_device.c
> +++ b/drivers/gpu/drm/radeon/radeon_device.c
> @@ -33,6 +33,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include 
>  #include 
> @@ -1643,6 +1644,8 @@ int radeon_suspend_kms(struct drm_device *dev, bool 
> suspend,
> rdev->asic->asic_reset(rdev, true);
> pci_restore_state(dev->pdev);
> } else if (suspend) {
> +   if (pm_suspend_no_platform())
> +   rdev->asic->asic_reset(rdev, true);
> /* Shut down the device */
> pci_disable_device(dev->pdev);
> pci_set_power_state(dev->pdev, PCI_D3hot);
> --
> 2.17.1
>
> ___
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: Fix a redundant kfree

2020-09-01 Thread Alex Deucher
On Tue, Sep 1, 2020 at 3:46 AM Pan, Xinhui  wrote:
>
> [AMD Official Use Only - Internal Distribution Only]
>
> drm_dev_alloc() alloc *dev* and set managed.final_kfree to dev to free
> itself.
> Now from commit 5cdd68498918("drm/amdgpu: Embed drm_device into
> amdgpu_device (v3)") we alloc *adev* and ddev is just a member of it.
> So drm_dev_release try to free a wrong pointer then.
>
> Also driver's release trys to free adev, but drm_dev_release will
> access dev after call drvier's release.
>
> To fix it, remove driver's release and set managed.final_kfree to adev.

I've got to admit, the documentation around drm_dev_init is hard to
follow.  We aren't really using the drmm stuff, but you have to use
drmm_add_final_kfree to avoid a warning.  The logic seems to make
sense though.
Acked-by: Alex Deucher 

>
> [   36.269348] BUG: unable to handle page fault for address: a0c279940028
> [   36.276841] #PF: supervisor read access in kernel mode
> [   36.282434] #PF: error_code(0x) - not-present page
> [   36.288053] PGD 676601067 P4D 676601067 PUD 86a414067 PMD 86a247067 PTE 
> 8008066bf060
> [   36.296868] Oops:  [#1] SMP DEBUG_PAGEALLOC NOPTI
> [   36.302409] CPU: 4 PID: 1375 Comm: bash Tainted: G   O  
> 5.9.0-rc2+ #46
> [   36.310670] Hardware name: System manufacturer System Product Name/PRIME 
> Z390-A, BIOS 1401 11/26/2019
> [   36.320725] RIP: 0010:drm_managed_release+0x25/0x110 [drm]
> [   36.326741] Code: 80 00 00 00 00 0f 1f 44 00 00 55 48 c7 c2 5a 9f 41 c0 be 
> 00 02 00 00 48 89 e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 83 ec 08 <48> 8b 
> 7f 18 e8 c2 10 ff ff 4d 8b 74 24 20 49 8d 44 24 5
> [   36.347217] RSP: 0018:b9424141fce0 EFLAGS: 00010282
> [   36.352931] RAX: 0006 RBX: a0c279940010 RCX: 
> 0006
> [   36.360718] RDX: c0419f5a RSI: 0200 RDI: 
> a0c279940010
> [   36.368503] RBP: b9424141fd10 R08: 0001 R09: 
> 0001
> [   36.376304] R10:  R11:  R12: 
> a0c279940010
> [   36.384070] R13: c0e2a000 R14: a0c26924e220 R15: 
> fff2
> [   36.391845] FS:  7fc4a277b740() GS:a0c288e0() 
> knlGS:
> [   36.400669] CS:  0010 DS:  ES:  CR0: 80050033
> [   36.406937] CR2: a0c279940028 CR3: 000792304006 CR4: 
> 003706e0
> [   36.414732] DR0:  DR1:  DR2: 
> 
> [   36.422550] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [   36.430354] Call Trace:
> [   36.433044]  drm_dev_put.part.0+0x40/0x60 [drm]
> [   36.438017]  drm_dev_put+0x13/0x20 [drm]
> [   36.442398]  amdgpu_pci_remove+0x56/0x60 [amdgpu]
> [   36.447528]  pci_device_remove+0x3e/0xb0
> [   36.451807]  device_release_driver_internal+0xff/0x1d0
> [   36.457416]  device_release_driver+0x12/0x20
> [   36.462094]  pci_stop_bus_device+0x70/0xa0
> [   36.466588]  pci_stop_and_remove_bus_device_locked+0x1b/0x30
> [   36.472786]  remove_store+0x7b/0x90
> [   36.476614]  dev_attr_store+0x17/0x30
> [   36.480646]  sysfs_kf_write+0x4b/0x60
> [   36.484655]  kernfs_fop_write+0xe8/0x1d0
> [   36.488952]  vfs_write+0xf5/0x230
> [   36.492562]  ksys_write+0x70/0xf0
> [   36.496206]  __x64_sys_write+0x1a/0x20
> [   36.500292]  do_syscall_64+0x38/0x90
> [   36.504219]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> Signed-off-by: xinhui pan 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 10 +-
>  1 file changed, 1 insertion(+), 9 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index c12e9acd421d..52fc0c6c8538 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -1165,7 +1165,7 @@ static int amdgpu_pci_probe(struct pci_dev *pdev,
>  if (ret)
>  goto err_free;
>
> -drmm_add_final_kfree(ddev, ddev);
> +drmm_add_final_kfree(ddev, adev);
>
>  if (!supports_atomic)
>  ddev->driver_features &= ~DRIVER_ATOMIC;
> @@ -1221,13 +1221,6 @@ amdgpu_pci_remove(struct pci_dev *pdev)
>  drm_dev_put(dev);
>  }
>
> -static void amdgpu_driver_release(struct drm_device *ddev)
> -{
> -struct amdgpu_device *adev = drm_to_adev(ddev);
> -
> -kfree(adev);
> -}
> -
>  static void
>  amdgpu_pci_shutdown(struct pci_dev *pdev)
>  {
> @@ -1522,7 +1515,6 @@ static struct drm_driver kms_driver = {
>  .open = amdgpu_driver_open_kms,
>  .postclose = amdgpu_driver_postclose_kms,
>  .lastclose = amdgpu_driver_lastclose_kms,
> -.release   = amdgpu_driver_release,
>  .irq_handler = amdgpu_irq_handler,
>  .ioctls = amdgpu_ioctls_kms,
>  .gem_free_object_unlocked = amdgpu_gem_object_free,
> --
> 2.25.1
>
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org

Re: [PATCH] drm/amdgpu/dc: Require primary plane to be enabled whenever the CRTC is

2020-09-01 Thread Harry Wentland


On 2020-09-01 3:54 a.m., Daniel Vetter wrote:
> On Wed, Aug 26, 2020 at 11:24:23AM +0300, Pekka Paalanen wrote:
>> On Tue, 25 Aug 2020 12:58:19 -0400
>> "Kazlauskas, Nicholas"  wrote:
>>
>>> On 2020-08-22 5:59 a.m., Michel Dänzer wrote:
 On 2020-08-21 8:07 p.m., Kazlauskas, Nicholas wrote:  
> On 2020-08-21 12:57 p.m., Michel Dänzer wrote:  
>> From: Michel Dänzer 
>>
>> Don't check drm_crtc_state::active for this either, per its
>> documentation in include/drm/drm_crtc.h:
>>
>>    * Hence drivers must not consult @active in their various
>>    * _mode_config_funcs.atomic_check callback to reject an atomic
>>    * commit.
>>
>> The atomic helpers disable the CRTC as needed for disabling the primary
>> plane.
>>
>> This prevents at least the following problems if the primary plane gets
>> disabled (e.g. due to destroying the FB assigned to the primary plane,
>> as happens e.g. with mutter in Wayland mode):
>>
>> * Toggling CRTC active to 1 failed if the cursor plane was enabled
>>     (e.g. via legacy DPMS property & cursor ioctl).
>> * Enabling the cursor plane failed, e.g. via the legacy cursor ioctl.  
>
> We previously had the requirement that the primary plane must be enabled
> but some userspace expects that they can enable just the overlay plane
> without anything else.
>
> I think the chromuiumos atomictest validates that this works as well:
>
> So is DRM going forward then with the expectation that this is wrong
> behavior from userspace?
>
> We require at least one plane to be enabled to display a cursor, but it
> doesn't necessarily need to be the primary.  

 It's a "pick your poison" situation:

 1) Currently the checks are invalid (atomic_check must not decide based
 on drm_crtc_state::active), and it's easy for legacy KMS userspace to
 accidentally hit errors trying to enable/move the cursor or switch DPMS
 off → on.

 2) Accurately rejecting only atomic states where the cursor plane is
 enabled but all other planes are off would break the KMS helper code,
 which can only deal with the "CRTC on & primary plane off is not
 allowed" case specifically.

 3) This patch addresses 1) & 2) but may break existing atomic userspace
 which wants to enable an overlay plane while disabling the primary plane.


 I do think in principle atomic userspace is expected to handle case 3)
 and leave the primary plane enabled. However, this is not ideal from an
 energy consumption PoV. Therefore, here's another idea for a possible
 way out of this quagmire:

 amdgpu_dm does not reject any atomic states based on which planes are
 enabled in it. If the cursor plane is enabled but all other planes are
 off, amdgpu_dm internally either:

 a) Enables an overlay plane and makes it invisible, e.g. by assigning a
 minimum size FB with alpha = 0.

 b) Enables the primary plane and assigns a minimum size FB (scaled up to
 the required size) containing all black, possibly using compression.
 (Trying to minimize the memory bandwidth)


 Does either of these seem feasible? If both do, which one would be
 preferable?

   
>>>
>>> It's really the same solution since DCN doesn't make a distinction 
>>> between primary or overlay planes in hardware. DCE doesn't have overlay 
>>> planes enabled so this is not relevant there.
>>>
>>> The old behavior (pre 5.1?) was to silently accept the commit even 
>>> though the screen would be completely black instead of outright 
>>> rejecting the commit.
>>>
>>> I almost wonder if that makes more sense in the short term here since 
>>> the only "userspace" affected here is IGT. We'll fail the CRC checks, 
>>> but no userspace actually tries to actively use a cursor with no primary 
>>> plane enabled from my understanding.
>>
>> Hi,
>>
>> I believe that there exists userspace that will *accidentally* attempt
>> to update the cursor plane while primary plane or whole CRTC is off.
>> Some versions of Mutter might do that on racy conditions, I suspect.
>> These are legacy KMS users, not atomic KMS.
>>
>> However, I do not believe there exists any userspace that would
>> actually expect the display to show the cursor plane alone without a
>> primary plane. Therefore I'd be ok with legacy cursor ioctls silently
>> succeeding. Atomic commits not. So the difference has to be in the
>> translation from legacy UAPI to kernel internal atomic interface.
>>
>>> In the long term I think we can work on getting cursor actually on the 
>>> screen in this case, though I can't say I really like having to reserve 
>>> some small buffer (eg. 16x16) for allowing lightup on this corner case.
>>
>> Why would you bother implementing that?
>>
>> Is there really an IGT test that unconditionally demands cursor plane
>> to be usable without 

Re: 回复: [PATCH V2] drm/amdgpu: Do not move root PT bo to relocated list

2020-09-01 Thread Christian König
Agreed, that change doesn't seem to make sense and your backtrace is 
mangled so barely readable.


Christian.

Am 01.09.20 um 14:59 schrieb Liu, Monk:

[AMD Official Use Only - Internal Distribution Only]

See that we already have such logic:

282 static void amdgpu_vm_bo_relocated(struct amdgpu_vm_bo_base *vm_bo)
  283 {
  284 if (vm_bo->bo->parent)
  285 list_move(_bo->vm_status, _bo->vm->relocated);
  286 else
  287 amdgpu_vm_bo_idle(vm_bo);
  288 }

Why you need to do the bo->parent check out side ?

-邮件原件-
发件人: amd-gfx  代表 Pan, Xinhui
发送时间: 2020年2月10日 9:04
收件人: amd-gfx@lists.freedesktop.org
抄送: Deucher, Alexander ; Koenig, Christian 

主题: [PATCH V2] drm/amdgpu: Do not move root PT bo to relocated list

hit panic when we update the page tables.

<1>[  122.103290] BUG: kernel NULL pointer dereference, address: 0008 <1>[  
122.103348] #PF: supervisor read access in kernel mode <1>[  122.103376] #PF: error_code(0x) - 
not-present page <6>[  122.103403] PGD 0 P4D 0 <4>[  122.103421] Oops:  [#1] SMP PTI
<4>[  122.103442] CPU: 13 PID: 2133 Comm: kfdtest Tainted: G   OE 
5.4.0-rc7+ #7
<4>[  122.103480] Hardware name: Supermicro SYS-7048GR-TR/X10DRG-Q, BIOS 3.0b 03/09/2018 <4>[  122.103657] RIP: 0010:amdgpu_vm_update_pdes+0x140/0x330 [amdgpu] 
<4>[  122.103689] Code: 03 4c 89 73 08 49 89 9d c8 00 00 00 48 8b 7b f0 c6 43 10 00 45 31 c0 48 8b 87 28 04 00 00 48 85 c0 74 07 4c 8b 80 20 04 00 00 <4d> 8b 70 08 
31 f6 49 8b 86 28 04 00 00 48 85 c0 74 0f 48 8b 80 28 <4>[  122.103769] RSP: 0018:b49a0a6a3a98 EFLAGS: 00010246 <4>[  122.103797] RAX:  RBX: 
9020f823c148 RCX: dead0122 <4>[  122.103831] RDX: 9020ece70018 RSI: 9020f823c0c8 RDI: 9010ca31c800 <4>[  122.103865] RBP: b49a0a6a3b38 
R08:  R09: 0001 <4>[  122.103899] R10: 6044f994 R11: df57fb58 R12: 9020f823c000 <4>[  122.103933] R13: 
9020f823c000 R14: 9020f823c0c8 R15: 9010d5d2 <4>[  122.103968] FS:  7f32c83dc780() GS:9020ff38() knlGS: <4>[  
122.104006] CS:  0010 DS:  ES:  CR0: 80050033 <4>[  122.104035] CR2: 0008 CR3: 002036bba005 CR4: 003606e0 <4>[  122.104069] 
DR0:  DR1:  DR2:  <4>[  122.104103] DR3:  DR6: fffe0ff0 DR7: 0400 <4>[  
122.104137] Call Trace:
<4>[  122.104241]  vm_update_pds+0x31/0x50 [amdgpu] <4>[  122.104347]  amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x2ef/0x690 
[amdgpu] <4>[  122.104466]  kfd_process_alloc_gpuvm+0x98/0x190 [amdgpu] <4>[  122.104576]  
kfd_process_device_init_vm.part.8+0xf3/0x1f0 [amdgpu] <4>[  122.104688]  kfd_process_device_init_vm+0x24/0x30 [amdgpu] 
<4>[  122.104794]  kfd_ioctl_acquire_vm+0xa4/0xc0 [amdgpu] <4>[  122.104900]  kfd_ioctl+0x277/0x500 [amdgpu] <4>[  
122.105001]  ? kfd_ioctl_free_memory_of_gpu+0xc0/0xc0 [amdgpu] <4>[  122.105039]  ? rcu_read_lock_sched_held+0x4f/0x80
<4>[  122.105068]  ? kmem_cache_free+0x2ba/0x300 <4>[  122.105093]  ? vm_area_free+0x18/0x20 <4>[  122.105117]  ? 
find_held_lock+0x35/0xa0 <4>[  122.105143]  do_vfs_ioctl+0xa9/0x6f0 <4>[  122.106001]  ksys_ioctl+0x75/0x80 <4>[  
122.106802]  ? do_syscall_64+0x17/0x230 <4>[  122.107605]  __x64_sys_ioctl+0x1a/0x20 <4>[  122.108378]  
do_syscall_64+0x5f/0x230 <4>[  122.109118]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4>[  122.109842] RIP: 0033:0x7f32c6b495d7

Signed-off-by: xinhui pan 
---
change from v1:
move root pt bo to idle state instead.
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 9 ++---
  1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 3195bc9..c3d1af5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -2619,9 +2619,12 @@ void amdgpu_vm_bo_invalidate(struct amdgpu_device *adev,
  continue;
  bo_base->moved = true;

-if (bo->tbo.type == ttm_bo_type_kernel)
-amdgpu_vm_bo_relocated(bo_base);
-else if (bo->tbo.base.resv == vm->root.base.bo->tbo.base.resv)
+if (bo->tbo.type == ttm_bo_type_kernel) {
+if (bo->parent)
+amdgpu_vm_bo_relocated(bo_base);
+else
+amdgpu_vm_bo_idle(bo_base);
+} else if (bo->tbo.base.resv == vm->root.base.bo->tbo.base.resv)
  amdgpu_vm_bo_moved(bo_base);
  else
  amdgpu_vm_bo_invalidated(bo_base);
--
2.7.4
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=02%7C01%7Cmonk.liu%40amd.com%7C9fa67fef539f46e8d94f08d7adc52b5a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637168934683625832sdata=nYoeBXXOdePn4Q6hxrlSTzT1PxTx%2FRwWqeKXGmQ5NPY%3Dreserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org

Re: [PATCH 3/3] drm/amdgpu: Embed drm_device into amdgpu_device (v2)

2020-09-01 Thread Alex Deucher
On Tue, Sep 1, 2020 at 3:44 AM Daniel Vetter  wrote:
>
> On Wed, Aug 19, 2020 at 01:00:42AM -0400, Luben Tuikov wrote:
> > a) Embed struct drm_device into struct amdgpu_device.
> > b) Modify the inline-f drm_to_adev() accordingly.
> > c) Modify the inline-f adev_to_drm() accordingly.
> > d) Eliminate the use of drm_device.dev_private,
> >in amdgpu.
> > e) Switch from using drm_dev_alloc() to
> >drm_dev_init().
> > f) Add a DRM driver release function, which frees
> >the container amdgpu_device after all krefs on
> >the contained drm_device have been released.
> >
> > v2: Split out adding adev_to_drm() into its own
> > patch (previous commit), making this patch
> > more succinct and clear. More detailed commit
> > description.
> >
> > Signed-off-by: Luben Tuikov 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu.h| 10 ++---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +++-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c| 43 ++
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c| 20 +++---
> >  4 files changed, 43 insertions(+), 45 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > index 735480cc7dcf..107a6ec920f7 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > @@ -724,8 +724,8 @@ struct amd_powerplay {
> >  #define AMDGPU_MAX_DF_PERFMONS 4
> >  struct amdgpu_device {
> >   struct device   *dev;
> > - struct drm_device   *ddev;
> >   struct pci_dev  *pdev;
> > + struct drm_device   ddev;
> >
> >  #ifdef CONFIG_DRM_AMD_ACP
> >   struct amdgpu_acp   acp;
> > @@ -990,12 +990,12 @@ struct amdgpu_device {
> >
> >  static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
> >  {
> > - return ddev->dev_private;
> > + return container_of(ddev, struct amdgpu_device, ddev);
> >  }
> >
> >  static inline struct drm_device *adev_to_drm(struct amdgpu_device *adev)
> >  {
> > - return adev->ddev;
> > + return >ddev;
> >  }
> >
> >  static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device 
> > *bdev)
> > @@ -1004,8 +1004,6 @@ static inline struct amdgpu_device 
> > *amdgpu_ttm_adev(struct ttm_bo_device *bdev)
> >  }
> >
> >  int amdgpu_device_init(struct amdgpu_device *adev,
> > -struct drm_device *ddev,
> > -struct pci_dev *pdev,
> >  uint32_t flags);
> >  void amdgpu_device_fini(struct amdgpu_device *adev);
> >  int amdgpu_gpu_wait_for_idle(struct amdgpu_device *adev);
> > @@ -1195,7 +1193,7 @@ static inline void *amdgpu_atpx_get_dhandle(void) { 
> > return NULL; }
> >  extern const struct drm_ioctl_desc amdgpu_ioctls_kms[];
> >  extern const int amdgpu_max_kms_ioctl;
> >
> > -int amdgpu_driver_load_kms(struct drm_device *dev, unsigned long flags);
> > +int amdgpu_driver_load_kms(struct amdgpu_device *adev, unsigned long 
> > flags);
> >  void amdgpu_driver_unload_kms(struct drm_device *dev);
> >  void amdgpu_driver_lastclose_kms(struct drm_device *dev);
> >  int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file 
> > *file_priv);
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 07012d71eeea..6e529548e708 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -1216,7 +1216,8 @@ static int amdgpu_device_check_arguments(struct 
> > amdgpu_device *adev)
> >   * Callback for the switcheroo driver.  Suspends or resumes the
> >   * the asics before or after it is powered up using ACPI methods.
> >   */
> > -static void amdgpu_switcheroo_set_state(struct pci_dev *pdev, enum 
> > vga_switcheroo_state state)
> > +static void amdgpu_switcheroo_set_state(struct pci_dev *pdev,
> > + enum vga_switcheroo_state state)
> >  {
> >   struct drm_device *dev = pci_get_drvdata(pdev);
> >   int r;
> > @@ -2977,8 +2978,6 @@ static const struct attribute 
> > *amdgpu_dev_attributes[] = {
> >   * amdgpu_device_init - initialize the driver
> >   *
> >   * @adev: amdgpu_device pointer
> > - * @ddev: drm dev pointer
> > - * @pdev: pci dev pointer
> >   * @flags: driver flags
> >   *
> >   * Initializes the driver info and hw (all asics).
> > @@ -2986,18 +2985,15 @@ static const struct attribute 
> > *amdgpu_dev_attributes[] = {
> >   * Called at driver startup.
> >   */
> >  int amdgpu_device_init(struct amdgpu_device *adev,
> > -struct drm_device *ddev,
> > -struct pci_dev *pdev,
> >  uint32_t flags)
> >  {
> > + struct drm_device *ddev = adev_to_drm(adev);
> > + struct pci_dev *pdev = adev->pdev;
> >   int r, i;
> >   bool boco = false;
> >   u32 max_MBps;
> >
> >   adev->shutdown = false;
> > - 

Re: [PATCH] drm/kfd: fix a system crash issue during GPU recovery

2020-09-01 Thread Felix Kuehling
I'm not sure how the bug you're fixing is caused, but your fix is
clearly in the wrong place.

A queue being disabled is not the same thing as a queue being destroyed.
Queues can be disabled for legitimate reasons, but they still should
exist and be in the qpd->queues_list.

If a destroyed queue is left on the qpd->queues_list, that would be a
problem. Can you point out where such a thing is happening?

Thanks,
  Felix


Am 2020-08-31 um 9:36 p.m. schrieb Dennis Li:
> The crash log as the below:
>
> [Thu Aug 20 23:18:14 2020] general protection fault:  [#1] SMP NOPTI
> [Thu Aug 20 23:18:14 2020] CPU: 152 PID: 1837 Comm: kworker/152:1 Tainted: G  
>  OE 5.4.0-42-generic #46~18.04.1-Ubuntu
> [Thu Aug 20 23:18:14 2020] Hardware name: GIGABYTE G482-Z53-YF/MZ52-G40-00, 
> BIOS R12 05/13/2020
> [Thu Aug 20 23:18:14 2020] Workqueue: events amdgpu_ras_do_recovery [amdgpu]
> [Thu Aug 20 23:18:14 2020] RIP: 0010:evict_process_queues_cpsch+0xc9/0x130 
> [amdgpu]
> [Thu Aug 20 23:18:14 2020] Code: 49 8d 4d 10 48 39 c8 75 21 eb 44 83 fa 03 74 
> 36 80 78 72 00 74 0c 83 ab 68 01 00 00 01 41 c6 45 41 00 48 8b 00 48 39 c8 74 
> 25 <80> 78 70 00 c6 40 6d 01 74 ee 8b 50 28 c6 40 70 00 83 ab 60 01 00
> [Thu Aug 20 23:18:14 2020] RSP: 0018:b29b52f6fc90 EFLAGS: 00010213
> [Thu Aug 20 23:18:14 2020] RAX: 1c884edb0a118914 RBX: 8a0d45ff3c00 RCX: 
> 8a2d83e41038
> [Thu Aug 20 23:18:14 2020] RDX:  RSI: 0082 RDI: 
> 8a0e2e4178c0
> [Thu Aug 20 23:18:14 2020] RBP: b29b52f6fcb0 R08: 1b64 R09: 
> 0004
> [Thu Aug 20 23:18:14 2020] R10: b29b52f6fb78 R11: 0001 R12: 
> 8a0d45ff3d28
> [Thu Aug 20 23:18:14 2020] R13: 8a2d83e41028 R14:  R15: 
> 
> [Thu Aug 20 23:18:14 2020] FS:  () 
> GS:8a0e2e40() knlGS:
> [Thu Aug 20 23:18:14 2020] CS:  0010 DS:  ES:  CR0: 80050033
> [Thu Aug 20 23:18:14 2020] CR2: 55c783c0e6a8 CR3: 0034a1284000 CR4: 
> 00340ee0
> [Thu Aug 20 23:18:14 2020] Call Trace:
> [Thu Aug 20 23:18:14 2020]  kfd_process_evict_queues+0x43/0xd0 [amdgpu]
> [Thu Aug 20 23:18:14 2020]  kfd_suspend_all_processes+0x60/0xf0 [amdgpu]
> [Thu Aug 20 23:18:14 2020]  kgd2kfd_suspend.part.7+0x43/0x50 [amdgpu]
> [Thu Aug 20 23:18:14 2020]  kgd2kfd_pre_reset+0x46/0x60 [amdgpu]
> [Thu Aug 20 23:18:14 2020]  amdgpu_amdkfd_pre_reset+0x1a/0x20 [amdgpu]
> [Thu Aug 20 23:18:14 2020]  amdgpu_device_gpu_recover+0x377/0xf90 [amdgpu]
> [Thu Aug 20 23:18:14 2020]  ? amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu]
> [Thu Aug 20 23:18:14 2020]  amdgpu_ras_do_recovery+0x159/0x190 [amdgpu]
> [Thu Aug 20 23:18:14 2020]  process_one_work+0x20f/0x400
> [Thu Aug 20 23:18:14 2020]  worker_thread+0x34/0x410
>
> When GPU hang, user process will fail to create a compute queue whose
> struct object will be freed later, but driver wrongly add this queue to
> queue list of the proccess. And then kfd_process_evict_queues will
> access a freed memory, which cause a system crash.
>
> Signed-off-by: Dennis Li 
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> index 560adc57a050..d5e6b07ffb27 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> @@ -1296,16 +1296,18 @@ static int create_queue_cpsch(struct 
> device_queue_manager *dqm, struct queue *q,
>   mqd_mgr->init_mqd(mqd_mgr, >mqd, q->mqd_mem_obj,
>   >gart_mqd_addr, >properties);
>  
> - list_add(>list, >queues_list);
> - qpd->queue_count++;
> -
>   if (q->properties.is_active) {
>   increment_queue_count(dqm, q->properties.type);
>  
>   retval = execute_queues_cpsch(dqm,
>   KFD_UNMAP_QUEUES_FILTER_DYNAMIC_QUEUES, 0);
> + if (retval)
> + goto out_execute_cpsch;
>   }
>  
> + list_add(>list, >queues_list);
> + qpd->queue_count++;
> +
>   /*
>* Unconditionally increment this counter, regardless of the queue's
>* type or whether the queue is active.
> @@ -1318,6 +1320,9 @@ static int create_queue_cpsch(struct 
> device_queue_manager *dqm, struct queue *q,
>   dqm_unlock(dqm);
>   return retval;
>  
> +out_execute_cpsch:
> + decrement_queue_count(dqm, q->properties.type);
> + dqm_unlock(dqm);
>  out_deallocate_doorbell:
>   deallocate_doorbell(qpd, q);
>  out_deallocate_sdma_queue:
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


回复: [PATCH V2] drm/amdgpu: Do not move root PT bo to relocated list

2020-09-01 Thread Liu, Monk
[AMD Official Use Only - Internal Distribution Only]

See that we already have such logic:

282 static void amdgpu_vm_bo_relocated(struct amdgpu_vm_bo_base *vm_bo)
 283 {
 284 if (vm_bo->bo->parent)
 285 list_move(_bo->vm_status, _bo->vm->relocated);
 286 else
 287 amdgpu_vm_bo_idle(vm_bo);
 288 }

Why you need to do the bo->parent check out side ?

-邮件原件-
发件人: amd-gfx  代表 Pan, Xinhui
发送时间: 2020年2月10日 9:04
收件人: amd-gfx@lists.freedesktop.org
抄送: Deucher, Alexander ; Koenig, Christian 

主题: [PATCH V2] drm/amdgpu: Do not move root PT bo to relocated list

hit panic when we update the page tables.

<1>[  122.103290] BUG: kernel NULL pointer dereference, address: 
0008 <1>[  122.103348] #PF: supervisor read access in kernel mode 
<1>[  122.103376] #PF: error_code(0x) - not-present page <6>[  122.103403] 
PGD 0 P4D 0 <4>[  122.103421] Oops:  [#1] SMP PTI
<4>[  122.103442] CPU: 13 PID: 2133 Comm: kfdtest Tainted: G   OE 
5.4.0-rc7+ #7
<4>[  122.103480] Hardware name: Supermicro SYS-7048GR-TR/X10DRG-Q, BIOS 3.0b 
03/09/2018 <4>[  122.103657] RIP: 0010:amdgpu_vm_update_pdes+0x140/0x330 
[amdgpu] <4>[  122.103689] Code: 03 4c 89 73 08 49 89 9d c8 00 00 00 48 8b 7b 
f0 c6 43 10 00 45 31 c0 48 8b 87 28 04 00 00 48 85 c0 74 07 4c 8b 80 20 04 00 
00 <4d> 8b 70 08 31 f6 49 8b 86 28 04 00 00 48 85 c0 74 0f 48 8b 80 28 <4>[  
122.103769] RSP: 0018:b49a0a6a3a98 EFLAGS: 00010246 <4>[  122.103797] RAX: 
 RBX: 9020f823c148 RCX: dead0122 <4>[  122.103831] 
RDX: 9020ece70018 RSI: 9020f823c0c8 RDI: 9010ca31c800 <4>[  
122.103865] RBP: b49a0a6a3b38 R08:  R09: 0001 
<4>[  122.103899] R10: 6044f994 R11: df57fb58 R12: 
9020f823c000 <4>[  122.103933] R13: 9020f823c000 R14: 9020f823c0c8 
R15: 9010d5d2 <4>[  122.103968] FS:  7f32c83dc780() 
GS:9020ff38() knlGS: <4>[  122.104006] CS:  0010 
DS:  ES:  CR0: 80050033 <4>[  122.104035] CR2: 0008 
CR3: 002036bba005 CR4: 003606e0 <4>[  122.104069] DR0: 
 DR1:  DR2:  <4>[  122.104103] 
DR3:  DR6: fffe0ff0 DR7: 0400 <4>[  
122.104137] Call Trace:
<4>[  122.104241]  vm_update_pds+0x31/0x50 [amdgpu] <4>[  122.104347]  
amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x2ef/0x690 [amdgpu] <4>[  122.104466]  
kfd_process_alloc_gpuvm+0x98/0x190 [amdgpu] <4>[  122.104576]  
kfd_process_device_init_vm.part.8+0xf3/0x1f0 [amdgpu] <4>[  122.104688]  
kfd_process_device_init_vm+0x24/0x30 [amdgpu] <4>[  122.104794]  
kfd_ioctl_acquire_vm+0xa4/0xc0 [amdgpu] <4>[  122.104900]  
kfd_ioctl+0x277/0x500 [amdgpu] <4>[  122.105001]  ? 
kfd_ioctl_free_memory_of_gpu+0xc0/0xc0 [amdgpu] <4>[  122.105039]  ? 
rcu_read_lock_sched_held+0x4f/0x80
<4>[  122.105068]  ? kmem_cache_free+0x2ba/0x300 <4>[  122.105093]  ? 
vm_area_free+0x18/0x20 <4>[  122.105117]  ? find_held_lock+0x35/0xa0 <4>[  
122.105143]  do_vfs_ioctl+0xa9/0x6f0 <4>[  122.106001]  ksys_ioctl+0x75/0x80 
<4>[  122.106802]  ? do_syscall_64+0x17/0x230 <4>[  122.107605]  
__x64_sys_ioctl+0x1a/0x20 <4>[  122.108378]  do_syscall_64+0x5f/0x230 <4>[  
122.109118]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4>[  122.109842] RIP: 0033:0x7f32c6b495d7

Signed-off-by: xinhui pan 
---
change from v1:
   move root pt bo to idle state instead.
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 3195bc9..c3d1af5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -2619,9 +2619,12 @@ void amdgpu_vm_bo_invalidate(struct amdgpu_device *adev,
 continue;
 bo_base->moved = true;

-if (bo->tbo.type == ttm_bo_type_kernel)
-amdgpu_vm_bo_relocated(bo_base);
-else if (bo->tbo.base.resv == vm->root.base.bo->tbo.base.resv)
+if (bo->tbo.type == ttm_bo_type_kernel) {
+if (bo->parent)
+amdgpu_vm_bo_relocated(bo_base);
+else
+amdgpu_vm_bo_idle(bo_base);
+} else if (bo->tbo.base.resv == vm->root.base.bo->tbo.base.resv)
 amdgpu_vm_bo_moved(bo_base);
 else
 amdgpu_vm_bo_invalidated(bo_base);
--
2.7.4
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=02%7C01%7Cmonk.liu%40amd.com%7C9fa67fef539f46e8d94f08d7adc52b5a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637168934683625832sdata=nYoeBXXOdePn4Q6hxrlSTzT1PxTx%2FRwWqeKXGmQ5NPY%3Dreserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu/dc: Require primary plane to be enabled whenever the CRTC is

2020-09-01 Thread Daniel Vetter
On Tue, Sep 01, 2020 at 10:56:42AM +0200, Michel Dänzer wrote:
> On 2020-09-01 9:57 a.m., Daniel Vetter wrote:
> > On Tue, Aug 25, 2020 at 04:55:28PM +0200, Michel Dänzer wrote:
> >> On 2020-08-24 9:43 a.m., Pekka Paalanen wrote:
> >> 
> >>> Sounds like the helpers you refer to are inadequate for your case.
> >>> Can't you fix the helpers in the long run and land this patch as an
> >>> immediate fix?
> >> 
> >> I'd rather leave working on those helpers to KMS developers. :)
> > 
> > I thought the issue is the rmfb ioctl trickery, which has this assumption
> > fully backed in. The wiggle room in there for semantic changes is iirc
> > pretty much nil, it took us a pile of patches to just get to the current
> > state. So it's not helper code, it's core legacy ioctl code for atomic
> > drivers.
> 
> My bad. Should I respin with an amended commit log?

Yeah if it's not yet merged then I think would be good to clarify that
this assumption is baked into rmfb, and that together with the assumption
the the legacy cursor ioctls just work we have fallout.

If (and hey I've been on vacations for 2 weeks, so good chances I remebers
this all wrong) this is indeed what we discussed a few weeks ago.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu/dc: Require primary plane to be enabled whenever the CRTC is

2020-09-01 Thread Michel Dänzer
On 2020-09-01 9:57 a.m., Daniel Vetter wrote:
> On Tue, Aug 25, 2020 at 04:55:28PM +0200, Michel Dänzer wrote:
>> On 2020-08-24 9:43 a.m., Pekka Paalanen wrote:
>> 
>>> Sounds like the helpers you refer to are inadequate for your case.
>>> Can't you fix the helpers in the long run and land this patch as an
>>> immediate fix?
>> 
>> I'd rather leave working on those helpers to KMS developers. :)
> 
> I thought the issue is the rmfb ioctl trickery, which has this assumption
> fully backed in. The wiggle room in there for semantic changes is iirc
> pretty much nil, it took us a pile of patches to just get to the current
> state. So it's not helper code, it's core legacy ioctl code for atomic
> drivers.

My bad. Should I respin with an amended commit log?


-- 
Earthling Michel Dänzer   |   https://redhat.com
Libre software enthusiast | Mesa and X developer
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v2] drm/amdgpu: block ring buffer access during GPU recovery

2020-09-01 Thread Christian König

Am 01.09.20 um 09:50 schrieb Dennis Li:

When GPU is in reset, its status isn't stable and ring buffer also need
be reset when resuming. Therefore driver should protect GPU recovery
thread from ring buffer accessed by other threads. Otherwise GPU will
randomly hang during recovery.

v2: correct indent

Signed-off-by: Dennis Li 


Reviewed-by: Christian König 



diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 172dc47b7f39..9b586bc80c38 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -319,8 +319,12 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, 
uint32_t reg,
  {
uint32_t ret;
  
-	if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))

-   return amdgpu_kiq_rreg(adev, reg);
+   if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev) &&
+   down_read_trylock(>reset_sem)) {
+   ret = amdgpu_kiq_rreg(adev, reg);
+   up_read(>reset_sem);
+   return ret;
+   }
  
  	if ((reg * 4) < adev->rmmio_size)

ret = readl(((void __iomem *)adev->rmmio) + (reg * 4));
@@ -332,6 +336,7 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, 
uint32_t reg,
ret = readl(((void __iomem *)adev->rmmio) + (mmMM_DATA * 4));
spin_unlock_irqrestore(>mmio_idx_lock, flags);
}
+
trace_amdgpu_mm_rreg(adev->pdev->device, reg, ret);
return ret;
  }
@@ -407,8 +412,12 @@ void static inline amdgpu_mm_wreg_mmio(struct 
amdgpu_device *adev, uint32_t reg,
  void amdgpu_mm_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
uint32_t acc_flags)
  {
-   if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
-   return amdgpu_kiq_wreg(adev, reg, v);
+   if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev) &&
+   down_read_trylock(>reset_sem)) {
+   amdgpu_kiq_wreg(adev, reg, v);
+   up_read(>reset_sem);
+   return;
+   }
  
  	amdgpu_mm_wreg_mmio(adev, reg, v, acc_flags);

  }
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
index ad9ad622ccce..31359e519d69 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
@@ -287,8 +287,7 @@ static void gmc_v10_0_flush_gpu_tlb(struct amdgpu_device 
*adev, uint32_t vmid,
 */
if (adev->gfx.kiq.ring.sched.ready &&
(amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev)) &&
-   !amdgpu_in_reset(adev)) {
-
+   down_read_trylock(>reset_sem)) {
struct amdgpu_vmhub *hub = >vmhub[vmhub];
const unsigned eng = 17;
u32 inv_req = hub->vmhub_funcs->get_invalidate_req(vmid, 
flush_type);
@@ -297,6 +296,8 @@ static void gmc_v10_0_flush_gpu_tlb(struct amdgpu_device 
*adev, uint32_t vmid,
  
  		amdgpu_virt_kiq_reg_write_reg_wait(adev, req, ack, inv_req,

1 << vmid);
+
+   up_read(>reset_sem);
return;
}
  
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c

index e1a0ae327cf5..c602ddc68384 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -500,13 +500,14 @@ static void gmc_v9_0_flush_gpu_tlb(struct amdgpu_device 
*adev, uint32_t vmid,
 * as GFXOFF under bare metal
 */
if (adev->gfx.kiq.ring.sched.ready &&
-   (amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev)) 
&&
-   !amdgpu_in_reset(adev)) {
+   (amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev)) &&
+   down_read_trylock(>reset_sem)) {
uint32_t req = hub->vm_inv_eng0_req + hub->eng_distance * eng;
uint32_t ack = hub->vm_inv_eng0_ack + hub->eng_distance * eng;
  
  		amdgpu_virt_kiq_reg_write_reg_wait(adev, req, ack, inv_req,

   1 << vmid);
+   up_read(>reset_sem);
return;
}
  
@@ -599,7 +600,7 @@ static int gmc_v9_0_flush_gpu_tlb_pasid(struct amdgpu_device *adev,

if (amdgpu_in_reset(adev))
return -EIO;
  
-	if (ring->sched.ready) {

+   if (ring->sched.ready && down_read_trylock(>reset_sem)) {
/* Vega20+XGMI caches PTEs in TC and TLB. Add a
 * heavy-weight TLB flush (type 2), which flushes
 * both. Due to a race condition with concurrent
@@ -626,6 +627,7 @@ static int gmc_v9_0_flush_gpu_tlb_pasid(struct 
amdgpu_device *adev,
if (r) {
amdgpu_ring_undo(ring);
spin_unlock(>gfx.kiq.ring_lock);
+   up_read(>reset_sem);
return -ETIME;
}
  
@@ 

RE: [PATCH v2] drm/amdgpu: block ring buffer access during GPU recovery

2020-09-01 Thread Zhang, Hawking
[AMD Public Use]

Reviewed-by: Hawking Zhang 

Regards,
Hawking
-Original Message-
From: Dennis Li  
Sent: Tuesday, September 1, 2020 15:50
To: amd-gfx@lists.freedesktop.org; Deucher, Alexander 
; Kuehling, Felix ; Zhang, 
Hawking ; Koenig, Christian 
Cc: Li, Dennis 
Subject: [PATCH v2] drm/amdgpu: block ring buffer access during GPU recovery

When GPU is in reset, its status isn't stable and ring buffer also need be 
reset when resuming. Therefore driver should protect GPU recovery thread from 
ring buffer accessed by other threads. Otherwise GPU will randomly hang during 
recovery.

v2: correct indent

Signed-off-by: Dennis Li 

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 172dc47b7f39..9b586bc80c38 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -319,8 +319,12 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, 
uint32_t reg,  {
uint32_t ret;
 
-   if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
-   return amdgpu_kiq_rreg(adev, reg);
+   if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev) &&
+   down_read_trylock(>reset_sem)) {
+   ret = amdgpu_kiq_rreg(adev, reg);
+   up_read(>reset_sem);
+   return ret;
+   }
 
if ((reg * 4) < adev->rmmio_size)
ret = readl(((void __iomem *)adev->rmmio) + (reg * 4)); @@ 
-332,6 +336,7 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t 
reg,
ret = readl(((void __iomem *)adev->rmmio) + (mmMM_DATA * 4));
spin_unlock_irqrestore(>mmio_idx_lock, flags);
}
+
trace_amdgpu_mm_rreg(adev->pdev->device, reg, ret);
return ret;
 }
@@ -407,8 +412,12 @@ void static inline amdgpu_mm_wreg_mmio(struct 
amdgpu_device *adev, uint32_t reg,  void amdgpu_mm_wreg(struct amdgpu_device 
*adev, uint32_t reg, uint32_t v,
uint32_t acc_flags)
 {
-   if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
-   return amdgpu_kiq_wreg(adev, reg, v);
+   if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev) &&
+   down_read_trylock(>reset_sem)) {
+   amdgpu_kiq_wreg(adev, reg, v);
+   up_read(>reset_sem);
+   return;
+   }
 
amdgpu_mm_wreg_mmio(adev, reg, v, acc_flags);  } diff --git 
a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
index ad9ad622ccce..31359e519d69 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
@@ -287,8 +287,7 @@ static void gmc_v10_0_flush_gpu_tlb(struct amdgpu_device 
*adev, uint32_t vmid,
 */
if (adev->gfx.kiq.ring.sched.ready &&
(amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev)) &&
-   !amdgpu_in_reset(adev)) {
-
+   down_read_trylock(>reset_sem)) {
struct amdgpu_vmhub *hub = >vmhub[vmhub];
const unsigned eng = 17;
u32 inv_req = hub->vmhub_funcs->get_invalidate_req(vmid, 
flush_type); @@ -297,6 +296,8 @@ static void gmc_v10_0_flush_gpu_tlb(struct 
amdgpu_device *adev, uint32_t vmid,
 
amdgpu_virt_kiq_reg_write_reg_wait(adev, req, ack, inv_req,
1 << vmid);
+
+   up_read(>reset_sem);
return;
}
 
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index e1a0ae327cf5..c602ddc68384 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -500,13 +500,14 @@ static void gmc_v9_0_flush_gpu_tlb(struct amdgpu_device 
*adev, uint32_t vmid,
 * as GFXOFF under bare metal
 */
if (adev->gfx.kiq.ring.sched.ready &&
-   (amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev)) 
&&
-   !amdgpu_in_reset(adev)) {
+   (amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev)) &&
+   down_read_trylock(>reset_sem)) {
uint32_t req = hub->vm_inv_eng0_req + hub->eng_distance * eng;
uint32_t ack = hub->vm_inv_eng0_ack + hub->eng_distance * eng;
 
amdgpu_virt_kiq_reg_write_reg_wait(adev, req, ack, inv_req,
   1 << vmid);
+   up_read(>reset_sem);
return;
}
 
@@ -599,7 +600,7 @@ static int gmc_v9_0_flush_gpu_tlb_pasid(struct 
amdgpu_device *adev,
if (amdgpu_in_reset(adev))
return -EIO;
 
-   if (ring->sched.ready) {
+   if (ring->sched.ready && down_read_trylock(>reset_sem)) {
/* Vega20+XGMI caches PTEs in TC and TLB. Add a
 * heavy-weight TLB flush (type 2), which flushes
 * both. Due to a race condition with concurrent @@ -626,6 
+627,7 

Re: [PATCH] drm/amdgpu/dc: Require primary plane to be enabled whenever the CRTC is

2020-09-01 Thread Daniel Vetter
On Tue, Aug 25, 2020 at 04:55:28PM +0200, Michel Dänzer wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> On 2020-08-24 9:43 a.m., Pekka Paalanen wrote:
> > On Sat, 22 Aug 2020 11:59:26 +0200 Michel Dänzer
> >  wrote:
> >> On 2020-08-21 8:07 p.m., Kazlauskas, Nicholas wrote:
> >>> On 2020-08-21 12:57 p.m., Michel Dänzer wrote:
>  From: Michel Dänzer 
> 
>  Don't check drm_crtc_state::active for this either, per its
>  documentation in include/drm/drm_crtc.h:
> 
>  * Hence drivers must not consult @active in their various *
>  _mode_config_funcs.atomic_check callback to reject an
>  atomic * commit.
> 
>  The atomic helpers disable the CRTC as needed for disabling
>  the primary plane.
> 
>  This prevents at least the following problems if the primary
>  plane gets disabled (e.g. due to destroying the FB assigned
>  to the primary plane, as happens e.g. with mutter in Wayland
>  mode):
> 
>  * Toggling CRTC active to 1 failed if the cursor plane was
>  enabled (e.g. via legacy DPMS property & cursor ioctl). *
>  Enabling the cursor plane failed, e.g. via the legacy cursor
>  ioctl.
> >>>
> >>> We previously had the requirement that the primary plane must
> >>> be enabled but some userspace expects that they can enable
> >>> just the overlay plane without anything else.
> >>>
> >>> I think the chromuiumos atomictest validates that this works
> >>> as well:
> >>>
> >>> So is DRM going forward then with the expectation that this is
> >>> wrong behavior from userspace?
> >>>
> >>> We require at least one plane to be enabled to display a
> >>> cursor, but it doesn't necessarily need to be the primary.
> >>
> >> It's a "pick your poison" situation:
> >>
> >> 1) Currently the checks are invalid (atomic_check must not
> >> decide based on drm_crtc_state::active), and it's easy for legacy
> >> KMS userspace to accidentally hit errors trying to enable/move
> >> the cursor or switch DPMS off → on.
> >>
> >> 2) Accurately rejecting only atomic states where the cursor
> >> plane is enabled but all other planes are off would break the
> >> KMS helper code, which can only deal with the "CRTC on & primary
> >> plane off is not allowed" case specifically.
> >>
> >> 3) This patch addresses 1) & 2) but may break existing atomic
> >> userspace which wants to enable an overlay plane while disabling
> >> the primary plane.
> >>
> >>
> >> I do think in principle atomic userspace is expected to handle
> >> case 3) and leave the primary plane enabled.
> >
> > Hi,
> >
> > my opinion as a userspace developer is that enabling a CRTC
> > without a primary plane has traditionally not worked, so userspace
> > cannot *rely* on it ever working. I think legacy KMS API does not
> > even allow to express that really, or has built-in assumptions that
> > it doesn't work - you can call the legacy cursor ioctls, they
> > don't fail, but also the CRTC remains off. Correct me if this is
> > not true.
> >
> > Atomic userspace OTOH is required to test the strange (and all!)
> > cases like this to see if they work or not, and carry on with a
> > fallback if they don't. Userspace that wants to use a CRTC without
> > a primary plane likely needs other CRTC properties as well, like
> > background color.
> >
> > I would guess that when two regressions conflict, the older
> > userspace would win. Hence legacy KMS using Mutter should keep
> > working, while atomic userspace is left with increased power
> > consumption. Not using a hardware cursor (Mutter) is much more
> > easily an end-user visible regression than slightly shorter
> > battery life.
> >
> > Atomic test commits are allowed to fail at any time for any reason
> > and something that previously worked can fail on the next frame or
> > on the next kernel, as far as I understand.
> 
> All of this basically matches my understanding.
> 
> 
> > Sounds like the helpers you refer to are inadequate for your case.
> > Can't you fix the helpers in the long run and land this patch as an
> > immediate fix?
> 
> I'd rather leave working on those helpers to KMS developers. :)

I thought the issue is the rmfb ioctl trickery, which has this assumption
fully backed in. The wiggle room in there for semantic changes is iirc
pretty much nil, it took us a pile of patches to just get to the current
state. So it's not helper code, it's core legacy ioctl code for atomic
drivers.
-Daniel

> 
> 
> - -- 
> Earthling Michel Dänzer   |   https://redhat.com
> Libre software enthusiast | Mesa and X developer
> -BEGIN PGP SIGNATURE-
> 
> iF0EARECAB0WIQSwn681vpFFIZgJURRaga+OatuyAAUCX0UmXAAKCRBaga+Oatuy
> AIeNAJoC9UgOrF+qBq08uOyjaV7Vfp+PgACfSp3nXB3un3LUZQxrvaxMAON+eIs=
> =Pbd2
> -END PGP SIGNATURE-
> ___
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- 

[PATCH] drm/amd/display: Fix a list corruption

2020-09-01 Thread Pan, Xinhui
[AMD Official Use Only - Internal Distribution Only]

Remove the private obj from the internal list before we free aconnector.

[   56.925828] BUG: unable to handle page fault for address: 8f84a870a560
[   56.933272] #PF: supervisor read access in kernel mode
[   56.938801] #PF: error_code(0x) - not-present page
[   56.944376] PGD 18e605067 P4D 18e605067 PUD 86a614067 PMD 86a4d0067 PTE 
8008578f5060
[   56.953260] Oops:  [#1] SMP DEBUG_PAGEALLOC NOPTI
[   56.958815] CPU: 6 PID: 1407 Comm: bash Tainted: G   O  
5.9.0-rc2+ #46
[   56.967092] Hardware name: System manufacturer System Product Name/PRIME 
Z390-A, BIOS 1401 11/26/2019
[   56.977162] RIP: 0010:__list_del_entry_valid+0x31/0xa0
[   56.982768] Code: 00 ad de 55 48 8b 17 4c 8b 47 08 48 89 e5 48 39 c2 74 27 
48 b8 22 01 00 00 00 00 ad de 49 39 c0 74 2d 49 8b 30 48 39 fe 75 3d <48> 8b 52 
08 48 39 f2 75 4c b8 01 00 00 00 5d c3 48 89 7
[   57.003327] RSP: 0018:b40c81687c90 EFLAGS: 00010246
[   57.009048] RAX: dead0122 RBX: 8f84ea41f4f0 RCX: 0006
[   57.016871] RDX: 8f84a870a558 RSI: 8f84ea41f4f0 RDI: 8f84ea41f4f0
[   57.024672] RBP: b40c81687c90 R08: 8f84ea400998 R09: 0001
[   57.032490] R10:  R11:  R12: 0006
[   57.040287] R13: 8f84ea422a90 R14: 8f84b4129a20 R15: fff2
[   57.048105] FS:  7f550d885740() GS:8f850960() 
knlGS:
[   57.056979] CS:  0010 DS:  ES:  CR0: 80050033
[   57.063260] CR2: 8f84a870a560 CR3: 0007e5144001 CR4: 003706e0
[   57.071053] DR0:  DR1:  DR2: 
[   57.078849] DR3:  DR6: fffe0ff0 DR7: 0400
[   57.086684] Call Trace:
[   57.089381]  drm_atomic_private_obj_fini+0x29/0x82 [drm]
[   57.095247]  amdgpu_dm_fini+0x83/0x170 [amdgpu]
[   57.100264]  dm_hw_fini+0x23/0x30 [amdgpu]
[   57.104814]  amdgpu_device_fini+0x1df/0x4fe [amdgpu]
[   57.110271]  amdgpu_driver_unload_kms+0x43/0x70 [amdgpu]
[   57.116136]  amdgpu_pci_remove+0x3b/0x60 [amdgpu]
[   57.121291]  pci_device_remove+0x3e/0xb0
[   57.125583]  device_release_driver_internal+0xff/0x1d0
[   57.131223]  device_release_driver+0x12/0x20
[   57.135903]  pci_stop_bus_device+0x70/0xa0
[   57.140401]  pci_stop_and_remove_bus_device_locked+0x1b/0x30
[   57.146571]  remove_store+0x7b/0x90
[   57.150429]  dev_attr_store+0x17/0x30
[   57.154441]  sysfs_kf_write+0x4b/0x60
[   57.158479]  kernfs_fop_write+0xe8/0x1d0
[   57.162788]  vfs_write+0xf5/0x230
[   57.166426]  ksys_write+0x70/0xf0
[   57.170087]  __x64_sys_write+0x1a/0x20
[   57.174219]  do_syscall_64+0x38/0x90
[   57.178145]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Signed-off-by: xinhui pan 
---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index f52533ee7372..cb624ee70545 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -5076,6 +5076,7 @@ static void amdgpu_dm_connector_destroy(struct 
drm_connector *connector)
 struct amdgpu_device *adev = drm_to_adev(connector->dev);
 struct amdgpu_display_manager *dm = >dm;

+drm_atomic_private_obj_fini(>mst_mgr.base);
 #if defined(CONFIG_BACKLIGHT_CLASS_DEVICE) ||\
 defined(CONFIG_BACKLIGHT_CLASS_DEVICE_MODULE)

--
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu/dc: Require primary plane to be enabled whenever the CRTC is

2020-09-01 Thread Daniel Vetter
On Wed, Aug 26, 2020 at 11:24:23AM +0300, Pekka Paalanen wrote:
> On Tue, 25 Aug 2020 12:58:19 -0400
> "Kazlauskas, Nicholas"  wrote:
> 
> > On 2020-08-22 5:59 a.m., Michel Dänzer wrote:
> > > On 2020-08-21 8:07 p.m., Kazlauskas, Nicholas wrote:  
> > >> On 2020-08-21 12:57 p.m., Michel Dänzer wrote:  
> > >>> From: Michel Dänzer 
> > >>>
> > >>> Don't check drm_crtc_state::active for this either, per its
> > >>> documentation in include/drm/drm_crtc.h:
> > >>>
> > >>>    * Hence drivers must not consult @active in their various
> > >>>    * _mode_config_funcs.atomic_check callback to reject an atomic
> > >>>    * commit.
> > >>>
> > >>> The atomic helpers disable the CRTC as needed for disabling the primary
> > >>> plane.
> > >>>
> > >>> This prevents at least the following problems if the primary plane gets
> > >>> disabled (e.g. due to destroying the FB assigned to the primary plane,
> > >>> as happens e.g. with mutter in Wayland mode):
> > >>>
> > >>> * Toggling CRTC active to 1 failed if the cursor plane was enabled
> > >>>     (e.g. via legacy DPMS property & cursor ioctl).
> > >>> * Enabling the cursor plane failed, e.g. via the legacy cursor ioctl.  
> > >>
> > >> We previously had the requirement that the primary plane must be enabled
> > >> but some userspace expects that they can enable just the overlay plane
> > >> without anything else.
> > >>
> > >> I think the chromuiumos atomictest validates that this works as well:
> > >>
> > >> So is DRM going forward then with the expectation that this is wrong
> > >> behavior from userspace?
> > >>
> > >> We require at least one plane to be enabled to display a cursor, but it
> > >> doesn't necessarily need to be the primary.  
> > > 
> > > It's a "pick your poison" situation:
> > > 
> > > 1) Currently the checks are invalid (atomic_check must not decide based
> > > on drm_crtc_state::active), and it's easy for legacy KMS userspace to
> > > accidentally hit errors trying to enable/move the cursor or switch DPMS
> > > off → on.
> > > 
> > > 2) Accurately rejecting only atomic states where the cursor plane is
> > > enabled but all other planes are off would break the KMS helper code,
> > > which can only deal with the "CRTC on & primary plane off is not
> > > allowed" case specifically.
> > > 
> > > 3) This patch addresses 1) & 2) but may break existing atomic userspace
> > > which wants to enable an overlay plane while disabling the primary plane.
> > > 
> > > 
> > > I do think in principle atomic userspace is expected to handle case 3)
> > > and leave the primary plane enabled. However, this is not ideal from an
> > > energy consumption PoV. Therefore, here's another idea for a possible
> > > way out of this quagmire:
> > > 
> > > amdgpu_dm does not reject any atomic states based on which planes are
> > > enabled in it. If the cursor plane is enabled but all other planes are
> > > off, amdgpu_dm internally either:
> > > 
> > > a) Enables an overlay plane and makes it invisible, e.g. by assigning a
> > > minimum size FB with alpha = 0.
> > > 
> > > b) Enables the primary plane and assigns a minimum size FB (scaled up to
> > > the required size) containing all black, possibly using compression.
> > > (Trying to minimize the memory bandwidth)
> > > 
> > > 
> > > Does either of these seem feasible? If both do, which one would be
> > > preferable?
> > > 
> > >   
> > 
> > It's really the same solution since DCN doesn't make a distinction 
> > between primary or overlay planes in hardware. DCE doesn't have overlay 
> > planes enabled so this is not relevant there.
> > 
> > The old behavior (pre 5.1?) was to silently accept the commit even 
> > though the screen would be completely black instead of outright 
> > rejecting the commit.
> > 
> > I almost wonder if that makes more sense in the short term here since 
> > the only "userspace" affected here is IGT. We'll fail the CRC checks, 
> > but no userspace actually tries to actively use a cursor with no primary 
> > plane enabled from my understanding.
> 
> Hi,
> 
> I believe that there exists userspace that will *accidentally* attempt
> to update the cursor plane while primary plane or whole CRTC is off.
> Some versions of Mutter might do that on racy conditions, I suspect.
> These are legacy KMS users, not atomic KMS.
> 
> However, I do not believe there exists any userspace that would
> actually expect the display to show the cursor plane alone without a
> primary plane. Therefore I'd be ok with legacy cursor ioctls silently
> succeeding. Atomic commits not. So the difference has to be in the
> translation from legacy UAPI to kernel internal atomic interface.
> 
> > In the long term I think we can work on getting cursor actually on the 
> > screen in this case, though I can't say I really like having to reserve 
> > some small buffer (eg. 16x16) for allowing lightup on this corner case.
> 
> Why would you bother implementing that?
> 
> Is there really an IGT test that unconditionally demands cursor plane
> 

[PATCH v2] drm/amdgpu: block ring buffer access during GPU recovery

2020-09-01 Thread Dennis Li
When GPU is in reset, its status isn't stable and ring buffer also need
be reset when resuming. Therefore driver should protect GPU recovery
thread from ring buffer accessed by other threads. Otherwise GPU will
randomly hang during recovery.

v2: correct indent

Signed-off-by: Dennis Li 

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 172dc47b7f39..9b586bc80c38 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -319,8 +319,12 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, 
uint32_t reg,
 {
uint32_t ret;
 
-   if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
-   return amdgpu_kiq_rreg(adev, reg);
+   if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev) &&
+   down_read_trylock(>reset_sem)) {
+   ret = amdgpu_kiq_rreg(adev, reg);
+   up_read(>reset_sem);
+   return ret;
+   }
 
if ((reg * 4) < adev->rmmio_size)
ret = readl(((void __iomem *)adev->rmmio) + (reg * 4));
@@ -332,6 +336,7 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, 
uint32_t reg,
ret = readl(((void __iomem *)adev->rmmio) + (mmMM_DATA * 4));
spin_unlock_irqrestore(>mmio_idx_lock, flags);
}
+
trace_amdgpu_mm_rreg(adev->pdev->device, reg, ret);
return ret;
 }
@@ -407,8 +412,12 @@ void static inline amdgpu_mm_wreg_mmio(struct 
amdgpu_device *adev, uint32_t reg,
 void amdgpu_mm_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
uint32_t acc_flags)
 {
-   if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
-   return amdgpu_kiq_wreg(adev, reg, v);
+   if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev) &&
+   down_read_trylock(>reset_sem)) {
+   amdgpu_kiq_wreg(adev, reg, v);
+   up_read(>reset_sem);
+   return;
+   }
 
amdgpu_mm_wreg_mmio(adev, reg, v, acc_flags);
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
index ad9ad622ccce..31359e519d69 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
@@ -287,8 +287,7 @@ static void gmc_v10_0_flush_gpu_tlb(struct amdgpu_device 
*adev, uint32_t vmid,
 */
if (adev->gfx.kiq.ring.sched.ready &&
(amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev)) &&
-   !amdgpu_in_reset(adev)) {
-
+   down_read_trylock(>reset_sem)) {
struct amdgpu_vmhub *hub = >vmhub[vmhub];
const unsigned eng = 17;
u32 inv_req = hub->vmhub_funcs->get_invalidate_req(vmid, 
flush_type);
@@ -297,6 +296,8 @@ static void gmc_v10_0_flush_gpu_tlb(struct amdgpu_device 
*adev, uint32_t vmid,
 
amdgpu_virt_kiq_reg_write_reg_wait(adev, req, ack, inv_req,
1 << vmid);
+
+   up_read(>reset_sem);
return;
}
 
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index e1a0ae327cf5..c602ddc68384 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -500,13 +500,14 @@ static void gmc_v9_0_flush_gpu_tlb(struct amdgpu_device 
*adev, uint32_t vmid,
 * as GFXOFF under bare metal
 */
if (adev->gfx.kiq.ring.sched.ready &&
-   (amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev)) 
&&
-   !amdgpu_in_reset(adev)) {
+   (amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev)) &&
+   down_read_trylock(>reset_sem)) {
uint32_t req = hub->vm_inv_eng0_req + hub->eng_distance * eng;
uint32_t ack = hub->vm_inv_eng0_ack + hub->eng_distance * eng;
 
amdgpu_virt_kiq_reg_write_reg_wait(adev, req, ack, inv_req,
   1 << vmid);
+   up_read(>reset_sem);
return;
}
 
@@ -599,7 +600,7 @@ static int gmc_v9_0_flush_gpu_tlb_pasid(struct 
amdgpu_device *adev,
if (amdgpu_in_reset(adev))
return -EIO;
 
-   if (ring->sched.ready) {
+   if (ring->sched.ready && down_read_trylock(>reset_sem)) {
/* Vega20+XGMI caches PTEs in TC and TLB. Add a
 * heavy-weight TLB flush (type 2), which flushes
 * both. Due to a race condition with concurrent
@@ -626,6 +627,7 @@ static int gmc_v9_0_flush_gpu_tlb_pasid(struct 
amdgpu_device *adev,
if (r) {
amdgpu_ring_undo(ring);
spin_unlock(>gfx.kiq.ring_lock);
+   up_read(>reset_sem);
return -ETIME;
}
 
@@ -634,9 +636,10 @@ static int 

Re: [PATCH 1/1] drm/amdgpu: disable gpu-sched load balance for uvd

2020-09-01 Thread Nirmoy


On 9/1/20 9:07 AM, Paul Menzel wrote:

Dear Nirmoy,


Am 31.08.20 um 12:45 schrieb Nirmoy Das:

UVD dependent jobs should run on the same udv instance.


Why? Datasheet? Performance reasons? What happens if they do not run 
on the UVD instance? Are there bug reports?



Sorry about that, I should've elaborated more. I will take care of that 
in my next patch.



Nirmoy




It’d be great if you extended the commit message.


This patch disables gpu scheduler's load balancer for
a context which binds jobs from same the context to a udv
instance.

Signed-off-by: Nirmoy Das 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c

index 59032c26fc82..7cd398d25498 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -114,7 +114,9 @@ static int amdgpu_ctx_init_entity(struct 
amdgpu_ctx *ctx, u32 hw_ip,

  scheds = adev->gpu_sched[hw_ip][hw_prio].sched;
  num_scheds = adev->gpu_sched[hw_ip][hw_prio].num_scheds;

-    if (hw_ip == AMDGPU_HW_IP_VCN_ENC || hw_ip == 
AMDGPU_HW_IP_VCN_DEC) {

+    if (hw_ip == AMDGPU_HW_IP_VCN_ENC ||
+    hw_ip == AMDGPU_HW_IP_VCN_DEC ||
+    hw_ip == AMDGPU_HW_IP_UVD) {
  sched = drm_sched_pick_best(scheds, num_scheds);
  scheds = 
  num_scheds = 1;



Kind regards,

Paul

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amdgpu: Fix a redundant kfree

2020-09-01 Thread Pan, Xinhui
[AMD Official Use Only - Internal Distribution Only]

drm_dev_alloc() alloc *dev* and set managed.final_kfree to dev to free
itself.
Now from commit 5cdd68498918("drm/amdgpu: Embed drm_device into
amdgpu_device (v3)") we alloc *adev* and ddev is just a member of it.
So drm_dev_release try to free a wrong pointer then.

Also driver's release trys to free adev, but drm_dev_release will
access dev after call drvier's release.

To fix it, remove driver's release and set managed.final_kfree to adev.

[   36.269348] BUG: unable to handle page fault for address: a0c279940028
[   36.276841] #PF: supervisor read access in kernel mode
[   36.282434] #PF: error_code(0x) - not-present page
[   36.288053] PGD 676601067 P4D 676601067 PUD 86a414067 PMD 86a247067 PTE 
8008066bf060
[   36.296868] Oops:  [#1] SMP DEBUG_PAGEALLOC NOPTI
[   36.302409] CPU: 4 PID: 1375 Comm: bash Tainted: G   O  
5.9.0-rc2+ #46
[   36.310670] Hardware name: System manufacturer System Product Name/PRIME 
Z390-A, BIOS 1401 11/26/2019
[   36.320725] RIP: 0010:drm_managed_release+0x25/0x110 [drm]
[   36.326741] Code: 80 00 00 00 00 0f 1f 44 00 00 55 48 c7 c2 5a 9f 41 c0 be 
00 02 00 00 48 89 e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 83 ec 08 <48> 8b 7f 
18 e8 c2 10 ff ff 4d 8b 74 24 20 49 8d 44 24 5
[   36.347217] RSP: 0018:b9424141fce0 EFLAGS: 00010282
[   36.352931] RAX: 0006 RBX: a0c279940010 RCX: 0006
[   36.360718] RDX: c0419f5a RSI: 0200 RDI: a0c279940010
[   36.368503] RBP: b9424141fd10 R08: 0001 R09: 0001
[   36.376304] R10:  R11:  R12: a0c279940010
[   36.384070] R13: c0e2a000 R14: a0c26924e220 R15: fff2
[   36.391845] FS:  7fc4a277b740() GS:a0c288e0() 
knlGS:
[   36.400669] CS:  0010 DS:  ES:  CR0: 80050033
[   36.406937] CR2: a0c279940028 CR3: 000792304006 CR4: 003706e0
[   36.414732] DR0:  DR1:  DR2: 
[   36.422550] DR3:  DR6: fffe0ff0 DR7: 0400
[   36.430354] Call Trace:
[   36.433044]  drm_dev_put.part.0+0x40/0x60 [drm]
[   36.438017]  drm_dev_put+0x13/0x20 [drm]
[   36.442398]  amdgpu_pci_remove+0x56/0x60 [amdgpu]
[   36.447528]  pci_device_remove+0x3e/0xb0
[   36.451807]  device_release_driver_internal+0xff/0x1d0
[   36.457416]  device_release_driver+0x12/0x20
[   36.462094]  pci_stop_bus_device+0x70/0xa0
[   36.466588]  pci_stop_and_remove_bus_device_locked+0x1b/0x30
[   36.472786]  remove_store+0x7b/0x90
[   36.476614]  dev_attr_store+0x17/0x30
[   36.480646]  sysfs_kf_write+0x4b/0x60
[   36.484655]  kernfs_fop_write+0xe8/0x1d0
[   36.488952]  vfs_write+0xf5/0x230
[   36.492562]  ksys_write+0x70/0xf0
[   36.496206]  __x64_sys_write+0x1a/0x20
[   36.500292]  do_syscall_64+0x38/0x90
[   36.504219]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Signed-off-by: xinhui pan 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 10 +-
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index c12e9acd421d..52fc0c6c8538 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1165,7 +1165,7 @@ static int amdgpu_pci_probe(struct pci_dev *pdev,
 if (ret)
 goto err_free;

-drmm_add_final_kfree(ddev, ddev);
+drmm_add_final_kfree(ddev, adev);

 if (!supports_atomic)
 ddev->driver_features &= ~DRIVER_ATOMIC;
@@ -1221,13 +1221,6 @@ amdgpu_pci_remove(struct pci_dev *pdev)
 drm_dev_put(dev);
 }

-static void amdgpu_driver_release(struct drm_device *ddev)
-{
-struct amdgpu_device *adev = drm_to_adev(ddev);
-
-kfree(adev);
-}
-
 static void
 amdgpu_pci_shutdown(struct pci_dev *pdev)
 {
@@ -1522,7 +1515,6 @@ static struct drm_driver kms_driver = {
 .open = amdgpu_driver_open_kms,
 .postclose = amdgpu_driver_postclose_kms,
 .lastclose = amdgpu_driver_lastclose_kms,
-.release   = amdgpu_driver_release,
 .irq_handler = amdgpu_irq_handler,
 .ioctls = amdgpu_ioctls_kms,
 .gem_free_object_unlocked = amdgpu_gem_object_free,
--
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 3/3] drm/amdgpu: Embed drm_device into amdgpu_device (v2)

2020-09-01 Thread Daniel Vetter
On Wed, Aug 19, 2020 at 01:00:42AM -0400, Luben Tuikov wrote:
> a) Embed struct drm_device into struct amdgpu_device.
> b) Modify the inline-f drm_to_adev() accordingly.
> c) Modify the inline-f adev_to_drm() accordingly.
> d) Eliminate the use of drm_device.dev_private,
>in amdgpu.
> e) Switch from using drm_dev_alloc() to
>drm_dev_init().
> f) Add a DRM driver release function, which frees
>the container amdgpu_device after all krefs on
>the contained drm_device have been released.
> 
> v2: Split out adding adev_to_drm() into its own
> patch (previous commit), making this patch
> more succinct and clear. More detailed commit
> description.
> 
> Signed-off-by: Luben Tuikov 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h| 10 ++---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +++-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c| 43 ++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c| 20 +++---
>  4 files changed, 43 insertions(+), 45 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 735480cc7dcf..107a6ec920f7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -724,8 +724,8 @@ struct amd_powerplay {
>  #define AMDGPU_MAX_DF_PERFMONS 4
>  struct amdgpu_device {
>   struct device   *dev;
> - struct drm_device   *ddev;
>   struct pci_dev  *pdev;
> + struct drm_device   ddev;
>  
>  #ifdef CONFIG_DRM_AMD_ACP
>   struct amdgpu_acp   acp;
> @@ -990,12 +990,12 @@ struct amdgpu_device {
>  
>  static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
>  {
> - return ddev->dev_private;
> + return container_of(ddev, struct amdgpu_device, ddev);
>  }
>  
>  static inline struct drm_device *adev_to_drm(struct amdgpu_device *adev)
>  {
> - return adev->ddev;
> + return >ddev;
>  }
>  
>  static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device 
> *bdev)
> @@ -1004,8 +1004,6 @@ static inline struct amdgpu_device 
> *amdgpu_ttm_adev(struct ttm_bo_device *bdev)
>  }
>  
>  int amdgpu_device_init(struct amdgpu_device *adev,
> -struct drm_device *ddev,
> -struct pci_dev *pdev,
>  uint32_t flags);
>  void amdgpu_device_fini(struct amdgpu_device *adev);
>  int amdgpu_gpu_wait_for_idle(struct amdgpu_device *adev);
> @@ -1195,7 +1193,7 @@ static inline void *amdgpu_atpx_get_dhandle(void) { 
> return NULL; }
>  extern const struct drm_ioctl_desc amdgpu_ioctls_kms[];
>  extern const int amdgpu_max_kms_ioctl;
>  
> -int amdgpu_driver_load_kms(struct drm_device *dev, unsigned long flags);
> +int amdgpu_driver_load_kms(struct amdgpu_device *adev, unsigned long flags);
>  void amdgpu_driver_unload_kms(struct drm_device *dev);
>  void amdgpu_driver_lastclose_kms(struct drm_device *dev);
>  int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file 
> *file_priv);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 07012d71eeea..6e529548e708 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -1216,7 +1216,8 @@ static int amdgpu_device_check_arguments(struct 
> amdgpu_device *adev)
>   * Callback for the switcheroo driver.  Suspends or resumes the
>   * the asics before or after it is powered up using ACPI methods.
>   */
> -static void amdgpu_switcheroo_set_state(struct pci_dev *pdev, enum 
> vga_switcheroo_state state)
> +static void amdgpu_switcheroo_set_state(struct pci_dev *pdev,
> + enum vga_switcheroo_state state)
>  {
>   struct drm_device *dev = pci_get_drvdata(pdev);
>   int r;
> @@ -2977,8 +2978,6 @@ static const struct attribute *amdgpu_dev_attributes[] 
> = {
>   * amdgpu_device_init - initialize the driver
>   *
>   * @adev: amdgpu_device pointer
> - * @ddev: drm dev pointer
> - * @pdev: pci dev pointer
>   * @flags: driver flags
>   *
>   * Initializes the driver info and hw (all asics).
> @@ -2986,18 +2985,15 @@ static const struct attribute 
> *amdgpu_dev_attributes[] = {
>   * Called at driver startup.
>   */
>  int amdgpu_device_init(struct amdgpu_device *adev,
> -struct drm_device *ddev,
> -struct pci_dev *pdev,
>  uint32_t flags)
>  {
> + struct drm_device *ddev = adev_to_drm(adev);
> + struct pci_dev *pdev = adev->pdev;
>   int r, i;
>   bool boco = false;
>   u32 max_MBps;
>  
>   adev->shutdown = false;
> - adev->dev = >dev;
> - adev->ddev = ddev;
> - adev->pdev = pdev;
>   adev->flags = flags;
>  
>   if (amdgpu_force_asic_type >= 0 && amdgpu_force_asic_type < CHIP_LAST)
> @@ -3451,9 +3447,8 @@ int amdgpu_device_suspend(struct drm_device *dev, bool 
> fbcon)
>   

Re: [PATCH v2 1/2] drm: allow limiting the scatter list size.

2020-09-01 Thread Daniel Vetter
On Tue, Aug 18, 2020 at 11:20:16AM +0200, Gerd Hoffmann wrote:
> Add max_segment argument to drm_prime_pages_to_sg().  When set pass it
> through to the __sg_alloc_table_from_pages() call, otherwise use
> SCATTERLIST_MAX_SEGMENT.
> 
> Also add max_segment field to drm driver and pass it to
> drm_prime_pages_to_sg() calls in drivers and helpers.
> 
> v2: place max_segment in drm driver not gem object.
> 
> Signed-off-by: Gerd Hoffmann 
> ---
>  include/drm/drm_device.h|  8 
>  include/drm/drm_prime.h |  3 ++-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c |  3 ++-
>  drivers/gpu/drm/drm_gem_shmem_helper.c  |  3 ++-
>  drivers/gpu/drm/drm_prime.c | 10 +++---
>  drivers/gpu/drm/etnaviv/etnaviv_gem.c   |  3 ++-
>  drivers/gpu/drm/etnaviv/etnaviv_gem_prime.c |  3 ++-
>  drivers/gpu/drm/msm/msm_gem.c   |  3 ++-
>  drivers/gpu/drm/msm/msm_gem_prime.c |  3 ++-
>  drivers/gpu/drm/nouveau/nouveau_prime.c |  3 ++-
>  drivers/gpu/drm/radeon/radeon_prime.c   |  3 ++-
>  drivers/gpu/drm/rockchip/rockchip_drm_gem.c |  6 --
>  drivers/gpu/drm/tegra/gem.c |  3 ++-
>  drivers/gpu/drm/vgem/vgem_drv.c |  3 ++-
>  drivers/gpu/drm/xen/xen_drm_front_gem.c |  3 ++-
>  15 files changed, 43 insertions(+), 17 deletions(-)
> 
> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> index 0988351d743c..47cb547a8115 100644
> --- a/include/drm/drm_device.h
> +++ b/include/drm/drm_device.h
> @@ -329,6 +329,14 @@ struct drm_device {
>*/
>   struct drm_fb_helper *fb_helper;
>  
> + /**
> +  * @max_segment:
> +  *
> +  * Max size for scatter list segments.  When unset the default
> +  * (SCATTERLIST_MAX_SEGMENT) is used.
> +  */
> + size_t max_segment;

Is there no better place for this then "at the bottom"? drm_device is a
huge structure, piling stuff up randomly doesn't make it better :-)

I think ideally we'd have a gem substruct like we have on the modeset side
at least.
-Daniel

> +
>   /* Everything below here is for legacy driver, never use! */
>   /* private: */
>  #if IS_ENABLED(CONFIG_DRM_LEGACY)
> diff --git a/include/drm/drm_prime.h b/include/drm/drm_prime.h
> index 9af7422b44cf..2c3689435cb4 100644
> --- a/include/drm/drm_prime.h
> +++ b/include/drm/drm_prime.h
> @@ -88,7 +88,8 @@ void drm_gem_dmabuf_vunmap(struct dma_buf *dma_buf, void 
> *vaddr);
>  int drm_gem_prime_mmap(struct drm_gem_object *obj, struct vm_area_struct 
> *vma);
>  int drm_gem_dmabuf_mmap(struct dma_buf *dma_buf, struct vm_area_struct *vma);
>  
> -struct sg_table *drm_prime_pages_to_sg(struct page **pages, unsigned int 
> nr_pages);
> +struct sg_table *drm_prime_pages_to_sg(struct page **pages, unsigned int 
> nr_pages,
> +size_t max_segment);
>  struct dma_buf *drm_gem_prime_export(struct drm_gem_object *obj,
>int flags);
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
> index 519ce4427fce..8f6a647757e7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
> @@ -303,7 +303,8 @@ static struct sg_table *amdgpu_dma_buf_map(struct 
> dma_buf_attachment *attach,
>   switch (bo->tbo.mem.mem_type) {
>   case TTM_PL_TT:
>   sgt = drm_prime_pages_to_sg(bo->tbo.ttm->pages,
> - bo->tbo.num_pages);
> + bo->tbo.num_pages,
> + obj->dev->max_segment);
>   if (IS_ERR(sgt))
>   return sgt;
>  
> diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c 
> b/drivers/gpu/drm/drm_gem_shmem_helper.c
> index 4b7cfbac4daa..8f47b41b0b2f 100644
> --- a/drivers/gpu/drm/drm_gem_shmem_helper.c
> +++ b/drivers/gpu/drm/drm_gem_shmem_helper.c
> @@ -656,7 +656,8 @@ struct sg_table *drm_gem_shmem_get_sg_table(struct 
> drm_gem_object *obj)
>  
>   WARN_ON(shmem->base.import_attach);
>  
> - return drm_prime_pages_to_sg(shmem->pages, obj->size >> PAGE_SHIFT);
> + return drm_prime_pages_to_sg(shmem->pages, obj->size >> PAGE_SHIFT,
> +  obj->dev->max_segment);
>  }
>  EXPORT_SYMBOL_GPL(drm_gem_shmem_get_sg_table);
>  
> diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c
> index 1693aa7c14b5..27c783fd6633 100644
> --- a/drivers/gpu/drm/drm_prime.c
> +++ b/drivers/gpu/drm/drm_prime.c
> @@ -802,7 +802,8 @@ static const struct dma_buf_ops drm_gem_prime_dmabuf_ops 
> =  {
>   *
>   * This is useful for implementing _gem_object_funcs.get_sg_table.
>   */
> -struct sg_table *drm_prime_pages_to_sg(struct page **pages, unsigned int 
> nr_pages)
> +struct sg_table *drm_prime_pages_to_sg(struct page **pages, unsigned int 
> nr_pages,
> +size_t 

[PATCH] drm/radeon: Reset ASIC if suspend is not managed by platform firmware

2020-09-01 Thread Kai-Heng Feng
Suspend with s2idle or by the following steps cause screen frozen:
 # echo devices > /sys/power/pm_test
 # echo freeze > /sys/power/mem

[  289.625461] [drm:uvd_v1_0_ib_test [radeon]] *ERROR* radeon: fence wait timed 
out.
[  289.625494] [drm:radeon_ib_ring_tests [radeon]] *ERROR* radeon: failed 
testing IB on ring 5 (-110).

The issue doesn't happen on traditional S3, probably because firmware or
hardware provides extra power management.

Inspired by Daniel Drake's patch [1] on amdgpu, using a similar approach
can fix the issue.

[1] https://patchwork.freedesktop.org/patch/335839/

Signed-off-by: Kai-Heng Feng 
---
 drivers/gpu/drm/radeon/radeon_device.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/radeon/radeon_device.c 
b/drivers/gpu/drm/radeon/radeon_device.c
index 266e3cbbd09b..df823b9ad79f 100644
--- a/drivers/gpu/drm/radeon/radeon_device.c
+++ b/drivers/gpu/drm/radeon/radeon_device.c
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1643,6 +1644,8 @@ int radeon_suspend_kms(struct drm_device *dev, bool 
suspend,
rdev->asic->asic_reset(rdev, true);
pci_restore_state(dev->pdev);
} else if (suspend) {
+   if (pm_suspend_no_platform())
+   rdev->asic->asic_reset(rdev, true);
/* Shut down the device */
pci_disable_device(dev->pdev);
pci_set_power_state(dev->pdev, PCI_D3hot);
-- 
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 3/7] drm/amd/display: Avoid using unvalidated tiling_flags and tmz_surface in prepare_planes

2020-09-01 Thread Daniel Vetter
On Mon, Aug 17, 2020 at 02:23:47AM -0400, Marek Olšák wrote:
> On Wed, Aug 12, 2020 at 9:54 AM Daniel Vetter  wrote:
> 
> > On Tue, Aug 11, 2020 at 09:42:11AM -0400, Marek Olšák wrote:
> > > There are a few cases when the flags can change, for example DCC can be
> > > disabled due to a hw limitation in the 3d engine. Modifiers give the
> > > misleading impression that they help with that, but they don't. They
> > don't
> > > really help with anything.
> >
> > But if that happens, how do you tell the other side that it needs to
> > sample new flags? Does that just happen all the time?
> >
> > Also do the DDC state changes happen for shared buffers too?
> >
> 
> I thought we were only talking about shared buffers.
> 
> If the other side is only a consumer and the producer must disable DCC, the
> producer decompresses DCC and then disables it and updates the BO flags.
> The consumer doesn't need the new flags, because even if DCC stays enabled
> in the consumer, it's in a decompressed state (it has no effect). Only the
> producer knows it's disabled, and any new consumer will also know it when
> it queries the latest BO flags.
> 
> It doesn't work if both sides use writes, because it's not communicated
> that DCC is disabled (BO flags are queried only once). This hasn't been a
> problem so far.
> 
> Is there a way to disable DCC correctly and safely across processes? Yes.
> So why don't we do it? Because it would add more GPU overhead.

Yeah but in this case you can get away with just sampling the bo flags
once (which is what you're doing), so doing that at addfb time should be
perfectly fine. Ofc you might waste a bit of $something also scanning out
the compression metadata (which tells the hw that it's all uncompressed),
but that doesn't seem to be a problem for you.

So treating the legacy bo flags as invariant for shared buffers should be
perfectly fine.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: block ring buffer access during GPU recovery

2020-09-01 Thread Christian König

Am 01.09.20 um 03:17 schrieb Dennis Li:

When GPU is in reset, its status isn't stable and ring buffer also need
be reset when resuming. Therefore driver should protect GPU recovery
thread from ring buffer accessed by other threads. Otherwise GPU will
randomly hang during recovery.


One style comment inline, apart from that looks good to me.



Signed-off-by: Dennis Li 

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 172dc47b7f39..8db56a22cd1b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -319,8 +319,13 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, 
uint32_t reg,
  {
uint32_t ret;
  
-	if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))

-   return amdgpu_kiq_rreg(adev, reg);
+   if (!(acc_flags & AMDGPU_REGS_NO_KIQ) &&
+   amdgpu_sriov_runtime(adev) &&
+   down_read_trylock(>reset_sem)) {


This is not correctly indented, the following lines of an if clause 
should be under the ( of the first line. Same for other cases as well.


Regards,
Christian.


+   ret = amdgpu_kiq_rreg(adev, reg);
+   up_read(>reset_sem);
+   return ret;
+   }
  
  	if ((reg * 4) < adev->rmmio_size)

ret = readl(((void __iomem *)adev->rmmio) + (reg * 4));
@@ -332,6 +337,7 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, 
uint32_t reg,
ret = readl(((void __iomem *)adev->rmmio) + (mmMM_DATA * 4));
spin_unlock_irqrestore(>mmio_idx_lock, flags);
}
+
trace_amdgpu_mm_rreg(adev->pdev->device, reg, ret);
return ret;
  }
@@ -407,8 +413,13 @@ void static inline amdgpu_mm_wreg_mmio(struct 
amdgpu_device *adev, uint32_t reg,
  void amdgpu_mm_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
uint32_t acc_flags)
  {
-   if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
-   return amdgpu_kiq_wreg(adev, reg, v);
+   if (!(acc_flags & AMDGPU_REGS_NO_KIQ) &&
+   amdgpu_sriov_runtime(adev) &&
+   down_read_trylock(>reset_sem)) {
+   amdgpu_kiq_wreg(adev, reg, v);
+   up_read(>reset_sem);
+   return;
+   }
  
  	amdgpu_mm_wreg_mmio(adev, reg, v, acc_flags);

  }
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
index ad9ad622ccce..4ea2a065daa9 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
@@ -287,7 +287,7 @@ static void gmc_v10_0_flush_gpu_tlb(struct amdgpu_device 
*adev, uint32_t vmid,
 */
if (adev->gfx.kiq.ring.sched.ready &&
(amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev)) &&
-   !amdgpu_in_reset(adev)) {
+   down_read_trylock(>reset_sem)) {
  
  		struct amdgpu_vmhub *hub = >vmhub[vmhub];

const unsigned eng = 17;
@@ -297,6 +297,8 @@ static void gmc_v10_0_flush_gpu_tlb(struct amdgpu_device 
*adev, uint32_t vmid,
  
  		amdgpu_virt_kiq_reg_write_reg_wait(adev, req, ack, inv_req,

1 << vmid);
+
+   up_read(>reset_sem);
return;
}
  
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c

index e1a0ae327cf5..33b7cf1c79ec 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -501,12 +501,13 @@ static void gmc_v9_0_flush_gpu_tlb(struct amdgpu_device 
*adev, uint32_t vmid,
 */
if (adev->gfx.kiq.ring.sched.ready &&
(amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev)) 
&&
-   !amdgpu_in_reset(adev)) {
+   down_read_trylock(>reset_sem)) {
uint32_t req = hub->vm_inv_eng0_req + hub->eng_distance * eng;
uint32_t ack = hub->vm_inv_eng0_ack + hub->eng_distance * eng;
  
  		amdgpu_virt_kiq_reg_write_reg_wait(adev, req, ack, inv_req,

   1 << vmid);
+   up_read(>reset_sem);
return;
}
  
@@ -599,7 +600,8 @@ static int gmc_v9_0_flush_gpu_tlb_pasid(struct amdgpu_device *adev,

if (amdgpu_in_reset(adev))
return -EIO;
  
-	if (ring->sched.ready) {

+   if (ring->sched.ready &&
+down_read_trylock(>reset_sem)) {
/* Vega20+XGMI caches PTEs in TC and TLB. Add a
 * heavy-weight TLB flush (type 2), which flushes
 * both. Due to a race condition with concurrent
@@ -626,6 +628,7 @@ static int gmc_v9_0_flush_gpu_tlb_pasid(struct 
amdgpu_device *adev,
if (r) {
amdgpu_ring_undo(ring);
spin_unlock(>gfx.kiq.ring_lock);
+   up_read(>reset_sem);

Re: [PATCH] drm/amdgpu: block ring buffer access during GPU recovery

2020-09-01 Thread Christian König

Yeah, correct.

What we maybe should do is to add a WARN_ON() which tests if the current 
thread is the one which has locked the semaphore to catch this case.


Regards,
Christian.

Am 01.09.20 um 04:45 schrieb Li, Dennis:

[AMD Official Use Only - Internal Distribution Only]

Hi, Andrey,
 
RE- Isn't adev->reset_sem non-recursive ? How this works when you try to access registers from within GPU reset thread while adev->reset_sem is already write locked from amdgpu_device_lock_adev earlier in the same thread ?


Deli: down_read_trylock will fail in this case, return false immediately and will 
not lock adev->reset_sem. In GPU reset thread, we should use MMIO instead of 
KIQ to access registers.

Best Regards
Dennis Li
-Original Message-
From: Grodzovsky, Andrey 
Sent: Tuesday, September 1, 2020 9:40 AM
To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander 
; Kuehling, Felix ; Zhang, Hawking 
; Koenig, Christian 
Subject: Re: [PATCH] drm/amdgpu: block ring buffer access during GPU recovery


On 8/31/20 9:17 PM, Dennis Li wrote:

When GPU is in reset, its status isn't stable and ring buffer also
need be reset when resuming. Therefore driver should protect GPU
recovery thread from ring buffer accessed by other threads. Otherwise
GPU will randomly hang during recovery.

Signed-off-by: Dennis Li 

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 172dc47b7f39..8db56a22cd1b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -319,8 +319,13 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, 
uint32_t reg,
   {
uint32_t ret;
   
-	if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))

-   return amdgpu_kiq_rreg(adev, reg);
+   if (!(acc_flags & AMDGPU_REGS_NO_KIQ) &&
+   amdgpu_sriov_runtime(adev) &&
+   down_read_trylock(>reset_sem)) {
+   ret = amdgpu_kiq_rreg(adev, reg);
+   up_read(>reset_sem);
+   return ret;
+   }


Isn't adev->reset_sem non-recursive ? How this works when you try to access 
registers from within GPU reset thread while adev->reset_sem is already write 
locked from amdgpu_device_lock_adev earlier in the same thread ?

Andrey


   
   	if ((reg * 4) < adev->rmmio_size)

ret = readl(((void __iomem *)adev->rmmio) + (reg * 4)); @@ 
-332,6
+337,7 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,
ret = readl(((void __iomem *)adev->rmmio) + (mmMM_DATA * 4));
spin_unlock_irqrestore(>mmio_idx_lock, flags);
}
+
trace_amdgpu_mm_rreg(adev->pdev->device, reg, ret);
return ret;
   }
@@ -407,8 +413,13 @@ void static inline amdgpu_mm_wreg_mmio(struct 
amdgpu_device *adev, uint32_t reg,
   void amdgpu_mm_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
uint32_t acc_flags)
   {
-   if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
-   return amdgpu_kiq_wreg(adev, reg, v);
+   if (!(acc_flags & AMDGPU_REGS_NO_KIQ) &&
+   amdgpu_sriov_runtime(adev) &&
+   down_read_trylock(>reset_sem)) {
+   amdgpu_kiq_wreg(adev, reg, v);
+   up_read(>reset_sem);
+   return;
+   }
   
   	amdgpu_mm_wreg_mmio(adev, reg, v, acc_flags);

   }
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
index ad9ad622ccce..4ea2a065daa9 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
@@ -287,7 +287,7 @@ static void gmc_v10_0_flush_gpu_tlb(struct amdgpu_device 
*adev, uint32_t vmid,
 */
if (adev->gfx.kiq.ring.sched.ready &&
(amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev)) &&
-   !amdgpu_in_reset(adev)) {
+   down_read_trylock(>reset_sem)) {
   
   		struct amdgpu_vmhub *hub = >vmhub[vmhub];

const unsigned eng = 17;
@@ -297,6 +297,8 @@ static void gmc_v10_0_flush_gpu_tlb(struct
amdgpu_device *adev, uint32_t vmid,
   
   		amdgpu_virt_kiq_reg_write_reg_wait(adev, req, ack, inv_req,

1 << vmid);
+
+   up_read(>reset_sem);
return;
}
   
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c

b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index e1a0ae327cf5..33b7cf1c79ec 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -501,12 +501,13 @@ static void gmc_v9_0_flush_gpu_tlb(struct amdgpu_device 
*adev, uint32_t vmid,
 */
if (adev->gfx.kiq.ring.sched.ready &&
(amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev)) 
&&
-   !amdgpu_in_reset(adev)) {
+   down_read_trylock(>reset_sem)) {
uint32_t req = hub->vm_inv_eng0_req + hub->eng_distance 

Re: [PATCH 1/1] drm/amdgpu: disable gpu-sched load balance for uvd

2020-09-01 Thread Paul Menzel

Dear Nirmoy,


Am 31.08.20 um 12:45 schrieb Nirmoy Das:

UVD dependent jobs should run on the same udv instance.


Why? Datasheet? Performance reasons? What happens if they do not run on 
the UVD instance? Are there bug reports?


It’d be great if you extended the commit message.


This patch disables gpu scheduler's load balancer for
a context which binds jobs from same the context to a udv
instance.

Signed-off-by: Nirmoy Das 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index 59032c26fc82..7cd398d25498 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -114,7 +114,9 @@ static int amdgpu_ctx_init_entity(struct amdgpu_ctx *ctx, 
u32 hw_ip,
scheds = adev->gpu_sched[hw_ip][hw_prio].sched;
num_scheds = adev->gpu_sched[hw_ip][hw_prio].num_scheds;

-   if (hw_ip == AMDGPU_HW_IP_VCN_ENC || hw_ip == AMDGPU_HW_IP_VCN_DEC) {
+   if (hw_ip == AMDGPU_HW_IP_VCN_ENC ||
+   hw_ip == AMDGPU_HW_IP_VCN_DEC ||
+   hw_ip == AMDGPU_HW_IP_UVD) {
sched = drm_sched_pick_best(scheds, num_scheds);
scheds = 
num_scheds = 1;



Kind regards,

Paul
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 4/5] drm_dp_cec: add plumbing in preparation for MST support

2020-09-01 Thread Sam McNally
From: Hans Verkuil 

Signed-off-by: Hans Verkuil 
[sa...@chromium.org:
 - rebased
 - removed polling-related changes
 - moved the calls to drm_dp_cec_(un)set_edid() into the next patch
]
Signed-off-by: Sam McNally 
---

 .../display/amdgpu_dm/amdgpu_dm_mst_types.c   |  2 +-
 drivers/gpu/drm/drm_dp_cec.c  | 22 ++-
 drivers/gpu/drm/i915/display/intel_dp.c   |  2 +-
 drivers/gpu/drm/nouveau/nouveau_connector.c   |  2 +-
 include/drm/drm_dp_helper.h   |  6 +++--
 5 files changed, 19 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c
index 461fa4da0a34..6e7075893ec9 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c
@@ -419,7 +419,7 @@ void amdgpu_dm_initialize_dp_connector(struct 
amdgpu_display_manager *dm,
 
drm_dp_aux_init(>dm_dp_aux.aux);
drm_dp_cec_register_connector(>dm_dp_aux.aux,
- >base);
+ >base, false);
 
if (aconnector->base.connector_type == DRM_MODE_CONNECTOR_eDP)
return;
diff --git a/drivers/gpu/drm/drm_dp_cec.c b/drivers/gpu/drm/drm_dp_cec.c
index 3ab2609f9ec7..04ab7b88055c 100644
--- a/drivers/gpu/drm/drm_dp_cec.c
+++ b/drivers/gpu/drm/drm_dp_cec.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Unfortunately it turns out that we have a chicken-and-egg situation
@@ -338,8 +339,6 @@ void drm_dp_cec_set_edid(struct drm_dp_aux *aux, const 
struct edid *edid)
if (aux->cec.adap) {
if (aux->cec.adap->capabilities == cec_caps &&
aux->cec.adap->available_log_addrs == num_las) {
-   /* Unchanged, so just set the phys addr */
-   cec_s_phys_addr_from_edid(aux->cec.adap, edid);
goto unlock;
}
/*
@@ -364,15 +363,16 @@ void drm_dp_cec_set_edid(struct drm_dp_aux *aux, const 
struct edid *edid)
if (cec_register_adapter(aux->cec.adap, connector->dev->dev)) {
cec_delete_adapter(aux->cec.adap);
aux->cec.adap = NULL;
-   } else {
-   /*
-* Update the phys addr for the new CEC adapter. When called
-* from drm_dp_cec_register_connector() edid == NULL, so in
-* that case the phys addr is just invalidated.
-*/
-   cec_s_phys_addr_from_edid(aux->cec.adap, edid);
}
 unlock:
+   /*
+* Update the phys addr for the new CEC adapter. When called
+* from drm_dp_cec_register_connector() edid == NULL, so in
+* that case the phys addr is just invalidated.
+*/
+   if (aux->cec.adap && edid) {
+   cec_s_phys_addr_from_edid(aux->cec.adap, edid);
+   }
mutex_unlock(>cec.lock);
 }
 EXPORT_SYMBOL(drm_dp_cec_set_edid);
@@ -418,6 +418,7 @@ EXPORT_SYMBOL(drm_dp_cec_unset_edid);
  * drm_dp_cec_register_connector() - register a new connector
  * @aux: DisplayPort AUX channel
  * @connector: drm connector
+ * @is_mst: set to true if this is an MST branch
  *
  * A new connector was registered with associated CEC adapter name and
  * CEC adapter parent device. After registering the name and parent
@@ -425,12 +426,13 @@ EXPORT_SYMBOL(drm_dp_cec_unset_edid);
  * CEC and to register a CEC adapter if that is the case.
  */
 void drm_dp_cec_register_connector(struct drm_dp_aux *aux,
-  struct drm_connector *connector)
+  struct drm_connector *connector, bool is_mst)
 {
WARN_ON(aux->cec.adap);
if (WARN_ON(!aux->transfer))
return;
aux->cec.connector = connector;
+   aux->cec.is_mst = is_mst;
INIT_DELAYED_WORK(>cec.unregister_work,
  drm_dp_cec_unregister_work);
 }
diff --git a/drivers/gpu/drm/i915/display/intel_dp.c 
b/drivers/gpu/drm/i915/display/intel_dp.c
index 82b9de274f65..744cb55572f9 100644
--- a/drivers/gpu/drm/i915/display/intel_dp.c
+++ b/drivers/gpu/drm/i915/display/intel_dp.c
@@ -6261,7 +6261,7 @@ intel_dp_connector_register(struct drm_connector 
*connector)
intel_dp->aux.dev = connector->kdev;
ret = drm_dp_aux_register(_dp->aux);
if (!ret)
-   drm_dp_cec_register_connector(_dp->aux, connector);
+   drm_dp_cec_register_connector(_dp->aux, connector, false);
return ret;
 }
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_connector.c 
b/drivers/gpu/drm/nouveau/nouveau_connector.c
index 49dd0cbc332f..671a70e95cd1 100644
--- a/drivers/gpu/drm/nouveau/nouveau_connector.c
+++ b/drivers/gpu/drm/nouveau/nouveau_connector.c
@@ -1414,7 +1414,7 @@ nouveau_connector_create(struct drm_device *dev,
switch (type) {
case