Re: [PATCH] drm/amdgpu: fix buffer funcs setting order on suspend

2023-12-05 Thread Luben Tuikov
On 2023-12-04 09:13, Alex Deucher wrote:
> We need to disable this after the last eviction
> call, but before we disable the SDMA IP.
> 
> Fixes: b70438004a14 ("drm/amdgpu: move buffer funcs setting up a level")
> Link: https://lore.kernel.org/r/87edgv4x3i@vps.thesusis.net
> Signed-off-by: Alex Deucher 
> Cc: Phillip Susi 
> Cc: Luben Tuikov 

Thank you Alex for this patch and thank you Phillip for testing it!
(Let's also add a Tested-by tag.)

Reviewed-by: Luben Tuikov 

Regards,
Luben

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index f29d0faf956e..b76ec5ec04c5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4593,8 +4593,6 @@ int amdgpu_device_suspend(struct drm_device *dev, bool 
> fbcon)
>  
>   amdgpu_ras_suspend(adev);
>  
> - amdgpu_ttm_set_buffer_funcs_status(adev, false);
> -
>   amdgpu_device_ip_suspend_phase1(adev);
>  
>   if (!adev->in_s0ix)
> @@ -4604,6 +4602,8 @@ int amdgpu_device_suspend(struct drm_device *dev, bool 
> fbcon)
>   if (r)
>   return r;
>  
> + amdgpu_ttm_set_buffer_funcs_status(adev, false);
> +
>   amdgpu_fence_driver_hw_fini(adev);
>  
>   amdgpu_device_ip_suspend_phase2(adev);


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: Radeon regression in 6.6 kernel

2023-11-29 Thread Luben Tuikov
On 2023-11-29 22:36, Luben Tuikov wrote:
> On 2023-11-29 15:49, Alex Deucher wrote:
>> On Wed, Nov 29, 2023 at 3:10 PM Alex Deucher  wrote:
>>>
>>> Actually I think I see the problem.  I'll try and send out a patch
>>> later today to test.
>>
>> Does the attached patch fix it?
> 
> Thanks for the patch, Alex.
> 
> Is it possible for AMD to also reproduce this issue and test this patch on a 
> Navi23 system?
> 
>> From 96e75b5218f7a124eafa53853681eef8fe567ab8 Mon Sep 17 00:00:00 2001
>> From: Alex Deucher 
>> Date: Wed, 29 Nov 2023 15:44:25 -0500
>> Subject: [PATCH] drm/amdgpu: fix buffer funcs setting order on suspend
>>
>> We need to make disable this after the last eviction
> 
> "make disable" --> "disable"
> 
>> call, but before we disable the SDMA IP.
>>
>> Fixes: b70438004a14 ("drm/amdgpu: move buffer funcs setting up a level")
>> Link: 
>> https://lists.freedesktop.org/archives/amd-gfx/2023-November/101197.html
> 
> Link: https://lore.kernel.org/r/87edgv4x3i@vps.thesusis.net
> 
> Let's link the start of the thread.
> 
> Regards,
> Luben
> 
>> Signed-off-by: Alex Deucher 
>> Cc: Phillip Susi 
>> Cc: Luben Tuikov 
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index b5edf40b5d03..78553e027db4 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -4531,8 +4531,6 @@ int amdgpu_device_suspend(struct drm_device *dev, bool 
>> fbcon)
>>  
>>  amdgpu_ras_suspend(adev);
>>  
>> -amdgpu_ttm_set_buffer_funcs_status(adev, false);
>> -
>>  amdgpu_device_ip_suspend_phase1(adev);
>>  
>>  if (!adev->in_s0ix)
>> @@ -4542,6 +4540,8 @@ int amdgpu_device_suspend(struct drm_device *dev, bool 
>> fbcon)
>>  if (r)
>>  return r;
>>  
>> +amdgpu_ttm_set_buffer_funcs_status(adev, false);
>> +

If you're moving this past phase 1, there's another instance in 
amdgpu_device_ip_suspend(),
which may need to be moved down.

Regards,
Luben

>>  amdgpu_fence_driver_hw_fini(adev);
>>  
>>  amdgpu_device_ip_suspend_phase2(adev);
> 
>>
>> Alex
>>
>>>
>>> Alex
>>>
>>> On Wed, Nov 29, 2023 at 1:52 PM Alex Deucher  wrote:
>>>>
>>>> On Wed, Nov 29, 2023 at 11:41 AM Luben Tuikov  wrote:
>>>>>
>>>>> On 2023-11-29 10:22, Alex Deucher wrote:
>>>>>> On Wed, Nov 29, 2023 at 8:50 AM Alex Deucher  
>>>>>> wrote:
>>>>>>>
>>>>>>> On Tue, Nov 28, 2023 at 11:45 PM Luben Tuikov  
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> On 2023-11-28 17:13, Alex Deucher wrote:
>>>>>>>>> On Mon, Nov 27, 2023 at 6:24 PM Phillip Susi  
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Alex Deucher  writes:
>>>>>>>>>>
>>>>>>>>>>>> In that case those are the already known problems with the 
>>>>>>>>>>>> scheduler
>>>>>>>>>>>> changes, aren't they?
>>>>>>>>>>>
>>>>>>>>>>> Yes.  Those changes went into 6.7 though, not 6.6 AFAIK.  Maybe I'm
>>>>>>>>>>> misunderstanding what the original report was actually testing.  If 
>>>>>>>>>>> it
>>>>>>>>>>> was 6.7, then try reverting:
>>>>>>>>>>> 56e449603f0ac580700621a356d35d5716a62ce5
>>>>>>>>>>> b70438004a14f4d0f9890b3297cd66248728546c
>>>>>>>>>>
>>>>>>>>>> At some point it was suggested that I file a gitlab issue, but I took
>>>>>>>>>> this to mean it was already known and being worked on.  -rc3 came out
>>>>>>>>>> today and still has the problem.  Is there a known issue I could 
>>>>>>>>>> track?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> At this point, unless there are any objections, I think we should just
>>>>&g

Re: Radeon regression in 6.6 kernel

2023-11-29 Thread Luben Tuikov
On 2023-11-29 15:49, Alex Deucher wrote:
> On Wed, Nov 29, 2023 at 3:10 PM Alex Deucher  wrote:
>>
>> Actually I think I see the problem.  I'll try and send out a patch
>> later today to test.
> 
> Does the attached patch fix it?

Thanks for the patch, Alex.

Is it possible for AMD to also reproduce this issue and test this patch on a 
Navi23 system?

> From 96e75b5218f7a124eafa53853681eef8fe567ab8 Mon Sep 17 00:00:00 2001
> From: Alex Deucher 
> Date: Wed, 29 Nov 2023 15:44:25 -0500
> Subject: [PATCH] drm/amdgpu: fix buffer funcs setting order on suspend
> 
> We need to make disable this after the last eviction

"make disable" --> "disable"

> call, but before we disable the SDMA IP.
> 
> Fixes: b70438004a14 ("drm/amdgpu: move buffer funcs setting up a level")
> Link: https://lists.freedesktop.org/archives/amd-gfx/2023-November/101197.html

Link: https://lore.kernel.org/r/87edgv4x3i@vps.thesusis.net

Let's link the start of the thread.

Regards,
Luben

> Signed-off-by: Alex Deucher 
> Cc: Phillip Susi 
> Cc: Luben Tuikov 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index b5edf40b5d03..78553e027db4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4531,8 +4531,6 @@ int amdgpu_device_suspend(struct drm_device *dev, bool 
> fbcon)
>  
>   amdgpu_ras_suspend(adev);
>  
> - amdgpu_ttm_set_buffer_funcs_status(adev, false);
> -
>   amdgpu_device_ip_suspend_phase1(adev);
>  
>   if (!adev->in_s0ix)
> @@ -4542,6 +4540,8 @@ int amdgpu_device_suspend(struct drm_device *dev, bool 
> fbcon)
>   if (r)
>   return r;
>  
> + amdgpu_ttm_set_buffer_funcs_status(adev, false);
> +
>   amdgpu_fence_driver_hw_fini(adev);
>  
>   amdgpu_device_ip_suspend_phase2(adev);

> 
> Alex
> 
>>
>> Alex
>>
>> On Wed, Nov 29, 2023 at 1:52 PM Alex Deucher  wrote:
>>>
>>> On Wed, Nov 29, 2023 at 11:41 AM Luben Tuikov  wrote:
>>>>
>>>> On 2023-11-29 10:22, Alex Deucher wrote:
>>>>> On Wed, Nov 29, 2023 at 8:50 AM Alex Deucher  
>>>>> wrote:
>>>>>>
>>>>>> On Tue, Nov 28, 2023 at 11:45 PM Luben Tuikov  
>>>>>> wrote:
>>>>>>>
>>>>>>> On 2023-11-28 17:13, Alex Deucher wrote:
>>>>>>>> On Mon, Nov 27, 2023 at 6:24 PM Phillip Susi  
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Alex Deucher  writes:
>>>>>>>>>
>>>>>>>>>>> In that case those are the already known problems with the scheduler
>>>>>>>>>>> changes, aren't they?
>>>>>>>>>>
>>>>>>>>>> Yes.  Those changes went into 6.7 though, not 6.6 AFAIK.  Maybe I'm
>>>>>>>>>> misunderstanding what the original report was actually testing.  If 
>>>>>>>>>> it
>>>>>>>>>> was 6.7, then try reverting:
>>>>>>>>>> 56e449603f0ac580700621a356d35d5716a62ce5
>>>>>>>>>> b70438004a14f4d0f9890b3297cd66248728546c
>>>>>>>>>
>>>>>>>>> At some point it was suggested that I file a gitlab issue, but I took
>>>>>>>>> this to mean it was already known and being worked on.  -rc3 came out
>>>>>>>>> today and still has the problem.  Is there a known issue I could 
>>>>>>>>> track?
>>>>>>>>>
>>>>>>>>
>>>>>>>> At this point, unless there are any objections, I think we should just
>>>>>>>> revert the two patches
>>>>>>> Uhm, no.
>>>>>>>
>>>>>>> Why "the two" patches?
>>>>>>>
>>>>>>> This email, part of this thread,
>>>>>>>
>>>>>>> https://lore.kernel.org/all/87r0kircdo@vps.thesusis.net/
>>>>>>>
>>>>>>> clearly states that reverting *only* this commit,
>>>>>>> 56e449603f0ac5 drm/sched: Convert the GPU scheduler to variable number 
>>>>>>> of run-queues
>>>>>>> *does not* mi

Re: Radeon regression in 6.6 kernel

2023-11-29 Thread Luben Tuikov
On 2023-11-29 10:22, Alex Deucher wrote:
> On Wed, Nov 29, 2023 at 8:50 AM Alex Deucher  wrote:
>>
>> On Tue, Nov 28, 2023 at 11:45 PM Luben Tuikov  wrote:
>>>
>>> On 2023-11-28 17:13, Alex Deucher wrote:
>>>> On Mon, Nov 27, 2023 at 6:24 PM Phillip Susi  wrote:
>>>>>
>>>>> Alex Deucher  writes:
>>>>>
>>>>>>> In that case those are the already known problems with the scheduler
>>>>>>> changes, aren't they?
>>>>>>
>>>>>> Yes.  Those changes went into 6.7 though, not 6.6 AFAIK.  Maybe I'm
>>>>>> misunderstanding what the original report was actually testing.  If it
>>>>>> was 6.7, then try reverting:
>>>>>> 56e449603f0ac580700621a356d35d5716a62ce5
>>>>>> b70438004a14f4d0f9890b3297cd66248728546c
>>>>>
>>>>> At some point it was suggested that I file a gitlab issue, but I took
>>>>> this to mean it was already known and being worked on.  -rc3 came out
>>>>> today and still has the problem.  Is there a known issue I could track?
>>>>>
>>>>
>>>> At this point, unless there are any objections, I think we should just
>>>> revert the two patches
>>> Uhm, no.
>>>
>>> Why "the two" patches?
>>>
>>> This email, part of this thread,
>>>
>>> https://lore.kernel.org/all/87r0kircdo@vps.thesusis.net/
>>>
>>> clearly states that reverting *only* this commit,
>>> 56e449603f0ac5 drm/sched: Convert the GPU scheduler to variable number of 
>>> run-queues
>>> *does not* mitigate the failed suspend. (Furthermore, this commit doesn't 
>>> really change
>>> anything operational, other than using an allocated array, instead of a 
>>> static one, in DRM,
>>> while the 2nd patch is solely contained within the amdgpu driver code.)
>>>
>>> Leaving us with only this change,
>>> b70438004a14f4 drm/amdgpu: move buffer funcs setting up a level
>>> to be at fault, as the kernel log attached in the linked email above shows.
>>>
>>> The conclusion is that only b70438004a14f4 needs reverting.
>>
>> b70438004a14f4 was a fix for 56e449603f0ac5.  Without b70438004a14f4,
>> 56e449603f0ac5 breaks amdgpu.
> 
> We can try and re-enable it in the next kernel.  I'm just not sure
> we'll be able to fix this in time for 6.7 with the holidays and all
> and I don't want to cause a lot of scheduler churn at the end of the
> 6.7 cycle if we hold off and try and fix it.  Reverting seems like the
> best short term solution.

A lot of subsequent code has come in since commit 56e449603f0ac5, as it opened
the opportunity for a 1-to-1 relationship between an entity and a scheduler.
(Should've always been the case, from the outset. Not sure why it was coded as
a fixed-size array.)

Given that commit 56e449603f0ac5 has nothing to do with amdgpu, and the problem
is wholly contained in amdgpu, and no other driver has this problem, there is
no reason to have to "churn", i.e. go back and forth in DRM, only to cover up
an init bug in amdgpu. See the response I just sent in @this thread:
https://lore.kernel.org/r/05007cb0-871e-4dc7-af58-1351f4ba4...@gmail.com

And it's not like this issue is unknown. I first posted about it on 2023-10-16. 

Ideally, amdgpu would just fix their init code.
-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: Radeon regression in 6.6 kernel

2023-11-29 Thread Luben Tuikov
On 2023-11-29 08:50, Alex Deucher wrote:
> On Tue, Nov 28, 2023 at 11:45 PM Luben Tuikov  wrote:
>>
>> On 2023-11-28 17:13, Alex Deucher wrote:
>>> On Mon, Nov 27, 2023 at 6:24 PM Phillip Susi  wrote:
>>>>
>>>> Alex Deucher  writes:
>>>>
>>>>>> In that case those are the already known problems with the scheduler
>>>>>> changes, aren't they?
>>>>>
>>>>> Yes.  Those changes went into 6.7 though, not 6.6 AFAIK.  Maybe I'm
>>>>> misunderstanding what the original report was actually testing.  If it
>>>>> was 6.7, then try reverting:
>>>>> 56e449603f0ac580700621a356d35d5716a62ce5
>>>>> b70438004a14f4d0f9890b3297cd66248728546c
>>>>
>>>> At some point it was suggested that I file a gitlab issue, but I took
>>>> this to mean it was already known and being worked on.  -rc3 came out
>>>> today and still has the problem.  Is there a known issue I could track?
>>>>
>>>
>>> At this point, unless there are any objections, I think we should just
>>> revert the two patches
>> Uhm, no.
>>
>> Why "the two" patches?
>>
>> This email, part of this thread,
>>
>> https://lore.kernel.org/all/87r0kircdo@vps.thesusis.net/
>>
>> clearly states that reverting *only* this commit,
>> 56e449603f0ac5 drm/sched: Convert the GPU scheduler to variable number of 
>> run-queues
>> *does not* mitigate the failed suspend. (Furthermore, this commit doesn't 
>> really change
>> anything operational, other than using an allocated array, instead of a 
>> static one, in DRM,
>> while the 2nd patch is solely contained within the amdgpu driver code.)
>>
>> Leaving us with only this change,
>> b70438004a14f4 drm/amdgpu: move buffer funcs setting up a level
>> to be at fault, as the kernel log attached in the linked email above shows.
>>
>> The conclusion is that only b70438004a14f4 needs reverting.
> 
> b70438004a14f4 was a fix for 56e449603f0ac5.  Without b70438004a14f4,
> 56e449603f0ac5 breaks amdgpu.

It doesn't "break" it, amdgpu just needs to be fixed.

I know we put in a Fixes tag in 
b70438004a14f4 "drm/amdgpu: move buffer funcs setting up a level"
pointing to 56e449603f0ac5 "drm/sched: Convert the GPU scheduler to variable 
number of run-queues",
but given the testing Phillip has done, the culprit is wholly contained in
the amdgpu driver code.

No other driver has this problem since commit 56e449603f0ac5.

The Fixes tag in b70438004a14f4 "drm/amdgpu: move buffer funcs setting up a 
level" should've ideally
pointed to an amdgpu-driver code commit only (perhaps an old-old commit), and I 
was a bit uncomfortable
putting in a Fixes tag which pointed to drm code, but we did it so that the 
amdgpu commit follows
the changes in DRM. In retrospect, the Fixes tag should've pointed to and 
amdgpu-driver commit when
that the amdgpu code was originally written.

I remember that the problem was really that amdgpu called 
drm_sched_entity_init(),
in amdgpu_ttm_set_buffer_funcs_status() without actually having initialized the 
scheduler
used therein. For instance, the code before commit b70438004a14f4, looked like 
this:

void amdgpu_ttm_set_buffer_funcs_status(struct amdgpu_device *adev, bool enable)
{
struct ttm_resource_manager *man = ttm_manager_type(>mman.bdev, 
TTM_PL_VRAM);
uint64_t size;
int r;

if (!adev->mman.initialized || amdgpu_in_reset(adev) ||
adev->mman.buffer_funcs_enabled == enable)
return;

if (enable) {
struct amdgpu_ring *ring;
struct drm_gpu_scheduler *sched;

ring = adev->mman.buffer_funcs_ring;
sched = >sched; <-- LT: No 
one has initialized this scheduler
r = drm_sched_entity_init(>mman.entity, <-- Oopses, 
now that sched->sched_rq is not a static array
  DRM_SCHED_PRIORITY_KERNEL, ,
  1, NULL);
if (r) {
DRM_ERROR("Failed setting up TTM BO move entity (%d)\n",
  r);
return;
}


Before commit 56e449603f0ac5, amdgpu was getting away with this, because the 
sched->sched_rq
was a static array.

Ideally, amdgpu code would be fixed.
-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: Radeon regression in 6.6 kernel

2023-11-28 Thread Luben Tuikov
On 2023-11-28 17:13, Alex Deucher wrote:
> On Mon, Nov 27, 2023 at 6:24 PM Phillip Susi  wrote:
>>
>> Alex Deucher  writes:
>>
 In that case those are the already known problems with the scheduler
 changes, aren't they?
>>>
>>> Yes.  Those changes went into 6.7 though, not 6.6 AFAIK.  Maybe I'm
>>> misunderstanding what the original report was actually testing.  If it
>>> was 6.7, then try reverting:
>>> 56e449603f0ac580700621a356d35d5716a62ce5
>>> b70438004a14f4d0f9890b3297cd66248728546c
>>
>> At some point it was suggested that I file a gitlab issue, but I took
>> this to mean it was already known and being worked on.  -rc3 came out
>> today and still has the problem.  Is there a known issue I could track?
>>
> 
> At this point, unless there are any objections, I think we should just
> revert the two patches
Uhm, no.

Why "the two" patches?

This email, part of this thread,

https://lore.kernel.org/all/87r0kircdo@vps.thesusis.net/

clearly states that reverting *only* this commit,
56e449603f0ac5 drm/sched: Convert the GPU scheduler to variable number of 
run-queues
*does not* mitigate the failed suspend. (Furthermore, this commit doesn't 
really change
anything operational, other than using an allocated array, instead of a static 
one, in DRM,
while the 2nd patch is solely contained within the amdgpu driver code.)

Leaving us with only this change,
b70438004a14f4 drm/amdgpu: move buffer funcs setting up a level
to be at fault, as the kernel log attached in the linked email above shows.

The conclusion is that only b70438004a14f4 needs reverting.
-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: Radeon regression in 6.6 kernel

2023-11-22 Thread Luben Tuikov
On 2023-11-21 17:05, Phillip Susi wrote:
> Alex Deucher  writes:
> 
>> Does reverting 56e449603f0ac580700621a356d35d5716a62ce5 alone fix it?
>> Can you also attach your full dmesg log for the failed suspend?
> 
> No, it doesn't.  Here is the full syslog from the boot with only that
> revert:
> 

Thank you Phillip for verifying this.

BTW, luben.tui...@amd.com should absolutely bounce for everyone sending emails 
to it. Not sure why it is still active.
My new email is the one this email is coming from.
-- 
Regards,
Luben


OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: [PATCH] drm/amdgpu: move UVD and VCE sched entity init after sched init

2023-11-09 Thread Luben Tuikov
On 2023-11-09 11:13, Alex Deucher wrote:
> Ping?
> 
> On Wed, Nov 8, 2023 at 1:42 PM Alex Deucher  wrote:
>>
>> We need kernel scheduling entities to deal with handle clean up
>> if apps are not cleaned up properly.  With commit 56e449603f0ac5
>> ("drm/sched: Convert the GPU scheduler to variable number of run-queues")
>> the scheduler entities have to be created after scheduler init, so
>> change the ordering to fix this.
>>
>> v2: Leave logic in UVD and VCE code
>>
>> Fixes: 56e449603f0ac5 ("drm/sched: Convert the GPU scheduler to variable 
>> number of run-queues")
>> Signed-off-by: Alex Deucher 
>> Cc: ltuiko...@gmail.com

Reviewed-by: Luben Tuikov 

Regards,
Luben

>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 +++
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c| 22 ++--
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.h|  2 +-
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c| 24 +++---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_vce.h|  2 +-
>>  drivers/gpu/drm/amd/amdgpu/uvd_v3_1.c  |  2 --
>>  drivers/gpu/drm/amd/amdgpu/uvd_v4_2.c  |  2 --
>>  drivers/gpu/drm/amd/amdgpu/uvd_v5_0.c  |  2 --
>>  drivers/gpu/drm/amd/amdgpu/uvd_v6_0.c  |  2 --
>>  drivers/gpu/drm/amd/amdgpu/uvd_v7_0.c  |  4 
>>  drivers/gpu/drm/amd/amdgpu/vce_v2_0.c  |  2 --
>>  drivers/gpu/drm/amd/amdgpu/vce_v3_0.c  |  2 --
>>  drivers/gpu/drm/amd/amdgpu/vce_v4_0.c  |  5 -
>>  13 files changed, 37 insertions(+), 46 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 43a95feba884..03e669c34033 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -2499,6 +2499,18 @@ static int amdgpu_device_init_schedulers(struct 
>> amdgpu_device *adev)
>>   ring->name);
>> return r;
>> }
>> +   r = amdgpu_uvd_entity_init(adev, ring);
>> +   if (r) {
>> +   DRM_ERROR("Failed to create UVD scheduling entity on 
>> ring %s.\n",
>> + ring->name);
>> +   return r;
>> +   }
>> +   r = amdgpu_vce_entity_init(adev, ring);
>> +   if (r) {
>> +   DRM_ERROR("Failed to create VCE scheduling entity on 
>> ring %s.\n",
>> + ring->name);
>> +   return r;
>> +   }
>> }
>>
>> amdgpu_xcp_update_partition_sched_list(adev);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
>> index 815b7c34ed33..65949cc7abb9 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
>> @@ -399,20 +399,20 @@ int amdgpu_uvd_sw_fini(struct amdgpu_device *adev)
>>   *
>>   * @adev: amdgpu_device pointer
>>   *
>> + * Initialize the entity used for handle management in the kernel driver.
>>   */
>> -int amdgpu_uvd_entity_init(struct amdgpu_device *adev)
>> +int amdgpu_uvd_entity_init(struct amdgpu_device *adev, struct amdgpu_ring 
>> *ring)
>>  {
>> -   struct amdgpu_ring *ring;
>> -   struct drm_gpu_scheduler *sched;
>> -   int r;
>> +   if (ring == >uvd.inst[0].ring) {
>> +   struct drm_gpu_scheduler *sched = >sched;
>> +   int r;
>>
>> -   ring = >uvd.inst[0].ring;
>> -   sched = >sched;
>> -   r = drm_sched_entity_init(>uvd.entity, 
>> DRM_SCHED_PRIORITY_NORMAL,
>> - , 1, NULL);
>> -   if (r) {
>> -   DRM_ERROR("Failed setting up UVD kernel entity.\n");
>> -   return r;
>> +   r = drm_sched_entity_init(>uvd.entity, 
>> DRM_SCHED_PRIORITY_NORMAL,
>> + , 1, NULL);
>> +   if (r) {
>> +   DRM_ERROR("Failed setting up UVD kernel entity.\n");
>> +   return r;
>> +   }
>> }
>>
>> return 0;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.h 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.h
>> index a9f342537c68..9dfad2f48ef4 100644
>> --- a/drivers/gpu

Re: [PATCH] drm/amdgpu: move buffer funcs setting up a level

2023-10-26 Thread Luben Tuikov
Pushed to drm-misc-next.

Regards,
Luben

On 2023-10-26 15:52, Luben Tuikov wrote:
> On 2023-10-26 15:32, Alex Deucher wrote:
>> On Thu, Oct 26, 2023 at 2:22 AM Christian König
>>  wrote:
>>>
>>> Am 25.10.23 um 19:19 schrieb Alex Deucher:
>>>> Rather than doing this in the IP code for the SDMA paging
>>>> engine, move it up to the core device level init level.
>>>> This should fix the scheduler init ordering.
>>>>
>>>> v2: drop extra parens
>>>> v3: drop SDMA helpers
>>>>
>>>> Tested-by: Luben Tuikov 
>>>> Signed-off-by: Alex Deucher 
>>>
>>> I don't know of hand if the high level function really cover everything,
>>> so only Acked-by: Christian König  for now.
>>>
>>
>> Luben,
>>
>> Was this needed for some of the scheduler stuff that is pending?  If
>> you would rather take it via drm-misc to align with the scheduler
>> changes, that works for me, otherwise I can take it via the amdgpu
>> tree.
> 
> Hi Alex,
> 
> Yes, it does.
> 
> I can take it via drm-misc-next as that where the scheduler changes landed.
> 
> I'll add Christian's Acked-by.
> 
> I'll add a Fixes tag because ideally it should've gone before the dynamic
> sched_rq commit.
> 
> Thanks for the heads-up!
> 
> Regards,
> Luben
> 
> 
> 
>>
>> Thanks,
>>
>> Alex
>>
>>
>>> Christian.
>>>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +++
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c   | 21 -
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h   |  1 -
>>>>   drivers/gpu/drm/amd/amdgpu/cik_sdma.c  |  5 -
>>>>   drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c |  5 -
>>>>   drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c |  5 -
>>>>   drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c | 16 +---
>>>>   drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c | 10 +-
>>>>   drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 10 +-
>>>>   drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c | 10 +-
>>>>   drivers/gpu/drm/amd/amdgpu/si_dma.c|  5 -
>>>>   11 files changed, 19 insertions(+), 84 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index 2031a467b721..5c90080e93ba 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -2662,6 +2662,9 @@ static int amdgpu_device_ip_init(struct 
>>>> amdgpu_device *adev)
>>>>   if (r)
>>>>   goto init_failed;
>>>>
>>>> + if (adev->mman.buffer_funcs_ring->sched.ready)
>>>> + amdgpu_ttm_set_buffer_funcs_status(adev, true);
>>>> +
>>>>   /* Don't init kfd if whole hive need to be reset during init */
>>>>   if (!adev->gmc.xgmi.pending_reset) {
>>>>   kgd2kfd_init_zone_device(adev);
>>>> @@ -3260,6 +3263,8 @@ int amdgpu_device_ip_suspend(struct amdgpu_device 
>>>> *adev)
>>>>   amdgpu_virt_request_full_gpu(adev, false);
>>>>   }
>>>>
>>>> + amdgpu_ttm_set_buffer_funcs_status(adev, false);
>>>> +
>>>>   r = amdgpu_device_ip_suspend_phase1(adev);
>>>>   if (r)
>>>>   return r;
>>>> @@ -3449,6 +3454,9 @@ static int amdgpu_device_ip_resume(struct 
>>>> amdgpu_device *adev)
>>>>
>>>>   r = amdgpu_device_ip_resume_phase2(adev);
>>>>
>>>> + if (adev->mman.buffer_funcs_ring->sched.ready)
>>>> + amdgpu_ttm_set_buffer_funcs_status(adev, true);
>>>> +
>>>>   return r;
>>>>   }
>>>>
>>>> @@ -4236,6 +4244,8 @@ void amdgpu_device_fini_hw(struct amdgpu_device 
>>>> *adev)
>>>>   /* disable ras feature must before hw fini */
>>>>   amdgpu_ras_pre_fini(adev);
>>>>
>>>> + amdgpu_ttm_set_buffer_funcs_status(adev, false);
>>>> +
>>>>   amdgpu_device_ip_fini_early(adev);
>>>>
>>>>   amdgpu_irq_fini_hw(adev);
>>>> @@ -4407,6 +4417,8 @@ int amdgpu_device_suspend(struct drm_device *dev, 
>>>> bool fbcon)
>&g

Re: [PATCH] drm/amdgpu: move buffer funcs setting up a level

2023-10-26 Thread Luben Tuikov
On 2023-10-26 15:32, Alex Deucher wrote:
> On Thu, Oct 26, 2023 at 2:22 AM Christian König
>  wrote:
>>
>> Am 25.10.23 um 19:19 schrieb Alex Deucher:
>>> Rather than doing this in the IP code for the SDMA paging
>>> engine, move it up to the core device level init level.
>>> This should fix the scheduler init ordering.
>>>
>>> v2: drop extra parens
>>> v3: drop SDMA helpers
>>>
>>> Tested-by: Luben Tuikov 
>>> Signed-off-by: Alex Deucher 
>>
>> I don't know of hand if the high level function really cover everything,
>> so only Acked-by: Christian König  for now.
>>
> 
> Luben,
> 
> Was this needed for some of the scheduler stuff that is pending?  If
> you would rather take it via drm-misc to align with the scheduler
> changes, that works for me, otherwise I can take it via the amdgpu
> tree.

Hi Alex,

Yes, it does.

I can take it via drm-misc-next as that where the scheduler changes landed.

I'll add Christian's Acked-by.

I'll add a Fixes tag because ideally it should've gone before the dynamic
sched_rq commit.

Thanks for the heads-up!

Regards,
Luben



> 
> Thanks,
> 
> Alex
> 
> 
>> Christian.
>>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c   | 21 -
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h   |  1 -
>>>   drivers/gpu/drm/amd/amdgpu/cik_sdma.c  |  5 -
>>>   drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c |  5 -
>>>   drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c |  5 -
>>>   drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c | 16 +---
>>>   drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c | 10 +-
>>>   drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 10 +-
>>>   drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c | 10 +-
>>>   drivers/gpu/drm/amd/amdgpu/si_dma.c|  5 -
>>>   11 files changed, 19 insertions(+), 84 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 2031a467b721..5c90080e93ba 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -2662,6 +2662,9 @@ static int amdgpu_device_ip_init(struct amdgpu_device 
>>> *adev)
>>>   if (r)
>>>   goto init_failed;
>>>
>>> + if (adev->mman.buffer_funcs_ring->sched.ready)
>>> + amdgpu_ttm_set_buffer_funcs_status(adev, true);
>>> +
>>>   /* Don't init kfd if whole hive need to be reset during init */
>>>   if (!adev->gmc.xgmi.pending_reset) {
>>>   kgd2kfd_init_zone_device(adev);
>>> @@ -3260,6 +3263,8 @@ int amdgpu_device_ip_suspend(struct amdgpu_device 
>>> *adev)
>>>   amdgpu_virt_request_full_gpu(adev, false);
>>>   }
>>>
>>> + amdgpu_ttm_set_buffer_funcs_status(adev, false);
>>> +
>>>   r = amdgpu_device_ip_suspend_phase1(adev);
>>>   if (r)
>>>   return r;
>>> @@ -3449,6 +3454,9 @@ static int amdgpu_device_ip_resume(struct 
>>> amdgpu_device *adev)
>>>
>>>   r = amdgpu_device_ip_resume_phase2(adev);
>>>
>>> + if (adev->mman.buffer_funcs_ring->sched.ready)
>>> + amdgpu_ttm_set_buffer_funcs_status(adev, true);
>>> +
>>>   return r;
>>>   }
>>>
>>> @@ -4236,6 +4244,8 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
>>>   /* disable ras feature must before hw fini */
>>>   amdgpu_ras_pre_fini(adev);
>>>
>>> + amdgpu_ttm_set_buffer_funcs_status(adev, false);
>>> +
>>>   amdgpu_device_ip_fini_early(adev);
>>>
>>>   amdgpu_irq_fini_hw(adev);
>>> @@ -4407,6 +4417,8 @@ int amdgpu_device_suspend(struct drm_device *dev, 
>>> bool fbcon)
>>>
>>>   amdgpu_ras_suspend(adev);
>>>
>>> + amdgpu_ttm_set_buffer_funcs_status(adev, false);
>>> +
>>>   amdgpu_device_ip_suspend_phase1(adev);
>>>
>>>   if (!adev->in_s0ix)
>>> @@ -5178,6 +5190,9 @@ int amdgpu_do_asic_reset(struct list_head 
>>> *device_list_handle,
>>>   if (r)
>>>   goto out;
>>>
>>> + if 
>>> (tmp_adev-&g

[PATCH] MAINTAINERS: Update the GPU Scheduler email

2023-10-26 Thread Luben Tuikov
Update the GPU Scheduler maintainer email.

Cc: Alex Deucher 
Cc: Christian König 
Cc: Daniel Vetter 
Cc: Dave Airlie 
Cc: AMD Graphics 
Cc: Direct Rendering Infrastructure - Development 

Signed-off-by: Luben Tuikov 
---
 MAINTAINERS | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 4452508bc1b040..f13e476ed8038b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7153,7 +7153,7 @@ F:Documentation/devicetree/bindings/display/xlnx/
 F: drivers/gpu/drm/xlnx/
 
 DRM GPU SCHEDULER
-M: Luben Tuikov 
+M: Luben Tuikov 
 L: dri-de...@lists.freedesktop.org
 S: Maintained
 T: git git://anongit.freedesktop.org/drm/drm-misc

base-commit: 56e449603f0ac580700621a356d35d5716a62ce5
-- 
2.42.0



[PATCH] drm/amdgpu: move buffer funcs setting up a level (v2)

2023-10-24 Thread Luben Tuikov
From: Alex Deucher 

Rather than doing this in the IP code for the SDMA paging
engine, move it up to the core device level init level.
This should fix the scheduler init ordering.

v2: Fix checkpatch parens complaint; long lines. (Luben)

Signed-off-by: Alex Deucher 
Tested-by: Luben Tuikov 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 
 drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c   | 21 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h   |  1 +
 drivers/gpu/drm/amd/amdgpu/cik_sdma.c  |  5 -
 drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c |  5 -
 drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c |  5 -
 drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c | 16 +---
 drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c | 10 +-
 drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 10 +-
 drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c | 10 +-
 drivers/gpu/drm/amd/amdgpu/si_dma.c|  5 -
 11 files changed, 38 insertions(+), 62 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 7ec32b44df052f..47c1e60109c14c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2662,6 +2662,8 @@ static int amdgpu_device_ip_init(struct amdgpu_device 
*adev)
if (r)
goto init_failed;
 
+   amdgpu_sdma_set_buffer_funcs_helper(adev);
+
/* Don't init kfd if whole hive need to be reset during init */
if (!adev->gmc.xgmi.pending_reset) {
kgd2kfd_init_zone_device(adev);
@@ -3260,6 +3262,8 @@ int amdgpu_device_ip_suspend(struct amdgpu_device *adev)
amdgpu_virt_request_full_gpu(adev, false);
}
 
+   amdgpu_sdma_unset_buffer_funcs_helper(adev);
+
r = amdgpu_device_ip_suspend_phase1(adev);
if (r)
return r;
@@ -3449,6 +3453,8 @@ static int amdgpu_device_ip_resume(struct amdgpu_device 
*adev)
 
r = amdgpu_device_ip_resume_phase2(adev);
 
+   amdgpu_sdma_set_buffer_funcs_helper(adev);
+
return r;
 }
 
@@ -4236,6 +4242,8 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
/* disable ras feature must before hw fini */
amdgpu_ras_pre_fini(adev);
 
+   amdgpu_sdma_unset_buffer_funcs_helper(adev);
+
amdgpu_device_ip_fini_early(adev);
 
amdgpu_irq_fini_hw(adev);
@@ -4407,6 +4415,8 @@ int amdgpu_device_suspend(struct drm_device *dev, bool 
fbcon)
 
amdgpu_ras_suspend(adev);
 
+   amdgpu_sdma_unset_buffer_funcs_helper(adev);
+
amdgpu_device_ip_suspend_phase1(adev);
 
if (!adev->in_s0ix)
@@ -5178,6 +5188,8 @@ int amdgpu_do_asic_reset(struct list_head 
*device_list_handle,
if (r)
goto out;
 
+   amdgpu_sdma_set_buffer_funcs_helper(tmp_adev);
+
if (vram_lost)

amdgpu_device_fill_reset_magic(tmp_adev);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
index e8cbc4142d8021..c4d642b06f3c5b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
@@ -292,6 +292,27 @@ int amdgpu_sdma_init_microcode(struct amdgpu_device *adev,
return err;
 }
 
+void amdgpu_sdma_set_buffer_funcs_helper(struct amdgpu_device *adev)
+{
+   struct amdgpu_ring *sdma;
+   int i;
+
+   for (i = 0; i < adev->sdma.num_instances; i++) {
+   if (adev->sdma.has_page_queue) {
+   sdma = >sdma.instance[i].page;
+   if (adev->mman.buffer_funcs_ring == sdma && 
sdma->sched.ready) {
+   amdgpu_ttm_set_buffer_funcs_status(adev, true);
+   break;
+   }
+   }
+   sdma = >sdma.instance[i].ring;
+   if (adev->mman.buffer_funcs_ring == sdma && sdma->sched.ready) {
+   amdgpu_ttm_set_buffer_funcs_status(adev, true);
+   break;
+   }
+   }
+}
+
 void amdgpu_sdma_unset_buffer_funcs_helper(struct amdgpu_device *adev)
 {
struct amdgpu_ring *sdma;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
index 513ac22120c1fa..33209593e97461 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
@@ -169,6 +169,7 @@ int amdgpu_sdma_init_microcode(struct amdgpu_device *adev, 
u32 instance,
   bool duplicate);
 void amdgpu_sdma_destroy_inst_ctx(struct amdgpu_device *adev,
 bool duplicate);
+void amdgpu_sdma_set_buffer_funcs_helper(struct amdgpu_device *adev);
 void amdgpu_sdma_unset_buffer_funcs_helper(struct amdgpu_device *adev);
 int amdgpu_sdma_ras_

Re: [PATCH] drm/amdgpu: Initialize schedulers before using them

2023-10-24 Thread Luben Tuikov
On 2023-10-24 10:46, Alex Deucher wrote:
> On Tue, Oct 24, 2023 at 6:14 AM Christian König
>  wrote:
>>
>> [SNIP]
>>> Let me take a closer look first
>>
>> I think I've figured out why this isn't working as expected. It started
>> with this patch here:
>>
>> commit 5fd8518d187ed03403a4d4f7f56f52c00b11c148
>> Author: Andrey Grodzovsky 
>> Date:   Mon Dec 6 14:59:35 2021 -0500
>>
>>  drm/amdgpu: Move scheduler init to after XGMI is ready
>>
>>  Before we initialize schedulers we must know which reset
>>  domain are we in - for single device there iis a single
>>  domain per device and so single wq per device. For XGMI
>>  the reset domain spans the entire XGMI hive and so the
>>  reset wq is per hive.
>>
>>  Signed-off-by: Andrey Grodzovsky 
>>  Reviewed-by: Christian König 
>>  Link: https://www.spinics.net/lists/amd-gfx/msg74112.html
>>
>> Andrey separated the scheduler initialization from the ring init because
>> we need some of the rings for XGMI initialization which in turn in
>> necessary to figure out the XGMI hive and so the reset domain for the
>> scheduler.
>>
>> The code inside amdgpu_ttm_set_buffer_funcs_status() is actually
>> correct, the problem is that this is called as part of the hw init which
>> comes earlier than the scheduler init.
>>
>> @Alex, Ideas how to fix this? My best guess is that we should move the
>> call to amdgpu_ttm_set_buffer_funcs_status() from the DMA specific code
>> into the higher level handling in amdgpu_device.c
> 
> Yes, I think so, but there could be some tricky ordering issues with
> respect to suspend and resume.  I think something like the attached
> patch should do the trick.

This patch works. I've tested suspend and resume too.

Tested-by: Luben Tuikov 

scripts/checkpatch.pl complains about extra parenthesis.

-- 
Regards,
Luben



Re: [PATCH] drm/amdgpu: Initialize schedulers before using them

2023-10-23 Thread Luben Tuikov
On 2023-10-23 01:49, Christian König wrote:
> 
> 
> Am 23.10.23 um 05:23 schrieb Luben Tuikov:
>> Initialize ring schedulers before using them, very early in the amdgpu boot,
>> at PCI probe time, specifically at frame-buffer dumb-create at fill-buffer.
>>
>> This was discovered by using dynamic scheduler run-queues, which showed that
>> amdgpu was using a scheduler before calling drm_sched_init(), and the only
>> reason it was working was because sched_rq[] was statically allocated in the
>> scheduler structure. However, the scheduler structure had _not_ been
>> initialized.
>>
>> When switching to dynamically allocated run-queues, this lack of
>> initialization was causing an oops and a blank screen at boot up. This patch
>> fixes this amdgpu bug.
>>
>> This patch depends on the "drm/sched: Convert the GPU scheduler to variable
>> number of run-queues" patch, as that patch prevents subsequent scheduler
>> initialization if a scheduler has already been initialized.
>>
>> Cc: Christian König 
>> Cc: Alex Deucher 
>> Cc: Felix Kuehling 
>> Cc: AMD Graphics 
>> Signed-off-by: Luben Tuikov 
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 14 ++
>>   1 file changed, 14 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> index 4e51dce3aab5d6..575ef7e1e30fd4 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> @@ -60,6 +60,7 @@
>>   #include "amdgpu_atomfirmware.h"
>>   #include "amdgpu_res_cursor.h"
>>   #include "bif/bif_4_1_d.h"
>> +#include "amdgpu_reset.h"
>>   
>>   MODULE_IMPORT_NS(DMA_BUF);
>>   
>> @@ -2059,6 +2060,19 @@ void amdgpu_ttm_set_buffer_funcs_status(struct 
>> amdgpu_device *adev, bool enable)
>>   
>>  ring = adev->mman.buffer_funcs_ring;
>>  sched = >sched;
>> +
>> +r = drm_sched_init(sched, _sched_ops,
>> +   DRM_SCHED_PRIORITY_COUNT,
>> +   ring->num_hw_submission, 0,
>> +   adev->sdma_timeout, adev->reset_domain->wq,
>> +   ring->sched_score, ring->name,
>> +   adev->dev);
>> +if (r) {
>> +drm_err(adev, "%s: couldn't initialize ring:%s 
>> error:%d\n",
>> +__func__, ring->name, r);
>> +return;
>> +}
> 
> That doesn't look correct either.
> 
> amdgpu_ttm_set_buffer_funcs_status() should only be called with 
> enable=true as argument *after* the copy ring is initialized and valid 
> to use. One part of this ring initialization is to setup the scheduler.

It's the only way to keep the functionality of amdgpu_fill_buffer()
from amdgpu_mode_dumb_create(), from drm_client_framebuffer_create(),
from ... without an oops and a blank screen at boot up.

Here is a stack of the oops:

Oct 20 22:12:34 fedora kernel: RIP: 0010:drm_sched_job_arm+0x1f/0x60 [gpu_sched]
Oct 20 22:12:34 fedora kernel: Code: 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 
00 00 55 53 48 8b 6f 58 48 85 ed 74 3f 48 89 fb 48 89 ef e8 95 34 00 00 48 8b 
45 10 <48> 8b 50 08 48 89 53 18 8b 45 24 89 43 54 b8 01 00 00 00 f0 48 0f
Oct 20 22:12:34 fedora kernel: RSP: 0018:c90001613838 EFLAGS: 00010246
Oct 20 22:12:34 fedora kernel: RAX:  RBX: 88812f33b400 RCX: 
0004
Oct 20 22:12:34 fedora kernel: RDX:  RSI: c9000395145c RDI: 
88812eacf850
Oct 20 22:12:34 fedora kernel: RBP: 88812eacf850 R08: 0004 R09: 
0003
Oct 20 22:12:34 fedora kernel: R10: c066b850 R11: bc848ef1 R12: 

Oct 20 22:12:34 fedora kernel: R13: 0004 R14: 00800300 R15: 
0100
Oct 20 22:12:34 fedora kernel: FS:  7f7be4866940() 
GS:0ed0() knlGS:
Oct 20 22:12:34 fedora kernel: CS:  0010 DS:  ES:  CR0: 80050033
Oct 20 22:12:34 fedora kernel: CR2: 0008 CR3: 00012cf22000 CR4: 
003506e0
Oct 20 22:12:34 fedora kernel: Call Trace:
Oct 20 22:12:34 fedora kernel:  
Oct 20 22:12:34 fedora kernel:  ? __die+0x1f/0x70
Oct 20 22:12:34 fedora kernel:  ? page_fault_oops+0x149/0x440
Oct 20 22:12:34 fedora kernel:  ? drm_sched_fence_alloc+0x1a/0x40 [gpu_sched]
Oct 20 22:12:34 fedora kernel:  ? amdgpu_job_alloc_with_ib+0x34/0xb0 [amdgpu]
Oct 20 22:12:34 fedora kernel:  ? srso_return_thunk+0x5/0x10
Oct 20 22:1

[PATCH] drm/amdgpu: Initialize schedulers before using them

2023-10-22 Thread Luben Tuikov
Initialize ring schedulers before using them, very early in the amdgpu boot,
at PCI probe time, specifically at frame-buffer dumb-create at fill-buffer.

This was discovered by using dynamic scheduler run-queues, which showed that
amdgpu was using a scheduler before calling drm_sched_init(), and the only
reason it was working was because sched_rq[] was statically allocated in the
scheduler structure. However, the scheduler structure had _not_ been
initialized.

When switching to dynamically allocated run-queues, this lack of
initialization was causing an oops and a blank screen at boot up. This patch
fixes this amdgpu bug.

This patch depends on the "drm/sched: Convert the GPU scheduler to variable
number of run-queues" patch, as that patch prevents subsequent scheduler
initialization if a scheduler has already been initialized.

Cc: Christian König 
Cc: Alex Deucher 
Cc: Felix Kuehling 
Cc: AMD Graphics 
Signed-off-by: Luben Tuikov 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 4e51dce3aab5d6..575ef7e1e30fd4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -60,6 +60,7 @@
 #include "amdgpu_atomfirmware.h"
 #include "amdgpu_res_cursor.h"
 #include "bif/bif_4_1_d.h"
+#include "amdgpu_reset.h"
 
 MODULE_IMPORT_NS(DMA_BUF);
 
@@ -2059,6 +2060,19 @@ void amdgpu_ttm_set_buffer_funcs_status(struct 
amdgpu_device *adev, bool enable)
 
ring = adev->mman.buffer_funcs_ring;
sched = >sched;
+
+   r = drm_sched_init(sched, _sched_ops,
+  DRM_SCHED_PRIORITY_COUNT,
+  ring->num_hw_submission, 0,
+  adev->sdma_timeout, adev->reset_domain->wq,
+  ring->sched_score, ring->name,
+  adev->dev);
+   if (r) {
+   drm_err(adev, "%s: couldn't initialize ring:%s 
error:%d\n",
+   __func__, ring->name, r);
+   return;
+   }
+
r = drm_sched_entity_init(>mman.high_pr,
  DRM_SCHED_PRIORITY_KERNEL, ,
  1, NULL);

base-commit: 05d3ef8bba77c1b5f98d941d8b2d4aeab8118ef1
prerequisite-patch-id: c52673df9b6fc9ee001d6261c7ac107b618912a0
-- 
2.42.0



Re: [PATCH] drm/amdgpu: Remove redundant call to priority_is_valid()

2023-10-21 Thread Luben Tuikov
On 2023-10-20 12:37, Alex Deucher wrote:
> On Tue, Oct 17, 2023 at 9:22 PM Luben Tuikov  wrote:
>>
>> Remove a redundant call to amdgpu_ctx_priority_is_valid() from
>> amdgpu_ctx_priority_permit(), which is called from amdgpu_ctx_init() which is
>> called from amdgpu_ctx_alloc() which is called from amdgpu_ctx_ioctl(), where
>> we've called amdgpu_ctx_priority_is_valid() already first thing in the
>> function.
>>
>> Cc: Alex Deucher 
>> Cc: Christian König 
>> Signed-off-by: Luben Tuikov 
> 
> Please push this to drm-misc since it depends on your previous patches.

Done!

Pushed to drm-misc-fixes, where the other two landed.

Regards,
Luben

> 
> Alex
> 
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 15 ---
>>  1 file changed, 8 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> index 68db924161ef66..4c6ffca97c4512 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> @@ -56,6 +56,10 @@ bool amdgpu_ctx_priority_is_valid(int32_t ctx_prio)
>> return true;
>> default:
>> case AMDGPU_CTX_PRIORITY_UNSET:
>> +   /* UNSET priority is not valid and we don't carry that
>> +* around, but set it to NORMAL in the only place this
>> +* function is called, amdgpu_ctx_ioctl().
>> +*/
>> return false;
>> }
>>  }
>> @@ -96,9 +100,6 @@ amdgpu_ctx_to_drm_sched_prio(int32_t ctx_prio)
>>  static int amdgpu_ctx_priority_permit(struct drm_file *filp,
>>   int32_t priority)
>>  {
>> -   if (!amdgpu_ctx_priority_is_valid(priority))
>> -   return -EINVAL;
>> -
>> /* NORMAL and below are accessible by everyone */
>> if (priority <= AMDGPU_CTX_PRIORITY_NORMAL)
>> return 0;
>> @@ -625,8 +626,6 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
>> return 0;
>>  }
>>
>> -
>> -
>>  static int amdgpu_ctx_stable_pstate(struct amdgpu_device *adev,
>> struct amdgpu_fpriv *fpriv, uint32_t id,
>> bool set, u32 *stable_pstate)
>> @@ -669,8 +668,10 @@ int amdgpu_ctx_ioctl(struct drm_device *dev, void *data,
>> id = args->in.ctx_id;
>> priority = args->in.priority;
>>
>> -   /* For backwards compatibility reasons, we need to accept
>> -* ioctls with garbage in the priority field */
>> +   /* For backwards compatibility, we need to accept ioctls with garbage
>> +* in the priority field. Garbage values in the priority field, 
>> result
>> +* in the priority being set to NORMAL.
>> +*/
>> if (!amdgpu_ctx_priority_is_valid(priority))
>> priority = AMDGPU_CTX_PRIORITY_NORMAL;
>>
>>
>> base-commit: 915718484b8fa1eede4499a939e2e4fc0d85caa4
>> prerequisite-patch-id: a36f628997d923f66da5342e760e8b45ff959fb8
>> prerequisite-patch-id: f15148c302329c0c60d86040571c61d367bd05e7
>> --
>> 2.42.0
>>



[PATCH] drm/amdgpu: Remove redundant call to priority_is_valid()

2023-10-17 Thread Luben Tuikov
Remove a redundant call to amdgpu_ctx_priority_is_valid() from
amdgpu_ctx_priority_permit(), which is called from amdgpu_ctx_init() which is
called from amdgpu_ctx_alloc() which is called from amdgpu_ctx_ioctl(), where
we've called amdgpu_ctx_priority_is_valid() already first thing in the
function.

Cc: Alex Deucher 
Cc: Christian König 
Signed-off-by: Luben Tuikov 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index 68db924161ef66..4c6ffca97c4512 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -56,6 +56,10 @@ bool amdgpu_ctx_priority_is_valid(int32_t ctx_prio)
return true;
default:
case AMDGPU_CTX_PRIORITY_UNSET:
+   /* UNSET priority is not valid and we don't carry that
+* around, but set it to NORMAL in the only place this
+* function is called, amdgpu_ctx_ioctl().
+*/
return false;
}
 }
@@ -96,9 +100,6 @@ amdgpu_ctx_to_drm_sched_prio(int32_t ctx_prio)
 static int amdgpu_ctx_priority_permit(struct drm_file *filp,
  int32_t priority)
 {
-   if (!amdgpu_ctx_priority_is_valid(priority))
-   return -EINVAL;
-
/* NORMAL and below are accessible by everyone */
if (priority <= AMDGPU_CTX_PRIORITY_NORMAL)
return 0;
@@ -625,8 +626,6 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
return 0;
 }
 
-
-
 static int amdgpu_ctx_stable_pstate(struct amdgpu_device *adev,
struct amdgpu_fpriv *fpriv, uint32_t id,
bool set, u32 *stable_pstate)
@@ -669,8 +668,10 @@ int amdgpu_ctx_ioctl(struct drm_device *dev, void *data,
id = args->in.ctx_id;
priority = args->in.priority;
 
-   /* For backwards compatibility reasons, we need to accept
-* ioctls with garbage in the priority field */
+   /* For backwards compatibility, we need to accept ioctls with garbage
+* in the priority field. Garbage values in the priority field, result
+* in the priority being set to NORMAL.
+*/
if (!amdgpu_ctx_priority_is_valid(priority))
priority = AMDGPU_CTX_PRIORITY_NORMAL;
 

base-commit: 915718484b8fa1eede4499a939e2e4fc0d85caa4
prerequisite-patch-id: a36f628997d923f66da5342e760e8b45ff959fb8
prerequisite-patch-id: f15148c302329c0c60d86040571c61d367bd05e7
-- 
2.42.0



Re: [PATCH 1/2] drm/amdgpu: Unset context priority is now invalid

2023-10-17 Thread Luben Tuikov
On 2023-10-17 09:22, Alex Deucher wrote:
> On Tue, Oct 17, 2023 at 12:52 AM Luben Tuikov  wrote:
>>
>> A context priority value of AMD_CTX_PRIORITY_UNSET is now invalid--instead of
>> carrying it around and passing it to the Direct Rendering Manager--and it
>> becomes AMD_CTX_PRIORITY_NORMAL in amdgpu_ctx_ioctl(), the gateway to context
>> creation.
>>
>> Cc: Alex Deucher 
>> Cc: Christian König 
>> Signed-off-by: Luben Tuikov 
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> index 0dc9c655c4fbdb..092962b93064fc 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> @@ -47,7 +47,6 @@ const unsigned int 
>> amdgpu_ctx_num_entities[AMDGPU_HW_IP_NUM] = {
>>  bool amdgpu_ctx_priority_is_valid(int32_t ctx_prio)
>>  {
>> switch (ctx_prio) {
>> -   case AMDGPU_CTX_PRIORITY_UNSET:
>> case AMDGPU_CTX_PRIORITY_VERY_LOW:
>> case AMDGPU_CTX_PRIORITY_LOW:
>> case AMDGPU_CTX_PRIORITY_NORMAL:
>> @@ -55,6 +54,7 @@ bool amdgpu_ctx_priority_is_valid(int32_t ctx_prio)
>> case AMDGPU_CTX_PRIORITY_VERY_HIGH:
>> return true;
>> default:
>> +   case AMDGPU_CTX_PRIORITY_UNSET:
>> return false;
> 
> I  don't recall if any userspace uses this, but this would break
> userspace if it does.

This is shielded from user space in the following manner,
1) amdgpu_ctx_priority_is_valid() is called from amdgpu_ctx_ioctl() and
   if amdgpu_ctx_priority_is_valid() returns false, we set the priority to 
NORMAL.
   See the 2nd patch.
2) It is also called from amdgpu_ctx_priority_permit(), which is called
   from amdgpu_ctx_init() which is called from amdgpu_ctx_alloc() which
   is called from amdgpu_ctx_ioctl(), _after_ the call described above,
   and thus priority is now NORMAL.

Plus I'm typing this on a running system with 6.6.0 + those two patches.

User space can send us down UNSET, but we set it to NORMAL.

Can I get an R-B?

> 
> Alex
> 
>> }
>>  }
>>
>> base-commit: dc9b2e683bcba017588b9aaad80f442ad004a48f
>> --
>> 2.42.0
>>

-- 
Regards,
Luben



[PATCH 2/2] gpu/drm: Eliminate DRM_SCHED_PRIORITY_UNSET

2023-10-16 Thread Luben Tuikov
Eliminate DRM_SCHED_PRIORITY_UNSET, value of -2, whose only user was
amdgpu. Furthermore, eliminate an index bug, in that when amdgpu boots, it
calls drm_sched_entity_init() with DRM_SCHED_PRIORITY_UNSET, which uses it to
index sched->sched_rq[].

Cc: Alex Deucher 
Cc: Christian König 
Signed-off-by: Luben Tuikov 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 3 ++-
 include/drm/gpu_scheduler.h | 3 +--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index 092962b93064fc..aac52d9754e6da 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -64,7 +64,8 @@ amdgpu_ctx_to_drm_sched_prio(int32_t ctx_prio)
 {
switch (ctx_prio) {
case AMDGPU_CTX_PRIORITY_UNSET:
-   return DRM_SCHED_PRIORITY_UNSET;
+   pr_warn_once("AMD-->DRM context priority value UNSET-->NORMAL");
+   return DRM_SCHED_PRIORITY_NORMAL;
 
case AMDGPU_CTX_PRIORITY_VERY_LOW:
return DRM_SCHED_PRIORITY_MIN;
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index f9544d9b670d33..ac65f0626cfc91 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -68,8 +68,7 @@ enum drm_sched_priority {
DRM_SCHED_PRIORITY_HIGH,
DRM_SCHED_PRIORITY_KERNEL,
 
-   DRM_SCHED_PRIORITY_COUNT,
-   DRM_SCHED_PRIORITY_UNSET = -2
+   DRM_SCHED_PRIORITY_COUNT
 };
 
 /* Used to chose between FIFO and RR jobs scheduling */
-- 
2.42.0



[PATCH 1/2] drm/amdgpu: Unset context priority is now invalid

2023-10-16 Thread Luben Tuikov
A context priority value of AMD_CTX_PRIORITY_UNSET is now invalid--instead of
carrying it around and passing it to the Direct Rendering Manager--and it
becomes AMD_CTX_PRIORITY_NORMAL in amdgpu_ctx_ioctl(), the gateway to context
creation.

Cc: Alex Deucher 
Cc: Christian König 
Signed-off-by: Luben Tuikov 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index 0dc9c655c4fbdb..092962b93064fc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -47,7 +47,6 @@ const unsigned int amdgpu_ctx_num_entities[AMDGPU_HW_IP_NUM] 
= {
 bool amdgpu_ctx_priority_is_valid(int32_t ctx_prio)
 {
switch (ctx_prio) {
-   case AMDGPU_CTX_PRIORITY_UNSET:
case AMDGPU_CTX_PRIORITY_VERY_LOW:
case AMDGPU_CTX_PRIORITY_LOW:
case AMDGPU_CTX_PRIORITY_NORMAL:
@@ -55,6 +54,7 @@ bool amdgpu_ctx_priority_is_valid(int32_t ctx_prio)
case AMDGPU_CTX_PRIORITY_VERY_HIGH:
return true;
default:
+   case AMDGPU_CTX_PRIORITY_UNSET:
return false;
}
 }

base-commit: dc9b2e683bcba017588b9aaad80f442ad004a48f
-- 
2.42.0



Re: [PATCH] drm/amdgpu: fix software pci_unplug on some chips

2023-10-11 Thread Luben Tuikov
On 2023-10-11 21:31, vitaly.pros...@amd.com wrote:
> From: Vitaly Prosyak 
> 
> When software 'pci unplug' using IGT is executed we got a sysfs directory
> entry is NULL for differant ras blocks like hdp, umc, etc.
> Before call 'sysfs_remove_file_from_group' and 'sysfs_remove_group'
> check that 'sd' is  not NULL.
> 
> [  +0.01] RIP: 0010:sysfs_remove_group+0x83/0x90
> [  +0.02] Code: 31 c0 31 d2 31 f6 31 ff e9 9a a8 b4 00 4c 89 e7 e8 f2 a2 
> ff ff eb c2 49 8b 55 00 48 8b 33 48 c7 c7 80 65 94 82 e8 cd 82 bb ff <0f> 0b 
> eb cc 66 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90
> [  +0.01] RSP: 0018:c90002067c90 EFLAGS: 00010246
> [  +0.02] RAX:  RBX: 824ea180 RCX: 
> 
> [  +0.01] RDX:  RSI:  RDI: 
> 
> [  +0.01] RBP: c90002067ca8 R08:  R09: 
> 
> [  +0.01] R10:  R11:  R12: 
> 
> [  +0.01] R13: 88810a395f48 R14: 888101aab0d0 R15: 
> 
> [  +0.01] FS:  7f5ddaa43a00() GS:88841e80() 
> knlGS:
> [  +0.02] CS:  0010 DS:  ES:  CR0: 80050033
> [  +0.01] CR2: 7f8ffa61ba50 CR3: 000106432000 CR4: 
> 00350ef0
> [  +0.01] Call Trace:
> [  +0.01]  
> [  +0.01]  ? show_regs+0x72/0x90
> [  +0.02]  ? sysfs_remove_group+0x83/0x90
> [  +0.02]  ? __warn+0x8d/0x160
> [  +0.01]  ? sysfs_remove_group+0x83/0x90
> [  +0.01]  ? report_bug+0x1bb/0x1d0
> [  +0.03]  ? handle_bug+0x46/0x90
> [  +0.01]  ? exc_invalid_op+0x19/0x80
> [  +0.02]  ? asm_exc_invalid_op+0x1b/0x20
> [  +0.03]  ? sysfs_remove_group+0x83/0x90
> [  +0.01]  dpm_sysfs_remove+0x61/0x70
> [  +0.02]  device_del+0xa3/0x3d0
> [  +0.02]  ? ktime_get_mono_fast_ns+0x46/0xb0
> [  +0.02]  device_unregister+0x18/0x70
> [  +0.01]  i2c_del_adapter+0x26d/0x330
> [  +0.02]  arcturus_i2c_control_fini+0x25/0x50 [amdgpu]
> [  +0.000236]  smu_sw_fini+0x38/0x260 [amdgpu]
> [  +0.000241]  amdgpu_device_fini_sw+0x116/0x670 [amdgpu]
> [  +0.000186]  ? mutex_lock+0x13/0x50
> [  +0.03]  amdgpu_driver_release_kms+0x16/0x40 [amdgpu]
> [  +0.000192]  drm_minor_release+0x4f/0x80 [drm]
> [  +0.25]  drm_release+0xfe/0x150 [drm]
> [  +0.27]  __fput+0x9f/0x290
> [  +0.02]  fput+0xe/0x20
> [  +0.02]  task_work_run+0x61/0xa0
> [  +0.02]  exit_to_user_mode_prepare+0x150/0x170
> [  +0.02]  syscall_exit_to_user_mode+0x2a/0x50
> 
> Cc: Hawking Zhang 
> Cc: Luben Tuikov 
> Cc: Alex Deucher 
> Cc: Christian Koenig 
> Signed-off-by: Vitaly Prosyak 

Reviewed-by: Luben Tuikov 

Regards,
Luben

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 9 ++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 5fb57419ef77..1673a10835a1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -1390,7 +1390,8 @@ static void 
> amdgpu_ras_sysfs_remove_bad_page_node(struct amdgpu_device *adev)
>  {
>   struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>  
> - sysfs_remove_file_from_group(>dev->kobj,
> + if (adev->dev->kobj.sd)
> + sysfs_remove_file_from_group(>dev->kobj,
>   >badpages_attr.attr,
>   RAS_FS_NAME);
>  }
> @@ -1409,7 +1410,8 @@ static int amdgpu_ras_sysfs_remove_dev_attr_node(struct 
> amdgpu_device *adev)
>   .attrs = attrs,
>   };
>  
> - sysfs_remove_group(>dev->kobj, );
> + if (adev->dev->kobj.sd)
> + sysfs_remove_group(>dev->kobj, );
>  
>   return 0;
>  }
> @@ -1456,7 +1458,8 @@ int amdgpu_ras_sysfs_remove(struct amdgpu_device *adev,
>   if (!obj || !obj->attr_inuse)
>   return -EINVAL;
>  
> - sysfs_remove_file_from_group(>dev->kobj,
> + if (adev->dev->kobj.sd)
> + sysfs_remove_file_from_group(>dev->kobj,
>   >sysfs_attr.attr,
>   RAS_FS_NAME);
>   obj->attr_inuse = 0;



Re: [PATCH] drm/amdgpu: Annotate struct amdgpu_bo_list with __counted_by

2023-10-04 Thread Luben Tuikov
On 2023-10-03 19:29, Kees Cook wrote:
> Prepare for the coming implementation by GCC and Clang of the __counted_by
> attribute. Flexible array members annotated with __counted_by can have
> their accesses bounds-checked at run-time via CONFIG_UBSAN_BOUNDS (for
> array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
> functions).
> 
> As found with Coccinelle[1], add __counted_by for struct amdgpu_bo_list.
> Additionally, since the element count member must be set before accessing
> the annotated flexible array member, move its initialization earlier.
> 
> Cc: Alex Deucher 
> Cc: "Christian König" 
> Cc: "Pan, Xinhui" 
> Cc: David Airlie 
> Cc: Daniel Vetter 
> Cc: "Gustavo A. R. Silva" 
> Cc: Luben Tuikov 
> Cc: Christophe JAILLET 
> Cc: Felix Kuehling 
> Cc: amd-gfx@lists.freedesktop.org
> Cc: dri-de...@lists.freedesktop.org
> Cc: linux-harden...@vger.kernel.org
> Link: 
> https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
>  [1]
> Signed-off-by: Kees Cook 

Reviewed-by: Luben Tuikov 
-- 
Regards,
Luben

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.c | 2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.h | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.c
> index 6f5b641b631e..781e5c5ce04d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.c
> @@ -84,6 +84,7 @@ int amdgpu_bo_list_create(struct amdgpu_device *adev, 
> struct drm_file *filp,
>  
>   kref_init(>refcount);
>  
> + list->num_entries = num_entries;
>   array = list->entries;
>  
>   for (i = 0; i < num_entries; ++i) {
> @@ -129,7 +130,6 @@ int amdgpu_bo_list_create(struct amdgpu_device *adev, 
> struct drm_file *filp,
>   }
>  
>   list->first_userptr = first_userptr;
> - list->num_entries = num_entries;
>   sort(array, last_entry, sizeof(struct amdgpu_bo_list_entry),
>amdgpu_bo_list_entry_cmp, NULL);
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.h
> index 6a703be45d04..555cd6d877c3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.h
> @@ -56,7 +56,7 @@ struct amdgpu_bo_list {
>*/
>   struct mutex bo_list_mutex;
>  
> - struct amdgpu_bo_list_entry entries[];
> + struct amdgpu_bo_list_entry entries[] __counted_by(num_entries);
>  };
>  
>  int amdgpu_bo_list_get(struct amdgpu_fpriv *fpriv, int id,




[PATCH] drm/amdgpu: Fix a memory leak

2023-09-22 Thread Luben Tuikov
Fix a memory leak in amdgpu_fru_get_product_info().

Cc: Alex Deucher 
Reported-by: Yang Wang 
Fixes: 0dbf2c56262532 ("drm/amdgpu: Interpret IPMI data for product information 
(v2)")
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fru_eeprom.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fru_eeprom.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fru_eeprom.c
index 9c66d98af6d86a..7cd0dfaeee206c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fru_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fru_eeprom.c
@@ -170,6 +170,7 @@ int amdgpu_fru_get_product_info(struct amdgpu_device *adev)
csum += pia[size - 1];
if (csum) {
DRM_ERROR("Bad Product Info Area checksum: 0x%02x", csum);
+   kfree(pia);
return -EIO;
}
 

base-commit: 14d13f757d369c9873ebbe34d02d0896f5de565e
-- 
2.42.0



Re: [PATCH] drm/amdgpu: fix memory leak in amdgpu_fru_get_product_info()

2023-09-22 Thread Luben Tuikov
On 2023-09-22 16:58, Luben Tuikov wrote:
> On 2023-09-22 01:27, Yang Wang wrote:
>> fix a memory leak that occurs when csum is 0,
>> the origin function will return directly and forgets to free 'pia' resource.
>>
>> Fixes: 0dbf2c562625 ("drm/amdgpu: Interpret IPMI data for product 
>> information (v2)")
>>
>> CC: Luben Tuikov 
>> Signed-off-by: Yang Wang 
> 
> Ah, yes, we should free "pia". Good catch!
> 
> Reviewed-by: Luben Tuikov 

Retracted--see my follow-up email of making this a one-liner change, instead
of adding a whole bunch of changes which are unnecessary to fix this memory 
leak.

Regards,
Luben

> 
> Regards,
> Luben
> 
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_fru_eeprom.c | 11 ++-
>>  1 file changed, 6 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fru_eeprom.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fru_eeprom.c
>> index 401651f28ba2..50b6eb447726 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fru_eeprom.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fru_eeprom.c
>> @@ -111,7 +111,7 @@ int amdgpu_fru_get_product_info(struct amdgpu_device 
>> *adev)
>>  {
>>  unsigned char buf[8], *pia;
>>  u32 addr, fru_addr;
>> -int size, len;
>> +int size, len, ret = 0;
>>  u8 csum;
>>  
>>  if (!is_fru_eeprom_supported(adev, _addr))
>> @@ -171,16 +171,17 @@ int amdgpu_fru_get_product_info(struct amdgpu_device 
>> *adev)
>>  /* Read the whole PIA. */
>>  len = amdgpu_eeprom_read(adev->pm.fru_eeprom_i2c_bus, addr, pia, size);
>>  if (len != size) {
>> -kfree(pia);
>>  DRM_ERROR("Couldn't read the Product Info Area: %d", len);
>> -return len < 0 ? len : -EIO;
>> +ret = len < 0 ? len : -EIO;
>> +goto Out;
>>  }
> 
> 
>>  
>>  for (csum = 0; size > 0; size--)
>>  csum += pia[size - 1];
>>  if (csum) {
>>  DRM_ERROR("Bad Product Info Area checksum: 0x%02x", csum);
>> -return -EIO;
>> +ret = -EIO;
>> +goto Out;
>>  }
>>  
>>  /* Now extract useful information from the PIA.
>> @@ -220,7 +221,7 @@ int amdgpu_fru_get_product_info(struct amdgpu_device 
>> *adev)
>>  adev->serial[sizeof(adev->serial) - 1] = '\0';
>>  Out:
>>  kfree(pia);
>> -return 0;
>> +return ret;
>>  }
>>  
>>  /**
> 

-- 
Regards,
Luben



Re: [PATCH] drm/amdgpu: fix memory leak in amdgpu_fru_get_product_info()

2023-09-22 Thread Luben Tuikov
On 2023-09-22 01:27, Yang Wang wrote:
> fix a memory leak that occurs when csum is 0,
> the origin function will return directly and forgets to free 'pia' resource.
> 
> Fixes: 0dbf2c562625 ("drm/amdgpu: Interpret IPMI data for product information 
> (v2)")
> 
> CC: Luben Tuikov 
> Signed-off-by: Yang Wang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_fru_eeprom.c | 11 ++-
>  1 file changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fru_eeprom.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fru_eeprom.c
> index 401651f28ba2..50b6eb447726 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fru_eeprom.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fru_eeprom.c
> @@ -111,7 +111,7 @@ int amdgpu_fru_get_product_info(struct amdgpu_device 
> *adev)
>  {
>   unsigned char buf[8], *pia;
>   u32 addr, fru_addr;
> - int size, len;
> + int size, len, ret = 0;
>   u8 csum;
>  
>   if (!is_fru_eeprom_supported(adev, _addr))
> @@ -171,16 +171,17 @@ int amdgpu_fru_get_product_info(struct amdgpu_device 
> *adev)
>   /* Read the whole PIA. */
>   len = amdgpu_eeprom_read(adev->pm.fru_eeprom_i2c_bus, addr, pia, size);
>   if (len != size) {
> - kfree(pia);
>   DRM_ERROR("Couldn't read the Product Info Area: %d", len);
> - return len < 0 ? len : -EIO;
> + ret = len < 0 ? len : -EIO;
> + goto Out;
>   }
>  
>   for (csum = 0; size > 0; size--)
>   csum += pia[size - 1];
>   if (csum) {
>   DRM_ERROR("Bad Product Info Area checksum: 0x%02x", csum);
> - return -EIO;
> + ret = -EIO;

Actually the memory leak is ONLY here.

I wonder if you could've simply added a

kfree(pia);

> + goto Out;
>   }

before the goto Out; which would result in a one-line change.

Yeah, please do that instead.

So, don't add "ret" and what not. Just add the one liner "kfree(pia);" before
the "goto Out;" and make it a SINGLE-LINE change to fix this memory leak.

You don't need so many (formulaic) changes of adding "ret" and what not.
Just a one-liner change, please.

>  
>   /* Now extract useful information from the PIA.
> @@ -220,7 +221,7 @@ int amdgpu_fru_get_product_info(struct amdgpu_device 
> *adev)
>   adev->serial[sizeof(adev->serial) - 1] = '\0';
>  Out:
>   kfree(pia);
> - return 0;
> + return ret;
>  }
>  
>  /**

-- 
Regards,
Luben



Re: [PATCH] drm/amdgpu: fix memory leak in amdgpu_fru_get_product_info()

2023-09-22 Thread Luben Tuikov
On 2023-09-22 01:27, Yang Wang wrote:
> fix a memory leak that occurs when csum is 0,
> the origin function will return directly and forgets to free 'pia' resource.
> 
> Fixes: 0dbf2c562625 ("drm/amdgpu: Interpret IPMI data for product information 
> (v2)")
> 
> CC: Luben Tuikov 
> Signed-off-by: Yang Wang 

Ah, yes, we should free "pia". Good catch!

Reviewed-by: Luben Tuikov 

Regards,
Luben

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_fru_eeprom.c | 11 ++-
>  1 file changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fru_eeprom.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fru_eeprom.c
> index 401651f28ba2..50b6eb447726 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fru_eeprom.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fru_eeprom.c
> @@ -111,7 +111,7 @@ int amdgpu_fru_get_product_info(struct amdgpu_device 
> *adev)
>  {
>   unsigned char buf[8], *pia;
>   u32 addr, fru_addr;
> - int size, len;
> + int size, len, ret = 0;
>   u8 csum;
>  
>   if (!is_fru_eeprom_supported(adev, _addr))
> @@ -171,16 +171,17 @@ int amdgpu_fru_get_product_info(struct amdgpu_device 
> *adev)
>   /* Read the whole PIA. */
>   len = amdgpu_eeprom_read(adev->pm.fru_eeprom_i2c_bus, addr, pia, size);
>   if (len != size) {
> - kfree(pia);
>   DRM_ERROR("Couldn't read the Product Info Area: %d", len);
> - return len < 0 ? len : -EIO;
> + ret = len < 0 ? len : -EIO;
> + goto Out;
>   }


>  
>   for (csum = 0; size > 0; size--)
>   csum += pia[size - 1];
>   if (csum) {
>   DRM_ERROR("Bad Product Info Area checksum: 0x%02x", csum);
> - return -EIO;
> + ret = -EIO;
> + goto Out;
>   }
>  
>   /* Now extract useful information from the PIA.
> @@ -220,7 +221,7 @@ int amdgpu_fru_get_product_info(struct amdgpu_device 
> *adev)
>   adev->serial[sizeof(adev->serial) - 1] = '\0';
>  Out:
>   kfree(pia);
> - return 0;
> + return ret;
>  }
>  
>  /**

-- 
Regards,
Luben



Re: [PATCH v2] drm/amdgpu: Prefer pr_err/_warn/_notice over printk in amdgpu_atpx_handler.c

2023-08-01 Thread Luben Tuikov
On 2023-08-01 01:01, Srinivasan Shanmugam wrote:
> Fixes the following style issues:
> 
> ERROR: open brace '{' following function definitions go on the next line
> WARNING: printk() should include KERN_ facility level
> 
> Cc: Guchun Chen 
> Cc: Christian König 
> Cc: Alex Deucher 
> Cc: Bert Karwatzki 
> Cc: "Pan, Xinhui" 
> Cc: Luben Tuikov 
> Signed-off-by: Srinivasan Shanmugam 

Yeah, looks good.

Reviewed-by: Luben Tuikov 

Regards,
Luben

> ---
> v2:
>  - Updated commit title as per log levels updated in this patch
>  - Updated with appropriate log levels (Luben)
> 
>  .../gpu/drm/amd/amdgpu/amdgpu_atpx_handler.c  | 29 +++
>  1 file changed, 17 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_atpx_handler.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_atpx_handler.c
> index d6d986be906a..375f02002579 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_atpx_handler.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_atpx_handler.c
> @@ -74,24 +74,29 @@ struct atpx_mux {
>   u16 mux;
>  } __packed;
>  
> -bool amdgpu_has_atpx(void) {
> +bool amdgpu_has_atpx(void)
> +{
>   return amdgpu_atpx_priv.atpx_detected;
>  }
>  
> -bool amdgpu_has_atpx_dgpu_power_cntl(void) {
> +bool amdgpu_has_atpx_dgpu_power_cntl(void)
> +{
>   return amdgpu_atpx_priv.atpx.functions.power_cntl;
>  }
>  
> -bool amdgpu_is_atpx_hybrid(void) {
> +bool amdgpu_is_atpx_hybrid(void)
> +{
>   return amdgpu_atpx_priv.atpx.is_hybrid;
>  }
>  
> -bool amdgpu_atpx_dgpu_req_power_for_displays(void) {
> +bool amdgpu_atpx_dgpu_req_power_for_displays(void)
> +{
>   return amdgpu_atpx_priv.atpx.dgpu_req_power_for_displays;
>  }
>  
>  #if defined(CONFIG_ACPI)
> -void *amdgpu_atpx_get_dhandle(void) {
> +void *amdgpu_atpx_get_dhandle(void)
> +{
>   return amdgpu_atpx_priv.dhandle;
>  }
>  #endif
> @@ -134,7 +139,7 @@ static union acpi_object *amdgpu_atpx_call(acpi_handle 
> handle, int function,
>  
>   /* Fail only if calling the method fails and ATPX is supported */
>   if (ACPI_FAILURE(status) && status != AE_NOT_FOUND) {
> - printk("failed to evaluate ATPX got %s\n",
> + pr_err("failed to evaluate ATPX got %s\n",
>  acpi_format_exception(status));
>   kfree(buffer.pointer);
>   return NULL;
> @@ -190,7 +195,7 @@ static int amdgpu_atpx_validate(struct amdgpu_atpx *atpx)
>  
>   size = *(u16 *) info->buffer.pointer;
>   if (size < 10) {
> - printk("ATPX buffer is too small: %zu\n", size);
> + pr_err("ATPX buffer is too small: %zu\n", size);
>   kfree(info);
>   return -EINVAL;
>   }
> @@ -223,11 +228,11 @@ static int amdgpu_atpx_validate(struct amdgpu_atpx 
> *atpx)
>   atpx->is_hybrid = false;
>   if (valid_bits & ATPX_MS_HYBRID_GFX_SUPPORTED) {
>   if (amdgpu_atpx_priv.quirks & AMDGPU_PX_QUIRK_FORCE_ATPX) {
> - printk("ATPX Hybrid Graphics, forcing to ATPX\n");
> + pr_warn("ATPX Hybrid Graphics, forcing to ATPX\n");
>   atpx->functions.power_cntl = true;
>   atpx->is_hybrid = false;
>   } else {
> - printk("ATPX Hybrid Graphics\n");
> + pr_notice("ATPX Hybrid Graphics\n");
>   /*
>* Disable legacy PM methods only when pcie port PM is 
> usable,
>* otherwise the device might fail to power off or 
> power on.
> @@ -269,7 +274,7 @@ static int amdgpu_atpx_verify_interface(struct 
> amdgpu_atpx *atpx)
>  
>   size = *(u16 *) info->buffer.pointer;
>   if (size < 8) {
> - printk("ATPX buffer is too small: %zu\n", size);
> + pr_err("ATPX buffer is too small: %zu\n", size);
>   err = -EINVAL;
>   goto out;
>   }
> @@ -278,8 +283,8 @@ static int amdgpu_atpx_verify_interface(struct 
> amdgpu_atpx *atpx)
>   memcpy(, info->buffer.pointer, size);
>  
>   /* TODO: check version? */
> - printk("ATPX version %u, functions 0x%08x\n",
> -output.version, output.function_bits);
> + pr_notice("ATPX version %u, functions 0x%08x\n",
> +   output.version, output.function_bits);
>  
>   amdgpu_atpx_parse_functions(>functions, output.function_bits);
>  

-- 
Regards,
Luben



Re: [PATCH v2] Revert "drm/radeon: Prefer dev_* variant over printk"

2023-07-31 Thread Luben Tuikov
On 2023-07-31 22:54, Srinivasan Shanmugam wrote:
> Usage of container_of is wrong here.
> struct acpi_device *adev = container_of(handle, struct acpi_device, handle)
> 
> This reverts commit 35ef33db90303589c76658207347539cf33f5ae3.
> 
> References: https://gitlab.freedesktop.org/drm/amd/-/issues/2744
> Cc: Guchun Chen 
> Cc: Christian König 
> Cc: Alex Deucher 
> Cc: Bert Karwatzki 
> Cc: "Pan, Xinhui" 
> Cc: Luben Tuikov 
> Signed-off-by: Srinivasan Shanmugam 
> Reviewed-by: Guchun Chen 

Reviewed-by: Luben Tuikov 

Regards,
Luben

> ---
> v2:
>  - Added missing commit id. 
> 
> 
>  drivers/gpu/drm/radeon/radeon_atpx_handler.c | 12 
>  1 file changed, 4 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/gpu/drm/radeon/radeon_atpx_handler.c 
> b/drivers/gpu/drm/radeon/radeon_atpx_handler.c
> index fb4d931fdf18..595354e3ce0b 100644
> --- a/drivers/gpu/drm/radeon/radeon_atpx_handler.c
> +++ b/drivers/gpu/drm/radeon/radeon_atpx_handler.c
> @@ -94,8 +94,6 @@ static union acpi_object *radeon_atpx_call(acpi_handle 
> handle, int function,
>   union acpi_object atpx_arg_elements[2];
>   struct acpi_object_list atpx_arg;
>   struct acpi_buffer buffer = { ACPI_ALLOCATE_BUFFER, NULL };
> - struct acpi_device *adev = container_of(handle, struct acpi_device, 
> handle);
> - struct device *dev = >dev;
>  
>   atpx_arg.count = 2;
>   atpx_arg.pointer = _arg_elements[0];
> @@ -117,8 +115,8 @@ static union acpi_object *radeon_atpx_call(acpi_handle 
> handle, int function,
>  
>   /* Fail only if calling the method fails and ATPX is supported */
>   if (ACPI_FAILURE(status) && status != AE_NOT_FOUND) {
> - dev_err(dev, "failed to evaluate ATPX got %s\n",
> - acpi_format_exception(status));
> + pr_err("failed to evaluate ATPX got %s\n",
> +acpi_format_exception(status));
>   kfree(buffer.pointer);
>   return NULL;
>   }
> @@ -159,8 +157,6 @@ static void radeon_atpx_parse_functions(struct 
> radeon_atpx_functions *f, u32 mas
>  static int radeon_atpx_validate(struct radeon_atpx *atpx)
>  {
>   u32 valid_bits = 0;
> - struct acpi_device *adev = container_of(atpx->handle, struct 
> acpi_device, handle);
> - struct device *dev = >dev;
>  
>   if (atpx->functions.px_params) {
>   union acpi_object *info;
> @@ -175,7 +171,7 @@ static int radeon_atpx_validate(struct radeon_atpx *atpx)
>  
>   size = *(u16 *) info->buffer.pointer;
>   if (size < 10) {
> - dev_err(dev, "ATPX buffer is too small: %zu\n", size);
> + pr_err("ATPX buffer is too small: %zu\n", size);
>   kfree(info);
>   return -EINVAL;
>   }
> @@ -206,7 +202,7 @@ static int radeon_atpx_validate(struct radeon_atpx *atpx)
>  
>   atpx->is_hybrid = false;
>   if (valid_bits & ATPX_MS_HYBRID_GFX_SUPPORTED) {
> - dev_info(dev, "ATPX Hybrid Graphics\n");
> + pr_info("ATPX Hybrid Graphics\n");
>   /*
>* Disable legacy PM methods only when pcie port PM is usable,
>* otherwise the device might fail to power off or power on.

-- 
Regards,
Luben



Re: [PATCH] drm/amdgpu: Prefer pr_err/_info over printk in amdgpu_atpx_handler.c

2023-07-31 Thread Luben Tuikov
On 2023-07-31 08:18, Srinivasan Shanmugam wrote:
> Fixes the following style issues:
> 
> ERROR: open brace '{' following function definitions go on the next line
> WARNING: printk() should include KERN_ facility level
> 
> Cc: Guchun Chen 
> Cc: Christian König 
> Cc: Alex Deucher 
> Cc: Bert Karwatzki 
> Signed-off-by: Srinivasan Shanmugam 
> ---
>  .../gpu/drm/amd/amdgpu/amdgpu_atpx_handler.c  | 29 +++
>  1 file changed, 17 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_atpx_handler.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_atpx_handler.c
> index d6d986be906a..9ba49a1b7c34 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_atpx_handler.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_atpx_handler.c
> @@ -74,24 +74,29 @@ struct atpx_mux {
>   u16 mux;
>  } __packed;
>  
> -bool amdgpu_has_atpx(void) {
> +bool amdgpu_has_atpx(void)
> +{
>   return amdgpu_atpx_priv.atpx_detected;
>  }
>  
> -bool amdgpu_has_atpx_dgpu_power_cntl(void) {
> +bool amdgpu_has_atpx_dgpu_power_cntl(void)
> +{
>   return amdgpu_atpx_priv.atpx.functions.power_cntl;
>  }
>  
> -bool amdgpu_is_atpx_hybrid(void) {
> +bool amdgpu_is_atpx_hybrid(void)
> +{
>   return amdgpu_atpx_priv.atpx.is_hybrid;
>  }
>  
> -bool amdgpu_atpx_dgpu_req_power_for_displays(void) {
> +bool amdgpu_atpx_dgpu_req_power_for_displays(void)
> +{
>   return amdgpu_atpx_priv.atpx.dgpu_req_power_for_displays;
>  }
>  
>  #if defined(CONFIG_ACPI)
> -void *amdgpu_atpx_get_dhandle(void) {
> +void *amdgpu_atpx_get_dhandle(void)
> +{
>   return amdgpu_atpx_priv.dhandle;
>  }
>  #endif
> @@ -134,7 +139,7 @@ static union acpi_object *amdgpu_atpx_call(acpi_handle 
> handle, int function,
>  
>   /* Fail only if calling the method fails and ATPX is supported */
>   if (ACPI_FAILURE(status) && status != AE_NOT_FOUND) {
> - printk("failed to evaluate ATPX got %s\n",
> + pr_err("failed to evaluate ATPX got %s\n",
>  acpi_format_exception(status));
>   kfree(buffer.pointer);
>   return NULL;
> @@ -190,7 +195,7 @@ static int amdgpu_atpx_validate(struct amdgpu_atpx *atpx)
>  
>   size = *(u16 *) info->buffer.pointer;
>   if (size < 10) {
> - printk("ATPX buffer is too small: %zu\n", size);
> + pr_err("ATPX buffer is too small: %zu\n", size);
>   kfree(info);
>   return -EINVAL;
>   }
> @@ -223,11 +228,11 @@ static int amdgpu_atpx_validate(struct amdgpu_atpx 
> *atpx)
>   atpx->is_hybrid = false;
>   if (valid_bits & ATPX_MS_HYBRID_GFX_SUPPORTED) {
>   if (amdgpu_atpx_priv.quirks & AMDGPU_PX_QUIRK_FORCE_ATPX) {
> - printk("ATPX Hybrid Graphics, forcing to ATPX\n");
> + pr_info("ATPX Hybrid Graphics, forcing to ATPX\n");

This should be a pr_warn().

>   atpx->functions.power_cntl = true;
>   atpx->is_hybrid = false;
>   } else {
> - printk("ATPX Hybrid Graphics\n");
> + pr_info("ATPX Hybrid Graphics\n");

I'd use pr_notice() here.

>   /*
>* Disable legacy PM methods only when pcie port PM is 
> usable,
>* otherwise the device might fail to power off or 
> power on.
> @@ -269,7 +274,7 @@ static int amdgpu_atpx_verify_interface(struct 
> amdgpu_atpx *atpx)
>  
>   size = *(u16 *) info->buffer.pointer;
>   if (size < 8) {
> - printk("ATPX buffer is too small: %zu\n", size);
> + pr_err("ATPX buffer is too small: %zu\n", size);
>   err = -EINVAL;
>   goto out;
>   }
> @@ -278,8 +283,8 @@ static int amdgpu_atpx_verify_interface(struct 
> amdgpu_atpx *atpx)
>   memcpy(, info->buffer.pointer, size);
>  
>   /* TODO: check version? */
> - printk("ATPX version %u, functions 0x%08x\n",
> -output.version, output.function_bits);
> + pr_info("ATPX version %u, functions 0x%08x\n",
> + output.version, output.function_bits);

Probably pr_notice().

>  
>   amdgpu_atpx_parse_functions(>functions, output.function_bits);
>  

-- 
Regards,
Luben



Re: [PATCH] Revert "drm/radeon: Prefer dev_* variant over printk"

2023-07-31 Thread Luben Tuikov
On 2023-07-31 08:04, Srinivasan Shanmugam wrote:
> Usage of container_of is wrong here.
> struct acpi_device *adev = container_of(handle, struct acpi_device, handle)
> 
> References: https://gitlab.freedesktop.org/drm/amd/-/issues/2744
> Cc: Guchun Chen 
> Cc: Christian König 
> Cc: Alex Deucher 
> Cc: Bert Karwatzki 
> Signed-off-by: Srinivasan Shanmugam 

Why are there two reverts?
Which commit is _this_ patch reverting?

You should have a single revert commit for a single commit, like what you have
here:

https://lore.kernel.org/r/20230731115828.2850334-1-srinivasan.shanmu...@amd.com

Either this or that, but not both.

If you're reverting a commit, you should list the commit you're reverting,
and it is missing here in the description of this revert.

Regards,
Luben


> ---
>  drivers/gpu/drm/radeon/radeon_atpx_handler.c | 12 
>  1 file changed, 4 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/gpu/drm/radeon/radeon_atpx_handler.c 
> b/drivers/gpu/drm/radeon/radeon_atpx_handler.c
> index fb4d931fdf18..595354e3ce0b 100644
> --- a/drivers/gpu/drm/radeon/radeon_atpx_handler.c
> +++ b/drivers/gpu/drm/radeon/radeon_atpx_handler.c
> @@ -94,8 +94,6 @@ static union acpi_object *radeon_atpx_call(acpi_handle 
> handle, int function,
>   union acpi_object atpx_arg_elements[2];
>   struct acpi_object_list atpx_arg;
>   struct acpi_buffer buffer = { ACPI_ALLOCATE_BUFFER, NULL };
> - struct acpi_device *adev = container_of(handle, struct acpi_device, 
> handle);
> - struct device *dev = >dev;
>  
>   atpx_arg.count = 2;
>   atpx_arg.pointer = _arg_elements[0];
> @@ -117,8 +115,8 @@ static union acpi_object *radeon_atpx_call(acpi_handle 
> handle, int function,
>  
>   /* Fail only if calling the method fails and ATPX is supported */
>   if (ACPI_FAILURE(status) && status != AE_NOT_FOUND) {
> - dev_err(dev, "failed to evaluate ATPX got %s\n",
> - acpi_format_exception(status));
> + pr_err("failed to evaluate ATPX got %s\n",
> +acpi_format_exception(status));
>   kfree(buffer.pointer);
>   return NULL;
>   }
> @@ -159,8 +157,6 @@ static void radeon_atpx_parse_functions(struct 
> radeon_atpx_functions *f, u32 mas
>  static int radeon_atpx_validate(struct radeon_atpx *atpx)
>  {
>   u32 valid_bits = 0;
> - struct acpi_device *adev = container_of(atpx->handle, struct 
> acpi_device, handle);
> - struct device *dev = >dev;
>  
>   if (atpx->functions.px_params) {
>   union acpi_object *info;
> @@ -175,7 +171,7 @@ static int radeon_atpx_validate(struct radeon_atpx *atpx)
>  
>   size = *(u16 *) info->buffer.pointer;
>   if (size < 10) {
> - dev_err(dev, "ATPX buffer is too small: %zu\n", size);
> + pr_err("ATPX buffer is too small: %zu\n", size);
>   kfree(info);
>   return -EINVAL;
>   }
> @@ -206,7 +202,7 @@ static int radeon_atpx_validate(struct radeon_atpx *atpx)
>  
>   atpx->is_hybrid = false;
>   if (valid_bits & ATPX_MS_HYBRID_GFX_SUPPORTED) {
> - dev_info(dev, "ATPX Hybrid Graphics\n");
> + pr_info("ATPX Hybrid Graphics\n");
>   /*
>* Disable legacy PM methods only when pcie port PM is usable,
>* otherwise the device might fail to power off or power on.

-- 
Regards,
Luben



Re: [PATCH RFC v1 00/52] drm/crtc: Rename struct drm_crtc::dev to drm_dev

2023-07-12 Thread Luben Tuikov
On 2023-07-12 09:53, Christian König wrote:
> Am 12.07.23 um 15:38 schrieb Uwe Kleine-König:
>> Hello Maxime,
>>
>> On Wed, Jul 12, 2023 at 02:52:38PM +0200, Maxime Ripard wrote:
>>> On Wed, Jul 12, 2023 at 01:02:53PM +0200, Uwe Kleine-König wrote:
> Background is that this makes merge conflicts easier to handle and detect.
 Really?
>>> FWIW, I agree with Christian here.
>>>
 Each file (apart from include/drm/drm_crtc.h) is only touched once. So
 unless I'm missing something you don't get less or easier conflicts by
 doing it all in a single patch. But you gain the freedom to drop a
 patch for one driver without having to drop the rest with it.
>>> Not really, because the last patch removed the union anyway. So you have
>>> to revert both the last patch, plus that driver one. And then you need
>>> to add a TODO to remove that union eventually.
>> Yes, with a single patch you have only one revert (but 194 files changed,
>> 1264 insertions(+), 1296 deletions(-)) instead of two (one of them: 1
>> file changed, 9 insertions(+), 1 deletion(-); the other maybe a bit
>> bigger). (And maybe you get away with just reverting the last patch.)
>>
>> With a single patch the TODO after a revert is "redo it all again (and
>> prepare for a different set of conflicts)" while with the split series
>> it's only "fix that one driver that was forgotten/borked" + reapply that
>> 10 line patch.
> 
> Yeah, but for a maintainer the size of the patches doesn't matter. 
> That's only interesting if you need to manually review the patch, which 
> you hopefully doesn't do in case of something auto-generated.
> 
> In other words if the patch is auto-generated re-applying it completely 
> is less work than fixing things up individually.
> 
>>   As the one who gets that TODO, I prefer the latter.
> 
> Yeah, but your personal preferences are not a technical relevant 
> argument to a maintainer.
> 
> At the end of the day Dave or Daniel need to decide, because they need 
> to live with it.
> 
> Regards,
> Christian.
> 
>>
>> So in sum: If your metric is "small count of reverted commits", you're
>> right. If however your metric is: Better get 95% of this series' change
>> in than maybe 0%, the split series is the way to do it.
>>
>> With me having spend ~3h on this series' changes, it's maybe
>> understandable that I did it the way I did.
>>
>> FTR: This series was created on top of v6.5-rc1. If you apply it to
>> drm-misc-next you get a (trivial) conflict in patch #2. If I consider to
>> be the responsible maintainer who applies this series, I like being able
>> to just do git am --skip then.
>>
>> FTR#2: In drm-misc-next is a new driver
>> (drivers/gpu/drm/loongson/lsdc_crtc.c) so skipping the last patch for
>> now might indeed be a good idea.
>>
 So I still like the split version better, but I'm open to a more
 verbose reasoning from your side.
>>> You're doing only one thing here, really: you change the name of a
>>> structure field. If it was shared between multiple maintainers, then
>>> sure, splitting that up is easier for everyone, but this will go through
>>> drm-misc, so I can't see the benefit it brings.
>> I see your argument, but I think mine weights more.

I'm with Maxime and Christian on this--a single action necessitates a single 
patch.
One single movement. As Maxime said "either 0 or 100."

As to the name, perhaps "drm_dev" is more descriptive than just "drm".
What is "drm"? Ah it's a "dev", as in "drm dev"... Then why not rename it
to "drm_dev"? You are renaming it from "dev" to something more descriptive
after all. "dev" --> "drm" is no better, but "dev" --> "drm_dev" is just
right.
-- 
Regards,
Luben



Re: [PATCH] drm/amdgpu: Rename to amdgpu_vm_tlb_seq_struct

2023-07-12 Thread Luben Tuikov
On 2023-07-12 03:57, Christian König wrote:
> Am 12.07.23 um 08:58 schrieb Luben Tuikov:
>> Rename struct amdgpu_vm_tlb_seq_cb {...} to struct amdgpu_vm_tlb_seq_struct
>> {...}, so as to not conflict with documentation processing tools. Of course, 
>> C
>> has no problem with this.
> 
> Hui? What exactly is duplicated here? Is the structure defined in 
> different files with a different meaning?

The same name is used for the function and for the structure.

struct amdgpu_vm_tlb_seq_cb {...}

and

static void amdgpu_vm_tlb_seq_cb(struct dma_fence *fence,
 struct dma_fence_cb *cb)

C has no problem with this, but document processing tools do,
and in general it doesn't seem like a good practice to have
the same name for both.

Regards,
Luben

> 
> Christian.
> 
>>
>> Cc: Randy Dunlap 
>> Cc: Alex Deucher 
>> Cc: Christian König 
>> Link: 
>> https://lore.kernel.org/r/b5ebc891-ee63-1638-0377-7b512d34b...@infradead.org
>> Signed-off-by: Luben Tuikov 
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 8 
>>   1 file changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> index 92a84e7b0db85b..32adc31c093d84 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> @@ -111,9 +111,9 @@ struct amdgpu_prt_cb {
>>   };
>>   
>>   /**
>> - * struct amdgpu_vm_tlb_seq_cb - Helper to increment the TLB flush sequence
>> + * struct amdgpu_vm_tlb_seq_struct - Helper to increment the TLB flush 
>> sequence
>>*/
>> -struct amdgpu_vm_tlb_seq_cb {
>> +struct amdgpu_vm_tlb_seq_struct {
>>  /**
>>   * @vm: pointer to the amdgpu_vm structure to set the fence sequence on
>>   */
>> @@ -833,7 +833,7 @@ int amdgpu_vm_update_pdes(struct amdgpu_device *adev,
>>   static void amdgpu_vm_tlb_seq_cb(struct dma_fence *fence,
>>   struct dma_fence_cb *cb)
>>   {
>> -struct amdgpu_vm_tlb_seq_cb *tlb_cb;
>> +struct amdgpu_vm_tlb_seq_struct *tlb_cb;
>>   
>>  tlb_cb = container_of(cb, typeof(*tlb_cb), cb);
>>  atomic64_inc(_cb->vm->tlb_seq);
>> @@ -871,7 +871,7 @@ int amdgpu_vm_update_range(struct amdgpu_device *adev, 
>> struct amdgpu_vm *vm,
>> struct dma_fence **fence)
>>   {
>>  struct amdgpu_vm_update_params params;
>> -struct amdgpu_vm_tlb_seq_cb *tlb_cb;
>> +struct amdgpu_vm_tlb_seq_struct *tlb_cb;
>>  struct amdgpu_res_cursor cursor;
>>  enum amdgpu_sync_mode sync_mode;
>>  int r, idx;
>>
>> base-commit: 50db2d96b49b7d6cdb12e71e4204cf7180d3bab5
> 



[PATCH] drm/amdgpu: Rename to amdgpu_vm_tlb_seq_struct

2023-07-12 Thread Luben Tuikov
Rename struct amdgpu_vm_tlb_seq_cb {...} to struct amdgpu_vm_tlb_seq_struct
{...}, so as to not conflict with documentation processing tools. Of course, C
has no problem with this.

Cc: Randy Dunlap 
Cc: Alex Deucher 
Cc: Christian König 
Link: 
https://lore.kernel.org/r/b5ebc891-ee63-1638-0377-7b512d34b...@infradead.org
Signed-off-by: Luben Tuikov 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 92a84e7b0db85b..32adc31c093d84 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -111,9 +111,9 @@ struct amdgpu_prt_cb {
 };
 
 /**
- * struct amdgpu_vm_tlb_seq_cb - Helper to increment the TLB flush sequence
+ * struct amdgpu_vm_tlb_seq_struct - Helper to increment the TLB flush sequence
  */
-struct amdgpu_vm_tlb_seq_cb {
+struct amdgpu_vm_tlb_seq_struct {
/**
 * @vm: pointer to the amdgpu_vm structure to set the fence sequence on
 */
@@ -833,7 +833,7 @@ int amdgpu_vm_update_pdes(struct amdgpu_device *adev,
 static void amdgpu_vm_tlb_seq_cb(struct dma_fence *fence,
 struct dma_fence_cb *cb)
 {
-   struct amdgpu_vm_tlb_seq_cb *tlb_cb;
+   struct amdgpu_vm_tlb_seq_struct *tlb_cb;
 
tlb_cb = container_of(cb, typeof(*tlb_cb), cb);
atomic64_inc(_cb->vm->tlb_seq);
@@ -871,7 +871,7 @@ int amdgpu_vm_update_range(struct amdgpu_device *adev, 
struct amdgpu_vm *vm,
   struct dma_fence **fence)
 {
struct amdgpu_vm_update_params params;
-   struct amdgpu_vm_tlb_seq_cb *tlb_cb;
+   struct amdgpu_vm_tlb_seq_struct *tlb_cb;
struct amdgpu_res_cursor cursor;
enum amdgpu_sync_mode sync_mode;
int r, idx;

base-commit: 50db2d96b49b7d6cdb12e71e4204cf7180d3bab5
-- 
2.41.0



[PATCH] drm/amdgpu: Fix usage of UMC fill record in RAS

2023-06-10 Thread Luben Tuikov
The fixed commit listed in the Fixes tag below, introduced a bug in
amdgpu_ras.c::amdgpu_reserve_page_direct(), in that when introducing the new
amdgpu_umc_fill_error_record() and internally in that new function the physical
address (argument "uint64_t retired_page"--wrong name) is right-shifted by
AMDGPU_GPU_PAGE_SHIFT. Thus, in amdgpu_reserve_page_direct() when we pass
"address" to that new function, we should NOT right-shift it, since this
results, erroneously, in the page address to be 0 for first
2^(2*AMDGPU_GPU_PAGE_SHIFT) memory addresses.

This commit fixes this bug.

Cc: Tao Zhou 
Cc: Hawking Zhang 
Cc: Alex Deucher 
Fixes: 400013b268cb66 ("drm/amdgpu: add umc_fill_error_record to make code more 
simple")
Signed-off-by: Luben Tuikov 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 27a32933cbee3b..700eb180ea60fa 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -171,8 +171,7 @@ static int amdgpu_reserve_page_direct(struct amdgpu_device 
*adev, uint64_t addre
 
memset(_rec, 0x0, sizeof(struct eeprom_table_record));
err_data.err_addr = _rec;
-   amdgpu_umc_fill_error_record(_data, address,
-   (address >> AMDGPU_GPU_PAGE_SHIFT), 0, 0);
+   amdgpu_umc_fill_error_record(_data, address, address, 0, 0);
 
if (amdgpu_bad_page_threshold != 0) {
amdgpu_ras_add_bad_pages(adev, err_data.err_addr,

base-commit: 7eda4451177abbc8d2fab24f816a3243dd1808d8
prerequisite-patch-id: f2a3eadc5d7074225109701f1bb43b38bd6024fd
-- 
2.41.0



[PATCH] drm/amdgpu: Report ras_num_recs in debugfs

2023-06-02 Thread Luben Tuikov
Report the number of records stored in the RAS EEPROM table in debugfs.

This can be used by user-space to calculate the capacity of the RAS EEPROM
table since "bad_page_cnt_threshold" is also reported in the same place in
debugfs.

See commit reference 7fb6407145479d (drm/amdgpu: Add bad_page_cnt_threshold to
debugfs, 2021-04-13).

ras_num_recs can already be inferred by dumping the RAS EEPROM table, also in
the same debugfs location, see commit reference c65b0805e77919 (drm/amdgpu:
RAS EEPROM table is now in debugfs, 2021-04-08). This commit makes it an
integer value easily shown in a single file.

Cc: Alex Deucher 
Cc: Hawking Zhang 
Cc: Tao Zhou 
Cc: Stanley Yang 
Cc: John Clements 
Signed-off-by: Luben Tuikov 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index f2da69adcd9d48..68163890f9632d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -1487,6 +1487,7 @@ static int amdgpu_ras_sysfs_remove_all(struct 
amdgpu_device *adev)
 static struct dentry *amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device 
*adev)
 {
struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+   struct amdgpu_ras_eeprom_control *eeprom = >eeprom_control;
struct drm_minor  *minor = adev_to_drm(adev)->primary;
struct dentry *dir;
 
@@ -1497,6 +1498,7 @@ static struct dentry 
*amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device *
_ras_debugfs_eeprom_ops);
debugfs_create_u32("bad_page_cnt_threshold", 0444, dir,
   >bad_page_cnt_threshold);
+   debugfs_create_u32("ras_num_recs", 0444, dir, >ras_num_recs);
debugfs_create_x32("ras_hw_enabled", 0444, dir, >ras_hw_enabled);
debugfs_create_x32("ras_enabled", 0444, dir, >ras_enabled);
debugfs_create_file("ras_eeprom_size", S_IRUGO, dir, adev,

base-commit: e82c20a8755677528a5e01e58b7763a42edf
-- 
2.41.0



Re: [PATCH] drm/amdgpu: Mark mmhub_v1_8_mmea_err_status_reg as __maybe_unused

2023-05-25 Thread Luben Tuikov
On 2023-05-25 12:29, Nathan Chancellor wrote:
> On Thu, May 25, 2023 at 12:26:56PM -0400, Luben Tuikov wrote:
>> On 2023-05-25 11:22, Nathan Chancellor wrote:
>>> On Fri, May 19, 2023 at 06:14:38PM +0530, Srinivasan Shanmugam wrote:
>>>> Silencing the compiler from below compilation error:
>>>>
>>>> drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c:704:23: error: variable 
>>>> 'mmhub_v1_8_mmea_err_status_reg' is not needed and will not be emitted 
>>>> [-Werror,-Wunneeded-internal-declaration]
>>>> static const uint32_t mmhub_v1_8_mmea_err_status_reg[] = {
>>>>   ^
>>>> 1 error generated.
>>>>
>>>> Mark the variable as __maybe_unused to make it clear to clang that this
>>>> is expected, so there is no more warning.
>>>>
>>>> Cc: Christian König 
>>>> Cc: Lijo Lazar 
>>>> Cc: Luben Tuikov 
>>>> Cc: Alex Deucher 
>>>> Signed-off-by: Srinivasan Shanmugam 
>>>
>>> Traditionally, this attribute would go between the [] and =, but that is
>>> a nit. Can someone please pick this up to unblock our builds on -next?
>>>
>>> Reviewed-by: Nathan Chancellor 
>>
>> I'll pick this up, fix it, and submit to amd-staging-drm-next.
> 
> Thanks a lot :)
> 
>> Which -next are you referring to, Nathan?
> 
> linux-next, this warning breaks the build when -Werror is enabled, such
> as with allmodconfig:
> 
> https://storage.tuxsuite.com/public/clangbuiltlinux/continuous-integration2/builds/2QHtlCTz2JL0yXNpRB5hVmiP9lq/build.log
> 

Hi Nathan,

Thanks for the pointers.

Srinivasan has already submitted it to amd-staging-drm-next.

Seems Alex will push it upstream.

Not sure who fast you need it, we can send you the commit itself
for you to git-am if you cannot wait.

Regards,
Luben

> Cheers,
> Nathan
> 
>>>> ---
>>>>  drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c | 1 +
>>>>  1 file changed, 1 insertion(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
>>>> index 3648994724c2..cba087e529c0 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
>>>> @@ -701,6 +701,7 @@ static void mmhub_v1_8_reset_ras_error_count(struct 
>>>> amdgpu_device *adev)
>>>>mmhub_v1_8_inst_reset_ras_error_count(adev, i);
>>>>  }
>>>>  
>>>> +__maybe_unused
>>>>  static const uint32_t mmhub_v1_8_mmea_err_status_reg[] = {
>>>>regMMEA0_ERR_STATUS,
>>>>regMMEA1_ERR_STATUS,
>>>> -- 
>>>> 2.25.1
>>>>
>>



Re: [PATCH] drm/amdgpu: Mark mmhub_v1_8_mmea_err_status_reg as __maybe_unused

2023-05-25 Thread Luben Tuikov
On 2023-05-25 11:22, Nathan Chancellor wrote:
> On Fri, May 19, 2023 at 06:14:38PM +0530, Srinivasan Shanmugam wrote:
>> Silencing the compiler from below compilation error:
>>
>> drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c:704:23: error: variable 
>> 'mmhub_v1_8_mmea_err_status_reg' is not needed and will not be emitted 
>> [-Werror,-Wunneeded-internal-declaration]
>> static const uint32_t mmhub_v1_8_mmea_err_status_reg[] = {
>>   ^
>> 1 error generated.
>>
>> Mark the variable as __maybe_unused to make it clear to clang that this
>> is expected, so there is no more warning.
>>
>> Cc: Christian König 
>> Cc: Lijo Lazar 
>> Cc: Luben Tuikov 
>> Cc: Alex Deucher 
>> Signed-off-by: Srinivasan Shanmugam 
> 
> Traditionally, this attribute would go between the [] and =, but that is
> a nit. Can someone please pick this up to unblock our builds on -next?
> 
> Reviewed-by: Nathan Chancellor 

I'll pick this up, fix it, and submit to amd-staging-drm-next.

Which -next are you referring to, Nathan?

Regards,
Luben


> 
>> ---
>>  drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c 
>> b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
>> index 3648994724c2..cba087e529c0 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
>> @@ -701,6 +701,7 @@ static void mmhub_v1_8_reset_ras_error_count(struct 
>> amdgpu_device *adev)
>>  mmhub_v1_8_inst_reset_ras_error_count(adev, i);
>>  }
>>  
>> +__maybe_unused
>>  static const uint32_t mmhub_v1_8_mmea_err_status_reg[] = {
>>  regMMEA0_ERR_STATUS,
>>  regMMEA1_ERR_STATUS,
>> -- 
>> 2.25.1
>>



Re: [PATCH] drm/amd/amdgpu: Fix errors & warnings in amdgpu _uvd, _vce.c

2023-05-17 Thread Luben Tuikov
Acked-by: Luben Tuikov 

Regards,
Luben

On 2023-05-17 11:56, Srinivasan Shanmugam wrote:
> Fix below checkpatch errors & warnings:
> 
> In amdgpu_uvd.c:
> 
> WARNING: Prefer 'unsigned int' to bare use of 'unsigned'
> WARNING: Prefer 'unsigned int *' to bare use of 'unsigned *'
> WARNING: Missing a blank line after declarations
> WARNING: %Lx is non-standard C, use %llx
> ERROR: space required before the open parenthesis '('
> ERROR: space required before the open brace '{'
> WARNING: %LX is non-standard C, use %llX
> WARNING: Block comments use * on subsequent lines
> +/* multiple fence commands without any stream commands in between can
> +   crash the vcpu so just try to emmit a dummy create/destroy msg to
> 
> WARNING: Block comments use a trailing */ on a separate line
> +   avoid this */
> WARNING: braces {} are not necessary for single statement blocks
> +   for (j = 0; j < adev->uvd.num_enc_rings; ++j) {
> +   fences += 
> amdgpu_fence_count_emitted(>uvd.inst[i].ring_enc[j]);
> +   }
> 
> In amdgpu_vce.c:
> 
> WARNING: Prefer 'unsigned int' to bare use of 'unsigned'
> WARNING: Missing a blank line after declarations
> WARNING: %Lx is non-standard C, use %llx
> WARNING: Possible repeated word: 'we'
> ERROR: space required before the open parenthesis '('
> 
> Cc: Alex Deucher 
> Cc: Christian König 
> Signed-off-by: Srinivasan Shanmugam 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c | 83 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c | 39 ++--
>  2 files changed, 63 insertions(+), 59 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
> index 6887109abb13..b7441654e6fa 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
> @@ -96,16 +96,16 @@
>   */
>  struct amdgpu_uvd_cs_ctx {
>   struct amdgpu_cs_parser *parser;
> - unsigned reg, count;
> - unsigned data0, data1;
> - unsigned idx;
> + unsigned int reg, count;
> + unsigned int data0, data1;
> + unsigned int idx;
>   struct amdgpu_ib *ib;
>  
>   /* does the IB has a msg command */
>   bool has_msg_cmd;
>  
>   /* minimum buffer sizes */
> - unsigned *buf_sizes;
> + unsigned int *buf_sizes;
>  };
>  
>  #ifdef CONFIG_DRM_AMDGPU_SI
> @@ -186,7 +186,7 @@ int amdgpu_uvd_sw_init(struct amdgpu_device *adev)
>   unsigned long bo_size;
>   const char *fw_name;
>   const struct common_firmware_header *hdr;
> - unsigned family_id;
> + unsigned int family_id;
>   int i, j, r;
>  
>   INIT_DELAYED_WORK(>uvd.idle_work, amdgpu_uvd_idle_work_handler);
> @@ -275,7 +275,7 @@ int amdgpu_uvd_sw_init(struct amdgpu_device *adev)
>   family_id = le32_to_cpu(hdr->ucode_version) & 0xff;
>  
>   if (adev->asic_type < CHIP_VEGA20) {
> - unsigned version_major, version_minor;
> + unsigned int version_major, version_minor;
>  
>   version_major = (le32_to_cpu(hdr->ucode_version) >> 24) & 0xff;
>   version_minor = (le32_to_cpu(hdr->ucode_version) >> 8) & 0xff;
> @@ -420,7 +420,7 @@ int amdgpu_uvd_entity_init(struct amdgpu_device *adev)
>  
>  int amdgpu_uvd_suspend(struct amdgpu_device *adev)
>  {
> - unsigned size;
> + unsigned int size;
>   void *ptr;
>   int i, j, idx;
>   bool in_ras_intr = amdgpu_ras_intr_triggered();
> @@ -469,7 +469,7 @@ int amdgpu_uvd_suspend(struct amdgpu_device *adev)
>  
>  int amdgpu_uvd_resume(struct amdgpu_device *adev)
>  {
> - unsigned size;
> + unsigned int size;
>   void *ptr;
>   int i, idx;
>  
> @@ -491,7 +491,7 @@ int amdgpu_uvd_resume(struct amdgpu_device *adev)
>   adev->uvd.inst[i].saved_bo = NULL;
>   } else {
>   const struct common_firmware_header *hdr;
> - unsigned offset;
> + unsigned int offset;
>  
>   hdr = (const struct common_firmware_header 
> *)adev->uvd.fw->data;
>   if (adev->firmware.load_type != AMDGPU_FW_LOAD_PSP) {
> @@ -542,6 +542,7 @@ void amdgpu_uvd_free_handles(struct amdgpu_device *adev, 
> struct drm_file *filp)
>  static void amdgpu_uvd_force_into_uvd_segment(struct amdgpu_bo *abo)
>  {
>   int i;
> +
>   for (i = 0; i < abo->placement.num_placement; ++i) {
>   abo->placements[i].fpfn = 0 >> PAGE_SHIFT;
>   abo->placements[i].lpfn = (256 * 1024 * 1024) &

Re: [PATCH] drm/amd/amdgpu: Fix errors & warnings in amdgpu_vcn.c

2023-05-17 Thread Luben Tuikov
Acked-by: Luben Tuikov 

Regards,
Luben

On 2023-05-17 11:13, Srinivasan Shanmugam wrote:
> Fix below checkpatch insisted error & warnings:
> 
> ERROR: space required before the open brace '{'
> WARNING: braces {} are not necessary for any arm of this statement
> +   if ((type == VCN_ENCODE_RING) && (vcn_config & 
> VCN_BLOCK_ENCODE_DISABLE_MASK)) {
> [...]
> +   } else if ((type == VCN_DECODE_RING) && (vcn_config & 
> VCN_BLOCK_DECODE_DISABLE_MASK)) {
> [...]
> +   } else if ((type == VCN_UNIFIED_RING) && (vcn_config & 
> VCN_BLOCK_QUEUE_DISABLE_MASK)) {
> [...]
> ERROR: code indent should use tabs where possible
> WARNING: Prefer 'unsigned int' to bare use of 'unsigned'
> WARNING: braces {} are not necessary for single statement blocks
> +   for (i = 0; i < adev->vcn.num_enc_rings; ++i) {
> +   fence[j] += 
> amdgpu_fence_count_emitted(>vcn.inst[j].ring_enc[i]);
> +
> ERROR: space required before the open parenthesis '('
> WARNING: Missing a blank line after declarations
> WARNING: please, no spaces at the start of a line
> WARNING: Symbolic permissions 'S_IRUGO' are not preferred. Consider using 
> octal permissions '0444'.
> 
> Cc: Alex Deucher 
> Cc: Christian König 
> Signed-off-by: Srinivasan Shanmugam 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 35 -
>  1 file changed, 17 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
> index 06ec2dc55857..c088111c2321 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
> @@ -169,7 +169,7 @@ int amdgpu_vcn_sw_init(struct amdgpu_device *adev)
>   if (adev->firmware.load_type != AMDGPU_FW_LOAD_PSP)
>   bo_size += 
> AMDGPU_GPU_PAGE_ALIGN(le32_to_cpu(hdr->ucode_size_bytes) + 8);
>  
> - if (adev->ip_versions[UVD_HWIP][0] >= IP_VERSION(4, 0, 0)){
> + if (adev->ip_versions[UVD_HWIP][0] >= IP_VERSION(4, 0, 0)) {
>   fw_shared_size = AMDGPU_GPU_PAGE_ALIGN(sizeof(struct 
> amdgpu_vcn4_fw_shared));
>   log_offset = offsetof(struct amdgpu_vcn4_fw_shared, fw_log);
>   } else {
> @@ -276,20 +276,19 @@ bool amdgpu_vcn_is_disabled_vcn(struct amdgpu_device 
> *adev, enum vcn_ring_type t
>   bool ret = false;
>   int vcn_config = adev->vcn.vcn_config[vcn_instance];
>  
> - if ((type == VCN_ENCODE_RING) && (vcn_config & 
> VCN_BLOCK_ENCODE_DISABLE_MASK)) {
> + if ((type == VCN_ENCODE_RING) && (vcn_config & 
> VCN_BLOCK_ENCODE_DISABLE_MASK))
>   ret = true;
> - } else if ((type == VCN_DECODE_RING) && (vcn_config & 
> VCN_BLOCK_DECODE_DISABLE_MASK)) {
> + else if ((type == VCN_DECODE_RING) && (vcn_config & 
> VCN_BLOCK_DECODE_DISABLE_MASK))
>   ret = true;
> - } else if ((type == VCN_UNIFIED_RING) && (vcn_config & 
> VCN_BLOCK_QUEUE_DISABLE_MASK)) {
> + else if ((type == VCN_UNIFIED_RING) && (vcn_config & 
> VCN_BLOCK_QUEUE_DISABLE_MASK))
>   ret = true;
> - }
>  
>   return ret;
>  }
>  
>  int amdgpu_vcn_suspend(struct amdgpu_device *adev)
>  {
> - unsigned size;
> + unsigned int size;
>   void *ptr;
>   int i, idx;
>  
> @@ -318,7 +317,7 @@ int amdgpu_vcn_suspend(struct amdgpu_device *adev)
>  
>  int amdgpu_vcn_resume(struct amdgpu_device *adev)
>  {
> - unsigned size;
> + unsigned int size;
>   void *ptr;
>   int i, idx;
>  
> @@ -340,7 +339,7 @@ int amdgpu_vcn_resume(struct amdgpu_device *adev)
>   adev->vcn.inst[i].saved_bo = NULL;
>   } else {
>   const struct common_firmware_header *hdr;
> - unsigned offset;
> + unsigned int offset;
>  
>   hdr = (const struct common_firmware_header 
> *)adev->vcn.fw->data;
>   if (adev->firmware.load_type != AMDGPU_FW_LOAD_PSP) {
> @@ -371,9 +370,8 @@ static void amdgpu_vcn_idle_work_handler(struct 
> work_struct *work)
>   if (adev->vcn.harvest_config & (1 << j))
>   continue;
>  
> - for (i = 0; i < adev->vcn.num_enc_rings; ++i) {
> + for (i = 0; i < adev->vcn.num_enc_rings; ++i)
>   fence[j] += 
> amdgpu_fence_count_emitted(>vcn.inst[j].ring_enc[i]);
> - }
>  
>   if (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG){
>

Re: [PATCH] drm/amd/amdgpu: Fix warnings in amdgpu_encoders.c

2023-05-17 Thread Luben Tuikov
Acked-by: Luben Tuikov 

Regards,
Luben

On 2023-05-17 09:11, Srinivasan Shanmugam wrote:
> Fix below checkpatch warnings:
> 
> WARNING: Missing a blank line after declarations
> +   struct amdgpu_connector *amdgpu_connector = 
> to_amdgpu_connector(connector);
> +   amdgpu_encoder->active_device = 
> amdgpu_encoder->devices & amdgpu_connector->devices;
> 
> WARNING: Prefer 'unsigned int' to bare use of 'unsigned'
> 
> Cc: Alex Deucher 
> Cc: Christian König 
> Signed-off-by: Srinivasan Shanmugam 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_encoders.c | 13 +++--
>  1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_encoders.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_encoders.c
> index c96e458ed088..93868ff01fb7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_encoders.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_encoders.c
> @@ -71,6 +71,7 @@ void amdgpu_encoder_set_active_device(struct drm_encoder 
> *encoder)
>   drm_for_each_connector_iter(connector, ) {
>   if (connector->encoder == encoder) {
>   struct amdgpu_connector *amdgpu_connector = 
> to_amdgpu_connector(connector);
> +
>   amdgpu_encoder->active_device = amdgpu_encoder->devices 
> & amdgpu_connector->devices;
>   DRM_DEBUG_KMS("setting active device to %08x from %08x 
> %08x for encoder %d\n",
> amdgpu_encoder->active_device, 
> amdgpu_encoder->devices,
> @@ -166,12 +167,12 @@ void amdgpu_panel_mode_fixup(struct drm_encoder 
> *encoder,
>  {
>   struct amdgpu_encoder *amdgpu_encoder = to_amdgpu_encoder(encoder);
>   struct drm_display_mode *native_mode = _encoder->native_mode;
> - unsigned hblank = native_mode->htotal - native_mode->hdisplay;
> - unsigned vblank = native_mode->vtotal - native_mode->vdisplay;
> - unsigned hover = native_mode->hsync_start - native_mode->hdisplay;
> - unsigned vover = native_mode->vsync_start - native_mode->vdisplay;
> - unsigned hsync_width = native_mode->hsync_end - 
> native_mode->hsync_start;
> - unsigned vsync_width = native_mode->vsync_end - 
> native_mode->vsync_start;
> + unsigned int hblank = native_mode->htotal - native_mode->hdisplay;
> + unsigned int vblank = native_mode->vtotal - native_mode->vdisplay;
> + unsigned int hover = native_mode->hsync_start - native_mode->hdisplay;
> + unsigned int vover = native_mode->vsync_start - native_mode->vdisplay;
> + unsigned int hsync_width = native_mode->hsync_end - 
> native_mode->hsync_start;
> + unsigned int vsync_width = native_mode->vsync_end - 
> native_mode->vsync_start;
>  
>   adjusted_mode->clock = native_mode->clock;
>   adjusted_mode->flags = native_mode->flags;



Re: [PATCH] drm/amd/amdgpu: Fix error & warnings in amdgpu_ttm.c

2023-05-17 Thread Luben Tuikov
Acked-by: Luben Tuikov 

Regards,
Luben

On 2023-05-17 10:45, Srinivasan Shanmugam wrote:
> Fix below checkpatch insisted error & warnings:
> 
> ERROR: Macros with complex values should be enclosed in parentheses
> WARNING: Prefer 'unsigned int' to bare use of 'unsigned'
> WARNING: braces {} are not necessary for single statement blocks
> WARNING: Block comments use a trailing */ on a separate line
> WARNING: Missing a blank line after declarations
> 
> Cc: Alex Deucher 
> Cc: Christian König 
> Signed-off-by: Srinivasan Shanmugam 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 25 +
>  1 file changed, 13 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> index ad664ef640ff..f6d9f904b20b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> @@ -65,7 +65,7 @@
>  
>  MODULE_IMPORT_NS(DMA_BUF);
>  
> -#define AMDGPU_TTM_VRAM_MAX_DW_READ  (size_t)128
> +#define AMDGPU_TTM_VRAM_MAX_DW_READ  ((size_t)128)
>  
>  static int amdgpu_ttm_backend_bind(struct ttm_device *bdev,
>  struct ttm_tt *ttm,
> @@ -184,11 +184,11 @@ static void amdgpu_evict_flags(struct ttm_buffer_object 
> *bo,
>  static int amdgpu_ttm_map_buffer(struct ttm_buffer_object *bo,
>struct ttm_resource *mem,
>struct amdgpu_res_cursor *mm_cur,
> -  unsigned window, struct amdgpu_ring *ring,
> +  unsigned int window, struct amdgpu_ring *ring,
>bool tmz, uint64_t *size, uint64_t *addr)
>  {
>   struct amdgpu_device *adev = ring->adev;
> - unsigned offset, num_pages, num_dw, num_bytes;
> + unsigned int offset, num_pages, num_dw, num_bytes;
>   uint64_t src_addr, dst_addr;
>   struct amdgpu_job *job;
>   void *cpu_addr;
> @@ -1061,9 +1061,9 @@ static struct ttm_tt *amdgpu_ttm_tt_create(struct 
> ttm_buffer_object *bo,
>   enum ttm_caching caching;
>  
>   gtt = kzalloc(sizeof(struct amdgpu_ttm_tt), GFP_KERNEL);
> - if (gtt == NULL) {
> + if (!gtt)
>   return NULL;
> - }
> +
>   gtt->gobj = >base;
>   if (adev->gmc.mem_partitions && abo->xcp_id >= 0)
>   gtt->pool_id = KFD_XCP_MEM_ID(adev, abo->xcp_id);
> @@ -1848,9 +1848,8 @@ int amdgpu_ttm_init(struct amdgpu_device *adev)
>*place on the VRAM, so reserve it early.
>*/
>   r = amdgpu_ttm_fw_reserve_vram_init(adev);
> - if (r) {
> + if (r)
>   return r;
> - }
>  
>   /*
>*The reserved vram for driver must be pinned to the specified
> @@ -1874,7 +1873,8 @@ int amdgpu_ttm_init(struct amdgpu_device *adev)
>   /* allocate memory as required for VGA
>* This is used for VGA emulation and pre-OS scanout buffers to
>* avoid display artifacts while transitioning between pre-OS
> -  * and driver.  */
> +  * and driver.
> +  */
>   if (!adev->gmc.is_app_apu) {
>   r = amdgpu_bo_create_kernel_at(adev, 0,
>  adev->mman.stolen_vga_size,
> @@ -1903,7 +1903,7 @@ int amdgpu_ttm_init(struct amdgpu_device *adev)
>   }
>  
>   DRM_INFO("amdgpu: %uM of VRAM memory ready\n",
> -  (unsigned) (adev->gmc.real_vram_size / (1024 * 1024)));
> +  (unsigned int)(adev->gmc.real_vram_size / (1024 * 1024)));
>  
>   /* Compute GTT size, either based on TTM limit
>* or whatever the user passed on module init.
> @@ -1920,7 +1920,7 @@ int amdgpu_ttm_init(struct amdgpu_device *adev)
>   return r;
>   }
>   DRM_INFO("amdgpu: %uM of GTT memory ready.\n",
> -  (unsigned)(gtt_size / (1024 * 1024)));
> +  (unsigned int)(gtt_size / (1024 * 1024)));
>  
>   /* Initialize preemptible memory pool */
>   r = amdgpu_preempt_mgr_init(adev);
> @@ -1962,6 +1962,7 @@ int amdgpu_ttm_init(struct amdgpu_device *adev)
>  void amdgpu_ttm_fini(struct amdgpu_device *adev)
>  {
>   int idx;
> +
>   if (!adev->mman.initialized)
>   return;
>  
> @@ -2090,10 +2091,10 @@ int amdgpu_copy_buffer(struct amdgpu_ring *ring, 
> uint64_t src_offset,
>  bool vm_needs_flush, bool tmz)
>  {
>   struct amdgpu_device *adev = ring->adev;
> - unsigned num_loops, num_dw;
> + unsigned int num_loops, num_dw;
>   struct amdgpu_job *job;
>   uint32_t max_bytes;
> - unsigned i;
> + unsigned int i;
>   int r;
>  
>   if (!direct_submit && !ring->sched.ready) {



Re: [PATCH] drm/sched: Check scheduler work queue before calling timeout handling

2023-05-10 Thread Luben Tuikov
On 2023-05-10 10:24, vitaly prosyak wrote:
> 
> On 2023-05-10 10:19, Luben Tuikov wrote:
>> On 2023-05-10 09:51, vitaly.pros...@amd.com wrote:
>>> From: Vitaly Prosyak 
>>>
>>> During an IGT GPU reset test we see again oops despite of
>>> commit 0c8c901aaaebc9 (drm/sched: Check scheduler ready before calling
>>> timeout handling).
>>>
>>> It uses ready condition whether to call drm_sched_fault which unwind
>>> the TDR leads to GPU reset.
>>> However it looks the ready condition is overloaded with other meanings,
>>> for example, for the following stack is related GPU reset :
>>>
>>> 0  gfx_v9_0_cp_gfx_start
>>> 1  gfx_v9_0_cp_gfx_resume
>>> 2  gfx_v9_0_cp_resume
>>> 3  gfx_v9_0_hw_init
>>> 4  gfx_v9_0_resume
>>> 5  amdgpu_device_ip_resume_phase2
>>>
>>> does the following:
>>> /* start the ring */
>>> gfx_v9_0_cp_gfx_start(adev);
>>> ring->sched.ready = true;
>>>
>>> The same approach is for other ASICs as well :
>>> gfx_v8_0_cp_gfx_resume
>>> gfx_v10_0_kiq_resume, etc...
>>>
>>> As a result, our GPU reset test causes GPU fault which calls 
>>> unconditionally gfx_v9_0_fault
>>> and then drm_sched_fault. However now it depends on whether the interrupt 
>>> service routine
>>> drm_sched_fault is executed after gfx_v9_0_cp_gfx_start is completed which 
>>> sets the ready
>>> field of the scheduler to true even  for uninitialized schedulers and 
>>> causes oops vs
>>> no fault or when ISR  drm_sched_fault is completed prior  
>>> gfx_v9_0_cp_gfx_start and
>>> NULL pointer dereference does not occur.
>>>
>>> Use the field timeout_wq  to prevent oops for uninitialized schedulers.
>>> The field could be initialized by the work queue of resetting the domain.
>>>
>>> Fixes: 0c8c901aaaebc9 ("drm/sched: Check scheduler ready before calling 
>>> timeout handling")
>>>
>>> v1: Corrections to commit message (Luben)
>>> Signed-off-by: Vitaly Prosyak 
>>> Reviewed-by: Luben Tuikov 
>> I didn't give my RB to this patch so I'm not sure what it is doing here.
> I removed your rb, also if you do not know what is doing here why do you want 
> to push this to amd-staging-drm-next and to  drm-misc-fixed?

I'll add my RB as I push it to those two branches.
I'll also add a Link tag and fix the commit SHA for the Fixes tag to
one which is found in drm-misc-fixes.

Thanks for the patch fixing this long-standing bug.

Regards,
Luben


>>
>> The fixes tag should be before the SOB tag, and the v1 line should be 
>> separated
>> by a line before the Git tags.
>>
>> Since this is a good patch and I want it in both drm-misc-fixed and 
>> amd-staging-drm-next,
>> I'll submit it to drm-misc-fixed with a Link: and RB/SOB tag there and then 
>> cherry-pick
>> that into amd-staging-drm-next.
>>
>> Don't push it to amd-staging-drm-next.
>>
>> I'll fix this and submit to amd-staging-drm-next and to drm-misc-fixed with
>> a Link: tag.
>>
>> Regards,
>> Luben
>>
>>
>>> ---
>>>  drivers/gpu/drm/scheduler/sched_main.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 649fac2e1ccb..670b7997f389 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -308,7 +308,7 @@ static void drm_sched_start_timeout(struct 
>>> drm_gpu_scheduler *sched)
>>>   */
>>>  void drm_sched_fault(struct drm_gpu_scheduler *sched)
>>>  {
>>> -   if (sched->ready)
>>> +   if (sched->timeout_wq)
>>> mod_delayed_work(sched->timeout_wq, >work_tdr, 0);
>>>  }
>>>  EXPORT_SYMBOL(drm_sched_fault);



Re: [PATCH] drm/sched: Check scheduler work queue before calling timeout handling

2023-05-10 Thread Luben Tuikov
On 2023-05-10 09:51, vitaly.pros...@amd.com wrote:
> From: Vitaly Prosyak 
> 
> During an IGT GPU reset test we see again oops despite of
> commit 0c8c901aaaebc9 (drm/sched: Check scheduler ready before calling
> timeout handling).
> 
> It uses ready condition whether to call drm_sched_fault which unwind
> the TDR leads to GPU reset.
> However it looks the ready condition is overloaded with other meanings,
> for example, for the following stack is related GPU reset :
> 
> 0  gfx_v9_0_cp_gfx_start
> 1  gfx_v9_0_cp_gfx_resume
> 2  gfx_v9_0_cp_resume
> 3  gfx_v9_0_hw_init
> 4  gfx_v9_0_resume
> 5  amdgpu_device_ip_resume_phase2
> 
> does the following:
>   /* start the ring */
>   gfx_v9_0_cp_gfx_start(adev);
>   ring->sched.ready = true;
> 
> The same approach is for other ASICs as well :
> gfx_v8_0_cp_gfx_resume
> gfx_v10_0_kiq_resume, etc...
> 
> As a result, our GPU reset test causes GPU fault which calls unconditionally 
> gfx_v9_0_fault
> and then drm_sched_fault. However now it depends on whether the interrupt 
> service routine
> drm_sched_fault is executed after gfx_v9_0_cp_gfx_start is completed which 
> sets the ready
> field of the scheduler to true even  for uninitialized schedulers and causes 
> oops vs
> no fault or when ISR  drm_sched_fault is completed prior  
> gfx_v9_0_cp_gfx_start and
> NULL pointer dereference does not occur.
> 
> Use the field timeout_wq  to prevent oops for uninitialized schedulers.
> The field could be initialized by the work queue of resetting the domain.
> 
> Fixes: 0c8c901aaaebc9 ("drm/sched: Check scheduler ready before calling 
> timeout handling")
> 
> v1: Corrections to commit message (Luben)
> Signed-off-by: Vitaly Prosyak 
> Reviewed-by: Luben Tuikov 

I didn't give my RB to this patch so I'm not sure what it is doing here.

The fixes tag should be before the SOB tag, and the v1 line should be separated
by a line before the Git tags.

Since this is a good patch and I want it in both drm-misc-fixed and 
amd-staging-drm-next,
I'll submit it to drm-misc-fixed with a Link: and RB/SOB tag there and then 
cherry-pick
that into amd-staging-drm-next.

Don't push it to amd-staging-drm-next.

I'll fix this and submit to amd-staging-drm-next and to drm-misc-fixed with
a Link: tag.

Regards,
Luben


> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> b/drivers/gpu/drm/scheduler/sched_main.c
> index 649fac2e1ccb..670b7997f389 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -308,7 +308,7 @@ static void drm_sched_start_timeout(struct 
> drm_gpu_scheduler *sched)
>   */
>  void drm_sched_fault(struct drm_gpu_scheduler *sched)
>  {
> - if (sched->ready)
> + if (sched->timeout_wq)
>   mod_delayed_work(sched->timeout_wq, >work_tdr, 0);
>  }
>  EXPORT_SYMBOL(drm_sched_fault);



Re: [PATCH] drm/sched: Check scheduler work queue before calling timeout handling

2023-05-10 Thread Luben Tuikov
On 2023-05-09 17:43, vitaly.pros...@amd.com wrote:
> From: Vitaly Prosyak 
> 
> During an IGT GPU reset test we see again oops despite of
> commit 0c8c901aaaebc9bf8bf189ffc116e678f7a2dc16
> drm/sched: Check scheduler ready before calling timeout handling.

You can probably use the more succinct fixes line:
0c8c901aaaebc9 ("drm/sched: Check scheduler ready before calling timeout 
handling")

> 
> It uses ready condition whether to call drm_sched_fault which unwind
> the TDR leads to GPU reset.
> However it looks the ready condition is overloaded with other meanings,
> for example, for the following stack is related GPU reset :
> 
> 0  gfx_v9_0_cp_gfx_start
> 1  gfx_v9_0_cp_gfx_resume
> 2  gfx_v9_0_cp_resume
> 3  gfx_v9_0_hw_init
> 4  gfx_v9_0_resume
> 5  amdgpu_device_ip_resume_phase2
> 
> does the following:
>   /* start the ring */
>   gfx_v9_0_cp_gfx_start(adev);
>   ring->sched.ready = true;
> 
> The same approach is for other ASICs as well :
> gfx_v8_0_cp_gfx_resume
> gfx_v10_0_kiq_resume, etc...
> 
> As a result, our GPU reset test causes GPU fault which calls unconditionally 
> gfx_v9_0_fault
> and then drm_sched_fault. However now it depends on whether the interrupt 
> service routine
> drm_sched_fault is executed after gfx_v9_0_cp_gfx_start is completed which 
> sets the ready
> field of the scheduler to true even  for not initialized schedulers and 
> causes oops vs

"not initialized" --> "uninitialized" reads better.

> no fault or when ISR  drm_sched_fault is completed prior  
> gfx_v9_0_cp_gfx_start and
> NULL pointer dereference does not occur.
> 
> Use the field timeout_wq  to prevent oops for uninitialized schedulers.
> The field could be initialized by the work queue of resetting the domain.
> 
> Signed-off-by: Vitaly Prosyak 

Add, a fixes tag,

Fixes: 0c8c901aaaebc9 ("drm/sched: Check scheduler ready before calling timeout 
handling")

Before the SOB tag.

> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> b/drivers/gpu/drm/scheduler/sched_main.c
> index 649fac2e1ccb..670b7997f389 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -308,7 +308,7 @@ static void drm_sched_start_timeout(struct 
> drm_gpu_scheduler *sched)
>   */
>  void drm_sched_fault(struct drm_gpu_scheduler *sched)
>  {
> - if (sched->ready)
> + if (sched->timeout_wq)
>   mod_delayed_work(sched->timeout_wq, >work_tdr, 0);
>  }
>  EXPORT_SYMBOL(drm_sched_fault);

Yes, this does indeed seem more correct.

Apply the comments above and repost the patch to amd-gfx and dri-devel and
I'll push it to drm-misc-fixes and amd-staging-drm-next.
-- 
Regards,
Luben



Re: [PATCH 2/2] drm/amdgpu: drop unused function

2023-05-03 Thread Luben Tuikov
I suppose we have this information elsewhere.

Series is:
Reviewed-by: Luben Tuikov 

Regards,
Luben

On 2023-05-03 11:02, Alex Deucher wrote:
> Ping?
> 
> On Thu, Apr 27, 2023 at 2:34 PM Alex Deucher  
> wrote:
>>
>> amdgpu_discovery_get_ip_version() has not been used since
>> commit c40bdfb2ffa4 ("drm/amdgpu: fix incorrect VCN revision in SRIOV")
>> so drop it.
>>
>> Signed-off-by: Alex Deucher 
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 48 ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h |  2 -
>>  2 files changed, 50 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
>> index 76ceca05452e..b58d94dc1924 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
>> @@ -1208,54 +1208,6 @@ static int amdgpu_discovery_reg_base_init(struct 
>> amdgpu_device *adev)
>> return 0;
>>  }
>>
>> -int amdgpu_discovery_get_ip_version(struct amdgpu_device *adev, int hw_id, 
>> int number_instance,
>> -   int *major, int *minor, int *revision)
>> -{
>> -   struct binary_header *bhdr;
>> -   struct ip_discovery_header *ihdr;
>> -   struct die_header *dhdr;
>> -   struct ip *ip;
>> -   uint16_t die_offset;
>> -   uint16_t ip_offset;
>> -   uint16_t num_dies;
>> -   uint16_t num_ips;
>> -   int i, j;
>> -
>> -   if (!adev->mman.discovery_bin) {
>> -   DRM_ERROR("ip discovery uninitialized\n");
>> -   return -EINVAL;
>> -   }
>> -
>> -   bhdr = (struct binary_header *)adev->mman.discovery_bin;
>> -   ihdr = (struct ip_discovery_header *)(adev->mman.discovery_bin +
>> -   le16_to_cpu(bhdr->table_list[IP_DISCOVERY].offset));
>> -   num_dies = le16_to_cpu(ihdr->num_dies);
>> -
>> -   for (i = 0; i < num_dies; i++) {
>> -   die_offset = le16_to_cpu(ihdr->die_info[i].die_offset);
>> -   dhdr = (struct die_header *)(adev->mman.discovery_bin + 
>> die_offset);
>> -   num_ips = le16_to_cpu(dhdr->num_ips);
>> -   ip_offset = die_offset + sizeof(*dhdr);
>> -
>> -   for (j = 0; j < num_ips; j++) {
>> -   ip = (struct ip *)(adev->mman.discovery_bin + 
>> ip_offset);
>> -
>> -   if ((le16_to_cpu(ip->hw_id) == hw_id) && 
>> (ip->number_instance == number_instance)) {
>> -   if (major)
>> -   *major = ip->major;
>> -   if (minor)
>> -   *minor = ip->minor;
>> -   if (revision)
>> -   *revision = ip->revision;
>> -   return 0;
>> -   }
>> -   ip_offset += struct_size(ip, base_address, 
>> ip->num_base_address);
>> -   }
>> -   }
>> -
>> -   return -EINVAL;
>> -}
>> -
>>  static void amdgpu_discovery_harvest_ip(struct amdgpu_device *adev)
>>  {
>> int vcn_harvest_count = 0;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h
>> index 8563dd4a7dc2..63ec6924b907 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h
>> @@ -28,8 +28,6 @@
>>  #define DISCOVERY_TMR_OFFSET(64 << 10)
>>
>>  void amdgpu_discovery_fini(struct amdgpu_device *adev);
>> -int amdgpu_discovery_get_ip_version(struct amdgpu_device *adev, int hw_id, 
>> int number_instance,
>> -int *major, int *minor, int *revision);
>>  int amdgpu_discovery_set_ip_blocks(struct amdgpu_device *adev);
>>
>>  #endif /* __AMDGPU_DISCOVERY__ */
>> --
>> 2.40.0
>>



Re: [PATCH] drm/amdgpu: put MQDs in VRAM

2023-05-03 Thread Luben Tuikov
Reviewed-by: Luben Tuikov 

Regards,
Luben

On 2023-05-01 10:55, Alex Deucher wrote:
> Ping?
> 
> Alex
> 
> On Fri, Apr 28, 2023 at 11:57 AM Alex Deucher  
> wrote:
>>
>> Reduces preemption latency.
>> Only enable this for gfx10 and 11 for now
>> to avoid changing behavior on gfx 8 and 9.
>>
>> v2: move MES MQDs into VRAM as well (YuBiao)
>> v3: enable on gfx10, 11 only (Alex)
>> v4: minor style changes, document why gfx10/11 only (Alex)
>>
>> Signed-off-by: Alex Deucher 
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 9 +++--
>>  drivers/gpu/drm/amd/amdgpu/mes_v10_1.c  | 1 +
>>  drivers/gpu/drm/amd/amdgpu/mes_v11_0.c  | 1 +
>>  3 files changed, 9 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>> index 90f5d302d5f3..b91be56ba773 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>> @@ -382,6 +382,11 @@ int amdgpu_gfx_mqd_sw_init(struct amdgpu_device *adev,
>> int r, i, j;
>> struct amdgpu_kiq *kiq = >gfx.kiq[xcc_id];
>> struct amdgpu_ring *ring = >ring;
>> +   u32 domain = AMDGPU_GEM_DOMAIN_GTT;
>> +
>> +   /* Only enable on gfx10 and 11 for now to avoid changing behavior on 
>> older chips */
>> +   if (adev->ip_versions[GC_HWIP][0] >= IP_VERSION(10, 0, 0))
>> +   domain |= AMDGPU_GEM_DOMAIN_VRAM;
>>
>> /* create MQD for KIQ */
>> if (!adev->enable_mes_kiq && !ring->mqd_obj) {
>> @@ -413,7 +418,7 @@ int amdgpu_gfx_mqd_sw_init(struct amdgpu_device *adev,
>> ring = >gfx.gfx_ring[i];
>> if (!ring->mqd_obj) {
>> r = amdgpu_bo_create_kernel(adev, mqd_size, 
>> PAGE_SIZE,
>> -   
>> AMDGPU_GEM_DOMAIN_GTT, >mqd_obj,
>> +   domain, 
>> >mqd_obj,
>> 
>> >mqd_gpu_addr, >mqd_ptr);
>> if (r) {
>> dev_warn(adev->dev, "failed to 
>> create ring mqd bo (%d)", r);
>> @@ -435,7 +440,7 @@ int amdgpu_gfx_mqd_sw_init(struct amdgpu_device *adev,
>> ring = >gfx.compute_ring[j];
>> if (!ring->mqd_obj) {
>> r = amdgpu_bo_create_kernel(adev, mqd_size, 
>> PAGE_SIZE,
>> -   AMDGPU_GEM_DOMAIN_GTT, 
>> >mqd_obj,
>> +   domain, >mqd_obj,
>> >mqd_gpu_addr, 
>> >mqd_ptr);
>> if (r) {
>> dev_warn(adev->dev, "failed to create ring 
>> mqd bo (%d)", r);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v10_1.c 
>> b/drivers/gpu/drm/amd/amdgpu/mes_v10_1.c
>> index 0599f8a6813e..4560476c7c31 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/mes_v10_1.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/mes_v10_1.c
>> @@ -901,6 +901,7 @@ static int mes_v10_1_mqd_sw_init(struct amdgpu_device 
>> *adev,
>> return 0;
>>
>> r = amdgpu_bo_create_kernel(adev, mqd_size, PAGE_SIZE,
>> +   AMDGPU_GEM_DOMAIN_VRAM |
>> AMDGPU_GEM_DOMAIN_GTT, >mqd_obj,
>> >mqd_gpu_addr, >mqd_ptr);
>> if (r) {
>> diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c 
>> b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
>> index e853bcb892fc..3adb450eec07 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
>> @@ -999,6 +999,7 @@ static int mes_v11_0_mqd_sw_init(struct amdgpu_device 
>> *adev,
>> return 0;
>>
>> r = amdgpu_bo_create_kernel(adev, mqd_size, PAGE_SIZE,
>> +   AMDGPU_GEM_DOMAIN_VRAM |
>> AMDGPU_GEM_DOMAIN_GTT, >mqd_obj,
>> >mqd_gpu_addr, >mqd_ptr);
>> if (r) {
>> --
>> 2.40.0
>>



Re: [PATCH 1/8] drm/scheduler: properly forward fence errors

2023-04-20 Thread Luben Tuikov
Hi Christian,

Thanks for working on this.

Series is,
Reviewed-by: Luben Tuikov 

Regards,
Luben

On 2023-04-20 07:57, Christian König wrote:
> When a hw fence is signaled with an error properly forward that to the
> finished fence.
> 
> Signed-off-by: Christian König 
> ---
>  drivers/gpu/drm/scheduler/sched_entity.c |  4 +---
>  drivers/gpu/drm/scheduler/sched_fence.c  |  4 +++-
>  drivers/gpu/drm/scheduler/sched_main.c   | 18 --
>  include/drm/gpu_scheduler.h  |  2 +-
>  4 files changed, 13 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
> b/drivers/gpu/drm/scheduler/sched_entity.c
> index 15d04a0ec623..eaf71fe15ed3 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -144,7 +144,7 @@ static void drm_sched_entity_kill_jobs_work(struct 
> work_struct *wrk)
>  {
>   struct drm_sched_job *job = container_of(wrk, typeof(*job), work);
>  
> - drm_sched_fence_finished(job->s_fence);
> + drm_sched_fence_finished(job->s_fence, -ESRCH);
>   WARN_ON(job->s_fence->parent);
>   job->sched->ops->free_job(job);
>  }
> @@ -195,8 +195,6 @@ static void drm_sched_entity_kill(struct drm_sched_entity 
> *entity)
>   while ((job = to_drm_sched_job(spsc_queue_pop(>job_queue {
>   struct drm_sched_fence *s_fence = job->s_fence;
>  
> - dma_fence_set_error(_fence->finished, -ESRCH);
> -
>   dma_fence_get(_fence->finished);
>   if (!prev || dma_fence_add_callback(prev, >finish_cb,
>  drm_sched_entity_kill_jobs_cb))
> diff --git a/drivers/gpu/drm/scheduler/sched_fence.c 
> b/drivers/gpu/drm/scheduler/sched_fence.c
> index 7fd869520ef2..1a6bea98c5cc 100644
> --- a/drivers/gpu/drm/scheduler/sched_fence.c
> +++ b/drivers/gpu/drm/scheduler/sched_fence.c
> @@ -53,8 +53,10 @@ void drm_sched_fence_scheduled(struct drm_sched_fence 
> *fence)
>   dma_fence_signal(>scheduled);
>  }
>  
> -void drm_sched_fence_finished(struct drm_sched_fence *fence)
> +void drm_sched_fence_finished(struct drm_sched_fence *fence, int result)
>  {
> + if (result)
> + dma_fence_set_error(>finished, result);
>   dma_fence_signal(>finished);
>  }
>  
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> b/drivers/gpu/drm/scheduler/sched_main.c
> index fcd4bfef7415..649fac2e1ccb 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -257,7 +257,7 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>   *
>   * Finish the job's fence and wake up the worker thread.
>   */
> -static void drm_sched_job_done(struct drm_sched_job *s_job)
> +static void drm_sched_job_done(struct drm_sched_job *s_job, int result)
>  {
>   struct drm_sched_fence *s_fence = s_job->s_fence;
>   struct drm_gpu_scheduler *sched = s_fence->sched;
> @@ -268,7 +268,7 @@ static void drm_sched_job_done(struct drm_sched_job 
> *s_job)
>   trace_drm_sched_process_job(s_fence);
>  
>   dma_fence_get(_fence->finished);
> - drm_sched_fence_finished(s_fence);
> + drm_sched_fence_finished(s_fence, result);
>   dma_fence_put(_fence->finished);
>   wake_up_interruptible(>wake_up_worker);
>  }
> @@ -282,7 +282,7 @@ static void drm_sched_job_done_cb(struct dma_fence *f, 
> struct dma_fence_cb *cb)
>  {
>   struct drm_sched_job *s_job = container_of(cb, struct drm_sched_job, 
> cb);
>  
> - drm_sched_job_done(s_job);
> + drm_sched_job_done(s_job, f->error);
>  }
>  
>  /**
> @@ -533,12 +533,12 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, 
> bool full_recovery)
>   r = dma_fence_add_callback(fence, _job->cb,
>  drm_sched_job_done_cb);
>   if (r == -ENOENT)
> - drm_sched_job_done(s_job);
> + drm_sched_job_done(s_job, fence->error);
>   else if (r)
>   DRM_DEV_ERROR(sched->dev, "fence add callback 
> failed (%d)\n",
> r);
>   } else
> - drm_sched_job_done(s_job);
> + drm_sched_job_done(s_job, 0);
>   }
>  
>   if (full_recovery) {
> @@ -1010,15 +1010,13 @@ static int drm_sched_main(void *param)
>   r = dma_fence_add_callback(fence, _job->cb,
> 

Re: [PATCH] drm/sched: Check scheduler ready before calling timeout handling

2023-04-11 Thread Luben Tuikov
On 2023-04-11 17:39, Alex Deucher wrote:
> On Thu, Apr 6, 2023 at 4:01 PM Luben Tuikov  wrote:
>>
>> From: Vitaly Prosyak 
>>
>> During an IGT GPU reset test we see the following oops,
>>
>> [  +0.03] [ cut here ]
>> [  +0.00] WARNING: CPU: 9 PID: 0 at kernel/workqueue.c:1656 
>> __queue_delayed_work+0x6d/0xa0
>> [  +0.04] Modules linked in: iptable_filter bpfilter amdgpu(OE) 
>> nls_iso8859_1 snd_hda_codec_realtek snd_hda_codec_generic intel_rapl_msr 
>> ledtrig_audio snd_hda_codec_hdmi intel_rapl_common snd_hda_intel 
>> edac_mce_amd snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core 
>> iommu_v2 gpu_sched(OE) kvm_amd drm_buddy snd_hwdep kvm video drm_ttm_helper 
>> snd_pcm ttm snd_seq_midi drm_display_helper snd_seq_midi_event snd_rawmidi 
>> cec crct10dif_pclmul ghash_clmulni_intel sha512_ssse3 snd_seq aesni_intel 
>> rc_core crypto_simd cryptd binfmt_misc drm_kms_helper rapl snd_seq_device 
>> input_leds joydev snd_timer i2c_algo_bit syscopyarea snd ccp sysfillrect 
>> sysimgblt wmi_bmof k10temp soundcore mac_hid sch_fq_codel msr parport_pc 
>> ppdev drm lp parport ramoops reed_solomon pstore_blk pstore_zone efi_pstore 
>> ip_tables x_tables autofs4 hid_generic usbhid hid r8169 ahci xhci_pci 
>> gpio_amdpt realtek i2c_piix4 wmi crc32_pclmul xhci_pci_renesas libahci 
>> gpio_generic
>> [  +0.70] CPU: 9 PID: 0 Comm: swapper/9 Tainted: GW OE  
>> 6.1.11+ #2
>> [  +0.03] Hardware name: Gigabyte Technology Co., Ltd. AB350-Gaming 
>> 3/AB350-Gaming 3-CF, BIOS F7 06/16/2017
>> [  +0.01] RIP: 0010:__queue_delayed_work+0x6d/0xa0
>> [  +0.03] Code: 7a 50 48 01 c1 48 89 4a 30 81 ff 00 20 00 00 75 38 4c 89 
>> cf e8 64 3e 0a 00 5d e9 1e c5 11 01 e8 99 f7 ff ff 5d e9 13 c5 11 01 <0f> 0b 
>> eb c1 0f 0b 48 81 7a 38 70 5c 0e 81 74 9f 0f 0b 48 8b 42 28
>> [  +0.02] RSP: 0018:c9398d60 EFLAGS: 00010007
>> [  +0.02] RAX: 88810d589c60 RBX:  RCX: 
>> 
>> [  +0.02] RDX: 88810d589c58 RSI:  RDI: 
>> 2000
>> [  +0.01] RBP: c9398d60 R08:  R09: 
>> 88810d589c78
>> [  +0.02] R10: 72705f305f39765f R11: 7866673a6d72645b R12: 
>> 88810d589c58
>> [  +0.01] R13: 2000 R14:  R15: 
>> 
>> [  +0.02] FS:  () GS:8887fee4() 
>> knlGS:
>> [  +0.01] CS:  0010 DS:  ES:  CR0: 80050033
>> [  +0.02] CR2: 5562c4797fa0 CR3: 000110da CR4: 
>> 003506e0
>> [  +0.02] Call Trace:
>> [  +0.01]  
>> [  +0.01]  mod_delayed_work_on+0x5e/0xa0
>> [  +0.04]  drm_sched_fault+0x23/0x30 [gpu_sched]
>> [  +0.07]  gfx_v9_0_fault.isra.0+0xa6/0xd0 [amdgpu]
>> [  +0.000258]  gfx_v9_0_priv_reg_irq+0x29/0x40 [amdgpu]
>> [  +0.000254]  amdgpu_irq_dispatch+0x1ac/0x2b0 [amdgpu]
>> [  +0.000243]  amdgpu_ih_process+0x89/0x130 [amdgpu]
>> [  +0.000245]  amdgpu_irq_handler+0x24/0x60 [amdgpu]
>> [  +0.000165]  __handle_irq_event_percpu+0x4f/0x1a0
>> [  +0.03]  handle_irq_event_percpu+0x15/0x50
>> [  +0.01]  handle_irq_event+0x39/0x60
>> [  +0.02]  handle_edge_irq+0xa8/0x250
>> [  +0.03]  __common_interrupt+0x7b/0x150
>> [  +0.02]  common_interrupt+0xc1/0xe0
>> [  +0.03]  
>> [  +0.00]  
>> [  +0.01]  asm_common_interrupt+0x27/0x40
>> [  +0.02] RIP: 0010:native_safe_halt+0xb/0x10
>> [  +0.03] Code: 46 ff ff ff cc cc cc cc cc cc cc cc cc cc cc eb 07 0f 00 
>> 2d 69 f2 5e 00 f4 e9 f1 3b 3e 00 90 eb 07 0f 00 2d 59 f2 5e 00 fb f4  e0 
>> 3b 3e 00 0f 1f 44 00 00 55 48 89 e5 53 e8 b1 d4 fe ff 66 90
>> [  +0.02] RSP: 0018:c918fdc8 EFLAGS: 0246
>> [  +0.02] RAX: 4000 RBX: 0002e5a8 RCX: 
>> 001f
>> [  +0.01] RDX: 0001 RSI: 888101298800 RDI: 
>> 888101298864
>> [  +0.01] RBP: c918fdd0 R08: 00527f64bd8b R09: 
>> 0001dc90
>> [  +0.01] R10: 0001dc90 R11: 0003 R12: 
>> 0001
>> [  +0.01] R13: 888101298864 R14: 832d9e20 R15: 
>> 888193aa8c00
>> [  +0.03]  ? acpi_idle_do_entry+0x5e/0x70
>> [  +0.02]  acpi_idle_enter+0xd1/0x160
>> [  +0.03]  cpuidle_enter_state+0x9a/0x6e0
>> [  +0.03]  cpuidle_enter+0x2e/0x50
>> [  +0.03]  call_cpuidle+0x23/0x50
>> [  +0.02]  do_idle+0x1de/0x260
>> [  +0.02]  cpu_s

Re: [PATCH] drm/amdgpu: refine get gpu clock counter method

2023-04-07 Thread Luben Tuikov
Acked-by: Luben Tuikov 

Regards,
Luben

On 2023-04-06 06:13, Tong Liu01 wrote:
> [why]
> regGOLDEN_TSC_COUNT_LOWER/regGOLDEN_TSC_COUNT_UPPER are protected and
> unaccessible under sriov.
> The clock counter high bit may update during reading process.
> 
> [How]
> Replace regGOLDEN_TSC_COUNT_LOWER/regGOLDEN_TSC_COUNT_UPPER with
> regCP_MES_MTIME_LO/regCP_MES_MTIME_HI to get gpu clock under sriov.
> Refine get gpu clock counter method to make the result more precise.
> 
> Signed-off-by: Tong Liu01 
> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 17 +++--
>  1 file changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> index ecf8ceb53311..107c487c0c37 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> @@ -4671,11 +4671,24 @@ static int gfx_v11_0_post_soft_reset(void *handle)
>  static uint64_t gfx_v11_0_get_gpu_clock_counter(struct amdgpu_device *adev)
>  {
>   uint64_t clock;
> + uint64_t clock_counter_lo, clock_counter_hi_pre, clock_counter_hi_after;
>  
>   amdgpu_gfx_off_ctrl(adev, false);
>   mutex_lock(>gfx.gpu_clock_mutex);
> - clock = (uint64_t)RREG32_SOC15(SMUIO, 0, regGOLDEN_TSC_COUNT_LOWER) |
> - ((uint64_t)RREG32_SOC15(SMUIO, 0, regGOLDEN_TSC_COUNT_UPPER) << 
> 32ULL);
> + if (amdgpu_sriov_vf(adev)) {
> + clock_counter_hi_pre = (uint64_t)RREG32_SOC15(GC, 0, 
> regCP_MES_MTIME_HI);
> + clock_counter_lo = (uint64_t)RREG32_SOC15(GC, 0, 
> regCP_MES_MTIME_LO);
> + clock_counter_hi_after = (uint64_t)RREG32_SOC15(GC, 0, 
> regCP_MES_MTIME_HI);
> + if (clock_counter_hi_pre != clock_counter_hi_after)
> + clock_counter_lo = (uint64_t)RREG32_SOC15(GC, 0, 
> regCP_MES_MTIME_LO);
> + } else {
> + clock_counter_hi_pre = (uint64_t)RREG32_SOC15(SMUIO, 0, 
> regGOLDEN_TSC_COUNT_UPPER);
> + clock_counter_lo = (uint64_t)RREG32_SOC15(SMUIO, 0, 
> regGOLDEN_TSC_COUNT_LOWER);
> + clock_counter_hi_after = (uint64_t)RREG32_SOC15(SMUIO, 0, 
> regGOLDEN_TSC_COUNT_UPPER);
> + if (clock_counter_hi_pre != clock_counter_hi_after)
> + clock_counter_lo = (uint64_t)RREG32_SOC15(SMUIO, 0, 
> regGOLDEN_TSC_COUNT_LOWER);
> + }
> + clock = clock_counter_lo | (clock_counter_hi_after << 32ULL);
>   mutex_unlock(>gfx.gpu_clock_mutex);
>   amdgpu_gfx_off_ctrl(adev, true);
>   return clock;



[PATCH] drm/sched: Check scheduler ready before calling timeout handling

2023-04-06 Thread Luben Tuikov
IOS F7 06/16/2017
[  +0.012101] amdgpu :0c:00.0: amdgpu: GPU reset begin!
[  +0.005136] RIP: 0010:__queue_work+0x1f/0x4e0
[  +0.04] Code: 87 cd 11 01 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 
41 57 41 56 41 55 49 89 d5 41 54 49 89 f4 53 48 83 ec 10 89 7d d4  86 02 01 
00 00 01 0f 85 6c 03 00 00 e8 7f 36 08 00 8b 45 d4 48

For gfx_rings the schedulers may not be initialized by
amdgpu_device_init_schedulers() due to ring->no_scheduler flag being set to
true and thus the timeout_wq is NULL. As a result, since all ASICs call
drm_sched_fault() unconditionally even for schedulers which have not been
initialized, it is simpler to use the ready condition which indicates whether
the given scheduler worker thread runs and whether the timeout_wq of the reset
domain has been initialized.

Signed-off-by: Vitaly Prosyak 
Cc: Christian König 
Reviewed-by: Luben Tuikov 
Signed-off-by: Luben Tuikov 
---
 drivers/gpu/drm/scheduler/sched_main.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index fd22d753b4ed0c..fcd4bfef741580 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -308,7 +308,8 @@ static void drm_sched_start_timeout(struct 
drm_gpu_scheduler *sched)
  */
 void drm_sched_fault(struct drm_gpu_scheduler *sched)
 {
-   mod_delayed_work(sched->timeout_wq, >work_tdr, 0);
+   if (sched->ready)
+   mod_delayed_work(sched->timeout_wq, >work_tdr, 0);
 }
 EXPORT_SYMBOL(drm_sched_fault);
 

base-commit: 49144cd279d047c1d572a57323df3af8e1461894
-- 
2.40.0



Re: [PATCH v3 7/9] drm/amdgpu: map usermode queue into MES

2023-04-05 Thread Luben Tuikov
On 2023-04-05 06:06, Shashank Sharma wrote:
> 
> On 04/04/2023 22:58, Luben Tuikov wrote:
>> On 2023-04-04 12:36, Shashank Sharma wrote:
>>> On 04/04/2023 18:30, Luben Tuikov wrote:
>>>> On 2023-03-29 12:04, Shashank Sharma wrote:
>>>>> From: Shashank Sharma 
>>>>>
>>>>> This patch adds new functions to map/unmap a usermode queue into
>>>>> the FW, using the MES ring. As soon as this mapping is done, the
>>>>> queue would  be considered ready to accept the workload.
>>>>>
>>>>> V1: Addressed review comments from Alex on the RFC patch series
>>>>>   - Map/Unmap should be IP specific.
>>>>> V2:
>>>>>   Addressed review comments from Christian:
>>>>>   - Fix the wptr_mc_addr calculation (moved into another patch)
>>>>>   Addressed review comments from Alex:
>>>>>   - Do not add fptrs for map/unmap
>>>>>
>>>>> V3: Integration with doorbell manager
>>>>>
>>>>> Cc: Alex Deucher 
>>>>> Cc: Christian Koenig 
>>>>> Signed-off-by: Shashank Sharma 
>>>>> ---
>>>> Just add all your Cc right here, and let git-send-email figure it out.
>>>> Between the Cc tags and the SMTP CC list, Felix is the only one missing.
>>> No, that's not how it is.
>>>
>>> You keep people cc'ed in the cover letter so that they get informed
>>> every time this patch is pushed/used on any opensource branch.
>> The cover letter is optional, and you can add Cc tags
>> into the cover letter and then git-send-email would extract those
>> (any and all) tags from the cover letter too.
>>
>> When you pick and choose whom to add in the Cc tags, and whom to
>> add to the SMTP CC list, it creates division.
> 
> 
> Exactly my point, there is no guideline on whom to add in Cc 
> cover-letter and whom to add manually, its all preference.
> 
> Now different people can have different preference, and a review comment 
> on what is your preference of what to
> 
> keep on cover letter does seem like a nitpick.

I am describing consensus. Take a look at DRM commits to see what
people do. It'd be nice if you followed that

> 
>>
>>> People who are added manually in cc are required for this code review
>>> session.
>> No such rule exists. It is best to add all the Cc into the commit message,
>> so that it is preserved in Git history.
> I believe this is also not a rule, we are discussing preferences only. 
> It is my preference that I want to keep only Alex and Christian in Cc.
>>
>> For instance, I just randomly did "git log drivers/gpu/drm/*.[hc]" in
>> amd-staging-drm-next, and this is the first commit which came up,
>>
>> commit 097ee58f3ddf7d
>> Author: Harry Wentland 
>> Date:   Fri Jan 13 11:24:09 2023 -0500
>>
>>  drm/connector: print max_requested_bpc in state debugfs
>>  
>>  This is useful to understand the bpc defaults and
>>  support of a driver.
>>  
>>  Signed-off-by: Harry Wentland 
>>  Cc: Pekka Paalanen 
>>  Cc: Sebastian Wick 
>>  Cc: vitaly.pros...@amd.com
>>  Cc: Uma Shankar 
>>  Cc: Ville Syrjälä 
>>  Cc: Joshua Ashton 
>>  Cc: Jani Nikula 
>>  Cc: dri-de...@lists.freedesktop.org
>>  Cc: amd-gfx@lists.freedesktop.org
>>  Reviewed-By: Joshua Ashton 
>>  Link: 
>> https://patchwork.freedesktop.org/patch/msgid/20230113162428.33874-3-harry.wentl...@amd.com
>>  Signed-off-by: Alex Deucher 
>>
>> As you can see the whole Cc list and the MLs are part of the Cc tags.
>> And the rest of the commits are also good examples of how to do it.
>> (Don't worry about the Link tag--it is automatically added by tools
>> maintainers use, although some use Lore.)
>> This preserves things in Git history, and it's a good thing if we need
>> to data mine and brainstorm later on on patches, design, and so on.
> 
> No, this is not random, this is actually well planned. All of these 

I never said it is "random"--it is not, it is well planned because
everyone submitting to DRM does this--it's common practice.

> people here in Cc are either the maintainers or respective domain experts or
> 
> contributors of color management feature and keeping them in CC is about 
> how this color management feature is being carried forward, so this is
> 
> exactly aligned with my point. Do note that this is a DRM level change 
> (not AMDGPU level).

So, 

Re: [PATCH] drm/amdgpu: Fix warnings

2023-04-05 Thread Luben Tuikov
Reviewed-by: Luben Tuikov 

Regards,
Luben

On 2023-04-05 05:45, Lijo Lazar wrote:
> Fix below warning due to incompatible types in conditional operator
> 
> ../pm/swsmu/smu13/smu_v13_0_6_ppt.c:315:17: sparse: sparse: incompatible
> types in conditional expression (different base types):
> 
> Signed-off-by: Lijo Lazar 
> Reported-by: kernel test robot 
> Link: 
> https://lore.kernel.org/oe-kbuild-all/202303082135.njdx1bij-...@intel.com/
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index bbac4239ceb3..376d14de7602 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -1241,7 +1241,7 @@ int emu_soc_asic_init(struct amdgpu_device *adev);
>   ((adev)->asic_funcs->flush_hdp ? (adev)->asic_funcs->flush_hdp((adev), 
> (r)) : (adev)->hdp.funcs->flush_hdp((adev), (r)))
>  #define amdgpu_asic_invalidate_hdp(adev, r) \
>   ((adev)->asic_funcs->invalidate_hdp ? 
> (adev)->asic_funcs->invalidate_hdp((adev), (r)) : \
> -  ((adev)->hdp.funcs->invalidate_hdp ? 
> (adev)->hdp.funcs->invalidate_hdp((adev), (r)) : 0))
> +  ((adev)->hdp.funcs->invalidate_hdp ? 
> (adev)->hdp.funcs->invalidate_hdp((adev), (r)) : (void)0))
>  #define amdgpu_asic_need_full_reset(adev) 
> (adev)->asic_funcs->need_full_reset((adev))
>  #define amdgpu_asic_init_doorbell_index(adev) 
> (adev)->asic_funcs->init_doorbell_index((adev))
>  #define amdgpu_asic_get_pcie_usage(adev, cnt0, cnt1) 
> ((adev)->asic_funcs->get_pcie_usage((adev), (cnt0), (cnt1)))



Re: [RFC PATCH 0/4] uapi, drm: Add and implement RLIMIT_GPUPRIO

2023-04-04 Thread Luben Tuikov
Hi!

On 2023-04-04 04:50, Christian König wrote:
> Adding a bunch of people who have been involved in this before.
> 
> Am 03.04.23 um 22:15 schrieb Joshua Ashton:
>> On 4/3/23 20:54, Christian König wrote:
>>> Am 03.04.23 um 21:40 schrieb Joshua Ashton:
 [SNIP]
 Anyway, please let me know what you think!
 Definitely open to any feedback and advice you may have. :D
>>>
>>> Well the basic problem is that higher priority queues can be used to 
>>> starve low priority queues.
>>>
>>> This starvation in turn is very very bad for memory management since 
>>> the dma_fence's the GPU scheduler deals with have very strong 
>>> restrictions.
>>>
>>> Even exposing this under CAP_SYS_NICE is questionable, so we will 
>>> most likely have to NAK this.
>>
>> This is already exposed with CAP_SYS_NICE and is relied on by SteamVR 
>> for async reprojection and Gamescope's composite path on Steam Deck.
> 
> Yeah, I know I was the one who designed that :)
> 
>>
>> Having a high priority async compute queue is really really important 
>> and advantageous for these tasks.
>>
>> The majority of usecases for something like this is going to be a 
>> compositor which does some really tiny amount of work per-frame but is 
>> incredibly latency dependent (as it depends on latching onto buffers 
>> just before vblank to do it's work)

There seems to be a dependency here. Is it possible to express this
dependency so that this work is done on vblank, then whoever needs
this, can latch onto vblank and get scheduled and completed before the vblank?

The problem generally is "We need to do some work B in order to satisfy
some condition in work A. Let's raise the ``priority'' of work B so that
if A needs it, when it needs it, it is ready." Or something to that effect.

The system would be much more responsive and run optimally, if such
dependencies are expressed directly, as opposed to trying to game
the scheduler and add more and more priorities, one on top of the other,
every so often.

It's okay to have priorities when tasks are independent and unrelated. But
when they do depend on each other directly, or indirectly (as in when memory
allocation or freeing is concerned), thus creating priority inversion,
then the best scheduler is the fair, oldest-ready-first scheduling, which
is the default GPU scheduler in DRM at the moment (for the last few months).

>> Starving and surpassing work on other queues is kind of the entire 
>> point. Gamescope and SteamVR do it on ACE as well so GFX work can run 
>> alongside it.

Are there no dependencies between them?

I mean if they're independent, we already have run queues with
different priorities. But if they're dependent, perhaps
we can express this explicitly so that we don't starve
other tasks/queues...

Regards,
Luben

> 
> Yes, unfortunately exactly that.
> 
> The problem is that our memory management is designed around the idea 
> that submissions to the hardware are guaranteed to finish at some point 
> in the future.
> 
> When we now have a functionality which allows to extend the amount of 
> time some work needs to finish on the hardware infinitely, then we have 
> a major problem at hand.
> 
> What we could do is to make the GPU scheduler more clever and make sure 
> that while higher priority submissions get precedence and can even 
> preempt low priority submissions we still guarantee some forward 
> progress for everybody.
> 
> Luben has been looking into a similar problem AMD internally as well, 
> maybe he has some idea here but I doubt that the solution will be simple.
> 
> Regards,
> Christian.
> 
>>
>> - Joshie ✨
>>
> 



Re: [PATCH v3 7/9] drm/amdgpu: map usermode queue into MES

2023-04-04 Thread Luben Tuikov
On 2023-04-04 12:36, Shashank Sharma wrote:
> 
> On 04/04/2023 18:30, Luben Tuikov wrote:
>> On 2023-03-29 12:04, Shashank Sharma wrote:
>>> From: Shashank Sharma 
>>>
>>> This patch adds new functions to map/unmap a usermode queue into
>>> the FW, using the MES ring. As soon as this mapping is done, the
>>> queue would  be considered ready to accept the workload.
>>>
>>> V1: Addressed review comments from Alex on the RFC patch series
>>>  - Map/Unmap should be IP specific.
>>> V2:
>>>  Addressed review comments from Christian:
>>>  - Fix the wptr_mc_addr calculation (moved into another patch)
>>>  Addressed review comments from Alex:
>>>  - Do not add fptrs for map/unmap
>>>
>>> V3: Integration with doorbell manager
>>>
>>> Cc: Alex Deucher 
>>> Cc: Christian Koenig 
>>> Signed-off-by: Shashank Sharma 
>>> ---
>> Just add all your Cc right here, and let git-send-email figure it out.
>> Between the Cc tags and the SMTP CC list, Felix is the only one missing.
> 
> No, that's not how it is.
> 
> You keep people cc'ed in the cover letter so that they get informed 
> every time this patch is pushed/used on any opensource branch.

The cover letter is optional, and you can add Cc tags
into the cover letter and then git-send-email would extract those
(any and all) tags from the cover letter too.

When you pick and choose whom to add in the Cc tags, and whom to
add to the SMTP CC list, it creates division.

> People who are added manually in cc are required for this code review 
> session.

No such rule exists. It is best to add all the Cc into the commit message,
so that it is preserved in Git history.

For instance, I just randomly did "git log drivers/gpu/drm/*.[hc]" in
amd-staging-drm-next, and this is the first commit which came up,

commit 097ee58f3ddf7d
Author: Harry Wentland 
Date:   Fri Jan 13 11:24:09 2023 -0500

drm/connector: print max_requested_bpc in state debugfs

This is useful to understand the bpc defaults and
support of a driver.

Signed-off-by: Harry Wentland 
Cc: Pekka Paalanen 
Cc: Sebastian Wick 
Cc: vitaly.pros...@amd.com
Cc: Uma Shankar 
Cc: Ville Syrjälä 
Cc: Joshua Ashton 
Cc: Jani Nikula 
Cc: dri-de...@lists.freedesktop.org
Cc: amd-gfx@lists.freedesktop.org
Reviewed-By: Joshua Ashton 
Link: 
https://patchwork.freedesktop.org/patch/msgid/20230113162428.33874-3-harry.wentl...@amd.com
Signed-off-by: Alex Deucher 

As you can see the whole Cc list and the MLs are part of the Cc tags.
And the rest of the commits are also good examples of how to do it.
(Don't worry about the Link tag--it is automatically added by tools
maintainers use, although some use Lore.)
This preserves things in Git history, and it's a good thing if we need
to data mine and brainstorm later on on patches, design, and so on.

A good tool to use is "scripts/get_maintainer.pl" which works
on a file, directory and even patch files.

I usually include everyone get_maintainer.pl prints, and on subsequent patch
revisions, also people who have previously commented on the patchset, as they
might be interested to follow up. It's a good thing to do.

Here are a couple of resources, in addition to DRM commits in the tree,
https://www.kernel.org/doc/html/v4.12/process/5.Posting.html#patch-formatting-and-changelogs
https://www.kernel.org/doc/html/v4.12/process/submitting-patches.html#the-canonical-patch-format
(The second link includes links to more resources on good patch writing.)

Hope this helps.

Regards,
Luben


> 
> - Shashank
> 
>> Regards,
>> Luben
>>
>>>   .../drm/amd/amdgpu/amdgpu_userqueue_gfx_v11.c | 70 +++
>>>   1 file changed, 70 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue_gfx_v11.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue_gfx_v11.c
>>> index 39e90ea32fcb..1627641a4a4e 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue_gfx_v11.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue_gfx_v11.c
>>> @@ -23,12 +23,73 @@
>>>   #include "amdgpu.h"
>>>   #include "amdgpu_userqueue.h"
>>>   #include "v11_structs.h"
>>> +#include "amdgpu_mes.h"
>>>   
>>>   #define AMDGPU_USERQ_PROC_CTX_SZ PAGE_SIZE
>>>   #define AMDGPU_USERQ_GANG_CTX_SZ PAGE_SIZE
>>>   #define AMDGPU_USERQ_FW_CTX_SZ PAGE_SIZE
>>>   #define AMDGPU_USERQ_GDS_CTX_SZ PAGE_SIZE
>>>   
>>> +static int
>>> +amdgpu_userq_gfx_v11_map(struct amdgpu_userq_mgr *uq_mgr,
>>> + struct amd

Re: [PATCH v3 7/9] drm/amdgpu: map usermode queue into MES

2023-04-04 Thread Luben Tuikov
On 2023-03-29 12:04, Shashank Sharma wrote:
> From: Shashank Sharma 
> 
> This patch adds new functions to map/unmap a usermode queue into
> the FW, using the MES ring. As soon as this mapping is done, the
> queue would  be considered ready to accept the workload.
> 
> V1: Addressed review comments from Alex on the RFC patch series
> - Map/Unmap should be IP specific.
> V2:
> Addressed review comments from Christian:
> - Fix the wptr_mc_addr calculation (moved into another patch)
> Addressed review comments from Alex:
> - Do not add fptrs for map/unmap
> 
> V3: Integration with doorbell manager
> 
> Cc: Alex Deucher 
> Cc: Christian Koenig 
> Signed-off-by: Shashank Sharma 
> ---

Just add all your Cc right here, and let git-send-email figure it out.
Between the Cc tags and the SMTP CC list, Felix is the only one missing.

Regards,
Luben

>  .../drm/amd/amdgpu/amdgpu_userqueue_gfx_v11.c | 70 +++
>  1 file changed, 70 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue_gfx_v11.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue_gfx_v11.c
> index 39e90ea32fcb..1627641a4a4e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue_gfx_v11.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue_gfx_v11.c
> @@ -23,12 +23,73 @@
>  #include "amdgpu.h"
>  #include "amdgpu_userqueue.h"
>  #include "v11_structs.h"
> +#include "amdgpu_mes.h"
>  
>  #define AMDGPU_USERQ_PROC_CTX_SZ PAGE_SIZE
>  #define AMDGPU_USERQ_GANG_CTX_SZ PAGE_SIZE
>  #define AMDGPU_USERQ_FW_CTX_SZ PAGE_SIZE
>  #define AMDGPU_USERQ_GDS_CTX_SZ PAGE_SIZE
>  
> +static int
> +amdgpu_userq_gfx_v11_map(struct amdgpu_userq_mgr *uq_mgr,
> + struct amdgpu_usermode_queue *queue)
> +{
> +struct amdgpu_device *adev = uq_mgr->adev;
> +struct mes_add_queue_input queue_input;
> +int r;
> +
> +memset(_input, 0x0, sizeof(struct mes_add_queue_input));
> +
> +queue_input.process_va_start = 0;
> +queue_input.process_va_end = (adev->vm_manager.max_pfn - 1) << 
> AMDGPU_GPU_PAGE_SHIFT;
> +queue_input.process_quantum = 10; /* 10ms */
> +queue_input.gang_quantum = 1; /* 1ms */
> +queue_input.paging = false;
> +
> +queue_input.gang_context_addr = queue->gang_ctx_gpu_addr;
> +queue_input.process_context_addr = queue->proc_ctx_gpu_addr;
> +queue_input.inprocess_gang_priority = AMDGPU_MES_PRIORITY_LEVEL_NORMAL;
> +queue_input.gang_global_priority_level = 
> AMDGPU_MES_PRIORITY_LEVEL_NORMAL;
> +
> +queue_input.process_id = queue->vm->pasid;
> +queue_input.queue_type = queue->queue_type;
> +queue_input.mqd_addr = queue->mqd.gpu_addr;
> +queue_input.wptr_addr = queue->userq_prop.wptr_gpu_addr;
> +queue_input.queue_size = queue->userq_prop.queue_size >> 2;
> +queue_input.doorbell_offset = queue->userq_prop.doorbell_index;
> +queue_input.page_table_base_addr = 
> amdgpu_gmc_pd_addr(queue->vm->root.bo);
> +
> +amdgpu_mes_lock(>mes);
> +r = adev->mes.funcs->add_hw_queue(>mes, _input);
> +amdgpu_mes_unlock(>mes);
> +if (r) {
> +DRM_ERROR("Failed to map queue in HW, err (%d)\n", r);
> +return r;
> +}
> +
> +DRM_DEBUG_DRIVER("Queue %d mapped successfully\n", queue->queue_id);
> +return 0;
> +}
> +
> +static void
> +amdgpu_userq_gfx_v11_unmap(struct amdgpu_userq_mgr *uq_mgr,
> +   struct amdgpu_usermode_queue *queue)
> +{
> +struct amdgpu_device *adev = uq_mgr->adev;
> +struct mes_remove_queue_input queue_input;
> +int r;
> +
> +memset(_input, 0x0, sizeof(struct mes_remove_queue_input));
> +queue_input.doorbell_offset = queue->userq_prop.doorbell_index;
> +queue_input.gang_context_addr = queue->gang_ctx_gpu_addr;
> +
> +amdgpu_mes_lock(>mes);
> +r = adev->mes.funcs->remove_hw_queue(>mes, _input);
> +amdgpu_mes_unlock(>mes);
> +if (r)
> +DRM_ERROR("Failed to unmap queue in HW, err (%d)\n", r);
> +}
> +
>  static int amdgpu_userq_gfx_v11_create_ctx_space(struct amdgpu_userq_mgr 
> *uq_mgr,
>   struct 
> amdgpu_usermode_queue *queue)
>  {
> @@ -129,6 +190,14 @@ amdgpu_userq_gfx_v11_mqd_create(struct amdgpu_userq_mgr 
> *uq_mgr, struct amdgpu_u
>  
>  amdgpu_userq_set_ctx_space(uq_mgr, queue);
>  amdgpu_bo_unreserve(mqd->obj);
> +
> +/* Map the queue in HW using MES ring */
> +r = amdgpu_userq_gfx_v11_map(uq_mgr, queue);
> +if (r) {
> +DRM_ERROR("Failed to map userqueue (%d)\n", r);
> +goto free_ctx;
> +}
> +
>  DRM_DEBUG_DRIVER("MQD for queue %d created\n", queue->queue_id);
>  return 0;
>  
> @@ -147,6 +216,7 @@ amdgpu_userq_gfx_v11_mqd_destroy(struct amdgpu_userq_mgr 
> *uq_mgr, struct amdgpu_
>  {
>  struct amdgpu_userq_ctx_space *mqd = >mqd;
>  
> +amdgpu_userq_gfx_v11_unmap(uq_mgr, queue);
>  amdgpu_userq_gfx_v11_destroy_ctx_space(uq_mgr, queue);
>  amdgpu_bo_free_kernel(>obj,
> 

Re: [PATCH v3 5/9] drm/amdgpu: create context space for usermode queue

2023-04-04 Thread Luben Tuikov
On 2023-03-29 12:04, Shashank Sharma wrote:
> From: Shashank Sharma 
> 
> The FW expects us to allocate atleast one page as context space to
> process gang, process, shadow, GDS and FW  related work. This patch
> creates a joint object for the same, and calculates GPU space offsets
> for each of these spaces.
> 
> V1: Addressed review comments on RFC patch:
> Alex: Make this function IP specific
> 
> V2: Addressed review comments from Christian
> - Allocate only one object for total FW space, and calculate
>   offsets for each of these objects.
> 
> V3: Integration with doorbell manager
> 
> Cc: Alex Deucher 
> Cc: Christian Koenig 
> Signed-off-by: Shashank Sharma 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c |  1 +
>  .../drm/amd/amdgpu/amdgpu_userqueue_gfx_v11.c | 60 ++-
>  .../gpu/drm/amd/include/amdgpu_userqueue.h|  7 +++
>  3 files changed, 66 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c
> index 052c2c1e8aed..5672efcbcffc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c
> @@ -71,6 +71,7 @@ static int amdgpu_userqueue_create(struct drm_file *filp, 
> union drm_amdgpu_userq
>  queue->userq_prop.queue_size = mqd_in->queue_size;
>  
>  queue->doorbell_handle = mqd_in->doorbell_handle;
> +queue->shadow_ctx_gpu_addr = mqd_in->shadow_va;
>  queue->queue_type = mqd_in->ip_type;
>  queue->flags = mqd_in->flags;
>  queue->vm = >vm;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue_gfx_v11.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue_gfx_v11.c
> index 12e1a785b65a..52de96727f98 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue_gfx_v11.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue_gfx_v11.c
> @@ -23,6 +23,51 @@
>  #include "amdgpu.h"
>  #include "amdgpu_userqueue.h"
>  
> +#define AMDGPU_USERQ_PROC_CTX_SZ PAGE_SIZE
> +#define AMDGPU_USERQ_GANG_CTX_SZ PAGE_SIZE
> +#define AMDGPU_USERQ_FW_CTX_SZ PAGE_SIZE
> +#define AMDGPU_USERQ_GDS_CTX_SZ PAGE_SIZE
> +

Leaving a single space after the macro name and its value
makes it hard to read. Please align the value columns
and leave at least 3 spaces--this would make it readable.
For instance,

#define AMDGPU_USERQ_PROC_CTX_SZ   PAGE_SIZE
#define AMDGPU_USERQ_GANG_CTX_SZ   PAGE_SIZE
#define AMDGPU_USERQ_FW_CTX_SZ PAGE_SIZE
#define AMDGPU_USERQ_GDS_CTX_SZPAGE_SIZE

Regards,
Luben

> +static int amdgpu_userq_gfx_v11_create_ctx_space(struct amdgpu_userq_mgr 
> *uq_mgr,
> + struct 
> amdgpu_usermode_queue *queue)
> +{
> +struct amdgpu_device *adev = uq_mgr->adev;
> +struct amdgpu_userq_ctx_space *ctx = >fw_space;
> +int r, size;
> +
> +/*
> + * The FW expects atleast one page space allocated for
> + * process ctx, gang ctx, gds ctx, fw ctx and shadow ctx each.
> + */
> +size = AMDGPU_USERQ_PROC_CTX_SZ + AMDGPU_USERQ_GANG_CTX_SZ +
> +   AMDGPU_USERQ_FW_CTX_SZ + AMDGPU_USERQ_GDS_CTX_SZ;
> +r = amdgpu_bo_create_kernel(adev, size, PAGE_SIZE,
> +AMDGPU_GEM_DOMAIN_GTT,
> +>obj,
> +>gpu_addr,
> +>cpu_ptr);
> +if (r) {
> +DRM_ERROR("Failed to allocate ctx space bo for userqueue, err:%d\n", 
> r);
> +return r;
> +}
> +
> +queue->proc_ctx_gpu_addr = ctx->gpu_addr;
> +queue->gang_ctx_gpu_addr = queue->proc_ctx_gpu_addr + 
> AMDGPU_USERQ_PROC_CTX_SZ;
> +queue->fw_ctx_gpu_addr = queue->gang_ctx_gpu_addr + 
> AMDGPU_USERQ_GANG_CTX_SZ;
> +queue->gds_ctx_gpu_addr = queue->fw_ctx_gpu_addr + 
> AMDGPU_USERQ_FW_CTX_SZ;
> +return 0;
> +}
> +
> +static void amdgpu_userq_gfx_v11_destroy_ctx_space(struct amdgpu_userq_mgr 
> *uq_mgr,
> +   struct 
> amdgpu_usermode_queue *queue)
> +{
> +struct amdgpu_userq_ctx_space *ctx = >fw_space;
> +
> +amdgpu_bo_free_kernel(>obj,
> +  >gpu_addr,
> +  >cpu_ptr);
> +}
> +
>  static int
>  amdgpu_userq_gfx_v11_mqd_create(struct amdgpu_userq_mgr *uq_mgr, struct 
> amdgpu_usermode_queue *queue)
>  {
> @@ -43,10 +88,17 @@ amdgpu_userq_gfx_v11_mqd_create(struct amdgpu_userq_mgr 
> *uq_mgr, struct amdgpu_u
>  }
>  
>  memset(mqd->cpu_ptr, 0, size);
> +
> +r = amdgpu_userq_gfx_v11_create_ctx_space(uq_mgr, queue);
> +if (r) {
> +DRM_ERROR("Failed to create CTX space for userqueue (%d)\n", r);
> +goto free_mqd;
> +}
> +
>  r = amdgpu_bo_reserve(mqd->obj, false);
>  if (unlikely(r != 0)) {
>  DRM_ERROR("Failed to reserve mqd for userqueue (%d)", r);
> -goto free_mqd;
> +goto free_ctx;
>  }
>  
>  queue->userq_prop.use_doorbell = true;
> @@ -55,12 +107,15 @@ 

Re: [PATCH v3 4/9] drm/amdgpu: create GFX-gen11 MQD for userqueue

2023-04-04 Thread Luben Tuikov
On 2023-03-29 12:04, Shashank Sharma wrote:
> From: Shashank Sharma 
> 
> A Memory queue descriptor (MQD) of a userqueue defines it in the harware's
> context. As MQD format can vary between different graphics IPs, we need gfx
> GEN specific handlers to create MQDs.
> 
> This patch:
> - Introduces MQD hander functions for the usermode queues.
> - Adds new functions to create and destroy MQD for GFX-GEN-11-IP
> 
> V1: Worked on review comments from Alex:
> - Make MQD functions GEN and IP specific
> 
> V2: Worked on review comments from Alex:
> - Reuse the existing adev->mqd[ip] for MQD creation
> - Formatting and arrangement of code
> 
> V3:
> - Integration with doorbell manager
> 
> Cc: Alex Deucher 
> Cc: Christian Koenig 
> 
> Signed-off-by: Shashank Sharma 
> Signed-off-by: Arvind Yadav 
> ---

Don't break up the Cc tag list and the Sob tag list.

Check out the following resources:
https://www.kernel.org/doc/html/v4.12/process/5.Posting.html#patch-formatting-and-changelogs
https://www.kernel.org/doc/html/v4.12/process/submitting-patches.html#the-canonical-patch-format


>  drivers/gpu/drm/amd/amdgpu/Makefile   |  1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c | 21 +
>  .../drm/amd/amdgpu/amdgpu_userqueue_gfx_v11.c | 84 +++
>  .../gpu/drm/amd/include/amdgpu_userqueue.h|  7 ++
>  4 files changed, 113 insertions(+)
>  create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue_gfx_v11.c
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/Makefile 
> b/drivers/gpu/drm/amd/amdgpu/Makefile
> index 2d90ba618e5d..2cc7897de7e6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/Makefile
> +++ b/drivers/gpu/drm/amd/amdgpu/Makefile
> @@ -212,6 +212,7 @@ amdgpu-y += amdgpu_amdkfd.o
>  
>  # add usermode queue
>  amdgpu-y += amdgpu_userqueue.o
> +amdgpu-y += amdgpu_userqueue_gfx_v11.o
>  
>  ifneq ($(CONFIG_HSA_AMD),)
>  AMDKFD_PATH := ../amdkfd
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c
> index 353f57c5a772..052c2c1e8aed 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c
> @@ -81,6 +81,12 @@ static int amdgpu_userqueue_create(struct drm_file *filp, 
> union drm_amdgpu_userq
>  goto free_queue;
>  }
>  
> +r = uq_mgr->userq_funcs[queue->queue_type]->mqd_create(uq_mgr, queue);
> +if (r) {
> +DRM_ERROR("Failed to create/map userqueue MQD\n");
> +goto free_queue;
> +}
> +
>  args->out.queue_id = queue->queue_id;
>  args->out.flags = 0;
>  mutex_unlock(_mgr->userq_mutex);
> @@ -105,6 +111,7 @@ static void amdgpu_userqueue_destroy(struct drm_file 
> *filp, int queue_id)
>  }
>  
>  mutex_lock(_mgr->userq_mutex);
> +uq_mgr->userq_funcs[queue->queue_type]->mqd_destroy(uq_mgr, queue);
>  amdgpu_userqueue_free_index(uq_mgr, queue->queue_id);
>  mutex_unlock(_mgr->userq_mutex);
>  kfree(queue);
> @@ -135,6 +142,19 @@ int amdgpu_userq_ioctl(struct drm_device *dev, void 
> *data,
>  return r;
>  }
>  
> +extern const struct amdgpu_userq_funcs userq_gfx_v11_funcs;
> +
> +static void
> +amdgpu_userqueue_setup_ip_funcs(struct amdgpu_userq_mgr *uq_mgr)
> +{
> +int maj;
> +struct amdgpu_device *adev = uq_mgr->adev;
> +uint32_t version = adev->ip_versions[GC_HWIP][0];
> +
> +maj = IP_VERSION_MAJ(version);
> +if (maj == 11)
> +uq_mgr->userq_funcs[AMDGPU_HW_IP_GFX] = _gfx_v11_funcs;
> +}
>  
>  int amdgpu_userq_mgr_init(struct amdgpu_userq_mgr *userq_mgr, struct 
> amdgpu_device *adev)
>  {
> @@ -142,6 +162,7 @@ int amdgpu_userq_mgr_init(struct amdgpu_userq_mgr 
> *userq_mgr, struct amdgpu_devi
>  idr_init_base(_mgr->userq_idr, 1);
>  userq_mgr->adev = adev;
>  
> +amdgpu_userqueue_setup_ip_funcs(userq_mgr);
>  return 0;
>  }
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue_gfx_v11.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue_gfx_v11.c
> new file mode 100644
> index ..12e1a785b65a
> --- /dev/null
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue_gfx_v11.c
> @@ -0,0 +1,84 @@
> +/*
> + * Copyright 2022 Advanced Micro Devices, Inc.
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * 

Re: [PATCH 01/16] drm/amdgpu: rename num_doorbells

2023-04-04 Thread Luben Tuikov
On 2023-03-29 11:47, Shashank Sharma wrote:
> From: Shashank Sharma 
> 
> Rename doorbell.num_doorbells to doorbell.num_kernel_doorbells to
> make it more readable.
> 
> Cc: Alex Deucher 
> Cc: Christian Koenig 
> Signed-off-by: Shashank Sharma 
> ---

Is there any reason you break up the Cc list between the Cc tags in the commit 
message,
and the SMTP CC list?

Just do either/or, but it is preferable to add all Cc into the Cc tags of the 
commit
message, and then let git-send-email fill in the SMTP CC list, and just using 
MLs
in the --to= argument. (Although, one can include those too in the Cc list.)

Regards,
Luben

>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c   |  6 +++---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c   | 22 ++--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h |  4 +++-
>  3 files changed, 17 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> index f99d4873bf22..0385f7f69278 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> @@ -96,7 +96,7 @@ static void amdgpu_doorbell_get_kfd_info(struct 
> amdgpu_device *adev,
>size_t *start_offset)
>  {
>   /*
> -  * The first num_doorbells are used by amdgpu.
> +  * The first num_kernel_doorbells are used by amdgpu.
>* amdkfd takes whatever's left in the aperture.
>*/
>   if (adev->enable_mes) {
> @@ -109,11 +109,11 @@ static void amdgpu_doorbell_get_kfd_info(struct 
> amdgpu_device *adev,
>   *aperture_base = adev->doorbell.base;
>   *aperture_size = 0;
>   *start_offset = 0;
> - } else if (adev->doorbell.size > adev->doorbell.num_doorbells *
> + } else if (adev->doorbell.size > adev->doorbell.num_kernel_doorbells *
>   sizeof(u32)) {
>   *aperture_base = adev->doorbell.base;
>   *aperture_size = adev->doorbell.size;
> - *start_offset = adev->doorbell.num_doorbells * sizeof(u32);
> + *start_offset = adev->doorbell.num_kernel_doorbells * 
> sizeof(u32);
>   } else {
>   *aperture_base = 0;
>   *aperture_size = 0;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index afe6af9c0138..57ee1c4a81e9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -593,7 +593,7 @@ u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 
> index)
>   if (amdgpu_device_skip_hw_access(adev))
>   return 0;
>  
> - if (index < adev->doorbell.num_doorbells) {
> + if (index < adev->doorbell.num_kernel_doorbells) {
>   return readl(adev->doorbell.ptr + index);
>   } else {
>   DRM_ERROR("reading beyond doorbell aperture: 0x%08x!\n", index);
> @@ -616,7 +616,7 @@ void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 
> index, u32 v)
>   if (amdgpu_device_skip_hw_access(adev))
>   return;
>  
> - if (index < adev->doorbell.num_doorbells) {
> + if (index < adev->doorbell.num_kernel_doorbells) {
>   writel(v, adev->doorbell.ptr + index);
>   } else {
>   DRM_ERROR("writing beyond doorbell aperture: 0x%08x!\n", index);
> @@ -637,7 +637,7 @@ u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 
> index)
>   if (amdgpu_device_skip_hw_access(adev))
>   return 0;
>  
> - if (index < adev->doorbell.num_doorbells) {
> + if (index < adev->doorbell.num_kernel_doorbells) {
>   return atomic64_read((atomic64_t *)(adev->doorbell.ptr + 
> index));
>   } else {
>   DRM_ERROR("reading beyond doorbell aperture: 0x%08x!\n", index);
> @@ -660,7 +660,7 @@ void amdgpu_mm_wdoorbell64(struct amdgpu_device *adev, 
> u32 index, u64 v)
>   if (amdgpu_device_skip_hw_access(adev))
>   return;
>  
> - if (index < adev->doorbell.num_doorbells) {
> + if (index < adev->doorbell.num_kernel_doorbells) {
>   atomic64_set((atomic64_t *)(adev->doorbell.ptr + index), v);
>   } else {
>   DRM_ERROR("writing beyond doorbell aperture: 0x%08x!\n", index);
> @@ -1034,7 +1034,7 @@ static int amdgpu_device_doorbell_init(struct 
> amdgpu_device *adev)
>   if (adev->asic_type < CHIP_BONAIRE) {
>   adev->doorbell.base = 0;
>   adev->doorbell.size = 0;
> - adev->doorbell.num_doorbells = 0;
> + adev->doorbell.num_kernel_doorbells = 0;
>   adev->doorbell.ptr = NULL;
>   return 0;
>   }
> @@ -1049,27 +1049,27 @@ static int amdgpu_device_doorbell_init(struct 
> amdgpu_device *adev)
>   adev->doorbell.size = pci_resource_len(adev->pdev, 2);
>  
>   if (adev->enable_mes) {
> - adev->doorbell.num_doorbells 

Re: [PATCH v3 2/9] drm/amdgpu: add usermode queue base code

2023-04-04 Thread Luben Tuikov
On 2023-03-29 12:04, Shashank Sharma wrote:
> From: Shashank Sharma 
> 
> This patch adds skeleton code for amdgpu usermode queue. It contains:
> - A new files with init functions of usermode queues.
> - A queue context manager in driver private data.
> 
> V1: Worked on design review comments from RFC patch series:
> (https://patchwork.freedesktop.org/series/112214/)
> - Alex: Keep a list of queues, instead of single queue per process.
> - Christian: Use the queue manager instead of global ptrs,
>Don't keep the queue structure in amdgpu_ctx
> 
> V2:
>  - Reformatted code, split the big patch into two
> 
> V3:
> - Integration with doorbell manager
> 
> Cc: Alex Deucher 
> Cc: Christian Koenig 
> Signed-off-by: Shashank Sharma 
> ---
>  drivers/gpu/drm/amd/amdgpu/Makefile   |  2 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h   | 10 +++-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c   |  1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c   |  6 +++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c | 39 +++
>  .../gpu/drm/amd/include/amdgpu_userqueue.h| 49 +++
>  6 files changed, 106 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c
>  create mode 100644 drivers/gpu/drm/amd/include/amdgpu_userqueue.h
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/Makefile 
> b/drivers/gpu/drm/amd/amdgpu/Makefile
> index 204665f20319..2d90ba618e5d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/Makefile
> +++ b/drivers/gpu/drm/amd/amdgpu/Makefile
> @@ -210,6 +210,8 @@ amdgpu-y += \
>  # add amdkfd interfaces
>  amdgpu-y += amdgpu_amdkfd.o
>  
> +# add usermode queue
> +amdgpu-y += amdgpu_userqueue.o
>  
>  ifneq ($(CONFIG_HSA_AMD),)
>  AMDKFD_PATH := ../amdkfd
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 6b74df446694..c5f9af0e74ee 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -438,6 +438,14 @@ struct amdgpu_sa_manager {
>   uint32_talign;
>  };
>  
> +/* Gfx usermode queues */
> +struct amdgpu_userq_mgr {
> + struct idr userq_idr;
> + struct mutex userq_mutex;
> + struct amdgpu_device *adev;
> + const struct amdgpu_userq_funcs *userq_funcs[AMDGPU_HW_IP_NUM];
> +};
> +

Could you align the member names to the largest member's column,
as opposed to having only a single space in between?

It makes it so much more readable.

>  /* sub-allocation buffer */
>  struct amdgpu_sa_bo {
>   struct list_headolist;
> @@ -470,7 +478,6 @@ struct amdgpu_flip_work {
>   boolasync;
>  };
>  
> -
>  /*
>   * file private structure
>   */
> @@ -482,6 +489,7 @@ struct amdgpu_fpriv {
>   struct mutexbo_list_lock;
>   struct idr  bo_list_handles;
>   struct amdgpu_ctx_mgr   ctx_mgr;
> + struct amdgpu_userq_mgr userq_mgr;
>  };

Like here, and pretty much the rest of the kernel code.

>  
>  int amdgpu_file_to_fpriv(struct file *filp, struct amdgpu_fpriv **fpriv);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index b4f2d61ea0d5..2d6bcfd727c8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -52,6 +52,7 @@
>  #include "amdgpu_ras.h"
>  #include "amdgpu_xgmi.h"
>  #include "amdgpu_reset.h"
> +#include "amdgpu_userqueue.h"
>  
>  /*
>   * KMS wrapper.
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> index 7aa7e52ca784..b16b8155a157 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> @@ -43,6 +43,7 @@
>  #include "amdgpu_gem.h"
>  #include "amdgpu_display.h"
>  #include "amdgpu_ras.h"
> +#include "amdgpu_userqueue.h"
>  
>  void amdgpu_unregister_gpu_instance(struct amdgpu_device *adev)
>  {
> @@ -1187,6 +1188,10 @@ int amdgpu_driver_open_kms(struct drm_device *dev, 
> struct drm_file *file_priv)
>  
>   amdgpu_ctx_mgr_init(>ctx_mgr, adev);
>  
> + r = amdgpu_userq_mgr_init(>userq_mgr, adev);
> + if (r)
> + DRM_WARN("Can't setup usermode queues, only legacy workload 
> submission will work\n");
> +
>   file_priv->driver_priv = fpriv;
>   goto out_suspend;
>  
> @@ -1254,6 +1259,7 @@ void amdgpu_driver_postclose_kms(struct drm_device *dev,
>  
>   amdgpu_ctx_mgr_fini(>ctx_mgr);
>   amdgpu_vm_fini(adev, >vm);
> + amdgpu_userq_mgr_fini(>userq_mgr);
>  
>   if (pasid)
>   amdgpu_pasid_free_delayed(pd->tbo.base.resv, pasid);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c
> new file mode 100644
> index ..13e1eebc1cb6
> --- /dev/null
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c
> @@ -0,0 +1,39 @@
> +/*
> + * Copyright 2022 Advanced Micro Devices, Inc.
> + *
> + * Permission is 

Re: [PATCH] drm/amdgpu: simplify amdgpu_ras_eeprom.c

2023-04-03 Thread Luben Tuikov
This patch is,

Reviewed-by: Luben Tuikov 

Regards,
Luben

On 2023-03-31 15:54, Alex Deucher wrote:
> All chips that support RAS also support IP discovery, so
> use the IP versions rather than a mix of IP versions and
> asic types.  Checking the validity of the atom_ctx pointer
> is not required as the vbios is already fetched at this
> point.
> 
> v2: add comments to id asic types based on feedback from Luben
> 
> Signed-off-by: Alex Deucher 
> Cc: Luben Tuikov 
> ---
>  .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c| 72 ++-
>  1 file changed, 20 insertions(+), 52 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> index 3106fa8a15ef..c2c2a7718613 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> @@ -106,48 +106,13 @@
>  #define to_amdgpu_device(x) (container_of(x, struct amdgpu_ras, 
> eeprom_control))->adev
>  
>  static bool __is_ras_eeprom_supported(struct amdgpu_device *adev)
> -{
> - if (adev->asic_type == CHIP_IP_DISCOVERY) {
> - switch (adev->ip_versions[MP1_HWIP][0]) {
> - case IP_VERSION(13, 0, 0):
> - case IP_VERSION(13, 0, 10):
> - return true;
> - default:
> - return false;
> - }
> - }
> -
> - return  adev->asic_type == CHIP_VEGA20 ||
> - adev->asic_type == CHIP_ARCTURUS ||
> - adev->asic_type == CHIP_SIENNA_CICHLID ||
> - adev->asic_type == CHIP_ALDEBARAN;
> -}
> -
> -static bool __get_eeprom_i2c_addr_arct(struct amdgpu_device *adev,
> -struct amdgpu_ras_eeprom_control 
> *control)
> -{
> - struct atom_context *atom_ctx = adev->mode_info.atom_context;
> -
> - if (!control || !atom_ctx)
> - return false;
> -
> - if (strnstr(atom_ctx->vbios_version,
> - "D342",
> - sizeof(atom_ctx->vbios_version)))
> - control->i2c_address = EEPROM_I2C_MADDR_0;
> - else
> - control->i2c_address = EEPROM_I2C_MADDR_4;
> -
> - return true;
> -}
> -
> -static bool __get_eeprom_i2c_addr_ip_discovery(struct amdgpu_device *adev,
> -struct amdgpu_ras_eeprom_control 
> *control)
>  {
>   switch (adev->ip_versions[MP1_HWIP][0]) {
> + case IP_VERSION(11, 0, 2): /* VEGA20 and ARCTURUS */
> + case IP_VERSION(11, 0, 7): /* Sienna cichlid */
>   case IP_VERSION(13, 0, 0):
> + case IP_VERSION(13, 0, 2): /* Aldebaran */
>   case IP_VERSION(13, 0, 10):
> - control->i2c_address = EEPROM_I2C_MADDR_4;
>   return true;
>   default:
>   return false;
> @@ -178,29 +143,32 @@ static bool __get_eeprom_i2c_addr(struct amdgpu_device 
> *adev,
>   return true;
>   }
>  
> - switch (adev->asic_type) {
> - case CHIP_VEGA20:
> - control->i2c_address = EEPROM_I2C_MADDR_0;
> + switch (adev->ip_versions[MP1_HWIP][0]) {
> + case IP_VERSION(11, 0, 2):
> + /* VEGA20 and ARCTURUS */
> + if (adev->asic_type == CHIP_VEGA20)
> + control->i2c_address = EEPROM_I2C_MADDR_0;
> + else if (strnstr(atom_ctx->vbios_version,
> +  "D342",
> +  sizeof(atom_ctx->vbios_version)))
> + control->i2c_address = EEPROM_I2C_MADDR_0;
> + else
> + control->i2c_address = EEPROM_I2C_MADDR_4;
>   return true;
> -
> - case CHIP_ARCTURUS:
> - return __get_eeprom_i2c_addr_arct(adev, control);
> -
> - case CHIP_SIENNA_CICHLID:
> + case IP_VERSION(11, 0, 7):
>   control->i2c_address = EEPROM_I2C_MADDR_0;
>   return true;
> -
> - case CHIP_ALDEBARAN:
> + case IP_VERSION(13, 0, 2):
>   if (strnstr(atom_ctx->vbios_version, "D673",
>   sizeof(atom_ctx->vbios_version)))
>   control->i2c_address = EEPROM_I2C_MADDR_4;
>   else
>   control->i2c_address = EEPROM_I2C_MADDR_0;
>   return true;
> -
> - case CHIP_IP_DISCOVERY:
> - return __get_eeprom_i2c_addr_ip_discovery(adev, control);
> -
> + case IP_VERSION(13, 0, 0):
> + case IP_VERSION(13, 0, 10):
> + control->i2c_address = EEPROM_I2C_MADDR_4;
> + return true;
>   default:
>   return false;
>   }



Re: [PATCH] drm/amdgpu: simplify amdgpu_ras_eeprom.c

2023-04-03 Thread Luben Tuikov
On 2023-03-31 15:30, Alex Deucher wrote:
> On Tue, Mar 28, 2023 at 12:30 PM Luben Tuikov  wrote:
>>
>> On 2023-03-27 20:11, Alex Deucher wrote:
>>> All chips that support RAS also support IP discovery, so
>>> use the IP versions rather than a mix of IP versions and
>>> asic types.
>>>
>>> Signed-off-by: Alex Deucher 
>>> Cc: Luben Tuikov 
>>> ---
>>>  .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c| 72 ++-
>>>  1 file changed, 20 insertions(+), 52 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>>> index 3106fa8a15ef..c2ef2b1456bc 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>>> @@ -106,48 +106,13 @@
>>>  #define to_amdgpu_device(x) (container_of(x, struct amdgpu_ras, 
>>> eeprom_control))->adev
>>>
>>>  static bool __is_ras_eeprom_supported(struct amdgpu_device *adev)
>>> -{
>>> - if (adev->asic_type == CHIP_IP_DISCOVERY) {
>>> - switch (adev->ip_versions[MP1_HWIP][0]) {
>>> - case IP_VERSION(13, 0, 0):
>>> - case IP_VERSION(13, 0, 10):
>>> - return true;
>>> - default:
>>> - return false;
>>> - }
>>> - }
>>> -
>>> - return  adev->asic_type == CHIP_VEGA20 ||
>>> - adev->asic_type == CHIP_ARCTURUS ||
>>> - adev->asic_type == CHIP_SIENNA_CICHLID ||
>>> - adev->asic_type == CHIP_ALDEBARAN;
>>> -}
>>> -
>>> -static bool __get_eeprom_i2c_addr_arct(struct amdgpu_device *adev,
>>> -struct amdgpu_ras_eeprom_control 
>>> *control)
>>> -{
>>> - struct atom_context *atom_ctx = adev->mode_info.atom_context;
>>> -
>>> - if (!control || !atom_ctx)
>>> - return false;
>>> -
>>> - if (strnstr(atom_ctx->vbios_version,
>>> - "D342",
>>> - sizeof(atom_ctx->vbios_version)))
>>> - control->i2c_address = EEPROM_I2C_MADDR_0;
>>> - else
>>> - control->i2c_address = EEPROM_I2C_MADDR_4;
>>> -
>>> - return true;
>>> -}
>>> -
>>> -static bool __get_eeprom_i2c_addr_ip_discovery(struct amdgpu_device *adev,
>>> -struct amdgpu_ras_eeprom_control 
>>> *control)
>>>  {
>>>   switch (adev->ip_versions[MP1_HWIP][0]) {
>>> + case IP_VERSION(11, 0, 2): /* VEGA20 and ARCTURUS */
>>> + case IP_VERSION(11, 0, 7):
>>>   case IP_VERSION(13, 0, 0):
>>> + case IP_VERSION(13, 0, 2):
>>>   case IP_VERSION(13, 0, 10):
>>
>> I'd add the rest of the proper names here which are being deleted by this 
>> change,
>> so as to not lose this information by this commit: Sienna Cichlid and 
>> Aldebaran,
>> the rest can be left blank as per the current state of the code.
> 
> Fixed.
> 
>>
>>> - control->i2c_address = EEPROM_I2C_MADDR_4;
>>>   return true;
>>>   default:
>>>   return false;
>>> @@ -178,29 +143,32 @@ static bool __get_eeprom_i2c_addr(struct 
>>> amdgpu_device *adev,
>>>   return true;
>>>   }
>>>
>>> - switch (adev->asic_type) {
>>> - case CHIP_VEGA20:
>>> - control->i2c_address = EEPROM_I2C_MADDR_0;
>>> + switch (adev->ip_versions[MP1_HWIP][0]) {
>>> + case IP_VERSION(11, 0, 2):
>>> + /* VEGA20 and ARCTURUS */
>>> + if (adev->asic_type == CHIP_VEGA20)
>>> + control->i2c_address = EEPROM_I2C_MADDR_0;
>>> + else if (strnstr(atom_ctx->vbios_version,
>>
>> In the code this is qualified with atom_ctx != NULL; and if it is,
>> then we return false. So, this is fine, iff we can guarantee that
>> "atom_ctx" will never be NULL. If, OTOH, we cannot guarantee that,
>> then we need to add,
>> else if (!atom_ctx)
>> return false;
>> else if (strnstr(...
>>
>> Although, I do recognize that for Aldebaran below, we do not qua

Re: [PATCH 11/16] drm/amdgpu: get absolute offset from doorbell index

2023-03-30 Thread Luben Tuikov
On 2023-03-29 11:47, Shashank Sharma wrote:
> This patch adds a helper function which converts a doorbell's
> relative index in a BO to an absolute doorbell offset in the
> doorbell BAR.
> 
> Cc: Alex Deucher 
> Cc: Christian Koenig 
> Signed-off-by: Shashank Sharma 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h  | 15 +++
>  .../gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c  | 26 +++
>  2 files changed, 41 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h
> index 10a9bb10e974..3481e9d83879 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h
> @@ -383,6 +383,21 @@ int amdgpu_doorbell_alloc_page(struct amdgpu_device 
> *adev,
>   */
>  int amdgpu_doorbell_create_kernel_doorbells(struct amdgpu_device *adev);
>  
> +/**
> + * amdgpu_doorbell_index_on_bar - Find doorbell's absolute offset in BAR
> + *
> + * @adev: amdgpu_device pointer
> + *
> + * @db_bo: doorbell object's bo
> + *
> + * @db_index: doorbell relative index in this doorbell object
> + *
> + * returns doorbell's absolute index in BAR
> + */
> +uint32_t amdgpu_doorbell_index_on_bar(struct amdgpu_device *adev,
> +struct amdgpu_bo *db_bo,
> +uint32_t doorbell_index);
> +

Two things:
1. No kernel doc for function declarations--this should go to where
the function is defined. (This also removes redundancy.)

2. No empty lines around function arguments in kernel doc. See this about
the format of function documentation:
https://www.kernel.org/doc/html/v4.12/doc-guide/kernel-doc.html#function-documentation

>  #define RDOORBELL32(index) amdgpu_mm_rdoorbell(adev, (index))
>  #define WDOORBELL32(index, v) amdgpu_mm_wdoorbell(adev, (index), (v))
>  #define RDOORBELL64(index) amdgpu_mm_rdoorbell64(adev, (index))
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c
> index 81713b2c28e1..c263bae6b0c4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c
> @@ -130,6 +130,32 @@ void amdgpu_mm_wdoorbell64(struct amdgpu_device *adev, 
> u32 index, u64 v)
>   }
>  }
>  
> +/**
> + * amdgpu_doorbell_index_on_bar - Find doorbell's absolute offset in BAR
> + *
> + * @adev: amdgpu_device pointer
> + *
> + * @db_bo: doorbell object's bo
> + *
> + * @db_index: doorbell relative index in this doorbell object
> + *
> + * returns doorbell's absolute index in BAR
> + */
> +uint32_t amdgpu_doorbell_index_on_bar(struct amdgpu_device *adev,
> +struct amdgpu_bo *db_bo,
> +uint32_t doorbell_index)
> +{
> + int db_bo_offset;
> +
> + db_bo_offset = amdgpu_bo_gpu_offset_no_check(db_bo);

amdgpu_bo_gpu_offset_no_check() returns u64. Perhaps use u64,
or u32 (which is what this function returns) and cast it down.

> +
> + /*
> +  * doorbell index granularity is maintained at 32 bit
> +  * but doorbell's size is 64-bit, so index * 2
> +  */
> + return db_bo_offset / sizeof(u32) + doorbell_index * 2;

Perhaps add this inside the comment:
* (db_bo_offset + doorbell_index * 8) / sizeof(u32),
which seems clearer to me. But leave the return as is.

Regards,
Luben

> +}
> +
>  /**
>   * amdgpu_doorbell_free_page - Free a doorbell page
>   *



Re: [PATCH 09/16] drm/amdgpu: create kernel doorbell page

2023-03-30 Thread Luben Tuikov
On 2023-03-30 10:53, Shashank Sharma wrote:
> 
> On 30/03/2023 16:49, Christian König wrote:
>> Am 30.03.23 um 16:40 schrieb Shashank Sharma:
>>>
>>> On 30/03/2023 16:24, Luben Tuikov wrote:
>>>> On 2023-03-29 11:47, Shashank Sharma wrote:
>>>>> From: Shashank Sharma 
>>>>>
>>>>> This patch:
>>>>> - creates a doorbell page for graphics driver usages.
>>>>> - removes the adev->doorbell.ptr variable, replaces it with
>>>>>    kernel-doorbell-bo's cpu address.
>>>>>
>>>>> Cc: Alex Deucher 
>>>>> Cc: Christian Koenig 
>>>>> Signed-off-by: Shashank Sharma 
>>>>> ---
>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h  | 16 ++-
>>>>>   .../gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c  | 44 
>>>>> +++
>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c   |  7 +++
>>>>>   3 files changed, 57 insertions(+), 10 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h
>>>>> index 6581b78fe438..10a9bb10e974 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h
>>>>> @@ -49,10 +49,13 @@ struct amdgpu_doorbell {
>>>>>   /* doorbell mmio */
>>>>>   resource_size_t    base;
>>>>>   resource_size_t    size;
>>>>> -    u32 __iomem    *ptr;
>>>>> +    u32    __iomem    *ptr;
>>>>>     /* Number of doorbells reserved for amdgpu kernel driver */
>>>>>   u32 num_kernel_doorbells;
>>>>> +
>>>>> +    /* For kernel doorbell pages */
>>>>> +    struct amdgpu_doorbell_obj kernel_doorbells;
>>>>>   };
>>>> Here's an example where it could be confusing what the difference is
>>>> between "struct amdgpu_doorbell" and "struct amdgpu_doobell_obj".
>>>> As the comment to the struct doorbell_obj declarations says
>>>> in patch 7,
>>>>> +/* Structure to hold doorbell pages from PCI doorbell BAR */
>>>>> +struct amdgpu_doorbell_obj {
>>>
>>> What is the confusion ? This is the object which is holding doorbell 
>>> page. There could be 2 type of
>>>
>>> doorbell pages, kernel and process, this is a kernel one.
>>>
>>>> Perhaps we should call it "struct amdgpu_doorbell_bo", since
>>>> it does contain amdgpu_bo's, which are doorbell's bos.
>>>
>>> This is not a buffer object (memory), this is doorbell object, so 
>>> calling it bo would be wrong.
>>
>> I think what Luben means is that in object orient programming this 
>> here would be the class. The object is then the actual instantiation 
>> of that.
>>
> Why should we even bother about OOPs terminology in kernel C code ? I 
> think we are spending too much time in something not worth.

Because you're using "object" incorrectly. Especially for people with
vast programming experience, this creates confusion. Please don't use
"obj" in the name of a structure. Perhaps use "bo" or "page" or something
which it really _is_. But don't mix OOP terminology in non-OOP code. We
have people who program both sides of the isle and this creates confusion.

Let's use structure names which really describe what something is. This would
help very much new people reading the code in the future, to form mental
concepts and better understand the code.

Regards,
Luben

> 
> 
>> But I have some real doubts that this is the right approach in the 
>> first place.
> 
> 
> I would like to discuss and understand more on this technical aspect. 
> Can you please have a look at the whole series and check how we have
> 
> handled the existing doorbell clients (KFD, MES), and if you feel the 
> same, we should talk more on this ?
> 
> - Shashank
> 
>>
>> Regards,
>> Christian.
>>
>>>
>>> - Shashank
>>>
>>>>
>>>> Regards,
>>>> Luben
>>>>
>>>>>     /* Reserved doorbells for amdgpu (including multimedia).
>>>>> @@ -369,6 +372,17 @@ void amdgpu_doorbell_free_page(struct 
>>>>> amdgpu_device *adev,
>>>>>   int amdgpu_doorbell_alloc_page(struct amdgpu_device *adev,
>>>>> 

Re: [PATCH 09/16] drm/amdgpu: create kernel doorbell page

2023-03-30 Thread Luben Tuikov
On 2023-03-30 10:49, Christian König wrote:
> Am 30.03.23 um 16:40 schrieb Shashank Sharma:
>>
>> On 30/03/2023 16:24, Luben Tuikov wrote:
>>> On 2023-03-29 11:47, Shashank Sharma wrote:
>>>> From: Shashank Sharma 
>>>>
>>>> This patch:
>>>> - creates a doorbell page for graphics driver usages.
>>>> - removes the adev->doorbell.ptr variable, replaces it with
>>>>    kernel-doorbell-bo's cpu address.
>>>>
>>>> Cc: Alex Deucher 
>>>> Cc: Christian Koenig 
>>>> Signed-off-by: Shashank Sharma 
>>>> ---
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h  | 16 ++-
>>>>   .../gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c  | 44 
>>>> +++
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c   |  7 +++
>>>>   3 files changed, 57 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h
>>>> index 6581b78fe438..10a9bb10e974 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h
>>>> @@ -49,10 +49,13 @@ struct amdgpu_doorbell {
>>>>   /* doorbell mmio */
>>>>   resource_size_t    base;
>>>>   resource_size_t    size;
>>>> -    u32 __iomem    *ptr;
>>>> +    u32    __iomem    *ptr;
>>>>     /* Number of doorbells reserved for amdgpu kernel driver */
>>>>   u32 num_kernel_doorbells;
>>>> +
>>>> +    /* For kernel doorbell pages */
>>>> +    struct amdgpu_doorbell_obj kernel_doorbells;
>>>>   };
>>> Here's an example where it could be confusing what the difference is
>>> between "struct amdgpu_doorbell" and "struct amdgpu_doobell_obj".
>>> As the comment to the struct doorbell_obj declarations says
>>> in patch 7,
>>>> +/* Structure to hold doorbell pages from PCI doorbell BAR */
>>>> +struct amdgpu_doorbell_obj {
>>
>> What is the confusion ? This is the object which is holding doorbell 
>> page. There could be 2 type of
>>
>> doorbell pages, kernel and process, this is a kernel one.
>>
>>> Perhaps we should call it "struct amdgpu_doorbell_bo", since
>>> it does contain amdgpu_bo's, which are doorbell's bos.
>>
>> This is not a buffer object (memory), this is doorbell object, so 
>> calling it bo would be wrong.
> 
> I think what Luben means is that in object orient programming this here 
> would be the class. The object is then the actual instantiation of that.

Yes, absolutely exactly what Christian said.

Regards,
Luben

> 
> But I have some real doubts that this is the right approach in the first 
> place.
> 
> Regards,
> Christian.
> 
>>
>> - Shashank
>>
>>>
>>> Regards,
>>> Luben
>>>
>>>>     /* Reserved doorbells for amdgpu (including multimedia).
>>>> @@ -369,6 +372,17 @@ void amdgpu_doorbell_free_page(struct 
>>>> amdgpu_device *adev,
>>>>   int amdgpu_doorbell_alloc_page(struct amdgpu_device *adev,
>>>>   struct amdgpu_doorbell_obj *db_obj);
>>>>   +/**
>>>> + * amdgpu_doorbell_create_kernel_doorbells - Create kernel 
>>>> doorbells for graphics
>>>> + *
>>>> + * @adev: amdgpu_device pointer
>>>> + *
>>>> + * Creates doorbells for graphics driver
>>>> + *
>>>> + * returns 0 on success, error otherwise.
>>>> + */
>>>> +int amdgpu_doorbell_create_kernel_doorbells(struct amdgpu_device 
>>>> *adev);
>>>> +
>>>>   #define RDOORBELL32(index) amdgpu_mm_rdoorbell(adev, (index))
>>>>   #define WDOORBELL32(index, v) amdgpu_mm_wdoorbell(adev, (index), (v))
>>>>   #define RDOORBELL64(index) amdgpu_mm_rdoorbell64(adev, (index))
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c
>>>> index 8be15b82b545..b46fe8b1378d 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c
>>>> @@ -160,6 +160,38 @@ int amdgpu_doorbell_alloc_page(struct 
>>>> amdgpu_device *adev,
>>>>   return 0;
>>>>   }
>>>>   +/**
>>>> + * am

Re: [PATCH 10/16] drm/amdgpu: validate doorbell read/write

2023-03-30 Thread Luben Tuikov
On 2023-03-30 10:37, Shashank Sharma wrote:
> 
> On 30/03/2023 16:34, Luben Tuikov wrote:
>> On 2023-03-29 11:47, Shashank Sharma wrote:
>>> This patch:
>>> - updates start/end values for each of the doorbell object
>>>created.
>>> - adds a function which validates that the kernel doorbell read/write
>>>is within this range.
>>> - uses this function during doorbell writes from kernel.
>>>
>>> Cc: Alex Deucher 
>>> Cc: Christian Koenig 
>>> Signed-off-by: Shashank Sharma 
>>> Signed-off-by: Arvind Yadav 
>>> ---
>>>   .../gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c  | 29 ---
>>>   1 file changed, 25 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c
>>> index b46fe8b1378d..81713b2c28e1 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c
>>> @@ -22,6 +22,25 @@
>>>*/
>>>   
>>>   #include "amdgpu.h"
>>> +#include "kfd_priv.h"
>>> +
>>> +static inline
>>> +bool amdgpu_doorbell_valid(struct amdgpu_device *adev, u32 index)
>>> +{
>>> +   if (index >= adev->doorbell.kernel_doorbells.start &&
>>> +   index < adev->doorbell.kernel_doorbells.end)
>>> +   return true;
>>> +
>>> +   if (index >= adev->mes.kernel_doorbells.start &&
>>> +   index < adev->mes.kernel_doorbells.end)
>>> +   return true;
>>> +
>>> +   if (index >= adev->kfd.dev->kernel_doorbells.start &&
>>> +   index < adev->kfd.dev->kernel_doorbells.end)
>>> +   return true;
>>> +
>>> +   return false;
>>> +}
>> Here you're excluding "end".
>>
>> In patch 7 we see this:
>>> +   /* Last index in this object */
>>> +   uint32_t end;
>> Which implies that "end" is included, but in this patch, the code excludes 
>> it.
>> Perhaps you intended to use "index <= ...end" here?
> 
> No, this is intended, same as array object calculation.
> 
> end = start + size;
> 
> max = start + size - 1

This I understand, but "end" is NEVER "start + size" in all
code written since 1969. "end" is outside the bounds and thus
never used like that.

"start" and "end" usage comes from RTL and is always inclusive,
and "end" always fits in the same sized register as that of "start".
But if you use "size" and add, it may overflow. So, enough history.

"end" is inclusive. If this is not the case in your implementation,
then please use "size".

> 
> so (< end) not (<= end)
> 
> end says last index in this doorbell range;

This I don't understand.

This isn't how "start" and "end" are being used.
Their usage comes from RTL, and is always inclusive.

Either use "start" and "size" or make "end" be inclusive.

I'd prefer using "start" and "size" as this is traditionally
what is done in memory management in software (not RTL).

However using "end" in software makes it tricky to calculate
size, and one always does "end-start+1", and this could lead
to bugs and errors.

Please use "start" and "size", then.

Regards,
Luben


> 
> - Shashank
> 
>>
>> Since this isn't RTL, wouldn't it be better to describe the doorbell 
>> instance,
>> with a "start" and "size"? This is traditionally used in memory management,
>> and it makes comparisons and checks easy.
>>
>> Regards,
>> Luben
>>
>>
>>>   
>>>   /**
>>>* amdgpu_mm_rdoorbell - read a doorbell dword
>>> @@ -37,7 +56,7 @@ u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 
>>> index)
>>> if (amdgpu_device_skip_hw_access(adev))
>>> return 0;
>>>   
>>> -   if (index < adev->doorbell.num_kernel_doorbells) {
>>> +   if (amdgpu_doorbell_valid(adev, index)) {
>>> return readl(adev->doorbell.ptr + index);
>>> } else {
>>> DRM_ERROR("reading beyond doorbell aperture: 0x%08x!\n", index);
>>> @@ -60,7 +79,7 @@ void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 
>>> index, u32 v)
>>> if (amdgpu_devic

Re: [PATCH 07/16] drm/amdgpu: add helper to create doorbell pages

2023-03-30 Thread Luben Tuikov
On 2023-03-30 10:28, Alex Deucher wrote:
> On Wed, Mar 29, 2023 at 11:48 AM Shashank Sharma
>  wrote:
>>
>> From: Shashank Sharma 
>>
>> This patch adds helper functions to create and free doorbell
>> pages for kernel objects.
> 
> I think we can probably drop this patch.  I think it would be simpler
> to just use standard amdgpu_bos to represent them and then maybe a
> helper function to calculate the BAR offset from a doorbell bo object.

I agree.

Regards,
Luben

> 
> Alex
> 
>>
>> Cc: Alex Deucher 
>> Cc: Christian Koenig 
>> Signed-off-by: Shashank Sharma 
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h  | 41 
>>  .../gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c  | 49 +++
>>  2 files changed, 90 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h
>> index f9c3b77bf65d..6581b78fe438 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h
>> @@ -27,6 +27,24 @@
>>  /*
>>   * GPU doorbell structures, functions & helpers
>>   */
>> +
>> +/* Structure to hold doorbell pages from PCI doorbell BAR */
>> +struct amdgpu_doorbell_obj {
>> +   struct amdgpu_bo *bo;
>> +   uint64_t gpu_addr;
>> +   uint32_t *cpu_addr;
>> +   uint32_t size;
>> +
>> +   /* First index in this object */
>> +   uint32_t start;
>> +
>> +   /* Last index in this object */
>> +   uint32_t end;
>> +
>> +   /* bitmap for dynamic doorbell allocation from this object */
>> +   unsigned long *doorbell_bitmap;
>> +};
>> +
>>  struct amdgpu_doorbell {
>> /* doorbell mmio */
>> resource_size_t base;
>> @@ -328,6 +346,29 @@ int amdgpu_device_doorbell_init(struct amdgpu_device 
>> *adev);
>>   */
>>  void amdgpu_device_doorbell_fini(struct amdgpu_device *adev);
>>
>> +/**
>> + * amdgpu_doorbell_free_page - Free a doorbell page
>> + *
>> + * @adev: amdgpu_device pointer
>> + *
>> + * @db_age: previously allocated doobell page details
>> + *
>> + */
>> +void amdgpu_doorbell_free_page(struct amdgpu_device *adev,
>> +   struct amdgpu_doorbell_obj *db_obj);
>> +
>> +/**
>> + * amdgpu_doorbell_alloc_page - create a page from doorbell pool
>> + *
>> + * @adev: amdgpu_device pointer
>> + *
>> + * @db_age: doobell page structure to fill details with
>> + *
>> + * returns 0 on success, else error number
>> + */
>> +int amdgpu_doorbell_alloc_page(struct amdgpu_device *adev,
>> +   struct amdgpu_doorbell_obj *db_obj);
>> +
>>  #define RDOORBELL32(index) amdgpu_mm_rdoorbell(adev, (index))
>>  #define WDOORBELL32(index, v) amdgpu_mm_wdoorbell(adev, (index), (v))
>>  #define RDOORBELL64(index) amdgpu_mm_rdoorbell64(adev, (index))
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c
>> index 1aea92363fd3..8be15b82b545 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c
>> @@ -111,6 +111,55 @@ void amdgpu_mm_wdoorbell64(struct amdgpu_device *adev, 
>> u32 index, u64 v)
>> }
>>  }
>>
>> +/**
>> + * amdgpu_doorbell_free_page - Free a doorbell page
>> + *
>> + * @adev: amdgpu_device pointer
>> + *
>> + * @db_age: previously allocated doobell page details
>> + *
>> + */
>> +void amdgpu_doorbell_free_page(struct amdgpu_device *adev,
>> +   struct amdgpu_doorbell_obj *db_obj)
>> +{
>> +   amdgpu_bo_free_kernel(_obj->bo,
>> + _obj->gpu_addr,
>> + (void **)_obj->cpu_addr);
>> +
>> +}
>> +
>> +/**
>> + * amdgpu_doorbell_alloc_page - create a page from doorbell pool
>> + *
>> + * @adev: amdgpu_device pointer
>> + *
>> + * @db_age: doobell page structure to fill details with
>> + *
>> + * returns 0 on success, else error number
>> + */
>> +int amdgpu_doorbell_alloc_page(struct amdgpu_device *adev,
>> +   struct amdgpu_doorbell_obj *db_obj)
>> +{
>> +   int r;
>> +
>> +   db_obj->size = ALIGN(db_obj->size, PAGE_SIZE);
>> +
>> +   r = amdgpu_bo_create_kernel(adev,
>> +   db_obj->size,
>> +   PAGE_SIZE,
>> +   AMDGPU_GEM_DOMAIN_DOORBELL,
>> +   _obj->bo,
>> +   _obj->gpu_addr,
>> +   (void **)_obj->cpu_addr);
>> +
>> +   if (r) {
>> +   DRM_ERROR("Failed to create doorbell BO, err=%d\n", r);
>> +   return r;
>> +   }
>> +
>> +   return 0;
>> +}
>> +
>>  /*
>>   * GPU doorbell aperture helpers function.
>>   */
>> --
>> 2.40.0
>>



Re: [PATCH 07/16] drm/amdgpu: add helper to create doorbell pages

2023-03-30 Thread Luben Tuikov
On 2023-03-30 10:34, Shashank Sharma wrote:
> 
> On 30/03/2023 16:15, Luben Tuikov wrote:
>> On 2023-03-30 10:04, Shashank Sharma wrote:
>>> On 30/03/2023 15:42, Luben Tuikov wrote:
>>>> On 2023-03-29 11:47, Shashank Sharma wrote:
>>>>> From: Shashank Sharma 
>>>>>
>>>>> This patch adds helper functions to create and free doorbell
>>>>> pages for kernel objects.
>>>>>
>>>>> Cc: Alex Deucher 
>>>>> Cc: Christian Koenig 
>>>>> Signed-off-by: Shashank Sharma 
>>>>> ---
>>>>>drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h  | 41 
>>>>>.../gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c  | 49 +++
>>>>>2 files changed, 90 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h
>>>>> index f9c3b77bf65d..6581b78fe438 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h
>>>>> @@ -27,6 +27,24 @@
>>>>>/*
>>>>> * GPU doorbell structures, functions & helpers
>>>>> */
>>>>> +
>>>>> +/* Structure to hold doorbell pages from PCI doorbell BAR */
>>>>> +struct amdgpu_doorbell_obj {
>>>> In the comment you say "Structure to hold ...";
>>>> it is a C structure, but then in the name of a function we see "obj".
>>>> (Object is something which is defined like in memory, i.e. it exists, not
>>>> something which is only declared.)
>>>> This is just a declaration of a structure, not an object per se.
>>>> I'd call it "struct amdgpu_doorbell_struct" or just "struct 
>>>> amdgpu_doorbell".
>>> It is similar to struct amdgpu buffer object (struct amdgpu_bo), and
>>> many more existing structure.
>> The amdpgu_bo is very different than a structure describing a doorbell.
>> The doorbell description isn't really "an object". I understand
>> the enthusiasm, but it is really not "an object". It's just a doorbell
>> description. :-)
> 
> amdgpu_bo is page of ttm_memory with additional information,
> 
> amdgpu_doorbell_obj is a page of ttm_doorbells with additional information
> 
> (it is not just one doorbell description)
> 
> I don't see a problem here.

There is no problem, it just descriptively may be confusing to future
maintainers and readers.

If amdgpu_doobell_obj stores a page/pages maybe "amdgpu_doorbell_bo"
would be more descriptive.

I'm merely trying to find the closest descriptive name, since
this not being C++, using "obj" is confusing.

Regards,
Luben

> 
> - Shashank
> 
>>
>> Regards,
>> Luben
>>
>>> - Shashank
>>>
>>>> Then in the definition, you can call it an object/objects, if you'd like,
>>>> like "struct amdgpu_doorbell *doorb_object[];" then you can say
>>>> "db = doorb_object[i]";
>>>>
>>>> Regards,
>>>> Luben
>>>>
>>>>> + struct amdgpu_bo *bo;
>>>>> + uint64_t gpu_addr;
>>>>> + uint32_t *cpu_addr;
>>>>> + uint32_t size;
>>>>> +
>>>>> + /* First index in this object */
>>>>> + uint32_t start;
>>>>> +
>>>>> + /* Last index in this object */
>>>>> + uint32_t end;
>>>>> +
>>>>> + /* bitmap for dynamic doorbell allocation from this object */
>>>>> + unsigned long *doorbell_bitmap;
>>>>> +};
>>>>> +
>>>>>struct amdgpu_doorbell {
>>>>>   /* doorbell mmio */
>>>>>   resource_size_t base;
>>>>> @@ -328,6 +346,29 @@ int amdgpu_device_doorbell_init(struct amdgpu_device 
>>>>> *adev);
>>>>> */
>>>>>void amdgpu_device_doorbell_fini(struct amdgpu_device *adev);
>>>>>
>>>>> +/**
>>>>> + * amdgpu_doorbell_free_page - Free a doorbell page
>>>>> + *
>>>>> + * @adev: amdgpu_device pointer
>>>>> + *
>>>>> + * @db_age: previously allocated doobell page details
>>>>> + *
>>>>> + */
>>>>> +void amdgpu_doorbell_free_page(struct amdgpu_device *adev,

Re: [PATCH 10/16] drm/amdgpu: validate doorbell read/write

2023-03-30 Thread Luben Tuikov
On 2023-03-29 11:47, Shashank Sharma wrote:
> This patch:
> - updates start/end values for each of the doorbell object
>   created.
> - adds a function which validates that the kernel doorbell read/write
>   is within this range.
> - uses this function during doorbell writes from kernel.
> 
> Cc: Alex Deucher 
> Cc: Christian Koenig 
> Signed-off-by: Shashank Sharma 
> Signed-off-by: Arvind Yadav 
> ---
>  .../gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c  | 29 ---
>  1 file changed, 25 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c
> index b46fe8b1378d..81713b2c28e1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c
> @@ -22,6 +22,25 @@
>   */
>  
>  #include "amdgpu.h"
> +#include "kfd_priv.h"
> +
> +static inline
> +bool amdgpu_doorbell_valid(struct amdgpu_device *adev, u32 index)
> +{
> + if (index >= adev->doorbell.kernel_doorbells.start &&
> + index < adev->doorbell.kernel_doorbells.end)
> + return true;
> +
> + if (index >= adev->mes.kernel_doorbells.start &&
> + index < adev->mes.kernel_doorbells.end)
> + return true;
> +
> + if (index >= adev->kfd.dev->kernel_doorbells.start &&
> + index < adev->kfd.dev->kernel_doorbells.end)
> + return true;
> +
> + return false;
> +}

Here you're excluding "end".

In patch 7 we see this:
> + /* Last index in this object */
> + uint32_t end;

Which implies that "end" is included, but in this patch, the code excludes it.
Perhaps you intended to use "index <= ...end" here?

Since this isn't RTL, wouldn't it be better to describe the doorbell instance,
with a "start" and "size"? This is traditionally used in memory management,
and it makes comparisons and checks easy.

Regards,
Luben


>  
>  /**
>   * amdgpu_mm_rdoorbell - read a doorbell dword
> @@ -37,7 +56,7 @@ u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 
> index)
>   if (amdgpu_device_skip_hw_access(adev))
>   return 0;
>  
> - if (index < adev->doorbell.num_kernel_doorbells) {
> + if (amdgpu_doorbell_valid(adev, index)) {
>   return readl(adev->doorbell.ptr + index);
>   } else {
>   DRM_ERROR("reading beyond doorbell aperture: 0x%08x!\n", index);
> @@ -60,7 +79,7 @@ void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 
> index, u32 v)
>   if (amdgpu_device_skip_hw_access(adev))
>   return;
>  
> - if (index < adev->doorbell.num_kernel_doorbells) {
> + if (amdgpu_doorbell_valid(adev, index)) {
>   writel(v, adev->doorbell.ptr + index);
>   } else {
>   DRM_ERROR("writing beyond doorbell aperture: 0x%08x!\n", index);
> @@ -81,7 +100,7 @@ u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 
> index)
>   if (amdgpu_device_skip_hw_access(adev))
>   return 0;
>  
> - if (index < adev->doorbell.num_kernel_doorbells) {
> + if (amdgpu_doorbell_valid(adev, index)) {
>   return atomic64_read((atomic64_t *)(adev->doorbell.ptr + 
> index));
>   } else {
>   DRM_ERROR("reading beyond doorbell aperture: 0x%08x!\n", index);
> @@ -104,7 +123,7 @@ void amdgpu_mm_wdoorbell64(struct amdgpu_device *adev, 
> u32 index, u64 v)
>   if (amdgpu_device_skip_hw_access(adev))
>   return;
>  
> - if (index < adev->doorbell.num_kernel_doorbells) {
> + if (amdgpu_doorbell_valid(adev, index)) {
>   atomic64_set((atomic64_t *)(adev->doorbell.ptr + index), v);
>   } else {
>   DRM_ERROR("writing beyond doorbell aperture: 0x%08x!\n", index);
> @@ -157,6 +176,8 @@ int amdgpu_doorbell_alloc_page(struct amdgpu_device *adev,
>   return r;
>   }
>  
> + db_obj->start = amdgpu_doorbell_index_on_bar(adev, db_obj->bo, 0);
> + db_obj->end = db_obj->start + db_obj->size / sizeof(u32);
>   return 0;
>  }
>  



Re: [PATCH 09/16] drm/amdgpu: create kernel doorbell page

2023-03-30 Thread Luben Tuikov
On 2023-03-29 11:47, Shashank Sharma wrote:
> From: Shashank Sharma 
> 
> This patch:
> - creates a doorbell page for graphics driver usages.
> - removes the adev->doorbell.ptr variable, replaces it with
>   kernel-doorbell-bo's cpu address.
> 
> Cc: Alex Deucher 
> Cc: Christian Koenig 
> Signed-off-by: Shashank Sharma 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h  | 16 ++-
>  .../gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c  | 44 +++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c   |  7 +++
>  3 files changed, 57 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h
> index 6581b78fe438..10a9bb10e974 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h
> @@ -49,10 +49,13 @@ struct amdgpu_doorbell {
>   /* doorbell mmio */
>   resource_size_t base;
>   resource_size_t size;
> - u32 __iomem *ptr;
> + u32 __iomem *ptr;
>  
>   /* Number of doorbells reserved for amdgpu kernel driver */
>   u32 num_kernel_doorbells;
> +
> + /* For kernel doorbell pages */
> + struct amdgpu_doorbell_obj kernel_doorbells;
>  };

Here's an example where it could be confusing what the difference is
between "struct amdgpu_doorbell" and "struct amdgpu_doobell_obj".
As the comment to the struct doorbell_obj declarations says
in patch 7,

> +/* Structure to hold doorbell pages from PCI doorbell BAR */
> +struct amdgpu_doorbell_obj {

Perhaps we should call it "struct amdgpu_doorbell_bo", since
it does contain amdgpu_bo's, which are doorbell's bos.

Regards,
Luben

>  
>  /* Reserved doorbells for amdgpu (including multimedia).
> @@ -369,6 +372,17 @@ void amdgpu_doorbell_free_page(struct amdgpu_device 
> *adev,
>  int amdgpu_doorbell_alloc_page(struct amdgpu_device *adev,
>   struct amdgpu_doorbell_obj *db_obj);
>  
> +/**
> + * amdgpu_doorbell_create_kernel_doorbells - Create kernel doorbells for 
> graphics
> + *
> + * @adev: amdgpu_device pointer
> + *
> + * Creates doorbells for graphics driver
> + *
> + * returns 0 on success, error otherwise.
> + */
> +int amdgpu_doorbell_create_kernel_doorbells(struct amdgpu_device *adev);
> +
>  #define RDOORBELL32(index) amdgpu_mm_rdoorbell(adev, (index))
>  #define WDOORBELL32(index, v) amdgpu_mm_wdoorbell(adev, (index), (v))
>  #define RDOORBELL64(index) amdgpu_mm_rdoorbell64(adev, (index))
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c
> index 8be15b82b545..b46fe8b1378d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c
> @@ -160,6 +160,38 @@ int amdgpu_doorbell_alloc_page(struct amdgpu_device 
> *adev,
>   return 0;
>  }
>  
> +/**
> + * amdgpu_doorbell_create_kernel_doorbells - Create kernel doorbells for 
> graphics
> + *
> + * @adev: amdgpu_device pointer
> + *
> + * Creates doorbells for graphics driver
> + *
> + * returns 0 on success, error otherwise.
> + */
> +int amdgpu_doorbell_create_kernel_doorbells(struct amdgpu_device *adev)
> +{
> + int r;
> + struct amdgpu_doorbell_obj *kernel_doorbells = 
> >doorbell.kernel_doorbells;
> +
> + kernel_doorbells->doorbell_bitmap = 
> bitmap_zalloc(adev->doorbell.num_kernel_doorbells,
> +   GFP_KERNEL);
> + if (!kernel_doorbells->doorbell_bitmap) {
> + DRM_ERROR("Failed to create kernel doorbell bitmap\n");
> + return -ENOMEM;
> + }
> +
> + kernel_doorbells->size = adev->doorbell.num_kernel_doorbells * 
> sizeof(u32);
> + r = amdgpu_doorbell_alloc_page(adev, kernel_doorbells);
> + if (r) {
> + bitmap_free(kernel_doorbells->doorbell_bitmap);
> + DRM_ERROR("Failed to allocate kernel doorbells, err=%d\n", r);
> + return r;
> + }
> +
> + return 0;
> +}
> +
>  /*
>   * GPU doorbell aperture helpers function.
>   */
> @@ -179,7 +211,6 @@ int amdgpu_device_doorbell_init(struct amdgpu_device 
> *adev)
>   adev->doorbell.base = 0;
>   adev->doorbell.size = 0;
>   adev->doorbell.num_kernel_doorbells = 0;
> - adev->doorbell.ptr = NULL;
>   return 0;
>   }
>  
> @@ -208,12 +239,7 @@ int amdgpu_device_doorbell_init(struct amdgpu_device 
> *adev)
>   if (adev->asic_type >= CHIP_VEGA10)
>   adev->doorbell.num_kernel_doorbells += 0x400;
>  
> - adev->doorbell.ptr = ioremap(adev->doorbell.base,
> -  adev->doorbell.num_kernel_doorbells *
> -  sizeof(u32));
> - if (adev->doorbell.ptr == NULL)
> - return -ENOMEM;
> -
> + adev->doorbell.ptr = ioremap(adev->doorbell.base, adev->doorbell.size);
>   return 0;
>  }
>  
> @@ -226,6 

Re: [PATCH 07/16] drm/amdgpu: add helper to create doorbell pages

2023-03-30 Thread Luben Tuikov
On 2023-03-30 10:04, Shashank Sharma wrote:
> 
> On 30/03/2023 15:42, Luben Tuikov wrote:
>> On 2023-03-29 11:47, Shashank Sharma wrote:
>>> From: Shashank Sharma 
>>>
>>> This patch adds helper functions to create and free doorbell
>>> pages for kernel objects.
>>>
>>> Cc: Alex Deucher 
>>> Cc: Christian Koenig 
>>> Signed-off-by: Shashank Sharma 
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h  | 41 
>>>   .../gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c  | 49 +++
>>>   2 files changed, 90 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h
>>> index f9c3b77bf65d..6581b78fe438 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h
>>> @@ -27,6 +27,24 @@
>>>   /*
>>>* GPU doorbell structures, functions & helpers
>>>*/
>>> +
>>> +/* Structure to hold doorbell pages from PCI doorbell BAR */
>>> +struct amdgpu_doorbell_obj {
>> In the comment you say "Structure to hold ...";
>> it is a C structure, but then in the name of a function we see "obj".
>> (Object is something which is defined like in memory, i.e. it exists, not
>> something which is only declared.)
>> This is just a declaration of a structure, not an object per se.
>> I'd call it "struct amdgpu_doorbell_struct" or just "struct amdgpu_doorbell".
> 
> It is similar to struct amdgpu buffer object (struct amdgpu_bo), and 
> many more existing structure.

The amdpgu_bo is very different than a structure describing a doorbell.
The doorbell description isn't really "an object". I understand
the enthusiasm, but it is really not "an object". It's just a doorbell
description. :-)

Regards,
Luben

> 
> - Shashank
> 
>> Then in the definition, you can call it an object/objects, if you'd like,
>> like "struct amdgpu_doorbell *doorb_object[];" then you can say
>> "db = doorb_object[i]";
>>
>> Regards,
>> Luben
>>
>>> +   struct amdgpu_bo *bo;
>>> +   uint64_t gpu_addr;
>>> +   uint32_t *cpu_addr;
>>> +   uint32_t size;
>>> +
>>> +   /* First index in this object */
>>> +   uint32_t start;
>>> +
>>> +   /* Last index in this object */
>>> +   uint32_t end;
>>> +
>>> +   /* bitmap for dynamic doorbell allocation from this object */
>>> +   unsigned long *doorbell_bitmap;
>>> +};
>>> +
>>>   struct amdgpu_doorbell {
>>> /* doorbell mmio */
>>> resource_size_t base;
>>> @@ -328,6 +346,29 @@ int amdgpu_device_doorbell_init(struct amdgpu_device 
>>> *adev);
>>>*/
>>>   void amdgpu_device_doorbell_fini(struct amdgpu_device *adev);
>>>   
>>> +/**
>>> + * amdgpu_doorbell_free_page - Free a doorbell page
>>> + *
>>> + * @adev: amdgpu_device pointer
>>> + *
>>> + * @db_age: previously allocated doobell page details
>>> + *
>>> + */
>>> +void amdgpu_doorbell_free_page(struct amdgpu_device *adev,
>>> +   struct amdgpu_doorbell_obj *db_obj);
>>> +
>>> +/**
>>> + * amdgpu_doorbell_alloc_page - create a page from doorbell pool
>>> + *
>>> + * @adev: amdgpu_device pointer
>>> + *
>>> + * @db_age: doobell page structure to fill details with
>>> + *
>>> + * returns 0 on success, else error number
>>> + */
>>> +int amdgpu_doorbell_alloc_page(struct amdgpu_device *adev,
>>> +   struct amdgpu_doorbell_obj *db_obj);
>>> +
>>>   #define RDOORBELL32(index) amdgpu_mm_rdoorbell(adev, (index))
>>>   #define WDOORBELL32(index, v) amdgpu_mm_wdoorbell(adev, (index), (v))
>>>   #define RDOORBELL64(index) amdgpu_mm_rdoorbell64(adev, (index))
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c
>>> index 1aea92363fd3..8be15b82b545 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c
>>> @@ -111,6 +111,55 @@ void amdgpu_mm_wdoorbell64(struct amdgpu_device *adev, 
>>> u32 index, u64 v)
>>> }
>>>   }
>>>   
>>> +/**
>>> + * amdgpu_d

Re: [PATCH 06/16] drm/amdgpu: accommodate DOMAIN/PL_DOORBELL

2023-03-30 Thread Luben Tuikov
On 2023-03-30 09:48, Shashank Sharma wrote:
> 
> On 30/03/2023 15:45, Luben Tuikov wrote:
>> On 2023-03-30 09:43, Shashank Sharma wrote:
>>> On 30/03/2023 15:33, Luben Tuikov wrote:
>>>> On 2023-03-30 07:14, Christian König wrote:
>>>>> Am 29.03.23 um 17:47 schrieb Shashank Sharma:
>>>>>> From: Alex Deucher 
>>>>>>
>>>>>> This patch adds changes:
>>>>>> - to accommodate the new GEM domain DOORBELL
>>>>>> - to accommodate the new TTM PL DOORBELL
>>>>>>
>>>>>> in order to manage doorbell pages as GEM object.
>>>>>>
>>>>>> V2: Addressed reviwe comments from Christian
>>>>>>- drop the doorbell changes for pinning/unpinning
>>>>>>- drop the doorbell changes for dma-buf map
>>>>>>- drop the doorbell changes for sgt
>>>>>>- no need to handle TTM_PL_FLAG_CONTIGUOUS for doorbell
>>>>>>- add caching type for doorbell
>>>>>>
>>>>>> Cc: Alex Deucher 
>>>>>> Cc: Christian Koenig 
>>>>>>
>>>>>> Signed-off-by: Alex Deucher 
>>>>>> Signed-off-by: Shashank Sharma 
>>>> Generally there are no empty lines in the tag list. Perhaps remove it?
>>> I would prefer to keep it, to highlight the CC parts.
>> I've never seen a commit with them separated. Perhaps follow Linux custom?
> 
> IIRC This is not against Linux patch formatting/message body guidelines.

The tag list forms a block, a paragraph, which is easy to scan and separate out
of the description of the patch, which in itself can have many paragraphs 
separated
by white lines: subject line, paragraph 1, paragraph 2, ..., paragraph of tags.
Furthermore these tags are added/appended by automated scripts/tools which 
wouldn't
add an empty line.

Check out the following resources:
https://www.kernel.org/doc/html/v4.12/process/5.Posting.html#patch-formatting-and-changelogs
https://www.kernel.org/doc/html/v4.12/process/submitting-patches.html#the-canonical-patch-format

"git log -- drivers/gpu/drm/." is also a very helpful reference to see some 
good patch formatting.

Please remove the empty line between the Cc and Sob lines, so it forms a tag 
paragraph.

Regards,
Luben


> 
> - Shashank
> 
>> Regards,
>> Luben
>>
>>> - Shashank
>>>
>>>> Regards,
>>>> Luben
>>>>
>>>>>> ---
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 11 ++-
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h |  2 ++
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c| 16 +++-
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h|  1 +
>>>>>> 4 files changed, 28 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>> index 4e684c2afc70..0ec080e240ad 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>> @@ -147,6 +147,14 @@ void amdgpu_bo_placement_from_domain(struct 
>>>>>> amdgpu_bo *abo, u32 domain)
>>>>>>  c++;
>>>>>>  }
>>>>>> 
>>>>>> +if (domain & AMDGPU_GEM_DOMAIN_DOORBELL) {
>>>>>> +places[c].fpfn = 0;
>>>>>> +places[c].lpfn = 0;
>>>>>> +places[c].mem_type = AMDGPU_PL_DOORBELL;
>>>>>> +places[c].flags = 0;
>>>>>> +c++;
>>>>>> +}
>>>>>> +
>>>>>>  if (domain & AMDGPU_GEM_DOMAIN_GTT) {
>>>>>>  places[c].fpfn = 0;
>>>>>>  places[c].lpfn = 0;
>>>>>> @@ -466,7 +474,7 @@ static bool amdgpu_bo_validate_size(struct 
>>>>>> amdgpu_device *adev,
>>>>>>  goto fail;
>>>>>>  }
>>>>>> 
>>>>>> -/* TODO add more domains checks, such as AMDGPU_GEM_DOMAIN_CPU 
>>>>>> */
>>>>>> +/* TODO add more domains checks, such as AMDGPU_GEM_DOMAIN_CPU, 
>>>>>>  AMDGPU_GEM_DOM

Re: [PATCH 00/16] AMDGPU Doorbell manager

2023-03-30 Thread Luben Tuikov
As I'm reviewing this, it is obvious that this patchset hasn't gone
though scripts/checkpatch.pl.

It's good practice to run one's patches through scripts/checkpatch.pl,
to see deviations on common Linux practices, and correct them.

Regards,
Luben

On 2023-03-29 11:47, Shashank Sharma wrote:
> The doorbells in AMDGPU drivers are currently managed by different
> users in a scattered way, across the driver. The existing clients are:
> - AMDGPU graphics driver for kernel level doorbell writes.
> - AMDGPU MES module for kernel level doorbell write (MES ring test).
> - AMDGPU MES modules for kernel level aggregated doorbell writes.
> - AMDGPU MES module for MES process doorbell writes.
> - AMDKFD module for KFD/KIQ kernel doorbell writes.
> - AMDKFD module for KFD process doorbell writes.
> - AMDGPU usermode queues for usermode doorbell writes (upcoming).
> 
> This patch series introduces Doorbell-manager to keep the doorbell handling
> at a central place. The fundamental changes are:
> 
> - Introduce and accommodate a new GEM domain for doorbells.
> - Prepare the AMDGPU ttm backend for handling doorbell allocation.
> - Introduce doorbell-manager functions to allocate, free and index
>   doorbells in one unique way.
> - Create doorbell BOs for kernel-level and process level doorbell
>   opertations, and place it in existing structures.
> - Modify the existing graphics, KFD and MES code to use the
>   doorbell-manager functions.
> - Remove the existing doorbell management code in KFD/MES.
> 
> PS: This series has been sanity tested with kfd_test_suit to ensure
> it is not introducing any regressions due to kfd doorbell changes.
> 
> The idea is that:
> - a kernel client can call doorbell manager functions to allocate/free
>   doorbell pages.
> - a usermode app can directly allocate a page from the doorbell bar just
>   like a GEM object and use it for different usermode queues.
> 
> Alex Deucher (2):
>   drm/amdgpu: add UAPI for allocating doorbell memory
>   drm/amdgpu: accommodate DOMAIN/PL_DOORBELL
> 
> Shashank Sharma (14):
>   drm/amdgpu: rename num_doorbells
>   drm/amdgpu: include protection for doobell.h
>   drm/amdgpu: create a new file for doorbell manager
>   drm/amdgpu: don't modify num_doorbells for mes
>   drm/amdgpu: add helper to create doorbell pages
>   drm/amdgpu: initialize ttm for doorbells
>   drm/amdgpu: create kernel doorbell page
>   drm/amdgpu: validate doorbell read/write
>   drm/amdgpu: get absolute offset from doorbell index
>   drm/amdgpu: use doorbell manager for kfd kernel doorbells
>   drm/amdgpu: use doorbell manager for kfd process doorbells
>   drm/amdgpu: remove ununsed functions and variables
>   drm/amdgpu: use doorbell mgr for MES kernel doorbells
>   drm/amdgpu: user doorbell mgr for MES process doorbells
> 
>  drivers/gpu/drm/amd/amdgpu/Makefile   |   2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c|   6 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c| 164 --
>  drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h  | 102 +-
>  .../gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c  | 304 ++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c   | 165 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h   |  17 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c|  11 +-
>  .../gpu/drm/amd/amdgpu/amdgpu_res_cursor.h|   2 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c   |  31 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h   |   1 +
>  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c  |  13 -
>  drivers/gpu/drm/amd/amdkfd/kfd_device.c   |   4 +-
>  .../drm/amd/amdkfd/kfd_device_queue_manager.c |  16 +-
>  drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c | 198 
>  drivers/gpu/drm/amd/amdkfd/kfd_priv.h |  23 +-
>  drivers/gpu/drm/amd/amdkfd/kfd_process.c  |  26 +-
>  .../amd/amdkfd/kfd_process_queue_manager.c|  16 +-
>  include/uapi/drm/amdgpu_drm.h |   7 +-
>  19 files changed, 636 insertions(+), 472 deletions(-)
>  create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c
> 



Re: [PATCH 06/16] drm/amdgpu: accommodate DOMAIN/PL_DOORBELL

2023-03-30 Thread Luben Tuikov
On 2023-03-30 09:43, Shashank Sharma wrote:
> 
> On 30/03/2023 15:33, Luben Tuikov wrote:
>> On 2023-03-30 07:14, Christian König wrote:
>>> Am 29.03.23 um 17:47 schrieb Shashank Sharma:
>>>> From: Alex Deucher 
>>>>
>>>> This patch adds changes:
>>>> - to accommodate the new GEM domain DOORBELL
>>>> - to accommodate the new TTM PL DOORBELL
>>>>
>>>> in order to manage doorbell pages as GEM object.
>>>>
>>>> V2: Addressed reviwe comments from Christian
>>>>   - drop the doorbell changes for pinning/unpinning
>>>>   - drop the doorbell changes for dma-buf map
>>>>   - drop the doorbell changes for sgt
>>>>   - no need to handle TTM_PL_FLAG_CONTIGUOUS for doorbell
>>>>   - add caching type for doorbell
>>>>
>>>> Cc: Alex Deucher 
>>>> Cc: Christian Koenig 
>>>>
>>>> Signed-off-by: Alex Deucher 
>>>> Signed-off-by: Shashank Sharma 
>> Generally there are no empty lines in the tag list. Perhaps remove it?
> 
> I would prefer to keep it, to highlight the CC parts.

I've never seen a commit with them separated. Perhaps follow Linux custom?

Regards,
Luben

> 
> - Shashank
> 
>> Regards,
>> Luben
>>
>>>> ---
>>>>drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 11 ++-
>>>>drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h |  2 ++
>>>>drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c| 16 +++-
>>>>drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h|  1 +
>>>>4 files changed, 28 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>> index 4e684c2afc70..0ec080e240ad 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>> @@ -147,6 +147,14 @@ void amdgpu_bo_placement_from_domain(struct amdgpu_bo 
>>>> *abo, u32 domain)
>>>>c++;
>>>>}
>>>>
>>>> +  if (domain & AMDGPU_GEM_DOMAIN_DOORBELL) {
>>>> +  places[c].fpfn = 0;
>>>> +  places[c].lpfn = 0;
>>>> +  places[c].mem_type = AMDGPU_PL_DOORBELL;
>>>> +  places[c].flags = 0;
>>>> +  c++;
>>>> +  }
>>>> +
>>>>if (domain & AMDGPU_GEM_DOMAIN_GTT) {
>>>>places[c].fpfn = 0;
>>>>places[c].lpfn = 0;
>>>> @@ -466,7 +474,7 @@ static bool amdgpu_bo_validate_size(struct 
>>>> amdgpu_device *adev,
>>>>goto fail;
>>>>}
>>>>
>>>> -  /* TODO add more domains checks, such as AMDGPU_GEM_DOMAIN_CPU */
>>>> +  /* TODO add more domains checks, such as AMDGPU_GEM_DOMAIN_CPU,  
>>>> AMDGPU_GEM_DOMAIN_DOORBELL */
>>>>return true;
>>>>
>>>>fail:
>>>> @@ -1013,6 +1021,7 @@ void amdgpu_bo_unpin(struct amdgpu_bo *bo)
>>>>} else if (bo->tbo.resource->mem_type == TTM_PL_TT) {
>>>>atomic64_sub(amdgpu_bo_size(bo), >gart_pin_size);
>>>>}
>>>> +
>>> Unrelated newline, probably just a leftover.
>>>
>>> Apart from that the patch is Reviewed-by: Christian König
>>> 
>>>
>>> Regards,
>>> Christian.
>>>
>>>>}
>>>>
>>>>static const char *amdgpu_vram_names[] = {
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h
>>>> index 5c4f93ee0c57..3c988cc406e4 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h
>>>> @@ -90,6 +90,7 @@ static inline void amdgpu_res_first(struct ttm_resource 
>>>> *res,
>>>>cur->node = block;
>>>>break;
>>>>case TTM_PL_TT:
>>>> +  case AMDGPU_PL_DOORBELL:
>>>>node = to_ttm_range_mgr_node(res)->mm_nodes;
>>>>while (start >= node->size << PAGE_SHIFT)
>>>>start -= node++->s

Re: [PATCH 07/16] drm/amdgpu: add helper to create doorbell pages

2023-03-30 Thread Luben Tuikov
On 2023-03-30 09:42, Luben Tuikov wrote:
> On 2023-03-29 11:47, Shashank Sharma wrote:
>> From: Shashank Sharma 
>>
>> This patch adds helper functions to create and free doorbell
>> pages for kernel objects.
>>
>> Cc: Alex Deucher 
>> Cc: Christian Koenig 
>> Signed-off-by: Shashank Sharma 
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h  | 41 
>>  .../gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c  | 49 +++
>>  2 files changed, 90 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h
>> index f9c3b77bf65d..6581b78fe438 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h
>> @@ -27,6 +27,24 @@
>>  /*
>>   * GPU doorbell structures, functions & helpers
>>   */
>> +
>> +/* Structure to hold doorbell pages from PCI doorbell BAR */
>> +struct amdgpu_doorbell_obj {
> 
> In the comment you say "Structure to hold ...";
> it is a C structure, but then in the name of a function we see "obj".

I mean here " in the name of the structure we see ..."

Regards,
Luben

> (Object is something which is defined like in memory, i.e. it exists, not
> something which is only declared.)
> This is just a declaration of a structure, not an object per se.
> I'd call it "struct amdgpu_doorbell_struct" or just "struct amdgpu_doorbell".
> 
> Then in the definition, you can call it an object/objects, if you'd like,
> like "struct amdgpu_doorbell *doorb_object[];" then you can say
> "db = doorb_object[i]";
> 
> Regards,
> Luben
> 
>> +struct amdgpu_bo *bo;
>> +uint64_t gpu_addr;
>> +uint32_t *cpu_addr;
>> +uint32_t size;
>> +
>> +/* First index in this object */
>> +uint32_t start;
>> +
>> +/* Last index in this object */
>> +uint32_t end;
>> +
>> +/* bitmap for dynamic doorbell allocation from this object */
>> +unsigned long *doorbell_bitmap;
>> +};
>> +
>>  struct amdgpu_doorbell {
>>  /* doorbell mmio */
>>  resource_size_t base;
>> @@ -328,6 +346,29 @@ int amdgpu_device_doorbell_init(struct amdgpu_device 
>> *adev);
>>   */
>>  void amdgpu_device_doorbell_fini(struct amdgpu_device *adev);
>>  
>> +/**
>> + * amdgpu_doorbell_free_page - Free a doorbell page
>> + *
>> + * @adev: amdgpu_device pointer
>> + *
>> + * @db_age: previously allocated doobell page details
>> + *
>> + */
>> +void amdgpu_doorbell_free_page(struct amdgpu_device *adev,
>> +struct amdgpu_doorbell_obj *db_obj);
>> +
>> +/**
>> + * amdgpu_doorbell_alloc_page - create a page from doorbell pool
>> + *
>> + * @adev: amdgpu_device pointer
>> + *
>> + * @db_age: doobell page structure to fill details with
>> + *
>> + * returns 0 on success, else error number
>> + */
>> +int amdgpu_doorbell_alloc_page(struct amdgpu_device *adev,
>> +struct amdgpu_doorbell_obj *db_obj);
>> +
>>  #define RDOORBELL32(index) amdgpu_mm_rdoorbell(adev, (index))
>>  #define WDOORBELL32(index, v) amdgpu_mm_wdoorbell(adev, (index), (v))
>>  #define RDOORBELL64(index) amdgpu_mm_rdoorbell64(adev, (index))
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c
>> index 1aea92363fd3..8be15b82b545 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c
>> @@ -111,6 +111,55 @@ void amdgpu_mm_wdoorbell64(struct amdgpu_device *adev, 
>> u32 index, u64 v)
>>  }
>>  }
>>  
>> +/**
>> + * amdgpu_doorbell_free_page - Free a doorbell page
>> + *
>> + * @adev: amdgpu_device pointer
>> + *
>> + * @db_age: previously allocated doobell page details
>> + *
>> + */
>> +void amdgpu_doorbell_free_page(struct amdgpu_device *adev,
>> +struct amdgpu_doorbell_obj *db_obj)
>> +{
>> +amdgpu_bo_free_kernel(_obj->bo,
>> +  _obj->gpu_addr,
>> +  (void **)_obj->cpu_addr);
>> +
>> +}
>> +
>> +/**
>> + * amdgpu_doorbell_alloc_page - create a page from doorbell pool
>> + *
>> + * @adev: amdgpu_device pointer
>> + *
>> + * @db_age: doobell page structure to fill 

Re: [PATCH 07/16] drm/amdgpu: add helper to create doorbell pages

2023-03-30 Thread Luben Tuikov
On 2023-03-29 11:47, Shashank Sharma wrote:
> From: Shashank Sharma 
> 
> This patch adds helper functions to create and free doorbell
> pages for kernel objects.
> 
> Cc: Alex Deucher 
> Cc: Christian Koenig 
> Signed-off-by: Shashank Sharma 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h  | 41 
>  .../gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c  | 49 +++
>  2 files changed, 90 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h
> index f9c3b77bf65d..6581b78fe438 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h
> @@ -27,6 +27,24 @@
>  /*
>   * GPU doorbell structures, functions & helpers
>   */
> +
> +/* Structure to hold doorbell pages from PCI doorbell BAR */
> +struct amdgpu_doorbell_obj {

In the comment you say "Structure to hold ...";
it is a C structure, but then in the name of a function we see "obj".
(Object is something which is defined like in memory, i.e. it exists, not
something which is only declared.)
This is just a declaration of a structure, not an object per se.
I'd call it "struct amdgpu_doorbell_struct" or just "struct amdgpu_doorbell".

Then in the definition, you can call it an object/objects, if you'd like,
like "struct amdgpu_doorbell *doorb_object[];" then you can say
"db = doorb_object[i]";

Regards,
Luben

> + struct amdgpu_bo *bo;
> + uint64_t gpu_addr;
> + uint32_t *cpu_addr;
> + uint32_t size;
> +
> + /* First index in this object */
> + uint32_t start;
> +
> + /* Last index in this object */
> + uint32_t end;
> +
> + /* bitmap for dynamic doorbell allocation from this object */
> + unsigned long *doorbell_bitmap;
> +};
> +
>  struct amdgpu_doorbell {
>   /* doorbell mmio */
>   resource_size_t base;
> @@ -328,6 +346,29 @@ int amdgpu_device_doorbell_init(struct amdgpu_device 
> *adev);
>   */
>  void amdgpu_device_doorbell_fini(struct amdgpu_device *adev);
>  
> +/**
> + * amdgpu_doorbell_free_page - Free a doorbell page
> + *
> + * @adev: amdgpu_device pointer
> + *
> + * @db_age: previously allocated doobell page details
> + *
> + */
> +void amdgpu_doorbell_free_page(struct amdgpu_device *adev,
> + struct amdgpu_doorbell_obj *db_obj);
> +
> +/**
> + * amdgpu_doorbell_alloc_page - create a page from doorbell pool
> + *
> + * @adev: amdgpu_device pointer
> + *
> + * @db_age: doobell page structure to fill details with
> + *
> + * returns 0 on success, else error number
> + */
> +int amdgpu_doorbell_alloc_page(struct amdgpu_device *adev,
> + struct amdgpu_doorbell_obj *db_obj);
> +
>  #define RDOORBELL32(index) amdgpu_mm_rdoorbell(adev, (index))
>  #define WDOORBELL32(index, v) amdgpu_mm_wdoorbell(adev, (index), (v))
>  #define RDOORBELL64(index) amdgpu_mm_rdoorbell64(adev, (index))
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c
> index 1aea92363fd3..8be15b82b545 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c
> @@ -111,6 +111,55 @@ void amdgpu_mm_wdoorbell64(struct amdgpu_device *adev, 
> u32 index, u64 v)
>   }
>  }
>  
> +/**
> + * amdgpu_doorbell_free_page - Free a doorbell page
> + *
> + * @adev: amdgpu_device pointer
> + *
> + * @db_age: previously allocated doobell page details
> + *
> + */
> +void amdgpu_doorbell_free_page(struct amdgpu_device *adev,
> + struct amdgpu_doorbell_obj *db_obj)
> +{
> + amdgpu_bo_free_kernel(_obj->bo,
> +   _obj->gpu_addr,
> +   (void **)_obj->cpu_addr);
> +
> +}
> +
> +/**
> + * amdgpu_doorbell_alloc_page - create a page from doorbell pool
> + *
> + * @adev: amdgpu_device pointer
> + *
> + * @db_age: doobell page structure to fill details with
> + *
> + * returns 0 on success, else error number
> + */
> +int amdgpu_doorbell_alloc_page(struct amdgpu_device *adev,
> + struct amdgpu_doorbell_obj *db_obj)
> +{
> + int r;
> +
> + db_obj->size = ALIGN(db_obj->size, PAGE_SIZE);
> +
> + r = amdgpu_bo_create_kernel(adev,
> + db_obj->size,
> + PAGE_SIZE,
> + AMDGPU_GEM_DOMAIN_DOORBELL,
> + _obj->bo,
> + _obj->gpu_addr,
> + (void **)_obj->cpu_addr);
> +
> + if (r) {
> + DRM_ERROR("Failed to create doorbell BO, err=%d\n", r);
> + return r;
> + }
> +
> + return 0;
> +}
> +
>  /*
>   * GPU doorbell aperture helpers function.
>   */



Re: [PATCH 06/16] drm/amdgpu: accommodate DOMAIN/PL_DOORBELL

2023-03-30 Thread Luben Tuikov
On 2023-03-30 07:14, Christian König wrote:
> Am 29.03.23 um 17:47 schrieb Shashank Sharma:
>> From: Alex Deucher 
>>
>> This patch adds changes:
>> - to accommodate the new GEM domain DOORBELL
>> - to accommodate the new TTM PL DOORBELL
>>
>> in order to manage doorbell pages as GEM object.
>>
>> V2: Addressed reviwe comments from Christian
>>  - drop the doorbell changes for pinning/unpinning
>>  - drop the doorbell changes for dma-buf map
>>  - drop the doorbell changes for sgt
>>  - no need to handle TTM_PL_FLAG_CONTIGUOUS for doorbell
>>  - add caching type for doorbell
>>
>> Cc: Alex Deucher 
>> Cc: Christian Koenig 
>>
>> Signed-off-by: Alex Deucher 
>> Signed-off-by: Shashank Sharma 

Generally there are no empty lines in the tag list. Perhaps remove it?

Regards,
Luben

>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 11 ++-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h |  2 ++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c| 16 +++-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h|  1 +
>>   4 files changed, 28 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> index 4e684c2afc70..0ec080e240ad 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> @@ -147,6 +147,14 @@ void amdgpu_bo_placement_from_domain(struct amdgpu_bo 
>> *abo, u32 domain)
>>  c++;
>>  }
>>   
>> +if (domain & AMDGPU_GEM_DOMAIN_DOORBELL) {
>> +places[c].fpfn = 0;
>> +places[c].lpfn = 0;
>> +places[c].mem_type = AMDGPU_PL_DOORBELL;
>> +places[c].flags = 0;
>> +c++;
>> +}
>> +
>>  if (domain & AMDGPU_GEM_DOMAIN_GTT) {
>>  places[c].fpfn = 0;
>>  places[c].lpfn = 0;
>> @@ -466,7 +474,7 @@ static bool amdgpu_bo_validate_size(struct amdgpu_device 
>> *adev,
>>  goto fail;
>>  }
>>   
>> -/* TODO add more domains checks, such as AMDGPU_GEM_DOMAIN_CPU */
>> +/* TODO add more domains checks, such as AMDGPU_GEM_DOMAIN_CPU,  
>> AMDGPU_GEM_DOMAIN_DOORBELL */
>>  return true;
>>   
>>   fail:
>> @@ -1013,6 +1021,7 @@ void amdgpu_bo_unpin(struct amdgpu_bo *bo)
>>  } else if (bo->tbo.resource->mem_type == TTM_PL_TT) {
>>  atomic64_sub(amdgpu_bo_size(bo), >gart_pin_size);
>>  }
>> +
> 
> Unrelated newline, probably just a leftover.
> 
> Apart from that the patch is Reviewed-by: Christian König 
> 
> 
> Regards,
> Christian.
> 
>>   }
>>   
>>   static const char *amdgpu_vram_names[] = {
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h
>> index 5c4f93ee0c57..3c988cc406e4 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h
>> @@ -90,6 +90,7 @@ static inline void amdgpu_res_first(struct ttm_resource 
>> *res,
>>  cur->node = block;
>>  break;
>>  case TTM_PL_TT:
>> +case AMDGPU_PL_DOORBELL:
>>  node = to_ttm_range_mgr_node(res)->mm_nodes;
>>  while (start >= node->size << PAGE_SHIFT)
>>  start -= node++->size << PAGE_SHIFT;
>> @@ -152,6 +153,7 @@ static inline void amdgpu_res_next(struct 
>> amdgpu_res_cursor *cur, uint64_t size)
>>  cur->size = min(amdgpu_vram_mgr_block_size(block), 
>> cur->remaining);
>>  break;
>>  case TTM_PL_TT:
>> +case AMDGPU_PL_DOORBELL:
>>  node = cur->node;
>>   
>>  cur->node = ++node;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> index 55e0284b2bdd..6f61491ef3dd 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> @@ -128,6 +128,7 @@ static void amdgpu_evict_flags(struct ttm_buffer_object 
>> *bo,
>>  case AMDGPU_PL_GDS:
>>  case AMDGPU_PL_GWS:
>>  case AMDGPU_PL_OA:
>> +case AMDGPU_PL_DOORBELL:
>>  placement->num_placement = 0;
>>  placement->num_busy_placement = 0;
>>  return;
>> @@ -500,9 +501,11 @@ static int amdgpu_bo_move(struct ttm_buffer_object *bo, 
>> bool evict,
>>  if (old_mem->mem_type == AMDGPU_PL_GDS ||
>>  old_mem->mem_type == AMDGPU_PL_GWS ||
>>  old_mem->mem_type == AMDGPU_PL_OA ||
>> +old_mem->mem_type == AMDGPU_PL_DOORBELL ||
>>  new_mem->mem_type == AMDGPU_PL_GDS ||
>>  new_mem->mem_type == AMDGPU_PL_GWS ||
>> -new_mem->mem_type == AMDGPU_PL_OA) {
>> +new_mem->mem_type == AMDGPU_PL_OA ||
>> +new_mem->mem_type == AMDGPU_PL_DOORBELL) {
>>  /* Nothing to save here */
>>  ttm_bo_move_null(bo, new_mem);
>>  goto out;
>> @@ -586,6 +589,12 @@ static int amdgpu_ttm_io_mem_reserve(struct ttm_device 
>> 

Re: [PATCH 03/16] drm/amdgpu: create a new file for doorbell manager

2023-03-30 Thread Luben Tuikov
Hi Shashank,

Inline:

On 2023-03-30 07:09, Christian König wrote:
> Am 29.03.23 um 17:47 schrieb Shashank Sharma:
>> From: Shashank Sharma 
>>
>> This patch:
>> - creates a new file for doorbell management.
>> - moves doorbell code from amdgpu_device.c to this file.
>>
>> Cc: Alex Deucher 
>> Cc: Christian Koenig 
>> Signed-off-by: Shashank Sharma 
>> ---
>>   drivers/gpu/drm/amd/amdgpu/Makefile   |   2 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c| 164 ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h  |  22 +++
>>   .../gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c  | 186 ++
>>   4 files changed, 209 insertions(+), 165 deletions(-)
>>   create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell_mgr.c
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/Makefile 
>> b/drivers/gpu/drm/amd/amdgpu/Makefile
>> index 798d0e9a60b7..204665f20319 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/Makefile
>> +++ b/drivers/gpu/drm/amd/amdgpu/Makefile
>> @@ -41,7 +41,7 @@ ccflags-y := -I$(FULL_AMD_PATH)/include/asic_reg \
>>   amdgpu-y := amdgpu_drv.o
>>   
>>   # add KMS driver
>> -amdgpu-y += amdgpu_device.o amdgpu_kms.o \
>> +amdgpu-y += amdgpu_device.o amdgpu_doorbell_mgr.o amdgpu_kms.o \
>>  amdgpu_atombios.o atombios_crtc.o amdgpu_connectors.o \
>>  atom.o amdgpu_fence.o amdgpu_ttm.o amdgpu_object.o amdgpu_gart.o \
>>  amdgpu_encoders.o amdgpu_display.o amdgpu_i2c.o \
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 57ee1c4a81e9..7f8fcac4f18b 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -579,94 +579,6 @@ void amdgpu_mm_wreg_mmio_rlc(struct amdgpu_device *adev,
>>  }
>>   }
>>   
>> -/**
>> - * amdgpu_mm_rdoorbell - read a doorbell dword
>> - *
>> - * @adev: amdgpu_device pointer
>> - * @index: doorbell index
>> - *
>> - * Returns the value in the doorbell aperture at the
>> - * requested doorbell index (CIK).
>> - */
>> -u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 index)
>> -{
>> -if (amdgpu_device_skip_hw_access(adev))
>> -return 0;
>> -
>> -if (index < adev->doorbell.num_kernel_doorbells) {
>> -return readl(adev->doorbell.ptr + index);
>> -} else {
>> -DRM_ERROR("reading beyond doorbell aperture: 0x%08x!\n", index);
>> -return 0;
>> -}
>> -}
>> -
>> -/**
>> - * amdgpu_mm_wdoorbell - write a doorbell dword
>> - *
>> - * @adev: amdgpu_device pointer
>> - * @index: doorbell index
>> - * @v: value to write
>> - *
>> - * Writes @v to the doorbell aperture at the
>> - * requested doorbell index (CIK).
>> - */
>> -void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, u32 v)
>> -{
>> -if (amdgpu_device_skip_hw_access(adev))
>> -return;
>> -
>> -if (index < adev->doorbell.num_kernel_doorbells) {
>> -writel(v, adev->doorbell.ptr + index);
>> -} else {
>> -DRM_ERROR("writing beyond doorbell aperture: 0x%08x!\n", index);
>> -}
>> -}
>> -
>> -/**
>> - * amdgpu_mm_rdoorbell64 - read a doorbell Qword
>> - *
>> - * @adev: amdgpu_device pointer
>> - * @index: doorbell index
>> - *
>> - * Returns the value in the doorbell aperture at the
>> - * requested doorbell index (VEGA10+).
>> - */
>> -u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 index)
>> -{
>> -if (amdgpu_device_skip_hw_access(adev))
>> -return 0;
>> -
>> -if (index < adev->doorbell.num_kernel_doorbells) {
>> -return atomic64_read((atomic64_t *)(adev->doorbell.ptr + 
>> index));
>> -} else {
>> -DRM_ERROR("reading beyond doorbell aperture: 0x%08x!\n", index);
>> -return 0;
>> -}
>> -}
>> -
>> -/**
>> - * amdgpu_mm_wdoorbell64 - write a doorbell Qword
>> - *
>> - * @adev: amdgpu_device pointer
>> - * @index: doorbell index
>> - * @v: value to write
>> - *
>> - * Writes @v to the doorbell aperture at the
>> - * requested doorbell index (VEGA10+).
>> - */
>> -void amdgpu_mm_wdoorbell64(struct amdgpu_device *adev, u32 index, u64 v)
>> -{
>> -if (amdgpu_device_skip_hw_access(adev))
>> -return;
>> -
>> -if (index < adev->doorbell.num_kernel_doorbells) {
>> -atomic64_set((atomic64_t *)(adev->doorbell.ptr + index), v);
>> -} else {
>> -DRM_ERROR("writing beyond doorbell aperture: 0x%08x!\n", index);
>> -}
>> -}
>> -
>>   /**
>>* amdgpu_device_indirect_rreg - read an indirect register
>>*
>> @@ -1016,82 +928,6 @@ int amdgpu_device_pci_reset(struct amdgpu_device *adev)
>>  return pci_reset_function(adev->pdev);
>>   }
>>   
>> -/*
>> - * GPU doorbell aperture helpers function.
>> - */
>> -/**
>> - * amdgpu_device_doorbell_init - Init doorbell driver information.
>> - *
>> - * @adev: amdgpu_device pointer
>> - *
>> - * Init doorbell driver information (CIK)
>> - * Returns 0 on success, error on failure.
>> - */
>> 

Re: [PATCH 01/16] drm/amdgpu: rename num_doorbells

2023-03-30 Thread Luben Tuikov
->doorbell.size = pci_resource_len(adev->pdev, 2);
>  
>   if (adev->enable_mes) {
> - adev->doorbell.num_doorbells =
> + adev->doorbell.num_kernel_doorbells =
>   adev->doorbell.size / sizeof(u32);
>   } else {
> - adev->doorbell.num_doorbells =
> + adev->doorbell.num_kernel_doorbells =
>   min_t(u32, adev->doorbell.size / sizeof(u32),
> adev->doorbell_index.max_assignment+1);
> - if (adev->doorbell.num_doorbells == 0)
> + if (adev->doorbell.num_kernel_doorbells == 0)
>   return -EINVAL;
>  
>   /* For Vega, reserve and map two pages on doorbell BAR since 
> SDMA
>* paging queue doorbell use the second page. The
>* AMDGPU_DOORBELL64_MAX_ASSIGNMENT definition assumes all the
>* doorbells are in the first page. So with paging queue 
> enabled,
> -  * the max num_doorbells should + 1 page (0x400 in dword)
> +  * the max num_kernel_doorbells should + 1 page (0x400 in dword)
>*/
>   if (adev->asic_type >= CHIP_VEGA10)
> - adev->doorbell.num_doorbells += 0x400;
> + adev->doorbell.num_kernel_doorbells += 0x400;
>   }
>  
>   adev->doorbell.ptr = ioremap(adev->doorbell.base,
> -  adev->doorbell.num_doorbells *
> +  adev->doorbell.num_kernel_doorbells *
>sizeof(u32));
>   if (adev->doorbell.ptr == NULL)
>   return -ENOMEM;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h
> index 7199b6b0be81..12263986f889 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_doorbell.h
> @@ -29,7 +29,9 @@ struct amdgpu_doorbell {
>   resource_size_t base;
>   resource_size_t size;
>   u32 __iomem *ptr;
> - u32 num_doorbells;  /* Number of doorbells actually 
> reserved for amdgpu. */
> +
> + /* Number of doorbells reserved for amdgpu kernel driver */
> + u32 num_kernel_doorbells;

The variable name should be indented to the same column as the previous 
variables.
u32 num_kernel_doorbells;

With that change, this patch is
Acked-by: Luben Tuikov 
-- 
Regards,
Luben

>  };
>  
>  /* Reserved doorbells for amdgpu (including multimedia).



Re: [PATCH] drm/amdgpu: Fix desktop freezed after gpu-reset

2023-03-30 Thread Luben Tuikov
Hi Alan,

Inline:

On 2023-03-30 06:48, Christian König wrote:
> Am 30.03.23 um 11:15 schrieb Liu, HaoPing (Alan):
>>
>> [AMD Official Use Only - General]
>>
>>  
>>
>> Hi Christian,
>>
>>  
>>
>> Thanks for the review. Please see inline.
>>
>>  
>>
>> Best Regards,
>>
>> Alan
>>
>>  
>>
>> -Original Message-
>> From: Christian König 
>> Sent: Tuesday, March 28, 2023 7:16 PM
>> To: Liu, HaoPing (Alan) ; amd-gfx@lists.freedesktop.org
>> Cc: Lakha, Bhawanpreet 
>> Subject: Re: [PATCH] drm/amdgpu: Fix desktop freezed after gpu-reset
>>
>>  
>>
>> Am 27.03.23 um 17:20 schrieb Alan Liu:
>>
>> > [Why]
>>
>> > After gpu-reset, sometimes the driver would fail to enable vblank irq,
>>
>> > causing flip_done timed out and the desktop freezed.
>>
>> > 
>>
>> > During gpu-reset, we will disable and enable vblank irq in
>>
>> > dm_suspend() and dm_resume(). Later on in
>>
>> > amdgpu_irq_gpu_reset_resume_helper(), we will check irqs' refcount and 
>> > decide to enable or disable the irqs again.
>>
>> > 
>>
>> > However, we have 2 sets of API for controling vblank irq, one is
>>
>> > dm_vblank_get/put() and another is amdgpu_irq_get/put(). Each API has
>>
>> > its own refcount and flag to store the state of vblank irq, and they
>>
>> > are not synchronized.
>>
>>  
>>
>> This is the source of the problem and you should address this instead.
>>
>> The change you suggested below would break in some use cases.
>>
>>  
>>
>> In struct drm_vblank_crtc we have a vblank irq refcount controlled by drm 
>> layer, and in struct amdgpu_irq_src we have enabled_types as refcount for 
>> each irq controlled by the dm.
>>
>> I think the best solution will be to get rid of the refcount in drm and only 
>> maintain the dm one, and add a crtc function for the drm layer to get the 
>> refcount/state of vblank.
>>
>> But this may be dangerous since it’s a change in drm layer. Do you have any 
>> comments?
>>
> 
> You don't necessarily need to remove it completely, what you can do as well 
> is properly chaining of them.
> 
> E.g. when the DRM counter goes from 0->1 or 1->0 it calls some function to 
> enable/disable the hw irq. In this situation you call 
> amdgpu_irq_get()/amdgpu_irq_put() as appropriate.
> 
> The the code in this patch already looks like it goes into the right 
> direction regarding that. It just seems to be that you have some race issues 
> when you need to add checks that the IRQ counter doesn't goes below 0.

Changing the DRM layer is generally not a good idea, unless
there is a compelling reason to do so, like fixing a bug, or adding
a new feature benefiting all drivers. As there are many drivers using
DRM, any changes in DRM are vetted thoroughly and need a good reason to
take place.

I suggest you follow Christian's advice.

Note that there's already a callback from drm_vblank_get() down
to amdgpu_enable_vblank_kms() which calls amdgpu_irq_get(). Perhaps,
you can leverage that. Similarly for the drm_vblank_put() to
the amdgpu_vblank_put() path.

> 
>>  
>>
>> > 
>>
>> > In drm we use the first API to control vblank irq but in
>>
>> > amdgpu_irq_gpu_reset_resume_helper() we use the second set of API.
>>
>> > 
>>
>> > The failure happens when vblank irq was enabled by dm_vblank_get()
>>
>> > before gpu-reset, we have vblank->enabled true. However, during
>>
>> > gpu-reset, in amdgpu_irq_gpu_reset_resume_helper(), vblank irq's state
>>
>> > checked from
>>
>> > amdgpu_irq_update() is DISABLED. So finally it will disable vblank irq
>>
>> > again. After gpu-reset, if there is a cursor plane commit, the driver
>>
>> > will try to enable vblank irq by calling drm_vblank_enable(), but the
>>
>> > vblank->enabled is still true, so it fails to turn on vblank irq and
>>
>> > causes flip_done can't be completed in vblank irq handler and desktop
>>
>> > become freezed.
>>
>> > 
>>
>> > [How]
>>
>> > Combining the 2 vblank control APIs by letting drm's API finally calls
>>
>> > amdgpu_irq's API, so the irq's refcount and state of both APIs can be
>>
>> > synchronized. Also add a check to prevent refcount from being less
>>
>> > then
>>
>> > 0 in amdgpu_irq_put().
>>
>> > 
>>
>> > Signed-off-by: Alan Liu mailto:haoping@amd.com>>
>>
>> > ---
>>
>> >   drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c    |  3 +++
>>
>> >   .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c | 14 ++
>>
>> >   2 files changed, 13 insertions(+), 4 deletions(-)
>>
>> > 
>>
>> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>>
>> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>>
>> > index a6aef488a822..1b66003657e2 100644
>>
>> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>>
>> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>>
>> > @@ -597,6 +597,9 @@ int amdgpu_irq_put(struct amdgpu_device *adev, struct 
>> > amdgpu_irq_src *src,
>>
>> >    if (!src->enabled_types || !src->funcs->set)
>>
>> >       return -EINVAL;
>>
>> >  
>>
>> > + if (!amdgpu_irq_enabled(adev, src, type))
>>
>> > +  

Re: [PATCH] drm/amdgpu: Fix desktop freezed after gpu-reset

2023-03-30 Thread Luben Tuikov
Hi Alan,

I'll comment in the other thread, as it seems Christian commented directly to 
your
patch the day after my comment, rather than following up with my email sent the 
previous
day and we now have two divergent threads where you post two identical 
comments, and it shouldn't
be like that--we should have one thread only.

Regards,
Luben

On 2023-03-30 04:59, Liu, HaoPing (Alan) wrote:
> [AMD Official Use Only - General]
> 
> 
> Hi, Luben
> 
>  
> 
> Thanks for the review. Please see inline.
> 
>  
> 
> Best Regards,
> 
> Alan
> 
>  
> 
> -Original Message-
> From: Tuikov, Luben 
> Sent: Tuesday, March 28, 2023 3:00 AM
> To: Liu, HaoPing (Alan) ; amd-gfx@lists.freedesktop.org
> Cc: Lakha, Bhawanpreet 
> Subject: Re: [PATCH] drm/amdgpu: Fix desktop freezed after gpu-reset
> 
>  
> 
> Hi,
> 
>  
> 
> That's a good fix. Some questions and comments below:
> 
>  
> 
> On 2023-03-27 11:20, Alan Liu wrote:
> 
>> [Why]
> 
>> After gpu-reset, sometimes the driver would fail to enable vblank irq,
> 
>> causing flip_done timed out and the desktop freezed.
> 
>>
> 
>> During gpu-reset, we will disable and enable vblank irq in
> 
>> dm_suspend() and dm_resume(). Later on in
> 
>> amdgpu_irq_gpu_reset_resume_helper(), we will check irqs' refcount and 
>> decide to enable or disable the irqs again.
> 
>>
> 
>> However, we have 2 sets of API for controling vblank irq, one is
> 
>> dm_vblank_get/put() and another is amdgpu_irq_get/put(). Each API has
> 
>> its own refcount and flag to store the state of vblank irq, and they
> 
>> are not synchronized.
> 
>  
> 
> Is it possible to reconcile controlling VBlank IRQ to a single refcount?
> 
>  
> 
> In struct drm_vblank_crtc, we have “enabled” and “refcount” to store vblank 
> irq state, and in struct amdgpu_irq_src we have “enabled_types” as the 
> refcount for each irq in dm layer.
> 
> To reconcile vblank irq to a single refcount, my idea is to remove enabled 
> and refcount from struct drm_vblank_crtc, and add a callback function like 
> vblank_irq_enabled() to drm_crtc_funcs.
> 
> Drm layer can use this function to check the state or refcount of vblank irq 
> from dm layer. But it may be dangerous because it is a change to drm layer. 
> Do you have any comments?
> 
>  
> 
>>
> 
>> In drm we use the first API to control vblank irq but in
> 
>> amdgpu_irq_gpu_reset_resume_helper() we use the second set of API.
> 
>>
> 
>> The failure happens when vblank irq was enabled by dm_vblank_get()
> 
>> before gpu-reset, we have vblank->enabled true. However, during
> 
>> gpu-reset, in amdgpu_irq_gpu_reset_resume_helper(), vblank irq's state
> 
>> checked from
> 
>> amdgpu_irq_update() is DISABLED. So finally it will disable vblank irq
> 
>> again. After gpu-reset, if there is a cursor plane commit, the driver
> 
>> will try to enable vblank irq by calling drm_vblank_enable(), but the
> 
>> vblank->enabled is still true, so it fails to turn on vblank irq and
> 
>> causes flip_done can't be completed in vblank irq handler and desktop
> 
>> become freezed.
> 
>>
> 
>> [How]
> 
>> Combining the 2 vblank control APIs by letting drm's API finally calls
> 
>> amdgpu_irq's API, so the irq's refcount and state of both APIs can be
> 
>> synchronized. Also add a check to prevent refcount from being less
> 
>> then
> 
>> 0 in amdgpu_irq_put().
> 
>>
> 
>> Signed-off-by: Alan Liu mailto:haoping@amd.com>>
> 
>> ---
> 
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c    |  3 +++
> 
>>  .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c | 14
> 
>> ++
> 
>>  2 files changed, 13 insertions(+), 4 deletions(-)
> 
>>
> 
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
> 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
> 
>> index a6aef488a822..1b66003657e2 100644
> 
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
> 
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
> 
>> @@ -597,6 +597,9 @@ int amdgpu_irq_put(struct amdgpu_device *adev, struct 
>> amdgpu_irq_src *src,
> 
>>    if (!src->enabled_types || !src->funcs->set)
> 
>>   return -EINVAL;
> 
>> 
> 
>> + if (!amdgpu_irq_enabled(adev, src, type))
> 
>> +   return 0;
> 
>> +
> 
>>    if (atomic_dec_and_test(>enabled_types[type]))
> 
>>   return amdgpu_irq_update(adev, src, type);
> 
>> 
> 
>> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c
> 
>> b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c
> 
>> index dc4f37240beb..e04f846b0b19 100644
> 
>> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c
> 
>> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c
> 
>> @@ -146,7 +146,7 @@ static void vblank_control_worker(struct
> 
>> work_struct *work)
> 
>> 
> 
>>  static inline int dm_set_vblank(struct drm_crtc *crtc, bool enable) 
> 
>> {
> 
>> -  enum dc_irq_source irq_source;
> 
>> + int irq_type;
> 
>>    struct amdgpu_crtc *acrtc = to_amdgpu_crtc(crtc);
> 

Re: [PATCH 1/3] drm/amdgpu: add sysfs node vclk1 and dclk1

2023-03-29 Thread Luben Tuikov
Series is,

Acked-by: Luben Tuikov 

Regards,
Luben

On 2023-03-29 06:51, Tong Liu01 wrote:
> User can check pp_dpm_vclk1 and pp_dpm_dclk1 for DPM frequency of
> vcn and dcn
> 
> Signed-off-by: Tong Liu01 
> ---
>  .../gpu/drm/amd/include/kgd_pp_interface.h|  2 ++
>  drivers/gpu/drm/amd/pm/amdgpu_pm.c| 32 +++
>  drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c |  8 +
>  3 files changed, 42 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/include/kgd_pp_interface.h 
> b/drivers/gpu/drm/amd/include/kgd_pp_interface.h
> index 86b6b0c9fb02..9f542f6e19ed 100644
> --- a/drivers/gpu/drm/amd/include/kgd_pp_interface.h
> +++ b/drivers/gpu/drm/amd/include/kgd_pp_interface.h
> @@ -104,7 +104,9 @@ enum pp_clock_type {
>   PP_FCLK,
>   PP_DCEFCLK,
>   PP_VCLK,
> + PP_VCLK1,
>   PP_DCLK,
> + PP_DCLK1,
>   OD_SCLK,
>   OD_MCLK,
>   OD_VDDC_CURVE,
> diff --git a/drivers/gpu/drm/amd/pm/amdgpu_pm.c 
> b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> index d75a67cfe523..9991447b5f14 100644
> --- a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> +++ b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> @@ -1180,6 +1180,21 @@ static ssize_t amdgpu_set_pp_dpm_vclk(struct device 
> *dev,
>   return amdgpu_set_pp_dpm_clock(dev, PP_VCLK, buf, count);
>  }
>  
> +static ssize_t amdgpu_get_pp_dpm_vclk1(struct device *dev,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + return amdgpu_get_pp_dpm_clock(dev, PP_VCLK1, buf);
> +}
> +
> +static ssize_t amdgpu_set_pp_dpm_vclk1(struct device *dev,
> + struct device_attribute *attr,
> + const char *buf,
> + size_t count)
> +{
> + return amdgpu_set_pp_dpm_clock(dev, PP_VCLK1, buf, count);
> +}
> +
>  static ssize_t amdgpu_get_pp_dpm_dclk(struct device *dev,
>   struct device_attribute *attr,
>   char *buf)
> @@ -1195,6 +1210,21 @@ static ssize_t amdgpu_set_pp_dpm_dclk(struct device 
> *dev,
>   return amdgpu_set_pp_dpm_clock(dev, PP_DCLK, buf, count);
>  }
>  
> +static ssize_t amdgpu_get_pp_dpm_dclk1(struct device *dev,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + return amdgpu_get_pp_dpm_clock(dev, PP_DCLK, buf);
> +}
> +
> +static ssize_t amdgpu_set_pp_dpm_dclk1(struct device *dev,
> + struct device_attribute *attr,
> + const char *buf,
> + size_t count)
> +{
> + return amdgpu_set_pp_dpm_clock(dev, PP_DCLK, buf, count);
> +}
> +
>  static ssize_t amdgpu_get_pp_dpm_dcefclk(struct device *dev,
>   struct device_attribute *attr,
>   char *buf)
> @@ -2002,7 +2032,9 @@ static struct amdgpu_device_attr amdgpu_device_attrs[] 
> = {
>   AMDGPU_DEVICE_ATTR_RW(pp_dpm_socclk,
> ATTR_FLAG_BASIC|ATTR_FLAG_ONEVF),
>   AMDGPU_DEVICE_ATTR_RW(pp_dpm_fclk,  
> ATTR_FLAG_BASIC|ATTR_FLAG_ONEVF),
>   AMDGPU_DEVICE_ATTR_RW(pp_dpm_vclk,  
> ATTR_FLAG_BASIC|ATTR_FLAG_ONEVF),
> + AMDGPU_DEVICE_ATTR_RW(pp_dpm_vclk1, 
> ATTR_FLAG_BASIC|ATTR_FLAG_ONEVF),
>   AMDGPU_DEVICE_ATTR_RW(pp_dpm_dclk,  
> ATTR_FLAG_BASIC|ATTR_FLAG_ONEVF),
> + AMDGPU_DEVICE_ATTR_RW(pp_dpm_dclk1, 
> ATTR_FLAG_BASIC|ATTR_FLAG_ONEVF),
>   AMDGPU_DEVICE_ATTR_RW(pp_dpm_dcefclk,   
> ATTR_FLAG_BASIC|ATTR_FLAG_ONEVF),
>   AMDGPU_DEVICE_ATTR_RW(pp_dpm_pcie,  
> ATTR_FLAG_BASIC|ATTR_FLAG_ONEVF),
>   AMDGPU_DEVICE_ATTR_RW(pp_sclk_od,   
> ATTR_FLAG_BASIC),
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c 
> b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> index 94fe8593444a..056ac2b512eb 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> @@ -2022,8 +2022,12 @@ static int smu_force_ppclk_levels(void *handle,
>   clk_type = SMU_DCEFCLK; break;
>   case PP_VCLK:
>   clk_type = SMU_VCLK; break;
> + case PP_VCLK1:
> + clk_type = SMU_VCLK1; break;
>   case PP_DCLK:
>   clk_type = SMU_DCLK; break;
> + case PP_DCLK1:
> + clk_type = SMU_DCLK1; break;
>   case OD_SCLK:
>   clk_type = SMU_OD_SCLK; break;
>   case OD_MCLK:
> @@ -2409,8 +2413,12 @@ static enum smu_clk_type smu_convert_to_smuclk(enum 
> pp_clock_type type)
>   clk_type = SMU_DCEFCLK; break;
>   case PP_VCLK:
>   clk_type = SMU_VCLK; break;
> + case PP_VCLK1:
> + clk_type = SMU_VCLK1; break;
>   case PP_DCLK:
>   clk_type = SMU_DCLK; break;
> + case PP_DCLK1:
> + clk_type = SMU_DCLK1; break;
>   case OD_SCLK:
>   clk_type = SMU_OD_SCLK; break;
>   case OD_MCLK:



Re: [PATCH] drm/amdgpu: simplify amdgpu_ras_eeprom.c

2023-03-28 Thread Luben Tuikov
On 2023-03-27 20:11, Alex Deucher wrote:
> All chips that support RAS also support IP discovery, so
> use the IP versions rather than a mix of IP versions and
> asic types.
> 
> Signed-off-by: Alex Deucher 
> Cc: Luben Tuikov 
> ---
>  .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c| 72 ++-
>  1 file changed, 20 insertions(+), 52 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> index 3106fa8a15ef..c2ef2b1456bc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> @@ -106,48 +106,13 @@
>  #define to_amdgpu_device(x) (container_of(x, struct amdgpu_ras, 
> eeprom_control))->adev
>  
>  static bool __is_ras_eeprom_supported(struct amdgpu_device *adev)
> -{
> - if (adev->asic_type == CHIP_IP_DISCOVERY) {
> - switch (adev->ip_versions[MP1_HWIP][0]) {
> - case IP_VERSION(13, 0, 0):
> - case IP_VERSION(13, 0, 10):
> - return true;
> - default:
> - return false;
> - }
> - }
> -
> - return  adev->asic_type == CHIP_VEGA20 ||
> - adev->asic_type == CHIP_ARCTURUS ||
> - adev->asic_type == CHIP_SIENNA_CICHLID ||
> - adev->asic_type == CHIP_ALDEBARAN;
> -}
> -
> -static bool __get_eeprom_i2c_addr_arct(struct amdgpu_device *adev,
> -struct amdgpu_ras_eeprom_control 
> *control)
> -{
> - struct atom_context *atom_ctx = adev->mode_info.atom_context;
> -
> - if (!control || !atom_ctx)
> - return false;
> -
> - if (strnstr(atom_ctx->vbios_version,
> - "D342",
> - sizeof(atom_ctx->vbios_version)))
> - control->i2c_address = EEPROM_I2C_MADDR_0;
> - else
> - control->i2c_address = EEPROM_I2C_MADDR_4;
> -
> - return true;
> -}
> -
> -static bool __get_eeprom_i2c_addr_ip_discovery(struct amdgpu_device *adev,
> -struct amdgpu_ras_eeprom_control 
> *control)
>  {
>   switch (adev->ip_versions[MP1_HWIP][0]) {
> + case IP_VERSION(11, 0, 2): /* VEGA20 and ARCTURUS */
> + case IP_VERSION(11, 0, 7):
>   case IP_VERSION(13, 0, 0):
> + case IP_VERSION(13, 0, 2):
>   case IP_VERSION(13, 0, 10):

I'd add the rest of the proper names here which are being deleted by this 
change,
so as to not lose this information by this commit: Sienna Cichlid and Aldebaran,
the rest can be left blank as per the current state of the code.

> - control->i2c_address = EEPROM_I2C_MADDR_4;
>   return true;
>   default:
>   return false;
> @@ -178,29 +143,32 @@ static bool __get_eeprom_i2c_addr(struct amdgpu_device 
> *adev,
>   return true;
>   }
>  
> - switch (adev->asic_type) {
> - case CHIP_VEGA20:
> - control->i2c_address = EEPROM_I2C_MADDR_0;
> + switch (adev->ip_versions[MP1_HWIP][0]) {
> + case IP_VERSION(11, 0, 2):
> + /* VEGA20 and ARCTURUS */
> + if (adev->asic_type == CHIP_VEGA20)
> + control->i2c_address = EEPROM_I2C_MADDR_0;
> + else if (strnstr(atom_ctx->vbios_version,

In the code this is qualified with atom_ctx != NULL; and if it is,
then we return false. So, this is fine, iff we can guarantee that
"atom_ctx" will never be NULL. If, OTOH, we cannot guarantee that,
then we need to add,
else if (!atom_ctx)
return false;
else if (strnstr(...

Although, I do recognize that for Aldebaran below, we do not qualify
atom_ctx, so we should probably qualify there too.

> +  "D342",
> +  sizeof(atom_ctx->vbios_version)))
> + control->i2c_address = EEPROM_I2C_MADDR_0;
> + else
> + control->i2c_address = EEPROM_I2C_MADDR_4;
>   return true;
> -
> - case CHIP_ARCTURUS:
> - return __get_eeprom_i2c_addr_arct(adev, control);
> -
> - case CHIP_SIENNA_CICHLID:
> + case IP_VERSION(11, 0, 7):
>   control->i2c_address = EEPROM_I2C_MADDR_0;
>   return true;
> -
> - case CHIP_ALDEBARAN:
> + case IP_VERSION(13, 0, 2):
>   if (strnstr(atom_ctx->vbios_version, "D673",
>   sizeof(atom_ctx->vbios_version)))
>   control->i2c_ad

Re: [PATCH] drm/amdgpu: enable sysfs node pp_dpm_vclk1 for some asics

2023-03-28 Thread Luben Tuikov
Looks good--thanks!

Acked-by: Luben Tuikov 

Regards,
Luben

On 2023-03-28 07:41, Tong Liu01 wrote:
> Add sysfs node pp_dpm_vclk1 for gc11.0.3
> 
> Signed-off-by: Tong Liu01 
> ---
>  .../gpu/drm/amd/include/kgd_pp_interface.h|  1 +
>  drivers/gpu/drm/amd/pm/amdgpu_pm.c| 22 +++
>  drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c |  4 
>  3 files changed, 27 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/include/kgd_pp_interface.h 
> b/drivers/gpu/drm/amd/include/kgd_pp_interface.h
> index 86b6b0c9fb02..fe75497eeeab 100644
> --- a/drivers/gpu/drm/amd/include/kgd_pp_interface.h
> +++ b/drivers/gpu/drm/amd/include/kgd_pp_interface.h
> @@ -104,6 +104,7 @@ enum pp_clock_type {
>   PP_FCLK,
>   PP_DCEFCLK,
>   PP_VCLK,
> + PP_VCLK1,
>   PP_DCLK,
>   OD_SCLK,
>   OD_MCLK,
> diff --git a/drivers/gpu/drm/amd/pm/amdgpu_pm.c 
> b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> index d75a67cfe523..1da6e9469450 100644
> --- a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> +++ b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> @@ -1180,6 +1180,21 @@ static ssize_t amdgpu_set_pp_dpm_vclk(struct device 
> *dev,
>   return amdgpu_set_pp_dpm_clock(dev, PP_VCLK, buf, count);
>  }
>  
> +static ssize_t amdgpu_get_pp_dpm_vclk1(struct device *dev,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + return amdgpu_get_pp_dpm_clock(dev, PP_VCLK1, buf);
> +}
> +
> +static ssize_t amdgpu_set_pp_dpm_vclk1(struct device *dev,
> + struct device_attribute *attr,
> + const char *buf,
> + size_t count)
> +{
> + return amdgpu_set_pp_dpm_clock(dev, PP_VCLK1, buf, count);
> +}
> +
>  static ssize_t amdgpu_get_pp_dpm_dclk(struct device *dev,
>   struct device_attribute *attr,
>   char *buf)
> @@ -2002,6 +2017,7 @@ static struct amdgpu_device_attr amdgpu_device_attrs[] 
> = {
>   AMDGPU_DEVICE_ATTR_RW(pp_dpm_socclk,
> ATTR_FLAG_BASIC|ATTR_FLAG_ONEVF),
>   AMDGPU_DEVICE_ATTR_RW(pp_dpm_fclk,  
> ATTR_FLAG_BASIC|ATTR_FLAG_ONEVF),
>   AMDGPU_DEVICE_ATTR_RW(pp_dpm_vclk,  
> ATTR_FLAG_BASIC|ATTR_FLAG_ONEVF),
> + AMDGPU_DEVICE_ATTR_RW(pp_dpm_vclk1, 
> ATTR_FLAG_BASIC|ATTR_FLAG_ONEVF),
>   AMDGPU_DEVICE_ATTR_RW(pp_dpm_dclk,  
> ATTR_FLAG_BASIC|ATTR_FLAG_ONEVF),
>   AMDGPU_DEVICE_ATTR_RW(pp_dpm_dcefclk,   
> ATTR_FLAG_BASIC|ATTR_FLAG_ONEVF),
>   AMDGPU_DEVICE_ATTR_RW(pp_dpm_pcie,  
> ATTR_FLAG_BASIC|ATTR_FLAG_ONEVF),
> @@ -2091,6 +2107,12 @@ static int default_attr_update(struct amdgpu_device 
> *adev, struct amdgpu_device_
> gc_ver == IP_VERSION(11, 0, 2) ||
> gc_ver == IP_VERSION(11, 0, 3)))
>   *states = ATTR_STATE_UNSUPPORTED;
> + } else if (DEVICE_ATTR_IS(pp_dpm_vclk1)) {
> + if (!((gc_ver == IP_VERSION(10, 3, 1) ||
> +gc_ver == IP_VERSION(10, 3, 0) ||
> +gc_ver == IP_VERSION(11, 0, 2) ||
> +gc_ver == IP_VERSION(11, 0, 3)) && 
> adev->vcn.num_vcn_inst >= 2))
> + *states = ATTR_STATE_UNSUPPORTED;
>   } else if (DEVICE_ATTR_IS(pp_dpm_dclk)) {
>   if (!(gc_ver == IP_VERSION(10, 3, 1) ||
> gc_ver == IP_VERSION(10, 3, 0) ||
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c 
> b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> index b5d64749990e..bffbef3f666d 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> @@ -2006,6 +2006,8 @@ static int smu_force_ppclk_levels(void *handle,
>   clk_type = SMU_DCEFCLK; break;
>   case PP_VCLK:
>   clk_type = SMU_VCLK; break;
> + case PP_VCLK1:
> + clk_type = SMU_VCLK1; break;
>   case PP_DCLK:
>   clk_type = SMU_DCLK; break;
>   case OD_SCLK:
> @@ -2393,6 +2395,8 @@ static enum smu_clk_type smu_convert_to_smuclk(enum 
> pp_clock_type type)
>   clk_type = SMU_DCEFCLK; break;
>   case PP_VCLK:
>   clk_type = SMU_VCLK; break;
> + case PP_VCLK1:
> + clk_type = SMU_VCLK1; break;
>   case PP_DCLK:
>   clk_type = SMU_DCLK; break;
>   case OD_SCLK:



Re: [PATCH] drm/amdgpu: Fix desktop freezed after gpu-reset

2023-03-27 Thread Luben Tuikov
Hi,

That's a good fix. Some questions and comments below:

On 2023-03-27 11:20, Alan Liu wrote:
> [Why]
> After gpu-reset, sometimes the driver would fail to enable vblank irq,
> causing flip_done timed out and the desktop freezed.
> 
> During gpu-reset, we will disable and enable vblank irq in dm_suspend()
> and dm_resume(). Later on in amdgpu_irq_gpu_reset_resume_helper(), we
> will check irqs' refcount and decide to enable or disable the irqs again.
> 
> However, we have 2 sets of API for controling vblank irq, one is
> dm_vblank_get/put() and another is amdgpu_irq_get/put(). Each API has
> its own refcount and flag to store the state of vblank irq, and they
> are not synchronized.

Is it possible to reconcile controlling VBlank IRQ to a single refcount?

> 
> In drm we use the first API to control vblank irq but in
> amdgpu_irq_gpu_reset_resume_helper() we use the second set of API.
> 
> The failure happens when vblank irq was enabled by dm_vblank_get() before
> gpu-reset, we have vblank->enabled true. However, during gpu-reset, in
> amdgpu_irq_gpu_reset_resume_helper(), vblank irq's state checked from
> amdgpu_irq_update() is DISABLED. So finally it will disable vblank irq
> again. After gpu-reset, if there is a cursor plane commit, the driver
> will try to enable vblank irq by calling drm_vblank_enable(), but the
> vblank->enabled is still true, so it fails to turn on vblank irq and
> causes flip_done can't be completed in vblank irq handler and desktop
> become freezed.
> 
> [How]
> Combining the 2 vblank control APIs by letting drm's API finally calls
> amdgpu_irq's API, so the irq's refcount and state of both APIs can be
> synchronized. Also add a check to prevent refcount from being less then
> 0 in amdgpu_irq_put().
> 
> Signed-off-by: Alan Liu 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c|  3 +++
>  .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c | 14 ++
>  2 files changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
> index a6aef488a822..1b66003657e2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
> @@ -597,6 +597,9 @@ int amdgpu_irq_put(struct amdgpu_device *adev, struct 
> amdgpu_irq_src *src,
>   if (!src->enabled_types || !src->funcs->set)
>   return -EINVAL;
>  
> + if (!amdgpu_irq_enabled(adev, src, type))
> + return 0;
> +
>   if (atomic_dec_and_test(>enabled_types[type]))
>   return amdgpu_irq_update(adev, src, type);
>  
> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c 
> b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c
> index dc4f37240beb..e04f846b0b19 100644
> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c
> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c
> @@ -146,7 +146,7 @@ static void vblank_control_worker(struct work_struct 
> *work)
>  
>  static inline int dm_set_vblank(struct drm_crtc *crtc, bool enable)
>  {
> - enum dc_irq_source irq_source;
> + int irq_type;
>   struct amdgpu_crtc *acrtc = to_amdgpu_crtc(crtc);
>   struct amdgpu_device *adev = drm_to_adev(crtc->dev);
>   struct dm_crtc_state *acrtc_state = to_dm_crtc_state(crtc->state);
> @@ -169,10 +169,16 @@ static inline int dm_set_vblank(struct drm_crtc *crtc, 
> bool enable)
>   if (rc)
>   return rc;
>  
> - irq_source = IRQ_TYPE_VBLANK + acrtc->otg_inst;
> + irq_type = amdgpu_display_crtc_idx_to_irq_type(adev, acrtc->crtc_id);
> +
> + if (enable)
> + rc = amdgpu_irq_get(adev, >crtc_irq, irq_type);
> +
> + else

There's an unnecessary empty line before the "else". It's a good idea
to run patches through scripts/checkpatch.pl.

> + rc = amdgpu_irq_put(adev, >crtc_irq, irq_type);
>  
> - if (!dc_interrupt_set(adev->dm.dc, irq_source, enable))
> - return -EBUSY;
> + if (rc)
> + return rc;
>  
>  skip:
>   if (amdgpu_in_reset(adev))

-- 
Regards,
Luben



Re: [PATCH] drm/scheduler: Fix variable name in function description

2023-03-27 Thread Luben Tuikov
Pushed through drm-misc-next.

Regards,
Luben

On 2023-03-27 11:02, Luben Tuikov wrote:
> Thanks for the fix. I'll push this via amd-staging-drm-next.
> 
> Reviewed-by: Luben Tuikov 
> 
> Regards,
> Luben
> 
> On 2023-03-25 09:15, Caio Novais wrote:
>> Compiling AMD GPU drivers displays two warnings:
>>
>> drivers/gpu/drm/scheduler/sched_main.c:738: warning: Function parameter or 
>> member 'file' not described in 'drm_sched_job_add_syncobj_dependency'
>> drivers/gpu/drm/scheduler/sched_main.c:738: warning: Excess function
>> parameter 'file_private' description in
>> 'drm_sched_job_add_syncobj_dependency'
>>
>> Get rid of them by renaming the variable name on the function description
>>
>> Signed-off-by: Caio Novais 
>> ---
>>  drivers/gpu/drm/scheduler/sched_main.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index 214364fccb71..7db586e6fce6 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -722,7 +722,7 @@ EXPORT_SYMBOL(drm_sched_job_add_dependency);
>>  /**
>>   * drm_sched_job_add_syncobj_dependency - adds a syncobj's fence as a job 
>> dependency
>>   * @job: scheduler job to add the dependencies to
>> - * @file_private: drm file private pointer
>> + * @file: drm file private pointer
>>   * @handle: syncobj handle to lookup
>>   * @point: timeline point
>>   *
> 



Re: [PATCH] drm/scheduler: Fix variable name in function description

2023-03-27 Thread Luben Tuikov
Thanks for the fix. I'll push this via amd-staging-drm-next.

Reviewed-by: Luben Tuikov 

Regards,
Luben

On 2023-03-25 09:15, Caio Novais wrote:
> Compiling AMD GPU drivers displays two warnings:
> 
> drivers/gpu/drm/scheduler/sched_main.c:738: warning: Function parameter or 
> member 'file' not described in 'drm_sched_job_add_syncobj_dependency'
> drivers/gpu/drm/scheduler/sched_main.c:738: warning: Excess function
> parameter 'file_private' description in
> 'drm_sched_job_add_syncobj_dependency'
> 
> Get rid of them by renaming the variable name on the function description
> 
> Signed-off-by: Caio Novais 
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> b/drivers/gpu/drm/scheduler/sched_main.c
> index 214364fccb71..7db586e6fce6 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -722,7 +722,7 @@ EXPORT_SYMBOL(drm_sched_job_add_dependency);
>  /**
>   * drm_sched_job_add_syncobj_dependency - adds a syncobj's fence as a job 
> dependency
>   * @job: scheduler job to add the dependencies to
> - * @file_private: drm file private pointer
> + * @file: drm file private pointer
>   * @handle: syncobj handle to lookup
>   * @point: timeline point
>   *



Re: [PATCH] drm/amd/amdgpu: Fix logic bug in fatal error handling

2023-03-23 Thread Luben Tuikov
On 2023-03-23 09:29, Christian König wrote:
> Am 23.03.23 um 13:04 schrieb Srinivasan Shanmugam:
>> CC  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.o
>> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2567:28: error: bitwise or with 
>> non-zero value always evaluates to true 
>> [-Werror,-Wtautological-bitwise-compare]
>>if (adev->ras_hw_enabled | AMDGPU_RAS_BLOCK__DF)
>>~^~
>>
>> Presumably the author intended to test if AMDGPU_RAS_BLOCK__DF
>> bit was set if ras is enabled, so that's what I'm changing the
>> code to. Hopefully to do the right thing.
> 
> That looks like a nice catch to me, but I don't really know the ras code 
> that well.
> 
> Hawking, Luben or whoever is more familiar with that should probably 
> comment as well.

Thanks Christian--yeah, it looks like a typo. Fix is already committed
into amd-staging-drm-next.
-- 
Regards,
Luben



[PATCH 1/2] drm/amdgpu: Remove second moot switch to set EEPROM I2C address

2023-03-23 Thread Luben Tuikov
Remove second switch since it already has its own function and case in the
first switch. This also avoids requalifying the EEPROM I2C address for VEGA20,
SIENNA CICHLID, and ALDEBARAN, as those have been set by the first switch and
shouldn't match SMU v13.0.x.

Cc: Candice Li 
Cc: Kent Russell 
Cc: Alex Deucher 
Fixes: 15822529468331 ("drm/amdgpu: Add EEPROM I2C address for smu v13_0_0")
Fixes: c9bdc6c3cf39df ("drm/amdgpu: Add EEPROM I2C address support for ip 
discovery")
Signed-off-by: Luben Tuikov 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index 2e08fce8752179..5c21480fff9c8b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -205,15 +205,6 @@ static bool __get_eeprom_i2c_addr(struct amdgpu_device 
*adev,
return false;
}
 
-   switch (adev->ip_versions[MP1_HWIP][0]) {
-   case IP_VERSION(13, 0, 0):
-   control->i2c_address = EEPROM_I2C_MADDR_4;
-   break;
-
-   default:
-   break;
-   }
-
return true;
 }
 

base-commit: 0f2fb865a56c747449f645d81cd842492459bcaa
-- 
2.40.0



[PATCH 2/2] drm/amdgpu: Return from switch early for EEPROM I2C address

2023-03-23 Thread Luben Tuikov
As soon as control->i2c_address is set, return; remove the "break;" from the
switch--it is unnecessary. This mimics what happens when for some cases in the
switch, we call helper functions with "return ".

Remove final function "return true;" to indicate that the switch is final and
terminal, and that there should be no code after the switch.

Cc: Candice Li 
Cc: Kent Russell 
Cc: Alex Deucher 
Signed-off-by: Luben Tuikov 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index 5c21480fff9c8b..3106fa8a15efef 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -181,14 +181,14 @@ static bool __get_eeprom_i2c_addr(struct amdgpu_device 
*adev,
switch (adev->asic_type) {
case CHIP_VEGA20:
control->i2c_address = EEPROM_I2C_MADDR_0;
-   break;
+   return true;
 
case CHIP_ARCTURUS:
return __get_eeprom_i2c_addr_arct(adev, control);
 
case CHIP_SIENNA_CICHLID:
control->i2c_address = EEPROM_I2C_MADDR_0;
-   break;
+   return true;
 
case CHIP_ALDEBARAN:
if (strnstr(atom_ctx->vbios_version, "D673",
@@ -196,7 +196,7 @@ static bool __get_eeprom_i2c_addr(struct amdgpu_device 
*adev,
control->i2c_address = EEPROM_I2C_MADDR_4;
else
control->i2c_address = EEPROM_I2C_MADDR_0;
-   break;
+   return true;
 
case CHIP_IP_DISCOVERY:
return __get_eeprom_i2c_addr_ip_discovery(adev, control);
@@ -204,8 +204,6 @@ static bool __get_eeprom_i2c_addr(struct amdgpu_device 
*adev,
default:
return false;
}
-
-   return true;
 }
 
 static void
-- 
2.40.0



[PATCH] tests/amdgpu: Allow to exclude a test or a suite of tests

2023-03-21 Thread Luben Tuikov
Add the command line argument -e s[.t] to exclude (disable) suite s, or to
exclude suite s test t.

This is useful for instance to run the Basic Suite, but disable the GPU reset
test, on the command line, like this:

amdgpu_tests -s 1 -e 1.13

This option can be specified more than once on the command line, in order to
exclude more than one suite and/or suite and test combination from being run.

Cc: Alex Deucher 
Signed-off-by: Luben Tuikov 
---
 tests/amdgpu/amdgpu_test.c | 187 ++---
 1 file changed, 152 insertions(+), 35 deletions(-)

diff --git a/tests/amdgpu/amdgpu_test.c b/tests/amdgpu/amdgpu_test.c
index 59ca49bdef5f20..ec787889afd25f 100644
--- a/tests/amdgpu/amdgpu_test.c
+++ b/tests/amdgpu/amdgpu_test.c
@@ -296,11 +296,14 @@ static void display_test_suites(void)
 
 /** Help string for command line parameters */
 static const char usage[] =
-   "Usage: %s [-hlpr] [-s ] [-t ] [-f] "
+   "Usage: %s [-hlpr] [-s ] [-e [.] [-e ...]] [-t ] [-f] "
"[-b ] [-d ]\n"
"Where,\n"
"  -b  Specify device's PCI bus id to run tests\n"
"  -d  Specify device's PCI device id to run tests (optional)\n"
+   "  -e [.]  Disable test  of suite . If only  is given, 
then disable\n"
+   "  the whole suite. Can be specified more than once on the 
command line\n"
+   "  to disable multiple tests or suites.\n"
"  -f  Force executing inactive suite or test\n"
"  -h  Display this help\n"
"  -l  Display all test suites and their tests\n"
@@ -309,7 +312,7 @@ static const char usage[] =
"  -s   Enable only test suite \n"
"  -t   Enable only test  of test suite \n";
 /** Specified options strings for getopt */
-static const char options[]   = "hlrps:t:b:d:f";
+static const char options[]   = "hlrps:t:e:b:d:f";
 
 /* Open AMD devices.
  * Return the number of AMD device opened.
@@ -664,6 +667,48 @@ char *amdgpu_get_device_from_fd(int fd)
 #endif
 }
 
+#ifndef ARRAY_SIZE
+#define ARRAY_SIZE(_A) (sizeof(_A)/sizeof(_A[0]))
+#endif
+
+static void amdgpu_test_disable(long suite, long test)
+{
+   const char *suite_name;
+
+   if (suite < 1)
+   return;
+
+   /* The array is 0-based, so subract 1. */
+   suite--;
+   if (suite >= ARRAY_SIZE(suites) - 1)
+   return;
+
+   suite_name = suites[suite].pName;
+   if (test < 1) {
+   fprintf(stderr, "Deactivating suite %s\n", suite_name);
+   amdgpu_set_suite_active(suite_name, CU_FALSE);
+   } else {
+   int ii;
+
+   /* The array is 0-based so subtract 1. */
+   test--;
+   for (ii = 0; suites[suite].pTests[ii].pName; ii++) {
+   if (ii == test) {
+   fprintf(stderr, "Deactivating %s:%s\n",
+   suite_name,
+   suites[suite].pTests[ii].pName);
+   amdgpu_set_test_active(suite_name,
+  
suites[suite].pTests[ii].pName,
+  CU_FALSE);
+   break;
+   }
+   }
+
+   if (suites[suite].pTests[ii].pName == NULL)
+   fprintf(stderr, "No such suite.test %ld.%ld\n", suite, 
test);
+   }
+}
+
 /* The main() function for setting up and running the tests.
  * Returns a CUE_SUCCESS on successful running, another
  * CUnit error code on failure.
@@ -682,48 +727,21 @@ int main(int argc, char **argv)
int display_list = 0;
int force_run = 0;
 
-   for (i = 0; i < MAX_CARDS_SUPPORTED; i++)
-   drm_amdgpu[i] = -1;
-
-
-   /* Parse command line string */
+   /* Parse command line string.
+* Process various command line options as early as possible.
+*/
opterr = 0; /* Do not print error messages from getopt */
while ((c = getopt(argc, argv, options)) != -1) {
switch (c) {
-   case 'l':
-   display_list = 1;
-   break;
-   case 's':
-   suite_id = atoi(optarg);
-   break;
-   case 't':
-   test_id = atoi(optarg);
-   break;
-   case 'b':
-   pci_bus_id = atoi(optarg);
-   break;
-   case 'd':
-   sscanf(optarg, "%x", _device_id);
-   break;
-   case 'p':
-   display_devices = 1;
-   

[PATCH] tests/amdgpu: Add all 9 optios to the help output

2023-03-16 Thread Luben Tuikov
Add -s and -t to the help output, as well as sort
the options output alphabetically.

Cc: Alex Deucher 
Signed-off-by: Luben Tuikov 
---
 tests/amdgpu/amdgpu_test.c | 18 ++
 1 file changed, 10 insertions(+), 8 deletions(-)

MR 291 has been updated.

diff --git a/tests/amdgpu/amdgpu_test.c b/tests/amdgpu/amdgpu_test.c
index b8fd638c5f4e97..59ca49bdef5f20 100644
--- a/tests/amdgpu/amdgpu_test.c
+++ b/tests/amdgpu/amdgpu_test.c
@@ -298,14 +298,16 @@ static void display_test_suites(void)
 static const char usage[] =
"Usage: %s [-hlpr] [-s ] [-t ] [-f] "
"[-b ] [-d ]\n"
-   "where:\n"
-   "   l - Display all suites and their tests\n"
-   "   r - Run the tests on render node\n"
-   "   b - Specify device's PCI bus id to run tests\n"
-   "   d - Specify device's PCI device id to run tests (optional)\n"
-   "   p - Display information of AMDGPU devices in system\n"
-   "   f - Force executing inactive suite or test\n"
-   "   h - Display this help\n";
+   "Where,\n"
+   "  -b  Specify device's PCI bus id to run tests\n"
+   "  -d  Specify device's PCI device id to run tests (optional)\n"
+   "  -f  Force executing inactive suite or test\n"
+   "  -h  Display this help\n"
+   "  -l  Display all test suites and their tests\n"
+   "  -p  Display information of AMDGPU devices in system\n"
+   "  -r  Run the tests on render node\n"
+   "  -s   Enable only test suite \n"
+   "  -t   Enable only test  of test suite \n";
 /** Specified options strings for getopt */
 static const char options[]   = "hlrps:t:b:d:f";
 
-- 
2.40.0



[PATCH] tests/amdgpu: Fix Usage string

2023-03-16 Thread Luben Tuikov
Fix the Usage: string on -h (help) in amdgpu_tests.c,
so brackets match, and remove mismatched angle brackets.

Cc: Alex Deucher 
Signed-off-by: Luben Tuikov 
---
 tests/amdgpu/amdgpu_test.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tests/amdgpu/amdgpu_test.c b/tests/amdgpu/amdgpu_test.c
index 9abe5730905ad7..b8fd638c5f4e97 100644
--- a/tests/amdgpu/amdgpu_test.c
+++ b/tests/amdgpu/amdgpu_test.c
@@ -296,8 +296,8 @@ static void display_test_suites(void)
 
 /** Help string for command line parameters */
 static const char usage[] =
-   "Usage: %s [-hlpr] [<-s > [-t ] [-f]] "
-   "[-b  [-d ]]\n"
+   "Usage: %s [-hlpr] [-s ] [-t ] [-f] "
+   "[-b ] [-d ]\n"
"where:\n"
"   l - Display all suites and their tests\n"
"   r - Run the tests on render node\n"

base-commit: 332809f3ee19f07abc03b62d5892fae51b9d9902
-- 
2.40.0



[PATCH] umr: Add umrgui to bash command line completion

2023-03-16 Thread Luben Tuikov
Add umrgui to bash command line completion as well.

Cc: Tom StDenis 
Signed-off-by: Luben Tuikov 
---
 scripts/umr-completion.bash | 1 +
 1 file changed, 1 insertion(+)

diff --git a/scripts/umr-completion.bash b/scripts/umr-completion.bash
index 344095b5d0633d..8bb2e7abb1fe97 100644
--- a/scripts/umr-completion.bash
+++ b/scripts/umr-completion.bash
@@ -400,3 +400,4 @@ _umr_completion()
 }
 
 complete -F _umr_completion umr
+complete -F _umr_completion umrgui

base-commit: e90f402de7d132a8faad94a8c093db5fe187f6d5
-- 
2.40.0



Re: [PATCH v2] drm/amdgpu: Force signal hw_fences that are embedded in non-sched jobs

2023-03-15 Thread Luben Tuikov
On 2023-03-08 21:27, YuBiao Wang wrote:
> v2: Add comments to clarify in the code.
> 
> [Why]
> For engines not supporting soft reset, i.e. VCN, there will be a failed
> ib test before mode 1 reset during asic reset. The fences in this case
> are never signaled and next time when we try to free the sa_bo, kernel
> will hang.
> 
> [How]
> During pre_asic_reset, driver will clear job fences and afterwards the
> fences' refcount will be reduced to 1. For drm_sched_jobs it will be
> released in job_free_cb, and for non-sched jobs like ib_test, it's meant
> to be released in sa_bo_free but only when the fences are signaled. So
> we have to force signal the non_sched bad job's fence during
> pre_asic_reset or the clear is not complete.
> 
> Signed-off-by: YuBiao Wang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index faff4a3f96e6..ad7c5b70c35a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -673,6 +673,7 @@ void amdgpu_fence_driver_clear_job_fences(struct 
> amdgpu_ring *ring)
>  {
>   int i;
>   struct dma_fence *old, **ptr;
> + struct amdgpu_job *job;
>  
>   for (i = 0; i <= ring->fence_drv.num_fences_mask; i++) {
>   ptr = >fence_drv.fences[i];
> @@ -680,6 +681,13 @@ void amdgpu_fence_driver_clear_job_fences(struct 
> amdgpu_ring *ring)
>   if (old && old->ops == _job_fence_ops) {
>   RCU_INIT_POINTER(*ptr, NULL);
>   dma_fence_put(old);
> + /* For non-sched bad job, i.e. failed ib test, we need 
> to force
> +  * signal it right here or we won't be able to track 
> them in fence drv
> +  * and they will remain unsignaled during sa_bo free.
> +  */
> + job = container_of(old, struct amdgpu_job, hw_fence);
> + if (!job->base.s_fence && !dma_fence_is_signaled(old))
> + dma_fence_signal(old);

Hi YuBiao,

Thanks for adding the clarifying comments and sending a v2 of this patch.

Perhaps move the chunk you're adding, to sit before, rather than after,
the statements of the if-conditional. Also move the "job" variable
declaration to be inside the if-conditional, since it is not used
by anything outside it. Something like this, (note a few small fixes to the 
comment),
if (old && old->ops == _job_fence_ops) {
struct amdgpu_job *job;

/* For non-scheduler bad job, i.e. failed IB test, we need to 
signal
 * it right here or we won't be able to track them in fence_drv
 * and they will remain unsignaled during sa_bo free.
 */
job = container_of(old, struct amdgpu_job, hw_fence);
if (!job->base.s_fence && !dma_fence_is_signaled(old))
dma_fence_signal(old);
RCU_INIT_POINTER(*ptr, NULL);
dma_fence_put(old);
}
Then, give it a test.
With that change, and upon successful test results, this patch is,
Acked-by: Luben Tuikov 
-- 
Regards,
Luben



Re: [PATCH v2] drm/amdgpu: Force signal hw_fences that are embedded in non-sched jobs

2023-03-10 Thread Luben Tuikov
On 2023-03-08 21:27, YuBiao Wang wrote:
> v2: Add comments to clarify in the code.
> 
> [Why]
> For engines not supporting soft reset, i.e. VCN, there will be a failed
> ib test before mode 1 reset during asic reset. The fences in this case
> are never signaled and next time when we try to free the sa_bo, kernel
> will hang.
> 
> [How]
> During pre_asic_reset, driver will clear job fences and afterwards the
> fences' refcount will be reduced to 1. For drm_sched_jobs it will be
> released in job_free_cb, and for non-sched jobs like ib_test, it's meant
> to be released in sa_bo_free but only when the fences are signaled. So
> we have to force signal the non_sched bad job's fence during
> pre_asic_reset or the clear is not complete.
> 
> Signed-off-by: YuBiao Wang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index faff4a3f96e6..ad7c5b70c35a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -673,6 +673,7 @@ void amdgpu_fence_driver_clear_job_fences(struct 
> amdgpu_ring *ring)
>  {
>   int i;
>   struct dma_fence *old, **ptr;
> + struct amdgpu_job *job;
>  
>   for (i = 0; i <= ring->fence_drv.num_fences_mask; i++) {
>   ptr = >fence_drv.fences[i];
> @@ -680,6 +681,13 @@ void amdgpu_fence_driver_clear_job_fences(struct 
> amdgpu_ring *ring)
>   if (old && old->ops == _job_fence_ops) {
>   RCU_INIT_POINTER(*ptr, NULL);
>   dma_fence_put(old);
> + /* For non-sched bad job, i.e. failed ib test, we need 
> to force
> +  * signal it right here or we won't be able to track 
> them in fence drv
> +  * and they will remain unsignaled during sa_bo free.
> +  */
> + job = container_of(old, struct amdgpu_job, hw_fence);
> + if (!job->base.s_fence && !dma_fence_is_signaled(old))
> + dma_fence_signal(old);

Conceptually, I don't mind this patch for what it does. The only thing which 
worries me
is this check here, !job->base.s_fence, which is used here to qualify that we
can signal the fence (and of course that the fence is not yet signalled.) We 
need
to audit this check to make sure that it is not overloaded to mean other 
things. I'll
take a look.

>   }
>   }
>  }

-- 
Regards,
Luben



Re: [PATCH] drm/amdgpu: Force signal hw_fences that are embedded in non-sched jobs

2023-03-07 Thread Luben Tuikov
On 2023-03-07 15:36, Luben Tuikov wrote:
> + job = container_of(old, struct amdgpu_job, hw_fence);
> + if (!job->base.s_fence && !dma_fence_is_signaled(old))
> + dma_fence_signal(old);

Thinking about this more, is !job->base.s_fence condition here
enough to mean "non-sched jobs like ib_test"?

I feel that it is a bit overloaded here--could we have this condition
satisfied,yet we can't willy-nilly signal the fence here?
-- 
Regards,
Luben



Re: [PATCH] drm/amdgpu: Force signal hw_fences that are embedded in non-sched jobs

2023-03-07 Thread Luben Tuikov
Hi,

Thanks for your patch!

On 2023-03-07 02:07, Christian König wrote:
> Am 07.03.23 um 08:02 schrieb YuBiao Wang:
>> [Why]
>> For engines not supporting soft reset, i.e. VCN, there will be a failed
>> ib test before mode 1 reset during asic reset. The fences in this case
>> are never signaled and next time when we try to free the sa_bo, kernel
>> will hang.
>>
>> [How]
>> During pre_asic_reset, driver will clear job fences and afterwards the
>> fences' refcount will be reduced to 1. For drm_sched_jobs it will be
>> released in job_free_cb, and for non-sched jobs like ib_test, it's meant
>> to be released in sa_bo_free but only when the fences are signaled. So

So, you're missing a signal for the non-scheduler job fences?

>> we have to force signal the non_sched bad job's fence during
>> pre_asic_reset or the clear is not complete.

Do you want to add a function which does just this (signals 
non-scheduler job fences) in amdgpu_device_pre_asic_reset(),
and resubmit your patch? (There will be code redundancy, but may
make the point clearer.)

Are we missing to signal non-scheduler job fences on reset altogether?
-- 
Regards,
Luben

> 
> Well NAK for now. It looks once more like one of those not very well 
> thought through changes.
> 
> Luben can you please take a look at this and double check it> 
> Thanks,
> Christian.
> 
>>
>> Signed-off-by: YuBiao Wang 
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 4 
>>   1 file changed, 4 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> index faff4a3f96e6..2e549bd50990 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> @@ -673,6 +673,7 @@ void amdgpu_fence_driver_clear_job_fences(struct 
>> amdgpu_ring *ring)
>>   {
>>  int i;
>>  struct dma_fence *old, **ptr;
>> +struct amdgpu_job *job;
>>   
>>  for (i = 0; i <= ring->fence_drv.num_fences_mask; i++) {
>>  ptr = >fence_drv.fences[i];
>> @@ -680,6 +681,9 @@ void amdgpu_fence_driver_clear_job_fences(struct 
>> amdgpu_ring *ring)
>>  if (old && old->ops == _job_fence_ops) {
>>  RCU_INIT_POINTER(*ptr, NULL);
>>  dma_fence_put(old);
>> +job = container_of(old, struct amdgpu_job, hw_fence);
>> +if (!job->base.s_fence && !dma_fence_is_signaled(old))
>> +dma_fence_signal(old);
>>  }
>>  }
>>   }
> 


Re: [PATCH 1/9] drm: execution context for GEM buffers v3

2023-03-03 Thread Luben Tuikov
On 2023-02-28 03:33, Christian König wrote:
> This adds the infrastructure for an execution context for GEM buffers
> which is similar to the existinc TTMs execbuf util and intended to replace
> it in the long term.
> 
> The basic functionality is that we abstracts the necessary loop to lock
> many different GEM buffers with automated deadlock and duplicate handling.
> 
> v2: drop xarray and use dynamic resized array instead, the locking
> overhead is unecessary and measureable.
> v3: drop duplicate tracking, radeon is really the only one needing that.
> 
> Signed-off-by: Christian König 
> ---
>  Documentation/gpu/drm-mm.rst |  12 ++
>  drivers/gpu/drm/Kconfig  |   6 +
>  drivers/gpu/drm/Makefile |   2 +
>  drivers/gpu/drm/drm_exec.c   | 249 +++
>  include/drm/drm_exec.h   | 115 
>  5 files changed, 384 insertions(+)
>  create mode 100644 drivers/gpu/drm/drm_exec.c
>  create mode 100644 include/drm/drm_exec.h
> 
> diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
> index a79fd3549ff8..a52e6f4117d6 100644
> --- a/Documentation/gpu/drm-mm.rst
> +++ b/Documentation/gpu/drm-mm.rst
> @@ -493,6 +493,18 @@ DRM Sync Objects
>  .. kernel-doc:: drivers/gpu/drm/drm_syncobj.c
> :export:
>  
> +DRM Execution context
> +=
> +
> +.. kernel-doc:: drivers/gpu/drm/drm_exec.c
> +   :doc: Overview
> +
> +.. kernel-doc:: include/drm/drm_exec.h
> +   :internal:
> +
> +.. kernel-doc:: drivers/gpu/drm/drm_exec.c
> +   :export:
> +
>  GPU Scheduler
>  =
>  
> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
> index 17d252dc25e2..84a5fc28c48d 100644
> --- a/drivers/gpu/drm/Kconfig
> +++ b/drivers/gpu/drm/Kconfig
> @@ -200,6 +200,12 @@ config DRM_TTM
> GPU memory types. Will be enabled automatically if a device driver
> uses it.
>  
> +config DRM_EXEC
> + tristate
> + depends on DRM
> + help
> +   Execution context for command submissions
> +
>  config DRM_BUDDY
>   tristate
>   depends on DRM
> diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
> index ab4460fcd63f..d40defbb0347 100644
> --- a/drivers/gpu/drm/Makefile
> +++ b/drivers/gpu/drm/Makefile
> @@ -78,6 +78,8 @@ obj-$(CONFIG_DRM_PANEL_ORIENTATION_QUIRKS) += 
> drm_panel_orientation_quirks.o
>  #
>  # Memory-management helpers
>  #
> +#
> +obj-$(CONFIG_DRM_EXEC) += drm_exec.o
>  
>  obj-$(CONFIG_DRM_BUDDY) += drm_buddy.o
>  
> diff --git a/drivers/gpu/drm/drm_exec.c b/drivers/gpu/drm/drm_exec.c
> new file mode 100644
> index ..df546cc5a227
> --- /dev/null
> +++ b/drivers/gpu/drm/drm_exec.c
> @@ -0,0 +1,249 @@
> +/* SPDX-License-Identifier: GPL-2.0 OR MIT */
> +
> +#include 
> +#include 
> +#include 
> +
> +/**
> + * DOC: Overview
> + *
> + * This component mainly abstracts the retry loop necessary for locking
> + * multiple GEM objects while preparing hardware operations (e.g. command
> + * submissions, page table updates etc..).
> + *
> + * If a contention is detected while locking a GEM object the cleanup 
> procedure
> + * unlocks all previously locked GEM objects and locks the contended one 
> first
> + * before locking any further objects.
> + *
> + * After an object is locked fences slots can optionally be reserved on the
> + * dma_resv object inside the GEM object.
> + *
> + * A typical usage pattern should look like this::
> + *
> + *   struct drm_gem_object *obj;
> + *   struct drm_exec exec;
> + *   unsigned long index;
> + *   int ret;
> + *
> + *   drm_exec_init(, true);
> + *   drm_exec_while_not_all_locked() {
> + *   ret = drm_exec_prepare_obj(, boA, 1);
> + *   drm_exec_continue_on_contention();
> + *   if (ret)
> + *   goto error;
> + *
> + *   ret = drm_exec_lock(, boB, 1);
> + *   drm_exec_continue_on_contention();
> + *   if (ret)
> + *   goto error;
> + *   }
> + *
> + *   drm_exec_for_each_locked_object(, index, obj) {
> + *   dma_resv_add_fence(obj->resv, fence, DMA_RESV_USAGE_READ);
> + *   ...
> + *   }
> + *   drm_exec_fini();

Maybe add the error: label here to show how recovery is to be had.

> + *
> + * See struct dma_exec for more details.
> + */
> +
> +/* Dummy value used to initially enter the retry loop */
> +#define DRM_EXEC_DUMMY (void*)~0
> +
> +/* Unlock all objects and drop references */
> +static void drm_exec_unlock_all(struct drm_exec *exec)
> +{
> + struct drm_gem_object *obj;
> + unsigned long index;
> +
> + drm_exec_for_each_locked_object(exec, index, obj) {
> + dma_resv_unlock(obj->resv);
> + drm_gem_object_put(obj);
> + }
> +
> + if (exec->prelocked) {
> + dma_resv_unlock(exec->prelocked->resv);
> + drm_gem_object_put(exec->prelocked);
> + exec->prelocked = NULL;
> + }
> +}
> +
> +/**
> + * drm_exec_init - initialize a drm_exec object
> + * @exec: the 

Re: [PATCH 1/9] drm: execution context for GEM buffers v3

2023-03-03 Thread Luben Tuikov
On 2023-02-28 14:13, Danilo Krummrich wrote:
> On 2/28/23 09:33, Christian König wrote:
>> This adds the infrastructure for an execution context for GEM buffers
>> which is similar to the existinc TTMs execbuf util and intended to replace
> 
> "existing"
> 
>> it in the long term.
>>
>> The basic functionality is that we abstracts the necessary loop to lock
>> many different GEM buffers with automated deadlock and duplicate handling.
>>
>> v2: drop xarray and use dynamic resized array instead, the locking
>>  overhead is unecessary and measureable.
> 
> "unecessary", "measurable"

"unnecessary".
-- 
Regards,
Luben



Re: [PATCH v3] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini

2023-02-02 Thread Luben Tuikov
Hi Guilherme,

Thanks for redoing to a v3. This patch is:

Reviewed-by: Luben Tuikov 

Regards,
Luben

On 2023-02-02 08:48, Guilherme G. Piccoli wrote:
> Currently amdgpu calls drm_sched_fini() from the fence driver sw fini
> routine - such function is expected to be called only after the
> respective init function - drm_sched_init() - was executed successfully.
> 
> Happens that we faced a driver probe failure in the Steam Deck
> recently, and the function drm_sched_fini() was called even without
> its counter-part had been previously called, causing the following oops:
> 
> amdgpu: probe of :04:00.0 failed with error -110
> BUG: kernel NULL pointer dereference, address: 0090
> PGD 0 P4D 0
> Oops: 0002 [#1] PREEMPT SMP NOPTI
> CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli #338
> Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022
> RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched]
> [...]
> Call Trace:
>  
>  amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu]
>  amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu]
>  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
>  devm_drm_dev_init_release+0x49/0x70
>  [...]
> 
> To prevent that, check if the drm_sched was properly initialized for a
> given ring before calling its fini counter-part.
> 
> Notice ideally we'd use sched.ready for that; such field is set as the latest
> thing on drm_sched_init(). But amdgpu seems to "override" the meaning of such
> field - in the above oops for example, it was a GFX ring causing the crash, 
> and
> the sched.ready field was set to true in the ring init routine, regardless of
> the state of the DRM scheduler. Hence, we ended-up using sched.ops as per
> Christian's suggestion [0], and also removed the no_scheduler check [1].
> 
> [0] 
> https://lore.kernel.org/amd-gfx/984ee981-2906-0eaf-ccec-9f80975cb...@amd.com/
> [1] 
> https://lore.kernel.org/amd-gfx/cd0e2994-f85f-d837-609f-7056d5fb7...@amd.com/
> 
> Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence driver fini in 
> s3 test (v2)")
> Suggested-by: Christian König 
> Cc: Guchun Chen 
> Cc: Luben Tuikov 
> Cc: Mario Limonciello 
> Signed-off-by: Guilherme G. Piccoli 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index 00444203220d..faff4a3f96e6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -618,7 +618,13 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device 
> *adev)
>   if (!ring || !ring->fence_drv.initialized)
>   continue;
>  
> - if (!ring->no_scheduler)
> + /*
> +  * Notice we check for sched.ops since there's some
> +  * override on the meaning of sched.ready by amdgpu.
> +  * The natural check would be sched.ready, which is
> +  * set as drm_sched_init() finishes...
> +  */
> + if (ring->sched.ops)
>   drm_sched_fini(>sched);
>  
>   for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)

-- 
Regards,
Luben



  1   2   3   4   5   6   7   8   9   >