RE: [PATCH v4] drm/amdgpu: drop gfx_v11_0_cp_ecc_error_irq_funcs

2023-05-03 Thread Zhang, Horatio
[AMD Official Use Only - General]

Hi Hawking,

Thank you for your review. 

I will change the judgment criteria to if (adev->gfx.cp_error_irq.funcs), and 
submit this patch to amd-staging-drm-next.

Regards,
Horatio

-Original Message-
From: Zhang, Hawking  
Sent: Friday, April 28, 2023 1:31 PM
To: Zhang, Horatio ; amd-gfx@lists.freedesktop.org
Cc: Yao, Longlong ; Chai, Thomas ; 
Zhang, Horatio ; Koenig, Christian 
; Chen, Guchun ; Xu, Feifei 

Subject: RE: [PATCH v4] drm/amdgpu: drop gfx_v11_0_cp_ecc_error_irq_funcs

[AMD Official Use Only - General]

+   if (!(adev->gfx.cp_ecc_error_irq.funcs == NULL)) {

Just check if (adev->gfx.cp_error_irq.funcs) {
+   r = amdgpu_irq_get(adev, >gfx.cp_ecc_error_irq, 
0);
+   if (r)
+   goto late_fini;
+   }

Or you can use

If (!adev->gfx.cp_ecc_error_irq.funcs)
Return 0;

Either way works for me. With that addressed,

The change is

Reviewed-by: Hawking Zhang 

Regards,
Hawking
-Original Message-
From: Horatio Zhang 
Sent: Friday, April 28, 2023 11:48
To: Zhang, Hawking ; amd-gfx@lists.freedesktop.org
Cc: Yao, Longlong ; Chai, Thomas ; 
Zhang, Horatio ; Zhang, Hawking ; 
Koenig, Christian ; Chen, Guchun 
; Xu, Feifei 
Subject: [PATCH v4] drm/amdgpu: drop gfx_v11_0_cp_ecc_error_irq_funcs

The gfx.cp_ecc_error_irq is retired in gfx11. In gfx_v11_0_hw_fini still use 
amdgpu_irq_put to disable this interrupt, which caused the call trace in this 
function.

[  102.873958] Call Trace:
[  102.873959]  
[  102.873961]  gfx_v11_0_hw_fini+0x23/0x1e0 [amdgpu] [  102.874019]  
gfx_v11_0_suspend+0xe/0x20 [amdgpu] [  102.874072]  
amdgpu_device_ip_suspend_phase2+0x240/0x460 [amdgpu] [  102.874122]  
amdgpu_device_ip_suspend+0x3d/0x80 [amdgpu] [  102.874172]  
amdgpu_device_pre_asic_reset+0xd9/0x490 [amdgpu] [  102.874223]  
amdgpu_device_gpu_recover.cold+0x548/0xce6 [amdgpu] [  102.874321]  
amdgpu_debugfs_reset_work+0x4c/0x70 [amdgpu] [  102.874375]  
process_one_work+0x21f/0x3f0 [  102.874377]  worker_thread+0x200/0x3e0 [  
102.874378]  ? process_one_work+0x3f0/0x3f0 [  102.874379]  kthread+0xfd/0x130 
[  102.874380]  ? kthread_complete_and_exit+0x20/0x20
[  102.874381]  ret_from_fork+0x22/0x30

v2:
- Handle umc and gfx ras cases in separated patch
- Retired the gfx_v11_0_cp_ecc_error_irq_funcs in gfx11

v3:
- Improve the subject and code comments
- Add judgment on gfx11 in the function of amdgpu_gfx_ras_late_init

v4:
- Drop the define of CP_ME1_PIPE_INST_ADDR_INTERVAL and SET_ECC_ME_PIPE_STATE 
which using in gfx_v11_0_set_cp_ecc_error_state
- Check cp_ecc_error_irq.funcs rather than ip version for a more sustainable 
life

Signed-off-by: Horatio Zhang 
Reviewed-by: Hawking Zhang 
Acked-by: Christian König 
Reviewed-by: Guchun Chen 
Reviewed-by: Feifei Xu 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c |  8 +++--  
drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c  | 46 -
 2 files changed, 5 insertions(+), 49 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
index 66b9740ec376..042f69e1ef91 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
@@ -720,9 +720,11 @@ int amdgpu_gfx_ras_late_init(struct amdgpu_device *adev, 
struct ras_common_if *r
if (r)
return r;

-   r = amdgpu_irq_get(adev, >gfx.cp_ecc_error_irq, 0);
-   if (r)
-   goto late_fini;
+   if (!(adev->gfx.cp_ecc_error_irq.funcs == NULL)) {
+   r = amdgpu_irq_get(adev, >gfx.cp_ecc_error_irq, 
0);
+   if (r)
+   goto late_fini;
+   }
} else {
amdgpu_ras_feature_enable_on_boot(adev, ras_block, 0);
}
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
index 8a4c4769e607..4ba21f798c72 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
@@ -1355,13 +1355,6 @@ static int gfx_v11_0_sw_init(void *handle)
if (r)
return r;

-   /* ECC error */
-   r = amdgpu_irq_add_id(adev, SOC21_IH_CLIENTID_GRBM_CP,
- GFX_11_0_0__SRCID__CP_ECC_ERROR,
- >gfx.cp_ecc_error_irq);
-   if (r)
-   return r;
-
/* FED error */
r = amdgpu_irq_add_id(adev, SOC21_IH_CLIENTID_GFX,
  GFX_11_0_0__SRCID__RLC_GC_FED_INTERRUPT,
@@ -4483,7 +4476,6 @@ static int gfx_v11_0_hw_fini(void *handle)
struct amdgpu_device *adev = (struct amdgpu_device *)handle;
int r;

-   amdgpu_irq_put(adev, >gfx.cp_ecc_error_irq, 0);
amdgpu_irq_put(adev, >gfx.priv_reg_irq, 0);
amdgpu_irq_put(adev, >gfx.priv_inst_irq, 0);

@@ -5962,36 +5954,6 @@ static void 

[pull] amdgpu drm-fixes-6.4

2023-05-03 Thread Alex Deucher
Hi Dave, Daniel,

Fixes for 6.4.

The following changes since commit d893f39320e1248d1c97fde0d6e51e5ea008a76b:

  drm/amd/display: Lowering min Z8 residency time (2023-04-26 22:53:58 -0400)

are available in the Git repository at:

  https://gitlab.freedesktop.org/agd5f/linux.git 
tags/amd-drm-fixes-6.4-2023-05-03

for you to fetch changes up to 1253685f0d3eb3eab0bfc4bf15ab341a5f3da0c8:

  drm/amdgpu: drop redundant sched job cleanup when cs is aborted (2023-05-03 
23:10:02 -0400)


amd-drm-fixes-6.4-2023-05-03:

amdgpu:
- GPU reset fixes
- Doorbell fix when resizing BARs
- Fix spurious warnings in gmc
- Locking fix for AMDGPU_SCHED IOCTL
- SR-IOV fix
- DCN 3.1.4 fix
- DCN 3.2 fix
- Fix job cleanup when CS is aborted


Chia-I Wu (1):
  drm/amdgpu: add a missing lock for AMDGPU_SCHED

Guchun Chen (1):
  drm/amdgpu: drop redundant sched job cleanup when cs is aborted

Hamza Mahfooz (1):
  drm/amdgpu: fix an amdgpu_irq_put() issue in gmc_v9_0_hw_fini()

Horace Chen (1):
  drm/amdgpu: disable SDMA WPTR_POLL_ENABLE for SR-IOV

Horatio Zhang (2):
  drm/amdgpu: fix amdgpu_irq_put call trace in gmc_v11_0_hw_fini
  drm/amdgpu: fix amdgpu_irq_put call trace in gmc_v10_0_hw_fini

Leo Chen (1):
  drm/amd/display: Change default Z8 watermark values

Samson Tam (1):
  drm/amd/display: filter out invalid bits in pipe_fuses

Shane Xiao (1):
  drm/amdgpu: Enable doorbell selfring after resize FB BAR

lyndonli (2):
  drm/amdgpu: Fix mode2 reset for sienna cichlid
  drm/amdgpu: Use the default reset when loading or reloading the driver

 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 13 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  7 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_sched.c  |  6 +-
 drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c |  1 -
 drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c |  1 -
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c  |  1 -
 drivers/gpu/drm/amd/amdgpu/nv.c| 23 +++-
 drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c |  5 +
 drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c|  2 +-
 drivers/gpu/drm/amd/amdgpu/soc15.c | 25 +-
 drivers/gpu/drm/amd/amdgpu/soc21.c | 23 +++-
 .../gpu/drm/amd/display/dc/dcn32/dcn32_resource.c  | 10 -
 .../drm/amd/display/dc/dcn321/dcn321_resource.c| 10 -
 .../gpu/drm/amd/display/dc/dml/dcn314/dcn314_fpu.c |  4 ++--
 14 files changed, 78 insertions(+), 53 deletions(-)


[PATCH] drm:amd:amdgpu: Fix missing buffer object unlock in failure path

2023-05-03 Thread Sukrut Bellary
smatch warning -
1) drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c:3615 gfx_v9_0_kiq_resume()
warn: inconsistent returns 'ring->mqd_obj->tbo.base.resv'.

2) drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c:6901 gfx_v10_0_kiq_resume()
warn: inconsistent returns 'ring->mqd_obj->tbo.base.resv'.

Signed-off-by: Sukrut Bellary 
---
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 4 +++-
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c  | 4 +++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index 8bd07ff59671..66d5c5d68454 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -6891,8 +6891,10 @@ static int gfx_v10_0_kiq_resume(struct amdgpu_device 
*adev)
return r;
 
r = amdgpu_bo_kmap(ring->mqd_obj, (void **)>mqd_ptr);
-   if (unlikely(r != 0))
+   if (unlikely(r != 0)) {
+   amdgpu_bo_unreserve(ring->mqd_obj);
return r;
+   }
 
gfx_v10_0_kiq_init_queue(ring);
amdgpu_bo_kunmap(ring->mqd_obj);
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
index bce6919d666a..d5715d8a4128 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
@@ -3617,8 +3617,10 @@ static int gfx_v9_0_kiq_resume(struct amdgpu_device 
*adev)
return r;
 
r = amdgpu_bo_kmap(ring->mqd_obj, (void **)>mqd_ptr);
-   if (unlikely(r != 0))
+   if (unlikely(r != 0)) {
+   amdgpu_bo_unreserve(ring->mqd_obj);
return r;
+   }
 
gfx_v9_0_kiq_init_queue(ring);
amdgpu_bo_kunmap(ring->mqd_obj);
-- 
2.34.1



Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread Marek Olšák
On Wed, May 3, 2023, 14:53 André Almeida  wrote:

> Em 03/05/2023 14:08, Marek Olšák escreveu:
> > GPU hangs are pretty common post-bringup. They are not common per user,
> > but if we gather all hangs from all users, we can have lots and lots of
> > them.
> >
> > GPU hangs are indeed not very debuggable. There are however some things
> > we can do:
> > - Identify the hanging IB by its VA (the kernel should know it)
>
> How can the kernel tell which VA range is being executed? I only found
> that information at mmCP_IB1_BASE_ regs, but as stated in this thread by
> Christian this is not reliable to be read.
>

The kernel receives the VA and the size via the CS ioctl. When user queues
are enabled, the kernel will no longer receive them.


> > - Read and parse the IB to detect memory corruption.
> > - Print active waves with shader disassembly if SQ isn't hung (often
> > it's not).
> >
> > Determining which packet the CP is stuck on is tricky. The CP has 2
> > engines (one frontend and one backend) that work on the same command
> > buffer. The frontend engine runs ahead, executes some packets and
> > forwards others to the backend engine. Only the frontend engine has the
> > command buffer VA somewhere. The backend engine only receives packets
> > from the frontend engine via a FIFO, so it might not be possible to tell
> > where it's stuck if it's stuck.
>
> Do they run at the same asynchronously or does the front end waits the
> back end to execute?
>

They run asynchronously and should run asynchronously for performance, but
they can be synchronized using a special packet (PFP_SYNC_ME).

Marek


> >
> > When the gfx pipeline hangs outside of shaders, making a scandump seems
> > to be the only way to have a chance at finding out what's going wrong,
> > and only AMD-internal versions of hw can be scanned.
> >
> > Marek
> >
> > On Wed, May 3, 2023 at 11:23 AM Christian König
> >  > > wrote:
> >
> > Am 03.05.23 um 17:08 schrieb Felix Kuehling:
> >  > Am 2023-05-03 um 03:59 schrieb Christian König:
> >  >> Am 02.05.23 um 20:41 schrieb Alex Deucher:
> >  >>> On Tue, May 2, 2023 at 11:22 AM Timur Kristóf
> >  >>> mailto:timur.kris...@gmail.com>>
> wrote:
> >   [SNIP]
> >   In my opinion, the correct solution to those problems
> would be
> >   if
> >   the kernel could give userspace the necessary information
> > about
> >   a
> >   GPU hang before a GPU reset.
> >  
> >  >>>   The fundamental problem here is that the kernel doesn't
> have
> >  >>> that
> >  >>> information either. We know which IB timed out and can
> >  >>> potentially do
> >  >>> a devcoredump when that happens, but that's it.
> >  >>
> >  >> Is it really not possible to know such a fundamental thing
> > as what
> >  >> the
> >  >> GPU was doing when it hung? How are we supposed to do any
> > kind of
> >  >> debugging without knowing that?
> >  >>
> >  >> Yes, that's indeed something at least I try to figure out for
> years
> >  >> as well.
> >  >>
> >  >> Basically there are two major problems:
> >  >> 1. When the ASIC is hung you can't talk to the firmware engines
> any
> >  >> more and most state is not exposed directly, but just through
> some
> >  >> fw/hw interface.
> >  >> Just take a look at how umr reads the shader state from the
> SQ.
> >  >> When that block is hung you can't do that any more and basically
> > have
> >  >> no chance at all to figure out why it's hung.
> >  >>
> >  >> Same for other engines, I remember once spending a week
> > figuring
> >  >> out why the UVD block is hung during suspend. Turned out to be a
> >  >> debugging nightmare because any time you touch any register of
> that
> >  >> block the whole system would hang.
> >  >>
> >  >> 2. There are tons of things going on in a pipeline fashion or
> even
> >  >> completely in parallel. For example the CP is just the beginning
> > of a
> >  >> rather long pipeline which at the end produces a bunch of pixels.
> >  >> In almost all cases I've seen you ran into a problem
> somewhere
> >  >> deep in the pipeline and only very rarely at the beginning.
> >  >>
> >  >>
> >  >> I wonder what AMD's Windows driver team is doing with this
> > problem,
> >  >> surely they must have better tools to deal with GPU hangs?
> >  > For better or worse, most teams internally rely on scan dumps
> via
> >  > JTAG
> >  > which sort of limits the usefulness outside of AMD, but also
> > gives
> >  > you
> >  > the exact state of the hardware when it's hung so the
> > hardware teams
> >  > prefer it.
> >  >
> >   How does this approach scale? It's 

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread André Almeida

Em 03/05/2023 14:43, Timur Kristóf escreveu:

Hi Felix,

On Wed, 2023-05-03 at 11:08 -0400, Felix Kuehling wrote:

That's the worst-case scenario where you're debugging HW or FW
issues.
Those should be pretty rare post-bringup. But are there hangs caused
by
user mode driver or application bugs that are easier to debug and
probably don't even require a GPU reset?


There are many GPU hangs that gamers experience while playing. We have
dozens of open bug reports against RADV about GPU hangs on various GPU
generations. These usually fall into two categories:

1. When the hang always happens at the same point in a game. These are
painful to debug but manageable.
2. "Random" hangs that happen to users over the course of playing a
game for several hours. It is absolute hell to try to even reproduce
let alone diagnose these issues, and this is what we would like to
improve.

For these hard-to-diagnose problems, it is already a challenge to
determine whether the problem is the kernel (eg. setting wrong voltages
/ frequencies) or userspace (eg. missing some synchronization), can be
even a game bug that we need to work around.


For example most VM faults can
be handled without hanging the GPU. Similarly, a shader in an endless
loop should not require a full GPU reset.


This is actually not the case, AFAIK André's test case was an app that
had an infinite loop in a shader.



This is the test app if anyone want to try out: 
https://github.com/andrealmeid/vulkan-triangle-v1. Just compile and run.


The kernel calls amdgpu_ring_soft_recovery() when I run my example, but 
I'm not sure what a soft recovery means here and if it's a full GPU 
reset or not.


But if we can at least trust the CP registers to dump information for 
soft resets, it would be some improvement from the current state I think




It's more complicated for graphics because of the more complex
pipeline
and the lack of CWSR. But it should still be possible to do some
debugging without JTAG if the problem is in SW and not HW or FW. It's
probably worth improving that debugability without getting hung-up on
the worst case.


I agree, and we welcome any constructive suggestion to improve the
situation. It seems like our idea doesn't work if the kernel can't give
us the information we need.

How do we move forward?

Best regards,
Timur



Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread André Almeida

Em 03/05/2023 14:08, Marek Olšák escreveu:
GPU hangs are pretty common post-bringup. They are not common per user, 
but if we gather all hangs from all users, we can have lots and lots of 
them.


GPU hangs are indeed not very debuggable. There are however some things 
we can do:

- Identify the hanging IB by its VA (the kernel should know it)


How can the kernel tell which VA range is being executed? I only found 
that information at mmCP_IB1_BASE_ regs, but as stated in this thread by 
Christian this is not reliable to be read.



- Read and parse the IB to detect memory corruption.
- Print active waves with shader disassembly if SQ isn't hung (often 
it's not).


Determining which packet the CP is stuck on is tricky. The CP has 2 
engines (one frontend and one backend) that work on the same command 
buffer. The frontend engine runs ahead, executes some packets and 
forwards others to the backend engine. Only the frontend engine has the 
command buffer VA somewhere. The backend engine only receives packets 
from the frontend engine via a FIFO, so it might not be possible to tell 
where it's stuck if it's stuck.


Do they run at the same asynchronously or does the front end waits the 
back end to execute?




When the gfx pipeline hangs outside of shaders, making a scandump seems 
to be the only way to have a chance at finding out what's going wrong, 
and only AMD-internal versions of hw can be scanned.


Marek

On Wed, May 3, 2023 at 11:23 AM Christian König 
> wrote:


Am 03.05.23 um 17:08 schrieb Felix Kuehling:
 > Am 2023-05-03 um 03:59 schrieb Christian König:
 >> Am 02.05.23 um 20:41 schrieb Alex Deucher:
 >>> On Tue, May 2, 2023 at 11:22 AM Timur Kristóf
 >>> mailto:timur.kris...@gmail.com>> wrote:
  [SNIP]
  In my opinion, the correct solution to those problems would be
  if
  the kernel could give userspace the necessary information
about
  a
  GPU hang before a GPU reset.
 
 >>>   The fundamental problem here is that the kernel doesn't have
 >>> that
 >>> information either. We know which IB timed out and can
 >>> potentially do
 >>> a devcoredump when that happens, but that's it.
 >>
 >> Is it really not possible to know such a fundamental thing
as what
 >> the
 >> GPU was doing when it hung? How are we supposed to do any
kind of
 >> debugging without knowing that?
 >>
 >> Yes, that's indeed something at least I try to figure out for years
 >> as well.
 >>
 >> Basically there are two major problems:
 >> 1. When the ASIC is hung you can't talk to the firmware engines any
 >> more and most state is not exposed directly, but just through some
 >> fw/hw interface.
 >>     Just take a look at how umr reads the shader state from the SQ.
 >> When that block is hung you can't do that any more and basically
have
 >> no chance at all to figure out why it's hung.
 >>
 >>     Same for other engines, I remember once spending a week
figuring
 >> out why the UVD block is hung during suspend. Turned out to be a
 >> debugging nightmare because any time you touch any register of that
 >> block the whole system would hang.
 >>
 >> 2. There are tons of things going on in a pipeline fashion or even
 >> completely in parallel. For example the CP is just the beginning
of a
 >> rather long pipeline which at the end produces a bunch of pixels.
 >>     In almost all cases I've seen you ran into a problem somewhere
 >> deep in the pipeline and only very rarely at the beginning.
 >>
 >>
 >> I wonder what AMD's Windows driver team is doing with this
problem,
 >> surely they must have better tools to deal with GPU hangs?
 > For better or worse, most teams internally rely on scan dumps via
 > JTAG
 > which sort of limits the usefulness outside of AMD, but also
gives
 > you
 > the exact state of the hardware when it's hung so the
hardware teams
 > prefer it.
 >
  How does this approach scale? It's not something we can ask
users to
  do, and even if all of us in the radv team had a JTAG device, we
  wouldn't be able to play every game that users experience
random hangs
  with.
 >>> It doesn't scale or lend itself particularly well to external
 >>> development, but that's the current state of affairs.
 >>
 >> The usual approach seems to be to reproduce a problem in a lab and
 >> have a JTAG attached to give the hw guys a scan dump and they can
 >> then tell you why something didn't worked as expected.
 >
 > That's the worst-case scenario where you're debugging HW or FW
issues.
 > Those should be pretty rare post-bringup. But 

Re: drm/amdgpu: fix an amdgpu_irq_put() issue in gmc_v9_0_hw_fini()

2023-05-03 Thread Limonciello, Mario



On 5/2/2023 11:51 AM, Hamza Mahfooz wrote:

As made mention of, in commit 9128e6babf10 ("drm/amdgpu: fix
amdgpu_irq_put call trace in gmc_v10_0_hw_fini") and commit c094b8923bdd
("drm/amdgpu: fix amdgpu_irq_put call trace in gmc_v11_0_hw_fini"). It
is meaningless to call amdgpu_irq_put() for gmc.ecc_irq. So, remove it
from gmc_v9_0_hw_fini().

Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2522
Fixes: 3029c855d79f ("drm/amdgpu: Fix desktop freezed after gpu-reset")
Signed-off-by: Hamza Mahfooz 


Reviewed-by: Mario Limonciello 


---
  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 1 -
  1 file changed, 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index 290804a06e05..6ae5cee9b64b 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -1999,7 +1999,6 @@ static int gmc_v9_0_hw_fini(void *handle)
if (adev->mmhub.funcs->update_power_gating)
adev->mmhub.funcs->update_power_gating(adev, false);
  
-	amdgpu_irq_put(adev, >gmc.ecc_irq, 0);

amdgpu_irq_put(adev, >gmc.vm_fault, 0);
  
  	return 0;


Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread Timur Kristóf
Hi Felix,

On Wed, 2023-05-03 at 11:08 -0400, Felix Kuehling wrote:
> That's the worst-case scenario where you're debugging HW or FW
> issues. 
> Those should be pretty rare post-bringup. But are there hangs caused
> by 
> user mode driver or application bugs that are easier to debug and 
> probably don't even require a GPU reset?

There are many GPU hangs that gamers experience while playing. We have
dozens of open bug reports against RADV about GPU hangs on various GPU
generations. These usually fall into two categories:

1. When the hang always happens at the same point in a game. These are
painful to debug but manageable.
2. "Random" hangs that happen to users over the course of playing a
game for several hours. It is absolute hell to try to even reproduce
let alone diagnose these issues, and this is what we would like to
improve.

For these hard-to-diagnose problems, it is already a challenge to
determine whether the problem is the kernel (eg. setting wrong voltages
/ frequencies) or userspace (eg. missing some synchronization), can be
even a game bug that we need to work around.

> For example most VM faults can 
> be handled without hanging the GPU. Similarly, a shader in an endless
> loop should not require a full GPU reset.

This is actually not the case, AFAIK André's test case was an app that
had an infinite loop in a shader.

> 
> It's more complicated for graphics because of the more complex
> pipeline 
> and the lack of CWSR. But it should still be possible to do some 
> debugging without JTAG if the problem is in SW and not HW or FW. It's
> probably worth improving that debugability without getting hung-up on
> the worst case.

I agree, and we welcome any constructive suggestion to improve the
situation. It seems like our idea doesn't work if the kernel can't give
us the information we need.

How do we move forward?

Best regards,
Timur



Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread Marek Olšák
WRITE_DATA with ENGINE=PFP will execute the packet on the frontend engine,
while ENGINE=ME will execute the packet on the backend engine.

Marek

On Wed, May 3, 2023 at 1:08 PM Marek Olšák  wrote:

> GPU hangs are pretty common post-bringup. They are not common per user,
> but if we gather all hangs from all users, we can have lots and lots of
> them.
>
> GPU hangs are indeed not very debuggable. There are however some things we
> can do:
> - Identify the hanging IB by its VA (the kernel should know it)
> - Read and parse the IB to detect memory corruption.
> - Print active waves with shader disassembly if SQ isn't hung (often it's
> not).
>
> Determining which packet the CP is stuck on is tricky. The CP has 2
> engines (one frontend and one backend) that work on the same command
> buffer. The frontend engine runs ahead, executes some packets and forwards
> others to the backend engine. Only the frontend engine has the command
> buffer VA somewhere. The backend engine only receives packets from the
> frontend engine via a FIFO, so it might not be possible to tell where it's
> stuck if it's stuck.
>
> When the gfx pipeline hangs outside of shaders, making a scandump seems to
> be the only way to have a chance at finding out what's going wrong, and
> only AMD-internal versions of hw can be scanned.
>
> Marek
>
> On Wed, May 3, 2023 at 11:23 AM Christian König <
> ckoenig.leichtzumer...@gmail.com> wrote:
>
>> Am 03.05.23 um 17:08 schrieb Felix Kuehling:
>> > Am 2023-05-03 um 03:59 schrieb Christian König:
>> >> Am 02.05.23 um 20:41 schrieb Alex Deucher:
>> >>> On Tue, May 2, 2023 at 11:22 AM Timur Kristóf
>> >>>  wrote:
>>  [SNIP]
>>  In my opinion, the correct solution to those problems would be
>>  if
>>  the kernel could give userspace the necessary information about
>>  a
>>  GPU hang before a GPU reset.
>> 
>> >>>   The fundamental problem here is that the kernel doesn't have
>> >>> that
>> >>> information either. We know which IB timed out and can
>> >>> potentially do
>> >>> a devcoredump when that happens, but that's it.
>> >>
>> >> Is it really not possible to know such a fundamental thing as what
>> >> the
>> >> GPU was doing when it hung? How are we supposed to do any kind of
>> >> debugging without knowing that?
>> >>
>> >> Yes, that's indeed something at least I try to figure out for years
>> >> as well.
>> >>
>> >> Basically there are two major problems:
>> >> 1. When the ASIC is hung you can't talk to the firmware engines any
>> >> more and most state is not exposed directly, but just through some
>> >> fw/hw interface.
>> >> Just take a look at how umr reads the shader state from the SQ.
>> >> When that block is hung you can't do that any more and basically have
>> >> no chance at all to figure out why it's hung.
>> >>
>> >> Same for other engines, I remember once spending a week figuring
>> >> out why the UVD block is hung during suspend. Turned out to be a
>> >> debugging nightmare because any time you touch any register of that
>> >> block the whole system would hang.
>> >>
>> >> 2. There are tons of things going on in a pipeline fashion or even
>> >> completely in parallel. For example the CP is just the beginning of a
>> >> rather long pipeline which at the end produces a bunch of pixels.
>> >> In almost all cases I've seen you ran into a problem somewhere
>> >> deep in the pipeline and only very rarely at the beginning.
>> >>
>> >>
>> >> I wonder what AMD's Windows driver team is doing with this problem,
>> >> surely they must have better tools to deal with GPU hangs?
>> > For better or worse, most teams internally rely on scan dumps via
>> > JTAG
>> > which sort of limits the usefulness outside of AMD, but also gives
>> > you
>> > the exact state of the hardware when it's hung so the hardware teams
>> > prefer it.
>> >
>>  How does this approach scale? It's not something we can ask users to
>>  do, and even if all of us in the radv team had a JTAG device, we
>>  wouldn't be able to play every game that users experience random
>> hangs
>>  with.
>> >>> It doesn't scale or lend itself particularly well to external
>> >>> development, but that's the current state of affairs.
>> >>
>> >> The usual approach seems to be to reproduce a problem in a lab and
>> >> have a JTAG attached to give the hw guys a scan dump and they can
>> >> then tell you why something didn't worked as expected.
>> >
>> > That's the worst-case scenario where you're debugging HW or FW issues.
>> > Those should be pretty rare post-bringup. But are there hangs caused
>> > by user mode driver or application bugs that are easier to debug and
>> > probably don't even require a GPU reset? For example most VM faults
>> > can be handled without hanging the GPU. Similarly, a shader in an
>> > endless loop should not require a full GPU reset. In the KFD compute
>> > case, 

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread Marek Olšák
GPU hangs are pretty common post-bringup. They are not common per user, but
if we gather all hangs from all users, we can have lots and lots of them.

GPU hangs are indeed not very debuggable. There are however some things we
can do:
- Identify the hanging IB by its VA (the kernel should know it)
- Read and parse the IB to detect memory corruption.
- Print active waves with shader disassembly if SQ isn't hung (often it's
not).

Determining which packet the CP is stuck on is tricky. The CP has 2 engines
(one frontend and one backend) that work on the same command buffer. The
frontend engine runs ahead, executes some packets and forwards others to
the backend engine. Only the frontend engine has the command buffer VA
somewhere. The backend engine only receives packets from the frontend
engine via a FIFO, so it might not be possible to tell where it's stuck if
it's stuck.

When the gfx pipeline hangs outside of shaders, making a scandump seems to
be the only way to have a chance at finding out what's going wrong, and
only AMD-internal versions of hw can be scanned.

Marek

On Wed, May 3, 2023 at 11:23 AM Christian König <
ckoenig.leichtzumer...@gmail.com> wrote:

> Am 03.05.23 um 17:08 schrieb Felix Kuehling:
> > Am 2023-05-03 um 03:59 schrieb Christian König:
> >> Am 02.05.23 um 20:41 schrieb Alex Deucher:
> >>> On Tue, May 2, 2023 at 11:22 AM Timur Kristóf
> >>>  wrote:
>  [SNIP]
>  In my opinion, the correct solution to those problems would be
>  if
>  the kernel could give userspace the necessary information about
>  a
>  GPU hang before a GPU reset.
> 
> >>>   The fundamental problem here is that the kernel doesn't have
> >>> that
> >>> information either. We know which IB timed out and can
> >>> potentially do
> >>> a devcoredump when that happens, but that's it.
> >>
> >> Is it really not possible to know such a fundamental thing as what
> >> the
> >> GPU was doing when it hung? How are we supposed to do any kind of
> >> debugging without knowing that?
> >>
> >> Yes, that's indeed something at least I try to figure out for years
> >> as well.
> >>
> >> Basically there are two major problems:
> >> 1. When the ASIC is hung you can't talk to the firmware engines any
> >> more and most state is not exposed directly, but just through some
> >> fw/hw interface.
> >> Just take a look at how umr reads the shader state from the SQ.
> >> When that block is hung you can't do that any more and basically have
> >> no chance at all to figure out why it's hung.
> >>
> >> Same for other engines, I remember once spending a week figuring
> >> out why the UVD block is hung during suspend. Turned out to be a
> >> debugging nightmare because any time you touch any register of that
> >> block the whole system would hang.
> >>
> >> 2. There are tons of things going on in a pipeline fashion or even
> >> completely in parallel. For example the CP is just the beginning of a
> >> rather long pipeline which at the end produces a bunch of pixels.
> >> In almost all cases I've seen you ran into a problem somewhere
> >> deep in the pipeline and only very rarely at the beginning.
> >>
> >>
> >> I wonder what AMD's Windows driver team is doing with this problem,
> >> surely they must have better tools to deal with GPU hangs?
> > For better or worse, most teams internally rely on scan dumps via
> > JTAG
> > which sort of limits the usefulness outside of AMD, but also gives
> > you
> > the exact state of the hardware when it's hung so the hardware teams
> > prefer it.
> >
>  How does this approach scale? It's not something we can ask users to
>  do, and even if all of us in the radv team had a JTAG device, we
>  wouldn't be able to play every game that users experience random hangs
>  with.
> >>> It doesn't scale or lend itself particularly well to external
> >>> development, but that's the current state of affairs.
> >>
> >> The usual approach seems to be to reproduce a problem in a lab and
> >> have a JTAG attached to give the hw guys a scan dump and they can
> >> then tell you why something didn't worked as expected.
> >
> > That's the worst-case scenario where you're debugging HW or FW issues.
> > Those should be pretty rare post-bringup. But are there hangs caused
> > by user mode driver or application bugs that are easier to debug and
> > probably don't even require a GPU reset? For example most VM faults
> > can be handled without hanging the GPU. Similarly, a shader in an
> > endless loop should not require a full GPU reset. In the KFD compute
> > case, that's still preemptible and the offending process can be killed
> > with Ctrl-C or debugged with rocm-gdb.
>
> We also have infinite loop in shader abort for gfx and page faults are
> pretty rare with OpenGL (a bit more often with Vulkan) and can be
> handled gracefully on modern hw (they just spam the logs).
>
> The 

RE: [PATCH v2] drm/amdkfd: Expose proc sysfs folder contents after permission check

2023-05-03 Thread Kasiviswanathan, Harish
[AMD Official Use Only - General]

One minor comment inline.

-Original Message-
From: amd-gfx  On Behalf Of Sreekant 
Somasekharan
Sent: Friday, April 28, 2023 3:12 PM
To: amd-gfx@lists.freedesktop.org
Cc: Somasekharan, Sreekant 
Subject: [PATCH v2] drm/amdkfd: Expose proc sysfs folder contents after 
permission check

Access to proc sysfs folder/subfolder contents are permitted only
if kfd_devcgroup_check_permission() function returns success. This
will restrict users from accessing sysfs files for a process running
on a device to which users has no access.

Signed-off-by: Sreekant Somasekharan 
---
 drivers/gpu/drm/amd/amdkfd/kfd_process.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index 95cc63d9f578..8ff505d29bb4 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -275,6 +275,8 @@ static int kfd_get_cu_occupancy(struct attribute *attr, 
char *buffer)
 
pdd = container_of(attr, struct kfd_process_device, attr_cu_occupancy);
dev = pdd->dev;
+   if (dev && kfd_devcgroup_check_permission(dev))
+   return -EPERM;
if (dev->kfd2kgd->get_cu_occupancy == NULL)
return -EINVAL;
 
@@ -308,10 +310,14 @@ static ssize_t kfd_procfs_show(struct kobject *kobj, 
struct attribute *attr,
} else if (strncmp(attr->name, "vram_", 5) == 0) {
struct kfd_process_device *pdd = container_of(attr, struct 
kfd_process_device,
  attr_vram);
+   if (pdd->dev && kfd_devcgroup_check_permission(pdd->dev))
+   return -EPERM;
return snprintf(buffer, PAGE_SIZE, "%llu\n", 
READ_ONCE(pdd->vram_usage));
} else if (strncmp(attr->name, "sdma_", 5) == 0) {
struct kfd_process_device *pdd = container_of(attr, struct 
kfd_process_device,
  attr_sdma);
+   if (pdd->dev && kfd_devcgroup_check_permission(pdd->dev))
+   return -EPERM;
[HK]: Move the if condition below the following struct declaration otherwise 
the following compiler will spew the following warning
warning: ISO C90 forbids mixed declarations and code 
[-Wdeclaration-after-statement]

struct kfd_sdma_activity_handler_workarea 
sdma_activity_work_handler;
 
INIT_WORK(_activity_work_handler.sdma_activity_work,
@@ -379,6 +385,8 @@ static ssize_t kfd_procfs_queue_show(struct kobject *kobj,
 struct attribute *attr, char *buffer)
 {
struct queue *q = container_of(kobj, struct queue, kobj);
+   if (q->device && kfd_devcgroup_check_permission(q->device))
+   return -EPERM;
 
if (!strcmp(attr->name, "size"))
return snprintf(buffer, PAGE_SIZE, "%llu",
@@ -402,6 +410,8 @@ static ssize_t kfd_procfs_stats_show(struct kobject *kobj,
attr_evict);
uint64_t evict_jiffies;
 
+   if (pdd->dev && kfd_devcgroup_check_permission(pdd->dev))
+   return -EPERM;
evict_jiffies = atomic64_read(>evict_duration_counter);
 
return snprintf(buffer,
@@ -427,16 +437,22 @@ static ssize_t kfd_sysfs_counters_show(struct kobject 
*kobj,
if (!strcmp(attr->name, "faults")) {
pdd = container_of(attr, struct kfd_process_device,
   attr_faults);
+   if (pdd->dev && kfd_devcgroup_check_permission(pdd->dev))
+   return -EPERM;
return sysfs_emit(buf, "%llu\n", READ_ONCE(pdd->faults));
}
if (!strcmp(attr->name, "page_in")) {
pdd = container_of(attr, struct kfd_process_device,
   attr_page_in);
+   if (pdd->dev && kfd_devcgroup_check_permission(pdd->dev))
+   return -EPERM;
return sysfs_emit(buf, "%llu\n", READ_ONCE(pdd->page_in));
}
if (!strcmp(attr->name, "page_out")) {
pdd = container_of(attr, struct kfd_process_device,
   attr_page_out);
+   if (pdd->dev && kfd_devcgroup_check_permission(pdd->dev))
+   return -EPERM;
return sysfs_emit(buf, "%llu\n", READ_ONCE(pdd->page_out));
}
return 0;
-- 
2.25.1


Re: [PATCH] drm/amdgpu: unlock on error in gfx_v9_4_3_kiq_resume()

2023-05-03 Thread Alex Deucher
Applied.  Thanks!

Alex

On Wed, May 3, 2023 at 11:29 AM Dan Carpenter  wrote:
>
> Smatch complains that we need to drop this lock before returning.
>
> drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c:1838 gfx_v9_4_3_kiq_resume()
> warn: inconsistent returns 'ring->mqd_obj->tbo.base.resv'.
>
> Fixes: 86301129698b ("drm/amdgpu: split gc v9_4_3 functionality from gc v9_0")
> Signed-off-by: Dan Carpenter 
> ---
> The Fixes tag is weird, but I think it's correct?
>
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c 
> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> index 56a415e151d4..552729a514d3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> @@ -1827,8 +1827,10 @@ static int gfx_v9_4_3_kiq_resume(struct amdgpu_device 
> *adev, int xcc_id)
> return r;
>
> r = amdgpu_bo_kmap(ring->mqd_obj, (void **)>mqd_ptr);
> -   if (unlikely(r != 0))
> +   if (unlikely(r != 0)) {
> +   amdgpu_bo_unreserve(ring->mqd_obj);
> return r;
> +   }
>
> gfx_v9_4_3_kiq_init_queue(ring, xcc_id);
> amdgpu_bo_kunmap(ring->mqd_obj);
> --
> 2.39.2
>


Re: [PATCH] drm/amdgpu: unlock the correct lock in amdgpu_gfx_enable_kcq()

2023-05-03 Thread Alex Deucher
Applied.  Thanks!

On Wed, May 3, 2023 at 11:29 AM Dan Carpenter  wrote:
>
> We changed which lock we are supposed to take but this error path
> was accidentally over looked so it still drops the old lock.
>
> Fixes: def799c6596d ("drm/amdgpu: add multi-xcc support to amdgpu_gfx 
> interfaces (v4)")
> Signed-off-by: Dan Carpenter 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> index 60bb4bba1994..1de3fffae9d7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> @@ -566,7 +566,7 @@ int amdgpu_gfx_enable_kcq(struct amdgpu_device *adev, int 
> xcc_id)
> kiq->pmf->set_resources_size);
> if (r) {
> DRM_ERROR("Failed to lock KIQ (%d).\n", r);
> -   spin_unlock(>gfx.kiq[0].ring_lock);
> +   spin_unlock(>ring_lock);
> return r;
> }
>
> --
> 2.39.2
>


Re: [Intel-gfx] [RFC PATCH 2/4] drm/cgroup: Add memory accounting to DRM cgroup

2023-05-03 Thread Maarten Lankhorst



On 2023-05-03 17:31, Tvrtko Ursulin wrote:


On 03/05/2023 09:34, Maarten Lankhorst wrote:

Based roughly on the rdma and misc cgroup controllers, with a lot of
the accounting code borrowed from rdma.

The interface is simple:
- populate drmcgroup_device->regions[..] name and size for each active
   region.
- Call drm(m)cg_register_device()
- Use drmcg_try_charge to check if you can allocate a chunk of memory,
   use drmcg_uncharge when freeing it. This may return an error code,
   or -EAGAIN when the cgroup limit is reached.

The ttm code transforms -EAGAIN back to -ENOSPC since it has specific
logic for -ENOSPC, and returning -EAGAIN to userspace causes drmIoctl
to restart infinitely.

This API allows you to limit stuff with cgroups.
You can see the supported cards in /sys/fs/cgroup/drm.capacity
You need to echo +drm to cgroup.subtree_control, and then you can
partition memory.

In each cgroup subdir:
drm.max shows the current limits of the cgroup.
drm.current the current amount of allocated memory used by this cgroup.
drm.events shows the amount of time max memory was reached.


Events is not in the patch?


Oops, correct.

I removed it since it added more complexity, and didn't seem granular 
enough to be useful.


I removed it from the documentation, but not the commit message it seems. :)




Signed-off-by: Maarten Lankhorst 
---
  Documentation/admin-guide/cgroup-v2.rst |  46 ++
  Documentation/gpu/drm-compute.rst   |  54 +++
  include/linux/cgroup_drm.h  |  81 
  kernel/cgroup/drm.c | 539 +++-
  4 files changed, 699 insertions(+), 21 deletions(-)
  create mode 100644 Documentation/gpu/drm-compute.rst

diff --git a/Documentation/admin-guide/cgroup-v2.rst 
b/Documentation/admin-guide/cgroup-v2.rst

index f67c0829350b..b858d99cb2ef 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2374,6 +2374,52 @@ RDMA Interface Files
    mlx4_0 hca_handle=1 hca_object=20
    ocrdma1 hca_handle=1 hca_object=23
  +DRM
+
+
+The "drm" controller regulates the distribution and accounting of
+DRM resources.
+
+DRM Interface Files
+
+
+  drm.max
+    A readwrite nested-keyed file that exists for all the cgroups
+    except root that describes current configured resource limit
+    for a DRM device.
+
+    Lines are keyed by device name and are not ordered.
+    Each line contains space separated resource name and its configured
+    limit that can be distributed.
+
+    The following nested keys are defined.
+
+  == 
===
+  region.* Maximum amount of bytes that allocatable in this 
region
+  == 
===

+
+    An example for xe follows::
+
+  :03:00.0 region.vram0=1073741824 region.stolen=max
+
+  drm.capacity
+    A read-only file that describes maximum region capacity.
+    It only exists on the root cgroup. Not all memory can be
+    allocated by cgroups, as the kernel reserves some for
+    internal use.
+
+    An example for xe follows::
+
+  :03:00.0 region.vram0=8514437120 region.stolen=67108864
+
+  drm.current
+    A read-only file that describes current resource usage.
+    It exists for all the cgroup except root.
+
+    An example for xe follows::
+
+  :03:00.0 region.vram0=12550144 region.stolen=8650752
+
  HugeTLB
  ---
  diff --git a/Documentation/gpu/drm-compute.rst 
b/Documentation/gpu/drm-compute.rst

new file mode 100644
index ..116270976ef7
--- /dev/null
+++ b/Documentation/gpu/drm-compute.rst
@@ -0,0 +1,54 @@
+==
+Long running workloads and compute
+==
+
+Long running workloads (compute) are workloads that will not 
complete in 10
+seconds. (The time let the user wait before he reaches for the power 
button).
+This means that other techniques need to be used to manage those 
workloads,

+that cannot use fences.
+
+Some hardware may schedule compute jobs, and have no way to pre-empt 
them, or
+have their memory swapped out from them. Or they simply want their 
workload

+not to be preempted or swapped out at all.
+
+This means that it differs from what is described in 
driver-api/dma-buf.rst.

+
+As with normal compute jobs, dma-fence may not be used at all. In 
this case,
+not even to force preemption. The driver with is simply forced to 
unmap a BO
+from the long compute job's address space on unbind immediately, not 
even
+waiting for the workload to complete. Effectively this terminates 
the workload

+when there is no hardware support to recover.
+
+Since this is undesirable, there need to be mitigations to prevent a 
workload
+from being terminated. There are several possible approach, all with 
their

+advantages and drawbacks.
+
+The first approach you will likely try is to pin all buffers used by 
compute.
+This 

Re: [Intel-gfx] [RFC PATCH 2/4] drm/cgroup: Add memory accounting to DRM cgroup

2023-05-03 Thread Tvrtko Ursulin



On 03/05/2023 09:34, Maarten Lankhorst wrote:

Based roughly on the rdma and misc cgroup controllers, with a lot of
the accounting code borrowed from rdma.

The interface is simple:
- populate drmcgroup_device->regions[..] name and size for each active
   region.
- Call drm(m)cg_register_device()
- Use drmcg_try_charge to check if you can allocate a chunk of memory,
   use drmcg_uncharge when freeing it. This may return an error code,
   or -EAGAIN when the cgroup limit is reached.

The ttm code transforms -EAGAIN back to -ENOSPC since it has specific
logic for -ENOSPC, and returning -EAGAIN to userspace causes drmIoctl
to restart infinitely.

This API allows you to limit stuff with cgroups.
You can see the supported cards in /sys/fs/cgroup/drm.capacity
You need to echo +drm to cgroup.subtree_control, and then you can
partition memory.

In each cgroup subdir:
drm.max shows the current limits of the cgroup.
drm.current the current amount of allocated memory used by this cgroup.
drm.events shows the amount of time max memory was reached.


Events is not in the patch?


Signed-off-by: Maarten Lankhorst 
---
  Documentation/admin-guide/cgroup-v2.rst |  46 ++
  Documentation/gpu/drm-compute.rst   |  54 +++
  include/linux/cgroup_drm.h  |  81 
  kernel/cgroup/drm.c | 539 +++-
  4 files changed, 699 insertions(+), 21 deletions(-)
  create mode 100644 Documentation/gpu/drm-compute.rst

diff --git a/Documentation/admin-guide/cgroup-v2.rst 
b/Documentation/admin-guide/cgroup-v2.rst
index f67c0829350b..b858d99cb2ef 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2374,6 +2374,52 @@ RDMA Interface Files
  mlx4_0 hca_handle=1 hca_object=20
  ocrdma1 hca_handle=1 hca_object=23
  
+DRM

+
+
+The "drm" controller regulates the distribution and accounting of
+DRM resources.
+
+DRM Interface Files
+
+
+  drm.max
+   A readwrite nested-keyed file that exists for all the cgroups
+   except root that describes current configured resource limit
+   for a DRM device.
+
+   Lines are keyed by device name and are not ordered.
+   Each line contains space separated resource name and its configured
+   limit that can be distributed.
+
+   The following nested keys are defined.
+
+ =====
+ region.*  Maximum amount of bytes that allocatable in this region
+ =====
+
+   An example for xe follows::
+
+ :03:00.0 region.vram0=1073741824 region.stolen=max
+
+  drm.capacity
+   A read-only file that describes maximum region capacity.
+   It only exists on the root cgroup. Not all memory can be
+   allocated by cgroups, as the kernel reserves some for
+   internal use.
+
+   An example for xe follows::
+
+ :03:00.0 region.vram0=8514437120 region.stolen=67108864
+
+  drm.current
+   A read-only file that describes current resource usage.
+   It exists for all the cgroup except root.
+
+   An example for xe follows::
+
+ :03:00.0 region.vram0=12550144 region.stolen=8650752
+
  HugeTLB
  ---
  
diff --git a/Documentation/gpu/drm-compute.rst b/Documentation/gpu/drm-compute.rst

new file mode 100644
index ..116270976ef7
--- /dev/null
+++ b/Documentation/gpu/drm-compute.rst
@@ -0,0 +1,54 @@
+==
+Long running workloads and compute
+==
+
+Long running workloads (compute) are workloads that will not complete in 10
+seconds. (The time let the user wait before he reaches for the power button).
+This means that other techniques need to be used to manage those workloads,
+that cannot use fences.
+
+Some hardware may schedule compute jobs, and have no way to pre-empt them, or
+have their memory swapped out from them. Or they simply want their workload
+not to be preempted or swapped out at all.
+
+This means that it differs from what is described in driver-api/dma-buf.rst.
+
+As with normal compute jobs, dma-fence may not be used at all. In this case,
+not even to force preemption. The driver with is simply forced to unmap a BO
+from the long compute job's address space on unbind immediately, not even
+waiting for the workload to complete. Effectively this terminates the workload
+when there is no hardware support to recover.
+
+Since this is undesirable, there need to be mitigations to prevent a workload
+from being terminated. There are several possible approach, all with their
+advantages and drawbacks.
+
+The first approach you will likely try is to pin all buffers used by compute.
+This guarantees that the job will run uninterrupted, but also allows a very
+denial of service attack by pinning as much memory as possible, hogging the
+all GPU memory, and possibly a 

[PATCH] drm/amdgpu: unlock on error in gfx_v9_4_3_kiq_resume()

2023-05-03 Thread Dan Carpenter
Smatch complains that we need to drop this lock before returning.

drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c:1838 gfx_v9_4_3_kiq_resume()
warn: inconsistent returns 'ring->mqd_obj->tbo.base.resv'.

Fixes: 86301129698b ("drm/amdgpu: split gc v9_4_3 functionality from gc v9_0")
Signed-off-by: Dan Carpenter 
---
The Fixes tag is weird, but I think it's correct?

 drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
index 56a415e151d4..552729a514d3 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
@@ -1827,8 +1827,10 @@ static int gfx_v9_4_3_kiq_resume(struct amdgpu_device 
*adev, int xcc_id)
return r;
 
r = amdgpu_bo_kmap(ring->mqd_obj, (void **)>mqd_ptr);
-   if (unlikely(r != 0))
+   if (unlikely(r != 0)) {
+   amdgpu_bo_unreserve(ring->mqd_obj);
return r;
+   }
 
gfx_v9_4_3_kiq_init_queue(ring, xcc_id);
amdgpu_bo_kunmap(ring->mqd_obj);
-- 
2.39.2



[PATCH] drm/amdgpu: unlock the correct lock in amdgpu_gfx_enable_kcq()

2023-05-03 Thread Dan Carpenter
We changed which lock we are supposed to take but this error path
was accidentally over looked so it still drops the old lock.

Fixes: def799c6596d ("drm/amdgpu: add multi-xcc support to amdgpu_gfx 
interfaces (v4)")
Signed-off-by: Dan Carpenter 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
index 60bb4bba1994..1de3fffae9d7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
@@ -566,7 +566,7 @@ int amdgpu_gfx_enable_kcq(struct amdgpu_device *adev, int 
xcc_id)
kiq->pmf->set_resources_size);
if (r) {
DRM_ERROR("Failed to lock KIQ (%d).\n", r);
-   spin_unlock(>gfx.kiq[0].ring_lock);
+   spin_unlock(>ring_lock);
return r;
}
 
-- 
2.39.2



Re: [PATCH 2/2] drm/amdgpu: drop unused function

2023-05-03 Thread Christian König

Am 03.05.23 um 17:24 schrieb Alex Deucher:

On Wed, May 3, 2023 at 11:20 AM Christian König
 wrote:

Reviewed-by: Christian König  for this one.

Can't say much about the first one. That was just the hack because some
bit in the IP version was re-used on SRIOV, wasn't it?

Yes, the high 2 bits of the revision number were reused for additional
information so they needed to be masked out when considering the IP
revision.  We already track those extra bits elsewhere and mask out
those bits so that IP version is never seen anymore in practice.


In this case rb for that one as well.

Christian.



Alex


Christian.

Am 03.05.23 um 17:02 schrieb Alex Deucher:

Ping?

On Thu, Apr 27, 2023 at 2:34 PM Alex Deucher  wrote:

amdgpu_discovery_get_ip_version() has not been used since
commit c40bdfb2ffa4 ("drm/amdgpu: fix incorrect VCN revision in SRIOV")
so drop it.

Signed-off-by: Alex Deucher 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 48 ---
   drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h |  2 -
   2 files changed, 50 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
index 76ceca05452e..b58d94dc1924 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
@@ -1208,54 +1208,6 @@ static int amdgpu_discovery_reg_base_init(struct 
amdgpu_device *adev)
  return 0;
   }

-int amdgpu_discovery_get_ip_version(struct amdgpu_device *adev, int hw_id, int 
number_instance,
-   int *major, int *minor, int *revision)
-{
-   struct binary_header *bhdr;
-   struct ip_discovery_header *ihdr;
-   struct die_header *dhdr;
-   struct ip *ip;
-   uint16_t die_offset;
-   uint16_t ip_offset;
-   uint16_t num_dies;
-   uint16_t num_ips;
-   int i, j;
-
-   if (!adev->mman.discovery_bin) {
-   DRM_ERROR("ip discovery uninitialized\n");
-   return -EINVAL;
-   }
-
-   bhdr = (struct binary_header *)adev->mman.discovery_bin;
-   ihdr = (struct ip_discovery_header *)(adev->mman.discovery_bin +
-   le16_to_cpu(bhdr->table_list[IP_DISCOVERY].offset));
-   num_dies = le16_to_cpu(ihdr->num_dies);
-
-   for (i = 0; i < num_dies; i++) {
-   die_offset = le16_to_cpu(ihdr->die_info[i].die_offset);
-   dhdr = (struct die_header *)(adev->mman.discovery_bin + 
die_offset);
-   num_ips = le16_to_cpu(dhdr->num_ips);
-   ip_offset = die_offset + sizeof(*dhdr);
-
-   for (j = 0; j < num_ips; j++) {
-   ip = (struct ip *)(adev->mman.discovery_bin + 
ip_offset);
-
-   if ((le16_to_cpu(ip->hw_id) == hw_id) && 
(ip->number_instance == number_instance)) {
-   if (major)
-   *major = ip->major;
-   if (minor)
-   *minor = ip->minor;
-   if (revision)
-   *revision = ip->revision;
-   return 0;
-   }
-   ip_offset += struct_size(ip, base_address, 
ip->num_base_address);
-   }
-   }
-
-   return -EINVAL;
-}
-
   static void amdgpu_discovery_harvest_ip(struct amdgpu_device *adev)
   {
  int vcn_harvest_count = 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h
index 8563dd4a7dc2..63ec6924b907 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h
@@ -28,8 +28,6 @@
   #define DISCOVERY_TMR_OFFSET(64 << 10)

   void amdgpu_discovery_fini(struct amdgpu_device *adev);
-int amdgpu_discovery_get_ip_version(struct amdgpu_device *adev, int hw_id, int 
number_instance,
-int *major, int *minor, int *revision);
   int amdgpu_discovery_set_ip_blocks(struct amdgpu_device *adev);

   #endif /* __AMDGPU_DISCOVERY__ */
--
2.40.0





Re: [PATCH 2/2] drm/amdgpu: drop unused function

2023-05-03 Thread Alex Deucher
On Wed, May 3, 2023 at 11:20 AM Christian König
 wrote:
>
> Reviewed-by: Christian König  for this one.
>
> Can't say much about the first one. That was just the hack because some
> bit in the IP version was re-used on SRIOV, wasn't it?

Yes, the high 2 bits of the revision number were reused for additional
information so they needed to be masked out when considering the IP
revision.  We already track those extra bits elsewhere and mask out
those bits so that IP version is never seen anymore in practice.

Alex

>
> Christian.
>
> Am 03.05.23 um 17:02 schrieb Alex Deucher:
> > Ping?
> >
> > On Thu, Apr 27, 2023 at 2:34 PM Alex Deucher  
> > wrote:
> >> amdgpu_discovery_get_ip_version() has not been used since
> >> commit c40bdfb2ffa4 ("drm/amdgpu: fix incorrect VCN revision in SRIOV")
> >> so drop it.
> >>
> >> Signed-off-by: Alex Deucher 
> >> ---
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 48 ---
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h |  2 -
> >>   2 files changed, 50 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c 
> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> >> index 76ceca05452e..b58d94dc1924 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> >> @@ -1208,54 +1208,6 @@ static int amdgpu_discovery_reg_base_init(struct 
> >> amdgpu_device *adev)
> >>  return 0;
> >>   }
> >>
> >> -int amdgpu_discovery_get_ip_version(struct amdgpu_device *adev, int 
> >> hw_id, int number_instance,
> >> -   int *major, int *minor, int *revision)
> >> -{
> >> -   struct binary_header *bhdr;
> >> -   struct ip_discovery_header *ihdr;
> >> -   struct die_header *dhdr;
> >> -   struct ip *ip;
> >> -   uint16_t die_offset;
> >> -   uint16_t ip_offset;
> >> -   uint16_t num_dies;
> >> -   uint16_t num_ips;
> >> -   int i, j;
> >> -
> >> -   if (!adev->mman.discovery_bin) {
> >> -   DRM_ERROR("ip discovery uninitialized\n");
> >> -   return -EINVAL;
> >> -   }
> >> -
> >> -   bhdr = (struct binary_header *)adev->mman.discovery_bin;
> >> -   ihdr = (struct ip_discovery_header *)(adev->mman.discovery_bin +
> >> -   
> >> le16_to_cpu(bhdr->table_list[IP_DISCOVERY].offset));
> >> -   num_dies = le16_to_cpu(ihdr->num_dies);
> >> -
> >> -   for (i = 0; i < num_dies; i++) {
> >> -   die_offset = le16_to_cpu(ihdr->die_info[i].die_offset);
> >> -   dhdr = (struct die_header *)(adev->mman.discovery_bin + 
> >> die_offset);
> >> -   num_ips = le16_to_cpu(dhdr->num_ips);
> >> -   ip_offset = die_offset + sizeof(*dhdr);
> >> -
> >> -   for (j = 0; j < num_ips; j++) {
> >> -   ip = (struct ip *)(adev->mman.discovery_bin + 
> >> ip_offset);
> >> -
> >> -   if ((le16_to_cpu(ip->hw_id) == hw_id) && 
> >> (ip->number_instance == number_instance)) {
> >> -   if (major)
> >> -   *major = ip->major;
> >> -   if (minor)
> >> -   *minor = ip->minor;
> >> -   if (revision)
> >> -   *revision = ip->revision;
> >> -   return 0;
> >> -   }
> >> -   ip_offset += struct_size(ip, base_address, 
> >> ip->num_base_address);
> >> -   }
> >> -   }
> >> -
> >> -   return -EINVAL;
> >> -}
> >> -
> >>   static void amdgpu_discovery_harvest_ip(struct amdgpu_device *adev)
> >>   {
> >>  int vcn_harvest_count = 0;
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h 
> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h
> >> index 8563dd4a7dc2..63ec6924b907 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h
> >> @@ -28,8 +28,6 @@
> >>   #define DISCOVERY_TMR_OFFSET(64 << 10)
> >>
> >>   void amdgpu_discovery_fini(struct amdgpu_device *adev);
> >> -int amdgpu_discovery_get_ip_version(struct amdgpu_device *adev, int 
> >> hw_id, int number_instance,
> >> -int *major, int *minor, int 
> >> *revision);
> >>   int amdgpu_discovery_set_ip_blocks(struct amdgpu_device *adev);
> >>
> >>   #endif /* __AMDGPU_DISCOVERY__ */
> >> --
> >> 2.40.0
> >>
>


Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread Christian König

Am 03.05.23 um 17:08 schrieb Felix Kuehling:

Am 2023-05-03 um 03:59 schrieb Christian König:

Am 02.05.23 um 20:41 schrieb Alex Deucher:
On Tue, May 2, 2023 at 11:22 AM Timur Kristóf 
 wrote:

[SNIP]

In my opinion, the correct solution to those problems would be
if
the kernel could give userspace the necessary information about
a
GPU hang before a GPU reset.


  The fundamental problem here is that the kernel doesn't have
that
information either. We know which IB timed out and can
potentially do
a devcoredump when that happens, but that's it.


Is it really not possible to know such a fundamental thing as what
the
GPU was doing when it hung? How are we supposed to do any kind of
debugging without knowing that?


Yes, that's indeed something at least I try to figure out for years 
as well.


Basically there are two major problems:
1. When the ASIC is hung you can't talk to the firmware engines any 
more and most state is not exposed directly, but just through some 
fw/hw interface.
    Just take a look at how umr reads the shader state from the SQ. 
When that block is hung you can't do that any more and basically have 
no chance at all to figure out why it's hung.


    Same for other engines, I remember once spending a week figuring 
out why the UVD block is hung during suspend. Turned out to be a 
debugging nightmare because any time you touch any register of that 
block the whole system would hang.


2. There are tons of things going on in a pipeline fashion or even 
completely in parallel. For example the CP is just the beginning of a 
rather long pipeline which at the end produces a bunch of pixels.
    In almost all cases I've seen you ran into a problem somewhere 
deep in the pipeline and only very rarely at the beginning.




I wonder what AMD's Windows driver team is doing with this problem,
surely they must have better tools to deal with GPU hangs?

For better or worse, most teams internally rely on scan dumps via
JTAG
which sort of limits the usefulness outside of AMD, but also gives
you
the exact state of the hardware when it's hung so the hardware teams
prefer it.


How does this approach scale? It's not something we can ask users to
do, and even if all of us in the radv team had a JTAG device, we
wouldn't be able to play every game that users experience random hangs
with.

It doesn't scale or lend itself particularly well to external
development, but that's the current state of affairs.


The usual approach seems to be to reproduce a problem in a lab and 
have a JTAG attached to give the hw guys a scan dump and they can 
then tell you why something didn't worked as expected.


That's the worst-case scenario where you're debugging HW or FW issues. 
Those should be pretty rare post-bringup. But are there hangs caused 
by user mode driver or application bugs that are easier to debug and 
probably don't even require a GPU reset? For example most VM faults 
can be handled without hanging the GPU. Similarly, a shader in an 
endless loop should not require a full GPU reset. In the KFD compute 
case, that's still preemptible and the offending process can be killed 
with Ctrl-C or debugged with rocm-gdb.


We also have infinite loop in shader abort for gfx and page faults are 
pretty rare with OpenGL (a bit more often with Vulkan) and can be 
handled gracefully on modern hw (they just spam the logs).


The majority of the problems is unfortunately that we really get hard 
hangs because of some hw issues. That can be caused by unlucky timing, 
power management or doing things in an order the hw doesn't expected.


Regards,
Christian.



It's more complicated for graphics because of the more complex 
pipeline and the lack of CWSR. But it should still be possible to do 
some debugging without JTAG if the problem is in SW and not HW or FW. 
It's probably worth improving that debugability without getting 
hung-up on the worst case.


Maybe user mode graphics queues will offer a better way of recovering 
from these kinds of bugs, if the graphics pipeline can be unstuck 
without a GPU reset, just by killing the offending user mode queue.


Regards,
  Felix




And yes that absolutely doesn't scale.

Christian.



Alex






Re: [PATCH 2/2] drm/amdgpu: drop unused function

2023-05-03 Thread Christian König

Reviewed-by: Christian König  for this one.

Can't say much about the first one. That was just the hack because some 
bit in the IP version was re-used on SRIOV, wasn't it?


Christian.

Am 03.05.23 um 17:02 schrieb Alex Deucher:

Ping?

On Thu, Apr 27, 2023 at 2:34 PM Alex Deucher  wrote:

amdgpu_discovery_get_ip_version() has not been used since
commit c40bdfb2ffa4 ("drm/amdgpu: fix incorrect VCN revision in SRIOV")
so drop it.

Signed-off-by: Alex Deucher 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 48 ---
  drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h |  2 -
  2 files changed, 50 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
index 76ceca05452e..b58d94dc1924 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
@@ -1208,54 +1208,6 @@ static int amdgpu_discovery_reg_base_init(struct 
amdgpu_device *adev)
 return 0;
  }

-int amdgpu_discovery_get_ip_version(struct amdgpu_device *adev, int hw_id, int 
number_instance,
-   int *major, int *minor, int *revision)
-{
-   struct binary_header *bhdr;
-   struct ip_discovery_header *ihdr;
-   struct die_header *dhdr;
-   struct ip *ip;
-   uint16_t die_offset;
-   uint16_t ip_offset;
-   uint16_t num_dies;
-   uint16_t num_ips;
-   int i, j;
-
-   if (!adev->mman.discovery_bin) {
-   DRM_ERROR("ip discovery uninitialized\n");
-   return -EINVAL;
-   }
-
-   bhdr = (struct binary_header *)adev->mman.discovery_bin;
-   ihdr = (struct ip_discovery_header *)(adev->mman.discovery_bin +
-   le16_to_cpu(bhdr->table_list[IP_DISCOVERY].offset));
-   num_dies = le16_to_cpu(ihdr->num_dies);
-
-   for (i = 0; i < num_dies; i++) {
-   die_offset = le16_to_cpu(ihdr->die_info[i].die_offset);
-   dhdr = (struct die_header *)(adev->mman.discovery_bin + 
die_offset);
-   num_ips = le16_to_cpu(dhdr->num_ips);
-   ip_offset = die_offset + sizeof(*dhdr);
-
-   for (j = 0; j < num_ips; j++) {
-   ip = (struct ip *)(adev->mman.discovery_bin + 
ip_offset);
-
-   if ((le16_to_cpu(ip->hw_id) == hw_id) && 
(ip->number_instance == number_instance)) {
-   if (major)
-   *major = ip->major;
-   if (minor)
-   *minor = ip->minor;
-   if (revision)
-   *revision = ip->revision;
-   return 0;
-   }
-   ip_offset += struct_size(ip, base_address, 
ip->num_base_address);
-   }
-   }
-
-   return -EINVAL;
-}
-
  static void amdgpu_discovery_harvest_ip(struct amdgpu_device *adev)
  {
 int vcn_harvest_count = 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h
index 8563dd4a7dc2..63ec6924b907 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h
@@ -28,8 +28,6 @@
  #define DISCOVERY_TMR_OFFSET(64 << 10)

  void amdgpu_discovery_fini(struct amdgpu_device *adev);
-int amdgpu_discovery_get_ip_version(struct amdgpu_device *adev, int hw_id, int 
number_instance,
-int *major, int *minor, int *revision);
  int amdgpu_discovery_set_ip_blocks(struct amdgpu_device *adev);

  #endif /* __AMDGPU_DISCOVERY__ */
--
2.40.0





Re: [PATCH 2/2] drm/amdgpu: drop unused function

2023-05-03 Thread Luben Tuikov
I suppose we have this information elsewhere.

Series is:
Reviewed-by: Luben Tuikov 

Regards,
Luben

On 2023-05-03 11:02, Alex Deucher wrote:
> Ping?
> 
> On Thu, Apr 27, 2023 at 2:34 PM Alex Deucher  
> wrote:
>>
>> amdgpu_discovery_get_ip_version() has not been used since
>> commit c40bdfb2ffa4 ("drm/amdgpu: fix incorrect VCN revision in SRIOV")
>> so drop it.
>>
>> Signed-off-by: Alex Deucher 
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 48 ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h |  2 -
>>  2 files changed, 50 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
>> index 76ceca05452e..b58d94dc1924 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
>> @@ -1208,54 +1208,6 @@ static int amdgpu_discovery_reg_base_init(struct 
>> amdgpu_device *adev)
>> return 0;
>>  }
>>
>> -int amdgpu_discovery_get_ip_version(struct amdgpu_device *adev, int hw_id, 
>> int number_instance,
>> -   int *major, int *minor, int *revision)
>> -{
>> -   struct binary_header *bhdr;
>> -   struct ip_discovery_header *ihdr;
>> -   struct die_header *dhdr;
>> -   struct ip *ip;
>> -   uint16_t die_offset;
>> -   uint16_t ip_offset;
>> -   uint16_t num_dies;
>> -   uint16_t num_ips;
>> -   int i, j;
>> -
>> -   if (!adev->mman.discovery_bin) {
>> -   DRM_ERROR("ip discovery uninitialized\n");
>> -   return -EINVAL;
>> -   }
>> -
>> -   bhdr = (struct binary_header *)adev->mman.discovery_bin;
>> -   ihdr = (struct ip_discovery_header *)(adev->mman.discovery_bin +
>> -   le16_to_cpu(bhdr->table_list[IP_DISCOVERY].offset));
>> -   num_dies = le16_to_cpu(ihdr->num_dies);
>> -
>> -   for (i = 0; i < num_dies; i++) {
>> -   die_offset = le16_to_cpu(ihdr->die_info[i].die_offset);
>> -   dhdr = (struct die_header *)(adev->mman.discovery_bin + 
>> die_offset);
>> -   num_ips = le16_to_cpu(dhdr->num_ips);
>> -   ip_offset = die_offset + sizeof(*dhdr);
>> -
>> -   for (j = 0; j < num_ips; j++) {
>> -   ip = (struct ip *)(adev->mman.discovery_bin + 
>> ip_offset);
>> -
>> -   if ((le16_to_cpu(ip->hw_id) == hw_id) && 
>> (ip->number_instance == number_instance)) {
>> -   if (major)
>> -   *major = ip->major;
>> -   if (minor)
>> -   *minor = ip->minor;
>> -   if (revision)
>> -   *revision = ip->revision;
>> -   return 0;
>> -   }
>> -   ip_offset += struct_size(ip, base_address, 
>> ip->num_base_address);
>> -   }
>> -   }
>> -
>> -   return -EINVAL;
>> -}
>> -
>>  static void amdgpu_discovery_harvest_ip(struct amdgpu_device *adev)
>>  {
>> int vcn_harvest_count = 0;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h
>> index 8563dd4a7dc2..63ec6924b907 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h
>> @@ -28,8 +28,6 @@
>>  #define DISCOVERY_TMR_OFFSET(64 << 10)
>>
>>  void amdgpu_discovery_fini(struct amdgpu_device *adev);
>> -int amdgpu_discovery_get_ip_version(struct amdgpu_device *adev, int hw_id, 
>> int number_instance,
>> -int *major, int *minor, int *revision);
>>  int amdgpu_discovery_set_ip_blocks(struct amdgpu_device *adev);
>>
>>  #endif /* __AMDGPU_DISCOVERY__ */
>> --
>> 2.40.0
>>



Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread Felix Kuehling

Am 2023-05-03 um 03:59 schrieb Christian König:

Am 02.05.23 um 20:41 schrieb Alex Deucher:
On Tue, May 2, 2023 at 11:22 AM Timur Kristóf 
 wrote:

[SNIP]

In my opinion, the correct solution to those problems would be
if
the kernel could give userspace the necessary information about
a
GPU hang before a GPU reset.


  The fundamental problem here is that the kernel doesn't have
that
information either. We know which IB timed out and can
potentially do
a devcoredump when that happens, but that's it.


Is it really not possible to know such a fundamental thing as what
the
GPU was doing when it hung? How are we supposed to do any kind of
debugging without knowing that?


Yes, that's indeed something at least I try to figure out for years as 
well.


Basically there are two major problems:
1. When the ASIC is hung you can't talk to the firmware engines any 
more and most state is not exposed directly, but just through some 
fw/hw interface.
    Just take a look at how umr reads the shader state from the SQ. 
When that block is hung you can't do that any more and basically have 
no chance at all to figure out why it's hung.


    Same for other engines, I remember once spending a week figuring 
out why the UVD block is hung during suspend. Turned out to be a 
debugging nightmare because any time you touch any register of that 
block the whole system would hang.


2. There are tons of things going on in a pipeline fashion or even 
completely in parallel. For example the CP is just the beginning of a 
rather long pipeline which at the end produces a bunch of pixels.
    In almost all cases I've seen you ran into a problem somewhere 
deep in the pipeline and only very rarely at the beginning.




I wonder what AMD's Windows driver team is doing with this problem,
surely they must have better tools to deal with GPU hangs?

For better or worse, most teams internally rely on scan dumps via
JTAG
which sort of limits the usefulness outside of AMD, but also gives
you
the exact state of the hardware when it's hung so the hardware teams
prefer it.


How does this approach scale? It's not something we can ask users to
do, and even if all of us in the radv team had a JTAG device, we
wouldn't be able to play every game that users experience random hangs
with.

It doesn't scale or lend itself particularly well to external
development, but that's the current state of affairs.


The usual approach seems to be to reproduce a problem in a lab and 
have a JTAG attached to give the hw guys a scan dump and they can then 
tell you why something didn't worked as expected.


That's the worst-case scenario where you're debugging HW or FW issues. 
Those should be pretty rare post-bringup. But are there hangs caused by 
user mode driver or application bugs that are easier to debug and 
probably don't even require a GPU reset? For example most VM faults can 
be handled without hanging the GPU. Similarly, a shader in an endless 
loop should not require a full GPU reset. In the KFD compute case, 
that's still preemptible and the offending process can be killed with 
Ctrl-C or debugged with rocm-gdb.


It's more complicated for graphics because of the more complex pipeline 
and the lack of CWSR. But it should still be possible to do some 
debugging without JTAG if the problem is in SW and not HW or FW. It's 
probably worth improving that debugability without getting hung-up on 
the worst case.


Maybe user mode graphics queues will offer a better way of recovering 
from these kinds of bugs, if the graphics pipeline can be unstuck 
without a GPU reset, just by killing the offending user mode queue.


Regards,
  Felix




And yes that absolutely doesn't scale.

Christian.



Alex




Re: [PATCH 2/2] drm/amdgpu: drop unused function

2023-05-03 Thread Alex Deucher
Ping?

On Thu, Apr 27, 2023 at 2:34 PM Alex Deucher  wrote:
>
> amdgpu_discovery_get_ip_version() has not been used since
> commit c40bdfb2ffa4 ("drm/amdgpu: fix incorrect VCN revision in SRIOV")
> so drop it.
>
> Signed-off-by: Alex Deucher 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 48 ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h |  2 -
>  2 files changed, 50 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> index 76ceca05452e..b58d94dc1924 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> @@ -1208,54 +1208,6 @@ static int amdgpu_discovery_reg_base_init(struct 
> amdgpu_device *adev)
> return 0;
>  }
>
> -int amdgpu_discovery_get_ip_version(struct amdgpu_device *adev, int hw_id, 
> int number_instance,
> -   int *major, int *minor, int *revision)
> -{
> -   struct binary_header *bhdr;
> -   struct ip_discovery_header *ihdr;
> -   struct die_header *dhdr;
> -   struct ip *ip;
> -   uint16_t die_offset;
> -   uint16_t ip_offset;
> -   uint16_t num_dies;
> -   uint16_t num_ips;
> -   int i, j;
> -
> -   if (!adev->mman.discovery_bin) {
> -   DRM_ERROR("ip discovery uninitialized\n");
> -   return -EINVAL;
> -   }
> -
> -   bhdr = (struct binary_header *)adev->mman.discovery_bin;
> -   ihdr = (struct ip_discovery_header *)(adev->mman.discovery_bin +
> -   le16_to_cpu(bhdr->table_list[IP_DISCOVERY].offset));
> -   num_dies = le16_to_cpu(ihdr->num_dies);
> -
> -   for (i = 0; i < num_dies; i++) {
> -   die_offset = le16_to_cpu(ihdr->die_info[i].die_offset);
> -   dhdr = (struct die_header *)(adev->mman.discovery_bin + 
> die_offset);
> -   num_ips = le16_to_cpu(dhdr->num_ips);
> -   ip_offset = die_offset + sizeof(*dhdr);
> -
> -   for (j = 0; j < num_ips; j++) {
> -   ip = (struct ip *)(adev->mman.discovery_bin + 
> ip_offset);
> -
> -   if ((le16_to_cpu(ip->hw_id) == hw_id) && 
> (ip->number_instance == number_instance)) {
> -   if (major)
> -   *major = ip->major;
> -   if (minor)
> -   *minor = ip->minor;
> -   if (revision)
> -   *revision = ip->revision;
> -   return 0;
> -   }
> -   ip_offset += struct_size(ip, base_address, 
> ip->num_base_address);
> -   }
> -   }
> -
> -   return -EINVAL;
> -}
> -
>  static void amdgpu_discovery_harvest_ip(struct amdgpu_device *adev)
>  {
> int vcn_harvest_count = 0;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h
> index 8563dd4a7dc2..63ec6924b907 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.h
> @@ -28,8 +28,6 @@
>  #define DISCOVERY_TMR_OFFSET(64 << 10)
>
>  void amdgpu_discovery_fini(struct amdgpu_device *adev);
> -int amdgpu_discovery_get_ip_version(struct amdgpu_device *adev, int hw_id, 
> int number_instance,
> -int *major, int *minor, int *revision);
>  int amdgpu_discovery_set_ip_blocks(struct amdgpu_device *adev);
>
>  #endif /* __AMDGPU_DISCOVERY__ */
> --
> 2.40.0
>


Re: [PATCH 1/2] drm/amdgpu: drop invalid IP revision

2023-05-03 Thread Alex Deucher
Ping?

On Thu, Apr 27, 2023 at 2:34 PM Alex Deucher  wrote:
>
> This was already fixed and dropped in:
> commit baf3f8f37406 ("drm/amdgpu: handle SRIOV VCN revision parsing")
> commit c40bdfb2ffa4 ("drm/amdgpu: fix incorrect VCN revision in SRIOV")
> But seems to have been accidently been left around in a merge.
>
> Signed-off-by: Alex Deucher 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> index 0ba013275dc1..76ceca05452e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> @@ -1939,7 +1939,6 @@ static int amdgpu_discovery_set_mm_ip_blocks(struct 
> amdgpu_device *adev)
> case IP_VERSION(3, 1, 1):
> case IP_VERSION(3, 1, 2):
> case IP_VERSION(3, 0, 2):
> -   case IP_VERSION(3, 0, 192):
> amdgpu_device_ip_block_add(adev, _v3_0_ip_block);
> if (!amdgpu_sriov_vf(adev))
> amdgpu_device_ip_block_add(adev, 
> _v3_0_ip_block);
> --
> 2.40.0
>


Re: [PATCH] drm/amdgpu: put MQDs in VRAM

2023-05-03 Thread Luben Tuikov
Reviewed-by: Luben Tuikov 

Regards,
Luben

On 2023-05-01 10:55, Alex Deucher wrote:
> Ping?
> 
> Alex
> 
> On Fri, Apr 28, 2023 at 11:57 AM Alex Deucher  
> wrote:
>>
>> Reduces preemption latency.
>> Only enable this for gfx10 and 11 for now
>> to avoid changing behavior on gfx 8 and 9.
>>
>> v2: move MES MQDs into VRAM as well (YuBiao)
>> v3: enable on gfx10, 11 only (Alex)
>> v4: minor style changes, document why gfx10/11 only (Alex)
>>
>> Signed-off-by: Alex Deucher 
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 9 +++--
>>  drivers/gpu/drm/amd/amdgpu/mes_v10_1.c  | 1 +
>>  drivers/gpu/drm/amd/amdgpu/mes_v11_0.c  | 1 +
>>  3 files changed, 9 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>> index 90f5d302d5f3..b91be56ba773 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>> @@ -382,6 +382,11 @@ int amdgpu_gfx_mqd_sw_init(struct amdgpu_device *adev,
>> int r, i, j;
>> struct amdgpu_kiq *kiq = >gfx.kiq[xcc_id];
>> struct amdgpu_ring *ring = >ring;
>> +   u32 domain = AMDGPU_GEM_DOMAIN_GTT;
>> +
>> +   /* Only enable on gfx10 and 11 for now to avoid changing behavior on 
>> older chips */
>> +   if (adev->ip_versions[GC_HWIP][0] >= IP_VERSION(10, 0, 0))
>> +   domain |= AMDGPU_GEM_DOMAIN_VRAM;
>>
>> /* create MQD for KIQ */
>> if (!adev->enable_mes_kiq && !ring->mqd_obj) {
>> @@ -413,7 +418,7 @@ int amdgpu_gfx_mqd_sw_init(struct amdgpu_device *adev,
>> ring = >gfx.gfx_ring[i];
>> if (!ring->mqd_obj) {
>> r = amdgpu_bo_create_kernel(adev, mqd_size, 
>> PAGE_SIZE,
>> -   
>> AMDGPU_GEM_DOMAIN_GTT, >mqd_obj,
>> +   domain, 
>> >mqd_obj,
>> 
>> >mqd_gpu_addr, >mqd_ptr);
>> if (r) {
>> dev_warn(adev->dev, "failed to 
>> create ring mqd bo (%d)", r);
>> @@ -435,7 +440,7 @@ int amdgpu_gfx_mqd_sw_init(struct amdgpu_device *adev,
>> ring = >gfx.compute_ring[j];
>> if (!ring->mqd_obj) {
>> r = amdgpu_bo_create_kernel(adev, mqd_size, 
>> PAGE_SIZE,
>> -   AMDGPU_GEM_DOMAIN_GTT, 
>> >mqd_obj,
>> +   domain, >mqd_obj,
>> >mqd_gpu_addr, 
>> >mqd_ptr);
>> if (r) {
>> dev_warn(adev->dev, "failed to create ring 
>> mqd bo (%d)", r);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v10_1.c 
>> b/drivers/gpu/drm/amd/amdgpu/mes_v10_1.c
>> index 0599f8a6813e..4560476c7c31 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/mes_v10_1.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/mes_v10_1.c
>> @@ -901,6 +901,7 @@ static int mes_v10_1_mqd_sw_init(struct amdgpu_device 
>> *adev,
>> return 0;
>>
>> r = amdgpu_bo_create_kernel(adev, mqd_size, PAGE_SIZE,
>> +   AMDGPU_GEM_DOMAIN_VRAM |
>> AMDGPU_GEM_DOMAIN_GTT, >mqd_obj,
>> >mqd_gpu_addr, >mqd_ptr);
>> if (r) {
>> diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c 
>> b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
>> index e853bcb892fc..3adb450eec07 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
>> @@ -999,6 +999,7 @@ static int mes_v11_0_mqd_sw_init(struct amdgpu_device 
>> *adev,
>> return 0;
>>
>> r = amdgpu_bo_create_kernel(adev, mqd_size, PAGE_SIZE,
>> +   AMDGPU_GEM_DOMAIN_VRAM |
>> AMDGPU_GEM_DOMAIN_GTT, >mqd_obj,
>> >mqd_gpu_addr, >mqd_ptr);
>> if (r) {
>> --
>> 2.40.0
>>



Re: [Intel-xe] [RFC PATCH 3/4] drm/ttm: Handle -EAGAIN in ttm_resource_alloc as -ENOSPC.

2023-05-03 Thread Thomas Hellström

Hi, Maarten

On 5/3/23 10:34, Maarten Lankhorst wrote:

This allows the drm cgroup controller to return no space is available..

XXX: This is a hopeless simplification that changes behavior, and
returns -ENOSPC even if we could evict ourselves from the current
cgroup.

Ideally, the eviction code becomes cgroup aware, and will force eviction
from the current cgroup or its parents.

Signed-off-by: Maarten Lankhorst 


Thinking of the shrinker analogy, do non-cgroup aware shrinkers just 
shrink blindly or do they reject shrinking like this patch when a cgroup 
limit is reached?


/Thomas



---
  drivers/gpu/drm/ttm/ttm_bo.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index bd5dae4d1624..e057d5d8f09a 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -731,6 +731,8 @@ static int ttm_bo_mem_force_space(struct ttm_buffer_object 
*bo,
ret = ttm_resource_alloc(bo, place, mem);
if (likely(!ret))
break;
+   if (ret == -EAGAIN)
+   return -ENOSPC;
if (unlikely(ret != -ENOSPC))
return ret;
ret = ttm_mem_evict_first(bdev, man, place, ctx,
@@ -783,7 +785,7 @@ int ttm_bo_mem_space(struct ttm_buffer_object *bo,
  
  		type_found = true;

ret = ttm_resource_alloc(bo, place, mem);
-   if (ret == -ENOSPC)
+   if (ret == -ENOSPC || ret == -EAGAIN)
continue;
if (unlikely(ret))
goto error;


[PATCH] drm/amdgpu: drop redundant sched job cleanup when cs is aborted

2023-05-03 Thread Guchun Chen
Once command submission failed due to userptr invalidation in
amdgpu_cs_submit, legacy code will perform cleanup of scheduler
job. However, it's not needed at all, as former commit has integrated
job cleanup stuff into amdgpu_job_free. Otherwise, because of double
free, a NULL pointer dereference will occur in such scenario.

Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/2457
Fixes: f7d66fb2ea43 ("drm/amdgpu: cleanup scheduler job initialization v2")
Signed-off-by: Guchun Chen 
Reviewed-by: Christian König 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 13 +++--
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index cb771c73cd07..9879aac3bcdb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -1303,7 +1303,7 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
r = drm_sched_job_add_dependency(>base, fence);
if (r) {
dma_fence_put(fence);
-   goto error_cleanup;
+   return r;
}
}
 
@@ -1330,7 +1330,8 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
}
if (r) {
r = -EAGAIN;
-   goto error_unlock;
+   mutex_unlock(>adev->notifier_lock);
+   return r;
}
 
p->fence = dma_fence_get(>base.s_fence->finished);
@@ -1377,14 +1378,6 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
mutex_unlock(>adev->notifier_lock);
mutex_unlock(>bo_list->bo_list_mutex);
return 0;
-
-error_unlock:
-   mutex_unlock(>adev->notifier_lock);
-
-error_cleanup:
-   for (i = 0; i < p->gang_size; ++i)
-   drm_sched_job_cleanup(>jobs[i]->base);
-   return r;
 }
 
 /* Cleanup the parser structure */
-- 
2.25.1



Re: [Intel-xe] [RFC PATCH 3/4] drm/ttm: Handle -EAGAIN in ttm_resource_alloc as -ENOSPC.

2023-05-03 Thread Maarten Lankhorst


On 2023-05-03 11:11, Thomas Hellström wrote:

Hi, Maarten

On 5/3/23 10:34, Maarten Lankhorst wrote:

This allows the drm cgroup controller to return no space is available..

XXX: This is a hopeless simplification that changes behavior, and
returns -ENOSPC even if we could evict ourselves from the current
cgroup.

Ideally, the eviction code becomes cgroup aware, and will force eviction
from the current cgroup or its parents.

Signed-off-by: Maarten Lankhorst 


Thinking of the shrinker analogy, do non-cgroup aware shrinkers just 
shrink blindly or do they reject shrinking like this patch when a 
cgroup limit is reached?


When I made the cgroup controller return -ENOSPC I just hit an infinite 
loop since it sees enough memory is free and tries to allocate memory 
again. Hence the -EAGAIN handling here. It returns -ENOSPC, without the 
infinite looping.


I think there should be 2 code paths:

- OOM, generic case: Handle like we do now. No need for special cgroup 
handling needed right now.


Might change if we implement cgroup memory semantics. See the memory 
section of Documentation/admin-guide/cgroup-v2.rst


It could be useful regardless.

- OOM, cgroup limit reached: Check for each BO if it's valuable to evict to 
unblock the relevant limit.


 / cg1.0
root - cg1 --  cg1.1
   \ \ cg1.2
\  cg2

If we hit the cg limit in cg1.0 for only cg1.0, it doesn't make sense to evict 
from any other cgroup.
If we hit the limit in cg1.0 for the entirety of cg1, it makes sense to evict 
from any of the cg1 nodes, but not from cg2.

This should be relatively straightforward to implement. We identify which 
cgroup hit a limit, and then let the shrinker
run only on that cgroup and its childs.

This could be simplified to the OOM generic case, for root/NULL cg.


~Maarten


Re: [PATCH v2] drm/amd/amdgpu: Fix errors & warnings in amdgpu _bios, _cs, _dma_buf, _fence.c

2023-05-03 Thread Christian König

Am 03.05.23 um 11:00 schrieb Srinivasan Shanmugam:

The following checkpatch errors & warning is removed.

ERROR: else should follow close brace '}'
ERROR: trailing statements should be on next line
WARNING: Prefer 'unsigned int' to bare use of 'unsigned'
WARNING: Possible repeated word: 'Fences'
WARNING: Missing a blank line after declarations
WARNING: braces {} are not necessary for single statement blocks
WARNING: Comparisons should place the constant on the right side of the test
WARNING: printk() should include KERN_ facility level

Cc: Christian König 
Cc: Alex Deucher 
Signed-off-by: Srinivasan Shanmugam 


Reviewed-by: Christian König 


---

v2:
  - Best to make the word "Fences" one line from that. E.g. "/* Fences mark an
event...". (Christian)

  drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c| 16 +++-
  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 16 
  drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c |  2 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c   | 14 --
  4 files changed, 24 insertions(+), 24 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c
index 30c28a69e847..b582b83c4984 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c
@@ -104,9 +104,8 @@ static bool igp_read_bios_from_vram(struct amdgpu_device 
*adev)
adev->bios = NULL;
vram_base = pci_resource_start(adev->pdev, 0);
bios = ioremap_wc(vram_base, size);
-   if (!bios) {
+   if (!bios)
return false;
-   }
  
  	adev->bios = kmalloc(size, GFP_KERNEL);

if (!adev->bios) {
@@ -133,9 +132,8 @@ bool amdgpu_read_bios(struct amdgpu_device *adev)
adev->bios = NULL;
/* XXX: some cards may return 0 for rom size? ddx has a workaround */
bios = pci_map_rom(adev->pdev, );
-   if (!bios) {
+   if (!bios)
return false;
-   }
  
  	adev->bios = kzalloc(size, GFP_KERNEL);

if (adev->bios == NULL) {
@@ -168,9 +166,9 @@ static bool amdgpu_read_bios_from_rom(struct amdgpu_device 
*adev)
header[AMD_VBIOS_SIGNATURE_END] = 0;
  
  	if ((!AMD_IS_VALID_VBIOS(header)) ||

-   0 != memcmp((char *)[AMD_VBIOS_SIGNATURE_OFFSET],
-   AMD_VBIOS_SIGNATURE,
-   strlen(AMD_VBIOS_SIGNATURE)))
+   memcmp((char *)[AMD_VBIOS_SIGNATURE_OFFSET],
+  AMD_VBIOS_SIGNATURE,
+  strlen(AMD_VBIOS_SIGNATURE)) != 0)
return false;
  
  	/* valid vbios, go on */

@@ -264,7 +262,7 @@ static int amdgpu_atrm_call(acpi_handle atrm_handle, 
uint8_t *bios,
  
  	status = acpi_evaluate_object(atrm_handle, NULL, _arg, );

if (ACPI_FAILURE(status)) {
-   printk("failed to evaluate ATRM got %s\n", 
acpi_format_exception(status));
+   DRM_ERROR("failed to evaluate ATRM got %s\n", 
acpi_format_exception(status));
return -ENODEV;
}
  
@@ -363,7 +361,7 @@ static bool amdgpu_acpi_vfct_bios(struct amdgpu_device *adev)

struct acpi_table_header *hdr;
acpi_size tbl_size;
UEFI_ACPI_VFCT *vfct;
-   unsigned offset;
+   unsigned int offset;
  
  	if (!ACPI_SUCCESS(acpi_get_table("VFCT", 1, )))

return false;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index cb771c73cd07..c5521f9953a6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -395,7 +395,7 @@ static int amdgpu_cs_p2_dependencies(struct 
amdgpu_cs_parser *p,
  {
struct drm_amdgpu_cs_chunk_dep *deps = chunk->kdata;
struct amdgpu_fpriv *fpriv = p->filp->driver_priv;
-   unsigned num_deps;
+   unsigned int num_deps;
int i, r;
  
  	num_deps = chunk->length_dw * 4 /

@@ -466,7 +466,7 @@ static int amdgpu_cs_p2_syncobj_in(struct amdgpu_cs_parser 
*p,
   struct amdgpu_cs_chunk *chunk)
  {
struct drm_amdgpu_cs_chunk_sem *deps = chunk->kdata;
-   unsigned num_deps;
+   unsigned int num_deps;
int i, r;
  
  	num_deps = chunk->length_dw * 4 /

@@ -484,7 +484,7 @@ static int amdgpu_cs_p2_syncobj_timeline_wait(struct 
amdgpu_cs_parser *p,
  struct amdgpu_cs_chunk *chunk)
  {
struct drm_amdgpu_cs_chunk_syncobj *syncobj_deps = chunk->kdata;
-   unsigned num_deps;
+   unsigned int num_deps;
int i, r;
  
  	num_deps = chunk->length_dw * 4 /

@@ -504,7 +504,7 @@ static int amdgpu_cs_p2_syncobj_out(struct amdgpu_cs_parser 
*p,
struct amdgpu_cs_chunk *chunk)
  {
struct drm_amdgpu_cs_chunk_sem *deps = chunk->kdata;
-   unsigned num_deps;
+   unsigned int num_deps;
int i;
  
  	num_deps = chunk->length_dw * 4 /

@@ -538,7 +538,7 @@ static int 

[PATCH] drm/amd/amdgpu: Remove unnecessary OOM messages

2023-05-03 Thread Srinivasan Shanmugam
The following checkpatch warning is removed.

WARNING: Possible unnecessary 'out of memory' message

Cc: Christian König 
Cc: Alex Deucher 
Signed-off-by: Srinivasan Shanmugam 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c
index b582b83c4984..2e0f82f7b2c0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c
@@ -318,10 +318,8 @@ static bool amdgpu_atrm_get_bios(struct amdgpu_device 
*adev)
pci_dev_put(pdev);
 
adev->bios = kmalloc(size, GFP_KERNEL);
-   if (!adev->bios) {
-   dev_err(adev->dev, "Unable to allocate bios\n");
+   if (!adev->bios)
return false;
-   }
 
for (i = 0; i < size / ATRM_BIOS_PAGE; i++) {
ret = amdgpu_atrm_call(atrm_handle,
-- 
2.25.1



[PATCH v2] drm/amd/amdgpu: Fix errors & warnings in amdgpu _bios, _cs, _dma_buf, _fence.c

2023-05-03 Thread Srinivasan Shanmugam
The following checkpatch errors & warning is removed.

ERROR: else should follow close brace '}'
ERROR: trailing statements should be on next line
WARNING: Prefer 'unsigned int' to bare use of 'unsigned'
WARNING: Possible repeated word: 'Fences'
WARNING: Missing a blank line after declarations
WARNING: braces {} are not necessary for single statement blocks
WARNING: Comparisons should place the constant on the right side of the test
WARNING: printk() should include KERN_ facility level

Cc: Christian König 
Cc: Alex Deucher 
Signed-off-by: Srinivasan Shanmugam 
---

v2:
 - Best to make the word "Fences" one line from that. E.g. "/* Fences mark an 
event...". (Christian)

 drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c| 16 +++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 16 
 drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c   | 14 --
 4 files changed, 24 insertions(+), 24 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c
index 30c28a69e847..b582b83c4984 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c
@@ -104,9 +104,8 @@ static bool igp_read_bios_from_vram(struct amdgpu_device 
*adev)
adev->bios = NULL;
vram_base = pci_resource_start(adev->pdev, 0);
bios = ioremap_wc(vram_base, size);
-   if (!bios) {
+   if (!bios)
return false;
-   }
 
adev->bios = kmalloc(size, GFP_KERNEL);
if (!adev->bios) {
@@ -133,9 +132,8 @@ bool amdgpu_read_bios(struct amdgpu_device *adev)
adev->bios = NULL;
/* XXX: some cards may return 0 for rom size? ddx has a workaround */
bios = pci_map_rom(adev->pdev, );
-   if (!bios) {
+   if (!bios)
return false;
-   }
 
adev->bios = kzalloc(size, GFP_KERNEL);
if (adev->bios == NULL) {
@@ -168,9 +166,9 @@ static bool amdgpu_read_bios_from_rom(struct amdgpu_device 
*adev)
header[AMD_VBIOS_SIGNATURE_END] = 0;
 
if ((!AMD_IS_VALID_VBIOS(header)) ||
-   0 != memcmp((char *)[AMD_VBIOS_SIGNATURE_OFFSET],
-   AMD_VBIOS_SIGNATURE,
-   strlen(AMD_VBIOS_SIGNATURE)))
+   memcmp((char *)[AMD_VBIOS_SIGNATURE_OFFSET],
+  AMD_VBIOS_SIGNATURE,
+  strlen(AMD_VBIOS_SIGNATURE)) != 0)
return false;
 
/* valid vbios, go on */
@@ -264,7 +262,7 @@ static int amdgpu_atrm_call(acpi_handle atrm_handle, 
uint8_t *bios,
 
status = acpi_evaluate_object(atrm_handle, NULL, _arg, );
if (ACPI_FAILURE(status)) {
-   printk("failed to evaluate ATRM got %s\n", 
acpi_format_exception(status));
+   DRM_ERROR("failed to evaluate ATRM got %s\n", 
acpi_format_exception(status));
return -ENODEV;
}
 
@@ -363,7 +361,7 @@ static bool amdgpu_acpi_vfct_bios(struct amdgpu_device 
*adev)
struct acpi_table_header *hdr;
acpi_size tbl_size;
UEFI_ACPI_VFCT *vfct;
-   unsigned offset;
+   unsigned int offset;
 
if (!ACPI_SUCCESS(acpi_get_table("VFCT", 1, )))
return false;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index cb771c73cd07..c5521f9953a6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -395,7 +395,7 @@ static int amdgpu_cs_p2_dependencies(struct 
amdgpu_cs_parser *p,
 {
struct drm_amdgpu_cs_chunk_dep *deps = chunk->kdata;
struct amdgpu_fpriv *fpriv = p->filp->driver_priv;
-   unsigned num_deps;
+   unsigned int num_deps;
int i, r;
 
num_deps = chunk->length_dw * 4 /
@@ -466,7 +466,7 @@ static int amdgpu_cs_p2_syncobj_in(struct amdgpu_cs_parser 
*p,
   struct amdgpu_cs_chunk *chunk)
 {
struct drm_amdgpu_cs_chunk_sem *deps = chunk->kdata;
-   unsigned num_deps;
+   unsigned int num_deps;
int i, r;
 
num_deps = chunk->length_dw * 4 /
@@ -484,7 +484,7 @@ static int amdgpu_cs_p2_syncobj_timeline_wait(struct 
amdgpu_cs_parser *p,
  struct amdgpu_cs_chunk *chunk)
 {
struct drm_amdgpu_cs_chunk_syncobj *syncobj_deps = chunk->kdata;
-   unsigned num_deps;
+   unsigned int num_deps;
int i, r;
 
num_deps = chunk->length_dw * 4 /
@@ -504,7 +504,7 @@ static int amdgpu_cs_p2_syncobj_out(struct amdgpu_cs_parser 
*p,
struct amdgpu_cs_chunk *chunk)
 {
struct drm_amdgpu_cs_chunk_sem *deps = chunk->kdata;
-   unsigned num_deps;
+   unsigned int num_deps;
int i;
 
num_deps = chunk->length_dw * 4 /
@@ -538,7 +538,7 @@ static int amdgpu_cs_p2_syncobj_timeline_signal(struct 
amdgpu_cs_parser *p,
  

Re: [PATCH] drm/amd/amdgpu: Fix errors & warnings in amdgpu _bios, _cs, _dma_buf, _fence.c

2023-05-03 Thread Christian König

Am 03.05.23 um 10:46 schrieb Srinivasan Shanmugam:

The following checkpatch errors & warning is removed.

ERROR: else should follow close brace '}'
ERROR: trailing statements should be on next line
WARNING: Prefer 'unsigned int' to bare use of 'unsigned'
WARNING: Possible repeated word: 'Fences'
WARNING: Missing a blank line after declarations
WARNING: braces {} are not necessary for single statement blocks
WARNING: Comparisons should place the constant on the right side of the test
WARNING: printk() should include KERN_ facility level

Cc: Christian König 
Cc: Alex Deucher 
Signed-off-by: Srinivasan Shanmugam 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c| 16 +++-
  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 16 
  drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c |  2 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c   | 15 +--
  4 files changed, 25 insertions(+), 24 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c
index 30c28a69e847..b582b83c4984 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c
@@ -104,9 +104,8 @@ static bool igp_read_bios_from_vram(struct amdgpu_device 
*adev)
adev->bios = NULL;
vram_base = pci_resource_start(adev->pdev, 0);
bios = ioremap_wc(vram_base, size);
-   if (!bios) {
+   if (!bios)
return false;
-   }
  
  	adev->bios = kmalloc(size, GFP_KERNEL);

if (!adev->bios) {
@@ -133,9 +132,8 @@ bool amdgpu_read_bios(struct amdgpu_device *adev)
adev->bios = NULL;
/* XXX: some cards may return 0 for rom size? ddx has a workaround */
bios = pci_map_rom(adev->pdev, );
-   if (!bios) {
+   if (!bios)
return false;
-   }
  
  	adev->bios = kzalloc(size, GFP_KERNEL);

if (adev->bios == NULL) {
@@ -168,9 +166,9 @@ static bool amdgpu_read_bios_from_rom(struct amdgpu_device 
*adev)
header[AMD_VBIOS_SIGNATURE_END] = 0;
  
  	if ((!AMD_IS_VALID_VBIOS(header)) ||

-   0 != memcmp((char *)[AMD_VBIOS_SIGNATURE_OFFSET],
-   AMD_VBIOS_SIGNATURE,
-   strlen(AMD_VBIOS_SIGNATURE)))
+   memcmp((char *)[AMD_VBIOS_SIGNATURE_OFFSET],
+  AMD_VBIOS_SIGNATURE,
+  strlen(AMD_VBIOS_SIGNATURE)) != 0)
return false;
  
  	/* valid vbios, go on */

@@ -264,7 +262,7 @@ static int amdgpu_atrm_call(acpi_handle atrm_handle, 
uint8_t *bios,
  
  	status = acpi_evaluate_object(atrm_handle, NULL, _arg, );

if (ACPI_FAILURE(status)) {
-   printk("failed to evaluate ATRM got %s\n", 
acpi_format_exception(status));
+   DRM_ERROR("failed to evaluate ATRM got %s\n", 
acpi_format_exception(status));
return -ENODEV;
}
  
@@ -363,7 +361,7 @@ static bool amdgpu_acpi_vfct_bios(struct amdgpu_device *adev)

struct acpi_table_header *hdr;
acpi_size tbl_size;
UEFI_ACPI_VFCT *vfct;
-   unsigned offset;
+   unsigned int offset;
  
  	if (!ACPI_SUCCESS(acpi_get_table("VFCT", 1, )))

return false;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index cb771c73cd07..c5521f9953a6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -395,7 +395,7 @@ static int amdgpu_cs_p2_dependencies(struct 
amdgpu_cs_parser *p,
  {
struct drm_amdgpu_cs_chunk_dep *deps = chunk->kdata;
struct amdgpu_fpriv *fpriv = p->filp->driver_priv;
-   unsigned num_deps;
+   unsigned int num_deps;
int i, r;
  
  	num_deps = chunk->length_dw * 4 /

@@ -466,7 +466,7 @@ static int amdgpu_cs_p2_syncobj_in(struct amdgpu_cs_parser 
*p,
   struct amdgpu_cs_chunk *chunk)
  {
struct drm_amdgpu_cs_chunk_sem *deps = chunk->kdata;
-   unsigned num_deps;
+   unsigned int num_deps;
int i, r;
  
  	num_deps = chunk->length_dw * 4 /

@@ -484,7 +484,7 @@ static int amdgpu_cs_p2_syncobj_timeline_wait(struct 
amdgpu_cs_parser *p,
  struct amdgpu_cs_chunk *chunk)
  {
struct drm_amdgpu_cs_chunk_syncobj *syncobj_deps = chunk->kdata;
-   unsigned num_deps;
+   unsigned int num_deps;
int i, r;
  
  	num_deps = chunk->length_dw * 4 /

@@ -504,7 +504,7 @@ static int amdgpu_cs_p2_syncobj_out(struct amdgpu_cs_parser 
*p,
struct amdgpu_cs_chunk *chunk)
  {
struct drm_amdgpu_cs_chunk_sem *deps = chunk->kdata;
-   unsigned num_deps;
+   unsigned int num_deps;
int i;
  
  	num_deps = chunk->length_dw * 4 /

@@ -538,7 +538,7 @@ static int amdgpu_cs_p2_syncobj_timeline_signal(struct 
amdgpu_cs_parser *p,
struct amdgpu_cs_chunk *chunk)
  {
struct 

[PATCH] drm/amd/amdgpu: Fix errors & warnings in amdgpu _bios, _cs, _dma_buf, _fence.c

2023-05-03 Thread Srinivasan Shanmugam
The following checkpatch errors & warning is removed.

ERROR: else should follow close brace '}'
ERROR: trailing statements should be on next line
WARNING: Prefer 'unsigned int' to bare use of 'unsigned'
WARNING: Possible repeated word: 'Fences'
WARNING: Missing a blank line after declarations
WARNING: braces {} are not necessary for single statement blocks
WARNING: Comparisons should place the constant on the right side of the test
WARNING: printk() should include KERN_ facility level

Cc: Christian König 
Cc: Alex Deucher 
Signed-off-by: Srinivasan Shanmugam 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c| 16 +++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 16 
 drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c   | 15 +--
 4 files changed, 25 insertions(+), 24 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c
index 30c28a69e847..b582b83c4984 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c
@@ -104,9 +104,8 @@ static bool igp_read_bios_from_vram(struct amdgpu_device 
*adev)
adev->bios = NULL;
vram_base = pci_resource_start(adev->pdev, 0);
bios = ioremap_wc(vram_base, size);
-   if (!bios) {
+   if (!bios)
return false;
-   }
 
adev->bios = kmalloc(size, GFP_KERNEL);
if (!adev->bios) {
@@ -133,9 +132,8 @@ bool amdgpu_read_bios(struct amdgpu_device *adev)
adev->bios = NULL;
/* XXX: some cards may return 0 for rom size? ddx has a workaround */
bios = pci_map_rom(adev->pdev, );
-   if (!bios) {
+   if (!bios)
return false;
-   }
 
adev->bios = kzalloc(size, GFP_KERNEL);
if (adev->bios == NULL) {
@@ -168,9 +166,9 @@ static bool amdgpu_read_bios_from_rom(struct amdgpu_device 
*adev)
header[AMD_VBIOS_SIGNATURE_END] = 0;
 
if ((!AMD_IS_VALID_VBIOS(header)) ||
-   0 != memcmp((char *)[AMD_VBIOS_SIGNATURE_OFFSET],
-   AMD_VBIOS_SIGNATURE,
-   strlen(AMD_VBIOS_SIGNATURE)))
+   memcmp((char *)[AMD_VBIOS_SIGNATURE_OFFSET],
+  AMD_VBIOS_SIGNATURE,
+  strlen(AMD_VBIOS_SIGNATURE)) != 0)
return false;
 
/* valid vbios, go on */
@@ -264,7 +262,7 @@ static int amdgpu_atrm_call(acpi_handle atrm_handle, 
uint8_t *bios,
 
status = acpi_evaluate_object(atrm_handle, NULL, _arg, );
if (ACPI_FAILURE(status)) {
-   printk("failed to evaluate ATRM got %s\n", 
acpi_format_exception(status));
+   DRM_ERROR("failed to evaluate ATRM got %s\n", 
acpi_format_exception(status));
return -ENODEV;
}
 
@@ -363,7 +361,7 @@ static bool amdgpu_acpi_vfct_bios(struct amdgpu_device 
*adev)
struct acpi_table_header *hdr;
acpi_size tbl_size;
UEFI_ACPI_VFCT *vfct;
-   unsigned offset;
+   unsigned int offset;
 
if (!ACPI_SUCCESS(acpi_get_table("VFCT", 1, )))
return false;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index cb771c73cd07..c5521f9953a6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -395,7 +395,7 @@ static int amdgpu_cs_p2_dependencies(struct 
amdgpu_cs_parser *p,
 {
struct drm_amdgpu_cs_chunk_dep *deps = chunk->kdata;
struct amdgpu_fpriv *fpriv = p->filp->driver_priv;
-   unsigned num_deps;
+   unsigned int num_deps;
int i, r;
 
num_deps = chunk->length_dw * 4 /
@@ -466,7 +466,7 @@ static int amdgpu_cs_p2_syncobj_in(struct amdgpu_cs_parser 
*p,
   struct amdgpu_cs_chunk *chunk)
 {
struct drm_amdgpu_cs_chunk_sem *deps = chunk->kdata;
-   unsigned num_deps;
+   unsigned int num_deps;
int i, r;
 
num_deps = chunk->length_dw * 4 /
@@ -484,7 +484,7 @@ static int amdgpu_cs_p2_syncobj_timeline_wait(struct 
amdgpu_cs_parser *p,
  struct amdgpu_cs_chunk *chunk)
 {
struct drm_amdgpu_cs_chunk_syncobj *syncobj_deps = chunk->kdata;
-   unsigned num_deps;
+   unsigned int num_deps;
int i, r;
 
num_deps = chunk->length_dw * 4 /
@@ -504,7 +504,7 @@ static int amdgpu_cs_p2_syncobj_out(struct amdgpu_cs_parser 
*p,
struct amdgpu_cs_chunk *chunk)
 {
struct drm_amdgpu_cs_chunk_sem *deps = chunk->kdata;
-   unsigned num_deps;
+   unsigned int num_deps;
int i;
 
num_deps = chunk->length_dw * 4 /
@@ -538,7 +538,7 @@ static int amdgpu_cs_p2_syncobj_timeline_signal(struct 
amdgpu_cs_parser *p,
struct amdgpu_cs_chunk *chunk)
 {
struct drm_amdgpu_cs_chunk_syncobj *syncobj_deps = 

[RFC PATCH 2/4] drm/cgroup: Add memory accounting to DRM cgroup

2023-05-03 Thread Maarten Lankhorst
Based roughly on the rdma and misc cgroup controllers, with a lot of
the accounting code borrowed from rdma.

The interface is simple:
- populate drmcgroup_device->regions[..] name and size for each active
  region.
- Call drm(m)cg_register_device()
- Use drmcg_try_charge to check if you can allocate a chunk of memory,
  use drmcg_uncharge when freeing it. This may return an error code,
  or -EAGAIN when the cgroup limit is reached.

The ttm code transforms -EAGAIN back to -ENOSPC since it has specific
logic for -ENOSPC, and returning -EAGAIN to userspace causes drmIoctl
to restart infinitely.

This API allows you to limit stuff with cgroups.
You can see the supported cards in /sys/fs/cgroup/drm.capacity
You need to echo +drm to cgroup.subtree_control, and then you can
partition memory.

In each cgroup subdir:
drm.max shows the current limits of the cgroup.
drm.current the current amount of allocated memory used by this cgroup.
drm.events shows the amount of time max memory was reached.

Signed-off-by: Maarten Lankhorst 
---
 Documentation/admin-guide/cgroup-v2.rst |  46 ++
 Documentation/gpu/drm-compute.rst   |  54 +++
 include/linux/cgroup_drm.h  |  81 
 kernel/cgroup/drm.c | 539 +++-
 4 files changed, 699 insertions(+), 21 deletions(-)
 create mode 100644 Documentation/gpu/drm-compute.rst

diff --git a/Documentation/admin-guide/cgroup-v2.rst 
b/Documentation/admin-guide/cgroup-v2.rst
index f67c0829350b..b858d99cb2ef 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2374,6 +2374,52 @@ RDMA Interface Files
  mlx4_0 hca_handle=1 hca_object=20
  ocrdma1 hca_handle=1 hca_object=23
 
+DRM
+
+
+The "drm" controller regulates the distribution and accounting of
+DRM resources.
+
+DRM Interface Files
+
+
+  drm.max
+   A readwrite nested-keyed file that exists for all the cgroups
+   except root that describes current configured resource limit
+   for a DRM device.
+
+   Lines are keyed by device name and are not ordered.
+   Each line contains space separated resource name and its configured
+   limit that can be distributed.
+
+   The following nested keys are defined.
+
+ =====
+ region.*  Maximum amount of bytes that allocatable in this region
+ =====
+
+   An example for xe follows::
+
+ :03:00.0 region.vram0=1073741824 region.stolen=max
+
+  drm.capacity
+   A read-only file that describes maximum region capacity.
+   It only exists on the root cgroup. Not all memory can be
+   allocated by cgroups, as the kernel reserves some for
+   internal use.
+
+   An example for xe follows::
+
+ :03:00.0 region.vram0=8514437120 region.stolen=67108864
+
+  drm.current
+   A read-only file that describes current resource usage.
+   It exists for all the cgroup except root.
+
+   An example for xe follows::
+
+ :03:00.0 region.vram0=12550144 region.stolen=8650752
+
 HugeTLB
 ---
 
diff --git a/Documentation/gpu/drm-compute.rst 
b/Documentation/gpu/drm-compute.rst
new file mode 100644
index ..116270976ef7
--- /dev/null
+++ b/Documentation/gpu/drm-compute.rst
@@ -0,0 +1,54 @@
+==
+Long running workloads and compute
+==
+
+Long running workloads (compute) are workloads that will not complete in 10
+seconds. (The time let the user wait before he reaches for the power button).
+This means that other techniques need to be used to manage those workloads,
+that cannot use fences.
+
+Some hardware may schedule compute jobs, and have no way to pre-empt them, or
+have their memory swapped out from them. Or they simply want their workload
+not to be preempted or swapped out at all.
+
+This means that it differs from what is described in driver-api/dma-buf.rst.
+
+As with normal compute jobs, dma-fence may not be used at all. In this case,
+not even to force preemption. The driver with is simply forced to unmap a BO
+from the long compute job's address space on unbind immediately, not even
+waiting for the workload to complete. Effectively this terminates the workload
+when there is no hardware support to recover.
+
+Since this is undesirable, there need to be mitigations to prevent a workload
+from being terminated. There are several possible approach, all with their
+advantages and drawbacks.
+
+The first approach you will likely try is to pin all buffers used by compute.
+This guarantees that the job will run uninterrupted, but also allows a very
+denial of service attack by pinning as much memory as possible, hogging the
+all GPU memory, and possibly a huge chunk of CPU memory.
+
+A second approach that will work slightly better on its own is 

[RFC PATCH 4/4] drm/xe: Add support for the drm cgroup

2023-05-03 Thread Maarten Lankhorst
Add some code to implement basic support for the vram0, vram1 and stolen
memory regions.

I fear the try_charge code should probably be done inside TTM. This
code should interact with the shrinker, but for a simple RFC it's good
enough.

Signed-off-by: Maarten Lankhorst 
---
 drivers/gpu/drm/xe/xe_device.c |  4 
 drivers/gpu/drm/xe/xe_device_types.h   |  4 
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.c   | 21 +++--
 drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h |  5 +
 4 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index 45d6e5ff47fd..f0a5af15a662 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -291,6 +291,10 @@ int xe_device_probe(struct xe_device *xe)
/* Allocate and map stolen after potential VRAM resize */
xe_ttm_stolen_mgr_init(xe);
 
+   err = drmmcg_register_device(>drm, >cg);
+   if (err)
+   goto err_irq_shutdown;
+
/*
 * Now that GT is initialized (TTM in particular),
 * we can try to init display, and inherit the initial fb.
diff --git a/drivers/gpu/drm/xe/xe_device_types.h 
b/drivers/gpu/drm/xe/xe_device_types.h
index 1cb404e48aaa..04b85060cbec 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -12,6 +12,8 @@
 #include 
 #include 
 
+#include 
+
 #include "xe_gt_types.h"
 #include "xe_platform_types.h"
 #include "xe_step_types.h"
@@ -55,6 +57,8 @@ struct xe_device {
/** @drm: drm device */
struct drm_device drm;
 
+   struct drmcgroup_device cg;
+
/** @info: device info */
struct intel_device_info {
/** @graphics_name: graphics IP name */
diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c 
b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
index 73836b9b7fed..263cd4ef7b6d 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
@@ -50,6 +50,7 @@ static int xe_ttm_vram_mgr_new(struct ttm_resource_manager 
*man,
   struct ttm_resource **res)
 {
struct xe_ttm_vram_mgr *mgr = to_xe_ttm_vram_mgr(man);
+   struct xe_device *xe = ttm_to_xe_device(tbo->bdev);
struct xe_ttm_vram_mgr_resource *vres;
struct drm_buddy *mm = >mm;
u64 size, remaining_size, min_page_size;
@@ -116,9 +117,8 @@ static int xe_ttm_vram_mgr_new(struct ttm_resource_manager 
*man,
 
mutex_lock(>lock);
if (lpfn <= mgr->visible_size >> PAGE_SHIFT && size > 
mgr->visible_avail) {
-   mutex_unlock(>lock);
err = -ENOSPC;
-   goto error_fini;
+   goto error_unlock;
}
 
if (place->fpfn + (size >> PAGE_SHIFT) != place->lpfn &&
@@ -129,6 +129,10 @@ static int xe_ttm_vram_mgr_new(struct ttm_resource_manager 
*man,
lpfn = max_t(unsigned long, place->fpfn + (size >> PAGE_SHIFT), 
lpfn);
}
 
+   err = drmcg_try_charge(>cg, >cg, mgr->mem_type, 
vres->base.size);
+   if (err)
+   goto error_unlock;
+
remaining_size = size;
do {
/*
@@ -197,6 +201,8 @@ static int xe_ttm_vram_mgr_new(struct ttm_resource_manager 
*man,
 
 error_free_blocks:
drm_buddy_free_list(mm, >blocks);
+   drmcg_uncharge(vres->cg, >cg, mgr->mem_type, vres->base.size);
+error_unlock:
mutex_unlock(>lock);
 error_fini:
ttm_resource_fini(man, >base);
@@ -211,6 +217,7 @@ static void xe_ttm_vram_mgr_del(struct ttm_resource_manager 
*man,
struct xe_ttm_vram_mgr_resource *vres =
to_xe_ttm_vram_mgr_resource(res);
struct xe_ttm_vram_mgr *mgr = to_xe_ttm_vram_mgr(man);
+   struct xe_device *xe = ttm_to_xe_device(man->bdev);
struct drm_buddy *mm = >mm;
 
mutex_lock(>lock);
@@ -218,6 +225,7 @@ static void xe_ttm_vram_mgr_del(struct ttm_resource_manager 
*man,
mgr->visible_avail += vres->used_visible_size;
mutex_unlock(>lock);
 
+   drmcg_uncharge(vres->cg, >cg, mgr->mem_type, vres->base.size);
ttm_resource_fini(man, res);
 
kfree(vres);
@@ -337,6 +345,15 @@ int __xe_ttm_vram_mgr_init(struct xe_device *xe, struct 
xe_ttm_vram_mgr *mgr,
struct ttm_resource_manager *man = >manager;
int err;
 
+   xe->cg.regions[mem_type].size = size;
+
+   if (mem_type == XE_PL_STOLEN) {
+   xe->cg.regions[mem_type].name = "stolen";
+   } else {
+   xe->cg.regions[mem_type].name =
+   mem_type == XE_PL_VRAM0 ? "vram0" : "vram1";
+   }
+
man->func = _ttm_vram_mgr_func;
mgr->mem_type = mem_type;
mutex_init(>lock);
diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h 
b/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
index 3d9417ff7434..232585d7ae69 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
@@ 

[RFC PATCH 3/4] drm/ttm: Handle -EAGAIN in ttm_resource_alloc as -ENOSPC.

2023-05-03 Thread Maarten Lankhorst
This allows the drm cgroup controller to return no space is available..

XXX: This is a hopeless simplification that changes behavior, and
returns -ENOSPC even if we could evict ourselves from the current
cgroup.

Ideally, the eviction code becomes cgroup aware, and will force eviction
from the current cgroup or its parents.

Signed-off-by: Maarten Lankhorst 
---
 drivers/gpu/drm/ttm/ttm_bo.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index bd5dae4d1624..e057d5d8f09a 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -731,6 +731,8 @@ static int ttm_bo_mem_force_space(struct ttm_buffer_object 
*bo,
ret = ttm_resource_alloc(bo, place, mem);
if (likely(!ret))
break;
+   if (ret == -EAGAIN)
+   return -ENOSPC;
if (unlikely(ret != -ENOSPC))
return ret;
ret = ttm_mem_evict_first(bdev, man, place, ctx,
@@ -783,7 +785,7 @@ int ttm_bo_mem_space(struct ttm_buffer_object *bo,
 
type_found = true;
ret = ttm_resource_alloc(bo, place, mem);
-   if (ret == -ENOSPC)
+   if (ret == -ENOSPC || ret == -EAGAIN)
continue;
if (unlikely(ret))
goto error;
-- 
2.34.1



[RFC PATCH 1/4] cgroup: Add the DRM cgroup controller

2023-05-03 Thread Maarten Lankhorst
From: Tvrtko Ursulin 

Skeleton controller without any functionality.

Signed-off-by: Tvrtko Ursulin 
Signed-off-by: Maarten Lankhorst 
---
 include/linux/cgroup_drm.h|  9 ++
 include/linux/cgroup_subsys.h |  4 +++
 init/Kconfig  |  7 
 kernel/cgroup/Makefile|  1 +
 kernel/cgroup/drm.c   | 60 +++
 5 files changed, 81 insertions(+)
 create mode 100644 include/linux/cgroup_drm.h
 create mode 100644 kernel/cgroup/drm.c

diff --git a/include/linux/cgroup_drm.h b/include/linux/cgroup_drm.h
new file mode 100644
index ..8ef66a47619f
--- /dev/null
+++ b/include/linux/cgroup_drm.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2023 Intel Corporation
+ */
+
+#ifndef _CGROUP_DRM_H
+#define _CGROUP_DRM_H
+
+#endif /* _CGROUP_DRM_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 445235487230..49460494a010 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -65,6 +65,10 @@ SUBSYS(rdma)
 SUBSYS(misc)
 #endif
 
+#if IS_ENABLED(CONFIG_CGROUP_DRM)
+SUBSYS(drm)
+#endif
+
 /*
  * The following subsystems are not supported on the default hierarchy.
  */
diff --git a/init/Kconfig b/init/Kconfig
index 1fb5f313d18f..1679229143c0 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1093,6 +1093,13 @@ config CGROUP_RDMA
  Attaching processes with active RDMA resources to the cgroup
  hierarchy is allowed even if can cross the hierarchy's limit.
 
+config CGROUP_DRM
+   bool "DRM controller"
+   help
+ Provides the DRM subsystem controller.
+
+ ...
+
 config CGROUP_FREEZER
bool "Freezer controller"
help
diff --git a/kernel/cgroup/Makefile b/kernel/cgroup/Makefile
index 12f8457ad1f9..849bd2917477 100644
--- a/kernel/cgroup/Makefile
+++ b/kernel/cgroup/Makefile
@@ -6,4 +6,5 @@ obj-$(CONFIG_CGROUP_PIDS) += pids.o
 obj-$(CONFIG_CGROUP_RDMA) += rdma.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
 obj-$(CONFIG_CGROUP_MISC) += misc.o
+obj-$(CONFIG_CGROUP_DRM) += drm.o
 obj-$(CONFIG_CGROUP_DEBUG) += debug.o
diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c
new file mode 100644
index ..02c8eaa633d3
--- /dev/null
+++ b/kernel/cgroup/drm.c
@@ -0,0 +1,60 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2023 Intel Corporation
+ */
+
+#include 
+#include 
+#include 
+
+struct drm_cgroup_state {
+   struct cgroup_subsys_state css;
+};
+
+struct drm_root_cgroup_state {
+   struct drm_cgroup_state drmcs;
+};
+
+static struct drm_root_cgroup_state root_drmcs;
+
+static inline struct drm_cgroup_state *
+css_to_drmcs(struct cgroup_subsys_state *css)
+{
+   return container_of(css, struct drm_cgroup_state, css);
+}
+
+static void drmcs_free(struct cgroup_subsys_state *css)
+{
+   struct drm_cgroup_state *drmcs = css_to_drmcs(css);
+
+   if (drmcs != _drmcs.drmcs)
+   kfree(drmcs);
+}
+
+static struct cgroup_subsys_state *
+drmcs_alloc(struct cgroup_subsys_state *parent_css)
+{
+   struct drm_cgroup_state *drmcs;
+
+   if (!parent_css) {
+   drmcs = _drmcs.drmcs;
+   } else {
+   drmcs = kzalloc(sizeof(*drmcs), GFP_KERNEL);
+   if (!drmcs)
+   return ERR_PTR(-ENOMEM);
+   }
+
+   return >css;
+}
+
+struct cftype files[] = {
+   { } /* Zero entry terminates. */
+};
+
+struct cgroup_subsys drm_cgrp_subsys = {
+   .css_alloc  = drmcs_alloc,
+   .css_free   = drmcs_free,
+   .early_init = false,
+   .legacy_cftypes = files,
+   .dfl_cftypes= files,
+};
-- 
2.34.1



[RFC PATCH 0/4] Add support for DRM cgroup memory accounting.

2023-05-03 Thread Maarten Lankhorst
RFC as I'm looking for comments.

For long running compute, it can be beneficial to partition the GPU memory
between cgroups, so each cgroup can use its maximum amount of memory without
interfering with other scheduled jobs. Done properly, this can alleviate the
need for eviction, which might result in a job being terminated if the GPU
doesn't support mid-thread preemption or recoverable page faults.

This is done by adding a bunch of knobs to cgroup:
drm.capacity: Shows maximum capacity of each resource region.
drm.max: Display or limit max amount of memory.
drm.current: Current amount of memory in use.

TTM has not been made cgroup aware yet, so instead of evicting from
the current cgroup to stay within the cgroup limits, it simply returns
the error -ENOSPC to userspace.

I've used Tvrtko's cgroup controller series as a base, but it implemented
scheduling weight, not memory accounting, so I only ended up keeping the
base patch.

Xe is not upstream yet, so the driver specific patch will only apply on
https://gitlab.freedesktop.org/drm/xe/kernel

Maarten Lankhorst (3):
  drm/cgroup: Add memory accounting to DRM cgroup
  drm/ttm: Handle -EAGAIN in ttm_resource_alloc as -ENOSPC.
  drm/xe: Add support for the drm cgroup

Tvrtko Ursulin (1):
  cgroup: Add the DRM cgroup controller

 Documentation/admin-guide/cgroup-v2.rst|  46 ++
 Documentation/gpu/drm-compute.rst  |  54 ++
 drivers/gpu/drm/ttm/ttm_bo.c   |   4 +-
 drivers/gpu/drm/xe/xe_device.c |   4 +
 drivers/gpu/drm/xe/xe_device_types.h   |   4 +
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.c   |  21 +-
 drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h |   5 +
 include/linux/cgroup_drm.h |  90 
 include/linux/cgroup_subsys.h  |   4 +
 init/Kconfig   |   7 +
 kernel/cgroup/Makefile |   1 +
 kernel/cgroup/drm.c| 557 +
 12 files changed, 794 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/gpu/drm-compute.rst
 create mode 100644 include/linux/cgroup_drm.h
 create mode 100644 kernel/cgroup/drm.c

-- 
2.34.1



Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread Christian König

Am 02.05.23 um 20:41 schrieb Alex Deucher:

On Tue, May 2, 2023 at 11:22 AM Timur Kristóf  wrote:

[SNIP]

In my opinion, the correct solution to those problems would be
if
the kernel could give userspace the necessary information about
a
GPU hang before a GPU reset.


  The fundamental problem here is that the kernel doesn't have
that
information either. We know which IB timed out and can
potentially do
a devcoredump when that happens, but that's it.


Is it really not possible to know such a fundamental thing as what
the
GPU was doing when it hung? How are we supposed to do any kind of
debugging without knowing that?


Yes, that's indeed something at least I try to figure out for years as well.

Basically there are two major problems:
1. When the ASIC is hung you can't talk to the firmware engines any more 
and most state is not exposed directly, but just through some fw/hw 
interface.
    Just take a look at how umr reads the shader state from the SQ. 
When that block is hung you can't do that any more and basically have no 
chance at all to figure out why it's hung.


    Same for other engines, I remember once spending a week figuring 
out why the UVD block is hung during suspend. Turned out to be a 
debugging nightmare because any time you touch any register of that 
block the whole system would hang.


2. There are tons of things going on in a pipeline fashion or even 
completely in parallel. For example the CP is just the beginning of a 
rather long pipeline which at the end produces a bunch of pixels.
    In almost all cases I've seen you ran into a problem somewhere deep 
in the pipeline and only very rarely at the beginning.




I wonder what AMD's Windows driver team is doing with this problem,
surely they must have better tools to deal with GPU hangs?

For better or worse, most teams internally rely on scan dumps via
JTAG
which sort of limits the usefulness outside of AMD, but also gives
you
the exact state of the hardware when it's hung so the hardware teams
prefer it.


How does this approach scale? It's not something we can ask users to
do, and even if all of us in the radv team had a JTAG device, we
wouldn't be able to play every game that users experience random hangs
with.

It doesn't scale or lend itself particularly well to external
development, but that's the current state of affairs.


The usual approach seems to be to reproduce a problem in a lab and have 
a JTAG attached to give the hw guys a scan dump and they can then tell 
you why something didn't worked as expected.


And yes that absolutely doesn't scale.

Christian.



Alex