Re: [PATCH v2] drm/amdgpu: add ring timeout information in devcoredump

2024-03-05 Thread Khatri, Sunil



On 3/5/2024 6:40 PM, Christian König wrote:

Am 05.03.24 um 12:58 schrieb Sunil Khatri:

Add ring timeout related information in the amdgpu
devcoredump file for debugging purposes.

During the gpu recovery process the registered call
is triggered and add the debug information in data
file created by devcoredump framework under the
directory /sys/class/devcoredump/devcdx/

Signed-off-by: Sunil Khatri 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 15 +++
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h |  2 ++
  2 files changed, 17 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c

index a59364e9b6ed..aa7fed59a0d5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -196,6 +196,13 @@ amdgpu_devcoredump_read(char *buffer, loff_t 
offset, size_t count,

 coredump->reset_task_info.process_name,
 coredump->reset_task_info.pid);
  +    if (coredump->ring_timeout) {
+    drm_printf(&p, "\nRing timed out details\n");
+    drm_printf(&p, "IP Type: %d Ring Name: %s \n",
+    coredump->ring->funcs->type,
+    coredump->ring->name);
+    }
+
  if (coredump->reset_vram_lost)
  drm_printf(&p, "VRAM is lost due to GPU reset!\n");
  if (coredump->adev->reset_info.num_regs) {
@@ -220,6 +227,8 @@ void amdgpu_coredump(struct amdgpu_device *adev, 
bool vram_lost,

  {
  struct amdgpu_coredump_info *coredump;
  struct drm_device *dev = adev_to_drm(adev);
+    struct amdgpu_job *job = reset_context->job;
+    struct drm_sched_job *s_job;
    coredump = kzalloc(sizeof(*coredump), GFP_NOWAIT);
  @@ -228,6 +237,12 @@ void amdgpu_coredump(struct amdgpu_device 
*adev, bool vram_lost,

  return;
  }
  +    if (job) {
+    s_job = &job->base;
+    coredump->ring = to_amdgpu_ring(s_job->sched);
+    coredump->ring_timeout = TRUE;
+    }
+
  coredump->reset_vram_lost = vram_lost;
    if (reset_context->job && reset_context->job->vm) {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h

index 19899f6b9b2b..6d67001a1057 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
@@ -97,6 +97,8 @@ struct amdgpu_coredump_info {
  struct amdgpu_task_info reset_task_info;
  struct timespec64   reset_time;
  bool    reset_vram_lost;
+    struct amdgpu_ring  *ring;
+    bool    ring_timeout;


I think you can drop ring_timeout, just having ring as optional 
information should be enough.


Apart from that looks pretty good I think.

- GPU reset could happen due to two possibilities atleast: 1. via sysfs 
cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover there is no timeout or 
page fault here. In this case we need information if ringtimeout 
happened or not else it will try to print empty information in 
devcoredump. Same goes for pagefault also in that case also we need to 
see if recovery ran due to pagefault and then only add that information.


So to cover all use cases i added this parameter.


Thanks
Sunil


Regards,
Christian.


  };
  #endif




Re: [PATCH] drm/amdgpu: cache in more vm fault information

2024-03-06 Thread Khatri, Sunil



On 3/6/2024 3:37 PM, Christian König wrote:

Am 06.03.24 um 10:04 schrieb Sunil Khatri:

When an  page fault interrupt is raised there
is a lot more information that is useful for
developers to analyse the pagefault.


Well actually those information are not that interesting  because they 
are hw generation specific.


You should probably rather use the decoded strings here, e.g. hub, 
client, xcc_id, node_id etc...


See gmc_v9_0_process_interrupt() an example.
I saw this v9 does provide more information than what v10 and v11 
provide like node_id and fault from which die but thats again very 
specific to IP_VERSION(9, 4, 3)) i dont know why thats information is 
not there in v10 and v11.


I agree to your point but, as of now during a pagefault we are dumping 
this information which is useful like which client
has generated an interrupt and for which src and other information like 
address. So i think to provide the similar information in the devcoredump.


Currently we do not have all this information from either job or vm 
being derived from the job during a reset. We surely could add more 
relevant information later on as per request but this information is 
useful as
eventually its developers only who would use the dump file provided by 
customer to debug.


Below is the information that i dump in devcore and i feel that is good 
information but new information could be added which could be picked later.



Page fault information
[gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773)
in page starting at address 0x from client 0x1b (UTCL2)


Regards
Sunil Khatri



Regards,
Christian.



Add all such information in the last cached
pagefault from an interrupt handler.

Signed-off-by: Sunil Khatri 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 9 +++--
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 7 ++-
  drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 2 +-
  drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c | 2 +-
  drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c  | 2 +-
  drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c  | 2 +-
  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c  | 2 +-
  7 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c

index 4299ce386322..b77e8e28769d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -2905,7 +2905,7 @@ void amdgpu_debugfs_vm_bo_info(struct amdgpu_vm 
*vm, struct seq_file *m)

   * Cache the fault info for later use by userspace in debugging.
   */
  void amdgpu_vm_update_fault_cache(struct amdgpu_device *adev,
-  unsigned int pasid,
+  struct amdgpu_iv_entry *entry,
    uint64_t addr,
    uint32_t status,
    unsigned int vmhub)
@@ -2915,7 +2915,7 @@ void amdgpu_vm_update_fault_cache(struct 
amdgpu_device *adev,

    xa_lock_irqsave(&adev->vm_manager.pasids, flags);
  -    vm = xa_load(&adev->vm_manager.pasids, pasid);
+    vm = xa_load(&adev->vm_manager.pasids, entry->pasid);
  /* Don't update the fault cache if status is 0.  In the multiple
   * fault case, subsequent faults will return a 0 status which is
   * useless for userspace and replaces the useful fault status, so
@@ -2924,6 +2924,11 @@ void amdgpu_vm_update_fault_cache(struct 
amdgpu_device *adev,

  if (vm && status) {
  vm->fault_info.addr = addr;
  vm->fault_info.status = status;
+    vm->fault_info.client_id = entry->client_id;
+    vm->fault_info.src_id = entry->src_id;
+    vm->fault_info.vmid = entry->vmid;
+    vm->fault_info.pasid = entry->pasid;
+    vm->fault_info.ring_id = entry->ring_id;
  if (AMDGPU_IS_GFXHUB(vmhub)) {
  vm->fault_info.vmhub = AMDGPU_VMHUB_TYPE_GFX;
  vm->fault_info.vmhub |=
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h

index 047ec1930d12..c7782a89bdb5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
@@ -286,6 +286,11 @@ struct amdgpu_vm_fault_info {
  uint32_t    status;
  /* which vmhub? gfxhub, mmhub, etc. */
  unsigned int    vmhub;
+    unsigned int    client_id;
+    unsigned int    src_id;
+    unsigned int    ring_id;
+    unsigned int    pasid;
+    unsigned int    vmid;
  };
    struct amdgpu_vm {
@@ -605,7 +610,7 @@ static inline void 
amdgpu_vm_eviction_unlock(struct amdgpu_vm *vm)

  }
    void amdgpu_vm_update_fault_cache(struct amdgpu_device *adev,
-  unsigned int pasid,
+  struct amdgpu_iv_entry *entry,
    uint64_t addr,
    uint32_t status,
    unsigned int vmhub);
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c

index d933e19e0cf5..6b177ce8db0e 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c

Re: [PATCH] drm/amdgpu: cache in more vm fault information

2024-03-06 Thread Khatri, Sunil



On 3/6/2024 6:12 PM, Christian König wrote:

Am 06.03.24 um 11:40 schrieb Khatri, Sunil:


On 3/6/2024 3:37 PM, Christian König wrote:

Am 06.03.24 um 10:04 schrieb Sunil Khatri:

When an  page fault interrupt is raised there
is a lot more information that is useful for
developers to analyse the pagefault.


Well actually those information are not that interesting because 
they are hw generation specific.


You should probably rather use the decoded strings here, e.g. hub, 
client, xcc_id, node_id etc...


See gmc_v9_0_process_interrupt() an example.
I saw this v9 does provide more information than what v10 and v11 
provide like node_id and fault from which die but thats again very 
specific to IP_VERSION(9, 4, 3)) i dont know why thats information 
is not there in v10 and v11.


I agree to your point but, as of now during a pagefault we are 
dumping this information which is useful like which client
has generated an interrupt and for which src and other information 
like address. So i think to provide the similar information in the 
devcoredump.


Currently we do not have all this information from either job or vm 
being derived from the job during a reset. We surely could add more 
relevant information later on as per request but this information is 
useful as
eventually its developers only who would use the dump file provided 
by customer to debug.


Below is the information that i dump in devcore and i feel that is 
good information but new information could be added which could be 
picked later.



Page fault information
[gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773)
in page starting at address 0x from client 0x1b (UTCL2)


This is a perfect example what I mean. You record in the patch is the 
client_id, but this is is basically meaningless unless you have access 
to the AMD internal hw documentation.


What you really need is the client in decoded form, in this case 
UTCL2. You can keep the client_id additionally, but the decoded client 
string is mandatory to have I think.


Sure i am capturing that information as i am trying to minimise the 
memory interaction to minimum as we are still in interrupt context 
here that why i recorded the integer information compared to decoding

and writing strings there itself but to postpone till we dump.

Like decoding to the gfxhub/mmhub based on vmhub/vmid_src and client 
string from client id. So are we good to go with the information with 
the above information of sharing details in devcoredump using the 
additional information from pagefault cached.


regards
sunil



Regards,
Christian.



Regards
Sunil Khatri



Regards,
Christian.



Add all such information in the last cached
pagefault from an interrupt handler.

Signed-off-by: Sunil Khatri 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 9 +++--
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 7 ++-
  drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 2 +-
  drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c | 2 +-
  drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c  | 2 +-
  drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c  | 2 +-
  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c  | 2 +-
  7 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c

index 4299ce386322..b77e8e28769d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -2905,7 +2905,7 @@ void amdgpu_debugfs_vm_bo_info(struct 
amdgpu_vm *vm, struct seq_file *m)

   * Cache the fault info for later use by userspace in debugging.
   */
  void amdgpu_vm_update_fault_cache(struct amdgpu_device *adev,
-  unsigned int pasid,
+  struct amdgpu_iv_entry *entry,
    uint64_t addr,
    uint32_t status,
    unsigned int vmhub)
@@ -2915,7 +2915,7 @@ void amdgpu_vm_update_fault_cache(struct 
amdgpu_device *adev,

    xa_lock_irqsave(&adev->vm_manager.pasids, flags);
  -    vm = xa_load(&adev->vm_manager.pasids, pasid);
+    vm = xa_load(&adev->vm_manager.pasids, entry->pasid);
  /* Don't update the fault cache if status is 0.  In the multiple
   * fault case, subsequent faults will return a 0 status which is
   * useless for userspace and replaces the useful fault 
status, so
@@ -2924,6 +2924,11 @@ void amdgpu_vm_update_fault_cache(struct 
amdgpu_device *adev,

  if (vm && status) {
  vm->fault_info.addr = addr;
  vm->fault_info.status = status;
+    vm->fault_info.client_id = entry->client_id;
+    vm->fault_info.src_id = entry->src_id;
+    vm->fault_info.vmid = entry->vmid;
+    vm->fault_info.pasid = entry->pasid;
+    vm->fault_info.ring_id = entry->ring_id;
  if (AMDGPU_IS_GFXHUB(vmhub)) {
  vm->fault_info.vmhub = AMDGPU_VMHUB_TYPE_GFX;
  vm->fault_info.vmhub |=
diff --git a/drivers/gpu/drm/amd/a

Re: [PATCH] drm/amdgpu: cache in more vm fault information

2024-03-06 Thread Khatri, Sunil



On 3/6/2024 8:34 PM, Christian König wrote:

Am 06.03.24 um 15:29 schrieb Alex Deucher:

On Wed, Mar 6, 2024 at 8:04 AM Khatri, Sunil  wrote:


On 3/6/2024 6:12 PM, Christian König wrote:

Am 06.03.24 um 11:40 schrieb Khatri, Sunil:

On 3/6/2024 3:37 PM, Christian König wrote:

Am 06.03.24 um 10:04 schrieb Sunil Khatri:

When an  page fault interrupt is raised there
is a lot more information that is useful for
developers to analyse the pagefault.

Well actually those information are not that interesting because
they are hw generation specific.

You should probably rather use the decoded strings here, e.g. hub,
client, xcc_id, node_id etc...

See gmc_v9_0_process_interrupt() an example.
I saw this v9 does provide more information than what v10 and v11
provide like node_id and fault from which die but thats again very
specific to IP_VERSION(9, 4, 3)) i dont know why thats information
is not there in v10 and v11.

I agree to your point but, as of now during a pagefault we are
dumping this information which is useful like which client
has generated an interrupt and for which src and other information
like address. So i think to provide the similar information in the
devcoredump.

Currently we do not have all this information from either job or vm
being derived from the job during a reset. We surely could add more
relevant information later on as per request but this information is
useful as
eventually its developers only who would use the dump file provided
by customer to debug.

Below is the information that i dump in devcore and i feel that is
good information but new information could be added which could be
picked later.


Page fault information
[gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773)
in page starting at address 0x from client 0x1b 
(UTCL2)

This is a perfect example what I mean. You record in the patch is the
client_id, but this is is basically meaningless unless you have access
to the AMD internal hw documentation.

What you really need is the client in decoded form, in this case
UTCL2. You can keep the client_id additionally, but the decoded client
string is mandatory to have I think.

Sure i am capturing that information as i am trying to minimise the
memory interaction to minimum as we are still in interrupt context
here that why i recorded the integer information compared to decoding

and writing strings there itself but to postpone till we dump.

Like decoding to the gfxhub/mmhub based on vmhub/vmid_src and client
string from client id. So are we good to go with the information with
the above information of sharing details in devcoredump using the
additional information from pagefault cached.

I think amdgpu_vm_fault_info() has everything you need already (vmhub,
status, and addr).  client_id and src_id are just tokens in the
interrupt cookie so we know which IP to route the interrupt to. We
know what they will be because otherwise we'd be in the interrupt
handler for a different IP.  I don't think ring_id has any useful
information in this context and vmid and pasid are probably not too
useful because they are just tokens to associate the fault with a
process.  It would be better to have the process name.


Just to share context here Alex, i am preparing this for devcoredump, my 
intention was to replicate the information which in KMD we are sharing 
in Dmesg for page faults. If assuming we do not add client id specially 
we would not be able to share enough information in devcoredump.
It would be just address and hub(gfxhub/mmhub) and i think that is 
partial information as src id and client id and ip block shares good 
information.


For process related information we are capturing that information part 
of dump from existing functionality.

 AMDGPU Device Coredump 
version: 1
kernel: 6.7.0-amd-staging-drm-next
module: amdgpu
time: 45.084775181
process_name: soft_recovery_p PID: 1780

Ring timed out details
IP Type: 0 Ring Name: gfx_0.0.0

Page fault information
[gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773)
in page starting at address 0x from client 0x1b (UTCL2)
VRAM is lost due to GPU reset!

Regards
Sunil



The decoded client name would be really useful I think since the fault 
handled is a catch all and handles a whole bunch of different clients.


But that should be ideally passed in as const string instead of the hw 
generation specific client_id.


As long as it's only a pointer we also don't run into the trouble that 
we need to allocate memory for it.


I agree but i prefer adding the client id and decoding it in devcorecump 
using soc15_ih_clientid_name[fault_info->client_id]) is better else we 
have to do an sprintf this string to fault_info in irq context which is 
writing more bytes to memory i guess compared to an integer:)


We can argue on values like pasid and vmid and ring id to be taken off 
if they are totally not useful.


Regards
Sunil



Christian.



Alex


regards
sun

Re: [PATCH] drm/amdgpu: cache in more vm fault information

2024-03-06 Thread Khatri, Sunil



On 3/6/2024 9:19 PM, Alex Deucher wrote:

On Wed, Mar 6, 2024 at 10:32 AM Alex Deucher  wrote:

On Wed, Mar 6, 2024 at 10:13 AM Khatri, Sunil  wrote:


On 3/6/2024 8:34 PM, Christian König wrote:

Am 06.03.24 um 15:29 schrieb Alex Deucher:

On Wed, Mar 6, 2024 at 8:04 AM Khatri, Sunil  wrote:

On 3/6/2024 6:12 PM, Christian König wrote:

Am 06.03.24 um 11:40 schrieb Khatri, Sunil:

On 3/6/2024 3:37 PM, Christian König wrote:

Am 06.03.24 um 10:04 schrieb Sunil Khatri:

When an  page fault interrupt is raised there
is a lot more information that is useful for
developers to analyse the pagefault.

Well actually those information are not that interesting because
they are hw generation specific.

You should probably rather use the decoded strings here, e.g. hub,
client, xcc_id, node_id etc...

See gmc_v9_0_process_interrupt() an example.
I saw this v9 does provide more information than what v10 and v11
provide like node_id and fault from which die but thats again very
specific to IP_VERSION(9, 4, 3)) i dont know why thats information
is not there in v10 and v11.

I agree to your point but, as of now during a pagefault we are
dumping this information which is useful like which client
has generated an interrupt and for which src and other information
like address. So i think to provide the similar information in the
devcoredump.

Currently we do not have all this information from either job or vm
being derived from the job during a reset. We surely could add more
relevant information later on as per request but this information is
useful as
eventually its developers only who would use the dump file provided
by customer to debug.

Below is the information that i dump in devcore and i feel that is
good information but new information could be added which could be
picked later.


Page fault information
[gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773)
in page starting at address 0x from client 0x1b
(UTCL2)

This is a perfect example what I mean. You record in the patch is the
client_id, but this is is basically meaningless unless you have access
to the AMD internal hw documentation.

What you really need is the client in decoded form, in this case
UTCL2. You can keep the client_id additionally, but the decoded client
string is mandatory to have I think.

Sure i am capturing that information as i am trying to minimise the
memory interaction to minimum as we are still in interrupt context
here that why i recorded the integer information compared to decoding

and writing strings there itself but to postpone till we dump.

Like decoding to the gfxhub/mmhub based on vmhub/vmid_src and client
string from client id. So are we good to go with the information with
the above information of sharing details in devcoredump using the
additional information from pagefault cached.

I think amdgpu_vm_fault_info() has everything you need already (vmhub,
status, and addr).  client_id and src_id are just tokens in the
interrupt cookie so we know which IP to route the interrupt to. We
know what they will be because otherwise we'd be in the interrupt
handler for a different IP.  I don't think ring_id has any useful
information in this context and vmid and pasid are probably not too
useful because they are just tokens to associate the fault with a
process.  It would be better to have the process name.

Just to share context here Alex, i am preparing this for devcoredump, my
intention was to replicate the information which in KMD we are sharing
in Dmesg for page faults. If assuming we do not add client id specially
we would not be able to share enough information in devcoredump.
It would be just address and hub(gfxhub/mmhub) and i think that is
partial information as src id and client id and ip block shares good
information.

We also need to include the status register value.  That contains the
important information (type of access, fault type, client, etc.).
Client_id and src_id are only used to route the interrupt to the right
software code.  E.g., a different client_id and src_id would be a
completely different interrupt (e.g., vblank or fence, etc.).  For GPU
page faults the client_id and src_id will always be the same.

The devcoredump should also include information about the GPU itself
as well (e.g., PCI DID/VID, maybe some of the relevant IP versions).

We already have "status" which is register "GCVM_L2_PROTECTION_FAULT_STATUS". 
But the problem here is this all needs to be captured in interrupt context which i want to avoid 
and this is family specific calls.

chip family would also be good.  And also vram size.

If we have a way to identify the chip and we have the vm status
register and vm fault address, we can decode all of the fault
information.

In this patch i am focusing on page fault specific information only[taking one 
at a time]. But eventually will be adding more information as per the 
devcoredump JIRA plan. will keep this in todo too for other information 

Re: [PATCH] drm/amdgpu: cache in more vm fault information

2024-03-06 Thread Khatri, Sunil



On 3/6/2024 9:07 PM, Christian König wrote:

Am 06.03.24 um 16:13 schrieb Khatri, Sunil:


On 3/6/2024 8:34 PM, Christian König wrote:

Am 06.03.24 um 15:29 schrieb Alex Deucher:

On Wed, Mar 6, 2024 at 8:04 AM Khatri, Sunil  wrote:


On 3/6/2024 6:12 PM, Christian König wrote:

Am 06.03.24 um 11:40 schrieb Khatri, Sunil:

On 3/6/2024 3:37 PM, Christian König wrote:

Am 06.03.24 um 10:04 schrieb Sunil Khatri:

When an  page fault interrupt is raised there
is a lot more information that is useful for
developers to analyse the pagefault.

Well actually those information are not that interesting because
they are hw generation specific.

You should probably rather use the decoded strings here, e.g. hub,
client, xcc_id, node_id etc...

See gmc_v9_0_process_interrupt() an example.
I saw this v9 does provide more information than what v10 and v11
provide like node_id and fault from which die but thats again very
specific to IP_VERSION(9, 4, 3)) i dont know why thats information
is not there in v10 and v11.

I agree to your point but, as of now during a pagefault we are
dumping this information which is useful like which client
has generated an interrupt and for which src and other information
like address. So i think to provide the similar information in the
devcoredump.

Currently we do not have all this information from either job or vm
being derived from the job during a reset. We surely could add more
relevant information later on as per request but this 
information is

useful as
eventually its developers only who would use the dump file provided
by customer to debug.

Below is the information that i dump in devcore and i feel that is
good information but new information could be added which could be
picked later.


Page fault information
[gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773)
in page starting at address 0x from client 0x1b 
(UTCL2)
This is a perfect example what I mean. You record in the patch is 
the
client_id, but this is is basically meaningless unless you have 
access

to the AMD internal hw documentation.

What you really need is the client in decoded form, in this case
UTCL2. You can keep the client_id additionally, but the decoded 
client

string is mandatory to have I think.

Sure i am capturing that information as i am trying to minimise the
memory interaction to minimum as we are still in interrupt context
here that why i recorded the integer information compared to 
decoding

and writing strings there itself but to postpone till we dump.

Like decoding to the gfxhub/mmhub based on vmhub/vmid_src and client
string from client id. So are we good to go with the information with
the above information of sharing details in devcoredump using the
additional information from pagefault cached.

I think amdgpu_vm_fault_info() has everything you need already (vmhub,
status, and addr).  client_id and src_id are just tokens in the
interrupt cookie so we know which IP to route the interrupt to. We
know what they will be because otherwise we'd be in the interrupt
handler for a different IP.  I don't think ring_id has any useful
information in this context and vmid and pasid are probably not too
useful because they are just tokens to associate the fault with a
process.  It would be better to have the process name.


Just to share context here Alex, i am preparing this for devcoredump, 
my intention was to replicate the information which in KMD we are 
sharing in Dmesg for page faults. If assuming we do not add client id 
specially we would not be able to share enough information in 
devcoredump.
It would be just address and hub(gfxhub/mmhub) and i think that is 
partial information as src id and client id and ip block shares good 
information.


For process related information we are capturing that information 
part of dump from existing functionality.

 AMDGPU Device Coredump 
version: 1
kernel: 6.7.0-amd-staging-drm-next
module: amdgpu
time: 45.084775181
process_name: soft_recovery_p PID: 1780

Ring timed out details
IP Type: 0 Ring Name: gfx_0.0.0

Page fault information
[gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773)
in page starting at address 0x from client 0x1b (UTCL2)
VRAM is lost due to GPU reset!

Regards
Sunil



The decoded client name would be really useful I think since the 
fault handled is a catch all and handles a whole bunch of different 
clients.


But that should be ideally passed in as const string instead of the 
hw generation specific client_id.


As long as it's only a pointer we also don't run into the trouble 
that we need to allocate memory for it.


I agree but i prefer adding the client id and decoding it in 
devcorecump using soc15_ih_clientid_name[fault_info->client_id]) is 
better else we have to do an sprintf this string to fault_info in irq 
context which is writing more bytes to memory i guess compared to an 
integer:)


Well I totally agree that we shouldn't fiddle to much in

Re: [PATCH] drm/amdgpu: cache in more vm fault information

2024-03-06 Thread Khatri, Sunil



On 3/6/2024 9:45 PM, Alex Deucher wrote:

On Wed, Mar 6, 2024 at 11:06 AM Khatri, Sunil  wrote:


On 3/6/2024 9:07 PM, Christian König wrote:

Am 06.03.24 um 16:13 schrieb Khatri, Sunil:

On 3/6/2024 8:34 PM, Christian König wrote:

Am 06.03.24 um 15:29 schrieb Alex Deucher:

On Wed, Mar 6, 2024 at 8:04 AM Khatri, Sunil  wrote:

On 3/6/2024 6:12 PM, Christian König wrote:

Am 06.03.24 um 11:40 schrieb Khatri, Sunil:

On 3/6/2024 3:37 PM, Christian König wrote:

Am 06.03.24 um 10:04 schrieb Sunil Khatri:

When an  page fault interrupt is raised there
is a lot more information that is useful for
developers to analyse the pagefault.

Well actually those information are not that interesting because
they are hw generation specific.

You should probably rather use the decoded strings here, e.g. hub,
client, xcc_id, node_id etc...

See gmc_v9_0_process_interrupt() an example.
I saw this v9 does provide more information than what v10 and v11
provide like node_id and fault from which die but thats again very
specific to IP_VERSION(9, 4, 3)) i dont know why thats information
is not there in v10 and v11.

I agree to your point but, as of now during a pagefault we are
dumping this information which is useful like which client
has generated an interrupt and for which src and other information
like address. So i think to provide the similar information in the
devcoredump.

Currently we do not have all this information from either job or vm
being derived from the job during a reset. We surely could add more
relevant information later on as per request but this
information is
useful as
eventually its developers only who would use the dump file provided
by customer to debug.

Below is the information that i dump in devcore and i feel that is
good information but new information could be added which could be
picked later.


Page fault information
[gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773)
in page starting at address 0x from client 0x1b
(UTCL2)

This is a perfect example what I mean. You record in the patch is
the
client_id, but this is is basically meaningless unless you have
access
to the AMD internal hw documentation.

What you really need is the client in decoded form, in this case
UTCL2. You can keep the client_id additionally, but the decoded
client
string is mandatory to have I think.

Sure i am capturing that information as i am trying to minimise the
memory interaction to minimum as we are still in interrupt context
here that why i recorded the integer information compared to
decoding

and writing strings there itself but to postpone till we dump.

Like decoding to the gfxhub/mmhub based on vmhub/vmid_src and client
string from client id. So are we good to go with the information with
the above information of sharing details in devcoredump using the
additional information from pagefault cached.

I think amdgpu_vm_fault_info() has everything you need already (vmhub,
status, and addr).  client_id and src_id are just tokens in the
interrupt cookie so we know which IP to route the interrupt to. We
know what they will be because otherwise we'd be in the interrupt
handler for a different IP.  I don't think ring_id has any useful
information in this context and vmid and pasid are probably not too
useful because they are just tokens to associate the fault with a
process.  It would be better to have the process name.

Just to share context here Alex, i am preparing this for devcoredump,
my intention was to replicate the information which in KMD we are
sharing in Dmesg for page faults. If assuming we do not add client id
specially we would not be able to share enough information in
devcoredump.
It would be just address and hub(gfxhub/mmhub) and i think that is
partial information as src id and client id and ip block shares good
information.

For process related information we are capturing that information
part of dump from existing functionality.
 AMDGPU Device Coredump 
version: 1
kernel: 6.7.0-amd-staging-drm-next
module: amdgpu
time: 45.084775181
process_name: soft_recovery_p PID: 1780

Ring timed out details
IP Type: 0 Ring Name: gfx_0.0.0

Page fault information
[gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773)
in page starting at address 0x from client 0x1b (UTCL2)
VRAM is lost due to GPU reset!

Regards
Sunil


The decoded client name would be really useful I think since the
fault handled is a catch all and handles a whole bunch of different
clients.

But that should be ideally passed in as const string instead of the
hw generation specific client_id.

As long as it's only a pointer we also don't run into the trouble
that we need to allocate memory for it.

I agree but i prefer adding the client id and decoding it in
devcorecump using soc15_ih_clientid_name[fault_info->client_id]) is
better else we have to do an sprintf this string to fault_info in irq
context which is writing more bytes to memory i guess compared to an
integer:

Re: [PATCH] drm/amdgpu: cache in more vm fault information

2024-03-06 Thread Khatri, Sunil



On 3/6/2024 9:49 PM, Christian König wrote:

Am 06.03.24 um 17:06 schrieb Khatri, Sunil:


On 3/6/2024 9:07 PM, Christian König wrote:

Am 06.03.24 um 16:13 schrieb Khatri, Sunil:


On 3/6/2024 8:34 PM, Christian König wrote:

Am 06.03.24 um 15:29 schrieb Alex Deucher:
On Wed, Mar 6, 2024 at 8:04 AM Khatri, Sunil  
wrote:


On 3/6/2024 6:12 PM, Christian König wrote:

Am 06.03.24 um 11:40 schrieb Khatri, Sunil:

On 3/6/2024 3:37 PM, Christian König wrote:

Am 06.03.24 um 10:04 schrieb Sunil Khatri:

When an  page fault interrupt is raised there
is a lot more information that is useful for
developers to analyse the pagefault.

Well actually those information are not that interesting because
they are hw generation specific.

You should probably rather use the decoded strings here, e.g. 
hub,

client, xcc_id, node_id etc...

See gmc_v9_0_process_interrupt() an example.
I saw this v9 does provide more information than what v10 and 
v11
provide like node_id and fault from which die but thats again 
very
specific to IP_VERSION(9, 4, 3)) i dont know why thats 
information

is not there in v10 and v11.

I agree to your point but, as of now during a pagefault we are
dumping this information which is useful like which client
has generated an interrupt and for which src and other 
information
like address. So i think to provide the similar information in 
the

devcoredump.

Currently we do not have all this information from either job 
or vm
being derived from the job during a reset. We surely could add 
more
relevant information later on as per request but this 
information is

useful as
eventually its developers only who would use the dump file 
provided

by customer to debug.

Below is the information that i dump in devcore and i feel 
that is
good information but new information could be added which 
could be

picked later.


Page fault information
[gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773)
in page starting at address 0x from client 
0x1b (UTCL2)
This is a perfect example what I mean. You record in the patch 
is the
client_id, but this is is basically meaningless unless you have 
access

to the AMD internal hw documentation.

What you really need is the client in decoded form, in this case
UTCL2. You can keep the client_id additionally, but the decoded 
client

string is mandatory to have I think.

Sure i am capturing that information as i am trying to minimise 
the

memory interaction to minimum as we are still in interrupt context
here that why i recorded the integer information compared to 
decoding

and writing strings there itself but to postpone till we dump.

Like decoding to the gfxhub/mmhub based on vmhub/vmid_src and 
client
string from client id. So are we good to go with the information 
with

the above information of sharing details in devcoredump using the
additional information from pagefault cached.
I think amdgpu_vm_fault_info() has everything you need already 
(vmhub,

status, and addr).  client_id and src_id are just tokens in the
interrupt cookie so we know which IP to route the interrupt to. We
know what they will be because otherwise we'd be in the interrupt
handler for a different IP.  I don't think ring_id has any useful
information in this context and vmid and pasid are probably not too
useful because they are just tokens to associate the fault with a
process.  It would be better to have the process name.


Just to share context here Alex, i am preparing this for 
devcoredump, my intention was to replicate the information which in 
KMD we are sharing in Dmesg for page faults. If assuming we do not 
add client id specially we would not be able to share enough 
information in devcoredump.
It would be just address and hub(gfxhub/mmhub) and i think that is 
partial information as src id and client id and ip block shares 
good information.


For process related information we are capturing that information 
part of dump from existing functionality.

 AMDGPU Device Coredump 
version: 1
kernel: 6.7.0-amd-staging-drm-next
module: amdgpu
time: 45.084775181
process_name: soft_recovery_p PID: 1780

Ring timed out details
IP Type: 0 Ring Name: gfx_0.0.0

Page fault information
[gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773)
in page starting at address 0x from client 0x1b 
(UTCL2)

VRAM is lost due to GPU reset!

Regards
Sunil



The decoded client name would be really useful I think since the 
fault handled is a catch all and handles a whole bunch of 
different clients.


But that should be ideally passed in as const string instead of 
the hw generation specific client_id.


As long as it's only a pointer we also don't run into the trouble 
that we need to allocate memory for it.


I agree but i prefer adding the client id and decoding it in 
devcorecump using soc15_ih_clientid_name[fault_info->client_id]) is 
better else we have to do an sprintf this string to fault_info in 
irq context which is writing more 

Re: [PATCH] drm/amdgpu: cache in more vm fault information

2024-03-06 Thread Khatri, Sunil



On 3/6/2024 9:59 PM, Alex Deucher wrote:

On Wed, Mar 6, 2024 at 11:21 AM Khatri, Sunil  wrote:


On 3/6/2024 9:45 PM, Alex Deucher wrote:

On Wed, Mar 6, 2024 at 11:06 AM Khatri, Sunil  wrote:

On 3/6/2024 9:07 PM, Christian König wrote:

Am 06.03.24 um 16:13 schrieb Khatri, Sunil:

On 3/6/2024 8:34 PM, Christian König wrote:

Am 06.03.24 um 15:29 schrieb Alex Deucher:

On Wed, Mar 6, 2024 at 8:04 AM Khatri, Sunil  wrote:

On 3/6/2024 6:12 PM, Christian König wrote:

Am 06.03.24 um 11:40 schrieb Khatri, Sunil:

On 3/6/2024 3:37 PM, Christian König wrote:

Am 06.03.24 um 10:04 schrieb Sunil Khatri:

When an  page fault interrupt is raised there
is a lot more information that is useful for
developers to analyse the pagefault.

Well actually those information are not that interesting because
they are hw generation specific.

You should probably rather use the decoded strings here, e.g. hub,
client, xcc_id, node_id etc...

See gmc_v9_0_process_interrupt() an example.
I saw this v9 does provide more information than what v10 and v11
provide like node_id and fault from which die but thats again very
specific to IP_VERSION(9, 4, 3)) i dont know why thats information
is not there in v10 and v11.

I agree to your point but, as of now during a pagefault we are
dumping this information which is useful like which client
has generated an interrupt and for which src and other information
like address. So i think to provide the similar information in the
devcoredump.

Currently we do not have all this information from either job or vm
being derived from the job during a reset. We surely could add more
relevant information later on as per request but this
information is
useful as
eventually its developers only who would use the dump file provided
by customer to debug.

Below is the information that i dump in devcore and i feel that is
good information but new information could be added which could be
picked later.


Page fault information
[gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773)
in page starting at address 0x from client 0x1b
(UTCL2)

This is a perfect example what I mean. You record in the patch is
the
client_id, but this is is basically meaningless unless you have
access
to the AMD internal hw documentation.

What you really need is the client in decoded form, in this case
UTCL2. You can keep the client_id additionally, but the decoded
client
string is mandatory to have I think.

Sure i am capturing that information as i am trying to minimise the
memory interaction to minimum as we are still in interrupt context
here that why i recorded the integer information compared to
decoding

and writing strings there itself but to postpone till we dump.

Like decoding to the gfxhub/mmhub based on vmhub/vmid_src and client
string from client id. So are we good to go with the information with
the above information of sharing details in devcoredump using the
additional information from pagefault cached.

I think amdgpu_vm_fault_info() has everything you need already (vmhub,
status, and addr).  client_id and src_id are just tokens in the
interrupt cookie so we know which IP to route the interrupt to. We
know what they will be because otherwise we'd be in the interrupt
handler for a different IP.  I don't think ring_id has any useful
information in this context and vmid and pasid are probably not too
useful because they are just tokens to associate the fault with a
process.  It would be better to have the process name.

Just to share context here Alex, i am preparing this for devcoredump,
my intention was to replicate the information which in KMD we are
sharing in Dmesg for page faults. If assuming we do not add client id
specially we would not be able to share enough information in
devcoredump.
It would be just address and hub(gfxhub/mmhub) and i think that is
partial information as src id and client id and ip block shares good
information.

For process related information we are capturing that information
part of dump from existing functionality.
 AMDGPU Device Coredump 
version: 1
kernel: 6.7.0-amd-staging-drm-next
module: amdgpu
time: 45.084775181
process_name: soft_recovery_p PID: 1780

Ring timed out details
IP Type: 0 Ring Name: gfx_0.0.0

Page fault information
[gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773)
in page starting at address 0x from client 0x1b (UTCL2)
VRAM is lost due to GPU reset!

Regards
Sunil


The decoded client name would be really useful I think since the
fault handled is a catch all and handles a whole bunch of different
clients.

But that should be ideally passed in as const string instead of the
hw generation specific client_id.

As long as it's only a pointer we also don't run into the trouble
that we need to allocate memory for it.

I agree but i prefer adding the client id and decoding it in
devcorecump using soc15_ih_clientid_name[fault_info->client_id]) is
better else we have to do an sprintf this stri

Re: [PATCH] drm/amdgpu: cache in more vm fault information

2024-03-06 Thread Khatri, Sunil
As discussion we decided that we dont need the client id, srcid, pasid 
etc in page fault information dump. So this patch isnt needed anymore. 
So dropping this patch and will add the new information in the 
devcoredump for pagefault which is all available in existing structures.


As discussed, we just need to provide faulting address, Fault status 
register with gpu family to decode the fault along with process information.


Regards
Sunil Khatri

On 3/6/2024 9:56 PM, Khatri, Sunil wrote:


On 3/6/2024 9:49 PM, Christian König wrote:

Am 06.03.24 um 17:06 schrieb Khatri, Sunil:


On 3/6/2024 9:07 PM, Christian König wrote:

Am 06.03.24 um 16:13 schrieb Khatri, Sunil:


On 3/6/2024 8:34 PM, Christian König wrote:

Am 06.03.24 um 15:29 schrieb Alex Deucher:
On Wed, Mar 6, 2024 at 8:04 AM Khatri, Sunil  
wrote:


On 3/6/2024 6:12 PM, Christian König wrote:

Am 06.03.24 um 11:40 schrieb Khatri, Sunil:

On 3/6/2024 3:37 PM, Christian König wrote:

Am 06.03.24 um 10:04 schrieb Sunil Khatri:

When an  page fault interrupt is raised there
is a lot more information that is useful for
developers to analyse the pagefault.
Well actually those information are not that interesting 
because

they are hw generation specific.

You should probably rather use the decoded strings here, 
e.g. hub,

client, xcc_id, node_id etc...

See gmc_v9_0_process_interrupt() an example.
I saw this v9 does provide more information than what v10 
and v11
provide like node_id and fault from which die but thats 
again very
specific to IP_VERSION(9, 4, 3)) i dont know why thats 
information

is not there in v10 and v11.

I agree to your point but, as of now during a pagefault we are
dumping this information which is useful like which client
has generated an interrupt and for which src and other 
information
like address. So i think to provide the similar information 
in the

devcoredump.

Currently we do not have all this information from either job 
or vm
being derived from the job during a reset. We surely could 
add more
relevant information later on as per request but this 
information is

useful as
eventually its developers only who would use the dump file 
provided

by customer to debug.

Below is the information that i dump in devcore and i feel 
that is
good information but new information could be added which 
could be

picked later.


Page fault information
[gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773)
in page starting at address 0x from client 
0x1b (UTCL2)
This is a perfect example what I mean. You record in the patch 
is the
client_id, but this is is basically meaningless unless you 
have access

to the AMD internal hw documentation.

What you really need is the client in decoded form, in this case
UTCL2. You can keep the client_id additionally, but the 
decoded client

string is mandatory to have I think.

Sure i am capturing that information as i am trying to 
minimise the
memory interaction to minimum as we are still in interrupt 
context
here that why i recorded the integer information compared to 
decoding

and writing strings there itself but to postpone till we dump.

Like decoding to the gfxhub/mmhub based on vmhub/vmid_src and 
client
string from client id. So are we good to go with the 
information with

the above information of sharing details in devcoredump using the
additional information from pagefault cached.
I think amdgpu_vm_fault_info() has everything you need already 
(vmhub,

status, and addr).  client_id and src_id are just tokens in the
interrupt cookie so we know which IP to route the interrupt to. We
know what they will be because otherwise we'd be in the interrupt
handler for a different IP.  I don't think ring_id has any useful
information in this context and vmid and pasid are probably not too
useful because they are just tokens to associate the fault with a
process.  It would be better to have the process name.


Just to share context here Alex, i am preparing this for 
devcoredump, my intention was to replicate the information which 
in KMD we are sharing in Dmesg for page faults. If assuming we do 
not add client id specially we would not be able to share enough 
information in devcoredump.
It would be just address and hub(gfxhub/mmhub) and i think that is 
partial information as src id and client id and ip block shares 
good information.


For process related information we are capturing that information 
part of dump from existing functionality.

 AMDGPU Device Coredump 
version: 1
kernel: 6.7.0-amd-staging-drm-next
module: amdgpu
time: 45.084775181
process_name: soft_recovery_p PID: 1780

Ring timed out details
IP Type: 0 Ring Name: gfx_0.0.0

Page fault information
[gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773)
in page starting at address 0x from client 0x1b 
(UTCL2)

VRAM is lost due to GPU reset!

Regards
Sunil



The decoded client name would be really useful I think since the 
fault handled is a catch all and handl

RE: [PATCH] drm/amdgpu: cache in more vm fault information

2024-03-06 Thread Khatri, Sunil
[AMD Official Use Only - General]

Ignore this. Triggered wrongly.

-Original Message-
From: Sunil Khatri 
Sent: Wednesday, March 6, 2024 11:50 PM
To: Deucher, Alexander ; Koenig, Christian 
; Sharma, Shashank 
Cc: amd-...@lists.freedesktop.org; dri-devel@lists.freedesktop.org; 
linux-ker...@vger.kernel.org; Joshi, Mukul ; Paneer 
Selvam, Arunpravin ; Khatri, Sunil 

Subject: [PATCH] drm/amdgpu: cache in more vm fault information

When an  page fault interrupt is raised there is a lot more information that is 
useful for developers to analyse the pagefault.

Add all such information in the last cached pagefault from an interrupt handler.

Signed-off-by: Sunil Khatri 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 9 +++--  
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 7 ++-  
drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 2 +-  
drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c | 2 +-  
drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c  | 2 +-  
drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c  | 2 +-  
drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c  | 2 +-
 7 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 4299ce386322..b77e8e28769d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -2905,7 +2905,7 @@ void amdgpu_debugfs_vm_bo_info(struct amdgpu_vm *vm, 
struct seq_file *m)
  * Cache the fault info for later use by userspace in debugging.
  */
 void amdgpu_vm_update_fault_cache(struct amdgpu_device *adev,
- unsigned int pasid,
+ struct amdgpu_iv_entry *entry,
  uint64_t addr,
  uint32_t status,
  unsigned int vmhub)
@@ -2915,7 +2915,7 @@ void amdgpu_vm_update_fault_cache(struct amdgpu_device 
*adev,

xa_lock_irqsave(&adev->vm_manager.pasids, flags);

-   vm = xa_load(&adev->vm_manager.pasids, pasid);
+   vm = xa_load(&adev->vm_manager.pasids, entry->pasid);
/* Don't update the fault cache if status is 0.  In the multiple
 * fault case, subsequent faults will return a 0 status which is
 * useless for userspace and replaces the useful fault status, so @@ 
-2924,6 +2924,11 @@ void amdgpu_vm_update_fault_cache(struct amdgpu_device 
*adev,
if (vm && status) {
vm->fault_info.addr = addr;
vm->fault_info.status = status;
+   vm->fault_info.client_id = entry->client_id;
+   vm->fault_info.src_id = entry->src_id;
+   vm->fault_info.vmid = entry->vmid;
+   vm->fault_info.pasid = entry->pasid;
+   vm->fault_info.ring_id = entry->ring_id;
if (AMDGPU_IS_GFXHUB(vmhub)) {
vm->fault_info.vmhub = AMDGPU_VMHUB_TYPE_GFX;
vm->fault_info.vmhub |=
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
index 047ec1930d12..c7782a89bdb5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
@@ -286,6 +286,11 @@ struct amdgpu_vm_fault_info {
uint32_tstatus;
/* which vmhub? gfxhub, mmhub, etc. */
unsigned intvmhub;
+   unsigned intclient_id;
+   unsigned intsrc_id;
+   unsigned intring_id;
+   unsigned intpasid;
+   unsigned intvmid;
 };

 struct amdgpu_vm {
@@ -605,7 +610,7 @@ static inline void amdgpu_vm_eviction_unlock(struct 
amdgpu_vm *vm)  }

 void amdgpu_vm_update_fault_cache(struct amdgpu_device *adev,
- unsigned int pasid,
+ struct amdgpu_iv_entry *entry,
  uint64_t addr,
  uint32_t status,
  unsigned int vmhub);
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
index d933e19e0cf5..6b177ce8db0e 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
@@ -150,7 +150,7 @@ static int gmc_v10_0_process_interrupt(struct amdgpu_device 
*adev,
status = RREG32(hub->vm_l2_pro_fault_status);
WREG32_P(hub->vm_l2_pro_fault_cntl, 1, ~1);

-   amdgpu_vm_update_fault_cache(adev, entry->pasid, addr, status,
+   amdgpu_vm_update_fault_cache(adev, entry, addr, status,
 entry->vmid_src ? AMDGPU_MMHUB0(0) 
: AMDGPU_GFXHUB(0));
}

diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
index 527dc917e049..bcf254856a3e 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
@@ -121,7 +121,7 @@ static int g

Re: [PATCH] drm/amdgpu: add vm fault information to devcoredump

2024-03-06 Thread Khatri, Sunil



On 3/7/2024 12:51 AM, Deucher, Alexander wrote:

[Public]


-Original Message-
From: Sunil Khatri 
Sent: Wednesday, March 6, 2024 1:20 PM
To: Deucher, Alexander ; Koenig, Christian
; Sharma, Shashank

Cc: amd-...@lists.freedesktop.org; dri-devel@lists.freedesktop.org; linux-
ker...@vger.kernel.org; Joshi, Mukul ; Paneer
Selvam, Arunpravin ; Khatri, Sunil

Subject: [PATCH] drm/amdgpu: add vm fault information to devcoredump

Add page fault information to the devcoredump.

Output of devcoredump:
 AMDGPU Device Coredump 
version: 1
kernel: 6.7.0-amd-staging-drm-next
module: amdgpu
time: 29.725011811
process_name: soft_recovery_p PID: 1720

Ring timed out details
IP Type: 0 Ring Name: gfx_0.0.0

[gfxhub] Page fault observed for GPU family:143 Faulty page starting at

I think we should add a separate section for the GPU identification information 
(family, PCI ids, IP versions, etc.).  For this patch, I think fine to just 
print the fault address and status.


Noted

Regards
Sunil


Alex


address 0x Protection fault status register:0x301031

VRAM is lost due to GPU reset!

Signed-off-by: Sunil Khatri 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 15 ++-
drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h |  1 +
  2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
index 147100c27c2d..d7fea6cdf2f9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t
offset, size_t count,
  coredump->ring->name);
   }

+ if (coredump->fault_info.status) {
+ struct amdgpu_vm_fault_info *fault_info = &coredump-

fault_info;

+
+ drm_printf(&p, "\n[%s] Page fault observed for GPU
family:%d\n",
+fault_info->vmhub ? "mmhub" : "gfxhub",
+coredump->adev->family);
+ drm_printf(&p, "Faulty page starting at address 0x%016llx\n",
+fault_info->addr);
+ drm_printf(&p, "Protection fault status register:0x%x\n",
+fault_info->status);
+ }
+
   if (coredump->reset_vram_lost)
- drm_printf(&p, "VRAM is lost due to GPU reset!\n");
+ drm_printf(&p, "\nVRAM is lost due to GPU reset!\n");
   if (coredump->adev->reset_info.num_regs) {
   drm_printf(&p, "AMDGPU register dumps:\nOffset:
Value:\n");

@@ -253,6 +265,7 @@ void amdgpu_coredump(struct amdgpu_device
*adev, bool vram_lost,
   if (job) {
   s_job = &job->base;
   coredump->ring = to_amdgpu_ring(s_job->sched);
+ coredump->fault_info = job->vm->fault_info;
   }

   coredump->adev = adev;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
index 60522963aaca..3197955264f9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
@@ -98,6 +98,7 @@ struct amdgpu_coredump_info {
   struct timespec64   reset_time;
   boolreset_vram_lost;
   struct amdgpu_ring  *ring;
+ struct amdgpu_vm_fault_info fault_info;
  };
  #endif

--
2.34.1


Re: [PATCH] drm/amdgpu: add vm fault information to devcoredump

2024-03-07 Thread Khatri, Sunil



On 3/7/2024 1:47 PM, Christian König wrote:

Am 06.03.24 um 19:19 schrieb Sunil Khatri:

Add page fault information to the devcoredump.

Output of devcoredump:
 AMDGPU Device Coredump 
version: 1
kernel: 6.7.0-amd-staging-drm-next
module: amdgpu
time: 29.725011811
process_name: soft_recovery_p PID: 1720

Ring timed out details
IP Type: 0 Ring Name: gfx_0.0.0

[gfxhub] Page fault observed for GPU family:143
Faulty page starting at address 0x
Protection fault status register:0x301031

VRAM is lost due to GPU reset!

Signed-off-by: Sunil Khatri 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 15 ++-
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h |  1 +
  2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c

index 147100c27c2d..d7fea6cdf2f9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t 
offset, size_t count,

 coredump->ring->name);
  }
  +    if (coredump->fault_info.status) {
+    struct amdgpu_vm_fault_info *fault_info = 
&coredump->fault_info;

+
+    drm_printf(&p, "\n[%s] Page fault observed for GPU 
family:%d\n",

+   fault_info->vmhub ? "mmhub" : "gfxhub",
+   coredump->adev->family);
+    drm_printf(&p, "Faulty page starting at address 0x%016llx\n",
+   fault_info->addr);
+    drm_printf(&p, "Protection fault status register:0x%x\n",
+   fault_info->status);
+    }
+
  if (coredump->reset_vram_lost)
-    drm_printf(&p, "VRAM is lost due to GPU reset!\n");
+    drm_printf(&p, "\nVRAM is lost due to GPU reset!\n");
  if (coredump->adev->reset_info.num_regs) {
  drm_printf(&p, "AMDGPU register dumps:\nOffset: 
Value:\n");
  @@ -253,6 +265,7 @@ void amdgpu_coredump(struct amdgpu_device 
*adev, bool vram_lost,

  if (job) {
  s_job = &job->base;
  coredump->ring = to_amdgpu_ring(s_job->sched);
+    coredump->fault_info = job->vm->fault_info;


That's illegal. The VM pointer might already be stale at this point.

I think you need to add the fault info of the last fault globally in 
the VRAM manager or move this to the process info Shashank is working on.
Are you saying that during the reset or otherwise a vm which is part 
of this job could have been freed  and we might have a NULL 
dereference or invalid reference? Till now based on the resets and 
pagefaults that i have created till now using the same app which we 
are using for IH overflow i am able to get the valid vm only.


Assuming  amdgpu_vm is freed for this job or stale, are you suggesting 
to update this information in adev-> vm_manager along with existing 
per vm fault_info or only in vm_manager ?


Regards,
Christian.


  }
    coredump->adev = adev;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h

index 60522963aaca..3197955264f9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
@@ -98,6 +98,7 @@ struct amdgpu_coredump_info {
  struct timespec64   reset_time;
  bool    reset_vram_lost;
  struct amdgpu_ring    *ring;
+    struct amdgpu_vm_fault_info    fault_info;
  };
  #endif




Re: [PATCH] drm/amdgpu: add vm fault information to devcoredump

2024-03-07 Thread Khatri, Sunil



On 3/7/2024 6:10 PM, Christian König wrote:

Am 07.03.24 um 09:37 schrieb Khatri, Sunil:


On 3/7/2024 1:47 PM, Christian König wrote:

Am 06.03.24 um 19:19 schrieb Sunil Khatri:

Add page fault information to the devcoredump.

Output of devcoredump:
 AMDGPU Device Coredump 
version: 1
kernel: 6.7.0-amd-staging-drm-next
module: amdgpu
time: 29.725011811
process_name: soft_recovery_p PID: 1720

Ring timed out details
IP Type: 0 Ring Name: gfx_0.0.0

[gfxhub] Page fault observed for GPU family:143
Faulty page starting at address 0x
Protection fault status register:0x301031

VRAM is lost due to GPU reset!

Signed-off-by: Sunil Khatri 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 15 ++-
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h |  1 +
  2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c

index 147100c27c2d..d7fea6cdf2f9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t 
offset, size_t count,

 coredump->ring->name);
  }
  +    if (coredump->fault_info.status) {
+    struct amdgpu_vm_fault_info *fault_info = 
&coredump->fault_info;

+
+    drm_printf(&p, "\n[%s] Page fault observed for GPU 
family:%d\n",

+   fault_info->vmhub ? "mmhub" : "gfxhub",
+   coredump->adev->family);
+    drm_printf(&p, "Faulty page starting at address 0x%016llx\n",
+   fault_info->addr);
+    drm_printf(&p, "Protection fault status register:0x%x\n",
+   fault_info->status);
+    }
+
  if (coredump->reset_vram_lost)
-    drm_printf(&p, "VRAM is lost due to GPU reset!\n");
+    drm_printf(&p, "\nVRAM is lost due to GPU reset!\n");
  if (coredump->adev->reset_info.num_regs) {
  drm_printf(&p, "AMDGPU register dumps:\nOffset: 
Value:\n");
  @@ -253,6 +265,7 @@ void amdgpu_coredump(struct amdgpu_device 
*adev, bool vram_lost,

  if (job) {
  s_job = &job->base;
  coredump->ring = to_amdgpu_ring(s_job->sched);
+    coredump->fault_info = job->vm->fault_info;


That's illegal. The VM pointer might already be stale at this point.

I think you need to add the fault info of the last fault globally in 
the VRAM manager or move this to the process info Shashank is 
working on.
Are you saying that during the reset or otherwise a vm which is part 
of this job could have been freed  and we might have a NULL 
dereference or invalid reference? Till now based on the resets and 
pagefaults that i have created till now using the same app which we 
are using for IH overflow i am able to get the valid vm only.


Assuming  amdgpu_vm is freed for this job or stale, are you 
suggesting to update this information in adev-> vm_manager along 
with existing per vm fault_info or only in vm_manager ?


Good question. having it both in the VM as well as the VM manager 
sounds like the simplest option for now.


Let me update the patch then with information in VM manager.

Regards
Sunil



Regards,
Christian.



Regards,
Christian.


  }
    coredump->adev = adev;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h

index 60522963aaca..3197955264f9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
@@ -98,6 +98,7 @@ struct amdgpu_coredump_info {
  struct timespec64   reset_time;
  bool    reset_vram_lost;
  struct amdgpu_ring    *ring;
+    struct amdgpu_vm_fault_info    fault_info;
  };
  #endif






Re: [PATCH 2/2] drm/amdgpu: add vm fault information to devcoredump

2024-03-07 Thread Khatri, Sunil



On 3/8/2024 12:44 AM, Alex Deucher wrote:

On Thu, Mar 7, 2024 at 12:00 PM Sunil Khatri  wrote:

Add page fault information to the devcoredump.

Output of devcoredump:
 AMDGPU Device Coredump 
version: 1
kernel: 6.7.0-amd-staging-drm-next
module: amdgpu
time: 29.725011811
process_name: soft_recovery_p PID: 1720

Ring timed out details
IP Type: 0 Ring Name: gfx_0.0.0

[gfxhub] Page fault observed
Faulty page starting at address 0x

Do you want a : before the address for consistency?

sure.



Protection fault status register:0x301031

How about a space after the : for consistency?

For parsability, it may make more sense to just have a list of key value pairs:
[GPU page fault]
hub:
addr:
status:
[Ring timeout details]
IP:
ring:
name:

etc.


Sure i agree but till now i was capturing information like we shared in 
dmesg which is user readable. But surely one we have enough data i could 
arrange all in key: value pairs like you suggest in a patch later if 
that works ?





VRAM is lost due to GPU reset!

Signed-off-by: Sunil Khatri 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 14 +-
  1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
index 147100c27c2d..dd39e614d907 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, 
size_t count,
coredump->ring->name);
 }

+   if (coredump->adev) {
+   struct amdgpu_vm_fault_info *fault_info =
+   &coredump->adev->vm_manager.fault_info;
+
+   drm_printf(&p, "\n[%s] Page fault observed\n",
+  fault_info->vmhub ? "mmhub" : "gfxhub");
+   drm_printf(&p, "Faulty page starting at address 0x%016llx\n",
+  fault_info->addr);
+   drm_printf(&p, "Protection fault status register:0x%x\n",
+  fault_info->status);
+   }
+
 if (coredump->reset_vram_lost)
-   drm_printf(&p, "VRAM is lost due to GPU reset!\n");
+   drm_printf(&p, "\nVRAM is lost due to GPU reset!\n");
 if (coredump->adev->reset_info.num_regs) {
 drm_printf(&p, "AMDGPU register dumps:\nOffset: Value:\n");

--
2.34.1



Re: [PATCH v2 2/2] drm/amdgpu: add vm fault information to devcoredump

2024-03-08 Thread Khatri, Sunil



On 3/8/2024 2:39 PM, Christian König wrote:

Am 07.03.24 um 21:50 schrieb Sunil Khatri:

Add page fault information to the devcoredump.

Output of devcoredump:
 AMDGPU Device Coredump 
version: 1
kernel: 6.7.0-amd-staging-drm-next
module: amdgpu
time: 29.725011811
process_name: soft_recovery_p PID: 1720

Ring timed out details
IP Type: 0 Ring Name: gfx_0.0.0

[gfxhub] Page fault observed
Faulty page starting at address: 0x
Protection fault status register: 0x301031

VRAM is lost due to GPU reset!

Signed-off-by: Sunil Khatri 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 14 +-
  1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c

index 147100c27c2d..8794a3c21176 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t 
offset, size_t count,

 coredump->ring->name);
  }
  +    if (coredump->adev) {
+    struct amdgpu_vm_fault_info *fault_info =
+    &coredump->adev->vm_manager.fault_info;
+
+    drm_printf(&p, "\n[%s] Page fault observed\n",
+   fault_info->vmhub ? "mmhub" : "gfxhub");
+    drm_printf(&p, "Faulty page starting at address: 0x%016llx\n",
+   fault_info->addr);
+    drm_printf(&p, "Protection fault status register: 0x%x\n",
+   fault_info->status);
+    }
+
  if (coredump->reset_vram_lost)
-    drm_printf(&p, "VRAM is lost due to GPU reset!\n");
+    drm_printf(&p, "\nVRAM is lost due to GPU reset!\n");


Why this additional new line?
The intent is the devcoredump have different sections clearly demarcated 
with an new line else "VRAM is lost due to GPU reset!" seems part of the 
page fault information.

[gfxhub] Page fault observed
Faulty page starting at address: 0x
Protection fault status register: 0x301031

VRAM is lost due to GPU reset!

Regards
Sunil



Apart from that looks really good to me.

Regards,
Christian.


  if (coredump->adev->reset_info.num_regs) {
  drm_printf(&p, "AMDGPU register dumps:\nOffset: 
Value:\n");




RE: [PATCH] drm/amdgpu: add all ringbuffer information in devcoredump

2024-03-11 Thread Khatri, Sunil
Ignore this as I updated commit message and subject so sending new mail.


-Original Message-
From: Sunil Khatri  
Sent: Monday, March 11, 2024 5:04 PM
To: Deucher, Alexander ; Koenig, Christian 
; Sharma, Shashank 
Cc: amd-...@lists.freedesktop.org; dri-devel@lists.freedesktop.org; 
linux-ker...@vger.kernel.org; Khatri, Sunil 
Subject: [PATCH] drm/amdgpu: add all ringbuffer information in devcoredump

Add ringbuffer information such as:
rptr, wptr, ring name, ring size and also the ring contents for each ring on a 
gpu reset.

Signed-off-by: Sunil Khatri 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 21 +
 1 file changed, 21 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
index 6d059f853adc..1992760039da 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -215,6 +215,27 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, 
size_t count,
   fault_info->status);
}
 
+   drm_printf(&p, "Ring buffer information\n");
+   for (int i = 0; i < coredump->adev->num_rings; i++) {
+   int j = 0;
+   struct amdgpu_ring *ring = coredump->adev->rings[i];
+
+   drm_printf(&p, "ring name: %s\n", ring->name);
+   drm_printf(&p, "Rptr: 0x%llx Wptr: 0x%llx\n",
+  amdgpu_ring_get_rptr(ring) & ring->buf_mask,
+  amdgpu_ring_get_wptr(ring) & ring->buf_mask);
+   drm_printf(&p, "Ring size in dwords: %d\n",
+  ring->ring_size / 4);
+   drm_printf(&p, "Ring contents\n");
+   drm_printf(&p, "Offset \t Value\n");
+
+   while (j < ring->ring_size) {
+   drm_printf(&p, "0x%x \t 0x%x\n", j, ring->ring[j/4]);
+   j += 4;
+   }
+   drm_printf(&p, "Ring dumped\n");
+   }
+
if (coredump->reset_vram_lost)
drm_printf(&p, "VRAM is lost due to GPU reset!\n");
if (coredump->adev->reset_info.num_regs) {
--
2.34.1



Re: [PATCH] drm/amdgpu: add ring buffer information in devcoredump

2024-03-11 Thread Khatri, Sunil



On 3/11/2024 7:29 PM, Christian König wrote:



Am 11.03.24 um 13:22 schrieb Sunil Khatri:

Add relevant ringbuffer information such as
rptr, wptr, ring name, ring size and also
the ring contents for each ring on a gpu reset.

Signed-off-by: Sunil Khatri 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 21 +
  1 file changed, 21 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c

index 6d059f853adc..1992760039da 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -215,6 +215,27 @@ amdgpu_devcoredump_read(char *buffer, loff_t 
offset, size_t count,

 fault_info->status);
  }
  +    drm_printf(&p, "Ring buffer information\n");
+    for (int i = 0; i < coredump->adev->num_rings; i++) {
+    int j = 0;
+    struct amdgpu_ring *ring = coredump->adev->rings[i];
+
+    drm_printf(&p, "ring name: %s\n", ring->name);
+    drm_printf(&p, "Rptr: 0x%llx Wptr: 0x%llx\n",
+   amdgpu_ring_get_rptr(ring) & ring->buf_mask,
+   amdgpu_ring_get_wptr(ring) & ring->buf_mask);


Don't apply the mask here. We do have some use cases where the rptr 
and wptr are outside the ring buffer.

Sure i will remove the mask.



+    drm_printf(&p, "Ring size in dwords: %d\n",
+   ring->ring_size / 4);


Rather print the mask as additional value here.

Does that help adding the mask value ?



+    drm_printf(&p, "Ring contents\n");
+    drm_printf(&p, "Offset \t Value\n");
+
+    while (j < ring->ring_size) {
+    drm_printf(&p, "0x%x \t 0x%x\n", j, ring->ring[j/4]);
+    j += 4;
+    }



+    drm_printf(&p, "Ring dumped\n");


That seems superfluous.


Noted


Regards
Sunil



Regards,
Christian.


+    }
+
  if (coredump->reset_vram_lost)
  drm_printf(&p, "VRAM is lost due to GPU reset!\n");
  if (coredump->adev->reset_info.num_regs) {




Re: [PATCH 1/2] drm/amdgpu: add the IP information of the soc

2024-03-13 Thread Khatri, Sunil
[AMD Official Use Only - General]

Gentle Reminder for review.

Regards,
Sunil

Get Outlook for Android<https://aka.ms/AAb9ysg>

From: Sunil Khatri 
Sent: Tuesday, March 12, 2024 6:11:47 PM
To: Deucher, Alexander ; Koenig, Christian 
; Sharma, Shashank 
Cc: amd-...@lists.freedesktop.org ; 
dri-devel@lists.freedesktop.org ; 
linux-ker...@vger.kernel.org ; Khatri, Sunil 

Subject: [PATCH 1/2] drm/amdgpu: add the IP information of the soc

Add all the IP's information on a SOC to the
devcoredump.

Signed-off-by: Sunil Khatri 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
index a0dbccad2f53..611fdb90a1fc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -196,6 +196,25 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, 
size_t count,
coredump->reset_task_info.process_name,
coredump->reset_task_info.pid);

+   /* GPU IP's information of the SOC */
+   if (coredump->adev) {
+   drm_printf(&p, "\nIP Information\n");
+   drm_printf(&p, "SOC Family: %d\n", coredump->adev->family);
+   drm_printf(&p, "SOC Revision id: %d\n", coredump->adev->rev_id);
+
+   for (int i = 0; i < coredump->adev->num_ip_blocks; i++) {
+   struct amdgpu_ip_block *ip =
+   &coredump->adev->ip_blocks[i];
+   drm_printf(&p, "IP type: %d IP name: %s\n",
+  ip->version->type,
+  ip->version->funcs->name);
+   drm_printf(&p, "IP version: (%d,%d,%d)\n\n",
+  ip->version->major,
+  ip->version->minor,
+  ip->version->rev);
+   }
+   }
+
 if (coredump->ring) {
 drm_printf(&p, "\nRing timed out details\n");
 drm_printf(&p, "IP Type: %d Ring Name: %s\n",
--
2.34.1



Re: [PATCH 1/2] drm/amdgpu: add the IP information of the soc

2024-03-13 Thread Khatri, Sunil
[AMD Official Use Only - General]

Gentle reminder for review.

Regards
Sunil

Get Outlook for Android<https://aka.ms/AAb9ysg>

From: Sunil Khatri 
Sent: Tuesday, March 12, 2024 6:11:47 PM
To: Deucher, Alexander ; Koenig, Christian 
; Sharma, Shashank 
Cc: amd-...@lists.freedesktop.org ; 
dri-devel@lists.freedesktop.org ; 
linux-ker...@vger.kernel.org ; Khatri, Sunil 

Subject: [PATCH 1/2] drm/amdgpu: add the IP information of the soc

Add all the IP's information on a SOC to the
devcoredump.

Signed-off-by: Sunil Khatri 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
index a0dbccad2f53..611fdb90a1fc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -196,6 +196,25 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, 
size_t count,
coredump->reset_task_info.process_name,
coredump->reset_task_info.pid);

+   /* GPU IP's information of the SOC */
+   if (coredump->adev) {
+   drm_printf(&p, "\nIP Information\n");
+   drm_printf(&p, "SOC Family: %d\n", coredump->adev->family);
+   drm_printf(&p, "SOC Revision id: %d\n", coredump->adev->rev_id);
+
+   for (int i = 0; i < coredump->adev->num_ip_blocks; i++) {
+   struct amdgpu_ip_block *ip =
+   &coredump->adev->ip_blocks[i];
+   drm_printf(&p, "IP type: %d IP name: %s\n",
+  ip->version->type,
+  ip->version->funcs->name);
+   drm_printf(&p, "IP version: (%d,%d,%d)\n\n",
+  ip->version->major,
+  ip->version->minor,
+  ip->version->rev);
+   }
+   }
+
 if (coredump->ring) {
 drm_printf(&p, "\nRing timed out details\n");
 drm_printf(&p, "IP Type: %d Ring Name: %s\n",
--
2.34.1



Re: [PATCH 2/2] drm:amdgpu: add firmware information of all IP's

2024-03-13 Thread Khatri, Sunil
[AMD Official Use Only - General]

Gentle reminder

Regards
Sunil

Get Outlook for Android<https://aka.ms/AAb9ysg>

From: Sunil Khatri 
Sent: Tuesday, March 12, 2024 6:11:48 PM
To: Deucher, Alexander ; Koenig, Christian 
; Sharma, Shashank 
Cc: amd-...@lists.freedesktop.org ; 
dri-devel@lists.freedesktop.org ; 
linux-ker...@vger.kernel.org ; Khatri, Sunil 

Subject: [PATCH 2/2] drm:amdgpu: add firmware information of all IP's

Add firmware version information of each
IP and each instance where applicable.

Signed-off-by: Sunil Khatri 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 122 ++
 1 file changed, 122 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
index 611fdb90a1fc..78ddc58aef67 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -168,6 +168,123 @@ void amdgpu_coredump(struct amdgpu_device *adev, bool 
vram_lost,
 {
 }
 #else
+static void amdgpu_devcoredump_fw_info(struct amdgpu_device *adev, struct 
drm_printer *p)
+{
+   uint32_t version;
+   uint32_t feature;
+   uint8_t smu_program, smu_major, smu_minor, smu_debug;
+
+   drm_printf(p, "VCE feature version: %u, fw version: 0x%08x\n",
+  adev->vce.fb_version, adev->vce.fw_version);
+   drm_printf(p, "UVD feature version: %u, fw version: 0x%08x\n",
+  0, adev->uvd.fw_version);
+   drm_printf(p, "GMC feature version: %u, fw version: 0x%08x\n",
+  0, adev->gmc.fw_version);
+   drm_printf(p, "ME feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.me_feature_version, adev->gfx.me_fw_version);
+   drm_printf(p, "PFP feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.pfp_feature_version, adev->gfx.pfp_fw_version);
+   drm_printf(p, "CE feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.ce_feature_version, adev->gfx.ce_fw_version);
+   drm_printf(p, "RLC feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.rlc_feature_version, adev->gfx.rlc_fw_version);
+
+   drm_printf(p, "RLC SRLC feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.rlc_srlc_feature_version,
+  adev->gfx.rlc_srlc_fw_version);
+   drm_printf(p, "RLC SRLG feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.rlc_srlg_feature_version,
+  adev->gfx.rlc_srlg_fw_version);
+   drm_printf(p, "RLC SRLS feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.rlc_srls_feature_version,
+  adev->gfx.rlc_srls_fw_version);
+   drm_printf(p, "RLCP feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.rlcp_ucode_feature_version,
+  adev->gfx.rlcp_ucode_version);
+   drm_printf(p, "RLCV feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.rlcv_ucode_feature_version,
+  adev->gfx.rlcv_ucode_version);
+   drm_printf(p, "MEC feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.mec_feature_version,
+  adev->gfx.mec_fw_version);
+
+   if (adev->gfx.mec2_fw)
+   drm_printf(p,
+  "MEC2 feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.mec2_feature_version,
+  adev->gfx.mec2_fw_version);
+
+   drm_printf(p, "IMU feature version: %u, fw version: 0x%08x\n",
+  0, adev->gfx.imu_fw_version);
+   drm_printf(p, "PSP SOS feature version: %u, fw version: 0x%08x\n",
+  adev->psp.sos.feature_version,
+  adev->psp.sos.fw_version);
+   drm_printf(p, "PSP ASD feature version: %u, fw version: 0x%08x\n",
+  adev->psp.asd_context.bin_desc.feature_version,
+  adev->psp.asd_context.bin_desc.fw_version);
+
+   drm_printf(p, "TA XGMI feature version: 0x%08x, fw version: 0x%08x\n",
+  adev->psp.xgmi_context.context.bin_desc.feature_version,
+  adev->psp.xgmi_context.context.bin_desc.fw_version);
+   drm_printf(p, "TA RAS feature version: 0x%08x, fw version: 0x%08x\n",
+  adev->psp.ras_context.context.bin_desc.feature_version,
+  adev->psp.ras_context.context.bin_desc.fw_version);
+   drm_printf(p, "TA HDCP feature version: 0x%08x, fw version: 0x%08x\n",
+  adev->psp.hdcp_context.context.bin_desc.feature_version,
+  adev->psp.hdcp_context.cont

Re: [PATCH 1/2] drm/amdgpu: add the IP information of the soc

2024-03-13 Thread Khatri, Sunil



On 3/14/2024 1:58 AM, Alex Deucher wrote:

On Tue, Mar 12, 2024 at 8:41 AM Sunil Khatri  wrote:

Add all the IP's information on a SOC to the
devcoredump.

Signed-off-by: Sunil Khatri 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 19 +++
  1 file changed, 19 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
index a0dbccad2f53..611fdb90a1fc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -196,6 +196,25 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, 
size_t count,
coredump->reset_task_info.process_name,
coredump->reset_task_info.pid);

+   /* GPU IP's information of the SOC */
+   if (coredump->adev) {
+   drm_printf(&p, "\nIP Information\n");
+   drm_printf(&p, "SOC Family: %d\n", coredump->adev->family);
+   drm_printf(&p, "SOC Revision id: %d\n", coredump->adev->rev_id);
+
+   for (int i = 0; i < coredump->adev->num_ip_blocks; i++) {
+   struct amdgpu_ip_block *ip =
+   &coredump->adev->ip_blocks[i];
+   drm_printf(&p, "IP type: %d IP name: %s\n",
+  ip->version->type,
+  ip->version->funcs->name);
+   drm_printf(&p, "IP version: (%d,%d,%d)\n\n",
+  ip->version->major,
+  ip->version->minor,
+  ip->version->rev);
+   }
+   }

I think the IP discovery table would be more useful.  Either walk the
adev->ip_versions structure, or just include the IP discovery binary.


I did explore the adev->ip_versions and if i just go through the array 
it doesn't give any useful information directly.
There are no ways to find directly from adev->ip_versions below things 
until i also reparse the discovery binary again like done the discovery 
amdgpu_discovery_reg_base_init and walk through the headers of various 
ips using discovery binary.

a. Which IP is available on soc or not.
b. How many instances are there
Also i again have to change back to major, minor and rev convention for 
this information to be useful. I am exploring it more if i find some 
other information i will update.


adev->ip_block[] is derived from ip discovery only for each block which 
is there on the SOC, so we are not reading information which isnt 
applicable for the soc. We have name , type and version no of the IPs 
available on the soc. If you want i could add no of instances of each IP 
too if you think that's useful information here. Could you share what 
information is missing in this approach so i can include that.


For dumping the IP discovery binary, i dont understand how that 
information would be useful directly and needs to be decoded like we are 
doing in discovery init. Please correct me if my understanding is wrong 
here.

Alex


+
 if (coredump->ring) {
 drm_printf(&p, "\nRing timed out details\n");
 drm_printf(&p, "IP Type: %d Ring Name: %s\n",
--
2.34.1



Re: [PATCH 2/2] drm:amdgpu: add firmware information of all IP's

2024-03-13 Thread Khatri, Sunil



On 3/14/2024 2:06 AM, Alex Deucher wrote:

On Tue, Mar 12, 2024 at 8:42 AM Sunil Khatri  wrote:

Add firmware version information of each
IP and each instance where applicable.


Is there a way we can share some common code with devcoredump,
debugfs, and the info IOCTL?  All three places need to query this
information and the same logic is repeated in each case.


Hello Alex,

Yes you re absolutely right the same information is being retrieved 
again as done in debugfs. I can reorganize the code so same function 
could be used by debugfs and devcoredump but this is exactly what i 
tried to avoid here. I did try to use minimum functionality in 
devcoredump without shuffling a lot of code here and there.


Also our devcoredump is implemented in amdgpu_reset.c and not all the 
information is available here and there we might have to include lot of 
header and cross functions in amdgpu_reset until we want a dedicated 
file for devcoredump.


Info IOCTL does have a lot of information which also is in pipeline to 
be dumped but this if we want to reuse the functionality of IOCTL we 
need to reorganize a lot of code.


If that is the need of the hour i could work on that. Please let me know.

This is my suggestion if it makes sense:

1. If we want to reuse a lot of functionality then we need to modularize 
some of the functions further so they could be consumed directly by 
devcoredump.
2. We should also have a dedicated file for devcoredump.c/.h so its easy 
to include headers of needed functionality cleanly and easy to expand 
devcoredump.
3. based on the priority and importance of this task we can add 
information else some repetition is a real possibility.




Alex



Signed-off-by: Sunil Khatri 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 122 ++
  1 file changed, 122 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
index 611fdb90a1fc..78ddc58aef67 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -168,6 +168,123 @@ void amdgpu_coredump(struct amdgpu_device *adev, bool 
vram_lost,
  {
  }
  #else
+static void amdgpu_devcoredump_fw_info(struct amdgpu_device *adev, struct 
drm_printer *p)
+{
+   uint32_t version;
+   uint32_t feature;
+   uint8_t smu_program, smu_major, smu_minor, smu_debug;
+
+   drm_printf(p, "VCE feature version: %u, fw version: 0x%08x\n",
+  adev->vce.fb_version, adev->vce.fw_version);
+   drm_printf(p, "UVD feature version: %u, fw version: 0x%08x\n",
+  0, adev->uvd.fw_version);
+   drm_printf(p, "GMC feature version: %u, fw version: 0x%08x\n",
+  0, adev->gmc.fw_version);
+   drm_printf(p, "ME feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.me_feature_version, adev->gfx.me_fw_version);
+   drm_printf(p, "PFP feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.pfp_feature_version, adev->gfx.pfp_fw_version);
+   drm_printf(p, "CE feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.ce_feature_version, adev->gfx.ce_fw_version);
+   drm_printf(p, "RLC feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.rlc_feature_version, adev->gfx.rlc_fw_version);
+
+   drm_printf(p, "RLC SRLC feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.rlc_srlc_feature_version,
+  adev->gfx.rlc_srlc_fw_version);
+   drm_printf(p, "RLC SRLG feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.rlc_srlg_feature_version,
+  adev->gfx.rlc_srlg_fw_version);
+   drm_printf(p, "RLC SRLS feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.rlc_srls_feature_version,
+  adev->gfx.rlc_srls_fw_version);
+   drm_printf(p, "RLCP feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.rlcp_ucode_feature_version,
+  adev->gfx.rlcp_ucode_version);
+   drm_printf(p, "RLCV feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.rlcv_ucode_feature_version,
+  adev->gfx.rlcv_ucode_version);
+   drm_printf(p, "MEC feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.mec_feature_version,
+  adev->gfx.mec_fw_version);
+
+   if (adev->gfx.mec2_fw)
+   drm_printf(p,
+  "MEC2 feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.mec2_feature_version,
+  adev->gfx.mec2_fw_version);
+
+   drm_printf(p, "IMU feature version: %u, fw version: 0x%08x\n",
+  0, adev->gfx.imu_fw_version);
+   drm_printf(p, "PSP SOS feature version: %u, fw version: 0x%08x\n",
+  adev->psp.sos.feature_version,
+  adev->psp.sos.fw_version);
+   drm_printf(p, "PSP ASD feat

Re: [PATCH 2/2] drm:amdgpu: add firmware information of all IP's

2024-03-14 Thread Khatri, Sunil



On 3/14/2024 11:40 AM, Sharma, Shashank wrote:


On 14/03/2024 06:58, Khatri, Sunil wrote:


On 3/14/2024 2:06 AM, Alex Deucher wrote:
On Tue, Mar 12, 2024 at 8:42 AM Sunil Khatri  
wrote:

Add firmware version information of each
IP and each instance where applicable.


Is there a way we can share some common code with devcoredump,
debugfs, and the info IOCTL?  All three places need to query this
information and the same logic is repeated in each case.


Hello Alex,

Yes you re absolutely right the same information is being retrieved 
again as done in debugfs. I can reorganize the code so same function 
could be used by debugfs and devcoredump but this is exactly what i 
tried to avoid here. I did try to use minimum functionality in 
devcoredump without shuffling a lot of code here and there.


Also our devcoredump is implemented in amdgpu_reset.c and not all the 
information is available here and there we might have to include lot 
of header and cross functions in amdgpu_reset until we want a 
dedicated file for devcoredump.



I think Alex is suggesting to have one common backend to generate all 
the core debug info, and then different wrapper functions which can 
pack this raw info into the packets aligned with respective infra like 
devcore/debugfs/info IOCTL, which seems like a good idea to me.


If you think you need a new file for this backend it should be fine.
My suggestion was on same lines that if we want to use the same infra to 
access information from different parts of the code then we need to 
reorganize. And at same time since there is quite some data we are 
planning to add in devcoredump so i recommend to have a dedicated .c/.h 
instead of using amdgpu_reset.c so a clean include is easy to maintain.


Once Alex confirms it i can start working on design and what all 
information we need on this.


Regards
Sunil



something like:

amdgpu_debug_core.c::

struct amdgpu_core_debug_info {

/* Superset of all the info you are collecting from HW */

};

- amdgpu_debug_generate_core_info

{

/* This function collects the core debug info from HW and saves in 
amdgpu_core_debug_info,


  we can update this periodically regardless of a request */

}

and then:

devcore_info *amdgpu_debug_pack_for_devcore(core_debug_info)

{

/* convert core debug info into devcore aligned format/data */

}

ioctl_info *amdgpu_debug_pack_for_info_ioctl(core_debug_info)

{

/* convert core debug info into info IOCTL aligned format/data */

}

debugfs_info *amdgpu_debug_pack_for_debugfs(core_debug_info)

{

/* convert core debug info into debugfs aligned format/data */

}

- Shashank





Info IOCTL does have a lot of information which also is in pipeline 
to be dumped but this if we want to reuse the functionality of IOCTL 
we need to reorganize a lot of code.


If that is the need of the hour i could work on that. Please let me 
know.


This is my suggestion if it makes sense:

1. If we want to reuse a lot of functionality then we need to 
modularize some of the functions further so they could be consumed 
directly by devcoredump.
2. We should also have a dedicated file for devcoredump.c/.h so its 
easy to include headers of needed functionality cleanly and easy to 
expand devcoredump.
3. based on the priority and importance of this task we can add 
information else some repetition is a real possibility.




Alex



Signed-off-by: Sunil Khatri 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 122 
++

  1 file changed, 122 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c

index 611fdb90a1fc..78ddc58aef67 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -168,6 +168,123 @@ void amdgpu_coredump(struct amdgpu_device 
*adev, bool vram_lost,

  {
  }
  #else
+static void amdgpu_devcoredump_fw_info(struct amdgpu_device *adev, 
struct drm_printer *p)

+{
+   uint32_t version;
+   uint32_t feature;
+   uint8_t smu_program, smu_major, smu_minor, smu_debug;
+
+   drm_printf(p, "VCE feature version: %u, fw version: 0x%08x\n",
+  adev->vce.fb_version, adev->vce.fw_version);
+   drm_printf(p, "UVD feature version: %u, fw version: 0x%08x\n",
+  0, adev->uvd.fw_version);
+   drm_printf(p, "GMC feature version: %u, fw version: 0x%08x\n",
+  0, adev->gmc.fw_version);
+   drm_printf(p, "ME feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.me_feature_version, 
adev->gfx.me_fw_version);

+   drm_printf(p, "PFP feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.pfp_feature_version, 
adev->gfx.pfp_fw_version);

+   drm_printf(p, "CE feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.ce_feature_version, 
adev->gfx.ce_fw_version);

+   

Re: [PATCH 1/2] drm/amdgpu: add the IP information of the soc

2024-03-14 Thread Khatri, Sunil



On 3/14/2024 8:12 PM, Alex Deucher wrote:

On Thu, Mar 14, 2024 at 1:44 AM Khatri, Sunil  wrote:


On 3/14/2024 1:58 AM, Alex Deucher wrote:

On Tue, Mar 12, 2024 at 8:41 AM Sunil Khatri  wrote:

Add all the IP's information on a SOC to the
devcoredump.

Signed-off-by: Sunil Khatri 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 19 +++
   1 file changed, 19 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
index a0dbccad2f53..611fdb90a1fc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -196,6 +196,25 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, 
size_t count,
 coredump->reset_task_info.process_name,
 coredump->reset_task_info.pid);

+   /* GPU IP's information of the SOC */
+   if (coredump->adev) {
+   drm_printf(&p, "\nIP Information\n");
+   drm_printf(&p, "SOC Family: %d\n", coredump->adev->family);
+   drm_printf(&p, "SOC Revision id: %d\n", coredump->adev->rev_id);
+
+   for (int i = 0; i < coredump->adev->num_ip_blocks; i++) {
+   struct amdgpu_ip_block *ip =
+   &coredump->adev->ip_blocks[i];
+   drm_printf(&p, "IP type: %d IP name: %s\n",
+  ip->version->type,
+  ip->version->funcs->name);
+   drm_printf(&p, "IP version: (%d,%d,%d)\n\n",
+  ip->version->major,
+  ip->version->minor,
+  ip->version->rev);
+   }
+   }

I think the IP discovery table would be more useful.  Either walk the
adev->ip_versions structure, or just include the IP discovery binary.

I did explore the adev->ip_versions and if i just go through the array
it doesn't give any useful information directly.
There are no ways to find directly from adev->ip_versions below things
until i also reparse the discovery binary again like done the discovery
amdgpu_discovery_reg_base_init and walk through the headers of various
ips using discovery binary.
a. Which IP is available on soc or not.
b. How many instances are there
Also i again have to change back to major, minor and rev convention for
this information to be useful. I am exploring it more if i find some
other information i will update.

adev->ip_block[] is derived from ip discovery only for each block which
is there on the SOC, so we are not reading information which isnt
applicable for the soc. We have name , type and version no of the IPs
available on the soc. If you want i could add no of instances of each IP
too if you think that's useful information here. Could you share what
information is missing in this approach so i can include that.

I was hoping to get the actual IP versions for the IPs from IP
discovery rather than the versions from the ip_block array.  The
latter are common so you can end up with the same version used across
a wide variety of chips (e.g., all gfx10.x based chips use the same
gfx 10 IP code even if the actual IP version is different for most of
the chips).

Got it. let me check how to get it could be done rightly.



For dumping the IP discovery binary, i dont understand how that
information would be useful directly and needs to be decoded like we are
doing in discovery init. Please correct me if my understanding is wrong
here.

It's probably not a high priority, I was just thinking it might be
useful to have in case there ended up being some problem related to
the IP discovery table on some boards.  E.g., we'd know that all
boards with a certain harvest config seem to align with a reported
problem.  Similar for vbios.  It's more for telemetry.  E.g., all the
boards reporting some problem have a particular powerplay config or
whatever.

I got it.
But two points of contention here in my understanding. The dump works 
only where there is reset and not sure if it could be used very early in 
development of not. Second point is that devcoredump is 4096 
bytes/4Kbyte of memory where we are dumping all the information. Not 
sure if that could be increased but it might not be enough if we are 
planning to dump all to it.


Another point is since we have sysfs/debugfs/info ioctl etc information 
available. We should sort out what really is helpful in debugging GPU 
hang and that's added in devcore.


Regards
Sunil



Alex



Alex


+
  if (coredump->ring) {
  drm_printf(&p, "\nRing timed out details\n");
  drm_printf(&p, "IP Type: %d Ring Name: %s\n",
--
2.34.1



RE: [PATCH] drm/amdgpu: add the hw_ip version of all IP's

2024-03-15 Thread Khatri, Sunil
[AMD Official Use Only - General]

Hello Alex

Added the information directly from the ip_version and also added names for 
each ip so the version information makes more sense to the user.

Below is the output in devcoredump now:
IP Information
SOC Family: 143
SOC Revision id: 0
SOC External Revision id: 50
HWIP: GC[1][0]: v10.3.2.0.0
HWIP: HDP[2][0]: v5.0.3.0.0
HWIP: SDMA0[3][0]: v5.2.2.0.0
HWIP: SDMA1[4][0]: v5.2.2.0.0
HWIP: MMHUB[12][0]: v2.1.0.0.0
HWIP: ATHUB[13][0]: v2.1.0.0.0
HWIP: NBIO[14][0]: v3.3.1.0.0
HWIP: MP0[15][0]: v11.0.11.0.0
HWIP: MP1[16][0]: v11.0.11.0.0
HWIP: UVD/JPEG/VCN[17][0]: v3.0.0.0.0
HWIP: UVD/JPEG/VCN[17][1]: v3.0.1.0.0
HWIP: DF[21][0]: v3.7.3.0.0
HWIP: DCE[22][0]: v3.0.0.0.0
HWIP: OSSSYS[23][0]: v5.0.3.0.0
HWIP: SMUIO[24][0]: v11.0.6.0.0
HWIP: NBIF[26][0]: v3.3.1.0.0
HWIP: THM[27][0]: v11.0.5.0.0
HWIP: CLK[28][0]: v11.0.3.0.0
HWIP: CLK[28][1]: v11.0.3.0.0
HWIP: CLK[28][2]: v11.0.3.0.0
HWIP: CLK[28][3]: v11.0.3.0.0
HWIP: CLK[28][4]: v11.0.3.0.0
HWIP: CLK[28][5]: v11.0.3.0.0
HWIP: CLK[28][6]: v11.0.3.0.0
HWIP: CLK[28][7]: v11.0.3.0.0
HWIP: UMC[29][0]: v8.7.1.0.0
HWIP: UMC[29][1]: v8.7.1.0.0
HWIP: UMC[29][2]: v8.7.1.0.0
HWIP: UMC[29][3]: v8.7.1.0.0
HWIP: UMC[29][4]: v8.7.1.0.0
HWIP: UMC[29][5]: v8.7.1.0.0
HWIP: PCIE[33][0]: v6.5.0.0.0


-Original Message-
From: Sunil Khatri 
Sent: Friday, March 15, 2024 5:43 PM
To: Deucher, Alexander ; Koenig, Christian 
; Sharma, Shashank 
Cc: amd-...@lists.freedesktop.org; dri-devel@lists.freedesktop.org; 
linux-ker...@vger.kernel.org; Khatri, Sunil 
Subject: [PATCH] drm/amdgpu: add the hw_ip version of all IP's

Add all the IP's version information on a SOC to the devcoredump.

Signed-off-by: Sunil Khatri 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 62 +++
 1 file changed, 62 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
index a0dbccad2f53..3d4bfe0a5a7c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -29,6 +29,43 @@
 #include "sienna_cichlid.h"
 #include "smu_v13_0_10.h"

+const char *hw_ip_names[MAX_HWIP] = {
+   [GC_HWIP]   = "GC",
+   [HDP_HWIP]  = "HDP",
+   [SDMA0_HWIP]= "SDMA0",
+   [SDMA1_HWIP]= "SDMA1",
+   [SDMA2_HWIP]= "SDMA2",
+   [SDMA3_HWIP]= "SDMA3",
+   [SDMA4_HWIP]= "SDMA4",
+   [SDMA5_HWIP]= "SDMA5",
+   [SDMA6_HWIP]= "SDMA6",
+   [SDMA7_HWIP]= "SDMA7",
+   [LSDMA_HWIP]= "LSDMA",
+   [MMHUB_HWIP]= "MMHUB",
+   [ATHUB_HWIP]= "ATHUB",
+   [NBIO_HWIP] = "NBIO",
+   [MP0_HWIP]  = "MP0",
+   [MP1_HWIP]  = "MP1",
+   [UVD_HWIP]  = "UVD/JPEG/VCN",
+   [VCN1_HWIP] = "VCN1",
+   [VCE_HWIP]  = "VCE",
+   [VPE_HWIP]  = "VPE",
+   [DF_HWIP]   = "DF",
+   [DCE_HWIP]  = "DCE",
+   [OSSSYS_HWIP]   = "OSSSYS",
+   [SMUIO_HWIP]= "SMUIO",
+   [PWR_HWIP]  = "PWR",
+   [NBIF_HWIP] = "NBIF",
+   [THM_HWIP]  = "THM",
+   [CLK_HWIP]  = "CLK",
+   [UMC_HWIP]  = "UMC",
+   [RSMU_HWIP] = "RSMU",
+   [XGMI_HWIP] = "XGMI",
+   [DCI_HWIP]  = "DCI",
+   [PCIE_HWIP] = "PCIE",
+};
+
+
 int amdgpu_reset_init(struct amdgpu_device *adev)  {
int ret = 0;
@@ -196,6 +233,31 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, 
size_t count,
   coredump->reset_task_info.process_name,
   coredump->reset_task_info.pid);

+   /* GPU IP's information of the SOC */
+   if (coredump->adev) {
+
+   drm_printf(&p, "\nIP Information\n");
+   drm_printf(&p, "SOC Family: %d\n", coredump->adev->family);
+   drm_printf(&p, "SOC Revision id: %d\n", coredump->adev->rev_id);
+   drm_printf(&p, "SOC External Revision id: %d\n",
+  coredump->adev->external_rev_id);
+
+   for (int i = 1; i < MAX_HWIP; i++) {
+   for (int j = 0; j < HWIP_MAX_INSTANCE; j++) {
+   int ver = coredump->adev->ip_versions[i][j];
+
+   if (ve

Re: [PATCH] drm/amdgpu: add the hw_ip version of all IP's

2024-03-15 Thread Khatri, Sunil



On 3/15/2024 6:45 PM, Alex Deucher wrote:

On Fri, Mar 15, 2024 at 8:13 AM Sunil Khatri  wrote:

Add all the IP's version information on a SOC to the
devcoredump.

Signed-off-by: Sunil Khatri 

This looks great.

Reviewed-by: Alex Deucher 


Thanks Alex




---
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 62 +++
  1 file changed, 62 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
index a0dbccad2f53..3d4bfe0a5a7c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -29,6 +29,43 @@
  #include "sienna_cichlid.h"
  #include "smu_v13_0_10.h"

+const char *hw_ip_names[MAX_HWIP] = {
+   [GC_HWIP]   = "GC",
+   [HDP_HWIP]  = "HDP",
+   [SDMA0_HWIP]= "SDMA0",
+   [SDMA1_HWIP]= "SDMA1",
+   [SDMA2_HWIP]= "SDMA2",
+   [SDMA3_HWIP]= "SDMA3",
+   [SDMA4_HWIP]= "SDMA4",
+   [SDMA5_HWIP]= "SDMA5",
+   [SDMA6_HWIP]= "SDMA6",
+   [SDMA7_HWIP]= "SDMA7",
+   [LSDMA_HWIP]= "LSDMA",
+   [MMHUB_HWIP]= "MMHUB",
+   [ATHUB_HWIP]= "ATHUB",
+   [NBIO_HWIP] = "NBIO",
+   [MP0_HWIP]  = "MP0",
+   [MP1_HWIP]  = "MP1",
+   [UVD_HWIP]  = "UVD/JPEG/VCN",
+   [VCN1_HWIP] = "VCN1",
+   [VCE_HWIP]  = "VCE",
+   [VPE_HWIP]  = "VPE",
+   [DF_HWIP]   = "DF",
+   [DCE_HWIP]  = "DCE",
+   [OSSSYS_HWIP]   = "OSSSYS",
+   [SMUIO_HWIP]= "SMUIO",
+   [PWR_HWIP]  = "PWR",
+   [NBIF_HWIP] = "NBIF",
+   [THM_HWIP]  = "THM",
+   [CLK_HWIP]  = "CLK",
+   [UMC_HWIP]  = "UMC",
+   [RSMU_HWIP] = "RSMU",
+   [XGMI_HWIP] = "XGMI",
+   [DCI_HWIP]  = "DCI",
+   [PCIE_HWIP] = "PCIE",
+};
+
+
  int amdgpu_reset_init(struct amdgpu_device *adev)
  {
 int ret = 0;
@@ -196,6 +233,31 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, 
size_t count,
coredump->reset_task_info.process_name,
coredump->reset_task_info.pid);

+   /* GPU IP's information of the SOC */
+   if (coredump->adev) {
+
+   drm_printf(&p, "\nIP Information\n");
+   drm_printf(&p, "SOC Family: %d\n", coredump->adev->family);
+   drm_printf(&p, "SOC Revision id: %d\n", coredump->adev->rev_id);
+   drm_printf(&p, "SOC External Revision id: %d\n",
+  coredump->adev->external_rev_id);
+
+   for (int i = 1; i < MAX_HWIP; i++) {
+   for (int j = 0; j < HWIP_MAX_INSTANCE; j++) {
+   int ver = coredump->adev->ip_versions[i][j];
+
+   if (ver)
+   drm_printf(&p, "HWIP: %s[%d][%d]: 
v%d.%d.%d.%d.%d\n",
+  hw_ip_names[i], i, j,
+  IP_VERSION_MAJ(ver),
+  IP_VERSION_MIN(ver),
+  IP_VERSION_REV(ver),
+  IP_VERSION_VARIANT(ver),
+  IP_VERSION_SUBREV(ver));
+   }
+   }
+   }
+
 if (coredump->ring) {
 drm_printf(&p, "\nRing timed out details\n");
 drm_printf(&p, "IP Type: %d Ring Name: %s\n",
--
2.34.1



Re: [PATCH] drm/amdgpu: refactor code to reuse system information

2024-03-19 Thread Khatri, Sunil
Validated the code by using the function in same way as ioctl would use 
in devcoredump and getting the valid values.


Also this would be the container of the information that we need to 
share between ioctl, debugfs and devcoredump and keep updating this 
based on information needed.



On 3/19/2024 6:02 PM, Sunil Khatri wrote:

Refactor the code so debugfs and devcoredump can reuse
the common information and avoid unnecessary copy of it.

created a new file which would be the right place to
hold functions which will be used between sysfs, debugfs
and devcoredump.

Cc: Christian König 
Cc: Alex Deucher 
Signed-off-by: Sunil Khatri 
---
  drivers/gpu/drm/amd/amdgpu/Makefile |   2 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu.h |   1 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c | 151 
  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 118 +--
  4 files changed, 157 insertions(+), 115 deletions(-)
  create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c

diff --git a/drivers/gpu/drm/amd/amdgpu/Makefile 
b/drivers/gpu/drm/amd/amdgpu/Makefile
index 4536c8ad0e11..05d34f4b18f5 100644
--- a/drivers/gpu/drm/amd/amdgpu/Makefile
+++ b/drivers/gpu/drm/amd/amdgpu/Makefile
@@ -80,7 +80,7 @@ amdgpu-y += amdgpu_device.o amdgpu_doorbell_mgr.o 
amdgpu_kms.o \
amdgpu_umc.o smu_v11_0_i2c.o amdgpu_fru_eeprom.o amdgpu_rap.o \
amdgpu_fw_attestation.o amdgpu_securedisplay.o \
amdgpu_eeprom.o amdgpu_mca.o amdgpu_psp_ta.o amdgpu_lsdma.o \
-   amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o
+   amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o 
amdgpu_devinfo.o
  
  amdgpu-$(CONFIG_PROC_FS) += amdgpu_fdinfo.o
  
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h

index 9c62552bec34..0267870aa9b1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1609,4 +1609,5 @@ extern const struct attribute_group 
amdgpu_vram_mgr_attr_group;
  extern const struct attribute_group amdgpu_gtt_mgr_attr_group;
  extern const struct attribute_group amdgpu_flash_attr_group;
  
+int amdgpu_device_info(struct amdgpu_device *adev, struct drm_amdgpu_info_device *dev_info);

  #endif
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c
new file mode 100644
index ..d2c15a1dcb0d
--- /dev/null
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c
@@ -0,0 +1,151 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright 2024 Advanced Micro Devices, Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
+ * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
+ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+ * OTHER DEALINGS IN THE SOFTWARE.
+ *
+ */
+
+#include "amdgpu.h"
+#include "amd_pcie.h"
+
+#include 
+
+int amdgpu_device_info(struct amdgpu_device *adev, struct 
drm_amdgpu_info_device *dev_info)
+{
+   int ret;
+   uint64_t vm_size;
+   uint32_t pcie_gen_mask;
+
+   if (dev_info == NULL)
+   return -EINVAL;
+
+   dev_info->device_id = adev->pdev->device;
+   dev_info->chip_rev = adev->rev_id;
+   dev_info->external_rev = adev->external_rev_id;
+   dev_info->pci_rev = adev->pdev->revision;
+   dev_info->family = adev->family;
+   dev_info->num_shader_engines = adev->gfx.config.max_shader_engines;
+   dev_info->num_shader_arrays_per_engine = adev->gfx.config.max_sh_per_se;
+   /* return all clocks in KHz */
+   dev_info->gpu_counter_freq = amdgpu_asic_get_xclk(adev) * 10;
+   if (adev->pm.dpm_enabled) {
+   dev_info->max_engine_clock = amdgpu_dpm_get_sclk(adev, false) * 
10;
+   dev_info->max_memory_clock = amdgpu_dpm_get_mclk(adev, false) * 
10;
+   dev_info->min_engine_clock = amdgpu_dpm_get_sclk(adev, true) * 
10;
+   dev_info->min_memory_clock = amdgpu_dpm_get_mclk(adev, true) * 
10;
+   } else {
+   dev_info->max_engine_clock =
+   dev_info->min_engine_clock =
+  

Re: [PATCH] drm/amdgpu: refactor code to reuse system information

2024-03-19 Thread Khatri, Sunil



On 3/19/2024 7:19 PM, Lazar, Lijo wrote:


On 3/19/2024 6:02 PM, Sunil Khatri wrote:

Refactor the code so debugfs and devcoredump can reuse
the common information and avoid unnecessary copy of it.

created a new file which would be the right place to
hold functions which will be used between sysfs, debugfs
and devcoredump.

Cc: Christian König 
Cc: Alex Deucher 
Signed-off-by: Sunil Khatri 
---
  drivers/gpu/drm/amd/amdgpu/Makefile |   2 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu.h |   1 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c | 151 
  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 118 +--
  4 files changed, 157 insertions(+), 115 deletions(-)
  create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c

diff --git a/drivers/gpu/drm/amd/amdgpu/Makefile 
b/drivers/gpu/drm/amd/amdgpu/Makefile
index 4536c8ad0e11..05d34f4b18f5 100644
--- a/drivers/gpu/drm/amd/amdgpu/Makefile
+++ b/drivers/gpu/drm/amd/amdgpu/Makefile
@@ -80,7 +80,7 @@ amdgpu-y += amdgpu_device.o amdgpu_doorbell_mgr.o 
amdgpu_kms.o \
amdgpu_umc.o smu_v11_0_i2c.o amdgpu_fru_eeprom.o amdgpu_rap.o \
amdgpu_fw_attestation.o amdgpu_securedisplay.o \
amdgpu_eeprom.o amdgpu_mca.o amdgpu_psp_ta.o amdgpu_lsdma.o \
-   amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o
+   amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o 
amdgpu_devinfo.o
  
  amdgpu-$(CONFIG_PROC_FS) += amdgpu_fdinfo.o
  
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h

index 9c62552bec34..0267870aa9b1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1609,4 +1609,5 @@ extern const struct attribute_group 
amdgpu_vram_mgr_attr_group;
  extern const struct attribute_group amdgpu_gtt_mgr_attr_group;
  extern const struct attribute_group amdgpu_flash_attr_group;
  
+int amdgpu_device_info(struct amdgpu_device *adev, struct drm_amdgpu_info_device *dev_info);

  #endif
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c
new file mode 100644
index ..d2c15a1dcb0d
--- /dev/null
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c
@@ -0,0 +1,151 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright 2024 Advanced Micro Devices, Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
+ * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
+ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+ * OTHER DEALINGS IN THE SOFTWARE.
+ *
+ */
+
+#include "amdgpu.h"
+#include "amd_pcie.h"
+
+#include 
+
+int amdgpu_device_info(struct amdgpu_device *adev, struct 
drm_amdgpu_info_device *dev_info)
+{
+   int ret;
+   uint64_t vm_size;
+   uint32_t pcie_gen_mask;
+
+   if (dev_info == NULL)
+   return -EINVAL;
+
+   dev_info->device_id = adev->pdev->device;
+   dev_info->chip_rev = adev->rev_id;
+   dev_info->external_rev = adev->external_rev_id;
+   dev_info->pci_rev = adev->pdev->revision;
+   dev_info->family = adev->family;
+   dev_info->num_shader_engines = adev->gfx.config.max_shader_engines;
+   dev_info->num_shader_arrays_per_engine = adev->gfx.config.max_sh_per_se;
+   /* return all clocks in KHz */
+   dev_info->gpu_counter_freq = amdgpu_asic_get_xclk(adev) * 10;
+   if (adev->pm.dpm_enabled) {
+   dev_info->max_engine_clock = amdgpu_dpm_get_sclk(adev, false) * 
10;
+   dev_info->max_memory_clock = amdgpu_dpm_get_mclk(adev, false) * 
10;
+   dev_info->min_engine_clock = amdgpu_dpm_get_sclk(adev, true) * 
10;
+   dev_info->min_memory_clock = amdgpu_dpm_get_mclk(adev, true) * 
10;
+   } else {
+   dev_info->max_engine_clock =
+   dev_info->min_engine_clock =
+   adev->clock.default_sclk * 10;
+   dev_info->max_memory_clock =
+   dev_info->min_memory_clock =
+   adev->clock.default_mclk * 10;
+   }
+   dev_info->enabled

Re: [PATCH] drm/amdgpu: refactor code to reuse system information

2024-03-19 Thread Khatri, Sunil


On 3/19/2024 7:43 PM, Lazar, Lijo wrote:


On 3/19/2024 7:27 PM, Khatri, Sunil wrote:

On 3/19/2024 7:19 PM, Lazar, Lijo wrote:

On 3/19/2024 6:02 PM, Sunil Khatri wrote:

Refactor the code so debugfs and devcoredump can reuse
the common information and avoid unnecessary copy of it.

created a new file which would be the right place to
hold functions which will be used between sysfs, debugfs
and devcoredump.

Cc: Christian König
Cc: Alex Deucher
Signed-off-by: Sunil Khatri
---
   drivers/gpu/drm/amd/amdgpu/Makefile |   2 +-
   drivers/gpu/drm/amd/amdgpu/amdgpu.h |   1 +
   drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c | 151 
   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 118 +--
   4 files changed, 157 insertions(+), 115 deletions(-)
   create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c

diff --git a/drivers/gpu/drm/amd/amdgpu/Makefile
b/drivers/gpu/drm/amd/amdgpu/Makefile
index 4536c8ad0e11..05d34f4b18f5 100644
--- a/drivers/gpu/drm/amd/amdgpu/Makefile
+++ b/drivers/gpu/drm/amd/amdgpu/Makefile
@@ -80,7 +80,7 @@ amdgpu-y += amdgpu_device.o amdgpu_doorbell_mgr.o
amdgpu_kms.o \
   amdgpu_umc.o smu_v11_0_i2c.o amdgpu_fru_eeprom.o amdgpu_rap.o \
   amdgpu_fw_attestation.o amdgpu_securedisplay.o \
   amdgpu_eeprom.o amdgpu_mca.o amdgpu_psp_ta.o amdgpu_lsdma.o \
-    amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o
+    amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o
amdgpu_devinfo.o
     amdgpu-$(CONFIG_PROC_FS) += amdgpu_fdinfo.o
   diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 9c62552bec34..0267870aa9b1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1609,4 +1609,5 @@ extern const struct attribute_group
amdgpu_vram_mgr_attr_group;
   extern const struct attribute_group amdgpu_gtt_mgr_attr_group;
   extern const struct attribute_group amdgpu_flash_attr_group;
   +int amdgpu_device_info(struct amdgpu_device *adev, struct
drm_amdgpu_info_device *dev_info);
   #endif
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c
new file mode 100644
index ..d2c15a1dcb0d
--- /dev/null
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c
@@ -0,0 +1,151 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright 2024 Advanced Micro Devices, Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person
obtaining a
+ * copy of this software and associated documentation files (the
"Software"),
+ * to deal in the Software without restriction, including without
limitation
+ * the rights to use, copy, modify, merge, publish, distribute,
sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom
the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO
EVENT SHALL
+ * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM,
DAMAGES OR
+ * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
OTHERWISE,
+ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE
USE OR
+ * OTHER DEALINGS IN THE SOFTWARE.
+ *
+ */
+
+#include "amdgpu.h"
+#include "amd_pcie.h"
+
+#include 
+
+int amdgpu_device_info(struct amdgpu_device *adev, struct
drm_amdgpu_info_device *dev_info)
+{
+    int ret;
+    uint64_t vm_size;
+    uint32_t pcie_gen_mask;
+
+    if (dev_info == NULL)
+    return -EINVAL;
+
+    dev_info->device_id = adev->pdev->device;
+    dev_info->chip_rev = adev->rev_id;
+    dev_info->external_rev = adev->external_rev_id;
+    dev_info->pci_rev = adev->pdev->revision;
+    dev_info->family = adev->family;
+    dev_info->num_shader_engines = adev->gfx.config.max_shader_engines;
+    dev_info->num_shader_arrays_per_engine =
adev->gfx.config.max_sh_per_se;
+    /* return all clocks in KHz */
+    dev_info->gpu_counter_freq = amdgpu_asic_get_xclk(adev) * 10;
+    if (adev->pm.dpm_enabled) {
+    dev_info->max_engine_clock = amdgpu_dpm_get_sclk(adev,
false) * 10;
+    dev_info->max_memory_clock = amdgpu_dpm_get_mclk(adev,
false) * 10;
+    dev_info->min_engine_clock = amdgpu_dpm_get_sclk(adev, true)
* 10;
+    dev_info->min_memory_clock = amdgpu_dpm_get_mclk(adev, true)
* 10;
+    } else {
+    dev_info->max_engine_clock =
+    dev_info->min_engine_clock =
+    adev->clock.default_sclk * 10;
+    dev_info->max_memory_clock =
+    dev_info->min_memory_clock =
+    adev->clock.default_mc

Re: [PATCH v2] drm/amdgpu: refactor code to reuse system information

2024-03-19 Thread Khatri, Sunil



On 3/19/2024 8:07 PM, Christian König wrote:

Am 19.03.24 um 15:25 schrieb Sunil Khatri:

Refactor the code so debugfs and devcoredump can reuse
the common information and avoid unnecessary copy of it.

created a new file which would be the right place to
hold functions which will be used between ioctl, debugfs
and devcoredump.


Ok, taking a closer look that is certainly not a good idea.

The devinfo structure was just created because somebody thought that 
mixing all that stuff into one structure would be a good idea.


We have pretty much deprecated that approach and should *really* not 
change anything here any more.
To support the ioctl we are keeping that information same without 
changing it. The intent to add a new file is because we will have more 
information coming in this new file. Next in line is firmware 
information which is again a huge function with lot of information and 
to use that information in devcoredump and ioctl and sysfs the new file 
seems to be right idea after some discussions.
FYI: this will not be just one function in new file but more to come so 
code can be reused without copying it.


Regards,
Christian.



Cc: Christian König 
Cc: Alex Deucher 
Signed-off-by: Sunil Khatri 
---
  drivers/gpu/drm/amd/amdgpu/Makefile |   2 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu.h |   1 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c | 151 
  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 118 +--
  4 files changed, 157 insertions(+), 115 deletions(-)
  create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c

diff --git a/drivers/gpu/drm/amd/amdgpu/Makefile 
b/drivers/gpu/drm/amd/amdgpu/Makefile

index 4536c8ad0e11..05d34f4b18f5 100644
--- a/drivers/gpu/drm/amd/amdgpu/Makefile
+++ b/drivers/gpu/drm/amd/amdgpu/Makefile
@@ -80,7 +80,7 @@ amdgpu-y += amdgpu_device.o amdgpu_doorbell_mgr.o 
amdgpu_kms.o \

  amdgpu_umc.o smu_v11_0_i2c.o amdgpu_fru_eeprom.o amdgpu_rap.o \
  amdgpu_fw_attestation.o amdgpu_securedisplay.o \
  amdgpu_eeprom.o amdgpu_mca.o amdgpu_psp_ta.o amdgpu_lsdma.o \
-    amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o
+    amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o 
amdgpu_devinfo.o

    amdgpu-$(CONFIG_PROC_FS) += amdgpu_fdinfo.o
  diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h

index 9c62552bec34..0267870aa9b1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1609,4 +1609,5 @@ extern const struct attribute_group 
amdgpu_vram_mgr_attr_group;

  extern const struct attribute_group amdgpu_gtt_mgr_attr_group;
  extern const struct attribute_group amdgpu_flash_attr_group;
  +int amdgpu_device_info(struct amdgpu_device *adev, struct 
drm_amdgpu_info_device *dev_info);

  #endif
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c

new file mode 100644
index ..fdcbc1984031
--- /dev/null
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c
@@ -0,0 +1,151 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright 2024 Advanced Micro Devices, Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person 
obtaining a
+ * copy of this software and associated documentation files (the 
"Software"),
+ * to deal in the Software without restriction, including without 
limitation
+ * the rights to use, copy, modify, merge, publish, distribute, 
sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom 
the

+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be 
included in

+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 
EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 
MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO 
EVENT SHALL
+ * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, 
DAMAGES OR
+ * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR 
OTHERWISE,
+ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE 
USE OR

+ * OTHER DEALINGS IN THE SOFTWARE.
+ *
+ */
+
+#include "amdgpu.h"
+#include "amd_pcie.h"
+
+#include 
+
+int amdgpu_device_info(struct amdgpu_device *adev, struct 
drm_amdgpu_info_device *dev_info)

+{
+    int ret;
+    uint64_t vm_size;
+    uint32_t pcie_gen_mask;
+
+    if (dev_info == NULL)
+    return -EINVAL;
+
+    dev_info->device_id = adev->pdev->device;
+    dev_info->chip_rev = adev->rev_id;
+    dev_info->external_rev = adev->external_rev_id;
+    dev_info->pci_rev = adev->pdev->revision;
+    dev_info->family = adev->family;
+    dev_info->num_shader_engines = adev->gfx.config.max_shader_engines;
+    dev_info->num_shader_arrays_per_engine = 
adev->gfx.config.max_sh_per_se;

+    /* return all clocks in KHz */
+    dev_info->gpu_counter

Re: [PATCH] drm/amdgpu: refactor code to reuse system information

2024-03-19 Thread Khatri, Sunil

Sent a new patch based on discussion with Alex.

On 3/19/2024 8:34 PM, Christian König wrote:

Am 19.03.24 um 15:59 schrieb Alex Deucher:

On Tue, Mar 19, 2024 at 10:56 AM Christian König
 wrote:

Am 19.03.24 um 15:26 schrieb Alex Deucher:
On Tue, Mar 19, 2024 at 8:32 AM Sunil Khatri  
wrote:

Refactor the code so debugfs and devcoredump can reuse
the common information and avoid unnecessary copy of it.

created a new file which would be the right place to
hold functions which will be used between sysfs, debugfs
and devcoredump.

Cc: Christian König 
Cc: Alex Deucher 
Signed-off-by: Sunil Khatri 
---
   drivers/gpu/drm/amd/amdgpu/Makefile |   2 +-
   drivers/gpu/drm/amd/amdgpu/amdgpu.h |   1 +
   drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c | 151 


   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 118 +--
   4 files changed, 157 insertions(+), 115 deletions(-)
   create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c

diff --git a/drivers/gpu/drm/amd/amdgpu/Makefile 
b/drivers/gpu/drm/amd/amdgpu/Makefile

index 4536c8ad0e11..05d34f4b18f5 100644
--- a/drivers/gpu/drm/amd/amdgpu/Makefile
+++ b/drivers/gpu/drm/amd/amdgpu/Makefile
@@ -80,7 +80,7 @@ amdgpu-y += amdgpu_device.o 
amdgpu_doorbell_mgr.o amdgpu_kms.o \
  amdgpu_umc.o smu_v11_0_i2c.o amdgpu_fru_eeprom.o 
amdgpu_rap.o \

  amdgpu_fw_attestation.o amdgpu_securedisplay.o \
  amdgpu_eeprom.o amdgpu_mca.o amdgpu_psp_ta.o 
amdgpu_lsdma.o \

-   amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o
+   amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o 
amdgpu_devinfo.o


   amdgpu-$(CONFIG_PROC_FS) += amdgpu_fdinfo.o

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h

index 9c62552bec34..0267870aa9b1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1609,4 +1609,5 @@ extern const struct attribute_group 
amdgpu_vram_mgr_attr_group;

   extern const struct attribute_group amdgpu_gtt_mgr_attr_group;
   extern const struct attribute_group amdgpu_flash_attr_group;

+int amdgpu_device_info(struct amdgpu_device *adev, struct 
drm_amdgpu_info_device *dev_info);

   #endif
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c

new file mode 100644
index ..d2c15a1dcb0d
--- /dev/null
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c
@@ -0,0 +1,151 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright 2024 Advanced Micro Devices, Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person 
obtaining a
+ * copy of this software and associated documentation files (the 
"Software"),
+ * to deal in the Software without restriction, including without 
limitation
+ * the rights to use, copy, modify, merge, publish, distribute, 
sublicense,
+ * and/or sell copies of the Software, and to permit persons to 
whom the
+ * Software is furnished to do so, subject to the following 
conditions:

+ *
+ * The above copyright notice and this permission notice shall be 
included in

+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY 
KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 
MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO 
EVENT SHALL
+ * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, 
DAMAGES OR
+ * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR 
OTHERWISE,
+ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE 
USE OR

+ * OTHER DEALINGS IN THE SOFTWARE.
+ *
+ */
+
+#include "amdgpu.h"
+#include "amd_pcie.h"
+
+#include 
+
+int amdgpu_device_info(struct amdgpu_device *adev, struct 
drm_amdgpu_info_device *dev_info)

We can probably keep this in amdgpu_kms.c unless that file is getting
too big.  I don't think it warrants a new file at this point.  If you
do keep it in amdgpu_kms.c, I'd recommend renaming it to something
like amdgpu_kms_device_info() to keep the naming conventions.

We should not be using this for anything new in the first place.

A whole bunch of the stuff inside the devinfo structure has been
deprecated because we found that putting everything into one structure
was a bad idea.

It's a convenient way to collect a lot of useful information that we
want in the core dump.  Plus it's not going anywhere because we need
to keep compatibility in the IOCTL.


Yeah and exactly that is what I'm strictly against. The devinfo wasn't 
used for new stuff because we found that it is way to inflexible.


That's why we have multiple separate IOCTLs for the memory and 
firmware information for example.


We should really *not* reuse that for the device core dumping.

Rather just use the same information from the different IPs and 
subsystems directly. E.g. add a function to the VM, GFX etc for 
printing out devcoredump infos.


I have pushed new v2 based o

RE: [PATCH v2] drm/amdgpu: refactor code to reuse system information

2024-03-19 Thread Khatri, Sunil
[AMD Official Use Only - General]

Ignore this as I have send v3.

-Original Message-
From: Sunil Khatri 
Sent: Tuesday, March 19, 2024 8:41 PM
To: Deucher, Alexander ; Koenig, Christian 
; Sharma, Shashank 
Cc: amd-...@lists.freedesktop.org; dri-devel@lists.freedesktop.org; 
linux-ker...@vger.kernel.org; Zhang, Hawking ; Kuehling, 
Felix ; Lazar, Lijo ; Khatri, Sunil 

Subject: [PATCH v2] drm/amdgpu: refactor code to reuse system information

Refactor the code so debugfs and devcoredump can reuse the common information 
and avoid unnecessary copy of it.

created a new file which would be the right place to hold functions which will 
be used between ioctl, debugfs and devcoredump.

Cc: Christian König 
Cc: Alex Deucher 
Signed-off-by: Sunil Khatri 
---
 drivers/gpu/drm/amd/amdgpu/Makefile  |   2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_coreinfo.c | 146 +++  
drivers/gpu/drm/amd/amdgpu/amdgpu_coreinfo.h |  33 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c  | 117 +--
 4 files changed, 182 insertions(+), 116 deletions(-)  create mode 100644 
drivers/gpu/drm/amd/amdgpu/amdgpu_coreinfo.c
 create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_coreinfo.h

diff --git a/drivers/gpu/drm/amd/amdgpu/Makefile 
b/drivers/gpu/drm/amd/amdgpu/Makefile
index 4536c8ad0e11..2c5c800c1ed6 100644
--- a/drivers/gpu/drm/amd/amdgpu/Makefile
+++ b/drivers/gpu/drm/amd/amdgpu/Makefile
@@ -80,7 +80,7 @@ amdgpu-y += amdgpu_device.o amdgpu_doorbell_mgr.o 
amdgpu_kms.o \
amdgpu_umc.o smu_v11_0_i2c.o amdgpu_fru_eeprom.o amdgpu_rap.o \
amdgpu_fw_attestation.o amdgpu_securedisplay.o \
amdgpu_eeprom.o amdgpu_mca.o amdgpu_psp_ta.o amdgpu_lsdma.o \
-   amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o
+   amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o
+amdgpu_coreinfo.o

 amdgpu-$(CONFIG_PROC_FS) += amdgpu_fdinfo.o

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_coreinfo.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_coreinfo.c
new file mode 100644
index ..597fc9d432ce
--- /dev/null
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_coreinfo.c
@@ -0,0 +1,146 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright 2024 Advanced Micro Devices, Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person
+obtaining a
+ * copy of this software and associated documentation files (the
+"Software"),
+ * to deal in the Software without restriction, including without
+limitation
+ * the rights to use, copy, modify, merge, publish, distribute,
+sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom
+the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT
+SHALL
+ * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM,
+DAMAGES OR
+ * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
+OTHERWISE,
+ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE
+OR
+ * OTHER DEALINGS IN THE SOFTWARE.
+ *
+ */
+
+#include "amdgpu_coreinfo.h"
+#include "amd_pcie.h"
+
+
+void amdgpu_coreinfo_devinfo(struct amdgpu_device *adev, struct
+drm_amdgpu_info_device *dev_info) {
+   int ret;
+   uint64_t vm_size;
+   uint32_t pcie_gen_mask;
+
+   dev_info->device_id = adev->pdev->device;
+   dev_info->chip_rev = adev->rev_id;
+   dev_info->external_rev = adev->external_rev_id;
+   dev_info->pci_rev = adev->pdev->revision;
+   dev_info->family = adev->family;
+   dev_info->num_shader_engines = adev->gfx.config.max_shader_engines;
+   dev_info->num_shader_arrays_per_engine = adev->gfx.config.max_sh_per_se;
+   /* return all clocks in KHz */
+   dev_info->gpu_counter_freq = amdgpu_asic_get_xclk(adev) * 10;
+   if (adev->pm.dpm_enabled) {
+   dev_info->max_engine_clock = amdgpu_dpm_get_sclk(adev, false) * 
10;
+   dev_info->max_memory_clock = amdgpu_dpm_get_mclk(adev, false) * 
10;
+   dev_info->min_engine_clock = amdgpu_dpm_get_sclk(adev, true) * 
10;
+   dev_info->min_memory_clock = amdgpu_dpm_get_mclk(adev, true) * 
10;
+   } else {
+   dev_info->max_engine_clock =
+   dev_info->min_engine_clock =
+   adev->clock.default_sclk * 10;
+   dev_info->max_memory_clock =
+   dev_info->min_memory_clock =
+   adev->clock.default_mclk * 10;
+   }
+   dev_info->enab

Re: [PATCH] drm/amdgpu: add support of bios dump in devcoredump

2024-03-26 Thread Khatri, Sunil



On 3/26/2024 10:23 PM, Alex Deucher wrote:

On Tue, Mar 26, 2024 at 10:38 AM Sunil Khatri  wrote:

dump the bios binary in the devcoredump.

Signed-off-by: Sunil Khatri 
---
  .../gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c  | 20 +++
  1 file changed, 20 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
index 44c5da8aa9ce..f33963d777eb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
@@ -132,6 +132,26 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, 
size_t count,
 drm_printf(&p, "Faulty page starting at address: 0x%016llx\n", 
fault_info->addr);
 drm_printf(&p, "Protection fault status register: 0x%x\n\n", 
fault_info->status);

+   /* Dump BIOS */
+   if (coredump->adev->bios && coredump->adev->bios_size) {
+   int i = 0;
+
+   drm_printf(&p, "BIOS Binary dump\n");
+   drm_printf(&p, "Valid BIOS  Size:%d bytes type:%s\n",
+  coredump->adev->bios_size,
+  coredump->adev->is_atom_fw ?
+  "Atom bios":"Non Atom Bios");
+
+   while (i < coredump->adev->bios_size) {
+   /* Printing 15 bytes in a line */
+   if (i % 15 == 0)
+   drm_printf(&p, "\n");
+   drm_printf(&p, "0x%x \t", coredump->adev->bios[i]);
+   i++;
+   }
+   drm_printf(&p, "\n");
+   }

I don't think it's too useful to dump this as text.  I was hoping it
could be a binary.  I guess, we can just get this from debugfs if we
need it if a binary is not possible.



Yes , this dumps in text format only and the binary is already available 
in debugfs. So discarding the patch.




Alex



+
 /* Add ring buffer information */
 drm_printf(&p, "Ring buffer information\n");
 for (int i = 0; i < coredump->adev->num_rings; i++) {
--
2.34.1



Re: [PATCH] drm/amdgpu: add IP's FW information to devcoredump

2024-03-27 Thread Khatri, Sunil



On 3/28/2024 8:38 AM, Alex Deucher wrote:

On Tue, Mar 26, 2024 at 1:31 PM Sunil Khatri  wrote:

Add FW information of all the IP's in the devcoredump.

Signed-off-by: Sunil Khatri 

Might want to include the vbios version info as well, e.g.,
atom_context->name
atom_context->vbios_pn
atom_context->vbios_ver_str
atom_context->date


Sure i will add those parameters too.

Regards

Sunil


Either way,
Reviewed-by: Alex Deucher 


---
  .../gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c  | 122 ++
  1 file changed, 122 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
index 44c5da8aa9ce..d598b6520ec9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
@@ -69,6 +69,124 @@ const char *hw_ip_names[MAX_HWIP] = {
 [PCIE_HWIP] = "PCIE",
  };

+static void amdgpu_devcoredump_fw_info(struct amdgpu_device *adev,
+  struct drm_printer *p)
+{
+   uint32_t version;
+   uint32_t feature;
+   uint8_t smu_program, smu_major, smu_minor, smu_debug;
+
+   drm_printf(p, "VCE feature version: %u, fw version: 0x%08x\n",
+  adev->vce.fb_version, adev->vce.fw_version);
+   drm_printf(p, "UVD feature version: %u, fw version: 0x%08x\n", 0,
+  adev->uvd.fw_version);
+   drm_printf(p, "GMC feature version: %u, fw version: 0x%08x\n", 0,
+  adev->gmc.fw_version);
+   drm_printf(p, "ME feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.me_feature_version, adev->gfx.me_fw_version);
+   drm_printf(p, "PFP feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.pfp_feature_version, adev->gfx.pfp_fw_version);
+   drm_printf(p, "CE feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.ce_feature_version, adev->gfx.ce_fw_version);
+   drm_printf(p, "RLC feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.rlc_feature_version, adev->gfx.rlc_fw_version);
+
+   drm_printf(p, "RLC SRLC feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.rlc_srlc_feature_version,
+  adev->gfx.rlc_srlc_fw_version);
+   drm_printf(p, "RLC SRLG feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.rlc_srlg_feature_version,
+  adev->gfx.rlc_srlg_fw_version);
+   drm_printf(p, "RLC SRLS feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.rlc_srls_feature_version,
+  adev->gfx.rlc_srls_fw_version);
+   drm_printf(p, "RLCP feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.rlcp_ucode_feature_version,
+  adev->gfx.rlcp_ucode_version);
+   drm_printf(p, "RLCV feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.rlcv_ucode_feature_version,
+  adev->gfx.rlcv_ucode_version);
+   drm_printf(p, "MEC feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.mec_feature_version, adev->gfx.mec_fw_version);
+
+   if (adev->gfx.mec2_fw)
+   drm_printf(p, "MEC2 feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.mec2_feature_version,
+  adev->gfx.mec2_fw_version);
+
+   drm_printf(p, "IMU feature version: %u, fw version: 0x%08x\n", 0,
+  adev->gfx.imu_fw_version);
+   drm_printf(p, "PSP SOS feature version: %u, fw version: 0x%08x\n",
+  adev->psp.sos.feature_version, adev->psp.sos.fw_version);
+   drm_printf(p, "PSP ASD feature version: %u, fw version: 0x%08x\n",
+  adev->psp.asd_context.bin_desc.feature_version,
+  adev->psp.asd_context.bin_desc.fw_version);
+
+   drm_printf(p, "TA XGMI feature version: 0x%08x, fw version: 0x%08x\n",
+  adev->psp.xgmi_context.context.bin_desc.feature_version,
+  adev->psp.xgmi_context.context.bin_desc.fw_version);
+   drm_printf(p, "TA RAS feature version: 0x%08x, fw version: 0x%08x\n",
+  adev->psp.ras_context.context.bin_desc.feature_version,
+  adev->psp.ras_context.context.bin_desc.fw_version);
+   drm_printf(p, "TA HDCP feature version: 0x%08x, fw version: 0x%08x\n",
+  adev->psp.hdcp_context.context.bin_desc.feature_version,
+  adev->psp.hdcp_context.context.bin_desc.fw_version);
+   drm_printf(p, "TA DTM feature version: 0x%08x, fw version: 0x%08x\n",
+  adev->psp.dtm_context.context.bin_desc.feature_version,
+  adev->psp.dtm_context.context.bin_desc.fw_version);
+   drm_printf(p, "TA RAP feature version: 0x%08x, fw version: 0x%08x\n",
+  adev->psp.rap_context.context.bin_desc.feature_version,
+  adev->psp.rap_context.context.bin_desc.fw_