Re: [PATCH v2] drm/amdgpu: add ring timeout information in devcoredump
On 3/5/2024 6:40 PM, Christian König wrote: Am 05.03.24 um 12:58 schrieb Sunil Khatri: Add ring timeout related information in the amdgpu devcoredump file for debugging purposes. During the gpu recovery process the registered call is triggered and add the debug information in data file created by devcoredump framework under the directory /sys/class/devcoredump/devcdx/ Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 15 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 2 ++ 2 files changed, 17 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c index a59364e9b6ed..aa7fed59a0d5 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c @@ -196,6 +196,13 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, coredump->reset_task_info.process_name, coredump->reset_task_info.pid); + if (coredump->ring_timeout) { + drm_printf(&p, "\nRing timed out details\n"); + drm_printf(&p, "IP Type: %d Ring Name: %s \n", + coredump->ring->funcs->type, + coredump->ring->name); + } + if (coredump->reset_vram_lost) drm_printf(&p, "VRAM is lost due to GPU reset!\n"); if (coredump->adev->reset_info.num_regs) { @@ -220,6 +227,8 @@ void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost, { struct amdgpu_coredump_info *coredump; struct drm_device *dev = adev_to_drm(adev); + struct amdgpu_job *job = reset_context->job; + struct drm_sched_job *s_job; coredump = kzalloc(sizeof(*coredump), GFP_NOWAIT); @@ -228,6 +237,12 @@ void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost, return; } + if (job) { + s_job = &job->base; + coredump->ring = to_amdgpu_ring(s_job->sched); + coredump->ring_timeout = TRUE; + } + coredump->reset_vram_lost = vram_lost; if (reset_context->job && reset_context->job->vm) { diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h index 19899f6b9b2b..6d67001a1057 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h @@ -97,6 +97,8 @@ struct amdgpu_coredump_info { struct amdgpu_task_info reset_task_info; struct timespec64 reset_time; bool reset_vram_lost; + struct amdgpu_ring *ring; + bool ring_timeout; I think you can drop ring_timeout, just having ring as optional information should be enough. Apart from that looks pretty good I think. - GPU reset could happen due to two possibilities atleast: 1. via sysfs cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover there is no timeout or page fault here. In this case we need information if ringtimeout happened or not else it will try to print empty information in devcoredump. Same goes for pagefault also in that case also we need to see if recovery ran due to pagefault and then only add that information. So to cover all use cases i added this parameter. Thanks Sunil Regards, Christian. }; #endif
Re: [PATCH] drm/amdgpu: cache in more vm fault information
On 3/6/2024 3:37 PM, Christian König wrote: Am 06.03.24 um 10:04 schrieb Sunil Khatri: When an page fault interrupt is raised there is a lot more information that is useful for developers to analyse the pagefault. Well actually those information are not that interesting because they are hw generation specific. You should probably rather use the decoded strings here, e.g. hub, client, xcc_id, node_id etc... See gmc_v9_0_process_interrupt() an example. I saw this v9 does provide more information than what v10 and v11 provide like node_id and fault from which die but thats again very specific to IP_VERSION(9, 4, 3)) i dont know why thats information is not there in v10 and v11. I agree to your point but, as of now during a pagefault we are dumping this information which is useful like which client has generated an interrupt and for which src and other information like address. So i think to provide the similar information in the devcoredump. Currently we do not have all this information from either job or vm being derived from the job during a reset. We surely could add more relevant information later on as per request but this information is useful as eventually its developers only who would use the dump file provided by customer to debug. Below is the information that i dump in devcore and i feel that is good information but new information could be added which could be picked later. Page fault information [gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773) in page starting at address 0x from client 0x1b (UTCL2) Regards Sunil Khatri Regards, Christian. Add all such information in the last cached pagefault from an interrupt handler. Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 9 +++-- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 7 ++- drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 2 +- drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c | 2 +- drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c | 2 +- drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c | 2 +- drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 2 +- 7 files changed, 18 insertions(+), 8 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 4299ce386322..b77e8e28769d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -2905,7 +2905,7 @@ void amdgpu_debugfs_vm_bo_info(struct amdgpu_vm *vm, struct seq_file *m) * Cache the fault info for later use by userspace in debugging. */ void amdgpu_vm_update_fault_cache(struct amdgpu_device *adev, - unsigned int pasid, + struct amdgpu_iv_entry *entry, uint64_t addr, uint32_t status, unsigned int vmhub) @@ -2915,7 +2915,7 @@ void amdgpu_vm_update_fault_cache(struct amdgpu_device *adev, xa_lock_irqsave(&adev->vm_manager.pasids, flags); - vm = xa_load(&adev->vm_manager.pasids, pasid); + vm = xa_load(&adev->vm_manager.pasids, entry->pasid); /* Don't update the fault cache if status is 0. In the multiple * fault case, subsequent faults will return a 0 status which is * useless for userspace and replaces the useful fault status, so @@ -2924,6 +2924,11 @@ void amdgpu_vm_update_fault_cache(struct amdgpu_device *adev, if (vm && status) { vm->fault_info.addr = addr; vm->fault_info.status = status; + vm->fault_info.client_id = entry->client_id; + vm->fault_info.src_id = entry->src_id; + vm->fault_info.vmid = entry->vmid; + vm->fault_info.pasid = entry->pasid; + vm->fault_info.ring_id = entry->ring_id; if (AMDGPU_IS_GFXHUB(vmhub)) { vm->fault_info.vmhub = AMDGPU_VMHUB_TYPE_GFX; vm->fault_info.vmhub |= diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h index 047ec1930d12..c7782a89bdb5 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h @@ -286,6 +286,11 @@ struct amdgpu_vm_fault_info { uint32_t status; /* which vmhub? gfxhub, mmhub, etc. */ unsigned int vmhub; + unsigned int client_id; + unsigned int src_id; + unsigned int ring_id; + unsigned int pasid; + unsigned int vmid; }; struct amdgpu_vm { @@ -605,7 +610,7 @@ static inline void amdgpu_vm_eviction_unlock(struct amdgpu_vm *vm) } void amdgpu_vm_update_fault_cache(struct amdgpu_device *adev, - unsigned int pasid, + struct amdgpu_iv_entry *entry, uint64_t addr, uint32_t status, unsigned int vmhub); diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c index d933e19e0cf5..6b177ce8db0e 100644 --- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
Re: [PATCH] drm/amdgpu: cache in more vm fault information
On 3/6/2024 6:12 PM, Christian König wrote: Am 06.03.24 um 11:40 schrieb Khatri, Sunil: On 3/6/2024 3:37 PM, Christian König wrote: Am 06.03.24 um 10:04 schrieb Sunil Khatri: When an page fault interrupt is raised there is a lot more information that is useful for developers to analyse the pagefault. Well actually those information are not that interesting because they are hw generation specific. You should probably rather use the decoded strings here, e.g. hub, client, xcc_id, node_id etc... See gmc_v9_0_process_interrupt() an example. I saw this v9 does provide more information than what v10 and v11 provide like node_id and fault from which die but thats again very specific to IP_VERSION(9, 4, 3)) i dont know why thats information is not there in v10 and v11. I agree to your point but, as of now during a pagefault we are dumping this information which is useful like which client has generated an interrupt and for which src and other information like address. So i think to provide the similar information in the devcoredump. Currently we do not have all this information from either job or vm being derived from the job during a reset. We surely could add more relevant information later on as per request but this information is useful as eventually its developers only who would use the dump file provided by customer to debug. Below is the information that i dump in devcore and i feel that is good information but new information could be added which could be picked later. Page fault information [gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773) in page starting at address 0x from client 0x1b (UTCL2) This is a perfect example what I mean. You record in the patch is the client_id, but this is is basically meaningless unless you have access to the AMD internal hw documentation. What you really need is the client in decoded form, in this case UTCL2. You can keep the client_id additionally, but the decoded client string is mandatory to have I think. Sure i am capturing that information as i am trying to minimise the memory interaction to minimum as we are still in interrupt context here that why i recorded the integer information compared to decoding and writing strings there itself but to postpone till we dump. Like decoding to the gfxhub/mmhub based on vmhub/vmid_src and client string from client id. So are we good to go with the information with the above information of sharing details in devcoredump using the additional information from pagefault cached. regards sunil Regards, Christian. Regards Sunil Khatri Regards, Christian. Add all such information in the last cached pagefault from an interrupt handler. Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 9 +++-- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 7 ++- drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 2 +- drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c | 2 +- drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c | 2 +- drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c | 2 +- drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 2 +- 7 files changed, 18 insertions(+), 8 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 4299ce386322..b77e8e28769d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -2905,7 +2905,7 @@ void amdgpu_debugfs_vm_bo_info(struct amdgpu_vm *vm, struct seq_file *m) * Cache the fault info for later use by userspace in debugging. */ void amdgpu_vm_update_fault_cache(struct amdgpu_device *adev, - unsigned int pasid, + struct amdgpu_iv_entry *entry, uint64_t addr, uint32_t status, unsigned int vmhub) @@ -2915,7 +2915,7 @@ void amdgpu_vm_update_fault_cache(struct amdgpu_device *adev, xa_lock_irqsave(&adev->vm_manager.pasids, flags); - vm = xa_load(&adev->vm_manager.pasids, pasid); + vm = xa_load(&adev->vm_manager.pasids, entry->pasid); /* Don't update the fault cache if status is 0. In the multiple * fault case, subsequent faults will return a 0 status which is * useless for userspace and replaces the useful fault status, so @@ -2924,6 +2924,11 @@ void amdgpu_vm_update_fault_cache(struct amdgpu_device *adev, if (vm && status) { vm->fault_info.addr = addr; vm->fault_info.status = status; + vm->fault_info.client_id = entry->client_id; + vm->fault_info.src_id = entry->src_id; + vm->fault_info.vmid = entry->vmid; + vm->fault_info.pasid = entry->pasid; + vm->fault_info.ring_id = entry->ring_id; if (AMDGPU_IS_GFXHUB(vmhub)) { vm->fault_info.vmhub = AMDGPU_VMHUB_TYPE_GFX; vm->fault_info.vmhub |= diff --git a/drivers/gpu/drm/amd/a
Re: [PATCH] drm/amdgpu: cache in more vm fault information
On 3/6/2024 8:34 PM, Christian König wrote: Am 06.03.24 um 15:29 schrieb Alex Deucher: On Wed, Mar 6, 2024 at 8:04 AM Khatri, Sunil wrote: On 3/6/2024 6:12 PM, Christian König wrote: Am 06.03.24 um 11:40 schrieb Khatri, Sunil: On 3/6/2024 3:37 PM, Christian König wrote: Am 06.03.24 um 10:04 schrieb Sunil Khatri: When an page fault interrupt is raised there is a lot more information that is useful for developers to analyse the pagefault. Well actually those information are not that interesting because they are hw generation specific. You should probably rather use the decoded strings here, e.g. hub, client, xcc_id, node_id etc... See gmc_v9_0_process_interrupt() an example. I saw this v9 does provide more information than what v10 and v11 provide like node_id and fault from which die but thats again very specific to IP_VERSION(9, 4, 3)) i dont know why thats information is not there in v10 and v11. I agree to your point but, as of now during a pagefault we are dumping this information which is useful like which client has generated an interrupt and for which src and other information like address. So i think to provide the similar information in the devcoredump. Currently we do not have all this information from either job or vm being derived from the job during a reset. We surely could add more relevant information later on as per request but this information is useful as eventually its developers only who would use the dump file provided by customer to debug. Below is the information that i dump in devcore and i feel that is good information but new information could be added which could be picked later. Page fault information [gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773) in page starting at address 0x from client 0x1b (UTCL2) This is a perfect example what I mean. You record in the patch is the client_id, but this is is basically meaningless unless you have access to the AMD internal hw documentation. What you really need is the client in decoded form, in this case UTCL2. You can keep the client_id additionally, but the decoded client string is mandatory to have I think. Sure i am capturing that information as i am trying to minimise the memory interaction to minimum as we are still in interrupt context here that why i recorded the integer information compared to decoding and writing strings there itself but to postpone till we dump. Like decoding to the gfxhub/mmhub based on vmhub/vmid_src and client string from client id. So are we good to go with the information with the above information of sharing details in devcoredump using the additional information from pagefault cached. I think amdgpu_vm_fault_info() has everything you need already (vmhub, status, and addr). client_id and src_id are just tokens in the interrupt cookie so we know which IP to route the interrupt to. We know what they will be because otherwise we'd be in the interrupt handler for a different IP. I don't think ring_id has any useful information in this context and vmid and pasid are probably not too useful because they are just tokens to associate the fault with a process. It would be better to have the process name. Just to share context here Alex, i am preparing this for devcoredump, my intention was to replicate the information which in KMD we are sharing in Dmesg for page faults. If assuming we do not add client id specially we would not be able to share enough information in devcoredump. It would be just address and hub(gfxhub/mmhub) and i think that is partial information as src id and client id and ip block shares good information. For process related information we are capturing that information part of dump from existing functionality. AMDGPU Device Coredump version: 1 kernel: 6.7.0-amd-staging-drm-next module: amdgpu time: 45.084775181 process_name: soft_recovery_p PID: 1780 Ring timed out details IP Type: 0 Ring Name: gfx_0.0.0 Page fault information [gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773) in page starting at address 0x from client 0x1b (UTCL2) VRAM is lost due to GPU reset! Regards Sunil The decoded client name would be really useful I think since the fault handled is a catch all and handles a whole bunch of different clients. But that should be ideally passed in as const string instead of the hw generation specific client_id. As long as it's only a pointer we also don't run into the trouble that we need to allocate memory for it. I agree but i prefer adding the client id and decoding it in devcorecump using soc15_ih_clientid_name[fault_info->client_id]) is better else we have to do an sprintf this string to fault_info in irq context which is writing more bytes to memory i guess compared to an integer:) We can argue on values like pasid and vmid and ring id to be taken off if they are totally not useful. Regards Sunil Christian. Alex regards sun
Re: [PATCH] drm/amdgpu: cache in more vm fault information
On 3/6/2024 9:19 PM, Alex Deucher wrote: On Wed, Mar 6, 2024 at 10:32 AM Alex Deucher wrote: On Wed, Mar 6, 2024 at 10:13 AM Khatri, Sunil wrote: On 3/6/2024 8:34 PM, Christian König wrote: Am 06.03.24 um 15:29 schrieb Alex Deucher: On Wed, Mar 6, 2024 at 8:04 AM Khatri, Sunil wrote: On 3/6/2024 6:12 PM, Christian König wrote: Am 06.03.24 um 11:40 schrieb Khatri, Sunil: On 3/6/2024 3:37 PM, Christian König wrote: Am 06.03.24 um 10:04 schrieb Sunil Khatri: When an page fault interrupt is raised there is a lot more information that is useful for developers to analyse the pagefault. Well actually those information are not that interesting because they are hw generation specific. You should probably rather use the decoded strings here, e.g. hub, client, xcc_id, node_id etc... See gmc_v9_0_process_interrupt() an example. I saw this v9 does provide more information than what v10 and v11 provide like node_id and fault from which die but thats again very specific to IP_VERSION(9, 4, 3)) i dont know why thats information is not there in v10 and v11. I agree to your point but, as of now during a pagefault we are dumping this information which is useful like which client has generated an interrupt and for which src and other information like address. So i think to provide the similar information in the devcoredump. Currently we do not have all this information from either job or vm being derived from the job during a reset. We surely could add more relevant information later on as per request but this information is useful as eventually its developers only who would use the dump file provided by customer to debug. Below is the information that i dump in devcore and i feel that is good information but new information could be added which could be picked later. Page fault information [gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773) in page starting at address 0x from client 0x1b (UTCL2) This is a perfect example what I mean. You record in the patch is the client_id, but this is is basically meaningless unless you have access to the AMD internal hw documentation. What you really need is the client in decoded form, in this case UTCL2. You can keep the client_id additionally, but the decoded client string is mandatory to have I think. Sure i am capturing that information as i am trying to minimise the memory interaction to minimum as we are still in interrupt context here that why i recorded the integer information compared to decoding and writing strings there itself but to postpone till we dump. Like decoding to the gfxhub/mmhub based on vmhub/vmid_src and client string from client id. So are we good to go with the information with the above information of sharing details in devcoredump using the additional information from pagefault cached. I think amdgpu_vm_fault_info() has everything you need already (vmhub, status, and addr). client_id and src_id are just tokens in the interrupt cookie so we know which IP to route the interrupt to. We know what they will be because otherwise we'd be in the interrupt handler for a different IP. I don't think ring_id has any useful information in this context and vmid and pasid are probably not too useful because they are just tokens to associate the fault with a process. It would be better to have the process name. Just to share context here Alex, i am preparing this for devcoredump, my intention was to replicate the information which in KMD we are sharing in Dmesg for page faults. If assuming we do not add client id specially we would not be able to share enough information in devcoredump. It would be just address and hub(gfxhub/mmhub) and i think that is partial information as src id and client id and ip block shares good information. We also need to include the status register value. That contains the important information (type of access, fault type, client, etc.). Client_id and src_id are only used to route the interrupt to the right software code. E.g., a different client_id and src_id would be a completely different interrupt (e.g., vblank or fence, etc.). For GPU page faults the client_id and src_id will always be the same. The devcoredump should also include information about the GPU itself as well (e.g., PCI DID/VID, maybe some of the relevant IP versions). We already have "status" which is register "GCVM_L2_PROTECTION_FAULT_STATUS". But the problem here is this all needs to be captured in interrupt context which i want to avoid and this is family specific calls. chip family would also be good. And also vram size. If we have a way to identify the chip and we have the vm status register and vm fault address, we can decode all of the fault information. In this patch i am focusing on page fault specific information only[taking one at a time]. But eventually will be adding more information as per the devcoredump JIRA plan. will keep this in todo too for other information
Re: [PATCH] drm/amdgpu: cache in more vm fault information
On 3/6/2024 9:07 PM, Christian König wrote: Am 06.03.24 um 16:13 schrieb Khatri, Sunil: On 3/6/2024 8:34 PM, Christian König wrote: Am 06.03.24 um 15:29 schrieb Alex Deucher: On Wed, Mar 6, 2024 at 8:04 AM Khatri, Sunil wrote: On 3/6/2024 6:12 PM, Christian König wrote: Am 06.03.24 um 11:40 schrieb Khatri, Sunil: On 3/6/2024 3:37 PM, Christian König wrote: Am 06.03.24 um 10:04 schrieb Sunil Khatri: When an page fault interrupt is raised there is a lot more information that is useful for developers to analyse the pagefault. Well actually those information are not that interesting because they are hw generation specific. You should probably rather use the decoded strings here, e.g. hub, client, xcc_id, node_id etc... See gmc_v9_0_process_interrupt() an example. I saw this v9 does provide more information than what v10 and v11 provide like node_id and fault from which die but thats again very specific to IP_VERSION(9, 4, 3)) i dont know why thats information is not there in v10 and v11. I agree to your point but, as of now during a pagefault we are dumping this information which is useful like which client has generated an interrupt and for which src and other information like address. So i think to provide the similar information in the devcoredump. Currently we do not have all this information from either job or vm being derived from the job during a reset. We surely could add more relevant information later on as per request but this information is useful as eventually its developers only who would use the dump file provided by customer to debug. Below is the information that i dump in devcore and i feel that is good information but new information could be added which could be picked later. Page fault information [gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773) in page starting at address 0x from client 0x1b (UTCL2) This is a perfect example what I mean. You record in the patch is the client_id, but this is is basically meaningless unless you have access to the AMD internal hw documentation. What you really need is the client in decoded form, in this case UTCL2. You can keep the client_id additionally, but the decoded client string is mandatory to have I think. Sure i am capturing that information as i am trying to minimise the memory interaction to minimum as we are still in interrupt context here that why i recorded the integer information compared to decoding and writing strings there itself but to postpone till we dump. Like decoding to the gfxhub/mmhub based on vmhub/vmid_src and client string from client id. So are we good to go with the information with the above information of sharing details in devcoredump using the additional information from pagefault cached. I think amdgpu_vm_fault_info() has everything you need already (vmhub, status, and addr). client_id and src_id are just tokens in the interrupt cookie so we know which IP to route the interrupt to. We know what they will be because otherwise we'd be in the interrupt handler for a different IP. I don't think ring_id has any useful information in this context and vmid and pasid are probably not too useful because they are just tokens to associate the fault with a process. It would be better to have the process name. Just to share context here Alex, i am preparing this for devcoredump, my intention was to replicate the information which in KMD we are sharing in Dmesg for page faults. If assuming we do not add client id specially we would not be able to share enough information in devcoredump. It would be just address and hub(gfxhub/mmhub) and i think that is partial information as src id and client id and ip block shares good information. For process related information we are capturing that information part of dump from existing functionality. AMDGPU Device Coredump version: 1 kernel: 6.7.0-amd-staging-drm-next module: amdgpu time: 45.084775181 process_name: soft_recovery_p PID: 1780 Ring timed out details IP Type: 0 Ring Name: gfx_0.0.0 Page fault information [gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773) in page starting at address 0x from client 0x1b (UTCL2) VRAM is lost due to GPU reset! Regards Sunil The decoded client name would be really useful I think since the fault handled is a catch all and handles a whole bunch of different clients. But that should be ideally passed in as const string instead of the hw generation specific client_id. As long as it's only a pointer we also don't run into the trouble that we need to allocate memory for it. I agree but i prefer adding the client id and decoding it in devcorecump using soc15_ih_clientid_name[fault_info->client_id]) is better else we have to do an sprintf this string to fault_info in irq context which is writing more bytes to memory i guess compared to an integer:) Well I totally agree that we shouldn't fiddle to much in
Re: [PATCH] drm/amdgpu: cache in more vm fault information
On 3/6/2024 9:45 PM, Alex Deucher wrote: On Wed, Mar 6, 2024 at 11:06 AM Khatri, Sunil wrote: On 3/6/2024 9:07 PM, Christian König wrote: Am 06.03.24 um 16:13 schrieb Khatri, Sunil: On 3/6/2024 8:34 PM, Christian König wrote: Am 06.03.24 um 15:29 schrieb Alex Deucher: On Wed, Mar 6, 2024 at 8:04 AM Khatri, Sunil wrote: On 3/6/2024 6:12 PM, Christian König wrote: Am 06.03.24 um 11:40 schrieb Khatri, Sunil: On 3/6/2024 3:37 PM, Christian König wrote: Am 06.03.24 um 10:04 schrieb Sunil Khatri: When an page fault interrupt is raised there is a lot more information that is useful for developers to analyse the pagefault. Well actually those information are not that interesting because they are hw generation specific. You should probably rather use the decoded strings here, e.g. hub, client, xcc_id, node_id etc... See gmc_v9_0_process_interrupt() an example. I saw this v9 does provide more information than what v10 and v11 provide like node_id and fault from which die but thats again very specific to IP_VERSION(9, 4, 3)) i dont know why thats information is not there in v10 and v11. I agree to your point but, as of now during a pagefault we are dumping this information which is useful like which client has generated an interrupt and for which src and other information like address. So i think to provide the similar information in the devcoredump. Currently we do not have all this information from either job or vm being derived from the job during a reset. We surely could add more relevant information later on as per request but this information is useful as eventually its developers only who would use the dump file provided by customer to debug. Below is the information that i dump in devcore and i feel that is good information but new information could be added which could be picked later. Page fault information [gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773) in page starting at address 0x from client 0x1b (UTCL2) This is a perfect example what I mean. You record in the patch is the client_id, but this is is basically meaningless unless you have access to the AMD internal hw documentation. What you really need is the client in decoded form, in this case UTCL2. You can keep the client_id additionally, but the decoded client string is mandatory to have I think. Sure i am capturing that information as i am trying to minimise the memory interaction to minimum as we are still in interrupt context here that why i recorded the integer information compared to decoding and writing strings there itself but to postpone till we dump. Like decoding to the gfxhub/mmhub based on vmhub/vmid_src and client string from client id. So are we good to go with the information with the above information of sharing details in devcoredump using the additional information from pagefault cached. I think amdgpu_vm_fault_info() has everything you need already (vmhub, status, and addr). client_id and src_id are just tokens in the interrupt cookie so we know which IP to route the interrupt to. We know what they will be because otherwise we'd be in the interrupt handler for a different IP. I don't think ring_id has any useful information in this context and vmid and pasid are probably not too useful because they are just tokens to associate the fault with a process. It would be better to have the process name. Just to share context here Alex, i am preparing this for devcoredump, my intention was to replicate the information which in KMD we are sharing in Dmesg for page faults. If assuming we do not add client id specially we would not be able to share enough information in devcoredump. It would be just address and hub(gfxhub/mmhub) and i think that is partial information as src id and client id and ip block shares good information. For process related information we are capturing that information part of dump from existing functionality. AMDGPU Device Coredump version: 1 kernel: 6.7.0-amd-staging-drm-next module: amdgpu time: 45.084775181 process_name: soft_recovery_p PID: 1780 Ring timed out details IP Type: 0 Ring Name: gfx_0.0.0 Page fault information [gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773) in page starting at address 0x from client 0x1b (UTCL2) VRAM is lost due to GPU reset! Regards Sunil The decoded client name would be really useful I think since the fault handled is a catch all and handles a whole bunch of different clients. But that should be ideally passed in as const string instead of the hw generation specific client_id. As long as it's only a pointer we also don't run into the trouble that we need to allocate memory for it. I agree but i prefer adding the client id and decoding it in devcorecump using soc15_ih_clientid_name[fault_info->client_id]) is better else we have to do an sprintf this string to fault_info in irq context which is writing more bytes to memory i guess compared to an integer:
Re: [PATCH] drm/amdgpu: cache in more vm fault information
On 3/6/2024 9:49 PM, Christian König wrote: Am 06.03.24 um 17:06 schrieb Khatri, Sunil: On 3/6/2024 9:07 PM, Christian König wrote: Am 06.03.24 um 16:13 schrieb Khatri, Sunil: On 3/6/2024 8:34 PM, Christian König wrote: Am 06.03.24 um 15:29 schrieb Alex Deucher: On Wed, Mar 6, 2024 at 8:04 AM Khatri, Sunil wrote: On 3/6/2024 6:12 PM, Christian König wrote: Am 06.03.24 um 11:40 schrieb Khatri, Sunil: On 3/6/2024 3:37 PM, Christian König wrote: Am 06.03.24 um 10:04 schrieb Sunil Khatri: When an page fault interrupt is raised there is a lot more information that is useful for developers to analyse the pagefault. Well actually those information are not that interesting because they are hw generation specific. You should probably rather use the decoded strings here, e.g. hub, client, xcc_id, node_id etc... See gmc_v9_0_process_interrupt() an example. I saw this v9 does provide more information than what v10 and v11 provide like node_id and fault from which die but thats again very specific to IP_VERSION(9, 4, 3)) i dont know why thats information is not there in v10 and v11. I agree to your point but, as of now during a pagefault we are dumping this information which is useful like which client has generated an interrupt and for which src and other information like address. So i think to provide the similar information in the devcoredump. Currently we do not have all this information from either job or vm being derived from the job during a reset. We surely could add more relevant information later on as per request but this information is useful as eventually its developers only who would use the dump file provided by customer to debug. Below is the information that i dump in devcore and i feel that is good information but new information could be added which could be picked later. Page fault information [gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773) in page starting at address 0x from client 0x1b (UTCL2) This is a perfect example what I mean. You record in the patch is the client_id, but this is is basically meaningless unless you have access to the AMD internal hw documentation. What you really need is the client in decoded form, in this case UTCL2. You can keep the client_id additionally, but the decoded client string is mandatory to have I think. Sure i am capturing that information as i am trying to minimise the memory interaction to minimum as we are still in interrupt context here that why i recorded the integer information compared to decoding and writing strings there itself but to postpone till we dump. Like decoding to the gfxhub/mmhub based on vmhub/vmid_src and client string from client id. So are we good to go with the information with the above information of sharing details in devcoredump using the additional information from pagefault cached. I think amdgpu_vm_fault_info() has everything you need already (vmhub, status, and addr). client_id and src_id are just tokens in the interrupt cookie so we know which IP to route the interrupt to. We know what they will be because otherwise we'd be in the interrupt handler for a different IP. I don't think ring_id has any useful information in this context and vmid and pasid are probably not too useful because they are just tokens to associate the fault with a process. It would be better to have the process name. Just to share context here Alex, i am preparing this for devcoredump, my intention was to replicate the information which in KMD we are sharing in Dmesg for page faults. If assuming we do not add client id specially we would not be able to share enough information in devcoredump. It would be just address and hub(gfxhub/mmhub) and i think that is partial information as src id and client id and ip block shares good information. For process related information we are capturing that information part of dump from existing functionality. AMDGPU Device Coredump version: 1 kernel: 6.7.0-amd-staging-drm-next module: amdgpu time: 45.084775181 process_name: soft_recovery_p PID: 1780 Ring timed out details IP Type: 0 Ring Name: gfx_0.0.0 Page fault information [gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773) in page starting at address 0x from client 0x1b (UTCL2) VRAM is lost due to GPU reset! Regards Sunil The decoded client name would be really useful I think since the fault handled is a catch all and handles a whole bunch of different clients. But that should be ideally passed in as const string instead of the hw generation specific client_id. As long as it's only a pointer we also don't run into the trouble that we need to allocate memory for it. I agree but i prefer adding the client id and decoding it in devcorecump using soc15_ih_clientid_name[fault_info->client_id]) is better else we have to do an sprintf this string to fault_info in irq context which is writing more
Re: [PATCH] drm/amdgpu: cache in more vm fault information
On 3/6/2024 9:59 PM, Alex Deucher wrote: On Wed, Mar 6, 2024 at 11:21 AM Khatri, Sunil wrote: On 3/6/2024 9:45 PM, Alex Deucher wrote: On Wed, Mar 6, 2024 at 11:06 AM Khatri, Sunil wrote: On 3/6/2024 9:07 PM, Christian König wrote: Am 06.03.24 um 16:13 schrieb Khatri, Sunil: On 3/6/2024 8:34 PM, Christian König wrote: Am 06.03.24 um 15:29 schrieb Alex Deucher: On Wed, Mar 6, 2024 at 8:04 AM Khatri, Sunil wrote: On 3/6/2024 6:12 PM, Christian König wrote: Am 06.03.24 um 11:40 schrieb Khatri, Sunil: On 3/6/2024 3:37 PM, Christian König wrote: Am 06.03.24 um 10:04 schrieb Sunil Khatri: When an page fault interrupt is raised there is a lot more information that is useful for developers to analyse the pagefault. Well actually those information are not that interesting because they are hw generation specific. You should probably rather use the decoded strings here, e.g. hub, client, xcc_id, node_id etc... See gmc_v9_0_process_interrupt() an example. I saw this v9 does provide more information than what v10 and v11 provide like node_id and fault from which die but thats again very specific to IP_VERSION(9, 4, 3)) i dont know why thats information is not there in v10 and v11. I agree to your point but, as of now during a pagefault we are dumping this information which is useful like which client has generated an interrupt and for which src and other information like address. So i think to provide the similar information in the devcoredump. Currently we do not have all this information from either job or vm being derived from the job during a reset. We surely could add more relevant information later on as per request but this information is useful as eventually its developers only who would use the dump file provided by customer to debug. Below is the information that i dump in devcore and i feel that is good information but new information could be added which could be picked later. Page fault information [gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773) in page starting at address 0x from client 0x1b (UTCL2) This is a perfect example what I mean. You record in the patch is the client_id, but this is is basically meaningless unless you have access to the AMD internal hw documentation. What you really need is the client in decoded form, in this case UTCL2. You can keep the client_id additionally, but the decoded client string is mandatory to have I think. Sure i am capturing that information as i am trying to minimise the memory interaction to minimum as we are still in interrupt context here that why i recorded the integer information compared to decoding and writing strings there itself but to postpone till we dump. Like decoding to the gfxhub/mmhub based on vmhub/vmid_src and client string from client id. So are we good to go with the information with the above information of sharing details in devcoredump using the additional information from pagefault cached. I think amdgpu_vm_fault_info() has everything you need already (vmhub, status, and addr). client_id and src_id are just tokens in the interrupt cookie so we know which IP to route the interrupt to. We know what they will be because otherwise we'd be in the interrupt handler for a different IP. I don't think ring_id has any useful information in this context and vmid and pasid are probably not too useful because they are just tokens to associate the fault with a process. It would be better to have the process name. Just to share context here Alex, i am preparing this for devcoredump, my intention was to replicate the information which in KMD we are sharing in Dmesg for page faults. If assuming we do not add client id specially we would not be able to share enough information in devcoredump. It would be just address and hub(gfxhub/mmhub) and i think that is partial information as src id and client id and ip block shares good information. For process related information we are capturing that information part of dump from existing functionality. AMDGPU Device Coredump version: 1 kernel: 6.7.0-amd-staging-drm-next module: amdgpu time: 45.084775181 process_name: soft_recovery_p PID: 1780 Ring timed out details IP Type: 0 Ring Name: gfx_0.0.0 Page fault information [gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773) in page starting at address 0x from client 0x1b (UTCL2) VRAM is lost due to GPU reset! Regards Sunil The decoded client name would be really useful I think since the fault handled is a catch all and handles a whole bunch of different clients. But that should be ideally passed in as const string instead of the hw generation specific client_id. As long as it's only a pointer we also don't run into the trouble that we need to allocate memory for it. I agree but i prefer adding the client id and decoding it in devcorecump using soc15_ih_clientid_name[fault_info->client_id]) is better else we have to do an sprintf this stri
Re: [PATCH] drm/amdgpu: cache in more vm fault information
As discussion we decided that we dont need the client id, srcid, pasid etc in page fault information dump. So this patch isnt needed anymore. So dropping this patch and will add the new information in the devcoredump for pagefault which is all available in existing structures. As discussed, we just need to provide faulting address, Fault status register with gpu family to decode the fault along with process information. Regards Sunil Khatri On 3/6/2024 9:56 PM, Khatri, Sunil wrote: On 3/6/2024 9:49 PM, Christian König wrote: Am 06.03.24 um 17:06 schrieb Khatri, Sunil: On 3/6/2024 9:07 PM, Christian König wrote: Am 06.03.24 um 16:13 schrieb Khatri, Sunil: On 3/6/2024 8:34 PM, Christian König wrote: Am 06.03.24 um 15:29 schrieb Alex Deucher: On Wed, Mar 6, 2024 at 8:04 AM Khatri, Sunil wrote: On 3/6/2024 6:12 PM, Christian König wrote: Am 06.03.24 um 11:40 schrieb Khatri, Sunil: On 3/6/2024 3:37 PM, Christian König wrote: Am 06.03.24 um 10:04 schrieb Sunil Khatri: When an page fault interrupt is raised there is a lot more information that is useful for developers to analyse the pagefault. Well actually those information are not that interesting because they are hw generation specific. You should probably rather use the decoded strings here, e.g. hub, client, xcc_id, node_id etc... See gmc_v9_0_process_interrupt() an example. I saw this v9 does provide more information than what v10 and v11 provide like node_id and fault from which die but thats again very specific to IP_VERSION(9, 4, 3)) i dont know why thats information is not there in v10 and v11. I agree to your point but, as of now during a pagefault we are dumping this information which is useful like which client has generated an interrupt and for which src and other information like address. So i think to provide the similar information in the devcoredump. Currently we do not have all this information from either job or vm being derived from the job during a reset. We surely could add more relevant information later on as per request but this information is useful as eventually its developers only who would use the dump file provided by customer to debug. Below is the information that i dump in devcore and i feel that is good information but new information could be added which could be picked later. Page fault information [gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773) in page starting at address 0x from client 0x1b (UTCL2) This is a perfect example what I mean. You record in the patch is the client_id, but this is is basically meaningless unless you have access to the AMD internal hw documentation. What you really need is the client in decoded form, in this case UTCL2. You can keep the client_id additionally, but the decoded client string is mandatory to have I think. Sure i am capturing that information as i am trying to minimise the memory interaction to minimum as we are still in interrupt context here that why i recorded the integer information compared to decoding and writing strings there itself but to postpone till we dump. Like decoding to the gfxhub/mmhub based on vmhub/vmid_src and client string from client id. So are we good to go with the information with the above information of sharing details in devcoredump using the additional information from pagefault cached. I think amdgpu_vm_fault_info() has everything you need already (vmhub, status, and addr). client_id and src_id are just tokens in the interrupt cookie so we know which IP to route the interrupt to. We know what they will be because otherwise we'd be in the interrupt handler for a different IP. I don't think ring_id has any useful information in this context and vmid and pasid are probably not too useful because they are just tokens to associate the fault with a process. It would be better to have the process name. Just to share context here Alex, i am preparing this for devcoredump, my intention was to replicate the information which in KMD we are sharing in Dmesg for page faults. If assuming we do not add client id specially we would not be able to share enough information in devcoredump. It would be just address and hub(gfxhub/mmhub) and i think that is partial information as src id and client id and ip block shares good information. For process related information we are capturing that information part of dump from existing functionality. AMDGPU Device Coredump version: 1 kernel: 6.7.0-amd-staging-drm-next module: amdgpu time: 45.084775181 process_name: soft_recovery_p PID: 1780 Ring timed out details IP Type: 0 Ring Name: gfx_0.0.0 Page fault information [gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773) in page starting at address 0x from client 0x1b (UTCL2) VRAM is lost due to GPU reset! Regards Sunil The decoded client name would be really useful I think since the fault handled is a catch all and handl
RE: [PATCH] drm/amdgpu: cache in more vm fault information
[AMD Official Use Only - General] Ignore this. Triggered wrongly. -Original Message- From: Sunil Khatri Sent: Wednesday, March 6, 2024 11:50 PM To: Deucher, Alexander ; Koenig, Christian ; Sharma, Shashank Cc: amd-...@lists.freedesktop.org; dri-devel@lists.freedesktop.org; linux-ker...@vger.kernel.org; Joshi, Mukul ; Paneer Selvam, Arunpravin ; Khatri, Sunil Subject: [PATCH] drm/amdgpu: cache in more vm fault information When an page fault interrupt is raised there is a lot more information that is useful for developers to analyse the pagefault. Add all such information in the last cached pagefault from an interrupt handler. Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 9 +++-- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 7 ++- drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 2 +- drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c | 2 +- drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c | 2 +- drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c | 2 +- drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 2 +- 7 files changed, 18 insertions(+), 8 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 4299ce386322..b77e8e28769d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -2905,7 +2905,7 @@ void amdgpu_debugfs_vm_bo_info(struct amdgpu_vm *vm, struct seq_file *m) * Cache the fault info for later use by userspace in debugging. */ void amdgpu_vm_update_fault_cache(struct amdgpu_device *adev, - unsigned int pasid, + struct amdgpu_iv_entry *entry, uint64_t addr, uint32_t status, unsigned int vmhub) @@ -2915,7 +2915,7 @@ void amdgpu_vm_update_fault_cache(struct amdgpu_device *adev, xa_lock_irqsave(&adev->vm_manager.pasids, flags); - vm = xa_load(&adev->vm_manager.pasids, pasid); + vm = xa_load(&adev->vm_manager.pasids, entry->pasid); /* Don't update the fault cache if status is 0. In the multiple * fault case, subsequent faults will return a 0 status which is * useless for userspace and replaces the useful fault status, so @@ -2924,6 +2924,11 @@ void amdgpu_vm_update_fault_cache(struct amdgpu_device *adev, if (vm && status) { vm->fault_info.addr = addr; vm->fault_info.status = status; + vm->fault_info.client_id = entry->client_id; + vm->fault_info.src_id = entry->src_id; + vm->fault_info.vmid = entry->vmid; + vm->fault_info.pasid = entry->pasid; + vm->fault_info.ring_id = entry->ring_id; if (AMDGPU_IS_GFXHUB(vmhub)) { vm->fault_info.vmhub = AMDGPU_VMHUB_TYPE_GFX; vm->fault_info.vmhub |= diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h index 047ec1930d12..c7782a89bdb5 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h @@ -286,6 +286,11 @@ struct amdgpu_vm_fault_info { uint32_tstatus; /* which vmhub? gfxhub, mmhub, etc. */ unsigned intvmhub; + unsigned intclient_id; + unsigned intsrc_id; + unsigned intring_id; + unsigned intpasid; + unsigned intvmid; }; struct amdgpu_vm { @@ -605,7 +610,7 @@ static inline void amdgpu_vm_eviction_unlock(struct amdgpu_vm *vm) } void amdgpu_vm_update_fault_cache(struct amdgpu_device *adev, - unsigned int pasid, + struct amdgpu_iv_entry *entry, uint64_t addr, uint32_t status, unsigned int vmhub); diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c index d933e19e0cf5..6b177ce8db0e 100644 --- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c @@ -150,7 +150,7 @@ static int gmc_v10_0_process_interrupt(struct amdgpu_device *adev, status = RREG32(hub->vm_l2_pro_fault_status); WREG32_P(hub->vm_l2_pro_fault_cntl, 1, ~1); - amdgpu_vm_update_fault_cache(adev, entry->pasid, addr, status, + amdgpu_vm_update_fault_cache(adev, entry, addr, status, entry->vmid_src ? AMDGPU_MMHUB0(0) : AMDGPU_GFXHUB(0)); } diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c index 527dc917e049..bcf254856a3e 100644 --- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c @@ -121,7 +121,7 @@ static int g
Re: [PATCH] drm/amdgpu: add vm fault information to devcoredump
On 3/7/2024 12:51 AM, Deucher, Alexander wrote: [Public] -Original Message- From: Sunil Khatri Sent: Wednesday, March 6, 2024 1:20 PM To: Deucher, Alexander ; Koenig, Christian ; Sharma, Shashank Cc: amd-...@lists.freedesktop.org; dri-devel@lists.freedesktop.org; linux- ker...@vger.kernel.org; Joshi, Mukul ; Paneer Selvam, Arunpravin ; Khatri, Sunil Subject: [PATCH] drm/amdgpu: add vm fault information to devcoredump Add page fault information to the devcoredump. Output of devcoredump: AMDGPU Device Coredump version: 1 kernel: 6.7.0-amd-staging-drm-next module: amdgpu time: 29.725011811 process_name: soft_recovery_p PID: 1720 Ring timed out details IP Type: 0 Ring Name: gfx_0.0.0 [gfxhub] Page fault observed for GPU family:143 Faulty page starting at I think we should add a separate section for the GPU identification information (family, PCI ids, IP versions, etc.). For this patch, I think fine to just print the fault address and status. Noted Regards Sunil Alex address 0x Protection fault status register:0x301031 VRAM is lost due to GPU reset! Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 15 ++- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 1 + 2 files changed, 15 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c index 147100c27c2d..d7fea6cdf2f9 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c @@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, coredump->ring->name); } + if (coredump->fault_info.status) { + struct amdgpu_vm_fault_info *fault_info = &coredump- fault_info; + + drm_printf(&p, "\n[%s] Page fault observed for GPU family:%d\n", +fault_info->vmhub ? "mmhub" : "gfxhub", +coredump->adev->family); + drm_printf(&p, "Faulty page starting at address 0x%016llx\n", +fault_info->addr); + drm_printf(&p, "Protection fault status register:0x%x\n", +fault_info->status); + } + if (coredump->reset_vram_lost) - drm_printf(&p, "VRAM is lost due to GPU reset!\n"); + drm_printf(&p, "\nVRAM is lost due to GPU reset!\n"); if (coredump->adev->reset_info.num_regs) { drm_printf(&p, "AMDGPU register dumps:\nOffset: Value:\n"); @@ -253,6 +265,7 @@ void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost, if (job) { s_job = &job->base; coredump->ring = to_amdgpu_ring(s_job->sched); + coredump->fault_info = job->vm->fault_info; } coredump->adev = adev; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h index 60522963aaca..3197955264f9 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h @@ -98,6 +98,7 @@ struct amdgpu_coredump_info { struct timespec64 reset_time; boolreset_vram_lost; struct amdgpu_ring *ring; + struct amdgpu_vm_fault_info fault_info; }; #endif -- 2.34.1
Re: [PATCH] drm/amdgpu: add vm fault information to devcoredump
On 3/7/2024 1:47 PM, Christian König wrote: Am 06.03.24 um 19:19 schrieb Sunil Khatri: Add page fault information to the devcoredump. Output of devcoredump: AMDGPU Device Coredump version: 1 kernel: 6.7.0-amd-staging-drm-next module: amdgpu time: 29.725011811 process_name: soft_recovery_p PID: 1720 Ring timed out details IP Type: 0 Ring Name: gfx_0.0.0 [gfxhub] Page fault observed for GPU family:143 Faulty page starting at address 0x Protection fault status register:0x301031 VRAM is lost due to GPU reset! Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 15 ++- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 1 + 2 files changed, 15 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c index 147100c27c2d..d7fea6cdf2f9 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c @@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, coredump->ring->name); } + if (coredump->fault_info.status) { + struct amdgpu_vm_fault_info *fault_info = &coredump->fault_info; + + drm_printf(&p, "\n[%s] Page fault observed for GPU family:%d\n", + fault_info->vmhub ? "mmhub" : "gfxhub", + coredump->adev->family); + drm_printf(&p, "Faulty page starting at address 0x%016llx\n", + fault_info->addr); + drm_printf(&p, "Protection fault status register:0x%x\n", + fault_info->status); + } + if (coredump->reset_vram_lost) - drm_printf(&p, "VRAM is lost due to GPU reset!\n"); + drm_printf(&p, "\nVRAM is lost due to GPU reset!\n"); if (coredump->adev->reset_info.num_regs) { drm_printf(&p, "AMDGPU register dumps:\nOffset: Value:\n"); @@ -253,6 +265,7 @@ void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost, if (job) { s_job = &job->base; coredump->ring = to_amdgpu_ring(s_job->sched); + coredump->fault_info = job->vm->fault_info; That's illegal. The VM pointer might already be stale at this point. I think you need to add the fault info of the last fault globally in the VRAM manager or move this to the process info Shashank is working on. Are you saying that during the reset or otherwise a vm which is part of this job could have been freed and we might have a NULL dereference or invalid reference? Till now based on the resets and pagefaults that i have created till now using the same app which we are using for IH overflow i am able to get the valid vm only. Assuming amdgpu_vm is freed for this job or stale, are you suggesting to update this information in adev-> vm_manager along with existing per vm fault_info or only in vm_manager ? Regards, Christian. } coredump->adev = adev; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h index 60522963aaca..3197955264f9 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h @@ -98,6 +98,7 @@ struct amdgpu_coredump_info { struct timespec64 reset_time; bool reset_vram_lost; struct amdgpu_ring *ring; + struct amdgpu_vm_fault_info fault_info; }; #endif
Re: [PATCH] drm/amdgpu: add vm fault information to devcoredump
On 3/7/2024 6:10 PM, Christian König wrote: Am 07.03.24 um 09:37 schrieb Khatri, Sunil: On 3/7/2024 1:47 PM, Christian König wrote: Am 06.03.24 um 19:19 schrieb Sunil Khatri: Add page fault information to the devcoredump. Output of devcoredump: AMDGPU Device Coredump version: 1 kernel: 6.7.0-amd-staging-drm-next module: amdgpu time: 29.725011811 process_name: soft_recovery_p PID: 1720 Ring timed out details IP Type: 0 Ring Name: gfx_0.0.0 [gfxhub] Page fault observed for GPU family:143 Faulty page starting at address 0x Protection fault status register:0x301031 VRAM is lost due to GPU reset! Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 15 ++- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 1 + 2 files changed, 15 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c index 147100c27c2d..d7fea6cdf2f9 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c @@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, coredump->ring->name); } + if (coredump->fault_info.status) { + struct amdgpu_vm_fault_info *fault_info = &coredump->fault_info; + + drm_printf(&p, "\n[%s] Page fault observed for GPU family:%d\n", + fault_info->vmhub ? "mmhub" : "gfxhub", + coredump->adev->family); + drm_printf(&p, "Faulty page starting at address 0x%016llx\n", + fault_info->addr); + drm_printf(&p, "Protection fault status register:0x%x\n", + fault_info->status); + } + if (coredump->reset_vram_lost) - drm_printf(&p, "VRAM is lost due to GPU reset!\n"); + drm_printf(&p, "\nVRAM is lost due to GPU reset!\n"); if (coredump->adev->reset_info.num_regs) { drm_printf(&p, "AMDGPU register dumps:\nOffset: Value:\n"); @@ -253,6 +265,7 @@ void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost, if (job) { s_job = &job->base; coredump->ring = to_amdgpu_ring(s_job->sched); + coredump->fault_info = job->vm->fault_info; That's illegal. The VM pointer might already be stale at this point. I think you need to add the fault info of the last fault globally in the VRAM manager or move this to the process info Shashank is working on. Are you saying that during the reset or otherwise a vm which is part of this job could have been freed and we might have a NULL dereference or invalid reference? Till now based on the resets and pagefaults that i have created till now using the same app which we are using for IH overflow i am able to get the valid vm only. Assuming amdgpu_vm is freed for this job or stale, are you suggesting to update this information in adev-> vm_manager along with existing per vm fault_info or only in vm_manager ? Good question. having it both in the VM as well as the VM manager sounds like the simplest option for now. Let me update the patch then with information in VM manager. Regards Sunil Regards, Christian. Regards, Christian. } coredump->adev = adev; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h index 60522963aaca..3197955264f9 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h @@ -98,6 +98,7 @@ struct amdgpu_coredump_info { struct timespec64 reset_time; bool reset_vram_lost; struct amdgpu_ring *ring; + struct amdgpu_vm_fault_info fault_info; }; #endif
Re: [PATCH 2/2] drm/amdgpu: add vm fault information to devcoredump
On 3/8/2024 12:44 AM, Alex Deucher wrote: On Thu, Mar 7, 2024 at 12:00 PM Sunil Khatri wrote: Add page fault information to the devcoredump. Output of devcoredump: AMDGPU Device Coredump version: 1 kernel: 6.7.0-amd-staging-drm-next module: amdgpu time: 29.725011811 process_name: soft_recovery_p PID: 1720 Ring timed out details IP Type: 0 Ring Name: gfx_0.0.0 [gfxhub] Page fault observed Faulty page starting at address 0x Do you want a : before the address for consistency? sure. Protection fault status register:0x301031 How about a space after the : for consistency? For parsability, it may make more sense to just have a list of key value pairs: [GPU page fault] hub: addr: status: [Ring timeout details] IP: ring: name: etc. Sure i agree but till now i was capturing information like we shared in dmesg which is user readable. But surely one we have enough data i could arrange all in key: value pairs like you suggest in a patch later if that works ? VRAM is lost due to GPU reset! Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 14 +- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c index 147100c27c2d..dd39e614d907 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c @@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, coredump->ring->name); } + if (coredump->adev) { + struct amdgpu_vm_fault_info *fault_info = + &coredump->adev->vm_manager.fault_info; + + drm_printf(&p, "\n[%s] Page fault observed\n", + fault_info->vmhub ? "mmhub" : "gfxhub"); + drm_printf(&p, "Faulty page starting at address 0x%016llx\n", + fault_info->addr); + drm_printf(&p, "Protection fault status register:0x%x\n", + fault_info->status); + } + if (coredump->reset_vram_lost) - drm_printf(&p, "VRAM is lost due to GPU reset!\n"); + drm_printf(&p, "\nVRAM is lost due to GPU reset!\n"); if (coredump->adev->reset_info.num_regs) { drm_printf(&p, "AMDGPU register dumps:\nOffset: Value:\n"); -- 2.34.1
Re: [PATCH v2 2/2] drm/amdgpu: add vm fault information to devcoredump
On 3/8/2024 2:39 PM, Christian König wrote: Am 07.03.24 um 21:50 schrieb Sunil Khatri: Add page fault information to the devcoredump. Output of devcoredump: AMDGPU Device Coredump version: 1 kernel: 6.7.0-amd-staging-drm-next module: amdgpu time: 29.725011811 process_name: soft_recovery_p PID: 1720 Ring timed out details IP Type: 0 Ring Name: gfx_0.0.0 [gfxhub] Page fault observed Faulty page starting at address: 0x Protection fault status register: 0x301031 VRAM is lost due to GPU reset! Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 14 +- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c index 147100c27c2d..8794a3c21176 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c @@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, coredump->ring->name); } + if (coredump->adev) { + struct amdgpu_vm_fault_info *fault_info = + &coredump->adev->vm_manager.fault_info; + + drm_printf(&p, "\n[%s] Page fault observed\n", + fault_info->vmhub ? "mmhub" : "gfxhub"); + drm_printf(&p, "Faulty page starting at address: 0x%016llx\n", + fault_info->addr); + drm_printf(&p, "Protection fault status register: 0x%x\n", + fault_info->status); + } + if (coredump->reset_vram_lost) - drm_printf(&p, "VRAM is lost due to GPU reset!\n"); + drm_printf(&p, "\nVRAM is lost due to GPU reset!\n"); Why this additional new line? The intent is the devcoredump have different sections clearly demarcated with an new line else "VRAM is lost due to GPU reset!" seems part of the page fault information. [gfxhub] Page fault observed Faulty page starting at address: 0x Protection fault status register: 0x301031 VRAM is lost due to GPU reset! Regards Sunil Apart from that looks really good to me. Regards, Christian. if (coredump->adev->reset_info.num_regs) { drm_printf(&p, "AMDGPU register dumps:\nOffset: Value:\n");
RE: [PATCH] drm/amdgpu: add all ringbuffer information in devcoredump
Ignore this as I updated commit message and subject so sending new mail. -Original Message- From: Sunil Khatri Sent: Monday, March 11, 2024 5:04 PM To: Deucher, Alexander ; Koenig, Christian ; Sharma, Shashank Cc: amd-...@lists.freedesktop.org; dri-devel@lists.freedesktop.org; linux-ker...@vger.kernel.org; Khatri, Sunil Subject: [PATCH] drm/amdgpu: add all ringbuffer information in devcoredump Add ringbuffer information such as: rptr, wptr, ring name, ring size and also the ring contents for each ring on a gpu reset. Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 21 + 1 file changed, 21 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c index 6d059f853adc..1992760039da 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c @@ -215,6 +215,27 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, fault_info->status); } + drm_printf(&p, "Ring buffer information\n"); + for (int i = 0; i < coredump->adev->num_rings; i++) { + int j = 0; + struct amdgpu_ring *ring = coredump->adev->rings[i]; + + drm_printf(&p, "ring name: %s\n", ring->name); + drm_printf(&p, "Rptr: 0x%llx Wptr: 0x%llx\n", + amdgpu_ring_get_rptr(ring) & ring->buf_mask, + amdgpu_ring_get_wptr(ring) & ring->buf_mask); + drm_printf(&p, "Ring size in dwords: %d\n", + ring->ring_size / 4); + drm_printf(&p, "Ring contents\n"); + drm_printf(&p, "Offset \t Value\n"); + + while (j < ring->ring_size) { + drm_printf(&p, "0x%x \t 0x%x\n", j, ring->ring[j/4]); + j += 4; + } + drm_printf(&p, "Ring dumped\n"); + } + if (coredump->reset_vram_lost) drm_printf(&p, "VRAM is lost due to GPU reset!\n"); if (coredump->adev->reset_info.num_regs) { -- 2.34.1
Re: [PATCH] drm/amdgpu: add ring buffer information in devcoredump
On 3/11/2024 7:29 PM, Christian König wrote: Am 11.03.24 um 13:22 schrieb Sunil Khatri: Add relevant ringbuffer information such as rptr, wptr, ring name, ring size and also the ring contents for each ring on a gpu reset. Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 21 + 1 file changed, 21 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c index 6d059f853adc..1992760039da 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c @@ -215,6 +215,27 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, fault_info->status); } + drm_printf(&p, "Ring buffer information\n"); + for (int i = 0; i < coredump->adev->num_rings; i++) { + int j = 0; + struct amdgpu_ring *ring = coredump->adev->rings[i]; + + drm_printf(&p, "ring name: %s\n", ring->name); + drm_printf(&p, "Rptr: 0x%llx Wptr: 0x%llx\n", + amdgpu_ring_get_rptr(ring) & ring->buf_mask, + amdgpu_ring_get_wptr(ring) & ring->buf_mask); Don't apply the mask here. We do have some use cases where the rptr and wptr are outside the ring buffer. Sure i will remove the mask. + drm_printf(&p, "Ring size in dwords: %d\n", + ring->ring_size / 4); Rather print the mask as additional value here. Does that help adding the mask value ? + drm_printf(&p, "Ring contents\n"); + drm_printf(&p, "Offset \t Value\n"); + + while (j < ring->ring_size) { + drm_printf(&p, "0x%x \t 0x%x\n", j, ring->ring[j/4]); + j += 4; + } + drm_printf(&p, "Ring dumped\n"); That seems superfluous. Noted Regards Sunil Regards, Christian. + } + if (coredump->reset_vram_lost) drm_printf(&p, "VRAM is lost due to GPU reset!\n"); if (coredump->adev->reset_info.num_regs) {
Re: [PATCH 1/2] drm/amdgpu: add the IP information of the soc
[AMD Official Use Only - General] Gentle Reminder for review. Regards, Sunil Get Outlook for Android<https://aka.ms/AAb9ysg> From: Sunil Khatri Sent: Tuesday, March 12, 2024 6:11:47 PM To: Deucher, Alexander ; Koenig, Christian ; Sharma, Shashank Cc: amd-...@lists.freedesktop.org ; dri-devel@lists.freedesktop.org ; linux-ker...@vger.kernel.org ; Khatri, Sunil Subject: [PATCH 1/2] drm/amdgpu: add the IP information of the soc Add all the IP's information on a SOC to the devcoredump. Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c index a0dbccad2f53..611fdb90a1fc 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c @@ -196,6 +196,25 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, coredump->reset_task_info.process_name, coredump->reset_task_info.pid); + /* GPU IP's information of the SOC */ + if (coredump->adev) { + drm_printf(&p, "\nIP Information\n"); + drm_printf(&p, "SOC Family: %d\n", coredump->adev->family); + drm_printf(&p, "SOC Revision id: %d\n", coredump->adev->rev_id); + + for (int i = 0; i < coredump->adev->num_ip_blocks; i++) { + struct amdgpu_ip_block *ip = + &coredump->adev->ip_blocks[i]; + drm_printf(&p, "IP type: %d IP name: %s\n", + ip->version->type, + ip->version->funcs->name); + drm_printf(&p, "IP version: (%d,%d,%d)\n\n", + ip->version->major, + ip->version->minor, + ip->version->rev); + } + } + if (coredump->ring) { drm_printf(&p, "\nRing timed out details\n"); drm_printf(&p, "IP Type: %d Ring Name: %s\n", -- 2.34.1
Re: [PATCH 1/2] drm/amdgpu: add the IP information of the soc
[AMD Official Use Only - General] Gentle reminder for review. Regards Sunil Get Outlook for Android<https://aka.ms/AAb9ysg> From: Sunil Khatri Sent: Tuesday, March 12, 2024 6:11:47 PM To: Deucher, Alexander ; Koenig, Christian ; Sharma, Shashank Cc: amd-...@lists.freedesktop.org ; dri-devel@lists.freedesktop.org ; linux-ker...@vger.kernel.org ; Khatri, Sunil Subject: [PATCH 1/2] drm/amdgpu: add the IP information of the soc Add all the IP's information on a SOC to the devcoredump. Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c index a0dbccad2f53..611fdb90a1fc 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c @@ -196,6 +196,25 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, coredump->reset_task_info.process_name, coredump->reset_task_info.pid); + /* GPU IP's information of the SOC */ + if (coredump->adev) { + drm_printf(&p, "\nIP Information\n"); + drm_printf(&p, "SOC Family: %d\n", coredump->adev->family); + drm_printf(&p, "SOC Revision id: %d\n", coredump->adev->rev_id); + + for (int i = 0; i < coredump->adev->num_ip_blocks; i++) { + struct amdgpu_ip_block *ip = + &coredump->adev->ip_blocks[i]; + drm_printf(&p, "IP type: %d IP name: %s\n", + ip->version->type, + ip->version->funcs->name); + drm_printf(&p, "IP version: (%d,%d,%d)\n\n", + ip->version->major, + ip->version->minor, + ip->version->rev); + } + } + if (coredump->ring) { drm_printf(&p, "\nRing timed out details\n"); drm_printf(&p, "IP Type: %d Ring Name: %s\n", -- 2.34.1
Re: [PATCH 2/2] drm:amdgpu: add firmware information of all IP's
[AMD Official Use Only - General] Gentle reminder Regards Sunil Get Outlook for Android<https://aka.ms/AAb9ysg> From: Sunil Khatri Sent: Tuesday, March 12, 2024 6:11:48 PM To: Deucher, Alexander ; Koenig, Christian ; Sharma, Shashank Cc: amd-...@lists.freedesktop.org ; dri-devel@lists.freedesktop.org ; linux-ker...@vger.kernel.org ; Khatri, Sunil Subject: [PATCH 2/2] drm:amdgpu: add firmware information of all IP's Add firmware version information of each IP and each instance where applicable. Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 122 ++ 1 file changed, 122 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c index 611fdb90a1fc..78ddc58aef67 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c @@ -168,6 +168,123 @@ void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost, { } #else +static void amdgpu_devcoredump_fw_info(struct amdgpu_device *adev, struct drm_printer *p) +{ + uint32_t version; + uint32_t feature; + uint8_t smu_program, smu_major, smu_minor, smu_debug; + + drm_printf(p, "VCE feature version: %u, fw version: 0x%08x\n", + adev->vce.fb_version, adev->vce.fw_version); + drm_printf(p, "UVD feature version: %u, fw version: 0x%08x\n", + 0, adev->uvd.fw_version); + drm_printf(p, "GMC feature version: %u, fw version: 0x%08x\n", + 0, adev->gmc.fw_version); + drm_printf(p, "ME feature version: %u, fw version: 0x%08x\n", + adev->gfx.me_feature_version, adev->gfx.me_fw_version); + drm_printf(p, "PFP feature version: %u, fw version: 0x%08x\n", + adev->gfx.pfp_feature_version, adev->gfx.pfp_fw_version); + drm_printf(p, "CE feature version: %u, fw version: 0x%08x\n", + adev->gfx.ce_feature_version, adev->gfx.ce_fw_version); + drm_printf(p, "RLC feature version: %u, fw version: 0x%08x\n", + adev->gfx.rlc_feature_version, adev->gfx.rlc_fw_version); + + drm_printf(p, "RLC SRLC feature version: %u, fw version: 0x%08x\n", + adev->gfx.rlc_srlc_feature_version, + adev->gfx.rlc_srlc_fw_version); + drm_printf(p, "RLC SRLG feature version: %u, fw version: 0x%08x\n", + adev->gfx.rlc_srlg_feature_version, + adev->gfx.rlc_srlg_fw_version); + drm_printf(p, "RLC SRLS feature version: %u, fw version: 0x%08x\n", + adev->gfx.rlc_srls_feature_version, + adev->gfx.rlc_srls_fw_version); + drm_printf(p, "RLCP feature version: %u, fw version: 0x%08x\n", + adev->gfx.rlcp_ucode_feature_version, + adev->gfx.rlcp_ucode_version); + drm_printf(p, "RLCV feature version: %u, fw version: 0x%08x\n", + adev->gfx.rlcv_ucode_feature_version, + adev->gfx.rlcv_ucode_version); + drm_printf(p, "MEC feature version: %u, fw version: 0x%08x\n", + adev->gfx.mec_feature_version, + adev->gfx.mec_fw_version); + + if (adev->gfx.mec2_fw) + drm_printf(p, + "MEC2 feature version: %u, fw version: 0x%08x\n", + adev->gfx.mec2_feature_version, + adev->gfx.mec2_fw_version); + + drm_printf(p, "IMU feature version: %u, fw version: 0x%08x\n", + 0, adev->gfx.imu_fw_version); + drm_printf(p, "PSP SOS feature version: %u, fw version: 0x%08x\n", + adev->psp.sos.feature_version, + adev->psp.sos.fw_version); + drm_printf(p, "PSP ASD feature version: %u, fw version: 0x%08x\n", + adev->psp.asd_context.bin_desc.feature_version, + adev->psp.asd_context.bin_desc.fw_version); + + drm_printf(p, "TA XGMI feature version: 0x%08x, fw version: 0x%08x\n", + adev->psp.xgmi_context.context.bin_desc.feature_version, + adev->psp.xgmi_context.context.bin_desc.fw_version); + drm_printf(p, "TA RAS feature version: 0x%08x, fw version: 0x%08x\n", + adev->psp.ras_context.context.bin_desc.feature_version, + adev->psp.ras_context.context.bin_desc.fw_version); + drm_printf(p, "TA HDCP feature version: 0x%08x, fw version: 0x%08x\n", + adev->psp.hdcp_context.context.bin_desc.feature_version, + adev->psp.hdcp_context.cont
Re: [PATCH 1/2] drm/amdgpu: add the IP information of the soc
On 3/14/2024 1:58 AM, Alex Deucher wrote: On Tue, Mar 12, 2024 at 8:41 AM Sunil Khatri wrote: Add all the IP's information on a SOC to the devcoredump. Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c index a0dbccad2f53..611fdb90a1fc 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c @@ -196,6 +196,25 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, coredump->reset_task_info.process_name, coredump->reset_task_info.pid); + /* GPU IP's information of the SOC */ + if (coredump->adev) { + drm_printf(&p, "\nIP Information\n"); + drm_printf(&p, "SOC Family: %d\n", coredump->adev->family); + drm_printf(&p, "SOC Revision id: %d\n", coredump->adev->rev_id); + + for (int i = 0; i < coredump->adev->num_ip_blocks; i++) { + struct amdgpu_ip_block *ip = + &coredump->adev->ip_blocks[i]; + drm_printf(&p, "IP type: %d IP name: %s\n", + ip->version->type, + ip->version->funcs->name); + drm_printf(&p, "IP version: (%d,%d,%d)\n\n", + ip->version->major, + ip->version->minor, + ip->version->rev); + } + } I think the IP discovery table would be more useful. Either walk the adev->ip_versions structure, or just include the IP discovery binary. I did explore the adev->ip_versions and if i just go through the array it doesn't give any useful information directly. There are no ways to find directly from adev->ip_versions below things until i also reparse the discovery binary again like done the discovery amdgpu_discovery_reg_base_init and walk through the headers of various ips using discovery binary. a. Which IP is available on soc or not. b. How many instances are there Also i again have to change back to major, minor and rev convention for this information to be useful. I am exploring it more if i find some other information i will update. adev->ip_block[] is derived from ip discovery only for each block which is there on the SOC, so we are not reading information which isnt applicable for the soc. We have name , type and version no of the IPs available on the soc. If you want i could add no of instances of each IP too if you think that's useful information here. Could you share what information is missing in this approach so i can include that. For dumping the IP discovery binary, i dont understand how that information would be useful directly and needs to be decoded like we are doing in discovery init. Please correct me if my understanding is wrong here. Alex + if (coredump->ring) { drm_printf(&p, "\nRing timed out details\n"); drm_printf(&p, "IP Type: %d Ring Name: %s\n", -- 2.34.1
Re: [PATCH 2/2] drm:amdgpu: add firmware information of all IP's
On 3/14/2024 2:06 AM, Alex Deucher wrote: On Tue, Mar 12, 2024 at 8:42 AM Sunil Khatri wrote: Add firmware version information of each IP and each instance where applicable. Is there a way we can share some common code with devcoredump, debugfs, and the info IOCTL? All three places need to query this information and the same logic is repeated in each case. Hello Alex, Yes you re absolutely right the same information is being retrieved again as done in debugfs. I can reorganize the code so same function could be used by debugfs and devcoredump but this is exactly what i tried to avoid here. I did try to use minimum functionality in devcoredump without shuffling a lot of code here and there. Also our devcoredump is implemented in amdgpu_reset.c and not all the information is available here and there we might have to include lot of header and cross functions in amdgpu_reset until we want a dedicated file for devcoredump. Info IOCTL does have a lot of information which also is in pipeline to be dumped but this if we want to reuse the functionality of IOCTL we need to reorganize a lot of code. If that is the need of the hour i could work on that. Please let me know. This is my suggestion if it makes sense: 1. If we want to reuse a lot of functionality then we need to modularize some of the functions further so they could be consumed directly by devcoredump. 2. We should also have a dedicated file for devcoredump.c/.h so its easy to include headers of needed functionality cleanly and easy to expand devcoredump. 3. based on the priority and importance of this task we can add information else some repetition is a real possibility. Alex Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 122 ++ 1 file changed, 122 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c index 611fdb90a1fc..78ddc58aef67 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c @@ -168,6 +168,123 @@ void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost, { } #else +static void amdgpu_devcoredump_fw_info(struct amdgpu_device *adev, struct drm_printer *p) +{ + uint32_t version; + uint32_t feature; + uint8_t smu_program, smu_major, smu_minor, smu_debug; + + drm_printf(p, "VCE feature version: %u, fw version: 0x%08x\n", + adev->vce.fb_version, adev->vce.fw_version); + drm_printf(p, "UVD feature version: %u, fw version: 0x%08x\n", + 0, adev->uvd.fw_version); + drm_printf(p, "GMC feature version: %u, fw version: 0x%08x\n", + 0, adev->gmc.fw_version); + drm_printf(p, "ME feature version: %u, fw version: 0x%08x\n", + adev->gfx.me_feature_version, adev->gfx.me_fw_version); + drm_printf(p, "PFP feature version: %u, fw version: 0x%08x\n", + adev->gfx.pfp_feature_version, adev->gfx.pfp_fw_version); + drm_printf(p, "CE feature version: %u, fw version: 0x%08x\n", + adev->gfx.ce_feature_version, adev->gfx.ce_fw_version); + drm_printf(p, "RLC feature version: %u, fw version: 0x%08x\n", + adev->gfx.rlc_feature_version, adev->gfx.rlc_fw_version); + + drm_printf(p, "RLC SRLC feature version: %u, fw version: 0x%08x\n", + adev->gfx.rlc_srlc_feature_version, + adev->gfx.rlc_srlc_fw_version); + drm_printf(p, "RLC SRLG feature version: %u, fw version: 0x%08x\n", + adev->gfx.rlc_srlg_feature_version, + adev->gfx.rlc_srlg_fw_version); + drm_printf(p, "RLC SRLS feature version: %u, fw version: 0x%08x\n", + adev->gfx.rlc_srls_feature_version, + adev->gfx.rlc_srls_fw_version); + drm_printf(p, "RLCP feature version: %u, fw version: 0x%08x\n", + adev->gfx.rlcp_ucode_feature_version, + adev->gfx.rlcp_ucode_version); + drm_printf(p, "RLCV feature version: %u, fw version: 0x%08x\n", + adev->gfx.rlcv_ucode_feature_version, + adev->gfx.rlcv_ucode_version); + drm_printf(p, "MEC feature version: %u, fw version: 0x%08x\n", + adev->gfx.mec_feature_version, + adev->gfx.mec_fw_version); + + if (adev->gfx.mec2_fw) + drm_printf(p, + "MEC2 feature version: %u, fw version: 0x%08x\n", + adev->gfx.mec2_feature_version, + adev->gfx.mec2_fw_version); + + drm_printf(p, "IMU feature version: %u, fw version: 0x%08x\n", + 0, adev->gfx.imu_fw_version); + drm_printf(p, "PSP SOS feature version: %u, fw version: 0x%08x\n", + adev->psp.sos.feature_version, + adev->psp.sos.fw_version); + drm_printf(p, "PSP ASD feat
Re: [PATCH 2/2] drm:amdgpu: add firmware information of all IP's
On 3/14/2024 11:40 AM, Sharma, Shashank wrote: On 14/03/2024 06:58, Khatri, Sunil wrote: On 3/14/2024 2:06 AM, Alex Deucher wrote: On Tue, Mar 12, 2024 at 8:42 AM Sunil Khatri wrote: Add firmware version information of each IP and each instance where applicable. Is there a way we can share some common code with devcoredump, debugfs, and the info IOCTL? All three places need to query this information and the same logic is repeated in each case. Hello Alex, Yes you re absolutely right the same information is being retrieved again as done in debugfs. I can reorganize the code so same function could be used by debugfs and devcoredump but this is exactly what i tried to avoid here. I did try to use minimum functionality in devcoredump without shuffling a lot of code here and there. Also our devcoredump is implemented in amdgpu_reset.c and not all the information is available here and there we might have to include lot of header and cross functions in amdgpu_reset until we want a dedicated file for devcoredump. I think Alex is suggesting to have one common backend to generate all the core debug info, and then different wrapper functions which can pack this raw info into the packets aligned with respective infra like devcore/debugfs/info IOCTL, which seems like a good idea to me. If you think you need a new file for this backend it should be fine. My suggestion was on same lines that if we want to use the same infra to access information from different parts of the code then we need to reorganize. And at same time since there is quite some data we are planning to add in devcoredump so i recommend to have a dedicated .c/.h instead of using amdgpu_reset.c so a clean include is easy to maintain. Once Alex confirms it i can start working on design and what all information we need on this. Regards Sunil something like: amdgpu_debug_core.c:: struct amdgpu_core_debug_info { /* Superset of all the info you are collecting from HW */ }; - amdgpu_debug_generate_core_info { /* This function collects the core debug info from HW and saves in amdgpu_core_debug_info, we can update this periodically regardless of a request */ } and then: devcore_info *amdgpu_debug_pack_for_devcore(core_debug_info) { /* convert core debug info into devcore aligned format/data */ } ioctl_info *amdgpu_debug_pack_for_info_ioctl(core_debug_info) { /* convert core debug info into info IOCTL aligned format/data */ } debugfs_info *amdgpu_debug_pack_for_debugfs(core_debug_info) { /* convert core debug info into debugfs aligned format/data */ } - Shashank Info IOCTL does have a lot of information which also is in pipeline to be dumped but this if we want to reuse the functionality of IOCTL we need to reorganize a lot of code. If that is the need of the hour i could work on that. Please let me know. This is my suggestion if it makes sense: 1. If we want to reuse a lot of functionality then we need to modularize some of the functions further so they could be consumed directly by devcoredump. 2. We should also have a dedicated file for devcoredump.c/.h so its easy to include headers of needed functionality cleanly and easy to expand devcoredump. 3. based on the priority and importance of this task we can add information else some repetition is a real possibility. Alex Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 122 ++ 1 file changed, 122 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c index 611fdb90a1fc..78ddc58aef67 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c @@ -168,6 +168,123 @@ void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost, { } #else +static void amdgpu_devcoredump_fw_info(struct amdgpu_device *adev, struct drm_printer *p) +{ + uint32_t version; + uint32_t feature; + uint8_t smu_program, smu_major, smu_minor, smu_debug; + + drm_printf(p, "VCE feature version: %u, fw version: 0x%08x\n", + adev->vce.fb_version, adev->vce.fw_version); + drm_printf(p, "UVD feature version: %u, fw version: 0x%08x\n", + 0, adev->uvd.fw_version); + drm_printf(p, "GMC feature version: %u, fw version: 0x%08x\n", + 0, adev->gmc.fw_version); + drm_printf(p, "ME feature version: %u, fw version: 0x%08x\n", + adev->gfx.me_feature_version, adev->gfx.me_fw_version); + drm_printf(p, "PFP feature version: %u, fw version: 0x%08x\n", + adev->gfx.pfp_feature_version, adev->gfx.pfp_fw_version); + drm_printf(p, "CE feature version: %u, fw version: 0x%08x\n", + adev->gfx.ce_feature_version, adev->gfx.ce_fw_version); +
Re: [PATCH 1/2] drm/amdgpu: add the IP information of the soc
On 3/14/2024 8:12 PM, Alex Deucher wrote: On Thu, Mar 14, 2024 at 1:44 AM Khatri, Sunil wrote: On 3/14/2024 1:58 AM, Alex Deucher wrote: On Tue, Mar 12, 2024 at 8:41 AM Sunil Khatri wrote: Add all the IP's information on a SOC to the devcoredump. Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c index a0dbccad2f53..611fdb90a1fc 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c @@ -196,6 +196,25 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, coredump->reset_task_info.process_name, coredump->reset_task_info.pid); + /* GPU IP's information of the SOC */ + if (coredump->adev) { + drm_printf(&p, "\nIP Information\n"); + drm_printf(&p, "SOC Family: %d\n", coredump->adev->family); + drm_printf(&p, "SOC Revision id: %d\n", coredump->adev->rev_id); + + for (int i = 0; i < coredump->adev->num_ip_blocks; i++) { + struct amdgpu_ip_block *ip = + &coredump->adev->ip_blocks[i]; + drm_printf(&p, "IP type: %d IP name: %s\n", + ip->version->type, + ip->version->funcs->name); + drm_printf(&p, "IP version: (%d,%d,%d)\n\n", + ip->version->major, + ip->version->minor, + ip->version->rev); + } + } I think the IP discovery table would be more useful. Either walk the adev->ip_versions structure, or just include the IP discovery binary. I did explore the adev->ip_versions and if i just go through the array it doesn't give any useful information directly. There are no ways to find directly from adev->ip_versions below things until i also reparse the discovery binary again like done the discovery amdgpu_discovery_reg_base_init and walk through the headers of various ips using discovery binary. a. Which IP is available on soc or not. b. How many instances are there Also i again have to change back to major, minor and rev convention for this information to be useful. I am exploring it more if i find some other information i will update. adev->ip_block[] is derived from ip discovery only for each block which is there on the SOC, so we are not reading information which isnt applicable for the soc. We have name , type and version no of the IPs available on the soc. If you want i could add no of instances of each IP too if you think that's useful information here. Could you share what information is missing in this approach so i can include that. I was hoping to get the actual IP versions for the IPs from IP discovery rather than the versions from the ip_block array. The latter are common so you can end up with the same version used across a wide variety of chips (e.g., all gfx10.x based chips use the same gfx 10 IP code even if the actual IP version is different for most of the chips). Got it. let me check how to get it could be done rightly. For dumping the IP discovery binary, i dont understand how that information would be useful directly and needs to be decoded like we are doing in discovery init. Please correct me if my understanding is wrong here. It's probably not a high priority, I was just thinking it might be useful to have in case there ended up being some problem related to the IP discovery table on some boards. E.g., we'd know that all boards with a certain harvest config seem to align with a reported problem. Similar for vbios. It's more for telemetry. E.g., all the boards reporting some problem have a particular powerplay config or whatever. I got it. But two points of contention here in my understanding. The dump works only where there is reset and not sure if it could be used very early in development of not. Second point is that devcoredump is 4096 bytes/4Kbyte of memory where we are dumping all the information. Not sure if that could be increased but it might not be enough if we are planning to dump all to it. Another point is since we have sysfs/debugfs/info ioctl etc information available. We should sort out what really is helpful in debugging GPU hang and that's added in devcore. Regards Sunil Alex Alex + if (coredump->ring) { drm_printf(&p, "\nRing timed out details\n"); drm_printf(&p, "IP Type: %d Ring Name: %s\n", -- 2.34.1
RE: [PATCH] drm/amdgpu: add the hw_ip version of all IP's
[AMD Official Use Only - General] Hello Alex Added the information directly from the ip_version and also added names for each ip so the version information makes more sense to the user. Below is the output in devcoredump now: IP Information SOC Family: 143 SOC Revision id: 0 SOC External Revision id: 50 HWIP: GC[1][0]: v10.3.2.0.0 HWIP: HDP[2][0]: v5.0.3.0.0 HWIP: SDMA0[3][0]: v5.2.2.0.0 HWIP: SDMA1[4][0]: v5.2.2.0.0 HWIP: MMHUB[12][0]: v2.1.0.0.0 HWIP: ATHUB[13][0]: v2.1.0.0.0 HWIP: NBIO[14][0]: v3.3.1.0.0 HWIP: MP0[15][0]: v11.0.11.0.0 HWIP: MP1[16][0]: v11.0.11.0.0 HWIP: UVD/JPEG/VCN[17][0]: v3.0.0.0.0 HWIP: UVD/JPEG/VCN[17][1]: v3.0.1.0.0 HWIP: DF[21][0]: v3.7.3.0.0 HWIP: DCE[22][0]: v3.0.0.0.0 HWIP: OSSSYS[23][0]: v5.0.3.0.0 HWIP: SMUIO[24][0]: v11.0.6.0.0 HWIP: NBIF[26][0]: v3.3.1.0.0 HWIP: THM[27][0]: v11.0.5.0.0 HWIP: CLK[28][0]: v11.0.3.0.0 HWIP: CLK[28][1]: v11.0.3.0.0 HWIP: CLK[28][2]: v11.0.3.0.0 HWIP: CLK[28][3]: v11.0.3.0.0 HWIP: CLK[28][4]: v11.0.3.0.0 HWIP: CLK[28][5]: v11.0.3.0.0 HWIP: CLK[28][6]: v11.0.3.0.0 HWIP: CLK[28][7]: v11.0.3.0.0 HWIP: UMC[29][0]: v8.7.1.0.0 HWIP: UMC[29][1]: v8.7.1.0.0 HWIP: UMC[29][2]: v8.7.1.0.0 HWIP: UMC[29][3]: v8.7.1.0.0 HWIP: UMC[29][4]: v8.7.1.0.0 HWIP: UMC[29][5]: v8.7.1.0.0 HWIP: PCIE[33][0]: v6.5.0.0.0 -Original Message- From: Sunil Khatri Sent: Friday, March 15, 2024 5:43 PM To: Deucher, Alexander ; Koenig, Christian ; Sharma, Shashank Cc: amd-...@lists.freedesktop.org; dri-devel@lists.freedesktop.org; linux-ker...@vger.kernel.org; Khatri, Sunil Subject: [PATCH] drm/amdgpu: add the hw_ip version of all IP's Add all the IP's version information on a SOC to the devcoredump. Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 62 +++ 1 file changed, 62 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c index a0dbccad2f53..3d4bfe0a5a7c 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c @@ -29,6 +29,43 @@ #include "sienna_cichlid.h" #include "smu_v13_0_10.h" +const char *hw_ip_names[MAX_HWIP] = { + [GC_HWIP] = "GC", + [HDP_HWIP] = "HDP", + [SDMA0_HWIP]= "SDMA0", + [SDMA1_HWIP]= "SDMA1", + [SDMA2_HWIP]= "SDMA2", + [SDMA3_HWIP]= "SDMA3", + [SDMA4_HWIP]= "SDMA4", + [SDMA5_HWIP]= "SDMA5", + [SDMA6_HWIP]= "SDMA6", + [SDMA7_HWIP]= "SDMA7", + [LSDMA_HWIP]= "LSDMA", + [MMHUB_HWIP]= "MMHUB", + [ATHUB_HWIP]= "ATHUB", + [NBIO_HWIP] = "NBIO", + [MP0_HWIP] = "MP0", + [MP1_HWIP] = "MP1", + [UVD_HWIP] = "UVD/JPEG/VCN", + [VCN1_HWIP] = "VCN1", + [VCE_HWIP] = "VCE", + [VPE_HWIP] = "VPE", + [DF_HWIP] = "DF", + [DCE_HWIP] = "DCE", + [OSSSYS_HWIP] = "OSSSYS", + [SMUIO_HWIP]= "SMUIO", + [PWR_HWIP] = "PWR", + [NBIF_HWIP] = "NBIF", + [THM_HWIP] = "THM", + [CLK_HWIP] = "CLK", + [UMC_HWIP] = "UMC", + [RSMU_HWIP] = "RSMU", + [XGMI_HWIP] = "XGMI", + [DCI_HWIP] = "DCI", + [PCIE_HWIP] = "PCIE", +}; + + int amdgpu_reset_init(struct amdgpu_device *adev) { int ret = 0; @@ -196,6 +233,31 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, coredump->reset_task_info.process_name, coredump->reset_task_info.pid); + /* GPU IP's information of the SOC */ + if (coredump->adev) { + + drm_printf(&p, "\nIP Information\n"); + drm_printf(&p, "SOC Family: %d\n", coredump->adev->family); + drm_printf(&p, "SOC Revision id: %d\n", coredump->adev->rev_id); + drm_printf(&p, "SOC External Revision id: %d\n", + coredump->adev->external_rev_id); + + for (int i = 1; i < MAX_HWIP; i++) { + for (int j = 0; j < HWIP_MAX_INSTANCE; j++) { + int ver = coredump->adev->ip_versions[i][j]; + + if (ve
Re: [PATCH] drm/amdgpu: add the hw_ip version of all IP's
On 3/15/2024 6:45 PM, Alex Deucher wrote: On Fri, Mar 15, 2024 at 8:13 AM Sunil Khatri wrote: Add all the IP's version information on a SOC to the devcoredump. Signed-off-by: Sunil Khatri This looks great. Reviewed-by: Alex Deucher Thanks Alex --- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 62 +++ 1 file changed, 62 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c index a0dbccad2f53..3d4bfe0a5a7c 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c @@ -29,6 +29,43 @@ #include "sienna_cichlid.h" #include "smu_v13_0_10.h" +const char *hw_ip_names[MAX_HWIP] = { + [GC_HWIP] = "GC", + [HDP_HWIP] = "HDP", + [SDMA0_HWIP]= "SDMA0", + [SDMA1_HWIP]= "SDMA1", + [SDMA2_HWIP]= "SDMA2", + [SDMA3_HWIP]= "SDMA3", + [SDMA4_HWIP]= "SDMA4", + [SDMA5_HWIP]= "SDMA5", + [SDMA6_HWIP]= "SDMA6", + [SDMA7_HWIP]= "SDMA7", + [LSDMA_HWIP]= "LSDMA", + [MMHUB_HWIP]= "MMHUB", + [ATHUB_HWIP]= "ATHUB", + [NBIO_HWIP] = "NBIO", + [MP0_HWIP] = "MP0", + [MP1_HWIP] = "MP1", + [UVD_HWIP] = "UVD/JPEG/VCN", + [VCN1_HWIP] = "VCN1", + [VCE_HWIP] = "VCE", + [VPE_HWIP] = "VPE", + [DF_HWIP] = "DF", + [DCE_HWIP] = "DCE", + [OSSSYS_HWIP] = "OSSSYS", + [SMUIO_HWIP]= "SMUIO", + [PWR_HWIP] = "PWR", + [NBIF_HWIP] = "NBIF", + [THM_HWIP] = "THM", + [CLK_HWIP] = "CLK", + [UMC_HWIP] = "UMC", + [RSMU_HWIP] = "RSMU", + [XGMI_HWIP] = "XGMI", + [DCI_HWIP] = "DCI", + [PCIE_HWIP] = "PCIE", +}; + + int amdgpu_reset_init(struct amdgpu_device *adev) { int ret = 0; @@ -196,6 +233,31 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, coredump->reset_task_info.process_name, coredump->reset_task_info.pid); + /* GPU IP's information of the SOC */ + if (coredump->adev) { + + drm_printf(&p, "\nIP Information\n"); + drm_printf(&p, "SOC Family: %d\n", coredump->adev->family); + drm_printf(&p, "SOC Revision id: %d\n", coredump->adev->rev_id); + drm_printf(&p, "SOC External Revision id: %d\n", + coredump->adev->external_rev_id); + + for (int i = 1; i < MAX_HWIP; i++) { + for (int j = 0; j < HWIP_MAX_INSTANCE; j++) { + int ver = coredump->adev->ip_versions[i][j]; + + if (ver) + drm_printf(&p, "HWIP: %s[%d][%d]: v%d.%d.%d.%d.%d\n", + hw_ip_names[i], i, j, + IP_VERSION_MAJ(ver), + IP_VERSION_MIN(ver), + IP_VERSION_REV(ver), + IP_VERSION_VARIANT(ver), + IP_VERSION_SUBREV(ver)); + } + } + } + if (coredump->ring) { drm_printf(&p, "\nRing timed out details\n"); drm_printf(&p, "IP Type: %d Ring Name: %s\n", -- 2.34.1
Re: [PATCH] drm/amdgpu: refactor code to reuse system information
Validated the code by using the function in same way as ioctl would use in devcoredump and getting the valid values. Also this would be the container of the information that we need to share between ioctl, debugfs and devcoredump and keep updating this based on information needed. On 3/19/2024 6:02 PM, Sunil Khatri wrote: Refactor the code so debugfs and devcoredump can reuse the common information and avoid unnecessary copy of it. created a new file which would be the right place to hold functions which will be used between sysfs, debugfs and devcoredump. Cc: Christian König Cc: Alex Deucher Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/Makefile | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c | 151 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 118 +-- 4 files changed, 157 insertions(+), 115 deletions(-) create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c diff --git a/drivers/gpu/drm/amd/amdgpu/Makefile b/drivers/gpu/drm/amd/amdgpu/Makefile index 4536c8ad0e11..05d34f4b18f5 100644 --- a/drivers/gpu/drm/amd/amdgpu/Makefile +++ b/drivers/gpu/drm/amd/amdgpu/Makefile @@ -80,7 +80,7 @@ amdgpu-y += amdgpu_device.o amdgpu_doorbell_mgr.o amdgpu_kms.o \ amdgpu_umc.o smu_v11_0_i2c.o amdgpu_fru_eeprom.o amdgpu_rap.o \ amdgpu_fw_attestation.o amdgpu_securedisplay.o \ amdgpu_eeprom.o amdgpu_mca.o amdgpu_psp_ta.o amdgpu_lsdma.o \ - amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o + amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o amdgpu_devinfo.o amdgpu-$(CONFIG_PROC_FS) += amdgpu_fdinfo.o diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 9c62552bec34..0267870aa9b1 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -1609,4 +1609,5 @@ extern const struct attribute_group amdgpu_vram_mgr_attr_group; extern const struct attribute_group amdgpu_gtt_mgr_attr_group; extern const struct attribute_group amdgpu_flash_attr_group; +int amdgpu_device_info(struct amdgpu_device *adev, struct drm_amdgpu_info_device *dev_info); #endif diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c new file mode 100644 index ..d2c15a1dcb0d --- /dev/null +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c @@ -0,0 +1,151 @@ +// SPDX-License-Identifier: MIT +/* + * Copyright 2024 Advanced Micro Devices, Inc. + * + * Permission is hereby granted, free of charge, to any person obtaining a + * copy of this software and associated documentation files (the "Software"), + * to deal in the Software without restriction, including without limitation + * the rights to use, copy, modify, merge, publish, distribute, sublicense, + * and/or sell copies of the Software, and to permit persons to whom the + * Software is furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR + * OTHER DEALINGS IN THE SOFTWARE. + * + */ + +#include "amdgpu.h" +#include "amd_pcie.h" + +#include + +int amdgpu_device_info(struct amdgpu_device *adev, struct drm_amdgpu_info_device *dev_info) +{ + int ret; + uint64_t vm_size; + uint32_t pcie_gen_mask; + + if (dev_info == NULL) + return -EINVAL; + + dev_info->device_id = adev->pdev->device; + dev_info->chip_rev = adev->rev_id; + dev_info->external_rev = adev->external_rev_id; + dev_info->pci_rev = adev->pdev->revision; + dev_info->family = adev->family; + dev_info->num_shader_engines = adev->gfx.config.max_shader_engines; + dev_info->num_shader_arrays_per_engine = adev->gfx.config.max_sh_per_se; + /* return all clocks in KHz */ + dev_info->gpu_counter_freq = amdgpu_asic_get_xclk(adev) * 10; + if (adev->pm.dpm_enabled) { + dev_info->max_engine_clock = amdgpu_dpm_get_sclk(adev, false) * 10; + dev_info->max_memory_clock = amdgpu_dpm_get_mclk(adev, false) * 10; + dev_info->min_engine_clock = amdgpu_dpm_get_sclk(adev, true) * 10; + dev_info->min_memory_clock = amdgpu_dpm_get_mclk(adev, true) * 10; + } else { + dev_info->max_engine_clock = + dev_info->min_engine_clock = +
Re: [PATCH] drm/amdgpu: refactor code to reuse system information
On 3/19/2024 7:19 PM, Lazar, Lijo wrote: On 3/19/2024 6:02 PM, Sunil Khatri wrote: Refactor the code so debugfs and devcoredump can reuse the common information and avoid unnecessary copy of it. created a new file which would be the right place to hold functions which will be used between sysfs, debugfs and devcoredump. Cc: Christian König Cc: Alex Deucher Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/Makefile | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c | 151 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 118 +-- 4 files changed, 157 insertions(+), 115 deletions(-) create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c diff --git a/drivers/gpu/drm/amd/amdgpu/Makefile b/drivers/gpu/drm/amd/amdgpu/Makefile index 4536c8ad0e11..05d34f4b18f5 100644 --- a/drivers/gpu/drm/amd/amdgpu/Makefile +++ b/drivers/gpu/drm/amd/amdgpu/Makefile @@ -80,7 +80,7 @@ amdgpu-y += amdgpu_device.o amdgpu_doorbell_mgr.o amdgpu_kms.o \ amdgpu_umc.o smu_v11_0_i2c.o amdgpu_fru_eeprom.o amdgpu_rap.o \ amdgpu_fw_attestation.o amdgpu_securedisplay.o \ amdgpu_eeprom.o amdgpu_mca.o amdgpu_psp_ta.o amdgpu_lsdma.o \ - amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o + amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o amdgpu_devinfo.o amdgpu-$(CONFIG_PROC_FS) += amdgpu_fdinfo.o diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 9c62552bec34..0267870aa9b1 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -1609,4 +1609,5 @@ extern const struct attribute_group amdgpu_vram_mgr_attr_group; extern const struct attribute_group amdgpu_gtt_mgr_attr_group; extern const struct attribute_group amdgpu_flash_attr_group; +int amdgpu_device_info(struct amdgpu_device *adev, struct drm_amdgpu_info_device *dev_info); #endif diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c new file mode 100644 index ..d2c15a1dcb0d --- /dev/null +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c @@ -0,0 +1,151 @@ +// SPDX-License-Identifier: MIT +/* + * Copyright 2024 Advanced Micro Devices, Inc. + * + * Permission is hereby granted, free of charge, to any person obtaining a + * copy of this software and associated documentation files (the "Software"), + * to deal in the Software without restriction, including without limitation + * the rights to use, copy, modify, merge, publish, distribute, sublicense, + * and/or sell copies of the Software, and to permit persons to whom the + * Software is furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR + * OTHER DEALINGS IN THE SOFTWARE. + * + */ + +#include "amdgpu.h" +#include "amd_pcie.h" + +#include + +int amdgpu_device_info(struct amdgpu_device *adev, struct drm_amdgpu_info_device *dev_info) +{ + int ret; + uint64_t vm_size; + uint32_t pcie_gen_mask; + + if (dev_info == NULL) + return -EINVAL; + + dev_info->device_id = adev->pdev->device; + dev_info->chip_rev = adev->rev_id; + dev_info->external_rev = adev->external_rev_id; + dev_info->pci_rev = adev->pdev->revision; + dev_info->family = adev->family; + dev_info->num_shader_engines = adev->gfx.config.max_shader_engines; + dev_info->num_shader_arrays_per_engine = adev->gfx.config.max_sh_per_se; + /* return all clocks in KHz */ + dev_info->gpu_counter_freq = amdgpu_asic_get_xclk(adev) * 10; + if (adev->pm.dpm_enabled) { + dev_info->max_engine_clock = amdgpu_dpm_get_sclk(adev, false) * 10; + dev_info->max_memory_clock = amdgpu_dpm_get_mclk(adev, false) * 10; + dev_info->min_engine_clock = amdgpu_dpm_get_sclk(adev, true) * 10; + dev_info->min_memory_clock = amdgpu_dpm_get_mclk(adev, true) * 10; + } else { + dev_info->max_engine_clock = + dev_info->min_engine_clock = + adev->clock.default_sclk * 10; + dev_info->max_memory_clock = + dev_info->min_memory_clock = + adev->clock.default_mclk * 10; + } + dev_info->enabled
Re: [PATCH] drm/amdgpu: refactor code to reuse system information
On 3/19/2024 7:43 PM, Lazar, Lijo wrote: On 3/19/2024 7:27 PM, Khatri, Sunil wrote: On 3/19/2024 7:19 PM, Lazar, Lijo wrote: On 3/19/2024 6:02 PM, Sunil Khatri wrote: Refactor the code so debugfs and devcoredump can reuse the common information and avoid unnecessary copy of it. created a new file which would be the right place to hold functions which will be used between sysfs, debugfs and devcoredump. Cc: Christian König Cc: Alex Deucher Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/Makefile | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c | 151 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 118 +-- 4 files changed, 157 insertions(+), 115 deletions(-) create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c diff --git a/drivers/gpu/drm/amd/amdgpu/Makefile b/drivers/gpu/drm/amd/amdgpu/Makefile index 4536c8ad0e11..05d34f4b18f5 100644 --- a/drivers/gpu/drm/amd/amdgpu/Makefile +++ b/drivers/gpu/drm/amd/amdgpu/Makefile @@ -80,7 +80,7 @@ amdgpu-y += amdgpu_device.o amdgpu_doorbell_mgr.o amdgpu_kms.o \ amdgpu_umc.o smu_v11_0_i2c.o amdgpu_fru_eeprom.o amdgpu_rap.o \ amdgpu_fw_attestation.o amdgpu_securedisplay.o \ amdgpu_eeprom.o amdgpu_mca.o amdgpu_psp_ta.o amdgpu_lsdma.o \ - amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o + amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o amdgpu_devinfo.o amdgpu-$(CONFIG_PROC_FS) += amdgpu_fdinfo.o diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 9c62552bec34..0267870aa9b1 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -1609,4 +1609,5 @@ extern const struct attribute_group amdgpu_vram_mgr_attr_group; extern const struct attribute_group amdgpu_gtt_mgr_attr_group; extern const struct attribute_group amdgpu_flash_attr_group; +int amdgpu_device_info(struct amdgpu_device *adev, struct drm_amdgpu_info_device *dev_info); #endif diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c new file mode 100644 index ..d2c15a1dcb0d --- /dev/null +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c @@ -0,0 +1,151 @@ +// SPDX-License-Identifier: MIT +/* + * Copyright 2024 Advanced Micro Devices, Inc. + * + * Permission is hereby granted, free of charge, to any person obtaining a + * copy of this software and associated documentation files (the "Software"), + * to deal in the Software without restriction, including without limitation + * the rights to use, copy, modify, merge, publish, distribute, sublicense, + * and/or sell copies of the Software, and to permit persons to whom the + * Software is furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR + * OTHER DEALINGS IN THE SOFTWARE. + * + */ + +#include "amdgpu.h" +#include "amd_pcie.h" + +#include + +int amdgpu_device_info(struct amdgpu_device *adev, struct drm_amdgpu_info_device *dev_info) +{ + int ret; + uint64_t vm_size; + uint32_t pcie_gen_mask; + + if (dev_info == NULL) + return -EINVAL; + + dev_info->device_id = adev->pdev->device; + dev_info->chip_rev = adev->rev_id; + dev_info->external_rev = adev->external_rev_id; + dev_info->pci_rev = adev->pdev->revision; + dev_info->family = adev->family; + dev_info->num_shader_engines = adev->gfx.config.max_shader_engines; + dev_info->num_shader_arrays_per_engine = adev->gfx.config.max_sh_per_se; + /* return all clocks in KHz */ + dev_info->gpu_counter_freq = amdgpu_asic_get_xclk(adev) * 10; + if (adev->pm.dpm_enabled) { + dev_info->max_engine_clock = amdgpu_dpm_get_sclk(adev, false) * 10; + dev_info->max_memory_clock = amdgpu_dpm_get_mclk(adev, false) * 10; + dev_info->min_engine_clock = amdgpu_dpm_get_sclk(adev, true) * 10; + dev_info->min_memory_clock = amdgpu_dpm_get_mclk(adev, true) * 10; + } else { + dev_info->max_engine_clock = + dev_info->min_engine_clock = + adev->clock.default_sclk * 10; + dev_info->max_memory_clock = + dev_info->min_memory_clock = + adev->clock.default_mc
Re: [PATCH v2] drm/amdgpu: refactor code to reuse system information
On 3/19/2024 8:07 PM, Christian König wrote: Am 19.03.24 um 15:25 schrieb Sunil Khatri: Refactor the code so debugfs and devcoredump can reuse the common information and avoid unnecessary copy of it. created a new file which would be the right place to hold functions which will be used between ioctl, debugfs and devcoredump. Ok, taking a closer look that is certainly not a good idea. The devinfo structure was just created because somebody thought that mixing all that stuff into one structure would be a good idea. We have pretty much deprecated that approach and should *really* not change anything here any more. To support the ioctl we are keeping that information same without changing it. The intent to add a new file is because we will have more information coming in this new file. Next in line is firmware information which is again a huge function with lot of information and to use that information in devcoredump and ioctl and sysfs the new file seems to be right idea after some discussions. FYI: this will not be just one function in new file but more to come so code can be reused without copying it. Regards, Christian. Cc: Christian König Cc: Alex Deucher Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/Makefile | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c | 151 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 118 +-- 4 files changed, 157 insertions(+), 115 deletions(-) create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c diff --git a/drivers/gpu/drm/amd/amdgpu/Makefile b/drivers/gpu/drm/amd/amdgpu/Makefile index 4536c8ad0e11..05d34f4b18f5 100644 --- a/drivers/gpu/drm/amd/amdgpu/Makefile +++ b/drivers/gpu/drm/amd/amdgpu/Makefile @@ -80,7 +80,7 @@ amdgpu-y += amdgpu_device.o amdgpu_doorbell_mgr.o amdgpu_kms.o \ amdgpu_umc.o smu_v11_0_i2c.o amdgpu_fru_eeprom.o amdgpu_rap.o \ amdgpu_fw_attestation.o amdgpu_securedisplay.o \ amdgpu_eeprom.o amdgpu_mca.o amdgpu_psp_ta.o amdgpu_lsdma.o \ - amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o + amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o amdgpu_devinfo.o amdgpu-$(CONFIG_PROC_FS) += amdgpu_fdinfo.o diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 9c62552bec34..0267870aa9b1 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -1609,4 +1609,5 @@ extern const struct attribute_group amdgpu_vram_mgr_attr_group; extern const struct attribute_group amdgpu_gtt_mgr_attr_group; extern const struct attribute_group amdgpu_flash_attr_group; +int amdgpu_device_info(struct amdgpu_device *adev, struct drm_amdgpu_info_device *dev_info); #endif diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c new file mode 100644 index ..fdcbc1984031 --- /dev/null +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c @@ -0,0 +1,151 @@ +// SPDX-License-Identifier: MIT +/* + * Copyright 2024 Advanced Micro Devices, Inc. + * + * Permission is hereby granted, free of charge, to any person obtaining a + * copy of this software and associated documentation files (the "Software"), + * to deal in the Software without restriction, including without limitation + * the rights to use, copy, modify, merge, publish, distribute, sublicense, + * and/or sell copies of the Software, and to permit persons to whom the + * Software is furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR + * OTHER DEALINGS IN THE SOFTWARE. + * + */ + +#include "amdgpu.h" +#include "amd_pcie.h" + +#include + +int amdgpu_device_info(struct amdgpu_device *adev, struct drm_amdgpu_info_device *dev_info) +{ + int ret; + uint64_t vm_size; + uint32_t pcie_gen_mask; + + if (dev_info == NULL) + return -EINVAL; + + dev_info->device_id = adev->pdev->device; + dev_info->chip_rev = adev->rev_id; + dev_info->external_rev = adev->external_rev_id; + dev_info->pci_rev = adev->pdev->revision; + dev_info->family = adev->family; + dev_info->num_shader_engines = adev->gfx.config.max_shader_engines; + dev_info->num_shader_arrays_per_engine = adev->gfx.config.max_sh_per_se; + /* return all clocks in KHz */ + dev_info->gpu_counter
Re: [PATCH] drm/amdgpu: refactor code to reuse system information
Sent a new patch based on discussion with Alex. On 3/19/2024 8:34 PM, Christian König wrote: Am 19.03.24 um 15:59 schrieb Alex Deucher: On Tue, Mar 19, 2024 at 10:56 AM Christian König wrote: Am 19.03.24 um 15:26 schrieb Alex Deucher: On Tue, Mar 19, 2024 at 8:32 AM Sunil Khatri wrote: Refactor the code so debugfs and devcoredump can reuse the common information and avoid unnecessary copy of it. created a new file which would be the right place to hold functions which will be used between sysfs, debugfs and devcoredump. Cc: Christian König Cc: Alex Deucher Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/Makefile | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c | 151 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 118 +-- 4 files changed, 157 insertions(+), 115 deletions(-) create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c diff --git a/drivers/gpu/drm/amd/amdgpu/Makefile b/drivers/gpu/drm/amd/amdgpu/Makefile index 4536c8ad0e11..05d34f4b18f5 100644 --- a/drivers/gpu/drm/amd/amdgpu/Makefile +++ b/drivers/gpu/drm/amd/amdgpu/Makefile @@ -80,7 +80,7 @@ amdgpu-y += amdgpu_device.o amdgpu_doorbell_mgr.o amdgpu_kms.o \ amdgpu_umc.o smu_v11_0_i2c.o amdgpu_fru_eeprom.o amdgpu_rap.o \ amdgpu_fw_attestation.o amdgpu_securedisplay.o \ amdgpu_eeprom.o amdgpu_mca.o amdgpu_psp_ta.o amdgpu_lsdma.o \ - amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o + amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o amdgpu_devinfo.o amdgpu-$(CONFIG_PROC_FS) += amdgpu_fdinfo.o diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 9c62552bec34..0267870aa9b1 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -1609,4 +1609,5 @@ extern const struct attribute_group amdgpu_vram_mgr_attr_group; extern const struct attribute_group amdgpu_gtt_mgr_attr_group; extern const struct attribute_group amdgpu_flash_attr_group; +int amdgpu_device_info(struct amdgpu_device *adev, struct drm_amdgpu_info_device *dev_info); #endif diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c new file mode 100644 index ..d2c15a1dcb0d --- /dev/null +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c @@ -0,0 +1,151 @@ +// SPDX-License-Identifier: MIT +/* + * Copyright 2024 Advanced Micro Devices, Inc. + * + * Permission is hereby granted, free of charge, to any person obtaining a + * copy of this software and associated documentation files (the "Software"), + * to deal in the Software without restriction, including without limitation + * the rights to use, copy, modify, merge, publish, distribute, sublicense, + * and/or sell copies of the Software, and to permit persons to whom the + * Software is furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR + * OTHER DEALINGS IN THE SOFTWARE. + * + */ + +#include "amdgpu.h" +#include "amd_pcie.h" + +#include + +int amdgpu_device_info(struct amdgpu_device *adev, struct drm_amdgpu_info_device *dev_info) We can probably keep this in amdgpu_kms.c unless that file is getting too big. I don't think it warrants a new file at this point. If you do keep it in amdgpu_kms.c, I'd recommend renaming it to something like amdgpu_kms_device_info() to keep the naming conventions. We should not be using this for anything new in the first place. A whole bunch of the stuff inside the devinfo structure has been deprecated because we found that putting everything into one structure was a bad idea. It's a convenient way to collect a lot of useful information that we want in the core dump. Plus it's not going anywhere because we need to keep compatibility in the IOCTL. Yeah and exactly that is what I'm strictly against. The devinfo wasn't used for new stuff because we found that it is way to inflexible. That's why we have multiple separate IOCTLs for the memory and firmware information for example. We should really *not* reuse that for the device core dumping. Rather just use the same information from the different IPs and subsystems directly. E.g. add a function to the VM, GFX etc for printing out devcoredump infos. I have pushed new v2 based o
RE: [PATCH v2] drm/amdgpu: refactor code to reuse system information
[AMD Official Use Only - General] Ignore this as I have send v3. -Original Message- From: Sunil Khatri Sent: Tuesday, March 19, 2024 8:41 PM To: Deucher, Alexander ; Koenig, Christian ; Sharma, Shashank Cc: amd-...@lists.freedesktop.org; dri-devel@lists.freedesktop.org; linux-ker...@vger.kernel.org; Zhang, Hawking ; Kuehling, Felix ; Lazar, Lijo ; Khatri, Sunil Subject: [PATCH v2] drm/amdgpu: refactor code to reuse system information Refactor the code so debugfs and devcoredump can reuse the common information and avoid unnecessary copy of it. created a new file which would be the right place to hold functions which will be used between ioctl, debugfs and devcoredump. Cc: Christian König Cc: Alex Deucher Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/Makefile | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_coreinfo.c | 146 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_coreinfo.h | 33 + drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 117 +-- 4 files changed, 182 insertions(+), 116 deletions(-) create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_coreinfo.c create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_coreinfo.h diff --git a/drivers/gpu/drm/amd/amdgpu/Makefile b/drivers/gpu/drm/amd/amdgpu/Makefile index 4536c8ad0e11..2c5c800c1ed6 100644 --- a/drivers/gpu/drm/amd/amdgpu/Makefile +++ b/drivers/gpu/drm/amd/amdgpu/Makefile @@ -80,7 +80,7 @@ amdgpu-y += amdgpu_device.o amdgpu_doorbell_mgr.o amdgpu_kms.o \ amdgpu_umc.o smu_v11_0_i2c.o amdgpu_fru_eeprom.o amdgpu_rap.o \ amdgpu_fw_attestation.o amdgpu_securedisplay.o \ amdgpu_eeprom.o amdgpu_mca.o amdgpu_psp_ta.o amdgpu_lsdma.o \ - amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o + amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o +amdgpu_coreinfo.o amdgpu-$(CONFIG_PROC_FS) += amdgpu_fdinfo.o diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_coreinfo.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_coreinfo.c new file mode 100644 index ..597fc9d432ce --- /dev/null +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_coreinfo.c @@ -0,0 +1,146 @@ +// SPDX-License-Identifier: MIT +/* + * Copyright 2024 Advanced Micro Devices, Inc. + * + * Permission is hereby granted, free of charge, to any person +obtaining a + * copy of this software and associated documentation files (the +"Software"), + * to deal in the Software without restriction, including without +limitation + * the rights to use, copy, modify, merge, publish, distribute, +sublicense, + * and/or sell copies of the Software, and to permit persons to whom +the + * Software is furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be +included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, +EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF +MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT +SHALL + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, +DAMAGES OR + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR +OTHERWISE, + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE +OR + * OTHER DEALINGS IN THE SOFTWARE. + * + */ + +#include "amdgpu_coreinfo.h" +#include "amd_pcie.h" + + +void amdgpu_coreinfo_devinfo(struct amdgpu_device *adev, struct +drm_amdgpu_info_device *dev_info) { + int ret; + uint64_t vm_size; + uint32_t pcie_gen_mask; + + dev_info->device_id = adev->pdev->device; + dev_info->chip_rev = adev->rev_id; + dev_info->external_rev = adev->external_rev_id; + dev_info->pci_rev = adev->pdev->revision; + dev_info->family = adev->family; + dev_info->num_shader_engines = adev->gfx.config.max_shader_engines; + dev_info->num_shader_arrays_per_engine = adev->gfx.config.max_sh_per_se; + /* return all clocks in KHz */ + dev_info->gpu_counter_freq = amdgpu_asic_get_xclk(adev) * 10; + if (adev->pm.dpm_enabled) { + dev_info->max_engine_clock = amdgpu_dpm_get_sclk(adev, false) * 10; + dev_info->max_memory_clock = amdgpu_dpm_get_mclk(adev, false) * 10; + dev_info->min_engine_clock = amdgpu_dpm_get_sclk(adev, true) * 10; + dev_info->min_memory_clock = amdgpu_dpm_get_mclk(adev, true) * 10; + } else { + dev_info->max_engine_clock = + dev_info->min_engine_clock = + adev->clock.default_sclk * 10; + dev_info->max_memory_clock = + dev_info->min_memory_clock = + adev->clock.default_mclk * 10; + } + dev_info->enab
Re: [PATCH] drm/amdgpu: add support of bios dump in devcoredump
On 3/26/2024 10:23 PM, Alex Deucher wrote: On Tue, Mar 26, 2024 at 10:38 AM Sunil Khatri wrote: dump the bios binary in the devcoredump. Signed-off-by: Sunil Khatri --- .../gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c | 20 +++ 1 file changed, 20 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c index 44c5da8aa9ce..f33963d777eb 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c @@ -132,6 +132,26 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, drm_printf(&p, "Faulty page starting at address: 0x%016llx\n", fault_info->addr); drm_printf(&p, "Protection fault status register: 0x%x\n\n", fault_info->status); + /* Dump BIOS */ + if (coredump->adev->bios && coredump->adev->bios_size) { + int i = 0; + + drm_printf(&p, "BIOS Binary dump\n"); + drm_printf(&p, "Valid BIOS Size:%d bytes type:%s\n", + coredump->adev->bios_size, + coredump->adev->is_atom_fw ? + "Atom bios":"Non Atom Bios"); + + while (i < coredump->adev->bios_size) { + /* Printing 15 bytes in a line */ + if (i % 15 == 0) + drm_printf(&p, "\n"); + drm_printf(&p, "0x%x \t", coredump->adev->bios[i]); + i++; + } + drm_printf(&p, "\n"); + } I don't think it's too useful to dump this as text. I was hoping it could be a binary. I guess, we can just get this from debugfs if we need it if a binary is not possible. Yes , this dumps in text format only and the binary is already available in debugfs. So discarding the patch. Alex + /* Add ring buffer information */ drm_printf(&p, "Ring buffer information\n"); for (int i = 0; i < coredump->adev->num_rings; i++) { -- 2.34.1
Re: [PATCH] drm/amdgpu: add IP's FW information to devcoredump
On 3/28/2024 8:38 AM, Alex Deucher wrote: On Tue, Mar 26, 2024 at 1:31 PM Sunil Khatri wrote: Add FW information of all the IP's in the devcoredump. Signed-off-by: Sunil Khatri Might want to include the vbios version info as well, e.g., atom_context->name atom_context->vbios_pn atom_context->vbios_ver_str atom_context->date Sure i will add those parameters too. Regards Sunil Either way, Reviewed-by: Alex Deucher --- .../gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c | 122 ++ 1 file changed, 122 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c index 44c5da8aa9ce..d598b6520ec9 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c @@ -69,6 +69,124 @@ const char *hw_ip_names[MAX_HWIP] = { [PCIE_HWIP] = "PCIE", }; +static void amdgpu_devcoredump_fw_info(struct amdgpu_device *adev, + struct drm_printer *p) +{ + uint32_t version; + uint32_t feature; + uint8_t smu_program, smu_major, smu_minor, smu_debug; + + drm_printf(p, "VCE feature version: %u, fw version: 0x%08x\n", + adev->vce.fb_version, adev->vce.fw_version); + drm_printf(p, "UVD feature version: %u, fw version: 0x%08x\n", 0, + adev->uvd.fw_version); + drm_printf(p, "GMC feature version: %u, fw version: 0x%08x\n", 0, + adev->gmc.fw_version); + drm_printf(p, "ME feature version: %u, fw version: 0x%08x\n", + adev->gfx.me_feature_version, adev->gfx.me_fw_version); + drm_printf(p, "PFP feature version: %u, fw version: 0x%08x\n", + adev->gfx.pfp_feature_version, adev->gfx.pfp_fw_version); + drm_printf(p, "CE feature version: %u, fw version: 0x%08x\n", + adev->gfx.ce_feature_version, adev->gfx.ce_fw_version); + drm_printf(p, "RLC feature version: %u, fw version: 0x%08x\n", + adev->gfx.rlc_feature_version, adev->gfx.rlc_fw_version); + + drm_printf(p, "RLC SRLC feature version: %u, fw version: 0x%08x\n", + adev->gfx.rlc_srlc_feature_version, + adev->gfx.rlc_srlc_fw_version); + drm_printf(p, "RLC SRLG feature version: %u, fw version: 0x%08x\n", + adev->gfx.rlc_srlg_feature_version, + adev->gfx.rlc_srlg_fw_version); + drm_printf(p, "RLC SRLS feature version: %u, fw version: 0x%08x\n", + adev->gfx.rlc_srls_feature_version, + adev->gfx.rlc_srls_fw_version); + drm_printf(p, "RLCP feature version: %u, fw version: 0x%08x\n", + adev->gfx.rlcp_ucode_feature_version, + adev->gfx.rlcp_ucode_version); + drm_printf(p, "RLCV feature version: %u, fw version: 0x%08x\n", + adev->gfx.rlcv_ucode_feature_version, + adev->gfx.rlcv_ucode_version); + drm_printf(p, "MEC feature version: %u, fw version: 0x%08x\n", + adev->gfx.mec_feature_version, adev->gfx.mec_fw_version); + + if (adev->gfx.mec2_fw) + drm_printf(p, "MEC2 feature version: %u, fw version: 0x%08x\n", + adev->gfx.mec2_feature_version, + adev->gfx.mec2_fw_version); + + drm_printf(p, "IMU feature version: %u, fw version: 0x%08x\n", 0, + adev->gfx.imu_fw_version); + drm_printf(p, "PSP SOS feature version: %u, fw version: 0x%08x\n", + adev->psp.sos.feature_version, adev->psp.sos.fw_version); + drm_printf(p, "PSP ASD feature version: %u, fw version: 0x%08x\n", + adev->psp.asd_context.bin_desc.feature_version, + adev->psp.asd_context.bin_desc.fw_version); + + drm_printf(p, "TA XGMI feature version: 0x%08x, fw version: 0x%08x\n", + adev->psp.xgmi_context.context.bin_desc.feature_version, + adev->psp.xgmi_context.context.bin_desc.fw_version); + drm_printf(p, "TA RAS feature version: 0x%08x, fw version: 0x%08x\n", + adev->psp.ras_context.context.bin_desc.feature_version, + adev->psp.ras_context.context.bin_desc.fw_version); + drm_printf(p, "TA HDCP feature version: 0x%08x, fw version: 0x%08x\n", + adev->psp.hdcp_context.context.bin_desc.feature_version, + adev->psp.hdcp_context.context.bin_desc.fw_version); + drm_printf(p, "TA DTM feature version: 0x%08x, fw version: 0x%08x\n", + adev->psp.dtm_context.context.bin_desc.feature_version, + adev->psp.dtm_context.context.bin_desc.fw_version); + drm_printf(p, "TA RAP feature version: 0x%08x, fw version: 0x%08x\n", + adev->psp.rap_context.context.bin_desc.feature_version, + adev->psp.rap_context.context.bin_desc.fw_