Re: [PATCH] drm/amdgpu: annotate a false positive recursive locking
Am 07.08.20 um 04:22 schrieb Li, Dennis: [AMD Public Use] [SNIP] I think it is a limitation of init_rwsem. And exactly that's wrong, this is intentional and perfectly correct. [Dennis Li] I couldn't understand. Why is it a perfectly correct? For example, we define two rw_sem: a and b. If we don't check init_rwsem definition, we may think case#1 and case#2 have the same behavior, but in fact they are different. Case 1: init_rwsem(&a); init_rwsem(&b); Case2: void init_rwsem_ext(rw_sem* sem) { init_rwsem(sem); } init_rwsem_ext(&a); init_rwsem_ext(&b); As far as I know it is perfectly possible that the locks in the hive are not always grabbed in the same order. And that's why lockdep is complaining about this. [Dennis Li] No. I takes a hive with two devices(a and b) to explain why lockdep complain. // Firstly driver lock the reset_sem of all devices down_write(&a->reset_sem); do_suspend(a); down_write(&b->reset_sem); // Because b shared the same lock_class_key with a, lockdep will take a and b as the same rw_sem and complain here. do_suspend(b); // do recovery do_hive_recovery() // unlock the reset_sem of all devices do_resume(a); up_write(&a->reset_sem); do_resume(b); up_write(&b->reset_sem); Ah! Now I understand what you are working around. So the problem is the static lock_class_key in the macro? [Dennis Li] Yes. The author of init_rwsem might not consider our similar use case. Well this is also really not the intended use case. When you lock the same rwsem class recursively you can easily run into deadlocks if you don't keep the order the same all the time. And abstracting init functions behind your own layer is a no-go in the Linux kernel as well. What we should do instead is to make sure we have only a single lock for the complete hive instead. [Dennis Li] If we use a single lock, users will must wait for all devices resuming successfully, but in fact their tasks are only running in device a. It is not friendly to users. Well that is intentional I would say. We can only restart submissions after all devices are resumed successfully, cause otherwise it can happen that a job on device A depends on memory on device B which isn't accessible. [Dennis Li] Yes, you are right. Driver have make sure that the shared resources(such as the shard memory) are ready before unlock the lock of adev one by one. But each device still has private hardware resources such as video and display. Yeah, but those are negligible and we have a team working on display support for XGMI. So this will sooner or later become a problem as well. I still think that a single rwsem for the whole hive is still the best option here. [Dennis Li] For your above concern, we can open a new thread to discuss it. If we make a decision to use a single after discussing, we can create another patch for it. That's a really good argument, but I still hesitate to merge this patch. How severe is the lockdep splat? At bare minimum we need a "/* TODO: " comment why we do this and how to remove the workaround again. Regards, Christian. Best Regards Dennis lI Regards, Christian. Regards, Christian. Am 06.08.20 um 11:15 schrieb Li, Dennis: [AMD Official Use Only - Internal Distribution Only] Hi, Christian, For this case, it is safe to use separated lock key. Please see the definition of init_rwsem as the below. Every init_rwsem calling will use a new static key, but devices of the hive share the same code to do initialization, so their lock_class_key are the same. I think it is a limitation of init_rwsem. In our case, it should be correct that reset_sem of each adev has different lock_class_key. BTW, this change doesn't effect dead-lock detection and just correct it. #define init_rwsem(sem) \ do {\ static struct lock_class_key __key; \ \ __init_rwsem((sem), #sem, &__key); \ } while (0) Best Regards Dennis Li -Original Message- From: Koenig, Christian Sent: Thursday, August 6, 2020 4:13 PM To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, Felix ; Zhang, Hawking Subject: Re: [PATCH] drm/amdgpu: annotate a false positive recursive locking Preventing locking problems during implementation is obviously a good approach, but lockdep has proven to be massively useful for finding and fixing problems. Disabling lockdep splat by annotating lock with separate classes is usually a no-go and only allowed if there is no other potential approach. In this case here we should really clean things up instead. Christian. Am 06.08.20 um 09:44 schrieb Li, Dennis: [AMD Official Use Only - Internal Distribution Only] Hi, Christian, I agree with your concern. However we shouldn't rely on system to detect dead-lock, and shou
RE: [PATCH 2/2] drm/amdgpu: reconfigure spm golden settings on Navi1x after GFXOFF exit(v3)
[AMD Public Use] Please check one comment inline. Regards, Guchun -Original Message- From: Tianci Yin Sent: Friday, August 7, 2020 10:37 AM To: amd-gfx@lists.freedesktop.org Cc: Tuikov, Luben ; Deucher, Alexander ; Zhang, Hawking ; Xu, Feifei ; Hesik, Christopher ; Swamy, Manjunatha ; Quan, Evan ; Chen, Guchun ; Feng, Kenneth ; Yin, Tianci (Rico) Subject: [PATCH 2/2] drm/amdgpu: reconfigure spm golden settings on Navi1x after GFXOFF exit(v3) From: "Tianci.Yin" On Navi1x, the SPM golden settings are lost after GFXOFF enter/exit, so reconfigure the golden settings after GFXOFF exit. Change-Id: I9358ba9c65f241c36f8a35916170b19535148ee9 Signed-off-by: Tianci.Yin --- drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c index 8eff0173360d..9e133fd0372d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c @@ -571,8 +571,14 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool enable) if (enable && !adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) { schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE); } else if (!enable && adev->gfx.gfx_off_state) { - if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) + if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, +false)) { adev->gfx.gfx_off_state = false; + + if (adev->gfx.funcs->init_spm_golden) { + dev_dbg(adev->dev, "GFXOFF is disabled, re-init SPM golden settings\n"); + amdgpu_gfx_init_spm_golden(adev); [Guchun] Since we have the function pointer check, why here we use another function for execution? init_spm_golden is one guard to indicate spm availability on navi1x ASICs? + } + } } mutex_unlock(&adev->gfx.gfx_off_mutex); -- 2.17.1 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
RE: [PATCH 2/2] drm/amdgpu: reconfigure spm golden settings on Navi1x after GFXOFF exit(v3)
[AMD Public Use] Series is Reviewed-by: Hawking Zhang Regards, Hawking -Original Message- From: Tianci Yin Sent: Friday, August 7, 2020 10:37 To: amd-gfx@lists.freedesktop.org Cc: Tuikov, Luben ; Deucher, Alexander ; Zhang, Hawking ; Xu, Feifei ; Hesik, Christopher ; Swamy, Manjunatha ; Quan, Evan ; Chen, Guchun ; Feng, Kenneth ; Yin, Tianci (Rico) Subject: [PATCH 2/2] drm/amdgpu: reconfigure spm golden settings on Navi1x after GFXOFF exit(v3) From: "Tianci.Yin" On Navi1x, the SPM golden settings are lost after GFXOFF enter/exit, so reconfigure the golden settings after GFXOFF exit. Change-Id: I9358ba9c65f241c36f8a35916170b19535148ee9 Signed-off-by: Tianci.Yin --- drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c index 8eff0173360d..9e133fd0372d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c @@ -571,8 +571,14 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool enable) if (enable && !adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) { schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE); } else if (!enable && adev->gfx.gfx_off_state) { - if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) + if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, +false)) { adev->gfx.gfx_off_state = false; + + if (adev->gfx.funcs->init_spm_golden) { + dev_dbg(adev->dev, "GFXOFF is disabled, re-init SPM golden settings\n"); + amdgpu_gfx_init_spm_golden(adev); + } + } } mutex_unlock(&adev->gfx.gfx_off_mutex); -- 2.17.1 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH 2/2] drm/amdgpu: reconfigure spm golden settings on Navi1x after GFXOFF exit(v3)
From: "Tianci.Yin" On Navi1x, the SPM golden settings are lost after GFXOFF enter/exit, so reconfigure the golden settings after GFXOFF exit. Change-Id: I9358ba9c65f241c36f8a35916170b19535148ee9 Signed-off-by: Tianci.Yin --- drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c index 8eff0173360d..9e133fd0372d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c @@ -571,8 +571,14 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool enable) if (enable && !adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) { schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE); } else if (!enable && adev->gfx.gfx_off_state) { - if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) + if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) { adev->gfx.gfx_off_state = false; + + if (adev->gfx.funcs->init_spm_golden) { + dev_dbg(adev->dev, "GFXOFF is disabled, re-init SPM golden settings\n"); + amdgpu_gfx_init_spm_golden(adev); + } + } } mutex_unlock(&adev->gfx.gfx_off_mutex); -- 2.17.1 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH 1/2] drm/amdgpu: add interface amdgpu_gfx_init_spm_golden for Navi1x
From: "Tianci.Yin" On Navi1x, the SPM golden settings are lost after GFXOFF enter/exit, so reconfiguration is needed. Make the configuration code as an interface for future use. Change-Id: I172f3dc7f59da69b0364052dcad75a9c9aab019e Reviewed-by: Luben Tuikov Reviewed-by: Feifei Xu Signed-off-by: Tianci.Yin --- drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h | 2 ++ drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 34 ++--- 2 files changed, 27 insertions(+), 9 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h index 1e7a2b0997c5..a611e78dd4ba 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h @@ -216,6 +216,7 @@ struct amdgpu_gfx_funcs { int (*ras_error_inject)(struct amdgpu_device *adev, void *inject_if); int (*query_ras_error_count) (struct amdgpu_device *adev, void *ras_error_status); void (*reset_ras_error_count) (struct amdgpu_device *adev); + void (*init_spm_golden)(struct amdgpu_device *adev); }; struct sq_work { @@ -324,6 +325,7 @@ struct amdgpu_gfx { #define amdgpu_gfx_get_gpu_clock_counter(adev) (adev)->gfx.funcs->get_gpu_clock_counter((adev)) #define amdgpu_gfx_select_se_sh(adev, se, sh, instance) (adev)->gfx.funcs->select_se_sh((adev), (se), (sh), (instance)) #define amdgpu_gfx_select_me_pipe_q(adev, me, pipe, q, vmid) (adev)->gfx.funcs->select_me_pipe_q((adev), (me), (pipe), (q), (vmid)) +#define amdgpu_gfx_init_spm_golden(adev) (adev)->gfx.funcs->init_spm_golden((adev)) /** * amdgpu_gfx_create_bitmask - create a bitmask diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c index db9f1e89a0f8..da21ad04ac0f 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c @@ -3307,6 +3307,29 @@ static void gfx_v10_0_set_kiq_pm4_funcs(struct amdgpu_device *adev) adev->gfx.kiq.pmf = &gfx_v10_0_kiq_pm4_funcs; } +static void gfx_v10_0_init_spm_golden_registers(struct amdgpu_device *adev) +{ + switch (adev->asic_type) { + case CHIP_NAVI10: + soc15_program_register_sequence(adev, + golden_settings_gc_rlc_spm_10_0_nv10, + (const u32)ARRAY_SIZE(golden_settings_gc_rlc_spm_10_0_nv10)); + break; + case CHIP_NAVI14: + soc15_program_register_sequence(adev, + golden_settings_gc_rlc_spm_10_1_nv14, + (const u32)ARRAY_SIZE(golden_settings_gc_rlc_spm_10_1_nv14)); + break; + case CHIP_NAVI12: + soc15_program_register_sequence(adev, + golden_settings_gc_rlc_spm_10_1_2_nv12, + (const u32)ARRAY_SIZE(golden_settings_gc_rlc_spm_10_1_2_nv12)); + break; + default: + break; + } +} + static void gfx_v10_0_init_golden_registers(struct amdgpu_device *adev) { switch (adev->asic_type) { @@ -3317,9 +3340,6 @@ static void gfx_v10_0_init_golden_registers(struct amdgpu_device *adev) soc15_program_register_sequence(adev, golden_settings_gc_10_0_nv10, (const u32)ARRAY_SIZE(golden_settings_gc_10_0_nv10)); - soc15_program_register_sequence(adev, - golden_settings_gc_rlc_spm_10_0_nv10, - (const u32)ARRAY_SIZE(golden_settings_gc_rlc_spm_10_0_nv10)); break; case CHIP_NAVI14: soc15_program_register_sequence(adev, @@ -3328,9 +3348,6 @@ static void gfx_v10_0_init_golden_registers(struct amdgpu_device *adev) soc15_program_register_sequence(adev, golden_settings_gc_10_1_nv14, (const u32)ARRAY_SIZE(golden_settings_gc_10_1_nv14)); - soc15_program_register_sequence(adev, - golden_settings_gc_rlc_spm_10_1_nv14, - (const u32)ARRAY_SIZE(golden_settings_gc_rlc_spm_10_1_nv14)); break; case CHIP_NAVI12: soc15_program_register_sequence(adev, @@ -3339,9 +3356,6 @@ static void gfx_v10_0_init_golden_registers(struct amdgpu_device *adev) soc15_program_register_sequence(adev, golden_settings_gc_10_1_2_nv12, (const u32)ARRAY_SIZE(golden_settings_gc_10_1_2_nv12)); - soc15_program_register_sequence(adev, - golden_
RE: [PATCH] drm/amdgpu: annotate a false positive recursive locking
[AMD Public Use] > [SNIP] >>> I think it is a limitation of init_rwsem. >> And exactly that's wrong, this is intentional and perfectly correct. >> >> [Dennis Li] I couldn't understand. Why is it a perfectly correct? >> For example, we define two rw_sem: a and b. If we don't check init_rwsem >> definition, we may think case#1 and case#2 have the same behavior, but in >> fact they are different. >> >> Case 1: >> init_rwsem(&a); >> init_rwsem(&b); >> >> Case2: >> void init_rwsem_ext(rw_sem* sem) >> { >>init_rwsem(sem); >> } >> init_rwsem_ext(&a); >> init_rwsem_ext(&b); >> >> As far as I know it is perfectly possible that the locks in the hive are not >> always grabbed in the same order. And that's why lockdep is complaining >> about this. >> [Dennis Li] No. I takes a hive with two devices(a and b) to explain why >> lockdep complain. >> >> // Firstly driver lock the reset_sem of all devices >> down_write(&a->reset_sem); do_suspend(a); >> down_write(&b->reset_sem); // Because b shared the same lock_class_key >> with a, lockdep will take a and b as the same rw_sem and complain here. >> do_suspend(b); >> >> // do recovery >> do_hive_recovery() >> >> // unlock the reset_sem of all devices do_resume(a); >> up_write(&a->reset_sem); do_resume(b); up_write(&b->reset_sem); > Ah! Now I understand what you are working around. So the problem is the > static lock_class_key in the macro? > [Dennis Li] Yes. The author of init_rwsem might not consider our similar use > case. > >> What we should do instead is to make sure we have only a single lock for the >> complete hive instead. >> [Dennis Li] If we use a single lock, users will must wait for all devices >> resuming successfully, but in fact their tasks are only running in device a. >> It is not friendly to users. > Well that is intentional I would say. We can only restart submissions after > all devices are resumed successfully, cause otherwise it can happen that a > job on device A depends on memory on device B which isn't accessible. > [Dennis Li] Yes, you are right. Driver have make sure that the shared > resources(such as the shard memory) are ready before unlock the lock of adev > one by one. But each device still has private hardware resources such as > video and display. Yeah, but those are negligible and we have a team working on display support for XGMI. So this will sooner or later become a problem as well. I still think that a single rwsem for the whole hive is still the best option here. [Dennis Li] For your above concern, we can open a new thread to discuss it. If we make a decision to use a single after discussing, we can create another patch for it. Best Regards Dennis lI > > Regards, > Christian. > >> Regards, >> Christian. >> >> Am 06.08.20 um 11:15 schrieb Li, Dennis: >>> [AMD Official Use Only - Internal Distribution Only] >>> >>> Hi, Christian, >>> For this case, it is safe to use separated lock key. Please see the >>> definition of init_rwsem as the below. Every init_rwsem calling will use a >>> new static key, but devices of the hive share the same code to do >>> initialization, so their lock_class_key are the same. I think it is a >>> limitation of init_rwsem. In our case, it should be correct that reset_sem >>> of each adev has different lock_class_key. BTW, this change doesn't effect >>> dead-lock detection and just correct it. >>> >>> #define init_rwsem(sem) \ >>> do {\ >>> static struct lock_class_key __key; \ >>> \ >>> __init_rwsem((sem), #sem, &__key); \ >>> } while (0) >>> >>> Best Regards >>> Dennis Li >>> -Original Message- >>> From: Koenig, Christian >>> Sent: Thursday, August 6, 2020 4:13 PM >>> To: Li, Dennis ; amd-gfx@lists.freedesktop.org; >>> Deucher, Alexander ; Kuehling, Felix >>> ; Zhang, Hawking >>> Subject: Re: [PATCH] drm/amdgpu: annotate a false positive recursive >>> locking >>> >>> Preventing locking problems during implementation is obviously a good >>> approach, but lockdep has proven to be massively useful for finding and >>> fixing problems. >>> >>> Disabling lockdep splat by annotating lock with separate classes is usually >>> a no-go and only allowed if there is no other potential approach. >>> >>> In this case here we should really clean things up instead. >>> >>> Christian. >>> >>> Am 06.08.20 um 09:44 schrieb Li, Dennis: [AMD Official Use Only - Internal Distribution Only] Hi, Christian, I agree with your concern. However we shouldn't rely on system to detect dead-lock, and should consider this when doing code implementation. In fact, dead-lock detection isn't enabled by default. For your proposal to remove reset_sem into the hive structure, we can open a new topic to discuss it.
Re: [PATCH 3/3] drm/radeon: drop superflous AGP handling
On Thu, 6 Aug 2020 at 23:51, Christian König wrote: > > The object flags created in radeon_ttm_placement_from_domain take care that > we use the correct caching for AGP, this is just superflous. > > Signed-off-by: Christian König Reviewed-by: Dave Airlie > --- > drivers/gpu/drm/radeon/radeon_ttm.c | 21 ++--- > 1 file changed, 2 insertions(+), 19 deletions(-) > > diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c > b/drivers/gpu/drm/radeon/radeon_ttm.c > index 31f4cf211b6a..290c8b479853 100644 > --- a/drivers/gpu/drm/radeon/radeon_ttm.c > +++ b/drivers/gpu/drm/radeon/radeon_ttm.c > @@ -76,26 +76,9 @@ static int radeon_ttm_init_vram(struct radeon_device *rdev) > > static int radeon_ttm_init_gtt(struct radeon_device *rdev) > { > - uint32_t available_caching, default_caching; > - > - available_caching = TTM_PL_MASK_CACHING; > - default_caching = TTM_PL_FLAG_CACHED; > - > -#if IS_ENABLED(CONFIG_AGP) > - if (rdev->flags & RADEON_IS_AGP) { > - if (!rdev->ddev->agp) { > - DRM_ERROR("AGP is not enabled\n"); > - return -EINVAL; > - } > - available_caching = TTM_PL_FLAG_UNCACHED | > - TTM_PL_FLAG_WC; > - default_caching = TTM_PL_FLAG_WC; > - } > -#endif > - > return ttm_range_man_init(&rdev->mman.bdev, TTM_PL_TT, > - available_caching, > - default_caching, true, > + TTM_PL_MASK_CACHING, > + TTM_PL_FLAG_CACHED, true, > rdev->mc.gtt_size >> PAGE_SHIFT); > } > > -- > 2.17.1 > ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 2/3] drm/ttm: give resource functions their own [ch] files
On Thu, 6 Aug 2020 at 23:51, Christian König wrote: > > This is a separate object we work within TTM. Reviewed-by: Dave Airlie > > Signed-off-by: Christian König > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 2 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c| 8 +- > drivers/gpu/drm/nouveau/nouveau_bo.c | 4 +- > drivers/gpu/drm/radeon/radeon_ttm.c| 4 +- > drivers/gpu/drm/ttm/Makefile | 3 +- > drivers/gpu/drm/ttm/ttm_bo.c | 124 +- > drivers/gpu/drm/ttm/ttm_bo_util.c | 4 +- > drivers/gpu/drm/ttm/ttm_resource.c | 151 > include/drm/ttm/ttm_bo_api.h | 70 +- > include/drm/ttm/ttm_bo_driver.h| 189 --- > include/drm/ttm/ttm_resource.h | 263 + > 11 files changed, 446 insertions(+), 376 deletions(-) > create mode 100644 drivers/gpu/drm/ttm/ttm_resource.c > create mode 100644 include/drm/ttm/ttm_resource.h > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c > index 43f4966331dd..b36d94f57d42 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c > @@ -381,7 +381,7 @@ int amdgpu_bo_create_kernel_at(struct amdgpu_device *adev, > if (cpu_addr) > amdgpu_bo_kunmap(*bo_ptr); > > - ttm_bo_mem_put(&(*bo_ptr)->tbo, &(*bo_ptr)->tbo.mem); > + ttm_resource_free(&(*bo_ptr)->tbo, &(*bo_ptr)->tbo.mem); > > for (i = 0; i < (*bo_ptr)->placement.num_placement; ++i) { > (*bo_ptr)->placements[i].fpfn = offset >> PAGE_SHIFT; > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c > index 67d2eef2f9eb..462402fcd1a4 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c > @@ -578,7 +578,7 @@ static int amdgpu_move_vram_ram(struct ttm_buffer_object > *bo, bool evict, > /* move BO (in tmp_mem) to new_mem */ > r = ttm_bo_move_ttm(bo, ctx, new_mem); > out_cleanup: > - ttm_bo_mem_put(bo, &tmp_mem); > + ttm_resource_free(bo, &tmp_mem); > return r; > } > > @@ -625,7 +625,7 @@ static int amdgpu_move_ram_vram(struct ttm_buffer_object > *bo, bool evict, > goto out_cleanup; > } > out_cleanup: > - ttm_bo_mem_put(bo, &tmp_mem); > + ttm_resource_free(bo, &tmp_mem); > return r; > } > > @@ -1203,11 +1203,11 @@ int amdgpu_ttm_alloc_gart(struct ttm_buffer_object > *bo) > gtt->offset = (u64)tmp.start << PAGE_SHIFT; > r = amdgpu_ttm_gart_bind(adev, bo, flags); > if (unlikely(r)) { > - ttm_bo_mem_put(bo, &tmp); > + ttm_resource_free(bo, &tmp); > return r; > } > > - ttm_bo_mem_put(bo, &bo->mem); > + ttm_resource_free(bo, &bo->mem); > bo->mem = tmp; > } > > diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c > b/drivers/gpu/drm/nouveau/nouveau_bo.c > index 604a74323696..29d7d7e100f7 100644 > --- a/drivers/gpu/drm/nouveau/nouveau_bo.c > +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c > @@ -1191,7 +1191,7 @@ nouveau_bo_move_flipd(struct ttm_buffer_object *bo, > bool evict, bool intr, > > ret = ttm_bo_move_ttm(bo, &ctx, new_reg); > out: > - ttm_bo_mem_put(bo, &tmp_reg); > + ttm_resource_free(bo, &tmp_reg); > return ret; > } > > @@ -1227,7 +1227,7 @@ nouveau_bo_move_flips(struct ttm_buffer_object *bo, > bool evict, bool intr, > goto out; > > out: > - ttm_bo_mem_put(bo, &tmp_reg); > + ttm_resource_free(bo, &tmp_reg); > return ret; > } > > diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c > b/drivers/gpu/drm/radeon/radeon_ttm.c > index 3355b69b13d1..31f4cf211b6a 100644 > --- a/drivers/gpu/drm/radeon/radeon_ttm.c > +++ b/drivers/gpu/drm/radeon/radeon_ttm.c > @@ -271,7 +271,7 @@ static int radeon_move_vram_ram(struct ttm_buffer_object > *bo, > } > r = ttm_bo_move_ttm(bo, &ctx, new_mem); > out_cleanup: > - ttm_bo_mem_put(bo, &tmp_mem); > + ttm_resource_free(bo, &tmp_mem); > return r; > } > > @@ -309,7 +309,7 @@ static int radeon_move_ram_vram(struct ttm_buffer_object > *bo, > goto out_cleanup; > } > out_cleanup: > - ttm_bo_mem_put(bo, &tmp_mem); > + ttm_resource_free(bo, &tmp_mem); > return r; > } > > diff --git a/drivers/gpu/drm/ttm/Makefile b/drivers/gpu/drm/ttm/Makefile > index e54326e6cea4..90c0da88cc98 100644 > --- a/drivers/gpu/drm/ttm/Makefile > +++ b/drivers/gpu/drm/ttm/Makefile > @@ -4,7 +4,8 @@ > > ttm-y := ttm_memory.o ttm_tt.o ttm_bo.o \ > ttm_bo_util.o ttm_bo_vm.o ttm_module.o \ > - ttm_execbuf_util.o ttm_page_alloc.o ttm_range_manager.o > + ttm_execbuf_util.o ttm_page_alloc.o ttm
Re: [PATCH 1/3] drm/ttm: rename ttm_resource_manager_func callbacks
On Thu, 6 Aug 2020 at 23:51, Christian König wrote: > > The names get/put are associated with reference counting > in the Linux kernel, use alloc/free instead. > Sounds good, Reviewed-by: Dave Airlie > Signed-off-by: Christian König > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c | 4 ++-- > drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 6 +++--- > drivers/gpu/drm/nouveau/nouveau_ttm.c | 12 ++-- > drivers/gpu/drm/ttm/ttm_bo.c | 8 > drivers/gpu/drm/ttm/ttm_range_manager.c | 16 > drivers/gpu/drm/vmwgfx/vmwgfx_gmrid_manager.c | 4 ++-- > drivers/gpu/drm/vmwgfx/vmwgfx_thp.c | 4 ++-- > include/drm/ttm/ttm_bo_driver.h | 18 +- > 8 files changed, 36 insertions(+), 36 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c > index c847a5fe94c9..4f8451bdc28c 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c > @@ -324,7 +324,7 @@ static void amdgpu_gtt_mgr_debug(struct > ttm_resource_manager *man, > } > > static const struct ttm_resource_manager_func amdgpu_gtt_mgr_func = { > - .get_node = amdgpu_gtt_mgr_new, > - .put_node = amdgpu_gtt_mgr_del, > + .alloc = amdgpu_gtt_mgr_new, > + .free = amdgpu_gtt_mgr_del, > .debug = amdgpu_gtt_mgr_debug > }; > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c > index 895634cbf999..044262b9ded4 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c > @@ -619,7 +619,7 @@ static void amdgpu_vram_mgr_debug(struct > ttm_resource_manager *man, > } > > static const struct ttm_resource_manager_func amdgpu_vram_mgr_func = { > - .get_node = amdgpu_vram_mgr_new, > - .put_node = amdgpu_vram_mgr_del, > - .debug = amdgpu_vram_mgr_debug > + .alloc = amdgpu_vram_mgr_new, > + .free = amdgpu_vram_mgr_del, > + .debug = amdgpu_vram_mgr_debug > }; > diff --git a/drivers/gpu/drm/nouveau/nouveau_ttm.c > b/drivers/gpu/drm/nouveau/nouveau_ttm.c > index e6a30865a00b..53c6f8827322 100644 > --- a/drivers/gpu/drm/nouveau/nouveau_ttm.c > +++ b/drivers/gpu/drm/nouveau/nouveau_ttm.c > @@ -64,8 +64,8 @@ nouveau_vram_manager_new(struct ttm_resource_manager *man, > } > > const struct ttm_resource_manager_func nouveau_vram_manager = { > - .get_node = nouveau_vram_manager_new, > - .put_node = nouveau_manager_del, > + .alloc = nouveau_vram_manager_new, > + .free = nouveau_manager_del, > }; > > static int > @@ -87,8 +87,8 @@ nouveau_gart_manager_new(struct ttm_resource_manager *man, > } > > const struct ttm_resource_manager_func nouveau_gart_manager = { > - .get_node = nouveau_gart_manager_new, > - .put_node = nouveau_manager_del, > + .alloc = nouveau_gart_manager_new, > + .free = nouveau_manager_del, > }; > > static int > @@ -119,8 +119,8 @@ nv04_gart_manager_new(struct ttm_resource_manager *man, > } > > const struct ttm_resource_manager_func nv04_gart_manager = { > - .get_node = nv04_gart_manager_new, > - .put_node = nouveau_manager_del, > + .alloc = nv04_gart_manager_new, > + .free = nouveau_manager_del, > }; > > int > diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c > index ad09329b62d3..ae71c3ab6cc4 100644 > --- a/drivers/gpu/drm/ttm/ttm_bo.c > +++ b/drivers/gpu/drm/ttm/ttm_bo.c > @@ -846,20 +846,20 @@ static int ttm_bo_mem_get(struct ttm_buffer_object *bo, > struct ttm_resource_manager *man = ttm_manager_type(bo->bdev, > mem->mem_type); > > mem->mm_node = NULL; > - if (!man->func || !man->func->get_node) > + if (!man->func || !man->func->alloc) > return 0; > > - return man->func->get_node(man, bo, place, mem); > + return man->func->alloc(man, bo, place, mem); > } > > void ttm_bo_mem_put(struct ttm_buffer_object *bo, struct ttm_resource *mem) > { > struct ttm_resource_manager *man = ttm_manager_type(bo->bdev, > mem->mem_type); > > - if (!man->func || !man->func->put_node) > + if (!man->func || !man->func->free) > return; > > - man->func->put_node(man, mem); > + man->func->free(man, mem); > mem->mm_node = NULL; > mem->mem_type = TTM_PL_SYSTEM; > } > diff --git a/drivers/gpu/drm/ttm/ttm_range_manager.c > b/drivers/gpu/drm/ttm/ttm_range_manager.c > index 274a05ca13d3..770c8988c139 100644 > --- a/drivers/gpu/drm/ttm/ttm_range_manager.c > +++ b/drivers/gpu/drm/ttm/ttm_range_manager.c > @@ -54,10 +54,10 @@ static inline struct ttm_range_manager > *to_range_manager(struct ttm_resource_man > return container_of(man, struct ttm_range_manager, manager); > } > > -static int ttm_range_man_get_node(struct ttm_resource_m
Re: [PATCH] drm/amdgpu: Skip some registers config for SRIOV
On 2020-08-06 3:40 a.m., Liu ChengZhe wrote: > For VF, registers L2_CNTL, L2_CONTEXT1_IDENTITY_APERTURE > L2_PROTECTION_FAULT_CNTL are not accesible, skip the Spelling: "accessible". I'd rather move the "For VF" to the end, after "accessible", like this. I also believe it should be "to VF" as opposed to "for VF". Both are correct, but mean different things. Maybe like this: Some registers are not accessible to virtual function setup, so skip their initialization when in VF-SRIOV mode. > configuration for them in SRIOV mode. > > Signed-off-by: Liu ChengZhe > --- > drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c | 12 ++-- > drivers/gpu/drm/amd/amdgpu/mmhub_v2_0.c | 12 ++-- > 2 files changed, 20 insertions(+), 4 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c > b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c > index 1f6112b7fa49..6b96f45fde2a 100644 > --- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c > +++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c > @@ -330,10 +330,13 @@ int gfxhub_v2_1_gart_enable(struct amdgpu_device *adev) > gfxhub_v2_1_init_gart_aperture_regs(adev); > gfxhub_v2_1_init_system_aperture_regs(adev); > gfxhub_v2_1_init_tlb_regs(adev); > - gfxhub_v2_1_init_cache_regs(adev); > + if (!amdgpu_sriov_vf(adev)) > + gfxhub_v2_1_init_cache_regs(adev); Yes, that's the literal meaning of the commit description, but it would be cleaner if gfxhub_v2_1_init_cache_regs(adev) did that check instead. (Some might even argue it'd be more object-oriented...) So then, this code would look like a sequence of statements, unchanged, and each method of initialization would know/check whether it is applicable to "adev". It also makes it more centralized, less code duplication and less code clutter. > > gfxhub_v2_1_enable_system_domain(adev); > - gfxhub_v2_1_disable_identity_aperture(adev); > + if (!amdgpu_sriov_vf(adev)) > + gfxhub_v2_1_disable_identity_aperture(adev); > + Ditto. > gfxhub_v2_1_setup_vmid_config(adev); > gfxhub_v2_1_program_invalidation(adev); > > @@ -372,7 +375,12 @@ void gfxhub_v2_1_gart_disable(struct amdgpu_device *adev) > void gfxhub_v2_1_set_fault_enable_default(struct amdgpu_device *adev, > bool value) > { > + /*These regs are not accessible for VF, PF will program in SRIOV */ Add a space after the asterisk and before the beginning of the sentence. Format the comment in one of the two acceptable Linux Kernel Coding styles. End the sentence with a full stop. It also helps to spell out "registers". Like this perhaps: /* These registers are not accessible to VF-SRIOV. * The PF will program them instead. */ > + if (amdgpu_sriov_vf(adev)) > + return; > + > u32 tmp; The definition of this automatic variable should precede all other code in this function. The compiler will optimize it away anyway. > + > tmp = RREG32_SOC15(GC, 0, mmGCVM_L2_PROTECTION_FAULT_CNTL); > tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL, > RANGE_PROTECTION_FAULT_ENABLE_DEFAULT, value); > diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v2_0.c > b/drivers/gpu/drm/amd/amdgpu/mmhub_v2_0.c > index d83912901f73..9cfde9b81600 100644 > --- a/drivers/gpu/drm/amd/amdgpu/mmhub_v2_0.c > +++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v2_0.c > @@ -321,10 +321,13 @@ int mmhub_v2_0_gart_enable(struct amdgpu_device *adev) > mmhub_v2_0_init_gart_aperture_regs(adev); > mmhub_v2_0_init_system_aperture_regs(adev); > mmhub_v2_0_init_tlb_regs(adev); > - mmhub_v2_0_init_cache_regs(adev); > + if (!amdgpu_sriov_vf(adev)) > + mmhub_v2_0_init_cache_regs(adev); Move this check inside said function. > > mmhub_v2_0_enable_system_domain(adev); > - mmhub_v2_0_disable_identity_aperture(adev); > + if (!amdgpu_sriov_vf(adev)) > + mmhub_v2_0_disable_identity_aperture(adev); > + Ditto. > mmhub_v2_0_setup_vmid_config(adev); > mmhub_v2_0_program_invalidation(adev); > > @@ -364,7 +367,12 @@ void mmhub_v2_0_gart_disable(struct amdgpu_device *adev) > */ > void mmhub_v2_0_set_fault_enable_default(struct amdgpu_device *adev, bool > value) > { > + /*These regs are not accessible for VF, PF will program in SRIOV */ You can duplicate this comment from the one above. > + if (amdgpu_sriov_vf(adev)) > + return; > + > u32 tmp; Should be the first thing in the function body, as noted above. Regards, Luben > + > tmp = RREG32_SOC15(MMHUB, 0, mmMMVM_L2_PROTECTION_FAULT_CNTL); > tmp = REG_SET_FIELD(tmp, MMVM_L2_PROTECTION_FAULT_CNTL, > RANGE_PROTECTION_FAULT_ENABLE_DEFAULT, value); > ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH v2] drm/amdgpu: unlock mutex on error
Yes, that's fine. Thanks for fix. Reviewed-by: Luben Tuikov On 2020-08-05 9:31 p.m., Dennis Li wrote: > Make sure to unlock the mutex when error happen > > v2: > 1. correct syntax error in the commit comment > 2. remove change-Id > > Acked-by: Nirmoy Das > Signed-off-by: Dennis Li > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c > index a0ea663ecdbc..5e5369abc6fa 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c > @@ -632,13 +632,14 @@ int amdgpu_amdkfd_submit_ib(struct kgd_dev *kgd, enum > kgd_engine_type engine, > } > > ret = amdgpu_ib_schedule(ring, 1, ib, job, &f); > + > + up_read(&adev->reset_sem); > + > if (ret) { > DRM_ERROR("amdgpu: failed to schedule IB.\n"); > goto err_ib_sched; > } > > - up_read(&adev->reset_sem); > - > ret = dma_fence_wait(f, false); > > err_ib_sched: > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c > index 4e017f379eb6..67a756f4337b 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c > @@ -545,7 +545,7 @@ int amdgpu_xgmi_add_device(struct amdgpu_device *adev) > } > ret = amdgpu_xgmi_update_topology(hive, tmp_adev); > if (ret) > - goto exit; > + goto exit_unlock; > } > > /* get latest topology info for each device from psp */ > @@ -558,7 +558,7 @@ int amdgpu_xgmi_add_device(struct amdgpu_device *adev) > tmp_adev->gmc.xgmi.node_id, > tmp_adev->gmc.xgmi.hive_id, ret); > /* To do : continue with some node failed or > disable the whole hive */ > - goto exit; > + goto exit_unlock; > } > } > } > @@ -566,7 +566,7 @@ int amdgpu_xgmi_add_device(struct amdgpu_device *adev) > if (!ret) > ret = amdgpu_xgmi_sysfs_add_dev_info(adev, hive); > > - > +exit_unlock: > mutex_unlock(&hive->hive_lock); > exit: > if (!ret) > ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 5/7] drm/amd/display: Reset plane for anything that's not a FAST update
On 2020-08-05 4:45 p.m., Rodrigo Siqueira wrote: On 07/30, Nicholas Kazlauskas wrote: [Why] MEDIUM or FULL updates can require global validation or affect bandwidth. By treating these all simply as surface updates we aren't actually passing this through DC global validation. [How] There's currently no way to pass surface updates through DC global validation, nor do I think it's a good idea to change the interface to accept these. DC global validation itself is currently stateless, and we can move our update type checking to be stateless as well by duplicating DC surface checks in DM based on DRM properties. We wanted to rely on DC automatically determining this since DC knows best, but DM is ultimately what fills in everything into DC plane state so it does need to know as well. There are basically only three paths that we exercise in DM today: 1) Cursor (async update) 2) Pageflip (fast update) 3) Full pipe programming (medium/full updates) Which means that anything that's more than a pageflip really needs to go down path #3. So this change duplicates all the surface update checks based on DRM state instead inside of should_reset_plane(). Next step is dropping dm_determine_update_type_for_commit and we no longer require the old DC state at all for global validation. Optimization can come later so we don't reset DC planes at all for MEDIUM udpates and avoid validation, but we might require some extra checks in DM to achieve this. How about adding this optimization description in our TODO list under-display folder? Reviewed-by: Rodrigo Siqueira Sure, I'll make another patch to clean up some of the TODO items in the text file. Regards, Nicholas Kazlauskas Cc: Bhawanpreet Lakha Cc: Hersen Wu Signed-off-by: Nicholas Kazlauskas --- .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 25 +++ 1 file changed, 25 insertions(+) diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c index 0d5f45742bb5..2cbb29199e61 100644 --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c @@ -8336,6 +8336,31 @@ static bool should_reset_plane(struct drm_atomic_state *state, if (old_other_state->crtc != new_other_state->crtc) return true; + /* Src/dst size and scaling updates. */ + if (old_other_state->src_w != new_other_state->src_w || + old_other_state->src_h != new_other_state->src_h || + old_other_state->crtc_w != new_other_state->crtc_w || + old_other_state->crtc_h != new_other_state->crtc_h) + return true; + + /* Rotation / mirroring updates. */ + if (old_other_state->rotation != new_other_state->rotation) + return true; + + /* Blending updates. */ + if (old_other_state->pixel_blend_mode != + new_other_state->pixel_blend_mode) + return true; + + /* Alpha updates. */ + if (old_other_state->alpha != new_other_state->alpha) + return true; + + /* Colorspace changes. */ + if (old_other_state->color_range != new_other_state->color_range || + old_other_state->color_encoding != new_other_state->color_encoding) + return true; + /* Framebuffer checks fall at the end. */ if (!old_other_state->fb || !new_other_state->fb) continue; -- 2.25.1 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7CRodrigo.Siqueira%40amd.com%7Ccc095e7ce6164f529e2708d834c86d1b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637317382766607890&sdata=omLC%2BizXVEjjGe6IylBpniZzyUGlzTATrgRoWEo6dHc%3D&reserved=0 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 2/7] drm/amd/display: Reset plane when tiling flags change
On 2020-08-05 5:11 p.m., Rodrigo Siqueira wrote: On 07/30, Nicholas Kazlauskas wrote: [Why] Enabling or disable DCC or switching between tiled and linear formats can require bandwidth updates. They're currently skipping all DC validation by being treated as purely surface updates. [How] Treat tiling_flag changes (which encode DCC state) as a condition for resetting the plane. Cc: Bhawanpreet Lakha Cc: Rodrigo Siqueira Cc: Hersen Wu Signed-off-by: Nicholas Kazlauskas --- .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 19 --- 1 file changed, 16 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c index 7cc5ab90ce13..bf1881bd492c 100644 --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c @@ -8332,6 +8332,8 @@ static bool should_reset_plane(struct drm_atomic_state *state, * TODO: Come up with a more elegant solution for this. */ for_each_oldnew_plane_in_state(state, other, old_other_state, new_other_state, i) { + struct dm_plane_state *old_dm_plane_state, *new_dm_plane_state; + if (other->type == DRM_PLANE_TYPE_CURSOR) continue; @@ -8342,9 +8344,20 @@ static bool should_reset_plane(struct drm_atomic_state *state, if (old_other_state->crtc != new_other_state->crtc) return true; - /* TODO: Remove this once we can handle fast format changes. */ - if (old_other_state->fb && new_other_state->fb && - old_other_state->fb->format != new_other_state->fb->format) + /* Framebuffer checks fall at the end. */ + if (!old_other_state->fb || !new_other_state->fb) + continue; + + /* Pixel format changes can require bandwidth updates. */ + if (old_other_state->fb->format != new_other_state->fb->format) + return true; + + old_dm_plane_state = to_dm_plane_state(old_other_state); + new_dm_plane_state = to_dm_plane_state(new_other_state); + + /* Tiling and DCC changes also require bandwidth updates. */ + if (old_dm_plane_state->tiling_flags != + new_dm_plane_state->tiling_flags) Why not add a case when we move to a TMZ area? Reviewed-by: Rodrigo Siqueira TMZ doesn't affect DML calculations or validation in this case so we can safely skip it. Regards, Nicholas Kazlauskas return true; } -- 2.25.1 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH] drm/amdgpu: don't create entity when use cpu to update page table
NAK, we also use the entity context number for debugging. Additional to this the entities should not need any additional resources. So the functions are only initializing fields. Regards, Christian. Am 06.08.2020 18:06 schrieb "Wang, Kevin(Yang)" : the entity isn't needed when vm use cpu to update page table. Signed-off-by: Kevin Wang --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 45 ++ 1 file changed, 25 insertions(+), 20 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 71e005cf2952..e15c29d613d9 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -2802,20 +2802,6 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm, spin_lock_init(&vm->invalidated_lock); INIT_LIST_HEAD(&vm->freed); - - /* create scheduler entities for page table updates */ - r = drm_sched_entity_init(&vm->immediate, DRM_SCHED_PRIORITY_NORMAL, - adev->vm_manager.vm_pte_scheds, - adev->vm_manager.vm_pte_num_scheds, NULL); - if (r) - return r; - - r = drm_sched_entity_init(&vm->delayed, DRM_SCHED_PRIORITY_NORMAL, - adev->vm_manager.vm_pte_scheds, - adev->vm_manager.vm_pte_num_scheds, NULL); - if (r) - goto error_free_immediate; - vm->pte_support_ats = false; vm->is_compute_context = false; @@ -2835,10 +2821,25 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm, !amdgpu_gmc_vram_full_visible(&adev->gmc)), "CPU update of VM recommended only for large BAR system\n"); - if (vm->use_cpu_for_update) + if (vm->use_cpu_for_update) { vm->update_funcs = &amdgpu_vm_cpu_funcs; - else + } else { + /* create scheduler entities for page table updates */ + r = drm_sched_entity_init(&vm->immediate, DRM_SCHED_PRIORITY_NORMAL, + adev->vm_manager.vm_pte_scheds, + adev->vm_manager.vm_pte_num_scheds, NULL); + if (r) + return r; + + r = drm_sched_entity_init(&vm->delayed, DRM_SCHED_PRIORITY_NORMAL, + adev->vm_manager.vm_pte_scheds, + adev->vm_manager.vm_pte_num_scheds, NULL); + if (r) + goto error_free_immediate; + vm->update_funcs = &amdgpu_vm_sdma_funcs; + } + vm->last_update = NULL; vm->last_unlocked = dma_fence_get_stub(); @@ -2895,10 +2896,12 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm, error_free_delayed: dma_fence_put(vm->last_unlocked); - drm_sched_entity_destroy(&vm->delayed); + if (!vm->use_cpu_for_update) + drm_sched_entity_destroy(&vm->delayed); error_free_immediate: - drm_sched_entity_destroy(&vm->immediate); + if (!vm->use_cpu_for_update) + drm_sched_entity_destroy(&vm->immediate); return r; } @@ -3120,8 +3123,10 @@ void amdgpu_vm_fini(struct amdgpu_device *adev, struct amdgpu_vm *vm) amdgpu_bo_unref(&root); WARN_ON(vm->root.base.bo); - drm_sched_entity_destroy(&vm->immediate); - drm_sched_entity_destroy(&vm->delayed); + if (!vm->use_cpu_for_update) { + drm_sched_entity_destroy(&vm->immediate); + drm_sched_entity_destroy(&vm->delayed); + } if (!RB_EMPTY_ROOT(&vm->va.rb_root)) { dev_err(adev->dev, "still active bo inside vm\n"); -- 2.27.0 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH] drm/amdgpu: fix uninit-value in arcturus_log_thermal_throttling_event()
On Thu, Aug 6, 2020 at 11:59 AM Kevin Wang wrote: > > when function arcturus_get_smu_metrics_data() call failed, > it will cause the variable "throttler_status" isn't initialized before use. > > warning: > powerplay/arcturus_ppt.c:2268:24: warning: ‘throttler_status’ may be used > uninitialized in this function [-Wmaybe-uninitialized] > 2268 | if (throttler_status & logging_label[throttler_idx].feature_mask) { > > Signed-off-by: Kevin Wang Reviewed-by: Alex Deucher > --- > drivers/gpu/drm/amd/powerplay/arcturus_ppt.c | 9 ++--- > 1 file changed, 6 insertions(+), 3 deletions(-) > > diff --git a/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c > b/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c > index e58af84433c7..946dbc0db1b1 100644 > --- a/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c > +++ b/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c > @@ -2253,14 +2253,17 @@ static const struct throttling_logging_label { > }; > static void arcturus_log_thermal_throttling_event(struct smu_context *smu) > { > + int ret; > int throttler_idx, throtting_events = 0, buf_idx = 0; > struct amdgpu_device *adev = smu->adev; > uint32_t throttler_status; > char log_buf[256]; > > - arcturus_get_smu_metrics_data(smu, > - METRICS_THROTTLER_STATUS, > - &throttler_status); > + ret = arcturus_get_smu_metrics_data(smu, > + METRICS_THROTTLER_STATUS, > + &throttler_status); > + if (ret) > + return; > > memset(log_buf, 0, sizeof(log_buf)); > for (throttler_idx = 0; throttler_idx < ARRAY_SIZE(logging_label); > -- > 2.27.0 > > ___ > amd-gfx mailing list > amd-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH] drm/amdgpu: don't create entity when use cpu to update page table
the entity isn't needed when vm use cpu to update page table. Signed-off-by: Kevin Wang --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 45 ++ 1 file changed, 25 insertions(+), 20 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 71e005cf2952..e15c29d613d9 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -2802,20 +2802,6 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm, spin_lock_init(&vm->invalidated_lock); INIT_LIST_HEAD(&vm->freed); - - /* create scheduler entities for page table updates */ - r = drm_sched_entity_init(&vm->immediate, DRM_SCHED_PRIORITY_NORMAL, - adev->vm_manager.vm_pte_scheds, - adev->vm_manager.vm_pte_num_scheds, NULL); - if (r) - return r; - - r = drm_sched_entity_init(&vm->delayed, DRM_SCHED_PRIORITY_NORMAL, - adev->vm_manager.vm_pte_scheds, - adev->vm_manager.vm_pte_num_scheds, NULL); - if (r) - goto error_free_immediate; - vm->pte_support_ats = false; vm->is_compute_context = false; @@ -2835,10 +2821,25 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm, !amdgpu_gmc_vram_full_visible(&adev->gmc)), "CPU update of VM recommended only for large BAR system\n"); - if (vm->use_cpu_for_update) + if (vm->use_cpu_for_update) { vm->update_funcs = &amdgpu_vm_cpu_funcs; - else + } else { + /* create scheduler entities for page table updates */ + r = drm_sched_entity_init(&vm->immediate, DRM_SCHED_PRIORITY_NORMAL, + adev->vm_manager.vm_pte_scheds, + adev->vm_manager.vm_pte_num_scheds, NULL); + if (r) + return r; + + r = drm_sched_entity_init(&vm->delayed, DRM_SCHED_PRIORITY_NORMAL, + adev->vm_manager.vm_pte_scheds, + adev->vm_manager.vm_pte_num_scheds, NULL); + if (r) + goto error_free_immediate; + vm->update_funcs = &amdgpu_vm_sdma_funcs; + } + vm->last_update = NULL; vm->last_unlocked = dma_fence_get_stub(); @@ -2895,10 +2896,12 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm, error_free_delayed: dma_fence_put(vm->last_unlocked); - drm_sched_entity_destroy(&vm->delayed); + if (!vm->use_cpu_for_update) + drm_sched_entity_destroy(&vm->delayed); error_free_immediate: - drm_sched_entity_destroy(&vm->immediate); + if (!vm->use_cpu_for_update) + drm_sched_entity_destroy(&vm->immediate); return r; } @@ -3120,8 +3123,10 @@ void amdgpu_vm_fini(struct amdgpu_device *adev, struct amdgpu_vm *vm) amdgpu_bo_unref(&root); WARN_ON(vm->root.base.bo); - drm_sched_entity_destroy(&vm->immediate); - drm_sched_entity_destroy(&vm->delayed); + if (!vm->use_cpu_for_update) { + drm_sched_entity_destroy(&vm->immediate); + drm_sched_entity_destroy(&vm->delayed); + } if (!RB_EMPTY_ROOT(&vm->va.rb_root)) { dev_err(adev->dev, "still active bo inside vm\n"); -- 2.27.0 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH] drm/amdgpu: fix uninit-value in arcturus_log_thermal_throttling_event()
when function arcturus_get_smu_metrics_data() call failed, it will cause the variable "throttler_status" isn't initialized before use. warning: powerplay/arcturus_ppt.c:2268:24: warning: ‘throttler_status’ may be used uninitialized in this function [-Wmaybe-uninitialized] 2268 | if (throttler_status & logging_label[throttler_idx].feature_mask) { Signed-off-by: Kevin Wang --- drivers/gpu/drm/amd/powerplay/arcturus_ppt.c | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c b/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c index e58af84433c7..946dbc0db1b1 100644 --- a/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c +++ b/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c @@ -2253,14 +2253,17 @@ static const struct throttling_logging_label { }; static void arcturus_log_thermal_throttling_event(struct smu_context *smu) { + int ret; int throttler_idx, throtting_events = 0, buf_idx = 0; struct amdgpu_device *adev = smu->adev; uint32_t throttler_status; char log_buf[256]; - arcturus_get_smu_metrics_data(smu, - METRICS_THROTTLER_STATUS, - &throttler_status); + ret = arcturus_get_smu_metrics_data(smu, + METRICS_THROTTLER_STATUS, + &throttler_status); + if (ret) + return; memset(log_buf, 0, sizeof(log_buf)); for (throttler_idx = 0; throttler_idx < ARRAY_SIZE(logging_label); -- 2.27.0 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 7/7] drm/amd/display: Replace DRM private objects with subclassed DRM atomic state
On 2020-08-05 4:37 p.m., Rodrigo Siqueira wrote: Hi, I have some minor inline comments, but everything looks fine when I tested it on Raven; feel free to add Tested-by: Rodrigo Siqueira in the whole series. Thanks for the reviews! I can clean up the nitpicks for this patch and make a v2. Regards, Nicholas Kazlauskas On 07/30, Nicholas Kazlauskas wrote: [Why] DM atomic check was structured in a way that we required old DC state in order to dynamically add and remove planes and streams from the context to build the DC state context for validation. DRM private objects were used to carry over the last DC state and were added to the context on nearly every commit - regardless of fast or full so we could check whether or not the new state could affect bandwidth. The problem with this model is that DRM private objects do not implicitly stall out other commits. So if you have two commits touching separate DRM objects they could run concurrently and potentially execute out of order - leading to a use-after-free. If we want this to be safe we have two options: 1. Stall out concurrent commits since they touch the same private object 2. Refactor DM to not require old DC state and drop private object usage [How] This implements approach #2 since it still allows for judder free updates in multi-display scenarios. I'll list the big changes in order at a high level: 1. Subclass DRM atomic state instead of using DRM private objects. DC relied on the old state to determine which changes cause bandwidth updates but now we have DM perform similar checks based on DRM state instead - dropping the requirement for old state to exist at all. This means that we can now build a new DC context from scratch whenever we have something that DM thinks could affect bandwidth. Whenever we need to rebuild bandwidth we now add all CRTCs and planes to the DRM state in order to get the absolute set of DC streams and DC planes. This introduces a stall on other commits, but this stall already exists because of the lock_and_validation logic and it's necessary since updates may move around pipes and require full reprogramming. 2. Drop workarounds to add planes to maintain z-order early in atomic check. This is no longer needed because of the changes for (1). This also involves fixing up should_plane_reset checks since we can just avoid resetting streams and planes when they haven't actually changed. 3. Rework dm_update_crtc_state and dm_update_plane_state to be single pass instead of two pass. This is necessary since we no longer have the dc_state to add and remove planes to the context in and we want to defer creation to the end of commit_check. It also makes the logic a lot simpler to follow since as an added bonus. Cc: Bhawanpreet Lakha Cc: Harry Wentland Cc: Leo Li Cc: Daniel Vetter Signed-off-by: Nicholas Kazlauskas --- .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 720 +++--- .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h | 11 +- 2 files changed, 280 insertions(+), 451 deletions(-) diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c index 59829ec81694..97a7dfc620e8 100644 --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c @@ -1839,7 +1839,6 @@ static int dm_resume(void *handle) struct drm_plane *plane; struct drm_plane_state *new_plane_state; struct dm_plane_state *dm_new_plane_state; - struct dm_atomic_state *dm_state = to_dm_atomic_state(dm->atomic_obj.state); enum dc_connection_type new_connection_type = dc_connection_none; struct dc_state *dc_state; int i, r, j; @@ -1879,11 +1878,6 @@ static int dm_resume(void *handle) return 0; } - /* Recreate dc_state - DC invalidates it when setting power state to S3. */ - dc_release_state(dm_state->context); - dm_state->context = dc_create_state(dm->dc); - /* TODO: Remove dc_state->dccg, use dc->dccg directly. */ - dc_resource_state_construct(dm->dc, dm_state->context); /* Before powering on DC we need to re-initialize DMUB. */ r = dm_dmub_hw_init(adev); @@ -2019,11 +2013,51 @@ const struct amdgpu_ip_block_version dm_ip_block = * *WIP* */ +struct drm_atomic_state *dm_atomic_state_alloc(struct drm_device *dev) +{ + struct dm_atomic_state *dm_state; + + dm_state = kzalloc(sizeof(*dm_state), GFP_KERNEL); How about use GFP_ATOMIC here? + + if (!dm_state) + return NULL; + + if (drm_atomic_state_init(dev, &dm_state->base) < 0) { + kfree(dm_state); + return NULL; + } + + return &dm_state->base; +} + +void dm_atomic_state_free(struct drm_atomic_state *state) +{ + struct dm_atomic_state *dm_state = to_dm_atomic_state(state); + + if (dm_state->context) { + dc_release_state(dm_state->context); +
RE: [PATCH] drm/amdgpu: use mode1 reset by default for sienna_cichlid
[AMD Public Use] Reviewed-by: Hawking Zhang Regards, Hawking -Original Message- From: Gao, Likun Sent: Thursday, August 6, 2020 17:43 To: amd-gfx@lists.freedesktop.org Cc: Zhang, Hawking ; Feng, Kenneth ; Sheng, Wenhui ; Gao, Likun Subject: [PATCH] drm/amdgpu: use mode1 reset by default for sienna_cichlid From: Likun Gao Swith default gpu reset method for sienna_cichlid to MODE1 reset. Signed-off-by: Likun Gao Change-Id: I775e5a66bbac474f65ca8c999136ccaf9c1dc14e --- drivers/gpu/drm/amd/amdgpu/nv.c | 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c b/drivers/gpu/drm/amd/amdgpu/nv.c index 74d02d270d34..da8024c2826e 100644 --- a/drivers/gpu/drm/amd/amdgpu/nv.c +++ b/drivers/gpu/drm/amd/amdgpu/nv.c @@ -362,10 +362,15 @@ nv_asic_reset_method(struct amdgpu_device *adev) dev_warn(adev->dev, "Specified reset method:%d isn't supported, using AUTO instead.\n", amdgpu_reset_method); - if (smu_baco_is_support(smu)) - return AMD_RESET_METHOD_BACO; - else + switch (adev->asic_type) { + case CHIP_SIENNA_CICHLID: return AMD_RESET_METHOD_MODE1; + default: + if (smu_baco_is_support(smu)) + return AMD_RESET_METHOD_BACO; + else + return AMD_RESET_METHOD_MODE1; + } } static int nv_asic_reset(struct amdgpu_device *adev) -- 2.25.1 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH 1/3] drm/ttm: rename ttm_resource_manager_func callbacks
The names get/put are associated with reference counting in the Linux kernel, use alloc/free instead. Signed-off-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c | 4 ++-- drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 6 +++--- drivers/gpu/drm/nouveau/nouveau_ttm.c | 12 ++-- drivers/gpu/drm/ttm/ttm_bo.c | 8 drivers/gpu/drm/ttm/ttm_range_manager.c | 16 drivers/gpu/drm/vmwgfx/vmwgfx_gmrid_manager.c | 4 ++-- drivers/gpu/drm/vmwgfx/vmwgfx_thp.c | 4 ++-- include/drm/ttm/ttm_bo_driver.h | 18 +- 8 files changed, 36 insertions(+), 36 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c index c847a5fe94c9..4f8451bdc28c 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c @@ -324,7 +324,7 @@ static void amdgpu_gtt_mgr_debug(struct ttm_resource_manager *man, } static const struct ttm_resource_manager_func amdgpu_gtt_mgr_func = { - .get_node = amdgpu_gtt_mgr_new, - .put_node = amdgpu_gtt_mgr_del, + .alloc = amdgpu_gtt_mgr_new, + .free = amdgpu_gtt_mgr_del, .debug = amdgpu_gtt_mgr_debug }; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c index 895634cbf999..044262b9ded4 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c @@ -619,7 +619,7 @@ static void amdgpu_vram_mgr_debug(struct ttm_resource_manager *man, } static const struct ttm_resource_manager_func amdgpu_vram_mgr_func = { - .get_node = amdgpu_vram_mgr_new, - .put_node = amdgpu_vram_mgr_del, - .debug = amdgpu_vram_mgr_debug + .alloc = amdgpu_vram_mgr_new, + .free = amdgpu_vram_mgr_del, + .debug = amdgpu_vram_mgr_debug }; diff --git a/drivers/gpu/drm/nouveau/nouveau_ttm.c b/drivers/gpu/drm/nouveau/nouveau_ttm.c index e6a30865a00b..53c6f8827322 100644 --- a/drivers/gpu/drm/nouveau/nouveau_ttm.c +++ b/drivers/gpu/drm/nouveau/nouveau_ttm.c @@ -64,8 +64,8 @@ nouveau_vram_manager_new(struct ttm_resource_manager *man, } const struct ttm_resource_manager_func nouveau_vram_manager = { - .get_node = nouveau_vram_manager_new, - .put_node = nouveau_manager_del, + .alloc = nouveau_vram_manager_new, + .free = nouveau_manager_del, }; static int @@ -87,8 +87,8 @@ nouveau_gart_manager_new(struct ttm_resource_manager *man, } const struct ttm_resource_manager_func nouveau_gart_manager = { - .get_node = nouveau_gart_manager_new, - .put_node = nouveau_manager_del, + .alloc = nouveau_gart_manager_new, + .free = nouveau_manager_del, }; static int @@ -119,8 +119,8 @@ nv04_gart_manager_new(struct ttm_resource_manager *man, } const struct ttm_resource_manager_func nv04_gart_manager = { - .get_node = nv04_gart_manager_new, - .put_node = nouveau_manager_del, + .alloc = nv04_gart_manager_new, + .free = nouveau_manager_del, }; int diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index ad09329b62d3..ae71c3ab6cc4 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -846,20 +846,20 @@ static int ttm_bo_mem_get(struct ttm_buffer_object *bo, struct ttm_resource_manager *man = ttm_manager_type(bo->bdev, mem->mem_type); mem->mm_node = NULL; - if (!man->func || !man->func->get_node) + if (!man->func || !man->func->alloc) return 0; - return man->func->get_node(man, bo, place, mem); + return man->func->alloc(man, bo, place, mem); } void ttm_bo_mem_put(struct ttm_buffer_object *bo, struct ttm_resource *mem) { struct ttm_resource_manager *man = ttm_manager_type(bo->bdev, mem->mem_type); - if (!man->func || !man->func->put_node) + if (!man->func || !man->func->free) return; - man->func->put_node(man, mem); + man->func->free(man, mem); mem->mm_node = NULL; mem->mem_type = TTM_PL_SYSTEM; } diff --git a/drivers/gpu/drm/ttm/ttm_range_manager.c b/drivers/gpu/drm/ttm/ttm_range_manager.c index 274a05ca13d3..770c8988c139 100644 --- a/drivers/gpu/drm/ttm/ttm_range_manager.c +++ b/drivers/gpu/drm/ttm/ttm_range_manager.c @@ -54,10 +54,10 @@ static inline struct ttm_range_manager *to_range_manager(struct ttm_resource_man return container_of(man, struct ttm_range_manager, manager); } -static int ttm_range_man_get_node(struct ttm_resource_manager *man, - struct ttm_buffer_object *bo, - const struct ttm_place *place, - struct ttm_resource *mem) +static int ttm_range_man_alloc(struct ttm_resource_manager *man, + struct ttm_buffer_object
[PATCH 3/3] drm/radeon: drop superflous AGP handling
The object flags created in radeon_ttm_placement_from_domain take care that we use the correct caching for AGP, this is just superflous. Signed-off-by: Christian König --- drivers/gpu/drm/radeon/radeon_ttm.c | 21 ++--- 1 file changed, 2 insertions(+), 19 deletions(-) diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c index 31f4cf211b6a..290c8b479853 100644 --- a/drivers/gpu/drm/radeon/radeon_ttm.c +++ b/drivers/gpu/drm/radeon/radeon_ttm.c @@ -76,26 +76,9 @@ static int radeon_ttm_init_vram(struct radeon_device *rdev) static int radeon_ttm_init_gtt(struct radeon_device *rdev) { - uint32_t available_caching, default_caching; - - available_caching = TTM_PL_MASK_CACHING; - default_caching = TTM_PL_FLAG_CACHED; - -#if IS_ENABLED(CONFIG_AGP) - if (rdev->flags & RADEON_IS_AGP) { - if (!rdev->ddev->agp) { - DRM_ERROR("AGP is not enabled\n"); - return -EINVAL; - } - available_caching = TTM_PL_FLAG_UNCACHED | - TTM_PL_FLAG_WC; - default_caching = TTM_PL_FLAG_WC; - } -#endif - return ttm_range_man_init(&rdev->mman.bdev, TTM_PL_TT, - available_caching, - default_caching, true, + TTM_PL_MASK_CACHING, + TTM_PL_FLAG_CACHED, true, rdev->mc.gtt_size >> PAGE_SHIFT); } -- 2.17.1 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH 2/3] drm/ttm: give resource functions their own [ch] files
This is a separate object we work within TTM. Signed-off-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c| 8 +- drivers/gpu/drm/nouveau/nouveau_bo.c | 4 +- drivers/gpu/drm/radeon/radeon_ttm.c| 4 +- drivers/gpu/drm/ttm/Makefile | 3 +- drivers/gpu/drm/ttm/ttm_bo.c | 124 +- drivers/gpu/drm/ttm/ttm_bo_util.c | 4 +- drivers/gpu/drm/ttm/ttm_resource.c | 151 include/drm/ttm/ttm_bo_api.h | 70 +- include/drm/ttm/ttm_bo_driver.h| 189 --- include/drm/ttm/ttm_resource.h | 263 + 11 files changed, 446 insertions(+), 376 deletions(-) create mode 100644 drivers/gpu/drm/ttm/ttm_resource.c create mode 100644 include/drm/ttm/ttm_resource.h diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c index 43f4966331dd..b36d94f57d42 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c @@ -381,7 +381,7 @@ int amdgpu_bo_create_kernel_at(struct amdgpu_device *adev, if (cpu_addr) amdgpu_bo_kunmap(*bo_ptr); - ttm_bo_mem_put(&(*bo_ptr)->tbo, &(*bo_ptr)->tbo.mem); + ttm_resource_free(&(*bo_ptr)->tbo, &(*bo_ptr)->tbo.mem); for (i = 0; i < (*bo_ptr)->placement.num_placement; ++i) { (*bo_ptr)->placements[i].fpfn = offset >> PAGE_SHIFT; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c index 67d2eef2f9eb..462402fcd1a4 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c @@ -578,7 +578,7 @@ static int amdgpu_move_vram_ram(struct ttm_buffer_object *bo, bool evict, /* move BO (in tmp_mem) to new_mem */ r = ttm_bo_move_ttm(bo, ctx, new_mem); out_cleanup: - ttm_bo_mem_put(bo, &tmp_mem); + ttm_resource_free(bo, &tmp_mem); return r; } @@ -625,7 +625,7 @@ static int amdgpu_move_ram_vram(struct ttm_buffer_object *bo, bool evict, goto out_cleanup; } out_cleanup: - ttm_bo_mem_put(bo, &tmp_mem); + ttm_resource_free(bo, &tmp_mem); return r; } @@ -1203,11 +1203,11 @@ int amdgpu_ttm_alloc_gart(struct ttm_buffer_object *bo) gtt->offset = (u64)tmp.start << PAGE_SHIFT; r = amdgpu_ttm_gart_bind(adev, bo, flags); if (unlikely(r)) { - ttm_bo_mem_put(bo, &tmp); + ttm_resource_free(bo, &tmp); return r; } - ttm_bo_mem_put(bo, &bo->mem); + ttm_resource_free(bo, &bo->mem); bo->mem = tmp; } diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c index 604a74323696..29d7d7e100f7 100644 --- a/drivers/gpu/drm/nouveau/nouveau_bo.c +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c @@ -1191,7 +1191,7 @@ nouveau_bo_move_flipd(struct ttm_buffer_object *bo, bool evict, bool intr, ret = ttm_bo_move_ttm(bo, &ctx, new_reg); out: - ttm_bo_mem_put(bo, &tmp_reg); + ttm_resource_free(bo, &tmp_reg); return ret; } @@ -1227,7 +1227,7 @@ nouveau_bo_move_flips(struct ttm_buffer_object *bo, bool evict, bool intr, goto out; out: - ttm_bo_mem_put(bo, &tmp_reg); + ttm_resource_free(bo, &tmp_reg); return ret; } diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c index 3355b69b13d1..31f4cf211b6a 100644 --- a/drivers/gpu/drm/radeon/radeon_ttm.c +++ b/drivers/gpu/drm/radeon/radeon_ttm.c @@ -271,7 +271,7 @@ static int radeon_move_vram_ram(struct ttm_buffer_object *bo, } r = ttm_bo_move_ttm(bo, &ctx, new_mem); out_cleanup: - ttm_bo_mem_put(bo, &tmp_mem); + ttm_resource_free(bo, &tmp_mem); return r; } @@ -309,7 +309,7 @@ static int radeon_move_ram_vram(struct ttm_buffer_object *bo, goto out_cleanup; } out_cleanup: - ttm_bo_mem_put(bo, &tmp_mem); + ttm_resource_free(bo, &tmp_mem); return r; } diff --git a/drivers/gpu/drm/ttm/Makefile b/drivers/gpu/drm/ttm/Makefile index e54326e6cea4..90c0da88cc98 100644 --- a/drivers/gpu/drm/ttm/Makefile +++ b/drivers/gpu/drm/ttm/Makefile @@ -4,7 +4,8 @@ ttm-y := ttm_memory.o ttm_tt.o ttm_bo.o \ ttm_bo_util.o ttm_bo_vm.o ttm_module.o \ - ttm_execbuf_util.o ttm_page_alloc.o ttm_range_manager.o + ttm_execbuf_util.o ttm_page_alloc.o ttm_range_manager.o \ + ttm_resource.o ttm-$(CONFIG_AGP) += ttm_agp_backend.o ttm-$(CONFIG_DRM_TTM_DMA_PAGE_POOL) += ttm_page_alloc_dma.o diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index ae71c3ab6cc4..55890314316b 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/tt
Re: [PATCH] drm/amdgpu: make sure userptr ttm is allocated
On 2020-08-06 2:56 p.m., Christian König wrote: > We need to allocate that manually now. > > Signed-off-by: Christian König > Fixes: 2ddef17678bc (HEAD) drm/ttm: make TT creation purely optional v3 Reviewed-by: Michel Dänzer Tested-by: Michel Dänzer Thanks! -- Earthling Michel Dänzer | https://redhat.com Libre software enthusiast | Mesa and X developer ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH] drm/amdgpu: annotate a false positive recursive locking
Am 06.08.20 um 13:38 schrieb Li, Dennis: [AMD Official Use Only - Internal Distribution Only] [SNIP] I think it is a limitation of init_rwsem. And exactly that's wrong, this is intentional and perfectly correct. [Dennis Li] I couldn't understand. Why is it a perfectly correct? For example, we define two rw_sem: a and b. If we don't check init_rwsem definition, we may think case#1 and case#2 have the same behavior, but in fact they are different. Case 1: init_rwsem(&a); init_rwsem(&b); Case2: void init_rwsem_ext(rw_sem* sem) { init_rwsem(sem); } init_rwsem_ext(&a); init_rwsem_ext(&b); As far as I know it is perfectly possible that the locks in the hive are not always grabbed in the same order. And that's why lockdep is complaining about this. [Dennis Li] No. I takes a hive with two devices(a and b) to explain why lockdep complain. // Firstly driver lock the reset_sem of all devices down_write(&a->reset_sem); do_suspend(a); down_write(&b->reset_sem); // Because b shared the same lock_class_key with a, lockdep will take a and b as the same rw_sem and complain here. do_suspend(b); // do recovery do_hive_recovery() // unlock the reset_sem of all devices do_resume(a); up_write(&a->reset_sem); do_resume(b); up_write(&b->reset_sem); Ah! Now I understand what you are working around. So the problem is the static lock_class_key in the macro? [Dennis Li] Yes. The author of init_rwsem might not consider our similar use case. What we should do instead is to make sure we have only a single lock for the complete hive instead. [Dennis Li] If we use a single lock, users will must wait for all devices resuming successfully, but in fact their tasks are only running in device a. It is not friendly to users. Well that is intentional I would say. We can only restart submissions after all devices are resumed successfully, cause otherwise it can happen that a job on device A depends on memory on device B which isn't accessible. [Dennis Li] Yes, you are right. Driver have make sure that the shared resources(such as the shard memory) are ready before unlock the lock of adev one by one. But each device still has private hardware resources such as video and display. Yeah, but those are negligible and we have a team working on display support for XGMI. So this will sooner or later become a problem as well. I still think that a single rwsem for the whole hive is still the best option here. Regards, Christian. Regards, Christian. Regards, Christian. Am 06.08.20 um 11:15 schrieb Li, Dennis: [AMD Official Use Only - Internal Distribution Only] Hi, Christian, For this case, it is safe to use separated lock key. Please see the definition of init_rwsem as the below. Every init_rwsem calling will use a new static key, but devices of the hive share the same code to do initialization, so their lock_class_key are the same. I think it is a limitation of init_rwsem. In our case, it should be correct that reset_sem of each adev has different lock_class_key. BTW, this change doesn't effect dead-lock detection and just correct it. #define init_rwsem(sem) \ do {\ static struct lock_class_key __key; \ \ __init_rwsem((sem), #sem, &__key); \ } while (0) Best Regards Dennis Li -Original Message- From: Koenig, Christian Sent: Thursday, August 6, 2020 4:13 PM To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, Felix ; Zhang, Hawking Subject: Re: [PATCH] drm/amdgpu: annotate a false positive recursive locking Preventing locking problems during implementation is obviously a good approach, but lockdep has proven to be massively useful for finding and fixing problems. Disabling lockdep splat by annotating lock with separate classes is usually a no-go and only allowed if there is no other potential approach. In this case here we should really clean things up instead. Christian. Am 06.08.20 um 09:44 schrieb Li, Dennis: [AMD Official Use Only - Internal Distribution Only] Hi, Christian, I agree with your concern. However we shouldn't rely on system to detect dead-lock, and should consider this when doing code implementation. In fact, dead-lock detection isn't enabled by default. For your proposal to remove reset_sem into the hive structure, we can open a new topic to discuss it. Currently we couldn't make sure which is the best solution. For example, with your proposal, we must wait for all devices resuming successfully before resubmit an old task in one device, which will effect performance. Best Regards Dennis Li -Original Message- From: amd-gfx On Behalf Of Christian König Sent: Thursday, August 6, 2020 3:08 PM To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, Felix ; Zha
[PATCH] drm/amdgpu: make sure userptr ttm is allocated
We need to allocate that manually now. Signed-off-by: Christian König Fixes: 2ddef17678bc (HEAD) drm/ttm: make TT creation purely optional v3 --- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c| 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 17 +++-- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h | 4 ++-- 4 files changed, 15 insertions(+), 10 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c index 9015c7b76d60..55d2e870fdda 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c @@ -564,7 +564,7 @@ static int init_user_pages(struct kgd_mem *mem, uint64_t user_addr) mutex_lock(&process_info->lock); - ret = amdgpu_ttm_tt_set_userptr(bo->tbo.ttm, user_addr, 0); + ret = amdgpu_ttm_tt_set_userptr(&bo->tbo, user_addr, 0); if (ret) { pr_err("%s: Failed to set userptr: %d\n", __func__, ret); goto out; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c index de9784b0c19b..b2dc320e7305 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c @@ -332,7 +332,7 @@ int amdgpu_gem_userptr_ioctl(struct drm_device *dev, void *data, bo = gem_to_amdgpu_bo(gobj); bo->preferred_domains = AMDGPU_GEM_DOMAIN_GTT; bo->allowed_domains = AMDGPU_GEM_DOMAIN_GTT; - r = amdgpu_ttm_tt_set_userptr(bo->tbo.ttm, args->addr, args->flags); + r = amdgpu_ttm_tt_set_userptr(&bo->tbo, args->addr, args->flags); if (r) goto release_object; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c index 28557839f132..67d2eef2f9eb 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c @@ -1407,21 +1407,26 @@ static void amdgpu_ttm_tt_unpopulate(struct ttm_tt *ttm) * amdgpu_ttm_tt_set_userptr - Initialize userptr GTT ttm_tt for the current * task * - * @ttm: The ttm_tt object to bind this userptr object to + * @bo: The ttm_buffer_object to bind this userptr to * @addr: The address in the current tasks VM space to use * @flags: Requirements of userptr object. * * Called by amdgpu_gem_userptr_ioctl() to bind userptr pages * to current task */ -int amdgpu_ttm_tt_set_userptr(struct ttm_tt *ttm, uint64_t addr, - uint32_t flags) +int amdgpu_ttm_tt_set_userptr(struct ttm_buffer_object *bo, + uint64_t addr, uint32_t flags) { - struct amdgpu_ttm_tt *gtt = (void *)ttm; + struct amdgpu_ttm_tt *gtt; - if (gtt == NULL) - return -EINVAL; + if (!bo->ttm) { + /* TODO: We want a separate TTM object type for userptrs */ + bo->ttm = amdgpu_ttm_tt_create(bo, 0); + if (bo->ttm == NULL) + return -ENOMEM; + } + gtt = (void*)bo->ttm; gtt->userptr = addr; gtt->userflags = flags; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h index 36b024fd077e..433156c2e8fb 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h @@ -132,8 +132,8 @@ static inline bool amdgpu_ttm_tt_get_user_pages_done(struct ttm_tt *ttm) #endif void amdgpu_ttm_tt_set_user_pages(struct ttm_tt *ttm, struct page **pages); -int amdgpu_ttm_tt_set_userptr(struct ttm_tt *ttm, uint64_t addr, -uint32_t flags); +int amdgpu_ttm_tt_set_userptr(struct ttm_buffer_object *bo, + uint64_t addr, uint32_t flags); bool amdgpu_ttm_tt_has_userptr(struct ttm_tt *ttm); struct mm_struct *amdgpu_ttm_tt_get_usermm(struct ttm_tt *ttm); bool amdgpu_ttm_tt_affect_userptr(struct ttm_tt *ttm, unsigned long start, -- 2.17.1 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
RE: [PATCH] drm/amdgpu: annotate a false positive recursive locking
[AMD Official Use Only - Internal Distribution Only] [SNIP] >> I think it is a limitation of init_rwsem. > And exactly that's wrong, this is intentional and perfectly correct. > > [Dennis Li] I couldn't understand. Why is it a perfectly correct? > For example, we define two rw_sem: a and b. If we don't check init_rwsem > definition, we may think case#1 and case#2 have the same behavior, but in > fact they are different. > > Case 1: > init_rwsem(&a); > init_rwsem(&b); > > Case2: > void init_rwsem_ext(rw_sem* sem) > { > init_rwsem(sem); > } > init_rwsem_ext(&a); > init_rwsem_ext(&b); > > As far as I know it is perfectly possible that the locks in the hive are not > always grabbed in the same order. And that's why lockdep is complaining about > this. > [Dennis Li] No. I takes a hive with two devices(a and b) to explain why > lockdep complain. > > // Firstly driver lock the reset_sem of all devices > down_write(&a->reset_sem); do_suspend(a); > down_write(&b->reset_sem); // Because b shared the same lock_class_key > with a, lockdep will take a and b as the same rw_sem and complain here. > do_suspend(b); > > // do recovery > do_hive_recovery() > > // unlock the reset_sem of all devices do_resume(a); > up_write(&a->reset_sem); do_resume(b); up_write(&b->reset_sem); Ah! Now I understand what you are working around. So the problem is the static lock_class_key in the macro? [Dennis Li] Yes. The author of init_rwsem might not consider our similar use case. > What we should do instead is to make sure we have only a single lock for the > complete hive instead. > [Dennis Li] If we use a single lock, users will must wait for all devices > resuming successfully, but in fact their tasks are only running in device a. > It is not friendly to users. Well that is intentional I would say. We can only restart submissions after all devices are resumed successfully, cause otherwise it can happen that a job on device A depends on memory on device B which isn't accessible. [Dennis Li] Yes, you are right. Driver have make sure that the shared resources(such as the shard memory) are ready before unlock the lock of adev one by one. But each device still has private hardware resources such as video and display. Regards, Christian. > > Regards, > Christian. > > Am 06.08.20 um 11:15 schrieb Li, Dennis: >> [AMD Official Use Only - Internal Distribution Only] >> >> Hi, Christian, >>For this case, it is safe to use separated lock key. Please see the >> definition of init_rwsem as the below. Every init_rwsem calling will use a >> new static key, but devices of the hive share the same code to do >> initialization, so their lock_class_key are the same. I think it is a >> limitation of init_rwsem. In our case, it should be correct that reset_sem >> of each adev has different lock_class_key. BTW, this change doesn't effect >> dead-lock detection and just correct it. >> >> #define init_rwsem(sem) \ >> do { \ >> static struct lock_class_key __key; \ >> \ >> __init_rwsem((sem), #sem, &__key); \ >> } while (0) >> >> Best Regards >> Dennis Li >> -Original Message- >> From: Koenig, Christian >> Sent: Thursday, August 6, 2020 4:13 PM >> To: Li, Dennis ; amd-gfx@lists.freedesktop.org; >> Deucher, Alexander ; Kuehling, Felix >> ; Zhang, Hawking >> Subject: Re: [PATCH] drm/amdgpu: annotate a false positive recursive >> locking >> >> Preventing locking problems during implementation is obviously a good >> approach, but lockdep has proven to be massively useful for finding and >> fixing problems. >> >> Disabling lockdep splat by annotating lock with separate classes is usually >> a no-go and only allowed if there is no other potential approach. >> >> In this case here we should really clean things up instead. >> >> Christian. >> >> Am 06.08.20 um 09:44 schrieb Li, Dennis: >>> [AMD Official Use Only - Internal Distribution Only] >>> >>> Hi, Christian, >>> I agree with your concern. However we shouldn't rely on system to >>> detect dead-lock, and should consider this when doing code implementation. >>> In fact, dead-lock detection isn't enabled by default. >>> For your proposal to remove reset_sem into the hive structure, we >>> can open a new topic to discuss it. Currently we couldn't make sure which >>> is the best solution. For example, with your proposal, we must wait for all >>> devices resuming successfully before resubmit an old task in one device, >>> which will effect performance. >>> >>> Best Regards >>> Dennis Li >>> -Original Message- >>> From: amd-gfx On Behalf Of >>> Christian König >>> Sent: Thursday, August 6, 2020 3:08 PM >>> To: Li, Dennis ; amd-gfx@lists.freedesktop.org; >>> Deucher, Alexander ; Kuehling, Felix >>> ; Zhang, Hawking >>>
Re: [PATCH] drm/amdgpu: annotate a false positive recursive locking
[SNIP] I think it is a limitation of init_rwsem. And exactly that's wrong, this is intentional and perfectly correct. [Dennis Li] I couldn't understand. Why is it a perfectly correct? For example, we define two rw_sem: a and b. If we don't check init_rwsem definition, we may think case#1 and case#2 have the same behavior, but in fact they are different. Case 1: init_rwsem(&a); init_rwsem(&b); Case2: void init_rwsem_ext(rw_sem* sem) { init_rwsem(sem); } init_rwsem_ext(&a); init_rwsem_ext(&b); As far as I know it is perfectly possible that the locks in the hive are not always grabbed in the same order. And that's why lockdep is complaining about this. [Dennis Li] No. I takes a hive with two devices(a and b) to explain why lockdep complain. // Firstly driver lock the reset_sem of all devices down_write(&a->reset_sem); do_suspend(a); down_write(&b->reset_sem); // Because b shared the same lock_class_key with a, lockdep will take a and b as the same rw_sem and complain here. do_suspend(b); // do recovery do_hive_recovery() // unlock the reset_sem of all devices do_resume(a); up_write(&a->reset_sem); do_resume(b); up_write(&b->reset_sem); Ah! Now I understand what you are working around. So the problem is the static lock_class_key in the macro? What we should do instead is to make sure we have only a single lock for the complete hive instead. [Dennis Li] If we use a single lock, users will must wait for all devices resuming successfully, but in fact their tasks are only running in device a. It is not friendly to users. Well that is intentional I would say. We can only restart submissions after all devices are resumed successfully, cause otherwise it can happen that a job on device A depends on memory on device B which isn't accessible. Regards, Christian. Regards, Christian. Am 06.08.20 um 11:15 schrieb Li, Dennis: [AMD Official Use Only - Internal Distribution Only] Hi, Christian, For this case, it is safe to use separated lock key. Please see the definition of init_rwsem as the below. Every init_rwsem calling will use a new static key, but devices of the hive share the same code to do initialization, so their lock_class_key are the same. I think it is a limitation of init_rwsem. In our case, it should be correct that reset_sem of each adev has different lock_class_key. BTW, this change doesn't effect dead-lock detection and just correct it. #define init_rwsem(sem) \ do {\ static struct lock_class_key __key; \ \ __init_rwsem((sem), #sem, &__key); \ } while (0) Best Regards Dennis Li -Original Message- From: Koenig, Christian Sent: Thursday, August 6, 2020 4:13 PM To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, Felix ; Zhang, Hawking Subject: Re: [PATCH] drm/amdgpu: annotate a false positive recursive locking Preventing locking problems during implementation is obviously a good approach, but lockdep has proven to be massively useful for finding and fixing problems. Disabling lockdep splat by annotating lock with separate classes is usually a no-go and only allowed if there is no other potential approach. In this case here we should really clean things up instead. Christian. Am 06.08.20 um 09:44 schrieb Li, Dennis: [AMD Official Use Only - Internal Distribution Only] Hi, Christian, I agree with your concern. However we shouldn't rely on system to detect dead-lock, and should consider this when doing code implementation. In fact, dead-lock detection isn't enabled by default. For your proposal to remove reset_sem into the hive structure, we can open a new topic to discuss it. Currently we couldn't make sure which is the best solution. For example, with your proposal, we must wait for all devices resuming successfully before resubmit an old task in one device, which will effect performance. Best Regards Dennis Li -Original Message- From: amd-gfx On Behalf Of Christian König Sent: Thursday, August 6, 2020 3:08 PM To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, Felix ; Zhang, Hawking Subject: Re: [PATCH] drm/amdgpu: annotate a false positive recursive locking Am 06.08.20 um 09:02 schrieb Dennis Li: [ 584.110304] [ 584.110590] WARNING: possible recursive locking detected [ 584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G OE [ 584.64] [ 584.111456] kworker/38:1/553 is trying to acquire lock: [ 584.111721] 9b15ff0a47a0 (&adev->reset_sem){}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.112112] but task is already holding lock: [ 584.112673] 9b1603d247a0 (&adev->reset_sem)
Re: [PATCH] drm/amd/powerplay: update driver if file for sienna_cichlid
[AMD Official Use Only - Internal Distribution Only] Reviewed-by: Kenneth Feng 在 2020/8/6 下午5:42,“Gao, Likun” 写入: From: Likun Gao Update drive if file for sienna_cichlid. Signed-off-by: Likun Gao Change-Id: If405461cfbe0133ceb61fa123272b2e53db99755 --- .../drm/amd/powerplay/inc/smu11_driver_if_sienna_cichlid.h | 6 +++--- drivers/gpu/drm/amd/powerplay/inc/smu_v11_0.h | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/powerplay/inc/smu11_driver_if_sienna_cichlid.h b/drivers/gpu/drm/amd/powerplay/inc/smu11_driver_if_sienna_cichlid.h index aa2708fccb6d..5ef9c92f57c4 100644 --- a/drivers/gpu/drm/amd/powerplay/inc/smu11_driver_if_sienna_cichlid.h +++ b/drivers/gpu/drm/amd/powerplay/inc/smu11_driver_if_sienna_cichlid.h @@ -27,7 +27,7 @@ // *** IMPORTANT *** // SMU TEAM: Always increment the interface version if // any structure is changed in this file -#define SMU11_DRIVER_IF_VERSION 0x34 +#define SMU11_DRIVER_IF_VERSION 0x35 #define PPTABLE_Sienna_Cichlid_SMU_VERSION 5 @@ -127,7 +127,7 @@ #define FEATURE_DF_CSTATE_BIT 45 #define FEATURE_2_STEP_PSTATE_BIT 46 #define FEATURE_SMNCLK_DPM_BIT 47 -#define FEATURE_SPARE_48_BIT48 +#define FEATURE_PERLINK_GMIDOWN_BIT 48 #define FEATURE_GFX_EDC_BIT 49 #define FEATURE_SPARE_50_BIT50 #define FEATURE_SPARE_51_BIT51 @@ -169,7 +169,7 @@ typedef enum { #define DPM_OVERRIDE_DISABLE_DFLL_PLL_SHUTDOWN 0x0200 #define DPM_OVERRIDE_DISABLE_MEMORY_TEMPERATURE_READ 0x0400 #define DPM_OVERRIDE_DISABLE_VOLT_LINK_VCN_DCEFCLK 0x0800 -#define DPM_OVERRIDE_DISABLE_FAST_FCLK_TIMER 0x1000 +#define DPM_OVERRIDE_ENABLE_FAST_FCLK_TIMER 0x1000 #define DPM_OVERRIDE_DISABLE_VCN_PG 0x2000 #define DPM_OVERRIDE_DISABLE_FMAX_VMAX 0x4000 diff --git a/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0.h b/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0.h index 6a42331aba8a..737b6d14372c 100644 --- a/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0.h +++ b/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0.h @@ -30,7 +30,7 @@ #define SMU11_DRIVER_IF_VERSION_NV10 0x36 #define SMU11_DRIVER_IF_VERSION_NV12 0x33 #define SMU11_DRIVER_IF_VERSION_NV14 0x36 -#define SMU11_DRIVER_IF_VERSION_Sienna_Cichlid 0x34 +#define SMU11_DRIVER_IF_VERSION_Sienna_Cichlid 0x35 #define SMU11_DRIVER_IF_VERSION_Navy_Flounder 0x3 /* MP Apertures */ -- 2.25.1 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
RE: [PATCH] drm/amdgpu: annotate a false positive recursive locking
[AMD Official Use Only - Internal Distribution Only] Hi, Christian, See my below comments. Best Regards Dennis Li -Original Message- From: Koenig, Christian Sent: Thursday, August 6, 2020 5:19 PM To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, Felix ; Zhang, Hawking Subject: Re: [PATCH] drm/amdgpu: annotate a false positive recursive locking > I think it is a limitation of init_rwsem. And exactly that's wrong, this is intentional and perfectly correct. [Dennis Li] I couldn't understand. Why is it a perfectly correct? For example, we define two rw_sem: a and b. If we don't check init_rwsem definition, we may think case#1 and case#2 have the same behavior, but in fact they are different. Case 1: init_rwsem(&a); init_rwsem(&b); Case2: void init_rwsem_ext(rw_sem* sem) { init_rwsem(sem); } init_rwsem_ext(&a); init_rwsem_ext(&b); As far as I know it is perfectly possible that the locks in the hive are not always grabbed in the same order. And that's why lockdep is complaining about this. [Dennis Li] No. I takes a hive with two devices(a and b) to explain why lockdep complain. // Firstly driver lock the reset_sem of all devices down_write(&a->reset_sem); do_suspend(a); down_write(&b->reset_sem); // Because b shared the same lock_class_key with a, lockdep will take a and b as the same rw_sem and complain here. do_suspend(b); // do recovery do_hive_recovery() // unlock the reset_sem of all devices do_resume(a); up_write(&a->reset_sem); do_resume(b); up_write(&b->reset_sem); What we should do instead is to make sure we have only a single lock for the complete hive instead. [Dennis Li] If we use a single lock, users will must wait for all devices resuming successfully, but in fact their tasks are only running in device a. It is not friendly to users. Regards, Christian. Am 06.08.20 um 11:15 schrieb Li, Dennis: > [AMD Official Use Only - Internal Distribution Only] > > Hi, Christian, > For this case, it is safe to use separated lock key. Please see the > definition of init_rwsem as the below. Every init_rwsem calling will use a > new static key, but devices of the hive share the same code to do > initialization, so their lock_class_key are the same. I think it is a > limitation of init_rwsem. In our case, it should be correct that reset_sem > of each adev has different lock_class_key. BTW, this change doesn't effect > dead-lock detection and just correct it. > > #define init_rwsem(sem) \ > do { \ > static struct lock_class_key __key; \ > \ > __init_rwsem((sem), #sem, &__key); \ > } while (0) > > Best Regards > Dennis Li > -Original Message- > From: Koenig, Christian > Sent: Thursday, August 6, 2020 4:13 PM > To: Li, Dennis ; amd-gfx@lists.freedesktop.org; > Deucher, Alexander ; Kuehling, Felix > ; Zhang, Hawking > Subject: Re: [PATCH] drm/amdgpu: annotate a false positive recursive > locking > > Preventing locking problems during implementation is obviously a good > approach, but lockdep has proven to be massively useful for finding and > fixing problems. > > Disabling lockdep splat by annotating lock with separate classes is usually a > no-go and only allowed if there is no other potential approach. > > In this case here we should really clean things up instead. > > Christian. > > Am 06.08.20 um 09:44 schrieb Li, Dennis: >> [AMD Official Use Only - Internal Distribution Only] >> >> Hi, Christian, >> I agree with your concern. However we shouldn't rely on system to >> detect dead-lock, and should consider this when doing code implementation. >> In fact, dead-lock detection isn't enabled by default. >> For your proposal to remove reset_sem into the hive structure, we >> can open a new topic to discuss it. Currently we couldn't make sure which is >> the best solution. For example, with your proposal, we must wait for all >> devices resuming successfully before resubmit an old task in one device, >> which will effect performance. >> >> Best Regards >> Dennis Li >> -Original Message- >> From: amd-gfx On Behalf Of >> Christian König >> Sent: Thursday, August 6, 2020 3:08 PM >> To: Li, Dennis ; amd-gfx@lists.freedesktop.org; >> Deucher, Alexander ; Kuehling, Felix >> ; Zhang, Hawking >> Subject: Re: [PATCH] drm/amdgpu: annotate a false positive recursive >> locking >> >> Am 06.08.20 um 09:02 schrieb Dennis Li: >>> [ 584.110304] >>> [ 584.110590] WARNING: possible recursive locking detected >>> [ 584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G OE >>> [ 584.64] >>> [ 584.111456] kworker/38:1/553 is trying to acquire lock: >>> [ 584.111721] 9b15
[PATCH] drm/amdgpu: use mode1 reset by default for sienna_cichlid
From: Likun Gao Swith default gpu reset method for sienna_cichlid to MODE1 reset. Signed-off-by: Likun Gao Change-Id: I775e5a66bbac474f65ca8c999136ccaf9c1dc14e --- drivers/gpu/drm/amd/amdgpu/nv.c | 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c b/drivers/gpu/drm/amd/amdgpu/nv.c index 74d02d270d34..da8024c2826e 100644 --- a/drivers/gpu/drm/amd/amdgpu/nv.c +++ b/drivers/gpu/drm/amd/amdgpu/nv.c @@ -362,10 +362,15 @@ nv_asic_reset_method(struct amdgpu_device *adev) dev_warn(adev->dev, "Specified reset method:%d isn't supported, using AUTO instead.\n", amdgpu_reset_method); - if (smu_baco_is_support(smu)) - return AMD_RESET_METHOD_BACO; - else + switch (adev->asic_type) { + case CHIP_SIENNA_CICHLID: return AMD_RESET_METHOD_MODE1; + default: + if (smu_baco_is_support(smu)) + return AMD_RESET_METHOD_BACO; + else + return AMD_RESET_METHOD_MODE1; + } } static int nv_asic_reset(struct amdgpu_device *adev) -- 2.25.1 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH] drm/amd/powerplay: update driver if file for sienna_cichlid
From: Likun Gao Update drive if file for sienna_cichlid. Signed-off-by: Likun Gao Change-Id: If405461cfbe0133ceb61fa123272b2e53db99755 --- .../drm/amd/powerplay/inc/smu11_driver_if_sienna_cichlid.h | 6 +++--- drivers/gpu/drm/amd/powerplay/inc/smu_v11_0.h | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/powerplay/inc/smu11_driver_if_sienna_cichlid.h b/drivers/gpu/drm/amd/powerplay/inc/smu11_driver_if_sienna_cichlid.h index aa2708fccb6d..5ef9c92f57c4 100644 --- a/drivers/gpu/drm/amd/powerplay/inc/smu11_driver_if_sienna_cichlid.h +++ b/drivers/gpu/drm/amd/powerplay/inc/smu11_driver_if_sienna_cichlid.h @@ -27,7 +27,7 @@ // *** IMPORTANT *** // SMU TEAM: Always increment the interface version if // any structure is changed in this file -#define SMU11_DRIVER_IF_VERSION 0x34 +#define SMU11_DRIVER_IF_VERSION 0x35 #define PPTABLE_Sienna_Cichlid_SMU_VERSION 5 @@ -127,7 +127,7 @@ #define FEATURE_DF_CSTATE_BIT 45 #define FEATURE_2_STEP_PSTATE_BIT 46 #define FEATURE_SMNCLK_DPM_BIT 47 -#define FEATURE_SPARE_48_BIT48 +#define FEATURE_PERLINK_GMIDOWN_BIT 48 #define FEATURE_GFX_EDC_BIT 49 #define FEATURE_SPARE_50_BIT50 #define FEATURE_SPARE_51_BIT51 @@ -169,7 +169,7 @@ typedef enum { #define DPM_OVERRIDE_DISABLE_DFLL_PLL_SHUTDOWN 0x0200 #define DPM_OVERRIDE_DISABLE_MEMORY_TEMPERATURE_READ 0x0400 #define DPM_OVERRIDE_DISABLE_VOLT_LINK_VCN_DCEFCLK 0x0800 -#define DPM_OVERRIDE_DISABLE_FAST_FCLK_TIMER 0x1000 +#define DPM_OVERRIDE_ENABLE_FAST_FCLK_TIMER 0x1000 #define DPM_OVERRIDE_DISABLE_VCN_PG 0x2000 #define DPM_OVERRIDE_DISABLE_FMAX_VMAX 0x4000 diff --git a/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0.h b/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0.h index 6a42331aba8a..737b6d14372c 100644 --- a/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0.h +++ b/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0.h @@ -30,7 +30,7 @@ #define SMU11_DRIVER_IF_VERSION_NV10 0x36 #define SMU11_DRIVER_IF_VERSION_NV12 0x33 #define SMU11_DRIVER_IF_VERSION_NV14 0x36 -#define SMU11_DRIVER_IF_VERSION_Sienna_Cichlid 0x34 +#define SMU11_DRIVER_IF_VERSION_Sienna_Cichlid 0x35 #define SMU11_DRIVER_IF_VERSION_Navy_Flounder 0x3 /* MP Apertures */ -- 2.25.1 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH] drm/amdgpu: annotate a false positive recursive locking
I think it is a limitation of init_rwsem. And exactly that's wrong, this is intentional and perfectly correct. As far as I know it is perfectly possible that the locks in the hive are not always grabbed in the same order. And that's why lockdep is complaining about this. What we should do instead is to make sure we have only a single lock for the complete hive instead. Regards, Christian. Am 06.08.20 um 11:15 schrieb Li, Dennis: [AMD Official Use Only - Internal Distribution Only] Hi, Christian, For this case, it is safe to use separated lock key. Please see the definition of init_rwsem as the below. Every init_rwsem calling will use a new static key, but devices of the hive share the same code to do initialization, so their lock_class_key are the same. I think it is a limitation of init_rwsem. In our case, it should be correct that reset_sem of each adev has different lock_class_key. BTW, this change doesn't effect dead-lock detection and just correct it. #define init_rwsem(sem) \ do {\ static struct lock_class_key __key; \ \ __init_rwsem((sem), #sem, &__key); \ } while (0) Best Regards Dennis Li -Original Message- From: Koenig, Christian Sent: Thursday, August 6, 2020 4:13 PM To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, Felix ; Zhang, Hawking Subject: Re: [PATCH] drm/amdgpu: annotate a false positive recursive locking Preventing locking problems during implementation is obviously a good approach, but lockdep has proven to be massively useful for finding and fixing problems. Disabling lockdep splat by annotating lock with separate classes is usually a no-go and only allowed if there is no other potential approach. In this case here we should really clean things up instead. Christian. Am 06.08.20 um 09:44 schrieb Li, Dennis: [AMD Official Use Only - Internal Distribution Only] Hi, Christian, I agree with your concern. However we shouldn't rely on system to detect dead-lock, and should consider this when doing code implementation. In fact, dead-lock detection isn't enabled by default. For your proposal to remove reset_sem into the hive structure, we can open a new topic to discuss it. Currently we couldn't make sure which is the best solution. For example, with your proposal, we must wait for all devices resuming successfully before resubmit an old task in one device, which will effect performance. Best Regards Dennis Li -Original Message- From: amd-gfx On Behalf Of Christian König Sent: Thursday, August 6, 2020 3:08 PM To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, Felix ; Zhang, Hawking Subject: Re: [PATCH] drm/amdgpu: annotate a false positive recursive locking Am 06.08.20 um 09:02 schrieb Dennis Li: [ 584.110304] [ 584.110590] WARNING: possible recursive locking detected [ 584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G OE [ 584.64] [ 584.111456] kworker/38:1/553 is trying to acquire lock: [ 584.111721] 9b15ff0a47a0 (&adev->reset_sem){}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.112112] but task is already holding lock: [ 584.112673] 9b1603d247a0 (&adev->reset_sem){}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.113068] other info that might help us debug this: [ 584.113689] Possible unsafe locking scenario: [ 584.114350]CPU0 [ 584.114685] [ 584.115014] lock(&adev->reset_sem); [ 584.115349] lock(&adev->reset_sem); [ 584.115678] *** DEADLOCK *** [ 584.116624] May be due to missing lock nesting notation [ 584.117284] 4 locks held by kworker/38:1/553: [ 584.117616] #0: 9ad635c1d348 ((wq_completion)events){+.+.}, at: process_one_work+0x21f/0x630 [ 584.117967] #1: ac708e1c3e58 ((work_completion)(&con->recovery_work)){+.+.}, at: process_one_work+0x21f/0x630 [ 584.118358] #2: c1c2a5d0 (&tmp->hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu] [ 584.118786] #3: 9b1603d247a0 (&adev->reset_sem){}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.119222] stack backtrace: [ 584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: G OE 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 [ 584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019 [ 584.121223] Workqueue: events amdgpu_ras_do_recovery [amdgpu] [ 584.121638] Call Trace: [ 584.122050] dump_stack+0x98/0xd5 [ 584.122499] __lock_acquire+0x1139/0x16e0 [ 584.122931] ? trace_hardirqs_on+0x3b/0xf0 [ 584.123358] ? canc
RE: [PATCH] drm/amdgpu: annotate a false positive recursive locking
[AMD Official Use Only - Internal Distribution Only] Hi, Christian, For this case, it is safe to use separated lock key. Please see the definition of init_rwsem as the below. Every init_rwsem calling will use a new static key, but devices of the hive share the same code to do initialization, so their lock_class_key are the same. I think it is a limitation of init_rwsem. In our case, it should be correct that reset_sem of each adev has different lock_class_key. BTW, this change doesn't effect dead-lock detection and just correct it. #define init_rwsem(sem) \ do {\ static struct lock_class_key __key; \ \ __init_rwsem((sem), #sem, &__key); \ } while (0) Best Regards Dennis Li -Original Message- From: Koenig, Christian Sent: Thursday, August 6, 2020 4:13 PM To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, Felix ; Zhang, Hawking Subject: Re: [PATCH] drm/amdgpu: annotate a false positive recursive locking Preventing locking problems during implementation is obviously a good approach, but lockdep has proven to be massively useful for finding and fixing problems. Disabling lockdep splat by annotating lock with separate classes is usually a no-go and only allowed if there is no other potential approach. In this case here we should really clean things up instead. Christian. Am 06.08.20 um 09:44 schrieb Li, Dennis: > [AMD Official Use Only - Internal Distribution Only] > > Hi, Christian, >I agree with your concern. However we shouldn't rely on system to > detect dead-lock, and should consider this when doing code implementation. In > fact, dead-lock detection isn't enabled by default. >For your proposal to remove reset_sem into the hive structure, we can > open a new topic to discuss it. Currently we couldn't make sure which is the > best solution. For example, with your proposal, we must wait for all devices > resuming successfully before resubmit an old task in one device, which will > effect performance. > > Best Regards > Dennis Li > -Original Message- > From: amd-gfx On Behalf Of > Christian König > Sent: Thursday, August 6, 2020 3:08 PM > To: Li, Dennis ; amd-gfx@lists.freedesktop.org; > Deucher, Alexander ; Kuehling, Felix > ; Zhang, Hawking > Subject: Re: [PATCH] drm/amdgpu: annotate a false positive recursive > locking > > Am 06.08.20 um 09:02 schrieb Dennis Li: >> [ 584.110304] >> [ 584.110590] WARNING: possible recursive locking detected >> [ 584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G OE >> [ 584.64] >> [ 584.111456] kworker/38:1/553 is trying to acquire lock: >> [ 584.111721] 9b15ff0a47a0 (&adev->reset_sem){}, at: >> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.112112] >> but task is already holding lock: >> [ 584.112673] 9b1603d247a0 (&adev->reset_sem){}, at: >> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.113068] >> other info that might help us debug this: >> [ 584.113689] Possible unsafe locking scenario: >> >> [ 584.114350]CPU0 >> [ 584.114685] >> [ 584.115014] lock(&adev->reset_sem); >> [ 584.115349] lock(&adev->reset_sem); >> [ 584.115678] >> *** DEADLOCK *** >> >> [ 584.116624] May be due to missing lock nesting notation >> >> [ 584.117284] 4 locks held by kworker/38:1/553: >> [ 584.117616] #0: 9ad635c1d348 ((wq_completion)events){+.+.}, >> at: process_one_work+0x21f/0x630 [ 584.117967] #1: ac708e1c3e58 >> ((work_completion)(&con->recovery_work)){+.+.}, at: >> process_one_work+0x21f/0x630 [ 584.118358] #2: c1c2a5d0 >> (&tmp->hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu] >> [ 584.118786] #3: 9b1603d247a0 (&adev->reset_sem){}, at: >> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.119222] >> stack backtrace: >> [ 584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: G >> OE 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 >> [ 584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, >> BIOS 3.1 05/23/2019 [ 584.121223] Workqueue: events >> amdgpu_ras_do_recovery [amdgpu] [ 584.121638] Call Trace: >> [ 584.122050] dump_stack+0x98/0xd5 >> [ 584.122499] __lock_acquire+0x1139/0x16e0 [ 584.122931] ? >> trace_hardirqs_on+0x3b/0xf0 [ 584.123358] ? >> cancel_delayed_work+0xa6/0xc0 [ 584.123771] lock_acquire+0xb8/0x1c0 >> [ 584.124197] ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ >> 584.124599] down_write+0x49/0x120 [ 584.125032] ? >> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.125472] >> amdgpu_device_
Re: [PATCH] drm/amdgpu: Enable P2P dmabuf over XGMI
Am 06.08.20 um 11:04 schrieb Arunpravin: Access the exported P2P dmabuf over XGMI, if available. Otherwise, fall back to the existing PCIe method. Signed-off-by: Arunpravin Reviewed-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 34 + drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.h | 2 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 19 ++-- 3 files changed, 52 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c index ffeb20f11c07..589d008f91df 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c @@ -35,6 +35,7 @@ #include "amdgpu_display.h" #include "amdgpu_gem.h" #include "amdgpu_dma_buf.h" +#include "amdgpu_xgmi.h" #include #include #include @@ -560,3 +561,36 @@ struct drm_gem_object *amdgpu_gem_prime_import(struct drm_device *dev, obj->import_attach = attach; return obj; } + +/** + * amdgpu_dmabuf_is_xgmi_accessible - Check if xgmi available for P2P transfer + * + * @adev: amdgpu_device pointer of the importer + * @bo: amdgpu buffer object + * + * Returns: + * True if dmabuf accessible over xgmi, false otherwise. + */ +bool amdgpu_dmabuf_is_xgmi_accessible(struct amdgpu_device *adev, + struct amdgpu_bo *bo) +{ + struct drm_gem_object *obj = &bo->tbo.base; + struct drm_gem_object *gobj; + + if (obj->import_attach) { + struct dma_buf *dma_buf = obj->import_attach->dmabuf; + + if (dma_buf->ops != &amdgpu_dmabuf_ops) + /* No XGMI with non AMD GPUs */ + return false; + + gobj = dma_buf->priv; + bo = gem_to_amdgpu_bo(gobj); + } + + if (amdgpu_xgmi_same_hive(adev, amdgpu_ttm_adev(bo->tbo.bdev)) && + (bo->preferred_domains & AMDGPU_GEM_DOMAIN_VRAM)) + return true; + + return false; +} diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.h index ec447a7b6b28..2c5c84a06bb9 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.h @@ -29,6 +29,8 @@ struct dma_buf *amdgpu_gem_prime_export(struct drm_gem_object *gobj, int flags); struct drm_gem_object *amdgpu_gem_prime_import(struct drm_device *dev, struct dma_buf *dma_buf); +bool amdgpu_dmabuf_is_xgmi_accessible(struct amdgpu_device *adev, + struct amdgpu_bo *bo); void *amdgpu_gem_prime_vmap(struct drm_gem_object *obj); void amdgpu_gem_prime_vunmap(struct drm_gem_object *obj, void *vaddr); int amdgpu_gem_prime_mmap(struct drm_gem_object *obj, diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 71e005cf2952..771c27478bb1 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -28,6 +28,7 @@ #include #include #include +#include #include #include "amdgpu.h" @@ -35,6 +36,7 @@ #include "amdgpu_amdkfd.h" #include "amdgpu_gmc.h" #include "amdgpu_xgmi.h" +#include "amdgpu_dma_buf.h" /** * DOC: GPUVM @@ -1778,15 +1780,24 @@ int amdgpu_vm_bo_update(struct amdgpu_device *adev, struct amdgpu_bo_va *bo_va, nodes = NULL; resv = vm->root.base.bo->tbo.base.resv; } else { + struct drm_gem_object *obj = &bo->tbo.base; struct ttm_dma_tt *ttm; + resv = bo->tbo.base.resv; + if (obj->import_attach && bo_va->is_xgmi) { + struct dma_buf *dma_buf = obj->import_attach->dmabuf; + struct drm_gem_object *gobj = dma_buf->priv; + struct amdgpu_bo *abo = gem_to_amdgpu_bo(gobj); + + if (abo->tbo.mem.mem_type == TTM_PL_VRAM) + bo = gem_to_amdgpu_bo(gobj); + } mem = &bo->tbo.mem; nodes = mem->mm_node; if (mem->mem_type == TTM_PL_TT) { ttm = container_of(bo->tbo.ttm, struct ttm_dma_tt, ttm); pages_addr = ttm->dma_address; } - resv = bo->tbo.base.resv; } if (bo) { @@ -2132,8 +2143,10 @@ struct amdgpu_bo_va *amdgpu_vm_bo_add(struct amdgpu_device *adev, INIT_LIST_HEAD(&bo_va->valids); INIT_LIST_HEAD(&bo_va->invalids); - if (bo && amdgpu_xgmi_same_hive(adev, amdgpu_ttm_adev(bo->tbo.bdev)) && - (bo->preferred_domains & AMDGPU_GEM_DOMAIN_VRAM)) { + if (!bo) + return bo_va; + + if (amdgpu_dmabuf_is_xgmi_accessible(adev, bo)) { bo_va->is_xgmi = true; /* Power up XGMI if it can be pot
[PATCH] drm/amdgpu: Enable P2P dmabuf over XGMI
Access the exported P2P dmabuf over XGMI, if available. Otherwise, fall back to the existing PCIe method. Signed-off-by: Arunpravin --- drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 34 + drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.h | 2 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 19 ++-- 3 files changed, 52 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c index ffeb20f11c07..589d008f91df 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c @@ -35,6 +35,7 @@ #include "amdgpu_display.h" #include "amdgpu_gem.h" #include "amdgpu_dma_buf.h" +#include "amdgpu_xgmi.h" #include #include #include @@ -560,3 +561,36 @@ struct drm_gem_object *amdgpu_gem_prime_import(struct drm_device *dev, obj->import_attach = attach; return obj; } + +/** + * amdgpu_dmabuf_is_xgmi_accessible - Check if xgmi available for P2P transfer + * + * @adev: amdgpu_device pointer of the importer + * @bo: amdgpu buffer object + * + * Returns: + * True if dmabuf accessible over xgmi, false otherwise. + */ +bool amdgpu_dmabuf_is_xgmi_accessible(struct amdgpu_device *adev, + struct amdgpu_bo *bo) +{ + struct drm_gem_object *obj = &bo->tbo.base; + struct drm_gem_object *gobj; + + if (obj->import_attach) { + struct dma_buf *dma_buf = obj->import_attach->dmabuf; + + if (dma_buf->ops != &amdgpu_dmabuf_ops) + /* No XGMI with non AMD GPUs */ + return false; + + gobj = dma_buf->priv; + bo = gem_to_amdgpu_bo(gobj); + } + + if (amdgpu_xgmi_same_hive(adev, amdgpu_ttm_adev(bo->tbo.bdev)) && + (bo->preferred_domains & AMDGPU_GEM_DOMAIN_VRAM)) + return true; + + return false; +} diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.h index ec447a7b6b28..2c5c84a06bb9 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.h @@ -29,6 +29,8 @@ struct dma_buf *amdgpu_gem_prime_export(struct drm_gem_object *gobj, int flags); struct drm_gem_object *amdgpu_gem_prime_import(struct drm_device *dev, struct dma_buf *dma_buf); +bool amdgpu_dmabuf_is_xgmi_accessible(struct amdgpu_device *adev, + struct amdgpu_bo *bo); void *amdgpu_gem_prime_vmap(struct drm_gem_object *obj); void amdgpu_gem_prime_vunmap(struct drm_gem_object *obj, void *vaddr); int amdgpu_gem_prime_mmap(struct drm_gem_object *obj, diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 71e005cf2952..771c27478bb1 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -28,6 +28,7 @@ #include #include #include +#include #include #include "amdgpu.h" @@ -35,6 +36,7 @@ #include "amdgpu_amdkfd.h" #include "amdgpu_gmc.h" #include "amdgpu_xgmi.h" +#include "amdgpu_dma_buf.h" /** * DOC: GPUVM @@ -1778,15 +1780,24 @@ int amdgpu_vm_bo_update(struct amdgpu_device *adev, struct amdgpu_bo_va *bo_va, nodes = NULL; resv = vm->root.base.bo->tbo.base.resv; } else { + struct drm_gem_object *obj = &bo->tbo.base; struct ttm_dma_tt *ttm; + resv = bo->tbo.base.resv; + if (obj->import_attach && bo_va->is_xgmi) { + struct dma_buf *dma_buf = obj->import_attach->dmabuf; + struct drm_gem_object *gobj = dma_buf->priv; + struct amdgpu_bo *abo = gem_to_amdgpu_bo(gobj); + + if (abo->tbo.mem.mem_type == TTM_PL_VRAM) + bo = gem_to_amdgpu_bo(gobj); + } mem = &bo->tbo.mem; nodes = mem->mm_node; if (mem->mem_type == TTM_PL_TT) { ttm = container_of(bo->tbo.ttm, struct ttm_dma_tt, ttm); pages_addr = ttm->dma_address; } - resv = bo->tbo.base.resv; } if (bo) { @@ -2132,8 +2143,10 @@ struct amdgpu_bo_va *amdgpu_vm_bo_add(struct amdgpu_device *adev, INIT_LIST_HEAD(&bo_va->valids); INIT_LIST_HEAD(&bo_va->invalids); - if (bo && amdgpu_xgmi_same_hive(adev, amdgpu_ttm_adev(bo->tbo.bdev)) && - (bo->preferred_domains & AMDGPU_GEM_DOMAIN_VRAM)) { + if (!bo) + return bo_va; + + if (amdgpu_dmabuf_is_xgmi_accessible(adev, bo)) { bo_va->is_xgmi = true; /* Power up XGMI if it can be potentially used */ amdgpu_xgmi_set_pstate(adev, AMDGPU_XGMI_PSTATE_MAX_VEG
RE: [PATCH] drm/amd/mst: clean DP main link status only when unplug mst 1st link
[AMD Public Use] > -Original Message- > From: Siqueira, Rodrigo > Sent: Wednesday, August 5, 2020 10:55 PM > To: Lin, Wayne > Cc: amd-gfx@lists.freedesktop.org; Lakha, Bhawanpreet > ; Wu, Hersen ; > Kazlauskas, Nicholas ; Zuo, Jerry > > Subject: Re: [PATCH] drm/amd/mst: clean DP main link status only when > unplug mst 1st link > > On 08/04, Wayne Lin wrote: > > [Why] > > Under DP daisy chain scenario as below: > > > > Src - Monitor_1 - Monitor_2 > > > > If unplug 2nd Monitor_2 and plug in again, observe that Monitor_1 > > doesn't light up. > > > > When unplug 2nd monitor, we clear the > > dc_link->cur_link_settings.lane_count in dm_dp_destroy_mst_connector(). > > However this link status is a shared data structure by all connected > > mst monitors. Although the 2nd monitor is gone, this link status > > should still be retained for other connected mst monitors. > > Otherwise, when we plug the 2nd monitor in again, we find out that > > main link is not trained and do link training again. Payload ID > > Table for Monitor_1 is ruined and we don't reallocate it. > > > > [How] > > In dm_dp_destroy_mst_connector(), only clean the cur_link_settings > > when we no longer do mst mode. > > > > Signed-off-by: Wayne Lin > > --- > > drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c | 5 > - > > 1 file changed, 4 insertions(+), 1 deletion(-) > > > > diff --git > > a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c > > b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c > > index 2c10352fa514..526f29598403 100644 > > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c > > +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c > > @@ -415,7 +415,10 @@ static void dm_dp_destroy_mst_connector(struct > drm_dp_mst_topology_mgr *mgr, > >aconnector->dc_sink); > > dc_sink_release(aconnector->dc_sink); > > aconnector->dc_sink = NULL; > > - aconnector->dc_link->cur_link_settings.lane_count = 0; > > + mutex_lock(&mgr->lock); > > + if (!mgr->mst_state) > > + aconnector->dc_link->cur_link_settings.lane_count = 0; > > + mutex_unlock(&mgr->lock); > Hi Wayne, > > The change looks good to me. > > Reviewed-by: Rodrigo Siqueira > > Just for curiosity, do you know why we use a mutex instead of > spin_lock_irq for this case? FWIU, the spin_lock_irq behavior looks > better for this sort of manipulation. Hi Siqueira, Thanks for your time. AFAIK, changing mst_state (e.g. enabling MST mode) involves some reading/writing steps on DPCD which might cost some time. > > Thanks > > > } > > > > drm_connector_unregister(connector); > > -- > > 2.17.1 > > > > -- > Rodrigo Siqueira > https://siqueira.tech -- Wayne Lin ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH] drm/amdgpu: annotate a false positive recursive locking
Preventing locking problems during implementation is obviously a good approach, but lockdep has proven to be massively useful for finding and fixing problems. Disabling lockdep splat by annotating lock with separate classes is usually a no-go and only allowed if there is no other potential approach. In this case here we should really clean things up instead. Christian. Am 06.08.20 um 09:44 schrieb Li, Dennis: [AMD Official Use Only - Internal Distribution Only] Hi, Christian, I agree with your concern. However we shouldn't rely on system to detect dead-lock, and should consider this when doing code implementation. In fact, dead-lock detection isn't enabled by default. For your proposal to remove reset_sem into the hive structure, we can open a new topic to discuss it. Currently we couldn't make sure which is the best solution. For example, with your proposal, we must wait for all devices resuming successfully before resubmit an old task in one device, which will effect performance. Best Regards Dennis Li -Original Message- From: amd-gfx On Behalf Of Christian König Sent: Thursday, August 6, 2020 3:08 PM To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, Felix ; Zhang, Hawking Subject: Re: [PATCH] drm/amdgpu: annotate a false positive recursive locking Am 06.08.20 um 09:02 schrieb Dennis Li: [ 584.110304] [ 584.110590] WARNING: possible recursive locking detected [ 584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G OE [ 584.64] [ 584.111456] kworker/38:1/553 is trying to acquire lock: [ 584.111721] 9b15ff0a47a0 (&adev->reset_sem){}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.112112] but task is already holding lock: [ 584.112673] 9b1603d247a0 (&adev->reset_sem){}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.113068] other info that might help us debug this: [ 584.113689] Possible unsafe locking scenario: [ 584.114350]CPU0 [ 584.114685] [ 584.115014] lock(&adev->reset_sem); [ 584.115349] lock(&adev->reset_sem); [ 584.115678] *** DEADLOCK *** [ 584.116624] May be due to missing lock nesting notation [ 584.117284] 4 locks held by kworker/38:1/553: [ 584.117616] #0: 9ad635c1d348 ((wq_completion)events){+.+.}, at: process_one_work+0x21f/0x630 [ 584.117967] #1: ac708e1c3e58 ((work_completion)(&con->recovery_work)){+.+.}, at: process_one_work+0x21f/0x630 [ 584.118358] #2: c1c2a5d0 (&tmp->hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu] [ 584.118786] #3: 9b1603d247a0 (&adev->reset_sem){}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.119222] stack backtrace: [ 584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: G OE 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 [ 584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019 [ 584.121223] Workqueue: events amdgpu_ras_do_recovery [amdgpu] [ 584.121638] Call Trace: [ 584.122050] dump_stack+0x98/0xd5 [ 584.122499] __lock_acquire+0x1139/0x16e0 [ 584.122931] ? trace_hardirqs_on+0x3b/0xf0 [ 584.123358] ? cancel_delayed_work+0xa6/0xc0 [ 584.123771] lock_acquire+0xb8/0x1c0 [ 584.124197] ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.124599] down_write+0x49/0x120 [ 584.125032] ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.125472] amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.125910] ? amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu] [ 584.126367] amdgpu_ras_do_recovery+0x159/0x190 [amdgpu] [ 584.126789] process_one_work+0x29e/0x630 [ 584.127208] worker_thread+0x3c/0x3f0 [ 584.127621] ? __kthread_parkme+0x61/0x90 [ 584.128014] kthread+0x12f/0x150 [ 584.128402] ? process_one_work+0x630/0x630 [ 584.128790] ? kthread_park+0x90/0x90 [ 584.129174] ret_from_fork+0x3a/0x50 Each adev has owned lock_class_key to avoid false positive recursive locking. NAK, that is not a false positive but a real problem. The issue here is that we have multiple reset semaphores, one for each device in the hive. If those are not acquired in the correct order we deadlock. The real solution would be to move the reset_sem into the hive structure and make sure that we lock it only once. Christian. Signed-off-by: Dennis Li Change-Id: I7571efeccbf15483982031d00504a353031a854a diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index e97c088d03b3..766dc8f8c8a0 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -967,6 +967,7 @@ struct amdgpu_device { atomic_tin_gpu_reset; enum pp_mp1_state mp1_state; struct rw_semaphore reset_sem; + struct lock_
RE: [PATCH] drm/amdgpu: annotate a false positive recursive locking
[AMD Official Use Only - Internal Distribution Only] Hi, Christian, I agree with your concern. However we shouldn't rely on system to detect dead-lock, and should consider this when doing code implementation. In fact, dead-lock detection isn't enabled by default. For your proposal to remove reset_sem into the hive structure, we can open a new topic to discuss it. Currently we couldn't make sure which is the best solution. For example, with your proposal, we must wait for all devices resuming successfully before resubmit an old task in one device, which will effect performance. Best Regards Dennis Li -Original Message- From: amd-gfx On Behalf Of Christian König Sent: Thursday, August 6, 2020 3:08 PM To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, Felix ; Zhang, Hawking Subject: Re: [PATCH] drm/amdgpu: annotate a false positive recursive locking Am 06.08.20 um 09:02 schrieb Dennis Li: > [ 584.110304] > [ 584.110590] WARNING: possible recursive locking detected > [ 584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G OE > [ 584.64] > [ 584.111456] kworker/38:1/553 is trying to acquire lock: > [ 584.111721] 9b15ff0a47a0 (&adev->reset_sem){}, at: > amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.112112] > but task is already holding lock: > [ 584.112673] 9b1603d247a0 (&adev->reset_sem){}, at: > amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.113068] > other info that might help us debug this: > [ 584.113689] Possible unsafe locking scenario: > > [ 584.114350]CPU0 > [ 584.114685] > [ 584.115014] lock(&adev->reset_sem); > [ 584.115349] lock(&adev->reset_sem); > [ 584.115678] > *** DEADLOCK *** > > [ 584.116624] May be due to missing lock nesting notation > > [ 584.117284] 4 locks held by kworker/38:1/553: > [ 584.117616] #0: 9ad635c1d348 ((wq_completion)events){+.+.}, > at: process_one_work+0x21f/0x630 [ 584.117967] #1: ac708e1c3e58 > ((work_completion)(&con->recovery_work)){+.+.}, at: > process_one_work+0x21f/0x630 [ 584.118358] #2: c1c2a5d0 > (&tmp->hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu] [ > 584.118786] #3: 9b1603d247a0 (&adev->reset_sem){}, at: > amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.119222] > stack backtrace: > [ 584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: G > OE 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 > [ 584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, > BIOS 3.1 05/23/2019 [ 584.121223] Workqueue: events > amdgpu_ras_do_recovery [amdgpu] [ 584.121638] Call Trace: > [ 584.122050] dump_stack+0x98/0xd5 > [ 584.122499] __lock_acquire+0x1139/0x16e0 [ 584.122931] ? > trace_hardirqs_on+0x3b/0xf0 [ 584.123358] ? > cancel_delayed_work+0xa6/0xc0 [ 584.123771] lock_acquire+0xb8/0x1c0 > [ 584.124197] ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ > 584.124599] down_write+0x49/0x120 [ 584.125032] ? > amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.125472] > amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.125910] ? > amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu] [ 584.126367] > amdgpu_ras_do_recovery+0x159/0x190 [amdgpu] [ 584.126789] > process_one_work+0x29e/0x630 [ 584.127208] worker_thread+0x3c/0x3f0 > [ 584.127621] ? __kthread_parkme+0x61/0x90 [ 584.128014] > kthread+0x12f/0x150 [ 584.128402] ? process_one_work+0x630/0x630 [ > 584.128790] ? kthread_park+0x90/0x90 [ 584.129174] > ret_from_fork+0x3a/0x50 > > Each adev has owned lock_class_key to avoid false positive recursive > locking. NAK, that is not a false positive but a real problem. The issue here is that we have multiple reset semaphores, one for each device in the hive. If those are not acquired in the correct order we deadlock. The real solution would be to move the reset_sem into the hive structure and make sure that we lock it only once. Christian. > > Signed-off-by: Dennis Li > Change-Id: I7571efeccbf15483982031d00504a353031a854a > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > index e97c088d03b3..766dc8f8c8a0 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > @@ -967,6 +967,7 @@ struct amdgpu_device { > atomic_tin_gpu_reset; > enum pp_mp1_state mp1_state; > struct rw_semaphore reset_sem; > + struct lock_class_key lock_key; > struct amdgpu_doorbell_index doorbell_index; > > struct mutexnotifier_lock; > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index 6c572db42d92..d78df9312d34 10
Re: [PATCH] drm/amdgpu: fix spelling mistake "Falied" -> "Failed"
Hi all, On Wed, 05 Aug 2020 15:19:38 -0700 Joe Perches wrote: > > On Wed, 2020-08-05 at 17:27 -0400, Alex Deucher wrote: > > On Wed, Aug 5, 2020 at 4:53 PM Joe Perches wrote: > > > On Wed, 2020-08-05 at 16:01 -0400, Alex Deucher wrote: > > > > On Wed, Aug 5, 2020 at 7:35 AM Colin King > > > > wrote: > > > > > From: Colin Ian King > > > > > > > > > > There is a spelling mistake in a DRM_ERROR message. Fix it. > > > > > > > > > > Signed-off-by: Colin Ian King > > > > > > > > This is already fixed. > > > > > > This fix is not in today's -next. > > > > > > Perhaps whatever tree it's fixed in should be in -next. > > > > > > > Weird. It's in the drm-next tree as: > > > > commit 4afaa61db9cf5250b5734c2531b226e7b3a3d691 > > Author: Colin Ian King > > Date: Fri Jul 10 09:37:58 2020 +0100 > > > > drm/amdgpu: fix spelling mistake "Falied" -> "Failed" > > > > There is a spelling mistake in a DRM_ERROR error message. Fix it. > > > > Signed-off-by: Colin Ian King > > Signed-off-by: Alex Deucher > > > > Alex > > > > > $ git show --oneline -s > > > d15fe4ec0435 (HEAD, tag: next-20200805, origin/master, origin/HEAD) Add > > > linux-next specific files for 20200805 > > > > > > $ git grep -i falied drivers > > > drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c:DRM_ERROR("Falied > > > to terminate tmr\n"); > > > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c > > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c > > > [] > > > > > @@ -2010,7 +2010,7 @@ static int psp_suspend(void *handle) > > > > > > > > > > ret = psp_tmr_terminate(psp); > > > > > if (ret) { > > > > > - DRM_ERROR("Falied to terminate tmr\n"); > > > > > + DRM_ERROR("Failed to terminate tmr\n"); > > > > > return ret; > > > > > } > > Dunno. > > Maybe it's due to some ordering of trees in > how -next accumulates patches? The spelling error is introduced in two commits: c564b8601ae9 ("drm/amdgpu: add TMR destory function for psp") in Linus' tree between v5.8-rc4 and rc5 90937420c44f ("drm/amdgpu: add TMR destory function for psp") in the amdgpu tree between two merges by the drm tree. In this same interval, the error is corrected by commit 4afaa61db9cf ("drm/amdgpu: fix spelling mistake "Falied" -> "Failed"") so when David comes to merge the amdgpu tree in commit 206739119508 ("Merge tag 'amd-drm-next-5.9-2020-07-17' of git://people.freedesktop.org/~agd5f/linux into drm-next") the spelling error has been introduced on one side of the merge and introduced and corrected on the other. This would have produced a conflict which David presumably resolved in haste by picking the HEAD side of the merge instead of the MERGE_HEAD side (it happens). This could have been avoided by not cherry-picking fix commits around in the amdgpu process - instead having a fixes branch that is merged into the next branch after the fixes branch has been accepted upstream (that way there is only one commit for each fix and less conflicts). I have to deal with these sort of conflicts (sometimes daily) due to the drm processes. Its a pain as I have to track down each conflict to see if the same patches appear on both sides of merges and then try to figure out what other changes occur. (This is only slightly helped by have the "cherry-picked from" tags in the fix commits.) -- Cheers, Stephen Rothwell pgphtd3IyHin5.pgp Description: OpenPGP digital signature ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH] drm/amdgpu: Skip some registers config for SRIOV
For VF, registers L2_CNTL, L2_CONTEXT1_IDENTITY_APERTURE L2_PROTECTION_FAULT_CNTL are not accesible, skip the configuration for them in SRIOV mode. Signed-off-by: Liu ChengZhe --- drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c | 12 ++-- drivers/gpu/drm/amd/amdgpu/mmhub_v2_0.c | 12 ++-- 2 files changed, 20 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c index 1f6112b7fa49..6b96f45fde2a 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c +++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c @@ -330,10 +330,13 @@ int gfxhub_v2_1_gart_enable(struct amdgpu_device *adev) gfxhub_v2_1_init_gart_aperture_regs(adev); gfxhub_v2_1_init_system_aperture_regs(adev); gfxhub_v2_1_init_tlb_regs(adev); - gfxhub_v2_1_init_cache_regs(adev); + if (!amdgpu_sriov_vf(adev)) + gfxhub_v2_1_init_cache_regs(adev); gfxhub_v2_1_enable_system_domain(adev); - gfxhub_v2_1_disable_identity_aperture(adev); + if (!amdgpu_sriov_vf(adev)) + gfxhub_v2_1_disable_identity_aperture(adev); + gfxhub_v2_1_setup_vmid_config(adev); gfxhub_v2_1_program_invalidation(adev); @@ -372,7 +375,12 @@ void gfxhub_v2_1_gart_disable(struct amdgpu_device *adev) void gfxhub_v2_1_set_fault_enable_default(struct amdgpu_device *adev, bool value) { + /*These regs are not accessible for VF, PF will program in SRIOV */ + if (amdgpu_sriov_vf(adev)) + return; + u32 tmp; + tmp = RREG32_SOC15(GC, 0, mmGCVM_L2_PROTECTION_FAULT_CNTL); tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL, RANGE_PROTECTION_FAULT_ENABLE_DEFAULT, value); diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v2_0.c b/drivers/gpu/drm/amd/amdgpu/mmhub_v2_0.c index d83912901f73..9cfde9b81600 100644 --- a/drivers/gpu/drm/amd/amdgpu/mmhub_v2_0.c +++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v2_0.c @@ -321,10 +321,13 @@ int mmhub_v2_0_gart_enable(struct amdgpu_device *adev) mmhub_v2_0_init_gart_aperture_regs(adev); mmhub_v2_0_init_system_aperture_regs(adev); mmhub_v2_0_init_tlb_regs(adev); - mmhub_v2_0_init_cache_regs(adev); + if (!amdgpu_sriov_vf(adev)) + mmhub_v2_0_init_cache_regs(adev); mmhub_v2_0_enable_system_domain(adev); - mmhub_v2_0_disable_identity_aperture(adev); + if (!amdgpu_sriov_vf(adev)) + mmhub_v2_0_disable_identity_aperture(adev); + mmhub_v2_0_setup_vmid_config(adev); mmhub_v2_0_program_invalidation(adev); @@ -364,7 +367,12 @@ void mmhub_v2_0_gart_disable(struct amdgpu_device *adev) */ void mmhub_v2_0_set_fault_enable_default(struct amdgpu_device *adev, bool value) { + /*These regs are not accessible for VF, PF will program in SRIOV */ + if (amdgpu_sriov_vf(adev)) + return; + u32 tmp; + tmp = RREG32_SOC15(MMHUB, 0, mmMMVM_L2_PROTECTION_FAULT_CNTL); tmp = REG_SET_FIELD(tmp, MMVM_L2_PROTECTION_FAULT_CNTL, RANGE_PROTECTION_FAULT_ENABLE_DEFAULT, value); -- 2.25.1 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH] drm/amdgpu: fix spelling mistake "Falied" -> "Failed"
On Thu, Aug 06, 2020 at 09:36:41AM +1000, Stephen Rothwell wrote: > Hi all, > > On Wed, 05 Aug 2020 15:19:38 -0700 Joe Perches wrote: > > > > On Wed, 2020-08-05 at 17:27 -0400, Alex Deucher wrote: > > > On Wed, Aug 5, 2020 at 4:53 PM Joe Perches wrote: > > > > On Wed, 2020-08-05 at 16:01 -0400, Alex Deucher wrote: > > > > > On Wed, Aug 5, 2020 at 7:35 AM Colin King > > > > > wrote: > > > > > > From: Colin Ian King > > > > > > > > > > > > There is a spelling mistake in a DRM_ERROR message. Fix it. > > > > > > > > > > > > Signed-off-by: Colin Ian King > > > > > > > > > > This is already fixed. > > > > > > > > This fix is not in today's -next. > > > > > > > > Perhaps whatever tree it's fixed in should be in -next. > > > > > > > > > > Weird. It's in the drm-next tree as: > > > > > > commit 4afaa61db9cf5250b5734c2531b226e7b3a3d691 > > > Author: Colin Ian King > > > Date: Fri Jul 10 09:37:58 2020 +0100 > > > > > > drm/amdgpu: fix spelling mistake "Falied" -> "Failed" > > > > > > There is a spelling mistake in a DRM_ERROR error message. Fix it. > > > > > > Signed-off-by: Colin Ian King > > > Signed-off-by: Alex Deucher > > > > > > Alex > > > > > > > $ git show --oneline -s > > > > d15fe4ec0435 (HEAD, tag: next-20200805, origin/master, origin/HEAD) Add > > > > linux-next specific files for 20200805 > > > > > > > > $ git grep -i falied drivers > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c: > > > > DRM_ERROR("Falied to terminate tmr\n"); > > > > > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c > > > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c > > > > [] > > > > > > @@ -2010,7 +2010,7 @@ static int psp_suspend(void *handle) > > > > > > > > > > > > ret = psp_tmr_terminate(psp); > > > > > > if (ret) { > > > > > > - DRM_ERROR("Falied to terminate tmr\n"); > > > > > > + DRM_ERROR("Failed to terminate tmr\n"); > > > > > > return ret; > > > > > > } > > > > Dunno. > > > > Maybe it's due to some ordering of trees in > > how -next accumulates patches? > > The spelling error is introduced in two commits: > > c564b8601ae9 ("drm/amdgpu: add TMR destory function for psp") > > in Linus' tree between v5.8-rc4 and rc5 > > 90937420c44f ("drm/amdgpu: add TMR destory function for psp") > > in the amdgpu tree between two merges by the drm tree. In this same > interval, the error is corrected by commit > > 4afaa61db9cf ("drm/amdgpu: fix spelling mistake "Falied" -> "Failed"") > > so when David comes to merge the amdgpu tree in commit > > 206739119508 ("Merge tag 'amd-drm-next-5.9-2020-07-17' of > git://people.freedesktop.org/~agd5f/linux into drm-next") > > the spelling error has been introduced on one side of the merge and > introduced and corrected on the other. This would have produced a > conflict which David presumably resolved in haste by picking the HEAD > side of the merge instead of the MERGE_HEAD side (it happens). > > This could have been avoided by not cherry-picking fix commits around > in the amdgpu process - instead having a fixes branch that is merged > into the next branch after the fixes branch has been accepted upstream > (that way there is only one commit for each fix and less conflicts). > > I have to deal with these sort of conflicts (sometimes daily) due to > the drm processes. Its a pain as I have to track down each conflict to > see if the same patches appear on both sides of merges and then try to > figure out what other changes occur. (This is only slightly helped by > have the "cherry-picked from" tags in the fix commits.) Yeah cherry-picking breaks if you only occasionally cherry-pick - then stuff gets lost. I'd say just apply it once more for the merge window fixes pull. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH] drm/amdgpu: annotate a false positive recursive locking
Am 06.08.20 um 09:02 schrieb Dennis Li: [ 584.110304] [ 584.110590] WARNING: possible recursive locking detected [ 584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G OE [ 584.64] [ 584.111456] kworker/38:1/553 is trying to acquire lock: [ 584.111721] 9b15ff0a47a0 (&adev->reset_sem){}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.112112] but task is already holding lock: [ 584.112673] 9b1603d247a0 (&adev->reset_sem){}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.113068] other info that might help us debug this: [ 584.113689] Possible unsafe locking scenario: [ 584.114350]CPU0 [ 584.114685] [ 584.115014] lock(&adev->reset_sem); [ 584.115349] lock(&adev->reset_sem); [ 584.115678] *** DEADLOCK *** [ 584.116624] May be due to missing lock nesting notation [ 584.117284] 4 locks held by kworker/38:1/553: [ 584.117616] #0: 9ad635c1d348 ((wq_completion)events){+.+.}, at: process_one_work+0x21f/0x630 [ 584.117967] #1: ac708e1c3e58 ((work_completion)(&con->recovery_work)){+.+.}, at: process_one_work+0x21f/0x630 [ 584.118358] #2: c1c2a5d0 (&tmp->hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu] [ 584.118786] #3: 9b1603d247a0 (&adev->reset_sem){}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.119222] stack backtrace: [ 584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: G OE 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 [ 584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019 [ 584.121223] Workqueue: events amdgpu_ras_do_recovery [amdgpu] [ 584.121638] Call Trace: [ 584.122050] dump_stack+0x98/0xd5 [ 584.122499] __lock_acquire+0x1139/0x16e0 [ 584.122931] ? trace_hardirqs_on+0x3b/0xf0 [ 584.123358] ? cancel_delayed_work+0xa6/0xc0 [ 584.123771] lock_acquire+0xb8/0x1c0 [ 584.124197] ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.124599] down_write+0x49/0x120 [ 584.125032] ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.125472] amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.125910] ? amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu] [ 584.126367] amdgpu_ras_do_recovery+0x159/0x190 [amdgpu] [ 584.126789] process_one_work+0x29e/0x630 [ 584.127208] worker_thread+0x3c/0x3f0 [ 584.127621] ? __kthread_parkme+0x61/0x90 [ 584.128014] kthread+0x12f/0x150 [ 584.128402] ? process_one_work+0x630/0x630 [ 584.128790] ? kthread_park+0x90/0x90 [ 584.129174] ret_from_fork+0x3a/0x50 Each adev has owned lock_class_key to avoid false positive recursive locking. NAK, that is not a false positive but a real problem. The issue here is that we have multiple reset semaphores, one for each device in the hive. If those are not acquired in the correct order we deadlock. The real solution would be to move the reset_sem into the hive structure and make sure that we lock it only once. Christian. Signed-off-by: Dennis Li Change-Id: I7571efeccbf15483982031d00504a353031a854a diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index e97c088d03b3..766dc8f8c8a0 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -967,6 +967,7 @@ struct amdgpu_device { atomic_tin_gpu_reset; enum pp_mp1_state mp1_state; struct rw_semaphore reset_sem; + struct lock_class_key lock_key; struct amdgpu_doorbell_index doorbell_index; struct mutex notifier_lock; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 6c572db42d92..d78df9312d34 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -3037,6 +3037,7 @@ int amdgpu_device_init(struct amdgpu_device *adev, mutex_init(&adev->virt.vf_errors.lock); hash_init(adev->mn_hash); init_rwsem(&adev->reset_sem); + lockdep_set_class(&adev->reset_sem, &adev->lock_key); atomic_set(&adev->in_gpu_reset, 0); mutex_init(&adev->psp.mutex); mutex_init(&adev->notifier_lock); ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH] drm/amdgpu: annotate a false positive recursive locking
[ 584.110304] [ 584.110590] WARNING: possible recursive locking detected [ 584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G OE [ 584.64] [ 584.111456] kworker/38:1/553 is trying to acquire lock: [ 584.111721] 9b15ff0a47a0 (&adev->reset_sem){}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.112112] but task is already holding lock: [ 584.112673] 9b1603d247a0 (&adev->reset_sem){}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.113068] other info that might help us debug this: [ 584.113689] Possible unsafe locking scenario: [ 584.114350]CPU0 [ 584.114685] [ 584.115014] lock(&adev->reset_sem); [ 584.115349] lock(&adev->reset_sem); [ 584.115678] *** DEADLOCK *** [ 584.116624] May be due to missing lock nesting notation [ 584.117284] 4 locks held by kworker/38:1/553: [ 584.117616] #0: 9ad635c1d348 ((wq_completion)events){+.+.}, at: process_one_work+0x21f/0x630 [ 584.117967] #1: ac708e1c3e58 ((work_completion)(&con->recovery_work)){+.+.}, at: process_one_work+0x21f/0x630 [ 584.118358] #2: c1c2a5d0 (&tmp->hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu] [ 584.118786] #3: 9b1603d247a0 (&adev->reset_sem){}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.119222] stack backtrace: [ 584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: G OE 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 [ 584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019 [ 584.121223] Workqueue: events amdgpu_ras_do_recovery [amdgpu] [ 584.121638] Call Trace: [ 584.122050] dump_stack+0x98/0xd5 [ 584.122499] __lock_acquire+0x1139/0x16e0 [ 584.122931] ? trace_hardirqs_on+0x3b/0xf0 [ 584.123358] ? cancel_delayed_work+0xa6/0xc0 [ 584.123771] lock_acquire+0xb8/0x1c0 [ 584.124197] ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.124599] down_write+0x49/0x120 [ 584.125032] ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.125472] amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.125910] ? amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu] [ 584.126367] amdgpu_ras_do_recovery+0x159/0x190 [amdgpu] [ 584.126789] process_one_work+0x29e/0x630 [ 584.127208] worker_thread+0x3c/0x3f0 [ 584.127621] ? __kthread_parkme+0x61/0x90 [ 584.128014] kthread+0x12f/0x150 [ 584.128402] ? process_one_work+0x630/0x630 [ 584.128790] ? kthread_park+0x90/0x90 [ 584.129174] ret_from_fork+0x3a/0x50 Each adev has owned lock_class_key to avoid false positive recursive locking. Signed-off-by: Dennis Li Change-Id: I7571efeccbf15483982031d00504a353031a854a diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index e97c088d03b3..766dc8f8c8a0 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -967,6 +967,7 @@ struct amdgpu_device { atomic_tin_gpu_reset; enum pp_mp1_state mp1_state; struct rw_semaphore reset_sem; + struct lock_class_key lock_key; struct amdgpu_doorbell_index doorbell_index; struct mutexnotifier_lock; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 6c572db42d92..d78df9312d34 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -3037,6 +3037,7 @@ int amdgpu_device_init(struct amdgpu_device *adev, mutex_init(&adev->virt.vf_errors.lock); hash_init(adev->mn_hash); init_rwsem(&adev->reset_sem); + lockdep_set_class(&adev->reset_sem, &adev->lock_key); atomic_set(&adev->in_gpu_reset, 0); mutex_init(&adev->psp.mutex); mutex_init(&adev->notifier_lock); -- 2.17.1 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx