Re: [PATCH v2] drm/amdgpu: Fix the dead lock issue.
Am 11.09.2018 um 04:16 schrieb Deng, Emily: -Original Message- From: amd-gfx On Behalf Of Deng, Emily Sent: Monday, September 10, 2018 6:33 PM To: Koenig, Christian ; amd- g...@lists.freedesktop.org Subject: RE: [PATCH v2] drm/amdgpu: Fix the dead lock issue. -Original Message- From: Koenig, Christian Sent: Monday, September 10, 2018 6:02 PM To: Deng, Emily ; amd-gfx@lists.freedesktop.org Subject: Re: [PATCH v2] drm/amdgpu: Fix the dead lock issue. Am 10.09.2018 um 11:55 schrieb Deng, Emily: -Original Message- From: amd-gfx On Behalf Of Christian König Sent: Monday, September 10, 2018 5:49 PM To: Deng, Emily ; amd-gfx@lists.freedesktop.org Subject: Re: [PATCH v2] drm/amdgpu: Fix the dead lock issue. Am 10.09.2018 um 11:47 schrieb Deng, Emily: -Original Message- From: Christian König Sent: Monday, September 10, 2018 5:41 PM To: Deng, Emily ; amd-gfx@lists.freedesktop.org Subject: Re: [PATCH v2] drm/amdgpu: Fix the dead lock issue. Am 10.09.2018 um 11:34 schrieb Emily Deng: It will ramdomly have the dead lock issue when test TDR: 1. amdgpu_device_handle_vram_lost gets the lock shadow_list_lock 2. amdgpu_bo_create locked the bo's resv lock 3. amdgpu_bo_create_shadow is waiting for the shadow_list_lock 4. amdgpu_device_recover_vram_from_shadow is waiting for the bo's resv lock. v2: Make a local copy of the list Signed-off-by: Emily Deng --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 +++- drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 1 + 2 files changed, 12 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index acfc63e..2b9f597 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -3006,6 +3006,9 @@ static int amdgpu_device_handle_vram_lost(struct amdgpu_device *adev) long r = 1; int i = 0; long tmo; + struct list_head local_shadow_list; + + INIT_LIST_HEAD(_shadow_list); if (amdgpu_sriov_runtime(adev)) tmo = msecs_to_jiffies(8000); @@ -3013,8 +3016,15 @@ static int amdgpu_device_handle_vram_lost(struct amdgpu_device *adev) tmo = msecs_to_jiffies(100); DRM_INFO("recover vram bo from shadow start\n"); + mutex_lock(>shadow_list_lock); list_for_each_entry_safe(bo, tmp, >shadow_list, shadow_list) { + amdgpu_bo_ref(bo); + list_add_tail(>copy_shadow_list, _shadow_list); + } Please don't add an extra copy_shadow_list field to amdgpu_bo. If don't use an extra variable, the local shadow list will change according the adev->shadow_list, both for adding or deleting, it is not we want. That is not correct, see amdgpu_bo_destroy: if (!list_empty(>shadow_list)) { mutex_lock(>shadow_list_lock); list_del_init(>shadow_list); mutex_unlock(>shadow_list_lock); } The BO is only removed from the list when it is destroyed, since we grabbed a local reference it can't be destroyed. So we are safe here. The amdgpu_bo_ref could only avoid deleting the bo, but when it is already in amdgpu_bo_destroy, the shadow_list seems still will be delete. Yeah, had the same thought this night. To make it even worse the reservation object will already be destroyed and so we actually can't reserve the BOs here. Christian. Sorry I am not meaning the delete, what about the adding. That will still go to adev->shadow_list and not affect our local list in any way. We are not interested in any newly allocated shadow BOs, so that should be unproblematic. Understand, will send a patch later. Best wishes Emily Deng Regards, Christian. Regards, Christian. Instead just use bo->shadow list for this. When you hold a reference to the BO it should not be removed from the shadow list. Additional to that you can just use list_splice_init() to move the whole shadow list to your local list. Christian. + mutex_unlock(>shadow_list_lock); + + list_for_each_entry_safe(bo, tmp, _shadow_list, +copy_shadow_list) { next = NULL; amdgpu_device_recover_vram_from_shadow(adev, ring, bo, ); if (fence) { @@ -3033,8 +3043,8 @@ static int amdgpu_device_handle_vram_lost(struct amdgpu_device *adev) dma_fence_put(fence); fence = next; + amdgpu_bo_unref(); } - mutex_unlock(>shadow_list_lock); if (fence) { r = dma_fence_wait_timeout(fence, false, tmo); diff -- git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h index 907fdf4..cfee16c 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h @@ -103,6 +103,7 @@ struct amdgpu_bo {
RE: [PATCH v2] drm/amdgpu: Fix the dead lock issue.
>-Original Message- >From: amd-gfx On Behalf Of Deng, >Emily >Sent: Monday, September 10, 2018 6:33 PM >To: Koenig, Christian ; amd- >g...@lists.freedesktop.org >Subject: RE: [PATCH v2] drm/amdgpu: Fix the dead lock issue. > >>-Original Message- >>From: Koenig, Christian >>Sent: Monday, September 10, 2018 6:02 PM >>To: Deng, Emily ; amd-gfx@lists.freedesktop.org >>Subject: Re: [PATCH v2] drm/amdgpu: Fix the dead lock issue. >> >>Am 10.09.2018 um 11:55 schrieb Deng, Emily: >>>> -Original Message- >>>> From: amd-gfx On Behalf Of >>>> Christian König >>>> Sent: Monday, September 10, 2018 5:49 PM >>>> To: Deng, Emily ; amd-gfx@lists.freedesktop.org >>>> Subject: Re: [PATCH v2] drm/amdgpu: Fix the dead lock issue. >>>> >>>> Am 10.09.2018 um 11:47 schrieb Deng, Emily: >>>>>> -----Original Message- >>>>>> From: Christian König >>>>>> Sent: Monday, September 10, 2018 5:41 PM >>>>>> To: Deng, Emily ; >>>>>> amd-gfx@lists.freedesktop.org >>>>>> Subject: Re: [PATCH v2] drm/amdgpu: Fix the dead lock issue. >>>>>> >>>>>> Am 10.09.2018 um 11:34 schrieb Emily Deng: >>>>>>> It will ramdomly have the dead lock issue when test TDR: >>>>>>> 1. amdgpu_device_handle_vram_lost gets the lock shadow_list_lock 2. >>>>>>> amdgpu_bo_create locked the bo's resv lock 3. >>>>>>> amdgpu_bo_create_shadow is waiting for the shadow_list_lock 4. >>>>>>> amdgpu_device_recover_vram_from_shadow is waiting for the bo's >>>>>>> resv lock. >>>>>>> >>>>>>> v2: >>>>>>> Make a local copy of the list >>>>>>> >>>>>>> Signed-off-by: Emily Deng >>>>>>> --- >>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 >+++- >>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 1 + >>>>>>> 2 files changed, 12 insertions(+), 1 deletion(-) >>>>>>> >>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>> index acfc63e..2b9f597 100644 >>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>> @@ -3006,6 +3006,9 @@ static int >>>>>> amdgpu_device_handle_vram_lost(struct amdgpu_device *adev) >>>>>>> long r = 1; >>>>>>> int i = 0; >>>>>>> long tmo; >>>>>>> + struct list_head local_shadow_list; >>>>>>> + >>>>>>> + INIT_LIST_HEAD(_shadow_list); >>>>>>> >>>>>>> if (amdgpu_sriov_runtime(adev)) >>>>>>> tmo = msecs_to_jiffies(8000); @@ -3013,8 +3016,15 >>@@ static >>>>>>> int >>>>>> amdgpu_device_handle_vram_lost(struct amdgpu_device *adev) >>>>>>> tmo = msecs_to_jiffies(100); >>>>>>> >>>>>>> DRM_INFO("recover vram bo from shadow start\n"); >>>>>>> + >>>>>>> mutex_lock(>shadow_list_lock); >>>>>>> list_for_each_entry_safe(bo, tmp, >shadow_list, >>>>>>> shadow_list) { >>>>>>> + amdgpu_bo_ref(bo); >>>>>>> + list_add_tail(>copy_shadow_list, >_shadow_list); >>>>>>> + } >>>>>> Please don't add an extra copy_shadow_list field to amdgpu_bo. >>>>> If don't use an extra variable, the local shadow list will change >>>>> according the adev->shadow_list, both for adding or deleting, it is >>>>> not we >>>> want. >>>> >>>> That is not correct, see amdgpu_bo_destroy: >>>>> if (!list_empty(>shadow_list)) { >>>>> mutex_lock(>shadow_list_lock); >>>>> list_del_init(>shadow_list); >>>>> mutex_unlock(>shadow_list_lock); >>>>> } >>>> The BO is only removed f
Re: [PATCH v2] drm/amdgpu: Fix the dead lock issue.
Am 10.09.2018 um 11:55 schrieb Deng, Emily: -Original Message- From: amd-gfx On Behalf Of Christian König Sent: Monday, September 10, 2018 5:49 PM To: Deng, Emily ; amd-gfx@lists.freedesktop.org Subject: Re: [PATCH v2] drm/amdgpu: Fix the dead lock issue. Am 10.09.2018 um 11:47 schrieb Deng, Emily: -Original Message- From: Christian König Sent: Monday, September 10, 2018 5:41 PM To: Deng, Emily ; amd-gfx@lists.freedesktop.org Subject: Re: [PATCH v2] drm/amdgpu: Fix the dead lock issue. Am 10.09.2018 um 11:34 schrieb Emily Deng: It will ramdomly have the dead lock issue when test TDR: 1. amdgpu_device_handle_vram_lost gets the lock shadow_list_lock 2. amdgpu_bo_create locked the bo's resv lock 3. amdgpu_bo_create_shadow is waiting for the shadow_list_lock 4. amdgpu_device_recover_vram_from_shadow is waiting for the bo's resv lock. v2: Make a local copy of the list Signed-off-by: Emily Deng --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 +++- drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 1 + 2 files changed, 12 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index acfc63e..2b9f597 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -3006,6 +3006,9 @@ static int amdgpu_device_handle_vram_lost(struct amdgpu_device *adev) long r = 1; int i = 0; long tmo; + struct list_head local_shadow_list; + + INIT_LIST_HEAD(_shadow_list); if (amdgpu_sriov_runtime(adev)) tmo = msecs_to_jiffies(8000); @@ -3013,8 +3016,15 @@ static int amdgpu_device_handle_vram_lost(struct amdgpu_device *adev) tmo = msecs_to_jiffies(100); DRM_INFO("recover vram bo from shadow start\n"); + mutex_lock(>shadow_list_lock); list_for_each_entry_safe(bo, tmp, >shadow_list, shadow_list) { + amdgpu_bo_ref(bo); + list_add_tail(>copy_shadow_list, _shadow_list); + } Please don't add an extra copy_shadow_list field to amdgpu_bo. If don't use an extra variable, the local shadow list will change according the adev->shadow_list, both for adding or deleting, it is not we want. That is not correct, see amdgpu_bo_destroy: if (!list_empty(>shadow_list)) { mutex_lock(>shadow_list_lock); list_del_init(>shadow_list); mutex_unlock(>shadow_list_lock); } The BO is only removed from the list when it is destroyed, since we grabbed a local reference it can't be destroyed. So we are safe here. Sorry I am not meaning the delete, what about the adding. That will still go to adev->shadow_list and not affect our local list in any way. We are not interested in any newly allocated shadow BOs, so that should be unproblematic. Regards, Christian. Regards, Christian. Instead just use bo->shadow list for this. When you hold a reference to the BO it should not be removed from the shadow list. Additional to that you can just use list_splice_init() to move the whole shadow list to your local list. Christian. + mutex_unlock(>shadow_list_lock); + + list_for_each_entry_safe(bo, tmp, _shadow_list, +copy_shadow_list) { next = NULL; amdgpu_device_recover_vram_from_shadow(adev, ring, bo, ); if (fence) { @@ -3033,8 +3043,8 @@ static int amdgpu_device_handle_vram_lost(struct amdgpu_device *adev) dma_fence_put(fence); fence = next; + amdgpu_bo_unref(); } - mutex_unlock(>shadow_list_lock); if (fence) { r = dma_fence_wait_timeout(fence, false, tmo); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h index 907fdf4..cfee16c 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h @@ -103,6 +103,7 @@ struct amdgpu_bo { struct list_headmn_list; struct list_headshadow_list; }; +struct list_headcopy_shadow_list; struct kgd_mem *kfd_bo; }; ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
RE: [PATCH v2] drm/amdgpu: Fix the dead lock issue.
>-Original Message- >From: amd-gfx On Behalf Of >Christian König >Sent: Monday, September 10, 2018 5:49 PM >To: Deng, Emily ; amd-gfx@lists.freedesktop.org >Subject: Re: [PATCH v2] drm/amdgpu: Fix the dead lock issue. > >Am 10.09.2018 um 11:47 schrieb Deng, Emily: >>> -Original Message- >>> From: Christian König >>> Sent: Monday, September 10, 2018 5:41 PM >>> To: Deng, Emily ; amd-gfx@lists.freedesktop.org >>> Subject: Re: [PATCH v2] drm/amdgpu: Fix the dead lock issue. >>> >>> Am 10.09.2018 um 11:34 schrieb Emily Deng: >>>> It will ramdomly have the dead lock issue when test TDR: >>>> 1. amdgpu_device_handle_vram_lost gets the lock shadow_list_lock 2. >>>> amdgpu_bo_create locked the bo's resv lock 3. >>>> amdgpu_bo_create_shadow is waiting for the shadow_list_lock 4. >>>> amdgpu_device_recover_vram_from_shadow is waiting for the bo's resv >>>> lock. >>>> >>>> v2: >>>> Make a local copy of the list >>>> >>>> Signed-off-by: Emily Deng >>>> --- >>>>drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 +++- >>>>drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 1 + >>>>2 files changed, 12 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> index acfc63e..2b9f597 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> @@ -3006,6 +3006,9 @@ static int >>> amdgpu_device_handle_vram_lost(struct amdgpu_device *adev) >>>>long r = 1; >>>>int i = 0; >>>>long tmo; >>>> + struct list_head local_shadow_list; >>>> + >>>> + INIT_LIST_HEAD(_shadow_list); >>>> >>>>if (amdgpu_sriov_runtime(adev)) >>>>tmo = msecs_to_jiffies(8000); >>>> @@ -3013,8 +3016,15 @@ static int >>> amdgpu_device_handle_vram_lost(struct amdgpu_device *adev) >>>>tmo = msecs_to_jiffies(100); >>>> >>>>DRM_INFO("recover vram bo from shadow start\n"); >>>> + >>>>mutex_lock(>shadow_list_lock); >>>>list_for_each_entry_safe(bo, tmp, >shadow_list, >>>> shadow_list) { >>>> + amdgpu_bo_ref(bo); >>>> + list_add_tail(>copy_shadow_list, _shadow_list); >>>> + } >>> Please don't add an extra copy_shadow_list field to amdgpu_bo. >> If don't use an extra variable, the local shadow list will change >> according the adev->shadow_list, both for adding or deleting, it is not we >want. > >That is not correct, see amdgpu_bo_destroy: >> if (!list_empty(>shadow_list)) { >> mutex_lock(>shadow_list_lock); >> list_del_init(>shadow_list); >> mutex_unlock(>shadow_list_lock); >> } > >The BO is only removed from the list when it is destroyed, since we grabbed a >local reference it can't be destroyed. So we are safe here. Sorry I am not meaning the delete, what about the adding. >Regards, >Christian. > >> >>> Instead just use bo->shadow list for this. When you hold a reference >>> to the BO it should not be removed from the shadow list. >>> >>> Additional to that you can just use list_splice_init() to move the >>> whole shadow list to your local list. >>> >>> Christian. >>> >>>> + mutex_unlock(>shadow_list_lock); >>>> + >>>> + list_for_each_entry_safe(bo, tmp, _shadow_list, >>>> +copy_shadow_list) { >>>>next = NULL; >>>>amdgpu_device_recover_vram_from_shadow(adev, ring, bo, >>> ); >>>>if (fence) { >>>> @@ -3033,8 +3043,8 @@ static int >>> amdgpu_device_handle_vram_lost(struct >>>> amdgpu_device *adev) >>>> >>>>dma_fence_put(fence); >>>>fence = next; >>>> + amdgpu_bo_unref(); >>>>} >>>> - mutex_unlock(>shadow_list_lock); >>>> >>>>if (fence) { >>>>r = dma_fence_wait_timeout(fence, false, tmo); diff >>>> --git >>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h >>>> index 907fdf4..cfee16c 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h >>>> @@ -103,6 +103,7 @@ struct amdgpu_bo { >>>>struct list_headmn_list; >>>>struct list_headshadow_list; >>>>}; >>>> +struct list_headcopy_shadow_list; >>>> >>>>struct kgd_mem *kfd_bo; >>>>}; > >___ >amd-gfx mailing list >amd-gfx@lists.freedesktop.org >https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH v2] drm/amdgpu: Fix the dead lock issue.
Am 10.09.2018 um 11:47 schrieb Deng, Emily: -Original Message- From: Christian König Sent: Monday, September 10, 2018 5:41 PM To: Deng, Emily ; amd-gfx@lists.freedesktop.org Subject: Re: [PATCH v2] drm/amdgpu: Fix the dead lock issue. Am 10.09.2018 um 11:34 schrieb Emily Deng: It will ramdomly have the dead lock issue when test TDR: 1. amdgpu_device_handle_vram_lost gets the lock shadow_list_lock 2. amdgpu_bo_create locked the bo's resv lock 3. amdgpu_bo_create_shadow is waiting for the shadow_list_lock 4. amdgpu_device_recover_vram_from_shadow is waiting for the bo's resv lock. v2: Make a local copy of the list Signed-off-by: Emily Deng --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 +++- drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 1 + 2 files changed, 12 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index acfc63e..2b9f597 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -3006,6 +3006,9 @@ static int amdgpu_device_handle_vram_lost(struct amdgpu_device *adev) long r = 1; int i = 0; long tmo; + struct list_head local_shadow_list; + + INIT_LIST_HEAD(_shadow_list); if (amdgpu_sriov_runtime(adev)) tmo = msecs_to_jiffies(8000); @@ -3013,8 +3016,15 @@ static int amdgpu_device_handle_vram_lost(struct amdgpu_device *adev) tmo = msecs_to_jiffies(100); DRM_INFO("recover vram bo from shadow start\n"); + mutex_lock(>shadow_list_lock); list_for_each_entry_safe(bo, tmp, >shadow_list, shadow_list) { + amdgpu_bo_ref(bo); + list_add_tail(>copy_shadow_list, _shadow_list); + } Please don't add an extra copy_shadow_list field to amdgpu_bo. If don't use an extra variable, the local shadow list will change according the adev->shadow_list, both for adding or deleting, it is not we want. That is not correct, see amdgpu_bo_destroy: if (!list_empty(>shadow_list)) { mutex_lock(>shadow_list_lock); list_del_init(>shadow_list); mutex_unlock(>shadow_list_lock); } The BO is only removed from the list when it is destroyed, since we grabbed a local reference it can't be destroyed. So we are safe here. Regards, Christian. Instead just use bo->shadow list for this. When you hold a reference to the BO it should not be removed from the shadow list. Additional to that you can just use list_splice_init() to move the whole shadow list to your local list. Christian. + mutex_unlock(>shadow_list_lock); + + list_for_each_entry_safe(bo, tmp, _shadow_list, +copy_shadow_list) { next = NULL; amdgpu_device_recover_vram_from_shadow(adev, ring, bo, ); if (fence) { @@ -3033,8 +3043,8 @@ static int amdgpu_device_handle_vram_lost(struct amdgpu_device *adev) dma_fence_put(fence); fence = next; + amdgpu_bo_unref(); } - mutex_unlock(>shadow_list_lock); if (fence) { r = dma_fence_wait_timeout(fence, false, tmo); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h index 907fdf4..cfee16c 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h @@ -103,6 +103,7 @@ struct amdgpu_bo { struct list_headmn_list; struct list_headshadow_list; }; +struct list_headcopy_shadow_list; struct kgd_mem *kfd_bo; }; ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
RE: [PATCH v2] drm/amdgpu: Fix the dead lock issue.
>-Original Message- >From: Christian König >Sent: Monday, September 10, 2018 5:41 PM >To: Deng, Emily ; amd-gfx@lists.freedesktop.org >Subject: Re: [PATCH v2] drm/amdgpu: Fix the dead lock issue. > >Am 10.09.2018 um 11:34 schrieb Emily Deng: >> It will ramdomly have the dead lock issue when test TDR: >> 1. amdgpu_device_handle_vram_lost gets the lock shadow_list_lock 2. >> amdgpu_bo_create locked the bo's resv lock 3. amdgpu_bo_create_shadow >> is waiting for the shadow_list_lock 4. >> amdgpu_device_recover_vram_from_shadow is waiting for the bo's resv >> lock. >> >> v2: >> Make a local copy of the list >> >> Signed-off-by: Emily Deng >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 +++- >> drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 1 + >> 2 files changed, 12 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> index acfc63e..2b9f597 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> @@ -3006,6 +3006,9 @@ static int >amdgpu_device_handle_vram_lost(struct amdgpu_device *adev) >> long r = 1; >> int i = 0; >> long tmo; >> +struct list_head local_shadow_list; >> + >> +INIT_LIST_HEAD(_shadow_list); >> >> if (amdgpu_sriov_runtime(adev)) >> tmo = msecs_to_jiffies(8000); >> @@ -3013,8 +3016,15 @@ static int >amdgpu_device_handle_vram_lost(struct amdgpu_device *adev) >> tmo = msecs_to_jiffies(100); >> >> DRM_INFO("recover vram bo from shadow start\n"); >> + >> mutex_lock(>shadow_list_lock); >> list_for_each_entry_safe(bo, tmp, >shadow_list, shadow_list) >> { >> +amdgpu_bo_ref(bo); >> +list_add_tail(>copy_shadow_list, _shadow_list); >> +} > >Please don't add an extra copy_shadow_list field to amdgpu_bo. If don't use an extra variable, the local shadow list will change according the adev->shadow_list, both for adding or deleting, it is not we want. >Instead just use bo->shadow list for this. When you hold a reference to the BO >it should not be removed from the shadow list. > >Additional to that you can just use list_splice_init() to move the whole shadow >list to your local list. > >Christian. > >> +mutex_unlock(>shadow_list_lock); >> + >> +list_for_each_entry_safe(bo, tmp, _shadow_list, >> +copy_shadow_list) { >> next = NULL; >> amdgpu_device_recover_vram_from_shadow(adev, ring, bo, >); >> if (fence) { >> @@ -3033,8 +3043,8 @@ static int >amdgpu_device_handle_vram_lost(struct >> amdgpu_device *adev) >> >> dma_fence_put(fence); >> fence = next; >> +amdgpu_bo_unref(); >> } >> -mutex_unlock(>shadow_list_lock); >> >> if (fence) { >> r = dma_fence_wait_timeout(fence, false, tmo); diff --git >> a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h >> index 907fdf4..cfee16c 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h >> @@ -103,6 +103,7 @@ struct amdgpu_bo { >> struct list_headmn_list; >> struct list_headshadow_list; >> }; >> +struct list_headcopy_shadow_list; >> >> struct kgd_mem *kfd_bo; >> }; ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH v2] drm/amdgpu: Fix the dead lock issue.
Am 10.09.2018 um 11:34 schrieb Emily Deng: It will ramdomly have the dead lock issue when test TDR: 1. amdgpu_device_handle_vram_lost gets the lock shadow_list_lock 2. amdgpu_bo_create locked the bo's resv lock 3. amdgpu_bo_create_shadow is waiting for the shadow_list_lock 4. amdgpu_device_recover_vram_from_shadow is waiting for the bo's resv lock. v2: Make a local copy of the list Signed-off-by: Emily Deng --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 +++- drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 1 + 2 files changed, 12 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index acfc63e..2b9f597 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -3006,6 +3006,9 @@ static int amdgpu_device_handle_vram_lost(struct amdgpu_device *adev) long r = 1; int i = 0; long tmo; + struct list_head local_shadow_list; + + INIT_LIST_HEAD(_shadow_list); if (amdgpu_sriov_runtime(adev)) tmo = msecs_to_jiffies(8000); @@ -3013,8 +3016,15 @@ static int amdgpu_device_handle_vram_lost(struct amdgpu_device *adev) tmo = msecs_to_jiffies(100); DRM_INFO("recover vram bo from shadow start\n"); + mutex_lock(>shadow_list_lock); list_for_each_entry_safe(bo, tmp, >shadow_list, shadow_list) { + amdgpu_bo_ref(bo); + list_add_tail(>copy_shadow_list, _shadow_list); + } Please don't add an extra copy_shadow_list field to amdgpu_bo. Instead just use bo->shadow list for this. When you hold a reference to the BO it should not be removed from the shadow list. Additional to that you can just use list_splice_init() to move the whole shadow list to your local list. Christian. + mutex_unlock(>shadow_list_lock); + + list_for_each_entry_safe(bo, tmp, _shadow_list, copy_shadow_list) { next = NULL; amdgpu_device_recover_vram_from_shadow(adev, ring, bo, ); if (fence) { @@ -3033,8 +3043,8 @@ static int amdgpu_device_handle_vram_lost(struct amdgpu_device *adev) dma_fence_put(fence); fence = next; + amdgpu_bo_unref(); } - mutex_unlock(>shadow_list_lock); if (fence) { r = dma_fence_wait_timeout(fence, false, tmo); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h index 907fdf4..cfee16c 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h @@ -103,6 +103,7 @@ struct amdgpu_bo { struct list_headmn_list; struct list_headshadow_list; }; +struct list_headcopy_shadow_list; struct kgd_mem *kfd_bo; }; ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx