On 01/07/2019 11:36 AM, StDenis, Tom wrote:
> On 2019-01-07 11:33 a.m., Grodzovsky, Andrey wrote:
>>
>> On 01/07/2019 11:16 AM, Liu, Shaoyun wrote:
>>> I think it's reasonable to use the hive  specific lock for hive  specific 
>>> functions.
>>> The changes is acked-by  Shaoyun.liu < shaoyun....@amd.com>
>>>
>>> -----Original Message-----
>>> From: amd-gfx <amd-gfx-boun...@lists.freedesktop.org> On Behalf Of StDenis, 
>>> Tom
>>> Sent: Monday, January 7, 2019 10:16 AM
>>> To: amd-gfx@lists.freedesktop.org
>>> Cc: StDenis, Tom <tom.stde...@amd.com>
>>> Subject: [PATCH] add missing mutex lock to amdgpu_get_xgmi_hive() (v2)
>>>
>>> v2: Move locks around in other functions so that this function can stand on 
>>> its own.  Also only hold the hive specific lock for add/remove device 
>>> instead of the driver global lock so you can't add/remove devices in 
>>> parallel from one hive.
>>>
>>> Signed-off-by: Tom St Denis <tom.stde...@amd.com>
>>> ---
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   | 36 ++++++++++++++--------
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |  2 +-
>>>     3 files changed, 25 insertions(+), 15 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 39d5d058b2c7..13d8e2ad2f7a 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -3525,7 +3525,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
>>> *adev,
>>>              * by different nodes. No point also since the one node already 
>>> executing
>>>              * reset will also reset all the other nodes in the hive.
>>>              */
>>> -   hive = amdgpu_get_xgmi_hive(adev);
>>> +   hive = amdgpu_get_xgmi_hive(adev, 0);
>>>             if (hive && adev->gmc.xgmi.num_physical_nodes > 1 &&
>>>                 !mutex_trylock(&hive->hive_lock))
>>>                     return 0;
>> Let's say i have device 0 in hive A and it just got a gpu reset and at
>> the same time device 1 is being added to same have though
>> amdgpu_xgmi_add_device, hive->hive_lock is acquired by this new device
>> being added and so gpu reset for device 0 bails out on
>> '!mutex_trylock(&hive->hive_lock))' without completing the reset.
>> Also in general i feel a bit uncomfortable about the confusing
>> acquisition scheme in the function and  the fact that you take the
>> hive->hive_lock inside amdgpu_get_xgmi_hive but release is still outside
>> of the function.
> Is adding a device while resetting a device even a valid operation
> anyways?

In theory it's valid if you have hot pluggable devices
>
> I think this means more so that the reset logic is broken.  Instead
> there should be a per-hive reset lock that is taken and that is tested
> instead.
>
> Tom

The hive->hive_lock was added exactly for this purpose and used only for 
that purpose. Maybe the naming i gave it wasn't reflective of it's 
purpose :)

Andrey

>
>
>> Andrey
>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
>>> index 8a8bc60cb6b4..ebf50809485f 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
>>> @@ -40,26 +40,39 @@ void *amdgpu_xgmi_hive_try_lock(struct amdgpu_hive_info 
>>> *hive)
>>>             return &hive->device_list;
>>>     }
>>>     
>>> -struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device *adev)
>>> +struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device
>>> +*adev, int lock)
>>>     {
>>>             int i;
>>>             struct amdgpu_hive_info *tmp;
>>>     
>>>             if (!adev->gmc.xgmi.hive_id)
>>>                     return NULL;
>>> +
>>> +   mutex_lock(&xgmi_mutex);
>>> +
>>>             for (i = 0 ; i < hive_count; ++i) {
>>>                     tmp = &xgmi_hives[i];
>>> -           if (tmp->hive_id == adev->gmc.xgmi.hive_id)
>>> +           if (tmp->hive_id == adev->gmc.xgmi.hive_id) {
>>> +                   if (lock)
>>> +                           mutex_lock(&tmp->hive_lock);
>>> +                   mutex_unlock(&xgmi_mutex);
>>>                             return tmp;
>>> +           }
>>>             }
>>> -   if (i >= AMDGPU_MAX_XGMI_HIVE)
>>> +   if (i >= AMDGPU_MAX_XGMI_HIVE) {
>>> +           mutex_unlock(&xgmi_mutex);
>>>                     return NULL;
>>> +   }
>>>     
>>>             /* initialize new hive if not exist */
>>>             tmp = &xgmi_hives[hive_count++];
>>>             tmp->hive_id = adev->gmc.xgmi.hive_id;
>>>             INIT_LIST_HEAD(&tmp->device_list);
>>>             mutex_init(&tmp->hive_lock);
>>> +   if (lock)
>>> +           mutex_lock(&tmp->hive_lock);
>>> +
>>> +   mutex_unlock(&xgmi_mutex);
>>>     
>>>             return tmp;
>>>     }
>>> @@ -111,8 +124,8 @@ int amdgpu_xgmi_add_device(struct amdgpu_device *adev)
>>>                     return ret;
>>>             }
>>>     
>>> -   mutex_lock(&xgmi_mutex);
>>> -   hive = amdgpu_get_xgmi_hive(adev);
>>> +   /* find hive and take lock */
>>> +   hive = amdgpu_get_xgmi_hive(adev, 1);
>>>             if (!hive)
>>>                     goto exit;
>>>     
>>> @@ -142,8 +155,8 @@ int amdgpu_xgmi_add_device(struct amdgpu_device *adev)
>>>                             break;
>>>             }
>>>     
>>> +   mutex_unlock(&hive->hive_lock);
>>>     exit:
>>> -   mutex_unlock(&xgmi_mutex);
>>>             return ret;
>>>     }
>>>     
>>> @@ -154,15 +167,12 @@ void amdgpu_xgmi_remove_device(struct amdgpu_device 
>>> *adev)
>>>             if (!adev->gmc.xgmi.supported)
>>>                     return;
>>>     
>>> -   mutex_lock(&xgmi_mutex);
>>> -
>>> -   hive = amdgpu_get_xgmi_hive(adev);
>>> +   hive = amdgpu_get_xgmi_hive(adev, 1);
>>>             if (!hive)
>>> -           goto exit;
>>> +           return;
>>>     
>>>             if (!(hive->number_devices--))
>>>                     mutex_destroy(&hive->hive_lock);
>>> -
>>> -exit:
>>> -   mutex_unlock(&xgmi_mutex);
>>> +   else
>>> +           mutex_unlock(&hive->hive_lock);
>>>     }
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
>>> index 6151eb9c8ad3..8d7984844174 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
>>> @@ -32,7 +32,7 @@ struct amdgpu_hive_info {
>>>             struct mutex hive_lock;
>>>     };
>>>     
>>> -struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device *adev);
>>> +struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device
>>> +*adev, int lock);
>>>     int amdgpu_xgmi_update_topology(struct amdgpu_hive_info *hive, struct 
>>> amdgpu_device *adev);  int amdgpu_xgmi_add_device(struct amdgpu_device 
>>> *adev);  void amdgpu_xgmi_remove_device(struct amdgpu_device *adev);
>>> --
>>> 2.17.2
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Reply via email to