Re: [PATCH V2] drm/amdgpu: Fix incorrect return value

2024-04-16 Thread Christian König

Well multiple things to consider here.

This is clearly not called from the interrupt, otherwise locking a mutex 
would be illegal. So question is who is calling this? And can the 
function be called from different threads at the same time?


As far as I can see you don't take that into account in the patch.

When there is some kind of single threaded worker handling the RAS 
errors instead then I strongly suggest to solve this issue in the worker.


As far as I can see hacking around a broken caller by inserting 
amdgpu_vram_mgr_query_page_status() inside 
amdgpu_vram_mgr_reserve_range() is an absolutely no-go. That is really 
bad coding style.


What could be is that the VRAM manager needs to be able to provide 
atomic uniqueness for the reserved addresses, e.g. that 
amdgpu_vram_mgr_reserve_range() can be called multiple times with same 
address from multiple threads, but then we would need a different data 
structure than a linked list which is protected by a mutex.


Regards,
Christian.

Am 15.04.24 um 06:04 schrieb Chai, Thomas:

[AMD Official Use Only - General]

Hi Christian:
If an ecc error occurs at an address, HW will generate an interrupt to SW 
to retire all pages located in the same physical row as the error address based 
on the physical characteristics of the memory device.
Therefore, if other pages located on the same physical row as the error 
address also occur ecc errors later, HW will also generate multiple interrupts 
to SW to retire these same pages again, so that amdgpu_vram_mgr_reserve_range 
will be called multiple times to reserve the same pages.

 I think it's more appropriate to do the status check inside the function. 
If the function entry is not checked, people who are not familiar with this 
part of the code can easily make mistakes when calling the function.


-
Best Regards,
Thomas

-Original Message-
From: Christian König 
Sent: Friday, April 12, 2024 5:24 PM
To: Chai, Thomas ; amd-gfx@lists.freedesktop.org
Cc: Chai, Thomas ; Zhang, Hawking ; Zhou1, Tao 
; Li, Candice ; Wang, Yang(Kevin) 
; Yang, Stanley 
Subject: Re: [PATCH V2] drm/amdgpu: Fix incorrect return value

Am 12.04.24 um 10:55 schrieb YiPeng Chai:

[Why]
After calling amdgpu_vram_mgr_reserve_range multiple times with the
same address, calling amdgpu_vram_mgr_query_page_status will always
return -EBUSY.
From the second call to amdgpu_vram_mgr_reserve_range, the same
address will be added to the reservations_pending list again and is
never moved to the reserved_pages list because the address had been
reserved.

Well just to make it clear that approach is a NAK until my concerns are solved.

Regards,
Christian.


[How]
First add the address status check before calling
amdgpu_vram_mgr_do_reserve, if the address is already reserved, do
nothing; If the address is already in the reservations_pending list,
directly reserve memory; only add new nodes for the addresses that are
not in the reserved_pages list and reservations_pending list.

V2:
   Avoid repeated locking/unlocking.

Signed-off-by: YiPeng Chai 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 25 +---
   1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
index 1e36c428d254..a636d3f650b1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
@@ -317,7 +317,6 @@ static void amdgpu_vram_mgr_do_reserve(struct
ttm_resource_manager *man)

   dev_dbg(adev->dev, "Reservation 0x%llx - %lld, Succeeded\n",
   rsv->start, rsv->size);
-
   vis_usage = amdgpu_vram_mgr_vis_size(adev, block);
   atomic64_add(vis_usage, >vis_usage);
   spin_lock(>bdev->lru_lock);
@@ -340,19 +339,27 @@ int amdgpu_vram_mgr_reserve_range(struct amdgpu_vram_mgr 
*mgr,
 uint64_t start, uint64_t size)
   {
   struct amdgpu_vram_reservation *rsv;
+ int ret = 0;

- rsv = kzalloc(sizeof(*rsv), GFP_KERNEL);
- if (!rsv)
- return -ENOMEM;
+ ret = amdgpu_vram_mgr_query_page_status(mgr, start);
+ if (!ret)
+ return 0;

- INIT_LIST_HEAD(>allocated);
- INIT_LIST_HEAD(>blocks);
+ if (ret == -ENOENT) {
+ rsv = kzalloc(sizeof(*rsv), GFP_KERNEL);
+ if (!rsv)
+ return -ENOMEM;

- rsv->start = start;
- rsv->size = size;
+ INIT_LIST_HEAD(>allocated);
+ INIT_LIST_HEAD(>blocks);
+
+ rsv->start = start;
+ rsv->size = size;
+ }

   mutex_lock(>lock);
- list_add_tail(>blocks, >reservations_pending);
+ if (ret == -ENOENT)
+ list_add_tail(>blocks, >reservations_pending);
   amdgpu_vram_mgr_do_reserve(>manager);
   mutex_unlock(>lock);





RE: [PATCH V2] drm/amdgpu: Fix incorrect return value

2024-04-14 Thread Chai, Thomas
[AMD Official Use Only - General]

Hi Christian:
   If an ecc error occurs at an address, HW will generate an interrupt to SW to 
retire all pages located in the same physical row as the error address based on 
the physical characteristics of the memory device.
   Therefore, if other pages located on the same physical row as the error 
address also occur ecc errors later, HW will also generate multiple interrupts 
to SW to retire these same pages again, so that amdgpu_vram_mgr_reserve_range 
will be called multiple times to reserve the same pages.

I think it's more appropriate to do the status check inside the function. 
If the function entry is not checked, people who are not familiar with this 
part of the code can easily make mistakes when calling the function.


-
Best Regards,
Thomas

-Original Message-
From: Christian König 
Sent: Friday, April 12, 2024 5:24 PM
To: Chai, Thomas ; amd-gfx@lists.freedesktop.org
Cc: Chai, Thomas ; Zhang, Hawking ; 
Zhou1, Tao ; Li, Candice ; Wang, 
Yang(Kevin) ; Yang, Stanley 
Subject: Re: [PATCH V2] drm/amdgpu: Fix incorrect return value

Am 12.04.24 um 10:55 schrieb YiPeng Chai:
> [Why]
>After calling amdgpu_vram_mgr_reserve_range multiple times with the
> same address, calling amdgpu_vram_mgr_query_page_status will always
> return -EBUSY.
>From the second call to amdgpu_vram_mgr_reserve_range, the same
> address will be added to the reservations_pending list again and is
> never moved to the reserved_pages list because the address had been
> reserved.

Well just to make it clear that approach is a NAK until my concerns are solved.

Regards,
Christian.

>
> [How]
>First add the address status check before calling
> amdgpu_vram_mgr_do_reserve, if the address is already reserved, do
> nothing; If the address is already in the reservations_pending list,
> directly reserve memory; only add new nodes for the addresses that are
> not in the reserved_pages list and reservations_pending list.
>
> V2:
>   Avoid repeated locking/unlocking.
>
> Signed-off-by: YiPeng Chai 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 25 +---
>   1 file changed, 16 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> index 1e36c428d254..a636d3f650b1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> @@ -317,7 +317,6 @@ static void amdgpu_vram_mgr_do_reserve(struct
> ttm_resource_manager *man)
>
>   dev_dbg(adev->dev, "Reservation 0x%llx - %lld, Succeeded\n",
>   rsv->start, rsv->size);
> -
>   vis_usage = amdgpu_vram_mgr_vis_size(adev, block);
>   atomic64_add(vis_usage, >vis_usage);
>   spin_lock(>bdev->lru_lock);
> @@ -340,19 +339,27 @@ int amdgpu_vram_mgr_reserve_range(struct 
> amdgpu_vram_mgr *mgr,
> uint64_t start, uint64_t size)
>   {
>   struct amdgpu_vram_reservation *rsv;
> + int ret = 0;
>
> - rsv = kzalloc(sizeof(*rsv), GFP_KERNEL);
> - if (!rsv)
> - return -ENOMEM;
> + ret = amdgpu_vram_mgr_query_page_status(mgr, start);
> + if (!ret)
> + return 0;
>
> - INIT_LIST_HEAD(>allocated);
> - INIT_LIST_HEAD(>blocks);
> + if (ret == -ENOENT) {
> + rsv = kzalloc(sizeof(*rsv), GFP_KERNEL);
> + if (!rsv)
> + return -ENOMEM;
>
> - rsv->start = start;
> - rsv->size = size;
> + INIT_LIST_HEAD(>allocated);
> + INIT_LIST_HEAD(>blocks);
> +
> + rsv->start = start;
> + rsv->size = size;
> + }
>
>   mutex_lock(>lock);
> - list_add_tail(>blocks, >reservations_pending);
> + if (ret == -ENOENT)
> + list_add_tail(>blocks, >reservations_pending);
>   amdgpu_vram_mgr_do_reserve(>manager);
>   mutex_unlock(>lock);
>



Re: [PATCH V2] drm/amdgpu: Fix incorrect return value

2024-04-12 Thread Christian König

Am 12.04.24 um 10:55 schrieb YiPeng Chai:

[Why]
   After calling amdgpu_vram_mgr_reserve_range
multiple times with the same address, calling
amdgpu_vram_mgr_query_page_status will always
return -EBUSY.
   From the second call to amdgpu_vram_mgr_reserve_range,
the same address will be added to the reservations_pending
list again and is never moved to the reserved_pages list
because the address had been reserved.


Well just to make it clear that approach is a NAK until my concerns are 
solved.


Regards,
Christian.



[How]
   First add the address status check before calling
amdgpu_vram_mgr_do_reserve, if the address is already
reserved, do nothing; If the address is already in the
reservations_pending list, directly reserve memory;
only add new nodes for the addresses that are not in the
reserved_pages list and reservations_pending list.

V2:
  Avoid repeated locking/unlocking.

Signed-off-by: YiPeng Chai 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 25 +---
  1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
index 1e36c428d254..a636d3f650b1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
@@ -317,7 +317,6 @@ static void amdgpu_vram_mgr_do_reserve(struct 
ttm_resource_manager *man)
  
  		dev_dbg(adev->dev, "Reservation 0x%llx - %lld, Succeeded\n",

rsv->start, rsv->size);
-
vis_usage = amdgpu_vram_mgr_vis_size(adev, block);
atomic64_add(vis_usage, >vis_usage);
spin_lock(>bdev->lru_lock);
@@ -340,19 +339,27 @@ int amdgpu_vram_mgr_reserve_range(struct amdgpu_vram_mgr 
*mgr,
  uint64_t start, uint64_t size)
  {
struct amdgpu_vram_reservation *rsv;
+   int ret = 0;
  
-	rsv = kzalloc(sizeof(*rsv), GFP_KERNEL);

-   if (!rsv)
-   return -ENOMEM;
+   ret = amdgpu_vram_mgr_query_page_status(mgr, start);
+   if (!ret)
+   return 0;
  
-	INIT_LIST_HEAD(>allocated);

-   INIT_LIST_HEAD(>blocks);
+   if (ret == -ENOENT) {
+   rsv = kzalloc(sizeof(*rsv), GFP_KERNEL);
+   if (!rsv)
+   return -ENOMEM;
  
-	rsv->start = start;

-   rsv->size = size;
+   INIT_LIST_HEAD(>allocated);
+   INIT_LIST_HEAD(>blocks);
+
+   rsv->start = start;
+   rsv->size = size;
+   }
  
  	mutex_lock(>lock);

-   list_add_tail(>blocks, >reservations_pending);
+   if (ret == -ENOENT)
+   list_add_tail(>blocks, >reservations_pending);
amdgpu_vram_mgr_do_reserve(>manager);
mutex_unlock(>lock);
  




RE: [PATCH V2] drm/amdgpu: Fix incorrect return value

2024-04-12 Thread Zhou1, Tao
[AMD Official Use Only - General]

Reviewed-by: Tao Zhou 

> -Original Message-
> From: Chai, Thomas 
> Sent: Friday, April 12, 2024 4:56 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Chai, Thomas ; Zhang, Hawking
> ; Zhou1, Tao ; Li, Candice
> ; Wang, Yang(Kevin) ; Yang,
> Stanley ; Chai, Thomas 
> Subject: [PATCH V2] drm/amdgpu: Fix incorrect return value
>
> [Why]
>   After calling amdgpu_vram_mgr_reserve_range multiple times with the same
> address, calling amdgpu_vram_mgr_query_page_status will always return -
> EBUSY.
>   From the second call to amdgpu_vram_mgr_reserve_range, the same address
> will be added to the reservations_pending list again and is never moved to the
> reserved_pages list because the address had been reserved.
>
> [How]
>   First add the address status check before calling
> amdgpu_vram_mgr_do_reserve, if the address is already reserved, do nothing; If
> the address is already in the reservations_pending list, directly reserve 
> memory;
> only add new nodes for the addresses that are not in the reserved_pages list 
> and
> reservations_pending list.
>
> V2:
>  Avoid repeated locking/unlocking.
>
> Signed-off-by: YiPeng Chai 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 25 +---
>  1 file changed, 16 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> index 1e36c428d254..a636d3f650b1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> @@ -317,7 +317,6 @@ static void amdgpu_vram_mgr_do_reserve(struct
> ttm_resource_manager *man)
>
>   dev_dbg(adev->dev, "Reservation 0x%llx - %lld, Succeeded\n",
>   rsv->start, rsv->size);
> -
>   vis_usage = amdgpu_vram_mgr_vis_size(adev, block);
>   atomic64_add(vis_usage, >vis_usage);
>   spin_lock(>bdev->lru_lock);
> @@ -340,19 +339,27 @@ int amdgpu_vram_mgr_reserve_range(struct
> amdgpu_vram_mgr *mgr,
> uint64_t start, uint64_t size)
>  {
>   struct amdgpu_vram_reservation *rsv;
> + int ret = 0;
>
> - rsv = kzalloc(sizeof(*rsv), GFP_KERNEL);
> - if (!rsv)
> - return -ENOMEM;
> + ret = amdgpu_vram_mgr_query_page_status(mgr, start);
> + if (!ret)
> + return 0;
>
> - INIT_LIST_HEAD(>allocated);
> - INIT_LIST_HEAD(>blocks);
> + if (ret == -ENOENT) {
> + rsv = kzalloc(sizeof(*rsv), GFP_KERNEL);
> + if (!rsv)
> + return -ENOMEM;
>
> - rsv->start = start;
> - rsv->size = size;
> + INIT_LIST_HEAD(>allocated);
> + INIT_LIST_HEAD(>blocks);
> +
> + rsv->start = start;
> + rsv->size = size;
> + }
>
>   mutex_lock(>lock);
> - list_add_tail(>blocks, >reservations_pending);
> + if (ret == -ENOENT)
> + list_add_tail(>blocks, >reservations_pending);
>   amdgpu_vram_mgr_do_reserve(>manager);
>   mutex_unlock(>lock);
>
> --
> 2.34.1