Re: [Bug][5.18-rc0] Between commits ed4643521e6a and 34af78c4e616, appears warning "WARNING: CPU: 31 PID: 51848 at drivers/dma-buf/dma-fence-array.c:191 dma_fence_array_create+0x101/0x120" and some ga

2022-04-14 Thread Mikhail Gavrilov
On Sat, Apr 9, 2022 at 7:27 PM Christian König
 wrote:
>
> That's unfortunately not the end of the story.
>
> This is fixing your problem, but reintroducing the original problem that
> we call the syncobj with a lock held which can crash badly as well.
>
> Going to take a closer look on Monday. I hope you can test a few more
> patches to help narrow down what's actually going wrong here.
>
> Thanks,
> Christian.
>

Hi Christian.
I'm sorry to trouble you.
Have you forgotten about this issue?

-- 
Best Regards,
Mike Gavrilov.


RE: [PATCH] drm/amdgpu/pm: fix the null pointer while the smu is disabled

2022-04-14 Thread Liu, Aaron
[AMD Official Use Only]

Reviewed-by: Aaron Liu 

> -Original Message-
> From: amd-gfx  On Behalf Of
> Huang Rui
> Sent: Friday, April 15, 2022 12:13 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; Huang, Tim
> ; Quan, Evan ; Huang, Ray
> ; Du, Xiaojian 
> Subject: [PATCH] drm/amdgpu/pm: fix the null pointer while the smu is
> disabled
> 
> It needs to check if the pp_funcs is initialized while release the context,
> otherwise it will trigger null pointer panic while the software smu is not
> enabled.
> 
> [ 1109.404555] BUG: kernel NULL pointer dereference, address:
> 0078 [ 1109.404609] #PF: supervisor read access in kernel mode
> [ 1109.404638] #PF: error_code(0x) - not-present page [ 1109.404657]
> PGD 0 P4D 0 [ 1109.404672] Oops:  [#1] PREEMPT SMP NOPTI
> [ 1109.404701] CPU: 7 PID: 9150 Comm: amdgpu_test Tainted: G   OEL
> 5.16.0-custom #1
> [ 1109.404732] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS
> VirtualBox 12/01/2006 [ 1109.404765] RIP:
> 0010:amdgpu_dpm_force_performance_level+0x1d/0x170 [amdgpu]
> [ 1109.405109] Code: 5d c3 44 8b a3 f0 80 00 00 eb e5 66 90 0f 1f 44 00 00 55 
> 48
> 89 e5 41 57 41 56 41 55 41 54 53 48 83 ec 08 4c 8b b7 f0 7d 00 00 <49> 83 7e 
> 78
> 00 0f 84 f2 00 00 00 80 bf 87 80 00 00 00 48 89 fb 0f [ 1109.405176] RSP:
> 0018:af3083ad7c20 EFLAGS: 00010282 [ 1109.405203] RAX:
>  RBX: 9796b1c14600 RCX: 02862007
> [ 1109.405229] RDX: 97968591c8c0 RSI: 0001 RDI:
> 9796a370 [ 1109.405260] RBP: af3083ad7c50 R08: 9897de00
> R09: 979688d9db60 [ 1109.405286] R10:  R11:
> 979688d9db90 R12: 0001 [ 1109.405316] R13:
> 9796a370 R14:  R15: 9796a3708fc0 [ 1109.405345]
> FS:  7ff055cff180() GS:9796bfdc()
> knlGS: [ 1109.405378] CS:  0010 DS:  ES:  CR0:
> 80050033 [ 1109.405400] CR2: 0078 CR3:
> 0a394000 CR4: 000506e0 [ 1109.405434] Call Trace:
> [ 1109.405445]  
> [ 1109.405456]  ? delete_object_full+0x1d/0x20 [ 1109.405480]
> amdgpu_ctx_set_stable_pstate+0x7c/0xa0 [amdgpu] [ 1109.405698]
> amdgpu_ctx_fini.part.0+0xcb/0x100 [amdgpu] [ 1109.405911]
> amdgpu_ctx_do_release+0x71/0x80 [amdgpu] [ 1109.406121]
> amdgpu_ctx_ioctl+0x52d/0x550 [amdgpu] [ 1109.406327]  ?
> _raw_spin_unlock+0x1a/0x30 [ 1109.406354]  ?
> drm_gem_handle_delete+0x81/0xb0 [drm] [ 1109.406400]  ?
> amdgpu_ctx_get_entity+0x2c0/0x2c0 [amdgpu] [ 1109.406609]
> drm_ioctl_kernel+0xb6/0x140 [drm]
> 
> Signed-off-by: Huang Rui 
> ---
>  drivers/gpu/drm/amd/pm/amdgpu_dpm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
> b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
> index 1d63f1e8884c..428623e64e8f 100644
> --- a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
> +++ b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
> @@ -717,7 +717,7 @@ int amdgpu_dpm_force_performance_level(struct
> amdgpu_device *adev,
> 
>   AMD_DPM_FORCED_LEVEL_PROFILE_MIN_MCLK |
> 
>   AMD_DPM_FORCED_LEVEL_PROFILE_PEAK;
> 
> - if (!pp_funcs->force_performance_level)
> + if (!pp_funcs || !pp_funcs->force_performance_level)
>   return 0;
> 
>   if (adev->pm.dpm.thermal_active)
> --
> 2.25.1


Re: [PATCH] drm/amdkfd: only allow heavy-weight TLB flush on some ASICs for SVM too

2022-04-14 Thread Paul Menzel

Dear Lang,


Am 15.04.22 um 05:20 schrieb Lang Yu:

On 04/14/ , Paul Menzel wrote:



Am 14.04.22 um 10:19 schrieb Lang Yu:

The idea is from commit a50fe7078035 ("drm/amdkfd: Only apply heavy-weight
TLB flush on Aldebaran") and commit f61c40c0757a ("drm/amdkfd: enable
heavy-weight TLB flush on Arcturus"). Otherwise, we will run into problems
on some ASICs when running SVM applications.


Please list the ASICs, you know of having problems, and even add how to
reproduce this.


Actually, this is ported from previous commits. You can find more details
from the commits I mentioned. At the moment the ASICs except Aldebaran
and Arcturus probably have the problem.


I think, it’s always good to make it as easy as possible for reviewers 
and, later, people reading a commit, and include the necessary 
information directly in the commit message. It’d be great if you amended 
the commit message.



And running a SVM application could reproduce the issue.


Thanks. How will it fail though?

(Also, a small implementation note would be nice to have. Maybe: Move 
the helper function into the header `kfd_priv.h`, and use in 
`svm_range_unmap_from_gpus()`.)



Kind regards,

Paul



Signed-off-by: Lang Yu 
---
   drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 8 
   drivers/gpu/drm/amd/amdkfd/kfd_priv.h| 8 
   drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 4 +++-
   3 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
index 91f82a9ccdaf..459f59e3d0ed 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -1128,14 +1128,6 @@ static int kfd_ioctl_free_memory_of_gpu(struct file 
*filep,
return ret;
   }
-static bool kfd_flush_tlb_after_unmap(struct kfd_dev *dev)
-{
-   return KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 2) ||
-   (KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 1) &&
-   dev->adev->sdma.instance[0].fw_version >= 18) ||
-   KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 0);
-}
-
   static int kfd_ioctl_map_memory_to_gpu(struct file *filep,
struct kfd_process *p, void *data)
   {
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index 8a43def1f638..aff6f598ff2c 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -1328,6 +1328,14 @@ void kfd_signal_poison_consumed_event(struct kfd_dev 
*dev, u32 pasid);
   void kfd_flush_tlb(struct kfd_process_device *pdd, enum TLB_FLUSH_TYPE type);
+static inline bool kfd_flush_tlb_after_unmap(struct kfd_dev *dev)
+{
+   return KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 2) ||
+  (KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 1) &&
+  dev->adev->sdma.instance[0].fw_version >= 18) ||
+  KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 0);
+}
+
   bool kfd_is_locked(void);
   /* Compute profile */
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 459fa07a3bcc..5afe216cf099 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -1229,7 +1229,9 @@ svm_range_unmap_from_gpus(struct svm_range *prange, 
unsigned long start,
if (r)
break;
}
-   kfd_flush_tlb(pdd, TLB_FLUSH_HEAVYWEIGHT);
+
+   if (kfd_flush_tlb_after_unmap(pdd->dev))
+   kfd_flush_tlb(pdd, TLB_FLUSH_HEAVYWEIGHT);
}
return r;


Re: [PATCH] drm/amdkfd: only allow heavy-weight TLB flush on some ASICs for SVM too

2022-04-14 Thread Felix Kuehling



Am 2022-04-14 um 22:47 schrieb Lang Yu:

On 04/14/ , Felix Kuehling wrote:

Am 2022-04-14 um 04:19 schrieb Lang Yu:

The idea is from commit a50fe7078035 ("drm/amdkfd: Only apply heavy-weight
TLB flush on Aldebaran") and commit f61c40c0757a ("drm/amdkfd: enable
heavy-weight TLB flush on Arcturus"). Otherwise, we will run into problems
on some ASICs when running SVM applications.

Signed-off-by: Lang Yu 
---
   drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 8 
   drivers/gpu/drm/amd/amdkfd/kfd_priv.h| 8 
   drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 4 +++-
   3 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
index 91f82a9ccdaf..459f59e3d0ed 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -1128,14 +1128,6 @@ static int kfd_ioctl_free_memory_of_gpu(struct file 
*filep,
return ret;
   }
-static bool kfd_flush_tlb_after_unmap(struct kfd_dev *dev)
-{
-   return KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 2) ||
-   (KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 1) &&
-   dev->adev->sdma.instance[0].fw_version >= 18) ||
-   KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 0);
-}
-
   static int kfd_ioctl_map_memory_to_gpu(struct file *filep,
struct kfd_process *p, void *data)
   {
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index 8a43def1f638..aff6f598ff2c 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -1328,6 +1328,14 @@ void kfd_signal_poison_consumed_event(struct kfd_dev 
*dev, u32 pasid);
   void kfd_flush_tlb(struct kfd_process_device *pdd, enum TLB_FLUSH_TYPE type);
+static inline bool kfd_flush_tlb_after_unmap(struct kfd_dev *dev)
+{
+   return KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 2) ||
+  (KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 1) &&
+  dev->adev->sdma.instance[0].fw_version >= 18) ||
+  KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 0);
+}
+
   bool kfd_is_locked(void);
   /* Compute profile */
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 459fa07a3bcc..5afe216cf099 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -1229,7 +1229,9 @@ svm_range_unmap_from_gpus(struct svm_range *prange, 
unsigned long start,
if (r)
break;
}
-   kfd_flush_tlb(pdd, TLB_FLUSH_HEAVYWEIGHT);
+
+   if (kfd_flush_tlb_after_unmap(pdd->dev))
+   kfd_flush_tlb(pdd, TLB_FLUSH_HEAVYWEIGHT);

Then you probably need to add another flush_tlb call in
svm_range_map_to_gpus.

There is a TLB_FLUSH_LEGACY call in svm_range_map_to_gpus same with
kfd_ioctl_map_memory_to_gpu. Do we still need to add another one?


Right, I missed that one. I think that should cover it and the patch 
looks good to me.


Reviewed-by: Felix Kuehling 




Regards,
Lang


Regards,
   Felix



}
return r;


[PATCH] drm/amdgpu/pm: fix the null pointer while the smu is disabled

2022-04-14 Thread Huang Rui
It needs to check if the pp_funcs is initialized while release the
context, otherwise it will trigger null pointer panic while the software
smu is not enabled.

[ 1109.404555] BUG: kernel NULL pointer dereference, address: 0078
[ 1109.404609] #PF: supervisor read access in kernel mode
[ 1109.404638] #PF: error_code(0x) - not-present page
[ 1109.404657] PGD 0 P4D 0
[ 1109.404672] Oops:  [#1] PREEMPT SMP NOPTI
[ 1109.404701] CPU: 7 PID: 9150 Comm: amdgpu_test Tainted: G   OEL
5.16.0-custom #1
[ 1109.404732] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS 
VirtualBox 12/01/2006
[ 1109.404765] RIP: 0010:amdgpu_dpm_force_performance_level+0x1d/0x170 [amdgpu]
[ 1109.405109] Code: 5d c3 44 8b a3 f0 80 00 00 eb e5 66 90 0f 1f 44 00 00 55 
48 89 e5 41 57 41 56 41 55 41 54 53 48 83 ec 08 4c 8b b7 f0 7d 00 00 <49> 83 7e 
78 00 0f 84 f2 00 00 00 80 bf 87 80 00 00 00 48 89 fb 0f
[ 1109.405176] RSP: 0018:af3083ad7c20 EFLAGS: 00010282
[ 1109.405203] RAX:  RBX: 9796b1c14600 RCX: 02862007
[ 1109.405229] RDX: 97968591c8c0 RSI: 0001 RDI: 9796a370
[ 1109.405260] RBP: af3083ad7c50 R08: 9897de00 R09: 979688d9db60
[ 1109.405286] R10:  R11: 979688d9db90 R12: 0001
[ 1109.405316] R13: 9796a370 R14:  R15: 9796a3708fc0
[ 1109.405345] FS:  7ff055cff180() GS:9796bfdc() 
knlGS:
[ 1109.405378] CS:  0010 DS:  ES:  CR0: 80050033
[ 1109.405400] CR2: 0078 CR3: 0a394000 CR4: 000506e0
[ 1109.405434] Call Trace:
[ 1109.405445]  
[ 1109.405456]  ? delete_object_full+0x1d/0x20
[ 1109.405480]  amdgpu_ctx_set_stable_pstate+0x7c/0xa0 [amdgpu]
[ 1109.405698]  amdgpu_ctx_fini.part.0+0xcb/0x100 [amdgpu]
[ 1109.405911]  amdgpu_ctx_do_release+0x71/0x80 [amdgpu]
[ 1109.406121]  amdgpu_ctx_ioctl+0x52d/0x550 [amdgpu]
[ 1109.406327]  ? _raw_spin_unlock+0x1a/0x30
[ 1109.406354]  ? drm_gem_handle_delete+0x81/0xb0 [drm]
[ 1109.406400]  ? amdgpu_ctx_get_entity+0x2c0/0x2c0 [amdgpu]
[ 1109.406609]  drm_ioctl_kernel+0xb6/0x140 [drm]

Signed-off-by: Huang Rui 
---
 drivers/gpu/drm/amd/pm/amdgpu_dpm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c 
b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
index 1d63f1e8884c..428623e64e8f 100644
--- a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
+++ b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
@@ -717,7 +717,7 @@ int amdgpu_dpm_force_performance_level(struct amdgpu_device 
*adev,
AMD_DPM_FORCED_LEVEL_PROFILE_MIN_MCLK |
AMD_DPM_FORCED_LEVEL_PROFILE_PEAK;
 
-   if (!pp_funcs->force_performance_level)
+   if (!pp_funcs || !pp_funcs->force_performance_level)
return 0;
 
if (adev->pm.dpm.thermal_active)
-- 
2.25.1



Re: [PATCH] drm/amdkfd: only allow heavy-weight TLB flush on some ASICs for SVM too

2022-04-14 Thread Lang Yu
On 04/14/ , Paul Menzel wrote:
> Dear Lang,
> 
> 
> Thank you for the patch.
> 
> Am 14.04.22 um 10:19 schrieb Lang Yu:
> > The idea is from commit a50fe7078035 ("drm/amdkfd: Only apply heavy-weight
> > TLB flush on Aldebaran") and commit f61c40c0757a ("drm/amdkfd: enable
> > heavy-weight TLB flush on Arcturus"). Otherwise, we will run into problems
> > on some ASICs when running SVM applications.
> 
> Please list the ASICs, you know of having problems, and even add how to
> reproduce this.

Actually, this is ported from previous commits. You can find more details
from the commits I mentioned. At the moment the ASICs except Aldebaran
and Arcturus probably have the problem. And running a SVM application
could reproduce the issue.

Thanks,
Lang

> 
> Kind regards,
> 
> Paul
> 
> 
> > Signed-off-by: Lang Yu 
> > ---
> >   drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 8 
> >   drivers/gpu/drm/amd/amdkfd/kfd_priv.h| 8 
> >   drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 4 +++-
> >   3 files changed, 11 insertions(+), 9 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c 
> > b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> > index 91f82a9ccdaf..459f59e3d0ed 100644
> > --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> > @@ -1128,14 +1128,6 @@ static int kfd_ioctl_free_memory_of_gpu(struct file 
> > *filep,
> > return ret;
> >   }
> > -static bool kfd_flush_tlb_after_unmap(struct kfd_dev *dev)
> > -{
> > -   return KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 2) ||
> > -   (KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 1) &&
> > -   dev->adev->sdma.instance[0].fw_version >= 18) ||
> > -   KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 0);
> > -}
> > -
> >   static int kfd_ioctl_map_memory_to_gpu(struct file *filep,
> > struct kfd_process *p, void *data)
> >   {
> > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
> > b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> > index 8a43def1f638..aff6f598ff2c 100644
> > --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> > @@ -1328,6 +1328,14 @@ void kfd_signal_poison_consumed_event(struct kfd_dev 
> > *dev, u32 pasid);
> >   void kfd_flush_tlb(struct kfd_process_device *pdd, enum TLB_FLUSH_TYPE 
> > type);
> > +static inline bool kfd_flush_tlb_after_unmap(struct kfd_dev *dev)
> > +{
> > +   return KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 2) ||
> > +  (KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 1) &&
> > +  dev->adev->sdma.instance[0].fw_version >= 18) ||
> > +  KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 0);
> > +}
> > +
> >   bool kfd_is_locked(void);
> >   /* Compute profile */
> > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
> > b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> > index 459fa07a3bcc..5afe216cf099 100644
> > --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> > @@ -1229,7 +1229,9 @@ svm_range_unmap_from_gpus(struct svm_range *prange, 
> > unsigned long start,
> > if (r)
> > break;
> > }
> > -   kfd_flush_tlb(pdd, TLB_FLUSH_HEAVYWEIGHT);
> > +
> > +   if (kfd_flush_tlb_after_unmap(pdd->dev))
> > +   kfd_flush_tlb(pdd, TLB_FLUSH_HEAVYWEIGHT);
> > }
> > return r;


Re: [PATCH] drm/amdkfd: only allow heavy-weight TLB flush on some ASICs for SVM too

2022-04-14 Thread Lang Yu
On 04/14/ , Eric Huang wrote:
> 
> 
> On 2022-04-14 04:19, Lang Yu wrote:
> > The idea is from commit a50fe7078035 ("drm/amdkfd: Only apply heavy-weight
> > TLB flush on Aldebaran") and commit f61c40c0757a ("drm/amdkfd: enable
> > heavy-weight TLB flush on Arcturus"). Otherwise, we will run into problems
> > on some ASICs when running SVM applications.
> > 
> > Signed-off-by: Lang Yu 
> > ---
> >   drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 8 
> >   drivers/gpu/drm/amd/amdkfd/kfd_priv.h| 8 
> >   drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 4 +++-
> >   3 files changed, 11 insertions(+), 9 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c 
> > b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> > index 91f82a9ccdaf..459f59e3d0ed 100644
> > --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> > @@ -1128,14 +1128,6 @@ static int kfd_ioctl_free_memory_of_gpu(struct file 
> > *filep,
> > return ret;
> >   }
> > -static bool kfd_flush_tlb_after_unmap(struct kfd_dev *dev)
> > -{
> > -   return KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 2) ||
> > -   (KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 1) &&
> > -   dev->adev->sdma.instance[0].fw_version >= 18) ||
> > -   KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 0);
> > -}
> > -
> >   static int kfd_ioctl_map_memory_to_gpu(struct file *filep,
> > struct kfd_process *p, void *data)
> >   {
> > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
> > b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> > index 8a43def1f638..aff6f598ff2c 100644
> > --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> > @@ -1328,6 +1328,14 @@ void kfd_signal_poison_consumed_event(struct kfd_dev 
> > *dev, u32 pasid);
> >   void kfd_flush_tlb(struct kfd_process_device *pdd, enum TLB_FLUSH_TYPE 
> > type);
> > +static inline bool kfd_flush_tlb_after_unmap(struct kfd_dev *dev)
> > +{
> > +   return KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 2) ||
> > +  (KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 1) &&
> > +  dev->adev->sdma.instance[0].fw_version >= 18) ||
> > +  KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 0);
> > +}
> > +
> It is a cosmetic change for function kfd_flush_tlb_after_unmap, and not
> related to the topic. You can separate that into another patch.

Okay. Thanks.

Regards,
Lang

> Regards,
> Eric
> >   bool kfd_is_locked(void);
> >   /* Compute profile */
> > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
> > b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> > index 459fa07a3bcc..5afe216cf099 100644
> > --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> > @@ -1229,7 +1229,9 @@ svm_range_unmap_from_gpus(struct svm_range *prange, 
> > unsigned long start,
> > if (r)
> > break;
> > }
> > -   kfd_flush_tlb(pdd, TLB_FLUSH_HEAVYWEIGHT);
> > +
> > +   if (kfd_flush_tlb_after_unmap(pdd->dev))
> > +   kfd_flush_tlb(pdd, TLB_FLUSH_HEAVYWEIGHT);
> > }
> > return r;
> 


[linux-next:master] BUILD REGRESSION 40354149f4d738dc3492d9998e45b3f02950369a

2022-04-14 Thread kernel test robot
tree/branch: 
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master
branch HEAD: 40354149f4d738dc3492d9998e45b3f02950369a  Add linux-next specific 
files for 20220414

Error/Warning reports:

https://lore.kernel.org/linux-mm/202203160358.yulpl6b4-...@intel.com
https://lore.kernel.org/linux-mm/202204081656.6x4pfen4-...@intel.com
https://lore.kernel.org/linux-mm/202204140108.derahwen-...@intel.com
https://lore.kernel.org/lkml/202204140043.tx7bibvi-...@intel.com
https://lore.kernel.org/llvm/202203241958.uw9bwfmd-...@intel.com

Error/Warning: (recently discovered and may have been fixed)

ERROR: dtschema minimum version is v2022.3
aarch64-linux-ld: (.text+0x8): undefined reference to `crypto_sm4_sbox'
aarch64-linux-ld: sm4-neon-glue.c:(.text+0x880): undefined reference to 
`sm4_crypt_block'
drivers/bus/mhi/host/main.c:792:13: warning: parameter 'event_quota' set but 
not used
drivers/bus/mhi/host/main.c:792:13: warning: parameter 'event_quota' set but 
not used [-Wunused-but-set-parameter]
drivers/dma-buf/st-dma-fence-unwrap.c:25:4: warning: unused function 
'to_mock_fence'
drivers/gpu/drm/amd/amdgpu/../display/dc/virtual/virtual_link_hwss.c:32:6: 
warning: no previous prototype for 'virtual_setup_stream_attribute' 
[-Wmissing-prototypes]
sm4-neon-glue.c:(.text+0x190): undefined reference to `sm4_expandkey'
sm4-neon-glue.c:(.text+0x268): undefined reference to `sm4_crypt_block'

Unverified Error/Warning (likely false positive, please contact us if 
interested):

Makefile:684: arch/h8300/Makefile: No such file or directory
arch/Kconfig:10: can't open file "arch/h8300/Kconfig"
arch/s390/include/asm/spinlock.h:81:3: error: unexpected token in '.rept' 
directive
arch/s390/include/asm/spinlock.h:81:3: error: unknown directive
arch/s390/include/asm/spinlock.h:81:3: error: unmatched '.endr' directive
arch/s390/lib/spinlock.c:78:3: error: unexpected token in '.rept' directive
arch/s390/lib/spinlock.c:78:3: error: unknown directive
arch/s390/lib/spinlock.c:78:3: error: unmatched '.endr' directive
drivers/dma-buf/st-dma-fence-unwrap.c:125:13: warning: variable 'err' set but 
not used [-Wunused-but-set-variable]
drivers/edac/edac_device.c:73 edac_device_alloc_ctl_info() warn: Please 
consider using kcalloc instead
drivers/edac/edac_mc.c:369 edac_mc_alloc() warn: Please consider using kcalloc 
instead
drivers/gpu/drm/amd/amdgpu/../display/dc/dcn31/dcn31_hubp.c:57:6: warning: no 
previous prototype for 'hubp31_program_extended_blank' [-Wmissing-prototypes]
drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c:403:5: warning: Value stored to 'ret' 
is never read [clang-analyzer-deadcode.DeadStores]
drivers/hwmon/da9055-hwmon.c:201:9: warning: Call to function 'sprintf' is 
insecure as it does not provide bounding of the memory buffer or security 
checks introduced in the C11 standard. Replace with analogous functions that 
support length arguments or provides boundary checks such as 'sprintf_s' in 
case of C11 
[clang-analyzer-security.insecureAPI.DeprecatedOrUnsafeBufferHandling]
drivers/hwmon/vt8231.c:634:9: warning: Call to function 'sprintf' is insecure 
as it does not provide bounding of the memory buffer or security checks 
introduced in the C11 standard. Replace with analogous functions that support 
length arguments or provides boundary checks such as 'sprintf_s' in case of C11 
[clang-analyzer-security.insecureAPI.DeprecatedOrUnsafeBufferHandling]
drivers/misc/cb710/debug.c:96:1: warning: Call to function 'sprintf' is 
insecure as it does not provide bounding of the memory buffer or security 
checks introduced in the C11 standard. Replace with analogous functions that 
support length arguments or provides boundary checks such as 'sprintf_s' in 
case of C11 
[clang-analyzer-security.insecureAPI.DeprecatedOrUnsafeBufferHandling]
drivers/usb/chipidea/core.c:956:10: warning: Call to function 'sprintf' is 
insecure as it does not provide bounding of the memory buffer or security 
checks introduced in the C11 standard. Replace with analogous functions that 
support length arguments or provides boundary checks such as 'sprintf_s' in 
case of C11 
[clang-analyzer-security.insecureAPI.DeprecatedOrUnsafeBufferHandling]
drivers/usb/gadget/configfs.c:237:8: warning: Call to function 'sprintf' is 
insecure as it does not provide bounding of the memory buffer or security 
checks introduced in the C11 standard. Replace with analogous functions that 
support length arguments or provides boundary checks such as 'sprintf_s' in 
case of C11 
[clang-analyzer-security.insecureAPI.DeprecatedOrUnsafeBufferHandling]
drivers/usb/gadget/udc/core.c:1664:9: warning: Call to function 'sprintf' is 
insecure as it does not provide bounding of the memory buffer or security 
checks introduced in the C11 standard. Replace with analogous functions that 
support length arguments or provides boundary checks such as 'sprintf_s' in 
case of C11 
[clang-analyzer-security.insecureAPI.DeprecatedOrUnsafeBufferHandling]
drivers/video/fb

RE: [PATCH] drm/amd/amdgpu: Remove static from variable in RLCG Reg RW.

2022-04-14 Thread Wan, Gavin
[AMD Official Use Only]

Dear Paul,

Thanks for you comments. Let me do it.

Thanks,
Gavin

-Original Message-
From: Paul Menzel 
Sent: Thursday, April 14, 2022 1:59 PM
To: Wan, Gavin 
Cc: amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amd/amdgpu: Remove static from variable in RLCG Reg RW.

Dear Gavin,


Thank you for your patch.

Am 13.04.22 um 17:26 schrieb Gavin Wan:

Should you re-roll your patch (v2), please remove the dot/period from the end 
of the git commit message summary (subject).

> [why] These static variables saves the RLC Scratch registers address.

s/saves/save/

>When we installed multiple GPUs (for example: XGMI setting) and

s/installed/install/

>multiple GPUs call the function at same time. The RLC Scratch

… same time, the RLC …

>registers address are changed each other. Then it caused

s/caused/causes/

>reading/writing to wrong GPU.

I see from other patches posted here, that [why] is put on a separate line, so 
you do not need to indent the text.

[why]

These static …

>
> [fix] Removed the static from the variables. The variables are
>in stack.

Same here, though *how* instead of *fix* seems more common.

s/Removed/Remove/
s/in stack/on the stack/

>
> Signed-off-by: Gavin Wan 
> Change-Id: Iee78849291d4f7a9688ecc5165bec70ee85cdfbe

Without the Gerrit URL that line is useless.


Kind regards.

Paul


> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 10 +-
>   1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> index d5eea031c3e3..d18a05a20566 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> @@ -868,11 +868,11 @@ static u32 amdgpu_virt_rlcg_reg_rw(struct amdgpu_device 
> *adev, u32 offset, u32 v
>   uint32_t timeout = 5;
>   uint32_t i, tmp;
>   uint32_t ret = 0;
> - static void *scratch_reg0;
> - static void *scratch_reg1;
> - static void *scratch_reg2;
> - static void *scratch_reg3;
> - static void *spare_int;
> + void *scratch_reg0;
> + void *scratch_reg1;
> + void *scratch_reg2;
> + void *scratch_reg3;
> + void *spare_int;
>
>   if (!adev->gfx.rlc.rlcg_reg_access_supported) {
>   dev_err(adev->dev,


Re: [PATCH 1/2] drm/amdkfd: Fix GWS queue count

2022-04-14 Thread Felix Kuehling



Am 2022-04-14 um 13:56 schrieb Yat Sin, David:

-Original Message-
From: Kuehling, Felix 
Sent: Thursday, April 14, 2022 11:42 AM
To: Yat Sin, David ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 1/2] drm/amdkfd: Fix GWS queue count


Am 2022-04-14 um 11:08 schrieb Yat Sin, David:

-Original Message-
From: Kuehling, Felix 
Sent: Thursday, April 14, 2022 10:52 AM
To: Yat Sin, David ;
amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 1/2] drm/amdkfd: Fix GWS queue count


Am 2022-04-13 um 11:51 schrieb David Yat Sin:

Queue can be inactive during process termination. This would cause
dqm->gws_queue_count to not be decremented. There can only be 1

GWS

queue per device process so moving the logic out of loop.

Signed-off-by: David Yat Sin 
---
.../gpu/drm/amd/amdkfd/kfd_device_queue_manager.c| 12 ++-



-

1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index acf4f7975850..7c107b88d944 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -1945,17 +1945,17 @@ static int process_termination_cpsch(struct

device_queue_manager *dqm,

else if (q->properties.type ==

KFD_QUEUE_TYPE_SDMA_XGMI)

deallocate_sdma_queue(dqm, q);

-   if (q->properties.is_active) {
+   if (q->properties.is_active)
decrement_queue_count(dqm, q->properties.type);
-   if (q->properties.is_gws) {
-   dqm->gws_queue_count--;
-   qpd->mapped_gws_queue = false;
-   }
-   }

This looks like the original intention was to decrement the counter
for inactive GWS queues. This makes sense because this counter is
used to determine whether the runlist is oversubscribed. Inactive
queues are not in the runlist, so they should not be counted.

I see that the counter is updated when queues are activated and
deactivated in update_queue. So IMO this patch is both incorrect and
unnecessary. Did you see an actual problem with the GWS counter during

process termination?

If so, I'd like to know more to understand the root cause.

Regards,
     Felix

Yes, when using a unit test (for example KFDGWSTest.Semaphore), 1. Put
a sleep in the middle of the application (after calling
hsaKmtAllocQueueGWS) 2. Run application and kill the application which it

is in sleep (application never calls queue.Destroy()).

Then inside kernel dqm->gws_queue_count keeps incrementing each time

the experiment is repeated and never goes back to 0. This seems to be
because the sleep causes the process to be evicted and q-

properties.is_active is false so code does not enter that if statement.

That's weird. Can you find out why it's not getting there? In the test you
describe, I would expect the queue to be active, so the counter should be
decremented by the existing code.

Does the test evict the queues for some reason? is_active gets set to false
when a queue is evicted.

Yes, the queue is evicted because of the sleep() call in userspace.


Sleep() shouldn't cause queue evictions. Evictions are usually temporary 
due to memory manager events (memory evictions or MMU notifiers). More 
permanent evictions can happen after VM faults. Does the test cause a VM 
fault before the sleep?






Looks like we're missing code to update the
gws_queue_count in evict/restore_process_queues_cpsch (it is present in
evict/restore_process_queues_nocpsch).

I think this is the actual problem and the increment/decrement should be done 
there. I did not realize the dqm->gws_queue_count only counts not-evicted 
queues. I will post new patch with this change instead.


Thanks,
  Felix





Maybe the management of this counter should be moved into
increment/decrement_queue_count, so we don't need to duplicate it in
many places.

ACK

Regards,
David


Regards,
    Felix



Regards,
David


dqm->total_queue_count--;
}

+   if (qpd->mapped_gws_queue) {
+   dqm->gws_queue_count--;
+   qpd->mapped_gws_queue = false;
+   }
+
/* Unregister process */
list_for_each_entry_safe(cur, next_dpn, >queues, list) {
if (qpd == cur->qpd) {


Re: [PATCH] drm/amd/amdgpu: Remove static from variable in RLCG Reg RW.

2022-04-14 Thread Paul Menzel

Dear Gavin,


Thank you for your patch.

Am 13.04.22 um 17:26 schrieb Gavin Wan:

Should you re-roll your patch (v2), please remove the dot/period from 
the end of the git commit message summary (subject).



[why] These static variables saves the RLC Scratch registers address.


s/saves/save/


   When we installed multiple GPUs (for example: XGMI setting) and


s/installed/install/


   multiple GPUs call the function at same time. The RLC Scratch


… same time, the RLC …


   registers address are changed each other. Then it caused


s/caused/causes/


   reading/writing to wrong GPU.


I see from other patches posted here, that [why] is put on a separate 
line, so you do not need to indent the text.


[why]

These static …



[fix] Removed the static from the variables. The variables are
   in stack.


Same here, though *how* instead of *fix* seems more common.

s/Removed/Remove/
s/in stack/on the stack/



Signed-off-by: Gavin Wan 
Change-Id: Iee78849291d4f7a9688ecc5165bec70ee85cdfbe


Without the Gerrit URL that line is useless.


Kind regards.

Paul



---
  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 10 +-
  1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index d5eea031c3e3..d18a05a20566 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -868,11 +868,11 @@ static u32 amdgpu_virt_rlcg_reg_rw(struct amdgpu_device 
*adev, u32 offset, u32 v
uint32_t timeout = 5;
uint32_t i, tmp;
uint32_t ret = 0;
-   static void *scratch_reg0;
-   static void *scratch_reg1;
-   static void *scratch_reg2;
-   static void *scratch_reg3;
-   static void *spare_int;
+   void *scratch_reg0;
+   void *scratch_reg1;
+   void *scratch_reg2;
+   void *scratch_reg3;
+   void *spare_int;
  
  	if (!adev->gfx.rlc.rlcg_reg_access_supported) {

dev_err(adev->dev,


RE: [PATCH 1/2] drm/amdkfd: Fix GWS queue count

2022-04-14 Thread Yat Sin, David


> -Original Message-
> From: Kuehling, Felix 
> Sent: Thursday, April 14, 2022 11:42 AM
> To: Yat Sin, David ; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 1/2] drm/amdkfd: Fix GWS queue count
> 
> 
> Am 2022-04-14 um 11:08 schrieb Yat Sin, David:
> >> -Original Message-
> >> From: Kuehling, Felix 
> >> Sent: Thursday, April 14, 2022 10:52 AM
> >> To: Yat Sin, David ;
> >> amd-gfx@lists.freedesktop.org
> >> Subject: Re: [PATCH 1/2] drm/amdkfd: Fix GWS queue count
> >>
> >>
> >> Am 2022-04-13 um 11:51 schrieb David Yat Sin:
> >>> Queue can be inactive during process termination. This would cause
> >>> dqm->gws_queue_count to not be decremented. There can only be 1
> GWS
> >>> queue per device process so moving the logic out of loop.
> >>>
> >>> Signed-off-by: David Yat Sin 
> >>> ---
> >>>.../gpu/drm/amd/amdkfd/kfd_device_queue_manager.c| 12 ++-
> 
> >> -
> >>>1 file changed, 6 insertions(+), 6 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> >>> b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> >>> index acf4f7975850..7c107b88d944 100644
> >>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> >>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> >>> @@ -1945,17 +1945,17 @@ static int process_termination_cpsch(struct
> >> device_queue_manager *dqm,
> >>>   else if (q->properties.type ==
> >> KFD_QUEUE_TYPE_SDMA_XGMI)
> >>>   deallocate_sdma_queue(dqm, q);
> >>>
> >>> - if (q->properties.is_active) {
> >>> + if (q->properties.is_active)
> >>>   decrement_queue_count(dqm, q->properties.type);
> >>> - if (q->properties.is_gws) {
> >>> - dqm->gws_queue_count--;
> >>> - qpd->mapped_gws_queue = false;
> >>> - }
> >>> - }
> >> This looks like the original intention was to decrement the counter
> >> for inactive GWS queues. This makes sense because this counter is
> >> used to determine whether the runlist is oversubscribed. Inactive
> >> queues are not in the runlist, so they should not be counted.
> >>
> >> I see that the counter is updated when queues are activated and
> >> deactivated in update_queue. So IMO this patch is both incorrect and
> >> unnecessary. Did you see an actual problem with the GWS counter during
> process termination?
> >> If so, I'd like to know more to understand the root cause.
> >>
> >> Regards,
> >>     Felix
> > Yes, when using a unit test (for example KFDGWSTest.Semaphore), 1. Put
> > a sleep in the middle of the application (after calling
> > hsaKmtAllocQueueGWS) 2. Run application and kill the application which it
> is in sleep (application never calls queue.Destroy()).
> >
> > Then inside kernel dqm->gws_queue_count keeps incrementing each time
> the experiment is repeated and never goes back to 0. This seems to be
> because the sleep causes the process to be evicted and q-
> >properties.is_active is false so code does not enter that if statement.
> 
> That's weird. Can you find out why it's not getting there? In the test you
> describe, I would expect the queue to be active, so the counter should be
> decremented by the existing code.
> 
> Does the test evict the queues for some reason? is_active gets set to false
> when a queue is evicted. 
Yes, the queue is evicted because of the sleep() call in userspace.

>Looks like we're missing code to update the
> gws_queue_count in evict/restore_process_queues_cpsch (it is present in
> evict/restore_process_queues_nocpsch).

I think this is the actual problem and the increment/decrement should be done 
there. I did not realize the dqm->gws_queue_count only counts not-evicted 
queues. I will post new patch with this change instead.

> 
> Maybe the management of this counter should be moved into
> increment/decrement_queue_count, so we don't need to duplicate it in
> many places.
ACK

Regards,
David

> 
> Regards,
>    Felix
> 
> 
> >
> > Regards,
> > David
> >
> >>>   dqm->total_queue_count--;
> >>>   }
> >>>
> >>> + if (qpd->mapped_gws_queue) {
> >>> + dqm->gws_queue_count--;
> >>> + qpd->mapped_gws_queue = false;
> >>> + }
> >>> +
> >>>   /* Unregister process */
> >>>   list_for_each_entry_safe(cur, next_dpn, >queues, list) {
> >>>   if (qpd == cur->qpd) {


Re: [PATCH] drm/amdkfd: only allow heavy-weight TLB flush on some ASICs for SVM too

2022-04-14 Thread Paul Menzel

Dear Lang,


Thank you for the patch.

Am 14.04.22 um 10:19 schrieb Lang Yu:

The idea is from commit a50fe7078035 ("drm/amdkfd: Only apply heavy-weight
TLB flush on Aldebaran") and commit f61c40c0757a ("drm/amdkfd: enable
heavy-weight TLB flush on Arcturus"). Otherwise, we will run into problems
on some ASICs when running SVM applications.


Please list the ASICs, you know of having problems, and even add how to 
reproduce this.



Kind regards,

Paul



Signed-off-by: Lang Yu 
---
  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 8 
  drivers/gpu/drm/amd/amdkfd/kfd_priv.h| 8 
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 4 +++-
  3 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
index 91f82a9ccdaf..459f59e3d0ed 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -1128,14 +1128,6 @@ static int kfd_ioctl_free_memory_of_gpu(struct file 
*filep,
return ret;
  }
  
-static bool kfd_flush_tlb_after_unmap(struct kfd_dev *dev)

-{
-   return KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 2) ||
-   (KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 1) &&
-   dev->adev->sdma.instance[0].fw_version >= 18) ||
-   KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 0);
-}
-
  static int kfd_ioctl_map_memory_to_gpu(struct file *filep,
struct kfd_process *p, void *data)
  {
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index 8a43def1f638..aff6f598ff2c 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -1328,6 +1328,14 @@ void kfd_signal_poison_consumed_event(struct kfd_dev 
*dev, u32 pasid);
  
  void kfd_flush_tlb(struct kfd_process_device *pdd, enum TLB_FLUSH_TYPE type);
  
+static inline bool kfd_flush_tlb_after_unmap(struct kfd_dev *dev)

+{
+   return KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 2) ||
+  (KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 1) &&
+  dev->adev->sdma.instance[0].fw_version >= 18) ||
+  KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 0);
+}
+
  bool kfd_is_locked(void);
  
  /* Compute profile */

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 459fa07a3bcc..5afe216cf099 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -1229,7 +1229,9 @@ svm_range_unmap_from_gpus(struct svm_range *prange, 
unsigned long start,
if (r)
break;
}
-   kfd_flush_tlb(pdd, TLB_FLUSH_HEAVYWEIGHT);
+
+   if (kfd_flush_tlb_after_unmap(pdd->dev))
+   kfd_flush_tlb(pdd, TLB_FLUSH_HEAVYWEIGHT);
}
  
  	return r;


Re: [PATCH] drm/amdkfd: Ignore bogus signals from MEC efficiently

2022-04-14 Thread philip yang

  


On 2022-04-07 22:40, Felix Kuehling
  wrote:


  MEC firmware sometimes sends signal interrupts without a valid context ID
on end of pipe events that don't intend to signal any HSA signals.
This triggers the slow path in kfd_signal_event_interrupt that scans the
entire event page for signaled events. Detect these signals in the top
half interrupt handler to stop processing them as early as possible.

Because we now always treat event ID 0 as invalid, reserve that ID during
process initialization.


Reviewed-by: Philip Yangy...@amd.com>

  
Signed-off-by: Felix Kuehling 
---
 drivers/gpu/drm/amd/amdkfd/kfd_events.c   | 22 +++---
 .../gpu/drm/amd/amdkfd/kfd_int_process_v9.c   | 29 +++
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h |  2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_process.c  | 10 +--
 4 files changed, 56 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_events.c b/drivers/gpu/drm/amd/amdkfd/kfd_events.c
index 75847c5d5957..e43bb14adfca 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_events.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_events.c
@@ -238,12 +238,24 @@ static int create_other_event(struct kfd_process *p, struct kfd_event *ev, const
 	return 0;
 }
 
-void kfd_event_init_process(struct kfd_process *p)
+int kfd_event_init_process(struct kfd_process *p)
 {
+	int id;
+
 	mutex_init(>event_mutex);
 	idr_init(>event_idr);
 	p->signal_page = NULL;
-	p->signal_event_count = 0;
+	p->signal_event_count = 1;
+	/* Allocate event ID 0. It is used for a fast path to ignore bogus events
+	 * that are sent by the CP without a context ID
+	 */
+	id = idr_alloc(>event_idr, NULL, 0, 1, GFP_KERNEL);
+	if (id < 0) {
+		idr_destroy(>event_idr);
+		mutex_destroy(>event_mutex);
+		return id;
+	}
+	return 0;
 }
 
 static void destroy_event(struct kfd_process *p, struct kfd_event *ev)
@@ -271,8 +283,10 @@ static void destroy_events(struct kfd_process *p)
 	uint32_t id;
 
 	idr_for_each_entry(>event_idr, ev, id)
-		destroy_event(p, ev);
+		if (ev)
+			destroy_event(p, ev);
 	idr_destroy(>event_idr);
+	mutex_destroy(>event_mutex);
 }
 
 /*
@@ -739,7 +753,7 @@ void kfd_signal_event_interrupt(u32 pasid, uint32_t partial_id,
 			 * iterate over the signal slots and lookup
 			 * only signaled events from the IDR.
 			 */
-			for (id = 0; id < KFD_SIGNAL_EVENT_LIMIT; id++)
+			for (id = 1; id < KFD_SIGNAL_EVENT_LIMIT; id++)
 if (READ_ONCE(slots[id]) != UNSIGNALED_EVENT_SLOT) {
 	ev = lookup_event_by_id(p, id);
 	set_event_from_interrupt(p, ev);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
index 03c29bdd89a1..7d0111c197c5 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
@@ -141,6 +141,21 @@ static void event_interrupt_poison_consumption(struct kfd_dev *dev,
 	}
 }
 
+static bool context_id_expected(struct kfd_dev *dev)
+{
+	/* Borrowing firmware versions for GWS support because they were known
+	 * to send context_ids on legitimate signals.
+	 */
+	switch (KFD_GC_VERSION(dev)) {
+	case IP_VERSION(9, 0, 1): return dev->mec_fw_version >= 0x81b3;
+	case IP_VERSION(9, 4, 0): return dev->mec_fw_version >= 0x1b3;
+	case IP_VERSION(9, 4, 1): return dev->mec_fw_version >= 0x30;
+	case IP_VERSION(9, 4, 2): return true; /* was never broken */
+	default:
+		return false;
+	}
+}
+
 static bool event_interrupt_isr_v9(struct kfd_dev *dev,
 	const uint32_t *ih_ring_entry,
 	uint32_t *patched_ihre,
@@ -206,6 +221,20 @@ static bool event_interrupt_isr_v9(struct kfd_dev *dev,
 	if (WARN_ONCE(pasid == 0, "Bug: No PASID in KFD interrupt"))
 		return false;
 
+	/* Workaround CP firmware sending bogus signals with 0 context_id.
+	 * Those can be safely ignored on hardware and firmware versions that
+	 * include a valid context_id on legitimate signals. This avoids the
+	 * slow path in kfd_signal_event_interrupt that scans all event slots
+	 * for signaled events.
+	 */
+	if (source_id == SOC15_INTSRC_CP_END_OF_PIPE) {
+		uint32_t context_id =
+			SOC15_CONTEXT_ID0_FROM_IH_ENTRY(ih_ring_entry);
+
+		if (context_id == 0 && context_id_expected(dev))
+			return false;
+	}
+
 	/* Interrupt types we care about: various signals and faults.
 	 * They will be forwarded to a work queue (see below).
 	 */
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index e1b7e6afa920..3761655ab0a9 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -1292,7 +1292,7 @@ extern const struct kfd_event_interrupt_class event_interrupt_class_v9;
 
 extern const struct kfd_device_global_init_class device_global_init_class_cik;
 
-void kfd_event_init_process(struct kfd_process *p);
+int kfd_event_init_process(struct kfd_process *p);
 void kfd_event_free_process(struct kfd_process *p);
 int kfd_event_mmap(struct kfd_process *process, struct 

Re: [PATCH 1/2] drm/amdkfd: Fix GWS queue count

2022-04-14 Thread Felix Kuehling



Am 2022-04-14 um 11:08 schrieb Yat Sin, David:

-Original Message-
From: Kuehling, Felix 
Sent: Thursday, April 14, 2022 10:52 AM
To: Yat Sin, David ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 1/2] drm/amdkfd: Fix GWS queue count


Am 2022-04-13 um 11:51 schrieb David Yat Sin:

Queue can be inactive during process termination. This would cause
dqm->gws_queue_count to not be decremented. There can only be 1 GWS
queue per device process so moving the logic out of loop.

Signed-off-by: David Yat Sin 
---
   .../gpu/drm/amd/amdkfd/kfd_device_queue_manager.c| 12 ++-

-

   1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index acf4f7975850..7c107b88d944 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -1945,17 +1945,17 @@ static int process_termination_cpsch(struct

device_queue_manager *dqm,

else if (q->properties.type ==

KFD_QUEUE_TYPE_SDMA_XGMI)

deallocate_sdma_queue(dqm, q);

-   if (q->properties.is_active) {
+   if (q->properties.is_active)
decrement_queue_count(dqm, q->properties.type);
-   if (q->properties.is_gws) {
-   dqm->gws_queue_count--;
-   qpd->mapped_gws_queue = false;
-   }
-   }

This looks like the original intention was to decrement the counter for inactive
GWS queues. This makes sense because this counter is used to determine
whether the runlist is oversubscribed. Inactive queues are not in the runlist,
so they should not be counted.

I see that the counter is updated when queues are activated and deactivated
in update_queue. So IMO this patch is both incorrect and unnecessary. Did
you see an actual problem with the GWS counter during process termination?
If so, I'd like to know more to understand the root cause.

Regards,
    Felix

Yes, when using a unit test (for example KFDGWSTest.Semaphore),
1. Put a sleep in the middle of the application (after calling 
hsaKmtAllocQueueGWS)
2. Run application and kill the application which it is in sleep (application 
never calls queue.Destroy()).

Then inside kernel dqm->gws_queue_count keeps incrementing each time the 
experiment is repeated and never goes back to 0. This seems to be because the sleep 
causes the process to be evicted and q->properties.is_active is false so code does 
not enter that if statement.


That's weird. Can you find out why it's not getting there? In the test 
you describe, I would expect the queue to be active, so the counter 
should be decremented by the existing code.


Does the test evict the queues for some reason? is_active gets set to 
false when a queue is evicted. Looks like we're missing code to update 
the gws_queue_count in evict/restore_process_queues_cpsch (it is present 
in evict/restore_process_queues_nocpsch).


Maybe the management of this counter should be moved into 
increment/decrement_queue_count, so we don't need to duplicate it in 
many places.


Regards,
  Felix




Regards,
David


dqm->total_queue_count--;
}

+   if (qpd->mapped_gws_queue) {
+   dqm->gws_queue_count--;
+   qpd->mapped_gws_queue = false;
+   }
+
/* Unregister process */
list_for_each_entry_safe(cur, next_dpn, >queues, list) {
if (qpd == cur->qpd) {


Re: [PATCH] drm/radeon: Add build directory to include path

2022-04-14 Thread Masahiro Yamada
Hi.

On Thu, Apr 14, 2022 at 10:50 PM Michel Dänzer
 wrote:
>
> On 2022-04-14 15:34, Alex Deucher wrote:
> > On Thu, Apr 14, 2022 at 4:44 AM Christian König
> >  wrote:
> >> Am 14.04.22 um 09:37 schrieb Michel Dänzer:
> >>> On 2022-04-14 08:24, Christian König wrote:
>  Am 13.04.22 um 18:14 schrieb Michel Dänzer:
> > From: Michel Dänzer 
> >
> > Fixes compile errors with out-of-tree builds, e.g.
> >
> > ../drivers/gpu/drm/radeon/r420.c:38:10: fatal error: r420_reg_safe.h: 
> > No such file or directory
> >  38 | #include "r420_reg_safe.h"
> > |  ^
> 
>  Well stuff like that usually points to a broken build environment.
> >>> Just a separate build directory. Specifically, I'm hitting the errors with
> >>>
> >>>   make -C build-amd64 M=drivers/gpu/drm


Maybe

make  O=build-arm64   drivers/gpu/drm/

is the way you were searching for.

It builds only drivers/gpu/drm/
in the separate directory.




> >>>
> >>> Generated headers such as r420_reg_safe.h reside in the build directory, 
> >>> so source files in the source directory can't find them without an 
> >>> explicit search path.
> >>
> >> I'm trying to swap back into my brain how all of this used to work, but
> >> that's a really long time ago that I tried this as well.
> >>
> >>> Are you saying that should get added automagically somehow?


For the kernel tree, yes, it is done automatically.

See the code in scripts/Makefile.lib:

# $(srctree)/$(src) for including checkin headers from generated source files
# $(objtree)/$(obj) for including generated headers from checkin source files
ifeq ($(KBUILD_EXTMOD),)
ifdef building_out_of_srctree
_c_flags   += -I $(srctree)/$(src) -I $(objtree)/$(obj)
_a_flags   += -I $(srctree)/$(src) -I $(objtree)/$(obj)
_cpp_flags += -I $(srctree)/$(src) -I $(objtree)/$(obj)
endif
endif




But, you used M=drivers/gpu/drm.
So, it did not work.



M= is intended for building external modules.

I do not recommend it for in-tree drivers.






> >>
> >> Yes, exactly that. I'm like 95% sure that used to work, but I don't know
> >> why exactly either.
> >>
> >>> FWIW, this is pretty common in the kernel according to git grep.
> >>
> >> Maybe Alex or somebody else with some more background in the kernel
> >> Makefiles could jump in and help here.
> >
> > I don't remember either.  I vaguely recall the build support for the
> > mkregtable stuff being reworked a while ago.  A quick zip through the
> > git logs shows a series from Masahiro Yamada from 2020.
>
> Yamada-san, can you help us? :)
>
> See https://patchwork.freedesktop.org/patch/482011/ for my patch.
>
>
> --
> Earthling Michel Dänzer|  https://redhat.com
> Libre software enthusiast  | Mesa and Xwayland developer



--
Best Regards
Masahiro Yamada


Re: [PATCH] drm/amdkfd: only allow heavy-weight TLB flush on some ASICs for SVM too

2022-04-14 Thread Felix Kuehling



Am 2022-04-14 um 04:19 schrieb Lang Yu:

The idea is from commit a50fe7078035 ("drm/amdkfd: Only apply heavy-weight
TLB flush on Aldebaran") and commit f61c40c0757a ("drm/amdkfd: enable
heavy-weight TLB flush on Arcturus"). Otherwise, we will run into problems
on some ASICs when running SVM applications.

Signed-off-by: Lang Yu 
---
  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 8 
  drivers/gpu/drm/amd/amdkfd/kfd_priv.h| 8 
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 4 +++-
  3 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
index 91f82a9ccdaf..459f59e3d0ed 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -1128,14 +1128,6 @@ static int kfd_ioctl_free_memory_of_gpu(struct file 
*filep,
return ret;
  }
  
-static bool kfd_flush_tlb_after_unmap(struct kfd_dev *dev)

-{
-   return KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 2) ||
-   (KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 1) &&
-   dev->adev->sdma.instance[0].fw_version >= 18) ||
-   KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 0);
-}
-
  static int kfd_ioctl_map_memory_to_gpu(struct file *filep,
struct kfd_process *p, void *data)
  {
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index 8a43def1f638..aff6f598ff2c 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -1328,6 +1328,14 @@ void kfd_signal_poison_consumed_event(struct kfd_dev 
*dev, u32 pasid);
  
  void kfd_flush_tlb(struct kfd_process_device *pdd, enum TLB_FLUSH_TYPE type);
  
+static inline bool kfd_flush_tlb_after_unmap(struct kfd_dev *dev)

+{
+   return KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 2) ||
+  (KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 1) &&
+  dev->adev->sdma.instance[0].fw_version >= 18) ||
+  KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 0);
+}
+
  bool kfd_is_locked(void);
  
  /* Compute profile */

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 459fa07a3bcc..5afe216cf099 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -1229,7 +1229,9 @@ svm_range_unmap_from_gpus(struct svm_range *prange, 
unsigned long start,
if (r)
break;
}
-   kfd_flush_tlb(pdd, TLB_FLUSH_HEAVYWEIGHT);
+
+   if (kfd_flush_tlb_after_unmap(pdd->dev))
+   kfd_flush_tlb(pdd, TLB_FLUSH_HEAVYWEIGHT);


Then you probably need to add another flush_tlb call in 
svm_range_map_to_gpus.


Regards,
  Felix



}
  
  	return r;


RE: [PATCH 1/2] drm/amdkfd: Fix GWS queue count

2022-04-14 Thread Yat Sin, David


> -Original Message-
> From: Kuehling, Felix 
> Sent: Thursday, April 14, 2022 10:52 AM
> To: Yat Sin, David ; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 1/2] drm/amdkfd: Fix GWS queue count
> 
> 
> Am 2022-04-13 um 11:51 schrieb David Yat Sin:
> > Queue can be inactive during process termination. This would cause
> > dqm->gws_queue_count to not be decremented. There can only be 1 GWS
> > queue per device process so moving the logic out of loop.
> >
> > Signed-off-by: David Yat Sin 
> > ---
> >   .../gpu/drm/amd/amdkfd/kfd_device_queue_manager.c| 12 ++-
> -
> >   1 file changed, 6 insertions(+), 6 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> > b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> > index acf4f7975850..7c107b88d944 100644
> > --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> > @@ -1945,17 +1945,17 @@ static int process_termination_cpsch(struct
> device_queue_manager *dqm,
> > else if (q->properties.type ==
> KFD_QUEUE_TYPE_SDMA_XGMI)
> > deallocate_sdma_queue(dqm, q);
> >
> > -   if (q->properties.is_active) {
> > +   if (q->properties.is_active)
> > decrement_queue_count(dqm, q->properties.type);
> > -   if (q->properties.is_gws) {
> > -   dqm->gws_queue_count--;
> > -   qpd->mapped_gws_queue = false;
> > -   }
> > -   }
> 
> This looks like the original intention was to decrement the counter for 
> inactive
> GWS queues. This makes sense because this counter is used to determine
> whether the runlist is oversubscribed. Inactive queues are not in the runlist,
> so they should not be counted.
> 
> I see that the counter is updated when queues are activated and deactivated
> in update_queue. So IMO this patch is both incorrect and unnecessary. Did
> you see an actual problem with the GWS counter during process termination?
> If so, I'd like to know more to understand the root cause.
> 
> Regards,
>    Felix

Yes, when using a unit test (for example KFDGWSTest.Semaphore), 
1. Put a sleep in the middle of the application (after calling 
hsaKmtAllocQueueGWS)
2. Run application and kill the application which it is in sleep (application 
never calls queue.Destroy()).

Then inside kernel dqm->gws_queue_count keeps incrementing each time the 
experiment is repeated and never goes back to 0. This seems to be because the 
sleep causes the process to be evicted and q->properties.is_active is false so 
code does not enter that if statement.

Regards,
David

> 
> 
> >
> > dqm->total_queue_count--;
> > }
> >
> > +   if (qpd->mapped_gws_queue) {
> > +   dqm->gws_queue_count--;
> > +   qpd->mapped_gws_queue = false;
> > +   }
> > +
> > /* Unregister process */
> > list_for_each_entry_safe(cur, next_dpn, >queues, list) {
> > if (qpd == cur->qpd) {


Re: [PATCH 1/2] drm/amdkfd: Fix GWS queue count

2022-04-14 Thread Felix Kuehling



Am 2022-04-13 um 11:51 schrieb David Yat Sin:

Queue can be inactive during process termination. This would cause
dqm->gws_queue_count to not be decremented. There can only be 1 GWS
queue per device process so moving the logic out of loop.

Signed-off-by: David Yat Sin 
---
  .../gpu/drm/amd/amdkfd/kfd_device_queue_manager.c| 12 ++--
  1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index acf4f7975850..7c107b88d944 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -1945,17 +1945,17 @@ static int process_termination_cpsch(struct 
device_queue_manager *dqm,
else if (q->properties.type == KFD_QUEUE_TYPE_SDMA_XGMI)
deallocate_sdma_queue(dqm, q);
  
-		if (q->properties.is_active) {

+   if (q->properties.is_active)
decrement_queue_count(dqm, q->properties.type);
-   if (q->properties.is_gws) {
-   dqm->gws_queue_count--;
-   qpd->mapped_gws_queue = false;
-   }
-   }


This looks like the original intention was to decrement the counter for 
inactive GWS queues. This makes sense because this counter is used to 
determine whether the runlist is oversubscribed. Inactive queues are not 
in the runlist, so they should not be counted.


I see that the counter is updated when queues are activated and 
deactivated in update_queue. So IMO this patch is both incorrect and 
unnecessary. Did you see an actual problem with the GWS counter during 
process termination? If so, I'd like to know more to understand the root 
cause.


Regards,
  Felix


  
  		dqm->total_queue_count--;

}
  
+	if (qpd->mapped_gws_queue) {

+   dqm->gws_queue_count--;
+   qpd->mapped_gws_queue = false;
+   }
+
/* Unregister process */
list_for_each_entry_safe(cur, next_dpn, >queues, list) {
if (qpd == cur->qpd) {


Re: [PATCH] drm/amdkfd: fix race condition in kfd_wait_on_events

2022-04-14 Thread philip yang

  


On 2022-04-12 17:58, Felix Kuehling
  wrote:


  Add the waiters to the wait queue during initialization, while holding the
event spinlock. Otherwise the waiter will not get activated if the event
signals before being added to the wait queue.


Reviewed-by: Philip Yang

  
Signed-off-by: Felix Kuehling 
---
 drivers/gpu/drm/amd/amdkfd/kfd_events.c | 26 +
 1 file changed, 5 insertions(+), 21 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_events.c b/drivers/gpu/drm/amd/amdkfd/kfd_events.c
index e43bb14adfca..ca562b9a8abe 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_events.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_events.c
@@ -784,7 +784,7 @@ static struct kfd_event_waiter *alloc_event_waiters(uint32_t num_events)
 	return event_waiters;
 }
 
-static int init_event_waiter_get_status(struct kfd_process *p,
+static int init_event_waiter(struct kfd_process *p,
 		struct kfd_event_waiter *waiter,
 		uint32_t event_id)
 {
@@ -797,25 +797,13 @@ static int init_event_waiter_get_status(struct kfd_process *p,
 	waiter->event = ev;
 	waiter->activated = ev->signaled;
 	ev->signaled = ev->signaled && !ev->auto_reset;
+	if (!waiter->activated)
+		add_wait_queue(>wq, >wait);
 	spin_unlock(>lock);
 
 	return 0;
 }
 
-static void init_event_waiter_add_to_waitlist(struct kfd_event_waiter *waiter)
-{
-	struct kfd_event *ev = waiter->event;
-
-	/* Only add to the wait list if we actually need to
-	 * wait on this event.
-	 */
-	if (!waiter->activated) {
-		spin_lock(>lock);
-		add_wait_queue(>wq, >wait);
-		spin_unlock(>lock);
-	}
-}
-
 /* test_event_condition - Test condition of events being waited for
  * @all:   Return completion only if all events have signaled
  * @num_events:Number of events to wait for
@@ -945,8 +933,8 @@ int kfd_wait_on_events(struct kfd_process *p,
 			goto out_unlock;
 		}
 
-		ret = init_event_waiter_get_status(p, _waiters[i],
-event_data.event_id);
+		ret = init_event_waiter(p, _waiters[i],
+	event_data.event_id);
 		if (ret)
 			goto out_unlock;
 	}
@@ -964,10 +952,6 @@ int kfd_wait_on_events(struct kfd_process *p,
 		goto out_unlock;
 	}
 
-	/* Add to wait lists if we need to wait. */
-	for (i = 0; i < num_events; i++)
-		init_event_waiter_add_to_waitlist(_waiters[i]);
-
 	mutex_unlock(>event_mutex);
 
 	while (true) {


  



Re: [PATCH] drm/amdgpu: Move reset domain locking in DPC handler

2022-04-14 Thread Andrey Grodzovsky
My bad, I see u already fixed this in amd-staging-drm-next. We had an 
issue in an internal branch with this and just reinvented the wheel :))


Andrey

On 2022-04-14 10:32, Andrey Grodzovsky wrote:

Yea, i need to improve it a bit, ignore this one, will be back with V2.

Andrey

On 2022-04-14 03:12, Chen, Guchun wrote:

It's in amdgpu_pci_resume.

Andrey, shall we modify the code accordingly in amdgpu_pci_resume as 
well? Otherwise, an unset/unlock leak will happen when 
pci_channel_state != pci_channel_io_frozen.


Regards,
Guchun

-Original Message-
From: amd-gfx  On Behalf Of 
Christian König

Sent: Thursday, April 14, 2022 2:40 PM
To: Grodzovsky, Andrey ; 
amd-gfx@lists.freedesktop.org

Cc: Antonovitch, Anatoli 
Subject: Re: [PATCH] drm/amdgpu: Move reset domain locking in DPC 
handler




Am 13.04.22 um 21:31 schrieb Andrey Grodzovsky:

Lock reset domain unconditionally because on resume we unlock it
unconditionally.
This solved mutex deadlock when handling both FATAL and non FATAL PCI
errors one after another.

Signed-off-by: Andrey Grodzovsky 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 +++---
   1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 1cc488a767d8..c65f25e3a0fc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5531,18 +5531,18 @@ pci_ers_result_t
amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta
      adev->pci_channel_state = state;
   +    /*
+ * Locking adev->reset_domain->sem will prevent any external 
access

+ * to GPU during PCI error recovery
+ */
+    amdgpu_device_lock_reset_domain(adev->reset_domain);
+    amdgpu_device_set_mp1_state(adev);
+
   switch (state) {
   case pci_channel_io_normal:
   return PCI_ERS_RESULT_CAN_RECOVER;

BTW: Where are we unlocking that again?


   /* Fatal error, prepare for slot reset */
   case pci_channel_io_frozen:
-    /*
- * Locking adev->reset_domain->sem will prevent any 
external access

- * to GPU during PCI error recovery
- */
- amdgpu_device_lock_reset_domain(adev->reset_domain);
-    amdgpu_device_set_mp1_state(adev);
-
   /*
    * Block any work scheduling as we do for regular GPU reset
    * for the duration of the recovery


Re: [PATCH] drm/amdgpu: Move reset domain locking in DPC handler

2022-04-14 Thread Andrey Grodzovsky

Yea, i need to improve it a bit, ignore this one, will be back with V2.

Andrey

On 2022-04-14 03:12, Chen, Guchun wrote:

It's in amdgpu_pci_resume.

Andrey, shall we modify the code accordingly in amdgpu_pci_resume as well? 
Otherwise, an unset/unlock leak will happen when pci_channel_state != 
pci_channel_io_frozen.

Regards,
Guchun

-Original Message-
From: amd-gfx  On Behalf Of Christian 
König
Sent: Thursday, April 14, 2022 2:40 PM
To: Grodzovsky, Andrey ; 
amd-gfx@lists.freedesktop.org
Cc: Antonovitch, Anatoli 
Subject: Re: [PATCH] drm/amdgpu: Move reset domain locking in DPC handler



Am 13.04.22 um 21:31 schrieb Andrey Grodzovsky:

Lock reset domain unconditionally because on resume we unlock it
unconditionally.
This solved mutex deadlock when handling both FATAL and non FATAL PCI
errors one after another.

Signed-off-by: Andrey Grodzovsky 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 +++---
   1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 1cc488a767d8..c65f25e3a0fc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5531,18 +5531,18 @@ pci_ers_result_t
amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta
   
   	adev->pci_channel_state = state;
   
+	/*

+* Locking adev->reset_domain->sem will prevent any external access
+* to GPU during PCI error recovery
+*/
+   amdgpu_device_lock_reset_domain(adev->reset_domain);
+   amdgpu_device_set_mp1_state(adev);
+
switch (state) {
case pci_channel_io_normal:
return PCI_ERS_RESULT_CAN_RECOVER;

BTW: Where are we unlocking that again?


/* Fatal error, prepare for slot reset */
case pci_channel_io_frozen:
-   /*
-* Locking adev->reset_domain->sem will prevent any external 
access
-* to GPU during PCI error recovery
-*/
-   amdgpu_device_lock_reset_domain(adev->reset_domain);
-   amdgpu_device_set_mp1_state(adev);
-
/*
 * Block any work scheduling as we do for regular GPU reset
 * for the duration of the recovery


Re: [PATCH] drm/amdgpu: Move reset domain locking in DPC handler

2022-04-14 Thread Andrey Grodzovsky



On 2022-04-14 02:40, Christian König wrote:



Am 13.04.22 um 21:31 schrieb Andrey Grodzovsky:

Lock reset domain unconditionally because on resume
we unlock it unconditionally.
This solved mutex deadlock when handling both FATAL
and non FATAL PCI errors one after another.

Signed-off-by: Andrey Grodzovsky 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 +++---
  1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index 1cc488a767d8..c65f25e3a0fc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5531,18 +5531,18 @@ pci_ers_result_t 
amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta

    adev->pci_channel_state = state;
  +    /*
+ * Locking adev->reset_domain->sem will prevent any external access
+ * to GPU during PCI error recovery
+ */
+    amdgpu_device_lock_reset_domain(adev->reset_domain);
+    amdgpu_device_set_mp1_state(adev);
+
  switch (state) {
  case pci_channel_io_normal:
  return PCI_ERS_RESULT_CAN_RECOVER;


BTW: Where are we unlocking that again?



In amdgpu_pci_resume, but you made realize I can do this better.
I will be back with V2.

Andrey





  /* Fatal error, prepare for slot reset */
  case pci_channel_io_frozen:
-    /*
- * Locking adev->reset_domain->sem will prevent any external 
access

- * to GPU during PCI error recovery
- */
-    amdgpu_device_lock_reset_domain(adev->reset_domain);
-    amdgpu_device_set_mp1_state(adev);
-
  /*
   * Block any work scheduling as we do for regular GPU reset
   * for the duration of the recovery




Re: [PATCH] drm/amd/amdgpu: Remove static from variable in RLCG Reg RW.

2022-04-14 Thread Alex Deucher
On Wed, Apr 13, 2022 at 11:27 AM Gavin Wan  wrote:
>
> [why] These static variables saves the RLC Scratch registers address.
>   When we installed multiple GPUs (for example: XGMI setting) and
>   multiple GPUs call the function at same time. The RLC Scratch
>   registers address are changed each other. Then it caused
>   reading/writing to wrong GPU.
>
> [fix] Removed the static from the variables. The variables are
>   in stack.

Please add:
Fixes: 5d447e29670148 ("drm/amdgpu: add helper for rlcg indirect reg access")

With that added.
Reviewed-by: Alex Deucher 

>
> Signed-off-by: Gavin Wan 
> Change-Id: Iee78849291d4f7a9688ecc5165bec70ee85cdfbe
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> index d5eea031c3e3..d18a05a20566 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> @@ -868,11 +868,11 @@ static u32 amdgpu_virt_rlcg_reg_rw(struct amdgpu_device 
> *adev, u32 offset, u32 v
> uint32_t timeout = 5;
> uint32_t i, tmp;
> uint32_t ret = 0;
> -   static void *scratch_reg0;
> -   static void *scratch_reg1;
> -   static void *scratch_reg2;
> -   static void *scratch_reg3;
> -   static void *spare_int;
> +   void *scratch_reg0;
> +   void *scratch_reg1;
> +   void *scratch_reg2;
> +   void *scratch_reg3;
> +   void *spare_int;
>
> if (!adev->gfx.rlc.rlcg_reg_access_supported) {
> dev_err(adev->dev,
> --
> 2.32.0
>


Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD

2022-04-14 Thread Shuotao Xu


On Apr 14, 2022, at 1:31 AM, Andrey Grodzovsky 
mailto:andrey.grodzov...@amd.com>> wrote:



On 2022-04-13 12:03, Shuotao Xu wrote:


On Apr 11, 2022, at 11:52 PM, Andrey Grodzovsky 
mailto:andrey.grodzov...@amd.com>> wrote:

[Some people who received this message don't often get email from 
andrey.grodzov...@amd.com. Learn why this is 
important at http://aka.ms/LearnAboutSenderIdentification.]

On 2022-04-08 21:28, Shuotao Xu wrote:

On Apr 8, 2022, at 11:28 PM, Andrey Grodzovsky 
mailto:andrey.grodzov...@amd.com>> wrote:

[Some people who received this message don't often get email from 
andrey.grodzov...@amd.com. Learn why this is 
important at http://aka.ms/LearnAboutSenderIdentification.]

On 2022-04-08 04:45, Shuotao Xu wrote:
Adding PCIe Hotplug Support for AMDKFD: the support of hot-plug of GPU
devices can open doors for many advanced applications in data center
in the next few years, such as for GPU resource
disaggregation. Current AMDKFD does not support hotplug out b/o the
following reasons:

1. During PCIe removal, decrement KFD lock which was incremented at
the beginning of hw fini; otherwise kfd_open later is going to
fail.
I assumed you read my comment last time, still you do same approach.
More in details bellow
Aha, I like your fix:) I was not familiar with drm APIs so just only half 
understood your comment last time.

BTW, I tried hot-plugging out a GPU when rocm application is still running.
From dmesg, application is still trying to access the removed kfd device, and 
are met with some errors.


Application us supposed to keep running, it holds the drm_device
reference as long as it has an open
FD to the device and final cleanup will come only after the app will die
thus releasing the FD and the last
drm_device reference.

Application would hang and not exiting in this case.


Actually I tried kill -7 $pid, and the process exists. The dmesg has some 
warning though.

[  711.769977] WARNING: CPU: 23 PID: 344 at 
.../amdgpu-rocm5.0.2/src/amd/amdgpu/amdgpu_object.c:1336 
amdgpu_bo_release_notify+0x150/0x160 [amdgpu]
[  711.770528] Modules linked in: amdgpu(OE) amdttm(OE) amd_sched(OE) 
amdkcl(OE) iommu_v2 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo 
xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat 
nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT 
nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter 
ip6_tables iptable_filter overlay binfmt_misc intel_rapl_msr i40iw 
intel_rapl_common skx_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp 
kvm_intel rpcrdma kvm sunrpc ipmi_ssif ib_umad ib_ipoib rdma_ucm irqbypass rapl 
joydev acpi_ipmi input_leds intel_cstate ipmi_si ipmi_devintf mei_me mei 
intel_pch_thermal ipmi_msghandler ioatdma mac_hid lpc_ich dca acpi_power_meter 
acpi_pad sch_fq_codel ib_iser rdma_cm iw_cm ib_cm iscsi_tcp libiscsi_tcp 
libiscsi scsi_transport_iscsi pci_stub ip_tables x_tables autofs4 btrfs 
blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy 
async_pq async_xor async_tx xor
[  711.779359]  raid6_pq libcrc32c raid1 raid0 multipath linear mlx5_ib 
ib_uverbs ib_core ast drm_vram_helper i2c_algo_bit drm_ttm_helper ttm 
drm_kms_helper syscopyarea crct10dif_pclmul crc32_pclmul ghash_clmulni_intel 
sysfillrect uas hid_generic sysimgblt aesni_intel mlx5_core fb_sys_fops 
crypto_simd usbhid cryptd drm i40e pci_hyperv_intf usb_storage glue_helper 
mlxfw hid ahci libahci wmi
[  711.779752] CPU: 23 PID: 344 Comm: kworker/23:1 Tainted: GW  OE 
5.11.0+ #1
[  711.779755] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 
2.1 08/14/2018
[  711.779756] Workqueue: kfd_process_wq kfd_process_wq_release [amdgpu]
[  711.779955] RIP: 0010:amdgpu_bo_release_notify+0x150/0x160 [amdgpu]
[  711.780141] Code: e8 b5 af 34 f4 e9 1f ff ff ff 48 39 c2 74 07 0f 0b e9 69 
ff ff ff 4c 89 e7 e8 3c b4 16 00 e9 5c ff ff ff e8 a2 ce fd f3 eb cf <0f> 0b eb 
cb e8 d7 02 34 f4 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55
[  711.780143] RSP: 0018:a8100dd67c30 EFLAGS: 00010282
[  711.780145] RAX: ffea RBX: 89980e792058 RCX: 
[  711.780147] RDX:  RSI: 89a8f9ad8870 RDI: 89a8f9ad8870
[  711.780148] RBP: a8100dd67c50 R08:  R09: fff99b18
[  711.780149] R10: a8100dd67bd0 R11: a8100dd67908 R12: 89980e792000
[  711.780151] R13: 89980e792058 R14: 89980e7921bc R15: dead0100
[  711.780152] FS:  () GS:89a8f9ac() 
knlGS:
[  711.780154] CS:  0010 DS:  ES:  CR0: 80050033
[  711.780156] CR2: 7ffddac6f71f CR3: 0030bb80a003 CR4: 007706e0
[  711.780157] DR0:  DR1:  DR2: 
[  711.780159] DR3:  DR6: fffe0ff0 DR7: 0400
[  711.780160] PKRU: 5554
[  

Re: [PATCH] drm/amdkfd: only allow heavy-weight TLB flush on some ASICs for SVM too

2022-04-14 Thread Eric Huang




On 2022-04-14 04:19, Lang Yu wrote:

The idea is from commit a50fe7078035 ("drm/amdkfd: Only apply heavy-weight
TLB flush on Aldebaran") and commit f61c40c0757a ("drm/amdkfd: enable
heavy-weight TLB flush on Arcturus"). Otherwise, we will run into problems
on some ASICs when running SVM applications.

Signed-off-by: Lang Yu 
---
  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 8 
  drivers/gpu/drm/amd/amdkfd/kfd_priv.h| 8 
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 4 +++-
  3 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
index 91f82a9ccdaf..459f59e3d0ed 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -1128,14 +1128,6 @@ static int kfd_ioctl_free_memory_of_gpu(struct file 
*filep,
return ret;
  }
  
-static bool kfd_flush_tlb_after_unmap(struct kfd_dev *dev)

-{
-   return KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 2) ||
-   (KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 1) &&
-   dev->adev->sdma.instance[0].fw_version >= 18) ||
-   KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 0);
-}
-
  static int kfd_ioctl_map_memory_to_gpu(struct file *filep,
struct kfd_process *p, void *data)
  {
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index 8a43def1f638..aff6f598ff2c 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -1328,6 +1328,14 @@ void kfd_signal_poison_consumed_event(struct kfd_dev 
*dev, u32 pasid);
  
  void kfd_flush_tlb(struct kfd_process_device *pdd, enum TLB_FLUSH_TYPE type);
  
+static inline bool kfd_flush_tlb_after_unmap(struct kfd_dev *dev)

+{
+   return KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 2) ||
+  (KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 1) &&
+  dev->adev->sdma.instance[0].fw_version >= 18) ||
+  KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 0);
+}
+
It is a cosmetic change for function kfd_flush_tlb_after_unmap, and not 
related to the topic. You can separate that into another patch.


Regards,
Eric

  bool kfd_is_locked(void);
  
  /* Compute profile */

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 459fa07a3bcc..5afe216cf099 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -1229,7 +1229,9 @@ svm_range_unmap_from_gpus(struct svm_range *prange, 
unsigned long start,
if (r)
break;
}
-   kfd_flush_tlb(pdd, TLB_FLUSH_HEAVYWEIGHT);
+
+   if (kfd_flush_tlb_after_unmap(pdd->dev))
+   kfd_flush_tlb(pdd, TLB_FLUSH_HEAVYWEIGHT);
}
  
  	return r;




Re: [PATCH] drm/radeon: Add build directory to include path

2022-04-14 Thread Alex Deucher
On Thu, Apr 14, 2022 at 4:44 AM Christian König
 wrote:
>
> Am 14.04.22 um 09:37 schrieb Michel Dänzer:
> > On 2022-04-14 08:24, Christian König wrote:
> >> Am 13.04.22 um 18:14 schrieb Michel Dänzer:
> >>> From: Michel Dänzer 
> >>>
> >>> Fixes compile errors with out-of-tree builds, e.g.
> >>>
> >>> ../drivers/gpu/drm/radeon/r420.c:38:10: fatal error: r420_reg_safe.h: No 
> >>> such file or directory
> >>>  38 | #include "r420_reg_safe.h"
> >>> |  ^
> >>
> >> Well stuff like that usually points to a broken build environment.
> > Just a separate build directory. Specifically, I'm hitting the errors with
> >
> >   make -C build-amd64 M=drivers/gpu/drm
> >
> > Generated headers such as r420_reg_safe.h reside in the build directory, so 
> > source files in the source directory can't find them without an explicit 
> > search path.
>
> I'm trying to swap back into my brain how all of this used to work, but
> that's a really long time ago that I tried this as well.
>
> > Are you saying that should get added automagically somehow?
>
> Yes, exactly that. I'm like 95% sure that used to work, but I don't know
> why exactly either.
>
> > FWIW, this is pretty common in the kernel according to git grep.
>
> Maybe Alex or somebody else with some more background in the kernel
> Makefiles could jump in and help here.

I don't remember either.  I vaguely recall the build support for the
mkregtable stuff being reworked a while ago.  A quick zip through the
git logs shows a series from Masahiro Yamada from 2020.

Alex


Re: [PATCH 5.10 1/1] drm/amdgpu: Ensure the AMDGPU file descriptor is legitimate

2022-04-14 Thread Greg KH
On Tue, Apr 12, 2022 at 04:20:57PM +0100, Lee Jones wrote:
> [ Upstream commit b40a6ab2cf9213923bf8e821ce7fa7f6a0a26990 ]
> 
> This is a partial cherry-pick of the above upstream commit.
> 
> It ensures the file descriptor passed in by userspace is a valid one.
> 
> Cc: Felix Kuehling 
> Cc: Alex Deucher 
> Cc: "Christian König" 
> Cc: David Airlie 
> Cc: Daniel Vetter 
> Cc: amd-gfx@lists.freedesktop.org
> Cc: dri-de...@lists.freedesktop.org
> Signed-off-by: Lee Jones 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 10 +++---
>  1 file changed, 7 insertions(+), 3 deletions(-)

Now queued up, thanks.

greg k-h


Re: [PATCH 4.19 1/2] drm/amdgpu: Check if fd really is an amdgpu fd.

2022-04-14 Thread Greg KH
On Tue, Apr 12, 2022 at 04:35:28PM +0100, Lee Jones wrote:
> From: Bas Nieuwenhuizen 
> 
> [ Upstream commit 021830d24ba55a578f602979274965344c8e6284 ]
> 
> Otherwise we interpret the file private data as drm & amdgpu data
> while it might not be, possibly allowing one to get memory corruption.
> 
> Cc: Felix Kuehling 
> Cc: Alex Deucher 
> Cc: "Christian König" 
> Cc: David Airlie 
> Cc: Daniel Vetter 
> Cc: amd-gfx@lists.freedesktop.org
> Cc: dri-de...@lists.freedesktop.org
> Signed-off-by: Bas Nieuwenhuizen 
> Reviewed-by: Christian König 
> Signed-off-by: Alex Deucher 
> Signed-off-by: Lee Jones 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h   |  2 ++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c   | 16 
>  drivers/gpu/drm/amd/amdgpu/amdgpu_sched.c | 10 +++---
>  3 files changed, 25 insertions(+), 3 deletions(-)

Both now queued up, thanks.

greg k-h


Re: [PATCHv4] drm/amdgpu: disable ASPM on Intel Alder Lake based systems

2022-04-14 Thread Alex Deucher
On Thu, Apr 14, 2022 at 3:52 AM Paul Menzel  wrote:
>
> [Cc: -kernel test robot ]
>
> Dear Alex, dear Richard,
>
>
> Am 13.04.22 um 15:00 schrieb Alex Deucher:
> > On Wed, Apr 13, 2022 at 3:43 AM Paul Menzel wrote:
>
> >> Thank you for sending out v4.
> >>
> >> Am 12.04.22 um 23:50 schrieb Richard Gong:
> >>> Active State Power Management (ASPM) feature is enabled since kernel 5.14.
> >>> There are some AMD GFX cards (such as WX3200 and RX640) that won't work
> >>> with ASPM-enabled Intel Alder Lake based systems. Using these GFX cards as
> >>> video/display output, Intel Alder Lake based systems will hang during
> >>> suspend/resume.
> >>
> >> I am still not clear, what “hang during suspend/resume” means. I guess
> >> suspending works fine? During resume (S3 or S0ix?), where does it hang?
> >> The system is functional, but there are only display problems?
> >>
> >>> The issue was initially reported on one system (Dell Precision 3660 with
> >>> BIOS version 0.14.81), but was later confirmed to affect at least 4 Alder
> >>> Lake based systems.
> >>>
> >>> Add extra check to disable ASPM on Intel Alder Lake based systems.
> >>>
> >>> Fixes: 0064b0ce85bb ("drm/amd/pm: enable ASPM by default")
> >>> Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1885
> >>> Reported-by: kernel test robot 
> >>
> >> This tag is a little confusing. Maybe clarify that it was for an issue
> >> in a previous patch iteration?
> >>
> >>> Signed-off-by: Richard Gong 
> >>> ---
> >>> v4: s/CONFIG_X86_64/CONFIG_X86
> >>>   enhanced check logic
> >>> v3: s/intel_core_asom_chk/aspm_support_quirk_check
> >>>   correct build error with W=1 option
> >>> v2: correct commit description
> >>>   move the check from chip family to problematic platform
> >>> ---
> >>>drivers/gpu/drm/amd/amdgpu/vi.c | 17 -
> >>>1 file changed, 16 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/vi.c 
> >>> b/drivers/gpu/drm/amd/amdgpu/vi.c
> >>> index 039b90cdc3bc..b33e0a9bee65 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/vi.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/vi.c
> >>> @@ -81,6 +81,10 @@
> >>>#include "mxgpu_vi.h"
> >>>#include "amdgpu_dm.h"
> >>>
> >>> +#if IS_ENABLED(CONFIG_X86)
> >>> +#include 
> >>> +#endif
> >>> +
> >>>#define ixPCIE_LC_L1_PM_SUBSTATE0x100100C6
> >>>#define PCIE_LC_L1_PM_SUBSTATE__LC_L1_SUBSTATES_OVERRIDE_EN_MASK
> >>> 0x0001L
> >>>#define PCIE_LC_L1_PM_SUBSTATE__LC_PCI_PM_L1_2_OVERRIDE_MASK
> >>> 0x0002L
> >>> @@ -1134,13 +1138,24 @@ static void vi_enable_aspm(struct amdgpu_device 
> >>> *adev)
> >>>WREG32_PCIE(ixPCIE_LC_CNTL, data);
> >>>}
> >>>
> >>> +static bool aspm_support_quirk_check(void)
> >>> +{
> >>> + if (IS_ENABLED(CONFIG_X86)) {
> >>> + struct cpuinfo_x86 *c = _data(0);
> >>> +
> >>> + return !(c->x86 == 6 && c->x86_model == 
> >>> INTEL_FAM6_ALDERLAKE);
> >>> + }
> >>> +
> >>> + return true;
> >>> +}
> >>> +
> >>>static void vi_program_aspm(struct amdgpu_device *adev)
> >>>{
> >>>u32 data, data1, orig;
> >>>bool bL1SS = false;
> >>>bool bClkReqSupport = true;
> >>>
> >>> - if (!amdgpu_device_should_use_aspm(adev))
> >>> + if (!amdgpu_device_should_use_aspm(adev) || 
> >>> !aspm_support_quirk_check())
> >>>return;
> >>
> >> Can users still forcefully enable ASPM with the parameter `amdgpu.aspm`?
> >>
> >>>
> >>>if (adev->flags & AMD_IS_APU ||
> >>
> >> If I remember correctly, there were also newer cards, where ASPM worked
> >> with Intel Alder Lake, right? Can only the problematic generations for
> >> WX3200 and RX640 be excluded from ASPM?
> >
> > This patch only disables it for the generation that was problematic.
>
> Could that please be made clear in the commit message summary, and message?

Sure.  Richard, please add that this only disables ASPM on VI parts
when in an alderlake system.

>
> Loosely related, is there a public (or internal issue) to analyze how to
> get ASPM working for VI generation devices with Intel Alder Lake?

We'd need support from Intel.  I'm not sure where things currently stand.

Alex


Re: [PATCH] drm/radeon: Add build directory to include path

2022-04-14 Thread Christian König

Am 14.04.22 um 09:37 schrieb Michel Dänzer:

On 2022-04-14 08:24, Christian König wrote:

Am 13.04.22 um 18:14 schrieb Michel Dänzer:

From: Michel Dänzer 

Fixes compile errors with out-of-tree builds, e.g.

../drivers/gpu/drm/radeon/r420.c:38:10: fatal error: r420_reg_safe.h: No such 
file or directory
     38 | #include "r420_reg_safe.h"
    |  ^


Well stuff like that usually points to a broken build environment.

Just a separate build directory. Specifically, I'm hitting the errors with

  make -C build-amd64 M=drivers/gpu/drm

Generated headers such as r420_reg_safe.h reside in the build directory, so 
source files in the source directory can't find them without an explicit search 
path.


I'm trying to swap back into my brain how all of this used to work, but 
that's a really long time ago that I tried this as well.



Are you saying that should get added automagically somehow?


Yes, exactly that. I'm like 95% sure that used to work, but I don't know 
why exactly either.



FWIW, this is pretty common in the kernel according to git grep.


Maybe Alex or somebody else with some more background in the kernel 
Makefiles could jump in and help here.


Christian.


[PATCH] drm/amdkfd: only allow heavy-weight TLB flush on some ASICs for SVM too

2022-04-14 Thread Lang Yu
The idea is from commit a50fe7078035 ("drm/amdkfd: Only apply heavy-weight
TLB flush on Aldebaran") and commit f61c40c0757a ("drm/amdkfd: enable
heavy-weight TLB flush on Arcturus"). Otherwise, we will run into problems
on some ASICs when running SVM applications.

Signed-off-by: Lang Yu 
---
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 8 
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h| 8 
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 4 +++-
 3 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
index 91f82a9ccdaf..459f59e3d0ed 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -1128,14 +1128,6 @@ static int kfd_ioctl_free_memory_of_gpu(struct file 
*filep,
return ret;
 }
 
-static bool kfd_flush_tlb_after_unmap(struct kfd_dev *dev)
-{
-   return KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 2) ||
-   (KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 1) &&
-   dev->adev->sdma.instance[0].fw_version >= 18) ||
-   KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 0);
-}
-
 static int kfd_ioctl_map_memory_to_gpu(struct file *filep,
struct kfd_process *p, void *data)
 {
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index 8a43def1f638..aff6f598ff2c 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -1328,6 +1328,14 @@ void kfd_signal_poison_consumed_event(struct kfd_dev 
*dev, u32 pasid);
 
 void kfd_flush_tlb(struct kfd_process_device *pdd, enum TLB_FLUSH_TYPE type);
 
+static inline bool kfd_flush_tlb_after_unmap(struct kfd_dev *dev)
+{
+   return KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 2) ||
+  (KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 1) &&
+  dev->adev->sdma.instance[0].fw_version >= 18) ||
+  KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 0);
+}
+
 bool kfd_is_locked(void);
 
 /* Compute profile */
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 459fa07a3bcc..5afe216cf099 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -1229,7 +1229,9 @@ svm_range_unmap_from_gpus(struct svm_range *prange, 
unsigned long start,
if (r)
break;
}
-   kfd_flush_tlb(pdd, TLB_FLUSH_HEAVYWEIGHT);
+
+   if (kfd_flush_tlb_after_unmap(pdd->dev))
+   kfd_flush_tlb(pdd, TLB_FLUSH_HEAVYWEIGHT);
}
 
return r;
-- 
2.25.1



Re: [PATCHv4] drm/amdgpu: disable ASPM on Intel Alder Lake based systems

2022-04-14 Thread Paul Menzel

[Cc: -kernel test robot ]

Dear Alex, dear Richard,


Am 13.04.22 um 15:00 schrieb Alex Deucher:

On Wed, Apr 13, 2022 at 3:43 AM Paul Menzel wrote:



Thank you for sending out v4.

Am 12.04.22 um 23:50 schrieb Richard Gong:

Active State Power Management (ASPM) feature is enabled since kernel 5.14.
There are some AMD GFX cards (such as WX3200 and RX640) that won't work
with ASPM-enabled Intel Alder Lake based systems. Using these GFX cards as
video/display output, Intel Alder Lake based systems will hang during
suspend/resume.


I am still not clear, what “hang during suspend/resume” means. I guess
suspending works fine? During resume (S3 or S0ix?), where does it hang?
The system is functional, but there are only display problems?


The issue was initially reported on one system (Dell Precision 3660 with
BIOS version 0.14.81), but was later confirmed to affect at least 4 Alder
Lake based systems.

Add extra check to disable ASPM on Intel Alder Lake based systems.

Fixes: 0064b0ce85bb ("drm/amd/pm: enable ASPM by default")
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1885
Reported-by: kernel test robot 


This tag is a little confusing. Maybe clarify that it was for an issue
in a previous patch iteration?


Signed-off-by: Richard Gong 
---
v4: s/CONFIG_X86_64/CONFIG_X86
  enhanced check logic
v3: s/intel_core_asom_chk/aspm_support_quirk_check
  correct build error with W=1 option
v2: correct commit description
  move the check from chip family to problematic platform
---
   drivers/gpu/drm/amd/amdgpu/vi.c | 17 -
   1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/vi.c b/drivers/gpu/drm/amd/amdgpu/vi.c
index 039b90cdc3bc..b33e0a9bee65 100644
--- a/drivers/gpu/drm/amd/amdgpu/vi.c
+++ b/drivers/gpu/drm/amd/amdgpu/vi.c
@@ -81,6 +81,10 @@
   #include "mxgpu_vi.h"
   #include "amdgpu_dm.h"

+#if IS_ENABLED(CONFIG_X86)
+#include 
+#endif
+
   #define ixPCIE_LC_L1_PM_SUBSTATE0x100100C6
   #define PCIE_LC_L1_PM_SUBSTATE__LC_L1_SUBSTATES_OVERRIDE_EN_MASK
0x0001L
   #define PCIE_LC_L1_PM_SUBSTATE__LC_PCI_PM_L1_2_OVERRIDE_MASK
0x0002L
@@ -1134,13 +1138,24 @@ static void vi_enable_aspm(struct amdgpu_device *adev)
   WREG32_PCIE(ixPCIE_LC_CNTL, data);
   }

+static bool aspm_support_quirk_check(void)
+{
+ if (IS_ENABLED(CONFIG_X86)) {
+ struct cpuinfo_x86 *c = _data(0);
+
+ return !(c->x86 == 6 && c->x86_model == INTEL_FAM6_ALDERLAKE);
+ }
+
+ return true;
+}
+
   static void vi_program_aspm(struct amdgpu_device *adev)
   {
   u32 data, data1, orig;
   bool bL1SS = false;
   bool bClkReqSupport = true;

- if (!amdgpu_device_should_use_aspm(adev))
+ if (!amdgpu_device_should_use_aspm(adev) || !aspm_support_quirk_check())
   return;


Can users still forcefully enable ASPM with the parameter `amdgpu.aspm`?



   if (adev->flags & AMD_IS_APU ||


If I remember correctly, there were also newer cards, where ASPM worked
with Intel Alder Lake, right? Can only the problematic generations for
WX3200 and RX640 be excluded from ASPM?


This patch only disables it for the generation that was problematic.


Could that please be made clear in the commit message summary, and message?

Loosely related, is there a public (or internal issue) to analyze how to 
get ASPM working for VI generation devices with Intel Alder Lake?



Kind regards,

Paul


Re: [PATCH v2] drm/amdgpu: Fix one use-after-free of VM

2022-04-14 Thread Paul Menzel

Dear Xinhui,


Am 14.04.22 um 07:45 schrieb Paul Menzel:


Thank you for rerolling the patch.

Am 14.04.22 um 07:03 schrieb xinhui pan:

VM might already be freed when amdgpu_vm_tlb_seq_cb() is called.
We see the calltrace below.

Fix it by keeping the last flush fence around and wait for it to signal


Nit: Please add a dot/period to the end of sentences.



BUG kmalloc-4k (Not tainted): Poison overwritten

0x9c88630414e8-0x9c88630414e8 @offset=5352. First byte 0x6c
instead of 0x6b Allocated in amdgpu_driver_open_kms+0x9d/0x360 [amdgpu]
age=44 cpu=0 pid=2343
  __slab_alloc.isra.0+0x4f/0x90
  kmem_cache_alloc_trace+0x6b8/0x7a0
  amdgpu_driver_open_kms+0x9d/0x360 [amdgpu]
  drm_file_alloc+0x222/0x3e0 [drm]
  drm_open+0x11d/0x410 [drm]
Freed in amdgpu_driver_postclose_kms+0x3e9/0x550 [amdgpu] age=22 cpu=1
pid=2485
  kfree+0x4a2/0x580
  amdgpu_driver_postclose_kms+0x3e9/0x550 [amdgpu]
  drm_file_free+0x24e/0x3c0 [drm]
  drm_close_helper.isra.0+0x90/0xb0 [drm]
  drm_release+0x97/0x1a0 [drm]
  __fput+0xb6/0x280
  fput+0xe/0x10
  task_work_run+0x64/0xb0


The v2 annotation is missing.


Suggested-by: Christian König 
Signed-off-by: xinhui pan 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 17 ++---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h |  1 +
  2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c

index 645ce28277c2..cd5aa7edd451 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -932,9 +932,12 @@ int amdgpu_vm_update_range(struct amdgpu_device 
*adev, struct amdgpu_vm *vm,

  if (flush_tlb || params.table_freed) {
  tlb_cb->vm = vm;
-    if (!fence || !*fence ||
-    dma_fence_add_callback(*fence, _cb->cb,
-   amdgpu_vm_tlb_seq_cb))
+    if (fence && *fence &&
+    !dma_fence_add_callback(*fence, _cb->cb,
+   amdgpu_vm_tlb_seq_cb)) {
+    dma_fence_put(vm->last_tlb_flush);
+    vm->last_tlb_flush = dma_fence_get(*fence);
+    } else
  amdgpu_vm_tlb_seq_cb(NULL, _cb->cb);


The Linux kernel coding style uses braces for all branches of a 
conditional statement, if one branch uses braces. [1]


By the way `scripts/checkpatch.pl --strict …` would also have found this:

```
$ scripts/checkpatch.pl --strict 
0001-drm-amdgpu-Fix-one-use-after-free-of-VM.patch

CHECK: braces {} should be used on all arms of this statement
#53: FILE: drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:935:
+   if (fence && *fence &&
[...]
+   } else
[...]

CHECK: Unbalanced braces around else statement
#58: FILE: drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:940:
+   } else

total: 0 errors, 0 warnings, 2 checks, 54 lines checked

NOTE: For some of the reported defects, checkpatch may be able to
  mechanically convert to the typical style using --fix or 
--fix-inplace.


0001-drm-amdgpu-Fix-one-use-after-free-of-VM.patch has style problems, 
please review.

```


Kind regards,

Paul



  tlb_cb = NULL;
  }
@@ -2094,6 +2097,7 @@ int amdgpu_vm_init(struct amdgpu_device *adev, 
struct amdgpu_vm *vm)

  vm->update_funcs = _vm_sdma_funcs;
  vm->last_update = NULL;
  vm->last_unlocked = dma_fence_get_stub();
+    vm->last_tlb_flush = dma_fence_get_stub();
  mutex_init(>eviction_lock);
  vm->evicting = false;
@@ -2132,6 +2136,7 @@ int amdgpu_vm_init(struct amdgpu_device *adev, 
struct amdgpu_vm *vm)

  vm->root.bo = NULL;
  error_free_delayed:
+    dma_fence_put(vm->last_tlb_flush);
  dma_fence_put(vm->last_unlocked);
  drm_sched_entity_destroy(>delayed);
@@ -2248,6 +2253,7 @@ void amdgpu_vm_fini(struct amdgpu_device *adev, 
struct amdgpu_vm *vm)

  struct amdgpu_bo_va_mapping *mapping, *tmp;
  bool prt_fini_needed = !!adev->gmc.gmc_funcs->set_prt;
  struct amdgpu_bo *root;
+    unsigned long flags;
  int i;
  amdgpu_amdkfd_gpuvm_destroy_cb(adev, vm);
@@ -2257,6 +2263,11 @@ void amdgpu_vm_fini(struct amdgpu_device *adev, 
struct amdgpu_vm *vm)

  amdgpu_vm_set_pasid(adev, vm, 0);
  dma_fence_wait(vm->last_unlocked, false);
  dma_fence_put(vm->last_unlocked);
+    dma_fence_wait(vm->last_tlb_flush, false);
+    /* Make sure that all fence callbacks have completed */
+    spin_lock_irqsave(vm->last_tlb_flush->lock, flags);
+    spin_unlock_irqrestore(vm->last_tlb_flush->lock, flags);
+    dma_fence_put(vm->last_tlb_flush);
  list_for_each_entry_safe(mapping, tmp, >freed, list) {
  if (mapping->flags & AMDGPU_PTE_PRT && prt_fini_needed) {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h

index 1a814fb8..6b06a214f05f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
@@ -286,6 +286,7 @@ struct amdgpu_vm {
  /* Last finished delayed update */
  atomic64_t    

RE: [PATCH] drm/amdgpu: Move reset domain locking in DPC handler

2022-04-14 Thread Chen, Guchun
It's in amdgpu_pci_resume.

Andrey, shall we modify the code accordingly in amdgpu_pci_resume as well? 
Otherwise, an unset/unlock leak will happen when pci_channel_state != 
pci_channel_io_frozen.

Regards,
Guchun

-Original Message-
From: amd-gfx  On Behalf Of Christian 
König
Sent: Thursday, April 14, 2022 2:40 PM
To: Grodzovsky, Andrey ; 
amd-gfx@lists.freedesktop.org
Cc: Antonovitch, Anatoli 
Subject: Re: [PATCH] drm/amdgpu: Move reset domain locking in DPC handler



Am 13.04.22 um 21:31 schrieb Andrey Grodzovsky:
> Lock reset domain unconditionally because on resume we unlock it 
> unconditionally.
> This solved mutex deadlock when handling both FATAL and non FATAL PCI 
> errors one after another.
>
> Signed-off-by: Andrey Grodzovsky 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 +++---
>   1 file changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 1cc488a767d8..c65f25e3a0fc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5531,18 +5531,18 @@ pci_ers_result_t 
> amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta
>   
>   adev->pci_channel_state = state;
>   
> + /*
> +  * Locking adev->reset_domain->sem will prevent any external access
> +  * to GPU during PCI error recovery
> +  */
> + amdgpu_device_lock_reset_domain(adev->reset_domain);
> + amdgpu_device_set_mp1_state(adev);
> +
>   switch (state) {
>   case pci_channel_io_normal:
>   return PCI_ERS_RESULT_CAN_RECOVER;

BTW: Where are we unlocking that again?

>   /* Fatal error, prepare for slot reset */
>   case pci_channel_io_frozen:
> - /*
> -  * Locking adev->reset_domain->sem will prevent any external 
> access
> -  * to GPU during PCI error recovery
> -  */
> - amdgpu_device_lock_reset_domain(adev->reset_domain);
> - amdgpu_device_set_mp1_state(adev);
> -
>   /*
>* Block any work scheduling as we do for regular GPU reset
>* for the duration of the recovery



Re: [PATCH] drm/amdgpu: Move reset domain locking in DPC handler

2022-04-14 Thread Christian König




Am 13.04.22 um 21:31 schrieb Andrey Grodzovsky:

Lock reset domain unconditionally because on resume
we unlock it unconditionally.
This solved mutex deadlock when handling both FATAL
and non FATAL PCI errors one after another.

Signed-off-by: Andrey Grodzovsky 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 +++---
  1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 1cc488a767d8..c65f25e3a0fc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5531,18 +5531,18 @@ pci_ers_result_t amdgpu_pci_error_detected(struct 
pci_dev *pdev, pci_channel_sta
  
  	adev->pci_channel_state = state;
  
+	/*

+* Locking adev->reset_domain->sem will prevent any external access
+* to GPU during PCI error recovery
+*/
+   amdgpu_device_lock_reset_domain(adev->reset_domain);
+   amdgpu_device_set_mp1_state(adev);
+
switch (state) {
case pci_channel_io_normal:
return PCI_ERS_RESULT_CAN_RECOVER;


BTW: Where are we unlocking that again?


/* Fatal error, prepare for slot reset */
case pci_channel_io_frozen:
-   /*
-* Locking adev->reset_domain->sem will prevent any external 
access
-* to GPU during PCI error recovery
-*/
-   amdgpu_device_lock_reset_domain(adev->reset_domain);
-   amdgpu_device_set_mp1_state(adev);
-
/*
 * Block any work scheduling as we do for regular GPU reset
 * for the duration of the recovery




Re: [PATCH v2] drm/amdgpu: Fix one use-after-free of VM

2022-04-14 Thread Christian König

Am 14.04.22 um 07:03 schrieb xinhui pan:

VM might already be freed when amdgpu_vm_tlb_seq_cb() is called.
We see the calltrace below.

Fix it by keeping the last flush fence around and wait for it to signal

BUG kmalloc-4k (Not tainted): Poison overwritten

0x9c88630414e8-0x9c88630414e8 @offset=5352. First byte 0x6c
instead of 0x6b Allocated in amdgpu_driver_open_kms+0x9d/0x360 [amdgpu]
age=44 cpu=0 pid=2343
  __slab_alloc.isra.0+0x4f/0x90
  kmem_cache_alloc_trace+0x6b8/0x7a0
  amdgpu_driver_open_kms+0x9d/0x360 [amdgpu]
  drm_file_alloc+0x222/0x3e0 [drm]
  drm_open+0x11d/0x410 [drm]
Freed in amdgpu_driver_postclose_kms+0x3e9/0x550 [amdgpu] age=22 cpu=1
pid=2485
  kfree+0x4a2/0x580
  amdgpu_driver_postclose_kms+0x3e9/0x550 [amdgpu]
  drm_file_free+0x24e/0x3c0 [drm]
  drm_close_helper.isra.0+0x90/0xb0 [drm]
  drm_release+0x97/0x1a0 [drm]
  __fput+0xb6/0x280
  fput+0xe/0x10
  task_work_run+0x64/0xb0

Suggested-by: Christian König 
Signed-off-by: xinhui pan 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 17 ++---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h |  1 +
  2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 645ce28277c2..cd5aa7edd451 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -932,9 +932,12 @@ int amdgpu_vm_update_range(struct amdgpu_device *adev, 
struct amdgpu_vm *vm,
  
  	if (flush_tlb || params.table_freed) {

tlb_cb->vm = vm;
-   if (!fence || !*fence ||
-   dma_fence_add_callback(*fence, _cb->cb,
-  amdgpu_vm_tlb_seq_cb))
+   if (fence && *fence &&
+   !dma_fence_add_callback(*fence, _cb->cb,
+  amdgpu_vm_tlb_seq_cb)) {
+   dma_fence_put(vm->last_tlb_flush);
+   vm->last_tlb_flush = dma_fence_get(*fence);
+   } else
amdgpu_vm_tlb_seq_cb(NULL, _cb->cb);


One style nit pick here: When one side of an if/else has {}, the other 
side should have {} as well.


Apart from that the patch is Reviewed-by: Christian König 



Thanks,
Christian.


tlb_cb = NULL;
}
@@ -2094,6 +2097,7 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct 
amdgpu_vm *vm)
vm->update_funcs = _vm_sdma_funcs;
vm->last_update = NULL;
vm->last_unlocked = dma_fence_get_stub();
+   vm->last_tlb_flush = dma_fence_get_stub();
  
  	mutex_init(>eviction_lock);

vm->evicting = false;
@@ -2132,6 +2136,7 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct 
amdgpu_vm *vm)
vm->root.bo = NULL;
  
  error_free_delayed:

+   dma_fence_put(vm->last_tlb_flush);
dma_fence_put(vm->last_unlocked);
drm_sched_entity_destroy(>delayed);
  
@@ -2248,6 +2253,7 @@ void amdgpu_vm_fini(struct amdgpu_device *adev, struct amdgpu_vm *vm)

struct amdgpu_bo_va_mapping *mapping, *tmp;
bool prt_fini_needed = !!adev->gmc.gmc_funcs->set_prt;
struct amdgpu_bo *root;
+   unsigned long flags;
int i;
  
  	amdgpu_amdkfd_gpuvm_destroy_cb(adev, vm);

@@ -2257,6 +2263,11 @@ void amdgpu_vm_fini(struct amdgpu_device *adev, struct 
amdgpu_vm *vm)
amdgpu_vm_set_pasid(adev, vm, 0);
dma_fence_wait(vm->last_unlocked, false);
dma_fence_put(vm->last_unlocked);
+   dma_fence_wait(vm->last_tlb_flush, false);
+   /* Make sure that all fence callbacks have completed */
+   spin_lock_irqsave(vm->last_tlb_flush->lock, flags);
+   spin_unlock_irqrestore(vm->last_tlb_flush->lock, flags);
+   dma_fence_put(vm->last_tlb_flush);
  
  	list_for_each_entry_safe(mapping, tmp, >freed, list) {

if (mapping->flags & AMDGPU_PTE_PRT && prt_fini_needed) {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
index 1a814fb8..6b06a214f05f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
@@ -286,6 +286,7 @@ struct amdgpu_vm {
  
  	/* Last finished delayed update */

atomic64_t  tlb_seq;
+   struct dma_fence*last_tlb_flush;
  
  	/* Last unlocked submission to the scheduler entities */

struct dma_fence*last_unlocked;




Re: [PATCH] drm/radeon: Add build directory to include path

2022-04-14 Thread Christian König

Am 13.04.22 um 18:14 schrieb Michel Dänzer:

From: Michel Dänzer 

Fixes compile errors with out-of-tree builds, e.g.

../drivers/gpu/drm/radeon/r420.c:38:10: fatal error: r420_reg_safe.h: No such 
file or directory
38 | #include "r420_reg_safe.h"
   |  ^



Well stuff like that usually points to a broken build environment.

Christian.



Signed-off-by: Michel Dänzer 
---
  drivers/gpu/drm/radeon/Makefile | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/radeon/Makefile b/drivers/gpu/drm/radeon/Makefile
index 11c97edde54d..37caf5236048 100644
--- a/drivers/gpu/drm/radeon/Makefile
+++ b/drivers/gpu/drm/radeon/Makefile
@@ -3,6 +3,8 @@
  # Makefile for the drm device driver.  This driver provides support for the
  # Direct Rendering Infrastructure (DRI) in XFree86 4.1.0 and higher.
  
+ccflags-y += -I$(src)

+
  hostprogs := mkregtable
  targets := rn50_reg_safe.h r100_reg_safe.h r200_reg_safe.h rv515_reg_safe.h 
r300_reg_safe.h r420_reg_safe.h rs600_reg_safe.h r600_reg_safe.h 
evergreen_reg_safe.h cayman_reg_safe.h