RE: [PATCH v4] drm/amdgpu : Add register read/write debugfs support for AID's

2023-12-19 Thread Kamal, Asad
[AMD Official Use Only - General]

Reviewed-by: Asad Kamal 

Thanks & Regards
Asad

-Original Message-
From: amd-gfx  On Behalf Of Mangesh Gadre
Sent: Wednesday, December 20, 2023 10:43 AM
To: amd-gfx@lists.freedesktop.org; Zhang, Hawking ; 
Lazar, Lijo 
Cc: Gadre, Mangesh ; Koenig, Christian 

Subject: [PATCH v4] drm/amdgpu : Add register read/write debugfs support for 
AID's

SMN address is larger than 32 bits for registers on different AID's Updating 
existing interface to support access to such registers.

Signed-off-by: Mangesh Gadre 
Reviewed-by: Christian König 
---
v2 : Adding hardware family check for creating
 debugfs interface for PCIe register access
v3 : Instead of creating new debugfs interface,now using
 existing interface with address range check for
 call to appropriate interface (Lijo)
v4 : Using available helper instead of explicit right
 shift operations (Christian)


 drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
index 96d634bfa448..391af8060704 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
@@ -559,7 +559,11 @@ static ssize_t amdgpu_debugfs_regs_pcie_read(struct file 
*f, char __user *buf,
while (size) {
uint32_t value;

-   value = RREG32_PCIE(*pos);
+   if (upper_32_bits(*pos))
+   value = RREG32_PCIE_EXT(*pos);
+   else
+   value = RREG32_PCIE(*pos);
+
r = put_user(value, (uint32_t *)buf);
if (r)
goto out;
@@ -619,7 +623,10 @@ static ssize_t amdgpu_debugfs_regs_pcie_write(struct file 
*f, const char __user
if (r)
goto out;

-   WREG32_PCIE(*pos, value);
+   if (upper_32_bits(*pos))
+   WREG32_PCIE_EXT(*pos, value);
+   else
+   WREG32_PCIE(*pos, value);

result += 4;
buf += 4;
--
2.34.1



[PATCH v4] drm/amdgpu : Add register read/write debugfs support for AID's

2023-12-19 Thread Mangesh Gadre
SMN address is larger than 32 bits for registers on different AID's
Updating existing interface to support access to such registers.

Signed-off-by: Mangesh Gadre 
Reviewed-by: Christian König 
---
v2 : Adding hardware family check for creating
 debugfs interface for PCIe register access
v3 : Instead of creating new debugfs interface,now using
 existing interface with address range check for
 call to appropriate interface (Lijo)
v4 : Using available helper instead of explicit right 
 shift operations (Christian)


 drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
index 96d634bfa448..391af8060704 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
@@ -559,7 +559,11 @@ static ssize_t amdgpu_debugfs_regs_pcie_read(struct file 
*f, char __user *buf,
while (size) {
uint32_t value;
 
-   value = RREG32_PCIE(*pos);
+   if (upper_32_bits(*pos))
+   value = RREG32_PCIE_EXT(*pos);
+   else
+   value = RREG32_PCIE(*pos);
+
r = put_user(value, (uint32_t *)buf);
if (r)
goto out;
@@ -619,7 +623,10 @@ static ssize_t amdgpu_debugfs_regs_pcie_write(struct file 
*f, const char __user
if (r)
goto out;
 
-   WREG32_PCIE(*pos, value);
+   if (upper_32_bits(*pos))
+   WREG32_PCIE_EXT(*pos, value);
+   else
+   WREG32_PCIE(*pos, value);
 
result += 4;
buf += 4;
-- 
2.34.1



Re: [PATCH v3 1/2] drm/amdgpu: Auto-validate DMABuf imports in compute VMs

2023-12-19 Thread Felix Kuehling


On 2023-12-19 3:10, Christian König wrote:

Am 15.12.23 um 16:19 schrieb Felix Kuehling:


On 2023-12-15 07:30, Christian König wrote:
@@ -1425,11 +1451,21 @@ int amdgpu_vm_handle_moved(struct 
amdgpu_device *adev,

  }
    r = amdgpu_vm_bo_update(adev, bo_va, clear);
-    if (r)
-    return r;
    if (unlock)
  dma_resv_unlock(resv);
+    if (r)
+    return r;
+
+    /* Remember evicted DMABuf imports in compute VMs for later
+ * validation
+ */
+    if (vm->is_compute_context && bo_va->base.bo &&
+    bo_va->base.bo->tbo.base.import_attach &&
+    (!bo_va->base.bo->tbo.resource ||
+ bo_va->base.bo->tbo.resource->mem_type == TTM_PL_SYSTEM))
+    amdgpu_vm_bo_evicted(&bo_va->base);
+


The change looks mostly good now. Just one thing which worries me is 
that when GFX and compute is mixed in the same VM this here might 
cause problems when we run into an error during command submission.


E.g. GFX validates the VM BOs, but then the IOCTL fails before 
calling amdgpu_vm_handle_moved().


In this case the DMA-buf wouldn't be validated any more.


This code path shouldn't be relevant for command submission, but for 
the amdgpu_vm_handle_moved call in amdgpu_dma_buf_move_notify. That's 
where the BO is first found to be evicted and its PTEs invalidated. 
That's where we need to remember it so it can be validated in the 
next call to amdgpu_vm_validate.


Currently the amdgpu_cs code path calls amdgpu_vm_validate with 
ticket=NULL, so it won't validate these imports. The only place that 
auto-validates evicted imports is amdgpu_amdkfd_restore_process_bos. 
So none of this should affect amdgpu_cs command submission.


Yeah, but ticket=NULL will result in removing those imports from the 
validation list.


I have a comment for that in amdgpu_vm_validate:

if (!ticket) {
/* We need to move the BO out of the evicted
 * list to avoid an infinite loop. It will be
 * moved back to evicted in the next
 * amdgpu_vm_handle_moved.
 */
amdgpu_vm_bo_invalidated(bo_base);
spin_lock(&vm->status_lock);
continue;
}

The net result is that the BO is still tracked as evicted.


This could potentially result in not validating them should we ever 
mix GFX and Compute submissions in the same VM.


My eviction test does exactly this, and it's working just fine.

Regards,
  Felix




For now that works, but we need to clean that up more thoughtfully I 
think.


Christian.




The flow for amdgpu_amdkfd_restore_process_bos is:

 * amdgpu_vm_validate validates the evicted dmabuf imports (moves the
   bo_vas from "evicted" to "invalidated")
 * amdgpu_vm_handle_moved iterates over invalidated bo_vas and updates
   the PTEs with valid entries (moves the bo_vas to "done")


Regards,
  Felix




Regards,
Christian. 


Re: Regression in 6.6: trying to set DPMS mode kills radeon (r600)

2023-12-19 Thread Holger Hoffstätte

On 2023-12-19 19:46, Alex Deucher wrote:

On Mon, Dec 18, 2023 at 1:52 PM Holger Hoffstätte
 wrote:


On 2023-12-16 18:36, Holger Hoffstätte wrote:



The affected machine is an older SandyBridge dektop with a fanless
r600 Redwood GPU, using the radeon driver. "Recently" - some time
after the last few 6.6.x stable updates - it started to die with GPU
lockups. I first blamed this on standby/resume - because why not? - but
this turned out to be wrong; the real culprit is DPMS.

I use xfce-power-manager as "screensaver" to turn off the display after
inacitvity. This can be configured in two ways: "suspend" and "poweroff".
I've been using "poweroff" since forever without problems, until now.

The symptom is that everything works fine until the screensaver kicks in
and tries to turn the monitor off, which sends the radeon driver and the GPU
into a complete tailspin.





Eventually the screensaver tries to switch off the monitor via DPMS "poweroff" 
method and
this greatly upsets the GPU:

Dec 12 20:39:59 ragnarok kernel: radeon :01:00.0: ring 0 stalled for more 
than 10140msec
Dec 12 20:39:59 ragnarok kernel: radeon :01:00.0: GPU lockup (current fence 
id 0x0002 last fence id 0x0003 on ring 0)


In the meantime I have confirmed that all this is still more complicated:
even using the "suspend" method only works after boot, not after a system 
suspend
cycle. Yes, weird but reproducible.

I have tried to chase down the problematic release, and as suspected this
started to happen with 6.6.5; 6.6.4 is fine.

Based on this information I found the offending commits and reverted them
in order from 6.6.7, which fixes everything for me:

b0399e22ada0 "drm/amd/display: Remove power sequencing check"
45f98fccb1f6 "drm/amd/display: Refactor edp power control"


Those patches are for amdgpu.  From the logs in your original post,
you are using the radeon driver.  They two are completely separate
drivers.  I don't see how those patches could be related.  That code
would never even execute.


Hi,

I understand the difference between amdgpu and radeon, that's why I was
wondering why those patches would make a difference.

The crash/no-crash behaviour was definitely reproducible - same config
and clean rebuild every time etc. My only guess was that maybe one of the
touched headers got included in the drm-display-helper used by radeon as
well, but that is seemingly not the case either.

In any case, it seems that whatever was going on is fixed in stable-6.6.8-rc1;
at least I haven't been able to reproduce the lockup so far, with various
combinations of display suspend/resume. There's at least one EDID-related patch
in 6.6.8 but I don' understand enough about the various display technologies to
assess whether that could have played a role.

You can probably imagine how frustrating it is to have a GPU that deadlocks 
while
_not_ doing anything. At least it seems to be working again now, either way.

Thanks for reading!

cheers
Holger


Re: [PATCH] drm/amd/display: remove redundant initialization of variable remainder

2023-12-19 Thread Alex Deucher
Applied.  Thanks!

On Tue, Dec 19, 2023 at 12:40 PM Colin Ian King  wrote:
>
> Variable remainder is being initialized with a value that is never read,
> the assignment is redundant and can be removed. Also add a newline
> after the declaration to clean up the coding style.
>
> Signed-off-by: Colin Ian King 
> ---
>  drivers/gpu/drm/amd/display/dc/basics/conversion.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/display/dc/basics/conversion.c 
> b/drivers/gpu/drm/amd/display/dc/basics/conversion.c
> index e295a839ab47..1090d235086a 100644
> --- a/drivers/gpu/drm/amd/display/dc/basics/conversion.c
> +++ b/drivers/gpu/drm/amd/display/dc/basics/conversion.c
> @@ -103,7 +103,8 @@ void convert_float_matrix(
>
>  static uint32_t find_gcd(uint32_t a, uint32_t b)
>  {
> -   uint32_t remainder = 0;
> +   uint32_t remainder;
> +
> while (b != 0) {
> remainder = a % b;
> a = b;
> --
> 2.39.2
>


Re: Regression in 6.6: trying to set DPMS mode kills radeon (r600)

2023-12-19 Thread Alex Deucher
On Mon, Dec 18, 2023 at 1:52 PM Holger Hoffstätte
 wrote:
>
> On 2023-12-16 18:36, Holger Hoffstätte wrote:
>
> 
> > The affected machine is an older SandyBridge dektop with a fanless
> > r600 Redwood GPU, using the radeon driver. "Recently" - some time
> > after the last few 6.6.x stable updates - it started to die with GPU
> > lockups. I first blamed this on standby/resume - because why not? - but
> > this turned out to be wrong; the real culprit is DPMS.
> >
> > I use xfce-power-manager as "screensaver" to turn off the display after
> > inacitvity. This can be configured in two ways: "suspend" and "poweroff".
> > I've been using "poweroff" since forever without problems, until now.
> >
> > The symptom is that everything works fine until the screensaver kicks in
> > and tries to turn the monitor off, which sends the radeon driver and the GPU
> > into a complete tailspin.
>
> 
>
> > Eventually the screensaver tries to switch off the monitor via DPMS 
> > "poweroff" method and
> > this greatly upsets the GPU:
> >
> > Dec 12 20:39:59 ragnarok kernel: radeon :01:00.0: ring 0 stalled for 
> > more than 10140msec
> > Dec 12 20:39:59 ragnarok kernel: radeon :01:00.0: GPU lockup (current 
> > fence id 0x0002 last fence id 0x0003 on ring 0)
>
> In the meantime I have confirmed that all this is still more complicated:
> even using the "suspend" method only works after boot, not after a system 
> suspend
> cycle. Yes, weird but reproducible.
>
> I have tried to chase down the problematic release, and as suspected this
> started to happen with 6.6.5; 6.6.4 is fine.
>
> Based on this information I found the offending commits and reverted them
> in order from 6.6.7, which fixes everything for me:
>
> b0399e22ada0 "drm/amd/display: Remove power sequencing check"
> 45f98fccb1f6 "drm/amd/display: Refactor edp power control"

Those patches are for amdgpu.  From the logs in your original post,
you are using the radeon driver.  They two are completely separate
drivers.  I don't see how those patches could be related.  That code
would never even execute.

Alex


[PATCH] drm/amd/display: remove redundant initialization of variable remainder

2023-12-19 Thread Colin Ian King
Variable remainder is being initialized with a value that is never read,
the assignment is redundant and can be removed. Also add a newline
after the declaration to clean up the coding style.

Signed-off-by: Colin Ian King 
---
 drivers/gpu/drm/amd/display/dc/basics/conversion.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/display/dc/basics/conversion.c 
b/drivers/gpu/drm/amd/display/dc/basics/conversion.c
index e295a839ab47..1090d235086a 100644
--- a/drivers/gpu/drm/amd/display/dc/basics/conversion.c
+++ b/drivers/gpu/drm/amd/display/dc/basics/conversion.c
@@ -103,7 +103,8 @@ void convert_float_matrix(
 
 static uint32_t find_gcd(uint32_t a, uint32_t b)
 {
-   uint32_t remainder = 0;
+   uint32_t remainder;
+
while (b != 0) {
remainder = a % b;
a = b;
-- 
2.39.2



Re: [PATCH] drm/amdgpu/gfx11: need acquire mutex before access CP_VMID_RESET

2023-12-19 Thread Alex Deucher
On Tue, Dec 19, 2023 at 4:30 AM Jack Xiao  wrote:
>
> It's required to take the gfx mutex before access to CP_VMID_RESET,
> for there is a race condition with CP firmware to write the register.
>
> Signed-off-by: Jack Xiao 
> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 20 
>  1 file changed, 20 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> index bdcf96df69e6..ae3370d34d11 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> @@ -4518,6 +4518,22 @@ static int gfx_v11_0_soft_reset(void *handle)
> }
> }
>

We probably want a CPU mutex or spinlock to protect this as well
unless this is already protected by the reset lock.

Alex

> +   /* Try to require the gfx mutex before access to CP_VMID_RESET */
> +   for (i = 0; i < adev->usec_timeout; i++) {
> +   /* Request with MeId=2, PipeId=0 */
> +   tmp = REG_SET_FIELD(0, CP_GFX_INDEX_MUTEX, REQUEST, 1);
> +   tmp = REG_SET_FIELD(tmp, CP_GFX_INDEX_MUTEX, CLIENTID, 4);
> +   WREG32_SOC15(GC, 0, regCP_GFX_INDEX_MUTEX, tmp);
> +   if (RREG32_SOC15(GC, 0, regCP_GFX_INDEX_MUTEX) == tmp)
> +   break;
> +   udelay(1);
> +   }
> +
> +   if (i >= adev->usec_timeout) {
> +   printk("Failed to require the gfx mutex during soft reset\n");
> +   return -EINVAL;
> +   }
> +
> WREG32_SOC15(GC, 0, regCP_VMID_RESET, 0xfffe);
>
> // Read CP_VMID_RESET register three times.
> @@ -4526,6 +4542,10 @@ static int gfx_v11_0_soft_reset(void *handle)
> RREG32_SOC15(GC, 0, regCP_VMID_RESET);
> RREG32_SOC15(GC, 0, regCP_VMID_RESET);
>
> +   /* release the gfx mutex */
> +   tmp = REG_SET_FIELD(tmp, CP_GFX_INDEX_MUTEX, REQUEST, 0);
> +   WREG32_SOC15(GC, 0, regCP_GFX_INDEX_MUTEX, tmp);
> +
> for (i = 0; i < adev->usec_timeout; i++) {
> if (!RREG32_SOC15(GC, 0, regCP_HQD_ACTIVE) &&
> !RREG32_SOC15(GC, 0, regCP_GFX_HQD_ACTIVE))
> --
> 2.41.0
>


Re: regression/bisected/6.7rc1: Instead of desktop I see a horizontal flashing bar with a picture of the desktop background on white screen

2023-12-19 Thread Alex Deucher
On Mon, Dec 18, 2023 at 1:31 PM Mikhail Gavrilov
 wrote:
>
> On Fri, Dec 15, 2023 at 9:14 PM Hamza Mahfooz  wrote:
> >
> > Can you try the following patch with old fw (version 0x07002100 should
> > be fine)?: https://patchwork.freedesktop.org/patch/572298/
> >
>
> Tested-by: Mikhail Gavrilov  on 7900XTX 
> hardware.
>
> Can I ask?
> What does SubVP actually do?
> I read on Phoronix that this is new feature of DCN 3.2 hardware
> https://www.phoronix.com/news/AMDGPU-Linux-6.5-Improvements
> But I didn't notice that anything began to work better after enabling
> this feature.
> On the contrary, my kernel logs began to become overgrown with
> unpleasant errors.
> See here: https://gitlab.freedesktop.org/drm/amd/-/issues/2796
> I bisected this issue and bisect heads me to commit
> 299004271cbf0315da327c4bd67aec3e7041cb32 which enables SubVP high
> refresh rate.
> But without SubVP I also had 120Hz and 4K. So I ask again what is the
> profit of SubVP?

IIRC, SubVP is a power saving feature which allows us to dynamically
adjust the memory clocks in more cases with narrow blanking periods.

Alex


[PATCH] drm/amdgpu: Use kzalloc instead of kmalloc+__GFP_ZERO in amdgpu_ras.c

2023-12-19 Thread Srinivasan Shanmugam
Fixes the below smatch warnings:

drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2543 amdgpu_ras_recovery_init() warn: 
Please consider using kzalloc instead of kmalloc
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2830 amdgpu_ras_init() warn: Please 
consider using kzalloc instead of kmalloc

Cc: Christian König 
Cc: Alex Deucher 
Signed-off-by: Srinivasan Shanmugam 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index bad62141f708..e541e6925918 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -2540,7 +2540,7 @@ int amdgpu_ras_recovery_init(struct amdgpu_device *adev)
return 0;
 
data = &con->eh_data;
-   *data = kmalloc(sizeof(**data), GFP_KERNEL | __GFP_ZERO);
+   *data = kzalloc(sizeof(**data), GFP_KERNEL);
if (!*data) {
ret = -ENOMEM;
goto out;
@@ -2827,10 +2827,10 @@ int amdgpu_ras_init(struct amdgpu_device *adev)
if (con)
return 0;
 
-   con = kmalloc(sizeof(struct amdgpu_ras) +
+   con = kzalloc(sizeof(*con) +
sizeof(struct ras_manager) * AMDGPU_RAS_BLOCK_COUNT +
sizeof(struct ras_manager) * AMDGPU_RAS_MCA_BLOCK_COUNT,
-   GFP_KERNEL|__GFP_ZERO);
+   GFP_KERNEL);
if (!con)
return -ENOMEM;
 
-- 
2.34.1



[PATCH] drm/amdgpu: Cleanup indenting in amdgpu_connector_dvi_detect()

2023-12-19 Thread Srinivasan Shanmugam
drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.c:1106 
amdgpu_connector_dvi_detect() warn: inconsistent indenting

'Fixes: 760817a60724 ("drm/amdgpu: Refactor
'amdgpu_connector_dvi_detect' in amdgpu_connectors.c")'
Cc: Christian König 
Cc: Alex Deucher 
Cc: "Pan, Xinhui" 
Cc: Rodrigo Siqueira 
Cc: Aurabindo Pillai 
Signed-off-by: Srinivasan Shanmugam 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.c
index 96f63fd39b9e..9caba10315a8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.c
@@ -1103,7 +1103,7 @@ amdgpu_connector_dvi_detect(struct drm_connector 
*connector, bool force)
 * DDC line.  The latter is more complex because with 
DVI<->HDMI adapters
 * you don't really know what's connected to which port 
as both are digital.
 */
-amdgpu_connector_shared_ddc(&ret, connector, 
amdgpu_connector);
+   amdgpu_connector_shared_ddc(&ret, connector, 
amdgpu_connector);
}
}
 
-- 
2.34.1



[PATCH Review V2 1/1] drm/amdgpu: Fix ecc irq enable/disable unpaired

2023-12-19 Thread Stanley . Yang
The ecc_irq is disabled while GPU mode2 reset suspending process,
but not be enabled during GPU mode2 reset resume process.

Changed from V1:
only do sdma/gfx ras_late_init in aldebaran_mode2_restore_ip,
delete amdgpu_ras_late_resume function.

Signed-off-by: Stanley.Yang 
---
 drivers/gpu/drm/amd/amdgpu/aldebaran.c | 28 +-
 drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c |  3 +++
 drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c |  4 
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c  |  3 +++
 4 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/aldebaran.c 
b/drivers/gpu/drm/amd/amdgpu/aldebaran.c
index 02f4c6f9d4f6..b60a3c1bd0f2 100644
--- a/drivers/gpu/drm/amd/amdgpu/aldebaran.c
+++ b/drivers/gpu/drm/amd/amdgpu/aldebaran.c
@@ -330,6 +330,7 @@ aldebaran_mode2_restore_hwcontext(struct 
amdgpu_reset_control *reset_ctl,
 {
struct list_head *reset_device_list = reset_context->reset_device_list;
struct amdgpu_device *tmp_adev = NULL;
+   struct amdgpu_ras *con;
int r;
 
if (reset_device_list == NULL)
@@ -355,7 +356,32 @@ aldebaran_mode2_restore_hwcontext(struct 
amdgpu_reset_control *reset_ctl,
 */
amdgpu_register_gpu_instance(tmp_adev);
 
-   /* Resume RAS */
+   /* Resume RAS, ecc_irq */
+   con = amdgpu_ras_get_context(tmp_adev);
+   if (!amdgpu_sriov_vf(tmp_adev) && con) {
+   if (tmp_adev->sdma.ras &&
+   amdgpu_ras_is_supported(tmp_adev, 
AMDGPU_RAS_BLOCK__SDMA) &&
+   tmp_adev->sdma.ras->ras_block.ras_late_init) {
+   r = 
tmp_adev->sdma.ras->ras_block.ras_late_init(tmp_adev,
+   
&tmp_adev->sdma.ras->ras_block.ras_comm);
+   if (r) {
+   dev_err(tmp_adev->dev, "SDMA failed to 
execute ras_late_init! ret:%d\n", r);
+   goto end;
+   }
+   }
+
+   if (tmp_adev->gfx.ras &&
+   amdgpu_ras_is_supported(tmp_adev, 
AMDGPU_RAS_BLOCK__GFX) &&
+   tmp_adev->gfx.ras->ras_block.ras_late_init) {
+   r = 
tmp_adev->gfx.ras->ras_block.ras_late_init(tmp_adev,
+   
&tmp_adev->gfx.ras->ras_block.ras_comm);
+   if (r) {
+   dev_err(tmp_adev->dev, "GFX failed to 
execute ras_late_init! ret:%d\n", r);
+   goto end;
+   }
+   }
+   }
+
amdgpu_ras_resume(tmp_adev);
 
/* Update PSP FW topology after reset */
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
index 09cbca596bb5..b93a0baeb2d3 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
@@ -1043,6 +1043,9 @@ static int gmc_v10_0_hw_fini(void *handle)
 
amdgpu_irq_put(adev, &adev->gmc.vm_fault, 0);
 
+   if (adev->gmc.ecc_irq.funcs)
+   amdgpu_irq_put(adev, &adev->gmc.ecc_irq, 0);
+
return 0;
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
index 416f3e4f0438..e633e60850b3 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
@@ -941,6 +941,10 @@ static int gmc_v11_0_hw_fini(void *handle)
}
 
amdgpu_irq_put(adev, &adev->gmc.vm_fault, 0);
+
+   if (adev->gmc.ecc_irq.funcs)
+   amdgpu_irq_put(adev, &adev->gmc.ecc_irq, 0);
+
gmc_v11_0_gart_disable(adev);
 
return 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index 205db28a9803..8ac4d5b7fb37 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -2388,6 +2388,9 @@ static int gmc_v9_0_hw_fini(void *handle)
 
amdgpu_irq_put(adev, &adev->gmc.vm_fault, 0);
 
+   if (adev->gmc.ecc_irq.funcs)
+   amdgpu_irq_put(adev, &adev->gmc.ecc_irq, 0);
+
return 0;
 }
 
-- 
2.25.1



Re: [PATCH] drm/amd/pm: Remove I2C_CLASS_SPD support

2023-12-19 Thread Wolfram Sang
On Mon, Nov 13, 2023 at 12:37:15PM +0100, Heiner Kallweit wrote:
> I2C_CLASS_SPD was used to expose the EEPROM content to user space,
> via the legacy eeprom driver. Now that this driver has been removed,
> we can remove I2C_CLASS_SPD support. at24 driver with explicit
> instantiation should be used instead.
> 
> If in doubt this patch could be applied via the i2c tree.
> 
> Signed-off-by: Heiner Kallweit 

Applied to for-next, thanks!



signature.asc
Description: PGP signature


Re: [PATCH v5 00/20] remove I2C_CLASS_DDC support

2023-12-19 Thread Wolfram Sang

> I created an immutable branch for this which the buildbots will
> hopefully check over night. I will reply with comments tomorrow when I
> got the buildbot results.

Applied to for-next, thanks!



signature.asc
Description: PGP signature


RE: [PATCH] drm/amdgpu/gfx11: need acquire mutex before access CP_VMID_RESET

2023-12-19 Thread Zhang, Hawking
[AMD Official Use Only - General]

+   /* release the gfx mutex */
+   tmp = REG_SET_FIELD(tmp, CP_GFX_INDEX_MUTEX, REQUEST, 0);
+   WREG32_SOC15(GC, 0, regCP_GFX_INDEX_MUTEX, tmp);

Shall we add a check by reading back CP_GFX_INDEX_MUTEX to ensure the release 
is done correctly?

Regards,
Hawking

-Original Message-
From: Xiao, Jack 
Sent: Tuesday, December 19, 2023 17:24
To: amd-gfx@lists.freedesktop.org; Deucher, Alexander 
; Zhang, Hawking 
Cc: Xiao, Jack 
Subject: [PATCH] drm/amdgpu/gfx11: need acquire mutex before access 
CP_VMID_RESET

It's required to take the gfx mutex before access to CP_VMID_RESET, for there 
is a race condition with CP firmware to write the register.

Signed-off-by: Jack Xiao 
---
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
index bdcf96df69e6..ae3370d34d11 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
@@ -4518,6 +4518,22 @@ static int gfx_v11_0_soft_reset(void *handle)
}
}

+   /* Try to require the gfx mutex before access to CP_VMID_RESET */
+   for (i = 0; i < adev->usec_timeout; i++) {
+   /* Request with MeId=2, PipeId=0 */
+   tmp = REG_SET_FIELD(0, CP_GFX_INDEX_MUTEX, REQUEST, 1);
+   tmp = REG_SET_FIELD(tmp, CP_GFX_INDEX_MUTEX, CLIENTID, 4);
+   WREG32_SOC15(GC, 0, regCP_GFX_INDEX_MUTEX, tmp);
+   if (RREG32_SOC15(GC, 0, regCP_GFX_INDEX_MUTEX) == tmp)
+   break;
+   udelay(1);
+   }
+
+   if (i >= adev->usec_timeout) {
+   printk("Failed to require the gfx mutex during soft reset\n");
+   return -EINVAL;
+   }
+
WREG32_SOC15(GC, 0, regCP_VMID_RESET, 0xfffe);

// Read CP_VMID_RESET register three times.
@@ -4526,6 +4542,10 @@ static int gfx_v11_0_soft_reset(void *handle)
RREG32_SOC15(GC, 0, regCP_VMID_RESET);
RREG32_SOC15(GC, 0, regCP_VMID_RESET);

+   /* release the gfx mutex */
+   tmp = REG_SET_FIELD(tmp, CP_GFX_INDEX_MUTEX, REQUEST, 0);
+   WREG32_SOC15(GC, 0, regCP_GFX_INDEX_MUTEX, tmp);
+
for (i = 0; i < adev->usec_timeout; i++) {
if (!RREG32_SOC15(GC, 0, regCP_HQD_ACTIVE) &&
!RREG32_SOC15(GC, 0, regCP_GFX_HQD_ACTIVE))
--
2.41.0



Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"

2023-12-19 Thread Mikhail Gavrilov
On Fri, Dec 15, 2023 at 5:37 PM Christian König
 wrote:
>
> I have no idea :)
>
>  From the logs I can see that the AMDGPU now has the proper BARs assigned:
>
> [5.722015] pci :03:00.0: [1002:73df] type 00 class 0x038000
> [5.722051] pci :03:00.0: reg 0x10: [mem
> 0xf8-0xfb 64bit pref]
> [5.722081] pci :03:00.0: reg 0x18: [mem
> 0xfc-0xfc0fff 64bit pref]
> [5.722112] pci :03:00.0: reg 0x24: [mem 0xfca0-0xfcaf]
> [5.722134] pci :03:00.0: reg 0x30: [mem 0xfcb0-0xfcb1 pref]
> [5.722368] pci :03:00.0: PME# supported from D1 D2 D3hot D3cold
> [5.722484] pci :03:00.0: 63.008 Gb/s available PCIe bandwidth,
> limited by 8.0 GT/s PCIe x8 link at :00:01.1 (capable of 252.048
> Gb/s with 16.0 GT/s PCIe x16 link)
>
> And with that the driver can work perfectly fine.
>
> Have you updated the BIOS or added/removed some other hardware? Maybe
> somebody added a quirk for your BIOS into the PCIe code or something
> like that.

No, nothing changed in hardware.
But I found the commit which fixes it.

> git bisect unfixed
92e2bd56a5f9fc44313fda802a43a63cc2a9c8f6 is the first fixed commit
commit 92e2bd56a5f9fc44313fda802a43a63cc2a9c8f6
Author: Vasant Hegde 
Date:   Thu Sep 21 09:21:45 2023 +

iommu/amd: Introduce iommu_dev_data.flags to track device capabilities

Currently we use struct iommu_dev_data.iommu_v2 to keep track of the device
ATS, PRI, and PASID capabilities. But these capabilities can be enabled
independently (except PRI requires ATS support). Hence, replace
the iommu_v2 variable with a flags variable, which keep track of the device
capabilities.

From commit 9bf49e36d718 ("PCI/ATS: Handle sharing of PF PRI Capability
with all VFs"), device PRI/PASID is shared between PF and any associated
VFs. Hence use pci_pri_supported() and pci_pasid_features() instead of
pci_find_ext_capability() to check device PRI/PASID support.

Signed-off-by: Vasant Hegde 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Jerry Snitselaar 
Link: https://lore.kernel.org/r/20230921092147.5930-13-vasant.he...@amd.com
Signed-off-by: Joerg Roedel 

 drivers/iommu/amd/amd_iommu_types.h |  3 ++-
 drivers/iommu/amd/iommu.c   | 46 ++---
 2 files changed, 30 insertions(+), 19 deletions(-)


> git bisect log
git bisect start '--term-new=fixed' '--term-old=unfixed'
# status: waiting for both good and bad commits
# fixed: [33cc938e65a98f1d29d0a18403dbbee050dcad9a] Linux 6.7-rc4
git bisect fixed 33cc938e65a98f1d29d0a18403dbbee050dcad9a
# status: waiting for good commit(s), bad commit known
# unfixed: [ffc253263a1375a65fa6c9f62a893e9767fbebfa] Linux 6.6
git bisect unfixed ffc253263a1375a65fa6c9f62a893e9767fbebfa
# unfixed: [7d461b291e65938f15f56fe58da2303b07578a76] Merge tag
'drm-next-2023-10-31-1' of git://anongit.freedesktop.org/drm/drm
git bisect unfixed 7d461b291e65938f15f56fe58da2303b07578a76
# unfixed: [e14aec23025eeb1f2159ba34dbc1458467c4c347] s390/ap: fix AP
bus crash on early config change callback invocation
git bisect unfixed e14aec23025eeb1f2159ba34dbc1458467c4c347
# unfixed: [be3ca57cfb777ad820c6659d52e60bbdd36bf5ff] Merge tag
'media/v6.7-1' of
git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
git bisect unfixed be3ca57cfb777ad820c6659d52e60bbdd36bf5ff
# fixed: [c0d12d769299e1e08338988c7745009e0db2a4a0] Merge tag
'drm-next-2023-11-10' of git://anongit.freedesktop.org/drm/drm
git bisect fixed c0d12d769299e1e08338988c7745009e0db2a4a0
# fixed: [4bbdb725a36b0d235f3b832bd0c1e885f0442d9f] Merge tag
'iommu-updates-v6.7' of
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
git bisect fixed 4bbdb725a36b0d235f3b832bd0c1e885f0442d9f
# unfixed: [25b6377007ebe1c3ede773fd6979f613386db000] Merge tag
'drm-next-2023-11-07' of git://anongit.freedesktop.org/drm/drm
git bisect unfixed 25b6377007ebe1c3ede773fd6979f613386db000
# unfixed: [67c0afb6424fee94238d9a32b97c407d0c97155e] Merge tag
'exfat-for-6.7-rc1-part2' of
git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat
git bisect unfixed 67c0afb6424fee94238d9a32b97c407d0c97155e
# unfixed: [3613047280ec42a4e1350fdc1a6dd161ff4008cc] Merge tag
'v6.6-rc7' into core
git bisect unfixed 3613047280ec42a4e1350fdc1a6dd161ff4008cc
# fixed: [cedc811c76778bdef91d405717acee0de54d8db5] iommu/amd: Remove
DMA_FQ type from domain allocation path
git bisect fixed cedc811c76778bdef91d405717acee0de54d8db5
# unfixed: [b0cc5dae1ac0c18748706a4beb636e3b726dd744] iommu/amd:
Rename ats related variables
git bisect unfixed b0cc5dae1ac0c18748706a4beb636e3b726dd744
# fixed: [5a0b11a180a9b82b4437a4be1cf73530053f139b] iommu/amd: Remove
iommu_v2 module
git bisect fixed 5a0b11a180a9b82b4437a4be1cf73530053f139b
# fixed: [92e2bd56a5f9fc44313fda802a43a63cc2a9c8f6] iommu/amd:
Introduce iommu_dev_data.flags to track device capabilities
git bisect fixed 92e2bd56a5f9fc44313fda802a43a63cc2a9c8f6
# unfixed: [739eb25514c90aa

[PATCH] drm/amdgpu/gfx11: need acquire mutex before access CP_VMID_RESET

2023-12-19 Thread Jack Xiao
It's required to take the gfx mutex before access to CP_VMID_RESET,
for there is a race condition with CP firmware to write the register.

Signed-off-by: Jack Xiao 
---
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
index bdcf96df69e6..ae3370d34d11 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
@@ -4518,6 +4518,22 @@ static int gfx_v11_0_soft_reset(void *handle)
}
}
 
+   /* Try to require the gfx mutex before access to CP_VMID_RESET */
+   for (i = 0; i < adev->usec_timeout; i++) {
+   /* Request with MeId=2, PipeId=0 */
+   tmp = REG_SET_FIELD(0, CP_GFX_INDEX_MUTEX, REQUEST, 1);
+   tmp = REG_SET_FIELD(tmp, CP_GFX_INDEX_MUTEX, CLIENTID, 4);
+   WREG32_SOC15(GC, 0, regCP_GFX_INDEX_MUTEX, tmp);
+   if (RREG32_SOC15(GC, 0, regCP_GFX_INDEX_MUTEX) == tmp)
+   break;
+   udelay(1);
+   }
+
+   if (i >= adev->usec_timeout) {
+   printk("Failed to require the gfx mutex during soft reset\n");
+   return -EINVAL;
+   }
+
WREG32_SOC15(GC, 0, regCP_VMID_RESET, 0xfffe);
 
// Read CP_VMID_RESET register three times.
@@ -4526,6 +4542,10 @@ static int gfx_v11_0_soft_reset(void *handle)
RREG32_SOC15(GC, 0, regCP_VMID_RESET);
RREG32_SOC15(GC, 0, regCP_VMID_RESET);
 
+   /* release the gfx mutex */
+   tmp = REG_SET_FIELD(tmp, CP_GFX_INDEX_MUTEX, REQUEST, 0);
+   WREG32_SOC15(GC, 0, regCP_GFX_INDEX_MUTEX, tmp);
+
for (i = 0; i < adev->usec_timeout; i++) {
if (!RREG32_SOC15(GC, 0, regCP_HQD_ACTIVE) &&
!RREG32_SOC15(GC, 0, regCP_GFX_HQD_ACTIVE))
-- 
2.41.0



Re: [PATCH v3 1/2] drm/amdgpu: Auto-validate DMABuf imports in compute VMs

2023-12-19 Thread Christian König

Am 15.12.23 um 16:19 schrieb Felix Kuehling:


On 2023-12-15 07:30, Christian König wrote:
@@ -1425,11 +1451,21 @@ int amdgpu_vm_handle_moved(struct 
amdgpu_device *adev,

  }
    r = amdgpu_vm_bo_update(adev, bo_va, clear);
-    if (r)
-    return r;
    if (unlock)
  dma_resv_unlock(resv);
+    if (r)
+    return r;
+
+    /* Remember evicted DMABuf imports in compute VMs for later
+ * validation
+ */
+    if (vm->is_compute_context && bo_va->base.bo &&
+    bo_va->base.bo->tbo.base.import_attach &&
+    (!bo_va->base.bo->tbo.resource ||
+ bo_va->base.bo->tbo.resource->mem_type == TTM_PL_SYSTEM))
+    amdgpu_vm_bo_evicted(&bo_va->base);
+


The change looks mostly good now. Just one thing which worries me is 
that when GFX and compute is mixed in the same VM this here might 
cause problems when we run into an error during command submission.


E.g. GFX validates the VM BOs, but then the IOCTL fails before 
calling amdgpu_vm_handle_moved().


In this case the DMA-buf wouldn't be validated any more.


This code path shouldn't be relevant for command submission, but for 
the amdgpu_vm_handle_moved call in amdgpu_dma_buf_move_notify. That's 
where the BO is first found to be evicted and its PTEs invalidated. 
That's where we need to remember it so it can be validated in the next 
call to amdgpu_vm_validate.


Currently the amdgpu_cs code path calls amdgpu_vm_validate with 
ticket=NULL, so it won't validate these imports. The only place that 
auto-validates evicted imports is amdgpu_amdkfd_restore_process_bos. 
So none of this should affect amdgpu_cs command submission.


Yeah, but ticket=NULL will result in removing those imports from the 
validation list. This could potentially result in not validating them 
should we ever mix GFX and Compute submissions in the same VM.


For now that works, but we need to clean that up more thoughtfully I think.

Christian.




The flow for amdgpu_amdkfd_restore_process_bos is:

 * amdgpu_vm_validate validates the evicted dmabuf imports (moves the
   bo_vas from "evicted" to "invalidated")
 * amdgpu_vm_handle_moved iterates over invalidated bo_vas and updates
   the PTEs with valid entries (moves the bo_vas to "done")


Regards,
  Felix




Regards,
Christian.