RE: [PATCH v4] drm/amdgpu : Add register read/write debugfs support for AID's
[AMD Official Use Only - General] Reviewed-by: Asad Kamal Thanks & Regards Asad -Original Message- From: amd-gfx On Behalf Of Mangesh Gadre Sent: Wednesday, December 20, 2023 10:43 AM To: amd-gfx@lists.freedesktop.org; Zhang, Hawking ; Lazar, Lijo Cc: Gadre, Mangesh ; Koenig, Christian Subject: [PATCH v4] drm/amdgpu : Add register read/write debugfs support for AID's SMN address is larger than 32 bits for registers on different AID's Updating existing interface to support access to such registers. Signed-off-by: Mangesh Gadre Reviewed-by: Christian König --- v2 : Adding hardware family check for creating debugfs interface for PCIe register access v3 : Instead of creating new debugfs interface,now using existing interface with address range check for call to appropriate interface (Lijo) v4 : Using available helper instead of explicit right shift operations (Christian) drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c index 96d634bfa448..391af8060704 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c @@ -559,7 +559,11 @@ static ssize_t amdgpu_debugfs_regs_pcie_read(struct file *f, char __user *buf, while (size) { uint32_t value; - value = RREG32_PCIE(*pos); + if (upper_32_bits(*pos)) + value = RREG32_PCIE_EXT(*pos); + else + value = RREG32_PCIE(*pos); + r = put_user(value, (uint32_t *)buf); if (r) goto out; @@ -619,7 +623,10 @@ static ssize_t amdgpu_debugfs_regs_pcie_write(struct file *f, const char __user if (r) goto out; - WREG32_PCIE(*pos, value); + if (upper_32_bits(*pos)) + WREG32_PCIE_EXT(*pos, value); + else + WREG32_PCIE(*pos, value); result += 4; buf += 4; -- 2.34.1
[PATCH v4] drm/amdgpu : Add register read/write debugfs support for AID's
SMN address is larger than 32 bits for registers on different AID's Updating existing interface to support access to such registers. Signed-off-by: Mangesh Gadre Reviewed-by: Christian König --- v2 : Adding hardware family check for creating debugfs interface for PCIe register access v3 : Instead of creating new debugfs interface,now using existing interface with address range check for call to appropriate interface (Lijo) v4 : Using available helper instead of explicit right shift operations (Christian) drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c index 96d634bfa448..391af8060704 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c @@ -559,7 +559,11 @@ static ssize_t amdgpu_debugfs_regs_pcie_read(struct file *f, char __user *buf, while (size) { uint32_t value; - value = RREG32_PCIE(*pos); + if (upper_32_bits(*pos)) + value = RREG32_PCIE_EXT(*pos); + else + value = RREG32_PCIE(*pos); + r = put_user(value, (uint32_t *)buf); if (r) goto out; @@ -619,7 +623,10 @@ static ssize_t amdgpu_debugfs_regs_pcie_write(struct file *f, const char __user if (r) goto out; - WREG32_PCIE(*pos, value); + if (upper_32_bits(*pos)) + WREG32_PCIE_EXT(*pos, value); + else + WREG32_PCIE(*pos, value); result += 4; buf += 4; -- 2.34.1
Re: [PATCH v3 1/2] drm/amdgpu: Auto-validate DMABuf imports in compute VMs
On 2023-12-19 3:10, Christian König wrote: Am 15.12.23 um 16:19 schrieb Felix Kuehling: On 2023-12-15 07:30, Christian König wrote: @@ -1425,11 +1451,21 @@ int amdgpu_vm_handle_moved(struct amdgpu_device *adev, } r = amdgpu_vm_bo_update(adev, bo_va, clear); - if (r) - return r; if (unlock) dma_resv_unlock(resv); + if (r) + return r; + + /* Remember evicted DMABuf imports in compute VMs for later + * validation + */ + if (vm->is_compute_context && bo_va->base.bo && + bo_va->base.bo->tbo.base.import_attach && + (!bo_va->base.bo->tbo.resource || + bo_va->base.bo->tbo.resource->mem_type == TTM_PL_SYSTEM)) + amdgpu_vm_bo_evicted(&bo_va->base); + The change looks mostly good now. Just one thing which worries me is that when GFX and compute is mixed in the same VM this here might cause problems when we run into an error during command submission. E.g. GFX validates the VM BOs, but then the IOCTL fails before calling amdgpu_vm_handle_moved(). In this case the DMA-buf wouldn't be validated any more. This code path shouldn't be relevant for command submission, but for the amdgpu_vm_handle_moved call in amdgpu_dma_buf_move_notify. That's where the BO is first found to be evicted and its PTEs invalidated. That's where we need to remember it so it can be validated in the next call to amdgpu_vm_validate. Currently the amdgpu_cs code path calls amdgpu_vm_validate with ticket=NULL, so it won't validate these imports. The only place that auto-validates evicted imports is amdgpu_amdkfd_restore_process_bos. So none of this should affect amdgpu_cs command submission. Yeah, but ticket=NULL will result in removing those imports from the validation list. I have a comment for that in amdgpu_vm_validate: if (!ticket) { /* We need to move the BO out of the evicted * list to avoid an infinite loop. It will be * moved back to evicted in the next * amdgpu_vm_handle_moved. */ amdgpu_vm_bo_invalidated(bo_base); spin_lock(&vm->status_lock); continue; } The net result is that the BO is still tracked as evicted. This could potentially result in not validating them should we ever mix GFX and Compute submissions in the same VM. My eviction test does exactly this, and it's working just fine. Regards, Felix For now that works, but we need to clean that up more thoughtfully I think. Christian. The flow for amdgpu_amdkfd_restore_process_bos is: * amdgpu_vm_validate validates the evicted dmabuf imports (moves the bo_vas from "evicted" to "invalidated") * amdgpu_vm_handle_moved iterates over invalidated bo_vas and updates the PTEs with valid entries (moves the bo_vas to "done") Regards, Felix Regards, Christian.
Re: Regression in 6.6: trying to set DPMS mode kills radeon (r600)
On 2023-12-19 19:46, Alex Deucher wrote: On Mon, Dec 18, 2023 at 1:52 PM Holger Hoffstätte wrote: On 2023-12-16 18:36, Holger Hoffstätte wrote: The affected machine is an older SandyBridge dektop with a fanless r600 Redwood GPU, using the radeon driver. "Recently" - some time after the last few 6.6.x stable updates - it started to die with GPU lockups. I first blamed this on standby/resume - because why not? - but this turned out to be wrong; the real culprit is DPMS. I use xfce-power-manager as "screensaver" to turn off the display after inacitvity. This can be configured in two ways: "suspend" and "poweroff". I've been using "poweroff" since forever without problems, until now. The symptom is that everything works fine until the screensaver kicks in and tries to turn the monitor off, which sends the radeon driver and the GPU into a complete tailspin. Eventually the screensaver tries to switch off the monitor via DPMS "poweroff" method and this greatly upsets the GPU: Dec 12 20:39:59 ragnarok kernel: radeon :01:00.0: ring 0 stalled for more than 10140msec Dec 12 20:39:59 ragnarok kernel: radeon :01:00.0: GPU lockup (current fence id 0x0002 last fence id 0x0003 on ring 0) In the meantime I have confirmed that all this is still more complicated: even using the "suspend" method only works after boot, not after a system suspend cycle. Yes, weird but reproducible. I have tried to chase down the problematic release, and as suspected this started to happen with 6.6.5; 6.6.4 is fine. Based on this information I found the offending commits and reverted them in order from 6.6.7, which fixes everything for me: b0399e22ada0 "drm/amd/display: Remove power sequencing check" 45f98fccb1f6 "drm/amd/display: Refactor edp power control" Those patches are for amdgpu. From the logs in your original post, you are using the radeon driver. They two are completely separate drivers. I don't see how those patches could be related. That code would never even execute. Hi, I understand the difference between amdgpu and radeon, that's why I was wondering why those patches would make a difference. The crash/no-crash behaviour was definitely reproducible - same config and clean rebuild every time etc. My only guess was that maybe one of the touched headers got included in the drm-display-helper used by radeon as well, but that is seemingly not the case either. In any case, it seems that whatever was going on is fixed in stable-6.6.8-rc1; at least I haven't been able to reproduce the lockup so far, with various combinations of display suspend/resume. There's at least one EDID-related patch in 6.6.8 but I don' understand enough about the various display technologies to assess whether that could have played a role. You can probably imagine how frustrating it is to have a GPU that deadlocks while _not_ doing anything. At least it seems to be working again now, either way. Thanks for reading! cheers Holger
Re: [PATCH] drm/amd/display: remove redundant initialization of variable remainder
Applied. Thanks! On Tue, Dec 19, 2023 at 12:40 PM Colin Ian King wrote: > > Variable remainder is being initialized with a value that is never read, > the assignment is redundant and can be removed. Also add a newline > after the declaration to clean up the coding style. > > Signed-off-by: Colin Ian King > --- > drivers/gpu/drm/amd/display/dc/basics/conversion.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/display/dc/basics/conversion.c > b/drivers/gpu/drm/amd/display/dc/basics/conversion.c > index e295a839ab47..1090d235086a 100644 > --- a/drivers/gpu/drm/amd/display/dc/basics/conversion.c > +++ b/drivers/gpu/drm/amd/display/dc/basics/conversion.c > @@ -103,7 +103,8 @@ void convert_float_matrix( > > static uint32_t find_gcd(uint32_t a, uint32_t b) > { > - uint32_t remainder = 0; > + uint32_t remainder; > + > while (b != 0) { > remainder = a % b; > a = b; > -- > 2.39.2 >
Re: Regression in 6.6: trying to set DPMS mode kills radeon (r600)
On Mon, Dec 18, 2023 at 1:52 PM Holger Hoffstätte wrote: > > On 2023-12-16 18:36, Holger Hoffstätte wrote: > > > > The affected machine is an older SandyBridge dektop with a fanless > > r600 Redwood GPU, using the radeon driver. "Recently" - some time > > after the last few 6.6.x stable updates - it started to die with GPU > > lockups. I first blamed this on standby/resume - because why not? - but > > this turned out to be wrong; the real culprit is DPMS. > > > > I use xfce-power-manager as "screensaver" to turn off the display after > > inacitvity. This can be configured in two ways: "suspend" and "poweroff". > > I've been using "poweroff" since forever without problems, until now. > > > > The symptom is that everything works fine until the screensaver kicks in > > and tries to turn the monitor off, which sends the radeon driver and the GPU > > into a complete tailspin. > > > > > Eventually the screensaver tries to switch off the monitor via DPMS > > "poweroff" method and > > this greatly upsets the GPU: > > > > Dec 12 20:39:59 ragnarok kernel: radeon :01:00.0: ring 0 stalled for > > more than 10140msec > > Dec 12 20:39:59 ragnarok kernel: radeon :01:00.0: GPU lockup (current > > fence id 0x0002 last fence id 0x0003 on ring 0) > > In the meantime I have confirmed that all this is still more complicated: > even using the "suspend" method only works after boot, not after a system > suspend > cycle. Yes, weird but reproducible. > > I have tried to chase down the problematic release, and as suspected this > started to happen with 6.6.5; 6.6.4 is fine. > > Based on this information I found the offending commits and reverted them > in order from 6.6.7, which fixes everything for me: > > b0399e22ada0 "drm/amd/display: Remove power sequencing check" > 45f98fccb1f6 "drm/amd/display: Refactor edp power control" Those patches are for amdgpu. From the logs in your original post, you are using the radeon driver. They two are completely separate drivers. I don't see how those patches could be related. That code would never even execute. Alex
[PATCH] drm/amd/display: remove redundant initialization of variable remainder
Variable remainder is being initialized with a value that is never read, the assignment is redundant and can be removed. Also add a newline after the declaration to clean up the coding style. Signed-off-by: Colin Ian King --- drivers/gpu/drm/amd/display/dc/basics/conversion.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/display/dc/basics/conversion.c b/drivers/gpu/drm/amd/display/dc/basics/conversion.c index e295a839ab47..1090d235086a 100644 --- a/drivers/gpu/drm/amd/display/dc/basics/conversion.c +++ b/drivers/gpu/drm/amd/display/dc/basics/conversion.c @@ -103,7 +103,8 @@ void convert_float_matrix( static uint32_t find_gcd(uint32_t a, uint32_t b) { - uint32_t remainder = 0; + uint32_t remainder; + while (b != 0) { remainder = a % b; a = b; -- 2.39.2
Re: [PATCH] drm/amdgpu/gfx11: need acquire mutex before access CP_VMID_RESET
On Tue, Dec 19, 2023 at 4:30 AM Jack Xiao wrote: > > It's required to take the gfx mutex before access to CP_VMID_RESET, > for there is a race condition with CP firmware to write the register. > > Signed-off-by: Jack Xiao > --- > drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 20 > 1 file changed, 20 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c > b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c > index bdcf96df69e6..ae3370d34d11 100644 > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c > @@ -4518,6 +4518,22 @@ static int gfx_v11_0_soft_reset(void *handle) > } > } > We probably want a CPU mutex or spinlock to protect this as well unless this is already protected by the reset lock. Alex > + /* Try to require the gfx mutex before access to CP_VMID_RESET */ > + for (i = 0; i < adev->usec_timeout; i++) { > + /* Request with MeId=2, PipeId=0 */ > + tmp = REG_SET_FIELD(0, CP_GFX_INDEX_MUTEX, REQUEST, 1); > + tmp = REG_SET_FIELD(tmp, CP_GFX_INDEX_MUTEX, CLIENTID, 4); > + WREG32_SOC15(GC, 0, regCP_GFX_INDEX_MUTEX, tmp); > + if (RREG32_SOC15(GC, 0, regCP_GFX_INDEX_MUTEX) == tmp) > + break; > + udelay(1); > + } > + > + if (i >= adev->usec_timeout) { > + printk("Failed to require the gfx mutex during soft reset\n"); > + return -EINVAL; > + } > + > WREG32_SOC15(GC, 0, regCP_VMID_RESET, 0xfffe); > > // Read CP_VMID_RESET register three times. > @@ -4526,6 +4542,10 @@ static int gfx_v11_0_soft_reset(void *handle) > RREG32_SOC15(GC, 0, regCP_VMID_RESET); > RREG32_SOC15(GC, 0, regCP_VMID_RESET); > > + /* release the gfx mutex */ > + tmp = REG_SET_FIELD(tmp, CP_GFX_INDEX_MUTEX, REQUEST, 0); > + WREG32_SOC15(GC, 0, regCP_GFX_INDEX_MUTEX, tmp); > + > for (i = 0; i < adev->usec_timeout; i++) { > if (!RREG32_SOC15(GC, 0, regCP_HQD_ACTIVE) && > !RREG32_SOC15(GC, 0, regCP_GFX_HQD_ACTIVE)) > -- > 2.41.0 >
Re: regression/bisected/6.7rc1: Instead of desktop I see a horizontal flashing bar with a picture of the desktop background on white screen
On Mon, Dec 18, 2023 at 1:31 PM Mikhail Gavrilov wrote: > > On Fri, Dec 15, 2023 at 9:14 PM Hamza Mahfooz wrote: > > > > Can you try the following patch with old fw (version 0x07002100 should > > be fine)?: https://patchwork.freedesktop.org/patch/572298/ > > > > Tested-by: Mikhail Gavrilov on 7900XTX > hardware. > > Can I ask? > What does SubVP actually do? > I read on Phoronix that this is new feature of DCN 3.2 hardware > https://www.phoronix.com/news/AMDGPU-Linux-6.5-Improvements > But I didn't notice that anything began to work better after enabling > this feature. > On the contrary, my kernel logs began to become overgrown with > unpleasant errors. > See here: https://gitlab.freedesktop.org/drm/amd/-/issues/2796 > I bisected this issue and bisect heads me to commit > 299004271cbf0315da327c4bd67aec3e7041cb32 which enables SubVP high > refresh rate. > But without SubVP I also had 120Hz and 4K. So I ask again what is the > profit of SubVP? IIRC, SubVP is a power saving feature which allows us to dynamically adjust the memory clocks in more cases with narrow blanking periods. Alex
[PATCH] drm/amdgpu: Use kzalloc instead of kmalloc+__GFP_ZERO in amdgpu_ras.c
Fixes the below smatch warnings: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2543 amdgpu_ras_recovery_init() warn: Please consider using kzalloc instead of kmalloc drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2830 amdgpu_ras_init() warn: Please consider using kzalloc instead of kmalloc Cc: Christian König Cc: Alex Deucher Signed-off-by: Srinivasan Shanmugam --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c index bad62141f708..e541e6925918 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c @@ -2540,7 +2540,7 @@ int amdgpu_ras_recovery_init(struct amdgpu_device *adev) return 0; data = &con->eh_data; - *data = kmalloc(sizeof(**data), GFP_KERNEL | __GFP_ZERO); + *data = kzalloc(sizeof(**data), GFP_KERNEL); if (!*data) { ret = -ENOMEM; goto out; @@ -2827,10 +2827,10 @@ int amdgpu_ras_init(struct amdgpu_device *adev) if (con) return 0; - con = kmalloc(sizeof(struct amdgpu_ras) + + con = kzalloc(sizeof(*con) + sizeof(struct ras_manager) * AMDGPU_RAS_BLOCK_COUNT + sizeof(struct ras_manager) * AMDGPU_RAS_MCA_BLOCK_COUNT, - GFP_KERNEL|__GFP_ZERO); + GFP_KERNEL); if (!con) return -ENOMEM; -- 2.34.1
[PATCH] drm/amdgpu: Cleanup indenting in amdgpu_connector_dvi_detect()
drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.c:1106 amdgpu_connector_dvi_detect() warn: inconsistent indenting 'Fixes: 760817a60724 ("drm/amdgpu: Refactor 'amdgpu_connector_dvi_detect' in amdgpu_connectors.c")' Cc: Christian König Cc: Alex Deucher Cc: "Pan, Xinhui" Cc: Rodrigo Siqueira Cc: Aurabindo Pillai Signed-off-by: Srinivasan Shanmugam --- drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.c index 96f63fd39b9e..9caba10315a8 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.c @@ -1103,7 +1103,7 @@ amdgpu_connector_dvi_detect(struct drm_connector *connector, bool force) * DDC line. The latter is more complex because with DVI<->HDMI adapters * you don't really know what's connected to which port as both are digital. */ -amdgpu_connector_shared_ddc(&ret, connector, amdgpu_connector); + amdgpu_connector_shared_ddc(&ret, connector, amdgpu_connector); } } -- 2.34.1
[PATCH Review V2 1/1] drm/amdgpu: Fix ecc irq enable/disable unpaired
The ecc_irq is disabled while GPU mode2 reset suspending process, but not be enabled during GPU mode2 reset resume process. Changed from V1: only do sdma/gfx ras_late_init in aldebaran_mode2_restore_ip, delete amdgpu_ras_late_resume function. Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/aldebaran.c | 28 +- drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 3 +++ drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c | 4 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 3 +++ 4 files changed, 37 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/aldebaran.c b/drivers/gpu/drm/amd/amdgpu/aldebaran.c index 02f4c6f9d4f6..b60a3c1bd0f2 100644 --- a/drivers/gpu/drm/amd/amdgpu/aldebaran.c +++ b/drivers/gpu/drm/amd/amdgpu/aldebaran.c @@ -330,6 +330,7 @@ aldebaran_mode2_restore_hwcontext(struct amdgpu_reset_control *reset_ctl, { struct list_head *reset_device_list = reset_context->reset_device_list; struct amdgpu_device *tmp_adev = NULL; + struct amdgpu_ras *con; int r; if (reset_device_list == NULL) @@ -355,7 +356,32 @@ aldebaran_mode2_restore_hwcontext(struct amdgpu_reset_control *reset_ctl, */ amdgpu_register_gpu_instance(tmp_adev); - /* Resume RAS */ + /* Resume RAS, ecc_irq */ + con = amdgpu_ras_get_context(tmp_adev); + if (!amdgpu_sriov_vf(tmp_adev) && con) { + if (tmp_adev->sdma.ras && + amdgpu_ras_is_supported(tmp_adev, AMDGPU_RAS_BLOCK__SDMA) && + tmp_adev->sdma.ras->ras_block.ras_late_init) { + r = tmp_adev->sdma.ras->ras_block.ras_late_init(tmp_adev, + &tmp_adev->sdma.ras->ras_block.ras_comm); + if (r) { + dev_err(tmp_adev->dev, "SDMA failed to execute ras_late_init! ret:%d\n", r); + goto end; + } + } + + if (tmp_adev->gfx.ras && + amdgpu_ras_is_supported(tmp_adev, AMDGPU_RAS_BLOCK__GFX) && + tmp_adev->gfx.ras->ras_block.ras_late_init) { + r = tmp_adev->gfx.ras->ras_block.ras_late_init(tmp_adev, + &tmp_adev->gfx.ras->ras_block.ras_comm); + if (r) { + dev_err(tmp_adev->dev, "GFX failed to execute ras_late_init! ret:%d\n", r); + goto end; + } + } + } + amdgpu_ras_resume(tmp_adev); /* Update PSP FW topology after reset */ diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c index 09cbca596bb5..b93a0baeb2d3 100644 --- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c @@ -1043,6 +1043,9 @@ static int gmc_v10_0_hw_fini(void *handle) amdgpu_irq_put(adev, &adev->gmc.vm_fault, 0); + if (adev->gmc.ecc_irq.funcs) + amdgpu_irq_put(adev, &adev->gmc.ecc_irq, 0); + return 0; } diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c index 416f3e4f0438..e633e60850b3 100644 --- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c @@ -941,6 +941,10 @@ static int gmc_v11_0_hw_fini(void *handle) } amdgpu_irq_put(adev, &adev->gmc.vm_fault, 0); + + if (adev->gmc.ecc_irq.funcs) + amdgpu_irq_put(adev, &adev->gmc.ecc_irq, 0); + gmc_v11_0_gart_disable(adev); return 0; diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c index 205db28a9803..8ac4d5b7fb37 100644 --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c @@ -2388,6 +2388,9 @@ static int gmc_v9_0_hw_fini(void *handle) amdgpu_irq_put(adev, &adev->gmc.vm_fault, 0); + if (adev->gmc.ecc_irq.funcs) + amdgpu_irq_put(adev, &adev->gmc.ecc_irq, 0); + return 0; } -- 2.25.1
Re: [PATCH] drm/amd/pm: Remove I2C_CLASS_SPD support
On Mon, Nov 13, 2023 at 12:37:15PM +0100, Heiner Kallweit wrote: > I2C_CLASS_SPD was used to expose the EEPROM content to user space, > via the legacy eeprom driver. Now that this driver has been removed, > we can remove I2C_CLASS_SPD support. at24 driver with explicit > instantiation should be used instead. > > If in doubt this patch could be applied via the i2c tree. > > Signed-off-by: Heiner Kallweit Applied to for-next, thanks! signature.asc Description: PGP signature
Re: [PATCH v5 00/20] remove I2C_CLASS_DDC support
> I created an immutable branch for this which the buildbots will > hopefully check over night. I will reply with comments tomorrow when I > got the buildbot results. Applied to for-next, thanks! signature.asc Description: PGP signature
RE: [PATCH] drm/amdgpu/gfx11: need acquire mutex before access CP_VMID_RESET
[AMD Official Use Only - General] + /* release the gfx mutex */ + tmp = REG_SET_FIELD(tmp, CP_GFX_INDEX_MUTEX, REQUEST, 0); + WREG32_SOC15(GC, 0, regCP_GFX_INDEX_MUTEX, tmp); Shall we add a check by reading back CP_GFX_INDEX_MUTEX to ensure the release is done correctly? Regards, Hawking -Original Message- From: Xiao, Jack Sent: Tuesday, December 19, 2023 17:24 To: amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Zhang, Hawking Cc: Xiao, Jack Subject: [PATCH] drm/amdgpu/gfx11: need acquire mutex before access CP_VMID_RESET It's required to take the gfx mutex before access to CP_VMID_RESET, for there is a race condition with CP firmware to write the register. Signed-off-by: Jack Xiao --- drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 20 1 file changed, 20 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c index bdcf96df69e6..ae3370d34d11 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c @@ -4518,6 +4518,22 @@ static int gfx_v11_0_soft_reset(void *handle) } } + /* Try to require the gfx mutex before access to CP_VMID_RESET */ + for (i = 0; i < adev->usec_timeout; i++) { + /* Request with MeId=2, PipeId=0 */ + tmp = REG_SET_FIELD(0, CP_GFX_INDEX_MUTEX, REQUEST, 1); + tmp = REG_SET_FIELD(tmp, CP_GFX_INDEX_MUTEX, CLIENTID, 4); + WREG32_SOC15(GC, 0, regCP_GFX_INDEX_MUTEX, tmp); + if (RREG32_SOC15(GC, 0, regCP_GFX_INDEX_MUTEX) == tmp) + break; + udelay(1); + } + + if (i >= adev->usec_timeout) { + printk("Failed to require the gfx mutex during soft reset\n"); + return -EINVAL; + } + WREG32_SOC15(GC, 0, regCP_VMID_RESET, 0xfffe); // Read CP_VMID_RESET register three times. @@ -4526,6 +4542,10 @@ static int gfx_v11_0_soft_reset(void *handle) RREG32_SOC15(GC, 0, regCP_VMID_RESET); RREG32_SOC15(GC, 0, regCP_VMID_RESET); + /* release the gfx mutex */ + tmp = REG_SET_FIELD(tmp, CP_GFX_INDEX_MUTEX, REQUEST, 0); + WREG32_SOC15(GC, 0, regCP_GFX_INDEX_MUTEX, tmp); + for (i = 0; i < adev->usec_timeout; i++) { if (!RREG32_SOC15(GC, 0, regCP_HQD_ACTIVE) && !RREG32_SOC15(GC, 0, regCP_GFX_HQD_ACTIVE)) -- 2.41.0
Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"
On Fri, Dec 15, 2023 at 5:37 PM Christian König wrote: > > I have no idea :) > > From the logs I can see that the AMDGPU now has the proper BARs assigned: > > [5.722015] pci :03:00.0: [1002:73df] type 00 class 0x038000 > [5.722051] pci :03:00.0: reg 0x10: [mem > 0xf8-0xfb 64bit pref] > [5.722081] pci :03:00.0: reg 0x18: [mem > 0xfc-0xfc0fff 64bit pref] > [5.722112] pci :03:00.0: reg 0x24: [mem 0xfca0-0xfcaf] > [5.722134] pci :03:00.0: reg 0x30: [mem 0xfcb0-0xfcb1 pref] > [5.722368] pci :03:00.0: PME# supported from D1 D2 D3hot D3cold > [5.722484] pci :03:00.0: 63.008 Gb/s available PCIe bandwidth, > limited by 8.0 GT/s PCIe x8 link at :00:01.1 (capable of 252.048 > Gb/s with 16.0 GT/s PCIe x16 link) > > And with that the driver can work perfectly fine. > > Have you updated the BIOS or added/removed some other hardware? Maybe > somebody added a quirk for your BIOS into the PCIe code or something > like that. No, nothing changed in hardware. But I found the commit which fixes it. > git bisect unfixed 92e2bd56a5f9fc44313fda802a43a63cc2a9c8f6 is the first fixed commit commit 92e2bd56a5f9fc44313fda802a43a63cc2a9c8f6 Author: Vasant Hegde Date: Thu Sep 21 09:21:45 2023 + iommu/amd: Introduce iommu_dev_data.flags to track device capabilities Currently we use struct iommu_dev_data.iommu_v2 to keep track of the device ATS, PRI, and PASID capabilities. But these capabilities can be enabled independently (except PRI requires ATS support). Hence, replace the iommu_v2 variable with a flags variable, which keep track of the device capabilities. From commit 9bf49e36d718 ("PCI/ATS: Handle sharing of PF PRI Capability with all VFs"), device PRI/PASID is shared between PF and any associated VFs. Hence use pci_pri_supported() and pci_pasid_features() instead of pci_find_ext_capability() to check device PRI/PASID support. Signed-off-by: Vasant Hegde Reviewed-by: Jason Gunthorpe Reviewed-by: Jerry Snitselaar Link: https://lore.kernel.org/r/20230921092147.5930-13-vasant.he...@amd.com Signed-off-by: Joerg Roedel drivers/iommu/amd/amd_iommu_types.h | 3 ++- drivers/iommu/amd/iommu.c | 46 ++--- 2 files changed, 30 insertions(+), 19 deletions(-) > git bisect log git bisect start '--term-new=fixed' '--term-old=unfixed' # status: waiting for both good and bad commits # fixed: [33cc938e65a98f1d29d0a18403dbbee050dcad9a] Linux 6.7-rc4 git bisect fixed 33cc938e65a98f1d29d0a18403dbbee050dcad9a # status: waiting for good commit(s), bad commit known # unfixed: [ffc253263a1375a65fa6c9f62a893e9767fbebfa] Linux 6.6 git bisect unfixed ffc253263a1375a65fa6c9f62a893e9767fbebfa # unfixed: [7d461b291e65938f15f56fe58da2303b07578a76] Merge tag 'drm-next-2023-10-31-1' of git://anongit.freedesktop.org/drm/drm git bisect unfixed 7d461b291e65938f15f56fe58da2303b07578a76 # unfixed: [e14aec23025eeb1f2159ba34dbc1458467c4c347] s390/ap: fix AP bus crash on early config change callback invocation git bisect unfixed e14aec23025eeb1f2159ba34dbc1458467c4c347 # unfixed: [be3ca57cfb777ad820c6659d52e60bbdd36bf5ff] Merge tag 'media/v6.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media git bisect unfixed be3ca57cfb777ad820c6659d52e60bbdd36bf5ff # fixed: [c0d12d769299e1e08338988c7745009e0db2a4a0] Merge tag 'drm-next-2023-11-10' of git://anongit.freedesktop.org/drm/drm git bisect fixed c0d12d769299e1e08338988c7745009e0db2a4a0 # fixed: [4bbdb725a36b0d235f3b832bd0c1e885f0442d9f] Merge tag 'iommu-updates-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu git bisect fixed 4bbdb725a36b0d235f3b832bd0c1e885f0442d9f # unfixed: [25b6377007ebe1c3ede773fd6979f613386db000] Merge tag 'drm-next-2023-11-07' of git://anongit.freedesktop.org/drm/drm git bisect unfixed 25b6377007ebe1c3ede773fd6979f613386db000 # unfixed: [67c0afb6424fee94238d9a32b97c407d0c97155e] Merge tag 'exfat-for-6.7-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat git bisect unfixed 67c0afb6424fee94238d9a32b97c407d0c97155e # unfixed: [3613047280ec42a4e1350fdc1a6dd161ff4008cc] Merge tag 'v6.6-rc7' into core git bisect unfixed 3613047280ec42a4e1350fdc1a6dd161ff4008cc # fixed: [cedc811c76778bdef91d405717acee0de54d8db5] iommu/amd: Remove DMA_FQ type from domain allocation path git bisect fixed cedc811c76778bdef91d405717acee0de54d8db5 # unfixed: [b0cc5dae1ac0c18748706a4beb636e3b726dd744] iommu/amd: Rename ats related variables git bisect unfixed b0cc5dae1ac0c18748706a4beb636e3b726dd744 # fixed: [5a0b11a180a9b82b4437a4be1cf73530053f139b] iommu/amd: Remove iommu_v2 module git bisect fixed 5a0b11a180a9b82b4437a4be1cf73530053f139b # fixed: [92e2bd56a5f9fc44313fda802a43a63cc2a9c8f6] iommu/amd: Introduce iommu_dev_data.flags to track device capabilities git bisect fixed 92e2bd56a5f9fc44313fda802a43a63cc2a9c8f6 # unfixed: [739eb25514c90aa
[PATCH] drm/amdgpu/gfx11: need acquire mutex before access CP_VMID_RESET
It's required to take the gfx mutex before access to CP_VMID_RESET, for there is a race condition with CP firmware to write the register. Signed-off-by: Jack Xiao --- drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 20 1 file changed, 20 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c index bdcf96df69e6..ae3370d34d11 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c @@ -4518,6 +4518,22 @@ static int gfx_v11_0_soft_reset(void *handle) } } + /* Try to require the gfx mutex before access to CP_VMID_RESET */ + for (i = 0; i < adev->usec_timeout; i++) { + /* Request with MeId=2, PipeId=0 */ + tmp = REG_SET_FIELD(0, CP_GFX_INDEX_MUTEX, REQUEST, 1); + tmp = REG_SET_FIELD(tmp, CP_GFX_INDEX_MUTEX, CLIENTID, 4); + WREG32_SOC15(GC, 0, regCP_GFX_INDEX_MUTEX, tmp); + if (RREG32_SOC15(GC, 0, regCP_GFX_INDEX_MUTEX) == tmp) + break; + udelay(1); + } + + if (i >= adev->usec_timeout) { + printk("Failed to require the gfx mutex during soft reset\n"); + return -EINVAL; + } + WREG32_SOC15(GC, 0, regCP_VMID_RESET, 0xfffe); // Read CP_VMID_RESET register three times. @@ -4526,6 +4542,10 @@ static int gfx_v11_0_soft_reset(void *handle) RREG32_SOC15(GC, 0, regCP_VMID_RESET); RREG32_SOC15(GC, 0, regCP_VMID_RESET); + /* release the gfx mutex */ + tmp = REG_SET_FIELD(tmp, CP_GFX_INDEX_MUTEX, REQUEST, 0); + WREG32_SOC15(GC, 0, regCP_GFX_INDEX_MUTEX, tmp); + for (i = 0; i < adev->usec_timeout; i++) { if (!RREG32_SOC15(GC, 0, regCP_HQD_ACTIVE) && !RREG32_SOC15(GC, 0, regCP_GFX_HQD_ACTIVE)) -- 2.41.0
Re: [PATCH v3 1/2] drm/amdgpu: Auto-validate DMABuf imports in compute VMs
Am 15.12.23 um 16:19 schrieb Felix Kuehling: On 2023-12-15 07:30, Christian König wrote: @@ -1425,11 +1451,21 @@ int amdgpu_vm_handle_moved(struct amdgpu_device *adev, } r = amdgpu_vm_bo_update(adev, bo_va, clear); - if (r) - return r; if (unlock) dma_resv_unlock(resv); + if (r) + return r; + + /* Remember evicted DMABuf imports in compute VMs for later + * validation + */ + if (vm->is_compute_context && bo_va->base.bo && + bo_va->base.bo->tbo.base.import_attach && + (!bo_va->base.bo->tbo.resource || + bo_va->base.bo->tbo.resource->mem_type == TTM_PL_SYSTEM)) + amdgpu_vm_bo_evicted(&bo_va->base); + The change looks mostly good now. Just one thing which worries me is that when GFX and compute is mixed in the same VM this here might cause problems when we run into an error during command submission. E.g. GFX validates the VM BOs, but then the IOCTL fails before calling amdgpu_vm_handle_moved(). In this case the DMA-buf wouldn't be validated any more. This code path shouldn't be relevant for command submission, but for the amdgpu_vm_handle_moved call in amdgpu_dma_buf_move_notify. That's where the BO is first found to be evicted and its PTEs invalidated. That's where we need to remember it so it can be validated in the next call to amdgpu_vm_validate. Currently the amdgpu_cs code path calls amdgpu_vm_validate with ticket=NULL, so it won't validate these imports. The only place that auto-validates evicted imports is amdgpu_amdkfd_restore_process_bos. So none of this should affect amdgpu_cs command submission. Yeah, but ticket=NULL will result in removing those imports from the validation list. This could potentially result in not validating them should we ever mix GFX and Compute submissions in the same VM. For now that works, but we need to clean that up more thoughtfully I think. Christian. The flow for amdgpu_amdkfd_restore_process_bos is: * amdgpu_vm_validate validates the evicted dmabuf imports (moves the bo_vas from "evicted" to "invalidated") * amdgpu_vm_handle_moved iterates over invalidated bo_vas and updates the PTEs with valid entries (moves the bo_vas to "done") Regards, Felix Regards, Christian.