[PATCH] drm/amdgpu: Log if device is unsupported
Since there is overlap in supported devices, both modules load, but only one will bind to a particular device depending on the model and user's configuration. amdgpu binds to all display class devices with VID 0x1002 and then determines whether or not to bind to a device based on whether the individual device is supported by the driver or not. Log that case, so users looking at the logs know what is going on. Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2608 Signed-off-by: Paul Menzel --- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index 86fbb4138285..410ff918c350 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -2062,8 +2062,10 @@ static int amdgpu_pci_probe(struct pci_dev *pdev, /* skip devices which are owned by radeon */ for (i = 0; i < ARRAY_SIZE(amdgpu_unsupported_pciidlist); i++) { - if (amdgpu_unsupported_pciidlist[i] == pdev->device) + if (amdgpu_unsupported_pciidlist[i] == pdev->device) { + DRM_INFO("This hardware is only supported by radeon."); return -ENODEV; + } } if (amdgpu_aspm == -1 && !pcie_aspm_enabled(pdev)) -- 2.40.1
Re: How are the DC patches tested?
Dear Daniel, Am 09.05.22 um 17:21 schrieb Wheeler, Daniel: I've made some edits to my cover letter to hopefully make it clearer with what is being done. Thank you. Hi all, This week this patchset was tested on the following systems: HP Envy 360, with Ryzen 5 4500U Lenovo Thinkpad T14s Gen2, with AMD Ryzen 5 5650U Sapphire Pulse RX5700XT Reference AMD RX6800 Engineering board with Ryzen 9 5900H These systems were tested on the following display types: eDP, (1080p 60hz [4500U, 5650U, 5900H]) VGA and DVI (1680x1050 60HZ [DP to VGA/DVI, USB-C to DVI/VGA]) DP/HDMI/USB-C (1440p 170hz, 4k 60hz, 4k 144hz [Includes USB-C to DP/HDMI adapters]) MST tested with Startech MST14DP123DP and 2x 4k 60Hz displays DSC tested with Cable Matters 101075 (DP to 3x DP), and 201375 (USB-C to 3x DP) with 3x 4k60 displays The testing is a mix of automated and manual tests. Manual testing includes (but is not limited to): Changing display configurations and settings Benchmark testing Feature testing (Freesync, etc.) Automated testing includes (but is not limited to): Script testing (scripts to automate some of the manual checks) IGT testing The patchset consists of the most recent amd-staging-drm-next branch with a selection of patches added on top of it. This goes for both Ubuntu testing and Chrome OS testing. Please mention the commit hash of the commit at the top of amd-staging-drm-next. Tested on Ubuntu 22.04 and Chrome OS I still do not see the Chrome OS version and devices listed. I think, that information would be valuable. Tested-by: Daniel Wheeler Kind regards, Paul
Fwd: Re: amd | ASUS ROG Strix G513QY fails to resume from suspend [regression] (#2008)
Dear Linux regressions folks, A user reported a regression [1], which also trickled down to the stable series, for example between 5.15.13 and 5.15.14. Kind regards, Paul [1]: https://gitlab.freedesktop.org/drm/amd/-/issues/2008--- Begin Message --- spin83 commented: It doesn't work either. -- View it on GitLab: https://gitlab.freedesktop.org/drm/amd/-/issues/2008#note_1375796 You're receiving this email because of your account on gitlab.freedesktop.org. --- End Message ---
Re: [PATCH] drm/amdgpu/gfx11: fix mes mqd settings and map_queue packet
Dear Alex, dear Jack, Thank you for the patch. Am 09.05.22 um 21:06 schrieb Alex Deucher: From: Jack Xiao a. use correct mes mqd settings Can you please elaborate? What is wrong with the old ones, and what are the correct ones? b. fix me field in map_queue packet Can you please add some background? The new value is 2. What does it do? It’d be great, if you could make it two patches. Signed-off-by: Jack Xiao Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher --- drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 7 +-- drivers/gpu/drm/amd/amdgpu/mes_v11_0.c | 20 ++-- 2 files changed, 15 insertions(+), 12 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c index 7614f38ff381..8a1bec70c719 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c @@ -145,16 +145,19 @@ static void gfx11_kiq_map_queues(struct amdgpu_ring *kiq_ring, { uint64_t mqd_addr = amdgpu_bo_gpu_offset(ring->mqd_obj); uint64_t wptr_addr = ring->wptr_gpu_addr; - uint32_t eng_sel = 0; + uint32_t me = 0, eng_sel = 0; switch (ring->funcs->type) { case AMDGPU_RING_TYPE_COMPUTE: + me = 1; eng_sel = 0; break; case AMDGPU_RING_TYPE_GFX: + me = 0; eng_sel = 4; break; case AMDGPU_RING_TYPE_MES: + me = 2; eng_sel = 5; break; default: @@ -168,7 +171,7 @@ static void gfx11_kiq_map_queues(struct amdgpu_ring *kiq_ring, PACKET3_MAP_QUEUES_VMID(0) | /* VMID */ PACKET3_MAP_QUEUES_QUEUE(ring->queue) | PACKET3_MAP_QUEUES_PIPE(ring->pipe) | - PACKET3_MAP_QUEUES_ME((ring->me == 1 ? 0 : 1)) | + PACKET3_MAP_QUEUES_ME((me)) | PACKET3_MAP_QUEUES_QUEUE_TYPE(0) | /*queue_type: normal compute queue */ PACKET3_MAP_QUEUES_ALLOC_FORMAT(0) | /* alloc format: all_on_one_pipe */ PACKET3_MAP_QUEUES_ENGINE_SEL(eng_sel) | diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c index 5d4d54cabf70..fcf51947bb18 100644 --- a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c +++ b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c @@ -29,7 +29,7 @@ #include "gc/gc_11_0_0_offset.h" #include "gc/gc_11_0_0_sh_mask.h" #include "gc/gc_11_0_0_default.h" -#include "v10_structs.h" +#include "v11_structs.h" #include "mes_v11_api_def.h" MODULE_FIRMWARE("amdgpu/gc_11_0_0_mes.bin"); @@ -637,7 +637,7 @@ static int mes_v11_0_allocate_eop_buf(struct amdgpu_device *adev, static int mes_v11_0_mqd_init(struct amdgpu_ring *ring) { - struct v10_compute_mqd *mqd = ring->mqd_ptr; + struct v11_compute_mqd *mqd = ring->mqd_ptr; uint64_t hqd_gpu_addr, wb_gpu_addr, eop_base_addr; uint32_t tmp; @@ -724,22 +724,22 @@ static int mes_v11_0_mqd_init(struct amdgpu_ring *ring) mqd->cp_hqd_vmid = 0; /* activate the queue */ mqd->cp_hqd_active = 1; - mqd->cp_hqd_persistent_state = regCP_HQD_PERSISTENT_STATE_DEFAULT; + + tmp = regCP_HQD_PERSISTENT_STATE_DEFAULT; + tmp = REG_SET_FIELD(tmp, CP_HQD_PERSISTENT_STATE, + PRELOAD_SIZE, 0x55); + mqd->cp_hqd_persistent_state = tmp; + mqd->cp_hqd_ib_control = regCP_HQD_IB_CONTROL_DEFAULT; mqd->cp_hqd_iq_timer = regCP_HQD_IQ_TIMER_DEFAULT; mqd->cp_hqd_quantum = regCP_HQD_QUANTUM_DEFAULT; - tmp = regCP_HQD_GFX_CONTROL_DEFAULT; - tmp = REG_SET_FIELD(tmp, CP_HQD_GFX_CONTROL, DB_UPDATED_MSG_EN, 1); - /* offset: 184 - this is used for CP_HQD_GFX_CONTROL */ - mqd->cp_hqd_suspend_cntl_stack_offset = tmp; - What was wrong with this? Kind regards, Paul return 0; } static void mes_v11_0_queue_init_register(struct amdgpu_ring *ring) { - struct v10_compute_mqd *mqd = ring->mqd_ptr; + struct v11_compute_mqd *mqd = ring->mqd_ptr; struct amdgpu_device *adev = ring->adev; uint32_t data = 0; @@ -910,7 +910,7 @@ static int mes_v11_0_kiq_ring_init(struct amdgpu_device *adev) static int mes_v11_0_mqd_sw_init(struct amdgpu_device *adev, enum admgpu_mes_pipe pipe) { - int r, mqd_size = sizeof(struct v10_compute_mqd); + int r, mqd_size = sizeof(struct v11_compute_mqd); struct amdgpu_ring *ring; if (pipe == AMDGPU_MES_KIQ_PIPE)
Re: [PATCH] drm/amdgpu: Check kernel version for 'hypervisor_is_type'
Dear Yongqiang, Thank you for your patch. Am 09.05.22 um 19:07 schrieb Yongqiang Sun: hypervisor_is_type is added since kernel 4.15, dkms package failed to Also mention the commit hash and summary of the commit adding that. install on older kernel e.g. 3.10. Use marcro check kernel version to determine whether the function is macro (found by my spell checker) used. Please reflow for 75 characters per line. The Signed-off-by line is missing. --- drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 4 +++- drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 4 +++- 2 files changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c index 88b852b3a2cb..963b2e68205e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c @@ -28,7 +28,7 @@ #ifdef CONFIG_X86 #include #endif - +#include #include "amdgpu.h" #include "amdgpu_gmc.h" #include "amdgpu_ras.h" @@ -653,10 +653,12 @@ void amdgpu_gmc_get_vbios_allocations(struct amdgpu_device *adev) * VEGA10 SRIOV VF with MS_HYPERV host needs some firmware reserved area. */ #ifdef CONFIG_X86 +#if LINUX_VERSION_CODE >= KERNEL_VERSION(4, 15, 0) if (amdgpu_sriov_vf(adev) && hypervisor_is_type(X86_HYPER_MS_HYPERV)) { adev->mman.stolen_reserved_offset = 0x50; adev->mman.stolen_reserved_size = 0x20; } +#endif `scripts/checkpatch.pl` warns about using `LINUX_VERSION_CODE`. #endif break; case CHIP_RAVEN: diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c index 02b161a1287b..5a50122a1161 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c @@ -26,7 +26,7 @@ #ifdef CONFIG_X86 #include #endif - +#include #include #include @@ -728,8 +728,10 @@ void amdgpu_detect_virtualization(struct amdgpu_device *adev) case CHIP_VEGA10: soc15_set_virt_ops(adev); #ifdef CONFIG_X86 +#if LINUX_VERSION_CODE >= KERNEL_VERSION(4, 15, 0) /* not send GPU_INIT_DATA with MS_HYPERV*/ if (!hypervisor_is_type(X86_HYPER_MS_HYPERV)) +#endif #endif /* send a dummy GPU_INIT_DATA request to host on vega10 */ amdgpu_virt_request_init_data(adev); Is this patch for inclusion to the Linux kernel or just for people building the newer code as module for older Linux kernel versions? Kind regards, Paul
Re: How are the DC patches tested?
Dear Rodrigo, Thank you for the quick response. Am 09.05.22 um 16:15 schrieb Rodrigo Siqueira Jordao: On 2022-05-09 10:00, Paul Menzel wrote: Am 09.05.22 um 15:14 schrieb Wheeler, Daniel: […] This week this patchset was tested on the following systems: Lenovo Thinkpad T14s Gen2 with AMD Ryzen 5 5650U, with the following display types: eDP 1080p 60hz, 4k 60hz (via USB-C to DP/HDMI), 1440p 144hz (via USB-C to DP/HDMI), 1680*1050 60hz (via USB-C to DP and then DP to DVI/VGA) Sapphire Pulse RX5700XT with the following display types: 4k 60hz (via DP/HDMI), 1440p 144hz (via DP/HDMI), 1680*1050 60hz (via DP to DVI/VGA) Reference AMD RX6800 with the following display types: 4k 60hz (via DP/HDMI and USB-C to DP/HDMI), 1440p 144hz (via USB-C to DP/HDMI and USB-C to DP/HDMI), 1680*1050 60hz (via DP to DVI/VGA) Included testing using a Startech DP 1.4 MST hub at 2x 4k 60hz and DSC via USB-C to DP DSC Hub with 3x 4k 60hz. Tested on Ubuntu 22.04 with Kernel Version 5.16, and ChromeOS I am curious, what this means exactly? You clone the Ubuntu Linux 5.16 kernel source, and then apply your patches on top? (Do they even apply?) All of these "promotion" patches are tested by using amd-staging-drm-next. In a few words: 1. We get the latest code from amd-staging-drm-next; 2. We apply these weekly promotion patches on top of it; 3. We compile, run unit tests, and run many manual tests (Daniel does that). If everything is alright with Daniel's tests, we feel confident to merge these series on top amd-staging-drm-next (we are basically trying to avoid regressions here). Anyway, maybe we can rephrase: Tested on Ubuntu 22.04 with Kernel Version 5.16, and ChromeOS to Tested on Ubuntu 22.04 and ChromeOS with amd-staging-drm-next + promotion patches. Yes, that’d be great. Maybe even reference the commit hash from the commit on top of *amd-staging-drm-next*. (Nit: ChromeOS → Chrome OS) The same for Chrome OS. Do you use Chrome OS Flex [1] with the systems you listed? If not, what Google Chromebooks/-boxes did you test with? The Linux kernel version is also tied for a device and Chrome OS release. Please mention those too. As written, the used Chrome OS version (and devices) would be helpful too. Is it documented somewhere, what tests you run exactly? We run IGT tests, some scripts that validate some specific areas, and Daniel has an extensive set of manual tests. Kind regards, Paul
How are the DC patches tested? (was: [PATCH 00/15] DC Patches May 9, 2022)
[Sorry for the incomplete first message.] Dear Daniel, Am 09.05.22 um 15:14 schrieb Wheeler, Daniel: […] This week this patchset was tested on the following systems: Lenovo Thinkpad T14s Gen2 with AMD Ryzen 5 5650U, with the following display types: eDP 1080p 60hz, 4k 60hz (via USB-C to DP/HDMI), 1440p 144hz (via USB-C to DP/HDMI), 1680*1050 60hz (via USB-C to DP and then DP to DVI/VGA) Sapphire Pulse RX5700XT with the following display types: 4k 60hz (via DP/HDMI), 1440p 144hz (via DP/HDMI), 1680*1050 60hz (via DP to DVI/VGA) Reference AMD RX6800 with the following display types: 4k 60hz (via DP/HDMI and USB-C to DP/HDMI), 1440p 144hz (via USB-C to DP/HDMI and USB-C to DP/HDMI), 1680*1050 60hz (via DP to DVI/VGA) Included testing using a Startech DP 1.4 MST hub at 2x 4k 60hz and DSC via USB-C to DP DSC Hub with 3x 4k 60hz. Tested on Ubuntu 22.04 with Kernel Version 5.16, and ChromeOS I am curious, what this means exactly? You clone the Ubuntu Linux 5.16 kernel source, and then apply your patches on top? (Do they even apply?) The same for Chrome OS. Do you use Chrome OS Flex [1] with the systems you listed? If not, what Google Chromebooks/-boxes did you test with? The Linux kernel version is also tied for a device and Chrome OS release. Please mention those too. Is it documented somewhere, what tests you run exactly? Kind regards, Paul [1]: https://chromeenterprise.google/os/chromeosflex/
Re: [PATCH 00/15] DC Patches May 9, 2022
Dear Daniel, Am 09.05.22 um 15:14 schrieb Wheeler, Daniel: […] This week this patchset was tested on the following systems: Lenovo Thinkpad T14s Gen2 with AMD Ryzen 5 5650U, with the following display types: eDP 1080p 60hz, 4k 60hz (via USB-C to DP/HDMI), 1440p 144hz (via USB-C to DP/HDMI), 1680*1050 60hz (via USB-C to DP and then DP to DVI/VGA) Sapphire Pulse RX5700XT with the following display types: 4k 60hz (via DP/HDMI), 1440p 144hz (via DP/HDMI), 1680*1050 60hz (via DP to DVI/VGA) Reference AMD RX6800 with the following display types: 4k 60hz (via DP/HDMI and USB-C to DP/HDMI), 1440p 144hz (via USB-C to DP/HDMI and USB-C to DP/HDMI), 1680*1050 60hz (via DP to DVI/VGA) Included testing using a Startech DP 1.4 MST hub at 2x 4k 60hz and DSC via USB-C to DP DSC Hub with 3x 4k 60hz. Tested on Ubuntu 22.04 with Kernel Version 5.16, and ChromeOS Kind regards, Paul
Re: [PATCH] drm/amdgpu/vcn: fixed no previous prototype warning.
Dear James, You sent this message 30 minutes after sending the same message before (Message-ID: <20220508173008.281673-1-james@amd.com>). Are there any differences? Am 08.05.22 um 20:00 schrieb James Zhu: Fixed warning: no previous prototype for 'vcn_dec_sw_ring_emit_fence' Please use imperative mood in the commit message summary. Maybe even: > Include header for `vcn_dec_sw_ring_emit_fence()` Also, please do not add a dot/period to the end of commit messages. Kind regards, Paul Signed-off-by: James Zhu Reported-by: kernel test robot --- drivers/gpu/drm/amd/amdgpu/vcn_sw_ring.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_sw_ring.c b/drivers/gpu/drm/amd/amdgpu/vcn_sw_ring.c index f4f97f0f5c64..1ceda3d0cd5b 100644 --- a/drivers/gpu/drm/amd/amdgpu/vcn_sw_ring.c +++ b/drivers/gpu/drm/amd/amdgpu/vcn_sw_ring.c @@ -22,6 +22,7 @@ */ #include "amdgpu.h" +#include "vcn_sw_ring.h" void vcn_dec_sw_ring_emit_fence(struct amdgpu_ring *ring, u64 addr, u64 seq, uint32_t flags)
Re: [PATCH] amdgpu/pm: Disable managing power profiles on SRIOV for Sienna Cichlid
Dear Danijel, Am 06.05.22 um 09:52 schrieb Danijel Slivka: Please add a reasoning/motivation in the commit message body. Kind regards, Paul Signed-off-by: Danijel Slivka --- drivers/gpu/drm/amd/pm/amdgpu_pm.c | 9 + 1 file changed, 9 insertions(+) diff --git a/drivers/gpu/drm/amd/pm/amdgpu_pm.c b/drivers/gpu/drm/amd/pm/amdgpu_pm.c index 70a0aad05426..59a662aeaaf3 100644 --- a/drivers/gpu/drm/amd/pm/amdgpu_pm.c +++ b/drivers/gpu/drm/amd/pm/amdgpu_pm.c @@ -2025,6 +2025,8 @@ static int default_attr_update(struct amdgpu_device *adev, struct amdgpu_device_ } else if (DEVICE_ATTR_IS(pp_power_profile_mode)) { if (amdgpu_dpm_get_power_profile_mode(adev, NULL) == -EOPNOTSUPP) *states = ATTR_STATE_UNSUPPORTED; + else if (gc_ver == IP_VERSION(10, 3, 0) && amdgpu_sriov_vf(adev)) + *states = ATTR_STATE_UNSUPPORTED; } switch (gc_ver) { @@ -2038,6 +2040,13 @@ static int default_attr_update(struct amdgpu_device *adev, struct amdgpu_device_ dev_attr->store = NULL; } break; + case IP_VERSION(10, 3, 0): + if (DEVICE_ATTR_IS(power_dpm_force_performance_level) && + amdgpu_sriov_vf(adev)) { + dev_attr->attr.mode &= ~S_IWUGO; + dev_attr->store = NULL; + } + break; default: break; }
Re: [PATCH] drm/amdgpu: flush delete wq after wait fence
Dear Yiging, Thank you for your patch. Am 05.05.22 um 08:35 schrieb Yiqing Yao: [why] lru_list not empty warning in sw fini during repeated device bind unbind. There should be a amdgpu_fence_wait_empty() before the flush_delayed_work() call as sugested. sug*g*ested Suggested by whom? Or do you mean suggested in this diff? [how] Do flush_delayed_work for ttm bo delayed delete wq after fence_driver_hw_fini. Signed-off-by: Yiqing Yao --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 14c5ccf81e80..92e5ed3ed345 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -4003,10 +4003,6 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev) { dev_info(adev->dev, "amdgpu: finishing device.\n"); flush_delayed_work(>delayed_init_work); - if (adev->mman.initialized) { - flush_delayed_work(>mman.bdev.wq); - ttm_bo_lock_delayed_workqueue(>mman.bdev); - } From the commit message, it’s not clear, that you remove this here. Kind regards, Paul adev->shutdown = true; /* make sure IB test finished before entering exclusive mode @@ -4029,6 +4025,11 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev) } amdgpu_fence_driver_hw_fini(adev); + if (adev->mman.initialized) { + flush_delayed_work(>mman.bdev.wq); + ttm_bo_lock_delayed_workqueue(>mman.bdev); + } + if (adev->pm_sysfs_en) amdgpu_pm_sysfs_fini(adev); if (adev->ucode_sysfs_en)
Re: [PATCHv4] drm/amdgpu: disable ASPM on Intel Alder Lake based systems
Dear Daniel, Am 03.05.22 um 14:25 schrieb Daniel Stone: On Sun, 1 May 2022 at 08:08, Paul Menzel wrote: Am 26.04.22 um 15:53 schrieb Gong, Richard: I think so. We captured dmesg log. Then the (whole) system did *not* freeze, if you could still log in (maybe over network) and execute `dmesg`. Please also paste the amdgpu(?) error logs in the commit message. As mentioned early we need support from Intel on how to get ASPM working for VI generation on Intel Alder Lake, but we don't know where things currently stand. Who is working on this, and knows? This has gone beyond the point of a reasonable request. The amount of detail you're demanding is completely unnecessary. If a quirk is introduced possibly leading to higher power consumption, especially on systems nobody has access to yet, then the detail, where the system hangs/freezes is not unreasonable at all. In the Linux logs from the issue there are messages like [ 58.101385] Freezing of tasks failed after 20.003 seconds (4 tasks refusing to freeze, wq_busy=0): [ 78.278403] Freezing of tasks failed after 20.008 seconds (4 tasks refusing to freeze, wq_busy=0): and it looks like several suspend/resume cycles were done. I see a lot of commit messages over the whole Linux kernel, where this level of detail is provided (by default), and The second question was not for the commit message, but just for documentation purpose when the problem is going to be fixed properly. And it looks like (at least publicly) analyzing the root cause is not happening, and once the quirk lands, nobody is going to feel the pressure to work on it, as everyone’s plates are full. Kind regards, Paul
Re: [PATCHv5] drm/amdgpu: vi: disable ASPM on Intel Alder Lake based systems
Dear Richard, Am 29.04.22 um 18:06 schrieb Richard Gong: Active State Power Management (ASPM) feature is enabled since kernel 5.14. There are some AMD Volcanic Islands (VI) GFX cards, such as the WX3200 and RX640, that do not work with ASPM-enabled Intel Alder Lake based systems. Using these GFX cards as video/display output, Intel Alder Lake based systems will freeze after suspend/resume. As replied in v4 just now, “freeze” is misleading if you can still run `dmesg` after resume. Kind regards, Paul The issue was originally reported on one system (Dell Precision 3660 with BIOS version 0.14.81), but was later confirmed to affect at least 4 pre-production Alder Lake based systems. Add an extra check to disable ASPM on Intel Alder Lake based systems with the problematic AMD Volcanic Islands GFX cards. Fixes: 0064b0ce85bb ("drm/amd/pm: enable ASPM by default") Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1885 Reported-by: kernel test robot Signed-off-by: Richard Gong --- v5: added vi to commit header and updated commit message rolled back guard with the preprocessor as did in v2 to correct build error on non-x86 systems v4: s/CONFIG_X86_64/CONFIG_X86 enhanced check logic v3: s/intel_core_aspm_chk/aspm_support_quirk_check correct build error with W=1 option v2: correct commit description move the check from chip family to problematic platform --- drivers/gpu/drm/amd/amdgpu/vi.c | 17 - 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/vi.c b/drivers/gpu/drm/amd/amdgpu/vi.c index 039b90cdc3bc..45f0188c4273 100644 --- a/drivers/gpu/drm/amd/amdgpu/vi.c +++ b/drivers/gpu/drm/amd/amdgpu/vi.c @@ -81,6 +81,10 @@ #include "mxgpu_vi.h" #include "amdgpu_dm.h" +#if IS_ENABLED(CONFIG_X86) +#include +#endif + #define ixPCIE_LC_L1_PM_SUBSTATE 0x100100C6 #define PCIE_LC_L1_PM_SUBSTATE__LC_L1_SUBSTATES_OVERRIDE_EN_MASK 0x0001L #define PCIE_LC_L1_PM_SUBSTATE__LC_PCI_PM_L1_2_OVERRIDE_MASK 0x0002L @@ -1134,13 +1138,24 @@ static void vi_enable_aspm(struct amdgpu_device *adev) WREG32_PCIE(ixPCIE_LC_CNTL, data); } +static bool aspm_support_quirk_check(void) +{ +#if IS_ENABLED(CONFIG_X86) + struct cpuinfo_x86 *c = _data(0); + + return !(c->x86 == 6 && c->x86_model == INTEL_FAM6_ALDERLAKE); +#else + return true; +#endif +} + static void vi_program_aspm(struct amdgpu_device *adev) { u32 data, data1, orig; bool bL1SS = false; bool bClkReqSupport = true; - if (!amdgpu_device_should_use_aspm(adev)) + if (!amdgpu_device_should_use_aspm(adev) || !aspm_support_quirk_check()) return; if (adev->flags & AMD_IS_APU ||
Re: [PATCHv4] drm/amdgpu: disable ASPM on Intel Alder Lake based systems
Dear Richard, Sorry for the late reply. Am 26.04.22 um 15:53 schrieb Gong, Richard: On 4/21/2022 12:35 AM, Paul Menzel wrote: Am 21.04.22 um 03:12 schrieb Gong, Richard: On 4/20/2022 3:29 PM, Paul Menzel wrote: Am 19.04.22 um 23:46 schrieb Gong, Richard: On 4/14/2022 2:52 AM, Paul Menzel wrote: [Cc: -kernel test robot ] […] Am 13.04.22 um 15:00 schrieb Alex Deucher: On Wed, Apr 13, 2022 at 3:43 AM Paul Menzel wrote: Thank you for sending out v4. Am 12.04.22 um 23:50 schrieb Richard Gong: […] I am still not clear, what “hang during suspend/resume” means. I guess suspending works fine? During resume (S3 or S0ix?), where does it hang? The system is functional, but there are only display problems? System freeze after suspend/resume. But you see certain messages still? At what point does it freeze exactly? In the bug report you posted Linux messages. No, the system freeze then users have to recycle power to recover. Then I misread the issue? Did you capture the messages over serial log then? I think so. We captured dmesg log. Then the (whole) system did *not* freeze, if you could still log in (maybe over network) and execute `dmesg`. Please also paste the amdgpu(?) error logs in the commit message. As mentioned early we need support from Intel on how to get ASPM working for VI generation on Intel Alder Lake, but we don't know where things currently stand. Who is working on this, and knows? Kind regards, Paul
Re: [PATCHv4] drm/amdgpu: disable ASPM on Intel Alder Lake based systems
Dear Richard, Am 21.04.22 um 03:12 schrieb Gong, Richard: On 4/20/2022 3:29 PM, Paul Menzel wrote: Am 19.04.22 um 23:46 schrieb Gong, Richard: On 4/14/2022 2:52 AM, Paul Menzel wrote: [Cc: -kernel test robot ] […] Am 13.04.22 um 15:00 schrieb Alex Deucher: On Wed, Apr 13, 2022 at 3:43 AM Paul Menzel wrote: Thank you for sending out v4. Am 12.04.22 um 23:50 schrieb Richard Gong: Active State Power Management (ASPM) feature is enabled since kernel 5.14. There are some AMD GFX cards (such as WX3200 and RX640) that won't work with ASPM-enabled Intel Alder Lake based systems. Using these GFX cards as video/display output, Intel Alder Lake based systems will hang during suspend/resume. [Your email program wraps lines in cited text for some reason, making the citation harder to read.] Not sure why, I am using Mozila Thunderbird for email. I am not using MS Outlook for upstream email. Strange. No idea if there were bugs in Mozilla Thunderbird 91.2.0, released over half year ago. The current version is 91.8.1. [1] I am still not clear, what “hang during suspend/resume” means. I guess suspending works fine? During resume (S3 or S0ix?), where does it hang? The system is functional, but there are only display problems? System freeze after suspend/resume. But you see certain messages still? At what point does it freeze exactly? In the bug report you posted Linux messages. No, the system freeze then users have to recycle power to recover. Then I misread the issue? Did you capture the messages over serial log then? The issue was initially reported on one system (Dell Precision 3660 with BIOS version 0.14.81), but was later confirmed to affect at least 4 Alder Lake based systems. Add extra check to disable ASPM on Intel Alder Lake based systems. Fixes: 0064b0ce85bb ("drm/amd/pm: enable ASPM by default") Link: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fdrm%2Famd%2F-%2Fissues%2F1885data=05%7C01%7Crichard.gong%40amd.com%7Cce01de048c61456174ff08da230c750d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637860833680922036%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7Csdata=vqhh3dTc%2FgBt7GrP9hKppWlrFy2F7DaivkNEuGekl0g%3Dreserved=0 Thank you Microsoft Outlook for keeping us safe. :( I am not using MS Outlook for the email exchanges. I guess, it’s not the client but the Microsoft email service (Exchange?) no idea adding these protection links. (Making it even harder for users to actually verify domain. No idea who comes up with these ideas, and customers actually accepting those.) Reported-by: kernel test robot This tag is a little confusing. Maybe clarify that it was for an issue in a previous patch iteration? I did describe in change-list version 3 below, which corrected the build error with W=1 option. It is not good idea to add the description for that to the commit message, this is why I add descriptions on change-list version 3. Do as you wish, but the current style is confusing, and readers of the commit are going to think, the kernel test robot reported the problem with AMD VI ASICs and Intel Alder Lake systems. Signed-off-by: Richard Gong --- v4: s/CONFIG_X86_64/CONFIG_X86 enhanced check logic v3: s/intel_core_asom_chk/aspm_support_quirk_check correct build error with W=1 option v2: correct commit description move the check from chip family to problematic platform --- drivers/gpu/drm/amd/amdgpu/vi.c | 17 - 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/vi.c b/drivers/gpu/drm/amd/amdgpu/vi.c index 039b90cdc3bc..b33e0a9bee65 100644 --- a/drivers/gpu/drm/amd/amdgpu/vi.c +++ b/drivers/gpu/drm/amd/amdgpu/vi.c @@ -81,6 +81,10 @@ #include "mxgpu_vi.h" #include "amdgpu_dm.h" +#if IS_ENABLED(CONFIG_X86) +#include +#endif + #define ixPCIE_LC_L1_PM_SUBSTATE 0x100100C6 #define PCIE_LC_L1_PM_SUBSTATE__LC_L1_SUBSTATES_OVERRIDE_EN_MASK 0x0001L #define PCIE_LC_L1_PM_SUBSTATE__LC_PCI_PM_L1_2_OVERRIDE_MASK 0x0002L @@ -1134,13 +1138,24 @@ static void vi_enable_aspm(struct amdgpu_device *adev) WREG32_PCIE(ixPCIE_LC_CNTL, data); } +static bool aspm_support_quirk_check(void) +{ + if (IS_ENABLED(CONFIG_X86)) { + struct cpuinfo_x86 *c = _data(0); + + return !(c->x86 == 6 && c->x86_model == INTEL_FAM6_ALDERLAKE); + } + + return true; +} + static void vi_program_aspm(struct amdgpu_device *adev) { u32 data, data1, orig; bool bL1SS = false; bool bClkReqSupport = true; - if (!amdgpu_device_should_use_aspm(adev)) + if (!amdgpu_device_should_use_aspm(adev) || !aspm_support_quirk_check()) return; Can users still forcefully enable ASPM with the parameter `amdgpu.aspm`? As Mario mentioned in
Re: [PATCHv4] drm/amdgpu: disable ASPM on Intel Alder Lake based systems
Dear Richard, Am 20.04.22 um 22:56 schrieb Gong, Richard: On 4/20/2022 3:48 PM, Paul Menzel wrote: Am 20.04.22 um 22:40 schrieb Alex Deucher: On Wed, Apr 20, 2022 at 4:29 PM Paul Menzel wrote: Am 19.04.22 um 23:46 schrieb Gong, Richard: On 4/14/2022 2:52 AM, Paul Menzel wrote: [Cc: -kernel test robot ] […] Am 13.04.22 um 15:00 schrieb Alex Deucher: On Wed, Apr 13, 2022 at 3:43 AM Paul Menzel wrote: Thank you for sending out v4. Am 12.04.22 um 23:50 schrieb Richard Gong: Active State Power Management (ASPM) feature is enabled since kernel 5.14. There are some AMD GFX cards (such as WX3200 and RX640) that won't work with ASPM-enabled Intel Alder Lake based systems. Using these GFX cards as video/display output, Intel Alder Lake based systems will hang during suspend/resume. [Your email program wraps lines in cited text for some reason, making the citation harder to read.] I am still not clear, what “hang during suspend/resume” means. I guess suspending works fine? During resume (S3 or S0ix?), where does it hang? The system is functional, but there are only display problems? System freeze after suspend/resume. But you see certain messages still? At what point does it freeze exactly? In the bug report you posted Linux messages. The issue was initially reported on one system (Dell Precision 3660 with BIOS version 0.14.81), but was later confirmed to affect at least 4 Alder Lake based systems. Add extra check to disable ASPM on Intel Alder Lake based systems. Fixes: 0064b0ce85bb ("drm/amd/pm: enable ASPM by default") Link: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fdrm%2Famd%2F-%2Fissues%2F1885data=05%7C01%7Crichard.gong%40amd.com%7C487aaa63098b462e146a08da230f2319%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637860845178176835%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7Csdata=3IVldn05qNa2XVp1Lu58SriS8k9mk4U9K9p3F3IYPe0%3Dreserved=0 Thank you Microsoft Outlook for keeping us safe. :( Reported-by: kernel test robot This tag is a little confusing. Maybe clarify that it was for an issue in a previous patch iteration? I did describe in change-list version 3 below, which corrected the build error with W=1 option. It is not good idea to add the description for that to the commit message, this is why I add descriptions on change-list version 3. Do as you wish, but the current style is confusing, and readers of the commit are going to think, the kernel test robot reported the problem with AMD VI ASICs and Intel Alder Lake systems. Signed-off-by: Richard Gong --- v4: s/CONFIG_X86_64/CONFIG_X86 enhanced check logic v3: s/intel_core_asom_chk/aspm_support_quirk_check correct build error with W=1 option v2: correct commit description move the check from chip family to problematic platform --- drivers/gpu/drm/amd/amdgpu/vi.c | 17 - 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/vi.c b/drivers/gpu/drm/amd/amdgpu/vi.c index 039b90cdc3bc..b33e0a9bee65 100644 --- a/drivers/gpu/drm/amd/amdgpu/vi.c +++ b/drivers/gpu/drm/amd/amdgpu/vi.c @@ -81,6 +81,10 @@ #include "mxgpu_vi.h" #include "amdgpu_dm.h" +#if IS_ENABLED(CONFIG_X86) +#include +#endif + #define ixPCIE_LC_L1_PM_SUBSTATE 0x100100C6 #define PCIE_LC_L1_PM_SUBSTATE__LC_L1_SUBSTATES_OVERRIDE_EN_MASK 0x0001L #define PCIE_LC_L1_PM_SUBSTATE__LC_PCI_PM_L1_2_OVERRIDE_MASK 0x0002L @@ -1134,13 +1138,24 @@ static void vi_enable_aspm(struct amdgpu_device *adev) WREG32_PCIE(ixPCIE_LC_CNTL, data); } +static bool aspm_support_quirk_check(void) +{ + if (IS_ENABLED(CONFIG_X86)) { + struct cpuinfo_x86 *c = _data(0); + + return !(c->x86 == 6 && c->x86_model == INTEL_FAM6_ALDERLAKE); + } + + return true; +} + static void vi_program_aspm(struct amdgpu_device *adev) { u32 data, data1, orig; bool bL1SS = false; bool bClkReqSupport = true; - if (!amdgpu_device_should_use_aspm(adev)) + if (!amdgpu_device_should_use_aspm(adev) || !aspm_support_quirk_check()) return; Can users still forcefully enable ASPM with the parameter `amdgpu.aspm`? As Mario mentioned in a separate reply, we can't forcefully enable ASPM with the parameter 'amdgpu.aspm'. That would be a regression on systems where ASPM used to work. Hmm. I guess, you could say, there are no such systems. if (adev->flags & AMD_IS_APU || If I remember correctly, there were also newer cards, where ASPM worked with Intel Alder Lake, right? Can only the problematic generations for WX3200 and RX640 be excluded from ASPM? This patch only disables it for the generatioaon that was problematic. Could that please be made clear in the commit message summary, and m
Re: [PATCHv4] drm/amdgpu: disable ASPM on Intel Alder Lake based systems
Dear Alex, Am 20.04.22 um 22:40 schrieb Alex Deucher: On Wed, Apr 20, 2022 at 4:29 PM Paul Menzel wrote: Am 19.04.22 um 23:46 schrieb Gong, Richard: On 4/14/2022 2:52 AM, Paul Menzel wrote: [Cc: -kernel test robot ] […] Am 13.04.22 um 15:00 schrieb Alex Deucher: On Wed, Apr 13, 2022 at 3:43 AM Paul Menzel wrote: Thank you for sending out v4. Am 12.04.22 um 23:50 schrieb Richard Gong: Active State Power Management (ASPM) feature is enabled since kernel 5.14. There are some AMD GFX cards (such as WX3200 and RX640) that won't work with ASPM-enabled Intel Alder Lake based systems. Using these GFX cards as video/display output, Intel Alder Lake based systems will hang during suspend/resume. [Your email program wraps lines in cited text for some reason, making the citation harder to read.] I am still not clear, what “hang during suspend/resume” means. I guess suspending works fine? During resume (S3 or S0ix?), where does it hang? The system is functional, but there are only display problems? System freeze after suspend/resume. But you see certain messages still? At what point does it freeze exactly? In the bug report you posted Linux messages. The issue was initially reported on one system (Dell Precision 3660 with BIOS version 0.14.81), but was later confirmed to affect at least 4 Alder Lake based systems. Add extra check to disable ASPM on Intel Alder Lake based systems. Fixes: 0064b0ce85bb ("drm/amd/pm: enable ASPM by default") Link: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fdrm%2Famd%2F-%2Fissues%2F1885data=04%7C01%7Crichard.gong%40amd.com%7Ce7febed5d6a441c3a58008da1debb99c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637855195670542145%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=7cEnE%2BSM9e5IGFxSLloCLtCOxovBpaPz0Ns0Ta2vVlc%3Dreserved=0 Thank you Microsoft Outlook for keeping us safe. :( Reported-by: kernel test robot This tag is a little confusing. Maybe clarify that it was for an issue in a previous patch iteration? I did describe in change-list version 3 below, which corrected the build error with W=1 option. It is not good idea to add the description for that to the commit message, this is why I add descriptions on change-list version 3. Do as you wish, but the current style is confusing, and readers of the commit are going to think, the kernel test robot reported the problem with AMD VI ASICs and Intel Alder Lake systems. Signed-off-by: Richard Gong --- v4: s/CONFIG_X86_64/CONFIG_X86 enhanced check logic v3: s/intel_core_asom_chk/aspm_support_quirk_check correct build error with W=1 option v2: correct commit description move the check from chip family to problematic platform --- drivers/gpu/drm/amd/amdgpu/vi.c | 17 - 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/vi.c b/drivers/gpu/drm/amd/amdgpu/vi.c index 039b90cdc3bc..b33e0a9bee65 100644 --- a/drivers/gpu/drm/amd/amdgpu/vi.c +++ b/drivers/gpu/drm/amd/amdgpu/vi.c @@ -81,6 +81,10 @@ #include "mxgpu_vi.h" #include "amdgpu_dm.h" +#if IS_ENABLED(CONFIG_X86) +#include +#endif + #define ixPCIE_LC_L1_PM_SUBSTATE0x100100C6 #define PCIE_LC_L1_PM_SUBSTATE__LC_L1_SUBSTATES_OVERRIDE_EN_MASK 0x0001L #define PCIE_LC_L1_PM_SUBSTATE__LC_PCI_PM_L1_2_OVERRIDE_MASK 0x0002L @@ -1134,13 +1138,24 @@ static void vi_enable_aspm(struct amdgpu_device *adev) WREG32_PCIE(ixPCIE_LC_CNTL, data); } +static bool aspm_support_quirk_check(void) +{ + if (IS_ENABLED(CONFIG_X86)) { + struct cpuinfo_x86 *c = _data(0); + + return !(c->x86 == 6 && c->x86_model == INTEL_FAM6_ALDERLAKE); + } + + return true; +} + static void vi_program_aspm(struct amdgpu_device *adev) { u32 data, data1, orig; bool bL1SS = false; bool bClkReqSupport = true; - if (!amdgpu_device_should_use_aspm(adev)) + if (!amdgpu_device_should_use_aspm(adev) || !aspm_support_quirk_check()) return; Can users still forcefully enable ASPM with the parameter `amdgpu.aspm`? As Mario mentioned in a separate reply, we can't forcefully enable ASPM with the parameter 'amdgpu.aspm'. That would be a regression on systems where ASPM used to work. Hmm. I guess, you could say, there are no such systems. if (adev->flags & AMD_IS_APU || If I remember correctly, there were also newer cards, where ASPM worked with Intel Alder Lake, right? Can only the problematic generations for WX3200 and RX640 be excluded from ASPM? This patch only disables it for the generatioaon that was problematic. Could that please be made clear in the commit message summary, and message? Are you ok with the commit messages below? Please change the commit message summary. Maybe: drm/amdgpu: VI: Disa
Re: [PATCHv4] drm/amdgpu: disable ASPM on Intel Alder Lake based systems
Dear Richard, Am 19.04.22 um 23:46 schrieb Gong, Richard: On 4/14/2022 2:52 AM, Paul Menzel wrote: [Cc: -kernel test robot ] […] Am 13.04.22 um 15:00 schrieb Alex Deucher: On Wed, Apr 13, 2022 at 3:43 AM Paul Menzel wrote: Thank you for sending out v4. Am 12.04.22 um 23:50 schrieb Richard Gong: Active State Power Management (ASPM) feature is enabled since kernel 5.14. There are some AMD GFX cards (such as WX3200 and RX640) that won't work with ASPM-enabled Intel Alder Lake based systems. Using these GFX cards as video/display output, Intel Alder Lake based systems will hang during suspend/resume. [Your email program wraps lines in cited text for some reason, making the citation harder to read.] I am still not clear, what “hang during suspend/resume” means. I guess suspending works fine? During resume (S3 or S0ix?), where does it hang? The system is functional, but there are only display problems? System freeze after suspend/resume. But you see certain messages still? At what point does it freeze exactly? In the bug report you posted Linux messages. The issue was initially reported on one system (Dell Precision 3660 with BIOS version 0.14.81), but was later confirmed to affect at least 4 Alder Lake based systems. Add extra check to disable ASPM on Intel Alder Lake based systems. Fixes: 0064b0ce85bb ("drm/amd/pm: enable ASPM by default") Link: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fdrm%2Famd%2F-%2Fissues%2F1885data=04%7C01%7Crichard.gong%40amd.com%7Ce7febed5d6a441c3a58008da1debb99c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637855195670542145%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=7cEnE%2BSM9e5IGFxSLloCLtCOxovBpaPz0Ns0Ta2vVlc%3Dreserved=0 Thank you Microsoft Outlook for keeping us safe. :( Reported-by: kernel test robot This tag is a little confusing. Maybe clarify that it was for an issue in a previous patch iteration? I did describe in change-list version 3 below, which corrected the build error with W=1 option. It is not good idea to add the description for that to the commit message, this is why I add descriptions on change-list version 3. Do as you wish, but the current style is confusing, and readers of the commit are going to think, the kernel test robot reported the problem with AMD VI ASICs and Intel Alder Lake systems. Signed-off-by: Richard Gong --- v4: s/CONFIG_X86_64/CONFIG_X86 enhanced check logic v3: s/intel_core_asom_chk/aspm_support_quirk_check correct build error with W=1 option v2: correct commit description move the check from chip family to problematic platform --- drivers/gpu/drm/amd/amdgpu/vi.c | 17 - 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/vi.c b/drivers/gpu/drm/amd/amdgpu/vi.c index 039b90cdc3bc..b33e0a9bee65 100644 --- a/drivers/gpu/drm/amd/amdgpu/vi.c +++ b/drivers/gpu/drm/amd/amdgpu/vi.c @@ -81,6 +81,10 @@ #include "mxgpu_vi.h" #include "amdgpu_dm.h" +#if IS_ENABLED(CONFIG_X86) +#include +#endif + #define ixPCIE_LC_L1_PM_SUBSTATE 0x100100C6 #define PCIE_LC_L1_PM_SUBSTATE__LC_L1_SUBSTATES_OVERRIDE_EN_MASK 0x0001L #define PCIE_LC_L1_PM_SUBSTATE__LC_PCI_PM_L1_2_OVERRIDE_MASK 0x0002L @@ -1134,13 +1138,24 @@ static void vi_enable_aspm(struct amdgpu_device *adev) WREG32_PCIE(ixPCIE_LC_CNTL, data); } +static bool aspm_support_quirk_check(void) +{ + if (IS_ENABLED(CONFIG_X86)) { + struct cpuinfo_x86 *c = _data(0); + + return !(c->x86 == 6 && c->x86_model == INTEL_FAM6_ALDERLAKE); + } + + return true; +} + static void vi_program_aspm(struct amdgpu_device *adev) { u32 data, data1, orig; bool bL1SS = false; bool bClkReqSupport = true; - if (!amdgpu_device_should_use_aspm(adev)) + if (!amdgpu_device_should_use_aspm(adev) || !aspm_support_quirk_check()) return; Can users still forcefully enable ASPM with the parameter `amdgpu.aspm`? As Mario mentioned in a separate reply, we can't forcefully enable ASPM with the parameter 'amdgpu.aspm'. That would be a regression on systems where ASPM used to work. Hmm. I guess, you could say, there are no such systems. if (adev->flags & AMD_IS_APU || If I remember correctly, there were also newer cards, where ASPM worked with Intel Alder Lake, right? Can only the problematic generations for WX3200 and RX640 be excluded from ASPM? This patch only disables it for the generatioaon that was problematic. Could that please be made clear in the commit message summary, and message? Are you ok with the commit messages below? Please change the commit message summary. Maybe: drm/amdgpu: VI: Disable ASPM on Intel Alder Lake based systems Active State Power Management (ASPM) featur
Re: [PATCH] drm/amdgpu: clean up psp ip if hw_init failed v2
Dear Alice, Thank you for your patch. Am 20.04.22 um 21:37 schrieb Alice Wong: Call psp_hw_fini when psp_hw_init failed. Please amend the commit message, and add the motivation/reasoning too [1]. I think it’s common, if a patch (series) is rerolled to list the changes between the versions. Kind regards, Paul [1]: https://cbea.ms/git-commit/ Signed-off-by: Alice Wong --- drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 57 + 1 file changed, 29 insertions(+), 28 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c index 5b9e48d44f5b..52b14efa848a 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c @@ -2537,6 +2537,34 @@ static int psp_load_fw(struct amdgpu_device *adev) return ret; } +static int psp_hw_fini(void *handle) +{ + struct amdgpu_device *adev = (struct amdgpu_device *)handle; + struct psp_context *psp = >psp; + + if (psp->ta_fw) { + psp_ras_terminate(psp); + psp_securedisplay_terminate(psp); + psp_rap_terminate(psp); + psp_dtm_terminate(psp); + psp_hdcp_terminate(psp); + } + + psp_asd_terminate(psp); + + psp_tmr_terminate(psp); + psp_ring_destroy(psp, PSP_RING_TYPE__KM); + + amdgpu_bo_free_kernel(>fw_pri_bo, + >fw_pri_mc_addr, >fw_pri_buf); + amdgpu_bo_free_kernel(>fence_buf_bo, + >fence_buf_mc_addr, >fence_buf); + amdgpu_bo_free_kernel(>cmd_buf_bo, >cmd_buf_mc_addr, + (void **)>cmd_buf_mem); + + return 0; +} + static int psp_hw_init(void *handle) { int ret; @@ -2563,37 +2591,10 @@ static int psp_hw_init(void *handle) failed: adev->firmware.load_type = AMDGPU_FW_LOAD_DIRECT; mutex_unlock(>firmware.mutex); + psp_hw_fini(handle); return -EINVAL; } -static int psp_hw_fini(void *handle) -{ - struct amdgpu_device *adev = (struct amdgpu_device *)handle; - struct psp_context *psp = >psp; - - if (psp->ta_fw) { - psp_ras_terminate(psp); - psp_securedisplay_terminate(psp); - psp_rap_terminate(psp); - psp_dtm_terminate(psp); - psp_hdcp_terminate(psp); - } - - psp_asd_terminate(psp); - - psp_tmr_terminate(psp); - psp_ring_destroy(psp, PSP_RING_TYPE__KM); - - amdgpu_bo_free_kernel(>fw_pri_bo, - >fw_pri_mc_addr, >fw_pri_buf); - amdgpu_bo_free_kernel(>fence_buf_bo, - >fence_buf_mc_addr, >fence_buf); - amdgpu_bo_free_kernel(>cmd_buf_bo, >cmd_buf_mc_addr, - (void **)>cmd_buf_mem); - - return 0; -} - static int psp_suspend(void *handle) { int ret;
Re: [PATCH] drm/amdgpu: don't runtime suspend if there are displays attached (v2)
Dear Alex, Thank you for the patch. Am 13.04.22 um 22:15 schrieb Alex Deucher: We normally runtime suspend when there are displays attached if they are in the DPMS off state, however, if something wakes the GPU we send a hotplug event on resume (in case any displays were connected while the GPU was in suspend) which can cause userspace to light up the displays again soon after they were turned off. Prior to commit 087451f372bf76 ("drm/amdgpu: use generic fb helpers instead of setting up AMD own's."), the driver took a runtime pm reference when the fbdev emulation was enabled because we didn't implement proper shadowing support for vram access when the device was off so the device never runtime suspended when there was a console bound. Once that commit landed, we now utilize the core fb helper implementation which properly handles the emulation, so runtime pm now suspends in cases where it did not before. Ultimately, we need to sort out why runtime suspend in not working in this case for some users, but this should restore similar behavior to before. v2: move check into runtime_suspend Fixes: 087451f372bf76 ("drm/amdgpu: use generic fb helpers instead of setting up AMD own's.") Tested-by: Michele Ballabio On what system and device? Signed-off-by: Alex Deucher --- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 107 1 file changed, 72 insertions(+), 35 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index 4efaa183abcd..97a1aa02d76e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -2395,6 +2395,71 @@ static int amdgpu_pmops_restore(struct device *dev) return amdgpu_device_resume(drm_dev, true); } +static int amdgpu_runtime_idle_check_display(struct device *dev) +{ + struct pci_dev *pdev = to_pci_dev(dev); + struct drm_device *drm_dev = pci_get_drvdata(pdev); + struct amdgpu_device *adev = drm_to_adev(drm_dev); + + if (adev->mode_info.num_crtc) { + struct drm_connector *list_connector; + struct drm_connector_list_iter iter; + int ret = 0; + + /* XXX: Return busy if any displays are connected to avoid +* possible display wake ups after runtime resume due to Nit: wakeups +* hotplug events in case any displays were connected while +* the GPU was in suspend. Remove this once that is fixed. +*/ Do you have an (internal) issue to track this? + mutex_lock(_dev->mode_config.mutex); + drm_connector_list_iter_begin(drm_dev, ); + drm_for_each_connector_iter(list_connector, ) { + if (list_connector->status == connector_status_connected) { + ret = -EBUSY; + break; + } + } + drm_connector_list_iter_end(); + mutex_unlock(_dev->mode_config.mutex); + + if (ret) + return ret; + + if (amdgpu_device_has_dc_support(adev)) { + struct drm_crtc *crtc; + + drm_for_each_crtc(crtc, drm_dev) { + drm_modeset_lock(>mutex, NULL); + if (crtc->state->active) + ret = -EBUSY; + drm_modeset_unlock(>mutex); + if (ret < 0) + break; + } + } else { + mutex_lock(_dev->mode_config.mutex); + drm_modeset_lock(_dev->mode_config.connection_mutex, NULL); + + drm_connector_list_iter_begin(drm_dev, ); + drm_for_each_connector_iter(list_connector, ) { + if (list_connector->dpms == DRM_MODE_DPMS_ON) { + ret = -EBUSY; + break; + } + } + + drm_connector_list_iter_end(); + + drm_modeset_unlock(_dev->mode_config.connection_mutex); + mutex_unlock(_dev->mode_config.mutex); + } + if (ret) + return ret; + } + + return 0; +} + static int amdgpu_pmops_runtime_suspend(struct device *dev) { struct pci_dev *pdev = to_pci_dev(dev); @@ -2407,6 +2472,10 @@ static int amdgpu_pmops_runtime_suspend(struct device *dev) return -EBUSY; } + ret = amdgpu_runtime_idle_check_display(dev); + if (ret) + return ret; + /* wait for all rings to drain before suspending */ for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
Re: [PATCH v2 2/2] drm/amdkfd: CRIU add support for GWS queues
Dear David, Thank you for sending out v3 of these patches. Am 19.04.22 um 02:04 schrieb Yat Sin, David: -Original Message- From: Paul Menzel Sent: Monday, April 18, 2022 4:23 PM […] Am 18.04.22 um 18:44 schrieb David Yat Sin: In the commit message summary, you could reorder some words: Add CRIU support for GWS queues Adding support to checkpoint/restore GWS(Global Wave Sync) queues. s/Adding/Add/ Did you miss the two comments above? Please add a space before the (. ACK How can this be tested? We have some internal tests that can we be used to specifically test this feature. Nice. Are you going to publish these in the future? Signed-off-by: David Yat Sin --- drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 2 +- drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 10 +++--- 2 files changed, 8 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h index f36062be9ca8..192dbef04c43 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h @@ -1102,7 +1102,7 @@ struct kfd_criu_queue_priv_data { uint32_t priority; uint32_t q_percent; uint32_t doorbell_id; - uint32_t is_gws; + uint32_t gws; Why is the new name better? The old variable (is_gws) was obtained from the queue_properties structure during checkpoint and is only used temporarily during queue creation, so this variable cannot be used to determine whether a queue as gws enabled. The new variable (gws) is obtained from the queue structure. The name is changed to better reflect this. Further down you seem to use it like a boolean though. So a name reflecting that would be nice. uint32_t sdma_id; uint32_t eop_ring_buffer_size; uint32_t ctx_save_restore_area_size; diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c index 6eca9509f2e3..4f58e671d39b 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c @@ -636,6 +636,8 @@ static int criu_checkpoint_queue(struct kfd_process_device *pdd, q_data->ctx_save_restore_area_size = q->properties.ctx_save_restore_area_size; + q_data->gws = !!q->gws; + ret = pqm_checkpoint_mqd(>process->pqm, q-> properties.queue_id, mqd, ctl_stack); if (ret) { pr_err("Failed checkpoint queue_mqd (%d)\n", ret); @@ -743,7 +745,6 @@ static void set_queue_properties_from_criu(struct queue_properties *qp, struct kfd_criu_queue_priv_data *q_data) { qp->is_interop = false; - qp->is_gws = q_data->is_gws; qp->queue_percent = q_data->q_percent; qp->priority = q_data->priority; qp->queue_address = q_data->q_address; @@ -826,12 +827,15 @@ int kfd_criu_restore_queue(struct kfd_process *p, NULL); if (ret) { pr_err("Failed to create new queue err:%d\n", ret); - ret = -EINVAL; + goto exit; } + if (q_data->gws) + ret = pqm_set_gws(>pqm, q_data->q_id, pdd->dev->gws); + exit: if (ret) - pr_err("Failed to create queue (%d)\n", ret); + pr_err("Failed to restore queue (%d)\n", ret); Maybe separate this out, so it can be applied to stable series. Did you miss this comment? else pr_debug("Queue id %d was restored successfully\n", queue_id); Kind regards, Paul
Re: [PATCH v3 1/2] drm/amdkfd: Fix GWS queue count
Dear David, Thank you for rerolling the patches. Am 19.04.22 um 03:32 schrieb David Yat Sin: dqm->gws_queue_count and pdd->qpd.mapped_gws_queue need to be updated each time the queue gets evicted. Fixes: b8020b0304c8 ("drm/amdkfd: Enable over-subscription with >1 GWS queue") For the future, no blank line is needed to separate the Fixes tag from the rest of the block. Kind regards, Paul Signed-off-by: David Yat Sin Reviewed-by: Felix Kuehling --- .../drm/amd/amdkfd/kfd_device_queue_manager.c | 83 +-- 1 file changed, 37 insertions(+), 46 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c index acf4f7975850..198672264492 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c @@ -130,19 +130,33 @@ void program_sh_mem_settings(struct device_queue_manager *dqm, } static void increment_queue_count(struct device_queue_manager *dqm, - enum kfd_queue_type type) + struct qcm_process_device *qpd, + struct queue *q) { dqm->active_queue_count++; - if (type == KFD_QUEUE_TYPE_COMPUTE || type == KFD_QUEUE_TYPE_DIQ) + if (q->properties.type == KFD_QUEUE_TYPE_COMPUTE || + q->properties.type == KFD_QUEUE_TYPE_DIQ) dqm->active_cp_queue_count++; + + if (q->properties.is_gws) { + dqm->gws_queue_count++; + qpd->mapped_gws_queue = true; + } } static void decrement_queue_count(struct device_queue_manager *dqm, - enum kfd_queue_type type) + struct qcm_process_device *qpd, + struct queue *q) { dqm->active_queue_count--; - if (type == KFD_QUEUE_TYPE_COMPUTE || type == KFD_QUEUE_TYPE_DIQ) + if (q->properties.type == KFD_QUEUE_TYPE_COMPUTE || + q->properties.type == KFD_QUEUE_TYPE_DIQ) dqm->active_cp_queue_count--; + + if (q->properties.is_gws) { + dqm->gws_queue_count--; + qpd->mapped_gws_queue = false; + } } /* @@ -412,7 +426,7 @@ static int create_queue_nocpsch(struct device_queue_manager *dqm, list_add(>list, >queues_list); qpd->queue_count++; if (q->properties.is_active) - increment_queue_count(dqm, q->properties.type); + increment_queue_count(dqm, qpd, q); /* * Unconditionally increment this counter, regardless of the queue's @@ -601,13 +615,8 @@ static int destroy_queue_nocpsch_locked(struct device_queue_manager *dqm, deallocate_vmid(dqm, qpd, q); } qpd->queue_count--; - if (q->properties.is_active) { - decrement_queue_count(dqm, q->properties.type); - if (q->properties.is_gws) { - dqm->gws_queue_count--; - qpd->mapped_gws_queue = false; - } - } + if (q->properties.is_active) + decrement_queue_count(dqm, qpd, q); return retval; } @@ -700,12 +709,11 @@ static int update_queue(struct device_queue_manager *dqm, struct queue *q, * dqm->active_queue_count to determine whether a new runlist must be * uploaded. */ - if (q->properties.is_active && !prev_active) - increment_queue_count(dqm, q->properties.type); - else if (!q->properties.is_active && prev_active) - decrement_queue_count(dqm, q->properties.type); - - if (q->gws && !q->properties.is_gws) { + if (q->properties.is_active && !prev_active) { + increment_queue_count(dqm, >qpd, q); + } else if (!q->properties.is_active && prev_active) { + decrement_queue_count(dqm, >qpd, q); + } else if (q->gws && !q->properties.is_gws) { if (q->properties.is_active) { dqm->gws_queue_count++; pdd->qpd.mapped_gws_queue = true; @@ -767,11 +775,7 @@ static int evict_process_queues_nocpsch(struct device_queue_manager *dqm, mqd_mgr = dqm->mqd_mgrs[get_mqd_type_from_queue_type( q->properties.type)]; q->properties.is_active = false; - decrement_queue_count(dqm, q->properties.type); - if (q->properties.is_gws) { - dqm->gws_queue_count--; - qpd->mapped_gws_queue = false; - } + decrement_queue_count(dqm, qpd, q); if (WARN_ONCE(!dqm->sched_running, "Evict when stopped\n")) continue; @@ -817,7 +821,7 @@ static int evict_process_queues_cpsch(struct device_queue_manager *dqm, continue; q->properties.is_active = false; -
Re: [PATCH v2 2/2] drm/amdkfd: CRIU add support for GWS queues
Dear David, Thank you for your patch. Am 18.04.22 um 18:44 schrieb David Yat Sin: In the commit message summary, you could reorder some words: Add CRIU support for GWS queues Adding support to checkpoint/restore GWS(Global Wave Sync) queues. s/Adding/Add/ Please add a space before the (. How can this be tested? Signed-off-by: David Yat Sin --- drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 2 +- drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 10 +++--- 2 files changed, 8 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h index f36062be9ca8..192dbef04c43 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h @@ -1102,7 +1102,7 @@ struct kfd_criu_queue_priv_data { uint32_t priority; uint32_t q_percent; uint32_t doorbell_id; - uint32_t is_gws; + uint32_t gws; Why is the new name better? uint32_t sdma_id; uint32_t eop_ring_buffer_size; uint32_t ctx_save_restore_area_size; diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c index 6eca9509f2e3..4f58e671d39b 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c @@ -636,6 +636,8 @@ static int criu_checkpoint_queue(struct kfd_process_device *pdd, q_data->ctx_save_restore_area_size = q->properties.ctx_save_restore_area_size; + q_data->gws = !!q->gws; + ret = pqm_checkpoint_mqd(>process->pqm, q->properties.queue_id, mqd, ctl_stack); if (ret) { pr_err("Failed checkpoint queue_mqd (%d)\n", ret); @@ -743,7 +745,6 @@ static void set_queue_properties_from_criu(struct queue_properties *qp, struct kfd_criu_queue_priv_data *q_data) { qp->is_interop = false; - qp->is_gws = q_data->is_gws; qp->queue_percent = q_data->q_percent; qp->priority = q_data->priority; qp->queue_address = q_data->q_address; @@ -826,12 +827,15 @@ int kfd_criu_restore_queue(struct kfd_process *p, NULL); if (ret) { pr_err("Failed to create new queue err:%d\n", ret); - ret = -EINVAL; + goto exit; } + if (q_data->gws) + ret = pqm_set_gws(>pqm, q_data->q_id, pdd->dev->gws); + exit: if (ret) - pr_err("Failed to create queue (%d)\n", ret); + pr_err("Failed to restore queue (%d)\n", ret); Maybe separate this out, so it can be applied to stable series. else pr_debug("Queue id %d was restored successfully\n", queue_id); Kind regards, Paul
Re: [PATCH v2 1/2] drm/amdkfd: Fix GWS queue count
Dear David, Thank you for your patch. Am 18.04.22 um 18:44 schrieb David Yat Sin: dqm->gws_queue_count and pdd->qpd.mapped_gws_queue needs to be updated s/needs/need/ each time the queue gets evicted. Why? Do you only change the case, when an element of the queue gets evicted? Next time, a short note about the implementation would be nic.e No Fixes tag? Kind regards, Paul Signed-off-by: David Yat Sin --- .../drm/amd/amdkfd/kfd_device_queue_manager.c | 83 +-- 1 file changed, 37 insertions(+), 46 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c index acf4f7975850..198672264492 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c @@ -130,19 +130,33 @@ void program_sh_mem_settings(struct device_queue_manager *dqm, } static void increment_queue_count(struct device_queue_manager *dqm, - enum kfd_queue_type type) + struct qcm_process_device *qpd, + struct queue *q) { dqm->active_queue_count++; - if (type == KFD_QUEUE_TYPE_COMPUTE || type == KFD_QUEUE_TYPE_DIQ) + if (q->properties.type == KFD_QUEUE_TYPE_COMPUTE || + q->properties.type == KFD_QUEUE_TYPE_DIQ) dqm->active_cp_queue_count++; + + if (q->properties.is_gws) { + dqm->gws_queue_count++; + qpd->mapped_gws_queue = true; + } } static void decrement_queue_count(struct device_queue_manager *dqm, - enum kfd_queue_type type) + struct qcm_process_device *qpd, + struct queue *q) { dqm->active_queue_count--; - if (type == KFD_QUEUE_TYPE_COMPUTE || type == KFD_QUEUE_TYPE_DIQ) + if (q->properties.type == KFD_QUEUE_TYPE_COMPUTE || + q->properties.type == KFD_QUEUE_TYPE_DIQ) dqm->active_cp_queue_count--; + + if (q->properties.is_gws) { + dqm->gws_queue_count--; + qpd->mapped_gws_queue = false; + } } /* @@ -412,7 +426,7 @@ static int create_queue_nocpsch(struct device_queue_manager *dqm, list_add(>list, >queues_list); qpd->queue_count++; if (q->properties.is_active) - increment_queue_count(dqm, q->properties.type); + increment_queue_count(dqm, qpd, q); /* * Unconditionally increment this counter, regardless of the queue's @@ -601,13 +615,8 @@ static int destroy_queue_nocpsch_locked(struct device_queue_manager *dqm, deallocate_vmid(dqm, qpd, q); } qpd->queue_count--; - if (q->properties.is_active) { - decrement_queue_count(dqm, q->properties.type); - if (q->properties.is_gws) { - dqm->gws_queue_count--; - qpd->mapped_gws_queue = false; - } - } + if (q->properties.is_active) + decrement_queue_count(dqm, qpd, q); return retval; } @@ -700,12 +709,11 @@ static int update_queue(struct device_queue_manager *dqm, struct queue *q, * dqm->active_queue_count to determine whether a new runlist must be * uploaded. */ - if (q->properties.is_active && !prev_active) - increment_queue_count(dqm, q->properties.type); - else if (!q->properties.is_active && prev_active) - decrement_queue_count(dqm, q->properties.type); - - if (q->gws && !q->properties.is_gws) { + if (q->properties.is_active && !prev_active) { + increment_queue_count(dqm, >qpd, q); + } else if (!q->properties.is_active && prev_active) { + decrement_queue_count(dqm, >qpd, q); + } else if (q->gws && !q->properties.is_gws) { if (q->properties.is_active) { dqm->gws_queue_count++; pdd->qpd.mapped_gws_queue = true; @@ -767,11 +775,7 @@ static int evict_process_queues_nocpsch(struct device_queue_manager *dqm, mqd_mgr = dqm->mqd_mgrs[get_mqd_type_from_queue_type( q->properties.type)]; q->properties.is_active = false; - decrement_queue_count(dqm, q->properties.type); - if (q->properties.is_gws) { - dqm->gws_queue_count--; - qpd->mapped_gws_queue = false; - } + decrement_queue_count(dqm, qpd, q); if (WARN_ONCE(!dqm->sched_running, "Evict when stopped\n")) continue; @@ -817,7 +821,7 @@ static int evict_process_queues_cpsch(struct device_queue_manager *dqm, continue; q->properties.is_active = false; -
Re: [PATCH 1/2] Documentation/gpu: Add entries to amdgpu glossary
Dear Tales, Thank you for your patch. Am 15.04.22 um 21:50 schrieb Tales Lelo da Aparecida: Add missing acronyms to the amdgppu glossary. Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/1939#note_1309737 Signed-off-by: Tales Lelo da Aparecida --- Documentation/gpu/amdgpu/amdgpu-glossary.rst | 13 + 1 file changed, 13 insertions(+) diff --git a/Documentation/gpu/amdgpu/amdgpu-glossary.rst b/Documentation/gpu/amdgpu/amdgpu-glossary.rst index 859dcec6c6f9..48829d097f40 100644 --- a/Documentation/gpu/amdgpu/amdgpu-glossary.rst +++ b/Documentation/gpu/amdgpu/amdgpu-glossary.rst @@ -8,12 +8,19 @@ we have a dedicated glossary for Display Core at .. glossary:: +active_cu_number + The number of CUs that are active on the system. The number of active + CUs may be less than SE * SH * CU depending on the board configuration. + CP Command Processor CPLIB Content Protection Library +CU + Compute unit Capitalize the U in *unit* as seems to be done in the rest of the files? + DFS Digital Frequency Synthesizer @@ -74,6 +81,12 @@ we have a dedicated glossary for Display Core at SDMA System DMA +SE + Shader Engine + +SH + SHader array No idea if the H should be capitalized. + SMU System Management Unit Kind regards, Paul
Re: [PATCH] drm/amdkfd: only allow heavy-weight TLB flush on some ASICs for SVM too
Dear Lang, Am 15.04.22 um 05:20 schrieb Lang Yu: On 04/14/ , Paul Menzel wrote: Am 14.04.22 um 10:19 schrieb Lang Yu: The idea is from commit a50fe7078035 ("drm/amdkfd: Only apply heavy-weight TLB flush on Aldebaran") and commit f61c40c0757a ("drm/amdkfd: enable heavy-weight TLB flush on Arcturus"). Otherwise, we will run into problems on some ASICs when running SVM applications. Please list the ASICs, you know of having problems, and even add how to reproduce this. Actually, this is ported from previous commits. You can find more details from the commits I mentioned. At the moment the ASICs except Aldebaran and Arcturus probably have the problem. I think, it’s always good to make it as easy as possible for reviewers and, later, people reading a commit, and include the necessary information directly in the commit message. It’d be great if you amended the commit message. And running a SVM application could reproduce the issue. Thanks. How will it fail though? (Also, a small implementation note would be nice to have. Maybe: Move the helper function into the header `kfd_priv.h`, and use in `svm_range_unmap_from_gpus()`.) Kind regards, Paul Signed-off-by: Lang Yu --- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 8 drivers/gpu/drm/amd/amdkfd/kfd_priv.h| 8 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 4 +++- 3 files changed, 11 insertions(+), 9 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c index 91f82a9ccdaf..459f59e3d0ed 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c @@ -1128,14 +1128,6 @@ static int kfd_ioctl_free_memory_of_gpu(struct file *filep, return ret; } -static bool kfd_flush_tlb_after_unmap(struct kfd_dev *dev) -{ - return KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 2) || - (KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 1) && - dev->adev->sdma.instance[0].fw_version >= 18) || - KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 0); -} - static int kfd_ioctl_map_memory_to_gpu(struct file *filep, struct kfd_process *p, void *data) { diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h index 8a43def1f638..aff6f598ff2c 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h @@ -1328,6 +1328,14 @@ void kfd_signal_poison_consumed_event(struct kfd_dev *dev, u32 pasid); void kfd_flush_tlb(struct kfd_process_device *pdd, enum TLB_FLUSH_TYPE type); +static inline bool kfd_flush_tlb_after_unmap(struct kfd_dev *dev) +{ + return KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 2) || + (KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 1) && + dev->adev->sdma.instance[0].fw_version >= 18) || + KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 0); +} + bool kfd_is_locked(void); /* Compute profile */ diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index 459fa07a3bcc..5afe216cf099 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -1229,7 +1229,9 @@ svm_range_unmap_from_gpus(struct svm_range *prange, unsigned long start, if (r) break; } - kfd_flush_tlb(pdd, TLB_FLUSH_HEAVYWEIGHT); + + if (kfd_flush_tlb_after_unmap(pdd->dev)) + kfd_flush_tlb(pdd, TLB_FLUSH_HEAVYWEIGHT); } return r;
Re: [PATCH] drm/amd/amdgpu: Remove static from variable in RLCG Reg RW.
Dear Gavin, Thank you for your patch. Am 13.04.22 um 17:26 schrieb Gavin Wan: Should you re-roll your patch (v2), please remove the dot/period from the end of the git commit message summary (subject). [why] These static variables saves the RLC Scratch registers address. s/saves/save/ When we installed multiple GPUs (for example: XGMI setting) and s/installed/install/ multiple GPUs call the function at same time. The RLC Scratch … same time, the RLC … registers address are changed each other. Then it caused s/caused/causes/ reading/writing to wrong GPU. I see from other patches posted here, that [why] is put on a separate line, so you do not need to indent the text. [why] These static … [fix] Removed the static from the variables. The variables are in stack. Same here, though *how* instead of *fix* seems more common. s/Removed/Remove/ s/in stack/on the stack/ Signed-off-by: Gavin Wan Change-Id: Iee78849291d4f7a9688ecc5165bec70ee85cdfbe Without the Gerrit URL that line is useless. Kind regards. Paul --- drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c index d5eea031c3e3..d18a05a20566 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c @@ -868,11 +868,11 @@ static u32 amdgpu_virt_rlcg_reg_rw(struct amdgpu_device *adev, u32 offset, u32 v uint32_t timeout = 5; uint32_t i, tmp; uint32_t ret = 0; - static void *scratch_reg0; - static void *scratch_reg1; - static void *scratch_reg2; - static void *scratch_reg3; - static void *spare_int; + void *scratch_reg0; + void *scratch_reg1; + void *scratch_reg2; + void *scratch_reg3; + void *spare_int; if (!adev->gfx.rlc.rlcg_reg_access_supported) { dev_err(adev->dev,
Re: [PATCH] drm/amdkfd: only allow heavy-weight TLB flush on some ASICs for SVM too
Dear Lang, Thank you for the patch. Am 14.04.22 um 10:19 schrieb Lang Yu: The idea is from commit a50fe7078035 ("drm/amdkfd: Only apply heavy-weight TLB flush on Aldebaran") and commit f61c40c0757a ("drm/amdkfd: enable heavy-weight TLB flush on Arcturus"). Otherwise, we will run into problems on some ASICs when running SVM applications. Please list the ASICs, you know of having problems, and even add how to reproduce this. Kind regards, Paul Signed-off-by: Lang Yu --- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 8 drivers/gpu/drm/amd/amdkfd/kfd_priv.h| 8 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 4 +++- 3 files changed, 11 insertions(+), 9 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c index 91f82a9ccdaf..459f59e3d0ed 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c @@ -1128,14 +1128,6 @@ static int kfd_ioctl_free_memory_of_gpu(struct file *filep, return ret; } -static bool kfd_flush_tlb_after_unmap(struct kfd_dev *dev) -{ - return KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 2) || - (KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 1) && - dev->adev->sdma.instance[0].fw_version >= 18) || - KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 0); -} - static int kfd_ioctl_map_memory_to_gpu(struct file *filep, struct kfd_process *p, void *data) { diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h index 8a43def1f638..aff6f598ff2c 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h @@ -1328,6 +1328,14 @@ void kfd_signal_poison_consumed_event(struct kfd_dev *dev, u32 pasid); void kfd_flush_tlb(struct kfd_process_device *pdd, enum TLB_FLUSH_TYPE type); +static inline bool kfd_flush_tlb_after_unmap(struct kfd_dev *dev) +{ + return KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 2) || + (KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 1) && + dev->adev->sdma.instance[0].fw_version >= 18) || + KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 0); +} + bool kfd_is_locked(void); /* Compute profile */ diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index 459fa07a3bcc..5afe216cf099 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -1229,7 +1229,9 @@ svm_range_unmap_from_gpus(struct svm_range *prange, unsigned long start, if (r) break; } - kfd_flush_tlb(pdd, TLB_FLUSH_HEAVYWEIGHT); + + if (kfd_flush_tlb_after_unmap(pdd->dev)) + kfd_flush_tlb(pdd, TLB_FLUSH_HEAVYWEIGHT); } return r;
Re: [PATCHv4] drm/amdgpu: disable ASPM on Intel Alder Lake based systems
[Cc: -kernel test robot ] Dear Alex, dear Richard, Am 13.04.22 um 15:00 schrieb Alex Deucher: On Wed, Apr 13, 2022 at 3:43 AM Paul Menzel wrote: Thank you for sending out v4. Am 12.04.22 um 23:50 schrieb Richard Gong: Active State Power Management (ASPM) feature is enabled since kernel 5.14. There are some AMD GFX cards (such as WX3200 and RX640) that won't work with ASPM-enabled Intel Alder Lake based systems. Using these GFX cards as video/display output, Intel Alder Lake based systems will hang during suspend/resume. I am still not clear, what “hang during suspend/resume” means. I guess suspending works fine? During resume (S3 or S0ix?), where does it hang? The system is functional, but there are only display problems? The issue was initially reported on one system (Dell Precision 3660 with BIOS version 0.14.81), but was later confirmed to affect at least 4 Alder Lake based systems. Add extra check to disable ASPM on Intel Alder Lake based systems. Fixes: 0064b0ce85bb ("drm/amd/pm: enable ASPM by default") Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1885 Reported-by: kernel test robot This tag is a little confusing. Maybe clarify that it was for an issue in a previous patch iteration? Signed-off-by: Richard Gong --- v4: s/CONFIG_X86_64/CONFIG_X86 enhanced check logic v3: s/intel_core_asom_chk/aspm_support_quirk_check correct build error with W=1 option v2: correct commit description move the check from chip family to problematic platform --- drivers/gpu/drm/amd/amdgpu/vi.c | 17 - 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/vi.c b/drivers/gpu/drm/amd/amdgpu/vi.c index 039b90cdc3bc..b33e0a9bee65 100644 --- a/drivers/gpu/drm/amd/amdgpu/vi.c +++ b/drivers/gpu/drm/amd/amdgpu/vi.c @@ -81,6 +81,10 @@ #include "mxgpu_vi.h" #include "amdgpu_dm.h" +#if IS_ENABLED(CONFIG_X86) +#include +#endif + #define ixPCIE_LC_L1_PM_SUBSTATE0x100100C6 #define PCIE_LC_L1_PM_SUBSTATE__LC_L1_SUBSTATES_OVERRIDE_EN_MASK 0x0001L #define PCIE_LC_L1_PM_SUBSTATE__LC_PCI_PM_L1_2_OVERRIDE_MASK 0x0002L @@ -1134,13 +1138,24 @@ static void vi_enable_aspm(struct amdgpu_device *adev) WREG32_PCIE(ixPCIE_LC_CNTL, data); } +static bool aspm_support_quirk_check(void) +{ + if (IS_ENABLED(CONFIG_X86)) { + struct cpuinfo_x86 *c = _data(0); + + return !(c->x86 == 6 && c->x86_model == INTEL_FAM6_ALDERLAKE); + } + + return true; +} + static void vi_program_aspm(struct amdgpu_device *adev) { u32 data, data1, orig; bool bL1SS = false; bool bClkReqSupport = true; - if (!amdgpu_device_should_use_aspm(adev)) + if (!amdgpu_device_should_use_aspm(adev) || !aspm_support_quirk_check()) return; Can users still forcefully enable ASPM with the parameter `amdgpu.aspm`? if (adev->flags & AMD_IS_APU || If I remember correctly, there were also newer cards, where ASPM worked with Intel Alder Lake, right? Can only the problematic generations for WX3200 and RX640 be excluded from ASPM? This patch only disables it for the generation that was problematic. Could that please be made clear in the commit message summary, and message? Loosely related, is there a public (or internal issue) to analyze how to get ASPM working for VI generation devices with Intel Alder Lake? Kind regards, Paul
Re: [PATCH v2] drm/amdgpu: Fix one use-after-free of VM
Dear Xinhui, Am 14.04.22 um 07:45 schrieb Paul Menzel: Thank you for rerolling the patch. Am 14.04.22 um 07:03 schrieb xinhui pan: VM might already be freed when amdgpu_vm_tlb_seq_cb() is called. We see the calltrace below. Fix it by keeping the last flush fence around and wait for it to signal Nit: Please add a dot/period to the end of sentences. BUG kmalloc-4k (Not tainted): Poison overwritten 0x9c88630414e8-0x9c88630414e8 @offset=5352. First byte 0x6c instead of 0x6b Allocated in amdgpu_driver_open_kms+0x9d/0x360 [amdgpu] age=44 cpu=0 pid=2343 __slab_alloc.isra.0+0x4f/0x90 kmem_cache_alloc_trace+0x6b8/0x7a0 amdgpu_driver_open_kms+0x9d/0x360 [amdgpu] drm_file_alloc+0x222/0x3e0 [drm] drm_open+0x11d/0x410 [drm] Freed in amdgpu_driver_postclose_kms+0x3e9/0x550 [amdgpu] age=22 cpu=1 pid=2485 kfree+0x4a2/0x580 amdgpu_driver_postclose_kms+0x3e9/0x550 [amdgpu] drm_file_free+0x24e/0x3c0 [drm] drm_close_helper.isra.0+0x90/0xb0 [drm] drm_release+0x97/0x1a0 [drm] __fput+0xb6/0x280 fput+0xe/0x10 task_work_run+0x64/0xb0 The v2 annotation is missing. Suggested-by: Christian König Signed-off-by: xinhui pan --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 17 ++--- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 1 + 2 files changed, 15 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 645ce28277c2..cd5aa7edd451 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -932,9 +932,12 @@ int amdgpu_vm_update_range(struct amdgpu_device *adev, struct amdgpu_vm *vm, if (flush_tlb || params.table_freed) { tlb_cb->vm = vm; - if (!fence || !*fence || - dma_fence_add_callback(*fence, _cb->cb, - amdgpu_vm_tlb_seq_cb)) + if (fence && *fence && + !dma_fence_add_callback(*fence, _cb->cb, + amdgpu_vm_tlb_seq_cb)) { + dma_fence_put(vm->last_tlb_flush); + vm->last_tlb_flush = dma_fence_get(*fence); + } else amdgpu_vm_tlb_seq_cb(NULL, _cb->cb); The Linux kernel coding style uses braces for all branches of a conditional statement, if one branch uses braces. [1] By the way `scripts/checkpatch.pl --strict …` would also have found this: ``` $ scripts/checkpatch.pl --strict 0001-drm-amdgpu-Fix-one-use-after-free-of-VM.patch CHECK: braces {} should be used on all arms of this statement #53: FILE: drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:935: + if (fence && *fence && [...] + } else [...] CHECK: Unbalanced braces around else statement #58: FILE: drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:940: + } else total: 0 errors, 0 warnings, 2 checks, 54 lines checked NOTE: For some of the reported defects, checkpatch may be able to mechanically convert to the typical style using --fix or --fix-inplace. 0001-drm-amdgpu-Fix-one-use-after-free-of-VM.patch has style problems, please review. ``` Kind regards, Paul tlb_cb = NULL; } @@ -2094,6 +2097,7 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm) vm->update_funcs = _vm_sdma_funcs; vm->last_update = NULL; vm->last_unlocked = dma_fence_get_stub(); + vm->last_tlb_flush = dma_fence_get_stub(); mutex_init(>eviction_lock); vm->evicting = false; @@ -2132,6 +2136,7 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm) vm->root.bo = NULL; error_free_delayed: + dma_fence_put(vm->last_tlb_flush); dma_fence_put(vm->last_unlocked); drm_sched_entity_destroy(>delayed); @@ -2248,6 +2253,7 @@ void amdgpu_vm_fini(struct amdgpu_device *adev, struct amdgpu_vm *vm) struct amdgpu_bo_va_mapping *mapping, *tmp; bool prt_fini_needed = !!adev->gmc.gmc_funcs->set_prt; struct amdgpu_bo *root; + unsigned long flags; int i; amdgpu_amdkfd_gpuvm_destroy_cb(adev, vm); @@ -2257,6 +2263,11 @@ void amdgpu_vm_fini(struct amdgpu_device *adev, struct amdgpu_vm *vm) amdgpu_vm_set_pasid(adev, vm, 0); dma_fence_wait(vm->last_unlocked, false); dma_fence_put(vm->last_unlocked); + dma_fence_wait(vm->last_tlb_flush, false); + /* Make sure that all fence callbacks have completed */ + spin_lock_irqsave(vm->last_tlb_flush->lock, flags); + spin_unlock_irqrestore(vm->last_tlb_flush->lock, flags); + dma_fence_put(vm->last_tlb_flush); list_for_each_entry_safe(mapping, tmp, >freed, list) { if (mapping->flags & AMDGPU_PTE_PRT && prt_fini_needed) { diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h index 1a814fb8..6b06a214f05f 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h +++ b/drivers
Re: [PATCH v2] drm/amdgpu: Fix one use-after-free of VM
Dear Xinhui, Thank you for rerolling the patch. Am 14.04.22 um 07:03 schrieb xinhui pan: VM might already be freed when amdgpu_vm_tlb_seq_cb() is called. We see the calltrace below. Fix it by keeping the last flush fence around and wait for it to signal Nit: Please add a dot/period to the end of sentences. BUG kmalloc-4k (Not tainted): Poison overwritten 0x9c88630414e8-0x9c88630414e8 @offset=5352. First byte 0x6c instead of 0x6b Allocated in amdgpu_driver_open_kms+0x9d/0x360 [amdgpu] age=44 cpu=0 pid=2343 __slab_alloc.isra.0+0x4f/0x90 kmem_cache_alloc_trace+0x6b8/0x7a0 amdgpu_driver_open_kms+0x9d/0x360 [amdgpu] drm_file_alloc+0x222/0x3e0 [drm] drm_open+0x11d/0x410 [drm] Freed in amdgpu_driver_postclose_kms+0x3e9/0x550 [amdgpu] age=22 cpu=1 pid=2485 kfree+0x4a2/0x580 amdgpu_driver_postclose_kms+0x3e9/0x550 [amdgpu] drm_file_free+0x24e/0x3c0 [drm] drm_close_helper.isra.0+0x90/0xb0 [drm] drm_release+0x97/0x1a0 [drm] __fput+0xb6/0x280 fput+0xe/0x10 task_work_run+0x64/0xb0 The v2 annotation is missing. Suggested-by: Christian König Signed-off-by: xinhui pan --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 17 ++--- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 1 + 2 files changed, 15 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 645ce28277c2..cd5aa7edd451 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -932,9 +932,12 @@ int amdgpu_vm_update_range(struct amdgpu_device *adev, struct amdgpu_vm *vm, if (flush_tlb || params.table_freed) { tlb_cb->vm = vm; - if (!fence || !*fence || - dma_fence_add_callback(*fence, _cb->cb, - amdgpu_vm_tlb_seq_cb)) + if (fence && *fence && + !dma_fence_add_callback(*fence, _cb->cb, + amdgpu_vm_tlb_seq_cb)) { + dma_fence_put(vm->last_tlb_flush); + vm->last_tlb_flush = dma_fence_get(*fence); + } else amdgpu_vm_tlb_seq_cb(NULL, _cb->cb); The Linux kernel coding style uses braces for all branches of a conditional statement, if one branch uses braces. [1] Kind regards, Paul tlb_cb = NULL; } @@ -2094,6 +2097,7 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm) vm->update_funcs = _vm_sdma_funcs; vm->last_update = NULL; vm->last_unlocked = dma_fence_get_stub(); + vm->last_tlb_flush = dma_fence_get_stub(); mutex_init(>eviction_lock); vm->evicting = false; @@ -2132,6 +2136,7 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm) vm->root.bo = NULL; error_free_delayed: + dma_fence_put(vm->last_tlb_flush); dma_fence_put(vm->last_unlocked); drm_sched_entity_destroy(>delayed); @@ -2248,6 +2253,7 @@ void amdgpu_vm_fini(struct amdgpu_device *adev, struct amdgpu_vm *vm) struct amdgpu_bo_va_mapping *mapping, *tmp; bool prt_fini_needed = !!adev->gmc.gmc_funcs->set_prt; struct amdgpu_bo *root; + unsigned long flags; int i; amdgpu_amdkfd_gpuvm_destroy_cb(adev, vm); @@ -2257,6 +2263,11 @@ void amdgpu_vm_fini(struct amdgpu_device *adev, struct amdgpu_vm *vm) amdgpu_vm_set_pasid(adev, vm, 0); dma_fence_wait(vm->last_unlocked, false); dma_fence_put(vm->last_unlocked); + dma_fence_wait(vm->last_tlb_flush, false); + /* Make sure that all fence callbacks have completed */ + spin_lock_irqsave(vm->last_tlb_flush->lock, flags); + spin_unlock_irqrestore(vm->last_tlb_flush->lock, flags); + dma_fence_put(vm->last_tlb_flush); list_for_each_entry_safe(mapping, tmp, >freed, list) { if (mapping->flags & AMDGPU_PTE_PRT && prt_fini_needed) { diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h index 1a814fb8..6b06a214f05f 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h @@ -286,6 +286,7 @@ struct amdgpu_vm { /* Last finished delayed update */ atomic64_t tlb_seq; + struct dma_fence*last_tlb_flush; /* Last unlocked submission to the scheduler entities */ struct dma_fence*last_unlocked; [1]: https://www.kernel.org/doc/html/latest/process/coding-style.html#placing-braces-and-spaces
Re: [PATCHv4] drm/amdgpu: disable ASPM on Intel Alder Lake based systems
Dear Richard, Thank you for sending out v4. Am 12.04.22 um 23:50 schrieb Richard Gong: Active State Power Management (ASPM) feature is enabled since kernel 5.14. There are some AMD GFX cards (such as WX3200 and RX640) that won't work with ASPM-enabled Intel Alder Lake based systems. Using these GFX cards as video/display output, Intel Alder Lake based systems will hang during suspend/resume. I am still not clear, what “hang during suspend/resume” means. I guess suspending works fine? During resume (S3 or S0ix?), where does it hang? The system is functional, but there are only display problems? The issue was initially reported on one system (Dell Precision 3660 with BIOS version 0.14.81), but was later confirmed to affect at least 4 Alder Lake based systems. Add extra check to disable ASPM on Intel Alder Lake based systems. Fixes: 0064b0ce85bb ("drm/amd/pm: enable ASPM by default") Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1885 Reported-by: kernel test robot This tag is a little confusing. Maybe clarify that it was for an issue in a previous patch iteration? Signed-off-by: Richard Gong --- v4: s/CONFIG_X86_64/CONFIG_X86 enhanced check logic v3: s/intel_core_asom_chk/aspm_support_quirk_check correct build error with W=1 option v2: correct commit description move the check from chip family to problematic platform --- drivers/gpu/drm/amd/amdgpu/vi.c | 17 - 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/vi.c b/drivers/gpu/drm/amd/amdgpu/vi.c index 039b90cdc3bc..b33e0a9bee65 100644 --- a/drivers/gpu/drm/amd/amdgpu/vi.c +++ b/drivers/gpu/drm/amd/amdgpu/vi.c @@ -81,6 +81,10 @@ #include "mxgpu_vi.h" #include "amdgpu_dm.h" +#if IS_ENABLED(CONFIG_X86) +#include +#endif + #define ixPCIE_LC_L1_PM_SUBSTATE 0x100100C6 #define PCIE_LC_L1_PM_SUBSTATE__LC_L1_SUBSTATES_OVERRIDE_EN_MASK 0x0001L #define PCIE_LC_L1_PM_SUBSTATE__LC_PCI_PM_L1_2_OVERRIDE_MASK 0x0002L @@ -1134,13 +1138,24 @@ static void vi_enable_aspm(struct amdgpu_device *adev) WREG32_PCIE(ixPCIE_LC_CNTL, data); } +static bool aspm_support_quirk_check(void) +{ + if (IS_ENABLED(CONFIG_X86)) { + struct cpuinfo_x86 *c = _data(0); + + return !(c->x86 == 6 && c->x86_model == INTEL_FAM6_ALDERLAKE); + } + + return true; +} + static void vi_program_aspm(struct amdgpu_device *adev) { u32 data, data1, orig; bool bL1SS = false; bool bClkReqSupport = true; - if (!amdgpu_device_should_use_aspm(adev)) + if (!amdgpu_device_should_use_aspm(adev) || !aspm_support_quirk_check()) return; Can users still forcefully enable ASPM with the parameter `amdgpu.aspm`? if (adev->flags & AMD_IS_APU || If I remember correctly, there were also newer cards, where ASPM worked with Intel Alder Lake, right? Can only the problematic generations for WX3200 and RX640 be excluded from ASPM? Kind regards, Paul
Re: [PATCH v2] drm/amdgpu: Make sure ttm delayed work finished
Dear xinhui, Am 13.04.22 um 08:46 schrieb xinhui pan: ttm_device_delayed_workqueue would reschedule itself if there is pending BO to be destroyed. So just one flush + cancel_sync is not enough. We still see lru_list not empty warnging. warning (`scripts/checkpatch.pl --codespell` should have found this.) Fix it by waiting all BO to be destroyed. Please explain the waiting algorithm. Why at least one millisecond? Do you have a reproducer for this issue? Acked-by: Guchun Chen Signed-off-by: xinhui pan --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 6f47726f1765..56dcf8c3a3cd 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -3957,11 +3957,17 @@ static void amdgpu_device_unmap_mmio(struct amdgpu_device *adev) */ void amdgpu_device_fini_hw(struct amdgpu_device *adev) { + int pending = 1; Could this be bool? + dev_info(adev->dev, "amdgpu: finishing device.\n"); flush_delayed_work(>delayed_init_work); - if (adev->mman.initialized) { + while (adev->mman.initialized && pending) { flush_delayed_work(>mman.bdev.wq); - ttm_bo_lock_delayed_workqueue(>mman.bdev); + pending = ttm_bo_lock_delayed_workqueue(>mman.bdev); + if (pending) { + ttm_bo_unlock_delayed_workqueue(>mman.bdev, true); Does this call affect `adev->mman.initialized`? If not a do-while loop might be better suited, so pending is only checked once. if (adev->mman.initialized) { do { flush_delayed_work(>mman.bdev.wq); ttm_bo_unlock_delayed_workqueue(>mman.bdev, true); msleep(((HZ / 100) < 1) ? 1 : HZ / 100); } while (ttm_bo_lock_delayed_workqueue(>mman.bdev)); } The logic is slightly different, as `ttm_bo_unlock_delayed_workqueue(>mman.bdev, true);` and the sleep are run at least once, so the suggestion might be unsuitable. + msleep(((HZ / 100) < 1) ? 1 : HZ / 100); Maybe add a comment for that formula? (No idea, if common knowledge, but why is a delay needed, and the loop cannot be run as fast as possible?) + } } adev->shutdown = true; Kind regards, Paul
Re: Vega 56 failing to process EDID from VR Headset
Dear James, Am 13.04.22 um 00:13 schrieb James Dutton: On Tue, 12 Apr 2022 at 07:13, Paul Menzel wrote: Am 11.04.22 um 23:39 schrieb James Dutton: So, did you do any changes to Linux? Why do you think the EDID is at fault? […] I suggest to analyze, why `No DP link bandwidth` is logged. The macro is `DC_NO_DP_LINK_BANDWIDTH`, and you should first check why `dp_validate_mode_timing()` in `drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c` returns false. PS: Using the issue tracker [1] might make it easier to keep track of this problem, and also to attach all the necessary information. [1]: https://gitlab.freedesktop.org/drm/amd/-/issues/ I will do some more investigation. In addition to it not processing the EDID particularly well... Since my email, I have found out that it is failing to complete CR (Clock Recovery) on Link 0,2, but it works on 1,3 at HBR2. All 4 Links work at HBR1. (I need the HBR2 working) The CR negotiation in the code looks a bit wrong to me, so I will look into that a bit more. Looking at the current amdgpu source code (I am using Mainline kernel version 5.17.1), it appears to retry CR negotiation, but each time it uses the same settings, rather than try different driver parameters, as recommended in the DP standards and compliance test documents. […] Awesome, that you review the code with your expertise. Though I suggest to look at and work on agd5f/amd-staging-drm-next [1], having the latest code for the driver. Once I know more, I will put all the info in the issue track, as you suggest. I am looking forward to it. To not get lost in all the problems, one email or issue per problem might be a good way forward, and adding people adding the code (`git blame -w`) to Cc might also help. Happy debugging and hacking! Kind regards, Paul [1]: https://gitlab.freedesktop.org/agd5f/linux/
Re: [PATCH 1/2] drm/amd/amdgpu: Update PF2VF header
[Removed unintended paste in second line] Am 13.04.22 um 09:03 schrieb Paul Menzel: Dear Bokun, Thank you for rerolling the patch. Please add the iteration/version in the subject next time `[PATCH v2 1/2]` or so. Am 12.04.22 um 23:31 schrieb Bokun Zhang: - Add proper indentation in the header file Please use that as the commit message summary, and the subject/summary as the title for the cover letter? Maybe: drm/amd/amdgpu: Properly indent PF2VF header Signed-off-by: Bokun Zhang --- drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h | 95 ++--- 1 file changed, 46 insertions(+), 49 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h b/drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h index 7326b6c1b71c..65433cbb00c5 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h @@ -1,34 +1,33 @@ /* - * Copyright 2018-2019 Advanced Micro Devices, Inc. + * Copyright (c) 2018-2021 Advanced Micro Devices, Inc. All rights reserved. * - * Permission is hereby granted, free of charge, to any person obtaining a - * copy of this software and associated documentation files (the "Software"), - * to deal in the Software without restriction, including without limitation - * the rights to use, copy, modify, merge, publish, distribute, sublicense, - * and/or sell copies of the Software, and to permit persons to whom the - * Software is furnished to do so, subject to the following conditions: + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the "Software"), to deal + * in the Software without restriction, including without limitation the rights + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell + * copies of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: * * The above copyright notice and this permission notice shall be included in * all copies or substantial portions of the Software. * * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, - * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL - * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR - * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, - * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR - * OTHER DEALINGS IN THE SOFTWARE. - * + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN + * THE SOFTWARE. */ The header change seems unrelated from indentation? […] Kind regards, Paul
Re: [PATCH 1/2] drm/amd/amdgpu: Update PF2VF header
Dear Bokun, drm/amd/amdgpu: Update PF2VF header Thank you for rerolling the patch. Please add the iteration/version in the subject next time `[PATCH v2 1/2]` or so. Am 12.04.22 um 23:31 schrieb Bokun Zhang: - Add proper indentation in the header file Please use that as the commit message summary, and the subject/summary as the title for the cover letter? Maybe: drm/amd/amdgpu: Properly indent PF2VF header Signed-off-by: Bokun Zhang --- drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h | 95 ++--- 1 file changed, 46 insertions(+), 49 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h b/drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h index 7326b6c1b71c..65433cbb00c5 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h @@ -1,34 +1,33 @@ /* - * Copyright 2018-2019 Advanced Micro Devices, Inc. + * Copyright (c) 2018-2021 Advanced Micro Devices, Inc. All rights reserved. * - * Permission is hereby granted, free of charge, to any person obtaining a - * copy of this software and associated documentation files (the "Software"), - * to deal in the Software without restriction, including without limitation - * the rights to use, copy, modify, merge, publish, distribute, sublicense, - * and/or sell copies of the Software, and to permit persons to whom the - * Software is furnished to do so, subject to the following conditions: + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the "Software"), to deal + * in the Software without restriction, including without limitation the rights + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell + * copies of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: * * The above copyright notice and this permission notice shall be included in * all copies or substantial portions of the Software. * * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, - * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL - * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR - * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, - * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR - * OTHER DEALINGS IN THE SOFTWARE. - * + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN + * THE SOFTWARE. */ The header change seems unrelated from indentation? […] Kind regards, Paul
Re: [PATCH] Revert "drm/amd/display: Pass HostVM enable flag into DCN3.1 DML"
Dear Rodrigo, Thank you for the patch. Am 12.04.22 um 20:44 schrieb Rodrigo Siqueira: This reverts commit 367b3e934f578f6c0d5d8ca5987dc6ac4cd6831d. While we were testing DCN3.1 with a hub, we noticed that only one of 2 connected displays lights up when using some specific display resolution. After bisecting this issue, we figured out the commit mentioned above introduced this issue. We are investigating why this patch introduced this regression, but we need to revert it for now. Thank you for bisecting this. It’d be great if you added the information about the test setup: graphics card, displays (model and connected how), model of the hub, and the resolution causing issues. Kind regards, Paul Cc: Harry Wentland Cc: Mark Broadworth Cc: Michael Strauss Signed-off-by: Rodrigo Siqueira --- drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c b/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c index 6cc580be7c79..5b3f0c2dfb55 100644 --- a/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c +++ b/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c @@ -1668,7 +1668,6 @@ int dcn31_populate_dml_pipes_from_context( pipes[pipe_cnt].pipe.src.immediate_flip = true; pipes[pipe_cnt].pipe.src.unbounded_req_mode = false; - pipes[pipe_cnt].pipe.src.hostvm = dc->res_pool->hubbub->riommu_active; pipes[pipe_cnt].pipe.src.gpuvm = true; pipes[pipe_cnt].pipe.src.dcc_fraction_of_zs_req_luma = 0; pipes[pipe_cnt].pipe.src.dcc_fraction_of_zs_req_chroma = 0;
Re: AMDGPU: RX 6500 XT: System reset when loading module [SOLVED]
Dear Cal, Am 12.04.22 um 00:31 schrieb Cal Peake: I wanted to put a capper on this just in case anyone was interested, or in case any future people find this thread, because I did find a resolution. Yes, that is very much appreciated. Turns out the way to stop the system from crashing was to disable Global C-State Control in the BIOS. Christian, Alex, you guys seem to have been on the right track in that it was something power related. I haven't yet been able to figure out what Global C-State Control exactly does. My best guess as to what was happening: once the GPU power management code was loaded and the GPU dropped into a very low power state, the CPU saw this and decided to match it, lowering its own power state to such a point that it hard resets the system. (Just a wild theory anyway :-) If anyone knows what this feature really does, or has any better theories as to why it doesn't play nice with AMD GPUs, please do share! It might be related to bug report *Random Soft Lockup on new Ryzen build* [1], and the referenced issues there. It’d be great if you could post here or there a summary with your system details (especially the system and GPU firmware versions). Me is still a little upset, how AMD has until now not been able to properly analyze and fix this (with the ODMs). Kind regards, Paul [1]: https://bugzilla.kernel.org/show_bug.cgi?id=196683
Re: [PATCH ] drm/amdgpu: fix discovery ip failed
Dear Jie, Am 12.04.22 um 03:52 schrieb Zhang, Jesse (Jie): [AMD Official Use Only] Thanks Paul Menzel Attach the patch file . Please do not attach patch files, as it breaks work flows and inline commenting. Please use `git send-email` or equivalent. Your colleagues should be able to help you with this. […] Kind regards, Paul
Re: [PATCH] drm/amd/amdgpu: Not request init data for MS_HYPERV with vega10
[Cc: +x86 folks] Dear Alex, dear x86 folks, x86 folks, can you think of alternatives to access `X86_HYPER_MS_HYPERV` from `arch/x86/include/asm/hypervisor.h` without any preprocessor ifdef-ery? Am 11.04.22 um 18:28 schrieb Alex Deucher: On Mon, Apr 11, 2022 at 11:28 AM Paul Menzel wrote: […] Am 11.04.22 um 15:59 schrieb Yongqiang Sun: MS_HYPERV with vega10 doesn't have the interface to process request init data msg. Should some Hyper-V folks be added to the reviewers list too? Check hypervisor type to not send the request for MS_HYPERV. Please add a blank line between paragraphs. Signed-off-by: Yongqiang Sun --- drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 12 ++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c index 933c41f77c92..56b130ec44a9 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c @@ -23,6 +23,10 @@ #include +#ifdef CONFIG_X86 +#include +#endif + #include #include "amdgpu.h" @@ -721,8 +725,12 @@ void amdgpu_detect_virtualization(struct amdgpu_device *adev) break; case CHIP_VEGA10: soc15_set_virt_ops(adev); - /* send a dummy GPU_INIT_DATA request to host on vega10 */ - amdgpu_virt_request_init_data(adev); +#ifdef CONFIG_X86 + /* not send GPU_INIT_DATA with MS_HYPERV*/ + if (hypervisor_is_type(X86_HYPER_MS_HYPERV) == false) +#endif Why guard everything with CONFIG_X86? (If it’s needed, it should be done in C code.) X86_HYPER_MS_HYPERV only available on x86. Sorry, I missed the X86 dependency when quickly looking at the Hyper-V stub IOMMU driver `drivers/iommu/hyperv-iommu.c`, but missed that `HYPERV_IOMMU` has `depends on HYPERV && X86`. Kind regards, Paul + /* send a dummy GPU_INIT_DATA request to host on vega10 */ + amdgpu_virt_request_init_data(adev); break; case CHIP_VEGA20: case CHIP_ARCTURUS: Kind regards, Paul
AMD Display Core (DC) patches (was: [PATCH 13/16] drm/amd/display: Revert FEC check in validation)
[Cc: +dri-de...@lists.freedesktop.org, +Daniel Vetter, +Alexander Deucher, +Greg KH] Dear Alex, I am a little confused and upset about how Display Core patches are handled in the Linux kernel. Am 25.03.22 um 23:53 schrieb Alex Hung: From: Martin Leung git puts a line “This reverts commit …” into the commit message, when something is reverted. Why isn’t this here? Right now, commit 7d56a154e22f, reverted here, is proposed for the stable series. I guess, because these indicators and meta data are missing. why and how: causes failure on install on certain machines Why are such kind of commit messages accepted? What does “failure on install” even mean? Why can’t the machine configuration be documented so it can be reproduced, when necessary. No less confusing, the date you posted it on amd-gfx is from March 25th, 2022, but the author date of the commit in agd5f/amd-staging-drm-next is `Fri Mar 18 11:12:36 2022 -0400`. Why is the patch missing the Date field then? Reviewed-by: George Shen Acked-by: Alex Hung Signed-off-by: Martin Leung Shouldn’t the Signed-off-by line by the author go first? You committed this on `Mon Mar 28 08:26:48 2022 -0600`, while you posted the patch on amd-gfx on Friday. How should *proper* review happen over the weekend? --- drivers/gpu/drm/amd/display/dc/core/dc.c | 4 1 file changed, 4 deletions(-) diff --git a/drivers/gpu/drm/amd/display/dc/core/dc.c b/drivers/gpu/drm/amd/display/dc/core/dc.c index f2ad8f58e69c..c436db416708 100644 --- a/drivers/gpu/drm/amd/display/dc/core/dc.c +++ b/drivers/gpu/drm/amd/display/dc/core/dc.c @@ -1496,10 +1496,6 @@ bool dc_validate_boot_timing(const struct dc *dc, if (!link->link_enc->funcs->is_dig_enabled(link->link_enc)) return false; - /* Check for FEC status*/ - if (link->link_enc->funcs->fec_is_active(link->link_enc)) - return false; - enc_inst = link->link_enc->funcs->get_dig_frontend(link->link_enc); if (enc_inst == ENGINE_ID_UNKNOWN) The patch reverted here, also lacked proper review, had a to-be desired commit message, did not follow the Linux kernel coding style (missing space before the comment terminator), so should not have been committed in the first place. Seeing how many people are in the Cc list, I would have hoped, that somebody noticed and commented. The current state also makes it really hard for non-AMD employees to get the necessary information to do proper reviews as the needed documentation and information is non-public. So good/excellent commit messages are a must. I think to remember, you replied to me once, that Display Core patches are shared also with the Microsoft Windows driver, restricting the workflow options. But I think the issues I mentioned are unrelated. I know graphics hardware is very complex, but if quality of the commits and review would be improved, hopefully it saves time for everyone in the end, as less bugs are introduced. Could AMD team please address these issues as soon as possible? Kind regards, Paul
Re: [PATCH 04/13] drm/amd/display: FEC check in timing validation
Dear Alex, Am 19.03.22 um 08:43 schrieb Paul Menzel: Dear Alex, dear Chiawen, Thank you for your patch. Am 18.03.22 um 22:47 schrieb Alex Hung: From: Chiawen Huang [Why] disable/enable leads fec mismatch between hw/sw fec state. 1. Disable/enable of what? 2. How can this be reproduced? 3. s/fec/FEC/ [How] check fec status to fastboot on/off. What do you mean by “to fastboot on/off”? Reviewed-by: Anthony Koo Acked-by: Alex Hung Signed-off-by: Chiawen Huang --- drivers/gpu/drm/amd/display/dc/core/dc.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/gpu/drm/amd/display/dc/core/dc.c b/drivers/gpu/drm/amd/display/dc/core/dc.c index f6e19efea756..75f9c97bebb0 100644 --- a/drivers/gpu/drm/amd/display/dc/core/dc.c +++ b/drivers/gpu/drm/amd/display/dc/core/dc.c @@ -1496,6 +1496,10 @@ bool dc_validate_boot_timing(const struct dc *dc, if (!link->link_enc->funcs->is_dig_enabled(link->link_enc)) return false; + /* Check for FEC status*/ Missing space before `*/`. + if (link->link_enc->funcs->fec_is_active(link->link_enc)) + return false; + enc_inst = link->link_enc->funcs->get_dig_frontend(link->link_enc); if (enc_inst == ENGINE_ID_UNKNOWN) I just saw, that this patch was committed on March 25th, 2022 (commit 7d56a154e22f) with my comments ignored. Could you please explain why? Kind regards, Paul
Re: Vega 56 failing to process EDID from VR Headset
Dear James, Am 11.04.22 um 23:39 schrieb James Dutton: I have an Oculus Rift S, that I am trying to get working in Linux. Please always mention the Linux kernel version. I have an AMD Vega 56 graphics card. The VR headset plugs into a display port of the Vega56. The amdgpu driver sees the connection, and tries to process it. The problem is it cannot process the EDID, so fails to recognise the VR headset, and the VR headset does not work as a result. Please find the EDID below. I am guessing that the following is causing the problem: Established Timings I & II: none Standard Timings: none Forcing the driver to understand the Detailed mode, to which it is failing. If it helps, when attached to windows 10, it uses 1440x2560, portrait mode. Some dmesg lines that may be useful: // We should pick 1440x2560 as Windows picks that, but for some reason is rejects it with error 10. [10402.650734] [drm:create_validate_stream_for_sink [amdgpu]] Mode 1440x2560 (clk 571570) failed DC validation with error 10 (No DP link bandwidth) [10402.650991] [drm:update_stream_scaling_settings [amdgpu]] Destination Rectangle x:0 y:0 width:1440 height:2560 [10402.651225] [drm:create_validate_stream_for_sink [amdgpu]] Mode 1440x2560 (clk 571570) failed DC validation with error 10 (No DP link bandwidth) (Please use an email program, which does not wrap lines after 72 characters.) Can anyone help give me with some pointers as to how to get the amdgpu driver to accept this EDID? So, did you do any changes to Linux? Why do you think the EDID is at fault? […] I suggest to analyze, why `No DP link bandwidth` is logged. The macro is `DC_NO_DP_LINK_BANDWIDTH`, and you should first check why `dp_validate_mode_timing()` in `drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c` returns false. Kind regards, Paul PS: Using the issue tracker [1] might make it easier to keep track of this problem, and also to attach all the necessary information. [1]: https://gitlab.freedesktop.org/drm/amd/-/issues/
Re: [PATCH] drm/amd/amdgpu: Not request init data for MS_HYPERV with vega10
Dear Yongqiang, Thank you for your patch. Am 11.04.22 um 15:59 schrieb Yongqiang Sun: MS_HYPERV with vega10 doesn't have the interface to process request init data msg. Should some Hyper-V folks be added to the reviewers list too? Check hypervisor type to not send the request for MS_HYPERV. Please add a blank line between paragraphs. Signed-off-by: Yongqiang Sun --- drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 12 ++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c index 933c41f77c92..56b130ec44a9 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c @@ -23,6 +23,10 @@ #include +#ifdef CONFIG_X86 +#include +#endif + #include #include "amdgpu.h" @@ -721,8 +725,12 @@ void amdgpu_detect_virtualization(struct amdgpu_device *adev) break; case CHIP_VEGA10: soc15_set_virt_ops(adev); - /* send a dummy GPU_INIT_DATA request to host on vega10 */ - amdgpu_virt_request_init_data(adev); +#ifdef CONFIG_X86 + /* not send GPU_INIT_DATA with MS_HYPERV*/ + if (hypervisor_is_type(X86_HYPER_MS_HYPERV) == false) +#endif Why guard everything with CONFIG_X86? (If it’s needed, it should be done in C code.) + /* send a dummy GPU_INIT_DATA request to host on vega10 */ + amdgpu_virt_request_init_data(adev); break; case CHIP_VEGA20: case CHIP_ARCTURUS: Kind regards, Paul
Re: [PATCH ] drm/amdgpu: fix discovery ip failed
Dear Jie, Thank you for your patch. Am 11.04.22 um 17:15 schrieb Zhang, Jesse(Jie): You might want to add a space before the (. [AMD Official Use Only] Please send a patch with `git format-patch` or similar. Fix discovery ip failed, and the log: On what system? 56.129549] [drm:amdgpu_discovery_validate_ip [amdgpu]] *ERROR* Unexpected number_instance (64) from ip discovery blob [ 56.130129] [drm:amdgpu_discovery_validate_ip [amdgpu]] *ERROR* Unexpected hw_id (45056) from ip discovery blob [ 56.130701] [drm:amdgpu_discovery_validate_ip [amdgpu]] *ERROR* Unexpected number_instance (66) from ip discovery blob [ 56.131283] [drm:amdgpu_discovery_validate_ip [amdgpu]] *ERROR* Unexpected hw_id (45568) from ip discovery blob [ 56.131855] [drm:amdgpu_discovery_validate_ip [amdgpu]] *ERROR* Unexpected number_instance (66) from ip discovery blob [ 56.132436] [drm:amdgpu_discovery_validate_ip [amdgpu]] *ERROR* Unexpected hw_id (480) from ip discovery blob [ 56.133053] [drm:amdgpu_discovery_validate_ip [amdgpu]] *ERROR* Unexpected hw_id (608) from ip discovery blob [ 56.133626] [drm:amdgpu_discovery_validate_ip [amdgpu]] *ERROR* Unexpected hw_id (640) from ip discovery blob [ 56.134207] [drm:amdgpu_discovery_validate_ip [amdgpu]] *ERROR* Unexpected number_instance (64) from ip discovery blob [ 56.134780] [drm:amdgpu_discovery_validate_ip [amdgpu]] *ERROR* Unexpected number_instance (64) from ip discovery blob [ 56.135360] [drm:amdgpu_discovery_validate_ip [amdgpu]] *ERROR* Unexpected hw_id (28672) from ip discovery blob Please describe the reason for the failure, and your fix. A Signed-off-by line is missing. diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c index 766006a075ec..a778b0392e9f 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c @@ -798,7 +798,7 @@ static int amdgpu_discovery_sysfs_ips(struct amdgpu_device *adev, res = kobject_add(_hw_instance->kobj, NULL, "%d", ip_hw_instance->num_instance); next_ip: - ip_offset += sizeof(*ip) + 4 * (ip->num_base_address - 1); + ip_offset += struct_size(ip, base_address, ip->num_base_address); } } @@ -1063,7 +1063,7 @@ int amdgpu_discovery_reg_base_init(struct amdgpu_device *adev) } next_ip: - ip_offset += sizeof(*ip) + 4 * (ip->num_base_address - 1); + ip_offset += struct_size(ip, base_address, ip->num_base_address); } } @@ -1113,7 +1113,7 @@ int amdgpu_discovery_get_ip_version(struct amdgpu_device *adev, int hw_id, int n *revision = ip->revision; return 0; } - ip_offset += sizeof(*ip) + 4 * (ip->num_base_address - 1); + ip_offset += struct_size(ip, base_address, ip->num_base_address); } } Kind regards, Paul
Re: [PATCH v2] drm/amd/pm: Disable fan control if not supported
Dear Elena, Am 11.04.22 um 15:32 schrieb Elena Sakhnovitch: On Sienna Cichild, not all platforms use PMFW based fan control. On such ASICs fan control by PMFW will be disabled in PPTable. Disable hwmon knobs for fan control also as it is not possible to report or control fan speed on such platforms through driver. v2: FEATURE_FAN_CONTROL_MASK is replaced with FEATURE_FAN_CONTROL_BIT Please add a blank line before the v2 line. Also please give one specific example for a platform without PMFW based fan control. Signed-off-by: Elena Sakhnovitch --- .../drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c | 15 ++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c index ab3e9d8b831e..ddc388b061b6 100644 --- a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c +++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c @@ -371,6 +371,18 @@ static void sienna_cichlid_check_bxco_support(struct smu_context *smu) } } +static void sienna_cichlid_check_fan_support(struct smu_context *smu) +{ + struct smu_table_context *table_context = >smu_table; + PPTable_t *pptable = table_context->driver_pptable; + /* No sort of fan control possible if PPTable has it disabled */ + smu->adev->pm.no_fan = + !(pptable->FeaturesToRun[1] & (1U << FEATURE_FAN_CONTROL_BIT)); + if (smu->adev->pm.no_fan) + dev_info_once(smu->adev->dev, + "PMFW based fan control disabled"); Maybe clarify: … according to PPTable. +} + static int sienna_cichlid_check_powerplay_table(struct smu_context *smu) { struct smu_table_context *table_context = >smu_table; @@ -381,6 +393,7 @@ static int sienna_cichlid_check_powerplay_table(struct smu_context *smu) smu->dc_controlled_by_gpio = true; sienna_cichlid_check_bxco_support(smu); + sienna_cichlid_check_fan_support(smu); table_context->thermal_controller_type = powerplay_table->thermal_controller_type; @@ -410,7 +423,7 @@ static int sienna_cichlid_append_powerplay_table(struct smu_context *smu) GET_PPTABLE_MEMBER(I2cControllers, _member); memcpy(table_member, smc_dpm_table->I2cControllers, sizeof(*smc_dpm_table) - sizeof(smc_dpm_table->table_header)); - + Unrelated, and should be a separate patch. return 0; } Kind regards, Paul
Re: [PATCH] drm/amdgpu: Release memory when psp sw_init is failed
Dear Ma, Thank you for your patch. Am 11.04.22 um 14:42 schrieb Ma Jun: Release the memory (psp->cmd) when psp initialization is failed in psp_sw_init s/is failed/fails/ Period/full stop at the end of sentences, and it still fits into 75 characters per line. Next time you could also add one simple statement like: Add new label `failure` to jump to in case of the failure. Signed-off-by: Ma Jun Change-Id: I2f88b5919142d55dd7d3820a7da94823286db235 Without the URL of the Gerrit instance, the Change-Id is not of much use. --- drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 15 +-- 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c index a6acec1a6155..1227dc014c80 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c @@ -305,9 +305,10 @@ static int psp_sw_init(void *handle) ret = psp_init_sriov_microcode(psp); else ret = psp_init_microcode(psp); + Unrelated. if (ret) { DRM_ERROR("Failed to load psp firmware!\n"); - return ret; + goto failure; } adev->psp.xgmi_context.supports_extended_data = @@ -339,25 +340,27 @@ static int psp_sw_init(void *handle) ret = psp_memory_training_init(psp); if (ret) { DRM_ERROR("Failed to initialize memory training!\n"); - return ret; + goto failure; } ret = psp_mem_training(psp, PSP_MEM_TRAIN_COLD_BOOT); if (ret) { DRM_ERROR("Failed to process memory training!\n"); - return ret; + goto failure; } } if (adev->ip_versions[MP0_HWIP][0] == IP_VERSION(11, 0, 0) || adev->ip_versions[MP0_HWIP][0] == IP_VERSION(11, 0, 7)) { ret= psp_sysfs_init(adev); - if (ret) { - return ret; - } + if (ret) + goto failure; } return 0; +failure: + kfree(psp->cmd); + return ret; } static int psp_sw_fini(void *handle) Acked-by: Paul Menzel Kind regards, Paul
Re: [PATCHv2] drm/amdgpu: disable ASPM on Intel AlderLake based systems
Dear Richard, Am 11.04.22 um 13:38 schrieb Gong, Richard: On 4/11/2022 2:41 AM, Paul Menzel wrote: [Cc: +] Am 11.04.22 um 02:27 schrieb Gong, Richard: On 4/8/2022 7:19 PM, Paul Menzel wrote: Am 08.04.22 um 21:05 schrieb Richard Gong: Active State Power Management (ASPM) feature is enabled since kernel 5.14. There are some AMD GFX cards (such as WX3200 and RX640) that cannot be used with Intel AlderLake based systems to enable ASPM. Using these GFX Alder Lake will correct in the next version. cards as video/display output, Intel Alder Lake based systems will hang during suspend/resume. Please reflow for 75 characters per line. Also please mention the exact system you had problems with (also firmware versions). Add extra check to disable ASPM on Intel AlderLake based systems. Is that a problem with Intel Alder Lake or the Dell system? Shouldn’t ASPM just be disabled for the problematic cards for the Dell system. You write newer cards worked fine. There is a problem with Dell system (Dell Precision DT workstation), which is based on Intel Alder Lake. ASPM works just fine on these GPU's. It's more of an issue with whether the underlying platform supports ASPM or not. At least you didn’t document what the real issue is, You can refer to bug tag from the comment messages. Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1885 No, the commit message should be self-contained, and reviewers and readers of the commit message not required to read comments of bug reports. Please add the necessary information to the commit message. Kind regards, Paul that ASPM does not work. With current information (some GPU graphics card with the the Dell system and others don’t), it could be the GPU, the Dell system (firmware, …), a problem with Alder Lake SOC, or another bug. I hope you are in contact with Dell to analyze it, so ASPM can be enabled again. […] Kind regards, Paul
Re: [PATCH] drm/amd/pm: fix the deadlock issue observed on SI
Dear Evan, It would have been nice, if you put me in Cc as I also reported the regression. Am 11.04.22 um 09:47 schrieb Evan Quan: By placing those unrelated code outside of adev->pm.mutex protections or restructuring the call flow, we can eliminate the deadlock issue. This comes with no real logics change. Please describe the deadlock issue and the fix in more detail. What code is unrelated, and why was it put into the mutex protections in the first place? Fixes: 3712e7a49459 ("drm/amd/pm: unified lock protections in amdgpu_dpm.c") No blank line needed. Signed-off-by: Evan Quan Change-Id: I75de4350d9c2517aba0d6adc12e1bc69430fd800 Without the Gerrit instance URL, the Change-Id is useless. As it’s a regression, please follow the documentation, and add the related tags. Fixes: 3712e7a49459 ("drm/amd/pm: unified lock protections in amdgpu_dpm.c") Reported-by: Paul Menzel Reported-by: Arthur Marsh Link: https://lore.kernel.org/r/9e689fea-6c69-f4b0-8dee-32c4cf7d8...@molgen.mpg.de BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1957 Also add the link from Arthur. Kind regards, Paul [1]: https://linux-regtracking.leemhuis.info/regzbot/mainline/ --- drivers/gpu/drm/amd/pm/amdgpu_dpm.c | 39 +++ .../gpu/drm/amd/pm/legacy-dpm/legacy_dpm.c| 10 - drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c| 35 - .../gpu/drm/amd/pm/powerplay/amd_powerplay.c | 10 - 4 files changed, 39 insertions(+), 55 deletions(-) diff --git a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c index 5504d81c77b7..72e7b5d40af6 100644 --- a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c +++ b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c @@ -427,6 +427,7 @@ int amdgpu_dpm_read_sensor(struct amdgpu_device *adev, enum amd_pp_sensors senso void amdgpu_dpm_compute_clocks(struct amdgpu_device *adev) { const struct amd_pm_funcs *pp_funcs = adev->powerplay.pp_funcs; + int i; You could make it unsigned, as the variable never should be assigned negaitve values. if (!adev->pm.dpm_enabled) return; @@ -434,6 +435,15 @@ void amdgpu_dpm_compute_clocks(struct amdgpu_device *adev) if (!pp_funcs->pm_compute_clocks) return; + if (adev->mode_info.num_crtc) + amdgpu_display_bandwidth_update(adev); + + for (i = 0; i < AMDGPU_MAX_RINGS; i++) { + struct amdgpu_ring *ring = adev->rings[i]; + if (ring && ring->sched.ready) + amdgpu_fence_wait_empty(ring); + } + mutex_lock(>pm.mutex); pp_funcs->pm_compute_clocks(adev->powerplay.pp_handle); mutex_unlock(>pm.mutex); @@ -443,6 +453,20 @@ void amdgpu_dpm_enable_uvd(struct amdgpu_device *adev, bool enable) { int ret = 0; + if (adev->family == AMDGPU_FAMILY_SI) { + mutex_lock(>pm.mutex); + if (enable) { + adev->pm.dpm.uvd_active = true; + adev->pm.dpm.state = POWER_STATE_TYPE_INTERNAL_UVD; + } else { + adev->pm.dpm.uvd_active = false; + } + mutex_unlock(>pm.mutex); + + amdgpu_dpm_compute_clocks(adev); + return; + } + ret = amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_UVD, !enable); if (ret) DRM_ERROR("Dpm %s uvd failed, ret = %d. \n", @@ -453,6 +477,21 @@ void amdgpu_dpm_enable_vce(struct amdgpu_device *adev, bool enable) { int ret = 0; + if (adev->family == AMDGPU_FAMILY_SI) { + mutex_lock(>pm.mutex); + if (enable) { + adev->pm.dpm.vce_active = true; + /* XXX select vce level based on ring/task */ + adev->pm.dpm.vce_level = AMD_VCE_LEVEL_AC_ALL; + } else { + adev->pm.dpm.vce_active = false; + } + mutex_unlock(>pm.mutex); + + amdgpu_dpm_compute_clocks(adev); + return; + } + ret = amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_VCE, !enable); if (ret) DRM_ERROR("Dpm %s vce failed, ret = %d. \n", diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/legacy_dpm.c b/drivers/gpu/drm/amd/pm/legacy-dpm/legacy_dpm.c index 9613c6181c17..d3fe149d8476 100644 --- a/drivers/gpu/drm/amd/pm/legacy-dpm/legacy_dpm.c +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/legacy_dpm.c @@ -1028,16 +1028,6 @@ static int amdgpu_dpm_change_power_state_locked(struct amdgpu_device *adev) void amdgpu_legacy_dpm_compute_clocks(void *handle) { struct amdgpu_device *adev = (struct amdgpu_device *)handle; - int i = 0; - - if (adev->mode_info.num_crtc) - amdgpu_display
Re: [PATCHv2] drm/amdgpu: disable ASPM on Intel AlderLake based systems
[Cc: +] Dear Richard, Am 11.04.22 um 02:27 schrieb Gong, Richard: On 4/8/2022 7:19 PM, Paul Menzel wrote: Am 08.04.22 um 21:05 schrieb Richard Gong: Active State Power Management (ASPM) feature is enabled since kernel 5.14. There are some AMD GFX cards (such as WX3200 and RX640) that cannot be used with Intel AlderLake based systems to enable ASPM. Using these GFX Alder Lake will correct in the next version. cards as video/display output, Intel Alder Lake based systems will hang during suspend/resume. Please reflow for 75 characters per line. Also please mention the exact system you had problems with (also firmware versions). Add extra check to disable ASPM on Intel AlderLake based systems. Is that a problem with Intel Alder Lake or the Dell system? Shouldn’t ASPM just be disabled for the problematic cards for the Dell system. You write newer cards worked fine. There is a problem with Dell system (Dell Precision DT workstation), which is based on Intel Alder Lake. ASPM works just fine on these GPU's. It's more of an issue with whether the underlying platform supports ASPM or not. At least you didn’t document what the real issue is, that ASPM does not work. With current information (some GPU graphics card with the the Dell system and others don’t), it could be the GPU, the Dell system (firmware, …), a problem with Alder Lake SOC, or another bug. I hope you are in contact with Dell to analyze it, so ASPM can be enabled again. […] Kind regards, Paul
Re: [PATCHv2] drm/amdgpu: disable ASPM on Intel AlderLake based systems
Dear Richard, Thank you for your response, but please reply to your own reply next time. Am 11.04.22 um 02:37 schrieb Gong, Richard: On 4/8/2022 7:19 PM, Paul Menzel wrote: Thank you for your patch. Am 08.04.22 um 21:05 schrieb Richard Gong: Active State Power Management (ASPM) feature is enabled since kernel 5.14. There are some AMD GFX cards (such as WX3200 and RX640) that cannot be used with Intel AlderLake based systems to enable ASPM. Using these GFX Alder Lake Actually there are 2 formats (one with space, another is w/o space) in the upstream sources, so I will keep that unchanged and use the format w/o space. Do you mean the Linux kernel sources? Anyway, please use the correct spelling [1]. Kind regards, Paul [1]: https://ark.intel.com/content/www/us/en/ark/products/codename/147470/products-formerly-alder-lake.html [2]: https://en.wikipedia.org/wiki/Alder_Lake
Re: [PATCH 04/20] drm/amd/display: Disallow entering PSR when panel is disconnected
Dear Pavle, dear Max, Am 08.04.22 um 19:18 schrieb Pavle Kotarac: From: Max Erenberg [WHY] The dGPU cannot enter PSR when it is not connected to a panel. Maybe spell out Panel Self Refresh once. [HOW] Added a check to dc_link_set_psr_allow_active s/Added/Add/ which returns early if panel is disconnected. Please reflow for 57 characters per line. Reviewed-by: Harry Vanzylldejong Reviewed-by: Evgenii Krasnikov Reviewed-by: Nicholas Choi Acked-by: Pavle Kotarac Signed-off-by: Max Erenberg --- drivers/gpu/drm/amd/display/dc/core/dc_link.c | 5 + 1 file changed, 5 insertions(+) diff --git a/drivers/gpu/drm/amd/display/dc/core/dc_link.c b/drivers/gpu/drm/amd/display/dc/core/dc_link.c index 47b67fd1e84c..22f2d88fab99 100644 --- a/drivers/gpu/drm/amd/display/dc/core/dc_link.c +++ b/drivers/gpu/drm/amd/display/dc/core/dc_link.c @@ -3079,6 +3079,11 @@ bool dc_link_set_psr_allow_active(struct dc_link *link, const bool *allow_active if (!dc_get_edp_link_panel_inst(dc, link, _inst)) return false; + if (allow_active && link->type == dc_connection_none) { Why does `allow_active` need to be checked? + // Don't enter PSR if panel is not connected + return false; + } + /* Set power optimization flag */ if (power_opts && link->psr_settings.psr_power_opt != *power_opts) { link->psr_settings.psr_power_opt = *power_opts; Kind regards, Paul
Re: [PATCH 12/20] drm/amd/display: Add odm seamless boot support
Dear Pavle, dear Duncan, Thank you for the patch. Am 08.04.22 um 19:19 schrieb Pavle Kotarac: From: Duncan Ma [WHY] Implement changes to transition from Pre-OS odm to What is odm/ODM? Original Device Manufacturer? Post-OS odm support. Seamless boot case is also considered. Please answer the question Why? better. What is Post-OS odm support? Why are change to the transition needed? What is seamless boot? Please add references. [HOW] Revised validation logic when marking for seamless boot. How is it revised exactly? Init resources accordingly when Pre-OS has odm enabled. Reset odm and det size when transitioning Pre-OS odm to Post-OS non-odm to avoid corruption. Apply logic to set odm accordingly upon commit. I looked through the diff, but would love a more elaborate description of the implementation. How was and can this tested? Please reflow for 75 characters per line as textwidth. Reviewed-by: Nicholas Kazlauskas Acked-by: Pavle Kotarac Signed-off-by: "Duncan Ma" --- drivers/gpu/drm/amd/display/dc/core/dc.c | 13 +++ .../gpu/drm/amd/display/dc/core/dc_resource.c | 82 --- drivers/gpu/drm/amd/display/dc/dc.h | 1 + drivers/gpu/drm/amd/display/dc/dc_stream.h| 1 + .../amd/display/dc/dcn10/dcn10_hw_sequencer.c | 21 + .../gpu/drm/amd/display/dc/dcn31/dcn31_optc.c | 21 + .../gpu/drm/amd/display/dc/dcn31/dcn31_optc.h | 2 + .../amd/display/dc/inc/hw/timing_generator.h | 2 + 8 files changed, 115 insertions(+), 28 deletions(-) diff --git a/drivers/gpu/drm/amd/display/dc/core/dc.c b/drivers/gpu/drm/amd/display/dc/core/dc.c index c436db416708..c2fcd67bcc4d 100644 --- a/drivers/gpu/drm/amd/display/dc/core/dc.c +++ b/drivers/gpu/drm/amd/display/dc/core/dc.c @@ -1569,11 +1569,24 @@ bool dc_validate_boot_timing(const struct dc *dc, if (dc_is_dp_signal(link->connector_signal)) { unsigned int pix_clk_100hz; + uint32_t numOdmPipes = 1; Maybe add a comment, that the type is due to `get_optc_source` signature. Why initialize it? get_optc_source seems to always assign a value to the passed variable. Why CamelCase? + uint32_t id_src[4] = {0}; dc->res_pool->dp_clock_source->funcs->get_pixel_clk_frequency_100hz( dc->res_pool->dp_clock_source, tg_inst, _clk_100hz); + if (tg->funcs->get_optc_source) + tg->funcs->get_optc_source(tg, + , _src[0], _src[1]); + + if (numOdmPipes == 2) + pix_clk_100hz *= 2; + if (numOdmPipes == 4) + pix_clk_100hz *= 4; + + // Note: In rare cases, HW pixclk may differ from crtc's pixclk + // slightly due to rounding issues in 10 kHz units. The comment could be added in a separate patch, and also the values be logged if they are different. if (crtc_timing->pix_clk_100hz != pix_clk_100hz) return false; diff --git a/drivers/gpu/drm/amd/display/dc/core/dc_resource.c b/drivers/gpu/drm/amd/display/dc/core/dc_resource.c index f5777a71f2f1..f292303b75a5 100644 --- a/drivers/gpu/drm/amd/display/dc/core/dc_resource.c +++ b/drivers/gpu/drm/amd/display/dc/core/dc_resource.c @@ -2120,6 +2120,8 @@ static int acquire_resource_from_hw_enabled_state( { struct dc_link *link = stream->link; unsigned int i, inst, tg_inst = 0; + uint32_t numPipes = 1; Why CamelCase? + uint32_t id_src[4] = {0}; /* Check for enabled DIG to identify enabled display */ if (!link->link_enc->funcs->is_dig_enabled(link->link_enc)) @@ -2148,38 +2150,62 @@ static int acquire_resource_from_hw_enabled_state( if (!res_ctx->pipe_ctx[tg_inst].stream) { struct pipe_ctx *pipe_ctx = _ctx->pipe_ctx[tg_inst]; - pipe_ctx->stream_res.tg = pool->timing_generators[tg_inst]; - pipe_ctx->plane_res.mi = pool->mis[tg_inst]; - pipe_ctx->plane_res.hubp = pool->hubps[tg_inst]; - pipe_ctx->plane_res.ipp = pool->ipps[tg_inst]; - pipe_ctx->plane_res.xfm = pool->transforms[tg_inst]; - pipe_ctx->plane_res.dpp = pool->dpps[tg_inst]; - pipe_ctx->stream_res.opp = pool->opps[tg_inst]; - - if (pool->dpps[tg_inst]) { - pipe_ctx->plane_res.mpcc_inst = pool->dpps[tg_inst]->inst; - - // Read DPP->MPCC->OPP Pipe from HW State - if (pool->mpc->funcs->read_mpcc_state) { - struct mpcc_state s = {0}; - - pool->mpc->funcs->read_mpcc_state(pool->mpc, pipe_ctx->plane_res.mpcc_inst, ); - - if (s.dpp_id < MAX_MPCC) - pool->mpc->mpcc_array[pipe_ctx->plane_res.mpcc_inst].dpp_id = s.dpp_id; - -
Re: [PATCH 03/20] drm/amd/display: Disabling Z10 on DCN31
Dear Pavle, dear syerizvi, Thank you for the patch. Am 08.04.22 um 19:18 schrieb Pavle Kotarac: From: "AMD\\syerizvi" Please correct the author name. Imperative mood should be used in the commit message summary: Disable Z10 …. [WHY] Z10 is should not be enabled by default on DCN31. According to what source? [HOW] Using DC debug flags to disable Z10 by default on DCN31. s/Using/Use/ Why is that grouped under debug flags? Does not seem related. Reviewed-by: Eric Yang Acked-by: Pavle Kotarac Signed-off-by: AMD\syerizvi Please fix the name. --- drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c b/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c index f27262417abe..6cc580be7c79 100644 --- a/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c +++ b/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c @@ -885,6 +885,7 @@ static const struct dc_debug_options debug_defaults_drv = { .afmt = true, } }, + .disable_z10 = true, .optimize_edp_link_rate = true, .enable_sw_cntl_psr = true, .apply_vendor_specific_lttpr_wa = true, Kind regards, Paul
Re: [PATCH 02/20] drm/amd/display: do not wait for mpc idle if tg is disabled
Dear Pavle, Thank you for the patch. Am 08.04.22 um 19:18 schrieb Pavle Kotarac: From: Josip Pavic [Why] When booting, the driver waits for the MPC idle bit to be set as part of pipe initialization. However, on some systems this occurs before OTG is enabled, and since the MPC idle bit won't be set until the vupdate signal occurs (which requires OTG to be enabled), this never happens and the wait times out. This can add hundreds of milliseconds to the boot time. Please list one specific system, where OTG is enabled later. [How] Do not wait for mpc idle if tg is disabled Please add a dot/period at the end of sentences. Reviewed-by: Jun Lei Acked-by: Pavle Kotarac Signed-off-by: Josip Pavic --- drivers/gpu/drm/amd/display/dc/dcn10/dcn10_hw_sequencer.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_hw_sequencer.c b/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_hw_sequencer.c index 50820e79d3c4..2d3d870f0bea 100644 --- a/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_hw_sequencer.c +++ b/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_hw_sequencer.c @@ -3185,7 +3185,8 @@ void dcn10_wait_for_mpcc_disconnect( if (pipe_ctx->stream_res.opp->mpcc_disconnect_pending[mpcc_inst]) { struct hubp *hubp = get_hubp_by_inst(res_pool, mpcc_inst); - res_pool->mpc->funcs->wait_for_idle(res_pool->mpc, mpcc_inst); + if (pipe_ctx->stream_res.tg->funcs->is_tg_enabled(pipe_ctx->stream_res.tg)) + res_pool->mpc->funcs->wait_for_idle(res_pool->mpc, mpcc_inst); pipe_ctx->stream_res.opp->mpcc_disconnect_pending[mpcc_inst] = false; hubp->funcs->set_blank(hubp, true); } Kind regards, Paul
Re: [PATCH 01/20] drm/amd/display: undo clearing of z10 related function pointers
Dear Eric, dear Pavle, Thank you for the patch. Am 08.04.22 um 19:18 schrieb Pavle Kotarac: From: Eric Yang [Why] Z10 and S0i3 have some shared path. Previous code clean up , 1. cleanup 2. Remove the space before the comma, or remove the unneeded comma. incorrectly removed these pointers, which breaks s0i3 restore Please add a dot/period at the end of sentences. How does tho breakage manifest exactly? Maybe mention, that it only happened with `dc->debug.disable_z10`. [How] Do not clear the function pointers based on Z10 disable. What commit introduced the regression? Please add a Fixes tag. Reviewed-by: Nicholas Kazlauskas Acked-by: Pavle Kotarac Signed-off-by: Eric Yang --- drivers/gpu/drm/amd/display/dc/dcn31/dcn31_init.c | 5 - 1 file changed, 5 deletions(-) diff --git a/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_init.c b/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_init.c index d7559e5a99ce..e708f07fe75a 100644 --- a/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_init.c +++ b/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_init.c @@ -153,9 +153,4 @@ void dcn31_hw_sequencer_construct(struct dc *dc) dc->hwss.init_hw = dcn20_fpga_init_hw; dc->hwseq->funcs.init_pipes = NULL; } - if (dc->debug.disable_z10) { - /*hw not support z10 or sw disable it*/ - dc->hwss.z10_restore = NULL; - dc->hwss.z10_save_init = NULL; - } } Kind regards, Paul
Re: [PATCHv2] drm/amdgpu: disable ASPM on Intel AlderLake based systems
Dear Richard, Thank you for your patch. Am 08.04.22 um 21:05 schrieb Richard Gong: Active State Power Management (ASPM) feature is enabled since kernel 5.14. There are some AMD GFX cards (such as WX3200 and RX640) that cannot be used with Intel AlderLake based systems to enable ASPM. Using these GFX Alder Lake cards as video/display output, Intel Alder Lake based systems will hang during suspend/resume. Please reflow for 75 characters per line. Also please mention the exact system you had problems with (also firmware versions). Add extra check to disable ASPM on Intel AlderLake based systems. Is that a problem with Intel Alder Lake or the Dell system? Shouldn’t ASPM just be disabled for the problematic cards for the Dell system. You write newer cards worked fine. Fixes: 0064b0ce85bb ("drm/amd/pm: enable ASPM by default") Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1885 Signed-off-by: Richard Gong --- v2: correct commit description move the check from chip family to problematic platform --- drivers/gpu/drm/amd/amdgpu/vi.c | 17 - 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/vi.c b/drivers/gpu/drm/amd/amdgpu/vi.c index 039b90cdc3bc..8b4eaf54b23e 100644 --- a/drivers/gpu/drm/amd/amdgpu/vi.c +++ b/drivers/gpu/drm/amd/amdgpu/vi.c @@ -81,6 +81,10 @@ #include "mxgpu_vi.h" #include "amdgpu_dm.h" +#if IS_ENABLED(CONFIG_X86_64) +#include +#endif + #define ixPCIE_LC_L1_PM_SUBSTATE 0x100100C6 #define PCIE_LC_L1_PM_SUBSTATE__LC_L1_SUBSTATES_OVERRIDE_EN_MASK 0x0001L #define PCIE_LC_L1_PM_SUBSTATE__LC_PCI_PM_L1_2_OVERRIDE_MASK 0x0002L @@ -1134,13 +1138,24 @@ static void vi_enable_aspm(struct amdgpu_device *adev) WREG32_PCIE(ixPCIE_LC_CNTL, data); } +static bool intel_core_apsm_chk(void) aspm +{ +#if IS_ENABLED(CONFIG_X86_64) + struct cpuinfo_x86 *c = _data(0); + + return (c->x86 == 6 && c->x86_model == INTEL_FAM6_ALDERLAKE); +#else + return false; +#endif Please do the check in C code and not the preprocessor. +} + static void vi_program_aspm(struct amdgpu_device *adev) { u32 data, data1, orig; bool bL1SS = false; bool bClkReqSupport = true; - if (!amdgpu_device_should_use_aspm(adev)) + if (!amdgpu_device_should_use_aspm(adev) || intel_core_apsm_chk()) return; if (adev->flags & AMD_IS_APU || Kind regards, Paul
Re: [PATCH Review 1/1] drm/amdgpu: add umc query error status function
Dear Stanley, Thank you for the patch. Am 08.04.22 um 10:10 schrieb Stanley.Yang: Please remove the dot/period from the name. In order to debug ras error, driver will print IPID/SYND/MISC0 register value if detect correctable or uncorrectable error. … if it detects … Provide umc_query_error_status_helper function to reduce code redundancy. Can you please make this two patches. First refactoring, and then adding new call sites. (The current commit message summary, does not say anything about new call-sites.) Change-Id: I09a2aae85cde3ab2cb6b042b973da6839ad024ec Without the Gerrit review instance, the Change-Id is not of any use. Signed-off-by: Stanley.Yang Please remove the . from the name. Kind regards, Paul --- drivers/gpu/drm/amd/amdgpu/umc_v6_7.c | 106 -- 1 file changed, 48 insertions(+), 58 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c b/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c index c45d9c14ecbc..9d3b54778417 100644 --- a/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c +++ b/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c @@ -64,21 +64,62 @@ static inline uint32_t get_umc_v6_7_channel_index(struct amdgpu_device *adev, return adev->umc.channel_idx_tbl[umc_inst * adev->umc.channel_inst_num + ch_inst]; } +static void umc_query_error_status_helper(struct amdgpu_device *adev, + uint64_t mc_umc_status, uint32_t umc_reg_offset) +{ + uint32_t mc_umc_addr; + uint64_t reg_value; + + if (REG_GET_FIELD(mc_umc_status, MCA_UMC_UMC0_MCUMC_STATUST0, Deferred) == 1) + dev_info(adev->dev, "Deferred error, no user action is needed.\n"); + + if (mc_umc_status) + dev_info(adev->dev, "MCA STATUS 0x%llx, umc_reg_offset 0x%x\n", mc_umc_status, umc_reg_offset); + + /* print IPID registers value */ + mc_umc_addr = + SOC15_REG_OFFSET(UMC, 0, regMCA_UMC_UMC0_MCUMC_IPIDT0); + reg_value = RREG64_PCIE((mc_umc_addr + umc_reg_offset) * 4); + if (reg_value) + dev_info(adev->dev, "MCA IPID 0x%llx, umc_reg_offset 0x%x\n", reg_value, umc_reg_offset); + + /* print SYND registers value */ + mc_umc_addr = + SOC15_REG_OFFSET(UMC, 0, regMCA_UMC_UMC0_MCUMC_SYNDT0); + reg_value = RREG64_PCIE((mc_umc_addr + umc_reg_offset) * 4); + if (reg_value) + dev_info(adev->dev, "MCA SYND 0x%llx, umc_reg_offset 0x%x\n", reg_value, umc_reg_offset); + + /* print MISC0 registers value */ + mc_umc_addr = + SOC15_REG_OFFSET(UMC, 0, regMCA_UMC_UMC0_MCUMC_MISC0T0); + reg_value = RREG64_PCIE((mc_umc_addr + umc_reg_offset) * 4); + if (reg_value) + dev_info(adev->dev, "MCA MISC0 0x%llx, umc_reg_offset 0x%x\n", reg_value, umc_reg_offset); +} + static void umc_v6_7_ecc_info_query_correctable_error_count(struct amdgpu_device *adev, uint32_t umc_inst, uint32_t ch_inst, unsigned long *error_count) { uint64_t mc_umc_status; uint32_t eccinfo_table_idx; + uint32_t umc_reg_offset; struct amdgpu_ras *ras = amdgpu_ras_get_context(adev); + umc_reg_offset = get_umc_v6_7_reg_offset(adev, + umc_inst, ch_inst); + eccinfo_table_idx = umc_inst * adev->umc.channel_inst_num + ch_inst; /* check for SRAM correctable error MCUMC_STATUS is a 64 bit register */ mc_umc_status = ras->umc_ecc.ecc[eccinfo_table_idx].mca_umc_status; if (REG_GET_FIELD(mc_umc_status, MCA_UMC_UMC0_MCUMC_STATUST0, Val) == 1 && - REG_GET_FIELD(mc_umc_status, MCA_UMC_UMC0_MCUMC_STATUST0, CECC) == 1) + REG_GET_FIELD(mc_umc_status, MCA_UMC_UMC0_MCUMC_STATUST0, CECC) == 1) { *error_count += 1; + + umc_query_error_status_helper(adev, mc_umc_status, umc_reg_offset); + } } static void umc_v6_7_ecc_info_querry_uncorrectable_error_count(struct amdgpu_device *adev, @@ -88,8 +129,6 @@ static void umc_v6_7_ecc_info_querry_uncorrectable_error_count(struct amdgpu_dev uint64_t mc_umc_status; uint32_t eccinfo_table_idx; uint32_t umc_reg_offset; - uint32_t mc_umc_addr; - uint64_t reg_value; struct amdgpu_ras *ras = amdgpu_ras_get_context(adev); umc_reg_offset = get_umc_v6_7_reg_offset(adev, @@ -106,32 +145,7 @@ static void umc_v6_7_ecc_info_querry_uncorrectable_error_count(struct amdgpu_dev REG_GET_FIELD(mc_umc_status, MCA_UMC_UMC0_MCUMC_STATUST0, TCC) == 1)) { *error_count += 1; - if (REG_GET_FIELD(mc_umc_status, MCA_UMC_UMC0_MCUMC_STATUST0, Deferred) == 1) - dev_info(adev->dev, "Deferred error, no user action is needed.\n"); - - if (mc_umc_status) - dev_info(adev->dev,
Re: [PATCH 1/1] amdgpu/pm: Clarify Documentation of error handling in send_smc_mesg
Dear Darren, Thank you for your patch. Am 08.04.22 um 04:26 schrieb Darren Powell: Contrary to the smu_cmn_send_smc_msg_with_param documentation, two cases exist where messages are silently dropped with no error returned to the caller. These cases occur in unusual situations where either: 1. the caller is a virtual GPU, or 2. a PCI recovery is underway and the HW is not yet in sync with the SW For more details see commit 4ea5081c82c4 ("drm/amd/powerplay: enable SMC message filter") commit bf36b52e781d ("drm/amdgpu: Avoid accessing HW when suspending SW state") Please remove the indentation. If you re-rolled the patch, you could also spell *documentation* lowercase in the commit message summary. Signed-off-by: Darren Powell --- drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c index b8d0c70ff668..b1bd1990c88b 100644 --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c @@ -356,12 +356,15 @@ int smu_cmn_wait_for_response(struct smu_context *smu) * completion of the command, and return back a value from the SMU in * @read_arg pointer. * - * Return 0 on success, -errno on error, if we weren't able to send + * Return 0 on success, or if the message is dropped. + * On error, -errno is returned if we weren't able to send * the message or if the message completed with some kind of * error. See __smu_cmn_reg2errno() for details of the -errno. * * If we weren't able to send the message to the SMU, we also print - * the error to the standard log. + * the error to the standard log. Dropped messages can be caused + * due to PCI slot recovery or attempting to send from a virtual GPU, + * and do not print an error. * * Command completion status is printed only if the -errno is * -EREMOTEIO, indicating that the SMU returned back an base-commit: 4585c45a6a66cb17cc97f4370457503746e540b7 The diff looks good – despite Mozilla Thunderbird quoting it strangely. Kind regards, Paul
Re: [PATCH v12] drm/amdgpu: add drm buddy support to amdgpu
Dear Arunpravin, Thank you for your patch. Am 07.04.22 um 07:46 schrieb Arunpravin Paneer Selvam: - Switch to drm buddy allocator - Add resource cursor support for drm buddy I though after the last long discussion, you would actually act on the review comments. Daniel wrote a good summary, you could more or less copy and past. So why didn’t you? So, I really wish to not have the patch commit as is. The summary should also say something about using mutex over spinlocks. For me the version change summaries below are just for reviewers of earlier iterations to see what changed, and not something to be read easily. Kind regards, Paul v2(Matthew Auld): - replace spinlock with mutex as we call kmem_cache_zalloc (..., GFP_KERNEL) in drm_buddy_alloc() function - lock drm_buddy_block_trim() function as it calls mark_free/mark_split are all globally visible v3(Matthew Auld): - remove trim method error handling as we address the failure case at drm_buddy_block_trim() function v4: - fix warnings reported by kernel test robot v5: - fix merge conflict issue v6: - fix warnings reported by kernel test robot v7: - remove DRM_BUDDY_RANGE_ALLOCATION flag usage v8: - keep DRM_BUDDY_RANGE_ALLOCATION flag usage - resolve conflicts created by drm/amdgpu: remove VRAM accounting v2 v9(Christian): - merged the below patch - drm/amdgpu: move vram inline functions into a header - rename label name as fallback - move struct amdgpu_vram_mgr to amdgpu_vram_mgr.h - remove unnecessary flags from struct amdgpu_vram_reservation - rewrite block NULL check condition - change else style as per coding standard - rewrite the node max size - add a helper function to fetch the first entry from the list v10(Christian): - rename amdgpu_get_node() function name as amdgpu_vram_mgr_first_block v11: - if size is not aligned with min_page_size, enable is_contiguous flag, therefore, the size round up to the power of two and trimmed to the original size. v12: - rename the function names having prefix as amdgpu_vram_mgr_*() - modify the round_up() logic conforming to contiguous flag enablement or if size is not aligned to min_block_size - modify the trim logic - rename node as block wherever applicable Signed-off-by: Arunpravin Paneer Selvam --- drivers/gpu/drm/Kconfig | 1 + .../gpu/drm/amd/amdgpu/amdgpu_res_cursor.h| 97 - drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h | 10 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 359 ++ 4 files changed, 291 insertions(+), 176 deletions(-) diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig index f1422bee3dcc..5133c3f028ab 100644 --- a/drivers/gpu/drm/Kconfig +++ b/drivers/gpu/drm/Kconfig @@ -280,6 +280,7 @@ config DRM_AMDGPU select HWMON select BACKLIGHT_CLASS_DEVICE select INTERVAL_TREE + select DRM_BUDDY help Choose this option if you have a recent AMD Radeon graphics card. diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h index acfa207cf970..6546552e596c 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h @@ -30,12 +30,15 @@ #include #include +#include "amdgpu_vram_mgr.h" + /* state back for walking over vram_mgr and gtt_mgr allocations */ struct amdgpu_res_cursor { uint64_tstart; uint64_tsize; uint64_tremaining; - struct drm_mm_node *node; + void*node; + uint32_tmem_type; }; /** @@ -52,27 +55,63 @@ static inline void amdgpu_res_first(struct ttm_resource *res, uint64_t start, uint64_t size, struct amdgpu_res_cursor *cur) { + struct drm_buddy_block *block; + struct list_head *head, *next; struct drm_mm_node *node; - if (!res || res->mem_type == TTM_PL_SYSTEM) { - cur->start = start; - cur->size = size; - cur->remaining = size; - cur->node = NULL; - WARN_ON(res && start + size > res->num_pages << PAGE_SHIFT); - return; - } + if (!res) + goto fallback; BUG_ON(start + size > res->num_pages << PAGE_SHIFT); - node = to_ttm_range_mgr_node(res)->mm_nodes; - while (start >= node->size << PAGE_SHIFT) - start -= node++->size << PAGE_SHIFT; + cur->mem_type = res->mem_type; + + switch (cur->mem_type) { + case TTM_PL_VRAM: + head = _amdgpu_vram_mgr_resource(res)->blocks; + + block = list_first_entry_or_null(head, +struct drm_buddy_block, +
Re: [PATCH] drm/amdgpu/smu10: fix SoC/fclk units in auto mode
Dear Alex, Thank you for your patch. Am 01.04.22 um 17:16 schrieb Alex Deucher: SMU takes clock limits in Mhz units. socclk and fclk were using 10 khz units in some cases. Switch to Mhz units. Fixes higher than required SoC clocks. It’d be great if you used 75 characters per line in commit messages. Fixes: 97cf32996c46d9 ("drm/amd/pm: Removed fixed clock in auto mode DPM") Signed-off-by: Alex Deucher --- drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.c b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.c index 9ddd8491ff00..545dd68a8c18 100644 --- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.c +++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.c @@ -773,13 +773,13 @@ static int smu10_dpm_force_dpm_level(struct pp_hwmgr *hwmgr, smum_send_msg_to_smc_with_parameter(hwmgr, PPSMC_MSG_SetHardMinFclkByFreq, hwmgr->display_config->num_display > 3 ? - data->clock_vol_info.vdd_dep_on_fclk->entries[0].clk : + (data->clock_vol_info.vdd_dep_on_fclk->entries[0].clk / 100) : min_mclk, Should that also be indented now? NULL); smum_send_msg_to_smc_with_parameter(hwmgr, PPSMC_MSG_SetHardMinSocclkByFreq, - data->clock_vol_info.vdd_dep_on_socclk->entries[0].clk, + data->clock_vol_info.vdd_dep_on_socclk->entries[0].clk / 100, NULL); smum_send_msg_to_smc_with_parameter(hwmgr, PPSMC_MSG_SetHardMinVcn, @@ -792,11 +792,11 @@ static int smu10_dpm_force_dpm_level(struct pp_hwmgr *hwmgr, NULL); smum_send_msg_to_smc_with_parameter(hwmgr, PPSMC_MSG_SetSoftMaxFclkByFreq, - data->clock_vol_info.vdd_dep_on_fclk->entries[index_fclk].clk, + data->clock_vol_info.vdd_dep_on_fclk->entries[index_fclk].clk / 100, NULL); smum_send_msg_to_smc_with_parameter(hwmgr, PPSMC_MSG_SetSoftMaxSocclkByFreq, - data->clock_vol_info.vdd_dep_on_socclk->entries[index_socclk].clk, + data->clock_vol_info.vdd_dep_on_socclk->entries[index_socclk].clk / 100, NULL); smum_send_msg_to_smc_with_parameter(hwmgr, PPSMC_MSG_SetSoftMaxVcn, Reviewed-by: Paul Menzel Kind regards, Paul
Re: Public patches but non-public development branch
Dear Alex, Am 05.04.22 um 16:13 schrieb Alex Deucher: On Tue, Apr 5, 2022 at 10:03 AM Paul Menzel wrote: Am 05.04.22 um 15:14 schrieb Alex Deucher: On Tue, Apr 5, 2022 at 3:02 AM Paul Menzel wrote: Am 05.04.22 um 08:54 schrieb Christian König: Am 05.04.22 um 08:45 schrieb Paul Menzel: Am 04.04.22 um 23:42 schrieb Philip Yang: bo_adev is NULL for system memory mapping to GPU. Fixes: 05fe8eeca92 (drm/amdgpu: fix TLB flushing during eviction) Sorry, where can I find that commit? Well that's expected, the development branch is not public. Well obviously, it was unexpected for me. How should I have known? Where is that documented? If the patches are publicly posted to the mailing list, why is that development branch not public? The current situation is really frustrating for non-AMD employees. How can the current situation be improved? Our development branch (https://gitlab.freedesktop.org/agd5f/linux/-/commits/amd-staging-drm-next) is available publicly. There can be a day or so of lag depending on when it gets mirrored (e.g., over the weekend). Thank you for the clarification. As can be seen at hand, it still causes confusion though. commit 05fe8eeca927e29b81f3f2a799e9b9b88b0989a9 Author: Christian König AuthorDate: Wed Mar 30 10:53:15 2022 +0200 Commit: Christian König CommitDate: Fri Apr 1 11:05:51 2022 +0200 Today is Tuesday, so it wasn’t mirrored yesterday, on Monday. To avoid this friction in the future, is there an automated way to mirror the branches? git hooks should allow that to be done on every push for example. It's a bit more complicated than that since we have various CI systems and IT security policies involved, but we can look into it. That’d be awesome. Thank you! Kind regards, Paul
Re: Public patches but non-public development branch
Dear Alex, Am 05.04.22 um 15:14 schrieb Alex Deucher: On Tue, Apr 5, 2022 at 3:02 AM Paul Menzel wrote: Am 05.04.22 um 08:54 schrieb Christian König: Am 05.04.22 um 08:45 schrieb Paul Menzel: Am 04.04.22 um 23:42 schrieb Philip Yang: bo_adev is NULL for system memory mapping to GPU. Fixes: 05fe8eeca92 (drm/amdgpu: fix TLB flushing during eviction) Sorry, where can I find that commit? Well that's expected, the development branch is not public. Well obviously, it was unexpected for me. How should I have known? Where is that documented? If the patches are publicly posted to the mailing list, why is that development branch not public? The current situation is really frustrating for non-AMD employees. How can the current situation be improved? Our development branch (https://gitlab.freedesktop.org/agd5f/linux/-/commits/amd-staging-drm-next) is available publicly. There can be a day or so of lag depending on when it gets mirrored (e.g., over the weekend). Thank you for the clarification. As can be seen at hand, it still causes confusion though. commit 05fe8eeca927e29b81f3f2a799e9b9b88b0989a9 Author: Christian König AuthorDate: Wed Mar 30 10:53:15 2022 +0200 Commit: Christian König CommitDate: Fri Apr 1 11:05:51 2022 +0200 Today is Tuesday, so it wasn’t mirrored yesterday, on Monday. To avoid this friction in the future, is there an automated way to mirror the branches? git hooks should allow that to be done on every push for example. Kind regards, Paul
Public patches but non-public development branch (Re: [PATCH 1/1] drm/amdkfd: Add missing NULL check in svm_range_map_to_gpu)
Dear Christian, Am 05.04.22 um 08:54 schrieb Christian König: Am 05.04.22 um 08:45 schrieb Paul Menzel: Am 04.04.22 um 23:42 schrieb Philip Yang: bo_adev is NULL for system memory mapping to GPU. Fixes: 05fe8eeca92 (drm/amdgpu: fix TLB flushing during eviction) Sorry, where can I find that commit? Well that's expected, the development branch is not public. Well obviously, it was unexpected for me. How should I have known? Where is that documented? If the patches are publicly posted to the mailing list, why is that development branch not public? The current situation is really frustrating for non-AMD employees. How can the current situation be improved? Kind regards, Paul I do not see it in drm-next from agd5f git archive g...@gitlab.freedesktop.org:agd5f/linux.git. $ git log --oneline -1 e45422695c19 (HEAD, agd5f/drm-next) drm/amdkfd: Create file descriptor after client is added to smi_clients list Kind regards, Paul Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index 907b02045824..d3fb2d0b5a25 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -1281,7 +1281,7 @@ svm_range_map_to_gpu(struct kfd_process_device *pdd, struct svm_range *prange, last_start, prange->start + i, pte_flags, last_start - prange->start, - bo_adev->vm_manager.vram_base_offset, + bo_adev ? bo_adev->vm_manager.vram_base_offset : 0, NULL, dma_addr, >last_update); for (j = last_start - prange->start; j <= i; j++)
Re: [PATCH 1/1] drm/amdkfd: Add missing NULL check in svm_range_map_to_gpu
Dear Philip, Am 04.04.22 um 23:42 schrieb Philip Yang: bo_adev is NULL for system memory mapping to GPU. Fixes: 05fe8eeca92 (drm/amdgpu: fix TLB flushing during eviction) Sorry, where can I find that commit? I do not see it in drm-next from agd5f git archive g...@gitlab.freedesktop.org:agd5f/linux.git. $ git log --oneline -1 e45422695c19 (HEAD, agd5f/drm-next) drm/amdkfd: Create file descriptor after client is added to smi_clients list Kind regards, Paul Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index 907b02045824..d3fb2d0b5a25 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -1281,7 +1281,7 @@ svm_range_map_to_gpu(struct kfd_process_device *pdd, struct svm_range *prange, last_start, prange->start + i, pte_flags, last_start - prange->start, - bo_adev->vm_manager.vram_base_offset, + bo_adev ? bo_adev->vm_manager.vram_base_offset : 0, NULL, dma_addr, >last_update); for (j = last_start - prange->start; j <= i; j++)
Re: [PATCH 1/1] drm/amdgpu: Flush TLB after mapping for VG20+XGMI
Dear Philip, Thank you for your response. Am 04.04.22 um 18:52 schrieb philip yang: On 2022-04-04 12:19, Paul Menzel wrote: Am 01.04.22 um 21:57 schrieb Philip Yang: For VG20 + XGMI bridge, all mappings PTEs cache in TC, this may have stall invalid PTEs in TC because one cache line has 8 pages. Need always Can you please rephrase. “may have stall …” is really hard to understand. The patch already pushed and merged. Sorry, but that really sucks, that commits with such a hard to understand commit message are merged. flush_tlb after updating mapping. Maybe: So, always flush_tlb after updating the mapping. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index f0aec04111a3..687c9a140645 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -837,6 +837,12 @@ int amdgpu_vm_update_range(struct amdgpu_device *adev, struct amdgpu_vm *vm, goto error_unlock; } +/* Vega20+XGMI where PTEs get inadvertently cached in L2 texture cache, + * heavy-weight flush TLB unconditionally. + */ +flush_tlb |= (adev->gmc.xgmi.num_physical_nodes && + adev->ip_versions[GC_HWIP][0] == IP_VERSION(9, 4, 0)); + memset(, 0, sizeof(params)); params.adev = adev; params.vm = vm; Did you do any performance measurement, if that flushing affects anything? There was another patch to optimize TLB flush to improve map to GPU performance, for this config, always flush TLB after updating mapping is the original performance before the optimization. So, why didn’t you reference this commit in the commit message, and also did not give that information in the commit message? Kind regards, Paul
Re: [PATCH 1/1] drm/amdgpu: Flush TLB after mapping for VG20+XGMI
Dear Philip, Thank you for your patch. Am 01.04.22 um 21:57 schrieb Philip Yang: For VG20 + XGMI bridge, all mappings PTEs cache in TC, this may have stall invalid PTEs in TC because one cache line has 8 pages. Need always Can you please rephrase. “may have stall …” is really hard to understand. flush_tlb after updating mapping. Maybe: So, always flush_tlb after updating the mapping. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index f0aec04111a3..687c9a140645 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -837,6 +837,12 @@ int amdgpu_vm_update_range(struct amdgpu_device *adev, struct amdgpu_vm *vm, goto error_unlock; } + /* Vega20+XGMI where PTEs get inadvertently cached in L2 texture cache, +* heavy-weight flush TLB unconditionally. +*/ + flush_tlb |= (adev->gmc.xgmi.num_physical_nodes && + adev->ip_versions[GC_HWIP][0] == IP_VERSION(9, 4, 0)); + memset(, 0, sizeof(params)); params.adev = adev; params.vm = vm; Did you do any performance measurement, if that flushing affects anything? Kind regards, Paul
Regression: No signal when loading amdgpu, and system lockup (was: [PATCH V4 17/17] drm/amd/pm: unified lock protections in amdgpu_dpm.c)
#regzbot introduced: 3712e7a494596b26861f4dc9b81676d1d0272eaf #regzbot title: No signal when loading amdgpu, and system lockup #regzbot monitor: https://gitlab.freedesktop.org/drm/amd/-/issues/1957 Dear Arthur, Thank you for your message. Too bad you didn’t update the subject, and didn’t include regressi...@lists.linux.dev and notify regzbot [1] about it. (It’s understandable, as you might be unfamiliar with the processes, but I would have expected at least Even to do.) So I also spent quite some time on bisecting, but reached the same conclusion. Am 31.03.22 um 04:28 schrieb Arthur Marsh: Hi, I have a Cape Verde GPU card in my pc and after git bisecting a situation where, at the time of the amdgpu module, the monitor would lose signal and the pc locked up so that it only responded to a magic sysreq boot (with no logging due to it happening before the root filesystem was writeable), the above commit was identified as the culprit. The GPU card is a Gigabyte R7 250 with pci-id 1002:682b (rev 87). With the 5.17.0 kernel and a kernel command line of: amdgpu.audio=1 amdgpu.si_support=1 the following dmesg output was received: [ 76.118991] [drm] amdgpu kernel modesetting enabled. [ 76.119100] amdgpu :01:00.0: vgaarb: deactivate vga console [ 76.120004] Console: switching to colour dummy device 80x25 [ 76.120203] [drm] initializing kernel modesetting (VERDE 0x1002:0x682B 0x1458:0x22CA 0x87). [ 76.120211] amdgpu :01:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported [ 76.120235] [drm] register mmio base: 0xFE8C [ 76.120238] [drm] register mmio size: 262144 [ 76.120245] [drm] add ip block number 0 [ 76.120248] [drm] add ip block number 1 [ 76.120251] [drm] add ip block number 2 [ 76.120253] [drm] add ip block number 3 [ 76.120256] [drm] add ip block number 4 [ 76.120258] [drm] add ip block number 5 [ 76.120261] [drm] add ip block number 6 [ 76.120264] [drm] add ip block number 7 [ 76.163659] [drm] BIOS signature incorrect 5b 7 [ 76.163669] resource sanity check: requesting [mem 0x000c-0x000d], which spans more than PCI Bus :00 [mem 0x000d-0x000d window] [ 76.163677] caller pci_map_rom+0x68/0x1b0 mapping multiple BARs [ 76.163691] amdgpu :01:00.0: No more image in the PCI ROM [ 76.164996] amdgpu :01:00.0: amdgpu: Fetched VBIOS from ROM BAR [ 76.165001] amdgpu: ATOM BIOS: xxx-xxx-xxx [ 76.165018] amdgpu :01:00.0: amdgpu: PCIE atomic ops is not supported [ 76.165270] [drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment size is 9-bit [ 76.349679] amdgpu :01:00.0: amdgpu: VRAM: 2048M 0x00F4 - 0x00F47FFF (2048M used) [ 76.349716] amdgpu :01:00.0: amdgpu: GART: 1024M 0x00FF - 0x00FF3FFF [ 76.349753] [drm] Detected VRAM RAM=2048M, BAR=256M [ 76.349764] [drm] RAM width 128bits DDR3 [ 76.349940] [drm] amdgpu: 2048M of VRAM memory ready [ 76.349953] [drm] amdgpu: 3072M of GTT memory ready. [ 76.349992] [drm] GART: num cpu pages 262144, num gpu pages 262144 [ 76.350506] amdgpu :01:00.0: amdgpu: PCIE GART of 1024M enabled (table at 0x00F40090). [ 76.495343] [drm] Internal thermal controller with fan control [ 76.495391] [drm] amdgpu: dpm initialized [ 76.495637] [drm] AMDGPU Display Connectors [ 76.495647] [drm] Connector 0: [ 76.495655] [drm] HDMI-A-1 [ 76.495662] [drm] HPD1 [ 76.495668] [drm] DDC: 0x194c 0x194c 0x194d 0x194d 0x194e 0x194e 0x194f 0x194f [ 76.495685] [drm] Encoders: [ 76.495691] [drm] DFP1: INTERNAL_UNIPHY [ 76.495699] [drm] Connector 1: [ 76.495706] [drm] DVI-D-1 [ 76.495712] [drm] HPD2 [ 76.495718] [drm] DDC: 0x1950 0x1950 0x1951 0x1951 0x1952 0x1952 0x1953 0x1953 [ 76.495733] [drm] Encoders: [ 76.495739] [drm] DFP2: INTERNAL_UNIPHY [ 76.495746] [drm] Connector 2: [ 76.495753] [drm] VGA-1 [ 76.495758] [drm] DDC: 0x1970 0x1970 0x1971 0x1971 0x1972 0x1972 0x1973 0x1973 [ 76.495773] [drm] Encoders: [ 76.495779] [drm] CRT1: INTERNAL_KLDSCP_DAC1 [ 76.599604] [drm] Found UVD firmware Version: 64.0 Family ID: 13 [ 76.603443] [drm] PCIE gen 2 link speeds already enabled [ 77.149564] [drm] UVD initialized successfully. [ 77.149578] amdgpu :01:00.0: amdgpu: SE 1, SH per SE 2, CU per SH 5, active_cu_number 8 [ 77.456492] RTL8211B Gigabit Ethernet r8169-0-300:00: attached PHY driver (mii_bus:phy_addr=r8169-0-300:00, irq=MAC) [ 77.486245] [drm] Initialized amdgpu 3.44.0 20150101 for :01:00.0 on minor 0 [ 77.521555] r8169 :03:00.0 eth0: Link is Down [ 77.547158] fbcon: amdgpudrmfb (fb0) is primary device [ 77.591226] Console: switching to colour frame buffer device 240x67 [ 77.600296] amdgpu :01:00.0: [drm] fb0: amdgpudrmfb frame buffer device I can supply extra details but found no logging from the sessions that experienced the lock-up. I had created issue *Worker [210] processing
Re: [Bug][5.18-rc0] Between commits ed4643521e6a and 34af78c4e616, appears warning "WARNING: CPU: 31 PID: 51848 at drivers/dma-buf/dma-fence-array.c:191 dma_fence_array_create+0x101/0x120" and some ga
Dear Christian, Am 04.04.22 um 08:30 schrieb Christian König: those are two independent and already known problems. The warning triggered from the sync_file is already fixed in drm-misc-next-fixes, but so far I couldn't figure out why the games suddenly doesn't work any more. There is a bug report for that, but bisecting the changes didn't yielded anything valuable so far. So if you can come up with something that would be rather valuable. It’d be great, if you (or somebody else) could provide the URL to that issue. Kind regards, Paul
Re: [PATCH] drm/amdgpu/vcn: remove Unneeded semicolon
Dear Haowen, Thank you for your patch. Am 31.03.22 um 07:56 schrieb Haowen Bai: In the commit message summary, please use: Remove unneeded semicolon report by coccicheck: drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c:1951:2-3: Unneeded semicolon fixed c543dcb ("drm/amdgpu/vcn: Add VCN ras error query support") Please use Fixes: … and a commit hash length of 12 characters. (`scripts/checkpatch.pl …` should tell you about this.) Kind regards, Paul Signed-off-by: Haowen Bai --- drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c b/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c index 3e1de8c..17d44be 100644 --- a/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c @@ -1948,7 +1948,7 @@ static uint32_t vcn_v2_6_query_poison_by_instance(struct amdgpu_device *adev, break; default: break; - }; + } if (poison_stat) dev_info(adev->dev, "Poison detected in VCN%d, sub_block%d\n",
Re: [PATCH] drm/amdgpu: don't use BACO for reset in S3
Dear Alex, Thank you for your patch. Am 31.03.22 um 16:56 schrieb Alex Deucher: Seems to cause a reboots or hangs on some systems. It’d be great if you listed the systems from the bug reports. Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1924 Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1953 Fixes: daf8de0874ab5b ("drm/amdgpu: always reset the asic in suspend (v2)") Signed-off-by: Alex Deucher --- drivers/gpu/drm/amd/pm/amdgpu_dpm.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c index c73fb73e9628..1ff6c42fb110 100644 --- a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c +++ b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c @@ -173,6 +173,9 @@ bool amdgpu_dpm_is_baco_supported(struct amdgpu_device *adev) if (!pp_funcs || !pp_funcs->get_asic_baco_capability) return false; + /* don't use baco for reset in S3 */ In issue #1924, I think you also write, it’s a workaround. Maybe make that clear in the comment? + if (adev->in_s3) + return false; mutex_lock(>pm.mutex); Kind regards, Paul
Re: [PATCH 1/1] drm: add PSR2 support and capability definition as per eDP 1.5
Dear David, Thank you for your patch. Am 31.03.22 um 19:26 schrieb David Zhang: [why & how] In eDP 1.5 spec, some new DPCD bit fileds are defined for PSR-SU support. You could be specific by using the exact number two. Maybe: As per eDP 1.5 specification, add the two DPCD bit fields below for PSR-SU support: 1. DP_PSR2_WITH_Y_COORD_ET_SUPPORTED 2. DP_PSR2_SU_AUX_FRAME_SYNC_NOT_NEEDED Kind regards, Paul Signed-off-by: David Zhang --- include/drm/drm_dp_helper.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/drm/drm_dp_helper.h b/include/drm/drm_dp_helper.h index 30359e434c3f..ac7b7571ae66 100644 --- a/include/drm/drm_dp_helper.h +++ b/include/drm/drm_dp_helper.h @@ -361,6 +361,7 @@ struct drm_panel; # define DP_PSR_IS_SUPPORTED1 # define DP_PSR2_IS_SUPPORTED 2 /* eDP 1.4 */ # define DP_PSR2_WITH_Y_COORD_IS_SUPPORTED 3 /* eDP 1.4a */ +# define DP_PSR2_WITH_Y_COORD_ET_SUPPORTED 4 /* eDP 1.5, adopted eDP 1.4b SCR */ #define DP_PSR_CAPS 0x071 /* XXX 1.2? */ # define DP_PSR_NO_TRAIN_ON_EXIT1 @@ -375,6 +376,7 @@ struct drm_panel; # define DP_PSR_SETUP_TIME_SHIFT1 # define DP_PSR2_SU_Y_COORDINATE_REQUIRED (1 << 4) /* eDP 1.4a */ # define DP_PSR2_SU_GRANULARITY_REQUIRED(1 << 5) /* eDP 1.4b */ +# define DP_PSR2_SU_AUX_FRAME_SYNC_NOT_NEEDED (1 << 6)/* eDP 1.5, adopted eDP 1.4b SCR */ #define DP_PSR2_SU_X_GRANULARITY 0x072 /* eDP 1.4b */ #define DP_PSR2_SU_Y_GRANULARITY 0x074 /* eDP 1.4b */
Re: [PATCH V2] drm/amdgpu: fix incorrect GCR_GENERAL_CNTL address
Dear Ruili, Am 29.03.22 um 11:17 schrieb Ji, Ruili: This is not related to any issue. I didn’t mean an issue (where I’d use Resolves to differentiate the two cases), but the commit introducing the incorrect address. Kind regards, Paul PS: Please use interleaved style when replying instead of top-posting.
Re: [PATCH v2] drm/amdgpu: fix that issue that the number of the crtc of the 3250c is not correct
[Cc: Remove undeliverable Chun Ming Zhou <mailto:david1.z...@amd.com>] Am 30.03.22 um 08:34 schrieb Paul Menzel: Dear Tsung-Hua, Thank you for your patch. Am 30.03.22 um 04:46 schrieb Ryan Lin: The commit message summary is quite long and confusing. Maybe: Use 3 CRTC for 3250c to get internal display working [Why] External displays take priority over internal display when there are fewer display controllers than displays. This causes the internal display to not work on the Chromebook google/zork. [How] The root cause is because of that number of the crtc is not correct. The root cause is the incorrect number of four configured CRTCs. The number of the crtc on the 3250c is 3, but on the 3500c is 4. On the source code, we can see that number of the crtc has been fixed at 4. Needs to set the num_crtc to 3 for 3250c platform. Please do not wrap lines after each sentence, and use a text width of 75 characters. v2: - remove unnecessary comments and Id Signed-off-by: Ryan Lin --- drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c index 40c91b448f7da..455a2c45e8cda 100644 --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c @@ -2738,9 +2738,15 @@ static int dm_early_init(void *handle) break; #if defined(CONFIG_DRM_AMD_DC_DCN1_0) case CHIP_RAVEN: - adev->mode_info.num_crtc = 4; - adev->mode_info.num_hpd = 4; - adev->mode_info.num_dig = 4; + if (adev->rev_id >= 8) { Is there some define for that number? Maybe add a comment, that it’s for 3250c? Kind regards, Paul + adev->mode_info.num_crtc = 3; + adev->mode_info.num_hpd = 3; + adev->mode_info.num_dig = 3; + } else { + adev->mode_info.num_crtc = 4; + adev->mode_info.num_hpd = 4; + adev->mode_info.num_dig = 4; + } break; #endif #if defined(CONFIG_DRM_AMD_DC_DCN2_0)
Re: [PATCH v2] drm/amdgpu: fix that issue that the number of the crtc of the 3250c is not correct
Dear Tsung-Hua, Thank you for your patch. Am 30.03.22 um 04:46 schrieb Ryan Lin: The commit message summary is quite long and confusing. Maybe: Use 3 CRTC for 3250c to get internal display working [Why] External displays take priority over internal display when there are fewer display controllers than displays. This causes the internal display to not work on the Chromebook google/zork. [How] The root cause is because of that number of the crtc is not correct. The root cause is the incorrect number of four configured CRTCs. The number of the crtc on the 3250c is 3, but on the 3500c is 4. On the source code, we can see that number of the crtc has been fixed at 4. Needs to set the num_crtc to 3 for 3250c platform. Please do not wrap lines after each sentence, and use a text width of 75 characters. v2: - remove unnecessary comments and Id Signed-off-by: Ryan Lin --- drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c index 40c91b448f7da..455a2c45e8cda 100644 --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c @@ -2738,9 +2738,15 @@ static int dm_early_init(void *handle) break; #if defined(CONFIG_DRM_AMD_DC_DCN1_0) case CHIP_RAVEN: - adev->mode_info.num_crtc = 4; - adev->mode_info.num_hpd = 4; - adev->mode_info.num_dig = 4; + if (adev->rev_id >= 8) { Is there some define for that number? Maybe add a comment, that it’s for 3250c? Kind regards, Paul + adev->mode_info.num_crtc = 3; + adev->mode_info.num_hpd = 3; + adev->mode_info.num_dig = 3; + } else { + adev->mode_info.num_crtc = 4; + adev->mode_info.num_hpd = 4; + adev->mode_info.num_dig = 4; + } break; #endif #if defined(CONFIG_DRM_AMD_DC_DCN2_0)
Re: [PATCH] drm/amd/display: Fix by adding FPU protection for dcn30_internal_validate_bw
Dear Chandan, Am 29.03.22 um 10:29 schrieb CHANDAN VURDIGERE NATARAJ: Is it common to spell your name all uppercase? If not, please use Chandan nVurdigere Nataraj. [WHY] The [] already emphasize the word, so Why could be used. Below general protection fault observed when WebGL Aquarium is run for longer duration. If drm debug logs are enabled and set to 0x1f then the In what browser and what version? issue is observed within 10 minutes of run. Where you able to reproduce it without drm debug logs? [ 100.717056] general protection fault, probably for non-canonical address 0x2d33302d32323032: [#1] PREEMPT SMP NOPTI [ 100.727921] CPU: 3 PID: 1906 Comm: DrmThread Tainted: GW 5.15.30 #12 d726c6a2d6ebe5cf9223931cbca6892f916fe18b [ 100.754419] RIP: 0010:CalculateSwathWidth+0x1f7/0x44f [ 100.767109] Code: 00 00 00 f2 42 0f 11 04 f0 48 8b 85 88 00 00 00 f2 42 0f 10 04 f0 48 8b 85 98 00 00 00 f2 42 0f 11 04 f0 48 8b 45 10 0f 57 c0 42 0f 2a 04 b0 0f 57 c9 f3 43 0f 2a 0c b4 e8 8c e2 f3 ff 48 8b [ 100.781269] RSP: 0018:a9230079eeb0 EFLAGS: 00010246 [ 100.812528] RAX: 2d33302d32323032 RBX: 0500 RCX: [ 100.819656] RDX: 0001 RSI: 99deb712c49c RDI: [ 100.826781] RBP: a9230079ef50 R08: 99deb712460c R09: 99deb712462c [ 100.833907] R10: 99deb7124940 R11: 99deb7124d70 R12: 99deb712ae44 [ 100.841033] R13: 0001 R14: R15: a9230079f0a0 [ 100.848159] FS: 7af121212640() GS:99deba78() knlGS: [ 100.856240] CS: 0010 DS: ES: CR0: 80050033 [ 100.861980] CR2: 209000fe1000 CR3: 00011b18c000 CR4: 00350ee0 [ 100.869106] Call Trace: [ 100.871555] [ 100.873655] ? asm_sysvec_reschedule_ipi+0x12/0x20 [ 100.878449] CalculateSwathAndDETConfiguration+0x1a3/0x6dd [ 100.883937] dml31_ModeSupportAndSystemConfigurationFull+0x2ce4/0x76da [ 100.890467] ? kallsyms_lookup_buildid+0xc8/0x163 [ 100.895173] ? kallsyms_lookup_buildid+0xc8/0x163 [ 100.899874] ? __sprint_symbol+0x80/0x135 [ 100.903883] ? dm_update_plane_state+0x3f9/0x4d2 [ 100.908500] ? symbol_string+0xb7/0xde [ 100.912250] ? number+0x145/0x29b [ 100.915566] ? vsnprintf+0x341/0x5ff [ 100.919141] ? desc_read_finalized_seq+0x39/0x87 [ 100.923755] ? update_load_avg+0x1b9/0x607 [ 100.927849] ? compute_mst_dsc_configs_for_state+0x7d/0xd5b [ 100.933416] ? fetch_pipe_params+0xa4d/0xd0c [ 100.937686] ? dc_fpu_end+0x3d/0xa8 [ 100.941175] dml_get_voltage_level+0x16b/0x180 [ 100.945619] dcn30_internal_validate_bw+0x10e/0x89b [ 100.950495] ? dcn31_validate_bandwidth+0x68/0x1fc [ 100.955285] ? resource_build_scaling_params+0x98b/0xb8c [ 100.960595] ? dcn31_validate_bandwidth+0x68/0x1fc [ 100.965384] dcn31_validate_bandwidth+0x9a/0x1fc [ 100.970001] dc_validate_global_state+0x238/0x295 [ 100.974703] amdgpu_dm_atomic_check+0x9c1/0xbce [ 100.979235] ? _printk+0x59/0x73 [ 100.982467] drm_atomic_check_only+0x403/0x78b [ 100.986912] drm_mode_atomic_ioctl+0x49b/0x546 [ 100.991358] ? drm_ioctl+0x1c1/0x3b3 [ 100.994936] ? drm_atomic_set_property+0x92a/0x92a [ 100.999725] drm_ioctl_kernel+0xdc/0x149 [ 101.003648] drm_ioctl+0x27f/0x3b3 [ 101.007051] ? drm_atomic_set_property+0x92a/0x92a [ 101.011842] amdgpu_drm_ioctl+0x49/0x7d [ 101.015679] __se_sys_ioctl+0x7c/0xb8 [ 101.015685] do_syscall_64+0x5f/0xb8 [ 101.015690] ? __irq_exit_rcu+0x34/0x96 [HOW] It calles populate_dml_pipes which uses doubles to initialize. calls Excuse my ignorance. So using doubles causes a context switch? Adding FPU protection avoids context switch and probable loss of vba context as there is potential contention while drm debug logs are enabled. Signed-off-by: CHANDAN VURDIGERE NATARAJ diff --git a/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c b/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c index 826970f2bd0a..f27262417abe 100644 --- a/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c +++ b/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c @@ -1750,7 +1750,9 @@ bool dcn31_validate_bandwidth(struct dc *dc, BW_VAL_TRACE_COUNT(); + DC_FP_START(); out = dcn30_internal_validate_bw(dc, context, pipes, _cnt, , fast_validate); + DC_FP_END(); // Disable fast_validate to set min dcfclk in alculate_wm_and_dlg if (pipe_cnt == 0) Kind regards, Paul
Re: [PATCH] drm/amdgpu: Support AMDGPU RAS debugfs poll interface
Dear Yi Peng, Thank you for the patch. Two nits regarding the commit message. Am 29.03.22 um 09:38 schrieb yipechai: Some AMDGPU RAS debugfs operations like UE injection can cause gpu reset. Before doing the next debugfs operation, the application should call poll to check if the gpu has finished recovering. Please use a text width of 75 characters per line for the commit message body. Signed-off-by: yipechai Could you please configure git (and your email program) to use your full name? $ git config --global user.name "Yi Peng Chai" # no idea if correct […] Kind regards, Paul
Re: [PATCH V2] drm/amdgpu: fix incorrect GCR_GENERAL_CNTL address
Dear Ruili, Thank you for your patch. Am 28.03.22 um 06:58 schrieb Ji, Ruili: From: Ruili Ji gfx10.3.3/gfx10.3.6/gfx10.3.7 shall use 0x1580 address for GCR_GENERAL_CNTL Is any “user-visible“ problem fixed by this? Please add a Fixes tag. Kind regards, Paul Signed-off-by: Ruili Ji --- drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c index 99df18ae7316..e4c9d92ac381 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c @@ -3300,7 +3300,7 @@ static const struct soc15_reg_golden golden_settings_gc_10_3_3[] = SOC15_REG_GOLDEN_VALUE(GC, 0, mmDB_DEBUG3, 0x, 0x0280), SOC15_REG_GOLDEN_VALUE(GC, 0, mmDB_DEBUG4, 0x, 0x0080), SOC15_REG_GOLDEN_VALUE(GC, 0, mmGB_ADDR_CONFIG, 0x0c1807ff, 0x0242), - SOC15_REG_GOLDEN_VALUE(GC, 0, mmGCR_GENERAL_CNTL, 0x1ff1, 0x0500), + SOC15_REG_GOLDEN_VALUE(GC, 0, mmGCR_GENERAL_CNTL_Vangogh, 0x1ff1, 0x0500), SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL1_PIPE_STEER, 0x00ff, 0x00e4), SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL2_PIPE_STEER_0, 0x, 0x32103210), SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL2_PIPE_STEER_1, 0x, 0x32103210), @@ -3436,7 +3436,7 @@ static const struct soc15_reg_golden golden_settings_gc_10_3_6[] = SOC15_REG_GOLDEN_VALUE(GC, 0, mmDB_DEBUG3, 0x, 0x0280), SOC15_REG_GOLDEN_VALUE(GC, 0, mmDB_DEBUG4, 0x, 0x0080), SOC15_REG_GOLDEN_VALUE(GC, 0, mmGB_ADDR_CONFIG, 0x0c1807ff, 0x0042), - SOC15_REG_GOLDEN_VALUE(GC, 0, mmGCR_GENERAL_CNTL, 0x1ff1, 0x0500), + SOC15_REG_GOLDEN_VALUE(GC, 0, mmGCR_GENERAL_CNTL_Vangogh, 0x1ff1, 0x0500), SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL1_PIPE_STEER, 0x00ff, 0x0044), SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL2_PIPE_STEER_0, 0x, 0x32103210), SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL2_PIPE_STEER_1, 0x, 0x32103210), @@ -3461,7 +3461,7 @@ static const struct soc15_reg_golden golden_settings_gc_10_3_7[] = { SOC15_REG_GOLDEN_VALUE(GC, 0, mmDB_DEBUG3, 0x, 0x0280), SOC15_REG_GOLDEN_VALUE(GC, 0, mmDB_DEBUG4, 0x, 0x0080), SOC15_REG_GOLDEN_VALUE(GC, 0, mmGB_ADDR_CONFIG, 0x0c1807ff, 0x0041), - SOC15_REG_GOLDEN_VALUE(GC, 0, mmGCR_GENERAL_CNTL, 0x1ff1, 0x0500), + SOC15_REG_GOLDEN_VALUE(GC, 0, mmGCR_GENERAL_CNTL_Vangogh, 0x1ff1, 0x0500), SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL1_PIPE_STEER, 0x00ff, 0x00e4), SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL2_PIPE_STEER_0, 0x, 0x32103210), SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL2_PIPE_STEER_1, 0x, 0x32103210),
Re: 回复: Re: [PATCH] drm/amdgpu: resolve s3 hang for r7340
Dear 李真, [Your mailer formatted the message oddly. Maybe configure it to use only plain text email with no HTML parts – common in Linux kernel community –, or, if not possible, switch to something else (Mozilla Thunderbird, …).] Am 29.03.22 um 04:54 schrieb 李真能: […] *日 期:*2022-03-28 15:38 *发件人:*Paul Menzel […] Am 28.03.22 um 09:36 schrieb Paul Menzel: > Dear Zhenneng, > > > Thank you for your patch. > > Am 28.03.22 um 06:05 schrieb Zhenneng Li: >> This is a workaround for s3 hang for r7340(amdgpu). > > Is it hanging when resuming from S3? Yes, this func is a delayed work after init graphics card. Thank for clarifying it. > Maybe also use the line below for > the commit message summary: > > drm/amdgpu: Add 1 ms delay to init handler to fix s3 resume hang > > Also, please add a space before the ( in “r7340(amdgpu)”. > >> When we test s3 with r7340 on arm64 platform, graphics card will hang up, >> the error message are as follows: >> Mar 4 01:14:11 greatwall-GW-XX-XXX kernel: [1.599374][ 7] [ T291] amdgpu :02:00.0: fb0: amdgpudrmfb frame buffer device >> Mar 4 01:14:11 greatwall-GW-XX-XXX kernel: [1.612869][ 7] [ T291] [drm:amdgpu_device_ip_late_init [amdgpu]] *ERROR* late_init of IP blockfailed -22 >> Mar 4 01:14:11 greatwall-GW-XX-XXX kernel: [1.623392][ 7] [ T291] amdgpu :02:00.0: amdgpu_device_ip_late_init failed >> Mar 4 01:14:11 greatwall-GW-XX-XXX kernel: [1.630696][ 7] [ T291] amdgpu :02:00.0: Fatal error during GPU init >> Mar 4 01:14:11 greatwall-GW-XX-XXX kernel: [1.637477][ 7] [ T291] [drm] amdgpu: finishing device. > > The prefix in the beginning is not really needed. Only the stuff after > `kernel: `. > > Maybe also add the output of `lspci -nn -s …` for that r7340 device. > >> Change-Id: I5048b3894c0ca9faf2f4847ddab61f9eb17b4823 > > Without the Gerrit instance this belongs to, the Change-Id is of no use > in the public. > >> Signed-off-by: Zhenneng Li >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> index 3987ecb24ef4..1eced991b5b2 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> @@ -2903,6 +2903,8 @@ static void >> amdgpu_device_delayed_init_work_handler(struct work_struct *work) >> container_of(work, struct amdgpu_device, delayed_init_work.work); >> int r; >> +mdelay(1); >> + > > Wow, I wonder how long it took you to find that workaround. About 3 months, I try to add this delay work(amdgpu_device_delayed_init_work_handler) from 2000ms to 2500ms, or use mb() instead of mdelay(1), but it's useless, I don't know the reason,the occurrence probability of this bug is one ten-thousandth, do you know the possible reasons? Oh, it’s not even always reproducible. That is hard. Did you try another graphics card or another ARM board to rule out hardware specific issues? Sorry, I do not. Maybe the developers with access to non-public datasheets and erratas know. >> r = amdgpu_ib_ring_tests(adev); >> if (r) >> DRM_ERROR("ib ring test failed (%d).\n", r); Kind regards, Paul
Re: [PATCH] amd/display: set backlight only if required
Dear Shirish, Thank you for the patch. Am 11.03.22 um 16:33 schrieb Shirish S: [Why] comparing pwm bl values (coverted) with user brightness(converted) 1. co*n*verted 2. Please add a space before the (. levels in commit_tail leads to continuous setting of backlight via dmub as they don't to match. Maybe add a blank line between paragraphs. This leads overdrive in queuing of commands to DMCU that sometimes lead What is “overdrive in queuing”? lead*s* to depending on load on DMCU fw: Leads to what? Words missing after *to*. "[drm:dc_dmub_srv_wait_idle] *ERROR* Error waiting for DMUB idle: status=3" You could also add the example from your discussion with Harry. [How] Store last successfully set backlight value and compare with it instead of pwm reads which is not what we should compare with. Maybe extend it with the information gotten from Harry, that this is expected, when ABM is enabled. Kind regards, Paul Signed-off-by: Shirish S --- drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 7 --- drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h | 6 ++ 2 files changed, 10 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c index df0980ff9a63..2b8337e47861 100644 --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c @@ -3972,7 +3972,7 @@ static u32 convert_brightness_to_user(const struct amdgpu_dm_backlight_caps *cap max - min); } -static int amdgpu_dm_backlight_set_level(struct amdgpu_display_manager *dm, +static void amdgpu_dm_backlight_set_level(struct amdgpu_display_manager *dm, int bl_idx, u32 user_brightness) { @@ -4003,7 +4003,8 @@ static int amdgpu_dm_backlight_set_level(struct amdgpu_display_manager *dm, DRM_DEBUG("DM: Failed to update backlight on eDP[%d]\n", bl_idx); } - return rc ? 0 : 1; + if (rc) + dm->actual_brightness[bl_idx] = user_brightness; } static int amdgpu_dm_backlight_update_status(struct backlight_device *bd) @@ -9944,7 +9945,7 @@ static void amdgpu_dm_atomic_commit_tail(struct drm_atomic_state *state) /* restore the backlight level */ for (i = 0; i < dm->num_of_edps; i++) { if (dm->backlight_dev[i] && - (amdgpu_dm_backlight_get_level(dm, i) != dm->brightness[i])) + (dm->actual_brightness[i] != dm->brightness[i])) amdgpu_dm_backlight_set_level(dm, i, dm->brightness[i]); } #endif diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h index 372f9adf091a..321279bc877b 100644 --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h @@ -540,6 +540,12 @@ struct amdgpu_display_manager { * cached backlight values. */ u32 brightness[AMDGPU_DM_MAX_NUM_EDP]; + /** +* @actual_brightness: +* +* last successfully applied backlight values. +*/ + u32 actual_brightness[AMDGPU_DM_MAX_NUM_EDP]; }; enum dsc_clock_force_state {
Re: BUG: Bad page state in process systemd-udevd (was: [PATCH v9 bpf-next 1/9] x86/Kconfig: select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP)
Dear Song, Am 28.03.22 um 21:24 schrieb Song Liu: On Mar 27, 2022, at 11:51 PM, Paul Menzel wrote: Am 28.03.22 um 08:37 schrieb Song Liu: […] On Mar 27, 2022, at 3:36 AM, Paul Menzel wrote: Am 26.03.22 um 19:46 schrieb Paul Menzel: #regzbot introduced: fac54e2bfb5be2b0bbf115fe80d45f59fd773048 #regzbot title: BUG: Bad page state in process systemd-udevd Am 04.02.22 um 19:57 schrieb Song Liu: From: Song Liu This enables module_alloc() to allocate huge page for 2MB+ requests. To check the difference of this change, we need enable config CONFIG_PTDUMP_DEBUGFS, and call module_alloc(2MB). Before the change, /sys/kernel/debug/page_tables/kernel shows pte for this map. With the change, /sys/kernel/debug/page_tables/ show pmd for thie map. Signed-off-by: Song Liu --- arch/x86/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 6fddb63271d9..e0e0d00cf103 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -159,6 +159,7 @@ config X86 select HAVE_ALIGNED_STRUCT_PAGEif SLUB select HAVE_ARCH_AUDITSYSCALL select HAVE_ARCH_HUGE_VMAPif X86_64 || X86_PAE +select HAVE_ARCH_HUGE_VMALLOCif HAVE_ARCH_HUGE_VMAP select HAVE_ARCH_JUMP_LABEL select HAVE_ARCH_JUMP_LABEL_RELATIVE select HAVE_ARCH_KASANif X86_64 Testing Linus’ current master branch, Linux logs critical messages like below: BUG: Bad page state in process systemd-udevd pfn:102e03 I bisected to your commit fac54e2bfb5 (x86/Kconfig: select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP). Sorry, I forget to mention, that this is a 32-bit (i686) userspace, but a 64-bit Linux kernel, so it might be the same issue as mentioned in commit eed1fcee556f (x86: Disable HAVE_ARCH_HUGE_VMALLOC on 32-bit x86), but didn’t fix the issue for 64-bit Linux kernel and 32-bit userspace. I will look more into this tomorrow. To clarify, what is the 32-bit user space that triggers this? Is it systemd-udevd? Is the systemd also i686? Yes, everything – also systemd – is i686. You can build a 32-bit VM image with grml-debootstrap [1]: sudo DEBOOTSTRAP=mmdebstrap ~/src/grml-debootstrap/grml-debootstrap --vm --vmfile --vmsize 3G --target /dev/shm/debian-32.img -r sid --arch i686 --filesystem ext4 Then run that with QEMU, but pass the 64-bit Linux kernel to QEMU directly with the switches `-kernel` and `-append`, or install the amd64 Linux kernel into the Debian VM image or the package created with `make bindeb-pkg` with `dpkg -i …`. Thanks for these information! I tried the following, but couldn't reproduce the issue. sudo ./grml-debootstrap --vm --vmfile --vmsize 3G --target ../debian-32.img -r sid --arch i386 --filesystem ext4 Note: s/i686/i386/. Also I run this on Fedora, so I didn't specify DEBOOTSTRAP. Then I run it with qemu-system-x86_64 \ -boot d ./debian-32.img -m 1024 -smp 4 \ -kernel ./bzImage \ -nographic -append 'root=/dev/sda1 ro console=ttyS0,115200' The VM boots fine. The config being used is x86_64_defconfig + CONFIG_DRM_FBDEV_EMULATION. I wonder whether this is caused by different config or different image. Could you please share your config? Sorry, for leading you on the wrong path. I actually just wanted to help getting a 32-bit userspace set up quickly. I haven’t tried reproducing the issue in a VM, and used only the ASUS F2A85-M PRO. Booting the system with `nomodeset`, I didn’t see the error. No idea if it’s related to framebuffer handling or specific to AMD graphics device. PS: I couldn't figure out the root password of the image, --password option of grml-debootstrap doesn't seem to work. Hmm, I thought it’s asking you during install, but I haven’t done it in a while. Kind regards, Paul
Re: [PATCH v4 5/6] drm/amdgpu/vcn: VCN ras error query support
Dear Mohammad, Am 28.03.22 um 11:49 schrieb Ziya, Mohammad zafar: -Original Message- From: Paul Menzel Sent: Monday, March 28, 2022 3:08 PM Am 28.03.22 um 10:47 schrieb Ziya, Mohammad zafar: […] -Original Message- From: Paul Menzel Sent: Monday, March 28, 2022 1:39 PM Am 28.03.22 um 10:00 schrieb Ziya, Mohammad zafar: […] From: Paul Menzel Sent: Monday, March 28, 2022 1:22 PM Am 28.03.22 um 09:43 schrieb Zhou1, Tao: -Original Message- From: Ziya, Mohammad zafar Sent: Monday, March 28, 2022 2:25 PM […] +static uint32_t vcn_v2_6_query_poison_by_instance(struct amdgpu_device *adev, + uint32_t instance, uint32_t sub_block) { + uint32_t poison_stat = 0, reg_value = 0; + + switch (sub_block) { + case AMDGPU_VCN_V2_6_VCPU_VCODEC: + reg_value = RREG32_SOC15(VCN, instance, mmUVD_RAS_VCPU_VCODEC_STATUS); + poison_stat = REG_GET_FIELD(reg_value, UVD_RAS_VCPU_VCODEC_STATUS, POISONED_PF); + break; + default: + break; + }; + + if (poison_stat) + dev_info(adev->dev, "Poison detected in VCN%d, sub_block%d\n", + instance, sub_block); What should a user do with that information? Faulty hardware, …? [Mohammad]: This message will help to identify the faulty hardware, the hardware ID will also log along with poison, help to identify among multiple hardware installed on the system. Thank you for clarifying. If it’s indeed faulty hardware, should the log level be increased to be an error? Keep in mind, that normal ignorant users (like me) are reading the message, and it’d be great to guide them a little. They do not know what “Poison“ means I guess. Maybe: A hardware corruption was found indicating the device might be faulty. (Poison detected in VCN%d, sub_block%d)\n (Keep in mind, I do not know anything about RAS.) [Mohammad]: It is an error condition, but this is just an information message which could have been ignored as well because VCN just consumed the poison, not created. Sorry, I have never seen these message in `dmesg`, so could you give an example log please, what the user would see? [Mohammad]: [ 231.181316] amdgpu :8a:00.0: amdgpu: Poison detected in VCN0, sub_block0 Sample message from amdgpu " [ 237.013029] amdgpu :8a:00.0: amdgpu: HDCP: optional hdcp ta ucode is not available " Hmm, that is six seconds later, so, if Linux logs other stuff in between, no idea if the connection will be made. Both messages read like debug message, with normal users not having a clue what to do. Can that be improved by rewording them? Kind regards, Paul
Re: [PATCH v4 5/6] drm/amdgpu/vcn: VCN ras error query support
Dear Mohammad, Am 28.03.22 um 10:47 schrieb Ziya, Mohammad zafar: […] -Original Message- From: Paul Menzel Sent: Monday, March 28, 2022 1:39 PM Am 28.03.22 um 10:00 schrieb Ziya, Mohammad zafar: […] From: Paul Menzel Sent: Monday, March 28, 2022 1:22 PM Am 28.03.22 um 09:43 schrieb Zhou1, Tao: -Original Message- From: Ziya, Mohammad zafar Sent: Monday, March 28, 2022 2:25 PM […] +static uint32_t vcn_v2_6_query_poison_by_instance(struct amdgpu_device *adev, + uint32_t instance, uint32_t sub_block) { + uint32_t poison_stat = 0, reg_value = 0; + + switch (sub_block) { + case AMDGPU_VCN_V2_6_VCPU_VCODEC: + reg_value = RREG32_SOC15(VCN, instance, mmUVD_RAS_VCPU_VCODEC_STATUS); + poison_stat = REG_GET_FIELD(reg_value, UVD_RAS_VCPU_VCODEC_STATUS, POISONED_PF); + break; + default: + break; + }; + + if (poison_stat) + dev_info(adev->dev, "Poison detected in VCN%d, sub_block%d\n", + instance, sub_block); What should a user do with that information? Faulty hardware, …? [Mohammad]: This message will help to identify the faulty hardware, the hardware ID will also log along with poison, help to identify among multiple hardware installed on the system. Thank you for clarifying. If it’s indeed faulty hardware, should the log level be increased to be an error? Keep in mind, that normal ignorant users (like me) are reading the message, and it’d be great to guide them a little. They do not know what “Poison“ means I guess. Maybe: A hardware corruption was found indicating the device might be faulty. (Poison detected in VCN%d, sub_block%d)\n (Keep in mind, I do not know anything about RAS.) [Mohammad]: It is an error condition, but this is just an information message which could have been ignored as well because VCN just consumed the poison, not created. Sorry, I have never seen these message in `dmesg`, so could you give an example log please, what the user would see? Kind regards, Paul
Re: [PATCH v4 5/6] drm/amdgpu/vcn: VCN ras error query support
Dear Mohammad, Am 28.03.22 um 10:00 schrieb Ziya, Mohammad zafar: […] From: Paul Menzel Sent: Monday, March 28, 2022 1:22 PM […] Tao, it’s hard to find your reply in the quote, as the message is not quoted correctly (> prepended). Is it possible to use a different email client? Am 28.03.22 um 09:43 schrieb Zhou1, Tao: -Original Message- From: Ziya, Mohammad zafar Sent: Monday, March 28, 2022 2:25 PM […] +static uint32_t vcn_v2_6_query_poison_by_instance(struct amdgpu_device *adev, + uint32_t instance, uint32_t sub_block) { + uint32_t poison_stat = 0, reg_value = 0; + + switch (sub_block) { + case AMDGPU_VCN_V2_6_VCPU_VCODEC: + reg_value = RREG32_SOC15(VCN, instance, mmUVD_RAS_VCPU_VCODEC_STATUS); + poison_stat = REG_GET_FIELD(reg_value, UVD_RAS_VCPU_VCODEC_STATUS, POISONED_PF); + break; + default: + break; + }; + + if (poison_stat) + dev_info(adev->dev, "Poison detected in VCN%d, sub_block%d\n", + instance, sub_block); What should a user do with that information? Faulty hardware, …? [Mohammad]: This message will help to identify the faulty hardware, the hardware ID will also log along with poison, help to identify among multiple hardware installed on the system. Thank you for clarifying. If it’s indeed faulty hardware, should the log level be increased to be an error? Keep in mind, that normal ignorant users (like me) are reading the message, and it’d be great to guide them a little. They do not know what “Poison“ means I guess. Maybe: A hardware corruption was found indicating the device might be faulty. (Poison detected in VCN%d, sub_block%d)\n (Keep in mind, I do not know anything about RAS.) + + return poison_stat; +} + +static bool vcn_v2_6_query_poison_status(struct amdgpu_device *adev) { + uint32_t inst, sub; + uint32_t poison_stat = 0; + + for (inst = 0; inst < adev->vcn.num_vcn_inst; inst++) + for (sub = 0; sub < AMDGPU_VCN_V2_6_MAX_SUB_BLOCK; sub++) + poison_stat += + vcn_v2_6_query_poison_by_instance(adev, inst, sub); + + return poison_stat ? true : false; [Tao] just want to confirm the logic, if the POISONED_PF of one instance is 1 and another is 0, the poison_stat is true, correct? Can we add a print message for this case? Same question to JPEG. [Mohammad]: Message will be printed on function block ahead of the function. Is the `dev_info` message in `vcn_v2_6_query_poison_by_instance()` doing what you want? Also, instead of `poison_stat ? true : false;` a common pattern is `!!poison_stat` I believe. […] [Mohammad]: Noted the change. Will make to return !!poison_stat ? true : false; No, just return !!poison_stat; […] Kind regards, Paul
Re: [PATCH v4 5/6] drm/amdgpu/vcn: VCN ras error query support
Dear Mohammad, dear Tao, Tao, it’s hard to find your reply in the quote, as the message is not quoted correctly (> prepended). Is it possible to use a different email client? Am 28.03.22 um 09:43 schrieb Zhou1, Tao: -Original Message- From: Ziya, Mohammad zafar Sent: Monday, March 28, 2022 2:25 PM To: amd-gfx@lists.freedesktop.org Cc: Lazar, Lijo ; Zhou1, Tao ; Zhang, Hawking ; Ziya, Mohammad zafar Subject: [PATCH v4 5/6] drm/amdgpu/vcn: VCN ras error query support RAS error query support addition for VCN 2.6 V2: removed unused option and corrected comment format Moved the register definition under header file Please wrap lines after 75 characters. (`scripts/checkpatch.pl` should have warned you about that). V3: poison query status check added. Removed error query interface V4: MMSCH poison check option removed, return true/false refactored. Signed-off-by: Mohammad Zafar Ziya Reviewed-by: Hawking Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 1 + drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c | 71 + drivers/gpu/drm/amd/amdgpu/vcn_v2_5.h | 6 +++ 3 files changed, 78 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h index 1e1a3b736859..606df8869b89 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h @@ -508,6 +508,7 @@ struct amdgpu_ras_block_hw_ops { void (*query_ras_error_address)(struct amdgpu_device *adev, void *ras_error_status); void (*reset_ras_error_count)(struct amdgpu_device *adev); void (*reset_ras_error_status)(struct amdgpu_device *adev); + bool (*query_poison_status)(struct amdgpu_device *adev); }; /* work flow diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c b/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c index 1869bae4104b..3988fc647741 100644 --- a/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c @@ -31,6 +31,7 @@ #include "soc15d.h" #include "vcn_v2_0.h" #include "mmsch_v1_0.h" +#include "vcn_v2_5.h" #include "vcn/vcn_2_5_offset.h" #include "vcn/vcn_2_5_sh_mask.h" @@ -59,6 +60,7 @@ static int vcn_v2_5_set_powergating_state(void *handle, static int vcn_v2_5_pause_dpg_mode(struct amdgpu_device *adev, int inst_idx, struct dpg_pause_state *new_state); static int vcn_v2_5_sriov_start(struct amdgpu_device *adev); +static void vcn_v2_5_set_ras_funcs(struct amdgpu_device *adev); static int amdgpu_ih_clientid_vcns[] = { SOC15_IH_CLIENTID_VCN, @@ -100,6 +102,7 @@ static int vcn_v2_5_early_init(void *handle) vcn_v2_5_set_dec_ring_funcs(adev); vcn_v2_5_set_enc_ring_funcs(adev); vcn_v2_5_set_irq_funcs(adev); + vcn_v2_5_set_ras_funcs(adev); return 0; } @@ -1930,3 +1933,71 @@ const struct amdgpu_ip_block_version vcn_v2_6_ip_block = .rev = 0, .funcs = _v2_6_ip_funcs, }; + +static uint32_t vcn_v2_6_query_poison_by_instance(struct amdgpu_device *adev, + uint32_t instance, uint32_t sub_block) { + uint32_t poison_stat = 0, reg_value = 0; + + switch (sub_block) { + case AMDGPU_VCN_V2_6_VCPU_VCODEC: + reg_value = RREG32_SOC15(VCN, instance, mmUVD_RAS_VCPU_VCODEC_STATUS); + poison_stat = REG_GET_FIELD(reg_value, UVD_RAS_VCPU_VCODEC_STATUS, POISONED_PF); + break; + default: + break; + }; + + if (poison_stat) + dev_info(adev->dev, "Poison detected in VCN%d, sub_block%d\n", + instance, sub_block); What should a user do with that information? Faulty hardware, …? + + return poison_stat; +} + +static bool vcn_v2_6_query_poison_status(struct amdgpu_device *adev) { + uint32_t inst, sub; + uint32_t poison_stat = 0; + + for (inst = 0; inst < adev->vcn.num_vcn_inst; inst++) + for (sub = 0; sub < AMDGPU_VCN_V2_6_MAX_SUB_BLOCK; sub++) + poison_stat += + vcn_v2_6_query_poison_by_instance(adev, inst, sub); + + return poison_stat ? true : false; [Tao] just want to confirm the logic, if the POISONED_PF of one instance is 1 and another is 0, the poison_stat is true, correct? Can we add a print message for this case? Same question to JPEG. Is the `dev_info` message in `vcn_v2_6_query_poison_by_instance()` doing what you want? Also, instead of `poison_stat ? true : false;` a common pattern is `!!poison_stat` I believe. Kind regards, Paul +} + +const struct amdgpu_ras_block_hw_ops vcn_v2_6_ras_hw_ops = { + .query_poison_status = vcn_v2_6_query_poison_status, }; + +static struct amdgpu_vcn_ras vcn_v2_6_ras = { + .ras_block = { + .hw_ops = _v2_6_ras_hw_ops, + }, +}; + +static void vcn_v2_5_set_ras_funcs(struct amdgpu_device *adev) { + switch
Re: [PATCH] drm/amdgpu: resolve s3 hang for r7340
[Cc: -Jack Zhang (invalid address) Am 28.03.22 um 09:36 schrieb Paul Menzel: Dear Zhenneng, Thank you for your patch. Am 28.03.22 um 06:05 schrieb Zhenneng Li: This is a workaround for s3 hang for r7340(amdgpu). Is it hanging when resuming from S3? Maybe also use the line below for the commit message summary: drm/amdgpu: Add 1 ms delay to init handler to fix s3 resume hang Also, please add a space before the ( in “r7340(amdgpu)”. When we test s3 with r7340 on arm64 platform, graphics card will hang up, the error message are as follows: Mar 4 01:14:11 greatwall-GW-XX-XXX kernel: [ 1.599374][ 7] [ T291] amdgpu :02:00.0: fb0: amdgpudrmfb frame buffer device Mar 4 01:14:11 greatwall-GW-XX-XXX kernel: [ 1.612869][ 7] [ T291] [drm:amdgpu_device_ip_late_init [amdgpu]] *ERROR* late_init of IP block failed -22 Mar 4 01:14:11 greatwall-GW-XX-XXX kernel: [ 1.623392][ 7] [ T291] amdgpu :02:00.0: amdgpu_device_ip_late_init failed Mar 4 01:14:11 greatwall-GW-XX-XXX kernel: [ 1.630696][ 7] [ T291] amdgpu :02:00.0: Fatal error during GPU init Mar 4 01:14:11 greatwall-GW-XX-XXX kernel: [ 1.637477][ 7] [ T291] [drm] amdgpu: finishing device. The prefix in the beginning is not really needed. Only the stuff after `kernel: `. Maybe also add the output of `lspci -nn -s …` for that r7340 device. Change-Id: I5048b3894c0ca9faf2f4847ddab61f9eb17b4823 Without the Gerrit instance this belongs to, the Change-Id is of no use in the public. Signed-off-by: Zhenneng Li --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 3987ecb24ef4..1eced991b5b2 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -2903,6 +2903,8 @@ static void amdgpu_device_delayed_init_work_handler(struct work_struct *work) container_of(work, struct amdgpu_device, delayed_init_work.work); int r; + mdelay(1); + Wow, I wonder how long it took you to find that workaround. r = amdgpu_ib_ring_tests(adev); if (r) DRM_ERROR("ib ring test failed (%d).\n", r); Kind regards, Paul
Re: [PATCH] drm/amdgpu: resolve s3 hang for r7340
Dear Zhenneng, Thank you for your patch. Am 28.03.22 um 06:05 schrieb Zhenneng Li: This is a workaround for s3 hang for r7340(amdgpu). Is it hanging when resuming from S3? Maybe also use the line below for the commit message summary: drm/amdgpu: Add 1 ms delay to init handler to fix s3 resume hang Also, please add a space before the ( in “r7340(amdgpu)”. When we test s3 with r7340 on arm64 platform, graphics card will hang up, the error message are as follows: Mar 4 01:14:11 greatwall-GW-XX-XXX kernel: [1.599374][ 7] [ T291] amdgpu :02:00.0: fb0: amdgpudrmfb frame buffer device Mar 4 01:14:11 greatwall-GW-XX-XXX kernel: [1.612869][ 7] [ T291] [drm:amdgpu_device_ip_late_init [amdgpu]] *ERROR* late_init of IP block failed -22 Mar 4 01:14:11 greatwall-GW-XX-XXX kernel: [1.623392][ 7] [ T291] amdgpu :02:00.0: amdgpu_device_ip_late_init failed Mar 4 01:14:11 greatwall-GW-XX-XXX kernel: [1.630696][ 7] [ T291] amdgpu :02:00.0: Fatal error during GPU init Mar 4 01:14:11 greatwall-GW-XX-XXX kernel: [1.637477][ 7] [ T291] [drm] amdgpu: finishing device. The prefix in the beginning is not really needed. Only the stuff after `kernel: `. Maybe also add the output of `lspci -nn -s …` for that r7340 device. Change-Id: I5048b3894c0ca9faf2f4847ddab61f9eb17b4823 Without the Gerrit instance this belongs to, the Change-Id is of no use in the public. Signed-off-by: Zhenneng Li --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 3987ecb24ef4..1eced991b5b2 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -2903,6 +2903,8 @@ static void amdgpu_device_delayed_init_work_handler(struct work_struct *work) container_of(work, struct amdgpu_device, delayed_init_work.work); int r; + mdelay(1); + Wow, I wonder how long it took you to find that workaround. r = amdgpu_ib_ring_tests(adev); if (r) DRM_ERROR("ib ring test failed (%d).\n", r); Kind regards, Paul
Re: [PATCH v4 2/6] drm/amdgpu/vcn: Add vcn ras support
Dear Mohammad, Thank you for your patch. Am 28.03.22 um 08:24 schrieb Mohammad Zafar Ziya: VCN block ras feature support addition V2: default ras callback removed Signed-off-by: Mohammad Zafar Ziya Reviewed-by: Hawking Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h | 10 ++ 1 file changed, 10 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h index e2fde88aaf5e..ea07974ef6f4 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h @@ -24,6 +24,8 @@ #ifndef __AMDGPU_VCN_H__ #define __AMDGPU_VCN_H__ +#include "amdgpu_ras.h" + #define AMDGPU_VCN_STACK_SIZE (128*1024) #define AMDGPU_VCN_CONTEXT_SIZE (512*1024) @@ -232,6 +234,10 @@ struct amdgpu_vcn_inst { struct amdgpu_vcn_fw_shared fw_shared; }; +struct amdgpu_vcn_ras { + struct amdgpu_ras_block_object ras_block; +}; + struct amdgpu_vcn { unsignedfw_version; struct delayed_work idle_work; @@ -251,6 +257,9 @@ struct amdgpu_vcn { unsignedharvest_config; int (*pause_dpg_mode)(struct amdgpu_device *adev, int inst_idx, struct dpg_pause_state *new_state); + + struct ras_common_if*ras_if; + struct amdgpu_vcn_ras *ras; }; struct amdgpu_fw_shared_rb_ptrs_struct { @@ -339,4 +348,5 @@ void amdgpu_vcn_setup_ucode(struct amdgpu_device *adev); void amdgpu_vcn_fwlog_init(struct amdgpu_vcn_inst *vcn); void amdgpu_debugfs_vcn_fwlog_init(struct amdgpu_device *adev, uint8_t i, struct amdgpu_vcn_inst *vcn); + #endif This hunk looks unrelated. Maybe remove it? Kind regards, Paul
Re: [PATCH v4 0/6] VCN and JPEG RAS poison detection
Dear Mahommad, Am 28.03.22 um 08:24 schrieb Mohammad Zafar Ziya: VCN and JPEG RAS poison detection It’d be great if you extended this a little bit. Especially, how it can be tested. Mohammad Zafar Ziya (6): drm/amdgpu: Add vcn and jpeg ras support flag drm/amdgpu/vcn: Add vcn ras support drm/amdgpu/jpeg: Add jpeg block ras support drm/amdgpu/vcn: vcn and jpeg ver 2.6 ras register definition drm/amdgpu/vcn: VCN ras error query support drm/amdgpu/jpeg: jpeg ras error query support It’d be great if you made the last three commit message summaries also statements (by adding a verb in imperative mood). Kind regards, Paul
Re: [PATCH 01/13] drm/amd/display: HDCP SEND AKI INIT error
Dear Alex, Am 25.03.22 um 21:43 schrieb Hung, Alex: Thanks for your feedback. Thank you for your reply. For the future, it’d be great if you could use interleaved style [1], when replying, or when it’s a new topic, do not cite the previous message. I fixed many errors and typos you highlighted in this series. In cases where modification requires re-testing we or anyone can have follow-up patches in the future. Thank you. Did you resubmit them in *[PATCH 00/16] DC Patches March 25, 2022*? Unfortunately, they are not tagged as v2 and the emails do not contain an information what was changed between v1 and v2. Can you please send them again with the correct tags, and information included? Kind regards, Paul [1]: https://en.wikipedia.org/wiki/Posting_style
Re: Commit messages
Dear Christian, dear Daniel, dear Alex, Am 23.03.22 um 16:32 schrieb Christian König: Am 23.03.22 um 16:24 schrieb Daniel Stone: On Wed, 23 Mar 2022 at 15:14, Alex Deucher wrote: On Wed, Mar 23, 2022 at 11:04 AM Daniel Stone wrote: That's not what anyone's saying here ... No-one's demanding AMD publish RTL, or internal design docs, or hardware specs, or URLs to JIRA tickets no-one can access. This is a large and invasive commit with pretty big ramifications; containing exactly two lines of commit message, one of which just duplicates the subject. It cannot be the case that it's completely impossible to provide any justification, background, or details, about this commit being made. Unless, of course, it's to fix a non-public security issue, that is reasonable justification for eliding some of the details. But then again, 'huge change which is very deliberately opaque' is a really good way to draw a lot of attention to the commit, and it would be better to provide more detail about the change to help it slip under the radar. If dri-devel@ isn't allowed to inquire about patches which are posted, then CCing the list is just a façade; might as well just do it all internally and periodically dump out pull requests. I think we are in agreement. I think the withheld information Christian was referring to was on another thread with Christian and Paul discussing a workaround for a hardware bug: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.spinics.net%2Flists%2Famd-gfx%2Fmsg75908.htmldata=04%7C01%7Cchristian.koenig%40amd.com%7C6a3f2815d83b4872577008da0ce1347a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637836458652370599%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=QtNB0XHMhTgH%2FNHMwF23Qn%2BgSdYyHJSenbpP%2FHG%2BkxE%3Dreserved=0 (Thank you Microsoft for keeping us safe.) I guess it proves, how assuming what other people should know/have read, especially when crossing message threads, is causing confusion and misunderstandings. Right, that definitely seems like some crossed wires. I don't see anything wrong with that commit at all: the commit message and a comment notes that there is a hardware issue preventing Raven from being able to do TMZ+GTT, and the code does the very straightforward and obvious thing to ensure that on VCN 1.0, any TMZ buffer must be VRAM-placed. My questions were: Where is that documented, and how can this be reproduced? Shouldn’t these be answered by the commit message? In five(?) years, somebody, maybe even with access to the currently non-public documents, finds a fault in the commit, and would be helped by having an document/errata number where to look at. To verify the fix, the developer would need a method to produce the error, so why not just share it? Also, I assume that workarounds often come with downsides, as otherwise it would have been programmed like this from the beginning, or instead of “workaround” it would be called “improvement”. Shouldn’t that also be answered? So totally made-up example: Currently, there is a graphics corruption running X on system Y. This is caused by a hardware bug in Raven ASIC (details internal document #/AMD-Jira #N), and can be worked around by [what is in the commit message]. The workaround does not affect the performance, and testing X shows the error is fixed. This one, on the other hand, is much less clear ... Yes, completely agree. I mean a good bunch of comments on commit messages are certainly valid and we could improve them. That’d be great. But this patch here was worked on by both AMD and Intel developers. Where both sides and I think even people from other companies perfectly understands why, what, how etc... When now somebody comes along and asks for a whole explanation of the context why we do it then that sounds really strange to me. The motivation should be part of the commit message. I didn’t mean anyone to rewrite buddy memory allocator Wikipedia article [1]. But the commit message at hand for switching the allocator is definitely too terse. Kind regards, Paul [1]: https://en.wikipedia.org/wiki/Buddy_memory_allocation
Re: [PATCH V2] drm/amdgpu/vcn3: send smu interface type
Dear Boyuan, Am 25.03.22 um 06:53 schrieb Paul Menzel: Dear Boyuan, Thank for the improved patch. Am 24.03.22 um 18:25 schrieb Zhang, Boyuan: [AMD Official Use Only] No idea if this would confuse `git am`. From: Boyuan Zhang mailto:boyuan.zh...@amd.com>> Your mailer(?) mangled the patch. Did you edit it in your MUA’s compose window? For VCN FW to detect ASIC type, in order to use different mailbox registers. V2: simplify codes and fix format issue. Signed-off-by: Boyuan Zhang mailto:boyuan.zh...@amd.com>> Acked-by Huang Rui --- drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h | 7 +++ drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c | 5 + 2 files changed, 12 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h index e2fde88aaf5e..f06fb7f882e2 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h @@ -159,6 +159,7 @@ #define AMDGPU_VCN_MULTI_QUEUE_FLAG (1 << 8) #define AMDGPU_VCN_SW_RING_FLAG (1 << 9) #define AMDGPU_VCN_FW_LOGGING_FLAG (1 << 10) +#define AMDGPU_VCN_SMU_VERSION_INFO_FLAG (1 << 11) #define AMDGPU_VCN_IB_FLAG_DECODE_BUFFER 0x0001 #define AMDGPU_VCN_CMD_FLAG_MSG_BUFFER 0x0001 @@ -279,6 +280,11 @@ struct amdgpu_fw_shared_fw_logging { uint32_t size; }; +struct amdgpu_fw_shared_smu_interface_info { + uint8_t smu_interface_type; + uint8_t padding[3]; +}; + struct amdgpu_fw_shared { uint32_t present_flag_0; uint8_t pad[44]; @@ -287,6 +293,7 @@ struct amdgpu_fw_shared { struct amdgpu_fw_shared_multi_queue multi_queue; struct amdgpu_fw_shared_sw_ring sw_ring; struct amdgpu_fw_shared_fw_logging fw_log; + struct amdgpu_fw_shared_smu_interface_info smu_interface_info; }; struct amdgpu_vcn_fwlog { diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c index b16c56aa2d22..9925b2bc63b9 100644 --- a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c @@ -219,6 +219,11 @@ static int vcn_v3_0_sw_init(void *handle) cpu_to_le32(AMDGPU_VCN_MULTI_QUEUE_FLAG) | cpu_to_le32(AMDGPU_VCN_FW_SHARED_FLAG_0_RB); fw_shared->sw_ring.is_enabled = cpu_to_le32(DEC_SW_RING_ENABLED); + fw_shared->present_flag_0 |= AMDGPU_VCN_SMU_VERSION_INFO_FLAG; + if (adev->ip_versions[UVD_HWIP][0] == IP_VERSION(3, 1, 2)) + fw_shared->smu_interface_info.smu_interface_type = 2; + else if (adev->ip_versions[UVD_HWIP][0] == IP_VERSION(3, 1, 1)) As commented on patch v1, please also put (3, 1, 1) first. + fw_shared->smu_interface_info.smu_interface_type = 1; if (amdgpu_vcnfw_log) amdgpu_vcn_fwlog_init(>vcn.inst[i]); -- 2.25.1 The whole patch cannot be applied. Did `scripts/checkpatch.pl` not detect this? Please fix and resend. Kind regards, Paul
Re: [PATCH V2] drm/amdgpu/vcn3: send smu interface type
Dear Boyuan, Thank for the improved patch. Am 24.03.22 um 18:25 schrieb Zhang, Boyuan: [AMD Official Use Only] No idea if this would confuse `git am`. From: Boyuan Zhang mailto:boyuan.zh...@amd.com>> Your mailer(?) mangled the patch. Did you edit it in your MUA’s compose window? For VCN FW to detect ASIC type, in order to use different mailbox registers. V2: simplify codes and fix format issue. Signed-off-by: Boyuan Zhang mailto:boyuan.zh...@amd.com>> Acked-by Huang Rui --- drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h | 7 +++ drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c | 5 + 2 files changed, 12 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h index e2fde88aaf5e..f06fb7f882e2 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h @@ -159,6 +159,7 @@ #define AMDGPU_VCN_MULTI_QUEUE_FLAG (1 << 8) #define AMDGPU_VCN_SW_RING_FLAG (1 << 9) #define AMDGPU_VCN_FW_LOGGING_FLAG (1 << 10) +#define AMDGPU_VCN_SMU_VERSION_INFO_FLAG (1 << 11) #define AMDGPU_VCN_IB_FLAG_DECODE_BUFFER0x0001 #define AMDGPU_VCN_CMD_FLAG_MSG_BUFFER 0x0001 @@ -279,6 +280,11 @@ struct amdgpu_fw_shared_fw_logging { uint32_t size; }; +struct amdgpu_fw_shared_smu_interface_info { + uint8_t smu_interface_type; + uint8_t padding[3]; +}; + struct amdgpu_fw_shared { uint32_t present_flag_0; uint8_t pad[44]; @@ -287,6 +293,7 @@ struct amdgpu_fw_shared { struct amdgpu_fw_shared_multi_queue multi_queue; struct amdgpu_fw_shared_sw_ring sw_ring; struct amdgpu_fw_shared_fw_logging fw_log; + struct amdgpu_fw_shared_smu_interface_info smu_interface_info; }; struct amdgpu_vcn_fwlog { diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c index b16c56aa2d22..9925b2bc63b9 100644 --- a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c @@ -219,6 +219,11 @@ static int vcn_v3_0_sw_init(void *handle) cpu_to_le32(AMDGPU_VCN_MULTI_QUEUE_FLAG) | cpu_to_le32(AMDGPU_VCN_FW_SHARED_FLAG_0_RB); fw_shared->sw_ring.is_enabled = cpu_to_le32(DEC_SW_RING_ENABLED); + fw_shared->present_flag_0 |= AMDGPU_VCN_SMU_VERSION_INFO_FLAG; + if (adev->ip_versions[UVD_HWIP][0] == IP_VERSION(3, 1, 2)) + fw_shared->smu_interface_info.smu_interface_type = 2; + else if (adev->ip_versions[UVD_HWIP][0] == IP_VERSION(3, 1, 1)) + fw_shared->smu_interface_info.smu_interface_type = 1; if (amdgpu_vcnfw_log) amdgpu_vcn_fwlog_init(>vcn.inst[i]); -- 2.25.1 The whole patch cannot be applied. Did `scripts/checkpatch.pl` not detect this? Please fix and resend. Kind regards, Paul
Re: [PATCH v2] drm/amdkfd: Check use_xgmi_p2p before reporting hive_id
Dear Divya, Am 23.03.22 um 15:20 schrieb Divya Shikre: Recently introduced commit "845ebd6b7c32 drm/amdgpu: Add use_xgmi_p2p module parameter" did not update XGMI iolinks when use_xgmi_p2p is disabled. Add fix to not create XGMI iolinks in KFD topology when this parameter is disabled. Thank you for rerolling the patch. Please also add: Fixes: 845ebd6b7c32 ("drm/amdgpu: Add use_xgmi_p2p module parameter") Signed-off-by: Divya Shikre --- drivers/gpu/drm/amd/amdkfd/kfd_device.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c index 339e12c94cff..d5536e33b3d8 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c @@ -536,7 +536,8 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd, goto kfd_doorbell_error; } - kfd->hive_id = kfd->adev->gmc.xgmi.hive_id; + if (amdgpu_use_xgmi_p2p) + kfd->hive_id = kfd->adev->gmc.xgmi.hive_id; kfd->noretry = kfd->adev->gmc.noretry; The rest looks good. Kind regards, Paul
Re: ATI Radeon Mobility 3470 fails with Kernel 5.0
Dear Craig, Am 24.03.22 um 02:09 schrieb Craig M: I should add some further information: I have tried using Manjaro KDE, Kubuntu 20 and Kubuntu 18.03 live disks and all exhibit this 'tearing' issue. The problem begins rather early on in the boot sequence while the TUI is logging the startup information and continues to be a problem as the screen becomes graphic. I feel that I should be able to rectify this with a setting change. Again, Thanks for any help. Craig. In the future, it’d be great if instead of top-posting you used interleaved style. On 3/24/22, Craig M wrote: Hi, I'm trying to track down a problem with an ancient graphics device ATI Radeon Mobility 3470. Ubuntu/Kubuntu 18 is shipped kernel 4.18 and the radeon drivers work well. With 18.02 and later it ships with kernel 5.0 and the radeon drivers don't. What I'm seeing is hard to describe. It looks a lot like the scan lines are interleaved and there's a lot of noise. It's not the typical screen tearing, nor is it just snow (black and white noise). I can *just* make out what it being displayed onscreen, but hurts the eyes a lot to try and do so! If I hold the kernel back at 4.18 things are just fine. Some detailed info (from a working 4.18 kernel). Note that the returned information from a 5.x kernel isn't all that much different: $ lshw -c video *-display description: VGA compatible controller product: RV620/M82 [Mobility Radeon HD 3450/3470] vendor: Advanced Micro Devices, Inc. [AMD/ATI] physical id: 0 bus info: pci@:01:00.0 version: 00 width: 32 bits clock: 33MHz capabilities: pm pciexpress msi vga_controller bus_master cap_list rom configuration: driver=radeon latency=0 resources: irq:26 memory:c000-c7ff ioport:9000(size=256) memory:c802-c802 memory:c-d $ modinfo radeon filename: /lib/modules/4.18.0-17-generic/kernel/drivers/gpu/drm/radeon/radeon.ko license: GPL and additional rights description: ATI Radeon author: Gareth Hughes, Keith Whitwell, others. ... name: radeon vermagic: 4.18.0-17-generic SMP mod_unload signat: PKCS#7 signer: sig_key: sig_hashalgo: md4 Looking at the output of `dmesg` might also help. I'm just trying to track down the driver changes between 4.18 and 5.0 to see what I can do to change settings. Any help would be greatly appreciated. I've had a quick look through https://gitlab.freedesktop.org/xorg/driver/xf86-video-ati but I don't know where to start (as there's no correlation between releases and kernels, from what I can see). As switching the Linux kernel version works, you need to look at the driver in the Linux kernel. As you can easily reproduce your problem, and you are most likely alone with this old hardware, bisecting will probably the way to go. First you can try the packages from Ubuntu’s Kernel PPA [1]. Figuring out, in what version the regression occurs, you then build the Linux kernel yourself, for example with `make bindeb-pkg`, and try to find out the exact commit. There should be some guides on the WWW explaining all this in detail. Kind regards and good luck, Paul PS: It would also be nice to know, if the problem is still present in Linux 5.17. [1]: https://kernel.ubuntu.com/~kernel-ppa/mainline/
Re: [PATCH] drm/amdgpu/vcn3: send smu interface type
Dear Yifan, dear Boyuan, Am 24.03.22 um 03:59 schrieb Yifan Zhang: From: Boyuan Zhang For VCN FW to detect ASIC type What affect does this have? How does VCN FW behave different now? Signed-off-by: Boyuan Zhang Signed-off-by: Yifan Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h | 7 +++ drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c | 7 +++ 2 files changed, 14 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h index e2fde88aaf5e..f06fb7f882e2 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h @@ -159,6 +159,7 @@ #define AMDGPU_VCN_MULTI_QUEUE_FLAG (1 << 8) #define AMDGPU_VCN_SW_RING_FLAG (1 << 9) #define AMDGPU_VCN_FW_LOGGING_FLAG(1 << 10) +#define AMDGPU_VCN_SMU_VERSION_INFO_FLAG (1 << 11) #define AMDGPU_VCN_IB_FLAG_DECODE_BUFFER 0x0001 #define AMDGPU_VCN_CMD_FLAG_MSG_BUFFER0x0001 @@ -279,6 +280,11 @@ struct amdgpu_fw_shared_fw_logging { uint32_t size; }; +struct amdgpu_fw_shared_smu_interface_info { + uint8_t smu_interface_type; + uint8_t padding[3]; +}; + struct amdgpu_fw_shared { uint32_t present_flag_0; uint8_t pad[44]; @@ -287,6 +293,7 @@ struct amdgpu_fw_shared { struct amdgpu_fw_shared_multi_queue multi_queue; struct amdgpu_fw_shared_sw_ring sw_ring; struct amdgpu_fw_shared_fw_logging fw_log; + struct amdgpu_fw_shared_smu_interface_info smu_interface_info; }; struct amdgpu_vcn_fwlog { diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c index b16c56aa2d22..c5bf7cbfa82c 100644 --- a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c @@ -219,6 +219,13 @@ static int vcn_v3_0_sw_init(void *handle) cpu_to_le32(AMDGPU_VCN_MULTI_QUEUE_FLAG) | cpu_to_le32(AMDGPU_VCN_FW_SHARED_FLAG_0_RB); fw_shared->sw_ring.is_enabled = cpu_to_le32(DEC_SW_RING_ENABLED); + if (adev->ip_versions[UVD_HWIP][0] == IP_VERSION(3, 1, 2)) { + fw_shared->present_flag_0 |= AMDGPU_VCN_SMU_VERSION_INFO_FLAG; + fw_shared->smu_interface_info.smu_interface_type = 2; + } else if(adev->ip_versions[UVD_HWIP][0] == IP_VERSION(3, 1, 1)) { Please add a space before the (, which `checkpatch.pl` also would have found: $ scripts/checkpatch.pl /dev/shm/0001-drm-amdgpu-vcn3-send-smu-interface-type.patch ERROR: space required before the open parenthesis '(' #58: FILE: drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c:225: + } else if(adev->ip_versions[UVD_HWIP][0] == IP_VERSION(3, 1, 1)) { Also why not order it smallest version first? Will there ever be other IP versions for VCN 3.0? + fw_shared->present_flag_0 |= AMDGPU_VCN_SMU_VERSION_INFO_FLAG; + fw_shared->smu_interface_info.smu_interface_type = 1; + } if (amdgpu_vcnfw_log) amdgpu_vcn_fwlog_init(>vcn.inst[i]); Kind regards, Paul
Commit messages (was: [PATCH v11] drm/amdgpu: add drm buddy support to amdgpu)
Dear Christian, Am 23.03.22 um 08:42 schrieb Christian König: Am 23.03.22 um 07:42 schrieb Paul Menzel: Am 23.03.22 um 07:25 schrieb Arunpravin Paneer Selvam: - Remove drm_mm references and replace with drm buddy functionalities The commit message summary to me suggested, you can somehow use both allocators now. Two suggestions below: 1. Switch to drm buddy allocator 2. Use drm buddy alllocator - Add res cursor support for drm buddy As an allocator switch sounds invasive, could you please extend the commit message, briefly describing the current situation, saying what the downsides are, and why the buddy allocator is “better”. Well, Paul please stop bothering developers with those requests. It's my job as maintainer to supervise the commit messages and it is certainly NOT require to explain all the details of the current situation in a commit message. That is just overkill. I did not request all the details, and I think my requests are totally reasonable. But let’s change the perspective. If there were not any AMD graphics drivers bug, I would have never needed to look at the code and deal with it. Unfortunately the AMD graphics driver situation – which improved a lot in recent years – with no public documentation, proprietary firmware and complex devices is still not optimal, and a lot of bugs get reported, and I am also hit by bugs, taking time to deal with them, and maybe reporting and helping to analyze them. So to keep your wording, if you would stop bothering users with bugs and requesting their help in fixing them – asking the user to bisect the issue is often the first thing. Actually it should not be unreasonable for customers buying an AMD device to expect get bug free drivers. It’s strange and a sad fact, that the software industry succeeded to sway that valid expectation and customers now except they need to regularly install software updates, and do not get, for example, a price reduction when there are bugs. Also, as stated everywhere, reviewer time is scarce, so commit authors should make it easy to attract new folks. A simple note that we are switching from the drm_mm backend to the buddy backend is sufficient, and that is exactly what the commit message is saying here. Sorry, I disagree. The motivation needs to be part of the commit message. For example see recent discussion on the LWN article *Donenfeld: Random number generator enhancements for Linux 5.17 and 5.18* [1]. How much the commit message should be extended, I do not know, but the current state is insufficient (too terse). Kind regards, Paul [1]: https://lwn.net/Articles/888413/ "Donenfeld: Random number generator enhancements for Linux 5.17 and 5.18"
Re: [PATCH v11] drm/amdgpu: add drm buddy support to amdgpu
Dear Arunprivin, Thank you for your patch. Am 23.03.22 um 07:25 schrieb Arunpravin Paneer Selvam: - Remove drm_mm references and replace with drm buddy functionalities The commit message summary to me suggested, you can somehow use both allocators now. Two suggestions below: 1. Switch to drm buddy allocator 2. Use drm buddy alllocator - Add res cursor support for drm buddy As an allocator switch sounds invasive, could you please extend the commit message, briefly describing the current situation, saying what the downsides are, and why the buddy allocator is “better”. How did you test it? How can it be tested, that there are no regressions? v2(Matthew Auld): Nit: I’d add a space before (. Kind regards, Paul - replace spinlock with mutex as we call kmem_cache_zalloc (..., GFP_KERNEL) in drm_buddy_alloc() function - lock drm_buddy_block_trim() function as it calls mark_free/mark_split are all globally visible v3(Matthew Auld): - remove trim method error handling as we address the failure case at drm_buddy_block_trim() function v4: - fix warnings reported by kernel test robot v5: - fix merge conflict issue v6: - fix warnings reported by kernel test robot v7: - remove DRM_BUDDY_RANGE_ALLOCATION flag usage v8: - keep DRM_BUDDY_RANGE_ALLOCATION flag usage - resolve conflicts created by drm/amdgpu: remove VRAM accounting v2 v9(Christian): - merged the below patch - drm/amdgpu: move vram inline functions into a header - rename label name as fallback - move struct amdgpu_vram_mgr to amdgpu_vram_mgr.h - remove unnecessary flags from struct amdgpu_vram_reservation - rewrite block NULL check condition - change else style as per coding standard - rewrite the node max size - add a helper function to fetch the first entry from the list v10(Christian): - rename amdgpu_get_node() function name as amdgpu_vram_mgr_first_block v11: - if size is not aligned with min_page_size, enable is_contiguous flag, therefore, the size round up to the power of two and trimmed to the original size. Signed-off-by: Arunpravin Paneer Selvam --- drivers/gpu/drm/Kconfig | 1 + .../gpu/drm/amd/amdgpu/amdgpu_res_cursor.h| 97 +-- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h | 10 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 263 ++ 4 files changed, 234 insertions(+), 137 deletions(-) diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig index f1422bee3dcc..5133c3f028ab 100644 --- a/drivers/gpu/drm/Kconfig +++ b/drivers/gpu/drm/Kconfig @@ -280,6 +280,7 @@ config DRM_AMDGPU select HWMON select BACKLIGHT_CLASS_DEVICE select INTERVAL_TREE + select DRM_BUDDY help Choose this option if you have a recent AMD Radeon graphics card. diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h index acfa207cf970..864c609ba00b 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h @@ -30,12 +30,15 @@ #include #include +#include "amdgpu_vram_mgr.h" + /* state back for walking over vram_mgr and gtt_mgr allocations */ struct amdgpu_res_cursor { uint64_tstart; uint64_tsize; uint64_tremaining; - struct drm_mm_node *node; + void*node; + uint32_tmem_type; }; /** @@ -52,27 +55,63 @@ static inline void amdgpu_res_first(struct ttm_resource *res, uint64_t start, uint64_t size, struct amdgpu_res_cursor *cur) { + struct drm_buddy_block *block; + struct list_head *head, *next; struct drm_mm_node *node; - if (!res || res->mem_type == TTM_PL_SYSTEM) { - cur->start = start; - cur->size = size; - cur->remaining = size; - cur->node = NULL; - WARN_ON(res && start + size > res->num_pages << PAGE_SHIFT); - return; - } + if (!res) + goto fallback; BUG_ON(start + size > res->num_pages << PAGE_SHIFT); - node = to_ttm_range_mgr_node(res)->mm_nodes; - while (start >= node->size << PAGE_SHIFT) - start -= node++->size << PAGE_SHIFT; + cur->mem_type = res->mem_type; + + switch (cur->mem_type) { + case TTM_PL_VRAM: + head = _amdgpu_vram_mgr_node(res)->blocks; + + block = list_first_entry_or_null(head, +struct drm_buddy_block, +link); + if (!block) + goto fallback; + + while (start >= amdgpu_node_size(block)) { + start -=
Re: [PATCH] drm/amdkfd:Check use_xgmi_p2p before reporting hive_id
Dear Divya, Thank you for your patch. Am 22.03.22 um 20:23 schrieb Divya Shikre: Please add a space after the colon in the commit message summary. XGMI iolinks will not be created in KFD topology when use_xgmi_p2p module parameter is disabled in a dGPU. 1. Is “will not be” the current problem, or the result of your patch? Please start by describing the current problem, how to reproduce it. 2. Please wrap lines after 75 characters, and add a blank line between paragraphs. The script `scripts/checkpatch.pl`, you should run on all your patches before submitting them, also warns about the first issue: WARNING: Possible unwrapped commit description (prefer a maximum 75 chars per line) In this case PCIe p2p links will be used instead. Is this a regression, and fixes another commit? If so, please add a Fixes line. Signed-off-by: Divya Shikre --- drivers/gpu/drm/amd/amdkfd/kfd_device.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c index 339e12c94cff..d5536e33b3d8 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c @@ -536,7 +536,8 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd, goto kfd_doorbell_error; } - kfd->hive_id = kfd->adev->gmc.xgmi.hive_id; + if (amdgpu_use_xgmi_p2p) + kfd->hive_id = kfd->adev->gmc.xgmi.hive_id; Is hive_id now unset? kfd->noretry = kfd->adev->gmc.noretry; Kind regards, Paul
Re: [PATCH] drm/amdgpu: Correct the parameter name of amdgpu_ring_init
Dear Jun, Thank you for your patch. Am 23.03.22 um 04:02 schrieb Ma, Jun: On 3/22/2022 9:33 PM, Christian König wrote: Am 22.03.22 um 13:53 schrieb Ma Jun: Correct the parameter name of amdgpu_ring_init() in header file. Maybe write "Sync up header and implementation to use the same parameter names", otherwise somebody could think that this is a real functional bug fix and backport it. ok, will fix this in v2 (Your mailer incorrectly quoted your reply sentence, making it look like it was written by Christian.) If you re-roll the patch, please also mention both parameter names in the commit message, so people do not have to search in the diff for them, and can verify the changes match your intention. (The summary uses singular *name* adding a little confusion.) Kind regards, Paul Signed-off-by: Ma Jun With the commit message and subject adjusted the patch is Reviewed-by: Christian König Change-Id: I202d76ba04b137926b456b1c8a4c05a5b1a01bff --- drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h index 5320bb0883d8..317d80209e95 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h @@ -300,8 +300,8 @@ void amdgpu_ring_generic_pad_ib(struct amdgpu_ring *ring, struct amdgpu_ib *ib); void amdgpu_ring_commit(struct amdgpu_ring *ring); void amdgpu_ring_undo(struct amdgpu_ring *ring); int amdgpu_ring_init(struct amdgpu_device *adev, struct amdgpu_ring *ring, -unsigned int ring_size, struct amdgpu_irq_src *irq_src, -unsigned int irq_type, unsigned int prio, +unsigned int max_dw, struct amdgpu_irq_src *irq_src, +unsigned int irq_type, unsigned int hw_prio, atomic_t *sched_score); void amdgpu_ring_fini(struct amdgpu_ring *ring); void amdgpu_ring_emit_reg_write_reg_wait_helper(struct amdgpu_ring *ring,