[PATCH] drm/amdgpu: Log if device is unsupported

2023-06-05 Thread Paul Menzel
Since there is overlap in supported devices, both modules load, but only
one will bind to a particular device depending on the model and user's
configuration.

amdgpu binds to all display class devices with VID 0x1002 and then
determines whether or not to bind to a device based on whether the
individual device is supported by the driver or not. Log that case, so
users looking at the logs know what is going on.

Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2608
Signed-off-by: Paul Menzel 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 86fbb4138285..410ff918c350 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2062,8 +2062,10 @@ static int amdgpu_pci_probe(struct pci_dev *pdev,
 
/* skip devices which are owned by radeon */
for (i = 0; i < ARRAY_SIZE(amdgpu_unsupported_pciidlist); i++) {
-   if (amdgpu_unsupported_pciidlist[i] == pdev->device)
+   if (amdgpu_unsupported_pciidlist[i] == pdev->device) {
+   DRM_INFO("This hardware is only supported by radeon.");
return -ENODEV;
+   }
}
 
if (amdgpu_aspm == -1 && !pcie_aspm_enabled(pdev))
-- 
2.40.1



Re: How are the DC patches tested?

2022-05-10 Thread Paul Menzel

Dear Daniel,


Am 09.05.22 um 17:21 schrieb Wheeler, Daniel:


I've made some edits to my cover letter to hopefully make it clearer
with what is being done.

Thank you.

Hi all,
  
		This week this patchset was tested on the following systems:
  
		HP Envy 360, with Ryzen 5 4500U

Lenovo Thinkpad T14s Gen2, with AMD Ryzen 5 5650U
Sapphire Pulse RX5700XT
Reference AMD RX6800
Engineering board with Ryzen 9 5900H

These systems were tested on the following display types:
eDP, (1080p 60hz [4500U, 5650U, 5900H])
VGA and DVI (1680x1050 60HZ [DP to VGA/DVI, USB-C to DVI/VGA])
DP/HDMI/USB-C (1440p 170hz, 4k 60hz, 4k 144hz [Includes USB-C 
to DP/HDMI adapters])

MST tested with Startech MST14DP123DP and 2x 4k 60Hz displays
DSC tested with Cable Matters 101075 (DP to 3x DP), and 201375 
(USB-C to 3x DP) with 3x 4k60 displays
  
		The testing is a mix of automated and manual tests. Manual testing includes (but is not limited to):

Changing display configurations and settings
Benchmark testing
Feature testing (Freesync, etc.)

Automated testing includes (but is not limited to):
Script testing (scripts to automate some of the manual checks)
IGT testing

The patchset consists of the most recent amd-staging-drm-next 
branch with a selection of patches added on top of it. This goes for both 
Ubuntu testing and Chrome   OS testing.


Please mention the commit hash of the commit at the top of 
amd-staging-drm-next.




Tested on Ubuntu 22.04 and Chrome OS


I still do not see the Chrome OS version and devices listed. I think, 
that information would be valuable.


  
		Tested-by: Daniel Wheeler 


Kind regards,

Paul


Fwd: Re: amd | ASUS ROG Strix G513QY fails to resume from suspend [regression] (#2008)

2022-05-10 Thread Paul Menzel

Dear Linux regressions folks,


A user reported a regression [1], which also trickled down to the stable 
series, for example between 5.15.13 and 5.15.14.



Kind regards,

Paul


[1]: https://gitlab.freedesktop.org/drm/amd/-/issues/2008--- Begin Message ---



spin83 commented:


It doesn't work either.

-- 
View it on GitLab: 
https://gitlab.freedesktop.org/drm/amd/-/issues/2008#note_1375796
You're receiving this email because of your account on gitlab.freedesktop.org.


--- End Message ---


Re: [PATCH] drm/amdgpu/gfx11: fix mes mqd settings and map_queue packet

2022-05-10 Thread Paul Menzel

Dear Alex, dear Jack,


Thank you for the patch.

Am 09.05.22 um 21:06 schrieb Alex Deucher:

From: Jack Xiao 

a. use correct mes mqd settings


Can you please elaborate? What is wrong with the old ones, and what are 
the correct ones?



b. fix me field in map_queue packet


Can you please add some background? The new value is 2. What does it do?

It’d be great, if you could make it two patches.


Signed-off-by: Jack Xiao 
Reviewed-by: Hawking Zhang 
Signed-off-by: Alex Deucher 
---
  drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c |  7 +--
  drivers/gpu/drm/amd/amdgpu/mes_v11_0.c | 20 ++--
  2 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
index 7614f38ff381..8a1bec70c719 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
@@ -145,16 +145,19 @@ static void gfx11_kiq_map_queues(struct amdgpu_ring 
*kiq_ring,
  {
uint64_t mqd_addr = amdgpu_bo_gpu_offset(ring->mqd_obj);
uint64_t wptr_addr = ring->wptr_gpu_addr;
-   uint32_t eng_sel = 0;
+   uint32_t me = 0, eng_sel = 0;
  
  	switch (ring->funcs->type) {

case AMDGPU_RING_TYPE_COMPUTE:
+   me = 1;
eng_sel = 0;
break;
case AMDGPU_RING_TYPE_GFX:
+   me = 0;
eng_sel = 4;
break;
case AMDGPU_RING_TYPE_MES:
+   me = 2;
eng_sel = 5;
break;
default:
@@ -168,7 +171,7 @@ static void gfx11_kiq_map_queues(struct amdgpu_ring 
*kiq_ring,
  PACKET3_MAP_QUEUES_VMID(0) | /* VMID */
  PACKET3_MAP_QUEUES_QUEUE(ring->queue) |
  PACKET3_MAP_QUEUES_PIPE(ring->pipe) |
- PACKET3_MAP_QUEUES_ME((ring->me == 1 ? 0 : 1)) |
+ PACKET3_MAP_QUEUES_ME((me)) |
  PACKET3_MAP_QUEUES_QUEUE_TYPE(0) | /*queue_type: 
normal compute queue */
  PACKET3_MAP_QUEUES_ALLOC_FORMAT(0) | /* alloc format: 
all_on_one_pipe */
  PACKET3_MAP_QUEUES_ENGINE_SEL(eng_sel) |
diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
index 5d4d54cabf70..fcf51947bb18 100644
--- a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
@@ -29,7 +29,7 @@
  #include "gc/gc_11_0_0_offset.h"
  #include "gc/gc_11_0_0_sh_mask.h"
  #include "gc/gc_11_0_0_default.h"
-#include "v10_structs.h"
+#include "v11_structs.h"
  #include "mes_v11_api_def.h"
  
  MODULE_FIRMWARE("amdgpu/gc_11_0_0_mes.bin");

@@ -637,7 +637,7 @@ static int mes_v11_0_allocate_eop_buf(struct amdgpu_device 
*adev,
  
  static int mes_v11_0_mqd_init(struct amdgpu_ring *ring)

  {
-   struct v10_compute_mqd *mqd = ring->mqd_ptr;
+   struct v11_compute_mqd *mqd = ring->mqd_ptr;
uint64_t hqd_gpu_addr, wb_gpu_addr, eop_base_addr;
uint32_t tmp;
  
@@ -724,22 +724,22 @@ static int mes_v11_0_mqd_init(struct amdgpu_ring *ring)

mqd->cp_hqd_vmid = 0;
/* activate the queue */
mqd->cp_hqd_active = 1;
-   mqd->cp_hqd_persistent_state = regCP_HQD_PERSISTENT_STATE_DEFAULT;
+
+   tmp = regCP_HQD_PERSISTENT_STATE_DEFAULT;
+   tmp = REG_SET_FIELD(tmp, CP_HQD_PERSISTENT_STATE,
+   PRELOAD_SIZE, 0x55);
+   mqd->cp_hqd_persistent_state = tmp;
+
mqd->cp_hqd_ib_control = regCP_HQD_IB_CONTROL_DEFAULT;
mqd->cp_hqd_iq_timer = regCP_HQD_IQ_TIMER_DEFAULT;
mqd->cp_hqd_quantum = regCP_HQD_QUANTUM_DEFAULT;
  
-	tmp = regCP_HQD_GFX_CONTROL_DEFAULT;

-   tmp = REG_SET_FIELD(tmp, CP_HQD_GFX_CONTROL, DB_UPDATED_MSG_EN, 1);
-   /* offset: 184 - this is used for CP_HQD_GFX_CONTROL */
-   mqd->cp_hqd_suspend_cntl_stack_offset = tmp;
-


What was wrong with this?


Kind regards,

Paul



return 0;
  }
  
  static void mes_v11_0_queue_init_register(struct amdgpu_ring *ring)

  {
-   struct v10_compute_mqd *mqd = ring->mqd_ptr;
+   struct v11_compute_mqd *mqd = ring->mqd_ptr;
struct amdgpu_device *adev = ring->adev;
uint32_t data = 0;
  
@@ -910,7 +910,7 @@ static int mes_v11_0_kiq_ring_init(struct amdgpu_device *adev)

  static int mes_v11_0_mqd_sw_init(struct amdgpu_device *adev,
 enum admgpu_mes_pipe pipe)
  {
-   int r, mqd_size = sizeof(struct v10_compute_mqd);
+   int r, mqd_size = sizeof(struct v11_compute_mqd);
struct amdgpu_ring *ring;
  
  	if (pipe == AMDGPU_MES_KIQ_PIPE)


Re: [PATCH] drm/amdgpu: Check kernel version for 'hypervisor_is_type'

2022-05-10 Thread Paul Menzel

Dear Yongqiang,


Thank you for your patch.

Am 09.05.22 um 19:07 schrieb Yongqiang Sun:

hypervisor_is_type is added since kernel 4.15, dkms package failed to


Also mention the commit hash and summary of the commit adding that.


install on older kernel e.g. 3.10.

Use marcro check kernel version to determine whether the function is


macro (found by my spell checker)


used.


Please reflow for 75 characters per line.

The Signed-off-by line is missing.


---
  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c  | 4 +++-
  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 4 +++-
  2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
index 88b852b3a2cb..963b2e68205e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -28,7 +28,7 @@
  #ifdef CONFIG_X86
  #include 
  #endif
-
+#include 
  #include "amdgpu.h"
  #include "amdgpu_gmc.h"
  #include "amdgpu_ras.h"
@@ -653,10 +653,12 @@ void amdgpu_gmc_get_vbios_allocations(struct 
amdgpu_device *adev)
 * VEGA10 SRIOV VF with MS_HYPERV host needs some firmware 
reserved area.
 */
  #ifdef CONFIG_X86
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(4, 15, 0)
if (amdgpu_sriov_vf(adev) && 
hypervisor_is_type(X86_HYPER_MS_HYPERV)) {
adev->mman.stolen_reserved_offset = 0x50;
adev->mman.stolen_reserved_size = 0x20;
}
+#endif


`scripts/checkpatch.pl` warns about using `LINUX_VERSION_CODE`.


  #endif
break;
case CHIP_RAVEN:
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index 02b161a1287b..5a50122a1161 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -26,7 +26,7 @@
  #ifdef CONFIG_X86
  #include 
  #endif
-
+#include 
  #include 
  #include 
  
@@ -728,8 +728,10 @@ void amdgpu_detect_virtualization(struct amdgpu_device *adev)

case CHIP_VEGA10:
soc15_set_virt_ops(adev);
  #ifdef CONFIG_X86
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(4, 15, 0)
/* not send GPU_INIT_DATA with MS_HYPERV*/
if (!hypervisor_is_type(X86_HYPER_MS_HYPERV))
+#endif
  #endif
/* send a dummy GPU_INIT_DATA request to host 
on vega10 */
amdgpu_virt_request_init_data(adev);


Is this patch for inclusion to the Linux kernel or just for people 
building the newer code as module for older Linux kernel versions?



Kind regards,

Paul


Re: How are the DC patches tested?

2022-05-09 Thread Paul Menzel

Dear Rodrigo,


Thank you for the quick response.

Am 09.05.22 um 16:15 schrieb Rodrigo Siqueira Jordao:


On 2022-05-09 10:00, Paul Menzel wrote:



Am 09.05.22 um 15:14 schrieb Wheeler, Daniel:

[…]


This week this patchset was tested on the following systems:

Lenovo Thinkpad T14s Gen2 with AMD Ryzen 5 5650U, with the following
display types: eDP 1080p 60hz, 4k 60hz  (via USB-C to DP/HDMI), 1440p
144hz (via USB-C to DP/HDMI), 1680*1050 60hz (via USB-C to DP and
then DP to DVI/VGA)

Sapphire Pulse RX5700XT with the following display types: 4k 60hz
(via DP/HDMI), 1440p 144hz (via DP/HDMI), 1680*1050 60hz (via DP to
DVI/VGA)

Reference AMD RX6800 with the following display types: 4k 60hz  (via
DP/HDMI and USB-C to DP/HDMI), 1440p 144hz (via USB-C to DP/HDMI and
USB-C to DP/HDMI), 1680*1050 60hz (via DP to DVI/VGA)

Included testing using a Startech DP 1.4 MST hub at 2x 4k 60hz and
DSC via USB-C to DP DSC Hub with 3x 4k 60hz.

Tested on Ubuntu 22.04 with Kernel Version 5.16, and ChromeOS


I am curious, what this means exactly? You clone the Ubuntu Linux 5.16 
kernel source, and then apply your patches on top? (Do they even apply?)


All of these "promotion" patches are tested by using 
amd-staging-drm-next. In a few words:


1. We get the latest code from amd-staging-drm-next;
2. We apply these weekly promotion patches on top of it;
3. We compile, run unit tests, and run many manual tests (Daniel does 
that).


If everything is alright with Daniel's tests, we feel confident to merge 
these series on top amd-staging-drm-next (we are basically trying to 
avoid regressions here).


Anyway, maybe we can rephrase:

  Tested on Ubuntu 22.04 with Kernel Version 5.16, and ChromeOS

to

  Tested on Ubuntu 22.04 and ChromeOS with amd-staging-drm-next + 
promotion patches.


Yes, that’d be great. Maybe even reference the commit hash from the 
commit on top of *amd-staging-drm-next*.


(Nit: ChromeOS → Chrome OS)

The same for Chrome OS. Do you use Chrome OS Flex [1] with the systems 
you listed? If not, what Google Chromebooks/-boxes did you test with? 
The Linux kernel version is also tied for a device and Chrome OS 
release. Please mention those too.


As written, the used Chrome OS version (and devices) would be helpful too.


Is it documented somewhere, what tests you run exactly?


We run IGT tests, some scripts that validate some specific areas, and 
Daniel has an extensive set of manual tests.



Kind regards,

Paul


How are the DC patches tested? (was: [PATCH 00/15] DC Patches May 9, 2022)

2022-05-09 Thread Paul Menzel

[Sorry for the incomplete first message.]

Dear Daniel,


Am 09.05.22 um 15:14 schrieb Wheeler, Daniel:

[…]


This week this patchset was tested on the following systems:

Lenovo Thinkpad T14s Gen2 with AMD Ryzen 5 5650U, with the following
display types: eDP 1080p 60hz, 4k 60hz  (via USB-C to DP/HDMI), 1440p
144hz (via USB-C to DP/HDMI), 1680*1050 60hz (via USB-C to DP and
then DP to DVI/VGA)

Sapphire Pulse RX5700XT with the following display types: 4k 60hz
(via DP/HDMI), 1440p 144hz (via DP/HDMI), 1680*1050 60hz (via DP to
DVI/VGA)

Reference AMD RX6800 with the following display types: 4k 60hz  (via
DP/HDMI and USB-C to DP/HDMI), 1440p 144hz (via USB-C to DP/HDMI and
USB-C to DP/HDMI), 1680*1050 60hz (via DP to DVI/VGA)

Included testing using a Startech DP 1.4 MST hub at 2x 4k 60hz and
DSC via USB-C to DP DSC Hub with 3x 4k 60hz.

Tested on Ubuntu 22.04 with Kernel Version 5.16, and ChromeOS


I am curious, what this means exactly? You clone the Ubuntu Linux 5.16 
kernel source, and then apply your patches on top? (Do they even apply?)


The same for Chrome OS. Do you use Chrome OS Flex [1] with the systems 
you listed? If not, what Google Chromebooks/-boxes did you test with? 
The Linux kernel version is also tied for a device and Chrome OS 
release. Please mention those too.


Is it documented somewhere, what tests you run exactly?


Kind regards,

Paul


[1]: https://chromeenterprise.google/os/chromeosflex/


Re: [PATCH 00/15] DC Patches May 9, 2022

2022-05-09 Thread Paul Menzel

Dear Daniel,


Am 09.05.22 um 15:14 schrieb Wheeler, Daniel:

[…]


This week this patchset was tested on the following systems:

Lenovo Thinkpad T14s Gen2 with AMD Ryzen 5 5650U, with the following
display types: eDP 1080p 60hz, 4k 60hz  (via USB-C to DP/HDMI), 1440p
144hz (via USB-C to DP/HDMI), 1680*1050 60hz (via USB-C to DP and
then DP to DVI/VGA)

Sapphire Pulse RX5700XT with the following display types: 4k 60hz
(via DP/HDMI), 1440p 144hz (via DP/HDMI), 1680*1050 60hz (via DP to
DVI/VGA)

Reference AMD RX6800 with the following display types: 4k 60hz  (via
DP/HDMI and USB-C to DP/HDMI), 1440p 144hz (via USB-C to DP/HDMI and
USB-C to DP/HDMI), 1680*1050 60hz (via DP to DVI/VGA)

Included testing using a Startech DP 1.4 MST hub at 2x 4k 60hz and
DSC via USB-C to DP DSC Hub with 3x 4k 60hz.

Tested on Ubuntu 22.04 with Kernel Version 5.16, and ChromeOS


Kind regards,

Paul


Re: [PATCH] drm/amdgpu/vcn: fixed no previous prototype warning.

2022-05-09 Thread Paul Menzel

Dear James,


You sent this message 30 minutes after sending the same message before 
(Message-ID: <20220508173008.281673-1-james@amd.com>). Are there any 
differences?


Am 08.05.22 um 20:00 schrieb James Zhu:

Fixed warning: no previous prototype for 'vcn_dec_sw_ring_emit_fence'


Please use imperative mood in the commit message summary. Maybe even:

> Include header for `vcn_dec_sw_ring_emit_fence()`

Also, please do not add a dot/period to the end of commit messages.


Kind regards,

Paul



Signed-off-by: James Zhu 
Reported-by: kernel test robot 
---
  drivers/gpu/drm/amd/amdgpu/vcn_sw_ring.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_sw_ring.c 
b/drivers/gpu/drm/amd/amdgpu/vcn_sw_ring.c
index f4f97f0f5c64..1ceda3d0cd5b 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_sw_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_sw_ring.c
@@ -22,6 +22,7 @@
   */
  
  #include "amdgpu.h"

+#include "vcn_sw_ring.h"
  
  void vcn_dec_sw_ring_emit_fence(struct amdgpu_ring *ring, u64 addr,

u64 seq, uint32_t flags)


Re: [PATCH] amdgpu/pm: Disable managing power profiles on SRIOV for Sienna Cichlid

2022-05-06 Thread Paul Menzel

Dear Danijel,


Am 06.05.22 um 09:52 schrieb Danijel Slivka:

Please add a reasoning/motivation in the commit message body.


Kind regards,

Paul



Signed-off-by: Danijel Slivka 
---
  drivers/gpu/drm/amd/pm/amdgpu_pm.c | 9 +
  1 file changed, 9 insertions(+)

diff --git a/drivers/gpu/drm/amd/pm/amdgpu_pm.c 
b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
index 70a0aad05426..59a662aeaaf3 100644
--- a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
+++ b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
@@ -2025,6 +2025,8 @@ static int default_attr_update(struct amdgpu_device 
*adev, struct amdgpu_device_
} else if (DEVICE_ATTR_IS(pp_power_profile_mode)) {
if (amdgpu_dpm_get_power_profile_mode(adev, NULL) == 
-EOPNOTSUPP)
*states = ATTR_STATE_UNSUPPORTED;
+   else if (gc_ver == IP_VERSION(10, 3, 0) && 
amdgpu_sriov_vf(adev))
+   *states = ATTR_STATE_UNSUPPORTED;
}
  
  	switch (gc_ver) {

@@ -2038,6 +2040,13 @@ static int default_attr_update(struct amdgpu_device 
*adev, struct amdgpu_device_
dev_attr->store = NULL;
}
break;
+   case IP_VERSION(10, 3, 0):
+   if (DEVICE_ATTR_IS(power_dpm_force_performance_level) &&
+   amdgpu_sriov_vf(adev)) {
+   dev_attr->attr.mode &= ~S_IWUGO;
+   dev_attr->store = NULL;
+   }
+   break;
default:
break;
}


Re: [PATCH] drm/amdgpu: flush delete wq after wait fence

2022-05-05 Thread Paul Menzel

Dear Yiging,


Thank you for your patch.

Am 05.05.22 um 08:35 schrieb Yiqing Yao:

[why]
lru_list not empty warning in sw fini during repeated device bind unbind.
There should be a amdgpu_fence_wait_empty() before the flush_delayed_work()
call as sugested.


sug*g*ested

Suggested by whom? Or do you mean suggested in this diff?


[how]
Do flush_delayed_work for ttm bo delayed delete wq after fence_driver_hw_fini.

Signed-off-by: Yiqing Yao 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +
  1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 14c5ccf81e80..92e5ed3ed345 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4003,10 +4003,6 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
  {
dev_info(adev->dev, "amdgpu: finishing device.\n");
flush_delayed_work(>delayed_init_work);
-   if (adev->mman.initialized) {
-   flush_delayed_work(>mman.bdev.wq);
-   ttm_bo_lock_delayed_workqueue(>mman.bdev);
-   }


From the commit message, it’s not clear, that you remove this here.


Kind regards,

Paul



adev->shutdown = true;
  
  	/* make sure IB test finished before entering exclusive mode

@@ -4029,6 +4025,11 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
}
amdgpu_fence_driver_hw_fini(adev);
  
+	if (adev->mman.initialized) {

+   flush_delayed_work(>mman.bdev.wq);
+   ttm_bo_lock_delayed_workqueue(>mman.bdev);
+   }
+
if (adev->pm_sysfs_en)
amdgpu_pm_sysfs_fini(adev);
if (adev->ucode_sysfs_en)


Re: [PATCHv4] drm/amdgpu: disable ASPM on Intel Alder Lake based systems

2022-05-03 Thread Paul Menzel

Dear Daniel,


Am 03.05.22 um 14:25 schrieb Daniel Stone:

On Sun, 1 May 2022 at 08:08, Paul Menzel  wrote:

Am 26.04.22 um 15:53 schrieb Gong, Richard:

I think so. We captured dmesg log.


Then the (whole) system did *not* freeze, if you could still log in
(maybe over network) and execute `dmesg`. Please also paste the
amdgpu(?) error logs in the commit message.


As mentioned early we need support from Intel on how to get ASPM working
for VI generation on Intel Alder Lake, but we don't know where things
currently stand.


Who is working on this, and knows?


This has gone beyond the point of a reasonable request. The amount of
detail you're demanding is completely unnecessary.


If a quirk is introduced possibly leading to higher power consumption, 
especially on systems nobody has access to yet, then the detail, where 
the system hangs/freezes is not unreasonable at all.


In the Linux logs from the issue there are messages like

[   58.101385] Freezing of tasks failed after 20.003 seconds (4 
tasks refusing to freeze, wq_busy=0):


[   78.278403] Freezing of tasks failed after 20.008 seconds (4 
tasks refusing to freeze, wq_busy=0):


and it looks like several suspend/resume cycles were done.

I see a lot of commit messages over the whole Linux kernel, where this 
level of detail is provided (by default), and


The second question was not for the commit message, but just for 
documentation purpose when the problem is going to be fixed properly. 
And it looks like (at least publicly) analyzing the root cause is not 
happening, and once the quirk lands, nobody is going to feel the 
pressure to work on it, as everyone’s plates are full.



Kind regards,

Paul


Re: [PATCHv5] drm/amdgpu: vi: disable ASPM on Intel Alder Lake based systems

2022-05-01 Thread Paul Menzel

Dear Richard,


Am 29.04.22 um 18:06 schrieb Richard Gong:

Active State Power Management (ASPM) feature is enabled since kernel 5.14.
There are some AMD Volcanic Islands (VI) GFX cards, such as the WX3200 and
RX640, that do not work with ASPM-enabled Intel Alder Lake based systems.
Using these GFX cards as video/display output, Intel Alder Lake based
systems will freeze after suspend/resume.


As replied in v4 just now, “freeze” is misleading if you can still run 
`dmesg` after resume.



Kind regards,

Paul



The issue was originally reported on one system (Dell Precision 3660 with
BIOS version 0.14.81), but was later confirmed to affect at least 4
pre-production Alder Lake based systems.

Add an extra check to disable ASPM on Intel Alder Lake based systems with
the problematic AMD Volcanic Islands GFX cards.

Fixes: 0064b0ce85bb ("drm/amd/pm: enable ASPM by default")
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1885
Reported-by: kernel test robot 
Signed-off-by: Richard Gong 
---
v5: added vi to commit header and updated commit message
 rolled back guard with the preprocessor as did in v2 to correct build
 error on non-x86 systems
v4: s/CONFIG_X86_64/CONFIG_X86
 enhanced check logic
v3: s/intel_core_aspm_chk/aspm_support_quirk_check
 correct build error with W=1 option
v2: correct commit description
 move the check from chip family to problematic platform
---
  drivers/gpu/drm/amd/amdgpu/vi.c | 17 -
  1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/vi.c b/drivers/gpu/drm/amd/amdgpu/vi.c
index 039b90cdc3bc..45f0188c4273 100644
--- a/drivers/gpu/drm/amd/amdgpu/vi.c
+++ b/drivers/gpu/drm/amd/amdgpu/vi.c
@@ -81,6 +81,10 @@
  #include "mxgpu_vi.h"
  #include "amdgpu_dm.h"
  
+#if IS_ENABLED(CONFIG_X86)

+#include 
+#endif
+
  #define ixPCIE_LC_L1_PM_SUBSTATE  0x100100C6
  #define PCIE_LC_L1_PM_SUBSTATE__LC_L1_SUBSTATES_OVERRIDE_EN_MASK  
0x0001L
  #define PCIE_LC_L1_PM_SUBSTATE__LC_PCI_PM_L1_2_OVERRIDE_MASK  0x0002L
@@ -1134,13 +1138,24 @@ static void vi_enable_aspm(struct amdgpu_device *adev)
WREG32_PCIE(ixPCIE_LC_CNTL, data);
  }
  
+static bool aspm_support_quirk_check(void)

+{
+#if IS_ENABLED(CONFIG_X86)
+   struct cpuinfo_x86 *c = _data(0);
+
+   return !(c->x86 == 6 && c->x86_model == INTEL_FAM6_ALDERLAKE);
+#else
+   return true;
+#endif
+}
+
  static void vi_program_aspm(struct amdgpu_device *adev)
  {
u32 data, data1, orig;
bool bL1SS = false;
bool bClkReqSupport = true;
  
-	if (!amdgpu_device_should_use_aspm(adev))

+   if (!amdgpu_device_should_use_aspm(adev) || !aspm_support_quirk_check())
return;
  
  	if (adev->flags & AMD_IS_APU ||


Re: [PATCHv4] drm/amdgpu: disable ASPM on Intel Alder Lake based systems

2022-05-01 Thread Paul Menzel

Dear Richard,


Sorry for the late reply.

Am 26.04.22 um 15:53 schrieb Gong, Richard:


On 4/21/2022 12:35 AM, Paul Menzel wrote:



Am 21.04.22 um 03:12 schrieb Gong, Richard:


On 4/20/2022 3:29 PM, Paul Menzel wrote:



Am 19.04.22 um 23:46 schrieb Gong, Richard:


On 4/14/2022 2:52 AM, Paul Menzel wrote:

[Cc: -kernel test robot ]


[…]


Am 13.04.22 um 15:00 schrieb Alex Deucher:

On Wed, Apr 13, 2022 at 3:43 AM Paul Menzel wrote:



Thank you for sending out v4.

Am 12.04.22 um 23:50 schrieb Richard Gong:


[…]

I am still not clear, what “hang during suspend/resume” means. I 
guess
suspending works fine? During resume (S3 or S0ix?), where does 
it hang?

The system is functional, but there are only display problems?

System freeze after suspend/resume.


But you see certain messages still? At what point does it freeze 
exactly? In the bug report you posted Linux messages.


No, the system freeze then users have to recycle power to recover.


Then I misread the issue? Did you capture the messages over serial log 
then?


I think so. We captured dmesg log.


Then the (whole) system did *not* freeze, if you could still log in 
(maybe over network) and execute `dmesg`. Please also paste the 
amdgpu(?) error logs in the commit message.


As mentioned early we need support from Intel on how to get ASPM working 
for VI generation on Intel Alder Lake, but we don't know where things 
currently stand.


Who is working on this, and knows?


Kind regards,

Paul


Re: [PATCHv4] drm/amdgpu: disable ASPM on Intel Alder Lake based systems

2022-04-20 Thread Paul Menzel

Dear Richard,


Am 21.04.22 um 03:12 schrieb Gong, Richard:


On 4/20/2022 3:29 PM, Paul Menzel wrote:



Am 19.04.22 um 23:46 schrieb Gong, Richard:


On 4/14/2022 2:52 AM, Paul Menzel wrote:

[Cc: -kernel test robot ]


[…]


Am 13.04.22 um 15:00 schrieb Alex Deucher:

On Wed, Apr 13, 2022 at 3:43 AM Paul Menzel wrote:



Thank you for sending out v4.

Am 12.04.22 um 23:50 schrieb Richard Gong:
Active State Power Management (ASPM) feature is enabled since 
kernel 5.14.
There are some AMD GFX cards (such as WX3200 and RX640) that 
won't work
with ASPM-enabled Intel Alder Lake based systems. Using these GFX 
cards as
video/display output, Intel Alder Lake based systems will hang 
during

suspend/resume.


[Your email program wraps lines in cited text for some reason, making 
the citation harder to read.]


Not sure why, I am using Mozila Thunderbird for email. I am not using MS 
Outlook for upstream email.


Strange. No idea if there were bugs in Mozilla Thunderbird 91.2.0, 
released over half year ago. The current version is 91.8.1. [1]


I am still not clear, what “hang during suspend/resume” means. I 
guess
suspending works fine? During resume (S3 or S0ix?), where does it 
hang?

The system is functional, but there are only display problems?

System freeze after suspend/resume.


But you see certain messages still? At what point does it freeze 
exactly? In the bug report you posted Linux messages.


No, the system freeze then users have to recycle power to recover.


Then I misread the issue? Did you capture the messages over serial log then?

The issue was initially reported on one system (Dell Precision 
3660 with
BIOS version 0.14.81), but was later confirmed to affect at least 
4 Alder

Lake based systems.

Add extra check to disable ASPM on Intel Alder Lake based systems.

Fixes: 0064b0ce85bb ("drm/amd/pm: enable ASPM by default")
Link: 
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fdrm%2Famd%2F-%2Fissues%2F1885data=05%7C01%7Crichard.gong%40amd.com%7Cce01de048c61456174ff08da230c750d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637860833680922036%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7Csdata=vqhh3dTc%2FgBt7GrP9hKppWlrFy2F7DaivkNEuGekl0g%3Dreserved=0 



Thank you Microsoft Outlook for keeping us safe. :(

I am not using MS Outlook for the email exchanges.


I guess, it’s not the client but the Microsoft email service (Exchange?) 
no idea adding these protection links. (Making it even harder for users 
to actually verify domain. No idea who comes up with these ideas, and 
customers actually accepting those.)




Reported-by: kernel test robot 


This tag is a little confusing. Maybe clarify that it was for an 
issue

in a previous patch iteration?


I did describe in change-list version 3 below, which corrected the 
build error with W=1 option.


It is not good idea to add the description for that to the commit 
message, this is why I add descriptions on change-list version 3.


Do as you wish, but the current style is confusing, and readers of the 
commit are going to think, the kernel test robot reported the problem 
with AMD VI ASICs and Intel Alder Lake systems.





Signed-off-by: Richard Gong 
---
v4: s/CONFIG_X86_64/CONFIG_X86
  enhanced check logic
v3: s/intel_core_asom_chk/aspm_support_quirk_check
  correct build error with W=1 option
v2: correct commit description
  move the check from chip family to problematic platform
---
   drivers/gpu/drm/amd/amdgpu/vi.c | 17 -
   1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/vi.c 
b/drivers/gpu/drm/amd/amdgpu/vi.c

index 039b90cdc3bc..b33e0a9bee65 100644
--- a/drivers/gpu/drm/amd/amdgpu/vi.c
+++ b/drivers/gpu/drm/amd/amdgpu/vi.c
@@ -81,6 +81,10 @@
   #include "mxgpu_vi.h"
   #include "amdgpu_dm.h"

+#if IS_ENABLED(CONFIG_X86)
+#include 
+#endif
+
   #define ixPCIE_LC_L1_PM_SUBSTATE    0x100100C6
   #define 
PCIE_LC_L1_PM_SUBSTATE__LC_L1_SUBSTATES_OVERRIDE_EN_MASK 0x0001L
   #define PCIE_LC_L1_PM_SUBSTATE__LC_PCI_PM_L1_2_OVERRIDE_MASK 
0x0002L
@@ -1134,13 +1138,24 @@ static void vi_enable_aspm(struct 
amdgpu_device *adev)

   WREG32_PCIE(ixPCIE_LC_CNTL, data);
   }

+static bool aspm_support_quirk_check(void)
+{
+ if (IS_ENABLED(CONFIG_X86)) {
+ struct cpuinfo_x86 *c = _data(0);
+
+ return !(c->x86 == 6 && c->x86_model == 
INTEL_FAM6_ALDERLAKE);

+ }
+
+ return true;
+}
+
   static void vi_program_aspm(struct amdgpu_device *adev)
   {
   u32 data, data1, orig;
   bool bL1SS = false;
   bool bClkReqSupport = true;

- if (!amdgpu_device_should_use_aspm(adev))
+ if (!amdgpu_device_should_use_aspm(adev) || 
!aspm_support_quirk_check())

   return;


Can users still forcefully enable ASPM with the parameter 
`amdgpu.aspm`?


As Mario mentioned in 

Re: [PATCHv4] drm/amdgpu: disable ASPM on Intel Alder Lake based systems

2022-04-20 Thread Paul Menzel

Dear Richard,


Am 20.04.22 um 22:56 schrieb Gong, Richard:


On 4/20/2022 3:48 PM, Paul Menzel wrote:



Am 20.04.22 um 22:40 schrieb Alex Deucher:
On Wed, Apr 20, 2022 at 4:29 PM Paul Menzel  
wrote:



Am 19.04.22 um 23:46 schrieb Gong, Richard:


On 4/14/2022 2:52 AM, Paul Menzel wrote:

[Cc: -kernel test robot ]


[…]


Am 13.04.22 um 15:00 schrieb Alex Deucher:

On Wed, Apr 13, 2022 at 3:43 AM Paul Menzel wrote:



Thank you for sending out v4.

Am 12.04.22 um 23:50 schrieb Richard Gong:

Active State Power Management (ASPM) feature is enabled since
kernel 5.14.
There are some AMD GFX cards (such as WX3200 and RX640) that won't
work
with ASPM-enabled Intel Alder Lake based systems. Using these GFX
cards as
video/display output, Intel Alder Lake based systems will hang 
during

suspend/resume.


[Your email program wraps lines in cited text for some reason, making
the citation harder to read.]



I am still not clear, what “hang during suspend/resume” means. I 
guess
suspending works fine? During resume (S3 or S0ix?), where does 
it hang?

The system is functional, but there are only display problems?

System freeze after suspend/resume.


But you see certain messages still? At what point does it freeze
exactly? In the bug report you posted Linux messages.

The issue was initially reported on one system (Dell Precision 
3660

with
BIOS version 0.14.81), but was later confirmed to affect at 
least 4

Alder
Lake based systems.

Add extra check to disable ASPM on Intel Alder Lake based systems.

Fixes: 0064b0ce85bb ("drm/amd/pm: enable ASPM by default")
Link:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fdrm%2Famd%2F-%2Fissues%2F1885data=05%7C01%7Crichard.gong%40amd.com%7C487aaa63098b462e146a08da230f2319%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637860845178176835%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7Csdata=3IVldn05qNa2XVp1Lu58SriS8k9mk4U9K9p3F3IYPe0%3Dreserved=0 



Thank you Microsoft Outlook for keeping us safe. :(



Reported-by: kernel test robot 


This tag is a little confusing. Maybe clarify that it was for an 
issue

in a previous patch iteration?


I did describe in change-list version 3 below, which corrected the 
build

error with W=1 option.

It is not good idea to add the description for that to the commit
message, this is why I add descriptions on change-list version 3.


Do as you wish, but the current style is confusing, and readers of the
commit are going to think, the kernel test robot reported the problem
with AMD VI ASICs and Intel Alder Lake systems.




Signed-off-by: Richard Gong 
---
v4: s/CONFIG_X86_64/CONFIG_X86
   enhanced check logic
v3: s/intel_core_asom_chk/aspm_support_quirk_check
   correct build error with W=1 option
v2: correct commit description
   move the check from chip family to problematic platform
---
    drivers/gpu/drm/amd/amdgpu/vi.c | 17 -
    1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/vi.c
b/drivers/gpu/drm/amd/amdgpu/vi.c
index 039b90cdc3bc..b33e0a9bee65 100644
--- a/drivers/gpu/drm/amd/amdgpu/vi.c
+++ b/drivers/gpu/drm/amd/amdgpu/vi.c
@@ -81,6 +81,10 @@
    #include "mxgpu_vi.h"
    #include "amdgpu_dm.h"

+#if IS_ENABLED(CONFIG_X86)
+#include 
+#endif
+
    #define ixPCIE_LC_L1_PM_SUBSTATE    0x100100C6
    #define 
PCIE_LC_L1_PM_SUBSTATE__LC_L1_SUBSTATES_OVERRIDE_EN_MASK

0x0001L
    #define PCIE_LC_L1_PM_SUBSTATE__LC_PCI_PM_L1_2_OVERRIDE_MASK
0x0002L
@@ -1134,13 +1138,24 @@ static void vi_enable_aspm(struct
amdgpu_device *adev)
    WREG32_PCIE(ixPCIE_LC_CNTL, data);
    }

+static bool aspm_support_quirk_check(void)
+{
+ if (IS_ENABLED(CONFIG_X86)) {
+ struct cpuinfo_x86 *c = _data(0);
+
+ return !(c->x86 == 6 && c->x86_model ==
INTEL_FAM6_ALDERLAKE);
+ }
+
+ return true;
+}
+
    static void vi_program_aspm(struct amdgpu_device *adev)
    {
    u32 data, data1, orig;
    bool bL1SS = false;
    bool bClkReqSupport = true;

- if (!amdgpu_device_should_use_aspm(adev))
+ if (!amdgpu_device_should_use_aspm(adev) ||
!aspm_support_quirk_check())
    return;


Can users still forcefully enable ASPM with the parameter
`amdgpu.aspm`?

As Mario mentioned in a separate reply, we can't forcefully enable 
ASPM

with the parameter 'amdgpu.aspm'.


That would be a regression on systems where ASPM used to work. Hmm. I
guess, you could say, there are no such systems.



    if (adev->flags & AMD_IS_APU ||


If I remember correctly, there were also newer cards, where ASPM 
worked
with Intel Alder Lake, right? Can only the problematic 
generations for

WX3200 and RX640 be excluded from ASPM?


This patch only disables it for the generatioaon that was 
problematic.


Could that please be made clear in the commit message summary, and
m

Re: [PATCHv4] drm/amdgpu: disable ASPM on Intel Alder Lake based systems

2022-04-20 Thread Paul Menzel

Dear Alex,


Am 20.04.22 um 22:40 schrieb Alex Deucher:

On Wed, Apr 20, 2022 at 4:29 PM Paul Menzel  wrote:



Am 19.04.22 um 23:46 schrieb Gong, Richard:


On 4/14/2022 2:52 AM, Paul Menzel wrote:

[Cc: -kernel test robot ]


[…]


Am 13.04.22 um 15:00 schrieb Alex Deucher:

On Wed, Apr 13, 2022 at 3:43 AM Paul Menzel wrote:



Thank you for sending out v4.

Am 12.04.22 um 23:50 schrieb Richard Gong:

Active State Power Management (ASPM) feature is enabled since
kernel 5.14.
There are some AMD GFX cards (such as WX3200 and RX640) that won't
work
with ASPM-enabled Intel Alder Lake based systems. Using these GFX
cards as
video/display output, Intel Alder Lake based systems will hang during
suspend/resume.


[Your email program wraps lines in cited text for some reason, making
the citation harder to read.]



I am still not clear, what “hang during suspend/resume” means. I guess
suspending works fine? During resume (S3 or S0ix?), where does it hang?
The system is functional, but there are only display problems?

System freeze after suspend/resume.


But you see certain messages still? At what point does it freeze
exactly? In the bug report you posted Linux messages.


The issue was initially reported on one system (Dell Precision 3660
with
BIOS version 0.14.81), but was later confirmed to affect at least 4
Alder
Lake based systems.

Add extra check to disable ASPM on Intel Alder Lake based systems.

Fixes: 0064b0ce85bb ("drm/amd/pm: enable ASPM by default")
Link:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fdrm%2Famd%2F-%2Fissues%2F1885data=04%7C01%7Crichard.gong%40amd.com%7Ce7febed5d6a441c3a58008da1debb99c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637855195670542145%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=7cEnE%2BSM9e5IGFxSLloCLtCOxovBpaPz0Ns0Ta2vVlc%3Dreserved=0


Thank you Microsoft Outlook for keeping us safe. :(



Reported-by: kernel test robot 


This tag is a little confusing. Maybe clarify that it was for an issue
in a previous patch iteration?


I did describe in change-list version 3 below, which corrected the build
error with W=1 option.

It is not good idea to add the description for that to the commit
message, this is why I add descriptions on change-list version 3.


Do as you wish, but the current style is confusing, and readers of the
commit are going to think, the kernel test robot reported the problem
with AMD VI ASICs and Intel Alder Lake systems.




Signed-off-by: Richard Gong 
---
v4: s/CONFIG_X86_64/CONFIG_X86
   enhanced check logic
v3: s/intel_core_asom_chk/aspm_support_quirk_check
   correct build error with W=1 option
v2: correct commit description
   move the check from chip family to problematic platform
---
drivers/gpu/drm/amd/amdgpu/vi.c | 17 -
1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/vi.c
b/drivers/gpu/drm/amd/amdgpu/vi.c
index 039b90cdc3bc..b33e0a9bee65 100644
--- a/drivers/gpu/drm/amd/amdgpu/vi.c
+++ b/drivers/gpu/drm/amd/amdgpu/vi.c
@@ -81,6 +81,10 @@
#include "mxgpu_vi.h"
#include "amdgpu_dm.h"

+#if IS_ENABLED(CONFIG_X86)
+#include 
+#endif
+
#define ixPCIE_LC_L1_PM_SUBSTATE0x100100C6
#define PCIE_LC_L1_PM_SUBSTATE__LC_L1_SUBSTATES_OVERRIDE_EN_MASK
0x0001L
#define PCIE_LC_L1_PM_SUBSTATE__LC_PCI_PM_L1_2_OVERRIDE_MASK
0x0002L
@@ -1134,13 +1138,24 @@ static void vi_enable_aspm(struct
amdgpu_device *adev)
WREG32_PCIE(ixPCIE_LC_CNTL, data);
}

+static bool aspm_support_quirk_check(void)
+{
+ if (IS_ENABLED(CONFIG_X86)) {
+ struct cpuinfo_x86 *c = _data(0);
+
+ return !(c->x86 == 6 && c->x86_model ==
INTEL_FAM6_ALDERLAKE);
+ }
+
+ return true;
+}
+
static void vi_program_aspm(struct amdgpu_device *adev)
{
u32 data, data1, orig;
bool bL1SS = false;
bool bClkReqSupport = true;

- if (!amdgpu_device_should_use_aspm(adev))
+ if (!amdgpu_device_should_use_aspm(adev) ||
!aspm_support_quirk_check())
return;


Can users still forcefully enable ASPM with the parameter
`amdgpu.aspm`?


As Mario mentioned in a separate reply, we can't forcefully enable ASPM
with the parameter 'amdgpu.aspm'.


That would be a regression on systems where ASPM used to work. Hmm. I
guess, you could say, there are no such systems.



if (adev->flags & AMD_IS_APU ||


If I remember correctly, there were also newer cards, where ASPM worked
with Intel Alder Lake, right? Can only the problematic generations for
WX3200 and RX640 be excluded from ASPM?


This patch only disables it for the generatioaon that was problematic.


Could that please be made clear in the commit message summary, and
message?


Are you ok with the commit messages below?


Please change the commit message summary. Maybe:

drm/amdgpu: VI: Disa

Re: [PATCHv4] drm/amdgpu: disable ASPM on Intel Alder Lake based systems

2022-04-20 Thread Paul Menzel

Dear Richard,


Am 19.04.22 um 23:46 schrieb Gong, Richard:


On 4/14/2022 2:52 AM, Paul Menzel wrote:

[Cc: -kernel test robot ]


[…]


Am 13.04.22 um 15:00 schrieb Alex Deucher:

On Wed, Apr 13, 2022 at 3:43 AM Paul Menzel wrote:



Thank you for sending out v4.

Am 12.04.22 um 23:50 schrieb Richard Gong:
Active State Power Management (ASPM) feature is enabled since 
kernel 5.14.
There are some AMD GFX cards (such as WX3200 and RX640) that won't 
work
with ASPM-enabled Intel Alder Lake based systems. Using these GFX 
cards as

video/display output, Intel Alder Lake based systems will hang during
suspend/resume.


[Your email program wraps lines in cited text for some reason, making 
the citation harder to read.]




I am still not clear, what “hang during suspend/resume” means. I guess
suspending works fine? During resume (S3 or S0ix?), where does it hang?
The system is functional, but there are only display problems?

System freeze after suspend/resume.


But you see certain messages still? At what point does it freeze 
exactly? In the bug report you posted Linux messages.


The issue was initially reported on one system (Dell Precision 3660 
with
BIOS version 0.14.81), but was later confirmed to affect at least 4 
Alder

Lake based systems.

Add extra check to disable ASPM on Intel Alder Lake based systems.

Fixes: 0064b0ce85bb ("drm/amd/pm: enable ASPM by default")
Link: 
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fdrm%2Famd%2F-%2Fissues%2F1885data=04%7C01%7Crichard.gong%40amd.com%7Ce7febed5d6a441c3a58008da1debb99c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637855195670542145%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=7cEnE%2BSM9e5IGFxSLloCLtCOxovBpaPz0Ns0Ta2vVlc%3Dreserved=0


Thank you Microsoft Outlook for keeping us safe. :(



Reported-by: kernel test robot 


This tag is a little confusing. Maybe clarify that it was for an issue
in a previous patch iteration?


I did describe in change-list version 3 below, which corrected the build 
error with W=1 option.


It is not good idea to add the description for that to the commit 
message, this is why I add descriptions on change-list version 3.


Do as you wish, but the current style is confusing, and readers of the 
commit are going to think, the kernel test robot reported the problem 
with AMD VI ASICs and Intel Alder Lake systems.





Signed-off-by: Richard Gong 
---
v4: s/CONFIG_X86_64/CONFIG_X86
  enhanced check logic
v3: s/intel_core_asom_chk/aspm_support_quirk_check
  correct build error with W=1 option
v2: correct commit description
  move the check from chip family to problematic platform
---
   drivers/gpu/drm/amd/amdgpu/vi.c | 17 -
   1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/vi.c 
b/drivers/gpu/drm/amd/amdgpu/vi.c

index 039b90cdc3bc..b33e0a9bee65 100644
--- a/drivers/gpu/drm/amd/amdgpu/vi.c
+++ b/drivers/gpu/drm/amd/amdgpu/vi.c
@@ -81,6 +81,10 @@
   #include "mxgpu_vi.h"
   #include "amdgpu_dm.h"

+#if IS_ENABLED(CONFIG_X86)
+#include 
+#endif
+
   #define ixPCIE_LC_L1_PM_SUBSTATE    0x100100C6
   #define PCIE_LC_L1_PM_SUBSTATE__LC_L1_SUBSTATES_OVERRIDE_EN_MASK 
0x0001L
   #define PCIE_LC_L1_PM_SUBSTATE__LC_PCI_PM_L1_2_OVERRIDE_MASK 
0x0002L
@@ -1134,13 +1138,24 @@ static void vi_enable_aspm(struct 
amdgpu_device *adev)

   WREG32_PCIE(ixPCIE_LC_CNTL, data);
   }

+static bool aspm_support_quirk_check(void)
+{
+ if (IS_ENABLED(CONFIG_X86)) {
+ struct cpuinfo_x86 *c = _data(0);
+
+ return !(c->x86 == 6 && c->x86_model == 
INTEL_FAM6_ALDERLAKE);

+ }
+
+ return true;
+}
+
   static void vi_program_aspm(struct amdgpu_device *adev)
   {
   u32 data, data1, orig;
   bool bL1SS = false;
   bool bClkReqSupport = true;

- if (!amdgpu_device_should_use_aspm(adev))
+ if (!amdgpu_device_should_use_aspm(adev) || 
!aspm_support_quirk_check())

   return;


Can users still forcefully enable ASPM with the parameter 
`amdgpu.aspm`?


As Mario mentioned in a separate reply, we can't forcefully enable ASPM 
with the parameter 'amdgpu.aspm'.


That would be a regression on systems where ASPM used to work. Hmm. I 
guess, you could say, there are no such systems.




   if (adev->flags & AMD_IS_APU ||


If I remember correctly, there were also newer cards, where ASPM worked
with Intel Alder Lake, right? Can only the problematic generations for
WX3200 and RX640 be excluded from ASPM?


This patch only disables it for the generatioaon that was problematic.


Could that please be made clear in the commit message summary, and 
message?


Are you ok with the commit messages below?


Please change the commit message summary. Maybe:

drm/amdgpu: VI: Disable ASPM on Intel Alder Lake based systems


Active State Power Management (ASPM) featur

Re: [PATCH] drm/amdgpu: clean up psp ip if hw_init failed v2

2022-04-20 Thread Paul Menzel

Dear Alice,


Thank you for your patch.

Am 20.04.22 um 21:37 schrieb Alice Wong:

Call psp_hw_fini when psp_hw_init failed.


Please amend the commit message, and add the motivation/reasoning too [1].

I think it’s common, if a patch (series) is rerolled to list the changes 
between the versions.



Kind regards,

Paul


[1]: https://cbea.ms/git-commit/


Signed-off-by: Alice Wong 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 57 +
  1 file changed, 29 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index 5b9e48d44f5b..52b14efa848a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -2537,6 +2537,34 @@ static int psp_load_fw(struct amdgpu_device *adev)
return ret;
  }
  
+static int psp_hw_fini(void *handle)

+{
+   struct amdgpu_device *adev = (struct amdgpu_device *)handle;
+   struct psp_context *psp = >psp;
+
+   if (psp->ta_fw) {
+   psp_ras_terminate(psp);
+   psp_securedisplay_terminate(psp);
+   psp_rap_terminate(psp);
+   psp_dtm_terminate(psp);
+   psp_hdcp_terminate(psp);
+   }
+
+   psp_asd_terminate(psp);
+
+   psp_tmr_terminate(psp);
+   psp_ring_destroy(psp, PSP_RING_TYPE__KM);
+
+   amdgpu_bo_free_kernel(>fw_pri_bo,
+ >fw_pri_mc_addr, >fw_pri_buf);
+   amdgpu_bo_free_kernel(>fence_buf_bo,
+ >fence_buf_mc_addr, >fence_buf);
+   amdgpu_bo_free_kernel(>cmd_buf_bo, >cmd_buf_mc_addr,
+ (void **)>cmd_buf_mem);
+
+   return 0;
+}
+
  static int psp_hw_init(void *handle)
  {
int ret;
@@ -2563,37 +2591,10 @@ static int psp_hw_init(void *handle)
  failed:
adev->firmware.load_type = AMDGPU_FW_LOAD_DIRECT;
mutex_unlock(>firmware.mutex);
+   psp_hw_fini(handle);
return -EINVAL;
  }
  
-static int psp_hw_fini(void *handle)

-{
-   struct amdgpu_device *adev = (struct amdgpu_device *)handle;
-   struct psp_context *psp = >psp;
-
-   if (psp->ta_fw) {
-   psp_ras_terminate(psp);
-   psp_securedisplay_terminate(psp);
-   psp_rap_terminate(psp);
-   psp_dtm_terminate(psp);
-   psp_hdcp_terminate(psp);
-   }
-
-   psp_asd_terminate(psp);
-
-   psp_tmr_terminate(psp);
-   psp_ring_destroy(psp, PSP_RING_TYPE__KM);
-
-   amdgpu_bo_free_kernel(>fw_pri_bo,
- >fw_pri_mc_addr, >fw_pri_buf);
-   amdgpu_bo_free_kernel(>fence_buf_bo,
- >fence_buf_mc_addr, >fence_buf);
-   amdgpu_bo_free_kernel(>cmd_buf_bo, >cmd_buf_mc_addr,
- (void **)>cmd_buf_mem);
-
-   return 0;
-}
-
  static int psp_suspend(void *handle)
  {
int ret;


Re: [PATCH] drm/amdgpu: don't runtime suspend if there are displays attached (v2)

2022-04-19 Thread Paul Menzel

Dear Alex,


Thank you for the patch.

Am 13.04.22 um 22:15 schrieb Alex Deucher:

We normally runtime suspend when there are displays attached if they
are in the DPMS off state, however, if something wakes the GPU
we send a hotplug event on resume (in case any displays were connected
while the GPU was in suspend) which can cause userspace to light
up the displays again soon after they were turned off.

Prior to
commit 087451f372bf76 ("drm/amdgpu: use generic fb helpers instead of setting up AMD 
own's."),
the driver took a runtime pm reference when the fbdev emulation was
enabled because we didn't implement proper shadowing support for
vram access when the device was off so the device never runtime
suspended when there was a console bound.  Once that commit landed,
we now utilize the core fb helper implementation which properly
handles the emulation, so runtime pm now suspends in cases where it did
not before.  Ultimately, we need to sort out why runtime suspend in not
working in this case for some users, but this should restore similar
behavior to before.

v2: move check into runtime_suspend

Fixes: 087451f372bf76 ("drm/amdgpu: use generic fb helpers instead of setting up AMD 
own's.")
Tested-by: Michele Ballabio 


On what system and device?


Signed-off-by: Alex Deucher 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 107 
  1 file changed, 72 insertions(+), 35 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 4efaa183abcd..97a1aa02d76e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2395,6 +2395,71 @@ static int amdgpu_pmops_restore(struct device *dev)
return amdgpu_device_resume(drm_dev, true);
  }
  
+static int amdgpu_runtime_idle_check_display(struct device *dev)

+{
+   struct pci_dev *pdev = to_pci_dev(dev);
+   struct drm_device *drm_dev = pci_get_drvdata(pdev);
+   struct amdgpu_device *adev = drm_to_adev(drm_dev);
+
+   if (adev->mode_info.num_crtc) {
+   struct drm_connector *list_connector;
+   struct drm_connector_list_iter iter;
+   int ret = 0;
+
+   /* XXX: Return busy if any displays are connected to avoid
+* possible display wake ups after runtime resume due to


Nit: wakeups


+* hotplug events in case any displays were connected while
+* the GPU was in suspend.  Remove this once that is fixed.
+*/


Do you have an (internal) issue to track this?


+   mutex_lock(_dev->mode_config.mutex);
+   drm_connector_list_iter_begin(drm_dev, );
+   drm_for_each_connector_iter(list_connector, ) {
+   if (list_connector->status == 
connector_status_connected) {
+   ret = -EBUSY;
+   break;
+   }
+   }
+   drm_connector_list_iter_end();
+   mutex_unlock(_dev->mode_config.mutex);
+
+   if (ret)
+   return ret;
+
+   if (amdgpu_device_has_dc_support(adev)) {
+   struct drm_crtc *crtc;
+
+   drm_for_each_crtc(crtc, drm_dev) {
+   drm_modeset_lock(>mutex, NULL);
+   if (crtc->state->active)
+   ret = -EBUSY;
+   drm_modeset_unlock(>mutex);
+   if (ret < 0)
+   break;
+   }
+   } else {
+   mutex_lock(_dev->mode_config.mutex);
+   
drm_modeset_lock(_dev->mode_config.connection_mutex, NULL);
+
+   drm_connector_list_iter_begin(drm_dev, );
+   drm_for_each_connector_iter(list_connector, ) {
+   if (list_connector->dpms ==  DRM_MODE_DPMS_ON) {
+   ret = -EBUSY;
+   break;
+   }
+   }
+
+   drm_connector_list_iter_end();
+
+   
drm_modeset_unlock(_dev->mode_config.connection_mutex);
+   mutex_unlock(_dev->mode_config.mutex);
+   }
+   if (ret)
+   return ret;
+   }
+
+   return 0;
+}
+
  static int amdgpu_pmops_runtime_suspend(struct device *dev)
  {
struct pci_dev *pdev = to_pci_dev(dev);
@@ -2407,6 +2472,10 @@ static int amdgpu_pmops_runtime_suspend(struct device 
*dev)
return -EBUSY;
}
  
+	ret = amdgpu_runtime_idle_check_display(dev);

+   if (ret)
+   return ret;
+
/* wait for all rings to drain before suspending */
for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
 

Re: [PATCH v2 2/2] drm/amdkfd: CRIU add support for GWS queues

2022-04-19 Thread Paul Menzel



Dear David,


Thank you for sending out v3 of these patches.

Am 19.04.22 um 02:04 schrieb Yat Sin, David:




-Original Message-
From: Paul Menzel 
Sent: Monday, April 18, 2022 4:23 PM


[…]

Am 18.04.22 um 18:44 schrieb David Yat Sin:

In the commit message summary, you could reorder some words:

Add CRIU support for GWS queues


Adding support to checkpoint/restore GWS(Global Wave Sync) queues.


s/Adding/Add/


Did you miss the two comments above?


Please add a space before the (.

ACK


How can this be tested?

We have some internal tests that can we be used to specifically test this 
feature.


Nice. Are you going to publish these in the future?


Signed-off-by: David Yat Sin 
---
   drivers/gpu/drm/amd/amdkfd/kfd_priv.h  |  2 +-
   drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 10 +++---
   2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index f36062be9ca8..192dbef04c43 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -1102,7 +1102,7 @@ struct kfd_criu_queue_priv_data {
uint32_t priority;
uint32_t q_percent;
uint32_t doorbell_id;
-   uint32_t is_gws;
+   uint32_t gws;


Why is the new name better?

The old variable (is_gws) was obtained from the queue_properties
structure during checkpoint and is only used temporarily during queue
creation, so this variable cannot be used to determine whether a
queue as gws enabled. The new variable (gws) is obtained from the
queue structure. The name is changed to better reflect this.


Further down you seem to use it like a boolean though. So a name
reflecting that would be nice.


uint32_t sdma_id;
uint32_t eop_ring_buffer_size;
uint32_t ctx_save_restore_area_size; diff --git
a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
index 6eca9509f2e3..4f58e671d39b 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
@@ -636,6 +636,8 @@ static int criu_checkpoint_queue(struct

kfd_process_device *pdd,

q_data->ctx_save_restore_area_size =
q->properties.ctx_save_restore_area_size;

+   q_data->gws = !!q->gws;
+
ret = pqm_checkpoint_mqd(>process->pqm, q-> properties.queue_id, 
mqd, ctl_stack);
if (ret) {
pr_err("Failed checkpoint queue_mqd (%d)\n", ret); @@ -743,7
+745,6 @@ static void set_queue_properties_from_criu(struct queue_properties 
*qp,
  struct kfd_criu_queue_priv_data 
*q_data)
   {
qp->is_interop = false;
-   qp->is_gws = q_data->is_gws;
qp->queue_percent = q_data->q_percent;
qp->priority = q_data->priority;
qp->queue_address = q_data->q_address; @@ -826,12 +827,15 @@

int kfd_criu_restore_queue(struct kfd_process *p,

NULL);
if (ret) {
pr_err("Failed to create new queue err:%d\n", ret);
-   ret = -EINVAL;
+   goto exit;
}

+   if (q_data->gws)
+   ret = pqm_set_gws(>pqm, q_data->q_id, pdd->dev->gws);
+
   exit:
if (ret)
-   pr_err("Failed to create queue (%d)\n", ret);
+   pr_err("Failed to restore queue (%d)\n", ret);


Maybe separate this out, so it can be applied to stable series.


Did you miss this comment?


else
pr_debug("Queue id %d was restored successfully\n", queue_id);




Kind regards,

Paul


Re: [PATCH v3 1/2] drm/amdkfd: Fix GWS queue count

2022-04-19 Thread Paul Menzel

Dear David,


Thank you for rerolling the patches.

Am 19.04.22 um 03:32 schrieb David Yat Sin:

dqm->gws_queue_count and pdd->qpd.mapped_gws_queue need to be updated
each time the queue gets evicted.

Fixes: b8020b0304c8 ("drm/amdkfd: Enable over-subscription with >1 GWS queue")



For the future, no blank line is needed to separate the Fixes tag from 
the rest of the block.



Kind regards,

Paul



Signed-off-by: David Yat Sin 
Reviewed-by: Felix Kuehling 
---
  .../drm/amd/amdkfd/kfd_device_queue_manager.c | 83 +--
  1 file changed, 37 insertions(+), 46 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index acf4f7975850..198672264492 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -130,19 +130,33 @@ void program_sh_mem_settings(struct device_queue_manager 
*dqm,
  }
  
  static void increment_queue_count(struct device_queue_manager *dqm,

-   enum kfd_queue_type type)
+ struct qcm_process_device *qpd,
+ struct queue *q)
  {
dqm->active_queue_count++;
-   if (type == KFD_QUEUE_TYPE_COMPUTE || type == KFD_QUEUE_TYPE_DIQ)
+   if (q->properties.type == KFD_QUEUE_TYPE_COMPUTE ||
+   q->properties.type == KFD_QUEUE_TYPE_DIQ)
dqm->active_cp_queue_count++;
+
+   if (q->properties.is_gws) {
+   dqm->gws_queue_count++;
+   qpd->mapped_gws_queue = true;
+   }
  }
  
  static void decrement_queue_count(struct device_queue_manager *dqm,

-   enum kfd_queue_type type)
+ struct qcm_process_device *qpd,
+ struct queue *q)
  {
dqm->active_queue_count--;
-   if (type == KFD_QUEUE_TYPE_COMPUTE || type == KFD_QUEUE_TYPE_DIQ)
+   if (q->properties.type == KFD_QUEUE_TYPE_COMPUTE ||
+   q->properties.type == KFD_QUEUE_TYPE_DIQ)
dqm->active_cp_queue_count--;
+
+   if (q->properties.is_gws) {
+   dqm->gws_queue_count--;
+   qpd->mapped_gws_queue = false;
+   }
  }
  
  /*

@@ -412,7 +426,7 @@ static int create_queue_nocpsch(struct device_queue_manager 
*dqm,
list_add(>list, >queues_list);
qpd->queue_count++;
if (q->properties.is_active)
-   increment_queue_count(dqm, q->properties.type);
+   increment_queue_count(dqm, qpd, q);
  
  	/*

 * Unconditionally increment this counter, regardless of the queue's
@@ -601,13 +615,8 @@ static int destroy_queue_nocpsch_locked(struct 
device_queue_manager *dqm,
deallocate_vmid(dqm, qpd, q);
}
qpd->queue_count--;
-   if (q->properties.is_active) {
-   decrement_queue_count(dqm, q->properties.type);
-   if (q->properties.is_gws) {
-   dqm->gws_queue_count--;
-   qpd->mapped_gws_queue = false;
-   }
-   }
+   if (q->properties.is_active)
+   decrement_queue_count(dqm, qpd, q);
  
  	return retval;

  }
@@ -700,12 +709,11 @@ static int update_queue(struct device_queue_manager *dqm, 
struct queue *q,
 * dqm->active_queue_count to determine whether a new runlist must be
 * uploaded.
 */
-   if (q->properties.is_active && !prev_active)
-   increment_queue_count(dqm, q->properties.type);
-   else if (!q->properties.is_active && prev_active)
-   decrement_queue_count(dqm, q->properties.type);
-
-   if (q->gws && !q->properties.is_gws) {
+   if (q->properties.is_active && !prev_active) {
+   increment_queue_count(dqm, >qpd, q);
+   } else if (!q->properties.is_active && prev_active) {
+   decrement_queue_count(dqm, >qpd, q);
+   } else if (q->gws && !q->properties.is_gws) {
if (q->properties.is_active) {
dqm->gws_queue_count++;
pdd->qpd.mapped_gws_queue = true;
@@ -767,11 +775,7 @@ static int evict_process_queues_nocpsch(struct 
device_queue_manager *dqm,
mqd_mgr = dqm->mqd_mgrs[get_mqd_type_from_queue_type(
q->properties.type)];
q->properties.is_active = false;
-   decrement_queue_count(dqm, q->properties.type);
-   if (q->properties.is_gws) {
-   dqm->gws_queue_count--;
-   qpd->mapped_gws_queue = false;
-   }
+   decrement_queue_count(dqm, qpd, q);
  
  		if (WARN_ONCE(!dqm->sched_running, "Evict when stopped\n"))

continue;
@@ -817,7 +821,7 @@ static int evict_process_queues_cpsch(struct 
device_queue_manager *dqm,
continue;
  
  		q->properties.is_active = false;

- 

Re: [PATCH v2 2/2] drm/amdkfd: CRIU add support for GWS queues

2022-04-18 Thread Paul Menzel

Dear David,


Thank you for your patch.

Am 18.04.22 um 18:44 schrieb David Yat Sin:

In the commit message summary, you could reorder some words:

Add CRIU support for GWS queues


Adding support to checkpoint/restore GWS(Global Wave Sync) queues.


s/Adding/Add/

Please add a space before the (.

How can this be tested?


Signed-off-by: David Yat Sin 
---
  drivers/gpu/drm/amd/amdkfd/kfd_priv.h  |  2 +-
  drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 10 +++---
  2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index f36062be9ca8..192dbef04c43 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -1102,7 +1102,7 @@ struct kfd_criu_queue_priv_data {
uint32_t priority;
uint32_t q_percent;
uint32_t doorbell_id;
-   uint32_t is_gws;
+   uint32_t gws;


Why is the new name better?


uint32_t sdma_id;
uint32_t eop_ring_buffer_size;
uint32_t ctx_save_restore_area_size;
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
index 6eca9509f2e3..4f58e671d39b 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
@@ -636,6 +636,8 @@ static int criu_checkpoint_queue(struct kfd_process_device 
*pdd,
q_data->ctx_save_restore_area_size =
q->properties.ctx_save_restore_area_size;
  
+	q_data->gws = !!q->gws;

+
ret = pqm_checkpoint_mqd(>process->pqm, q->properties.queue_id, 
mqd, ctl_stack);
if (ret) {
pr_err("Failed checkpoint queue_mqd (%d)\n", ret);
@@ -743,7 +745,6 @@ static void set_queue_properties_from_criu(struct 
queue_properties *qp,
  struct kfd_criu_queue_priv_data 
*q_data)
  {
qp->is_interop = false;
-   qp->is_gws = q_data->is_gws;
qp->queue_percent = q_data->q_percent;
qp->priority = q_data->priority;
qp->queue_address = q_data->q_address;
@@ -826,12 +827,15 @@ int kfd_criu_restore_queue(struct kfd_process *p,
NULL);
if (ret) {
pr_err("Failed to create new queue err:%d\n", ret);
-   ret = -EINVAL;
+   goto exit;
}
  
+	if (q_data->gws)

+   ret = pqm_set_gws(>pqm, q_data->q_id, pdd->dev->gws);
+
  exit:
if (ret)
-   pr_err("Failed to create queue (%d)\n", ret);
+   pr_err("Failed to restore queue (%d)\n", ret);


Maybe separate this out, so it can be applied to stable series.


else
pr_debug("Queue id %d was restored successfully\n", queue_id);
  



Kind regards,

Paul


Re: [PATCH v2 1/2] drm/amdkfd: Fix GWS queue count

2022-04-18 Thread Paul Menzel

Dear David,


Thank you for your patch.

Am 18.04.22 um 18:44 schrieb David Yat Sin:

dqm->gws_queue_count and pdd->qpd.mapped_gws_queue needs to be updated


s/needs/need/


each time the queue gets evicted.


Why?

Do you only change the case, when an element of the queue gets evicted?

Next time, a short note about the implementation would be nic.e

No Fixes tag?


Kind regards,

Paul



Signed-off-by: David Yat Sin 
---
  .../drm/amd/amdkfd/kfd_device_queue_manager.c | 83 +--
  1 file changed, 37 insertions(+), 46 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index acf4f7975850..198672264492 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -130,19 +130,33 @@ void program_sh_mem_settings(struct device_queue_manager 
*dqm,
  }
  
  static void increment_queue_count(struct device_queue_manager *dqm,

-   enum kfd_queue_type type)
+ struct qcm_process_device *qpd,
+ struct queue *q)
  {
dqm->active_queue_count++;
-   if (type == KFD_QUEUE_TYPE_COMPUTE || type == KFD_QUEUE_TYPE_DIQ)
+   if (q->properties.type == KFD_QUEUE_TYPE_COMPUTE ||
+   q->properties.type == KFD_QUEUE_TYPE_DIQ)
dqm->active_cp_queue_count++;
+
+   if (q->properties.is_gws) {
+   dqm->gws_queue_count++;
+   qpd->mapped_gws_queue = true;
+   }
  }
  
  static void decrement_queue_count(struct device_queue_manager *dqm,

-   enum kfd_queue_type type)
+ struct qcm_process_device *qpd,
+ struct queue *q)
  {
dqm->active_queue_count--;
-   if (type == KFD_QUEUE_TYPE_COMPUTE || type == KFD_QUEUE_TYPE_DIQ)
+   if (q->properties.type == KFD_QUEUE_TYPE_COMPUTE ||
+   q->properties.type == KFD_QUEUE_TYPE_DIQ)
dqm->active_cp_queue_count--;
+
+   if (q->properties.is_gws) {
+   dqm->gws_queue_count--;
+   qpd->mapped_gws_queue = false;
+   }
  }
  
  /*

@@ -412,7 +426,7 @@ static int create_queue_nocpsch(struct device_queue_manager 
*dqm,
list_add(>list, >queues_list);
qpd->queue_count++;
if (q->properties.is_active)
-   increment_queue_count(dqm, q->properties.type);
+   increment_queue_count(dqm, qpd, q);
  
  	/*

 * Unconditionally increment this counter, regardless of the queue's
@@ -601,13 +615,8 @@ static int destroy_queue_nocpsch_locked(struct 
device_queue_manager *dqm,
deallocate_vmid(dqm, qpd, q);
}
qpd->queue_count--;
-   if (q->properties.is_active) {
-   decrement_queue_count(dqm, q->properties.type);
-   if (q->properties.is_gws) {
-   dqm->gws_queue_count--;
-   qpd->mapped_gws_queue = false;
-   }
-   }
+   if (q->properties.is_active)
+   decrement_queue_count(dqm, qpd, q);
  
  	return retval;

  }
@@ -700,12 +709,11 @@ static int update_queue(struct device_queue_manager *dqm, 
struct queue *q,
 * dqm->active_queue_count to determine whether a new runlist must be
 * uploaded.
 */
-   if (q->properties.is_active && !prev_active)
-   increment_queue_count(dqm, q->properties.type);
-   else if (!q->properties.is_active && prev_active)
-   decrement_queue_count(dqm, q->properties.type);
-
-   if (q->gws && !q->properties.is_gws) {
+   if (q->properties.is_active && !prev_active) {
+   increment_queue_count(dqm, >qpd, q);
+   } else if (!q->properties.is_active && prev_active) {
+   decrement_queue_count(dqm, >qpd, q);
+   } else if (q->gws && !q->properties.is_gws) {
if (q->properties.is_active) {
dqm->gws_queue_count++;
pdd->qpd.mapped_gws_queue = true;
@@ -767,11 +775,7 @@ static int evict_process_queues_nocpsch(struct 
device_queue_manager *dqm,
mqd_mgr = dqm->mqd_mgrs[get_mqd_type_from_queue_type(
q->properties.type)];
q->properties.is_active = false;
-   decrement_queue_count(dqm, q->properties.type);
-   if (q->properties.is_gws) {
-   dqm->gws_queue_count--;
-   qpd->mapped_gws_queue = false;
-   }
+   decrement_queue_count(dqm, qpd, q);
  
  		if (WARN_ONCE(!dqm->sched_running, "Evict when stopped\n"))

continue;
@@ -817,7 +821,7 @@ static int evict_process_queues_cpsch(struct 
device_queue_manager *dqm,
continue;
  
  		q->properties.is_active = false;

-   

Re: [PATCH 1/2] Documentation/gpu: Add entries to amdgpu glossary

2022-04-17 Thread Paul Menzel

Dear Tales,


Thank you for your patch.

Am 15.04.22 um 21:50 schrieb Tales Lelo da Aparecida:

Add missing acronyms to the amdgppu glossary.

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/1939#note_1309737
Signed-off-by: Tales Lelo da Aparecida 
---
  Documentation/gpu/amdgpu/amdgpu-glossary.rst | 13 +
  1 file changed, 13 insertions(+)

diff --git a/Documentation/gpu/amdgpu/amdgpu-glossary.rst 
b/Documentation/gpu/amdgpu/amdgpu-glossary.rst
index 859dcec6c6f9..48829d097f40 100644
--- a/Documentation/gpu/amdgpu/amdgpu-glossary.rst
+++ b/Documentation/gpu/amdgpu/amdgpu-glossary.rst
@@ -8,12 +8,19 @@ we have a dedicated glossary for Display Core at
  
  .. glossary::
  
+active_cu_number

+  The number of CUs that are active on the system.  The number of active
+  CUs may be less than SE * SH * CU depending on the board configuration.
+
  CP
Command Processor
  
  CPLIB

Content Protection Library
  
+CU

+  Compute unit


Capitalize the U in *unit* as seems to be done in the rest of the files?


+
  DFS
Digital Frequency Synthesizer
  
@@ -74,6 +81,12 @@ we have a dedicated glossary for Display Core at

  SDMA
System DMA
  
+SE

+  Shader Engine
+
+SH
+  SHader array


No idea if the H should be capitalized.


+
  SMU
System Management Unit
  



Kind regards,

Paul


Re: [PATCH] drm/amdkfd: only allow heavy-weight TLB flush on some ASICs for SVM too

2022-04-14 Thread Paul Menzel

Dear Lang,


Am 15.04.22 um 05:20 schrieb Lang Yu:

On 04/14/ , Paul Menzel wrote:



Am 14.04.22 um 10:19 schrieb Lang Yu:

The idea is from commit a50fe7078035 ("drm/amdkfd: Only apply heavy-weight
TLB flush on Aldebaran") and commit f61c40c0757a ("drm/amdkfd: enable
heavy-weight TLB flush on Arcturus"). Otherwise, we will run into problems
on some ASICs when running SVM applications.


Please list the ASICs, you know of having problems, and even add how to
reproduce this.


Actually, this is ported from previous commits. You can find more details
from the commits I mentioned. At the moment the ASICs except Aldebaran
and Arcturus probably have the problem.


I think, it’s always good to make it as easy as possible for reviewers 
and, later, people reading a commit, and include the necessary 
information directly in the commit message. It’d be great if you amended 
the commit message.



And running a SVM application could reproduce the issue.


Thanks. How will it fail though?

(Also, a small implementation note would be nice to have. Maybe: Move 
the helper function into the header `kfd_priv.h`, and use in 
`svm_range_unmap_from_gpus()`.)



Kind regards,

Paul



Signed-off-by: Lang Yu 
---
   drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 8 
   drivers/gpu/drm/amd/amdkfd/kfd_priv.h| 8 
   drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 4 +++-
   3 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
index 91f82a9ccdaf..459f59e3d0ed 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -1128,14 +1128,6 @@ static int kfd_ioctl_free_memory_of_gpu(struct file 
*filep,
return ret;
   }
-static bool kfd_flush_tlb_after_unmap(struct kfd_dev *dev)
-{
-   return KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 2) ||
-   (KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 1) &&
-   dev->adev->sdma.instance[0].fw_version >= 18) ||
-   KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 0);
-}
-
   static int kfd_ioctl_map_memory_to_gpu(struct file *filep,
struct kfd_process *p, void *data)
   {
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index 8a43def1f638..aff6f598ff2c 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -1328,6 +1328,14 @@ void kfd_signal_poison_consumed_event(struct kfd_dev 
*dev, u32 pasid);
   void kfd_flush_tlb(struct kfd_process_device *pdd, enum TLB_FLUSH_TYPE type);
+static inline bool kfd_flush_tlb_after_unmap(struct kfd_dev *dev)
+{
+   return KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 2) ||
+  (KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 1) &&
+  dev->adev->sdma.instance[0].fw_version >= 18) ||
+  KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 0);
+}
+
   bool kfd_is_locked(void);
   /* Compute profile */
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 459fa07a3bcc..5afe216cf099 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -1229,7 +1229,9 @@ svm_range_unmap_from_gpus(struct svm_range *prange, 
unsigned long start,
if (r)
break;
}
-   kfd_flush_tlb(pdd, TLB_FLUSH_HEAVYWEIGHT);
+
+   if (kfd_flush_tlb_after_unmap(pdd->dev))
+   kfd_flush_tlb(pdd, TLB_FLUSH_HEAVYWEIGHT);
}
return r;


Re: [PATCH] drm/amd/amdgpu: Remove static from variable in RLCG Reg RW.

2022-04-14 Thread Paul Menzel

Dear Gavin,


Thank you for your patch.

Am 13.04.22 um 17:26 schrieb Gavin Wan:

Should you re-roll your patch (v2), please remove the dot/period from 
the end of the git commit message summary (subject).



[why] These static variables saves the RLC Scratch registers address.


s/saves/save/


   When we installed multiple GPUs (for example: XGMI setting) and


s/installed/install/


   multiple GPUs call the function at same time. The RLC Scratch


… same time, the RLC …


   registers address are changed each other. Then it caused


s/caused/causes/


   reading/writing to wrong GPU.


I see from other patches posted here, that [why] is put on a separate 
line, so you do not need to indent the text.


[why]

These static …



[fix] Removed the static from the variables. The variables are
   in stack.


Same here, though *how* instead of *fix* seems more common.

s/Removed/Remove/
s/in stack/on the stack/



Signed-off-by: Gavin Wan 
Change-Id: Iee78849291d4f7a9688ecc5165bec70ee85cdfbe


Without the Gerrit URL that line is useless.


Kind regards.

Paul



---
  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 10 +-
  1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index d5eea031c3e3..d18a05a20566 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -868,11 +868,11 @@ static u32 amdgpu_virt_rlcg_reg_rw(struct amdgpu_device 
*adev, u32 offset, u32 v
uint32_t timeout = 5;
uint32_t i, tmp;
uint32_t ret = 0;
-   static void *scratch_reg0;
-   static void *scratch_reg1;
-   static void *scratch_reg2;
-   static void *scratch_reg3;
-   static void *spare_int;
+   void *scratch_reg0;
+   void *scratch_reg1;
+   void *scratch_reg2;
+   void *scratch_reg3;
+   void *spare_int;
  
  	if (!adev->gfx.rlc.rlcg_reg_access_supported) {

dev_err(adev->dev,


Re: [PATCH] drm/amdkfd: only allow heavy-weight TLB flush on some ASICs for SVM too

2022-04-14 Thread Paul Menzel

Dear Lang,


Thank you for the patch.

Am 14.04.22 um 10:19 schrieb Lang Yu:

The idea is from commit a50fe7078035 ("drm/amdkfd: Only apply heavy-weight
TLB flush on Aldebaran") and commit f61c40c0757a ("drm/amdkfd: enable
heavy-weight TLB flush on Arcturus"). Otherwise, we will run into problems
on some ASICs when running SVM applications.


Please list the ASICs, you know of having problems, and even add how to 
reproduce this.



Kind regards,

Paul



Signed-off-by: Lang Yu 
---
  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 8 
  drivers/gpu/drm/amd/amdkfd/kfd_priv.h| 8 
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 4 +++-
  3 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
index 91f82a9ccdaf..459f59e3d0ed 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -1128,14 +1128,6 @@ static int kfd_ioctl_free_memory_of_gpu(struct file 
*filep,
return ret;
  }
  
-static bool kfd_flush_tlb_after_unmap(struct kfd_dev *dev)

-{
-   return KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 2) ||
-   (KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 1) &&
-   dev->adev->sdma.instance[0].fw_version >= 18) ||
-   KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 0);
-}
-
  static int kfd_ioctl_map_memory_to_gpu(struct file *filep,
struct kfd_process *p, void *data)
  {
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index 8a43def1f638..aff6f598ff2c 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -1328,6 +1328,14 @@ void kfd_signal_poison_consumed_event(struct kfd_dev 
*dev, u32 pasid);
  
  void kfd_flush_tlb(struct kfd_process_device *pdd, enum TLB_FLUSH_TYPE type);
  
+static inline bool kfd_flush_tlb_after_unmap(struct kfd_dev *dev)

+{
+   return KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 2) ||
+  (KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 1) &&
+  dev->adev->sdma.instance[0].fw_version >= 18) ||
+  KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 0);
+}
+
  bool kfd_is_locked(void);
  
  /* Compute profile */

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 459fa07a3bcc..5afe216cf099 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -1229,7 +1229,9 @@ svm_range_unmap_from_gpus(struct svm_range *prange, 
unsigned long start,
if (r)
break;
}
-   kfd_flush_tlb(pdd, TLB_FLUSH_HEAVYWEIGHT);
+
+   if (kfd_flush_tlb_after_unmap(pdd->dev))
+   kfd_flush_tlb(pdd, TLB_FLUSH_HEAVYWEIGHT);
}
  
  	return r;


Re: [PATCHv4] drm/amdgpu: disable ASPM on Intel Alder Lake based systems

2022-04-14 Thread Paul Menzel

[Cc: -kernel test robot ]

Dear Alex, dear Richard,


Am 13.04.22 um 15:00 schrieb Alex Deucher:

On Wed, Apr 13, 2022 at 3:43 AM Paul Menzel wrote:



Thank you for sending out v4.

Am 12.04.22 um 23:50 schrieb Richard Gong:

Active State Power Management (ASPM) feature is enabled since kernel 5.14.
There are some AMD GFX cards (such as WX3200 and RX640) that won't work
with ASPM-enabled Intel Alder Lake based systems. Using these GFX cards as
video/display output, Intel Alder Lake based systems will hang during
suspend/resume.


I am still not clear, what “hang during suspend/resume” means. I guess
suspending works fine? During resume (S3 or S0ix?), where does it hang?
The system is functional, but there are only display problems?


The issue was initially reported on one system (Dell Precision 3660 with
BIOS version 0.14.81), but was later confirmed to affect at least 4 Alder
Lake based systems.

Add extra check to disable ASPM on Intel Alder Lake based systems.

Fixes: 0064b0ce85bb ("drm/amd/pm: enable ASPM by default")
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1885
Reported-by: kernel test robot 


This tag is a little confusing. Maybe clarify that it was for an issue
in a previous patch iteration?


Signed-off-by: Richard Gong 
---
v4: s/CONFIG_X86_64/CONFIG_X86
  enhanced check logic
v3: s/intel_core_asom_chk/aspm_support_quirk_check
  correct build error with W=1 option
v2: correct commit description
  move the check from chip family to problematic platform
---
   drivers/gpu/drm/amd/amdgpu/vi.c | 17 -
   1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/vi.c b/drivers/gpu/drm/amd/amdgpu/vi.c
index 039b90cdc3bc..b33e0a9bee65 100644
--- a/drivers/gpu/drm/amd/amdgpu/vi.c
+++ b/drivers/gpu/drm/amd/amdgpu/vi.c
@@ -81,6 +81,10 @@
   #include "mxgpu_vi.h"
   #include "amdgpu_dm.h"

+#if IS_ENABLED(CONFIG_X86)
+#include 
+#endif
+
   #define ixPCIE_LC_L1_PM_SUBSTATE0x100100C6
   #define PCIE_LC_L1_PM_SUBSTATE__LC_L1_SUBSTATES_OVERRIDE_EN_MASK
0x0001L
   #define PCIE_LC_L1_PM_SUBSTATE__LC_PCI_PM_L1_2_OVERRIDE_MASK
0x0002L
@@ -1134,13 +1138,24 @@ static void vi_enable_aspm(struct amdgpu_device *adev)
   WREG32_PCIE(ixPCIE_LC_CNTL, data);
   }

+static bool aspm_support_quirk_check(void)
+{
+ if (IS_ENABLED(CONFIG_X86)) {
+ struct cpuinfo_x86 *c = _data(0);
+
+ return !(c->x86 == 6 && c->x86_model == INTEL_FAM6_ALDERLAKE);
+ }
+
+ return true;
+}
+
   static void vi_program_aspm(struct amdgpu_device *adev)
   {
   u32 data, data1, orig;
   bool bL1SS = false;
   bool bClkReqSupport = true;

- if (!amdgpu_device_should_use_aspm(adev))
+ if (!amdgpu_device_should_use_aspm(adev) || !aspm_support_quirk_check())
   return;


Can users still forcefully enable ASPM with the parameter `amdgpu.aspm`?



   if (adev->flags & AMD_IS_APU ||


If I remember correctly, there were also newer cards, where ASPM worked
with Intel Alder Lake, right? Can only the problematic generations for
WX3200 and RX640 be excluded from ASPM?


This patch only disables it for the generation that was problematic.


Could that please be made clear in the commit message summary, and message?

Loosely related, is there a public (or internal issue) to analyze how to 
get ASPM working for VI generation devices with Intel Alder Lake?



Kind regards,

Paul


Re: [PATCH v2] drm/amdgpu: Fix one use-after-free of VM

2022-04-14 Thread Paul Menzel

Dear Xinhui,


Am 14.04.22 um 07:45 schrieb Paul Menzel:


Thank you for rerolling the patch.

Am 14.04.22 um 07:03 schrieb xinhui pan:

VM might already be freed when amdgpu_vm_tlb_seq_cb() is called.
We see the calltrace below.

Fix it by keeping the last flush fence around and wait for it to signal


Nit: Please add a dot/period to the end of sentences.



BUG kmalloc-4k (Not tainted): Poison overwritten

0x9c88630414e8-0x9c88630414e8 @offset=5352. First byte 0x6c
instead of 0x6b Allocated in amdgpu_driver_open_kms+0x9d/0x360 [amdgpu]
age=44 cpu=0 pid=2343
  __slab_alloc.isra.0+0x4f/0x90
  kmem_cache_alloc_trace+0x6b8/0x7a0
  amdgpu_driver_open_kms+0x9d/0x360 [amdgpu]
  drm_file_alloc+0x222/0x3e0 [drm]
  drm_open+0x11d/0x410 [drm]
Freed in amdgpu_driver_postclose_kms+0x3e9/0x550 [amdgpu] age=22 cpu=1
pid=2485
  kfree+0x4a2/0x580
  amdgpu_driver_postclose_kms+0x3e9/0x550 [amdgpu]
  drm_file_free+0x24e/0x3c0 [drm]
  drm_close_helper.isra.0+0x90/0xb0 [drm]
  drm_release+0x97/0x1a0 [drm]
  __fput+0xb6/0x280
  fput+0xe/0x10
  task_work_run+0x64/0xb0


The v2 annotation is missing.


Suggested-by: Christian König 
Signed-off-by: xinhui pan 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 17 ++---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h |  1 +
  2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c

index 645ce28277c2..cd5aa7edd451 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -932,9 +932,12 @@ int amdgpu_vm_update_range(struct amdgpu_device 
*adev, struct amdgpu_vm *vm,

  if (flush_tlb || params.table_freed) {
  tlb_cb->vm = vm;
-    if (!fence || !*fence ||
-    dma_fence_add_callback(*fence, _cb->cb,
-   amdgpu_vm_tlb_seq_cb))
+    if (fence && *fence &&
+    !dma_fence_add_callback(*fence, _cb->cb,
+   amdgpu_vm_tlb_seq_cb)) {
+    dma_fence_put(vm->last_tlb_flush);
+    vm->last_tlb_flush = dma_fence_get(*fence);
+    } else
  amdgpu_vm_tlb_seq_cb(NULL, _cb->cb);


The Linux kernel coding style uses braces for all branches of a 
conditional statement, if one branch uses braces. [1]


By the way `scripts/checkpatch.pl --strict …` would also have found this:

```
$ scripts/checkpatch.pl --strict 
0001-drm-amdgpu-Fix-one-use-after-free-of-VM.patch

CHECK: braces {} should be used on all arms of this statement
#53: FILE: drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:935:
+   if (fence && *fence &&
[...]
+   } else
[...]

CHECK: Unbalanced braces around else statement
#58: FILE: drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:940:
+   } else

total: 0 errors, 0 warnings, 2 checks, 54 lines checked

NOTE: For some of the reported defects, checkpatch may be able to
  mechanically convert to the typical style using --fix or 
--fix-inplace.


0001-drm-amdgpu-Fix-one-use-after-free-of-VM.patch has style problems, 
please review.

```


Kind regards,

Paul



  tlb_cb = NULL;
  }
@@ -2094,6 +2097,7 @@ int amdgpu_vm_init(struct amdgpu_device *adev, 
struct amdgpu_vm *vm)

  vm->update_funcs = _vm_sdma_funcs;
  vm->last_update = NULL;
  vm->last_unlocked = dma_fence_get_stub();
+    vm->last_tlb_flush = dma_fence_get_stub();
  mutex_init(>eviction_lock);
  vm->evicting = false;
@@ -2132,6 +2136,7 @@ int amdgpu_vm_init(struct amdgpu_device *adev, 
struct amdgpu_vm *vm)

  vm->root.bo = NULL;
  error_free_delayed:
+    dma_fence_put(vm->last_tlb_flush);
  dma_fence_put(vm->last_unlocked);
  drm_sched_entity_destroy(>delayed);
@@ -2248,6 +2253,7 @@ void amdgpu_vm_fini(struct amdgpu_device *adev, 
struct amdgpu_vm *vm)

  struct amdgpu_bo_va_mapping *mapping, *tmp;
  bool prt_fini_needed = !!adev->gmc.gmc_funcs->set_prt;
  struct amdgpu_bo *root;
+    unsigned long flags;
  int i;
  amdgpu_amdkfd_gpuvm_destroy_cb(adev, vm);
@@ -2257,6 +2263,11 @@ void amdgpu_vm_fini(struct amdgpu_device *adev, 
struct amdgpu_vm *vm)

  amdgpu_vm_set_pasid(adev, vm, 0);
  dma_fence_wait(vm->last_unlocked, false);
  dma_fence_put(vm->last_unlocked);
+    dma_fence_wait(vm->last_tlb_flush, false);
+    /* Make sure that all fence callbacks have completed */
+    spin_lock_irqsave(vm->last_tlb_flush->lock, flags);
+    spin_unlock_irqrestore(vm->last_tlb_flush->lock, flags);
+    dma_fence_put(vm->last_tlb_flush);
  list_for_each_entry_safe(mapping, tmp, >freed, list) {
  if (mapping->flags & AMDGPU_PTE_PRT && prt_fini_needed) {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h

index 1a814fb8..6b06a214f05f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
+++ b/drivers

Re: [PATCH v2] drm/amdgpu: Fix one use-after-free of VM

2022-04-13 Thread Paul Menzel

Dear Xinhui,


Thank you for rerolling the patch.

Am 14.04.22 um 07:03 schrieb xinhui pan:

VM might already be freed when amdgpu_vm_tlb_seq_cb() is called.
We see the calltrace below.

Fix it by keeping the last flush fence around and wait for it to signal


Nit: Please add a dot/period to the end of sentences.



BUG kmalloc-4k (Not tainted): Poison overwritten

0x9c88630414e8-0x9c88630414e8 @offset=5352. First byte 0x6c
instead of 0x6b Allocated in amdgpu_driver_open_kms+0x9d/0x360 [amdgpu]
age=44 cpu=0 pid=2343
  __slab_alloc.isra.0+0x4f/0x90
  kmem_cache_alloc_trace+0x6b8/0x7a0
  amdgpu_driver_open_kms+0x9d/0x360 [amdgpu]
  drm_file_alloc+0x222/0x3e0 [drm]
  drm_open+0x11d/0x410 [drm]
Freed in amdgpu_driver_postclose_kms+0x3e9/0x550 [amdgpu] age=22 cpu=1
pid=2485
  kfree+0x4a2/0x580
  amdgpu_driver_postclose_kms+0x3e9/0x550 [amdgpu]
  drm_file_free+0x24e/0x3c0 [drm]
  drm_close_helper.isra.0+0x90/0xb0 [drm]
  drm_release+0x97/0x1a0 [drm]
  __fput+0xb6/0x280
  fput+0xe/0x10
  task_work_run+0x64/0xb0


The v2 annotation is missing.


Suggested-by: Christian König 
Signed-off-by: xinhui pan 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 17 ++---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h |  1 +
  2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 645ce28277c2..cd5aa7edd451 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -932,9 +932,12 @@ int amdgpu_vm_update_range(struct amdgpu_device *adev, 
struct amdgpu_vm *vm,
  
  	if (flush_tlb || params.table_freed) {

tlb_cb->vm = vm;
-   if (!fence || !*fence ||
-   dma_fence_add_callback(*fence, _cb->cb,
-  amdgpu_vm_tlb_seq_cb))
+   if (fence && *fence &&
+   !dma_fence_add_callback(*fence, _cb->cb,
+  amdgpu_vm_tlb_seq_cb)) {
+   dma_fence_put(vm->last_tlb_flush);
+   vm->last_tlb_flush = dma_fence_get(*fence);
+   } else
amdgpu_vm_tlb_seq_cb(NULL, _cb->cb);


The Linux kernel coding style uses braces for all branches of a 
conditional statement, if one branch uses braces. [1]



Kind regards,

Paul



tlb_cb = NULL;
}
@@ -2094,6 +2097,7 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct 
amdgpu_vm *vm)
vm->update_funcs = _vm_sdma_funcs;
vm->last_update = NULL;
vm->last_unlocked = dma_fence_get_stub();
+   vm->last_tlb_flush = dma_fence_get_stub();
  
  	mutex_init(>eviction_lock);

vm->evicting = false;
@@ -2132,6 +2136,7 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct 
amdgpu_vm *vm)
vm->root.bo = NULL;
  
  error_free_delayed:

+   dma_fence_put(vm->last_tlb_flush);
dma_fence_put(vm->last_unlocked);
drm_sched_entity_destroy(>delayed);
  
@@ -2248,6 +2253,7 @@ void amdgpu_vm_fini(struct amdgpu_device *adev, struct amdgpu_vm *vm)

struct amdgpu_bo_va_mapping *mapping, *tmp;
bool prt_fini_needed = !!adev->gmc.gmc_funcs->set_prt;
struct amdgpu_bo *root;
+   unsigned long flags;
int i;
  
  	amdgpu_amdkfd_gpuvm_destroy_cb(adev, vm);

@@ -2257,6 +2263,11 @@ void amdgpu_vm_fini(struct amdgpu_device *adev, struct 
amdgpu_vm *vm)
amdgpu_vm_set_pasid(adev, vm, 0);
dma_fence_wait(vm->last_unlocked, false);
dma_fence_put(vm->last_unlocked);
+   dma_fence_wait(vm->last_tlb_flush, false);
+   /* Make sure that all fence callbacks have completed */
+   spin_lock_irqsave(vm->last_tlb_flush->lock, flags);
+   spin_unlock_irqrestore(vm->last_tlb_flush->lock, flags);
+   dma_fence_put(vm->last_tlb_flush);
  
  	list_for_each_entry_safe(mapping, tmp, >freed, list) {

if (mapping->flags & AMDGPU_PTE_PRT && prt_fini_needed) {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
index 1a814fb8..6b06a214f05f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
@@ -286,6 +286,7 @@ struct amdgpu_vm {
  
  	/* Last finished delayed update */

atomic64_t  tlb_seq;
+   struct dma_fence*last_tlb_flush;
  
  	/* Last unlocked submission to the scheduler entities */

struct dma_fence*last_unlocked;



[1]: 
https://www.kernel.org/doc/html/latest/process/coding-style.html#placing-braces-and-spaces


Re: [PATCHv4] drm/amdgpu: disable ASPM on Intel Alder Lake based systems

2022-04-13 Thread Paul Menzel

Dear Richard,


Thank you for sending out v4.

Am 12.04.22 um 23:50 schrieb Richard Gong:

Active State Power Management (ASPM) feature is enabled since kernel 5.14.
There are some AMD GFX cards (such as WX3200 and RX640) that won't work
with ASPM-enabled Intel Alder Lake based systems. Using these GFX cards as
video/display output, Intel Alder Lake based systems will hang during
suspend/resume.


I am still not clear, what “hang during suspend/resume” means. I guess 
suspending works fine? During resume (S3 or S0ix?), where does it hang? 
The system is functional, but there are only display problems?



The issue was initially reported on one system (Dell Precision 3660 with
BIOS version 0.14.81), but was later confirmed to affect at least 4 Alder
Lake based systems.

Add extra check to disable ASPM on Intel Alder Lake based systems.

Fixes: 0064b0ce85bb ("drm/amd/pm: enable ASPM by default")
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1885
Reported-by: kernel test robot 


This tag is a little confusing. Maybe clarify that it was for an issue 
in a previous patch iteration?



Signed-off-by: Richard Gong 
---
v4: s/CONFIG_X86_64/CONFIG_X86
 enhanced check logic
v3: s/intel_core_asom_chk/aspm_support_quirk_check
 correct build error with W=1 option
v2: correct commit description
 move the check from chip family to problematic platform
---
  drivers/gpu/drm/amd/amdgpu/vi.c | 17 -
  1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/vi.c b/drivers/gpu/drm/amd/amdgpu/vi.c
index 039b90cdc3bc..b33e0a9bee65 100644
--- a/drivers/gpu/drm/amd/amdgpu/vi.c
+++ b/drivers/gpu/drm/amd/amdgpu/vi.c
@@ -81,6 +81,10 @@
  #include "mxgpu_vi.h"
  #include "amdgpu_dm.h"
  
+#if IS_ENABLED(CONFIG_X86)

+#include 
+#endif
+
  #define ixPCIE_LC_L1_PM_SUBSTATE  0x100100C6
  #define PCIE_LC_L1_PM_SUBSTATE__LC_L1_SUBSTATES_OVERRIDE_EN_MASK  
0x0001L
  #define PCIE_LC_L1_PM_SUBSTATE__LC_PCI_PM_L1_2_OVERRIDE_MASK  0x0002L
@@ -1134,13 +1138,24 @@ static void vi_enable_aspm(struct amdgpu_device *adev)
WREG32_PCIE(ixPCIE_LC_CNTL, data);
  }
  
+static bool aspm_support_quirk_check(void)

+{
+   if (IS_ENABLED(CONFIG_X86)) {
+   struct cpuinfo_x86 *c = _data(0);
+
+   return !(c->x86 == 6 && c->x86_model == INTEL_FAM6_ALDERLAKE);
+   }
+
+   return true;
+}
+
  static void vi_program_aspm(struct amdgpu_device *adev)
  {
u32 data, data1, orig;
bool bL1SS = false;
bool bClkReqSupport = true;
  
-	if (!amdgpu_device_should_use_aspm(adev))

+   if (!amdgpu_device_should_use_aspm(adev) || !aspm_support_quirk_check())
return;


Can users still forcefully enable ASPM with the parameter `amdgpu.aspm`?

  
  	if (adev->flags & AMD_IS_APU ||


If I remember correctly, there were also newer cards, where ASPM worked 
with Intel Alder Lake, right? Can only the problematic generations for 
WX3200 and RX640 be excluded from ASPM?



Kind regards,

Paul


Re: [PATCH v2] drm/amdgpu: Make sure ttm delayed work finished

2022-04-13 Thread Paul Menzel

Dear xinhui,


Am 13.04.22 um 08:46 schrieb xinhui pan:

ttm_device_delayed_workqueue would reschedule itself if there is pending
BO to be destroyed. So just one flush + cancel_sync is not enough. We
still see lru_list not empty warnging.


warning

(`scripts/checkpatch.pl --codespell` should have found this.)



Fix it by waiting all BO to be destroyed.


Please explain the waiting algorithm. Why at least one millisecond?

Do you have a reproducer for this issue?


Acked-by: Guchun Chen 
Signed-off-by: xinhui pan 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 --
  1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 6f47726f1765..56dcf8c3a3cd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3957,11 +3957,17 @@ static void amdgpu_device_unmap_mmio(struct 
amdgpu_device *adev)
   */
  void amdgpu_device_fini_hw(struct amdgpu_device *adev)
  {
+   int pending = 1;


Could this be bool?


+
dev_info(adev->dev, "amdgpu: finishing device.\n");
flush_delayed_work(>delayed_init_work);
-   if (adev->mman.initialized) {
+   while (adev->mman.initialized && pending) {
flush_delayed_work(>mman.bdev.wq);
-   ttm_bo_lock_delayed_workqueue(>mman.bdev);
+   pending = ttm_bo_lock_delayed_workqueue(>mman.bdev);
+   if (pending) {
+   ttm_bo_unlock_delayed_workqueue(>mman.bdev, true);


Does this call affect `adev->mman.initialized`? If not a do-while loop 
might be better suited, so pending is only checked once.


if (adev->mman.initialized) {
do {
flush_delayed_work(>mman.bdev.wq);
ttm_bo_unlock_delayed_workqueue(>mman.bdev, true);
msleep(((HZ / 100) < 1) ? 1 : HZ / 100);
} while (ttm_bo_lock_delayed_workqueue(>mman.bdev));
}

The logic is slightly different, as 
`ttm_bo_unlock_delayed_workqueue(>mman.bdev, true);` and the sleep 
are run at least once, so the suggestion might be unsuitable.



+   msleep(((HZ / 100) < 1) ? 1 : HZ / 100);


Maybe add a comment for that formula? (No idea, if common knowledge, but 
why is a delay needed, and the loop cannot be run as fast as possible?)



+   }
}
adev->shutdown = true;
  



Kind regards,

Paul


Re: Vega 56 failing to process EDID from VR Headset

2022-04-13 Thread Paul Menzel

Dear James,


Am 13.04.22 um 00:13 schrieb James Dutton:

On Tue, 12 Apr 2022 at 07:13, Paul Menzel wrote:

Am 11.04.22 um 23:39 schrieb James Dutton:
So, did you do any changes to Linux? Why do you think the EDID is at fault?

[…]

I suggest to analyze, why `No DP link bandwidth` is logged. The macro is
`DC_NO_DP_LINK_BANDWIDTH`, and you should first check why
`dp_validate_mode_timing()` in
`drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c` returns false.



PS: Using the issue tracker [1] might make it easier to keep track of
this problem, and also to attach all the necessary information.


[1]: https://gitlab.freedesktop.org/drm/amd/-/issues/



I will do some more investigation. In addition to it not processing
the EDID particularly well...
Since my email, I have found out that it is failing to complete CR
(Clock Recovery) on Link 0,2, but it works on 1,3 at HBR2. All 4 Links
work at HBR1.   (I need the HBR2 working)
The CR negotiation in the code looks a bit wrong to me, so I will look
into that a bit more.
Looking at the current amdgpu source code   (I am using Mainline
kernel version 5.17.1), it appears to retry CR negotiation, but each
time it uses the same settings, rather than try different driver
parameters, as recommended in the DP standards and compliance test
documents.


[…]

Awesome, that you review the code with your expertise. Though I suggest 
to look at and work on agd5f/amd-staging-drm-next [1], having the latest 
code for the driver.



Once I know more, I will put all the info in the issue track, as you
suggest.


I am looking forward to it. To not get lost in all the problems, one 
email or issue per problem might be a good way forward, and adding 
people adding the code (`git blame -w`) to Cc might also help.


Happy debugging and hacking!


Kind regards,

Paul


[1]: https://gitlab.freedesktop.org/agd5f/linux/


Re: [PATCH 1/2] drm/amd/amdgpu: Update PF2VF header

2022-04-13 Thread Paul Menzel

[Removed unintended paste in second line]

Am 13.04.22 um 09:03 schrieb Paul Menzel:

Dear Bokun,


Thank you for rerolling the patch. Please add the iteration/version in 
the subject next time `[PATCH v2 1/2]` or so.


Am 12.04.22 um 23:31 schrieb Bokun Zhang:

- Add proper indentation in the header file


Please use that as the commit message summary, and the subject/summary 
as the title for the cover letter? Maybe:


drm/amd/amdgpu: Properly indent PF2VF header


Signed-off-by: Bokun Zhang 
---
  drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h | 95 ++---
  1 file changed, 46 insertions(+), 49 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h 
b/drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h

index 7326b6c1b71c..65433cbb00c5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h
@@ -1,34 +1,33 @@
  /*
- * Copyright 2018-2019 Advanced Micro Devices, Inc.
+ * Copyright (c) 2018-2021 Advanced Micro Devices, Inc. All rights reserved.
   *
- * Permission is hereby granted, free of charge, to any person obtaining a
- * copy of this software and associated documentation files (the "Software"),
- * to deal in the Software without restriction, including without limitation
- * the rights to use, copy, modify, merge, publish, distribute, sublicense,
- * and/or sell copies of the Software, and to permit persons to whom the
- * Software is furnished to do so, subject to the following conditions:
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to 
deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
   *
   * The above copyright notice and this permission notice shall be included in
   * all copies or substantial portions of the Software.
   *
   * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
   * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
- * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
- * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
- * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
- * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
- * OTHER DEALINGS IN THE SOFTWARE.
- *
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
   */


The header change seems unrelated from indentation?

[…]


Kind regards,

Paul


Re: [PATCH 1/2] drm/amd/amdgpu: Update PF2VF header

2022-04-13 Thread Paul Menzel

Dear Bokun,
 drm/amd/amdgpu: Update PF2VF header

Thank you for rerolling the patch. Please add the iteration/version in 
the subject next time `[PATCH v2 1/2]` or so.


Am 12.04.22 um 23:31 schrieb Bokun Zhang:

- Add proper indentation in the header file


Please use that as the commit message summary, and the subject/summary 
as the title for the cover letter? Maybe:


drm/amd/amdgpu: Properly indent PF2VF header


Signed-off-by: Bokun Zhang 
---
  drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h | 95 ++---
  1 file changed, 46 insertions(+), 49 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h 
b/drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h
index 7326b6c1b71c..65433cbb00c5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h
@@ -1,34 +1,33 @@
  /*
- * Copyright 2018-2019 Advanced Micro Devices, Inc.
+ * Copyright (c) 2018-2021 Advanced Micro Devices, Inc. All rights reserved.
   *
- * Permission is hereby granted, free of charge, to any person obtaining a
- * copy of this software and associated documentation files (the "Software"),
- * to deal in the Software without restriction, including without limitation
- * the rights to use, copy, modify, merge, publish, distribute, sublicense,
- * and/or sell copies of the Software, and to permit persons to whom the
- * Software is furnished to do so, subject to the following conditions:
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to 
deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
   *
   * The above copyright notice and this permission notice shall be included in
   * all copies or substantial portions of the Software.
   *
   * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
   * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
- * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
- * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
- * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
- * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
- * OTHER DEALINGS IN THE SOFTWARE.
- *
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
   */


The header change seems unrelated from indentation?

[…]


Kind regards,

Paul


Re: [PATCH] Revert "drm/amd/display: Pass HostVM enable flag into DCN3.1 DML"

2022-04-12 Thread Paul Menzel

Dear Rodrigo,


Thank you for the patch.


Am 12.04.22 um 20:44 schrieb Rodrigo Siqueira:

This reverts commit 367b3e934f578f6c0d5d8ca5987dc6ac4cd6831d.

While we were testing DCN3.1 with a hub, we noticed that only one of 2
connected displays lights up when using some specific display
resolution. After bisecting this issue, we figured out the commit
mentioned above introduced this issue. We are investigating why this
patch introduced this regression, but we need to revert it for now.


Thank you for bisecting this. It’d be great if you added the information 
about the test setup: graphics card, displays (model and connected how), 
model of the hub, and the resolution causing issues.



Kind regards,

Paul



Cc: Harry Wentland 
Cc: Mark Broadworth 
Cc: Michael Strauss 
Signed-off-by: Rodrigo Siqueira 
---
  drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c | 1 -
  1 file changed, 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c 
b/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c
index 6cc580be7c79..5b3f0c2dfb55 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c
@@ -1668,7 +1668,6 @@ int dcn31_populate_dml_pipes_from_context(
pipes[pipe_cnt].pipe.src.immediate_flip = true;
  
  		pipes[pipe_cnt].pipe.src.unbounded_req_mode = false;

-   pipes[pipe_cnt].pipe.src.hostvm = 
dc->res_pool->hubbub->riommu_active;
pipes[pipe_cnt].pipe.src.gpuvm = true;
pipes[pipe_cnt].pipe.src.dcc_fraction_of_zs_req_luma = 0;
pipes[pipe_cnt].pipe.src.dcc_fraction_of_zs_req_chroma = 0;


Re: AMDGPU: RX 6500 XT: System reset when loading module [SOLVED]

2022-04-12 Thread Paul Menzel

Dear Cal,


Am 12.04.22 um 00:31 schrieb Cal Peake:


I wanted to put a capper on this just in case anyone was interested, or in
case any future people find this thread, because I did find a resolution.


Yes, that is very much appreciated.


Turns out the way to stop the system from crashing was to disable Global
C-State Control in the BIOS.

Christian, Alex, you guys seem to have been on the right track in that it
was something power related.

I haven't yet been able to figure out what Global C-State Control exactly
does.

My best guess as to what was happening: once the GPU power management code
was loaded and the GPU dropped into a very low power state, the CPU saw
this and decided to match it, lowering its own power state to such a point
that it hard resets the system. (Just a wild theory anyway :-)

If anyone knows what this feature really does, or has any better theories
as to why it doesn't play nice with AMD GPUs, please do share!


It might be related to bug report *Random Soft Lockup on new Ryzen 
build* [1], and the referenced issues there. It’d be great if you could 
post here or there a summary with your system details (especially the 
system and GPU firmware versions).


Me is still a little upset, how AMD has until now not been able to 
properly analyze and fix this (with the ODMs).



Kind regards,

Paul


[1]: https://bugzilla.kernel.org/show_bug.cgi?id=196683


Re: [PATCH ] drm/amdgpu: fix discovery ip failed

2022-04-12 Thread Paul Menzel

Dear Jie,


Am 12.04.22 um 03:52 schrieb Zhang, Jesse (Jie):

[AMD Official Use Only]

Thanks  Paul Menzel
Attach  the patch file .


Please do not attach patch files, as it breaks work flows and inline 
commenting. Please use `git send-email` or equivalent. Your colleagues 
should be able to help you with this.


[…]


Kind regards,

Paul


Re: [PATCH] drm/amd/amdgpu: Not request init data for MS_HYPERV with vega10

2022-04-12 Thread Paul Menzel

[Cc: +x86 folks]

Dear Alex, dear x86 folks,


x86 folks, can you think of alternatives to access `X86_HYPER_MS_HYPERV` 
from `arch/x86/include/asm/hypervisor.h` without any preprocessor ifdef-ery?



Am 11.04.22 um 18:28 schrieb Alex Deucher:

On Mon, Apr 11, 2022 at 11:28 AM Paul Menzel wrote:


[…]


Am 11.04.22 um 15:59 schrieb Yongqiang Sun:

MS_HYPERV with vega10 doesn't have the interface to process
request init data msg.


Should some Hyper-V folks be added to the reviewers list too?


Check hypervisor type to not send the request for MS_HYPERV.


Please add a blank line between paragraphs.


Signed-off-by: Yongqiang Sun 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 12 ++--
   1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index 933c41f77c92..56b130ec44a9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -23,6 +23,10 @@

   #include 

+#ifdef CONFIG_X86
+#include 
+#endif
+
   #include 

   #include "amdgpu.h"
@@ -721,8 +725,12 @@ void amdgpu_detect_virtualization(struct amdgpu_device 
*adev)
   break;
   case CHIP_VEGA10:
   soc15_set_virt_ops(adev);
- /* send a dummy GPU_INIT_DATA request to host on vega10 */
- amdgpu_virt_request_init_data(adev);
+#ifdef CONFIG_X86
+ /* not send GPU_INIT_DATA with MS_HYPERV*/
+ if (hypervisor_is_type(X86_HYPER_MS_HYPERV) == false)
+#endif


Why guard everything with CONFIG_X86? (If it’s needed, it should be done
in C code.)


X86_HYPER_MS_HYPERV only available on x86.


Sorry, I missed the X86 dependency when quickly looking at the Hyper-V 
stub IOMMU driver `drivers/iommu/hyperv-iommu.c`, but missed that 
`HYPERV_IOMMU` has `depends on HYPERV && X86`.



Kind regards,

Paul



+ /* send a dummy GPU_INIT_DATA request to host on 
vega10 */
+ amdgpu_virt_request_init_data(adev);
   break;
   case CHIP_VEGA20:
   case CHIP_ARCTURUS:



Kind regards,

Paul


AMD Display Core (DC) patches (was: [PATCH 13/16] drm/amd/display: Revert FEC check in validation)

2022-04-12 Thread Paul Menzel
[Cc: +dri-de...@lists.freedesktop.org, +Daniel Vetter, +Alexander 
Deucher, +Greg KH]



Dear Alex,


I am a little confused and upset about how Display Core patches are 
handled in the Linux kernel.



Am 25.03.22 um 23:53 schrieb Alex Hung:

From: Martin Leung 


git puts a line “This reverts commit …” into the commit message, when 
something is reverted. Why isn’t this here? Right now, commit 
7d56a154e22f, reverted here, is proposed for the stable series. I guess, 
because these indicators and meta data are missing.



why and how:
causes failure on install on certain machines


Why are such kind of commit messages accepted? What does “failure on 
install” even mean? Why can’t the machine configuration be documented so 
it can be reproduced, when necessary.


No less confusing, the date you posted it on amd-gfx is from March 25th, 
2022, but the author date of the commit in agd5f/amd-staging-drm-next is 
`Fri Mar 18 11:12:36 2022 -0400`. Why is the patch missing the Date 
field then?



Reviewed-by: George Shen 
Acked-by: Alex Hung 
Signed-off-by: Martin Leung 


Shouldn’t the Signed-off-by line by the author go first?

You committed this on `Mon Mar 28 08:26:48 2022 -0600`, while you posted 
the patch on amd-gfx on Friday. How should *proper* review happen over 
the weekend?



---
  drivers/gpu/drm/amd/display/dc/core/dc.c | 4 
  1 file changed, 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/dc/core/dc.c 
b/drivers/gpu/drm/amd/display/dc/core/dc.c
index f2ad8f58e69c..c436db416708 100644
--- a/drivers/gpu/drm/amd/display/dc/core/dc.c
+++ b/drivers/gpu/drm/amd/display/dc/core/dc.c
@@ -1496,10 +1496,6 @@ bool dc_validate_boot_timing(const struct dc *dc,
if (!link->link_enc->funcs->is_dig_enabled(link->link_enc))
return false;
  
-	/* Check for FEC status*/

-   if (link->link_enc->funcs->fec_is_active(link->link_enc))
-   return false;
-
enc_inst = link->link_enc->funcs->get_dig_frontend(link->link_enc);
  
  	if (enc_inst == ENGINE_ID_UNKNOWN)


The patch reverted here, also lacked proper review, had a to-be desired 
commit message, did not follow the Linux kernel coding style (missing 
space before the comment terminator), so should not have been committed 
in the first place.


Seeing how many people are in the Cc list, I would have hoped, that 
somebody noticed and commented. The current state also makes it really 
hard for non-AMD employees to get the necessary information to do proper 
reviews as the needed documentation and information is non-public. So 
good/excellent commit messages are a must. I think to remember, you 
replied to me once, that Display Core patches are shared also with the 
Microsoft Windows driver, restricting the workflow options. But I think 
the issues I mentioned are unrelated. I know graphics hardware is very 
complex, but if quality of the commits and review would be improved, 
hopefully it saves time for everyone in the end, as less bugs are 
introduced.


Could AMD team please address these issues as soon as possible?


Kind regards,

Paul


Re: [PATCH 04/13] drm/amd/display: FEC check in timing validation

2022-04-12 Thread Paul Menzel

Dear Alex,


Am 19.03.22 um 08:43 schrieb Paul Menzel:

Dear Alex, dear Chiawen,


Thank you for your patch.

Am 18.03.22 um 22:47 schrieb Alex Hung:

From: Chiawen Huang 

[Why]
disable/enable leads fec mismatch between hw/sw fec state.


1.  Disable/enable of what?
2.  How can this be reproduced?
3.  s/fec/FEC/


[How]
check fec status to fastboot on/off.


What do you mean by “to fastboot on/off”?


Reviewed-by: Anthony Koo 
Acked-by: Alex Hung 
Signed-off-by: Chiawen Huang 
---
  drivers/gpu/drm/amd/display/dc/core/dc.c | 4 
  1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/display/dc/core/dc.c 
b/drivers/gpu/drm/amd/display/dc/core/dc.c

index f6e19efea756..75f9c97bebb0 100644
--- a/drivers/gpu/drm/amd/display/dc/core/dc.c
+++ b/drivers/gpu/drm/amd/display/dc/core/dc.c
@@ -1496,6 +1496,10 @@ bool dc_validate_boot_timing(const struct dc *dc,
  if (!link->link_enc->funcs->is_dig_enabled(link->link_enc))
  return false;
+    /* Check for FEC status*/


Missing space before `*/`.


+    if (link->link_enc->funcs->fec_is_active(link->link_enc))
+    return false;
+
  enc_inst = link->link_enc->funcs->get_dig_frontend(link->link_enc);
  if (enc_inst == ENGINE_ID_UNKNOWN)


I just saw, that this patch was committed on March 25th, 2022 (commit 
7d56a154e22f) with my comments ignored. Could you please explain why?



Kind regards,

Paul


Re: Vega 56 failing to process EDID from VR Headset

2022-04-12 Thread Paul Menzel

Dear James,


Am 11.04.22 um 23:39 schrieb James Dutton:


I have an Oculus Rift S, that I am trying to get working in Linux.


Please always mention the Linux kernel version.


I have an AMD Vega 56  graphics card.

The VR headset plugs into a display port of the Vega56.
The amdgpu driver sees the connection, and tries to process it.
The problem is it cannot process the EDID, so fails to recognise the
VR headset, and the VR headset does not work as a result.
Please find the EDID below.
I am guessing that the following is causing the problem:
Established Timings I & II: none
Standard Timings: none
Forcing the driver to understand the Detailed mode, to which it is failing.
If it helps, when attached to windows 10, it uses 1440x2560, portrait mode.
Some dmesg lines that may be useful:
// We should pick 1440x2560 as Windows picks that, but for some reason
is rejects it with error 10.
[10402.650734] [drm:create_validate_stream_for_sink [amdgpu]] Mode
1440x2560 (clk 571570) failed DC validation with error 10 (No DP link
bandwidth)
[10402.650991] [drm:update_stream_scaling_settings [amdgpu]]
Destination Rectangle x:0  y:0  width:1440  height:2560
[10402.651225] [drm:create_validate_stream_for_sink [amdgpu]] Mode
1440x2560 (clk 571570) failed DC validation with error 10 (No DP link
bandwidth)


(Please use an email program, which does not wrap lines after 72 
characters.)



Can anyone help give me with some pointers as to how to get the amdgpu
driver to accept this EDID?


So, did you do any changes to Linux? Why do you think the EDID is at fault?

[…]

I suggest to analyze, why `No DP link bandwidth` is logged. The macro is 
`DC_NO_DP_LINK_BANDWIDTH`, and you should first check why 
`dp_validate_mode_timing()` in 
`drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c` returns false.



Kind regards,

Paul


PS: Using the issue tracker [1] might make it easier to keep track of 
this problem, and also to attach all the necessary information.



[1]: https://gitlab.freedesktop.org/drm/amd/-/issues/


Re: [PATCH] drm/amd/amdgpu: Not request init data for MS_HYPERV with vega10

2022-04-11 Thread Paul Menzel

Dear Yongqiang,


Thank you for your patch.

Am 11.04.22 um 15:59 schrieb Yongqiang Sun:

MS_HYPERV with vega10 doesn't have the interface to process
request init data msg.


Should some Hyper-V folks be added to the reviewers list too?


Check hypervisor type to not send the request for MS_HYPERV.


Please add a blank line between paragraphs.


Signed-off-by: Yongqiang Sun 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 12 ++--
  1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index 933c41f77c92..56b130ec44a9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -23,6 +23,10 @@
  
  #include 
  
+#ifdef CONFIG_X86

+#include 
+#endif
+
  #include 
  
  #include "amdgpu.h"

@@ -721,8 +725,12 @@ void amdgpu_detect_virtualization(struct amdgpu_device 
*adev)
break;
case CHIP_VEGA10:
soc15_set_virt_ops(adev);
-   /* send a dummy GPU_INIT_DATA request to host on vega10 
*/
-   amdgpu_virt_request_init_data(adev);
+#ifdef CONFIG_X86
+   /* not send GPU_INIT_DATA with MS_HYPERV*/
+   if (hypervisor_is_type(X86_HYPER_MS_HYPERV) == false)
+#endif


Why guard everything with CONFIG_X86? (If it’s needed, it should be done 
in C code.)



+   /* send a dummy GPU_INIT_DATA request to host 
on vega10 */
+   amdgpu_virt_request_init_data(adev);
break;
case CHIP_VEGA20:
case CHIP_ARCTURUS:



Kind regards,

Paul


Re: [PATCH ] drm/amdgpu: fix discovery ip failed

2022-04-11 Thread Paul Menzel

Dear Jie,


Thank you for your patch.


Am 11.04.22 um 17:15 schrieb Zhang, Jesse(Jie):

You might want to add a space before the (.


[AMD Official Use Only]


Please send a patch with `git format-patch` or similar.


Fix discovery ip failed, and the log:


On what system?


56.129549] [drm:amdgpu_discovery_validate_ip [amdgpu]] *ERROR* Unexpected 
number_instance (64) from ip discovery blob
[   56.130129] [drm:amdgpu_discovery_validate_ip [amdgpu]] *ERROR* Unexpected 
hw_id (45056) from ip discovery blob
[   56.130701] [drm:amdgpu_discovery_validate_ip [amdgpu]] *ERROR* Unexpected 
number_instance (66) from ip discovery blob
[   56.131283] [drm:amdgpu_discovery_validate_ip [amdgpu]] *ERROR* Unexpected 
hw_id (45568) from ip discovery blob
[   56.131855] [drm:amdgpu_discovery_validate_ip [amdgpu]] *ERROR* Unexpected 
number_instance (66) from ip discovery blob
[   56.132436] [drm:amdgpu_discovery_validate_ip [amdgpu]] *ERROR* Unexpected 
hw_id (480) from ip discovery blob
[   56.133053] [drm:amdgpu_discovery_validate_ip [amdgpu]] *ERROR* Unexpected 
hw_id (608) from ip discovery blob
[   56.133626] [drm:amdgpu_discovery_validate_ip [amdgpu]] *ERROR* Unexpected 
hw_id (640) from ip discovery blob
[   56.134207] [drm:amdgpu_discovery_validate_ip [amdgpu]] *ERROR* Unexpected 
number_instance (64) from ip discovery blob
[   56.134780] [drm:amdgpu_discovery_validate_ip [amdgpu]] *ERROR* Unexpected 
number_instance (64) from ip discovery blob
[   56.135360] [drm:amdgpu_discovery_validate_ip [amdgpu]] *ERROR* Unexpected 
hw_id (28672) from ip discovery blob


Please describe the reason for the failure, and your fix.

A Signed-off-by line is missing.


diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
index 766006a075ec..a778b0392e9f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
@@ -798,7 +798,7 @@ static int amdgpu_discovery_sysfs_ips(struct amdgpu_device 
*adev,
 res = kobject_add(_hw_instance->kobj, NULL,
   "%d", ip_hw_instance->num_instance);
next_ip:
-   ip_offset += sizeof(*ip) + 4 * (ip->num_base_address - 
1);
+   ip_offset += struct_size(ip, base_address, 
ip->num_base_address);
 }
 }

@@ -1063,7 +1063,7 @@ int amdgpu_discovery_reg_base_init(struct amdgpu_device 
*adev)
 }

next_ip:
-   ip_offset += sizeof(*ip) + 4 * (ip->num_base_address - 
1);
+   ip_offset += struct_size(ip, base_address, 
ip->num_base_address);
 }
 }

@@ -1113,7 +1113,7 @@ int amdgpu_discovery_get_ip_version(struct amdgpu_device 
*adev, int hw_id, int n
 *revision = ip->revision;
 return 0;
 }
-   ip_offset += sizeof(*ip) + 4 * (ip->num_base_address - 
1);
+   ip_offset += struct_size(ip, base_address, 
ip->num_base_address);
 }
 }



Kind regards,

Paul


Re: [PATCH v2] drm/amd/pm: Disable fan control if not supported

2022-04-11 Thread Paul Menzel

Dear Elena,


Am 11.04.22 um 15:32 schrieb Elena Sakhnovitch:

On Sienna Cichild, not all platforms use PMFW based fan control. On such
ASICs fan control by PMFW will be disabled in PPTable. Disable hwmon
knobs for fan control also as it is not possible to report or control
fan speed on such platforms through driver.
v2: FEATURE_FAN_CONTROL_MASK is replaced with FEATURE_FAN_CONTROL_BIT


Please add a blank line before the v2 line.

Also please give one specific example for a platform without PMFW based 
fan control.



Signed-off-by: Elena Sakhnovitch  
---
  .../drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c   | 15 ++-
  1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
index ab3e9d8b831e..ddc388b061b6 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
@@ -371,6 +371,18 @@ static void sienna_cichlid_check_bxco_support(struct 
smu_context *smu)
}
  }
  
+static void sienna_cichlid_check_fan_support(struct smu_context *smu)

+{
+   struct smu_table_context *table_context = >smu_table;
+   PPTable_t *pptable = table_context->driver_pptable;
+   /* No sort of fan control possible if PPTable has it disabled */
+   smu->adev->pm.no_fan =
+   !(pptable->FeaturesToRun[1] & (1U << FEATURE_FAN_CONTROL_BIT));
+   if (smu->adev->pm.no_fan)
+   dev_info_once(smu->adev->dev,
+ "PMFW based fan control disabled");


Maybe clarify:

… according to PPTable.


+}
+
  static int sienna_cichlid_check_powerplay_table(struct smu_context *smu)
  {
struct smu_table_context *table_context = >smu_table;
@@ -381,6 +393,7 @@ static int sienna_cichlid_check_powerplay_table(struct 
smu_context *smu)
smu->dc_controlled_by_gpio = true;
  
  	sienna_cichlid_check_bxco_support(smu);

+   sienna_cichlid_check_fan_support(smu);
  
  	table_context->thermal_controller_type =

powerplay_table->thermal_controller_type;
@@ -410,7 +423,7 @@ static int sienna_cichlid_append_powerplay_table(struct 
smu_context *smu)
GET_PPTABLE_MEMBER(I2cControllers, _member);
memcpy(table_member, smc_dpm_table->I2cControllers,
sizeof(*smc_dpm_table) - 
sizeof(smc_dpm_table->table_header));
-   
+


Unrelated, and should be a separate patch.


return 0;
  }
  



Kind regards,

Paul


Re: [PATCH] drm/amdgpu: Release memory when psp sw_init is failed

2022-04-11 Thread Paul Menzel

Dear Ma,


Thank you for your patch.

Am 11.04.22 um 14:42 schrieb Ma Jun:

Release the memory (psp->cmd) when psp initialization is
failed in psp_sw_init


s/is failed/fails/

Period/full stop at the end of sentences, and it still fits into 75 
characters per line.


Next time you could also add one simple statement like:

Add new label `failure` to jump to in case of the failure.


Signed-off-by: Ma Jun 
Change-Id: I2f88b5919142d55dd7d3820a7da94823286db235


Without the URL of the Gerrit instance, the Change-Id is not of much use.


---
  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 15 +--
  1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index a6acec1a6155..1227dc014c80 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -305,9 +305,10 @@ static int psp_sw_init(void *handle)
ret = psp_init_sriov_microcode(psp);
else
ret = psp_init_microcode(psp);
+


Unrelated.


if (ret) {
DRM_ERROR("Failed to load psp firmware!\n");
-   return ret;
+   goto failure;
}
  
  	adev->psp.xgmi_context.supports_extended_data =

@@ -339,25 +340,27 @@ static int psp_sw_init(void *handle)
ret = psp_memory_training_init(psp);
if (ret) {
DRM_ERROR("Failed to initialize memory training!\n");
-   return ret;
+   goto failure;
}
  
  		ret = psp_mem_training(psp, PSP_MEM_TRAIN_COLD_BOOT);

if (ret) {
DRM_ERROR("Failed to process memory training!\n");
-   return ret;
+   goto failure;
}
}
  
  	if (adev->ip_versions[MP0_HWIP][0] == IP_VERSION(11, 0, 0) ||

adev->ip_versions[MP0_HWIP][0] == IP_VERSION(11, 0, 7)) {
ret= psp_sysfs_init(adev);
-   if (ret) {
-   return ret;
-   }
+   if (ret)
+   goto failure;
}
  
  	return 0;

+failure:
+   kfree(psp->cmd);
+   return ret;
  }
  
  static int psp_sw_fini(void *handle)


Acked-by: Paul Menzel 


Kind regards,

Paul


Re: [PATCHv2] drm/amdgpu: disable ASPM on Intel AlderLake based systems

2022-04-11 Thread Paul Menzel

Dear Richard,


Am 11.04.22 um 13:38 schrieb Gong, Richard:


On 4/11/2022 2:41 AM, Paul Menzel wrote:

[Cc: +]



Am 11.04.22 um 02:27 schrieb Gong, Richard:


On 4/8/2022 7:19 PM, Paul Menzel wrote:



Am 08.04.22 um 21:05 schrieb Richard Gong:
Active State Power Management (ASPM) feature is enabled since 
kernel 5.14.

There are some AMD GFX cards (such as WX3200 and RX640) that cannot be
used with Intel AlderLake based systems to enable ASPM. Using these 
GFX


Alder Lake

will correct in the next version.



cards as video/display output, Intel Alder Lake based systems will hang
during suspend/resume.


Please reflow for 75 characters per line.

Also please mention the exact system you had problems with (also 
firmware versions).




Add extra check to disable ASPM on Intel AlderLake based systems.


Is that a problem with Intel Alder Lake or the Dell system? 
Shouldn’t ASPM just be disabled for the problematic cards for the 
Dell system. You write newer cards worked fine.


There is a problem with Dell system (Dell Precision DT workstation), 
which is based on Intel Alder Lake.


ASPM works just fine on these GPU's. It's more of an issue with 
whether the underlying platform supports ASPM or not.


At least you didn’t document what the real issue is,


You can refer to bug tag from the comment messages.

Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1885


No, the commit message should be self-contained, and reviewers and 
readers of the commit message not required to read comments of bug 
reports. Please add the necessary information to the commit message.



Kind regards,

Paul


that ASPM does not work. With current information (some GPU graphics 
card with the the Dell system and others don’t), it could be the GPU, 
the Dell system (firmware, …), a problem with Alder Lake SOC, or 
another bug. I hope you are in contact with Dell to analyze it, so 
ASPM can be enabled again.


[…]


Kind regards,

Paul


Re: [PATCH] drm/amd/pm: fix the deadlock issue observed on SI

2022-04-11 Thread Paul Menzel

Dear Evan,


It would have been nice, if you put me in Cc as I also reported the 
regression.


Am 11.04.22 um 09:47 schrieb Evan Quan:

By placing those unrelated code outside of adev->pm.mutex
protections or restructuring the call flow, we can eliminate
the deadlock issue. This comes with no real logics change.


Please describe the deadlock issue and the fix in more detail. What code 
is unrelated, and why was it put into the mutex protections in the first 
place?



Fixes: 3712e7a49459 ("drm/amd/pm: unified lock protections in amdgpu_dpm.c")


No blank line needed.


Signed-off-by: Evan Quan 
Change-Id: I75de4350d9c2517aba0d6adc12e1bc69430fd800


Without the Gerrit instance URL, the Change-Id is useless.

As it’s a regression, please follow the documentation, and add the 
related tags.


Fixes: 3712e7a49459 ("drm/amd/pm: unified lock protections in amdgpu_dpm.c")
Reported-by: Paul Menzel 
Reported-by: Arthur Marsh 
Link: 
https://lore.kernel.org/r/9e689fea-6c69-f4b0-8dee-32c4cf7d8...@molgen.mpg.de

BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1957

Also add the link from Arthur.


Kind regards,

Paul


[1]: https://linux-regtracking.leemhuis.info/regzbot/mainline/


---
  drivers/gpu/drm/amd/pm/amdgpu_dpm.c   | 39 +++
  .../gpu/drm/amd/pm/legacy-dpm/legacy_dpm.c| 10 -
  drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c| 35 -
  .../gpu/drm/amd/pm/powerplay/amd_powerplay.c  | 10 -
  4 files changed, 39 insertions(+), 55 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c 
b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
index 5504d81c77b7..72e7b5d40af6 100644
--- a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
+++ b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
@@ -427,6 +427,7 @@ int amdgpu_dpm_read_sensor(struct amdgpu_device *adev, enum 
amd_pp_sensors senso
  void amdgpu_dpm_compute_clocks(struct amdgpu_device *adev)
  {
const struct amd_pm_funcs *pp_funcs = adev->powerplay.pp_funcs;
+   int i;


You could make it unsigned, as the variable never should be assigned 
negaitve values.


  
  	if (!adev->pm.dpm_enabled)

return;
@@ -434,6 +435,15 @@ void amdgpu_dpm_compute_clocks(struct amdgpu_device *adev)
if (!pp_funcs->pm_compute_clocks)
return;
  
+	if (adev->mode_info.num_crtc)

+   amdgpu_display_bandwidth_update(adev);
+
+   for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
+   struct amdgpu_ring *ring = adev->rings[i];
+   if (ring && ring->sched.ready)
+   amdgpu_fence_wait_empty(ring);
+   }
+
mutex_lock(>pm.mutex);
pp_funcs->pm_compute_clocks(adev->powerplay.pp_handle);
mutex_unlock(>pm.mutex);
@@ -443,6 +453,20 @@ void amdgpu_dpm_enable_uvd(struct amdgpu_device *adev, 
bool enable)
  {
int ret = 0;
  
+	if (adev->family == AMDGPU_FAMILY_SI) {

+   mutex_lock(>pm.mutex);
+   if (enable) {
+   adev->pm.dpm.uvd_active = true;
+   adev->pm.dpm.state = POWER_STATE_TYPE_INTERNAL_UVD;
+   } else {
+   adev->pm.dpm.uvd_active = false;
+   }
+   mutex_unlock(>pm.mutex);
+
+   amdgpu_dpm_compute_clocks(adev);
+   return;
+   }
+
ret = amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_UVD, 
!enable);
if (ret)
DRM_ERROR("Dpm %s uvd failed, ret = %d. \n",
@@ -453,6 +477,21 @@ void amdgpu_dpm_enable_vce(struct amdgpu_device *adev, 
bool enable)
  {
int ret = 0;
  
+	if (adev->family == AMDGPU_FAMILY_SI) {

+   mutex_lock(>pm.mutex);
+   if (enable) {
+   adev->pm.dpm.vce_active = true;
+   /* XXX select vce level based on ring/task */
+   adev->pm.dpm.vce_level = AMD_VCE_LEVEL_AC_ALL;
+   } else {
+   adev->pm.dpm.vce_active = false;
+   }
+   mutex_unlock(>pm.mutex);
+
+   amdgpu_dpm_compute_clocks(adev);
+   return;
+   }
+
ret = amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_VCE, 
!enable);
if (ret)
DRM_ERROR("Dpm %s vce failed, ret = %d. \n",
diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/legacy_dpm.c 
b/drivers/gpu/drm/amd/pm/legacy-dpm/legacy_dpm.c
index 9613c6181c17..d3fe149d8476 100644
--- a/drivers/gpu/drm/amd/pm/legacy-dpm/legacy_dpm.c
+++ b/drivers/gpu/drm/amd/pm/legacy-dpm/legacy_dpm.c
@@ -1028,16 +1028,6 @@ static int amdgpu_dpm_change_power_state_locked(struct 
amdgpu_device *adev)
  void amdgpu_legacy_dpm_compute_clocks(void *handle)
  {
struct amdgpu_device *adev = (struct amdgpu_device *)handle;
-   int i = 0;
-
-   if (adev->mode_info.num_crtc)
-   amdgpu_display

Re: [PATCHv2] drm/amdgpu: disable ASPM on Intel AlderLake based systems

2022-04-11 Thread Paul Menzel

[Cc: +]

Dear Richard,


Am 11.04.22 um 02:27 schrieb Gong, Richard:


On 4/8/2022 7:19 PM, Paul Menzel wrote:



Am 08.04.22 um 21:05 schrieb Richard Gong:

Active State Power Management (ASPM) feature is enabled since kernel 5.14.
There are some AMD GFX cards (such as WX3200 and RX640) that cannot be
used with Intel AlderLake based systems to enable ASPM. Using these GFX


Alder Lake

will correct in the next version.



cards as video/display output, Intel Alder Lake based systems will hang
during suspend/resume.


Please reflow for 75 characters per line.

Also please mention the exact system you had problems with (also 
firmware versions).




Add extra check to disable ASPM on Intel AlderLake based systems.


Is that a problem with Intel Alder Lake or the Dell system? Shouldn’t 
ASPM just be disabled for the problematic cards for the Dell system. 
You write newer cards worked fine.


There is a problem with Dell system (Dell Precision DT workstation), 
which is based on Intel Alder Lake.


ASPM works just fine on these GPU's. It's more of an issue with whether 
the underlying platform supports ASPM or not.


At least you didn’t document what the real issue is, that ASPM does not 
work. With current information (some GPU graphics card with the the Dell 
system and others don’t), it could be the GPU, the Dell system 
(firmware, …), a problem with Alder Lake SOC, or another bug. I hope you 
are in contact with Dell to analyze it, so ASPM can be enabled again.


[…]


Kind regards,

Paul


Re: [PATCHv2] drm/amdgpu: disable ASPM on Intel AlderLake based systems

2022-04-11 Thread Paul Menzel

Dear Richard,


Thank you for your response, but please reply to your own reply next time.

Am 11.04.22 um 02:37 schrieb Gong, Richard:


On 4/8/2022 7:19 PM, Paul Menzel wrote:



Thank you for your patch.

Am 08.04.22 um 21:05 schrieb Richard Gong:
Active State Power Management (ASPM) feature is enabled since kernel 
5.14.

There are some AMD GFX cards (such as WX3200 and RX640) that cannot be
used with Intel AlderLake based systems to enable ASPM. Using these GFX


Alder Lake
Actually there are 2 formats (one with space, another is w/o space) in 
the upstream sources, so I will keep that unchanged and use the format 
w/o space.


Do you mean the Linux kernel sources? Anyway, please use the correct 
spelling [1].



Kind regards,

Paul


[1]: 
https://ark.intel.com/content/www/us/en/ark/products/codename/147470/products-formerly-alder-lake.html

[2]: https://en.wikipedia.org/wiki/Alder_Lake


Re: [PATCH 04/20] drm/amd/display: Disallow entering PSR when panel is disconnected

2022-04-09 Thread Paul Menzel

Dear Pavle, dear Max,


Am 08.04.22 um 19:18 schrieb Pavle Kotarac:

From: Max Erenberg 

[WHY]
The dGPU cannot enter PSR when it is not connected to a panel.


Maybe spell out Panel Self Refresh once.


[HOW]
Added a check to dc_link_set_psr_allow_active


s/Added/Add/


which returns early if panel is disconnected.


Please reflow for 57 characters per line.



Reviewed-by: Harry Vanzylldejong 
Reviewed-by: Evgenii Krasnikov 
Reviewed-by: Nicholas Choi 
Acked-by: Pavle Kotarac 
Signed-off-by: Max Erenberg 
---
  drivers/gpu/drm/amd/display/dc/core/dc_link.c | 5 +
  1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/amd/display/dc/core/dc_link.c 
b/drivers/gpu/drm/amd/display/dc/core/dc_link.c
index 47b67fd1e84c..22f2d88fab99 100644
--- a/drivers/gpu/drm/amd/display/dc/core/dc_link.c
+++ b/drivers/gpu/drm/amd/display/dc/core/dc_link.c
@@ -3079,6 +3079,11 @@ bool dc_link_set_psr_allow_active(struct dc_link *link, 
const bool *allow_active
if (!dc_get_edp_link_panel_inst(dc, link, _inst))
return false;
  
+	if (allow_active && link->type == dc_connection_none) {


Why does `allow_active` need to be checked?


+   // Don't enter PSR if panel is not connected
+   return false;
+   }
+
/* Set power optimization flag */
if (power_opts && link->psr_settings.psr_power_opt != *power_opts) {
link->psr_settings.psr_power_opt = *power_opts;



Kind regards,

Paul


Re: [PATCH 12/20] drm/amd/display: Add odm seamless boot support

2022-04-09 Thread Paul Menzel

Dear Pavle, dear Duncan,


Thank you for the patch.

Am 08.04.22 um 19:19 schrieb Pavle Kotarac:

From: Duncan Ma 

[WHY]
Implement changes to transition from Pre-OS odm to


What is odm/ODM? Original Device Manufacturer?


Post-OS odm support. Seamless boot case is also
considered.


Please answer the question Why? better.

What is Post-OS odm support? Why are change to the transition needed?

What is seamless boot? Please add references.



[HOW]
Revised validation logic when marking for seamless
boot.


How is it revised exactly?


Init resources accordingly when Pre-OS has
odm enabled. Reset odm and det size when transitioning
Pre-OS odm to Post-OS non-odm to avoid corruption.
Apply logic to set odm accordingly upon commit.


I looked through the diff, but would love a more elaborate description 
of the implementation.


How was and can this tested?

Please reflow for 75 characters per line as textwidth.


Reviewed-by: Nicholas Kazlauskas 
Acked-by: Pavle Kotarac 
Signed-off-by: "Duncan Ma" 
---
  drivers/gpu/drm/amd/display/dc/core/dc.c  | 13 +++
  .../gpu/drm/amd/display/dc/core/dc_resource.c | 82 ---
  drivers/gpu/drm/amd/display/dc/dc.h   |  1 +
  drivers/gpu/drm/amd/display/dc/dc_stream.h|  1 +
  .../amd/display/dc/dcn10/dcn10_hw_sequencer.c | 21 +
  .../gpu/drm/amd/display/dc/dcn31/dcn31_optc.c | 21 +
  .../gpu/drm/amd/display/dc/dcn31/dcn31_optc.h |  2 +
  .../amd/display/dc/inc/hw/timing_generator.h  |  2 +
  8 files changed, 115 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/dc/core/dc.c 
b/drivers/gpu/drm/amd/display/dc/core/dc.c
index c436db416708..c2fcd67bcc4d 100644
--- a/drivers/gpu/drm/amd/display/dc/core/dc.c
+++ b/drivers/gpu/drm/amd/display/dc/core/dc.c
@@ -1569,11 +1569,24 @@ bool dc_validate_boot_timing(const struct dc *dc,
  
  	if (dc_is_dp_signal(link->connector_signal)) {

unsigned int pix_clk_100hz;
+   uint32_t numOdmPipes = 1;


Maybe add a comment, that the type is due to `get_optc_source` signature.

Why initialize it? get_optc_source seems to always assign a value to the 
passed variable.


Why CamelCase?


+   uint32_t id_src[4] = {0};
  
  		dc->res_pool->dp_clock_source->funcs->get_pixel_clk_frequency_100hz(

dc->res_pool->dp_clock_source,
tg_inst, _clk_100hz);
  
+		if (tg->funcs->get_optc_source)

+   tg->funcs->get_optc_source(tg,
+   , _src[0], 
_src[1]);
+
+   if (numOdmPipes == 2)
+   pix_clk_100hz *= 2;
+   if (numOdmPipes == 4)
+   pix_clk_100hz *= 4;
+
+   // Note: In rare cases, HW pixclk may differ from crtc's pixclk
+   // slightly due to rounding issues in 10 kHz units.


The comment could be added in a separate patch, and also the values be 
logged if they are different.



if (crtc_timing->pix_clk_100hz != pix_clk_100hz)
return false;
  
diff --git a/drivers/gpu/drm/amd/display/dc/core/dc_resource.c b/drivers/gpu/drm/amd/display/dc/core/dc_resource.c

index f5777a71f2f1..f292303b75a5 100644
--- a/drivers/gpu/drm/amd/display/dc/core/dc_resource.c
+++ b/drivers/gpu/drm/amd/display/dc/core/dc_resource.c
@@ -2120,6 +2120,8 @@ static int acquire_resource_from_hw_enabled_state(
  {
struct dc_link *link = stream->link;
unsigned int i, inst, tg_inst = 0;
+   uint32_t numPipes = 1;


Why CamelCase?


+   uint32_t id_src[4] = {0};
  
  	/* Check for enabled DIG to identify enabled display */

if (!link->link_enc->funcs->is_dig_enabled(link->link_enc))
@@ -2148,38 +2150,62 @@ static int acquire_resource_from_hw_enabled_state(
if (!res_ctx->pipe_ctx[tg_inst].stream) {
struct pipe_ctx *pipe_ctx = _ctx->pipe_ctx[tg_inst];
  
-		pipe_ctx->stream_res.tg = pool->timing_generators[tg_inst];

-   pipe_ctx->plane_res.mi = pool->mis[tg_inst];
-   pipe_ctx->plane_res.hubp = pool->hubps[tg_inst];
-   pipe_ctx->plane_res.ipp = pool->ipps[tg_inst];
-   pipe_ctx->plane_res.xfm = pool->transforms[tg_inst];
-   pipe_ctx->plane_res.dpp = pool->dpps[tg_inst];
-   pipe_ctx->stream_res.opp = pool->opps[tg_inst];
-
-   if (pool->dpps[tg_inst]) {
-   pipe_ctx->plane_res.mpcc_inst = 
pool->dpps[tg_inst]->inst;
-
-   // Read DPP->MPCC->OPP Pipe from HW State
-   if (pool->mpc->funcs->read_mpcc_state) {
-   struct mpcc_state s = {0};
-
-   pool->mpc->funcs->read_mpcc_state(pool->mpc, 
pipe_ctx->plane_res.mpcc_inst, );
-
-   if (s.dpp_id < MAX_MPCC)
-   
pool->mpc->mpcc_array[pipe_ctx->plane_res.mpcc_inst].dpp_id = s.dpp_id;
-
-  

Re: [PATCH 03/20] drm/amd/display: Disabling Z10 on DCN31

2022-04-09 Thread Paul Menzel

Dear Pavle, dear syerizvi,


Thank you for the patch.

Am 08.04.22 um 19:18 schrieb Pavle Kotarac:

From: "AMD\\syerizvi" 


Please correct the author name.

Imperative mood should be used in the commit message summary: Disable Z10 ….



[WHY]
Z10 is should not be enabled by default on DCN31.


According to what source?



[HOW]
Using DC debug flags to disable Z10 by default on DCN31.


s/Using/Use/

Why is that grouped under debug flags? Does not seem related.



Reviewed-by: Eric Yang 
Acked-by: Pavle Kotarac 
Signed-off-by: AMD\syerizvi 


Please fix the name.


---
  drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c 
b/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c
index f27262417abe..6cc580be7c79 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c
@@ -885,6 +885,7 @@ static const struct dc_debug_options debug_defaults_drv = {
.afmt = true,
}
},
+   .disable_z10 = true,
.optimize_edp_link_rate = true,
.enable_sw_cntl_psr = true,
.apply_vendor_specific_lttpr_wa = true,



Kind regards,

Paul


Re: [PATCH 02/20] drm/amd/display: do not wait for mpc idle if tg is disabled

2022-04-09 Thread Paul Menzel

Dear Pavle,


Thank you for the patch.

Am 08.04.22 um 19:18 schrieb Pavle Kotarac:

From: Josip Pavic 

[Why]
When booting, the driver waits for the MPC idle bit to be set as part of
pipe initialization. However, on some systems this occurs before OTG is
enabled, and since the MPC idle bit won't be set until the vupdate
signal occurs (which requires OTG to be enabled), this never happens and
the wait times out. This can add hundreds of milliseconds to the boot
time.


Please list one specific system, where OTG is enabled later.



[How]
Do not wait for mpc idle if tg is disabled


Please add a dot/period at the end of sentences.


Reviewed-by: Jun Lei 
Acked-by: Pavle Kotarac 
Signed-off-by: Josip Pavic 
---
  drivers/gpu/drm/amd/display/dc/dcn10/dcn10_hw_sequencer.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_hw_sequencer.c 
b/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_hw_sequencer.c
index 50820e79d3c4..2d3d870f0bea 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_hw_sequencer.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_hw_sequencer.c
@@ -3185,7 +3185,8 @@ void dcn10_wait_for_mpcc_disconnect(
if 
(pipe_ctx->stream_res.opp->mpcc_disconnect_pending[mpcc_inst]) {
struct hubp *hubp = get_hubp_by_inst(res_pool, 
mpcc_inst);
  
-			res_pool->mpc->funcs->wait_for_idle(res_pool->mpc, mpcc_inst);

+   if 
(pipe_ctx->stream_res.tg->funcs->is_tg_enabled(pipe_ctx->stream_res.tg))
+   
res_pool->mpc->funcs->wait_for_idle(res_pool->mpc, mpcc_inst);

pipe_ctx->stream_res.opp->mpcc_disconnect_pending[mpcc_inst] = false;
hubp->funcs->set_blank(hubp, true);
}



Kind regards,

Paul


Re: [PATCH 01/20] drm/amd/display: undo clearing of z10 related function pointers

2022-04-09 Thread Paul Menzel

Dear Eric, dear Pavle,


Thank you for the patch.

Am 08.04.22 um 19:18 schrieb Pavle Kotarac:

From: Eric Yang 

[Why]
Z10 and S0i3 have some shared path. Previous code clean up ,


1.  cleanup
2.  Remove the space before the comma, or remove the unneeded comma.


incorrectly removed these pointers, which breaks s0i3 restore


Please add a dot/period at the end of sentences.

How does tho breakage manifest exactly?

Maybe mention, that it only happened with `dc->debug.disable_z10`.



[How]
Do not clear the function pointers based on Z10 disable.



What commit introduced the regression? Please add a Fixes tag.


Reviewed-by: Nicholas Kazlauskas 
Acked-by: Pavle Kotarac 
Signed-off-by: Eric Yang 
---
  drivers/gpu/drm/amd/display/dc/dcn31/dcn31_init.c | 5 -
  1 file changed, 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_init.c 
b/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_init.c
index d7559e5a99ce..e708f07fe75a 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_init.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_init.c
@@ -153,9 +153,4 @@ void dcn31_hw_sequencer_construct(struct dc *dc)
dc->hwss.init_hw = dcn20_fpga_init_hw;
dc->hwseq->funcs.init_pipes = NULL;
}
-   if (dc->debug.disable_z10) {
-   /*hw not support z10 or sw disable it*/
-   dc->hwss.z10_restore = NULL;
-   dc->hwss.z10_save_init = NULL;
-   }
  }



Kind regards,

Paul


Re: [PATCHv2] drm/amdgpu: disable ASPM on Intel AlderLake based systems

2022-04-08 Thread Paul Menzel

Dear Richard,


Thank you for your patch.

Am 08.04.22 um 21:05 schrieb Richard Gong:

Active State Power Management (ASPM) feature is enabled since kernel 5.14.
There are some AMD GFX cards (such as WX3200 and RX640) that cannot be
used with Intel AlderLake based systems to enable ASPM. Using these GFX


Alder Lake


cards as video/display output, Intel Alder Lake based systems will hang
during suspend/resume.


Please reflow for 75 characters per line.

Also please mention the exact system you had problems with (also 
firmware versions).




Add extra check to disable ASPM on Intel AlderLake based systems.


Is that a problem with Intel Alder Lake or the Dell system? Shouldn’t 
ASPM just be disabled for the problematic cards for the Dell system. You 
write newer cards worked fine.



Fixes: 0064b0ce85bb ("drm/amd/pm: enable ASPM by default")
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1885
Signed-off-by: Richard Gong 
---
v2: correct commit description
 move the check from chip family to problematic platform
---
  drivers/gpu/drm/amd/amdgpu/vi.c | 17 -
  1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/vi.c b/drivers/gpu/drm/amd/amdgpu/vi.c
index 039b90cdc3bc..8b4eaf54b23e 100644
--- a/drivers/gpu/drm/amd/amdgpu/vi.c
+++ b/drivers/gpu/drm/amd/amdgpu/vi.c
@@ -81,6 +81,10 @@
  #include "mxgpu_vi.h"
  #include "amdgpu_dm.h"
  
+#if IS_ENABLED(CONFIG_X86_64)

+#include 
+#endif
+
  #define ixPCIE_LC_L1_PM_SUBSTATE  0x100100C6
  #define PCIE_LC_L1_PM_SUBSTATE__LC_L1_SUBSTATES_OVERRIDE_EN_MASK  
0x0001L
  #define PCIE_LC_L1_PM_SUBSTATE__LC_PCI_PM_L1_2_OVERRIDE_MASK  0x0002L
@@ -1134,13 +1138,24 @@ static void vi_enable_aspm(struct amdgpu_device *adev)
WREG32_PCIE(ixPCIE_LC_CNTL, data);
  }
  
+static bool intel_core_apsm_chk(void)


aspm


+{
+#if IS_ENABLED(CONFIG_X86_64)
+   struct cpuinfo_x86 *c = _data(0);
+
+   return (c->x86 == 6 && c->x86_model == INTEL_FAM6_ALDERLAKE);
+#else
+   return false;
+#endif


Please do the check in C code and not the preprocessor.


+}
+
  static void vi_program_aspm(struct amdgpu_device *adev)
  {
u32 data, data1, orig;
bool bL1SS = false;
bool bClkReqSupport = true;
  
-	if (!amdgpu_device_should_use_aspm(adev))

+   if (!amdgpu_device_should_use_aspm(adev) || intel_core_apsm_chk())
return;
  
  	if (adev->flags & AMD_IS_APU ||



Kind regards,

Paul


Re: [PATCH Review 1/1] drm/amdgpu: add umc query error status function

2022-04-08 Thread Paul Menzel

Dear Stanley,


Thank you for the patch.

Am 08.04.22 um 10:10 schrieb Stanley.Yang:

Please remove the dot/period from the name.


In order to debug ras error, driver will print IPID/SYND/MISC0
register value if detect correctable or uncorrectable error.


… if it detects …


Provide umc_query_error_status_helper function to reduce code
redundancy.


Can you please make this two patches. First refactoring, and then adding 
new call sites. (The current commit message summary, does not say 
anything about new call-sites.)



Change-Id: I09a2aae85cde3ab2cb6b042b973da6839ad024ec


Without the Gerrit review instance, the Change-Id is not of any use.


Signed-off-by: Stanley.Yang 


Please remove the . from the name.


Kind regards,

Paul



---
  drivers/gpu/drm/amd/amdgpu/umc_v6_7.c | 106 --
  1 file changed, 48 insertions(+), 58 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c 
b/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
index c45d9c14ecbc..9d3b54778417 100644
--- a/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
+++ b/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
@@ -64,21 +64,62 @@ static inline uint32_t get_umc_v6_7_channel_index(struct 
amdgpu_device *adev,
return adev->umc.channel_idx_tbl[umc_inst * adev->umc.channel_inst_num 
+ ch_inst];
  }
  
+static void umc_query_error_status_helper(struct amdgpu_device *adev,

+ uint64_t mc_umc_status, 
uint32_t umc_reg_offset)
+{
+   uint32_t mc_umc_addr;
+   uint64_t reg_value;
+
+   if (REG_GET_FIELD(mc_umc_status, MCA_UMC_UMC0_MCUMC_STATUST0, Deferred) 
== 1)
+   dev_info(adev->dev, "Deferred error, no user action is 
needed.\n");
+
+   if (mc_umc_status)
+   dev_info(adev->dev, "MCA STATUS 0x%llx, umc_reg_offset 0x%x\n", 
mc_umc_status, umc_reg_offset);
+
+   /* print IPID registers value */
+   mc_umc_addr =
+   SOC15_REG_OFFSET(UMC, 0, regMCA_UMC_UMC0_MCUMC_IPIDT0);
+   reg_value = RREG64_PCIE((mc_umc_addr + umc_reg_offset) * 4);
+   if (reg_value)
+   dev_info(adev->dev, "MCA IPID 0x%llx, umc_reg_offset 0x%x\n", 
reg_value, umc_reg_offset);
+
+   /* print SYND registers value */
+   mc_umc_addr =
+   SOC15_REG_OFFSET(UMC, 0, regMCA_UMC_UMC0_MCUMC_SYNDT0);
+   reg_value = RREG64_PCIE((mc_umc_addr + umc_reg_offset) * 4);
+   if (reg_value)
+   dev_info(adev->dev, "MCA SYND 0x%llx, umc_reg_offset 0x%x\n", 
reg_value, umc_reg_offset);
+
+   /* print MISC0 registers value */
+   mc_umc_addr =
+   SOC15_REG_OFFSET(UMC, 0, regMCA_UMC_UMC0_MCUMC_MISC0T0);
+   reg_value = RREG64_PCIE((mc_umc_addr + umc_reg_offset) * 4);
+   if (reg_value)
+   dev_info(adev->dev, "MCA MISC0 0x%llx, umc_reg_offset 0x%x\n", 
reg_value, umc_reg_offset);
+}
+
  static void umc_v6_7_ecc_info_query_correctable_error_count(struct 
amdgpu_device *adev,
   uint32_t umc_inst, uint32_t 
ch_inst,
   unsigned long *error_count)
  {
uint64_t mc_umc_status;
uint32_t eccinfo_table_idx;
+   uint32_t umc_reg_offset;
struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
  
+	umc_reg_offset = get_umc_v6_7_reg_offset(adev,

+   umc_inst, ch_inst);
+
eccinfo_table_idx = umc_inst * adev->umc.channel_inst_num + ch_inst;
/* check for SRAM correctable error
  MCUMC_STATUS is a 64 bit register */
mc_umc_status = ras->umc_ecc.ecc[eccinfo_table_idx].mca_umc_status;
if (REG_GET_FIELD(mc_umc_status, MCA_UMC_UMC0_MCUMC_STATUST0, Val) == 1 
&&
-   REG_GET_FIELD(mc_umc_status, MCA_UMC_UMC0_MCUMC_STATUST0, CECC) == 
1)
+   REG_GET_FIELD(mc_umc_status, MCA_UMC_UMC0_MCUMC_STATUST0, CECC) == 
1) {
*error_count += 1;
+
+   umc_query_error_status_helper(adev, mc_umc_status, 
umc_reg_offset);
+   }
  }
  
  static void umc_v6_7_ecc_info_querry_uncorrectable_error_count(struct amdgpu_device *adev,

@@ -88,8 +129,6 @@ static void 
umc_v6_7_ecc_info_querry_uncorrectable_error_count(struct amdgpu_dev
uint64_t mc_umc_status;
uint32_t eccinfo_table_idx;
uint32_t umc_reg_offset;
-   uint32_t mc_umc_addr;
-   uint64_t reg_value;
struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
  
  	umc_reg_offset = get_umc_v6_7_reg_offset(adev,

@@ -106,32 +145,7 @@ static void 
umc_v6_7_ecc_info_querry_uncorrectable_error_count(struct amdgpu_dev
REG_GET_FIELD(mc_umc_status, MCA_UMC_UMC0_MCUMC_STATUST0, TCC) == 
1)) {
*error_count += 1;
  
-		if (REG_GET_FIELD(mc_umc_status, MCA_UMC_UMC0_MCUMC_STATUST0, Deferred) == 1)

-   dev_info(adev->dev, "Deferred error, no user action is 
needed.\n");
-
-   if (mc_umc_status)
-   dev_info(adev->dev, 

Re: [PATCH 1/1] amdgpu/pm: Clarify Documentation of error handling in send_smc_mesg

2022-04-08 Thread Paul Menzel

Dear Darren,


Thank you for your patch.

Am 08.04.22 um 04:26 schrieb Darren Powell:

  Contrary to the smu_cmn_send_smc_msg_with_param documentation, two
  cases exist where messages are silently dropped with no error returned
  to the caller. These cases occur in unusual situations where either:
   1. the caller is a virtual GPU, or
   2. a PCI recovery is underway and the HW is not yet in sync with the SW

  For more details see
   commit 4ea5081c82c4 ("drm/amd/powerplay: enable SMC message filter")
   commit bf36b52e781d ("drm/amdgpu: Avoid accessing HW when suspending SW 
state")


Please remove the indentation. If you re-rolled the patch, you could 
also spell *documentation* lowercase in the commit message summary.



Signed-off-by: Darren Powell 
---
  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 7 +--
  1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
index b8d0c70ff668..b1bd1990c88b 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
@@ -356,12 +356,15 @@ int smu_cmn_wait_for_response(struct smu_context *smu)
   * completion of the command, and return back a value from the SMU in
   * @read_arg pointer.
   *
- * Return 0 on success, -errno on error, if we weren't able to send
+ * Return 0 on success, or if the message is dropped.
+ * On error, -errno is returned if we weren't able to send
   * the message or if the message completed with some kind of
   * error. See __smu_cmn_reg2errno() for details of the -errno.
   *
   * If we weren't able to send the message to the SMU, we also print
- * the error to the standard log.
+ * the error to the standard log. Dropped messages can be caused
+ * due to PCI slot recovery or attempting to send from a virtual GPU,
+ * and do not print an error.
   *
   * Command completion status is printed only if the -errno is
   * -EREMOTEIO, indicating that the SMU returned back an

base-commit: 4585c45a6a66cb17cc97f4370457503746e540b7


The diff looks good – despite Mozilla Thunderbird quoting it strangely.


Kind regards,

Paul


Re: [PATCH v12] drm/amdgpu: add drm buddy support to amdgpu

2022-04-07 Thread Paul Menzel

Dear Arunpravin,


Thank you for your patch.

Am 07.04.22 um 07:46 schrieb Arunpravin Paneer Selvam:

- Switch to drm buddy allocator
- Add resource cursor support for drm buddy


I though after the last long discussion, you would actually act on the 
review comments. Daniel wrote a good summary, you could more or less 
copy and past. So why didn’t you?


So, I really wish to not have the patch commit as is.

The summary should also say something about using mutex over spinlocks. 
For me the version change summaries below are just for reviewers of 
earlier iterations to see what changed, and not something to be read easily.



Kind regards,

Paul



v2(Matthew Auld):
   - replace spinlock with mutex as we call kmem_cache_zalloc
 (..., GFP_KERNEL) in drm_buddy_alloc() function

   - lock drm_buddy_block_trim() function as it calls
 mark_free/mark_split are all globally visible

v3(Matthew Auld):
   - remove trim method error handling as we address the failure case
 at drm_buddy_block_trim() function

v4:
   - fix warnings reported by kernel test robot 

v5:
   - fix merge conflict issue

v6:
   - fix warnings reported by kernel test robot 

v7:
   - remove DRM_BUDDY_RANGE_ALLOCATION flag usage

v8:
   - keep DRM_BUDDY_RANGE_ALLOCATION flag usage
   - resolve conflicts created by drm/amdgpu: remove VRAM accounting v2

v9(Christian):
   - merged the below patch
  - drm/amdgpu: move vram inline functions into a header
   - rename label name as fallback
   - move struct amdgpu_vram_mgr to amdgpu_vram_mgr.h
   - remove unnecessary flags from struct amdgpu_vram_reservation
   - rewrite block NULL check condition
   - change else style as per coding standard
   - rewrite the node max size
   - add a helper function to fetch the first entry from the list

v10(Christian):
- rename amdgpu_get_node() function name as amdgpu_vram_mgr_first_block

v11:
- if size is not aligned with min_page_size, enable is_contiguous flag,
  therefore, the size round up to the power of two and trimmed to the
  original size.
v12:
- rename the function names having prefix as amdgpu_vram_mgr_*()
- modify the round_up() logic conforming to contiguous flag enablement
  or if size is not aligned to min_block_size
- modify the trim logic
- rename node as block wherever applicable

Signed-off-by: Arunpravin Paneer Selvam 
---
  drivers/gpu/drm/Kconfig   |   1 +
  .../gpu/drm/amd/amdgpu/amdgpu_res_cursor.h|  97 -
  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h   |  10 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c  | 359 ++
  4 files changed, 291 insertions(+), 176 deletions(-)

diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
index f1422bee3dcc..5133c3f028ab 100644
--- a/drivers/gpu/drm/Kconfig
+++ b/drivers/gpu/drm/Kconfig
@@ -280,6 +280,7 @@ config DRM_AMDGPU
select HWMON
select BACKLIGHT_CLASS_DEVICE
select INTERVAL_TREE
+   select DRM_BUDDY
help
  Choose this option if you have a recent AMD Radeon graphics card.
  
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h

index acfa207cf970..6546552e596c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h
@@ -30,12 +30,15 @@
  #include 
  #include 
  
+#include "amdgpu_vram_mgr.h"

+
  /* state back for walking over vram_mgr and gtt_mgr allocations */
  struct amdgpu_res_cursor {
uint64_tstart;
uint64_tsize;
uint64_tremaining;
-   struct drm_mm_node  *node;
+   void*node;
+   uint32_tmem_type;
  };
  
  /**

@@ -52,27 +55,63 @@ static inline void amdgpu_res_first(struct ttm_resource 
*res,
uint64_t start, uint64_t size,
struct amdgpu_res_cursor *cur)
  {
+   struct drm_buddy_block *block;
+   struct list_head *head, *next;
struct drm_mm_node *node;
  
-	if (!res || res->mem_type == TTM_PL_SYSTEM) {

-   cur->start = start;
-   cur->size = size;
-   cur->remaining = size;
-   cur->node = NULL;
-   WARN_ON(res && start + size > res->num_pages << PAGE_SHIFT);
-   return;
-   }
+   if (!res)
+   goto fallback;
  
  	BUG_ON(start + size > res->num_pages << PAGE_SHIFT);
  
-	node = to_ttm_range_mgr_node(res)->mm_nodes;

-   while (start >= node->size << PAGE_SHIFT)
-   start -= node++->size << PAGE_SHIFT;
+   cur->mem_type = res->mem_type;
+
+   switch (cur->mem_type) {
+   case TTM_PL_VRAM:
+   head = _amdgpu_vram_mgr_resource(res)->blocks;
+
+   block = list_first_entry_or_null(head,
+struct drm_buddy_block,
+

Re: [PATCH] drm/amdgpu/smu10: fix SoC/fclk units in auto mode

2022-04-06 Thread Paul Menzel

Dear Alex,


Thank you for your patch.

Am 01.04.22 um 17:16 schrieb Alex Deucher:

SMU takes clock limits in Mhz units.  socclk and fclk were
using 10 khz units in some cases.  Switch to Mhz units.
Fixes higher than required SoC clocks.


It’d be great if you used 75 characters per line in commit messages.


Fixes: 97cf32996c46d9 ("drm/amd/pm: Removed fixed clock in auto mode DPM")
Signed-off-by: Alex Deucher 
---
  drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.c | 8 
  1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.c 
b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.c
index 9ddd8491ff00..545dd68a8c18 100644
--- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.c
+++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.c
@@ -773,13 +773,13 @@ static int smu10_dpm_force_dpm_level(struct pp_hwmgr 
*hwmgr,
smum_send_msg_to_smc_with_parameter(hwmgr,
PPSMC_MSG_SetHardMinFclkByFreq,
hwmgr->display_config->num_display 
> 3 ?
-   
data->clock_vol_info.vdd_dep_on_fclk->entries[0].clk :
+   
(data->clock_vol_info.vdd_dep_on_fclk->entries[0].clk / 100) :
min_mclk,


Should that also be indented now?


NULL);
  
  		smum_send_msg_to_smc_with_parameter(hwmgr,


PPSMC_MSG_SetHardMinSocclkByFreq,
-   
data->clock_vol_info.vdd_dep_on_socclk->entries[0].clk,
+   
data->clock_vol_info.vdd_dep_on_socclk->entries[0].clk / 100,
NULL);
smum_send_msg_to_smc_with_parameter(hwmgr,
PPSMC_MSG_SetHardMinVcn,
@@ -792,11 +792,11 @@ static int smu10_dpm_force_dpm_level(struct pp_hwmgr 
*hwmgr,
NULL);
smum_send_msg_to_smc_with_parameter(hwmgr,
PPSMC_MSG_SetSoftMaxFclkByFreq,
-   
data->clock_vol_info.vdd_dep_on_fclk->entries[index_fclk].clk,
+   
data->clock_vol_info.vdd_dep_on_fclk->entries[index_fclk].clk / 100,
NULL);
smum_send_msg_to_smc_with_parameter(hwmgr,

PPSMC_MSG_SetSoftMaxSocclkByFreq,
-   
data->clock_vol_info.vdd_dep_on_socclk->entries[index_socclk].clk,
+   
data->clock_vol_info.vdd_dep_on_socclk->entries[index_socclk].clk / 100,
NULL);
smum_send_msg_to_smc_with_parameter(hwmgr,
    PPSMC_MSG_SetSoftMaxVcn,


Reviewed-by: Paul Menzel 


Kind regards,

Paul


Re: Public patches but non-public development branch

2022-04-05 Thread Paul Menzel

Dear Alex,


Am 05.04.22 um 16:13 schrieb Alex Deucher:

On Tue, Apr 5, 2022 at 10:03 AM Paul Menzel  wrote:



Am 05.04.22 um 15:14 schrieb Alex Deucher:

On Tue, Apr 5, 2022 at 3:02 AM Paul Menzel wrote:



Am 05.04.22 um 08:54 schrieb Christian König:

Am 05.04.22 um 08:45 schrieb Paul Menzel:



Am 04.04.22 um 23:42 schrieb Philip Yang:

bo_adev is NULL for system memory mapping to GPU.

Fixes: 05fe8eeca92 (drm/amdgpu: fix TLB flushing during eviction)


Sorry, where can I find that commit?


Well that's expected, the development branch is not public.


Well obviously, it was unexpected for me. How should I have known? Where
is that documented? If the patches are publicly posted to the mailing
list, why is that development branch not public?

The current situation is really frustrating for non-AMD employees. How
can the current situation be improved?


Our development branch
(https://gitlab.freedesktop.org/agd5f/linux/-/commits/amd-staging-drm-next)
is available publicly.  There can be a day or so of lag depending on
when it gets mirrored (e.g., over the weekend).


Thank you for the clarification. As can be seen at hand, it still causes
confusion though.

  commit 05fe8eeca927e29b81f3f2a799e9b9b88b0989a9
  Author: Christian König 
  AuthorDate: Wed Mar 30 10:53:15 2022 +0200
  Commit: Christian König 
  CommitDate: Fri Apr 1 11:05:51 2022 +0200

Today is Tuesday, so it wasn’t mirrored yesterday, on Monday.

To avoid this friction in the future, is there an automated way to
mirror the branches? git hooks should allow that to be done on every
push for example.


It's a bit more complicated than that since we have various CI systems
and IT security policies involved, but we can look into it.


That’d be awesome. Thank you!


Kind regards,

Paul


Re: Public patches but non-public development branch

2022-04-05 Thread Paul Menzel

Dear Alex,


Am 05.04.22 um 15:14 schrieb Alex Deucher:

On Tue, Apr 5, 2022 at 3:02 AM Paul Menzel wrote:



Am 05.04.22 um 08:54 schrieb Christian König:

Am 05.04.22 um 08:45 schrieb Paul Menzel:



Am 04.04.22 um 23:42 schrieb Philip Yang:

bo_adev is NULL for system memory mapping to GPU.

Fixes: 05fe8eeca92 (drm/amdgpu: fix TLB flushing during eviction)


Sorry, where can I find that commit?


Well that's expected, the development branch is not public.


Well obviously, it was unexpected for me. How should I have known? Where
is that documented? If the patches are publicly posted to the mailing
list, why is that development branch not public?

The current situation is really frustrating for non-AMD employees. How
can the current situation be improved?


Our development branch
(https://gitlab.freedesktop.org/agd5f/linux/-/commits/amd-staging-drm-next)
is available publicly.  There can be a day or so of lag depending on
when it gets mirrored (e.g., over the weekend).


Thank you for the clarification. As can be seen at hand, it still causes 
confusion though.


commit 05fe8eeca927e29b81f3f2a799e9b9b88b0989a9
Author: Christian König 
AuthorDate: Wed Mar 30 10:53:15 2022 +0200
Commit: Christian König 
CommitDate: Fri Apr 1 11:05:51 2022 +0200

Today is Tuesday, so it wasn’t mirrored yesterday, on Monday.

To avoid this friction in the future, is there an automated way to 
mirror the branches? git hooks should allow that to be done on every 
push for example.



Kind regards,

Paul


Public patches but non-public development branch (Re: [PATCH 1/1] drm/amdkfd: Add missing NULL check in svm_range_map_to_gpu)

2022-04-05 Thread Paul Menzel

Dear Christian,


Am 05.04.22 um 08:54 schrieb Christian König:

Am 05.04.22 um 08:45 schrieb Paul Menzel:



Am 04.04.22 um 23:42 schrieb Philip Yang:

bo_adev is NULL for system memory mapping to GPU.

Fixes: 05fe8eeca92 (drm/amdgpu: fix TLB flushing during eviction)


Sorry, where can I find that commit?


Well that's expected, the development branch is not public.


Well obviously, it was unexpected for me. How should I have known? Where 
is that documented? If the patches are publicly posted to the mailing 
list, why is that development branch not public?


The current situation is really frustrating for non-AMD employees. How 
can the current situation be improved?



Kind regards,

Paul


I do not see it in drm-next from agd5f git archive 
g...@gitlab.freedesktop.org:agd5f/linux.git.


    $ git log --oneline -1
    e45422695c19 (HEAD, agd5f/drm-next) drm/amdkfd: Create file 
descriptor after client is added to smi_clients list



Kind regards,

Paul



Signed-off-by: Philip Yang 
---
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c

index 907b02045824..d3fb2d0b5a25 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -1281,7 +1281,7 @@ svm_range_map_to_gpu(struct kfd_process_device 
*pdd, struct svm_range *prange,

 last_start, prange->start + i,
 pte_flags,
 last_start - prange->start,
-   bo_adev->vm_manager.vram_base_offset,
+   bo_adev ? 
bo_adev->vm_manager.vram_base_offset : 0,

 NULL, dma_addr, >last_update);
    for (j = last_start - prange->start; j <= i; j++)


Re: [PATCH 1/1] drm/amdkfd: Add missing NULL check in svm_range_map_to_gpu

2022-04-05 Thread Paul Menzel

Dear Philip,


Am 04.04.22 um 23:42 schrieb Philip Yang:

bo_adev is NULL for system memory mapping to GPU.

Fixes: 05fe8eeca92 (drm/amdgpu: fix TLB flushing during eviction)


Sorry, where can I find that commit? I do not see it in drm-next from 
agd5f git archive g...@gitlab.freedesktop.org:agd5f/linux.git.


$ git log --oneline -1
e45422695c19 (HEAD, agd5f/drm-next) drm/amdkfd: Create file 
descriptor after client is added to smi_clients list



Kind regards,

Paul



Signed-off-by: Philip Yang 
---
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 907b02045824..d3fb2d0b5a25 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -1281,7 +1281,7 @@ svm_range_map_to_gpu(struct kfd_process_device *pdd, 
struct svm_range *prange,
   last_start, prange->start + i,
   pte_flags,
   last_start - prange->start,
-  bo_adev->vm_manager.vram_base_offset,
+  bo_adev ? 
bo_adev->vm_manager.vram_base_offset : 0,
   NULL, dma_addr, >last_update);
  
  		for (j = last_start - prange->start; j <= i; j++)


Re: [PATCH 1/1] drm/amdgpu: Flush TLB after mapping for VG20+XGMI

2022-04-04 Thread Paul Menzel

Dear Philip,


Thank you for your response.

Am 04.04.22 um 18:52 schrieb philip yang:


On 2022-04-04 12:19, Paul Menzel wrote:



Am 01.04.22 um 21:57 schrieb Philip Yang:

For VG20 + XGMI bridge, all mappings PTEs cache in TC, this may have
stall invalid PTEs in TC because one cache line has 8 pages. Need always


Can you please rephrase. “may have stall …” is really hard to understand.

The patch already pushed and merged.


Sorry, but that really sucks, that commits with such a hard to 
understand commit message are merged.



flush_tlb after updating mapping.


Maybe:

So, always flush_tlb after updating the mapping.


Signed-off-by: Philip Yang 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 6 ++
  1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c

index f0aec04111a3..687c9a140645 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -837,6 +837,12 @@ int amdgpu_vm_update_range(struct amdgpu_device *adev, 
struct amdgpu_vm *vm,

  goto error_unlock;
  }
  +/* Vega20+XGMI where PTEs get inadvertently cached in L2 texture cache,
+ * heavy-weight flush TLB unconditionally.
+ */
+flush_tlb |= (adev->gmc.xgmi.num_physical_nodes &&
+  adev->ip_versions[GC_HWIP][0] == IP_VERSION(9, 4, 0));
+
  memset(, 0, sizeof(params));
  params.adev = adev;
  params.vm = vm;


Did you do any performance measurement, if that flushing affects anything?


There was another patch to optimize TLB flush to improve map to GPU performance,
for this config, always flush TLB after updating mapping is the original
performance before the optimization.


So, why didn’t you reference this commit in the commit message, and also 
did not give that information in the commit message?



Kind regards,

Paul


Re: [PATCH 1/1] drm/amdgpu: Flush TLB after mapping for VG20+XGMI

2022-04-04 Thread Paul Menzel

Dear Philip,


Thank you for your patch.

Am 01.04.22 um 21:57 schrieb Philip Yang:

For VG20 + XGMI bridge, all mappings PTEs cache in TC, this may have
stall invalid PTEs in TC because one cache line has 8 pages. Need always


Can you please rephrase. “may have stall …” is really hard to understand.


flush_tlb after updating mapping.


Maybe:

So, always flush_tlb after updating the mapping.


Signed-off-by: Philip Yang 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 6 ++
  1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index f0aec04111a3..687c9a140645 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -837,6 +837,12 @@ int amdgpu_vm_update_range(struct amdgpu_device *adev, 
struct amdgpu_vm *vm,
goto error_unlock;
}
  
+	/* Vega20+XGMI where PTEs get inadvertently cached in L2 texture cache,

+* heavy-weight flush TLB unconditionally.
+*/
+   flush_tlb |= (adev->gmc.xgmi.num_physical_nodes &&
+ adev->ip_versions[GC_HWIP][0] == IP_VERSION(9, 4, 0));
+
memset(, 0, sizeof(params));
params.adev = adev;
params.vm = vm;


Did you do any performance measurement, if that flushing affects anything?


Kind regards,

Paul


Regression: No signal when loading amdgpu, and system lockup (was: [PATCH V4 17/17] drm/amd/pm: unified lock protections in amdgpu_dpm.c)

2022-04-04 Thread Paul Menzel

#regzbot introduced: 3712e7a494596b26861f4dc9b81676d1d0272eaf
#regzbot title: No signal when loading amdgpu, and system lockup
#regzbot monitor: https://gitlab.freedesktop.org/drm/amd/-/issues/1957

Dear Arthur,


Thank you for your message. Too bad you didn’t update the subject, and 
didn’t include regressi...@lists.linux.dev and notify regzbot [1] about 
it. (It’s understandable, as you might be unfamiliar with the processes, 
but I would have expected at least Even to do.) So I also spent quite 
some time on bisecting, but reached the same conclusion.


Am 31.03.22 um 04:28 schrieb Arthur Marsh:

Hi, I have a Cape Verde GPU card in my pc and after git bisecting a situation
where, at the time of the amdgpu module, the monitor would lose signal and
the pc locked up so that it only responded to a magic sysreq boot (with no
logging due to it happening before the root filesystem was writeable), the
above commit was identified as the culprit.

The GPU card is a Gigabyte R7 250 with pci-id 1002:682b (rev 87).

With the 5.17.0 kernel and a kernel command line of:

amdgpu.audio=1 amdgpu.si_support=1

the following dmesg output was received:

[   76.118991] [drm] amdgpu kernel modesetting enabled.
[   76.119100] amdgpu :01:00.0: vgaarb: deactivate vga console
[   76.120004] Console: switching to colour dummy device 80x25
[   76.120203] [drm] initializing kernel modesetting (VERDE 0x1002:0x682B 
0x1458:0x22CA 0x87).
[   76.120211] amdgpu :01:00.0: amdgpu: Trusted Memory Zone (TMZ) feature 
not supported
[   76.120235] [drm] register mmio base: 0xFE8C
[   76.120238] [drm] register mmio size: 262144
[   76.120245] [drm] add ip block number 0 
[   76.120248] [drm] add ip block number 1 
[   76.120251] [drm] add ip block number 2 
[   76.120253] [drm] add ip block number 3 
[   76.120256] [drm] add ip block number 4 
[   76.120258] [drm] add ip block number 5 
[   76.120261] [drm] add ip block number 6 
[   76.120264] [drm] add ip block number 7 
[   76.163659] [drm] BIOS signature incorrect 5b 7
[   76.163669] resource sanity check: requesting [mem 0x000c-0x000d], 
which spans more than PCI Bus :00 [mem 0x000d-0x000d window]
[   76.163677] caller pci_map_rom+0x68/0x1b0 mapping multiple BARs
[   76.163691] amdgpu :01:00.0: No more image in the PCI ROM
[   76.164996] amdgpu :01:00.0: amdgpu: Fetched VBIOS from ROM BAR
[   76.165001] amdgpu: ATOM BIOS: xxx-xxx-xxx
[   76.165018] amdgpu :01:00.0: amdgpu: PCIE atomic ops is not supported
[   76.165270] [drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment 
size is 9-bit
[   76.349679] amdgpu :01:00.0: amdgpu: VRAM: 2048M 0x00F4 - 
0x00F47FFF (2048M used)
[   76.349716] amdgpu :01:00.0: amdgpu: GART: 1024M 0x00FF - 
0x00FF3FFF
[   76.349753] [drm] Detected VRAM RAM=2048M, BAR=256M
[   76.349764] [drm] RAM width 128bits DDR3
[   76.349940] [drm] amdgpu: 2048M of VRAM memory ready
[   76.349953] [drm] amdgpu: 3072M of GTT memory ready.
[   76.349992] [drm] GART: num cpu pages 262144, num gpu pages 262144
[   76.350506] amdgpu :01:00.0: amdgpu: PCIE GART of 1024M enabled (table 
at 0x00F40090).
[   76.495343] [drm] Internal thermal controller with fan control
[   76.495391] [drm] amdgpu: dpm initialized
[   76.495637] [drm] AMDGPU Display Connectors
[   76.495647] [drm] Connector 0:
[   76.495655] [drm]   HDMI-A-1
[   76.495662] [drm]   HPD1
[   76.495668] [drm]   DDC: 0x194c 0x194c 0x194d 0x194d 0x194e 0x194e 0x194f 
0x194f
[   76.495685] [drm]   Encoders:
[   76.495691] [drm] DFP1: INTERNAL_UNIPHY
[   76.495699] [drm] Connector 1:
[   76.495706] [drm]   DVI-D-1
[   76.495712] [drm]   HPD2
[   76.495718] [drm]   DDC: 0x1950 0x1950 0x1951 0x1951 0x1952 0x1952 0x1953 
0x1953
[   76.495733] [drm]   Encoders:
[   76.495739] [drm] DFP2: INTERNAL_UNIPHY
[   76.495746] [drm] Connector 2:
[   76.495753] [drm]   VGA-1
[   76.495758] [drm]   DDC: 0x1970 0x1970 0x1971 0x1971 0x1972 0x1972 0x1973 
0x1973
[   76.495773] [drm]   Encoders:
[   76.495779] [drm] CRT1: INTERNAL_KLDSCP_DAC1
[   76.599604] [drm] Found UVD firmware Version: 64.0 Family ID: 13
[   76.603443] [drm] PCIE gen 2 link speeds already enabled
[   77.149564] [drm] UVD initialized successfully.
[   77.149578] amdgpu :01:00.0: amdgpu: SE 1, SH per SE 2, CU per SH 5, 
active_cu_number 8
[   77.456492] RTL8211B Gigabit Ethernet r8169-0-300:00: attached PHY driver 
(mii_bus:phy_addr=r8169-0-300:00, irq=MAC)
[   77.486245] [drm] Initialized amdgpu 3.44.0 20150101 for :01:00.0 on 
minor 0
[   77.521555] r8169 :03:00.0 eth0: Link is Down
[   77.547158] fbcon: amdgpudrmfb (fb0) is primary device
[   77.591226] Console: switching to colour frame buffer device 240x67
[   77.600296] amdgpu :01:00.0: [drm] fb0: amdgpudrmfb frame buffer device

I can supply extra details but found no logging from the sessions that 
experienced the lock-up.


I had created issue *Worker [210] processing 

Re: [Bug][5.18-rc0] Between commits ed4643521e6a and 34af78c4e616, appears warning "WARNING: CPU: 31 PID: 51848 at drivers/dma-buf/dma-fence-array.c:191 dma_fence_array_create+0x101/0x120" and some ga

2022-04-04 Thread Paul Menzel

Dear Christian,


Am 04.04.22 um 08:30 schrieb Christian König:


those are two independent and already known problems.

The warning triggered from the sync_file is already fixed in 
drm-misc-next-fixes, but so far I couldn't figure out why the games 
suddenly doesn't work any more.


There is a bug report for that, but bisecting the changes didn't yielded 
anything valuable so far.


So if you can come up with something that would be rather valuable.


It’d be great, if you (or somebody else) could provide the URL to that 
issue.



Kind regards,

Paul


Re: [PATCH] drm/amdgpu/vcn: remove Unneeded semicolon

2022-03-31 Thread Paul Menzel

Dear Haowen,


Thank you for your patch.

Am 31.03.22 um 07:56 schrieb Haowen Bai:

In the commit message summary, please use:

Remove unneeded semicolon


report by coccicheck:
drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c:1951:2-3: Unneeded semicolon

fixed c543dcb ("drm/amdgpu/vcn: Add VCN ras error query support")


Please use

Fixes: …

and a commit hash length of 12 characters. (`scripts/checkpatch.pl …` 
should tell you about this.)



Kind regards,

Paul



Signed-off-by: Haowen Bai 
---
  drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c 
b/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c
index 3e1de8c..17d44be 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c
@@ -1948,7 +1948,7 @@ static uint32_t vcn_v2_6_query_poison_by_instance(struct 
amdgpu_device *adev,
break;
default:
break;
-   };
+   }
  
  	if (poison_stat)

dev_info(adev->dev, "Poison detected in VCN%d, sub_block%d\n",


Re: [PATCH] drm/amdgpu: don't use BACO for reset in S3

2022-03-31 Thread Paul Menzel

Dear Alex,


Thank you for your patch.

Am 31.03.22 um 16:56 schrieb Alex Deucher:

Seems to cause a reboots or hangs on some systems.


It’d be great if you listed the systems from the bug reports.


Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1924
Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1953
Fixes: daf8de0874ab5b ("drm/amdgpu: always reset the asic in suspend (v2)")
Signed-off-by: Alex Deucher 
---
  drivers/gpu/drm/amd/pm/amdgpu_dpm.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c 
b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
index c73fb73e9628..1ff6c42fb110 100644
--- a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
+++ b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
@@ -173,6 +173,9 @@ bool amdgpu_dpm_is_baco_supported(struct amdgpu_device 
*adev)
  
  	if (!pp_funcs || !pp_funcs->get_asic_baco_capability)

return false;
+   /* don't use baco for reset in S3 */


In issue #1924, I think you also write, it’s a workaround. Maybe make 
that clear in the comment?



+   if (adev->in_s3)
+   return false;
  
  	mutex_lock(>pm.mutex);
  



Kind regards,

Paul


Re: [PATCH 1/1] drm: add PSR2 support and capability definition as per eDP 1.5

2022-03-31 Thread Paul Menzel

Dear David,


Thank you for your patch.

Am 31.03.22 um 19:26 schrieb David Zhang:

[why & how]
In eDP 1.5 spec, some new DPCD bit fileds are defined for PSR-SU
support.


You could be specific by using the exact number two. Maybe:

As per eDP 1.5 specification, add the two DPCD bit fields below for 
PSR-SU support:


1.  DP_PSR2_WITH_Y_COORD_ET_SUPPORTED
2.  DP_PSR2_SU_AUX_FRAME_SYNC_NOT_NEEDED


Kind regards,

Paul


Signed-off-by: David Zhang 
---
  include/drm/drm_dp_helper.h | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/include/drm/drm_dp_helper.h b/include/drm/drm_dp_helper.h
index 30359e434c3f..ac7b7571ae66 100644
--- a/include/drm/drm_dp_helper.h
+++ b/include/drm/drm_dp_helper.h
@@ -361,6 +361,7 @@ struct drm_panel;
  # define DP_PSR_IS_SUPPORTED1
  # define DP_PSR2_IS_SUPPORTED 2   /* eDP 1.4 */
  # define DP_PSR2_WITH_Y_COORD_IS_SUPPORTED  3 /* eDP 1.4a */
+# define DP_PSR2_WITH_Y_COORD_ET_SUPPORTED  4  /* eDP 1.5, adopted eDP 
1.4b SCR */
  
  #define DP_PSR_CAPS 0x071   /* XXX 1.2? */

  # define DP_PSR_NO_TRAIN_ON_EXIT1
@@ -375,6 +376,7 @@ struct drm_panel;
  # define DP_PSR_SETUP_TIME_SHIFT1
  # define DP_PSR2_SU_Y_COORDINATE_REQUIRED   (1 << 4)  /* eDP 1.4a */
  # define DP_PSR2_SU_GRANULARITY_REQUIRED(1 << 5)  /* eDP 1.4b */
+# define DP_PSR2_SU_AUX_FRAME_SYNC_NOT_NEEDED (1 << 6)/* eDP 1.5, adopted eDP 
1.4b SCR */
  
  #define DP_PSR2_SU_X_GRANULARITY	0x072 /* eDP 1.4b */

  #define DP_PSR2_SU_Y_GRANULARITY  0x074 /* eDP 1.4b */


Re: [PATCH V2] drm/amdgpu: fix incorrect GCR_GENERAL_CNTL address

2022-03-30 Thread Paul Menzel

Dear Ruili,


Am 29.03.22 um 11:17 schrieb Ji, Ruili:


This is not related to any issue.


I didn’t mean an issue (where I’d use Resolves to differentiate the two 
cases), but the commit introducing the incorrect address.



Kind regards,

Paul


PS: Please use interleaved style when replying instead of top-posting.


Re: [PATCH v2] drm/amdgpu: fix that issue that the number of the crtc of the 3250c is not correct

2022-03-30 Thread Paul Menzel

[Cc: Remove undeliverable Chun Ming Zhou <mailto:david1.z...@amd.com>]

Am 30.03.22 um 08:34 schrieb Paul Menzel:

Dear Tsung-Hua,


Thank you for your patch.

Am 30.03.22 um 04:46 schrieb Ryan Lin:

The commit message summary is quite long and confusing. Maybe:

Use 3 CRTC for 3250c to get internal display working


[Why]
External displays take priority over internal display when there are 
fewer

display controllers than displays.


This causes the internal display to not work on the Chromebook google/zork.


[How]
The root cause is because of that number of the crtc is not correct.


The root cause is the incorrect number of four configured CRTCs.


The number of the crtc on the 3250c is 3, but on the 3500c is 4.
On the source code, we can see that number of the crtc has been fixed 
at 4.

Needs to set the num_crtc to 3 for 3250c platform.


Please do not wrap lines after each sentence, and use a text width of 75 
characters.



v2:
    - remove unnecessary comments and Id

Signed-off-by: Ryan Lin 

---
  drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 12 +---
  1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c

index 40c91b448f7da..455a2c45e8cda 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -2738,9 +2738,15 @@ static int dm_early_init(void *handle)
  break;
  #if defined(CONFIG_DRM_AMD_DC_DCN1_0)
  case CHIP_RAVEN:
-    adev->mode_info.num_crtc = 4;
-    adev->mode_info.num_hpd = 4;
-    adev->mode_info.num_dig = 4;
+    if (adev->rev_id >= 8) {


Is there some define for that number? Maybe add a comment, that it’s for 
3250c?



Kind regards,

Paul



+    adev->mode_info.num_crtc = 3;
+    adev->mode_info.num_hpd = 3;
+    adev->mode_info.num_dig = 3;
+    } else {
+    adev->mode_info.num_crtc = 4;
+    adev->mode_info.num_hpd = 4;
+    adev->mode_info.num_dig = 4;
+    }
  break;
  #endif
  #if defined(CONFIG_DRM_AMD_DC_DCN2_0)


Re: [PATCH v2] drm/amdgpu: fix that issue that the number of the crtc of the 3250c is not correct

2022-03-30 Thread Paul Menzel

Dear Tsung-Hua,


Thank you for your patch.

Am 30.03.22 um 04:46 schrieb Ryan Lin:

The commit message summary is quite long and confusing. Maybe:

Use 3 CRTC for 3250c to get internal display working


[Why]
External displays take priority over internal display when there are fewer
display controllers than displays.


This causes the internal display to not work on the Chromebook google/zork.


[How]
The root cause is because of that number of the crtc is not correct.


The root cause is the incorrect number of four configured CRTCs.


The number of the crtc on the 3250c is 3, but on the 3500c is 4.
On the source code, we can see that number of the crtc has been fixed at 4.
Needs to set the num_crtc to 3 for 3250c platform.


Please do not wrap lines after each sentence, and use a text width of 75 
characters.



v2:
- remove unnecessary comments and Id

Signed-off-by: Ryan Lin 

---
  drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 12 +---
  1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 40c91b448f7da..455a2c45e8cda 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -2738,9 +2738,15 @@ static int dm_early_init(void *handle)
break;
  #if defined(CONFIG_DRM_AMD_DC_DCN1_0)
case CHIP_RAVEN:
-   adev->mode_info.num_crtc = 4;
-   adev->mode_info.num_hpd = 4;
-   adev->mode_info.num_dig = 4;
+   if (adev->rev_id >= 8) {


Is there some define for that number? Maybe add a comment, that it’s for 
3250c?



Kind regards,

Paul



+   adev->mode_info.num_crtc = 3;
+   adev->mode_info.num_hpd = 3;
+   adev->mode_info.num_dig = 3;
+   } else {
+   adev->mode_info.num_crtc = 4;
+   adev->mode_info.num_hpd = 4;
+   adev->mode_info.num_dig = 4;
+   }
break;
  #endif
  #if defined(CONFIG_DRM_AMD_DC_DCN2_0)


Re: [PATCH] drm/amd/display: Fix by adding FPU protection for dcn30_internal_validate_bw

2022-03-29 Thread Paul Menzel

Dear Chandan,


Am 29.03.22 um 10:29 schrieb CHANDAN VURDIGERE NATARAJ:

Is it common to spell your name all uppercase? If not, please use 
Chandan nVurdigere Nataraj.



[WHY]


The [] already emphasize the word, so Why could be used.


Below general protection fault observed when WebGL Aquarium is run for
longer duration. If drm debug logs are enabled and set to 0x1f then the


In what browser and what version?


issue is observed within 10 minutes of run.


Where you able to reproduce it without drm debug logs?


[  100.717056] general protection fault, probably for non-canonical address 
0x2d33302d32323032:  [#1] PREEMPT SMP NOPTI
[  100.727921] CPU: 3 PID: 1906 Comm: DrmThread Tainted: GW 
5.15.30 #12 d726c6a2d6ebe5cf9223931cbca6892f916fe18b
[  100.754419] RIP: 0010:CalculateSwathWidth+0x1f7/0x44f
[  100.767109] Code: 00 00 00 f2 42 0f 11 04 f0 48 8b 85 88 00 00 00 f2 42 0f 10 04 
f0 48 8b 85 98 00 00 00 f2 42 0f 11 04 f0 48 8b 45 10 0f 57 c0  42 0f 2a 04 
b0 0f 57 c9 f3 43 0f 2a 0c b4 e8 8c e2 f3 ff 48 8b
[  100.781269] RSP: 0018:a9230079eeb0 EFLAGS: 00010246
[  100.812528] RAX: 2d33302d32323032 RBX: 0500 RCX: 
[  100.819656] RDX: 0001 RSI: 99deb712c49c RDI: 
[  100.826781] RBP: a9230079ef50 R08: 99deb712460c R09: 99deb712462c
[  100.833907] R10: 99deb7124940 R11: 99deb7124d70 R12: 99deb712ae44
[  100.841033] R13: 0001 R14:  R15: a9230079f0a0
[  100.848159] FS:  7af121212640() GS:99deba78() 
knlGS:
[  100.856240] CS:  0010 DS:  ES:  CR0: 80050033
[  100.861980] CR2: 209000fe1000 CR3: 00011b18c000 CR4: 00350ee0
[  100.869106] Call Trace:
[  100.871555]  
[  100.873655]  ? asm_sysvec_reschedule_ipi+0x12/0x20
[  100.878449]  CalculateSwathAndDETConfiguration+0x1a3/0x6dd
[  100.883937]  dml31_ModeSupportAndSystemConfigurationFull+0x2ce4/0x76da
[  100.890467]  ? kallsyms_lookup_buildid+0xc8/0x163
[  100.895173]  ? kallsyms_lookup_buildid+0xc8/0x163
[  100.899874]  ? __sprint_symbol+0x80/0x135
[  100.903883]  ? dm_update_plane_state+0x3f9/0x4d2
[  100.908500]  ? symbol_string+0xb7/0xde
[  100.912250]  ? number+0x145/0x29b
[  100.915566]  ? vsnprintf+0x341/0x5ff
[  100.919141]  ? desc_read_finalized_seq+0x39/0x87
[  100.923755]  ? update_load_avg+0x1b9/0x607
[  100.927849]  ? compute_mst_dsc_configs_for_state+0x7d/0xd5b
[  100.933416]  ? fetch_pipe_params+0xa4d/0xd0c
[  100.937686]  ? dc_fpu_end+0x3d/0xa8
[  100.941175]  dml_get_voltage_level+0x16b/0x180
[  100.945619]  dcn30_internal_validate_bw+0x10e/0x89b
[  100.950495]  ? dcn31_validate_bandwidth+0x68/0x1fc
[  100.955285]  ? resource_build_scaling_params+0x98b/0xb8c
[  100.960595]  ? dcn31_validate_bandwidth+0x68/0x1fc
[  100.965384]  dcn31_validate_bandwidth+0x9a/0x1fc
[  100.970001]  dc_validate_global_state+0x238/0x295
[  100.974703]  amdgpu_dm_atomic_check+0x9c1/0xbce
[  100.979235]  ? _printk+0x59/0x73
[  100.982467]  drm_atomic_check_only+0x403/0x78b
[  100.986912]  drm_mode_atomic_ioctl+0x49b/0x546
[  100.991358]  ? drm_ioctl+0x1c1/0x3b3
[  100.994936]  ? drm_atomic_set_property+0x92a/0x92a
[  100.999725]  drm_ioctl_kernel+0xdc/0x149
[  101.003648]  drm_ioctl+0x27f/0x3b3
[  101.007051]  ? drm_atomic_set_property+0x92a/0x92a
[  101.011842]  amdgpu_drm_ioctl+0x49/0x7d
[  101.015679]  __se_sys_ioctl+0x7c/0xb8
[  101.015685]  do_syscall_64+0x5f/0xb8
[  101.015690]  ? __irq_exit_rcu+0x34/0x96

[HOW]
It calles populate_dml_pipes which uses doubles to initialize.


calls

Excuse my ignorance. So using doubles causes a context switch?


Adding FPU protection avoids context switch and probable loss of vba context
as there is potential contention while drm debug logs are enabled.

Signed-off-by: CHANDAN VURDIGERE NATARAJ 

diff --git a/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c 
b/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c
index 826970f2bd0a..f27262417abe 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c
@@ -1750,7 +1750,9 @@ bool dcn31_validate_bandwidth(struct dc *dc,
  
  	BW_VAL_TRACE_COUNT();
  
+	DC_FP_START();

out = dcn30_internal_validate_bw(dc, context, pipes, _cnt, 
, fast_validate);
+   DC_FP_END();
  
  	// Disable fast_validate to set min dcfclk in alculate_wm_and_dlg

if (pipe_cnt == 0)



Kind regards,

Paul


Re: [PATCH] drm/amdgpu: Support AMDGPU RAS debugfs poll interface

2022-03-29 Thread Paul Menzel

Dear Yi Peng,


Thank you for the patch. Two nits regarding the commit message.

Am 29.03.22 um 09:38 schrieb yipechai:

Some AMDGPU RAS debugfs operations like UE injection
can cause gpu reset. Before doing the next debugfs
operation, the application should call poll to check
if the gpu has finished recovering.


Please use a text width of 75 characters per line for the commit message 
body.



Signed-off-by: yipechai 


Could you please configure git (and your email program) to use your full 
name?


$ git config --global user.name "Yi Peng Chai" # no idea if correct

[…]


Kind regards,

Paul


Re: [PATCH V2] drm/amdgpu: fix incorrect GCR_GENERAL_CNTL address

2022-03-29 Thread Paul Menzel

Dear Ruili,


Thank you for your patch.

Am 28.03.22 um 06:58 schrieb Ji, Ruili:

From: Ruili Ji 

gfx10.3.3/gfx10.3.6/gfx10.3.7 shall use 0x1580 address for GCR_GENERAL_CNTL


Is any “user-visible“ problem fixed by this?

Please add a Fixes tag.


Kind regards,

Paul



Signed-off-by: Ruili Ji 
---
  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index 99df18ae7316..e4c9d92ac381 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -3300,7 +3300,7 @@ static const struct soc15_reg_golden 
golden_settings_gc_10_3_3[] =
SOC15_REG_GOLDEN_VALUE(GC, 0, mmDB_DEBUG3, 0x, 0x0280),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmDB_DEBUG4, 0x, 0x0080),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmGB_ADDR_CONFIG, 0x0c1807ff, 0x0242),
-   SOC15_REG_GOLDEN_VALUE(GC, 0, mmGCR_GENERAL_CNTL, 0x1ff1, 
0x0500),
+   SOC15_REG_GOLDEN_VALUE(GC, 0, mmGCR_GENERAL_CNTL_Vangogh, 0x1ff1, 
0x0500),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL1_PIPE_STEER, 0x00ff, 0x00e4),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL2_PIPE_STEER_0, 0x, 
0x32103210),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL2_PIPE_STEER_1, 0x, 
0x32103210),
@@ -3436,7 +3436,7 @@ static const struct soc15_reg_golden 
golden_settings_gc_10_3_6[] =
SOC15_REG_GOLDEN_VALUE(GC, 0, mmDB_DEBUG3, 0x, 0x0280),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmDB_DEBUG4, 0x, 0x0080),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmGB_ADDR_CONFIG, 0x0c1807ff, 0x0042),
-   SOC15_REG_GOLDEN_VALUE(GC, 0, mmGCR_GENERAL_CNTL, 0x1ff1, 
0x0500),
+   SOC15_REG_GOLDEN_VALUE(GC, 0, mmGCR_GENERAL_CNTL_Vangogh, 0x1ff1, 
0x0500),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL1_PIPE_STEER, 0x00ff, 0x0044),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL2_PIPE_STEER_0, 0x, 
0x32103210),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL2_PIPE_STEER_1, 0x, 
0x32103210),
@@ -3461,7 +3461,7 @@ static const struct soc15_reg_golden 
golden_settings_gc_10_3_7[] = {
SOC15_REG_GOLDEN_VALUE(GC, 0, mmDB_DEBUG3, 0x, 0x0280),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmDB_DEBUG4, 0x, 0x0080),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmGB_ADDR_CONFIG, 0x0c1807ff, 0x0041),
-   SOC15_REG_GOLDEN_VALUE(GC, 0, mmGCR_GENERAL_CNTL, 0x1ff1, 
0x0500),
+   SOC15_REG_GOLDEN_VALUE(GC, 0, mmGCR_GENERAL_CNTL_Vangogh, 0x1ff1, 
0x0500),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL1_PIPE_STEER, 0x00ff, 0x00e4),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL2_PIPE_STEER_0, 0x, 
0x32103210),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL2_PIPE_STEER_1, 0x, 
0x32103210),


Re: 回复: Re: [PATCH] drm/amdgpu: resolve s3 hang for r7340

2022-03-29 Thread Paul Menzel

Dear 李真,


[Your mailer formatted the message oddly. Maybe configure it to use only 
plain text email with no HTML parts – common in Linux kernel community 
–, or, if not possible, switch to something else (Mozilla Thunderbird, …).]



Am 29.03.22 um 04:54 schrieb 李真能:

[…]


*日 期:*2022-03-28 15:38
*发件人:*Paul Menzel


[…]


Am 28.03.22 um 09:36 schrieb Paul Menzel:
  > Dear Zhenneng,
  >
  >
  > Thank you for your patch.
  >
  > Am 28.03.22 um 06:05 schrieb Zhenneng Li:
  >> This is a workaround for s3 hang for r7340(amdgpu).
  >
  > Is it hanging when resuming from S3?

Yes, this func is a delayed work after init graphics card.


Thank for clarifying it.


  > Maybe also use the line below for
  > the commit message summary:
  >
  > drm/amdgpu: Add 1 ms delay to init handler to fix s3 resume hang
  >
  > Also, please add a space before the ( in “r7340(amdgpu)”.
  >
  >> When we test s3 with r7340 on arm64 platform, graphics card will hang up,
  >> the error message are as follows:
  >> Mar  4 01:14:11 greatwall-GW-XX-XXX kernel: [1.599374][ 7] [  
T291] amdgpu :02:00.0: fb0: amdgpudrmfb frame buffer device
  >> Mar  4 01:14:11 greatwall-GW-XX-XXX kernel: [1.612869][ 7] [  
T291] [drm:amdgpu_device_ip_late_init [amdgpu]] *ERROR* late_init of IP blockfailed 
-22
  >> Mar  4 01:14:11 greatwall-GW-XX-XXX kernel: [1.623392][ 7] [  
T291] amdgpu :02:00.0: amdgpu_device_ip_late_init failed
  >> Mar  4 01:14:11 greatwall-GW-XX-XXX kernel: [1.630696][ 7] [  
T291] amdgpu :02:00.0: Fatal error during GPU init
  >> Mar  4 01:14:11 greatwall-GW-XX-XXX kernel: [1.637477][ 7] [  
T291] [drm] amdgpu: finishing device.
  >
  > The prefix in the beginning is not really needed. Only the stuff after
  > `kernel: `.
  >
  > Maybe also add the output of `lspci -nn -s …` for that r7340 device.
  >
  >> Change-Id: I5048b3894c0ca9faf2f4847ddab61f9eb17b4823
  >
  > Without the Gerrit instance this belongs to, the Change-Id is of no use
  > in the public.
  >
  >> Signed-off-by: Zhenneng Li
  >> ---
  >>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 ++
  >>   1 file changed, 2 insertions(+)
  >>
  >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
  >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
  >> index 3987ecb24ef4..1eced991b5b2 100644
  >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
  >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
  >> @@ -2903,6 +2903,8 @@ static void
  >> amdgpu_device_delayed_init_work_handler(struct work_struct *work)
  >>   container_of(work, struct amdgpu_device, delayed_init_work.work);
  >>   int r;
  >> +mdelay(1);
  >> +
  >

  > Wow, I wonder how long it took you to find that workaround.

About 3 months, I try to add this delay
work(amdgpu_device_delayed_init_work_handler) from 2000ms to 2500ms, or use mb()
instead of mdelay(1), but it's useless, I don't know the reason,the occurrence
probability  of this bug is one ten-thousandth, do you know the possible 
reasons?


Oh, it’s not even always reproducible. That is hard. Did you try another 
graphics card or another ARM board to rule out hardware specific issues?


Sorry, I do not. Maybe the developers with access to non-public 
datasheets and erratas know.



  >>   r = amdgpu_ib_ring_tests(adev);
  >>   if (r)
  >>   DRM_ERROR("ib ring test failed (%d).\n", r);



Kind regards,

Paul


Re: [PATCH] amd/display: set backlight only if required

2022-03-28 Thread Paul Menzel

Dear Shirish,


Thank you for the patch.


Am 11.03.22 um 16:33 schrieb Shirish S:

[Why]
comparing pwm bl values (coverted) with user brightness(converted)


1.  co*n*verted
2.  Please add a space before the (.


levels in commit_tail leads to continuous setting of backlight via dmub
as they don't to match.


Maybe add a blank line between paragraphs.


This leads overdrive in queuing of commands to DMCU that sometimes lead


What is “overdrive in queuing”?

lead*s*


to depending on load on DMCU fw:


Leads to what? Words missing after *to*.


"[drm:dc_dmub_srv_wait_idle] *ERROR* Error waiting for DMUB idle: status=3"


You could also add the example from your discussion with Harry.


[How]
Store last successfully set backlight value and compare with it instead
of pwm reads which is not what we should compare with.


Maybe extend it with the information gotten from Harry, that this is 
expected, when ABM is enabled.



Kind regards,

Paul



Signed-off-by: Shirish S 
---
  drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 7 ---
  drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h | 6 ++
  2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index df0980ff9a63..2b8337e47861 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -3972,7 +3972,7 @@ static u32 convert_brightness_to_user(const struct 
amdgpu_dm_backlight_caps *cap
 max - min);
  }
  
-static int amdgpu_dm_backlight_set_level(struct amdgpu_display_manager *dm,

+static void amdgpu_dm_backlight_set_level(struct amdgpu_display_manager *dm,
 int bl_idx,
 u32 user_brightness)
  {
@@ -4003,7 +4003,8 @@ static int amdgpu_dm_backlight_set_level(struct 
amdgpu_display_manager *dm,
DRM_DEBUG("DM: Failed to update backlight on 
eDP[%d]\n", bl_idx);
}
  
-	return rc ? 0 : 1;

+   if (rc)
+   dm->actual_brightness[bl_idx] = user_brightness;
  }
  
  static int amdgpu_dm_backlight_update_status(struct backlight_device *bd)

@@ -9944,7 +9945,7 @@ static void amdgpu_dm_atomic_commit_tail(struct 
drm_atomic_state *state)
/* restore the backlight level */
for (i = 0; i < dm->num_of_edps; i++) {
if (dm->backlight_dev[i] &&
-   (amdgpu_dm_backlight_get_level(dm, i) != dm->brightness[i]))
+   (dm->actual_brightness[i] != dm->brightness[i]))
amdgpu_dm_backlight_set_level(dm, i, dm->brightness[i]);
}
  #endif
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h
index 372f9adf091a..321279bc877b 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h
@@ -540,6 +540,12 @@ struct amdgpu_display_manager {
 * cached backlight values.
 */
u32 brightness[AMDGPU_DM_MAX_NUM_EDP];
+   /**
+* @actual_brightness:
+*
+* last successfully applied backlight values.
+*/
+   u32 actual_brightness[AMDGPU_DM_MAX_NUM_EDP];
  };
  
  enum dsc_clock_force_state {


Re: BUG: Bad page state in process systemd-udevd (was: [PATCH v9 bpf-next 1/9] x86/Kconfig: select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP)

2022-03-28 Thread Paul Menzel

Dear Song,


Am 28.03.22 um 21:24 schrieb Song Liu:


On Mar 27, 2022, at 11:51 PM, Paul Menzel  wrote:



Am 28.03.22 um 08:37 schrieb Song Liu:


[…]


On Mar 27, 2022, at 3:36 AM, Paul Menzel  wrote:



Am 26.03.22 um 19:46 schrieb Paul Menzel:

#regzbot introduced: fac54e2bfb5be2b0bbf115fe80d45f59fd773048
#regzbot title: BUG: Bad page state in process systemd-udevd



Am 04.02.22 um 19:57 schrieb Song Liu:

From: Song Liu 

This enables module_alloc() to allocate huge page for 2MB+ requests.
To check the difference of this change, we need enable config
CONFIG_PTDUMP_DEBUGFS, and call module_alloc(2MB). Before the change,
/sys/kernel/debug/page_tables/kernel shows pte for this map. With the
change, /sys/kernel/debug/page_tables/ show pmd for thie map.

Signed-off-by: Song Liu 
---
   arch/x86/Kconfig | 1 +
   1 file changed, 1 insertion(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 6fddb63271d9..e0e0d00cf103 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -159,6 +159,7 @@ config X86
   select HAVE_ALIGNED_STRUCT_PAGEif SLUB
   select HAVE_ARCH_AUDITSYSCALL
   select HAVE_ARCH_HUGE_VMAPif X86_64 || X86_PAE
+select HAVE_ARCH_HUGE_VMALLOCif HAVE_ARCH_HUGE_VMAP
   select HAVE_ARCH_JUMP_LABEL
   select HAVE_ARCH_JUMP_LABEL_RELATIVE
   select HAVE_ARCH_KASANif X86_64

Testing Linus’ current master branch, Linux logs critical messages like below:
 BUG: Bad page state in process systemd-udevd  pfn:102e03
I bisected to your commit fac54e2bfb5 (x86/Kconfig: select
HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP).

Sorry, I forget to mention, that this is a 32-bit (i686) userspace,
but a 64-bit Linux kernel, so it might be the same issue as
mentioned in commit eed1fcee556f (x86: Disable
HAVE_ARCH_HUGE_VMALLOC on 32-bit x86), but didn’t fix the issue for
64-bit Linux kernel and 32-bit userspace.

I will look more into this tomorrow. To clarify, what is the 32-bit
user space that triggers this? Is it systemd-udevd? Is the systemd
also i686?


Yes, everything – also systemd – is i686. You can build a 32-bit VM image with 
grml-debootstrap [1]:

sudo DEBOOTSTRAP=mmdebstrap ~/src/grml-debootstrap/grml-debootstrap --vm 
--vmfile --vmsize 3G --target /dev/shm/debian-32.img -r sid --arch i686 
--filesystem ext4

Then run that with QEMU, but pass the 64-bit Linux kernel to QEMU directly with 
the switches `-kernel` and `-append`, or install the amd64 Linux kernel into 
the Debian VM image or the package created with `make bindeb-pkg` with `dpkg -i 
…`.


Thanks for these information!

I tried the following, but couldn't reproduce the issue.

sudo ./grml-debootstrap --vm --vmfile --vmsize 3G --target ../debian-32.img -r 
sid --arch i386 --filesystem ext4

Note: s/i686/i386/. Also I run this on Fedora, so I didn't specify DEBOOTSTRAP.

Then I run it with

qemu-system-x86_64 \
   -boot d ./debian-32.img -m 1024 -smp 4 \
   -kernel ./bzImage \
   -nographic -append 'root=/dev/sda1 ro console=ttyS0,115200'

The VM boots fine. The config being used is x86_64_defconfig +
CONFIG_DRM_FBDEV_EMULATION.

I wonder whether this is caused by different config or different image.
Could you please share your config?


Sorry, for leading you on the wrong path. I actually just wanted to help 
getting a 32-bit userspace set up quickly. I haven’t tried reproducing 
the issue in a VM, and used only the ASUS F2A85-M PRO.


Booting the system with `nomodeset`, I didn’t see the error. No idea if 
it’s related to framebuffer handling or specific to AMD graphics device.



PS: I couldn't figure out the root password of the image, --password
option of grml-debootstrap doesn't seem to work.


Hmm, I thought it’s asking you during install, but I haven’t done it in 
a while.



Kind regards,

Paul


Re: [PATCH v4 5/6] drm/amdgpu/vcn: VCN ras error query support

2022-03-28 Thread Paul Menzel

Dear Mohammad,


Am 28.03.22 um 11:49 schrieb Ziya, Mohammad zafar:


-Original Message-
From: Paul Menzel 
Sent: Monday, March 28, 2022 3:08 PM



Am 28.03.22 um 10:47 schrieb Ziya, Mohammad zafar:

[…]


-Original Message-
From: Paul Menzel 
Sent: Monday, March 28, 2022 1:39 PM



Am 28.03.22 um 10:00 schrieb Ziya, Mohammad zafar:

[…]


From: Paul Menzel 
Sent: Monday, March 28, 2022 1:22 PM



Am 28.03.22 um 09:43 schrieb Zhou1, Tao:

-Original Message-
From: Ziya, Mohammad zafar 
Sent: Monday, March 28, 2022 2:25 PM


[…]


+static uint32_t vcn_v2_6_query_poison_by_instance(struct amdgpu_device *adev,
+   uint32_t instance, uint32_t sub_block) {
+   uint32_t poison_stat = 0, reg_value = 0;
+
+   switch (sub_block) {
+   case AMDGPU_VCN_V2_6_VCPU_VCODEC:
+   reg_value = RREG32_SOC15(VCN, instance, 
mmUVD_RAS_VCPU_VCODEC_STATUS);
+   poison_stat = REG_GET_FIELD(reg_value, 
UVD_RAS_VCPU_VCODEC_STATUS, POISONED_PF);
+   break;
+   default:
+   break;
+   };
+
+   if (poison_stat)
+   dev_info(adev->dev, "Poison detected in VCN%d, sub_block%d\n",
+   instance, sub_block);


What should a user do with that information? Faulty hardware, …?


[Mohammad]: This message will help to identify the faulty hardware,
the hardware ID will also log along with poison, help to identify
among multiple hardware installed on the system.


Thank you for clarifying. If it’s indeed faulty hardware, should the
log level be increased to be an error? Keep in mind, that normal
ignorant users (like me) are reading the message, and it’d be great
to guide them a little. They do not know what “Poison“ means I guess. Maybe:

A hardware corruption was found indicating the device might be faulty.
(Poison detected in VCN%d, sub_block%d)\n

(Keep in mind, I do not know anything about RAS.)


[Mohammad]: It is an error condition, but this is just an information
message which could have been ignored as well because VCN just
consumed the poison, not created.


Sorry, I have never seen these message in `dmesg`, so could you give an
example log please, what the user would see?



[Mohammad]: [  231.181316] amdgpu :8a:00.0: amdgpu: Poison detected in 
VCN0, sub_block0

Sample message from amdgpu " [  237.013029] amdgpu :8a:00.0: amdgpu: HDCP: 
optional hdcp ta ucode is not available "


Hmm, that is six seconds later, so, if Linux logs other stuff in 
between, no idea if the connection will be made.


Both messages read like debug message, with normal users not having a 
clue what to do. Can that be improved by rewording them?



Kind regards,

Paul


Re: [PATCH v4 5/6] drm/amdgpu/vcn: VCN ras error query support

2022-03-28 Thread Paul Menzel



Dear Mohammad,


Am 28.03.22 um 10:47 schrieb Ziya, Mohammad zafar:

[…]


-Original Message-
From: Paul Menzel 
Sent: Monday, March 28, 2022 1:39 PM



Am 28.03.22 um 10:00 schrieb Ziya, Mohammad zafar:

[…]


From: Paul Menzel 
Sent: Monday, March 28, 2022 1:22 PM



Am 28.03.22 um 09:43 schrieb Zhou1, Tao:

-Original Message-
From: Ziya, Mohammad zafar 
Sent: Monday, March 28, 2022 2:25 PM


[…]


+static uint32_t vcn_v2_6_query_poison_by_instance(struct amdgpu_device *adev,
+   uint32_t instance, uint32_t sub_block) {
+   uint32_t poison_stat = 0, reg_value = 0;
+
+   switch (sub_block) {
+   case AMDGPU_VCN_V2_6_VCPU_VCODEC:
+   reg_value = RREG32_SOC15(VCN, instance, 
mmUVD_RAS_VCPU_VCODEC_STATUS);
+   poison_stat = REG_GET_FIELD(reg_value, 
UVD_RAS_VCPU_VCODEC_STATUS, POISONED_PF);
+   break;
+   default:
+   break;
+   };
+
+   if (poison_stat)
+   dev_info(adev->dev, "Poison detected in VCN%d, sub_block%d\n",
+   instance, sub_block);


What should a user do with that information? Faulty hardware, …?


[Mohammad]: This message will help to identify the faulty hardware,
the hardware ID will also log along with poison, help to identify
among multiple hardware installed on the system.


Thank you for clarifying. If it’s indeed faulty hardware, should the log level 
be
increased to be an error? Keep in mind, that normal ignorant users (like me)
are reading the message, and it’d be great to guide them a little. They do not
know what “Poison“ means I guess. Maybe:

A hardware corruption was found indicating the device might be faulty.
(Poison detected in VCN%d, sub_block%d)\n

(Keep in mind, I do not know anything about RAS.)


[Mohammad]: It is an error condition, but this is just an information
message which could have been ignored as well because VCN just
consumed the poison, not created.


Sorry, I have never seen these message in `dmesg`, so could you give an 
example log please, what the user would see?



Kind regards,

Paul


Re: [PATCH v4 5/6] drm/amdgpu/vcn: VCN ras error query support

2022-03-28 Thread Paul Menzel

Dear Mohammad,


Am 28.03.22 um 10:00 schrieb Ziya, Mohammad zafar:

[…]


From: Paul Menzel 
Sent: Monday, March 28, 2022 1:22 PM


[…]


Tao, it’s hard to find your reply in the quote, as the message is not quoted
correctly (> prepended). Is it possible to use a different email client?


Am 28.03.22 um 09:43 schrieb Zhou1, Tao:

-Original Message-
From: Ziya, Mohammad zafar 
Sent: Monday, March 28, 2022 2:25 PM


[…]


+static uint32_t vcn_v2_6_query_poison_by_instance(struct amdgpu_device *adev,
+   uint32_t instance, uint32_t sub_block) {
+   uint32_t poison_stat = 0, reg_value = 0;
+
+   switch (sub_block) {
+   case AMDGPU_VCN_V2_6_VCPU_VCODEC:
+   reg_value = RREG32_SOC15(VCN, instance, 
mmUVD_RAS_VCPU_VCODEC_STATUS);
+   poison_stat = REG_GET_FIELD(reg_value, 
UVD_RAS_VCPU_VCODEC_STATUS, POISONED_PF);
+   break;
+   default:
+   break;
+   };
+
+   if (poison_stat)
+   dev_info(adev->dev, "Poison detected in VCN%d, sub_block%d\n",
+   instance, sub_block);


What should a user do with that information? Faulty hardware, …?


[Mohammad]: This message will help to identify the faulty hardware,
the hardware ID will also log along with poison, help to identify
among multiple hardware installed on the system.


Thank you for clarifying. If it’s indeed faulty hardware, should the log 
level be increased to be an error? Keep in mind, that normal ignorant 
users (like me) are reading the message, and it’d be great to guide them 
a little. They do not know what “Poison“ means I guess. Maybe:


A hardware corruption was found indicating the device might be faulty. 
(Poison detected in VCN%d, sub_block%d)\n


(Keep in mind, I do not know anything about RAS.)


+
+   return poison_stat;
+}
+
+static bool vcn_v2_6_query_poison_status(struct amdgpu_device *adev) {
+   uint32_t inst, sub;
+   uint32_t poison_stat = 0;
+
+   for (inst = 0; inst < adev->vcn.num_vcn_inst; inst++)
+   for (sub = 0; sub < AMDGPU_VCN_V2_6_MAX_SUB_BLOCK; sub++)
+   poison_stat +=
+   vcn_v2_6_query_poison_by_instance(adev, inst, sub);
+
+   return poison_stat ? true : false;

[Tao] just want to confirm the logic, if the POISONED_PF of one instance is 1

and another is 0, the poison_stat is true, correct?

Can we add a print message for this case? Same question to JPEG.


[Mohammad]: Message will be printed on function block ahead of the function.


Is the `dev_info` message in `vcn_v2_6_query_poison_by_instance()` doing
what you want?

Also, instead of `poison_stat ? true : false;` a common pattern is
`!!poison_stat` I believe.


[…]


[Mohammad]: Noted the change. Will make to return !!poison_stat ? true : false;


No, just

return !!poison_stat;

[…]


Kind regards,

Paul


Re: [PATCH v4 5/6] drm/amdgpu/vcn: VCN ras error query support

2022-03-28 Thread Paul Menzel

Dear Mohammad, dear Tao,


Tao, it’s hard to find your reply in the quote, as the message is not 
quoted correctly (> prepended). Is it possible to use a different email 
client?



Am 28.03.22 um 09:43 schrieb Zhou1, Tao:

-Original Message-
From: Ziya, Mohammad zafar 
Sent: Monday, March 28, 2022 2:25 PM
To: amd-gfx@lists.freedesktop.org
Cc: Lazar, Lijo ; Zhou1, Tao ; Zhang, Hawking 
; Ziya, Mohammad zafar 
Subject: [PATCH v4 5/6] drm/amdgpu/vcn: VCN ras error query support

RAS error query support addition for VCN 2.6

V2: removed unused option and corrected comment format Moved the register 
definition under header file


Please wrap lines after 75 characters. (`scripts/checkpatch.pl` should 
have warned you about that).



V3: poison query status check added.
Removed error query interface

V4: MMSCH poison check option removed, return true/false refactored.

Signed-off-by: Mohammad Zafar Ziya 
Reviewed-by: Hawking Zhang 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  1 +
  drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c   | 71 +
  drivers/gpu/drm/amd/amdgpu/vcn_v2_5.h   |  6 +++
  3 files changed, 78 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index 1e1a3b736859..606df8869b89 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -508,6 +508,7 @@ struct amdgpu_ras_block_hw_ops {
void (*query_ras_error_address)(struct amdgpu_device *adev, void 
*ras_error_status);
void (*reset_ras_error_count)(struct amdgpu_device *adev);
void (*reset_ras_error_status)(struct amdgpu_device *adev);
+   bool (*query_poison_status)(struct amdgpu_device *adev);
  };
  
  /* work flow

diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c 
b/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c
index 1869bae4104b..3988fc647741 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c
@@ -31,6 +31,7 @@
  #include "soc15d.h"
  #include "vcn_v2_0.h"
  #include "mmsch_v1_0.h"
+#include "vcn_v2_5.h"
  
  #include "vcn/vcn_2_5_offset.h"

  #include "vcn/vcn_2_5_sh_mask.h"
@@ -59,6 +60,7 @@ static int vcn_v2_5_set_powergating_state(void *handle,  
static int vcn_v2_5_pause_dpg_mode(struct amdgpu_device *adev,
int inst_idx, struct dpg_pause_state 
*new_state);  static int vcn_v2_5_sriov_start(struct amdgpu_device *adev);
+static void vcn_v2_5_set_ras_funcs(struct amdgpu_device *adev);
  
  static int amdgpu_ih_clientid_vcns[] = {

SOC15_IH_CLIENTID_VCN,
@@ -100,6 +102,7 @@ static int vcn_v2_5_early_init(void *handle)
vcn_v2_5_set_dec_ring_funcs(adev);
vcn_v2_5_set_enc_ring_funcs(adev);
vcn_v2_5_set_irq_funcs(adev);
+   vcn_v2_5_set_ras_funcs(adev);
  
  	return 0;

  }
@@ -1930,3 +1933,71 @@ const struct amdgpu_ip_block_version vcn_v2_6_ip_block =
.rev = 0,
.funcs = _v2_6_ip_funcs,
  };
+
+static uint32_t vcn_v2_6_query_poison_by_instance(struct amdgpu_device *adev,
+   uint32_t instance, uint32_t sub_block) {
+   uint32_t poison_stat = 0, reg_value = 0;
+
+   switch (sub_block) {
+   case AMDGPU_VCN_V2_6_VCPU_VCODEC:
+   reg_value = RREG32_SOC15(VCN, instance, 
mmUVD_RAS_VCPU_VCODEC_STATUS);
+   poison_stat = REG_GET_FIELD(reg_value, 
UVD_RAS_VCPU_VCODEC_STATUS, POISONED_PF);
+   break;
+   default:
+   break;
+   };
+
+   if (poison_stat)
+   dev_info(adev->dev, "Poison detected in VCN%d, sub_block%d\n",
+   instance, sub_block);


What should a user do with that information? Faulty hardware, …?


+
+   return poison_stat;
+}
+
+static bool vcn_v2_6_query_poison_status(struct amdgpu_device *adev) {
+   uint32_t inst, sub;
+   uint32_t poison_stat = 0;
+
+   for (inst = 0; inst < adev->vcn.num_vcn_inst; inst++)
+   for (sub = 0; sub < AMDGPU_VCN_V2_6_MAX_SUB_BLOCK; sub++)
+   poison_stat +=
+   vcn_v2_6_query_poison_by_instance(adev, inst, sub);
+
+   return poison_stat ? true : false;

[Tao] just want to confirm the logic, if the POISONED_PF of one instance is 1 
and another is 0, the poison_stat is true, correct?
Can we add a print message for this case? Same question to JPEG.


Is the `dev_info` message in `vcn_v2_6_query_poison_by_instance()` doing 
what you want?


Also, instead of `poison_stat ? true : false;` a common pattern is 
`!!poison_stat` I believe.



Kind regards,

Paul



+}
+
+const struct amdgpu_ras_block_hw_ops vcn_v2_6_ras_hw_ops = {
+   .query_poison_status = vcn_v2_6_query_poison_status, };
+
+static struct amdgpu_vcn_ras vcn_v2_6_ras = {
+   .ras_block = {
+   .hw_ops = _v2_6_ras_hw_ops,
+   },
+};
+
+static void vcn_v2_5_set_ras_funcs(struct amdgpu_device *adev) {
+   switch 

Re: [PATCH] drm/amdgpu: resolve s3 hang for r7340

2022-03-28 Thread Paul Menzel

[Cc: -Jack Zhang (invalid address)

Am 28.03.22 um 09:36 schrieb Paul Menzel:

Dear Zhenneng,


Thank you for your patch.

Am 28.03.22 um 06:05 schrieb Zhenneng Li:

This is a workaround for s3 hang for r7340(amdgpu).


Is it hanging when resuming from S3? Maybe also use the line below for 
the commit message summary:


drm/amdgpu: Add 1 ms delay to init handler to fix s3 resume hang

Also, please add a space before the ( in “r7340(amdgpu)”.


When we test s3 with r7340 on arm64 platform, graphics card will hang up,
the error message are as follows:
Mar  4 01:14:11 greatwall-GW-XX-XXX kernel: [    1.599374][ 7] [  T291] 
amdgpu :02:00.0: fb0: amdgpudrmfb frame buffer device
Mar  4 01:14:11 greatwall-GW-XX-XXX kernel: [    1.612869][ 7] [  T291] 
[drm:amdgpu_device_ip_late_init [amdgpu]] *ERROR* late_init of IP block 
 failed -22
Mar  4 01:14:11 greatwall-GW-XX-XXX kernel: [    1.623392][ 7] [  T291] 
amdgpu :02:00.0: amdgpu_device_ip_late_init failed
Mar  4 01:14:11 greatwall-GW-XX-XXX kernel: [    1.630696][ 7] [  T291] 
amdgpu :02:00.0: Fatal error during GPU init
Mar  4 01:14:11 greatwall-GW-XX-XXX kernel: [    1.637477][ 7] [  T291] 
[drm] amdgpu: finishing device.


The prefix in the beginning is not really needed. Only the stuff after 
`kernel: `.


Maybe also add the output of `lspci -nn -s …` for that r7340 device.


Change-Id: I5048b3894c0ca9faf2f4847ddab61f9eb17b4823


Without the Gerrit instance this belongs to, the Change-Id is of no use 
in the public.



Signed-off-by: Zhenneng Li 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index 3987ecb24ef4..1eced991b5b2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2903,6 +2903,8 @@ static void 
amdgpu_device_delayed_init_work_handler(struct work_struct *work)

  container_of(work, struct amdgpu_device, delayed_init_work.work);
  int r;
+    mdelay(1);
+


Wow, I wonder how long it took you to find that workaround.


  r = amdgpu_ib_ring_tests(adev);
  if (r)
  DRM_ERROR("ib ring test failed (%d).\n", r);



Kind regards,

Paul


Re: [PATCH] drm/amdgpu: resolve s3 hang for r7340

2022-03-28 Thread Paul Menzel

Dear Zhenneng,


Thank you for your patch.

Am 28.03.22 um 06:05 schrieb Zhenneng Li:

This is a workaround for s3 hang for r7340(amdgpu).


Is it hanging when resuming from S3? Maybe also use the line below for 
the commit message summary:


drm/amdgpu: Add 1 ms delay to init handler to fix s3 resume hang

Also, please add a space before the ( in “r7340(amdgpu)”.


When we test s3 with r7340 on arm64 platform, graphics card will hang up,
the error message are as follows:
Mar  4 01:14:11 greatwall-GW-XX-XXX kernel: [1.599374][ 7] [  T291] 
amdgpu :02:00.0: fb0: amdgpudrmfb frame buffer device
Mar  4 01:14:11 greatwall-GW-XX-XXX kernel: [1.612869][ 7] [  T291] 
[drm:amdgpu_device_ip_late_init [amdgpu]] *ERROR* late_init of IP block 
 failed -22
Mar  4 01:14:11 greatwall-GW-XX-XXX kernel: [1.623392][ 7] [  T291] 
amdgpu :02:00.0: amdgpu_device_ip_late_init failed
Mar  4 01:14:11 greatwall-GW-XX-XXX kernel: [1.630696][ 7] [  T291] 
amdgpu :02:00.0: Fatal error during GPU init
Mar  4 01:14:11 greatwall-GW-XX-XXX kernel: [1.637477][ 7] [  T291] 
[drm] amdgpu: finishing device.


The prefix in the beginning is not really needed. Only the stuff after 
`kernel: `.


Maybe also add the output of `lspci -nn -s …` for that r7340 device.


Change-Id: I5048b3894c0ca9faf2f4847ddab61f9eb17b4823


Without the Gerrit instance this belongs to, the Change-Id is of no use 
in the public.



Signed-off-by: Zhenneng Li 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 3987ecb24ef4..1eced991b5b2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2903,6 +2903,8 @@ static void 
amdgpu_device_delayed_init_work_handler(struct work_struct *work)
container_of(work, struct amdgpu_device, 
delayed_init_work.work);
int r;
  
+	mdelay(1);

+


Wow, I wonder how long it took you to find that workaround.


r = amdgpu_ib_ring_tests(adev);
if (r)
DRM_ERROR("ib ring test failed (%d).\n", r);



Kind regards,

Paul


Re: [PATCH v4 2/6] drm/amdgpu/vcn: Add vcn ras support

2022-03-28 Thread Paul Menzel

Dear Mohammad,


Thank you for your patch.

Am 28.03.22 um 08:24 schrieb Mohammad Zafar Ziya:

VCN block ras feature support addition

V2: default ras callback removed

Signed-off-by: Mohammad Zafar Ziya 
Reviewed-by: Hawking Zhang 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h | 10 ++
  1 file changed, 10 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
index e2fde88aaf5e..ea07974ef6f4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
@@ -24,6 +24,8 @@
  #ifndef __AMDGPU_VCN_H__
  #define __AMDGPU_VCN_H__
  
+#include "amdgpu_ras.h"

+
  #define AMDGPU_VCN_STACK_SIZE (128*1024)
  #define AMDGPU_VCN_CONTEXT_SIZE   (512*1024)
  
@@ -232,6 +234,10 @@ struct amdgpu_vcn_inst {

struct amdgpu_vcn_fw_shared fw_shared;
  };
  
+struct amdgpu_vcn_ras {

+   struct amdgpu_ras_block_object ras_block;
+};
+
  struct amdgpu_vcn {
unsignedfw_version;
struct delayed_work idle_work;
@@ -251,6 +257,9 @@ struct amdgpu_vcn {
unsignedharvest_config;
int (*pause_dpg_mode)(struct amdgpu_device *adev,
int inst_idx, struct dpg_pause_state *new_state);
+
+   struct ras_common_if*ras_if;
+   struct amdgpu_vcn_ras   *ras;
  };
  
  struct amdgpu_fw_shared_rb_ptrs_struct {

@@ -339,4 +348,5 @@ void amdgpu_vcn_setup_ucode(struct amdgpu_device *adev);
  void amdgpu_vcn_fwlog_init(struct amdgpu_vcn_inst *vcn);
  void amdgpu_debugfs_vcn_fwlog_init(struct amdgpu_device *adev,
 uint8_t i, struct amdgpu_vcn_inst *vcn);
+
  #endif


This hunk looks unrelated. Maybe remove it?


Kind regards,

Paul


Re: [PATCH v4 0/6] VCN and JPEG RAS poison detection

2022-03-28 Thread Paul Menzel

Dear Mahommad,


Am 28.03.22 um 08:24 schrieb Mohammad Zafar Ziya:

VCN and JPEG RAS poison detection


It’d be great if you extended this a little bit. Especially, how it can 
be tested.



Mohammad Zafar Ziya (6):
   drm/amdgpu: Add vcn and jpeg ras support flag
   drm/amdgpu/vcn: Add vcn ras support
   drm/amdgpu/jpeg: Add jpeg block ras support
   drm/amdgpu/vcn: vcn and jpeg ver 2.6 ras register definition
   drm/amdgpu/vcn: VCN ras error query support
   drm/amdgpu/jpeg: jpeg ras error query support


It’d be great if you made the last three commit message summaries also 
statements (by adding a verb in imperative mood).



Kind regards,

Paul


Re: [PATCH 01/13] drm/amd/display: HDCP SEND AKI INIT error

2022-03-27 Thread Paul Menzel

Dear Alex,


Am 25.03.22 um 21:43 schrieb Hung, Alex:


Thanks for your feedback.


Thank you for your reply. For the future, it’d be great if you could use
interleaved style [1], when replying, or when it’s a new topic, do not 
cite the previous message.



I fixed many errors and typos you highlighted in this series. In
cases where modification requires re-testing we or anyone can have
follow-up patches in the future.


Thank you. Did you resubmit them in *[PATCH 00/16] DC Patches March 25, 
2022*? Unfortunately, they are not tagged as v2 and the emails do not 
contain an information what was changed between v1 and v2. Can you 
please send them again with the correct tags, and information included?



Kind regards,

Paul


[1]: https://en.wikipedia.org/wiki/Posting_style


Re: Commit messages

2022-03-25 Thread Paul Menzel

Dear Christian, dear Daniel, dear Alex,


Am 23.03.22 um 16:32 schrieb Christian König:

Am 23.03.22 um 16:24 schrieb Daniel Stone:

On Wed, 23 Mar 2022 at 15:14, Alex Deucher  wrote:
On Wed, Mar 23, 2022 at 11:04 AM Daniel Stone  
wrote:

That's not what anyone's saying here ...

No-one's demanding AMD publish RTL, or internal design docs, or
hardware specs, or URLs to JIRA tickets no-one can access.

This is a large and invasive commit with pretty big ramifications;
containing exactly two lines of commit message, one of which just
duplicates the subject.

It cannot be the case that it's completely impossible to provide any
justification, background, or details, about this commit being made.
Unless, of course, it's to fix a non-public security issue, that is
reasonable justification for eliding some of the details. But then
again, 'huge change which is very deliberately opaque' is a really
good way to draw a lot of attention to the commit, and it would be
better to provide more detail about the change to help it slip under
the radar.

If dri-devel@ isn't allowed to inquire about patches which are posted,
then CCing the list is just a façade; might as well just do it all
internally and periodically dump out pull requests.

I think we are in agreement. I think the withheld information
Christian was referring to was on another thread with Christian and
Paul discussing a workaround for a hardware bug:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.spinics.net%2Flists%2Famd-gfx%2Fmsg75908.htmldata=04%7C01%7Cchristian.koenig%40amd.com%7C6a3f2815d83b4872577008da0ce1347a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637836458652370599%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=QtNB0XHMhTgH%2FNHMwF23Qn%2BgSdYyHJSenbpP%2FHG%2BkxE%3Dreserved=0 


(Thank you Microsoft for keeping us safe.)

I guess it proves, how assuming what other people should know/have read, 
especially when crossing message threads, is causing confusion and 
misunderstandings.



Right, that definitely seems like some crossed wires. I don't see
anything wrong with that commit at all: the commit message and a
comment notes that there is a hardware issue preventing Raven from
being able to do TMZ+GTT, and the code does the very straightforward
and obvious thing to ensure that on VCN 1.0, any TMZ buffer must be
VRAM-placed.


My questions were:

Where is that documented, and how can this be reproduced? 


Shouldn’t these be answered by the commit message? In five(?) years, 
somebody, maybe even with access to the currently non-public documents, 
finds a fault in the commit, and would be helped by having an 
document/errata number where to look at. To verify the fix, the 
developer would need a method to produce the error, so why not just 
share it?


Also, I assume that workarounds often come with downsides, as otherwise 
it would have been programmed like this from the beginning, or instead 
of “workaround” it would be called “improvement”. Shouldn’t that also be 
answered?


So totally made-up example:

Currently, there is a graphics corruption running X on system Y. This is 
caused by a hardware bug in Raven ASIC (details internal document 
#/AMD-Jira #N), and can be worked around by [what is in the commit 
message].


The workaround does not affect the performance, and testing X shows the 
error is fixed.



This one, on the other hand, is much less clear ...


Yes, completely agree. I mean a good bunch of comments on commit 
messages are certainly valid and we could improve them.


That’d be great.

But this patch here was worked on by both AMD and Intel developers. 
Where both sides and I think even people from other companies perfectly 
understands why, what, how etc...


When now somebody comes along and asks for a whole explanation of the 
context why we do it then that sounds really strange to me.


The motivation should be part of the commit message. I didn’t mean 
anyone to rewrite buddy memory allocator Wikipedia article [1]. But the 
commit message at hand for switching the allocator is definitely too terse.



Kind regards,

Paul


[1]: https://en.wikipedia.org/wiki/Buddy_memory_allocation


Re: [PATCH V2] drm/amdgpu/vcn3: send smu interface type

2022-03-25 Thread Paul Menzel

Dear Boyuan,


Am 25.03.22 um 06:53 schrieb Paul Menzel:

Dear Boyuan,


Thank for the improved patch.


Am 24.03.22 um 18:25 schrieb Zhang, Boyuan:

[AMD Official Use Only]


No idea if this would confuse `git am`.


From: Boyuan Zhang mailto:boyuan.zh...@amd.com>>


Your mailer(?) mangled the patch. Did you edit it in your MUA’s compose 
window?


For VCN FW to detect ASIC type, in order to use different mailbox 
registers.


V2: simplify codes and fix format issue.

Signed-off-by: Boyuan Zhang 
mailto:boyuan.zh...@amd.com>>

Acked-by Huang Rui 
---
drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h | 7 +++
drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c   | 5 +
2 files changed, 12 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h

index e2fde88aaf5e..f06fb7f882e2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
@@ -159,6 +159,7 @@
#define AMDGPU_VCN_MULTI_QUEUE_FLAG   (1 << 8)
#define AMDGPU_VCN_SW_RING_FLAG  (1 << 9)
#define AMDGPU_VCN_FW_LOGGING_FLAG (1 << 10)
+#define AMDGPU_VCN_SMU_VERSION_INFO_FLAG (1 << 11)
  #define AMDGPU_VCN_IB_FLAG_DECODE_BUFFER    0x0001
#define AMDGPU_VCN_CMD_FLAG_MSG_BUFFER 0x0001
@@ -279,6 +280,11 @@ struct amdgpu_fw_shared_fw_logging {
    uint32_t size;
};
+struct amdgpu_fw_shared_smu_interface_info {
+ uint8_t smu_interface_type;
+ uint8_t padding[3];
+};
+
struct amdgpu_fw_shared {
    uint32_t present_flag_0;
    uint8_t pad[44];
@@ -287,6 +293,7 @@ struct amdgpu_fw_shared {
    struct amdgpu_fw_shared_multi_queue multi_queue;
    struct amdgpu_fw_shared_sw_ring sw_ring;
    struct amdgpu_fw_shared_fw_logging fw_log;
+ struct amdgpu_fw_shared_smu_interface_info 
smu_interface_info;

};
  struct amdgpu_vcn_fwlog {
diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c 
b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c

index b16c56aa2d22..9925b2bc63b9 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
@@ -219,6 +219,11 @@ static int vcn_v3_0_sw_init(void *handle)
 
cpu_to_le32(AMDGPU_VCN_MULTI_QUEUE_FLAG) |
 
cpu_to_le32(AMDGPU_VCN_FW_SHARED_FLAG_0_RB);
    fw_shared->sw_ring.is_enabled = 
cpu_to_le32(DEC_SW_RING_ENABLED);
+ fw_shared->present_flag_0 |= 
AMDGPU_VCN_SMU_VERSION_INFO_FLAG;
+ if (adev->ip_versions[UVD_HWIP][0] == 
IP_VERSION(3, 1, 2))
+ 
fw_shared->smu_interface_info.smu_interface_type = 2;
+ else if (adev->ip_versions[UVD_HWIP][0] 
== IP_VERSION(3, 1, 1))


As commented on patch v1, please also put (3, 1, 1) first.

+ 
fw_shared->smu_interface_info.smu_interface_type = 1;

 if (amdgpu_vcnfw_log)

amdgpu_vcn_fwlog_init(>vcn.inst[i]);

--
2.25.1


The whole patch cannot be applied. Did `scripts/checkpatch.pl` not 
detect this? Please fix and resend.



Kind regards,

Paul


Re: [PATCH V2] drm/amdgpu/vcn3: send smu interface type

2022-03-24 Thread Paul Menzel

Dear Boyuan,


Thank for the improved patch.


Am 24.03.22 um 18:25 schrieb Zhang, Boyuan:

[AMD Official Use Only]


No idea if this would confuse `git am`.


From: Boyuan Zhang mailto:boyuan.zh...@amd.com>>


Your mailer(?) mangled the patch. Did you edit it in your MUA’s compose 
window?



For VCN FW to detect ASIC type, in order to use different mailbox registers.

V2: simplify codes and fix format issue.

Signed-off-by: Boyuan Zhang mailto:boyuan.zh...@amd.com>>
Acked-by Huang Rui 
---
drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h | 7 +++
drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c   | 5 +
2 files changed, 12 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
index e2fde88aaf5e..f06fb7f882e2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
@@ -159,6 +159,7 @@
#define AMDGPU_VCN_MULTI_QUEUE_FLAG   (1 << 8)
#define AMDGPU_VCN_SW_RING_FLAG  (1 << 9)
#define AMDGPU_VCN_FW_LOGGING_FLAG (1 << 10)
+#define AMDGPU_VCN_SMU_VERSION_INFO_FLAG (1 << 11)
  #define AMDGPU_VCN_IB_FLAG_DECODE_BUFFER0x0001
#define AMDGPU_VCN_CMD_FLAG_MSG_BUFFER 0x0001
@@ -279,6 +280,11 @@ struct amdgpu_fw_shared_fw_logging {
uint32_t size;
};
+struct amdgpu_fw_shared_smu_interface_info {
+ uint8_t smu_interface_type;
+ uint8_t padding[3];
+};
+
struct amdgpu_fw_shared {
uint32_t present_flag_0;
uint8_t pad[44];
@@ -287,6 +293,7 @@ struct amdgpu_fw_shared {
struct amdgpu_fw_shared_multi_queue multi_queue;
struct amdgpu_fw_shared_sw_ring sw_ring;
struct amdgpu_fw_shared_fw_logging fw_log;
+ struct amdgpu_fw_shared_smu_interface_info smu_interface_info;
};
  struct amdgpu_vcn_fwlog {
diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c 
b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
index b16c56aa2d22..9925b2bc63b9 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
@@ -219,6 +219,11 @@ static int vcn_v3_0_sw_init(void *handle)

 cpu_to_le32(AMDGPU_VCN_MULTI_QUEUE_FLAG) |

 cpu_to_le32(AMDGPU_VCN_FW_SHARED_FLAG_0_RB);
fw_shared->sw_ring.is_enabled = 
cpu_to_le32(DEC_SW_RING_ENABLED);
+ fw_shared->present_flag_0 |= 
AMDGPU_VCN_SMU_VERSION_INFO_FLAG;
+ if (adev->ip_versions[UVD_HWIP][0] == 
IP_VERSION(3, 1, 2))
+ 
fw_shared->smu_interface_info.smu_interface_type = 2;
+ else if (adev->ip_versions[UVD_HWIP][0] == 
IP_VERSION(3, 1, 1))
+ 
fw_shared->smu_interface_info.smu_interface_type = 1;
 if (amdgpu_vcnfw_log)

amdgpu_vcn_fwlog_init(>vcn.inst[i]);
--
2.25.1


The whole patch cannot be applied. Did `scripts/checkpatch.pl` not 
detect this? Please fix and resend.



Kind regards,

Paul


Re: [PATCH v2] drm/amdkfd: Check use_xgmi_p2p before reporting hive_id

2022-03-24 Thread Paul Menzel

Dear Divya,


Am 23.03.22 um 15:20 schrieb Divya Shikre:

Recently introduced commit "845ebd6b7c32 drm/amdgpu: Add
use_xgmi_p2p module parameter" did not update XGMI iolinks
when use_xgmi_p2p is disabled. Add fix to not create XGMI
iolinks in KFD topology when this parameter is disabled.


Thank you for rerolling the patch. Please also add:

Fixes: 845ebd6b7c32 ("drm/amdgpu: Add use_xgmi_p2p module parameter")


Signed-off-by: Divya Shikre 
---
  drivers/gpu/drm/amd/amdkfd/kfd_device.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 339e12c94cff..d5536e33b3d8 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -536,7 +536,8 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
goto kfd_doorbell_error;
}
  
-	kfd->hive_id = kfd->adev->gmc.xgmi.hive_id;

+   if (amdgpu_use_xgmi_p2p)
+   kfd->hive_id = kfd->adev->gmc.xgmi.hive_id;
  
  	kfd->noretry = kfd->adev->gmc.noretry;
  


The rest looks good.


Kind regards,

Paul


Re: ATI Radeon Mobility 3470 fails with Kernel 5.0

2022-03-24 Thread Paul Menzel

Dear Craig,


Am 24.03.22 um 02:09 schrieb Craig M:

I should add some further information:

I have tried using Manjaro KDE, Kubuntu 20 and Kubuntu 18.03 live
disks and all exhibit this 'tearing' issue.

The problem begins rather early on in the boot sequence while the TUI
is logging the startup information and continues to be a problem as
the screen becomes graphic.

I feel that I should be able to rectify this with a setting change.

Again, Thanks for any help.
Craig.


In the future, it’d be great if instead of top-posting you used 
interleaved style.



On 3/24/22, Craig M  wrote:

Hi,

I'm trying to track down a problem with an ancient graphics device ATI
Radeon Mobility 3470.

Ubuntu/Kubuntu 18 is shipped kernel 4.18 and the radeon drivers work well.
With 18.02 and later it ships with kernel 5.0 and the radeon drivers don't.

What I'm seeing is hard to describe. It looks a lot like the scan
lines are interleaved and there's a lot of noise. It's not the typical
screen tearing, nor is it just snow (black and white noise). I can
*just* make out what it being displayed onscreen, but hurts the eyes a
lot to try and do so!

If I hold the kernel back at 4.18 things are just fine.

Some detailed info (from a working 4.18 kernel). Note that the
returned information from a 5.x kernel isn't all that much different:

$ lshw -c video
*-display
   description: VGA compatible controller
   product: RV620/M82 [Mobility Radeon HD 3450/3470]
   vendor: Advanced Micro Devices, Inc. [AMD/ATI]
   physical id: 0
   bus info: pci@:01:00.0
   version: 00
   width: 32 bits
   clock: 33MHz
   capabilities: pm pciexpress msi vga_controller bus_master
cap_list rom   configuration: driver=radeon latency=0
resources: irq:26 memory:c000-c7ff ioport:9000(size=256)
memory:c802-c802 memory:c-d


$ modinfo radeon

filename:
/lib/modules/4.18.0-17-generic/kernel/drivers/gpu/drm/radeon/radeon.ko
license:    GPL and additional rights description:    ATI Radeon
author: Gareth Hughes, Keith Whitwell, others.

...
name:   radeon
vermagic:   4.18.0-17-generic SMP mod_unload
signat: PKCS#7
signer:
sig_key:
sig_hashalgo:   md4



Looking at the output of `dmesg` might also help.


I'm just trying to track down the driver changes between 4.18 and 5.0
to see what I can do to change settings. Any help  would be greatly
appreciated. I've had a quick look through
https://gitlab.freedesktop.org/xorg/driver/xf86-video-ati but I don't
know where to start (as there's no correlation between releases and
kernels, from what I can see).


As switching the Linux kernel version works, you need to look at the 
driver in the Linux kernel. As you can easily reproduce your problem, 
and you are most likely alone with this old hardware, bisecting will 
probably the way to go. First you can try the packages from Ubuntu’s 
Kernel PPA [1]. Figuring out, in what version the regression occurs, you 
then build the Linux kernel yourself, for example with `make 
bindeb-pkg`, and try to find out the exact commit. There should be some 
guides on the WWW explaining all this in detail.



Kind regards and good luck,

Paul


PS: It would also be nice to know, if the problem is still present in 
Linux 5.17.



[1]: https://kernel.ubuntu.com/~kernel-ppa/mainline/


Re: [PATCH] drm/amdgpu/vcn3: send smu interface type

2022-03-23 Thread Paul Menzel

Dear Yifan, dear Boyuan,


Am 24.03.22 um 03:59 schrieb Yifan Zhang:

From: Boyuan Zhang 

For VCN FW to detect ASIC type


What affect does this have? How does VCN FW behave different now?


Signed-off-by: Boyuan Zhang 
Signed-off-by: Yifan Zhang 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h | 7 +++
  drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c   | 7 +++
  2 files changed, 14 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
index e2fde88aaf5e..f06fb7f882e2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
@@ -159,6 +159,7 @@
  #define AMDGPU_VCN_MULTI_QUEUE_FLAG   (1 << 8)
  #define AMDGPU_VCN_SW_RING_FLAG   (1 << 9)
  #define AMDGPU_VCN_FW_LOGGING_FLAG(1 << 10)
+#define AMDGPU_VCN_SMU_VERSION_INFO_FLAG (1 << 11)
  
  #define AMDGPU_VCN_IB_FLAG_DECODE_BUFFER	0x0001

  #define AMDGPU_VCN_CMD_FLAG_MSG_BUFFER0x0001
@@ -279,6 +280,11 @@ struct amdgpu_fw_shared_fw_logging {
uint32_t size;
  };
  
+struct amdgpu_fw_shared_smu_interface_info {

+   uint8_t smu_interface_type;
+   uint8_t padding[3];
+};
+
  struct amdgpu_fw_shared {
uint32_t present_flag_0;
uint8_t pad[44];
@@ -287,6 +293,7 @@ struct amdgpu_fw_shared {
struct amdgpu_fw_shared_multi_queue multi_queue;
struct amdgpu_fw_shared_sw_ring sw_ring;
struct amdgpu_fw_shared_fw_logging fw_log;
+   struct amdgpu_fw_shared_smu_interface_info smu_interface_info;
  };
  
  struct amdgpu_vcn_fwlog {

diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c 
b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
index b16c56aa2d22..c5bf7cbfa82c 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
@@ -219,6 +219,13 @@ static int vcn_v3_0_sw_init(void *handle)
 
cpu_to_le32(AMDGPU_VCN_MULTI_QUEUE_FLAG) |
 
cpu_to_le32(AMDGPU_VCN_FW_SHARED_FLAG_0_RB);
fw_shared->sw_ring.is_enabled = 
cpu_to_le32(DEC_SW_RING_ENABLED);
+   if (adev->ip_versions[UVD_HWIP][0] == IP_VERSION(3, 1, 2)) {
+   fw_shared->present_flag_0 |= 
AMDGPU_VCN_SMU_VERSION_INFO_FLAG;
+   fw_shared->smu_interface_info.smu_interface_type = 2;
+   } else if(adev->ip_versions[UVD_HWIP][0] == IP_VERSION(3, 1, 
1)) {


Please add a space before the (, which `checkpatch.pl` also would have 
found:


$ scripts/checkpatch.pl 
/dev/shm/0001-drm-amdgpu-vcn3-send-smu-interface-type.patch

ERROR: space required before the open parenthesis '('
#58: FILE: drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c:225:
+   } else if(adev->ip_versions[UVD_HWIP][0] == 
IP_VERSION(3, 1, 1)) {


Also why not order it smallest version first? Will there ever be other 
IP versions for VCN 3.0?



+   fw_shared->present_flag_0 |= 
AMDGPU_VCN_SMU_VERSION_INFO_FLAG;
+   fw_shared->smu_interface_info.smu_interface_type = 1;
+   }
  
  		if (amdgpu_vcnfw_log)

amdgpu_vcn_fwlog_init(>vcn.inst[i]);



Kind regards,

Paul


Commit messages (was: [PATCH v11] drm/amdgpu: add drm buddy support to amdgpu)

2022-03-23 Thread Paul Menzel

Dear Christian,


Am 23.03.22 um 08:42 schrieb Christian König:


Am 23.03.22 um 07:42 schrieb Paul Menzel:



Am 23.03.22 um 07:25 schrieb Arunpravin Paneer Selvam:

- Remove drm_mm references and replace with drm buddy functionalities


The commit message summary to me suggested, you can somehow use both 
allocators now. Two suggestions below:


1.  Switch to drm buddy allocator
2.  Use drm buddy alllocator


- Add res cursor support for drm buddy


As an allocator switch sounds invasive, could you please extend the 
commit message, briefly describing the current situation, saying what 
the downsides are, and why the buddy allocator is “better”.


Well, Paul please stop bothering developers with those requests.

It's my job as maintainer to supervise the commit messages and it is 
certainly NOT require to explain all the details of the current 
situation in a commit message. That is just overkill.


I did not request all the details, and I think my requests are totally 
reasonable. But let’s change the perspective. If there were not any AMD 
graphics drivers bug, I would have never needed to look at the code and 
deal with it. Unfortunately the AMD graphics driver situation – which 
improved a lot in recent years – with no public documentation, 
proprietary firmware and complex devices is still not optimal, and a lot 
of bugs get reported, and I am also hit by bugs, taking time to deal 
with them, and maybe reporting and helping to analyze them. So to keep 
your wording, if you would stop bothering users with bugs and requesting 
their help in fixing them – asking the user to bisect the issue is often 
the first thing. Actually it should not be unreasonable for customers 
buying an AMD device to expect get bug free drivers. It’s strange and a 
sad fact, that the software industry succeeded to sway that valid 
expectation and customers now except they need to regularly install 
software updates, and do not get, for example, a price reduction when 
there are bugs.


Also, as stated everywhere, reviewer time is scarce, so commit authors 
should make it easy to attract new folks.


A simple note that we are switching from the drm_mm backend to the buddy 
backend is sufficient, and that is exactly what the commit message is 
saying here.


Sorry, I disagree. The motivation needs to be part of the commit 
message. For example see recent discussion on the LWN article 
*Donenfeld: Random number generator enhancements for Linux 5.17 and 
5.18* [1].


How much the commit message should be extended, I do not know, but the 
current state is insufficient (too terse).



Kind regards,

Paul


[1]: https://lwn.net/Articles/888413/
 "Donenfeld: Random number generator enhancements for Linux 5.17 
and 5.18"


Re: [PATCH v11] drm/amdgpu: add drm buddy support to amdgpu

2022-03-23 Thread Paul Menzel

Dear Arunprivin,


Thank you for your patch.

Am 23.03.22 um 07:25 schrieb Arunpravin Paneer Selvam:

- Remove drm_mm references and replace with drm buddy functionalities


The commit message summary to me suggested, you can somehow use both 
allocators now. Two suggestions below:


1.  Switch to drm buddy allocator
2.  Use drm buddy alllocator


- Add res cursor support for drm buddy


As an allocator switch sounds invasive, could you please extend the 
commit message, briefly describing the current situation, saying what 
the downsides are, and why the buddy allocator is “better”.


How did you test it? How can it be tested, that there are no regressions?


v2(Matthew Auld):


Nit: I’d add a space before (.


Kind regards,

Paul



   - replace spinlock with mutex as we call kmem_cache_zalloc
 (..., GFP_KERNEL) in drm_buddy_alloc() function

   - lock drm_buddy_block_trim() function as it calls
 mark_free/mark_split are all globally visible

v3(Matthew Auld):
   - remove trim method error handling as we address the failure case
 at drm_buddy_block_trim() function

v4:
   - fix warnings reported by kernel test robot 

v5:
   - fix merge conflict issue

v6:
   - fix warnings reported by kernel test robot 

v7:
   - remove DRM_BUDDY_RANGE_ALLOCATION flag usage

v8:
   - keep DRM_BUDDY_RANGE_ALLOCATION flag usage
   - resolve conflicts created by drm/amdgpu: remove VRAM accounting v2

v9(Christian):
   - merged the below patch
  - drm/amdgpu: move vram inline functions into a header
   - rename label name as fallback
   - move struct amdgpu_vram_mgr to amdgpu_vram_mgr.h
   - remove unnecessary flags from struct amdgpu_vram_reservation
   - rewrite block NULL check condition
   - change else style as per coding standard
   - rewrite the node max size
   - add a helper function to fetch the first entry from the list

v10(Christian):
- rename amdgpu_get_node() function name as amdgpu_vram_mgr_first_block

v11:
- if size is not aligned with min_page_size, enable is_contiguous flag,
  therefore, the size round up to the power of two and trimmed to the
  original size.

Signed-off-by: Arunpravin Paneer Selvam 
---
  drivers/gpu/drm/Kconfig   |   1 +
  .../gpu/drm/amd/amdgpu/amdgpu_res_cursor.h|  97 +--
  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h   |  10 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c  | 263 ++
  4 files changed, 234 insertions(+), 137 deletions(-)

diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
index f1422bee3dcc..5133c3f028ab 100644
--- a/drivers/gpu/drm/Kconfig
+++ b/drivers/gpu/drm/Kconfig
@@ -280,6 +280,7 @@ config DRM_AMDGPU
select HWMON
select BACKLIGHT_CLASS_DEVICE
select INTERVAL_TREE
+   select DRM_BUDDY
help
  Choose this option if you have a recent AMD Radeon graphics card.
  
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h

index acfa207cf970..864c609ba00b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h
@@ -30,12 +30,15 @@
  #include 
  #include 
  
+#include "amdgpu_vram_mgr.h"

+
  /* state back for walking over vram_mgr and gtt_mgr allocations */
  struct amdgpu_res_cursor {
uint64_tstart;
uint64_tsize;
uint64_tremaining;
-   struct drm_mm_node  *node;
+   void*node;
+   uint32_tmem_type;
  };
  
  /**

@@ -52,27 +55,63 @@ static inline void amdgpu_res_first(struct ttm_resource 
*res,
uint64_t start, uint64_t size,
struct amdgpu_res_cursor *cur)
  {
+   struct drm_buddy_block *block;
+   struct list_head *head, *next;
struct drm_mm_node *node;
  
-	if (!res || res->mem_type == TTM_PL_SYSTEM) {

-   cur->start = start;
-   cur->size = size;
-   cur->remaining = size;
-   cur->node = NULL;
-   WARN_ON(res && start + size > res->num_pages << PAGE_SHIFT);
-   return;
-   }
+   if (!res)
+   goto fallback;
  
  	BUG_ON(start + size > res->num_pages << PAGE_SHIFT);
  
-	node = to_ttm_range_mgr_node(res)->mm_nodes;

-   while (start >= node->size << PAGE_SHIFT)
-   start -= node++->size << PAGE_SHIFT;
+   cur->mem_type = res->mem_type;
+
+   switch (cur->mem_type) {
+   case TTM_PL_VRAM:
+   head = _amdgpu_vram_mgr_node(res)->blocks;
+
+   block = list_first_entry_or_null(head,
+struct drm_buddy_block,
+link);
+   if (!block)
+   goto fallback;
+
+   while (start >= amdgpu_node_size(block)) {
+   start -= 

Re: [PATCH] drm/amdkfd:Check use_xgmi_p2p before reporting hive_id

2022-03-23 Thread Paul Menzel

Dear Divya,


Thank you for your patch.

Am 22.03.22 um 20:23 schrieb Divya Shikre:

Please add a space after the colon in the commit message summary.


XGMI iolinks will not be created in KFD topology when use_xgmi_p2p module 
parameter
is disabled in a dGPU.


1.  Is “will not be” the current problem, or the result of your patch? 
Please start by describing the current problem, how to reproduce it.


2.  Please wrap lines after 75 characters, and add a blank line between 
paragraphs.


The script `scripts/checkpatch.pl`, you should run on all your patches 
before submitting them, also warns about the first issue:


WARNING: Possible unwrapped commit description (prefer a maximum 75 
chars per line)



In this case PCIe p2p links will be used instead.


Is this a regression, and fixes another commit? If so, please add a 
Fixes line.



Signed-off-by: Divya Shikre 
---
  drivers/gpu/drm/amd/amdkfd/kfd_device.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 339e12c94cff..d5536e33b3d8 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -536,7 +536,8 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
goto kfd_doorbell_error;
}
  
-	kfd->hive_id = kfd->adev->gmc.xgmi.hive_id;

+   if (amdgpu_use_xgmi_p2p)
+   kfd->hive_id = kfd->adev->gmc.xgmi.hive_id;


Is hive_id now unset?


kfd->noretry = kfd->adev->gmc.noretry;
  



Kind regards,

Paul


Re: [PATCH] drm/amdgpu: Correct the parameter name of amdgpu_ring_init

2022-03-23 Thread Paul Menzel

Dear Jun,


Thank you for your patch.

Am 23.03.22 um 04:02 schrieb Ma, Jun:


On 3/22/2022 9:33 PM, Christian König wrote:

Am 22.03.22 um 13:53 schrieb Ma Jun:

Correct the parameter name of amdgpu_ring_init() in header file.


Maybe write "Sync up header and implementation to use the same parameter
names", otherwise somebody could think that this is a real functional
bug fix and backport it.
ok, will fix this in v2


(Your mailer incorrectly quoted your reply sentence, making it look like 
it was written by Christian.)


If you re-roll the patch, please also mention both parameter names in 
the commit message, so people do not have to search in the diff for 
them, and can verify the changes match your intention. (The summary uses 
singular *name* adding a little confusion.)



Kind regards,

Paul



Signed-off-by: Ma Jun 


With the commit message and subject adjusted the patch is Reviewed-by:
Christian König 


Change-Id: I202d76ba04b137926b456b1c8a4c05a5b1a01bff
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 4 ++--
   1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
index 5320bb0883d8..317d80209e95 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
@@ -300,8 +300,8 @@ void amdgpu_ring_generic_pad_ib(struct amdgpu_ring *ring, 
struct amdgpu_ib *ib);
   void amdgpu_ring_commit(struct amdgpu_ring *ring);
   void amdgpu_ring_undo(struct amdgpu_ring *ring);
   int amdgpu_ring_init(struct amdgpu_device *adev, struct amdgpu_ring *ring,
-unsigned int ring_size, struct amdgpu_irq_src *irq_src,
-unsigned int irq_type, unsigned int prio,
+unsigned int max_dw, struct amdgpu_irq_src *irq_src,
+unsigned int irq_type, unsigned int hw_prio,
 atomic_t *sched_score);
   void amdgpu_ring_fini(struct amdgpu_ring *ring);
   void amdgpu_ring_emit_reg_write_reg_wait_helper(struct amdgpu_ring *ring,




  1   2   3   >