Re: [RFC 3/6] drm/amdgpu: Fix crash on modprobe

2021-12-21 Thread Christian König

Am 21.12.21 um 17:03 schrieb Andrey Grodzovsky:


On 2021-12-21 2:02 a.m., Christian König wrote:



Am 20.12.21 um 20:22 schrieb Andrey Grodzovsky:


On 2021-12-20 2:17 a.m., Christian König wrote:

Am 17.12.21 um 23:27 schrieb Andrey Grodzovsky:

Restrict jobs resubmission to suspend case
only since schedulers not initialised yet on
probe.

Signed-off-by: Andrey Grodzovsky 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c

index 5527c68c51de..8ebd954e06c6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -582,7 +582,7 @@ void amdgpu_fence_driver_hw_init(struct 
amdgpu_device *adev)

  if (!ring || !ring->fence_drv.initialized)
  continue;
  -    if (!ring->no_scheduler) {
+    if (adev->in_suspend && !ring->no_scheduler) {


Uff, why is that suddenly necessary? Because of the changed order?

Christian.



Yes.


Mhm, that's quite bad design then.



If you look at the original patch for this 
https://www.spinics.net/lists/amd-gfx/msg67560.html you will
see that that restarting scheduler here is only relevant for 
suspend/resume case because there was
a race to fix. There is no point in this code on driver init because 
nothing was submitted to scheduler yet
and so it seems to me ok to add condition that this code run only 
in_suspend case.


Yeah, but having extra logic like this means that we have some design 
issue in the IP block handling.


We need to clean that and some other odd approaches up at some point, 
but probably not now.


Christian.






How about we keep the order as is and allow specifying the reset work 
queue with drm_sched_start() ?



As i mentioned above, the fact we even have drm_sched_start there is 
just part of a solution to resolve a race
during suspend/resume. It is not for device initialization and indeed, 
other client drivers of gpu shcheduler never call
drm_sched_start on device init. We must guarantee that reset work 
queue already initialized before any job submission to scheduler

and because of that IMHO the right place for this is drm_sched_init.

Andrey




Christian.



Andrey





drm_sched_resubmit_jobs(>sched);
  drm_sched_start(>sched, true);
  }








[PATCH] drm/ttm: add workaround for some arm hardware issue

2021-12-21 Thread Victor Zhao
Some Arm based platform has hardware issue which may
generate incorrect addresses when receiving writes from the CPU
with a discontiguous set of byte enables. This affects the writes
with write combine property.

Workaround by change PROT_NORMAL_NC to PROT_DEVICE_nGnRE on arm.
As this is an issue with some specific arm based cpu, adding
a ttm parameter to control.

Signed-off-by: Victor Zhao 
---
 drivers/gpu/drm/ttm/ttm_module.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/ttm/ttm_module.c b/drivers/gpu/drm/ttm/ttm_module.c
index e87f40674a4d..b27473cbbd52 100644
--- a/drivers/gpu/drm/ttm/ttm_module.c
+++ b/drivers/gpu/drm/ttm/ttm_module.c
@@ -41,6 +41,12 @@
 
 #include "ttm_module.h"
 
+static int enable_use_wc = 1;
+
+MODULE_PARM_DESC(enable_use_wc,
+   "control write combine usage on arm platform due to hardware issue with 
write combine found on some specific arm cpu (1 = enable(default), 0 = 
disable)");
+module_param(enable_use_wc, int, 0644);
+
 /**
  * ttm_prot_from_caching - Modify the page protection according to the
  * ttm cacing mode
@@ -63,7 +69,7 @@ pgprot_t ttm_prot_from_caching(enum ttm_caching caching, 
pgprot_t tmp)
 #endif
 #if defined(__ia64__) || defined(__arm__) || defined(__aarch64__) || \
defined(__powerpc__) || defined(__mips__)
-   if (caching == ttm_write_combined)
+   if (caching == ttm_write_combined && enable_use_wc != 0)
tmp = pgprot_writecombine(tmp);
else
tmp = pgprot_noncached(tmp);
-- 
2.25.1



[RFC PATH 3/3] drm: replace allow_fb_modifiers with fb_modifiers_not_supported

2021-12-21 Thread Tomohito Esaki
Since almost drivers support fb modifiers, allow_fb_modifiers is
replaced with fb_modifiers_not_supported and removed.

Signed-off-by: Tomohito Esaki 
---
 drivers/gpu/drm/drm_framebuffer.c|  6 +++---
 drivers/gpu/drm/drm_ioctl.c  |  2 +-
 drivers/gpu/drm/drm_plane.c  |  9 -
 drivers/gpu/drm/selftests/test-drm_framebuffer.c |  1 -
 include/drm/drm_mode_config.h| 16 
 5 files changed, 4 insertions(+), 30 deletions(-)

diff --git a/drivers/gpu/drm/drm_framebuffer.c 
b/drivers/gpu/drm/drm_framebuffer.c
index 07f5abc875e9..4562a8b86579 100644
--- a/drivers/gpu/drm/drm_framebuffer.c
+++ b/drivers/gpu/drm/drm_framebuffer.c
@@ -309,7 +309,7 @@ drm_internal_framebuffer_create(struct drm_device *dev,
}
 
if (r->flags & DRM_MODE_FB_MODIFIERS &&
-   !dev->mode_config.allow_fb_modifiers) {
+   dev->mode_config.fb_modifiers_not_supported) {
DRM_DEBUG_KMS("driver does not support fb modifiers\n");
return ERR_PTR(-EINVAL);
}
@@ -594,7 +594,7 @@ int drm_mode_getfb2_ioctl(struct drm_device *dev,
r->pixel_format = fb->format->format;
 
r->flags = 0;
-   if (dev->mode_config.allow_fb_modifiers)
+   if (!dev->mode_config.fb_modifiers_not_supported)
r->flags |= DRM_MODE_FB_MODIFIERS;
 
for (i = 0; i < ARRAY_SIZE(r->handles); i++) {
@@ -607,7 +607,7 @@ int drm_mode_getfb2_ioctl(struct drm_device *dev,
for (i = 0; i < fb->format->num_planes; i++) {
r->pitches[i] = fb->pitches[i];
r->offsets[i] = fb->offsets[i];
-   if (dev->mode_config.allow_fb_modifiers)
+   if (!dev->mode_config.fb_modifiers_not_supported)
r->modifier[i] = fb->modifier;
}
 
diff --git a/drivers/gpu/drm/drm_ioctl.c b/drivers/gpu/drm/drm_ioctl.c
index 8b8744dcf691..51fcf1298023 100644
--- a/drivers/gpu/drm/drm_ioctl.c
+++ b/drivers/gpu/drm/drm_ioctl.c
@@ -297,7 +297,7 @@ static int drm_getcap(struct drm_device *dev, void *data, 
struct drm_file *file_
req->value = 64;
break;
case DRM_CAP_ADDFB2_MODIFIERS:
-   req->value = dev->mode_config.allow_fb_modifiers;
+   req->value = !dev->mode_config.fb_modifiers_not_supported;
break;
case DRM_CAP_CRTC_IN_VBLANK_EVENT:
req->value = 1;
diff --git a/drivers/gpu/drm/drm_plane.c b/drivers/gpu/drm/drm_plane.c
index 75308ee240c0..5b546d80d248 100644
--- a/drivers/gpu/drm/drm_plane.c
+++ b/drivers/gpu/drm/drm_plane.c
@@ -302,15 +302,6 @@ static int __drm_universal_plane_init(struct drm_device 
*dev,
}
}
 
-   /* autoset the cap and check for consistency across all planes */
-   if (format_modifier_count) {
-   drm_WARN_ON(dev, !config->allow_fb_modifiers &&
-   !list_empty(>plane_list));
-   config->allow_fb_modifiers = true;
-   } else {
-   drm_WARN_ON(dev, config->allow_fb_modifiers);
-   }
-
plane->modifier_count = format_modifier_count;
plane->modifiers = kmalloc_array(format_modifier_count,
 sizeof(format_modifiers[0]),
diff --git a/drivers/gpu/drm/selftests/test-drm_framebuffer.c 
b/drivers/gpu/drm/selftests/test-drm_framebuffer.c
index 61b44d3a6a61..f6d66285c5fc 100644
--- a/drivers/gpu/drm/selftests/test-drm_framebuffer.c
+++ b/drivers/gpu/drm/selftests/test-drm_framebuffer.c
@@ -323,7 +323,6 @@ static struct drm_device mock_drm_device = {
.max_width = MAX_WIDTH,
.min_height = MIN_HEIGHT,
.max_height = MAX_HEIGHT,
-   .allow_fb_modifiers = true,
.funcs = _config_funcs,
},
 };
diff --git a/include/drm/drm_mode_config.h b/include/drm/drm_mode_config.h
index c56f298c55bd..6fd13d6510f1 100644
--- a/include/drm/drm_mode_config.h
+++ b/include/drm/drm_mode_config.h
@@ -904,22 +904,6 @@ struct drm_mode_config {
 */
bool async_page_flip;
 
-   /**
-* @allow_fb_modifiers:
-*
-* Whether the driver supports fb modifiers in the ADDFB2.1 ioctl call.
-* Note that drivers should not set this directly, it is automatically
-* set in drm_universal_plane_init().
-*
-* IMPORTANT:
-*
-* If this is set the driver must fill out the full implicit modifier
-* information in their _mode_config_funcs.fb_create hook for legacy
-* userspace which does not set modifiers. Otherwise the GETFB2 ioctl is
-* broken for modifier aware userspace.
-*/
-   bool allow_fb_modifiers;
-
/**
 * @fb_modifiers_not_supported:
 *
-- 
2.17.1



[RFC PATH 2/3] drm: set fb_modifiers_not_supported flag in legacy drivers

2021-12-21 Thread Tomohito Esaki
Set fb_modifiers_not_supported flag in legacy drivers whose planes
support non-linear layouts but does not support modifiers, and replace
allow_fb_modifiers with fb_modifiers_not_supported.

Signed-off-by: Tomohito Esaki 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_display.c | 6 +++---
 drivers/gpu/drm/amd/amdgpu/dce_v10_0.c  | 2 ++
 drivers/gpu/drm/amd/amdgpu/dce_v11_0.c  | 2 ++
 drivers/gpu/drm/amd/amdgpu/dce_v6_0.c   | 1 +
 drivers/gpu/drm/amd/amdgpu/dce_v8_0.c   | 2 ++
 drivers/gpu/drm/nouveau/nouveau_display.c   | 6 --
 drivers/gpu/drm/radeon/radeon_display.c | 2 ++
 7 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
index dc50c05f23fc..cbaea9c6cfda 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
@@ -958,7 +958,7 @@ static int amdgpu_display_verify_sizes(struct 
amdgpu_framebuffer *rfb)
int ret;
unsigned int i, block_width, block_height, block_size_log2;
 
-   if (!rfb->base.dev->mode_config.allow_fb_modifiers)
+   if (rfb->base.dev->mode_config.fb_modifiers_not_supported)
return 0;
 
for (i = 0; i < format_info->num_planes; ++i) {
@@ -1145,7 +1145,7 @@ int amdgpu_display_framebuffer_init(struct drm_device 
*dev,
if (ret)
return ret;
 
-   if (!dev->mode_config.allow_fb_modifiers) {
+   if (dev->mode_config.fb_modifiers_not_supported) {
drm_WARN_ONCE(dev, adev->family >= AMDGPU_FAMILY_AI,
  "GFX9+ requires FB check based on format 
modifier\n");
ret = check_tiling_flags_gfx6(rfb);
@@ -1153,7 +1153,7 @@ int amdgpu_display_framebuffer_init(struct drm_device 
*dev,
return ret;
}
 
-   if (dev->mode_config.allow_fb_modifiers &&
+   if (!dev->mode_config.fb_modifiers_not_supported &&
!(rfb->base.flags & DRM_MODE_FB_MODIFIERS)) {
ret = convert_tiling_flags_to_modifier(rfb);
if (ret) {
diff --git a/drivers/gpu/drm/amd/amdgpu/dce_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/dce_v10_0.c
index d1570a462a51..fb61c0814115 100644
--- a/drivers/gpu/drm/amd/amdgpu/dce_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/dce_v10_0.c
@@ -2798,6 +2798,8 @@ static int dce_v10_0_sw_init(void *handle)
adev_to_drm(adev)->mode_config.preferred_depth = 24;
adev_to_drm(adev)->mode_config.prefer_shadow = 1;
 
+   adev_to_drm(adev)->mode_config.fb_modifiers_not_supported = true;
+
adev_to_drm(adev)->mode_config.fb_base = adev->gmc.aper_base;
 
r = amdgpu_display_modeset_create_props(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c
index 18a7b3bd633b..17942a11366d 100644
--- a/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c
@@ -2916,6 +2916,8 @@ static int dce_v11_0_sw_init(void *handle)
adev_to_drm(adev)->mode_config.preferred_depth = 24;
adev_to_drm(adev)->mode_config.prefer_shadow = 1;
 
+   adev_to_drm(adev)->mode_config.fb_modifiers_not_supported = true;
+
adev_to_drm(adev)->mode_config.fb_base = adev->gmc.aper_base;
 
r = amdgpu_display_modeset_create_props(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/dce_v6_0.c 
b/drivers/gpu/drm/amd/amdgpu/dce_v6_0.c
index c7803dc2b2d5..2ec99ec8e1a3 100644
--- a/drivers/gpu/drm/amd/amdgpu/dce_v6_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/dce_v6_0.c
@@ -2674,6 +2674,7 @@ static int dce_v6_0_sw_init(void *handle)
adev_to_drm(adev)->mode_config.max_height = 16384;
adev_to_drm(adev)->mode_config.preferred_depth = 24;
adev_to_drm(adev)->mode_config.prefer_shadow = 1;
+   adev_to_drm(adev)->mode_config.fb_modifiers_not_supported = true;
adev_to_drm(adev)->mode_config.fb_base = adev->gmc.aper_base;
 
r = amdgpu_display_modeset_create_props(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/dce_v8_0.c 
b/drivers/gpu/drm/amd/amdgpu/dce_v8_0.c
index b200b9e722d9..8369336cec90 100644
--- a/drivers/gpu/drm/amd/amdgpu/dce_v8_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/dce_v8_0.c
@@ -2699,6 +2699,8 @@ static int dce_v8_0_sw_init(void *handle)
adev_to_drm(adev)->mode_config.preferred_depth = 24;
adev_to_drm(adev)->mode_config.prefer_shadow = 1;
 
+   adev_to_drm(adev)->mode_config.fb_modifiers_not_supported = true;
+
adev_to_drm(adev)->mode_config.fb_base = adev->gmc.aper_base;
 
r = amdgpu_display_modeset_create_props(adev);
diff --git a/drivers/gpu/drm/nouveau/nouveau_display.c 
b/drivers/gpu/drm/nouveau/nouveau_display.c
index 929de41c281f..1ecad7fa3e8a 100644
--- a/drivers/gpu/drm/nouveau/nouveau_display.c
+++ b/drivers/gpu/drm/nouveau/nouveau_display.c
@@ -711,10 +711,12 @@ nouveau_display_create(struct drm_device *dev)
 >disp);
if (ret == 0) {
 

[RFC PATH 1/3] drm: add support modifiers for drivers whose planes only support linear layout

2021-12-21 Thread Tomohito Esaki
The LINEAR modifier is advertised as default if a driver doesn't specify
modifiers. However, there are legacy drivers such as radeon that do not
support modifiers but infer the actual layout of the underlying buffer.
Therefore, a new flag not_support_fb_modifires is introduced for these
legacy drivers. Allow_fb_modifiers will be replaced with this new flag.

Signed-off-by: Tomohito Esaki 
---
 drivers/gpu/drm/drm_plane.c   | 34 ++
 include/drm/drm_mode_config.h | 10 ++
 include/drm/drm_plane.h   |  3 +++
 3 files changed, 39 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/drm_plane.c b/drivers/gpu/drm/drm_plane.c
index 82afb854141b..75308ee240c0 100644
--- a/drivers/gpu/drm/drm_plane.c
+++ b/drivers/gpu/drm/drm_plane.c
@@ -161,6 +161,16 @@ modifiers_ptr(struct drm_format_modifier_blob *blob)
return (struct drm_format_modifier *)(((char *)blob) + 
blob->modifiers_offset);
 }
 
+static bool check_format_modifier(struct drm_plane *plane, uint32_t format,
+ uint64_t modifier)
+{
+   if (plane->funcs->format_mod_supported)
+   return plane->funcs->format_mod_supported(plane, format,
+ modifier);
+
+   return modifier == DRM_FORMAT_MOD_LINEAR;
+}
+
 static int create_in_format_blob(struct drm_device *dev, struct drm_plane 
*plane)
 {
const struct drm_mode_config *config = >mode_config;
@@ -203,16 +213,15 @@ static int create_in_format_blob(struct drm_device *dev, 
struct drm_plane *plane
memcpy(formats_ptr(blob_data), plane->format_types, formats_size);
 
/* If we can't determine support, just bail */
-   if (!plane->funcs->format_mod_supported)
+   if (config->fb_modifiers_not_supported)
goto done;
 
mod = modifiers_ptr(blob_data);
for (i = 0; i < plane->modifier_count; i++) {
for (j = 0; j < plane->format_count; j++) {
-   if (plane->funcs->format_mod_supported(plane,
-  
plane->format_types[j],
-  
plane->modifiers[i])) {
-
+   if (check_format_modifier(plane,
+ plane->format_types[j],
+ plane->modifiers[i])) {
mod->formats |= 1ULL << j;
}
}
@@ -242,6 +251,10 @@ static int __drm_universal_plane_init(struct drm_device 
*dev,
  const char *name, va_list ap)
 {
struct drm_mode_config *config = >mode_config;
+   const uint64_t default_modifiers[] = {
+   DRM_FORMAT_MOD_LINEAR,
+   DRM_FORMAT_MOD_INVALID
+   };
unsigned int format_modifier_count = 0;
int ret;
 
@@ -282,6 +295,11 @@ static int __drm_universal_plane_init(struct drm_device 
*dev,
 
while (*temp_modifiers++ != DRM_FORMAT_MOD_INVALID)
format_modifier_count++;
+   } else {
+   if (!dev->mode_config.fb_modifiers_not_supported) {
+   format_modifiers = default_modifiers;
+   format_modifier_count = 1;
+   }
}
 
/* autoset the cap and check for consistency across all planes */
@@ -346,7 +364,7 @@ static int __drm_universal_plane_init(struct drm_device 
*dev,
drm_object_attach_property(>base, config->prop_src_h, 0);
}
 
-   if (config->allow_fb_modifiers)
+   if (format_modifier_count)
create_in_format_blob(dev, plane);
 
return 0;
@@ -373,8 +391,8 @@ static int __drm_universal_plane_init(struct drm_device 
*dev,
  * drm_universal_plane_init() to let the DRM managed resource infrastructure
  * take care of cleanup and deallocation.
  *
- * Drivers supporting modifiers must set @format_modifiers on all their planes,
- * even those that only support DRM_FORMAT_MOD_LINEAR.
+ * For drivers supporting modifiers, all planes will advertise
+ * DRM_FORMAT_MOD_LINEAR support, if @format_modifiers is not set.
  *
  * Returns:
  * Zero on success, error code on failure.
diff --git a/include/drm/drm_mode_config.h b/include/drm/drm_mode_config.h
index 48b7de80daf5..c56f298c55bd 100644
--- a/include/drm/drm_mode_config.h
+++ b/include/drm/drm_mode_config.h
@@ -920,6 +920,16 @@ struct drm_mode_config {
 */
bool allow_fb_modifiers;
 
+   /**
+* @fb_modifiers_not_supported:
+*
+* This flag is for legacy drivers such as radeon that do not support
+* modifiers but infer the actual layout of the underlying buffer.
+* Generally, each drivers must support modifiers, this flag should not
+* be set.
+*/
+   bool fb_modifiers_not_supported;
+
/**
 * 

[RFC PATCH 0/3] Add support modifiers for drivers whose planes only support linear layout

2021-12-21 Thread Tomohito Esaki
Some drivers whose planes only support linear layout fb do not support format
modifiers.
These drivers should support modifiers, however the DRM core should handle this
rather than open-coding in every driver.

In this patch series, these drivers expose format modifiers based on the
following suggestion[1].

On Thu, Nov 18, 2021 at 01:02:11PM +, Daniel Stone wrote:
> I think the best way forward here is:
>   - add a new mode_config.cannot_support_modifiers flag, and enable
> this in radeon (plus any other drivers in the same boat)
>   - change drm_universal_plane_init() to advertise the LINEAR modifier
> when NULL is passed as the modifier list (including installing a
> default .format_mod_supported hook)
>   - remove the mode_config.allow_fb_modifiers hook and always
> advertise modifier support, unless
> mode_config.cannot_support_modifiers is set


[1] 
https://patchwork.kernel.org/project/linux-renesas-soc/patch/20190509054518.10781-1-e...@igel.co.jp/#24602575


Tomohito Esaki (3):
  drm: add support modifiers for drivers whose planes only support
linear layout
  drm: set fb_modifiers_not_supported flag in legacy drivers
  drm: replace allow_fb_modifiers with fb_modifiers_not_supported

 drivers/gpu/drm/amd/amdgpu/amdgpu_display.c   |  6 +--
 drivers/gpu/drm/amd/amdgpu/dce_v10_0.c|  2 +
 drivers/gpu/drm/amd/amdgpu/dce_v11_0.c|  2 +
 drivers/gpu/drm/amd/amdgpu/dce_v6_0.c |  1 +
 drivers/gpu/drm/amd/amdgpu/dce_v8_0.c |  2 +
 drivers/gpu/drm/drm_framebuffer.c |  6 +--
 drivers/gpu/drm/drm_ioctl.c   |  2 +-
 drivers/gpu/drm/drm_plane.c   | 41 +++
 drivers/gpu/drm/nouveau/nouveau_display.c |  6 ++-
 drivers/gpu/drm/radeon/radeon_display.c   |  2 +
 .../gpu/drm/selftests/test-drm_framebuffer.c  |  1 -
 include/drm/drm_mode_config.h | 18 +++-
 include/drm/drm_plane.h   |  3 ++
 13 files changed, 54 insertions(+), 38 deletions(-)

-- 
2.17.1


RE: [PATCH V5 13/16] drm/amd/pm: relocate the power related headers

2021-12-21 Thread Quan, Evan
[AMD Official Use Only]



> -Original Message-
> From: Lazar, Lijo 
> Sent: Tuesday, December 21, 2021 2:22 PM
> To: Quan, Evan ; amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander 
> Subject: Re: [PATCH V5 13/16] drm/amd/pm: relocate the power related
> headers
> 
> 
> 
> On 12/13/2021 9:22 AM, Evan Quan wrote:
> > Instead of centralizing all headers in the same folder. Separate them
> > into different folders and place them among those source files those
> > who really need them.
> >
> > Signed-off-by: Evan Quan 
> > Change-Id: Id74cb4c7006327ca7ecd22daf17321e417c4aa71
> > --
> > v1->v2:
> >- create separate holders for driver_if and ppsmc headers(Lijo)
> > ---
> >   drivers/gpu/drm/amd/pm/Makefile   | 12 ---
> >   drivers/gpu/drm/amd/pm/legacy-dpm/Makefile| 32
> +++
> >   .../pm/{powerplay => legacy-dpm}/cik_dpm.h|  0
> >   .../amd/pm/{powerplay => legacy-dpm}/kv_dpm.c |  0
> >   .../amd/pm/{powerplay => legacy-dpm}/kv_dpm.h |  0
> >   .../amd/pm/{powerplay => legacy-dpm}/kv_smc.c |  0
> >   .../pm/{powerplay => legacy-dpm}/legacy_dpm.c |  0
> >   .../pm/{powerplay => legacy-dpm}/legacy_dpm.h |  0
> >   .../amd/pm/{powerplay => legacy-dpm}/ppsmc.h  |  0
> >   .../pm/{powerplay => legacy-dpm}/r600_dpm.h   |  0
> >   .../amd/pm/{powerplay => legacy-dpm}/si_dpm.c |  0
> >   .../amd/pm/{powerplay => legacy-dpm}/si_dpm.h |  0
> >   .../amd/pm/{powerplay => legacy-dpm}/si_smc.c |  0
> >   .../{powerplay => legacy-dpm}/sislands_smc.h  |  0
> >   drivers/gpu/drm/amd/pm/powerplay/Makefile |  6 +---
> >   .../pm/{ => powerplay}/inc/amd_powerplay.h|  0
> >   .../drm/amd/pm/{ => powerplay}/inc/cz_ppsmc.h |  0
> >   .../amd/pm/{ => powerplay}/inc/fiji_ppsmc.h   |  0
> >   .../pm/{ => powerplay}/inc/hardwaremanager.h  |  0
> >   .../drm/amd/pm/{ => powerplay}/inc/hwmgr.h|  0
> >   .../{ => powerplay}/inc/polaris10_pwrvirus.h  |  0
> >   .../amd/pm/{ => powerplay}/inc/power_state.h  |  0
> >   .../drm/amd/pm/{ => powerplay}/inc/pp_debug.h |  0
> >   .../amd/pm/{ => powerplay}/inc/pp_endian.h|  0
> >   .../amd/pm/{ => powerplay}/inc/pp_thermal.h   |  0
> >   .../amd/pm/{ => powerplay}/inc/ppinterrupt.h  |  0
> >   .../drm/amd/pm/{ => powerplay}/inc/rv_ppsmc.h |  0
> >   .../drm/amd/pm/{ => powerplay}/inc/smu10.h|  0
> >   .../pm/{ => powerplay}/inc/smu10_driver_if.h  |  0
> >   .../pm/{ => powerplay}/inc/smu11_driver_if.h  |  0
> >   .../gpu/drm/amd/pm/{ => powerplay}/inc/smu7.h |  0
> >   .../drm/amd/pm/{ => powerplay}/inc/smu71.h|  0
> >   .../pm/{ => powerplay}/inc/smu71_discrete.h   |  0
> >   .../drm/amd/pm/{ => powerplay}/inc/smu72.h|  0
> >   .../pm/{ => powerplay}/inc/smu72_discrete.h   |  0
> >   .../drm/amd/pm/{ => powerplay}/inc/smu73.h|  0
> >   .../pm/{ => powerplay}/inc/smu73_discrete.h   |  0
> >   .../drm/amd/pm/{ => powerplay}/inc/smu74.h|  0
> >   .../pm/{ => powerplay}/inc/smu74_discrete.h   |  0
> >   .../drm/amd/pm/{ => powerplay}/inc/smu75.h|  0
> >   .../pm/{ => powerplay}/inc/smu75_discrete.h   |  0
> >   .../amd/pm/{ => powerplay}/inc/smu7_common.h  |  0
> >   .../pm/{ => powerplay}/inc/smu7_discrete.h|  0
> >   .../amd/pm/{ => powerplay}/inc/smu7_fusion.h  |  0
> >   .../amd/pm/{ => powerplay}/inc/smu7_ppsmc.h   |  0
> >   .../gpu/drm/amd/pm/{ => powerplay}/inc/smu8.h |  0
> >   .../amd/pm/{ => powerplay}/inc/smu8_fusion.h  |  0
> >   .../gpu/drm/amd/pm/{ => powerplay}/inc/smu9.h |  0
> >   .../pm/{ => powerplay}/inc/smu9_driver_if.h   |  0
> >   .../{ => powerplay}/inc/smu_ucode_xfer_cz.h   |  0
> >   .../{ => powerplay}/inc/smu_ucode_xfer_vi.h   |  0
> >   .../drm/amd/pm/{ => powerplay}/inc/smumgr.h   |  0
> >   .../amd/pm/{ => powerplay}/inc/tonga_ppsmc.h  |  0
> >   .../amd/pm/{ => powerplay}/inc/vega10_ppsmc.h |  0
> >   .../inc/vega12/smu9_driver_if.h   |  0
> >   .../amd/pm/{ => powerplay}/inc/vega12_ppsmc.h |  0
> >   .../amd/pm/{ => powerplay}/inc/vega20_ppsmc.h |  0
> >   .../drm/amd/pm/{ => swsmu}/inc/amdgpu_smu.h   |  0
> >   .../inc/interface}/smu11_driver_if_arcturus.h |  0
> >   .../smu11_driver_if_cyan_skillfish.h  |  0
> >   .../inc/interface}/smu11_driver_if_navi10.h   |  0
> >   .../smu11_driver_if_sienna_cichlid.h  |  0
> >   .../inc/interface}/smu11_driver_if_vangogh.h  |  0
> >   .../inc/interface}/smu12_driver_if.h  |  0
> >   .../interface}/smu13_driver_if_aldebaran.h|  0
> >   .../interface}/smu13_driver_if_yellow_carp.h  |  0
> >   .../inc/interface}/smu_v11_5_pmfw.h   |  0
> >   .../inc/interface}/smu_v11_8_pmfw.h   |  0
> >   .../inc/interface}/smu_v13_0_1_pmfw.h |  0
> >   .../inc/message}/aldebaran_ppsmc.h|  0
> >   .../inc/message}/arcturus_ppsmc.h |  0
> >   .../inc/message}/smu_v11_0_7_ppsmc.h  |  0
> >   .../inc/message}/smu_v11_0_ppsmc.h|  0
> >   .../inc/message}/smu_v11_5_ppsmc.h|  0
> >   .../inc/message}/smu_v11_8_ppsmc.h|  0

RE: [PATCH V5 05/16] drm/amd/pm: do not expose those APIs used internally only in si_dpm.c

2021-12-21 Thread Quan, Evan
[AMD Official Use Only]



> -Original Message-
> From: Lazar, Lijo 
> Sent: Tuesday, December 21, 2021 2:17 PM
> To: Quan, Evan ; amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander 
> Subject: Re: [PATCH V5 05/16] drm/amd/pm: do not expose those APIs used
> internally only in si_dpm.c
> 
> 
> 
> On 12/13/2021 9:22 AM, Evan Quan wrote:
> > Move them to si_dpm.c instead.
> >
> > Signed-off-by: Evan Quan 
> > Change-Id: I288205cfd7c6ba09cfb22626ff70360d61ff0c67
> > --
> > v1->v2:
> >- rename the API with "si_" prefix(Alex)
> > v2->v3:
> >- rename other data structures used only in si_dpm.c(Lijo)
> > v3->v4:
> >- rename Macros used only in si_dpm.c with "SI_" prefix(Lijo)
> > ---
> >   drivers/gpu/drm/amd/pm/amdgpu_dpm.c   |  25 -
> >   drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h   |  25 -
> >   drivers/gpu/drm/amd/pm/powerplay/si_dpm.c | 106
> +++---
> >   drivers/gpu/drm/amd/pm/powerplay/si_dpm.h |  15 ++-
> >   4 files changed, 83 insertions(+), 88 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
> > b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
> > index 08770888cabb..0f9e109941f1 100644
> > --- a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
> > +++ b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
> > @@ -894,31 +894,6 @@ void amdgpu_add_thermal_controller(struct
> amdgpu_device *adev)
> > }
> >   }
> >
> > -enum amdgpu_pcie_gen amdgpu_get_pcie_gen_support(struct
> amdgpu_device *adev,
> > -u32 sys_mask,
> > -enum amdgpu_pcie_gen
> asic_gen,
> > -enum amdgpu_pcie_gen
> default_gen)
> > -{
> > -   switch (asic_gen) {
> > -   case AMDGPU_PCIE_GEN1:
> > -   return AMDGPU_PCIE_GEN1;
> > -   case AMDGPU_PCIE_GEN2:
> > -   return AMDGPU_PCIE_GEN2;
> > -   case AMDGPU_PCIE_GEN3:
> > -   return AMDGPU_PCIE_GEN3;
> > -   default:
> > -   if ((sys_mask & CAIL_PCIE_LINK_SPEED_SUPPORT_GEN3)
> &&
> > -   (default_gen == AMDGPU_PCIE_GEN3))
> > -   return AMDGPU_PCIE_GEN3;
> > -   else if ((sys_mask &
> CAIL_PCIE_LINK_SPEED_SUPPORT_GEN2) &&
> > -(default_gen == AMDGPU_PCIE_GEN2))
> > -   return AMDGPU_PCIE_GEN2;
> > -   else
> > -   return AMDGPU_PCIE_GEN1;
> > -   }
> > -   return AMDGPU_PCIE_GEN1;
> > -}
> > -
> >   struct amd_vce_state*
> >   amdgpu_get_vce_clock_state(void *handle, u32 idx)
> >   {
> > diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h
> > b/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h
> > index 6681b878e75f..f43b96dfe9d8 100644
> > --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h
> > +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h
> > @@ -45,19 +45,6 @@ enum amdgpu_int_thermal_type {
> > THERMAL_TYPE_KV,
> >   };
> >
> > -enum amdgpu_dpm_auto_throttle_src {
> > -   AMDGPU_DPM_AUTO_THROTTLE_SRC_THERMAL,
> > -   AMDGPU_DPM_AUTO_THROTTLE_SRC_EXTERNAL
> > -};
> > -
> > -enum amdgpu_dpm_event_src {
> > -   AMDGPU_DPM_EVENT_SRC_ANALOG = 0,
> > -   AMDGPU_DPM_EVENT_SRC_EXTERNAL = 1,
> > -   AMDGPU_DPM_EVENT_SRC_DIGITAL = 2,
> > -   AMDGPU_DPM_EVENT_SRC_ANALOG_OR_EXTERNAL = 3,
> > -   AMDGPU_DPM_EVENT_SRC_DIGIAL_OR_EXTERNAL = 4
> > -};
> > -
> >   struct amdgpu_ps {
> > u32 caps; /* vbios flags */
> > u32 class; /* vbios flags */
> > @@ -252,13 +239,6 @@ struct amdgpu_dpm_fan {
> > bool ucode_fan_control;
> >   };
> >
> > -enum amdgpu_pcie_gen {
> > -   AMDGPU_PCIE_GEN1 = 0,
> > -   AMDGPU_PCIE_GEN2 = 1,
> > -   AMDGPU_PCIE_GEN3 = 2,
> > -   AMDGPU_PCIE_GEN_INVALID = 0x
> > -};
> > -
> >   #define amdgpu_dpm_reset_power_profile_state(adev, request) \
> > ((adev)->powerplay.pp_funcs-
> >reset_power_profile_state(\
> > (adev)->powerplay.pp_handle, request)) @@ -
> 403,11 +383,6 @@ void
> > amdgpu_free_extended_power_table(struct amdgpu_device *adev);
> >
> >   void amdgpu_add_thermal_controller(struct amdgpu_device *adev);
> >
> > -enum amdgpu_pcie_gen amdgpu_get_pcie_gen_support(struct
> amdgpu_device *adev,
> > -u32 sys_mask,
> > -enum amdgpu_pcie_gen
> asic_gen,
> > -enum amdgpu_pcie_gen
> default_gen);
> > -
> >   struct amd_vce_state*
> >   amdgpu_get_vce_clock_state(void *handle, u32 idx);
> >
> > diff --git a/drivers/gpu/drm/amd/pm/powerplay/si_dpm.c
> > b/drivers/gpu/drm/amd/pm/powerplay/si_dpm.c
> > index 81f82aa05ec2..5bd7a24b70b6 100644
> > --- a/drivers/gpu/drm/amd/pm/powerplay/si_dpm.c
> > +++ b/drivers/gpu/drm/amd/pm/powerplay/si_dpm.c
> > @@ -96,6 +96,19 @@ union pplib_clock_info {
> > struct _ATOM_PPLIB_SI_CLOCK_INFO si;
> >   };
> >
> > +enum si_dpm_auto_throttle_src {
> > +   SI_DPM_AUTO_THROTTLE_SRC_THERMAL,
> > +   SI_DPM_AUTO_THROTTLE_SRC_EXTERNAL
> > +};
> > +
> > +enum si_dpm_event_src {
> > +   

RE: [PATCH] amdgpu/pm: Modify sysfs pp_dpm_sclk to have only read premission in ONEVF mode

2021-12-21 Thread Quan, Evan
[AMD Official Use Only]



From: amd-gfx  On Behalf Of Nikolic, 
Marina
Sent: Tuesday, December 21, 2021 10:36 PM
To: Russell, Kent ; amd-gfx@lists.freedesktop.org
Cc: Mitrovic, Milan ; Kitchen, Greg 

Subject: Re: [PATCH] amdgpu/pm: Modify sysfs pp_dpm_sclk to have only read 
premission in ONEVF mode


[AMD Official Use Only]


[AMD Official Use Only]

>From 06359f3be0c0b889519d6dd954fb11f31e9a15e0 Mon Sep 17 00:00:00 2001
From: Marina Nikolic mailto:marina.niko...@amd.com>>
Date: Tue, 14 Dec 2021 20:57:53 +0800
Subject: [PATCH] amdgpu/pm: Modify sysfs pp_dpm_sclk to have only read
 permission in ONEVF mode
[Quan, Evan] With the subject updated(remove the description about 
pp_dpm_sclk), the patch is acked-by: Evan Quan 

BR
Evan
== Description ==
Setting through sysfs should not be allowed in SRIOV mode.
These calls will not be processed by FW anyway,
but error handling on sysfs level should be improved.

== Changes ==
This patch prohibits performing of all set commands
in SRIOV mode on sysfs level.
It offers better error handling as calls that are
not allowed will not be propagated further.

== Test ==
Writing to any sysfs file in passthrough mode will succeed.
Writing to any sysfs file in ONEVF mode will yield error:
"calling process does not have sufficient permission to execute a command".

Signed-off-by: Marina Nikolic 
mailto:marina.niko...@amd.com>>
---
 drivers/gpu/drm/amd/pm/amdgpu_pm.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/amd/pm/amdgpu_pm.c 
b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
index 082539c70fd4..c43818cd02aa 100644
--- a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
+++ b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
@@ -2133,6 +2133,12 @@ static int default_attr_update(struct amdgpu_device 
*adev, struct amdgpu_device_
}
}

+   /* setting should not be allowed from VF */
+   if (amdgpu_sriov_vf(adev)) {
+   dev_attr->attr.mode &= ~S_IWUGO;
+   dev_attr->store = NULL;
+   }
+
 #undef DEVICE_ATTR_IS

return 0;
--
2.20.1


From: Nikolic, Marina mailto:marina.niko...@amd.com>>
Sent: Tuesday, December 21, 2021 3:15 PM
To: Russell, Kent mailto:kent.russ...@amd.com>>; 
amd-gfx@lists.freedesktop.org 
mailto:amd-gfx@lists.freedesktop.org>>
Cc: Mitrovic, Milan mailto:milan.mitro...@amd.com>>; 
Kitchen, Greg mailto:greg.kitc...@amd.com>>
Subject: Re: [PATCH] amdgpu/pm: Modify sysfs pp_dpm_sclk to have only read 
premission in ONEVF mode

Hi Kent,

Thank you for the review. Yes, I can confirm I am trying to set this for every 
single file for SRIOV mode.
@Kitchen, Greg required this for ROCM-SMI 5.0 
release. In case you need it, he can provide more details.
I'm going to clarify commit message more and send a new patch.

BR,
Marina

From: Russell, Kent mailto:kent.russ...@amd.com>>
Sent: Monday, December 20, 2021 8:01 PM
To: Nikolic, Marina mailto:marina.niko...@amd.com>>; 
amd-gfx@lists.freedesktop.org 
mailto:amd-gfx@lists.freedesktop.org>>
Cc: Mitrovic, Milan mailto:milan.mitro...@amd.com>>; 
Nikolic, Marina mailto:marina.niko...@amd.com>>; 
Kitchen, Greg mailto:greg.kitc...@amd.com>>
Subject: RE: [PATCH] amdgpu/pm: Modify sysfs pp_dpm_sclk to have only read 
premission in ONEVF mode

[AMD Official Use Only]

> -Original Message-
> From: amd-gfx 
> mailto:amd-gfx-boun...@lists.freedesktop.org>>
>  On Behalf Of Marina Nikolic
> Sent: Monday, December 20, 2021 11:09 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Mitrovic, Milan mailto:milan.mitro...@amd.com>>; 
> Nikolic, Marina
> mailto:marina.niko...@amd.com>>; Kitchen, Greg 
> mailto:greg.kitc...@amd.com>>
> Subject: [PATCH] amdgpu/pm: Modify sysfs pp_dpm_sclk to have only read 
> premission in
> ONEVF mode
>
> == Description ==
> Due to security reasons setting through sysfs
> should only be allowed in passthrough mode.
> Options that are not mapped as SMU messages
> do not have any mechanizm to distinguish between
> passthorugh, onevf and mutivf usecase.
> A unified approach is needed.
>
> == Changes ==
> This patch introduces a new mechanizm to distinguish
> ONEVF and PASSTHROUGH use case on sysfs level
> and prohibit setting (writting to sysfs).
> It also applies the new mechanizm on pp_dpm_sclk sysfs file.
>
> == Test ==
> Writing to pp_dpm_sclk sysfs file in passthrough mode will succeed.
> Writing to pp_dpm_sclk sysfs file in ONEVF mode will yield error:
> "calling process does not have sufficient permission to execute a command".
> Sysfs pp_dpm_sclk will not be created in MULTIVF mode.
>
> Signed-off-by: Marina Nikolic 
> mailto:marina.niko...@amd.com>>
> ---
>  drivers/gpu/drm/amd/pm/amdgpu_pm.c | 6 ++
>  1 file changed, 6 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> index 

RE: [PATCH] drm/amdgpu: fix runpm documentation

2021-12-21 Thread Quan, Evan
[AMD Official Use Only]



> -Original Message-
> From: amd-gfx  On Behalf Of Alex
> Deucher
> Sent: Tuesday, December 21, 2021 10:59 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander 
> Subject: [PATCH] drm/amdgpu: fix runpm documentation
> 
> It's not only supported by HG/PX laptops.  It's supported
> by all dGPUs which supports BOCO/BACO functionality (runtime
> D3).
> 
> BOCO - Bus Off, Chip Off.  The entire chip is powered off.
>This is controlled by ACPI.
> BACO - Bus Active, Chip Off.  The chip still shows up
>on the PCI bus, but the device itself is powered
>down.
> 
> Signed-off-by: Alex Deucher 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index a78bbea9629d..f001924ed92e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -331,9 +331,10 @@ module_param_named(aspm, amdgpu_aspm, int,
> 0444);
>  /**
>   * DOC: runpm (int)
>   * Override for runtime power management control for dGPUs in PX/HG
> laptops. The amdgpu driver can dynamically power down
[Quan, Evan] This("dGPUs in PX/HG latops") needs also be updated. Maybe missing 
unintentionally ?
With that fixed, the patch is reviewed-by: Evan Quan 

BR
Evan
> - * the dGPU on PX/HG laptops when it is idle. The default is -1 (auto 
> enable).
> Setting the value to 0 disables this functionality.
> + * the dGPUs when they are idle if supported. The default is -1 (auto
> enable).
> + * Setting the value to 0 disables this functionality.
>   */
> -MODULE_PARM_DESC(runpm, "PX runtime pm (2 = force enable with
> BAMACO, 1 = force enable with BACO, 0 = disable, -1 = PX only default)");
> +MODULE_PARM_DESC(runpm, "PX runtime pm (2 = force enable with
> BAMACO, 1 = force enable with BACO, 0 = disable, -1 = auto)");
>  module_param_named(runpm, amdgpu_runtime_pm, int, 0444);
> 
>  /**
> --
> 2.33.1


Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm

2021-12-21 Thread Yann Dirson



- Mail original -
> De: "Alex Deucher" 
> À: "Yann Dirson" 
> Cc: "Christian König" , "amd-gfx list" 
> 
> Envoyé: Mardi 21 Décembre 2021 23:31:01
> Objet: Re: Various problems trying to vga-passthrough a Renoir iGPU to a 
> xen/qubes-os hvm
> 
> On Tue, Dec 21, 2021 at 5:12 PM Yann Dirson  wrote:
> >
> >
> > Alex wrote:
> > >
> > > On Sun, Dec 19, 2021 at 11:41 AM Yann Dirson 
> > > wrote:
> > > >
> > > > Christian wrote:
> > > > > Am 19.12.21 um 17:00 schrieb Yann Dirson:
> > > > > > Alex wrote:
> > > > > >> Thinking about this more, I think the problem might be
> > > > > >> related
> > > > > >> to
> > > > > >> CPU
> > > > > >> access to "VRAM".  APUs don't have dedicated VRAM, they
> > > > > >> use a
> > > > > >> reserved
> > > > > >> carve out region at the top of system memory.  For CPU
> > > > > >> access
> > > > > >> to
> > > > > >> this
> > > > > >> memory, we kmap the physical address of the carve out
> > > > > >> region
> > > > > >> of
> > > > > >> system
> > > > > >> memory.  You'll need to make sure that region is
> > > > > >> accessible to
> > > > > >> the
> > > > > >> guest.
> > > > > > So basically, the non-virt flow is is: (video?) BIOS
> > > > > > reserves
> > > > > > memory, marks it
> > > > > > as reserved in e820, stores the physaddr somewhere, which
> > > > > > the
> > > > > > GPU
> > > > > > driver gets.
> > > > > > Since I suppose this includes the framebuffer, this
> > > > > > probably
> > > > > > has to
> > > > > > occur around
> > > > > > the moment the driver calls
> > > > > > drm_aperture_remove_conflicting_pci_framebuffers()
> > > > > > (which happens before this hw init step), right ?
> > > > >
> > > > > Well, that partially correct. The efifb is using the PCIe
> > > > > resources
> > > > > to
> > > > > access the framebuffer and as far as I know we use that one
> > > > > to
> > > > > kick
> > > > > it out.
> > > > >
> > > > > The stolen memory we get over e820/registers is separate to
> > > > > that.
> >
> > How is the stolen memory communicated to the driver ?  That host
> > physical
> > memory probably has to be mapped at the same guest physical address
> > for
> > the magic to work, right ?
> 
> Correct.  The driver reads the physical location of that memory from
> hardware registers.  Removing this chunk of code from gmc_v9_0.c will
> force the driver to use the BAR, but I'm not sure if there are any
> other places in the driver that make assumptions about using the
> physical host address or not on APUs off hand.
> 
> if ((adev->flags & AMD_IS_APU) ||
> (adev->gmc.xgmi.supported &&
>  adev->gmc.xgmi.connected_to_cpu)) {
> adev->gmc.aper_base =
> adev->gfxhub.funcs->get_mc_fb_offset(adev) +
> adev->gmc.xgmi.physical_node_id *
> adev->gmc.xgmi.node_segment_size;
> adev->gmc.aper_size = adev->gmc.real_vram_size;
> }
> 
> 
> 
> >
> > > > >
> > > > > > ... which brings me to a point that's been puzzling me for
> > > > > > some
> > > > > > time, which is
> > > > > > that as the hw init fails, the efifb driver is still using
> > > > > > the
> > > > > > framebuffer.
> > > > >
> > > > > No, it isn't. You are probably just still seeing the same
> > > > > screen.
> > > > >
> > > > > The issue is most likely that while efi was kicked out nobody
> > > > > re-programmed the display hardware to show something
> > > > > different.
> > > > >
> > > > > > Am I right in suspecting that efifb should get stripped of
> > > > > > its
> > > > > > ownership of the
> > > > > > fb aperture first, and that if I don't get a black screen
> > > > > > on
> > > > > > hw_init failure
> > > > > > that issue should be the first focus point ?
> > > > >
> > > > > You assumption with the black screen is incorrect. Since the
> > > > > hardware
> > > > > works independent even if you kick out efi you still have the
> > > > > same
> > > > > screen content, you just can't update it anymore.
> > > >
> > > > It's not only that the screen keeps its contents, it's that the
> > > > dom0
> > > > happily continues updating it.
> > >
> > > If the hypevisor is using efifb, then yes that could be a problem
> > > as
> > > the hypervisor could be writing to the efifb resources which ends
> > > up
> > > writing to the same physical memory.  That applies to any GPU on
> > > a
> > > UEFI system.  You'll need to make sure efifb is not in use in the
> > > hypervisor.
> >
> > That remark evokes several things to me.  First one is that every
> > time
> > I've tried booting with efifb disabled in dom0, there was no
> > visible
> > improvements in the guest driver - i.i. I really have to dig how
> > vram mapping
> > is performed and check things are as expected anyway.
> 
> Ultimately you end up at the same physical memory.  efifb uses the
> PCI
> BAR which points to the same physical memory that the driver directly
> maps.
> 
> >
> > The other is that, when dom0 cannot use efifb, 

Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm

2021-12-21 Thread Alex Deucher
On Tue, Dec 21, 2021 at 5:12 PM Yann Dirson  wrote:
>
>
> Alex wrote:
> >
> > On Sun, Dec 19, 2021 at 11:41 AM Yann Dirson  wrote:
> > >
> > > Christian wrote:
> > > > Am 19.12.21 um 17:00 schrieb Yann Dirson:
> > > > > Alex wrote:
> > > > >> Thinking about this more, I think the problem might be related
> > > > >> to
> > > > >> CPU
> > > > >> access to "VRAM".  APUs don't have dedicated VRAM, they use a
> > > > >> reserved
> > > > >> carve out region at the top of system memory.  For CPU access
> > > > >> to
> > > > >> this
> > > > >> memory, we kmap the physical address of the carve out region
> > > > >> of
> > > > >> system
> > > > >> memory.  You'll need to make sure that region is accessible to
> > > > >> the
> > > > >> guest.
> > > > > So basically, the non-virt flow is is: (video?) BIOS reserves
> > > > > memory, marks it
> > > > > as reserved in e820, stores the physaddr somewhere, which the
> > > > > GPU
> > > > > driver gets.
> > > > > Since I suppose this includes the framebuffer, this probably
> > > > > has to
> > > > > occur around
> > > > > the moment the driver calls
> > > > > drm_aperture_remove_conflicting_pci_framebuffers()
> > > > > (which happens before this hw init step), right ?
> > > >
> > > > Well, that partially correct. The efifb is using the PCIe
> > > > resources
> > > > to
> > > > access the framebuffer and as far as I know we use that one to
> > > > kick
> > > > it out.
> > > >
> > > > The stolen memory we get over e820/registers is separate to that.
>
> How is the stolen memory communicated to the driver ?  That host physical
> memory probably has to be mapped at the same guest physical address for
> the magic to work, right ?

Correct.  The driver reads the physical location of that memory from
hardware registers.  Removing this chunk of code from gmc_v9_0.c will
force the driver to use the BAR, but I'm not sure if there are any
other places in the driver that make assumptions about using the
physical host address or not on APUs off hand.

if ((adev->flags & AMD_IS_APU) ||
(adev->gmc.xgmi.supported &&
 adev->gmc.xgmi.connected_to_cpu)) {
adev->gmc.aper_base =
adev->gfxhub.funcs->get_mc_fb_offset(adev) +
adev->gmc.xgmi.physical_node_id *
adev->gmc.xgmi.node_segment_size;
adev->gmc.aper_size = adev->gmc.real_vram_size;
}



>
> > > >
> > > > > ... which brings me to a point that's been puzzling me for some
> > > > > time, which is
> > > > > that as the hw init fails, the efifb driver is still using the
> > > > > framebuffer.
> > > >
> > > > No, it isn't. You are probably just still seeing the same screen.
> > > >
> > > > The issue is most likely that while efi was kicked out nobody
> > > > re-programmed the display hardware to show something different.
> > > >
> > > > > Am I right in suspecting that efifb should get stripped of its
> > > > > ownership of the
> > > > > fb aperture first, and that if I don't get a black screen on
> > > > > hw_init failure
> > > > > that issue should be the first focus point ?
> > > >
> > > > You assumption with the black screen is incorrect. Since the
> > > > hardware
> > > > works independent even if you kick out efi you still have the
> > > > same
> > > > screen content, you just can't update it anymore.
> > >
> > > It's not only that the screen keeps its contents, it's that the
> > > dom0
> > > happily continues updating it.
> >
> > If the hypevisor is using efifb, then yes that could be a problem as
> > the hypervisor could be writing to the efifb resources which ends up
> > writing to the same physical memory.  That applies to any GPU on a
> > UEFI system.  You'll need to make sure efifb is not in use in the
> > hypervisor.
>
> That remark evokes several things to me.  First one is that every time
> I've tried booting with efifb disabled in dom0, there was no visible
> improvements in the guest driver - i.i. I really have to dig how vram mapping
> is performed and check things are as expected anyway.

Ultimately you end up at the same physical memory.  efifb uses the PCI
BAR which points to the same physical memory that the driver directly
maps.

>
> The other is that, when dom0 cannot use efifb, entering a luks key is
> suddenly less user-friendly.  But in theory I'd think we could overcome
> this by letting dom0 use efifb until ready to start the guest, a simple
> driver unbind at the right moment should be expected to work, right ?
> Going further and allowing the guest to use efifb on its own could
> possibly be more tricky (starting with a different state?) but does
> not seem to sound completely outlandish either - or does it ?
>

efifb just takes whatever hardware state the GOP driver in the pre-OS
environment left the GPU in.  Once you have a driver loaded in the OS,
that state is gone so I I don't see much value in using efifb once you
have a real driver in the mix.  If you want a console 

Re: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release PCI device ..."

2021-12-21 Thread Alex Deucher
On Tue, Dec 21, 2021 at 5:09 PM Harry Wentland  wrote:
>
>
>
> On 2021-12-21 16:18, Alex Deucher wrote:
> > On Tue, Dec 21, 2021 at 1:47 PM Deucher, Alexander
> >  wrote:
> >>
> >> [Public]
> >>
> >>> -Original Message-
> >>> From: Deucher, Alexander
> >>> Sent: Tuesday, December 21, 2021 12:01 PM
> >>> To: Linus Torvalds ; Imre Deak
> >>> ; amd-gfx@lists.freedesktop.org
> >>> Cc: Daniel Vetter ; Kai-Heng Feng
> >>> 
> >>> Subject: RE: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release
> >>> PCI device ..."
> >>>
> >>> [Public]
> >>>
>  -Original Message-
>  From: Linus Torvalds 
>  Sent: Monday, December 20, 2021 5:05 PM
>  To: Imre Deak 
>  Cc: Daniel Vetter ; Deucher, Alexander
>  ; Kai-Heng Feng
>  
>  Subject: Re: Expecting to revert commit 55285e21f045 "fbdev/efifb:
>  Release PCI device ..."
> 
>  On Mon, Dec 20, 2021 at 1:33 PM Imre Deak 
> >>> wrote:
> >
> > amdgpu.runpm=0
> 
>  Hmmm.
> 
>  This does seem to "work", but not very well.
> 
>  With this, what seems to happen is odd: I lock the screen, wait, it
>  goes "No signal, shutting down", but then doesn't actually shut down
>  but stays black (with the backlight on). After _another_ five seconds
>  or so, the monitor goes "No signal, shutting down" _again_, and at that
> >>> point it actually does it.
> 
>  So it solves my immediate problem - in that yes, the backlight finally
>  does turn off in the end - but it does seem to be still broken.
> 
>  I'm very surprised if no AMD drm developers can see this exact same 
>  thing.
>  This is a very simple setup. The only possibly slightly less common
>  thing is that I have two monitors, but while that is not necessarily
>  the _most_ common setup in an absolute sense, I'd expect it to be very
>  common among DRM developers..
> 
>  I guess I can just change the revert to just a
> 
>  -int amdgpu_runtime_pm = -1;
>  +int amdgpu_runtime_pm = 0;
> 
>  instead. The auto-detect is apparently broken. Maybe it should only
>  kick in for LVDS screens on actual laptops?
> 
>  Note: on my machine, I get that
> 
> amdgpu :49:00.0: amdgpu: Using BACO for runtime pm
> 
>  so maybe the other possible runtime pm models (ARPX and BOCO) are ok,
>  and it's only that BACO case that is broken.
> 
>  I have no idea what any of those three things are - I'm just looking
>  at the uses of that amdgpu_runtime_pm variable.
> 
>  amdgpu people: if you don't want that amdgpu_runtime_pm turned off by
>  default, tell me something else to try.
> >>>
> >>> For a little background, runtime PM support was added about 10 year ago
> >>> originally to support laptops with multiple GPUs (integrated and 
> >>> discrete).
> >>> It's not specific to the display hardware.  When the GPU is idle, it can 
> >>> be
> >>> powered down completely.  In the case of these laptops, it's D3 cold
> >>> (managed by ACPI, we call this BOCO in AMD parlance - Bus Off, Chip Off)
> >>> which powers off the dGPU completely (i.e., it disappears from the bus).  
> >>> A
> >>> few years ago we extended this to support desktop dGPUs as well which
> >>> support their own version of runtime D3 (called BACO in AMD parlance - Bus
> >>> Active, Chip Off).  The driver can put the chip into a low power state 
> >>> where
> >>> everything except the bus interface is powered down (to avoid the device
> >>> disappearing from the bus).  So this has worked for almost 2 years now on
> >>> BACO capable parts and for a decade or more on BOCO systems.
> >>> Unfortunately, changing the default runpm parameter setting would cause a
> >>> flood of bug reports about runtime power management breaking and
> >>> suddenly systems are using more power.
> >>>
> >>> Imre's commit (55285e21f045) fixes another commit (a6c0fd3d5a8b).
> >>> Runtime pm was working on amdgpu prior to that commit.  Is it possible
> >>> there is still some race between when amdgpu takes over from efifb?  Does
> >>> it work properly when all pm_runtime calls in efifb are removed or if 
> >>> efifb is
> >>> not enabled?  Runtime pm for Polaris boards has been enabled by default
> >>> since 4fdda2e66de0b which predates both of those patches.
> >>
> >> Thinking about this more, I wonder if there was some change in some 
> >> userspace component which was hidden by the changes in 55285e21f045 and 
> >> a6c0fd3d5a8b.  E.g., some desktop component started polling for display 
> >> changes or GPU temperature or something like that and when a6c0fd3d5a8b 
> >> was in place the GPU never entered runtime suspend.  Then when 
> >> 55285e21f045 was applied, it unmasked the new behavior in the userpace 
> >> component.
> >>
> >> What should happen is that when all of the displays blank, assuming the 
> >> GPU is otherwise idle, the GPU will runtime suspend after  seconds.  

Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm

2021-12-21 Thread Yann Dirson


Alex wrote:
> 
> On Sun, Dec 19, 2021 at 11:41 AM Yann Dirson  wrote:
> >
> > Christian wrote:
> > > Am 19.12.21 um 17:00 schrieb Yann Dirson:
> > > > Alex wrote:
> > > >> Thinking about this more, I think the problem might be related
> > > >> to
> > > >> CPU
> > > >> access to "VRAM".  APUs don't have dedicated VRAM, they use a
> > > >> reserved
> > > >> carve out region at the top of system memory.  For CPU access
> > > >> to
> > > >> this
> > > >> memory, we kmap the physical address of the carve out region
> > > >> of
> > > >> system
> > > >> memory.  You'll need to make sure that region is accessible to
> > > >> the
> > > >> guest.
> > > > So basically, the non-virt flow is is: (video?) BIOS reserves
> > > > memory, marks it
> > > > as reserved in e820, stores the physaddr somewhere, which the
> > > > GPU
> > > > driver gets.
> > > > Since I suppose this includes the framebuffer, this probably
> > > > has to
> > > > occur around
> > > > the moment the driver calls
> > > > drm_aperture_remove_conflicting_pci_framebuffers()
> > > > (which happens before this hw init step), right ?
> > >
> > > Well, that partially correct. The efifb is using the PCIe
> > > resources
> > > to
> > > access the framebuffer and as far as I know we use that one to
> > > kick
> > > it out.
> > >
> > > The stolen memory we get over e820/registers is separate to that.

How is the stolen memory communicated to the driver ?  That host physical
memory probably has to be mapped at the same guest physical address for
the magic to work, right ?

> > >
> > > > ... which brings me to a point that's been puzzling me for some
> > > > time, which is
> > > > that as the hw init fails, the efifb driver is still using the
> > > > framebuffer.
> > >
> > > No, it isn't. You are probably just still seeing the same screen.
> > >
> > > The issue is most likely that while efi was kicked out nobody
> > > re-programmed the display hardware to show something different.
> > >
> > > > Am I right in suspecting that efifb should get stripped of its
> > > > ownership of the
> > > > fb aperture first, and that if I don't get a black screen on
> > > > hw_init failure
> > > > that issue should be the first focus point ?
> > >
> > > You assumption with the black screen is incorrect. Since the
> > > hardware
> > > works independent even if you kick out efi you still have the
> > > same
> > > screen content, you just can't update it anymore.
> >
> > It's not only that the screen keeps its contents, it's that the
> > dom0
> > happily continues updating it.
> 
> If the hypevisor is using efifb, then yes that could be a problem as
> the hypervisor could be writing to the efifb resources which ends up
> writing to the same physical memory.  That applies to any GPU on a
> UEFI system.  You'll need to make sure efifb is not in use in the
> hypervisor.

That remark evokes several things to me.  First one is that every time
I've tried booting with efifb disabled in dom0, there was no visible
improvements in the guest driver - i.i. I really have to dig how vram mapping
is performed and check things are as expected anyway.

The other is that, when dom0 cannot use efifb, entering a luks key is
suddenly less user-friendly.  But in theory I'd think we could overcome
this by letting dom0 use efifb until ready to start the guest, a simple
driver unbind at the right moment should be expected to work, right ?
Going further and allowing the guest to use efifb on its own could
possibly be more tricky (starting with a different state?) but does
not seem to sound completely outlandish either - or does it ?

> 
> Alex
> 
> 
> >
> > > But putting efi asside what Alex pointed out pretty much breaks
> > > your
> > > neck trying to forward the device. You maybe could try to hack
> > > the
> > > driver to use the PCIe BAR for framebuffer access, but that might
> > > be
> > > quite a bit slower.
> > >
> > > Regards,
> > > Christian.
> > >
> > > >
> > > >> Alex
> > > >>
> > > >> On Mon, Dec 13, 2021 at 3:29 PM Alex Deucher
> > > >> 
> > > >> wrote:
> > > >>> On Sun, Dec 12, 2021 at 5:19 PM Yann Dirson 
> > > >>> wrote:
> > >  Alex wrote:
> > > > On Mon, Dec 6, 2021 at 4:36 PM Yann Dirson
> > > > 
> > > > wrote:
> > > >> Hi Alex,
> > > >>
> > > >>> We have not validated virtualization of our integrated
> > > >>> GPUs.  I
> > > >>> don't
> > > >>> know that it will work at all.  We had done a bit of
> > > >>> testing but
> > > >>> ran
> > > >>> into the same issues with the PSP, but never had a chance
> > > >>> to
> > > >>> debug
> > > >>> further because this feature is not productized.
> > > >> ...
> > > >>> You need a functional PSP to get the GPU driver up and
> > > >>> running.
> > > >> Ah, thanks for the hint :)
> > > >>
> > > >> I guess that if I want to have any chance to get the PSP
> > > >> working
> > > >> I'm
> > > >> going to need more details on it.  A quick search some
> > > 

Re: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release PCI device ..."

2021-12-21 Thread Harry Wentland



On 2021-12-21 16:18, Alex Deucher wrote:
> On Tue, Dec 21, 2021 at 1:47 PM Deucher, Alexander
>  wrote:
>>
>> [Public]
>>
>>> -Original Message-
>>> From: Deucher, Alexander
>>> Sent: Tuesday, December 21, 2021 12:01 PM
>>> To: Linus Torvalds ; Imre Deak
>>> ; amd-gfx@lists.freedesktop.org
>>> Cc: Daniel Vetter ; Kai-Heng Feng
>>> 
>>> Subject: RE: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release
>>> PCI device ..."
>>>
>>> [Public]
>>>
 -Original Message-
 From: Linus Torvalds 
 Sent: Monday, December 20, 2021 5:05 PM
 To: Imre Deak 
 Cc: Daniel Vetter ; Deucher, Alexander
 ; Kai-Heng Feng
 
 Subject: Re: Expecting to revert commit 55285e21f045 "fbdev/efifb:
 Release PCI device ..."

 On Mon, Dec 20, 2021 at 1:33 PM Imre Deak 
>>> wrote:
>
> amdgpu.runpm=0

 Hmmm.

 This does seem to "work", but not very well.

 With this, what seems to happen is odd: I lock the screen, wait, it
 goes "No signal, shutting down", but then doesn't actually shut down
 but stays black (with the backlight on). After _another_ five seconds
 or so, the monitor goes "No signal, shutting down" _again_, and at that
>>> point it actually does it.

 So it solves my immediate problem - in that yes, the backlight finally
 does turn off in the end - but it does seem to be still broken.

 I'm very surprised if no AMD drm developers can see this exact same thing.
 This is a very simple setup. The only possibly slightly less common
 thing is that I have two monitors, but while that is not necessarily
 the _most_ common setup in an absolute sense, I'd expect it to be very
 common among DRM developers..

 I guess I can just change the revert to just a

 -int amdgpu_runtime_pm = -1;
 +int amdgpu_runtime_pm = 0;

 instead. The auto-detect is apparently broken. Maybe it should only
 kick in for LVDS screens on actual laptops?

 Note: on my machine, I get that

amdgpu :49:00.0: amdgpu: Using BACO for runtime pm

 so maybe the other possible runtime pm models (ARPX and BOCO) are ok,
 and it's only that BACO case that is broken.

 I have no idea what any of those three things are - I'm just looking
 at the uses of that amdgpu_runtime_pm variable.

 amdgpu people: if you don't want that amdgpu_runtime_pm turned off by
 default, tell me something else to try.
>>>
>>> For a little background, runtime PM support was added about 10 year ago
>>> originally to support laptops with multiple GPUs (integrated and discrete).
>>> It's not specific to the display hardware.  When the GPU is idle, it can be
>>> powered down completely.  In the case of these laptops, it's D3 cold
>>> (managed by ACPI, we call this BOCO in AMD parlance - Bus Off, Chip Off)
>>> which powers off the dGPU completely (i.e., it disappears from the bus).  A
>>> few years ago we extended this to support desktop dGPUs as well which
>>> support their own version of runtime D3 (called BACO in AMD parlance - Bus
>>> Active, Chip Off).  The driver can put the chip into a low power state where
>>> everything except the bus interface is powered down (to avoid the device
>>> disappearing from the bus).  So this has worked for almost 2 years now on
>>> BACO capable parts and for a decade or more on BOCO systems.
>>> Unfortunately, changing the default runpm parameter setting would cause a
>>> flood of bug reports about runtime power management breaking and
>>> suddenly systems are using more power.
>>>
>>> Imre's commit (55285e21f045) fixes another commit (a6c0fd3d5a8b).
>>> Runtime pm was working on amdgpu prior to that commit.  Is it possible
>>> there is still some race between when amdgpu takes over from efifb?  Does
>>> it work properly when all pm_runtime calls in efifb are removed or if efifb 
>>> is
>>> not enabled?  Runtime pm for Polaris boards has been enabled by default
>>> since 4fdda2e66de0b which predates both of those patches.
>>
>> Thinking about this more, I wonder if there was some change in some 
>> userspace component which was hidden by the changes in 55285e21f045 and 
>> a6c0fd3d5a8b.  E.g., some desktop component started polling for display 
>> changes or GPU temperature or something like that and when a6c0fd3d5a8b was 
>> in place the GPU never entered runtime suspend.  Then when 55285e21f045 was 
>> applied, it unmasked the new behavior in the userpace component.
>>
>> What should happen is that when all of the displays blank, assuming the GPU 
>> is otherwise idle, the GPU will runtime suspend after  seconds.  When you 
>> move the mouse or hit the keyboard, that should trigger the GPU should 
>> runtime resume and then the displays will be re-enabled.
>>
>> In the behavior you are seeing, when the displays come back on after they 
>> blank are you seeing the device resume from runtime suspend?  On resume from 
>> 

Re: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release PCI device ..."

2021-12-21 Thread Alex Deucher
On Tue, Dec 21, 2021 at 1:47 PM Deucher, Alexander
 wrote:
>
> [Public]
>
> > -Original Message-
> > From: Deucher, Alexander
> > Sent: Tuesday, December 21, 2021 12:01 PM
> > To: Linus Torvalds ; Imre Deak
> > ; amd-gfx@lists.freedesktop.org
> > Cc: Daniel Vetter ; Kai-Heng Feng
> > 
> > Subject: RE: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release
> > PCI device ..."
> >
> > [Public]
> >
> > > -Original Message-
> > > From: Linus Torvalds 
> > > Sent: Monday, December 20, 2021 5:05 PM
> > > To: Imre Deak 
> > > Cc: Daniel Vetter ; Deucher, Alexander
> > > ; Kai-Heng Feng
> > > 
> > > Subject: Re: Expecting to revert commit 55285e21f045 "fbdev/efifb:
> > > Release PCI device ..."
> > >
> > > On Mon, Dec 20, 2021 at 1:33 PM Imre Deak 
> > wrote:
> > > >
> > > > amdgpu.runpm=0
> > >
> > > Hmmm.
> > >
> > > This does seem to "work", but not very well.
> > >
> > > With this, what seems to happen is odd: I lock the screen, wait, it
> > > goes "No signal, shutting down", but then doesn't actually shut down
> > > but stays black (with the backlight on). After _another_ five seconds
> > > or so, the monitor goes "No signal, shutting down" _again_, and at that
> > point it actually does it.
> > >
> > > So it solves my immediate problem - in that yes, the backlight finally
> > > does turn off in the end - but it does seem to be still broken.
> > >
> > > I'm very surprised if no AMD drm developers can see this exact same thing.
> > > This is a very simple setup. The only possibly slightly less common
> > > thing is that I have two monitors, but while that is not necessarily
> > > the _most_ common setup in an absolute sense, I'd expect it to be very
> > > common among DRM developers..
> > >
> > > I guess I can just change the revert to just a
> > >
> > > -int amdgpu_runtime_pm = -1;
> > > +int amdgpu_runtime_pm = 0;
> > >
> > > instead. The auto-detect is apparently broken. Maybe it should only
> > > kick in for LVDS screens on actual laptops?
> > >
> > > Note: on my machine, I get that
> > >
> > >amdgpu :49:00.0: amdgpu: Using BACO for runtime pm
> > >
> > > so maybe the other possible runtime pm models (ARPX and BOCO) are ok,
> > > and it's only that BACO case that is broken.
> > >
> > > I have no idea what any of those three things are - I'm just looking
> > > at the uses of that amdgpu_runtime_pm variable.
> > >
> > > amdgpu people: if you don't want that amdgpu_runtime_pm turned off by
> > > default, tell me something else to try.
> >
> > For a little background, runtime PM support was added about 10 year ago
> > originally to support laptops with multiple GPUs (integrated and discrete).
> > It's not specific to the display hardware.  When the GPU is idle, it can be
> > powered down completely.  In the case of these laptops, it's D3 cold
> > (managed by ACPI, we call this BOCO in AMD parlance - Bus Off, Chip Off)
> > which powers off the dGPU completely (i.e., it disappears from the bus).  A
> > few years ago we extended this to support desktop dGPUs as well which
> > support their own version of runtime D3 (called BACO in AMD parlance - Bus
> > Active, Chip Off).  The driver can put the chip into a low power state where
> > everything except the bus interface is powered down (to avoid the device
> > disappearing from the bus).  So this has worked for almost 2 years now on
> > BACO capable parts and for a decade or more on BOCO systems.
> > Unfortunately, changing the default runpm parameter setting would cause a
> > flood of bug reports about runtime power management breaking and
> > suddenly systems are using more power.
> >
> > Imre's commit (55285e21f045) fixes another commit (a6c0fd3d5a8b).
> > Runtime pm was working on amdgpu prior to that commit.  Is it possible
> > there is still some race between when amdgpu takes over from efifb?  Does
> > it work properly when all pm_runtime calls in efifb are removed or if efifb 
> > is
> > not enabled?  Runtime pm for Polaris boards has been enabled by default
> > since 4fdda2e66de0b which predates both of those patches.
>
> Thinking about this more, I wonder if there was some change in some userspace 
> component which was hidden by the changes in 55285e21f045 and a6c0fd3d5a8b.  
> E.g., some desktop component started polling for display changes or GPU 
> temperature or something like that and when a6c0fd3d5a8b was in place the GPU 
> never entered runtime suspend.  Then when 55285e21f045 was applied, it 
> unmasked the new behavior in the userpace component.
>
> What should happen is that when all of the displays blank, assuming the GPU 
> is otherwise idle, the GPU will runtime suspend after  seconds.  When you 
> move the mouse or hit the keyboard, that should trigger the GPU should 
> runtime resume and then the displays will be re-enabled.
>
> In the behavior you are seeing, when the displays come back on after they 
> blank are you seeing the device resume from runtime suspend?  On resume from 
> suspend (runtime 

Re: [PATCH 10/19] drm/amd/display: Changed pipe split policy to allow for multi-display pipe split

2021-12-21 Thread Alex Deucher
On Fri, Dec 17, 2021 at 4:51 PM Rodrigo Siqueira Jordao
 wrote:
>
>
>
> On 2021-12-17 4:36 p.m., Deucher, Alexander wrote:
> > [AMD Official Use Only]
> >
> >
> > Maybe add Bug links for:
> > https://gitlab.freedesktop.org/drm/amd/-/issues/1522
> > 
> > https://gitlab.freedesktop.org/drm/amd/-/issues/1709
> > 
> > https://gitlab.freedesktop.org/drm/amd/-/issues/1655
> > 
> > https://gitlab.freedesktop.org/drm/amd/-/issues/1403
> > 
>
> Sure, I'll update the commit message before apply this patch.
>
> Thanks.
>
> >
> >
> >
> > 
> > *From:* amd-gfx  on behalf of
> > Rodrigo Siqueira 
> > *Sent:* Friday, December 17, 2021 4:23 PM
> > *To:* amd-gfx@lists.freedesktop.org 
> > *Cc:* Wang, Chao-kai (Stylon) ; Cyr, Aric
> > ; Li, Sun peng (Leo) ; Wentland,
> > Harry ; Zhuo, Qingqing (Lillian)
> > ; Siqueira, Rodrigo ;
> > Li, Roman ; Chiu, Solomon ;
> > Pillai, Aurabindo ; Wang, Angus
> > ; Lin, Wayne ; Lipski, Mikita
> > ; Lakha, Bhawanpreet ;
> > Gutierrez, Agustin ; Kotarac, Pavle
> > 
> > *Subject:* [PATCH 10/19] drm/amd/display: Changed pipe split policy to
> > allow for multi-display pipe split
> > From: Angus Wang 
> >
> > [WHY]
> > Current implementation of pipe split policy prevents pipe split with
> > multiple displays connected, which caused the MCLK speed to be stuck at
> > max
> >
> > [HOW]
> > Changed the pipe split policies so that pipe split is allowed for
> > multi-display configurations
> >
> > Reviewed-by: Aric Cyr 
> > Acked-by: Rodrigo Siqueira 
> > Signed-off-by: Angus Wang 
> > ---
> >   drivers/gpu/drm/amd/display/dc/dcn20/dcn20_resource.c   | 2 +-
> >   drivers/gpu/drm/amd/display/dc/dcn201/dcn201_resource.c | 2 +-
> >   drivers/gpu/drm/amd/display/dc/dcn21/dcn21_resource.c   | 2 +-
> >   drivers/gpu/drm/amd/display/dc/dcn30/dcn30_resource.c   | 2 +-
> >   drivers/gpu/drm/amd/display/dc/dcn301/dcn301_resource.c | 2 +-
> >   drivers/gpu/drm/amd/display/dc/dcn302/dcn302_resource.c | 2 +-
> >   drivers/gpu/drm/amd/display/dc/dcn303/dcn303_resource.c | 2 +-
> >   drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c   | 2 +-

Also, it looks like dcn10_resource.c was missed.  Was that intentional?

Alex


> >   8 files changed, 8 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_resource.c
> > b/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_resource.c
> > index 2a72517e2b28..2bc93df023ad 100644
> > --- a/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_resource.c
> > +++ b/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_resource.c
> > @@ -1069,7 +1069,7 @@ static const struct dc_debug_options
> > debug_defaults_drv = {
> >   .timing_trace = false,
> >   .clock_trace = true,
> >   .disable_pplib_clock_request = true,
> > -   .pipe_split_policy = MPC_SPLIT_AVOID_MULT_DISP,
> > +   .pipe_split_policy = MPC_SPLIT_DYNAMIC,
> >   .force_single_disp_pipe_split = false,
> >   .disable_dcc = DCC_ENABLE,
> >   .vsr_support = true,
> > diff --git a/drivers/gpu/drm/amd/display/dc/dcn201/dcn201_resource.c
> > b/drivers/gpu/drm/amd/display/dc/dcn201/dcn201_resource.c
> > index d6acf9a8590a..0bb7d3dd53fa 100644
> > --- a/drivers/gpu/drm/amd/display/dc/dcn201/dcn201_resource.c
> > +++ b/drivers/gpu/drm/amd/display/dc/dcn201/dcn201_resource.c
> > @@ -603,7 +603,7 @@ static const struct dc_debug_options
> > debug_defaults_drv = {
> >   .timing_trace = false,
> >   .clock_trace = true,
> >   .disable_pplib_clock_request = true,
> > -   .pipe_split_policy = MPC_SPLIT_AVOID,
> > +   .pipe_split_policy = MPC_SPLIT_DYNAMIC,
> >   .force_single_disp_pipe_split = false,
> >   .disable_dcc = DCC_ENABLE,
> >   .vsr_support = true,
> > diff --git a/drivers/gpu/drm/amd/display/dc/dcn21/dcn21_resource.c
> > b/drivers/gpu/drm/amd/display/dc/dcn21/dcn21_resource.c
> > index ca1bbc942fd4..e5cc6bf45743 100644
> > --- a/drivers/gpu/drm/amd/display/dc/dcn21/dcn21_resource.c
> > +++ b/drivers/gpu/drm/amd/display/dc/dcn21/dcn21_resource.c
> > @@ -873,7 +873,7 @@ static const struct dc_debug_options
> > debug_defaults_drv = {
> >   .clock_trace = true,
> >   .disable_pplib_clock_request = true,
> >   .min_disp_clk_khz = 10,
> > -   .pipe_split_policy = MPC_SPLIT_AVOID_MULT_DISP,
> > +   .pipe_split_policy = MPC_SPLIT_DYNAMIC,
> >   .force_single_disp_pipe_split = false,
> >   .disable_dcc = DCC_ENABLE,
> >   .vsr_support = true,
> > diff --git 

Re: Re: Re: Re: 回复: Re: [PATCH] drm/amdgpu: fixup bad vram size on gmc v8

2021-12-21 Thread Alex Deucher
Yes, you can either do that, or if amdgpu is loaded, just read the data
from /sys/kernel/debug/dri/0/amdgpu_vbios

Alex


On Mon, Dec 20, 2021 at 3:06 AM 周宗敏  wrote:

>
>
> Dear Alex:
>
>
> I've never tried to get a VBIOS before, so can you tell me how to  get a
> vbios image copy for you?
>
> I  try to google, just get the message that maybe can get from the
> following way:
>
> echo 1 > /sys/devices/pci:00/:00:02.0/rom
>
> cat /sys/devices/pci:00/:00:02.0/rom > vbios.dump
>
> echo 0 > /sys/devices/pci:00/:00:02.0/rom
>
>
> Is that right?
>
>
> Thanks very much.
>
>
> 
>
>
>
>
>
>
> *主 题:*Re: Re: Re: 回复: Re: [PATCH] drm/amdgpu: fixup bad vram size on gmc
> v8
> *日 期:*2021-12-18 05:19
> *发件人:*Alex Deucher
> *收件人:*周宗敏
>
>
> If you could get me a copy of the vbios image from a problematic board,
> that would be helpful.  In the meantime, I've applied the patch.
>
> Alex
>
>
> On Thu, Dec 16, 2021 at 9:38 PM 周宗敏  wrote:
>
>> Dear Alex:
>>
>>
>> >Is the issue reproducible with the same board in bare metal on x86?Or
>> does it only happen with passthrough on ARM?
>>
>>
>> Unfortunately, my current environment is not convenient to test this GPU
>> board on x86 platform.
>>
>> but I can tell you the problem still occurs on ARM without passthrough to
>> virtual machine.
>>
>>
>> In addition,at end of 2020,my colleagues also found similar problems on
>> MIPS platforms with Graphics chips of Radeon R7 340.
>>
>> So,I may think it can happen to no matter based on x86 ,ARM or mips.
>>
>>
>> I hope the above information is helpful to you,and I also think it will
>> be better for user if can root cause this issue.
>>
>>
>> Best regards.
>>
>>
>>
>>
>> 
>>
>>
>>
>>
>>
>>
>> *主 题:*Re: Re: 回复: Re: [PATCH] drm/amdgpu: fixup bad vram size on gmc v8
>>
>> *日 期:*2021-12-16 23:28
>> *发件人:*Alex Deucher
>> *收件人:*周宗敏
>>
>>
>> Is the issue reproducible with the same board in bare metal on x86?  Or
>> does it only happen with passthrough on ARM?  Looking through the archives,
>> the SI patch I made was for an x86 laptop.  It would be nice to root cause
>> this, but there weren't any gfx8 boards with more than 64G of vram, so I
>> think it's safe.  That said, if you see similar issues with newer gfx IPs
>> then we have an issue since the upper bit will be meaningful, so it would
>> be nice to root cause this.
>>
>> Alex
>>
>>
>> On Thu, Dec 16, 2021 at 4:36 AM 周宗敏  wrote:
>>
>>> Hi  Christian,
>>>
>>>
>>> I'm  testing for GPU passthrough feature, so I pass through this GPU to
>>> virtual machine to use. It  based on arm64 system.
>>>
>>> As far as i know, Alex had dealt with a similar problems on
>>> dri/radeon/si.c .  Maybe they have a same reason to cause it?
>>>
>>> the history commit message is below:
>>>
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0ca223b029a261e82fb2f50c52eb85d510f4260e
>>>
>>> [image: image.png]
>>>
>>>
>>> Thanks very much.
>>>
>>>
>>>
>>> 
>>>
>>>
>>>
>>> *主 题:*Re: 回复: Re: [PATCH] drm/amdgpu: fixup bad vram size on gmc v8
>>>
>>> *日 期:*2021-12-16 16:15
>>> *发件人:*Christian König
>>> *收件人:*周宗敏Alex Deucher
>>>
>>>
>>>
>>>
>>> Hi Zongmin,
>>>
>>>that strongly sounds like the ASIC is not correctly initialized when
>>>trying to read the register.
>>>
>>>What board and environment are you using this GPU with? Is that a
>>>  normal x86 system?
>>>
>>>Regards,
>>>Christian.
>>>
>>>
>>>
>>> Am 16.12.21 um 04:11 schrieb 周宗敏:
>>>
>>>
>>>
>>>1.
>>>
>>>the problematic boards that I have tested is [AMD/ATI] Lexa
>>>   PRO [Radeon RX 550/550X] ;  and the vbios version :
>>> 113-RXF9310-C09-BT
>>>2.
>>>
>>>When an exception occurs I can see the following changes in
>>>   the values of vram size get from RREG32(mmCONFIG_MEMSIZE) ,
>>>
>>>it seems to have garbage in the upper 16 bits
>>>
>>>[image: image.png]
>>>
>>>
>>>
>>>
>>>3.
>>>
>>>and then I can also see some dmesg like below:
>>>
>>>when vram size register have garbage,we may see error
>>> message like below:
>>>
>>>amdgpu :09:00.0: VRAM: 4286582784M 0x00F4 -
>>>   0x000FF8F4 (4286582784M used)
>>>
>>>the correct message should like below:
>>>
>>>amdgpu :09:00.0: VRAM: 4096M 0x00F4 -
>>> 0x00F4 (4096M used)
>>>
>>>
>>>
>>>
>>>if you have any problems,please send me mail.
>>>
>>>thanks very much.
>>>
>>>
>>>
>>>
>>> 
>>>
>>> *主 题:*Re: [PATCH] drm/amdgpu:  fixup bad vram size on gmc v8
>>>
>>>*日 期:*2021-12-16 04:23
>>>*发件人:*Alex Deucher
>>>*收件人:*Zongmin Zhou
>>>
>>>
>>>
>>>
>>> On Wed, Dec 15, 2021 at 10:31 AM Zongmin Zhouwrote:
>>>  >
>>>  > Some boards(like RX550) seem to have garbage in the upper
>>>  > 16 bits of the vram size register.  Check for
>>>  > this and clamp the size properly.  Fixes
>>>  > boards reporting bogus amounts of vram.
>>>  >
>>>

Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm

2021-12-21 Thread Alex Deucher
On Sun, Dec 19, 2021 at 11:41 AM Yann Dirson  wrote:
>
> Christian wrote:
> > Am 19.12.21 um 17:00 schrieb Yann Dirson:
> > > Alex wrote:
> > >> Thinking about this more, I think the problem might be related to
> > >> CPU
> > >> access to "VRAM".  APUs don't have dedicated VRAM, they use a
> > >> reserved
> > >> carve out region at the top of system memory.  For CPU access to
> > >> this
> > >> memory, we kmap the physical address of the carve out region of
> > >> system
> > >> memory.  You'll need to make sure that region is accessible to the
> > >> guest.
> > > So basically, the non-virt flow is is: (video?) BIOS reserves
> > > memory, marks it
> > > as reserved in e820, stores the physaddr somewhere, which the GPU
> > > driver gets.
> > > Since I suppose this includes the framebuffer, this probably has to
> > > occur around
> > > the moment the driver calls
> > > drm_aperture_remove_conflicting_pci_framebuffers()
> > > (which happens before this hw init step), right ?
> >
> > Well, that partially correct. The efifb is using the PCIe resources
> > to
> > access the framebuffer and as far as I know we use that one to kick
> > it out.
> >
> > The stolen memory we get over e820/registers is separate to that.
> >
> > > ... which brings me to a point that's been puzzling me for some
> > > time, which is
> > > that as the hw init fails, the efifb driver is still using the
> > > framebuffer.
> >
> > No, it isn't. You are probably just still seeing the same screen.
> >
> > The issue is most likely that while efi was kicked out nobody
> > re-programmed the display hardware to show something different.
> >
> > > Am I right in suspecting that efifb should get stripped of its
> > > ownership of the
> > > fb aperture first, and that if I don't get a black screen on
> > > hw_init failure
> > > that issue should be the first focus point ?
> >
> > You assumption with the black screen is incorrect. Since the hardware
> > works independent even if you kick out efi you still have the same
> > screen content, you just can't update it anymore.
>
> It's not only that the screen keeps its contents, it's that the dom0
> happily continues updating it.

If the hypevisor is using efifb, then yes that could be a problem as
the hypervisor could be writing to the efifb resources which ends up
writing to the same physical memory.  That applies to any GPU on a
UEFI system.  You'll need to make sure efifb is not in use in the
hypervisor.

Alex


>
> > But putting efi asside what Alex pointed out pretty much breaks your
> > neck trying to forward the device. You maybe could try to hack the
> > driver to use the PCIe BAR for framebuffer access, but that might be
> > quite a bit slower.
> >
> > Regards,
> > Christian.
> >
> > >
> > >> Alex
> > >>
> > >> On Mon, Dec 13, 2021 at 3:29 PM Alex Deucher
> > >> 
> > >> wrote:
> > >>> On Sun, Dec 12, 2021 at 5:19 PM Yann Dirson 
> > >>> wrote:
> >  Alex wrote:
> > > On Mon, Dec 6, 2021 at 4:36 PM Yann Dirson 
> > > wrote:
> > >> Hi Alex,
> > >>
> > >>> We have not validated virtualization of our integrated
> > >>> GPUs.  I
> > >>> don't
> > >>> know that it will work at all.  We had done a bit of
> > >>> testing but
> > >>> ran
> > >>> into the same issues with the PSP, but never had a chance
> > >>> to
> > >>> debug
> > >>> further because this feature is not productized.
> > >> ...
> > >>> You need a functional PSP to get the GPU driver up and
> > >>> running.
> > >> Ah, thanks for the hint :)
> > >>
> > >> I guess that if I want to have any chance to get the PSP
> > >> working
> > >> I'm
> > >> going to need more details on it.  A quick search some time
> > >> ago
> > >> mostly
> > >> brought reverse-engineering work, rather than official AMD
> > >> doc.
> > >>   Are
> > >> there some AMD resources I missed ?
> > > The driver code is pretty much it.
> >  Let's try to shed some more light on how things work, taking as
> >  excuse
> >  psp_v12_0_ring_create().
> > 
> >  First, register access through [RW]REG32_SOC15() is implemented
> >  in
> >  terms of __[RW]REG32_SOC15_RLC__(), which is basically a
> >  [RW]REG32(),
> >  except it has to be more complex in the SR-IOV case.
> >  Has the RLC anything to do with SR-IOV ?
> > >>> When running the driver on a SR-IOV virtual function (VF), some
> > >>> registers are not available directly via the VF's MMIO aperture
> > >>> so
> > >>> they need to go through the RLC.  For bare metal or passthrough
> > >>> this
> > >>> is not relevant.
> > >>>
> >  It accesses registers in the MMIO range of the MP0 IP, and the
> >  "MP0"
> >  name correlates highly with MMIO accesses in PSP-handling code.
> >  Is "MP0" another name for PSP (and "MP1" for SMU) ?  The MP0
> >  version
> > >>> Yes.
> > >>>
> >  reported at v11.0.3 by discovery seems to contradict the use of
> >  

RE: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release PCI device ..."

2021-12-21 Thread Deucher, Alexander
[Public]

> -Original Message-
> From: Deucher, Alexander
> Sent: Tuesday, December 21, 2021 12:01 PM
> To: Linus Torvalds ; Imre Deak
> ; amd-gfx@lists.freedesktop.org
> Cc: Daniel Vetter ; Kai-Heng Feng
> 
> Subject: RE: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release
> PCI device ..."
> 
> [Public]
> 
> > -Original Message-
> > From: Linus Torvalds 
> > Sent: Monday, December 20, 2021 5:05 PM
> > To: Imre Deak 
> > Cc: Daniel Vetter ; Deucher, Alexander
> > ; Kai-Heng Feng
> > 
> > Subject: Re: Expecting to revert commit 55285e21f045 "fbdev/efifb:
> > Release PCI device ..."
> >
> > On Mon, Dec 20, 2021 at 1:33 PM Imre Deak 
> wrote:
> > >
> > > amdgpu.runpm=0
> >
> > Hmmm.
> >
> > This does seem to "work", but not very well.
> >
> > With this, what seems to happen is odd: I lock the screen, wait, it
> > goes "No signal, shutting down", but then doesn't actually shut down
> > but stays black (with the backlight on). After _another_ five seconds
> > or so, the monitor goes "No signal, shutting down" _again_, and at that
> point it actually does it.
> >
> > So it solves my immediate problem - in that yes, the backlight finally
> > does turn off in the end - but it does seem to be still broken.
> >
> > I'm very surprised if no AMD drm developers can see this exact same thing.
> > This is a very simple setup. The only possibly slightly less common
> > thing is that I have two monitors, but while that is not necessarily
> > the _most_ common setup in an absolute sense, I'd expect it to be very
> > common among DRM developers..
> >
> > I guess I can just change the revert to just a
> >
> > -int amdgpu_runtime_pm = -1;
> > +int amdgpu_runtime_pm = 0;
> >
> > instead. The auto-detect is apparently broken. Maybe it should only
> > kick in for LVDS screens on actual laptops?
> >
> > Note: on my machine, I get that
> >
> >amdgpu :49:00.0: amdgpu: Using BACO for runtime pm
> >
> > so maybe the other possible runtime pm models (ARPX and BOCO) are ok,
> > and it's only that BACO case that is broken.
> >
> > I have no idea what any of those three things are - I'm just looking
> > at the uses of that amdgpu_runtime_pm variable.
> >
> > amdgpu people: if you don't want that amdgpu_runtime_pm turned off by
> > default, tell me something else to try.
> 
> For a little background, runtime PM support was added about 10 year ago
> originally to support laptops with multiple GPUs (integrated and discrete).
> It's not specific to the display hardware.  When the GPU is idle, it can be
> powered down completely.  In the case of these laptops, it's D3 cold
> (managed by ACPI, we call this BOCO in AMD parlance - Bus Off, Chip Off)
> which powers off the dGPU completely (i.e., it disappears from the bus).  A
> few years ago we extended this to support desktop dGPUs as well which
> support their own version of runtime D3 (called BACO in AMD parlance - Bus
> Active, Chip Off).  The driver can put the chip into a low power state where
> everything except the bus interface is powered down (to avoid the device
> disappearing from the bus).  So this has worked for almost 2 years now on
> BACO capable parts and for a decade or more on BOCO systems.
> Unfortunately, changing the default runpm parameter setting would cause a
> flood of bug reports about runtime power management breaking and
> suddenly systems are using more power.
> 
> Imre's commit (55285e21f045) fixes another commit (a6c0fd3d5a8b).
> Runtime pm was working on amdgpu prior to that commit.  Is it possible
> there is still some race between when amdgpu takes over from efifb?  Does
> it work properly when all pm_runtime calls in efifb are removed or if efifb is
> not enabled?  Runtime pm for Polaris boards has been enabled by default
> since 4fdda2e66de0b which predates both of those patches.

Thinking about this more, I wonder if there was some change in some userspace 
component which was hidden by the changes in 55285e21f045 and a6c0fd3d5a8b.  
E.g., some desktop component started polling for display changes or GPU 
temperature or something like that and when a6c0fd3d5a8b was in place the GPU 
never entered runtime suspend.  Then when 55285e21f045 was applied, it unmasked 
the new behavior in the userpace component.

What should happen is that when all of the displays blank, assuming the GPU is 
otherwise idle, the GPU will runtime suspend after  seconds.  When you move the 
mouse or hit the keyboard, that should trigger the GPU should runtime resume 
and then the displays will be re-enabled.

In the behavior you are seeing, when the displays come back on after they blank 
are you seeing the device resume from runtime suspend?  On resume from suspend 
(runtime or system) we issue a hotplug notification to userspace in case any 
displays changed during suspend when the GPU was powered down (and hence could 
not detect a hotplug event).  Perhaps that event is triggering userspace to 
reprobe and re-enable the displays shortly 

RE: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release PCI device ..."

2021-12-21 Thread Deucher, Alexander
[Public]

> -Original Message-
> From: Linus Torvalds 
> Sent: Monday, December 20, 2021 5:21 PM
> To: Imre Deak ; Koenig, Christian
> ; Pan, Xinhui ;
> Wentland, Harry 
> Cc: Daniel Vetter ; Deucher, Alexander
> ; Kai-Heng Feng
> ; amd-gfx list  g...@lists.freedesktop.org>
> Subject: Re: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release
> PCI device ..."
> 
> [ Adding back in more amd people and the amd list, the people Daniel added
> seem to have gotten lost again, but I think people at least saw my original
> report thanks to Daniel ]
> 
> With "amdgpu.runpm=0", things are better, but not perfect. With that I can
> lock the screen, and it has to go through *two* cycles of "No signal, turning
> off", but on the second cycle it does finally work.
> 
> This was exposed by commit 55285e21f045 ("fbdev/efifb: Release PCI
> device's runtime PM ref during FB destroy"), probably because that made
> runtime PM actually potentially work, but it is then broken on amdgpu.
> 
> Absolutely nothing odd in my setup. Two monitors, one GPU. PCI ID
> 1002:67df rev e7, subsystem ID 1da2:e353.
> 
> I'd expect pretty much any amdgpu person to see this.
> 
> On Mon, Dec 20, 2021 at 2:04 PM Linus Torvalds  foundation.org> wrote:
> >
> > Note: on my machine, I get that
> >
> >amdgpu :49:00.0: amdgpu: Using BACO for runtime pm
> >
> > so maybe the other possible runtime pm models (ARPX and BOCO) are ok,
> > and it's only that BACO case that is broken.
> 
> Hmm. The *documentation* says:
> 
> PX runtime pm
> 2 = force enable with BAMACO,
> 1 = force enable with BACO,
> 0 = disable,
> -1 = PX only default
> 
> but the code actually makes anything != 0 enable it, except on VEGA20 and
> ARCTURUS, where it needs to be positive.
> 
> My card is apparently "POLARIS10", whatever that means, which means that
> any non-zero value of amdgpu_runtime_pm will enable runtime PM as long
> as "amdgpu_device_supports_baco()" is true. Which it is.
> 
> Whatever. Now I'm just kwetching about the documentation not matching
> what I see the code doing, which is never a great sign when things don't
> work.

Apologies on the documentation.  -1 is the default and is enabled for all dGPUs 
which support runtime D3.  It was never fixed up when we extended support for 
runtime pm beyond PX/HG laptops.  Fixed up the documentation here:
https://patchwork.freedesktop.org/patch/467681/

Alex


RE: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release PCI device ..."

2021-12-21 Thread Deucher, Alexander
[Public]

> -Original Message-
> From: Linus Torvalds 
> Sent: Monday, December 20, 2021 5:05 PM
> To: Imre Deak 
> Cc: Daniel Vetter ; Deucher, Alexander
> ; Kai-Heng Feng
> 
> Subject: Re: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release
> PCI device ..."
> 
> On Mon, Dec 20, 2021 at 1:33 PM Imre Deak  wrote:
> >
> > amdgpu.runpm=0
> 
> Hmmm.
> 
> This does seem to "work", but not very well.
> 
> With this, what seems to happen is odd: I lock the screen, wait, it goes "No
> signal, shutting down", but then doesn't actually shut down but stays black
> (with the backlight on). After _another_ five seconds or so, the monitor goes
> "No signal, shutting down" _again_, and at that point it actually does it.
> 
> So it solves my immediate problem - in that yes, the backlight finally does
> turn off in the end - but it does seem to be still broken.
> 
> I'm very surprised if no AMD drm developers can see this exact same thing.
> This is a very simple setup. The only possibly slightly less common thing is 
> that
> I have two monitors, but while that is not necessarily the _most_ common
> setup in an absolute sense, I'd expect it to be very common among DRM
> developers..
> 
> I guess I can just change the revert to just a
> 
> -int amdgpu_runtime_pm = -1;
> +int amdgpu_runtime_pm = 0;
> 
> instead. The auto-detect is apparently broken. Maybe it should only kick in
> for LVDS screens on actual laptops?
> 
> Note: on my machine, I get that
> 
>amdgpu :49:00.0: amdgpu: Using BACO for runtime pm
> 
> so maybe the other possible runtime pm models (ARPX and BOCO) are ok,
> and it's only that BACO case that is broken.
> 
> I have no idea what any of those three things are - I'm just looking at the
> uses of that amdgpu_runtime_pm variable.
> 
> amdgpu people: if you don't want that amdgpu_runtime_pm turned off by
> default, tell me something else to try.

For a little background, runtime PM support was added about 10 year ago 
originally to support laptops with multiple GPUs (integrated and discrete).  
It's not specific to the display hardware.  When the GPU is idle, it can be 
powered down completely.  In the case of these laptops, it's D3 cold (managed 
by ACPI, we call this BOCO in AMD parlance - Bus Off, Chip Off) which powers 
off the dGPU completely (i.e., it disappears from the bus).  A few years ago we 
extended this to support desktop dGPUs as well which support their own version 
of runtime D3 (called BACO in AMD parlance - Bus Active, Chip Off).  The driver 
can put the chip into a low power state where everything except the bus 
interface is powered down (to avoid the device disappearing from the bus).  So 
this has worked for almost 2 years now on BACO capable parts and for a decade 
or more on BOCO systems.  Unfortunately, changing the default runpm parameter 
setting would cause a flood of bug reports about runtime power management 
breaking and suddenly systems are using more power.

Imre's commit (55285e21f045) fixes another commit (a6c0fd3d5a8b).  Runtime pm 
was working on amdgpu prior to that commit.  Is it possible there is still some 
race between when amdgpu takes over from efifb?  Does it work properly when all 
pm_runtime calls in efifb are removed or if efifb is not enabled?  Runtime pm 
for Polaris boards has been enabled by default since 4fdda2e66de0b which 
predates both of those patches.

Alex


Re: [RFC 4/6] drm/amdgpu: Serialize non TDR gpu recovery with TDRs

2021-12-21 Thread Andrey Grodzovsky



On 2021-12-21 2:59 a.m., Christian König wrote:

Am 20.12.21 um 23:17 schrieb Andrey Grodzovsky:


On 2021-12-20 2:20 a.m., Christian König wrote:

Am 17.12.21 um 23:27 schrieb Andrey Grodzovsky:

Use reset domain wq also for non TDR gpu recovery trigers
such as sysfs and RAS. We must serialize all possible
GPU recoveries to gurantee no concurrency there.
For TDR call the original recovery function directly since
it's already executed from within the wq. For others just
use a wrapper to qeueue work and wait on it to finish.

Signed-off-by: Andrey Grodzovsky 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h    |  2 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 33 
+-

  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +-
  3 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h

index b5ff76aae7e0..8e96b9a14452 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1296,6 +1296,8 @@ bool amdgpu_device_has_job_running(struct 
amdgpu_device *adev);

  bool amdgpu_device_should_recover_gpu(struct amdgpu_device *adev);
  int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
    struct amdgpu_job* job);
+int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
+  struct amdgpu_job *job);
  void amdgpu_device_pci_config_reset(struct amdgpu_device *adev);
  int amdgpu_device_pci_reset(struct amdgpu_device *adev);
  bool amdgpu_device_need_post(struct amdgpu_device *adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index b595e6d699b5..55cd67b9ede2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4979,7 +4979,7 @@ static void amdgpu_device_recheck_guilty_jobs(
   * Returns 0 for success or an error on failure.
   */
  -int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
+int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
    struct amdgpu_job *job)
  {
  struct list_head device_list, *device_list_handle = NULL;
@@ -5236,6 +5236,37 @@ int amdgpu_device_gpu_recover(struct 
amdgpu_device *adev,

  return r;
  }
  +struct recover_work_struct {


Please add an amdgpu_ prefix to the name.


+    struct work_struct base;
+    struct amdgpu_device *adev;
+    struct amdgpu_job *job;
+    int ret;
+};
+
+static void amdgpu_device_queue_gpu_recover_work(struct 
work_struct *work)

+{
+    struct recover_work_struct *recover_work = container_of(work, 
struct recover_work_struct, base);

+
+    recover_work->ret = 
amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);

+}
+/*
+ * Serialize gpu recover into reset domain single threaded wq
+ */
+int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
+    struct amdgpu_job *job)
+{
+    struct recover_work_struct work = {.adev = adev, .job = job};
+
+    INIT_WORK(, amdgpu_device_queue_gpu_recover_work);
+
+    if (!queue_work(adev->reset_domain.wq, ))
+    return -EAGAIN;
+
+    flush_work();
+
+    return work.ret;
+}


Maybe that should be part of the scheduler code? Not really sure, 
just an idea.



Seems to me that since the reset domain is almost always above a 
single scheduler granularity then it wouldn't feet very well there.


Yeah, but what if we introduce an drm_sched_recover_queue and 
drm_sched_recover_work object?


It's probably ok to go forward with that for now, but this handling 
makes quite some sense to have independent of which driver is using 
it. So as soon as we see a second similar implementation we should 
move it into common code.


Regards,
Christian.



Agree, the only point i would stress is that we need an entity separate 
from from drm_gpu_scheduler, something like
drm_gpu_reset_domain which  should point to or be pointed by a set of 
schedulers that should go through

reset together.

Andrey




Andrey




Apart from that looks good to me,
Christian.


+
  /**
   * amdgpu_device_get_pcie_info - fence pcie info about the PCIE slot
   *
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c

index bfc47bea23db..38c9fd7b7ad4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -63,7 +63,7 @@ static enum drm_gpu_sched_stat 
amdgpu_job_timedout(struct drm_sched_job *s_job)

    ti.process_name, ti.tgid, ti.task_name, ti.pid);
    if (amdgpu_device_should_recover_gpu(ring->adev)) {
-    amdgpu_device_gpu_recover(ring->adev, job);
+    amdgpu_device_gpu_recover_imp(ring->adev, job);
  } else {
  drm_sched_suspend_timeout(>sched);
  if (amdgpu_sriov_vf(adev))






Re: [RFC 3/6] drm/amdgpu: Fix crash on modprobe

2021-12-21 Thread Andrey Grodzovsky



On 2021-12-21 2:02 a.m., Christian König wrote:



Am 20.12.21 um 20:22 schrieb Andrey Grodzovsky:


On 2021-12-20 2:17 a.m., Christian König wrote:

Am 17.12.21 um 23:27 schrieb Andrey Grodzovsky:

Restrict jobs resubmission to suspend case
only since schedulers not initialised yet on
probe.

Signed-off-by: Andrey Grodzovsky 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c

index 5527c68c51de..8ebd954e06c6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -582,7 +582,7 @@ void amdgpu_fence_driver_hw_init(struct 
amdgpu_device *adev)

  if (!ring || !ring->fence_drv.initialized)
  continue;
  -    if (!ring->no_scheduler) {
+    if (adev->in_suspend && !ring->no_scheduler) {


Uff, why is that suddenly necessary? Because of the changed order?

Christian.



Yes.


Mhm, that's quite bad design then.



If you look at the original patch for this 
https://www.spinics.net/lists/amd-gfx/msg67560.html you will
see that that restarting scheduler here is only relevant for 
suspend/resume case because there was
a race to fix. There is no point in this code on driver init because 
nothing was submitted to scheduler yet
and so it seems to me ok to add condition that this code run only 
in_suspend case.





How about we keep the order as is and allow specifying the reset work 
queue with drm_sched_start() ?



As i mentioned above, the fact we even have drm_sched_start there is 
just part of a solution to resolve a race
during suspend/resume. It is not for device initialization and indeed, 
other client drivers of gpu shcheduler never call
drm_sched_start on device init. We must guarantee that reset work queue 
already initialized before any job submission to scheduler

and because of that IMHO the right place for this is drm_sched_init.

Andrey




Christian.



Andrey





drm_sched_resubmit_jobs(>sched);
  drm_sched_start(>sched, true);
  }






[PATCH] drm/amdgpu: fix runpm documentation

2021-12-21 Thread Alex Deucher
It's not only supported by HG/PX laptops.  It's supported
by all dGPUs which supports BOCO/BACO functionality (runtime
D3).

BOCO - Bus Off, Chip Off.  The entire chip is powered off.
   This is controlled by ACPI.
BACO - Bus Active, Chip Off.  The chip still shows up
   on the PCI bus, but the device itself is powered
   down.

Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index a78bbea9629d..f001924ed92e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -331,9 +331,10 @@ module_param_named(aspm, amdgpu_aspm, int, 0444);
 /**
  * DOC: runpm (int)
  * Override for runtime power management control for dGPUs in PX/HG laptops. 
The amdgpu driver can dynamically power down
- * the dGPU on PX/HG laptops when it is idle. The default is -1 (auto enable). 
Setting the value to 0 disables this functionality.
+ * the dGPUs when they are idle if supported. The default is -1 (auto enable).
+ * Setting the value to 0 disables this functionality.
  */
-MODULE_PARM_DESC(runpm, "PX runtime pm (2 = force enable with BAMACO, 1 = 
force enable with BACO, 0 = disable, -1 = PX only default)");
+MODULE_PARM_DESC(runpm, "PX runtime pm (2 = force enable with BAMACO, 1 = 
force enable with BACO, 0 = disable, -1 = auto)");
 module_param_named(runpm, amdgpu_runtime_pm, int, 0444);
 
 /**
-- 
2.33.1



Re: [PATCH] amdgpu/pm: Modify sysfs pp_dpm_sclk to have only read premission in ONEVF mode

2021-12-21 Thread Nikolic, Marina
[AMD Official Use Only]

>From 06359f3be0c0b889519d6dd954fb11f31e9a15e0 Mon Sep 17 00:00:00 2001
From: Marina Nikolic 
Date: Tue, 14 Dec 2021 20:57:53 +0800
Subject: [PATCH] amdgpu/pm: Modify sysfs pp_dpm_sclk to have only read
 permission in ONEVF mode

== Description ==
Setting through sysfs should not be allowed in SRIOV mode.
These calls will not be processed by FW anyway,
but error handling on sysfs level should be improved.

== Changes ==
This patch prohibits performing of all set commands
in SRIOV mode on sysfs level.
It offers better error handling as calls that are
not allowed will not be propagated further.

== Test ==
Writing to any sysfs file in passthrough mode will succeed.
Writing to any sysfs file in ONEVF mode will yield error:
"calling process does not have sufficient permission to execute a command".

Signed-off-by: Marina Nikolic 
---
 drivers/gpu/drm/amd/pm/amdgpu_pm.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/amd/pm/amdgpu_pm.c 
b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
index 082539c70fd4..c43818cd02aa 100644
--- a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
+++ b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
@@ -2133,6 +2133,12 @@ static int default_attr_update(struct amdgpu_device 
*adev, struct amdgpu_device_
}
}

+   /* setting should not be allowed from VF */
+   if (amdgpu_sriov_vf(adev)) {
+   dev_attr->attr.mode &= ~S_IWUGO;
+   dev_attr->store = NULL;
+   }
+
 #undef DEVICE_ATTR_IS

return 0;
--
2.20.1


From: Nikolic, Marina 
Sent: Tuesday, December 21, 2021 3:15 PM
To: Russell, Kent ; amd-gfx@lists.freedesktop.org 

Cc: Mitrovic, Milan ; Kitchen, Greg 

Subject: Re: [PATCH] amdgpu/pm: Modify sysfs pp_dpm_sclk to have only read 
premission in ONEVF mode

Hi Kent,

Thank you for the review. Yes, I can confirm I am trying to set this for every 
single file for SRIOV mode.
@Kitchen, Greg required this for ROCM-SMI 5.0 
release. In case you need it, he can provide more details.
I'm going to clarify commit message more and send a new patch.

BR,
Marina

From: Russell, Kent 
Sent: Monday, December 20, 2021 8:01 PM
To: Nikolic, Marina ; amd-gfx@lists.freedesktop.org 

Cc: Mitrovic, Milan ; Nikolic, Marina 
; Kitchen, Greg 
Subject: RE: [PATCH] amdgpu/pm: Modify sysfs pp_dpm_sclk to have only read 
premission in ONEVF mode

[AMD Official Use Only]

> -Original Message-
> From: amd-gfx  On Behalf Of Marina 
> Nikolic
> Sent: Monday, December 20, 2021 11:09 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Mitrovic, Milan ; Nikolic, Marina
> ; Kitchen, Greg 
> Subject: [PATCH] amdgpu/pm: Modify sysfs pp_dpm_sclk to have only read 
> premission in
> ONEVF mode
>
> == Description ==
> Due to security reasons setting through sysfs
> should only be allowed in passthrough mode.
> Options that are not mapped as SMU messages
> do not have any mechanizm to distinguish between
> passthorugh, onevf and mutivf usecase.
> A unified approach is needed.
>
> == Changes ==
> This patch introduces a new mechanizm to distinguish
> ONEVF and PASSTHROUGH use case on sysfs level
> and prohibit setting (writting to sysfs).
> It also applies the new mechanizm on pp_dpm_sclk sysfs file.
>
> == Test ==
> Writing to pp_dpm_sclk sysfs file in passthrough mode will succeed.
> Writing to pp_dpm_sclk sysfs file in ONEVF mode will yield error:
> "calling process does not have sufficient permission to execute a command".
> Sysfs pp_dpm_sclk will not be created in MULTIVF mode.
>
> Signed-off-by: Marina Nikolic 
> ---
>  drivers/gpu/drm/amd/pm/amdgpu_pm.c | 6 ++
>  1 file changed, 6 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> index 082539c70fd4..d2b168babc7d 100644
> --- a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> +++ b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> @@ -2133,6 +2133,12 @@ static int default_attr_update(struct amdgpu_device 
> *adev,
> struct amdgpu_device_
>   }
>   }
>
> + /* security: setting should not be allowed from VF */
> + if (amdgpu_sriov_vf(adev)) {

You should be checking for pp_dpm_sclk here, for example:
if (DEVICE_ATTR_IS(pp_dpm_sclk) {

Otherwise I am pretty sure you're setting this for every single file. And is it 
only sclk? Or does it also need to affect mclk/fclk/etc? If it's only sclk, the 
line above should help. If it's for more, then the commit should try to clarify 
that as it's not 100% clear.

 Kent

> + dev_attr->attr.mode &= ~S_IWUGO;
> + dev_attr->store = NULL;
> + }
> +
>  #undef DEVICE_ATTR_IS
>
>   return 0;
> --
> 2.20.1



Re: [PATCH] amdgpu/pm: Modify sysfs pp_dpm_sclk to have only read premission in ONEVF mode

2021-12-21 Thread Nikolic, Marina
[AMD Official Use Only]

Hi Kent,

Thank you for the review. Yes, I can confirm I am trying to set this for every 
single file for SRIOV mode.
@Kitchen, Greg required this for ROCM-SMI 5.0 
release. In case you need it, he can provide more details.
I'm going to clarify commit message more and send a new patch.

BR,
Marina

From: Russell, Kent 
Sent: Monday, December 20, 2021 8:01 PM
To: Nikolic, Marina ; amd-gfx@lists.freedesktop.org 

Cc: Mitrovic, Milan ; Nikolic, Marina 
; Kitchen, Greg 
Subject: RE: [PATCH] amdgpu/pm: Modify sysfs pp_dpm_sclk to have only read 
premission in ONEVF mode

[AMD Official Use Only]

> -Original Message-
> From: amd-gfx  On Behalf Of Marina 
> Nikolic
> Sent: Monday, December 20, 2021 11:09 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Mitrovic, Milan ; Nikolic, Marina
> ; Kitchen, Greg 
> Subject: [PATCH] amdgpu/pm: Modify sysfs pp_dpm_sclk to have only read 
> premission in
> ONEVF mode
>
> == Description ==
> Due to security reasons setting through sysfs
> should only be allowed in passthrough mode.
> Options that are not mapped as SMU messages
> do not have any mechanizm to distinguish between
> passthorugh, onevf and mutivf usecase.
> A unified approach is needed.
>
> == Changes ==
> This patch introduces a new mechanizm to distinguish
> ONEVF and PASSTHROUGH use case on sysfs level
> and prohibit setting (writting to sysfs).
> It also applies the new mechanizm on pp_dpm_sclk sysfs file.
>
> == Test ==
> Writing to pp_dpm_sclk sysfs file in passthrough mode will succeed.
> Writing to pp_dpm_sclk sysfs file in ONEVF mode will yield error:
> "calling process does not have sufficient permission to execute a command".
> Sysfs pp_dpm_sclk will not be created in MULTIVF mode.
>
> Signed-off-by: Marina Nikolic 
> ---
>  drivers/gpu/drm/amd/pm/amdgpu_pm.c | 6 ++
>  1 file changed, 6 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> index 082539c70fd4..d2b168babc7d 100644
> --- a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> +++ b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> @@ -2133,6 +2133,12 @@ static int default_attr_update(struct amdgpu_device 
> *adev,
> struct amdgpu_device_
>   }
>   }
>
> + /* security: setting should not be allowed from VF */
> + if (amdgpu_sriov_vf(adev)) {

You should be checking for pp_dpm_sclk here, for example:
if (DEVICE_ATTR_IS(pp_dpm_sclk) {

Otherwise I am pretty sure you're setting this for every single file. And is it 
only sclk? Or does it also need to affect mclk/fclk/etc? If it's only sclk, the 
line above should help. If it's for more, then the commit should try to clarify 
that as it's not 100% clear.

 Kent

> + dev_attr->attr.mode &= ~S_IWUGO;
> + dev_attr->store = NULL;
> + }
> +
>  #undef DEVICE_ATTR_IS
>
>   return 0;
> --
> 2.20.1



Re: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release PCI device ..."

2021-12-21 Thread Linus Torvalds
[ Adding back in more amd people and the amd list, the people Daniel
added seem to have gotten lost again, but I think people at least saw
my original report thanks to Daniel ]

With "amdgpu.runpm=0", things are better, but not perfect. With that I
can lock the screen, and it has to go through *two* cycles of "No
signal, turning off", but on the second cycle it does finally work.

This was exposed by commit 55285e21f045 ("fbdev/efifb: Release PCI
device's runtime PM ref during FB destroy"), probably because that
made runtime PM actually potentially work, but it is then broken on
amdgpu.

Absolutely nothing odd in my setup. Two monitors, one GPU. PCI ID
1002:67df rev e7, subsystem ID 1da2:e353.

I'd expect pretty much any amdgpu person to see this.

On Mon, Dec 20, 2021 at 2:04 PM Linus Torvalds
 wrote:
>
> Note: on my machine, I get that
>
>amdgpu :49:00.0: amdgpu: Using BACO for runtime pm
>
> so maybe the other possible runtime pm models (ARPX and BOCO) are ok,
> and it's only that BACO case that is broken.

Hmm. The *documentation* says:

PX runtime pm
2 = force enable with BAMACO,
1 = force enable with BACO,
0 = disable,
-1 = PX only default

but the code actually makes anything != 0 enable it, except on VEGA20
and ARCTURUS, where it needs to be positive.

My card is apparently "POLARIS10", whatever that means, which means
that any non-zero value of amdgpu_runtime_pm will enable runtime PM as
long as "amdgpu_device_supports_baco()" is true. Which it is.

Whatever. Now I'm just kwetching about the documentation not matching
what I see the code doing, which is never a great sign when things
don't work.

  Linus


Re: [RFC 4/6] drm/amdgpu: Serialize non TDR gpu recovery with TDRs

2021-12-21 Thread Christian König

Am 20.12.21 um 23:17 schrieb Andrey Grodzovsky:


On 2021-12-20 2:20 a.m., Christian König wrote:

Am 17.12.21 um 23:27 schrieb Andrey Grodzovsky:

Use reset domain wq also for non TDR gpu recovery trigers
such as sysfs and RAS. We must serialize all possible
GPU recoveries to gurantee no concurrency there.
For TDR call the original recovery function directly since
it's already executed from within the wq. For others just
use a wrapper to qeueue work and wait on it to finish.

Signed-off-by: Andrey Grodzovsky 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h    |  2 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 33 
+-

  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +-
  3 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h

index b5ff76aae7e0..8e96b9a14452 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1296,6 +1296,8 @@ bool amdgpu_device_has_job_running(struct 
amdgpu_device *adev);

  bool amdgpu_device_should_recover_gpu(struct amdgpu_device *adev);
  int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
    struct amdgpu_job* job);
+int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
+  struct amdgpu_job *job);
  void amdgpu_device_pci_config_reset(struct amdgpu_device *adev);
  int amdgpu_device_pci_reset(struct amdgpu_device *adev);
  bool amdgpu_device_need_post(struct amdgpu_device *adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index b595e6d699b5..55cd67b9ede2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4979,7 +4979,7 @@ static void amdgpu_device_recheck_guilty_jobs(
   * Returns 0 for success or an error on failure.
   */
  -int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
+int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
    struct amdgpu_job *job)
  {
  struct list_head device_list, *device_list_handle = NULL;
@@ -5236,6 +5236,37 @@ int amdgpu_device_gpu_recover(struct 
amdgpu_device *adev,

  return r;
  }
  +struct recover_work_struct {


Please add an amdgpu_ prefix to the name.


+    struct work_struct base;
+    struct amdgpu_device *adev;
+    struct amdgpu_job *job;
+    int ret;
+};
+
+static void amdgpu_device_queue_gpu_recover_work(struct work_struct 
*work)

+{
+    struct recover_work_struct *recover_work = container_of(work, 
struct recover_work_struct, base);

+
+    recover_work->ret = 
amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);

+}
+/*
+ * Serialize gpu recover into reset domain single threaded wq
+ */
+int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
+    struct amdgpu_job *job)
+{
+    struct recover_work_struct work = {.adev = adev, .job = job};
+
+    INIT_WORK(, amdgpu_device_queue_gpu_recover_work);
+
+    if (!queue_work(adev->reset_domain.wq, ))
+    return -EAGAIN;
+
+    flush_work();
+
+    return work.ret;
+}


Maybe that should be part of the scheduler code? Not really sure, 
just an idea.



Seems to me that since the reset domain is almost always above a 
single scheduler granularity then it wouldn't feet very well there.


Yeah, but what if we introduce an drm_sched_recover_queue and 
drm_sched_recover_work object?


It's probably ok to go forward with that for now, but this handling 
makes quite some sense to have independent of which driver is using it. 
So as soon as we see a second similar implementation we should move it 
into common code.


Regards,
Christian.



Andrey




Apart from that looks good to me,
Christian.


+
  /**
   * amdgpu_device_get_pcie_info - fence pcie info about the PCIE slot
   *
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c

index bfc47bea23db..38c9fd7b7ad4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -63,7 +63,7 @@ static enum drm_gpu_sched_stat 
amdgpu_job_timedout(struct drm_sched_job *s_job)

    ti.process_name, ti.tgid, ti.task_name, ti.pid);
    if (amdgpu_device_should_recover_gpu(ring->adev)) {
-    amdgpu_device_gpu_recover(ring->adev, job);
+    amdgpu_device_gpu_recover_imp(ring->adev, job);
  } else {
  drm_sched_suspend_timeout(>sched);
  if (amdgpu_sriov_vf(adev))