[pull] amdgpu, radeon drm-fixes-5.9

2020-09-02 Thread Alex Deucher
Hi Dave, Daniel,

Fixes for 5.9.

The following changes since commit f75aef392f869018f78cfedf3c320a6b3fcfda6b:

  Linux 5.9-rc3 (2020-08-30 16:01:54 -0700)

are available in the Git repository at:

  git://people.freedesktop.org/~agd5f/linux tags/amd-drm-fixes-5.9-2020-09-03

for you to fetch changes up to fc8c70526bd30733ea8667adb8b8ffebea30a8ed:

  drm/radeon: Prefer lower feedback dividers (2020-09-03 00:37:30 -0400)


amd-drm-fixes-5.9-2020-09-03:

amdgpu:
- Fix for 32bit systems
- SW CTF fix
- Update for Sienna Cichlid
- CIK bug fixes

radeon:
- PLL fix


Evan Quan (1):
  drm/amd/pm: avoid false alarm due to confusing softwareshutdowntemp 
setting

Jiansong Chen (1):
  drm/amd/pm: enable MP0 DPM for sienna_cichlid

Kai-Heng Feng (1):
  drm/radeon: Prefer lower feedback dividers

Kevin Wang (1):
  drm/amd/pm: fix is_dpm_running() run error on 32bit system

Sandeep Raghuraman (2):
  drm/amdgpu: Specify get_argument function for ci_smu_funcs
  drm/amdgpu: Fix bug in reporting voltage for CIK

 drivers/gpu/drm/amd/powerplay/arcturus_ppt.c | 10 +++---
 drivers/gpu/drm/amd/powerplay/hwmgr/smu7_hwmgr.c |  3 ++-
 drivers/gpu/drm/amd/powerplay/hwmgr/vega10_thermal.c | 14 --
 drivers/gpu/drm/amd/powerplay/navi10_ppt.c   | 10 +++---
 drivers/gpu/drm/amd/powerplay/sienna_cichlid_ppt.c   | 14 ++
 drivers/gpu/drm/amd/powerplay/smumgr/ci_smumgr.c |  2 ++
 drivers/gpu/drm/radeon/radeon_display.c  |  2 +-
 7 files changed, 41 insertions(+), 14 deletions(-)
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 8/8] drm/amd/display: Expose modifiers.

2020-09-02 Thread Marek Olšák
OK. Reviewed-by: Marek Olšák 

Marek

On Wed, Sep 2, 2020 at 6:31 AM Bas Nieuwenhuizen 
wrote:

> On Fri, Aug 7, 2020 at 9:43 PM Marek Olšák  wrote:
> >
> > On Tue, Aug 4, 2020 at 5:32 PM Bas Nieuwenhuizen <
> b...@basnieuwenhuizen.nl> wrote:
> >>
> >> This expose modifier support on GFX9+.
> >>
> >> Only modifiers that can be rendered on the current GPU are
> >> added. This is to reduce the number of modifiers exposed.
> >>
> >> The HW could expose more, but the best mechanism to decide
> >> what to expose without an explosion in modifiers is still
> >> to be decided, and in the meantime this should not regress
> >> things from pre-modifiers and does not risk regressions as
> >> we make up our mind in the future.
> >>
> >> Signed-off-by: Bas Nieuwenhuizen 
> >> ---
> >>  .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 343 +-
> >>  1 file changed, 342 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> >> index c38257081868..6594cbe625f9 100644
> >> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> >> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> >> @@ -3891,6 +3891,340 @@ fill_gfx9_tiling_info_from_modifier(const
> struct amdgpu_device *adev,
> >> }
> >>  }
> >>
> >> +enum dm_micro_swizzle {
> >> +   MICRO_SWIZZLE_Z = 0,
> >> +   MICRO_SWIZZLE_S = 1,
> >> +   MICRO_SWIZZLE_D = 2,
> >> +   MICRO_SWIZZLE_R = 3
> >> +};
> >> +
> >> +static bool dm_plane_format_mod_supported(struct drm_plane *plane,
> >> + uint32_t format,
> >> + uint64_t modifier)
> >> +{
> >> +   struct amdgpu_device *adev = plane->dev->dev_private;
> >> +   const struct drm_format_info *info = drm_format_info(format);
> >> +
> >> +   enum dm_micro_swizzle microtile =
> modifier_gfx9_swizzle_mode(modifier) & 3;
> >> +
> >> +   if (!info)
> >> +   return false;
> >> +
> >> +   /*
> >> +* We always have to allow this modifier, because core DRM still
> >> +* checks LINEAR support if userspace does not provide modifers.
> >> +*/
> >> +   if (modifier == DRM_FORMAT_MOD_LINEAR)
> >> +   return true;
> >> +
> >> +   /*
> >> +* The arbitrary tiling support for multiplane formats has not
> been hooked
> >> +* up.
> >> +*/
> >> +   if (info->num_planes > 1)
> >> +   return false;
> >> +
> >> +   /*
> >> +* For D swizzle the canonical modifier depends on the bpp, so
> check
> >> +* it here.
> >> +*/
> >> +   if (AMD_FMT_MOD_GET(TILE_VERSION, modifier) ==
> AMD_FMT_MOD_TILE_VER_GFX9 &&
> >> +   adev->family >= AMDGPU_FAMILY_NV) {
> >> +   if (microtile == MICRO_SWIZZLE_D && info->cpp[0] == 4)
> >> +   return false;
> >> +   }
> >> +
> >> +   if (adev->family >= AMDGPU_FAMILY_RV && microtile ==
> MICRO_SWIZZLE_D &&
> >> +   info->cpp[0] < 8)
> >> +   return false;
> >> +
> >> +   if (modifier_has_dcc(modifier)) {
> >> +   /* Per radeonsi comments 16/64 bpp are more
> complicated. */
> >> +   if (info->cpp[0] != 4)
> >> +   return false;
> >> +   }
> >> +
> >> +   return true;
> >> +}
> >> +
> >> +static void
> >> +add_modifier(uint64_t **mods, uint64_t *size, uint64_t *cap, uint64_t
> mod)
> >> +{
> >> +   if (!*mods)
> >> +   return;
> >> +
> >> +   if (*cap - *size < 1) {
> >> +   uint64_t new_cap = *cap * 2;
> >> +   uint64_t *new_mods = kmalloc(new_cap *
> sizeof(uint64_t), GFP_KERNEL);
> >> +
> >> +   if (!new_mods) {
> >> +   kfree(*mods);
> >> +   *mods = NULL;
> >> +   return;
> >> +   }
> >> +
> >> +   memcpy(new_mods, *mods, sizeof(uint64_t) * *size);
> >> +   kfree(*mods);
> >> +   *mods = new_mods;
> >> +   *cap = new_cap;
> >> +   }
> >> +
> >> +   (*mods)[*size] = mod;
> >> +   *size += 1;
> >> +}
> >> +
> >> +static void
> >> +add_gfx9_modifiers(const struct amdgpu_device *adev,
> >> + uint64_t **mods, uint64_t *size, uint64_t *capacity)
> >> +{
> >> +   int pipes =
> ilog2(adev->gfx.config.gb_addr_config_fields.num_pipes);
> >> +   int pipe_xor_bits = min(8, pipes +
> >> +
>  ilog2(adev->gfx.config.gb_addr_config_fields.num_se));
> >> +   int bank_xor_bits = min(8 - pipe_xor_bits,
> >> +
>  ilog2(adev->gfx.config.gb_addr_config_fields.num_banks));
> >> +   int rb = ilog2(adev->gfx.config.gb_addr_config_fields.num_se) +
> >> +
> ilog2(adev->gfx.config.gb_addr_config_fields.num_rb_per_se);
> >> +
> >> +
> >> +   if (adev->family == AMDGPU_FAMILY_RV) {
> >> +   /*
> >> +* No _D DCC 

[PATCH] drm/amdgpu/gfx10: Delete some duplicated argument to '|'

2020-09-02 Thread Ye Bin
1. gfx_v10_0_soft_reset GRBM_STATUS__SPI_BUSY_MASK
2. gfx_v10_0_update_gfx_clock_gating AMD_CG_SUPPORT_GFX_CGLS

Signed-off-by: Ye Bin 
---
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index 2db195ec8d0c..d502e30f67d9 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -7055,8 +7055,7 @@ static int gfx_v10_0_soft_reset(void *handle)
   GRBM_STATUS__BCI_BUSY_MASK | GRBM_STATUS__SX_BUSY_MASK |
   GRBM_STATUS__TA_BUSY_MASK | GRBM_STATUS__DB_BUSY_MASK |
   GRBM_STATUS__CB_BUSY_MASK | GRBM_STATUS__GDS_BUSY_MASK |
-  GRBM_STATUS__SPI_BUSY_MASK | GRBM_STATUS__GE_BUSY_NO_DMA_MASK
-  | GRBM_STATUS__BCI_BUSY_MASK)) {
+  GRBM_STATUS__SPI_BUSY_MASK | 
GRBM_STATUS__GE_BUSY_NO_DMA_MASK)) {
grbm_soft_reset = REG_SET_FIELD(grbm_soft_reset,
GRBM_SOFT_RESET, SOFT_RESET_CP,
1);
@@ -7449,7 +7448,6 @@ static int gfx_v10_0_update_gfx_clock_gating(struct 
amdgpu_device *adev,
(AMD_CG_SUPPORT_GFX_MGCG |
 AMD_CG_SUPPORT_GFX_CGLS |
 AMD_CG_SUPPORT_GFX_CGCG |
-AMD_CG_SUPPORT_GFX_CGLS |
 AMD_CG_SUPPORT_GFX_3D_CGCG |
 AMD_CG_SUPPORT_GFX_3D_CGLS))
gfx_v10_0_enable_gui_idle_interrupt(adev, enable);
-- 
2.25.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 0/3] Use implicit kref infra

2020-09-02 Thread Pan, Xinhui


> 2020年9月2日 22:50,Tuikov, Luben  写道:
> 
> On 2020-09-02 00:43, Pan, Xinhui wrote:
>> 
>> 
>>> 2020年9月2日 11:46,Tuikov, Luben  写道:
>>> 
>>> On 2020-09-01 21:42, Pan, Xinhui wrote:
 If you take a look at the below function, you should not use driver's 
 release to free adev. As dev is embedded in adev.
>>> 
>>> Do you mean "look at the function below", using "below" as an adverb?
>>> "below" is not an adjective.
>>> 
>>> I know dev is embedded in adev--I did that patchset.
>>> 
 
 809 static void drm_dev_release(struct kref *ref)
 810 {
 811 struct drm_device *dev = container_of(ref, struct drm_device, 
 ref);
 812
 813 if (dev->driver->release)
 814 dev->driver->release(dev);
 815 
 816 drm_managed_release(dev);
 817 
 818 kfree(dev->managed.final_kfree);
 819 }
>>> 
>>> That's simple--this comes from change c6603c740e0e3
>>> and it should be reverted. Simple as that.
>>> 
>>> The version before this change was absolutely correct:
>>> 
>>> static void drm_dev_release(struct kref *ref)
>>> {
>>> if (dev->driver->release)
>>> dev->driver->release(dev);
>>> else
>>> drm_dev_fini(dev);
>>> }
>>> 
>>> Meaning, "the kref is now 0"--> if the driver
>>> has a release, call it, else use our own.
>>> But note that nothing can be assumed after this point,
>>> about the existence of "dev".
>>> 
>>> It is exactly because struct drm_device is statically
>>> embedded into a container, struct amdgpu_device,
>>> that this change above should be reverted.
>>> 
>>> This is very similar to how fops has open/release
>>> but no close. That is, the "release" is called
>>> only when the last kref is released, i.e. when
>>> kref goes from non-zero to zero.
>>> 
>>> This uses the kref infrastructure which has been
>>> around for about 20 years in the Linux kernel.
>>> 
>>> I suggest reading the comments
>>> in drm_dev.c mostly, "DOC: driver instance overview"
>>> starting at line 240 onwards. This is right above
>>> drm_put_dev(). There is actually an example of a driver
>>> in the comment. Also the comment to drm_dev_init().
>>> 
>>> Now, take a look at this:
>>> 
>>> /**
>>> * drm_dev_put - Drop reference of a DRM device
>>> * @dev: device to drop reference of or NULL
>>> *
>>> * This decreases the ref-count of @dev by one. The device is destroyed if 
>>> the
>>> * ref-count drops to zero.
>>> */
>>> void drm_dev_put(struct drm_device *dev)
>>> {
>>>   if (dev)
>>>   kref_put(>ref, drm_dev_release);
>>> }
>>> EXPORT_SYMBOL(drm_dev_put);
>>> 
>>> Two things:
>>> 
>>> 1. It is us, who kzalloc the amdgpu device, which contains
>>> the drm_device (you'll see this discussed in the reading
>>> material I pointed to above). We do this because we're
>>> probing the PCI device whether we'll work it it or not.
>>> 
>> 
>> that is true.
> 
> Of course it's true--good morning!
> 
>> My understanding of the drm core code is like something below.
> 
> Let me stop you right there--just read the documentation I pointed
> to you at.
> 
>> struct B { 
>>  strcut A 
>> }
>> we initialize A firstly and initialize B in the end. But destroy B firstly 
>> and destory A in the end.
> 
> No!
> B, which is the amdgpu_device struct "exists" before A, which is the DRM 
> struct.
> This is why DRM recommends to _embed_ it into the driver's own device struct,
> as the documentation I pointed you to at.
> 
I think you are misleading me here.  A pci dev as you said below can act as 
many roles, a drm dev, a tty dev, etc.
say, struct B{
struct A;
struct TTY;
struct printer;
...
}
but TTY or other members has nothing to do with our discussion.

B of course exists before A. but the code logic is not that. code below is 
really rare in drm world.
create_B()
{
init B members
return create_A()
}
So usually B have more work to do after it initialize A.
then code should like below
create_B()
{
init B base members
create_A()
init B extended members
}


For release part.
release B extended member
release A
release B base member

a good design should not have the so-called extended and base members existing 
in the release process.
Now have a look at the drm core code.
it expects driver to do release process like below.
release B
cleanup work of A

as long as the cleanup work of A exists, we can not do a pure release of B.

So if you want to follow the ruls of kref, you have to rework the drm core code 
first. only after that, we can do a pure release of B.

What I am confused is that, kfer sits in drm dev. why adev must be destroyed 
too when drm dev is going to be destroyed.
adev is not equal to drm dev.
I think struct below is more fit for the logic.
struct adev {
struct drm * ddev_p = 
struct type *odev_p  = 
struct drm ddev
struct type odev
}

> "undone" first, since the DRM layer may finish with a device, but
> the device may 

RE: [PATCH v4 1/8] drm/amdgpu: Avoid accessing HW when suspending SW state

2020-09-02 Thread Li, Dennis
[AMD Official Use Only - Internal Distribution Only]

Hi, andrey

Did you want to use adev->in_pci_err_recovery to avoid hardware accessed by 
other threads when doing PCI recovery? If so, it is better to change to use 
lock protect them. This patch can't solve your issue completely. 

Best Regards
Dennis Li
-Original Message-
From: Andrey Grodzovsky  
Sent: Thursday, September 3, 2020 2:42 AM
To: amd-gfx@lists.freedesktop.org; sathyanarayanan.kuppusw...@linux.intel.com; 
linux-...@vger.kernel.org
Cc: Deucher, Alexander ; Das, Nirmoy 
; Li, Dennis ; Koenig, Christian 
; Tuikov, Luben ; 
bhelg...@google.com; Grodzovsky, Andrey 
Subject: [PATCH v4 1/8] drm/amdgpu: Avoid accessing HW when suspending SW state

At this point the ASIC is already post reset by the HW/PSP so the HW not in 
proper state to be configured for suspension, some blocks might be even gated 
and so best is to avoid touching it.

v2: Rename in_dpc to more meaningful name

Signed-off-by: Andrey Grodzovsky 
Reviewed-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h|  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 38 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c|  6 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c|  6 +
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 18 --
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c |  3 +++
 6 files changed, 65 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index c311a3c..b20354f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -992,6 +992,7 @@ struct amdgpu_device {
atomic_tthrottling_logging_enabled;
struct ratelimit_state  throttling_logging_rs;
uint32_tras_features;
+   boolin_pci_err_recovery;
 };
 
 static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev) diff 
--git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 74a1c03..1fbf8a1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -319,6 +319,9 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, 
uint32_t reg,  {
uint32_t ret;
 
+   if (adev->in_pci_err_recovery)
+   return 0;
+
if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
return amdgpu_kiq_rreg(adev, reg);
 
@@ -351,6 +354,9 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, 
uint32_t reg,
  * Returns the 8 bit value from the offset specified.
  */
 uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
+   if (adev->in_pci_err_recovery)
+   return 0;
+
if (offset < adev->rmmio_size)
return (readb(adev->rmmio + offset));
BUG();
@@ -372,6 +378,9 @@ uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, 
uint32_t offset) {
  * Writes the value specified to the offset specified.
  */
 void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t 
value) {
+   if (adev->in_pci_err_recovery)
+   return;
+
if (offset < adev->rmmio_size)
writeb(value, adev->rmmio + offset);
else
@@ -382,6 +391,9 @@ static inline void amdgpu_mm_wreg_mmio(struct amdgpu_device 
*adev,
   uint32_t reg, uint32_t v,
   uint32_t acc_flags)
 {
+   if (adev->in_pci_err_recovery)
+   return;
+
trace_amdgpu_mm_wreg(adev->pdev->device, reg, v);
 
if ((reg * 4) < adev->rmmio_size)
@@ -409,6 +421,9 @@ static inline void amdgpu_mm_wreg_mmio(struct amdgpu_device 
*adev,  void amdgpu_mm_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t 
v,
uint32_t acc_flags)
 {
+   if (adev->in_pci_err_recovery)
+   return;
+
if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
return amdgpu_kiq_wreg(adev, reg, v);
 
@@ -423,6 +438,9 @@ void amdgpu_mm_wreg(struct amdgpu_device *adev, uint32_t 
reg, uint32_t v,  void amdgpu_mm_wreg_mmio_rlc(struct amdgpu_device *adev, 
uint32_t reg, uint32_t v,
uint32_t acc_flags)
 {
+   if (adev->in_pci_err_recovery)
+   return;
+
if (amdgpu_sriov_fullaccess(adev) &&
adev->gfx.rlc.funcs &&
adev->gfx.rlc.funcs->is_rlcg_access_range) { @@ -444,6 +462,9 
@@ void amdgpu_mm_wreg_mmio_rlc(struct amdgpu_device *adev, uint32_t reg, 
uint32_t
  */
 u32 amdgpu_io_rreg(struct amdgpu_device *adev, u32 reg)  {
+   if (adev->in_pci_err_recovery)
+   return 0;
+
if ((reg * 4) < adev->rio_mem_size)
return ioread32(adev->rio_mem + (reg * 4));
else {
@@ -463,6 +484,9 @@ u32 amdgpu_io_rreg(struct amdgpu_device *adev, u32 

Re: [PATCH v4 0/8] Implement PCI Error Recovery on Navi12

2020-09-02 Thread Bjorn Helgaas
On Wed, Sep 02, 2020 at 11:43:41PM +, Grodzovsky, Andrey wrote:
> It's based on v5.9-rc2 but won't apply cleanly since there is a
> significant amount of amd-staging-drm-next patches which this was
> applied on top of.

Is there a git branch published somewhere?  It'd be nice to be able to
see the whole thing, including the bits that this depends on from
amd-staging-drm-next.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v4 0/8] Implement PCI Error Recovery on Navi12

2020-09-02 Thread Grodzovsky, Andrey
It's based on v5.9-rc2 but won't apply cleanly since there is a significant 
amount of amd-staging-drm-next patches which this was applied on top of.

Andrey

From: Bjorn Helgaas 
Sent: 02 September 2020 17:36
To: Grodzovsky, Andrey 
Cc: amd-gfx@lists.freedesktop.org ; 
sathyanarayanan.kuppusw...@linux.intel.com 
; linux-...@vger.kernel.org 
; Deucher, Alexander ; 
Das, Nirmoy ; Li, Dennis ; Koenig, 
Christian ; Tuikov, Luben ; 
bhelg...@google.com 
Subject: Re: [PATCH v4 0/8] Implement PCI Error Recovery on Navi12

On Wed, Sep 02, 2020 at 02:42:02PM -0400, Andrey Grodzovsky wrote:
> Many PCI bus controllers are able to detect a variety of hardware PCI errors 
> on the bus,
> such as parity errors on the data and address buses,  A typical action taken 
> is to disconnect
> the affected device, halting all I/O to it. Typically, a reconnection 
> mechanism is also offered,
> so that the affected PCI device(s) are reset and put back into working 
> condition.
> In our case the reconnection mechanism is facilitated by kernel Downstream 
> Port Containment (DPC)
> driver which will intercept the PCIe error, remove (isolate) the faulting 
> device after which it
> will call into PCIe recovery code of the PCI core.
> This code will call hooks which are implemented in this patchset where the 
> error is
> first reported at which point we block the GPU scheduler, next DPC resets the
> PCI link which generates HW interrupt which is intercepted by SMU/PSP who
> start executing mode1 reset of the ASIC, next step is slot reset hook is 
> called
> at which point we wait for ASIC reset to complete, restore PCI config space 
> and run
> HW suspend/resume sequence to resinit the ASIC.
> Last hook called is resume normal operation at which point we will restart 
> the GPU scheduler.
>
> More info on PCIe error handling and DPC are here:
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.kernel.org%2Fdoc%2Fhtml%2Flatest%2FPCI%2Fpci-error-recovery.htmldata=02%7C01%7Candrey.grodzovsky%40amd.com%7Cc1ab3b293aa543a591a808d84f884058%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637346793904985104sdata=FgfyOmKy7iVq5N6Z7h1c9rrkJReSzOlI%2BbykOE0rfac%3Dreserved=0
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.kernel.org%2Fpatch%2F8945681%2Fdata=02%7C01%7Candrey.grodzovsky%40amd.com%7Cc1ab3b293aa543a591a808d84f884058%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637346793904985104sdata=rSXEB8NoAD9%2BRfRddEvOGfwBJJ80KBnLgI%2B%2BPGsFdOc%3Dreserved=0
>
> v4:Rebase to 5.9 kernel and revert PCI error recovery core commit which 
> breaks the feature.

What does this apply to?  I tried

  - v5.9-rc1 (9123e3a74ec7 ("Linux 5.9-rc1")),
  - v5.9-rc2 (d012a7190fc1 ("Linux 5.9-rc2")),
  - v5.9-rc3 (f75aef392f86 ("Linux 5.9-rc3")),
  - drm-next (3393649977f9 ("Merge tag 'drm-intel-next-2020-08-24-1' of 
git://anongit.freedesktop.org/drm/drm-intel into drm-next")),
  - linux-next (4442749a2031 ("Add linux-next specific files for 20200902"))

but it doesn't apply cleanly to any.

> Andrey Grodzovsky (8):
>   drm/amdgpu: Avoid accessing HW when suspending SW state
>   drm/amdgpu: Block all job scheduling activity during DPC recovery
>   drm/amdgpu: Fix SMU error failure
>   drm/amdgpu: Fix consecutive DPC recovery failures.
>   drm/amdgpu: Trim amdgpu_pci_slot_reset by reusing code.
>   drm/amdgpu: Disable DPC for XGMI for now.
>   drm/amdgpu: Minor checkpatch fix
>   Revert "PCI/ERR: Update error status after reset_link()"
>
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h|   6 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 247 
> +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c|   4 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c|   6 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c|   6 +
>  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c |  18 ++-
>  drivers/gpu/drm/amd/amdgpu/nv.c|   4 +-
>  drivers/gpu/drm/amd/amdgpu/soc15.c |   4 +-
>  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c |   3 +
>  drivers/pci/pcie/err.c |   3 +-
>  10 files changed, 222 insertions(+), 79 deletions(-)
>
> --
> 2.7.4
>
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v4 4/8] drm/amdgpu: Fix consecutive DPC recovery failures.

2020-09-02 Thread Bjorn Helgaas
On Wed, Sep 02, 2020 at 02:42:06PM -0400, Andrey Grodzovsky wrote:
> Cache the PCI state on boot and before each case where we might
> loose it.

s/loose/lose/

> v2: Add pci_restore_state while caching the PCI state to avoid
> breaking PCI core logic for stuff like suspend/resume.
> 
> v3: Extract pci_restore_state from amdgpu_device_cache_pci_state
> to avoid superflous restores during GPU resets and suspend/resumes.
> 
> v4: Style fixes.

Is the DRM convention to keep the v2/v3/v4 stuff in the commit log?  I
keep those below the "---" or manually remove them for PCI, but use
the local convention, of course.

> + /* Have stored pci confspace at hand for restore in sudden PCI error */

I assume that at least from the perspective of this code, all PCI
errors are "sudden".  Or if they're not, I'm curious about which would
be sudden and which would not.

> + if (amdgpu_device_cache_pci_state(adev->pdev))
> + pci_restore_state(pdev);

> +bool amdgpu_device_cache_pci_state(struct pci_dev *pdev)
> +{
> + struct drm_device *dev = pci_get_drvdata(pdev);
> + struct amdgpu_device *adev = drm_to_adev(dev);
> + int r;
> +
> + r = pci_save_state(pdev);
> + if (!r) {
> + kfree(adev->pci_state);
> +
> + adev->pci_state = pci_store_saved_state(pdev);
> +
> + if (!adev->pci_state) {
> + DRM_ERROR("Failed to store PCI saved state");
> + return false;
> + }
> + } else {
> + DRM_WARN("Failed to save PCI state, err:%d\n", r);
> + return false;
> + }
> +
> + return true;
> +}
> +
> +bool amdgpu_device_load_pci_state(struct pci_dev *pdev)
> +{
> + struct drm_device *dev = pci_get_drvdata(pdev);
> + struct amdgpu_device *adev = drm_to_adev(dev);
> + int r;
> +
> + if (!adev->pci_state)
> + return false;
> +
> + r = pci_load_saved_state(pdev, adev->pci_state);

I'm a little bit hesitant to pci_load_saved_state() and
pci_store_saved_state() being used here, simply because they're
currently only used by VFIO, Xen, and nvme.  So I don't have a real
objection, but just pointing out that apparently you're doing
something really special that isn't commonly used and tested, so it's
more likely to be broken or incomplete.

There's lots of state that the PCI core *can't* save/restore, and
pci_load_saved_state() doesn't even handle all the architected PCI
state, i.e., we only call pci_add_cap_save_buffer() or
pci_add_ext_cap_save_buffer() for a few of the capabilities.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v4 2/8] drm/amdgpu: Block all job scheduling activity during DPC recovery

2020-09-02 Thread Bjorn Helgaas
On Wed, Sep 02, 2020 at 02:42:04PM -0400, Andrey Grodzovsky wrote:
> DPC recovery involves ASIC reset just as normal GPU recovery so block
> SW GPU schedulers and wait on all concurrent GPU resets.

> +  * Cancel and wait for all TDRs in progress if failing to
> +  * set  adev->in_gpu_reset in amdgpu_device_lock_adev

OCD typo, s/set  adev/set adev/ (two spaces)
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v4 3/8] drm/amdgpu: Fix SMU error failure

2020-09-02 Thread Bjorn Helgaas
On Wed, Sep 02, 2020 at 02:42:05PM -0400, Andrey Grodzovsky wrote:
> Wait for HW/PSP initiated ASIC reset to complete before
> starting the recovery operations.
> 
> v2: Remove typo
> 
> Signed-off-by: Andrey Grodzovsky 
> Reviewed-by: Alex Deucher 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 22 --
>  1 file changed, 20 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index e999f1f..412d07e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4838,14 +4838,32 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev 
> *pdev)
>  {
>   struct drm_device *dev = pci_get_drvdata(pdev);
>   struct amdgpu_device *adev = drm_to_adev(dev);
> - int r;
> + int r, i;
>   bool vram_lost;
> + u32 memsize;
>  
>   DRM_INFO("PCI error: slot reset callback!!\n");
>  
> + /* wait for asic to come out of reset */

I know it's totally OCD, but it is a minor speed bump to read "asic"
here and "ASIC" in the commit log above and the new comment below.

> + msleep(500);
> +
>   pci_restore_state(pdev);
>  
> - adev->in_pci_err_recovery = true;
> + /* confirm  ASIC came out of reset */
> + for (i = 0; i < adev->usec_timeout; i++) {
> + memsize = amdgpu_asic_get_config_memsize(adev);
> +
> + if (memsize != 0x)

I guess this is a spot where you actually depend on an MMIO read
returning 0x because adev->in_pci_err_recovery is false at
this point, so amdgpu_mm_rreg() will actually *do* the MMIO read
instead of returning 0.  Right?

> + break;
> + udelay(1);
> + }
> + if (memsize == 0x) {
> + r = -ETIME;
> + goto out;
> + }
> +
> + /* TODO Call amdgpu_pre_asic_reset instead */
> + adev->in_pci_err_recovery = true;   
>   r = amdgpu_device_ip_suspend(adev);
>   adev->in_pci_err_recovery = false;
>   if (r)
> -- 
> 2.7.4
> 
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v4 1/8] drm/amdgpu: Avoid accessing HW when suspending SW state

2020-09-02 Thread Bjorn Helgaas
On Wed, Sep 02, 2020 at 02:42:03PM -0400, Andrey Grodzovsky wrote:
> At this point the ASIC is already post reset by the HW/PSP
> so the HW not in proper state to be configured for suspension,
> some blocks might be even gated and so best is to avoid touching it.
> 
> v2: Rename in_dpc to more meaningful name
> 
> Signed-off-by: Andrey Grodzovsky 
> Reviewed-by: Alex Deucher 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h|  1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 38 
> ++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c|  6 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c|  6 +
>  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 18 --
>  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c |  3 +++
>  6 files changed, 65 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index c311a3c..b20354f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -992,6 +992,7 @@ struct amdgpu_device {
>   atomic_tthrottling_logging_enabled;
>   struct ratelimit_state  throttling_logging_rs;
>   uint32_tras_features;
> + boolin_pci_err_recovery;
>  };
>  
>  static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 74a1c03..1fbf8a1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -319,6 +319,9 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, 
> uint32_t reg,
>  {
>   uint32_t ret;
>  
> + if (adev->in_pci_err_recovery)
> + return 0;

I don't know the whole scheme of this, but this looks racy.

It looks like the normal path through this function is the readl(),
which I assume is an MMIO read from the PCI device.  If this is called
after a PCI error occurs, but before amdgpu_pci_slot_reset() sets
adev->in_pci_err_recovery, the readl() will return 0x.

If this is called after amdgpu_pci_slot_reset() sets
adev->in_pci_err_recovery, it will return 0.  Do you really want those
two different cases?

>   if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
>   return amdgpu_kiq_rreg(adev, reg);

> @@ -4773,7 +4809,9 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev 
> *pdev)
>  
>   pci_restore_state(pdev);
>  
> + adev->in_pci_err_recovery = true;
>   r = amdgpu_device_ip_suspend(adev);
> + adev->in_pci_err_recovery = false;
>   if (r)
>   goto out;
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v4 0/8] Implement PCI Error Recovery on Navi12

2020-09-02 Thread Bjorn Helgaas
On Wed, Sep 02, 2020 at 02:42:02PM -0400, Andrey Grodzovsky wrote:
> Many PCI bus controllers are able to detect a variety of hardware PCI errors 
> on the bus, 
> such as parity errors on the data and address buses,  A typical action taken 
> is to disconnect 
> the affected device, halting all I/O to it. Typically, a reconnection 
> mechanism is also offered, 
> so that the affected PCI device(s) are reset and put back into working 
> condition. 
> In our case the reconnection mechanism is facilitated by kernel Downstream 
> Port Containment (DPC) 
> driver which will intercept the PCIe error, remove (isolate) the faulting 
> device after which it 
> will call into PCIe recovery code of the PCI core. 
> This code will call hooks which are implemented in this patchset where the 
> error is 
> first reported at which point we block the GPU scheduler, next DPC resets the 
> PCI link which generates HW interrupt which is intercepted by SMU/PSP who 
> start executing mode1 reset of the ASIC, next step is slot reset hook is 
> called 
> at which point we wait for ASIC reset to complete, restore PCI config space 
> and run 
> HW suspend/resume sequence to resinit the ASIC. 
> Last hook called is resume normal operation at which point we will restart 
> the GPU scheduler.
> 
> More info on PCIe error handling and DPC are here:
> https://www.kernel.org/doc/html/latest/PCI/pci-error-recovery.html
> https://patchwork.kernel.org/patch/8945681/
> 
> v4:Rebase to 5.9 kernel and revert PCI error recovery core commit which 
> breaks the feature.

What does this apply to?  I tried 

  - v5.9-rc1 (9123e3a74ec7 ("Linux 5.9-rc1")),
  - v5.9-rc2 (d012a7190fc1 ("Linux 5.9-rc2")),
  - v5.9-rc3 (f75aef392f86 ("Linux 5.9-rc3")),
  - drm-next (3393649977f9 ("Merge tag 'drm-intel-next-2020-08-24-1' of 
git://anongit.freedesktop.org/drm/drm-intel into drm-next")),
  - linux-next (4442749a2031 ("Add linux-next specific files for 20200902"))

but it doesn't apply cleanly to any.

> Andrey Grodzovsky (8):
>   drm/amdgpu: Avoid accessing HW when suspending SW state
>   drm/amdgpu: Block all job scheduling activity during DPC recovery
>   drm/amdgpu: Fix SMU error failure
>   drm/amdgpu: Fix consecutive DPC recovery failures.
>   drm/amdgpu: Trim amdgpu_pci_slot_reset by reusing code.
>   drm/amdgpu: Disable DPC for XGMI for now.
>   drm/amdgpu: Minor checkpatch fix
>   Revert "PCI/ERR: Update error status after reset_link()"
> 
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h|   6 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 247 
> +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c|   4 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c|   6 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c|   6 +
>  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c |  18 ++-
>  drivers/gpu/drm/amd/amdgpu/nv.c|   4 +-
>  drivers/gpu/drm/amd/amdgpu/soc15.c |   4 +-
>  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c |   3 +
>  drivers/pci/pcie/err.c |   3 +-
>  10 files changed, 222 insertions(+), 79 deletions(-)
> 
> -- 
> 2.7.4
> 
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v4 8/8] Revert "PCI/ERR: Update error status after reset_link()"

2020-09-02 Thread Kuppuswamy, Sathyanarayanan



On 9/2/20 12:54 PM, Andrey Grodzovsky wrote:

Yes, works also.

Can you provide me a formal patch that i can commit into our local amd staging 
tree with my patch set ?

https://patchwork.kernel.org/patch/11684175/mbox/


Alex - is that how we want to do it, without this patch or reverting the original patch the feature 
is broken.


Andrey

On 9/2/20 3:00 PM, Kuppuswamy, Sathyanarayanan wrote:



On 9/2/20 11:42 AM, Andrey Grodzovsky wrote:

This reverts commit 6d2c89441571ea534d6240f7724f518936c44f8d.

In the code bellow

 pci_walk_bus(bus, report_frozen_detected, );
-   if (reset_link(dev, service) != PCI_ERS_RESULT_RECOVERED)
+   status = reset_link(dev, service);

status returned from report_frozen_detected is unconditionally masked
by status returned from reset_link which is wrong.

This breaks error recovery implementation for AMDGPU driver
by masking PCI_ERS_RESULT_NEED_RESET returned from amdgpu_pci_error_detected
and hence skiping slot reset callback which is necessary for proper
ASIC recovery. Effectively no other callback besides resume callback will
be called after link reset the way it is implemented now regardless of what
value error_detected callback returns.


}

Instead of reverting this change, can you try following patch ?
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-pci%2F56ad4901-725f-7b88-2117-b124b28b027f%40linux.intel.com%2FT%2F%23me8029c04f63c21f9d1cb3b1ba2aeffbca3a60df5data=02%7C01%7Candrey.grodzovsky%40amd.com%7C77325d6a2abc42d26ae608d84f726c51%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637346700170831846sdata=JPo8lOXfjxpq%2BnmlVrSi93aZxGjIlbuh0rkZmNKkzQM%3Dreserved=0 





--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 0/3] Use implicit kref infra

2020-09-02 Thread Daniel Vetter
On Wed, Sep 2, 2020 at 9:55 PM Daniel Vetter  wrote:
>
> On Wed, Sep 2, 2020 at 9:17 PM Alex Deucher  wrote:
> >
> > On Wed, Sep 2, 2020 at 3:04 PM Luben Tuikov  wrote:
> > >
> > > On 2020-09-02 11:51 a.m., Daniel Stone wrote:
> > > > Hi Luben,
> > > >
> > > > On Wed, 2 Sep 2020 at 16:16, Luben Tuikov  wrote:
> > > >> Not sure how I can do this when someone doesn't want to read up on
> > > >> the kref infrastructure. Can you help?
> > > >>
> > > >> When someone starts off with "My understanding of ..." (as in the OP) 
> > > >> you know you're
> > > >> in trouble and in for a rough times.
> > > >>
> > > >> Such is the nature of world-wide open-to-everyone mailing lists where
> > > >> anyone can put forth an argument, regardless of their level of 
> > > >> understanding.
> > > >> The more obfuscated an argument, the more uncertainty.
> > > >>
> > > >> If one knows the kref infrastructure, it just clicks, no explanation
> > > >> necessary.
> > > >
> > > > Evidently there are more points of view than yours. Evidently your
> > > > method of persuasion is also not working, because this thread is now
> > > > getting quite long and not converging on your point of view (which you
> > > > are holding to be absolutely objectively correct).
> > > >
> > > > I think you need to re-evaluate the way in which you speak to people,
> > > > considering that it costs nothing to be polite and considerate, and
> > > > also takes effort to be rude and dismissive.
> > >
> > > Not sure how to help this:
> > >
> > > > My understanding of the drm core code is like something below.
> > > > struct B {
> > > >   strcut A
> > > > }
> > > > we initialize A firstly and initialize B in the end. But destroy B 
> > > > firstly and destory A in the end.
> > >
> >
> > Luben, please tone it down a bit.  You are coming across very harshly.
> > You do make a good point though.  What is the point of having the drm
> > release callback if it's ostensibly useless?  We should either use it
> > as intended to release the structures allocated by the driver or the
> > drm core should handle it all.  With the managed resources there is an
> > incongruity between allocation and freeing which leads to confusion.
> > Even with the proposed updated documentation, it's not clear to me who
> > should use the managed resources or not.  My understanding was that it
> > was optional for drivers that wanted it.
>
> In an ideal world this would all be perfectly clean. In reality we
> have huge existing drivers which, if at all feasible, can only be
> converted over step by step.
>
> So with that there's a few ways we can make this happen:
> - drmm resources are cleaned up before ->release is called. This means
> doing auto-cleanup of the final steps like cleanup up drm_device
> resources is gated on all drivers first being converted completely
> over to drmm, which is never going to happen. And it's holding up
> removing all the fairly simple cleanup code from small driver, which
> is where managed resources (whether drmm or devm) have the most
> benefit, often they completely eliminate the need for any explicit
> teardown code.
> - start in the middle. That just oopses because the unwind order isn't
> the inverse of the setup order anymore, and generally that's required.
> - start at the end. Unfortunately this means that the drm_device
> cannot be freed in the driver's ->release callback, therefore for
> transition purposes I had to sprinkle drmm_add_final_kfree all over
> the place. But if you use devm_drm_dev_alloc (like the updated docs
> recommend) that's not needed anymore, so really not an eyesore for
> driver developers.
>
> Yes there's mildly tricky code in the core as a result, but given that
> you guys wont volunteer to fix up the entire subsystem either we just
> have to live with that I think. Also, the commit adding the
> drm_managed stuff does explain these implementation details and the
> reasons why.

Also note that tons of stuff in drm doesn't yet provide drmm_
versions, teardown-less drivers really only works for really simple
ones. So completely getting rid of the ->release callback will also
need lots of core work, like the currently in-flight series to add
more drmm_ helpers for kms objects:

https://lore.kernel.org/dri-devel/20200827160545.1146-1-p.za...@pengutronix.de/

Help obviously very much welcome.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 0/3] Use implicit kref infra

2020-09-02 Thread Daniel Vetter
On Wed, Sep 2, 2020 at 9:17 PM Alex Deucher  wrote:
>
> On Wed, Sep 2, 2020 at 3:04 PM Luben Tuikov  wrote:
> >
> > On 2020-09-02 11:51 a.m., Daniel Stone wrote:
> > > Hi Luben,
> > >
> > > On Wed, 2 Sep 2020 at 16:16, Luben Tuikov  wrote:
> > >> Not sure how I can do this when someone doesn't want to read up on
> > >> the kref infrastructure. Can you help?
> > >>
> > >> When someone starts off with "My understanding of ..." (as in the OP) 
> > >> you know you're
> > >> in trouble and in for a rough times.
> > >>
> > >> Such is the nature of world-wide open-to-everyone mailing lists where
> > >> anyone can put forth an argument, regardless of their level of 
> > >> understanding.
> > >> The more obfuscated an argument, the more uncertainty.
> > >>
> > >> If one knows the kref infrastructure, it just clicks, no explanation
> > >> necessary.
> > >
> > > Evidently there are more points of view than yours. Evidently your
> > > method of persuasion is also not working, because this thread is now
> > > getting quite long and not converging on your point of view (which you
> > > are holding to be absolutely objectively correct).
> > >
> > > I think you need to re-evaluate the way in which you speak to people,
> > > considering that it costs nothing to be polite and considerate, and
> > > also takes effort to be rude and dismissive.
> >
> > Not sure how to help this:
> >
> > > My understanding of the drm core code is like something below.
> > > struct B {
> > >   strcut A
> > > }
> > > we initialize A firstly and initialize B in the end. But destroy B 
> > > firstly and destory A in the end.
> >
>
> Luben, please tone it down a bit.  You are coming across very harshly.
> You do make a good point though.  What is the point of having the drm
> release callback if it's ostensibly useless?  We should either use it
> as intended to release the structures allocated by the driver or the
> drm core should handle it all.  With the managed resources there is an
> incongruity between allocation and freeing which leads to confusion.
> Even with the proposed updated documentation, it's not clear to me who
> should use the managed resources or not.  My understanding was that it
> was optional for drivers that wanted it.

In an ideal world this would all be perfectly clean. In reality we
have huge existing drivers which, if at all feasible, can only be
converted over step by step.

So with that there's a few ways we can make this happen:
- drmm resources are cleaned up before ->release is called. This means
doing auto-cleanup of the final steps like cleanup up drm_device
resources is gated on all drivers first being converted completely
over to drmm, which is never going to happen. And it's holding up
removing all the fairly simple cleanup code from small driver, which
is where managed resources (whether drmm or devm) have the most
benefit, often they completely eliminate the need for any explicit
teardown code.
- start in the middle. That just oopses because the unwind order isn't
the inverse of the setup order anymore, and generally that's required.
- start at the end. Unfortunately this means that the drm_device
cannot be freed in the driver's ->release callback, therefore for
transition purposes I had to sprinkle drmm_add_final_kfree all over
the place. But if you use devm_drm_dev_alloc (like the updated docs
recommend) that's not needed anymore, so really not an eyesore for
driver developers.

Yes there's mildly tricky code in the core as a result, but given that
you guys wont volunteer to fix up the entire subsystem either we just
have to live with that I think. Also, the commit adding the
drm_managed stuff does explain these implementation details and the
reasons why.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v4 8/8] Revert "PCI/ERR: Update error status after reset_link()"

2020-09-02 Thread Andrey Grodzovsky

Yes, works also.

Can you provide me a formal patch that i can commit into our local amd staging 
tree with my patch set ?


Alex - is that how we want to do it, without this patch or reverting the 
original patch the feature is broken.


Andrey

On 9/2/20 3:00 PM, Kuppuswamy, Sathyanarayanan wrote:



On 9/2/20 11:42 AM, Andrey Grodzovsky wrote:

This reverts commit 6d2c89441571ea534d6240f7724f518936c44f8d.

In the code bellow

 pci_walk_bus(bus, report_frozen_detected, );
-   if (reset_link(dev, service) != PCI_ERS_RESULT_RECOVERED)
+   status = reset_link(dev, service);

status returned from report_frozen_detected is unconditionally masked
by status returned from reset_link which is wrong.

This breaks error recovery implementation for AMDGPU driver
by masking PCI_ERS_RESULT_NEED_RESET returned from amdgpu_pci_error_detected
and hence skiping slot reset callback which is necessary for proper
ASIC recovery. Effectively no other callback besides resume callback will
be called after link reset the way it is implemented now regardless of what
value error_detected callback returns.


}

Instead of reverting this change, can you try following patch ?
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-pci%2F56ad4901-725f-7b88-2117-b124b28b027f%40linux.intel.com%2FT%2F%23me8029c04f63c21f9d1cb3b1ba2aeffbca3a60df5data=02%7C01%7Candrey.grodzovsky%40amd.com%7C77325d6a2abc42d26ae608d84f726c51%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637346700170831846sdata=JPo8lOXfjxpq%2BnmlVrSi93aZxGjIlbuh0rkZmNKkzQM%3Dreserved=0 




___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


suspicious conditions in amd display driver

2020-09-02 Thread Denis Efremov
Hi,

I've found suspicious code patterns in amd display driver.
For me they looks like redundant comparisons, but maybe it's logic bugs
and dm_444_16, dm_whole_buffer_for_single_stream_interleave should be changed to
other variables in second disjuncts.

Here are they:

diff -u -p ./drivers/gpu/drm/amd/display/dc/dml/dcn30/display_mode_vba_30.c 
/tmp/nothing/drivers/gpu/drm/amd/display/dc/dml/dcn30/display_mode_vba_30.c
--- ./drivers/gpu/drm/amd/display/dc/dml/dcn30/display_mode_vba_30.c
+++ /tmp/nothing/drivers/gpu/drm/amd/display/dc/dml/dcn30/display_mode_vba_30.c
@@ -3235,7 +3235,6 @@ static bool CalculateBytePerPixelAnd256B
*BytePerPixelDETC = 0;
*BytePerPixelY = 4;
*BytePerPixelC = 0;
-   } else if (SourcePixelFormat == dm_444_16 || SourcePixelFormat == 
dm_444_16) { // <== same comparison with dm_444_16
*BytePerPixelDETY = 2;
*BytePerPixelDETC = 0;
*BytePerPixelY = 2;

@@ -5515,7 +5514,6 @@ static void CalculateWatermarksAndDRAMSp
if (WritebackPixelFormat[k] == dm_444_64) {
WritebackDRAMClockChangeLatencyHiding = 
WritebackDRAMClockChangeLatencyHiding / 2;
}
-   if (mode_lib->vba.WritebackConfiguration == 
dm_whole_buffer_for_single_stream_interleave || 
mode_lib->vba.WritebackConfiguration == 
dm_whole_buffer_for_single_stream_interleave) { // <== same comparison with 
dm_whole_buffer_for_single_stream_interleave


diff -u -p ./drivers/gpu/drm/amd/display/dc/dml/dcn30/display_rq_dlg_calc_30.c 
/tmp/nothing/drivers/gpu/drm/amd/display/dc/dml/dcn30/display_rq_dlg_calc_30.c
--- ./drivers/gpu/drm/amd/display/dc/dml/dcn30/display_rq_dlg_calc_30.c
+++ 
/tmp/nothing/drivers/gpu/drm/amd/display/dc/dml/dcn30/display_rq_dlg_calc_30.c
@@ -279,7 +279,6 @@ static bool CalculateBytePerPixelAnd256B
*BytePerPixelDETC = 0;
*BytePerPixelY = 4;
*BytePerPixelC = 0;
-   } else if (SourcePixelFormat == dm_444_16 || SourcePixelFormat == 
dm_444_16) { // <== same comparison with dm_444_16
*BytePerPixelDETY = 2;
*BytePerPixelDETC = 0;
*BytePerPixelY = 2;

Thanks,
Denis
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v4 8/8] Revert "PCI/ERR: Update error status after reset_link()"

2020-09-02 Thread Kuppuswamy, Sathyanarayanan




On 9/2/20 11:42 AM, Andrey Grodzovsky wrote:

This reverts commit 6d2c89441571ea534d6240f7724f518936c44f8d.

In the code bellow

 pci_walk_bus(bus, report_frozen_detected, );
-   if (reset_link(dev, service) != PCI_ERS_RESULT_RECOVERED)
+   status = reset_link(dev, service);

status returned from report_frozen_detected is unconditionally masked
by status returned from reset_link which is wrong.

This breaks error recovery implementation for AMDGPU driver
by masking PCI_ERS_RESULT_NEED_RESET returned from amdgpu_pci_error_detected
and hence skiping slot reset callback which is necessary for proper
ASIC recovery. Effectively no other callback besides resume callback will
be called after link reset the way it is implemented now regardless of what
value error_detected callback returns.


}

Instead of reverting this change, can you try following patch ?
https://lore.kernel.org/linux-pci/56ad4901-725f-7b88-2117-b124b28b0...@linux.intel.com/T/#me8029c04f63c21f9d1cb3b1ba2aeffbca3a60df5

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 0/3] Use implicit kref infra

2020-09-02 Thread Alex Deucher
On Wed, Sep 2, 2020 at 3:04 PM Luben Tuikov  wrote:
>
> On 2020-09-02 11:51 a.m., Daniel Stone wrote:
> > Hi Luben,
> >
> > On Wed, 2 Sep 2020 at 16:16, Luben Tuikov  wrote:
> >> Not sure how I can do this when someone doesn't want to read up on
> >> the kref infrastructure. Can you help?
> >>
> >> When someone starts off with "My understanding of ..." (as in the OP) you 
> >> know you're
> >> in trouble and in for a rough times.
> >>
> >> Such is the nature of world-wide open-to-everyone mailing lists where
> >> anyone can put forth an argument, regardless of their level of 
> >> understanding.
> >> The more obfuscated an argument, the more uncertainty.
> >>
> >> If one knows the kref infrastructure, it just clicks, no explanation
> >> necessary.
> >
> > Evidently there are more points of view than yours. Evidently your
> > method of persuasion is also not working, because this thread is now
> > getting quite long and not converging on your point of view (which you
> > are holding to be absolutely objectively correct).
> >
> > I think you need to re-evaluate the way in which you speak to people,
> > considering that it costs nothing to be polite and considerate, and
> > also takes effort to be rude and dismissive.
>
> Not sure how to help this:
>
> > My understanding of the drm core code is like something below.
> > struct B {
> >   strcut A
> > }
> > we initialize A firstly and initialize B in the end. But destroy B firstly 
> > and destory A in the end.
>

Luben, please tone it down a bit.  You are coming across very harshly.
You do make a good point though.  What is the point of having the drm
release callback if it's ostensibly useless?  We should either use it
as intended to release the structures allocated by the driver or the
drm core should handle it all.  With the managed resources there is an
incongruity between allocation and freeing which leads to confusion.
Even with the proposed updated documentation, it's not clear to me who
should use the managed resources or not.  My understanding was that it
was optional for drivers that wanted it.

Alex
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 0/3] Use implicit kref infra

2020-09-02 Thread Luben Tuikov
On 2020-09-02 11:51 a.m., Daniel Stone wrote:
> Hi Luben,
> 
> On Wed, 2 Sep 2020 at 16:16, Luben Tuikov  wrote:
>> Not sure how I can do this when someone doesn't want to read up on
>> the kref infrastructure. Can you help?
>>
>> When someone starts off with "My understanding of ..." (as in the OP) you 
>> know you're
>> in trouble and in for a rough times.
>>
>> Such is the nature of world-wide open-to-everyone mailing lists where
>> anyone can put forth an argument, regardless of their level of understanding.
>> The more obfuscated an argument, the more uncertainty.
>>
>> If one knows the kref infrastructure, it just clicks, no explanation
>> necessary.
> 
> Evidently there are more points of view than yours. Evidently your
> method of persuasion is also not working, because this thread is now
> getting quite long and not converging on your point of view (which you
> are holding to be absolutely objectively correct).
> 
> I think you need to re-evaluate the way in which you speak to people,
> considering that it costs nothing to be polite and considerate, and
> also takes effort to be rude and dismissive.

Not sure how to help this:

> My understanding of the drm core code is like something below.
> struct B { 
>   strcut A 
> }
> we initialize A firstly and initialize B in the end. But destroy B firstly 
> and destory A in the end.

Regards,
Luben
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH v4 4/8] drm/amdgpu: Fix consecutive DPC recovery failures.

2020-09-02 Thread Andrey Grodzovsky
Cache the PCI state on boot and before each case where we might
loose it.

v2: Add pci_restore_state while caching the PCI state to avoid
breaking PCI core logic for stuff like suspend/resume.

v3: Extract pci_restore_state from amdgpu_device_cache_pci_state
to avoid superflous restores during GPU resets and suspend/resumes.

v4: Style fixes.

Signed-off-by: Andrey Grodzovsky 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h|  5 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 62 --
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c|  4 +-
 drivers/gpu/drm/amd/amdgpu/nv.c|  4 +-
 drivers/gpu/drm/amd/amdgpu/soc15.c |  4 +-
 5 files changed, 70 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index b20354f..13f92de 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -992,7 +992,9 @@ struct amdgpu_device {
atomic_tthrottling_logging_enabled;
struct ratelimit_state  throttling_logging_rs;
uint32_tras_features;
+
boolin_pci_err_recovery;
+   struct pci_saved_state  *pci_state;
 };
 
 static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
@@ -1272,6 +1274,9 @@ pci_ers_result_t amdgpu_pci_mmio_enabled(struct pci_dev 
*pdev);
 pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev);
 void amdgpu_pci_resume(struct pci_dev *pdev);
 
+bool amdgpu_device_cache_pci_state(struct pci_dev *pdev);
+bool amdgpu_device_load_pci_state(struct pci_dev *pdev);
+
 #include "amdgpu_object.h"
 
 /* used by df_v3_6.c and amdgpu_pmu.c */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 412d07e..174e09b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -1283,7 +1283,7 @@ static void amdgpu_switcheroo_set_state(struct pci_dev 
*pdev,
dev->switch_power_state = DRM_SWITCH_POWER_CHANGING;
 
pci_set_power_state(dev->pdev, PCI_D0);
-   pci_restore_state(dev->pdev);
+   amdgpu_device_load_pci_state(dev->pdev);
r = pci_enable_device(dev->pdev);
if (r)
DRM_WARN("pci_enable_device failed (%d)\n", r);
@@ -1296,7 +1296,7 @@ static void amdgpu_switcheroo_set_state(struct pci_dev 
*pdev,
drm_kms_helper_poll_disable(dev);
dev->switch_power_state = DRM_SWITCH_POWER_CHANGING;
amdgpu_device_suspend(dev, true);
-   pci_save_state(dev->pdev);
+   amdgpu_device_cache_pci_state(dev->pdev);
/* Shut down the device */
pci_disable_device(dev->pdev);
pci_set_power_state(dev->pdev, PCI_D3cold);
@@ -3399,6 +3399,10 @@ int amdgpu_device_init(struct amdgpu_device *adev,
if (r)
dev_err(adev->dev, "amdgpu_pmu_init failed\n");
 
+   /* Have stored pci confspace at hand for restore in sudden PCI error */
+   if (amdgpu_device_cache_pci_state(adev->pdev))
+   pci_restore_state(pdev);
+
return 0;
 
 failed:
@@ -3423,6 +3427,8 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
flush_delayed_work(>delayed_init_work);
adev->shutdown = true;
 
+   kfree(adev->pci_state);
+
/* make sure IB test finished before entering exclusive mode
 * to avoid preemption on IB test
 * */
@@ -4847,7 +4853,7 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev 
*pdev)
/* wait for asic to come out of reset */
msleep(500);
 
-   pci_restore_state(pdev);
+   amdgpu_device_load_pci_state(pdev);
 
/* confirm  ASIC came out of reset */
for (i = 0; i < adev->usec_timeout; i++) {
@@ -4927,6 +4933,9 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev 
*pdev)
 out:
 
if (!r) {
+   if (amdgpu_device_cache_pci_state(adev->pdev))
+   pci_restore_state(adev->pdev);
+
DRM_INFO("PCIe error recovery succeeded\n");
} else {
DRM_ERROR("PCIe error recovery failed, err:%d", r);
@@ -4966,3 +4975,50 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
 
amdgpu_device_unlock_adev(adev);
 }
+
+bool amdgpu_device_cache_pci_state(struct pci_dev *pdev)
+{
+   struct drm_device *dev = pci_get_drvdata(pdev);
+   struct amdgpu_device *adev = drm_to_adev(dev);
+   int r;
+
+   r = pci_save_state(pdev);
+   if (!r) {
+   kfree(adev->pci_state);
+
+   adev->pci_state = pci_store_saved_state(pdev);
+
+   if (!adev->pci_state) {
+   DRM_ERROR("Failed to store PCI saved state");
+   return false;
+   }
+   } else {
+   DRM_WARN("Failed 

[PATCH v4 8/8] Revert "PCI/ERR: Update error status after reset_link()"

2020-09-02 Thread Andrey Grodzovsky
This reverts commit 6d2c89441571ea534d6240f7724f518936c44f8d.

In the code bellow

pci_walk_bus(bus, report_frozen_detected, );
-   if (reset_link(dev, service) != PCI_ERS_RESULT_RECOVERED)
+   status = reset_link(dev, service);

status returned from report_frozen_detected is unconditionally masked
by status returned from reset_link which is wrong.

This breaks error recovery implementation for AMDGPU driver
by masking PCI_ERS_RESULT_NEED_RESET returned from amdgpu_pci_error_detected
and hence skiping slot reset callback which is necessary for proper
ASIC recovery. Effectively no other callback besides resume callback will
be called after link reset the way it is implemented now regardless of what
value error_detected callback returns.

In general step 6.1.4 describing link reset unlike the other steps is not well 
defined
in what are the  expected return values and the appropriate next steps as
it is for other stpes.

Signed-off-by: Andrey Grodzovsky 
---
 drivers/pci/pcie/err.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index c543f41..81dd719 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -165,8 +165,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
pci_dbg(dev, "broadcast error_detected message\n");
if (state == pci_channel_io_frozen) {
pci_walk_bus(bus, report_frozen_detected, );
-   status = reset_link(dev);
-   if (status != PCI_ERS_RESULT_RECOVERED) {
+   if (reset_link(dev) != PCI_ERS_RESULT_RECOVERED) {
pci_warn(dev, "link reset failed\n");
goto failed;
}
-- 
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH v4 1/8] drm/amdgpu: Avoid accessing HW when suspending SW state

2020-09-02 Thread Andrey Grodzovsky
At this point the ASIC is already post reset by the HW/PSP
so the HW not in proper state to be configured for suspension,
some blocks might be even gated and so best is to avoid touching it.

v2: Rename in_dpc to more meaningful name

Signed-off-by: Andrey Grodzovsky 
Reviewed-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h|  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 38 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c|  6 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c|  6 +
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 18 --
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c |  3 +++
 6 files changed, 65 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index c311a3c..b20354f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -992,6 +992,7 @@ struct amdgpu_device {
atomic_tthrottling_logging_enabled;
struct ratelimit_state  throttling_logging_rs;
uint32_tras_features;
+   boolin_pci_err_recovery;
 };
 
 static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 74a1c03..1fbf8a1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -319,6 +319,9 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, 
uint32_t reg,
 {
uint32_t ret;
 
+   if (adev->in_pci_err_recovery)
+   return 0;
+
if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
return amdgpu_kiq_rreg(adev, reg);
 
@@ -351,6 +354,9 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, 
uint32_t reg,
  * Returns the 8 bit value from the offset specified.
  */
 uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
+   if (adev->in_pci_err_recovery)
+   return 0;
+
if (offset < adev->rmmio_size)
return (readb(adev->rmmio + offset));
BUG();
@@ -372,6 +378,9 @@ uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, 
uint32_t offset) {
  * Writes the value specified to the offset specified.
  */
 void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t 
value) {
+   if (adev->in_pci_err_recovery)
+   return;
+
if (offset < adev->rmmio_size)
writeb(value, adev->rmmio + offset);
else
@@ -382,6 +391,9 @@ static inline void amdgpu_mm_wreg_mmio(struct amdgpu_device 
*adev,
   uint32_t reg, uint32_t v,
   uint32_t acc_flags)
 {
+   if (adev->in_pci_err_recovery)
+   return;
+
trace_amdgpu_mm_wreg(adev->pdev->device, reg, v);
 
if ((reg * 4) < adev->rmmio_size)
@@ -409,6 +421,9 @@ static inline void amdgpu_mm_wreg_mmio(struct amdgpu_device 
*adev,
 void amdgpu_mm_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v,
uint32_t acc_flags)
 {
+   if (adev->in_pci_err_recovery)
+   return;
+
if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
return amdgpu_kiq_wreg(adev, reg, v);
 
@@ -423,6 +438,9 @@ void amdgpu_mm_wreg(struct amdgpu_device *adev, uint32_t 
reg, uint32_t v,
 void amdgpu_mm_wreg_mmio_rlc(struct amdgpu_device *adev, uint32_t reg, 
uint32_t v,
uint32_t acc_flags)
 {
+   if (adev->in_pci_err_recovery)
+   return;
+
if (amdgpu_sriov_fullaccess(adev) &&
adev->gfx.rlc.funcs &&
adev->gfx.rlc.funcs->is_rlcg_access_range) {
@@ -444,6 +462,9 @@ void amdgpu_mm_wreg_mmio_rlc(struct amdgpu_device *adev, 
uint32_t reg, uint32_t
  */
 u32 amdgpu_io_rreg(struct amdgpu_device *adev, u32 reg)
 {
+   if (adev->in_pci_err_recovery)
+   return 0;
+
if ((reg * 4) < adev->rio_mem_size)
return ioread32(adev->rio_mem + (reg * 4));
else {
@@ -463,6 +484,9 @@ u32 amdgpu_io_rreg(struct amdgpu_device *adev, u32 reg)
  */
 void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, u32 v)
 {
+   if (adev->in_pci_err_recovery)
+   return;
+
if ((reg * 4) < adev->rio_mem_size)
iowrite32(v, adev->rio_mem + (reg * 4));
else {
@@ -482,6 +506,9 @@ void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, 
u32 v)
  */
 u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 index)
 {
+   if (adev->in_pci_err_recovery)
+   return 0;
+
if (index < adev->doorbell.num_doorbells) {
return readl(adev->doorbell.ptr + index);
} else {
@@ -502,6 +529,9 @@ u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 
index)
  */
 void amdgpu_mm_wdoorbell(struct 

[PATCH v4 6/8] drm/amdgpu: Disable DPC for XGMI for now.

2020-09-02 Thread Andrey Grodzovsky
XGMI support is more complicated than single device support as
questions of synchronization between the device recovering from
PCI error and other members of the hive are required.
Leaving this for next round.

Signed-off-by: Andrey Grodzovsky 
Reviewed-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index c477cfd..4d4fc67 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4778,6 +4778,11 @@ pci_ers_result_t amdgpu_pci_error_detected(struct 
pci_dev *pdev, pci_channel_sta
 
DRM_INFO("PCI error: detected callback, state(%d)!!\n", state);
 
+   if (adev->gmc.xgmi.num_physical_nodes > 1) {
+   DRM_WARN("No support for XGMI hive yet...");
+   return PCI_ERS_RESULT_DISCONNECT;
+   }
+
switch (state) {
case pci_channel_io_normal:
return PCI_ERS_RESULT_CAN_RECOVER;
-- 
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH v4 5/8] drm/amdgpu: Trim amdgpu_pci_slot_reset by reusing code.

2020-09-02 Thread Andrey Grodzovsky
Reuse exsisting functions from GPU recovery to avoid code
duplications.

Signed-off-by: Andrey Grodzovsky 
Reviewed-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 73 +-
 1 file changed, 12 insertions(+), 61 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 174e09b..c477cfd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4111,7 +4111,8 @@ static int amdgpu_device_pre_asic_reset(struct 
amdgpu_device *adev,
 
 static int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,
   struct list_head *device_list_handle,
-  bool *need_full_reset_arg)
+  bool *need_full_reset_arg,
+  bool skip_hw_reset)
 {
struct amdgpu_device *tmp_adev = NULL;
bool need_full_reset = *need_full_reset_arg, vram_lost = false;
@@ -4121,7 +4122,7 @@ static int amdgpu_do_asic_reset(struct amdgpu_hive_info 
*hive,
 * ASIC reset has to be done on all HGMI hive nodes ASAP
 * to allow proper links negotiation in FW (within 1 sec)
 */
-   if (need_full_reset) {
+   if (!skip_hw_reset && need_full_reset) {
list_for_each_entry(tmp_adev, device_list_handle, 
gmc.xgmi.head) {
/* For XGMI run all resets in parallel to speed up the 
process */
if (tmp_adev->gmc.xgmi.num_physical_nodes > 1) {
@@ -4517,7 +4518,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
if (r)
adev->asic_reset_res = r;
} else {
-   r  = amdgpu_do_asic_reset(hive, device_list_handle, 
_full_reset);
+   r  = amdgpu_do_asic_reset(hive, device_list_handle, 
_full_reset, false);
if (r && r == -EAGAIN)
goto retry;
}
@@ -4845,14 +4846,19 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev 
*pdev)
struct drm_device *dev = pci_get_drvdata(pdev);
struct amdgpu_device *adev = drm_to_adev(dev);
int r, i;
-   bool vram_lost;
+   bool need_full_reset = true;
u32 memsize;
+   struct list_head device_list;
 
DRM_INFO("PCI error: slot reset callback!!\n");
 
+   INIT_LIST_HEAD(_list);
+   list_add_tail(>gmc.xgmi.head, _list);
+
/* wait for asic to come out of reset */
msleep(500);
 
+   /* Restore PCI confspace */
amdgpu_device_load_pci_state(pdev);
 
/* confirm  ASIC came out of reset */
@@ -4868,70 +4874,15 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev 
*pdev)
goto out;
}
 
-   /* TODO Call amdgpu_pre_asic_reset instead */
adev->in_pci_err_recovery = true;   
-   r = amdgpu_device_ip_suspend(adev);
+   r = amdgpu_device_pre_asic_reset(adev, NULL, _full_reset);
adev->in_pci_err_recovery = false;
if (r)
goto out;
 
-
-   /* post card */
-   r = amdgpu_atom_asic_init(adev->mode_info.atom_context);
-   if (r)
-   goto out;
-
-   r = amdgpu_device_ip_resume_phase1(adev);
-   if (r)
-   goto out;
-
-   vram_lost = amdgpu_device_check_vram_lost(adev);
-   if (vram_lost) {
-   DRM_INFO("VRAM is lost due to GPU reset!\n");
-   amdgpu_inc_vram_lost(adev);
-   }
-
-   r = amdgpu_gtt_mgr_recover(
-   >mman.bdev.man[TTM_PL_TT]);
-   if (r)
-   goto out;
-
-   r = amdgpu_device_fw_loading(adev);
-   if (r)
-   return r;
-
-   r = amdgpu_device_ip_resume_phase2(adev);
-   if (r)
-   goto out;
-
-   if (vram_lost)
-   amdgpu_device_fill_reset_magic(adev);
-
-   /*
-* Add this ASIC as tracked as reset was already
-* complete successfully.
-*/
-   amdgpu_register_gpu_instance(adev);
-
-   r = amdgpu_device_ip_late_init(adev);
-   if (r)
-   goto out;
-
-   amdgpu_fbdev_set_suspend(adev, 0);
-
-   /* must succeed. */
-   amdgpu_ras_resume(adev);
-
-
-   amdgpu_irq_gpu_reset_resume_helper(adev);
-   r = amdgpu_ib_ring_tests(adev);
-   if (r)
-   goto out;
-
-   r = amdgpu_device_recover_vram(adev);
+   r = amdgpu_do_asic_reset(NULL, _list, _full_reset, true);
 
 out:
-
if (!r) {
if (amdgpu_device_cache_pci_state(adev->pdev))
pci_restore_state(adev->pdev);
-- 
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH v4 0/8] Implement PCI Error Recovery on Navi12

2020-09-02 Thread Andrey Grodzovsky
Many PCI bus controllers are able to detect a variety of hardware PCI errors on 
the bus, 
such as parity errors on the data and address buses,  A typical action taken is 
to disconnect 
the affected device, halting all I/O to it. Typically, a reconnection mechanism 
is also offered, 
so that the affected PCI device(s) are reset and put back into working 
condition. 
In our case the reconnection mechanism is facilitated by kernel Downstream Port 
Containment (DPC) 
driver which will intercept the PCIe error, remove (isolate) the faulting 
device after which it 
will call into PCIe recovery code of the PCI core. 
This code will call hooks which are implemented in this patchset where the 
error is 
first reported at which point we block the GPU scheduler, next DPC resets the 
PCI link which generates HW interrupt which is intercepted by SMU/PSP who 
start executing mode1 reset of the ASIC, next step is slot reset hook is called 
at which point we wait for ASIC reset to complete, restore PCI config space and 
run 
HW suspend/resume sequence to resinit the ASIC. 
Last hook called is resume normal operation at which point we will restart the 
GPU scheduler.

More info on PCIe error handling and DPC are here:
https://www.kernel.org/doc/html/latest/PCI/pci-error-recovery.html
https://patchwork.kernel.org/patch/8945681/

v4:Rebase to 5.9 kernel and revert PCI error recovery core commit which breaks 
the feature.

Andrey Grodzovsky (8):
  drm/amdgpu: Avoid accessing HW when suspending SW state
  drm/amdgpu: Block all job scheduling activity during DPC recovery
  drm/amdgpu: Fix SMU error failure
  drm/amdgpu: Fix consecutive DPC recovery failures.
  drm/amdgpu: Trim amdgpu_pci_slot_reset by reusing code.
  drm/amdgpu: Disable DPC for XGMI for now.
  drm/amdgpu: Minor checkpatch fix
  Revert "PCI/ERR: Update error status after reset_link()"

 drivers/gpu/drm/amd/amdgpu/amdgpu.h|   6 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 247 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c|   4 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c|   6 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c|   6 +
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c |  18 ++-
 drivers/gpu/drm/amd/amdgpu/nv.c|   4 +-
 drivers/gpu/drm/amd/amdgpu/soc15.c |   4 +-
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c |   3 +
 drivers/pci/pcie/err.c |   3 +-
 10 files changed, 222 insertions(+), 79 deletions(-)

-- 
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH v4 7/8] drm/amdgpu: Minor checkpatch fix

2020-09-02 Thread Andrey Grodzovsky
Signed-off-by: Andrey Grodzovsky 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 4d4fc67..3748bef 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -353,7 +353,8 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, 
uint32_t reg,
  *
  * Returns the 8 bit value from the offset specified.
  */
-uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset) {
+uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, uint32_t offset)
+{
if (adev->in_pci_err_recovery)
return 0;
 
@@ -377,7 +378,8 @@ uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, 
uint32_t offset) {
  *
  * Writes the value specified to the offset specified.
  */
-void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t 
value) {
+void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t 
value)
+{
if (adev->in_pci_err_recovery)
return;
 
-- 
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH v4 2/8] drm/amdgpu: Block all job scheduling activity during DPC recovery

2020-09-02 Thread Andrey Grodzovsky
DPC recovery involves ASIC reset just as normal GPU recovery so block
SW GPU schedulers and wait on all concurrent GPU resets.

Signed-off-by: Andrey Grodzovsky 
Acked-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 57 +++---
 1 file changed, 53 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 1fbf8a1..e999f1f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4740,6 +4740,20 @@ int amdgpu_device_baco_exit(struct drm_device *dev)
return 0;
 }
 
+static void amdgpu_cancel_all_tdr(struct amdgpu_device *adev)
+{
+   int i;
+
+   for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+   struct amdgpu_ring *ring = adev->rings[i];
+
+   if (!ring || !ring->sched.thread)
+   continue;
+
+   cancel_delayed_work_sync(>sched.work_tdr);
+   }
+}
+
 /**
  * amdgpu_pci_error_detected - Called when a PCI error is detected.
  * @pdev: PCI device struct
@@ -4753,15 +4767,37 @@ pci_ers_result_t amdgpu_pci_error_detected(struct 
pci_dev *pdev, pci_channel_sta
 {
struct drm_device *dev = pci_get_drvdata(pdev);
struct amdgpu_device *adev = drm_to_adev(dev);
+   int i;
 
DRM_INFO("PCI error: detected callback, state(%d)!!\n", state);
 
switch (state) {
case pci_channel_io_normal:
return PCI_ERS_RESULT_CAN_RECOVER;
-   case pci_channel_io_frozen:
-   /* Fatal error, prepare for slot reset */
-   amdgpu_device_lock_adev(adev);
+   /* Fatal error, prepare for slot reset */
+   case pci_channel_io_frozen: 
+   /*  
+* Cancel and wait for all TDRs in progress if failing to
+* set  adev->in_gpu_reset in amdgpu_device_lock_adev
+*
+* Locking adev->reset_sem will prevent any external access
+* to GPU during PCI error recovery
+*/
+   while (!amdgpu_device_lock_adev(adev, NULL))
+   amdgpu_cancel_all_tdr(adev);
+
+   /*
+* Block any work scheduling as we do for regular GPU reset
+* for the duration of the recovery
+*/
+   for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+   struct amdgpu_ring *ring = adev->rings[i];
+
+   if (!ring || !ring->sched.thread)
+   continue;
+
+   drm_sched_stop(>sched, NULL);
+   }
return PCI_ERS_RESULT_NEED_RESET;
case pci_channel_io_perm_failure:
/* Permanent error, prepare for device removal */
@@ -4894,8 +4930,21 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
 {
struct drm_device *dev = pci_get_drvdata(pdev);
struct amdgpu_device *adev = drm_to_adev(dev);
+   int i;
 
-   amdgpu_device_unlock_adev(adev);
 
DRM_INFO("PCI error: resume callback!!\n");
+
+   for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+   struct amdgpu_ring *ring = adev->rings[i];
+
+   if (!ring || !ring->sched.thread)
+   continue;
+
+
+   drm_sched_resubmit_jobs(>sched);
+   drm_sched_start(>sched, true);
+   }
+
+   amdgpu_device_unlock_adev(adev);
 }
-- 
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH v4 3/8] drm/amdgpu: Fix SMU error failure

2020-09-02 Thread Andrey Grodzovsky
Wait for HW/PSP initiated ASIC reset to complete before
starting the recovery operations.

v2: Remove typo

Signed-off-by: Andrey Grodzovsky 
Reviewed-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 22 --
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index e999f1f..412d07e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4838,14 +4838,32 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev 
*pdev)
 {
struct drm_device *dev = pci_get_drvdata(pdev);
struct amdgpu_device *adev = drm_to_adev(dev);
-   int r;
+   int r, i;
bool vram_lost;
+   u32 memsize;
 
DRM_INFO("PCI error: slot reset callback!!\n");
 
+   /* wait for asic to come out of reset */
+   msleep(500);
+
pci_restore_state(pdev);
 
-   adev->in_pci_err_recovery = true;
+   /* confirm  ASIC came out of reset */
+   for (i = 0; i < adev->usec_timeout; i++) {
+   memsize = amdgpu_asic_get_config_memsize(adev);
+
+   if (memsize != 0x)
+   break;
+   udelay(1);
+   }
+   if (memsize == 0x) {
+   r = -ETIME;
+   goto out;
+   }
+
+   /* TODO Call amdgpu_pre_asic_reset instead */
+   adev->in_pci_err_recovery = true;   
r = amdgpu_device_ip_suspend(adev);
adev->in_pci_err_recovery = false;
if (r)
-- 
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: enable ih1 ih2 for Arcturus only

2020-09-02 Thread Felix Kuehling
Am 2020-09-02 um 2:13 p.m. schrieb Alex Deucher:
> On Wed, Sep 2, 2020 at 2:08 PM Alex Deucher  wrote:
>> On Wed, Sep 2, 2020 at 1:01 PM Alex Sierra  wrote:
>>> Enable multi-ring ih1 and ih2 for Arcturus only.
>>> For Navi10 family multi-ring has been disabled.
>>> Apparently, having multi-ring enabled in Navi was causing
>>> continus page fault interrupts.
>>> Further investigation is needed to get to the root cause.
>>> Related issue link:
>>> https://gitlab.freedesktop.org/drm/amd/-/issues/1279
>>>
>> Before committing, let's verify that it fixes that issue.
> Looking at the bug report, the OSS (presumably IH) block is causing a
> write fault so I suspect arcturus may be affected by this as well.  We
> should double check the ring sizes and allocations.

Alejandro has been doing a lot of testing on Arcturus and didn't run
into this problem. That's why I suggested only disabling the IH rings on
Navi10 for now. We need the extra rings on Arcturus for our HMM work.

Regards,
  Felix


>
> Alex
>
>
>> Alex
>>
>>
>>> Signed-off-by: Alex Sierra 
>>> ---
>>>  drivers/gpu/drm/amd/amdgpu/navi10_ih.c | 30 --
>>>  1 file changed, 19 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/navi10_ih.c 
>>> b/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
>>> index 350f1bf063c6..4d73869870ab 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
>>> @@ -306,7 +306,8 @@ static int navi10_ih_irq_init(struct amdgpu_device 
>>> *adev)
>>> } else {
>>> WREG32_SOC15(OSSSYS, 0, mmIH_RB_CNTL, ih_rb_cntl);
>>> }
>>> -   navi10_ih_reroute_ih(adev);
>>> +   if (adev->asic_type == CHIP_ARCTURUS)
>>> +   navi10_ih_reroute_ih(adev);
>>>
>>> if (unlikely(adev->firmware.load_type == AMDGPU_FW_LOAD_DIRECT)) {
>>> if (ih->use_bus_addr) {
>>> @@ -668,19 +669,26 @@ static int navi10_ih_sw_init(void *handle)
>>> adev->irq.ih.use_doorbell = true;
>>> adev->irq.ih.doorbell_index = adev->doorbell_index.ih << 1;
>>>
>>> -   r = amdgpu_ih_ring_init(adev, >irq.ih1, PAGE_SIZE, true);
>>> -   if (r)
>>> -   return r;
>>> +   adev->irq.ih1.ring_size = 0;
>>> +   adev->irq.ih2.ring_size = 0;
>>>
>>> -   adev->irq.ih1.use_doorbell = true;
>>> -   adev->irq.ih1.doorbell_index = (adev->doorbell_index.ih + 1) << 1;
>>> +   if (adev->asic_type == CHIP_ARCTURUS) {
>>> +   r = amdgpu_ih_ring_init(adev, >irq.ih1, PAGE_SIZE, 
>>> true);
>>> +   if (r)
>>> +   return r;
>>>
>>> -   r = amdgpu_ih_ring_init(adev, >irq.ih2, PAGE_SIZE, true);
>>> -   if (r)
>>> -   return r;
>>> +   adev->irq.ih1.use_doorbell = true;
>>> +   adev->irq.ih1.doorbell_index =
>>> +   (adev->doorbell_index.ih + 1) << 1;
>>> +
>>> +   r = amdgpu_ih_ring_init(adev, >irq.ih2, PAGE_SIZE, 
>>> true);
>>> +   if (r)
>>> +   return r;
>>>
>>> -   adev->irq.ih2.use_doorbell = true;
>>> -   adev->irq.ih2.doorbell_index = (adev->doorbell_index.ih + 2) << 1;
>>> +   adev->irq.ih2.use_doorbell = true;
>>> +   adev->irq.ih2.doorbell_index =
>>> +   (adev->doorbell_index.ih + 2) << 1;
>>> +   }
>>>
>>> r = amdgpu_irq_init(adev);
>>>
>>> --
>>> 2.17.1
>>>
>>> ___
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 2/2] drm/amdgpu/mmhub2.0: print client id string for mmhub

2020-09-02 Thread Alex Deucher
Print the name of the client rather than the number.  This
makes it easier to debug what block is causing the fault.

Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/mmhub_v2_0.c | 88 +++--
 1 file changed, 82 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v2_0.c 
b/drivers/gpu/drm/amd/amdgpu/mmhub_v2_0.c
index 5baf899417d8..2d88278c50bf 100644
--- a/drivers/gpu/drm/amd/amdgpu/mmhub_v2_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v2_0.c
@@ -36,6 +36,63 @@
 #define mmDAGB0_CNTL_MISC2_Sienna_Cichlid   0x0070
 #define mmDAGB0_CNTL_MISC2_Sienna_Cichlid_BASE_IDX  0
 
+static const char *mmhub_client_ids_navi1x[][2] = {
+   [3][0] = "DCEDMC",
+   [4][0] = "DCEVGA",
+   [5][0] = "MP0",
+   [6][0] = "MP1",
+   [13][0] = "VMC",
+   [14][0] = "HDP",
+   [15][0] = "OSS",
+   [16][0] = "VCNU",
+   [17][0] = "JPEG",
+   [18][0] = "VCN",
+   [3][1] = "DCEDMC",
+   [4][1] = "DCEXFC",
+   [5][1] = "DCEVGA",
+   [6][1] = "DCEDWB",
+   [7][1] = "MP0",
+   [8][1] = "MP1",
+   [9][1] = "DBGU1",
+   [10][1] = "DBGU0",
+   [11][1] = "XDP",
+   [14][1] = "HDP",
+   [15][1] = "OSS",
+   [16][1] = "VCNU",
+   [17][1] = "JPEG",
+   [18][1] = "VCN",
+};
+
+static const char *mmhub_client_ids_sienna_cichlid[][2] = {
+   [3][0] = "DCEDMC",
+   [4][0] = "DCEVGA",
+   [5][0] = "MP0",
+   [6][0] = "MP1",
+   [8][0] = "VMC",
+   [9][0] = "VCNU0",
+   [10][0] = "JPEG",
+   [12][0] = "VCNU1",
+   [13][0] = "VCN1",
+   [14][0] = "HDP",
+   [15][0] = "OSS",
+   [32+11][0] = "VCN0",
+   [0][1] = "DBGU0",
+   [1][1] = "DBGU1",
+   [2][1] = "DCEDWB",
+   [3][1] = "DCEDMC",
+   [4][1] = "DCEVGA",
+   [5][1] = "MP0",
+   [6][1] = "MP1",
+   [7][1] = "XDP",
+   [9][1] = "VCNU0",
+   [10][1] = "JPEG",
+   [11][1] = "VCN0",
+   [12][1] = "VCNU1",
+   [13][1] = "VCN1",
+   [14][1] = "HDP",
+   [15][1] = "OSS",
+};
+
 static uint32_t mmhub_v2_0_get_invalidate_req(unsigned int vmid,
  uint32_t flush_type)
 {
@@ -60,12 +117,33 @@ static void
 mmhub_v2_0_print_l2_protection_fault_status(struct amdgpu_device *adev,
 uint32_t status)
 {
+   uint32_t cid, rw;
+   const char *mmhub_cid = NULL;
+
+   cid = REG_GET_FIELD(status,
+   MMVM_L2_PROTECTION_FAULT_STATUS, CID);
+   rw = REG_GET_FIELD(status,
+  MMVM_L2_PROTECTION_FAULT_STATUS, RW);
+
dev_err(adev->dev,
"MMVM_L2_PROTECTION_FAULT_STATUS:0x%08X\n",
status);
-   dev_err(adev->dev, "\t Faulty UTCL2 client ID: 0x%lx\n",
-   REG_GET_FIELD(status,
-   MMVM_L2_PROTECTION_FAULT_STATUS, CID));
+   switch (adev->asic_type) {
+   case CHIP_NAVI10:
+   case CHIP_NAVI12:
+   case CHIP_NAVI14:
+   mmhub_cid = mmhub_client_ids_navi1x[cid][rw];
+   break;
+   case CHIP_SIENNA_CICHLID:
+   case CHIP_NAVY_FLOUNDER:
+   mmhub_cid = mmhub_client_ids_sienna_cichlid[cid][rw];
+   break;
+   default:
+   mmhub_cid = NULL;
+   break;
+   }
+   dev_err(adev->dev, "\t Faulty UTCL2 client ID: %s (0x%x)\n",
+   mmhub_cid ? mmhub_cid : "unknown", cid);
dev_err(adev->dev, "\t MORE_FAULTS: 0x%lx\n",
REG_GET_FIELD(status,
MMVM_L2_PROTECTION_FAULT_STATUS, MORE_FAULTS));
@@ -78,9 +156,7 @@ mmhub_v2_0_print_l2_protection_fault_status(struct 
amdgpu_device *adev,
dev_err(adev->dev, "\t MAPPING_ERROR: 0x%lx\n",
REG_GET_FIELD(status,
MMVM_L2_PROTECTION_FAULT_STATUS, MAPPING_ERROR));
-   dev_err(adev->dev, "\t RW: 0x%lx\n",
-   REG_GET_FIELD(status,
-   MMVM_L2_PROTECTION_FAULT_STATUS, RW));
+   dev_err(adev->dev, "\t RW: 0x%x\n", rw);
 }
 
 static void mmhub_v2_0_setup_vm_pt_regs(struct amdgpu_device *adev, uint32_t 
vmid,
-- 
2.25.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: enable ih1 ih2 for Arcturus only

2020-09-02 Thread Alex Deucher
On Wed, Sep 2, 2020 at 2:10 PM Felix Kuehling  wrote:
>
> Am 2020-09-02 um 2:08 p.m. schrieb Alex Deucher:
> > On Wed, Sep 2, 2020 at 1:01 PM Alex Sierra  wrote:
> >> Enable multi-ring ih1 and ih2 for Arcturus only.
> >> For Navi10 family multi-ring has been disabled.
> >> Apparently, having multi-ring enabled in Navi was causing
> >> continus page fault interrupts.
> >> Further investigation is needed to get to the root cause.
> >> Related issue link:
> >> https://gitlab.freedesktop.org/drm/amd/-/issues/1279
> >>
> > Before committing, let's verify that it fixes that issue.
>
> Has anyone reproduced this in AMD? Or should we ask the gitlab issue
> reporter to test the patch?

I've asked on the bug report.  I think Nicolai reported an mmhub error
at some point, but I can't find the reference now.  I haven't heard of
anything else.

Alex

>
> Thanks,
>   Felix
>
>
> >
> > Alex
> >
> >
> >> Signed-off-by: Alex Sierra 
> >> ---
> >>  drivers/gpu/drm/amd/amdgpu/navi10_ih.c | 30 --
> >>  1 file changed, 19 insertions(+), 11 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/navi10_ih.c 
> >> b/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
> >> index 350f1bf063c6..4d73869870ab 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
> >> @@ -306,7 +306,8 @@ static int navi10_ih_irq_init(struct amdgpu_device 
> >> *adev)
> >> } else {
> >> WREG32_SOC15(OSSSYS, 0, mmIH_RB_CNTL, ih_rb_cntl);
> >> }
> >> -   navi10_ih_reroute_ih(adev);
> >> +   if (adev->asic_type == CHIP_ARCTURUS)
> >> +   navi10_ih_reroute_ih(adev);
> >>
> >> if (unlikely(adev->firmware.load_type == AMDGPU_FW_LOAD_DIRECT)) {
> >> if (ih->use_bus_addr) {
> >> @@ -668,19 +669,26 @@ static int navi10_ih_sw_init(void *handle)
> >> adev->irq.ih.use_doorbell = true;
> >> adev->irq.ih.doorbell_index = adev->doorbell_index.ih << 1;
> >>
> >> -   r = amdgpu_ih_ring_init(adev, >irq.ih1, PAGE_SIZE, true);
> >> -   if (r)
> >> -   return r;
> >> +   adev->irq.ih1.ring_size = 0;
> >> +   adev->irq.ih2.ring_size = 0;
> >>
> >> -   adev->irq.ih1.use_doorbell = true;
> >> -   adev->irq.ih1.doorbell_index = (adev->doorbell_index.ih + 1) << 1;
> >> +   if (adev->asic_type == CHIP_ARCTURUS) {
> >> +   r = amdgpu_ih_ring_init(adev, >irq.ih1, PAGE_SIZE, 
> >> true);
> >> +   if (r)
> >> +   return r;
> >>
> >> -   r = amdgpu_ih_ring_init(adev, >irq.ih2, PAGE_SIZE, true);
> >> -   if (r)
> >> -   return r;
> >> +   adev->irq.ih1.use_doorbell = true;
> >> +   adev->irq.ih1.doorbell_index =
> >> +   (adev->doorbell_index.ih + 1) << 1;
> >> +
> >> +   r = amdgpu_ih_ring_init(adev, >irq.ih2, PAGE_SIZE, 
> >> true);
> >> +   if (r)
> >> +   return r;
> >>
> >> -   adev->irq.ih2.use_doorbell = true;
> >> -   adev->irq.ih2.doorbell_index = (adev->doorbell_index.ih + 2) << 1;
> >> +   adev->irq.ih2.use_doorbell = true;
> >> +   adev->irq.ih2.doorbell_index =
> >> +   (adev->doorbell_index.ih + 2) << 1;
> >> +   }
> >>
> >> r = amdgpu_irq_init(adev);
> >>
> >> --
> >> 2.17.1
> >>
> >> ___
> >> amd-gfx mailing list
> >> amd-gfx@lists.freedesktop.org
> >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> > ___
> > amd-gfx mailing list
> > amd-gfx@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 1/2] drm/amdgpu/gmc9: print client id string for mmhub

2020-09-02 Thread Alex Deucher
Print the name of the client rather than the number.  This
makes it easier to debug what block is causing the fault.

Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 239 +-
 1 file changed, 230 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index 7e86aee60c64..f9e810126124 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -87,6 +87,203 @@ static const char *gfxhub_client_ids[] = {
"PA",
 };
 
+static const char *mmhub_client_ids_raven[][2] = {
+   [0][0] = "MP1",
+   [1][0] = "MP0",
+   [2][0] = "VCN",
+   [3][0] = "VCNU",
+   [4][0] = "HDP",
+   [5][0] = "DCE",
+   [13][0] = "UTCL2",
+   [19][0] = "TLS",
+   [26][0] = "OSS",
+   [27][0] = "SDMA0",
+   [0][1] = "MP1",
+   [1][1] = "MP0",
+   [2][1] = "VCN",
+   [3][1] = "VCNU",
+   [4][1] = "HDP",
+   [5][1] = "XDP",
+   [6][1] = "DBGU0",
+   [7][1] = "DCE",
+   [8][1] = "DCEDWB0",
+   [9][1] = "DCEDWB1",
+   [26][1] = "OSS",
+   [27][1] = "SDMA0",
+};
+
+static const char *mmhub_client_ids_renoir[][2] = {
+   [0][0] = "MP1",
+   [1][0] = "MP0",
+   [2][0] = "HDP",
+   [4][0] = "DCEDMC",
+   [5][0] = "DCEVGA",
+   [13][0] = "UTCL2",
+   [19][0] = "TLS",
+   [26][0] = "OSS",
+   [27][0] = "SDMA0",
+   [28][0] = "VCN",
+   [29][0] = "VCNU",
+   [30][0] = "JPEG",
+   [0][1] = "MP1",
+   [1][1] = "MP0",
+   [2][1] = "HDP",
+   [3][1] = "XDP",
+   [6][1] = "DBGU0",
+   [7][1] = "DCEDMC",
+   [8][1] = "DCEVGA",
+   [9][1] = "DCEDWB",
+   [26][1] = "OSS",
+   [27][1] = "SDMA0",
+   [28][1] = "VCN",
+   [29][1] = "VCNU",
+   [30][1] = "JPEG",
+};
+
+static const char *mmhub_client_ids_vega10[][2] = {
+   [0][0] = "MP0",
+   [1][0] = "UVD",
+   [2][0] = "UVDU",
+   [3][0] = "HDP",
+   [13][0] = "UTCL2",
+   [14][0] = "OSS",
+   [15][0] = "SDMA1",
+   [32+0][0] = "VCE0",
+   [32+1][0] = "VCE0U",
+   [32+2][0] = "XDMA",
+   [32+3][0] = "DCE",
+   [32+4][0] = "MP1",
+   [32+14][0] = "SDMA0",
+   [0][1] = "MP0",
+   [1][1] = "UVD",
+   [2][1] = "UVDU",
+   [3][1] = "DBGU0",
+   [4][1] = "HDP",
+   [5][1] = "XDP",
+   [14][1] = "OSS",
+   [15][1] = "SDMA0",
+   [32+0][1] = "VCE0",
+   [32+1][1] = "VCE0U",
+   [32+2][1] = "XDMA",
+   [32+3][1] = "DCE",
+   [32+4][1] = "DCEDWB",
+   [32+5][1] = "MP1",
+   [32+6][1] = "DBGU1",
+   [32+14][1] = "SDMA1",
+};
+
+static const char *mmhub_client_ids_vega12[][2] = {
+   [0][0] = "MP0",
+   [1][0] = "VCE0",
+   [2][0] = "VCE0U",
+   [3][0] = "HDP",
+   [13][0] = "UTCL2",
+   [14][0] = "OSS",
+   [15][0] = "SDMA1",
+   [32+0][0] = "DCE",
+   [32+1][0] = "XDMA",
+   [32+2][0] = "UVD",
+   [32+3][0] = "UVDU",
+   [32+4][0] = "MP1",
+   [32+15][0] = "SDMA0",
+   [0][1] = "MP0",
+   [1][1] = "VCE0",
+   [2][1] = "VCE0U",
+   [3][1] = "DBGU0",
+   [4][1] = "HDP",
+   [5][1] = "XDP",
+   [14][1] = "OSS",
+   [15][1] = "SDMA0",
+   [32+0][1] = "DCE",
+   [32+1][1] = "DCEDWB",
+   [32+2][1] = "XDMA",
+   [32+3][1] = "UVD",
+   [32+4][1] = "UVDU",
+   [32+5][1] = "MP1",
+   [32+6][1] = "DBGU1",
+   [32+15][1] = "SDMA1",
+};
+
+static const char *mmhub_client_ids_vega20[][2] = {
+   [0][0] = "XDMA",
+   [1][0] = "DCE",
+   [2][0] = "VCE0",
+   [3][0] = "VCE0U",
+   [4][0] = "UVD",
+   [5][0] = "UVD1U",
+   [13][0] = "OSS",
+   [14][0] = "HDP",
+   [15][0] = "SDMA0",
+   [32+0][0] = "UVD",
+   [32+1][0] = "UVDU",
+   [32+2][0] = "MP1",
+   [32+3][0] = "MP0",
+   [32+12][0] = "UTCL2",
+   [32+14][0] = "SDMA1",
+   [0][1] = "XDMA",
+   [1][1] = "DCE",
+   [2][1] = "DCEDWB",
+   [3][1] = "VCE0",
+   [4][1] = "VCE0U",
+   [5][1] = "UVD1",
+   [6][1] = "UVD1U",
+   [7][1] = "DBGU0",
+   [8][1] = "XDP",
+   [13][1] = "OSS",
+   [14][1] = "HDP",
+   [15][1] = "SDMA0",
+   [32+0][1] = "UVD",
+   [32+1][1] = "UVDU",
+   [32+2][1] = "DBGU1",
+   [32+3][1] = "MP1",
+   [32+4][1] = "MP0",
+   [32+14][1] = "SDMA1",
+};
+
+static const char *mmhub_client_ids_arcturus[][2] = {
+   [2][0] = "MP1",
+   [3][0] = "MP0",
+   [10][0] = "UTCL2",
+   [13][0] = "OSS",
+   [14][0] = "HDP",
+   [15][0] = "SDMA0",
+   [32+15][0] = "SDMA1",
+   [64+15][0] = "SDMA2",
+   [96+15][0] = "SDMA3",
+   [128+15][0] = "SDMA4",
+   [160+11][0] = "JPEG",
+   [160+12][0] = "VCN",
+   [160+13][0] = "VCNU",
+   [160+15][0] = "SDMA5",
+   [192+10][0] = "UTCL2",
+   [192+11][0] = "JPEG1",
+   

Re: [PATCH] drm/amdgpu: enable ih1 ih2 for Arcturus only

2020-09-02 Thread Alex Deucher
On Wed, Sep 2, 2020 at 2:08 PM Alex Deucher  wrote:
>
> On Wed, Sep 2, 2020 at 1:01 PM Alex Sierra  wrote:
> >
> > Enable multi-ring ih1 and ih2 for Arcturus only.
> > For Navi10 family multi-ring has been disabled.
> > Apparently, having multi-ring enabled in Navi was causing
> > continus page fault interrupts.
> > Further investigation is needed to get to the root cause.
> > Related issue link:
> > https://gitlab.freedesktop.org/drm/amd/-/issues/1279
> >
>
> Before committing, let's verify that it fixes that issue.

Looking at the bug report, the OSS (presumably IH) block is causing a
write fault so I suspect arcturus may be affected by this as well.  We
should double check the ring sizes and allocations.

Alex


>
> Alex
>
>
> > Signed-off-by: Alex Sierra 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/navi10_ih.c | 30 --
> >  1 file changed, 19 insertions(+), 11 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/navi10_ih.c 
> > b/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
> > index 350f1bf063c6..4d73869870ab 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
> > @@ -306,7 +306,8 @@ static int navi10_ih_irq_init(struct amdgpu_device 
> > *adev)
> > } else {
> > WREG32_SOC15(OSSSYS, 0, mmIH_RB_CNTL, ih_rb_cntl);
> > }
> > -   navi10_ih_reroute_ih(adev);
> > +   if (adev->asic_type == CHIP_ARCTURUS)
> > +   navi10_ih_reroute_ih(adev);
> >
> > if (unlikely(adev->firmware.load_type == AMDGPU_FW_LOAD_DIRECT)) {
> > if (ih->use_bus_addr) {
> > @@ -668,19 +669,26 @@ static int navi10_ih_sw_init(void *handle)
> > adev->irq.ih.use_doorbell = true;
> > adev->irq.ih.doorbell_index = adev->doorbell_index.ih << 1;
> >
> > -   r = amdgpu_ih_ring_init(adev, >irq.ih1, PAGE_SIZE, true);
> > -   if (r)
> > -   return r;
> > +   adev->irq.ih1.ring_size = 0;
> > +   adev->irq.ih2.ring_size = 0;
> >
> > -   adev->irq.ih1.use_doorbell = true;
> > -   adev->irq.ih1.doorbell_index = (adev->doorbell_index.ih + 1) << 1;
> > +   if (adev->asic_type == CHIP_ARCTURUS) {
> > +   r = amdgpu_ih_ring_init(adev, >irq.ih1, PAGE_SIZE, 
> > true);
> > +   if (r)
> > +   return r;
> >
> > -   r = amdgpu_ih_ring_init(adev, >irq.ih2, PAGE_SIZE, true);
> > -   if (r)
> > -   return r;
> > +   adev->irq.ih1.use_doorbell = true;
> > +   adev->irq.ih1.doorbell_index =
> > +   (adev->doorbell_index.ih + 1) << 1;
> > +
> > +   r = amdgpu_ih_ring_init(adev, >irq.ih2, PAGE_SIZE, 
> > true);
> > +   if (r)
> > +   return r;
> >
> > -   adev->irq.ih2.use_doorbell = true;
> > -   adev->irq.ih2.doorbell_index = (adev->doorbell_index.ih + 2) << 1;
> > +   adev->irq.ih2.use_doorbell = true;
> > +   adev->irq.ih2.doorbell_index =
> > +   (adev->doorbell_index.ih + 2) << 1;
> > +   }
> >
> > r = amdgpu_irq_init(adev);
> >
> > --
> > 2.17.1
> >
> > ___
> > amd-gfx mailing list
> > amd-gfx@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: enable ih1 ih2 for Arcturus only

2020-09-02 Thread Felix Kuehling
Am 2020-09-02 um 2:08 p.m. schrieb Alex Deucher:
> On Wed, Sep 2, 2020 at 1:01 PM Alex Sierra  wrote:
>> Enable multi-ring ih1 and ih2 for Arcturus only.
>> For Navi10 family multi-ring has been disabled.
>> Apparently, having multi-ring enabled in Navi was causing
>> continus page fault interrupts.
>> Further investigation is needed to get to the root cause.
>> Related issue link:
>> https://gitlab.freedesktop.org/drm/amd/-/issues/1279
>>
> Before committing, let's verify that it fixes that issue.

Has anyone reproduced this in AMD? Or should we ask the gitlab issue
reporter to test the patch?

Thanks,
  Felix


>
> Alex
>
>
>> Signed-off-by: Alex Sierra 
>> ---
>>  drivers/gpu/drm/amd/amdgpu/navi10_ih.c | 30 --
>>  1 file changed, 19 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/navi10_ih.c 
>> b/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
>> index 350f1bf063c6..4d73869870ab 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
>> @@ -306,7 +306,8 @@ static int navi10_ih_irq_init(struct amdgpu_device *adev)
>> } else {
>> WREG32_SOC15(OSSSYS, 0, mmIH_RB_CNTL, ih_rb_cntl);
>> }
>> -   navi10_ih_reroute_ih(adev);
>> +   if (adev->asic_type == CHIP_ARCTURUS)
>> +   navi10_ih_reroute_ih(adev);
>>
>> if (unlikely(adev->firmware.load_type == AMDGPU_FW_LOAD_DIRECT)) {
>> if (ih->use_bus_addr) {
>> @@ -668,19 +669,26 @@ static int navi10_ih_sw_init(void *handle)
>> adev->irq.ih.use_doorbell = true;
>> adev->irq.ih.doorbell_index = adev->doorbell_index.ih << 1;
>>
>> -   r = amdgpu_ih_ring_init(adev, >irq.ih1, PAGE_SIZE, true);
>> -   if (r)
>> -   return r;
>> +   adev->irq.ih1.ring_size = 0;
>> +   adev->irq.ih2.ring_size = 0;
>>
>> -   adev->irq.ih1.use_doorbell = true;
>> -   adev->irq.ih1.doorbell_index = (adev->doorbell_index.ih + 1) << 1;
>> +   if (adev->asic_type == CHIP_ARCTURUS) {
>> +   r = amdgpu_ih_ring_init(adev, >irq.ih1, PAGE_SIZE, 
>> true);
>> +   if (r)
>> +   return r;
>>
>> -   r = amdgpu_ih_ring_init(adev, >irq.ih2, PAGE_SIZE, true);
>> -   if (r)
>> -   return r;
>> +   adev->irq.ih1.use_doorbell = true;
>> +   adev->irq.ih1.doorbell_index =
>> +   (adev->doorbell_index.ih + 1) << 1;
>> +
>> +   r = amdgpu_ih_ring_init(adev, >irq.ih2, PAGE_SIZE, 
>> true);
>> +   if (r)
>> +   return r;
>>
>> -   adev->irq.ih2.use_doorbell = true;
>> -   adev->irq.ih2.doorbell_index = (adev->doorbell_index.ih + 2) << 1;
>> +   adev->irq.ih2.use_doorbell = true;
>> +   adev->irq.ih2.doorbell_index =
>> +   (adev->doorbell_index.ih + 2) << 1;
>> +   }
>>
>> r = amdgpu_irq_init(adev);
>>
>> --
>> 2.17.1
>>
>> ___
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: enable ih1 ih2 for Arcturus only

2020-09-02 Thread Alex Deucher
On Wed, Sep 2, 2020 at 1:01 PM Alex Sierra  wrote:
>
> Enable multi-ring ih1 and ih2 for Arcturus only.
> For Navi10 family multi-ring has been disabled.
> Apparently, having multi-ring enabled in Navi was causing
> continus page fault interrupts.
> Further investigation is needed to get to the root cause.
> Related issue link:
> https://gitlab.freedesktop.org/drm/amd/-/issues/1279
>

Before committing, let's verify that it fixes that issue.

Alex


> Signed-off-by: Alex Sierra 
> ---
>  drivers/gpu/drm/amd/amdgpu/navi10_ih.c | 30 --
>  1 file changed, 19 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/navi10_ih.c 
> b/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
> index 350f1bf063c6..4d73869870ab 100644
> --- a/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
> +++ b/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
> @@ -306,7 +306,8 @@ static int navi10_ih_irq_init(struct amdgpu_device *adev)
> } else {
> WREG32_SOC15(OSSSYS, 0, mmIH_RB_CNTL, ih_rb_cntl);
> }
> -   navi10_ih_reroute_ih(adev);
> +   if (adev->asic_type == CHIP_ARCTURUS)
> +   navi10_ih_reroute_ih(adev);
>
> if (unlikely(adev->firmware.load_type == AMDGPU_FW_LOAD_DIRECT)) {
> if (ih->use_bus_addr) {
> @@ -668,19 +669,26 @@ static int navi10_ih_sw_init(void *handle)
> adev->irq.ih.use_doorbell = true;
> adev->irq.ih.doorbell_index = adev->doorbell_index.ih << 1;
>
> -   r = amdgpu_ih_ring_init(adev, >irq.ih1, PAGE_SIZE, true);
> -   if (r)
> -   return r;
> +   adev->irq.ih1.ring_size = 0;
> +   adev->irq.ih2.ring_size = 0;
>
> -   adev->irq.ih1.use_doorbell = true;
> -   adev->irq.ih1.doorbell_index = (adev->doorbell_index.ih + 1) << 1;
> +   if (adev->asic_type == CHIP_ARCTURUS) {
> +   r = amdgpu_ih_ring_init(adev, >irq.ih1, PAGE_SIZE, 
> true);
> +   if (r)
> +   return r;
>
> -   r = amdgpu_ih_ring_init(adev, >irq.ih2, PAGE_SIZE, true);
> -   if (r)
> -   return r;
> +   adev->irq.ih1.use_doorbell = true;
> +   adev->irq.ih1.doorbell_index =
> +   (adev->doorbell_index.ih + 1) << 1;
> +
> +   r = amdgpu_ih_ring_init(adev, >irq.ih2, PAGE_SIZE, 
> true);
> +   if (r)
> +   return r;
>
> -   adev->irq.ih2.use_doorbell = true;
> -   adev->irq.ih2.doorbell_index = (adev->doorbell_index.ih + 2) << 1;
> +   adev->irq.ih2.use_doorbell = true;
> +   adev->irq.ih2.doorbell_index =
> +   (adev->doorbell_index.ih + 2) << 1;
> +   }
>
> r = amdgpu_irq_init(adev);
>
> --
> 2.17.1
>
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: enable ih1 ih2 for Arcturus only

2020-09-02 Thread Felix Kuehling
Am 2020-09-02 um 1:01 p.m. schrieb Alex Sierra:
> Enable multi-ring ih1 and ih2 for Arcturus only.
> For Navi10 family multi-ring has been disabled.
> Apparently, having multi-ring enabled in Navi was causing
> continus page fault interrupts.
> Further investigation is needed to get to the root cause.
> Related issue link:
> https://gitlab.freedesktop.org/drm/amd/-/issues/1279
>
> Signed-off-by: Alex Sierra 
> ---
>  drivers/gpu/drm/amd/amdgpu/navi10_ih.c | 30 --
>  1 file changed, 19 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/navi10_ih.c 
> b/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
> index 350f1bf063c6..4d73869870ab 100644
> --- a/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
> +++ b/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
> @@ -306,7 +306,8 @@ static int navi10_ih_irq_init(struct amdgpu_device *adev)
>   } else {
>   WREG32_SOC15(OSSSYS, 0, mmIH_RB_CNTL, ih_rb_cntl);
>   }
> - navi10_ih_reroute_ih(adev);
> + if (adev->asic_type == CHIP_ARCTURUS)

Instead of looking at the asic_type here, it would be better to check
the IH ring sizes. They are also used further down in this function as
the condition to enable the extra interrupt rings. It would be more
consistent and this way, the ASIC-specific condition would be limited to
one function, navi10_ih_sw_init.


> + navi10_ih_reroute_ih(adev);
>  
>   if (unlikely(adev->firmware.load_type == AMDGPU_FW_LOAD_DIRECT)) {
>   if (ih->use_bus_addr) {
> @@ -668,19 +669,26 @@ static int navi10_ih_sw_init(void *handle)
>   adev->irq.ih.use_doorbell = true;
>   adev->irq.ih.doorbell_index = adev->doorbell_index.ih << 1;
>  
> - r = amdgpu_ih_ring_init(adev, >irq.ih1, PAGE_SIZE, true);
> - if (r)
> - return r;
> + adev->irq.ih1.ring_size = 0;
> + adev->irq.ih2.ring_size = 0;
>  
> - adev->irq.ih1.use_doorbell = true;
> - adev->irq.ih1.doorbell_index = (adev->doorbell_index.ih + 1) << 1;
> + if (adev->asic_type == CHIP_ARCTURUS) {

This may apply to the Arcturus successor as well. I'd use asic_type <
NAVI10 instead, to be future-proof.

With these two issues fixed, the patch is

Reviewed-by: Felix Kuehling 


> + r = amdgpu_ih_ring_init(adev, >irq.ih1, PAGE_SIZE, true);
> + if (r)
> + return r;
>  
> - r = amdgpu_ih_ring_init(adev, >irq.ih2, PAGE_SIZE, true);
> - if (r)
> - return r;
> + adev->irq.ih1.use_doorbell = true;
> + adev->irq.ih1.doorbell_index =
> + (adev->doorbell_index.ih + 1) << 1;
> +
> + r = amdgpu_ih_ring_init(adev, >irq.ih2, PAGE_SIZE, true);
> + if (r)
> + return r;
>  
> - adev->irq.ih2.use_doorbell = true;
> - adev->irq.ih2.doorbell_index = (adev->doorbell_index.ih + 2) << 1;
> + adev->irq.ih2.use_doorbell = true;
> + adev->irq.ih2.doorbell_index =
> + (adev->doorbell_index.ih + 2) << 1;
> + }
>  
>   r = amdgpu_irq_init(adev);
>  
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amdgpu: enable ih1 ih2 for Arcturus only

2020-09-02 Thread Alex Sierra
Enable multi-ring ih1 and ih2 for Arcturus only.
For Navi10 family multi-ring has been disabled.
Apparently, having multi-ring enabled in Navi was causing
continus page fault interrupts.
Further investigation is needed to get to the root cause.
Related issue link:
https://gitlab.freedesktop.org/drm/amd/-/issues/1279

Signed-off-by: Alex Sierra 
---
 drivers/gpu/drm/amd/amdgpu/navi10_ih.c | 30 --
 1 file changed, 19 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/navi10_ih.c 
b/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
index 350f1bf063c6..4d73869870ab 100644
--- a/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
@@ -306,7 +306,8 @@ static int navi10_ih_irq_init(struct amdgpu_device *adev)
} else {
WREG32_SOC15(OSSSYS, 0, mmIH_RB_CNTL, ih_rb_cntl);
}
-   navi10_ih_reroute_ih(adev);
+   if (adev->asic_type == CHIP_ARCTURUS)
+   navi10_ih_reroute_ih(adev);
 
if (unlikely(adev->firmware.load_type == AMDGPU_FW_LOAD_DIRECT)) {
if (ih->use_bus_addr) {
@@ -668,19 +669,26 @@ static int navi10_ih_sw_init(void *handle)
adev->irq.ih.use_doorbell = true;
adev->irq.ih.doorbell_index = adev->doorbell_index.ih << 1;
 
-   r = amdgpu_ih_ring_init(adev, >irq.ih1, PAGE_SIZE, true);
-   if (r)
-   return r;
+   adev->irq.ih1.ring_size = 0;
+   adev->irq.ih2.ring_size = 0;
 
-   adev->irq.ih1.use_doorbell = true;
-   adev->irq.ih1.doorbell_index = (adev->doorbell_index.ih + 1) << 1;
+   if (adev->asic_type == CHIP_ARCTURUS) {
+   r = amdgpu_ih_ring_init(adev, >irq.ih1, PAGE_SIZE, true);
+   if (r)
+   return r;
 
-   r = amdgpu_ih_ring_init(adev, >irq.ih2, PAGE_SIZE, true);
-   if (r)
-   return r;
+   adev->irq.ih1.use_doorbell = true;
+   adev->irq.ih1.doorbell_index =
+   (adev->doorbell_index.ih + 1) << 1;
+
+   r = amdgpu_ih_ring_init(adev, >irq.ih2, PAGE_SIZE, true);
+   if (r)
+   return r;
 
-   adev->irq.ih2.use_doorbell = true;
-   adev->irq.ih2.doorbell_index = (adev->doorbell_index.ih + 2) << 1;
+   adev->irq.ih2.use_doorbell = true;
+   adev->irq.ih2.doorbell_index =
+   (adev->doorbell_index.ih + 2) << 1;
+   }
 
r = amdgpu_irq_init(adev);
 
-- 
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: fix max_entries calculation v3

2020-09-02 Thread Pan, Xinhui


> 2020年9月2日 23:21,Christian König  写道:
> 
> Calculate the correct value for max_entries or we might run after the
> page_address array.
> 
> v2: Xinhui pointed out we don't need the shift
> v3: use local copy of start and simplify some calculation
> 
> Signed-off-by: Christian König 
> Fixes: 1e691e244487 drm/amdgpu: stop allocating dummy GTT nodes
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index 8bc2253939be..7cbee1a7ec84 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -1697,7 +1697,7 @@ static int amdgpu_vm_bo_split_mapping(struct 
> amdgpu_device *adev,
>   AMDGPU_GPU_PAGES_IN_CPU_PAGE;
>   } else {
>   addr = 0;
> - max_entries = S64_MAX;
> + max_entries = mapping->last - start + 1;
>   }
> 
>   if (pages_addr) {
> @@ -1727,7 +1727,7 @@ static int amdgpu_vm_bo_split_mapping(struct 
> amdgpu_device *adev,
>   addr += pfn << PAGE_SHIFT;
>   }
> 
> - last = min((uint64_t)mapping->last, start + max_entries - 1);
> + last = start + max_entries - 1;

I am not sure if we calculate the max_entries by nodes->size. does it work in 
that case?

>   r = amdgpu_vm_bo_update_mapping(adev, vm, false, false, resv,
>   start, last, flags, addr,
>   dma_addr, fence);
> -- 
> 2.17.1
> 
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=02%7C01%7Cxinhui.pan%40amd.com%7Cbb2c2456534842d24e9c08d84f53cfc3%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637346568673703868sdata=rLB3ME25AkyRaP6kd3JxOkvqz3iSKhHu9bkZnMMqS74%3Dreserved=0

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] Query guest's information by VF2PF message

2020-09-02 Thread Bokun Zhang
From: Tiecheng Zhou 

  drm/amd Add VF2PF message support and follow up fix

  - Update VF2PF header
  - Implement VF2PF work thread

Change-Id: I15b3899719ea06b19a9ff683666f9bc4b8b6229c
Signed-off-by: Tiecheng Zhou 
Signed-off-by: Bokun Zhang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |   9 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c| 244 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h|  85 +-
 drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h | 276 
 4 files changed, 480 insertions(+), 134 deletions(-)
 create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index c4900471beb0..e0e2f3c25e3e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3396,8 +3396,10 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
/* make sure IB test finished before entering exclusive mode
 * to avoid preemption on IB test
 * */
-   if (amdgpu_sriov_vf(adev))
+   if (amdgpu_sriov_vf(adev)) {
amdgpu_virt_request_full_gpu(adev, false);
+   amdgpu_virt_fini_data_exchange(adev);
+   }
 
/* disable all interrupts */
amdgpu_irq_disable_all(adev);
@@ -4034,6 +4036,11 @@ static int amdgpu_device_pre_asic_reset(struct 
amdgpu_device *adev,
 
amdgpu_debugfs_wait_dump(adev);
 
+   if (amdgpu_sriov_vf(adev)) {
+   /* stop the data exchange thread */
+   amdgpu_virt_fini_data_exchange(adev);
+   }
+
/* block all schedulers and reset given job's ring */
for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
struct amdgpu_ring *ring = adev->rings[i];
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index f76961d17246..1f1171812e35 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -31,6 +31,12 @@
 #include "soc15.h"
 #include "nv.h"
 
+#define POPULATE_UCODE_INFO(vf2pf_info, ucode, ver) \
+   do { \
+   vf2pf_info->ucode_info[ucode].id = ucode; \
+   vf2pf_info->ucode_info[ucode].version = ver; \
+   } while (0)
+
 bool amdgpu_virt_mmio_blocked(struct amdgpu_device *adev)
 {
/* By now all MMIO pages except mailbox are blocked */
@@ -239,10 +245,10 @@ void amdgpu_virt_free_mm_table(struct amdgpu_device *adev)
 }
 
 
-int amdgpu_virt_fw_reserve_get_checksum(void *obj,
-   unsigned long obj_size,
-   unsigned int key,
-   unsigned int chksum)
+unsigned int amd_sriov_msg_checksum(void *obj,
+   unsigned long obj_size,
+   unsigned int key,
+   unsigned int checksum)
 {
unsigned int ret = key;
unsigned long i = 0;
@@ -252,9 +258,9 @@ int amdgpu_virt_fw_reserve_get_checksum(void *obj,
/* calculate checksum */
for (i = 0; i < obj_size; ++i)
ret += *(pos + i);
-   /* minus the chksum itself */
-   pos = (char *)
-   for (i = 0; i < sizeof(chksum); ++i)
+   /* minus the checksum itself */
+   pos = (char *)
+   for (i = 0; i < sizeof(checksum); ++i)
ret -= *(pos + i);
return ret;
 }
@@ -415,33 +421,187 @@ static void amdgpu_virt_add_bad_page(struct 
amdgpu_device *adev,
}
 }
 
-void amdgpu_virt_init_data_exchange(struct amdgpu_device *adev)
+static int amdgpu_virt_read_pf2vf_data(struct amdgpu_device *adev)
 {
-   uint32_t pf2vf_size = 0;
-   uint32_t checksum = 0;
+   struct amd_sriov_msg_pf2vf_info_header *pf2vf_info = 
adev->virt.fw_reserve.p_pf2vf;
+   uint32_t checksum;
uint32_t checkval;
-   char *str;
+
+   if (adev->virt.fw_reserve.p_pf2vf == NULL)
+   return -EINVAL;
+
+   if (pf2vf_info->size > 1024) {
+   DRM_ERROR("invalid pf2vf message size\n");
+   return -EINVAL;
+   }
+
+   switch (pf2vf_info->version) {
+   case 1:
+   checksum = ((struct amdgim_pf2vf_info_v1 
*)pf2vf_info)->checksum;
+   checkval = amd_sriov_msg_checksum(
+   adev->virt.fw_reserve.p_pf2vf, pf2vf_info->size,
+   adev->virt.fw_reserve.checksum_key, checksum);
+   if (checksum != checkval) {
+   DRM_ERROR("invalid pf2vf message\n");
+   return -EINVAL;
+   }
+
+   adev->virt.gim_feature =
+   ((struct amdgim_pf2vf_info_v1 
*)pf2vf_info)->feature_flags;
+   break;
+   case 2:
+   /* TODO: missing key, need to add it later */
+   checksum = ((struct amd_sriov_msg_pf2vf_info 
*)pf2vf_info)->checksum;
+  

Re: [PATCH 0/3] Use implicit kref infra

2020-09-02 Thread Daniel Stone
Hi Luben,

On Wed, 2 Sep 2020 at 16:16, Luben Tuikov  wrote:
> Not sure how I can do this when someone doesn't want to read up on
> the kref infrastructure. Can you help?
>
> When someone starts off with "My understanding of ..." (as in the OP) you 
> know you're
> in trouble and in for a rough times.
>
> Such is the nature of world-wide open-to-everyone mailing lists where
> anyone can put forth an argument, regardless of their level of understanding.
> The more obfuscated an argument, the more uncertainty.
>
> If one knows the kref infrastructure, it just clicks, no explanation
> necessary.

Evidently there are more points of view than yours. Evidently your
method of persuasion is also not working, because this thread is now
getting quite long and not converging on your point of view (which you
are holding to be absolutely objectively correct).

I think you need to re-evaluate the way in which you speak to people,
considering that it costs nothing to be polite and considerate, and
also takes effort to be rude and dismissive.

Cheers,
Daniel
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amdgpu: fix max_entries calculation v3

2020-09-02 Thread Christian König
Calculate the correct value for max_entries or we might run after the
page_address array.

v2: Xinhui pointed out we don't need the shift
v3: use local copy of start and simplify some calculation

Signed-off-by: Christian König 
Fixes: 1e691e244487 drm/amdgpu: stop allocating dummy GTT nodes
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 8bc2253939be..7cbee1a7ec84 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1697,7 +1697,7 @@ static int amdgpu_vm_bo_split_mapping(struct 
amdgpu_device *adev,
AMDGPU_GPU_PAGES_IN_CPU_PAGE;
} else {
addr = 0;
-   max_entries = S64_MAX;
+   max_entries = mapping->last - start + 1;
}
 
if (pages_addr) {
@@ -1727,7 +1727,7 @@ static int amdgpu_vm_bo_split_mapping(struct 
amdgpu_device *adev,
addr += pfn << PAGE_SHIFT;
}
 
-   last = min((uint64_t)mapping->last, start + max_entries - 1);
+   last = start + max_entries - 1;
r = amdgpu_vm_bo_update_mapping(adev, vm, false, false, resv,
start, last, flags, addr,
dma_addr, fence);
-- 
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 0/3] Use implicit kref infra

2020-09-02 Thread Luben Tuikov
On 2020-09-02 11:00, Daniel Stone wrote:
> Hi Luben,
> 
> On Wed, 2 Sep 2020 at 15:51, Luben Tuikov  wrote:
>> Of course it's true--good morning!
>>
>> Let me stop you right there--just read the documentation I pointed
>> to you at.
>>
>> No!
>>
>> I'm sorry, that doesn't make sense.
>>
>> No, that's horrible.
>>
>> No, that's horrible.
>>
>> You need to understand how the kref infrastructure works in the kernel. I've 
>> said
>> it a million times: it's implicit.
>>
>> Or LESS. Less changes. Less is better. Basically revert and redo all this 
>> "managed resources".
> 
> There are many better ways to make your point. At the moment it's just
> getting lost in shouting.

Hi Daniel,

Not sure how I can do this when someone doesn't want to read up on
the kref infrastructure. Can you help?

When someone starts off with "My understanding of ..." (as in the OP) you know 
you're
in trouble and in for a rough times.

Such is the nature of world-wide open-to-everyone mailing lists where
anyone can put forth an argument, regardless of their level of understanding.
The more obfuscated an argument, the more uncertainty.

If one knows the kref infrastructure, it just clicks, no explanation
necessary.

Regards,
Luben

> 
> Cheers,
> Daniel
> 

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: fix max_entries calculation v2

2020-09-02 Thread Christian König

Am 02.09.20 um 16:53 schrieb Pan, Xinhui:



2020年9月2日 22:31,Christian König  写道:

Am 02.09.20 um 16:27 schrieb Pan, Xinhui:

2020年9月2日 22:05,Christian König  写道:

Calculate the correct value for max_entries or we might run after the
page_address array.

v2: Xinhui pointed out we don't need the shift

Signed-off-by: Christian König 
Fixes: 1e691e244487 drm/amdgpu: stop allocating dummy GTT nodes
---
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 8bc2253939be..be886bdca5c6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1697,7 +1697,7 @@ static int amdgpu_vm_bo_split_mapping(struct 
amdgpu_device *adev,
AMDGPU_GPU_PAGES_IN_CPU_PAGE;
} else {
addr = 0;
-   max_entries = S64_MAX;
+   max_entries = mapping->last - mapping->start + 1;

You need minus pfn here.

That doesn't sound correct either. The pfn is the destination of the mapping, 
e.g. the offset inside the BO and not related to the virtual address range we 
map.

I mean we need minus pfn too. pfn is mapping->offset >> PAGE_SHIFT.

In amdgpu_vm_bo_map(), there is a check  below
if (bo && offset + size > amdgpu_bo_size(bo))
return -EINVAL;
so mapping->offset is just an offset_in_bytes inside the BO as you said.


Correct, but this is the destination of the mapping and not the covered 
VA space.


In other words start can be 4, last be 5 and offset 64k and it would 
still be valid as long as the bo is at leat 64k+2 pages in size.



mapping->start and mapping->last are virtual addresses in pfns, the range we 
are going to touch then is [start+ offset_in_pfns, last].


No, that is completely unrelated.

Christian.




The range we are going to touch is [start + offset, last].
so the max_entries is last - (start + offset) + 1. and offset is pfn in this 
case.

I still hit panic with this patch in practice.

Thanks for testing, I think I know what the problem is.

We need to start instead of mapping->start or otherwise the values is to large 
after the first iteration.

Give me a second for a v3.

Christian.


}

if (pages_addr) {
--
2.17.1


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 0/3] Use implicit kref infra

2020-09-02 Thread Luben Tuikov
On 2020-09-02 02:52, Daniel Vetter wrote:
> On Tue, Sep 01, 2020 at 11:46:18PM -0400, Luben Tuikov wrote:
>> On 2020-09-01 21:42, Pan, Xinhui wrote:
>>> If you take a look at the below function, you should not use driver's 
>>> release to free adev. As dev is embedded in adev.
>>
>> Do you mean "look at the function below", using "below" as an adverb?
>> "below" is not an adjective.
>>
>> I know dev is embedded in adev--I did that patchset.
>>
>>>
>>>  809 static void drm_dev_release(struct kref *ref)
>>>  810 {
>>>  811 struct drm_device *dev = container_of(ref, struct drm_device, 
>>> ref);
>>>  812
>>>  813 if (dev->driver->release)
>>>  814 dev->driver->release(dev);
>>>  815 
>>>  816 drm_managed_release(dev);
>>>  817 
>>>  818 kfree(dev->managed.final_kfree);
>>>  819 }
>>
>> That's simple--this comes from change c6603c740e0e3
>> and it should be reverted. Simple as that.
>>
>> The version before this change was absolutely correct:
>>
>> static void drm_dev_release(struct kref *ref)
>> {
>>  if (dev->driver->release)
>>  dev->driver->release(dev);
>>  else
>>  drm_dev_fini(dev);
>> }
>>
>> Meaning, "the kref is now 0"--> if the driver
>> has a release, call it, else use our own.
>> But note that nothing can be assumed after this point,
>> about the existence of "dev".
>>
>> It is exactly because struct drm_device is statically
>> embedded into a container, struct amdgpu_device,
>> that this change above should be reverted.
>>
>> This is very similar to how fops has open/release
>> but no close. That is, the "release" is called
>> only when the last kref is released, i.e. when
>> kref goes from non-zero to zero.
>>
>> This uses the kref infrastructure which has been
>> around for about 20 years in the Linux kernel.
>>
>> I suggest reading the comments
>> in drm_dev.c mostly, "DOC: driver instance overview"
>> starting at line 240 onwards. This is right above
>> drm_put_dev(). There is actually an example of a driver
>> in the comment. Also the comment to drm_dev_init().
>>
>> Now, take a look at this:
>>
>> /**
>>  * drm_dev_put - Drop reference of a DRM device
>>  * @dev: device to drop reference of or NULL
>>  *
>>  * This decreases the ref-count of @dev by one. The device is destroyed if 
>> the
>>  * ref-count drops to zero.
>>  */
>> void drm_dev_put(struct drm_device *dev)
>> {
>> if (dev)
>> kref_put(>ref, drm_dev_release);
>> }
>> EXPORT_SYMBOL(drm_dev_put);
>>
>> Two things:
>>
>> 1. It is us, who kzalloc the amdgpu device, which contains
>> the drm_device (you'll see this discussed in the reading
>> material I pointed to above). We do this because we're
>> probing the PCI device whether we'll work it it or not.
>>
>> 2. Using the kref infrastructure, when the ref goes to 0,
>> drm_dev_release is called. And here's the KEY:
>> Because WE allocated the container, we should free it--after the release
>> method is called, DRM cannot assume anything about the drm
>> device or the container. The "release" method is final.
>>
>> We allocate, we free. And we free only when the ref goes to 0.
>>
>> DRM can, in due time, "free" itself of the DRM device and stop
>> having knowledge of it--that's fine, but as long as the ref
>> is not 0, the amdgpu device and thus the contained DRM device,
>> cannot be freed.
>>
>>>
>>> You have to make another change something like
>>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
>>> index 13068fdf4331..2aabd2b4c63b 100644
>>> --- a/drivers/gpu/drm/drm_drv.c
>>> +++ b/drivers/gpu/drm/drm_drv.c
>>> @@ -815,7 +815,8 @@ static void drm_dev_release(struct kref *ref)
>>>  
>>> drm_managed_release(dev);
>>>  
>>> -   kfree(dev->managed.final_kfree);
>>> +   if (dev->driver->final_release)
>>> +   dev->driver->final_release(dev);
>>>  }
>>
>> No. What's this?
>> There is no such thing as "final" release, nor is there a "partial" release.
>> When the kref goes to 0, the device disappears. Simple.
>> If someone is using it, they should kref-get it, and when they're
>> done with it, they should kref-put it.
>>
>> The whole point is that this is done implicitly, via the kref infrastructure.
>> drm_dev_init() which we call in our PCI probe function, sets the kref to 
>> 1--all
>> as per the documentation I pointed you to above.
>>
>> Another point is that we can do some other stuff in the release
>> function, notify someone, write some registers, free memory we use
>> for that PCI device, etc.
>>
>> If the "managed resources" infrastructure wants to stay, it should hook
>> itself into drm_dev_fini() and into drm_dev_init() or drm_dev_register().
>> It shouldn't have to be so out-of-place like in patch 2/3 of this series,
>> where the drmm_add_final_kfree() is smack-dab in the middle of our PCI
>> discovery function, surrounded on top and bottom by drm_dev_init()
>> and drm_dev_register(). The "managed resources" infra 

Re: [PATCH 0/3] Use implicit kref infra

2020-09-02 Thread Daniel Stone
Hi Luben,

On Wed, 2 Sep 2020 at 15:51, Luben Tuikov  wrote:
> Of course it's true--good morning!
>
> Let me stop you right there--just read the documentation I pointed
> to you at.
>
> No!
>
> I'm sorry, that doesn't make sense.
>
> No, that's horrible.
>
> No, that's horrible.
>
> You need to understand how the kref infrastructure works in the kernel. I've 
> said
> it a million times: it's implicit.
>
> Or LESS. Less changes. Less is better. Basically revert and redo all this 
> "managed resources".

There are many better ways to make your point. At the moment it's just
getting lost in shouting.

Cheers,
Daniel
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: fix max_entries calculation v2

2020-09-02 Thread Pan, Xinhui


> 2020年9月2日 22:31,Christian König  写道:
> 
> Am 02.09.20 um 16:27 schrieb Pan, Xinhui:
>> 
>>> 2020年9月2日 22:05,Christian König  写道:
>>> 
>>> Calculate the correct value for max_entries or we might run after the
>>> page_address array.
>>> 
>>> v2: Xinhui pointed out we don't need the shift
>>> 
>>> Signed-off-by: Christian König 
>>> Fixes: 1e691e244487 drm/amdgpu: stop allocating dummy GTT nodes
>>> ---
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>> 
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> index 8bc2253939be..be886bdca5c6 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> @@ -1697,7 +1697,7 @@ static int amdgpu_vm_bo_split_mapping(struct 
>>> amdgpu_device *adev,
>>> AMDGPU_GPU_PAGES_IN_CPU_PAGE;
>>> } else {
>>> addr = 0;
>>> -   max_entries = S64_MAX;
>>> +   max_entries = mapping->last - mapping->start + 1;
>> You need minus pfn here.
> 
> That doesn't sound correct either. The pfn is the destination of the mapping, 
> e.g. the offset inside the BO and not related to the virtual address range we 
> map.

I mean we need minus pfn too. pfn is mapping->offset >> PAGE_SHIFT.

In amdgpu_vm_bo_map(), there is a check  below
if (bo && offset + size > amdgpu_bo_size(bo))
return -EINVAL;
so mapping->offset is just an offset_in_bytes inside the BO as you said. 

mapping->start and mapping->last are virtual addresses in pfns, the range we 
are going to touch then is [start+ offset_in_pfns, last].

> 
>> The range we are going to touch is [start + offset, last].
>> so the max_entries is last - (start + offset) + 1. and offset is pfn in this 
>> case.
>> 
>> I still hit panic with this patch in practice.
> 
> Thanks for testing, I think I know what the problem is.
> 
> We need to start instead of mapping->start or otherwise the values is to 
> large after the first iteration.
> 
> Give me a second for a v3.
> 
> Christian.
> 
>> 
>>> }
>>> 
>>> if (pages_addr) {
>>> -- 
>>> 2.17.1
>>> 
> 

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 0/3] Use implicit kref infra

2020-09-02 Thread Luben Tuikov
On 2020-09-02 00:43, Pan, Xinhui wrote:
> 
> 
>> 2020年9月2日 11:46,Tuikov, Luben  写道:
>>
>> On 2020-09-01 21:42, Pan, Xinhui wrote:
>>> If you take a look at the below function, you should not use driver's 
>>> release to free adev. As dev is embedded in adev.
>>
>> Do you mean "look at the function below", using "below" as an adverb?
>> "below" is not an adjective.
>>
>> I know dev is embedded in adev--I did that patchset.
>>
>>>
>>> 809 static void drm_dev_release(struct kref *ref)
>>> 810 {
>>> 811 struct drm_device *dev = container_of(ref, struct drm_device, 
>>> ref);
>>> 812
>>> 813 if (dev->driver->release)
>>> 814 dev->driver->release(dev);
>>> 815 
>>> 816 drm_managed_release(dev);
>>> 817 
>>> 818 kfree(dev->managed.final_kfree);
>>> 819 }
>>
>> That's simple--this comes from change c6603c740e0e3
>> and it should be reverted. Simple as that.
>>
>> The version before this change was absolutely correct:
>>
>> static void drm_dev_release(struct kref *ref)
>> {
>>  if (dev->driver->release)
>>  dev->driver->release(dev);
>>  else
>>  drm_dev_fini(dev);
>> }
>>
>> Meaning, "the kref is now 0"--> if the driver
>> has a release, call it, else use our own.
>> But note that nothing can be assumed after this point,
>> about the existence of "dev".
>>
>> It is exactly because struct drm_device is statically
>> embedded into a container, struct amdgpu_device,
>> that this change above should be reverted.
>>
>> This is very similar to how fops has open/release
>> but no close. That is, the "release" is called
>> only when the last kref is released, i.e. when
>> kref goes from non-zero to zero.
>>
>> This uses the kref infrastructure which has been
>> around for about 20 years in the Linux kernel.
>>
>> I suggest reading the comments
>> in drm_dev.c mostly, "DOC: driver instance overview"
>> starting at line 240 onwards. This is right above
>> drm_put_dev(). There is actually an example of a driver
>> in the comment. Also the comment to drm_dev_init().
>>
>> Now, take a look at this:
>>
>> /**
>> * drm_dev_put - Drop reference of a DRM device
>> * @dev: device to drop reference of or NULL
>> *
>> * This decreases the ref-count of @dev by one. The device is destroyed if the
>> * ref-count drops to zero.
>> */
>> void drm_dev_put(struct drm_device *dev)
>> {
>>if (dev)
>>kref_put(>ref, drm_dev_release);
>> }
>> EXPORT_SYMBOL(drm_dev_put);
>>
>> Two things:
>>
>> 1. It is us, who kzalloc the amdgpu device, which contains
>> the drm_device (you'll see this discussed in the reading
>> material I pointed to above). We do this because we're
>> probing the PCI device whether we'll work it it or not.
>>
> 
> that is true.

Of course it's true--good morning!

> My understanding of the drm core code is like something below.

Let me stop you right there--just read the documentation I pointed
to you at.

> struct B { 
>   strcut A 
> }
> we initialize A firstly and initialize B in the end. But destroy B firstly 
> and destory A in the end.

No!
B, which is the amdgpu_device struct "exists" before A, which is the DRM struct.
This is why DRM recommends to _embed_ it into the driver's own device struct,
as the documentation I pointed you to at.

A, the DRM struct, is an abstraction, and is "created" last, and
"undone" first, since the DRM layer may finish with a device, but
the device may still exists with the driver and as well as with PCI.
This is very VERY common, with kernels, devices, device abstractions,
device layers: DRM dev <-- amdgpu dev <-- PCI dev.

> But yes, practice is more complex. 
> if B has nothing to be destroyed. we can destory A directly, otherwise 
> destroy B firstly.

I'm sorry, that doesn't make sense. There is no such thing as "destroy directly"
and "otherwise"--this is absolutely not how this works.

A good architecture doesn't have if-then-else--it's just a pure single-branch 
path.

> 
> in this case, we can do something below in our release()
> //some cleanup work of B
> drm_dev_fini(dev);//destroy A
> kfree(adev)
> 
>> 2. Using the kref infrastructure, when the ref goes to 0,
>> drm_dev_release is called. And here's the KEY:
>> Because WE allocated the container, we should free it--after the release
>> method is called, DRM cannot assume anything about the drm
>> device or the container. The "release" method is final.
>>
>> We allocate, we free. And we free only when the ref goes to 0.
>>
>> DRM can, in due time, "free" itself of the DRM device and stop
>> having knowledge of it--that's fine, but as long as the ref
>> is not 0, the amdgpu device and thus the contained DRM device,
>> cannot be freed.
>>
>>>
>>> You have to make another change something like
>>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
>>> index 13068fdf4331..2aabd2b4c63b 100644
>>> --- a/drivers/gpu/drm/drm_drv.c
>>> +++ b/drivers/gpu/drm/drm_drv.c
>>> @@ -815,7 +815,8 @@ static void 

Re: [PATCH] drm/amdgpu: Fix a redundant kfree

2020-09-02 Thread Luben Tuikov
On 2020-09-02 07:27, Daniel Vetter wrote:
> On Tue, Sep 1, 2020 at 10:35 PM Luben Tuikov  wrote:
>>
>> On 2020-09-01 10:12 a.m., Alex Deucher wrote:
>>> On Tue, Sep 1, 2020 at 3:46 AM Pan, Xinhui  wrote:

 [AMD Official Use Only - Internal Distribution Only]

 drm_dev_alloc() alloc *dev* and set managed.final_kfree to dev to free
 itself.
 Now from commit 5cdd68498918("drm/amdgpu: Embed drm_device into
 amdgpu_device (v3)") we alloc *adev* and ddev is just a member of it.
 So drm_dev_release try to free a wrong pointer then.

 Also driver's release trys to free adev, but drm_dev_release will
 access dev after call drvier's release.

 To fix it, remove driver's release and set managed.final_kfree to adev.
>>>
>>> I've got to admit, the documentation around drm_dev_init is hard to
>>> follow.  We aren't really using the drmm stuff, but you have to use
>>> drmm_add_final_kfree to avoid a warning.  The logic seems to make
>>> sense though.
> 
> I've just resent the patch which should clarify all this a bit.
> 
> And the warning isn't there just for lolz, if you enable KASAN it will
> report a use-after-free if you don't set this all up correctly. Note
> that drmm_ is already used by drm core code internally for everyone.

Well, you made the changes--of course it is. But something
being used by "everyone", doesn't mean it is a good thing.

It seems the motivation behind "managed resources", may have
been good, but the implementation, as is right now, makes
a mockery of the kref infrastructure and the original
_clean_ design of DRM init/fini sequence as I showed
in a previous email quoting the original version
of drm_dev_release():

static void drm_dev_release(struct kref *ref)
{
if (dev->driver->release)
dev->driver->release(dev);
else
drm_dev_fini(dev);
}

FWIW, the managed resources shouldn't be even known
by drivers, if well implemented--it should fold
into the current/original DRM driver infra.

Regards,
Luben

> -Daniel
> 
>>> Acked-by: Alex Deucher 
>>
>> The logic in patch 3/3 uses the kref infrastructure
>> as described in drm_drv.c's comment on what the DRM
>> usage is, i.e. taking advantage of the kref infrastructure.
>>
>> In amdgpu_pci_probe() we call drm_dev_init() which takes
>> a ref of 1 on the kref in the DRM device structure,
>> and from then on, only when the kref transitions
>> from non-zero to 0, do we free the container of
>> DRM device, and this is beautifully shown in the
>> kernel stack below (please take a look at the kernel
>> stack below).
>>
>> Using a kref is very powerful as it is implicit:
>> when the kref transitions from non-zero to 0,
>> then call the release method.
>>
>> Furthermore, we own the release method, and we
>> like that, as it is pure, as well as,
>> there may be more things we'd like to do in the future
>> before we free the amdgpu device: maybe free memory we're
>> using specifically for that PCI device, maybe write
>> some registers, maybe notify someone or something, etc.
>>
>> On another note, adding "drmm_add_final_kfree()" in the middle
>> of amdgpu_pci_probe() seems hackish--it's neither part
>> of drm_dev_init() nor of drm_dev_register(). We really
>> don't need it, since we rely on the kref infrastructure
>> to tell us when to free the device, and if you'd look
>> at the beautiful stack below, it knows exactly when that is,
>> i.e. when to free it.
>>
>> The correct thing to do this is to
>> _leave the amdgpu_driver_release()_ alone,
>> remove "drmm_add_final_kfree()" and qualify
>> the WARN_ON() in drm_dev_register() by
>> the existence of drm_driver.release() (i.e. non-NULL).
>>
>> I'll submit a sequence of patches to fix this right.
>>
>> Regards,
>> Luben
>>
>>>

 [   36.269348] BUG: unable to handle page fault for address: 
 a0c279940028
 [   36.276841] #PF: supervisor read access in kernel mode
 [   36.282434] #PF: error_code(0x) - not-present page
 [   36.288053] PGD 676601067 P4D 676601067 PUD 86a414067 PMD 86a247067 PTE 
 8008066bf060
 [   36.296868] Oops:  [#1] SMP DEBUG_PAGEALLOC NOPTI
 [   36.302409] CPU: 4 PID: 1375 Comm: bash Tainted: G   O  
 5.9.0-rc2+ #46
 [   36.310670] Hardware name: System manufacturer System Product 
 Name/PRIME Z390-A, BIOS 1401 11/26/2019
 [   36.320725] RIP: 0010:drm_managed_release+0x25/0x110 [drm]
 [   36.326741] Code: 80 00 00 00 00 0f 1f 44 00 00 55 48 c7 c2 5a 9f 41 c0 
 be 00 02 00 00 48 89 e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 83 ec 08 
 <48> 8b 7f 18 e8 c2 10 ff ff 4d 8b 74 24 20 49 8d 44 24 5
 [   36.347217] RSP: 0018:b9424141fce0 EFLAGS: 00010282
 [   36.352931] RAX: 0006 RBX: a0c279940010 RCX: 
 0006
 [   36.360718] RDX: c0419f5a RSI: 0200 RDI: 
 a0c279940010
 [   36.368503] RBP: b9424141fd10 R08: 0001 R09: 
 

Re: [PATCH] drm/amdgpu: fix max_entries calculation v2

2020-09-02 Thread Christian König

Am 02.09.20 um 16:27 schrieb Pan, Xinhui:



2020年9月2日 22:05,Christian König  写道:

Calculate the correct value for max_entries or we might run after the
page_address array.

v2: Xinhui pointed out we don't need the shift

Signed-off-by: Christian König 
Fixes: 1e691e244487 drm/amdgpu: stop allocating dummy GTT nodes
---
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 8bc2253939be..be886bdca5c6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1697,7 +1697,7 @@ static int amdgpu_vm_bo_split_mapping(struct 
amdgpu_device *adev,
AMDGPU_GPU_PAGES_IN_CPU_PAGE;
} else {
addr = 0;
-   max_entries = S64_MAX;
+   max_entries = mapping->last - mapping->start + 1;

You need minus pfn here.


That doesn't sound correct either. The pfn is the destination of the 
mapping, e.g. the offset inside the BO and not related to the virtual 
address range we map.



The range we are going to touch is [start + offset, last].
so the max_entries is last - (start + offset) + 1. and offset is pfn in this 
case.

I still hit panic with this patch in practice.


Thanks for testing, I think I know what the problem is.

We need to start instead of mapping->start or otherwise the values is to 
large after the first iteration.


Give me a second for a v3.

Christian.




}

if (pages_addr) {
--
2.17.1



___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/2] drm/amdgpu/gmc10: print client id string for gfxhub

2020-09-02 Thread Deucher, Alexander
[AMD Official Use Only - Internal Distribution Only]

I'm working on the mmhub clients list.  Will send out patches for them soon.

Alex


From: Christian König 
Sent: Wednesday, September 2, 2020 3:17 AM
To: Kuehling, Felix ; Alex Deucher 
; amd-gfx@lists.freedesktop.org 

Cc: Deucher, Alexander 
Subject: Re: [PATCH 2/2] drm/amdgpu/gmc10: print client id string for gfxhub

Am 02.09.20 um 04:32 schrieb Felix Kuehling:
> Should there a corresponding change in mmhub_v2_0.c?

It would be at least nice to have.

Maybe we should put a pointer to the array and its size into the hub
structure instead?

Anyway Reviewed-by: Christian König  for now.

Christian.

>
> Other than that, the series is
>
> Reviewed-by: Felix Kuehling 
>
> On 2020-09-01 5:51 p.m., Alex Deucher wrote:
>> Print the name of the client rather than the number.  This
>> makes it easier to debug what block is causing the fault.
>>
>> Signed-off-by: Alex Deucher 
>> ---
>>   drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c | 30 +---
>>   drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c | 30 +---
>>   2 files changed, 54 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c
>> b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c
>> index 76acd7f7723e..b882ac59879a 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c
>> @@ -31,6 +31,27 @@
>> #include "soc15_common.h"
>>   +static const char *gfxhub_client_ids[] = {
>> +"CB/DB",
>> +"Reserved",
>> +"GE1",
>> +"GE2",
>> +"CPF",
>> +"CPC",
>> +"CPG",
>> +"RLC",
>> +"TCP",
>> +"SQC (inst)",
>> +"SQC (data)",
>> +"SQG",
>> +"Reserved",
>> +"SDMA0",
>> +"SDMA1",
>> +"GCR",
>> +"SDMA2",
>> +"SDMA3",
>> +};
>> +
>>   static uint32_t gfxhub_v2_0_get_invalidate_req(unsigned int vmid,
>>  uint32_t flush_type)
>>   {
>> @@ -55,12 +76,15 @@ static void
>>   gfxhub_v2_0_print_l2_protection_fault_status(struct amdgpu_device
>> *adev,
>>uint32_t status)
>>   {
>> +u32 cid = REG_GET_FIELD(status,
>> +GCVM_L2_PROTECTION_FAULT_STATUS, CID);
>> +
>>   dev_err(adev->dev,
>>   "GCVM_L2_PROTECTION_FAULT_STATUS:0x%08X\n",
>>   status);
>> -dev_err(adev->dev, "\t Faulty UTCL2 client ID: 0x%lx\n",
>> -REG_GET_FIELD(status,
>> -GCVM_L2_PROTECTION_FAULT_STATUS, CID));
>> +dev_err(adev->dev, "\t Faulty UTCL2 client ID: %s (0x%x)\n",
>> +cid >= ARRAY_SIZE(gfxhub_client_ids) ? "unknown" :
>> gfxhub_client_ids[cid],
>> +cid);
>>   dev_err(adev->dev, "\t MORE_FAULTS: 0x%lx\n",
>>   REG_GET_FIELD(status,
>>   GCVM_L2_PROTECTION_FAULT_STATUS, MORE_FAULTS));
>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c
>> b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c
>> index 80c906a0383f..237a9ff5afa0 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c
>> @@ -31,6 +31,27 @@
>> #include "soc15_common.h"
>>   +static const char *gfxhub_client_ids[] = {
>> +"CB/DB",
>> +"Reserved",
>> +"GE1",
>> +"GE2",
>> +"CPF",
>> +"CPC",
>> +"CPG",
>> +"RLC",
>> +"TCP",
>> +"SQC (inst)",
>> +"SQC (data)",
>> +"SQG",
>> +"Reserved",
>> +"SDMA0",
>> +"SDMA1",
>> +"GCR",
>> +"SDMA2",
>> +"SDMA3",
>> +};
>> +
>>   static uint32_t gfxhub_v2_1_get_invalidate_req(unsigned int vmid,
>>  uint32_t flush_type)
>>   {
>> @@ -55,12 +76,15 @@ static void
>>   gfxhub_v2_1_print_l2_protection_fault_status(struct amdgpu_device
>> *adev,
>>uint32_t status)
>>   {
>> +u32 cid = REG_GET_FIELD(status,
>> +GCVM_L2_PROTECTION_FAULT_STATUS, CID);
>> +
>>   dev_err(adev->dev,
>>   "GCVM_L2_PROTECTION_FAULT_STATUS:0x%08X\n",
>>   status);
>> -dev_err(adev->dev, "\t Faulty UTCL2 client ID: 0x%lx\n",
>> -REG_GET_FIELD(status,
>> -GCVM_L2_PROTECTION_FAULT_STATUS, CID));
>> +dev_err(adev->dev, "\t Faulty UTCL2 client ID: %s (0x%x)\n",
>> +cid >= ARRAY_SIZE(gfxhub_client_ids) ? "unknown" :
>> gfxhub_client_ids[cid],
>> +cid);
>>   dev_err(adev->dev, "\t MORE_FAULTS: 0x%lx\n",
>>   REG_GET_FIELD(status,
>>   GCVM_L2_PROTECTION_FAULT_STATUS, MORE_FAULTS));
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=02%7C01%7Calexander.deucher%40amd.com%7C0d36fe896c0d4f7605d908d84f104fa2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637346278764796534sdata=gE2GPBkm3S4KDP9hg9itNXEs%2F4E1QnkPYmELRWsRdKE%3Dreserved=0

___
amd-gfx mailing 

Re: [PATCH] drm/amdgpu: fix max_entries calculation v2

2020-09-02 Thread Pan, Xinhui


> 2020年9月2日 22:05,Christian König  写道:
> 
> Calculate the correct value for max_entries or we might run after the
> page_address array.
> 
> v2: Xinhui pointed out we don't need the shift
> 
> Signed-off-by: Christian König 
> Fixes: 1e691e244487 drm/amdgpu: stop allocating dummy GTT nodes
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index 8bc2253939be..be886bdca5c6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -1697,7 +1697,7 @@ static int amdgpu_vm_bo_split_mapping(struct 
> amdgpu_device *adev,
>   AMDGPU_GPU_PAGES_IN_CPU_PAGE;
>   } else {
>   addr = 0;
> - max_entries = S64_MAX;
> + max_entries = mapping->last - mapping->start + 1;

You need minus pfn here.

The range we are going to touch is [start + offset, last].
so the max_entries is last - (start + offset) + 1. and offset is pfn in this 
case.

I still hit panic with this patch in practice.

>   }
> 
>   if (pages_addr) {
> -- 
> 2.17.1
> 

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: fix max_entries calculation v2

2020-09-02 Thread Deucher, Alexander
[AMD Official Use Only - Internal Distribution Only]

Acked-by: Alex Deucher 

From: amd-gfx  on behalf of Christian 
König 
Sent: Wednesday, September 2, 2020 10:05 AM
To: Pan, Xinhui ; amd-gfx@lists.freedesktop.org 

Subject: [PATCH] drm/amdgpu: fix max_entries calculation v2

Calculate the correct value for max_entries or we might run after the
page_address array.

v2: Xinhui pointed out we don't need the shift

Signed-off-by: Christian König 
Fixes: 1e691e244487 drm/amdgpu: stop allocating dummy GTT nodes
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 8bc2253939be..be886bdca5c6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1697,7 +1697,7 @@ static int amdgpu_vm_bo_split_mapping(struct 
amdgpu_device *adev,
 AMDGPU_GPU_PAGES_IN_CPU_PAGE;
 } else {
 addr = 0;
-   max_entries = S64_MAX;
+   max_entries = mapping->last - mapping->start + 1;
 }

 if (pages_addr) {
--
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=02%7C01%7Calexander.deucher%40amd.com%7Ce5239f6d94a9480e27e008d84f4939b5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637346523216694548sdata=OF7UX%2FLbWPJXkymt1QsdVVrQCZxI2Avj%2BBv1HYWPzxo%3Dreserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: add ta firmware load in psp_v12_0 for renoir

2020-09-02 Thread Deucher, Alexander
[AMD Public Use]

We also need to release the firmware when the driver unloads or is that already 
handled in some common path?

Alex


From: amd-gfx  on behalf of 
Changfeng.Zhu 
Sent: Tuesday, September 1, 2020 10:25 PM
To: amd-gfx@lists.freedesktop.org ; Huang, Ray 
; Lakha, Bhawanpreet 
Cc: Zhu, Changfeng 
Subject: [PATCH] drm/amdgpu: add ta firmware load in psp_v12_0 for renoir

From: changzhu 

From: Changfeng 

It needs to load renoir_ta firmware because hdcp is enabled by default
for renoir now. This can avoid error:DTM TA is not initialized

Change-Id: Ib2f03a531013e4b432c2e9d4ec3dc021b4f8da7d
Signed-off-by: Changfeng 
---
 drivers/gpu/drm/amd/amdgpu/psp_v12_0.c | 54 ++
 1 file changed, 54 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v12_0.c 
b/drivers/gpu/drm/amd/amdgpu/psp_v12_0.c
index 6c9614f77d33..75489313dbad 100644
--- a/drivers/gpu/drm/amd/amdgpu/psp_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/psp_v12_0.c
@@ -38,6 +38,8 @@
 #include "oss/osssys_4_0_sh_mask.h"

 MODULE_FIRMWARE("amdgpu/renoir_asd.bin");
+MODULE_FIRMWARE("amdgpu/renoir_ta.bin");
+
 /* address block */
 #define smnMP1_FIRMWARE_FLAGS   0x3010024

@@ -45,7 +47,10 @@ static int psp_v12_0_init_microcode(struct psp_context *psp)
 {
 struct amdgpu_device *adev = psp->adev;
 const char *chip_name;
+   char fw_name[30];
 int err = 0;
+   const struct ta_firmware_header_v1_0 *ta_hdr;
+   DRM_DEBUG("\n");

 switch (adev->asic_type) {
 case CHIP_RENOIR:
@@ -56,6 +61,55 @@ static int psp_v12_0_init_microcode(struct psp_context *psp)
 }

 err = psp_init_asd_microcode(psp, chip_name);
+   if (err)
+   goto out;
+
+   snprintf(fw_name, sizeof(fw_name), "amdgpu/%s_ta.bin", chip_name);
+   err = request_firmware(>psp.ta_fw, fw_name, adev->dev);
+   if (err) {
+   release_firmware(adev->psp.ta_fw);
+   adev->psp.ta_fw = NULL;
+   dev_info(adev->dev,
+"psp v12.0: Failed to load firmware \"%s\"\n",
+fw_name);
+   } else {
+   err = amdgpu_ucode_validate(adev->psp.ta_fw);
+   if (err)
+   goto out2;
+
+   ta_hdr = (const struct ta_firmware_header_v1_0 *)
+adev->psp.ta_fw->data;
+   adev->psp.ta_hdcp_ucode_version =
+   le32_to_cpu(ta_hdr->ta_hdcp_ucode_version);
+   adev->psp.ta_hdcp_ucode_size =
+   le32_to_cpu(ta_hdr->ta_hdcp_size_bytes);
+   adev->psp.ta_hdcp_start_addr =
+   (uint8_t *)ta_hdr +
+   le32_to_cpu(ta_hdr->header.ucode_array_offset_bytes);
+
+   adev->psp.ta_fw_version = 
le32_to_cpu(ta_hdr->header.ucode_version);
+
+   adev->psp.ta_dtm_ucode_version =
+   le32_to_cpu(ta_hdr->ta_dtm_ucode_version);
+   adev->psp.ta_dtm_ucode_size =
+   le32_to_cpu(ta_hdr->ta_dtm_size_bytes);
+   adev->psp.ta_dtm_start_addr =
+   (uint8_t *)adev->psp.ta_hdcp_start_addr +
+   le32_to_cpu(ta_hdr->ta_dtm_offset_bytes);
+   }
+
+   return 0;
+
+out2:
+   release_firmware(adev->psp.ta_fw);
+   adev->psp.ta_fw = NULL;
+out:
+   if (err) {
+   dev_err(adev->dev,
+   "psp v12.0: Failed to load firmware \"%s\"\n",
+   fw_name);
+   }
+
 return err;
 }

--
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=02%7C01%7Calexander.deucher%40amd.com%7C324a6285d81146a0639b08d84ee78d14%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637346103730596780sdata=ItQDVbjEzkmKeeEU%2BV01rQb4iGuWvHaqRAFlC4e6oqI%3Dreserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amdgpu: fix max_entries calculation v2

2020-09-02 Thread Christian König
Calculate the correct value for max_entries or we might run after the
page_address array.

v2: Xinhui pointed out we don't need the shift

Signed-off-by: Christian König 
Fixes: 1e691e244487 drm/amdgpu: stop allocating dummy GTT nodes
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 8bc2253939be..be886bdca5c6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1697,7 +1697,7 @@ static int amdgpu_vm_bo_split_mapping(struct 
amdgpu_device *adev,
AMDGPU_GPU_PAGES_IN_CPU_PAGE;
} else {
addr = 0;
-   max_entries = S64_MAX;
+   max_entries = mapping->last - mapping->start + 1;
}
 
if (pages_addr) {
-- 
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v2 0/24] backlight: add init macros and accessors

2020-09-02 Thread Daniel Thompson
On Fri, Aug 28, 2020 at 11:40:28AM +0200, Linus Walleij wrote:
> On Sun, Aug 23, 2020 at 12:45 PM Sam Ravnborg  wrote:
> 
> > The first patch trims backlight_update_status() so it can be called with a 
> > NULL
> > backlight_device. Then the caller do not need to add this check just to 
> > avoid
> > a NULL reference.
> >
> > The backlight drivers uses several different patterns when registering
> > a backlight:
> >
> > - Register backlight and assign properties later
> > - Define a local backlight_properties variable and use memset
> > - Define a const backlight_properties and assign relevant properties
> >
> > On top of this there was differences in what members was assigned.
> >
> > To align how backlight drivers are initialized introduce following helper 
> > macros:
> > - DECLARE_BACKLIGHT_INIT_FIRMWARE()
> > - DECLARE_BACKLIGHT_INIT_PLATFORM()
> > - DECLARE_BACKLIGHT_INIT_RAW()
> >
> > The macros are introduced in patch 2.
> >
> > The backlight drivers used direct access to backlight_properties.
> > Encapsulate these in get/set access operations resulting in following 
> > benefits:
> > - The access methods can be called with a NULL pointer so logic around the
> >   access can be made simpler.
> > - The update_brightness and enable_brightness simplifies the users
> > - The code is in most cases more readable with the access operations.
> > - When everyone uses the access methods refactoring in the backlight core 
> > is simpler.
> >
> > The get/set operations are introduced in patch 3.
> >
> > The gpio backlight driver received a small overhaul in a set of three 
> > patches.
> > The result is a smaller and more readable driver.
> >
> > The remaining patches updates all backlight users in drivers/gpu/drm/*
> > With this patch set all of drivers/gpu/drm/:
> > - All backlight references to FB_BLANK* are gone from drm/*
> > - All direct references to backlight properties are gone
> > - All panel drivers uses the devm_ variant for registering backlight
> >   Daniel Vetter had some concerns with this for future updates,
> >   but we are aligned now and can update if refoctoring demands it
> > - All panel drivers uses the backlight support in drm_panel
> >
> > Individual patches are only sent to the people listed in the patch + a few 
> > more.
> > Please check https://lore.kernel.org/dri-devel/ for the full series.
> >
> > v2:
> >   - Documented BACKLIGHT_PROPS as it may be used by drivers
> >   - Dropped backlight_set_power_{on,off}, they were a mistake (Daniel)
> >   - Added backlight_update_brightness() and use it (Daniel)
> >   - Added backlight_enable_brightness() and use it
> >   - Moved remaining drm_panel driver to use backlight support in drm_panel
> >   - gpio backlight driver overhaul
> >
> > The patches are made on top of the for-backlight-next branch at
> > https://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight.git
> > The branch needs v5.8-rc1 backported to build as dev_err_probe()
> > is used.
> >
> > The first 6 patches are candidates for the backlight tree.
> > If they are applied then this should preferably be to an immutable
> > branch we can merge to drm-misc-next where the drm patches shall go.
> >
> > The drm patches has known conflics and shall *not* be applied to the
> > backlight tree, they are included in this patchset to show how the
> > new functions are used.
> >
> > Diffstat for the drm bits alone looks nice:
> >  25 files changed, 243 insertions(+), 460 deletions(-)
> >
> > Feedback welcome!
> 
> Thank you for trying to make backlight easier for developers.
> I am a big supporter of this type of simplifications and
> generalizations, it is what makes DRM great.

+1!

I've reviewed and sent out patch by patch replies for the backlight
patches.

I've eyeballed the drm patches but not reviewed at same depth
and FWIW for all the patches whose subject *doesn't* start with
backlight then:
Acked-by: Daniel Thompson 


Daniel.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: fix max_entries calculation

2020-09-02 Thread Christian König

Am 02.09.20 um 15:02 schrieb Pan, Xinhui:



2020年9月2日 20:05,Christian König  写道:

Calculate the correct value for max_entries or we might run after the
page_address array.

Signed-off-by: Christian König 
Fixes: 1e691e244487 drm/amdgpu: stop allocating dummy GTT nodes
---
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 8bc2253939be..8aa9584c184f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1697,7 +1697,8 @@ static int amdgpu_vm_bo_split_mapping(struct 
amdgpu_device *adev,
AMDGPU_GPU_PAGES_IN_CPU_PAGE;
} else {
addr = 0;
-   max_entries = S64_MAX;
+   max_entries = ((mapping->last - mapping->start) >>
+  AMDGPU_GPU_PAGE_SHIFT) + 1;

should it be like below?
max_entries = (mapping->last - mapping->start + 1 - pfn) * 
AMDGPU_GPU_PAGES_IN_CPU_PAGE;


Still not correct, but mine wasn't correct either.


last and start are already pfns. why still >> AMDGPU_GPU_PAGE_SHIFT? Am I 
missing something?


Yeah, that's wrong.

But this is in AMDGPU_GPU_PAGE_SIZE units and not PAGE_SIZE units, so 
multiplying it with AMDGPU_GPU_PAGES_IN_CPU_PAGE doesn't make to much 
sense either.


Christian.




}

if (pages_addr) {
--
2.17.1



___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: fix max_entries calculation

2020-09-02 Thread Pan, Xinhui


> 2020年9月2日 20:05,Christian König  写道:
> 
> Calculate the correct value for max_entries or we might run after the
> page_address array.
> 
> Signed-off-by: Christian König 
> Fixes: 1e691e244487 drm/amdgpu: stop allocating dummy GTT nodes
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index 8bc2253939be..8aa9584c184f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -1697,7 +1697,8 @@ static int amdgpu_vm_bo_split_mapping(struct 
> amdgpu_device *adev,
>   AMDGPU_GPU_PAGES_IN_CPU_PAGE;
>   } else {
>   addr = 0;
> - max_entries = S64_MAX;
> + max_entries = ((mapping->last - mapping->start) >>
> +AMDGPU_GPU_PAGE_SHIFT) + 1;

should it be like below?
max_entries = (mapping->last - mapping->start + 1 - pfn) * 
AMDGPU_GPU_PAGES_IN_CPU_PAGE;

last and start are already pfns. why still >> AMDGPU_GPU_PAGE_SHIFT? Am I 
missing something?

>   }
> 
>   if (pages_addr) {
> -- 
> 2.17.1
> 

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amdgpu: fix max_entries calculation

2020-09-02 Thread Christian König
Calculate the correct value for max_entries or we might run after the
page_address array.

Signed-off-by: Christian König 
Fixes: 1e691e244487 drm/amdgpu: stop allocating dummy GTT nodes
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 8bc2253939be..8aa9584c184f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1697,7 +1697,8 @@ static int amdgpu_vm_bo_split_mapping(struct 
amdgpu_device *adev,
AMDGPU_GPU_PAGES_IN_CPU_PAGE;
} else {
addr = 0;
-   max_entries = S64_MAX;
+   max_entries = ((mapping->last - mapping->start) >>
+  AMDGPU_GPU_PAGE_SHIFT) + 1;
}
 
if (pages_addr) {
-- 
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: Fix a redundant kfree

2020-09-02 Thread Daniel Vetter
On Tue, Sep 1, 2020 at 10:35 PM Luben Tuikov  wrote:
>
> On 2020-09-01 10:12 a.m., Alex Deucher wrote:
> > On Tue, Sep 1, 2020 at 3:46 AM Pan, Xinhui  wrote:
> >>
> >> [AMD Official Use Only - Internal Distribution Only]
> >>
> >> drm_dev_alloc() alloc *dev* and set managed.final_kfree to dev to free
> >> itself.
> >> Now from commit 5cdd68498918("drm/amdgpu: Embed drm_device into
> >> amdgpu_device (v3)") we alloc *adev* and ddev is just a member of it.
> >> So drm_dev_release try to free a wrong pointer then.
> >>
> >> Also driver's release trys to free adev, but drm_dev_release will
> >> access dev after call drvier's release.
> >>
> >> To fix it, remove driver's release and set managed.final_kfree to adev.
> >
> > I've got to admit, the documentation around drm_dev_init is hard to
> > follow.  We aren't really using the drmm stuff, but you have to use
> > drmm_add_final_kfree to avoid a warning.  The logic seems to make
> > sense though.

I've just resent the patch which should clarify all this a bit.

And the warning isn't there just for lolz, if you enable KASAN it will
report a use-after-free if you don't set this all up correctly. Note
that drmm_ is already used by drm core code internally for everyone.
-Daniel

> > Acked-by: Alex Deucher 
>
> The logic in patch 3/3 uses the kref infrastructure
> as described in drm_drv.c's comment on what the DRM
> usage is, i.e. taking advantage of the kref infrastructure.
>
> In amdgpu_pci_probe() we call drm_dev_init() which takes
> a ref of 1 on the kref in the DRM device structure,
> and from then on, only when the kref transitions
> from non-zero to 0, do we free the container of
> DRM device, and this is beautifully shown in the
> kernel stack below (please take a look at the kernel
> stack below).
>
> Using a kref is very powerful as it is implicit:
> when the kref transitions from non-zero to 0,
> then call the release method.
>
> Furthermore, we own the release method, and we
> like that, as it is pure, as well as,
> there may be more things we'd like to do in the future
> before we free the amdgpu device: maybe free memory we're
> using specifically for that PCI device, maybe write
> some registers, maybe notify someone or something, etc.
>
> On another note, adding "drmm_add_final_kfree()" in the middle
> of amdgpu_pci_probe() seems hackish--it's neither part
> of drm_dev_init() nor of drm_dev_register(). We really
> don't need it, since we rely on the kref infrastructure
> to tell us when to free the device, and if you'd look
> at the beautiful stack below, it knows exactly when that is,
> i.e. when to free it.
>
> The correct thing to do this is to
> _leave the amdgpu_driver_release()_ alone,
> remove "drmm_add_final_kfree()" and qualify
> the WARN_ON() in drm_dev_register() by
> the existence of drm_driver.release() (i.e. non-NULL).
>
> I'll submit a sequence of patches to fix this right.
>
> Regards,
> Luben
>
> >
> >>
> >> [   36.269348] BUG: unable to handle page fault for address: 
> >> a0c279940028
> >> [   36.276841] #PF: supervisor read access in kernel mode
> >> [   36.282434] #PF: error_code(0x) - not-present page
> >> [   36.288053] PGD 676601067 P4D 676601067 PUD 86a414067 PMD 86a247067 PTE 
> >> 8008066bf060
> >> [   36.296868] Oops:  [#1] SMP DEBUG_PAGEALLOC NOPTI
> >> [   36.302409] CPU: 4 PID: 1375 Comm: bash Tainted: G   O  
> >> 5.9.0-rc2+ #46
> >> [   36.310670] Hardware name: System manufacturer System Product 
> >> Name/PRIME Z390-A, BIOS 1401 11/26/2019
> >> [   36.320725] RIP: 0010:drm_managed_release+0x25/0x110 [drm]
> >> [   36.326741] Code: 80 00 00 00 00 0f 1f 44 00 00 55 48 c7 c2 5a 9f 41 c0 
> >> be 00 02 00 00 48 89 e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 83 ec 08 
> >> <48> 8b 7f 18 e8 c2 10 ff ff 4d 8b 74 24 20 49 8d 44 24 5
> >> [   36.347217] RSP: 0018:b9424141fce0 EFLAGS: 00010282
> >> [   36.352931] RAX: 0006 RBX: a0c279940010 RCX: 
> >> 0006
> >> [   36.360718] RDX: c0419f5a RSI: 0200 RDI: 
> >> a0c279940010
> >> [   36.368503] RBP: b9424141fd10 R08: 0001 R09: 
> >> 0001
> >> [   36.376304] R10:  R11:  R12: 
> >> a0c279940010
> >> [   36.384070] R13: c0e2a000 R14: a0c26924e220 R15: 
> >> fff2
> >> [   36.391845] FS:  7fc4a277b740() GS:a0c288e0() 
> >> knlGS:
> >> [   36.400669] CS:  0010 DS:  ES:  CR0: 80050033
> >> [   36.406937] CR2: a0c279940028 CR3: 000792304006 CR4: 
> >> 003706e0
> >> [   36.414732] DR0:  DR1:  DR2: 
> >> 
> >> [   36.422550] DR3:  DR6: fffe0ff0 DR7: 
> >> 0400
> >> [   36.430354] Call Trace:
> >> [   36.433044]  drm_dev_put.part.0+0x40/0x60 [drm]
> >> [   36.438017]  drm_dev_put+0x13/0x20 [drm]
> >> [   36.442398]  amdgpu_pci_remove+0x56/0x60 [amdgpu]
> >> [   

Re: [PATCH 1/1] drm/amdgpu: disable gpu-sched load balance for uvd_enc

2020-09-02 Thread Christian König

Am 02.09.20 um 12:15 schrieb Nirmoy Das:

On hardware with multiple uvd instances, dependent uvd_enc jobs
may get scheduled to different uvd instances. Because uvd_enc
jobs retain hw context, dependent jobs should always run on the
same uvd instance. This patch disables GPU scheduler's load balancer
for a context that binds jobs from the same context to a uvd
instance.

Signed-off-by: Nirmoy Das 


Reviewed-by: Christian König 


---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index 7cd398d25498..c80d8339f58c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -114,8 +114,10 @@ static int amdgpu_ctx_init_entity(struct amdgpu_ctx *ctx, 
u32 hw_ip,
scheds = adev->gpu_sched[hw_ip][hw_prio].sched;
num_scheds = adev->gpu_sched[hw_ip][hw_prio].num_scheds;

+   /* disable load balance if the hw engine retains context among 
dependent jobs */
if (hw_ip == AMDGPU_HW_IP_VCN_ENC ||
hw_ip == AMDGPU_HW_IP_VCN_DEC ||
+   hw_ip == AMDGPU_HW_IP_UVD_ENC ||
hw_ip == AMDGPU_HW_IP_UVD) {
sched = drm_sched_pick_best(scheds, num_scheds);
scheds = 
--
2.28.0

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 8/8] drm/amd/display: Expose modifiers.

2020-09-02 Thread Bas Nieuwenhuizen
On Fri, Aug 7, 2020 at 9:43 PM Marek Olšák  wrote:
>
> On Tue, Aug 4, 2020 at 5:32 PM Bas Nieuwenhuizen  
> wrote:
>>
>> This expose modifier support on GFX9+.
>>
>> Only modifiers that can be rendered on the current GPU are
>> added. This is to reduce the number of modifiers exposed.
>>
>> The HW could expose more, but the best mechanism to decide
>> what to expose without an explosion in modifiers is still
>> to be decided, and in the meantime this should not regress
>> things from pre-modifiers and does not risk regressions as
>> we make up our mind in the future.
>>
>> Signed-off-by: Bas Nieuwenhuizen 
>> ---
>>  .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 343 +-
>>  1 file changed, 342 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
>> b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>> index c38257081868..6594cbe625f9 100644
>> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>> @@ -3891,6 +3891,340 @@ fill_gfx9_tiling_info_from_modifier(const struct 
>> amdgpu_device *adev,
>> }
>>  }
>>
>> +enum dm_micro_swizzle {
>> +   MICRO_SWIZZLE_Z = 0,
>> +   MICRO_SWIZZLE_S = 1,
>> +   MICRO_SWIZZLE_D = 2,
>> +   MICRO_SWIZZLE_R = 3
>> +};
>> +
>> +static bool dm_plane_format_mod_supported(struct drm_plane *plane,
>> + uint32_t format,
>> + uint64_t modifier)
>> +{
>> +   struct amdgpu_device *adev = plane->dev->dev_private;
>> +   const struct drm_format_info *info = drm_format_info(format);
>> +
>> +   enum dm_micro_swizzle microtile = 
>> modifier_gfx9_swizzle_mode(modifier) & 3;
>> +
>> +   if (!info)
>> +   return false;
>> +
>> +   /*
>> +* We always have to allow this modifier, because core DRM still
>> +* checks LINEAR support if userspace does not provide modifers.
>> +*/
>> +   if (modifier == DRM_FORMAT_MOD_LINEAR)
>> +   return true;
>> +
>> +   /*
>> +* The arbitrary tiling support for multiplane formats has not been 
>> hooked
>> +* up.
>> +*/
>> +   if (info->num_planes > 1)
>> +   return false;
>> +
>> +   /*
>> +* For D swizzle the canonical modifier depends on the bpp, so check
>> +* it here.
>> +*/
>> +   if (AMD_FMT_MOD_GET(TILE_VERSION, modifier) == 
>> AMD_FMT_MOD_TILE_VER_GFX9 &&
>> +   adev->family >= AMDGPU_FAMILY_NV) {
>> +   if (microtile == MICRO_SWIZZLE_D && info->cpp[0] == 4)
>> +   return false;
>> +   }
>> +
>> +   if (adev->family >= AMDGPU_FAMILY_RV && microtile == MICRO_SWIZZLE_D 
>> &&
>> +   info->cpp[0] < 8)
>> +   return false;
>> +
>> +   if (modifier_has_dcc(modifier)) {
>> +   /* Per radeonsi comments 16/64 bpp are more complicated. */
>> +   if (info->cpp[0] != 4)
>> +   return false;
>> +   }
>> +
>> +   return true;
>> +}
>> +
>> +static void
>> +add_modifier(uint64_t **mods, uint64_t *size, uint64_t *cap, uint64_t mod)
>> +{
>> +   if (!*mods)
>> +   return;
>> +
>> +   if (*cap - *size < 1) {
>> +   uint64_t new_cap = *cap * 2;
>> +   uint64_t *new_mods = kmalloc(new_cap * sizeof(uint64_t), 
>> GFP_KERNEL);
>> +
>> +   if (!new_mods) {
>> +   kfree(*mods);
>> +   *mods = NULL;
>> +   return;
>> +   }
>> +
>> +   memcpy(new_mods, *mods, sizeof(uint64_t) * *size);
>> +   kfree(*mods);
>> +   *mods = new_mods;
>> +   *cap = new_cap;
>> +   }
>> +
>> +   (*mods)[*size] = mod;
>> +   *size += 1;
>> +}
>> +
>> +static void
>> +add_gfx9_modifiers(const struct amdgpu_device *adev,
>> + uint64_t **mods, uint64_t *size, uint64_t *capacity)
>> +{
>> +   int pipes = ilog2(adev->gfx.config.gb_addr_config_fields.num_pipes);
>> +   int pipe_xor_bits = min(8, pipes +
>> +   
>> ilog2(adev->gfx.config.gb_addr_config_fields.num_se));
>> +   int bank_xor_bits = min(8 - pipe_xor_bits,
>> +   
>> ilog2(adev->gfx.config.gb_addr_config_fields.num_banks));
>> +   int rb = ilog2(adev->gfx.config.gb_addr_config_fields.num_se) +
>> +ilog2(adev->gfx.config.gb_addr_config_fields.num_rb_per_se);
>> +
>> +
>> +   if (adev->family == AMDGPU_FAMILY_RV) {
>> +   /*
>> +* No _D DCC swizzles yet because we only allow 32bpp, which
>> +* doesn't support _D on DCN
>> +*/
>> +
>> +   /*
>> +* Always enable constant encoding, because the only unit 
>> that
>> +* didn't support it was CB. But on 

[PATCH 1/1] drm/amdgpu: disable gpu-sched load balance for uvd_enc

2020-09-02 Thread Nirmoy Das
On hardware with multiple uvd instances, dependent uvd_enc jobs
may get scheduled to different uvd instances. Because uvd_enc
jobs retain hw context, dependent jobs should always run on the
same uvd instance. This patch disables GPU scheduler's load balancer
for a context that binds jobs from the same context to a uvd
instance.

Signed-off-by: Nirmoy Das 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index 7cd398d25498..c80d8339f58c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -114,8 +114,10 @@ static int amdgpu_ctx_init_entity(struct amdgpu_ctx *ctx, 
u32 hw_ip,
scheds = adev->gpu_sched[hw_ip][hw_prio].sched;
num_scheds = adev->gpu_sched[hw_ip][hw_prio].num_scheds;

+   /* disable load balance if the hw engine retains context among 
dependent jobs */
if (hw_ip == AMDGPU_HW_IP_VCN_ENC ||
hw_ip == AMDGPU_HW_IP_VCN_DEC ||
+   hw_ip == AMDGPU_HW_IP_UVD_ENC ||
hw_ip == AMDGPU_HW_IP_UVD) {
sched = drm_sched_pick_best(scheds, num_scheds);
scheds = 
--
2.28.0

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 1/1] drm/amdgpu: disable gpu-sched load balance for uvd_enc

2020-09-02 Thread Nirmoy

Please ignore this.

On 9/2/20 12:08 PM, Nirmoy Das wrote:

On hardware with multiple uvd instances, dependent uvd_enc jobs
may get scheduled to different uvd instances. Because uvd_enc
jobs retain hw context, dependent jobs should always run on the
same uvd instance. This patch disables GPU scheduler's load balancer
for a context that binds jobs from same the context to a uvd
instance.



from same the context --> from the same context



Signed-off-by: Nirmoy Das 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index 7cd398d25498..c80d8339f58c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -114,8 +114,10 @@ static int amdgpu_ctx_init_entity(struct amdgpu_ctx *ctx, 
u32 hw_ip,
scheds = adev->gpu_sched[hw_ip][hw_prio].sched;
num_scheds = adev->gpu_sched[hw_ip][hw_prio].num_scheds;

+   /* disable load balance if the hw engine retains context among 
dependent jobs */
if (hw_ip == AMDGPU_HW_IP_VCN_ENC ||
hw_ip == AMDGPU_HW_IP_VCN_DEC ||
+   hw_ip == AMDGPU_HW_IP_UVD_ENC ||
hw_ip == AMDGPU_HW_IP_UVD) {
sched = drm_sched_pick_best(scheds, num_scheds);
scheds = 
--
2.28.0


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 1/1] drm/amdgpu: disable gpu-sched load balance for uvd_enc

2020-09-02 Thread Nirmoy Das
On hardware with multiple uvd instances, dependent uvd_enc jobs
may get scheduled to different uvd instances. Because uvd_enc
jobs retain hw context, dependent jobs should always run on the
same uvd instance. This patch disables GPU scheduler's load balancer
for a context that binds jobs from same the context to a uvd
instance.

Signed-off-by: Nirmoy Das 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index 7cd398d25498..c80d8339f58c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -114,8 +114,10 @@ static int amdgpu_ctx_init_entity(struct amdgpu_ctx *ctx, 
u32 hw_ip,
scheds = adev->gpu_sched[hw_ip][hw_prio].sched;
num_scheds = adev->gpu_sched[hw_ip][hw_prio].num_scheds;

+   /* disable load balance if the hw engine retains context among 
dependent jobs */
if (hw_ip == AMDGPU_HW_IP_VCN_ENC ||
hw_ip == AMDGPU_HW_IP_VCN_DEC ||
+   hw_ip == AMDGPU_HW_IP_UVD_ENC ||
hw_ip == AMDGPU_HW_IP_UVD) {
sched = drm_sched_pick_best(scheds, num_scheds);
scheds = 
--
2.28.0

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 1/2] Revert "drm/amdgpu: disable gpu-sched load balance for uvd"

2020-09-02 Thread Nirmoy


On 9/2/20 8:55 AM, Christian König wrote:

Am 01.09.20 um 21:49 schrieb Nirmoy Das:

This reverts commit e0300ed8820d19fe108006cf1b69fa26f0b4e3fc.

We should also disable load balance for AMDGPU_HW_IP_UVD_ENC jobs.


Well revert and re-apply is usually not the best option. Just provide 
a delta patch and Alex might decide to squash it into the original one 
during upstreaming.



I wasn't sure how to handle that. Thanks,  I will send a delta patch.


Regards,

Nirmoy




Christian.



Signed-off-by: Nirmoy Das 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 4 +---
  1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c

index 7cd398d25498..59032c26fc82 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -114,9 +114,7 @@ static int amdgpu_ctx_init_entity(struct 
amdgpu_ctx *ctx, u32 hw_ip,

  scheds = adev->gpu_sched[hw_ip][hw_prio].sched;
  num_scheds = adev->gpu_sched[hw_ip][hw_prio].num_scheds;
  -    if (hw_ip == AMDGPU_HW_IP_VCN_ENC ||
-    hw_ip == AMDGPU_HW_IP_VCN_DEC ||
-    hw_ip == AMDGPU_HW_IP_UVD) {
+    if (hw_ip == AMDGPU_HW_IP_VCN_ENC || hw_ip == 
AMDGPU_HW_IP_VCN_DEC) {

  sched = drm_sched_pick_best(scheds, num_scheds);
  scheds = 
  num_scheds = 1;



___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amdkfd: fix a memory leak issue

2020-09-02 Thread Dennis Li
In the resume stage of GPU recovery, start_cpsch will call pm_init
which set pm->allocated as false, cause the next pm_release_ib has
no chance to release ib memory.

Add pm_release_ib in stop_cpsch which will be called in the suspend
stage of GPU recovery.

Signed-off-by: Dennis Li 

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index 069ba4be1e8f..20ef048d6a03 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -1192,6 +1192,8 @@ static int stop_cpsch(struct device_queue_manager *dqm)
dqm->sched_running = false;
dqm_unlock(dqm);
 
+   pm_release_ib(>packets);
+
kfd_gtt_sa_free(dqm->dev, dqm->fence_mem);
pm_uninit(>packets, hanging);
 
-- 
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu/dc: Require primary plane to be enabled whenever the CRTC is

2020-09-02 Thread Michel Dänzer
On 2020-09-02 9:02 a.m., Daniel Vetter wrote:
> On Tue, Sep 01, 2020 at 09:58:43AM -0400, Harry Wentland wrote:
>> On 2020-09-01 3:54 a.m., Daniel Vetter wrote:
>>> On Wed, Aug 26, 2020 at 11:24:23AM +0300, Pekka Paalanen wrote:
 On Tue, 25 Aug 2020 12:58:19 -0400
 "Kazlauskas, Nicholas"  wrote:

> On 2020-08-22 5:59 a.m., Michel Dänzer wrote:
>>
>> It's a "pick your poison" situation:
>>
>> 1) Currently the checks are invalid (atomic_check must not decide based
>> on drm_crtc_state::active), and it's easy for legacy KMS userspace to
>> accidentally hit errors trying to enable/move the cursor or switch DPMS
>> off → on.
>>
>> 2) Accurately rejecting only atomic states where the cursor plane is
>> enabled but all other planes are off would break the KMS helper code,
>> which can only deal with the "CRTC on & primary plane off is not
>> allowed" case specifically.
>>
>> 3) This patch addresses 1) & 2) but may break existing atomic userspace
>> which wants to enable an overlay plane while disabling the primary plane.
>>
>>
>> I do think in principle atomic userspace is expected to handle case 3)
>> and leave the primary plane enabled. However, this is not ideal from an
>> energy consumption PoV. Therefore, here's another idea for a possible
>> way out of this quagmire:
>>
>> amdgpu_dm does not reject any atomic states based on which planes are
>> enabled in it. If the cursor plane is enabled but all other planes are
>> off, amdgpu_dm internally either:
>>
>> a) Enables an overlay plane and makes it invisible, e.g. by assigning a
>> minimum size FB with alpha = 0.
>>
>> b) Enables the primary plane and assigns a minimum size FB (scaled up to
>> the required size) containing all black, possibly using compression.
>> (Trying to minimize the memory bandwidth)
>>
>>
>> Does either of these seem feasible? If both do, which one would be
>> preferable?
>>
>>   
>
> It's really the same solution since DCN doesn't make a distinction 
> between primary or overlay planes in hardware. DCE doesn't have overlay 
> planes enabled so this is not relevant there.
>
> The old behavior (pre 5.1?) was to silently accept the commit even 
> though the screen would be completely black instead of outright 
> rejecting the commit.
>
> I almost wonder if that makes more sense in the short term here since 
> the only "userspace" affected here is IGT. We'll fail the CRC checks, 
> but no userspace actually tries to actively use a cursor with no primary 
> plane enabled from my understanding.

 Hi,

 I believe that there exists userspace that will *accidentally* attempt
 to update the cursor plane while primary plane or whole CRTC is off.
 Some versions of Mutter might do that on racy conditions, I suspect.
 These are legacy KMS users, not atomic KMS.

 However, I do not believe there exists any userspace that would
 actually expect the display to show the cursor plane alone without a
 primary plane. Therefore I'd be ok with legacy cursor ioctls silently
 succeeding. Atomic commits not. So the difference has to be in the
 translation from legacy UAPI to kernel internal atomic interface.

> In the long term I think we can work on getting cursor actually on the 
> screen in this case, though I can't say I really like having to reserve 
> some small buffer (eg. 16x16) for allowing lightup on this corner case.

 Why would you bother implementing that?

 Is there really an IGT test that unconditionally demands cursor plane
 to be usable without any other planes?
>>>
>>> The cursor plane isn't anything else than any other plane, aside from the
>>> legacy uapi implication that it's used for the legacy cursor ioctls.
>>>
>>> Which means the cursor plane could actually be a full-featured plane, and
>>> it's totally legit to use just that without anything else enabled.
>>>
>>> So yeah if you allow that, it better show something :-)
>>>
>>> Personally I'd lean towards merging this patch to close the gap (oldest
>>> regressions wins and all that) and then implement the black plane hack on
>>> top.
>>
>> Not sure I'm a big fan of the black plane hack. Is there any way we
>> could allow the (non-displayed) cursor for the legacy IOCTL but not for
>> the atomic IOCTL? I assume that would require a change to core code in
>> the atomic helpers that convert legacy IOCTLs to atomic for drivers.
> 
> That's the "just dont show the cursor when it's not possible" hack, which
> is also rather iffy imo.
> 
> The other side is that this is all kinda uapi, or at least we've spent a
> lot of attempts trying to needle all this through rmfb and cursor ioctls,
> and I'm not sure what exactly you can change without breaking something.
> Yeah it's not helper stuff as 

[PATCH 1/9] drm/amd/pm: wrapper for postponing some setup job after DAL initializatioa(V2)

2020-09-02 Thread Evan Quan
So that ASIC specific actions can be added.

V2: better namings

Change-Id: Iabc9241d3e10ece9cd54d8cdb3ae8c8b831c7bce
Signed-off-by: Evan Quan 
---
 drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h | 1 +
 drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c   | 6 ++
 drivers/gpu/drm/amd/pm/swsmu/smu_internal.h | 1 +
 3 files changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h 
b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
index d22a759b6b43..4acc3c4c4737 100644
--- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
+++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
@@ -600,6 +600,7 @@ struct pptable_funcs {
int (*gfx_ulv_control)(struct smu_context *smu, bool enablement);
int (*deep_sleep_control)(struct smu_context *smu, bool enablement);
int (*get_fan_parameters)(struct smu_context *smu);
+   int (*post_init)(struct smu_context *smu);
 };
 
 typedef enum {
diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c 
b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
index 7a55ece1f124..8d7c75c51fe5 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
@@ -473,6 +473,12 @@ static int smu_late_init(void *handle)
if (!smu->pm_enabled)
return 0;
 
+   ret = smu_post_init(smu);
+   if (ret) {
+   dev_err(adev->dev, "Failed to post smu init!\n");
+   return ret;
+   }
+
ret = smu_set_default_od_settings(smu);
if (ret) {
dev_err(adev->dev, "Failed to setup default OD settings!\n");
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_internal.h 
b/drivers/gpu/drm/amd/pm/swsmu/smu_internal.h
index 38c10177ed21..db903889f6a7 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu_internal.h
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu_internal.h
@@ -95,6 +95,7 @@
 #define smu_gfx_ulv_control(smu, enablement)   
smu_ppt_funcs(gfx_ulv_control, 0, smu, enablement)
 #define smu_deep_sleep_control(smu, enablement)
smu_ppt_funcs(deep_sleep_control, 0, smu, enablement)
 #define smu_get_fan_parameters(smu)
smu_ppt_funcs(get_fan_parameters, 0, smu)
+#define smu_post_init(smu) 
smu_ppt_funcs(post_init, 0, smu)
 
 #endif
 #endif
-- 
2.28.0

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 8/9] drm/amd/pm: correct the requirement for umc cdr workaround

2020-09-02 Thread Evan Quan
The workaround can be applied only with UCLK DPM enabled.
And expand the workaround to more Navi10 SKUs and also
Navi14.

Change-Id: I8be4256079f81e292b39bcf43b4a84db82aa069b
Signed-off-by: Evan Quan 
---
 .../gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c   | 19 +--
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
index e02d036fb298..801c92eb439f 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
@@ -2185,19 +2185,18 @@ static int navi10_run_btc(struct smu_context *smu)
return ret;
 }
 
-static inline bool navi10_need_umc_cdr_12gbps_workaround(struct amdgpu_device 
*adev)
+static bool navi10_need_umc_cdr_12gbps_workaround(struct smu_context *smu)
 {
-   if (adev->asic_type != CHIP_NAVI10)
+   struct amdgpu_device *adev = smu->adev;
+
+   if (!smu_cmn_feature_is_enabled(smu, SMU_FEATURE_DPM_UCLK_BIT))
return false;
 
-   if (adev->pdev->device == 0x731f &&
-   (adev->pdev->revision == 0xc2 ||
-adev->pdev->revision == 0xc3 ||
-adev->pdev->revision == 0xca ||
-adev->pdev->revision == 0xcb))
+   if (adev->asic_type == CHIP_NAVI10 ||
+   adev->asic_type == CHIP_NAVI14)
return true;
-   else
-   return false;
+
+   return false;
 }
 
 static int navi10_umc_hybrid_cdr_workaround(struct smu_context *smu)
@@ -2285,7 +2284,7 @@ static int 
navi10_disable_umc_cdr_12gbps_workaround(struct smu_context *smu)
uint32_t param;
int ret = 0;
 
-   if (!navi10_need_umc_cdr_12gbps_workaround(adev))
+   if (!navi10_need_umc_cdr_12gbps_workaround(smu))
return 0;
 
ret = smu_cmn_send_smc_msg_with_param(smu,
-- 
2.28.0

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 3/9] drm/amd/pm: put Navi1X umc cdr workaround in post_smu_init

2020-09-02 Thread Evan Quan
That's where the uclk dpm get enabled and then the
uclk cdr workaround can be applied.

Change-Id: I520ae0fbc1c3be68324377c7d8c6dc4a346d3a57
Signed-off-by: Evan Quan 
---
 drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h|  1 -
 drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c  |  6 --
 .../gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c| 18 ++
 drivers/gpu/drm/amd/pm/swsmu/smu_internal.h|  1 -
 4 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h 
b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
index 4acc3c4c4737..701a94d4b9f6 100644
--- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
+++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
@@ -590,7 +590,6 @@ struct pptable_funcs {
int (*mode2_reset)(struct smu_context *smu);
int (*get_dpm_ultimate_freq)(struct smu_context *smu, enum smu_clk_type 
clk_type, uint32_t *min, uint32_t *max);
int (*set_soft_freq_limited_range)(struct smu_context *smu, enum 
smu_clk_type clk_type, uint32_t min, uint32_t max);
-   int (*disable_umc_cdr_12gbps_workaround)(struct smu_context *smu);
int (*set_power_source)(struct smu_context *smu, enum 
smu_power_src_type power_src);
void (*log_thermal_throttling_event)(struct smu_context *smu);
size_t (*get_pp_feature_mask)(struct smu_context *smu, char *buf);
diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c 
b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
index 8d7c75c51fe5..a9c0c20efddb 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
@@ -975,12 +975,6 @@ static int smu_smc_hw_setup(struct smu_context *smu)
return ret;
}
 
-   ret = smu_disable_umc_cdr_12gbps_workaround(smu);
-   if (ret) {
-   dev_err(adev->dev, "Workaround failed to disable UMC CDR 
feature on 12Gbps SKU!\n");
-   return ret;
-   }
-
/*
 * For Navi1X, manually switch it to AC mode as PMFW
 * may boot it with DC mode.
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
index 8180b7f1..6674f3abd457 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
@@ -2568,6 +2568,7 @@ static int navi10_post_smu_init(struct smu_context *smu)
struct smu_feature *feature = >smu_feature;
struct amdgpu_device *adev = smu->adev;
uint64_t feature_mask = 0;
+   int ret = 0;
 
/* For Naiv1x, enable these features only after DAL initialization */
if (adev->pm.pp_feature & PP_SOCCLK_DPM_MASK)
@@ -2590,9 +2591,19 @@ static int navi10_post_smu_init(struct smu_context *smu)
  (unsigned long *)(_mask),
  SMU_FEATURE_MAX);
 
-   return smu_cmn_feature_update_enable_state(smu,
-  feature_mask,
-  true);
+   ret = smu_cmn_feature_update_enable_state(smu,
+ feature_mask,
+ true);
+   if (ret) {
+   dev_err(adev->dev, "Failed to post uclk/socclk dpm 
enablement!\n");
+   return ret;
+   }
+
+   ret = navi10_disable_umc_cdr_12gbps_workaround(smu);
+   if (ret)
+   dev_err(adev->dev, "Failed to apply umc cdr workaround!\n");
+
+   return ret;
 }
 
 static const struct pptable_funcs navi10_ppt_funcs = {
@@ -2669,7 +2680,6 @@ static const struct pptable_funcs navi10_ppt_funcs = {
.set_default_od_settings = navi10_set_default_od_settings,
.od_edit_dpm_table = navi10_od_edit_dpm_table,
.run_btc = navi10_run_btc,
-   .disable_umc_cdr_12gbps_workaround = 
navi10_disable_umc_cdr_12gbps_workaround,
.set_power_source = smu_v11_0_set_power_source,
.get_pp_feature_mask = smu_cmn_get_pp_feature_mask,
.set_pp_feature_mask = smu_cmn_set_pp_feature_mask,
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_internal.h 
b/drivers/gpu/drm/amd/pm/swsmu/smu_internal.h
index db903889f6a7..521b805c920e 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu_internal.h
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu_internal.h
@@ -83,7 +83,6 @@
 #define smu_asic_set_performance_level(smu, level) 
smu_ppt_funcs(set_performance_level, -EINVAL, smu, level)
 #define smu_dump_pptable(smu)  
smu_ppt_funcs(dump_pptable, 0, smu)
 #define smu_update_pcie_parameters(smu, pcie_gen_cap, pcie_width_cap)  
smu_ppt_funcs(update_pcie_parameters, 0, smu, pcie_gen_cap, pcie_width_cap)
-#define smu_disable_umc_cdr_12gbps_workaround(smu) 
smu_ppt_funcs(disable_umc_cdr_12gbps_workaround, 0, smu)
 #define smu_set_power_source(smu, power_src)   
smu_ppt_funcs(set_power_source, 0, smu, power_src)
 #define smu_i2c_init(smu, control)  

[PATCH 2/9] drm/amd/pm: postpone SOCCLK/UCLK enablement after DAL initialization(V2)

2020-09-02 Thread Evan Quan
This is needed for Navi1X only. And it may help for display missing
or hang issue seen on some high resolution monitors.

V2: no UCLK DPM enablement for Navi10 A0 secure SKU

Change-Id: Id3965a638c2a238d52cf074f2111dc4bf2244a3e
Signed-off-by: Evan Quan 
---
 .../gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c   | 60 ---
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c|  6 +-
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h|  4 ++
 3 files changed, 46 insertions(+), 24 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
index 42d53cca7360..8180b7f1 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
@@ -279,9 +279,6 @@ navi10_get_allowed_feature_mask(struct smu_context *smu,
| FEATURE_MASK(FEATURE_FW_CTF_BIT)
| FEATURE_MASK(FEATURE_OUT_OF_BAND_MONITOR_BIT);
 
-   if (adev->pm.pp_feature & PP_SOCCLK_DPM_MASK)
-   *(uint64_t *)feature_mask |= 
FEATURE_MASK(FEATURE_DPM_SOCCLK_BIT);
-
if (adev->pm.pp_feature & PP_SCLK_DPM_MASK)
*(uint64_t *)feature_mask |= 
FEATURE_MASK(FEATURE_DPM_GFXCLK_BIT);
 
@@ -291,11 +288,6 @@ navi10_get_allowed_feature_mask(struct smu_context *smu,
if (adev->pm.pp_feature & PP_DCEFCLK_DPM_MASK)
*(uint64_t *)feature_mask |= 
FEATURE_MASK(FEATURE_DPM_DCEFCLK_BIT);
 
-   if (adev->pm.pp_feature & PP_MCLK_DPM_MASK)
-   *(uint64_t *)feature_mask |= FEATURE_MASK(FEATURE_DPM_UCLK_BIT)
-   | FEATURE_MASK(FEATURE_MEM_VDDCI_SCALING_BIT)
-   | FEATURE_MASK(FEATURE_MEM_MVDD_SCALING_BIT);
-
if (adev->pm.pp_feature & PP_ULV_MASK)
*(uint64_t *)feature_mask |= FEATURE_MASK(FEATURE_GFX_ULV_BIT);
 
@@ -320,19 +312,12 @@ navi10_get_allowed_feature_mask(struct smu_context *smu,
if (smu->dc_controlled_by_gpio)
*(uint64_t *)feature_mask |= FEATURE_MASK(FEATURE_ACDC_BIT);
 
-   /* disable DPM UCLK and DS SOCCLK on navi10 A0 secure board */
-   if (is_asic_secure(smu)) {
-   /* only for navi10 A0 */
-   if ((adev->asic_type == CHIP_NAVI10) &&
-   (adev->rev_id == 0)) {
-   *(uint64_t *)feature_mask &=
-   ~(FEATURE_MASK(FEATURE_DPM_UCLK_BIT)
- | 
FEATURE_MASK(FEATURE_MEM_VDDCI_SCALING_BIT)
- | 
FEATURE_MASK(FEATURE_MEM_MVDD_SCALING_BIT));
-   *(uint64_t *)feature_mask &=
-   ~FEATURE_MASK(FEATURE_DS_SOCCLK_BIT);
-   }
-   }
+   /* DS SOCCLK enablement should be skipped for navi10 A0 secure board */
+   if (is_asic_secure(smu) &&
+   (adev->asic_type == CHIP_NAVI10) &&
+   (adev->rev_id == 0))
+   *(uint64_t *)feature_mask &=
+   ~FEATURE_MASK(FEATURE_DS_SOCCLK_BIT);
 
return 0;
 }
@@ -2578,6 +2563,38 @@ static int navi10_enable_mgpu_fan_boost(struct 
smu_context *smu)
   NULL);
 }
 
+static int navi10_post_smu_init(struct smu_context *smu)
+{
+   struct smu_feature *feature = >smu_feature;
+   struct amdgpu_device *adev = smu->adev;
+   uint64_t feature_mask = 0;
+
+   /* For Naiv1x, enable these features only after DAL initialization */
+   if (adev->pm.pp_feature & PP_SOCCLK_DPM_MASK)
+   feature_mask |= FEATURE_MASK(FEATURE_DPM_SOCCLK_BIT);
+
+   /* DPM UCLK enablement should be skipped for navi10 A0 secure board */
+   if (!(is_asic_secure(smu) &&
+(adev->asic_type == CHIP_NAVI10) &&
+(adev->rev_id == 0)) &&
+   (adev->pm.pp_feature & PP_MCLK_DPM_MASK))
+   feature_mask |= FEATURE_MASK(FEATURE_DPM_UCLK_BIT)
+   | FEATURE_MASK(FEATURE_MEM_VDDCI_SCALING_BIT)
+   | FEATURE_MASK(FEATURE_MEM_MVDD_SCALING_BIT);
+
+   if (!feature_mask)
+   return 0;
+
+   bitmap_or(feature->allowed,
+ feature->allowed,
+ (unsigned long *)(_mask),
+ SMU_FEATURE_MAX);
+
+   return smu_cmn_feature_update_enable_state(smu,
+  feature_mask,
+  true);
+}
+
 static const struct pptable_funcs navi10_ppt_funcs = {
.get_allowed_feature_mask = navi10_get_allowed_feature_mask,
.set_default_dpm_table = navi10_set_default_dpm_table,
@@ -2661,6 +2678,7 @@ static const struct pptable_funcs navi10_ppt_funcs = {
.gfx_ulv_control = smu_v11_0_gfx_ulv_control,
.deep_sleep_control = smu_v11_0_deep_sleep_control,
.get_fan_parameters = 

[PATCH 5/9] drm/amd/pm: allocate a new buffer for pstate dummy reading

2020-09-02 Thread Evan Quan
This dummy reading buffer will be used for the new Navi1x
UMC CDR workaround.

Change-Id: Ida41374c0ea156527a1bf1104c7b2b909e562f7a
Signed-off-by: Evan Quan 
---
 drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h   |  1 +
 drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 45 +++
 2 files changed, 46 insertions(+)

diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h 
b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
index 701a94d4b9f6..29e041d86ae5 100644
--- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
+++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
@@ -270,6 +270,7 @@ struct smu_table_context
 */
struct smu_tabledriver_table;
struct smu_tablememory_pool;
+   struct smu_tabledummy_read_1_table;
uint8_t thermal_controller_type;
 
void*overdrive_table;
diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c 
b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
index a9c0c20efddb..dab272721037 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
@@ -663,6 +663,45 @@ static int smu_free_memory_pool(struct smu_context *smu)
return 0;
 }
 
+static int smu_alloc_dummy_read_table(struct smu_context *smu)
+{
+   struct smu_table_context *smu_table = >smu_table;
+   struct smu_table *dummy_read_1_table =
+   _table->dummy_read_1_table;
+   struct amdgpu_device *adev = smu->adev;
+   int ret = 0;
+
+   dummy_read_1_table->size = 0x4;
+   dummy_read_1_table->align = PAGE_SIZE;
+   dummy_read_1_table->domain = AMDGPU_GEM_DOMAIN_VRAM;
+
+   ret = amdgpu_bo_create_kernel(adev,
+ dummy_read_1_table->size,
+ dummy_read_1_table->align,
+ dummy_read_1_table->domain,
+ _read_1_table->bo,
+ _read_1_table->mc_address,
+ _read_1_table->cpu_addr);
+   if (ret)
+   dev_err(adev->dev, "VRAM allocation for dummy read table 
failed!\n");
+
+   return ret;
+}
+
+static void smu_free_dummy_read_table(struct smu_context *smu)
+{
+   struct smu_table_context *smu_table = >smu_table;
+   struct smu_table *dummy_read_1_table =
+   _table->dummy_read_1_table;
+
+
+   amdgpu_bo_free_kernel(_read_1_table->bo,
+ _read_1_table->mc_address,
+ _read_1_table->cpu_addr);
+
+   memset(dummy_read_1_table, 0, sizeof(struct smu_table));
+}
+
 static int smu_smc_table_sw_init(struct smu_context *smu)
 {
int ret;
@@ -698,6 +737,10 @@ static int smu_smc_table_sw_init(struct smu_context *smu)
if (ret)
return ret;
 
+   ret = smu_alloc_dummy_read_table(smu);
+   if (ret)
+   return ret;
+
ret = smu_i2c_init(smu, >adev->pm.smu_i2c);
if (ret)
return ret;
@@ -711,6 +754,8 @@ static int smu_smc_table_sw_fini(struct smu_context *smu)
 
smu_i2c_fini(smu, >adev->pm.smu_i2c);
 
+   smu_free_dummy_read_table(smu);
+
ret = smu_free_memory_pool(smu);
if (ret)
return ret;
-- 
2.28.0

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 6/9] drm/amd/pm: implement a new umc cdr workaround

2020-09-02 Thread Evan Quan
By uploading dummy pstate tables.

Change-Id: I9f52f965d23cae46b4a4eeab7790183e5d09bf27
Signed-off-by: Evan Quan 
---
 .../gpu/drm/amd/pm/inc/smu_11_0_cdr_table.h   | 194 ++
 drivers/gpu/drm/amd/pm/inc/smu_types.h|   2 +
 drivers/gpu/drm/amd/pm/inc/smu_v11_0_ppsmc.h  |   5 +-
 .../gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c   |  34 +++
 4 files changed, 234 insertions(+), 1 deletion(-)
 create mode 100644 drivers/gpu/drm/amd/pm/inc/smu_11_0_cdr_table.h

diff --git a/drivers/gpu/drm/amd/pm/inc/smu_11_0_cdr_table.h 
b/drivers/gpu/drm/amd/pm/inc/smu_11_0_cdr_table.h
new file mode 100644
index ..beab6d7b28b7
--- /dev/null
+++ b/drivers/gpu/drm/amd/pm/inc/smu_11_0_cdr_table.h
@@ -0,0 +1,194 @@
+/*
+ * Copyright 2020 Advanced Micro Devices, Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
+ * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
+ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+ * OTHER DEALINGS IN THE SOFTWARE.
+ *
+ */
+
+
+#ifndef SMU_11_0_CDR_TABLE
+#define SMU_11_0_CDR_TABLE
+
+
+#pragma pack(push, 1)
+
+/// CDR table : PRBS sequence for DQ toggles
+
+/*static unsigned int NoDbiPrbs7[] =
+{
+//256 bytes, 256 byte aligned
+0x0f0f0f0f, 0x0f0f0f0f, 0x0f0f0f0f, 0xf0f00f0f, 0x0f0f0f0f, 0x0f0f0f0f, 
0xf0f0f0f0, 0x0f0f0f0f, 0x0f0f0f0f, 0xf0f00f0f, 0xf0f00f0f, 0x0f0f0f0f, 
0xf0f0f0f0, 0xf0f0f0f0, 0x0f0f0f0f, 0xf0f00f0f,
+0x0f0f0f0f, 0xf0f00f0f, 0xf0f0f0f0, 0x0f0f0f0f, 0xf0f0f0f0, 0xf0f00f0f, 
0xf0f00f0f, 0xf0f00f0f, 0x0f0ff0f0, 0xf0f0f0f0, 0xf0f0f0f0, 0x0f0ff0f0, 
0x0f0f0f0f, 0x0f0f0f0f, 0xf0f0f0f0, 0xf0f00f0f,
+0x0f0f0f0f, 0xf0f00f0f, 0x0f0ff0f0, 0x0f0f0f0f, 0xf0f0f0f0, 0x0f0ff0f0, 
0xf0f00f0f, 0xf0f00f0f, 0xf0f0f0f0, 0x0f0ff0f0, 0xf0f0f0f0, 0xf0f00f0f, 
0xf0f0f0f0, 0x0f0f0f0f, 0x0f0ff0f0, 0xf0f00f0f,
+0xf0f00f0f, 0x0f0ff0f0, 0x0f0ff0f0, 0xf0f0f0f0, 0x0f0ff0f0, 0xf0f0f0f0, 
0x0f0f0f0f, 0xf0f0f0f0, 0x0f0f0f0f, 0xf0f00f0f, 0xf0f00f0f, 0xf0f00f0f, 
0xf0f0f0f0, 0xf0f0f0f0, 0xf0f0f0f0, 0xf0f0,
+};
+
+
+static unsigned int DbiPrbs7[] =
+{
+// 256 bytes, 256 byte aligned
+0x, 0x, 0x, 0x, 0x, 0x, 
0x, 0x, 0x, 0x, 0x, 0x, 
0x, 0x, 0x, 0x,
+0x, 0x, 0x, 0x, 0x, 0x, 
0x, 0x, 0x, 0x, 0x, 0x, 
0x, 0x, 0x, 0x,
+0x, 0x, 0x, 0x, 0x, 0x, 
0x, 0x, 0x, 0x, 0x, 0x, 
0x, 0x, 0x, 0x,
+0x, 0x, 0x, 0x, 0x, 0x, 
0x, 0x, 0x, 0x, 0x, 0x, 
0x, 0x, 0x, 0x,
+};
+*/
+
+
+//4096 bytes, 256 byte aligned
+static unsigned int NoDbiPrbs7[] =
+{
+0x0f0f0f0f, 0x0f0f0f0f, 0x0f0f0f0f, 0xf0f00f0f, 0x0f0f0f0f, 0x0f0f0f0f, 
0xf0f0f0f0, 0x0f0f0f0f, 0x0f0f0f0f, 0xf0f00f0f, 0xf0f00f0f, 0x0f0f0f0f, 
0xf0f0f0f0, 0xf0f0f0f0, 0x0f0f0f0f, 0xf0f00f0f,
+0x0f0f0f0f, 0xf0f00f0f, 0xf0f0f0f0, 0x0f0f0f0f, 0xf0f0f0f0, 0xf0f00f0f, 
0xf0f00f0f, 0xf0f00f0f, 0x0f0ff0f0, 0xf0f0f0f0, 0xf0f0f0f0, 0x0f0ff0f0, 
0x0f0f0f0f, 0x0f0f0f0f, 0xf0f0f0f0, 0xf0f00f0f,
+0x0f0f0f0f, 0xf0f00f0f, 0x0f0ff0f0, 0x0f0f0f0f, 0xf0f0f0f0, 0x0f0ff0f0, 
0xf0f00f0f, 0xf0f00f0f, 0xf0f0f0f0, 0x0f0ff0f0, 0xf0f0f0f0, 0xf0f00f0f, 
0xf0f0f0f0, 0x0f0f0f0f, 0x0f0ff0f0, 0xf0f00f0f,
+0xf0f00f0f, 0x0f0ff0f0, 0x0f0ff0f0, 0xf0f0f0f0, 0x0f0ff0f0, 0xf0f0f0f0, 
0x0f0f0f0f, 0xf0f0f0f0, 0x0f0f0f0f, 0xf0f00f0f, 0xf0f00f0f, 0xf0f00f0f, 
0xf0f0f0f0, 0xf0f0f0f0, 0xf0f0f0f0, 0xf0f0,
+0x0f0f0f0f, 0x0f0f0f0f, 0x0f0f0f0f, 0xf0f00f0f, 0x0f0f0f0f, 0x0f0f0f0f, 
0xf0f0f0f0, 0x0f0f0f0f, 0x0f0f0f0f, 0xf0f00f0f, 0xf0f00f0f, 0x0f0f0f0f, 
0xf0f0f0f0, 0xf0f0f0f0, 0x0f0f0f0f, 0xf0f00f0f,
+0x0f0f0f0f, 0xf0f00f0f, 0xf0f0f0f0, 0x0f0f0f0f, 0xf0f0f0f0, 0xf0f00f0f, 
0xf0f00f0f, 0xf0f00f0f, 0x0f0ff0f0, 0xf0f0f0f0, 0xf0f0f0f0, 0x0f0ff0f0, 
0x0f0f0f0f, 0x0f0f0f0f, 

[PATCH 9/9] drm/amd/pm: make namings and comments more readable

2020-09-02 Thread Evan Quan
And to fit more accurately what the cod does.

Change-Id: I2d917e66b55925c3a14aa96ac8e0c8c2110848c0
Signed-off-by: Evan Quan 
---
 drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
index 801c92eb439f..cd5394d4beb0 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
@@ -2185,7 +2185,7 @@ static int navi10_run_btc(struct smu_context *smu)
return ret;
 }
 
-static bool navi10_need_umc_cdr_12gbps_workaround(struct smu_context *smu)
+static bool navi10_need_umc_cdr_workaround(struct smu_context *smu)
 {
struct amdgpu_device *adev = smu->adev;
 
@@ -2276,7 +2276,7 @@ static int navi10_set_dummy_pstates_table_location(struct 
smu_context *smu)
   NULL);
 }
 
-static int navi10_disable_umc_cdr_12gbps_workaround(struct smu_context *smu)
+static int navi10_run_umc_cdr_workaround(struct smu_context *smu)
 {
struct amdgpu_device *adev = smu->adev;
uint8_t umc_fw_greater_than_v136 = false;
@@ -2284,7 +2284,7 @@ static int 
navi10_disable_umc_cdr_12gbps_workaround(struct smu_context *smu)
uint32_t param;
int ret = 0;
 
-   if (!navi10_need_umc_cdr_12gbps_workaround(smu))
+   if (!navi10_need_umc_cdr_workaround(smu))
return 0;
 
ret = smu_cmn_send_smc_msg_with_param(smu,
@@ -2655,7 +2655,7 @@ static int navi10_post_smu_init(struct smu_context *smu)
return ret;
}
 
-   ret = navi10_disable_umc_cdr_12gbps_workaround(smu);
+   ret = navi10_run_umc_cdr_workaround(smu);
if (ret)
dev_err(adev->dev, "Failed to apply umc cdr workaround!\n");
 
-- 
2.28.0

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 4/9] drm/amd/pm: revise the umc hybrid cdr workaround

2020-09-02 Thread Evan Quan
Drop the unused message(SMU_MSG_DAL_DISABLE_DUMMY_PSTATE_CHANGE).
And do not apply this workaround when the max uclk frequency
is greater than 750Mhz.

Change-Id: I862e80cc96424c82f34aff0fa85b3d37f4dbcb2b
Signed-off-by: Evan Quan 
---
 .../gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c   | 61 +++
 1 file changed, 34 insertions(+), 27 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
index 6674f3abd457..79cd17d6bfaa 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
@@ -2181,18 +2181,6 @@ static int navi10_run_btc(struct smu_context *smu)
return ret;
 }
 
-static int navi10_dummy_pstate_control(struct smu_context *smu, bool enable)
-{
-   int result = 0;
-
-   if (!enable)
-   result = smu_cmn_send_smc_msg(smu, 
SMU_MSG_DAL_DISABLE_DUMMY_PSTATE_CHANGE, NULL);
-   else
-   result = smu_cmn_send_smc_msg(smu, 
SMU_MSG_DAL_ENABLE_DUMMY_PSTATE_CHANGE, NULL);
-
-   return result;
-}
-
 static inline bool navi10_need_umc_cdr_12gbps_workaround(struct amdgpu_device 
*adev)
 {
if (adev->asic_type != CHIP_NAVI10)
@@ -2208,32 +2196,32 @@ static inline bool 
navi10_need_umc_cdr_12gbps_workaround(struct amdgpu_device *a
return false;
 }
 
-static int navi10_disable_umc_cdr_12gbps_workaround(struct smu_context *smu)
+static int navi10_umc_hybrid_cdr_workaround(struct smu_context *smu)
 {
uint32_t uclk_count, uclk_min, uclk_max;
-   uint32_t smu_version;
int ret = 0;
 
-   if (!navi10_need_umc_cdr_12gbps_workaround(smu->adev))
-   return 0;
-
-   ret = smu_cmn_get_smc_version(smu, NULL, _version);
-   if (ret)
-   return ret;
-
-   /* This workaround is available only for 42.50 or later SMC firmwares */
-   if (smu_version < 0x2A3200)
+   /* This workaround can be applied only with uclk dpm enabled */
+   if (!smu_cmn_feature_is_enabled(smu, SMU_FEATURE_DPM_UCLK_BIT))
return 0;
 
ret = smu_v11_0_get_dpm_level_count(smu, SMU_UCLK, _count);
if (ret)
return ret;
 
-   ret = smu_v11_0_get_dpm_freq_by_index(smu, SMU_UCLK, (uint16_t)0, 
_min);
+   ret = smu_v11_0_get_dpm_freq_by_index(smu, SMU_UCLK, 
(uint16_t)(uclk_count - 1), _max);
if (ret)
return ret;
 
-   ret = smu_v11_0_get_dpm_freq_by_index(smu, SMU_UCLK, 
(uint16_t)(uclk_count - 1), _max);
+   /*
+* The NAVI10_UMC_HYBRID_CDR_WORKAROUND_UCLK_THRESHOLD is 750Mhz.
+* This workaround is needed only when the max uclk frequency
+* not greater than that.
+*/
+   if (uclk_max > 0x2EE)
+   return 0;
+
+   ret = smu_v11_0_get_dpm_freq_by_index(smu, SMU_UCLK, (uint16_t)0, 
_min);
if (ret)
return ret;
 
@@ -2250,8 +2238,27 @@ static int 
navi10_disable_umc_cdr_12gbps_workaround(struct smu_context *smu)
/*
 * In this case, SMU already disabled dummy pstate during enablement
 * of UCLK DPM, we have to re-enabled it.
-* */
-   return navi10_dummy_pstate_control(smu, true);
+*/
+   return smu_cmn_send_smc_msg(smu, 
SMU_MSG_DAL_ENABLE_DUMMY_PSTATE_CHANGE, NULL);
+}
+
+static int navi10_disable_umc_cdr_12gbps_workaround(struct smu_context *smu)
+{
+   uint32_t smu_version;
+   int ret = 0;
+
+   if (!navi10_need_umc_cdr_12gbps_workaround(smu->adev))
+   return 0;
+
+   ret = smu_cmn_get_smc_version(smu, NULL, _version);
+   if (ret)
+   return ret;
+
+   /* This workaround is available only for 42.50 or later SMC firmwares */
+   if (smu_version < 0x2A3200)
+   return 0;
+
+   return navi10_umc_hybrid_cdr_workaround(smu);
 }
 
 static void navi10_fill_i2c_req(SwI2cRequest_t  *req, bool write,
-- 
2.28.0

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 7/9] drm/amd/pm: apply the CDR workarounds only with some specific UMC firmwares

2020-09-02 Thread Evan Quan
And different workaround will be applied based on hybrid cdr bit.

Change-Id: I828dc3605dbe0bb5a5e1a0db409658608ff21888
Signed-off-by: Evan Quan 
---
 drivers/gpu/drm/amd/pm/inc/smu_types.h|  1 +
 drivers/gpu/drm/amd/pm/inc/smu_v11_0_ppsmc.h  |  4 ++-
 .../gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c   | 28 +++
 3 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/inc/smu_types.h 
b/drivers/gpu/drm/amd/pm/inc/smu_types.h
index 4a8655a20ef6..35fc46d3c9c0 100644
--- a/drivers/gpu/drm/amd/pm/inc/smu_types.h
+++ b/drivers/gpu/drm/amd/pm/inc/smu_types.h
@@ -175,6 +175,7 @@
__SMU_DUMMY_MAP(DAL_ENABLE_DUMMY_PSTATE_CHANGE), \
__SMU_DUMMY_MAP(SET_DRIVER_DUMMY_TABLE_DRAM_ADDR_HIGH), \
__SMU_DUMMY_MAP(SET_DRIVER_DUMMY_TABLE_DRAM_ADDR_LOW), \
+   __SMU_DUMMY_MAP(GET_UMC_FW_WA), \
__SMU_DUMMY_MAP(Mode1Reset), \
 
 #undef __SMU_DUMMY_MAP
diff --git a/drivers/gpu/drm/amd/pm/inc/smu_v11_0_ppsmc.h 
b/drivers/gpu/drm/amd/pm/inc/smu_v11_0_ppsmc.h
index fc8594e9b2bd..26181b679098 100644
--- a/drivers/gpu/drm/amd/pm/inc/smu_v11_0_ppsmc.h
+++ b/drivers/gpu/drm/amd/pm/inc/smu_v11_0_ppsmc.h
@@ -128,7 +128,9 @@
 #define PPSMC_MSG_SetDriverDummyTableDramAddrHigh 0x4E
 #define PPSMC_MSG_SetDriverDummyTableDramAddrLow  0x4F
 
-#define PPSMC_Message_Count  0x50
+#define PPSMC_MSG_GetUMCFWWA 0x50
+
+#define PPSMC_Message_Count  0x51
 
 typedef uint32_t PPSMC_Result;
 typedef uint32_t PPSMC_Msg;
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
index 061eee1a4c32..e02d036fb298 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
@@ -142,6 +142,7 @@ static struct cmn2asic_msg_mapping 
navi10_message_map[SMU_MSG_MAX_COUNT] = {
MSG_MAP(SetMGpuFanBoostLimitRpm,
PPSMC_MSG_SetMGpuFanBoostLimitRpm,  0),
MSG_MAP(SET_DRIVER_DUMMY_TABLE_DRAM_ADDR_HIGH, 
PPSMC_MSG_SetDriverDummyTableDramAddrHigh, 0),
MSG_MAP(SET_DRIVER_DUMMY_TABLE_DRAM_ADDR_LOW, 
PPSMC_MSG_SetDriverDummyTableDramAddrLow, 0),
+   MSG_MAP(GET_UMC_FW_WA,  PPSMC_MSG_GetUMCFWWA,   
0),
 };
 
 static struct cmn2asic_mapping navi10_clk_map[SMU_CLK_COUNT] = {
@@ -2278,21 +2279,36 @@ static int 
navi10_set_dummy_pstates_table_location(struct smu_context *smu)
 
 static int navi10_disable_umc_cdr_12gbps_workaround(struct smu_context *smu)
 {
-   uint32_t smu_version;
+   struct amdgpu_device *adev = smu->adev;
+   uint8_t umc_fw_greater_than_v136 = false;
+   uint8_t umc_fw_disable_cdr = false;
+   uint32_t param;
int ret = 0;
 
-   if (!navi10_need_umc_cdr_12gbps_workaround(smu->adev))
+   if (!navi10_need_umc_cdr_12gbps_workaround(adev))
return 0;
 
-   ret = smu_cmn_get_smc_version(smu, NULL, _version);
+   ret = smu_cmn_send_smc_msg_with_param(smu,
+ SMU_MSG_GET_UMC_FW_WA,
+ 0,
+ );
if (ret)
return ret;
 
-   /* This workaround is available only for 42.50 or later SMC firmwares */
-   if (smu_version < 0x2A3200)
+   /* First bit indicates if the UMC f/w is above v137 */
+   umc_fw_greater_than_v136 = param & 0x1;
+
+   /* Second bit indicates if hybrid-cdr is disabled */
+   umc_fw_disable_cdr = param & 0x2;
+
+   /* w/a only allowed if UMC f/w is <= 136 */
+   if (umc_fw_greater_than_v136)
return 0;
 
-   return navi10_umc_hybrid_cdr_workaround(smu);
+   if (umc_fw_disable_cdr && adev->asic_type == CHIP_NAVI10)
+   return navi10_umc_hybrid_cdr_workaround(smu);
+   else
+   return navi10_set_dummy_pstates_table_location(smu);
 }
 
 static void navi10_fill_i2c_req(SwI2cRequest_t  *req, bool write,
-- 
2.28.0

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: Revert "drm/amdgpu: stop allocating dummy GTT nodes"

2020-09-02 Thread Christian König
Forget it, I see the problem now as well. Give me a second to provide a 
better fix.


Thanks for the hint,
Christian.

Am 02.09.20 um 09:10 schrieb Christian König:
We got a bug report from upstream about this as well, but I couldn't 
reproduce it so far.


Why would we touch outside of the page table with this?

Regards,
Christian.

Am 02.09.20 um 05:43 schrieb xinhui pan:

This reverts commit 1e691e2444871d1fde11b611653b5da9010dcec8.

mem->mm_node now could be NULL with commit above. That makes
amdgpu_vm_bo_split_mapping touchs outside of the page table as
max_entries set to S64_MAX;

before we fix that issue, revert commit above.

[  978.955925] BUG: unable to handle page fault for address: 
94dfc4bc

[  978.963424] #PF: supervisor read access in kernel mode
[  978.969034] #PF: error_code(0x) - not-present page
[  978.974662] PGD 72e201067 P4D 72e201067 PUD 86a414067 PMD 
86a3ee067 PTE 80083b43f060

[  978.983494] Oops:  [#1] SMP DEBUG_PAGEALLOC NOPTI
[  978.988992] CPU: 0 PID: 12264 Comm: Xorg Tainted: G    W 
O  5.9.0-rc2+ #46
[  978.997394] Hardware name: System manufacturer System Product 
Name/PRIME Z390-A, BIOS 1401 11/26/2019

[  979.007495] RIP: 0010:amdgpu_vm_bo_update+0x5af/0x880 [amdgpu]
[  979.013881] Code: ff ff ff ff ff 7f 48 8b 45 c0 4c 8d 04 d8 b8 01 
00 00 00 eb 09 48 83 c0 01 48 39 c2 76 12 49 8b 74 c0 f8 48 81 c6 00 
10 00 00 <49> 39 34 c0 74 e5 8b 75 b4 4c 8b 45 c8 48 38

[  979.034354] RSP: 0018:a94281403ba8 EFLAGS: 00010206
[  979.040050] RAX: 0200 RBX: 0e00 RCX: 
001049e8
[  979.047824] RDX: 7fff RSI: 0007c5e0 RDI: 
94dfd5fc
[  979.055644] RBP: a94281403c40 R08: 94dfc4bbf000 R09: 
0001
[  979.063441] R10:  R11:  R12: 
001047e8
[  979.071279] R13:  R14: 001047e9 R15: 
94dfc4e9af48
[  979.079098] FS:  7f19d3d00a80() GS:94e007e0() 
knlGS:

[  979.087911] CS:  0010 DS:  ES:  CR0: 80050033
[  979.094240] CR2: 94dfc4bc CR3: 0007c408c005 CR4: 
003706f0
[  979.102050] DR0:  DR1:  DR2: 

[  979.109868] DR3:  DR6: fffe0ff0 DR7: 
0400

[  979.117669] Call Trace:
[  979.120393]  amdgpu_gem_va_ioctl+0x533/0x560 [amdgpu]
[  979.125970]  ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
[  979.131914]  drm_ioctl_kernel+0xb4/0x100 [drm]
[  979.136792]  drm_ioctl+0x241/0x400 [drm]
[  979.141100]  ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
[  979.147003]  ? _raw_spin_unlock_irqrestore+0x4c/0x60
[  979.152446]  ? trace_hardirqs_on+0x2b/0xf0
[  979.156977]  amdgpu_drm_ioctl+0x4e/0x80 [amdgpu]
[  979.162033]  __x64_sys_ioctl+0x91/0xc0
[  979.166117]  do_syscall_64+0x38/0x90
[  979.170022]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  979.175537] RIP: 0033:0x7f19d405e37b
[  979.179450] Code: 0f 1e fa 48 8b 05 15 3b 0d 00 64 c7 00 26 00 00 
00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 
00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e5 3a 08
[  979.200034] RSP: 002b:7ffe66c9e938 EFLAGS: 0246 ORIG_RAX: 
0010
[  979.208330] RAX: ffda RBX: 7ffe66c9e990 RCX: 
7f19d405e37b
[  979.216147] RDX: 7ffe66c9e990 RSI: c0286448 RDI: 
0010
[  979.223897] RBP: c0286448 R08: 0001039e9000 R09: 
000e
[  979.231742] R10: 5640dcedf010 R11: 0246 R12: 

[  979.239555] R13: 0010 R14: 0001 R15: 
7ffe66c9ea58
[  979.247358] Modules linked in: amdgpu(O) iommu_v2 gpu_sched(O) 
ttm(O) drm_kms_helper(O) cec i2c_algo_bit fb_sys_fops syscopyarea 
sysfillrect sysimgblt overlay binfmt_misc snd_sof_pci snd_sos
[  979.247375]  x_tables autofs4 crc32_pclmul e1000e i2c_i801 
i2c_smbus ahci libahci wmi video pinctrl_cannonlake pinctrl_intel

[  979.354934] CR2: 94dfc4bc
[  979.358566] ---[ end trace 5b622843e4242519 ]---

Signed-off-by: xinhui pan 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c | 104 ++--
  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c |  18 +---
  2 files changed, 80 insertions(+), 42 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c

index e1b66898cb76..295d6fbcda8f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
@@ -150,7 +150,60 @@ static int amdgpu_gtt_mgr_fini(struct 
ttm_mem_type_manager *man)

   */
  bool amdgpu_gtt_mgr_has_gart_addr(struct ttm_mem_reg *mem)
  {
-    return mem->mm_node != NULL;
+    struct amdgpu_gtt_node *node = mem->mm_node;
+
+    return (node->node.start != AMDGPU_BO_INVALID_OFFSET);
+}
+
+/**
+ * amdgpu_gtt_mgr_alloc - allocate new ranges
+ *
+ * @man: TTM memory type manager
+ * @tbo: TTM BO we need this range for
+ * @place: placement flags and 

[PATCH] drm/managed: Cleanup of unused functions and polishing docs

2020-09-02 Thread Daniel Vetter
Following functions are only used internally, not by drivers:
- devm_drm_dev_init

Also, now that we have a very slick and polished way to allocate a
drm_device with devm_drm_dev_alloc, update all the docs to reflect the
new reality. Mostly this consists of deleting old and misleading
hints. Two main ones:

- it is no longer required that the drm_device base class is first in
  the structure. devm_drm_dev_alloc can cope with it being anywhere

- obviously embedded now strongly recommends using devm_drm_dev_alloc

v2: Fix typos (Noralf)

v3: Split out the removal of drm_dev_init, that's blocked on some
discussions on how to convert vgem/vkms/i915-selftests. Adjust commit
message to reflect that.

Cc: Noralf Trønnes 
Acked-by: Noralf Trønnes  (v2)
Acked-by: Sam Ravnborg 
Cc: Luben Tuikov 
Cc: amd-gfx@lists.freedesktop.org
Signed-off-by: Daniel Vetter 
---
 .../driver-api/driver-model/devres.rst|  2 +-
 drivers/gpu/drm/drm_drv.c | 78 +--
 drivers/gpu/drm/drm_managed.c |  2 +-
 include/drm/drm_device.h  |  2 +-
 include/drm/drm_drv.h | 16 ++--
 5 files changed, 30 insertions(+), 70 deletions(-)

diff --git a/Documentation/driver-api/driver-model/devres.rst 
b/Documentation/driver-api/driver-model/devres.rst
index efc21134..aa4d2420f79e 100644
--- a/Documentation/driver-api/driver-model/devres.rst
+++ b/Documentation/driver-api/driver-model/devres.rst
@@ -263,7 +263,7 @@ DMA
   dmam_pool_destroy()
 
 DRM
-  devm_drm_dev_init()
+  devm_drm_dev_alloc()
 
 GPIO
   devm_gpiod_get()
diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
index d4506f7a234e..7c1689842ec0 100644
--- a/drivers/gpu/drm/drm_drv.c
+++ b/drivers/gpu/drm/drm_drv.c
@@ -240,13 +240,13 @@ void drm_minor_release(struct drm_minor *minor)
  * DOC: driver instance overview
  *
  * A device instance for a drm driver is represented by  drm_device. 
This
- * is initialized with drm_dev_init(), usually from bus-specific ->probe()
- * callbacks implemented by the driver. The driver then needs to initialize all
- * the various subsystems for the drm device like memory management, vblank
- * handling, modesetting support and intial output configuration plus obviously
- * initialize all the corresponding hardware bits. Finally when everything is 
up
- * and running and ready for userspace the device instance can be published
- * using drm_dev_register().
+ * is allocated and initialized with devm_drm_dev_alloc(), usually from
+ * bus-specific ->probe() callbacks implemented by the driver. The driver then
+ * needs to initialize all the various subsystems for the drm device like 
memory
+ * management, vblank handling, modesetting support and initial output
+ * configuration plus obviously initialize all the corresponding hardware bits.
+ * Finally when everything is up and running and ready for userspace the device
+ * instance can be published using drm_dev_register().
  *
  * There is also deprecated support for initalizing device instances using
  * bus-specific helpers and the _driver.load callback. But due to
@@ -274,7 +274,7 @@ void drm_minor_release(struct drm_minor *minor)
  *
  * The following example shows a typical structure of a DRM display driver.
  * The example focus on the probe() function and the other functions that is
- * almost always present and serves as a demonstration of devm_drm_dev_init().
+ * almost always present and serves as a demonstration of devm_drm_dev_alloc().
  *
  * .. code-block:: c
  *
@@ -294,22 +294,12 @@ void drm_minor_release(struct drm_minor *minor)
  * struct drm_device *drm;
  * int ret;
  *
- * // devm_kzalloc() can't be used here because the drm_device '
- * // lifetime can exceed the device lifetime if driver unbind
- * // happens when userspace still has open file descriptors.
- * priv = kzalloc(sizeof(*priv), GFP_KERNEL);
- * if (!priv)
- * return -ENOMEM;
- *
+ * priv = devm_drm_dev_alloc(>dev, _drm_driver,
+ *   struct driver_device, drm);
+ * if (IS_ERR(priv))
+ * return PTR_ERR(priv);
  * drm = >drm;
  *
- * ret = devm_drm_dev_init(>dev, drm, _drm_driver);
- * if (ret) {
- * kfree(priv);
- * return ret;
- * }
- * drmm_add_final_kfree(drm, priv);
- *
  * ret = drmm_mode_config_init(drm);
  * if (ret)
  * return ret;
@@ -550,9 +540,9 @@ static void drm_fs_inode_free(struct inode *inode)
  * following guidelines apply:
  *
  *  - The entire device initialization procedure should be run from the
- *_master_ops.master_bind callback, starting with drm_dev_init(),
- *then binding all components with component_bind_all() and finishing with
- *

Re: [PATCH 3/3] drm/amdgpu: Remove superfluous NULL check

2020-09-02 Thread Christian König

Am 02.09.20 um 08:59 schrieb Daniel Vetter:

On Tue, Sep 01, 2020 at 09:06:45PM -0400, Luben Tuikov wrote:

The DRM device is a static member of
the amdgpu device structure and as such
always exists, so long as the PCI and
thus the amdgpu device exist.

Signed-off-by: Luben Tuikov 

On this patch, but not the other two earlier in this series:

Acked-by: Daniel Vetter 


---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ---
  1 file changed, 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index c4900471beb0..6dcc256b9ebc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3471,9 +3471,6 @@ int amdgpu_device_suspend(struct drm_device *dev, bool 
fbcon)
struct drm_connector_list_iter iter;
int r;
  
-	if (!dev)

-   return -ENODEV;
-
adev = drm_to_adev(dev);


Maybe this could now even fit into the declaration of adev.

But either way Acked-by: Christian König .

Christian.

  
  	if (dev->switch_power_state == DRM_SWITCH_POWER_OFF)

--
2.28.0.394.ge197136389



___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/2] drm/amdgpu/gmc10: print client id string for gfxhub

2020-09-02 Thread Christian König

Am 02.09.20 um 04:32 schrieb Felix Kuehling:

Should there a corresponding change in mmhub_v2_0.c?


It would be at least nice to have.

Maybe we should put a pointer to the array and its size into the hub 
structure instead?


Anyway Reviewed-by: Christian König  for now.

Christian.



Other than that, the series is

Reviewed-by: Felix Kuehling 

On 2020-09-01 5:51 p.m., Alex Deucher wrote:

Print the name of the client rather than the number.  This
makes it easier to debug what block is causing the fault.

Signed-off-by: Alex Deucher 
---
  drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c | 30 +---
  drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c | 30 +---
  2 files changed, 54 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c

index 76acd7f7723e..b882ac59879a 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c
@@ -31,6 +31,27 @@
    #include "soc15_common.h"
  +static const char *gfxhub_client_ids[] = {
+    "CB/DB",
+    "Reserved",
+    "GE1",
+    "GE2",
+    "CPF",
+    "CPC",
+    "CPG",
+    "RLC",
+    "TCP",
+    "SQC (inst)",
+    "SQC (data)",
+    "SQG",
+    "Reserved",
+    "SDMA0",
+    "SDMA1",
+    "GCR",
+    "SDMA2",
+    "SDMA3",
+};
+
  static uint32_t gfxhub_v2_0_get_invalidate_req(unsigned int vmid,
 uint32_t flush_type)
  {
@@ -55,12 +76,15 @@ static void
  gfxhub_v2_0_print_l2_protection_fault_status(struct amdgpu_device 
*adev,

   uint32_t status)
  {
+    u32 cid = REG_GET_FIELD(status,
+    GCVM_L2_PROTECTION_FAULT_STATUS, CID);
+
  dev_err(adev->dev,
  "GCVM_L2_PROTECTION_FAULT_STATUS:0x%08X\n",
  status);
-    dev_err(adev->dev, "\t Faulty UTCL2 client ID: 0x%lx\n",
-    REG_GET_FIELD(status,
-    GCVM_L2_PROTECTION_FAULT_STATUS, CID));
+    dev_err(adev->dev, "\t Faulty UTCL2 client ID: %s (0x%x)\n",
+    cid >= ARRAY_SIZE(gfxhub_client_ids) ? "unknown" : 
gfxhub_client_ids[cid],

+    cid);
  dev_err(adev->dev, "\t MORE_FAULTS: 0x%lx\n",
  REG_GET_FIELD(status,
  GCVM_L2_PROTECTION_FAULT_STATUS, MORE_FAULTS));
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c 
b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c

index 80c906a0383f..237a9ff5afa0 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c
@@ -31,6 +31,27 @@
    #include "soc15_common.h"
  +static const char *gfxhub_client_ids[] = {
+    "CB/DB",
+    "Reserved",
+    "GE1",
+    "GE2",
+    "CPF",
+    "CPC",
+    "CPG",
+    "RLC",
+    "TCP",
+    "SQC (inst)",
+    "SQC (data)",
+    "SQG",
+    "Reserved",
+    "SDMA0",
+    "SDMA1",
+    "GCR",
+    "SDMA2",
+    "SDMA3",
+};
+
  static uint32_t gfxhub_v2_1_get_invalidate_req(unsigned int vmid,
 uint32_t flush_type)
  {
@@ -55,12 +76,15 @@ static void
  gfxhub_v2_1_print_l2_protection_fault_status(struct amdgpu_device 
*adev,

   uint32_t status)
  {
+    u32 cid = REG_GET_FIELD(status,
+    GCVM_L2_PROTECTION_FAULT_STATUS, CID);
+
  dev_err(adev->dev,
  "GCVM_L2_PROTECTION_FAULT_STATUS:0x%08X\n",
  status);
-    dev_err(adev->dev, "\t Faulty UTCL2 client ID: 0x%lx\n",
-    REG_GET_FIELD(status,
-    GCVM_L2_PROTECTION_FAULT_STATUS, CID));
+    dev_err(adev->dev, "\t Faulty UTCL2 client ID: %s (0x%x)\n",
+    cid >= ARRAY_SIZE(gfxhub_client_ids) ? "unknown" : 
gfxhub_client_ids[cid],

+    cid);
  dev_err(adev->dev, "\t MORE_FAULTS: 0x%lx\n",
  REG_GET_FIELD(status,
  GCVM_L2_PROTECTION_FAULT_STATUS, MORE_FAULTS));

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH V2] drm/amdgpu: Do not move root PT bo to relocated list

2020-09-02 Thread Christian König

Am 02.09.20 um 06:50 schrieb Pan, Xinhui:



2020年9月1日 21:54,Christian König  写道:

Agreed, that change doesn't seem to make sense and your backtrace is mangled so 
barely readable.

it is reply that messed up the logs.

And this patch was sent on 10th Feb.


Now I see it as well. I'm just not used to reading Chinese letters :)

I was already wondering why this came up again.

Regards,
Christian.


Christian.

Am 01.09.20 um 14:59 schrieb Liu, Monk:

[AMD Official Use Only - Internal Distribution Only]

See that we already have such logic:

282 static void amdgpu_vm_bo_relocated(struct amdgpu_vm_bo_base *vm_bo)
  283 {
  284 if (vm_bo->bo->parent)
  285 list_move(_bo->vm_status, _bo->vm->relocated);
  286 else
  287 amdgpu_vm_bo_idle(vm_bo);
  288 }

Why you need to do the bo->parent check out side ?

because it is me that moves such logic into amdgpu_vm_bo_relocated.


-邮件原件-
发件人: amd-gfx  代表 Pan, Xinhui
发送时间: 2020年2月10日 9:04
收件人: amd-gfx@lists.freedesktop.org
抄送: Deucher, Alexander ; Koenig, Christian 

主题: [PATCH V2] drm/amdgpu: Do not move root PT bo to relocated list

hit panic when we update the page tables.

<1>[  122.103290] BUG: kernel NULL pointer dereference, address: 0008 <1>[  
122.103348] #PF: supervisor read access in kernel mode <1>[  122.103376] #PF: error_code(0x) - 
not-present page <6>[  122.103403] PGD 0 P4D 0 <4>[  122.103421] Oops:  [#1] SMP PTI
<4>[  122.103442] CPU: 13 PID: 2133 Comm: kfdtest Tainted: G   OE 
5.4.0-rc7+ #7
<4>[  122.103480] Hardware name: Supermicro SYS-7048GR-TR/X10DRG-Q, BIOS 3.0b 03/09/2018 <4>[  122.103657] RIP: 0010:amdgpu_vm_update_pdes+0x140/0x330 [amdgpu] 
<4>[  122.103689] Code: 03 4c 89 73 08 49 89 9d c8 00 00 00 48 8b 7b f0 c6 43 10 00 45 31 c0 48 8b 87 28 04 00 00 48 85 c0 74 07 4c 8b 80 20 04 00 00 <4d> 8b 70 08 
31 f6 49 8b 86 28 04 00 00 48 85 c0 74 0f 48 8b 80 28 <4>[  122.103769] RSP: 0018:b49a0a6a3a98 EFLAGS: 00010246 <4>[  122.103797] RAX:  RBX: 
9020f823c148 RCX: dead0122 <4>[  122.103831] RDX: 9020ece70018 RSI: 9020f823c0c8 RDI: 9010ca31c800 <4>[  122.103865] RBP: b49a0a6a3b38 
R08:  R09: 0001 <4>[  122.103899] R10: 6044f994 R11: df57fb58 R12: 9020f823c000 <4>[  122.103933] R13: 
9020f823c000 R14: 9020f823c0c8 R15: 9010d5d2 <4>[  122.103968] FS:  7f32c83dc780() GS:9020ff38() knlGS: <4>[  
122.104006] CS:  0010 DS:  ES:  CR0: 80050033 <4>[  122.104035] CR2: 0008 CR3: 002036bba005 CR4: 003606e0 <4>[  122.104069] 
DR0:  DR1:  DR2:  <4>[  122.104103] DR3:  DR6: fffe0ff0 DR7: 0400 <4>[  
122.104137] Call Trace:
<4>[  122.104241]  vm_update_pds+0x31/0x50 [amdgpu] <4>[  122.104347]  amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x2ef/0x690 
[amdgpu] <4>[  122.104466]  kfd_process_alloc_gpuvm+0x98/0x190 [amdgpu] <4>[  122.104576]  
kfd_process_device_init_vm.part.8+0xf3/0x1f0 [amdgpu] <4>[  122.104688]  kfd_process_device_init_vm+0x24/0x30 [amdgpu] 
<4>[  122.104794]  kfd_ioctl_acquire_vm+0xa4/0xc0 [amdgpu] <4>[  122.104900]  kfd_ioctl+0x277/0x500 [amdgpu] <4>[  
122.105001]  ? kfd_ioctl_free_memory_of_gpu+0xc0/0xc0 [amdgpu] <4>[  122.105039]  ? rcu_read_lock_sched_held+0x4f/0x80
<4>[  122.105068]  ? kmem_cache_free+0x2ba/0x300 <4>[  122.105093]  ? vm_area_free+0x18/0x20 <4>[  122.105117]  ? 
find_held_lock+0x35/0xa0 <4>[  122.105143]  do_vfs_ioctl+0xa9/0x6f0 <4>[  122.106001]  ksys_ioctl+0x75/0x80 <4>[  
122.106802]  ? do_syscall_64+0x17/0x230 <4>[  122.107605]  __x64_sys_ioctl+0x1a/0x20 <4>[  122.108378]  
do_syscall_64+0x5f/0x230 <4>[  122.109118]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4>[  122.109842] RIP: 0033:0x7f32c6b495d7

Signed-off-by: xinhui pan 
---
change from v1:
move root pt bo to idle state instead.
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 9 ++---
  1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 3195bc9..c3d1af5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -2619,9 +2619,12 @@ void amdgpu_vm_bo_invalidate(struct amdgpu_device *adev,
  continue;
  bo_base->moved = true;

-if (bo->tbo.type == ttm_bo_type_kernel)
-amdgpu_vm_bo_relocated(bo_base);
-else if (bo->tbo.base.resv == vm->root.base.bo->tbo.base.resv)
+if (bo->tbo.type == ttm_bo_type_kernel) {
+if (bo->parent)
+amdgpu_vm_bo_relocated(bo_base);
+else
+amdgpu_vm_bo_idle(bo_base);
+} else if (bo->tbo.base.resv == vm->root.base.bo->tbo.base.resv)
  amdgpu_vm_bo_moved(bo_base);
  else
  amdgpu_vm_bo_invalidated(bo_base);
--
2.7.4
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org

Re: [PATCH] drm/amdgpu: Revert "drm/amdgpu: stop allocating dummy GTT nodes"

2020-09-02 Thread Christian König
We got a bug report from upstream about this as well, but I couldn't 
reproduce it so far.


Why would we touch outside of the page table with this?

Regards,
Christian.

Am 02.09.20 um 05:43 schrieb xinhui pan:

This reverts commit 1e691e2444871d1fde11b611653b5da9010dcec8.

mem->mm_node now could be NULL with commit above. That makes
amdgpu_vm_bo_split_mapping touchs outside of the page table as
max_entries set to S64_MAX;

before we fix that issue, revert commit above.

[  978.955925] BUG: unable to handle page fault for address: 94dfc4bc
[  978.963424] #PF: supervisor read access in kernel mode
[  978.969034] #PF: error_code(0x) - not-present page
[  978.974662] PGD 72e201067 P4D 72e201067 PUD 86a414067 PMD 86a3ee067 PTE 
80083b43f060
[  978.983494] Oops:  [#1] SMP DEBUG_PAGEALLOC NOPTI
[  978.988992] CPU: 0 PID: 12264 Comm: Xorg Tainted: GW  O  
5.9.0-rc2+ #46
[  978.997394] Hardware name: System manufacturer System Product Name/PRIME 
Z390-A, BIOS 1401 11/26/2019
[  979.007495] RIP: 0010:amdgpu_vm_bo_update+0x5af/0x880 [amdgpu]
[  979.013881] Code: ff ff ff ff ff 7f 48 8b 45 c0 4c 8d 04 d8 b8 01 00 00 00 eb 09 
48 83 c0 01 48 39 c2 76 12 49 8b 74 c0 f8 48 81 c6 00 10 00 00 <49> 39 34 c0 74 
e5 8b 75 b4 4c 8b 45 c8 48 38
[  979.034354] RSP: 0018:a94281403ba8 EFLAGS: 00010206
[  979.040050] RAX: 0200 RBX: 0e00 RCX: 001049e8
[  979.047824] RDX: 7fff RSI: 0007c5e0 RDI: 94dfd5fc
[  979.055644] RBP: a94281403c40 R08: 94dfc4bbf000 R09: 0001
[  979.063441] R10:  R11:  R12: 001047e8
[  979.071279] R13:  R14: 001047e9 R15: 94dfc4e9af48
[  979.079098] FS:  7f19d3d00a80() GS:94e007e0() 
knlGS:
[  979.087911] CS:  0010 DS:  ES:  CR0: 80050033
[  979.094240] CR2: 94dfc4bc CR3: 0007c408c005 CR4: 003706f0
[  979.102050] DR0:  DR1:  DR2: 
[  979.109868] DR3:  DR6: fffe0ff0 DR7: 0400
[  979.117669] Call Trace:
[  979.120393]  amdgpu_gem_va_ioctl+0x533/0x560 [amdgpu]
[  979.125970]  ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
[  979.131914]  drm_ioctl_kernel+0xb4/0x100 [drm]
[  979.136792]  drm_ioctl+0x241/0x400 [drm]
[  979.141100]  ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
[  979.147003]  ? _raw_spin_unlock_irqrestore+0x4c/0x60
[  979.152446]  ? trace_hardirqs_on+0x2b/0xf0
[  979.156977]  amdgpu_drm_ioctl+0x4e/0x80 [amdgpu]
[  979.162033]  __x64_sys_ioctl+0x91/0xc0
[  979.166117]  do_syscall_64+0x38/0x90
[  979.170022]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  979.175537] RIP: 0033:0x7f19d405e37b
[  979.179450] Code: 0f 1e fa 48 8b 05 15 3b 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff 
ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff 
ff 73 01 c3 48 8b 0d e5 3a 08
[  979.200034] RSP: 002b:7ffe66c9e938 EFLAGS: 0246 ORIG_RAX: 
0010
[  979.208330] RAX: ffda RBX: 7ffe66c9e990 RCX: 7f19d405e37b
[  979.216147] RDX: 7ffe66c9e990 RSI: c0286448 RDI: 0010
[  979.223897] RBP: c0286448 R08: 0001039e9000 R09: 000e
[  979.231742] R10: 5640dcedf010 R11: 0246 R12: 
[  979.239555] R13: 0010 R14: 0001 R15: 7ffe66c9ea58
[  979.247358] Modules linked in: amdgpu(O) iommu_v2 gpu_sched(O) ttm(O) 
drm_kms_helper(O) cec i2c_algo_bit fb_sys_fops syscopyarea sysfillrect 
sysimgblt overlay binfmt_misc snd_sof_pci snd_sos
[  979.247375]  x_tables autofs4 crc32_pclmul e1000e i2c_i801 i2c_smbus ahci 
libahci wmi video pinctrl_cannonlake pinctrl_intel
[  979.354934] CR2: 94dfc4bc
[  979.358566] ---[ end trace 5b622843e4242519 ]---

Signed-off-by: xinhui pan 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c | 104 ++--
  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c |  18 +---
  2 files changed, 80 insertions(+), 42 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
index e1b66898cb76..295d6fbcda8f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
@@ -150,7 +150,60 @@ static int amdgpu_gtt_mgr_fini(struct ttm_mem_type_manager 
*man)
   */
  bool amdgpu_gtt_mgr_has_gart_addr(struct ttm_mem_reg *mem)
  {
-   return mem->mm_node != NULL;
+   struct amdgpu_gtt_node *node = mem->mm_node;
+
+   return (node->node.start != AMDGPU_BO_INVALID_OFFSET);
+}
+
+/**
+ * amdgpu_gtt_mgr_alloc - allocate new ranges
+ *
+ * @man: TTM memory type manager
+ * @tbo: TTM BO we need this range for
+ * @place: placement flags and restrictions
+ * @mem: the resulting mem object
+ *
+ * Allocate the address space for a node.
+ */
+static int amdgpu_gtt_mgr_alloc(struct ttm_mem_type_manager *man,
+  

Re: [PATCH] drm/radeon: Reset ASIC if suspend is not managed by platform firmware

2020-09-02 Thread Kai-Heng Feng



> On Sep 2, 2020, at 00:30, Alex Deucher  wrote:
> 
> On Tue, Sep 1, 2020 at 12:21 PM Kai-Heng Feng
>  wrote:
>> 
>> 
>> 
>>> On Sep 1, 2020, at 22:19, Alex Deucher  wrote:
>>> 
>>> On Tue, Sep 1, 2020 at 3:32 AM Kai-Heng Feng
>>>  wrote:
 
 Suspend with s2idle or by the following steps cause screen frozen:
 # echo devices > /sys/power/pm_test
 # echo freeze > /sys/power/mem
 
 [  289.625461] [drm:uvd_v1_0_ib_test [radeon]] *ERROR* radeon: fence wait 
 timed out.
 [  289.625494] [drm:radeon_ib_ring_tests [radeon]] *ERROR* radeon: failed 
 testing IB on ring 5 (-110).
 
 The issue doesn't happen on traditional S3, probably because firmware or
 hardware provides extra power management.
 
 Inspired by Daniel Drake's patch [1] on amdgpu, using a similar approach
 can fix the issue.
>>> 
>>> It doesn't actually fix the issue.  The device is never powered down
>>> so you are using more power than you would if you did not suspend in
>>> the first place.  The reset just works around the fact that the device
>>> is never powered down.
>> 
>> So how do we properly suspend/resume the device without help from platform 
>> firmware?
> 
> I guess you don't?

Unfortunate but I guess we need to accept reality and use the default suspend 
method.

Kai-Heng

> 
> Alex
> 
> 
>> 
>> Kai-Heng
>> 
>>> 
>>> Alex
>>> 
 
 [1] https://patchwork.freedesktop.org/patch/335839/
 
 Signed-off-by: Kai-Heng Feng 
 ---
 drivers/gpu/drm/radeon/radeon_device.c | 3 +++
 1 file changed, 3 insertions(+)
 
 diff --git a/drivers/gpu/drm/radeon/radeon_device.c 
 b/drivers/gpu/drm/radeon/radeon_device.c
 index 266e3cbbd09b..df823b9ad79f 100644
 --- a/drivers/gpu/drm/radeon/radeon_device.c
 +++ b/drivers/gpu/drm/radeon/radeon_device.c
 @@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
 +#include 
 
 #include 
 #include 
 @@ -1643,6 +1644,8 @@ int radeon_suspend_kms(struct drm_device *dev, bool 
 suspend,
   rdev->asic->asic_reset(rdev, true);
   pci_restore_state(dev->pdev);
   } else if (suspend) {
 +   if (pm_suspend_no_platform())
 +   rdev->asic->asic_reset(rdev, true);
   /* Shut down the device */
   pci_disable_device(dev->pdev);
   pci_set_power_state(dev->pdev, PCI_D3hot);
 --
 2.17.1
 
 ___
 dri-devel mailing list
 dri-de...@lists.freedesktop.org
 https://lists.freedesktop.org/mailman/listinfo/dri-devel
>> 

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu/dc: Require primary plane to be enabled whenever the CRTC is

2020-09-02 Thread Daniel Vetter
On Tue, Sep 01, 2020 at 09:58:43AM -0400, Harry Wentland wrote:
> 
> 
> On 2020-09-01 3:54 a.m., Daniel Vetter wrote:
> > On Wed, Aug 26, 2020 at 11:24:23AM +0300, Pekka Paalanen wrote:
> >> On Tue, 25 Aug 2020 12:58:19 -0400
> >> "Kazlauskas, Nicholas"  wrote:
> >>
> >>> On 2020-08-22 5:59 a.m., Michel Dänzer wrote:
>  On 2020-08-21 8:07 p.m., Kazlauskas, Nicholas wrote:  
> > On 2020-08-21 12:57 p.m., Michel Dänzer wrote:  
> >> From: Michel Dänzer 
> >>
> >> Don't check drm_crtc_state::active for this either, per its
> >> documentation in include/drm/drm_crtc.h:
> >>
> >>    * Hence drivers must not consult @active in their various
> >>    * _mode_config_funcs.atomic_check callback to reject an atomic
> >>    * commit.
> >>
> >> The atomic helpers disable the CRTC as needed for disabling the primary
> >> plane.
> >>
> >> This prevents at least the following problems if the primary plane gets
> >> disabled (e.g. due to destroying the FB assigned to the primary plane,
> >> as happens e.g. with mutter in Wayland mode):
> >>
> >> * Toggling CRTC active to 1 failed if the cursor plane was enabled
> >>     (e.g. via legacy DPMS property & cursor ioctl).
> >> * Enabling the cursor plane failed, e.g. via the legacy cursor ioctl.  
> >
> > We previously had the requirement that the primary plane must be enabled
> > but some userspace expects that they can enable just the overlay plane
> > without anything else.
> >
> > I think the chromuiumos atomictest validates that this works as well:
> >
> > So is DRM going forward then with the expectation that this is wrong
> > behavior from userspace?
> >
> > We require at least one plane to be enabled to display a cursor, but it
> > doesn't necessarily need to be the primary.  
> 
>  It's a "pick your poison" situation:
> 
>  1) Currently the checks are invalid (atomic_check must not decide based
>  on drm_crtc_state::active), and it's easy for legacy KMS userspace to
>  accidentally hit errors trying to enable/move the cursor or switch DPMS
>  off → on.
> 
>  2) Accurately rejecting only atomic states where the cursor plane is
>  enabled but all other planes are off would break the KMS helper code,
>  which can only deal with the "CRTC on & primary plane off is not
>  allowed" case specifically.
> 
>  3) This patch addresses 1) & 2) but may break existing atomic userspace
>  which wants to enable an overlay plane while disabling the primary plane.
> 
> 
>  I do think in principle atomic userspace is expected to handle case 3)
>  and leave the primary plane enabled. However, this is not ideal from an
>  energy consumption PoV. Therefore, here's another idea for a possible
>  way out of this quagmire:
> 
>  amdgpu_dm does not reject any atomic states based on which planes are
>  enabled in it. If the cursor plane is enabled but all other planes are
>  off, amdgpu_dm internally either:
> 
>  a) Enables an overlay plane and makes it invisible, e.g. by assigning a
>  minimum size FB with alpha = 0.
> 
>  b) Enables the primary plane and assigns a minimum size FB (scaled up to
>  the required size) containing all black, possibly using compression.
>  (Trying to minimize the memory bandwidth)
> 
> 
>  Does either of these seem feasible? If both do, which one would be
>  preferable?
> 
>    
> >>>
> >>> It's really the same solution since DCN doesn't make a distinction 
> >>> between primary or overlay planes in hardware. DCE doesn't have overlay 
> >>> planes enabled so this is not relevant there.
> >>>
> >>> The old behavior (pre 5.1?) was to silently accept the commit even 
> >>> though the screen would be completely black instead of outright 
> >>> rejecting the commit.
> >>>
> >>> I almost wonder if that makes more sense in the short term here since 
> >>> the only "userspace" affected here is IGT. We'll fail the CRC checks, 
> >>> but no userspace actually tries to actively use a cursor with no primary 
> >>> plane enabled from my understanding.
> >>
> >> Hi,
> >>
> >> I believe that there exists userspace that will *accidentally* attempt
> >> to update the cursor plane while primary plane or whole CRTC is off.
> >> Some versions of Mutter might do that on racy conditions, I suspect.
> >> These are legacy KMS users, not atomic KMS.
> >>
> >> However, I do not believe there exists any userspace that would
> >> actually expect the display to show the cursor plane alone without a
> >> primary plane. Therefore I'd be ok with legacy cursor ioctls silently
> >> succeeding. Atomic commits not. So the difference has to be in the
> >> translation from legacy UAPI to kernel internal atomic interface.
> >>
> >>> In the long term I think we can work on getting cursor actually on the 
> >>> screen in this 

Re: [PATCH 3/3] drm/amdgpu: Remove superfluous NULL check

2020-09-02 Thread Daniel Vetter
On Tue, Sep 01, 2020 at 09:06:45PM -0400, Luben Tuikov wrote:
> The DRM device is a static member of
> the amdgpu device structure and as such
> always exists, so long as the PCI and
> thus the amdgpu device exist.
> 
> Signed-off-by: Luben Tuikov 

On this patch, but not the other two earlier in this series:

Acked-by: Daniel Vetter 

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index c4900471beb0..6dcc256b9ebc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3471,9 +3471,6 @@ int amdgpu_device_suspend(struct drm_device *dev, bool 
> fbcon)
>   struct drm_connector_list_iter iter;
>   int r;
>  
> - if (!dev)
> - return -ENODEV;
> -
>   adev = drm_to_adev(dev);
>  
>   if (dev->switch_power_state == DRM_SWITCH_POWER_OFF)
> -- 
> 2.28.0.394.ge197136389
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 1/2] Revert "drm/amdgpu: disable gpu-sched load balance for uvd"

2020-09-02 Thread Christian König

Am 01.09.20 um 21:49 schrieb Nirmoy Das:

This reverts commit e0300ed8820d19fe108006cf1b69fa26f0b4e3fc.

We should also disable load balance for AMDGPU_HW_IP_UVD_ENC jobs.


Well revert and re-apply is usually not the best option. Just provide a 
delta patch and Alex might decide to squash it into the original one 
during upstreaming.


Christian.



Signed-off-by: Nirmoy Das 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 4 +---
  1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index 7cd398d25498..59032c26fc82 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -114,9 +114,7 @@ static int amdgpu_ctx_init_entity(struct amdgpu_ctx *ctx, 
u32 hw_ip,
scheds = adev->gpu_sched[hw_ip][hw_prio].sched;
num_scheds = adev->gpu_sched[hw_ip][hw_prio].num_scheds;
  
-	if (hw_ip == AMDGPU_HW_IP_VCN_ENC ||

-   hw_ip == AMDGPU_HW_IP_VCN_DEC ||
-   hw_ip == AMDGPU_HW_IP_UVD) {
+   if (hw_ip == AMDGPU_HW_IP_VCN_ENC || hw_ip == AMDGPU_HW_IP_VCN_DEC) {
sched = drm_sched_pick_best(scheds, num_scheds);
scheds = 
num_scheds = 1;


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 0/3] Use implicit kref infra

2020-09-02 Thread Daniel Vetter
On Tue, Sep 01, 2020 at 11:46:18PM -0400, Luben Tuikov wrote:
> On 2020-09-01 21:42, Pan, Xinhui wrote:
> > If you take a look at the below function, you should not use driver's 
> > release to free adev. As dev is embedded in adev.
> 
> Do you mean "look at the function below", using "below" as an adverb?
> "below" is not an adjective.
> 
> I know dev is embedded in adev--I did that patchset.
> 
> > 
> >  809 static void drm_dev_release(struct kref *ref)
> >  810 {
> >  811 struct drm_device *dev = container_of(ref, struct drm_device, 
> > ref);
> >  812
> >  813 if (dev->driver->release)
> >  814 dev->driver->release(dev);
> >  815 
> >  816 drm_managed_release(dev);
> >  817 
> >  818 kfree(dev->managed.final_kfree);
> >  819 }
> 
> That's simple--this comes from change c6603c740e0e3
> and it should be reverted. Simple as that.
> 
> The version before this change was absolutely correct:
> 
> static void drm_dev_release(struct kref *ref)
> {
>   if (dev->driver->release)
>   dev->driver->release(dev);
>   else
>   drm_dev_fini(dev);
> }
> 
> Meaning, "the kref is now 0"--> if the driver
> has a release, call it, else use our own.
> But note that nothing can be assumed after this point,
> about the existence of "dev".
> 
> It is exactly because struct drm_device is statically
> embedded into a container, struct amdgpu_device,
> that this change above should be reverted.
> 
> This is very similar to how fops has open/release
> but no close. That is, the "release" is called
> only when the last kref is released, i.e. when
> kref goes from non-zero to zero.
> 
> This uses the kref infrastructure which has been
> around for about 20 years in the Linux kernel.
> 
> I suggest reading the comments
> in drm_dev.c mostly, "DOC: driver instance overview"
> starting at line 240 onwards. This is right above
> drm_put_dev(). There is actually an example of a driver
> in the comment. Also the comment to drm_dev_init().
> 
> Now, take a look at this:
> 
> /**
>  * drm_dev_put - Drop reference of a DRM device
>  * @dev: device to drop reference of or NULL
>  *
>  * This decreases the ref-count of @dev by one. The device is destroyed if the
>  * ref-count drops to zero.
>  */
> void drm_dev_put(struct drm_device *dev)
> {
> if (dev)
> kref_put(>ref, drm_dev_release);
> }
> EXPORT_SYMBOL(drm_dev_put);
> 
> Two things:
> 
> 1. It is us, who kzalloc the amdgpu device, which contains
> the drm_device (you'll see this discussed in the reading
> material I pointed to above). We do this because we're
> probing the PCI device whether we'll work it it or not.
> 
> 2. Using the kref infrastructure, when the ref goes to 0,
> drm_dev_release is called. And here's the KEY:
> Because WE allocated the container, we should free it--after the release
> method is called, DRM cannot assume anything about the drm
> device or the container. The "release" method is final.
> 
> We allocate, we free. And we free only when the ref goes to 0.
> 
> DRM can, in due time, "free" itself of the DRM device and stop
> having knowledge of it--that's fine, but as long as the ref
> is not 0, the amdgpu device and thus the contained DRM device,
> cannot be freed.
> 
> > 
> > You have to make another change something like
> > diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> > index 13068fdf4331..2aabd2b4c63b 100644
> > --- a/drivers/gpu/drm/drm_drv.c
> > +++ b/drivers/gpu/drm/drm_drv.c
> > @@ -815,7 +815,8 @@ static void drm_dev_release(struct kref *ref)
> >  
> > drm_managed_release(dev);
> >  
> > -   kfree(dev->managed.final_kfree);
> > +   if (dev->driver->final_release)
> > +   dev->driver->final_release(dev);
> >  }
> 
> No. What's this?
> There is no such thing as "final" release, nor is there a "partial" release.
> When the kref goes to 0, the device disappears. Simple.
> If someone is using it, they should kref-get it, and when they're
> done with it, they should kref-put it.
> 
> The whole point is that this is done implicitly, via the kref infrastructure.
> drm_dev_init() which we call in our PCI probe function, sets the kref to 
> 1--all
> as per the documentation I pointed you to above.
> 
> Another point is that we can do some other stuff in the release
> function, notify someone, write some registers, free memory we use
> for that PCI device, etc.
> 
> If the "managed resources" infrastructure wants to stay, it should hook
> itself into drm_dev_fini() and into drm_dev_init() or drm_dev_register().
> It shouldn't have to be so out-of-place like in patch 2/3 of this series,
> where the drmm_add_final_kfree() is smack-dab in the middle of our PCI
> discovery function, surrounded on top and bottom by drm_dev_init()
> and drm_dev_register(). The "managed resources" infra should be non-invasive
> and drivers shouldn't have to change to use it--it should be invisible to 
> them.
> Then our 

RE: [PATCH] drm/kfd: fix a system crash issue during GPU recovery

2020-09-02 Thread Li, Dennis
[AMD Official Use Only - Internal Distribution Only]

Hi, Felix,

>>>The failure to execute_queues should probably not be reported to the caller 
>>>of create_queue, because the queue was already created, and the problem with 
>>>execute_queues has a much bigger scope than this one caller. So I think the 
>>>correct solution is to ignore the return value from 
>>>execute_queues.

Got it. I have created a patch v2 according to your suggestion. 

>>>As a follow up, we should probably handle all the error scenarios inside 
>>>execute_queues and make it a void function. Failure to unmap queues already 
>>>triggers a GPU reset, so nothing new needs to be done for that. 
>>>But we need to add handling of failures to map queues. It doesn't require a 
>>>GPU reset, because the problem is in the kernel (e.g. out of memory), not 
>>>the GPU. The best we can do is report this asynchronously as a GPU hang to 
>>>all KFD processes, so they know the GPU is no longer going to work 
>>>for them.

Understood.  I will follow up this issue and prepare a solution to discuss with 
you. 

Best Regards
Dennis Li

-Original Message-
From: Kuehling, Felix  
Sent: Wednesday, September 2, 2020 11:26 AM
To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, 
Alexander ; Zhang, Hawking ; 
Koenig, Christian 
Subject: Re: [PATCH] drm/kfd: fix a system crash issue during GPU recovery

On 2020-09-01 11:21 a.m., Li, Dennis wrote:
> [AMD Official Use Only - Internal Distribution Only]
>
> Hi, Felix,
>   If GPU hang, execute_queues_cpsch will fail to unmap or map queues and 
> then create_queue_cpsch will return error. If pqm_create_queue find 
> create_queue_cpsch failed, it will call uninit_queue to free queue object. 
> However this queue object has been added in to qpd->queues_list in the old 
> code.

Right, that's a problem. I think the intention here is to keep going because a 
failure to execute the runlist affects not just the queue that was just 
created, but all queues in all processes.

The failure to execute_queues should probably not be reported to the caller of 
create_queue, because the queue was already created, and the problem with 
execute_queues has a much bigger scope than this one caller. So I think the 
correct solution is to ignore the return value from execute_queues.

As a follow up, we should probably handle all the error scenarios inside 
execute_queues and make it a void function. Failure to unmap queues already 
triggers a GPU reset, so nothing new needs to be done for that. 
But we need to add handling of failures to map queues. It doesn't require a GPU 
reset, because the problem is in the kernel (e.g. out of memory), not the GPU. 
The best we can do is report this asynchronously as a GPU hang to all KFD 
processes, so they know the GPU is no longer going to work for them.

Regards,
   Felix

>
> Best Regards
> Dennis Li
>
> -Original Message-
> From: Kuehling, Felix 
> Sent: Tuesday, September 1, 2020 9:26 PM
> To: Li, Dennis ; amd-gfx@lists.freedesktop.org; 
> Deucher, Alexander ; Zhang, Hawking 
> ; Koenig, Christian 
> Subject: Re: [PATCH] drm/kfd: fix a system crash issue during GPU 
> recovery
>
> I'm not sure how the bug you're fixing is caused, but your fix is clearly in 
> the wrong place.
>
> A queue being disabled is not the same thing as a queue being destroyed.
> Queues can be disabled for legitimate reasons, but they still should exist 
> and be in the qpd->queues_list.
>
> If a destroyed queue is left on the qpd->queues_list, that would be a 
> problem. Can you point out where such a thing is happening?
>
> Thanks,
>    Felix
>
>
> Am 2020-08-31 um 9:36 p.m. schrieb Dennis Li:
>> The crash log as the below:
>>
>> [Thu Aug 20 23:18:14 2020] general protection fault:  [#1] SMP NOPTI
>> [Thu Aug 20 23:18:14 2020] CPU: 152 PID: 1837 Comm: kworker/152:1 Tainted: G 
>>   OE 5.4.0-42-generic #46~18.04.1-Ubuntu
>> [Thu Aug 20 23:18:14 2020] Hardware name: GIGABYTE 
>> G482-Z53-YF/MZ52-G40-00, BIOS R12 05/13/2020 [Thu Aug 20 23:18:14 
>> 2020] Workqueue: events amdgpu_ras_do_recovery [amdgpu] [Thu Aug 20
>> 23:18:14 2020] RIP: 0010:evict_process_queues_cpsch+0xc9/0x130
>> [amdgpu] [Thu Aug 20 23:18:14 2020] Code: 49 8d 4d 10 48 39 c8 75 21 
>> eb 44 83 fa 03 74 36 80 78 72 00 74 0c 83 ab 68 01 00 00 01 41 c6 45
>> 41 00 48 8b 00 48 39 c8 74 25 <80> 78 70 00 c6 40 6d 01 74 ee 8b 50 
>> 28
>> c6 40 70 00 83 ab 60 01 00 [Thu Aug 20 23:18:14 2020] RSP:
>> 0018:b29b52f6fc90 EFLAGS: 00010213 [Thu Aug 20 23:18:14 2020] RAX:
>> 1c884edb0a118914 RBX: 8a0d45ff3c00 RCX: 8a2d83e41038 [Thu Aug
>> 20 23:18:14 2020] RDX:  RSI: 0082 RDI:
>> 8a0e2e4178c0 [Thu Aug 20 23:18:14 2020] RBP: b29b52f6fcb0 R08:
>> 1b64 R09: 0004 [Thu Aug 20 23:18:14 2020] R10:
>> b29b52f6fb78 R11: 0001 R12: 8a0d45ff3d28 [Thu Aug 20 
>> 23:18:14 2020] R13: 8a2d83e41028 R14:  R15: 
>>