RE: [PATCH v3] drm/amdgpu: Call amdgpu_device_unmap_mmio() if device is unplugged to prevent crash in GPU initialization failure

2021-12-16 Thread Chen, Guchun
[Public]

Reviewed-by: Guchun Chen 

Regards,
Guchun

-Original Message-
From: Shi, Leslie  
Sent: Friday, December 17, 2021 10:26 AM
To: Grodzovsky, Andrey ; Koenig, Christian 
; Pan, Xinhui ; Deucher, 
Alexander ; amd-gfx@lists.freedesktop.org
Cc: Chen, Guchun ; Shi, Leslie 
Subject: [PATCH v3] drm/amdgpu: Call amdgpu_device_unmap_mmio() if device is 
unplugged to prevent crash in GPU initialization failure

[Why]
In amdgpu_driver_load_kms, when amdgpu_device_init returns error during driver 
modprobe, it will start the error handle path immediately and call into 
amdgpu_device_unmap_mmio as well to release mapped VRAM. However, in the 
following release callback, driver stills visits the unmapped memory like 
vcn.inst[i].fw_shared_cpu_addr in vcn_v3_0_sw_fini. So a kernel crash occurs.

[How]
call amdgpu_device_unmap_mmio() if device is unplugged to prevent invalid 
memory address in
vcn_v3_0_sw_fini() when GPU initialization failure.

Signed-off-by: Leslie Shi 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index f31caec669e7..f6a47b927cfd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3899,7 +3899,9 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
 
amdgpu_gart_dummy_page_fini(adev);
 
-   amdgpu_device_unmap_mmio(adev);
+   if (drm_dev_is_unplugged(adev_to_drm(adev)))
+   amdgpu_device_unmap_mmio(adev);
+
 }
 
 void amdgpu_device_fini_sw(struct amdgpu_device *adev)
--
2.25.1


[PATCH] drm/amdgpu: potential dereference of null pointer

2021-12-16 Thread Jiasheng Jiang
The return value of dma_alloc_coherent() needs to be checked.
To avoid use of null pointer in memcpy_toio() in case of the failure of
alloc.

Fixes: 57430471e2fa ("drm/amdgpu: Add support for USBC PD FW download")
Signed-off-by: Jiasheng Jiang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index a09483beb968..613e25bf87e5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -3034,6 +3034,10 @@ static ssize_t psp_usbc_pd_fw_sysfs_write(struct device 
*dev,
 
/* We need contiguous physical mem to place the FW  for psp to access */
cpu_addr = dma_alloc_coherent(adev->dev, usbc_pd_fw->size, _addr, 
GFP_KERNEL);
+   if (!cpu_addr) {
+   ret = -ENOMEM;
+   goto rel_buf;
+   }
 
ret = dma_mapping_error(adev->dev, dma_addr);
if (ret)
-- 
2.25.1



回复: Re: Re: 回复: Re: [PATCH] drm/amdgpu: fixup bad vram size on gmc v8

2021-12-16 Thread 周宗敏
Dear Alex:>Is the issue reproducible with the same board in bare metal on x86?Or does it only happen with passthrough on ARM?Unfortunately, my current environment is not convenient to test this GPU board on x86 platform.but I can tell you the problem still occurs on ARM without passthrough to virtual machine.In addition,at end of 2020,my colleagues also found similar problems on MIPS platforms with Graphics chips of Radeon R7 340.So,I may think it can happen to no matter based on x86 ,ARM or mips.I hope the above information is helpful to you,and I also think it will be better for user if can root cause this issue.Best regards.

    
        主 题:Re: Re: 回复: Re: [PATCH] drm/amdgpu: fixup bad vram size on gmc v8
            日 期:2021-12-16 23:28
            发件人:Alex Deucher
            收件人:周宗敏
            
        
        Is the issue reproducible with the same board in bare metal on x86?  Or does it only happen with passthrough on ARM?  Looking through the archives, the SI patch I made was for an x86 laptop.  It would be nice to root cause this, but there weren't any gfx8 boards with more than 64G of vram, so I think it's safe.  That said, if you see similar issues with newer gfx IPs then we have an issue since the upper bit will be meaningful, so it would be nice to root cause this.AlexOn Thu, Dec 16, 2021 at 4:36 AM 周宗敏  wrote:Hi  Christian,I'm  testing for GPU passthrough feature, so I pass through this GPU to  virtual machine to use. It  based on arm64 system.As far as i know, Alex had dealt with a similar problems on dri/radeon/si.c .  Maybe they have a same reason to cause it?the history commit message is below:https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0ca223b029a261e82fb2f50c52eb85d510f4260eThanks very much.       主 题:Re: 回复: Re: [PATCH] drm/amdgpu: fixup bad vram size on gmc v8
            日 期:2021-12-16 16:15
            发件人:Christian König
            收件人:周宗敏Alex Deucher
            
        
        Hi Zongmin,
    
    that strongly sounds like the ASIC is not correctly initialized when
    trying to read the register.
    
    What board and environment are you using this GPU with? Is that a
    normal x86 system?
    
    Regards,
    Christian.
    
    Am 16.12.21 um 04:11 schrieb 周宗敏:
    the problematic boards that I have tested is [AMD/ATI] Lexa
            PRO [Radeon RX 550/550X] ;  and the vbios version :
            113-RXF9310-C09-BTWhen an exception occurs I can see the following changes in
            the values of vram size get from RREG32(mmCONFIG_MEMSIZE) ,it seems to have garbage in the upper 16 bits 
          
          and then I can also see some dmesg like below:when vram size register have garbage,we may see error
            message like below:amdgpu :09:00.0: VRAM: 4286582784M 0x00F4 -
            0x000FF8F4 (4286582784M used)the correct message should like below:amdgpu :09:00.0: VRAM: 4096M 0x00F4 -
            0x00F4 (4096M used)
          if you have any problems,please send me mail.thanks very much.
                主 题:Re: [PATCH] drm/amdgpu:
          fixup bad vram size on gmc v8            
        日 期:2021-12-16 04:23    
               
        发件人:Alex Deucher        
           
        收件人:Zongmin Zhou          
                 
               On Wed, Dec 15, 2021 at 10:31 AM Zongmin Zhouwrote:
          >
          > Some boards(like RX550) seem to have garbage in the upper
          > 16 bits of the vram size register.  Check for
          > this and clamp the size properly.  Fixes
          > boards reporting bogus amounts of vram.
          >
          > after add this patch,the maximum GPU VRAM size is 64GB,
          > otherwise only 64GB vram size will be used.
          
          Can you provide some examples of problematic boards and
          possibly a
          vbios image from the problematic board?  What values are you
          seeing?
          It would be nice to see what the boards are reporting and
          whether the
          lower 16 bits are actually correct or if it is some other
          issue.  This
          register is undefined until the asic has been initialized.
           The vbios
          programs it as part of it's asic init sequence (either via
          vesa/gop or
          the OS driver).
          
          Alex
          
          
          >
          > Signed-off-by: Zongmin Zhou
            > ---
            >  drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c | 13
            ++---
            >  1 file changed, 10 insertions(+), 3 deletions(-)
            >
            > diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
            b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
            > index 492ebed2915b..63b890f1e8af 100644
            > --- a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
            > +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
            > @@ -515,10 +515,10 @@ static void

[PATCH v3] drm/amdgpu: Call amdgpu_device_unmap_mmio() if device is unplugged to prevent crash in GPU initialization failure

2021-12-16 Thread Leslie Shi
[Why]
In amdgpu_driver_load_kms, when amdgpu_device_init returns error during driver 
modprobe, it
will start the error handle path immediately and call into 
amdgpu_device_unmap_mmio as well
to release mapped VRAM. However, in the following release callback, driver 
stills visits the
unmapped memory like vcn.inst[i].fw_shared_cpu_addr in vcn_v3_0_sw_fini. So a 
kernel crash occurs.

[How]
call amdgpu_device_unmap_mmio() if device is unplugged to prevent invalid 
memory address in
vcn_v3_0_sw_fini() when GPU initialization failure.

Signed-off-by: Leslie Shi 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index f31caec669e7..f6a47b927cfd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3899,7 +3899,9 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
 
amdgpu_gart_dummy_page_fini(adev);
 
-   amdgpu_device_unmap_mmio(adev);
+   if (drm_dev_is_unplugged(adev_to_drm(adev)))
+   amdgpu_device_unmap_mmio(adev);
+
 }
 
 void amdgpu_device_fini_sw(struct amdgpu_device *adev)
-- 
2.25.1



RE: [PATCH 3/4] drm/amdkfd: add reset queue function for RAS poison

2021-12-16 Thread Zhou1, Tao
[AMD Official Use Only]

OK, I'll rename it before submit.

Regards,
Tao

> -Original Message-
> From: Zhang, Hawking 
> Sent: Thursday, December 16, 2021 8:22 PM
> To: Zhou1, Tao ; amd-gfx@lists.freedesktop.org; Yang,
> Stanley ; Chai, Thomas ;
> Kuehling, Felix 
> Subject: RE: [PATCH 3/4] drm/amdkfd: add reset queue function for RAS poison
> 
> +
> + int (*unmap_queues_cpsch_poison)(struct device_queue_manager
> *dqm,
> + uint16_t pasid);
>  };
> 
> Might be better call it reset_queue directly (match with update_queue,
> create_queue, .etc.,)
> 
> Others look good to me
> 
> The series (4 patches) is
> 
> Reviewed-by: Hawking Zhang 
> 
> Regards,
> Hawking
> -Original Message-
> From: Zhou1, Tao 
> Sent: Thursday, December 16, 2021 19:36
> To: amd-gfx@lists.freedesktop.org; Zhang, Hawking
> ; Yang, Stanley ; Chai,
> Thomas ; Kuehling, Felix 
> Cc: Zhou1, Tao 
> Subject: [PATCH 3/4] drm/amdkfd: add reset queue function for RAS poison
> 
> The new interface unmaps queues with reset mode for the process consumes
> RAS poison, it's only for compute queue.
> 
> Signed-off-by: Tao Zhou 
> ---
>  .../drm/amd/amdkfd/kfd_device_queue_manager.c| 16 
>  .../drm/amd/amdkfd/kfd_device_queue_manager.h|  5 +
>  2 files changed, 21 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> index 01a2cc3928ac..b4b0495c7024 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> @@ -1476,6 +1476,21 @@ static int unmap_queues_cpsch(struct
> device_queue_manager *dqm,
>   return retval;
>  }
> 
> +/* only for compute queue */
> +static int unmap_queues_cpsch_poison(struct device_queue_manager *dqm,
> + uint16_t pasid)
> +{
> + int retval;
> +
> + dqm_lock(dqm);
> +
> + retval = unmap_queues_cpsch(dqm,
> KFD_UNMAP_QUEUES_FILTER_BY_PASID,
> + pasid, true);
> +
> + dqm_unlock(dqm);
> + return retval;
> +}
> +
>  /* dqm->lock mutex has to be locked before calling this function */  static 
> int
> execute_queues_cpsch(struct device_queue_manager *dqm,
>   enum kfd_unmap_queues_filter filter, @@ -
> 1896,6 +1911,7 @@ struct device_queue_manager
> *device_queue_manager_init(struct kfd_dev *dev)
>   dqm->ops.evict_process_queues = evict_process_queues_cpsch;
>   dqm->ops.restore_process_queues =
> restore_process_queues_cpsch;
>   dqm->ops.get_wave_state = get_wave_state;
> + dqm->ops.unmap_queues_cpsch_poison =
> unmap_queues_cpsch_poison;
>   break;
>   case KFD_SCHED_POLICY_NO_HWS:
>   /* initialize dqm for no cp scheduling */ diff --git
> a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
> b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
> index 499fc0ea387f..19ec3e8859e8 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
> @@ -81,6 +81,8 @@ struct device_process_node {
>   *
>   * @get_wave_state: Retrieves context save state and optionally copies the
>   * control stack, if kept in the MQD, to the given userspace address.
> + *
> + * @unmap_queues_cpsch_poison: reset queue which consumes RAS poison
>   */
> 
>  struct device_queue_manager_ops {
> @@ -134,6 +136,9 @@ struct device_queue_manager_ops {
> void __user *ctl_stack,
> u32 *ctl_stack_used_size,
> u32 *save_area_used_size);
> +
> + int (*unmap_queues_cpsch_poison)(struct device_queue_manager
> *dqm,
> + uint16_t pasid);
>  };
> 
>  struct device_queue_manager_asic_ops {
> --
> 2.17.1


Re: [PATCH] drm/amdgpu: When the VCN(1.0) block is suspended, powergating is explicitly enabled

2021-12-16 Thread Alex Deucher
On Thu, Dec 16, 2021 at 8:43 PM Quan, Evan  wrote:
>
> [AMD Official Use Only]
>
> Hi Alex,
>
> Per our checking, vcn_v2 and vcn_v3 already have the dpm disablement(below) 
> in their ->suspend routine which should prevent them from the issue here.
> if (adev->pm.dpm_enabled)
> amdgpu_dpm_enable_uvd(adev, false);
> So, maybe it's a different story with those newer APUs. Can you forward the 
> bug reports to me?

https://gitlab.freedesktop.org/drm/amd/-/issues/1712#note_1192187

Alex

>
> Thanks,
> Evan
> > -Original Message-
> > From: Alex Deucher 
> > Sent: Thursday, December 16, 2021 11:38 PM
> > To: Quan, Evan 
> > Cc: Zhu, James ; Gong, Curry
> > ; amd-gfx@lists.freedesktop.org; Deucher,
> > Alexander ; Liu, Leo 
> > Subject: Re: [PATCH] drm/amdgpu: When the VCN(1.0) block is suspended,
> > powergating is explicitly enabled
> >
> > FWIW, it looks like all versions of VCN need the same fix.  There have been
> > reports of suspend failing when VCN is in use on other newer APUs as well.
> >
> > Alex
> >
> > On Tue, Dec 14, 2021 at 12:59 AM Quan, Evan  wrote:
> > >
> > > [AMD Official Use Only]
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > From: Zhu, James 
> > > Sent: Monday, December 13, 2021 9:39 PM
> > > To: Gong, Curry ; Zhu, James
> > ;
> > > amd-gfx@lists.freedesktop.org
> > > Cc: Liu, Leo ; Quan, Evan ;
> > > Deucher, Alexander 
> > > Subject: Re: [PATCH] drm/amdgpu: When the VCN(1.0) block is suspended,
> > > powergating is explicitly enabled
> > >
> > >
> > >
> > > Hi Curry, Evan,
> > >
> > > It seems vcn1.0 power gate sequence are special.
> > >
> > > if the original solution is thread safe, then the following solution will 
> > > be
> > thread safe also.
> > >
> > > static int vcn_v1_0_hw_fini(void *handle) {
> > > struct amdgpu_device *adev = (struct amdgpu_device *)handle;
> > >
> > > cancel_delayed_work_sync(>vcn.idle_work);
> > >
> > > if ((adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG) ||
> > > (adev->vcn.cur_state != AMD_PG_STATE_GATE &&
> > >  RREG32_SOC15(VCN, 0, mmUVD_STATUS))) {
> > > +if (adev->pm.dpm_enabled)
> > > +amdgpu_dpm_enable_uvd(adev, false);
> > > +else
> > > +vcn_v1_0_set_powergating_state(adev, AMD_PG_STATE_GATE);
> > >
> > > [Quan, Evan] Considering adev->pm.dpm_enabled is always true,
> > “vcn_v1_0_set_powergating_state(adev, AMD_PG_STATE_GATE); “ will
> > become dead code.
> > >
> > > Also, the vcn_v1_0_hw_fini is also used in other scenarios(except the
> > suspend/resume discussed here). So, it seems quite risky compared with
> > Curry’s original patch.
> > >
> > > Before we can come up with a better idea, I would suggest to land Curry’s
> > original patch as a quick fix.
> > >
> > >
> > >
> > > BR
> > >
> > > Evan
> > >
> > >
> > > }
> > >
> > > Best Regards!
> > >
> > > James
> > >
> > > On 2021-12-13 3:55 a.m., Gong, Curry wrote:
> > >
> > > [AMD Official Use Only]
> > >
> > >
> > >
> > > Hi James:
> > >
> > >
> > >
> > > With the following patch, an error will be reported when the driver is
> > > loaded
> > >
> > > +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
> > >
> > > @@ -1202,6 +1204,9 @@ static int vcn_v1_0_stop(struct amdgpu_device
> > > *adev)
> > >
> > > else
> > >
> > > r = vcn_v1_0_stop_spg_mode(adev);
> > >
> > >
> > >
> > > c
> > >
> > > return r;
> > >
> > > }
> > >
> > >
> > >
> > >
> > >
> > > $ dmesg
> > >
> > > [  363.181081] INFO: task kworker/3:2:223 blocked for more than 120
> > seconds.
> > >
> > > [  363.181150]   Tainted: G   OE 5.11.0-41-generic 
> > > #45~20.04.1-
> > Ubuntu
> > >
> > > [  363.181208] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > >
> > > [  363.181266] task:kworker/3:2 state:D stack:0 pid:  223 ppid:   
> > >   2
> > flags:0x4000
> > >
> > > [  363.181276] Workqueue: events vcn_v1_0_idle_work_handler [amdgpu]
> > >
> > > [  363.181612] Call Trace:
> > >
> > > [  363.181618]  __schedule+0x44c/0x8a0
> > >
> > > [  363.181627]  schedule+0x4f/0xc0
> > >
> > > [  363.181631]  schedule_preempt_disabled+0xe/0x10
> > >
> > > [  363.181636]  __mutex_lock.isra.0+0x183/0x4d0
> > >
> > > [  363.181643]  __mutex_lock_slowpath+0x13/0x20
> > >
> > > [  363.181648]  mutex_lock+0x32/0x40
> > >
> > > [  363.181652]  amdgpu_dpm_set_powergating_by_smu+0x9c/0x180
> > [amdgpu]
> > >
> > > [  363.182055]  amdgpu_dpm_enable_uvd+0x38/0x110 [amdgpu]
> > >
> > > [  363.182454]  vcn_v1_0_set_powergating_state+0x2e7e/0x3cf0 [amdgpu]
> > >
> > > [  363.182776]  amdgpu_device_ip_set_powergating_state+0x6c/0xc0
> > > [amdgpu]
> > >
> > > [  363.183028]  smu10_powergate_vcn+0x2a/0x80 [amdgpu]
> > >
> > > [  363.183361]  pp_set_powergating_by_smu+0xc5/0x2b0 [amdgpu]
> > >
> > > [  363.183699]  amdgpu_dpm_set_powergating_by_smu+0xb6/0x180
> > [amdgpu]
> > >
> > > [  363.184040]  amdgpu_dpm_enable_uvd+0x38/0x110 [amdgpu]
> > >
> > > [  363.184391]  

RE: [PATCH] drm/amdgpu: When the VCN(1.0) block is suspended, powergating is explicitly enabled

2021-12-16 Thread Quan, Evan
[AMD Official Use Only]

Hi Alex,

Per our checking, vcn_v2 and vcn_v3 already have the dpm disablement(below) in 
their ->suspend routine which should prevent them from the issue here.
if (adev->pm.dpm_enabled)
amdgpu_dpm_enable_uvd(adev, false);
So, maybe it's a different story with those newer APUs. Can you forward the bug 
reports to me?

Thanks,
Evan
> -Original Message-
> From: Alex Deucher 
> Sent: Thursday, December 16, 2021 11:38 PM
> To: Quan, Evan 
> Cc: Zhu, James ; Gong, Curry
> ; amd-gfx@lists.freedesktop.org; Deucher,
> Alexander ; Liu, Leo 
> Subject: Re: [PATCH] drm/amdgpu: When the VCN(1.0) block is suspended,
> powergating is explicitly enabled
> 
> FWIW, it looks like all versions of VCN need the same fix.  There have been
> reports of suspend failing when VCN is in use on other newer APUs as well.
> 
> Alex
> 
> On Tue, Dec 14, 2021 at 12:59 AM Quan, Evan  wrote:
> >
> > [AMD Official Use Only]
> >
> >
> >
> >
> >
> >
> >
> > From: Zhu, James 
> > Sent: Monday, December 13, 2021 9:39 PM
> > To: Gong, Curry ; Zhu, James
> ;
> > amd-gfx@lists.freedesktop.org
> > Cc: Liu, Leo ; Quan, Evan ;
> > Deucher, Alexander 
> > Subject: Re: [PATCH] drm/amdgpu: When the VCN(1.0) block is suspended,
> > powergating is explicitly enabled
> >
> >
> >
> > Hi Curry, Evan,
> >
> > It seems vcn1.0 power gate sequence are special.
> >
> > if the original solution is thread safe, then the following solution will be
> thread safe also.
> >
> > static int vcn_v1_0_hw_fini(void *handle) {
> > struct amdgpu_device *adev = (struct amdgpu_device *)handle;
> >
> > cancel_delayed_work_sync(>vcn.idle_work);
> >
> > if ((adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG) ||
> > (adev->vcn.cur_state != AMD_PG_STATE_GATE &&
> >  RREG32_SOC15(VCN, 0, mmUVD_STATUS))) {
> > +if (adev->pm.dpm_enabled)
> > +amdgpu_dpm_enable_uvd(adev, false);
> > +else
> > +vcn_v1_0_set_powergating_state(adev, AMD_PG_STATE_GATE);
> >
> > [Quan, Evan] Considering adev->pm.dpm_enabled is always true,
> “vcn_v1_0_set_powergating_state(adev, AMD_PG_STATE_GATE); “ will
> become dead code.
> >
> > Also, the vcn_v1_0_hw_fini is also used in other scenarios(except the
> suspend/resume discussed here). So, it seems quite risky compared with
> Curry’s original patch.
> >
> > Before we can come up with a better idea, I would suggest to land Curry’s
> original patch as a quick fix.
> >
> >
> >
> > BR
> >
> > Evan
> >
> >
> > }
> >
> > Best Regards!
> >
> > James
> >
> > On 2021-12-13 3:55 a.m., Gong, Curry wrote:
> >
> > [AMD Official Use Only]
> >
> >
> >
> > Hi James:
> >
> >
> >
> > With the following patch, an error will be reported when the driver is
> > loaded
> >
> > +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
> >
> > @@ -1202,6 +1204,9 @@ static int vcn_v1_0_stop(struct amdgpu_device
> > *adev)
> >
> > else
> >
> > r = vcn_v1_0_stop_spg_mode(adev);
> >
> >
> >
> > c
> >
> > return r;
> >
> > }
> >
> >
> >
> >
> >
> > $ dmesg
> >
> > [  363.181081] INFO: task kworker/3:2:223 blocked for more than 120
> seconds.
> >
> > [  363.181150]   Tainted: G   OE 5.11.0-41-generic 
> > #45~20.04.1-
> Ubuntu
> >
> > [  363.181208] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> >
> > [  363.181266] task:kworker/3:2 state:D stack:0 pid:  223 ppid: 
> > 2
> flags:0x4000
> >
> > [  363.181276] Workqueue: events vcn_v1_0_idle_work_handler [amdgpu]
> >
> > [  363.181612] Call Trace:
> >
> > [  363.181618]  __schedule+0x44c/0x8a0
> >
> > [  363.181627]  schedule+0x4f/0xc0
> >
> > [  363.181631]  schedule_preempt_disabled+0xe/0x10
> >
> > [  363.181636]  __mutex_lock.isra.0+0x183/0x4d0
> >
> > [  363.181643]  __mutex_lock_slowpath+0x13/0x20
> >
> > [  363.181648]  mutex_lock+0x32/0x40
> >
> > [  363.181652]  amdgpu_dpm_set_powergating_by_smu+0x9c/0x180
> [amdgpu]
> >
> > [  363.182055]  amdgpu_dpm_enable_uvd+0x38/0x110 [amdgpu]
> >
> > [  363.182454]  vcn_v1_0_set_powergating_state+0x2e7e/0x3cf0 [amdgpu]
> >
> > [  363.182776]  amdgpu_device_ip_set_powergating_state+0x6c/0xc0
> > [amdgpu]
> >
> > [  363.183028]  smu10_powergate_vcn+0x2a/0x80 [amdgpu]
> >
> > [  363.183361]  pp_set_powergating_by_smu+0xc5/0x2b0 [amdgpu]
> >
> > [  363.183699]  amdgpu_dpm_set_powergating_by_smu+0xb6/0x180
> [amdgpu]
> >
> > [  363.184040]  amdgpu_dpm_enable_uvd+0x38/0x110 [amdgpu]
> >
> > [  363.184391]  vcn_v1_0_idle_work_handler+0xe1/0x130 [amdgpu]
> >
> > [  363.184667]  process_one_work+0x220/0x3c0
> >
> > [  363.184674]  worker_thread+0x4d/0x3f0
> >
> > [  363.184677]  ? process_one_work+0x3c0/0x3c0
> >
> > [  363.184680]  kthread+0x12b/0x150
> >
> > [  363.184685]  ? set_kthread_struct+0x40/0x40
> >
> > [  363.184690]  ret_from_fork+0x22/0x30
> >
> > [  363.184699] INFO: task kworker/2:2:233 blocked for more than 120
> seconds.
> >
> > [  363.184739]   Tainted: G   OE 

RE: [PATCH v3 4/5] drm/amdgpu: get xgmi info before ip_init

2021-12-16 Thread Skvortsov, Victor
[AMD Official Use Only]

I will fix the indent issue. There shouldn't be any conflicts between this 
patch and the xgmi init changes. We've run with those xgmi init patches applied 
on top of mainline-dkms-5.13 (where this change is already present) and found 
no issues.

Thanks,
Victor

-Original Message-
From: Alex Deucher  
Sent: Thursday, December 16, 2021 3:18 PM
To: Skvortsov, Victor 
Cc: amd-gfx list ; Deng, Emily 
; Liu, Monk ; Ming, Davis 
; Liu, Shaoyun ; Zhou, Peng Ju 
; Chen, JingWen ; Chen, Horace 
; Nieto, David M 
Subject: Re: [PATCH v3 4/5] drm/amdgpu: get xgmi info before ip_init

[CAUTION: External Email]

On Thu, Dec 16, 2021 at 2:43 PM Victor Skvortsov  
wrote:
>
> Driver needs to call get_xgmi_info() before ip_init to determine 
> whether it needs to handle a pending hive reset.
>
> Signed-off-by: Victor Skvortsov 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++
>  drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 6 --
>  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c  | 6 --
>  3 files changed, 7 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 5bd785cfc5ca..4fd370016834 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3576,6 +3576,13 @@ int amdgpu_device_init(struct amdgpu_device *adev,
> if (r)
> return r;
>
> +   /* Need to get xgmi info early to decide the reset behavior*/
> +   if (adev->gmc.xgmi.supported) {
> +   r = adev->gfxhub.funcs->get_xgmi_info(adev);
> +   if (r)

Indentation looks off here, please fix that.  Also, will this conflict with 
Shaoyun's xgmi init patch which may or may not have landed?
Other than that, these patches look pretty good to me.

Alex


> +   return r;
> +   }
> +
> /* enable PCIE atomic ops */
> if (amdgpu_sriov_vf(adev))
> adev->have_atomics_support = ((struct 
> amd_sriov_msg_pf2vf_info *) diff --git 
> a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> index ae46eb35b3d7..3d5d47a799e3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> @@ -914,12 +914,6 @@ static int gmc_v10_0_sw_init(void *handle)
> return r;
> }
>
> -   if (adev->gmc.xgmi.supported) {
> -   r = adev->gfxhub.funcs->get_xgmi_info(adev);
> -   if (r)
> -   return r;
> -   }
> -
> r = gmc_v10_0_mc_init(adev);
> if (r)
> return r;
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> index 2b86c63b032a..57f2729a7bd0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> @@ -1628,12 +1628,6 @@ static int gmc_v9_0_sw_init(void *handle)
> }
> adev->need_swiotlb = drm_need_swiotlb(44);
>
> -   if (adev->gmc.xgmi.supported) {
> -   r = adev->gfxhub.funcs->get_xgmi_info(adev);
> -   if (r)
> -   return r;
> -   }
> -
> r = gmc_v9_0_mc_init(adev);
> if (r)
> return r;
> --
> 2.25.1
>


[pull] amdgpu, amdkfd, radeon drm-next-5.17

2021-12-16 Thread Alex Deucher
Hi Dave, Daniel,

More updates for 5.17.

The following changes since commit 3c021931023a30316db415044531b116b85e6ebd:

  drm/amdgpu: replace drm_detect_hdmi_monitor() with drm_display_info.is_hdmi 
(2021-12-07 13:13:07 -0500)

are available in the Git repository at:

  https://gitlab.freedesktop.org/agd5f/linux.git 
tags/amd-drm-next-5.17-2021-12-16

for you to fetch changes up to a342655865b2f14d1fbf346356d3b3360e63e872:

  drm/radeon: Fix syntax errors in comments (2021-12-14 16:11:02 -0500)


amdgpu:
- Add some display debugfs entries
- RAS fixes
- SR-IOV fixes
- W=1 fixes
- Documentation fixes
- IH timestamp fix
- Misc power fixes
- IP discovery fixes
- Large driver documentation updates
- Multi-GPU memory use reductions
- Misc display fixes and cleanups
- Add new SMU debug option

amdkfd:
- SVM fixes

radeon:
- Fix typo in comment


Anthony Koo (1):
  drm/amd/display: [FW Promotion] Release 0.0.97

Aric Cyr (1):
  drm/amd/display: 3.2.166

Aurabindo Pillai (1):
  drm/amd/display: Add feature flags to disable LTTPR

Eric Bernstein (1):
  drm/amd/display: ODM + MPO window on only one half of ODM

Evan Quan (2):
  drm/amdgpu: move smu_debug_mask to a more proper place
  drm/amdgpu: correct the wrong cached state for GMC on PICASSO

Fangzhi Zuo (1):
  drm/amd/display: Add Debugfs Entry to Force in SST Sequence

Felix Kuehling (4):
  drm/amdkfd: Fix error handling in svm_range_add
  drm/amdkfd: Fix svm_range_is_same_attrs
  drm/amdkfd: Don't split unchanged SVM ranges
  drm/amdkfd: Make KFD support on Hawaii experimental

Graham Sider (1):
  drm/amdkfd: add Navi2x to GWS init conditions

Guchun Chen (1):
  drm/amdgpu: use adev_to_drm to get drm_device pointer

Hawking Zhang (6):
  drm/amdgpu: add helper to load ip_discovery binary from file
  drm/amdgpu: rename discovery_read_binary helper
  drm/amdgpu: add helper to verify ip discovery binary signature
  drm/amdgpu: read and authenticate ip discovery binary
  drm/amdgpu: don't override default ECO_BITs setting
  drm/amdgpu: check df_funcs and its callback pointers

Isabella Basso (10):
  drm/amd: Mark IP_BASE definition as __maybe_unused
  drm/amd: fix improper docstring syntax
  drm/amdgpu: fix function scopes
  drm/amdkfd: fix function scopes
  drm/amd: append missing includes
  drm/amdgpu: fix location of prototype for amdgpu_kms_compat_ioctl
  drm/amdgpu: fix amdgpu_ras_mca_query_error_status scope
  drm/amdgpu: remove unnecessary variables
  drm/amdgpu: re-format file header comments
  drm/amd/display: fix function scopes

Jingwen Chen (2):
  drm/amd/amdgpu: fix psp tmr bo pin count leak in SRIOV
  drm/amd/amdgpu: fix gmc bo pin count leak in SRIOV

Jonathan Kim (1):
  drm/amdgpu: disable default navi2x co-op kernel support

Lang Yu (5):
  drm/amdgpu: remove power on/off SDMA in SMU hw_init/fini()
  drm/amdgpu: only hw fini SMU fisrt for ASICs need that
  drm/amdgpu: introduce a kind of halt state for amdgpu device
  drm/amdgpu: add support for SMU debug option
  drm/amd/pm: fix a potential gpu_metrics_table memory leak

Le Ma (1):
  drm/amdgpu: correct register access for RLC_JUMP_TABLE_RESTORE

Leslie Shi (2):
  drm/amdgpu: add modifiers in amdgpu_vkms_plane_init()
  drm/amdgpu: fix incorrect VCN revision in SRIOV

Lijo Lazar (1):
  drm/amd/pm: Skip power state allocation

Mario Limonciello (4):
  drm/amd: add some extra checks that is_dig_enabled is defined
  drm/amd: move variable to local scope
  drm/amd/pm: fix reading SMU FW version from amdgpu_firmware_info on YC
  drivers/amd/pm: drop statement to print FW version for smu_v13

Martin Leung (1):
  drm/amd/display: implement dc_mode_memclk

Michael Strauss (1):
  drm/amd/display: Force det buf size to 192KB with 3+ streams and upscaling

Michel Dänzer (2):
  drm/amd/display: Reduce stack size for 
dml31_ModeSupportAndSystemConfigurationFull
  drm/amd/display: Reduce stack size for dml31 UseMinimumDCFCLK

Mikita Lipski (1):
  drm/amd/display: parse and check PSR SU caps

Nicholas Kazlauskas (2):
  drm/amd/display: Set exit_optimized_pwr_state for DCN31
  drm/amd/display: Reset DMCUB before HW init

Philip Yang (3):
  drm/amdgpu: Handle fault with same timestamp
  drm/amdgpu: Detect if amdgpu in IOMMU direct map mode
  drm/amdgpu: Reduce SG bo memory usage for mGPUs

Rodrigo Siqueira (6):
  Documentation/gpu: Reorganize DC documentation
  Documentation/gpu: Document amdgpu_dm_visual_confirm debugfs entry
  Documentation/gpu: Document pipe split visual confirmation
  Documentation/gpu: How to collect DTN log
  Documentation/gpu: Add basic overview of DC pipeline
  Documentation/gpu: Add amdgpu and dc glossary

Solomon Chiu (1):
  

Re: [PATCH v3 4/5] drm/amdgpu: get xgmi info before ip_init

2021-12-16 Thread Alex Deucher
On Thu, Dec 16, 2021 at 2:43 PM Victor Skvortsov
 wrote:
>
> Driver needs to call get_xgmi_info() before ip_init
> to determine whether it needs to handle a pending hive reset.
>
> Signed-off-by: Victor Skvortsov 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++
>  drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 6 --
>  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c  | 6 --
>  3 files changed, 7 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 5bd785cfc5ca..4fd370016834 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3576,6 +3576,13 @@ int amdgpu_device_init(struct amdgpu_device *adev,
> if (r)
> return r;
>
> +   /* Need to get xgmi info early to decide the reset behavior*/
> +   if (adev->gmc.xgmi.supported) {
> +   r = adev->gfxhub.funcs->get_xgmi_info(adev);
> +   if (r)

Indentation looks off here, please fix that.  Also, will this conflict
with Shaoyun's xgmi init patch which may or may not have landed?
Other than that, these patches look pretty good to me.

Alex


> +   return r;
> +   }
> +
> /* enable PCIE atomic ops */
> if (amdgpu_sriov_vf(adev))
> adev->have_atomics_support = ((struct 
> amd_sriov_msg_pf2vf_info *)
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> index ae46eb35b3d7..3d5d47a799e3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> @@ -914,12 +914,6 @@ static int gmc_v10_0_sw_init(void *handle)
> return r;
> }
>
> -   if (adev->gmc.xgmi.supported) {
> -   r = adev->gfxhub.funcs->get_xgmi_info(adev);
> -   if (r)
> -   return r;
> -   }
> -
> r = gmc_v10_0_mc_init(adev);
> if (r)
> return r;
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> index 2b86c63b032a..57f2729a7bd0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> @@ -1628,12 +1628,6 @@ static int gmc_v9_0_sw_init(void *handle)
> }
> adev->need_swiotlb = drm_need_swiotlb(44);
>
> -   if (adev->gmc.xgmi.supported) {
> -   r = adev->gfxhub.funcs->get_xgmi_info(adev);
> -   if (r)
> -   return r;
> -   }
> -
> r = gmc_v9_0_mc_init(adev);
> if (r)
> return r;
> --
> 2.25.1
>


RE: [PATCH v3 4/5] drm/amdgpu: get xgmi info before ip_init

2021-12-16 Thread Liu, Shaoyun
[AMD Official Use Only]

Reviewed by: shaoyun.liu 

-Original Message-
From: Skvortsov, Victor  
Sent: Thursday, December 16, 2021 2:43 PM
To: amd-gfx@lists.freedesktop.org; Deng, Emily ; Liu, Monk 
; Ming, Davis ; Liu, Shaoyun 
; Zhou, Peng Ju ; Chen, JingWen 
; Chen, Horace ; Nieto, David M 

Cc: Skvortsov, Victor 
Subject: [PATCH v3 4/5] drm/amdgpu: get xgmi info before ip_init

Driver needs to call get_xgmi_info() before ip_init to determine whether it 
needs to handle a pending hive reset.

Signed-off-by: Victor Skvortsov 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++
 drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 6 --
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c  | 6 --
 3 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 5bd785cfc5ca..4fd370016834 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3576,6 +3576,13 @@ int amdgpu_device_init(struct amdgpu_device *adev,
if (r)
return r;
 
+   /* Need to get xgmi info early to decide the reset behavior*/
+   if (adev->gmc.xgmi.supported) {
+   r = adev->gfxhub.funcs->get_xgmi_info(adev);
+   if (r)
+   return r;
+   }
+
/* enable PCIE atomic ops */
if (amdgpu_sriov_vf(adev))
adev->have_atomics_support = ((struct amd_sriov_msg_pf2vf_info 
*) diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
index ae46eb35b3d7..3d5d47a799e3 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
@@ -914,12 +914,6 @@ static int gmc_v10_0_sw_init(void *handle)
return r;
}
 
-   if (adev->gmc.xgmi.supported) {
-   r = adev->gfxhub.funcs->get_xgmi_info(adev);
-   if (r)
-   return r;
-   }
-
r = gmc_v10_0_mc_init(adev);
if (r)
return r;
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index 2b86c63b032a..57f2729a7bd0 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -1628,12 +1628,6 @@ static int gmc_v9_0_sw_init(void *handle)
}
adev->need_swiotlb = drm_need_swiotlb(44);
 
-   if (adev->gmc.xgmi.supported) {
-   r = adev->gfxhub.funcs->get_xgmi_info(adev);
-   if (r)
-   return r;
-   }
-
r = gmc_v9_0_mc_init(adev);
if (r)
return r;
--
2.25.1


Re: [PATCH v3 5/5] drm/amdgpu: Modify indirect register access for gfx9 sriov

2021-12-16 Thread Nieto, David M
[AMD Official Use Only]

Reviewed-by: David Nieto 

From: Skvortsov, Victor 
Sent: Thursday, December 16, 2021 11:42 AM
To: amd-gfx@lists.freedesktop.org ; Deng, Emily 
; Liu, Monk ; Ming, Davis 
; Liu, Shaoyun ; Zhou, Peng Ju 
; Chen, JingWen ; Chen, Horace 
; Nieto, David M 
Cc: Skvortsov, Victor 
Subject: [PATCH v3 5/5] drm/amdgpu: Modify indirect register access for gfx9 
sriov

Expand RLCG interface for new GC read & write commands.
New interface will only be used if the PF enables the flag in pf2vf msg.

v2: Added a description for the scratch registers

Signed-off-by: Victor Skvortsov 
---
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 117 --
 1 file changed, 89 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
index edb3e3b08eed..9189fb85a4dd 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
@@ -63,6 +63,13 @@
 #define mmGCEA_PROBE_MAP0x070c
 #define mmGCEA_PROBE_MAP_BASE_IDX   0

+#define GFX9_RLCG_GC_WRITE_OLD (0x8 << 28)
+#define GFX9_RLCG_GC_WRITE (0x0 << 28)
+#define GFX9_RLCG_GC_READ  (0x1 << 28)
+#define GFX9_RLCG_VFGATE_DISABLED  0x400
+#define GFX9_RLCG_WRONG_OPERATION_TYPE 0x200
+#define GFX9_RLCG_NOT_IN_RANGE 0x100
+
 MODULE_FIRMWARE("amdgpu/vega10_ce.bin");
 MODULE_FIRMWARE("amdgpu/vega10_pfp.bin");
 MODULE_FIRMWARE("amdgpu/vega10_me.bin");
@@ -739,7 +746,7 @@ static const u32 GFX_RLC_SRM_INDEX_CNTL_DATA_OFFSETS[] =
 mmRLC_SRM_INDEX_CNTL_DATA_7 - mmRLC_SRM_INDEX_CNTL_DATA_0,
 };

-static void gfx_v9_0_rlcg_w(struct amdgpu_device *adev, u32 offset, u32 v, u32 
flag)
+static u32 gfx_v9_0_rlcg_rw(struct amdgpu_device *adev, u32 offset, u32 v, 
uint32_t flag)
 {
 static void *scratch_reg0;
 static void *scratch_reg1;
@@ -748,21 +755,20 @@ static void gfx_v9_0_rlcg_w(struct amdgpu_device *adev, 
u32 offset, u32 v, u32 f
 static void *spare_int;
 static uint32_t grbm_cntl;
 static uint32_t grbm_idx;
+   uint32_t i = 0;
+   uint32_t retries = 5;
+   u32 ret = 0;
+   u32 tmp;

 scratch_reg0 = adev->rmmio + 
(adev->reg_offset[GC_HWIP][0][mmSCRATCH_REG0_BASE_IDX] + mmSCRATCH_REG0)*4;
 scratch_reg1 = adev->rmmio + 
(adev->reg_offset[GC_HWIP][0][mmSCRATCH_REG1_BASE_IDX] + mmSCRATCH_REG1)*4;
-   scratch_reg2 = adev->rmmio + 
(adev->reg_offset[GC_HWIP][0][mmSCRATCH_REG1_BASE_IDX] + mmSCRATCH_REG2)*4;
-   scratch_reg3 = adev->rmmio + 
(adev->reg_offset[GC_HWIP][0][mmSCRATCH_REG1_BASE_IDX] + mmSCRATCH_REG3)*4;
+   scratch_reg2 = adev->rmmio + 
(adev->reg_offset[GC_HWIP][0][mmSCRATCH_REG2_BASE_IDX] + mmSCRATCH_REG2)*4;
+   scratch_reg3 = adev->rmmio + 
(adev->reg_offset[GC_HWIP][0][mmSCRATCH_REG3_BASE_IDX] + mmSCRATCH_REG3)*4;
 spare_int = adev->rmmio + 
(adev->reg_offset[GC_HWIP][0][mmRLC_SPARE_INT_BASE_IDX] + mmRLC_SPARE_INT)*4;

 grbm_cntl = adev->reg_offset[GC_HWIP][0][mmGRBM_GFX_CNTL_BASE_IDX] + 
mmGRBM_GFX_CNTL;
 grbm_idx = adev->reg_offset[GC_HWIP][0][mmGRBM_GFX_INDEX_BASE_IDX] + 
mmGRBM_GFX_INDEX;

-   if (amdgpu_sriov_runtime(adev)) {
-   pr_err("shouldn't call rlcg write register during runtime\n");
-   return;
-   }
-
 if (offset == grbm_cntl || offset == grbm_idx) {
 if (offset  == grbm_cntl)
 writel(v, scratch_reg2);
@@ -771,41 +777,95 @@ static void gfx_v9_0_rlcg_w(struct amdgpu_device *adev, 
u32 offset, u32 v, u32 f

 writel(v, ((void __iomem *)adev->rmmio) + (offset * 4));
 } else {
-   uint32_t i = 0;
-   uint32_t retries = 5;
-
+   /*
+* SCRATCH_REG0   = read/write value
+* SCRATCH_REG1[30:28]   = command
+* SCRATCH_REG1[19:0]= address in dword
+* SCRATCH_REG1[26:24]   = Error reporting
+*/
 writel(v, scratch_reg0);
-   writel(offset | 0x8000, scratch_reg1);
+   writel(offset | flag, scratch_reg1);
 writel(1, spare_int);
-   for (i = 0; i < retries; i++) {
-   u32 tmp;

+   for (i = 0; i < retries; i++) {
 tmp = readl(scratch_reg1);
-   if (!(tmp & 0x8000))
+   if (!(tmp & flag))
 break;

 udelay(10);
 }
-   if (i >= retries)
-   pr_err("timeout: rlcg program reg:0x%05x failed !\n", 
offset);
+
+   if (i >= retries) {
+   if (amdgpu_sriov_reg_indirect_gc(adev)) {
+   if (tmp & GFX9_RLCG_VFGATE_DISABLED)
+  

Re: [PATCH v3 4/5] drm/amdgpu: get xgmi info before ip_init

2021-12-16 Thread Nieto, David M
[AMD Official Use Only]

Reviewed-by: David Nieto 

From: Skvortsov, Victor 
Sent: Thursday, December 16, 2021 11:42 AM
To: amd-gfx@lists.freedesktop.org ; Deng, Emily 
; Liu, Monk ; Ming, Davis 
; Liu, Shaoyun ; Zhou, Peng Ju 
; Chen, JingWen ; Chen, Horace 
; Nieto, David M 
Cc: Skvortsov, Victor 
Subject: [PATCH v3 4/5] drm/amdgpu: get xgmi info before ip_init

Driver needs to call get_xgmi_info() before ip_init
to determine whether it needs to handle a pending hive reset.

Signed-off-by: Victor Skvortsov 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++
 drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 6 --
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c  | 6 --
 3 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 5bd785cfc5ca..4fd370016834 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3576,6 +3576,13 @@ int amdgpu_device_init(struct amdgpu_device *adev,
 if (r)
 return r;

+   /* Need to get xgmi info early to decide the reset behavior*/
+   if (adev->gmc.xgmi.supported) {
+   r = adev->gfxhub.funcs->get_xgmi_info(adev);
+   if (r)
+   return r;
+   }
+
 /* enable PCIE atomic ops */
 if (amdgpu_sriov_vf(adev))
 adev->have_atomics_support = ((struct amd_sriov_msg_pf2vf_info 
*)
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
index ae46eb35b3d7..3d5d47a799e3 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
@@ -914,12 +914,6 @@ static int gmc_v10_0_sw_init(void *handle)
 return r;
 }

-   if (adev->gmc.xgmi.supported) {
-   r = adev->gfxhub.funcs->get_xgmi_info(adev);
-   if (r)
-   return r;
-   }
-
 r = gmc_v10_0_mc_init(adev);
 if (r)
 return r;
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index 2b86c63b032a..57f2729a7bd0 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -1628,12 +1628,6 @@ static int gmc_v9_0_sw_init(void *handle)
 }
 adev->need_swiotlb = drm_need_swiotlb(44);

-   if (adev->gmc.xgmi.supported) {
-   r = adev->gfxhub.funcs->get_xgmi_info(adev);
-   if (r)
-   return r;
-   }
-
 r = gmc_v9_0_mc_init(adev);
 if (r)
 return r;
--
2.25.1



Re: [PATCH v3 3/5] drm/amdgpu: Modify indirect register access for amdkfd_gfx_v9 sriov

2021-12-16 Thread Nieto, David M
[AMD Official Use Only]

Reviewed-by: David Nieto 

From: Skvortsov, Victor 
Sent: Thursday, December 16, 2021 11:42 AM
To: amd-gfx@lists.freedesktop.org ; Deng, Emily 
; Liu, Monk ; Ming, Davis 
; Liu, Shaoyun ; Zhou, Peng Ju 
; Chen, JingWen ; Chen, Horace 
; Nieto, David M 
Cc: Skvortsov, Victor 
Subject: [PATCH v3 3/5] drm/amdgpu: Modify indirect register access for 
amdkfd_gfx_v9 sriov

Modify GC register access from MMIO to RLCG if the indirect
flag is set

Signed-off-by: Victor Skvortsov 
---
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c | 27 +--
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c
index ddfe7aff919d..1abf662a0e91 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c
@@ -166,7 +166,7 @@ int kgd_gfx_v9_init_interrupts(struct amdgpu_device *adev, 
uint32_t pipe_id)

 lock_srbm(adev, mec, pipe, 0, 0);

-   WREG32(SOC15_REG_OFFSET(GC, 0, mmCPC_INT_CNTL),
+   WREG32_SOC15(GC, 0, mmCPC_INT_CNTL,
 CP_INT_CNTL_RING0__TIME_STAMP_INT_ENABLE_MASK |
 CP_INT_CNTL_RING0__OPCODE_ERROR_INT_ENABLE_MASK);

@@ -279,7 +279,7 @@ int kgd_gfx_v9_hqd_load(struct amdgpu_device *adev, void 
*mqd,
lower_32_bits((uintptr_t)wptr));
 WREG32_RLC(SOC15_REG_OFFSET(GC, 0, 
mmCP_HQD_PQ_WPTR_POLL_ADDR_HI),
upper_32_bits((uintptr_t)wptr));
-   WREG32(SOC15_REG_OFFSET(GC, 0, mmCP_PQ_WPTR_POLL_CNTL1),
+   WREG32_SOC15(GC, 0, mmCP_PQ_WPTR_POLL_CNTL1,
(uint32_t)get_queue_mask(adev, pipe_id, queue_id));
 }

@@ -488,13 +488,13 @@ bool kgd_gfx_v9_hqd_is_occupied(struct amdgpu_device 
*adev,
 uint32_t low, high;

 acquire_queue(adev, pipe_id, queue_id);
-   act = RREG32(SOC15_REG_OFFSET(GC, 0, mmCP_HQD_ACTIVE));
+   act = RREG32_SOC15(GC, 0, mmCP_HQD_ACTIVE);
 if (act) {
 low = lower_32_bits(queue_address >> 8);
 high = upper_32_bits(queue_address >> 8);

-   if (low == RREG32(SOC15_REG_OFFSET(GC, 0, mmCP_HQD_PQ_BASE)) &&
-  high == RREG32(SOC15_REG_OFFSET(GC, 0, mmCP_HQD_PQ_BASE_HI)))
+   if (low == RREG32_SOC15(GC, 0, mmCP_HQD_PQ_BASE) &&
+  high == RREG32_SOC15(GC, 0, mmCP_HQD_PQ_BASE_HI))
 retval = true;
 }
 release_queue(adev);
@@ -556,7 +556,7 @@ int kgd_gfx_v9_hqd_destroy(struct amdgpu_device *adev, void 
*mqd,

 end_jiffies = (utimeout * HZ / 1000) + jiffies;
 while (true) {
-   temp = RREG32(SOC15_REG_OFFSET(GC, 0, mmCP_HQD_ACTIVE));
+   temp = RREG32_SOC15(GC, 0, mmCP_HQD_ACTIVE);
 if (!(temp & CP_HQD_ACTIVE__ACTIVE_MASK))
 break;
 if (time_after(jiffies, end_jiffies)) {
@@ -645,7 +645,7 @@ int kgd_gfx_v9_wave_control_execute(struct amdgpu_device 
*adev,
 mutex_lock(>grbm_idx_mutex);

 WREG32_SOC15_RLC_SHADOW(GC, 0, mmGRBM_GFX_INDEX, gfx_index_val);
-   WREG32(SOC15_REG_OFFSET(GC, 0, mmSQ_CMD), sq_cmd);
+   WREG32_SOC15(GC, 0, mmSQ_CMD, sq_cmd);

 data = REG_SET_FIELD(data, GRBM_GFX_INDEX,
 INSTANCE_BROADCAST_WRITES, 1);
@@ -722,7 +722,7 @@ static void get_wave_count(struct amdgpu_device *adev, int 
queue_idx,
 pipe_idx = queue_idx / adev->gfx.mec.num_queue_per_pipe;
 queue_slot = queue_idx % adev->gfx.mec.num_queue_per_pipe;
 soc15_grbm_select(adev, 1, pipe_idx, queue_slot, 0);
-   reg_val = RREG32(SOC15_REG_OFFSET(GC, 0, mmSPI_CSQ_WF_ACTIVE_COUNT_0) +
+   reg_val = RREG32_SOC15_IP(GC, SOC15_REG_OFFSET(GC, 0, 
mmSPI_CSQ_WF_ACTIVE_COUNT_0) +
  queue_slot);
 *wave_cnt = reg_val & SPI_CSQ_WF_ACTIVE_COUNT_0__COUNT_MASK;
 if (*wave_cnt != 0)
@@ -809,8 +809,7 @@ void kgd_gfx_v9_get_cu_occupancy(struct amdgpu_device 
*adev, int pasid,
 for (sh_idx = 0; sh_idx < sh_cnt; sh_idx++) {

 gfx_v9_0_select_se_sh(adev, se_idx, sh_idx, 
0x);
-   queue_map = RREG32(SOC15_REG_OFFSET(GC, 0,
-  mmSPI_CSQ_WF_ACTIVE_STATUS));
+   queue_map = RREG32_SOC15(GC, 0, 
mmSPI_CSQ_WF_ACTIVE_STATUS);

 /*
  * Assumption: queue map encodes following schema: four
@@ -860,17 +859,17 @@ void kgd_gfx_v9_program_trap_handler_settings(struct 
amdgpu_device *adev,
 /*
  * Program TBA registers
  */
-   WREG32(SOC15_REG_OFFSET(GC, 0, mmSQ_SHADER_TBA_LO),
+   WREG32_SOC15(GC, 0, mmSQ_SHADER_TBA_LO,
 lower_32_bits(tba_addr >> 8));
-   WREG32(SOC15_REG_OFFSET(GC, 0, 

Re: [PATCH v3 2/5] drm/amdgpu: Modify indirect register access for gmc_v9_0 sriov

2021-12-16 Thread Nieto, David M
[AMD Official Use Only]

Reviewed-by: David Nieto 

From: Skvortsov, Victor 
Sent: Thursday, December 16, 2021 11:42 AM
To: amd-gfx@lists.freedesktop.org ; Deng, Emily 
; Liu, Monk ; Ming, Davis 
; Liu, Shaoyun ; Zhou, Peng Ju 
; Chen, JingWen ; Chen, Horace 
; Nieto, David M 
Cc: Skvortsov, Victor 
Subject: [PATCH v3 2/5] drm/amdgpu: Modify indirect register access for 
gmc_v9_0 sriov

Modify GC register access from MMIO to RLCG if the
indirect flag is set

v2: Replaced ternary operator with if-else for better
readability

Signed-off-by: Victor Skvortsov 
---
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 57 ---
 1 file changed, 43 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index a5471923b3f6..2b86c63b032a 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -478,9 +478,18 @@ static int gmc_v9_0_vm_fault_interrupt_state(struct 
amdgpu_device *adev,
 hub = >vmhub[j];
 for (i = 0; i < 16; i++) {
 reg = hub->vm_context0_cntl + i;
-   tmp = RREG32(reg);
+
+   if (j == AMDGPU_GFXHUB_0)
+   tmp = RREG32_SOC15_IP(GC, reg);
+   else
+   tmp = RREG32_SOC15_IP(MMHUB, reg);
+
 tmp &= ~bits;
-   WREG32(reg, tmp);
+
+   if (j == AMDGPU_GFXHUB_0)
+   WREG32_SOC15_IP(GC, reg, tmp);
+   else
+   WREG32_SOC15_IP(MMHUB, reg, tmp);
 }
 }
 break;
@@ -489,9 +498,18 @@ static int gmc_v9_0_vm_fault_interrupt_state(struct 
amdgpu_device *adev,
 hub = >vmhub[j];
 for (i = 0; i < 16; i++) {
 reg = hub->vm_context0_cntl + i;
-   tmp = RREG32(reg);
+
+   if (j == AMDGPU_GFXHUB_0)
+   tmp = RREG32_SOC15_IP(GC, reg);
+   else
+   tmp = RREG32_SOC15_IP(MMHUB, reg);
+
 tmp |= bits;
-   WREG32(reg, tmp);
+
+   if (j == AMDGPU_GFXHUB_0)
+   WREG32_SOC15_IP(GC, reg, tmp);
+   else
+   WREG32_SOC15_IP(MMHUB, reg, tmp);
 }
 }
 break;
@@ -788,9 +806,12 @@ static void gmc_v9_0_flush_gpu_tlb(struct amdgpu_device 
*adev, uint32_t vmid,
 /* TODO: It needs to continue working on debugging with semaphore for 
GFXHUB as well. */
 if (use_semaphore) {
 for (j = 0; j < adev->usec_timeout; j++) {
-   /* a read return value of 1 means semaphore acuqire */
-   tmp = RREG32_NO_KIQ(hub->vm_inv_eng0_sem +
-   hub->eng_distance * eng);
+   /* a read return value of 1 means semaphore acquire */
+   if (vmhub == AMDGPU_GFXHUB_0)
+   tmp = RREG32_SOC15_IP_NO_KIQ(GC, 
hub->vm_inv_eng0_sem + hub->eng_distance * eng);
+   else
+   tmp = RREG32_SOC15_IP_NO_KIQ(MMHUB, 
hub->vm_inv_eng0_sem + hub->eng_distance * eng);
+
 if (tmp & 0x1)
 break;
 udelay(1);
@@ -801,8 +822,10 @@ static void gmc_v9_0_flush_gpu_tlb(struct amdgpu_device 
*adev, uint32_t vmid,
 }

 do {
-   WREG32_NO_KIQ(hub->vm_inv_eng0_req +
- hub->eng_distance * eng, inv_req);
+   if (vmhub == AMDGPU_GFXHUB_0)
+   WREG32_SOC15_IP_NO_KIQ(GC, hub->vm_inv_eng0_req + 
hub->eng_distance * eng, inv_req);
+   else
+   WREG32_SOC15_IP_NO_KIQ(MMHUB, hub->vm_inv_eng0_req + 
hub->eng_distance * eng, inv_req);

 /*
  * Issue a dummy read to wait for the ACK register to
@@ -815,8 +838,11 @@ static void gmc_v9_0_flush_gpu_tlb(struct amdgpu_device 
*adev, uint32_t vmid,
   hub->eng_distance * eng);

 for (j = 0; j < adev->usec_timeout; j++) {
-   tmp = RREG32_NO_KIQ(hub->vm_inv_eng0_ack +
-   hub->eng_distance * eng);
+   if (vmhub == AMDGPU_GFXHUB_0)
+   tmp = RREG32_SOC15_IP_NO_KIQ(GC, 

Re: [PATCH v3 1/5] drm/amdgpu: Add *_SOC15_IP_NO_KIQ() macro definitions

2021-12-16 Thread Nieto, David M
[AMD Official Use Only]

Reviewed-by: David Nieto 

From: Skvortsov, Victor 
Sent: Thursday, December 16, 2021 11:42 AM
To: amd-gfx@lists.freedesktop.org ; Deng, Emily 
; Liu, Monk ; Ming, Davis 
; Liu, Shaoyun ; Zhou, Peng Ju 
; Chen, JingWen ; Chen, Horace 
; Nieto, David M 
Cc: Skvortsov, Victor 
Subject: [PATCH v3 1/5] drm/amdgpu: Add *_SOC15_IP_NO_KIQ() macro definitions

Add helper macros to change register access
from direct to indirect.

Signed-off-by: Victor Skvortsov 
---
 drivers/gpu/drm/amd/amdgpu/soc15_common.h | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/soc15_common.h 
b/drivers/gpu/drm/amd/amdgpu/soc15_common.h
index 8a9ca87d8663..473767e03676 100644
--- a/drivers/gpu/drm/amd/amdgpu/soc15_common.h
+++ b/drivers/gpu/drm/amd/amdgpu/soc15_common.h
@@ -51,6 +51,8 @@

 #define RREG32_SOC15_IP(ip, reg) __RREG32_SOC15_RLC__(reg, 0, ip##_HWIP)

+#define RREG32_SOC15_IP_NO_KIQ(ip, reg) __RREG32_SOC15_RLC__(reg, 
AMDGPU_REGS_NO_KIQ, ip##_HWIP)
+
 #define RREG32_SOC15_NO_KIQ(ip, inst, reg) \
 __RREG32_SOC15_RLC__(adev->reg_offset[ip##_HWIP][inst][reg##_BASE_IDX] 
+ reg, \
  AMDGPU_REGS_NO_KIQ, ip##_HWIP)
@@ -65,6 +67,9 @@
 #define WREG32_SOC15_IP(ip, reg, value) \
  __WREG32_SOC15_RLC__(reg, value, 0, ip##_HWIP)

+#define WREG32_SOC15_IP_NO_KIQ(ip, reg, value) \
+__WREG32_SOC15_RLC__(reg, value, AMDGPU_REGS_NO_KIQ, ip##_HWIP)
+
 #define WREG32_SOC15_NO_KIQ(ip, inst, reg, value) \
 __WREG32_SOC15_RLC__(adev->reg_offset[ip##_HWIP][inst][reg##_BASE_IDX] 
+ reg, \
  value, AMDGPU_REGS_NO_KIQ, ip##_HWIP)
--
2.25.1



[PATCH v3 5/5] drm/amdgpu: Modify indirect register access for gfx9 sriov

2021-12-16 Thread Victor Skvortsov
Expand RLCG interface for new GC read & write commands.
New interface will only be used if the PF enables the flag in pf2vf msg.

v2: Added a description for the scratch registers

Signed-off-by: Victor Skvortsov 
---
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 117 --
 1 file changed, 89 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
index edb3e3b08eed..9189fb85a4dd 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
@@ -63,6 +63,13 @@
 #define mmGCEA_PROBE_MAP0x070c
 #define mmGCEA_PROBE_MAP_BASE_IDX   0
 
+#define GFX9_RLCG_GC_WRITE_OLD (0x8 << 28)
+#define GFX9_RLCG_GC_WRITE (0x0 << 28)
+#define GFX9_RLCG_GC_READ  (0x1 << 28)
+#define GFX9_RLCG_VFGATE_DISABLED  0x400
+#define GFX9_RLCG_WRONG_OPERATION_TYPE 0x200
+#define GFX9_RLCG_NOT_IN_RANGE 0x100
+
 MODULE_FIRMWARE("amdgpu/vega10_ce.bin");
 MODULE_FIRMWARE("amdgpu/vega10_pfp.bin");
 MODULE_FIRMWARE("amdgpu/vega10_me.bin");
@@ -739,7 +746,7 @@ static const u32 GFX_RLC_SRM_INDEX_CNTL_DATA_OFFSETS[] =
mmRLC_SRM_INDEX_CNTL_DATA_7 - mmRLC_SRM_INDEX_CNTL_DATA_0,
 };
 
-static void gfx_v9_0_rlcg_w(struct amdgpu_device *adev, u32 offset, u32 v, u32 
flag)
+static u32 gfx_v9_0_rlcg_rw(struct amdgpu_device *adev, u32 offset, u32 v, 
uint32_t flag)
 {
static void *scratch_reg0;
static void *scratch_reg1;
@@ -748,21 +755,20 @@ static void gfx_v9_0_rlcg_w(struct amdgpu_device *adev, 
u32 offset, u32 v, u32 f
static void *spare_int;
static uint32_t grbm_cntl;
static uint32_t grbm_idx;
+   uint32_t i = 0;
+   uint32_t retries = 5;
+   u32 ret = 0;
+   u32 tmp;
 
scratch_reg0 = adev->rmmio + 
(adev->reg_offset[GC_HWIP][0][mmSCRATCH_REG0_BASE_IDX] + mmSCRATCH_REG0)*4;
scratch_reg1 = adev->rmmio + 
(adev->reg_offset[GC_HWIP][0][mmSCRATCH_REG1_BASE_IDX] + mmSCRATCH_REG1)*4;
-   scratch_reg2 = adev->rmmio + 
(adev->reg_offset[GC_HWIP][0][mmSCRATCH_REG1_BASE_IDX] + mmSCRATCH_REG2)*4;
-   scratch_reg3 = adev->rmmio + 
(adev->reg_offset[GC_HWIP][0][mmSCRATCH_REG1_BASE_IDX] + mmSCRATCH_REG3)*4;
+   scratch_reg2 = adev->rmmio + 
(adev->reg_offset[GC_HWIP][0][mmSCRATCH_REG2_BASE_IDX] + mmSCRATCH_REG2)*4;
+   scratch_reg3 = adev->rmmio + 
(adev->reg_offset[GC_HWIP][0][mmSCRATCH_REG3_BASE_IDX] + mmSCRATCH_REG3)*4;
spare_int = adev->rmmio + 
(adev->reg_offset[GC_HWIP][0][mmRLC_SPARE_INT_BASE_IDX] + mmRLC_SPARE_INT)*4;
 
grbm_cntl = adev->reg_offset[GC_HWIP][0][mmGRBM_GFX_CNTL_BASE_IDX] + 
mmGRBM_GFX_CNTL;
grbm_idx = adev->reg_offset[GC_HWIP][0][mmGRBM_GFX_INDEX_BASE_IDX] + 
mmGRBM_GFX_INDEX;
 
-   if (amdgpu_sriov_runtime(adev)) {
-   pr_err("shouldn't call rlcg write register during runtime\n");
-   return;
-   }
-
if (offset == grbm_cntl || offset == grbm_idx) {
if (offset  == grbm_cntl)
writel(v, scratch_reg2);
@@ -771,41 +777,95 @@ static void gfx_v9_0_rlcg_w(struct amdgpu_device *adev, 
u32 offset, u32 v, u32 f
 
writel(v, ((void __iomem *)adev->rmmio) + (offset * 4));
} else {
-   uint32_t i = 0;
-   uint32_t retries = 5;
-
+   /*
+* SCRATCH_REG0 = read/write value
+* SCRATCH_REG1[30:28]  = command
+* SCRATCH_REG1[19:0]   = address in dword
+* SCRATCH_REG1[26:24]  = Error reporting
+*/
writel(v, scratch_reg0);
-   writel(offset | 0x8000, scratch_reg1);
+   writel(offset | flag, scratch_reg1);
writel(1, spare_int);
-   for (i = 0; i < retries; i++) {
-   u32 tmp;
 
+   for (i = 0; i < retries; i++) {
tmp = readl(scratch_reg1);
-   if (!(tmp & 0x8000))
+   if (!(tmp & flag))
break;
 
udelay(10);
}
-   if (i >= retries)
-   pr_err("timeout: rlcg program reg:0x%05x failed !\n", 
offset);
+
+   if (i >= retries) {
+   if (amdgpu_sriov_reg_indirect_gc(adev)) {
+   if (tmp & GFX9_RLCG_VFGATE_DISABLED)
+   pr_err("The vfgate is disabled, program 
reg:0x%05x failed!\n", offset);
+   else if (tmp & GFX9_RLCG_WRONG_OPERATION_TYPE)
+   pr_err("Wrong operation type, program 
reg:0x%05x failed!\n", offset);
+   else if (tmp & GFX9_RLCG_NOT_IN_RANGE)
+   pr_err("The 

[PATCH v3 4/5] drm/amdgpu: get xgmi info before ip_init

2021-12-16 Thread Victor Skvortsov
Driver needs to call get_xgmi_info() before ip_init
to determine whether it needs to handle a pending hive reset.

Signed-off-by: Victor Skvortsov 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++
 drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 6 --
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c  | 6 --
 3 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 5bd785cfc5ca..4fd370016834 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3576,6 +3576,13 @@ int amdgpu_device_init(struct amdgpu_device *adev,
if (r)
return r;
 
+   /* Need to get xgmi info early to decide the reset behavior*/
+   if (adev->gmc.xgmi.supported) {
+   r = adev->gfxhub.funcs->get_xgmi_info(adev);
+   if (r)
+   return r;
+   }
+
/* enable PCIE atomic ops */
if (amdgpu_sriov_vf(adev))
adev->have_atomics_support = ((struct amd_sriov_msg_pf2vf_info 
*)
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
index ae46eb35b3d7..3d5d47a799e3 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
@@ -914,12 +914,6 @@ static int gmc_v10_0_sw_init(void *handle)
return r;
}
 
-   if (adev->gmc.xgmi.supported) {
-   r = adev->gfxhub.funcs->get_xgmi_info(adev);
-   if (r)
-   return r;
-   }
-
r = gmc_v10_0_mc_init(adev);
if (r)
return r;
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index 2b86c63b032a..57f2729a7bd0 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -1628,12 +1628,6 @@ static int gmc_v9_0_sw_init(void *handle)
}
adev->need_swiotlb = drm_need_swiotlb(44);
 
-   if (adev->gmc.xgmi.supported) {
-   r = adev->gfxhub.funcs->get_xgmi_info(adev);
-   if (r)
-   return r;
-   }
-
r = gmc_v9_0_mc_init(adev);
if (r)
return r;
-- 
2.25.1



[PATCH v3 3/5] drm/amdgpu: Modify indirect register access for amdkfd_gfx_v9 sriov

2021-12-16 Thread Victor Skvortsov
Modify GC register access from MMIO to RLCG if the indirect
flag is set

Signed-off-by: Victor Skvortsov 
---
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c | 27 +--
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c
index ddfe7aff919d..1abf662a0e91 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c
@@ -166,7 +166,7 @@ int kgd_gfx_v9_init_interrupts(struct amdgpu_device *adev, 
uint32_t pipe_id)
 
lock_srbm(adev, mec, pipe, 0, 0);
 
-   WREG32(SOC15_REG_OFFSET(GC, 0, mmCPC_INT_CNTL),
+   WREG32_SOC15(GC, 0, mmCPC_INT_CNTL,
CP_INT_CNTL_RING0__TIME_STAMP_INT_ENABLE_MASK |
CP_INT_CNTL_RING0__OPCODE_ERROR_INT_ENABLE_MASK);
 
@@ -279,7 +279,7 @@ int kgd_gfx_v9_hqd_load(struct amdgpu_device *adev, void 
*mqd,
   lower_32_bits((uintptr_t)wptr));
WREG32_RLC(SOC15_REG_OFFSET(GC, 0, 
mmCP_HQD_PQ_WPTR_POLL_ADDR_HI),
   upper_32_bits((uintptr_t)wptr));
-   WREG32(SOC15_REG_OFFSET(GC, 0, mmCP_PQ_WPTR_POLL_CNTL1),
+   WREG32_SOC15(GC, 0, mmCP_PQ_WPTR_POLL_CNTL1,
   (uint32_t)get_queue_mask(adev, pipe_id, queue_id));
}
 
@@ -488,13 +488,13 @@ bool kgd_gfx_v9_hqd_is_occupied(struct amdgpu_device 
*adev,
uint32_t low, high;
 
acquire_queue(adev, pipe_id, queue_id);
-   act = RREG32(SOC15_REG_OFFSET(GC, 0, mmCP_HQD_ACTIVE));
+   act = RREG32_SOC15(GC, 0, mmCP_HQD_ACTIVE);
if (act) {
low = lower_32_bits(queue_address >> 8);
high = upper_32_bits(queue_address >> 8);
 
-   if (low == RREG32(SOC15_REG_OFFSET(GC, 0, mmCP_HQD_PQ_BASE)) &&
-  high == RREG32(SOC15_REG_OFFSET(GC, 0, mmCP_HQD_PQ_BASE_HI)))
+   if (low == RREG32_SOC15(GC, 0, mmCP_HQD_PQ_BASE) &&
+  high == RREG32_SOC15(GC, 0, mmCP_HQD_PQ_BASE_HI))
retval = true;
}
release_queue(adev);
@@ -556,7 +556,7 @@ int kgd_gfx_v9_hqd_destroy(struct amdgpu_device *adev, void 
*mqd,
 
end_jiffies = (utimeout * HZ / 1000) + jiffies;
while (true) {
-   temp = RREG32(SOC15_REG_OFFSET(GC, 0, mmCP_HQD_ACTIVE));
+   temp = RREG32_SOC15(GC, 0, mmCP_HQD_ACTIVE);
if (!(temp & CP_HQD_ACTIVE__ACTIVE_MASK))
break;
if (time_after(jiffies, end_jiffies)) {
@@ -645,7 +645,7 @@ int kgd_gfx_v9_wave_control_execute(struct amdgpu_device 
*adev,
mutex_lock(>grbm_idx_mutex);
 
WREG32_SOC15_RLC_SHADOW(GC, 0, mmGRBM_GFX_INDEX, gfx_index_val);
-   WREG32(SOC15_REG_OFFSET(GC, 0, mmSQ_CMD), sq_cmd);
+   WREG32_SOC15(GC, 0, mmSQ_CMD, sq_cmd);
 
data = REG_SET_FIELD(data, GRBM_GFX_INDEX,
INSTANCE_BROADCAST_WRITES, 1);
@@ -722,7 +722,7 @@ static void get_wave_count(struct amdgpu_device *adev, int 
queue_idx,
pipe_idx = queue_idx / adev->gfx.mec.num_queue_per_pipe;
queue_slot = queue_idx % adev->gfx.mec.num_queue_per_pipe;
soc15_grbm_select(adev, 1, pipe_idx, queue_slot, 0);
-   reg_val = RREG32(SOC15_REG_OFFSET(GC, 0, mmSPI_CSQ_WF_ACTIVE_COUNT_0) +
+   reg_val = RREG32_SOC15_IP(GC, SOC15_REG_OFFSET(GC, 0, 
mmSPI_CSQ_WF_ACTIVE_COUNT_0) +
 queue_slot);
*wave_cnt = reg_val & SPI_CSQ_WF_ACTIVE_COUNT_0__COUNT_MASK;
if (*wave_cnt != 0)
@@ -809,8 +809,7 @@ void kgd_gfx_v9_get_cu_occupancy(struct amdgpu_device 
*adev, int pasid,
for (sh_idx = 0; sh_idx < sh_cnt; sh_idx++) {
 
gfx_v9_0_select_se_sh(adev, se_idx, sh_idx, 0x);
-   queue_map = RREG32(SOC15_REG_OFFSET(GC, 0,
-  mmSPI_CSQ_WF_ACTIVE_STATUS));
+   queue_map = RREG32_SOC15(GC, 0, 
mmSPI_CSQ_WF_ACTIVE_STATUS);
 
/*
 * Assumption: queue map encodes following schema: four
@@ -860,17 +859,17 @@ void kgd_gfx_v9_program_trap_handler_settings(struct 
amdgpu_device *adev,
/*
 * Program TBA registers
 */
-   WREG32(SOC15_REG_OFFSET(GC, 0, mmSQ_SHADER_TBA_LO),
+   WREG32_SOC15(GC, 0, mmSQ_SHADER_TBA_LO,
 lower_32_bits(tba_addr >> 8));
-   WREG32(SOC15_REG_OFFSET(GC, 0, mmSQ_SHADER_TBA_HI),
+   WREG32_SOC15(GC, 0, mmSQ_SHADER_TBA_HI,
 upper_32_bits(tba_addr >> 8));
 
/*
 * Program TMA registers
 */
-   WREG32(SOC15_REG_OFFSET(GC, 0, mmSQ_SHADER_TMA_LO),
+   WREG32_SOC15(GC, 0, mmSQ_SHADER_TMA_LO,
lower_32_bits(tma_addr >> 8));
-   WREG32(SOC15_REG_OFFSET(GC, 0, mmSQ_SHADER_TMA_HI),
+   WREG32_SOC15(GC, 0, mmSQ_SHADER_TMA_HI,
 

[PATCH v3 2/5] drm/amdgpu: Modify indirect register access for gmc_v9_0 sriov

2021-12-16 Thread Victor Skvortsov
Modify GC register access from MMIO to RLCG if the
indirect flag is set

v2: Replaced ternary operator with if-else for better
readability

Signed-off-by: Victor Skvortsov 
---
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 57 ---
 1 file changed, 43 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index a5471923b3f6..2b86c63b032a 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -478,9 +478,18 @@ static int gmc_v9_0_vm_fault_interrupt_state(struct 
amdgpu_device *adev,
hub = >vmhub[j];
for (i = 0; i < 16; i++) {
reg = hub->vm_context0_cntl + i;
-   tmp = RREG32(reg);
+
+   if (j == AMDGPU_GFXHUB_0)
+   tmp = RREG32_SOC15_IP(GC, reg);
+   else
+   tmp = RREG32_SOC15_IP(MMHUB, reg);
+
tmp &= ~bits;
-   WREG32(reg, tmp);
+
+   if (j == AMDGPU_GFXHUB_0)
+   WREG32_SOC15_IP(GC, reg, tmp);
+   else
+   WREG32_SOC15_IP(MMHUB, reg, tmp);
}
}
break;
@@ -489,9 +498,18 @@ static int gmc_v9_0_vm_fault_interrupt_state(struct 
amdgpu_device *adev,
hub = >vmhub[j];
for (i = 0; i < 16; i++) {
reg = hub->vm_context0_cntl + i;
-   tmp = RREG32(reg);
+
+   if (j == AMDGPU_GFXHUB_0)
+   tmp = RREG32_SOC15_IP(GC, reg);
+   else
+   tmp = RREG32_SOC15_IP(MMHUB, reg);
+
tmp |= bits;
-   WREG32(reg, tmp);
+
+   if (j == AMDGPU_GFXHUB_0)
+   WREG32_SOC15_IP(GC, reg, tmp);
+   else
+   WREG32_SOC15_IP(MMHUB, reg, tmp);
}
}
break;
@@ -788,9 +806,12 @@ static void gmc_v9_0_flush_gpu_tlb(struct amdgpu_device 
*adev, uint32_t vmid,
/* TODO: It needs to continue working on debugging with semaphore for 
GFXHUB as well. */
if (use_semaphore) {
for (j = 0; j < adev->usec_timeout; j++) {
-   /* a read return value of 1 means semaphore acuqire */
-   tmp = RREG32_NO_KIQ(hub->vm_inv_eng0_sem +
-   hub->eng_distance * eng);
+   /* a read return value of 1 means semaphore acquire */
+   if (vmhub == AMDGPU_GFXHUB_0)
+   tmp = RREG32_SOC15_IP_NO_KIQ(GC, 
hub->vm_inv_eng0_sem + hub->eng_distance * eng);
+   else
+   tmp = RREG32_SOC15_IP_NO_KIQ(MMHUB, 
hub->vm_inv_eng0_sem + hub->eng_distance * eng);
+
if (tmp & 0x1)
break;
udelay(1);
@@ -801,8 +822,10 @@ static void gmc_v9_0_flush_gpu_tlb(struct amdgpu_device 
*adev, uint32_t vmid,
}
 
do {
-   WREG32_NO_KIQ(hub->vm_inv_eng0_req +
- hub->eng_distance * eng, inv_req);
+   if (vmhub == AMDGPU_GFXHUB_0)
+   WREG32_SOC15_IP_NO_KIQ(GC, hub->vm_inv_eng0_req + 
hub->eng_distance * eng, inv_req);
+   else
+   WREG32_SOC15_IP_NO_KIQ(MMHUB, hub->vm_inv_eng0_req + 
hub->eng_distance * eng, inv_req);
 
/*
 * Issue a dummy read to wait for the ACK register to
@@ -815,8 +838,11 @@ static void gmc_v9_0_flush_gpu_tlb(struct amdgpu_device 
*adev, uint32_t vmid,
  hub->eng_distance * eng);
 
for (j = 0; j < adev->usec_timeout; j++) {
-   tmp = RREG32_NO_KIQ(hub->vm_inv_eng0_ack +
-   hub->eng_distance * eng);
+   if (vmhub == AMDGPU_GFXHUB_0)
+   tmp = RREG32_SOC15_IP_NO_KIQ(GC, 
hub->vm_inv_eng0_ack + hub->eng_distance * eng);
+   else
+   tmp = RREG32_SOC15_IP_NO_KIQ(MMHUB, 
hub->vm_inv_eng0_ack + hub->eng_distance * eng);
+
if (tmp & (1 << vmid))
break;
udelay(1);
@@ -827,13 +853,16 @@ static void gmc_v9_0_flush_gpu_tlb(struct amdgpu_device 
*adev, uint32_t vmid,
} while (inv_req);
 

[PATCH v3 1/5] drm/amdgpu: Add *_SOC15_IP_NO_KIQ() macro definitions

2021-12-16 Thread Victor Skvortsov
Add helper macros to change register access
from direct to indirect.

Signed-off-by: Victor Skvortsov 
---
 drivers/gpu/drm/amd/amdgpu/soc15_common.h | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/soc15_common.h 
b/drivers/gpu/drm/amd/amdgpu/soc15_common.h
index 8a9ca87d8663..473767e03676 100644
--- a/drivers/gpu/drm/amd/amdgpu/soc15_common.h
+++ b/drivers/gpu/drm/amd/amdgpu/soc15_common.h
@@ -51,6 +51,8 @@
 
 #define RREG32_SOC15_IP(ip, reg) __RREG32_SOC15_RLC__(reg, 0, ip##_HWIP)
 
+#define RREG32_SOC15_IP_NO_KIQ(ip, reg) __RREG32_SOC15_RLC__(reg, 
AMDGPU_REGS_NO_KIQ, ip##_HWIP)
+
 #define RREG32_SOC15_NO_KIQ(ip, inst, reg) \
__RREG32_SOC15_RLC__(adev->reg_offset[ip##_HWIP][inst][reg##_BASE_IDX] 
+ reg, \
 AMDGPU_REGS_NO_KIQ, ip##_HWIP)
@@ -65,6 +67,9 @@
 #define WREG32_SOC15_IP(ip, reg, value) \
 __WREG32_SOC15_RLC__(reg, value, 0, ip##_HWIP)
 
+#define WREG32_SOC15_IP_NO_KIQ(ip, reg, value) \
+__WREG32_SOC15_RLC__(reg, value, AMDGPU_REGS_NO_KIQ, ip##_HWIP)
+
 #define WREG32_SOC15_NO_KIQ(ip, inst, reg, value) \
__WREG32_SOC15_RLC__(adev->reg_offset[ip##_HWIP][inst][reg##_BASE_IDX] 
+ reg, \
 value, AMDGPU_REGS_NO_KIQ, ip##_HWIP)
-- 
2.25.1



[PATCH v3 0/5] *** GFX9 RLCG Interface modifications ***

2021-12-16 Thread Victor Skvortsov
This patchset introduces an expanded sriov RLCG interface.
This interface will be used by Aldebaran in sriov
mode for indirect GC register access during full access.

v2: Added descriptions to scratch registers, and improved code readability.

v3: Remove the RLC function pointer init change. Move xgmi_info call
to earlier in init.

Victor Skvortsov (5):
  drm/amdgpu: Add *_SOC15_IP_NO_KIQ() macro definitions
  drm/amdgpu: Modify indirect register access for gmc_v9_0 sriov
  drm/amdgpu: Modify indirect register access for amdkfd_gfx_v9 sriov
  drm/amdgpu: get xgmi info before ip_init
  drm/amdgpu: Modify indirect register access for gfx9 sriov

 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c |  27 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c|   7 ++
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 117 +-
 drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c|   6 -
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c |  63 +++---
 drivers/gpu/drm/amd/amdgpu/soc15_common.h |   5 +
 6 files changed, 157 insertions(+), 68 deletions(-)

-- 
2.25.1



Re: [PATCH] drm/amdkfd: make SPDX License expression more sound

2021-12-16 Thread Kate Stewart
On Thu, Dec 16, 2021 at 11:14 AM Richard Fontana  wrote:
>
> On Thu, Dec 16, 2021 at 4:45 AM Lukas Bulwahn  wrote:
> >
> > Commit b5f57384805a ("drm/amdkfd: Add sysfs bitfields and enums to uAPI")
> > adds include/uapi/linux/kfd_sysfs.h with the "GPL-2.0 OR MIT WITH
> > Linux-syscall-note" SPDX-License expression.
> >
> > The command ./scripts/spdxcheck.py warns:
> >
> >   include/uapi/linux/kfd_sysfs.h: 1:48 Exception not valid for license MIT: 
> > Linux-syscall-note
> >
> > For a uapi header, the file under GPLv2 License must be combined with the
> > Linux-syscall-note, but combining the MIT License with the
> > Linux-syscall-note makes no sense, as the note provides an exception for
> > GPL-licensed code, not for permissively licensed code.
> >
> > So, reorganize the SPDX expression to only combine the note with the GPL
> > License condition. This makes spdxcheck happy again.
> >
> > Signed-off-by: Lukas Bulwahn 
> > ---
> > I am not a lawyer and I do not intend to modify the actual licensing of
> > this header file. So, I really would like to have an Ack from some AMD
> > developer here.
> >
> > Maybe also a lawyer on the linux-spdx list can check my reasoning on the
> > licensing with the exception note?
>
> I believe "MIT WITH Linux-syscall-note" is a syntactically correct
> SPDX expression but is otherwise sort of non-meaningful.
> "(GPL-2.0 WITH Linux-syscall-note) OR MIT" is presumably what is
> intended here. But yes would be good to get confirmation from someone
> associated with AMD.

Thanks Lukas,  I agree that this is indeed clearer.
+1

Reviewed-by: kstew...@linuxfoundation.org


[PATCH] drm/amd/display: fix dereference before NULL check

2021-12-16 Thread José Expósito
The "plane_state" pointer was access before checking if it was NULL.

Avoid a possible NULL pointer dereference by accessing the plane
address after the check.

Addresses-Coverity-ID: 1474582 ("Dereference before null check")
Fixes: 3f68c01be9a22 ("drm/amd/display: add cyan_skillfish display support")
Signed-off-by: José Expósito 
---
 drivers/gpu/drm/amd/display/dc/dcn201/dcn201_hwseq.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/display/dc/dcn201/dcn201_hwseq.c 
b/drivers/gpu/drm/amd/display/dc/dcn201/dcn201_hwseq.c
index cfd09b3f705e..fe22530242d2 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn201/dcn201_hwseq.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn201/dcn201_hwseq.c
@@ -134,11 +134,12 @@ void dcn201_update_plane_addr(const struct dc *dc, struct 
pipe_ctx *pipe_ctx)
PHYSICAL_ADDRESS_LOC addr;
struct dc_plane_state *plane_state = pipe_ctx->plane_state;
struct dce_hwseq *hws = dc->hwseq;
-   struct dc_plane_address uma = plane_state->address;
+   struct dc_plane_address uma;
 
if (plane_state == NULL)
return;
 
+   uma = plane_state->address;
addr_patched = patch_address_for_sbs_tb_stereo(pipe_ctx, );
 
plane_address_in_gpu_space_to_uma(hws, );
-- 
2.25.1



Re: [PATCH 4/5] drm/amdgpu: Initialize Aldebaran RLC function pointers

2021-12-16 Thread Deucher, Alexander
[Public]

If it needs to be in drm-next, please make sure it lands there.

Alex


From: amd-gfx  on behalf of Liu, Shaoyun 

Sent: Thursday, December 16, 2021 12:51 PM
To: Skvortsov, Victor ; Alex Deucher 

Cc: Ming, Davis ; Chen, JingWen ; 
amd-gfx list ; Deng, Emily ; 
Chen, Horace ; Liu, Monk ; Nieto, David 
M 
Subject: RE: [PATCH 4/5] drm/amdgpu: Initialize Aldebaran RLC function pointers

[AMD Official Use Only]

Actually I don't know why  the change " a35f147621bc drm/amdgpu: get xgmi info 
at eary_init " not in drm-next , instead it’s in amd-mainline-dkms-5.13.
That change is necessary for passthrough XGMI hive  to a VM and rely on our 
driver to do the reset on whole hive  when driver is loaded .

I checked the code again,  it seems we should be ok as long as we get xgmi info 
at eary_init.  So since gfx_v9_0_set_rlc_funcs() already gets called in 
gfx_v9_0_early_init(), we can  move get xgmi info out of gmc_early_init and 
call  it at the last step  early_init .

Regards
Shaoyun.liu

-Original Message-
From: Skvortsov, Victor 
Sent: Thursday, December 16, 2021 9:28 AM
To: Alex Deucher 
Cc: amd-gfx list ; Deng, Emily 
; Liu, Monk ; Ming, Davis 
; Liu, Shaoyun ; Zhou, Peng Ju 
; Chen, JingWen ; Chen, Horace 
; Nieto, David M 
Subject: RE: [PATCH 4/5] drm/amdgpu: Initialize Aldebaran RLC function pointers

[AMD Official Use Only]

Gotcha, I will skip this patch for drm-next

-Original Message-
From: Alex Deucher 
Sent: Thursday, December 16, 2021 8:53 AM
To: Skvortsov, Victor 
Cc: amd-gfx list ; Deng, Emily 
; Liu, Monk ; Ming, Davis 
; Liu, Shaoyun ; Zhou, Peng Ju 
; Chen, JingWen ; Chen, Horace 
; Nieto, David M 
Subject: Re: [PATCH 4/5] drm/amdgpu: Initialize Aldebaran RLC function pointers

[CAUTION: External Email]

On Wed, Dec 15, 2021 at 6:58 PM Skvortsov, Victor  
wrote:
>
> [AMD Official Use Only]
>
> Hey Alex,
>
> This change was based on the fact that amd-mainline-dkms-5.13 calls 
> get_xgmi_info() in gmc_v9_0_early_init(). But I can see that drm-next it's 
> instead called in gmc_v9_0_sw_init(). So, I'm not sure whats the correct 
> behavior. But I do agree that the change is kind of ugly. I don't know where 
> else to put it if we do need to call get_xgmi_info() in early_init.
>

We could skip this patch for drm-next and just apply it to the dkms branch.  
There's already a lot of ugly stuff in there to deal with multiple kernel 
versions.

Alex


> Thanks,
> Victor
>
> -Original Message-
> From: Alex Deucher 
> Sent: Wednesday, December 15, 2021 4:38 PM
> To: Skvortsov, Victor 
> Cc: amd-gfx list ; Deng, Emily
> ; Liu, Monk ; Ming, Davis
> ; Liu, Shaoyun ; Zhou, Peng
> Ju ; Chen, JingWen ; Chen,
> Horace ; Nieto, David M 
> Subject: Re: [PATCH 4/5] drm/amdgpu: Initialize Aldebaran RLC function
> pointers
>
> [CAUTION: External Email]
>
> On Wed, Dec 15, 2021 at 1:56 PM Victor Skvortsov  
> wrote:
> >
> > In SRIOV, RLC function pointers must be initialized early as we rely
> > on the RLCG interface for all GC register access.
> >
> > Signed-off-by: Victor Skvortsov 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 2 ++
> >  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 3 +--
> >  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.h | 2 ++
> >  3 files changed, 5 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> > index 65e1f6cc59dd..1bc92a38d124 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> > @@ -844,6 +844,8 @@ static int amdgpu_discovery_set_gc_ip_blocks(struct 
> > amdgpu_device *adev)
> > case IP_VERSION(9, 4, 1):
> > case IP_VERSION(9, 4, 2):
> > amdgpu_device_ip_block_add(adev,
> > _v9_0_ip_block);
> > +   if (amdgpu_sriov_vf(adev) && adev->ip_versions[GC_HWIP][0] 
> > == IP_VERSION(9, 4, 2))
> > +   gfx_v9_0_set_rlc_funcs(adev);
>
> amdgpu_discovery.c is IP independent.  I'd rather not add random IP specific 
> function calls.  gfx_v9_0_set_rlc_funcs() already gets called in 
> gfx_v9_0_early_init().  Is that not early enough?  In general we shouldn't be 
> touching the hardware much if at all in early_init.
>
> Alex
>
> > break;
> > case IP_VERSION(10, 1, 10):
> > case IP_VERSION(10, 1, 2):
> > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > index edb3e3b08eed..d252b06efa43 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > @@ -816,7 +816,6 @@ static void gfx_v9_0_sriov_wreg(struct
> > amdgpu_device *adev, u32 offset,  static void
> > gfx_v9_0_set_ring_funcs(struct amdgpu_device *adev);  static void
> > gfx_v9_0_set_irq_funcs(struct amdgpu_device *adev);  static void
> > gfx_v9_0_set_gds_init(struct amdgpu_device *adev); -static void
> > 

RE: [PATCH 4/5] drm/amdgpu: Initialize Aldebaran RLC function pointers

2021-12-16 Thread Liu, Shaoyun
[AMD Official Use Only]

Actually I don't know why  the change " a35f147621bc drm/amdgpu: get xgmi info 
at eary_init " not in drm-next , instead it’s in amd-mainline-dkms-5.13.
That change is necessary for passthrough XGMI hive  to a VM and rely on our 
driver to do the reset on whole hive  when driver is loaded .  

I checked the code again,  it seems we should be ok as long as we get xgmi info 
at eary_init.  So since gfx_v9_0_set_rlc_funcs() already gets called in 
gfx_v9_0_early_init(), we can  move get xgmi info out of gmc_early_init and 
call  it at the last step  early_init . 

Regards
Shaoyun.liu

-Original Message-
From: Skvortsov, Victor  
Sent: Thursday, December 16, 2021 9:28 AM
To: Alex Deucher 
Cc: amd-gfx list ; Deng, Emily 
; Liu, Monk ; Ming, Davis 
; Liu, Shaoyun ; Zhou, Peng Ju 
; Chen, JingWen ; Chen, Horace 
; Nieto, David M 
Subject: RE: [PATCH 4/5] drm/amdgpu: Initialize Aldebaran RLC function pointers

[AMD Official Use Only]

Gotcha, I will skip this patch for drm-next

-Original Message-
From: Alex Deucher 
Sent: Thursday, December 16, 2021 8:53 AM
To: Skvortsov, Victor 
Cc: amd-gfx list ; Deng, Emily 
; Liu, Monk ; Ming, Davis 
; Liu, Shaoyun ; Zhou, Peng Ju 
; Chen, JingWen ; Chen, Horace 
; Nieto, David M 
Subject: Re: [PATCH 4/5] drm/amdgpu: Initialize Aldebaran RLC function pointers

[CAUTION: External Email]

On Wed, Dec 15, 2021 at 6:58 PM Skvortsov, Victor  
wrote:
>
> [AMD Official Use Only]
>
> Hey Alex,
>
> This change was based on the fact that amd-mainline-dkms-5.13 calls 
> get_xgmi_info() in gmc_v9_0_early_init(). But I can see that drm-next it's 
> instead called in gmc_v9_0_sw_init(). So, I'm not sure whats the correct 
> behavior. But I do agree that the change is kind of ugly. I don't know where 
> else to put it if we do need to call get_xgmi_info() in early_init.
>

We could skip this patch for drm-next and just apply it to the dkms branch.  
There's already a lot of ugly stuff in there to deal with multiple kernel 
versions.

Alex


> Thanks,
> Victor
>
> -Original Message-
> From: Alex Deucher 
> Sent: Wednesday, December 15, 2021 4:38 PM
> To: Skvortsov, Victor 
> Cc: amd-gfx list ; Deng, Emily 
> ; Liu, Monk ; Ming, Davis 
> ; Liu, Shaoyun ; Zhou, Peng 
> Ju ; Chen, JingWen ; Chen, 
> Horace ; Nieto, David M 
> Subject: Re: [PATCH 4/5] drm/amdgpu: Initialize Aldebaran RLC function 
> pointers
>
> [CAUTION: External Email]
>
> On Wed, Dec 15, 2021 at 1:56 PM Victor Skvortsov  
> wrote:
> >
> > In SRIOV, RLC function pointers must be initialized early as we rely 
> > on the RLCG interface for all GC register access.
> >
> > Signed-off-by: Victor Skvortsov 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 2 ++
> >  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 3 +--
> >  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.h | 2 ++
> >  3 files changed, 5 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> > index 65e1f6cc59dd..1bc92a38d124 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> > @@ -844,6 +844,8 @@ static int amdgpu_discovery_set_gc_ip_blocks(struct 
> > amdgpu_device *adev)
> > case IP_VERSION(9, 4, 1):
> > case IP_VERSION(9, 4, 2):
> > amdgpu_device_ip_block_add(adev, 
> > _v9_0_ip_block);
> > +   if (amdgpu_sriov_vf(adev) && adev->ip_versions[GC_HWIP][0] 
> > == IP_VERSION(9, 4, 2))
> > +   gfx_v9_0_set_rlc_funcs(adev);
>
> amdgpu_discovery.c is IP independent.  I'd rather not add random IP specific 
> function calls.  gfx_v9_0_set_rlc_funcs() already gets called in 
> gfx_v9_0_early_init().  Is that not early enough?  In general we shouldn't be 
> touching the hardware much if at all in early_init.
>
> Alex
>
> > break;
> > case IP_VERSION(10, 1, 10):
> > case IP_VERSION(10, 1, 2):
> > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > index edb3e3b08eed..d252b06efa43 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > @@ -816,7 +816,6 @@ static void gfx_v9_0_sriov_wreg(struct 
> > amdgpu_device *adev, u32 offset,  static void 
> > gfx_v9_0_set_ring_funcs(struct amdgpu_device *adev);  static void 
> > gfx_v9_0_set_irq_funcs(struct amdgpu_device *adev);  static void 
> > gfx_v9_0_set_gds_init(struct amdgpu_device *adev); -static void 
> > gfx_v9_0_set_rlc_funcs(struct amdgpu_device *adev);  static int 
> > gfx_v9_0_get_cu_info(struct amdgpu_device *adev,
> > struct amdgpu_cu_info *cu_info); 
> > static uint64_t gfx_v9_0_get_gpu_clock_counter(struct amdgpu_device 
> > *adev); @@ -7066,7 +7065,7 @@ static void gfx_v9_0_set_irq_funcs(struct 
> > amdgpu_device *adev)
> > adev->gfx.cp_ecc_error_irq.funcs = 
> > 

Re: [PATCH] drm/amdkfd: make SPDX License expression more sound

2021-12-16 Thread Richard Fontana
On Thu, Dec 16, 2021 at 4:45 AM Lukas Bulwahn  wrote:
>
> Commit b5f57384805a ("drm/amdkfd: Add sysfs bitfields and enums to uAPI")
> adds include/uapi/linux/kfd_sysfs.h with the "GPL-2.0 OR MIT WITH
> Linux-syscall-note" SPDX-License expression.
>
> The command ./scripts/spdxcheck.py warns:
>
>   include/uapi/linux/kfd_sysfs.h: 1:48 Exception not valid for license MIT: 
> Linux-syscall-note
>
> For a uapi header, the file under GPLv2 License must be combined with the
> Linux-syscall-note, but combining the MIT License with the
> Linux-syscall-note makes no sense, as the note provides an exception for
> GPL-licensed code, not for permissively licensed code.
>
> So, reorganize the SPDX expression to only combine the note with the GPL
> License condition. This makes spdxcheck happy again.
>
> Signed-off-by: Lukas Bulwahn 
> ---
> I am not a lawyer and I do not intend to modify the actual licensing of
> this header file. So, I really would like to have an Ack from some AMD
> developer here.
>
> Maybe also a lawyer on the linux-spdx list can check my reasoning on the
> licensing with the exception note?

I believe "MIT WITH Linux-syscall-note" is a syntactically correct
SPDX expression but is otherwise sort of non-meaningful.
"(GPL-2.0 WITH Linux-syscall-note) OR MIT" is presumably what is
intended here. But yes would be good to get confirmation from someone
associated with AMD.

Richard



Re: [PATCH v2] drm/amdgpu: Call amdgpu_device_unmap_mmio() iff device is unplugged to prevent crash in GPU initialization failure

2021-12-16 Thread Andrey Grodzovsky
Maybe we just should use drm_dev_is_unplugged() for this particular case 
because, there would be no race since when device is unplugged it's 
final. It's the other way around that requires strict drm_dev_enter/exit 
scope.


Andrey

On 2021-12-16 3:38 a.m., Christian König wrote:
The !drm_dev_enter() is quite unusual and deserves a comment 
explaining what's going on here.


Apart from that it looks good with the typos fixed I think.

Christian.

Am 16.12.21 um 08:27 schrieb Chen, Guchun:

[Public]

My BAD to misunderstand this.

There are both spell typos in patch subject and body, s/iff/if.

The patch is: Reviewed-by: Guchun Chen 

Please wait for the ack from Andrey and Christian before pushing this.

Regards,
Guchun

-Original Message-
From: Shi, Leslie 
Sent: Thursday, December 16, 2021 3:00 PM
To: Chen, Guchun ; Grodzovsky, Andrey 
; Koenig, Christian 
; Pan, Xinhui ; 
Deucher, Alexander ; 
amd-gfx@lists.freedesktop.org
Subject: RE: [PATCH v2] drm/amdgpu: Call amdgpu_device_unmap_mmio() 
iff device is unplugged to prevent crash in GPU initialization failure


[Public]

Hi Guchun,

As Andrey says, "we should not call amdgpu_device_unmap_mmio unless 
device is unplugged", I think we should call 
amdgpu_device_unmap_mmio() only if device is unplugged 
(drm_dev_enter() return false) .


+if (!drm_dev_enter(adev_to_drm(adev), ))
+    amdgpu_device_unmap_mmio(adev);
+ else
# drm_dev_exit(idx);


Regards,
Leslie

-Original Message-
From: Chen, Guchun 
Sent: Thursday, December 16, 2021 2:46 PM
To: Shi, Leslie ; Grodzovsky, Andrey 
; Koenig, Christian 
; Pan, Xinhui ; 
Deucher, Alexander ; 
amd-gfx@lists.freedesktop.org
Subject: RE: [PATCH v2] drm/amdgpu: Call amdgpu_device_unmap_mmio() 
iff device is unplugged to prevent crash in GPU initialization failure


[Public]

Hi Leslie,

I think we need to modify it like:

+if (drm_dev_enter(adev_to_drm(adev), )) {
+    amdgpu_device_unmap_mmio(adev);
+    drm_dev_exit(idx);
+}

Also you need to credit Andrey a 'suggested-by' in your patch.

Regards,
Guchun

-Original Message-
From: Shi, Leslie 
Sent: Thursday, December 16, 2021 2:14 PM
To: Grodzovsky, Andrey ; Koenig, Christian 
; Pan, Xinhui ; 
Deucher, Alexander ; 
amd-gfx@lists.freedesktop.org
Cc: Chen, Guchun ; Shi, Leslie 

Subject: [PATCH v2] drm/amdgpu: Call amdgpu_device_unmap_mmio() iff 
device is unplugged to prevent crash in GPU initialization failure


[Why]
In amdgpu_driver_load_kms, when amdgpu_device_init returns error 
during driver modprobe, it will start the error handle path 
immediately and call into amdgpu_device_unmap_mmio as well to release 
mapped VRAM. However, in the following release callback, driver 
stills visits the unmapped memory like vcn.inst[i].fw_shared_cpu_addr 
in vcn_v3_0_sw_fini. So a kernel crash occurs.


[How]
call amdgpu_device_unmap_mmio() iff device is unplugged to prevent 
invalid memory address in

vcn_v3_0_sw_fini() when GPU initialization failure.

Signed-off-by: Leslie Shi 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 +++-
  1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index fb03d75880ec..d3656e7b60c2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3845,6 +3845,8 @@ static void amdgpu_device_unmap_mmio(struct 
amdgpu_device *adev)

   */
  void amdgpu_device_fini_hw(struct amdgpu_device *adev)  {
+    int idx;
+
  dev_info(adev->dev, "amdgpu: finishing device.\n");
  flush_delayed_work(>delayed_init_work);
  if (adev->mman.initialized) {
@@ -3888,7 +3890,11 @@ void amdgpu_device_fini_hw(struct 
amdgpu_device *adev)

    amdgpu_gart_dummy_page_fini(adev);
  -    amdgpu_device_unmap_mmio(adev);
+    if (!drm_dev_enter(adev_to_drm(adev), ))
+    amdgpu_device_unmap_mmio(adev);
+    else
+    drm_dev_exit(idx);
+
  }
    void amdgpu_device_fini_sw(struct amdgpu_device *adev)
--
2.25.1




RE: [PATCH] drm/amdgpu: add support for IP discovery gc_info table v2

2021-12-16 Thread Zhang, Hawking
[AMD Official Use Only]

Reviewed-by: Hawking Zhang 

Regards,
Hawking
-Original Message-
From: amd-gfx  On Behalf Of Alex Deucher
Sent: Thursday, December 16, 2021 11:18
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander 
Subject: [PATCH] drm/amdgpu: add support for IP discovery gc_info table v2

Used on gfx9 based systems. Fixes incorrect CU counts reported in the kernel 
log.

Bug: 
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fdrm%2Famd%2F-%2Fissues%2F1833data=04%7C01%7Chawking.zhang%40amd.com%7Cc4c2b58360d7438212e908d9c042b574%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637752215012895865%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=kZXixshnxu1G3MhOwQW%2B2tJUiAmRbPjmx3qHGJ4XcNg%3Dreserved=0
Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 76 +--
 drivers/gpu/drm/amd/include/discovery.h   | 49 
 2 files changed, 103 insertions(+), 22 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
index ea00090b3fb3..bcc9343353b5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
@@ -526,10 +526,15 @@ void amdgpu_discovery_harvest_ip(struct amdgpu_device 
*adev)
}
 }

+union gc_info {
+   struct gc_info_v1_0 v1;
+   struct gc_info_v2_0 v2;
+};
+
 int amdgpu_discovery_get_gfx_info(struct amdgpu_device *adev)  {
struct binary_header *bhdr;
-   struct gc_info_v1_0 *gc_info;
+   union gc_info *gc_info;

if (!adev->mman.discovery_bin) {
DRM_ERROR("ip discovery uninitialized\n"); @@ -537,28 +542,55 
@@ int amdgpu_discovery_get_gfx_info(struct amdgpu_device *adev)
}

bhdr = (struct binary_header *)adev->mman.discovery_bin;
-   gc_info = (struct gc_info_v1_0 *)(adev->mman.discovery_bin +
+   gc_info = (union gc_info *)(adev->mman.discovery_bin +
le16_to_cpu(bhdr->table_list[GC].offset));
-
-   adev->gfx.config.max_shader_engines = le32_to_cpu(gc_info->gc_num_se);
-   adev->gfx.config.max_cu_per_sh = 2 * 
(le32_to_cpu(gc_info->gc_num_wgp0_per_sa) +
- 
le32_to_cpu(gc_info->gc_num_wgp1_per_sa));
-   adev->gfx.config.max_sh_per_se = le32_to_cpu(gc_info->gc_num_sa_per_se);
-   adev->gfx.config.max_backends_per_se = 
le32_to_cpu(gc_info->gc_num_rb_per_se);
-   adev->gfx.config.max_texture_channel_caches = 
le32_to_cpu(gc_info->gc_num_gl2c);
-   adev->gfx.config.max_gprs = le32_to_cpu(gc_info->gc_num_gprs);
-   adev->gfx.config.max_gs_threads = 
le32_to_cpu(gc_info->gc_num_max_gs_thds);
-   adev->gfx.config.gs_vgt_table_depth = 
le32_to_cpu(gc_info->gc_gs_table_depth);
-   adev->gfx.config.gs_prim_buffer_depth = 
le32_to_cpu(gc_info->gc_gsprim_buff_depth);
-   adev->gfx.config.double_offchip_lds_buf = 
le32_to_cpu(gc_info->gc_double_offchip_lds_buffer);
-   adev->gfx.cu_info.wave_front_size = le32_to_cpu(gc_info->gc_wave_size);
-   adev->gfx.cu_info.max_waves_per_simd = 
le32_to_cpu(gc_info->gc_max_waves_per_simd);
-   adev->gfx.cu_info.max_scratch_slots_per_cu = 
le32_to_cpu(gc_info->gc_max_scratch_slots_per_cu);
-   adev->gfx.cu_info.lds_size = le32_to_cpu(gc_info->gc_lds_size);
-   adev->gfx.config.num_sc_per_sh = le32_to_cpu(gc_info->gc_num_sc_per_se) 
/
-le32_to_cpu(gc_info->gc_num_sa_per_se);
-   adev->gfx.config.num_packer_per_sc = 
le32_to_cpu(gc_info->gc_num_packer_per_sc);
-
+   switch (gc_info->v1.header.version_major) {
+   case 1:
+   adev->gfx.config.max_shader_engines = 
le32_to_cpu(gc_info->v1.gc_num_se);
+   adev->gfx.config.max_cu_per_sh = 2 * 
(le32_to_cpu(gc_info->v1.gc_num_wgp0_per_sa) +
+ 
le32_to_cpu(gc_info->v1.gc_num_wgp1_per_sa));
+   adev->gfx.config.max_sh_per_se = 
le32_to_cpu(gc_info->v1.gc_num_sa_per_se);
+   adev->gfx.config.max_backends_per_se = 
le32_to_cpu(gc_info->v1.gc_num_rb_per_se);
+   adev->gfx.config.max_texture_channel_caches = 
le32_to_cpu(gc_info->v1.gc_num_gl2c);
+   adev->gfx.config.max_gprs = 
le32_to_cpu(gc_info->v1.gc_num_gprs);
+   adev->gfx.config.max_gs_threads = 
le32_to_cpu(gc_info->v1.gc_num_max_gs_thds);
+   adev->gfx.config.gs_vgt_table_depth = 
le32_to_cpu(gc_info->v1.gc_gs_table_depth);
+   adev->gfx.config.gs_prim_buffer_depth = 
le32_to_cpu(gc_info->v1.gc_gsprim_buff_depth);
+   adev->gfx.config.double_offchip_lds_buf = 
le32_to_cpu(gc_info->v1.gc_double_offchip_lds_buffer);
+   adev->gfx.cu_info.wave_front_size = 
le32_to_cpu(gc_info->v1.gc_wave_size);
+   adev->gfx.cu_info.max_waves_per_simd = 

RE: [PATCH v2] drm/amdgpu: Separate vf2pf work item init from virt data exchange

2021-12-16 Thread Liu, Shaoyun
[AMD Official Use Only]

This one  looks better and  more logical . 

Reviewed By :Shaoyun.liu 

-Original Message-
From: Skvortsov, Victor  
Sent: Thursday, December 16, 2021 10:39 AM
To: amd-gfx@lists.freedesktop.org; Liu, Shaoyun ; Nieto, 
David M 
Cc: Skvortsov, Victor 
Subject: [PATCH v2] drm/amdgpu: Separate vf2pf work item init from virt data 
exchange

We want to be able to call virt data exchange conditionally after gmc sw init 
to reserve bad pages as early as possible.
Since this is a conditional call, we will need to call it again unconditionally 
later in the init sequence.

Refactor the data exchange function so it can be called multiple times without 
re-initializing the work item.

v2: Cleaned up the code. Kept the original call to init_exchange_data() inside 
early init to initialize the work item, afterwards call
exchange_data() when needed.

Signed-off-by: Victor Skvortsov 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  6 +++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c   | 36 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |  1 +
 3 files changed, 30 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 48aeca3b8f16..ddc67b900587 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2316,6 +2316,10 @@ static int amdgpu_device_ip_init(struct amdgpu_device 
*adev)
 
/* need to do gmc hw init early so we can allocate gpu mem */
if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_GMC) {
+   /* Try to reserve bad pages early */
+   if (amdgpu_sriov_vf(adev))
+   amdgpu_virt_exchange_data(adev);
+
r = amdgpu_device_vram_scratch_init(adev);
if (r) {
DRM_ERROR("amdgpu_vram_scratch_init failed 
%d\n", r); @@ -2347,7 +2351,7 @@ static int amdgpu_device_ip_init(struct 
amdgpu_device *adev)
}
 
if (amdgpu_sriov_vf(adev))
-   amdgpu_virt_init_data_exchange(adev);
+   amdgpu_virt_exchange_data(adev);
 
r = amdgpu_ib_pool_init(adev);
if (r) {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index 3fc49823f527..f8e574cc0e22 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -622,17 +622,35 @@ void amdgpu_virt_fini_data_exchange(struct amdgpu_device 
*adev)
 
 void amdgpu_virt_init_data_exchange(struct amdgpu_device *adev)  {
-   uint64_t bp_block_offset = 0;
-   uint32_t bp_block_size = 0;
-   struct amd_sriov_msg_pf2vf_info *pf2vf_v2 = NULL;
-
adev->virt.fw_reserve.p_pf2vf = NULL;
adev->virt.fw_reserve.p_vf2pf = NULL;
adev->virt.vf2pf_update_interval_ms = 0;
 
-   if (adev->mman.fw_vram_usage_va != NULL) {
+   if (adev->bios != NULL) {
adev->virt.vf2pf_update_interval_ms = 2000;
 
+   adev->virt.fw_reserve.p_pf2vf =
+   (struct amd_sriov_msg_pf2vf_info_header *)
+   (adev->bios + (AMD_SRIOV_MSG_PF2VF_OFFSET_KB << 10));
+
+   amdgpu_virt_read_pf2vf_data(adev);
+   }
+
+   if (adev->virt.vf2pf_update_interval_ms != 0) {
+   INIT_DELAYED_WORK(>virt.vf2pf_work, 
amdgpu_virt_update_vf2pf_work_item);
+   schedule_delayed_work(&(adev->virt.vf2pf_work), 
msecs_to_jiffies(adev->virt.vf2pf_update_interval_ms));
+   }
+}
+
+
+void amdgpu_virt_exchange_data(struct amdgpu_device *adev) {
+   uint64_t bp_block_offset = 0;
+   uint32_t bp_block_size = 0;
+   struct amd_sriov_msg_pf2vf_info *pf2vf_v2 = NULL;
+
+   if (adev->mman.fw_vram_usage_va != NULL) {
+
adev->virt.fw_reserve.p_pf2vf =
(struct amd_sriov_msg_pf2vf_info_header *)
(adev->mman.fw_vram_usage_va + 
(AMD_SRIOV_MSG_PF2VF_OFFSET_KB << 10)); @@ -663,16 +681,10 @@ void 
amdgpu_virt_init_data_exchange(struct amdgpu_device *adev)
(adev->bios + (AMD_SRIOV_MSG_PF2VF_OFFSET_KB << 10));
 
amdgpu_virt_read_pf2vf_data(adev);
-
-   return;
-   }
-
-   if (adev->virt.vf2pf_update_interval_ms != 0) {
-   INIT_DELAYED_WORK(>virt.vf2pf_work, 
amdgpu_virt_update_vf2pf_work_item);
-   schedule_delayed_work(&(adev->virt.vf2pf_work), 
adev->virt.vf2pf_update_interval_ms);
}
 }
 
+
 void amdgpu_detect_virtualization(struct amdgpu_device *adev)  {
uint32_t reg;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
index 8d4c20bb71c5..9adfb8d63280 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
@@ -308,6 +308,7 @@ int 

[PATCH v2] drm/amdgpu: Separate vf2pf work item init from virt data exchange

2021-12-16 Thread Victor Skvortsov
We want to be able to call virt data exchange conditionally
after gmc sw init to reserve bad pages as early as possible.
Since this is a conditional call, we will need to call
it again unconditionally later in the init sequence.

Refactor the data exchange function so it can be
called multiple times without re-initializing the work item.

v2: Cleaned up the code. Kept the original call to init_exchange_data()
inside early init to initialize the work item, afterwards call
exchange_data() when needed.

Signed-off-by: Victor Skvortsov 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  6 +++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c   | 36 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |  1 +
 3 files changed, 30 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 48aeca3b8f16..ddc67b900587 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2316,6 +2316,10 @@ static int amdgpu_device_ip_init(struct amdgpu_device 
*adev)
 
/* need to do gmc hw init early so we can allocate gpu mem */
if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_GMC) {
+   /* Try to reserve bad pages early */
+   if (amdgpu_sriov_vf(adev))
+   amdgpu_virt_exchange_data(adev);
+
r = amdgpu_device_vram_scratch_init(adev);
if (r) {
DRM_ERROR("amdgpu_vram_scratch_init failed 
%d\n", r);
@@ -2347,7 +2351,7 @@ static int amdgpu_device_ip_init(struct amdgpu_device 
*adev)
}
 
if (amdgpu_sriov_vf(adev))
-   amdgpu_virt_init_data_exchange(adev);
+   amdgpu_virt_exchange_data(adev);
 
r = amdgpu_ib_pool_init(adev);
if (r) {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index 3fc49823f527..f8e574cc0e22 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -622,17 +622,35 @@ void amdgpu_virt_fini_data_exchange(struct amdgpu_device 
*adev)
 
 void amdgpu_virt_init_data_exchange(struct amdgpu_device *adev)
 {
-   uint64_t bp_block_offset = 0;
-   uint32_t bp_block_size = 0;
-   struct amd_sriov_msg_pf2vf_info *pf2vf_v2 = NULL;
-
adev->virt.fw_reserve.p_pf2vf = NULL;
adev->virt.fw_reserve.p_vf2pf = NULL;
adev->virt.vf2pf_update_interval_ms = 0;
 
-   if (adev->mman.fw_vram_usage_va != NULL) {
+   if (adev->bios != NULL) {
adev->virt.vf2pf_update_interval_ms = 2000;
 
+   adev->virt.fw_reserve.p_pf2vf =
+   (struct amd_sriov_msg_pf2vf_info_header *)
+   (adev->bios + (AMD_SRIOV_MSG_PF2VF_OFFSET_KB << 10));
+
+   amdgpu_virt_read_pf2vf_data(adev);
+   }
+
+   if (adev->virt.vf2pf_update_interval_ms != 0) {
+   INIT_DELAYED_WORK(>virt.vf2pf_work, 
amdgpu_virt_update_vf2pf_work_item);
+   schedule_delayed_work(&(adev->virt.vf2pf_work), 
msecs_to_jiffies(adev->virt.vf2pf_update_interval_ms));
+   }
+}
+
+
+void amdgpu_virt_exchange_data(struct amdgpu_device *adev)
+{
+   uint64_t bp_block_offset = 0;
+   uint32_t bp_block_size = 0;
+   struct amd_sriov_msg_pf2vf_info *pf2vf_v2 = NULL;
+
+   if (adev->mman.fw_vram_usage_va != NULL) {
+
adev->virt.fw_reserve.p_pf2vf =
(struct amd_sriov_msg_pf2vf_info_header *)
(adev->mman.fw_vram_usage_va + 
(AMD_SRIOV_MSG_PF2VF_OFFSET_KB << 10));
@@ -663,16 +681,10 @@ void amdgpu_virt_init_data_exchange(struct amdgpu_device 
*adev)
(adev->bios + (AMD_SRIOV_MSG_PF2VF_OFFSET_KB << 10));
 
amdgpu_virt_read_pf2vf_data(adev);
-
-   return;
-   }
-
-   if (adev->virt.vf2pf_update_interval_ms != 0) {
-   INIT_DELAYED_WORK(>virt.vf2pf_work, 
amdgpu_virt_update_vf2pf_work_item);
-   schedule_delayed_work(&(adev->virt.vf2pf_work), 
adev->virt.vf2pf_update_interval_ms);
}
 }
 
+
 void amdgpu_detect_virtualization(struct amdgpu_device *adev)
 {
uint32_t reg;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
index 8d4c20bb71c5..9adfb8d63280 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
@@ -308,6 +308,7 @@ int amdgpu_virt_alloc_mm_table(struct amdgpu_device *adev);
 void amdgpu_virt_free_mm_table(struct amdgpu_device *adev);
 void amdgpu_virt_release_ras_err_handler_data(struct amdgpu_device *adev);
 void amdgpu_virt_init_data_exchange(struct amdgpu_device *adev);
+void amdgpu_virt_exchange_data(struct amdgpu_device *adev);
 void amdgpu_virt_fini_data_exchange(struct amdgpu_device *adev);
 void 

Re: [PATCH] drm/amdgpu: When the VCN(1.0) block is suspended, powergating is explicitly enabled

2021-12-16 Thread Alex Deucher
FWIW, it looks like all versions of VCN need the same fix.  There have
been reports of suspend failing when VCN is in use on other newer APUs
as well.

Alex

On Tue, Dec 14, 2021 at 12:59 AM Quan, Evan  wrote:
>
> [AMD Official Use Only]
>
>
>
>
>
>
>
> From: Zhu, James 
> Sent: Monday, December 13, 2021 9:39 PM
> To: Gong, Curry ; Zhu, James ; 
> amd-gfx@lists.freedesktop.org
> Cc: Liu, Leo ; Quan, Evan ; Deucher, 
> Alexander 
> Subject: Re: [PATCH] drm/amdgpu: When the VCN(1.0) block is suspended, 
> powergating is explicitly enabled
>
>
>
> Hi Curry, Evan,
>
> It seems vcn1.0 power gate sequence are special.
>
> if the original solution is thread safe, then the following solution will be 
> thread safe also.
>
> static int vcn_v1_0_hw_fini(void *handle)
> {
> struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>
> cancel_delayed_work_sync(>vcn.idle_work);
>
> if ((adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG) ||
> (adev->vcn.cur_state != AMD_PG_STATE_GATE &&
>  RREG32_SOC15(VCN, 0, mmUVD_STATUS))) {
> +if (adev->pm.dpm_enabled)
> +amdgpu_dpm_enable_uvd(adev, false);
> +else
> +vcn_v1_0_set_powergating_state(adev, AMD_PG_STATE_GATE);
>
> [Quan, Evan] Considering adev->pm.dpm_enabled is always true, 
> “vcn_v1_0_set_powergating_state(adev, AMD_PG_STATE_GATE); “ will become dead 
> code.
>
> Also, the vcn_v1_0_hw_fini is also used in other scenarios(except the 
> suspend/resume discussed here). So, it seems quite risky compared with 
> Curry’s original patch.
>
> Before we can come up with a better idea, I would suggest to land Curry’s 
> original patch as a quick fix.
>
>
>
> BR
>
> Evan
>
>
> }
>
> Best Regards!
>
> James
>
> On 2021-12-13 3:55 a.m., Gong, Curry wrote:
>
> [AMD Official Use Only]
>
>
>
> Hi James:
>
>
>
> With the following patch, an error will be reported when the driver is loaded
>
> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
>
> @@ -1202,6 +1204,9 @@ static int vcn_v1_0_stop(struct amdgpu_device *adev)
>
> else
>
> r = vcn_v1_0_stop_spg_mode(adev);
>
>
>
> c
>
> return r;
>
> }
>
>
>
>
>
> $ dmesg
>
> [  363.181081] INFO: task kworker/3:2:223 blocked for more than 120 seconds.
>
> [  363.181150]   Tainted: G   OE 5.11.0-41-generic 
> #45~20.04.1-Ubuntu
>
> [  363.181208] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
>
> [  363.181266] task:kworker/3:2 state:D stack:0 pid:  223 ppid: 2 
> flags:0x4000
>
> [  363.181276] Workqueue: events vcn_v1_0_idle_work_handler [amdgpu]
>
> [  363.181612] Call Trace:
>
> [  363.181618]  __schedule+0x44c/0x8a0
>
> [  363.181627]  schedule+0x4f/0xc0
>
> [  363.181631]  schedule_preempt_disabled+0xe/0x10
>
> [  363.181636]  __mutex_lock.isra.0+0x183/0x4d0
>
> [  363.181643]  __mutex_lock_slowpath+0x13/0x20
>
> [  363.181648]  mutex_lock+0x32/0x40
>
> [  363.181652]  amdgpu_dpm_set_powergating_by_smu+0x9c/0x180 [amdgpu]
>
> [  363.182055]  amdgpu_dpm_enable_uvd+0x38/0x110 [amdgpu]
>
> [  363.182454]  vcn_v1_0_set_powergating_state+0x2e7e/0x3cf0 [amdgpu]
>
> [  363.182776]  amdgpu_device_ip_set_powergating_state+0x6c/0xc0 [amdgpu]
>
> [  363.183028]  smu10_powergate_vcn+0x2a/0x80 [amdgpu]
>
> [  363.183361]  pp_set_powergating_by_smu+0xc5/0x2b0 [amdgpu]
>
> [  363.183699]  amdgpu_dpm_set_powergating_by_smu+0xb6/0x180 [amdgpu]
>
> [  363.184040]  amdgpu_dpm_enable_uvd+0x38/0x110 [amdgpu]
>
> [  363.184391]  vcn_v1_0_idle_work_handler+0xe1/0x130 [amdgpu]
>
> [  363.184667]  process_one_work+0x220/0x3c0
>
> [  363.184674]  worker_thread+0x4d/0x3f0
>
> [  363.184677]  ? process_one_work+0x3c0/0x3c0
>
> [  363.184680]  kthread+0x12b/0x150
>
> [  363.184685]  ? set_kthread_struct+0x40/0x40
>
> [  363.184690]  ret_from_fork+0x22/0x30
>
> [  363.184699] INFO: task kworker/2:2:233 blocked for more than 120 seconds.
>
> [  363.184739]   Tainted: G   OE 5.11.0-41-generic 
> #45~20.04.1-Ubuntu
>
> [  363.184782] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
>
> [  363.184825] task:kworker/2:2 state:D stack:0 pid:  233 ppid: 2 
> flags:0x4000
>
> [  363.184831] Workqueue: events amdgpu_device_delayed_init_work_handler 
> [amdgpu]
>
> [  363.185085] Call Trace:
>
> [  363.185087]  __schedule+0x44c/0x8a0
>
> [  363.185092]  schedule+0x4f/0xc0
>
> [  363.185095]  schedule_timeout+0x202/0x290
>
> [  363.185099]  ? sched_clock_cpu+0x11/0xb0
>
> [  363.185105]  wait_for_completion+0x94/0x100
>
> [  363.185110]  __flush_work+0x12a/0x1e0
>
> [  363.185113]  ? worker_detach_from_pool+0xc0/0xc0
>
> [  363.185119]  __cancel_work_timer+0x10e/0x190
>
> [  363.185123]  cancel_delayed_work_sync+0x13/0x20
>
> [  363.185126]  vcn_v1_0_ring_begin_use+0x20/0x70 [amdgpu]
>
> [  363.185401]  amdgpu_ring_alloc+0x48/0x60 [amdgpu]
>
> [  363.185640]  amdgpu_ib_schedule+0x493/0x600 [amdgpu]
>
> [  363.185884]  amdgpu_job_submit_direct+0x3c/0xd0 [amdgpu]
>
> [ 

Re: Re: 回复: Re: [PATCH] drm/amdgpu: fixup bad vram size on gmc v8

2021-12-16 Thread Alex Deucher
Is the issue reproducible with the same board in bare metal on x86?  Or
does it only happen with passthrough on ARM?  Looking through the archives,
the SI patch I made was for an x86 laptop.  It would be nice to root cause
this, but there weren't any gfx8 boards with more than 64G of vram, so I
think it's safe.  That said, if you see similar issues with newer gfx IPs
then we have an issue since the upper bit will be meaningful, so it would
be nice to root cause this.

Alex


On Thu, Dec 16, 2021 at 4:36 AM 周宗敏  wrote:

> Hi  Christian,
>
>
> I'm  testing for GPU passthrough feature, so I pass through this GPU to
> virtual machine to use. It  based on arm64 system.
>
> As far as i know, Alex had dealt with a similar problems on
> dri/radeon/si.c .  Maybe they have a same reason to cause it?
>
> the history commit message is below:
>
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0ca223b029a261e82fb2f50c52eb85d510f4260e
>
> [image: image.png]
>
>
> Thanks very much.
>
>
>
> 
>
>
>
> *主 题:*Re: 回复: Re: [PATCH] drm/amdgpu: fixup bad vram size on gmc v8
>
> *日 期:*2021-12-16 16:15
> *发件人:*Christian König
> *收件人:*周宗敏Alex Deucher
>
>
>
>
> Hi Zongmin,
>
>that strongly sounds like the ASIC is not correctly initialized when
>  trying to read the register.
>
>What board and environment are you using this GPU with? Is that a
>  normal x86 system?
>
>Regards,
>Christian.
>
>
>
> Am 16.12.21 um 04:11 schrieb 周宗敏:
>
>
>
>1.
>
>the problematic boards that I have tested is [AMD/ATI] Lexa
> PRO [Radeon RX 550/550X] ;  and the vbios version :
> 113-RXF9310-C09-BT
>2.
>
>When an exception occurs I can see the following changes in
> the values of vram size get from RREG32(mmCONFIG_MEMSIZE) ,
>
>it seems to have garbage in the upper 16 bits
>
>[image: image.png]
>
>
>
>
>3.
>
>and then I can also see some dmesg like below:
>
>when vram size register have garbage,we may see error
> message like below:
>
>amdgpu :09:00.0: VRAM: 4286582784M 0x00F4 -
> 0x000FF8F4 (4286582784M used)
>
>the correct message should like below:
>
>amdgpu :09:00.0: VRAM: 4096M 0x00F4 -
> 0x00F4 (4096M used)
>
>
>
>
>if you have any problems,please send me mail.
>
>thanks very much.
>
>
>
>
> 
>
> *主 题:*Re: [PATCH] drm/amdgpu:  fixup bad vram size on gmc v8
>
>*日 期:*2021-12-16 04:23
>*发件人:*Alex Deucher
>*收件人:*Zongmin Zhou
>
>
>
>
> On Wed, Dec 15, 2021 at 10:31 AM Zongmin Zhouwrote:
>  >
>  > Some boards(like RX550) seem to have garbage in the upper
>  > 16 bits of the vram size register.  Check for
>  > this and clamp the size properly.  Fixes
>  > boards reporting bogus amounts of vram.
>  >
>  > after add this patch,the maximum GPU VRAM size is 64GB,
>  > otherwise only 64GB vram size will be used.
>
>  Can you provide some examples of problematic boards and
>  possibly a
>  vbios image from the problematic board?  What values are you
>  seeing?
>  It would be nice to see what the boards are reporting and
>  whether the
>  lower 16 bits are actually correct or if it is some other
>  issue.  This
>  register is undefined until the asic has been initialized.
> The vbios
>  programs it as part of it's asic init sequence (either via
>vesa/gop or
>  the OS driver).
>
>  Alex
>
>
>  >
>  > Signed-off-by: Zongmin Zhou
>> ---
>>  drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c | 13
>  ++---
>>  1 file changed, 10 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
>  b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
>> index 492ebed2915b..63b890f1e8af 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
>> @@ -515,10 +515,10 @@ static void
>  gmc_v8_0_mc_program(struct amdgpu_device *adev)
>>  static int gmc_v8_0_mc_init(struct amdgpu_device
>  *adev)
>>  {
>> int r;
>> +   u32 tmp;
>>
>> adev->gmc.vram_width =
>  amdgpu_atombios_get_vram_width(adev);
>> if (!adev->gmc.vram_width) {
>> -   u32 tmp;
>> int chansize, numchan;
>>
>> /* Get VRAM informations */
>> @@ -562,8 +562,15 @@ static int gmc_v8_0_mc_init(struct
>  amdgpu_device *adev)
>> adev->gmc.vram_width = numchan *
>  chansize;
>> }
>> /* size in MB on si */
>> -   adev->gmc.mc_vram_size =
>  RREG32(mmCONFIG_MEMSIZE) * 1024ULL * 1024ULL;
>> -   

Re: [PATCH] drm/amd/display: Fix warning comparing pointer to 0

2021-12-16 Thread Alex Deucher
Applied.  Thanks!

Alex

On Thu, Dec 16, 2021 at 2:50 AM Jiapeng Chong
 wrote:
>
> Fix the following coccicheck warning:
>
> ./drivers/gpu/drm/amd/display/dc/clk_mgr/dcn301/vg_clk_mgr.c:744:35-36:
> WARNING comparing pointer to 0.
>
> Reported-by: Abaci Robot 
> Signed-off-by: Jiapeng Chong 
> ---
>  drivers/gpu/drm/amd/display/dc/clk_mgr/dcn301/vg_clk_mgr.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn301/vg_clk_mgr.c 
> b/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn301/vg_clk_mgr.c
> index 8f78e62b28b7..48005def1164 100644
> --- a/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn301/vg_clk_mgr.c
> +++ b/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn301/vg_clk_mgr.c
> @@ -741,7 +741,7 @@ void vg_clk_mgr_construct(
> sizeof(struct watermarks),
> _mgr->smu_wm_set.mc_address.quad_part);
>
> -   if (clk_mgr->smu_wm_set.wm_set == 0) {
> +   if (!clk_mgr->smu_wm_set.wm_set) {
> clk_mgr->smu_wm_set.wm_set = _wms;
> clk_mgr->smu_wm_set.mc_address.quad_part = 0;
> }
> --
> 2.20.1.7.g153144c
>


Re: [PATCH] drm/amdgpu: add support for IP discovery gc_info table v2

2021-12-16 Thread Alex Deucher
Ping?

On Wed, Dec 15, 2021 at 10:18 PM Alex Deucher  wrote:
>
> Used on gfx9 based systems. Fixes incorrect CU counts reported
> in the kernel log.
>
> Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1833
> Signed-off-by: Alex Deucher 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 76 +--
>  drivers/gpu/drm/amd/include/discovery.h   | 49 
>  2 files changed, 103 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> index ea00090b3fb3..bcc9343353b5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> @@ -526,10 +526,15 @@ void amdgpu_discovery_harvest_ip(struct amdgpu_device 
> *adev)
> }
>  }
>
> +union gc_info {
> +   struct gc_info_v1_0 v1;
> +   struct gc_info_v2_0 v2;
> +};
> +
>  int amdgpu_discovery_get_gfx_info(struct amdgpu_device *adev)
>  {
> struct binary_header *bhdr;
> -   struct gc_info_v1_0 *gc_info;
> +   union gc_info *gc_info;
>
> if (!adev->mman.discovery_bin) {
> DRM_ERROR("ip discovery uninitialized\n");
> @@ -537,28 +542,55 @@ int amdgpu_discovery_get_gfx_info(struct amdgpu_device 
> *adev)
> }
>
> bhdr = (struct binary_header *)adev->mman.discovery_bin;
> -   gc_info = (struct gc_info_v1_0 *)(adev->mman.discovery_bin +
> +   gc_info = (union gc_info *)(adev->mman.discovery_bin +
> le16_to_cpu(bhdr->table_list[GC].offset));
> -
> -   adev->gfx.config.max_shader_engines = le32_to_cpu(gc_info->gc_num_se);
> -   adev->gfx.config.max_cu_per_sh = 2 * 
> (le32_to_cpu(gc_info->gc_num_wgp0_per_sa) +
> - 
> le32_to_cpu(gc_info->gc_num_wgp1_per_sa));
> -   adev->gfx.config.max_sh_per_se = 
> le32_to_cpu(gc_info->gc_num_sa_per_se);
> -   adev->gfx.config.max_backends_per_se = 
> le32_to_cpu(gc_info->gc_num_rb_per_se);
> -   adev->gfx.config.max_texture_channel_caches = 
> le32_to_cpu(gc_info->gc_num_gl2c);
> -   adev->gfx.config.max_gprs = le32_to_cpu(gc_info->gc_num_gprs);
> -   adev->gfx.config.max_gs_threads = 
> le32_to_cpu(gc_info->gc_num_max_gs_thds);
> -   adev->gfx.config.gs_vgt_table_depth = 
> le32_to_cpu(gc_info->gc_gs_table_depth);
> -   adev->gfx.config.gs_prim_buffer_depth = 
> le32_to_cpu(gc_info->gc_gsprim_buff_depth);
> -   adev->gfx.config.double_offchip_lds_buf = 
> le32_to_cpu(gc_info->gc_double_offchip_lds_buffer);
> -   adev->gfx.cu_info.wave_front_size = 
> le32_to_cpu(gc_info->gc_wave_size);
> -   adev->gfx.cu_info.max_waves_per_simd = 
> le32_to_cpu(gc_info->gc_max_waves_per_simd);
> -   adev->gfx.cu_info.max_scratch_slots_per_cu = 
> le32_to_cpu(gc_info->gc_max_scratch_slots_per_cu);
> -   adev->gfx.cu_info.lds_size = le32_to_cpu(gc_info->gc_lds_size);
> -   adev->gfx.config.num_sc_per_sh = 
> le32_to_cpu(gc_info->gc_num_sc_per_se) /
> -
> le32_to_cpu(gc_info->gc_num_sa_per_se);
> -   adev->gfx.config.num_packer_per_sc = 
> le32_to_cpu(gc_info->gc_num_packer_per_sc);
> -
> +   switch (gc_info->v1.header.version_major) {
> +   case 1:
> +   adev->gfx.config.max_shader_engines = 
> le32_to_cpu(gc_info->v1.gc_num_se);
> +   adev->gfx.config.max_cu_per_sh = 2 * 
> (le32_to_cpu(gc_info->v1.gc_num_wgp0_per_sa) +
> + 
> le32_to_cpu(gc_info->v1.gc_num_wgp1_per_sa));
> +   adev->gfx.config.max_sh_per_se = 
> le32_to_cpu(gc_info->v1.gc_num_sa_per_se);
> +   adev->gfx.config.max_backends_per_se = 
> le32_to_cpu(gc_info->v1.gc_num_rb_per_se);
> +   adev->gfx.config.max_texture_channel_caches = 
> le32_to_cpu(gc_info->v1.gc_num_gl2c);
> +   adev->gfx.config.max_gprs = 
> le32_to_cpu(gc_info->v1.gc_num_gprs);
> +   adev->gfx.config.max_gs_threads = 
> le32_to_cpu(gc_info->v1.gc_num_max_gs_thds);
> +   adev->gfx.config.gs_vgt_table_depth = 
> le32_to_cpu(gc_info->v1.gc_gs_table_depth);
> +   adev->gfx.config.gs_prim_buffer_depth = 
> le32_to_cpu(gc_info->v1.gc_gsprim_buff_depth);
> +   adev->gfx.config.double_offchip_lds_buf = 
> le32_to_cpu(gc_info->v1.gc_double_offchip_lds_buffer);
> +   adev->gfx.cu_info.wave_front_size = 
> le32_to_cpu(gc_info->v1.gc_wave_size);
> +   adev->gfx.cu_info.max_waves_per_simd = 
> le32_to_cpu(gc_info->v1.gc_max_waves_per_simd);
> +   adev->gfx.cu_info.max_scratch_slots_per_cu = 
> le32_to_cpu(gc_info->v1.gc_max_scratch_slots_per_cu);
> +   adev->gfx.cu_info.lds_size = 
> le32_to_cpu(gc_info->v1.gc_lds_size);
> +   adev->gfx.config.num_sc_per_sh = 
> le32_to_cpu(gc_info->v1.gc_num_sc_per_se) /
> +   le32_to_cpu(gc_info->v1.gc_num_sa_per_se);
> + 

Re: [PATCH 1/2] drm/amdgpu/UAPI: add new CTX OP to get/set stable pstates

2021-12-16 Thread Alex Deucher
On Thu, Dec 16, 2021 at 9:24 AM Christian König
 wrote:
>
>
>
> Am 16.12.21 um 15:14 schrieb Alex Deucher:
> > Add a new CTX ioctl operation to set stable pstates for profiling.
> > When creating traces for tools like RGP or using SPM or doing
> > performance profiling, it's required to enable a special
> > stable profiling power state on the GPU.  These profiling
> > states set fixed clocks and disable certain other power
> > features like powergating which may impact the results.
> >
> > Historically, these profiling pstates were enabled via sysfs,
> > but this adds an interface to enable it via the CTX ioctl
> > from the application.  Since the power state is global
> > only one application can set it at a time, so if multiple
> > applications try and use it only the first will get it,
> > the ioctl will return -EBUSY for others.  The sysfs interface
> > will override whatever has been set by this interface.
> >
> > Mesa MR: https://gitlab.freedesktop.org/mesa/drm/-/merge_requests/207
> >
> > Signed-off-by: Alex Deucher 
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c| 145 -
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h|   1 +
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   1 +
> >   drivers/gpu/drm/amd/pm/amdgpu_pm.c |   7 +
> >   drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h|   3 +
> >   include/uapi/drm/amdgpu_drm.h  |  17 ++-
> >   6 files changed, 171 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> > index 468003583b2a..327cf308c2ab 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> > @@ -237,6 +237,7 @@ static int amdgpu_ctx_init(struct amdgpu_device *adev,
> >   ctx->vram_lost_counter = atomic_read(>vram_lost_counter);
> >   ctx->init_priority = priority;
> >   ctx->override_priority = AMDGPU_CTX_PRIORITY_UNSET;
> > + ctx->stable_pstate = AMDGPU_CTX_STABLE_PSTATE_NONE;
> >
> >   return 0;
> >   }
> > @@ -255,6 +256,102 @@ static void amdgpu_ctx_fini_entity(struct 
> > amdgpu_ctx_entity *entity)
> >   kfree(entity);
> >   }
> >
> > +static int amdgpu_ctx_get_stable_pstate(struct amdgpu_ctx *ctx,
> > + u32 *stable_pstate)
> > +{
> > + struct amdgpu_device *adev = ctx->adev;
> > + const struct amd_pm_funcs *pp_funcs = adev->powerplay.pp_funcs;
> > + enum amd_dpm_forced_level current_level;
> > +
> > + if (!ctx)
> > + return -EINVAL;
> > +
> > + if (pp_funcs->get_performance_level)
> > + current_level = amdgpu_dpm_get_performance_level(adev);
> > + else
> > + current_level = adev->pm.dpm.forced_level;
> > +
> > + switch (current_level) {
> > + case AMD_DPM_FORCED_LEVEL_PROFILE_STANDARD:
> > + *stable_pstate = AMDGPU_CTX_STABLE_PSTATE_STANDARD;
> > + break;
> > + case AMD_DPM_FORCED_LEVEL_PROFILE_MIN_SCLK:
> > + *stable_pstate = AMDGPU_CTX_STABLE_PSTATE_MIN_SCLK;
> > + break;
> > + case AMD_DPM_FORCED_LEVEL_PROFILE_MIN_MCLK:
> > + *stable_pstate = AMDGPU_CTX_STABLE_PSTATE_MIN_MCLK;
> > + break;
> > + case AMD_DPM_FORCED_LEVEL_PROFILE_PEAK:
> > + *stable_pstate = AMDGPU_CTX_STABLE_PSTATE_PEAK;
> > + break;
> > + default:
> > + *stable_pstate = AMDGPU_CTX_STABLE_PSTATE_NONE;
> > + break;
> > + }
> > + return 0;
> > +}
> > +
> > +static int amdgpu_ctx_set_stable_pstate(struct amdgpu_ctx *ctx,
> > + u32 stable_pstate)
> > +{
> > + struct amdgpu_device *adev = ctx->adev;
> > + const struct amd_pm_funcs *pp_funcs = adev->powerplay.pp_funcs;
> > + enum amd_dpm_forced_level level, current_level;
> > + int r = 0;
>
> I would avoid initializing the return value and rather set it below
> after everything worked out.

Will fix.

>
> > +
> > + if (!ctx)
> > + return -EINVAL;
> > +
> > + mutex_lock(>pm.stable_pstate_ctx_lock);
> > + if (adev->pm.stable_pstate_ctx && adev->pm.stable_pstate_ctx != ctx) {
> > + r = -EBUSY;
> > + goto done;
> > + }
> > +
> > + switch (stable_pstate) {
> > + case AMDGPU_CTX_STABLE_PSTATE_NONE:
> > + level = AMD_DPM_FORCED_LEVEL_AUTO;
> > + break;
> > + case AMDGPU_CTX_STABLE_PSTATE_STANDARD:
> > + level = AMD_DPM_FORCED_LEVEL_PROFILE_STANDARD;
> > + break;
> > + case AMDGPU_CTX_STABLE_PSTATE_MIN_SCLK:
> > + level = AMD_DPM_FORCED_LEVEL_PROFILE_MIN_SCLK;
> > + break;
> > + case AMDGPU_CTX_STABLE_PSTATE_MIN_MCLK:
> > + level = AMD_DPM_FORCED_LEVEL_PROFILE_MIN_MCLK;
> > + break;
> > + case AMDGPU_CTX_STABLE_PSTATE_PEAK:
> > + level = AMD_DPM_FORCED_LEVEL_PROFILE_PEAK;
> > + 

Re: [PATCH] drm/amdkfd: use max() and min() to make code cleaner

2021-12-16 Thread philip yang

  


On 2021-12-15 3:52 a.m.,
  cgel@gmail.com wrote:


  From: Changcheng Deng 

Use max() and min() in order to make code cleaner.

Reported-by: Zeal Robot 
Signed-off-by: Changcheng Deng 

Reviewed-by: Philip Yang 
Applied, thanks.


  
---
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 7e92dcea4ce8..c6d3555b5be6 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -2254,8 +2254,8 @@ svm_range_cpu_invalidate_pagetables(struct mmu_interval_notifier *mni,
 
 	start = mni->interval_tree.start;
 	last = mni->interval_tree.last;
-	start = (start > range->start ? start : range->start) >> PAGE_SHIFT;
-	last = (last < (range->end - 1) ? last : range->end - 1) >> PAGE_SHIFT;
+	start = max(start, range->start) >> PAGE_SHIFT;
+	last = min(last, range->end - 1) >> PAGE_SHIFT;
 	pr_debug("[0x%lx 0x%lx] range[0x%lx 0x%lx] notifier[0x%lx 0x%lx] %d\n",
 		 start, last, range->start >> PAGE_SHIFT,
 		 (range->end - 1) >> PAGE_SHIFT,


  



RE: [PATCH 4/5] drm/amdgpu: Initialize Aldebaran RLC function pointers

2021-12-16 Thread Chen, Guchun
[Public]

Hi Victor,

calling get_xgmi_info() in gmc_v9_0_early_init is from below patch, which also 
sent out to amd-gfx for review as a upstream patch. Do you know why it's missed 
from upstream?

drm/amdgpu: get xgmi info at eary_init

Driver need to get XGMI info function earlier before ip_init since 
driver need to check
the XGMI setting to determine how to perform reset during init

Signed-off-by: shaoyunl 
Acked-by: Alex Deucher 
Change-Id: Ic37276bbb6640bb4e9360220fed99494cedd3ef5

Regards,
Guchun

-Original Message-
From: amd-gfx  On Behalf Of Skvortsov, 
Victor
Sent: Thursday, December 16, 2021 10:28 PM
To: Alex Deucher 
Cc: Ming, Davis ; Chen, JingWen ; 
amd-gfx list ; Deng, Emily ; 
Nieto, David M ; Chen, Horace ; Liu, 
Monk ; Liu, Shaoyun 
Subject: RE: [PATCH 4/5] drm/amdgpu: Initialize Aldebaran RLC function pointers

[AMD Official Use Only]

Gotcha, I will skip this patch for drm-next

-Original Message-
From: Alex Deucher 
Sent: Thursday, December 16, 2021 8:53 AM
To: Skvortsov, Victor 
Cc: amd-gfx list ; Deng, Emily 
; Liu, Monk ; Ming, Davis 
; Liu, Shaoyun ; Zhou, Peng Ju 
; Chen, JingWen ; Chen, Horace 
; Nieto, David M 
Subject: Re: [PATCH 4/5] drm/amdgpu: Initialize Aldebaran RLC function pointers

[CAUTION: External Email]

On Wed, Dec 15, 2021 at 6:58 PM Skvortsov, Victor  
wrote:
>
> [AMD Official Use Only]
>
> Hey Alex,
>
> This change was based on the fact that amd-mainline-dkms-5.13 calls 
> get_xgmi_info() in gmc_v9_0_early_init(). But I can see that drm-next it's 
> instead called in gmc_v9_0_sw_init(). So, I'm not sure whats the correct 
> behavior. But I do agree that the change is kind of ugly. I don't know where 
> else to put it if we do need to call get_xgmi_info() in early_init.
>

We could skip this patch for drm-next and just apply it to the dkms branch.  
There's already a lot of ugly stuff in there to deal with multiple kernel 
versions.

Alex


> Thanks,
> Victor
>
> -Original Message-
> From: Alex Deucher 
> Sent: Wednesday, December 15, 2021 4:38 PM
> To: Skvortsov, Victor 
> Cc: amd-gfx list ; Deng, Emily 
> ; Liu, Monk ; Ming, Davis 
> ; Liu, Shaoyun ; Zhou, Peng 
> Ju ; Chen, JingWen ; Chen, 
> Horace ; Nieto, David M 
> Subject: Re: [PATCH 4/5] drm/amdgpu: Initialize Aldebaran RLC function 
> pointers
>
> [CAUTION: External Email]
>
> On Wed, Dec 15, 2021 at 1:56 PM Victor Skvortsov  
> wrote:
> >
> > In SRIOV, RLC function pointers must be initialized early as we rely 
> > on the RLCG interface for all GC register access.
> >
> > Signed-off-by: Victor Skvortsov 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 2 ++
> >  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 3 +--
> >  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.h | 2 ++
> >  3 files changed, 5 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> > index 65e1f6cc59dd..1bc92a38d124 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> > @@ -844,6 +844,8 @@ static int amdgpu_discovery_set_gc_ip_blocks(struct 
> > amdgpu_device *adev)
> > case IP_VERSION(9, 4, 1):
> > case IP_VERSION(9, 4, 2):
> > amdgpu_device_ip_block_add(adev, 
> > _v9_0_ip_block);
> > +   if (amdgpu_sriov_vf(adev) && adev->ip_versions[GC_HWIP][0] 
> > == IP_VERSION(9, 4, 2))
> > +   gfx_v9_0_set_rlc_funcs(adev);
>
> amdgpu_discovery.c is IP independent.  I'd rather not add random IP specific 
> function calls.  gfx_v9_0_set_rlc_funcs() already gets called in 
> gfx_v9_0_early_init().  Is that not early enough?  In general we shouldn't be 
> touching the hardware much if at all in early_init.
>
> Alex
>
> > break;
> > case IP_VERSION(10, 1, 10):
> > case IP_VERSION(10, 1, 2):
> > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > index edb3e3b08eed..d252b06efa43 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > @@ -816,7 +816,6 @@ static void gfx_v9_0_sriov_wreg(struct 
> > amdgpu_device *adev, u32 offset,  static void 
> > gfx_v9_0_set_ring_funcs(struct amdgpu_device *adev);  static void 
> > gfx_v9_0_set_irq_funcs(struct amdgpu_device *adev);  static void 
> > gfx_v9_0_set_gds_init(struct amdgpu_device *adev); -static void 
> > gfx_v9_0_set_rlc_funcs(struct amdgpu_device *adev);  static int 
> > gfx_v9_0_get_cu_info(struct amdgpu_device *adev,
> > struct amdgpu_cu_info *cu_info); 
> > static uint64_t gfx_v9_0_get_gpu_clock_counter(struct amdgpu_device 
> > *adev); @@ -7066,7 +7065,7 @@ static void gfx_v9_0_set_irq_funcs(struct 
> > amdgpu_device *adev)
> > adev->gfx.cp_ecc_error_irq.funcs = 
> > _v9_0_cp_ecc_error_irq_funcs;  }
> >
> > 

Re: [PATCH] drm/amdgpu/sriov : Check the memory can be accesssed by ttm_device_clear_dma_mappings.

2021-12-16 Thread Alex Deucher
On Tue, Dec 14, 2021 at 12:44 PM Surbhi Kakarya  wrote:
>
> On SRIOV environment, if event guard is enabled and VF doesn't
> receive an ack from PF for full access, the guest driver load crashes.
> This is caused due to the call to ttm_device_clear_dma_mappings with 
> non-initialized
> mman during driver tear down.
>
> This patch adds the necessary condition to check if the mman initialization 
> passed or not
> and takes the path based on the condition output.

Is this actually sr-iov specific?  I think any failure that happens
before ttm is set up would hit this.  So I think the wording could be
updated to drop the SR-IOV.

Alex

>
> Signed-off-by: skakarya 
> Change-Id: I1c18c7eb3500687c8b6e7fc414503dcf2a20b94c
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 598250a380f5..226110be7a2f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3984,7 +3984,8 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
>
> amdgpu_irq_fini_hw(adev);
>
> -   ttm_device_clear_dma_mappings(>mman.bdev);
> +   if (adev->mman.initialized)
> +   ttm_device_clear_dma_mappings(>mman.bdev);
>
> amdgpu_gart_dummy_page_fini(adev);
>
> --
> 2.25.1
>


RE: [PATCH 4/5] drm/amdgpu: Initialize Aldebaran RLC function pointers

2021-12-16 Thread Skvortsov, Victor
[AMD Official Use Only]

Gotcha, I will skip this patch for drm-next

-Original Message-
From: Alex Deucher  
Sent: Thursday, December 16, 2021 8:53 AM
To: Skvortsov, Victor 
Cc: amd-gfx list ; Deng, Emily 
; Liu, Monk ; Ming, Davis 
; Liu, Shaoyun ; Zhou, Peng Ju 
; Chen, JingWen ; Chen, Horace 
; Nieto, David M 
Subject: Re: [PATCH 4/5] drm/amdgpu: Initialize Aldebaran RLC function pointers

[CAUTION: External Email]

On Wed, Dec 15, 2021 at 6:58 PM Skvortsov, Victor  
wrote:
>
> [AMD Official Use Only]
>
> Hey Alex,
>
> This change was based on the fact that amd-mainline-dkms-5.13 calls 
> get_xgmi_info() in gmc_v9_0_early_init(). But I can see that drm-next it's 
> instead called in gmc_v9_0_sw_init(). So, I'm not sure whats the correct 
> behavior. But I do agree that the change is kind of ugly. I don't know where 
> else to put it if we do need to call get_xgmi_info() in early_init.
>

We could skip this patch for drm-next and just apply it to the dkms branch.  
There's already a lot of ugly stuff in there to deal with multiple kernel 
versions.

Alex


> Thanks,
> Victor
>
> -Original Message-
> From: Alex Deucher 
> Sent: Wednesday, December 15, 2021 4:38 PM
> To: Skvortsov, Victor 
> Cc: amd-gfx list ; Deng, Emily 
> ; Liu, Monk ; Ming, Davis 
> ; Liu, Shaoyun ; Zhou, Peng 
> Ju ; Chen, JingWen ; Chen, 
> Horace ; Nieto, David M 
> Subject: Re: [PATCH 4/5] drm/amdgpu: Initialize Aldebaran RLC function 
> pointers
>
> [CAUTION: External Email]
>
> On Wed, Dec 15, 2021 at 1:56 PM Victor Skvortsov  
> wrote:
> >
> > In SRIOV, RLC function pointers must be initialized early as we rely 
> > on the RLCG interface for all GC register access.
> >
> > Signed-off-by: Victor Skvortsov 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 2 ++
> >  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 3 +--
> >  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.h | 2 ++
> >  3 files changed, 5 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> > index 65e1f6cc59dd..1bc92a38d124 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> > @@ -844,6 +844,8 @@ static int amdgpu_discovery_set_gc_ip_blocks(struct 
> > amdgpu_device *adev)
> > case IP_VERSION(9, 4, 1):
> > case IP_VERSION(9, 4, 2):
> > amdgpu_device_ip_block_add(adev, 
> > _v9_0_ip_block);
> > +   if (amdgpu_sriov_vf(adev) && adev->ip_versions[GC_HWIP][0] 
> > == IP_VERSION(9, 4, 2))
> > +   gfx_v9_0_set_rlc_funcs(adev);
>
> amdgpu_discovery.c is IP independent.  I'd rather not add random IP specific 
> function calls.  gfx_v9_0_set_rlc_funcs() already gets called in 
> gfx_v9_0_early_init().  Is that not early enough?  In general we shouldn't be 
> touching the hardware much if at all in early_init.
>
> Alex
>
> > break;
> > case IP_VERSION(10, 1, 10):
> > case IP_VERSION(10, 1, 2):
> > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > index edb3e3b08eed..d252b06efa43 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > @@ -816,7 +816,6 @@ static void gfx_v9_0_sriov_wreg(struct 
> > amdgpu_device *adev, u32 offset,  static void 
> > gfx_v9_0_set_ring_funcs(struct amdgpu_device *adev);  static void 
> > gfx_v9_0_set_irq_funcs(struct amdgpu_device *adev);  static void 
> > gfx_v9_0_set_gds_init(struct amdgpu_device *adev); -static void 
> > gfx_v9_0_set_rlc_funcs(struct amdgpu_device *adev);  static int 
> > gfx_v9_0_get_cu_info(struct amdgpu_device *adev,
> > struct amdgpu_cu_info *cu_info); 
> > static uint64_t gfx_v9_0_get_gpu_clock_counter(struct amdgpu_device 
> > *adev); @@ -7066,7 +7065,7 @@ static void gfx_v9_0_set_irq_funcs(struct 
> > amdgpu_device *adev)
> > adev->gfx.cp_ecc_error_irq.funcs = 
> > _v9_0_cp_ecc_error_irq_funcs;  }
> >
> > -static void gfx_v9_0_set_rlc_funcs(struct amdgpu_device *adev)
> > +void gfx_v9_0_set_rlc_funcs(struct amdgpu_device *adev)
> >  {
> > switch (adev->ip_versions[GC_HWIP][0]) {
> > case IP_VERSION(9, 0, 1):
> > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.h
> > b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.h
> > index dfe8d4841f58..1817e252354f 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.h
> > @@ -29,4 +29,6 @@ extern const struct amdgpu_ip_block_version 
> > gfx_v9_0_ip_block;  void gfx_v9_0_select_se_sh(struct amdgpu_device *adev, 
> > u32 se_num, u32 sh_num,
> >u32 instance);
> >
> > +void gfx_v9_0_set_rlc_funcs(struct amdgpu_device *adev);
> > +
> >  #endif
> > --
> > 2.25.1
> >


Re: [PATCH 1/2] drm/amdgpu/UAPI: add new CTX OP to get/set stable pstates

2021-12-16 Thread Christian König




Am 16.12.21 um 15:14 schrieb Alex Deucher:

Add a new CTX ioctl operation to set stable pstates for profiling.
When creating traces for tools like RGP or using SPM or doing
performance profiling, it's required to enable a special
stable profiling power state on the GPU.  These profiling
states set fixed clocks and disable certain other power
features like powergating which may impact the results.

Historically, these profiling pstates were enabled via sysfs,
but this adds an interface to enable it via the CTX ioctl
from the application.  Since the power state is global
only one application can set it at a time, so if multiple
applications try and use it only the first will get it,
the ioctl will return -EBUSY for others.  The sysfs interface
will override whatever has been set by this interface.

Mesa MR: https://gitlab.freedesktop.org/mesa/drm/-/merge_requests/207

Signed-off-by: Alex Deucher 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c| 145 -
  drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h|   1 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   1 +
  drivers/gpu/drm/amd/pm/amdgpu_pm.c |   7 +
  drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h|   3 +
  include/uapi/drm/amdgpu_drm.h  |  17 ++-
  6 files changed, 171 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index 468003583b2a..327cf308c2ab 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -237,6 +237,7 @@ static int amdgpu_ctx_init(struct amdgpu_device *adev,
ctx->vram_lost_counter = atomic_read(>vram_lost_counter);
ctx->init_priority = priority;
ctx->override_priority = AMDGPU_CTX_PRIORITY_UNSET;
+   ctx->stable_pstate = AMDGPU_CTX_STABLE_PSTATE_NONE;
  
  	return 0;

  }
@@ -255,6 +256,102 @@ static void amdgpu_ctx_fini_entity(struct 
amdgpu_ctx_entity *entity)
kfree(entity);
  }
  
+static int amdgpu_ctx_get_stable_pstate(struct amdgpu_ctx *ctx,

+   u32 *stable_pstate)
+{
+   struct amdgpu_device *adev = ctx->adev;
+   const struct amd_pm_funcs *pp_funcs = adev->powerplay.pp_funcs;
+   enum amd_dpm_forced_level current_level;
+
+   if (!ctx)
+   return -EINVAL;
+
+   if (pp_funcs->get_performance_level)
+   current_level = amdgpu_dpm_get_performance_level(adev);
+   else
+   current_level = adev->pm.dpm.forced_level;
+
+   switch (current_level) {
+   case AMD_DPM_FORCED_LEVEL_PROFILE_STANDARD:
+   *stable_pstate = AMDGPU_CTX_STABLE_PSTATE_STANDARD;
+   break;
+   case AMD_DPM_FORCED_LEVEL_PROFILE_MIN_SCLK:
+   *stable_pstate = AMDGPU_CTX_STABLE_PSTATE_MIN_SCLK;
+   break;
+   case AMD_DPM_FORCED_LEVEL_PROFILE_MIN_MCLK:
+   *stable_pstate = AMDGPU_CTX_STABLE_PSTATE_MIN_MCLK;
+   break;
+   case AMD_DPM_FORCED_LEVEL_PROFILE_PEAK:
+   *stable_pstate = AMDGPU_CTX_STABLE_PSTATE_PEAK;
+   break;
+   default:
+   *stable_pstate = AMDGPU_CTX_STABLE_PSTATE_NONE;
+   break;
+   }
+   return 0;
+}
+
+static int amdgpu_ctx_set_stable_pstate(struct amdgpu_ctx *ctx,
+   u32 stable_pstate)
+{
+   struct amdgpu_device *adev = ctx->adev;
+   const struct amd_pm_funcs *pp_funcs = adev->powerplay.pp_funcs;
+   enum amd_dpm_forced_level level, current_level;
+   int r = 0;


I would avoid initializing the return value and rather set it below 
after everything worked out.



+
+   if (!ctx)
+   return -EINVAL;
+
+   mutex_lock(>pm.stable_pstate_ctx_lock);
+   if (adev->pm.stable_pstate_ctx && adev->pm.stable_pstate_ctx != ctx) {
+   r = -EBUSY;
+   goto done;
+   }
+
+   switch (stable_pstate) {
+   case AMDGPU_CTX_STABLE_PSTATE_NONE:
+   level = AMD_DPM_FORCED_LEVEL_AUTO;
+   break;
+   case AMDGPU_CTX_STABLE_PSTATE_STANDARD:
+   level = AMD_DPM_FORCED_LEVEL_PROFILE_STANDARD;
+   break;
+   case AMDGPU_CTX_STABLE_PSTATE_MIN_SCLK:
+   level = AMD_DPM_FORCED_LEVEL_PROFILE_MIN_SCLK;
+   break;
+   case AMDGPU_CTX_STABLE_PSTATE_MIN_MCLK:
+   level = AMD_DPM_FORCED_LEVEL_PROFILE_MIN_MCLK;
+   break;
+   case AMDGPU_CTX_STABLE_PSTATE_PEAK:
+   level = AMD_DPM_FORCED_LEVEL_PROFILE_PEAK;
+   break;
+   default:
+   r = -EINVAL;
+   goto done;
+   }
+
+   if (pp_funcs->get_performance_level)
+   current_level = amdgpu_dpm_get_performance_level(adev);
+   else
+   current_level = adev->pm.dpm.forced_level;


Mhm, that looks strongly like it doesn't belong into the ctx code.

Didn't Evan had some patches to clean that 

Re: [bug report] new kvmalloc() WARN() triggered by DRM ioctls tracking

2021-12-16 Thread Steven Price
On 16/12/2021 14:15, Boris Brezillon wrote:
> Hi Steve,
> 
> On Thu, 16 Dec 2021 14:02:25 +
> Steven Price  wrote:
> 
>> + Boris
>>
>> On 16/12/2021 12:08, Dan Carpenter wrote:
>>> Hi DRM Devs,
>>>
>>> In commit 7661809d493b ("mm: don't allow oversized kvmalloc() calls")
>>> from July, Linus added a WARN_ONCE() for "crazy" allocations over 2GB.
>>> I have a static checker warning for this and most of the warnings are
>>> from DRM ioctls.
>>>
>>> drivers/gpu/drm/lima/lima_drv.c:124 lima_ioctl_gem_submit() warn: uncapped 
>>> user size for kvmalloc() will WARN
>>> drivers/gpu/drm/radeon/radeon_cs.c:291 radeon_cs_parser_init() warn: 
>>> uncapped user size for kvmalloc() will WARN
>>> drivers/gpu/drm/v3d/v3d_gem.c:311 v3d_lookup_bos() warn: uncapped user size 
>>> for kvmalloc() will WARN
>>> drivers/gpu/drm/v3d/v3d_gem.c:319 v3d_lookup_bos() warn: uncapped user size 
>>> for kvmalloc() will WARN
>>> drivers/gpu/drm/v3d/v3d_gem.c:601 v3d_get_multisync_post_deps() warn: 
>>> uncapped user size for kvmalloc() will WARN
>>> drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c:476 etnaviv_ioctl_gem_submit() 
>>> warn: uncapped user size for kvmalloc() will WARN
>>> drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c:477 etnaviv_ioctl_gem_submit() 
>>> warn: uncapped user size for kvmalloc() will WARN
>>> drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c:478 etnaviv_ioctl_gem_submit() 
>>> warn: uncapped user size for kvmalloc() will WARN
>>> drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c:479 etnaviv_ioctl_gem_submit() 
>>> warn: uncapped user size for kvmalloc() will WARN
>>> drivers/gpu/drm/virtio/virtgpu_ioctl.c:186 virtio_gpu_execbuffer_ioctl() 
>>> warn: uncapped user size for kvmalloc() will WARN
>>> drivers/gpu/drm/panfrost/panfrost_drv.c:198 panfrost_copy_in_sync() warn: 
>>> uncapped user size for kvmalloc() will WARN  
>>
>> I believe this one in Panfrost would be fixed by Boris's series
>> reworking the submit ioctl[1].
>>
>> Boris: are you planning on submitting that series soon - or is it worth
>> cherry picking the rework in patch 5 to fix this issue?
> 
> Don't know when I'll get back to it, so I'd recommend cherry-picking
> what you need.

Thanks, no problem - it was mostly when I looked at the code I had the
feeling that "surely this has already been fixed", then discovered your
series was never merged ;)

I'll hammer out a patch for this one issue.

Thanks,

Steve


Re: [bug report] new kvmalloc() WARN() triggered by DRM ioctls tracking

2021-12-16 Thread Boris Brezillon
Hi Steve,

On Thu, 16 Dec 2021 14:02:25 +
Steven Price  wrote:

> + Boris
> 
> On 16/12/2021 12:08, Dan Carpenter wrote:
> > Hi DRM Devs,
> > 
> > In commit 7661809d493b ("mm: don't allow oversized kvmalloc() calls")
> > from July, Linus added a WARN_ONCE() for "crazy" allocations over 2GB.
> > I have a static checker warning for this and most of the warnings are
> > from DRM ioctls.
> > 
> > drivers/gpu/drm/lima/lima_drv.c:124 lima_ioctl_gem_submit() warn: uncapped 
> > user size for kvmalloc() will WARN
> > drivers/gpu/drm/radeon/radeon_cs.c:291 radeon_cs_parser_init() warn: 
> > uncapped user size for kvmalloc() will WARN
> > drivers/gpu/drm/v3d/v3d_gem.c:311 v3d_lookup_bos() warn: uncapped user size 
> > for kvmalloc() will WARN
> > drivers/gpu/drm/v3d/v3d_gem.c:319 v3d_lookup_bos() warn: uncapped user size 
> > for kvmalloc() will WARN
> > drivers/gpu/drm/v3d/v3d_gem.c:601 v3d_get_multisync_post_deps() warn: 
> > uncapped user size for kvmalloc() will WARN
> > drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c:476 etnaviv_ioctl_gem_submit() 
> > warn: uncapped user size for kvmalloc() will WARN
> > drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c:477 etnaviv_ioctl_gem_submit() 
> > warn: uncapped user size for kvmalloc() will WARN
> > drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c:478 etnaviv_ioctl_gem_submit() 
> > warn: uncapped user size for kvmalloc() will WARN
> > drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c:479 etnaviv_ioctl_gem_submit() 
> > warn: uncapped user size for kvmalloc() will WARN
> > drivers/gpu/drm/virtio/virtgpu_ioctl.c:186 virtio_gpu_execbuffer_ioctl() 
> > warn: uncapped user size for kvmalloc() will WARN
> > drivers/gpu/drm/panfrost/panfrost_drv.c:198 panfrost_copy_in_sync() warn: 
> > uncapped user size for kvmalloc() will WARN  
> 
> I believe this one in Panfrost would be fixed by Boris's series
> reworking the submit ioctl[1].
> 
> Boris: are you planning on submitting that series soon - or is it worth
> cherry picking the rework in patch 5 to fix this issue?

Don't know when I'll get back to it, so I'd recommend cherry-picking
what you need.

Regards,

Boris


[PATCH 2/2] drm/amdgpu: bump driver version for new CTX OP to set/get stable pstates

2021-12-16 Thread Alex Deucher
So mesa and tools know when this is available.

Mesa MR: https://gitlab.freedesktop.org/mesa/drm/-/merge_requests/207

Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index bc1355c6248d..320706b3a8e0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -99,9 +99,10 @@
  * - 3.42.0 - Add 16bpc fixed point display support
  * - 3.43.0 - Add device hot plug/unplug support
  * - 3.44.0 - DCN3 supports DCC independent block settings: !64B && 128B, 64B 
&& 128B
+ * - 3.45.0 - Add context ioctl stable pstate interface
  */
 #define KMS_DRIVER_MAJOR   3
-#define KMS_DRIVER_MINOR   44
+#define KMS_DRIVER_MINOR   45
 #define KMS_DRIVER_PATCHLEVEL  0
 
 int amdgpu_vram_limit;
-- 
2.33.1



[PATCH 1/2] drm/amdgpu/UAPI: add new CTX OP to get/set stable pstates

2021-12-16 Thread Alex Deucher
Add a new CTX ioctl operation to set stable pstates for profiling.
When creating traces for tools like RGP or using SPM or doing
performance profiling, it's required to enable a special
stable profiling power state on the GPU.  These profiling
states set fixed clocks and disable certain other power
features like powergating which may impact the results.

Historically, these profiling pstates were enabled via sysfs,
but this adds an interface to enable it via the CTX ioctl
from the application.  Since the power state is global
only one application can set it at a time, so if multiple
applications try and use it only the first will get it,
the ioctl will return -EBUSY for others.  The sysfs interface
will override whatever has been set by this interface.

Mesa MR: https://gitlab.freedesktop.org/mesa/drm/-/merge_requests/207

Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c| 145 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h|   1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   1 +
 drivers/gpu/drm/amd/pm/amdgpu_pm.c |   7 +
 drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h|   3 +
 include/uapi/drm/amdgpu_drm.h  |  17 ++-
 6 files changed, 171 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index 468003583b2a..327cf308c2ab 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -237,6 +237,7 @@ static int amdgpu_ctx_init(struct amdgpu_device *adev,
ctx->vram_lost_counter = atomic_read(>vram_lost_counter);
ctx->init_priority = priority;
ctx->override_priority = AMDGPU_CTX_PRIORITY_UNSET;
+   ctx->stable_pstate = AMDGPU_CTX_STABLE_PSTATE_NONE;
 
return 0;
 }
@@ -255,6 +256,102 @@ static void amdgpu_ctx_fini_entity(struct 
amdgpu_ctx_entity *entity)
kfree(entity);
 }
 
+static int amdgpu_ctx_get_stable_pstate(struct amdgpu_ctx *ctx,
+   u32 *stable_pstate)
+{
+   struct amdgpu_device *adev = ctx->adev;
+   const struct amd_pm_funcs *pp_funcs = adev->powerplay.pp_funcs;
+   enum amd_dpm_forced_level current_level;
+
+   if (!ctx)
+   return -EINVAL;
+
+   if (pp_funcs->get_performance_level)
+   current_level = amdgpu_dpm_get_performance_level(adev);
+   else
+   current_level = adev->pm.dpm.forced_level;
+
+   switch (current_level) {
+   case AMD_DPM_FORCED_LEVEL_PROFILE_STANDARD:
+   *stable_pstate = AMDGPU_CTX_STABLE_PSTATE_STANDARD;
+   break;
+   case AMD_DPM_FORCED_LEVEL_PROFILE_MIN_SCLK:
+   *stable_pstate = AMDGPU_CTX_STABLE_PSTATE_MIN_SCLK;
+   break;
+   case AMD_DPM_FORCED_LEVEL_PROFILE_MIN_MCLK:
+   *stable_pstate = AMDGPU_CTX_STABLE_PSTATE_MIN_MCLK;
+   break;
+   case AMD_DPM_FORCED_LEVEL_PROFILE_PEAK:
+   *stable_pstate = AMDGPU_CTX_STABLE_PSTATE_PEAK;
+   break;
+   default:
+   *stable_pstate = AMDGPU_CTX_STABLE_PSTATE_NONE;
+   break;
+   }
+   return 0;
+}
+
+static int amdgpu_ctx_set_stable_pstate(struct amdgpu_ctx *ctx,
+   u32 stable_pstate)
+{
+   struct amdgpu_device *adev = ctx->adev;
+   const struct amd_pm_funcs *pp_funcs = adev->powerplay.pp_funcs;
+   enum amd_dpm_forced_level level, current_level;
+   int r = 0;
+
+   if (!ctx)
+   return -EINVAL;
+
+   mutex_lock(>pm.stable_pstate_ctx_lock);
+   if (adev->pm.stable_pstate_ctx && adev->pm.stable_pstate_ctx != ctx) {
+   r = -EBUSY;
+   goto done;
+   }
+
+   switch (stable_pstate) {
+   case AMDGPU_CTX_STABLE_PSTATE_NONE:
+   level = AMD_DPM_FORCED_LEVEL_AUTO;
+   break;
+   case AMDGPU_CTX_STABLE_PSTATE_STANDARD:
+   level = AMD_DPM_FORCED_LEVEL_PROFILE_STANDARD;
+   break;
+   case AMDGPU_CTX_STABLE_PSTATE_MIN_SCLK:
+   level = AMD_DPM_FORCED_LEVEL_PROFILE_MIN_SCLK;
+   break;
+   case AMDGPU_CTX_STABLE_PSTATE_MIN_MCLK:
+   level = AMD_DPM_FORCED_LEVEL_PROFILE_MIN_MCLK;
+   break;
+   case AMDGPU_CTX_STABLE_PSTATE_PEAK:
+   level = AMD_DPM_FORCED_LEVEL_PROFILE_PEAK;
+   break;
+   default:
+   r = -EINVAL;
+   goto done;
+   }
+
+   if (pp_funcs->get_performance_level)
+   current_level = amdgpu_dpm_get_performance_level(adev);
+   else
+   current_level = adev->pm.dpm.forced_level;
+
+   if ((current_level != level) && pp_funcs->force_performance_level) {
+   mutex_lock(>pm.mutex);
+   r = amdgpu_dpm_force_performance_level(adev, level);
+   if (!r)
+   adev->pm.dpm.forced_level = level;
+   

Re: [bug report] new kvmalloc() WARN() triggered by DRM ioctls tracking

2021-12-16 Thread Steven Price
+ Boris

On 16/12/2021 12:08, Dan Carpenter wrote:
> Hi DRM Devs,
> 
> In commit 7661809d493b ("mm: don't allow oversized kvmalloc() calls")
> from July, Linus added a WARN_ONCE() for "crazy" allocations over 2GB.
> I have a static checker warning for this and most of the warnings are
> from DRM ioctls.
> 
> drivers/gpu/drm/lima/lima_drv.c:124 lima_ioctl_gem_submit() warn: uncapped 
> user size for kvmalloc() will WARN
> drivers/gpu/drm/radeon/radeon_cs.c:291 radeon_cs_parser_init() warn: uncapped 
> user size for kvmalloc() will WARN
> drivers/gpu/drm/v3d/v3d_gem.c:311 v3d_lookup_bos() warn: uncapped user size 
> for kvmalloc() will WARN
> drivers/gpu/drm/v3d/v3d_gem.c:319 v3d_lookup_bos() warn: uncapped user size 
> for kvmalloc() will WARN
> drivers/gpu/drm/v3d/v3d_gem.c:601 v3d_get_multisync_post_deps() warn: 
> uncapped user size for kvmalloc() will WARN
> drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c:476 etnaviv_ioctl_gem_submit() 
> warn: uncapped user size for kvmalloc() will WARN
> drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c:477 etnaviv_ioctl_gem_submit() 
> warn: uncapped user size for kvmalloc() will WARN
> drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c:478 etnaviv_ioctl_gem_submit() 
> warn: uncapped user size for kvmalloc() will WARN
> drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c:479 etnaviv_ioctl_gem_submit() 
> warn: uncapped user size for kvmalloc() will WARN
> drivers/gpu/drm/virtio/virtgpu_ioctl.c:186 virtio_gpu_execbuffer_ioctl() 
> warn: uncapped user size for kvmalloc() will WARN
> drivers/gpu/drm/panfrost/panfrost_drv.c:198 panfrost_copy_in_sync() warn: 
> uncapped user size for kvmalloc() will WARN

I believe this one in Panfrost would be fixed by Boris's series
reworking the submit ioctl[1].

Boris: are you planning on submitting that series soon - or is it worth
cherry picking the rework in patch 5 to fix this issue?

[1]
https://lore.kernel.org/all/20210930190954.1525933-1-boris.brezil...@collabora.com/

> drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c:120 amdgpu_cs_parser_init() warn: 
> uncapped user size for kvmalloc() will WARN
> 
> These ioctls can all trigger the stack dump.  The line numbers are from
> linux next (next-20211214).
> 
> I feel like ideally if this could be fixed in a central way, but if not
> then hopefully I've added the relevant lists to the CC.

I've only looked at Panfrost, but at least here we're simply allowing
user space to allocate an arbitrary amount of kernel memory in one go -
which is always going to be a good way of triggering the OOM killer if
nothing else. Boris's series includes a change that means instead trying
to copy an (attacker controller) sized array into the kernel to process,
we copy each each element of the array in turn.

So I don't really see how this could be fixed in a central way (but some
of the other cases might be different).

Thanks,

Steve

> regards,
> dan carpenter
> 



Re: [PATCH] drm/amdgpu: Filter security violation registers

2021-12-16 Thread Deucher, Alexander
[Public]

Reviewed-by: Alex Deucher 

From: amd-gfx  on behalf of Bokun Zhang 

Sent: Wednesday, December 15, 2021 7:52 PM
To: amd-gfx@lists.freedesktop.org 
Cc: Zhang, Bokun 
Subject: [PATCH] drm/amdgpu: Filter security violation registers

Recently, there is security policy update under SRIOV.
We need to filter the registers that hit the violation
and move the code to the host driver side so that
the guest driver can execute correctly.

Signed-off-by: Bokun Zhang 
Change-Id: Ida893bb17de17a80e865c7662f04c5562f5d2727
---
 drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 83 ++
 1 file changed, 46 insertions(+), 37 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c 
b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
index 4f546f632223..d3d6d5b045b8 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
@@ -542,9 +542,6 @@ static void sdma_v5_2_ctx_switch_enable(struct 
amdgpu_device *adev, bool enable)
 }

 for (i = 0; i < adev->sdma.num_instances; i++) {
-   f32_cntl = RREG32(sdma_v5_2_get_reg_offset(adev, i, 
mmSDMA0_CNTL));
-   f32_cntl = REG_SET_FIELD(f32_cntl, SDMA0_CNTL,
-   AUTO_CTXSW_ENABLE, enable ? 1 : 0);
 if (enable && amdgpu_sdma_phase_quantum) {
 WREG32_SOC15_IP(GC, sdma_v5_2_get_reg_offset(adev, i, 
mmSDMA0_PHASE0_QUANTUM),
phase_quantum);
@@ -553,7 +550,13 @@ static void sdma_v5_2_ctx_switch_enable(struct 
amdgpu_device *adev, bool enable)
 WREG32_SOC15_IP(GC, sdma_v5_2_get_reg_offset(adev, i, 
mmSDMA0_PHASE2_QUANTUM),
phase_quantum);
 }
-   WREG32(sdma_v5_2_get_reg_offset(adev, i, mmSDMA0_CNTL), 
f32_cntl);
+
+   if (!amdgpu_sriov_vf(adev)) {
+   f32_cntl = RREG32(sdma_v5_2_get_reg_offset(adev, i, 
mmSDMA0_CNTL));
+   f32_cntl = REG_SET_FIELD(f32_cntl, SDMA0_CNTL,
+   AUTO_CTXSW_ENABLE, enable ? 1 : 0);
+   WREG32(sdma_v5_2_get_reg_offset(adev, i, mmSDMA0_CNTL), 
f32_cntl);
+   }
 }

 }
@@ -576,10 +579,12 @@ static void sdma_v5_2_enable(struct amdgpu_device *adev, 
bool enable)
 sdma_v5_2_rlc_stop(adev);
 }

-   for (i = 0; i < adev->sdma.num_instances; i++) {
-   f32_cntl = RREG32(sdma_v5_2_get_reg_offset(adev, i, 
mmSDMA0_F32_CNTL));
-   f32_cntl = REG_SET_FIELD(f32_cntl, SDMA0_F32_CNTL, HALT, enable 
? 0 : 1);
-   WREG32(sdma_v5_2_get_reg_offset(adev, i, mmSDMA0_F32_CNTL), 
f32_cntl);
+   if (!amdgpu_sriov_vf(adev)) {
+   for (i = 0; i < adev->sdma.num_instances; i++) {
+   f32_cntl = RREG32(sdma_v5_2_get_reg_offset(adev, i, 
mmSDMA0_F32_CNTL));
+   f32_cntl = REG_SET_FIELD(f32_cntl, SDMA0_F32_CNTL, 
HALT, enable ? 0 : 1);
+   WREG32(sdma_v5_2_get_reg_offset(adev, i, 
mmSDMA0_F32_CNTL), f32_cntl);
+   }
 }
 }

@@ -608,7 +613,8 @@ static int sdma_v5_2_gfx_resume(struct amdgpu_device *adev)
 ring = >sdma.instance[i].ring;
 wb_offset = (ring->rptr_offs * 4);

-   WREG32_SOC15_IP(GC, sdma_v5_2_get_reg_offset(adev, i, 
mmSDMA0_SEM_WAIT_FAIL_TIMER_CNTL), 0);
+   if (!amdgpu_sriov_vf(adev))
+   WREG32_SOC15_IP(GC, sdma_v5_2_get_reg_offset(adev, i, 
mmSDMA0_SEM_WAIT_FAIL_TIMER_CNTL), 0);

 /* Set ring buffer size in dwords */
 rb_bufsz = order_base_2(ring->ring_size / 4);
@@ -683,32 +689,34 @@ static int sdma_v5_2_gfx_resume(struct amdgpu_device 
*adev)
 sdma_v5_2_ring_set_wptr(ring);

 /* set minor_ptr_update to 0 after wptr programed */
-   WREG32_SOC15_IP(GC, sdma_v5_2_get_reg_offset(adev, i, 
mmSDMA0_GFX_MINOR_PTR_UPDATE), 0);

-   /* set utc l1 enable flag always to 1 */
-   temp = RREG32(sdma_v5_2_get_reg_offset(adev, i, mmSDMA0_CNTL));
-   temp = REG_SET_FIELD(temp, SDMA0_CNTL, UTC_L1_ENABLE, 1);
-
-   /* enable MCBP */
-   temp = REG_SET_FIELD(temp, SDMA0_CNTL, MIDCMD_PREEMPT_ENABLE, 
1);
-   WREG32(sdma_v5_2_get_reg_offset(adev, i, mmSDMA0_CNTL), temp);
-
-   /* Set up RESP_MODE to non-copy addresses */
-   temp = RREG32_SOC15_IP(GC, sdma_v5_2_get_reg_offset(adev, i, 
mmSDMA0_UTCL1_CNTL));
-   temp = REG_SET_FIELD(temp, SDMA0_UTCL1_CNTL, RESP_MODE, 3);
-   temp = REG_SET_FIELD(temp, SDMA0_UTCL1_CNTL, REDO_DELAY, 9);
-   WREG32_SOC15_IP(GC, sdma_v5_2_get_reg_offset(adev, i, 
mmSDMA0_UTCL1_CNTL), temp);
-
-   /* program default cache read and write policy */
-   temp = 

Re: [PATCH 4/5] drm/amdgpu: Initialize Aldebaran RLC function pointers

2021-12-16 Thread Alex Deucher
On Wed, Dec 15, 2021 at 6:58 PM Skvortsov, Victor
 wrote:
>
> [AMD Official Use Only]
>
> Hey Alex,
>
> This change was based on the fact that amd-mainline-dkms-5.13 calls 
> get_xgmi_info() in gmc_v9_0_early_init(). But I can see that drm-next it's 
> instead called in gmc_v9_0_sw_init(). So, I'm not sure whats the correct 
> behavior. But I do agree that the change is kind of ugly. I don't know where 
> else to put it if we do need to call get_xgmi_info() in early_init.
>

We could skip this patch for drm-next and just apply it to the dkms
branch.  There's already a lot of ugly stuff in there to deal with
multiple kernel versions.

Alex


> Thanks,
> Victor
>
> -Original Message-
> From: Alex Deucher 
> Sent: Wednesday, December 15, 2021 4:38 PM
> To: Skvortsov, Victor 
> Cc: amd-gfx list ; Deng, Emily 
> ; Liu, Monk ; Ming, Davis 
> ; Liu, Shaoyun ; Zhou, Peng Ju 
> ; Chen, JingWen ; Chen, Horace 
> ; Nieto, David M 
> Subject: Re: [PATCH 4/5] drm/amdgpu: Initialize Aldebaran RLC function 
> pointers
>
> [CAUTION: External Email]
>
> On Wed, Dec 15, 2021 at 1:56 PM Victor Skvortsov  
> wrote:
> >
> > In SRIOV, RLC function pointers must be initialized early as we rely
> > on the RLCG interface for all GC register access.
> >
> > Signed-off-by: Victor Skvortsov 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 2 ++
> >  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 3 +--
> >  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.h | 2 ++
> >  3 files changed, 5 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> > index 65e1f6cc59dd..1bc92a38d124 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> > @@ -844,6 +844,8 @@ static int amdgpu_discovery_set_gc_ip_blocks(struct 
> > amdgpu_device *adev)
> > case IP_VERSION(9, 4, 1):
> > case IP_VERSION(9, 4, 2):
> > amdgpu_device_ip_block_add(adev, _v9_0_ip_block);
> > +   if (amdgpu_sriov_vf(adev) && adev->ip_versions[GC_HWIP][0] 
> > == IP_VERSION(9, 4, 2))
> > +   gfx_v9_0_set_rlc_funcs(adev);
>
> amdgpu_discovery.c is IP independent.  I'd rather not add random IP specific 
> function calls.  gfx_v9_0_set_rlc_funcs() already gets called in 
> gfx_v9_0_early_init().  Is that not early enough?  In general we shouldn't be 
> touching the hardware much if at all in early_init.
>
> Alex
>
> > break;
> > case IP_VERSION(10, 1, 10):
> > case IP_VERSION(10, 1, 2):
> > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > index edb3e3b08eed..d252b06efa43 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > @@ -816,7 +816,6 @@ static void gfx_v9_0_sriov_wreg(struct
> > amdgpu_device *adev, u32 offset,  static void
> > gfx_v9_0_set_ring_funcs(struct amdgpu_device *adev);  static void
> > gfx_v9_0_set_irq_funcs(struct amdgpu_device *adev);  static void
> > gfx_v9_0_set_gds_init(struct amdgpu_device *adev); -static void
> > gfx_v9_0_set_rlc_funcs(struct amdgpu_device *adev);  static int 
> > gfx_v9_0_get_cu_info(struct amdgpu_device *adev,
> > struct amdgpu_cu_info *cu_info);
> > static uint64_t gfx_v9_0_get_gpu_clock_counter(struct amdgpu_device
> > *adev); @@ -7066,7 +7065,7 @@ static void gfx_v9_0_set_irq_funcs(struct 
> > amdgpu_device *adev)
> > adev->gfx.cp_ecc_error_irq.funcs =
> > _v9_0_cp_ecc_error_irq_funcs;  }
> >
> > -static void gfx_v9_0_set_rlc_funcs(struct amdgpu_device *adev)
> > +void gfx_v9_0_set_rlc_funcs(struct amdgpu_device *adev)
> >  {
> > switch (adev->ip_versions[GC_HWIP][0]) {
> > case IP_VERSION(9, 0, 1):
> > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.h
> > b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.h
> > index dfe8d4841f58..1817e252354f 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.h
> > @@ -29,4 +29,6 @@ extern const struct amdgpu_ip_block_version
> > gfx_v9_0_ip_block;  void gfx_v9_0_select_se_sh(struct amdgpu_device *adev, 
> > u32 se_num, u32 sh_num,
> >u32 instance);
> >
> > +void gfx_v9_0_set_rlc_funcs(struct amdgpu_device *adev);
> > +
> >  #endif
> > --
> > 2.25.1
> >


[PATCH] drm/amdkfd: make SPDX License expression more sound

2021-12-16 Thread Lukas Bulwahn
Commit b5f57384805a ("drm/amdkfd: Add sysfs bitfields and enums to uAPI")
adds include/uapi/linux/kfd_sysfs.h with the "GPL-2.0 OR MIT WITH
Linux-syscall-note" SPDX-License expression.

The command ./scripts/spdxcheck.py warns:

  include/uapi/linux/kfd_sysfs.h: 1:48 Exception not valid for license MIT: 
Linux-syscall-note

For a uapi header, the file under GPLv2 License must be combined with the
Linux-syscall-note, but combining the MIT License with the
Linux-syscall-note makes no sense, as the note provides an exception for
GPL-licensed code, not for permissively licensed code.

So, reorganize the SPDX expression to only combine the note with the GPL
License condition. This makes spdxcheck happy again.

Signed-off-by: Lukas Bulwahn 
---
I am not a lawyer and I do not intend to modify the actual licensing of
this header file. So, I really would like to have an Ack from some AMD
developer here.

Maybe also a lawyer on the linux-spdx list can check my reasoning on the
licensing with the exception note?

 include/uapi/linux/kfd_sysfs.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/uapi/linux/kfd_sysfs.h b/include/uapi/linux/kfd_sysfs.h
index e1fb78b4bf09..3e330f368917 100644
--- a/include/uapi/linux/kfd_sysfs.h
+++ b/include/uapi/linux/kfd_sysfs.h
@@ -1,4 +1,4 @@
-/* SPDX-License-Identifier: GPL-2.0 OR MIT WITH Linux-syscall-note */
+/* SPDX-License-Identifier: (GPL-2.0 WITH Linux-syscall-note) OR MIT */
 /*
  * Copyright 2021 Advanced Micro Devices, Inc.
  *
-- 
2.17.1



回复: Re: 回复: Re: [PATCH] drm/amdgpu: fixup bad vram size on gmc v8

2021-12-16 Thread 周宗敏
Hi  Christian,I'm  testing for GPU passthrough feature, so I pass through this GPU to  virtual machine to use. It  based on arm64 system.As far as i know, Alex had dealt with a similar problems on dri/radeon/si.c .  Maybe they have a same reason to cause it?the history commit message is below:https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0ca223b029a261e82fb2f50c52eb85d510f4260eThanks very much.       主 题:Re: 回复: Re: [PATCH] drm/amdgpu: fixup bad vram size on gmc v8
            日 期:2021-12-16 16:15
            发件人:Christian König
            收件人:周宗敏Alex Deucher
            
        
        Hi Zongmin,
    
    that strongly sounds like the ASIC is not correctly initialized when
    trying to read the register.
    
    What board and environment are you using this GPU with? Is that a
    normal x86 system?
    
    Regards,
    Christian.
    
    Am 16.12.21 um 04:11 schrieb 周宗敏:
    the problematic boards that I have tested is [AMD/ATI] Lexa
            PRO [Radeon RX 550/550X] ;  and the vbios version :
            113-RXF9310-C09-BTWhen an exception occurs I can see the following changes in
            the values of vram size get from RREG32(mmCONFIG_MEMSIZE) ,it seems to have garbage in the upper 16 bits 
          
          and then I can also see some dmesg like below:when vram size register have garbage,we may see error
            message like below:amdgpu :09:00.0: VRAM: 4286582784M 0x00F4 -
            0x000FF8F4 (4286582784M used)the correct message should like below:amdgpu :09:00.0: VRAM: 4096M 0x00F4 -
            0x00F4 (4096M used)
          if you have any problems,please send me mail.thanks very much.
                主 题:Re: [PATCH] drm/amdgpu:
          fixup bad vram size on gmc v8            
        日 期:2021-12-16 04:23    
               
        发件人:Alex Deucher        
           
        收件人:Zongmin Zhou          
                 
               On Wed, Dec 15, 2021 at 10:31 AM Zongmin Zhouwrote:
          >
          > Some boards(like RX550) seem to have garbage in the upper
          > 16 bits of the vram size register.  Check for
          > this and clamp the size properly.  Fixes
          > boards reporting bogus amounts of vram.
          >
          > after add this patch,the maximum GPU VRAM size is 64GB,
          > otherwise only 64GB vram size will be used.
          
          Can you provide some examples of problematic boards and
          possibly a
          vbios image from the problematic board?  What values are you
          seeing?
          It would be nice to see what the boards are reporting and
          whether the
          lower 16 bits are actually correct or if it is some other
          issue.  This
          register is undefined until the asic has been initialized.
           The vbios
          programs it as part of it's asic init sequence (either via
          vesa/gop or
          the OS driver).
          
          Alex
          
          
          >
          > Signed-off-by: Zongmin Zhou
            > ---
            >  drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c | 13
            ++---
            >  1 file changed, 10 insertions(+), 3 deletions(-)
            >
            > diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
            b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
            > index 492ebed2915b..63b890f1e8af 100644
            > --- a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
            > +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
            > @@ -515,10 +515,10 @@ static void
            gmc_v8_0_mc_program(struct amdgpu_device *adev)
            >  static int gmc_v8_0_mc_init(struct amdgpu_device
            *adev)
            >  {
            >         int r;
            > +       u32 tmp;
            >
            >         adev->gmc.vram_width =
            amdgpu_atombios_get_vram_width(adev);
            >         if (!adev->gmc.vram_width) {
            > -               u32 tmp;
            >                 int chansize, numchan;
            >
            >                 /* Get VRAM informations */
            > @@ -562,8 +562,15 @@ static int gmc_v8_0_mc_init(struct
            amdgpu_device *adev)
            >                 adev->gmc.vram_width = numchan *
            chansize;
            >         }
            >         /* size in MB on si */
            > -       adev->gmc.mc_vram_size =
            RREG32(mmCONFIG_MEMSIZE) * 1024ULL * 1024ULL;
            > -       adev->gmc.real_vram_size =
            RREG32(mmCONFIG_MEMSIZE) * 1024ULL * 1024ULL;
            > +       tmp = RREG32(mmCONFIG_MEMSIZE);
            > +       /* some boards may have garbage in the upper 16
            bits */
            > +       if (tmp & 0x) {
            > +               DRM_INFO("Probable bad vram size:
            0x%08x\n", tmp);
            > +               if (tmp & 0x)
            > +                   

RE: [PATCH 3/4] drm/amdkfd: add reset queue function for RAS poison

2021-12-16 Thread Zhang, Hawking
+
+   int (*unmap_queues_cpsch_poison)(struct device_queue_manager *dqm,
+   uint16_t pasid);
 };

Might be better call it reset_queue directly (match with update_queue, 
create_queue, .etc.,)

Others look good to me

The series (4 patches) is

Reviewed-by: Hawking Zhang 

Regards,
Hawking
-Original Message-
From: Zhou1, Tao  
Sent: Thursday, December 16, 2021 19:36
To: amd-gfx@lists.freedesktop.org; Zhang, Hawking ; 
Yang, Stanley ; Chai, Thomas ; 
Kuehling, Felix 
Cc: Zhou1, Tao 
Subject: [PATCH 3/4] drm/amdkfd: add reset queue function for RAS poison

The new interface unmaps queues with reset mode for the process consumes RAS 
poison, it's only for compute queue.

Signed-off-by: Tao Zhou 
---
 .../drm/amd/amdkfd/kfd_device_queue_manager.c| 16 
 .../drm/amd/amdkfd/kfd_device_queue_manager.h|  5 +
 2 files changed, 21 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index 01a2cc3928ac..b4b0495c7024 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -1476,6 +1476,21 @@ static int unmap_queues_cpsch(struct 
device_queue_manager *dqm,
return retval;
 }
 
+/* only for compute queue */
+static int unmap_queues_cpsch_poison(struct device_queue_manager *dqm,
+   uint16_t pasid)
+{
+   int retval;
+
+   dqm_lock(dqm);
+
+   retval = unmap_queues_cpsch(dqm, KFD_UNMAP_QUEUES_FILTER_BY_PASID,
+   pasid, true);
+
+   dqm_unlock(dqm);
+   return retval;
+}
+
 /* dqm->lock mutex has to be locked before calling this function */  static 
int execute_queues_cpsch(struct device_queue_manager *dqm,
enum kfd_unmap_queues_filter filter, @@ -1896,6 
+1911,7 @@ struct device_queue_manager *device_queue_manager_init(struct 
kfd_dev *dev)
dqm->ops.evict_process_queues = evict_process_queues_cpsch;
dqm->ops.restore_process_queues = restore_process_queues_cpsch;
dqm->ops.get_wave_state = get_wave_state;
+   dqm->ops.unmap_queues_cpsch_poison = unmap_queues_cpsch_poison;
break;
case KFD_SCHED_POLICY_NO_HWS:
/* initialize dqm for no cp scheduling */ diff --git 
a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h 
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
index 499fc0ea387f..19ec3e8859e8 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
@@ -81,6 +81,8 @@ struct device_process_node {
  *
  * @get_wave_state: Retrieves context save state and optionally copies the
  * control stack, if kept in the MQD, to the given userspace address.
+ *
+ * @unmap_queues_cpsch_poison: reset queue which consumes RAS poison
  */
 
 struct device_queue_manager_ops {
@@ -134,6 +136,9 @@ struct device_queue_manager_ops {
  void __user *ctl_stack,
  u32 *ctl_stack_used_size,
  u32 *save_area_used_size);
+
+   int (*unmap_queues_cpsch_poison)(struct device_queue_manager *dqm,
+   uint16_t pasid);
 };
 
 struct device_queue_manager_asic_ops {
--
2.17.1



[bug report] new kvmalloc() WARN() triggered by DRM ioctls tracking

2021-12-16 Thread Dan Carpenter
Hi DRM Devs,

In commit 7661809d493b ("mm: don't allow oversized kvmalloc() calls")
from July, Linus added a WARN_ONCE() for "crazy" allocations over 2GB.
I have a static checker warning for this and most of the warnings are
from DRM ioctls.

drivers/gpu/drm/lima/lima_drv.c:124 lima_ioctl_gem_submit() warn: uncapped user 
size for kvmalloc() will WARN
drivers/gpu/drm/radeon/radeon_cs.c:291 radeon_cs_parser_init() warn: uncapped 
user size for kvmalloc() will WARN
drivers/gpu/drm/v3d/v3d_gem.c:311 v3d_lookup_bos() warn: uncapped user size for 
kvmalloc() will WARN
drivers/gpu/drm/v3d/v3d_gem.c:319 v3d_lookup_bos() warn: uncapped user size for 
kvmalloc() will WARN
drivers/gpu/drm/v3d/v3d_gem.c:601 v3d_get_multisync_post_deps() warn: uncapped 
user size for kvmalloc() will WARN
drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c:476 etnaviv_ioctl_gem_submit() 
warn: uncapped user size for kvmalloc() will WARN
drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c:477 etnaviv_ioctl_gem_submit() 
warn: uncapped user size for kvmalloc() will WARN
drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c:478 etnaviv_ioctl_gem_submit() 
warn: uncapped user size for kvmalloc() will WARN
drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c:479 etnaviv_ioctl_gem_submit() 
warn: uncapped user size for kvmalloc() will WARN
drivers/gpu/drm/virtio/virtgpu_ioctl.c:186 virtio_gpu_execbuffer_ioctl() warn: 
uncapped user size for kvmalloc() will WARN
drivers/gpu/drm/panfrost/panfrost_drv.c:198 panfrost_copy_in_sync() warn: 
uncapped user size for kvmalloc() will WARN
drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c:120 amdgpu_cs_parser_init() warn: 
uncapped user size for kvmalloc() will WARN

These ioctls can all trigger the stack dump.  The line numbers are from
linux next (next-20211214).

I feel like ideally if this could be fixed in a central way, but if not
then hopefully I've added the relevant lists to the CC.

regards,
dan carpenter


[PATCH 3/4] drm/amdkfd: add reset queue function for RAS poison

2021-12-16 Thread Tao Zhou
The new interface unmaps queues with reset mode for the process consumes
RAS poison, it's only for compute queue.

Signed-off-by: Tao Zhou 
---
 .../drm/amd/amdkfd/kfd_device_queue_manager.c| 16 
 .../drm/amd/amdkfd/kfd_device_queue_manager.h|  5 +
 2 files changed, 21 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index 01a2cc3928ac..b4b0495c7024 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -1476,6 +1476,21 @@ static int unmap_queues_cpsch(struct 
device_queue_manager *dqm,
return retval;
 }
 
+/* only for compute queue */
+static int unmap_queues_cpsch_poison(struct device_queue_manager *dqm,
+   uint16_t pasid)
+{
+   int retval;
+
+   dqm_lock(dqm);
+
+   retval = unmap_queues_cpsch(dqm, KFD_UNMAP_QUEUES_FILTER_BY_PASID,
+   pasid, true);
+
+   dqm_unlock(dqm);
+   return retval;
+}
+
 /* dqm->lock mutex has to be locked before calling this function */
 static int execute_queues_cpsch(struct device_queue_manager *dqm,
enum kfd_unmap_queues_filter filter,
@@ -1896,6 +1911,7 @@ struct device_queue_manager 
*device_queue_manager_init(struct kfd_dev *dev)
dqm->ops.evict_process_queues = evict_process_queues_cpsch;
dqm->ops.restore_process_queues = restore_process_queues_cpsch;
dqm->ops.get_wave_state = get_wave_state;
+   dqm->ops.unmap_queues_cpsch_poison = unmap_queues_cpsch_poison;
break;
case KFD_SCHED_POLICY_NO_HWS:
/* initialize dqm for no cp scheduling */
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h 
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
index 499fc0ea387f..19ec3e8859e8 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
@@ -81,6 +81,8 @@ struct device_process_node {
  *
  * @get_wave_state: Retrieves context save state and optionally copies the
  * control stack, if kept in the MQD, to the given userspace address.
+ *
+ * @unmap_queues_cpsch_poison: reset queue which consumes RAS poison
  */
 
 struct device_queue_manager_ops {
@@ -134,6 +136,9 @@ struct device_queue_manager_ops {
  void __user *ctl_stack,
  u32 *ctl_stack_used_size,
  u32 *save_area_used_size);
+
+   int (*unmap_queues_cpsch_poison)(struct device_queue_manager *dqm,
+   uint16_t pasid);
 };
 
 struct device_queue_manager_asic_ops {
-- 
2.17.1



[PATCH 4/4] drm/amdkfd: reset queue which consumes RAS poison (v2)

2021-12-16 Thread Tao Zhou
CP supports unmap queue with reset mode which only destroys specific queue 
without affecting others.
Replacing whole gpu reset with reset queue mode for RAS poison consumption
saves much time, and we can also fallback to gpu reset solution if reset
queue fails.

v2: Return directly if process is NULL;
Reset queue solution is not applicable to SDMA, fallback to legacy
way.
call kfd_unref_process after lookup process.

Signed-off-by: Tao Zhou 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c|  6 +--
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h|  3 +-
 .../gpu/drm/amd/amdkfd/kfd_int_process_v9.c   | 44 +--
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h |  2 +
 4 files changed, 47 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 46cf48b3904a..0bf09a94d944 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -721,13 +721,13 @@ bool amdgpu_amdkfd_have_atomics_support(struct 
amdgpu_device *adev)
return adev->have_atomics_support;
 }
 
-void amdgpu_amdkfd_ras_poison_consumption_handler(struct amdgpu_device *adev)
+void amdgpu_amdkfd_ras_poison_consumption_handler(struct amdgpu_device *adev, 
bool reset)
 {
struct ras_err_data err_data = {0, 0, 0, NULL};
 
/* CPU MCA will handle page retirement if connected_to_cpu is 1 */
if (!adev->gmc.xgmi.connected_to_cpu)
-   amdgpu_umc_process_ras_data_cb(adev, _data, NULL);
-   else
+   amdgpu_umc_do_page_retirement(adev, _data, NULL, reset);
+   else if (reset)
amdgpu_amdkfd_gpu_reset(adev);
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index fcbc8a9c9e06..61f899e54fd5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -296,7 +296,8 @@ int amdgpu_amdkfd_gpuvm_import_dmabuf(struct amdgpu_device 
*adev,
  uint64_t *mmap_offset);
 int amdgpu_amdkfd_get_tile_config(struct amdgpu_device *adev,
struct tile_config *config);
-void amdgpu_amdkfd_ras_poison_consumption_handler(struct amdgpu_device *adev);
+void amdgpu_amdkfd_ras_poison_consumption_handler(struct amdgpu_device *adev,
+   bool reset);
 #if IS_ENABLED(CONFIG_HSA_AMD)
 void amdgpu_amdkfd_gpuvm_init_mem_limits(void);
 void amdgpu_amdkfd_gpuvm_destroy_cb(struct amdgpu_device *adev,
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
index deb64168c9e8..bd29b50532ea 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
@@ -89,6 +89,44 @@ enum SQ_INTERRUPT_ERROR_TYPE {
 #define KFD_SQ_INT_DATA__ERR_TYPE_MASK 0xF0
 #define KFD_SQ_INT_DATA__ERR_TYPE__SHIFT 20
 
+static void event_interrupt_poison_consumption(struct kfd_dev *dev,
+   uint16_t pasid, uint16_t source_id)
+{
+   int ret = -EINVAL;
+   struct kfd_process *p = kfd_lookup_process_by_pasid(pasid);
+
+   if (!p)
+   return;
+
+   /* all queues of a process will be unmapped in one time */
+   if (atomic_read(>poison)) {
+   kfd_unref_process(p);
+   return;
+   }
+
+   atomic_set(>poison, 1);
+   kfd_unref_process(p);
+
+   switch (source_id) {
+   case SOC15_INTSRC_SQ_INTERRUPT_MSG:
+   if (dev->dqm->ops.unmap_queues_cpsch_poison)
+   ret = dev->dqm->ops.unmap_queues_cpsch_poison(dev->dqm, 
pasid);
+   break;
+   case SOC15_INTSRC_SDMA_ECC:
+   default:
+   break;
+   }
+
+   kfd_signal_poison_consumed_event(dev, pasid);
+
+   /* resetting queue passes, do page retirement without gpu reset
+  resetting queue fails, fallback to gpu reset solution */
+   if (!ret)
+   amdgpu_amdkfd_ras_poison_consumption_handler(dev->adev, false);
+   else
+   amdgpu_amdkfd_ras_poison_consumption_handler(dev->adev, true);
+}
+
 static bool event_interrupt_isr_v9(struct kfd_dev *dev,
const uint32_t *ih_ring_entry,
uint32_t *patched_ihre,
@@ -230,8 +268,7 @@ static void event_interrupt_wq_v9(struct kfd_dev *dev,
sq_intr_err);
if (sq_intr_err != 
SQ_INTERRUPT_ERROR_TYPE_ILLEGAL_INST &&
sq_intr_err != 
SQ_INTERRUPT_ERROR_TYPE_MEMVIOL) {
-   kfd_signal_poison_consumed_event(dev, 
pasid);
-   
amdgpu_amdkfd_ras_poison_consumption_handler(dev->adev);
+   event_interrupt_poison_consumption(dev, 
pasid, 

[PATCH 2/4] drm/amdkfd: add reset parameter for unmap queues

2021-12-16 Thread Tao Zhou
So we can set reset mode for unmap operation, no functional change.

Signed-off-by: Tao Zhou 
---
 .../gpu/drm/amd/amdkfd/kfd_device_queue_manager.c| 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index dd0b952f0173..01a2cc3928ac 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -47,7 +47,7 @@ static int execute_queues_cpsch(struct device_queue_manager 
*dqm,
uint32_t filter_param);
 static int unmap_queues_cpsch(struct device_queue_manager *dqm,
enum kfd_unmap_queues_filter filter,
-   uint32_t filter_param);
+   uint32_t filter_param, bool reset);
 
 static int map_queues_cpsch(struct device_queue_manager *dqm);
 
@@ -570,7 +570,7 @@ static int update_queue(struct device_queue_manager *dqm, 
struct queue *q,
/* Make sure the queue is unmapped before updating the MQD */
if (dqm->sched_policy != KFD_SCHED_POLICY_NO_HWS) {
retval = unmap_queues_cpsch(dqm,
-   KFD_UNMAP_QUEUES_FILTER_DYNAMIC_QUEUES, 0);
+   KFD_UNMAP_QUEUES_FILTER_DYNAMIC_QUEUES, 0, 
false);
if (retval) {
pr_err("unmap queue failed\n");
goto out_unlock;
@@ -1223,7 +1223,7 @@ static int stop_cpsch(struct device_queue_manager *dqm)
}
 
if (!dqm->is_hws_hang)
-   unmap_queues_cpsch(dqm, KFD_UNMAP_QUEUES_FILTER_ALL_QUEUES, 0);
+   unmap_queues_cpsch(dqm, KFD_UNMAP_QUEUES_FILTER_ALL_QUEUES, 0, 
false);
hanging = dqm->is_hws_hang || dqm->is_resetting;
dqm->sched_running = false;
 
@@ -1419,7 +1419,7 @@ static int map_queues_cpsch(struct device_queue_manager 
*dqm)
 /* dqm->lock mutex has to be locked before calling this function */
 static int unmap_queues_cpsch(struct device_queue_manager *dqm,
enum kfd_unmap_queues_filter filter,
-   uint32_t filter_param)
+   uint32_t filter_param, bool reset)
 {
int retval = 0;
struct mqd_manager *mqd_mgr;
@@ -1432,7 +1432,7 @@ static int unmap_queues_cpsch(struct device_queue_manager 
*dqm,
return retval;
 
retval = pm_send_unmap_queue(>packet_mgr, KFD_QUEUE_TYPE_COMPUTE,
-   filter, filter_param, false, 0);
+   filter, filter_param, reset, 0);
if (retval)
return retval;
 
@@ -1485,7 +1485,7 @@ static int execute_queues_cpsch(struct 
device_queue_manager *dqm,
 
if (dqm->is_hws_hang)
return -EIO;
-   retval = unmap_queues_cpsch(dqm, filter, filter_param);
+   retval = unmap_queues_cpsch(dqm, filter, filter_param, false);
if (retval)
return retval;
 
-- 
2.17.1



[PATCH 1/4] drm/amdgpu: add gpu reset control for umc page retirement

2021-12-16 Thread Tao Zhou
Add a reset parameter for umc page retirement, let user decide whether
call gpu reset in umc page retirement.

Signed-off-by: Tao Zhou 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 15 ---
 drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h |  5 +++--
 2 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
index 6e4bea012ea4..0c33f367a4e5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
@@ -23,6 +23,13 @@
 
 #include "amdgpu_ras.h"
 
+static int amdgpu_umc_process_ras_data_cb(struct amdgpu_device *adev,
+   void *ras_error_status,
+   struct amdgpu_iv_entry *entry)
+{
+   return amdgpu_umc_do_page_retirement(adev, ras_error_status, entry, 
true);
+}
+
 int amdgpu_umc_ras_late_init(struct amdgpu_device *adev)
 {
int r;
@@ -88,9 +95,10 @@ void amdgpu_umc_ras_fini(struct amdgpu_device *adev)
}
 }
 
-int amdgpu_umc_process_ras_data_cb(struct amdgpu_device *adev,
+int amdgpu_umc_do_page_retirement(struct amdgpu_device *adev,
void *ras_error_status,
-   struct amdgpu_iv_entry *entry)
+   struct amdgpu_iv_entry *entry,
+   bool reset)
 {
struct ras_err_data *err_data = (struct ras_err_data *)ras_error_status;
struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
@@ -164,7 +172,8 @@ int amdgpu_umc_process_ras_data_cb(struct amdgpu_device 
*adev,

adev->smu.ppt_funcs->send_hbm_bad_pages_num(>smu, 
con->eeprom_control.ras_num_recs);
}
 
-   amdgpu_ras_reset_gpu(adev);
+   if (reset)
+   amdgpu_ras_reset_gpu(adev);
}
 
kfree(err_data->err_addr);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
index 9e40bade0a68..8d18d5121f66 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
@@ -78,9 +78,10 @@ struct amdgpu_umc {
 
 int amdgpu_umc_ras_late_init(struct amdgpu_device *adev);
 void amdgpu_umc_ras_fini(struct amdgpu_device *adev);
-int amdgpu_umc_process_ras_data_cb(struct amdgpu_device *adev,
+int amdgpu_umc_do_page_retirement(struct amdgpu_device *adev,
void *ras_error_status,
-   struct amdgpu_iv_entry *entry);
+   struct amdgpu_iv_entry *entry,
+   bool reset);
 int amdgpu_umc_process_ecc_irq(struct amdgpu_device *adev,
struct amdgpu_irq_src *source,
struct amdgpu_iv_entry *entry);
-- 
2.17.1



Re: [PATCH v4 2/6] drm: improve drm_buddy_alloc function

2021-12-16 Thread Thomas Zimmermann

Hi

Am 01.12.21 um 17:39 schrieb Arunpravin:

- Make drm_buddy_alloc a single function to handle
   range allocation and non-range allocation demands

- Implemented a new function alloc_range() which allocates
   the requested power-of-two block comply with range limitations

- Moved order computation and memory alignment logic from
   i915 driver to drm buddy

v2:
   merged below changes to keep the build unbroken
- drm_buddy_alloc_range() becomes obsolete and may be removed
- enable ttm range allocation (fpfn / lpfn) support in i915 driver
- apply enhanced drm_buddy_alloc() function to i915 driver

v3(Matthew Auld):
   - Fix alignment issues and remove unnecessary list_empty check
   - add more validation checks for input arguments
   - make alloc_range() block allocations as bottom-up
   - optimize order computation logic
   - replace uint64_t with u64, which is preferred in the kernel

v4(Matthew Auld):
   - keep drm_buddy_alloc_range() function implementation for generic
 actual range allocations
   - keep alloc_range() implementation for end bias allocations

Signed-off-by: Arunpravin 
---
  drivers/gpu/drm/drm_buddy.c   | 316 +-
  drivers/gpu/drm/i915/i915_ttm_buddy_manager.c |  67 ++--
  drivers/gpu/drm/i915/i915_ttm_buddy_manager.h |   2 +
  include/drm/drm_buddy.h   |  22 +-
  4 files changed, 285 insertions(+), 122 deletions(-)

diff --git a/drivers/gpu/drm/drm_buddy.c b/drivers/gpu/drm/drm_buddy.c
index 9340a4b61c5a..7f47632821f4 100644
--- a/drivers/gpu/drm/drm_buddy.c
+++ b/drivers/gpu/drm/drm_buddy.c
@@ -280,23 +280,97 @@ void drm_buddy_free_list(struct drm_buddy_mm *mm, struct 
list_head *objects)
  }
  EXPORT_SYMBOL(drm_buddy_free_list);
  
-/**

- * drm_buddy_alloc - allocate power-of-two blocks
- *
- * @mm: DRM buddy manager to allocate from
- * @order: size of the allocation
- *
- * The order value here translates to:
- *
- * 0 = 2^0 * mm->chunk_size
- * 1 = 2^1 * mm->chunk_size
- * 2 = 2^2 * mm->chunk_size
- *
- * Returns:
- * allocated ptr to the _buddy_block on success
- */
-struct drm_buddy_block *
-drm_buddy_alloc(struct drm_buddy_mm *mm, unsigned int order)
+static inline bool overlaps(u64 s1, u64 e1, u64 s2, u64 e2)
+{
+   return s1 <= e2 && e1 >= s2;
+}
+
+static inline bool contains(u64 s1, u64 e1, u64 s2, u64 e2)
+{
+   return s1 <= s2 && e1 >= e2;
+}
+
+static struct drm_buddy_block *
+alloc_range_bias(struct drm_buddy_mm *mm,
+u64 start, u64 end,
+unsigned int order)
+{
+   struct drm_buddy_block *block;
+   struct drm_buddy_block *buddy;
+   LIST_HEAD(dfs);
+   int err;
+   int i;
+
+   end = end - 1;
+
+   for (i = 0; i < mm->n_roots; ++i)
+   list_add_tail(>roots[i]->tmp_link, );
+
+   do {
+   u64 block_start;
+   u64 block_end;
+
+   block = list_first_entry_or_null(,
+struct drm_buddy_block,
+tmp_link);
+   if (!block)
+   break;
+
+   list_del(>tmp_link);
+
+   if (drm_buddy_block_order(block) < order)
+   continue;
+
+   block_start = drm_buddy_block_offset(block);
+   block_end = block_start + drm_buddy_block_size(mm, block) - 1;
+
+   if (!overlaps(start, end, block_start, block_end))
+   continue;
+
+   if (drm_buddy_block_is_allocated(block))
+   continue;
+
+   if (contains(start, end, block_start, block_end) &&
+   order == drm_buddy_block_order(block)) {
+   /*
+* Find the free block within the range.
+*/
+   if (drm_buddy_block_is_free(block))
+   return block;
+
+   continue;
+   }
+
+   if (!drm_buddy_block_is_split(block)) {
+   err = split_block(mm, block);
+   if (unlikely(err))
+   goto err_undo;
+   }
+
+   list_add(>right->tmp_link, );
+   list_add(>left->tmp_link, );
+   } while (1);
+
+   return ERR_PTR(-ENOSPC);
+
+err_undo:
+   /*
+* We really don't want to leave around a bunch of split blocks, since
+* bigger is better, so make sure we merge everything back before we
+* free the allocated blocks.
+*/
+   buddy = get_buddy(block);
+   if (buddy &&
+   (drm_buddy_block_is_free(block) &&
+drm_buddy_block_is_free(buddy)))
+   __drm_buddy_free(mm, block);
+   return ERR_PTR(err);
+}
+
+static struct drm_buddy_block *
+alloc_from_freelist(struct drm_buddy_mm *mm,
+   unsigned int order,
+   unsigned long 

Re: [PATCH v4 2/6] drm: improve drm_buddy_alloc function

2021-12-16 Thread Matthew Auld

On 15/12/2021 20:46, Arunpravin wrote:



On 14/12/21 12:29 am, Matthew Auld wrote:

On 09/12/2021 15:47, Paneer Selvam, Arunpravin wrote:

[AMD Official Use Only]

Hi Matthew,

Ping on this?


No new comments from me :) I guess just a question of what we should do
with the selftests, and then ofc at some point being able to throw this
at CI, or at least test locally, once the series builds.


sure :) I think we should rewrite the i915 buddy selftests since now we
have a single function for range and non-range requirements. I will
rewrite the i915 buddy selftests and move to drm selftests folder?
so for the time being, I remove the i915_buddy_mock_selftest() from
i915_mock_selftests.h list to avoid build errors?


Yeah, whatever is easiest.



Regards,
Arun
-Original Message-
From: amd-gfx  On Behalf Of Arunpravin
Sent: Wednesday, December 1, 2021 10:10 PM
To: dri-de...@lists.freedesktop.org; intel-...@lists.freedesktop.org; 
amd-gfx@lists.freedesktop.org
Cc: dan...@ffwll.ch; Paneer Selvam, Arunpravin ; 
jani.nik...@linux.intel.com; matthew.a...@intel.com; tzimmerm...@suse.de; Deucher, Alexander 
; Koenig, Christian 
Subject: [PATCH v4 2/6] drm: improve drm_buddy_alloc function

- Make drm_buddy_alloc a single function to handle
range allocation and non-range allocation demands

- Implemented a new function alloc_range() which allocates
the requested power-of-two block comply with range limitations

- Moved order computation and memory alignment logic from
i915 driver to drm buddy

v2:
merged below changes to keep the build unbroken
 - drm_buddy_alloc_range() becomes obsolete and may be removed
 - enable ttm range allocation (fpfn / lpfn) support in i915 driver
 - apply enhanced drm_buddy_alloc() function to i915 driver

v3(Matthew Auld):
- Fix alignment issues and remove unnecessary list_empty check
- add more validation checks for input arguments
- make alloc_range() block allocations as bottom-up
- optimize order computation logic
- replace uint64_t with u64, which is preferred in the kernel

v4(Matthew Auld):
- keep drm_buddy_alloc_range() function implementation for generic
  actual range allocations
- keep alloc_range() implementation for end bias allocations

Signed-off-by: Arunpravin 
---
   drivers/gpu/drm/drm_buddy.c   | 316 +-
   drivers/gpu/drm/i915/i915_ttm_buddy_manager.c |  67 ++--
   drivers/gpu/drm/i915/i915_ttm_buddy_manager.h |   2 +
   include/drm/drm_buddy.h   |  22 +-
   4 files changed, 285 insertions(+), 122 deletions(-)

diff --git a/drivers/gpu/drm/drm_buddy.c b/drivers/gpu/drm/drm_buddy.c index 
9340a4b61c5a..7f47632821f4 100644
--- a/drivers/gpu/drm/drm_buddy.c
+++ b/drivers/gpu/drm/drm_buddy.c
@@ -280,23 +280,97 @@ void drm_buddy_free_list(struct drm_buddy_mm *mm, struct 
list_head *objects)  }  EXPORT_SYMBOL(drm_buddy_free_list);
   
-/**

- * drm_buddy_alloc - allocate power-of-two blocks
- *
- * @mm: DRM buddy manager to allocate from
- * @order: size of the allocation
- *
- * The order value here translates to:
- *
- * 0 = 2^0 * mm->chunk_size
- * 1 = 2^1 * mm->chunk_size
- * 2 = 2^2 * mm->chunk_size
- *
- * Returns:
- * allocated ptr to the _buddy_block on success
- */
-struct drm_buddy_block *
-drm_buddy_alloc(struct drm_buddy_mm *mm, unsigned int order)
+static inline bool overlaps(u64 s1, u64 e1, u64 s2, u64 e2) {
+   return s1 <= e2 && e1 >= s2;
+}
+
+static inline bool contains(u64 s1, u64 e1, u64 s2, u64 e2) {
+   return s1 <= s2 && e1 >= e2;
+}
+
+static struct drm_buddy_block *
+alloc_range_bias(struct drm_buddy_mm *mm,
+u64 start, u64 end,
+unsigned int order)
+{
+   struct drm_buddy_block *block;
+   struct drm_buddy_block *buddy;
+   LIST_HEAD(dfs);
+   int err;
+   int i;
+
+   end = end - 1;
+
+   for (i = 0; i < mm->n_roots; ++i)
+   list_add_tail(>roots[i]->tmp_link, );
+
+   do {
+   u64 block_start;
+   u64 block_end;
+
+   block = list_first_entry_or_null(,
+struct drm_buddy_block,
+tmp_link);
+   if (!block)
+   break;
+
+   list_del(>tmp_link);
+
+   if (drm_buddy_block_order(block) < order)
+   continue;
+
+   block_start = drm_buddy_block_offset(block);
+   block_end = block_start + drm_buddy_block_size(mm, block) - 1;
+
+   if (!overlaps(start, end, block_start, block_end))
+   continue;
+
+   if (drm_buddy_block_is_allocated(block))
+   continue;
+
+   if (contains(start, end, block_start, block_end) &&
+   order == drm_buddy_block_order(block)) {
+   /*
+* Find the free block within 

[PATCH] drm/amd/display: Fix warning comparing pointer to 0

2021-12-16 Thread Jiapeng Chong
Fix the following coccicheck warning:

./drivers/gpu/drm/amd/display/dc/clk_mgr/dcn301/vg_clk_mgr.c:744:35-36:
WARNING comparing pointer to 0.

Reported-by: Abaci Robot 
Signed-off-by: Jiapeng Chong 
---
 drivers/gpu/drm/amd/display/dc/clk_mgr/dcn301/vg_clk_mgr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn301/vg_clk_mgr.c 
b/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn301/vg_clk_mgr.c
index 8f78e62b28b7..48005def1164 100644
--- a/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn301/vg_clk_mgr.c
+++ b/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn301/vg_clk_mgr.c
@@ -741,7 +741,7 @@ void vg_clk_mgr_construct(
sizeof(struct watermarks),
_mgr->smu_wm_set.mc_address.quad_part);
 
-   if (clk_mgr->smu_wm_set.wm_set == 0) {
+   if (!clk_mgr->smu_wm_set.wm_set) {
clk_mgr->smu_wm_set.wm_set = _wms;
clk_mgr->smu_wm_set.mc_address.quad_part = 0;
}
-- 
2.20.1.7.g153144c



Re: [PATCH v2] drm/amdgpu: Call amdgpu_device_unmap_mmio() iff device is unplugged to prevent crash in GPU initialization failure

2021-12-16 Thread Christian König
The !drm_dev_enter() is quite unusual and deserves a comment explaining 
what's going on here.


Apart from that it looks good with the typos fixed I think.

Christian.

Am 16.12.21 um 08:27 schrieb Chen, Guchun:

[Public]

My BAD to misunderstand this.

There are both spell typos in patch subject and body, s/iff/if.

The patch is: Reviewed-by: Guchun Chen 

Please wait for the ack from Andrey and Christian before pushing this.

Regards,
Guchun

-Original Message-
From: Shi, Leslie 
Sent: Thursday, December 16, 2021 3:00 PM
To: Chen, Guchun ; Grodzovsky, Andrey ; Koenig, 
Christian ; Pan, Xinhui ; Deucher, Alexander 
; amd-gfx@lists.freedesktop.org
Subject: RE: [PATCH v2] drm/amdgpu: Call amdgpu_device_unmap_mmio() iff device 
is unplugged to prevent crash in GPU initialization failure

[Public]

Hi Guchun,

As Andrey says, "we should not call amdgpu_device_unmap_mmio unless device is 
unplugged", I think we should call amdgpu_device_unmap_mmio() only if device is 
unplugged (drm_dev_enter() return false) .

+if (!drm_dev_enter(adev_to_drm(adev), ))
+   amdgpu_device_unmap_mmio(adev);
+ else
#   drm_dev_exit(idx);


Regards,
Leslie

-Original Message-
From: Chen, Guchun 
Sent: Thursday, December 16, 2021 2:46 PM
To: Shi, Leslie ; Grodzovsky, Andrey ; Koenig, 
Christian ; Pan, Xinhui ; Deucher, Alexander 
; amd-gfx@lists.freedesktop.org
Subject: RE: [PATCH v2] drm/amdgpu: Call amdgpu_device_unmap_mmio() iff device 
is unplugged to prevent crash in GPU initialization failure

[Public]

Hi Leslie,

I think we need to modify it like:

+if (drm_dev_enter(adev_to_drm(adev), )) {
+   amdgpu_device_unmap_mmio(adev);
+   drm_dev_exit(idx);
+}

Also you need to credit Andrey a 'suggested-by' in your patch.

Regards,
Guchun

-Original Message-
From: Shi, Leslie 
Sent: Thursday, December 16, 2021 2:14 PM
To: Grodzovsky, Andrey ; Koenig, Christian 
; Pan, Xinhui ; Deucher, Alexander 
; amd-gfx@lists.freedesktop.org
Cc: Chen, Guchun ; Shi, Leslie 
Subject: [PATCH v2] drm/amdgpu: Call amdgpu_device_unmap_mmio() iff device is 
unplugged to prevent crash in GPU initialization failure

[Why]
In amdgpu_driver_load_kms, when amdgpu_device_init returns error during driver 
modprobe, it will start the error handle path immediately and call into 
amdgpu_device_unmap_mmio as well to release mapped VRAM. However, in the 
following release callback, driver stills visits the unmapped memory like 
vcn.inst[i].fw_shared_cpu_addr in vcn_v3_0_sw_fini. So a kernel crash occurs.

[How]
call amdgpu_device_unmap_mmio() iff device is unplugged to prevent invalid 
memory address in
vcn_v3_0_sw_fini() when GPU initialization failure.

Signed-off-by: Leslie Shi 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 +++-
  1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index fb03d75880ec..d3656e7b60c2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3845,6 +3845,8 @@ static void amdgpu_device_unmap_mmio(struct amdgpu_device 
*adev)
   */
  void amdgpu_device_fini_hw(struct amdgpu_device *adev)  {
+   int idx;
+
dev_info(adev->dev, "amdgpu: finishing device.\n");
flush_delayed_work(>delayed_init_work);
if (adev->mman.initialized) {
@@ -3888,7 +3890,11 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
  
  	amdgpu_gart_dummy_page_fini(adev);
  
-	amdgpu_device_unmap_mmio(adev);

+   if (!drm_dev_enter(adev_to_drm(adev), ))
+   amdgpu_device_unmap_mmio(adev);
+   else
+   drm_dev_exit(idx);
+
  }
  
  void amdgpu_device_fini_sw(struct amdgpu_device *adev)

--
2.25.1




Re: [PATCH 6/7] drm/amdgpu: Ensure kunmap is called on error

2021-12-16 Thread Christian König

Am 15.12.21 um 22:09 schrieb Ira Weiny:

On Tue, Dec 14, 2021 at 08:09:29AM +0100, Christian König wrote:

Am 14.12.21 um 04:37 schrieb Ira Weiny:

On Mon, Dec 13, 2021 at 09:37:32PM +0100, Christian König wrote:

Am 11.12.21 um 00:24 schrieb ira.we...@intel.com:

From: Ira Weiny 

The default case leaves the buffer object mapped in error.

Add amdgpu_bo_kunmap() to that case to ensure the mapping is cleaned up.

Mhm, good catch. But why do you want to do this in the first place?

I'm not sure I understand the question.

Any mapping of memory should be paired with an unmapping when no longer needed.
And this is supported by the call to amdgpu_bo_kunmap() in the other
non-default cases.

Do you believe the mapping is not needed?

No, the unmapping is not needed here. See the function amdgpu_bo_kmap(), it
either creates the mapping or return the cached pointer.

Ah I missed that.  Thanks.


A call to amdgpu_bo_kunmap() is only done in a few places where we know that
the created mapping most likely won't be needed any more. If that's not done
the mapping is automatically destroyed when the BO is moved or freed up.

I mean good bug fix, but you seem to see this as some kind of prerequisite
to some follow up work converting TTM to use kmap_local() which most likely
won't work in the first place.

Sure.  I see now that it is more complicated than I thought but I never thought
of this as a strict prerequisite.  Just something I found while trying to
figure out how this works.

How much of a speed up is it to maintain the ttm_bo_map_kmap map type?


Good question. I don't really know.

This used to be pretty important for older drivers since there the 
kernel needs to kmap individual pages and patch them up before sending 
the command stream to the hardware.


It most likely doesn't matter for modern hardware.


Could this all be done with vmap and just remove the kmap stuff?


Maybe, but I wouldn't bet on it and I don't really want to touch any of 
the old drivers to figure that out.


Christian.



Ira


Regards,
Christian.


Ira


Christian.


Signed-off-by: Ira Weiny 

---
NOTE: It seems like this function could use a fair bit of refactoring
but this is the easiest way to fix the actual bug.
---
drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c | 1 +
1 file changed, 1 insertion(+)
nice
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
index 6f8de11a17f1..b3ffd0f6b35f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
@@ -889,6 +889,7 @@ static int amdgpu_uvd_cs_msg(struct amdgpu_uvd_cs_ctx *ctx,
return 0;
default:
+   amdgpu_bo_kunmap(bo);
DRM_ERROR("Illegal UVD message type (%d)!\n", msg_type);
}




Re: [PATCH] drm/amdgpu: fix mismatch warning between the prototype and function name

2021-12-16 Thread Christian König

Am 16.12.21 um 05:39 schrieb Huang Rui:

Fix the typo to align with the prototype and function name.

All warnings (new ones prefixed by >>):


drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c:631: warning: expecting

prototype for amdgpu_fence_clear_job_fences(). Prototype was for
amdgpu_fence_driver_clear_job_fences() instead

Reported-by: kernel test robot 
Signed-off-by: Huang Rui 


Reviewed-by: Christian König 


---
  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index db41d16838b9..9afd11ca2709 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -622,7 +622,7 @@ void amdgpu_fence_driver_hw_init(struct amdgpu_device *adev)
  }
  
  /**

- * amdgpu_fence_clear_job_fences - clear job embedded fences of ring
+ * amdgpu_fence_driver_clear_job_fences - clear job embedded fences of ring
   *
   * @ring: fence of the ring to be cleared
   *




Re: 回复: Re: [PATCH] drm/amdgpu: fixup bad vram size on gmc v8

2021-12-16 Thread Christian König

Hi Zongmin,

that strongly sounds like the ASIC is not correctly initialized when 
trying to read the register.


What board and environment are you using this GPU with? Is that a normal 
x86 system?


Regards,
Christian.

Am 16.12.21 um 04:11 schrieb 周宗敏:


1.

the problematic boards that I have tested is [AMD/ATI] Lexa PRO
[Radeon RX 550/550X] ;  and the vbios version : 113-RXF9310-C09-BT

2.

When an exception occurs I can see the following changes in the
values of vram size get from RREG32(mmCONFIG_MEMSIZE) ,

it seems to have garbage in the upper 16 bits

image.png


3.

and then I can also see some dmesg like below:

when vram size register have garbage,we may see error message like
below:

amdgpu :09:00.0: VRAM: 4286582784M 0x00F4 -
0x000FF8F4 (4286582784M used)

the correct message should like below:

amdgpu :09:00.0: VRAM: 4096M 0x00F4 -
0x00F4 (4096M used)


if you have any problems,please send me mail.

thanks very much.




*主 题:*Re: [PATCH] drm/amdgpu: fixup bad vram size on gmc v8
*日 期:*2021-12-16 04:23
*发件人:*Alex Deucher
*收件人:*Zongmin Zhou


On Wed, Dec 15, 2021 at 10:31 AM Zongmin Zhouwrote:
>
> Some boards(like RX550) seem to have garbage in the upper
> 16 bits of the vram size register.  Check for
> this and clamp the size properly.  Fixes
> boards reporting bogus amounts of vram.
>
> after add this patch,the maximum GPU VRAM size is 64GB,
> otherwise only 64GB vram size will be used.

Can you provide some examples of problematic boards and possibly a
vbios image from the problematic board?  What values are you seeing?
It would be nice to see what the boards are reporting and whether the
lower 16 bits are actually correct or if it is some other issue.  This
register is undefined until the asic has been initialized.  The vbios
programs it as part of it's asic init sequence (either via vesa/gop or
the OS driver).

Alex


>
> Signed-off-by: Zongmin Zhou
> ---
>  drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c | 13 ++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c

> index 492ebed2915b..63b890f1e8af 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
> @@ -515,10 +515,10 @@ static void gmc_v8_0_mc_program(struct 
amdgpu_device *adev)

>  static int gmc_v8_0_mc_init(struct amdgpu_device *adev)
>  {
>         int r;
> +       u32 tmp;
>
>         adev->gmc.vram_width = amdgpu_atombios_get_vram_width(adev);
>         if (!adev->gmc.vram_width) {
> -               u32 tmp;
>                 int chansize, numchan;
>
>                 /* Get VRAM informations */
> @@ -562,8 +562,15 @@ static int gmc_v8_0_mc_init(struct 
amdgpu_device *adev)

>                 adev->gmc.vram_width = numchan * chansize;
>         }
>         /* size in MB on si */
> -       adev->gmc.mc_vram_size = RREG32(mmCONFIG_MEMSIZE) * 1024ULL 
* 1024ULL;
> -       adev->gmc.real_vram_size = RREG32(mmCONFIG_MEMSIZE) * 
1024ULL * 1024ULL;

> +       tmp = RREG32(mmCONFIG_MEMSIZE);
> +       /* some boards may have garbage in the upper 16 bits */
> +       if (tmp & 0x) {
> +               DRM_INFO("Probable bad vram size: 0x%08x\n", tmp);
> +               if (tmp & 0x)
> +                       tmp &= 0x;
> +       }
> +       adev->gmc.mc_vram_size = tmp * 1024ULL * 1024ULL;
> +       adev->gmc.real_vram_size = adev->gmc.mc_vram_size;
>
>         if (!(adev->flags & AMD_IS_APU)) {
>                 r = amdgpu_device_resize_fb_bar(adev);
> --
> 2.25.1
>
>
> No virus found
>                 Checked by Hillstone Network AntiVirus