Re: radeon Disabling GPU acceleration (WB disabled?)

2019-10-22 Thread Alex Deucher
On Tue, Oct 22, 2019 at 12:09 PM Michel Dänzer  wrote:
>
> On 2019-10-20 11:21 p.m., Meelis Roos wrote:
> > I tried 5.2, 5.3 and 5.4-rc4 on my old Fujitsu RX220 with integrated
> > Radeon RV100. Dmesg tells that GPU acceleration is disabled. I do not
> > know if it has been enabled in the past - no old kernels handy at the
> > moment.
> >
> > From dmesg it looks like something with MTRR maybe: WB disabled.
>
> That's harmless.
>
>
> > [8.535975] [drm] Driver supports precise vblank timestamp query.
> > [8.535981] radeon :00:05.0: Disabling GPU acceleration
>
> This looks like drm_irq_install returns an error in radeon_irq_kms_init.
>

Check to see that the sbios assigns an irq to the device.  There may
be an option in the sbios configuration settings.  IIRC, some old
platforms didn't always assign interrupts to vga devices.

Alex
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH] drm/amdgpu: Allow reading more status registers on si/cik

2019-10-22 Thread Deucher, Alexander
> -Original Message-
> From: amd-gfx  On Behalf Of
> Marek Olšák
> Sent: Tuesday, October 22, 2019 5:23 PM
> To: amd-gfx@lists.freedesktop.org
> Subject: [PATCH] drm/amdgpu: Allow reading more status registers on si/cik
> 
> From: Marek Olšák 
> 
> Allow userspace to read the same status registers for every family.
> Based on commit c7890fea, added any of these registers if defined in the
> include files of each architecture.
> 
> Signed-off-by: Marek Olšák 

Reviewed-by: Alex Deucher 

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c |  3 ++-
>  drivers/gpu/drm/amd/amdgpu/cik.c| 19 +++
>  drivers/gpu/drm/amd/amdgpu/nv.c |  1 +
>  drivers/gpu/drm/amd/amdgpu/si.c | 11 +++
>  drivers/gpu/drm/amd/amdgpu/soc15.c  |  1 +
>  5 files changed, 34 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index a787d838974e..08610081bd9e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -76,23 +76,24 @@
>   * - 3.26.0 - GFX9: Process AMDGPU_IB_FLAG_TC_WB_NOT_INVALIDATE.
>   * - 3.27.0 - Add new chunk to to AMDGPU_CS to enable BO_LIST creation.
>   * - 3.28.0 - Add AMDGPU_CHUNK_ID_SCHEDULED_DEPENDENCIES
>   * - 3.29.0 - Add AMDGPU_IB_FLAG_RESET_GDS_MAX_WAVE_ID
>   * - 3.30.0 - Add AMDGPU_SCHED_OP_CONTEXT_PRIORITY_OVERRIDE.
>   * - 3.31.0 - Add support for per-flip tiling attribute changes with DC
>   * - 3.32.0 - Add syncobj timeline support to AMDGPU_CS.
>   * - 3.33.0 - Fixes for GDS ENOMEM failures in AMDGPU_CS.
>   * - 3.34.0 - Non-DC can flip correctly between buffers with different 
> pitches
>   * - 3.35.0 - Add drm_amdgpu_info_device::tcc_disabled_mask
> + * - 3.36.0 - Allow reading more status registers on si/cik
>   */
>  #define KMS_DRIVER_MAJOR 3
> -#define KMS_DRIVER_MINOR 35
> +#define KMS_DRIVER_MINOR 36
>  #define KMS_DRIVER_PATCHLEVEL0
> 
>  int amdgpu_vram_limit = 0;
>  int amdgpu_vis_vram_limit = 0;
>  int amdgpu_gart_size = -1; /* auto */
>  int amdgpu_gtt_size = -1; /* auto */
>  int amdgpu_moverate = -1; /* auto */
>  int amdgpu_benchmarking = 0;
>  int amdgpu_testing = 0;
>  int amdgpu_audio = -1;
> diff --git a/drivers/gpu/drm/amd/amdgpu/cik.c
> b/drivers/gpu/drm/amd/amdgpu/cik.c
> index fc8b34480f66..2d64d270725d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/cik.c
> +++ b/drivers/gpu/drm/amd/amdgpu/cik.c
> @@ -959,20 +959,39 @@ static bool cik_read_bios_from_rom(struct
> amdgpu_device *adev,
>   WREG32(mmSMC_IND_INDEX_0, ixROM_DATA);
>   for (i = 0; i < length_dw; i++)
>   dw_ptr[i] = RREG32(mmSMC_IND_DATA_0);
>   spin_unlock_irqrestore(>smc_idx_lock, flags);
> 
>   return true;
>  }
> 
>  static const struct amdgpu_allowed_register_entry
> cik_allowed_read_registers[] = {
>   {mmGRBM_STATUS},
> + {mmGRBM_STATUS2},
> + {mmGRBM_STATUS_SE0},
> + {mmGRBM_STATUS_SE1},
> + {mmGRBM_STATUS_SE2},
> + {mmGRBM_STATUS_SE3},
> + {mmSRBM_STATUS},
> + {mmSRBM_STATUS2},
> + {mmSDMA0_STATUS_REG + SDMA0_REGISTER_OFFSET},
> + {mmSDMA0_STATUS_REG + SDMA1_REGISTER_OFFSET},
> + {mmCP_STAT},
> + {mmCP_STALLED_STAT1},
> + {mmCP_STALLED_STAT2},
> + {mmCP_STALLED_STAT3},
> + {mmCP_CPF_BUSY_STAT},
> + {mmCP_CPF_STALLED_STAT1},
> + {mmCP_CPF_STATUS},
> + {mmCP_CPC_BUSY_STAT},
> + {mmCP_CPC_STALLED_STAT1},
> + {mmCP_CPC_STATUS},
>   {mmGB_ADDR_CONFIG},
>   {mmMC_ARB_RAMCFG},
>   {mmGB_TILE_MODE0},
>   {mmGB_TILE_MODE1},
>   {mmGB_TILE_MODE2},
>   {mmGB_TILE_MODE3},
>   {mmGB_TILE_MODE4},
>   {mmGB_TILE_MODE5},
>   {mmGB_TILE_MODE6},
>   {mmGB_TILE_MODE7},
> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c
> b/drivers/gpu/drm/amd/amdgpu/nv.c index 46206a1a1f4d..22ab1955b923
> 100644
> --- a/drivers/gpu/drm/amd/amdgpu/nv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c
> @@ -171,20 +171,21 @@ static struct soc15_allowed_register_entry
> nv_allowed_read_registers[] = {
>   { SOC15_REG_ENTRY(SDMA0, 0, mmSDMA0_STATUS_REG)},
>   { SOC15_REG_ENTRY(SDMA1, 0, mmSDMA1_STATUS_REG)},  #endif
>   { SOC15_REG_ENTRY(GC, 0, mmCP_STAT)},
>   { SOC15_REG_ENTRY(GC, 0, mmCP_STALLED_STAT1)},
>   { SOC15_REG_ENTRY(GC, 0, mmCP_STALLED_STAT2)},
>   { SOC15_REG_ENTRY(GC, 0, mmCP_STALLED_STAT3)},
>   { SOC15_REG_ENTRY(GC, 0, mmCP_CPF_BUSY_STAT)},
>   { SOC15_REG_ENTRY(GC, 0, mmCP_CPF_STALLED_STAT1)},
>   { SOC15_REG_ENTRY(GC, 0, mmCP_CPF_STATUS)},
> + { SOC15_REG_ENTRY(GC, 0, mmCP_CPC_BUSY_STAT)},
>   { SOC15_REG_ENTRY(GC, 0, mmCP_CPC_STALLED_STAT1)},
>   { SOC15_REG_ENTRY(GC, 0, mmCP_CPC_STATUS)},
>   { SOC15_REG_ENTRY(GC, 0, mmGB_ADDR_CONFIG)},  };
> 
>  static uint32_t nv_read_indexed_register(struct amdgpu_device *adev, u32
> se_num,
>u32 sh_num, u32 reg_offset)
>  {
>   uint32_t val;
> 
> diff --git 

RE: [PATCH 2/4] drm/amd/powerplay: Add EEPROM I2C read/write support to Arcturus.

2019-10-22 Thread Chen, Guchun


> -Original Message-
> From: Grodzovsky, Andrey 
> Sent: Wednesday, October 23, 2019 3:10 AM
> To: Chen, Guchun ; amd-
> g...@lists.freedesktop.org
> Cc: Zhou1, Tao ; Deucher, Alexander
> ; noreply-conflue...@amd.com; Quan,
> Evan 
> Subject: Re: [PATCH 2/4] drm/amd/powerplay: Add EEPROM I2C read/write
> support to Arcturus.
> 
> 
> On 10/22/19 10:58 AM, Andrey Grodzovsky wrote:
> >
> > On 10/20/19 9:44 PM, Chen, Guchun wrote:
> >> Comment inline.
> >>
> >>
> >> Regards,
> >> Guchun
> >>
> >> -Original Message-
> >> From: Andrey Grodzovsky 
> >> Sent: Saturday, October 19, 2019 4:48 AM
> >> To: amd-gfx@lists.freedesktop.org
> >> Cc: Chen, Guchun ; Zhou1, Tao
> >> ; Deucher, Alexander
> ;
> >> noreply-conflue...@amd.com; Quan, Evan ;
> >> Grodzovsky, Andrey 
> >> Subject: [PATCH 2/4] drm/amd/powerplay: Add EEPROM I2C read/write
> >> support to Arcturus.
> >>
> >> The communication is done through SMU table and hence the code is in
> >> powerplay.
> >>
> >> Signed-off-by: Andrey Grodzovsky 
> >> ---
> >>   drivers/gpu/drm/amd/powerplay/arcturus_ppt.c | 229
> >> +++
> >>   1 file changed, 229 insertions(+)
> >>
> >> diff --git a/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c
> >> b/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c
> >> index 90d871a..53d08de5 100644
> >> --- a/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c
> >> +++ b/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c
> >> @@ -36,6 +36,11 @@
> >>   #include "smu_v11_0_pptable.h"
> >>   #include "arcturus_ppsmc.h"
> >>   #include "nbio/nbio_7_4_sh_mask.h"
> >> +#include 
> >> +#include 
> >> +#include "amdgpu_ras.h"
> >> +
> >> +#define to_amdgpu_device(x) (container_of(x, struct amdgpu_ras,
> >> +eeprom_control.eeprom_accessor))->adev
> >>     #define CTF_OFFSET_EDGE    5
> >>   #define CTF_OFFSET_HOTSPOT    5 @@ -171,6 +176,7 @@ static
> >> struct smu_11_0_cmn2aisc_mapping
> arcturus_table_map[SMU_TABLE_COUNT]
> >> = {
> >>   TAB_MAP(SMU_METRICS),
> >>   TAB_MAP(DRIVER_SMU_CONFIG),
> >>   TAB_MAP(OVERDRIVE),
> >> +    TAB_MAP(I2C_COMMANDS),
> >>   };
> >>     static struct smu_11_0_cmn2aisc_mapping
> >> arcturus_pwr_src_map[SMU_POWER_SOURCE_COUNT] = { @@ -293,6
> +299,9 @@
> >> static int arcturus_tables_init(struct smu_context *smu, struct
> >> smu_table *table
> >>   SMU_TABLE_INIT(tables, SMU_TABLE_SMU_METRICS,
> >> sizeof(SmuMetrics_t),
> >>  PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> >>   +    SMU_TABLE_INIT(tables, SMU_TABLE_I2C_COMMANDS,
> >> sizeof(SwI2cRequest_t),
> >> +   PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> >> +
> >>   smu_table->metrics_table = kzalloc(sizeof(SmuMetrics_t),
> >> GFP_KERNEL);
> >>   if (!smu_table->metrics_table)
> >>   return -ENOMEM;
> >> @@ -1927,6 +1936,224 @@ static int
> arcturus_dpm_set_uvd_enable(struct
> >> smu_context *smu, bool enable)
> >>   return ret;
> >>   }
> >>   +
> >> +static void arcturus_fill_eeprom_i2c_req(SwI2cRequest_t  *req, bool
> >> write,
> >> +  uint8_t address, uint32_t numbytes,
> >> +  uint8_t *data)
> >> +{
> >> +    int i;
> >> +
> >> +    BUG_ON(numbytes > MAX_SW_I2C_COMMANDS);
> >> +
> >> +    req->I2CcontrollerPort = 0;
> >> +    req->I2CSpeed = 2;
> >> +    req->SlaveAddress = address;
> >> +    req->NumCmds = numbytes;
> >> +
> >> +    for (i = 0; i < numbytes; i++) {
> >> +    SwI2cCmd_t *cmd =  >SwI2cCmds[i];
> >> +
> >> +    /* First 2 bytes are always write for lower 2b EEPROM
> >> address */
> >> +    if (i < 2)
> >> +    cmd->Cmd = 1;
> >> +    else
> >> +    cmd->Cmd = write;
> >> +
> >> +
> >> +    /* Add RESTART for read  after address filled */
> >> +    cmd->CmdConfig |= (i == 2 && !write) ?
> >> CMDCONFIG_RESTART_MASK : 0;
> >> +
> >> +    /* Add STOP in the end */
> >> +    cmd->CmdConfig |= (i == (numbytes - 1)) ?
> >> CMDCONFIG_STOP_MASK : 0;
> >> +
> >> +    /* Fill with data regardless if read or write to simplify
> >> code */
> >> +    cmd->RegisterAddr = data[i];
> >> +    }
> >> +}
> >> +
> >> +static int arcturus_i2c_eeprom_read_data(struct i2c_adapter
> >> +*control,
> >> +   uint8_t address,
> >> +   uint8_t *data,
> >> +   uint32_t numbytes) {
> >> +    uint32_t  i, ret = 0;
> >> +    SwI2cRequest_t req;
> >> +    struct amdgpu_device *adev = to_amdgpu_device(control);
> >> +    struct smu_table_context *smu_table = >smu.smu_table;
> >> +    struct smu_table *table =
> >> _table->tables[SMU_TABLE_I2C_COMMANDS];
> >> +
> >> +    memset(, 0, sizeof(req));
> >> +    arcturus_fill_eeprom_i2c_req(, false, address, numbytes,
> >> +data);
> >> +
> >> +    mutex_lock(>smu.mutex);
> >> +    /* Now read data starting with that address */
> >> +    ret = smu_update_table(>smu, SMU_TABLE_I2C_COMMANDS,
> 0,
> >> +,
> >> +    true);
> >> +    mutex_unlock(>smu.mutex);
> >> +
> >> +    if 

RE: [PATCH] drm/amdgpu/vcn: Enable VCN2.5 encoding

2019-10-22 Thread Deucher, Alexander
> -Original Message-
> From: amd-gfx  On Behalf Of Zhu,
> James
> Sent: Tuesday, October 22, 2019 10:49 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhu, James 
> Subject: [PATCH] drm/amdgpu/vcn: Enable VCN2.5 encoding
> 
> After VCN2.5 firmware (Version ENC: 1.1  Revision: 11),
> VCN2.5 encoding can work properly.
> 
> Signed-off-by: James Zhu 


Acked-by: Alex Deucher 

> ---
>  drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c
> b/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c
> index d270df8..ff6cc77 100644
> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c
> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c
> @@ -265,9 +265,6 @@ static int vcn_v2_5_hw_init(void *handle)
> 
>   for (i = 0; i < adev->vcn.num_enc_rings; ++i) {
>   ring = >vcn.inst[j].ring_enc[i];
> - /* disable encode rings till the robustness of the FW
> */
> - ring->sched.ready = false;
> - continue;
>   r = amdgpu_ring_test_helper(ring);
>   if (r)
>   goto done;
> --
> 2.7.4
> 
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH 4/4] drm/amdgpu: Move amdgpu_ras_recovery_init to after SMU ready.

2019-10-22 Thread Chen, Guchun
Hi Andrey,

No, I don't want to see amdgpu_ras_recovery_init being called at different 
places.
If calling amdgpu_ras_recovery_init as much early as we can is not doable, due 
to arcturus i2c code ready timeline, I am fine with this patch.

Regards,
Guchun

-Original Message-
From: Grodzovsky, Andrey  
Sent: Tuesday, October 22, 2019 10:33 PM
To: Chen, Guchun ; amd-gfx@lists.freedesktop.org
Cc: Zhou1, Tao ; Deucher, Alexander 
; noreply-conflue...@amd.com; Quan, Evan 

Subject: Re: [PATCH 4/4] drm/amdgpu: Move amdgpu_ras_recovery_init to after SMU 
ready.

I am well aware that we want to do it ASAP, but what is your suggestion for the 
Arcturus use case where we have to do it AFTER SMU is up and running ? Do you 
want to call amdgpu_ras_recovery_init in two different places depending if this 
is Vega 20 or Arcturus ? This will over complicate an already complicated init 
sequence of RAS.

Andrey

On 10/20/19 9:51 PM, Chen, Guchun wrote:
> I don't think postpone RAS recovery init is not one reasonable proposal. What 
> we do in recovery init is to read the retired page if there is, and retire 
> corresponding memory, this will make sure these pages are reserved well 
> before setting up memory manager and reserving other memory blocks, it will 
> be safe.
>
> Regards,
> Guchun
>
> -Original Message-
> From: Andrey Grodzovsky 
> Sent: Saturday, October 19, 2019 4:49 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Chen, Guchun ; Zhou1, Tao 
> ; Deucher, Alexander ; 
> noreply-conflue...@amd.com; Quan, Evan ; 
> Grodzovsky, Andrey 
> Subject: [PATCH 4/4] drm/amdgpu: Move amdgpu_ras_recovery_init to after SMU 
> ready.
>
> For Arcturus the I2C traffic is done through SMU tables and so we must 
> postpone RAS recovery init to after they are ready which is in 
> amdgpu_device_ip_hw_init_phase2.
>
> Signed-off-by: Andrey Grodzovsky 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 13 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c| 11 ---
>   2 files changed, 13 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 17cfdaf..c40e9a5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -1850,6 +1850,19 @@ static int amdgpu_device_ip_init(struct amdgpu_device 
> *adev)
>   if (r)
>   goto init_failed;
>   
> + /*
> +  * retired pages will be loaded from eeprom and reserved here,
> +  * it should be called after amdgpu_device_ip_hw_init_phase2  since
> +  * for some ASICs the RAS EEPROM code relies on SMU fully functioning
> +  * for I2C communication which only true at this point.
> +  * recovery_init may fail, but it can free all resources allocated by
> +  * itself and its failure should not stop amdgpu init process.
> +  *
> +  * Note: theoretically, this should be called before all vram 
> allocations
> +  * to protect retired page from abusing
> +  */
> + amdgpu_ras_recovery_init(adev);
> +
>   if (adev->gmc.xgmi.num_physical_nodes > 1)
>   amdgpu_xgmi_add_device(adev);
>   amdgpu_amdkfd_device_init(adev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> index 2e85a51..1045c3f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> @@ -1721,17 +1721,6 @@ int amdgpu_ttm_init(struct amdgpu_device *adev)  
> #endif
>   
>   /*
> -  * retired pages will be loaded from eeprom and reserved here,
> -  * it should be called after ttm init since new bo may be created,
> -  * recovery_init may fail, but it can free all resources allocated by
> -  * itself and its failure should not stop amdgpu init process.
> -  *
> -  * Note: theoretically, this should be called before all vram 
> allocations
> -  * to protect retired page from abusing
> -  */
> - amdgpu_ras_recovery_init(adev);
> -
> - /*
>*The reserved vram for firmware must be pinned to the specified
>*place on the VRAM, so reserve it early.
>*/
> --
> 2.7.4
>
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH] drm/amdgpu: Allow reading more status registers on si/cik

2019-10-22 Thread Marek Olšák
From: Marek Olšák 

Allow userspace to read the same status registers for every family.
Based on commit c7890fea, added any of these registers if defined in
the include files of each architecture.

Signed-off-by: Marek Olšák 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c |  3 ++-
 drivers/gpu/drm/amd/amdgpu/cik.c| 19 +++
 drivers/gpu/drm/amd/amdgpu/nv.c |  1 +
 drivers/gpu/drm/amd/amdgpu/si.c | 11 +++
 drivers/gpu/drm/amd/amdgpu/soc15.c  |  1 +
 5 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index a787d838974e..08610081bd9e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -76,23 +76,24 @@
  * - 3.26.0 - GFX9: Process AMDGPU_IB_FLAG_TC_WB_NOT_INVALIDATE.
  * - 3.27.0 - Add new chunk to to AMDGPU_CS to enable BO_LIST creation.
  * - 3.28.0 - Add AMDGPU_CHUNK_ID_SCHEDULED_DEPENDENCIES
  * - 3.29.0 - Add AMDGPU_IB_FLAG_RESET_GDS_MAX_WAVE_ID
  * - 3.30.0 - Add AMDGPU_SCHED_OP_CONTEXT_PRIORITY_OVERRIDE.
  * - 3.31.0 - Add support for per-flip tiling attribute changes with DC
  * - 3.32.0 - Add syncobj timeline support to AMDGPU_CS.
  * - 3.33.0 - Fixes for GDS ENOMEM failures in AMDGPU_CS.
  * - 3.34.0 - Non-DC can flip correctly between buffers with different pitches
  * - 3.35.0 - Add drm_amdgpu_info_device::tcc_disabled_mask
+ * - 3.36.0 - Allow reading more status registers on si/cik
  */
 #define KMS_DRIVER_MAJOR   3
-#define KMS_DRIVER_MINOR   35
+#define KMS_DRIVER_MINOR   36
 #define KMS_DRIVER_PATCHLEVEL  0
 
 int amdgpu_vram_limit = 0;
 int amdgpu_vis_vram_limit = 0;
 int amdgpu_gart_size = -1; /* auto */
 int amdgpu_gtt_size = -1; /* auto */
 int amdgpu_moverate = -1; /* auto */
 int amdgpu_benchmarking = 0;
 int amdgpu_testing = 0;
 int amdgpu_audio = -1;
diff --git a/drivers/gpu/drm/amd/amdgpu/cik.c b/drivers/gpu/drm/amd/amdgpu/cik.c
index fc8b34480f66..2d64d270725d 100644
--- a/drivers/gpu/drm/amd/amdgpu/cik.c
+++ b/drivers/gpu/drm/amd/amdgpu/cik.c
@@ -959,20 +959,39 @@ static bool cik_read_bios_from_rom(struct amdgpu_device 
*adev,
WREG32(mmSMC_IND_INDEX_0, ixROM_DATA);
for (i = 0; i < length_dw; i++)
dw_ptr[i] = RREG32(mmSMC_IND_DATA_0);
spin_unlock_irqrestore(>smc_idx_lock, flags);
 
return true;
 }
 
 static const struct amdgpu_allowed_register_entry cik_allowed_read_registers[] 
= {
{mmGRBM_STATUS},
+   {mmGRBM_STATUS2},
+   {mmGRBM_STATUS_SE0},
+   {mmGRBM_STATUS_SE1},
+   {mmGRBM_STATUS_SE2},
+   {mmGRBM_STATUS_SE3},
+   {mmSRBM_STATUS},
+   {mmSRBM_STATUS2},
+   {mmSDMA0_STATUS_REG + SDMA0_REGISTER_OFFSET},
+   {mmSDMA0_STATUS_REG + SDMA1_REGISTER_OFFSET},
+   {mmCP_STAT},
+   {mmCP_STALLED_STAT1},
+   {mmCP_STALLED_STAT2},
+   {mmCP_STALLED_STAT3},
+   {mmCP_CPF_BUSY_STAT},
+   {mmCP_CPF_STALLED_STAT1},
+   {mmCP_CPF_STATUS},
+   {mmCP_CPC_BUSY_STAT},
+   {mmCP_CPC_STALLED_STAT1},
+   {mmCP_CPC_STATUS},
{mmGB_ADDR_CONFIG},
{mmMC_ARB_RAMCFG},
{mmGB_TILE_MODE0},
{mmGB_TILE_MODE1},
{mmGB_TILE_MODE2},
{mmGB_TILE_MODE3},
{mmGB_TILE_MODE4},
{mmGB_TILE_MODE5},
{mmGB_TILE_MODE6},
{mmGB_TILE_MODE7},
diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c b/drivers/gpu/drm/amd/amdgpu/nv.c
index 46206a1a1f4d..22ab1955b923 100644
--- a/drivers/gpu/drm/amd/amdgpu/nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/nv.c
@@ -171,20 +171,21 @@ static struct soc15_allowed_register_entry 
nv_allowed_read_registers[] = {
{ SOC15_REG_ENTRY(SDMA0, 0, mmSDMA0_STATUS_REG)},
{ SOC15_REG_ENTRY(SDMA1, 0, mmSDMA1_STATUS_REG)},
 #endif
{ SOC15_REG_ENTRY(GC, 0, mmCP_STAT)},
{ SOC15_REG_ENTRY(GC, 0, mmCP_STALLED_STAT1)},
{ SOC15_REG_ENTRY(GC, 0, mmCP_STALLED_STAT2)},
{ SOC15_REG_ENTRY(GC, 0, mmCP_STALLED_STAT3)},
{ SOC15_REG_ENTRY(GC, 0, mmCP_CPF_BUSY_STAT)},
{ SOC15_REG_ENTRY(GC, 0, mmCP_CPF_STALLED_STAT1)},
{ SOC15_REG_ENTRY(GC, 0, mmCP_CPF_STATUS)},
+   { SOC15_REG_ENTRY(GC, 0, mmCP_CPC_BUSY_STAT)},
{ SOC15_REG_ENTRY(GC, 0, mmCP_CPC_STALLED_STAT1)},
{ SOC15_REG_ENTRY(GC, 0, mmCP_CPC_STATUS)},
{ SOC15_REG_ENTRY(GC, 0, mmGB_ADDR_CONFIG)},
 };
 
 static uint32_t nv_read_indexed_register(struct amdgpu_device *adev, u32 
se_num,
 u32 sh_num, u32 reg_offset)
 {
uint32_t val;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/si.c b/drivers/gpu/drm/amd/amdgpu/si.c
index 493af42152f2..29024e64c886 100644
--- a/drivers/gpu/drm/amd/amdgpu/si.c
+++ b/drivers/gpu/drm/amd/amdgpu/si.c
@@ -968,20 +968,31 @@ static void si_smc_wreg(struct amdgpu_device *adev, u32 
reg, u32 v)
unsigned long flags;
 
spin_lock_irqsave(>smc_idx_lock, flags);
WREG32(SMC_IND_INDEX_0, (reg));
   

[PATCH] drm/amdgpu: Allow reading more status registers on si/cik

2019-10-22 Thread Marek Olšák
From: Trek 

Allow userspace to read the same status registers for every family.
Based on commit c7890fea, added any of these registers if defined in
the include files of each architecture.

Signed-off-by: Trek 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c |  3 ++-
 drivers/gpu/drm/amd/amdgpu/cik.c| 19 +++
 drivers/gpu/drm/amd/amdgpu/nv.c |  1 +
 drivers/gpu/drm/amd/amdgpu/si.c | 11 +++
 drivers/gpu/drm/amd/amdgpu/soc15.c  |  1 +
 5 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index a787d838974e..08610081bd9e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -76,23 +76,24 @@
  * - 3.26.0 - GFX9: Process AMDGPU_IB_FLAG_TC_WB_NOT_INVALIDATE.
  * - 3.27.0 - Add new chunk to to AMDGPU_CS to enable BO_LIST creation.
  * - 3.28.0 - Add AMDGPU_CHUNK_ID_SCHEDULED_DEPENDENCIES
  * - 3.29.0 - Add AMDGPU_IB_FLAG_RESET_GDS_MAX_WAVE_ID
  * - 3.30.0 - Add AMDGPU_SCHED_OP_CONTEXT_PRIORITY_OVERRIDE.
  * - 3.31.0 - Add support for per-flip tiling attribute changes with DC
  * - 3.32.0 - Add syncobj timeline support to AMDGPU_CS.
  * - 3.33.0 - Fixes for GDS ENOMEM failures in AMDGPU_CS.
  * - 3.34.0 - Non-DC can flip correctly between buffers with different pitches
  * - 3.35.0 - Add drm_amdgpu_info_device::tcc_disabled_mask
+ * - 3.36.0 - Allow reading more status registers on si/cik
  */
 #define KMS_DRIVER_MAJOR   3
-#define KMS_DRIVER_MINOR   35
+#define KMS_DRIVER_MINOR   36
 #define KMS_DRIVER_PATCHLEVEL  0
 
 int amdgpu_vram_limit = 0;
 int amdgpu_vis_vram_limit = 0;
 int amdgpu_gart_size = -1; /* auto */
 int amdgpu_gtt_size = -1; /* auto */
 int amdgpu_moverate = -1; /* auto */
 int amdgpu_benchmarking = 0;
 int amdgpu_testing = 0;
 int amdgpu_audio = -1;
diff --git a/drivers/gpu/drm/amd/amdgpu/cik.c b/drivers/gpu/drm/amd/amdgpu/cik.c
index fc8b34480f66..2d64d270725d 100644
--- a/drivers/gpu/drm/amd/amdgpu/cik.c
+++ b/drivers/gpu/drm/amd/amdgpu/cik.c
@@ -959,20 +959,39 @@ static bool cik_read_bios_from_rom(struct amdgpu_device 
*adev,
WREG32(mmSMC_IND_INDEX_0, ixROM_DATA);
for (i = 0; i < length_dw; i++)
dw_ptr[i] = RREG32(mmSMC_IND_DATA_0);
spin_unlock_irqrestore(>smc_idx_lock, flags);
 
return true;
 }
 
 static const struct amdgpu_allowed_register_entry cik_allowed_read_registers[] 
= {
{mmGRBM_STATUS},
+   {mmGRBM_STATUS2},
+   {mmGRBM_STATUS_SE0},
+   {mmGRBM_STATUS_SE1},
+   {mmGRBM_STATUS_SE2},
+   {mmGRBM_STATUS_SE3},
+   {mmSRBM_STATUS},
+   {mmSRBM_STATUS2},
+   {mmSDMA0_STATUS_REG + SDMA0_REGISTER_OFFSET},
+   {mmSDMA0_STATUS_REG + SDMA1_REGISTER_OFFSET},
+   {mmCP_STAT},
+   {mmCP_STALLED_STAT1},
+   {mmCP_STALLED_STAT2},
+   {mmCP_STALLED_STAT3},
+   {mmCP_CPF_BUSY_STAT},
+   {mmCP_CPF_STALLED_STAT1},
+   {mmCP_CPF_STATUS},
+   {mmCP_CPC_BUSY_STAT},
+   {mmCP_CPC_STALLED_STAT1},
+   {mmCP_CPC_STATUS},
{mmGB_ADDR_CONFIG},
{mmMC_ARB_RAMCFG},
{mmGB_TILE_MODE0},
{mmGB_TILE_MODE1},
{mmGB_TILE_MODE2},
{mmGB_TILE_MODE3},
{mmGB_TILE_MODE4},
{mmGB_TILE_MODE5},
{mmGB_TILE_MODE6},
{mmGB_TILE_MODE7},
diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c b/drivers/gpu/drm/amd/amdgpu/nv.c
index 46206a1a1f4d..22ab1955b923 100644
--- a/drivers/gpu/drm/amd/amdgpu/nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/nv.c
@@ -171,20 +171,21 @@ static struct soc15_allowed_register_entry 
nv_allowed_read_registers[] = {
{ SOC15_REG_ENTRY(SDMA0, 0, mmSDMA0_STATUS_REG)},
{ SOC15_REG_ENTRY(SDMA1, 0, mmSDMA1_STATUS_REG)},
 #endif
{ SOC15_REG_ENTRY(GC, 0, mmCP_STAT)},
{ SOC15_REG_ENTRY(GC, 0, mmCP_STALLED_STAT1)},
{ SOC15_REG_ENTRY(GC, 0, mmCP_STALLED_STAT2)},
{ SOC15_REG_ENTRY(GC, 0, mmCP_STALLED_STAT3)},
{ SOC15_REG_ENTRY(GC, 0, mmCP_CPF_BUSY_STAT)},
{ SOC15_REG_ENTRY(GC, 0, mmCP_CPF_STALLED_STAT1)},
{ SOC15_REG_ENTRY(GC, 0, mmCP_CPF_STATUS)},
+   { SOC15_REG_ENTRY(GC, 0, mmCP_CPC_BUSY_STAT)},
{ SOC15_REG_ENTRY(GC, 0, mmCP_CPC_STALLED_STAT1)},
{ SOC15_REG_ENTRY(GC, 0, mmCP_CPC_STATUS)},
{ SOC15_REG_ENTRY(GC, 0, mmGB_ADDR_CONFIG)},
 };
 
 static uint32_t nv_read_indexed_register(struct amdgpu_device *adev, u32 
se_num,
 u32 sh_num, u32 reg_offset)
 {
uint32_t val;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/si.c b/drivers/gpu/drm/amd/amdgpu/si.c
index 493af42152f2..29024e64c886 100644
--- a/drivers/gpu/drm/amd/amdgpu/si.c
+++ b/drivers/gpu/drm/amd/amdgpu/si.c
@@ -968,20 +968,31 @@ static void si_smc_wreg(struct amdgpu_device *adev, u32 
reg, u32 v)
unsigned long flags;
 
spin_lock_irqsave(>smc_idx_lock, flags);
WREG32(SMC_IND_INDEX_0, (reg));

Re: [PATCH v2] drm/amdkfd: don't use dqm lock during device reset/suspend/resume

2019-10-22 Thread Grodzovsky, Andrey

On 10/22/19 4:04 PM, Yang, Philip wrote:
>
> On 2019-10-22 3:36 p.m., Grodzovsky, Andrey wrote:
>> On 10/22/19 3:19 PM, Yang, Philip wrote:
>>> On 2019-10-22 2:40 p.m., Grodzovsky, Andrey wrote:
 On 10/22/19 2:38 PM, Grodzovsky, Andrey wrote:
> On 10/22/19 2:28 PM, Yang, Philip wrote:
>> If device reset/suspend/resume failed for some reason, dqm lock is
>> hold forever and this causes deadlock. Below is a kernel backtrace when
>> application open kfd after suspend/resume failed.
>>
>> Instead of holding dqm lock in pre_reset and releasing dqm lock in
>> post_reset, add dqm->device_stopped flag which is modified in
>> dqm->ops.start and dqm->ops.stop. The flag doesn't need lock protection
>> because write/read are all inside dqm lock.
>>
>> For HWS case, map_queues_cpsch and unmap_queues_cpsch checks
>> device_stopped flag before sending the updated runlist.
> Is there a chance of race condition here where dqm->device_stopped
> returns true for some operation (e.g.map_queues_cpsch) but just as it
> proceeds GPU reset starts  ?
>
> Andrey
 Correction -

 dqm->device_stopped returns FALSE

>>> No race condition here, dqm->device_stopped is set to FALSE in
>>> kgd2kfd_post_reset -> dqm->ops.start(), which the last step of
>>> amdgpu_device_gpu_recover, so it's safe to do map_queue_cpsch.
>>>
>>> Regards,
>>> Philip
>> Sorry - i was confused by the commit description vs. body - in the
>> description it's called device_stopped flag while in the body it's
>> sched_running - probably the description needs to be fixed.
>>
> Thanks, commit description is updated.
>
>> So i mean the switch true->false when GPU reset just begins - you get an
>> IOCTL to map a queue (if i understood KFD code correctly), check
>> dqm->sched_running == true and continue, what if right then GPU reset
>> started due to some issue like job timeout or user triggered reset -
>> dqm->sched_running becomes false but you already past that checkpoint in
>> the IOCTL, no ?
>>
> map a queue is done under dqm lock, to send the updated runlist to HWS,
> then relase dqm lock. GPU reset start pre_reset will unmap all queues
> first under dqm lock, then set dqm->sched_running to false, release dqm
> lock, and then continue to reset HW. dqm lock guarantee the two
> operations are serialized.
>
> Philip


I see now that it's ok.

Andrey


>
>> Andrey
>>
 Andrey

>> v2: For no-HWS case, when device is stopped, don't call
>> load/destroy_mqd for eviction, restore and create queue, and avoid
>> debugfs dump hdqs.
>>
>> Backtrace of dqm lock deadlock:
>>
>> [Thu Oct 17 16:43:37 2019] INFO: task rocminfo:3024 blocked for more
>> than 120 seconds.
>> [Thu Oct 17 16:43:37 2019]   Not tainted
>> 5.0.0-rc1-kfd-compute-rocm-dkms-no-npi-1131 #1
>> [Thu Oct 17 16:43:37 2019] "echo 0 >
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [Thu Oct 17 16:43:37 2019] rocminfoD0  3024   2947
>> 0x8000
>> [Thu Oct 17 16:43:37 2019] Call Trace:
>> [Thu Oct 17 16:43:37 2019]  ? __schedule+0x3d9/0x8a0
>> [Thu Oct 17 16:43:37 2019]  schedule+0x32/0x70
>> [Thu Oct 17 16:43:37 2019]  schedule_preempt_disabled+0xa/0x10
>> [Thu Oct 17 16:43:37 2019]  __mutex_lock.isra.9+0x1e3/0x4e0
>> [Thu Oct 17 16:43:37 2019]  ? __call_srcu+0x264/0x3b0
>> [Thu Oct 17 16:43:37 2019]  ? process_termination_cpsch+0x24/0x2f0
>> [amdgpu]
>> [Thu Oct 17 16:43:37 2019]  process_termination_cpsch+0x24/0x2f0
>> [amdgpu]
>> [Thu Oct 17 16:43:37 2019]
>> kfd_process_dequeue_from_all_devices+0x42/0x60 [amdgpu]
>> [Thu Oct 17 16:43:37 2019]  kfd_process_notifier_release+0x1be/0x220
>> [amdgpu]
>> [Thu Oct 17 16:43:37 2019]  __mmu_notifier_release+0x3e/0xc0
>> [Thu Oct 17 16:43:37 2019]  exit_mmap+0x160/0x1a0
>> [Thu Oct 17 16:43:37 2019]  ? __handle_mm_fault+0xba3/0x1200
>> [Thu Oct 17 16:43:37 2019]  ? exit_robust_list+0x5a/0x110
>> [Thu Oct 17 16:43:37 2019]  mmput+0x4a/0x120
>> [Thu Oct 17 16:43:37 2019]  do_exit+0x284/0xb20
>> [Thu Oct 17 16:43:37 2019]  ? handle_mm_fault+0xfa/0x200
>> [Thu Oct 17 16:43:37 2019]  do_group_exit+0x3a/0xa0
>> [Thu Oct 17 16:43:37 2019]  __x64_sys_exit_group+0x14/0x20
>> [Thu Oct 17 16:43:37 2019]  do_syscall_64+0x4f/0x100
>> [Thu Oct 17 16:43:37 2019]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> Suggested-by: Felix Kuehling 
>> Signed-off-by: Philip Yang 
>> ---
>>drivers/gpu/drm/amd/amdkfd/kfd_chardev.c  |  6 +--
>>drivers/gpu/drm/amd/amdkfd/kfd_device.c   |  5 --
>>.../drm/amd/amdkfd/kfd_device_queue_manager.c | 47 
>> +--
>>.../drm/amd/amdkfd/kfd_device_queue_manager.h |  1 +
>>4 files changed, 46 insertions(+), 13 deletions(-)
>>
>> diff --git 

Re: [PATCH v5 07/14] drm/dp_mst: Don't forget to update port->input in drm_dp_mst_handle_conn_stat()

2019-10-22 Thread Sean Paul
On Mon, Oct 21, 2019 at 10:36:02PM -0400, Lyude Paul wrote:
> This probably hasn't caused any problems up until now since it's
> probably nearly impossible to encounter this in the wild, however if we
> were to receive a connection status notification from the MST hub after
> resume while we're in the middle of reprobing the link addresses for a
> topology then there's a much larger chance that a port could have
> changed from being an output port to input port (or vice versa). If we
> forget to update this bit of information, we'll potentially ignore a
> valid PDT change on a downstream port because we think it's an input
> port.
> 
> So, make sure we read the input_port field in connection status
> notifications in drm_dp_mst_handle_conn_stat() to prevent this from
> happening once we've implemented suspend/resume reprobing.
> 
> Cc: Juston Li 
> Cc: Imre Deak 
> Cc: Ville Syrjälä 
> Cc: Harry Wentland 
> Cc: Daniel Vetter 
> Signed-off-by: Lyude Paul 
> Reviewed-by: Sean Paul 
> ---
>  drivers/gpu/drm/drm_dp_mst_topology.c | 52 +++
>  1 file changed, 38 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/gpu/drm/drm_dp_mst_topology.c 
> b/drivers/gpu/drm/drm_dp_mst_topology.c
> index 7bf4db91ff90..c8e218b902ae 100644
> --- a/drivers/gpu/drm/drm_dp_mst_topology.c
> +++ b/drivers/gpu/drm/drm_dp_mst_topology.c
> @@ -2079,18 +2079,40 @@ drm_dp_mst_handle_conn_stat(struct drm_dp_mst_branch 
> *mstb,
>  {
>   struct drm_dp_mst_topology_mgr *mgr = mstb->mgr;
>   struct drm_dp_mst_port *port;
> - int old_ddps;
> - bool dowork = false;
> + int old_ddps, ret;
> + u8 new_pdt;
> + bool dowork = false, create_connector = false;
>  
>   port = drm_dp_get_port(mstb, conn_stat->port_number);
>   if (!port)
>   return;
>  
> - /* Locking is only needed if the port's exposed to userspace */
> - if (port->connector)
> + if (port->connector) {
> + if (!port->input && conn_stat->input_port) {
> + /*
> +  * We can't remove a connector from an already exposed
> +  * port, so just throw the port out and make sure we
> +  * reprobe the link address of it's parent MSTB
> +  */
> + drm_dp_mst_topology_unlink_port(mgr, port);
> + mstb->link_address_sent = false;
> + dowork = true;
> + goto out;
> + }
> +
> + /*
> +  * Locking is only needed if the port's exposed to userspace
> +  */

optional nit: this will still fit on one line

>   drm_modeset_lock(>base.lock, NULL);
> + } else if (port->input && !conn_stat->input_port) {
> + create_connector = true;
> + /* Reprobe link address so we get num_sdp_streams */
> + mstb->link_address_sent = false;
> + dowork = true;
> + }
>  
>   old_ddps = port->ddps;
> + port->input = conn_stat->input_port;
>   port->mcs = conn_stat->message_capability_status;
>   port->ldps = conn_stat->legacy_device_plug_status;
>   port->ddps = conn_stat->displayport_device_plug_status;
> @@ -2103,21 +2125,23 @@ drm_dp_mst_handle_conn_stat(struct drm_dp_mst_branch 
> *mstb,
>   }
>   }
>  
> - if (!port->input) {
> - int ret = drm_dp_port_set_pdt(port,
> -   conn_stat->peer_device_type);
> - if (ret == 1) {
> - dowork = true;
> - } else if (ret < 0) {
> - DRM_ERROR("Failed to change PDT for port %p: %d\n",
> -   port, ret);
> - dowork = false;
> - }
> + new_pdt = port->input ? DP_PEER_DEVICE_NONE : 
> conn_stat->peer_device_type;
> +
> + ret = drm_dp_port_set_pdt(port, new_pdt);
> + if (ret == 1) {
> + dowork = true;
> + } else if (ret < 0) {
> + DRM_ERROR("Failed to change PDT for port %p: %d\n",
> +   port, ret);
> + dowork = false;
>   }
>  
>   if (port->connector)
>   drm_modeset_unlock(>base.lock);
> + else if (create_connector)
> + drm_dp_mst_port_add_connector(mstb, port);
>  
> +out:
>   drm_dp_mst_topology_put_port(port);
>   if (dowork)
>   queue_work(system_long_wq, >mgr->work);
> -- 
> 2.21.0
> 

-- 
Sean Paul, Software Engineer, Google / Chromium OS
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH v5 06/14] drm/dp_mst: Protect drm_dp_mst_port members with locking

2019-10-22 Thread Sean Paul
On Mon, Oct 21, 2019 at 10:36:01PM -0400, Lyude Paul wrote:
> This is a complicated one. Essentially, there's currently a problem in the MST
> core that hasn't really caused any issues that we're aware of (emphasis on 
> "that
> we're aware of"): locking.
> 
> When we go through and probe the link addresses and path resources in a
> topology, we hold no locks when updating ports with said information. The
> members I'm referring to in particular are:
> 
> - ldps
> - ddps
> - mcs
> - pdt
> - dpcd_rev
> - num_sdp_streams
> - num_sdp_stream_sinks
> - available_pbn
> - input
> - connector
> 
> Now that we're handling UP requests asynchronously and will be using some of
> the struct members mentioned above in atomic modesetting in the future for
> features such as PBN validation, this is going to become a lot more important.
> As well, the next few commits that prepare us for and introduce suspend/resume
> reprobing will also need clear locking in order to prevent from additional
> racing hilarities that we never could have hit in the past.
> 
> So, let's solve this issue by using >base.lock, the modesetting
> lock which currently only protects >base.state. This works
> perfectly because it allows us to avoid blocking connection_mutex
> unnecessarily, and we can grab this in connector detection paths since
> it's a ww mutex. We start by having drm_dp_mst_handle_up_req() hold this
> when updating ports. For drm_dp_mst_handle_link_address_port() things
> are a bit more complicated. As I've learned the hard way, we can grab
> >lock.base for everything except for port->connector. See, our
> normal driver probing paths end up generating this rather obvious
> lockdep chain:
> 
> >mode_config.mutex
>   -> crtc_ww_class_mutex/crtc_ww_class_acquire
> -> >mutex
> 
> However, sysfs grabs >mode_config.mutex in order to protect itself
> from connector state changing under it. Because this entails grabbing
> kn->count, e.g. the lock that the kernel provides for protecting sysfs
> contexts, we end up grabbing kn->count followed by
> >mode_config.mutex. This ends up creating an extremely rude chain:
> 
> >count
>   -> >mode_config.mutex
> -> crtc_ww_class_mutex/crtc_ww_class_acquire
>   -> >mutex
> 
> I mean, look at that thing! It's just evil!!! This gross thing ends up
> making any calls to drm_connector_register()/drm_connector_unregister()
> impossible when holding any kind of modesetting lock. This is annoying
> because ideally, we always want to ensure that
> drm_dp_mst_port->connector never changes when doing an atomic commit or
> check that would affect the atomic topology state so that it can
> reliably and easily be used from future DRM DP MST helpers to assist
> with tasks such as scanning through the current VCPI allocations and
> adding connectors which need to have their allocations updated in
> response to a bandwidth change or the like.
> 
> Being able to hold >base.lock throughout the entire link probe
> process would have been _great_, since we could prevent userspace from
> ever seeing any states in-between individual port changes and as a
> result likely end up with a much faster probe and more consistent
> results from said probes. But without some rework of how we handle
> connector probing in sysfs it's not at all currently possible. In the
> future, maybe we can try using the sysfs locks to protect updates to
> connector probing state and fix this mess.
> 
> So for now, to protect everything other than port->connector under
> >base.lock and ensure that we still have the guarantee that atomic
> check/commit contexts will never see port->connector change we use a
> silly trick. See: port->connector only needs to change in order to
> ensure that input ports (see the MST spec) never have a ghost connector
> associated with them. But, there's nothing stopping us from simply
> throwing the entire port out and creating a new one in order to maintain
> that requirement while still keeping port->connector consistent across
> the lifetime of the port in atomic check/commit contexts. For all
> intended purposes this works fine, as we validate ports in any contexts
> we care about before using them and as such will end up reporting the
> connector as disconnected until it's port's destruction finalizes. So,
> we just do that in cases where we detect port->input has transitioned
> from true->false. We don't need to worry about the other direction,
> since a port without a connector isn't visible to userspace and as such
> doesn't need to be protected by >base.lock until we finish
> registering a connector for it.
> 
> For updating members of drm_dp_mst_port other than port->connector, we
> simply grab >base.lock in drm_dp_mst_link_probe_work() for already
> registered ports, update said members and drop the lock before
> potentially registering a connector and probing the link address of it's
> children.
> 
> Finally, we modify drm_dp_mst_detect_port() to take a modesetting lock
> acquisition context in 

Re: [PATCH v2] drm/amdkfd: don't use dqm lock during device reset/suspend/resume

2019-10-22 Thread Yang, Philip


On 2019-10-22 3:36 p.m., Grodzovsky, Andrey wrote:
> 
> On 10/22/19 3:19 PM, Yang, Philip wrote:
>>
>> On 2019-10-22 2:40 p.m., Grodzovsky, Andrey wrote:
>>> On 10/22/19 2:38 PM, Grodzovsky, Andrey wrote:
 On 10/22/19 2:28 PM, Yang, Philip wrote:
> If device reset/suspend/resume failed for some reason, dqm lock is
> hold forever and this causes deadlock. Below is a kernel backtrace when
> application open kfd after suspend/resume failed.
>
> Instead of holding dqm lock in pre_reset and releasing dqm lock in
> post_reset, add dqm->device_stopped flag which is modified in
> dqm->ops.start and dqm->ops.stop. The flag doesn't need lock protection
> because write/read are all inside dqm lock.
>
> For HWS case, map_queues_cpsch and unmap_queues_cpsch checks
> device_stopped flag before sending the updated runlist.
 Is there a chance of race condition here where dqm->device_stopped
 returns true for some operation (e.g.map_queues_cpsch) but just as it
 proceeds GPU reset starts  ?

 Andrey
>>>
>>> Correction -
>>>
>>> dqm->device_stopped returns FALSE
>>>
>> No race condition here, dqm->device_stopped is set to FALSE in
>> kgd2kfd_post_reset -> dqm->ops.start(), which the last step of
>> amdgpu_device_gpu_recover, so it's safe to do map_queue_cpsch.
>>
>> Regards,
>> Philip
> 
> Sorry - i was confused by the commit description vs. body - in the
> description it's called device_stopped flag while in the body it's
> sched_running - probably the description needs to be fixed.
> 
Thanks, commit description is updated.

> So i mean the switch true->false when GPU reset just begins - you get an
> IOCTL to map a queue (if i understood KFD code correctly), check
> dqm->sched_running == true and continue, what if right then GPU reset
> started due to some issue like job timeout or user triggered reset -
> dqm->sched_running becomes false but you already past that checkpoint in
> the IOCTL, no ?
>
map a queue is done under dqm lock, to send the updated runlist to HWS, 
then relase dqm lock. GPU reset start pre_reset will unmap all queues 
first under dqm lock, then set dqm->sched_running to false, release dqm 
lock, and then continue to reset HW. dqm lock guarantee the two 
operations are serialized.

Philip

> Andrey
> 
>>
>>> Andrey
>>>

> v2: For no-HWS case, when device is stopped, don't call
> load/destroy_mqd for eviction, restore and create queue, and avoid
> debugfs dump hdqs.
>
> Backtrace of dqm lock deadlock:
>
> [Thu Oct 17 16:43:37 2019] INFO: task rocminfo:3024 blocked for more
> than 120 seconds.
> [Thu Oct 17 16:43:37 2019]   Not tainted
> 5.0.0-rc1-kfd-compute-rocm-dkms-no-npi-1131 #1
> [Thu Oct 17 16:43:37 2019] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [Thu Oct 17 16:43:37 2019] rocminfoD0  3024   2947
> 0x8000
> [Thu Oct 17 16:43:37 2019] Call Trace:
> [Thu Oct 17 16:43:37 2019]  ? __schedule+0x3d9/0x8a0
> [Thu Oct 17 16:43:37 2019]  schedule+0x32/0x70
> [Thu Oct 17 16:43:37 2019]  schedule_preempt_disabled+0xa/0x10
> [Thu Oct 17 16:43:37 2019]  __mutex_lock.isra.9+0x1e3/0x4e0
> [Thu Oct 17 16:43:37 2019]  ? __call_srcu+0x264/0x3b0
> [Thu Oct 17 16:43:37 2019]  ? process_termination_cpsch+0x24/0x2f0
> [amdgpu]
> [Thu Oct 17 16:43:37 2019]  process_termination_cpsch+0x24/0x2f0
> [amdgpu]
> [Thu Oct 17 16:43:37 2019]
> kfd_process_dequeue_from_all_devices+0x42/0x60 [amdgpu]
> [Thu Oct 17 16:43:37 2019]  kfd_process_notifier_release+0x1be/0x220
> [amdgpu]
> [Thu Oct 17 16:43:37 2019]  __mmu_notifier_release+0x3e/0xc0
> [Thu Oct 17 16:43:37 2019]  exit_mmap+0x160/0x1a0
> [Thu Oct 17 16:43:37 2019]  ? __handle_mm_fault+0xba3/0x1200
> [Thu Oct 17 16:43:37 2019]  ? exit_robust_list+0x5a/0x110
> [Thu Oct 17 16:43:37 2019]  mmput+0x4a/0x120
> [Thu Oct 17 16:43:37 2019]  do_exit+0x284/0xb20
> [Thu Oct 17 16:43:37 2019]  ? handle_mm_fault+0xfa/0x200
> [Thu Oct 17 16:43:37 2019]  do_group_exit+0x3a/0xa0
> [Thu Oct 17 16:43:37 2019]  __x64_sys_exit_group+0x14/0x20
> [Thu Oct 17 16:43:37 2019]  do_syscall_64+0x4f/0x100
> [Thu Oct 17 16:43:37 2019]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> Suggested-by: Felix Kuehling 
> Signed-off-by: Philip Yang 
> ---
>   drivers/gpu/drm/amd/amdkfd/kfd_chardev.c  |  6 +--
>   drivers/gpu/drm/amd/amdkfd/kfd_device.c   |  5 --
>   .../drm/amd/amdkfd/kfd_device_queue_manager.c | 47 
> +--
>   .../drm/amd/amdkfd/kfd_device_queue_manager.h |  1 +
>   4 files changed, 46 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> index d9e36dbf13d5..40d75c39f08e 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> 

Re: [PATCH v2] drm/amdkfd: don't use dqm lock during device reset/suspend/resume

2019-10-22 Thread Yang, Philip
On 2019-10-22 2:44 p.m., Kuehling, Felix wrote:
> On 2019-10-22 14:28, Yang, Philip wrote:
>> If device reset/suspend/resume failed for some reason, dqm lock is
>> hold forever and this causes deadlock. Below is a kernel backtrace when
>> application open kfd after suspend/resume failed.
>>
>> Instead of holding dqm lock in pre_reset and releasing dqm lock in
>> post_reset, add dqm->device_stopped flag which is modified in
>> dqm->ops.start and dqm->ops.stop. The flag doesn't need lock protection
>> because write/read are all inside dqm lock.
>>
>> For HWS case, map_queues_cpsch and unmap_queues_cpsch checks
>> device_stopped flag before sending the updated runlist.
>>
>> v2: For no-HWS case, when device is stopped, don't call
>> load/destroy_mqd for eviction, restore and create queue, and avoid
>> debugfs dump hdqs.
>>
>> Backtrace of dqm lock deadlock:
>>
>> [Thu Oct 17 16:43:37 2019] INFO: task rocminfo:3024 blocked for more
>> than 120 seconds.
>> [Thu Oct 17 16:43:37 2019]   Not tainted
>> 5.0.0-rc1-kfd-compute-rocm-dkms-no-npi-1131 #1
>> [Thu Oct 17 16:43:37 2019] "echo 0 >
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [Thu Oct 17 16:43:37 2019] rocminfoD0  3024   2947
>> 0x8000
>> [Thu Oct 17 16:43:37 2019] Call Trace:
>> [Thu Oct 17 16:43:37 2019]  ? __schedule+0x3d9/0x8a0
>> [Thu Oct 17 16:43:37 2019]  schedule+0x32/0x70
>> [Thu Oct 17 16:43:37 2019]  schedule_preempt_disabled+0xa/0x10
>> [Thu Oct 17 16:43:37 2019]  __mutex_lock.isra.9+0x1e3/0x4e0
>> [Thu Oct 17 16:43:37 2019]  ? __call_srcu+0x264/0x3b0
>> [Thu Oct 17 16:43:37 2019]  ? process_termination_cpsch+0x24/0x2f0
>> [amdgpu]
>> [Thu Oct 17 16:43:37 2019]  process_termination_cpsch+0x24/0x2f0
>> [amdgpu]
>> [Thu Oct 17 16:43:37 2019]
>> kfd_process_dequeue_from_all_devices+0x42/0x60 [amdgpu]
>> [Thu Oct 17 16:43:37 2019]  kfd_process_notifier_release+0x1be/0x220
>> [amdgpu]
>> [Thu Oct 17 16:43:37 2019]  __mmu_notifier_release+0x3e/0xc0
>> [Thu Oct 17 16:43:37 2019]  exit_mmap+0x160/0x1a0
>> [Thu Oct 17 16:43:37 2019]  ? __handle_mm_fault+0xba3/0x1200
>> [Thu Oct 17 16:43:37 2019]  ? exit_robust_list+0x5a/0x110
>> [Thu Oct 17 16:43:37 2019]  mmput+0x4a/0x120
>> [Thu Oct 17 16:43:37 2019]  do_exit+0x284/0xb20
>> [Thu Oct 17 16:43:37 2019]  ? handle_mm_fault+0xfa/0x200
>> [Thu Oct 17 16:43:37 2019]  do_group_exit+0x3a/0xa0
>> [Thu Oct 17 16:43:37 2019]  __x64_sys_exit_group+0x14/0x20
>> [Thu Oct 17 16:43:37 2019]  do_syscall_64+0x4f/0x100
>> [Thu Oct 17 16:43:37 2019]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> Suggested-by: Felix Kuehling 
>> Signed-off-by: Philip Yang 
> 
> Three more comments inline. With those comments addressed, this patch is
> 
> Reviewed-by: Felix Kuehling 
> 
> 
>> ---
>>drivers/gpu/drm/amd/amdkfd/kfd_chardev.c  |  6 +--
>>drivers/gpu/drm/amd/amdkfd/kfd_device.c   |  5 --
>>.../drm/amd/amdkfd/kfd_device_queue_manager.c | 47 +--
>>.../drm/amd/amdkfd/kfd_device_queue_manager.h |  1 +
>>4 files changed, 46 insertions(+), 13 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c 
>> b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>> index d9e36dbf13d5..40d75c39f08e 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>> @@ -120,13 +120,13 @@ static int kfd_open(struct inode *inode, struct file 
>> *filep)
>>  return -EPERM;
>>  }
>>
>> +if (kfd_is_locked())
>> +return -EAGAIN;
>> +
>>  process = kfd_create_process(filep);
>>  if (IS_ERR(process))
>>  return PTR_ERR(process);
>>
>> -if (kfd_is_locked())
>> -return -EAGAIN;
>> -
> 
> Is this part of the change still needed? I remember that this sequence
> was a bit tricky with some potential race condition when Shaoyun was
> working on it. This may have unintended side effects.
> 
> 
Revert this change to be safe, it is not needed and not clear if there 
are side effects.

>>  dev_dbg(kfd_device, "process %d opened, compat mode (32 bit) - %d\n",
>>  process->pasid, process->is_32bit_user_mode);
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> index 8f4b24e84964..4fa8834ce7cb 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> @@ -730,9 +730,6 @@ int kgd2kfd_pre_reset(struct kfd_dev *kfd)
>>  return 0;
>>  kgd2kfd_suspend(kfd);
>>
>> -/* hold dqm->lock to prevent further execution*/
>> -dqm_lock(kfd->dqm);
>> -
>>  kfd_signal_reset_event(kfd);
>>  return 0;
>>}
>> @@ -750,8 +747,6 @@ int kgd2kfd_post_reset(struct kfd_dev *kfd)
>>  if (!kfd->init_complete)
>>  return 0;
>>
>> -dqm_unlock(kfd->dqm);
>> -
>>  ret = kfd_resume(kfd);
>>  if (ret)
>>  return ret;
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 

Re: [PATCH v2] drm/amdkfd: don't use dqm lock during device reset/suspend/resume

2019-10-22 Thread Grodzovsky, Andrey

On 10/22/19 3:19 PM, Yang, Philip wrote:
>
> On 2019-10-22 2:40 p.m., Grodzovsky, Andrey wrote:
>> On 10/22/19 2:38 PM, Grodzovsky, Andrey wrote:
>>> On 10/22/19 2:28 PM, Yang, Philip wrote:
 If device reset/suspend/resume failed for some reason, dqm lock is
 hold forever and this causes deadlock. Below is a kernel backtrace when
 application open kfd after suspend/resume failed.

 Instead of holding dqm lock in pre_reset and releasing dqm lock in
 post_reset, add dqm->device_stopped flag which is modified in
 dqm->ops.start and dqm->ops.stop. The flag doesn't need lock protection
 because write/read are all inside dqm lock.

 For HWS case, map_queues_cpsch and unmap_queues_cpsch checks
 device_stopped flag before sending the updated runlist.
>>> Is there a chance of race condition here where dqm->device_stopped
>>> returns true for some operation (e.g.map_queues_cpsch) but just as it
>>> proceeds GPU reset starts  ?
>>>
>>> Andrey
>>
>> Correction -
>>
>> dqm->device_stopped returns FALSE
>>
> No race condition here, dqm->device_stopped is set to FALSE in
> kgd2kfd_post_reset -> dqm->ops.start(), which the last step of
> amdgpu_device_gpu_recover, so it's safe to do map_queue_cpsch.
>
> Regards,
> Philip

Sorry - i was confused by the commit description vs. body - in the 
description it's called device_stopped flag while in the body it's  
sched_running - probably the description needs to be fixed.

So i mean the switch true->false when GPU reset just begins - you get an 
IOCTL to map a queue (if i understood KFD code correctly), check  
dqm->sched_running == true and continue, what if right then GPU reset 
started due to some issue like job timeout or user triggered reset - 
dqm->sched_running becomes false but you already past that checkpoint in 
the IOCTL, no ?

Andrey

>
>> Andrey
>>
>>>
 v2: For no-HWS case, when device is stopped, don't call
 load/destroy_mqd for eviction, restore and create queue, and avoid
 debugfs dump hdqs.

 Backtrace of dqm lock deadlock:

 [Thu Oct 17 16:43:37 2019] INFO: task rocminfo:3024 blocked for more
 than 120 seconds.
 [Thu Oct 17 16:43:37 2019]   Not tainted
 5.0.0-rc1-kfd-compute-rocm-dkms-no-npi-1131 #1
 [Thu Oct 17 16:43:37 2019] "echo 0 >
 /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 [Thu Oct 17 16:43:37 2019] rocminfoD0  3024   2947
 0x8000
 [Thu Oct 17 16:43:37 2019] Call Trace:
 [Thu Oct 17 16:43:37 2019]  ? __schedule+0x3d9/0x8a0
 [Thu Oct 17 16:43:37 2019]  schedule+0x32/0x70
 [Thu Oct 17 16:43:37 2019]  schedule_preempt_disabled+0xa/0x10
 [Thu Oct 17 16:43:37 2019]  __mutex_lock.isra.9+0x1e3/0x4e0
 [Thu Oct 17 16:43:37 2019]  ? __call_srcu+0x264/0x3b0
 [Thu Oct 17 16:43:37 2019]  ? process_termination_cpsch+0x24/0x2f0
 [amdgpu]
 [Thu Oct 17 16:43:37 2019]  process_termination_cpsch+0x24/0x2f0
 [amdgpu]
 [Thu Oct 17 16:43:37 2019]
 kfd_process_dequeue_from_all_devices+0x42/0x60 [amdgpu]
 [Thu Oct 17 16:43:37 2019]  kfd_process_notifier_release+0x1be/0x220
 [amdgpu]
 [Thu Oct 17 16:43:37 2019]  __mmu_notifier_release+0x3e/0xc0
 [Thu Oct 17 16:43:37 2019]  exit_mmap+0x160/0x1a0
 [Thu Oct 17 16:43:37 2019]  ? __handle_mm_fault+0xba3/0x1200
 [Thu Oct 17 16:43:37 2019]  ? exit_robust_list+0x5a/0x110
 [Thu Oct 17 16:43:37 2019]  mmput+0x4a/0x120
 [Thu Oct 17 16:43:37 2019]  do_exit+0x284/0xb20
 [Thu Oct 17 16:43:37 2019]  ? handle_mm_fault+0xfa/0x200
 [Thu Oct 17 16:43:37 2019]  do_group_exit+0x3a/0xa0
 [Thu Oct 17 16:43:37 2019]  __x64_sys_exit_group+0x14/0x20
 [Thu Oct 17 16:43:37 2019]  do_syscall_64+0x4f/0x100
 [Thu Oct 17 16:43:37 2019]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

 Suggested-by: Felix Kuehling 
 Signed-off-by: Philip Yang 
 ---
  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c  |  6 +--
  drivers/gpu/drm/amd/amdkfd/kfd_device.c   |  5 --
  .../drm/amd/amdkfd/kfd_device_queue_manager.c | 47 +--
  .../drm/amd/amdkfd/kfd_device_queue_manager.h |  1 +
  4 files changed, 46 insertions(+), 13 deletions(-)

 diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c 
 b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
 index d9e36dbf13d5..40d75c39f08e 100644
 --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
 +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
 @@ -120,13 +120,13 @@ static int kfd_open(struct inode *inode, struct file 
 *filep)
return -EPERM;
}
  
 +  if (kfd_is_locked())
 +  return -EAGAIN;
 +
process = kfd_create_process(filep);
if (IS_ERR(process))
return PTR_ERR(process);
  
 -  if (kfd_is_locked())
 -  return -EAGAIN;
 -
dev_dbg(kfd_device, 

Re: [PATCH v2] drm/amdkfd: don't use dqm lock during device reset/suspend/resume

2019-10-22 Thread Yang, Philip


On 2019-10-22 2:40 p.m., Grodzovsky, Andrey wrote:
> 
> On 10/22/19 2:38 PM, Grodzovsky, Andrey wrote:
>> On 10/22/19 2:28 PM, Yang, Philip wrote:
>>> If device reset/suspend/resume failed for some reason, dqm lock is
>>> hold forever and this causes deadlock. Below is a kernel backtrace when
>>> application open kfd after suspend/resume failed.
>>>
>>> Instead of holding dqm lock in pre_reset and releasing dqm lock in
>>> post_reset, add dqm->device_stopped flag which is modified in
>>> dqm->ops.start and dqm->ops.stop. The flag doesn't need lock protection
>>> because write/read are all inside dqm lock.
>>>
>>> For HWS case, map_queues_cpsch and unmap_queues_cpsch checks
>>> device_stopped flag before sending the updated runlist.
>>
>> Is there a chance of race condition here where dqm->device_stopped
>> returns true for some operation (e.g.map_queues_cpsch) but just as it
>> proceeds GPU reset starts  ?
>>
>> Andrey
> 
> 
> Correction -
> 
> dqm->device_stopped returns FALSE
> 
No race condition here, dqm->device_stopped is set to FALSE in 
kgd2kfd_post_reset -> dqm->ops.start(), which the last step of 
amdgpu_device_gpu_recover, so it's safe to do map_queue_cpsch.

Regards,
Philip

> Andrey
> 
>>
>>
>>> v2: For no-HWS case, when device is stopped, don't call
>>> load/destroy_mqd for eviction, restore and create queue, and avoid
>>> debugfs dump hdqs.
>>>
>>> Backtrace of dqm lock deadlock:
>>>
>>> [Thu Oct 17 16:43:37 2019] INFO: task rocminfo:3024 blocked for more
>>> than 120 seconds.
>>> [Thu Oct 17 16:43:37 2019]   Not tainted
>>> 5.0.0-rc1-kfd-compute-rocm-dkms-no-npi-1131 #1
>>> [Thu Oct 17 16:43:37 2019] "echo 0 >
>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> [Thu Oct 17 16:43:37 2019] rocminfoD0  3024   2947
>>> 0x8000
>>> [Thu Oct 17 16:43:37 2019] Call Trace:
>>> [Thu Oct 17 16:43:37 2019]  ? __schedule+0x3d9/0x8a0
>>> [Thu Oct 17 16:43:37 2019]  schedule+0x32/0x70
>>> [Thu Oct 17 16:43:37 2019]  schedule_preempt_disabled+0xa/0x10
>>> [Thu Oct 17 16:43:37 2019]  __mutex_lock.isra.9+0x1e3/0x4e0
>>> [Thu Oct 17 16:43:37 2019]  ? __call_srcu+0x264/0x3b0
>>> [Thu Oct 17 16:43:37 2019]  ? process_termination_cpsch+0x24/0x2f0
>>> [amdgpu]
>>> [Thu Oct 17 16:43:37 2019]  process_termination_cpsch+0x24/0x2f0
>>> [amdgpu]
>>> [Thu Oct 17 16:43:37 2019]
>>> kfd_process_dequeue_from_all_devices+0x42/0x60 [amdgpu]
>>> [Thu Oct 17 16:43:37 2019]  kfd_process_notifier_release+0x1be/0x220
>>> [amdgpu]
>>> [Thu Oct 17 16:43:37 2019]  __mmu_notifier_release+0x3e/0xc0
>>> [Thu Oct 17 16:43:37 2019]  exit_mmap+0x160/0x1a0
>>> [Thu Oct 17 16:43:37 2019]  ? __handle_mm_fault+0xba3/0x1200
>>> [Thu Oct 17 16:43:37 2019]  ? exit_robust_list+0x5a/0x110
>>> [Thu Oct 17 16:43:37 2019]  mmput+0x4a/0x120
>>> [Thu Oct 17 16:43:37 2019]  do_exit+0x284/0xb20
>>> [Thu Oct 17 16:43:37 2019]  ? handle_mm_fault+0xfa/0x200
>>> [Thu Oct 17 16:43:37 2019]  do_group_exit+0x3a/0xa0
>>> [Thu Oct 17 16:43:37 2019]  __x64_sys_exit_group+0x14/0x20
>>> [Thu Oct 17 16:43:37 2019]  do_syscall_64+0x4f/0x100
>>> [Thu Oct 17 16:43:37 2019]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>
>>> Suggested-by: Felix Kuehling 
>>> Signed-off-by: Philip Yang 
>>> ---
>>> drivers/gpu/drm/amd/amdkfd/kfd_chardev.c  |  6 +--
>>> drivers/gpu/drm/amd/amdkfd/kfd_device.c   |  5 --
>>> .../drm/amd/amdkfd/kfd_device_queue_manager.c | 47 +--
>>> .../drm/amd/amdkfd/kfd_device_queue_manager.h |  1 +
>>> 4 files changed, 46 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c 
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>>> index d9e36dbf13d5..40d75c39f08e 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>>> @@ -120,13 +120,13 @@ static int kfd_open(struct inode *inode, struct file 
>>> *filep)
>>> return -EPERM;
>>> }
>>> 
>>> +   if (kfd_is_locked())
>>> +   return -EAGAIN;
>>> +
>>> process = kfd_create_process(filep);
>>> if (IS_ERR(process))
>>> return PTR_ERR(process);
>>> 
>>> -   if (kfd_is_locked())
>>> -   return -EAGAIN;
>>> -
>>> dev_dbg(kfd_device, "process %d opened, compat mode (32 bit) - 
>>> %d\n",
>>> process->pasid, process->is_32bit_user_mode);
>>> 
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> index 8f4b24e84964..4fa8834ce7cb 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> @@ -730,9 +730,6 @@ int kgd2kfd_pre_reset(struct kfd_dev *kfd)
>>> return 0;
>>> kgd2kfd_suspend(kfd);
>>> 
>>> -   /* hold dqm->lock to prevent further execution*/
>>> -   dqm_lock(kfd->dqm);
>>> -
>>> kfd_signal_reset_event(kfd);
>>> return 0;
>>> }
>>> @@ -750,8 

Re: [PATCH 2/4] drm/amd/powerplay: Add EEPROM I2C read/write support to Arcturus.

2019-10-22 Thread Grodzovsky, Andrey

On 10/22/19 10:58 AM, Andrey Grodzovsky wrote:
>
> On 10/20/19 9:44 PM, Chen, Guchun wrote:
>> Comment inline.
>>
>>
>> Regards,
>> Guchun
>>
>> -Original Message-
>> From: Andrey Grodzovsky 
>> Sent: Saturday, October 19, 2019 4:48 AM
>> To: amd-gfx@lists.freedesktop.org
>> Cc: Chen, Guchun ; Zhou1, Tao 
>> ; Deucher, Alexander ; 
>> noreply-conflue...@amd.com; Quan, Evan ; 
>> Grodzovsky, Andrey 
>> Subject: [PATCH 2/4] drm/amd/powerplay: Add EEPROM I2C read/write 
>> support to Arcturus.
>>
>> The communication is done through SMU table and hence the code is in 
>> powerplay.
>>
>> Signed-off-by: Andrey Grodzovsky 
>> ---
>>   drivers/gpu/drm/amd/powerplay/arcturus_ppt.c | 229 
>> +++
>>   1 file changed, 229 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c 
>> b/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c
>> index 90d871a..53d08de5 100644
>> --- a/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c
>> +++ b/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c
>> @@ -36,6 +36,11 @@
>>   #include "smu_v11_0_pptable.h"
>>   #include "arcturus_ppsmc.h"
>>   #include "nbio/nbio_7_4_sh_mask.h"
>> +#include 
>> +#include 
>> +#include "amdgpu_ras.h"
>> +
>> +#define to_amdgpu_device(x) (container_of(x, struct amdgpu_ras,
>> +eeprom_control.eeprom_accessor))->adev
>>     #define CTF_OFFSET_EDGE    5
>>   #define CTF_OFFSET_HOTSPOT    5
>> @@ -171,6 +176,7 @@ static struct smu_11_0_cmn2aisc_mapping 
>> arcturus_table_map[SMU_TABLE_COUNT] = {
>>   TAB_MAP(SMU_METRICS),
>>   TAB_MAP(DRIVER_SMU_CONFIG),
>>   TAB_MAP(OVERDRIVE),
>> +    TAB_MAP(I2C_COMMANDS),
>>   };
>>     static struct smu_11_0_cmn2aisc_mapping 
>> arcturus_pwr_src_map[SMU_POWER_SOURCE_COUNT] = { @@ -293,6 +299,9 @@ 
>> static int arcturus_tables_init(struct smu_context *smu, struct 
>> smu_table *table
>>   SMU_TABLE_INIT(tables, SMU_TABLE_SMU_METRICS, 
>> sizeof(SmuMetrics_t),
>>  PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
>>   +    SMU_TABLE_INIT(tables, SMU_TABLE_I2C_COMMANDS, 
>> sizeof(SwI2cRequest_t),
>> +   PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
>> +
>>   smu_table->metrics_table = kzalloc(sizeof(SmuMetrics_t), 
>> GFP_KERNEL);
>>   if (!smu_table->metrics_table)
>>   return -ENOMEM;
>> @@ -1927,6 +1936,224 @@ static int arcturus_dpm_set_uvd_enable(struct 
>> smu_context *smu, bool enable)
>>   return ret;
>>   }
>>   +
>> +static void arcturus_fill_eeprom_i2c_req(SwI2cRequest_t  *req, bool 
>> write,
>> +  uint8_t address, uint32_t numbytes,
>> +  uint8_t *data)
>> +{
>> +    int i;
>> +
>> +    BUG_ON(numbytes > MAX_SW_I2C_COMMANDS);
>> +
>> +    req->I2CcontrollerPort = 0;
>> +    req->I2CSpeed = 2;
>> +    req->SlaveAddress = address;
>> +    req->NumCmds = numbytes;
>> +
>> +    for (i = 0; i < numbytes; i++) {
>> +    SwI2cCmd_t *cmd =  >SwI2cCmds[i];
>> +
>> +    /* First 2 bytes are always write for lower 2b EEPROM 
>> address */
>> +    if (i < 2)
>> +    cmd->Cmd = 1;
>> +    else
>> +    cmd->Cmd = write;
>> +
>> +
>> +    /* Add RESTART for read  after address filled */
>> +    cmd->CmdConfig |= (i == 2 && !write) ? 
>> CMDCONFIG_RESTART_MASK : 0;
>> +
>> +    /* Add STOP in the end */
>> +    cmd->CmdConfig |= (i == (numbytes - 1)) ? 
>> CMDCONFIG_STOP_MASK : 0;
>> +
>> +    /* Fill with data regardless if read or write to simplify 
>> code */
>> +    cmd->RegisterAddr = data[i];
>> +    }
>> +}
>> +
>> +static int arcturus_i2c_eeprom_read_data(struct i2c_adapter *control,
>> +   uint8_t address,
>> +   uint8_t *data,
>> +   uint32_t numbytes)
>> +{
>> +    uint32_t  i, ret = 0;
>> +    SwI2cRequest_t req;
>> +    struct amdgpu_device *adev = to_amdgpu_device(control);
>> +    struct smu_table_context *smu_table = >smu.smu_table;
>> +    struct smu_table *table = 
>> _table->tables[SMU_TABLE_I2C_COMMANDS];
>> +
>> +    memset(, 0, sizeof(req));
>> +    arcturus_fill_eeprom_i2c_req(, false, address, numbytes, data);
>> +
>> +    mutex_lock(>smu.mutex);
>> +    /* Now read data starting with that address */
>> +    ret = smu_update_table(>smu, SMU_TABLE_I2C_COMMANDS, 0, ,
>> +    true);
>> +    mutex_unlock(>smu.mutex);
>> +
>> +    if (!ret) {
>> +    SwI2cRequest_t *res = (SwI2cRequest_t *)table->cpu_addr;
>> +
>> +    /* Assume SMU  fills res.SwI2cCmds[i].Data with read bytes */
>> +    for (i = 0; i < numbytes; i++)
>> +    data[i] = res->SwI2cCmds[i].Data;
>> +
>> +    pr_debug("arcturus_i2c_eeprom_read_data, address = %x, bytes 
>> = %d, data :",
>> +  (uint16_t)address, numbytes);
>> +
>> +    print_hex_dump(KERN_DEBUG, "data: ", DUMP_PREFIX_NONE,
>> +   8, 1, data, numbytes, false);
>> +    } else
>> +    pr_err("arcturus_i2c_eeprom_read_data - error occurred :%x", 

Re: [PATCH v2] drm/amdkfd: don't use dqm lock during device reset/suspend/resume

2019-10-22 Thread Kuehling, Felix
On 2019-10-22 14:28, Yang, Philip wrote:
> If device reset/suspend/resume failed for some reason, dqm lock is
> hold forever and this causes deadlock. Below is a kernel backtrace when
> application open kfd after suspend/resume failed.
>
> Instead of holding dqm lock in pre_reset and releasing dqm lock in
> post_reset, add dqm->device_stopped flag which is modified in
> dqm->ops.start and dqm->ops.stop. The flag doesn't need lock protection
> because write/read are all inside dqm lock.
>
> For HWS case, map_queues_cpsch and unmap_queues_cpsch checks
> device_stopped flag before sending the updated runlist.
>
> v2: For no-HWS case, when device is stopped, don't call
> load/destroy_mqd for eviction, restore and create queue, and avoid
> debugfs dump hdqs.
>
> Backtrace of dqm lock deadlock:
>
> [Thu Oct 17 16:43:37 2019] INFO: task rocminfo:3024 blocked for more
> than 120 seconds.
> [Thu Oct 17 16:43:37 2019]   Not tainted
> 5.0.0-rc1-kfd-compute-rocm-dkms-no-npi-1131 #1
> [Thu Oct 17 16:43:37 2019] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [Thu Oct 17 16:43:37 2019] rocminfoD0  3024   2947
> 0x8000
> [Thu Oct 17 16:43:37 2019] Call Trace:
> [Thu Oct 17 16:43:37 2019]  ? __schedule+0x3d9/0x8a0
> [Thu Oct 17 16:43:37 2019]  schedule+0x32/0x70
> [Thu Oct 17 16:43:37 2019]  schedule_preempt_disabled+0xa/0x10
> [Thu Oct 17 16:43:37 2019]  __mutex_lock.isra.9+0x1e3/0x4e0
> [Thu Oct 17 16:43:37 2019]  ? __call_srcu+0x264/0x3b0
> [Thu Oct 17 16:43:37 2019]  ? process_termination_cpsch+0x24/0x2f0
> [amdgpu]
> [Thu Oct 17 16:43:37 2019]  process_termination_cpsch+0x24/0x2f0
> [amdgpu]
> [Thu Oct 17 16:43:37 2019]
> kfd_process_dequeue_from_all_devices+0x42/0x60 [amdgpu]
> [Thu Oct 17 16:43:37 2019]  kfd_process_notifier_release+0x1be/0x220
> [amdgpu]
> [Thu Oct 17 16:43:37 2019]  __mmu_notifier_release+0x3e/0xc0
> [Thu Oct 17 16:43:37 2019]  exit_mmap+0x160/0x1a0
> [Thu Oct 17 16:43:37 2019]  ? __handle_mm_fault+0xba3/0x1200
> [Thu Oct 17 16:43:37 2019]  ? exit_robust_list+0x5a/0x110
> [Thu Oct 17 16:43:37 2019]  mmput+0x4a/0x120
> [Thu Oct 17 16:43:37 2019]  do_exit+0x284/0xb20
> [Thu Oct 17 16:43:37 2019]  ? handle_mm_fault+0xfa/0x200
> [Thu Oct 17 16:43:37 2019]  do_group_exit+0x3a/0xa0
> [Thu Oct 17 16:43:37 2019]  __x64_sys_exit_group+0x14/0x20
> [Thu Oct 17 16:43:37 2019]  do_syscall_64+0x4f/0x100
> [Thu Oct 17 16:43:37 2019]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> Suggested-by: Felix Kuehling 
> Signed-off-by: Philip Yang 

Three more comments inline. With those comments addressed, this patch is

Reviewed-by: Felix Kuehling 


> ---
>   drivers/gpu/drm/amd/amdkfd/kfd_chardev.c  |  6 +--
>   drivers/gpu/drm/amd/amdkfd/kfd_device.c   |  5 --
>   .../drm/amd/amdkfd/kfd_device_queue_manager.c | 47 +--
>   .../drm/amd/amdkfd/kfd_device_queue_manager.h |  1 +
>   4 files changed, 46 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> index d9e36dbf13d5..40d75c39f08e 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> @@ -120,13 +120,13 @@ static int kfd_open(struct inode *inode, struct file 
> *filep)
>   return -EPERM;
>   }
>   
> + if (kfd_is_locked())
> + return -EAGAIN;
> +
>   process = kfd_create_process(filep);
>   if (IS_ERR(process))
>   return PTR_ERR(process);
>   
> - if (kfd_is_locked())
> - return -EAGAIN;
> -

Is this part of the change still needed? I remember that this sequence 
was a bit tricky with some potential race condition when Shaoyun was 
working on it. This may have unintended side effects.


>   dev_dbg(kfd_device, "process %d opened, compat mode (32 bit) - %d\n",
>   process->pasid, process->is_32bit_user_mode);
>   
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> index 8f4b24e84964..4fa8834ce7cb 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -730,9 +730,6 @@ int kgd2kfd_pre_reset(struct kfd_dev *kfd)
>   return 0;
>   kgd2kfd_suspend(kfd);
>   
> - /* hold dqm->lock to prevent further execution*/
> - dqm_lock(kfd->dqm);
> -
>   kfd_signal_reset_event(kfd);
>   return 0;
>   }
> @@ -750,8 +747,6 @@ int kgd2kfd_post_reset(struct kfd_dev *kfd)
>   if (!kfd->init_complete)
>   return 0;
>   
> - dqm_unlock(kfd->dqm);
> -
>   ret = kfd_resume(kfd);
>   if (ret)
>   return ret;
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> index 81fb545cf42c..82e1c6280d13 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> @@ -340,6 +340,10 @@ static int 

Re: [PATCH v2] drm/amdkfd: don't use dqm lock during device reset/suspend/resume

2019-10-22 Thread Grodzovsky, Andrey

On 10/22/19 2:38 PM, Grodzovsky, Andrey wrote:
> On 10/22/19 2:28 PM, Yang, Philip wrote:
>> If device reset/suspend/resume failed for some reason, dqm lock is
>> hold forever and this causes deadlock. Below is a kernel backtrace when
>> application open kfd after suspend/resume failed.
>>
>> Instead of holding dqm lock in pre_reset and releasing dqm lock in
>> post_reset, add dqm->device_stopped flag which is modified in
>> dqm->ops.start and dqm->ops.stop. The flag doesn't need lock protection
>> because write/read are all inside dqm lock.
>>
>> For HWS case, map_queues_cpsch and unmap_queues_cpsch checks
>> device_stopped flag before sending the updated runlist.
>
> Is there a chance of race condition here where dqm->device_stopped
> returns true for some operation (e.g.map_queues_cpsch) but just as it
> proceeds GPU reset starts  ?
>
> Andrey


Correction -

dqm->device_stopped returns FALSE

Andrey

>
>
>> v2: For no-HWS case, when device is stopped, don't call
>> load/destroy_mqd for eviction, restore and create queue, and avoid
>> debugfs dump hdqs.
>>
>> Backtrace of dqm lock deadlock:
>>
>> [Thu Oct 17 16:43:37 2019] INFO: task rocminfo:3024 blocked for more
>> than 120 seconds.
>> [Thu Oct 17 16:43:37 2019]   Not tainted
>> 5.0.0-rc1-kfd-compute-rocm-dkms-no-npi-1131 #1
>> [Thu Oct 17 16:43:37 2019] "echo 0 >
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [Thu Oct 17 16:43:37 2019] rocminfoD0  3024   2947
>> 0x8000
>> [Thu Oct 17 16:43:37 2019] Call Trace:
>> [Thu Oct 17 16:43:37 2019]  ? __schedule+0x3d9/0x8a0
>> [Thu Oct 17 16:43:37 2019]  schedule+0x32/0x70
>> [Thu Oct 17 16:43:37 2019]  schedule_preempt_disabled+0xa/0x10
>> [Thu Oct 17 16:43:37 2019]  __mutex_lock.isra.9+0x1e3/0x4e0
>> [Thu Oct 17 16:43:37 2019]  ? __call_srcu+0x264/0x3b0
>> [Thu Oct 17 16:43:37 2019]  ? process_termination_cpsch+0x24/0x2f0
>> [amdgpu]
>> [Thu Oct 17 16:43:37 2019]  process_termination_cpsch+0x24/0x2f0
>> [amdgpu]
>> [Thu Oct 17 16:43:37 2019]
>> kfd_process_dequeue_from_all_devices+0x42/0x60 [amdgpu]
>> [Thu Oct 17 16:43:37 2019]  kfd_process_notifier_release+0x1be/0x220
>> [amdgpu]
>> [Thu Oct 17 16:43:37 2019]  __mmu_notifier_release+0x3e/0xc0
>> [Thu Oct 17 16:43:37 2019]  exit_mmap+0x160/0x1a0
>> [Thu Oct 17 16:43:37 2019]  ? __handle_mm_fault+0xba3/0x1200
>> [Thu Oct 17 16:43:37 2019]  ? exit_robust_list+0x5a/0x110
>> [Thu Oct 17 16:43:37 2019]  mmput+0x4a/0x120
>> [Thu Oct 17 16:43:37 2019]  do_exit+0x284/0xb20
>> [Thu Oct 17 16:43:37 2019]  ? handle_mm_fault+0xfa/0x200
>> [Thu Oct 17 16:43:37 2019]  do_group_exit+0x3a/0xa0
>> [Thu Oct 17 16:43:37 2019]  __x64_sys_exit_group+0x14/0x20
>> [Thu Oct 17 16:43:37 2019]  do_syscall_64+0x4f/0x100
>> [Thu Oct 17 16:43:37 2019]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> Suggested-by: Felix Kuehling 
>> Signed-off-by: Philip Yang 
>> ---
>>drivers/gpu/drm/amd/amdkfd/kfd_chardev.c  |  6 +--
>>drivers/gpu/drm/amd/amdkfd/kfd_device.c   |  5 --
>>.../drm/amd/amdkfd/kfd_device_queue_manager.c | 47 +--
>>.../drm/amd/amdkfd/kfd_device_queue_manager.h |  1 +
>>4 files changed, 46 insertions(+), 13 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c 
>> b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>> index d9e36dbf13d5..40d75c39f08e 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>> @@ -120,13 +120,13 @@ static int kfd_open(struct inode *inode, struct file 
>> *filep)
>>  return -EPERM;
>>  }
>>
>> +if (kfd_is_locked())
>> +return -EAGAIN;
>> +
>>  process = kfd_create_process(filep);
>>  if (IS_ERR(process))
>>  return PTR_ERR(process);
>>
>> -if (kfd_is_locked())
>> -return -EAGAIN;
>> -
>>  dev_dbg(kfd_device, "process %d opened, compat mode (32 bit) - %d\n",
>>  process->pasid, process->is_32bit_user_mode);
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> index 8f4b24e84964..4fa8834ce7cb 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> @@ -730,9 +730,6 @@ int kgd2kfd_pre_reset(struct kfd_dev *kfd)
>>  return 0;
>>  kgd2kfd_suspend(kfd);
>>
>> -/* hold dqm->lock to prevent further execution*/
>> -dqm_lock(kfd->dqm);
>> -
>>  kfd_signal_reset_event(kfd);
>>  return 0;
>>}
>> @@ -750,8 +747,6 @@ int kgd2kfd_post_reset(struct kfd_dev *kfd)
>>  if (!kfd->init_complete)
>>  return 0;
>>
>> -dqm_unlock(kfd->dqm);
>> -
>>  ret = kfd_resume(kfd);
>>  if (ret)
>>  return ret;
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
>> b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>> index 81fb545cf42c..82e1c6280d13 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>> +++ 

Re: [PATCH v2] drm/amdkfd: don't use dqm lock during device reset/suspend/resume

2019-10-22 Thread Grodzovsky, Andrey
On 10/22/19 2:28 PM, Yang, Philip wrote:
> If device reset/suspend/resume failed for some reason, dqm lock is
> hold forever and this causes deadlock. Below is a kernel backtrace when
> application open kfd after suspend/resume failed.
>
> Instead of holding dqm lock in pre_reset and releasing dqm lock in
> post_reset, add dqm->device_stopped flag which is modified in
> dqm->ops.start and dqm->ops.stop. The flag doesn't need lock protection
> because write/read are all inside dqm lock.
>
> For HWS case, map_queues_cpsch and unmap_queues_cpsch checks
> device_stopped flag before sending the updated runlist.


Is there a chance of race condition here where dqm->device_stopped 
returns true for some operation (e.g.map_queues_cpsch) but just as it 
proceeds GPU reset starts  ?

Andrey


>
> v2: For no-HWS case, when device is stopped, don't call
> load/destroy_mqd for eviction, restore and create queue, and avoid
> debugfs dump hdqs.
>
> Backtrace of dqm lock deadlock:
>
> [Thu Oct 17 16:43:37 2019] INFO: task rocminfo:3024 blocked for more
> than 120 seconds.
> [Thu Oct 17 16:43:37 2019]   Not tainted
> 5.0.0-rc1-kfd-compute-rocm-dkms-no-npi-1131 #1
> [Thu Oct 17 16:43:37 2019] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [Thu Oct 17 16:43:37 2019] rocminfoD0  3024   2947
> 0x8000
> [Thu Oct 17 16:43:37 2019] Call Trace:
> [Thu Oct 17 16:43:37 2019]  ? __schedule+0x3d9/0x8a0
> [Thu Oct 17 16:43:37 2019]  schedule+0x32/0x70
> [Thu Oct 17 16:43:37 2019]  schedule_preempt_disabled+0xa/0x10
> [Thu Oct 17 16:43:37 2019]  __mutex_lock.isra.9+0x1e3/0x4e0
> [Thu Oct 17 16:43:37 2019]  ? __call_srcu+0x264/0x3b0
> [Thu Oct 17 16:43:37 2019]  ? process_termination_cpsch+0x24/0x2f0
> [amdgpu]
> [Thu Oct 17 16:43:37 2019]  process_termination_cpsch+0x24/0x2f0
> [amdgpu]
> [Thu Oct 17 16:43:37 2019]
> kfd_process_dequeue_from_all_devices+0x42/0x60 [amdgpu]
> [Thu Oct 17 16:43:37 2019]  kfd_process_notifier_release+0x1be/0x220
> [amdgpu]
> [Thu Oct 17 16:43:37 2019]  __mmu_notifier_release+0x3e/0xc0
> [Thu Oct 17 16:43:37 2019]  exit_mmap+0x160/0x1a0
> [Thu Oct 17 16:43:37 2019]  ? __handle_mm_fault+0xba3/0x1200
> [Thu Oct 17 16:43:37 2019]  ? exit_robust_list+0x5a/0x110
> [Thu Oct 17 16:43:37 2019]  mmput+0x4a/0x120
> [Thu Oct 17 16:43:37 2019]  do_exit+0x284/0xb20
> [Thu Oct 17 16:43:37 2019]  ? handle_mm_fault+0xfa/0x200
> [Thu Oct 17 16:43:37 2019]  do_group_exit+0x3a/0xa0
> [Thu Oct 17 16:43:37 2019]  __x64_sys_exit_group+0x14/0x20
> [Thu Oct 17 16:43:37 2019]  do_syscall_64+0x4f/0x100
> [Thu Oct 17 16:43:37 2019]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> Suggested-by: Felix Kuehling 
> Signed-off-by: Philip Yang 
> ---
>   drivers/gpu/drm/amd/amdkfd/kfd_chardev.c  |  6 +--
>   drivers/gpu/drm/amd/amdkfd/kfd_device.c   |  5 --
>   .../drm/amd/amdkfd/kfd_device_queue_manager.c | 47 +--
>   .../drm/amd/amdkfd/kfd_device_queue_manager.h |  1 +
>   4 files changed, 46 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> index d9e36dbf13d5..40d75c39f08e 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> @@ -120,13 +120,13 @@ static int kfd_open(struct inode *inode, struct file 
> *filep)
>   return -EPERM;
>   }
>   
> + if (kfd_is_locked())
> + return -EAGAIN;
> +
>   process = kfd_create_process(filep);
>   if (IS_ERR(process))
>   return PTR_ERR(process);
>   
> - if (kfd_is_locked())
> - return -EAGAIN;
> -
>   dev_dbg(kfd_device, "process %d opened, compat mode (32 bit) - %d\n",
>   process->pasid, process->is_32bit_user_mode);
>   
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> index 8f4b24e84964..4fa8834ce7cb 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -730,9 +730,6 @@ int kgd2kfd_pre_reset(struct kfd_dev *kfd)
>   return 0;
>   kgd2kfd_suspend(kfd);
>   
> - /* hold dqm->lock to prevent further execution*/
> - dqm_lock(kfd->dqm);
> -
>   kfd_signal_reset_event(kfd);
>   return 0;
>   }
> @@ -750,8 +747,6 @@ int kgd2kfd_post_reset(struct kfd_dev *kfd)
>   if (!kfd->init_complete)
>   return 0;
>   
> - dqm_unlock(kfd->dqm);
> -
>   ret = kfd_resume(kfd);
>   if (ret)
>   return ret;
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> index 81fb545cf42c..82e1c6280d13 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> @@ -340,6 +340,10 @@ static int create_queue_nocpsch(struct 
> device_queue_manager *dqm,
>   mqd_mgr->init_mqd(mqd_mgr, >mqd, q->mqd_mem_obj,
>

[PATCH v2] drm/amdkfd: don't use dqm lock during device reset/suspend/resume

2019-10-22 Thread Yang, Philip
If device reset/suspend/resume failed for some reason, dqm lock is
hold forever and this causes deadlock. Below is a kernel backtrace when
application open kfd after suspend/resume failed.

Instead of holding dqm lock in pre_reset and releasing dqm lock in
post_reset, add dqm->device_stopped flag which is modified in
dqm->ops.start and dqm->ops.stop. The flag doesn't need lock protection
because write/read are all inside dqm lock.

For HWS case, map_queues_cpsch and unmap_queues_cpsch checks
device_stopped flag before sending the updated runlist.

v2: For no-HWS case, when device is stopped, don't call
load/destroy_mqd for eviction, restore and create queue, and avoid
debugfs dump hdqs.

Backtrace of dqm lock deadlock:

[Thu Oct 17 16:43:37 2019] INFO: task rocminfo:3024 blocked for more
than 120 seconds.
[Thu Oct 17 16:43:37 2019]   Not tainted
5.0.0-rc1-kfd-compute-rocm-dkms-no-npi-1131 #1
[Thu Oct 17 16:43:37 2019] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Oct 17 16:43:37 2019] rocminfoD0  3024   2947
0x8000
[Thu Oct 17 16:43:37 2019] Call Trace:
[Thu Oct 17 16:43:37 2019]  ? __schedule+0x3d9/0x8a0
[Thu Oct 17 16:43:37 2019]  schedule+0x32/0x70
[Thu Oct 17 16:43:37 2019]  schedule_preempt_disabled+0xa/0x10
[Thu Oct 17 16:43:37 2019]  __mutex_lock.isra.9+0x1e3/0x4e0
[Thu Oct 17 16:43:37 2019]  ? __call_srcu+0x264/0x3b0
[Thu Oct 17 16:43:37 2019]  ? process_termination_cpsch+0x24/0x2f0
[amdgpu]
[Thu Oct 17 16:43:37 2019]  process_termination_cpsch+0x24/0x2f0
[amdgpu]
[Thu Oct 17 16:43:37 2019]
kfd_process_dequeue_from_all_devices+0x42/0x60 [amdgpu]
[Thu Oct 17 16:43:37 2019]  kfd_process_notifier_release+0x1be/0x220
[amdgpu]
[Thu Oct 17 16:43:37 2019]  __mmu_notifier_release+0x3e/0xc0
[Thu Oct 17 16:43:37 2019]  exit_mmap+0x160/0x1a0
[Thu Oct 17 16:43:37 2019]  ? __handle_mm_fault+0xba3/0x1200
[Thu Oct 17 16:43:37 2019]  ? exit_robust_list+0x5a/0x110
[Thu Oct 17 16:43:37 2019]  mmput+0x4a/0x120
[Thu Oct 17 16:43:37 2019]  do_exit+0x284/0xb20
[Thu Oct 17 16:43:37 2019]  ? handle_mm_fault+0xfa/0x200
[Thu Oct 17 16:43:37 2019]  do_group_exit+0x3a/0xa0
[Thu Oct 17 16:43:37 2019]  __x64_sys_exit_group+0x14/0x20
[Thu Oct 17 16:43:37 2019]  do_syscall_64+0x4f/0x100
[Thu Oct 17 16:43:37 2019]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Suggested-by: Felix Kuehling 
Signed-off-by: Philip Yang 
---
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c  |  6 +--
 drivers/gpu/drm/amd/amdkfd/kfd_device.c   |  5 --
 .../drm/amd/amdkfd/kfd_device_queue_manager.c | 47 +--
 .../drm/amd/amdkfd/kfd_device_queue_manager.h |  1 +
 4 files changed, 46 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
index d9e36dbf13d5..40d75c39f08e 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -120,13 +120,13 @@ static int kfd_open(struct inode *inode, struct file 
*filep)
return -EPERM;
}
 
+   if (kfd_is_locked())
+   return -EAGAIN;
+
process = kfd_create_process(filep);
if (IS_ERR(process))
return PTR_ERR(process);
 
-   if (kfd_is_locked())
-   return -EAGAIN;
-
dev_dbg(kfd_device, "process %d opened, compat mode (32 bit) - %d\n",
process->pasid, process->is_32bit_user_mode);
 
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 8f4b24e84964..4fa8834ce7cb 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -730,9 +730,6 @@ int kgd2kfd_pre_reset(struct kfd_dev *kfd)
return 0;
kgd2kfd_suspend(kfd);
 
-   /* hold dqm->lock to prevent further execution*/
-   dqm_lock(kfd->dqm);
-
kfd_signal_reset_event(kfd);
return 0;
 }
@@ -750,8 +747,6 @@ int kgd2kfd_post_reset(struct kfd_dev *kfd)
if (!kfd->init_complete)
return 0;
 
-   dqm_unlock(kfd->dqm);
-
ret = kfd_resume(kfd);
if (ret)
return ret;
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index 81fb545cf42c..82e1c6280d13 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -340,6 +340,10 @@ static int create_queue_nocpsch(struct 
device_queue_manager *dqm,
mqd_mgr->init_mqd(mqd_mgr, >mqd, q->mqd_mem_obj,
>gart_mqd_addr, >properties);
if (q->properties.is_active) {
+   if (!dqm->sched_running) {
+   WARN_ONCE(1, "Load non-HWS mqd while stopped\n");
+   goto add_queue_to_list;
+   }
 
if (WARN(q->process->mm != current->mm,
"should only run in user thread"))
@@ 

Re: Stack out of bounds in KFD on Arcturus

2019-10-22 Thread Grodzovsky, Andrey
No problem on Vega 20

Andrey

On 10/22/19 1:46 PM, Zeng, Oak wrote:
> Sorry I searched my kconfig and I didn't find the stack size configure 
> anymore...Maybe today kernel stack size is not configurable anymore...
>
> Can you try your kernel on vega10 or 20 or navi10? We want to know whether 
> this is mi100 specific issue.
>
> Oak
>
> -Original Message-
> From: Grodzovsky, Andrey 
> Sent: Tuesday, October 22, 2019 1:28 PM
> To: Zeng, Oak ; Kuehling, Felix 
> Cc: amd-gfx@lists.freedesktop.org
> Subject: Re: Stack out of bounds in KFD on Arcturus
>
> I don't know - what Kconfig flag should I look at ?
>
> Andrey
>
> On 10/22/19 1:17 PM, Zeng, Oak wrote:
>> Sorry I meant is the kernel stack size 16KB in your kconfig?
>>
>> Oak
>>
>> -Original Message-
>> From: Grodzovsky, Andrey 
>> Sent: Tuesday, October 22, 2019 12:49 PM
>> To: Zeng, Oak ; Kuehling, Felix
>> 
>> Cc: amd-gfx@lists.freedesktop.org
>> Subject: Re: Stack out of bounds in KFD on Arcturus
>>
>> On 10/18/19 5:31 PM, Zeng, Oak wrote:
>>
>>> Hi Andrey,
>>>
>>> What is your system configuration? I didn’t see this issue before. Also see 
>>> attached QA's configuration - you can compare to see any difference.
>> Attached is my lshw
>>
>>> Also I believe for x86-64, the default kernel stack size is 16kb? Is this 
>>> your Kconfig?
>> What do you mean if this is my Kconfig ? Is there particular Kconfig flag 
>> you know that i can look for ?
>>
>> Andrey
>>
>>
>>> Regards,
>>> Oak
>>>
>>> -Original Message-
>>> From: amd-gfx  On Behalf Of
>>> Kuehling, Felix
>>> Sent: Friday, October 18, 2019 4:55 PM
>>> To: Grodzovsky, Andrey 
>>> Cc: amd-gfx@lists.freedesktop.org
>>> Subject: Re: Stack out of bounds in KFD on Arcturus
>>>
>>> On 2019-10-17 6:38 p.m., Grodzovsky, Andrey wrote:
 Not that I aware of, is there a special Kconfig flag to determine
 stack size ?
>>> I remember there used to be a Kconfig option to force a 4KB kernel stack. I 
>>> don't see it in the current kernel any more.
>>>
>>> I don't have time to work on this myself. I'll create a ticket and see if I 
>>> can find someone to investigate.
>>>
>>> Thanks,
>>>   Felix
>>>
>>>
 Andrey

 On 10/17/19 5:29 PM, Kuehling, Felix wrote:
> I don't see why this problem would be specific to Arcturus. I don't
> see any excessive allocations on the stack either. Also the code
> involved here hasn't changed recently.
>
> Are you using some weird kernel config with a smaller stack? Is it
> specific to a compiler version or some optimization flags? I've
> sometimes seen function inlining cause excessive stack usage.
>
> Regards,
>     Felix
>
> On 2019-10-17 4:09 p.m., Grodzovsky, Andrey wrote:
>> He Felix - I see this on boot when working with Arcturus.
>>
>> Andrey
>>
>>
>> [  103.602092] kfd kfd: Allocated 3969056 bytes on gart [
>> 103.610769]
>> ==
>> [  103.611469] BUG: KASAN: stack-out-of-bounds in
>> kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] [  103.611646]
>> Read of size 4 at addr 8883cb19ee38 by task modprobe/1122
>>
>> [  103.611836] CPU: 3 PID: 1122 Comm: modprobe Tainted: G O
>> 5.3.0-rc3+ #45 [  103.611847] Hardware name: System manufacturer
>> System Product Name/Z170-PRO, BIOS 1902 06/27/2016 [  103.611856]
>> Call Trace:
>> [  103.611879]  dump_stack+0x71/0xab [  103.611907]
>> print_address_description+0x1da/0x3c0
>> [  103.612453]  ? kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu]
>> [ 103.612479]  __kasan_report+0x13f/0x1a0 [  103.613022]  ?
>> kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] [  103.613580]  ?
>> kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] [  103.613604]
>> kasan_report+0xe/0x20 [  103.614149]
>> kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] [  103.614762]  ?
>> kfd_fill_gpu_memory_affinity+0x110/0x110 [amdgpu] [  103.614796]  ?
>> __alloc_pages_nodemask+0x2c9/0x560
>> [  103.614824]  ? __alloc_pages_slowpath+0x1390/0x1390
>> [  103.614898]  ? kmalloc_order+0x63/0x70 [  103.615469]
>> kfd_create_crat_image_virtual+0x70c/0x770 [amdgpu] [  103.616054]  ?
>> kfd_create_crat_image_acpi+0x1c0/0x1c0 [amdgpu] [  103.616095]  ?
>> up_write+0x4b/0x70 [  103.616649]
>> kfd_topology_add_device+0x98d/0xb10 [amdgpu] [  103.617207]  ?
>> kfd_topology_shutdown+0x60/0x60 [amdgpu] [  103.617743]  ?
>> start_cpsch+0x2ff/0x3a0 [amdgpu] [  103.61]  ?
>> mutex_lock_io_nested+0xac0/0xac0 [  103.617807]  ?
>> __mutex_unlock_slowpath+0xda/0x420
>> [  103.617848]  ? __mutex_unlock_slowpath+0xda/0x420
>> [  103.617877]  ? wait_for_completion+0x200/0x200 [  103.618461]  ?
>> start_cpsch+0x38b/0x3a0 [amdgpu] [  103.619011]  ?
>> create_queue_cpsch+0x670/0x670 [amdgpu] [  103.619573]  ?
>> kfd_iommu_device_init+0x92/0x1e0 [amdgpu] [  103.620112]  ?

RE: Stack out of bounds in KFD on Arcturus

2019-10-22 Thread Zeng, Oak
Sorry I searched my kconfig and I didn't find the stack size configure 
anymore...Maybe today kernel stack size is not configurable anymore...

Can you try your kernel on vega10 or 20 or navi10? We want to know whether this 
is mi100 specific issue.

Oak

-Original Message-
From: Grodzovsky, Andrey  
Sent: Tuesday, October 22, 2019 1:28 PM
To: Zeng, Oak ; Kuehling, Felix 
Cc: amd-gfx@lists.freedesktop.org
Subject: Re: Stack out of bounds in KFD on Arcturus

I don't know - what Kconfig flag should I look at ?

Andrey

On 10/22/19 1:17 PM, Zeng, Oak wrote:
> Sorry I meant is the kernel stack size 16KB in your kconfig?
>
> Oak
>
> -Original Message-
> From: Grodzovsky, Andrey 
> Sent: Tuesday, October 22, 2019 12:49 PM
> To: Zeng, Oak ; Kuehling, Felix 
> 
> Cc: amd-gfx@lists.freedesktop.org
> Subject: Re: Stack out of bounds in KFD on Arcturus
>
> On 10/18/19 5:31 PM, Zeng, Oak wrote:
>
>> Hi Andrey,
>>
>> What is your system configuration? I didn’t see this issue before. Also see 
>> attached QA's configuration - you can compare to see any difference.
>
> Attached is my lshw
>
>> Also I believe for x86-64, the default kernel stack size is 16kb? Is this 
>> your Kconfig?
>
> What do you mean if this is my Kconfig ? Is there particular Kconfig flag you 
> know that i can look for ?
>
> Andrey
>
>
>> Regards,
>> Oak
>>
>> -Original Message-
>> From: amd-gfx  On Behalf Of 
>> Kuehling, Felix
>> Sent: Friday, October 18, 2019 4:55 PM
>> To: Grodzovsky, Andrey 
>> Cc: amd-gfx@lists.freedesktop.org
>> Subject: Re: Stack out of bounds in KFD on Arcturus
>>
>> On 2019-10-17 6:38 p.m., Grodzovsky, Andrey wrote:
>>> Not that I aware of, is there a special Kconfig flag to determine 
>>> stack size ?
>> I remember there used to be a Kconfig option to force a 4KB kernel stack. I 
>> don't see it in the current kernel any more.
>>
>> I don't have time to work on this myself. I'll create a ticket and see if I 
>> can find someone to investigate.
>>
>> Thanks,
>>  Felix
>>
>>
>>> Andrey
>>>
>>> On 10/17/19 5:29 PM, Kuehling, Felix wrote:
 I don't see why this problem would be specific to Arcturus. I don't 
 see any excessive allocations on the stack either. Also the code 
 involved here hasn't changed recently.

 Are you using some weird kernel config with a smaller stack? Is it 
 specific to a compiler version or some optimization flags? I've 
 sometimes seen function inlining cause excessive stack usage.

 Regards,
    Felix

 On 2019-10-17 4:09 p.m., Grodzovsky, Andrey wrote:
> He Felix - I see this on boot when working with Arcturus.
>
> Andrey
>
>
> [  103.602092] kfd kfd: Allocated 3969056 bytes on gart [ 
> 103.610769] 
> ==
> [  103.611469] BUG: KASAN: stack-out-of-bounds in
> kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] [  103.611646] 
> Read of size 4 at addr 8883cb19ee38 by task modprobe/1122
>
> [  103.611836] CPU: 3 PID: 1122 Comm: modprobe Tainted: G O 
> 5.3.0-rc3+ #45 [  103.611847] Hardware name: System manufacturer 
> System Product Name/Z170-PRO, BIOS 1902 06/27/2016 [  103.611856] 
> Call Trace:
> [  103.611879]  dump_stack+0x71/0xab [  103.611907]
> print_address_description+0x1da/0x3c0
> [  103.612453]  ? kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] 
> [ 103.612479]  __kasan_report+0x13f/0x1a0 [  103.613022]  ?
> kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] [  103.613580]  ?
> kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] [  103.613604]
> kasan_report+0xe/0x20 [  103.614149]
> kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] [  103.614762]  ?
> kfd_fill_gpu_memory_affinity+0x110/0x110 [amdgpu] [  103.614796]  ?
> __alloc_pages_nodemask+0x2c9/0x560
> [  103.614824]  ? __alloc_pages_slowpath+0x1390/0x1390
> [  103.614898]  ? kmalloc_order+0x63/0x70 [  103.615469]
> kfd_create_crat_image_virtual+0x70c/0x770 [amdgpu] [  103.616054]  ?
> kfd_create_crat_image_acpi+0x1c0/0x1c0 [amdgpu] [  103.616095]  ?
> up_write+0x4b/0x70 [  103.616649]
> kfd_topology_add_device+0x98d/0xb10 [amdgpu] [  103.617207]  ?
> kfd_topology_shutdown+0x60/0x60 [amdgpu] [  103.617743]  ?
> start_cpsch+0x2ff/0x3a0 [amdgpu] [  103.61]  ?
> mutex_lock_io_nested+0xac0/0xac0 [  103.617807]  ?
> __mutex_unlock_slowpath+0xda/0x420
> [  103.617848]  ? __mutex_unlock_slowpath+0xda/0x420
> [  103.617877]  ? wait_for_completion+0x200/0x200 [  103.618461]  ?
> start_cpsch+0x38b/0x3a0 [amdgpu] [  103.619011]  ?
> create_queue_cpsch+0x670/0x670 [amdgpu] [  103.619573]  ?
> kfd_iommu_device_init+0x92/0x1e0 [amdgpu] [  103.620112]  ?
> kfd_iommu_resume+0x2c/0x2c0 [amdgpu] [  103.620655]  ?
> kfd_iommu_check_device+0xf0/0xf0 [amdgpu] [  103.621228]
> kgd2kfd_device_init+0x474/0x870 [amdgpu] [  103.621781]
> 

Re: Stack out of bounds in KFD on Arcturus

2019-10-22 Thread Grodzovsky, Andrey
I don't know - what Kconfig flag should I look at ?

Andrey

On 10/22/19 1:17 PM, Zeng, Oak wrote:
> Sorry I meant is the kernel stack size 16KB in your kconfig?
>
> Oak
>
> -Original Message-
> From: Grodzovsky, Andrey 
> Sent: Tuesday, October 22, 2019 12:49 PM
> To: Zeng, Oak ; Kuehling, Felix 
> Cc: amd-gfx@lists.freedesktop.org
> Subject: Re: Stack out of bounds in KFD on Arcturus
>
> On 10/18/19 5:31 PM, Zeng, Oak wrote:
>
>> Hi Andrey,
>>
>> What is your system configuration? I didn’t see this issue before. Also see 
>> attached QA's configuration - you can compare to see any difference.
>
> Attached is my lshw
>
>> Also I believe for x86-64, the default kernel stack size is 16kb? Is this 
>> your Kconfig?
>
> What do you mean if this is my Kconfig ? Is there particular Kconfig flag you 
> know that i can look for ?
>
> Andrey
>
>
>> Regards,
>> Oak
>>
>> -Original Message-
>> From: amd-gfx  On Behalf Of
>> Kuehling, Felix
>> Sent: Friday, October 18, 2019 4:55 PM
>> To: Grodzovsky, Andrey 
>> Cc: amd-gfx@lists.freedesktop.org
>> Subject: Re: Stack out of bounds in KFD on Arcturus
>>
>> On 2019-10-17 6:38 p.m., Grodzovsky, Andrey wrote:
>>> Not that I aware of, is there a special Kconfig flag to determine
>>> stack size ?
>> I remember there used to be a Kconfig option to force a 4KB kernel stack. I 
>> don't see it in the current kernel any more.
>>
>> I don't have time to work on this myself. I'll create a ticket and see if I 
>> can find someone to investigate.
>>
>> Thanks,
>>  Felix
>>
>>
>>> Andrey
>>>
>>> On 10/17/19 5:29 PM, Kuehling, Felix wrote:
 I don't see why this problem would be specific to Arcturus. I don't
 see any excessive allocations on the stack either. Also the code
 involved here hasn't changed recently.

 Are you using some weird kernel config with a smaller stack? Is it
 specific to a compiler version or some optimization flags? I've
 sometimes seen function inlining cause excessive stack usage.

 Regards,
    Felix

 On 2019-10-17 4:09 p.m., Grodzovsky, Andrey wrote:
> He Felix - I see this on boot when working with Arcturus.
>
> Andrey
>
>
> [  103.602092] kfd kfd: Allocated 3969056 bytes on gart [
> 103.610769]
> ==
> [  103.611469] BUG: KASAN: stack-out-of-bounds in
> kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] [  103.611646] Read
> of size 4 at addr 8883cb19ee38 by task modprobe/1122
>
> [  103.611836] CPU: 3 PID: 1122 Comm: modprobe Tainted: G O
> 5.3.0-rc3+ #45 [  103.611847] Hardware name: System manufacturer
> System Product Name/Z170-PRO, BIOS 1902 06/27/2016 [  103.611856]
> Call Trace:
> [  103.611879]  dump_stack+0x71/0xab [  103.611907]
> print_address_description+0x1da/0x3c0
> [  103.612453]  ? kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] [
> 103.612479]  __kasan_report+0x13f/0x1a0 [  103.613022]  ?
> kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] [  103.613580]  ?
> kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] [  103.613604]
> kasan_report+0xe/0x20 [  103.614149]
> kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] [  103.614762]  ?
> kfd_fill_gpu_memory_affinity+0x110/0x110 [amdgpu] [  103.614796]  ?
> __alloc_pages_nodemask+0x2c9/0x560
> [  103.614824]  ? __alloc_pages_slowpath+0x1390/0x1390
> [  103.614898]  ? kmalloc_order+0x63/0x70 [  103.615469]
> kfd_create_crat_image_virtual+0x70c/0x770 [amdgpu] [  103.616054]  ?
> kfd_create_crat_image_acpi+0x1c0/0x1c0 [amdgpu] [  103.616095]  ?
> up_write+0x4b/0x70 [  103.616649]
> kfd_topology_add_device+0x98d/0xb10 [amdgpu] [  103.617207]  ?
> kfd_topology_shutdown+0x60/0x60 [amdgpu] [  103.617743]  ?
> start_cpsch+0x2ff/0x3a0 [amdgpu] [  103.61]  ?
> mutex_lock_io_nested+0xac0/0xac0 [  103.617807]  ?
> __mutex_unlock_slowpath+0xda/0x420
> [  103.617848]  ? __mutex_unlock_slowpath+0xda/0x420
> [  103.617877]  ? wait_for_completion+0x200/0x200 [  103.618461]  ?
> start_cpsch+0x38b/0x3a0 [amdgpu] [  103.619011]  ?
> create_queue_cpsch+0x670/0x670 [amdgpu] [  103.619573]  ?
> kfd_iommu_device_init+0x92/0x1e0 [amdgpu] [  103.620112]  ?
> kfd_iommu_resume+0x2c/0x2c0 [amdgpu] [  103.620655]  ?
> kfd_iommu_check_device+0xf0/0xf0 [amdgpu] [  103.621228]
> kgd2kfd_device_init+0x474/0x870 [amdgpu] [  103.621781]
> amdgpu_amdkfd_device_init+0x291/0x390 [amdgpu] [  103.622329]  ?
> amdgpu_amdkfd_device_probe+0x90/0x90 [amdgpu] [  103.622344]  ?
> kmsg_dump_rewind_nolock+0x59/0x59 [  103.622895]  ?
> amdgpu_ras_eeprom_test+0x71/0x90 [amdgpu] [  103.623424]
> amdgpu_device_init+0x1bbe/0x2f00 [amdgpu] [  103.623819]  ?
> amdgpu_device_has_dc_support+0x30/0x30 [amdgpu] [  103.623842]  ?
> __isolate_free_page+0x290/0x290 [  103.623852]  ?
> fs_reclaim_acquire.part.97+0x5/0x30

RE: Stack out of bounds in KFD on Arcturus

2019-10-22 Thread Zeng, Oak
Sorry I meant is the kernel stack size 16KB in your kconfig?

Oak

-Original Message-
From: Grodzovsky, Andrey  
Sent: Tuesday, October 22, 2019 12:49 PM
To: Zeng, Oak ; Kuehling, Felix 
Cc: amd-gfx@lists.freedesktop.org
Subject: Re: Stack out of bounds in KFD on Arcturus

On 10/18/19 5:31 PM, Zeng, Oak wrote:

> Hi Andrey,
>
> What is your system configuration? I didn’t see this issue before. Also see 
> attached QA's configuration - you can compare to see any difference.


Attached is my lshw

>
> Also I believe for x86-64, the default kernel stack size is 16kb? Is this 
> your Kconfig?


What do you mean if this is my Kconfig ? Is there particular Kconfig flag you 
know that i can look for ?

Andrey


>
> Regards,
> Oak
>
> -Original Message-
> From: amd-gfx  On Behalf Of 
> Kuehling, Felix
> Sent: Friday, October 18, 2019 4:55 PM
> To: Grodzovsky, Andrey 
> Cc: amd-gfx@lists.freedesktop.org
> Subject: Re: Stack out of bounds in KFD on Arcturus
>
> On 2019-10-17 6:38 p.m., Grodzovsky, Andrey wrote:
>> Not that I aware of, is there a special Kconfig flag to determine 
>> stack size ?
> I remember there used to be a Kconfig option to force a 4KB kernel stack. I 
> don't see it in the current kernel any more.
>
> I don't have time to work on this myself. I'll create a ticket and see if I 
> can find someone to investigate.
>
> Thanks,
>     Felix
>
>
>> Andrey
>>
>> On 10/17/19 5:29 PM, Kuehling, Felix wrote:
>>> I don't see why this problem would be specific to Arcturus. I don't 
>>> see any excessive allocations on the stack either. Also the code 
>>> involved here hasn't changed recently.
>>>
>>> Are you using some weird kernel config with a smaller stack? Is it 
>>> specific to a compiler version or some optimization flags? I've 
>>> sometimes seen function inlining cause excessive stack usage.
>>>
>>> Regards,
>>>   Felix
>>>
>>> On 2019-10-17 4:09 p.m., Grodzovsky, Andrey wrote:
 He Felix - I see this on boot when working with Arcturus.

 Andrey


 [  103.602092] kfd kfd: Allocated 3969056 bytes on gart [ 
 103.610769] 
 ==
 [  103.611469] BUG: KASAN: stack-out-of-bounds in
 kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] [  103.611646] Read 
 of size 4 at addr 8883cb19ee38 by task modprobe/1122

 [  103.611836] CPU: 3 PID: 1122 Comm: modprobe Tainted: G O 
 5.3.0-rc3+ #45 [  103.611847] Hardware name: System manufacturer 
 System Product Name/Z170-PRO, BIOS 1902 06/27/2016 [  103.611856] 
 Call Trace:
 [  103.611879]  dump_stack+0x71/0xab [  103.611907]
 print_address_description+0x1da/0x3c0
 [  103.612453]  ? kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] [ 
 103.612479]  __kasan_report+0x13f/0x1a0 [  103.613022]  ?
 kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] [  103.613580]  ?
 kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] [  103.613604]
 kasan_report+0xe/0x20 [  103.614149]
 kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] [  103.614762]  ?
 kfd_fill_gpu_memory_affinity+0x110/0x110 [amdgpu] [  103.614796]  ?
 __alloc_pages_nodemask+0x2c9/0x560
 [  103.614824]  ? __alloc_pages_slowpath+0x1390/0x1390
 [  103.614898]  ? kmalloc_order+0x63/0x70 [  103.615469]
 kfd_create_crat_image_virtual+0x70c/0x770 [amdgpu] [  103.616054]  ?
 kfd_create_crat_image_acpi+0x1c0/0x1c0 [amdgpu] [  103.616095]  ?
 up_write+0x4b/0x70 [  103.616649]
 kfd_topology_add_device+0x98d/0xb10 [amdgpu] [  103.617207]  ?
 kfd_topology_shutdown+0x60/0x60 [amdgpu] [  103.617743]  ?
 start_cpsch+0x2ff/0x3a0 [amdgpu] [  103.61]  ?
 mutex_lock_io_nested+0xac0/0xac0 [  103.617807]  ?
 __mutex_unlock_slowpath+0xda/0x420
 [  103.617848]  ? __mutex_unlock_slowpath+0xda/0x420
 [  103.617877]  ? wait_for_completion+0x200/0x200 [  103.618461]  ?
 start_cpsch+0x38b/0x3a0 [amdgpu] [  103.619011]  ?
 create_queue_cpsch+0x670/0x670 [amdgpu] [  103.619573]  ?
 kfd_iommu_device_init+0x92/0x1e0 [amdgpu] [  103.620112]  ?
 kfd_iommu_resume+0x2c/0x2c0 [amdgpu] [  103.620655]  ?
 kfd_iommu_check_device+0xf0/0xf0 [amdgpu] [  103.621228]
 kgd2kfd_device_init+0x474/0x870 [amdgpu] [  103.621781]
 amdgpu_amdkfd_device_init+0x291/0x390 [amdgpu] [  103.622329]  ?
 amdgpu_amdkfd_device_probe+0x90/0x90 [amdgpu] [  103.622344]  ?
 kmsg_dump_rewind_nolock+0x59/0x59 [  103.622895]  ?
 amdgpu_ras_eeprom_test+0x71/0x90 [amdgpu] [  103.623424]
 amdgpu_device_init+0x1bbe/0x2f00 [amdgpu] [  103.623819]  ?
 amdgpu_device_has_dc_support+0x30/0x30 [amdgpu] [  103.623842]  ?
 __isolate_free_page+0x290/0x290 [  103.623852]  ?
 fs_reclaim_acquire.part.97+0x5/0x30
 [  103.623891]  ? __alloc_pages_nodemask+0x2c9/0x560
 [  103.623912]  ? __alloc_pages_slowpath+0x1390/0x1390
 [  103.623945]  ? kasan_unpoison_shadow+0x31/0x40 [  103.623970]  ?
 

[PATCH] drm/amdgpu/sdma5: do not execute 0-sized IBs (v2)

2019-10-22 Thread Pelloux-prayer, Pierre-eric
This seems to help with https://bugs.freedesktop.org/show_bug.cgi?id=111481.

v2: insert a NOP instead of skipping all 0-sized IBs to avoid breaking older hw

Signed-off-by: Pierre-Eric Pelloux-Prayer 
---
 drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
index fb48622c2abd..6e1b25bd1fe7 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
@@ -309,6 +309,7 @@ static void gmc_v10_0_flush_gpu_tlb(struct amdgpu_device 
*adev, uint32_t vmid,
 
job->vm_pd_addr = amdgpu_gmc_pd_addr(adev->gart.bo);
job->vm_needs_flush = true;
+   job->ibs->ptr[job->ibs->length_dw++] = ring->funcs->nop;
amdgpu_ring_pad_ib(ring, >ibs[0]);
r = amdgpu_job_submit(job, >mman.entity,
  AMDGPU_FENCE_OWNER_UNDEFINED, );
-- 
2.23.0.rc1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: Stack out of bounds in KFD on Arcturus

2019-10-22 Thread Grodzovsky, Andrey
On 10/18/19 5:31 PM, Zeng, Oak wrote:

> Hi Andrey,
>
> What is your system configuration? I didn’t see this issue before. Also see 
> attached QA's configuration - you can compare to see any difference.


Attached is my lshw

>
> Also I believe for x86-64, the default kernel stack size is 16kb? Is this 
> your Kconfig?


What do you mean if this is my Kconfig ? Is there particular Kconfig 
flag you know that i can look for ?

Andrey


>
> Regards,
> Oak
>
> -Original Message-
> From: amd-gfx  On Behalf Of Kuehling, 
> Felix
> Sent: Friday, October 18, 2019 4:55 PM
> To: Grodzovsky, Andrey 
> Cc: amd-gfx@lists.freedesktop.org
> Subject: Re: Stack out of bounds in KFD on Arcturus
>
> On 2019-10-17 6:38 p.m., Grodzovsky, Andrey wrote:
>> Not that I aware of, is there a special Kconfig flag to determine
>> stack size ?
> I remember there used to be a Kconfig option to force a 4KB kernel stack. I 
> don't see it in the current kernel any more.
>
> I don't have time to work on this myself. I'll create a ticket and see if I 
> can find someone to investigate.
>
> Thanks,
>     Felix
>
>
>> Andrey
>>
>> On 10/17/19 5:29 PM, Kuehling, Felix wrote:
>>> I don't see why this problem would be specific to Arcturus. I don't
>>> see any excessive allocations on the stack either. Also the code
>>> involved here hasn't changed recently.
>>>
>>> Are you using some weird kernel config with a smaller stack? Is it
>>> specific to a compiler version or some optimization flags? I've
>>> sometimes seen function inlining cause excessive stack usage.
>>>
>>> Regards,
>>>   Felix
>>>
>>> On 2019-10-17 4:09 p.m., Grodzovsky, Andrey wrote:
 He Felix - I see this on boot when working with Arcturus.

 Andrey


 [  103.602092] kfd kfd: Allocated 3969056 bytes on gart [
 103.610769]
 ==
 [  103.611469] BUG: KASAN: stack-out-of-bounds in
 kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] [  103.611646] Read
 of size 4 at addr 8883cb19ee38 by task modprobe/1122

 [  103.611836] CPU: 3 PID: 1122 Comm: modprobe Tainted: G O
 5.3.0-rc3+ #45 [  103.611847] Hardware name: System manufacturer
 System Product Name/Z170-PRO, BIOS 1902 06/27/2016 [  103.611856]
 Call Trace:
 [  103.611879]  dump_stack+0x71/0xab [  103.611907]
 print_address_description+0x1da/0x3c0
 [  103.612453]  ? kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] [
 103.612479]  __kasan_report+0x13f/0x1a0 [  103.613022]  ?
 kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] [  103.613580]  ?
 kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] [  103.613604]
 kasan_report+0xe/0x20 [  103.614149]
 kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] [  103.614762]  ?
 kfd_fill_gpu_memory_affinity+0x110/0x110 [amdgpu] [  103.614796]  ?
 __alloc_pages_nodemask+0x2c9/0x560
 [  103.614824]  ? __alloc_pages_slowpath+0x1390/0x1390
 [  103.614898]  ? kmalloc_order+0x63/0x70 [  103.615469]
 kfd_create_crat_image_virtual+0x70c/0x770 [amdgpu] [  103.616054]  ?
 kfd_create_crat_image_acpi+0x1c0/0x1c0 [amdgpu] [  103.616095]  ?
 up_write+0x4b/0x70 [  103.616649]
 kfd_topology_add_device+0x98d/0xb10 [amdgpu] [  103.617207]  ?
 kfd_topology_shutdown+0x60/0x60 [amdgpu] [  103.617743]  ?
 start_cpsch+0x2ff/0x3a0 [amdgpu] [  103.61]  ?
 mutex_lock_io_nested+0xac0/0xac0 [  103.617807]  ?
 __mutex_unlock_slowpath+0xda/0x420
 [  103.617848]  ? __mutex_unlock_slowpath+0xda/0x420
 [  103.617877]  ? wait_for_completion+0x200/0x200 [  103.618461]  ?
 start_cpsch+0x38b/0x3a0 [amdgpu] [  103.619011]  ?
 create_queue_cpsch+0x670/0x670 [amdgpu] [  103.619573]  ?
 kfd_iommu_device_init+0x92/0x1e0 [amdgpu] [  103.620112]  ?
 kfd_iommu_resume+0x2c/0x2c0 [amdgpu] [  103.620655]  ?
 kfd_iommu_check_device+0xf0/0xf0 [amdgpu] [  103.621228]
 kgd2kfd_device_init+0x474/0x870 [amdgpu] [  103.621781]
 amdgpu_amdkfd_device_init+0x291/0x390 [amdgpu] [  103.622329]  ?
 amdgpu_amdkfd_device_probe+0x90/0x90 [amdgpu] [  103.622344]  ?
 kmsg_dump_rewind_nolock+0x59/0x59 [  103.622895]  ?
 amdgpu_ras_eeprom_test+0x71/0x90 [amdgpu] [  103.623424]
 amdgpu_device_init+0x1bbe/0x2f00 [amdgpu] [  103.623819]  ?
 amdgpu_device_has_dc_support+0x30/0x30 [amdgpu] [  103.623842]  ?
 __isolate_free_page+0x290/0x290 [  103.623852]  ?
 fs_reclaim_acquire.part.97+0x5/0x30
 [  103.623891]  ? __alloc_pages_nodemask+0x2c9/0x560
 [  103.623912]  ? __alloc_pages_slowpath+0x1390/0x1390
 [  103.623945]  ? kasan_unpoison_shadow+0x31/0x40 [  103.623970]  ?
 kmalloc_order+0x63/0x70 [  103.624337]
 amdgpu_driver_load_kms+0xd9/0x430 [amdgpu] [  103.624690]  ?
 amdgpu_register_gpu_instance+0xe0/0xe0 [amdgpu] [  103.624756]  ?
 drm_dev_register+0x19c/0x310 [drm] [  103.624768]  ?
 __kasan_slab_free+0x133/0x160 [  103.624849]
 

Re: radeon Disabling GPU acceleration (WB disabled?)

2019-10-22 Thread Michel Dänzer
On 2019-10-20 11:21 p.m., Meelis Roos wrote:
> I tried 5.2, 5.3 and 5.4-rc4 on my old Fujitsu RX220 with integrated
> Radeon RV100. Dmesg tells that GPU acceleration is disabled. I do not
> know if it has been enabled in the past - no old kernels handy at the
> moment.
> 
> From dmesg it looks like something with MTRR maybe: WB disabled.

That's harmless.


> [    8.535975] [drm] Driver supports precise vblank timestamp query.
> [    8.535981] radeon :00:05.0: Disabling GPU acceleration

This looks like drm_irq_install returns an error in radeon_irq_kms_init.


-- 
Earthling Michel Dänzer   |   https://redhat.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH v5 05/14] drm/dp_mst: Add probe_lock

2019-10-22 Thread Sean Paul
On Mon, Oct 21, 2019 at 10:36:00PM -0400, Lyude Paul wrote:
> Currently, MST lacks locking in a lot of places that really should have
> some sort of locking. Hotplugging and link address code paths are some
> of the offenders here, as there is actually nothing preventing us from
> running a link address probe while at the same time handling a
> connection status update request - something that's likely always been
> possible but never seen in the wild because hotplugging has been broken
> for ages now (with the exception of amdgpu, for reasons I don't think
> are worth digging into very far).
> 
> Note: I'm going to start using the term "in-memory topology layout" here
> to refer to drm_dp_mst_port->mstb and drm_dp_mst_branch->ports.
> 
> Locking in these places is a little tougher then it looks though.
> Generally we protect anything having to do with the in-memory topology
> layout under >lock. But this becomes nearly impossible to do from
> the context of link address probes due to the fact that >lock is
> usually grabbed under random various modesetting locks, meaning that
> there's no way we can just invert the >lock order and keep it
> locked throughout the whole process of updating the topology.
> 
> Luckily there are only two workers which can modify the in-memory
> topology layout: drm_dp_mst_up_req_work() and
> drm_dp_mst_link_probe_work(), meaning as long as we prevent these two
> workers from traveling the topology layout in parallel with the intent
> of updating it we don't need to worry about grabbing >lock in these
> workers for reads. We only need to grab >lock in these workers for
> writes, so that readers outside these two workers are still protected
> from the topology layout changing beneath them.
> 
> So, add the new >probe_lock and use it in both
> drm_dp_mst_link_probe_work() and drm_dp_mst_up_req_work(). Additionally,
> add some more detailed explanations for how this locking is intended to
> work to drm_dp_mst_port->mstb and drm_dp_mst_branch->ports.

This seems like a good solution to me, thanks for working this through!

Any chance we could add a WARN_ON(!mutex_is_locked(>probe_lock)); somewhere
centrally called by all paths modifying the in-memory topology layout?
drm_dp_add_port perhaps? That way we have a fallback in case something else
starts mucking with the topology.

Other than that,

Reviewed-by: Sean Paul 


> 
> Signed-off-by: Lyude Paul 
> Cc: Juston Li 
> Cc: Imre Deak 
> Cc: Ville Syrjälä 
> Cc: Harry Wentland 
> Cc: Daniel Vetter 
> Cc: Sean Paul 
> ---
>  drivers/gpu/drm/drm_dp_mst_topology.c | 28 ++-
>  include/drm/drm_dp_mst_helper.h   | 32 +++
>  2 files changed, 46 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/gpu/drm/drm_dp_mst_topology.c 
> b/drivers/gpu/drm/drm_dp_mst_topology.c
> index 08c316a727df..11d842f0bff5 100644
> --- a/drivers/gpu/drm/drm_dp_mst_topology.c
> +++ b/drivers/gpu/drm/drm_dp_mst_topology.c
> @@ -2147,37 +2147,40 @@ static void drm_dp_check_and_send_link_address(struct 
> drm_dp_mst_topology_mgr *m
>  struct drm_dp_mst_branch *mstb)
>  {
>   struct drm_dp_mst_port *port;
> - struct drm_dp_mst_branch *mstb_child;
> +
>   if (!mstb->link_address_sent)
>   drm_dp_send_link_address(mgr, mstb);
>  
>   list_for_each_entry(port, >ports, next) {
> - if (port->input)
> - continue;
> + struct drm_dp_mst_branch *mstb_child = NULL;
>  
> - if (!port->ddps)
> + if (port->input || !port->ddps)
>   continue;
>  
>   if (!port->available_pbn)
>   drm_dp_send_enum_path_resources(mgr, mstb, port);
>  
> - if (port->mstb) {
> + if (port->mstb)
>   mstb_child = drm_dp_mst_topology_get_mstb_validated(
>   mgr, port->mstb);
> - if (mstb_child) {
> - drm_dp_check_and_send_link_address(mgr, 
> mstb_child);
> - drm_dp_mst_topology_put_mstb(mstb_child);
> - }
> +
> + if (mstb_child) {
> + drm_dp_check_and_send_link_address(mgr, mstb_child);
> + drm_dp_mst_topology_put_mstb(mstb_child);
>   }
>   }
>  }
>  
>  static void drm_dp_mst_link_probe_work(struct work_struct *work)
>  {
> - struct drm_dp_mst_topology_mgr *mgr = container_of(work, struct 
> drm_dp_mst_topology_mgr, work);
> + struct drm_dp_mst_topology_mgr *mgr =
> + container_of(work, struct drm_dp_mst_topology_mgr, work);
> + struct drm_device *dev = mgr->dev;
>   struct drm_dp_mst_branch *mstb;
>   int ret;
>  
> + mutex_lock(>probe_lock);
> +
>   mutex_lock(>lock);
>   mstb = mgr->mst_primary;
>   if (mstb) {
> @@ -2190,6 +2193,7 @@ static void drm_dp_mst_link_probe_work(struct 
> 

Re: [PATCH] drm/amdkfd: don't use dqm lock during device reset/suspend/resume

2019-10-22 Thread Yang, Philip


On 2019-10-21 9:03 p.m., Kuehling, Felix wrote:
> 
> On 2019-10-21 5:04 p.m., Yang, Philip wrote:
>> If device reset/suspend/resume failed for some reason, dqm lock is
>> hold forever and this causes deadlock. Below is a kernel backtrace when
>> application open kfd after suspend/resume failed.
>>
>> Instead of holding dqm lock in pre_reset and releasing dqm lock in
>> post_reset, add dqm->device_stopped flag which is modified in
>> dqm->ops.start and dqm->ops.stop. The flag doesn't need lock protection
>> because write/read are all inside dqm lock.
>>
>> For HWS case, map_queues_cpsch and unmap_queues_cpsch checks
>> device_stopped flag before sending the updated runlist.
> 
> What about the non-HWS case?
> 
> In theory in non-HWS case new queues should be created in evicted state
> while the device (and all processes) are suspended. So we should never
> try to map or unmap queues to HQDs during suspend. But I'd feel better
> with a WARN_ON and error return in the right places to make sure we're
> not missing anything. Basically, we can't call any of the
> load/destroy_mqd functions while suspended.
> 
v2 patch add non-HWS case.

> That reminds me, we also have to add some checks in the debugfs code to
> avoid dumping HQDs of a DQM that's stopped.
>
Thanks, done in v2 patch

> Last comment: dqm->device_stopped must be initialized as true. It will
> get set to false when the device is first started. It may be easier to
> reverse the logic, something like dqm->sched_running that gets
> implicitly initialized as false.
> 
Change to dqm->sched_running in v2 patch.

Regards,
Philip

> Regards,
>     Felix
> 
> 
>>
>> Backtrace of dqm lock deadlock:
>>
>> [Thu Oct 17 16:43:37 2019] INFO: task rocminfo:3024 blocked for more
>> than 120 seconds.
>> [Thu Oct 17 16:43:37 2019]   Not tainted
>> 5.0.0-rc1-kfd-compute-rocm-dkms-no-npi-1131 #1
>> [Thu Oct 17 16:43:37 2019] "echo 0 >
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [Thu Oct 17 16:43:37 2019] rocminfoD0  3024   2947
>> 0x8000
>> [Thu Oct 17 16:43:37 2019] Call Trace:
>> [Thu Oct 17 16:43:37 2019]  ? __schedule+0x3d9/0x8a0
>> [Thu Oct 17 16:43:37 2019]  schedule+0x32/0x70
>> [Thu Oct 17 16:43:37 2019]  schedule_preempt_disabled+0xa/0x10
>> [Thu Oct 17 16:43:37 2019]  __mutex_lock.isra.9+0x1e3/0x4e0
>> [Thu Oct 17 16:43:37 2019]  ? __call_srcu+0x264/0x3b0
>> [Thu Oct 17 16:43:37 2019]  ? process_termination_cpsch+0x24/0x2f0
>> [amdgpu]
>> [Thu Oct 17 16:43:37 2019]  process_termination_cpsch+0x24/0x2f0
>> [amdgpu]
>> [Thu Oct 17 16:43:37 2019]
>> kfd_process_dequeue_from_all_devices+0x42/0x60 [amdgpu]
>> [Thu Oct 17 16:43:37 2019]  kfd_process_notifier_release+0x1be/0x220
>> [amdgpu]
>> [Thu Oct 17 16:43:37 2019]  __mmu_notifier_release+0x3e/0xc0
>> [Thu Oct 17 16:43:37 2019]  exit_mmap+0x160/0x1a0
>> [Thu Oct 17 16:43:37 2019]  ? __handle_mm_fault+0xba3/0x1200
>> [Thu Oct 17 16:43:37 2019]  ? exit_robust_list+0x5a/0x110
>> [Thu Oct 17 16:43:37 2019]  mmput+0x4a/0x120
>> [Thu Oct 17 16:43:37 2019]  do_exit+0x284/0xb20
>> [Thu Oct 17 16:43:37 2019]  ? handle_mm_fault+0xfa/0x200
>> [Thu Oct 17 16:43:37 2019]  do_group_exit+0x3a/0xa0
>> [Thu Oct 17 16:43:37 2019]  __x64_sys_exit_group+0x14/0x20
>> [Thu Oct 17 16:43:37 2019]  do_syscall_64+0x4f/0x100
>> [Thu Oct 17 16:43:37 2019]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> Suggested-by: Felix Kuehling 
>> Signed-off-by: Philip Yang 
>> ---
>>drivers/gpu/drm/amd/amdkfd/kfd_chardev.c|  6 +++---
>>drivers/gpu/drm/amd/amdkfd/kfd_device.c |  5 -
>>.../gpu/drm/amd/amdkfd/kfd_device_queue_manager.c   | 13 ++---
>>.../gpu/drm/amd/amdkfd/kfd_device_queue_manager.h   |  1 +
>>4 files changed, 14 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c 
>> b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>> index d9e36dbf13d5..40d75c39f08e 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>> @@ -120,13 +120,13 @@ static int kfd_open(struct inode *inode, struct file 
>> *filep)
>>  return -EPERM;
>>  }
>>
>> +if (kfd_is_locked())
>> +return -EAGAIN;
>> +
>>  process = kfd_create_process(filep);
>>  if (IS_ERR(process))
>>  return PTR_ERR(process);
>>
>> -if (kfd_is_locked())
>> -return -EAGAIN;
>> -
>>  dev_dbg(kfd_device, "process %d opened, compat mode (32 bit) - %d\n",
>>  process->pasid, process->is_32bit_user_mode);
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> index 8f4b24e84964..4fa8834ce7cb 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> @@ -730,9 +730,6 @@ int kgd2kfd_pre_reset(struct kfd_dev *kfd)
>>  return 0;
>>  kgd2kfd_suspend(kfd);
>>
>> -/* hold dqm->lock to prevent further 

Re: [PATCH] drm/amdgpu/vcn: Enable VCN2.5 encoding

2019-10-22 Thread James Zhu
Hi Christian,

I checked that Arcturus firmware haven't been published yet.

Best Regards!

James Zhu

On 2019-10-22 8:52 a.m., Christian König wrote:
> Have we ever published an older version of the firmware?
>
> I strongly assume the answer is "no" and if that's the case feel free 
> to add an Reviewed-by: Christian König .
>
> Regards,
> Christian.
>
> Am 22.10.19 um 16:50 schrieb Liu, Leo:
>> Reviewed-by: Leo Liu 
>>
>> -Original Message-
>> From: amd-gfx  On Behalf Of 
>> Zhu, James
>> Sent: Tuesday, October 22, 2019 10:49 AM
>> To: amd-gfx@lists.freedesktop.org
>> Cc: Zhu, James 
>> Subject: [PATCH] drm/amdgpu/vcn: Enable VCN2.5 encoding
>>
>> After VCN2.5 firmware (Version ENC: 1.1  Revision: 11),
>> VCN2.5 encoding can work properly.
>>
>> Signed-off-by: James Zhu 
>> ---
>>   drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c | 3 ---
>>   1 file changed, 3 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c 
>> b/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c
>> index d270df8..ff6cc77 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c
>> @@ -265,9 +265,6 @@ static int vcn_v2_5_hw_init(void *handle)
>>     for (i = 0; i < adev->vcn.num_enc_rings; ++i) {
>>   ring = >vcn.inst[j].ring_enc[i];
>> -    /* disable encode rings till the robustness of the FW */
>> -    ring->sched.ready = false;
>> -    continue;
>>   r = amdgpu_ring_test_helper(ring);
>>   if (r)
>>   goto done;
>
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amdgpu/vcn: Enable VCN2.5 encoding

2019-10-22 Thread Christian König

Have we ever published an older version of the firmware?

I strongly assume the answer is "no" and if that's the case feel free to 
add an Reviewed-by: Christian König .


Regards,
Christian.

Am 22.10.19 um 16:50 schrieb Liu, Leo:

Reviewed-by: Leo Liu 

-Original Message-
From: amd-gfx  On Behalf Of Zhu, James
Sent: Tuesday, October 22, 2019 10:49 AM
To: amd-gfx@lists.freedesktop.org
Cc: Zhu, James 
Subject: [PATCH] drm/amdgpu/vcn: Enable VCN2.5 encoding

After VCN2.5 firmware (Version ENC: 1.1  Revision: 11),
VCN2.5 encoding can work properly.

Signed-off-by: James Zhu 
---
  drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c | 3 ---
  1 file changed, 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c 
b/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c
index d270df8..ff6cc77 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c
@@ -265,9 +265,6 @@ static int vcn_v2_5_hw_init(void *handle)
  
  		for (i = 0; i < adev->vcn.num_enc_rings; ++i) {

ring = >vcn.inst[j].ring_enc[i];
-   /* disable encode rings till the robustness of the FW */
-   ring->sched.ready = false;
-   continue;
r = amdgpu_ring_test_helper(ring);
if (r)
goto done;


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 2/4] drm/amd/powerplay: Add EEPROM I2C read/write support to Arcturus.

2019-10-22 Thread Grodzovsky, Andrey

On 10/20/19 9:44 PM, Chen, Guchun wrote:
> Comment inline.
>
>
> Regards,
> Guchun
>
> -Original Message-
> From: Andrey Grodzovsky 
> Sent: Saturday, October 19, 2019 4:48 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Chen, Guchun ; Zhou1, Tao ; 
> Deucher, Alexander ; noreply-conflue...@amd.com; 
> Quan, Evan ; Grodzovsky, Andrey 
> Subject: [PATCH 2/4] drm/amd/powerplay: Add EEPROM I2C read/write support to 
> Arcturus.
>
> The communication is done through SMU table and hence the code is in 
> powerplay.
>
> Signed-off-by: Andrey Grodzovsky 
> ---
>   drivers/gpu/drm/amd/powerplay/arcturus_ppt.c | 229 
> +++
>   1 file changed, 229 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c 
> b/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c
> index 90d871a..53d08de5 100644
> --- a/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c
> +++ b/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c
> @@ -36,6 +36,11 @@
>   #include "smu_v11_0_pptable.h"
>   #include "arcturus_ppsmc.h"
>   #include "nbio/nbio_7_4_sh_mask.h"
> +#include 
> +#include 
> +#include "amdgpu_ras.h"
> +
> +#define to_amdgpu_device(x) (container_of(x, struct amdgpu_ras,
> +eeprom_control.eeprom_accessor))->adev
>   
>   #define CTF_OFFSET_EDGE 5
>   #define CTF_OFFSET_HOTSPOT  5
> @@ -171,6 +176,7 @@ static struct smu_11_0_cmn2aisc_mapping 
> arcturus_table_map[SMU_TABLE_COUNT] = {
>   TAB_MAP(SMU_METRICS),
>   TAB_MAP(DRIVER_SMU_CONFIG),
>   TAB_MAP(OVERDRIVE),
> + TAB_MAP(I2C_COMMANDS),
>   };
>   
>   static struct smu_11_0_cmn2aisc_mapping 
> arcturus_pwr_src_map[SMU_POWER_SOURCE_COUNT] = { @@ -293,6 +299,9 @@ static 
> int arcturus_tables_init(struct smu_context *smu, struct smu_table *table
>   SMU_TABLE_INIT(tables, SMU_TABLE_SMU_METRICS, sizeof(SmuMetrics_t),
>  PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
>   
> + SMU_TABLE_INIT(tables, SMU_TABLE_I2C_COMMANDS, sizeof(SwI2cRequest_t),
> +PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> +
>   smu_table->metrics_table = kzalloc(sizeof(SmuMetrics_t), GFP_KERNEL);
>   if (!smu_table->metrics_table)
>   return -ENOMEM;
> @@ -1927,6 +1936,224 @@ static int arcturus_dpm_set_uvd_enable(struct 
> smu_context *smu, bool enable)
>   return ret;
>   }
>   
> +
> +static void arcturus_fill_eeprom_i2c_req(SwI2cRequest_t  *req, bool write,
> +   uint8_t address, uint32_t numbytes,
> +   uint8_t *data)
> +{
> + int i;
> +
> + BUG_ON(numbytes > MAX_SW_I2C_COMMANDS);
> +
> + req->I2CcontrollerPort = 0;
> + req->I2CSpeed = 2;
> + req->SlaveAddress = address;
> + req->NumCmds = numbytes;
> +
> + for (i = 0; i < numbytes; i++) {
> + SwI2cCmd_t *cmd =  >SwI2cCmds[i];
> +
> + /* First 2 bytes are always write for lower 2b EEPROM address */
> + if (i < 2)
> + cmd->Cmd = 1;
> + else
> + cmd->Cmd = write;
> +
> +
> + /* Add RESTART for read  after address filled */
> + cmd->CmdConfig |= (i == 2 && !write) ? CMDCONFIG_RESTART_MASK : 
> 0;
> +
> + /* Add STOP in the end */
> + cmd->CmdConfig |= (i == (numbytes - 1)) ? CMDCONFIG_STOP_MASK : 
> 0;
> +
> + /* Fill with data regardless if read or write to simplify code 
> */
> + cmd->RegisterAddr = data[i];
> + }
> +}
> +
> +static int arcturus_i2c_eeprom_read_data(struct i2c_adapter *control,
> +uint8_t address,
> +uint8_t *data,
> +uint32_t numbytes)
> +{
> + uint32_t  i, ret = 0;
> + SwI2cRequest_t req;
> + struct amdgpu_device *adev = to_amdgpu_device(control);
> + struct smu_table_context *smu_table = >smu.smu_table;
> + struct smu_table *table = _table->tables[SMU_TABLE_I2C_COMMANDS];
> +
> + memset(, 0, sizeof(req));
> + arcturus_fill_eeprom_i2c_req(, false, address, numbytes, data);
> +
> + mutex_lock(>smu.mutex);
> + /* Now read data starting with that address */
> + ret = smu_update_table(>smu, SMU_TABLE_I2C_COMMANDS, 0, ,
> + true);
> + mutex_unlock(>smu.mutex);
> +
> + if (!ret) {
> + SwI2cRequest_t *res = (SwI2cRequest_t *)table->cpu_addr;
> +
> + /* Assume SMU  fills res.SwI2cCmds[i].Data with read bytes */
> + for (i = 0; i < numbytes; i++)
> + data[i] = res->SwI2cCmds[i].Data;
> +
> + pr_debug("arcturus_i2c_eeprom_read_data, address = %x, bytes = 
> %d, data :",
> +   (uint16_t)address, numbytes);
> +
> + print_hex_dump(KERN_DEBUG, "data: ", DUMP_PREFIX_NONE,
> +8, 1, data, numbytes, false);
> + } else
> +   

RE: [PATCH] drm/amdgpu/vcn: Enable VCN2.5 encoding

2019-10-22 Thread Liu, Leo
Reviewed-by: Leo Liu 

-Original Message-
From: amd-gfx  On Behalf Of Zhu, James
Sent: Tuesday, October 22, 2019 10:49 AM
To: amd-gfx@lists.freedesktop.org
Cc: Zhu, James 
Subject: [PATCH] drm/amdgpu/vcn: Enable VCN2.5 encoding

After VCN2.5 firmware (Version ENC: 1.1  Revision: 11),
VCN2.5 encoding can work properly.

Signed-off-by: James Zhu 
---
 drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c 
b/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c
index d270df8..ff6cc77 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c
@@ -265,9 +265,6 @@ static int vcn_v2_5_hw_init(void *handle)
 
for (i = 0; i < adev->vcn.num_enc_rings; ++i) {
ring = >vcn.inst[j].ring_enc[i];
-   /* disable encode rings till the robustness of the FW */
-   ring->sched.ready = false;
-   continue;
r = amdgpu_ring_test_helper(ring);
if (r)
goto done;
-- 
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH] drm/amdgpu/vcn: Enable VCN2.5 encoding

2019-10-22 Thread Zhu, James
After VCN2.5 firmware (Version ENC: 1.1  Revision: 11),
VCN2.5 encoding can work properly.

Signed-off-by: James Zhu 
---
 drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c 
b/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c
index d270df8..ff6cc77 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c
@@ -265,9 +265,6 @@ static int vcn_v2_5_hw_init(void *handle)
 
for (i = 0; i < adev->vcn.num_enc_rings; ++i) {
ring = >vcn.inst[j].ring_enc[i];
-   /* disable encode rings till the robustness of the FW */
-   ring->sched.ready = false;
-   continue;
r = amdgpu_ring_test_helper(ring);
if (r)
goto done;
-- 
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amdgpu/display: add dc feature mask for psr enablement

2019-10-22 Thread Kazlauskas, Nicholas
On 2019-10-22 9:33 a.m., Li, Roman wrote:
>> Any reason why we're skipping a flag here going from 0x2 to 0x8?
> 
> 0x4 is reserved for fractional pwm  mask:
> https://patchwork.freedesktop.org/patch/336923/

Ah, missed that patch. No problem then, feel free to go ahead with this.

Nicholas Kazlauskas

> 
> Thank you Nicholas for the review.
> 
> -Original Message-
> From: Kazlauskas, Nicholas 
> Sent: Tuesday, October 22, 2019 8:39 AM
> To: Li, Roman ; amd-gfx@lists.freedesktop.org; Deucher, 
> Alexander 
> Cc: Wentland, Harry ; Lakha, Bhawanpreet 
> ; Li, Sun peng (Leo) 
> Subject: Re: [PATCH] drm/amdgpu/display: add dc feature mask for psr 
> enablement
> 
> On 2019-10-21 5:45 p.m., roman...@amd.com wrote:
>> From: Roman Li 
>>
>> [Why]
>> Adding psr mask to dc features allows selectively disable/enable psr.
>> Current psr implementation may not work with non-pageflipping application.
>> Until resolved it should be disabled by default.
>>
>> [How]
>> Add dcfeaturemask for psr enablement. Disable by default.
>> To enable set amdgpu.dcfeaturemask=0x8 in grub kernel command line.
>>
>> Signed-off-by: Roman Li 
>> ---
>>drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 ++-
>>drivers/gpu/drm/amd/include/amd_shared.h  | 1 +
>>2 files changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>> b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>> index 1cf4beb..0f08879 100644
>> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>> @@ -2424,7 +2424,8 @@ static int amdgpu_dm_initialize_drm_device(struct 
>> amdgpu_device *adev)
>>  } else if (dc_link_detect(link, DETECT_REASON_BOOT)) {
>>  amdgpu_dm_update_connector_after_detect(aconnector);
>>  register_backlight_device(dm, link);
>> -amdgpu_dm_set_psr_caps(link);
>> +if (amdgpu_dc_feature_mask & DC_PSR_MASK)
>> +amdgpu_dm_set_psr_caps(link);
>>  }
>>
>>
>> diff --git a/drivers/gpu/drm/amd/include/amd_shared.h
>> b/drivers/gpu/drm/amd/include/amd_shared.h
>> index 8889aac..1daa221 100644
>> --- a/drivers/gpu/drm/amd/include/amd_shared.h
>> +++ b/drivers/gpu/drm/amd/include/amd_shared.h
>> @@ -143,6 +143,7 @@ enum PP_FEATURE_MASK {
>>enum DC_FEATURE_MASK {
>>  DC_FBC_MASK = 0x1,
>>  DC_MULTI_MON_PP_MCLK_SWITCH_MASK = 0x2,
>> +DC_PSR_MASK = 0x8,
> 
> Can this just be 0x4 instead? Any reason why we're skipping a flag here going 
> from 0x2 to 0x8?
> 
> You can still have my:
> 
> Reviewed-by: Nicholas Kazlauskas 
> 
> but my preference would be on fixing this up to a 0x4 first in the commit 
> message / DC_FEATURE_MASK.
> 
> Nicholas Kazlauskas
> 
>>};
>>
>>enum amd_dpm_forced_level;
>>
> 

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amdkfd: don't use dqm lock during device reset/suspend/resume

2019-10-22 Thread Kuehling, Felix
On 2019-10-21 22:02, Zeng, Oak wrote:
> If we decline the queue creation request in suspend state by returning 
> -EAGAIN, then this approach works for both hws and non-hws. This way the 
> driver is clean but application need to re-create queue later when it get a 
> EAGAIN. Currently application is not aware of the suspend/resume state, so it 
> is hard for application to know when to re-create queue.
>
> The main benefit to allowing create queue in suspend state is that it is 
> easier for application writer. But no actual performance gain as no task will 
> be executed in suspend state.

We should not need to prevent queue creation while suspended. The 
processes are suspended. That means new queues will be created in 
evicted state:

     /*
  * Eviction state logic: mark all queues as evicted, even ones
  * not currently active. Restoring inactive queues later only
  * updates the is_evicted flag but is a no-op otherwise.
  */
     q->properties.is_evicted = !!qpd->evicted;

mqd_mgr->load_mqd will only be called for active queues. So even in the 
non-HWS case we should not be touching the HW while suspended. But I'd 
like to see some safeguards in place to make sure those assumptions are 
never violated.

Regards,
   Felix


>
> Regards,
> Oak
>
> -Original Message-
> From: amd-gfx  On Behalf Of Kuehling, 
> Felix
> Sent: Monday, October 21, 2019 9:04 PM
> To: Yang, Philip ; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdkfd: don't use dqm lock during device 
> reset/suspend/resume
>
>
> On 2019-10-21 5:04 p.m., Yang, Philip wrote:
>> If device reset/suspend/resume failed for some reason, dqm lock is
>> hold forever and this causes deadlock. Below is a kernel backtrace
>> when application open kfd after suspend/resume failed.
>>
>> Instead of holding dqm lock in pre_reset and releasing dqm lock in
>> post_reset, add dqm->device_stopped flag which is modified in
>> dqm->ops.start and dqm->ops.stop. The flag doesn't need lock
>> dqm->protection
>> because write/read are all inside dqm lock.
>>
>> For HWS case, map_queues_cpsch and unmap_queues_cpsch checks
>> device_stopped flag before sending the updated runlist.
> What about the non-HWS case?
>
> In theory in non-HWS case new queues should be created in evicted state while 
> the device (and all processes) are suspended. So we should never try to map 
> or unmap queues to HQDs during suspend. But I'd feel better with a WARN_ON 
> and error return in the right places to make sure we're not missing anything. 
> Basically, we can't call any of the load/destroy_mqd functions while 
> suspended.
>
> That reminds me, we also have to add some checks in the debugfs code to avoid 
> dumping HQDs of a DQM that's stopped.
>
> Last comment: dqm->device_stopped must be initialized as true. It will get 
> set to false when the device is first started. It may be easier to reverse 
> the logic, something like dqm->sched_running that gets implicitly initialized 
> as false.
>
> Regards,
>     Felix
>
>
>> Backtrace of dqm lock deadlock:
>>
>> [Thu Oct 17 16:43:37 2019] INFO: task rocminfo:3024 blocked for more
>> than 120 seconds.
>> [Thu Oct 17 16:43:37 2019]   Not tainted
>> 5.0.0-rc1-kfd-compute-rocm-dkms-no-npi-1131 #1 [Thu Oct 17 16:43:37
>> 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
>> message.
>> [Thu Oct 17 16:43:37 2019] rocminfoD0  3024   2947
>> 0x8000
>> [Thu Oct 17 16:43:37 2019] Call Trace:
>> [Thu Oct 17 16:43:37 2019]  ? __schedule+0x3d9/0x8a0 [Thu Oct 17
>> 16:43:37 2019]  schedule+0x32/0x70 [Thu Oct 17 16:43:37 2019]
>> schedule_preempt_disabled+0xa/0x10
>> [Thu Oct 17 16:43:37 2019]  __mutex_lock.isra.9+0x1e3/0x4e0 [Thu Oct
>> 17 16:43:37 2019]  ? __call_srcu+0x264/0x3b0 [Thu Oct 17 16:43:37
>> 2019]  ? process_termination_cpsch+0x24/0x2f0
>> [amdgpu]
>> [Thu Oct 17 16:43:37 2019]  process_termination_cpsch+0x24/0x2f0
>> [amdgpu]
>> [Thu Oct 17 16:43:37 2019]
>> kfd_process_dequeue_from_all_devices+0x42/0x60 [amdgpu] [Thu Oct 17
>> 16:43:37 2019]  kfd_process_notifier_release+0x1be/0x220
>> [amdgpu]
>> [Thu Oct 17 16:43:37 2019]  __mmu_notifier_release+0x3e/0xc0 [Thu Oct
>> 17 16:43:37 2019]  exit_mmap+0x160/0x1a0 [Thu Oct 17 16:43:37 2019]  ?
>> __handle_mm_fault+0xba3/0x1200 [Thu Oct 17 16:43:37 2019]  ?
>> exit_robust_list+0x5a/0x110 [Thu Oct 17 16:43:37 2019]
>> mmput+0x4a/0x120 [Thu Oct 17 16:43:37 2019]  do_exit+0x284/0xb20 [Thu
>> Oct 17 16:43:37 2019]  ? handle_mm_fault+0xfa/0x200 [Thu Oct 17
>> 16:43:37 2019]  do_group_exit+0x3a/0xa0 [Thu Oct 17 16:43:37 2019]
>> __x64_sys_exit_group+0x14/0x20 [Thu Oct 17 16:43:37 2019]
>> do_syscall_64+0x4f/0x100 [Thu Oct 17 16:43:37 2019]
>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> Suggested-by: Felix Kuehling 
>> Signed-off-by: Philip Yang 
>> ---
>>drivers/gpu/drm/amd/amdkfd/kfd_chardev.c|  6 +++---
>>drivers/gpu/drm/amd/amdkfd/kfd_device.c |  5 -
>>

RE: [PATCH] drm/amdgpu: define macros for retire page reservation

2019-10-22 Thread Deucher, Alexander
> -Original Message-
> From: amd-gfx  On Behalf Of
> Chen, Guchun
> Sent: Monday, October 21, 2019 11:43 PM
> To: amd-gfx@lists.freedesktop.org; Zhang, Hawking
> ; Li, Dennis ;
> Grodzovsky, Andrey ; Zhou1, Tao
> 
> Cc: Chen, Guchun 
> Subject: [PATCH] drm/amdgpu: define macros for retire page reservation
> 
> Easy for maintainance.
> 
> Signed-off-by: Guchun Chen 

Reviewed-by: Alex Deucher 

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 17 +++--
>  1 file changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 2d9e13d2a71a..796326b36e00 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -68,6 +68,11 @@ const char *ras_block_string[] = {
>  /* inject address is 52 bits */
>  #define  RAS_UMC_INJECT_ADDR_LIMIT   (0x1ULL << 52)
> 
> +enum amdgpu_ras_retire_page_reservation {
> + AMDGPU_RAS_RETIRE_PAGE_RESERVED,
> + AMDGPU_RAS_RETIRE_PAGE_PENDING,
> + AMDGPU_RAS_RETIRE_PAGE_FAULT,
> +};
> 
>  atomic_t amdgpu_ras_in_intr = ATOMIC_INIT(0);
> 
> @@ -809,11 +814,11 @@ static int amdgpu_ras_badpages_read(struct
> amdgpu_device *adev,  static char
> *amdgpu_ras_badpage_flags_str(unsigned int flags)  {
>   switch (flags) {
> - case 0:
> + case AMDGPU_RAS_RETIRE_PAGE_RESERVED:
>   return "R";
> - case 1:
> + case AMDGPU_RAS_RETIRE_PAGE_PENDING:
>   return "P";
> - case 2:
> + case AMDGPU_RAS_RETIRE_PAGE_FAULT:
>   default:
>   return "F";
>   };
> @@ -1294,13 +1299,13 @@ static int amdgpu_ras_badpages_read(struct
> amdgpu_device *adev,
>   (*bps)[i] = (struct ras_badpage){
>   .bp = data->bps[i].retired_page,
>   .size = AMDGPU_GPU_PAGE_SIZE,
> - .flags = 0,
> + .flags = AMDGPU_RAS_RETIRE_PAGE_RESERVED,
>   };
> 
>   if (data->last_reserved <= i)
> - (*bps)[i].flags = 1;
> + (*bps)[i].flags =
> AMDGPU_RAS_RETIRE_PAGE_PENDING;
>   else if (data->bps_bo[i] == NULL)
> - (*bps)[i].flags = 2;
> + (*bps)[i].flags =
> AMDGPU_RAS_RETIRE_PAGE_FAULT;
>   }
> 
>   *count = data->count;
> --
> 2.17.1
> 
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH] drm/amdgpu/display: add dc feature mask for psr enablement

2019-10-22 Thread Li, Roman
> Any reason why we're skipping a flag here going from 0x2 to 0x8?

0x4 is reserved for fractional pwm  mask:
https://patchwork.freedesktop.org/patch/336923/

Thank you Nicholas for the review.

-Original Message-
From: Kazlauskas, Nicholas  
Sent: Tuesday, October 22, 2019 8:39 AM
To: Li, Roman ; amd-gfx@lists.freedesktop.org; Deucher, 
Alexander 
Cc: Wentland, Harry ; Lakha, Bhawanpreet 
; Li, Sun peng (Leo) 
Subject: Re: [PATCH] drm/amdgpu/display: add dc feature mask for psr enablement

On 2019-10-21 5:45 p.m., roman...@amd.com wrote:
> From: Roman Li 
> 
> [Why]
> Adding psr mask to dc features allows selectively disable/enable psr.
> Current psr implementation may not work with non-pageflipping application.
> Until resolved it should be disabled by default.
> 
> [How]
> Add dcfeaturemask for psr enablement. Disable by default.
> To enable set amdgpu.dcfeaturemask=0x8 in grub kernel command line.
> 
> Signed-off-by: Roman Li 
> ---
>   drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 ++-
>   drivers/gpu/drm/amd/include/amd_shared.h  | 1 +
>   2 files changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
> b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> index 1cf4beb..0f08879 100644
> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> @@ -2424,7 +2424,8 @@ static int amdgpu_dm_initialize_drm_device(struct 
> amdgpu_device *adev)
>   } else if (dc_link_detect(link, DETECT_REASON_BOOT)) {
>   amdgpu_dm_update_connector_after_detect(aconnector);
>   register_backlight_device(dm, link);
> - amdgpu_dm_set_psr_caps(link);
> + if (amdgpu_dc_feature_mask & DC_PSR_MASK)
> + amdgpu_dm_set_psr_caps(link);
>   }
>   
>   
> diff --git a/drivers/gpu/drm/amd/include/amd_shared.h 
> b/drivers/gpu/drm/amd/include/amd_shared.h
> index 8889aac..1daa221 100644
> --- a/drivers/gpu/drm/amd/include/amd_shared.h
> +++ b/drivers/gpu/drm/amd/include/amd_shared.h
> @@ -143,6 +143,7 @@ enum PP_FEATURE_MASK {
>   enum DC_FEATURE_MASK {
>   DC_FBC_MASK = 0x1,
>   DC_MULTI_MON_PP_MCLK_SWITCH_MASK = 0x2,
> + DC_PSR_MASK = 0x8,

Can this just be 0x4 instead? Any reason why we're skipping a flag here going 
from 0x2 to 0x8?

You can still have my:

Reviewed-by: Nicholas Kazlauskas 

but my preference would be on fixing this up to a 0x4 first in the commit 
message / DC_FEATURE_MASK.

Nicholas Kazlauskas

>   };
>   
>   enum amd_dpm_forced_level;
> 

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH] drm/amdgpu: refine reboot debugfs operation in ras case (v3)

2019-10-22 Thread Deucher, Alexander
> -Original Message-
> From: amd-gfx  On Behalf Of
> Chen, Guchun
> Sent: Monday, October 21, 2019 10:29 PM
> To: amd-gfx@lists.freedesktop.org; Koenig, Christian
> ; Zhang, Hawking
> ; Li, Dennis ;
> Grodzovsky, Andrey ; Zhou1, Tao
> 
> Cc: Li, Candice ; Chen, Guchun
> 
> Subject: [PATCH] drm/amdgpu: refine reboot debugfs operation in ras case
> (v3)
> 
> Ras reboot debugfs node allows user one easy control to avoid gpu recovery
> hang problem and directly reboot system per card basis, after ras
> uncorrectable error happens. However, it is one common entry, which
> should get rid of ras_ctrl node and remove ip dependence when inputting by
> user. So add one new auto_reboot node in ras debugfs dir to achieve this.
> 
> v2: in commit mssage, add justification why ras reboot debugfs node is
> needed.
> v3: use debugfs_create_bool to create debugfs file for boolean value
> 
> Signed-off-by: Guchun Chen 

Reviewed-by: Alex Deucher 

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 19 ---
>  1 file changed, 12 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 6220394521e4..2d9e13d2a71a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -153,8 +153,6 @@ static int
> amdgpu_ras_debugfs_ctrl_parse_data(struct file *f,
>   op = 1;
>   else if (sscanf(str, "inject %32s %8s", block_name, err) == 2)
>   op = 2;
> - else if (sscanf(str, "reboot %32s", block_name) == 1)
> - op = 3;
>   else if (str[0] && str[1] && str[2] && str[3])
>   /* ascii string, but commands are not matched. */
>   return -EINVAL;
> @@ -218,12 +216,11 @@ static struct ras_manager
> *amdgpu_ras_find_obj(struct amdgpu_device *adev,
>   * value to the address.
>   *
>   * Second member: struct ras_debug_if::op.
> - * It has four kinds of operations.
> + * It has three kinds of operations.
>   *
>   * - 0: disable RAS on the block. Take ::head as its data.
>   * - 1: enable RAS on the block. Take ::head as its data.
>   * - 2: inject errors on the block. Take ::inject as its data.
> - * - 3: reboot on unrecoverable error
>   *
>   * How to use the interface?
>   * programs:
> @@ -305,9 +302,6 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct
> file *f, const char __user *
>   /* data.inject.address is offset instead of absolute gpu
> address */
>   ret = amdgpu_ras_error_inject(adev, );
>   break;
> - case 3:
> - amdgpu_ras_get_context(adev)->reboot = true;
> - break;
>   default:
>   ret = -EINVAL;
>   break;
> @@ -1037,6 +1031,17 @@ static void
> amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device *adev)
>   adev, _ras_debugfs_ctrl_ops);
>   debugfs_create_file("ras_eeprom_reset", S_IWUGO | S_IRUGO,
> con->dir,
>   adev, _ras_debugfs_eeprom_ops);
> +
> + /*
> +  * After one uncorrectable error happens, usually GPU recovery will
> +  * be scheduled. But due to the known problem in GPU recovery
> failing
> +  * to bring GPU back, below interface provides one direct way to
> +  * user to reboot system automatically in such case within
> +  * ERREVENT_ATHUB_INTERRUPT generated. Normal GPU recovery
> routine
> +  * will never be called.
> +  */
> + debugfs_create_bool("auto_reboot", S_IWUGO | S_IRUGO, con-
> >dir,
> + >reboot);
>  }
> 
>  void amdgpu_ras_debugfs_create(struct amdgpu_device *adev,
> --
> 2.17.1
> 
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amdgpu/display: add dc feature mask for psr enablement

2019-10-22 Thread Kazlauskas, Nicholas
On 2019-10-21 5:45 p.m., roman...@amd.com wrote:
> From: Roman Li 
> 
> [Why]
> Adding psr mask to dc features allows selectively disable/enable psr.
> Current psr implementation may not work with non-pageflipping application.
> Until resolved it should be disabled by default.
> 
> [How]
> Add dcfeaturemask for psr enablement. Disable by default.
> To enable set amdgpu.dcfeaturemask=0x8 in grub kernel command line.
> 
> Signed-off-by: Roman Li 
> ---
>   drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 ++-
>   drivers/gpu/drm/amd/include/amd_shared.h  | 1 +
>   2 files changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
> b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> index 1cf4beb..0f08879 100644
> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> @@ -2424,7 +2424,8 @@ static int amdgpu_dm_initialize_drm_device(struct 
> amdgpu_device *adev)
>   } else if (dc_link_detect(link, DETECT_REASON_BOOT)) {
>   amdgpu_dm_update_connector_after_detect(aconnector);
>   register_backlight_device(dm, link);
> - amdgpu_dm_set_psr_caps(link);
> + if (amdgpu_dc_feature_mask & DC_PSR_MASK)
> + amdgpu_dm_set_psr_caps(link);
>   }
>   
>   
> diff --git a/drivers/gpu/drm/amd/include/amd_shared.h 
> b/drivers/gpu/drm/amd/include/amd_shared.h
> index 8889aac..1daa221 100644
> --- a/drivers/gpu/drm/amd/include/amd_shared.h
> +++ b/drivers/gpu/drm/amd/include/amd_shared.h
> @@ -143,6 +143,7 @@ enum PP_FEATURE_MASK {
>   enum DC_FEATURE_MASK {
>   DC_FBC_MASK = 0x1,
>   DC_MULTI_MON_PP_MCLK_SWITCH_MASK = 0x2,
> + DC_PSR_MASK = 0x8,

Can this just be 0x4 instead? Any reason why we're skipping a flag here 
going from 0x2 to 0x8?

You can still have my:

Reviewed-by: Nicholas Kazlauskas 

but my preference would be on fixing this up to a 0x4 first in the 
commit message / DC_FEATURE_MASK.

Nicholas Kazlauskas

>   };
>   
>   enum amd_dpm_forced_level;
> 

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amd/display: Change Navi14's DWB flag to 1

2019-10-22 Thread Kazlauskas, Nicholas
On 2019-10-21 3:36 p.m., Liu, Zhan wrote:
> [PATCH] drm/amd/display: Change Navi14's DWB flag to 1
> 
> [Why]
> DWB (Display Writeback) flag needs to be enabled as 1, or system
> will throw out a few warnings when creating dcn20 resource pool.
> Also, Navi14's dwb setting needs to match Navi10's,
> which has already been set to 1.
> 
> [How]
> Change value of num_dwb from 0 to 1.
> 
> Signed-off-by: Zhan Liu 

Reviewed-by: Nicholas Kazlauskas 

> ---
>   drivers/gpu/drm/amd/display/dc/dcn20/dcn20_resource.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_resource.c 
> b/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_resource.c
> index 914e378bcda4..d4937c475e7c 100644
> --- a/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_resource.c
> +++ b/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_resource.c
> @@ -816,7 +816,7 @@ static const struct resource_caps res_cap_nv14 = {
>  .num_audio = 6,
>  .num_stream_encoder = 5,
>  .num_pll = 5,
> -   .num_dwb = 0,
> +   .num_dwb = 1,
>  .num_ddc = 5,
>   };
> 
> --
> 2.21.0
> 
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> 

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: Spontaneous reboots when using RX 560

2019-10-22 Thread Sylvain Munaut
Hi All,

More testing over the last few days showed that only either the lowest
power mode, or slightly above can work. Oh, I also tested 5.4-rc3 just
in case but same results.
It doesn't seem to be the affected by PCIe lane speed, Memory seems
stable at 625M and almost at 1500M (only the sustained heavy workload
eventually bring it down), but the SoC speed seems pretty touchy.

So that would seem to confirm something is wrong either in the power
play table itself, or its interpretation by the linux driver.
I tried brute-loading some other RX570 pptable into it, but that
didn't really do much. After writing it to pp_table, the card was
stuck at its lower clock mode. Working fine, but same as if I had
forced it to low power.

Is there anyway to extract the power play table from windows since
it's running fine there ?
I'm kind of running out of idea of what to try next.

Cheers,

Sylvain
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amdgpu: define macros for retire page reservation

2019-10-22 Thread Christian König

Am 22.10.19 um 05:43 schrieb Chen, Guchun:

Easy for maintainance.

Signed-off-by: Guchun Chen 


Acked-by: Christian König 


---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 17 +++--
  1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 2d9e13d2a71a..796326b36e00 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -68,6 +68,11 @@ const char *ras_block_string[] = {
  /* inject address is 52 bits */
  #define   RAS_UMC_INJECT_ADDR_LIMIT   (0x1ULL << 52)
  
+enum amdgpu_ras_retire_page_reservation {

+   AMDGPU_RAS_RETIRE_PAGE_RESERVED,
+   AMDGPU_RAS_RETIRE_PAGE_PENDING,
+   AMDGPU_RAS_RETIRE_PAGE_FAULT,
+};
  
  atomic_t amdgpu_ras_in_intr = ATOMIC_INIT(0);
  
@@ -809,11 +814,11 @@ static int amdgpu_ras_badpages_read(struct amdgpu_device *adev,

  static char *amdgpu_ras_badpage_flags_str(unsigned int flags)
  {
switch (flags) {
-   case 0:
+   case AMDGPU_RAS_RETIRE_PAGE_RESERVED:
return "R";
-   case 1:
+   case AMDGPU_RAS_RETIRE_PAGE_PENDING:
return "P";
-   case 2:
+   case AMDGPU_RAS_RETIRE_PAGE_FAULT:
default:
return "F";
};
@@ -1294,13 +1299,13 @@ static int amdgpu_ras_badpages_read(struct 
amdgpu_device *adev,
(*bps)[i] = (struct ras_badpage){
.bp = data->bps[i].retired_page,
.size = AMDGPU_GPU_PAGE_SIZE,
-   .flags = 0,
+   .flags = AMDGPU_RAS_RETIRE_PAGE_RESERVED,
};
  
  		if (data->last_reserved <= i)

-   (*bps)[i].flags = 1;
+   (*bps)[i].flags = AMDGPU_RAS_RETIRE_PAGE_PENDING;
else if (data->bps_bo[i] == NULL)
-   (*bps)[i].flags = 2;
+   (*bps)[i].flags = AMDGPU_RAS_RETIRE_PAGE_FAULT;
}
  
  	*count = data->count;


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amdgpu: refine reboot debugfs operation in ras case (v3)

2019-10-22 Thread Koenig, Christian
Am 22.10.19 um 04:28 schrieb Chen, Guchun:
> Ras reboot debugfs node allows user one easy control to avoid
> gpu recovery hang problem and directly reboot system per card
> basis, after ras uncorrectable error happens. However, it is
> one common entry, which should get rid of ras_ctrl node and
> remove ip dependence when inputting by user. So add one new
> auto_reboot node in ras debugfs dir to achieve this.
>
> v2: in commit mssage, add justification why ras reboot debugfs
> node is needed.
> v3: use debugfs_create_bool to create debugfs file for boolean value
>
> Signed-off-by: Guchun Chen 

Nice cleanup, patch is Reviewed-by: Christian König 
.

Thanks,
Christian.

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 19 ---
>   1 file changed, 12 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 6220394521e4..2d9e13d2a71a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -153,8 +153,6 @@ static int amdgpu_ras_debugfs_ctrl_parse_data(struct file 
> *f,
>   op = 1;
>   else if (sscanf(str, "inject %32s %8s", block_name, err) == 2)
>   op = 2;
> - else if (sscanf(str, "reboot %32s", block_name) == 1)
> - op = 3;
>   else if (str[0] && str[1] && str[2] && str[3])
>   /* ascii string, but commands are not matched. */
>   return -EINVAL;
> @@ -218,12 +216,11 @@ static struct ras_manager *amdgpu_ras_find_obj(struct 
> amdgpu_device *adev,
>* value to the address.
>*
>* Second member: struct ras_debug_if::op.
> - * It has four kinds of operations.
> + * It has three kinds of operations.
>*
>* - 0: disable RAS on the block. Take ::head as its data.
>* - 1: enable RAS on the block. Take ::head as its data.
>* - 2: inject errors on the block. Take ::inject as its data.
> - * - 3: reboot on unrecoverable error
>*
>* How to use the interface?
>* programs:
> @@ -305,9 +302,6 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file 
> *f, const char __user *
>   /* data.inject.address is offset instead of absolute gpu 
> address */
>   ret = amdgpu_ras_error_inject(adev, );
>   break;
> - case 3:
> - amdgpu_ras_get_context(adev)->reboot = true;
> - break;
>   default:
>   ret = -EINVAL;
>   break;
> @@ -1037,6 +1031,17 @@ static void amdgpu_ras_debugfs_create_ctrl_node(struct 
> amdgpu_device *adev)
>   adev, _ras_debugfs_ctrl_ops);
>   debugfs_create_file("ras_eeprom_reset", S_IWUGO | S_IRUGO, con->dir,
>   adev, _ras_debugfs_eeprom_ops);
> +
> + /*
> +  * After one uncorrectable error happens, usually GPU recovery will
> +  * be scheduled. But due to the known problem in GPU recovery failing
> +  * to bring GPU back, below interface provides one direct way to
> +  * user to reboot system automatically in such case within
> +  * ERREVENT_ATHUB_INTERRUPT generated. Normal GPU recovery routine
> +  * will never be called.
> +  */
> + debugfs_create_bool("auto_reboot", S_IWUGO | S_IRUGO, con->dir,
> + >reboot);
>   }
>   
>   void amdgpu_ras_debugfs_create(struct amdgpu_device *adev,

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH hmm 00/15] Consolidate the mmu notifier interval_tree and locking

2019-10-22 Thread Daniel Vetter
On Mon, Oct 21, 2019 at 03:12:26PM +, Jason Gunthorpe wrote:
> On Mon, Oct 21, 2019 at 02:28:46PM +, Koenig, Christian wrote:
> > Am 21.10.19 um 15:57 schrieb Jason Gunthorpe:
> > > On Sun, Oct 20, 2019 at 02:21:42PM +, Koenig, Christian wrote:
> > >> Am 18.10.19 um 22:36 schrieb Jason Gunthorpe:
> > >>> On Thu, Oct 17, 2019 at 04:47:20PM +, Koenig, Christian wrote:
> > >>> [SNIP]
> > >>>
> >  So again how are they serialized?
> > >>> The 'driver lock' thing does it, read the hmm documentation, the hmm
> > >>> approach is basically the only approach that was correct of all the
> > >>> drivers..
> > >> Well that's what I've did, but what HMM does still doesn't looks correct
> > >> to me.
> > > It has a bug, but the basic flow seems to work.
> > >
> > > https://patchwork.kernel.org/patch/11191
> > 
> > Maybe wrong link? That link looks like an unrelated discussion on kernel 
> > image relocation.
> 
> Sorry, it got corrupted:
> 
> https://patchwork.kernel.org/patch/11191397/
> 
> > >>> So long as the 'driver lock' is held the range cannot become
> > >>> invalidated as the 'driver lock' prevents progress of invalidation.
> > >> Correct, but the problem is it doesn't wait for ongoing operations to
> > >> complete.
> > >>
> > >> See I'm talking about the following case:
> > >>
> > >> Thread A    Thread B
> > >> invalidate_range_start()
> > >>   mmu_range_read_begin()
> > >>   get_user_pages()/hmm_range_fault()
> > >>   grab_driver_lock()
> > >> Updating the ptes
> > >> invalidate_range_end()
> > >>
> > >> As far as I can see in invalidate_range_start() the driver lock is taken
> > >> to make sure that we can't start any invalidation while the driver is
> > >> using the pages for a command submission.
> > > Again, this uses the seqlock like scheme *and* the driver lock.
> > >
> > > In this case after grab_driver_lock() mmu_range_read_retry() will
> > > return false if Thread A has progressed to 'updating the ptes.
> > >
> > > For instance here is how the concurrency resolves for retry:
> > >
> > > CPU1CPU2
> > >seq = mmu_range_read_begin()
> > > invalidate_range_start()
> > >invalidate_seq++
> > 
> > How that was order was what confusing me. But I've read up on the code 
> > in mmu_range_read_begin() and found the lines I was looking for:
> > 
> > +    if (is_invalidating)
> > +        wait_event(mmn_mm->wq,
> > +               READ_ONCE(mmn_mm->invalidate_seq) != seq);
> > 
> > [SNIP]
> 
> Right, the basic design is that the 'seq' returned by
> mmu_range_read_begin() is never currently under invalidation.
> 
> Thus if the starting seq is not invalidating, then if it doesn't
> change we are guaranteed the ptes haven't changed either.
> 
> > > For the above I've simplified the mechanics of the invalidate_seq, you
> > > need to look through the patch to see how it actually works.
> > 
> > Yea, that you also allow multiple write sides is pretty neat.
> 
> Complicated, but necessary to make the non-blocking OOM stuff able to
> read the interval tree under all conditions :\
> 
> > > One of the motivations for this work is to squash that bug by adding a
> > > seqlock like pattern. But the basic hmm flow and collision-retry
> > > approach seems sound.
> > >
> > > Do you see a problem with this patch?
> > 
> > No, not any more.
> 
> Okay, great, thanks
>  
> > Essentially you are doing the same thing I've tried to do with the 
> > original amdgpu implementation. The difference is that you don't try to 
> > use a per range sequence (which is a good idea, we never got that fully 
> > working) and you allow multiple writers at the same time.
> 
> Yes, ODP had the per-range sequence and it never worked right
> either. Keeping track of things during the invalidate_end was too complex
>  
> > Feel free to stitch an Acked-by: Christian König 
> >  on patch #2,
> 
> Thanks! Can you also take some review and test for the AMD related
> patches? These were quite hard to make, it is very likely I've made an
> error..
> 
> > but you still doing a bunch of things in there which are way beyond
> > my understanding (e.g. where are all the SMP barriers?).
> 
> The only non-locked data is 'struct mmu_range_notifier->invalidate_seq'
> 
> On the write side it uses a WRITE_ONCE. The 'seq' it writes is
> generated under the mmn_mm->lock spinlock (held before and after the
> WRITE_ONCE) such that all concurrent WRITE_ONCE's are storing the same
> value. 
> 
> Essentially the spinlock is providing the barrier to order write:
> 
> invalidate_range_start():
>  spin_lock(_mm->lock);
>  mmn_mm->active_invalidate_ranges++;
>  mmn_mm->invalidate_seq |= 1;
>  *seq = mmn_mm->invalidate_seq;
>  spin_unlock(_mm->lock);
> 
>  WRITE_ONCE(mrn->invalidate_seq, cur_seq);
> 
> invalidate_range_end()
>  spin_lock(_mm->lock);
>  if (--mmn_mm->active_invalidate_ranges)
> mmn_mm->invalidate_seq++

Re: [PATCH hmm 00/15] Consolidate the mmu notifier interval_tree and locking

2019-10-22 Thread Jason Gunthorpe
On Mon, Oct 21, 2019 at 02:40:41PM -0400, Jerome Glisse wrote:
> On Tue, Oct 15, 2019 at 03:12:27PM -0300, Jason Gunthorpe wrote:
> > From: Jason Gunthorpe 
> > 
> > 8 of the mmu_notifier using drivers (i915_gem, radeon_mn, umem_odp, hfi1,
> > scif_dma, vhost, gntdev, hmm) drivers are using a common pattern where
> > they only use invalidate_range_start/end and immediately check the
> > invalidating range against some driver data structure to tell if the
> > driver is interested. Half of them use an interval_tree, the others are
> > simple linear search lists.
> > 
> > Of the ones I checked they largely seem to have various kinds of races,
> > bugs and poor implementation. This is a result of the complexity in how
> > the notifier interacts with get_user_pages(). It is extremely difficult to
> > use it correctly.
> > 
> > Consolidate all of this code together into the core mmu_notifier and
> > provide a locking scheme similar to hmm_mirror that allows the user to
> > safely use get_user_pages() and reliably know if the page list still
> > matches the mm.
> > 
> > This new arrangment plays nicely with the !blockable mode for
> > OOM. Scanning the interval tree is done such that the intersection test
> > will always succeed, and since there is no invalidate_range_end exposed to
> > drivers the scheme safely allows multiple drivers to be subscribed.
> > 
> > Four places are converted as an example of how the new API is used.
> > Four are left for future patches:
> >  - i915_gem has complex locking around destruction of a registration,
> >needs more study
> >  - hfi1 (2nd user) needs access to the rbtree
> >  - scif_dma has a complicated logic flow
> >  - vhost's mmu notifiers are already being rewritten
> > 
> > This is still being tested, but I figured to send it to start getting help
> > from the xen, amd and hfi drivers which I cannot test here.
> 
> It might be a good oportunity to also switch those users to
> hmm_range_fault() instead of GUP as GUP is pointless for those
> users. In fact the GUP is an impediment to normal mm operations.

I think vhost can use hmm_range_fault

hfi1 does actually need to have the page pin, it doesn't fence DMA
during invalidate.

i915_gem feels alot like amdgpu, so probably it would benefit

No idea about scif_dma

> I will test on nouveau.

Thanks, hopefully it still works, I think Ralph was able to do some
basic checks. But it is a pretty complicated series, I probably made
some mistakes.

FWIW, I know that nouveau gets a lockdep splat now from Daniel
Vetter's recent changes, it tries to do GFP_KERENEL allocations under
a lock also held by the invalidate_range_start path.

Thanks for looking at it!

Regards,
Jason 
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH] drm/amdgpu: Fix memory leak in amdgpu_fence_emit

2019-10-22 Thread Navid Emamdoost
In the impelementation of amdgpu_fence_emit() if dma_fence_wait() fails
and returns an errno, before returning the allocated memory for fence
should be released.

Fixes: 3d2aca8c8620 ("drm/amdgpu: fix old fence check in amdgpu_fence_emit")
Signed-off-by: Navid Emamdoost 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 23085b352cf2..2f59c9270a7e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -166,8 +166,10 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, struct 
dma_fence **f,
if (old) {
r = dma_fence_wait(old, false);
dma_fence_put(old);
-   if (r)
+   if (r) {
+   dma_fence_put(fence);
return r;
+   }
}
}
 
-- 
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH hmm 00/15] Consolidate the mmu notifier interval_tree and locking

2019-10-22 Thread Jason Gunthorpe
On Mon, Oct 21, 2019 at 02:28:46PM +, Koenig, Christian wrote:
> Am 21.10.19 um 15:57 schrieb Jason Gunthorpe:
> > On Sun, Oct 20, 2019 at 02:21:42PM +, Koenig, Christian wrote:
> >> Am 18.10.19 um 22:36 schrieb Jason Gunthorpe:
> >>> On Thu, Oct 17, 2019 at 04:47:20PM +, Koenig, Christian wrote:
> >>> [SNIP]
> >>>
>  So again how are they serialized?
> >>> The 'driver lock' thing does it, read the hmm documentation, the hmm
> >>> approach is basically the only approach that was correct of all the
> >>> drivers..
> >> Well that's what I've did, but what HMM does still doesn't looks correct
> >> to me.
> > It has a bug, but the basic flow seems to work.
> >
> > https://patchwork.kernel.org/patch/11191
> 
> Maybe wrong link? That link looks like an unrelated discussion on kernel 
> image relocation.

Sorry, it got corrupted:

https://patchwork.kernel.org/patch/11191397/

> >>> So long as the 'driver lock' is held the range cannot become
> >>> invalidated as the 'driver lock' prevents progress of invalidation.
> >> Correct, but the problem is it doesn't wait for ongoing operations to
> >> complete.
> >>
> >> See I'm talking about the following case:
> >>
> >> Thread A    Thread B
> >> invalidate_range_start()
> >>   mmu_range_read_begin()
> >>   get_user_pages()/hmm_range_fault()
> >>   grab_driver_lock()
> >> Updating the ptes
> >> invalidate_range_end()
> >>
> >> As far as I can see in invalidate_range_start() the driver lock is taken
> >> to make sure that we can't start any invalidation while the driver is
> >> using the pages for a command submission.
> > Again, this uses the seqlock like scheme *and* the driver lock.
> >
> > In this case after grab_driver_lock() mmu_range_read_retry() will
> > return false if Thread A has progressed to 'updating the ptes.
> >
> > For instance here is how the concurrency resolves for retry:
> >
> > CPU1CPU2
> >seq = mmu_range_read_begin()
> > invalidate_range_start()
> >invalidate_seq++
> 
> How that was order was what confusing me. But I've read up on the code 
> in mmu_range_read_begin() and found the lines I was looking for:
> 
> +    if (is_invalidating)
> +        wait_event(mmn_mm->wq,
> +               READ_ONCE(mmn_mm->invalidate_seq) != seq);
> 
> [SNIP]

Right, the basic design is that the 'seq' returned by
mmu_range_read_begin() is never currently under invalidation.

Thus if the starting seq is not invalidating, then if it doesn't
change we are guaranteed the ptes haven't changed either.

> > For the above I've simplified the mechanics of the invalidate_seq, you
> > need to look through the patch to see how it actually works.
> 
> Yea, that you also allow multiple write sides is pretty neat.

Complicated, but necessary to make the non-blocking OOM stuff able to
read the interval tree under all conditions :\

> > One of the motivations for this work is to squash that bug by adding a
> > seqlock like pattern. But the basic hmm flow and collision-retry
> > approach seems sound.
> >
> > Do you see a problem with this patch?
> 
> No, not any more.

Okay, great, thanks
 
> Essentially you are doing the same thing I've tried to do with the 
> original amdgpu implementation. The difference is that you don't try to 
> use a per range sequence (which is a good idea, we never got that fully 
> working) and you allow multiple writers at the same time.

Yes, ODP had the per-range sequence and it never worked right
either. Keeping track of things during the invalidate_end was too complex
 
> Feel free to stitch an Acked-by: Christian König 
>  on patch #2,

Thanks! Can you also take some review and test for the AMD related
patches? These were quite hard to make, it is very likely I've made an
error..

> but you still doing a bunch of things in there which are way beyond
> my understanding (e.g. where are all the SMP barriers?).

The only non-locked data is 'struct mmu_range_notifier->invalidate_seq'

On the write side it uses a WRITE_ONCE. The 'seq' it writes is
generated under the mmn_mm->lock spinlock (held before and after the
WRITE_ONCE) such that all concurrent WRITE_ONCE's are storing the same
value. 

Essentially the spinlock is providing the barrier to order write:

invalidate_range_start():
 spin_lock(_mm->lock);
 mmn_mm->active_invalidate_ranges++;
 mmn_mm->invalidate_seq |= 1;
 *seq = mmn_mm->invalidate_seq;
 spin_unlock(_mm->lock);

 WRITE_ONCE(mrn->invalidate_seq, cur_seq);

invalidate_range_end()
 spin_lock(_mm->lock);
 if (--mmn_mm->active_invalidate_ranges)
mmn_mm->invalidate_seq++
 spin_unlock(_mm->lock);

ie when we do invalidate_seq++, due to the active_invalidate_ranges
counter and the spinlock, we know all other WRITE_ONCE's have passed
their spin_unlock and no concurrent ones are starting. The spinlock is
providing the barrier here.

On the read side.. It is a similar argument, except with 

Re: [PATCH hmm 15/15] mm/hmm: remove hmm_mirror and related

2019-10-22 Thread Jason Gunthorpe
On Mon, Oct 21, 2019 at 02:38:24PM -0400, Jerome Glisse wrote:
> On Tue, Oct 15, 2019 at 03:12:42PM -0300, Jason Gunthorpe wrote:
> > From: Jason Gunthorpe 
> > 
> > The only two users of this are now converted to use mmu_range_notifier,
> > delete all the code and update hmm.rst.
> 
> I guess i should point out that the reasons for hmm_mirror and hmm
> was for:
> 1) Maybe define a common API for userspace to provide memory
>placement hints (NUMA for GPU)

Do you think this needs special code in the notifiers?

> 2) multi-devices sharing same mirror page table

Oh neat, but I think this just means the GPU driver has to register a
single notifier for multiple GPUs??

> But support for multi-GPU in nouveau is way behind and i guess such
> optimization will have to re-materialize what is necessary once that
> happens.

Sure, it will be easier to understand what is needed with a bit of
code!

> Note this patch should also update kernel/fork.c and the mm_struct
> definition AFAICT. With those changes you can add my:

Can you please elaborate what updates you mean? I'm not sure. 

Maybe I already got the things you are thinking of with the get/put
changes?

Thanks,
Jason
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH hmm 02/15] mm/mmu_notifier: add an interval tree notifier

2019-10-22 Thread Jason Gunthorpe
On Mon, Oct 21, 2019 at 02:30:56PM -0400, Jerome Glisse wrote:

> > +/**
> > + * mmu_range_read_retry - End a read side critical section against a VA 
> > range
> > + * mrn: The range under lock
> > + * seq: The return of the paired mmu_range_read_begin()
> > + *
> > + * This MUST be called under a user provided lock that is also held
> > + * unconditionally by op->invalidate(). That lock provides the required SMP
> > + * barrier for handling invalidate_seq.
> > + *
> > + * Each call should be paired with a single mmu_range_read_begin() and
> > + * should be used to conclude the read side.
> > + *
> > + * Returns true if an invalidation collided with this critical section, and
> > + * the caller should retry.
> > + */
> > +static inline bool mmu_range_read_retry(struct mmu_range_notifier *mrn,
> > +   unsigned long seq)
> > +{
> > +   return READ_ONCE(mrn->invalidate_seq) != seq;
> > +}
> 
> What about calling this mmu_range_read_end() instead ? To match
> with the mmu_range_read_begin().

_end make some sense too, but I picked _retry for symmetry with the
seqcount_* family of functions which used retry.

I think retry makes it clearer that it is expected to fail and retry
is required.

> > +   /*
> > +* The inv_end incorporates a deferred mechanism like rtnl. Adds and
> 
> The rtnl reference is lost on people unfamiliar with the network :)
> code maybe like rtnl_lock()/rtnl_unlock() so people have a chance to
> grep the right function. Assuming i am myself getting the right
> reference :)

Yep, you got it, I will update

> > +   /*
> > +* mrn->invalidate_seq is always set to an odd value. This ensures
> > +* that if seq does wrap we will always clear the below sleep in some
> > +* reasonable time as mmn_mm->invalidate_seq is even in the idle
> > +* state.
> 
> I think this comment should be with the struct mmu_range_notifier
> definition and you should just point to it from here as the same
> comment would be useful down below.

I had it here because it is critical to understanding the wait_event
and why it doesn't just block indefinitely, but yes this property
comes up below too which refers back here.

Fundamentally this wait event is why this approach to keep an odd
value in the mrn is used.

> > -int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
> > +static int mn_itree_invalidate(struct mmu_notifier_mm *mmn_mm,
> > +const struct mmu_notifier_range *range)
> > +{
> > +   struct mmu_range_notifier *mrn;
> > +   unsigned long cur_seq;
> > +
> > +   for (mrn = mn_itree_inv_start_range(mmn_mm, range, _seq); mrn;
> > +mrn = mn_itree_inv_next(mrn, range)) {
> > +   bool ret;
> > +
> > +   WRITE_ONCE(mrn->invalidate_seq, cur_seq);
> > +   ret = mrn->ops->invalidate(mrn, range);
> > +   if (!ret && !WARN_ON(mmu_notifier_range_blockable(range)))
> 
> Isn't the logic wrong here ? We want to warn if the range
> was mark as blockable and invalidate returned false. Also
> we went to backoff no matter what if the invalidate return
> false ie:

If invalidate returned false and the caller is blockable then we do
not want to return, we must continue processing other ranges - to try
to cope with the defective driver.

Callers in blocking mode ignore the return value and go ahead to
invalidate..

Would it be clearer as 

if (!ret) {
   if (WARN_ON(mmu_notifier_range_blockable(range)))
   continue;
   goto out_would_block;
}

?

> > @@ -284,21 +589,22 @@ int __mmu_notifier_register(struct mmu_notifier *mn, 
> > struct mm_struct *mm)
> >  * the write side of the mmap_sem.
> >  */
> > mmu_notifier_mm =
> > -   kmalloc(sizeof(struct mmu_notifier_mm), GFP_KERNEL);
> > +   kzalloc(sizeof(struct mmu_notifier_mm), GFP_KERNEL);
> > if (!mmu_notifier_mm)
> > return -ENOMEM;
> >  
> > INIT_HLIST_HEAD(_notifier_mm->list);
> > spin_lock_init(_notifier_mm->lock);
> > +   mmu_notifier_mm->invalidate_seq = 2;
> 
> Why starting at 2 ?

Good question. If everything is coded properly the starting value
doesn't matter

I left it like this because it makes debugging a tiny bit simpler, ie
if you print the seq number then the first mmu_range_notififers will
get 1 as their intial seq (see __mmu_range_notifier_insert) instead of
ULONG_MAX

> > +   mmu_notifier_mm->itree = RB_ROOT_CACHED;
> > +   init_waitqueue_head(_notifier_mm->wq);
> > +   INIT_HLIST_HEAD(_notifier_mm->deferred_list);
> > }
> >  
> > ret = mm_take_all_locks(mm);
> > if (unlikely(ret))
> > goto out_clean;
> >  
> > -   /* Pairs with the mmdrop in mmu_notifier_unregister_* */
> > -   mmgrab(mm);
> > -
> > /*
> >  * Serialize the update against mmu_notifier_unregister. A
> >  * side note: mmu_notifier_release can't run concurrently with

Re: [PATCH hmm 00/15] Consolidate the mmu notifier interval_tree and locking

2019-10-22 Thread Dennis Dalessandro

On 10/15/2019 2:12 PM, Jason Gunthorpe wrote:

This is still being tested, but I figured to send it to start getting help
from the xen, amd and hfi drivers which I cannot test here.


Sorry for the delay, I never seen this. Was not on Cc list and didn't 
register to me it impacted hfi. I'll take a look and run it through some 
hfi1 tests.


-Denny
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH hmm 00/15] Consolidate the mmu notifier interval_tree and locking

2019-10-22 Thread Jason Gunthorpe
On Sun, Oct 20, 2019 at 02:21:42PM +, Koenig, Christian wrote:
> Am 18.10.19 um 22:36 schrieb Jason Gunthorpe:
> > On Thu, Oct 17, 2019 at 04:47:20PM +, Koenig, Christian wrote:
> >
> >>> get_user_pages/hmm_range_fault() and invalidate_range_start() both are
> >>> called while holding mm->map_sem, so they are always serialized.
> >> Not even remotely.
> >>
> >> For calling get_user_pages()/hmm_range_fault() you only need to hold the
> >> mmap_sem in read mode.
> > Right
> >   
> >> And IIRC invalidate_range_start() is sometimes called without holding
> >> the mmap_sem at all.
> > Yep
> >   
> >> So again how are they serialized?
> > The 'driver lock' thing does it, read the hmm documentation, the hmm
> > approach is basically the only approach that was correct of all the
> > drivers..
> 
> Well that's what I've did, but what HMM does still doesn't looks correct 
> to me.

It has a bug, but the basic flow seems to work.

https://patchwork.kernel.org/patch/11191

> > So long as the 'driver lock' is held the range cannot become
> > invalidated as the 'driver lock' prevents progress of invalidation.
> 
> Correct, but the problem is it doesn't wait for ongoing operations to 
> complete.
>
> See I'm talking about the following case:
> 
> Thread A    Thread B
> invalidate_range_start()
>  mmu_range_read_begin()
>  get_user_pages()/hmm_range_fault()
>  grab_driver_lock()
> Updating the ptes
> invalidate_range_end()
> 
> As far as I can see in invalidate_range_start() the driver lock is taken 
> to make sure that we can't start any invalidation while the driver is 
> using the pages for a command submission.

Again, this uses the seqlock like scheme *and* the driver lock.

In this case after grab_driver_lock() mmu_range_read_retry() will
return false if Thread A has progressed to 'updating the ptes.

For instance here is how the concurrency resolves for retry:

   CPU1CPU2
  seq = mmu_range_read_begin()
invalidate_range_start()
  invalidate_seq++ 
  get_user_pages()/hmm_range_fault()
  grab_driver_lock()
  ungrab_driver_lock()
  grab_driver_lock()
  seq != invalidate_seq, so retry

  Updating the ptes

invalidate_range_end()
  invalidate_seq++


And here is how it resolves for blocking:

   CPU1CPU2
  seq = mmu_range_read_begin()
invalidate_range_start()
  get_user_pages()/hmm_range_fault()

  grab_driver_lock()
  seq == invalidate_seq, so conitnue
  invalidate_seq++ 
  ungrab_driver_lock()
  grab_driver_lock()
  // Cannot progress to 'Updating the ptes' while the drive_lock is held
  ungrab_driver_lock()
  Updating the ptes

invalidate_range_end()
  invalidate_seq++

For the above I've simplified the mechanics of the invalidate_seq, you
need to look through the patch to see how it actually works.

> Well we don't update the seqlock after the update to the protected data 
> structure (the page table) happened, but rather before that.

??? This is what mn_itree_inv_end() does, it is called by
invalidate_range_end

> That doesn't looks like the normal patter for a seqlock to me and as far 
> as I can see that is quite a bug in the HMM design/logic.

Well, hmm has a bug because it doesn't use a seqlock pattern, see the
above URL.

One of the motivations for this work is to squash that bug by adding a
seqlock like pattern. But the basic hmm flow and collision-retry
approach seems sound.

Do you see a problem with this patch?

Jason
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH hmm 02/15] mm/mmu_notifier: add an interval tree notifier

2019-10-22 Thread Jason Gunthorpe
On Mon, Oct 21, 2019 at 03:11:57PM -0400, Jerome Glisse wrote:
> > Since that reader is not locked we need release semantics here to
> > ensure the unlocked reader sees a fully initinalized mmu_notifier_mm
> > structure when it observes the pointer.
> 
> I thought the mm_take_all_locks() would have had a barrier and thus
> that you could not see mmu_notifier struct partialy initialized. 

Not sure, usually a lock acquire doesn't have a store barrier?

Even if it did, we would still need some pairing read barrier..

> having the acquire/release as safety net does not hurt. Maybe add a
> comment about the struct initialization needing to be visible before
> pointer is set.

Is this clear?

 * release semantics on the initialization of the
 * mmu_notifier_mm's contents are provided for unlocked readers.
 * acquire can only be used while holding the
 * mmgrab or mmget, and is safe because once created the
 * mmu_notififer_mm is not freed until the mm is destroyed.
 * Users holding the mmap_sem or one of the
 * mm_take_all_locks() do not need to use acquire semantics.

It also helps explain why there is no locking around the other
readers, which has puzzled me in the past at least.

Jason
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH hmm 08/15] xen/gntdev: Use select for DMA_SHARED_BUFFER

2019-10-22 Thread Jason Gunthorpe
On Wed, Oct 16, 2019 at 06:35:15AM +, Oleksandr Andrushchenko wrote:
> On 10/16/19 8:11 AM, Jürgen Groß wrote:
> > On 15.10.19 20:12, Jason Gunthorpe wrote:
> >> From: Jason Gunthorpe 
> >>
> >> DMA_SHARED_BUFFER can not be enabled by the user (it represents a 
> >> library
> >> set in the kernel). The kconfig convention is to use select for such
> >> symbols so they are turned on implicitly when the user enables a kconfig
> >> that needs them.
> >>
> >> Otherwise the XEN_GNTDEV_DMABUF kconfig is overly difficult to enable.
> >>
> >> Fixes: 932d6562179e ("xen/gntdev: Add initial support for dma-buf UAPI")
> >> Cc: Oleksandr Andrushchenko 
> >> Cc: Boris Ostrovsky 
> >> Cc: xen-de...@lists.xenproject.org
> >> Cc: Juergen Gross 
> >> Cc: Stefano Stabellini 
> >> Signed-off-by: Jason Gunthorpe 
> >
> > Reviewed-by: Juergen Gross 
> >
> Reviewed-by: Oleksandr Andrushchenko 

Thanks Oleksandr and Juergen, can you also give me some advice on how
to progress the more complex patch:

https://patchwork.kernel.org/patch/11191369/

Is this gntdev stuff still in-use? I struggled a bit to understand
what it is doing, but I think I made a reasonable guess?

Jason
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH hmm 00/15] Consolidate the mmu notifier interval_tree and locking

2019-10-22 Thread Jason Gunthorpe
On Mon, Oct 21, 2019 at 11:55:51AM -0400, Dennis Dalessandro wrote:
> On 10/15/2019 2:12 PM, Jason Gunthorpe wrote:
> > This is still being tested, but I figured to send it to start getting help
> > from the xen, amd and hfi drivers which I cannot test here.
> 
> Sorry for the delay, I never seen this. Was not on Cc list and didn't
> register to me it impacted hfi. I'll take a look and run it through some
> hfi1 tests.

Hm, you were cc'd on the hfi1 patch of the series:

https://patchwork.kernel.org/patch/11191395/

So you saw that, right?

But it seems that git send-email didn't pull all the cc's together?

Jason
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amdgpu: Fix memory leak in amdgpu_fence_emit

2019-10-22 Thread Koenig, Christian
Am 21.10.19 um 20:09 schrieb Navid Emamdoost:
> In the impelementation of amdgpu_fence_emit() if dma_fence_wait() fails
> and returns an errno, before returning the allocated memory for fence
> should be released.
>
> Fixes: 3d2aca8c8620 ("drm/amdgpu: fix old fence check in amdgpu_fence_emit")
> Signed-off-by: Navid Emamdoost 

Reviewed-by: Christian König 

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 4 +++-
>   1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index 23085b352cf2..2f59c9270a7e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -166,8 +166,10 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, struct 
> dma_fence **f,
>   if (old) {
>   r = dma_fence_wait(old, false);
>   dma_fence_put(old);
> - if (r)
> + if (r) {
> + dma_fence_put(fence);
>   return r;
> + }
>   }
>   }
>   

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx