date:20220117

Re: [PATCH 2/2] drm/amdgpu: fix debug sdma vram access checks

2022-01-17 Thread Christian König


Am 18.01.22 um 00:43 schrieb Jonathan Kim:

drm_dev_enter returns true when accessiable so correct the error check.

Source VRAM buffer is reserved by TTM but not pinned so the gpu offset
fetch throws a warning.


Well it throws a warning because what you try to do here won't work.


   Get the gpu offset without a check and then
double check to make sure the source buffer hasn't fallen into a hole.
Otherwise use MMIO access to deal with non-contiguous VRAM cases as
usual.

Signed-off-by: Jonathan Kim 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 19 +++
  1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 6178ae7ba624..0acfd872bfef 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -1440,6 +1440,7 @@ static int amdgpu_ttm_access_memory_sdma(struct 
ttm_buffer_object *bo,
struct dma_fence *fence;
uint64_t src_addr, dst_addr;
unsigned int num_dw;
+   bool vram_is_contiguous;
int r, idx;
  
  	if (len != PAGE_SIZE)

@@ -1448,9 +1449,8 @@ static int amdgpu_ttm_access_memory_sdma(struct 
ttm_buffer_object *bo,
if (!adev->mman.sdma_access_ptr)
return -EACCES;
  
-	r = drm_dev_enter(adev_to_drm(adev), &idx);

-   if (r)
-   return r;
+   if (!drm_dev_enter(adev_to_drm(adev), &idx))
+   return -ENODEV;
  
  	if (write)

memcpy(adev->mman.sdma_access_ptr, buf, len);
@@ -1460,7 +1460,18 @@ static int amdgpu_ttm_access_memory_sdma(struct 
ttm_buffer_object *bo,
if (r)
goto out;
  
-	src_addr = amdgpu_bo_gpu_offset(abo);

+   src_addr = amdgpu_bo_gpu_offset_no_check(abo);
+   vram_is_contiguous = (adev->gmc.real_vram_size - 1 ==
+   adev->gmc.vram_end - adev->gmc.vram_start) &&
+   src_addr >= adev->gmc.vram_start &&
+   src_addr + len <= adev->gmc.vram_end;


That code is utterly nonsense. What you need to do is to use the 
iterator to get the real VRAM address.


The return value of amdgpu_bo_gpu_offset() and 
amdgpu_bo_gpu_offset_no_check() is only valid for pinned contiguous buffers.


Regards,
Christian.


+
+   if (!vram_is_contiguous) {
+   amdgpu_job_free(job);
+   r = -EACCES;
+   goto out;
+   }
+
dst_addr = amdgpu_bo_gpu_offset(adev->mman.sdma_access_bo);
if (write)
swap(src_addr, dst_addr);

Re: [PATCH 1/2] drm/amdgpu: fix build up and tear down of debug vram access bounce buffer

2022-01-17 Thread Christian König


Am 18.01.22 um 00:43 schrieb Jonathan Kim:

Move the debug sdma vram bounce buffer GART map on device init after when
GART is ready to avoid warnings and non-access to SDMA.


Well that doesn't seem to make sense the GART is initialized by the code 
around the allocation so that should work fine.


Freeing the BO indeed needs to be moved up, but that can still be in the 
same function.


Also please don't move TTM related code outside of the TTM code in amdgpu.

Regards,
Christian.



Also move bounce buffer tear down after the memory manager has flushed
queued work for safety.

Signed-off-by: Jonathan Kim 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 +++
  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c|  8 
  2 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index da3348fa7b0e..099460d15258 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2378,6 +2378,13 @@ static int amdgpu_device_ip_init(struct amdgpu_device 
*adev)
if (r)
goto init_failed;
  
+	/* GTT bounce buffer for debug vram access over sdma. */

+   if (amdgpu_bo_create_kernel(adev, PAGE_SIZE, PAGE_SIZE,
+   AMDGPU_GEM_DOMAIN_GTT,
+   &adev->mman.sdma_access_bo, NULL,
+   &adev->mman.sdma_access_ptr))
+   DRM_WARN("Debug VRAM access will use slowpath MM access\n");
+
/*
 * retired pages will be loaded from eeprom and reserved here,
 * it should be called after amdgpu_device_ip_hw_init_phase2  since
@@ -3872,6 +3879,10 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
}
adev->shutdown = true;
  
+	/* remove debug vram sdma access bounce buffer. */

+   amdgpu_bo_free_kernel(&adev->mman.sdma_access_bo, NULL,
+   &adev->mman.sdma_access_ptr);
+
/* make sure IB test finished before entering exclusive mode
 * to avoid preemption on IB test
 * */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index b489cd8abe31..6178ae7ba624 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -1855,12 +1855,6 @@ int amdgpu_ttm_init(struct amdgpu_device *adev)
return r;
}
  
-	if (amdgpu_bo_create_kernel(adev, PAGE_SIZE, PAGE_SIZE,

-   AMDGPU_GEM_DOMAIN_GTT,
-   &adev->mman.sdma_access_bo, NULL,
-   adev->mman.sdma_access_ptr))
-   DRM_WARN("Debug VRAM access will use slowpath MM access\n");
-
return 0;
  }
  
@@ -1901,8 +1895,6 @@ void amdgpu_ttm_fini(struct amdgpu_device *adev)

ttm_range_man_fini(&adev->mman.bdev, AMDGPU_PL_OA);
ttm_device_fini(&adev->mman.bdev);
adev->mman.initialized = false;
-   amdgpu_bo_free_kernel(&adev->mman.sdma_access_bo, NULL,
-   &adev->mman.sdma_access_ptr);
DRM_INFO("amdgpu: ttm finalized\n");
  }

Re: [PATCH 1/1] Add available memory ioctl for libhsakmt

2022-01-17 Thread Felix Kuehling



Am 2022-01-10 um 8:39 p.m. schrieb Daniel Phillips:
> Add an ioctl to inquire memory available for allocation by libhsakmt
> per node, allowing for space consumed by page translation tables.
>
> This ioctl is the underlying mechanism for the new memory availability
> library call posted for review here:
>
>https://lists.freedesktop.org/archives/amd-gfx/2022-January/073352.html
>
> Signed-off-by: Daniel Phillips 
>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h  |  1 +
>  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c| 14 ++
>  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c| 17 +
>  include/uapi/linux/kfd_ioctl.h  | 14 --
>  4 files changed, 44 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> index fcbc8a9c9e06..64c6c36685d3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> @@ -266,6 +266,7 @@ int amdgpu_amdkfd_gpuvm_acquire_process_vm(struct 
> amdgpu_device *adev,
>  void amdgpu_amdkfd_gpuvm_release_process_vm(struct amdgpu_device *adev,
>   void *drm_priv);
>  uint64_t amdgpu_amdkfd_gpuvm_get_process_page_dir(void *drm_priv);
> +size_t amdgpu_amdkfd_get_available_memory(struct amdgpu_device *adev);
>  int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu(
>   struct amdgpu_device *adev, uint64_t va, uint64_t size,
>   void *drm_priv, struct kgd_mem **mem,
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> index 86a1a6c109d9..b7490a659173 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> @@ -190,6 +190,20 @@ static int amdgpu_amdkfd_reserve_mem_limit(struct 
> amdgpu_device *adev,
>   return ret;
>  }
>  
> +size_t amdgpu_amdkfd_get_available_memory(struct amdgpu_device *adev)
> +{
> + uint64_t reserved_for_pt =
> + ESTIMATE_PT_SIZE(amdgpu_amdkfd_total_mem_size);
> + size_t available_memory;
> +
> + spin_lock(&kfd_mem_limit.mem_limit_lock);
> + available_memory =
> + adev->gmc.real_vram_size -
> + adev->kfd.vram_used - reserved_for_pt;
> + spin_unlock(&kfd_mem_limit.mem_limit_lock);
> + return available_memory;
> +}
> +
>  static void unreserve_mem_limit(struct amdgpu_device *adev,
>   uint64_t size, u32 alloc_flag)
>  {
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> index 4bfc0c8ab764..5c2f6d97ff1c 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> @@ -486,6 +486,20 @@ static int kfd_ioctl_get_queue_wave_state(struct file 
> *filep,
>   return r;
>  }
>  
> +static int kfd_ioctl_get_available_memory(struct file *filep,
> +  struct kfd_process *p, void *data)
> +{
> + struct kfd_ioctl_get_available_memory_args *args = data;
> + struct kfd_dev *dev;
> +
> + dev = kfd_device_by_id(args->gpu_id);
> + if (!dev)
> + return -EINVAL;
> +
> + args->available = amdgpu_amdkfd_get_available_memory(dev->adev);
> + return 0;
> +}
> +
>  static int kfd_ioctl_set_memory_policy(struct file *filep,
>   struct kfd_process *p, void *data)
>  {
> @@ -1959,6 +1973,9 @@ static const struct amdkfd_ioctl_desc amdkfd_ioctls[] = 
> {
>  
>   AMDKFD_IOCTL_DEF(AMDKFD_IOC_SET_XNACK_MODE,
>   kfd_ioctl_set_xnack_mode, 0),
> +
> + AMDKFD_IOCTL_DEF(AMDKFD_IOC_AVAILABLE_MEMORY,
> + kfd_ioctl_get_available_memory, 0),
>  };
>  
>  #define AMDKFD_CORE_IOCTL_COUNT  ARRAY_SIZE(amdkfd_ioctls)
> diff --git a/include/uapi/linux/kfd_ioctl.h b/include/uapi/linux/kfd_ioctl.h
> index af96af174dc4..94a99add2432 100644
> --- a/include/uapi/linux/kfd_ioctl.h
> +++ b/include/uapi/linux/kfd_ioctl.h
> @@ -32,9 +32,10 @@
>   * - 1.4 - Indicate new SRAM EDC bit in device properties
>   * - 1.5 - Add SVM API
>   * - 1.6 - Query clear flags in SVM get_attr API
> + * - 1.7 - Add available_memory ioctl
>   */
>  #define KFD_IOCTL_MAJOR_VERSION 1
> -#define KFD_IOCTL_MINOR_VERSION 6
> +#define KFD_IOCTL_MINOR_VERSION 7
>  
>  struct kfd_ioctl_get_version_args {
>   __u32 major_version;/* from KFD */
> @@ -98,6 +99,12 @@ struct kfd_ioctl_get_queue_wave_state_args {
>   __u32 pad;
>  };
>  
> +struct kfd_ioctl_get_available_memory_args {
> + __u64 available;/* from KFD */
> + __u32 gpu_id;   /* to KFD */
> + __u32 pad;
> +};
> +
>  /* For kfd_ioctl_set_memory_policy_args.default_policy and alternate_policy 
> */
>  #define KFD_IOC_CACHE_POLICY_COHERENT 0
>  #define KFD_IOC_CACHE_POLICY_NONCOHERENT 1
> @@ -742,7 +749,10 @@ struct kfd_ioctl_set_xnack_mode_args {
>  #define AMD

[PATCH v2 1/1] Add test for new hsaKmtAvailableMemory library call

2022-01-17 Thread Felix Kuehling

From: Daniel Phillips 

Using DefaultGPUNode now instead of system memory, usage similar to
other tests. Also cleaned up pSmall, which I originally intended to
just let float away on the mistaken assumption that it would be
cleaned up automatically at the end of the test.

Basic test for the new hsaKmtAvailableMemory library call. This is
a standalone test, does not modify any of the other tests just to
be on the safe side. More elaborate tests coming soon.

v2:
* Change ioctl to IOWR
* Allocate non-paged memory to actually get VRAM
* Align available memory to page size as required by hsaKmtAllocMemory
Still doesn't pass on Fiji. Available is actually larger than the physically
free memory on this GPU with only 4GB of memory. Needs more refinement in KFD.

Signed-off-by: Daniel Phillips 
Change-Id: I645006a89bd8d55ef7b1605611e8ef0c010dad1a
---
 include/linux/kfd_ioctl.h   |  2 +-
 tests/kfdtest/src/KFDMemoryTest.cpp | 24 
 tests/kfdtest/src/KFDTestUtil.hpp   |  1 +
 3 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/include/linux/kfd_ioctl.h b/include/linux/kfd_ioctl.h
index a81ae37..f9af830 100644
--- a/include/linux/kfd_ioctl.h
+++ b/include/linux/kfd_ioctl.h
@@ -1252,7 +1252,7 @@ struct kfd_ioctl_set_xnack_mode_args {
AMDKFD_IOWR(0x21, struct kfd_ioctl_set_xnack_mode_args)
 
 #define AMDKFD_IOC_AVAILABLE_MEMORY\
-   AMDKFD_IOR(0x22, struct kfd_ioctl_get_available_memory_args)
+   AMDKFD_IOWR(0x22, struct kfd_ioctl_get_available_memory_args)
 
 #define AMDKFD_COMMAND_START   0x01
 #define AMDKFD_COMMAND_END 0x23
diff --git a/tests/kfdtest/src/KFDMemoryTest.cpp 
b/tests/kfdtest/src/KFDMemoryTest.cpp
index f7ac73f..b722ff8 100644
--- a/tests/kfdtest/src/KFDMemoryTest.cpp
+++ b/tests/kfdtest/src/KFDMemoryTest.cpp
@@ -595,6 +595,30 @@ TEST_F(KFDMemoryTest, MemoryAlloc) {
 TEST_END
 }
 
+// Basic test for hsaKmtAllocMemory
+TEST_F(KFDMemoryTest, MemoryAllocAll) {
+TEST_START(TESTPROFILE_RUNALL)
+
+int defaultGPUNode = m_NodeInfo.HsaDefaultGPUNode();
+unsigned int* pBig = NULL;
+unsigned int* pSmall = NULL;
+HsaMemFlags memFlags = {0};
+HSAuint64 available;
+
+memFlags.ui32.NonPaged = 1;
+
+EXPECT_SUCCESS(hsaKmtAvailableMemory(defaultGPUNode, &available));
+available = ALIGN_DOWN(available, PAGE_SIZE);
+
+EXPECT_SUCCESS(hsaKmtAllocMemory(defaultGPUNode, available, memFlags, 
reinterpret_cast(&pBig)));
+EXPECT_NE(HSAKMT_STATUS_SUCCESS, hsaKmtAllocMemory(defaultGPUNode, 
PAGE_SIZE, memFlags, reinterpret_cast(&pSmall)));
+EXPECT_SUCCESS(hsaKmtFreeMemory(pBig, available));
+EXPECT_SUCCESS(hsaKmtAllocMemory(defaultGPUNode, PAGE_SIZE, memFlags, 
reinterpret_cast(&pSmall)));
+EXPECT_SUCCESS(hsaKmtFreeMemory(pSmall, PAGE_SIZE));
+
+TEST_END
+}
+
 TEST_F(KFDMemoryTest, AccessPPRMem) {
 TEST_START(TESTPROFILE_RUNALL)
 
diff --git a/tests/kfdtest/src/KFDTestUtil.hpp 
b/tests/kfdtest/src/KFDTestUtil.hpp
index ee88ed3..420b7af 100644
--- a/tests/kfdtest/src/KFDTestUtil.hpp
+++ b/tests/kfdtest/src/KFDTestUtil.hpp
@@ -33,6 +33,7 @@
 class BaseQueue;
 #define ARRAY_SIZE(_x) (sizeof(_x)/sizeof(_x[0]))
 #define ALIGN_UP(x, align) (((uint64_t)(x) + (align) - 1) & 
~(uint64_t)((align)-1))
+#define ALIGN_DOWN(x, align) ((uint64_t)(x) & ~(uint64_t)((align)-1))
 #define CounterToNanoSec(x) ((x) * 1000 / (is_dgpu() ? 27 : 100))
 
 void WaitUntilInput();
-- 
2.32.0

[PATCH AUTOSEL 4.19 46/59] drm/amdgpu: fixup bad vram size on gmc v8

2022-01-17 Thread Sasha Levin

From: Zongmin Zhou 

[ Upstream commit 11544d77e3974924c5a9c8a8320b996a3e9b2f8b ]

Some boards(like RX550) seem to have garbage in the upper
16 bits of the vram size register.  Check for
this and clamp the size properly.  Fixes
boards reporting bogus amounts of vram.

after add this patch,the maximum GPU VRAM size is 64GB,
otherwise only 64GB vram size will be used.

Signed-off-by: Zongmin Zhou
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin 
---
 drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
index 1a744f964b301..358004a4650b6 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
@@ -520,10 +520,10 @@ static void gmc_v8_0_mc_program(struct amdgpu_device 
*adev)
 static int gmc_v8_0_mc_init(struct amdgpu_device *adev)
 {
int r;
+   u32 tmp;
 
adev->gmc.vram_width = amdgpu_atombios_get_vram_width(adev);
if (!adev->gmc.vram_width) {
-   u32 tmp;
int chansize, numchan;
 
/* Get VRAM informations */
@@ -567,8 +567,15 @@ static int gmc_v8_0_mc_init(struct amdgpu_device *adev)
adev->gmc.vram_width = numchan * chansize;
}
/* size in MB on si */
-   adev->gmc.mc_vram_size = RREG32(mmCONFIG_MEMSIZE) * 1024ULL * 1024ULL;
-   adev->gmc.real_vram_size = RREG32(mmCONFIG_MEMSIZE) * 1024ULL * 1024ULL;
+   tmp = RREG32(mmCONFIG_MEMSIZE);
+   /* some boards may have garbage in the upper 16 bits */
+   if (tmp & 0x) {
+   DRM_INFO("Probable bad vram size: 0x%08x\n", tmp);
+   if (tmp & 0x)
+   tmp &= 0x;
+   }
+   adev->gmc.mc_vram_size = tmp * 1024ULL * 1024ULL;
+   adev->gmc.real_vram_size = adev->gmc.mc_vram_size;
 
if (!(adev->flags & AMD_IS_APU)) {
r = amdgpu_device_resize_fb_bar(adev);
-- 
2.34.1

Re: [PATCH 2/2] drm/amdgpu: fix debug sdma vram access checks

2022-01-17 Thread Felix Kuehling

Am 2022-01-17 um 6:43 p.m. schrieb Jonathan Kim:
> drm_dev_enter returns true when accessiable so correct the error check.
>
> Source VRAM buffer is reserved by TTM but not pinned so the gpu offset
> fetch throws a warning.  Get the gpu offset without a check and then
> double check to make sure the source buffer hasn't fallen into a hole.
> Otherwise use MMIO access to deal with non-contiguous VRAM cases as
> usual.

There is a way to get the correct offset for non-contiguous memory the
same way amdgpu_ttm_access_memory does, using amdgpu_res_first. Since
you only need a single page, you should never need amdgpu_res_next. Just
make sure that the cursor.size >= len.

Regards,
  Felix


>
> Signed-off-by: Jonathan Kim 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 19 +++
>  1 file changed, 15 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> index 6178ae7ba624..0acfd872bfef 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> @@ -1440,6 +1440,7 @@ static int amdgpu_ttm_access_memory_sdma(struct 
> ttm_buffer_object *bo,
>   struct dma_fence *fence;
>   uint64_t src_addr, dst_addr;
>   unsigned int num_dw;
> + bool vram_is_contiguous;
>   int r, idx;
>  
>   if (len != PAGE_SIZE)
> @@ -1448,9 +1449,8 @@ static int amdgpu_ttm_access_memory_sdma(struct 
> ttm_buffer_object *bo,
>   if (!adev->mman.sdma_access_ptr)
>   return -EACCES;
>  
> - r = drm_dev_enter(adev_to_drm(adev), &idx);
> - if (r)
> - return r;
> + if (!drm_dev_enter(adev_to_drm(adev), &idx))
> + return -ENODEV;
>  
>   if (write)
>   memcpy(adev->mman.sdma_access_ptr, buf, len);
> @@ -1460,7 +1460,18 @@ static int amdgpu_ttm_access_memory_sdma(struct 
> ttm_buffer_object *bo,
>   if (r)
>   goto out;
>  
> - src_addr = amdgpu_bo_gpu_offset(abo);
> + src_addr = amdgpu_bo_gpu_offset_no_check(abo);
> + vram_is_contiguous = (adev->gmc.real_vram_size - 1 ==
> + adev->gmc.vram_end - adev->gmc.vram_start) &&
> + src_addr >= adev->gmc.vram_start &&
> + src_addr + len <= adev->gmc.vram_end;
> +
> + if (!vram_is_contiguous) {
> + amdgpu_job_free(job);
> + r = -EACCES;
> + goto out;
> + }
> +
>   dst_addr = amdgpu_bo_gpu_offset(adev->mman.sdma_access_bo);
>   if (write)
>   swap(src_addr, dst_addr);

[PATCH AUTOSEL 5.4 59/73] drm/amdgpu: fixup bad vram size on gmc v8

2022-01-17 Thread Sasha Levin

From: Zongmin Zhou 

[ Upstream commit 11544d77e3974924c5a9c8a8320b996a3e9b2f8b ]

Some boards(like RX550) seem to have garbage in the upper
16 bits of the vram size register.  Check for
this and clamp the size properly.  Fixes
boards reporting bogus amounts of vram.

after add this patch,the maximum GPU VRAM size is 64GB,
otherwise only 64GB vram size will be used.

Signed-off-by: Zongmin Zhou
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin 
---
 drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
index ea764dd9245db..2975331a7b867 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
@@ -524,10 +524,10 @@ static void gmc_v8_0_mc_program(struct amdgpu_device 
*adev)
 static int gmc_v8_0_mc_init(struct amdgpu_device *adev)
 {
int r;
+   u32 tmp;
 
adev->gmc.vram_width = amdgpu_atombios_get_vram_width(adev);
if (!adev->gmc.vram_width) {
-   u32 tmp;
int chansize, numchan;
 
/* Get VRAM informations */
@@ -571,8 +571,15 @@ static int gmc_v8_0_mc_init(struct amdgpu_device *adev)
adev->gmc.vram_width = numchan * chansize;
}
/* size in MB on si */
-   adev->gmc.mc_vram_size = RREG32(mmCONFIG_MEMSIZE) * 1024ULL * 1024ULL;
-   adev->gmc.real_vram_size = RREG32(mmCONFIG_MEMSIZE) * 1024ULL * 1024ULL;
+   tmp = RREG32(mmCONFIG_MEMSIZE);
+   /* some boards may have garbage in the upper 16 bits */
+   if (tmp & 0x) {
+   DRM_INFO("Probable bad vram size: 0x%08x\n", tmp);
+   if (tmp & 0x)
+   tmp &= 0x;
+   }
+   adev->gmc.mc_vram_size = tmp * 1024ULL * 1024ULL;
+   adev->gmc.real_vram_size = adev->gmc.mc_vram_size;
 
if (!(adev->flags & AMD_IS_APU)) {
r = amdgpu_device_resize_fb_bar(adev);
-- 
2.34.1

[PATCH AUTOSEL 5.10 097/116] amdgpu/pm: Make sysfs pm attributes as read-only for VFs

2022-01-17 Thread Sasha Levin

From: Marina Nikolic 

[ Upstream commit 11c9cc95f818f0f187e9b579a7f136f532b42445 ]

== Description ==
Setting values of pm attributes through sysfs
should not be allowed in SRIOV mode.
These calls will not be processed by FW anyway,
but error handling on sysfs level should be improved.

== Changes ==
This patch prohibits performing of all set commands
in SRIOV mode on sysfs level.
It offers better error handling as calls that are
not allowed will not be propagated further.

== Test ==
Writing to any sysfs file in passthrough mode will succeed.
Writing to any sysfs file in ONEVF mode will yield error:
"calling process does not have sufficient permission to execute a command".

Signed-off-by: Marina Nikolic 
Acked-by: Evan Quan 
Reviewed-by: Lijo Lazar 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin 
---
 drivers/gpu/drm/amd/pm/amdgpu_pm.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/amd/pm/amdgpu_pm.c 
b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
index 9f383b9041d28..49109614510b8 100644
--- a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
+++ b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
@@ -2098,6 +2098,12 @@ static int default_attr_update(struct amdgpu_device 
*adev, struct amdgpu_device_
}
}
 
+   /* setting should not be allowed from VF */
+   if (amdgpu_sriov_vf(adev)) {
+   dev_attr->attr.mode &= ~S_IWUGO;
+   dev_attr->store = NULL;
+   }
+
 #undef DEVICE_ATTR_IS
 
return 0;
-- 
2.34.1

[PATCH AUTOSEL 5.10 095/116] drm/amdgpu: fixup bad vram size on gmc v8

2022-01-17 Thread Sasha Levin

From: Zongmin Zhou 

[ Upstream commit 11544d77e3974924c5a9c8a8320b996a3e9b2f8b ]

Some boards(like RX550) seem to have garbage in the upper
16 bits of the vram size register.  Check for
this and clamp the size properly.  Fixes
boards reporting bogus amounts of vram.

after add this patch,the maximum GPU VRAM size is 64GB,
otherwise only 64GB vram size will be used.

Signed-off-by: Zongmin Zhou
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin 
---
 drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
index 9ab65ca7df777..873bc33912e23 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
@@ -524,10 +524,10 @@ static void gmc_v8_0_mc_program(struct amdgpu_device 
*adev)
 static int gmc_v8_0_mc_init(struct amdgpu_device *adev)
 {
int r;
+   u32 tmp;
 
adev->gmc.vram_width = amdgpu_atombios_get_vram_width(adev);
if (!adev->gmc.vram_width) {
-   u32 tmp;
int chansize, numchan;
 
/* Get VRAM informations */
@@ -571,8 +571,15 @@ static int gmc_v8_0_mc_init(struct amdgpu_device *adev)
adev->gmc.vram_width = numchan * chansize;
}
/* size in MB on si */
-   adev->gmc.mc_vram_size = RREG32(mmCONFIG_MEMSIZE) * 1024ULL * 1024ULL;
-   adev->gmc.real_vram_size = RREG32(mmCONFIG_MEMSIZE) * 1024ULL * 1024ULL;
+   tmp = RREG32(mmCONFIG_MEMSIZE);
+   /* some boards may have garbage in the upper 16 bits */
+   if (tmp & 0x) {
+   DRM_INFO("Probable bad vram size: 0x%08x\n", tmp);
+   if (tmp & 0x)
+   tmp &= 0x;
+   }
+   adev->gmc.mc_vram_size = tmp * 1024ULL * 1024ULL;
+   adev->gmc.real_vram_size = adev->gmc.mc_vram_size;
 
if (!(adev->flags & AMD_IS_APU)) {
r = amdgpu_device_resize_fb_bar(adev);
-- 
2.34.1

[PATCH AUTOSEL 5.10 023/116] drm/amdgpu/display: set vblank_disable_immediate for DC

2022-01-17 Thread Sasha Levin

From: Alex Deucher 

[ Upstream commit 92020e81ddbeac351ea4a19bcf01743f32b9c800 ]

Disable vblanks immediately to save power.  I think this was
missed when we merged DC support.

Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1781
Reviewed-by: Harry Wentland 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c   | 1 -
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 +++
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
index 2f70fdd6104f2..582055136cdbf 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
@@ -267,7 +267,6 @@ int amdgpu_irq_init(struct amdgpu_device *adev)
if (!amdgpu_device_has_dc_support(adev)) {
if (!adev->enable_virtual_display)
/* Disable vblank IRQs aggressively for power-saving */
-   /* XXX: can this be enabled for DC? */
adev_to_drm(adev)->vblank_disable_immediate = true;
 
r = drm_vblank_init(adev_to_drm(adev), 
adev->mode_info.num_crtc);
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index a5b6f36fe1d72..6c8f141103da4 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -1069,6 +1069,9 @@ static int amdgpu_dm_init(struct amdgpu_device *adev)
adev_to_drm(adev)->mode_config.cursor_width = 
adev->dm.dc->caps.max_cursor_size;
adev_to_drm(adev)->mode_config.cursor_height = 
adev->dm.dc->caps.max_cursor_size;
 
+   /* Disable vblank IRQs aggressively for power-saving */
+   adev_to_drm(adev)->vblank_disable_immediate = true;
+
if (drm_vblank_init(adev_to_drm(adev), adev->dm.display_indexes_num)) {
DRM_ERROR(
"amdgpu: failed to initialize sw for display support.\n");
-- 
2.34.1

[PATCH AUTOSEL 5.10 022/116] drm/amd/display: check top_pipe_to_program pointer

2022-01-17 Thread Sasha Levin

From: Yang Li 

[ Upstream commit a689e8d1f80012f90384ebac9dcfac4201f9f77e ]

Clang static analysis reports this error

drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc.c:2870:7: warning:
Dereference of null pointer [clang-analyzer-core.NullDereference]
if
(top_pipe_to_program->stream_res.tg->funcs->lock_doublebuffer_enable) {
^

top_pipe_to_program being NULL is caught as an error
But then it is used to report the error.

So add a check before using it.

Reported-by: Abaci Robot 
Signed-off-by: Yang Li 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin 
---
 drivers/gpu/drm/amd/display/dc/core/dc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/display/dc/core/dc.c 
b/drivers/gpu/drm/amd/display/dc/core/dc.c
index 284ed1c8a35ac..93f5229c303e7 100644
--- a/drivers/gpu/drm/amd/display/dc/core/dc.c
+++ b/drivers/gpu/drm/amd/display/dc/core/dc.c
@@ -2436,7 +2436,8 @@ static void commit_planes_for_stream(struct dc *dc,
}
 
if ((update_type != UPDATE_TYPE_FAST) && 
stream->update_flags.bits.dsc_changed)
-   if 
(top_pipe_to_program->stream_res.tg->funcs->lock_doublebuffer_enable) {
+   if (top_pipe_to_program &&
+   
top_pipe_to_program->stream_res.tg->funcs->lock_doublebuffer_enable) {
if (should_use_dmub_lock(stream->link)) {
union dmub_hw_lock_flags hw_locks = { 0 };
struct dmub_hw_lock_inst_flags inst_flags = { 0 
};
-- 
2.34.1

[PATCH AUTOSEL 5.15 161/188] amdgpu/pm: Make sysfs pm attributes as read-only for VFs

2022-01-17 Thread Sasha Levin

From: Marina Nikolic 

[ Upstream commit 11c9cc95f818f0f187e9b579a7f136f532b42445 ]

== Description ==
Setting values of pm attributes through sysfs
should not be allowed in SRIOV mode.
These calls will not be processed by FW anyway,
but error handling on sysfs level should be improved.

== Changes ==
This patch prohibits performing of all set commands
in SRIOV mode on sysfs level.
It offers better error handling as calls that are
not allowed will not be propagated further.

== Test ==
Writing to any sysfs file in passthrough mode will succeed.
Writing to any sysfs file in ONEVF mode will yield error:
"calling process does not have sufficient permission to execute a command".

Signed-off-by: Marina Nikolic 
Acked-by: Evan Quan 
Reviewed-by: Lijo Lazar 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin 
---
 drivers/gpu/drm/amd/pm/amdgpu_pm.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/amd/pm/amdgpu_pm.c 
b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
index 249cb0aeb5ae4..32a0fd5e84b73 100644
--- a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
+++ b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
@@ -2117,6 +2117,12 @@ static int default_attr_update(struct amdgpu_device 
*adev, struct amdgpu_device_
}
}
 
+   /* setting should not be allowed from VF */
+   if (amdgpu_sriov_vf(adev)) {
+   dev_attr->attr.mode &= ~S_IWUGO;
+   dev_attr->store = NULL;
+   }
+
 #undef DEVICE_ATTR_IS
 
return 0;
-- 
2.34.1

[PATCH AUTOSEL 5.15 159/188] drm/amdgpu: fixup bad vram size on gmc v8

2022-01-17 Thread Sasha Levin

From: Zongmin Zhou 

[ Upstream commit 11544d77e3974924c5a9c8a8320b996a3e9b2f8b ]

Some boards(like RX550) seem to have garbage in the upper
16 bits of the vram size register.  Check for
this and clamp the size properly.  Fixes
boards reporting bogus amounts of vram.

after add this patch,the maximum GPU VRAM size is 64GB,
otherwise only 64GB vram size will be used.

Signed-off-by: Zongmin Zhou
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin 
---
 drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
index 492ebed2915be..63b890f1e8afb 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
@@ -515,10 +515,10 @@ static void gmc_v8_0_mc_program(struct amdgpu_device 
*adev)
 static int gmc_v8_0_mc_init(struct amdgpu_device *adev)
 {
int r;
+   u32 tmp;
 
adev->gmc.vram_width = amdgpu_atombios_get_vram_width(adev);
if (!adev->gmc.vram_width) {
-   u32 tmp;
int chansize, numchan;
 
/* Get VRAM informations */
@@ -562,8 +562,15 @@ static int gmc_v8_0_mc_init(struct amdgpu_device *adev)
adev->gmc.vram_width = numchan * chansize;
}
/* size in MB on si */
-   adev->gmc.mc_vram_size = RREG32(mmCONFIG_MEMSIZE) * 1024ULL * 1024ULL;
-   adev->gmc.real_vram_size = RREG32(mmCONFIG_MEMSIZE) * 1024ULL * 1024ULL;
+   tmp = RREG32(mmCONFIG_MEMSIZE);
+   /* some boards may have garbage in the upper 16 bits */
+   if (tmp & 0x) {
+   DRM_INFO("Probable bad vram size: 0x%08x\n", tmp);
+   if (tmp & 0x)
+   tmp &= 0x;
+   }
+   adev->gmc.mc_vram_size = tmp * 1024ULL * 1024ULL;
+   adev->gmc.real_vram_size = adev->gmc.mc_vram_size;
 
if (!(adev->flags & AMD_IS_APU)) {
r = amdgpu_device_resize_fb_bar(adev);
-- 
2.34.1

[PATCH AUTOSEL 5.15 158/188] drm/amdgpu: Don't inherit GEM object VMAs in child process

2022-01-17 Thread Sasha Levin

From: Rajneesh Bhardwaj 

[ Upstream commit fbcdbfde87509d523132b59f661a355c731139d0 ]

When an application having open file access to a node forks, its shared
mappings also get reflected in the address space of child process even
though it cannot access them with the object permissions applied. With the
existing permission checks on the gem objects, it might be reasonable to
also create the VMAs with VM_DONTCOPY flag so a user space application
doesn't need to explicitly call the madvise(addr, len, MADV_DONTFORK)
system call to prevent the pages in the mapped range to appear in the
address space of the child process. It also prevents the memory leaks
due to additional reference counts on the mapped BOs in the child
process that prevented freeing the memory in the parent for which we had
worked around earlier in the user space inside the thunk library.

Additionally, we faced this issue when using CRIU to checkpoint restore
an application that had such inherited mappings in the child which
confuse CRIU when it mmaps on restore. Having this flag set for the
render node VMAs helps. VMAs mapped via KFD already take care of this so
this is needed only for the render nodes.

To limit the impact of the change to user space consumers such as OpenGL
etc, limit it to KFD BOs only.

Acked-by: Felix Kuehling 
Reviewed-by: Christian König 
Signed-off-by: David Yat Sin 
Signed-off-by: Rajneesh Bhardwaj 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
index a1e63ba4c54a5..630dc99e49086 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
@@ -264,6 +264,9 @@ static int amdgpu_gem_object_mmap(struct drm_gem_object 
*obj, struct vm_area_str
!(vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)))
vma->vm_flags &= ~VM_MAYWRITE;
 
+   if (bo->kfd_bo)
+   vma->vm_flags |= VM_DONTCOPY;
+
return drm_gem_ttm_mmap(obj, vma);
 }
 
-- 
2.34.1

[PATCH AUTOSEL 5.15 120/188] drm/amd/amdgpu: fix gmc bo pin count leak in SRIOV

2022-01-17 Thread Sasha Levin

From: Jingwen Chen 

[ Upstream commit 948e7ce01413b71395723aaf846015062aea3a43 ]

[Why]
gmc bo will be pinned during loading amdgpu and reset in SRIOV while
only unpinned in unload amdgpu

[How]
add amdgpu_in_reset and sriov judgement to skip pin bo

v2: fix wrong judgement

Signed-off-by: Jingwen Chen 
Reviewed-by: Horace Chen 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin 
---
 drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 4 
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c  | 4 
 2 files changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
index e47104a1f5596..3c01be6610144 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
@@ -1021,10 +1021,14 @@ static int gmc_v10_0_gart_enable(struct amdgpu_device 
*adev)
return -EINVAL;
}
 
+   if (amdgpu_sriov_vf(adev) && amdgpu_in_reset(adev))
+   goto skip_pin_bo;
+
r = amdgpu_gart_table_vram_pin(adev);
if (r)
return r;
 
+skip_pin_bo:
r = adev->gfxhub.funcs->gart_enable(adev);
if (r)
return r;
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index 5551359d5dfdc..b5d93247237b1 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -1708,10 +1708,14 @@ static int gmc_v9_0_gart_enable(struct amdgpu_device 
*adev)
return -EINVAL;
}
 
+   if (amdgpu_sriov_vf(adev) && amdgpu_in_reset(adev))
+   goto skip_pin_bo;
+
r = amdgpu_gart_table_vram_pin(adev);
if (r)
return r;
 
+skip_pin_bo:
r = adev->gfxhub.funcs->gart_enable(adev);
if (r)
return r;
-- 
2.34.1

[PATCH AUTOSEL 5.15 119/188] drm/amd/amdgpu: fix psp tmr bo pin count leak in SRIOV

2022-01-17 Thread Sasha Levin

From: Jingwen Chen 

[ Upstream commit 85dfc1d692c9434c37842e610be37cd4ae4e0081 ]

[Why]
psp tmr bo will be pinned during loading amdgpu and reset in SRIOV while
only unpinned in unload amdgpu

[How]
add amdgpu_in_reset and sriov judgement to skip pin bo

v2: fix wrong judgement

Signed-off-by: Jingwen Chen 
Reviewed-by: Horace Chen 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index 9b41cb8c3de54..86e2090bbd6e0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -2207,12 +2207,16 @@ static int psp_hw_start(struct psp_context *psp)
return ret;
}
 
+   if (amdgpu_sriov_vf(adev) && amdgpu_in_reset(adev))
+   goto skip_pin_bo;
+
ret = psp_tmr_init(psp);
if (ret) {
DRM_ERROR("PSP tmr init failed!\n");
return ret;
}
 
+skip_pin_bo:
/*
 * For ASICs with DF Cstate management centralized
 * to PMFW, TMR setup should be performed after PMFW
-- 
2.34.1

[PATCH AUTOSEL 5.15 111/188] drm/amdkfd: Fix error handling in svm_range_add

2022-01-17 Thread Sasha Levin

From: Felix Kuehling 

[ Upstream commit 726be40607264b180a2b336c81e1dcff941de618 ]

Add null-pointer check after the last svm_range_new call. This was
originally reported by Zhou Qingyang  based on a
static analyzer.

To avoid duplicating the unwinding code from svm_range_handle_overlap,
I merged the two functions into one.

Signed-off-by: Felix Kuehling 
Cc: Zhou Qingyang 
Reviewed-by: Philip Yang 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin 
---
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 138 ++-
 1 file changed, 49 insertions(+), 89 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 5a674235ae41a..830809b694dd9 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -936,7 +936,7 @@ svm_range_split(struct svm_range *prange, uint64_t start, 
uint64_t last,
 }
 
 static int
-svm_range_split_tail(struct svm_range *prange, struct svm_range *new,
+svm_range_split_tail(struct svm_range *prange,
 uint64_t new_last, struct list_head *insert_list)
 {
struct svm_range *tail;
@@ -948,7 +948,7 @@ svm_range_split_tail(struct svm_range *prange, struct 
svm_range *new,
 }
 
 static int
-svm_range_split_head(struct svm_range *prange, struct svm_range *new,
+svm_range_split_head(struct svm_range *prange,
 uint64_t new_start, struct list_head *insert_list)
 {
struct svm_range *head;
@@ -1755,49 +1755,54 @@ static struct svm_range *svm_range_clone(struct 
svm_range *old)
 }
 
 /**
- * svm_range_handle_overlap - split overlap ranges
- * @svms: svm range list header
- * @new: range added with this attributes
- * @start: range added start address, in pages
- * @last: range last address, in pages
- * @update_list: output, the ranges attributes are updated. For set_attr, this
- *   will do validation and map to GPUs. For unmap, this will be
- *   removed and unmap from GPUs
- * @insert_list: output, the ranges will be inserted into svms, attributes are
- *   not changes. For set_attr, this will add into svms.
- * @remove_list:output, the ranges will be removed from svms
- * @left: the remaining range after overlap, For set_attr, this will be added
- *as new range.
+ * svm_range_add - add svm range and handle overlap
+ * @p: the range add to this process svms
+ * @start: page size aligned
+ * @size: page size aligned
+ * @nattr: number of attributes
+ * @attrs: array of attributes
+ * @update_list: output, the ranges need validate and update GPU mapping
+ * @insert_list: output, the ranges need insert to svms
+ * @remove_list: output, the ranges are replaced and need remove from svms
  *
- * Total have 5 overlap cases.
+ * Check if the virtual address range has overlap with any existing ranges,
+ * split partly overlapping ranges and add new ranges in the gaps. All changes
+ * should be applied to the range_list and interval tree transactionally. If
+ * any range split or allocation fails, the entire update fails. Therefore any
+ * existing overlapping svm_ranges are cloned and the original svm_ranges left
+ * unchanged.
  *
- * This function handles overlap of an address interval with existing
- * struct svm_ranges for applying new attributes. This may require
- * splitting existing struct svm_ranges. All changes should be applied to
- * the range_list and interval tree transactionally. If any split operation
- * fails, the entire update fails. Therefore the existing overlapping
- * svm_ranges are cloned and the original svm_ranges left unchanged. If the
- * transaction succeeds, the modified clones are added and the originals
- * freed. Otherwise the clones are removed and the old svm_ranges remain.
+ * If the transaction succeeds, the caller can update and insert clones and
+ * new ranges, then free the originals.
  *
- * Context: The caller must hold svms->lock
+ * Otherwise the caller can free the clones and new ranges, while the old
+ * svm_ranges remain unchanged.
+ *
+ * Context: Process context, caller must hold svms->lock
+ *
+ * Return:
+ * 0 - OK, otherwise error code
  */
 static int
-svm_range_handle_overlap(struct svm_range_list *svms, struct svm_range *new,
-unsigned long start, unsigned long last,
-struct list_head *update_list,
-struct list_head *insert_list,
-struct list_head *remove_list,
-unsigned long *left)
+svm_range_add(struct kfd_process *p, uint64_t start, uint64_t size,
+ uint32_t nattr, struct kfd_ioctl_svm_attribute *attrs,
+ struct list_head *update_list, struct list_head *insert_list,
+ struct list_head *remove_list)
 {
+   unsigned long last = start + size - 1UL;
+   struct svm_range_list *svms = &p->svms;
struct interval_tree_node *node;
+   struct svm_range new = {0};
struct svm_range

[PATCH AUTOSEL 5.15 056/188] drm/amd/display: add else to avoid double destroy clk_mgr

2022-01-17 Thread Sasha Levin

From: Martin Leung 

[ Upstream commit 11dff0e871037a6ad978e52f826a2eb7f5fb274a ]

[Why & How]
when changing some code we accidentally
changed else if-> if. reverting that.

Reviewed-by: Aric Cyr 
Acked-by: Qingqing Zhuo 
Signed-off-by: Martin Leung 
Tested-by: Daniel Wheeler 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin 
---
 drivers/gpu/drm/amd/display/dc/clk_mgr/clk_mgr.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/dc/clk_mgr/clk_mgr.c 
b/drivers/gpu/drm/amd/display/dc/clk_mgr/clk_mgr.c
index bb31541f80723..6420527fe476c 100644
--- a/drivers/gpu/drm/amd/display/dc/clk_mgr/clk_mgr.c
+++ b/drivers/gpu/drm/amd/display/dc/clk_mgr/clk_mgr.c
@@ -306,8 +306,7 @@ void dc_destroy_clk_mgr(struct clk_mgr *clk_mgr_base)
case FAMILY_NV:
if 
(ASICREV_IS_SIENNA_CICHLID_P(clk_mgr_base->ctx->asic_id.hw_internal_rev)) {
dcn3_clk_mgr_destroy(clk_mgr);
-   }
-   if 
(ASICREV_IS_DIMGREY_CAVEFISH_P(clk_mgr_base->ctx->asic_id.hw_internal_rev)) {
+   } else if 
(ASICREV_IS_DIMGREY_CAVEFISH_P(clk_mgr_base->ctx->asic_id.hw_internal_rev)) {
dcn3_clk_mgr_destroy(clk_mgr);
}
if 
(ASICREV_IS_BEIGE_GOBY_P(clk_mgr_base->ctx->asic_id.hw_internal_rev)) {
-- 
2.34.1

[PATCH AUTOSEL 5.15 053/188] drm/amdgpu/display: set vblank_disable_immediate for DC

2022-01-17 Thread Sasha Levin

From: Alex Deucher 

[ Upstream commit 92020e81ddbeac351ea4a19bcf01743f32b9c800 ]

Disable vblanks immediately to save power.  I think this was
missed when we merged DC support.

Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1781
Reviewed-by: Harry Wentland 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c   | 1 -
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 +++
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
index cc2e0c9cfe0a1..4f3c62adccbde 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
@@ -333,7 +333,6 @@ int amdgpu_irq_init(struct amdgpu_device *adev)
if (!amdgpu_device_has_dc_support(adev)) {
if (!adev->enable_virtual_display)
/* Disable vblank IRQs aggressively for power-saving */
-   /* XXX: can this be enabled for DC? */
adev_to_drm(adev)->vblank_disable_immediate = true;
 
r = drm_vblank_init(adev_to_drm(adev), 
adev->mode_info.num_crtc);
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index fef13e93a99fd..cbc67477a9a1f 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -1279,6 +1279,9 @@ static int amdgpu_dm_init(struct amdgpu_device *adev)
adev_to_drm(adev)->mode_config.cursor_width = 
adev->dm.dc->caps.max_cursor_size;
adev_to_drm(adev)->mode_config.cursor_height = 
adev->dm.dc->caps.max_cursor_size;
 
+   /* Disable vblank IRQs aggressively for power-saving */
+   adev_to_drm(adev)->vblank_disable_immediate = true;
+
if (drm_vblank_init(adev_to_drm(adev), adev->dm.display_indexes_num)) {
DRM_ERROR(
"amdgpu: failed to initialize sw for display support.\n");
-- 
2.34.1

[PATCH AUTOSEL 5.15 052/188] drm/amd/display: check top_pipe_to_program pointer

2022-01-17 Thread Sasha Levin

From: Yang Li 

[ Upstream commit a689e8d1f80012f90384ebac9dcfac4201f9f77e ]

Clang static analysis reports this error

drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc.c:2870:7: warning:
Dereference of null pointer [clang-analyzer-core.NullDereference]
if
(top_pipe_to_program->stream_res.tg->funcs->lock_doublebuffer_enable) {
^

top_pipe_to_program being NULL is caught as an error
But then it is used to report the error.

So add a check before using it.

Reported-by: Abaci Robot 
Signed-off-by: Yang Li 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin 
---
 drivers/gpu/drm/amd/display/dc/core/dc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/display/dc/core/dc.c 
b/drivers/gpu/drm/amd/display/dc/core/dc.c
index c798c65d42765..1860ccc3f4f2c 100644
--- a/drivers/gpu/drm/amd/display/dc/core/dc.c
+++ b/drivers/gpu/drm/amd/display/dc/core/dc.c
@@ -2703,7 +2703,8 @@ static void commit_planes_for_stream(struct dc *dc,
 #endif
 
if ((update_type != UPDATE_TYPE_FAST) && 
stream->update_flags.bits.dsc_changed)
-   if 
(top_pipe_to_program->stream_res.tg->funcs->lock_doublebuffer_enable) {
+   if (top_pipe_to_program &&
+   
top_pipe_to_program->stream_res.tg->funcs->lock_doublebuffer_enable) {
if (should_use_dmub_lock(stream->link)) {
union dmub_hw_lock_flags hw_locks = { 0 };
struct dmub_hw_lock_inst_flags inst_flags = { 0 
};
-- 
2.34.1

[PATCH AUTOSEL 5.16 188/217] amdgpu/pm: Make sysfs pm attributes as read-only for VFs

2022-01-17 Thread Sasha Levin

From: Marina Nikolic 

[ Upstream commit 11c9cc95f818f0f187e9b579a7f136f532b42445 ]

== Description ==
Setting values of pm attributes through sysfs
should not be allowed in SRIOV mode.
These calls will not be processed by FW anyway,
but error handling on sysfs level should be improved.

== Changes ==
This patch prohibits performing of all set commands
in SRIOV mode on sysfs level.
It offers better error handling as calls that are
not allowed will not be propagated further.

== Test ==
Writing to any sysfs file in passthrough mode will succeed.
Writing to any sysfs file in ONEVF mode will yield error:
"calling process does not have sufficient permission to execute a command".

Signed-off-by: Marina Nikolic 
Acked-by: Evan Quan 
Reviewed-by: Lijo Lazar 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin 
---
 drivers/gpu/drm/amd/pm/amdgpu_pm.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/amd/pm/amdgpu_pm.c 
b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
index 41472ed992530..f8370d54100e8 100644
--- a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
+++ b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
@@ -2123,6 +2123,12 @@ static int default_attr_update(struct amdgpu_device 
*adev, struct amdgpu_device_
}
}
 
+   /* setting should not be allowed from VF */
+   if (amdgpu_sriov_vf(adev)) {
+   dev_attr->attr.mode &= ~S_IWUGO;
+   dev_attr->store = NULL;
+   }
+
 #undef DEVICE_ATTR_IS
 
return 0;
-- 
2.34.1

[PATCH AUTOSEL 5.16 186/217] drm/amdgpu: fixup bad vram size on gmc v8

2022-01-17 Thread Sasha Levin

From: Zongmin Zhou 

[ Upstream commit 11544d77e3974924c5a9c8a8320b996a3e9b2f8b ]

Some boards(like RX550) seem to have garbage in the upper
16 bits of the vram size register.  Check for
this and clamp the size properly.  Fixes
boards reporting bogus amounts of vram.

after add this patch,the maximum GPU VRAM size is 64GB,
otherwise only 64GB vram size will be used.

Signed-off-by: Zongmin Zhou
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin 
---
 drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
index 492ebed2915be..63b890f1e8afb 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
@@ -515,10 +515,10 @@ static void gmc_v8_0_mc_program(struct amdgpu_device 
*adev)
 static int gmc_v8_0_mc_init(struct amdgpu_device *adev)
 {
int r;
+   u32 tmp;
 
adev->gmc.vram_width = amdgpu_atombios_get_vram_width(adev);
if (!adev->gmc.vram_width) {
-   u32 tmp;
int chansize, numchan;
 
/* Get VRAM informations */
@@ -562,8 +562,15 @@ static int gmc_v8_0_mc_init(struct amdgpu_device *adev)
adev->gmc.vram_width = numchan * chansize;
}
/* size in MB on si */
-   adev->gmc.mc_vram_size = RREG32(mmCONFIG_MEMSIZE) * 1024ULL * 1024ULL;
-   adev->gmc.real_vram_size = RREG32(mmCONFIG_MEMSIZE) * 1024ULL * 1024ULL;
+   tmp = RREG32(mmCONFIG_MEMSIZE);
+   /* some boards may have garbage in the upper 16 bits */
+   if (tmp & 0x) {
+   DRM_INFO("Probable bad vram size: 0x%08x\n", tmp);
+   if (tmp & 0x)
+   tmp &= 0x;
+   }
+   adev->gmc.mc_vram_size = tmp * 1024ULL * 1024ULL;
+   adev->gmc.real_vram_size = adev->gmc.mc_vram_size;
 
if (!(adev->flags & AMD_IS_APU)) {
r = amdgpu_device_resize_fb_bar(adev);
-- 
2.34.1

[PATCH AUTOSEL 5.16 185/217] drm/amdgpu: Don't inherit GEM object VMAs in child process

2022-01-17 Thread Sasha Levin

From: Rajneesh Bhardwaj 

[ Upstream commit fbcdbfde87509d523132b59f661a355c731139d0 ]

When an application having open file access to a node forks, its shared
mappings also get reflected in the address space of child process even
though it cannot access them with the object permissions applied. With the
existing permission checks on the gem objects, it might be reasonable to
also create the VMAs with VM_DONTCOPY flag so a user space application
doesn't need to explicitly call the madvise(addr, len, MADV_DONTFORK)
system call to prevent the pages in the mapped range to appear in the
address space of the child process. It also prevents the memory leaks
due to additional reference counts on the mapped BOs in the child
process that prevented freeing the memory in the parent for which we had
worked around earlier in the user space inside the thunk library.

Additionally, we faced this issue when using CRIU to checkpoint restore
an application that had such inherited mappings in the child which
confuse CRIU when it mmaps on restore. Having this flag set for the
render node VMAs helps. VMAs mapped via KFD already take care of this so
this is needed only for the render nodes.

To limit the impact of the change to user space consumers such as OpenGL
etc, limit it to KFD BOs only.

Acked-by: Felix Kuehling 
Reviewed-by: Christian König 
Signed-off-by: David Yat Sin 
Signed-off-by: Rajneesh Bhardwaj 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
index a1e63ba4c54a5..630dc99e49086 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
@@ -264,6 +264,9 @@ static int amdgpu_gem_object_mmap(struct drm_gem_object 
*obj, struct vm_area_str
!(vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)))
vma->vm_flags &= ~VM_MAYWRITE;
 
+   if (bo->kfd_bo)
+   vma->vm_flags |= VM_DONTCOPY;
+
return drm_gem_ttm_mmap(obj, vma);
 }
 
-- 
2.34.1

[PATCH AUTOSEL 5.16 141/217] drm/amd/amdgpu: fix gmc bo pin count leak in SRIOV

2022-01-17 Thread Sasha Levin

From: Jingwen Chen 

[ Upstream commit 948e7ce01413b71395723aaf846015062aea3a43 ]

[Why]
gmc bo will be pinned during loading amdgpu and reset in SRIOV while
only unpinned in unload amdgpu

[How]
add amdgpu_in_reset and sriov judgement to skip pin bo

v2: fix wrong judgement

Signed-off-by: Jingwen Chen 
Reviewed-by: Horace Chen 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin 
---
 drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 4 
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c  | 4 
 2 files changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
index 3ec5ff5a6dbe6..61ec6145bbb16 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
@@ -992,10 +992,14 @@ static int gmc_v10_0_gart_enable(struct amdgpu_device 
*adev)
return -EINVAL;
}
 
+   if (amdgpu_sriov_vf(adev) && amdgpu_in_reset(adev))
+   goto skip_pin_bo;
+
r = amdgpu_gart_table_vram_pin(adev);
if (r)
return r;
 
+skip_pin_bo:
r = adev->gfxhub.funcs->gart_enable(adev);
if (r)
return r;
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index d84523cf5f759..4420c264c554c 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -1714,10 +1714,14 @@ static int gmc_v9_0_gart_enable(struct amdgpu_device 
*adev)
return -EINVAL;
}
 
+   if (amdgpu_sriov_vf(adev) && amdgpu_in_reset(adev))
+   goto skip_pin_bo;
+
r = amdgpu_gart_table_vram_pin(adev);
if (r)
return r;
 
+skip_pin_bo:
r = adev->gfxhub.funcs->gart_enable(adev);
if (r)
return r;
-- 
2.34.1

[PATCH AUTOSEL 5.16 140/217] drm/amd/amdgpu: fix psp tmr bo pin count leak in SRIOV

2022-01-17 Thread Sasha Levin

From: Jingwen Chen 

[ Upstream commit 85dfc1d692c9434c37842e610be37cd4ae4e0081 ]

[Why]
psp tmr bo will be pinned during loading amdgpu and reset in SRIOV while
only unpinned in unload amdgpu

[How]
add amdgpu_in_reset and sriov judgement to skip pin bo

v2: fix wrong judgement

Signed-off-by: Jingwen Chen 
Reviewed-by: Horace Chen 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index c641f84649d6b..d011ae7e50a54 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -2017,12 +2017,16 @@ static int psp_hw_start(struct psp_context *psp)
return ret;
}
 
+   if (amdgpu_sriov_vf(adev) && amdgpu_in_reset(adev))
+   goto skip_pin_bo;
+
ret = psp_tmr_init(psp);
if (ret) {
DRM_ERROR("PSP tmr init failed!\n");
return ret;
}
 
+skip_pin_bo:
/*
 * For ASICs with DF Cstate management centralized
 * to PMFW, TMR setup should be performed after PMFW
-- 
2.34.1

[PATCH AUTOSEL 5.16 131/217] drm/amdgpu: fix amdgpu_ras_mca_query_error_status scope

2022-01-17 Thread Sasha Levin

From: Isabella Basso 

[ Upstream commit 929bb8e200412da36aca4b61209ec26283f9c184 ]

This commit fixes the compile-time warning below:

 warning: no previous prototype for ‘amdgpu_ras_mca_query_error_status’
 [-Wmissing-prototypes]

Changes since v1:
- As suggested by Alexander Deucher:
  1. Make function static instead of adding prototype.

Signed-off-by: Isabella Basso 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 08133de21fdd6..26b7a4a0b44b7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -867,9 +867,9 @@ static int amdgpu_ras_enable_all_features(struct 
amdgpu_device *adev,
 /* feature ctl end */
 
 
-void amdgpu_ras_mca_query_error_status(struct amdgpu_device *adev,
-  struct ras_common_if *ras_block,
-  struct ras_err_data  *err_data)
+static void amdgpu_ras_mca_query_error_status(struct amdgpu_device *adev,
+ struct ras_common_if *ras_block,
+ struct ras_err_data  *err_data)
 {
switch (ras_block->sub_block_index) {
case AMDGPU_RAS_MCA_BLOCK__MP0:
-- 
2.34.1

[PATCH AUTOSEL 5.16 130/217] drm/amdkfd: Fix error handling in svm_range_add

2022-01-17 Thread Sasha Levin

From: Felix Kuehling 

[ Upstream commit 726be40607264b180a2b336c81e1dcff941de618 ]

Add null-pointer check after the last svm_range_new call. This was
originally reported by Zhou Qingyang  based on a
static analyzer.

To avoid duplicating the unwinding code from svm_range_handle_overlap,
I merged the two functions into one.

Signed-off-by: Felix Kuehling 
Cc: Zhou Qingyang 
Reviewed-by: Philip Yang 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin 
---
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 138 ++-
 1 file changed, 49 insertions(+), 89 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 3cb4681c5f539..c0b8f4ff80b8a 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -943,7 +943,7 @@ svm_range_split(struct svm_range *prange, uint64_t start, 
uint64_t last,
 }
 
 static int
-svm_range_split_tail(struct svm_range *prange, struct svm_range *new,
+svm_range_split_tail(struct svm_range *prange,
 uint64_t new_last, struct list_head *insert_list)
 {
struct svm_range *tail;
@@ -955,7 +955,7 @@ svm_range_split_tail(struct svm_range *prange, struct 
svm_range *new,
 }
 
 static int
-svm_range_split_head(struct svm_range *prange, struct svm_range *new,
+svm_range_split_head(struct svm_range *prange,
 uint64_t new_start, struct list_head *insert_list)
 {
struct svm_range *head;
@@ -1764,49 +1764,54 @@ static struct svm_range *svm_range_clone(struct 
svm_range *old)
 }
 
 /**
- * svm_range_handle_overlap - split overlap ranges
- * @svms: svm range list header
- * @new: range added with this attributes
- * @start: range added start address, in pages
- * @last: range last address, in pages
- * @update_list: output, the ranges attributes are updated. For set_attr, this
- *   will do validation and map to GPUs. For unmap, this will be
- *   removed and unmap from GPUs
- * @insert_list: output, the ranges will be inserted into svms, attributes are
- *   not changes. For set_attr, this will add into svms.
- * @remove_list:output, the ranges will be removed from svms
- * @left: the remaining range after overlap, For set_attr, this will be added
- *as new range.
+ * svm_range_add - add svm range and handle overlap
+ * @p: the range add to this process svms
+ * @start: page size aligned
+ * @size: page size aligned
+ * @nattr: number of attributes
+ * @attrs: array of attributes
+ * @update_list: output, the ranges need validate and update GPU mapping
+ * @insert_list: output, the ranges need insert to svms
+ * @remove_list: output, the ranges are replaced and need remove from svms
  *
- * Total have 5 overlap cases.
+ * Check if the virtual address range has overlap with any existing ranges,
+ * split partly overlapping ranges and add new ranges in the gaps. All changes
+ * should be applied to the range_list and interval tree transactionally. If
+ * any range split or allocation fails, the entire update fails. Therefore any
+ * existing overlapping svm_ranges are cloned and the original svm_ranges left
+ * unchanged.
  *
- * This function handles overlap of an address interval with existing
- * struct svm_ranges for applying new attributes. This may require
- * splitting existing struct svm_ranges. All changes should be applied to
- * the range_list and interval tree transactionally. If any split operation
- * fails, the entire update fails. Therefore the existing overlapping
- * svm_ranges are cloned and the original svm_ranges left unchanged. If the
- * transaction succeeds, the modified clones are added and the originals
- * freed. Otherwise the clones are removed and the old svm_ranges remain.
+ * If the transaction succeeds, the caller can update and insert clones and
+ * new ranges, then free the originals.
  *
- * Context: The caller must hold svms->lock
+ * Otherwise the caller can free the clones and new ranges, while the old
+ * svm_ranges remain unchanged.
+ *
+ * Context: Process context, caller must hold svms->lock
+ *
+ * Return:
+ * 0 - OK, otherwise error code
  */
 static int
-svm_range_handle_overlap(struct svm_range_list *svms, struct svm_range *new,
-unsigned long start, unsigned long last,
-struct list_head *update_list,
-struct list_head *insert_list,
-struct list_head *remove_list,
-unsigned long *left)
+svm_range_add(struct kfd_process *p, uint64_t start, uint64_t size,
+ uint32_t nattr, struct kfd_ioctl_svm_attribute *attrs,
+ struct list_head *update_list, struct list_head *insert_list,
+ struct list_head *remove_list)
 {
+   unsigned long last = start + size - 1UL;
+   struct svm_range_list *svms = &p->svms;
struct interval_tree_node *node;
+   struct svm_range new = {0};
struct svm_range

[PATCH AUTOSEL 5.16 089/217] drm/amd/display: Use oriented source size when checking cursor scaling

2022-01-17 Thread Sasha Levin

From: Vlad Zahorodnii 

[ Upstream commit 69cb56290d9d10cdcc461aa2685e67e540507a96 ]

dm_check_crtc_cursor() doesn't take into account plane transforms when
calculating plane scaling, this can result in false positives.

For example, if there's an output with resolution 3840x2160 and the
output is rotated 90 degrees, CRTC_W and CRTC_H will be 3840 and 2160,
respectively, but SRC_W and SRC_H will be 2160 and 3840, respectively.

Since the cursor plane usually has a square buffer attached to it, the
dm_check_crtc_cursor() will think that there's a scale factor mismatch
even though there isn't really.

This fixes an issue where kwin fails to use hardware plane transforms.

Changes since version 1:
- s/orientated/oriented/g

Reviewed-by: Nicholas Kazlauskas 
Signed-off-by: Vlad Zahorodnii 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin 
---
 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 35 ++-
 1 file changed, 27 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index e08ac474e9d59..21ff6b232fb62 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -10661,6 +10661,24 @@ static int dm_update_plane_state(struct dc *dc,
return ret;
 }
 
+static void dm_get_oriented_plane_size(struct drm_plane_state *plane_state,
+  int *src_w, int *src_h)
+{
+   switch (plane_state->rotation & DRM_MODE_ROTATE_MASK) {
+   case DRM_MODE_ROTATE_90:
+   case DRM_MODE_ROTATE_270:
+   *src_w = plane_state->src_h >> 16;
+   *src_h = plane_state->src_w >> 16;
+   break;
+   case DRM_MODE_ROTATE_0:
+   case DRM_MODE_ROTATE_180:
+   default:
+   *src_w = plane_state->src_w >> 16;
+   *src_h = plane_state->src_h >> 16;
+   break;
+   }
+}
+
 static int dm_check_crtc_cursor(struct drm_atomic_state *state,
struct drm_crtc *crtc,
struct drm_crtc_state *new_crtc_state)
@@ -10669,6 +10687,8 @@ static int dm_check_crtc_cursor(struct drm_atomic_state 
*state,
struct drm_plane_state *new_cursor_state, *new_underlying_state;
int i;
int cursor_scale_w, cursor_scale_h, underlying_scale_w, 
underlying_scale_h;
+   int cursor_src_w, cursor_src_h;
+   int underlying_src_w, underlying_src_h;
 
/* On DCE and DCN there is no dedicated hardware cursor plane. We get a
 * cursor per pipe but it's going to inherit the scaling and
@@ -10680,10 +10700,9 @@ static int dm_check_crtc_cursor(struct 
drm_atomic_state *state,
return 0;
}
 
-   cursor_scale_w = new_cursor_state->crtc_w * 1000 /
-(new_cursor_state->src_w >> 16);
-   cursor_scale_h = new_cursor_state->crtc_h * 1000 /
-(new_cursor_state->src_h >> 16);
+   dm_get_oriented_plane_size(new_cursor_state, &cursor_src_w, 
&cursor_src_h);
+   cursor_scale_w = new_cursor_state->crtc_w * 1000 / cursor_src_w;
+   cursor_scale_h = new_cursor_state->crtc_h * 1000 / cursor_src_h;
 
for_each_new_plane_in_state_reverse(state, underlying, 
new_underlying_state, i) {
/* Narrow down to non-cursor planes on the same CRTC as the 
cursor */
@@ -10694,10 +10713,10 @@ static int dm_check_crtc_cursor(struct 
drm_atomic_state *state,
if (!new_underlying_state->fb)
continue;
 
-   underlying_scale_w = new_underlying_state->crtc_w * 1000 /
-(new_underlying_state->src_w >> 16);
-   underlying_scale_h = new_underlying_state->crtc_h * 1000 /
-(new_underlying_state->src_h >> 16);
+   dm_get_oriented_plane_size(new_underlying_state,
+  &underlying_src_w, 
&underlying_src_h);
+   underlying_scale_w = new_underlying_state->crtc_w * 1000 / 
underlying_src_w;
+   underlying_scale_h = new_underlying_state->crtc_h * 1000 / 
underlying_src_h;
 
if (cursor_scale_w != underlying_scale_w ||
cursor_scale_h != underlying_scale_h) {
-- 
2.34.1

[PATCH AUTOSEL 5.16 064/217] drm/amd/display: add else to avoid double destroy clk_mgr

2022-01-17 Thread Sasha Levin

From: Martin Leung 

[ Upstream commit 11dff0e871037a6ad978e52f826a2eb7f5fb274a ]

[Why & How]
when changing some code we accidentally
changed else if-> if. reverting that.

Reviewed-by: Aric Cyr 
Acked-by: Qingqing Zhuo 
Signed-off-by: Martin Leung 
Tested-by: Daniel Wheeler 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin 
---
 drivers/gpu/drm/amd/display/dc/clk_mgr/clk_mgr.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/dc/clk_mgr/clk_mgr.c 
b/drivers/gpu/drm/amd/display/dc/clk_mgr/clk_mgr.c
index 26f96ee324729..9200c8ce02ba9 100644
--- a/drivers/gpu/drm/amd/display/dc/clk_mgr/clk_mgr.c
+++ b/drivers/gpu/drm/amd/display/dc/clk_mgr/clk_mgr.c
@@ -308,8 +308,7 @@ void dc_destroy_clk_mgr(struct clk_mgr *clk_mgr_base)
case FAMILY_NV:
if 
(ASICREV_IS_SIENNA_CICHLID_P(clk_mgr_base->ctx->asic_id.hw_internal_rev)) {
dcn3_clk_mgr_destroy(clk_mgr);
-   }
-   if 
(ASICREV_IS_DIMGREY_CAVEFISH_P(clk_mgr_base->ctx->asic_id.hw_internal_rev)) {
+   } else if 
(ASICREV_IS_DIMGREY_CAVEFISH_P(clk_mgr_base->ctx->asic_id.hw_internal_rev)) {
dcn3_clk_mgr_destroy(clk_mgr);
}
if 
(ASICREV_IS_BEIGE_GOBY_P(clk_mgr_base->ctx->asic_id.hw_internal_rev)) {
-- 
2.34.1

[PATCH AUTOSEL 5.16 061/217] drm/amdgpu/display: set vblank_disable_immediate for DC

2022-01-17 Thread Sasha Levin

From: Alex Deucher 

[ Upstream commit 92020e81ddbeac351ea4a19bcf01743f32b9c800 ]

Disable vblanks immediately to save power.  I think this was
missed when we merged DC support.

Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1781
Reviewed-by: Harry Wentland 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c   | 1 -
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 +++
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
index cc2e0c9cfe0a1..4f3c62adccbde 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
@@ -333,7 +333,6 @@ int amdgpu_irq_init(struct amdgpu_device *adev)
if (!amdgpu_device_has_dc_support(adev)) {
if (!adev->enable_virtual_display)
/* Disable vblank IRQs aggressively for power-saving */
-   /* XXX: can this be enabled for DC? */
adev_to_drm(adev)->vblank_disable_immediate = true;
 
r = drm_vblank_init(adev_to_drm(adev), 
adev->mode_info.num_crtc);
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index e727f1dd2a9a7..e08ac474e9d59 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -1597,6 +1597,9 @@ static int amdgpu_dm_init(struct amdgpu_device *adev)
adev_to_drm(adev)->mode_config.cursor_width = 
adev->dm.dc->caps.max_cursor_size;
adev_to_drm(adev)->mode_config.cursor_height = 
adev->dm.dc->caps.max_cursor_size;
 
+   /* Disable vblank IRQs aggressively for power-saving */
+   adev_to_drm(adev)->vblank_disable_immediate = true;
+
if (drm_vblank_init(adev_to_drm(adev), adev->dm.display_indexes_num)) {
DRM_ERROR(
"amdgpu: failed to initialize sw for display support.\n");
-- 
2.34.1

[PATCH AUTOSEL 5.16 060/217] drm/amd/display: check top_pipe_to_program pointer

2022-01-17 Thread Sasha Levin

From: Yang Li 

[ Upstream commit a689e8d1f80012f90384ebac9dcfac4201f9f77e ]

Clang static analysis reports this error

drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc.c:2870:7: warning:
Dereference of null pointer [clang-analyzer-core.NullDereference]
if
(top_pipe_to_program->stream_res.tg->funcs->lock_doublebuffer_enable) {
^

top_pipe_to_program being NULL is caught as an error
But then it is used to report the error.

So add a check before using it.

Reported-by: Abaci Robot 
Signed-off-by: Yang Li 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin 
---
 drivers/gpu/drm/amd/display/dc/core/dc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/display/dc/core/dc.c 
b/drivers/gpu/drm/amd/display/dc/core/dc.c
index 0ded4decee05f..f0fbd8ad56229 100644
--- a/drivers/gpu/drm/amd/display/dc/core/dc.c
+++ b/drivers/gpu/drm/amd/display/dc/core/dc.c
@@ -2870,7 +2870,8 @@ static void commit_planes_for_stream(struct dc *dc,
 #endif
 
if ((update_type != UPDATE_TYPE_FAST) && 
stream->update_flags.bits.dsc_changed)
-   if 
(top_pipe_to_program->stream_res.tg->funcs->lock_doublebuffer_enable) {
+   if (top_pipe_to_program &&
+   
top_pipe_to_program->stream_res.tg->funcs->lock_doublebuffer_enable) {
if (should_use_dmub_lock(stream->link)) {
union dmub_hw_lock_flags hw_locks = { 0 };
struct dmub_hw_lock_inst_flags inst_flags = { 0 
};
-- 
2.34.1

RE: [PATCH 1/2] drm/amdgpu: fix build up and tear down of debug vram access bounce buffer

2022-01-17 Thread Chen, Guchun

[Public]

Thanks for your quick fix, Jonathan.

I wonder if it's better to encapsulate the bo creation into a function and put 
the function definition in amdgpu_sdma or amdgpu_ttm?

Regards,
Guchun

-Original Message-
From: Kim, Jonathan  
Sent: Tuesday, January 18, 2022 7:43 AM
To: amd-gfx@lists.freedesktop.org
Cc: Kuehling, Felix ; Koenig, Christian 
; Chen, Guchun ; Kim, Jonathan 

Subject: [PATCH 1/2] drm/amdgpu: fix build up and tear down of debug vram 
access bounce buffer

Move the debug sdma vram bounce buffer GART map on device init after when GART 
is ready to avoid warnings and non-access to SDMA.

Also move bounce buffer tear down after the memory manager has flushed queued 
work for safety.

Signed-off-by: Jonathan Kim 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c|  8 
 2 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index da3348fa7b0e..099460d15258 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2378,6 +2378,13 @@ static int amdgpu_device_ip_init(struct amdgpu_device 
*adev)
if (r)
goto init_failed;
 
+   /* GTT bounce buffer for debug vram access over sdma. */
+   if (amdgpu_bo_create_kernel(adev, PAGE_SIZE, PAGE_SIZE,
+   AMDGPU_GEM_DOMAIN_GTT,
+   &adev->mman.sdma_access_bo, NULL,
+   &adev->mman.sdma_access_ptr))
+   DRM_WARN("Debug VRAM access will use slowpath MM access\n");
+
/*
 * retired pages will be loaded from eeprom and reserved here,
 * it should be called after amdgpu_device_ip_hw_init_phase2  since @@ 
-3872,6 +3879,10 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
}
adev->shutdown = true;
 
+   /* remove debug vram sdma access bounce buffer. */
+   amdgpu_bo_free_kernel(&adev->mman.sdma_access_bo, NULL,
+   &adev->mman.sdma_access_ptr);
+
/* make sure IB test finished before entering exclusive mode
 * to avoid preemption on IB test
 * */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index b489cd8abe31..6178ae7ba624 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -1855,12 +1855,6 @@ int amdgpu_ttm_init(struct amdgpu_device *adev)
return r;
}
 
-   if (amdgpu_bo_create_kernel(adev, PAGE_SIZE, PAGE_SIZE,
-   AMDGPU_GEM_DOMAIN_GTT,
-   &adev->mman.sdma_access_bo, NULL,
-   adev->mman.sdma_access_ptr))
-   DRM_WARN("Debug VRAM access will use slowpath MM access\n");
-
return 0;
 }
 
@@ -1901,8 +1895,6 @@ void amdgpu_ttm_fini(struct amdgpu_device *adev)
ttm_range_man_fini(&adev->mman.bdev, AMDGPU_PL_OA);
ttm_device_fini(&adev->mman.bdev);
adev->mman.initialized = false;
-   amdgpu_bo_free_kernel(&adev->mman.sdma_access_bo, NULL,
-   &adev->mman.sdma_access_ptr);
DRM_INFO("amdgpu: ttm finalized\n");
 }
 
--
2.25.1

Re: [PATCH] drm/amd/pm: Enable sysfs required by rocm-smi tool for One VF mode

2022-01-17 Thread Wang, Yang(Kevin)

[AMD Official Use Only]

Reviewed-by: Kevin Wang 

Best Regards,
Kevin

From: amd-gfx  on behalf of Marina 
Nikolic 
Sent: Tuesday, January 18, 2022 12:21 AM
To: amd-gfx@lists.freedesktop.org 
Cc: Mitrovic, Milan ; Veljkovic, Nikola 
; Nikolic, Marina ; Kitchen, 
Greg 
Subject: [PATCH] drm/amd/pm: Enable sysfs required by rocm-smi tool for One VF 
mode

Enable power level, power limit and fan speed
information retrieval in one VF mode.
This is required so that tool ROCM-SMI
can provide this information to users.
---
 drivers/gpu/drm/amd/pm/amdgpu_pm.c  | 17 ++---
 .../drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c |  2 +-
 2 files changed, 3 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/amdgpu_pm.c 
b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
index 1b03ad7a21ad..bb7d03cd814c 100644
--- a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
+++ b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
@@ -1903,8 +1903,8 @@ static struct amdgpu_device_attr amdgpu_device_attrs[] = {
 AMDGPU_DEVICE_ATTR_RW(pp_dpm_fclk,  
ATTR_FLAG_BASIC|ATTR_FLAG_ONEVF),
 AMDGPU_DEVICE_ATTR_RW(pp_dpm_vclk,  
ATTR_FLAG_BASIC|ATTR_FLAG_ONEVF),
 AMDGPU_DEVICE_ATTR_RW(pp_dpm_dclk,  
ATTR_FLAG_BASIC|ATTR_FLAG_ONEVF),
-   AMDGPU_DEVICE_ATTR_RW(pp_dpm_dcefclk,   
ATTR_FLAG_BASIC),
-   AMDGPU_DEVICE_ATTR_RW(pp_dpm_pcie,  
ATTR_FLAG_BASIC),
+   AMDGPU_DEVICE_ATTR_RW(pp_dpm_dcefclk,   
ATTR_FLAG_BASIC|ATTR_FLAG_ONEVF),
+   AMDGPU_DEVICE_ATTR_RW(pp_dpm_pcie,  
ATTR_FLAG_BASIC|ATTR_FLAG_ONEVF),
 AMDGPU_DEVICE_ATTR_RW(pp_sclk_od,   
ATTR_FLAG_BASIC),
 AMDGPU_DEVICE_ATTR_RW(pp_mclk_od,   
ATTR_FLAG_BASIC),
 AMDGPU_DEVICE_ATTR_RW(pp_power_profile_mode,
ATTR_FLAG_BASIC|ATTR_FLAG_ONEVF),
@@ -3152,19 +3152,6 @@ static umode_t hwmon_attributes_visible(struct kobject 
*kobj,
 if (amdgpu_sriov_vf(adev) && !amdgpu_sriov_is_pp_one_vf(adev))
 return 0;

-   /* there is no fan under pp one vf mode */
-   if (amdgpu_sriov_is_pp_one_vf(adev) &&
-   (attr == &sensor_dev_attr_pwm1.dev_attr.attr ||
-attr == &sensor_dev_attr_pwm1_enable.dev_attr.attr ||
-attr == &sensor_dev_attr_pwm1_max.dev_attr.attr ||
-attr == &sensor_dev_attr_pwm1_min.dev_attr.attr ||
-attr == &sensor_dev_attr_fan1_input.dev_attr.attr ||
-attr == &sensor_dev_attr_fan1_min.dev_attr.attr ||
-attr == &sensor_dev_attr_fan1_max.dev_attr.attr ||
-attr == &sensor_dev_attr_fan1_target.dev_attr.attr ||
-attr == &sensor_dev_attr_fan1_enable.dev_attr.attr))
-   return 0;
-
 /* Skip fan attributes if fan is not present */
 if (adev->pm.no_fan && (attr == &sensor_dev_attr_pwm1.dev_attr.attr ||
 attr == &sensor_dev_attr_pwm1_enable.dev_attr.attr ||
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
index 4e37cd8025ed..7ea0427f65d3 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
@@ -134,7 +134,7 @@ static struct cmn2asic_msg_mapping 
sienna_cichlid_message_map[SMU_MSG_MAX_COUNT]
 MSG_MAP(PrepareMp1ForUnload,PPSMC_MSG_PrepareMp1ForUnload, 
1),
 MSG_MAP(AllowGfxOff,PPSMC_MSG_AllowGfxOff, 
0),
 MSG_MAP(DisallowGfxOff, PPSMC_MSG_DisallowGfxOff,  
0),
-   MSG_MAP(GetPptLimit,PPSMC_MSG_GetPptLimit,  
   0),
+   MSG_MAP(GetPptLimit,PPSMC_MSG_GetPptLimit,  
   1),
 MSG_MAP(GetDcModeMaxDpmFreq,PPSMC_MSG_GetDcModeMaxDpmFreq, 
1),
 MSG_MAP(ExitBaco,   PPSMC_MSG_ExitBaco,
0),
 MSG_MAP(PowerUpVcn, PPSMC_MSG_PowerUpVcn,  
0),
--
2.20.1

RE: [PATCH] drm/amd/pm: Enable sysfs required by rocm-smi tool for One VF mode

2022-01-17 Thread Quan, Evan

[AMD Official Use Only]

Acked-by: Evan Quan 

> -Original Message-
> From: amd-gfx  On Behalf Of
> Marina Nikolic
> Sent: Tuesday, January 18, 2022 12:22 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Mitrovic, Milan ; Veljkovic, Nikola
> ; Nikolic, Marina ;
> Kitchen, Greg 
> Subject: [PATCH] drm/amd/pm: Enable sysfs required by rocm-smi tool for
> One VF mode
> 
> Enable power level, power limit and fan speed
> information retrieval in one VF mode.
> This is required so that tool ROCM-SMI
> can provide this information to users.
> ---
>  drivers/gpu/drm/amd/pm/amdgpu_pm.c  | 17 ++---
>  .../drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c |  2 +-
>  2 files changed, 3 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> index 1b03ad7a21ad..bb7d03cd814c 100644
> --- a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> +++ b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> @@ -1903,8 +1903,8 @@ static struct amdgpu_device_attr
> amdgpu_device_attrs[] = {
>   AMDGPU_DEVICE_ATTR_RW(pp_dpm_fclk,
>   ATTR_FLAG_BASIC|ATTR_FLAG_ONEVF),
>   AMDGPU_DEVICE_ATTR_RW(pp_dpm_vclk,
>   ATTR_FLAG_BASIC|ATTR_FLAG_ONEVF),
>   AMDGPU_DEVICE_ATTR_RW(pp_dpm_dclk,
>   ATTR_FLAG_BASIC|ATTR_FLAG_ONEVF),
> - AMDGPU_DEVICE_ATTR_RW(pp_dpm_dcefclk,
>   ATTR_FLAG_BASIC),
> - AMDGPU_DEVICE_ATTR_RW(pp_dpm_pcie,
>   ATTR_FLAG_BASIC),
> + AMDGPU_DEVICE_ATTR_RW(pp_dpm_dcefclk,
>   ATTR_FLAG_BASIC|ATTR_FLAG_ONEVF),
> + AMDGPU_DEVICE_ATTR_RW(pp_dpm_pcie,
>   ATTR_FLAG_BASIC|ATTR_FLAG_ONEVF),
>   AMDGPU_DEVICE_ATTR_RW(pp_sclk_od,
>   ATTR_FLAG_BASIC),
>   AMDGPU_DEVICE_ATTR_RW(pp_mclk_od,
>   ATTR_FLAG_BASIC),
>   AMDGPU_DEVICE_ATTR_RW(pp_power_profile_mode,
>   ATTR_FLAG_BASIC|ATTR_FLAG_ONEVF),
> @@ -3152,19 +3152,6 @@ static umode_t hwmon_attributes_visible(struct
> kobject *kobj,
>   if (amdgpu_sriov_vf(adev) && !amdgpu_sriov_is_pp_one_vf(adev))
>   return 0;
> 
> - /* there is no fan under pp one vf mode */
> - if (amdgpu_sriov_is_pp_one_vf(adev) &&
> - (attr == &sensor_dev_attr_pwm1.dev_attr.attr ||
> -  attr == &sensor_dev_attr_pwm1_enable.dev_attr.attr ||
> -  attr == &sensor_dev_attr_pwm1_max.dev_attr.attr ||
> -  attr == &sensor_dev_attr_pwm1_min.dev_attr.attr ||
> -  attr == &sensor_dev_attr_fan1_input.dev_attr.attr ||
> -  attr == &sensor_dev_attr_fan1_min.dev_attr.attr ||
> -  attr == &sensor_dev_attr_fan1_max.dev_attr.attr ||
> -  attr == &sensor_dev_attr_fan1_target.dev_attr.attr ||
> -  attr == &sensor_dev_attr_fan1_enable.dev_attr.attr))
> - return 0;
> -
>   /* Skip fan attributes if fan is not present */
>   if (adev->pm.no_fan && (attr ==
> &sensor_dev_attr_pwm1.dev_attr.attr ||
>   attr == &sensor_dev_attr_pwm1_enable.dev_attr.attr ||
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
> b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
> index 4e37cd8025ed..7ea0427f65d3 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
> @@ -134,7 +134,7 @@ static struct cmn2asic_msg_mapping
> sienna_cichlid_message_map[SMU_MSG_MAX_COUNT]
>   MSG_MAP(PrepareMp1ForUnload,
>   PPSMC_MSG_PrepareMp1ForUnload, 1),
>   MSG_MAP(AllowGfxOff,
>   PPSMC_MSG_AllowGfxOff, 0),
>   MSG_MAP(DisallowGfxOff,
>   PPSMC_MSG_DisallowGfxOff,  0),
> - MSG_MAP(GetPptLimit,
>   PPSMC_MSG_GetPptLimit, 0),
> + MSG_MAP(GetPptLimit,
>   PPSMC_MSG_GetPptLimit, 1),
>   MSG_MAP(GetDcModeMaxDpmFreq,
>   PPSMC_MSG_GetDcModeMaxDpmFreq, 1),
>   MSG_MAP(ExitBaco,   PPSMC_MSG_ExitBaco,
> 0),
>   MSG_MAP(PowerUpVcn,
>   PPSMC_MSG_PowerUpVcn,  0),
> --
> 2.20.1

[PATCH 2/2] drm/amdgpu: fix debug sdma vram access checks

2022-01-17 Thread Jonathan Kim

drm_dev_enter returns true when accessiable so correct the error check.

Source VRAM buffer is reserved by TTM but not pinned so the gpu offset
fetch throws a warning.  Get the gpu offset without a check and then
double check to make sure the source buffer hasn't fallen into a hole.
Otherwise use MMIO access to deal with non-contiguous VRAM cases as
usual.

Signed-off-by: Jonathan Kim 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 19 +++
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 6178ae7ba624..0acfd872bfef 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -1440,6 +1440,7 @@ static int amdgpu_ttm_access_memory_sdma(struct 
ttm_buffer_object *bo,
struct dma_fence *fence;
uint64_t src_addr, dst_addr;
unsigned int num_dw;
+   bool vram_is_contiguous;
int r, idx;
 
if (len != PAGE_SIZE)
@@ -1448,9 +1449,8 @@ static int amdgpu_ttm_access_memory_sdma(struct 
ttm_buffer_object *bo,
if (!adev->mman.sdma_access_ptr)
return -EACCES;
 
-   r = drm_dev_enter(adev_to_drm(adev), &idx);
-   if (r)
-   return r;
+   if (!drm_dev_enter(adev_to_drm(adev), &idx))
+   return -ENODEV;
 
if (write)
memcpy(adev->mman.sdma_access_ptr, buf, len);
@@ -1460,7 +1460,18 @@ static int amdgpu_ttm_access_memory_sdma(struct 
ttm_buffer_object *bo,
if (r)
goto out;
 
-   src_addr = amdgpu_bo_gpu_offset(abo);
+   src_addr = amdgpu_bo_gpu_offset_no_check(abo);
+   vram_is_contiguous = (adev->gmc.real_vram_size - 1 ==
+   adev->gmc.vram_end - adev->gmc.vram_start) &&
+   src_addr >= adev->gmc.vram_start &&
+   src_addr + len <= adev->gmc.vram_end;
+
+   if (!vram_is_contiguous) {
+   amdgpu_job_free(job);
+   r = -EACCES;
+   goto out;
+   }
+
dst_addr = amdgpu_bo_gpu_offset(adev->mman.sdma_access_bo);
if (write)
swap(src_addr, dst_addr);
-- 
2.25.1

[PATCH 1/2] drm/amdgpu: fix build up and tear down of debug vram access bounce buffer

2022-01-17 Thread Jonathan Kim

Move the debug sdma vram bounce buffer GART map on device init after when
GART is ready to avoid warnings and non-access to SDMA.

Also move bounce buffer tear down after the memory manager has flushed
queued work for safety.

Signed-off-by: Jonathan Kim 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c|  8 
 2 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index da3348fa7b0e..099460d15258 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2378,6 +2378,13 @@ static int amdgpu_device_ip_init(struct amdgpu_device 
*adev)
if (r)
goto init_failed;
 
+   /* GTT bounce buffer for debug vram access over sdma. */
+   if (amdgpu_bo_create_kernel(adev, PAGE_SIZE, PAGE_SIZE,
+   AMDGPU_GEM_DOMAIN_GTT,
+   &adev->mman.sdma_access_bo, NULL,
+   &adev->mman.sdma_access_ptr))
+   DRM_WARN("Debug VRAM access will use slowpath MM access\n");
+
/*
 * retired pages will be loaded from eeprom and reserved here,
 * it should be called after amdgpu_device_ip_hw_init_phase2  since
@@ -3872,6 +3879,10 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
}
adev->shutdown = true;
 
+   /* remove debug vram sdma access bounce buffer. */
+   amdgpu_bo_free_kernel(&adev->mman.sdma_access_bo, NULL,
+   &adev->mman.sdma_access_ptr);
+
/* make sure IB test finished before entering exclusive mode
 * to avoid preemption on IB test
 * */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index b489cd8abe31..6178ae7ba624 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -1855,12 +1855,6 @@ int amdgpu_ttm_init(struct amdgpu_device *adev)
return r;
}
 
-   if (amdgpu_bo_create_kernel(adev, PAGE_SIZE, PAGE_SIZE,
-   AMDGPU_GEM_DOMAIN_GTT,
-   &adev->mman.sdma_access_bo, NULL,
-   adev->mman.sdma_access_ptr))
-   DRM_WARN("Debug VRAM access will use slowpath MM access\n");
-
return 0;
 }
 
@@ -1901,8 +1895,6 @@ void amdgpu_ttm_fini(struct amdgpu_device *adev)
ttm_range_man_fini(&adev->mman.bdev, AMDGPU_PL_OA);
ttm_device_fini(&adev->mman.bdev);
adev->mman.initialized = false;
-   amdgpu_bo_free_kernel(&adev->mman.sdma_access_bo, NULL,
-   &adev->mman.sdma_access_ptr);
DRM_INFO("amdgpu: ttm finalized\n");
 }
 
-- 
2.25.1

[PATCH] drm: Add PSR version 4 macro

2022-01-17 Thread sunpeng.li

From: Leo Li 

eDP 1.5 specification defines PSR version 4.

It defines PSR1 and PSR2 support with selective-update (SU)
capabilities, with additional support for Y-coordinate and Early
Transport of the selective-update region.

This differs from PSR version 3 in that early transport is supported
for version 4, but not for version 3.

Signed-off-by: Leo Li 
---
 include/drm/drm_dp_helper.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/drm/drm_dp_helper.h b/include/drm/drm_dp_helper.h
index 3f2715eb965f..05268c51acaa 100644
--- a/include/drm/drm_dp_helper.h
+++ b/include/drm/drm_dp_helper.h
@@ -360,6 +360,7 @@ struct drm_dp_aux;
 # define DP_PSR_IS_SUPPORTED1
 # define DP_PSR2_IS_SUPPORTED  2   /* eDP 1.4 */
 # define DP_PSR2_WITH_Y_COORD_IS_SUPPORTED  3  /* eDP 1.4a */
+# define DP_PSR2_WITH_ET_IS_SUPPORTED   4  /* eDP 1.5 (eDP 1.4b SCR) */
 
 #define DP_PSR_CAPS 0x071   /* XXX 1.2? */
 # define DP_PSR_NO_TRAIN_ON_EXIT1
-- 
2.34.1

Re: [PATCH v11 01/19] dyndbg: add _DPRINTK_FLAGS_ENABLED

2022-01-17 Thread jim . cromie

On Fri, Jan 14, 2022 at 4:57 AM Vincent Whitchurch
 wrote:
>
> On Fri, Jan 07, 2022 at 06:29:24AM +0100, Jim Cromie wrote:
> >  #ifdef CONFIG_JUMP_LABEL
> > - if (dp->flags & _DPRINTK_FLAGS_PRINT) {
> > - if (!(modifiers->flags & 
> > _DPRINTK_FLAGS_PRINT))
> > + if (dp->flags & _DPRINTK_FLAGS_ENABLED) {
> > + if (!(modifiers->flags & 
> > _DPRINTK_FLAGS_ENABLED))
> >   
> > static_branch_disable(&dp->key.dd_key_true);
> > - } else if (modifiers->flags & _DPRINTK_FLAGS_PRINT)
> > + } else if (modifiers->flags & _DPRINTK_FLAGS_ENABLED)
> >   static_branch_enable(&dp->key.dd_key_true);
> >  #endif
> >   dp->flags = newflags;
> > --
> > 2.33.1
> >
>
> I haven't tested it so I could be mistaken, but when
> _DPRINTK_FLAGS_ENABLED gets two flags in the next patch, it looks like
> this code still has the problem which I mentioned in
> https://lore.kernel.org/lkml/20211209150910.ga23...@axis.com/?
>

Yes, thanks for noticing.  I missed that detail.
Apriori, I dont know why bit-and of bit-or'd flags doesnt cover it,
but I will take a careful look.

> | I noticed a bug inside the CONFIG_JUMP_LABEL handling (also present
> | in the last version I posted) which should be fixed as part of the
> | diff below (I've added a comment).
> | [...]
> |  #ifdef CONFIG_JUMP_LABEL
> | - if (dp->flags & _DPRINTK_FLAGS_PRINT) {
> | - if (!(modifiers->flags & 
> _DPRINTK_FLAGS_PRINT))
> | + if (dp->flags & _DPRINTK_FLAGS_ENABLE) {
> | + /*
> | +  * The newflags check is to ensure that the
> | +  * static branch doesn't get disabled in step
> | +  * 3:
> | +  *
> | +  * (1) +pf
> | +  * (2) +x
> | +  * (3) -pf
> | +  */
> | + if (!(modifiers->flags & 
> _DPRINTK_FLAGS_ENABLE) &&
> | + !(newflags & _DPRINTK_FLAGS_ENABLE)) {
> |   
> static_branch_disable(&dp->key.dd_key_true);
> | - } else if (modifiers->flags & _DPRINTK_FLAGS_PRINT)
> | + }
> | + } else if (modifiers->flags & _DPRINTK_FLAGS_ENABLE) {
> |   static_branch_enable(&dp->key.dd_key_true);
> | + }
> |  #endif

Re: [PATCH 1/1] Add available memory ioctl for libhsakmt

2022-01-17 Thread Felix Kuehling

Am 2022-01-17 um 9:50 a.m. schrieb Bhardwaj, Rajneesh:
>
> [Public]
>
>  
>
>  
>
>  
>
> *From:* amd-gfx  *On Behalf Of
> *Deucher, Alexander
> *Sent:* Monday, January 10, 2022 4:11 PM
> *To:* Phillips, Daniel ;
> amd-gfx@lists.freedesktop.org; dri-de...@lists.freedesktop.org
> *Subject:* Re: [PATCH 1/1] Add available memory ioctl for libhsakmt
>
>  
>
> [Public]
>
>  
>
> [Public]
>
>  
>
> This is missing your signed-off-by.  Additionally, for UAPI changes,
> we need a link the patches for the userspace component that will make
> use of it.
>
>  
>
> Alex
>
>  
>
> 
>
> *From:*amd-gfx  > on behalf of Daniel
> Phillips mailto:daniel.phill...@amd.com>>
> *Sent:* Monday, January 10, 2022 3:54 PM
> *To:* amd-gfx@lists.freedesktop.org
>   >;
> dri-de...@lists.freedesktop.org
> 
> mailto:dri-de...@lists.freedesktop.org>>
> *Cc:* Phillips, Daniel  >
> *Subject:* [PATCH 1/1] Add available memory ioctl for libhsakmt
>
>  
>
> From: Daniel Phillips mailto:dphil...@amd.com>>
>
> Add an ioctl to inquire memory available for allocation by libhsakmt
> per node, allowing for space consumed by page translation tables.
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h  |  1 +
>  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c    | 14 ++
>  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c    | 17 +
>  include/uapi/linux/kfd_ioctl.h  | 14 --
>  4 files changed, 44 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> index fcbc8a9c9e06..64c6c36685d3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> @@ -266,6 +266,7 @@ int amdgpu_amdkfd_gpuvm_acquire_process_vm(struct
> amdgpu_device *adev,
>  void amdgpu_amdkfd_gpuvm_release_process_vm(struct amdgpu_device *adev,
>  void *drm_priv);
>  uint64_t amdgpu_amdkfd_gpuvm_get_process_page_dir(void *drm_priv);
> +size_t amdgpu_amdkfd_get_available_memory(struct amdgpu_device *adev);
>  int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu(
>  struct amdgpu_device *adev, uint64_t va, uint64_t size,
>  void *drm_priv, struct kgd_mem **mem,
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> index 86a1a6c109d9..b7490a659173 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> @@ -190,6 +190,20 @@ static int amdgpu_amdkfd_reserve_mem_limit(struct
> amdgpu_device *adev,
>  return ret;
>  }
>  
> +size_t amdgpu_amdkfd_get_available_memory(struct amdgpu_device *adev)
> +{
> +   uint64_t reserved_for_pt =
> +   ESTIMATE_PT_SIZE(amdgpu_amdkfd_total_mem_size);
> +   size_t available_memory;
> +
> +   spin_lock(&kfd_mem_limit.mem_limit_lock);
> +   available_memory =
> +   adev->gmc.real_vram_size -
> +   adev->kfd.vram_used - reserved_for_pt;
> +   spin_unlock(&kfd_mem_limit.mem_limit_lock);
> +   return available_memory;
> +}
> +
>  static void unreserve_mem_limit(struct amdgpu_device *adev,
>  uint64_t size, u32 alloc_flag)
>  {
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> index 4bfc0c8ab764..5c2f6d97ff1c 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> @@ -486,6 +486,20 @@ static int kfd_ioctl_get_queue_wave_state(struct
> file *filep,
>  return r;
>  }
>  
> +static int kfd_ioctl_get_available_memory(struct file *filep,
> +    struct kfd_process *p, void *data)
> +{
> +   struct kfd_ioctl_get_available_memory_args *args = data;
> +   struct kfd_dev *dev;
> +
> +   dev = kfd_device_by_id(args->gpu_id);
>
> Once CRIU changes land, this need to change to
> kfd_process_device_data_by_id() and then use pdd->dev
>
The CRIU patches will apply on top of this patch. So the CRIU patches
have to adapt this code, just like other existing ioctl functions.

Regards,
  Felix


>  
>
>
> +   if (!dev)
> +   return -EINVAL;
> +
> +   args->available = amdgpu_amdkfd_get_available_memory(dev->adev);
> +   return 0;
> +}
> +
>  static int kfd_ioctl_set_memory_policy(struct file *filep,
>  struct kfd_process *p, void
> *data)
>  {
> @@ -1959,6 +1973,9 @@ static const struct amdkfd_ioctl_desc
> amdkfd_ioctls[] = {
>  
>  AMDKFD_IOCTL_DEF(AMDKFD_IOC_SET_XNACK_MODE,
>  kfd_ioctl_set_xnack_mode, 0),
> +
> +   AMDKFD_

Re: [PATCH] drm/radeon: fix UVD suspend error

2022-01-17 Thread Leo Liu




On 2022-01-17 2:47 a.m., Qiang Ma wrote:

I met a bug recently and the kernel log:

[  330.171875] radeon :03:00.0: couldn't schedule ib
[  330.175781] [drm:radeon_uvd_suspend [radeon]] *ERROR* Error destroying UVD 
(-22)!

In radeon drivers, using UVD suspend is as follows:

if (rdev->has_uvd) {
 uvd_v1_0_fini(rdev);
 radeon_uvd_suspend(rdev);
}

In radeon_ib_schedule function, we check the 'ring->ready' state,
but in uvd_v1_0_fini funciton, we've cleared the ready state.
So, just modify the suspend code flow to fix error.


It seems reasonable to me. The suspend sends the destroy message if 
there is still incomplete job, so it should be before the fini which 
stops the hardware.


The series are:

Reviewed-by: Leo Liu 




Signed-off-by: Qiang Ma 
---
  drivers/gpu/drm/radeon/cik.c   | 2 +-
  drivers/gpu/drm/radeon/evergreen.c | 2 +-
  drivers/gpu/drm/radeon/ni.c| 2 +-
  drivers/gpu/drm/radeon/r600.c  | 2 +-
  drivers/gpu/drm/radeon/rv770.c | 2 +-
  drivers/gpu/drm/radeon/si.c| 2 +-
  6 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/radeon/cik.c b/drivers/gpu/drm/radeon/cik.c
index 81b4de7be9f2..5819737c21c6 100644
--- a/drivers/gpu/drm/radeon/cik.c
+++ b/drivers/gpu/drm/radeon/cik.c
@@ -8517,8 +8517,8 @@ int cik_suspend(struct radeon_device *rdev)
cik_cp_enable(rdev, false);
cik_sdma_enable(rdev, false);
if (rdev->has_uvd) {
-   uvd_v1_0_fini(rdev);
radeon_uvd_suspend(rdev);
+   uvd_v1_0_fini(rdev);
}
if (rdev->has_vce)
radeon_vce_suspend(rdev);
diff --git a/drivers/gpu/drm/radeon/evergreen.c 
b/drivers/gpu/drm/radeon/evergreen.c
index eeb590d2dec2..455f8036aa54 100644
--- a/drivers/gpu/drm/radeon/evergreen.c
+++ b/drivers/gpu/drm/radeon/evergreen.c
@@ -5156,8 +5156,8 @@ int evergreen_suspend(struct radeon_device *rdev)
radeon_pm_suspend(rdev);
radeon_audio_fini(rdev);
if (rdev->has_uvd) {
-   uvd_v1_0_fini(rdev);
radeon_uvd_suspend(rdev);
+   uvd_v1_0_fini(rdev);
}
r700_cp_stop(rdev);
r600_dma_stop(rdev);
diff --git a/drivers/gpu/drm/radeon/ni.c b/drivers/gpu/drm/radeon/ni.c
index 4a364ca7a1be..927e5f42e97d 100644
--- a/drivers/gpu/drm/radeon/ni.c
+++ b/drivers/gpu/drm/radeon/ni.c
@@ -2323,8 +2323,8 @@ int cayman_suspend(struct radeon_device *rdev)
cayman_cp_enable(rdev, false);
cayman_dma_stop(rdev);
if (rdev->has_uvd) {
-   uvd_v1_0_fini(rdev);
radeon_uvd_suspend(rdev);
+   uvd_v1_0_fini(rdev);
}
evergreen_irq_suspend(rdev);
radeon_wb_disable(rdev);
diff --git a/drivers/gpu/drm/radeon/r600.c b/drivers/gpu/drm/radeon/r600.c
index ca3fcae2adb5..dd78fc499402 100644
--- a/drivers/gpu/drm/radeon/r600.c
+++ b/drivers/gpu/drm/radeon/r600.c
@@ -3232,8 +3232,8 @@ int r600_suspend(struct radeon_device *rdev)
radeon_audio_fini(rdev);
r600_cp_stop(rdev);
if (rdev->has_uvd) {
-   uvd_v1_0_fini(rdev);
radeon_uvd_suspend(rdev);
+   uvd_v1_0_fini(rdev);
}
r600_irq_suspend(rdev);
radeon_wb_disable(rdev);
diff --git a/drivers/gpu/drm/radeon/rv770.c b/drivers/gpu/drm/radeon/rv770.c
index e592e57be1bb..38796af4fadd 100644
--- a/drivers/gpu/drm/radeon/rv770.c
+++ b/drivers/gpu/drm/radeon/rv770.c
@@ -1894,8 +1894,8 @@ int rv770_suspend(struct radeon_device *rdev)
radeon_pm_suspend(rdev);
radeon_audio_fini(rdev);
if (rdev->has_uvd) {
-   uvd_v1_0_fini(rdev);
radeon_uvd_suspend(rdev);
+   uvd_v1_0_fini(rdev);
}
r700_cp_stop(rdev);
r600_dma_stop(rdev);
diff --git a/drivers/gpu/drm/radeon/si.c b/drivers/gpu/drm/radeon/si.c
index 013e44ed0f39..8d5e4b25609d 100644
--- a/drivers/gpu/drm/radeon/si.c
+++ b/drivers/gpu/drm/radeon/si.c
@@ -6800,8 +6800,8 @@ int si_suspend(struct radeon_device *rdev)
si_cp_enable(rdev, false);
cayman_dma_stop(rdev);
if (rdev->has_uvd) {
-   uvd_v1_0_fini(rdev);
radeon_uvd_suspend(rdev);
+   uvd_v1_0_fini(rdev);
}
if (rdev->has_vce)
radeon_vce_suspend(rdev);

Re: [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs

2022-01-17 Thread Andrey Grodzovsky



On 2022-01-17 2:17 p.m., Christian König wrote:

Am 17.01.22 um 20:14 schrieb Andrey Grodzovsky:


Ping on the question



Oh, my! That was already more than a week ago and is completely 
swapped out of my head again.



Andrey

On 2022-01-05 1:11 p.m., Andrey Grodzovsky wrote:
Also, what about having the reset_active or in_reset flag in the 
reset_domain itself?


Of hand that sounds like a good idea.



What then about the adev->reset_sem semaphore ? Should we also move 
this to reset_domain ?  Both of the moves have functional
implications only for XGMI case because there will be contention 
over accessing those single instance variables from multiple devices

while now each device has it's own copy.


Since this is a rw semaphore that should be unproblematic I think. It 
could just be that the cache line of the lock then plays ping/pong 
between the CPU cores.




What benefit the centralization into reset_domain gives - is it for 
example to prevent one device in a hive trying to access through 
MMIO another one's

VRAM (shared FB memory) while the other one goes through reset ?


I think that this is the killer argument for a centralized lock, yes.



np, i will add a patch with centralizing both flag into reset domain and 
resend.


Andrey




Christian.



Andrey

Re: [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs

2022-01-17 Thread Christian König


Am 17.01.22 um 20:14 schrieb Andrey Grodzovsky:


Ping on the question



Oh, my! That was already more than a week ago and is completely swapped 
out of my head again.



Andrey

On 2022-01-05 1:11 p.m., Andrey Grodzovsky wrote:
Also, what about having the reset_active or in_reset flag in the 
reset_domain itself?


Of hand that sounds like a good idea.



What then about the adev->reset_sem semaphore ? Should we also move 
this to reset_domain ?  Both of the moves have functional
implications only for XGMI case because there will be contention over 
accessing those single instance variables from multiple devices

while now each device has it's own copy.


Since this is a rw semaphore that should be unproblematic I think. It 
could just be that the cache line of the lock then plays ping/pong 
between the CPU cores.




What benefit the centralization into reset_domain gives - is it for 
example to prevent one device in a hive trying to access through MMIO 
another one's

VRAM (shared FB memory) while the other one goes through reset ?


I think that this is the killer argument for a centralized lock, yes.

Christian.



Andrey

Re: [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs

2022-01-17 Thread Andrey Grodzovsky


Ping on the question

Andrey

On 2022-01-05 1:11 p.m., Andrey Grodzovsky wrote:
Also, what about having the reset_active or in_reset flag in the 
reset_domain itself?


Of hand that sounds like a good idea.



What then about the adev->reset_sem semaphore ? Should we also move 
this to reset_domain ?  Both of the moves have functional
implications only for XGMI case because there will be contention over 
accessing those single instance variables from multiple devices

while now each device has it's own copy.

What benefit the centralization into reset_domain gives - is it for 
example to prevent one device in a hive trying to access through MMIO 
another one's

VRAM (shared FB memory) while the other one goes through reset ?

Andrey

Re: [PATCH] drm/radeon: fix error handling in radeon_driver_open_kms

2022-01-17 Thread Alex Deucher

On Mon, Jan 17, 2022 at 4:38 AM Christian König
 wrote:
>
> The return value was never initialized so the cleanup code executed when
> it isn't even necessary.
>
> Just add proper error handling.
>
> Fixes: 2ad5d8fca195 ("drm/radeon/radeon_kms: Fix a NULL pointer dereference 
> in radeon_driver_open_kms()")
> Signed-off-by: Christian König 

If you kept the same labels, the patch would be smaller, but either way,

Reviewed-by: Alex Deucher 

> ---
>  drivers/gpu/drm/radeon/radeon_kms.c | 22 --
>  1 file changed, 12 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/gpu/drm/radeon/radeon_kms.c 
> b/drivers/gpu/drm/radeon/radeon_kms.c
> index e2488559cc9f..11ad210919c8 100644
> --- a/drivers/gpu/drm/radeon/radeon_kms.c
> +++ b/drivers/gpu/drm/radeon/radeon_kms.c
> @@ -666,18 +666,18 @@ int radeon_driver_open_kms(struct drm_device *dev, 
> struct drm_file *file_priv)
> fpriv = kzalloc(sizeof(*fpriv), GFP_KERNEL);
> if (unlikely(!fpriv)) {
> r = -ENOMEM;
> -   goto out_suspend;
> +   goto err_suspend;
> }
>
> if (rdev->accel_working) {
> vm = &fpriv->vm;
> r = radeon_vm_init(rdev, vm);
> if (r)
> -   goto out_fpriv;
> +   goto err_fpriv;
>
> r = radeon_bo_reserve(rdev->ring_tmp_bo.bo, false);
> if (r)
> -   goto out_vm_fini;
> +   goto err_vm_fini;
>
> /* map the ib pool buffer read only into
>  * virtual address space */
> @@ -685,7 +685,7 @@ int radeon_driver_open_kms(struct drm_device *dev, struct 
> drm_file *file_priv)
> rdev->ring_tmp_bo.bo);
> if (!vm->ib_bo_va) {
> r = -ENOMEM;
> -   goto out_vm_fini;
> +   goto err_vm_fini;
> }
>
> r = radeon_vm_bo_set_addr(rdev, vm->ib_bo_va,
> @@ -693,19 +693,21 @@ int radeon_driver_open_kms(struct drm_device *dev, 
> struct drm_file *file_priv)
>   RADEON_VM_PAGE_READABLE |
>   RADEON_VM_PAGE_SNOOPED);
> if (r)
> -   goto out_vm_fini;
> +   goto err_vm_fini;
> }
> file_priv->driver_priv = fpriv;
> }
>
> -   if (!r)
> -   goto out_suspend;
> +   pm_runtime_mark_last_busy(dev->dev);
> +   pm_runtime_put_autosuspend(dev->dev);
> +   return 0;
>
> -out_vm_fini:
> +err_vm_fini:
> radeon_vm_fini(rdev, vm);
> -out_fpriv:
> +err_fpriv:
> kfree(fpriv);
> -out_suspend:
> +
> +err_suspend:
> pm_runtime_mark_last_busy(dev->dev);
> pm_runtime_put_autosuspend(dev->dev);
> return r;
> --
> 2.25.1
>

RE: [PATCH v2] drm/amd: Warn users about potential s0ix problems

2022-01-17 Thread Limonciello, Mario

[Public]

Yes, that's part of why I want to make sure there are explicit warnings to 
users about using this flow.
When not enabled in ACPI then also the LPS0 device is not exported and AMD_PMC 
won't load or be used.

I think from amdgpu perspective it should behave relatively similar to an 
aborted suspend.

From: Lazar, Lijo 
Sent: Monday, January 17, 2022 11:20
To: Limonciello, Mario ; 
amd-gfx@lists.freedesktop.org
Cc: Bjoren Dasse 
Subject: Re: [PATCH v2] drm/amd: Warn users about potential s0ix problems

Any problem with PMFW sequence in the way Linux handles s2idle when it's not 
enabled in ACPI?

Thanks,
Lijo

From: Limonciello, Mario 
mailto:mario.limoncie...@amd.com>>
Sent: Monday, January 17, 2022 10:45:44 PM
To: amd-gfx@lists.freedesktop.org 
mailto:amd-gfx@lists.freedesktop.org>>; Lazar, 
Lijo mailto:lijo.la...@amd.com>>
Cc: Bjoren Dasse mailto:bjoern.da...@gmail.com>>
Subject: RE: [PATCH v2] drm/amd: Warn users about potential s0ix problems

[Public]

This has been sitting a week or so.
Bump on review for this patch.

> -Original Message-
> From: Limonciello, Mario 
> mailto:mario.limoncie...@amd.com>>
> Sent: Tuesday, January 11, 2022 14:00
> To: amd-gfx@lists.freedesktop.org
> Cc: Limonciello, Mario 
> mailto:mario.limoncie...@amd.com>>; Bjoren Dasse
> mailto:bjoern.da...@gmail.com>>
> Subject: [PATCH v2] drm/amd: Warn users about potential s0ix problems
>
> On some OEM setups users can configure the BIOS for S3 or S2idle.
> When configured to S3 users can still choose 's2idle' in the kernel by
> using `/sys/power/mem_sleep`.  Before commit 6dc8265f9803
> ("drm/amdgpu:
> always reset the asic in suspend (v2)"), the GPU would crash.  Now when
> configured this way, the system should resume but will use more power.
>
> As such, adjust the `amdpu_acpi_is_s0ix function` to warn users about
> potential power consumption issues during their first attempt at
> suspending.
>
> Reported-by: Bjoren Dasse 
> mailto:bjoern.da...@gmail.com>>
> Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1824
> Signed-off-by: Mario Limonciello 
> mailto:mario.limoncie...@amd.com>>
> ---
> v1->v2:
>  * Only show messages in s2idle cases
>  drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c | 16 ++--
>  1 file changed, 10 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> index 4811b0faafd9..1295de6d6c30 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> @@ -1040,11 +1040,15 @@ void amdgpu_acpi_detect(void)
>   */
>  bool amdgpu_acpi_is_s0ix_active(struct amdgpu_device *adev)
>  {
> -#if IS_ENABLED(CONFIG_AMD_PMC) && IS_ENABLED(CONFIG_SUSPEND)
> - if (acpi_gbl_FADT.flags & ACPI_FADT_LOW_POWER_S0) {
> - if (adev->flags & AMD_IS_APU)
> - return pm_suspend_target_state ==
> PM_SUSPEND_TO_IDLE;
> - }
> + if (!(adev->flags & AMD_IS_APU) ||
> + pm_suspend_target_state != PM_SUSPEND_TO_IDLE)
> + return false;
> + if (!(acpi_gbl_FADT.flags & ACPI_FADT_LOW_POWER_S0))
> + dev_warn_once(adev->dev,
> +   "BIOS is not configured for suspend-to-idle, power
> consumption will be higher\n");
> +#if !IS_ENABLED(CONFIG_AMD_PMC)
> + dev_warn_once(adev->dev,
> +   "amd-pmc is not enabled in the kernel, power
> consumption will be higher\n");
>  #endif
> - return false;
> + return true;
>  }
> --
> 2.25.1

Re: [PATCH v2] drm/amd: Warn users about potential s0ix problems

2022-01-17 Thread Lazar, Lijo

Any problem with PMFW sequence in the way Linux handles s2idle when it's not 
enabled in ACPI?

Thanks,
Lijo

From: Limonciello, Mario 
Sent: Monday, January 17, 2022 10:45:44 PM
To: amd-gfx@lists.freedesktop.org ; Lazar, Lijo 

Cc: Bjoren Dasse 
Subject: RE: [PATCH v2] drm/amd: Warn users about potential s0ix problems

[Public]

This has been sitting a week or so.
Bump on review for this patch.

> -Original Message-
> From: Limonciello, Mario 
> Sent: Tuesday, January 11, 2022 14:00
> To: amd-gfx@lists.freedesktop.org
> Cc: Limonciello, Mario ; Bjoren Dasse
> 
> Subject: [PATCH v2] drm/amd: Warn users about potential s0ix problems
>
> On some OEM setups users can configure the BIOS for S3 or S2idle.
> When configured to S3 users can still choose 's2idle' in the kernel by
> using `/sys/power/mem_sleep`.  Before commit 6dc8265f9803
> ("drm/amdgpu:
> always reset the asic in suspend (v2)"), the GPU would crash.  Now when
> configured this way, the system should resume but will use more power.
>
> As such, adjust the `amdpu_acpi_is_s0ix function` to warn users about
> potential power consumption issues during their first attempt at
> suspending.
>
> Reported-by: Bjoren Dasse 
> Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1824
> Signed-off-by: Mario Limonciello 
> ---
> v1->v2:
>  * Only show messages in s2idle cases
>  drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c | 16 ++--
>  1 file changed, 10 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> index 4811b0faafd9..1295de6d6c30 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> @@ -1040,11 +1040,15 @@ void amdgpu_acpi_detect(void)
>   */
>  bool amdgpu_acpi_is_s0ix_active(struct amdgpu_device *adev)
>  {
> -#if IS_ENABLED(CONFIG_AMD_PMC) && IS_ENABLED(CONFIG_SUSPEND)
> - if (acpi_gbl_FADT.flags & ACPI_FADT_LOW_POWER_S0) {
> - if (adev->flags & AMD_IS_APU)
> - return pm_suspend_target_state ==
> PM_SUSPEND_TO_IDLE;
> - }
> + if (!(adev->flags & AMD_IS_APU) ||
> + pm_suspend_target_state != PM_SUSPEND_TO_IDLE)
> + return false;
> + if (!(acpi_gbl_FADT.flags & ACPI_FADT_LOW_POWER_S0))
> + dev_warn_once(adev->dev,
> +   "BIOS is not configured for suspend-to-idle, power
> consumption will be higher\n");
> +#if !IS_ENABLED(CONFIG_AMD_PMC)
> + dev_warn_once(adev->dev,
> +   "amd-pmc is not enabled in the kernel, power
> consumption will be higher\n");
>  #endif
> - return false;
> + return true;
>  }
> --
> 2.25.1

RE: [PATCH v2] drm/amd: Warn users about potential s0ix problems

2022-01-17 Thread Limonciello, Mario

[Public]

This has been sitting a week or so.
Bump on review for this patch.

> -Original Message-
> From: Limonciello, Mario 
> Sent: Tuesday, January 11, 2022 14:00
> To: amd-gfx@lists.freedesktop.org
> Cc: Limonciello, Mario ; Bjoren Dasse
> 
> Subject: [PATCH v2] drm/amd: Warn users about potential s0ix problems
> 
> On some OEM setups users can configure the BIOS for S3 or S2idle.
> When configured to S3 users can still choose 's2idle' in the kernel by
> using `/sys/power/mem_sleep`.  Before commit 6dc8265f9803
> ("drm/amdgpu:
> always reset the asic in suspend (v2)"), the GPU would crash.  Now when
> configured this way, the system should resume but will use more power.
> 
> As such, adjust the `amdpu_acpi_is_s0ix function` to warn users about
> potential power consumption issues during their first attempt at
> suspending.
> 
> Reported-by: Bjoren Dasse 
> Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1824
> Signed-off-by: Mario Limonciello 
> ---
> v1->v2:
>  * Only show messages in s2idle cases
>  drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c | 16 ++--
>  1 file changed, 10 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> index 4811b0faafd9..1295de6d6c30 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> @@ -1040,11 +1040,15 @@ void amdgpu_acpi_detect(void)
>   */
>  bool amdgpu_acpi_is_s0ix_active(struct amdgpu_device *adev)
>  {
> -#if IS_ENABLED(CONFIG_AMD_PMC) && IS_ENABLED(CONFIG_SUSPEND)
> - if (acpi_gbl_FADT.flags & ACPI_FADT_LOW_POWER_S0) {
> - if (adev->flags & AMD_IS_APU)
> - return pm_suspend_target_state ==
> PM_SUSPEND_TO_IDLE;
> - }
> + if (!(adev->flags & AMD_IS_APU) ||
> + pm_suspend_target_state != PM_SUSPEND_TO_IDLE)
> + return false;
> + if (!(acpi_gbl_FADT.flags & ACPI_FADT_LOW_POWER_S0))
> + dev_warn_once(adev->dev,
> +   "BIOS is not configured for suspend-to-idle, power
> consumption will be higher\n");
> +#if !IS_ENABLED(CONFIG_AMD_PMC)
> + dev_warn_once(adev->dev,
> +   "amd-pmc is not enabled in the kernel, power
> consumption will be higher\n");
>  #endif
> - return false;
> + return true;
>  }
> --
> 2.25.1

Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2022-01-17 Thread Christian König


Am 17.01.22 um 15:50 schrieb Marek Olšák:
I don't think fork() would work with userspace where all buffers are 
shared. It certainly doesn't work now. The driver needs to be notified 
that a buffer or texture is shared to ensure data coherency between 
processes, and the driver must execute decompression and other render 
passes when a buffer or texture is being shared for the first time. 
Those aren't called when fork() is called.


Yeah, that's why you can install handlers which run before/after fork() 
is executed. But to summarize it is illegal for OpenGL, so we don't 
really need to worry about it.


For compute there are a couple of use cases though, but even those are 
not real world ones as far as I know.


But see below.



Marek

On Mon, Jan 17, 2022 at 9:34 AM Felix Kuehling > wrote:


Am 2022-01-17 um 9:21 a.m. schrieb Christian König:
> Am 17.01.22 um 15:17 schrieb Felix Kuehling:
>> Am 2022-01-17 um 6:44 a.m. schrieb Christian König:
>>> Am 14.01.22 um 18:40 schrieb Felix Kuehling:
 Am 2022-01-14 um 12:26 p.m. schrieb Christian König:
> Am 14.01.22 um 17:44 schrieb Daniel Vetter:
>> Top post because I tried to catch up on the entire
discussion here.
>>
>> So fundamentally I'm not opposed to just close this fork() hole
>> once and
>> for all. The thing that worries me from a upstream/platform
pov is
>> really
>> only if we don't do it consistently across all drivers.
>>
>> So maybe as an idea:
>> - Do the original patch, but not just for ttm but all gem
rendernode
>>  drivers at least (or maybe even all gem drivers, no
idea), with
>> the
>>  below discussion cleaned up as justification.
> I know of at least one use case which this will break.
>
> A couple of years back we had a discussion on the Mesa
mailing list
> because (IIRC) Marek introduced a background thread to push
command
> submissions to the kernel.
>
> That broke because some compositor used to initialize OpenGL
and then
> do a fork(). This indeed worked previously (no GPUVM at that
time),
> but with the addition of the backround thread obviously broke.
>
> The conclusion back then was that the compositor is broken
and needs
> fixing, but it still essentially means that there could be
people out
> there with really old userspace where this setting would
just break
> the desktop.
>
> I'm not really against that change either, but at least in
theory we
> could make fork() work perfectly fine even with VMs and
background
> threads.
 You may regret this if you ever try to build a shared virtual
address
 space between GPU and CPU. Then you have two processes
(parent and
 child) sharing the same render context and GPU VM address space.
 But the
 CPU address spaces are different. You can't maintain
consistent shared
 virtual address spaces for both processes when the GPU address
 space is
 shared between them.
>>> That's actually not much of a problem.
>>>
>>> All you need to do is to use pthread_atfork() and do the
appropriate
>>> action in parent/child to clean up your context:
>>> https://man7.org/linux/man-pages/man3/pthread_atfork.3.html


>> Thunk already does that. However, it's not foolproof.
pthread_atfork
>> hanlders aren't called when the process is forked with a clone
call.
>
> Yeah, but that's perfectly intentional. clone() is usually used to
> create threads.

Clone can be used to create new processes. Maybe not the common
use today.


>
>>> The rest is just to make sure that all shared and all private
data are
>>> kept separate all the time. Sharing virtual memory is already
done for
>>> decades this way, it's just that nobody ever did it with a
statefull
>>> device like GPUs.
>> My concern is not with sharing or not sharing data. It's with
sharing
>> the address space itself. If you share the render node, you
share GPU
>> virtual address space. However CPU address space is not shared
between
>> parent and child. That's a fundamental mismatch between the CPU
world
>> and current GPU driver implementation.
>
> Correct, but

[PATCH] drm/amd/pm: Enable sysfs required by rocm-smi tool for One VF mode

2022-01-17 Thread Marina Nikolic

Enable power level, power limit and fan speed
information retrieval in one VF mode.
This is required so that tool ROCM-SMI
can provide this information to users.
---
 drivers/gpu/drm/amd/pm/amdgpu_pm.c  | 17 ++---
 .../drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c |  2 +-
 2 files changed, 3 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/amdgpu_pm.c 
b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
index 1b03ad7a21ad..bb7d03cd814c 100644
--- a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
+++ b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
@@ -1903,8 +1903,8 @@ static struct amdgpu_device_attr amdgpu_device_attrs[] = {
AMDGPU_DEVICE_ATTR_RW(pp_dpm_fclk,  
ATTR_FLAG_BASIC|ATTR_FLAG_ONEVF),
AMDGPU_DEVICE_ATTR_RW(pp_dpm_vclk,  
ATTR_FLAG_BASIC|ATTR_FLAG_ONEVF),
AMDGPU_DEVICE_ATTR_RW(pp_dpm_dclk,  
ATTR_FLAG_BASIC|ATTR_FLAG_ONEVF),
-   AMDGPU_DEVICE_ATTR_RW(pp_dpm_dcefclk,   
ATTR_FLAG_BASIC),
-   AMDGPU_DEVICE_ATTR_RW(pp_dpm_pcie,  
ATTR_FLAG_BASIC),
+   AMDGPU_DEVICE_ATTR_RW(pp_dpm_dcefclk,   
ATTR_FLAG_BASIC|ATTR_FLAG_ONEVF),
+   AMDGPU_DEVICE_ATTR_RW(pp_dpm_pcie,  
ATTR_FLAG_BASIC|ATTR_FLAG_ONEVF),
AMDGPU_DEVICE_ATTR_RW(pp_sclk_od,   
ATTR_FLAG_BASIC),
AMDGPU_DEVICE_ATTR_RW(pp_mclk_od,   
ATTR_FLAG_BASIC),
AMDGPU_DEVICE_ATTR_RW(pp_power_profile_mode,
ATTR_FLAG_BASIC|ATTR_FLAG_ONEVF),
@@ -3152,19 +3152,6 @@ static umode_t hwmon_attributes_visible(struct kobject 
*kobj,
if (amdgpu_sriov_vf(adev) && !amdgpu_sriov_is_pp_one_vf(adev))
return 0;
 
-   /* there is no fan under pp one vf mode */
-   if (amdgpu_sriov_is_pp_one_vf(adev) &&
-   (attr == &sensor_dev_attr_pwm1.dev_attr.attr ||
-attr == &sensor_dev_attr_pwm1_enable.dev_attr.attr ||
-attr == &sensor_dev_attr_pwm1_max.dev_attr.attr ||
-attr == &sensor_dev_attr_pwm1_min.dev_attr.attr ||
-attr == &sensor_dev_attr_fan1_input.dev_attr.attr ||
-attr == &sensor_dev_attr_fan1_min.dev_attr.attr ||
-attr == &sensor_dev_attr_fan1_max.dev_attr.attr ||
-attr == &sensor_dev_attr_fan1_target.dev_attr.attr ||
-attr == &sensor_dev_attr_fan1_enable.dev_attr.attr))
-   return 0;
-
/* Skip fan attributes if fan is not present */
if (adev->pm.no_fan && (attr == &sensor_dev_attr_pwm1.dev_attr.attr ||
attr == &sensor_dev_attr_pwm1_enable.dev_attr.attr ||
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
index 4e37cd8025ed..7ea0427f65d3 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
@@ -134,7 +134,7 @@ static struct cmn2asic_msg_mapping 
sienna_cichlid_message_map[SMU_MSG_MAX_COUNT]
MSG_MAP(PrepareMp1ForUnload,PPSMC_MSG_PrepareMp1ForUnload,  
   1),
MSG_MAP(AllowGfxOff,PPSMC_MSG_AllowGfxOff,  
   0),
MSG_MAP(DisallowGfxOff, PPSMC_MSG_DisallowGfxOff,   
   0),
-   MSG_MAP(GetPptLimit,PPSMC_MSG_GetPptLimit,  
   0),
+   MSG_MAP(GetPptLimit,PPSMC_MSG_GetPptLimit,  
   1),
MSG_MAP(GetDcModeMaxDpmFreq,PPSMC_MSG_GetDcModeMaxDpmFreq,  
   1),
MSG_MAP(ExitBaco,   PPSMC_MSG_ExitBaco, 
   0),
MSG_MAP(PowerUpVcn, PPSMC_MSG_PowerUpVcn,   
   0),
-- 
2.20.1

Re: [PATCH 3/5] drm/amdkfd: add page fault SMI event

2022-01-17 Thread philip yang


  


On 2022-01-14 5:37 p.m., Felix Kuehling
  wrote:


  Am 2022-01-14 um 3:38 p.m. schrieb Philip Yang:

  
After GPU page fault is recovered, output timestamp when fault is
received, duration to recover the fault, if migration or only
GPU page table is updated, fault address, read or write fault.

Signed-off-by: Philip Yang 
---
 drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c | 41 +
 drivers/gpu/drm/amd/amdkfd/kfd_smi_events.h |  3 ++
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c| 12 --
 3 files changed, 53 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c b/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c
index 5818ea8ad4ce..6ed3d85348d6 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c
@@ -265,6 +265,47 @@ void kfd_smi_event_update_vmfault(struct kfd_dev *dev, uint16_t pasid)
 	add_event_to_kfifo(task_info.pid, dev, KFD_SMI_EVENT_VMFAULT, fifo_in, len);
 }
 
+static bool kfd_smi_event_pid_duration(struct kfd_dev *dev, uint16_t pasid,
+   pid_t *pid, uint64_t ts,
+   uint64_t *duration)
+{
+	struct amdgpu_task_info task_info = {0};
+
+	if (list_empty(&dev->smi_clients))
+		return false;
+
+	amdgpu_vm_get_task_info(dev->adev, pasid, &task_info);

  
  
The caller (svm_range_restore_pages) already knows the kfd_process. It
should be able to provide the pid directly from p->lead_thread.pid. No
need to look that up from the pasid and vm task info.

yes, I can use p->lead_thread.pid for other events as well.

  



  
+	if (!task_info.pid) {
+		pr_debug("task is gone\n");
+		return false;
+	}
+	if (pid)
+		*pid = task_info.pid;
+	if (duration)
+		*duration = ktime_get_ns() - ts;
+	return true;
+}
+
+void kfd_smi_event_page_fault(struct kfd_dev *dev, uint16_t pasid,
+			  unsigned long address, bool migration,
+			  bool write_fault, uint64_t ts)
+{
+	char fifo_in[128];
+	uint64_t duration;
+	pid_t pid;
+	int len;
+
+	if (!kfd_smi_event_pid_duration(dev, pasid, &pid, ts, &duration))
+		return;
+
+	len = snprintf(fifo_in, sizeof(fifo_in), "%d ts=%lld duration=%lld"
+		   " pid=%d pfn=0x%lx write=%d migration=%d\n",

  
  
Please don't break the string over several lines. I believe that would
also trigger a checkpatch warning.

It does trigger checkpatch warning, although I saw other places
using multiple lines format, but this is not good to grep string by
line, I will short it and use one line format.

  



  
+		   KFD_SMI_EVENT_PAGE_FAULT, ts, duration, pid, address,
+		   write_fault, migration);

  
  
The existing events use %x for all numbers, including event number and
pid. Maybe better to stick with that convention for consistency. At
least for the event number.


Will output event number as hex.

  
The existing events seems to favor a very compact layout. I could think
of ways to make this event more compact as well (still using decimal for
some things for readability):

"%x %d(%d)-%d @%x %c%c", KFD_SMI_EVENT_PAGE_FAULT, ts, duration, pid,
address, write ? 'W' : 'w', migration ? 'M' : 'm'



good idea to short the format and message.

  

  
+
+	add_event_to_kfifo(pid, dev, KFD_SMI_EVENT_PAGE_FAULT, fifo_in, len);
+}
+
 int kfd_smi_event_open(struct kfd_dev *dev, uint32_t *fd)
 {
 	struct kfd_smi_client *client;
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.h b/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.h
index bffd0c32b060..fa3a8fdad69f 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.h
@@ -28,5 +28,8 @@ void kfd_smi_event_update_vmfault(struct kfd_dev *dev, uint16_t pasid);
 void kfd_smi_event_update_thermal_throttling(struct kfd_dev *dev,
 	 uint64_t throttle_bitmask);
 void kfd_smi_event_update_gpu_reset(struct kfd_dev *dev, bool post_reset);
+void kfd_smi_event_page_fault(struct kfd_dev *dev, uint16_t pasid,
+			  unsigned long address, bool migration,
+			  bool write_fault, uint64_t ts);
 
 #endif
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 37b3191615b6..b81667162dc1 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -32,6 +32,7 @@
 #include "kfd_priv.h"
 #include "kfd_svm.h"
 #include "kfd_migrate.h"
+#include "kfd_smi_events.h"
 
 #ifdef dev_fmt
 #undef dev_fmt
@@ -2657,11 +2658,12 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid,
 	struct svm_range_list *svms;
 	struct svm_range *prange;
 	struct kfd_process *p;
-	uint64_t timestamp;
+	uint64_t timestamp = ktime_get_ns();

  
  
kfd_ioctl_get_clock_counters uses ktime_get_boottime_ns for the
system_clock_counter. We should use the same here (and in the duration
calculation) to make it possible to correlate timestamps f

Re: [PATCH 2/2] drm/amdgpu: add AMDGPURESET uevent on AMD GPU reset

2022-01-17 Thread Christian König


Hi Amarnath,

not a bad idea, but that won't work either because you really need to 
return to make sure that a potential next reset can run.


What could be done is to have a work item which does that, but then I 
think it would just be easier to teach the udev function a GFP mask.


Regards,
Christian.


Am 17.01.22 um 15:19 schrieb Somalapuram, Amaranath:

Hi Christian,

if sending udev event during reset is going to create problem, we can 
move this code from reset sequence to re-int  (after GPU reset 
succeeded).


Regards,

S.Amarnath

On 1/17/2022 5:27 PM, Christian König wrote:

Am 17.01.22 um 12:43 schrieb Sharma, Shashank:

Hello Christian,

On 1/17/2022 11:37 AM, Christian König wrote:

Am 17.01.22 um 11:34 schrieb Somalapuram, Amaranath:

[SNIP]
Any suggestion how we can notify user space during this situation. 


Well trace points should work. They use a ring buffer and are 
specially made to work in such situations.


We are anyways sending a trace event with data, but can trace event 
be a wake up source for an application (like a tracer) ? This DRM 
event is just to indicate that reset happened, so app has to read 
from trace file.


Yeah, that's a really good question I can't fully answer. As far as I 
know you can filter in userspace what you get from the tracer, but I 
don't know if you can block on a specific event.


Christian.





Alternative would be to audit the udev code and allow giving a GFP 
mask to the function to make sure that we don't run into the deadlock.


Another alternative is a debugfs file which just blocks for the 
next reset to happen.




- Shashank


Regards,
Christian.


Regards,

S.Amarnath

RE: [PATCH 1/1] Add available memory ioctl for libhsakmt

2022-01-17 Thread Bhardwaj, Rajneesh

[Public]



From: amd-gfx  On Behalf Of Deucher, 
Alexander
Sent: Monday, January 10, 2022 4:11 PM
To: Phillips, Daniel ; amd-gfx@lists.freedesktop.org; 
dri-de...@lists.freedesktop.org
Subject: Re: [PATCH 1/1] Add available memory ioctl for libhsakmt


[Public]


[Public]

This is missing your signed-off-by.  Additionally, for UAPI changes, we need a 
link the patches for the userspace component that will make use of it.

Alex


From: amd-gfx 
mailto:amd-gfx-boun...@lists.freedesktop.org>>
 on behalf of Daniel Phillips 
mailto:daniel.phill...@amd.com>>
Sent: Monday, January 10, 2022 3:54 PM
To: amd-gfx@lists.freedesktop.org 
mailto:amd-gfx@lists.freedesktop.org>>; 
dri-de...@lists.freedesktop.org 
mailto:dri-de...@lists.freedesktop.org>>
Cc: Phillips, Daniel mailto:daniel.phill...@amd.com>>
Subject: [PATCH 1/1] Add available memory ioctl for libhsakmt

From: Daniel Phillips mailto:dphil...@amd.com>>

Add an ioctl to inquire memory available for allocation by libhsakmt
per node, allowing for space consumed by page translation tables.
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h  |  1 +
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c| 14 ++
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c| 17 +
 include/uapi/linux/kfd_ioctl.h  | 14 --
 4 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index fcbc8a9c9e06..64c6c36685d3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -266,6 +266,7 @@ int amdgpu_amdkfd_gpuvm_acquire_process_vm(struct 
amdgpu_device *adev,
 void amdgpu_amdkfd_gpuvm_release_process_vm(struct amdgpu_device *adev,
 void *drm_priv);
 uint64_t amdgpu_amdkfd_gpuvm_get_process_page_dir(void *drm_priv);
+size_t amdgpu_amdkfd_get_available_memory(struct amdgpu_device *adev);
 int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu(
 struct amdgpu_device *adev, uint64_t va, uint64_t size,
 void *drm_priv, struct kgd_mem **mem,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 86a1a6c109d9..b7490a659173 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -190,6 +190,20 @@ static int amdgpu_amdkfd_reserve_mem_limit(struct 
amdgpu_device *adev,
 return ret;
 }

+size_t amdgpu_amdkfd_get_available_memory(struct amdgpu_device *adev)
+{
+   uint64_t reserved_for_pt =
+   ESTIMATE_PT_SIZE(amdgpu_amdkfd_total_mem_size);
+   size_t available_memory;
+
+   spin_lock(&kfd_mem_limit.mem_limit_lock);
+   available_memory =
+   adev->gmc.real_vram_size -
+   adev->kfd.vram_used - reserved_for_pt;
+   spin_unlock(&kfd_mem_limit.mem_limit_lock);
+   return available_memory;
+}
+
 static void unreserve_mem_limit(struct amdgpu_device *adev,
 uint64_t size, u32 alloc_flag)
 {
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
index 4bfc0c8ab764..5c2f6d97ff1c 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -486,6 +486,20 @@ static int kfd_ioctl_get_queue_wave_state(struct file 
*filep,
 return r;
 }

+static int kfd_ioctl_get_available_memory(struct file *filep,
+struct kfd_process *p, void *data)
+{
+   struct kfd_ioctl_get_available_memory_args *args = data;
+   struct kfd_dev *dev;
+
+   dev = kfd_device_by_id(args->gpu_id);
Once CRIU changes land, this need to change to kfd_process_device_data_by_id() 
and then use pdd->dev


+   if (!dev)
+   return -EINVAL;
+
+   args->available = amdgpu_amdkfd_get_available_memory(dev->adev);
+   return 0;
+}
+
 static int kfd_ioctl_set_memory_policy(struct file *filep,
 struct kfd_process *p, void *data)
 {
@@ -1959,6 +1973,9 @@ static const struct amdkfd_ioctl_desc amdkfd_ioctls[] = {

 AMDKFD_IOCTL_DEF(AMDKFD_IOC_SET_XNACK_MODE,
 kfd_ioctl_set_xnack_mode, 0),
+
+   AMDKFD_IOCTL_DEF(AMDKFD_IOC_AVAILABLE_MEMORY,
+   kfd_ioctl_get_available_memory, 0),
 };

 #define AMDKFD_CORE_IOCTL_COUNT ARRAY_SIZE(amdkfd_ioctls)
diff --git a/include/uapi/linux/kfd_ioctl.h b/include/uapi/linux/kfd_ioctl.h
index af96af174dc4..94a99add2432 100644
--- a/include/uapi/linux/kfd_ioctl.h
+++ b/include/uapi/linux/kfd_ioctl.h
@@ -32,9 +32,10 @@
  * - 1.4 - Indicate new SRAM EDC bit in device properties
  * - 1.5 - Add SVM API
  * - 1.6 - Query clear flags in SVM get_attr API
+ * - 1.7 - Add available_memory ioctl
  */
 #define KF

Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2022-01-17 Thread Marek Olšák

I don't think fork() would work with userspace where all buffers are
shared. It certainly doesn't work now. The driver needs to be notified that
a buffer or texture is shared to ensure data coherency between processes,
and the driver must execute decompression and other render passes when a
buffer or texture is being shared for the first time. Those aren't called
when fork() is called.

Marek

On Mon, Jan 17, 2022 at 9:34 AM Felix Kuehling 
wrote:

> Am 2022-01-17 um 9:21 a.m. schrieb Christian König:
> > Am 17.01.22 um 15:17 schrieb Felix Kuehling:
> >> Am 2022-01-17 um 6:44 a.m. schrieb Christian König:
> >>> Am 14.01.22 um 18:40 schrieb Felix Kuehling:
>  Am 2022-01-14 um 12:26 p.m. schrieb Christian König:
> > Am 14.01.22 um 17:44 schrieb Daniel Vetter:
> >> Top post because I tried to catch up on the entire discussion here.
> >>
> >> So fundamentally I'm not opposed to just close this fork() hole
> >> once and
> >> for all. The thing that worries me from a upstream/platform pov is
> >> really
> >> only if we don't do it consistently across all drivers.
> >>
> >> So maybe as an idea:
> >> - Do the original patch, but not just for ttm but all gem rendernode
> >>  drivers at least (or maybe even all gem drivers, no idea), with
> >> the
> >>  below discussion cleaned up as justification.
> > I know of at least one use case which this will break.
> >
> > A couple of years back we had a discussion on the Mesa mailing list
> > because (IIRC) Marek introduced a background thread to push command
> > submissions to the kernel.
> >
> > That broke because some compositor used to initialize OpenGL and then
> > do a fork(). This indeed worked previously (no GPUVM at that time),
> > but with the addition of the backround thread obviously broke.
> >
> > The conclusion back then was that the compositor is broken and needs
> > fixing, but it still essentially means that there could be people out
> > there with really old userspace where this setting would just break
> > the desktop.
> >
> > I'm not really against that change either, but at least in theory we
> > could make fork() work perfectly fine even with VMs and background
> > threads.
>  You may regret this if you ever try to build a shared virtual address
>  space between GPU and CPU. Then you have two processes (parent and
>  child) sharing the same render context and GPU VM address space.
>  But the
>  CPU address spaces are different. You can't maintain consistent shared
>  virtual address spaces for both processes when the GPU address
>  space is
>  shared between them.
> >>> That's actually not much of a problem.
> >>>
> >>> All you need to do is to use pthread_atfork() and do the appropriate
> >>> action in parent/child to clean up your context:
> >>> https://man7.org/linux/man-pages/man3/pthread_atfork.3.html
> >> Thunk already does that. However, it's not foolproof. pthread_atfork
> >> hanlders aren't called when the process is forked with a clone call.
> >
> > Yeah, but that's perfectly intentional. clone() is usually used to
> > create threads.
>
> Clone can be used to create new processes. Maybe not the common use today.
>
>
> >
> >>> The rest is just to make sure that all shared and all private data are
> >>> kept separate all the time. Sharing virtual memory is already done for
> >>> decades this way, it's just that nobody ever did it with a statefull
> >>> device like GPUs.
> >> My concern is not with sharing or not sharing data. It's with sharing
> >> the address space itself. If you share the render node, you share GPU
> >> virtual address space. However CPU address space is not shared between
> >> parent and child. That's a fundamental mismatch between the CPU world
> >> and current GPU driver implementation.
> >
> > Correct, but even that is easily solvable. As I said before you can
> > hang this state on a VMA and let it be cloned together with the CPU
> > address space.
>
> I'm not following. The address space I'm talking about is struct
> amdgpu_vm. It's associated with the render node file descriptor.
> Inheriting and using that file descriptor in the child inherits the
> amdgpu_vm. I don't see how you can hang that state on any one VMA.
>
> To be consistent with the CPU, you'd need to clone the GPU address space
> (struct amdgpu_vm) in the child process. That means you need a new
> render node file descriptor that imports all the BOs from the parent
> address space. It's a bunch of extra work to fork a process, that you're
> proposing to immediately undo with an atfork handler. So I really don't
> see the point.
>
> Regards,
>   Felix
>
>
> >
> > Since VMAs are informed about their cloning (in opposite to file
> > descriptors) it's trivial to even just clone kernel data on first access.
> >
> > Regards,
> > Christian.
> >
> >>
> >> Regards,
> >>Felix
> >>
> >>
> >>>

Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2022-01-17 Thread Felix Kuehling

Am 2022-01-17 um 9:21 a.m. schrieb Christian König:
> Am 17.01.22 um 15:17 schrieb Felix Kuehling:
>> Am 2022-01-17 um 6:44 a.m. schrieb Christian König:
>>> Am 14.01.22 um 18:40 schrieb Felix Kuehling:
 Am 2022-01-14 um 12:26 p.m. schrieb Christian König:
> Am 14.01.22 um 17:44 schrieb Daniel Vetter:
>> Top post because I tried to catch up on the entire discussion here.
>>
>> So fundamentally I'm not opposed to just close this fork() hole
>> once and
>> for all. The thing that worries me from a upstream/platform pov is
>> really
>> only if we don't do it consistently across all drivers.
>>
>> So maybe as an idea:
>> - Do the original patch, but not just for ttm but all gem rendernode
>>  drivers at least (or maybe even all gem drivers, no idea), with
>> the
>>  below discussion cleaned up as justification.
> I know of at least one use case which this will break.
>
> A couple of years back we had a discussion on the Mesa mailing list
> because (IIRC) Marek introduced a background thread to push command
> submissions to the kernel.
>
> That broke because some compositor used to initialize OpenGL and then
> do a fork(). This indeed worked previously (no GPUVM at that time),
> but with the addition of the backround thread obviously broke.
>
> The conclusion back then was that the compositor is broken and needs
> fixing, but it still essentially means that there could be people out
> there with really old userspace where this setting would just break
> the desktop.
>
> I'm not really against that change either, but at least in theory we
> could make fork() work perfectly fine even with VMs and background
> threads.
 You may regret this if you ever try to build a shared virtual address
 space between GPU and CPU. Then you have two processes (parent and
 child) sharing the same render context and GPU VM address space.
 But the
 CPU address spaces are different. You can't maintain consistent shared
 virtual address spaces for both processes when the GPU address
 space is
 shared between them.
>>> That's actually not much of a problem.
>>>
>>> All you need to do is to use pthread_atfork() and do the appropriate
>>> action in parent/child to clean up your context:
>>> https://man7.org/linux/man-pages/man3/pthread_atfork.3.html
>> Thunk already does that. However, it's not foolproof. pthread_atfork
>> hanlders aren't called when the process is forked with a clone call.
>
> Yeah, but that's perfectly intentional. clone() is usually used to
> create threads.

Clone can be used to create new processes. Maybe not the common use today.


>
>>> The rest is just to make sure that all shared and all private data are
>>> kept separate all the time. Sharing virtual memory is already done for
>>> decades this way, it's just that nobody ever did it with a statefull
>>> device like GPUs.
>> My concern is not with sharing or not sharing data. It's with sharing
>> the address space itself. If you share the render node, you share GPU
>> virtual address space. However CPU address space is not shared between
>> parent and child. That's a fundamental mismatch between the CPU world
>> and current GPU driver implementation.
>
> Correct, but even that is easily solvable. As I said before you can
> hang this state on a VMA and let it be cloned together with the CPU
> address space.

I'm not following. The address space I'm talking about is struct
amdgpu_vm. It's associated with the render node file descriptor.
Inheriting and using that file descriptor in the child inherits the
amdgpu_vm. I don't see how you can hang that state on any one VMA.

To be consistent with the CPU, you'd need to clone the GPU address space
(struct amdgpu_vm) in the child process. That means you need a new
render node file descriptor that imports all the BOs from the parent
address space. It's a bunch of extra work to fork a process, that you're
proposing to immediately undo with an atfork handler. So I really don't
see the point.

Regards,
  Felix


>
> Since VMAs are informed about their cloning (in opposite to file
> descriptors) it's trivial to even just clone kernel data on first access.
>
> Regards,
> Christian.
>
>>
>> Regards,
>>    Felix
>>
>>
>>> Regards,
>>> Christian.
>>>
 Regards,
     Felix

>

Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2022-01-17 Thread Christian König


Am 17.01.22 um 15:17 schrieb Felix Kuehling:

Am 2022-01-17 um 6:44 a.m. schrieb Christian König:

Am 14.01.22 um 18:40 schrieb Felix Kuehling:

Am 2022-01-14 um 12:26 p.m. schrieb Christian König:

Am 14.01.22 um 17:44 schrieb Daniel Vetter:

Top post because I tried to catch up on the entire discussion here.

So fundamentally I'm not opposed to just close this fork() hole
once and
for all. The thing that worries me from a upstream/platform pov is
really
only if we don't do it consistently across all drivers.

So maybe as an idea:
- Do the original patch, but not just for ttm but all gem rendernode
     drivers at least (or maybe even all gem drivers, no idea), with
the
     below discussion cleaned up as justification.

I know of at least one use case which this will break.

A couple of years back we had a discussion on the Mesa mailing list
because (IIRC) Marek introduced a background thread to push command
submissions to the kernel.

That broke because some compositor used to initialize OpenGL and then
do a fork(). This indeed worked previously (no GPUVM at that time),
but with the addition of the backround thread obviously broke.

The conclusion back then was that the compositor is broken and needs
fixing, but it still essentially means that there could be people out
there with really old userspace where this setting would just break
the desktop.

I'm not really against that change either, but at least in theory we
could make fork() work perfectly fine even with VMs and background
threads.

You may regret this if you ever try to build a shared virtual address
space between GPU and CPU. Then you have two processes (parent and
child) sharing the same render context and GPU VM address space. But the
CPU address spaces are different. You can't maintain consistent shared
virtual address spaces for both processes when the GPU address space is
shared between them.

That's actually not much of a problem.

All you need to do is to use pthread_atfork() and do the appropriate
action in parent/child to clean up your context:
https://man7.org/linux/man-pages/man3/pthread_atfork.3.html

Thunk already does that. However, it's not foolproof. pthread_atfork
hanlders aren't called when the process is forked with a clone call.


Yeah, but that's perfectly intentional. clone() is usually used to 
create threads.



The rest is just to make sure that all shared and all private data are
kept separate all the time. Sharing virtual memory is already done for
decades this way, it's just that nobody ever did it with a statefull
device like GPUs.

My concern is not with sharing or not sharing data. It's with sharing
the address space itself. If you share the render node, you share GPU
virtual address space. However CPU address space is not shared between
parent and child. That's a fundamental mismatch between the CPU world
and current GPU driver implementation.


Correct, but even that is easily solvable. As I said before you can hang 
this state on a VMA and let it be cloned together with the CPU address 
space.


Since VMAs are informed about their cloning (in opposite to file 
descriptors) it's trivial to even just clone kernel data on first access.


Regards,
Christian.



Regards,
   Felix



Regards,
Christian.


Regards,
    Felix

Re: [PATCH 2/2] drm/amdgpu: add AMDGPURESET uevent on AMD GPU reset

2022-01-17 Thread Somalapuram, Amaranath


Hi Christian,

if sending udev event during reset is going to create problem, we can 
move this code from reset sequence to re-int  (after GPU reset succeeded).


Regards,

S.Amarnath

On 1/17/2022 5:27 PM, Christian König wrote:

Am 17.01.22 um 12:43 schrieb Sharma, Shashank:

Hello Christian,

On 1/17/2022 11:37 AM, Christian König wrote:

Am 17.01.22 um 11:34 schrieb Somalapuram, Amaranath:

[SNIP]
Any suggestion how we can notify user space during this situation. 


Well trace points should work. They use a ring buffer and are 
specially made to work in such situations.


We are anyways sending a trace event with data, but can trace event 
be a wake up source for an application (like a tracer) ? This DRM 
event is just to indicate that reset happened, so app has to read 
from trace file.


Yeah, that's a really good question I can't fully answer. As far as I 
know you can filter in userspace what you get from the tracer, but I 
don't know if you can block on a specific event.


Christian.





Alternative would be to audit the udev code and allow giving a GFP 
mask to the function to make sure that we don't run into the deadlock.


Another alternative is a debugfs file which just blocks for the next 
reset to happen.




- Shashank


Regards,
Christian.


Regards,

S.Amarnath

Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2022-01-17 Thread Felix Kuehling



Am 2022-01-17 um 6:44 a.m. schrieb Christian König:
> Am 14.01.22 um 18:40 schrieb Felix Kuehling:
>> Am 2022-01-14 um 12:26 p.m. schrieb Christian König:
>>> Am 14.01.22 um 17:44 schrieb Daniel Vetter:
 Top post because I tried to catch up on the entire discussion here.

 So fundamentally I'm not opposed to just close this fork() hole
 once and
 for all. The thing that worries me from a upstream/platform pov is
 really
 only if we don't do it consistently across all drivers.

 So maybe as an idea:
 - Do the original patch, but not just for ttm but all gem rendernode
     drivers at least (or maybe even all gem drivers, no idea), with
 the
     below discussion cleaned up as justification.
>>> I know of at least one use case which this will break.
>>>
>>> A couple of years back we had a discussion on the Mesa mailing list
>>> because (IIRC) Marek introduced a background thread to push command
>>> submissions to the kernel.
>>>
>>> That broke because some compositor used to initialize OpenGL and then
>>> do a fork(). This indeed worked previously (no GPUVM at that time),
>>> but with the addition of the backround thread obviously broke.
>>>
>>> The conclusion back then was that the compositor is broken and needs
>>> fixing, but it still essentially means that there could be people out
>>> there with really old userspace where this setting would just break
>>> the desktop.
>>>
>>> I'm not really against that change either, but at least in theory we
>>> could make fork() work perfectly fine even with VMs and background
>>> threads.
>> You may regret this if you ever try to build a shared virtual address
>> space between GPU and CPU. Then you have two processes (parent and
>> child) sharing the same render context and GPU VM address space. But the
>> CPU address spaces are different. You can't maintain consistent shared
>> virtual address spaces for both processes when the GPU address space is
>> shared between them.
>
> That's actually not much of a problem.
>
> All you need to do is to use pthread_atfork() and do the appropriate
> action in parent/child to clean up your context:
> https://man7.org/linux/man-pages/man3/pthread_atfork.3.html

Thunk already does that. However, it's not foolproof. pthread_atfork
hanlders aren't called when the process is forked with a clone call.


>
> The rest is just to make sure that all shared and all private data are
> kept separate all the time. Sharing virtual memory is already done for
> decades this way, it's just that nobody ever did it with a statefull
> device like GPUs.
My concern is not with sharing or not sharing data. It's with sharing
the address space itself. If you share the render node, you share GPU
virtual address space. However CPU address space is not shared between
parent and child. That's a fundamental mismatch between the CPU world
and current GPU driver implementation.

Regards,
  Felix


>
> Regards,
> Christian.
>
>>
>> Regards,
>>    Felix
>>
>

Re: RIP: 0010:radeon_vm_fini+0x15/0x220 [radeon]

2022-01-17 Thread Borislav Petkov

On Mon, Jan 17, 2022 at 08:16:09AM +0100, Christian König wrote:
> Interesting to see that even that old stuff is still used.

Well, "used" is a stretch.

This is my way of testing on K8 as pretty much all the big K8 boxes to
which I had access to, got decommissioned so this baby is the only K8
real hw I have now. :-)

Lemme test your patch.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH] drm/radeon: fix error handling in radeon_driver_open_kms

2022-01-17 Thread Jan Stancek

On Mon, Jan 17, 2022 at 10:38 AM Christian König
 wrote:
>
> The return value was never initialized so the cleanup code executed when
> it isn't even necessary.
>
> Just add proper error handling.
>
> Fixes: 2ad5d8fca195 ("drm/radeon/radeon_kms: Fix a NULL pointer dereference 
> in radeon_driver_open_kms()")
> Signed-off-by: Christian König 

Thanks for quick patch, applied on top of 8d0749b4f83b Merge tag
'drm-next-2022-01-07'
it fixed the boot panic for me.

Tested-by: Jan Stancek 

> ---
>  drivers/gpu/drm/radeon/radeon_kms.c | 22 --
>  1 file changed, 12 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/gpu/drm/radeon/radeon_kms.c 
> b/drivers/gpu/drm/radeon/radeon_kms.c
> index e2488559cc9f..11ad210919c8 100644
> --- a/drivers/gpu/drm/radeon/radeon_kms.c
> +++ b/drivers/gpu/drm/radeon/radeon_kms.c
> @@ -666,18 +666,18 @@ int radeon_driver_open_kms(struct drm_device *dev, 
> struct drm_file *file_priv)
> fpriv = kzalloc(sizeof(*fpriv), GFP_KERNEL);
> if (unlikely(!fpriv)) {
> r = -ENOMEM;
> -   goto out_suspend;
> +   goto err_suspend;
> }
>
> if (rdev->accel_working) {
> vm = &fpriv->vm;
> r = radeon_vm_init(rdev, vm);
> if (r)
> -   goto out_fpriv;
> +   goto err_fpriv;
>
> r = radeon_bo_reserve(rdev->ring_tmp_bo.bo, false);
> if (r)
> -   goto out_vm_fini;
> +   goto err_vm_fini;
>
> /* map the ib pool buffer read only into
>  * virtual address space */
> @@ -685,7 +685,7 @@ int radeon_driver_open_kms(struct drm_device *dev, struct 
> drm_file *file_priv)
> rdev->ring_tmp_bo.bo);
> if (!vm->ib_bo_va) {
> r = -ENOMEM;
> -   goto out_vm_fini;
> +   goto err_vm_fini;
> }
>
> r = radeon_vm_bo_set_addr(rdev, vm->ib_bo_va,
> @@ -693,19 +693,21 @@ int radeon_driver_open_kms(struct drm_device *dev, 
> struct drm_file *file_priv)
>   RADEON_VM_PAGE_READABLE |
>   RADEON_VM_PAGE_SNOOPED);
> if (r)
> -   goto out_vm_fini;
> +   goto err_vm_fini;
> }
> file_priv->driver_priv = fpriv;
> }
>
> -   if (!r)
> -   goto out_suspend;
> +   pm_runtime_mark_last_busy(dev->dev);
> +   pm_runtime_put_autosuspend(dev->dev);
> +   return 0;
>
> -out_vm_fini:
> +err_vm_fini:
> radeon_vm_fini(rdev, vm);
> -out_fpriv:
> +err_fpriv:
> kfree(fpriv);
> -out_suspend:
> +
> +err_suspend:
> pm_runtime_mark_last_busy(dev->dev);
> pm_runtime_put_autosuspend(dev->dev);
> return r;
> --
> 2.25.1
>

Re: RIP: 0010:radeon_vm_fini+0x15/0x220 [radeon]

2022-01-17 Thread Jan Stancek


On Mon, Jan 17, 2022 at 08:16:09AM +0100, Christian König wrote:

Hi Borislav,

Am 15.01.22 um 17:11 schrieb Borislav Petkov:

Hi folks,

so this is a *very* old K8 laptop - yap, you read it right, family 0xf.

[   31.353032] powernow_k8: fid 0xa (1800 MHz), vid 0xa
[   31.353569] powernow_k8: fid 0x8 (1600 MHz), vid 0xc
[   31.354081] powernow_k8: fid 0x0 (800 MHz), vid 0x16
[   31.354844] powernow_k8: Found 1 AMD Turion(tm) 64 Mobile Technology MT-34 
(1 cpu cores) (version 2.20.00)

This is true story.


well, that hardware is ancient ^^.

Interesting to see that even that old stuff is still used.


Anyway, it blows up, see below.

Kernel is latest Linus tree, top commit is:

a33f5c380c4b ("Merge tag 'xfs-5.17-merge-3' of 
git://git.kernel.org/pub/scm/fs/xfs/xfs-linux")

I can bisect if you don't see it immediately why it blows up.


Immediately I see that code is called which isn't for this hardware 
generation.


This is extremely odd because it means that we either have recently 
added a logic bug or the detection of the hardware generation doesn't 
work as expected any more.


Please bisect,
Christian.


I'm see panics like this one as well on multiple systems in lab (e.g. ProLiant 
SL390s G7,
PowerEdge R805). Looks same to what Bruno reported here:
 
https://lore.kernel.org/all/ca+qyu4rt2vhwzbot-sega9yabqc-d36poqtzmy6dscwvp+6...@mail.gmail.com/

It started around 8d0749b4f83b - Merge tag 'drm-next-2022-01-07', running a 
bisect atm.

[   15.230105] SGI XFS with ACLs, security attributes, scrub, quota, no debug enabled 
[   15.234816] XFS (sdb1): Mounting V5 Filesystem 
[   15.342261] [drm] ib test succeeded in 0 usecs 
[   15.343311] [drm] No TV DAC info found in BIOS 
[   15.344061] [drm] Radeon Display Connectors 
[   15.344330] [drm] Connector 0: 
[   15.344961] [drm]   VGA-1 
[   15.345174] [drm]   DDC: 0x60 0x60 0x60 0x60 0x60 0x60 0x60 0x60 
[   15.345991] [drm]   Encoders: 
[   15.346617] [drm] CRT1: INTERNAL_DAC1 
[   15.346942] [drm] Connector 1: 
[   15.347561] [drm]   VGA-2 
[   15.347746] [drm]   DDC: 0x6c 0x6c 0x6c 0x6c 0x6c 0x6c 0x6c 0x6c 
[   15.348598] [drm]   Encoders: 
[   15.349217] [drm] CRT2: INTERNAL_DAC2 
[   15.349521] BUG: kernel NULL pointer dereference, address:  
[   15.349974] #PF: supervisor read access in kernel mode 
[   15.350305] #PF: error_code(0x) - not-present page 
[   15.350675] PGD 0 P4D 0  
[   15.350814] Oops:  [#[   15.431048] CPU: 0 PID: 410 Comm: systemd-udevd Tainted: G  I   5.16.0 #1 
[   15.443401] XFS (sdb1): Ending clean mount 
[   15.451541] Hardware name: HP ProLiant SL390s G7/, BIOS P69 07/02/2013 
[   15.451545] RIP: 0010:radeon_vm_fini+0x174/0x300 [radeon] 
[   15.452689] Code: e8 74 cc 7a c1 eb d1 4c 8b 24 24 4d 8d 74 24 48 49 8b 5c 24 48 49 39 de 74 38 66 2e 0f 1f 84 00 00 00 00 00 66 90 4c 8d 7b a8 <48> 8b 2b 48 8d 7b 18 e8 30 1e f4 ff 48 83 c3 c0 48 89 df e8 34 f3 
[   15.454412] RSP: 0018:a349481 R08: 0020 R09:  
[   15.533944] R10:  R11: c04f7810 R12: 979b4ba46730 
[   15.533945] R13: 979d5c26 R14: 979b4ba46778 R15: ffa8 
[   15.533947] FS:  7f3a13141500() GS:979d4ba0() knlGS: 
[   15.533948] CS:  0010 DS:  ES:  CR0: 80050033 
[   15.533950] CR2:  CR3: 00031c7fc005 CR4: 000206f0 
[   15.533952] Call Trace: 
[   15.533956]   
[   15.533959]  radeon_driver_open_kms+0x118/0x180 [radeon] 
[   15.533998]  drm_file_alloc+0x1a8/0x230 [drm] 
[  
 OK   [[   15.961755]  drm_client_init+0x99/0x130 [drm] 
 [   15.961777]  drm_fb_helper_init+0x32/0x50 [drm_kms_helper] 
 [   15.961809]  radeon_fbdev_init+0xbc/0x110 [radeon] 
 [   15.963653]  radeon_modeset_init+0x857/0x9e0 [radeon] 
 0m] Mounted  [0;[   15.964003]  radeon_driver_load_kms+0x19b/0x290 [radeon] 
 [   15.964474]  drm_dev_register+0xf5/0x2d0 [drm] 
 1;39msysroot.mou[   15.965196]  radeon_pci_probe+0xc3/0x120 [radeon] 
 [   15.965972]  pci_device_probe+0x185/0x220 
 [   15.966225]  call_driver_probe+0x32/0xd0 
 [   15.966505]  really_probe+0x157/0x380 
 [   15.99bus_add_driver+0x111/0x210 
 [   16.467150]  ? 0xc0412000 
 [   16.467805]  driver_register+0x81/0x120 
 [   16.468069]  do_one_initcall+0xb0/0x290 
 [   16.468359]  ? down_write+0xe/0x40 
 [   16.469008]  ? kernfs_activate+0x28/0x130 
 [   16.469267]  ? kernfs_add_one+0x1c8/0x210 
 [   16.469563]  ? vunmap_p4d_range+0x3dc/0x420 
 [   16.469858]  ? __vunmap+0x1df/0x2a0 
 [   16.470466]  ? kmem_cache_alloc_trace+0x1a4/0x330 
 [   16.471224]  ? do_init_module+0x24/0x230 
 [   16.471485]  do_init_module+0x5a/0x230 
 [   16.471779]  load_module+0x145f/0x1630 
 [   16.472022]  ? kernel_read_file_from_fd+0x5d/0x80 
 [   16.472762]  __se_sys_finit_module+0x9f/0xd0 
 [   16.473480]  do_syscall_64+0x43/0x90 
 [   16.473778]  entry_SYSCALL_64_after_hwframe+0x44/0xae 
 [   16.474123] RIP: 0033:0x7f3a13d11e2d 
 [   16.474422

Re: [PATCH 2/2] drm/amdgpu: add AMDGPURESET uevent on AMD GPU reset

2022-01-17 Thread Christian König


Am 17.01.22 um 12:43 schrieb Sharma, Shashank:

Hello Christian,

On 1/17/2022 11:37 AM, Christian König wrote:

Am 17.01.22 um 11:34 schrieb Somalapuram, Amaranath:

[SNIP]
Any suggestion how we can notify user space during this situation. 


Well trace points should work. They use a ring buffer and are 
specially made to work in such situations.


We are anyways sending a trace event with data, but can trace event be 
a wake up source for an application (like a tracer) ? This DRM event 
is just to indicate that reset happened, so app has to read from trace 
file.


Yeah, that's a really good question I can't fully answer. As far as I 
know you can filter in userspace what you get from the tracer, but I 
don't know if you can block on a specific event.


Christian.





Alternative would be to audit the udev code and allow giving a GFP 
mask to the function to make sure that we don't run into the deadlock.


Another alternative is a debugfs file which just blocks for the next 
reset to happen.




- Shashank


Regards,
Christian.


Regards,

S.Amarnath

Re: [PATCH 2/2] drm/amdgpu: add AMDGPURESET uevent on AMD GPU reset

2022-01-17 Thread Lazar, Lijo





On 1/17/2022 12:03 PM, Somalapuram Amaranath wrote:

AMDGPURESET uevent added to notify userspace, collect dump_stack and trace

Signed-off-by: Somalapuram Amaranath 
---
  drivers/gpu/drm/amd/amdgpu/nv.c | 45 +
  1 file changed, 45 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c b/drivers/gpu/drm/amd/amdgpu/nv.c
index 2ec1ffb36b1f..b73147ae41fb 100644
--- a/drivers/gpu/drm/amd/amdgpu/nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/nv.c
@@ -529,10 +529,55 @@ nv_asic_reset_method(struct amdgpu_device *adev)
}
  }
  
+/**

+ * drm_sysfs_reset_event - generate a DRM uevent
+ * @dev: DRM device
+ *
+ * Send a uevent for the DRM device specified by @dev.  Currently we only
+ * set AMDGPURESET=1 in the uevent environment, but this could be expanded to
+ * deal with other types of events.
+ *
+ * Any new uapi should be using the drm_sysfs_connector_status_event()
+ * for uevents on connector status change.
+ */
+void drm_sysfs_reset_event(struct drm_device *dev)
+{
+   char *event_string = "AMDGPURESET=1";
+   char *envp[2] = { event_string, NULL };
+
+   kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
+}
+
+void amdgpu_reset_dumps(struct amdgpu_device *adev)
+{
+   struct drm_device *ddev = adev_to_drm(adev);
+   int r = 0, i;
+
+   /* original raven doesn't have full asic reset */
+   if ((adev->apu_flags & AMD_APU_IS_RAVEN) &&
+   !(adev->apu_flags & AMD_APU_IS_RAVEN2))
+   return;
+   for (i = 0; i < adev->num_ip_blocks; i++) {
+   if (!adev->ip_blocks[i].status.valid)
+   continue;
+   if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps)
+   continue;
+   r = adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev);
+
+   if (r)
+   DRM_ERROR("reset_reg_dumps of IP block <%s> failed 
%d\n",
+   
adev->ip_blocks[i].version->funcs->name, r);
+   }
+
+   drm_sysfs_reset_event(ddev);
+   dump_stack();
+}
+
  static int nv_asic_reset(struct amdgpu_device *adev)
  {
int ret = 0;
  
+	amdgpu_reset_dumps(adev);


Alex recently added a patch to reset GPU on suspend. It doesn't make 
sense to send an event in such cases, guess the original intention is 
for gpu recovery related cases.


Thanks,
Lijo


switch (nv_asic_reset_method(adev)) {
case AMD_RESET_METHOD_PCI:
dev_info(adev->dev, "PCI reset\n");

Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2022-01-17 Thread Christian König


Am 14.01.22 um 18:40 schrieb Felix Kuehling:

Am 2022-01-14 um 12:26 p.m. schrieb Christian König:

Am 14.01.22 um 17:44 schrieb Daniel Vetter:

Top post because I tried to catch up on the entire discussion here.

So fundamentally I'm not opposed to just close this fork() hole once and
for all. The thing that worries me from a upstream/platform pov is
really
only if we don't do it consistently across all drivers.

So maybe as an idea:
- Do the original patch, but not just for ttm but all gem rendernode
    drivers at least (or maybe even all gem drivers, no idea), with the
    below discussion cleaned up as justification.

I know of at least one use case which this will break.

A couple of years back we had a discussion on the Mesa mailing list
because (IIRC) Marek introduced a background thread to push command
submissions to the kernel.

That broke because some compositor used to initialize OpenGL and then
do a fork(). This indeed worked previously (no GPUVM at that time),
but with the addition of the backround thread obviously broke.

The conclusion back then was that the compositor is broken and needs
fixing, but it still essentially means that there could be people out
there with really old userspace where this setting would just break
the desktop.

I'm not really against that change either, but at least in theory we
could make fork() work perfectly fine even with VMs and background
threads.

You may regret this if you ever try to build a shared virtual address
space between GPU and CPU. Then you have two processes (parent and
child) sharing the same render context and GPU VM address space. But the
CPU address spaces are different. You can't maintain consistent shared
virtual address spaces for both processes when the GPU address space is
shared between them.


That's actually not much of a problem.

All you need to do is to use pthread_atfork() and do the appropriate 
action in parent/child to clean up your context: 
https://man7.org/linux/man-pages/man3/pthread_atfork.3.html


The rest is just to make sure that all shared and all private data are 
kept separate all the time. Sharing virtual memory is already done for 
decades this way, it's just that nobody ever did it with a statefull 
device like GPUs.


Regards,
Christian.



Regards,
   Felix

Re: [PATCH 2/2] drm/amdgpu: add AMDGPURESET uevent on AMD GPU reset

2022-01-17 Thread Sharma, Shashank


Hello Christian,

On 1/17/2022 11:37 AM, Christian König wrote:

Am 17.01.22 um 11:34 schrieb Somalapuram, Amaranath:

[SNIP]
Any suggestion how we can notify user space during this situation. 


Well trace points should work. They use a ring buffer and are specially 
made to work in such situations.


We are anyways sending a trace event with data, but can trace event be a 
wake up source for an application (like a tracer) ? This DRM event is 
just to indicate that reset happened, so app has to read from trace file.




Alternative would be to audit the udev code and allow giving a GFP mask 
to the function to make sure that we don't run into the deadlock.


Another alternative is a debugfs file which just blocks for the next 
reset to happen.




- Shashank


Regards,
Christian.


Regards,

S.Amarnath

Re: [PATCH 2/2] drm/amdgpu: add AMDGPURESET uevent on AMD GPU reset

2022-01-17 Thread Christian König


Am 17.01.22 um 11:34 schrieb Somalapuram, Amaranath:

[SNIP]
Any suggestion how we can notify user space during this situation. 


Well trace points should work. They use a ring buffer and are specially 
made to work in such situations.


Alternative would be to audit the udev code and allow giving a GFP mask 
to the function to make sure that we don't run into the deadlock.


Another alternative is a debugfs file which just blocks for the next 
reset to happen.


Regards,
Christian.


Regards,

S.Amarnath

Re: [PATCH 2/2] drm/amdgpu: add AMDGPURESET uevent on AMD GPU reset

2022-01-17 Thread Somalapuram, Amaranath



On 1/17/2022 3:49 PM, Christian König wrote:

Am 17.01.22 um 11:09 schrieb Somalapuram, Amaranath:

[AMD Official Use Only]



-Original Message-
From: Koenig, Christian 
Sent: Monday, January 17, 2022 3:33 PM
To: Somalapuram, Amaranath ; 
amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Sharma, Shashank 

Subject: Re: [PATCH 2/2] drm/amdgpu: add AMDGPURESET uevent on AMD 
GPU reset


Am 17.01.22 um 11:01 schrieb Somalapuram, Amaranath:

[AMD Official Use Only]



-Original Message-
From: Koenig, Christian 
Sent: Monday, January 17, 2022 12:57 PM
To: Somalapuram, Amaranath ;
amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Sharma, Shashank

Subject: Re: [PATCH 2/2] drm/amdgpu: add AMDGPURESET uevent on AMD GPU
reset

Am 17.01.22 um 07:33 schrieb Somalapuram Amaranath:

AMDGPURESET uevent added to notify userspace, collect dump_stack and
trace

Signed-off-by: Somalapuram Amaranath 
---
    drivers/gpu/drm/amd/amdgpu/nv.c | 45 
+

    1 file changed, 45 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c
b/drivers/gpu/drm/amd/amdgpu/nv.c index 2ec1ffb36b1f..b73147ae41fb
100644
--- a/drivers/gpu/drm/amd/amdgpu/nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/nv.c
@@ -529,10 +529,55 @@ nv_asic_reset_method(struct amdgpu_device *adev)
    }
    }
    +/**
+ * drm_sysfs_reset_event - generate a DRM uevent
+ * @dev: DRM device
+ *
+ * Send a uevent for the DRM device specified by @dev. Currently we
+only
+ * set AMDGPURESET=1 in the uevent environment, but this could be
+expanded to
+ * deal with other types of events.
+ *
+ * Any new uapi should be using the
+drm_sysfs_connector_status_event()
+ * for uevents on connector status change.
+ */
+void drm_sysfs_reset_event(struct drm_device *dev)

This should probably go directly into the DRM subsystem.


+{
+    char *event_string = "AMDGPURESET=1";
+    char *envp[2] = { event_string, NULL };
+
+ kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
As I said this approach is a clear NAK. We can't allocate any memory 
when we do a reset.


Regards,
Christian.

Can I do something like this:

void drm_sysfs_reset_event(struct drm_device *dev)
   {
-   char *event_string = "AMDGPURESET=1";
-   char *envp[2] = { event_string, NULL };
+   char **envp;
+
+   envp = kcalloc(2,sizeof(char *), GFP_ATOMIC);
+   envp[0] = kcalloc(30, sizeof(char), GFP_ATOMIC);
+   envp[1] = kcalloc(30, sizeof(char), GFP_ATOMIC);
No, not really. kobject_uevent_env() will still allocate memory 
without GFP_ATOMIC.


I think the whole approach of using udev won't work for this.

Regards,
Christian.

I have tested it with sample applications:
Gpu reset:
sudo cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover

And I never missed the AMDGPURESET=1 event in user space,


That's not the problem. Allocating memory when we need to do a reset 
can cause a *HARD* kernel deadlock.


This is absolutely not something we can do and Daniel even tried to 
add a few lockdep annotations for this.


So automated testing scripts will complain that this won't work.

Regards,
Christian.

Any suggestion how we can notify user space during this situation.

Regards,

S.Amarnath



even with continues resets with sudo cat 
/sys/kernel/debug/dri/0/amdgpu_gpu_recover .





Regards,
S.Amarnath

+
+   strcpy(envp[0], "AMDGPURESET=1");
+   strcpy(envp[1], "");
+

kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE,
envp);
+
+   kfree(envp[0]);
+   kfree(envp[1]);
+   kfree(envp);
   }

Regards,
S.Amarnath


+}
+
+void amdgpu_reset_dumps(struct amdgpu_device *adev) {
+    struct drm_device *ddev = adev_to_drm(adev);
+    int r = 0, i;
+
+    /* original raven doesn't have full asic reset */
+    if ((adev->apu_flags & AMD_APU_IS_RAVEN) &&
+    !(adev->apu_flags & AMD_APU_IS_RAVEN2))
+    return;
+    for (i = 0; i < adev->num_ip_blocks; i++) {
+    if (!adev->ip_blocks[i].status.valid)
+    continue;
+    if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps)
+    continue;
+    r = adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev);
+
+    if (r)
+    DRM_ERROR("reset_reg_dumps of IP block <%s> failed %d\n",
+ adev->ip_blocks[i].version->funcs->name, r);
+    }
+
+    drm_sysfs_reset_event(ddev);
+    dump_stack();
+}
+
    static int nv_asic_reset(struct amdgpu_device *adev)
    {
    int ret = 0;
    +    amdgpu_reset_dumps(adev);
    switch (nv_asic_reset_method(adev)) {
    case AMD_RESET_METHOD_PCI:
    dev_info(adev->dev, "PCI reset\n");

Re: [PATCH 2/2] drm/amdgpu: add AMDGPURESET uevent on AMD GPU reset

2022-01-17 Thread Christian König


Am 17.01.22 um 11:09 schrieb Somalapuram, Amaranath:

[AMD Official Use Only]



-Original Message-
From: Koenig, Christian 
Sent: Monday, January 17, 2022 3:33 PM
To: Somalapuram, Amaranath ; 
amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Sharma, Shashank 

Subject: Re: [PATCH 2/2] drm/amdgpu: add AMDGPURESET uevent on AMD GPU reset

Am 17.01.22 um 11:01 schrieb Somalapuram, Amaranath:

[AMD Official Use Only]



-Original Message-
From: Koenig, Christian 
Sent: Monday, January 17, 2022 12:57 PM
To: Somalapuram, Amaranath ;
amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Sharma, Shashank

Subject: Re: [PATCH 2/2] drm/amdgpu: add AMDGPURESET uevent on AMD GPU
reset

Am 17.01.22 um 07:33 schrieb Somalapuram Amaranath:

AMDGPURESET uevent added to notify userspace, collect dump_stack and
trace

Signed-off-by: Somalapuram Amaranath 
---
drivers/gpu/drm/amd/amdgpu/nv.c | 45 +
1 file changed, 45 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c
b/drivers/gpu/drm/amd/amdgpu/nv.c index 2ec1ffb36b1f..b73147ae41fb
100644
--- a/drivers/gpu/drm/amd/amdgpu/nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/nv.c
@@ -529,10 +529,55 @@ nv_asic_reset_method(struct amdgpu_device *adev)
}
}

+/**

+ * drm_sysfs_reset_event - generate a DRM uevent
+ * @dev: DRM device
+ *
+ * Send a uevent for the DRM device specified by @dev.  Currently we
+only
+ * set AMDGPURESET=1 in the uevent environment, but this could be
+expanded to
+ * deal with other types of events.
+ *
+ * Any new uapi should be using the
+drm_sysfs_connector_status_event()
+ * for uevents on connector status change.
+ */
+void drm_sysfs_reset_event(struct drm_device *dev)

This should probably go directly into the DRM subsystem.


+{
+   char *event_string = "AMDGPURESET=1";
+   char *envp[2] = { event_string, NULL };
+
+   kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);

As I said this approach is a clear NAK. We can't allocate any memory when we do 
a reset.

Regards,
Christian.

Can I do something like this:

void drm_sysfs_reset_event(struct drm_device *dev)
   {
-   char *event_string = "AMDGPURESET=1";
-   char *envp[2] = { event_string, NULL };
+   char **envp;
+
+   envp = kcalloc(2,sizeof(char *), GFP_ATOMIC);
+   envp[0] = kcalloc(30, sizeof(char), GFP_ATOMIC);
+   envp[1] = kcalloc(30, sizeof(char), GFP_ATOMIC);

No, not really. kobject_uevent_env() will still allocate memory without 
GFP_ATOMIC.

I think the whole approach of using udev won't work for this.

Regards,
Christian.

I have tested it with sample applications:
Gpu reset:
sudo cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover

And I never missed the AMDGPURESET=1 event in user space,


That's not the problem. Allocating memory when we need to do a reset can 
cause a *HARD* kernel deadlock.


This is absolutely not something we can do and Daniel even tried to add 
a few lockdep annotations for this.


So automated testing scripts will complain that this won't work.

Regards,
Christian.


even with continues resets with sudo cat 
/sys/kernel/debug/dri/0/amdgpu_gpu_recover .





Regards,
S.Amarnath

+
+   strcpy(envp[0], "AMDGPURESET=1");
+   strcpy(envp[1], "");
+

  kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE,
envp);
+
+   kfree(envp[0]);
+   kfree(envp[1]);
+   kfree(envp);
   }

Regards,
S.Amarnath


+}
+
+void amdgpu_reset_dumps(struct amdgpu_device *adev) {
+   struct drm_device *ddev = adev_to_drm(adev);
+   int r = 0, i;
+
+   /* original raven doesn't have full asic reset */
+   if ((adev->apu_flags & AMD_APU_IS_RAVEN) &&
+   !(adev->apu_flags & AMD_APU_IS_RAVEN2))
+   return;
+   for (i = 0; i < adev->num_ip_blocks; i++) {
+   if (!adev->ip_blocks[i].status.valid)
+   continue;
+   if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps)
+   continue;
+   r = adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev);
+
+   if (r)
+   DRM_ERROR("reset_reg_dumps of IP block <%s> failed 
%d\n",
+   
adev->ip_blocks[i].version->funcs->name, r);
+   }
+
+   drm_sysfs_reset_event(ddev);
+   dump_stack();
+}
+
static int nv_asic_reset(struct amdgpu_device *adev)
{
int ret = 0;

+	amdgpu_reset_dumps(adev);

switch (nv_asic_reset_method(adev)) {
case AMD_RESET_METHOD_PCI:
dev_info(adev->dev, "PCI reset\n");

RE: [PATCH 2/2] drm/amdgpu: add AMDGPURESET uevent on AMD GPU reset

2022-01-17 Thread Somalapuram, Amaranath

[AMD Official Use Only]



-Original Message-
From: Koenig, Christian  
Sent: Monday, January 17, 2022 3:33 PM
To: Somalapuram, Amaranath ; 
amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Sharma, Shashank 

Subject: Re: [PATCH 2/2] drm/amdgpu: add AMDGPURESET uevent on AMD GPU reset

Am 17.01.22 um 11:01 schrieb Somalapuram, Amaranath:
> [AMD Official Use Only]
>
>
>
> -Original Message-
> From: Koenig, Christian 
> Sent: Monday, January 17, 2022 12:57 PM
> To: Somalapuram, Amaranath ; 
> amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; Sharma, Shashank 
> 
> Subject: Re: [PATCH 2/2] drm/amdgpu: add AMDGPURESET uevent on AMD GPU 
> reset
>
> Am 17.01.22 um 07:33 schrieb Somalapuram Amaranath:
>> AMDGPURESET uevent added to notify userspace, collect dump_stack and 
>> trace
>>
>> Signed-off-by: Somalapuram Amaranath 
>> ---
>>drivers/gpu/drm/amd/amdgpu/nv.c | 45 +
>>1 file changed, 45 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c 
>> b/drivers/gpu/drm/amd/amdgpu/nv.c index 2ec1ffb36b1f..b73147ae41fb
>> 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/nv.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c
>> @@ -529,10 +529,55 @@ nv_asic_reset_method(struct amdgpu_device *adev)
>>  }
>>}
>>
>> +/**
>> + * drm_sysfs_reset_event - generate a DRM uevent
>> + * @dev: DRM device
>> + *
>> + * Send a uevent for the DRM device specified by @dev.  Currently we 
>> +only
>> + * set AMDGPURESET=1 in the uevent environment, but this could be 
>> +expanded to
>> + * deal with other types of events.
>> + *
>> + * Any new uapi should be using the
>> +drm_sysfs_connector_status_event()
>> + * for uevents on connector status change.
>> + */
>> +void drm_sysfs_reset_event(struct drm_device *dev)
> This should probably go directly into the DRM subsystem.
>
>> +{
>> +char *event_string = "AMDGPURESET=1";
>> +char *envp[2] = { event_string, NULL };
>> +
>> +kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
> As I said this approach is a clear NAK. We can't allocate any memory when we 
> do a reset.
>
> Regards,
> Christian.
>
> Can I do something like this:
>
> void drm_sysfs_reset_event(struct drm_device *dev)
>   {
> -   char *event_string = "AMDGPURESET=1";
> -   char *envp[2] = { event_string, NULL };
> +   char **envp;
> +
> +   envp = kcalloc(2,sizeof(char *), GFP_ATOMIC);
> +   envp[0] = kcalloc(30, sizeof(char), GFP_ATOMIC);
> +   envp[1] = kcalloc(30, sizeof(char), GFP_ATOMIC);

No, not really. kobject_uevent_env() will still allocate memory without 
GFP_ATOMIC.

I think the whole approach of using udev won't work for this.

Regards,
Christian.

I have tested it with sample applications: 
Gpu reset:
sudo cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover

And I never missed the AMDGPURESET=1 event in user space, 
even with continues resets with sudo cat 
/sys/kernel/debug/dri/0/amdgpu_gpu_recover .

Regards,
S.Amarnath
> +
> +   strcpy(envp[0], "AMDGPURESET=1");
> +   strcpy(envp[1], "");
> +
>
>  kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, 
> envp);
> +
> +   kfree(envp[0]);
> +   kfree(envp[1]);
> +   kfree(envp);
>   }
>
> Regards,
> S.Amarnath
>
>> +}
>> +
>> +void amdgpu_reset_dumps(struct amdgpu_device *adev) {
>> +struct drm_device *ddev = adev_to_drm(adev);
>> +int r = 0, i;
>> +
>> +/* original raven doesn't have full asic reset */
>> +if ((adev->apu_flags & AMD_APU_IS_RAVEN) &&
>> +!(adev->apu_flags & AMD_APU_IS_RAVEN2))
>> +return;
>> +for (i = 0; i < adev->num_ip_blocks; i++) {
>> +if (!adev->ip_blocks[i].status.valid)
>> +continue;
>> +if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps)
>> +continue;
>> +r = adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev);
>> +
>> +if (r)
>> +DRM_ERROR("reset_reg_dumps of IP block <%s> failed 
>> %d\n",
>> +
>> adev->ip_blocks[i].version->funcs->name, r);
>> +}
>> +
>> +drm_sysfs_reset_event(ddev);
>> +dump_stack();
>> +}
>> +
>>static int nv_asic_reset(struct amdgpu_device *adev)
>>{
>>  int ret = 0;
>>
>> +amdgpu_reset_dumps(adev);
>>  switch (nv_asic_reset_method(adev)) {
>>  case AMD_RESET_METHOD_PCI:
>>  dev_info(adev->dev, "PCI reset\n");

Re: [PATCH 2/2] drm/amdgpu: add AMDGPURESET uevent on AMD GPU reset

2022-01-17 Thread Christian König


Am 17.01.22 um 11:01 schrieb Somalapuram, Amaranath:

[AMD Official Use Only]



-Original Message-
From: Koenig, Christian 
Sent: Monday, January 17, 2022 12:57 PM
To: Somalapuram, Amaranath ; 
amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Sharma, Shashank 

Subject: Re: [PATCH 2/2] drm/amdgpu: add AMDGPURESET uevent on AMD GPU reset

Am 17.01.22 um 07:33 schrieb Somalapuram Amaranath:

AMDGPURESET uevent added to notify userspace, collect dump_stack and
trace

Signed-off-by: Somalapuram Amaranath 
---
   drivers/gpu/drm/amd/amdgpu/nv.c | 45 +
   1 file changed, 45 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c
b/drivers/gpu/drm/amd/amdgpu/nv.c index 2ec1ffb36b1f..b73147ae41fb
100644
--- a/drivers/gpu/drm/amd/amdgpu/nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/nv.c
@@ -529,10 +529,55 @@ nv_asic_reset_method(struct amdgpu_device *adev)
}
   }
   
+/**

+ * drm_sysfs_reset_event - generate a DRM uevent
+ * @dev: DRM device
+ *
+ * Send a uevent for the DRM device specified by @dev.  Currently we
+only
+ * set AMDGPURESET=1 in the uevent environment, but this could be
+expanded to
+ * deal with other types of events.
+ *
+ * Any new uapi should be using the
+drm_sysfs_connector_status_event()
+ * for uevents on connector status change.
+ */
+void drm_sysfs_reset_event(struct drm_device *dev)

This should probably go directly into the DRM subsystem.


+{
+   char *event_string = "AMDGPURESET=1";
+   char *envp[2] = { event_string, NULL };
+
+   kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);

As I said this approach is a clear NAK. We can't allocate any memory when we do 
a reset.

Regards,
Christian.

Can I do something like this:

void drm_sysfs_reset_event(struct drm_device *dev)
  {
-   char *event_string = "AMDGPURESET=1";
-   char *envp[2] = { event_string, NULL };
+   char **envp;
+
+   envp = kcalloc(2,sizeof(char *), GFP_ATOMIC);
+   envp[0] = kcalloc(30, sizeof(char), GFP_ATOMIC);
+   envp[1] = kcalloc(30, sizeof(char), GFP_ATOMIC);


No, not really. kobject_uevent_env() will still allocate memory without 
GFP_ATOMIC.


I think the whole approach of using udev won't work for this.

Regards,
Christian.


+
+   strcpy(envp[0], "AMDGPURESET=1");
+   strcpy(envp[1], "");
+

 kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
+
+   kfree(envp[0]);
+   kfree(envp[1]);
+   kfree(envp);
  }

Regards,
S.Amarnath


+}
+
+void amdgpu_reset_dumps(struct amdgpu_device *adev) {
+   struct drm_device *ddev = adev_to_drm(adev);
+   int r = 0, i;
+
+   /* original raven doesn't have full asic reset */
+   if ((adev->apu_flags & AMD_APU_IS_RAVEN) &&
+   !(adev->apu_flags & AMD_APU_IS_RAVEN2))
+   return;
+   for (i = 0; i < adev->num_ip_blocks; i++) {
+   if (!adev->ip_blocks[i].status.valid)
+   continue;
+   if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps)
+   continue;
+   r = adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev);
+
+   if (r)
+   DRM_ERROR("reset_reg_dumps of IP block <%s> failed 
%d\n",
+   
adev->ip_blocks[i].version->funcs->name, r);
+   }
+
+   drm_sysfs_reset_event(ddev);
+   dump_stack();
+}
+
   static int nv_asic_reset(struct amdgpu_device *adev)
   {
int ret = 0;
   
+	amdgpu_reset_dumps(adev);

switch (nv_asic_reset_method(adev)) {
case AMD_RESET_METHOD_PCI:
dev_info(adev->dev, "PCI reset\n");

RE: [PATCH 2/2] drm/amdgpu: add AMDGPURESET uevent on AMD GPU reset

2022-01-17 Thread Somalapuram, Amaranath

[AMD Official Use Only]



-Original Message-
From: Koenig, Christian  
Sent: Monday, January 17, 2022 12:57 PM
To: Somalapuram, Amaranath ; 
amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Sharma, Shashank 

Subject: Re: [PATCH 2/2] drm/amdgpu: add AMDGPURESET uevent on AMD GPU reset

Am 17.01.22 um 07:33 schrieb Somalapuram Amaranath:
> AMDGPURESET uevent added to notify userspace, collect dump_stack and 
> trace
>
> Signed-off-by: Somalapuram Amaranath 
> ---
>   drivers/gpu/drm/amd/amdgpu/nv.c | 45 +
>   1 file changed, 45 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c 
> b/drivers/gpu/drm/amd/amdgpu/nv.c index 2ec1ffb36b1f..b73147ae41fb 
> 100644
> --- a/drivers/gpu/drm/amd/amdgpu/nv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c
> @@ -529,10 +529,55 @@ nv_asic_reset_method(struct amdgpu_device *adev)
>   }
>   }
>   
> +/**
> + * drm_sysfs_reset_event - generate a DRM uevent
> + * @dev: DRM device
> + *
> + * Send a uevent for the DRM device specified by @dev.  Currently we 
> +only
> + * set AMDGPURESET=1 in the uevent environment, but this could be 
> +expanded to
> + * deal with other types of events.
> + *
> + * Any new uapi should be using the 
> +drm_sysfs_connector_status_event()
> + * for uevents on connector status change.
> + */
> +void drm_sysfs_reset_event(struct drm_device *dev)

This should probably go directly into the DRM subsystem.

> +{
> + char *event_string = "AMDGPURESET=1";
> + char *envp[2] = { event_string, NULL };
> +
> + kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);

As I said this approach is a clear NAK. We can't allocate any memory when we do 
a reset.

Regards,
Christian.

Can I do something like this:

void drm_sysfs_reset_event(struct drm_device *dev)
 {
-   char *event_string = "AMDGPURESET=1";
-   char *envp[2] = { event_string, NULL };
+   char **envp;
+
+   envp = kcalloc(2,sizeof(char *), GFP_ATOMIC);
+   envp[0] = kcalloc(30, sizeof(char), GFP_ATOMIC);
+   envp[1] = kcalloc(30, sizeof(char), GFP_ATOMIC);
+
+   strcpy(envp[0], "AMDGPURESET=1");
+   strcpy(envp[1], "");
+

kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
+
+   kfree(envp[0]);
+   kfree(envp[1]);
+   kfree(envp);
 }

Regards,
S.Amarnath

> +}
> +
> +void amdgpu_reset_dumps(struct amdgpu_device *adev) {
> + struct drm_device *ddev = adev_to_drm(adev);
> + int r = 0, i;
> +
> + /* original raven doesn't have full asic reset */
> + if ((adev->apu_flags & AMD_APU_IS_RAVEN) &&
> + !(adev->apu_flags & AMD_APU_IS_RAVEN2))
> + return;
> + for (i = 0; i < adev->num_ip_blocks; i++) {
> + if (!adev->ip_blocks[i].status.valid)
> + continue;
> + if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps)
> + continue;
> + r = adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev);
> +
> + if (r)
> + DRM_ERROR("reset_reg_dumps of IP block <%s> failed 
> %d\n",
> + 
> adev->ip_blocks[i].version->funcs->name, r);
> + }
> +
> + drm_sysfs_reset_event(ddev);
> + dump_stack();
> +}
> +
>   static int nv_asic_reset(struct amdgpu_device *adev)
>   {
>   int ret = 0;
>   
> + amdgpu_reset_dumps(adev);
>   switch (nv_asic_reset_method(adev)) {
>   case AMD_RESET_METHOD_PCI:
>   dev_info(adev->dev, "PCI reset\n");

[PATCH] drm/radeon: fix error handling in radeon_driver_open_kms

2022-01-17 Thread Christian König

The return value was never initialized so the cleanup code executed when
it isn't even necessary.

Just add proper error handling.

Fixes: 2ad5d8fca195 ("drm/radeon/radeon_kms: Fix a NULL pointer dereference in 
radeon_driver_open_kms()")
Signed-off-by: Christian König 
---
 drivers/gpu/drm/radeon/radeon_kms.c | 22 --
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/radeon/radeon_kms.c 
b/drivers/gpu/drm/radeon/radeon_kms.c
index e2488559cc9f..11ad210919c8 100644
--- a/drivers/gpu/drm/radeon/radeon_kms.c
+++ b/drivers/gpu/drm/radeon/radeon_kms.c
@@ -666,18 +666,18 @@ int radeon_driver_open_kms(struct drm_device *dev, struct 
drm_file *file_priv)
fpriv = kzalloc(sizeof(*fpriv), GFP_KERNEL);
if (unlikely(!fpriv)) {
r = -ENOMEM;
-   goto out_suspend;
+   goto err_suspend;
}
 
if (rdev->accel_working) {
vm = &fpriv->vm;
r = radeon_vm_init(rdev, vm);
if (r)
-   goto out_fpriv;
+   goto err_fpriv;
 
r = radeon_bo_reserve(rdev->ring_tmp_bo.bo, false);
if (r)
-   goto out_vm_fini;
+   goto err_vm_fini;
 
/* map the ib pool buffer read only into
 * virtual address space */
@@ -685,7 +685,7 @@ int radeon_driver_open_kms(struct drm_device *dev, struct 
drm_file *file_priv)
rdev->ring_tmp_bo.bo);
if (!vm->ib_bo_va) {
r = -ENOMEM;
-   goto out_vm_fini;
+   goto err_vm_fini;
}
 
r = radeon_vm_bo_set_addr(rdev, vm->ib_bo_va,
@@ -693,19 +693,21 @@ int radeon_driver_open_kms(struct drm_device *dev, struct 
drm_file *file_priv)
  RADEON_VM_PAGE_READABLE |
  RADEON_VM_PAGE_SNOOPED);
if (r)
-   goto out_vm_fini;
+   goto err_vm_fini;
}
file_priv->driver_priv = fpriv;
}
 
-   if (!r)
-   goto out_suspend;
+   pm_runtime_mark_last_busy(dev->dev);
+   pm_runtime_put_autosuspend(dev->dev);
+   return 0;
 
-out_vm_fini:
+err_vm_fini:
radeon_vm_fini(rdev, vm);
-out_fpriv:
+err_fpriv:
kfree(fpriv);
-out_suspend:
+
+err_suspend:
pm_runtime_mark_last_busy(dev->dev);
pm_runtime_put_autosuspend(dev->dev);
return r;
-- 
2.25.1

Re: RIP: 0010:radeon_vm_fini+0x15/0x220 [radeon]

2022-01-17 Thread Christian König





Am 17.01.22 um 09:42 schrieb Jan Stancek:

On Mon, Jan 17, 2022 at 08:16:09AM +0100, Christian König wrote:

Hi Borislav,

Am 15.01.22 um 17:11 schrieb Borislav Petkov:

Hi folks,

so this is a *very* old K8 laptop - yap, you read it right, family 0xf.

[   31.353032] powernow_k8: fid 0xa (1800 MHz), vid 0xa
[   31.353569] powernow_k8: fid 0x8 (1600 MHz), vid 0xc
[   31.354081] powernow_k8: fid 0x0 (800 MHz), vid 0x16
[   31.354844] powernow_k8: Found 1 AMD Turion(tm) 64 Mobile 
Technology MT-34 (1 cpu cores) (version 2.20.00)


This is true story.


well, that hardware is ancient ^^.

Interesting to see that even that old stuff is still used.


Anyway, it blows up, see below.

Kernel is latest Linus tree, top commit is:

a33f5c380c4b ("Merge tag 'xfs-5.17-merge-3' of 
git://git.kernel.org/pub/scm/fs/xfs/xfs-linux")


I can bisect if you don't see it immediately why it blows up.


Immediately I see that code is called which isn't for this hardware 
generation.


This is extremely odd because it means that we either have recently 
added a logic bug or the detection of the hardware generation doesn't 
work as expected any more.


Please bisect,
Christian.


I'm see panics like this one as well on multiple systems in lab (e.g. 
ProLiant SL390s G7,

PowerEdge R805). Looks same to what Bruno reported here:
 https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2FCA%2BQYu4rt2VHWzbOt-SegA9yABqC-D36PoqTZmy6DscWvp%2B6ZMQ%40mail.gmail.com%2F&data=04%7C01%7Cchristian.koenig%40amd.com%7C42f29e6eb93243584c2108d9d9953e25%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637780057291895847%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=HO5dYKo7kQHtneS%2F5ftl9KobWa%2BIjgXKjf7SXe0aRcw%3D&reserved=0 



It started around 8d0749b4f83b - Merge tag 'drm-next-2022-01-07', 
running a bisect atm.


Not necessary any more. That is probably caused by commit 
drm/radeon/radeon_kms: Fix a NULL pointer dereference in 
radeon_driver_open_kms() ab50cb9df8896b39aae65c537a30de2c79c19735.


I'm getting other bug reports for that one as well. Going to take a look.

Regards,
Christian.



[   15.230105] SGI XFS with ACLs, security attributes, scrub, quota, 
no debug enabled [   15.234816] XFS (sdb1): Mounting V5 Filesystem [   
15.342261] [drm] ib test succeeded in 0 usecs [ 15.343311] [drm] No TV 
DAC info found in BIOS [   15.344061] [drm] Radeon Display Connectors 
[   15.344330] [drm] Connector 0: [ 15.344961] [drm]   VGA-1 [   
15.345174] [drm]   DDC: 0x60 0x60 0x60 0x60 0x60 0x60 0x60 0x60 [   
15.345991] [drm]   Encoders: [ 15.346617] [drm] CRT1: 
INTERNAL_DAC1 [   15.346942] [drm] Connector 1: [   15.347561] [drm]   
VGA-2 [   15.347746] [drm] DDC: 0x6c 0x6c 0x6c 0x6c 0x6c 0x6c 0x6c 
0x6c [   15.348598] [drm]   Encoders: [   15.349217] [drm] CRT2: 
INTERNAL_DAC2 [ 15.349521] BUG: kernel NULL pointer dereference, 
address:  [   15.349974] #PF: supervisor read access 
in kernel mode [   15.350305] #PF: error_code(0x) - not-present 
page [   15.350675] PGD 0 P4D 0  [   15.350814] Oops:  [#[ 
15.431048] CPU: 0 PID: 410 Comm: systemd-udevd Tainted: G I   
5.16.0 #1 [   15.443401] XFS (sdb1): Ending clean mount [   15.451541] 
Hardware name: HP ProLiant SL390s G7/, BIOS P69 07/02/2013 [   
15.451545] RIP: 0010:radeon_vm_fini+0x174/0x300 [radeon] [   
15.452689] Code: e8 74 cc 7a c1 eb d1 4c 8b 24 24 4d 8d 74 24 48 49 8b 
5c 24 48 49 39 de 74 38 66 2e 0f 1f 84 00 00 00 00 00 66 90 4c 8d 7b 
a8 <48> 8b 2b 48 8d 7b 18 e8 30 1e f4 ff 48 83 c3 c0 48 89 df e8 34 f3 
[   15.454412] RSP: 0018:a349481 R08: 0020 R09: 
 [   15.533944] R10:  R11: 
c04f7810 R12: 979b4ba46730 [   15.533945] R13: 
979d5c26 R14: 979b4ba46778 R15: ffa8 [   
15.533947] FS: 7f3a13141500() GS:979d4ba0() 
knlGS: [   15.533948] CS:  0010 DS:  ES:  CR0: 
80050033 [   15.533950] CR2:  CR3: 
00031c7fc005 CR4: 000206f0 [   15.533952] Call Trace: [   
15.533956]   [   15.533959] radeon_driver_open_kms+0x118/0x180 
[radeon] [   15.533998] drm_file_alloc+0x1a8/0x230 [drm] [   OK   
[[   15.961755] drm_client_init+0x99/0x130 [drm]  [   15.961777] 
drm_fb_helper_init+0x32/0x50 [drm_kms_helper]  [   15.961809] 
radeon_fbdev_init+0xbc/0x110 [radeon]  [   15.963653] 
radeon_modeset_init+0x857/0x9e0 [radeon]  0m] Mounted  [0;[ 
15.964003]  radeon_driver_load_kms+0x19b/0x290 [radeon]  [ 15.964474]  
drm_dev_register+0xf5/0x2d0 [drm]  1;39msysroot.mou[ 15.965196]  
radeon_pci_probe+0xc3/0x120 [radeon]  [   15.965972] 
pci_device_probe+0x185/0x220  [   15.966225] 
call_driver_probe+0x32/0xd0  [   15.966505] really_probe+0x157/0x380 
 [   15.99bus_add_driver+0x111/0x210  [ 16.467150]  ? 
0xc0412000  [   16.467805] driver_register+0x81/0x120  [   
16.468069] do_one_initcall+0xb0/0x290  [   16.46

Re: [PATCH] Revert "i2c: core: support bus regulator controlling in adapter"

2022-01-17 Thread Hsin-Yi Wang

hi Tareque,


On Fri, Jan 14, 2022 at 6:09 PM Tareque Md Hanif
 wrote:
>
> Hi Hsin-Yi,
>
> On 1/12/22 16:58, Hsin-Yi Wang wrote:
>
> Can you help provide logs if we apply
> 5a7b95fb993ec399c8a685552aa6a8fc995c40bd but revert
> 8d35a2596164c1c9d34d4656fd42b445cd1e247f?
>
> Issue still exists. journalctl log attached in revert_8d.txt
>
>
> > after apply 5a7b95fb993ec399c8a685552aa6a8fc995c40bd
> > 1. delete SET_LATE_SYSTEM_SLEEP_PM_OPS(i2c_suspend_late,
> > i2c_resume_early) and function i2c_suspend_late() and
> > i2c_resume_early().
>
> No issues. journalctl log attached in test1.txt
>
>
> > 2. delete SET_RUNTIME_PM_OPS(i2c_runtime_suspend, i2c_runtime_resume,
> > NULL) and function i2c_runtime_suspend() and i2c_runtime_resume().
>
> Issue exists. journalctl log attached in test2.txt

Thanks for the testing.
Can you help us test if applying the following patch on top of
5a7b95fb993ec399c8a685552aa6a8fc995c40bd works? Thanks

diff --git a/drivers/i2c/i2c-core-base.c b/drivers/i2c/i2c-core-base.c
index 9eb4009cb250..6b046012aa08 100644
--- a/drivers/i2c/i2c-core-base.c
+++ b/drivers/i2c/i2c-core-base.c
@@ -484,7 +484,7 @@ static int i2c_resume_early(struct device *dev)
struct i2c_client *client = i2c_verify_client(dev);
int err;

-   if (!client)
+   if (!client || dev_pm_skip_resume(dev))
return 0;

if (pm_runtime_status_suspended(&client->dev) &&
@@ -502,7 +502,7 @@ static int i2c_suspend_late(struct device *dev)
struct i2c_client *client = i2c_verify_client(dev);
int err;

-   if (!client)
+   if (!client || dev_pm_skip_suspend(dev))
return 0;

err = pm_generic_suspend_late(&client->dev);

[PATCH] drm/radeon: fix UVD suspend error

2022-01-17 Thread Qiang Ma

I met a bug recently and the kernel log:

[  330.171875] radeon :03:00.0: couldn't schedule ib
[  330.175781] [drm:radeon_uvd_suspend [radeon]] *ERROR* Error destroying UVD 
(-22)!

In radeon drivers, using UVD suspend is as follows:

if (rdev->has_uvd) {
uvd_v1_0_fini(rdev);
radeon_uvd_suspend(rdev);
}

In radeon_ib_schedule function, we check the 'ring->ready' state,
but in uvd_v1_0_fini funciton, we've cleared the ready state.
So, just modify the suspend code flow to fix error.

Signed-off-by: Qiang Ma 
---
 drivers/gpu/drm/radeon/cik.c   | 2 +-
 drivers/gpu/drm/radeon/evergreen.c | 2 +-
 drivers/gpu/drm/radeon/ni.c| 2 +-
 drivers/gpu/drm/radeon/r600.c  | 2 +-
 drivers/gpu/drm/radeon/rv770.c | 2 +-
 drivers/gpu/drm/radeon/si.c| 2 +-
 6 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/radeon/cik.c b/drivers/gpu/drm/radeon/cik.c
index 81b4de7be9f2..5819737c21c6 100644
--- a/drivers/gpu/drm/radeon/cik.c
+++ b/drivers/gpu/drm/radeon/cik.c
@@ -8517,8 +8517,8 @@ int cik_suspend(struct radeon_device *rdev)
cik_cp_enable(rdev, false);
cik_sdma_enable(rdev, false);
if (rdev->has_uvd) {
-   uvd_v1_0_fini(rdev);
radeon_uvd_suspend(rdev);
+   uvd_v1_0_fini(rdev);
}
if (rdev->has_vce)
radeon_vce_suspend(rdev);
diff --git a/drivers/gpu/drm/radeon/evergreen.c 
b/drivers/gpu/drm/radeon/evergreen.c
index eeb590d2dec2..455f8036aa54 100644
--- a/drivers/gpu/drm/radeon/evergreen.c
+++ b/drivers/gpu/drm/radeon/evergreen.c
@@ -5156,8 +5156,8 @@ int evergreen_suspend(struct radeon_device *rdev)
radeon_pm_suspend(rdev);
radeon_audio_fini(rdev);
if (rdev->has_uvd) {
-   uvd_v1_0_fini(rdev);
radeon_uvd_suspend(rdev);
+   uvd_v1_0_fini(rdev);
}
r700_cp_stop(rdev);
r600_dma_stop(rdev);
diff --git a/drivers/gpu/drm/radeon/ni.c b/drivers/gpu/drm/radeon/ni.c
index 4a364ca7a1be..927e5f42e97d 100644
--- a/drivers/gpu/drm/radeon/ni.c
+++ b/drivers/gpu/drm/radeon/ni.c
@@ -2323,8 +2323,8 @@ int cayman_suspend(struct radeon_device *rdev)
cayman_cp_enable(rdev, false);
cayman_dma_stop(rdev);
if (rdev->has_uvd) {
-   uvd_v1_0_fini(rdev);
radeon_uvd_suspend(rdev);
+   uvd_v1_0_fini(rdev);
}
evergreen_irq_suspend(rdev);
radeon_wb_disable(rdev);
diff --git a/drivers/gpu/drm/radeon/r600.c b/drivers/gpu/drm/radeon/r600.c
index ca3fcae2adb5..dd78fc499402 100644
--- a/drivers/gpu/drm/radeon/r600.c
+++ b/drivers/gpu/drm/radeon/r600.c
@@ -3232,8 +3232,8 @@ int r600_suspend(struct radeon_device *rdev)
radeon_audio_fini(rdev);
r600_cp_stop(rdev);
if (rdev->has_uvd) {
-   uvd_v1_0_fini(rdev);
radeon_uvd_suspend(rdev);
+   uvd_v1_0_fini(rdev);
}
r600_irq_suspend(rdev);
radeon_wb_disable(rdev);
diff --git a/drivers/gpu/drm/radeon/rv770.c b/drivers/gpu/drm/radeon/rv770.c
index e592e57be1bb..38796af4fadd 100644
--- a/drivers/gpu/drm/radeon/rv770.c
+++ b/drivers/gpu/drm/radeon/rv770.c
@@ -1894,8 +1894,8 @@ int rv770_suspend(struct radeon_device *rdev)
radeon_pm_suspend(rdev);
radeon_audio_fini(rdev);
if (rdev->has_uvd) {
-   uvd_v1_0_fini(rdev);
radeon_uvd_suspend(rdev);
+   uvd_v1_0_fini(rdev);
}
r700_cp_stop(rdev);
r600_dma_stop(rdev);
diff --git a/drivers/gpu/drm/radeon/si.c b/drivers/gpu/drm/radeon/si.c
index 013e44ed0f39..8d5e4b25609d 100644
--- a/drivers/gpu/drm/radeon/si.c
+++ b/drivers/gpu/drm/radeon/si.c
@@ -6800,8 +6800,8 @@ int si_suspend(struct radeon_device *rdev)
si_cp_enable(rdev, false);
cayman_dma_stop(rdev);
if (rdev->has_uvd) {
-   uvd_v1_0_fini(rdev);
radeon_uvd_suspend(rdev);
+   uvd_v1_0_fini(rdev);
}
if (rdev->has_vce)
radeon_vce_suspend(rdev);
-- 
2.20.1

74 matches

Mail list logo