Re: [PATCH v3 00/14] drm: Add a driver for CSF-based Mali GPUs

2023-12-09 Thread Tatsuyuki Ishi
> On Dec 5, 2023, at 2:32, Boris Brezillon  
> wrote:
> 
> Hello,
> 
> This is the 3rd version of the kernel driver for Mali CSF-based GPUs.
> 
> With all the DRM dependencies being merged (drm-sched single entity and
> drm_gpuvm), I thought now was a good time to post a new version. Note
> that the iommu series we depend on [1] has been merged recently. The
> only remaining dependency that hasn't been merged yet is this rather
> trival drm_gpuvm [2] patch.
> 
> As for v2, I pushed a branch based on drm-misc-next and containing
> all the dependencies that are not yet available in drm-misc-next
> here[3], and another [4] containing extra patches to have things
> working on rk3588. The CSF firmware binary can be found here[5], and
> should be placed under /lib/firmware/arm/mali/arch10.8/mali_csffw.bin.
> 
> The mesa MR adding v10 support on top of panthor is available here [6].
> 
> Regarding the GPL2+MIT relicensing, Liviu did an audit and found two
> more people that I didn't spot initially: Clément Péron for the devfreq
> code, and Alexey Sheplyakov for some bits in panthor_gpu.c. Both are
> Cc-ed on the relevant patches. The rest of the code is either new, or
> covered by the Linaro, Arm and Collabora acks.
> 
> And here is a non-exhaustive changelog, check each commit for a detailed
> changelog.
> 
> v3;
> - Quite a few changes at the MMU/sched level to make the fix some
>  race conditions and deadlocks
> - Addition of the a sync-only VM_BIND operation (to support
>  vkQueueSparseBind with zero commands).

Hi Boris,

Just wanted to point out that vkQueueBindSparse's semantics is rather different
from vkQueueSubmit when it comes to synchronization.  In short,
vkQueueBindSparse does not operate on a particular timeline (aka scheduling
queue / drm_sched_entity).  The property of following a timeline order is known
as “submission order” [1] in Vulkan, and applies to vkQueueSubmit only and not
vkQueueBindSparse.

Hence, an implementation that takes full advantage of Vulkan semantics would
essentially have an infinite amount of VM_BIND contexts.  It would also not need
sync-only VM_BIND submissions, assuming that drmSyncobjTransfer works.

I’m not saying that the driver should always be implemented that way; in
particular, app developers are also confused by the semantics and native Vulkan
games can be terribly wrong [2].  D3D12 has serializing semantics for
UpdateTileMappings, so for D3D12 emulation one timeline is fine also.

Tatsuyuki.

[1]: 
https://registry.khronos.org/vulkan/specs/1.3-extensions/html/vkspec.html#synchronization-submission-order
[2]: https://gitlab.freedesktop.org/mesa/mesa/-/issues/10234

> - Addition of a VM_GET_STATE ioctl
> - Various cosmetic changes (see the commit changelogs for more details)
> - Various fixes (see the commit changelogs for more details)
> 
> v2:
> - Rename the driver (pancsf -> panthor)
> - Split the commit adding the driver to ease review
> - Use drm_sched for dependency tracking/job submission
> - Add a VM_BIND ioctl
> - Add the concept of exclusive VM for BOs that are only ever mapped to a
>  single VM
> - Document the code and uAPI
> - Add a DT binding doc
> 
> Regards,
> 
> Boris
> 
> [1]https://lore.kernel.org/linux-arm-kernel/20231124142434.1577550-1-boris.brezil...@collabora.com/T/
> [2]https://patchwork.freedesktop.org/patch/570380/
> [3]https://gitlab.freedesktop.org/panfrost/linux/-/tree/panthor-v3
> [4]https://gitlab.freedesktop.org/panfrost/linux/-/tree/panthor-v3+rk3588
> [5]https://gitlab.com/firefly-linux/external/libmali/-/raw/firefly/firmware/g610/mali_csffw.bin
> [6]https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/26358
> 
> Boris Brezillon (13):
>  drm/panthor: Add uAPI
>  drm/panthor: Add GPU register definitions
>  drm/panthor: Add the device logical block
>  drm/panthor: Add the GPU logical block
>  drm/panthor: Add GEM logical block
>  drm/panthor: Add the devfreq logical block
>  drm/panthor: Add the MMU/VM logical block
>  drm/panthor: Add the FW logical block
>  drm/panthor: Add the heap logical block
>  drm/panthor: Add the scheduler logical block
>  drm/panthor: Add the driver frontend block
>  drm/panthor: Allow driver compilation
>  drm/panthor: Add an entry to MAINTAINERS
> 
> Liviu Dudau (1):
>  dt-bindings: gpu: mali-valhall-csf: Add support for Arm Mali CSF GPUs
> 
> .../bindings/gpu/arm,mali-valhall-csf.yaml|  147 +
> Documentation/gpu/driver-uapi.rst |5 +
> MAINTAINERS   |   11 +
> drivers/gpu/drm/Kconfig   |2 +
> drivers/gpu/drm/Makefile  |1 +
> drivers/gpu/drm/panthor/Kconfig   |   23 +
> drivers/gpu/drm/panthor/Makefile  |   15 +
> drivers/gpu/drm/panthor/panthor_devfreq.c |  283 ++
> drivers/gpu/drm/panthor/panthor_devfreq.h |   25 +
> drivers/gpu/drm/panthor/panthor_device.c  |  497 +++
> drivers/gpu/drm/panthor/panthor_device.h  |  381 ++
> drivers/gpu/drm/panthor/panthor_drv

Re: [PATCH v2 2/3] drm/amdgpu: Add flag to disable implicit sync for GEM operations.

2023-11-06 Thread Tatsuyuki Ishi
> On Nov 6, 2023, at 22:44, Christian König  wrote:
> 
>  Am 02.11.23 um 15:04 schrieb Tatsuyuki Ishi:
>> In Vulkan, it is the application's responsibility to perform adequate
>> synchronization before a sparse unmap, replace or BO destroy operation.
>> Until now, the kernel applied the same rule as implicitly-synchronized
>> APIs like OpenGL, which with per-VM BOs made page table updates stall the
>> queue completely. The newly added AMDGPU_VM_EXPLICIT_SYNC flag allows
>> drivers to opt-out of this behavior, while still ensuring adequate implicit
>> sync happens for kernel-initiated updates (e.g. BO moves).
>> 
>> We record whether to use implicit sync or not for each freed mapping. To
>> avoid increasing the mapping struct's size, this is union-ized with the
>> interval tree field which is unused after the unmap.
>> 
>> The reason this is done with a GEM ioctl flag, instead of being a VM /
>> context global setting, is that the current libdrm implementation shares
>> the DRM handle even between different kind of drivers (radeonsi vs radv).
> 
> It would be nice if we could make this more future prove by not using a flag, 
> but rather a drm_syncobj.

There is asynchronous VM_BIND and synchronous VM_BIND. Using syncobjs address 
asynchronous binds, but what this patch set solves is to add an explicitly 
synced synchronous bind.

Even within Vulkan, there are use cases for synchronous binds. This is when a 
non-sparse BO is destroyed (or created but that’s not synchronized). Such 
operations should still be explicit sync, unlike OpenGL where it syncs to 
previous submissions. So adding asynchronous bind doesn’t supersede this need.

I’ve also thought whether we can just make the unmap asynchronous, since the 
spec requires that destroyed stuff are not accessed in any way, but I think it 
will complicate behavior when the destruction of BO immediately follows.

We should implement asynchronous bind someday to make vkQueueBindSparse work 
(even) better, but that will likely involve a larger scope including the 
scheduler. Getting synchronous but explicitly synced binds should be simpler 
and a good incremental step.

Tatsuyuki.

> You can extend the drm_amdgpu_gem_va structure by adding a drm_syncobj handle 
> and timeline point at the end.
> 
> If the syncobj/timeline point results in a fence we give that as input 
> dependency the operation has to wait for.
> 
> And output fence can come later on as well, but that one is much more harder 
> to handle.
> 
> Regards,
> Christian.
> 
>> 
>> Signed-off-by: Tatsuyuki Ishi 
>> ---
>>  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |  2 +-
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c   |  2 +-
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c   | 14 --
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h|  7 ++-
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h |  6 ++-
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c| 47 +++
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h| 23 +
>>  drivers/gpu/drm/amd/amdkfd/kfd_svm.c  | 18 +++
>>  include/uapi/drm/amdgpu_drm.h |  2 +
>>  9 files changed, 71 insertions(+), 50 deletions(-)
>> 
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> index 7d6daf8d2bfa..10e129bff977 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> @@ -1196,7 +1196,7 @@ static void unmap_bo_from_gpuvm(struct kgd_mem *mem,
>>  struct amdgpu_device *adev = entry->adev;
>>  struct amdgpu_vm *vm = bo_va->base.vm;
>>  -   amdgpu_vm_bo_unmap(adev, bo_va, entry->va);
>> +amdgpu_vm_bo_unmap(adev, bo_va, entry->va, true);
>>  amdgpu_vm_clear_freed(adev, vm, &bo_va->last_pt_update);
>>  diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
>> index 720011019741..612279e65bff 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
>> @@ -122,7 +122,7 @@ int amdgpu_unmap_static_csa(struct amdgpu_device *adev, 
>> struct amdgpu_vm *vm,
>>  }
>>  }
>>  -   r = amdgpu_vm_bo_unmap(adev, bo_va, csa_addr);
>> +r = amdgpu_vm_bo_unmap(adev, bo_va, csa_addr, true);
>>  if (r) {
>>  DRM_ERROR("failed to do bo_unmap on static CSA, err=%d\n", r);
>>  goto error;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>> index a1b15d0d6c48..cca68b

Re: [PATCH 3/6] drm/amdgpu: Flush VM updates for split bindings eagerly.

2023-11-05 Thread Tatsuyuki Ishi

> On Oct 31, 2023, at 23:07, Christian König  wrote:
> 
> Am 31.10.23 um 14:59 schrieb Bas Nieuwenhuizen:
>> 
>> 
>> On Tue, Oct 31, 2023 at 2:57 PM Christian König > <mailto:christian.koe...@amd.com>> wrote:
>>> Am 31.10.23 um 14:40 schrieb Tatsuyuki Ishi:
>>> > The current amdgpu_gem_va_update_vm only tries to perform updates for the
>>> > BO specified in the GEM ioctl; however, when a binding is split, the
>>> > adjacent bindings also need to be updated. Such updates currently ends up
>>> > getting deferred until next submission which causes stalls.
>>> 
>>> Yeah, that is a necessity. The hardware simply doesn't support what you 
>>> try to do here in all cases.
>> 
>> What can the hardware not do here? Is this just needing to wait for TLB 
>> flushes before we can free pagetables, can we just delay that?
> 
> On some hardware generations (especially Navi1x, but also everything older 
> than Polaris) you can't invalidate the TLB while it is in use.
> 
> For Polaris and older it just means that you don't have a guarantee that the 
> shader can't access the memory any more. So delaying the free operation helps 
> here.
> 
> But for Navi1x it's a workaround for a hardware bug. If you try to invalidate 
> the TLB while it is in use you can potentially triggering memory accesses to 
> random addresses.
> 
> That's why we still delay TLB invalidation's to the next CS and use a new 
> VMID for each submission instead of invalidating the old one.

Thanks for the information. I looked into the VMID allocation logic and I see 
that if concurrent_flush is false, then we defer any flush (or VMID reuse that 
requires a flush) until that VMID is idle.

What patch #3 ends up doing is just performing the PT update right away. Any 
required TLB update is deferred by amdgpu_vm_update_range through the increment 
of tlb_seq, and amdgpu_vmid.c is responsible for doing the actual TLB flush in 
a manner that does not trigger the bug.

Can you confirm that this would be fine for the current hardware?

Tatsuyuki.

> 
> I'm currently working on changing that for Navi2x and newer (maybe Vega as 
> well), but this is something you can really only do on some hw generations 
> after validating that it works.
> 
> Regards,
> Christian. 
> 
>>  
>>> 
>>> So this approach won't work in general.
>>> 
>>> Regards,
>>> Christian.
>>> 
>>> >
>>> > Introduce a new state "dirty", shared between per-VM BOs and traditional
>>> > BOs, containing all BOs that have pending updates in `invalids`.
>>> > amdgpu_gem_va_update_vm will now simply flush any pending updates for BOs
>>> > in the dirty state.
>>> >
>>> > Signed-off-by: Tatsuyuki Ishi >> > <mailto:ishitatsuy...@gmail.com>>
>>> > ---
>>> >   drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 18 ---
>>> >   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c  | 66 ++---
>>> >   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h  |  3 ++
>>> >   3 files changed, 63 insertions(+), 24 deletions(-)
>>> >
>>> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c 
>>> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>> > index a1b15d0d6c48..01d3a97248b0 100644
>>> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>> > @@ -604,10 +604,9 @@ int amdgpu_gem_metadata_ioctl(struct drm_device 
>>> > *dev, void *data,
>>> >* vital here, so they are not reported back to userspace.
>>> >*/
>>> >   static void amdgpu_gem_va_update_vm(struct amdgpu_device *adev,
>>> > - struct amdgpu_vm *vm,
>>> > - struct amdgpu_bo_va *bo_va,
>>> > - uint32_t operation)
>>> > + struct amdgpu_vm *vm)
>>> >   {
>>> > + struct amdgpu_bo_va *bo_va;
>>> >   int r;
>>> >   
>>> >   if (!amdgpu_vm_ready(vm))
>>> > @@ -617,12 +616,18 @@ static void amdgpu_gem_va_update_vm(struct 
>>> > amdgpu_device *adev,
>>> >   if (r)
>>> >   goto error;
>>> >   
>>> > - if (operation == AMDGPU_VA_OP_MAP ||
>>> > - operation == AMDGPU_VA_OP_REPLACE) {
>>> > + spin_lock(&vm->status_lock);
>>> > +  

[PATCH v2 3/3] drm/amdgpu: Bump amdgpu driver version.

2023-11-02 Thread Tatsuyuki Ishi
For detection of the new explicit sync functionality without having to try
the ioctl.

Signed-off-by: Tatsuyuki Ishi 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 81edf66dbea8..2aa406dee192 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -113,9 +113,10 @@
  *gl1c_cache_size, gl2c_cache_size, mall_size, 
enabled_rb_pipes_mask_hi
  *   3.53.0 - Support for GFX11 CP GFX shadowing
  *   3.54.0 - Add AMDGPU_CTX_QUERY2_FLAGS_RESET_IN_PROGRESS support
+ * - 3.55.0 - Add AMDGPU_VM_EXPLICIT_SYNC flag for GEM operations.
  */
 #define KMS_DRIVER_MAJOR   3
-#define KMS_DRIVER_MINOR   54
+#define KMS_DRIVER_MINOR   55
 #define KMS_DRIVER_PATCHLEVEL  0
 
 unsigned int amdgpu_vram_limit = UINT_MAX;
-- 
2.42.0



[PATCH v2 0/3] drm/amdgpu: Add flag to disable implicit sync for GEM operations.

2023-11-02 Thread Tatsuyuki Ishi
In Vulkan, it is the application's responsibility to perform adequate
synchronization before a sparse unmap, replace or BO destroy operation.
This adds an option to AMDGPU_VA_OPs to disable redundant implicit sync
that happens on sparse unmap or replace operations.

This has seen a significant improvement in stutter in Forza Horizon 5
and Forza Horizon 4. (As games that had significant issues in sparse
binding related stutter).

Compared to the previous series [1], this specifically targets the VM
operations and keep everything else intact, including implicit sync on
kernel-initiated moves.

I've been able to pass a full Vulkan CTS run on Navi 10 with this.

Userspace code for this is available at [2] and a branch for the kernel
code is available at [3].

v2 changes:
- Drop the changes to flush split bindings eagerly as its incompatible
  with TLB flush quirks in current hardware. Drop the refactoring
  commits related to that change too.
- Fixed a missing doc warning.
- Removed an accidentally included ioctl change.

[1]: 
https://lore.kernel.org/all/20230821062005.109771-1-ishitatsuy...@gmail.com/
[2]: 
https://gitlab.freedesktop.org/ishitatsuyuki/mesa/-/commits/vm-explicit-sync
[3]: https://github.com/ishitatsuyuki/linux/tree/explicit-sync-drm-misc-next

Tatsuyuki Ishi (3):
  drm/amdgpu: Don't implicit sync PRT maps.
  drm/amdgpu: Add flag to disable implicit sync for GEM operations.
  drm/amdgpu: Bump amdgpu driver version.

 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c   |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c   |  3 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c   | 14 --
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h|  7 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h |  6 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c| 47 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h| 23 +
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c  | 18 +++
 include/uapi/drm/amdgpu_drm.h |  2 +
 10 files changed, 73 insertions(+), 51 deletions(-)

-- 
2.42.0



[PATCH v2 2/3] drm/amdgpu: Add flag to disable implicit sync for GEM operations.

2023-11-02 Thread Tatsuyuki Ishi
In Vulkan, it is the application's responsibility to perform adequate
synchronization before a sparse unmap, replace or BO destroy operation.
Until now, the kernel applied the same rule as implicitly-synchronized
APIs like OpenGL, which with per-VM BOs made page table updates stall the
queue completely. The newly added AMDGPU_VM_EXPLICIT_SYNC flag allows
drivers to opt-out of this behavior, while still ensuring adequate implicit
sync happens for kernel-initiated updates (e.g. BO moves).

We record whether to use implicit sync or not for each freed mapping. To
avoid increasing the mapping struct's size, this is union-ized with the
interval tree field which is unused after the unmap.

The reason this is done with a GEM ioctl flag, instead of being a VM /
context global setting, is that the current libdrm implementation shares
the DRM handle even between different kind of drivers (radeonsi vs radv).

Signed-off-by: Tatsuyuki Ishi 
---
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c   |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c   | 14 --
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h|  7 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h |  6 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c| 47 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h| 23 +
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c  | 18 +++
 include/uapi/drm/amdgpu_drm.h |  2 +
 9 files changed, 71 insertions(+), 50 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 7d6daf8d2bfa..10e129bff977 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -1196,7 +1196,7 @@ static void unmap_bo_from_gpuvm(struct kgd_mem *mem,
struct amdgpu_device *adev = entry->adev;
struct amdgpu_vm *vm = bo_va->base.vm;
 
-   amdgpu_vm_bo_unmap(adev, bo_va, entry->va);
+   amdgpu_vm_bo_unmap(adev, bo_va, entry->va, true);
 
amdgpu_vm_clear_freed(adev, vm, &bo_va->last_pt_update);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
index 720011019741..612279e65bff 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
@@ -122,7 +122,7 @@ int amdgpu_unmap_static_csa(struct amdgpu_device *adev, 
struct amdgpu_vm *vm,
}
}
 
-   r = amdgpu_vm_bo_unmap(adev, bo_va, csa_addr);
+   r = amdgpu_vm_bo_unmap(adev, bo_va, csa_addr, true);
if (r) {
DRM_ERROR("failed to do bo_unmap on static CSA, err=%d\n", r);
goto error;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
index a1b15d0d6c48..cca68b89754e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
@@ -667,9 +667,9 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
const uint32_t valid_flags = AMDGPU_VM_DELAY_UPDATE |
AMDGPU_VM_PAGE_READABLE | AMDGPU_VM_PAGE_WRITEABLE |
AMDGPU_VM_PAGE_EXECUTABLE | AMDGPU_VM_MTYPE_MASK |
-   AMDGPU_VM_PAGE_NOALLOC;
+   AMDGPU_VM_PAGE_NOALLOC | AMDGPU_VM_EXPLICIT_SYNC;
const uint32_t prt_flags = AMDGPU_VM_DELAY_UPDATE |
-   AMDGPU_VM_PAGE_PRT;
+   AMDGPU_VM_PAGE_PRT | AMDGPU_VM_EXPLICIT_SYNC;
 
struct drm_amdgpu_gem_va *args = data;
struct drm_gem_object *gobj;
@@ -680,6 +680,7 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
struct drm_exec exec;
uint64_t va_flags;
uint64_t vm_size;
+   bool sync_unmap;
int r = 0;
 
if (args->va_address < AMDGPU_VA_RESERVED_SIZE) {
@@ -715,6 +716,8 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
return -EINVAL;
}
 
+   sync_unmap = !(args->flags & AMDGPU_VM_EXPLICIT_SYNC);
+
switch (args->operation) {
case AMDGPU_VA_OP_MAP:
case AMDGPU_VA_OP_UNMAP:
@@ -774,19 +777,20 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void 
*data,
 va_flags);
break;
case AMDGPU_VA_OP_UNMAP:
-   r = amdgpu_vm_bo_unmap(adev, bo_va, args->va_address);
+   r = amdgpu_vm_bo_unmap(adev, bo_va, args->va_address,
+  sync_unmap);
break;
 
case AMDGPU_VA_OP_CLEAR:
r = amdgpu_vm_bo_clear_mappings(adev, &fpriv->vm,
args->va_address,
-   args->map_size);
+   args->map_size, sync_unmap);
break;
case AMDGPU_VA_OP_

[PATCH v2 1/3] drm/amdgpu: Don't implicit sync PRT maps.

2023-11-02 Thread Tatsuyuki Ishi
These are considered map operations rather than unmap, and there is no
point of doing implicit synchronization here.

Signed-off-by: Tatsuyuki Ishi 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index f5daadcec865..7b9762f1cddd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -902,7 +902,7 @@ int amdgpu_vm_update_range(struct amdgpu_device *adev, 
struct amdgpu_vm *vm,
/* Implicitly sync to command submissions in the same VM before
 * unmapping. Sync to moving fences before mapping.
 */
-   if (!(flags & AMDGPU_PTE_VALID))
+   if (!(flags & (AMDGPU_PTE_VALID | AMDGPU_PTE_PRT)))
sync_mode = AMDGPU_SYNC_EQ_OWNER;
else
sync_mode = AMDGPU_SYNC_EXPLICIT;
-- 
2.42.0



Re: [PATCH 3/6] drm/amdgpu: Flush VM updates for split bindings eagerly.

2023-10-31 Thread Tatsuyuki Ishi



> On Oct 31, 2023, at 23:17, Bas Nieuwenhuizen  wrote:
> 
> 
> 
> On Tue, Oct 31, 2023 at 3:08 PM Christian König  
> wrote:
> Am 31.10.23 um 14:59 schrieb Bas Nieuwenhuizen:
>> 
>> 
>> On Tue, Oct 31, 2023 at 2:57 PM Christian König  
>> wrote:
>> Am 31.10.23 um 14:40 schrieb Tatsuyuki Ishi:
>> > The current amdgpu_gem_va_update_vm only tries to perform updates for the
>> > BO specified in the GEM ioctl; however, when a binding is split, the
>> > adjacent bindings also need to be updated. Such updates currently ends up
>> > getting deferred until next submission which causes stalls.
>> 
>> Yeah, that is a necessity. The hardware simply doesn't support what you 
>> try to do here in all cases.
>> 
>> What can the hardware not do here? Is this just needing to wait for TLB 
>> flushes before we can free pagetables, can we just delay that?
> 
> On some hardware generations (especially Navi1x, but also everything older 
> than Polaris) you can't invalidate the TLB while it is in use.
> 
> For Polaris and older it just means that you don't have a guarantee that the 
> shader can't access the memory any more. So delaying the free operation helps 
> here.
> 
> But for Navi1x it's a workaround for a hardware bug. If you try to invalidate 
> the TLB while it is in use you can potentially triggering memory accesses to 
> random addresses.
> 
> That's why we still delay TLB invalidation's to the next CS and use a new 
> VMID for each submission instead of invalidating the old one.
> 
> I'm currently working on changing that for Navi2x and newer (maybe Vega as 
> well), but this is something you can really only do on some hw generations 
> after validating that it works.
> 
> I think as long as we make sure all significant work gets done 
> asynchronously, doing the TLB flushing on the next submit (/submissions, one 
> per queue?) is fine for our purposes.

For a bit more of context, the performance / frame timing in Forza with just 
patch 5 wasn’t quite right. As Bas said, ideally we want to perform all the PT 
updates right away, and only defer the TLB flush.

For now the state machine part of this patch doesn’t seem to be going in the 
right direction so I’ll consider dropping this change.

Tatsuyuki.

> 
> (As an aside after thinking some more I *think* we also need some work to 
> make these maps/unmaps (VALID->PRT and PRT->VALID) atomic, as I think it is 
> valid Vulkan to make these race. As such I'm speculating we'd need a bit more 
> reworking there too, not just a late free of the lower level pagetables)
> 
> - Bas 
> 
> Regards,
> Christian. 
> 
>>  
>> 
>> So this approach won't work in general.
>> 
>> Regards,
>> Christian.
>> 
>> >
>> > Introduce a new state "dirty", shared between per-VM BOs and traditional
>> > BOs, containing all BOs that have pending updates in `invalids`.
>> > amdgpu_gem_va_update_vm will now simply flush any pending updates for BOs
>> > in the dirty state.
>> >
>> > Signed-off-by: Tatsuyuki Ishi 
>> > ---
>> >   drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 18 ---
>> >   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c  | 66 ++---
>> >   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h  |  3 ++
>> >   3 files changed, 63 insertions(+), 24 deletions(-)
>> >
>> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c 
>> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>> > index a1b15d0d6c48..01d3a97248b0 100644
>> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>> > @@ -604,10 +604,9 @@ int amdgpu_gem_metadata_ioctl(struct drm_device *dev, 
>> > void *data,
>> >* vital here, so they are not reported back to userspace.
>> >*/
>> >   static void amdgpu_gem_va_update_vm(struct amdgpu_device *adev,
>> > - struct amdgpu_vm *vm,
>> > - struct amdgpu_bo_va *bo_va,
>> > - uint32_t operation)
>> > + struct amdgpu_vm *vm)
>> >   {
>> > + struct amdgpu_bo_va *bo_va;
>> >   int r;
>> >   
>> >   if (!amdgpu_vm_ready(vm))
>> > @@ -617,12 +616,18 @@ static void amdgpu_gem_va_update_vm(struct 
>> > amdgpu_device *adev,
>> >   if (r)
>> >   goto error;
>> >   
>> > - if (operation == AMDGPU_VA_OP_MAP ||
>> >

Re: [PATCH 2/6] drm/amdgpu: Separate eviction from VM status.

2023-10-31 Thread Tatsuyuki Ishi



> On Oct 31, 2023, at 22:55, Christian König  wrote:
> 
> Am 31.10.23 um 14:40 schrieb Tatsuyuki Ishi:
>> In short, eviction never really belonged to the vm_status state machine.
> 
> I strongly disagree to that.
> 
>> Even when evicted, the BO could belong to either the moved or done state.
>> The "evicted" state needed to handle both cases, causing greater confusion.
>> 
>> Additionally, there were inconsistencies in the definition of an evicted
>> BO. Some places are based on the `evict` parameter passed from the TTM move
>> callback, while the others were updated based on whether the BO got its
>> optimal placement. The second is more accurate for our use case. With this
>> refactor, the evicted state is solely determined by the second rule.
> 
> That strongly sounds like you don't understand what the evicted state it good 
> for.
> 
> The evicted state is for page directories, page tables and per VM BOs which 
> needs to move around before doing the next CS.
> 
> Please further explain what you try to do here.

This is mainly an attempt to address inconsistency in the definition of 
“eviction”. The TTM move callback sets eviction when eviction happens through 
ttm_bo_evict. This is however not the only way a BO might end up outside its 
preferred domains.

amdgpu_vm_bo_update later updates the eviction state based on whether the BO is 
in its preferred domains. In my understanding this includes all cases where the 
BO is evicted through ttm_bo_evict. Therefore we should apply this definition 
right from the move callback, not only after amdgpu_vm_bo_update has been 
called at least once.

Tatsuyuki.

> 
> Regards,
> Christian.
> 
>> 
>> Signed-off-by: Tatsuyuki Ishi 
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c| 67 +--
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h|  1 +
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c |  1 +
>>  3 files changed, 29 insertions(+), 40 deletions(-)
>> 
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> index 7b9762f1cddd..dd6f72e2a1d6 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> @@ -174,19 +174,23 @@ int amdgpu_vm_set_pasid(struct amdgpu_device *adev, 
>> struct amdgpu_vm *vm,
>>   * State for PDs/PTs and per VM BOs which are not at the location they 
>> should
>>   * be.
>>   */
>> -static void amdgpu_vm_bo_evicted(struct amdgpu_vm_bo_base *vm_bo)
>> +static void amdgpu_vm_bo_set_evicted(struct amdgpu_vm_bo_base *vm_bo, bool 
>> evicted)
>>  {
>>  struct amdgpu_vm *vm = vm_bo->vm;
>>  struct amdgpu_bo *bo = vm_bo->bo;
>>  -   vm_bo->moved = true;
>>  spin_lock(&vm_bo->vm->status_lock);
>> -if (bo->tbo.type == ttm_bo_type_kernel)
>> -list_move(&vm_bo->vm_status, &vm->evicted);
>> -else
>> -list_move_tail(&vm_bo->vm_status, &vm->evicted);
>> +if (evicted && bo->tbo.base.resv == vm->root.bo->tbo.base.resv) {
>> +if (bo->tbo.type == ttm_bo_type_kernel)
>> +list_move(&vm_bo->eviction_status, &vm->evicted);
>> +else
>> +list_move_tail(&vm_bo->eviction_status, &vm->evicted);
>> +} else {
>> +list_del_init(&vm_bo->eviction_status);
>> +}
>>  spin_unlock(&vm_bo->vm->status_lock);
>>  }
>> +
>>  /**
>>   * amdgpu_vm_bo_moved - vm_bo is moved
>>   *
>> @@ -310,6 +314,7 @@ void amdgpu_vm_bo_base_init(struct amdgpu_vm_bo_base 
>> *base,
>>  base->bo = bo;
>>  base->next = NULL;
>>  INIT_LIST_HEAD(&base->vm_status);
>> +INIT_LIST_HEAD(&base->eviction_status);
>>  if (!bo)
>>  return;
>> @@ -336,7 +341,7 @@ void amdgpu_vm_bo_base_init(struct amdgpu_vm_bo_base 
>> *base,
>>   * is currently evicted. add the bo to the evicted list to make sure it
>>   * is validated on next vm use to avoid fault.
>>   * */
>> -amdgpu_vm_bo_evicted(base);
>> +amdgpu_vm_bo_set_evicted(base, true);
>>  }
>>/**
>> @@ -460,7 +465,7 @@ int amdgpu_vm_validate_pt_bos(struct amdgpu_device 
>> *adev, struct amdgpu_vm *vm,
>>  while (!list_empty(&vm->evicted)) {
>>  bo_base = list_first_entry(&vm->evicted,
>> struct amdgpu_vm_bo_base,
>> -

[PATCH 3/6] drm/amdgpu: Flush VM updates for split bindings eagerly.

2023-10-31 Thread Tatsuyuki Ishi
The current amdgpu_gem_va_update_vm only tries to perform updates for the
BO specified in the GEM ioctl; however, when a binding is split, the
adjacent bindings also need to be updated. Such updates currently ends up
getting deferred until next submission which causes stalls.

Introduce a new state "dirty", shared between per-VM BOs and traditional
BOs, containing all BOs that have pending updates in `invalids`.
amdgpu_gem_va_update_vm will now simply flush any pending updates for BOs
in the dirty state.

Signed-off-by: Tatsuyuki Ishi 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 18 ---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c  | 66 ++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h  |  3 ++
 3 files changed, 63 insertions(+), 24 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
index a1b15d0d6c48..01d3a97248b0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
@@ -604,10 +604,9 @@ int amdgpu_gem_metadata_ioctl(struct drm_device *dev, void 
*data,
  * vital here, so they are not reported back to userspace.
  */
 static void amdgpu_gem_va_update_vm(struct amdgpu_device *adev,
-   struct amdgpu_vm *vm,
-   struct amdgpu_bo_va *bo_va,
-   uint32_t operation)
+   struct amdgpu_vm *vm)
 {
+   struct amdgpu_bo_va *bo_va;
int r;
 
if (!amdgpu_vm_ready(vm))
@@ -617,12 +616,18 @@ static void amdgpu_gem_va_update_vm(struct amdgpu_device 
*adev,
if (r)
goto error;
 
-   if (operation == AMDGPU_VA_OP_MAP ||
-   operation == AMDGPU_VA_OP_REPLACE) {
+   spin_lock(&vm->status_lock);
+   while (!list_empty(&vm->dirty)) {
+   bo_va = list_first_entry(&vm->dirty, struct amdgpu_bo_va,
+base.vm_status);
+   spin_unlock(&vm->status_lock);
+
r = amdgpu_vm_bo_update(adev, bo_va, false);
if (r)
goto error;
+   spin_lock(&vm->status_lock);
}
+   spin_unlock(&vm->status_lock);
 
r = amdgpu_vm_update_pdes(adev, vm, false);
 
@@ -792,8 +797,7 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
break;
}
if (!r && !(args->flags & AMDGPU_VM_DELAY_UPDATE) && !amdgpu_vm_debug)
-   amdgpu_gem_va_update_vm(adev, &fpriv->vm, bo_va,
-   args->operation);
+   amdgpu_gem_va_update_vm(adev, &fpriv->vm);
 
 error:
drm_exec_fini(&exec);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index dd6f72e2a1d6..01d31891cd05 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -191,6 +191,21 @@ static void amdgpu_vm_bo_set_evicted(struct 
amdgpu_vm_bo_base *vm_bo, bool evict
spin_unlock(&vm_bo->vm->status_lock);
 }
 
+/**
+ * amdgpu_vm_bo_dirty - vm_bo is dirty
+ *
+ * @vm_bo: vm_bo which is dirty
+ *
+ * State for normal and per VM BOs that are not moved, but have new entries in
+ * bo_va->invalids.
+ */
+static void amdgpu_vm_bo_dirty(struct amdgpu_vm_bo_base *vm_bo)
+{
+   spin_lock(&vm_bo->vm->status_lock);
+   list_move(&vm_bo->vm_status, &vm_bo->vm->dirty);
+   spin_unlock(&vm_bo->vm->status_lock);
+}
+
 /**
  * amdgpu_vm_bo_moved - vm_bo is moved
  *
@@ -1042,6 +1057,9 @@ void amdgpu_vm_get_memory(struct amdgpu_vm *vm,
list_for_each_entry_safe(bo_va, tmp, &vm->evicted, base.eviction_status)
amdgpu_vm_bo_get_memory(bo_va, stats);
 
+   list_for_each_entry_safe(bo_va, tmp, &vm->dirty, base.vm_status)
+   amdgpu_vm_bo_get_memory(bo_va, stats);
+
list_for_each_entry_safe(bo_va, tmp, &vm->relocated, base.vm_status)
amdgpu_vm_bo_get_memory(bo_va, stats);
 
@@ -1411,6 +1429,17 @@ int amdgpu_vm_handle_moved(struct amdgpu_device *adev,
dma_resv_unlock(resv);
spin_lock(&vm->status_lock);
}
+
+   while (!list_empty(&vm->dirty)) {
+   bo_va = list_first_entry(&vm->dirty, struct amdgpu_bo_va,
+base.vm_status);
+   spin_unlock(&vm->status_lock);
+
+   r = amdgpu_vm_bo_update(adev, bo_va, false);
+   if (r)
+   return r;
+   spin_lock(&vm->status_lock);
+   }
spin_unlock(&vm->status_lock);
 
return 0;
@@ -1476,19 +1505,16 @@ static void amdgpu_vm_bo_insert_map(struct 
amdgpu_device *adev,
   

[PATCH 4/6] drm/amdgpu: Remove redundant state change after validation.

2023-10-31 Thread Tatsuyuki Ishi
All the state changes are handled in the TTM move callback; doing it again
here just leads to more confusion.

The table update remains here because it needs to be done exactly once,
while doing it in the move callback will result it getting triggered twice,
once by the actual BO and once by the shadow BO.

Signed-off-by: Tatsuyuki Ishi 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 01d31891cd05..50f7cee639ac 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -495,12 +495,9 @@ int amdgpu_vm_validate_pt_bos(struct amdgpu_device *adev, 
struct amdgpu_vm *vm,
return r;
}
 
-   if (bo->tbo.type != ttm_bo_type_kernel) {
-   amdgpu_vm_bo_moved(bo_base);
-   } else {
+   if (bo->tbo.type == ttm_bo_type_kernel)
vm->update_funcs->map_table(to_amdgpu_bo_vm(bo));
-   amdgpu_vm_bo_relocated(bo_base);
-   }
+
spin_lock(&vm->status_lock);
}
spin_unlock(&vm->status_lock);
-- 
2.42.0



[PATCH 5/6] drm/amdgpu: Add flag to disable implicit sync for GEM operations.

2023-10-31 Thread Tatsuyuki Ishi
In Vulkan, it is the application's responsibility to perform adequate
synchronization before a sparse unmap, replace or BO destroy operation.
Until now, the kernel applied the same rule as implicitly-synchronized
APIs like OpenGL, which with per-VM BOs made page table updates stall the
queue completely. The newly added AMDGPU_VM_EXPLICIT_SYNC flag allows
drivers to opt-out of this behavior, while still ensuring adequate implicit
sync happens for kernel-initiated updates (e.g. BO moves).

We record whether to use implicit sync or not for each freed mapping. To
avoid increasing the mapping struct's size, this is union-ized with the
interval tree field which is unused after the unmap.

The reason this is done with a GEM ioctl flag, instead of being a VM /
context global setting, is that the current libdrm implementation shares
the DRM handle even between different kind of drivers (radeonsi vs radv).

Signed-off-by: Tatsuyuki Ishi 
---
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c   |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c   | 14 +++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h|  7 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h |  6 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c| 55 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h| 23 
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c  | 18 +++---
 include/uapi/drm/amdgpu_drm.h |  2 +
 9 files changed, 74 insertions(+), 55 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 7d6daf8d2bfa..10e129bff977 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -1196,7 +1196,7 @@ static void unmap_bo_from_gpuvm(struct kgd_mem *mem,
struct amdgpu_device *adev = entry->adev;
struct amdgpu_vm *vm = bo_va->base.vm;
 
-   amdgpu_vm_bo_unmap(adev, bo_va, entry->va);
+   amdgpu_vm_bo_unmap(adev, bo_va, entry->va, true);
 
amdgpu_vm_clear_freed(adev, vm, &bo_va->last_pt_update);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
index 720011019741..612279e65bff 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
@@ -122,7 +122,7 @@ int amdgpu_unmap_static_csa(struct amdgpu_device *adev, 
struct amdgpu_vm *vm,
}
}
 
-   r = amdgpu_vm_bo_unmap(adev, bo_va, csa_addr);
+   r = amdgpu_vm_bo_unmap(adev, bo_va, csa_addr, true);
if (r) {
DRM_ERROR("failed to do bo_unmap on static CSA, err=%d\n", r);
goto error;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
index 01d3a97248b0..0d9496a06947 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
@@ -672,9 +672,9 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
const uint32_t valid_flags = AMDGPU_VM_DELAY_UPDATE |
AMDGPU_VM_PAGE_READABLE | AMDGPU_VM_PAGE_WRITEABLE |
AMDGPU_VM_PAGE_EXECUTABLE | AMDGPU_VM_MTYPE_MASK |
-   AMDGPU_VM_PAGE_NOALLOC;
+   AMDGPU_VM_PAGE_NOALLOC | AMDGPU_VM_EXPLICIT_SYNC;
const uint32_t prt_flags = AMDGPU_VM_DELAY_UPDATE |
-   AMDGPU_VM_PAGE_PRT;
+   AMDGPU_VM_PAGE_PRT | AMDGPU_VM_EXPLICIT_SYNC;
 
struct drm_amdgpu_gem_va *args = data;
struct drm_gem_object *gobj;
@@ -685,6 +685,7 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
struct drm_exec exec;
uint64_t va_flags;
uint64_t vm_size;
+   bool sync_unmap;
int r = 0;
 
if (args->va_address < AMDGPU_VA_RESERVED_SIZE) {
@@ -720,6 +721,8 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
return -EINVAL;
}
 
+   sync_unmap = !(args->flags & AMDGPU_VM_EXPLICIT_SYNC);
+
switch (args->operation) {
case AMDGPU_VA_OP_MAP:
case AMDGPU_VA_OP_UNMAP:
@@ -779,19 +782,20 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void 
*data,
 va_flags);
break;
case AMDGPU_VA_OP_UNMAP:
-   r = amdgpu_vm_bo_unmap(adev, bo_va, args->va_address);
+   r = amdgpu_vm_bo_unmap(adev, bo_va, args->va_address,
+  sync_unmap);
break;
 
case AMDGPU_VA_OP_CLEAR:
r = amdgpu_vm_bo_clear_mappings(adev, &fpriv->vm,
args->va_address,
-   args->map_size);
+   args->map_size, sync_unmap);
break;
case AMDGPU_VA_OP_

[PATCH 6/6] drm/amdgpu: Bump amdgpu driver version.

2023-10-31 Thread Tatsuyuki Ishi
For detection of the new explicit sync functionality without having to try
the ioctl.

Signed-off-by: Tatsuyuki Ishi 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 81edf66dbea8..2aa406dee192 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -113,9 +113,10 @@
  *gl1c_cache_size, gl2c_cache_size, mall_size, 
enabled_rb_pipes_mask_hi
  *   3.53.0 - Support for GFX11 CP GFX shadowing
  *   3.54.0 - Add AMDGPU_CTX_QUERY2_FLAGS_RESET_IN_PROGRESS support
+ * - 3.55.0 - Add AMDGPU_VM_EXPLICIT_SYNC flag for GEM operations.
  */
 #define KMS_DRIVER_MAJOR   3
-#define KMS_DRIVER_MINOR   54
+#define KMS_DRIVER_MINOR   55
 #define KMS_DRIVER_PATCHLEVEL  0
 
 unsigned int amdgpu_vram_limit = UINT_MAX;
-- 
2.42.0



[PATCH 1/6] drm/amdgpu: Don't implicit sync PRT maps.

2023-10-31 Thread Tatsuyuki Ishi
These are considered map operations rather than unmap, and there is no
point of doing implicit synchronization here.

Signed-off-by: Tatsuyuki Ishi 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index f5daadcec865..7b9762f1cddd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -902,7 +902,7 @@ int amdgpu_vm_update_range(struct amdgpu_device *adev, 
struct amdgpu_vm *vm,
/* Implicitly sync to command submissions in the same VM before
 * unmapping. Sync to moving fences before mapping.
 */
-   if (!(flags & AMDGPU_PTE_VALID))
+   if (!(flags & (AMDGPU_PTE_VALID | AMDGPU_PTE_PRT)))
sync_mode = AMDGPU_SYNC_EQ_OWNER;
else
sync_mode = AMDGPU_SYNC_EXPLICIT;
-- 
2.42.0



[PATCH 2/6] drm/amdgpu: Separate eviction from VM status.

2023-10-31 Thread Tatsuyuki Ishi
In short, eviction never really belonged to the vm_status state machine.
Even when evicted, the BO could belong to either the moved or done state.
The "evicted" state needed to handle both cases, causing greater confusion.

Additionally, there were inconsistencies in the definition of an evicted
BO. Some places are based on the `evict` parameter passed from the TTM move
callback, while the others were updated based on whether the BO got its
optimal placement. The second is more accurate for our use case. With this
refactor, the evicted state is solely determined by the second rule.

Signed-off-by: Tatsuyuki Ishi 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c| 67 +--
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h|  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c |  1 +
 3 files changed, 29 insertions(+), 40 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 7b9762f1cddd..dd6f72e2a1d6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -174,19 +174,23 @@ int amdgpu_vm_set_pasid(struct amdgpu_device *adev, 
struct amdgpu_vm *vm,
  * State for PDs/PTs and per VM BOs which are not at the location they should
  * be.
  */
-static void amdgpu_vm_bo_evicted(struct amdgpu_vm_bo_base *vm_bo)
+static void amdgpu_vm_bo_set_evicted(struct amdgpu_vm_bo_base *vm_bo, bool 
evicted)
 {
struct amdgpu_vm *vm = vm_bo->vm;
struct amdgpu_bo *bo = vm_bo->bo;
 
-   vm_bo->moved = true;
spin_lock(&vm_bo->vm->status_lock);
-   if (bo->tbo.type == ttm_bo_type_kernel)
-   list_move(&vm_bo->vm_status, &vm->evicted);
-   else
-   list_move_tail(&vm_bo->vm_status, &vm->evicted);
+   if (evicted && bo->tbo.base.resv == vm->root.bo->tbo.base.resv) {
+   if (bo->tbo.type == ttm_bo_type_kernel)
+   list_move(&vm_bo->eviction_status, &vm->evicted);
+   else
+   list_move_tail(&vm_bo->eviction_status, &vm->evicted);
+   } else {
+   list_del_init(&vm_bo->eviction_status);
+   }
spin_unlock(&vm_bo->vm->status_lock);
 }
+
 /**
  * amdgpu_vm_bo_moved - vm_bo is moved
  *
@@ -310,6 +314,7 @@ void amdgpu_vm_bo_base_init(struct amdgpu_vm_bo_base *base,
base->bo = bo;
base->next = NULL;
INIT_LIST_HEAD(&base->vm_status);
+   INIT_LIST_HEAD(&base->eviction_status);
 
if (!bo)
return;
@@ -336,7 +341,7 @@ void amdgpu_vm_bo_base_init(struct amdgpu_vm_bo_base *base,
 * is currently evicted. add the bo to the evicted list to make sure it
 * is validated on next vm use to avoid fault.
 * */
-   amdgpu_vm_bo_evicted(base);
+   amdgpu_vm_bo_set_evicted(base, true);
 }
 
 /**
@@ -460,7 +465,7 @@ int amdgpu_vm_validate_pt_bos(struct amdgpu_device *adev, 
struct amdgpu_vm *vm,
while (!list_empty(&vm->evicted)) {
bo_base = list_first_entry(&vm->evicted,
   struct amdgpu_vm_bo_base,
-  vm_status);
+  eviction_status);
spin_unlock(&vm->status_lock);
 
bo = bo_base->bo;
@@ -1034,7 +1039,7 @@ void amdgpu_vm_get_memory(struct amdgpu_vm *vm,
list_for_each_entry_safe(bo_va, tmp, &vm->idle, base.vm_status)
amdgpu_vm_bo_get_memory(bo_va, stats);
 
-   list_for_each_entry_safe(bo_va, tmp, &vm->evicted, base.vm_status)
+   list_for_each_entry_safe(bo_va, tmp, &vm->evicted, base.eviction_status)
amdgpu_vm_bo_get_memory(bo_va, stats);
 
list_for_each_entry_safe(bo_va, tmp, &vm->relocated, base.vm_status)
@@ -1153,21 +1158,10 @@ int amdgpu_vm_bo_update(struct amdgpu_device *adev, 
struct amdgpu_bo_va *bo_va,
return r;
}
 
-   /* If the BO is not in its preferred location add it back to
-* the evicted list so that it gets validated again on the
-* next command submission.
-*/
-   if (bo && bo->tbo.base.resv == vm->root.bo->tbo.base.resv) {
-   uint32_t mem_type = bo->tbo.resource->mem_type;
-
-   if (!(bo->preferred_domains &
- amdgpu_mem_type_to_domain(mem_type)))
-   amdgpu_vm_bo_evicted(&bo_va->base);
-   else
-   amdgpu_vm_bo_idle(&bo_va->base);
-   } else {
+   if (bo && bo->tbo.base.resv == vm->root.bo->tbo.base.resv)
+   amdgpu_vm_bo_idle(&bo_va->base);
+   else
amdgpu_vm_bo_done(&bo_va->base);
-   

[PATCH 0/6] drm/amdgpu: Add flag to disable implicit sync for GEM operations.

2023-10-31 Thread Tatsuyuki Ishi
In Vulkan, it is the application's responsibility to perform adequate
synchronization before a sparse unmap, replace or BO destroy operation.
This adds an option to AMDGPU_VA_OPs to disable redundant implicit sync
that happens on sparse unmap or replace operations.

This has seen a significant improvement in stutter in Forza Horizon 5
and Forza Horizon 4. (As games that had significant issues in sparse
binding related stutter).

This patchset also address a tangential issue that some changes were
not being flushed immediately after the ioctls, but were deferred to be
processed on the next CS submission, which also results in stalling.
A refactor of state machine is included to achieve this.

Compared to the previous series [1], this specifically targets the VM
operations and keep everything else intact, including implicit sync on
kernel-initiated moves.

I've been able to pass a full Vulkan CTS run on Navi 10 with this.

Userspace code for this is available at [2] and a branch for the kernel
code is available at [3].

[1]: 
https://lore.kernel.org/all/20230821062005.109771-1-ishitatsuy...@gmail.com/
[2]: 
https://gitlab.freedesktop.org/ishitatsuyuki/mesa/-/commits/vm-explicit-sync
[3]: https://github.com/ishitatsuyuki/linux/tree/explicit-sync-drm-misc-next

Tatsuyuki Ishi (6):
  drm/amdgpu: Don't implicit sync PRT maps.
  drm/amdgpu: Separate eviction from VM status.
  drm/amdgpu: Flush VM updates for split bindings eagerly.
  drm/amdgpu: Remove redundant state change after validation.
  drm/amdgpu: Add flag to disable implicit sync for GEM operations.
  drm/amdgpu: Bump amdgpu driver version.

 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c   |   2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c   |   3 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c   |  32 +--
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h|   7 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h |   6 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c| 185 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h|  27 +--
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c |   1 +
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c  |  18 +-
 include/uapi/drm/amdgpu_drm.h |   2 +
 11 files changed, 165 insertions(+), 120 deletions(-)

-- 
2.42.0



[PATCH v2 4/4] drm/amdgpu: Bump amdgpu driver version.

2023-08-20 Thread Tatsuyuki Ishi
From: Bas Nieuwenhuizen 

For detection of the new explicit sync functionality without
having to try the ioctl.

Signed-off-by: Bas Nieuwenhuizen 
Signed-off-by: Tatsuyuki Ishi 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index e90f730eb715..d2eef46f0fcb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -113,9 +113,10 @@
  *gl1c_cache_size, gl2c_cache_size, mall_size, 
enabled_rb_pipes_mask_hi
  *   3.53.0 - Support for GFX11 CP GFX shadowing
  *   3.54.0 - Add AMDGPU_CTX_QUERY2_FLAGS_RESET_IN_PROGRESS support
+ * - 3.55.0 - Add AMDGPU_CTX_OP_SET_IMPLICIT_SYNC context operation.
  */
 #define KMS_DRIVER_MAJOR   3
-#define KMS_DRIVER_MINOR   54
+#define KMS_DRIVER_MINOR   55
 #define KMS_DRIVER_PATCHLEVEL  0
 
 unsigned int amdgpu_vram_limit = UINT_MAX;
-- 
2.41.0



[PATCH v2 3/4] drm/amdgpu: Add option to disable implicit sync for a context.

2023-08-20 Thread Tatsuyuki Ishi
From: Bas Nieuwenhuizen 

This changes all BO usages in a submit to BOOKKEEP instead of READ,
which effectively disables implicit sync for these submits.

This is configured at a context level using the existing IOCTL.

Signed-off-by: Bas Nieuwenhuizen 
Co-developed-by: Tatsuyuki Ishi 
Signed-off-by: Tatsuyuki Ishi 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 16 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 30 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h |  1 +
 include/uapi/drm/amdgpu_drm.h   |  4 
 4 files changed, 47 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 2cb814de0149..008547f97fd1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -1200,8 +1200,11 @@ static int amdgpu_cs_sync_rings(struct amdgpu_cs_parser 
*p)
struct dma_resv *resv = bo->tbo.base.resv;
enum amdgpu_sync_mode sync_mode;
 
-   sync_mode = amdgpu_bo_explicit_sync(bo) ?
-   AMDGPU_SYNC_EXPLICIT : AMDGPU_SYNC_NE_OWNER;
+   if (amdgpu_bo_explicit_sync(bo) || 
p->ctx->disable_implicit_sync)
+   sync_mode = AMDGPU_SYNC_EXPLICIT;
+   else
+   sync_mode = AMDGPU_SYNC_NE_OWNER;
+
r = amdgpu_sync_resv(p->adev, &p->sync, resv, sync_mode,
 AMDGPU_SYNC_EXPLICIT, &fpriv->vm);
if (r)
@@ -1322,11 +1325,16 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
 
dma_resv_add_fence(gobj->resv,
   &p->jobs[i]->base.s_fence->finished,
-  DMA_RESV_USAGE_READ);
+  p->ctx->disable_implicit_sync ?
+  DMA_RESV_USAGE_BOOKKEEP :
+  DMA_RESV_USAGE_READ);
}
 
/* The gang leader as remembered as writer */
-   dma_resv_add_fence(gobj->resv, p->fence, DMA_RESV_USAGE_WRITE);
+   dma_resv_add_fence(gobj->resv, p->fence,
+  p->ctx->disable_implicit_sync ?
+  DMA_RESV_USAGE_BOOKKEEP :
+  DMA_RESV_USAGE_WRITE);
}
 
seq = amdgpu_ctx_add_fence(p->ctx, p->entities[p->gang_leader_idx],
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index 0dc9c655c4fb..fe6f30d8fd70 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -661,6 +661,30 @@ static int amdgpu_ctx_stable_pstate(struct amdgpu_device 
*adev,
return r;
 }
 
+static int amdgpu_ctx_set_implicit_sync(struct amdgpu_device *adev,
+   struct amdgpu_fpriv *fpriv, uint32_t id,
+   bool enable)
+{
+   struct amdgpu_ctx *ctx;
+   struct amdgpu_ctx_mgr *mgr;
+
+   if (!fpriv)
+   return -EINVAL;
+
+   mgr = &fpriv->ctx_mgr;
+   mutex_lock(&mgr->lock);
+   ctx = idr_find(&mgr->ctx_handles, id);
+   if (!ctx) {
+   mutex_unlock(&mgr->lock);
+   return -EINVAL;
+   }
+
+   ctx->disable_implicit_sync = !enable;
+
+   mutex_unlock(&mgr->lock);
+   return 0;
+}
+
 int amdgpu_ctx_ioctl(struct drm_device *dev, void *data,
 struct drm_file *filp)
 {
@@ -709,6 +733,12 @@ int amdgpu_ctx_ioctl(struct drm_device *dev, void *data,
return -EINVAL;
r = amdgpu_ctx_stable_pstate(adev, fpriv, id, true, 
&stable_pstate);
break;
+   case AMDGPU_CTX_OP_SET_IMPLICIT_SYNC:
+   if ((args->in.flags & ~AMDGPU_CTX_IMPLICIT_SYNC_ENABLED) || 
args->in.priority)
+   return -EINVAL;
+   r = amdgpu_ctx_set_implicit_sync(adev, fpriv, id,
+args->in.flags & 
~AMDGPU_CTX_IMPLICIT_SYNC_ENABLED);
+   break;
default:
return -EINVAL;
}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h
index 85376baaa92f..a330e5b65d30 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h
@@ -58,6 +58,7 @@ struct amdgpu_ctx {
unsigned long   ras_counter_ue;
uint32_tstable_pstate;
struct amdgpu_ctx_mgr   *ctx_mgr;
+   booldisable_implicit_sync;
 };
 
 struct amdgpu_ctx_mgr {
diff --git a/include/uapi/drm/

[PATCH v2 2/4] drm/amdgpu: Allow explicit sync for VM ops.

2023-08-20 Thread Tatsuyuki Ishi
From: Bas Nieuwenhuizen 

This should be okay because moves themselves use KERNEL usage and
hence still sync with BOOKKEEP usage. Then any later submits still
wait on any pending VM operations.

(i.e. we only made VM ops not wait on BOOKKEEP submits, not the other
 way around)

Signed-off-by: Bas Nieuwenhuizen 
Signed-off-by: Tatsuyuki Ishi 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c  | 3 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
index f10332e1c6c0..e898a549f86d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
@@ -51,7 +51,8 @@ static int amdgpu_vm_cpu_prepare(struct 
amdgpu_vm_update_params *p,
if (!resv)
return 0;
 
-   return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, sync_mode, 
p->vm, true);
+   return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode,
+   AMDGPU_SYNC_EXPLICIT, p->vm, true);
 }
 
 /**
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
index e259a51e7c56..8cb427710d66 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
@@ -98,7 +98,8 @@ static int amdgpu_vm_sdma_prepare(struct 
amdgpu_vm_update_params *p,
return 0;
 
amdgpu_sync_create(&sync);
-   r = amdgpu_sync_resv(p->adev, &sync, resv, sync_mode, sync_mode, p->vm);
+   r = amdgpu_sync_resv(p->adev, &sync, resv, sync_mode,
+   AMDGPU_SYNC_EXPLICIT, p->vm);
if (!r)
r = amdgpu_sync_push_to_job(&sync, p->job);
amdgpu_sync_free(&sync);
-- 
2.41.0



[PATCH v2 1/4] drm/amdgpu: Add separate mode for syncing DMA_RESV_USAGE_BOOKKEEP.

2023-08-20 Thread Tatsuyuki Ishi
From: Bas Nieuwenhuizen 

To prep for allowing different sync modes in a follow-up patch.

Signed-off-by: Bas Nieuwenhuizen 
Signed-off-by: Tatsuyuki Ishi 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c   |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c   | 11 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h   |  3 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c | 11 ---
 drivers/gpu/drm/amd/amdgpu/amdgpu_sync.h |  4 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c   |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c  |  2 +-
 8 files changed, 23 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index edb35a88cb78..240556838094 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -1287,7 +1287,7 @@ static int process_sync_pds_resv(struct 
amdkfd_process_info *process_info,
struct amdgpu_bo *pd = peer_vm->root.bo;
 
ret = amdgpu_sync_resv(NULL, sync, pd->tbo.base.resv,
-  AMDGPU_SYNC_NE_OWNER,
+  AMDGPU_SYNC_NE_OWNER, 
AMDGPU_SYNC_NE_OWNER,
   AMDGPU_FENCE_OWNER_KFD);
if (ret)
return ret;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 977e1804718d..2cb814de0149 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -1203,7 +1203,7 @@ static int amdgpu_cs_sync_rings(struct amdgpu_cs_parser 
*p)
sync_mode = amdgpu_bo_explicit_sync(bo) ?
AMDGPU_SYNC_EXPLICIT : AMDGPU_SYNC_NE_OWNER;
r = amdgpu_sync_resv(p->adev, &p->sync, resv, sync_mode,
-&fpriv->vm);
+AMDGPU_SYNC_EXPLICIT, &fpriv->vm);
if (r)
return r;
}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index f7905bce0de1..f20c1d2757e4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -1450,7 +1450,8 @@ void amdgpu_bo_fence(struct amdgpu_bo *bo, struct 
dma_fence *fence,
  *
  * @adev: amdgpu device pointer
  * @resv: reservation object to sync to
- * @sync_mode: synchronization mode
+ * @implicit_sync_mode: synchronization mode for usage <= DMA_RESV_USAGE_READ
+ * @explicit_sync_mode: synchronization mode for usage DMA_RESV_USAGE_BOOKKEEP
  * @owner: fence owner
  * @intr: Whether the wait is interruptible
  *
@@ -1460,14 +1461,15 @@ void amdgpu_bo_fence(struct amdgpu_bo *bo, struct 
dma_fence *fence,
  * 0 on success, errno otherwise.
  */
 int amdgpu_bo_sync_wait_resv(struct amdgpu_device *adev, struct dma_resv *resv,
-enum amdgpu_sync_mode sync_mode, void *owner,
+enum amdgpu_sync_mode implicit_sync_mode,
+enum amdgpu_sync_mode explicit_sync_mode, void 
*owner,
 bool intr)
 {
struct amdgpu_sync sync;
int r;
 
amdgpu_sync_create(&sync);
-   amdgpu_sync_resv(adev, &sync, resv, sync_mode, owner);
+   amdgpu_sync_resv(adev, &sync, resv, implicit_sync_mode, 
explicit_sync_mode, owner);
r = amdgpu_sync_wait(&sync, intr);
amdgpu_sync_free(&sync);
return r;
@@ -1488,7 +1490,8 @@ int amdgpu_bo_sync_wait(struct amdgpu_bo *bo, void 
*owner, bool intr)
struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
 
return amdgpu_bo_sync_wait_resv(adev, bo->tbo.base.resv,
-   AMDGPU_SYNC_NE_OWNER, owner, intr);
+   AMDGPU_SYNC_NE_OWNER, 
AMDGPU_SYNC_EXPLICIT,
+   owner, intr);
 }
 
 /**
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
index 05496b97ef93..e982b5815969 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
@@ -350,7 +350,8 @@ vm_fault_t amdgpu_bo_fault_reserve_notify(struct 
ttm_buffer_object *bo);
 void amdgpu_bo_fence(struct amdgpu_bo *bo, struct dma_fence *fence,
 bool shared);
 int amdgpu_bo_sync_wait_resv(struct amdgpu_device *adev, struct dma_resv *resv,
-enum amdgpu_sync_mode sync_mode, void *owner,
+enum amdgpu_sync_mode implicit_sync_mode,
+enum amdgpu_sync_mode explicit_sync_mode, void 
*owner,
 bool intr);
 int amdgpu_bo_sync_wait(struct amdgpu

[PATCH v2 0/4] amdgpu: Allow explicitly synchronized submissions.

2023-08-20 Thread Tatsuyuki Ishi
This adds a context option to use DMA_RESV_USAGE_BOOKKEEP for userspace
submissions. This is a respin of [1] but rebased on top of the newly
introduced drm_exec helpers.

Disabling implicit sync is something we've wanted in RADV for a while
for resolving some corner cases. A more immediate thing that would be
solved here is avoiding a bunch of implicit sync on GPU map/unmap
operations as well, which helps with stutter around sparse maps/unmaps.

This has seen a significant improvement in stutter in Forza Horizon 5
and Forza Horizon 4. (As games that had significant issues in sparse
binding related stutter). I've been able to pass a full Vulkan CTS run
on Navi 10 with this.

(Note that the drm_exec patchset has introduced new CTS failures with
duplicate BOs, but we will be tracking that separately.)

Userspace code for this is available at [2] and a branch for the kernel
code is available at [3].

[1]: https://patchwork.freedesktop.org/series/107233/
[2]: 
https://gitlab.freedesktop.org/ishitatsuyuki/mesa/-/commits/single-sparse-queue-explicit-rebase
[3]: https://github.com/ishitatsuyuki/linux/tree/explicit-sync-drm-misc-next

Bas Nieuwenhuizen (4):
  drm/amdgpu: Add separate mode for syncing DMA_RESV_USAGE_BOOKKEEP.
  drm/amdgpu: Allow explicit sync for VM ops.
  drm/amdgpu: Add option to disable implicit sync for a context.
  drm/amdgpu: Bump amdgpu driver version.

 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c| 18 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c   | 30 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h   |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c   |  3 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c| 11 ---
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h|  3 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c  | 11 +--
 drivers/gpu/drm/amd/amdgpu/amdgpu_sync.h  |  4 +--
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c|  3 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c   |  3 +-
 include/uapi/drm/amdgpu_drm.h |  4 +++
 12 files changed, 74 insertions(+), 19 deletions(-)

-- 
2.41.0



Re: [PATCH 4/6] drm/amdgpu: use drm_exec for GEM and CSA handling

2023-06-29 Thread Tatsuyuki Ishi

On 6/28/23 19:44, Christian König wrote:


diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
index 74055cba3dc9..6811fc866494 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
@@ -728,36 +723,37 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void 
*data,
return -EINVAL;
}
  
-	INIT_LIST_HEAD(&list);

-   INIT_LIST_HEAD(&duplicates);
if ((args->operation != AMDGPU_VA_OP_CLEAR) &&
!(args->flags & AMDGPU_VM_PAGE_PRT)) {
gobj = drm_gem_object_lookup(filp, args->handle);
if (gobj == NULL)
return -ENOENT;
abo = gem_to_amdgpu_bo(gobj);
-   tv.bo = &abo->tbo;
-   if (abo->flags & AMDGPU_GEM_CREATE_VM_ALWAYS_VALID)
-   tv.num_shared = 1;
-   else
-   tv.num_shared = 0;
-   list_add(&tv.head, &list);
} else {
gobj = NULL;
abo = NULL;
}
  
-	amdgpu_vm_get_pd_bo(&fpriv->vm, &list, &vm_pd);

+   drm_exec_init(&exec, DRM_EXEC_INTERRUPTIBLE_WAIT);


Sorry, I missed this last time, but this needs to allow duplicates as well or 
mapping
always_valid BOs doesn't work.


+   drm_exec_until_all_locked(&exec) {
+   if (gobj) {
+   r = drm_exec_lock_obj(&exec, gobj);
+   drm_exec_retry_on_contention(&exec);
+   if (unlikely(r))
+   goto error;
+   }
  
-	r = ttm_eu_reserve_buffers(&ticket, &list, true, &duplicates);

-   if (r)
-   goto error_unref;
+   r = amdgpu_vm_lock_pd(&fpriv->vm, &exec, 2);
+   drm_exec_retry_on_contention(&exec);
+   if (unlikely(r))
+   goto error;
+   }
  
  	if (abo) {

bo_va = amdgpu_vm_bo_find(&fpriv->vm, abo);
if (!bo_va) {
r = -ENOENT;
-   goto error_backoff;
+   goto error;
}
} else if (args->operation != AMDGPU_VA_OP_CLEAR) {
bo_va = fpriv->prt_va;
@@ -794,10 +790,8 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
amdgpu_gem_va_update_vm(adev, &fpriv->vm, bo_va,
args->operation);
  
-error_backoff:

-   ttm_eu_backoff_reservation(&ticket, &list);
-
-error_unref:
+error:
+   drm_exec_fini(&exec);
drm_gem_object_put(gobj);
return r;
  }


Re: [PATCH 6/6] drm/amdgpu: use the new drm_exec object for CS v2

2023-06-29 Thread Tatsuyuki Ishi

On 6/28/23 19:44, Christian König wrote:

@@ -958,18 +904,57 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser 
*p,
e->user_invalidated = userpage_invalidated;
}
  
-	r = ttm_eu_reserve_buffers(&p->ticket, &p->validated, true,

-  &duplicates);
-   if (unlikely(r != 0)) {
-   if (r != -ERESTARTSYS)
-   DRM_ERROR("ttm_eu_reserve_buffers failed.\n");
-   goto out_free_user_pages;
+   drm_exec_until_all_locked(&p->exec) {
+   r = amdgpu_vm_lock_pd(&fpriv->vm, &p->exec, 1 + p->gang_size);
+   drm_exec_retry_on_contention(&p->exec);
+   if (unlikely(r))
+   goto out_free_user_pages;
+
+   amdgpu_bo_list_for_each_entry(e, p->bo_list) {
+   /* One fence for TTM and one for each CS job */
+   r = drm_exec_prepare_obj(&p->exec, &e->bo->tbo.base,
+1 + p->gang_size);
+   drm_exec_retry_on_contention(&p->exec);
+   if (unlikely(r))
+   goto out_free_user_pages;
+
+   e->bo_va = amdgpu_vm_bo_find(vm, e->bo);
+   e->range = NULL;

Still leaking.

+   }
+
+   if (p->uf_bo) {
+   r = drm_exec_prepare_obj(&p->exec, &p->uf_bo->tbo.base,
+1 + p->gang_size);
+   drm_exec_retry_on_contention(&p->exec);
+   if (unlikely(r))
+   goto out_free_user_pages;
+   }
}




Re: [PATCH 06/13] drm/amdgpu: use the new drm_exec object for CS v2

2023-06-20 Thread Tatsuyuki Ishi

On 6/20/23 17:16, Tatsuyuki Ishi wrote:

On 6/20/23 17:12, Christian König wrote:

Am 20.06.23 um 06:07 schrieb Tatsuyuki Ishi:

@@ -928,18 +874,56 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser 
*p,
  e->user_invalidated = userpage_invalidated;
  }
  -    r = ttm_eu_reserve_buffers(&p->ticket, &p->validated, true,
-   &duplicates);
-    if (unlikely(r != 0)) {
-    if (r != -ERESTARTSYS)
-    DRM_ERROR("ttm_eu_reserve_buffers failed.\n");
-    goto out_free_user_pages;
+    drm_exec_while_not_all_locked(&p->exec) {
+    r = amdgpu_vm_lock_pd(&fpriv->vm, &p->exec);
+    drm_exec_continue_on_contention(&p->exec);


Duplicate handling is needed for pretty much every call of amdgpu_vm_lock_pd, as 
bo->tbo.base.resv == vm->root.bo->tbo.base.resv for 
AMDGPU_GEM_CREATE_VM_ALWAYS_VALID.


Well no. AMDGPU_GEM_CREATE_VM_ALWAYS_VALID means that BOs should *not* be part 
of the relocation list. So when those cause an EALREADY here then userspace has 
a bug.


Sounds fair, lemme check how RADV is handling this again.


I checked again and relocation list was actually fine, but other places were 
not. For example amdgpu_gem_object_close
locks both bo->tbo.base.resv and vm->root.bo->tbo.base.resv (PD) on its own.

This was the easily debuggable case since it caused an error log but some other 
BO operations on ALWAYS_VALID
is also presumably broken due to the same reason.

Tatsuyuki


Re: [PATCH 06/13] drm/amdgpu: use the new drm_exec object for CS v2

2023-06-20 Thread Tatsuyuki Ishi

On 6/20/23 17:12, Christian König wrote:

Am 20.06.23 um 06:07 schrieb Tatsuyuki Ishi:

+Boris and +Matthew in case you want to take over this patch set.

Here are some review / testing comments, including those I posted before to 
ease tracking.

On 5/4/23 20:51, Christian König wrote:

Use the new component here as well and remove the old handling.

v2: drop dupplicate handling

Signed-off-by: Christian König 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h |   1 -
  drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.c |  71 ++-
  drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.h |   5 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 210 +---
  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h  |   7 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c  |  22 --
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h  |   3 -
  7 files changed, 115 insertions(+), 204 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 02b827785e39..eba3e4f01ea6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -133,6 +141,8 @@ int amdgpu_bo_list_create(struct amdgpu_device *adev, 
struct drm_file *filp,
    list->first_userptr = first_userptr;
  list->num_entries = num_entries;
+    sort(array, last_entry, sizeof(struct amdgpu_bo_list_entry),
+ amdgpu_bo_list_entry_cmp, NULL);


Previously amdgpu_bo_list_get_list sorted all entries, but this one only sorts 
userptr entries. I think this changes behavior?


The intention here is to sort all entries except the userptrs. Need to double 
check.


Sorry, I mistyped. You're right that it sorts all entries except the userptrs. 
The previous code seems to sort all entries including userptrs.




@@ -928,18 +874,56 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser 
*p,
  e->user_invalidated = userpage_invalidated;
  }
  -    r = ttm_eu_reserve_buffers(&p->ticket, &p->validated, true,
-   &duplicates);
-    if (unlikely(r != 0)) {
-    if (r != -ERESTARTSYS)
-    DRM_ERROR("ttm_eu_reserve_buffers failed.\n");
-    goto out_free_user_pages;
+    drm_exec_while_not_all_locked(&p->exec) {
+    r = amdgpu_vm_lock_pd(&fpriv->vm, &p->exec);
+    drm_exec_continue_on_contention(&p->exec);


Duplicate handling is needed for pretty much every call of amdgpu_vm_lock_pd, as 
bo->tbo.base.resv == vm->root.bo->tbo.base.resv for 
AMDGPU_GEM_CREATE_VM_ALWAYS_VALID.


Well no. AMDGPU_GEM_CREATE_VM_ALWAYS_VALID means that BOs should *not* be part 
of the relocation list. So when those cause an EALREADY here then userspace has 
a bug.


Sounds fair, lemme check how RADV is handling this again.

Tatsuyuki



Re: [PATCH 06/13] drm/amdgpu: use the new drm_exec object for CS v2

2023-06-19 Thread Tatsuyuki Ishi

On 6/20/23 13:07, Tatsuyuki Ishi wrote:
@@ -1296,9 +1271,8 @@ static int amdgpu_cs_submit(struct 
amdgpu_cs_parser *p,

   */
  r = 0;
  amdgpu_bo_list_for_each_userptr_entry(e, p->bo_list) {
-    struct amdgpu_bo *bo = ttm_to_amdgpu_bo(e->tv.bo);
-
-    r |= !amdgpu_ttm_tt_get_user_pages_done(bo->tbo.ttm, e->range);
+    r |= !amdgpu_ttm_tt_get_user_pages_done(e->bo->tbo.ttm,
+    e->range);
  e->range = NULL;


e->range = NULL; needs to be removed, or it's causing *massive* memory 
leaks.


Actually, I quoted the wrong hunk, the correct one is below.


@@ -928,18 +874,56 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser 
*p,
e->user_invalidated = userpage_invalidated;
}
 
-	r = ttm_eu_reserve_buffers(&p->ticket, &p->validated, true,

-  &duplicates);
-   if (unlikely(r != 0)) {
-   if (r != -ERESTARTSYS)
-   DRM_ERROR("ttm_eu_reserve_buffers failed.\n");
-   goto out_free_user_pages;
+   drm_exec_while_not_all_locked(&p->exec) {
+   r = amdgpu_vm_lock_pd(&fpriv->vm, &p->exec);
+   drm_exec_continue_on_contention(&p->exec);
+   if (unlikely(r))
+   goto out_free_user_pages;
+
+   amdgpu_bo_list_for_each_entry(e, p->bo_list) {
+   r = drm_exec_prepare_obj(&p->exec, &e->bo->tbo.base, 2);
+   drm_exec_break_on_contention(&p->exec);
+   if (unlikely(r))
+   goto out_free_user_pages;
+
+   e->bo_va = amdgpu_vm_bo_find(vm, e->bo);
+   e->range = NULL;


This causes the leak.


+   }
+   drm_exec_continue_on_contention(&p->exec);
+
+   if (p->uf_bo) {
+   r = drm_exec_prepare_obj(&p->exec, &p->uf_bo->tbo.base,
+2);
+   drm_exec_continue_on_contention(&p->exec);
+   if (unlikely(r))
+   goto out_free_user_pages;
+   }
}


Tatsuyuki


Re: [PATCH 06/13] drm/amdgpu: use the new drm_exec object for CS v2

2023-06-19 Thread Tatsuyuki Ishi

+Boris and +Matthew in case you want to take over this patch set.

Here are some review / testing comments, including those I posted before 
to ease tracking.


On 5/4/23 20:51, Christian König wrote:

Use the new component here as well and remove the old handling.

v2: drop dupplicate handling

Signed-off-by: Christian König 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h |   1 -
  drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.c |  71 ++-
  drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.h |   5 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 210 +---
  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h  |   7 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c  |  22 --
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h  |   3 -
  7 files changed, 115 insertions(+), 204 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 02b827785e39..eba3e4f01ea6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -133,6 +141,8 @@ int amdgpu_bo_list_create(struct amdgpu_device *adev, 
struct drm_file *filp,
  
  	list->first_userptr = first_userptr;

list->num_entries = num_entries;
+   sort(array, last_entry, sizeof(struct amdgpu_bo_list_entry),
+amdgpu_bo_list_entry_cmp, NULL);


Previously amdgpu_bo_list_get_list sorted all entries, but this one only 
sorts userptr entries. I think this changes behavior?



@@ -928,18 +874,56 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser 
*p,
e->user_invalidated = userpage_invalidated;
}
  
-	r = ttm_eu_reserve_buffers(&p->ticket, &p->validated, true,

-  &duplicates);
-   if (unlikely(r != 0)) {
-   if (r != -ERESTARTSYS)
-   DRM_ERROR("ttm_eu_reserve_buffers failed.\n");
-   goto out_free_user_pages;
+   drm_exec_while_not_all_locked(&p->exec) {
+   r = amdgpu_vm_lock_pd(&fpriv->vm, &p->exec);
+   drm_exec_continue_on_contention(&p->exec);


Duplicate handling is needed for pretty much every call of 
amdgpu_vm_lock_pd, as bo->tbo.base.resv == vm->root.bo->tbo.base.resv 
for AMDGPU_GEM_CREATE_VM_ALWAYS_VALID.


I think Boris's suggestion of having this through a common 
DRM_EXEC_FLAG_ALLOW_DUPLICATES flag fits well.



+   if (unlikely(r))
+   goto out_free_user_pages;
+
+   amdgpu_bo_list_for_each_entry(e, p->bo_list) {
+   r = drm_exec_prepare_obj(&p->exec, &e->bo->tbo.base, 2);


Previously there were comments for how the fence count is calculated, 
now they seem to be removed. I'd prefer if they were properly retained 
as finding out who calls drm_resv_add_fence is not trivial, and wrong 
reservation count can also be really hard to debug.


Likewise for amdgpu_vm_lock_pd (which was added in another commit).


+   drm_exec_break_on_contention(&p->exec);
+   if (unlikely(r))
+   goto out_free_user_pages;
+
+   e->bo_va = amdgpu_vm_bo_find(vm, e->bo);
+   e->range = NULL;
+   }
+   drm_exec_continue_on_contention(&p->exec);
+
+   if (p->uf_bo) {
+   r = drm_exec_prepare_obj(&p->exec, &p->uf_bo->tbo.base,
+2);
+   drm_exec_continue_on_contention(&p->exec);
+   if (unlikely(r))
+   goto out_free_user_pages;
+   }
}
  
-	amdgpu_bo_list_for_each_entry(e, p->bo_list) {

-   struct amdgpu_bo *bo = ttm_to_amdgpu_bo(e->tv.bo);
+   amdgpu_bo_list_for_each_userptr_entry(e, p->bo_list) {
+   struct mm_struct *usermm;
  
-		e->bo_va = amdgpu_vm_bo_find(vm, bo);

+   usermm = amdgpu_ttm_tt_get_usermm(e->bo->tbo.ttm);
+   if (usermm && usermm != current->mm) {
+   r = -EPERM;
+   goto out_free_user_pages;
+   }
+
+   if (amdgpu_ttm_tt_is_userptr(e->bo->tbo.ttm) &&
+   e->user_invalidated && e->user_pages) {
+   amdgpu_bo_placement_from_domain(e->bo,
+   AMDGPU_GEM_DOMAIN_CPU);
+   r = ttm_bo_validate(&e->bo->tbo, &e->bo->placement,
+   &ctx);
+   if (r)
+   goto out_free_user_pages;
+
+   amdgpu_ttm_tt_set_user_pages(e->bo->tbo.ttm,
+e->user_pages);
+   }
+
+   kvfree(e->user_pages);
+   e->user_pages = NULL;
}
  
  	amdgpu_cs_get_threshold_for_moves(p->adev, &p->bytes_moved_threshold,

@@ -1296,9 +1271,8 @@ static int amdgpu_cs_submit(struct amdgpu_cs_pars

Re: [PATCH 06/13] drm/amdgpu: use the new drm_exec object for CS v2

2023-06-12 Thread Tatsuyuki Ishi

Hi Chrisitan,

On May 4, 2023, at 20:51, Christian König mailto:ckoenig.leichtzumer...@gmail.com>> wrote:
Use the new component here as well and remove the old handling.

v2: drop dupplicate handling


It seems that after dropping the duplicate handling, locking of VM PDs on 
global BO list is basically broken everywhere,
as bo->tbo.base.resv == vm->root.bo->tbo.base.resv for 
AMDGPU_GEM_CREATE_VM_ALWAYS_VALID

Perhaps we need to bring dup handling back?

Tatsuyuki



Re: [PATCH 06/13] drm/amdgpu: use the new drm_exec object for CS v2

2023-06-12 Thread Tatsuyuki Ishi
Hi Christian,

> On May 4, 2023, at 20:51, Christian König  
> wrote:
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index 08eced097bd8..9e751f5d4aa7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -882,25 +840,13 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser 
> *p,
>
> mutex_lock(&p->bo_list->bo_list_mutex);
>
> - /* One for TTM and one for the CS job */
> - amdgpu_bo_list_for_each_entry(e, p->bo_list)
> - e->tv.num_shared = 2;
> -
> - amdgpu_bo_list_get_list(p->bo_list, &p->validated);
> -
> - INIT_LIST_HEAD(&duplicates);
> - amdgpu_vm_get_pd_bo(&fpriv->vm, &p->validated, &p->vm_pd);
> -
> - if (p->uf_entry.tv.bo && !ttm_to_amdgpu_bo(p->uf_entry.tv.bo)->parent)
> - list_add(&p->uf_entry.tv.head, &p->validated);
> -
> /* Get userptr backing pages. If pages are updated after registered
> * in amdgpu_gem_userptr_ioctl(), amdgpu_cs_list_validate() will do
> * amdgpu_ttm_backend_bind() to flush and invalidate new pages
> */
> amdgpu_bo_list_for_each_userptr_entry(e, p->bo_list) {
> - struct amdgpu_bo *bo = ttm_to_amdgpu_bo(e->tv.bo);
> bool userpage_invalidated = false;
> + struct amdgpu_bo *bo = e->bo;
> int i;
>
> e->user_pages = kvmalloc_array(bo->tbo.ttm->num_pages,
> @@ -1307,20 +1281,22 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser 
> *p,
> }
>
> p->fence = dma_fence_get(&leader->base.s_fence->finished);
> - list_for_each_entry(e, &p->validated, tv.head) {
> + drm_exec_for_each_locked_object(&p->exec, index, gobj) {
> +
> + ttm_bo_move_to_lru_tail_unlocked(&gem_to_amdgpu_bo(gobj)->tbo);
>
> /* Everybody except for the gang leader uses READ */
> for (i = 0; i < p->gang_size; ++i) {
> if (p->jobs[i] == leader)
> continue;
>
> - dma_resv_add_fence(e->tv.bo->base.resv,
> + dma_resv_add_fence(gobj->resv,
>   &p->jobs[i]->base.s_fence->finished,
>   DMA_RESV_USAGE_READ);
> }
>
> - /* The gang leader is remembered as writer */
> - e->tv.num_shared = 0;
> + /* The gang leader as remembered as writer */
> + dma_resv_add_fence(gobj->resv, p->fence, DMA_RESV_USAGE_WRITE);
> }
>
> seq = amdgpu_ctx_add_fence(p->ctx, p->entities[p->gang_leader_idx],

I believe this changes the usage of VM PDs from READ to WRITE.
Maybe we could check if a BO is PD and supply DMA_RESV_USAGE_READ in that case?

Tatsuyuki


[PATCH v2] drm: make drm_syncobj_array_wait() use the range hrtimer feature

2021-12-18 Thread Tatsuyuki Ishi
select(), poll() and epoll_wait() all already supports high-precision
timeout handling. This patch makes drm_syncobj_array_wait() to handle
the timeout in high precision using the same heuristics and functions
implemented for select().

v2: Fix a name error resulting in NULL dereference.

Signed-off-by: Tatsuyuki Ishi 
---
 drivers/gpu/drm/drm_syncobj.c | 75 ---
 1 file changed, 44 insertions(+), 31 deletions(-)

diff --git a/drivers/gpu/drm/drm_syncobj.c b/drivers/gpu/drm/drm_syncobj.c
index c9a9d74f338c..b2f1631e7dc2 100644
--- a/drivers/gpu/drm/drm_syncobj.c
+++ b/drivers/gpu/drm/drm_syncobj.c
@@ -949,17 +949,30 @@ static void syncobj_wait_syncobj_func(struct drm_syncobj 
*syncobj,
list_del_init(&wait->node);
 }
 
-static signed long drm_syncobj_array_wait_timeout(struct drm_syncobj 
**syncobjs,
- void __user *user_points,
- uint32_t count,
- uint32_t flags,
- signed long timeout,
- uint32_t *idx)
+static int drm_syncobj_array_wait_timeout(struct drm_syncobj **syncobjs,
+ void __user *user_points,
+ uint32_t count, uint32_t flags,
+ struct timespec64 *timeout,
+ uint32_t *idx)
 {
struct syncobj_wait_entry *entries;
struct dma_fence *fence;
uint64_t *points;
uint32_t signaled_count, i;
+   u64 slack = 0;
+   ktime_t expires, *to = NULL;
+   int ret = 0, timed_out = 0;
+
+   if (timeout->tv_sec | timeout->tv_nsec) {
+   slack = select_estimate_accuracy(timeout);
+   to = &expires;
+   *to = timespec64_to_ktime(*timeout);
+   } else {
+   /*
+* Avoid scheduling a hrtimer wait if timeout is zero.
+*/
+   timed_out = 1;
+   }
 
if (flags & DRM_SYNCOBJ_WAIT_FLAGS_WAIT_FOR_SUBMIT)
lockdep_assert_none_held_once();
@@ -973,13 +986,13 @@ static signed long drm_syncobj_array_wait_timeout(struct 
drm_syncobj **syncobjs,
 
} else if (copy_from_user(points, user_points,
  sizeof(uint64_t) * count)) {
-   timeout = -EFAULT;
+   ret = -EFAULT;
goto err_free_points;
}
 
entries = kcalloc(count, sizeof(*entries), GFP_KERNEL);
if (!entries) {
-   timeout = -ENOMEM;
+   ret = -ENOMEM;
goto err_free_points;
}
/* Walk the list of sync objects and initialize entries.  We do
@@ -999,7 +1012,7 @@ static signed long drm_syncobj_array_wait_timeout(struct 
drm_syncobj **syncobjs,
if (flags & DRM_SYNCOBJ_WAIT_FLAGS_WAIT_FOR_SUBMIT) {
continue;
} else {
-   timeout = -EINVAL;
+   ret = -EINVAL;
goto cleanup_entries;
}
}
@@ -1063,17 +1076,18 @@ static signed long 
drm_syncobj_array_wait_timeout(struct drm_syncobj **syncobjs,
if (signaled_count == count)
goto done_waiting;
 
-   if (timeout == 0) {
-   timeout = -ETIME;
+   if (timed_out) {
+   ret = -ETIME;
goto done_waiting;
}
 
if (signal_pending(current)) {
-   timeout = -ERESTARTSYS;
+   ret = -ERESTARTSYS;
goto done_waiting;
}
 
-   timeout = schedule_timeout(timeout);
+   timed_out =
+   !schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS);
} while (1);
 
 done_waiting:
@@ -1092,7 +1106,7 @@ static signed long drm_syncobj_array_wait_timeout(struct 
drm_syncobj **syncobjs,
 err_free_points:
kfree(points);
 
-   return timeout;
+   return ret;
 }
 
 /**
@@ -1134,28 +1148,27 @@ static int drm_syncobj_array_wait(struct drm_device 
*dev,
  struct drm_syncobj_timeline_wait 
*timeline_wait,
  struct drm_syncobj **syncobjs, bool timeline)
 {
-   signed long timeout = 0;
+   int ret = 0;
+   struct timespec64 timeout;
uint32_t first = ~0;
 
if (!timeline) {
-   timeout = drm_timeout_abs_to_jiffies(wait->timeout_nsec);
-   timeout = drm_syncobj_array_wait_timeout(syncobjs,
-NULL,
-

[PATCH] drm: make drm_syncobj_array_wait() use the range hrtimer feature

2021-12-06 Thread Tatsuyuki Ishi
select(), poll() and epoll_wait() all already supports high-precision
timeout handling. This patch makes drm_syncobj_array_wait() to handle
the timeout in high precision using the same heuristics and functions
implemented for select().

Signed-off-by: Tatsuyuki Ishi 
---
 drivers/gpu/drm/drm_syncobj.c | 75 ---
 1 file changed, 44 insertions(+), 31 deletions(-)

diff --git a/drivers/gpu/drm/drm_syncobj.c b/drivers/gpu/drm/drm_syncobj.c
index c9a9d74f338c..b2f1631e7dc2 100644
--- a/drivers/gpu/drm/drm_syncobj.c
+++ b/drivers/gpu/drm/drm_syncobj.c
@@ -949,17 +949,30 @@ static void syncobj_wait_syncobj_func(struct drm_syncobj 
*syncobj,
list_del_init(&wait->node);
 }
 
-static signed long drm_syncobj_array_wait_timeout(struct drm_syncobj 
**syncobjs,
- void __user *user_points,
- uint32_t count,
- uint32_t flags,
- signed long timeout,
- uint32_t *idx)
+static int drm_syncobj_array_wait_timeout(struct drm_syncobj **syncobjs,
+ void __user *user_points,
+ uint32_t count, uint32_t flags,
+ struct timespec64 *timeout,
+ uint32_t *idx)
 {
struct syncobj_wait_entry *entries;
struct dma_fence *fence;
uint64_t *points;
uint32_t signaled_count, i;
+   u64 slack = 0;
+   ktime_t expires, *to = NULL;
+   int ret = 0, timed_out = 0;
+
+   if (timeout->tv_sec | timeout->tv_nsec) {
+   slack = select_estimate_accuracy(timeout);
+   to = &expires;
+   *to = timespec64_to_ktime(*timeout);
+   } else {
+   /*
+* Avoid scheduling a hrtimer wait if timeout is zero.
+*/
+   timed_out = 1;
+   }
 
if (flags & DRM_SYNCOBJ_WAIT_FLAGS_WAIT_FOR_SUBMIT)
lockdep_assert_none_held_once();
@@ -973,13 +986,13 @@ static signed long drm_syncobj_array_wait_timeout(struct 
drm_syncobj **syncobjs,
 
} else if (copy_from_user(points, user_points,
  sizeof(uint64_t) * count)) {
-   timeout = -EFAULT;
+   ret = -EFAULT;
goto err_free_points;
}
 
entries = kcalloc(count, sizeof(*entries), GFP_KERNEL);
if (!entries) {
-   timeout = -ENOMEM;
+   ret = -ENOMEM;
goto err_free_points;
}
/* Walk the list of sync objects and initialize entries.  We do
@@ -999,7 +1012,7 @@ static signed long drm_syncobj_array_wait_timeout(struct 
drm_syncobj **syncobjs,
if (flags & DRM_SYNCOBJ_WAIT_FLAGS_WAIT_FOR_SUBMIT) {
continue;
} else {
-   timeout = -EINVAL;
+   ret = -EINVAL;
goto cleanup_entries;
}
}
@@ -1063,17 +1076,18 @@ static signed long 
drm_syncobj_array_wait_timeout(struct drm_syncobj **syncobjs,
if (signaled_count == count)
goto done_waiting;
 
-   if (timeout == 0) {
-   timeout = -ETIME;
+   if (timed_out) {
+   ret = -ETIME;
goto done_waiting;
}
 
if (signal_pending(current)) {
-   timeout = -ERESTARTSYS;
+   ret = -ERESTARTSYS;
goto done_waiting;
}
 
-   timeout = schedule_timeout(timeout);
+   timed_out =
+   !schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS);
} while (1);
 
 done_waiting:
@@ -1092,7 +1106,7 @@ static signed long drm_syncobj_array_wait_timeout(struct 
drm_syncobj **syncobjs,
 err_free_points:
kfree(points);
 
-   return timeout;
+   return ret;
 }
 
 /**
@@ -1134,28 +1148,27 @@ static int drm_syncobj_array_wait(struct drm_device 
*dev,
  struct drm_syncobj_timeline_wait 
*timeline_wait,
  struct drm_syncobj **syncobjs, bool timeline)
 {
-   signed long timeout = 0;
+   int ret = 0;
+   struct timespec64 timeout;
uint32_t first = ~0;
 
if (!timeline) {
-   timeout = drm_timeout_abs_to_jiffies(wait->timeout_nsec);
-   timeout = drm_syncobj_array_wait_timeout(syncobjs,
-NULL,
-