Re: Forcing device loss for debugging

2023-09-11 Thread Christian König
Under the sysfs directory for each PCI device you should have a virtual 
file called "remove".


So if you just do an "echo 1 > /sys/devices/pci/remove" you 
basically simulate hot unplugging the specific PCI device


IIRC the PCI bridge where the device is connected to should then have a 
"rescan" virtual device which triggers hot plugging the device back in 
again when written.


Regards,
Christian.

Am 06.09.23 um 02:27 schrieb Jim Blandy:
Is there any way to force a Vulkan device to report itself lost, in 
order to exercise one's application's recovery paths? I looked through 
the Mesa sources but didn't find anything that seemed suitable.




Re: Unit and performance tests for VA-API

2023-06-21 Thread Christian König
Since nobody replied to this in a month I'm explicitly adding Leo to 
comment.


Christian.

Am 22.05.23 um 22:01 schrieb Shashank Sharma:

Respected sir/mam.

I am a 4th year student of National Institute of Technology, Silchar 
in India.


I am currently looking for a summer internship. I read the description 
and requirements for your project. I want to work on your project.


I have 3 year experience of C language and have completed various 
academic projects in C.


I am eager to learn about the system and I am confident that I can 
provide you with satisfactory results. Please provide a task so that I 
can prove myself.



Sincerely,

Shashank Sharma




Re: Regarding project in X.Org Endless Vacation of Code programs

2023-02-19 Thread Christian König

Hi Tushar,

Leo is the lead from our multimedia team and should be able to help you 
with that.


Cheers,
Christian.

Am 18.02.23 um 16:19 schrieb Tushar Gupta:

Hello,

I am software developer proficient in C++ and python  and interested 
for contributing to "Unit and Performance test for VA-API" and "piglit 
for VA-API" project. My skills are aligned with the requirements of 
this project.

I would like to discuss further about this project.

Thank you




Re: [PATCH v2] drm/doc: add rfc section for small BAR uapi

2022-04-27 Thread Christian König

Am 27.04.22 um 17:02 schrieb Matthew Auld:

On 27/04/2022 07:55, Christian König wrote:
Well usually we increment the drm minor version when adding some new 
flags on amdgpu.


Additional to that just one comment from our experience with that: 
You don't just need one flag, but two. The first one is a hint which 
says "CPU access needed" and the second is a promise which says "CPU 
access never needed".


The background is that on a whole bunch of buffers you can 100% 
certain say that you will never ever need CPU access.


Then at least we have a whole bunch of buffers where we might need 
CPU access, but can't tell for sure.


And last we have stuff like transfer buffers you can be 100% sure 
that you need CPU access.


Separating it like this helped a lot with performance on small BAR 
systems.


Thanks for the comments. For the "CPU access never needed" flag, what 
extra stuff does that do on the kernel side vs not specifying any 
flag/hint? I assume it still prioritizes using the non-CPU visible 
portion first? What else does it do?


It's used as a hint when you need to pin BOs for scanout for example.

In general we try to allocate BOs which are marked "CPU access needed" 
in the CPU visible window if possible, but fallback to any memory if 
that won't fit.


Christian.





Regards,
Christian.

Am 27.04.22 um 08:48 schrieb Lionel Landwerlin:
One question though, how do we detect that this flag 
(I915_GEM_CREATE_EXT_FLAG_NEEDS_CPU_ACCESS) is accepted on a given 
kernel?
I assume older kernels are going to reject object creation if we use 
this flag?


I didn't plan to use __drm_i915_query_vma_info, but isn't it 
inconsistent to select the placement on the GEM object and then 
query whether it's mappable by address?
You made a comment stating this is racy, wouldn't querying on the 
GEM object prevent this?


Thanks,

-Lionel

On 27/04/2022 09:35, Lionel Landwerlin wrote:

Hi Matt,


The proposal looks good to me.

Looking forward to try it on drm-tip.


-Lionel

On 20/04/2022 20:13, Matthew Auld wrote:

Add an entry for the new uapi needed for small BAR on DG2+.

v2:
   - Some spelling fixes and other small tweaks. (Akeem & Thomas)
   - Rework error capture interactions, including no longer needing
 NEEDS_CPU_ACCESS for objects marked for capture. (Thomas)
   - Add probed_cpu_visible_size. (Lionel)

Signed-off-by: Matthew Auld 
Cc: Thomas Hellström 
Cc: Lionel Landwerlin 
Cc: Jon Bloomfield 
Cc: Daniel Vetter 
Cc: Jordan Justen 
Cc: Kenneth Graunke 
Cc: Akeem G Abodunrin 
Cc: mesa-dev@lists.freedesktop.org
---
  Documentation/gpu/rfc/i915_small_bar.h   | 190 
+++

  Documentation/gpu/rfc/i915_small_bar.rst |  58 +++
  Documentation/gpu/rfc/index.rst  |   4 +
  3 files changed, 252 insertions(+)
  create mode 100644 Documentation/gpu/rfc/i915_small_bar.h
  create mode 100644 Documentation/gpu/rfc/i915_small_bar.rst

diff --git a/Documentation/gpu/rfc/i915_small_bar.h 
b/Documentation/gpu/rfc/i915_small_bar.h

new file mode 100644
index ..7bfd0cf44d35
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_small_bar.h
@@ -0,0 +1,190 @@
+/**
+ * struct __drm_i915_memory_region_info - Describes one region as 
known to the

+ * driver.
+ *
+ * Note this is using both struct drm_i915_query_item and struct 
drm_i915_query.
+ * For this new query we are adding the new query id 
DRM_I915_QUERY_MEMORY_REGIONS

+ * at _i915_query_item.query_id.
+ */
+struct __drm_i915_memory_region_info {
+    /** @region: The class:instance pair encoding */
+    struct drm_i915_gem_memory_class_instance region;
+
+    /** @rsvd0: MBZ */
+    __u32 rsvd0;
+
+    /** @probed_size: Memory probed by the driver (-1 = unknown) */
+    __u64 probed_size;
+
+    /** @unallocated_size: Estimate of memory remaining (-1 = 
unknown) */

+    __u64 unallocated_size;
+
+    union {
+    /** @rsvd1: MBZ */
+    __u64 rsvd1[8];
+    struct {
+    /**
+ * @probed_cpu_visible_size: Memory probed by the driver
+ * that is CPU accessible. (-1 = unknown).
+ *
+ * This will be always be <= @probed_size, and the
+ * remainder(if there is any) will not be CPU
+ * accessible.
+ */
+    __u64 probed_cpu_visible_size;
+    };
+    };
+};
+
+/**
+ * struct __drm_i915_gem_create_ext - Existing gem_create 
behaviour, with added

+ * extension support using struct i915_user_extension.
+ *
+ * Note that new buffer flags should be added here, at least for 
the stuff that
+ * is immutable. Previously we would have two ioctls, one to 
create the object
+ * with gem_create, and another to apply various parameters, 
however this
+ * creates some ambiguity for the params which are considered 
immutable. Also in

+ * general we're phasing out the various SET/GET ioctls.
+ */
+struct __drm_i915_gem_create_ext {
+    /**
+ * @size: Requested size for the objec

Re: [PATCH v2] drm/doc: add rfc section for small BAR uapi

2022-04-27 Thread Christian König
Well usually we increment the drm minor version when adding some new 
flags on amdgpu.


Additional to that just one comment from our experience with that: You 
don't just need one flag, but two. The first one is a hint which says 
"CPU access needed" and the second is a promise which says "CPU access 
never needed".


The background is that on a whole bunch of buffers you can 100% certain 
say that you will never ever need CPU access.


Then at least we have a whole bunch of buffers where we might need CPU 
access, but can't tell for sure.


And last we have stuff like transfer buffers you can be 100% sure that 
you need CPU access.


Separating it like this helped a lot with performance on small BAR systems.

Regards,
Christian.

Am 27.04.22 um 08:48 schrieb Lionel Landwerlin:
One question though, how do we detect that this flag 
(I915_GEM_CREATE_EXT_FLAG_NEEDS_CPU_ACCESS) is accepted on a given 
kernel?
I assume older kernels are going to reject object creation if we use 
this flag?


I didn't plan to use __drm_i915_query_vma_info, but isn't it 
inconsistent to select the placement on the GEM object and then query 
whether it's mappable by address?
You made a comment stating this is racy, wouldn't querying on the GEM 
object prevent this?


Thanks,

-Lionel

On 27/04/2022 09:35, Lionel Landwerlin wrote:

Hi Matt,


The proposal looks good to me.

Looking forward to try it on drm-tip.


-Lionel

On 20/04/2022 20:13, Matthew Auld wrote:

Add an entry for the new uapi needed for small BAR on DG2+.

v2:
   - Some spelling fixes and other small tweaks. (Akeem & Thomas)
   - Rework error capture interactions, including no longer needing
 NEEDS_CPU_ACCESS for objects marked for capture. (Thomas)
   - Add probed_cpu_visible_size. (Lionel)

Signed-off-by: Matthew Auld 
Cc: Thomas Hellström 
Cc: Lionel Landwerlin 
Cc: Jon Bloomfield 
Cc: Daniel Vetter 
Cc: Jordan Justen 
Cc: Kenneth Graunke 
Cc: Akeem G Abodunrin 
Cc: mesa-dev@lists.freedesktop.org
---
  Documentation/gpu/rfc/i915_small_bar.h   | 190 
+++

  Documentation/gpu/rfc/i915_small_bar.rst |  58 +++
  Documentation/gpu/rfc/index.rst  |   4 +
  3 files changed, 252 insertions(+)
  create mode 100644 Documentation/gpu/rfc/i915_small_bar.h
  create mode 100644 Documentation/gpu/rfc/i915_small_bar.rst

diff --git a/Documentation/gpu/rfc/i915_small_bar.h 
b/Documentation/gpu/rfc/i915_small_bar.h

new file mode 100644
index ..7bfd0cf44d35
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_small_bar.h
@@ -0,0 +1,190 @@
+/**
+ * struct __drm_i915_memory_region_info - Describes one region as 
known to the

+ * driver.
+ *
+ * Note this is using both struct drm_i915_query_item and struct 
drm_i915_query.
+ * For this new query we are adding the new query id 
DRM_I915_QUERY_MEMORY_REGIONS

+ * at _i915_query_item.query_id.
+ */
+struct __drm_i915_memory_region_info {
+    /** @region: The class:instance pair encoding */
+    struct drm_i915_gem_memory_class_instance region;
+
+    /** @rsvd0: MBZ */
+    __u32 rsvd0;
+
+    /** @probed_size: Memory probed by the driver (-1 = unknown) */
+    __u64 probed_size;
+
+    /** @unallocated_size: Estimate of memory remaining (-1 = 
unknown) */

+    __u64 unallocated_size;
+
+    union {
+    /** @rsvd1: MBZ */
+    __u64 rsvd1[8];
+    struct {
+    /**
+ * @probed_cpu_visible_size: Memory probed by the driver
+ * that is CPU accessible. (-1 = unknown).
+ *
+ * This will be always be <= @probed_size, and the
+ * remainder(if there is any) will not be CPU
+ * accessible.
+ */
+    __u64 probed_cpu_visible_size;
+    };
+    };
+};
+
+/**
+ * struct __drm_i915_gem_create_ext - Existing gem_create 
behaviour, with added

+ * extension support using struct i915_user_extension.
+ *
+ * Note that new buffer flags should be added here, at least for 
the stuff that
+ * is immutable. Previously we would have two ioctls, one to create 
the object
+ * with gem_create, and another to apply various parameters, 
however this
+ * creates some ambiguity for the params which are considered 
immutable. Also in

+ * general we're phasing out the various SET/GET ioctls.
+ */
+struct __drm_i915_gem_create_ext {
+    /**
+ * @size: Requested size for the object.
+ *
+ * The (page-aligned) allocated size for the object will be 
returned.

+ *
+ * Note that for some devices we have might have further minimum
+ * page-size restrictions(larger than 4K), like for device 
local-memory.
+ * However in general the final size here should always reflect 
any
+ * rounding up, if for example using the 
I915_GEM_CREATE_EXT_MEMORY_REGIONS

+ * extension to place the object in device local-memory.
+ */
+    __u64 size;
+    /**
+ * @handle: Returned handle for the object.
+ *
+ * Object handles are nonzero.
+ */
+    __u32 handle;
+    /**
+

Re: [Mesa-dev] Enabling Mesa video frontends on top of D3D12 gallium driver

2021-11-22 Thread Christian König

Hi guys,

Am 22.11.21 um 06:49 schrieb Dave Airlie:

On Thu, 18 Nov 2021 at 18:45, Sil Vilerino
 wrote:

Hello mesa-dev community,



We are starting to work on adding support for D3D12 Video acceleration in the 
mesa gallium D3D12 driver so that mesa frontends such as VA, VDPAU, etc can 
leverage HW acceleration in those environments.

To begin with we are working in a mesa fork project on a simple video decoder 
prototype that implements the pipe_video_codec and pipe_video_buffer interfaces 
with a D3D12 backend.


Welcome, to start I'm not authorative on this area of Mesa at all,
I've just started investigating vaapi and vulkan video decode.

I'm not really sure who understands all the ins/outs, I've cc'ed two
AMD contacts who have worked on this code.


Yeah, I'm not working on that for a long time now but I think I still 
have all the design pieces in my head.



Wayland/Software screen creation in VA_DRIVER_INIT_FUNC
In our d3d12 gallium driver, we rely on the EGL/GLX and the DRI frontend to 
handle the pure swrast screen creation as our virtualized environment doesn’t 
have devices listed under /dev/dri/*.
The default for gstreamer/vainfo initialization code in WSL seems to be to 
create a Wayland display and pass it to VAInitialize. If we go ahead and create 
a pure software wayland d3d12_screen in VA_DRIVER_INIT_FUNC(VADriverContextP 
ctx) we hit vaGetSurfaceBufferWl() is not implemented at VAAPI Gallium state 
tracker (#587) · Issues · Mesa / mesa · GitLab when trying to run a simple 
gstreamer pipeline that decodes with VAAPI (and d3d12 video backend) and 
presents to screen in a display sink.
 From vaGetSurfaceBufferWl() is not implemented at VAAPI Gallium state tracker (#587) · 
Issues · Mesa / mesa · GitLab discussion, it looks like “the change removing 
"NOT_IMPLEMENTED" is wayland display should be opened with DRM code path, and 
it's already implemented. the code here is not general switch to turn on the wayland 
support on vaapi, it's just one of the steps to complete that support and which has been 
implemented.”
Could you please provide some details on the high level design of how wayland 
supports for libva and gallium video drivers ?
Which of the currently implemented paths in mesa is currently recommended in 
general for video driver implementors: Wayland with DRM fd device or X DRI/DR3?
What’d be a recommended way of supporting pure swrast screens like d3d12 for 
libva in VA_DRIVER_INIT_FUNC?


Well long story short: That simply won't work that easily.

As far as I know interop with Wayland only works by passing DMA-buf 
handles around. And the problem is now that software drivers can't 
create any DMA-buf handles because they don't have an underlying kernel 
driver.


This question already came up a couple of times now in different contexts.


Are there any objections to having a pure sw (no DRM devices) screen ?
Another alternative we discussed was to enable VGEM in the WSL kernel and 
create a wayland swrast screen using a kms_dri_create_winsys with the virtual 
DRM device FD but that’d still require more work to support wayland 
presentation to screen code paths (see next section below).
Recommended present path
Assuming we create a wayland screen (pure software or using VGEM + DRM) in 
VA_DRIVER_INIT_FUNC, once end_frame() finishes execution and we have the 
decoded image in pipe_video_buffer *target, what are the supported and 
recommended paths for presenting to screen?
Looks like both VDPAU (vlVdpPresentationQueueDisplay) and VAAPI 
(vlVaPutSurface) call texture_from_drawable() to pass the associated 
pipe_resource texture to flush_frontbuffer(), but the texture_from_drawable() 
function seems only implemented on X backends (vl_winsys_dri3/vl_winsys_dri.c) 
and not for vl_winsys_swrast/vl_winsys_drm. As it’s mentioned that there’s 
support for wayland on vaapi, we’d like to get some clarity on how the 
presentation path is designed here.

My expectation here is that making it operate like the GL paths is
probably what you want, and that if you are not using wayland there,
you shouldn't be doing it here. If you are there then adding support
for it here would be required.

I haven't investigated vaapi on wayland at all, and I probably need to
look into that soon, I'd expect adding proper support for presenting
would be a good feature to add there.

Currently at least using things like mplayer -hwdec=vaapi, the vaapi
code is just used to do the decode into dma-buf on Linux and then that
is handed off to a separate GL context for presentation.


Yeah, exactly that's the problem. The pure software driver somehow needs 
to get a DMA-buf file descriptor.


In theory you can maybe use memfd() as long as you don't import the file 
descriptor anyway. But I'm not sure what a memfd() file descriptor says 
if it sees DMA-buf IOCTLs.


To summarize using VGEM for that sounds like the easiest to implement 
approach to me as well.


Regards,
Christian.



Dave.




Re: [Mesa-dev] llvmpipe not supporting EGL_EXT_image_dma_buf_import ?

2021-10-29 Thread Christian König

Am 29.10.21 um 11:35 schrieb Michel Dänzer:

On 2021-10-28 14:02, Christian König wrote:

Well I'm not an expert on llvmpipe, but as far as I know that's a general 
problem.

DMA-buf is used by the Linux kernel drivers to pass hardware bufefrs between 
processes and drivers.

Since llvmpipe as a software renderer it has no kernel driver, so there is no 
easy way to implement that.

Even if there was, CPU reads from dma-bufs imported from a HW driver may be 
extremely slow, so they should be avoided other than for specific setups where 
they're known to be guaranteed to perform adequately. (This performance trap is 
why I think allowing mmap for dma-buf fds was a mistake)


Yeah, I'm not very keen of that either.

But for testing I think it would still be nice to have the ability to 
share DMA-bufs with software rendereres even if it is horrible slow.


Christian.


Re: [Mesa-dev] llvmpipe not supporting EGL_EXT_image_dma_buf_import ?

2021-10-28 Thread Christian König
Well I'm not an expert on llvmpipe, but as far as I know that's a 
general problem.


DMA-buf is used by the Linux kernel drivers to pass hardware bufefrs 
between processes and drivers.


Since llvmpipe as a software renderer it has no kernel driver, so there 
is no easy way to implement that.


What could work is to use dma-buf-heaps as general allocation interface 
for llvmpipe or other software rendereres.


Regards,
Christian.

Am 28.10.21 um 13:30 schrieb Irion, Alexander:

Hello,
I would like to use the zwp_linux_dmabuf interface of Weston which requires the 
EGL_EXT_image_dma_buf_import extension. Apparently with llvmpipe renderer (LLVM 
12.0.0, 256 bits) this extension is not enumerated.
Does llvmpipe currently not support EGL_EXT_image_dma_buf_import in general?
Kind regards,
Alexander Irion

-
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955




Re: [Mesa-dev] Question about display_gpu and render gpu in dri3_create_screen

2021-09-14 Thread Christian König

Mhm, that looks like it's something I should look into.

How exactly are you triggering this and which kernel version are you using?

Thanks,
Christian.

Am 14.09.21 um 03:47 schrieb Luc Ma:

Yes, I did. It crashed the kernel as follows

Sep 14 09:12:34 kernel: [  506.676925][ 4] [ T1045] Hardware name: 
WEIBU F20A8/F20A8, BIOS 0.1.3 2021-07-06_20:59:09
Sep 14 09:12:34 kernel: [  506.685091][ 4] [ T1045] pstate: 8005 
(Nzcv daif -PAN -UAO)

Sep 14 09:12:34 kernel: [  506.691001][ 4] [ T1045] pc : 0x0
Sep 14 09:12:34 kernel: [  506.694341][ 4] [ T1045] lr : 
drm_gem_map_dma_buf+0xc0/0x118 [drm]

Sep 14 09:12:34 kernel: [  506.700509][ 4] [ T1045] sp : ffa0f4993b10
Sep 14 09:12:34 kernel: [  506.704941][ 4] [ T1045] x29: 
ffa0f4993b10 x28: ffa0f4993cd8
Sep 14 09:12:34 kernel: [  506.711370][ 4] [ T1045] x27: 
000c x26: ffa0f33e5200
Sep 14 09:12:34 kernel: [  506.717800][ 4] [ T1045] x25: 
ffa0e7452630 x24: ffa0f5a9f000
Sep 14 09:12:34 kernel: [  506.724229][ 4] [ T1045] x23: 
ffa0f33e5328 x22: ffa0e820f180
Sep 14 09:12:34 kernel: [  506.730658][ 4] [ T1045] x21: 
ffa0e820f180 x20: 
Sep 14 09:12:34 kernel: [  506.737087][ 4] [ T1045] x19: 
ffa0e820f180 x18: 
Sep 14 09:12:34 kernel: [  506.743516][ 4] [ T1045] x17: 
 x16: 
Sep 14 09:12:34 kernel: [  506.749945][ 4] [ T1045] x15: 
 x14: 
Sep 14 09:12:34 kernel: [  506.756373][ 4] [ T1045] x13: 
 x12: 
Sep 14 09:12:34 kernel: [  506.762802][ 4] [ T1045] x11: 
 x10: ffa0f43d1938
Sep 14 09:12:34 kernel: [  506.769231][ 4] [ T1045] x9 : 
 x8 : ffa0e820f200
Sep 14 09:12:34 kernel: [  506.775659][ 4] [ T1045] x7 : 
 x6 : 003f
Sep 14 09:12:34 kernel: [  506.782088][ 4] [ T1045] x5 : 
0040 x4 : 
Sep 14 09:12:34 kernel: [  506.788517][ 4] [ T1045] x3 : 
 x2 : ffc008d553b0
Sep 14 09:12:34 kernel: [  506.794945][ 4] [ T1045] x1 : 
 x0 : ffa0f311b000

Sep 14 09:12:34 kernel: [  506.801375][ 4] [ T1045] Call trace:
Sep 14 09:12:34 kernel: [  506.804938][ 4] [ T1045]  0x0
Sep 14 09:12:34 kernel: [  506.807899][ 4] [ T1045] 
 dma_buf_map_attachment+0x60/0xb0
Sep 14 09:12:34 kernel: [  506.813398][ 4] [ T1045] 
 drm_gem_prime_import_dev+0x7c/0x138 [drm]
Sep 14 09:12:34 kernel: [  506.819675][ 4] [ T1045] 
 drm_gem_prime_fd_to_handle+0x1b4/0x1d8 [drm]
Sep 14 09:12:34 kernel: [  506.826213][ 4] [ T1045] 
 drm_prime_fd_to_handle_ioctl+0x24/0x38 [drm]
Sep 14 09:12:34 kernel: [  506.832750][ 4] [ T1045] 
 drm_ioctl_kernel+0x84/0xd8 [drm]
Sep 14 09:12:34 kernel: [  506.838245][ 4] [ T1045] 
 drm_ioctl+0x218/0x408 [drm]
Sep 14 09:12:34 kernel: [  506.843326][ 4] [ T1045] 
 radeon_drm_ioctl+0x50/0x88 [radeon]
Sep 14 09:12:34 kernel: [  506.849063][ 4] [ T1045] 
 do_vfs_ioctl+0x394/0x7e8

Sep 14 09:12:34 kernel: [  506.853843][ 4] [ T1045]  ksys_ioctl+0x78/0xa8
Sep 14 09:12:34 kernel: [  506.858276][ 4] [ T1045] 
 __arm64_sys_ioctl+0x1c/0x28
Sep 14 09:12:34 kernel: [  506.863318][ 4] [ T1045] 
 el0_svc_common.constprop.0+0x68/0x168
Sep 14 09:12:34 kernel: [  506.869226][ 4] [ T1045] 
 el0_svc_handler+0x8c/0x98

Sep 14 09:12:34 kernel: [  506.874093][ 4] [ T1045]  el0_svc+0x8/0xc
Sep 14 09:12:34 kernel: [  506.878093][ 4] [ T1045] Code: bad PC value
Sep 14 09:12:34 kernel: [  506.882265][ 4] [ T1045] ---[ end trace 
3888e65eac0454cc ]---


I'm not familiar with the kernel, but I suspect it may be a kmd 
problem. It seems like something is missing.


On Mon, 13 Sept 2021 at 20:57, Michel Dänzer > wrote:


On 2021-09-13 14:40, Luc Ma wrote:
> Hello,
>
> I recently tried multi-GPU support on the mesa gallium
drivers(glx=dri). When I exported the env `DRI_PRIME=1`, I found
that it didn't work with two different drivers loaded at the same
time.
> because there are different driver names in my case. display_gpu
driver name is "r600" while render_gpu driver name is "nouveau".
It failed to create display gpu screen
>
> if (strcmp(driverName, driverNameDisplayGPU) == 0) {
>             psc->driScreenDisplayGPU =
>  psc->image_driver->createNewScreen2(screen, psc->fd_display_gpu,
>  pdp->loader_extensions,
>  extensions,
>  _configs, psc);
> }
>
> so I am wondering
>
> - is it possible to use two GPUs from different vendors on a
system with gallium? one is for display, another for rendering
> - is it possible to use two GPUs driven by a shared driver(say
both "r600") on a system?

Both should work.


Did you hit a problem other than psc->driScreenDisplayGPU related
code being skipped (as is expected ATM with different drivers)?


-- 
Earthling Michel Dänzer               | https://redhat.com


Libre software 

Re: [Mesa-dev] [PATCH 15/15] RFC: drm/amdgpu: Implement a proper implicit fencing uapi

2021-06-23 Thread Christian König

Am 23.06.21 um 17:12 schrieb Daniel Vetter:

On Wed, Jun 23, 2021 at 05:07:17PM +0200, Christian König wrote:

Am 23.06.21 um 17:03 schrieb Daniel Vetter:

On Wed, Jun 23, 2021 at 04:58:27PM +0200, Bas Nieuwenhuizen wrote:

On Wed, Jun 23, 2021 at 4:50 PM Daniel Vetter  wrote:

On Wed, Jun 23, 2021 at 4:02 PM Christian König
 wrote:

Am 23.06.21 um 15:49 schrieb Daniel Vetter:

On Wed, Jun 23, 2021 at 3:44 PM Christian König
 wrote:

Am 23.06.21 um 15:38 schrieb Bas Nieuwenhuizen:

On Wed, Jun 23, 2021 at 2:59 PM Christian König
 wrote:

Am 23.06.21 um 14:18 schrieb Daniel Vetter:

On Wed, Jun 23, 2021 at 11:45 AM Bas Nieuwenhuizen
 wrote:

On Tue, Jun 22, 2021 at 6:55 PM Daniel Vetter  wrote:

WARNING: Absolutely untested beyond "gcc isn't dying in agony".

Implicit fencing done properly needs to treat the implicit fencing
slots like a funny kind of IPC mailbox. In other words it needs to be
explicitly. This is the only way it will mesh well with explicit
fencing userspace like vk, and it's also the bare minimum required to
be able to manage anything else that wants to use the same buffer on
multiple engines in parallel, and still be able to share it through
implicit sync.

amdgpu completely lacks such an uapi. Fix this.

Luckily the concept of ignoring implicit fences exists already, and
takes care of all the complexities of making sure that non-optional
fences (like bo moves) are not ignored. This support was added in

commit 177ae09b5d699a5ebd1cafcee78889db968abf54
Author: Andres Rodriguez 
Date:   Fri Sep 15 20:44:06 2017 -0400

 drm/amdgpu: introduce AMDGPU_GEM_CREATE_EXPLICIT_SYNC v2

Unfortuantely it's the wrong semantics, because it's a bo flag and
disables implicit sync on an allocated buffer completely.

We _do_ want implicit sync, but control it explicitly. For this we
need a flag on the drm_file, so that a given userspace (like vulkan)
can manage the implicit sync slots explicitly. The other side of the
pipeline (compositor, other process or just different stage in a media
pipeline in the same process) can then either do the same, or fully
participate in the implicit sync as implemented by the kernel by
default.

By building on the existing flag for buffers we avoid any issues with
opening up additional security concerns - anything this new flag here
allows is already.

All drivers which supports this concept of a userspace-specific
opt-out of implicit sync have a flag in their CS ioctl, but in reality
that turned out to be a bit too inflexible. See the discussion below,
let's try to do a bit better for amdgpu.

This alone only allows us to completely avoid any stalls due to
implicit sync, it does not yet allow us to use implicit sync as a
strange form of IPC for sync_file.

For that we need two more pieces:

- a way to get the current implicit sync fences out of a buffer. Could
   be done in a driver ioctl, but everyone needs this, and generally a
   dma-buf is involved anyway to establish the sharing. So an ioctl on
   the dma-buf makes a ton more sense:

   
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210520190007.534046-4-jason%40jlekstrand.net%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7Cbbfb6a2fd1ab4ab448d608d936594aae%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637600579485508943%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=EKK8SOxfcGPIcH3EFEN656G76vQnn2jGlGyMkuY4f5k%3Dreserved=0

   Current drivers in upstream solves this by having the opt-out flag
   on their CS ioctl. This has the downside that very often the CS
   which must actually stall for the implicit fence is run a while
   after the implicit fence point was logically sampled per the api
   spec (vk passes an explicit syncobj around for that afaiui), and so
   results in oversync. Converting the implicit sync fences into a
   snap-shot sync_file is actually accurate.

- Simillar we need to be able to set the exclusive implicit fence.
   Current drivers again do this with a CS ioctl flag, with again the
   same problems that the time the CS happens additional dependencies
   have been added. An explicit ioctl to only insert a sync_file (while
   respecting the rules for how exclusive and shared fence slots must
   be update in struct dma_resv) is much better. This is proposed here:

   
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210520190007.534046-5-jason%40jlekstrand.net%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7Cbbfb6a2fd1ab4ab448d608d936594aae%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637600579485508943%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=nrsIXQYXctDMFvYuUfM2LO%2BozUfzopfs34ZpTKjNCKo%3Dreserved=0

These three pieces together allow userspace to fully control implicit
fencing and remove all uneces

Re: [Mesa-dev] [PATCH 15/15] RFC: drm/amdgpu: Implement a proper implicit fencing uapi

2021-06-23 Thread Christian König

Am 23.06.21 um 17:03 schrieb Daniel Vetter:

On Wed, Jun 23, 2021 at 04:58:27PM +0200, Bas Nieuwenhuizen wrote:

On Wed, Jun 23, 2021 at 4:50 PM Daniel Vetter  wrote:

On Wed, Jun 23, 2021 at 4:02 PM Christian König
 wrote:

Am 23.06.21 um 15:49 schrieb Daniel Vetter:

On Wed, Jun 23, 2021 at 3:44 PM Christian König
 wrote:

Am 23.06.21 um 15:38 schrieb Bas Nieuwenhuizen:

On Wed, Jun 23, 2021 at 2:59 PM Christian König
 wrote:

Am 23.06.21 um 14:18 schrieb Daniel Vetter:

On Wed, Jun 23, 2021 at 11:45 AM Bas Nieuwenhuizen
 wrote:

On Tue, Jun 22, 2021 at 6:55 PM Daniel Vetter  wrote:

WARNING: Absolutely untested beyond "gcc isn't dying in agony".

Implicit fencing done properly needs to treat the implicit fencing
slots like a funny kind of IPC mailbox. In other words it needs to be
explicitly. This is the only way it will mesh well with explicit
fencing userspace like vk, and it's also the bare minimum required to
be able to manage anything else that wants to use the same buffer on
multiple engines in parallel, and still be able to share it through
implicit sync.

amdgpu completely lacks such an uapi. Fix this.

Luckily the concept of ignoring implicit fences exists already, and
takes care of all the complexities of making sure that non-optional
fences (like bo moves) are not ignored. This support was added in

commit 177ae09b5d699a5ebd1cafcee78889db968abf54
Author: Andres Rodriguez 
Date:   Fri Sep 15 20:44:06 2017 -0400

drm/amdgpu: introduce AMDGPU_GEM_CREATE_EXPLICIT_SYNC v2

Unfortuantely it's the wrong semantics, because it's a bo flag and
disables implicit sync on an allocated buffer completely.

We _do_ want implicit sync, but control it explicitly. For this we
need a flag on the drm_file, so that a given userspace (like vulkan)
can manage the implicit sync slots explicitly. The other side of the
pipeline (compositor, other process or just different stage in a media
pipeline in the same process) can then either do the same, or fully
participate in the implicit sync as implemented by the kernel by
default.

By building on the existing flag for buffers we avoid any issues with
opening up additional security concerns - anything this new flag here
allows is already.

All drivers which supports this concept of a userspace-specific
opt-out of implicit sync have a flag in their CS ioctl, but in reality
that turned out to be a bit too inflexible. See the discussion below,
let's try to do a bit better for amdgpu.

This alone only allows us to completely avoid any stalls due to
implicit sync, it does not yet allow us to use implicit sync as a
strange form of IPC for sync_file.

For that we need two more pieces:

- a way to get the current implicit sync fences out of a buffer. Could
  be done in a driver ioctl, but everyone needs this, and generally a
  dma-buf is involved anyway to establish the sharing. So an ioctl on
  the dma-buf makes a ton more sense:

  
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210520190007.534046-4-jason%40jlekstrand.net%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7C517f0d3467324e7ce05008d936581f60%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637600574408265873%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=gntXLzlrqPxYj4Q3mQflD3arT9ad40S9AqsvtOXV4nk%3Dreserved=0

  Current drivers in upstream solves this by having the opt-out flag
  on their CS ioctl. This has the downside that very often the CS
  which must actually stall for the implicit fence is run a while
  after the implicit fence point was logically sampled per the api
  spec (vk passes an explicit syncobj around for that afaiui), and so
  results in oversync. Converting the implicit sync fences into a
  snap-shot sync_file is actually accurate.

- Simillar we need to be able to set the exclusive implicit fence.
  Current drivers again do this with a CS ioctl flag, with again the
  same problems that the time the CS happens additional dependencies
  have been added. An explicit ioctl to only insert a sync_file (while
  respecting the rules for how exclusive and shared fence slots must
  be update in struct dma_resv) is much better. This is proposed here:

  
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210520190007.534046-5-jason%40jlekstrand.net%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7C517f0d3467324e7ce05008d936581f60%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637600574408265873%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=YtqHT756jlt5NX7Ydr3Kk1UMTb98nQhlcOlrnr%2B48HE%3Dreserved=0

These three pieces together allow userspace to fully control implicit
fencing and remove all unecessary stall points due to them.

Well, as much as the implicit fencing model fundamentally allows:
There is only one set of fences, yo

Re: [Mesa-dev] [PATCH 15/15] RFC: drm/amdgpu: Implement a proper implicit fencing uapi

2021-06-23 Thread Christian König

Am 23.06.21 um 15:49 schrieb Daniel Vetter:

On Wed, Jun 23, 2021 at 3:44 PM Christian König
 wrote:

Am 23.06.21 um 15:38 schrieb Bas Nieuwenhuizen:

On Wed, Jun 23, 2021 at 2:59 PM Christian König
 wrote:

Am 23.06.21 um 14:18 schrieb Daniel Vetter:

On Wed, Jun 23, 2021 at 11:45 AM Bas Nieuwenhuizen
 wrote:

On Tue, Jun 22, 2021 at 6:55 PM Daniel Vetter  wrote:

WARNING: Absolutely untested beyond "gcc isn't dying in agony".

Implicit fencing done properly needs to treat the implicit fencing
slots like a funny kind of IPC mailbox. In other words it needs to be
explicitly. This is the only way it will mesh well with explicit
fencing userspace like vk, and it's also the bare minimum required to
be able to manage anything else that wants to use the same buffer on
multiple engines in parallel, and still be able to share it through
implicit sync.

amdgpu completely lacks such an uapi. Fix this.

Luckily the concept of ignoring implicit fences exists already, and
takes care of all the complexities of making sure that non-optional
fences (like bo moves) are not ignored. This support was added in

commit 177ae09b5d699a5ebd1cafcee78889db968abf54
Author: Andres Rodriguez 
Date:   Fri Sep 15 20:44:06 2017 -0400

   drm/amdgpu: introduce AMDGPU_GEM_CREATE_EXPLICIT_SYNC v2

Unfortuantely it's the wrong semantics, because it's a bo flag and
disables implicit sync on an allocated buffer completely.

We _do_ want implicit sync, but control it explicitly. For this we
need a flag on the drm_file, so that a given userspace (like vulkan)
can manage the implicit sync slots explicitly. The other side of the
pipeline (compositor, other process or just different stage in a media
pipeline in the same process) can then either do the same, or fully
participate in the implicit sync as implemented by the kernel by
default.

By building on the existing flag for buffers we avoid any issues with
opening up additional security concerns - anything this new flag here
allows is already.

All drivers which supports this concept of a userspace-specific
opt-out of implicit sync have a flag in their CS ioctl, but in reality
that turned out to be a bit too inflexible. See the discussion below,
let's try to do a bit better for amdgpu.

This alone only allows us to completely avoid any stalls due to
implicit sync, it does not yet allow us to use implicit sync as a
strange form of IPC for sync_file.

For that we need two more pieces:

- a way to get the current implicit sync fences out of a buffer. Could
 be done in a driver ioctl, but everyone needs this, and generally a
 dma-buf is involved anyway to establish the sharing. So an ioctl on
 the dma-buf makes a ton more sense:

 
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210520190007.534046-4-jason%40jlekstrand.net%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7C83dbdd0a1eb8442cbf7108d9364db51e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637600529684040802%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=fbdwtutEj93anZp6Pshs277QoMTHZxIy0Yl54T95rCw%3Dreserved=0

 Current drivers in upstream solves this by having the opt-out flag
 on their CS ioctl. This has the downside that very often the CS
 which must actually stall for the implicit fence is run a while
 after the implicit fence point was logically sampled per the api
 spec (vk passes an explicit syncobj around for that afaiui), and so
 results in oversync. Converting the implicit sync fences into a
 snap-shot sync_file is actually accurate.

- Simillar we need to be able to set the exclusive implicit fence.
 Current drivers again do this with a CS ioctl flag, with again the
 same problems that the time the CS happens additional dependencies
 have been added. An explicit ioctl to only insert a sync_file (while
 respecting the rules for how exclusive and shared fence slots must
 be update in struct dma_resv) is much better. This is proposed here:

 
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210520190007.534046-5-jason%40jlekstrand.net%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7C83dbdd0a1eb8442cbf7108d9364db51e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637600529684040802%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=vv%2BREnWorjoTOwrD1jH1GHVQcjPy1oesaophsz056aI%3Dreserved=0

These three pieces together allow userspace to fully control implicit
fencing and remove all unecessary stall points due to them.

Well, as much as the implicit fencing model fundamentally allows:
There is only one set of fences, you can only choose to sync against
only writers (exclusive slot), or everyone. Hence suballocating
multiple buffers or anything else like this is fundamentally not
possible, and can only be fixed by a proper explicit fencing model.

Aside from t

Re: [Mesa-dev] [PATCH 15/15] RFC: drm/amdgpu: Implement a proper implicit fencing uapi

2021-06-23 Thread Christian König

Am 23.06.21 um 15:38 schrieb Bas Nieuwenhuizen:

On Wed, Jun 23, 2021 at 2:59 PM Christian König
 wrote:

Am 23.06.21 um 14:18 schrieb Daniel Vetter:

On Wed, Jun 23, 2021 at 11:45 AM Bas Nieuwenhuizen
 wrote:

On Tue, Jun 22, 2021 at 6:55 PM Daniel Vetter  wrote:

WARNING: Absolutely untested beyond "gcc isn't dying in agony".

Implicit fencing done properly needs to treat the implicit fencing
slots like a funny kind of IPC mailbox. In other words it needs to be
explicitly. This is the only way it will mesh well with explicit
fencing userspace like vk, and it's also the bare minimum required to
be able to manage anything else that wants to use the same buffer on
multiple engines in parallel, and still be able to share it through
implicit sync.

amdgpu completely lacks such an uapi. Fix this.

Luckily the concept of ignoring implicit fences exists already, and
takes care of all the complexities of making sure that non-optional
fences (like bo moves) are not ignored. This support was added in

commit 177ae09b5d699a5ebd1cafcee78889db968abf54
Author: Andres Rodriguez 
Date:   Fri Sep 15 20:44:06 2017 -0400

  drm/amdgpu: introduce AMDGPU_GEM_CREATE_EXPLICIT_SYNC v2

Unfortuantely it's the wrong semantics, because it's a bo flag and
disables implicit sync on an allocated buffer completely.

We _do_ want implicit sync, but control it explicitly. For this we
need a flag on the drm_file, so that a given userspace (like vulkan)
can manage the implicit sync slots explicitly. The other side of the
pipeline (compositor, other process or just different stage in a media
pipeline in the same process) can then either do the same, or fully
participate in the implicit sync as implemented by the kernel by
default.

By building on the existing flag for buffers we avoid any issues with
opening up additional security concerns - anything this new flag here
allows is already.

All drivers which supports this concept of a userspace-specific
opt-out of implicit sync have a flag in their CS ioctl, but in reality
that turned out to be a bit too inflexible. See the discussion below,
let's try to do a bit better for amdgpu.

This alone only allows us to completely avoid any stalls due to
implicit sync, it does not yet allow us to use implicit sync as a
strange form of IPC for sync_file.

For that we need two more pieces:

- a way to get the current implicit sync fences out of a buffer. Could
be done in a driver ioctl, but everyone needs this, and generally a
dma-buf is involved anyway to establish the sharing. So an ioctl on
the dma-buf makes a ton more sense:


https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210520190007.534046-4-jason%40jlekstrand.net%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7Ca401fc4551f045c95d8808d9364c38f6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637600523287217723%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=L8KCz8711Y2qZx0%2FJWT6HSg4o6OMhn%2BC4U2IR06nViE%3Dreserved=0

Current drivers in upstream solves this by having the opt-out flag
on their CS ioctl. This has the downside that very often the CS
which must actually stall for the implicit fence is run a while
after the implicit fence point was logically sampled per the api
spec (vk passes an explicit syncobj around for that afaiui), and so
results in oversync. Converting the implicit sync fences into a
snap-shot sync_file is actually accurate.

- Simillar we need to be able to set the exclusive implicit fence.
Current drivers again do this with a CS ioctl flag, with again the
same problems that the time the CS happens additional dependencies
have been added. An explicit ioctl to only insert a sync_file (while
respecting the rules for how exclusive and shared fence slots must
be update in struct dma_resv) is much better. This is proposed here:


https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210520190007.534046-5-jason%40jlekstrand.net%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7Ca401fc4551f045c95d8808d9364c38f6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637600523287227719%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=8Ws%2B573T5rj9Bs08%2BQB5CbIAsWgo36hYiH%2Fd0dPcJeg%3Dreserved=0

These three pieces together allow userspace to fully control implicit
fencing and remove all unecessary stall points due to them.

Well, as much as the implicit fencing model fundamentally allows:
There is only one set of fences, you can only choose to sync against
only writers (exclusive slot), or everyone. Hence suballocating
multiple buffers or anything else like this is fundamentally not
possible, and can only be fixed by a proper explicit fencing model.

Aside from that caveat this model gets implicit fencing as closely to
explicit fencing semantics as possible:

On 

Re: [Mesa-dev] [PATCH 15/15] RFC: drm/amdgpu: Implement a proper implicit fencing uapi

2021-06-23 Thread Christian König
 nice
flag parameter in the VM ioctl which we could use, except:
- it's not checked, so userspace likely passes garbage
- there's already a comment that userspace _does_ pass garbage in the
   priority field
So yeah unfortunately this flag parameter for setting vm flags is
useless, and we need to hack up a new one.

v2: Explain why a new SETPARAM (Jason)

v3: Bas noticed I forgot to hook up the dependency-side shortcut. We
need both, or this doesn't do much.

v4: Rebase over the amdgpu patch to always set the implicit sync
fences.

So I think there is still a case missing in this implementation.
Consider these 3 cases

(format: a->b: b waits on a. Yes, I know arrows are hard)

explicit->explicit: This doesn't wait now, which is good
Implicit->explicit: This doesn't wait now, which is good
explicit->implicit : This still waits as the explicit submission still
adds shared fences and most things that set an exclusive fence for
implicit sync will hence wait on it.

This is probably good enough for what radv needs now but also sounds
like a risk wrt baking in new uapi behavior that we don't want to be
the end result.

Within AMDGPU this is probably solvable in two ways:

1) Downgrade AMDGPU_SYNC_NE_OWNER to AMDGPU_SYNC_EXPLICIT for shared fences.

I'm not sure that works. I think the right fix is that radeonsi also
switches to this model, with maybe a per-bo CS flag to set indicate
write access, to cut down on the number of ioctls that are needed
otherwise on shared buffers. This per-bo flag would essentially select
between SYNC_NE_OWNER and SYNC_EXPLICIT on a per-buffer basis.


Yeah, but I'm still not entirely sure why that approach isn't sufficient?

Problem with the per context or per vm flag is that you then don't get 
any implicit synchronization any more when another process starts using 
the buffer.



The current amdgpu uapi just doesn't allow any other model without an
explicit opt-in. So current implicit sync userspace just has to
oversync, there's not much choice.


2) Have an EXPLICIT fence owner that is used for explicit submissions
that is ignored by AMDGPU_SYNC_NE_OWNER.

But this doesn't solve cross-driver interactions here.

Yeah cross-driver is still entirely unsolved, because
amdgpu_bo_explicit_sync() on the bo didn't solve that either.


Hui? You have lost me. Why is that still unsolved?

Regards,
Christian.


-Daniel


Cc: mesa-dev@lists.freedesktop.org
Cc: Bas Nieuwenhuizen 
Cc: Dave Airlie 
Cc: Rob Clark 
Cc: Kristian H. Kristensen 
Cc: Michel Dänzer 
Cc: Daniel Stone 
Cc: Sumit Semwal 
Cc: "Christian König" 
Cc: Alex Deucher 
Cc: Daniel Vetter 
Cc: Deepak R Varma 
Cc: Chen Li 
Cc: Kevin Wang 
Cc: Dennis Li 
Cc: Luben Tuikov 
Cc: linaro-mm-...@lists.linaro.org
Signed-off-by: Daniel Vetter 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  |  7 +--
  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 21 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h  |  6 ++
  include/uapi/drm/amdgpu_drm.h   | 10 ++
  4 files changed, 42 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 65df34c17264..c5386d13eb4a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -498,6 +498,7 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
 struct amdgpu_bo *gds;
 struct amdgpu_bo *gws;
 struct amdgpu_bo *oa;
+   bool no_implicit_sync = READ_ONCE(fpriv->vm.no_implicit_sync);
 int r;

 INIT_LIST_HEAD(>validated);
@@ -577,7 +578,8 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,

 e->bo_va = amdgpu_vm_bo_find(vm, bo);

-   if (bo->tbo.base.dma_buf && !amdgpu_bo_explicit_sync(bo)) {
+   if (bo->tbo.base.dma_buf &&
+   !(no_implicit_sync || amdgpu_bo_explicit_sync(bo))) {
 e->chain = dma_fence_chain_alloc();
 if (!e->chain) {
 r = -ENOMEM;
@@ -649,6 +651,7 @@ static int amdgpu_cs_sync_rings(struct amdgpu_cs_parser *p)
  {
 struct amdgpu_fpriv *fpriv = p->filp->driver_priv;
 struct amdgpu_bo_list_entry *e;
+   bool no_implicit_sync = READ_ONCE(fpriv->vm.no_implicit_sync);
 int r;

 list_for_each_entry(e, >validated, tv.head) {
@@ -656,7 +659,7 @@ static int amdgpu_cs_sync_rings(struct amdgpu_cs_parser *p)
 struct dma_resv *resv = bo->tbo.base.resv;
 enum amdgpu_sync_mode sync_mode;

-   sync_mode = amdgpu_bo_explicit_sync(bo) ?
+   sync_mode = no_implicit_sync || amdgpu_bo_explicit_sync(bo) ?
 AMDGPU_SYNC_EXPLICIT : AMDGPU_SYNC_NE_OWNER;
 r = amdgpu_sync_resv(p->adev, >job->sync, resv, sync_mode,
  >vm);
diff --git a/drive

Re: [Mesa-dev] [PATCH 03/15] dma-buf: Document dma-buf implicit fencing/resv fencing rules

2021-06-23 Thread Christian König
o msm_gem_sync_object(). Investing into
   a scheduler might be a good idea.

- all the remaining drivers are ttm based, where I hope they do
   appropriately obey implicit fences already. I didn't do the full
   audit there because a) not follow the contract would confuse ttm
   quite well and b) reading non-standard scheduler and submit code
   which isn't based on drm/scheduler is a pain.

Onwards to the display side.

- Any driver using the drm_gem_plane_helper_prepare_fb() helper will
   correctly. Overwhelmingly most drivers get this right, except a few
   totally dont. I'll follow up with a patch to make this the default
   and avoid a bunch of bugs.

- I didn't audit the ttm drivers, but given that dma_resv started
   there I hope they get this right.

In conclusion this IS the contract, both as documented and
overwhelmingly implemented, specically as implemented by all render
drivers except amdgpu.

Amdgpu tried to fix this already in

commit 049aca4363d8af87cab8d53de5401602db3b
Author: Christian König 
Date:   Wed Sep 19 16:54:35 2018 +0200

 drm/amdgpu: fix using shared fence for exported BOs v2

but this fix falls short on a number of areas:

- It's racy, by the time the buffer is shared it might be too late. To
   make sure there's definitely never a problem we need to set the
   fences correctly for any buffer that's potentially exportable.

- It's breaking uapi, dma-buf fds support poll() and differentitiate
   between, which was introduced in

commit 9b495a5887994a6d74d5c261d012083a92b94738
Author: Maarten Lankhorst 
Date:   Tue Jul 1 12:57:43 2014 +0200

    dma-buf: add poll support, v3

- Christian König wants to nack new uapi building further on this
   dma_resv contract because it breaks amdgpu, quoting

   "Yeah, and that is exactly the reason why I will NAK this uAPI change.

   "This doesn't works for amdgpu at all for the reasons outlined above."

   
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2Ff2eb6751-2f82-9b23-f57e-548de5b729de%40gmail.com%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7C42d8c70b62084a846e9508d9359e8629%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637599777264114436%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=CboicEqZau1%2FlEgEM3w4Ye2Nq6wULjePIehxaMqW3Fg%3Dreserved=0

   Rejecting new development because your own driver is broken and
   violates established cross driver contracts and uapi is really not
   how upstream works.

Now this patch will have a severe performance impact on anything that
runs on multiple engines. So we can't just merge it outright, but need
a bit a plan:

- amdgpu needs a proper uapi for handling implicit fencing. The funny
   thing is that to do it correctly, implicit fencing must be treated
   as a very strange IPC mechanism for transporting fences, where both
   setting the fence and dependency intercepts must be handled
   explicitly. Current best practices is a per-bo flag to indicate
   writes, and a per-bo flag to to skip implicit fencing in the CS
   ioctl as a new chunk.

- Since amdgpu has been shipping with broken behaviour we need an
   opt-out flag from the butchered implicit fencing model to enable the
   proper explicit implicit fencing model.

- for kernel memory fences due to bo moves at least the i915 idea is
   to use ttm_bo->moving. amdgpu probably needs the same.

- since the current p2p dma-buf interface assumes the kernel memory
   fence is in the exclusive dma_resv fence slot we need to add a new
   fence slot for kernel fences, which must never be ignored. Since
   currently only amdgpu supports this there's no real problem here
   yet, until amdgpu gains a NO_IMPLICIT CS flag.

- New userspace needs to ship in enough desktop distros so that users
   wont notice the perf impact. I think we can ignore LTS distros who
   upgrade their kernels but not their mesa3d snapshot.

- Then when this is all in place we can merge this patch here.

What is not a solution to this problem here is trying to make the
dma_resv rules in the kernel more clever. The fundamental issue here
is that the amdgpu CS uapi is the least expressive one across all
drivers (only equalled by panfrost, which has an actual excuse) by not
allowing any userspace control over how implicit sync is conducted.

Until this is fixed it's completely pointless to make the kernel more
clever to improve amdgpu, because all we're doing is papering over
this uapi design issue. amdgpu needs to attain the status quo
established by other drivers first, once that's achieved we can tackle
the remaining issues in a consistent way across drivers.

v2: Bas pointed me at AMDGPU_GEM_CREATE_EXPLICIT_SYNC, which I
entirely missed.

This is great because it means the amdgpu specific piece for proper
implicit fence handling exists already, and that since a while. The
only thing that's now missing is
- fishing the implicit fence

Re: [Mesa-dev] [PATCH 0/6] dma-buf: Add an API for exporting sync files (v12)

2021-06-21 Thread Christian König

Am 18.06.21 um 20:45 schrieb Daniel Vetter:

On Fri, Jun 18, 2021 at 8:02 PM Christian König
 wrote:

Am 18.06.21 um 19:20 schrieb Daniel Vetter:
[SNIP]
The whole thing was introduced with this commit here:

commit f2c24b83ae90292d315aa7ac029c6ce7929e01aa
Author: Maarten Lankhorst 
Date:   Wed Apr 2 17:14:48 2014 +0200

  drm/ttm: flip the switch, and convert to dma_fence

  Signed-off-by: Maarten Lankhorst 

   int ttm_bo_move_accel_cleanup(struct ttm_buffer_object *bo,

-   bo->sync_obj = driver->sync_obj_ref(sync_obj);
+   reservation_object_add_excl_fence(bo->resv, fence);
  if (evict) {

Maarten replaced the bo->sync_obj reference with the dma_resv exclusive
fence.

This means that we need to apply the sync_obj semantic to all drivers
using a DMA-buf with its dma_resv object, otherwise you break imports
from TTM drivers.

Since then and up till now the exclusive fence must be waited on and
never replaced with anything which signals before the old fence.

Maarten and I think Thomas did that and I was always assuming that you
know about this design decision.

Surprisingly I do actually know this.

Still the commit you cite did _not_ change any of the rules around
dma_buf: Importers have _no_ obligation to obey the exclusive fence,
because the buffer is pinned. None of the work that Maarten has done
has fundamentally changed this contract in any way.


Well I now agree that the rules around dma_resv are different than I 
thought, but this change should have raised more eyebrows.


The problem is this completely broke interop with all drivers using TTM 
and I think might even explain some bug reports.


I re-introduced the moving fence by adding bo->moving a few years after 
the initial introduction of dma_resv, but that was just to work around 
performance problems introduced by using the exclusive fence for both 
use cases.



If amdgpu (or any other ttm based driver) hands back and sgt without
waiting for ttm_bo->moving or the exclusive fence first, then that's a
bug we need to fix in these drivers. But if ttm based drivers did get
this wrong, then they got this wrong both before and after the switch
over to using dma_resv - this bug would go back all the way to Dave's
introduction of drm_prime.c and support for that.


I'm not 100% sure, but I think before the switch to the dma_resv object 
drivers just waited for the BOs to become idle and that should have 
prevented this.


Anyway let's stop discussing history and move forward. Sending patches 
for all affected TTM driver with CC: stable tags in a minute.




The only thing which importers have to do is not wreak the DAG nature
of the dma_resv fences and drop dependencies. Currently there's a
handful of drivers which break this (introduced over the last few
years), and I have it somewhere on my todo list to audit them all.


Please give that some priority.

Ignoring the moving fence is a information leak, but messing up the DAG 
gives you access to freed up memory.



The goal with extracting dma_resv from ttm was to make implicit sync
working and get rid of some terrible stalls on the userspace side.
Eventually it was also the goal to make truly dynamic buffer
reservation possible, but that took another 6 or so years to realize
with your work. And we had to make dynamic dma-buf very much opt-in,
because auditing all the users is very hard work and no one
volunteered. And for dynamic dma-buf the rule is that the exclusive
fence must _never_ be ignored, and the two drivers supporting it (mlx5
and amdgpu) obey that.

So yeah for ttm drivers dma_resv is primarily for memory management,
with a side effect of also supporting implicit sync.

For everyone else (and this includes a pile of render drivers, all the
atomic kms drivers, v4l and I have no idea what else on top) dma_resv
was only ever about implicit sync, and it can be ignored. And it (the
implicit sync side) has to be ignored to be able to support vulkan
winsys buffers correctly without stalling where we shouldn't. Also we
have to ignore it on atomic kms side too (and depending upon whether
writeback is supported atomic kms is perfectly capable of reading out
any buffer passed to it).


Oh! That might actually explain some issues, but that just completely 
breaks when TTM based drivers use atomic.


In other words for the first use is actually rather likely for TTM based 
drivers to need to move the buffer around so that scanout is possible.


And that in turn means you need to wait for this move to finish even if 
you have an explicit fence to wait for. IIRC amdgpu rolled its own 
implementation of this and radeon doesn't have atomic, but nouveau is 
most like broken.


So we do need a better solution for this sooner or later.


It's absolutely not that this is my invention, I'm just telling you how
it ever was.

Anyway this means we have a seriously misunderstanding and yes now some
of our discussions about dynamic P2P suddenly make much more se

Re: [Mesa-dev] [PATCH 0/6] dma-buf: Add an API for exporting sync files (v12)

2021-06-18 Thread Christian König

Hi Daniel,

thanks for jumping in here.

And yes, you are absolutely right we need to get this fixed and not yell 
at each other that we have a different understanding of things.


Your proposal sounds sane to me, but I wouldn't call it slots. Rather 
something like "use cases" since we can have multiple fences for each 
category I think.


And I see at four here:

1. Internal kernel memory management. Everybody needs to wait for this, 
it's equal to bo->moving.

2. Writers for implicit sync, implicit sync readers should wait for them.
3. Readers for implicit sync, implicit sync writers should wait for them.
4. Things like TLB flushes and page table updates, no implicit sync but 
memory management must take them into account before moving/freeing 
backing store.


Happy weekend and hopefully not so much heat guys.

Cheers,
Christian.

Am 18.06.21 um 20:20 schrieb Daniel Stone:

Sorry for the mobile reply, but V4L2 is absolutely not write-only; there has 
never been an intersection of V4L2 supporting dmabuf and not supporting reads.

I see your point about the heritage of dma_resv but it’s a red herring. It 
doesn’t matter who’s right, or who was first, or where the code was extracted 
from.

It’s well defined that amdgpu defines resv to be one thing, that every other 
non-TTM user defines it to be something very different, and that the other TTM 
users define it to be something in the middle.

We’ll never get to anything workable if we keep arguing who’s right. Everyone 
is wrong, because dma_resv doesn’t globally mean anything.

It seems clear that there are three classes of synchronisation barrier (not 
using the ‘f’ word here), in descending exclusion order:
   - memory management barriers (amdgpu exclusive fence / ttm_bo->moving)
   - implicit synchronisation write barriers (everyone else’s exclusive fences, 
amdgpu’s shared fences)
   - implicit synchronisation read barriers (everyone else’s shared fences, 
also amdgpu’s shared fences sometimes)

I don’t see a world in which these three uses can be reduced to two slots. What 
also isn’t clear to me though, is how the memory-management barriers can 
exclude all other access in the original proposal with purely userspace CS. 
Retaining the three separate modes also seems like a hard requirement to not 
completely break userspace, but then I don’t see how three separate slots would 
work if they need to be temporally ordered. amdgpu fixed this by redefining the 
meaning of the two slots, others fixed this by not doing one of the three modes.

So how do we square the circle without encoding a DAG into the kernel? Do the 
two slots need to become a single list which is ordered by time + ‘weight’ and 
flattened whenever modified? Something else?

Have a great weekend.

-d


On 18 Jun 2021, at 5:43 pm, Christian König  wrote:

Am 18.06.21 um 17:17 schrieb Daniel Vetter:

[SNIP]
Ignoring _all_ fences is officially ok for pinned dma-buf. This is
what v4l does. Aside from it's definitely not just i915 that does this
even on the drm side, we have a few more drivers nowadays.

No it seriously isn't. If drivers are doing this they are more than broken.

See the comment in dma-resv.h

  * Based on bo.c which bears the following copyright notice,
  * but is dual licensed:



The handling in ttm_bo.c is and always was that the exclusive fence is used for 
buffer moves.

As I said multiple times now the *MAIN* purpose of the dma_resv object is 
memory management and *NOT* synchronization.

Those restrictions come from the original design of TTM where the dma_resv 
object originated from.

The resulting consequences are that:

a) If you access the buffer without waiting for the exclusive fence you run 
into a potential information leak.
 We kind of let that slip for V4L since they only access the buffers for 
writes, so you can't do any harm there.

b) If you overwrite the exclusive fence with a new one without waiting for the 
old one to signal you open up the possibility for userspace to access freed up 
memory.
 This is a complete show stopper since it means that taking over the system 
is just a typing exercise.


What you have done by allowing this in is ripping open a major security hole 
for any DMA-buf import in i915 from all TTM based driver.

This needs to be fixed ASAP, either by waiting in i915 and all other drivers 
doing this for the exclusive fence while importing a DMA-buf or by marking i915 
and all other drivers as broken.

Sorry, but if you allowed that in you seriously have no idea what you are 
talking about here and where all of this originated from.

Regards,
Christian.


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 0/6] dma-buf: Add an API for exporting sync files (v12)

2021-06-18 Thread Christian König

Am 18.06.21 um 19:20 schrieb Daniel Vetter:

On Fri, Jun 18, 2021 at 6:43 PM Christian König
 wrote:

Am 18.06.21 um 17:17 schrieb Daniel Vetter:

[SNIP]
Ignoring _all_ fences is officially ok for pinned dma-buf. This is
what v4l does. Aside from it's definitely not just i915 that does this
even on the drm side, we have a few more drivers nowadays.

No it seriously isn't. If drivers are doing this they are more than broken.

See the comment in dma-resv.h

   * Based on bo.c which bears the following copyright notice,
   * but is dual licensed:



The handling in ttm_bo.c is and always was that the exclusive fence is
used for buffer moves.

As I said multiple times now the *MAIN* purpose of the dma_resv object
is memory management and *NOT* synchronization.

Those restrictions come from the original design of TTM where the
dma_resv object originated from.

The resulting consequences are that:

a) If you access the buffer without waiting for the exclusive fence you
run into a potential information leak.
  We kind of let that slip for V4L since they only access the buffers
for writes, so you can't do any harm there.

b) If you overwrite the exclusive fence with a new one without waiting
for the old one to signal you open up the possibility for userspace to
access freed up memory.
  This is a complete show stopper since it means that taking over the
system is just a typing exercise.


What you have done by allowing this in is ripping open a major security
hole for any DMA-buf import in i915 from all TTM based driver.

This needs to be fixed ASAP, either by waiting in i915 and all other
drivers doing this for the exclusive fence while importing a DMA-buf or
by marking i915 and all other drivers as broken.

Sorry, but if you allowed that in you seriously have no idea what you
are talking about here and where all of this originated from.

Dude, get a grip, seriously. dma-buf landed in 2011

commit d15bd7ee445d0702ad801fdaece348fdb79e6581
Author: Sumit Semwal 
Date:   Mon Dec 26 14:53:15 2011 +0530

dma-buf: Introduce dma buffer sharing mechanism

and drm prime landed in the same year

commit 3248877ea1796915419fba7c89315fdbf00cb56a
(airlied/drm-prime-dmabuf-initial)
Author: Dave Airlie 
Date:   Fri Nov 25 15:21:02 2011 +

drm: base prime/dma-buf support (v5)

dma-resv was extracted much later

commit 786d7257e537da0674c02e16e3b30a44665d1cee
Author: Maarten Lankhorst 
Date:   Thu Jun 27 13:48:16 2013 +0200

reservation: cross-device reservation support, v4

Maarten's patch only extracted the dma_resv stuff so it's there,
optionally. There was never any effort to roll this out to all the
existing drivers, of which there were plenty.

It is, and has been since 10 years, totally fine to access dma-buf
without looking at any fences at all. From your pov of a ttm driver
dma-resv is mainly used for memory management and not sync, but I
think that's also due to some reinterpretation of the actual sync
rules on your side. For everyone else the dma_resv attached to a
dma-buf has been about implicit sync only, nothing else.


No, that was way before my time.

The whole thing was introduced with this commit here:

commit f2c24b83ae90292d315aa7ac029c6ce7929e01aa
Author: Maarten Lankhorst 
Date:   Wed Apr 2 17:14:48 2014 +0200

    drm/ttm: flip the switch, and convert to dma_fence

    Signed-off-by: Maarten Lankhorst 

 int ttm_bo_move_accel_cleanup(struct ttm_buffer_object *bo,

-   bo->sync_obj = driver->sync_obj_ref(sync_obj);
+   reservation_object_add_excl_fence(bo->resv, fence);
    if (evict) {

Maarten replaced the bo->sync_obj reference with the dma_resv exclusive 
fence.


This means that we need to apply the sync_obj semantic to all drivers 
using a DMA-buf with its dma_resv object, otherwise you break imports 
from TTM drivers.


Since then and up till now the exclusive fence must be waited on and 
never replaced with anything which signals before the old fence.


Maarten and I think Thomas did that and I was always assuming that you 
know about this design decision.


It's absolutely not that this is my invention, I'm just telling you how 
it ever was.


Anyway this means we have a seriously misunderstanding and yes now some 
of our discussions about dynamic P2P suddenly make much more sense.


Regards,
Christian.




_only_ when you have a dynamic importer/exporter can you assume that
the dma_resv fences must actually be obeyed. That's one of the reasons
why we had to make this a completely new mode (the other one was
locking, but they really tie together).

Wrt your problems:
a) needs to be fixed in drivers exporting buffers and failing to make
sure the memory is there by the time dma_buf_map_attachment returns.
b) needs to be fixed in the importers, and there's quite a few of
those. There's more than i915 here, which is why I think we should
have the dma_resv_add_shared_exclusive helper extracted from amdgpu.
Avoids hand-rolling this about 5 times (6 i

Re: [Mesa-dev] [PATCH 0/6] dma-buf: Add an API for exporting sync files (v12)

2021-06-18 Thread Christian König

Am 18.06.21 um 17:17 schrieb Daniel Vetter:

[SNIP]
Ignoring _all_ fences is officially ok for pinned dma-buf. This is
what v4l does. Aside from it's definitely not just i915 that does this
even on the drm side, we have a few more drivers nowadays.


No it seriously isn't. If drivers are doing this they are more than broken.

See the comment in dma-resv.h

 * Based on bo.c which bears the following copyright notice,
 * but is dual licensed:



The handling in ttm_bo.c is and always was that the exclusive fence is 
used for buffer moves.


As I said multiple times now the *MAIN* purpose of the dma_resv object 
is memory management and *NOT* synchronization.


Those restrictions come from the original design of TTM where the 
dma_resv object originated from.


The resulting consequences are that:

a) If you access the buffer without waiting for the exclusive fence you 
run into a potential information leak.
    We kind of let that slip for V4L since they only access the buffers 
for writes, so you can't do any harm there.


b) If you overwrite the exclusive fence with a new one without waiting 
for the old one to signal you open up the possibility for userspace to 
access freed up memory.
    This is a complete show stopper since it means that taking over the 
system is just a typing exercise.



What you have done by allowing this in is ripping open a major security 
hole for any DMA-buf import in i915 from all TTM based driver.


This needs to be fixed ASAP, either by waiting in i915 and all other 
drivers doing this for the exclusive fence while importing a DMA-buf or 
by marking i915 and all other drivers as broken.


Sorry, but if you allowed that in you seriously have no idea what you 
are talking about here and where all of this originated from.


Regards,
Christian.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 0/6] dma-buf: Add an API for exporting sync files (v12)

2021-06-18 Thread Christian König
8961fe4884e608e11a82d994e183d%7C0%7C0%7C637596234879170817%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=kDEQr7d7fbba6938tZoERXN6hlOyKMdVjgY5U4ux4iI%3Dreserved=0
IGT tests: 
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.freedesktop.org%2Fseries%2F90490%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7C841231ea3c6e43f2141208d93265bfe7%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637596234879170817%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=MM5c55nspWbUxzajqBv1iNHdz2TYAImG2XPOSnDE6qQ%3Dreserved=0

v10 (Jason Ekstrand, Daniel Vetter):
 - Add reviews/acks
 - Add a patch to rename _rcu to _unlocked
 - Split things better so import is clearly RFC status

v11 (Daniel Vetter):
 - Add more CCs to try and get maintainers
 - Add a patch to document DMA_BUF_IOCTL_SYNC
 - Generally better docs
 - Use separate structs for import/export (easier to document)
 - Fix an issue in the import patch

v12 (Daniel Vetter):
 - Better docs for DMA_BUF_IOCTL_SYNC

v12 (Christian König):
 - Drop the rename patch in favor of Christian's series
 - Add a comment to the commit message for the dma-buf sync_file export
   ioctl saying why we made it an ioctl on dma-buf

Cc: Christian König 
Cc: Michel Dänzer 
Cc: Dave Airlie 
Cc: Bas Nieuwenhuizen 
Cc: Daniel Stone 
Cc: mesa-dev@lists.freedesktop.org
Cc: wayland-de...@lists.freedesktop.org
Test-with: 20210524205225.872316-1-ja...@jlekstrand.net

Christian König (1):
  dma-buf: Add dma_fence_array_for_each (v2)

Jason Ekstrand (5):
  dma-buf: Add dma_resv_get_singleton (v6)
  dma-buf: Document DMA_BUF_IOCTL_SYNC (v2)
  dma-buf: Add an API for exporting sync files (v12)
  RFC: dma-buf: Add an extra fence to dma_resv_get_singleton_unlocked
  RFC: dma-buf: Add an API for importing sync files (v7)

 Documentation/driver-api/dma-buf.rst |   8 ++
 drivers/dma-buf/dma-buf.c| 103 +
 drivers/dma-buf/dma-fence-array.c|  27 +++
 drivers/dma-buf/dma-resv.c   | 110 +++
 include/linux/dma-fence-array.h  |  17 +
 include/linux/dma-resv.h |   2 +
 include/uapi/linux/dma-buf.h | 103 -
 7 files changed, 369 insertions(+), 1 deletion(-)


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fmesa-devdata=04%7C01%7Cchristian.koenig%40amd.com%7C841231ea3c6e43f2141208d93265bfe7%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637596234879170817%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=iA%2B3ZezHwlfjMMpkf3bVX8M0HUk3lVDm%2F476G1S8yZI%3Dreserved=0




___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 0/6] dma-buf: Add an API for exporting sync files (v12)

2021-06-18 Thread Christian König

Am 17.06.21 um 21:58 schrieb Daniel Vetter:

On Thu, Jun 17, 2021 at 09:37:36AM +0200, Christian König wrote:

[SNIP]

But, to the broader point, maybe?  I'm a little fuzzy on exactly where
i915 inserts and/or depends on fences.


When you combine that with complex drivers which use TTM and buffer
moves underneath you can construct an information leak using this and
give userspace access to memory which is allocated to the driver, but
not yet initialized.

This way you can leak things like page tables, passwords, kernel data
etc... in large amounts to userspace and is an absolutely no-go for
security.

Ugh...  Unfortunately, I'm really out of my depth on the implications
going on here but I think I see your point.


That's why I'm said we need to get this fixed before we upstream this
patch set here and especially the driver change which is using that.

Well, i915 has had uAPI for a while to ignore fences.

Yeah, exactly that's illegal.

You're a few years too late with closing that barn door. The following
drives have this concept
- i915
- msm
- etnaviv

Because you can't write a competent vulkan driver without this.


WHAT? ^^


This was discussed at absolute epic length in various xdcs iirc. We did ignore a
bit the vram/ttm/bo-moving problem because all the people present were
hacking on integrated gpu (see list above), but that just means we need to
treat the ttm_bo->moving fence properly.


I should have visited more XDCs in the past, the problem is much larger 
than this.


But I now start to understand what you are doing with that design and 
why it looks so messy to me, amdgpu is just currently the only driver 
which does Vulkan and complex memory management at the same time.



At least the kernel internal fences like moving or clearing a buffer object
needs to be taken into account before a driver is allowed to access a
buffer.

Yes i915 needs to make sure it never ignores ttm_bo->moving.


No, that is only the tip of the iceberg. See TTM for example also puts 
fences which drivers needs to wait for into the shared slots. Same thing 
for use cases like clear on release etc


From my point of view the main purpose of the dma_resv object is to 
serve memory management, synchronization for command submission is just 
a secondary use case.


And that drivers choose to ignore the exclusive fence is an absolutely 
no-go from a memory management and security point of view. Exclusive 
access means exclusive access. Ignoring that won't work.


The only thing which saved us so far is the fact that drivers doing this 
are not that complex.


BTW: How does it even work? I mean then you would run into the same 
problem as amdgpu with its page table update fences, e.g. that your 
shared fences might signal before the exclusive one.



For dma-buf this isn't actually a problem, because dma-buf are pinned. You
can't move them while other drivers are using them, hence there's not
actually a ttm_bo->moving fence we can ignore.

p2p dma-buf aka dynamic dma-buf is a different beast, and i915 (and fwiw
these other drivers) need to change before they can do dynamic dma-buf.


Otherwise we have an information leak worth a CVE and that is certainly not
something we want.

Because yes otherwise we get a CVE. But right now I don't think we have
one.


Yeah, agree. But this is just because of coincident and not because of 
good engineering :)



We do have a quite big confusion on what exactly the signaling ordering is
supposed to be between exclusive and the collective set of shared fences,
and there's some unifying that needs to happen here. But I think what
Jason implements here in the import ioctl is the most defensive version
possible, so really can't break any driver. It really works like you have
an ad-hoc gpu engine that does nothing itself, but waits for the current
exclusive fence and then sets the exclusive fence with its "CS" completion
fence.

That's imo perfectly legit use-case.


The use case is certainly legit, but I'm not sure if merging this at the 
moment is a good idea.


Your note that drivers are already ignoring the exclusive fence in the 
dma_resv object was eye opening to me. And I now have the very strong 
feeling that the synchronization and the design of the dma_resv object 
is even more messy then I thought it is.


To summarize we can be really lucky that it didn't blow up into our 
faces already.



Same for the export one. Waiting for a previous snapshot of implicit
fences is imo perfectly ok use-case and useful for compositors - client
might soon start more rendering, and on some drivers that always results
in the exclusive slot being set, so if you dont take a snapshot you
oversync real bad for your atomic flip.


The export use case is unproblematic as far as I can see.


Those changes are years in the past.  If we have a real problem here (not sure 
on
that yet), then we'll have to figure out how to fix it without nuking
uAPI.

Well, that was the basic idea

Re: [Mesa-dev] [PATCH 0/6] dma-buf: Add an API for exporting sync files (v12)

2021-06-17 Thread Christian König

Am 16.06.21 um 20:30 schrieb Jason Ekstrand:

On Tue, Jun 15, 2021 at 3:41 AM Christian König
 wrote:

Hi Jason & Daniel,

maybe I should explain once more where the problem with this approach is
and why I think we need to get that fixed before we can do something
like this here.

To summarize what this patch here does is that it copies the exclusive
fence and/or the shared fences into a sync_file. This alone is totally
unproblematic.

The problem is what this implies. When you need to copy the exclusive
fence to a sync_file then this means that the driver is at some point
ignoring the exclusive fence on a buffer object.

Not necessarily.  Part of the point of this is to allow for CPU waits
on a past point in buffers timeline.  Today, we have poll() and
GEM_WAIT both of which wait for the buffer to be idle from whatever
GPU work is currently happening.  We want to wait on something in the
past and ignore anything happening now.


Good point, yes that is indeed a valid use case.


But, to the broader point, maybe?  I'm a little fuzzy on exactly where
i915 inserts and/or depends on fences.


When you combine that with complex drivers which use TTM and buffer
moves underneath you can construct an information leak using this and
give userspace access to memory which is allocated to the driver, but
not yet initialized.

This way you can leak things like page tables, passwords, kernel data
etc... in large amounts to userspace and is an absolutely no-go for
security.

Ugh...  Unfortunately, I'm really out of my depth on the implications
going on here but I think I see your point.


That's why I'm said we need to get this fixed before we upstream this
patch set here and especially the driver change which is using that.

Well, i915 has had uAPI for a while to ignore fences.


Yeah, exactly that's illegal.

At least the kernel internal fences like moving or clearing a buffer 
object needs to be taken into account before a driver is allowed to 
access a buffer.


Otherwise we have an information leak worth a CVE and that is certainly 
not something we want.



Those changes are years in the past.  If we have a real problem here (not sure 
on
that yet), then we'll have to figure out how to fix it without nuking
uAPI.


Well, that was the basic idea of attaching flags to the fences in the 
dma_resv object.


In other words you clearly denote when you have to wait for a fence 
before accessing a buffer or you cause a security issue.


Christian.



--Jason



Regards,
Christian.

Am 10.06.21 um 23:09 schrieb Jason Ekstrand:

Modern userspace APIs like Vulkan are built on an explicit
synchronization model.  This doesn't always play nicely with the
implicit synchronization used in the kernel and assumed by X11 and
Wayland.  The client -> compositor half of the synchronization isn't too
bad, at least on intel, because we can control whether or not i915
synchronizes on the buffer and whether or not it's considered written.

The harder part is the compositor -> client synchronization when we get
the buffer back from the compositor.  We're required to be able to
provide the client with a VkSemaphore and VkFence representing the point
in time where the window system (compositor and/or display) finished
using the buffer.  With current APIs, it's very hard to do this in such
a way that we don't get confused by the Vulkan driver's access of the
buffer.  In particular, once we tell the kernel that we're rendering to
the buffer again, any CPU waits on the buffer or GPU dependencies will
wait on some of the client rendering and not just the compositor.

This new IOCTL solves this problem by allowing us to get a snapshot of
the implicit synchronization state of a given dma-buf in the form of a
sync file.  It's effectively the same as a poll() or I915_GEM_WAIT only,
instead of CPU waiting directly, it encapsulates the wait operation, at
the current moment in time, in a sync_file so we can check/wait on it
later.  As long as the Vulkan driver does the sync_file export from the
dma-buf before we re-introduce it for rendering, it will only contain
fences from the compositor or display.  This allows to accurately turn
it into a VkFence or VkSemaphore without any over- synchronization.

This patch series actually contains two new ioctls.  There is the export
one mentioned above as well as an RFC for an import ioctl which provides
the other half.  The intention is to land the export ioctl since it seems
like there's no real disagreement on that one.  The import ioctl, however,
has a lot of debate around it so it's intended to be RFC-only for now.

Mesa MR: 
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fmesa%2Fmesa%2F-%2Fmerge_requests%2F4037data=04%7C01%7Cchristian.koenig%40amd.com%7Cb094e69c94814727939508d930f4ca94%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637594650220923783%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=xUwa

Re: [Mesa-dev] [PATCH 0/6] dma-buf: Add an API for exporting sync files (v12)

2021-06-15 Thread Christian König

Hi Jason & Daniel,

maybe I should explain once more where the problem with this approach is 
and why I think we need to get that fixed before we can do something 
like this here.


To summarize what this patch here does is that it copies the exclusive 
fence and/or the shared fences into a sync_file. This alone is totally 
unproblematic.


The problem is what this implies. When you need to copy the exclusive 
fence to a sync_file then this means that the driver is at some point 
ignoring the exclusive fence on a buffer object.


When you combine that with complex drivers which use TTM and buffer 
moves underneath you can construct an information leak using this and 
give userspace access to memory which is allocated to the driver, but 
not yet initialized.


This way you can leak things like page tables, passwords, kernel data 
etc... in large amounts to userspace and is an absolutely no-go for 
security.


That's why I'm said we need to get this fixed before we upstream this 
patch set here and especially the driver change which is using that.


Regards,
Christian.

Am 10.06.21 um 23:09 schrieb Jason Ekstrand:

Modern userspace APIs like Vulkan are built on an explicit
synchronization model.  This doesn't always play nicely with the
implicit synchronization used in the kernel and assumed by X11 and
Wayland.  The client -> compositor half of the synchronization isn't too
bad, at least on intel, because we can control whether or not i915
synchronizes on the buffer and whether or not it's considered written.

The harder part is the compositor -> client synchronization when we get
the buffer back from the compositor.  We're required to be able to
provide the client with a VkSemaphore and VkFence representing the point
in time where the window system (compositor and/or display) finished
using the buffer.  With current APIs, it's very hard to do this in such
a way that we don't get confused by the Vulkan driver's access of the
buffer.  In particular, once we tell the kernel that we're rendering to
the buffer again, any CPU waits on the buffer or GPU dependencies will
wait on some of the client rendering and not just the compositor.

This new IOCTL solves this problem by allowing us to get a snapshot of
the implicit synchronization state of a given dma-buf in the form of a
sync file.  It's effectively the same as a poll() or I915_GEM_WAIT only,
instead of CPU waiting directly, it encapsulates the wait operation, at
the current moment in time, in a sync_file so we can check/wait on it
later.  As long as the Vulkan driver does the sync_file export from the
dma-buf before we re-introduce it for rendering, it will only contain
fences from the compositor or display.  This allows to accurately turn
it into a VkFence or VkSemaphore without any over- synchronization.

This patch series actually contains two new ioctls.  There is the export
one mentioned above as well as an RFC for an import ioctl which provides
the other half.  The intention is to land the export ioctl since it seems
like there's no real disagreement on that one.  The import ioctl, however,
has a lot of debate around it so it's intended to be RFC-only for now.

Mesa MR: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037
IGT tests: https://patchwork.freedesktop.org/series/90490/

v10 (Jason Ekstrand, Daniel Vetter):
  - Add reviews/acks
  - Add a patch to rename _rcu to _unlocked
  - Split things better so import is clearly RFC status

v11 (Daniel Vetter):
  - Add more CCs to try and get maintainers
  - Add a patch to document DMA_BUF_IOCTL_SYNC
  - Generally better docs
  - Use separate structs for import/export (easier to document)
  - Fix an issue in the import patch

v12 (Daniel Vetter):
  - Better docs for DMA_BUF_IOCTL_SYNC

v12 (Christian König):
  - Drop the rename patch in favor of Christian's series
  - Add a comment to the commit message for the dma-buf sync_file export
ioctl saying why we made it an ioctl on dma-buf

Cc: Christian König 
Cc: Michel Dänzer 
Cc: Dave Airlie 
Cc: Bas Nieuwenhuizen 
Cc: Daniel Stone 
Cc: mesa-dev@lists.freedesktop.org
Cc: wayland-de...@lists.freedesktop.org
Test-with: 20210524205225.872316-1-ja...@jlekstrand.net

Christian König (1):
   dma-buf: Add dma_fence_array_for_each (v2)

Jason Ekstrand (5):
   dma-buf: Add dma_resv_get_singleton (v6)
   dma-buf: Document DMA_BUF_IOCTL_SYNC (v2)
   dma-buf: Add an API for exporting sync files (v12)
   RFC: dma-buf: Add an extra fence to dma_resv_get_singleton_unlocked
   RFC: dma-buf: Add an API for importing sync files (v7)

  Documentation/driver-api/dma-buf.rst |   8 ++
  drivers/dma-buf/dma-buf.c| 103 +
  drivers/dma-buf/dma-fence-array.c|  27 +++
  drivers/dma-buf/dma-resv.c   | 110 +++
  include/linux/dma-fence-array.h  |  17 +
  include/linux/dma-resv.h |   2 +
  include/uapi/linux/dma-buf.h | 103 -
  7 fil

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-14 Thread Christian König
As long as we can figure out who touched to a certain sync object last 
that would indeed work, yes.


Christian.

Am 14.06.21 um 19:10 schrieb Marek Olšák:
The call to the hw scheduler has a limitation on the size of all 
parameters combined. I think we can only pass a 32-bit sequence number 
and a ~16-bit global (per-GPU) syncobj handle in one call and not much 
else.


The syncobj handle can be an element index in a global (per-GPU) 
syncobj table and it's read only for all processes with the exception 
of the signal command. Syncobjs can either have per VMID write access 
flags for the signal command (slow), or any process can write to any 
syncobjs and only rely on the kernel checking the write log (fast).


In any case, we can execute the memory write in the queue engine and 
only use the hw scheduler for logging, which would be perfect.


Marek

On Thu, Jun 10, 2021 at 12:33 PM Christian König 
<mailto:ckoenig.leichtzumer...@gmail.com>> wrote:


Hi guys,

maybe soften that a bit. Reading from the shared memory of the
user fence is ok for everybody. What we need to take more care of
is the writing side.

So my current thinking is that we allow read only access, but
writing a new sequence value needs to go through the scheduler/kernel.

So when the CPU wants to signal a timeline fence it needs to call
an IOCTL. When the GPU wants to signal the timeline fence it needs
to hand that of to the hardware scheduler.

If we lockup the kernel can check with the hardware who did the
last write and what value was written.

That together with an IOCTL to give out sequence number for
implicit sync to applications should be sufficient for the kernel
to track who is responsible if something bad happens.

In other words when the hardware says that the shader wrote stuff
like 0xdeadbeef 0x0 or 0x into memory we kill the process
who did that.

If the hardware says that seq - 1 was written fine, but seq is
missing then the kernel blames whoever was supposed to write seq.

Just pieping the write through a privileged instance should be
fine to make sure that we don't run into issues.

Christian.

Am 10.06.21 um 17:59 schrieb Marek Olšák:

Hi Daniel,

We just talked about this whole topic internally and we came up
to the conclusion that the hardware needs to understand sync
object handles and have high-level wait and signal operations in
the command stream. Sync objects will be backed by memory, but
they won't be readable or writable by processes directly. The
hardware will log all accesses to sync objects and will send the
log to the kernel periodically. The kernel will identify
malicious behavior.

Example of a hardware command stream:
...
ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence
number is assigned by the kernel
Draw();
ImplicitSyncSignalWhenDone(syncObjHandle);
...

I'm afraid we have no other choice because of the TLB
invalidation overhead.

Marek


On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter mailto:dan...@ffwll.ch>> wrote:

On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König wrote:
> Am 09.06.21 um 15:19 schrieb Daniel Vetter:
> > [SNIP]
> > > Yeah, we call this the lightweight and the heavyweight
tlb flush.
> > >
> > > The lighweight can be used when you are sure that you
don't have any of the
> > > PTEs currently in flight in the 3D/DMA engine and you
just need to
> > > invalidate the TLB.
> > >
> > > The heavyweight must be used when you need to
invalidate the TLB *AND* make
> > > sure that no concurrently operation moves new stuff
into the TLB.
> > >
> > > The problem is for this use case we have to use the
heavyweight one.
> > Just for my own curiosity: So the lightweight flush is
only for in-between
> > CS when you know access is idle? Or does that also not
work if userspace
> > has a CS on a dma engine going at the same time because
the tlb aren't
> > isolated enough between engines?
>
> More or less correct, yes.
>
> The problem is a lightweight flush only invalidates the
TLB, but doesn't
> take care of entries which have been handed out to the
different engines.
>
> In other words what can happen is the following:
>
> 1. Shader asks TLB to resolve address X.
> 2. TLB looks into its cache and can't find address X so it
asks the walker
> to resolve.
> 3. Walker comes back with result for address X and TL

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-10 Thread Christian König

Hi guys,

maybe soften that a bit. Reading from the shared memory of the user 
fence is ok for everybody. What we need to take more care of is the 
writing side.


So my current thinking is that we allow read only access, but writing a 
new sequence value needs to go through the scheduler/kernel.


So when the CPU wants to signal a timeline fence it needs to call an 
IOCTL. When the GPU wants to signal the timeline fence it needs to hand 
that of to the hardware scheduler.


If we lockup the kernel can check with the hardware who did the last 
write and what value was written.


That together with an IOCTL to give out sequence number for implicit 
sync to applications should be sufficient for the kernel to track who is 
responsible if something bad happens.


In other words when the hardware says that the shader wrote stuff like 
0xdeadbeef 0x0 or 0x into memory we kill the process who did that.


If the hardware says that seq - 1 was written fine, but seq is missing 
then the kernel blames whoever was supposed to write seq.


Just pieping the write through a privileged instance should be fine to 
make sure that we don't run into issues.


Christian.

Am 10.06.21 um 17:59 schrieb Marek Olšák:

Hi Daniel,

We just talked about this whole topic internally and we came up to the 
conclusion that the hardware needs to understand sync object handles 
and have high-level wait and signal operations in the command stream. 
Sync objects will be backed by memory, but they won't be readable or 
writable by processes directly. The hardware will log all accesses to 
sync objects and will send the log to the kernel periodically. The 
kernel will identify malicious behavior.


Example of a hardware command stream:
...
ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence 
number is assigned by the kernel

Draw();
ImplicitSyncSignalWhenDone(syncObjHandle);
...

I'm afraid we have no other choice because of the TLB invalidation 
overhead.


Marek


On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter <mailto:dan...@ffwll.ch>> wrote:


On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König wrote:
> Am 09.06.21 um 15:19 schrieb Daniel Vetter:
> > [SNIP]
> > > Yeah, we call this the lightweight and the heavyweight tlb
flush.
> > >
> > > The lighweight can be used when you are sure that you don't
have any of the
> > > PTEs currently in flight in the 3D/DMA engine and you just
need to
> > > invalidate the TLB.
> > >
> > > The heavyweight must be used when you need to invalidate the
TLB *AND* make
> > > sure that no concurrently operation moves new stuff into the
TLB.
> > >
> > > The problem is for this use case we have to use the
heavyweight one.
> > Just for my own curiosity: So the lightweight flush is only
for in-between
> > CS when you know access is idle? Or does that also not work if
userspace
> > has a CS on a dma engine going at the same time because the
tlb aren't
> > isolated enough between engines?
>
> More or less correct, yes.
>
> The problem is a lightweight flush only invalidates the TLB, but
doesn't
> take care of entries which have been handed out to the different
engines.
>
> In other words what can happen is the following:
>
> 1. Shader asks TLB to resolve address X.
> 2. TLB looks into its cache and can't find address X so it asks
the walker
> to resolve.
> 3. Walker comes back with result for address X and TLB puts that
into its
> cache and gives it to Shader.
> 4. Shader starts doing some operation using result for address X.
> 5. You send lightweight TLB invalidate and TLB throws away
cached values for
> address X.
> 6. Shader happily still uses whatever the TLB gave to it in step
3 to
> accesses address X
>
> See it like the shader has their own 1 entry L0 TLB cache which
is not
> affected by the lightweight flush.
>
> The heavyweight flush on the other hand sends out a broadcast
signal to
> everybody and only comes back when we are sure that an address
is not in use
> any more.

Ah makes sense. On intel the shaders only operate in VA,
everything goes
around as explicit async messages to IO blocks. So we don't have
this, the
only difference in tlb flushes is between tlb flush in the IB and
an mmio
one which is independent for anything currently being executed on an
egine.
-Daniel
-- 
Daniel Vetter

Software Engineer, Intel Corporation
http://blog.ffwll.ch <http://blog.ffwll.ch>



___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-09 Thread Christian König

Am 09.06.21 um 15:19 schrieb Daniel Vetter:

[SNIP]

Yeah, we call this the lightweight and the heavyweight tlb flush.

The lighweight can be used when you are sure that you don't have any of the
PTEs currently in flight in the 3D/DMA engine and you just need to
invalidate the TLB.

The heavyweight must be used when you need to invalidate the TLB *AND* make
sure that no concurrently operation moves new stuff into the TLB.

The problem is for this use case we have to use the heavyweight one.

Just for my own curiosity: So the lightweight flush is only for in-between
CS when you know access is idle? Or does that also not work if userspace
has a CS on a dma engine going at the same time because the tlb aren't
isolated enough between engines?


More or less correct, yes.

The problem is a lightweight flush only invalidates the TLB, but doesn't 
take care of entries which have been handed out to the different engines.


In other words what can happen is the following:

1. Shader asks TLB to resolve address X.
2. TLB looks into its cache and can't find address X so it asks the 
walker to resolve.
3. Walker comes back with result for address X and TLB puts that into 
its cache and gives it to Shader.

4. Shader starts doing some operation using result for address X.
5. You send lightweight TLB invalidate and TLB throws away cached values 
for address X.
6. Shader happily still uses whatever the TLB gave to it in step 3 to 
accesses address X


See it like the shader has their own 1 entry L0 TLB cache which is not 
affected by the lightweight flush.


The heavyweight flush on the other hand sends out a broadcast signal to 
everybody and only comes back when we are sure that an address is not in 
use any more.


Christian.


-Daniel



___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-04 Thread Christian König

Am 04.06.21 um 10:57 schrieb Daniel Vetter:

On Fri, Jun 04, 2021 at 09:00:31AM +0200, Christian König wrote:

Am 02.06.21 um 21:19 schrieb Daniel Vetter:

On Wed, Jun 02, 2021 at 08:52:38PM +0200, Christian König wrote:

Am 02.06.21 um 20:48 schrieb Daniel Vetter:

On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:

On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák  wrote:


Yes, we can't break anything because we don't want to complicate things
for us. It's pretty much all NAK'd already. We are trying to gather more
knowledge and then make better decisions.

The idea we are considering is that we'll expose memory-based sync objects
to userspace for read only, and the kernel or hw will strictly control the
memory writes to those sync objects. The hole in that idea is that
userspace can decide not to signal a job, so even if userspace can't
overwrite memory-based sync object states arbitrarily, it can still decide
not to signal them, and then a future fence is born.


This would actually be treated as a GPU hang caused by that context, so it
should be fine.

This is practically what I proposed already, except your not doing it with
dma_fence. And on the memory fence side this also doesn't actually give
what you want for that compute model.

This seems like a bit a worst of both worlds approach to me? Tons of work
in the kernel to hide these not-dma_fence-but-almost, and still pain to
actually drive the hardware like it should be for compute or direct
display.

Also maybe I've missed it, but I didn't see any replies to my suggestion
how to fake the entire dma_fence stuff on top of new hw. Would be
interesting to know what doesn't work there instead of amd folks going of
into internal again and then coming back with another rfc from out of
nowhere :-)

Well to be honest I would just push back on our hardware/firmware guys that
we need to keep kernel queues forever before going down that route.

I looked again, and you said the model wont work because preemption is way
too slow, even when the context is idle.

I guess at that point I got maybe too fed up and just figured "not my
problem", but if preempt is too slow as the unload fence, you can do it
with pte removal and tlb shootdown too (that is hopefully not too slow,
otherwise your hw is just garbage and wont even be fast for direct submit
compute workloads).

Have you seen that one here:
https://www.spinics.net/lists/amd-gfx/msg63101.html :)

I've rejected it because I think polling for 6 seconds on a TLB flush which
can block interrupts as well is just madness.

Hm but I thought you had like 2 tlb flush modes, the shitty one (with
retrying page faults) and the not so shitty one?


Yeah, we call this the lightweight and the heavyweight tlb flush.

The lighweight can be used when you are sure that you don't have any of 
the PTEs currently in flight in the 3D/DMA engine and you just need to 
invalidate the TLB.


The heavyweight must be used when you need to invalidate the TLB *AND* 
make sure that no concurrently operation moves new stuff into the TLB.


The problem is for this use case we have to use the heavyweight one.


But yeah at that point I think you just have to bite one of the bullets.


Yeah, completely agree. We can choose which way we want to die, but it's 
certainly not going to be nice whatever we do.




The thing is with hmm/userspace memory fence model this will be even
worse, because you will _have_ to do this tlb flush deep down in core mm
functions, so this is going to be userptr, but worse.

With the dma_resv/dma_fence bo memory management model you can at least
wrap that tlb flush into a dma_fence and push the waiting/pinging onto a
separate thread or something like that. If the hw really is that slow.

Somewhat aside: Personally I think that sriov needs to move over to the
compute model, i.e. indefinite timeouts, no tdr, because everything takes
too long. At least looking around sriov timeouts tend to be 10x bare
metal, across the board.

But for stuff like cloud gaming that's serious amounts of heavy lifting
since it brings us right back "the entire linux/android 3d stack is built
on top of dma_fence right now".


The only thing that you need to do when you use pte clearing + tlb
shootdown instad of preemption as the unload fence for buffers that get
moved is that if you get any gpu page fault, you don't serve that, but
instead treat it as a tdr and shot the context permanently.

So summarizing the model I proposed:

- you allow userspace to directly write into the ringbuffer, and also
write the fences directly

- actual submit is done by the kernel, using drm/scheduler. The kernel
blindly trusts userspace to set up everything else, and even just wraps
dma_fences around the userspace memory fences.

- the only check is tdr. If a fence doesn't complete an tdr fires, a) the
kernel shot the entire context and b) userspace recovers by setting up a
new ringbuffer

- memory management is done

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-04 Thread Christian König

Am 02.06.21 um 21:19 schrieb Daniel Vetter:

On Wed, Jun 02, 2021 at 08:52:38PM +0200, Christian König wrote:


Am 02.06.21 um 20:48 schrieb Daniel Vetter:

On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:

On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák  wrote:


Yes, we can't break anything because we don't want to complicate things
for us. It's pretty much all NAK'd already. We are trying to gather more
knowledge and then make better decisions.

The idea we are considering is that we'll expose memory-based sync objects
to userspace for read only, and the kernel or hw will strictly control the
memory writes to those sync objects. The hole in that idea is that
userspace can decide not to signal a job, so even if userspace can't
overwrite memory-based sync object states arbitrarily, it can still decide
not to signal them, and then a future fence is born.


This would actually be treated as a GPU hang caused by that context, so it
should be fine.

This is practically what I proposed already, except your not doing it with
dma_fence. And on the memory fence side this also doesn't actually give
what you want for that compute model.

This seems like a bit a worst of both worlds approach to me? Tons of work
in the kernel to hide these not-dma_fence-but-almost, and still pain to
actually drive the hardware like it should be for compute or direct
display.

Also maybe I've missed it, but I didn't see any replies to my suggestion
how to fake the entire dma_fence stuff on top of new hw. Would be
interesting to know what doesn't work there instead of amd folks going of
into internal again and then coming back with another rfc from out of
nowhere :-)

Well to be honest I would just push back on our hardware/firmware guys that
we need to keep kernel queues forever before going down that route.

I looked again, and you said the model wont work because preemption is way
too slow, even when the context is idle.

I guess at that point I got maybe too fed up and just figured "not my
problem", but if preempt is too slow as the unload fence, you can do it
with pte removal and tlb shootdown too (that is hopefully not too slow,
otherwise your hw is just garbage and wont even be fast for direct submit
compute workloads).


Have you seen that one here: 
https://www.spinics.net/lists/amd-gfx/msg63101.html :)


I've rejected it because I think polling for 6 seconds on a TLB flush 
which can block interrupts as well is just madness.






The only thing that you need to do when you use pte clearing + tlb
shootdown instad of preemption as the unload fence for buffers that get
moved is that if you get any gpu page fault, you don't serve that, but
instead treat it as a tdr and shot the context permanently.

So summarizing the model I proposed:

- you allow userspace to directly write into the ringbuffer, and also
   write the fences directly

- actual submit is done by the kernel, using drm/scheduler. The kernel
   blindly trusts userspace to set up everything else, and even just wraps
   dma_fences around the userspace memory fences.

- the only check is tdr. If a fence doesn't complete an tdr fires, a) the
   kernel shot the entire context and b) userspace recovers by setting up a
   new ringbuffer

- memory management is done using ttm only, you still need to supply the
   buffer list (ofc that list includes the always present ones, so CS will
   only get the list of special buffers like today). If you hw can't trun
   gpu page faults and you ever get one we pull up the same old solution:
   Kernel shots the entire context.

   The important thing is that from the gpu pov memory management works
   exactly like compute workload with direct submit, except that you just
   terminate the context on _any_ page fault, instead of only those that go
   somewhere where there's really no mapping and repair the others.

   Also I guess from reading the old thread this means you'd disable page
   fault retry because that is apparently also way too slow for anything.

- memory management uses an unload fence. That unload fences waits for all
   userspace memory fences (represented as dma_fence) to complete, with
   maybe some fudge to busy-spin until we've reached the actual end of the
   ringbuffer (maybe you have a IB tail there after the memory fence write,
   we have that on intel hw), and it waits for the memory to get
   "unloaded". This is either preemption, or pte clearing + tlb shootdown,
   or whatever else your hw provides which is a) used for dynamic memory
   management b) fast enough for actual memory management.

- any time a context dies we force-complete all it's pending fences,
   in-order ofc

So from hw pov this looks 99% like direct userspace submit, with the exact
same mappings, command sequences and everything else. The only difference
is that the rinbuffer head/tail updates happen from drm/scheduler, instead
of directly from userspace.

None of this stuff needs funny tricks where the kernel

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-02 Thread Christian König



Am 02.06.21 um 20:48 schrieb Daniel Vetter:

On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:

On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák  wrote:


Yes, we can't break anything because we don't want to complicate things
for us. It's pretty much all NAK'd already. We are trying to gather more
knowledge and then make better decisions.

The idea we are considering is that we'll expose memory-based sync objects
to userspace for read only, and the kernel or hw will strictly control the
memory writes to those sync objects. The hole in that idea is that
userspace can decide not to signal a job, so even if userspace can't
overwrite memory-based sync object states arbitrarily, it can still decide
not to signal them, and then a future fence is born.


This would actually be treated as a GPU hang caused by that context, so it
should be fine.

This is practically what I proposed already, except your not doing it with
dma_fence. And on the memory fence side this also doesn't actually give
what you want for that compute model.

This seems like a bit a worst of both worlds approach to me? Tons of work
in the kernel to hide these not-dma_fence-but-almost, and still pain to
actually drive the hardware like it should be for compute or direct
display.

Also maybe I've missed it, but I didn't see any replies to my suggestion
how to fake the entire dma_fence stuff on top of new hw. Would be
interesting to know what doesn't work there instead of amd folks going of
into internal again and then coming back with another rfc from out of
nowhere :-)


Well to be honest I would just push back on our hardware/firmware guys 
that we need to keep kernel queues forever before going down that route.


That syncfile and all that Android stuff isn't working out of the box 
with the new shiny user queue submission model (which in turn is mostly 
because of Windows) already raised some eyebrows here.


Christian.


-Daniel


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-02 Thread Christian König

Am 02.06.21 um 11:58 schrieb Marek Olšák:
On Wed, Jun 2, 2021 at 5:44 AM Christian König 
<mailto:ckoenig.leichtzumer...@gmail.com>> wrote:


Am 02.06.21 um 10:57 schrieb Daniel Stone:
> Hi Christian,
>
> On Tue, 1 Jun 2021 at 13:51, Christian König
> mailto:ckoenig.leichtzumer...@gmail.com>> wrote:
>> Am 01.06.21 um 14:30 schrieb Daniel Vetter:
>>> If you want to enable this use-case with driver magic and
without the
>>> compositor being aware of what's going on, the solution is
EGLStreams.
>>> Not sure we want to go there, but it's definitely a lot more
feasible
>>> than trying to stuff eglstreams semantics into dma-buf implicit
>>> fencing support in a desperate attempt to not change compositors.
>> Well not changing compositors is certainly not something I
would try
>> with this use case.
>>
>> Not changing compositors is more like ok we have Ubuntu 20.04
and need
>> to support that we the newest hardware generation.
> Serious question, have you talked to Canonical?
>
> I mean there's a hell of a lot of effort being expended here, but it
> seems to all be predicated on the assumption that Ubuntu's LTS
> HWE/backport policy is totally immutable, and that we might need to
> make the kernel do backflips to work around that. But ... is it? Has
> anyone actually asked them how they feel about this?

This was merely an example. What I wanted to say is that we need to
support system already deployed.

In other words our customers won't accept that they need to
replace the
compositor just because they switch to a new hardware generation.

> I mean, my answer to the first email is 'no, absolutely not'
from the
> technical perspective (the initial proposal totally breaks
current and
> future userspace), from a design perspective (it breaks a lot of
> usecases which aren't single-vendor GPU+display+codec, or aren't
just
> a simple desktop), and from a sustainability perspective (cutting
> Android adrift again isn't acceptable collateral damage to make it
> easier to backport things to last year's Ubuntu release).
>
> But then again, I don't even know what I'm NAKing here ... ? The
> original email just lists a proposal to break a ton of things, with
> proposed replacements which aren't technically viable, and it's not
> clear why? Can we please have some more details and some reasoning
> behind them?
>
> I don't mind that userspace (compositor, protocols, clients like
Mesa
> as well as codec APIs) need to do a lot of work to support the new
> model. I do really care though that the hard-binary-switch model
works
> fine enough for AMD but totally breaks heterogeneous systems and
makes
> it impossible for userspace to do the right thing.

Well how the handling for new Android, distributions etc... is
going to
look like is a completely different story.

And I completely agree with both Daniel Vetter and you that we
need to
keep this in mind when designing the compatibility with older
software.

For Android I'm really not sure what to do. In general Android is
already trying to do the right thing by using explicit sync, the
problem
is that this is build around the idea that this explicit sync is
syncfile kernel based.

Either we need to change Android and come up with something that
works
with user fences as well or we somehow invent a compatibility
layer for
syncfile as well.


What's the issue with syncfiles that syncobjs don't suffer from?


Syncobjs where designed with future fences in mind. In other words we 
already have the ability to wait for a future submission to appear with 
all the nasty locking implications.


Syncfile on the other hand are just a container for up to N kernel 
fences and since we don't have kernel fences any more that is rather 
tricky to keep working.


Going to look into the uAPI around syncfiles once more and see if we can 
somehow use that for user fences as well.


Christian.



Marek


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-02 Thread Christian König

Am 02.06.21 um 10:57 schrieb Daniel Stone:

Hi Christian,

On Tue, 1 Jun 2021 at 13:51, Christian König
 wrote:

Am 01.06.21 um 14:30 schrieb Daniel Vetter:

If you want to enable this use-case with driver magic and without the
compositor being aware of what's going on, the solution is EGLStreams.
Not sure we want to go there, but it's definitely a lot more feasible
than trying to stuff eglstreams semantics into dma-buf implicit
fencing support in a desperate attempt to not change compositors.

Well not changing compositors is certainly not something I would try
with this use case.

Not changing compositors is more like ok we have Ubuntu 20.04 and need
to support that we the newest hardware generation.

Serious question, have you talked to Canonical?

I mean there's a hell of a lot of effort being expended here, but it
seems to all be predicated on the assumption that Ubuntu's LTS
HWE/backport policy is totally immutable, and that we might need to
make the kernel do backflips to work around that. But ... is it? Has
anyone actually asked them how they feel about this?


This was merely an example. What I wanted to say is that we need to 
support system already deployed.


In other words our customers won't accept that they need to replace the 
compositor just because they switch to a new hardware generation.



I mean, my answer to the first email is 'no, absolutely not' from the
technical perspective (the initial proposal totally breaks current and
future userspace), from a design perspective (it breaks a lot of
usecases which aren't single-vendor GPU+display+codec, or aren't just
a simple desktop), and from a sustainability perspective (cutting
Android adrift again isn't acceptable collateral damage to make it
easier to backport things to last year's Ubuntu release).

But then again, I don't even know what I'm NAKing here ... ? The
original email just lists a proposal to break a ton of things, with
proposed replacements which aren't technically viable, and it's not
clear why? Can we please have some more details and some reasoning
behind them?

I don't mind that userspace (compositor, protocols, clients like Mesa
as well as codec APIs) need to do a lot of work to support the new
model. I do really care though that the hard-binary-switch model works
fine enough for AMD but totally breaks heterogeneous systems and makes
it impossible for userspace to do the right thing.


Well how the handling for new Android, distributions etc... is going to 
look like is a completely different story.


And I completely agree with both Daniel Vetter and you that we need to 
keep this in mind when designing the compatibility with older software.


For Android I'm really not sure what to do. In general Android is 
already trying to do the right thing by using explicit sync, the problem 
is that this is build around the idea that this explicit sync is 
syncfile kernel based.


Either we need to change Android and come up with something that works 
with user fences as well or we somehow invent a compatibility layer for 
syncfile as well.


Regards,
Christian.



Cheers,
Daniel


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-01 Thread Christian König

Am 01.06.21 um 14:30 schrieb Daniel Vetter:

On Tue, Jun 1, 2021 at 2:10 PM Christian König
 wrote:

Am 01.06.21 um 12:49 schrieb Michel Dänzer:

On 2021-06-01 12:21 p.m., Christian König wrote:

Am 01.06.21 um 11:02 schrieb Michel Dänzer:

On 2021-05-27 11:51 p.m., Marek Olšák wrote:

3) Compositors (and other privileged processes, and display flipping) can't 
trust imported/exported fences. They need a timeout recovery mechanism from the 
beginning, and the following are some possible solutions to timeouts:

a) use a CPU wait with a small absolute timeout, and display the previous 
content on timeout
b) use a GPU wait with a small absolute timeout, and conditional rendering will 
choose between the latest content (if signalled) and previous content (if timed 
out)

The result would be that the desktop can run close to 60 fps even if an app 
runs at 1 fps.

FWIW, this is working with
https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with 
implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide the 
same dma-buf poll semantics as other drivers and high priority GFX contexts via 
EGL_IMG_context_priority which can preempt lower priority ones).

Yeah, that is really nice to have.

One question is if you wait on the CPU or the GPU for the new surface to become 
available?

It's based on polling dma-buf fds, i.e. CPU.


The former is a bit bad for latency and power management.

There isn't a choice for Wayland compositors in general, since there can be 
arbitrary other state which needs to be applied atomically together with the 
new buffer. (Though in theory, a compositor might get fancy and special-case 
surface commits which can be handled by waiting on the GPU)

Latency is largely a matter of scheduling in the compositor. The latency 
incurred by the compositor shouldn't have to be more than single-digit 
milliseconds. (I've seen total latency from when the client starts processing a 
(static) frame to when it starts being scanned out as low as ~6 ms with 
https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, lower than typical 
with Xorg)

Well let me describe it like this:

We have an use cases for 144 Hz guaranteed refresh rate. That
essentially means that the client application needs to be able to spit
out one frame/window content every ~6.9ms. That's tough, but doable.

When you now add 6ms latency in the compositor that means the client
application has only .9ms left for it's frame which is basically
impossible to do.

See for the user fences handling the display engine will learn to read
sequence numbers from memory and decide on it's own if the old frame or
the new one is scanned out. To get the latency there as low as possible.

This won't work with implicit sync at all.

If you want to enable this use-case with driver magic and without the
compositor being aware of what's going on, the solution is EGLStreams.
Not sure we want to go there, but it's definitely a lot more feasible
than trying to stuff eglstreams semantics into dma-buf implicit
fencing support in a desperate attempt to not change compositors.


Well not changing compositors is certainly not something I would try 
with this use case.


Not changing compositors is more like ok we have Ubuntu 20.04 and need 
to support that we the newest hardware generation.



I still think the most reasonable approach here is that we wrap a
dma_fence compat layer/mode over new hw for existing
userspace/compositors. And then enable userspace memory fences and the
fancy new features those allow with a new model that's built for them.


Yeah, that's basically the same direction I'm heading. Question is how 
to fix all those details.



Also even with dma_fence we could implement your model of staying with
the previous buffer (or an older buffer at that's already rendered),
but it needs explicit involvement of the compositor. At least without
adding eglstreams fd to the kernel and wiring up all the relevant
extensions.


Question is do we already have some extension which allows different 
textures to be selected on the fly depending on some state?


E.g. something like use new frame if it's available and old frame otherwise.

If you then apply this to the standard dma_fence based hardware or the 
new user fence based one is then pretty much irrelevant.


Regards,
Christian.


-Daniel


Another question is if that is sufficient as security for the display server or 
if we need further handling down the road? I mean essentially we are moving the 
reliability problem into the display server.

Good question. This should generally protect the display server from freezing 
due to client fences never signalling, but there might still be corner cases.






___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-01 Thread Christian König

Am 01.06.21 um 12:49 schrieb Michel Dänzer:

On 2021-06-01 12:21 p.m., Christian König wrote:

Am 01.06.21 um 11:02 schrieb Michel Dänzer:

On 2021-05-27 11:51 p.m., Marek Olšák wrote:

3) Compositors (and other privileged processes, and display flipping) can't 
trust imported/exported fences. They need a timeout recovery mechanism from the 
beginning, and the following are some possible solutions to timeouts:

a) use a CPU wait with a small absolute timeout, and display the previous 
content on timeout
b) use a GPU wait with a small absolute timeout, and conditional rendering will 
choose between the latest content (if signalled) and previous content (if timed 
out)

The result would be that the desktop can run close to 60 fps even if an app 
runs at 1 fps.

FWIW, this is working with
https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with 
implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide the 
same dma-buf poll semantics as other drivers and high priority GFX contexts via 
EGL_IMG_context_priority which can preempt lower priority ones).

Yeah, that is really nice to have.

One question is if you wait on the CPU or the GPU for the new surface to become 
available?

It's based on polling dma-buf fds, i.e. CPU.


The former is a bit bad for latency and power management.

There isn't a choice for Wayland compositors in general, since there can be 
arbitrary other state which needs to be applied atomically together with the 
new buffer. (Though in theory, a compositor might get fancy and special-case 
surface commits which can be handled by waiting on the GPU)

Latency is largely a matter of scheduling in the compositor. The latency 
incurred by the compositor shouldn't have to be more than single-digit 
milliseconds. (I've seen total latency from when the client starts processing a 
(static) frame to when it starts being scanned out as low as ~6 ms with 
https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, lower than typical 
with Xorg)


Well let me describe it like this:

We have an use cases for 144 Hz guaranteed refresh rate. That 
essentially means that the client application needs to be able to spit 
out one frame/window content every ~6.9ms. That's tough, but doable.


When you now add 6ms latency in the compositor that means the client 
application has only .9ms left for it's frame which is basically 
impossible to do.


See for the user fences handling the display engine will learn to read 
sequence numbers from memory and decide on it's own if the old frame or 
the new one is scanned out. To get the latency there as low as possible.



Another question is if that is sufficient as security for the display server or 
if we need further handling down the road? I mean essentially we are moving the 
reliability problem into the display server.

Good question. This should generally protect the display server from freezing 
due to client fences never signalling, but there might still be corner cases.




___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-01 Thread Christian König

Am 01.06.21 um 11:02 schrieb Michel Dänzer:

On 2021-05-27 11:51 p.m., Marek Olšák wrote:

3) Compositors (and other privileged processes, and display flipping) can't 
trust imported/exported fences. They need a timeout recovery mechanism from the 
beginning, and the following are some possible solutions to timeouts:

a) use a CPU wait with a small absolute timeout, and display the previous 
content on timeout
b) use a GPU wait with a small absolute timeout, and conditional rendering will 
choose between the latest content (if signalled) and previous content (if timed 
out)

The result would be that the desktop can run close to 60 fps even if an app 
runs at 1 fps.

FWIW, this is working with
https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with 
implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide the 
same dma-buf poll semantics as other drivers and high priority GFX contexts via 
EGL_IMG_context_priority which can preempt lower priority ones).


Yeah, that is really nice to have.

One question is if you wait on the CPU or the GPU for the new surface to 
become available? The former is a bit bad for latency and power management.


Another question is if that is sufficient as security for the display 
server or if we need further handling down the road? I mean essentially 
we are moving the reliability problem into the display server.


Regards,
Christian.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-05-31 Thread Christian König
Yes, exactly that's my thinking and also the reason why I'm pondering so 
hard on the requirement that the memory for shared user fences should 
not be modifiable by userspace directly.


Christian.

Am 29.05.21 um 05:33 schrieb Marek Olšák:

My first email can be ignored except for the sync files. Oh well.

I think I see what you mean, Christian. If we assume that an imported 
fence is always read only (the buffer with the sequence number is read 
only), only the process that created and exported the fence can signal 
it. If the fence is not signaled, the exporting process is guilty. The 
only thing the importing process must do when it's about to use the 
fence as a dependency is to notify the kernel about it. Thus, the 
kernel will always know the dependency graph. Then if the importing 
process times out, the kernel will blame any of the processes that 
passed it a fence that is still unsignaled. The kernel will blame the 
process that timed out only if all imported fences have been signaled. 
It seems pretty robust.


It's the same with implicit sync except that the buffer with the 
sequence number is writable. Any process that has an implicitly-sync'd 
buffer can set the sequence number to 0 or UINT64_MAX. 0 will cause a 
timeout for the next job, while UINT64_MAX might cause a timeout a 
little later. The timeout can be mitigated by the kernel because the 
kernel knows the greatest number that should be there, but it's not 
possible to know which process is guilty (all processes holding the 
buffer handle would be suspects).


Marek

On Fri, May 28, 2021 at 6:25 PM Marek Olšák <mailto:mar...@gmail.com>> wrote:


If both implicit and explicit synchronization are handled the
same, then the kernel won't be able to identify the process that
caused an implicit sync deadlock. The process that is stuck
waiting for a fence can be innocent, and the kernel can't punish
it. Likewise, the GPU reset guery that reports which process is
guilty and innocent will only be able to report unknown. Is that OK?

Marek

On Fri, May 28, 2021 at 10:41 AM Christian König
mailto:ckoenig.leichtzumer...@gmail.com>> wrote:

Hi Marek,

well I don't think that implicit and explicit synchronization
needs to be mutual exclusive.

What we should do is to have the ability to transport an
synchronization object with each BO.

Implicit and explicit synchronization then basically become
the same, they just transport the synchronization object
differently.

The biggest problem are the sync_files for Android, since they
are really not easy to support at all. If Android wants to
support user queues we would probably have to do some changes
there.

Regards,
Christian.

Am 27.05.21 um 23:51 schrieb Marek Olšák:

Hi,

Since Christian believes that we can't deadlock the kernel
with some changes there, we just need to make everything nice
for userspace too. Instead of explaining how it will work, I
will explain the cases where future hardware (and its kernel
driver) will break existing userspace in order to protect
everybody from deadlocks. Anything that uses implicit sync
will be spared, so X and Wayland will be fine, assuming they
don't import/export fences. Those use cases that do
import/export fences might or might not work, depending on
how the fences are used.

One of the necessities is that all fences will become future
fences. The semantics of imported/exported fences will change
completely and will have new restrictions on the usage. The
restrictions are:


1) Android sync files will be impossible to support, so won't
be supported. (they don't allow future fences)


2) Implicit sync and explicit sync will be mutually exclusive
between process. A process can either use one or the other,
but not both. This is meant to prevent a deadlock condition
with future fences where any process can malevolently
deadlock execution of any other process, even execution of a
higher-privileged process. The kernel will impose the
following restrictions to protect against the deadlock:

a) a process with an implicitly-sync'd imported/exported
buffer can't import/export a fence from/to another process
b) a process with an imported/exported fence can't
import/export an implicitly-sync'd buffer from/to another process

Alternative: A higher-privileged process could enforce both
restrictions instead of the kernel to protect itself from the
deadlock, but this would be a can of worms for existing
userspace. It would be better if the kernel just broke unsafe
userspace on future hw, just like sync files.

If both implicit an

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-05-28 Thread Christian König

Hi Marek,

well I don't think that implicit and explicit synchronization needs to 
be mutual exclusive.


What we should do is to have the ability to transport an synchronization 
object with each BO.


Implicit and explicit synchronization then basically become the same, 
they just transport the synchronization object differently.


The biggest problem are the sync_files for Android, since they are 
really not easy to support at all. If Android wants to support user 
queues we would probably have to do some changes there.


Regards,
Christian.

Am 27.05.21 um 23:51 schrieb Marek Olšák:

Hi,

Since Christian believes that we can't deadlock the kernel with some 
changes there, we just need to make everything nice for userspace too. 
Instead of explaining how it will work, I will explain the cases where 
future hardware (and its kernel driver) will break existing userspace 
in order to protect everybody from deadlocks. Anything that uses 
implicit sync will be spared, so X and Wayland will be fine, assuming 
they don't import/export fences. Those use cases that do import/export 
fences might or might not work, depending on how the fences are used.


One of the necessities is that all fences will become future fences. 
The semantics of imported/exported fences will change completely and 
will have new restrictions on the usage. The restrictions are:



1) Android sync files will be impossible to support, so won't be 
supported. (they don't allow future fences)



2) Implicit sync and explicit sync will be mutually exclusive between 
process. A process can either use one or the other, but not both. This 
is meant to prevent a deadlock condition with future fences where any 
process can malevolently deadlock execution of any other process, even 
execution of a higher-privileged process. The kernel will impose the 
following restrictions to protect against the deadlock:


a) a process with an implicitly-sync'd imported/exported buffer can't 
import/export a fence from/to another process
b) a process with an imported/exported fence can't import/export an 
implicitly-sync'd buffer from/to another process


Alternative: A higher-privileged process could enforce both 
restrictions instead of the kernel to protect itself from the 
deadlock, but this would be a can of worms for existing userspace. It 
would be better if the kernel just broke unsafe userspace on future 
hw, just like sync files.


If both implicit and explicit sync are allowed to occur 
simultaneously, sending a future fence that will never signal to any 
process will deadlock that process after it acquires the implicit sync 
lock, which is a sequence number that the process is required to write 
to memory and send an interrupt from the GPU in a finite time. This is 
how the deadlock can happen:


* The process gets sequence number N from the kernel for an 
implicitly-sync'd buffer.
* The process inserts (into the GPU user-mapped queue) a wait for 
sequence number N-1.
* The process inserts a wait for a fence, but it doesn't know that it 
will never signal ==> deadlock.

...
* The process inserts a command to write sequence number N to a 
predetermined memory location. (which will make the buffer idle and 
send an interrupt to the kernel)

...
* The kernel will terminate the process because it has never received 
the interrupt. (i.e. a less-privileged process just killed a 
more-privileged process)


It's the interrupt for implicit sync that never arrived that caused 
the termination, and the only way another process can cause it is by 
sending a fence that will never signal. Thus, importing/exporting 
fences from/to other processes can't be allowed simultaneously with 
implicit sync.



3) Compositors (and other privileged processes, and display flipping) 
can't trust imported/exported fences. They need a timeout recovery 
mechanism from the beginning, and the following are some possible 
solutions to timeouts:


a) use a CPU wait with a small absolute timeout, and display the 
previous content on timeout
b) use a GPU wait with a small absolute timeout, and conditional 
rendering will choose between the latest content (if signalled) and 
previous content (if timed out)


The result would be that the desktop can run close to 60 fps even if 
an app runs at 1 fps.


*Redefining imported/exported fences and breaking some users/OSs is 
the only way to have userspace GPU command submission, and the 
deadlock example here is the counterexample proving that there is no 
other way.*


So, what are the chances this is going to fly with the ecosystem?

Thanks,
Marek


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 01/11] drm/amdgpu: Comply with implicit fencing rules

2021-05-26 Thread Christian König

Am 25.05.21 um 17:23 schrieb Daniel Vetter:

On Tue, May 25, 2021 at 5:05 PM Christian König
 wrote:

Hi Daniel,

Am 25.05.21 um 15:05 schrieb Daniel Vetter:

Hi Christian,

On Sat, May 22, 2021 at 10:30:19AM +0200, Christian König wrote:

Am 21.05.21 um 20:31 schrieb Daniel Vetter:
This works by adding the fence of the last eviction DMA operation to BOs
when their backing store is newly allocated. That's what the
ttm_bo_add_move_fence() function you stumbled over is good for: 
https://elixir.bootlin.com/linux/v5.13-rc2/source/drivers/gpu/drm/ttm/ttm_bo.c#L692

Now the problem is it is possible that the application is terminated before
it can complete it's command submission. But since resource management only
waits for the shared fences when there are some there is a chance that we
free up memory while it is still in use.

Hm where is this code? Would help in my audit that I wanted to do this
week? If you look at most other places like
drm_gem_fence_array_add_implicit() I mentioned earlier, then we don't
treat the shared fences special and always also include the exclusive one.

See amdgpu_gem_object_close():

...
  fence = dma_resv_get_excl(bo->tbo.base.resv);
  if (fence) {
  amdgpu_bo_fence(bo, fence, true);
  fence = NULL;
  }
...

We explicitly added that because resource management of some other
driver was going totally bananas without that.

But I'm not sure which one that was. Maybe dig a bit in the git and
mailing history of that.

Hm I looked and it's

commit 82c416b13cb7d22b96ec0888b296a48dff8a09eb
Author: Christian König 
Date:   Thu Mar 12 12:03:34 2020 +0100

drm/amdgpu: fix and cleanup amdgpu_gem_object_close v4

That sounded more like amdgpu itself needing this, not another driver?


No, that patch was just a follow up moving the functionality around.


But looking at amdgpu_vm_bo_update_mapping() it seems to pick the
right fencing mode for gpu pte clearing, so I'm really not sure what
the bug was that you worked around here?The implementation boils down
to amdgpu_sync_resv() which syncs for the exclusive fence, always. And
there's nothing else that I could find in public history at least, no
references to bug reports or anything. I think you need to dig
internally, because as-is I'm not seeing the problem here.

Or am I missing something here?


See the code here for example: 
https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/nouveau/nouveau_fence.c#L361


Nouveau assumes that when a shared fence is present it doesn't need to 
wait for the exclusive one because the shared are always supposed to 
finish after the exclusive one.


But for page table unmap fences that isn't true and we ran into a really 
nasty and hard to reproduce bug because of this.


I think it would be much more defensive if we could say that we always 
wait for the exclusive fence and fix the use case in nouveau and double 
check if somebody else does stuff like that as well.


Christian.


-Daniel


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 01/11] drm/amdgpu: Comply with implicit fencing rules

2021-05-25 Thread Christian König

Hi Daniel,

Am 25.05.21 um 15:05 schrieb Daniel Vetter:

Hi Christian,

On Sat, May 22, 2021 at 10:30:19AM +0200, Christian König wrote:

Am 21.05.21 um 20:31 schrieb Daniel Vetter:

[SNIP]

We could provide an IOCTL for the BO to change the flag.

That's not the semantics we need.


But could we first figure out the semantics we want to use here?

Cause I'm pretty sure we don't actually need those changes at all and as
said before I'm certainly NAKing things which break existing use cases.

Please read how other drivers do this and at least _try_ to understand
it. I'm really loosing my patience here with you NAKing patches you're
not even understanding (or did you actually read and fully understand
the entire story I typed up here, and your NAK is on the entire
thing?). There's not much useful conversation to be had with that
approach. And with drivers I mean kernel + userspace here.

Well to be honest I did fully read that, but I was just to emotionally
attached to answer more appropriately in that moment.

And I'm sorry that I react emotional on that, but it is really frustrating
that I'm not able to convince you that we have a major problem which affects
all drivers and not just amdgpu.

Regarding the reason why I'm NAKing this particular patch, you are breaking
existing uAPI for RADV with that. And as a maintainer of the driver I have
simply no other choice than saying halt, stop we can't do it like this.

I'm perfectly aware that I've some holes in the understanding of how ANV or
other Vulkan/OpenGL stacks work. But you should probably also admit that you
have some holes how amdgpu works or otherwise I can't imagine why you
suggest a patch which simply breaks RADV.

I mean we are working together for years now and I think you know me pretty
well, do you really think I scream bloody hell we can't do this without a
good reason?

So let's stop throwing halve backed solutions at each other and discuss what
we can do to solve the different problems we are both seeing here.

Well this was meant to be a goal post/semantics discussion starter. Yes
the patch breaks performance (but not correctness) for amdgpu, but it also
contains my suggestion for how to fix that issue (in text form at least).


Well as far as I can see this really breaks uAPI, we have unit tests 
exercising this.


But see more on this below.


Plus a plan how to roll it out so that anyone who cares doesn't hit the
perf issues this patch can cause.

Also the overall series is really meant as a subsystem wide assessment of
the status quo. Aside from a correctness issue Lucas spotted in my panfrost
patches no substantial input from others on this yet unfortunately. I need
to poke more people I think.

Anyway since the plan as a text didn't stick I'm typing up now something
more detailed in form of amdgpu patches. Maybe Bas can do the radv
conversion too.

It won't be complete by far either (I'm not working for amd after all
...), I'll leave out the opengl/media side entirely. But if this works for
radv is should be a useful blueprint for gl/media too (with some changes
in the interfaces, but not really the exposed semantics).


Yeah, but to make my point clear once more: I can't allow any patch in 
which would change amdgpus existing uAPI behavior.


What we can talk about is changing the behavior by adding flags to the 
fpriv to change the behavior and/or stuff the CS fence by default into 
the exclusive slot.


For the later I think we could do something like using a dma_fence_chain 
for the exclusive fence in amdgpu. This way we would have the same 
semantics in the CS and still support the epoll and Jasons new import IOCTL.


But the semantics of the amdgpu CS interface to not serialize from the 
same process and always serialize if you see some different process must 
stay the same or otherwise I have quite a bunch of angry end users.



That's the other frustration part: You're trying to fix this purely in
the kernel. This is exactly one of these issues why we require open
source userspace, so that we can fix the issues correctly across the
entire stack. And meanwhile you're steadfastily refusing to even look
at that the userspace side of the picture.

Well I do fully understand the userspace side of the picture for the AMD
stack. I just don't think we should give userspace that much control over
the fences in the dma_resv object without untangling them from resource
management.

And RADV is exercising exclusive sync for amdgpu already. You can do
submission to both the GFX, Compute and SDMA queues in Vulkan and those
currently won't over-synchronize.

When you then send a texture generated by multiple engines to the Compositor
the kernel will correctly inserts waits for all submissions of the other
process.

So this already works for RADV and completely without the IOCTL Jason
proposed. IIRC we also have unit tests which exercised that feature for the
video decoding use case long before RADV even existed.

Yeah multiple engines

Re: [Mesa-dev] [PATCH 01/11] drm/amdgpu: Comply with implicit fencing rules

2021-05-22 Thread Christian König

Am 21.05.21 um 20:31 schrieb Daniel Vetter:

[SNIP]

We could provide an IOCTL for the BO to change the flag.

That's not the semantics we need.


But could we first figure out the semantics we want to use here?

Cause I'm pretty sure we don't actually need those changes at all and as
said before I'm certainly NAKing things which break existing use cases.

Please read how other drivers do this and at least _try_ to understand
it. I'm really loosing my patience here with you NAKing patches you're
not even understanding (or did you actually read and fully understand
the entire story I typed up here, and your NAK is on the entire
thing?). There's not much useful conversation to be had with that
approach. And with drivers I mean kernel + userspace here.


Well to be honest I did fully read that, but I was just to emotionally 
attached to answer more appropriately in that moment.


And I'm sorry that I react emotional on that, but it is really 
frustrating that I'm not able to convince you that we have a major 
problem which affects all drivers and not just amdgpu.


Regarding the reason why I'm NAKing this particular patch, you are 
breaking existing uAPI for RADV with that. And as a maintainer of the 
driver I have simply no other choice than saying halt, stop we can't do 
it like this.


I'm perfectly aware that I've some holes in the understanding of how ANV 
or other Vulkan/OpenGL stacks work. But you should probably also admit 
that you have some holes how amdgpu works or otherwise I can't imagine 
why you suggest a patch which simply breaks RADV.


I mean we are working together for years now and I think you know me 
pretty well, do you really think I scream bloody hell we can't do this 
without a good reason?


So let's stop throwing halve backed solutions at each other and discuss 
what we can do to solve the different problems we are both seeing here.



That's the other frustration part: You're trying to fix this purely in
the kernel. This is exactly one of these issues why we require open
source userspace, so that we can fix the issues correctly across the
entire stack. And meanwhile you're steadfastily refusing to even look
at that the userspace side of the picture.


Well I do fully understand the userspace side of the picture for the AMD 
stack. I just don't think we should give userspace that much control 
over the fences in the dma_resv object without untangling them from 
resource management.


And RADV is exercising exclusive sync for amdgpu already. You can do 
submission to both the GFX, Compute and SDMA queues in Vulkan and those 
currently won't over-synchronize.


When you then send a texture generated by multiple engines to the 
Compositor the kernel will correctly inserts waits for all submissions 
of the other process.


So this already works for RADV and completely without the IOCTL Jason 
proposed. IIRC we also have unit tests which exercised that feature for 
the video decoding use case long before RADV even existed.


And yes I have to admit that I haven't thought about interaction with 
other drivers when I came up with this because the rules of that 
interaction wasn't clear to me at that time.



Also I thought through your tlb issue, why are you even putting these
tlb flush fences into the shard dma_resv slots? If you store them
somewhere else in the amdgpu private part, the oversync issues goes
away
- in your ttm bo move callback, you can just make your bo copy job
depend on them too (you have to anyway)
- even for p2p there's not an issue here, because you have the
->move_notify callback, and can then lift the tlb flush fences from
your private place to the shared slots so the exporter can see them.


Because adding a shared fence requires that this shared fence signals 
after the exclusive fence. And this is a perfect example to explain why 
this is so problematic and also why why we currently stumble over that 
only in amdgpu.


In TTM we have a feature which allows evictions to be pipelined and 
don't wait for the evicting DMA operation. Without that driver will 
stall waiting for their allocations to finish when we need to allocate 
memory.


For certain use cases this gives you a ~20% fps increase under memory 
pressure, so it is a really important feature.


This works by adding the fence of the last eviction DMA operation to BOs 
when their backing store is newly allocated. That's what the 
ttm_bo_add_move_fence() function you stumbled over is good for: 
https://elixir.bootlin.com/linux/v5.13-rc2/source/drivers/gpu/drm/ttm/ttm_bo.c#L692


Now the problem is it is possible that the application is terminated 
before it can complete it's command submission. But since resource 
management only waits for the shared fences when there are some there is 
a chance that we free up memory while it is still in use.


Because of this we have some rather crude workarounds in amdgpu. For 
example IIRC we manual wait for any potential exclusive fence before 
freeing memory.


We 

Re: [Mesa-dev] [PATCH 01/11] drm/amdgpu: Comply with implicit fencing rules

2021-05-21 Thread Christian König

Am 21.05.21 um 17:16 schrieb Daniel Vetter:

On Fri, May 21, 2021 at 05:00:46PM +0200, Bas Nieuwenhuizen wrote:

On Fri, May 21, 2021 at 4:37 PM Daniel Vetter  wrote:

On Fri, May 21, 2021 at 11:46:23AM +0200, Bas Nieuwenhuizen wrote:

On Fri, May 21, 2021 at 11:10 AM Daniel Vetter  wrote:

---
  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 88a24a0b5691..cc8426e1e8a8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -617,8 +617,8 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
 amdgpu_bo_list_for_each_entry(e, p->bo_list) {
 struct amdgpu_bo *bo = ttm_to_amdgpu_bo(e->tv.bo);

-   /* Make sure we use the exclusive slot for shared BOs */
-   if (bo->prime_shared_count)
+   /* Make sure we use the exclusive slot for all potentially 
shared BOs */
+   if (!(bo->flags & AMDGPU_GEM_CREATE_VM_ALWAYS_VALID))
 e->tv.num_shared = 0;

I think it also makes sense to skip this with
AMDGPU_GEM_CREATE_EXPLICIT_SYNC? It can be shared but I don't think
anyone expects implicit sync to happen with those.

Ah yes, I missed this entirely. So the "no implicit flag" is already
there, and the _only_ thing that's missing really is a way to fish out the
implicit fences, and set them.

https://lore.kernel.org/dri-devel/20210520190007.534046-1-ja...@jlekstrand.net/

So I think all that's really needed in radv is not setting
RADEON_FLAG_IMPLICIT_SYNC for winsys buffers when Jason's dma-buf ioctl
are present (means you need to do some import/export and keep the fd
around for winsys buffers, but shouldn't be too bad), and then control the
implicit fences entirely explicitly like vk expects.

That is the part I'm less sure about. This is a BO wide flag so we are
also disabling implicit sync in the compositor. If the compositor does
only do read stuff that is ok, as the inserted exclusive fence will
work for that. But as I learned recently the app provided buffer may
end up being written to by the X server which open a whole can of
potential problems if implicit sync gets disabled between Xserver
operations on the app provided buffer. Hence setting that on the WSI
buffer is a whole new can of potential problems and hence I've said a
submission based flag would be preferred.

I can certainly try it out though.

Hm yeah that's the wrong flag. We need a flag on the drm_file which the
explicit userspace sets. And which is valid only for itself.

There's a nice flags field when creating a ctx, but it's not validated and
there's already a comment that we have to filter out garbage priority, so
that's not use. I'll whip up something entirely untested just as a draft.


We could provide an IOCTL for the BO to change the flag.

But could we first figure out the semantics we want to use here?

Cause I'm pretty sure we don't actually need those changes at all and as 
said before I'm certainly NAKing things which break existing use cases.


Regards,
Christian.


-Daniel




Are you bored enough to type this up for radv? I'll give Jason's kernel
stuff another review meanwhile.
-Daniel


 e->bo_va = amdgpu_vm_bo_find(vm, bo);
 }
--
2.31.0


--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 01/11] drm/amdgpu: Comply with implicit fencing rules

2021-05-21 Thread Christian König
o msm_gem_sync_object(). Investing into
   a scheduler might be a good idea.

- all the remaining drivers are ttm based, where I hope they do
   appropriately obey implicit fences already. I didn't do the full
   audit there because a) not follow the contract would confuse ttm
   quite well and b) reading non-standard scheduler and submit code
   which isn't based on drm/scheduler is a pain.

Onwards to the display side.

- Any driver using the drm_gem_plane_helper_prepare_fb() helper will
   correctly. Overwhelmingly most drivers get this right, except a few
   totally dont. I'll follow up with a patch to make this the default
   and avoid a bunch of bugs.

- I didn't audit the ttm drivers, but given that dma_resv started
   there I hope they get this right.

In conclusion this IS the contract, both as documented and
overwhelmingly implemented, specically as implemented by all render
drivers except amdgpu.

Amdgpu tried to fix this already in

commit 049aca4363d8af87cab8d53de5401602db3b
Author: Christian König 
Date:   Wed Sep 19 16:54:35 2018 +0200

 drm/amdgpu: fix using shared fence for exported BOs v2

but this fix falls short on a number of areas:

- It's racy, by the time the buffer is shared it might be too late. To
   make sure there's definitely never a problem we need to set the
   fences correctly for any buffer that's potentially exportable.

- It's breaking uapi, dma-buf fds support poll() and differentitiate
   between, which was introduced in

commit 9b495a5887994a6d74d5c261d012083a92b94738
Author: Maarten Lankhorst 
Date:   Tue Jul 1 12:57:43 2014 +0200

    dma-buf: add poll support, v3

- Christian König wants to nack new uapi building further on this
   dma_resv contract because it breaks amdgpu, quoting

   "Yeah, and that is exactly the reason why I will NAK this uAPI change.

   "This doesn't works for amdgpu at all for the reasons outlined above."

   
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2Ff2eb6751-2f82-9b23-f57e-548de5b729de%40gmail.com%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7C2cdb7d8e82de40fd452e08d91c383a13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637571850083203679%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=WkQz%2Bdd61XuEw93JOcKx17SQFpNcyMDvvSBgRA9N0U4%3Dreserved=0

   Rejecting new development because your own driver is broken and
   violates established cross driver contracts and uapi is really not
   how upstream works.

Now this patch will have a severe performance impact on anything that
runs on multiple engines. So we can't just merge it outright, but need
a bit a plan:

- amdgpu needs a proper uapi for handling implicit fencing. The funny
   thing is that to do it correctly, implicit fencing must be treated
   as a very strange IPC mechanism for transporting fences, where both
   setting the fence and dependency intercepts must be handled
   explicitly. Current best practices is a per-bo flag to indicate
   writes, and a per-bo flag to to skip implicit fencing in the CS
   ioctl as a new chunk.

- Since amdgpu has been shipping with broken behaviour we need an
   opt-out flag from the butchered implicit fencing model to enable the
   proper explicit implicit fencing model.

- for kernel memory fences due to bo moves at least the i915 idea is
   to use ttm_bo->moving. amdgpu probably needs the same.

- since the current p2p dma-buf interface assumes the kernel memory
   fence is in the exclusive dma_resv fence slot we need to add a new
   fence slot for kernel fences, which must never be ignored. Since
   currently only amdgpu supports this there's no real problem here
   yet, until amdgpu gains a NO_IMPLICIT CS flag.

- New userspace needs to ship in enough desktop distros so that users
   wont notice the perf impact. I think we can ignore LTS distros who
   upgrade their kernels but not their mesa3d snapshot.

- Then when this is all in place we can merge this patch here.

What is not a solution to this problem here is trying to make the
dma_resv rules in the kernel more clever. The fundamental issue here
is that the amdgpu CS uapi is the least expressive one across all
drivers (only equalled by panfrost, which has an actual excuse) by not
allowing any userspace control over how implicit sync is conducted.

Until this is fixed it's completely pointless to make the kernel more
clever to improve amdgpu, because all we're doing is papering over
this uapi design issue. amdgpu needs to attain the status quo
established by other drivers first, once that's achieved we can tackle
the remaining issues in a consistent way across drivers.

Cc: mesa-dev@lists.freedesktop.org
Cc: Bas Nieuwenhuizen 
Cc: Dave Airlie 
Cc: Rob Clark 
Cc: Kristian H. Kristensen 
Cc: Michel Dänzer 
Cc: Daniel Stone 
Cc: Sumit Semwal 
Cc: "Christian König" 
Cc: Alex Deucher 
Cc: Daniel Vetter 
Cc: Deepak R Varma 
Cc: Chen Li 

Re: [Mesa-dev] [Intel-gfx] [RFC 2/2] drm/doc/rfc: i915 new parallel submission uAPI plan

2021-05-21 Thread Christian König

Am 20.05.21 um 23:38 schrieb Jason Ekstrand:

On Thu, May 20, 2021 at 10:46 AM Matthew Brost  wrote:

On Thu, May 20, 2021 at 01:11:59PM +0200, Christian König wrote:

Am 19.05.21 um 18:51 schrieb Matthew Brost:

On Wed, May 19, 2021 at 01:45:39PM +0200, Christian König wrote:

Oh, yeah we call that gang submit on the AMD side.

Had already some internal discussions how to implement this, but so far
couldn't figure out how to cleanly introduce that into the DRM scheduler.

Can you briefly describe in a few words how that is supposed to work on the
Intel side?

On Intel, we actually have two cases which don't fit the current
drm/scheduler model well: balanced and bonded.

In the balanced model, we want to submit a batch which can go to any
one of some set of engines and we don't care which.  It's up to the
kernel to pick an engine.  Imagine you had 64 identical HW compute
queues, for instance.  This could be done by making all the identical
engines share a single drm_gpu_scheduler and round-robin around the HW
queues or something.  I don't know that we strictly need drm/scheduler
to be aware of it but it might be nice if it grew support for this
mode so we could maintain a 1:1 relationship between HW queues and
drm_gpu_schedulers.  That said, I'm not sure how this would play with
GuC queues so maybe it doesn't help?


Oh, we do have support for load balancing like that.

When you call drm_sched_entity_init() you can give a list of 
drm_gpu_scheduler object to use round robing for scheduling.


New jobs are then scheduler to the drm_gpu_scheduler instance which is 
idle or rather the least busy one.



The bonded model is like your ganged, I think.  We want to submit N
batches to run in parallel.  And they actually have to be executing on
the GPU simultaneously and not just sort-of at similar times.  We need
this for video.  There are also potential use-cases in Vulkan or even
GL that might be able to use this.  One difference with the balanced
mode is that bonds don't, strictly speaking, need to be on the same
type of engine.  Imagine, for instance, a 3D batch with a parallel
compute batch doing vertex pre-processing.

I'm pretty sure the bonded case is something that the mobile drivers
(panfrost, etc.) would like as well for doing Vulkan on tilers where
you often have to have two command buffers running in parallel.
They're currently doing it by submitting a giant pile of batches where
they split the batch and add sync primitives every time some GL call
requires them to sync between fragment and vertex pipes.


Yeah, we have exactly the same problem as well.

But so far every model we discussed has some drawbacks and it is rather 
hard for the scheduler to guarantee that stuff runs at the same time.


So if you got any ideas how to cleanly implement that then they would be 
rather welcomed.


Regards,
Christian.



So, to sum up, I think there's likely some good collaboration to be
had here for everyone. :-)

--Jason


Sure, I've done a quick PoC internally and have been able to hook this
into the DRM scheduler.

Basically each BB still maps to a single job as each job is somewhat
unique (e.g. each job has its own ring, lrc, seqno, etc...). However all
the jobs configured to run in parallel map to a single sched_entity
which maintains the order each job was generated from the execbuf IOCTL
(1 - N). When the backend receives jobs 1 to N - 1 it basically just
updates some internal state. When the backend sees job N (last job) it
actually does the submit for jobs 1 - N which with GuC submission is a
simple command moving the LRC tail of the N jobs.

Daniel has suggested that we create a single job for the NN BBs but that
would be huge rework to the internals of the i915 and likely won't
happen by the time this code first lands.

Also worth noting one way a job isn't really a treated individually is
the excl slot with dma-resv. In that case we create a composite fence of
all jobs (dma_fence_array).

Yeah, that's something we have discussed as well.

How do you prevent the scheduler from over committing to a single ring
buffer in this scenario?


Each job has its own ring, the execbuf IOCTL throttles itself for each
job if there isn't space in the ring. This is exactly the same as
non-parallel submits.

I think this is what you were asking? If not, maybe try explaining the
question a bit more.

Matt


Christian.


Matt


Thanks,
Christian.

Am 19.05.21 um 01:58 schrieb Matthew Brost:

Add entry fpr i915 new parallel submission uAPI plan.

v2:
(Daniel Vetter):
 - Expand logical order explaination
 - Add dummy header
 - Only allow N BBs in execbuf IOCTL
 - Configure parallel submission per slot not per gem context

Cc: Tvrtko Ursulin 
Cc: Tony Ye 
CC: Carl Zhang 
Cc: Daniel Vetter 
Cc: Jason Ekstrand 
Signed-off-by: Matthew Brost 
---
Documentation/gpu/rfc/i915_parallel_execbuf.h | 144 ++
Documentation/gpu/rfc/i915_scheduler.rst  |  53 ++-
2 files changed

Re: [Mesa-dev] [RFC 2/2] drm/doc/rfc: i915 new parallel submission uAPI plan

2021-05-20 Thread Christian König

Am 19.05.21 um 18:51 schrieb Matthew Brost:

On Wed, May 19, 2021 at 01:45:39PM +0200, Christian König wrote:

Oh, yeah we call that gang submit on the AMD side.

Had already some internal discussions how to implement this, but so far
couldn't figure out how to cleanly introduce that into the DRM scheduler.

Can you briefly describe in a few words how that is supposed to work on the
Intel side?


Sure, I've done a quick PoC internally and have been able to hook this
into the DRM scheduler.

Basically each BB still maps to a single job as each job is somewhat
unique (e.g. each job has its own ring, lrc, seqno, etc...). However all
the jobs configured to run in parallel map to a single sched_entity
which maintains the order each job was generated from the execbuf IOCTL
(1 - N). When the backend receives jobs 1 to N - 1 it basically just
updates some internal state. When the backend sees job N (last job) it
actually does the submit for jobs 1 - N which with GuC submission is a
simple command moving the LRC tail of the N jobs.

Daniel has suggested that we create a single job for the NN BBs but that
would be huge rework to the internals of the i915 and likely won't
happen by the time this code first lands.

Also worth noting one way a job isn't really a treated individually is
the excl slot with dma-resv. In that case we create a composite fence of
all jobs (dma_fence_array).


Yeah, that's something we have discussed as well.

How do you prevent the scheduler from over committing to a single ring 
buffer in this scenario?


Christian.



Matt


Thanks,
Christian.

Am 19.05.21 um 01:58 schrieb Matthew Brost:

Add entry fpr i915 new parallel submission uAPI plan.

v2:
   (Daniel Vetter):
- Expand logical order explaination
- Add dummy header
- Only allow N BBs in execbuf IOCTL
- Configure parallel submission per slot not per gem context

Cc: Tvrtko Ursulin 
Cc: Tony Ye 
CC: Carl Zhang 
Cc: Daniel Vetter 
Cc: Jason Ekstrand 
Signed-off-by: Matthew Brost 
---
   Documentation/gpu/rfc/i915_parallel_execbuf.h | 144 ++
   Documentation/gpu/rfc/i915_scheduler.rst  |  53 ++-
   2 files changed, 196 insertions(+), 1 deletion(-)
   create mode 100644 Documentation/gpu/rfc/i915_parallel_execbuf.h

diff --git a/Documentation/gpu/rfc/i915_parallel_execbuf.h 
b/Documentation/gpu/rfc/i915_parallel_execbuf.h
new file mode 100644
index ..8c64b983ccad
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_parallel_execbuf.h
@@ -0,0 +1,144 @@
+#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see 
i915_context_engines_parallel_submit */
+
+/*
+ * i915_context_engines_parallel_submit:
+ *
+ * Setup a slot to allow multiple BBs to be submitted in a single execbuf 
IOCTL.
+ * Those BBs will then be scheduled to run on the GPU in parallel. Multiple
+ * hardware contexts are created internally in the i915 run these BBs. Once a
+ * slot is configured for N BBs only N BBs can be submitted in each execbuf
+ * IOCTL and this is implict behavior (e.g. the user doesn't tell the execbuf
+ * IOCTL there are N BBs, the execbuf IOCTL know how many BBs there are based 
on
+ * the slots configuration).
+ *
+ * Their are two currently defined ways to control the placement of the
+ * hardware contexts on physical engines: default behavior (no flags) and
+ * I915_PARALLEL_IMPLICT_BONDS (a flag). More flags may be added the in the
+ * future as new hardware / use cases arise. Details of how to use this
+ * interface below above the flags.
+ *
+ * Returns -EINVAL if hardware context placement configuration invalid or if 
the
+ * placement configuration isn't supported on the platform / submission
+ * interface.
+ * Returns -ENODEV if extension isn't supported on the platform / submission
+ * inteface.
+ */
+struct i915_context_engines_parallel_submit {
+   struct i915_user_extension base;
+
+   __u16 engine_index; /* slot for parallel engine */
+   __u16 width;/* number of contexts per parallel engine */
+   __u16 num_siblings; /* number of siblings per context */
+   __u16 mbz16;
+/*
+ * Default placement behvavior (currently unsupported):
+ *
+ * Rather than restricting parallel submission to a single class with a
+ * logically contiguous placement (I915_PARALLEL_IMPLICT_BONDS), add a mode 
that
+ * enables parallel submission across multiple engine classes. In this case 
each
+ * context's logical engine mask indicates where that context can placed. It is
+ * implied in this mode that all contexts have mutual exclusive placement (e.g.
+ * if one context is running CS0 no other contexts can run on CS0).
+ *
+ * Example 1 pseudo code:
+ * CSX[Y] = engine class X, logical instance Y
+ * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
+ * set_engines(INVALID)
+ * set_parallel(engine_index=0, width=2, num_siblings=2,
+ * engines=CS0[0],CS0[1],CS1[0],CS1[1])
+ *
+ * Results in the following valid placements:
+ * CS0[0], CS1[0

Re: [Mesa-dev] [RFC 2/2] drm/doc/rfc: i915 new parallel submission uAPI plan

2021-05-19 Thread Christian König

Oh, yeah we call that gang submit on the AMD side.

Had already some internal discussions how to implement this, but so far 
couldn't figure out how to cleanly introduce that into the DRM scheduler.


Can you briefly describe in a few words how that is supposed to work on 
the Intel side?


Thanks,
Christian.

Am 19.05.21 um 01:58 schrieb Matthew Brost:

Add entry fpr i915 new parallel submission uAPI plan.

v2:
  (Daniel Vetter):
   - Expand logical order explaination
   - Add dummy header
   - Only allow N BBs in execbuf IOCTL
   - Configure parallel submission per slot not per gem context

Cc: Tvrtko Ursulin 
Cc: Tony Ye 
CC: Carl Zhang 
Cc: Daniel Vetter 
Cc: Jason Ekstrand 
Signed-off-by: Matthew Brost 
---
  Documentation/gpu/rfc/i915_parallel_execbuf.h | 144 ++
  Documentation/gpu/rfc/i915_scheduler.rst  |  53 ++-
  2 files changed, 196 insertions(+), 1 deletion(-)
  create mode 100644 Documentation/gpu/rfc/i915_parallel_execbuf.h

diff --git a/Documentation/gpu/rfc/i915_parallel_execbuf.h 
b/Documentation/gpu/rfc/i915_parallel_execbuf.h
new file mode 100644
index ..8c64b983ccad
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_parallel_execbuf.h
@@ -0,0 +1,144 @@
+#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see 
i915_context_engines_parallel_submit */
+
+/*
+ * i915_context_engines_parallel_submit:
+ *
+ * Setup a slot to allow multiple BBs to be submitted in a single execbuf 
IOCTL.
+ * Those BBs will then be scheduled to run on the GPU in parallel. Multiple
+ * hardware contexts are created internally in the i915 run these BBs. Once a
+ * slot is configured for N BBs only N BBs can be submitted in each execbuf
+ * IOCTL and this is implict behavior (e.g. the user doesn't tell the execbuf
+ * IOCTL there are N BBs, the execbuf IOCTL know how many BBs there are based 
on
+ * the slots configuration).
+ *
+ * Their are two currently defined ways to control the placement of the
+ * hardware contexts on physical engines: default behavior (no flags) and
+ * I915_PARALLEL_IMPLICT_BONDS (a flag). More flags may be added the in the
+ * future as new hardware / use cases arise. Details of how to use this
+ * interface below above the flags.
+ *
+ * Returns -EINVAL if hardware context placement configuration invalid or if 
the
+ * placement configuration isn't supported on the platform / submission
+ * interface.
+ * Returns -ENODEV if extension isn't supported on the platform / submission
+ * inteface.
+ */
+struct i915_context_engines_parallel_submit {
+   struct i915_user_extension base;
+
+   __u16 engine_index; /* slot for parallel engine */
+   __u16 width;/* number of contexts per parallel engine */
+   __u16 num_siblings; /* number of siblings per context */
+   __u16 mbz16;
+/*
+ * Default placement behvavior (currently unsupported):
+ *
+ * Rather than restricting parallel submission to a single class with a
+ * logically contiguous placement (I915_PARALLEL_IMPLICT_BONDS), add a mode 
that
+ * enables parallel submission across multiple engine classes. In this case 
each
+ * context's logical engine mask indicates where that context can placed. It is
+ * implied in this mode that all contexts have mutual exclusive placement (e.g.
+ * if one context is running CS0 no other contexts can run on CS0).
+ *
+ * Example 1 pseudo code:
+ * CSX[Y] = engine class X, logical instance Y
+ * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
+ * set_engines(INVALID)
+ * set_parallel(engine_index=0, width=2, num_siblings=2,
+ * engines=CS0[0],CS0[1],CS1[0],CS1[1])
+ *
+ * Results in the following valid placements:
+ * CS0[0], CS1[0]
+ * CS0[0], CS1[1]
+ * CS0[1], CS1[0]
+ * CS0[1], CS1[1]
+ *
+ * This can also be though of as 2 virtual engines:
+ * VE[0] = CS0[0], CS0[1]
+ * VE[1] = CS1[0], CS1[1]
+ *
+ * Example 2 pseudo code:
+ * CS[X] = generic engine of same class, logical instance X
+ * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
+ * set_engines(INVALID)
+ * set_parallel(engine_index=0, width=2, num_siblings=3,
+ * engines=CS[0],CS[1],CS[2],CS[0],CS[1],CS[2])
+ *
+ * Results in the following valid placements:
+ * CS[0], CS[1]
+ * CS[0], CS[2]
+ * CS[1], CS[0]
+ * CS[1], CS[2]
+ * CS[2], CS[0]
+ * CS[2], CS[1]
+ *
+ *
+ * This can also be though of as 2 virtual engines:
+ * VE[0] = CS[0], CS[1], CS[2]
+ * VE[1] = CS[0], CS[1], CS[2]
+
+ * This enables a use case where all engines are created equally, we don't care
+ * where they are scheduled, we just want a certain number of resources, for
+ * those resources to be scheduled in parallel, and possibly across multiple
+ * engine classes.
+ */
+
+/*
+ * I915_PARALLEL_IMPLICT_BONDS - Create implict bonds between each context.
+ * Each context must have the same number sibling and bonds are implictly 
create
+ * of the siblings.
+ *
+ * All of the below examples are in logical space.
+ *
+ * Example 1 pseudo code:

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-04 Thread Christian König

Am 04.05.21 um 13:13 schrieb Daniel Vetter:

On Tue, May 4, 2021 at 12:53 PM Christian König
 wrote:

Am 04.05.21 um 11:47 schrieb Daniel Vetter:

[SNIP]

Yeah, it just takes to long for the preemption to complete to be really
useful for the feature we are discussing here.

As I said when the kernel requests to preempt a queue we can easily expect a
timeout of ~100ms until that comes back. For compute that is even in the
multiple seconds range.

100ms for preempting an idle request sounds like broken hw to me. Of
course preemting something that actually runs takes a while, that's
nothing new. But it's also not the thing we're talking about here. Is this
100ms actual numbers from hw for an actual idle ringbuffer?

Well 100ms is just an example of the scheduler granularity. Let me
explain in a wider context.

The hardware can have X queues mapped at the same time and every Y time
interval the hardware scheduler checks if those queues have changed and
only if they have changed the necessary steps to reload them are started.

Multiple queues can be rendering at the same time, so you can have X as
a high priority queue active and just waiting for a signal to start and
the client rendering one frame after another and a third background
compute task mining bitcoins for you.

As long as everything is static this is perfectly performant. Adding a
queue to the list of active queues is also relatively simple, but taking
one down requires you to wait until we are sure the hardware has seen
the change and reloaded the queues.

Think of it as an RCU grace period. This is simply not something which
is made to be used constantly, but rather just at process termination.

Uh ... that indeed sounds rather broken.


Well I wouldn't call it broken. It's just not made for the use case we 
are trying to abuse it for.



Otoh it's just a dma_fence that'd we'd inject as this unload-fence.


Yeah, exactly that's why it isn't much of a problem for process 
termination or freeing memory.



So by and large everyone should already be able to cope with it taking a
bit longer. So from a design pov I don't see a huge problem, but I
guess you guys wont be happy since it means on amd hw there will be
random unsightly stalls in desktop linux usage.


The "preemption" feature is really called suspend and made just for the case
when we want to put a process to sleep or need to forcefully kill it for
misbehavior or stuff like that. It is not meant to be used in normal
operation.

If we only attach it on ->move then yeah maybe a last resort possibility to
do it this way, but I think in that case we could rather stick with kernel
submissions.

Well this is a hybrid userspace ring + kernel augmeted submit mode, so you
can keep dma-fences working. Because the dma-fence stuff wont work with
pure userspace submit, I think that conclusion is rather solid. Once more
even after this long thread here.

When assisted with unload fences, then yes. Problem is that I can't see
how we could implement those performant currently.

Is there really no way to fix fw here? Like if process start/teardown
takes 100ms, that's going to suck no matter what.


As I said adding the queue is unproblematic and teardown just results in 
a bit more waiting to free things up.


Problematic is more overcommit swapping and OOM situations which need to 
wait for the hw scheduler to come back and tell us that the queue is now 
unmapped.



Also, if userspace lies to us and keeps pushing crap into the ring
after it's supposed to be idle: Userspace is already allowed to waste
gpu time. If you're too worried about this set a fairly aggressive
preempt timeout on the unload fence, and kill the context if it takes
longer than what preempting an idle ring should take (because that
would indicate broken/evil userspace).

I think you have the wrong expectation here. It is perfectly valid and
expected for userspace to keep writing commands into the ring buffer.

After all when one frame is completed they want to immediately start
rendering the next one.

Sure, for the true userspace direct submit model. But with that you don't
get dma-fence, which means this gpu will not work for 3d accel on any
current linux desktop.

I'm not sure of that. I've looked a bit into how we could add user
fences to dma_resv objects and that isn't that hard after all.

I think as a proof of concept it's fine, but as an actual solution ...
pls no. Two reasons:
- implicit sync is bad


Well can't disagree with that :) But I think we can't avoid supporting it.


- this doesn't fix anything for explicit sync using dma_fence in terms
of sync_file or drm_syncobj.


Exactly.

If we do implicit sync or explicit sync is orthogonal to the problems 
that sync must be made reliable somehow.


So when we sync and timeout the waiter should just continue, but whoever 
failed to signal will be punished.


But since this isn't solved on Windows I don't see how we can solve it 
on Linux either.



So if we go w

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-04 Thread Christian König

Am 04.05.21 um 11:47 schrieb Daniel Vetter:

[SNIP]

Yeah, it just takes to long for the preemption to complete to be really
useful for the feature we are discussing here.

As I said when the kernel requests to preempt a queue we can easily expect a
timeout of ~100ms until that comes back. For compute that is even in the
multiple seconds range.

100ms for preempting an idle request sounds like broken hw to me. Of
course preemting something that actually runs takes a while, that's
nothing new. But it's also not the thing we're talking about here. Is this
100ms actual numbers from hw for an actual idle ringbuffer?


Well 100ms is just an example of the scheduler granularity. Let me 
explain in a wider context.


The hardware can have X queues mapped at the same time and every Y time 
interval the hardware scheduler checks if those queues have changed and 
only if they have changed the necessary steps to reload them are started.


Multiple queues can be rendering at the same time, so you can have X as 
a high priority queue active and just waiting for a signal to start and 
the client rendering one frame after another and a third background 
compute task mining bitcoins for you.


As long as everything is static this is perfectly performant. Adding a 
queue to the list of active queues is also relatively simple, but taking 
one down requires you to wait until we are sure the hardware has seen 
the change and reloaded the queues.


Think of it as an RCU grace period. This is simply not something which 
is made to be used constantly, but rather just at process termination.



The "preemption" feature is really called suspend and made just for the case
when we want to put a process to sleep or need to forcefully kill it for
misbehavior or stuff like that. It is not meant to be used in normal
operation.

If we only attach it on ->move then yeah maybe a last resort possibility to
do it this way, but I think in that case we could rather stick with kernel
submissions.

Well this is a hybrid userspace ring + kernel augmeted submit mode, so you
can keep dma-fences working. Because the dma-fence stuff wont work with
pure userspace submit, I think that conclusion is rather solid. Once more
even after this long thread here.


When assisted with unload fences, then yes. Problem is that I can't see 
how we could implement those performant currently.



Also, if userspace lies to us and keeps pushing crap into the ring
after it's supposed to be idle: Userspace is already allowed to waste
gpu time. If you're too worried about this set a fairly aggressive
preempt timeout on the unload fence, and kill the context if it takes
longer than what preempting an idle ring should take (because that
would indicate broken/evil userspace).

I think you have the wrong expectation here. It is perfectly valid and
expected for userspace to keep writing commands into the ring buffer.

After all when one frame is completed they want to immediately start
rendering the next one.

Sure, for the true userspace direct submit model. But with that you don't
get dma-fence, which means this gpu will not work for 3d accel on any
current linux desktop.


I'm not sure of that. I've looked a bit into how we could add user 
fences to dma_resv objects and that isn't that hard after all.



Which sucks, hence some hybrid model of using the userspace ring and
kernel augmented submit is needed. Which was my idea.


Yeah, I think when our firmware folks would really remove the kernel 
queue and we still don't have





[SNIP]
Can't find that of hand either, but see the amdgpu_noretry module option.

It basically tells the hardware if retry page faults should be supported or
not because this whole TLB shutdown thing when they are supported is
extremely costly.

Hm so synchronous tlb shootdown is a lot more costly when you allow
retrying of page faults?


Partially correct, yes.

See when you have retry page faults enabled and unmap something you need 
to make sure that everybody which could have potentially translated that 
page and has a TLB is either invalidated or waited until the access is 
completed.


Since every CU could be using a memory location that takes ages to 
completed compared to the normal invalidation where you just invalidate 
the L1/L2 and are done.


Additional to that the recovery adds some extra overhead to every memory 
access, so even without a fault you are quite a bit slower if this is 
enabled.



That sounds bad, because for full hmm mode you need to be able to retry
pagefaults. Well at least the PASID/ATS/IOMMU side will do that, and might just
hang your gpu for a long time while it's waiting for the va->pa lookup
response to return. So retrying lookups shouldn't be any different really.

And you also need fairly fast synchronous tlb shootdown for hmm. So if
your hw has a problem with both together that sounds bad.


Completely agree. And since it was my job to validate the implementation 
on Vega10 I was also the first one to 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-04 Thread Christian König

Am 04.05.21 um 10:27 schrieb Daniel Vetter:

On Tue, May 4, 2021 at 10:09 AM Christian König
 wrote:

Am 04.05.21 um 09:32 schrieb Daniel Vetter:

On Tue, May 04, 2021 at 09:01:23AM +0200, Christian König wrote:

Unfortunately as I pointed out to Daniel as well this won't work 100%
reliable either.

You're claiming this, but there's no clear reason why really, and you
did't reply to my last mail on that sub-thread, so I really don't get
where exactly you're seeing a problem.

Yeah, it's rather hard to explain without pointing out how the hardware
works in detail.


See the signal on the ring buffer needs to be protected by manipulation from
userspace so that we can guarantee that the hardware really has finished
executing when it fires.

Nope you don't. Userspace is already allowed to submit all kinds of random
garbage, the only thing the kernel has to guarnatee is:
- the dma-fence DAG stays a DAG
- dma-fence completes in finite time

Everything else is not the kernel's problem, and if userspace mixes stuff
up like manipulates the seqno, that's ok. It can do that kind of garbage
already.


Protecting memory by immediate page table updates is a good first step, but
unfortunately not sufficient (and we would need to restructure large parts
of the driver to make this happen).

This is why you need the unload-fence on top, because indeed you can't
just rely on the fences created from the userspace ring, those are
unreliable for memory management.

And exactly that's the problem! We can't provide a reliable unload-fence
and the user fences are unreliable for that.

I've talked this through lengthy with our hardware/firmware guy last
Thursday but couldn't find a solution either.

We can have a preemption fence for the kernel which says: Hey this queue
was scheduled away you can touch it's hardware descriptor, control
registers, page tables, TLB, memory, GWS, GDS, OA etc etc etc... again.
But that one is only triggered on preemption and then we have the same
ordering problems once more.

Or we can have a end of operation fence for userspace which says: Hey
this queue has finished it's batch of execution, but this one is
manipulable from userspace in both finish to early (very very bad for
invalidations and memory management) or finish to late/never (deadlock
prone but fixable by timeout).

What we could do is to use the preemption fence to emulate the unload
fence, e.g. something like:
1. Preempt the queue in fixed intervals (let's say 100ms).
2. While preempted check if we have reached the checkpoint in question
by looking at the hardware descriptor.
3. If we have reached the checkpoint signal the unload fence.
4. If we haven't reached the checkpoint resume the queue again.

The problem is that this might introduce a maximum of 100ms delay before
signaling the unload fence and preempt/resume has such a hefty overhead
that we waste a horrible amount of time on it.

So your hw can preempt? That's good enough.

The unload fence is just
1. wait for all dma_fence that are based on the userspace ring. This
is unreliable, but we don't care because tdr will make it reliable.
And once tdr shot down a context we'll force-unload and thrash it
completely, which solves the problem.
2. preempt the context, which /should/ now be stuck waiting for more
commands to be stuffed into the ringbuffer. Which means your
preemption is hopefully fast enough to not matter. If your hw takes
forever to preempt an idle ring, I can't help you :-)


Yeah, it just takes to long for the preemption to complete to be really 
useful for the feature we are discussing here.


As I said when the kernel requests to preempt a queue we can easily 
expect a timeout of ~100ms until that comes back. For compute that is 
even in the multiple seconds range.


The "preemption" feature is really called suspend and made just for the 
case when we want to put a process to sleep or need to forcefully kill 
it for misbehavior or stuff like that. It is not meant to be used in 
normal operation.


If we only attach it on ->move then yeah maybe a last resort possibility 
to do it this way, but I think in that case we could rather stick with 
kernel submissions.



Also, if userspace lies to us and keeps pushing crap into the ring
after it's supposed to be idle: Userspace is already allowed to waste
gpu time. If you're too worried about this set a fairly aggressive
preempt timeout on the unload fence, and kill the context if it takes
longer than what preempting an idle ring should take (because that
would indicate broken/evil userspace).


I think you have the wrong expectation here. It is perfectly valid and 
expected for userspace to keep writing commands into the ring buffer.


After all when one frame is completed they want to immediately start 
rendering the next one.



Again, I'm not seeing the problem. Except if your hw is really
completely busted to the point where it can't even support userspace
ringbuffers properly and with sufficient pe

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-04 Thread Christian König

Am 04.05.21 um 09:32 schrieb Daniel Vetter:

On Tue, May 04, 2021 at 09:01:23AM +0200, Christian König wrote:

Unfortunately as I pointed out to Daniel as well this won't work 100%
reliable either.

You're claiming this, but there's no clear reason why really, and you
did't reply to my last mail on that sub-thread, so I really don't get
where exactly you're seeing a problem.


Yeah, it's rather hard to explain without pointing out how the hardware 
works in detail.



See the signal on the ring buffer needs to be protected by manipulation from
userspace so that we can guarantee that the hardware really has finished
executing when it fires.

Nope you don't. Userspace is already allowed to submit all kinds of random
garbage, the only thing the kernel has to guarnatee is:
- the dma-fence DAG stays a DAG
- dma-fence completes in finite time

Everything else is not the kernel's problem, and if userspace mixes stuff
up like manipulates the seqno, that's ok. It can do that kind of garbage
already.


Protecting memory by immediate page table updates is a good first step, but
unfortunately not sufficient (and we would need to restructure large parts
of the driver to make this happen).

This is why you need the unload-fence on top, because indeed you can't
just rely on the fences created from the userspace ring, those are
unreliable for memory management.


And exactly that's the problem! We can't provide a reliable unload-fence 
and the user fences are unreliable for that.


I've talked this through lengthy with our hardware/firmware guy last 
Thursday but couldn't find a solution either.


We can have a preemption fence for the kernel which says: Hey this queue 
was scheduled away you can touch it's hardware descriptor, control 
registers, page tables, TLB, memory, GWS, GDS, OA etc etc etc... again. 
But that one is only triggered on preemption and then we have the same 
ordering problems once more.


Or we can have a end of operation fence for userspace which says: Hey 
this queue has finished it's batch of execution, but this one is 
manipulable from userspace in both finish to early (very very bad for 
invalidations and memory management) or finish to late/never (deadlock 
prone but fixable by timeout).


What we could do is to use the preemption fence to emulate the unload 
fence, e.g. something like:

1. Preempt the queue in fixed intervals (let's say 100ms).
2. While preempted check if we have reached the checkpoint in question 
by looking at the hardware descriptor.

3. If we have reached the checkpoint signal the unload fence.
4. If we haven't reached the checkpoint resume the queue again.

The problem is that this might introduce a maximum of 100ms delay before 
signaling the unload fence and preempt/resume has such a hefty overhead 
that we waste a horrible amount of time on it.




btw I thought some more, and I think it's probably best if we only attach
the unload-fence in the ->move(_notify) callbacks. Kinda like we already
do for async copy jobs. So the overall buffer move sequence would be:

1. wait for (untrusted for kernel, but necessary for userspace
correctness) fake dma-fence that rely on the userspace ring

2. unload ctx

3. copy buffer

Ofc 2&3 would be done async behind a dma_fence.


On older hardware we often had the situation that for reliable invalidation
we need the guarantee that every previous operation has finished executing.
It's not so much of a problem when the next operation has already started,
since then we had the opportunity to do things in between the last and the
next operation. Just see cache invalidation and VM switching for example.

If you have gpu page faults you generally have synchronous tlb
invalidation,


Please tell that our hardware engineers :)

We have two modes of operation, see the whole XNACK on/off discussion on 
the amdgfx mailing list.



so this also shouldn't be a big problem. Combined with the
unload fence at least. If you don't have synchronous tlb invalidate it
gets a bit more nasty and you need to force a preemption to a kernel
context which has the required flushes across all the caches. Slightly
nasty, but the exact same thing would be required for handling page faults
anyway with the direct userspace submit model.

Again I'm not seeing a problem.


Additional to that it doesn't really buy us anything, e.g. there is not much
advantage to this. Writing the ring buffer in userspace and then ringing in
the kernel has the same overhead as doing everything in the kernel in the
first place.

It gets you dma-fence backwards compat without having to rewrite the
entire userspace ecosystem. Also since you have the hw already designed
for ringbuffer in userspace it would be silly to copy that through the cs
ioctl, that's just overhead.

Also I thought the problem you're having is that all the kernel ringbuf
stuff is going away, so the old cs ioctl wont work anymore for sure?


We still have a bit more time for this. As I learned from our firmware 
e

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-04 Thread Christian König
   > immediately went to much more complicated things but can't
find it.
> Thoughts?

Timeouts are sufficient to protect the kernel but they make
the fences
unpredictable and unreliable from a userspace PoV.  One of the big
problems we face is that, once we expose a dma_fence to userspace,
we've allowed for some pretty crazy potential dependencies that
neither userspace nor the kernel can sort out.  Say you have
marek's
"next serial, please" proposal and a multi-threaded application.
Between time time you ask the kernel for a serial and get a
dma_fence
and submit the work to signal that serial, your process may get
preempted, something else shoved in which allocates memory,
and then
we end up blocking on that dma_fence.  There's no way
userspace can
predict and defend itself from that.

So I think where that leaves us is that there is no safe place to
create a dma_fence except for inside the ioctl which submits
the work
and only after any necessary memory has been allocated. That's a
pretty stiff requirement.  We may still be able to interact with
userspace a bit more explicitly but I think it throws any
notion of
userspace direct submit out the window.

--Jason


> - Bas
> >
> > --Jason
> >
> > On Mon, May 3, 2021 at 9:42 AM Alex Deucher
mailto:alexdeuc...@gmail.com>> wrote:
> > >
    > > > On Sat, May 1, 2021 at 6:27 PM Marek Olšák
mailto:mar...@gmail.com>> wrote:
> > > >
> > > > On Wed, Apr 28, 2021 at 5:07 AM Michel Dänzer
mailto:mic...@daenzer.net>> wrote:
> > > >>
> > > >> On 2021-04-28 8:59 a.m., Christian König wrote:
> > > >> > Hi Dave,
> > > >> >
> > > >> > Am 27.04.21 um 21:23 schrieb Marek Olšák:
> > > >> >> Supporting interop with any device is always
possible. It depends on which drivers we need to interoperate
with and update them. We've already found the path forward for
amdgpu. We just need to find out how many other drivers need
to be updated and evaluate the cost/benefit aspect.
> > > >> >>
> > > >> >> Marek
> > > >> >>
> > > >> >> On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie
mailto:airl...@gmail.com>
<mailto:airl...@gmail.com <mailto:airl...@gmail.com>>> wrote:
> > > >> >>
> > > >> >>     On Tue, 27 Apr 2021 at 22:06, Christian König
> > > >> >>     mailto:ckoenig.leichtzumer...@gmail.com>
<mailto:ckoenig.leichtzumer...@gmail.com
<mailto:ckoenig.leichtzumer...@gmail.com>>> wrote:
> > > >> >>     >
> > > >> >>     > Correct, we wouldn't have synchronization
between device with and without user queues any more.
> > > >> >>     >
> > > >> >>     > That could only be a problem for A+I Laptops.
> > > >> >>
> > > >> >>     Since I think you mentioned you'd only be
enabling this on newer
> > > >> >>     chipsets, won't it be a problem for A+A where
one A is a generation
> > > >> >>     behind the other?
> > > >> >>
> > > >> >
> > > >> > Crap, that is a good point as well.
> > > >> >
> > > >> >>
> > > >> >>     I'm not really liking where this is going btw,
seems like a ill
> > > >> >>     thought out concept, if AMD is really going
down the road of designing
> > > >> >>     hw that is currently Linux incompatible, you
are going to have to
> > > >> >>     accept a big part of the burden in bringing
this support in to more
> > > >> >>     than just amd drivers for upcoming generations
of gpu.
> > > >> >>
> > > >> >
> > > >> > Well we don't really like that either, but we have
no other option as far as I can see.
> > > >>
> > > >> I don'

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-03 Thread Christian König

Am 03.05.21 um 16:59 schrieb Jason Ekstrand:

Sorry for the top-post but there's no good thing to reply to here...

One of the things pointed out to me recently by Daniel Vetter that I
didn't fully understand before is that dma_buf has a very subtle
second requirement beyond finite time completion:  Nothing required
for signaling a dma-fence can allocate memory.  Why?  Because the act
of allocating memory may wait on your dma-fence.  This, as it turns
out, is a massively more strict requirement than finite time
completion and, I think, throws out all of the proposals we have so
far.

Take, for instance, Marek's proposal for userspace involvement with
dma-fence by asking the kernel for a next serial and the kernel
trusting userspace to signal it.  That doesn't work at all if
allocating memory to trigger a dma-fence can blow up.  There's simply
no way for the kernel to trust userspace to not do ANYTHING which
might allocate memory.  I don't even think there's a way userspace can
trust itself there.  It also blows up my plan of moving the fences to
transition boundaries.

Not sure where that leaves us.


Well at least I was perfectly aware of that :)

I'm currently experimenting with some sample code which would allow 
implicit sync with user fences.


Not that I'm pushing hard into that directly, but I just want to make 
clear how simple or complex the whole thing would be.


Christian.



--Jason

On Mon, May 3, 2021 at 9:42 AM Alex Deucher  wrote:

On Sat, May 1, 2021 at 6:27 PM Marek Olšák  wrote:

On Wed, Apr 28, 2021 at 5:07 AM Michel Dänzer  wrote:

On 2021-04-28 8:59 a.m., Christian König wrote:

Hi Dave,

Am 27.04.21 um 21:23 schrieb Marek Olšák:

Supporting interop with any device is always possible. It depends on which 
drivers we need to interoperate with and update them. We've already found the 
path forward for amdgpu. We just need to find out how many other drivers need 
to be updated and evaluate the cost/benefit aspect.

Marek

On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie mailto:airl...@gmail.com>> wrote:

 On Tue, 27 Apr 2021 at 22:06, Christian König
 mailto:ckoenig.leichtzumer...@gmail.com>> wrote:
 >
 > Correct, we wouldn't have synchronization between device with and 
without user queues any more.
 >
 > That could only be a problem for A+I Laptops.

 Since I think you mentioned you'd only be enabling this on newer
 chipsets, won't it be a problem for A+A where one A is a generation
 behind the other?


Crap, that is a good point as well.


 I'm not really liking where this is going btw, seems like a ill
 thought out concept, if AMD is really going down the road of designing
 hw that is currently Linux incompatible, you are going to have to
 accept a big part of the burden in bringing this support in to more
 than just amd drivers for upcoming generations of gpu.


Well we don't really like that either, but we have no other option as far as I 
can see.

I don't really understand what "future hw may remove support for kernel queues" 
means exactly. While the per-context queues can be mapped to userspace directly, they 
don't *have* to be, do they? I.e. the kernel driver should be able to either intercept 
userspace access to the queues, or in the worst case do it all itself, and provide the 
existing synchronization semantics as needed?

Surely there are resource limits for the per-context queues, so the kernel driver 
needs to do some kind of virtualization / multi-plexing anyway, or we'll get sad user 
faces when there's no queue available for .

I'm probably missing something though, awaiting enlightenment. :)


The hw interface for userspace is that the ring buffer is mapped to the process address 
space alongside a doorbell aperture (4K page) that isn't real memory, but when the CPU 
writes into it, it tells the hw scheduler that there are new GPU commands in the ring 
buffer. Userspace inserts all the wait, draw, and signal commands into the ring buffer 
and then "rings" the doorbell. It's my understanding that the ring buffer and 
the doorbell are always mapped in the same GPU address space as the process, which makes 
it very difficult to emulate the current protected ring buffers in the kernel. The VMID 
of the ring buffer is also not changeable.


The doorbell does not have to be mapped into the process's GPU virtual
address space.  The CPU could write to it directly.  Mapping it into
the GPU's virtual address space would allow you to have a device kick
off work however rather than the CPU.  E.g., the GPU could kick off
it's own work or multiple devices could kick off work without CPU
involvement.

Alex



The hw scheduler doesn't do any synchronization and it doesn't see any 
dependencies. It only chooses which queue to execute, so it's really just a 
simple queue manager handling the virtualization aspect and not much else.

Marek
_

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-30 Thread Christian König

Am 30.04.21 um 10:58 schrieb Daniel Vetter:

[SNIP]

When the user allocates usermode queues, the kernel driver sets up a
queue descriptor in the kernel which defines the location of the queue
in memory, what priority it has, what page tables it should use, etc.
User mode can then start writing commands to its queues.  When they
are ready for the hardware to start executing them, they ring a
doorbell which signals the scheduler and it maps the queue descriptors
to HW queue slots and they start executing.  The user only has access
to it's queues and any buffers it has mapped in it's GPU virtual
address space.  While the queues are scheduled, the user can keep
submitting work to them and they will keep executing unless they get
preempted by the scheduler due to oversubscription or a priority call
or a request from the kernel driver to preempt, etc.

Yeah, works like with our stuff.

I don't see a problem tbh. It's slightly silly going the detour with the
kernel ioctl, and it's annoying that you still have to use drm/scheduler
to resolve dependencies instead of gpu semaphores and all that. But this
only applies to legacy winsys mode, compute (e.g. vk without winsys) can
use the full power. Just needs a flag or something when setting up the
context.

And best part is that from hw pov this really is indistinguishable from
the full on userspace submit model.

The thing where it gets annoying is when you use one of these new cpu
instructions which do direct submit to hw and pass along the pasid id
behind the scenes. That's truly something you can't intercept anymore in
the kernel and fake the legacy dma_fence world.

But what you're describing here sounds like bog standard stuff, and also
pretty easy to keep working with exactly the current model.

Ofc we'll want to push forward a more modern model that better suits
modern gpus, but I don't see any hard requirement here from the hw side.

Adding a bit more detail on what I have in mind:

- memory management works like amdgpu does today, so all buffers are
pre-bound to the gpu vm, we keep the entire bo set marked as busy with
the bulk lru trick for every command submission.

- for the ringbuffer, userspace allcoates a suitably sized bo for
ringbuffer, ring/tail/seqno and whatever else it needs

- userspace then asks the kernel to make that into a hw context, with
all the priviledges setup. Doorbell will only be mapped into kernel
(hw can't tell the difference anyway), but if it happens to also be
visible to userspace that's no problem. We assume userspace can ring
the doorbell anytime it wants to.


This doesn't work in hardware. We at least need to setup a few registers 
and memory locations from inside the VM which userspace shouldn't have 
access to when we want the end of batch fence and ring buffer start to 
be reliable.



- we do double memory management: One dma_fence works similar to the
amdkfd preempt fence, except it doesn't preempt but does anything
required to make the hw context unrunnable and take it out of the hw
scheduler entirely. This might involve unmapping the doorbell if
userspace has access to it.

- but we also do classic end-of-batch fences, so that implicit fencing
and all that keeps working. The "make hw ctx unrunnable" fence must
also wait for all of these pending submissions to complete.


This together doesn't work from the software side, e.g. you can either 
have preemption fences or end of batch fences but never both or your end 
of batch fences would have another dependency on the preemption fences 
which we currently can't express in the dma_fence framework.


Additional to that it can't work from the hardware side because we have 
a separation between engine and scheduler on the hardware side. So we 
can't reliable get a signal inside the kernel that a batch has completed.


What we could do is to get this signal in userspace, e.g. userspace 
inserts the packets into the ring buffer and then the kernel can read 
the fence value and get the IV.


But this has the same problem as user fences because it requires the 
cooperation of userspace.


We just yesterday had a meeting with the firmware developers to discuss 
the possible options and I now have even stronger doubts that this is 
doable.


We either have user queues where userspace writes the necessary commands 
directly to the ring buffer or we have kernel queues. A mixture of both 
isn't supported in neither the hardware nor the firmware.


Regards,
Christian.



- for the actual end-of-batchbuffer dma_fence it's almost all faked,
but with some checks in the kernel to keep up the guarantees. cs flow
is roughly

1. userspace directly writes into the userspace ringbuffer. It needs
to follow the kernel's rule for this if it wants things to work
correctly, but we assume evil userspace is allowed to write whatever
it wants to the ring, and change that whenever it wants. Userspace
does not update ring head/tail pointers.

2. cs ioctl just contains: a) head (the thing userspace 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Christian König

Am 28.04.21 um 16:34 schrieb Daniel Vetter:

On Wed, Apr 28, 2021 at 03:37:49PM +0200, Christian König wrote:

Am 28.04.21 um 15:34 schrieb Daniel Vetter:

On Wed, Apr 28, 2021 at 03:11:27PM +0200, Christian König wrote:

Am 28.04.21 um 14:26 schrieb Daniel Vetter:

On Wed, Apr 28, 2021 at 02:21:54PM +0200, Daniel Vetter wrote:

On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:

Am 28.04.21 um 12:05 schrieb Daniel Vetter:

On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:

On Tue, Apr 27, 2021 at 1:35 PM Simon Ser  wrote:

On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach  
wrote:


Ok. So that would only make the following use cases broken for now:

- amd render -> external gpu
- amd video encode -> network device

FWIW, "only" breaking amd render -> external gpu will make us pretty
unhappy

I concur. I have quite a few users with a multi-GPU setup involving
AMD hardware.

Note, if this brokenness can't be avoided, I'd prefer a to get a clear
error, and not bad results on screen because nothing is synchronized
anymore.

It's an upcoming requirement for windows[1], so you are likely to
start seeing this across all GPU vendors that support windows.  I
think the timing depends on how quickly the legacy hardware support
sticks around for each vendor.

Yeah but hw scheduling doesn't mean the hw has to be constructed to not
support isolating the ringbuffer at all.

E.g. even if the hw loses the bit to put the ringbuffer outside of the
userspace gpu vm, if you have pagetables I'm seriously hoping you have r/o
pte flags. Otherwise the entire "share address space with cpu side,
seamlessly" thing is out of the window.

And with that r/o bit on the ringbuffer you can once more force submit
through kernel space, and all the legacy dma_fence based stuff keeps
working. And we don't have to invent some horrendous userspace fence based
implicit sync mechanism in the kernel, but can instead do this transition
properly with drm_syncobj timeline explicit sync and protocol reving.

At least I think you'd have to work extra hard to create a gpu which
cannot possibly be intercepted by the kernel, even when it's designed to
support userspace direct submit only.

Or are your hw engineers more creative here and we're screwed?

The upcomming hardware generation will have this hardware scheduler as a
must have, but there are certain ways we can still stick to the old
approach:

1. The new hardware scheduler currently still supports kernel queues which
essentially is the same as the old hardware ring buffer.

2. Mapping the top level ring buffer into the VM at least partially solves
the problem. This way you can't manipulate the ring buffer content, but the
location for the fence must still be writeable.

Yeah allowing userspace to lie about completion fences in this model is
ok. Though I haven't thought through full consequences of that, but I
think it's not any worse than userspace lying about which buffers/address
it uses in the current model - we rely on hw vm ptes to catch that stuff.

Also it might be good to switch to a non-recoverable ctx model for these.
That's already what we do in i915 (opt-in, but all current umd use that
mode). So any hang/watchdog just kills the entire ctx and you don't have
to worry about userspace doing something funny with it's ringbuffer.
Simplifies everything.

Also ofc userspace fencing still disallowed, but since userspace would
queu up all writes to its ringbuffer through the drm/scheduler, we'd
handle dependencies through that still. Not great, but workable.

Thinking about this, not even mapping the ringbuffer r/o is required, it's
just that we must queue things throug the kernel to resolve dependencies
and everything without breaking dma_fence. If userspace lies, tdr will
shoot it and the kernel stops running that context entirely.

Thinking more about that approach I don't think that it will work correctly.

See we not only need to write the fence as signal that an IB is submitted,
but also adjust a bunch of privileged hardware registers.

When userspace could do that from its IBs as well then there is nothing
blocking it from reprogramming the page table base address for example.

We could do those writes with the CPU as well, but that would be a huge
performance drop because of the additional latency.

That's not what I'm suggesting. I'm suggesting you have the queue and
everything in userspace, like in wondows. Fences are exactly handled like
on windows too. The difference is:

- All new additions to the ringbuffer are done through a kernel ioctl
call, using the drm/scheduler to resolve dependencies.

- Memory management is also done like today int that ioctl.

- TDR makes sure that if userspace abuses the contract (which it can, but
it can do that already today because there's also no command parser to
e.g. stop gpu semaphores) the entire context is shot and terminally
killed. Userspace has to then set up a n

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Christian König

Am 28.04.21 um 15:34 schrieb Daniel Vetter:

On Wed, Apr 28, 2021 at 03:11:27PM +0200, Christian König wrote:

Am 28.04.21 um 14:26 schrieb Daniel Vetter:

On Wed, Apr 28, 2021 at 02:21:54PM +0200, Daniel Vetter wrote:

On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:

Am 28.04.21 um 12:05 schrieb Daniel Vetter:

On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:

On Tue, Apr 27, 2021 at 1:35 PM Simon Ser  wrote:

On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach  
wrote:


Ok. So that would only make the following use cases broken for now:

- amd render -> external gpu
- amd video encode -> network device

FWIW, "only" breaking amd render -> external gpu will make us pretty
unhappy

I concur. I have quite a few users with a multi-GPU setup involving
AMD hardware.

Note, if this brokenness can't be avoided, I'd prefer a to get a clear
error, and not bad results on screen because nothing is synchronized
anymore.

It's an upcoming requirement for windows[1], so you are likely to
start seeing this across all GPU vendors that support windows.  I
think the timing depends on how quickly the legacy hardware support
sticks around for each vendor.

Yeah but hw scheduling doesn't mean the hw has to be constructed to not
support isolating the ringbuffer at all.

E.g. even if the hw loses the bit to put the ringbuffer outside of the
userspace gpu vm, if you have pagetables I'm seriously hoping you have r/o
pte flags. Otherwise the entire "share address space with cpu side,
seamlessly" thing is out of the window.

And with that r/o bit on the ringbuffer you can once more force submit
through kernel space, and all the legacy dma_fence based stuff keeps
working. And we don't have to invent some horrendous userspace fence based
implicit sync mechanism in the kernel, but can instead do this transition
properly with drm_syncobj timeline explicit sync and protocol reving.

At least I think you'd have to work extra hard to create a gpu which
cannot possibly be intercepted by the kernel, even when it's designed to
support userspace direct submit only.

Or are your hw engineers more creative here and we're screwed?

The upcomming hardware generation will have this hardware scheduler as a
must have, but there are certain ways we can still stick to the old
approach:

1. The new hardware scheduler currently still supports kernel queues which
essentially is the same as the old hardware ring buffer.

2. Mapping the top level ring buffer into the VM at least partially solves
the problem. This way you can't manipulate the ring buffer content, but the
location for the fence must still be writeable.

Yeah allowing userspace to lie about completion fences in this model is
ok. Though I haven't thought through full consequences of that, but I
think it's not any worse than userspace lying about which buffers/address
it uses in the current model - we rely on hw vm ptes to catch that stuff.

Also it might be good to switch to a non-recoverable ctx model for these.
That's already what we do in i915 (opt-in, but all current umd use that
mode). So any hang/watchdog just kills the entire ctx and you don't have
to worry about userspace doing something funny with it's ringbuffer.
Simplifies everything.

Also ofc userspace fencing still disallowed, but since userspace would
queu up all writes to its ringbuffer through the drm/scheduler, we'd
handle dependencies through that still. Not great, but workable.

Thinking about this, not even mapping the ringbuffer r/o is required, it's
just that we must queue things throug the kernel to resolve dependencies
and everything without breaking dma_fence. If userspace lies, tdr will
shoot it and the kernel stops running that context entirely.

Thinking more about that approach I don't think that it will work correctly.

See we not only need to write the fence as signal that an IB is submitted,
but also adjust a bunch of privileged hardware registers.

When userspace could do that from its IBs as well then there is nothing
blocking it from reprogramming the page table base address for example.

We could do those writes with the CPU as well, but that would be a huge
performance drop because of the additional latency.

That's not what I'm suggesting. I'm suggesting you have the queue and
everything in userspace, like in wondows. Fences are exactly handled like
on windows too. The difference is:

- All new additions to the ringbuffer are done through a kernel ioctl
   call, using the drm/scheduler to resolve dependencies.

- Memory management is also done like today int that ioctl.

- TDR makes sure that if userspace abuses the contract (which it can, but
   it can do that already today because there's also no command parser to
   e.g. stop gpu semaphores) the entire context is shot and terminally
   killed. Userspace has to then set up a new one. This isn't how amdgpu
   recovery works right now, but i915 supports it and I think it's also the
   bet

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Christian König

Am 28.04.21 um 14:26 schrieb Daniel Vetter:

On Wed, Apr 28, 2021 at 02:21:54PM +0200, Daniel Vetter wrote:

On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:

Am 28.04.21 um 12:05 schrieb Daniel Vetter:

On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:

On Tue, Apr 27, 2021 at 1:35 PM Simon Ser  wrote:

On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach  
wrote:


Ok. So that would only make the following use cases broken for now:

- amd render -> external gpu
- amd video encode -> network device

FWIW, "only" breaking amd render -> external gpu will make us pretty
unhappy

I concur. I have quite a few users with a multi-GPU setup involving
AMD hardware.

Note, if this brokenness can't be avoided, I'd prefer a to get a clear
error, and not bad results on screen because nothing is synchronized
anymore.

It's an upcoming requirement for windows[1], so you are likely to
start seeing this across all GPU vendors that support windows.  I
think the timing depends on how quickly the legacy hardware support
sticks around for each vendor.

Yeah but hw scheduling doesn't mean the hw has to be constructed to not
support isolating the ringbuffer at all.

E.g. even if the hw loses the bit to put the ringbuffer outside of the
userspace gpu vm, if you have pagetables I'm seriously hoping you have r/o
pte flags. Otherwise the entire "share address space with cpu side,
seamlessly" thing is out of the window.

And with that r/o bit on the ringbuffer you can once more force submit
through kernel space, and all the legacy dma_fence based stuff keeps
working. And we don't have to invent some horrendous userspace fence based
implicit sync mechanism in the kernel, but can instead do this transition
properly with drm_syncobj timeline explicit sync and protocol reving.

At least I think you'd have to work extra hard to create a gpu which
cannot possibly be intercepted by the kernel, even when it's designed to
support userspace direct submit only.

Or are your hw engineers more creative here and we're screwed?

The upcomming hardware generation will have this hardware scheduler as a
must have, but there are certain ways we can still stick to the old
approach:

1. The new hardware scheduler currently still supports kernel queues which
essentially is the same as the old hardware ring buffer.

2. Mapping the top level ring buffer into the VM at least partially solves
the problem. This way you can't manipulate the ring buffer content, but the
location for the fence must still be writeable.

Yeah allowing userspace to lie about completion fences in this model is
ok. Though I haven't thought through full consequences of that, but I
think it's not any worse than userspace lying about which buffers/address
it uses in the current model - we rely on hw vm ptes to catch that stuff.

Also it might be good to switch to a non-recoverable ctx model for these.
That's already what we do in i915 (opt-in, but all current umd use that
mode). So any hang/watchdog just kills the entire ctx and you don't have
to worry about userspace doing something funny with it's ringbuffer.
Simplifies everything.

Also ofc userspace fencing still disallowed, but since userspace would
queu up all writes to its ringbuffer through the drm/scheduler, we'd
handle dependencies through that still. Not great, but workable.

Thinking about this, not even mapping the ringbuffer r/o is required, it's
just that we must queue things throug the kernel to resolve dependencies
and everything without breaking dma_fence. If userspace lies, tdr will
shoot it and the kernel stops running that context entirely.


Thinking more about that approach I don't think that it will work correctly.

See we not only need to write the fence as signal that an IB is 
submitted, but also adjust a bunch of privileged hardware registers.


When userspace could do that from its IBs as well then there is nothing 
blocking it from reprogramming the page table base address for example.


We could do those writes with the CPU as well, but that would be a huge 
performance drop because of the additional latency.


Christian.



So I think even if we have hw with 100% userspace submit model only we
should be still fine. It's ofc silly, because instead of using userspace
fences and gpu semaphores the hw scheduler understands we still take the
detour through drm/scheduler, but at least it's not a break-the-world
event.

Also no page fault support, userptr invalidates still stall until
end-of-batch instead of just preempting it, and all that too. But I mean
there needs to be some motivation to fix this and roll out explicit sync
:-)
-Daniel


Or do I miss something here?


For now and the next hardware we are save to support the old submission
model, but the functionality of kernel queues will sooner or later go away
if it is only for Linux.

So we need to work on something which works in the long term and get us away
from this implicit sync.

Yeah

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Christian König

Am 28.04.21 um 12:05 schrieb Daniel Vetter:

On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:

On Tue, Apr 27, 2021 at 1:35 PM Simon Ser  wrote:

On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach  
wrote:


Ok. So that would only make the following use cases broken for now:

- amd render -> external gpu
- amd video encode -> network device

FWIW, "only" breaking amd render -> external gpu will make us pretty
unhappy

I concur. I have quite a few users with a multi-GPU setup involving
AMD hardware.

Note, if this brokenness can't be avoided, I'd prefer a to get a clear
error, and not bad results on screen because nothing is synchronized
anymore.

It's an upcoming requirement for windows[1], so you are likely to
start seeing this across all GPU vendors that support windows.  I
think the timing depends on how quickly the legacy hardware support
sticks around for each vendor.

Yeah but hw scheduling doesn't mean the hw has to be constructed to not
support isolating the ringbuffer at all.

E.g. even if the hw loses the bit to put the ringbuffer outside of the
userspace gpu vm, if you have pagetables I'm seriously hoping you have r/o
pte flags. Otherwise the entire "share address space with cpu side,
seamlessly" thing is out of the window.

And with that r/o bit on the ringbuffer you can once more force submit
through kernel space, and all the legacy dma_fence based stuff keeps
working. And we don't have to invent some horrendous userspace fence based
implicit sync mechanism in the kernel, but can instead do this transition
properly with drm_syncobj timeline explicit sync and protocol reving.

At least I think you'd have to work extra hard to create a gpu which
cannot possibly be intercepted by the kernel, even when it's designed to
support userspace direct submit only.

Or are your hw engineers more creative here and we're screwed?


The upcomming hardware generation will have this hardware scheduler as a 
must have, but there are certain ways we can still stick to the old 
approach:


1. The new hardware scheduler currently still supports kernel queues 
which essentially is the same as the old hardware ring buffer.


2. Mapping the top level ring buffer into the VM at least partially 
solves the problem. This way you can't manipulate the ring buffer 
content, but the location for the fence must still be writeable.


For now and the next hardware we are save to support the old submission 
model, but the functionality of kernel queues will sooner or later go 
away if it is only for Linux.


So we need to work on something which works in the long term and get us 
away from this implicit sync.


Christian.


-Daniel


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Christian König

Hi Dave,

Am 27.04.21 um 21:23 schrieb Marek Olšák:
Supporting interop with any device is always possible. It depends on 
which drivers we need to interoperate with and update them. We've 
already found the path forward for amdgpu. We just need to find out 
how many other drivers need to be updated and evaluate the 
cost/benefit aspect.


Marek

On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie <mailto:airl...@gmail.com>> wrote:


On Tue, 27 Apr 2021 at 22:06, Christian König
mailto:ckoenig.leichtzumer...@gmail.com>> wrote:
>
> Correct, we wouldn't have synchronization between device with
and without user queues any more.
>
> That could only be a problem for A+I Laptops.

Since I think you mentioned you'd only be enabling this on newer
chipsets, won't it be a problem for A+A where one A is a generation
behind the other?



Crap, that is a good point as well.



I'm not really liking where this is going btw, seems like a ill
thought out concept, if AMD is really going down the road of designing
hw that is currently Linux incompatible, you are going to have to
accept a big part of the burden in bringing this support in to more
than just amd drivers for upcoming generations of gpu.



Well we don't really like that either, but we have no other option as 
far as I can see.


I have a couple of ideas how to handle this in the kernel without 
dma_fences, but it always require more or less changes to all existing 
drivers.


Christian.



Dave.



___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Christian König
Uff good question. DMA-buf certainly supports that use case, but I have 
no idea if that is actually used somewhere.


Daniel do you know any case?

Christian.

Am 27.04.21 um 15:26 schrieb Marek Olšák:

Ok. So that would only make the following use cases broken for now:
- amd render -> external gpu
- amd video encode -> network device

What about the case when we get a buffer from an external device and 
we're supposed to make it "busy" when we are using it, and the 
external device wants to wait until we stop using it? Is it something 
that can happen, thus turning "external -> amd" into "external <-> amd"?


Marek

On Tue., Apr. 27, 2021, 08:50 Christian König, 
<mailto:ckoenig.leichtzumer...@gmail.com>> wrote:


Only amd -> external.

We can easily install something in an user queue which waits for a
dma_fence in the kernel.

But we can't easily wait for an user queue as dependency of a
dma_fence.

The good thing is we have this wait before signal case on Vulkan
timeline semaphores which have the same problem in the kernel.

The good news is I think we can relatively easily convert i915 and
older amdgpu device to something which is compatible with user fences.

So yes, getting that fixed case by case should work.

Christian

Am 27.04.21 um 14:46 schrieb Marek Olšák:

I'll defer to Christian and Alex to decide whether dropping sync
with non-amd devices (GPUs, cameras etc.) is acceptable.

Rewriting those drivers to this new sync model could be done on a
case by case basis.

For now, would we only lose the "amd -> external" dependency? Or
the "external -> amd" dependency too?

Marek

On Tue., Apr. 27, 2021, 08:15 Daniel Vetter, mailto:dan...@ffwll.ch>> wrote:

On Tue, Apr 27, 2021 at 2:11 PM Marek Olšák mailto:mar...@gmail.com>> wrote:
> Ok. I'll interpret this as "yes, it will work, let's do it".

It works if all you care about is drm/amdgpu. I'm not sure
that's a
reasonable approach for upstream, but it definitely is an
approach :-)

We've already gone somewhat through the pain of drm/amdgpu
redefining
how implicit sync works without sufficiently talking with other
people, maybe we should avoid a repeat of this ...
-Daniel

>
> Marek
>
> On Tue., Apr. 27, 2021, 08:06 Christian König,
mailto:ckoenig.leichtzumer...@gmail.com>> wrote:
>>
>> Correct, we wouldn't have synchronization between device
with and without user queues any more.
>>
>> That could only be a problem for A+I Laptops.
>>
>> Memory management will just work with preemption fences
which pause the user queues of a process before evicting
something. That will be a dma_fence, but also a well known
approach.
>>
>> Christian.
>>
>> Am 27.04.21 um 13:49 schrieb Marek Olšák:
>>
>> If we don't use future fences for DMA fences at all, e.g.
we don't use them for memory management, it can work, right?
Memory management can suspend user queues anytime. It doesn't
need to use DMA fences. There might be something that I'm
missing here.
>>
>> What would we lose without DMA fences? Just inter-device
synchronization? I think that might be acceptable.
>>
>> The only case when the kernel will wait on a future fence
is before a page flip. Everything today already depends on
userspace not hanging the gpu, which makes everything a
future fence.
>>
>> Marek
>>
>> On Tue., Apr. 27, 2021, 04:02 Daniel Vetter,
mailto:dan...@ffwll.ch>> wrote:
>>>
>>> On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:
>>> > Thanks everybody. The initial proposal is dead. Here
are some thoughts on
>>> > how to do it differently.
>>> >
>>> > I think we can have direct command submission from
userspace via
>>> > memory-mapped queues ("user queues") without changing
window systems.
>>> >
>>> > The memory management doesn't have to use GPU page
faults like HMM.
>>> > Instead, it can wait for user queues of a specific
process to go idle and
>>> > then unmap the queues, so that userspace can't submit
anything. Buffer
>>> > evictions, pinning, etc. can be executed whe

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Christian König

Only amd -> external.

We can easily install something in an user queue which waits for a 
dma_fence in the kernel.


But we can't easily wait for an user queue as dependency of a dma_fence.

The good thing is we have this wait before signal case on Vulkan 
timeline semaphores which have the same problem in the kernel.


The good news is I think we can relatively easily convert i915 and older 
amdgpu device to something which is compatible with user fences.


So yes, getting that fixed case by case should work.

Christian

Am 27.04.21 um 14:46 schrieb Marek Olšák:
I'll defer to Christian and Alex to decide whether dropping sync with 
non-amd devices (GPUs, cameras etc.) is acceptable.


Rewriting those drivers to this new sync model could be done on a case 
by case basis.


For now, would we only lose the "amd -> external" dependency? Or the 
"external -> amd" dependency too?


Marek

On Tue., Apr. 27, 2021, 08:15 Daniel Vetter, <mailto:dan...@ffwll.ch>> wrote:


On Tue, Apr 27, 2021 at 2:11 PM Marek Olšák mailto:mar...@gmail.com>> wrote:
> Ok. I'll interpret this as "yes, it will work, let's do it".

It works if all you care about is drm/amdgpu. I'm not sure that's a
reasonable approach for upstream, but it definitely is an approach :-)

We've already gone somewhat through the pain of drm/amdgpu redefining
how implicit sync works without sufficiently talking with other
people, maybe we should avoid a repeat of this ...
-Daniel

    >
    > Marek
>
> On Tue., Apr. 27, 2021, 08:06 Christian König,
mailto:ckoenig.leichtzumer...@gmail.com>> wrote:
>>
>> Correct, we wouldn't have synchronization between device with
and without user queues any more.
>>
>> That could only be a problem for A+I Laptops.
>>
>> Memory management will just work with preemption fences which
pause the user queues of a process before evicting something. That
will be a dma_fence, but also a well known approach.
>>
>> Christian.
>>
>> Am 27.04.21 um 13:49 schrieb Marek Olšák:
>>
>> If we don't use future fences for DMA fences at all, e.g. we
don't use them for memory management, it can work, right? Memory
management can suspend user queues anytime. It doesn't need to use
DMA fences. There might be something that I'm missing here.
>>
>> What would we lose without DMA fences? Just inter-device
synchronization? I think that might be acceptable.
>>
>> The only case when the kernel will wait on a future fence is
before a page flip. Everything today already depends on userspace
not hanging the gpu, which makes everything a future fence.
>>
>> Marek
>>
>> On Tue., Apr. 27, 2021, 04:02 Daniel Vetter, mailto:dan...@ffwll.ch>> wrote:
>>>
>>> On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:
>>> > Thanks everybody. The initial proposal is dead. Here are
some thoughts on
>>> > how to do it differently.
>>> >
>>> > I think we can have direct command submission from userspace via
>>> > memory-mapped queues ("user queues") without changing window
systems.
>>> >
>>> > The memory management doesn't have to use GPU page faults
like HMM.
>>> > Instead, it can wait for user queues of a specific process
to go idle and
>>> > then unmap the queues, so that userspace can't submit
anything. Buffer
>>> > evictions, pinning, etc. can be executed when all queues are
unmapped
>>> > (suspended). Thus, no BO fences and page faults are needed.
>>> >
>>> > Inter-process synchronization can use timeline semaphores.
Userspace will
>>> > query the wait and signal value for a shared buffer from the
kernel. The
>>> > kernel will keep a history of those queries to know which
process is
>>> > responsible for signalling which buffer. There is only the
wait-timeout
>>> > issue and how to identify the culprit. One of the solutions
is to have the
>>> > GPU send all GPU signal commands and all timed out wait
commands via an
>>> > interrupt to the kernel driver to monitor and validate
userspace behavior.
>>> > With that, it can be identified whether the culprit is the
waiting process
>>> > or the signalling process and which one. Invalid signal/wait
parameters can
>>> > also be detected. The kernel can force-signal only the
semaphores that time
>>> >

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Christian König

Am 27.04.21 um 14:15 schrieb Daniel Vetter:

On Tue, Apr 27, 2021 at 2:11 PM Marek Olšák  wrote:

Ok. I'll interpret this as "yes, it will work, let's do it".

It works if all you care about is drm/amdgpu. I'm not sure that's a
reasonable approach for upstream, but it definitely is an approach :-)

We've already gone somewhat through the pain of drm/amdgpu redefining
how implicit sync works without sufficiently talking with other
people, maybe we should avoid a repeat of this ...


BTW: This is coming up again for the plan here.

We once more need to think about the "other" fences which don't 
participate in the implicit sync here.


Christian.


-Daniel


Marek

On Tue., Apr. 27, 2021, 08:06 Christian König, 
 wrote:

Correct, we wouldn't have synchronization between device with and without user 
queues any more.

That could only be a problem for A+I Laptops.

Memory management will just work with preemption fences which pause the user 
queues of a process before evicting something. That will be a dma_fence, but 
also a well known approach.

Christian.

Am 27.04.21 um 13:49 schrieb Marek Olšák:

If we don't use future fences for DMA fences at all, e.g. we don't use them for 
memory management, it can work, right? Memory management can suspend user 
queues anytime. It doesn't need to use DMA fences. There might be something 
that I'm missing here.

What would we lose without DMA fences? Just inter-device synchronization? I 
think that might be acceptable.

The only case when the kernel will wait on a future fence is before a page 
flip. Everything today already depends on userspace not hanging the gpu, which 
makes everything a future fence.

Marek

On Tue., Apr. 27, 2021, 04:02 Daniel Vetter,  wrote:

On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:

Thanks everybody. The initial proposal is dead. Here are some thoughts on
how to do it differently.

I think we can have direct command submission from userspace via
memory-mapped queues ("user queues") without changing window systems.

The memory management doesn't have to use GPU page faults like HMM.
Instead, it can wait for user queues of a specific process to go idle and
then unmap the queues, so that userspace can't submit anything. Buffer
evictions, pinning, etc. can be executed when all queues are unmapped
(suspended). Thus, no BO fences and page faults are needed.

Inter-process synchronization can use timeline semaphores. Userspace will
query the wait and signal value for a shared buffer from the kernel. The
kernel will keep a history of those queries to know which process is
responsible for signalling which buffer. There is only the wait-timeout
issue and how to identify the culprit. One of the solutions is to have the
GPU send all GPU signal commands and all timed out wait commands via an
interrupt to the kernel driver to monitor and validate userspace behavior.
With that, it can be identified whether the culprit is the waiting process
or the signalling process and which one. Invalid signal/wait parameters can
also be detected. The kernel can force-signal only the semaphores that time
out, and punish the processes which caused the timeout or used invalid
signal/wait parameters.

The question is whether this synchronization solution is robust enough for
dma_fence and whatever the kernel and window systems need.

The proper model here is the preempt-ctx dma_fence that amdkfd uses
(without page faults). That means dma_fence for synchronization is doa, at
least as-is, and we're back to figuring out the winsys problem.

"We'll solve it with timeouts" is very tempting, but doesn't work. It's
akin to saying that we're solving deadlock issues in a locking design by
doing a global s/mutex_lock/mutex_lock_timeout/ in the kernel. Sure it
avoids having to reach the reset button, but that's about it.

And the fundamental problem is that once you throw in userspace command
submission (and syncing, at least within the userspace driver, otherwise
there's kinda no point if you still need the kernel for cross-engine sync)
means you get deadlocks if you still use dma_fence for sync under
perfectly legit use-case. We've discussed that one ad nauseam last summer:

https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_fence#indefinite-dma-fences

See silly diagramm at the bottom.

Now I think all isn't lost, because imo the first step to getting to this
brave new world is rebuilding the driver on top of userspace fences, and
with the adjusted cmd submit model. You probably don't want to use amdkfd,
but port that as a context flag or similar to render nodes for gl/vk. Of
course that means you can only use this mode in headless, without
glx/wayland winsys support, but it's a start.
-Daniel


Marek

On Tue, Apr 20, 2021 at 4:34 PM Daniel Stone  wrote:


Hi,

On Tue, 20 Apr 2021 at 20:30, Daniel Vetter  wrote:


The thing is, you can't do this in drm/scheduler. At least not without
splitting up t

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Christian König
Correct, we wouldn't have synchronization between device with and 
without user queues any more.


That could only be a problem for A+I Laptops.

Memory management will just work with preemption fences which pause the 
user queues of a process before evicting something. That will be a 
dma_fence, but also a well known approach.


Christian.

Am 27.04.21 um 13:49 schrieb Marek Olšák:
If we don't use future fences for DMA fences at all, e.g. we don't use 
them for memory management, it can work, right? Memory management can 
suspend user queues anytime. It doesn't need to use DMA fences. There 
might be something that I'm missing here.


What would we lose without DMA fences? Just inter-device 
synchronization? I think that might be acceptable.


The only case when the kernel will wait on a future fence is before a 
page flip. Everything today already depends on userspace not hanging 
the gpu, which makes everything a future fence.


Marek

On Tue., Apr. 27, 2021, 04:02 Daniel Vetter, > wrote:


On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:
> Thanks everybody. The initial proposal is dead. Here are some
thoughts on
> how to do it differently.
>
> I think we can have direct command submission from userspace via
> memory-mapped queues ("user queues") without changing window
systems.
>
> The memory management doesn't have to use GPU page faults like HMM.
> Instead, it can wait for user queues of a specific process to go
idle and
> then unmap the queues, so that userspace can't submit anything.
Buffer
> evictions, pinning, etc. can be executed when all queues are
unmapped
> (suspended). Thus, no BO fences and page faults are needed.
>
> Inter-process synchronization can use timeline semaphores.
Userspace will
> query the wait and signal value for a shared buffer from the
kernel. The
> kernel will keep a history of those queries to know which process is
> responsible for signalling which buffer. There is only the
wait-timeout
> issue and how to identify the culprit. One of the solutions is
to have the
> GPU send all GPU signal commands and all timed out wait commands
via an
> interrupt to the kernel driver to monitor and validate userspace
behavior.
> With that, it can be identified whether the culprit is the
waiting process
> or the signalling process and which one. Invalid signal/wait
parameters can
> also be detected. The kernel can force-signal only the
semaphores that time
> out, and punish the processes which caused the timeout or used
invalid
> signal/wait parameters.
>
> The question is whether this synchronization solution is robust
enough for
> dma_fence and whatever the kernel and window systems need.

The proper model here is the preempt-ctx dma_fence that amdkfd uses
(without page faults). That means dma_fence for synchronization is
doa, at
least as-is, and we're back to figuring out the winsys problem.

"We'll solve it with timeouts" is very tempting, but doesn't work.
It's
akin to saying that we're solving deadlock issues in a locking
design by
doing a global s/mutex_lock/mutex_lock_timeout/ in the kernel. Sure it
avoids having to reach the reset button, but that's about it.

And the fundamental problem is that once you throw in userspace
command
submission (and syncing, at least within the userspace driver,
otherwise
there's kinda no point if you still need the kernel for
cross-engine sync)
means you get deadlocks if you still use dma_fence for sync under
perfectly legit use-case. We've discussed that one ad nauseam last
summer:


https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_fence#indefinite-dma-fences



See silly diagramm at the bottom.

Now I think all isn't lost, because imo the first step to getting
to this
brave new world is rebuilding the driver on top of userspace
fences, and
with the adjusted cmd submit model. You probably don't want to use
amdkfd,
but port that as a context flag or similar to render nodes for
gl/vk. Of
course that means you can only use this mode in headless, without
glx/wayland winsys support, but it's a start.
-Daniel

>
> Marek
>
> On Tue, Apr 20, 2021 at 4:34 PM Daniel Stone
mailto:dan...@fooishbar.org>> wrote:
>
> > Hi,
> >
> > On Tue, 20 Apr 2021 at 20:30, Daniel Vetter mailto:dan...@ffwll.ch>> wrote:
> >
> >> The thing is, you can't do this in drm/scheduler. At least
not without
> >> splitting up the dma_fence in the kernel into separate memory
fences
> >> and sync fences
> >
> >
> > I'm starting to think this thread 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Christian König



Am 20.04.21 um 19:44 schrieb Daniel Stone:

Hi,

On Tue, 20 Apr 2021 at 16:46, Jason Ekstrand > wrote:


It's still early in the morning here and I'm not awake yet so sorry if
this comes out in bits and pieces...


No problem, it's helpful. If I weren't on this thread I'd be 
attempting to put together a 73-piece chest of drawers whose 
instructions are about as clear as this so far, so I'm in the right 
head space anyway.


IMO, there are two problems being solved here which are related in
very subtle and tricky ways.  They're also, admittedly, driver
problems, not really winsys problems.  Unfortunately, they may have
winsys implications.


Yeah ... bingo.

First, is better/real timelines for Vulkan and compute. [...]

We also want something like this for compute workloads. [...]


Totally understand and agree with all of this. Memory fences seem like 
a good and useful primitive here.


Completely agree.


The second biting issue is that, in the current kernel implementation
of dma-fence and dma_resv, we've lumped internal synchronization for
memory management together with execution synchronization for
userspace dependency tracking.  And we have no way to tell the
difference between the two internally.  Even if user space is passing
around sync_files and trying to do explicit sync, once you get inside
the kernel, they're all dma-fences and it can't tell the difference.


Funny, because 'lumped [the two] together' is exactly the crux of my 
issues ...


If we move


Stop here, because ...

to a more userspace-controlled synchronization model with
wait-before-signal and no timeouts unless requested, regardless of the
implementation, it plays really badly dma-fence.  And, by "badly" I
mean the two are nearly incompatible.


I would go further than that, and say completely, fundamentally, 
conceptually, incompatible.


+1


From a user space PoV, it means
it's tricky to provide the finite time dma-fence guarantee. From a
kernel PoV, it's way worse.  Currently, the way dma-fence is
constructed, it's impossible to deadlock assuming everyone follows the
rules.  The moment we allow user space to deadlock itself and allow
those deadlocks to leak into the kernel, we have a problem. Even if
we throw in some timeouts, we still have a scenario where user space
has one linearizable dependency graph for execution synchronization
and the kernel has a different linearizable dependency graph for
memory management and, when you smash them together, you may have
cycles in your graph.

So how do we sort this all out?  Good question.  It's a hard problem.
Probably the hardest problem here is the second one: the intermixing
of synchronization types.  Solving that one is likely going to require
some user space re-plumbing because all the user space APIs we have
for explicit sync are built on dma-fence.


Gotcha.

Firstly, let's stop, as you say, lumping things together. Timeline 
semaphores and compute's GPU-side spinlocks etc, are one thing. I 
accept those now have a hard requirement on something like memory 
fences, where any responsibility is totally abrogated. So let's run 
with that in our strawman: Vulkan compute & graphics & transfer queues 
all degenerate to something spinning (hopefully GPU-assisted gentle 
spin) on a uint64 somewhere. The kernel has (in the general case) no 
visibility or responsibility into these things. Fine - that's one side 
of the story.


Exactly, yes.



But winsys is something _completely_ different. Yes, you're using the 
GPU to do things with buffers A, B, and C to produce buffer Z. Yes, 
you're using vkQueuePresentKHR to schedule that work. Yes, Mutter's 
composition job might depend on a Chromium composition job which 
depends on GTA's render job which depends on GTA's compute job which 
might take a year to complete. Mutter's composition job needs to 
complete in 'reasonable' (again, FSVO) time, no matter what. The two 
are compatible.


How? Don't lump them together. Isolate them aggressively, and 
_predictably_ in a way that you can reason about.


What clients do in their own process space is their own 
business. Games can deadlock themselves if they get wait-before-signal 
wrong. Compute jobs can run for a year. Their problem. Winsys is not 
that, because you're crossing every isolation boundary possible. 
Process, user, container, VM - every kind of privilege boundary. Thus 
far, dma_fence has protected us from the most egregious abuses by 
guaranteeing bounded-time completion; it also acts as a sequencing 
primitive, but from the perspective of a winsys person that's of 
secondary importance, which is probably one of the bigger disconnects 
between winsys people and GPU driver people.


Finally somebody who understands me :)

Well the question is then how do we get winsys and your own process 
space 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Christian König



Am 20.04.21 um 17:07 schrieb Daniel Stone:
On Tue, 20 Apr 2021 at 15:58, Christian König 
<mailto:ckoenig.leichtzumer...@gmail.com>> wrote:


Am 20.04.21 um 16:53 schrieb Daniel Stone:

On Mon, 19 Apr 2021 at 11:48, Marek Olšák mailto:mar...@gmail.com>> wrote:

Deadlock mitigation to recover from segfaults:
- The kernel knows which process is obliged to signal which
fence. This information is part of the Present request and
supplied by userspace.
- If the producer crashes, the kernel signals the submit
fence, so that the consumer can make forward progress.
- If the consumer crashes, the kernel signals the return
fence, so that the producer can reclaim the buffer.
- A GPU hang signals all fences. Other deadlocks will be
handled like GPU hangs.


Another thought: with completely arbitrary userspace fencing,
none of this is helpful either. If the compositor can't guarantee
that a hostile client has submitted a fence which will never be
signaled, then it won't be waiting on it, so it already needs
infrastructure to handle something like this.



That already handles the crashed-client case, because if the
client crashes, then its connection will be dropped, which will
trigger the compositor to destroy all its resources anyway,
including any pending waits.


Exactly that's the problem. A compositor isn't immediately
informed that the client crashed, instead it is still referencing
the buffer and trying to use it for compositing.


If the compositor no longer has a guarantee that the buffer will be 
ready for composition in a reasonable amount of time (which dma_fence 
gives us, and this proposal does not appear to give us), then the 
compositor isn't trying to use the buffer for compositing, it's 
waiting asynchronously on a notification that the fence has signaled 
before it attempts to use the buffer.


Marek's initial suggestion is that the kernel signal the fence, which 
would unblock composition (and presumably show garbage on screen, or 
at best jump back to old content).


My position is that the compositor will know the process has crashed 
anyway - because its socket has been closed - at which point we 
destroy all the client's resources including its windows and buffers 
regardless. Signaling the fence doesn't give us any value here, 
_unless_ the compositor is just blindly waiting for the fence to 
signal ... which it can't do because there's no guarantee the fence 
will ever signal.


Yeah, but that assumes that the compositor has change to not blindly 
wait for the client to finish rendering and as Daniel explained that is 
rather unrealistic.


What we need is a fallback mechanism which signals the fence after a 
timeout and gives a penalty to the one causing the timeout.


That gives us the same functionality we have today with the in software 
scheduler inside the kernel.


Regards,
Christian.


Cheers,
Daniel


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Christian König



Am 20.04.21 um 16:53 schrieb Daniel Stone:

Hi,

On Mon, 19 Apr 2021 at 11:48, Marek Olšák > wrote:


Deadlock mitigation to recover from segfaults:
- The kernel knows which process is obliged to signal which fence.
This information is part of the Present request and supplied by
userspace.
- If the producer crashes, the kernel signals the submit fence, so
that the consumer can make forward progress.
- If the consumer crashes, the kernel signals the return fence, so
that the producer can reclaim the buffer.
- A GPU hang signals all fences. Other deadlocks will be handled
like GPU hangs.


Another thought: with completely arbitrary userspace fencing, none of 
this is helpful either. If the compositor can't guarantee that a 
hostile client has submitted a fence which will never be signaled, 
then it won't be waiting on it, so it already needs infrastructure to 
handle something like this.


That already handles the crashed-client case, because if the client 
crashes, then its connection will be dropped, which will trigger the 
compositor to destroy all its resources anyway, including any pending 
waits.


Exactly that's the problem. A compositor isn't immediately informed that 
the client crashed, instead it is still referencing the buffer and 
trying to use it for compositing.




GPU hangs also look pretty similar; it's an infinite wait, until the 
client resubmits a new buffer which would replace (& discard) the old.


Correct. You just need to assume that all queues get destroyed and 
re-initialized when a GPU reset happens.




So signal-fence-on-process-exit isn't helpful and doesn't provide any 
extra reliability; it in fact probably just complicates things.


Well it is when you go for partial GPU resets.

Regards,
Christian.



Cheers,
Daniel

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Christian König

Hi Daniel,

Am 20.04.21 um 14:01 schrieb Daniel Vetter:

On Mon, Apr 19, 2021 at 06:47:48AM -0400, Marek Olšák wrote:

Hi,

This is our initial proposal for explicit fences everywhere and new memory
management that doesn't use BO fences. It's a redesign of how Linux
graphics drivers work, and it can coexist with what we have now.


*1. Introduction*
(skip this if you are already sold on explicit fences)

The current Linux graphics architecture was initially designed for GPUs
with only one graphics queue where everything was executed in the
submission order and per-BO fences were used for memory management and
CPU-GPU synchronization, not GPU-GPU synchronization. Later, multiple
queues were added on top, which required the introduction of implicit
GPU-GPU synchronization between queues of different processes using per-BO
fences. Recently, even parallel execution within one queue was enabled
where a command buffer starts draws and compute shaders, but doesn't wait
for them, enabling parallelism between back-to-back command buffers.
Modesetting also uses per-BO fences for scheduling flips. Our GPU scheduler
was created to enable all those use cases, and it's the only reason why the
scheduler exists.

The GPU scheduler, implicit synchronization, BO-fence-based memory
management, and the tracking of per-BO fences increase CPU overhead and
latency, and reduce parallelism. There is a desire to replace all of them
with something much simpler. Below is how we could do it.

I get the feeling you're mixing up a lot of things here that have more
nuance, so first some lingo.

- There's kernel based synchronization, based on dma_fence. These come in
   two major variants: Implicit synchronization, where the kernel attaches
   the dma_fences to a dma-buf, and explicit synchronization, where the
   dma_fence gets passed around as a stand-alone object, either a sync_file
   or a drm_syncobj

- Then there's userspace fence synchronization, where userspace issues any
   fences directly and the kernel doesn't even know what's going on. This
   is the only model that allows you to ditch the kernel overhead, and it's
   also the model that vk uses.

   I concur with Jason that this one is the future, it's the model hw
   wants, compute wants and vk wants. Building an explicit fence world
   which doesn't aim at this is imo wasted effort.

Now you smash them into one thing by also changing the memory model, but I
think that doesn't work:

- Relying on gpu page faults across the board wont happen. I think right
   now only amd's GFX10 or so has enough pagefault support to allow this,


It's even worse. GFX9 has enough support so that in theory can work.

Because of this Felix and his team are working on HMM support based on 
this generation.


On GFX10 some aspects of it are improved while others are totally broken 
again.



   and not even there I'm really sure. Nothing else will anytime soon, at
   least not as far as I know. So we need to support slightly more hw in
   upstream than just that.  Any plan that's realistic needs to cope with
   dma_fence for a really long time.

- Pown^WPin All The Things! is probably not a general enough memory
   management approach. We've kinda tried for years to move away from it.
   Sure we can support it as an optimization in specific workloads, and it
   will make stuff faster, but it's not going to be the default I think.

- We live in a post xf86-video-$vendor world, and all these other
   compositors rely on implicit sync. You're not going to be able to get
   rid of them anytime soon. What's worse, all the various EGL/vk buffer
   sharing things also rely on implicit sync, so you get to fix up tons of
   applications on top. Any plan that's realistic needs to cope with
   implicit/explicit at the same time together won't work.

- Absolute infuriating, but you can't use page-faulting together with any
   dma_fence synchronization primitives, whether implicit or explicit. This
   means until the entire ecosystem moved forward (good luck with that) we
   have to support dma_fence. The only sync model that works together with
   page faults is userspace fence based sync.

Then there's the somewhat aside topic of how amdgpu/radeonsi does implicit
sync, at least last I checked. Currently this oversynchronizes badly
because it's left to the kernel to guess what should be synchronized, and
that gets things wrong. What you need there is explicit implicit
synchronization:

- on the cs side, userspace must set explicit for which buffers the kernel
   should engage in implicit synchronization. That's how it works on all
   other drivers that support more explicit userspace like vk or gl drivers
   that are internally all explicit. So essentially you only set the
   implicit fence slot when you really want to, and only userspace knows
   this. Implementing this without breaking the current logic probably
   needs some flags.

- the other side isn't there yet upstream, but Jason has patches.
   

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Christian König

Yeah. If we go with userspace fences, then userspace can hang itself. Not
the kernel's problem.


Well, the path of inner peace begins with four words. “Not my fucking 
problem.”


But I'm not that much concerned about the kernel, but rather about 
important userspace processes like X, Wayland, SurfaceFlinger etc...


I mean attaching a page to a sync object and allowing to wait/signal 
from both CPU as well as GPU side is not so much of a problem.



You have to somehow handle that, e.g. perhaps with conditional
rendering and just using the old frame in compositing if the new one
doesn't show up in time.


Nice idea, but how would you handle that on the OpenGL/Glamor/Vulkan level.

Regards,
Christian.

Am 20.04.21 um 13:16 schrieb Daniel Vetter:

On Tue, Apr 20, 2021 at 07:03:19AM -0400, Marek Olšák wrote:

Daniel, are you suggesting that we should skip any deadlock prevention in
the kernel, and just let userspace wait for and signal any fence it has
access to?

Yeah. If we go with userspace fences, then userspace can hang itself. Not
the kernel's problem. The only criteria is that the kernel itself must
never rely on these userspace fences, except for stuff like implementing
optimized cpu waits. And in those we must always guarantee that the
userspace process remains interruptible.

It's a completely different world from dma_fence based kernel fences,
whether those are implicit or explicit.


Do you have any concern with the deprecation/removal of BO fences in the
kernel assuming userspace is only using explicit fences? Any concern with
the submit and return fences for modesetting and other producer<->consumer
scenarios?

Let me work on the full replay for your rfc first, because there's a lot
of details here and nuance.
-Daniel


Thanks,
Marek

On Tue, Apr 20, 2021 at 6:34 AM Daniel Vetter  wrote:


On Tue, Apr 20, 2021 at 12:15 PM Christian König
 wrote:

Am 19.04.21 um 17:48 schrieb Jason Ekstrand:

Not going to comment on everything on the first pass...

On Mon, Apr 19, 2021 at 5:48 AM Marek Olšák  wrote:

Hi,

This is our initial proposal for explicit fences everywhere and new

memory management that doesn't use BO fences. It's a redesign of how Linux
graphics drivers work, and it can coexist with what we have now.


1. Introduction
(skip this if you are already sold on explicit fences)

The current Linux graphics architecture was initially designed for

GPUs with only one graphics queue where everything was executed in the
submission order and per-BO fences were used for memory management and
CPU-GPU synchronization, not GPU-GPU synchronization. Later, multiple
queues were added on top, which required the introduction of implicit
GPU-GPU synchronization between queues of different processes using per-BO
fences. Recently, even parallel execution within one queue was enabled
where a command buffer starts draws and compute shaders, but doesn't wait
for them, enabling parallelism between back-to-back command buffers.
Modesetting also uses per-BO fences for scheduling flips. Our GPU scheduler
was created to enable all those use cases, and it's the only reason why the
scheduler exists.

The GPU scheduler, implicit synchronization, BO-fence-based memory

management, and the tracking of per-BO fences increase CPU overhead and
latency, and reduce parallelism. There is a desire to replace all of them
with something much simpler. Below is how we could do it.


2. Explicit synchronization for window systems and modesetting

The producer is an application and the consumer is a compositor or a

modesetting driver.

2.1. The Present request

As part of the Present request, the producer will pass 2 fences (sync

objects) to the consumer alongside the presented DMABUF BO:

- The submit fence: Initially unsignalled, it will be signalled when

the producer has finished drawing into the presented buffer.

- The return fence: Initially unsignalled, it will be signalled when

the consumer has finished using the presented buffer.

I'm not sure syncobj is what we want.  In the Intel world we're trying
to go even further to something we're calling "userspace fences" which
are a timeline implemented as a single 64-bit value in some
CPU-mappable BO.  The client writes a higher value into the BO to
signal the timeline.

Well that is exactly what our Windows guys have suggested as well, but
it strongly looks like that this isn't sufficient.

First of all you run into security problems when any application can
just write any value to that memory location. Just imagine an
application sets the counter to zero and X waits forever for some
rendering to finish.

The thing is, with userspace fences security boundary issue prevent
moves into userspace entirely. And it really doesn't matter whether
the event you're waiting on doesn't complete because the other app
crashed or was stupid or intentionally gave you a wrong fence point:
You have to somehow handle that, e.g. perhaps with conditional
rendering and just usi

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Christian König

Am 19.04.21 um 17:48 schrieb Jason Ekstrand:

Not going to comment on everything on the first pass...

On Mon, Apr 19, 2021 at 5:48 AM Marek Olšák  wrote:

Hi,

This is our initial proposal for explicit fences everywhere and new memory 
management that doesn't use BO fences. It's a redesign of how Linux graphics 
drivers work, and it can coexist with what we have now.


1. Introduction
(skip this if you are already sold on explicit fences)

The current Linux graphics architecture was initially designed for GPUs with 
only one graphics queue where everything was executed in the submission order 
and per-BO fences were used for memory management and CPU-GPU synchronization, 
not GPU-GPU synchronization. Later, multiple queues were added on top, which 
required the introduction of implicit GPU-GPU synchronization between queues of 
different processes using per-BO fences. Recently, even parallel execution 
within one queue was enabled where a command buffer starts draws and compute 
shaders, but doesn't wait for them, enabling parallelism between back-to-back 
command buffers. Modesetting also uses per-BO fences for scheduling flips. Our 
GPU scheduler was created to enable all those use cases, and it's the only 
reason why the scheduler exists.

The GPU scheduler, implicit synchronization, BO-fence-based memory management, 
and the tracking of per-BO fences increase CPU overhead and latency, and reduce 
parallelism. There is a desire to replace all of them with something much 
simpler. Below is how we could do it.


2. Explicit synchronization for window systems and modesetting

The producer is an application and the consumer is a compositor or a 
modesetting driver.

2.1. The Present request

As part of the Present request, the producer will pass 2 fences (sync objects) 
to the consumer alongside the presented DMABUF BO:
- The submit fence: Initially unsignalled, it will be signalled when the 
producer has finished drawing into the presented buffer.
- The return fence: Initially unsignalled, it will be signalled when the 
consumer has finished using the presented buffer.

I'm not sure syncobj is what we want.  In the Intel world we're trying
to go even further to something we're calling "userspace fences" which
are a timeline implemented as a single 64-bit value in some
CPU-mappable BO.  The client writes a higher value into the BO to
signal the timeline.


Well that is exactly what our Windows guys have suggested as well, but 
it strongly looks like that this isn't sufficient.


First of all you run into security problems when any application can 
just write any value to that memory location. Just imagine an 
application sets the counter to zero and X waits forever for some 
rendering to finish.


Additional to that in such a model you can't determine who is the guilty 
queue in case of a hang and can't reset the synchronization primitives 
in case of an error.


Apart from that this is rather inefficient, e.g. we don't have any way 
to prevent priority inversion when used as a synchronization mechanism 
between different GPU queues.


Christian.


   The kernel then provides some helpers for
waiting on them reliably and without spinning.  I don't expect
everyone to support these right away but, If we're going to re-plumb
userspace for explicit synchronization, I'd like to make sure we take
this into account so we only have to do it once.



Deadlock mitigation to recover from segfaults:
- The kernel knows which process is obliged to signal which fence. This 
information is part of the Present request and supplied by userspace.

This isn't clear to me.  Yes, if we're using anything dma-fence based
like syncobj, this is true.  But it doesn't seem totally true as a
general statement.



- If the producer crashes, the kernel signals the submit fence, so that the 
consumer can make forward progress.
- If the consumer crashes, the kernel signals the return fence, so that the 
producer can reclaim the buffer.
- A GPU hang signals all fences. Other deadlocks will be handled like GPU hangs.

What do you mean by "all"?  All fences that were supposed to be
signaled by the hung context?



Other window system requests can follow the same idea.

Merged fences where one fence object contains multiple fences will be 
supported. A merged fence is signalled only when its fences are signalled. The 
consumer will have the option to redefine the unsignalled return fence to a 
merged fence.

2.2. Modesetting

Since a modesetting driver can also be the consumer, the present ioctl will 
contain a submit fence and a return fence too. One small problem with this is 
that userspace can hang the modesetting driver, but in theory, any later 
present ioctl can override the previous one, so the unsignalled presentation is 
never used.


3. New memory management

The per-BO fences will be removed and the kernel will not know which buffers 
are busy. This will reduce CPU overhead and latency. The kernel will not need 
per-BO fences 

Re: [Mesa-dev] Dmabuf based render buffers!?

2021-01-15 Thread Christian König

Am 15.01.21 um 11:26 schrieb Michel Dänzer:

On 2021-01-14 8:02 p.m., Blueroom wrote:


Hi Everyone!

I have a program that’s using dmabuf’s to create a zero copy 
camera->GL texture pipeline and it’s working great on my RPi4.


Now as a last step I want to access the pixels that Iv’e processed in 
gl with shaders, on the cpu.


Iv’e been told that on the Raspberry Pi OpenGL is sharing the same 
memory as the cpu so I’m hoping it would be possible to do something 
like a dmabuf on the ‘way out’ too?


Does anyone have any pointers in how this could work?


I'd recommend using glGetTexImage or other similar GL APIs for getting 
the data out of the GL texture.


While mmap of a dma-buf file descriptor works in theory, direct CPU 
reads from GPU accessible memory can be very slow on some platforms.




Yeah, agree. Additional to that you don't know the format of the DMA-buf 
of the texture.


The input image is most likely linear, but the output might be tiled.

Christian.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Hardware assisted (VDPAU) decoding of MPEG-2 causes GPU lockup on Radeon HD6320

2019-12-02 Thread Christian König

Well how was the stack configured then? Pure software playback?


In 19.10, yes the whole stack was told to use software playback and 
decoding.


I would investigate this way. 1920x1080 is not a high resolution and 
should decode with the CPU just fine.



Our older Gentoo based setup with the old software stack worked fine


The hardware generally does not support some interlaced frame/field 
features rarely used in todays MPEG2 streams and a software fallback 
isn't easily doable with VA-API/VDPAU and so was never implemented.


Are you sure that the Gentoo based setup isn't software decoding based?

Regards,
Christian.

Am 02.12.19 um 16:02 schrieb Will DeBerry:


What regression was that? The difference between VDPAU and VA-API
is only marginal for codec support.


The regression revolved around deinterlacing the content. If we had to 
deinterlace 1080i for instance, the playback was very choppy and 
dropped frames.


Well how was the stack configured then? Pure software playback?


In 19.10, yes the whole stack was told to use software playback and 
decoding.


As long as you don't do any high resolution H264 decoding the CPU
in that package should be capable of decoding both MPEG2 and MPEG4
with software decode.


That's part of the problem. We do have high resolution (1920x1080) 
mpeg2 at the places we are installed. We have no control over what 
content is available but have to support it.


Our older Gentoo based setup with the old software stack worked fine 
but the Ubuntu 16.04 stack does not playback the same content without 
having to switch to VDPAU and thus introduce the GPU thread lockup 
issue. Ubuntu 18.04 looks to have the same VDPAU GPU lockup issue as 
well and cannot use software playback/decoding successfully.


On Mon, Dec 2, 2019 at 9:16 AM Christian König 
mailto:christian.koe...@amd.com>> wrote:



The reason we had to switch to VDPAU with Ubuntu 16.04 is that we
saw a major regression with mpeg2 playback using va-api.

What regression was that? The difference between VDPAU and VA-API
is only marginal for codec support.


During our testing we put Ubuntu 19.10 on one of these boxes and
noticed that full software acceleration has improved to the point
that VA-API nor VDPAU was required for VLC to render the mpeg2
and mpeg4 streams correctly.

Well how was the stack configured then? Pure software playback?

As long as you don't do any high resolution H264 decoding the CPU
in that package should be capable of decoding both MPEG2 and MPEG4
with software decode.

In general I would also try with mpv instead of vlc to rule out
player issues.

Regards,
Christian.

Am 02.12.19 um 15:06 schrieb Will DeBerry:


well that's the very first APU generation and unfortunately
nobody is working on that old hardware any more.


Agreed, definitely old hardware. Unfortunately we have 10,000 of
these things in production and they have been playing hardware
accelerated mpeg2 fine until we upgraded to Ubuntu 16.04 and the
new mesa package. Now to be specific, our previous version of
linux on these systems we were using an older software stack and
video acceleration pipeline but it worked perfectly, so we know
the hardware is capable.

*Old Software Stack:*

  * vlc 2.1.5
  * mesa 11.0.6
  * va-api hardware acceleration
  * libva info: VA-API version 0.38.1
libva info: va_getDriverName() returns 0
libva info: Trying to open /usr/lib/va/drivers/r600_drv_video.so
libva info: Found init function __vaDriverInit_0_35
libva info: va_openDriver() returns 0
vainfo: VA-API version: 0.38 (libva 1.6.2)
vainfo: Driver version: Splitted-Desktop Systems VDPAU
backend for VA-API - 0.7.4
vainfo: Supported profile and entrypoints
      VAProfileMPEG2Simple            : VAEntrypointVLD
      VAProfileMPEG2Main              : VAEntrypointVLD
      VAProfileMPEG4Simple            : VAEntrypointVLD
      VAProfileMPEG4AdvancedSimple    : VAEntrypointVLD
      VAProfileH264Baseline           : VAEntrypointVLD
      VAProfileH264Main               : VAEntrypointVLD
      VAProfileH264High               : VAEntrypointVLD
      VAProfileVC1Simple              : VAEntrypointVLD
      VAProfileVC1Main                : VAEntrypointVLD
      VAProfileVC1Advanced            : VAEntrypointVLD

*New Software Stack:*

  * vlc 2.2.2
  * mesa 18.2.8-2~bpo9+1
  * vdpau hardware acceleration

The reason we had to switch to VDPAU with Ubuntu 16.04 is that we
saw a major regression with mpeg2 playback using va-api. It was
capable of playing back mpeg4 without any issues. Now that we
have switched to VDPAU however, we are seeing this GPU thread
lockup bug and thus causing X and other GUI related pr

Re: [Mesa-dev] Hardware assisted (VDPAU) decoding of MPEG-2 causes GPU lockup on Radeon HD6320

2019-12-02 Thread Christian König
The reason we had to switch to VDPAU with Ubuntu 16.04 is that we saw 
a major regression with mpeg2 playback using va-api.
What regression was that? The difference between VDPAU and VA-API is 
only marginal for codec support.


During our testing we put Ubuntu 19.10 on one of these boxes and 
noticed that full software acceleration has improved to the point that 
VA-API nor VDPAU was required for VLC to render the mpeg2 and mpeg4 
streams correctly.

Well how was the stack configured then? Pure software playback?

As long as you don't do any high resolution H264 decoding the CPU in 
that package should be capable of decoding both MPEG2 and MPEG4 with 
software decode.


In general I would also try with mpv instead of vlc to rule out player 
issues.


Regards,
Christian.

Am 02.12.19 um 15:06 schrieb Will DeBerry:


well that's the very first APU generation and unfortunately nobody
is working on that old hardware any more.


Agreed, definitely old hardware. Unfortunately we have 10,000 of these 
things in production and they have been playing hardware accelerated 
mpeg2 fine until we upgraded to Ubuntu 16.04 and the new mesa package. 
Now to be specific, our previous version of linux on these systems we 
were using an older software stack and video acceleration pipeline but 
it worked perfectly, so we know the hardware is capable.


*Old Software Stack:*

  * vlc 2.1.5
  * mesa 11.0.6
  * va-api hardware acceleration
  * libva info: VA-API version 0.38.1
libva info: va_getDriverName() returns 0
libva info: Trying to open /usr/lib/va/drivers/r600_drv_video.so
libva info: Found init function __vaDriverInit_0_35
libva info: va_openDriver() returns 0
vainfo: VA-API version: 0.38 (libva 1.6.2)
vainfo: Driver version: Splitted-Desktop Systems VDPAU backend for
VA-API - 0.7.4
vainfo: Supported profile and entrypoints
      VAProfileMPEG2Simple            : VAEntrypointVLD
      VAProfileMPEG2Main              : VAEntrypointVLD
      VAProfileMPEG4Simple            : VAEntrypointVLD
      VAProfileMPEG4AdvancedSimple    : VAEntrypointVLD
      VAProfileH264Baseline           : VAEntrypointVLD
      VAProfileH264Main               : VAEntrypointVLD
      VAProfileH264High               : VAEntrypointVLD
      VAProfileVC1Simple              : VAEntrypointVLD
      VAProfileVC1Main                : VAEntrypointVLD
      VAProfileVC1Advanced            : VAEntrypointVLD

*New Software Stack:*

  * vlc 2.2.2
  * mesa 18.2.8-2~bpo9+1
  * vdpau hardware acceleration

The reason we had to switch to VDPAU with Ubuntu 16.04 is that we saw 
a major regression with mpeg2 playback using va-api. It was capable of 
playing back mpeg4 without any issues. Now that we have switched to 
VDPAU however, we are seeing this GPU thread lockup bug and thus 
causing X and other GUI related programs to crash and requiring a 
reboot to recover.


Changing out hardware for the next best thing is not an option at our 
scale and we know that the hardware is capable due to past 
experiences. We are just in need of assistance with someone or some 
party that knows that stack a lot more than us to help dig to the core 
issue of the lockup or help us get VA-API working for mpeg2 in 16.04.


So the best approach is probably to not use hardware acceleration
for MPEG2 clips in general.


With software decoding, the performance doesn't produce something that 
is watchable. One interesting tidbit to note. During our testing we 
put Ubuntu 19.10 on one of these boxes and noticed that full software 
acceleration has improved to the point that VA-API nor VDPAU was 
required for VLC to render the mpeg2 and mpeg4 streams correctly. Is 
this something that could potentially be backported to Ubuntu 16.04? I 
know this is a much bigger task that the one sentence ask alludes to, 
but figured I'd ask anyway.


We are more than welcome to work together on this, especially since 
the hardware is older and probably hard to find. Just needing to find 
a solution so we can move forward on upgrading the software and on 
these older hardware.


On Thu, Nov 28, 2019 at 7:15 AM Christian König 
<mailto:ckoenig.leichtzumer...@gmail.com>> wrote:


Hi Will,

well that's the very first APU generation and unfortunately nobody
is working on that old hardware any more.

MPEG2 is known to not be fully supported by that chipset in
general. So the best approach is probably to not use hardware
acceleration for MPEG2 clips in general.

Regards,
Christian.

Am 27.11.19 um 18:35 schrieb Will DeBerry:

Hi all,

I am reaching out hoping to get some assistance with resolving a
bug/crash that we see with the GPU when using VDPAU hardware
acceleration on Ubuntu 16.04. This is specific to the r600
drivers interacting with VDPAU when trying to playback certain
mpeg2 content.

*GPU in question per lscpi: *
00:01.0 VGA c

Re: [Mesa-dev] Hardware assisted (VDPAU) decoding of MPEG-2 causes GPU lockup on Radeon HD6320

2019-11-28 Thread Christian König

Hi Will,

well that's the very first APU generation and unfortunately nobody is 
working on that old hardware any more.


MPEG2 is known to not be fully supported by that chipset in general. So 
the best approach is probably to not use hardware acceleration for MPEG2 
clips in general.


Regards,
Christian.

Am 27.11.19 um 18:35 schrieb Will DeBerry:

Hi all,

I am reaching out hoping to get some assistance with resolving a 
bug/crash that we see with the GPU when using VDPAU hardware 
acceleration on Ubuntu 16.04. This is specific to the r600 drivers 
interacting with VDPAU when trying to playback certain mpeg2 content.


*GPU in question per lscpi: *
00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. 
[AMD/ATI] Wrestler [Radeon HD 6320]


We are highly invested in this GPU and would love to get this 
addressed as soon as possible and are also willing to sponsor this 
work if needed.


*Steps to Recreate:*

 1. Launch VLC with VDPAU hardware acceleration and deinterlacing enabled
 2. Play the attached piece of known bad content
 3. Wait for GPU lockup

Per dmesg, the GPU thread gets locked up within the kernel and thus 
breaks all GUI related activities until the PC is rebooted.


*Mesa Version Tested:*
18.0.5-0ubuntu0~16.04.1
18.2.8-2~bpo9+1

Let me know if you have any questions or are interested in discussing 
this further.


Thanks,

--

*Will DeBerry*

Manager, Client Devices

GetWellNetwork

o: 240.482.4237 | m: 813.330.0121 | getwellnetwork.com 



___



Join us for GetConnected 2020! REGISTER TODAY 






___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [PATCH] pipe-loader: use radeonsi for MM if amdgpu dri is used

2019-07-15 Thread Christian König

Am 15.07.19 um 16:15 schrieb Michel Dänzer:

On 2019-07-15 4:11 p.m., Newton, Jeremy wrote:

Sorry about that, I've only used git email maybe three times in my life :)

Nothing to apologize for, everybody has to learn that kind of thing. :)


To be honest even after more than a decade I still get this wrong from 
time to time.


So really don't worry about stuff like that :)

Christian.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [PATCH] radeonsi: Expose support for 10-bit VP9 decode

2019-06-28 Thread Christian König

Am 28.06.19 um 07:09 schrieb Vishwakarma, Pratik:

Fix si_vid_is_format_supported to expose support
for 10-bit VP9 decode using P016 format. Without
this change, 10-bit decode will be exposed only
for HEVC even though newer hardware support
10-bit decode for VP9.

Signed-off-by: Pratik Vishwakarma 


Reviewed-by: Christian König 


---
  src/gallium/drivers/radeonsi/si_get.c | 4 
  1 file changed, 4 insertions(+)

diff --git a/src/gallium/drivers/radeonsi/si_get.c 
b/src/gallium/drivers/radeonsi/si_get.c
index 4e23d283ab7..8cc5933f9bc 100644
--- a/src/gallium/drivers/radeonsi/si_get.c
+++ b/src/gallium/drivers/radeonsi/si_get.c
@@ -709,6 +709,10 @@ static boolean si_vid_is_format_supported(struct 
pipe_screen *screen,
return (format == PIPE_FORMAT_NV12) ||
(format == PIPE_FORMAT_P016);
  
+	/* Vp9 profile 2 supports 10 bit decoding using P016 */

+   if (profile == PIPE_VIDEO_PROFILE_VP9_PROFILE2)
+   return format == PIPE_FORMAT_P016;
+
/* we can only handle this one with UVD */
if (profile != PIPE_VIDEO_PROFILE_UNKNOWN)
return format == PIPE_FORMAT_NV12;


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [PATCH] radv: Change memory type order for GPUs without dedicated VRAM

2019-06-11 Thread Christian König

Am 10.06.19 um 15:56 schrieb Bas Nieuwenhuizen:

On Sat, Jun 8, 2019 at 3:36 PM Alex Smith  wrote:

On Mon, 3 Jun 2019 at 13:27, Koenig, Christian  wrote:

Am 03.06.19 um 14:21 schrieb Alex Smith:

On Mon, 3 Jun 2019 at 11:57, Koenig, Christian  wrote:

Am 02.06.19 um 12:32 schrieb Alex Smith:

Put the uncached GTT type at a higher index than the visible VRAM type,
rather than having GTT first.

When we don't have dedicated VRAM, we don't have a non-visible VRAM
type, and the property flags for GTT and visible VRAM are identical.
According to the spec, for types with identical flags, we should give
the one with better performance a lower index.

Previously, apps which follow the spec guidance for choosing a memory
type would have picked the GTT type in preference to visible VRAM (all
Feral games will do this), and end up with lower performance.

On a Ryzen 5 2500U laptop (Raven Ridge), this improves average FPS in
the Rise of the Tomb Raider benchmark by up to ~30%. Tested a couple of
other (Feral) games and saw similar improvement on those as well.

Well that patch doesn't looks like a good idea to me.

Using VRAM over uncached GTT should have something between no and only
minimal performance difference on APU.

To make things even worse VRAM is still needed for scanout and newer
laptops have only a very very low default setting (32 or 16MB). So you
can end up in VRAM clashing on those systems.

Can you check some kernel statistics to figure out what exactly is going
on here?


What statistics should I look at?


First of all take a look at amdgpu_gem_info file in the debugfs directory while 
using GTT and match that to using VRAM. You should see a lot more of GTT ... 
CPU_GTT_USWC entries with the GTT variant. If the CPU_GTT_USWC flag is missing 
we have found the problem.

If that looks ok, then take a look at the ttm_page_pool or ttm_dma_page_pool 
file and see how many wc/uc and wc/uc huge pages you got. Huge pages should be 
used for anything larger than 2MB, if not we have found the problem.

If that still isn't the issue I need to take a look at the VM code again and 
see if we still map VRAM/GTT differently on APUs.


OK, got around to looking at this. amdgpu_gem_info does have more USWC entries 
when using GTT. I've attached the output from VRAM vs GTT in case you can spot 
anything else in there.

ttm_page_pool has 9806 wc, 238 wc huge, no uc or uc huge.

To add to this, I tried rounding up the size all application GTT
allocations to a multiple of 2 megabytes (+ a suballocator for buffers
< 2M). This increased performance a bit but not nearly what going from
GTT->"VRAM" brings.


I need to dig deeper when I have a bit more time.

The logs Alex provided didn't showed anything obviously wrong, so no 
idea what's the actual problem here.


Anyway feel free to go ahead with this approach, but please keep in mind 
that this might cause problems on some systems.


Christian.




FWIW this was from kernel 5.0.10, I just upgraded to 5.1.6 and still the same 
perf difference there.

Thanks,
Alex



Thanks,
Christian.


Thanks,
Alex



Regards,
Christian.


Signed-off-by: Alex Smith 
---
I noticed that the memory types advertised on my Raven laptop looked a
bit odd so played around with it and found this. I'm not sure if it is
actually expected that the performance difference between visible VRAM
and GTT is so large, seeing as it's not dedicated VRAM, but the results
are clear (and consistent, tested multiple times).
---
   src/amd/vulkan/radv_device.c | 18 +++---
   1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/src/amd/vulkan/radv_device.c b/src/amd/vulkan/radv_device.c
index 3cf050ed220..d36ee226ebd 100644
--- a/src/amd/vulkan/radv_device.c
+++ b/src/amd/vulkan/radv_device.c
@@ -171,12 +171,11 @@ radv_physical_device_init_mem_types(struct 
radv_physical_device *device)
   .heapIndex = vram_index,
   };
   }
- if (gart_index >= 0) {
+ if (gart_index >= 0 && device->rad_info.has_dedicated_vram) {
   device->mem_type_indices[type_count] = 
RADV_MEM_TYPE_GTT_WRITE_COMBINE;
   device->memory_properties.memoryTypes[type_count++] = 
(VkMemoryType) {
   .propertyFlags = VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT |
- VK_MEMORY_PROPERTY_HOST_COHERENT_BIT |
- (device->rad_info.has_dedicated_vram ? 0 : 
VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT),
+ VK_MEMORY_PROPERTY_HOST_COHERENT_BIT,
   .heapIndex = gart_index,
   };
   }
@@ -189,6 +188,19 @@ radv_physical_device_init_mem_types(struct 
radv_physical_device *device)
   .heapIndex = visible_vram_index,
   };
   }
+ if (gart_index >= 0 && !device->rad_info.has_dedicated_vram) {
+ /* Put GTT after visible VRAM for GPUs without dedicated VRAM
+  * as they have identical property flags, and according to the

Re: [Mesa-dev] [PATCH] radeon/uvd: fix poc for hevc encode

2019-05-28 Thread Christian König
It would be better to have those checks in the state tracker than in the 
backend code.


Christian.

Am 27.05.19 um 20:41 schrieb boyuan.zh...@amd.com:

From: Boyuan Zhang 

MaxPicOrderCntLsb should be at 16 according to the spec,
therefore add minimum value check.

Also use poc value passed from st instead of calculation
in slice header encoding.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110673
Cc: mesa-sta...@lists.freedesktop.org

Signed-off-by: Boyuan Zhang 
---
  src/gallium/drivers/radeon/radeon_uvd_enc.c | 3 ++-
  src/gallium/drivers/radeon/radeon_uvd_enc_1_1.c | 3 +--
  2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/gallium/drivers/radeon/radeon_uvd_enc.c 
b/src/gallium/drivers/radeon/radeon_uvd_enc.c
index 521d08f304..9256e43a08 100644
--- a/src/gallium/drivers/radeon/radeon_uvd_enc.c
+++ b/src/gallium/drivers/radeon/radeon_uvd_enc.c
@@ -73,7 +73,8 @@ radeon_uvd_enc_get_param(struct radeon_uvd_encoder *enc,
 enc->enc_pic.general_tier_flag = pic->seq.general_tier_flag;
 enc->enc_pic.general_profile_idc = pic->seq.general_profile_idc;
 enc->enc_pic.general_level_idc = pic->seq.general_level_idc;
-   enc->enc_pic.max_poc = pic->seq.intra_period;
+   enc->enc_pic.max_poc =
+  (pic->seq.intra_period >= 16) ? pic->seq.intra_period : 16;
 enc->enc_pic.log2_max_poc = 0;
 for (int i = enc->enc_pic.max_poc; i != 0; enc->enc_pic.log2_max_poc++)
i = (i >> 1);
diff --git a/src/gallium/drivers/radeon/radeon_uvd_enc_1_1.c 
b/src/gallium/drivers/radeon/radeon_uvd_enc_1_1.c
index ddb219792a..8f0e0099e7 100644
--- a/src/gallium/drivers/radeon/radeon_uvd_enc_1_1.c
+++ b/src/gallium/drivers/radeon/radeon_uvd_enc_1_1.c
@@ -768,8 +768,7 @@ radeon_uvd_enc_slice_header_hevc(struct radeon_uvd_encoder 
*enc)
 if ((enc->enc_pic.nal_unit_type != 19)
 && (enc->enc_pic.nal_unit_type != 20)) {
radeon_uvd_enc_code_fixed_bits(enc,
- enc->enc_pic.frame_num %
- enc->enc_pic.max_poc,
+ enc->enc_pic.pic_order_cnt,
   enc->enc_pic.log2_max_poc);
if (enc->enc_pic.picture_type == PIPE_H265_ENC_PICTURE_TYPE_P)
   radeon_uvd_enc_code_fixed_bits(enc, 0x1, 1);


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [PATCH] winsys/amdgpu: add VCN JPEG to no user fence group

2019-05-08 Thread Christian König

Am 08.05.19 um 15:23 schrieb Liu, Leo:

On 5/8/19 9:19 AM, Koenig, Christian wrote:

Am 08.05.19 um 15:14 schrieb Liu, Leo:

On 5/8/19 9:02 AM, Christian König wrote:

[CAUTION: External Email]

Am 08.05.19 um 14:56 schrieb Liu, Leo:

There is no user fence for JPEG, the bug triggering
kernel WARN_ON(flags & AMDGPU_FENCE_FLAG_64BIT)

Oh, we are probably going to need to check for this in the kernel as
well.

Currently we only check for UVD and VCE there,

Are you talking about the checking for JPEG engine? if that, and then
yes the check of " WARN_ON(flags & AMDGPU_FENCE_FLAG_64BIT)" is there,
that's why current JPEG is triggering that.

Yeah, but this check comes way to late.

We usually already reject command submissions when they have user fences
for UVD & VCE, see amdgpu_cs_ib_fill():

      /* UVD & VCE fw doesn't support user fences */
      ring = to_amdgpu_ring(parser->entity->rq->sched);
      if (parser->job->uf_addr && (
      ring->funcs->type == AMDGPU_RING_TYPE_UVD ||
      ring->funcs->type == AMDGPU_RING_TYPE_VCE))
      return -EINVAL;

We should probably make that a ring flag or something like that and
generalize he code here.

Then the WARN_ON in the JPEG fence code can be removed.

Yep. I will take a look at this on the kernel side, in the meantime, can
I have a RB on the Mesa side?


Well Acked-by: Christian König , cause I don't 
know the Mesa code well enough.


Christian.



Thanks,
Leo



Christian.


Regards,

Leo



do you want to take a
look Leo or should I do this?

Christian.


Signed-off-by: Leo Liu 
Cc: mesa-sta...@lists.freedesktop.org
---
     src/gallium/winsys/amdgpu/drm/amdgpu_cs.c | 3 ++-
     1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/gallium/winsys/amdgpu/drm/amdgpu_cs.c
b/src/gallium/winsys/amdgpu/drm/amdgpu_cs.c
index 4a2377f7e09..972030eaaa8 100644
--- a/src/gallium/winsys/amdgpu/drm/amdgpu_cs.c
+++ b/src/gallium/winsys/amdgpu/drm/amdgpu_cs.c
@@ -378,7 +378,8 @@ static bool amdgpu_cs_has_user_fence(struct
amdgpu_cs_context *cs)
       cs->ib[IB_MAIN].ip_type != AMDGPU_HW_IP_VCE &&
       cs->ib[IB_MAIN].ip_type != AMDGPU_HW_IP_UVD_ENC &&
       cs->ib[IB_MAIN].ip_type != AMDGPU_HW_IP_VCN_DEC &&
-  cs->ib[IB_MAIN].ip_type != AMDGPU_HW_IP_VCN_ENC;
+  cs->ib[IB_MAIN].ip_type != AMDGPU_HW_IP_VCN_ENC &&
+  cs->ib[IB_MAIN].ip_type != AMDGPU_HW_IP_VCN_JPEG;
     }

     static bool amdgpu_cs_has_chaining(struct amdgpu_cs *cs)

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [PATCH] winsys/amdgpu: add VCN JPEG to no user fence group

2019-05-08 Thread Christian König

Am 08.05.19 um 14:56 schrieb Liu, Leo:

There is no user fence for JPEG, the bug triggering
kernel WARN_ON(flags & AMDGPU_FENCE_FLAG_64BIT)


Oh, we are probably going to need to check for this in the kernel as well.

Currently we only check for UVD and VCE there, do you want to take a 
look Leo or should I do this?


Christian.



Signed-off-by: Leo Liu 
Cc: mesa-sta...@lists.freedesktop.org
---
  src/gallium/winsys/amdgpu/drm/amdgpu_cs.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/gallium/winsys/amdgpu/drm/amdgpu_cs.c 
b/src/gallium/winsys/amdgpu/drm/amdgpu_cs.c
index 4a2377f7e09..972030eaaa8 100644
--- a/src/gallium/winsys/amdgpu/drm/amdgpu_cs.c
+++ b/src/gallium/winsys/amdgpu/drm/amdgpu_cs.c
@@ -378,7 +378,8 @@ static bool amdgpu_cs_has_user_fence(struct 
amdgpu_cs_context *cs)
cs->ib[IB_MAIN].ip_type != AMDGPU_HW_IP_VCE &&
cs->ib[IB_MAIN].ip_type != AMDGPU_HW_IP_UVD_ENC &&
cs->ib[IB_MAIN].ip_type != AMDGPU_HW_IP_VCN_DEC &&
-  cs->ib[IB_MAIN].ip_type != AMDGPU_HW_IP_VCN_ENC;
+  cs->ib[IB_MAIN].ip_type != AMDGPU_HW_IP_VCN_ENC &&
+  cs->ib[IB_MAIN].ip_type != AMDGPU_HW_IP_VCN_JPEG;
  }
  
  static bool amdgpu_cs_has_chaining(struct amdgpu_cs *cs)


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [PATCH] winsys/amdgpu: Restrict allocation to GTT for small vram size

2019-04-26 Thread Christian König

Am 25.04.19 um 13:37 schrieb Agrawal, Akshu:

To avoid evictions, use GTT only for allocation on devices with
small vram size.

Signed-off-by: Akshu Agrawal 
---
  src/gallium/winsys/amdgpu/drm/amdgpu_bo.c | 9 -
  1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/src/gallium/winsys/amdgpu/drm/amdgpu_bo.c 
b/src/gallium/winsys/amdgpu/drm/amdgpu_bo.c
index 09cf9247755..aab801b6337 100644
--- a/src/gallium/winsys/amdgpu/drm/amdgpu_bo.c
+++ b/src/gallium/winsys/amdgpu/drm/amdgpu_bo.c
@@ -486,8 +486,15 @@ static struct amdgpu_winsys_bo *amdgpu_create_bo(struct 
amdgpu_winsys *ws,
 * shared with the OS, allow VRAM placements too. The idea is not to use
 * VRAM usefully, but to use it so that it's not unused and wasted.
 */
-  if (!ws->info.has_dedicated_vram)
+   if (!ws->info.has_dedicated_vram) {
+  /* For devices having small VRAM size use GTT only to
+   * avoid evictions.
+   */
+  if (ws->info.vram_size <= 16777216)
+ request.preferred_heap = AMDGPU_GEM_DOMAIN_GTT;


Well that will certainly cause problems because it would result in 
scanout BOs to be forced into GTT.


Christian.


+  else
   request.preferred_heap |= AMDGPU_GEM_DOMAIN_GTT;
+  }
 }
  
 if (initial_domain & RADEON_DOMAIN_GTT)


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [PATCH v2 1/7] gallium/auxiliary/vl: Move dirty define to header file

2019-02-07 Thread Christian König

Am 07.02.19 um 15:21 schrieb James Zhu:

On 2019-02-07 4:49 a.m., Christian König wrote:

Patches #1, #2, #5, #7  are Reviewed-by: Christian König


Patch #3 the csc_matrix need a better name since we now store more and
more additional info in there, but that can as well be a follow up patch.

csc_matrix is used by upper stack.Let me figure out how to use 2nd
constant buffer to hold additional info.


Actually you don't need to add a second one.

We should just rename the variable because there is now more in the 
buffer than the csc matrix.


Something like shader_params or something similar.

Christian.



James


Patch #4 is Acked-by: Christian König 

Patch #6 I think there was a simpler option for this.

And when the compute shaders reach the same level of functionality as
the GFX shaders we should make this the default, depending on the
hardware capabilities.

Sure.

James


Christian.

Am 06.02.19 um 20:44 schrieb Zhu, James:

Move dirty define to header file to share with compute shader.

Signed-off-by: James Zhu 
---
   src/gallium/auxiliary/vl/vl_compositor.c | 15 ++-
   src/gallium/auxiliary/vl/vl_compositor.h |  2 ++
   2 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/src/gallium/auxiliary/vl/vl_compositor.c
b/src/gallium/auxiliary/vl/vl_compositor.c
index 159a295..41f9e5e 100644
--- a/src/gallium/auxiliary/vl/vl_compositor.c
+++ b/src/gallium/auxiliary/vl/vl_compositor.c
@@ -42,9 +42,6 @@
   #include "vl_types.h"
   #include "vl_compositor.h"
   -#define MIN_DIRTY (0)
-#define MAX_DIRTY (1 << 15)
-
   enum VS_OUTPUT
   {
  VS_O_VPOS = 0,
@@ -899,8 +896,8 @@ gen_vertex_data(struct vl_compositor *c, struct
vl_compositor_state *s, struct u
    dirty->y1 <= drawn.y1) {
    // We clear the dirty area anyway, no need for
clear_render_target
-   dirty->x0 = dirty->y0 = MAX_DIRTY;
-   dirty->x1 = dirty->y1 = MIN_DIRTY;
+   dirty->x0 = dirty->y0 = VL_COMPOSITOR_MAX_DIRTY;
+   dirty->x1 = dirty->y1 = VL_COMPOSITOR_MIN_DIRTY;
   }
    }
     }
@@ -1030,8 +1027,8 @@ vl_compositor_reset_dirty_area(struct u_rect
*dirty)
   {
  assert(dirty);
   -   dirty->x0 = dirty->y0 = MIN_DIRTY;
-   dirty->x1 = dirty->y1 = MAX_DIRTY;
+   dirty->x0 = dirty->y0 = VL_COMPOSITOR_MIN_DIRTY;
+   dirty->x1 = dirty->y1 = VL_COMPOSITOR_MAX_DIRTY;
   }
     void
@@ -1378,8 +1375,8 @@ vl_compositor_render(struct vl_compositor_state
*s,
       c->pipe->clear_render_target(c->pipe, dst_surface,
>clear_color,
  0, 0, dst_surface->width,
dst_surface->height, false);
-  dirty_area->x0 = dirty_area->y0 = MAX_DIRTY;
-  dirty_area->x1 = dirty_area->y1 = MIN_DIRTY;
+  dirty_area->x0 = dirty_area->y0 = VL_COMPOSITOR_MAX_DIRTY;
+  dirty_area->x1 = dirty_area->y1 = VL_COMPOSITOR_MIN_DIRTY;
  }
    c->pipe->set_framebuffer_state(c->pipe, >fb_state);
diff --git a/src/gallium/auxiliary/vl/vl_compositor.h
b/src/gallium/auxiliary/vl/vl_compositor.h
index 8819176..aa843c3 100644
--- a/src/gallium/auxiliary/vl/vl_compositor.h
+++ b/src/gallium/auxiliary/vl/vl_compositor.h
@@ -44,6 +44,8 @@ struct pipe_context;
    */
     #define VL_COMPOSITOR_MAX_LAYERS 16
+#define VL_COMPOSITOR_MIN_DIRTY (0)
+#define VL_COMPOSITOR_MAX_DIRTY (1 << 15)
     /* deinterlace allgorithem */
   enum vl_compositor_deinterlace

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH v2 1/7] gallium/auxiliary/vl: Move dirty define to header file

2019-02-07 Thread Christian König
Patches #1, #2, #5, #7  are Reviewed-by: Christian König 



Patch #3 the csc_matrix need a better name since we now store more and 
more additional info in there, but that can as well be a follow up patch.


Patch #4 is Acked-by: Christian König 

Patch #6 I think there was a simpler option for this.

And when the compute shaders reach the same level of functionality as 
the GFX shaders we should make this the default, depending on the 
hardware capabilities.


Christian.

Am 06.02.19 um 20:44 schrieb Zhu, James:

Move dirty define to header file to share with compute shader.

Signed-off-by: James Zhu 
---
  src/gallium/auxiliary/vl/vl_compositor.c | 15 ++-
  src/gallium/auxiliary/vl/vl_compositor.h |  2 ++
  2 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/src/gallium/auxiliary/vl/vl_compositor.c 
b/src/gallium/auxiliary/vl/vl_compositor.c
index 159a295..41f9e5e 100644
--- a/src/gallium/auxiliary/vl/vl_compositor.c
+++ b/src/gallium/auxiliary/vl/vl_compositor.c
@@ -42,9 +42,6 @@
  #include "vl_types.h"
  #include "vl_compositor.h"
  
-#define MIN_DIRTY (0)

-#define MAX_DIRTY (1 << 15)
-
  enum VS_OUTPUT
  {
 VS_O_VPOS = 0,
@@ -899,8 +896,8 @@ gen_vertex_data(struct vl_compositor *c, struct 
vl_compositor_state *s, struct u
   dirty->y1 <= drawn.y1) {
  
 // We clear the dirty area anyway, no need for clear_render_target

-   dirty->x0 = dirty->y0 = MAX_DIRTY;
-   dirty->x1 = dirty->y1 = MIN_DIRTY;
+   dirty->x0 = dirty->y0 = VL_COMPOSITOR_MAX_DIRTY;
+   dirty->x1 = dirty->y1 = VL_COMPOSITOR_MIN_DIRTY;
  }
   }
}
@@ -1030,8 +1027,8 @@ vl_compositor_reset_dirty_area(struct u_rect *dirty)
  {
 assert(dirty);
  
-   dirty->x0 = dirty->y0 = MIN_DIRTY;

-   dirty->x1 = dirty->y1 = MAX_DIRTY;
+   dirty->x0 = dirty->y0 = VL_COMPOSITOR_MIN_DIRTY;
+   dirty->x1 = dirty->y1 = VL_COMPOSITOR_MAX_DIRTY;
  }
  
  void

@@ -1378,8 +1375,8 @@ vl_compositor_render(struct vl_compositor_state *s,
  
c->pipe->clear_render_target(c->pipe, dst_surface, >clear_color,

 0, 0, dst_surface->width, 
dst_surface->height, false);
-  dirty_area->x0 = dirty_area->y0 = MAX_DIRTY;
-  dirty_area->x1 = dirty_area->y1 = MIN_DIRTY;
+  dirty_area->x0 = dirty_area->y0 = VL_COMPOSITOR_MAX_DIRTY;
+  dirty_area->x1 = dirty_area->y1 = VL_COMPOSITOR_MIN_DIRTY;
 }
  
 c->pipe->set_framebuffer_state(c->pipe, >fb_state);

diff --git a/src/gallium/auxiliary/vl/vl_compositor.h 
b/src/gallium/auxiliary/vl/vl_compositor.h
index 8819176..aa843c3 100644
--- a/src/gallium/auxiliary/vl/vl_compositor.h
+++ b/src/gallium/auxiliary/vl/vl_compositor.h
@@ -44,6 +44,8 @@ struct pipe_context;
   */
  
  #define VL_COMPOSITOR_MAX_LAYERS 16

+#define VL_COMPOSITOR_MIN_DIRTY (0)
+#define VL_COMPOSITOR_MAX_DIRTY (1 << 15)
  
  /* deinterlace allgorithem */

  enum vl_compositor_deinterlace


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 3/6] gallium\auxiliary\vl: Add compute shader to support video compositor render

2019-02-04 Thread Christian König

Am 04.02.19 um 20:12 schrieb James Zhu:

On 2019-02-04 1:47 p.m., Liu, Leo wrote:

On 2/1/19 11:28 AM, Zhu, James wrote:

Add compute shader to support video compositor render.

Signed-off-by: James Zhu 
---
src/gallium/auxiliary/Makefile.sources  |   2 +
src/gallium/auxiliary/meson.build   |   2 +
src/gallium/auxiliary/vl/vl_compositor_cs.c | 414 

src/gallium/auxiliary/vl/vl_compositor_cs.h |  56 
4 files changed, 474 insertions(+)
create mode 100644 src/gallium/auxiliary/vl/vl_compositor_cs.c
create mode 100644 src/gallium/auxiliary/vl/vl_compositor_cs.h

diff --git a/src/gallium/auxiliary/Makefile.sources 
b/src/gallium/auxiliary/Makefile.sources
index 50e8808..df000f6 100644
--- a/src/gallium/auxiliary/Makefile.sources
+++ b/src/gallium/auxiliary/Makefile.sources
@@ -348,6 +348,8 @@ VL_SOURCES := \
vl/vl_bicubic_filter.h \
vl/vl_compositor.c \
vl/vl_compositor.h \
+   vl/vl_compositor_cs.c \
+   vl/vl_compositor_cs.h \
vl/vl_csc.c \
vl/vl_csc.h \
vl/vl_decoder.c \
diff --git a/src/gallium/auxiliary/meson.build 
b/src/gallium/auxiliary/meson.build
index 57f7e69..74e4b48 100644
--- a/src/gallium/auxiliary/meson.build
+++ b/src/gallium/auxiliary/meson.build
@@ -445,6 +445,8 @@ files_libgalliumvl = files(
  'vl/vl_bicubic_filter.h',
  'vl/vl_compositor.c',
  'vl/vl_compositor.h',
+  'vl/vl_compositor_cs.c',
+  'vl/vl_compositor_cs.h',
  'vl/vl_csc.c',
  'vl/vl_csc.h',
  'vl/vl_decoder.c',
diff --git a/src/gallium/auxiliary/vl/vl_compositor_cs.c 
b/src/gallium/auxiliary/vl/vl_compositor_cs.c
new file mode 100644
index 000..3cd1a76
--- /dev/null
+++ b/src/gallium/auxiliary/vl/vl_compositor_cs.c
@@ -0,0 +1,414 @@
+/**
+ *
+ * Copyright 2019 Advanced Micro Devices, Inc.
+ * All Rights Reserved.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sub license, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the
+ * next paragraph) shall be included in all copies or substantial portions
+ * of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
+ * OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT.
+ * IN NO EVENT SHALL VMWARE AND/OR ITS SUPPLIERS BE LIABLE FOR
+ * ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ *
+ * Authors: James Zhu 
+ *
+ **/
+
+#include 
+
+#include "tgsi/tgsi_text.h"
+#include "vl_compositor_cs.h"
+
+struct cs_viewport {
+   float scale_x;
+   float scale_y;
+   int translate_x;
+   int translate_y;
+   struct u_rect area;
+};
+
+char *compute_shader_video_buffer =
+  "COMP\n"
+  "PROPERTY CS_FIXED_BLOCK_WIDTH 8\n"
+  "PROPERTY CS_FIXED_BLOCK_HEIGHT 8\n"
+  "PROPERTY CS_FIXED_BLOCK_DEPTH 1\n"
+
+  "DCL SV[0], THREAD_ID\n"
+  "DCL SV[1], BLOCK_ID\n"
+
+  "DCL CONST[0..5]\n"
+  "DCL SVIEW[0..2], RECT, FLOAT\n"
+  "DCL SAMP[0..2]\n"
+
+  "DCL IMAGE[0], 2D, WR\n"
+  "DCL TEMP[0..7]\n"
+
+  "IMM[0] UINT32 { 8, 8, 1, 0}\n"
+  "IMM[1] FLT32 { 1.0, 2.0, 0.0, 0.0}\n"
+
+  "UMAD TEMP[0], SV[1], IMM[0], SV[0]\n"
+
+  /* Drawn area check */
+  "USGE TEMP[1].xy, TEMP[0].xyxy, CONST[4].xyxy\n"
+  "USLT TEMP[1].zw, TEMP[0].xyxy, CONST[4].zwzw\n"
+  "AND TEMP[1].x, TEMP[1]., TEMP[1].\n"
+  "AND TEMP[1].x, TEMP[1]., TEMP[1].\n"
+  "AND TEMP[1].x, TEMP[1]., TEMP[1].\n"
+
+  "UIF TEMP[1]\n"
+ /* Translate */
+ "UADD TEMP[2].xy, TEMP[0], -CONST[5].xyxy\n"
+ "U2F TEMP[2], TEMP[2]\n"
+ "DIV TEMP[3], TEMP[2], IMM[1].\n"
+
+ /* Scale */
+ "DIV TEMP[2], TEMP[2], CONST[3].zwzw\n"
+ "DIV TEMP[3], TEMP[3], CONST[3].zwzw\n"
+
+ /* Fetch texels */
+ "TEX_LZ TEMP[4].x, TEMP[2], SAMP[0], RECT\n"
+ "TEX_LZ TEMP[4].y, TEMP[3], SAMP[1], RECT\n"
+ "TEX_LZ TEMP[4].z, TEMP[3], SAMP[2], RECT\n"
+
+ "MOV TEMP[4].w, IMM[1].\n"
+
+ /* Color Space Conversion */
+ "DP4 TEMP[7].x, CONST[0], TEMP[4]\n"
+ "DP4 TEMP[7].y, CONST[1], TEMP[4]\n"
+ "DP4 TEMP[7].z, CONST[2], TEMP[4]\n"
+
+ "MOV 

Re: [Mesa-dev] [PATCH 6/6] gallium\auxiliary\vl: Add video compute shader render

2019-02-01 Thread Christian König

Am 01.02.19 um 17:28 schrieb Zhu, James:

Add video compute shader render. export CS_COMPOSITOR_RENDER=true
to enable video compute shader render.


Ok that actually makes more sense, but I would either put everything 
into one file or cleanly separate between gfx and compute implementation.


Christian.



Signed-off-by: James Zhu 
---
  src/gallium/auxiliary/vl/vl_compositor.c | 19 +--
  1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/src/gallium/auxiliary/vl/vl_compositor.c 
b/src/gallium/auxiliary/vl/vl_compositor.c
index 7ee8402..66a8fc9 100644
--- a/src/gallium/auxiliary/vl/vl_compositor.c
+++ b/src/gallium/auxiliary/vl/vl_compositor.c
@@ -1376,8 +1376,8 @@ vl_compositor_convert_rgb_to_yuv(struct 
vl_compositor_state *s,
 s->pipe->flush(s->pipe, NULL, 0);
  }
  
-void

-vl_compositor_render(struct vl_compositor_state *s,
+static void
+vl_compositor_gfx_render(struct vl_compositor_state *s,
   struct vl_compositor   *c,
   struct pipe_surface*dst_surface,
   struct u_rect  *dirty_area,
@@ -1419,6 +1419,21 @@ vl_compositor_render(struct vl_compositor_state *s,
 draw_layers(c, s, dirty_area);
  }
  
+void

+vl_compositor_render(struct vl_compositor_state *s,
+ struct vl_compositor   *c,
+ struct pipe_surface*dst_surface,
+ struct u_rect  *dirty_area,
+ boolclear_dirty)
+{
+   assert(s);
+
+   if (cs_compositor_render_enable && s->layers->cs)
+  vl_compositor_cs_render(s, c, dst_surface, dirty_area, clear_dirty);
+   else
+  vl_compositor_gfx_render(s, c, dst_surface, dirty_area, clear_dirty);
+}
+
  bool
  vl_compositor_init(struct vl_compositor *c, struct pipe_context *pipe)
  {


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 3/6] gallium\auxiliary\vl: Add compute shader to support video compositor render

2019-02-01 Thread Christian König

Am 01.02.19 um 17:28 schrieb Zhu, James:

Add compute shader to support video compositor render.


I don't think that this is actually a good approach.

It adds a second implementation of the compositor instead of adapting 
the original one to use compute shaders when available.


Christian.



Signed-off-by: James Zhu 
---
  src/gallium/auxiliary/Makefile.sources  |   2 +
  src/gallium/auxiliary/meson.build   |   2 +
  src/gallium/auxiliary/vl/vl_compositor_cs.c | 414 
  src/gallium/auxiliary/vl/vl_compositor_cs.h |  56 
  4 files changed, 474 insertions(+)
  create mode 100644 src/gallium/auxiliary/vl/vl_compositor_cs.c
  create mode 100644 src/gallium/auxiliary/vl/vl_compositor_cs.h

diff --git a/src/gallium/auxiliary/Makefile.sources 
b/src/gallium/auxiliary/Makefile.sources
index 50e8808..df000f6 100644
--- a/src/gallium/auxiliary/Makefile.sources
+++ b/src/gallium/auxiliary/Makefile.sources
@@ -348,6 +348,8 @@ VL_SOURCES := \
vl/vl_bicubic_filter.h \
vl/vl_compositor.c \
vl/vl_compositor.h \
+   vl/vl_compositor_cs.c \
+   vl/vl_compositor_cs.h \
vl/vl_csc.c \
vl/vl_csc.h \
vl/vl_decoder.c \
diff --git a/src/gallium/auxiliary/meson.build 
b/src/gallium/auxiliary/meson.build
index 57f7e69..74e4b48 100644
--- a/src/gallium/auxiliary/meson.build
+++ b/src/gallium/auxiliary/meson.build
@@ -445,6 +445,8 @@ files_libgalliumvl = files(
'vl/vl_bicubic_filter.h',
'vl/vl_compositor.c',
'vl/vl_compositor.h',
+  'vl/vl_compositor_cs.c',
+  'vl/vl_compositor_cs.h',
'vl/vl_csc.c',
'vl/vl_csc.h',
'vl/vl_decoder.c',
diff --git a/src/gallium/auxiliary/vl/vl_compositor_cs.c 
b/src/gallium/auxiliary/vl/vl_compositor_cs.c
new file mode 100644
index 000..3cd1a76
--- /dev/null
+++ b/src/gallium/auxiliary/vl/vl_compositor_cs.c
@@ -0,0 +1,414 @@
+/**
+ *
+ * Copyright 2019 Advanced Micro Devices, Inc.
+ * All Rights Reserved.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sub license, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the
+ * next paragraph) shall be included in all copies or substantial portions
+ * of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
+ * OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT.
+ * IN NO EVENT SHALL VMWARE AND/OR ITS SUPPLIERS BE LIABLE FOR
+ * ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ *
+ * Authors: James Zhu 
+ *
+ **/
+
+#include 
+
+#include "tgsi/tgsi_text.h"
+#include "vl_compositor_cs.h"
+
+struct cs_viewport {
+   float scale_x;
+   float scale_y;
+   int translate_x;
+   int translate_y;
+   struct u_rect area;
+};
+
+char *compute_shader_video_buffer =
+  "COMP\n"
+  "PROPERTY CS_FIXED_BLOCK_WIDTH 8\n"
+  "PROPERTY CS_FIXED_BLOCK_HEIGHT 8\n"
+  "PROPERTY CS_FIXED_BLOCK_DEPTH 1\n"
+
+  "DCL SV[0], THREAD_ID\n"
+  "DCL SV[1], BLOCK_ID\n"
+
+  "DCL CONST[0..5]\n"
+  "DCL SVIEW[0..2], RECT, FLOAT\n"
+  "DCL SAMP[0..2]\n"
+
+  "DCL IMAGE[0], 2D, WR\n"
+  "DCL TEMP[0..7]\n"
+
+  "IMM[0] UINT32 { 8, 8, 1, 0}\n"
+  "IMM[1] FLT32 { 1.0, 2.0, 0.0, 0.0}\n"
+
+  "UMAD TEMP[0], SV[1], IMM[0], SV[0]\n"
+
+  /* Drawn area check */
+  "USGE TEMP[1].xy, TEMP[0].xyxy, CONST[4].xyxy\n"
+  "USLT TEMP[1].zw, TEMP[0].xyxy, CONST[4].zwzw\n"
+  "AND TEMP[1].x, TEMP[1]., TEMP[1].\n"
+  "AND TEMP[1].x, TEMP[1]., TEMP[1].\n"
+  "AND TEMP[1].x, TEMP[1]., TEMP[1].\n"
+
+  "UIF TEMP[1]\n"
+ /* Translate */
+ "UADD TEMP[2].xy, TEMP[0], -CONST[5].xyxy\n"
+ "U2F TEMP[2], TEMP[2]\n"
+ "DIV TEMP[3], TEMP[2], IMM[1].\n"
+
+ /* Scale */
+ "DIV TEMP[2], TEMP[2], CONST[3].zwzw\n"
+ "DIV TEMP[3], TEMP[3], CONST[3].zwzw\n"
+
+ /* Fetch texels */
+ "TEX_LZ TEMP[4].x, TEMP[2], SAMP[0], RECT\n"
+ "TEX_LZ TEMP[4].y, TEMP[3], SAMP[1], RECT\n"
+ "TEX_LZ TEMP[4].z, TEMP[3], SAMP[2], RECT\n"
+
+ "MOV TEMP[4].w, IMM[1].\n"
+
+ /* Color Space Conversion */
+ "DP4 TEMP[7].x, CONST[0], TEMP[4]\n"
+ "DP4 TEMP[7].y, 

Re: [Mesa-dev] [PATCH 1/6] gallium\auxiliary\vl: Move dirty define to header file

2019-02-01 Thread Christian König

Am 01.02.19 um 17:28 schrieb Zhu, James:

Move dirty define to header file to share with compute shader.

Signed-off-by: James Zhu 
---
  src/gallium/auxiliary/vl/vl_compositor.c | 3 ---
  src/gallium/auxiliary/vl/vl_compositor.h | 2 ++
  2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/src/gallium/auxiliary/vl/vl_compositor.c 
b/src/gallium/auxiliary/vl/vl_compositor.c
index 159a295..2c6d585 100644
--- a/src/gallium/auxiliary/vl/vl_compositor.c
+++ b/src/gallium/auxiliary/vl/vl_compositor.c
@@ -42,9 +42,6 @@
  #include "vl_types.h"
  #include "vl_compositor.h"
  
-#define MIN_DIRTY (0)

-#define MAX_DIRTY (1 << 15)
-
  enum VS_OUTPUT
  {
 VS_O_VPOS = 0,
diff --git a/src/gallium/auxiliary/vl/vl_compositor.h 
b/src/gallium/auxiliary/vl/vl_compositor.h
index 8819176..d51b5f5 100644
--- a/src/gallium/auxiliary/vl/vl_compositor.h
+++ b/src/gallium/auxiliary/vl/vl_compositor.h
@@ -44,6 +44,8 @@ struct pipe_context;
   */
  
  #define VL_COMPOSITOR_MAX_LAYERS 16

+#define MIN_DIRTY (0)
+#define MAX_DIRTY (1 << 15)


That needs a proper prefix.

E.g. put VL_COMPOSITOR_ in front of the name and rename all usages.

Christian.

  
  /* deinterlace allgorithem */

  enum vl_compositor_deinterlace


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 1/3] radeonsi: allow si_cp_dma_clear_buffer to clear GDS from any IB

2018-11-28 Thread Christian König

Are those committed yet? They don't seem to apply cleanly on master.

Christian.

Am 27.11.18 um 02:56 schrieb Marek Olšák:

From: Marek Olšák 

---
  .../drivers/radeonsi/si_compute_blit.c|  4 +-
  src/gallium/drivers/radeonsi/si_cp_dma.c  | 49 ++-
  src/gallium/drivers/radeonsi/si_pipe.h|  8 +--
  .../drivers/radeonsi/si_test_dma_perf.c   |  3 +-
  4 files changed, 33 insertions(+), 31 deletions(-)

diff --git a/src/gallium/drivers/radeonsi/si_compute_blit.c 
b/src/gallium/drivers/radeonsi/si_compute_blit.c
index 20e4f591fbb..086793637f0 100644
--- a/src/gallium/drivers/radeonsi/si_compute_blit.c
+++ b/src/gallium/drivers/radeonsi/si_compute_blit.c
@@ -212,22 +212,22 @@ void si_clear_buffer(struct si_context *sctx, struct 
pipe_resource *dst,
 */
if (clear_value_size > 4 ||
(clear_value_size == 4 &&
 offset % 4 == 0 &&
 (size > 32*1024 || sctx->chip_class <= VI))) {
si_compute_do_clear_or_copy(sctx, dst, offset, NULL, 0,
aligned_size, clear_value,
clear_value_size, coher);
} else {
assert(clear_value_size == 4);
-   si_cp_dma_clear_buffer(sctx, dst, offset,
-  aligned_size, *clear_value, 
coher,
+   si_cp_dma_clear_buffer(sctx, sctx->gfx_cs, dst, offset,
+  aligned_size, *clear_value, 0, 
coher,
   get_cache_policy(sctx, coher, 
size));
}
  
  		offset += aligned_size;

size -= aligned_size;
}
  
  	/* Handle non-dword alignment. */

if (size) {
assert(dst);
diff --git a/src/gallium/drivers/radeonsi/si_cp_dma.c 
b/src/gallium/drivers/radeonsi/si_cp_dma.c
index 839b31b7fdf..33220d9f0fa 100644
--- a/src/gallium/drivers/radeonsi/si_cp_dma.c
+++ b/src/gallium/drivers/radeonsi/si_cp_dma.c
@@ -47,25 +47,24 @@ static inline unsigned cp_dma_max_byte_count(struct 
si_context *sctx)
  
  	/* make it aligned for optimal performance */

return max & ~(SI_CPDMA_ALIGNMENT - 1);
  }
  
  
  /* Emit a CP DMA packet to do a copy from one buffer to another, or to clear

   * a buffer. The size must fit in bits [20:0]. If CP_DMA_CLEAR is set, src_va 
is a 32-bit
   * clear value.
   */
-static void si_emit_cp_dma(struct si_context *sctx, uint64_t dst_va,
-  uint64_t src_va, unsigned size, unsigned flags,
-  enum si_cache_policy cache_policy)
+static void si_emit_cp_dma(struct si_context *sctx, struct radeon_cmdbuf *cs,
+  uint64_t dst_va, uint64_t src_va, unsigned size,
+  unsigned flags, enum si_cache_policy cache_policy)
  {
-   struct radeon_cmdbuf *cs = sctx->gfx_cs;
uint32_t header = 0, command = 0;
  
  	assert(size <= cp_dma_max_byte_count(sctx));

assert(sctx->chip_class != SI || cache_policy == L2_BYPASS);
  
  	if (sctx->chip_class >= GFX9)

command |= S_414_BYTE_COUNT_GFX9(size);
else
command |= S_414_BYTE_COUNT_GFX6(size);
  
@@ -139,21 +138,21 @@ static void si_emit_cp_dma(struct si_context *sctx, uint64_t dst_va,

  }
  
  void si_cp_dma_wait_for_idle(struct si_context *sctx)

  {
/* Issue a dummy DMA that copies zero bytes.
 *
 * The DMA engine will see that there's no work to do and skip this
 * DMA request, however, the CP will see the sync flag and still wait
 * for all DMAs to complete.
 */
-   si_emit_cp_dma(sctx, 0, 0, 0, CP_DMA_SYNC, L2_BYPASS);
+   si_emit_cp_dma(sctx, sctx->gfx_cs, 0, 0, 0, CP_DMA_SYNC, L2_BYPASS);
  }
  
  static void si_cp_dma_prepare(struct si_context *sctx, struct pipe_resource *dst,

  struct pipe_resource *src, unsigned byte_count,
  uint64_t remaining_size, unsigned user_flags,
  enum si_coherency coher, bool *is_first,
  unsigned *packet_flags)
  {
/* Fast exit for a CPDMA prefetch. */
if ((user_flags & SI_CPDMA_SKIP_ALL) == SI_CPDMA_SKIP_ALL) {
@@ -200,51 +199,53 @@ static void si_cp_dma_prepare(struct si_context *sctx, 
struct pipe_resource *dst
 */
if (!(user_flags & SI_CPDMA_SKIP_SYNC_AFTER) &&
byte_count == remaining_size) {
*packet_flags |= CP_DMA_SYNC;
  
  		if (coher == SI_COHERENCY_SHADER)

*packet_flags |= CP_DMA_PFP_SYNC_ME;
}
  }
  
-void si_cp_dma_clear_buffer(struct si_context *sctx, struct pipe_resource *dst,

-   uint64_t offset, uint64_t size, unsigned value,
-   

Re: [Mesa-dev] [PATCH 3/3] winsys/amdgpu: use optimal VM alignment for CPU allocations

2018-11-27 Thread Christian König

Am 27.11.18 um 00:02 schrieb Marek Olšák:

From: Marek Olšák 

---
  src/gallium/winsys/amdgpu/drm/amdgpu_bo.c | 6 --
  1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/src/gallium/winsys/amdgpu/drm/amdgpu_bo.c 
b/src/gallium/winsys/amdgpu/drm/amdgpu_bo.c
index 7b239695872..a9170a2bc69 100644
--- a/src/gallium/winsys/amdgpu/drm/amdgpu_bo.c
+++ b/src/gallium/winsys/amdgpu/drm/amdgpu_bo.c
@@ -1541,22 +1541,24 @@ static struct pb_buffer *amdgpu_bo_from_ptr(struct 
radeon_winsys *rws,
  
  bo = CALLOC_STRUCT(amdgpu_winsys_bo);

  if (!bo)
  return NULL;
  
  if (amdgpu_create_bo_from_user_mem(ws->dev, pointer,

 aligned_size, _handle))
  goto error;
  
  if (amdgpu_va_range_alloc(ws->dev, amdgpu_gpu_va_range_general,

-  aligned_size, 1 << 12, 0, , _handle,
- AMDGPU_VA_RANGE_HIGH))
+  aligned_size,
+  amdgpu_get_optimal_vm_alignment(ws, aligned_size,
+  
ws->info.gart_page_size),
+  0, , _handle, AMDGPU_VA_RANGE_HIGH))


For userptrs the VA alignment is most likely irrelevant because they are 
composed out of 4k pages anyway. On the other hand it shouldn't hurt to 
handle them the same way.


Feel free to add an Acked-by: Christian König  
to the series.


Christian.


  goto error_va_alloc;
  
  if (amdgpu_bo_va_op(buf_handle, 0, aligned_size, va, 0, AMDGPU_VA_OP_MAP))

  goto error_va_map;
  
  /* Initialize it. */

  pipe_reference_init(>base.reference, 1);
  bo->bo = buf_handle;
  bo->base.alignment = 0;
  bo->base.size = size;


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 0/7] winsys/amdgpu: slab allocators and tweaks for address translation

2018-11-24 Thread Christian König

Patch #5 and #6 are Reviewed-by: Christian König 

For patch #7 I think we really need some testing if that gives us an 
improvement. As you noted as well that we have buffer which are slightly 
smaller than a power of two is rather unlikely.


Christian.

Am 24.11.18 um 00:40 schrieb Marek Olšák:

Hi,

This series changes the slab allocation to 3 slab allocators layered
on top of each other, and increases the max slab entry size to 256 KB
and the max slab size to 2 MB.

There are also tweaks for faster address translation, though we don't
know whether it helps anything.

Please review.

Thanks,
Marek

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 1/3] vl: get h264 profile idc

2018-10-24 Thread Christian König

Am 23.10.18 um 17:43 schrieb boyuan.zh...@amd.com:

From: Boyuan Zhang 

Adding a function for converting h264 pipe video profile to profile idc

Signed-off-by: Boyuan Zhang 


Series is Acked-by: Christian König 


---
  src/gallium/auxiliary/util/u_video.h | 24 
  1 file changed, 24 insertions(+)

diff --git a/src/gallium/auxiliary/util/u_video.h 
b/src/gallium/auxiliary/util/u_video.h
index 967ebc57489..f6e93dd0387 100644
--- a/src/gallium/auxiliary/util/u_video.h
+++ b/src/gallium/auxiliary/util/u_video.h
@@ -239,6 +239,30 @@ u_get_h264_level(uint32_t width, uint32_t height, uint32_t 
*max_reference)
return 52;
  }
  
+static inline uint32_t

+u_get_h264_profile_idc(enum pipe_video_profile profile)
+{
+   switch (profile) {
+  case PIPE_VIDEO_PROFILE_MPEG4_AVC_CONSTRAINED_BASELINE:
+  case PIPE_VIDEO_PROFILE_MPEG4_AVC_BASELINE:
+ return 66;
+  case PIPE_VIDEO_PROFILE_MPEG4_AVC_MAIN:
+ return 77;
+  case PIPE_VIDEO_PROFILE_MPEG4_AVC_EXTENDED:
+ return 88;
+  case PIPE_VIDEO_PROFILE_MPEG4_AVC_HIGH:
+ return 100;
+  case PIPE_VIDEO_PROFILE_MPEG4_AVC_HIGH10:
+ return 110;
+  case PIPE_VIDEO_PROFILE_MPEG4_AVC_HIGH422:
+ return 122;
+  case PIPE_VIDEO_PROFILE_MPEG4_AVC_HIGH444:
+ return 244;
+  default:
+ return 66; //use baseline profile instead
+   }
+}
+
  #ifdef __cplusplus
  }
  #endif


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] st/va: use provided sizes and coords for vlVaGetImage

2018-10-10 Thread Christian König

Acked-by: Christian König 

Am 10.10.2018 um 21:18 schrieb Ilia Mirkin:

Reviewed-by: Ilia Mirkin 
On Wed, Oct 10, 2018 at 3:12 PM  wrote:

From: Boyuan Zhang 

vlVaGetImage should respect the width, height, and coordinates x and y that
passed in. Therefore, pipe_box should be created with the passed in values
instead of surface width/height.

v2: add input size check, return error when size out of bounds
v3: fix the size check for vaimage
v4: add size adjustment for x and y coordinates

Signed-off-by: Boyuan Zhang 
Cc: "18.2" 
Reviewed-by: Leo Liu 
---
  src/gallium/state_trackers/va/image.c | 31 ---
  1 file changed, 28 insertions(+), 3 deletions(-)

diff --git a/src/gallium/state_trackers/va/image.c 
b/src/gallium/state_trackers/va/image.c
index 3f892c9..807fc83 100644
--- a/src/gallium/state_trackers/va/image.c
+++ b/src/gallium/state_trackers/va/image.c
@@ -353,6 +353,23 @@ vlVaGetImage(VADriverContextP ctx, VASurfaceID surface, 
int x, int y,
return VA_STATUS_ERROR_INVALID_IMAGE;
 }

+   if (x < 0 || y < 0) {
+  mtx_unlock(>mutex);
+  return VA_STATUS_ERROR_INVALID_PARAMETER;
+   }
+
+   if (x + width > surf->templat.width ||
+   y + height > surf->templat.height) {
+  mtx_unlock(>mutex);
+  return VA_STATUS_ERROR_INVALID_PARAMETER;
+   }
+
+   if (width > vaimage->width ||
+   height > vaimage->height) {
+  mtx_unlock(>mutex);
+  return VA_STATUS_ERROR_INVALID_PARAMETER;
+   }
+
 img_buf = handle_table_get(drv->htab, vaimage->buf);
 if (!img_buf) {
mtx_unlock(>mutex);
@@ -400,11 +417,19 @@ vlVaGetImage(VADriverContextP ctx, VASurfaceID surface, 
int x, int y,
 }

 for (i = 0; i < vaimage->num_planes; i++) {
-  unsigned width, height;
+  unsigned box_w = align(width, 2);
+  unsigned box_h = align(height, 2);
+  unsigned box_x = x & ~1;
+  unsigned box_y = y & ~1;
if (!views[i]) continue;
-  vlVaVideoSurfaceSize(surf, i, , );
+  vl_video_buffer_adjust_size(_w, _h, i,
+  surf->templat.chroma_format,
+  surf->templat.interlaced);
+  vl_video_buffer_adjust_size(_x, _y, i,
+  surf->templat.chroma_format,
+  surf->templat.interlaced);
for (j = 0; j < views[i]->texture->array_size; ++j) {
- struct pipe_box box = {0, 0, j, width, height, 1};
+ struct pipe_box box = {box_x, box_y, j, box_w, box_h, 1};
   struct pipe_transfer *transfer;
   uint8_t *map;
   map = drv->pipe->transfer_map(drv->pipe, views[i]->texture, 0,
--
2.7.4



___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] st/va: use provided sizes and coords for getimage

2018-10-05 Thread Christian König
If that also fixes the problem, then yeah that makes perfect sense to me 
as well.


Christian.

Am 05.10.2018 um 18:11 schrieb Ilia Mirkin:

This is an improvement, but I think you need to clip the box to

1. Size of the surface
2. Size of the image

I think that there are clipping helpers available to do that (maybe
pipe_box_clip or so? I forget, check the auxiliary dir). Christian -
does that make sense to you?

Cheers,

   -ilia
On Fri, Oct 5, 2018 at 12:01 PM  wrote:

From: Boyuan Zhang 

vlVaGetImage should respect the width, height, and coordinates x and y that
passed in. Therefore, pipe_box should be created with the passed in values
instead of surface width/height.

Signed-off-by: Boyuan Zhang 
---
  src/gallium/state_trackers/va/image.c | 9 ++---
  1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/src/gallium/state_trackers/va/image.c 
b/src/gallium/state_trackers/va/image.c
index 3f892c9..c9f6f18 100644
--- a/src/gallium/state_trackers/va/image.c
+++ b/src/gallium/state_trackers/va/image.c
@@ -400,11 +400,14 @@ vlVaGetImage(VADriverContextP ctx, VASurfaceID surface, 
int x, int y,
 }

 for (i = 0; i < vaimage->num_planes; i++) {
-  unsigned width, height;
+  unsigned w = align(width, 2);
+  unsigned h = align(height, 2);
if (!views[i]) continue;
-  vlVaVideoSurfaceSize(surf, i, , );
+  vl_video_buffer_adjust_size(, , i,
+  surf->templat.chroma_format,
+  surf->templat.interlaced);
for (j = 0; j < views[i]->texture->array_size; ++j) {
- struct pipe_box box = {0, 0, j, width, height, 1};
+ struct pipe_box box = {x, y, j, w, h, 1};
   struct pipe_transfer *transfer;
   uint8_t *map;
   map = drv->pipe->transfer_map(drv->pipe, views[i]->texture, 0,
--
2.7.4



___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] st/va:Aligned image width and height to 16.

2018-10-03 Thread Christian König
What the heck are you talking about? As far as I can see this patch is 
about adding hw specific alignment to vlVaCreateImage which is a state 
tracker function.


Completely agree that vlVaGetImage should respect the parameters given 
instead of using the one from the surface, but that sounds like a 
different problem.


Maybe mixing mail threads?

Regards,
Christian.

Am 02.10.2018 um 21:34 schrieb Ilia Mirkin:

vlVaGetImage should respect the x/y/width/height. The surface size
need not have any correlation to the image size. Someone should
double-check the docs for how that function should work, but the
current logic seems completely bogus.
On Tue, Oct 2, 2018 at 3:09 PM Koenig, Christian
 wrote:

Well that's the complete wrong place for that.

The stride of the surface is determined by addrlib. That one should handle 
aligning the parameters.

Christian.

Am 02.10.2018 20:38 schrieb "Sharma, Deepak" :
Christian, the issue which trying to address here is vlvaGetImage doesn’t use 
width/height
passed to function. box.width is calculated from surface and that will end up 
in wrong stride for dst buffer
for said resolution. So was thinking to use aligned width/height for 
vaCreateImage as well as surface.
But as you said that depends on codec , So I think either we can use 
width/height aligned based on codec
or use passed width/height in vlvaGetImage to fix this issue.

Thanks,
Deepak

-Original Message-----
From: Christian König 
Sent: Tuesday, October 2, 2018 3:42 AM
To: Sharma, Deepak ; mesa-dev@lists.freedesktop.org
Cc: Guttula, Suresh 
Subject: Re: [Mesa-dev] [PATCH] st/va:Aligned image width and height to 16.

Am 02.10.2018 um 03:47 schrieb Sharma, Deepak:

From: suresh guttula 

In case of decoding of resolution like 40x24, while allocating surface
video buffer is always aligned with macroblock width/height which is 16.
But when application tries to get data after decoding through
vaCreateImage /vaGetImage, image width/height aligned with 2 and
result a smaller image buffer which causes the memory stomping issue.

Well NAK. It depends on the codec if the picture needs to be aligned to
16 or not.

For example VC-1 would created decoding errors with that.

Regards,
Christian.


Signed-off-by: suresh guttula 
---
   src/gallium/state_trackers/va/image.c | 4 ++--
   1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/gallium/state_trackers/va/image.c
b/src/gallium/state_trackers/va/image.c
index 3f892c9..2fc47b7 100644
--- a/src/gallium/state_trackers/va/image.c
+++ b/src/gallium/state_trackers/va/image.c
@@ -123,8 +123,8 @@ vlVaCreateImage(VADriverContextP ctx, VAImageFormat 
*format, int width, int heig
  img->format = *format;
  img->width = width;
  img->height = height;
-   w = align(width, 2);
-   h = align(height, 2);
+   w = align(width, 16);
+   h = align(height, 16);

  switch (format->fourcc) {
  case VA_FOURCC('N','V','1','2'):

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] st/va:Aligned image width and height to 16.

2018-10-02 Thread Christian König

Am 02.10.2018 um 03:47 schrieb Sharma, Deepak:

From: suresh guttula 

In case of decoding of resolution like 40x24, while allocating surface
video buffer is always aligned with macroblock width/height which is 16.
But when application tries to get data after decoding through vaCreateImage
/vaGetImage, image width/height aligned with 2 and result a smaller image
buffer which causes the memory stomping issue.


Well NAK. It depends on the codec if the picture needs to be aligned to 
16 or not.


For example VC-1 would created decoding errors with that.

Regards,
Christian.



Signed-off-by: suresh guttula 
---
  src/gallium/state_trackers/va/image.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/gallium/state_trackers/va/image.c 
b/src/gallium/state_trackers/va/image.c
index 3f892c9..2fc47b7 100644
--- a/src/gallium/state_trackers/va/image.c
+++ b/src/gallium/state_trackers/va/image.c
@@ -123,8 +123,8 @@ vlVaCreateImage(VADriverContextP ctx, VAImageFormat 
*format, int width, int heig
 img->format = *format;
 img->width = width;
 img->height = height;
-   w = align(width, 2);
-   h = align(height, 2);
+   w = align(width, 16);
+   h = align(height, 16);
  
 switch (format->fourcc) {

 case VA_FOURCC('N','V','1','2'):


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] vl: reorder H264 profiles

2018-09-26 Thread Christian König

Am 26.09.2018 um 00:11 schrieb boyuan.zh...@amd.com:

From: Boyuan Zhang 

Fix the wrong h264 profiles order. Previously, the constrained baseline was
added in between baseline and main profiles, which breaked the logic in
radeon/vce when converting from pipe_video_profile to profile_idc


I think it would be better to use a switch/case in radeon/vce or even 
better make a helper function which converts between 
PIPE_VIDEO_PROFILE_MPEG4_AVC_* and the profile_idc from the specification.


Christian.



Signed-off-by: Boyuan Zhang 
---
  src/gallium/include/pipe/p_video_enums.h | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/gallium/include/pipe/p_video_enums.h 
b/src/gallium/include/pipe/p_video_enums.h
index b5b8b06228..260f47ea8a 100644
--- a/src/gallium/include/pipe/p_video_enums.h
+++ b/src/gallium/include/pipe/p_video_enums.h
@@ -55,8 +55,8 @@ enum pipe_video_profile
 PIPE_VIDEO_PROFILE_VC1_SIMPLE,
 PIPE_VIDEO_PROFILE_VC1_MAIN,
 PIPE_VIDEO_PROFILE_VC1_ADVANCED,
-   PIPE_VIDEO_PROFILE_MPEG4_AVC_BASELINE,
 PIPE_VIDEO_PROFILE_MPEG4_AVC_CONSTRAINED_BASELINE,
+   PIPE_VIDEO_PROFILE_MPEG4_AVC_BASELINE,
 PIPE_VIDEO_PROFILE_MPEG4_AVC_MAIN,
 PIPE_VIDEO_PROFILE_MPEG4_AVC_EXTENDED,
 PIPE_VIDEO_PROFILE_MPEG4_AVC_HIGH,


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Request for Mentor - XorgEvoc - Piglit for VA-API

2018-09-09 Thread Christian König

Am 09.09.2018 um 12:49 schrieb Naveen Naidu:
On Sun 9 Sep, 2018, 4:06 PM Christian König, 
<mailto:ckoenig.leichtzumer...@gmail.com>> wrote:


Am 09.09.2018 um 09:55 schrieb Naveen Naidu:


Thank you for the information. Yes I have a AMD GPU, It is a
Radeon r5 m430. But unfortunately it does not have a video
decoder. But I also have an integrated Intel HD 620 GPU, it's a
kabylake and It had a video encoder and video decoder. I have
already discussed about this with Alex Deuscher and he said that
I will be fine with Intel HD 620 and I can carry on with the project.

If you don't mind can you please let me know if i am on the right
track?


Yeah, that should indeed work from the hardware side.

But it could be a problem for mentoring you since we are obviously
interested in improving the VA-API state tracker for AMD hardware.


Does this mean I cannot work on this project?


Well it depends. It means that you won't test with AMD hardware. So if 
you mentor works for AMD and doesn't has the Intel stack available it 
could get really tricky to reproduce and test things.


On the other hand you could ask if one of the Intel guys wants to mentor 
you.


Will it be okay, If I write a proposal using the current hardware and 
once the proposal get selected, I could buy the required hardware.


That could work, but I'm not sure if that fits the requirements for a 
proposal from Google.



Is it possible to write a proposal without the hardware?

It would be really kind of you, if you could let me know any other 
project that is compatible with my hardware.


 I was also looking into the project "Unit Performance tests for VA-API".

Will this be a suitable fit with my hardware.


Same problem with that one.

Regards,
Christian.






Thank you very much for your time

P.S:- Sorry for the starting a new thread with the last mail. I
am new to mailing list. I will see to that I do not repeat the
same mistake.


Well starting a new thread is not much of an issue, but you could
at least remove the digest mail body :)

Regards,
Christian.



P . Naveen Naidu





___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Request for Mentor - XorgEvoc - Piglit for VA-API

2018-09-09 Thread Christian König

Am 09.09.2018 um 09:55 schrieb Naveen Naidu:


Thank you for the information. Yes I have a AMD GPU, It is a Radeon r5 
m430. But unfortunately it does not have a video decoder. But I also 
have an integrated Intel HD 620 GPU, it's a kabylake and It had a 
video encoder and video decoder. I have already discussed about this 
with Alex Deuscher and he said that I will be fine with Intel HD 620 
and I can carry on with the project.


If you don't mind can you please let me know if i am on the right track?


Yeah, that should indeed work from the hardware side.

But it could be a problem for mentoring you since we are obviously 
interested in improving the VA-API state tracker for AMD hardware.




Thank you very much for your time

P.S:- Sorry for the starting a new thread with the last mail. I am new 
to mailing list. I will see to that I do not repeat the same mistake.


Well starting a new thread is not much of an issue, but you could at 
least remove the digest mail body :)


Regards,
Christian.



P . Naveen Naidu



___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] ac/radeonsi: fix CIK copy max size

2018-08-29 Thread Christian König

Am 29.08.2018 um 05:53 schrieb Dave Airlie:

From: Dave Airlie 

While adding transfer queues to radv, I started writing some tests,
the first test I wrote fell over copying a buffer larger than this
limit.

Checked AMDVLK and found the correct limit.

Cc: 
---
  src/amd/common/sid.h | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/src/amd/common/sid.h b/src/amd/common/sid.h
index 0671f7d3998..edb7d06afa6 100644
--- a/src/amd/common/sid.h
+++ b/src/amd/common/sid.h
@@ -9139,7 +9139,9 @@
  #defineCIK_SDMA_PACKET_SEMAPHORE   0x7
  #defineCIK_SDMA_PACKET_CONSTANT_FILL   0xb
  #defineCIK_SDMA_PACKET_SRBM_WRITE  0xe
-#defineCIK_SDMA_COPY_MAX_SIZE  0x3fffe0
+/* There is apparently an undocumented HW "feature" that
+   prevents the HW from copying past 256 bytes of (1 << 22) */
+#defineCIK_SDMA_COPY_MAX_SIZE  0x3fff00


Well that is interesting. IIRC, the hardware documentation explicitly 
states that 0x3fffe0 is the maximum size.


That would also explain a couple of problems we see with on CIK because 
the kernel doesn't get that right either.


Christian.

  
  enum amd_cmp_class_flags {

S_NAN = 1 << 0,// Signaling NaN


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] libdrm: Fix amdgpu build failure

2018-08-07 Thread Christian König

Hi Mike,

it is not the right mailing list, but thanks for the info.

I'm going to push that patch ASAP.

Christian.

Am 07.08.2018 um 14:38 schrieb Mike Lothian:

Hi

I'm not sure if this is the right mailing list or not but the
following patch gets things building with meson again

Signed-of-by: Mike Lothian 

diff --git a/amdgpu/meson.build b/amdgpu/meson.build
index f39d7bf6..d9d7de2d 100644
--- a/amdgpu/meson.build
+++ b/amdgpu/meson.build
@@ -26,8 +26,7 @@ libdrm_amdgpu = shared_library(
[
  files(
'amdgpu_asic_id.c', 'amdgpu_bo.c', 'amdgpu_cs.c', 'amdgpu_device.c',
-  'amdgpu_gpu_info.c', 'amdgpu_vamgr.c', 'amdgpu_vm.c', 'util_hash.c',
-  'util_hash_table.c',
+  'amdgpu_gpu_info.c', 'amdgpu_vamgr.c', 'amdgpu_vm.c', 'handle_table.c',
  ),
  config_file,
],
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] st/vdpau: allow progressive video surface with interop

2018-05-07 Thread Christian König

Am 07.05.2018 um 15:03 schrieb Leo Liu:



On 05/07/2018 05:10 AM, Christian König wrote:

Am 02.05.2018 um 16:51 schrieb Leo Liu:
mpv now interop with video surface instead of output surface 
previously,
so it fails with "vlVdpVideoSurfaceDMABuf", this's fine for Mesa GL, 
since

the code path will fall back to "vlVdpVideoSurfaceGallium", but this's
not the case for others

Signed-off-by: Leo Liu <leo@amd.com>
Cc: Christian König <christian.koe...@amd.com>
Cc: "18.1 18.0" <mesa-sta...@lists.freedesktop.org>


That won't work correctly.

The NV_VDPAU_interop extension we implement with that needs 
interlaced layout or otherwise can't correctly work with the surfaces.
I thought the same way in the beginning, then later I asked myself why 
it's working with Mesa interop, then I found it resolved with:


static struct pipe_resource *st_vdpau_video_surface_gallium(struct 
gl_context *ctx, const void *vdpSurface,

...

   samplers = buffer->get_sampler_view_planes(buffer);
   if (!samplers)
  return NULL;

   sv = samplers[index >> 1];
   if (!sv)
  return NULL;

   pipe_resource_reference(, sv->texture);
}

The above code gave me the hint for this patch, and the patch is 
tested okay with Mesa GL and other GL.


That is just a hack I introduced for testing which also doesn't work 
correctly. E.g. it won't sample correctly from the odd/even fields and 
instead interpolate from the whole frame. In other words you get an more 
or less correct output with that, it's just not binary identical.


But for the DMA-buf interop case that won't work because the importer 
expects for surfaces with X height and not two with 2*X height. That you 
don't get a corrupted picture in your test properly pure coincident 
because of how we share the DMA-buf tilling data in the background.


When you want to work with DMA-buf sharing the exported surface *MUST* 
be in interlaced form.


That either leaves us with copying from the progressive to the 
interlaced form during export, or change the firmware to directly decode 
everything into the interlaced format.


Regards,
Christian.



Regards,
Leo




It's probably pure coincident that you don't get a messed up picture 
with that.


Christian.


---
  src/gallium/state_trackers/vdpau/surface.c | 5 -
  1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/src/gallium/state_trackers/vdpau/surface.c 
b/src/gallium/state_trackers/vdpau/surface.c

index 012d303641..d63e761350 100644
--- a/src/gallium/state_trackers/vdpau/surface.c
+++ b/src/gallium/state_trackers/vdpau/surface.c
@@ -513,12 +513,15 @@ VdpStatus 
vlVdpVideoSurfaceDMABuf(VdpVideoSurface surface,

 }
   /* Check if surface match interop requirements */
-   if (p_surf->video_buffer == NULL || 
!p_surf->video_buffer->interlaced ||

+   if (p_surf->video_buffer == NULL ||
 p_surf->video_buffer->buffer_format != PIPE_FORMAT_NV12) {
    mtx_unlock(_surf->device->mutex);
    return VDP_STATUS_NO_IMPLEMENTATION;
 }
  +   if (!p_surf->video_buffer->interlaced)
+  plane >>= 1;
+
 surf = 
p_surf->video_buffer->get_surfaces(p_surf->video_buffer)[plane];

 if (!surf) {
    mtx_unlock(_surf->device->mutex);




___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] st/vdpau: allow progressive video surface with interop

2018-05-07 Thread Christian König

Am 02.05.2018 um 16:51 schrieb Leo Liu:

mpv now interop with video surface instead of output surface previously,
so it fails with "vlVdpVideoSurfaceDMABuf", this's fine for Mesa GL, since
the code path will fall back to "vlVdpVideoSurfaceGallium", but this's
not the case for others

Signed-off-by: Leo Liu <leo@amd.com>
Cc: Christian König <christian.koe...@amd.com>
Cc: "18.1 18.0" <mesa-sta...@lists.freedesktop.org>


That won't work correctly.

The NV_VDPAU_interop extension we implement with that needs interlaced 
layout or otherwise can't correctly work with the surfaces.


It's probably pure coincident that you don't get a messed up picture 
with that.


Christian.


---
  src/gallium/state_trackers/vdpau/surface.c | 5 -
  1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/src/gallium/state_trackers/vdpau/surface.c 
b/src/gallium/state_trackers/vdpau/surface.c
index 012d303641..d63e761350 100644
--- a/src/gallium/state_trackers/vdpau/surface.c
+++ b/src/gallium/state_trackers/vdpau/surface.c
@@ -513,12 +513,15 @@ VdpStatus vlVdpVideoSurfaceDMABuf(VdpVideoSurface surface,
 }
  
 /* Check if surface match interop requirements */

-   if (p_surf->video_buffer == NULL || !p_surf->video_buffer->interlaced ||
+   if (p_surf->video_buffer == NULL ||
 p_surf->video_buffer->buffer_format != PIPE_FORMAT_NV12) {
mtx_unlock(_surf->device->mutex);
return VDP_STATUS_NO_IMPLEMENTATION;
 }
  
+   if (!p_surf->video_buffer->interlaced)

+  plane >>= 1;
+
 surf = p_surf->video_buffer->get_surfaces(p_surf->video_buffer)[plane];
 if (!surf) {
mtx_unlock(_surf->device->mutex);


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] radeon/vcn: fix mpeg4 msg buffer settings

2018-05-04 Thread Christian König

Am 03.05.2018 um 23:29 schrieb boyuan.zh...@amd.com:

From: Boyuan Zhang <boyuan.zh...@amd.com>

Previous bit-fields assignments are incorrect and will result certain mpeg4
decode failed due to wrong flag values. This patch fixes these assignments.

Cc: 18.0 18.1 <mesa-sta...@lists.freedesktop.org>

Signed-off-by: Boyuan Zhang <boyuan.zh...@amd.com>
Reviewed-by: Leo Liu <leo@amd.com>


Reviewed-by: Christian König <christian.koe...@amd.com> as well.


---
  src/gallium/drivers/radeon/radeon_vcn_dec.c | 18 +-
  1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/src/gallium/drivers/radeon/radeon_vcn_dec.c 
b/src/gallium/drivers/radeon/radeon_vcn_dec.c
index f83e9e5..4bc922d 100644
--- a/src/gallium/drivers/radeon/radeon_vcn_dec.c
+++ b/src/gallium/drivers/radeon/radeon_vcn_dec.c
@@ -554,15 +554,15 @@ static rvcn_dec_message_mpeg4_asp_vld_t 
get_mpeg4_msg(struct radeon_decoder *dec
  
  	result.vop_time_increment_resolution = pic->vop_time_increment_resolution;
  
-	result.short_video_header |= pic->short_video_header << 0;

-   result.interlaced |= pic->interlaced << 2;
-result.load_intra_quant_mat |= 1 << 3;
-   result.load_nonintra_quant_mat |= 1 << 4;
-   result.quarter_sample |= pic->quarter_sample << 5;
-   result.complexity_estimation_disable |= 1 << 6;
-   result.resync_marker_disable |= pic->resync_marker_disable << 7;
-   result.newpred_enable |= 0 << 10; //
-   result.reduced_resolution_vop_enable |= 0 << 11;
+   result.short_video_header = pic->short_video_header;
+   result.interlaced = pic->interlaced;
+   result.load_intra_quant_mat = 1;
+   result.load_nonintra_quant_mat = 1;
+   result.quarter_sample = pic->quarter_sample;
+   result.complexity_estimation_disable = 1;
+   result.resync_marker_disable = pic->resync_marker_disable;
+   result.newpred_enable = 0;
+   result.reduced_resolution_vop_enable = 0;
  
  	result.quant_type = pic->quant_type;
  


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 1/3] radeonsi: implement mechanism for IBs without partial flushes at the end (v6)

2018-04-16 Thread Christian König

Am 15.04.2018 um 20:46 schrieb Nicolai Hähnle:

On 07.04.2018 04:31, Marek Olšák wrote:

From: Marek Olšák 

(This patch doesn't enable the behavior. It will be enabled in a later
commit.)

Draw calls from multiple IBs can be executed in parallel.

v2: do emit partial flushes on SI
v3: invalidate all shader caches at the beginning of IBs
v4: don't call si_emit_cache_flush in si_flush_gfx_cs if not needed,
 only do this for flushes invoked internally
v5: empty IBs should wait for idle if the flush requires it
v6: split the commit

If we artificially limit the number of draw calls per IB to 5, we'll get
a lot more IBs, leading to a lot more partial flushes. Let's see how
the removal of partial flushes changes GPU utilization in that scenario:

With partial flushes (time busy):
 CP: 99%
 SPI: 86%
 CB: 73:

Without partial flushes (time busy):
 CP: 99%
 SPI: 93%
 CB: 81%
---
  src/gallium/drivers/radeon/radeon_winsys.h |  7 
  src/gallium/drivers/radeonsi/si_gfx_cs.c   | 52 
++

  src/gallium/drivers/radeonsi/si_pipe.h |  1 +
  3 files changed, 46 insertions(+), 14 deletions(-)
[snip]
+    /* Always invalidate caches at the beginning of IBs, because 
external

+ * users (e.g. BO evictions and SDMA/UVD/VCE IBs) can modify our
+ * buffers.
+ *
+ * Note that the cache flush done by the kernel at the end of 
GFX IBs
+ * isn't useful here, because that flush can finish after the 
following

+ * IB starts drawing.
+ *
+ * TODO: Do we also need to invalidate CB & DB caches?


I don't think so.

Kernel buffer move: CB & DB caches use logical addressing, so should 
be unaffected.


Are you sure about that? Basically we don't do any extra invalidation 
when BOs are moved by the kernel.


But on the other hand the worst that could happen when we skip 
invalidation is that we don't read the same data into the caches which 
is already in the caches. E.g. the content of the BO doesn't change, 
just it's location.


In other words it depends how the CB caches work.

Christian.



UVD: APIs should forbid writing to the currently bound framebuffer.

CPU: Shouldn't be writing directly to the framebuffer, and even if it 
does (linear framebuffer?), I believe OpenGL requires re-binding the 
framebuffer.


Cheers,
Nicolai



+ */
+    ctx->flags |= SI_CONTEXT_INV_ICACHE |
+  SI_CONTEXT_INV_SMEM_L1 |
+  SI_CONTEXT_INV_VMEM_L1 |
+  SI_CONTEXT_INV_GLOBAL_L2 |
+  SI_CONTEXT_START_PIPELINE_STATS;
    /* set all valid group as dirty so they get reemited on
   * next draw command
   */
  si_pm4_reset_emitted(ctx);
    /* The CS initialization should be emitted before everything 
else. */

  si_pm4_emit(ctx, ctx->init_config);
  if (ctx->init_config_gs_rings)
  si_pm4_emit(ctx, ctx->init_config_gs_rings);
diff --git a/src/gallium/drivers/radeonsi/si_pipe.h 
b/src/gallium/drivers/radeonsi/si_pipe.h

index 0c90a6c6e46..f0f323ff3a7 100644
--- a/src/gallium/drivers/radeonsi/si_pipe.h
+++ b/src/gallium/drivers/radeonsi/si_pipe.h
@@ -540,20 +540,21 @@ struct si_context {
  void    *vs_blit_texcoord;
  struct si_screen    *screen;
  struct pipe_debug_callback    debug;
  LLVMTargetMachineRef    tm; /* only non-threaded 
compilation */

  struct si_shader_ctx_state    fixed_func_tcs_shader;
  struct r600_resource    *wait_mem_scratch;
  unsigned    wait_mem_number;
  uint16_t    prefetch_L2_mask;
    bool    gfx_flush_in_progress:1;
+    bool    gfx_last_ib_is_busy:1;
  bool    compute_is_busy:1;
    unsigned    num_gfx_cs_flushes;
  unsigned    initial_gfx_cs_size;
  unsigned    gpu_reset_counter;
  unsigned    last_dirty_tex_counter;
  unsigned    last_compressed_colortex_counter;
  unsigned    last_num_draw_calls;
  unsigned    flags; /* flush flags */
  /* Current unaccounted memory usage. */






___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


  1   2   3   4   5   6   7   8   9   10   >