[Mesa-dev] [PATCH 0/6] dma-buf: Add an API for exporting sync files (v12)

2021-06-10 Thread Jason Ekstrand
Modern userspace APIs like Vulkan are built on an explicit
synchronization model.  This doesn't always play nicely with the
implicit synchronization used in the kernel and assumed by X11 and
Wayland.  The client -> compositor half of the synchronization isn't too
bad, at least on intel, because we can control whether or not i915
synchronizes on the buffer and whether or not it's considered written.

The harder part is the compositor -> client synchronization when we get
the buffer back from the compositor.  We're required to be able to
provide the client with a VkSemaphore and VkFence representing the point
in time where the window system (compositor and/or display) finished
using the buffer.  With current APIs, it's very hard to do this in such
a way that we don't get confused by the Vulkan driver's access of the
buffer.  In particular, once we tell the kernel that we're rendering to
the buffer again, any CPU waits on the buffer or GPU dependencies will
wait on some of the client rendering and not just the compositor.

This new IOCTL solves this problem by allowing us to get a snapshot of
the implicit synchronization state of a given dma-buf in the form of a
sync file.  It's effectively the same as a poll() or I915_GEM_WAIT only,
instead of CPU waiting directly, it encapsulates the wait operation, at
the current moment in time, in a sync_file so we can check/wait on it
later.  As long as the Vulkan driver does the sync_file export from the
dma-buf before we re-introduce it for rendering, it will only contain
fences from the compositor or display.  This allows to accurately turn
it into a VkFence or VkSemaphore without any over- synchronization.

This patch series actually contains two new ioctls.  There is the export
one mentioned above as well as an RFC for an import ioctl which provides
the other half.  The intention is to land the export ioctl since it seems
like there's no real disagreement on that one.  The import ioctl, however,
has a lot of debate around it so it's intended to be RFC-only for now.

Mesa MR: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037
IGT tests: https://patchwork.freedesktop.org/series/90490/

v10 (Jason Ekstrand, Daniel Vetter):
 - Add reviews/acks
 - Add a patch to rename _rcu to _unlocked
 - Split things better so import is clearly RFC status

v11 (Daniel Vetter):
 - Add more CCs to try and get maintainers
 - Add a patch to document DMA_BUF_IOCTL_SYNC
 - Generally better docs
 - Use separate structs for import/export (easier to document)
 - Fix an issue in the import patch

v12 (Daniel Vetter):
 - Better docs for DMA_BUF_IOCTL_SYNC

v12 (Christian König):
 - Drop the rename patch in favor of Christian's series
 - Add a comment to the commit message for the dma-buf sync_file export
   ioctl saying why we made it an ioctl on dma-buf

Cc: Christian König 
Cc: Michel Dänzer 
Cc: Dave Airlie 
Cc: Bas Nieuwenhuizen 
Cc: Daniel Stone 
Cc: mesa-dev@lists.freedesktop.org
Cc: wayland-de...@lists.freedesktop.org
Test-with: 20210524205225.872316-1-ja...@jlekstrand.net

Christian König (1):
  dma-buf: Add dma_fence_array_for_each (v2)

Jason Ekstrand (5):
  dma-buf: Add dma_resv_get_singleton (v6)
  dma-buf: Document DMA_BUF_IOCTL_SYNC (v2)
  dma-buf: Add an API for exporting sync files (v12)
  RFC: dma-buf: Add an extra fence to dma_resv_get_singleton_unlocked
  RFC: dma-buf: Add an API for importing sync files (v7)

 Documentation/driver-api/dma-buf.rst |   8 ++
 drivers/dma-buf/dma-buf.c| 103 +
 drivers/dma-buf/dma-fence-array.c|  27 +++
 drivers/dma-buf/dma-resv.c   | 110 +++
 include/linux/dma-fence-array.h  |  17 +
 include/linux/dma-resv.h |   2 +
 include/uapi/linux/dma-buf.h | 103 -
 7 files changed, 369 insertions(+), 1 deletion(-)

-- 
2.31.1

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 0/7] dma-buf: Add an API for exporting sync files (v11)

2021-06-10 Thread Jason Ekstrand
On Thu, Jun 10, 2021 at 3:11 PM Chia-I Wu  wrote:
>
> On Tue, May 25, 2021 at 2:18 PM Jason Ekstrand  wrote:
> > Modern userspace APIs like Vulkan are built on an explicit
> > synchronization model.  This doesn't always play nicely with the
> > implicit synchronization used in the kernel and assumed by X11 and
> > Wayland.  The client -> compositor half of the synchronization isn't too
> > bad, at least on intel, because we can control whether or not i915
> > synchronizes on the buffer and whether or not it's considered written.
> We might have an important use case for this half, for virtio-gpu and Chrome 
> OS.
>
> When the guest compositor acts as a proxy to connect guest apps to the
> host compositor, implicit fencing requires the guest compositor to do
> a wait before forwarding the buffer to the host compositor.  With this
> patch, the guest compositor can extract the dma-fence from the buffer,
> and if the fence is a virtio-gpu fence, forward both the fence and the
> buffer to the host compositor.  It will allow us to convert a
> guest-side wait into a host-side wait.

Yeah, I think the first half solves a lot of problems.  I'm rebasing
it now and will send a v12 series shortly.  I don't think there's a
lot standing between the first few patches and merging.  I've got IGT
tests and I'm pretty sure the code is good.  The last review cycle got
distracted with some renaming fun.

--Jason
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 0/7] dma-buf: Add an API for exporting sync files (v11)

2021-06-10 Thread Chia-I Wu
On Tue, May 25, 2021 at 2:18 PM Jason Ekstrand  wrote:
> Modern userspace APIs like Vulkan are built on an explicit
> synchronization model.  This doesn't always play nicely with the
> implicit synchronization used in the kernel and assumed by X11 and
> Wayland.  The client -> compositor half of the synchronization isn't too
> bad, at least on intel, because we can control whether or not i915
> synchronizes on the buffer and whether or not it's considered written.
We might have an important use case for this half, for virtio-gpu and Chrome OS.

When the guest compositor acts as a proxy to connect guest apps to the
host compositor, implicit fencing requires the guest compositor to do
a wait before forwarding the buffer to the host compositor.  With this
patch, the guest compositor can extract the dma-fence from the buffer,
and if the fence is a virtio-gpu fence, forward both the fence and the
buffer to the host compositor.  It will allow us to convert a
guest-side wait into a host-side wait.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-10 Thread Christian König

Hi guys,

maybe soften that a bit. Reading from the shared memory of the user 
fence is ok for everybody. What we need to take more care of is the 
writing side.


So my current thinking is that we allow read only access, but writing a 
new sequence value needs to go through the scheduler/kernel.


So when the CPU wants to signal a timeline fence it needs to call an 
IOCTL. When the GPU wants to signal the timeline fence it needs to hand 
that of to the hardware scheduler.


If we lockup the kernel can check with the hardware who did the last 
write and what value was written.


That together with an IOCTL to give out sequence number for implicit 
sync to applications should be sufficient for the kernel to track who is 
responsible if something bad happens.


In other words when the hardware says that the shader wrote stuff like 
0xdeadbeef 0x0 or 0x into memory we kill the process who did that.


If the hardware says that seq - 1 was written fine, but seq is missing 
then the kernel blames whoever was supposed to write seq.


Just pieping the write through a privileged instance should be fine to 
make sure that we don't run into issues.


Christian.

Am 10.06.21 um 17:59 schrieb Marek Olšák:

Hi Daniel,

We just talked about this whole topic internally and we came up to the 
conclusion that the hardware needs to understand sync object handles 
and have high-level wait and signal operations in the command stream. 
Sync objects will be backed by memory, but they won't be readable or 
writable by processes directly. The hardware will log all accesses to 
sync objects and will send the log to the kernel periodically. The 
kernel will identify malicious behavior.


Example of a hardware command stream:
...
ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence 
number is assigned by the kernel

Draw();
ImplicitSyncSignalWhenDone(syncObjHandle);
...

I'm afraid we have no other choice because of the TLB invalidation 
overhead.


Marek


On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter > wrote:


On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König wrote:
> Am 09.06.21 um 15:19 schrieb Daniel Vetter:
> > [SNIP]
> > > Yeah, we call this the lightweight and the heavyweight tlb
flush.
> > >
> > > The lighweight can be used when you are sure that you don't
have any of the
> > > PTEs currently in flight in the 3D/DMA engine and you just
need to
> > > invalidate the TLB.
> > >
> > > The heavyweight must be used when you need to invalidate the
TLB *AND* make
> > > sure that no concurrently operation moves new stuff into the
TLB.
> > >
> > > The problem is for this use case we have to use the
heavyweight one.
> > Just for my own curiosity: So the lightweight flush is only
for in-between
> > CS when you know access is idle? Or does that also not work if
userspace
> > has a CS on a dma engine going at the same time because the
tlb aren't
> > isolated enough between engines?
>
> More or less correct, yes.
>
> The problem is a lightweight flush only invalidates the TLB, but
doesn't
> take care of entries which have been handed out to the different
engines.
>
> In other words what can happen is the following:
>
> 1. Shader asks TLB to resolve address X.
> 2. TLB looks into its cache and can't find address X so it asks
the walker
> to resolve.
> 3. Walker comes back with result for address X and TLB puts that
into its
> cache and gives it to Shader.
> 4. Shader starts doing some operation using result for address X.
> 5. You send lightweight TLB invalidate and TLB throws away
cached values for
> address X.
> 6. Shader happily still uses whatever the TLB gave to it in step
3 to
> accesses address X
>
> See it like the shader has their own 1 entry L0 TLB cache which
is not
> affected by the lightweight flush.
>
> The heavyweight flush on the other hand sends out a broadcast
signal to
> everybody and only comes back when we are sure that an address
is not in use
> any more.

Ah makes sense. On intel the shaders only operate in VA,
everything goes
around as explicit async messages to IO blocks. So we don't have
this, the
only difference in tlb flushes is between tlb flush in the IB and
an mmio
one which is independent for anything currently being executed on an
egine.
-Daniel
-- 
Daniel Vetter

Software Engineer, Intel Corporation
http://blog.ffwll.ch 



___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-10 Thread Marek Olšák
Hi Daniel,

We just talked about this whole topic internally and we came up to the
conclusion that the hardware needs to understand sync object handles and
have high-level wait and signal operations in the command stream. Sync
objects will be backed by memory, but they won't be readable or writable by
processes directly. The hardware will log all accesses to sync objects and
will send the log to the kernel periodically. The kernel will identify
malicious behavior.

Example of a hardware command stream:
...
ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence number is
assigned by the kernel
Draw();
ImplicitSyncSignalWhenDone(syncObjHandle);
...

I'm afraid we have no other choice because of the TLB invalidation overhead.

Marek


On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter  wrote:

> On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König wrote:
> > Am 09.06.21 um 15:19 schrieb Daniel Vetter:
> > > [SNIP]
> > > > Yeah, we call this the lightweight and the heavyweight tlb flush.
> > > >
> > > > The lighweight can be used when you are sure that you don't have any
> of the
> > > > PTEs currently in flight in the 3D/DMA engine and you just need to
> > > > invalidate the TLB.
> > > >
> > > > The heavyweight must be used when you need to invalidate the TLB
> *AND* make
> > > > sure that no concurrently operation moves new stuff into the TLB.
> > > >
> > > > The problem is for this use case we have to use the heavyweight one.
> > > Just for my own curiosity: So the lightweight flush is only for
> in-between
> > > CS when you know access is idle? Or does that also not work if
> userspace
> > > has a CS on a dma engine going at the same time because the tlb aren't
> > > isolated enough between engines?
> >
> > More or less correct, yes.
> >
> > The problem is a lightweight flush only invalidates the TLB, but doesn't
> > take care of entries which have been handed out to the different engines.
> >
> > In other words what can happen is the following:
> >
> > 1. Shader asks TLB to resolve address X.
> > 2. TLB looks into its cache and can't find address X so it asks the
> walker
> > to resolve.
> > 3. Walker comes back with result for address X and TLB puts that into its
> > cache and gives it to Shader.
> > 4. Shader starts doing some operation using result for address X.
> > 5. You send lightweight TLB invalidate and TLB throws away cached values
> for
> > address X.
> > 6. Shader happily still uses whatever the TLB gave to it in step 3 to
> > accesses address X
> >
> > See it like the shader has their own 1 entry L0 TLB cache which is not
> > affected by the lightweight flush.
> >
> > The heavyweight flush on the other hand sends out a broadcast signal to
> > everybody and only comes back when we are sure that an address is not in
> use
> > any more.
>
> Ah makes sense. On intel the shaders only operate in VA, everything goes
> around as explicit async messages to IO blocks. So we don't have this, the
> only difference in tlb flushes is between tlb flush in the IB and an mmio
> one which is independent for anything currently being executed on an
> egine.
> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
>
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Intel: ABI for DRM format modifiers with multi-planar images

2021-06-10 Thread Yiwei Zhang
On Wed, Jun 9, 2021 at 2:19 PM Jason Ekstrand  wrote:
>
> +Nanley
>
> We've not defined those yet.  We had some internal talks a couple
> years ago.  The rough idea we had at the time was to define a modifier
> for those cases which put the CCS after each main surface at some
> fixed calculation offset based on width, height, and stride.  Then the
> one modifier would apply independently to the three different planes.
>
> --Jason
>
> On Wed, Jun 9, 2021 at 2:11 PM Chad Versace  wrote:
> >
> > The Problem: For a given 3-tuple (multi-planar format, DRM format modifier, 
> > chipset), we need Intel ABI that decides (a) the value of 
> > VkDrmFormatModifierPropertiesEXT::drmFormatModifierPlaneCount and (b) the 
> > content of each "modifier" plane.
> >
> > For example, suppose drmFormatModifierPlaneCount is 2 for 
> > (VK_FORMAT_G8_B8R8_2PLANE_420_UNORM, INTEL_FORMAT_MOD_FOO). Is the layout 
> > (plane1=g8, plane2=b8r8); or is it (plane1=g8_b8r8, plane2=aux)? The second 
> > choice isn't crazy; iirc, some non-Intel hardware (sorry, NDA) uses only 2 
> > modifier planes for color-compressed 3-planar formats, with 
> > (plane1=r8_g8_b8, plane2=aux).
> >
> > I'm asking because Yiwei (cc'd) has begun implementing limited support for 
> > multi-planar formats in Anvil in order to pass the Android CTS. To support 
> > more media formats (for imminent Chrome OS projects and future vanilla 
> > Linux stuff too), we need to decide on some ABI.

Hi,

I have sent a 
MR(https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11281)
to add the minimal multi-planar format support with drm format
modifier, so that Venus Android AHB extension layered on top of ANV
VK_EXT_image_drm_format_modifier implementation can pass all the
interop graphics cts for camera and media.

I'd be interested to follow up about the stable ABI for this to expand
multi-planar support there.

Best,
Yiwei
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev