Re: [Mesa-dev] Intel: ABI for DRM format modifiers with multi-planar images

2021-06-09 Thread Chad Versace
That's a reasonable plan for now. For LINEAR, X, and Y, the 
drmFormatModifierCount is the obvious value for the format. That's enough to 
satisfy all the needs of Chrome OS and its zoo of virtual machines. For 
simplicity, we can keep VK_FORMAT_FEATURE_DISJOINT_BIT disabled in 
drmFormatModifierTilingProperties for multi-planar formats. If we ever need to 
squeeze more performance out of shared media images, then we can start 
experiments on compressed modifiers and possibly work on defining the ABI 
(though, it's always better to have it defined before it's needed, due to Mesa 
and kernel release cycles).

On Wed, Jun 9, 2021, at 16:19, Jason Ekstrand wrote:
> I should have said that the minimal support can be for LINEAR, X-tiled
> and Y-tiled.  CCS can and probably should come later.
> 
> On Wed, Jun 9, 2021 at 6:14 PM Yiwei Zhang  wrote:
> >
> > On Wed, Jun 9, 2021 at 2:19 PM Jason Ekstrand  wrote:
> > >
> > > +Nanley
> > >
> > > We've not defined those yet.  We had some internal talks a couple
> > > years ago.  The rough idea we had at the time was to define a modifier
> > > for those cases which put the CCS after each main surface at some
> > > fixed calculation offset based on width, height, and stride.  Then the
> > > one modifier would apply independently to the three different planes.
> > >
> > > --Jason
> > >
> > > On Wed, Jun 9, 2021 at 2:11 PM Chad Versace  wrote:
> > > >
> > > > The Problem: For a given 3-tuple (multi-planar format, DRM format 
> > > > modifier, chipset), we need Intel ABI that decides (a) the value of 
> > > > VkDrmFormatModifierPropertiesEXT::drmFormatModifierPlaneCount and (b) 
> > > > the content of each "modifier" plane.
> > > >
> > > > For example, suppose drmFormatModifierPlaneCount is 2 for 
> > > > (VK_FORMAT_G8_B8R8_2PLANE_420_UNORM, INTEL_FORMAT_MOD_FOO). Is the 
> > > > layout (plane1=g8, plane2=b8r8); or is it (plane1=g8_b8r8, plane2=aux)? 
> > > > The second choice isn't crazy; iirc, some non-Intel hardware (sorry, 
> > > > NDA) uses only 2 modifier planes for color-compressed 3-planar formats, 
> > > > with (plane1=r8_g8_b8, plane2=aux).
> > > >
> > > > I'm asking because Yiwei (cc'd) has begun implementing limited support 
> > > > for multi-planar formats in Anvil in order to pass the Android CTS. To 
> > > > support more media formats (for imminent Chrome OS projects and future 
> > > > vanilla Linux stuff too), we need to decide on some ABI.
> >
> > Hi,
> >
> > I have sent a 
> > MR(https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11281)
> > to add the minimal multi-planar format support with drm format
> > modifier, so that Venus Android AHB extension layered on top of ANV
> > VK_EXT_image_drm_format_modifier implementation can pass all the
> > interop graphics cts for camera and media.
> >
> > I'd be interested to follow up about the stable ABI for this to expand
> > multi-planar support there.
> >
> > Best,
> > Yiwei
> 
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Intel: ABI for DRM format modifiers with multi-planar images

2021-06-09 Thread Jason Ekstrand
I should have said that the minimal support can be for LINEAR, X-tiled
and Y-tiled.  CCS can and probably should come later.

On Wed, Jun 9, 2021 at 6:14 PM Yiwei Zhang  wrote:
>
> On Wed, Jun 9, 2021 at 2:19 PM Jason Ekstrand  wrote:
> >
> > +Nanley
> >
> > We've not defined those yet.  We had some internal talks a couple
> > years ago.  The rough idea we had at the time was to define a modifier
> > for those cases which put the CCS after each main surface at some
> > fixed calculation offset based on width, height, and stride.  Then the
> > one modifier would apply independently to the three different planes.
> >
> > --Jason
> >
> > On Wed, Jun 9, 2021 at 2:11 PM Chad Versace  wrote:
> > >
> > > The Problem: For a given 3-tuple (multi-planar format, DRM format 
> > > modifier, chipset), we need Intel ABI that decides (a) the value of 
> > > VkDrmFormatModifierPropertiesEXT::drmFormatModifierPlaneCount and (b) the 
> > > content of each "modifier" plane.
> > >
> > > For example, suppose drmFormatModifierPlaneCount is 2 for 
> > > (VK_FORMAT_G8_B8R8_2PLANE_420_UNORM, INTEL_FORMAT_MOD_FOO). Is the layout 
> > > (plane1=g8, plane2=b8r8); or is it (plane1=g8_b8r8, plane2=aux)? The 
> > > second choice isn't crazy; iirc, some non-Intel hardware (sorry, NDA) 
> > > uses only 2 modifier planes for color-compressed 3-planar formats, with 
> > > (plane1=r8_g8_b8, plane2=aux).
> > >
> > > I'm asking because Yiwei (cc'd) has begun implementing limited support 
> > > for multi-planar formats in Anvil in order to pass the Android CTS. To 
> > > support more media formats (for imminent Chrome OS projects and future 
> > > vanilla Linux stuff too), we need to decide on some ABI.
>
> Hi,
>
> I have sent a 
> MR(https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11281)
> to add the minimal multi-planar format support with drm format
> modifier, so that Venus Android AHB extension layered on top of ANV
> VK_EXT_image_drm_format_modifier implementation can pass all the
> interop graphics cts for camera and media.
>
> I'd be interested to follow up about the stable ABI for this to expand
> multi-planar support there.
>
> Best,
> Yiwei
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Intel: ABI for DRM format modifiers with multi-planar images

2021-06-09 Thread Jason Ekstrand
+Nanley

We've not defined those yet.  We had some internal talks a couple
years ago.  The rough idea we had at the time was to define a modifier
for those cases which put the CCS after each main surface at some
fixed calculation offset based on width, height, and stride.  Then the
one modifier would apply independently to the three different planes.

--Jason

On Wed, Jun 9, 2021 at 2:11 PM Chad Versace  wrote:
>
> The Problem: For a given 3-tuple (multi-planar format, DRM format modifier, 
> chipset), we need Intel ABI that decides (a) the value of 
> VkDrmFormatModifierPropertiesEXT::drmFormatModifierPlaneCount and (b) the 
> content of each "modifier" plane.
>
> For example, suppose drmFormatModifierPlaneCount is 2 for 
> (VK_FORMAT_G8_B8R8_2PLANE_420_UNORM, INTEL_FORMAT_MOD_FOO). Is the layout 
> (plane1=g8, plane2=b8r8); or is it (plane1=g8_b8r8, plane2=aux)? The second 
> choice isn't crazy; iirc, some non-Intel hardware (sorry, NDA) uses only 2 
> modifier planes for color-compressed 3-planar formats, with (plane1=r8_g8_b8, 
> plane2=aux).
>
> I'm asking because Yiwei (cc'd) has begun implementing limited support for 
> multi-planar formats in Anvil in order to pass the Android CTS. To support 
> more media formats (for imminent Chrome OS projects and future vanilla Linux 
> stuff too), we need to decide on some ABI.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] Intel: ABI for DRM format modifiers with multi-planar images

2021-06-09 Thread Chad Versace
The Problem: For a given 3-tuple (multi-planar format, DRM format modifier, 
chipset), we need Intel ABI that decides (a) the value of 
VkDrmFormatModifierPropertiesEXT::drmFormatModifierPlaneCount and (b) the 
content of each "modifier" plane.

For example, suppose drmFormatModifierPlaneCount is 2 for 
(VK_FORMAT_G8_B8R8_2PLANE_420_UNORM, INTEL_FORMAT_MOD_FOO). Is the layout 
(plane1=g8, plane2=b8r8); or is it (plane1=g8_b8r8, plane2=aux)? The second 
choice isn't crazy; iirc, some non-Intel hardware (sorry, NDA) uses only 2 
modifier planes for color-compressed 3-planar formats, with (plane1=r8_g8_b8, 
plane2=aux).

I'm asking because Yiwei (cc'd) has begun implementing limited support for 
multi-planar formats in Anvil in order to pass the Android CTS. To support more 
media formats (for imminent Chrome OS projects and future vanilla Linux stuff 
too), we need to decide on some ABI.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-09 Thread Daniel Vetter
On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König wrote:
> Am 09.06.21 um 15:19 schrieb Daniel Vetter:
> > [SNIP]
> > > Yeah, we call this the lightweight and the heavyweight tlb flush.
> > > 
> > > The lighweight can be used when you are sure that you don't have any of 
> > > the
> > > PTEs currently in flight in the 3D/DMA engine and you just need to
> > > invalidate the TLB.
> > > 
> > > The heavyweight must be used when you need to invalidate the TLB *AND* 
> > > make
> > > sure that no concurrently operation moves new stuff into the TLB.
> > > 
> > > The problem is for this use case we have to use the heavyweight one.
> > Just for my own curiosity: So the lightweight flush is only for in-between
> > CS when you know access is idle? Or does that also not work if userspace
> > has a CS on a dma engine going at the same time because the tlb aren't
> > isolated enough between engines?
> 
> More or less correct, yes.
> 
> The problem is a lightweight flush only invalidates the TLB, but doesn't
> take care of entries which have been handed out to the different engines.
> 
> In other words what can happen is the following:
> 
> 1. Shader asks TLB to resolve address X.
> 2. TLB looks into its cache and can't find address X so it asks the walker
> to resolve.
> 3. Walker comes back with result for address X and TLB puts that into its
> cache and gives it to Shader.
> 4. Shader starts doing some operation using result for address X.
> 5. You send lightweight TLB invalidate and TLB throws away cached values for
> address X.
> 6. Shader happily still uses whatever the TLB gave to it in step 3 to
> accesses address X
> 
> See it like the shader has their own 1 entry L0 TLB cache which is not
> affected by the lightweight flush.
> 
> The heavyweight flush on the other hand sends out a broadcast signal to
> everybody and only comes back when we are sure that an address is not in use
> any more.

Ah makes sense. On intel the shaders only operate in VA, everything goes
around as explicit async messages to IO blocks. So we don't have this, the
only difference in tlb flushes is between tlb flush in the IB and an mmio
one which is independent for anything currently being executed on an
egine.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-09 Thread Christian König

Am 09.06.21 um 15:19 schrieb Daniel Vetter:

[SNIP]

Yeah, we call this the lightweight and the heavyweight tlb flush.

The lighweight can be used when you are sure that you don't have any of the
PTEs currently in flight in the 3D/DMA engine and you just need to
invalidate the TLB.

The heavyweight must be used when you need to invalidate the TLB *AND* make
sure that no concurrently operation moves new stuff into the TLB.

The problem is for this use case we have to use the heavyweight one.

Just for my own curiosity: So the lightweight flush is only for in-between
CS when you know access is idle? Or does that also not work if userspace
has a CS on a dma engine going at the same time because the tlb aren't
isolated enough between engines?


More or less correct, yes.

The problem is a lightweight flush only invalidates the TLB, but doesn't 
take care of entries which have been handed out to the different engines.


In other words what can happen is the following:

1. Shader asks TLB to resolve address X.
2. TLB looks into its cache and can't find address X so it asks the 
walker to resolve.
3. Walker comes back with result for address X and TLB puts that into 
its cache and gives it to Shader.

4. Shader starts doing some operation using result for address X.
5. You send lightweight TLB invalidate and TLB throws away cached values 
for address X.
6. Shader happily still uses whatever the TLB gave to it in step 3 to 
accesses address X


See it like the shader has their own 1 entry L0 TLB cache which is not 
affected by the lightweight flush.


The heavyweight flush on the other hand sends out a broadcast signal to 
everybody and only comes back when we are sure that an address is not in 
use any more.


Christian.


-Daniel



___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-09 Thread Daniel Vetter
On Fri, Jun 04, 2021 at 01:27:15PM +0200, Christian König wrote:
> Am 04.06.21 um 10:57 schrieb Daniel Vetter:
> > On Fri, Jun 04, 2021 at 09:00:31AM +0200, Christian König wrote:
> > > Am 02.06.21 um 21:19 schrieb Daniel Vetter:
> > > > On Wed, Jun 02, 2021 at 08:52:38PM +0200, Christian König wrote:
> > > > > Am 02.06.21 um 20:48 schrieb Daniel Vetter:
> > > > > > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
> > > > > > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák  
> > > > > > > wrote:
> > > > > > > 
> > > > > > > > Yes, we can't break anything because we don't want to 
> > > > > > > > complicate things
> > > > > > > > for us. It's pretty much all NAK'd already. We are trying to 
> > > > > > > > gather more
> > > > > > > > knowledge and then make better decisions.
> > > > > > > > 
> > > > > > > > The idea we are considering is that we'll expose memory-based 
> > > > > > > > sync objects
> > > > > > > > to userspace for read only, and the kernel or hw will strictly 
> > > > > > > > control the
> > > > > > > > memory writes to those sync objects. The hole in that idea is 
> > > > > > > > that
> > > > > > > > userspace can decide not to signal a job, so even if userspace 
> > > > > > > > can't
> > > > > > > > overwrite memory-based sync object states arbitrarily, it can 
> > > > > > > > still decide
> > > > > > > > not to signal them, and then a future fence is born.
> > > > > > > > 
> > > > > > > This would actually be treated as a GPU hang caused by that 
> > > > > > > context, so it
> > > > > > > should be fine.
> > > > > > This is practically what I proposed already, except your not doing 
> > > > > > it with
> > > > > > dma_fence. And on the memory fence side this also doesn't actually 
> > > > > > give
> > > > > > what you want for that compute model.
> > > > > > 
> > > > > > This seems like a bit a worst of both worlds approach to me? Tons 
> > > > > > of work
> > > > > > in the kernel to hide these not-dma_fence-but-almost, and still 
> > > > > > pain to
> > > > > > actually drive the hardware like it should be for compute or direct
> > > > > > display.
> > > > > > 
> > > > > > Also maybe I've missed it, but I didn't see any replies to my 
> > > > > > suggestion
> > > > > > how to fake the entire dma_fence stuff on top of new hw. Would be
> > > > > > interesting to know what doesn't work there instead of amd folks 
> > > > > > going of
> > > > > > into internal again and then coming back with another rfc from out 
> > > > > > of
> > > > > > nowhere :-)
> > > > > Well to be honest I would just push back on our hardware/firmware 
> > > > > guys that
> > > > > we need to keep kernel queues forever before going down that route.
> > > > I looked again, and you said the model wont work because preemption is 
> > > > way
> > > > too slow, even when the context is idle.
> > > > 
> > > > I guess at that point I got maybe too fed up and just figured "not my
> > > > problem", but if preempt is too slow as the unload fence, you can do it
> > > > with pte removal and tlb shootdown too (that is hopefully not too slow,
> > > > otherwise your hw is just garbage and wont even be fast for direct 
> > > > submit
> > > > compute workloads).
> > > Have you seen that one here:
> > > https://www.spinics.net/lists/amd-gfx/msg63101.html :)
> > > 
> > > I've rejected it because I think polling for 6 seconds on a TLB flush 
> > > which
> > > can block interrupts as well is just madness.
> > Hm but I thought you had like 2 tlb flush modes, the shitty one (with
> > retrying page faults) and the not so shitty one?
> 
> Yeah, we call this the lightweight and the heavyweight tlb flush.
> 
> The lighweight can be used when you are sure that you don't have any of the
> PTEs currently in flight in the 3D/DMA engine and you just need to
> invalidate the TLB.
> 
> The heavyweight must be used when you need to invalidate the TLB *AND* make
> sure that no concurrently operation moves new stuff into the TLB.
> 
> The problem is for this use case we have to use the heavyweight one.

Just for my own curiosity: So the lightweight flush is only for in-between
CS when you know access is idle? Or does that also not work if userspace
has a CS on a dma engine going at the same time because the tlb aren't
isolated enough between engines?
-Daniel

> > But yeah at that point I think you just have to bite one of the bullets.
> 
> Yeah, completely agree. We can choose which way we want to die, but it's
> certainly not going to be nice whatever we do.
> 
> > 
> > The thing is with hmm/userspace memory fence model this will be even
> > worse, because you will _have_ to do this tlb flush deep down in core mm
> > functions, so this is going to be userptr, but worse.
> > 
> > With the dma_resv/dma_fence bo memory management model you can at least
> > wrap that tlb flush into a dma_fence and push the waiting/pinging onto a
> > separate thread or something like that. If the hw really is that slow.
> > 
> > Somewhat aside: Personally I think