Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-14 Thread Christian König
As long as we can figure out who touched to a certain sync object last 
that would indeed work, yes.


Christian.

Am 14.06.21 um 19:10 schrieb Marek Olšák:
The call to the hw scheduler has a limitation on the size of all 
parameters combined. I think we can only pass a 32-bit sequence number 
and a ~16-bit global (per-GPU) syncobj handle in one call and not much 
else.


The syncobj handle can be an element index in a global (per-GPU) 
syncobj table and it's read only for all processes with the exception 
of the signal command. Syncobjs can either have per VMID write access 
flags for the signal command (slow), or any process can write to any 
syncobjs and only rely on the kernel checking the write log (fast).


In any case, we can execute the memory write in the queue engine and 
only use the hw scheduler for logging, which would be perfect.


Marek

On Thu, Jun 10, 2021 at 12:33 PM Christian König 
> wrote:


Hi guys,

maybe soften that a bit. Reading from the shared memory of the
user fence is ok for everybody. What we need to take more care of
is the writing side.

So my current thinking is that we allow read only access, but
writing a new sequence value needs to go through the scheduler/kernel.

So when the CPU wants to signal a timeline fence it needs to call
an IOCTL. When the GPU wants to signal the timeline fence it needs
to hand that of to the hardware scheduler.

If we lockup the kernel can check with the hardware who did the
last write and what value was written.

That together with an IOCTL to give out sequence number for
implicit sync to applications should be sufficient for the kernel
to track who is responsible if something bad happens.

In other words when the hardware says that the shader wrote stuff
like 0xdeadbeef 0x0 or 0x into memory we kill the process
who did that.

If the hardware says that seq - 1 was written fine, but seq is
missing then the kernel blames whoever was supposed to write seq.

Just pieping the write through a privileged instance should be
fine to make sure that we don't run into issues.

Christian.

Am 10.06.21 um 17:59 schrieb Marek Olšák:

Hi Daniel,

We just talked about this whole topic internally and we came up
to the conclusion that the hardware needs to understand sync
object handles and have high-level wait and signal operations in
the command stream. Sync objects will be backed by memory, but
they won't be readable or writable by processes directly. The
hardware will log all accesses to sync objects and will send the
log to the kernel periodically. The kernel will identify
malicious behavior.

Example of a hardware command stream:
...
ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence
number is assigned by the kernel
Draw();
ImplicitSyncSignalWhenDone(syncObjHandle);
...

I'm afraid we have no other choice because of the TLB
invalidation overhead.

Marek


On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter mailto:dan...@ffwll.ch>> wrote:

On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König wrote:
> Am 09.06.21 um 15:19 schrieb Daniel Vetter:
> > [SNIP]
> > > Yeah, we call this the lightweight and the heavyweight
tlb flush.
> > >
> > > The lighweight can be used when you are sure that you
don't have any of the
> > > PTEs currently in flight in the 3D/DMA engine and you
just need to
> > > invalidate the TLB.
> > >
> > > The heavyweight must be used when you need to
invalidate the TLB *AND* make
> > > sure that no concurrently operation moves new stuff
into the TLB.
> > >
> > > The problem is for this use case we have to use the
heavyweight one.
> > Just for my own curiosity: So the lightweight flush is
only for in-between
> > CS when you know access is idle? Or does that also not
work if userspace
> > has a CS on a dma engine going at the same time because
the tlb aren't
> > isolated enough between engines?
>
> More or less correct, yes.
>
> The problem is a lightweight flush only invalidates the
TLB, but doesn't
> take care of entries which have been handed out to the
different engines.
>
> In other words what can happen is the following:
>
> 1. Shader asks TLB to resolve address X.
> 2. TLB looks into its cache and can't find address X so it
asks the walker
> to resolve.
> 3. Walker comes back with result for address X and TLB puts
that into its
> cache and gives it to Shader.
> 4. Shader starts doing some operation using result for
address X.
> 5. You send 

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-14 Thread Marek Olšák
The call to the hw scheduler has a limitation on the size of all parameters
combined. I think we can only pass a 32-bit sequence number and a ~16-bit
global (per-GPU) syncobj handle in one call and not much else.

The syncobj handle can be an element index in a global (per-GPU) syncobj
table and it's read only for all processes with the exception of the signal
command. Syncobjs can either have per VMID write access flags for the
signal command (slow), or any process can write to any syncobjs and only
rely on the kernel checking the write log (fast).

In any case, we can execute the memory write in the queue engine and only
use the hw scheduler for logging, which would be perfect.

Marek

On Thu, Jun 10, 2021 at 12:33 PM Christian König <
ckoenig.leichtzumer...@gmail.com> wrote:

> Hi guys,
>
> maybe soften that a bit. Reading from the shared memory of the user fence
> is ok for everybody. What we need to take more care of is the writing side.
>
> So my current thinking is that we allow read only access, but writing a
> new sequence value needs to go through the scheduler/kernel.
>
> So when the CPU wants to signal a timeline fence it needs to call an
> IOCTL. When the GPU wants to signal the timeline fence it needs to hand
> that of to the hardware scheduler.
>
> If we lockup the kernel can check with the hardware who did the last write
> and what value was written.
>
> That together with an IOCTL to give out sequence number for implicit sync
> to applications should be sufficient for the kernel to track who is
> responsible if something bad happens.
>
> In other words when the hardware says that the shader wrote stuff like
> 0xdeadbeef 0x0 or 0x into memory we kill the process who did that.
>
> If the hardware says that seq - 1 was written fine, but seq is missing
> then the kernel blames whoever was supposed to write seq.
>
> Just pieping the write through a privileged instance should be fine to
> make sure that we don't run into issues.
>
> Christian.
>
> Am 10.06.21 um 17:59 schrieb Marek Olšák:
>
> Hi Daniel,
>
> We just talked about this whole topic internally and we came up to the
> conclusion that the hardware needs to understand sync object handles and
> have high-level wait and signal operations in the command stream. Sync
> objects will be backed by memory, but they won't be readable or writable by
> processes directly. The hardware will log all accesses to sync objects and
> will send the log to the kernel periodically. The kernel will identify
> malicious behavior.
>
> Example of a hardware command stream:
> ...
> ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence number is
> assigned by the kernel
> Draw();
> ImplicitSyncSignalWhenDone(syncObjHandle);
> ...
>
> I'm afraid we have no other choice because of the TLB invalidation
> overhead.
>
> Marek
>
>
> On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter  wrote:
>
>> On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König wrote:
>> > Am 09.06.21 um 15:19 schrieb Daniel Vetter:
>> > > [SNIP]
>> > > > Yeah, we call this the lightweight and the heavyweight tlb flush.
>> > > >
>> > > > The lighweight can be used when you are sure that you don't have
>> any of the
>> > > > PTEs currently in flight in the 3D/DMA engine and you just need to
>> > > > invalidate the TLB.
>> > > >
>> > > > The heavyweight must be used when you need to invalidate the TLB
>> *AND* make
>> > > > sure that no concurrently operation moves new stuff into the TLB.
>> > > >
>> > > > The problem is for this use case we have to use the heavyweight one.
>> > > Just for my own curiosity: So the lightweight flush is only for
>> in-between
>> > > CS when you know access is idle? Or does that also not work if
>> userspace
>> > > has a CS on a dma engine going at the same time because the tlb aren't
>> > > isolated enough between engines?
>> >
>> > More or less correct, yes.
>> >
>> > The problem is a lightweight flush only invalidates the TLB, but doesn't
>> > take care of entries which have been handed out to the different
>> engines.
>> >
>> > In other words what can happen is the following:
>> >
>> > 1. Shader asks TLB to resolve address X.
>> > 2. TLB looks into its cache and can't find address X so it asks the
>> walker
>> > to resolve.
>> > 3. Walker comes back with result for address X and TLB puts that into
>> its
>> > cache and gives it to Shader.
>> > 4. Shader starts doing some operation using result for address X.
>> > 5. You send lightweight TLB invalidate and TLB throws away cached
>> values for
>> > address X.
>> > 6. Shader happily still uses whatever the TLB gave to it in step 3 to
>> > accesses address X
>> >
>> > See it like the shader has their own 1 entry L0 TLB cache which is not
>> > affected by the lightweight flush.
>> >
>> > The heavyweight flush on the other hand sends out a broadcast signal to
>> > everybody and only comes back when we are sure that an address is not
>> in use
>> > any more.
>>
>> Ah makes sense. On intel the shaders 

Re: [Mesa-dev] Intel: ABI for DRM format modifiers with multi-planar images

2021-06-14 Thread Yiwei Zhang
+Chia-I Wu

My latest MR 
there(https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11281)
has addressed these though I still only enable the single format to be
safe (since that's the only one I can thoroughly test and already
verified)

On Wed, Jun 9, 2021 at 5:48 PM Chad Versace  wrote:
>
> That's a reasonable plan for now. For LINEAR, X, and Y, the 
> drmFormatModifierCount is the obvious value for the format. That's enough to 
> satisfy all the needs of Chrome OS and its zoo of virtual machines. For 
> simplicity, we can keep VK_FORMAT_FEATURE_DISJOINT_BIT disabled in 
> drmFormatModifierTilingProperties for multi-planar formats. If we ever need 
> to squeeze more performance out of shared media images, then we can start 
> experiments on compressed modifiers and possibly work on defining the ABI 
> (though, it's always better to have it defined before it's needed, due to 
> Mesa and kernel release cycles).
>
> On Wed, Jun 9, 2021, at 16:19, Jason Ekstrand wrote:
> > I should have said that the minimal support can be for LINEAR, X-tiled
> > and Y-tiled.  CCS can and probably should come later.
> >
> > On Wed, Jun 9, 2021 at 6:14 PM Yiwei Zhang  wrote:
> > >
> > > On Wed, Jun 9, 2021 at 2:19 PM Jason Ekstrand  
> > > wrote:
> > > >
> > > > +Nanley
> > > >
> > > > We've not defined those yet.  We had some internal talks a couple
> > > > years ago.  The rough idea we had at the time was to define a modifier
> > > > for those cases which put the CCS after each main surface at some
> > > > fixed calculation offset based on width, height, and stride.  Then the
> > > > one modifier would apply independently to the three different planes.
> > > >
> > > > --Jason
> > > >
> > > > On Wed, Jun 9, 2021 at 2:11 PM Chad Versace  wrote:
> > > > >
> > > > > The Problem: For a given 3-tuple (multi-planar format, DRM format 
> > > > > modifier, chipset), we need Intel ABI that decides (a) the value of 
> > > > > VkDrmFormatModifierPropertiesEXT::drmFormatModifierPlaneCount and (b) 
> > > > > the content of each "modifier" plane.
> > > > >
> > > > > For example, suppose drmFormatModifierPlaneCount is 2 for 
> > > > > (VK_FORMAT_G8_B8R8_2PLANE_420_UNORM, INTEL_FORMAT_MOD_FOO). Is the 
> > > > > layout (plane1=g8, plane2=b8r8); or is it (plane1=g8_b8r8, 
> > > > > plane2=aux)? The second choice isn't crazy; iirc, some non-Intel 
> > > > > hardware (sorry, NDA) uses only 2 modifier planes for 
> > > > > color-compressed 3-planar formats, with (plane1=r8_g8_b8, plane2=aux).
> > > > >
> > > > > I'm asking because Yiwei (cc'd) has begun implementing limited 
> > > > > support for multi-planar formats in Anvil in order to pass the 
> > > > > Android CTS. To support more media formats (for imminent Chrome OS 
> > > > > projects and future vanilla Linux stuff too), we need to decide on 
> > > > > some ABI.
> > >
> > > Hi,
> > >
> > > I have sent a 
> > > MR(https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11281)
> > > to add the minimal multi-planar format support with drm format
> > > modifier, so that Venus Android AHB extension layered on top of ANV
> > > VK_EXT_image_drm_format_modifier implementation can pass all the
> > > interop graphics cts for camera and media.
> > >
> > > I'd be interested to follow up about the stable ABI for this to expand
> > > multi-planar support there.
> > >
> > > Best,
> > > Yiwei
> >
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev