Re: [Mesa-dev] [PATCH 0/6] dma-buf: Add an API for exporting sync files (v12)

2021-06-17 Thread Daniel Vetter
On Thu, Jun 17, 2021 at 09:37:36AM +0200, Christian König wrote:
> Am 16.06.21 um 20:30 schrieb Jason Ekstrand:
> > On Tue, Jun 15, 2021 at 3:41 AM Christian König
> >  wrote:
> > > Hi Jason & Daniel,
> > > 
> > > maybe I should explain once more where the problem with this approach is
> > > and why I think we need to get that fixed before we can do something
> > > like this here.
> > > 
> > > To summarize what this patch here does is that it copies the exclusive
> > > fence and/or the shared fences into a sync_file. This alone is totally
> > > unproblematic.
> > > 
> > > The problem is what this implies. When you need to copy the exclusive
> > > fence to a sync_file then this means that the driver is at some point
> > > ignoring the exclusive fence on a buffer object.
> > Not necessarily.  Part of the point of this is to allow for CPU waits
> > on a past point in buffers timeline.  Today, we have poll() and
> > GEM_WAIT both of which wait for the buffer to be idle from whatever
> > GPU work is currently happening.  We want to wait on something in the
> > past and ignore anything happening now.
> 
> Good point, yes that is indeed a valid use case.
> 
> > But, to the broader point, maybe?  I'm a little fuzzy on exactly where
> > i915 inserts and/or depends on fences.
> > 
> > > When you combine that with complex drivers which use TTM and buffer
> > > moves underneath you can construct an information leak using this and
> > > give userspace access to memory which is allocated to the driver, but
> > > not yet initialized.
> > > 
> > > This way you can leak things like page tables, passwords, kernel data
> > > etc... in large amounts to userspace and is an absolutely no-go for
> > > security.
> > Ugh...  Unfortunately, I'm really out of my depth on the implications
> > going on here but I think I see your point.
> > 
> > > That's why I'm said we need to get this fixed before we upstream this
> > > patch set here and especially the driver change which is using that.
> > Well, i915 has had uAPI for a while to ignore fences.
> 
> Yeah, exactly that's illegal.

You're a few years too late with closing that barn door. The following
drives have this concept
- i915
- msm
- etnaviv

Because you can't write a competent vulkan driver without this. This was
discussed at absolute epic length in various xdcs iirc. We did ignore a
bit the vram/ttm/bo-moving problem because all the people present were
hacking on integrated gpu (see list above), but that just means we need to
treat the ttm_bo->moving fence properly.

> At least the kernel internal fences like moving or clearing a buffer object
> needs to be taken into account before a driver is allowed to access a
> buffer.

Yes i915 needs to make sure it never ignores ttm_bo->moving.

For dma-buf this isn't actually a problem, because dma-buf are pinned. You
can't move them while other drivers are using them, hence there's not
actually a ttm_bo->moving fence we can ignore.

p2p dma-buf aka dynamic dma-buf is a different beast, and i915 (and fwiw
these other drivers) need to change before they can do dynamic dma-buf.

> Otherwise we have an information leak worth a CVE and that is certainly not
> something we want.

Because yes otherwise we get a CVE. But right now I don't think we have
one.

We do have a quite big confusion on what exactly the signaling ordering is
supposed to be between exclusive and the collective set of shared fences,
and there's some unifying that needs to happen here. But I think what
Jason implements here in the import ioctl is the most defensive version
possible, so really can't break any driver. It really works like you have
an ad-hoc gpu engine that does nothing itself, but waits for the current
exclusive fence and then sets the exclusive fence with its "CS" completion
fence.

That's imo perfectly legit use-case.

Same for the export one. Waiting for a previous snapshot of implicit
fences is imo perfectly ok use-case and useful for compositors - client
might soon start more rendering, and on some drivers that always results
in the exclusive slot being set, so if you dont take a snapshot you
oversync real bad for your atomic flip.

> > Those changes are years in the past.  If we have a real problem here (not 
> > sure on
> > that yet), then we'll have to figure out how to fix it without nuking
> > uAPI.
> 
> Well, that was the basic idea of attaching flags to the fences in the
> dma_resv object.
> 
> In other words you clearly denote when you have to wait for a fence before
> accessing a buffer or you cause a security issue.

Replied somewhere else, and I do kinda like the flag idea. But the problem
is we first need a ton more encapsulation and review of drivers before we
can change the internals. One thing at a time.

And yes for amdgpu this gets triple-hard because you both have the
ttm_bo->moving fence _and_ the current uapi of using fence ownership _and_
you need to figure out how to support vulkan properly with true opt-in
fencing. I'm pretty 

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-17 Thread Marek Olšák
Timeline semaphore waits (polling on memory) will be unmonitored and as
fast as the roundtrip to memory. Semaphore writes will be slower because
the copy of those write requests will also be forwarded to the kernel.
Arbitrary writes are not protected by the hw but the kernel will take
action against such behavior because it will receive them too.

I don't know if that would work with dma_fence.

Marek


On Thu, Jun 17, 2021 at 3:04 PM Daniel Vetter  wrote:

> On Thu, Jun 17, 2021 at 02:28:06PM -0400, Marek Olšák wrote:
> > The kernel will know who should touch the implicit-sync semaphore next,
> and
> > at the same time, the copy of all write requests to the implicit-sync
> > semaphore will be forwarded to the kernel for monitoring and bo_wait.
> >
> > Syncobjs could either use the same monitored access as implicit sync or
> be
> > completely unmonitored. We haven't decided yet.
> >
> > Syncfiles could either use one of the above or wait for a syncobj to go
> > idle before converting to a syncfile.
>
> Hm this sounds all like you're planning to completely rewrap everything
> ... I'm assuming the plan is still that this is going to be largely
> wrapped in dma_fence? Maybe with timeline objects being a bit more
> optimized, but I'm not sure how much you can optimize without breaking the
> interfaces.
> -Daniel
>
> >
> > Marek
> >
> >
> >
> > On Thu, Jun 17, 2021 at 12:48 PM Daniel Vetter  wrote:
> >
> > > On Mon, Jun 14, 2021 at 07:13:00PM +0200, Christian König wrote:
> > > > As long as we can figure out who touched to a certain sync object
> last
> > > that
> > > > would indeed work, yes.
> > >
> > > Don't you need to know who will touch it next, i.e. who is holding up
> your
> > > fence? Or maybe I'm just again totally confused.
> > > -Daniel
> > >
> > > >
> > > > Christian.
> > > >
> > > > Am 14.06.21 um 19:10 schrieb Marek Olšák:
> > > > > The call to the hw scheduler has a limitation on the size of all
> > > > > parameters combined. I think we can only pass a 32-bit sequence
> number
> > > > > and a ~16-bit global (per-GPU) syncobj handle in one call and not
> much
> > > > > else.
> > > > >
> > > > > The syncobj handle can be an element index in a global (per-GPU)
> > > syncobj
> > > > > table and it's read only for all processes with the exception of
> the
> > > > > signal command. Syncobjs can either have per VMID write access
> flags
> > > for
> > > > > the signal command (slow), or any process can write to any
> syncobjs and
> > > > > only rely on the kernel checking the write log (fast).
> > > > >
> > > > > In any case, we can execute the memory write in the queue engine
> and
> > > > > only use the hw scheduler for logging, which would be perfect.
> > > > >
> > > > > Marek
> > > > >
> > > > > On Thu, Jun 10, 2021 at 12:33 PM Christian König
> > > > >  > > > > > wrote:
> > > > >
> > > > > Hi guys,
> > > > >
> > > > > maybe soften that a bit. Reading from the shared memory of the
> > > > > user fence is ok for everybody. What we need to take more care
> of
> > > > > is the writing side.
> > > > >
> > > > > So my current thinking is that we allow read only access, but
> > > > > writing a new sequence value needs to go through the
> > > scheduler/kernel.
> > > > >
> > > > > So when the CPU wants to signal a timeline fence it needs to
> call
> > > > > an IOCTL. When the GPU wants to signal the timeline fence it
> needs
> > > > > to hand that of to the hardware scheduler.
> > > > >
> > > > > If we lockup the kernel can check with the hardware who did the
> > > > > last write and what value was written.
> > > > >
> > > > > That together with an IOCTL to give out sequence number for
> > > > > implicit sync to applications should be sufficient for the
> kernel
> > > > > to track who is responsible if something bad happens.
> > > > >
> > > > > In other words when the hardware says that the shader wrote
> stuff
> > > > > like 0xdeadbeef 0x0 or 0x into memory we kill the
> process
> > > > > who did that.
> > > > >
> > > > > If the hardware says that seq - 1 was written fine, but seq is
> > > > > missing then the kernel blames whoever was supposed to write
> seq.
> > > > >
> > > > > Just pieping the write through a privileged instance should be
> > > > > fine to make sure that we don't run into issues.
> > > > >
> > > > > Christian.
> > > > >
> > > > > Am 10.06.21 um 17:59 schrieb Marek Olšák:
> > > > > > Hi Daniel,
> > > > > >
> > > > > > We just talked about this whole topic internally and we came
> up
> > > > > > to the conclusion that the hardware needs to understand sync
> > > > > > object handles and have high-level wait and signal
> operations in
> > > > > > the command stream. Sync objects will be backed by memory,
> but
> > > > > > they won't be readable or writable by processes directly. The
> > > > > > hardware will log all 

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-17 Thread Daniel Vetter
On Thu, Jun 17, 2021 at 02:28:06PM -0400, Marek Olšák wrote:
> The kernel will know who should touch the implicit-sync semaphore next, and
> at the same time, the copy of all write requests to the implicit-sync
> semaphore will be forwarded to the kernel for monitoring and bo_wait.
> 
> Syncobjs could either use the same monitored access as implicit sync or be
> completely unmonitored. We haven't decided yet.
> 
> Syncfiles could either use one of the above or wait for a syncobj to go
> idle before converting to a syncfile.

Hm this sounds all like you're planning to completely rewrap everything
... I'm assuming the plan is still that this is going to be largely
wrapped in dma_fence? Maybe with timeline objects being a bit more
optimized, but I'm not sure how much you can optimize without breaking the
interfaces.
-Daniel

> 
> Marek
> 
> 
> 
> On Thu, Jun 17, 2021 at 12:48 PM Daniel Vetter  wrote:
> 
> > On Mon, Jun 14, 2021 at 07:13:00PM +0200, Christian König wrote:
> > > As long as we can figure out who touched to a certain sync object last
> > that
> > > would indeed work, yes.
> >
> > Don't you need to know who will touch it next, i.e. who is holding up your
> > fence? Or maybe I'm just again totally confused.
> > -Daniel
> >
> > >
> > > Christian.
> > >
> > > Am 14.06.21 um 19:10 schrieb Marek Olšák:
> > > > The call to the hw scheduler has a limitation on the size of all
> > > > parameters combined. I think we can only pass a 32-bit sequence number
> > > > and a ~16-bit global (per-GPU) syncobj handle in one call and not much
> > > > else.
> > > >
> > > > The syncobj handle can be an element index in a global (per-GPU)
> > syncobj
> > > > table and it's read only for all processes with the exception of the
> > > > signal command. Syncobjs can either have per VMID write access flags
> > for
> > > > the signal command (slow), or any process can write to any syncobjs and
> > > > only rely on the kernel checking the write log (fast).
> > > >
> > > > In any case, we can execute the memory write in the queue engine and
> > > > only use the hw scheduler for logging, which would be perfect.
> > > >
> > > > Marek
> > > >
> > > > On Thu, Jun 10, 2021 at 12:33 PM Christian König
> > > >  > > > > wrote:
> > > >
> > > > Hi guys,
> > > >
> > > > maybe soften that a bit. Reading from the shared memory of the
> > > > user fence is ok for everybody. What we need to take more care of
> > > > is the writing side.
> > > >
> > > > So my current thinking is that we allow read only access, but
> > > > writing a new sequence value needs to go through the
> > scheduler/kernel.
> > > >
> > > > So when the CPU wants to signal a timeline fence it needs to call
> > > > an IOCTL. When the GPU wants to signal the timeline fence it needs
> > > > to hand that of to the hardware scheduler.
> > > >
> > > > If we lockup the kernel can check with the hardware who did the
> > > > last write and what value was written.
> > > >
> > > > That together with an IOCTL to give out sequence number for
> > > > implicit sync to applications should be sufficient for the kernel
> > > > to track who is responsible if something bad happens.
> > > >
> > > > In other words when the hardware says that the shader wrote stuff
> > > > like 0xdeadbeef 0x0 or 0x into memory we kill the process
> > > > who did that.
> > > >
> > > > If the hardware says that seq - 1 was written fine, but seq is
> > > > missing then the kernel blames whoever was supposed to write seq.
> > > >
> > > > Just pieping the write through a privileged instance should be
> > > > fine to make sure that we don't run into issues.
> > > >
> > > > Christian.
> > > >
> > > > Am 10.06.21 um 17:59 schrieb Marek Olšák:
> > > > > Hi Daniel,
> > > > >
> > > > > We just talked about this whole topic internally and we came up
> > > > > to the conclusion that the hardware needs to understand sync
> > > > > object handles and have high-level wait and signal operations in
> > > > > the command stream. Sync objects will be backed by memory, but
> > > > > they won't be readable or writable by processes directly. The
> > > > > hardware will log all accesses to sync objects and will send the
> > > > > log to the kernel periodically. The kernel will identify
> > > > > malicious behavior.
> > > > >
> > > > > Example of a hardware command stream:
> > > > > ...
> > > > > ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence
> > > > > number is assigned by the kernel
> > > > > Draw();
> > > > > ImplicitSyncSignalWhenDone(syncObjHandle);
> > > > > ...
> > > > >
> > > > > I'm afraid we have no other choice because of the TLB
> > > > > invalidation overhead.
> > > > >
> > > > > Marek
> > > > >
> > > > >
> > > > > On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter  > > > > 

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-17 Thread Marek Olšák
The kernel will know who should touch the implicit-sync semaphore next, and
at the same time, the copy of all write requests to the implicit-sync
semaphore will be forwarded to the kernel for monitoring and bo_wait.

Syncobjs could either use the same monitored access as implicit sync or be
completely unmonitored. We haven't decided yet.

Syncfiles could either use one of the above or wait for a syncobj to go
idle before converting to a syncfile.

Marek



On Thu, Jun 17, 2021 at 12:48 PM Daniel Vetter  wrote:

> On Mon, Jun 14, 2021 at 07:13:00PM +0200, Christian König wrote:
> > As long as we can figure out who touched to a certain sync object last
> that
> > would indeed work, yes.
>
> Don't you need to know who will touch it next, i.e. who is holding up your
> fence? Or maybe I'm just again totally confused.
> -Daniel
>
> >
> > Christian.
> >
> > Am 14.06.21 um 19:10 schrieb Marek Olšák:
> > > The call to the hw scheduler has a limitation on the size of all
> > > parameters combined. I think we can only pass a 32-bit sequence number
> > > and a ~16-bit global (per-GPU) syncobj handle in one call and not much
> > > else.
> > >
> > > The syncobj handle can be an element index in a global (per-GPU)
> syncobj
> > > table and it's read only for all processes with the exception of the
> > > signal command. Syncobjs can either have per VMID write access flags
> for
> > > the signal command (slow), or any process can write to any syncobjs and
> > > only rely on the kernel checking the write log (fast).
> > >
> > > In any case, we can execute the memory write in the queue engine and
> > > only use the hw scheduler for logging, which would be perfect.
> > >
> > > Marek
> > >
> > > On Thu, Jun 10, 2021 at 12:33 PM Christian König
> > >  > > > wrote:
> > >
> > > Hi guys,
> > >
> > > maybe soften that a bit. Reading from the shared memory of the
> > > user fence is ok for everybody. What we need to take more care of
> > > is the writing side.
> > >
> > > So my current thinking is that we allow read only access, but
> > > writing a new sequence value needs to go through the
> scheduler/kernel.
> > >
> > > So when the CPU wants to signal a timeline fence it needs to call
> > > an IOCTL. When the GPU wants to signal the timeline fence it needs
> > > to hand that of to the hardware scheduler.
> > >
> > > If we lockup the kernel can check with the hardware who did the
> > > last write and what value was written.
> > >
> > > That together with an IOCTL to give out sequence number for
> > > implicit sync to applications should be sufficient for the kernel
> > > to track who is responsible if something bad happens.
> > >
> > > In other words when the hardware says that the shader wrote stuff
> > > like 0xdeadbeef 0x0 or 0x into memory we kill the process
> > > who did that.
> > >
> > > If the hardware says that seq - 1 was written fine, but seq is
> > > missing then the kernel blames whoever was supposed to write seq.
> > >
> > > Just pieping the write through a privileged instance should be
> > > fine to make sure that we don't run into issues.
> > >
> > > Christian.
> > >
> > > Am 10.06.21 um 17:59 schrieb Marek Olšák:
> > > > Hi Daniel,
> > > >
> > > > We just talked about this whole topic internally and we came up
> > > > to the conclusion that the hardware needs to understand sync
> > > > object handles and have high-level wait and signal operations in
> > > > the command stream. Sync objects will be backed by memory, but
> > > > they won't be readable or writable by processes directly. The
> > > > hardware will log all accesses to sync objects and will send the
> > > > log to the kernel periodically. The kernel will identify
> > > > malicious behavior.
> > > >
> > > > Example of a hardware command stream:
> > > > ...
> > > > ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence
> > > > number is assigned by the kernel
> > > > Draw();
> > > > ImplicitSyncSignalWhenDone(syncObjHandle);
> > > > ...
> > > >
> > > > I'm afraid we have no other choice because of the TLB
> > > > invalidation overhead.
> > > >
> > > > Marek
> > > >
> > > >
> > > > On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter  > > > > wrote:
> > > >
> > > > On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König
> wrote:
> > > > > Am 09.06.21 um 15:19 schrieb Daniel Vetter:
> > > > > > [SNIP]
> > > > > > > Yeah, we call this the lightweight and the heavyweight
> > > > tlb flush.
> > > > > > >
> > > > > > > The lighweight can be used when you are sure that you
> > > > don't have any of the
> > > > > > > PTEs currently in flight in the 3D/DMA engine and you
> > > > just need to
> > > > > > > invalidate the TLB.
> > > 

Re: [Mesa-dev] [Intel-gfx] [RFC PATCH 2/2] drm/doc/rfc: i915 new parallel submission uAPI plan

2021-06-17 Thread Matthew Brost
On Thu, Jun 17, 2021 at 06:46:48PM +0200, Daniel Vetter wrote:
> Sorry I'm behind on mails  ...
> 

Aren't we all.

> On Fri, Jun 11, 2021 at 12:50:29PM -0700, Matthew Brost wrote:
> > On Fri, Jun 04, 2021 at 07:59:05PM +0200, Daniel Vetter wrote:
> > > On Wed, May 26, 2021 at 04:33:57PM -0700, Matthew Brost wrote:
> > > > Add entry for i915 new parallel submission uAPI plan.
> > > > 
> > > > v2:
> > > >  (Daniel Vetter):
> > > >   - Expand logical order explaination
> > > >   - Add dummy header
> > > >   - Only allow N BBs in execbuf IOCTL
> > > >   - Configure parallel submission per slot not per gem context
> > > > v3:
> > > >  (Marcin Ślusarz):
> > > >   - Lot's of typos / bad english fixed
> > > >  (Tvrtko Ursulin):
> > > >   - Consistent pseudo code, clean up wording in descriptions
> > > > 
> > > > Cc: Tvrtko Ursulin 
> > > > Cc: Tony Ye 
> > > > CC: Carl Zhang 
> > > > Cc: Daniel Vetter 
> > > > Cc: Jason Ekstrand 
> > > > Signed-off-by: Matthew Brost 
> > > > ---
> > > >  Documentation/gpu/rfc/i915_parallel_execbuf.h | 145 ++
> > > >  Documentation/gpu/rfc/i915_scheduler.rst  |  55 ++-
> > > >  2 files changed, 198 insertions(+), 2 deletions(-)
> > > >  create mode 100644 Documentation/gpu/rfc/i915_parallel_execbuf.h
> > > > 
> > > > diff --git a/Documentation/gpu/rfc/i915_parallel_execbuf.h 
> > > > b/Documentation/gpu/rfc/i915_parallel_execbuf.h
> > > > new file mode 100644
> > > > index ..20de206e3ab4
> > > > --- /dev/null
> > > > +++ b/Documentation/gpu/rfc/i915_parallel_execbuf.h
> > > > @@ -0,0 +1,145 @@
> > > > +#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see 
> > > > i915_context_engines_parallel_submit */
> > > > +
> > > > +/*
> > > > + * i915_context_engines_parallel_submit:
> > > 
> > > So the idea is to make these kerneldoc and pull them into the rfc section.
> > > Then when we merge, move them to the real uapi section, like what Matt has
> > > done for lmem.
> > > 
> > 
> > Yep, will fix in next rev.
> > 
> > > > + *
> > > > + * Setup a slot in the context engine map to allow multiple BBs to be 
> > > > submitted
> > > > + * in a single execbuf IOCTL. Those BBs will then be scheduled to run 
> > > > on the GPU
> > > > + * in parallel. Multiple hardware contexts are created internally in 
> > > > the i915
> > > > + * run these BBs. Once a slot is configured for N BBs only N BBs can be
> > > > + * submitted in each execbuf IOCTL and this is implicit behavior e.g. 
> > > > The user
> > > > + * doesn't tell the execbuf IOCTL there are N BBs, the execbuf IOCTL 
> > > > know how
> > > > + * many BBs there are based on the slots configuration. The N BBs are 
> > > > the last N
> > > > + * buffer objects for first N if I915_EXEC_BATCH_FIRST is set.
> > > 
> > > s/for/or/
> > > 
> > > > + *
> > > > + * There are two currently defined ways to control the placement of the
> > > > + * hardware contexts on physical engines: default behavior (no flags) 
> > > > and
> > > > + * I915_PARALLEL_IMPLICIT_BONDS (a flag). More flags may be added the 
> > > > in the
> > > > + * future as new hardware / use cases arise. Details of how to use this
> > > > + * interface above the flags field in this structure.
> > > > + *
> > > > + * Returns -EINVAL if hardware context placement configuration is 
> > > > invalid or if
> > > > + * the placement configuration isn't supported on the platform / 
> > > > submission
> > > > + * interface.
> > > > + * Returns -ENODEV if extension isn't supported on the platform / 
> > > > submission
> > > > + * inteface.
> > > > + */
> > > > +struct i915_context_engines_parallel_submit {
> > > > +   struct i915_user_extension base;
> > > > +
> > > > +   __u16 engine_index; /* slot for parallel engine */
> > > 
> > > Kernel doc here for the inline comments too.
> > >
> > 
> > Yep.
> >  
> > > > +   __u16 width;/* number of contexts per parallel 
> > > > engine */
> > > > +   __u16 num_siblings; /* number of siblings per context */
> > > > +   __u16 mbz16;
> > > > +/*
> > > > + * Default placement behavior (currently unsupported):
> > > > + *
> > > > + * Allow BBs to be placed on any available engine instance. In this 
> > > > case each
> > > > + * context's engine mask indicates where that context can be placed. 
> > > > It is
> > > > + * implied in this mode that all contexts have mutual exclusive 
> > > > placement.
> > > > + * e.g. If one context is running CSX[0] no other contexts can run on 
> > > > CSX[0]).
> > > > + *
> > > > + * Example 1 pseudo code:
> > > > + * CSX,Y[N] = generic engine class X or Y, logical instance N
> > > > + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> > > > + * set_engines(INVALID)
> > > > + * set_parallel(engine_index=0, width=2, num_siblings=2,
> > > > + * engines=CSX[0],CSX[1],CSY[0],CSY[1])
> > > > + *
> > > > + * Results in the following valid placements:
> > > > + * CSX[0], CSY[0]
> > > > + * CSX[0], CSY[1]
> > > > + * 

Re: [Mesa-dev] [Intel-gfx] [PATCH 0/2] GuC submission / DRM scheduler integration plan + new uAPI

2021-06-17 Thread Daniel Vetter
On Fri, Jun 11, 2021 at 04:40:42PM -0700, Matthew Brost wrote:
> Subject and patches say it all.
> 
> v2: Address comments, patches have details of changes
> v3: Address comments, patches have details of changes
> v4: Address comments, patches have details of changes
> 
> Signed-off-by: Matthew Brost 

Imo ready (well overdue) for merging, please annoy Carl or someone from
media for an ack and then ask John or Daniele to merge it into
drm-intel-gt-next.
-Daniel

> 
> Matthew Brost (2):
>   drm/doc/rfc: i915 GuC submission / DRM scheduler
>   drm/doc/rfc: i915 new parallel submission uAPI plan
> 
>  Documentation/gpu/rfc/i915_parallel_execbuf.h | 117 ++
>  Documentation/gpu/rfc/i915_scheduler.rst  | 148 ++
>  Documentation/gpu/rfc/index.rst   |   4 +
>  3 files changed, 269 insertions(+)
>  create mode 100644 Documentation/gpu/rfc/i915_parallel_execbuf.h
>  create mode 100644 Documentation/gpu/rfc/i915_scheduler.rst
> 
> -- 
> 2.28.0
> 
> ___
> Intel-gfx mailing list
> intel-...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-17 Thread Daniel Vetter
On Mon, Jun 14, 2021 at 07:13:00PM +0200, Christian König wrote:
> As long as we can figure out who touched to a certain sync object last that
> would indeed work, yes.

Don't you need to know who will touch it next, i.e. who is holding up your
fence? Or maybe I'm just again totally confused.
-Daniel

> 
> Christian.
> 
> Am 14.06.21 um 19:10 schrieb Marek Olšák:
> > The call to the hw scheduler has a limitation on the size of all
> > parameters combined. I think we can only pass a 32-bit sequence number
> > and a ~16-bit global (per-GPU) syncobj handle in one call and not much
> > else.
> > 
> > The syncobj handle can be an element index in a global (per-GPU) syncobj
> > table and it's read only for all processes with the exception of the
> > signal command. Syncobjs can either have per VMID write access flags for
> > the signal command (slow), or any process can write to any syncobjs and
> > only rely on the kernel checking the write log (fast).
> > 
> > In any case, we can execute the memory write in the queue engine and
> > only use the hw scheduler for logging, which would be perfect.
> > 
> > Marek
> > 
> > On Thu, Jun 10, 2021 at 12:33 PM Christian König
> >  > > wrote:
> > 
> > Hi guys,
> > 
> > maybe soften that a bit. Reading from the shared memory of the
> > user fence is ok for everybody. What we need to take more care of
> > is the writing side.
> > 
> > So my current thinking is that we allow read only access, but
> > writing a new sequence value needs to go through the scheduler/kernel.
> > 
> > So when the CPU wants to signal a timeline fence it needs to call
> > an IOCTL. When the GPU wants to signal the timeline fence it needs
> > to hand that of to the hardware scheduler.
> > 
> > If we lockup the kernel can check with the hardware who did the
> > last write and what value was written.
> > 
> > That together with an IOCTL to give out sequence number for
> > implicit sync to applications should be sufficient for the kernel
> > to track who is responsible if something bad happens.
> > 
> > In other words when the hardware says that the shader wrote stuff
> > like 0xdeadbeef 0x0 or 0x into memory we kill the process
> > who did that.
> > 
> > If the hardware says that seq - 1 was written fine, but seq is
> > missing then the kernel blames whoever was supposed to write seq.
> > 
> > Just pieping the write through a privileged instance should be
> > fine to make sure that we don't run into issues.
> > 
> > Christian.
> > 
> > Am 10.06.21 um 17:59 schrieb Marek Olšák:
> > > Hi Daniel,
> > > 
> > > We just talked about this whole topic internally and we came up
> > > to the conclusion that the hardware needs to understand sync
> > > object handles and have high-level wait and signal operations in
> > > the command stream. Sync objects will be backed by memory, but
> > > they won't be readable or writable by processes directly. The
> > > hardware will log all accesses to sync objects and will send the
> > > log to the kernel periodically. The kernel will identify
> > > malicious behavior.
> > > 
> > > Example of a hardware command stream:
> > > ...
> > > ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence
> > > number is assigned by the kernel
> > > Draw();
> > > ImplicitSyncSignalWhenDone(syncObjHandle);
> > > ...
> > > 
> > > I'm afraid we have no other choice because of the TLB
> > > invalidation overhead.
> > > 
> > > Marek
> > > 
> > > 
> > > On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter  > > > wrote:
> > > 
> > > On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König wrote:
> > > > Am 09.06.21 um 15:19 schrieb Daniel Vetter:
> > > > > [SNIP]
> > > > > > Yeah, we call this the lightweight and the heavyweight
> > > tlb flush.
> > > > > >
> > > > > > The lighweight can be used when you are sure that you
> > > don't have any of the
> > > > > > PTEs currently in flight in the 3D/DMA engine and you
> > > just need to
> > > > > > invalidate the TLB.
> > > > > >
> > > > > > The heavyweight must be used when you need to
> > > invalidate the TLB *AND* make
> > > > > > sure that no concurrently operation moves new stuff
> > > into the TLB.
> > > > > >
> > > > > > The problem is for this use case we have to use the
> > > heavyweight one.
> > > > > Just for my own curiosity: So the lightweight flush is
> > > only for in-between
> > > > > CS when you know access is idle? Or does that also not
> > > work if userspace
> > > > > has a CS on a dma engine going at the same time because
> > > the tlb aren't
> > > > > isolated enough between 

Re: [Mesa-dev] [Intel-gfx] [RFC PATCH 2/2] drm/doc/rfc: i915 new parallel submission uAPI plan

2021-06-17 Thread Daniel Vetter
Sorry I'm behind on mails  ...

On Fri, Jun 11, 2021 at 12:50:29PM -0700, Matthew Brost wrote:
> On Fri, Jun 04, 2021 at 07:59:05PM +0200, Daniel Vetter wrote:
> > On Wed, May 26, 2021 at 04:33:57PM -0700, Matthew Brost wrote:
> > > Add entry for i915 new parallel submission uAPI plan.
> > > 
> > > v2:
> > >  (Daniel Vetter):
> > >   - Expand logical order explaination
> > >   - Add dummy header
> > >   - Only allow N BBs in execbuf IOCTL
> > >   - Configure parallel submission per slot not per gem context
> > > v3:
> > >  (Marcin Ślusarz):
> > >   - Lot's of typos / bad english fixed
> > >  (Tvrtko Ursulin):
> > >   - Consistent pseudo code, clean up wording in descriptions
> > > 
> > > Cc: Tvrtko Ursulin 
> > > Cc: Tony Ye 
> > > CC: Carl Zhang 
> > > Cc: Daniel Vetter 
> > > Cc: Jason Ekstrand 
> > > Signed-off-by: Matthew Brost 
> > > ---
> > >  Documentation/gpu/rfc/i915_parallel_execbuf.h | 145 ++
> > >  Documentation/gpu/rfc/i915_scheduler.rst  |  55 ++-
> > >  2 files changed, 198 insertions(+), 2 deletions(-)
> > >  create mode 100644 Documentation/gpu/rfc/i915_parallel_execbuf.h
> > > 
> > > diff --git a/Documentation/gpu/rfc/i915_parallel_execbuf.h 
> > > b/Documentation/gpu/rfc/i915_parallel_execbuf.h
> > > new file mode 100644
> > > index ..20de206e3ab4
> > > --- /dev/null
> > > +++ b/Documentation/gpu/rfc/i915_parallel_execbuf.h
> > > @@ -0,0 +1,145 @@
> > > +#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see 
> > > i915_context_engines_parallel_submit */
> > > +
> > > +/*
> > > + * i915_context_engines_parallel_submit:
> > 
> > So the idea is to make these kerneldoc and pull them into the rfc section.
> > Then when we merge, move them to the real uapi section, like what Matt has
> > done for lmem.
> > 
> 
> Yep, will fix in next rev.
> 
> > > + *
> > > + * Setup a slot in the context engine map to allow multiple BBs to be 
> > > submitted
> > > + * in a single execbuf IOCTL. Those BBs will then be scheduled to run on 
> > > the GPU
> > > + * in parallel. Multiple hardware contexts are created internally in the 
> > > i915
> > > + * run these BBs. Once a slot is configured for N BBs only N BBs can be
> > > + * submitted in each execbuf IOCTL and this is implicit behavior e.g. 
> > > The user
> > > + * doesn't tell the execbuf IOCTL there are N BBs, the execbuf IOCTL 
> > > know how
> > > + * many BBs there are based on the slots configuration. The N BBs are 
> > > the last N
> > > + * buffer objects for first N if I915_EXEC_BATCH_FIRST is set.
> > 
> > s/for/or/
> > 
> > > + *
> > > + * There are two currently defined ways to control the placement of the
> > > + * hardware contexts on physical engines: default behavior (no flags) and
> > > + * I915_PARALLEL_IMPLICIT_BONDS (a flag). More flags may be added the in 
> > > the
> > > + * future as new hardware / use cases arise. Details of how to use this
> > > + * interface above the flags field in this structure.
> > > + *
> > > + * Returns -EINVAL if hardware context placement configuration is 
> > > invalid or if
> > > + * the placement configuration isn't supported on the platform / 
> > > submission
> > > + * interface.
> > > + * Returns -ENODEV if extension isn't supported on the platform / 
> > > submission
> > > + * inteface.
> > > + */
> > > +struct i915_context_engines_parallel_submit {
> > > + struct i915_user_extension base;
> > > +
> > > + __u16 engine_index; /* slot for parallel engine */
> > 
> > Kernel doc here for the inline comments too.
> >
> 
> Yep.
>  
> > > + __u16 width;/* number of contexts per parallel engine */
> > > + __u16 num_siblings; /* number of siblings per context */
> > > + __u16 mbz16;
> > > +/*
> > > + * Default placement behavior (currently unsupported):
> > > + *
> > > + * Allow BBs to be placed on any available engine instance. In this case 
> > > each
> > > + * context's engine mask indicates where that context can be placed. It 
> > > is
> > > + * implied in this mode that all contexts have mutual exclusive 
> > > placement.
> > > + * e.g. If one context is running CSX[0] no other contexts can run on 
> > > CSX[0]).
> > > + *
> > > + * Example 1 pseudo code:
> > > + * CSX,Y[N] = generic engine class X or Y, logical instance N
> > > + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> > > + * set_engines(INVALID)
> > > + * set_parallel(engine_index=0, width=2, num_siblings=2,
> > > + *   engines=CSX[0],CSX[1],CSY[0],CSY[1])
> > > + *
> > > + * Results in the following valid placements:
> > > + * CSX[0], CSY[0]
> > > + * CSX[0], CSY[1]
> > > + * CSX[1], CSY[0]
> > > + * CSX[1], CSY[1]
> > > + *
> > > + * This can also be thought of as 2 virtual engines described by 2-D 
> > > array in
> > > + * the engines the field:
> > > + * VE[0] = CSX[0], CSX[1]
> > > + * VE[1] = CSY[0], CSY[1]
> > > + *
> > > + * Example 2 pseudo code:
> > > + * CSX[Y] = generic engine of same class X, logical instance N
> > > + * INVALID = 

Re: [Mesa-dev] [PATCH 0/6] dma-buf: Add an API for exporting sync files (v12)

2021-06-17 Thread Christian König

Am 16.06.21 um 20:30 schrieb Jason Ekstrand:

On Tue, Jun 15, 2021 at 3:41 AM Christian König
 wrote:

Hi Jason & Daniel,

maybe I should explain once more where the problem with this approach is
and why I think we need to get that fixed before we can do something
like this here.

To summarize what this patch here does is that it copies the exclusive
fence and/or the shared fences into a sync_file. This alone is totally
unproblematic.

The problem is what this implies. When you need to copy the exclusive
fence to a sync_file then this means that the driver is at some point
ignoring the exclusive fence on a buffer object.

Not necessarily.  Part of the point of this is to allow for CPU waits
on a past point in buffers timeline.  Today, we have poll() and
GEM_WAIT both of which wait for the buffer to be idle from whatever
GPU work is currently happening.  We want to wait on something in the
past and ignore anything happening now.


Good point, yes that is indeed a valid use case.


But, to the broader point, maybe?  I'm a little fuzzy on exactly where
i915 inserts and/or depends on fences.


When you combine that with complex drivers which use TTM and buffer
moves underneath you can construct an information leak using this and
give userspace access to memory which is allocated to the driver, but
not yet initialized.

This way you can leak things like page tables, passwords, kernel data
etc... in large amounts to userspace and is an absolutely no-go for
security.

Ugh...  Unfortunately, I'm really out of my depth on the implications
going on here but I think I see your point.


That's why I'm said we need to get this fixed before we upstream this
patch set here and especially the driver change which is using that.

Well, i915 has had uAPI for a while to ignore fences.


Yeah, exactly that's illegal.

At least the kernel internal fences like moving or clearing a buffer 
object needs to be taken into account before a driver is allowed to 
access a buffer.


Otherwise we have an information leak worth a CVE and that is certainly 
not something we want.



Those changes are years in the past.  If we have a real problem here (not sure 
on
that yet), then we'll have to figure out how to fix it without nuking
uAPI.


Well, that was the basic idea of attaching flags to the fences in the 
dma_resv object.


In other words you clearly denote when you have to wait for a fence 
before accessing a buffer or you cause a security issue.


Christian.



--Jason



Regards,
Christian.

Am 10.06.21 um 23:09 schrieb Jason Ekstrand:

Modern userspace APIs like Vulkan are built on an explicit
synchronization model.  This doesn't always play nicely with the
implicit synchronization used in the kernel and assumed by X11 and
Wayland.  The client -> compositor half of the synchronization isn't too
bad, at least on intel, because we can control whether or not i915
synchronizes on the buffer and whether or not it's considered written.

The harder part is the compositor -> client synchronization when we get
the buffer back from the compositor.  We're required to be able to
provide the client with a VkSemaphore and VkFence representing the point
in time where the window system (compositor and/or display) finished
using the buffer.  With current APIs, it's very hard to do this in such
a way that we don't get confused by the Vulkan driver's access of the
buffer.  In particular, once we tell the kernel that we're rendering to
the buffer again, any CPU waits on the buffer or GPU dependencies will
wait on some of the client rendering and not just the compositor.

This new IOCTL solves this problem by allowing us to get a snapshot of
the implicit synchronization state of a given dma-buf in the form of a
sync file.  It's effectively the same as a poll() or I915_GEM_WAIT only,
instead of CPU waiting directly, it encapsulates the wait operation, at
the current moment in time, in a sync_file so we can check/wait on it
later.  As long as the Vulkan driver does the sync_file export from the
dma-buf before we re-introduce it for rendering, it will only contain
fences from the compositor or display.  This allows to accurately turn
it into a VkFence or VkSemaphore without any over- synchronization.

This patch series actually contains two new ioctls.  There is the export
one mentioned above as well as an RFC for an import ioctl which provides
the other half.  The intention is to land the export ioctl since it seems
like there's no real disagreement on that one.  The import ioctl, however,
has a lot of debate around it so it's intended to be RFC-only for now.

Mesa MR: 

Re: [Mesa-dev] [RFC] Concrete proposal to split classic

2021-06-17 Thread Alyssa Rosenzweig
> By far the limiting factor for i915g progress now that I've got some
> CI rigged up is review.  My preference would be that we all agree that
> nobody wants to look at i915, and some responsible folks (ajax and a
> couple Intel volunteers, perhaps?) bless me to merge without review
> once an i915g-only MR has been up for a week.

Congratulations, you're the new i915g maintainer ;-)
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev