Re: [Open-graphics] Alternative synchronization mechanism in the driver

Daniel Phillips Mon, 21 Mar 2005 19:47:11 -0800

On Monday 21 March 2005 18:50, Nicolai Haehnle wrote:
> On Monday 21 March 2005 19:51, I  wrote:
> > The DRI socket's job would just be to listen for connections, then
> > create the real socket and hand it to the client.
> >
> > We can probably manage to set up a per-client socket connection
> > within the existing DRI framework, and so be able to offer a driver
> > variant that works without upgrading X/DRI, for what it's worth.  I
> > haven't tried this, and I still haven't looked at a lot of DRI
> > code, so I can't swear it will work.
>
> Stop right there. You have just blown this whole thing up an order of
> magnitude in complexity without a good reason.


Why do you think that using a socket is complex?  (In my experience, it 
is not.)

> What exactly do you want to achieve? I thought you just wanted a way
> for the kernel to notify userspace when some event happened in the
> GPU program. Most of this can be done using the classic ioctl model
> (think "wait_for_xyz" ioctls like all drivers already use) or shared
> memory or a combination of both.

Wait in the kernel so the task can't do anything else?  Kind of crude...

> The only situation where this *isn't* enough is if we ever find the
> need for a fully event-based model, because in that situation we need
> to poll() or select() on multiple event sources - where one of them
> is the DRI file. But we can easily extend the current DRI file to be
> a file that can be waited on by userspace. No need to go crazy with
> sockets here...

Right, you know about poll, but you think using it is complex.

> For the record, I don't think such a fully event-based model is even
> needed for an OpenGL implementation, unless we come up with some
> really fancy new extensions.

Of course, nothing forces you to use it in an event style.  You can just 
write to and read from the socket, blocking in the kernel just like a 
blocking ioctl, only without sucking as much because the read/write 
interface is cleaner (note that ioctls do copy to/from user just like 
read/write).

If you want to implement a fully asynchronous model, which I believe we 
do, the blocking ioctl interface just doesn't cut it.  You suggest:

  * DRM allocates DMA buffer B
     * application draws via DRM into B until full
     * DRM submits B and waits for completion
     * <engine is idle here>
     * DRM wakes up and returns control to application
     * repeat

If the drawing task takes some time to wake up, you may see a noticeable 
stall, and the card bandwidth isn't fully used.

Now consider:

  * DRM allocates two DMA buffers, A and B
  * application draws via DRM into A until full
  * DRM submits A
     * draw into B via DRM until full
     * DRM submits B and waits for socket data
     * DRM wakes up and receives completion for A
     * draw into A via DRM until full
     * DRM submits A and waits for socket data
     * DRM wakes up and receives completion for B
     * repeat

With this interface style, the drawing pipeline is never idle.  This 
goodness is realized even though the drawing task is inherently
linear - we haven't even done anything fancy with poll yet.

> > Ideally, the DMA engine would advance its head pointer only after a
> > drawing operation has completely cleared the pipeline.  But perhaps
> > that is too hard to implement in hardware.  A reasonable
> > alternative is to flush the pipeline before recovering a resource
> > that is known to be in use.
>
> If the head pointer could be advanced in that way, that would indeed
> really be helpful.
>
> Alternatively, we could have a special "tag" command that can be
> inserted into the command stream that updates an "age" register and
> optionally writes that age value back to RAM. We could then have
> sequences like this in the command stream:

Interestingly, the suggestion I posted a little earlier is quite 
similar.

> Now for the state register discussion:
> > My assumption is, the kernel driver keeps all the necessary state
> > on behalf of each client.  The client _updates_ the state in
> > process context, the kernel driver accesses the state via kernel
> > address.
>
> And this is exactly the problem. How are state changes submitted by
> the client? The direct path would be to write all
> (non-safety-critical state changes) directly into an indirect DMA
> buffer.

Nothing says the client can't write its state to the card via indirect 
DMA state commands and also to kernel memory.

For efficiency, the client would only supply the hardware state deltas.  
The kernel associates the deltas with the buffer, and there is also a 
cumulative state buffer for the client.  As each indirect buffer 
completes, the kernel applies the associated deltas to the cumulative 
state.  To switch contexts, the kernel compares two hardware contexts 
and submits the differences as state commands via the command ring 
buffer.

Does this sound complex?  It is, a little.  But it avoids having to read 
state from the hardware and it costs only a few bytes of state deltas 
on each buffer submission.

> There are basically four solutions:
> 1. Parse the indirect DMA buffers in the kernel. This is obviously
> slow and therefore a bad idea.

Agreed, rejected.

> 2. Allow userspace to point to sections in indirect DMA buffers that
> need to be rerun in order to restore state information.

Hmm.

> 3. Force userspace to emit state-resetting commands at the beginning
> of all indirect DMA buffers. These reset commands are skipped (by
> providing an offset pointer) if the graphics context hasn't been
> changed between indirect buffers. (I actually did something
> comparable in the experimental R300 driver)

I don't get this one.

> 4. Don't emit state-setting commands in the indirect DMA buffer, but
> in the meta-ring buffer. The kernel parses the commands, stores them
> in an internal structure and forwards them to the card accordingly.

Rejected I think, on the grounds it will break up the indirect DMA 
stream too much.

5. Reconstruct some of the GL state, e.g., GL_BLEND, from the hardware 
registers.  Not appealing because of slowness and possible extra 
hardware support.

6. At the time of submitting an indirect command buffer, the rendering 
task supplies all the GL state deltas to the kernel.  This is my 
current proposal.

> It's not that there aren't any solutions to this problem. It's just
> that it is far from obvious to me what the right solution is.

Same here.  Now, are we going to allow a context switch in the middle of 
processing an indirect DMA buffer, or only between indirect buffers?  
The latter is considerably easier, but we are then at the mercy of the 
client to submit reasonably granular indirect buffers.

Supporting context switch on partially completed indirect buffers also 
implies doing something about the command ring buffer, which starts to 
get messy.  I vote we just do the easy thing for now, while working out 
a reasonable approach to finer granularity context switching as a 
background project.

To go whole hog, we would save/restore the state of the whole pipeline, 
including temporary registers like the x and y registers and in-flight 
interpolants.  This would allow smooth-as-silk parallel drawing, but is 
clearly too ambitious for a first rev.  It might not require a whole 
lot of additional hardware support though.

Regards,

Daniel
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Re: [Open-graphics] Alternative synchronization mechanism in the driver

Reply via email to