Re: [Open-graphics] Alternative synchronization mechanism in the driver

Timothy Miller Mon, 21 Mar 2005 22:04:10 -0800

On Mon, 21 Mar 2005 22:32:54 -0500, Daniel Phillips <[EMAIL PROTECTED]> wrote:
> On Monday 21 March 2005 18:50, Nicolai Haehnle wrote:
> > On Monday 21 March 2005 19:51, I  wrote:
> > > The DRI socket's job would just be to listen for connections, then
> > > create the real socket and hand it to the client.
> > >
> > > We can probably manage to set up a per-client socket connection
> > > within the existing DRI framework, and so be able to offer a driver
> > > variant that works without upgrading X/DRI, for what it's worth.  I
> > > haven't tried this, and I still haven't looked at a lot of DRI
> > > code, so I can't swear it will work.
> >
> > Stop right there. You have just blown this whole thing up an order of
> > magnitude in complexity without a good reason.
> 
> Why do you think that using a socket is complex?  (In my experience, it
> is not.)


Simple, conceptually.  Not simple in how it functions.

> 
> > What exactly do you want to achieve? I thought you just wanted a way
> > for the kernel to notify userspace when some event happened in the
> > GPU program. Most of this can be done using the classic ioctl model
> > (think "wait_for_xyz" ioctls like all drivers already use) or shared
> > memory or a combination of both.
> 
> Wait in the kernel so the task can't do anything else?  Kind of crude...

Not crude.  Polite.  When a process is waiting on the GPU, it
shouldn't busy-wait.  It should sleep.  Yield the CPU until the GPU is
available.  That's how it SHOULD be.

> > The only situation where this *isn't* enough is if we ever find the
> > need for a fully event-based model, because in that situation we need
> > to poll() or select() on multiple event sources - where one of them
> > is the DRI file. But we can easily extend the current DRI file to be
> > a file that can be waited on by userspace. No need to go crazy with
> > sockets here...
> 
> Right, you know about poll, but you think using it is complex.

It is.  Much, MUCH more.  What we need is the lowest-overhead means to
block while waiting on a condition or event.  ioctl is that means.

> > For the record, I don't think such a fully event-based model is even
> > needed for an OpenGL implementation, unless we come up with some
> > really fancy new extensions.
> 
> Of course, nothing forces you to use it in an event style.  You can just
> write to and read from the socket, blocking in the kernel just like a
> blocking ioctl, only without sucking as much because the read/write
> interface is cleaner (note that ioctls do copy to/from user just like
> read/write).

ioctl does copy when you give it a pointer.  That's under the control
of your driver.

Why do you use the term 'socket' anyhow?  What do you mean?  Are you
talking about read/write calls for the driver?  That's not a socket in
the usual sense.

> If you want to implement a fully asynchronous model, which I believe we
> do, the blocking ioctl interface just doesn't cut it.  You suggest:

That's a little different but still best done by ioctl.  You don't
really quite want the behavior of read/write.  Read/write would have
to copy data.  What you want to do is pass a pointer to a DMA buffer
to be submitted to the ring buffer.  You can keep doing that with an
ioctl, but if you run out of pages in your buffer, you're going to
have to sleep, and that's another ioctl.

Zero-copy.

>   * DRM allocates DMA buffer B
>      * application draws via DRM into B until full
>      * DRM submits B and waits for completion
>      * <engine is idle here>
>      * DRM wakes up and returns control to application
>      * repeat
> 
> If the drawing task takes some time to wake up, you may see a noticeable
> stall, and the card bandwidth isn't fully used.

Fine.

> Now consider:
> 
>   * DRM allocates two DMA buffers, A and B
>   * application draws via DRM into A until full
>   * DRM submits A
>      * draw into B via DRM until full
>      * DRM submits B and waits for socket data
>      * DRM wakes up and receives completion for A
>      * draw into A via DRM until full
>      * DRM submits A and waits for socket data
>      * DRM wakes up and receives completion for B
>      * repeat

The approach is right; we just don't want the copy overhead of what
you're calling a 'socket', which really it isn't.  You're referring to
read/write calls to a character or block device.  And that's OKAY,
except for the necessary copies.

> With this interface style, the drawing pipeline is never idle.  This
> goodness is realized even though the drawing task is inherently
> linear - we haven't even done anything fancy with poll yet.

The only difference between write and ioctl is that write has copy
overhead, and for ioctl, the app decides when it should sleep, instead
of the kernel.

> > > Ideally, the DMA engine would advance its head pointer only after a
> > > drawing operation has completely cleared the pipeline.  But perhaps
> > > that is too hard to implement in hardware.  A reasonable
> > > alternative is to flush the pipeline before recovering a resource
> > > that is known to be in use.
> >
> > If the head pointer could be advanced in that way, that would indeed
> > really be helpful.
> >
> > Alternatively, we could have a special "tag" command that can be
> > inserted into the command stream that updates an "age" register and
> > optionally writes that age value back to RAM. We could then have
> > sequences like this in the command stream:
> 
> Interestingly, the suggestion I posted a little earlier is quite
> similar.
> 
> > Now for the state register discussion:
> > > My assumption is, the kernel driver keeps all the necessary state
> > > on behalf of each client.  The client _updates_ the state in
> > > process context, the kernel driver accesses the state via kernel
> > > address.
> >
> > And this is exactly the problem. How are state changes submitted by
> > the client? The direct path would be to write all
> > (non-safety-critical state changes) directly into an indirect DMA
> > buffer.
> 
> Nothing says the client can't write its state to the card via indirect
> DMA state commands and also to kernel memory.
> 
> For efficiency, the client would only supply the hardware state deltas.
> The kernel associates the deltas with the buffer, and there is also a
> cumulative state buffer for the client.  As each indirect buffer
> completes, the kernel applies the associated deltas to the cumulative
> state.  To switch contexts, the kernel compares two hardware contexts
> and submits the differences as state commands via the command ring
> buffer.
> 
> Does this sound complex?  It is, a little.  But it avoids having to read
> state from the hardware and it costs only a few bytes of state deltas
> on each buffer submission.

Your description needs refinement (it's not 'hardware' enough), but
it's essentially correct.

> > There are basically four solutions:
> > 1. Parse the indirect DMA buffers in the kernel. This is obviously
> > slow and therefore a bad idea.
> 
> Agreed, rejected.
> 
> > 2. Allow userspace to point to sections in indirect DMA buffers that
> > need to be rerun in order to restore state information.
> 
> Hmm.

This is closer to my suggested approach.

> > 3. Force userspace to emit state-resetting commands at the beginning
> > of all indirect DMA buffers. These reset commands are skipped (by
> > providing an offset pointer) if the graphics context hasn't been
> > changed between indirect buffers. (I actually did something
> > comparable in the experimental R300 driver)
> 
> I don't get this one.

It's the same as #2, except that the location of the "rerun" stuff is inline.

> > 4. Don't emit state-setting commands in the indirect DMA buffer, but
> > in the meta-ring buffer. The kernel parses the commands, stores them
> > in an internal structure and forwards them to the card accordingly.
> 
> Rejected I think, on the grounds it will break up the indirect DMA
> stream too much.

This comes back to my suggestion.  There is state information stored
in a client shared memory buffer.  The kernel compares that to the
last state and generates a command packet to the GPU that switches to
the desired state.  This state would be consistent with the state of
the GPU before the last context switch.  Thence, the app can submit
command packets as though it has exclusive access to the GPU (which is
what you want when you virtualize hardware like that).

> 5. Reconstruct some of the GL state, e.g., GL_BLEND, from the hardware
> registers.  Not appealing because of slowness and possible extra
> hardware support.

Something akin to a shadow of the engine state is stored in an
application memory buffer instead.

> 6. At the time of submitting an indirect command buffer, the rendering
> task supplies all the GL state deltas to the kernel.  This is my
> current proposal.

Well, there's "current state", which can be turned into a delta based
on comparing it to the state left behind by another process, and then
there are rendering commands which are also deltas.

> 
> > It's not that there aren't any solutions to this problem. It's just
> > that it is far from obvious to me what the right solution is.
> 
> Same here.  Now, are we going to allow a context switch in the middle of
> processing an indirect DMA buffer, or only between indirect buffers?
> The latter is considerably easier, but we are then at the mercy of the
> client to submit reasonably granular indirect buffers.

Well, if the client can only submit up to 4K at a time, is that so
bad?  Certainly, that could be 4K of really huge copies, which would
be bad.

> Supporting context switch on partially completed indirect buffers also
> implies doing something about the command ring buffer, which starts to
> get messy.  I vote we just do the easy thing for now, while working out
> a reasonable approach to finer granularity context switching as a
> background project.

My opinion is that you'll need hardware support for any finer
granularity than that.  You make a PIO to a register that indicates
the index of the state to switch to.  The GPU automatically saves the
current state (totally in the middle of whatever it's doing), and then
loads the requested state.  This is what you need for true real-time.

That's the "right" way to do it, but you're not going to get it in
this version of the chip.  :)

> 
> To go whole hog, we would save/restore the state of the whole pipeline,
> including temporary registers like the x and y registers and in-flight
> interpolants.  This would allow smooth-as-silk parallel drawing, but is
> clearly too ambitious for a first rev.  It might not require a whole
> lot of additional hardware support though.

Exactly.
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Re: [Open-graphics] Alternative synchronization mechanism in the driver

Reply via email to