Re: [Open-graphics] Alternative synchronization mechanism in the driver

Daniel Phillips Mon, 21 Mar 2005 16:16:46 -0800

On Monday 21 March 2005 14:50, Timothy Miller wrote:
> On Mon, 21 Mar 2005 13:51:31 -0500, Daniel Phillips wrote:
> > This fd is currently just a generic character device and not a
> > socket, so DRI would have to be patched along with adding our
> > driver to the X tree, which isn't necessarily a bad idea.  I can't
> > think of any compatibility problem with changing the DRI character
> > device(s) to a socket.  The quick test is to try it and see if
> > anything breaks.
> >
> > The DRI socket's job would just be to listen for connections, then
> > create the real socket and hand it to the client.
> >
> > We can probably manage to set up a per-client socket connection
> > within the existing DRI framework, and so be able to offer a driver
> > variant that works without upgrading X/DRI, for what it's worth.  I
> > haven't tried this, and I still haven't looked at a lot of DRI
> > code, so I can't swear it will work.
>
> I'm not sure I like the idea of using the read/write interface.
> That'll most likely involve extra copies.


This is only for low volume traffic that has to be synchronized and 
checked for validity.  Most (99% or more) of the traffic will go 
through indirect DMA.

> Rather, I prefer to use 
> shared memory pages, and the GL app sends pointers via an ioctl
> interface.  Zero-copy.

That was one of the ideas I looked at.  The tricky part is making it 
event-driven.  With sockets, we already have a nice mechanism: poll.  
We could get way more creative, but it wouldn't necessarily be better.

> > ...This requires some clever register updating mechanism, which
> > Timothy partly described earlier: there is a queue of register
> > update values that runs in parallel with the pipeline and there is
> > some means of deciding when to pull a value out of the queue into a
> > particular register.  This still leaves a bunch of questions, like:
> > how exactly are the queue values tagged, and what do you do when
> > more than one register needs to be updated at the same pipeline
> > stage?
>
> Since I've done this before, I can answer the question directly....
>
> While it's perfectly valid to look at it the way you've described it,
> I tend to think of the register writes going THROUGH the pipeline. 
> Of course, being hardware, it's also parallel to the pipeline; it's
> just that they synchronization logic.
>
> As for numbering, that's easy too:  Each pipeline stage has a number,
> and each register with that stage has a number.  N bits for the stage
> number, M bits for the register number, gives you an ADDRESS that is
> M+N bits long.

It's a reasonable addressing scheme.  I had imagined you could have a 
comparator attached to each register, which recognizes a particular tag 
(register number) and loads the register.  So the pipeline stage number 
doesn't matter.

Anyway, that still leaves the question: what about when you have to load 
multiple registers at the same stage?  For example, 16 parameter 
increments have to be loaded all at the same stage, and further down, 
two register base addresses plus (if you accept my theories on accurate 
calculation) four base values for S1, T1, S2 and T2.  Some trick is 
needed.

I know you've already sorted this out, but it's fun to think about it 
anyway.  I imagined you could either stall the pipeline for a few 
clocks until a whole set of parameters arrive, or you could send some 
of the parameters down a few clocks early, and unload them into 
temporary registers until the whole set arrives.

> > For non-DMA control of the card, I suggested emulating the DMA
> > stream with an auto-incrementing PIO register.  This is an inverted
> > way of looking at the problem: instead of seeing the world in terms
> > of PIO registers with the DMA interface built on top of them, you
> > see the world in terms of the DMA stream, or instruction stream if
> > you like, and provide a PIO interface capable of emulating the DMA
> > stream for any special situations where DMA isn't possible (though
> > I still haven't seen a clear explanation of how this might come
> > up).
>
> Regardless of the fate of the registers appearing in the PIO space,
> what you describe is a totally valid and potentially very useful
> alternative.  Add it to the list!

Great, and I'll shut up about the PIO interface, which I perceive it as 
a debugging interface. (Oops, was the me not shutting up about it?;-)

As a debugging interface, it will obviously give a lot of feedback very 
quickly.  Do we have a way to stop/single step the whole pipeline while 
we look at the registers?

> Normally, you'd have the DMA engine feed fifo A, which is read by
> translation logic that then feeds fifo B, which is read by the
> engine. The PIO method you describe gives PIO access to fifo A.  If
> we also allow PIO access directly to registers by number, then that
> would be PIO access to fifo B.

This sounds clean and simple.

> > The corollary is, except perhaps for debugging, there is no reason
> > to provide read access to the internal pipeline registers.  Which
> > ought to save a little hardware and certainly will save some
> > documentation.  It also removes the need to devise a register
> > numbering scheme, which is replaced by a queue tagging scheme that
> > can be whatever is convenient, and can even change from rev to rev,
> > transparently to the driver.
>
> I don't see why the register numbers and register tags can't be
> exactly the same numbers.

They are, the driver just doesn't care what the numbers are, because 
they are purely an internal detail as I see it.  Where my "tag" 
terminology comes from: register updates in the pipeline are "tagged 
with the register number".

> > > And even if it is feasible, the video memory management issues
> > > still remain. They can obviously be reduced by allowing each GPU
> > > program to lock memory in place so that it will not be moved by
> > > the video memory manager until a certain part of the program has
> > > been executed.
> >
> > Ideally, the DMA engine would advance its head pointer only after a
> > drawing operation has completely cleared the pipeline.  But perhaps
> > that is too hard to implement in hardware.  A reasonable
> > alternative is to flush the pipeline before recovering a resource
> > that is known to be in use.
>
> I agree with this approach.  If you have to shuffle memory, you have
> to bite the bullet and do it serially with rendering.

So which is it, do we flush the pipeline to be sure a resource is 
released?  A half formed alternative idea: we could have a command to 
delay advancing the ring buffer head.  When that command arrives, we 
send a special tag down the pipeline and when it pops out the other 
end, the DMA head is updated.  The driver sees that and carries on with 
releasing the resource.  A barrier instruction of sorts.

> > > However, this probably isn't too big a problem: This data
> > > shouldn't be too much, so we could just allocate a special page
> > > for it in memory.
> >
> > Indeed.  Even if it is a lot, that's still ok.
>
> Someone want to count?  :)

The GL state isn't more than a page or two per client, I would think.  
Resource tables will be bulkier, but it's hard to say much until we've 
established the model.

There isn't really a practical limit here, the kernel driver can grab as 
many pages as it needs.

Regards,

Daniel
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Re: [Open-graphics] Alternative synchronization mechanism in the driver

Reply via email to