One major thing I forgot here is to include the option to signal an
interrupt for privileged commands.


On 10/8/05, Attila Kinali <[EMAIL PROTECTED]> wrote:
> On Fri, 7 Oct 2005 17:41:23 -0400
> Timothy Miller <[EMAIL PROTECTED]> wrote:
>
>
> > There are two data sources for DMA packets.  One is a ring buffer.
> > There is only one ring buffer (at a time).  As its name suggests, it's
> > circular, and packets can wrap around one end to the other.
>
> I do not see the usefullness of a packet wrap around. As
> in software it is quite tedious to split packets at certain adresses.
> It's a lot more convenient to just leave the rest of the space unused
> and start from the beginning of the ringbuffer.

It's a pain to deal with.  I had to write special macros that would
stick different numbers of words into the buffer efficiently.

But really, although drawing commands are allowed in this buffer,
you're really supposed to use the indirect buffer for drawing.  That
makes this less of a big deal.  If this is a real problem, I just need
to add a NO-OP packet that's one word that you can use as filler.

> IMHO it wouldnt be a bad idea to have fixed size packets, as these
> would not only make the hardware more easy, but also simplify
> driver programming (given that we do not lose functionality by
> restricting us to a fixed size).

But then you have to pull extra words over the bus.  That's
inefficient, especially for PCI.  I think making them variable is an
advantage since you can minimize the data transfer.

> If you choose not to use fixed size packets, then you definitly
> have to restrict the sizes to be divisible by some value
> (i'd say either 4 or 8 byte) and force the sizes to be divisible
> by shifting them.
>
> > DMA from
> > the ring buffer is initiated by enabling it and having head (read) and
> > tail (write) pointers being different.  The ring buffer has full
> > privileges, meaning that commands in it it can be any kind of DMA
> > transaction.  The ring buffer would usually be managed by the kernel.
>
> s/kernel/driver/
> The driver does not necessarily have to be in the kernel.

Well, yes, but it's probably most efficient to have that driver
accessible via system calls to the kernel and to not have to send
messages to another process.  Most of the time, we're just instructing
the driver to load another indirect packet into the ring buffer, and
we want to minimize overhead.

>
> > A packet starts with a 32-bit header.  The upper 4 bits are the packet
> > type, plus some other info about the packet.  This is followed by data
> > words.
> >
> > My format for describing bits in a word is [number of bits:meaning].
> >
> >
> > These are the packet types:
> >
> > 0: Indirect DMA commands
> > This packet is allowed in the ring buffer but not an indirect buffer.
> > The packet contains two words:
> > [4:0][1:privileged][27:length]
> > [32:starting address]
> > When this packet is read from the ring buffer, the indirect sequencer
> > starts reading words from this address and continues reading until
> > length words have been read.  If the privileged bit is not set, the
> > particular indirect buffer is only allowed to contain rendering
> > commands.
>
> where is the start adress ? on the card memory or host memory?

These are engine commands.  Engine commands always come from the host,
and their target addresses are implicit.

> >
> > 1: Memory download
> > [4:1][28:length]
> > [32:graphics address]
> > [32:host address]
> > This copies length 32-bit words from the graphics memory to the host
> > memory.  Format conversions will apply.
> >
> > 2: Memory upload
> > [4:2][28:length]
> > [32:graphics address]
> > [32:host address]
> > Copy length 32-bit words from host memory to graphics memory.  Format
> > conversions apply.
>
> I would use here from the beginning 64bit for the host adress.
> It will make upgrade to 64bit systems simpler.

You have a point.  But I think perhaps I should make it selectable, like this:

[4:2][1:64-bit host address][27:length]
...

Note that the PCI controller in this version doesn't support 64-bit
addresses.  The kernel developers have found all sorts of clever ways
to use the AMD iommu to deal with 32-bit devices in a 64-bit address
space, so I'm not too worried.  But reserving it isn't a bad idea.

> > 3: Engine upload indirect
> > [4:3][28:length]
> > [32:host address]
> > This pushes length 32-bit pixels through the drawing engine's
> > rectangular area upload mechanism.  This is to make it easier to do
> > putimages to windows.  Format conversions apply.  If you are uploading
> > 8-bit pixels, you're sending 32-bit aligned data, so you will have to
> > program the appropriate engine registers to trim the left and right
> > edges.  The data is treated largely like texture data.
>
> Here again, 64bit for host address.
> And where does the image end up?

It ends up wherever you programmed the GPU to put it.  It'll show up
as a register in a GPU unit that I haven't defined yet.  You program
some registers in that unit with some information about where you want
to draw.  When you write to this register (via PIO or this DMA
command), each word you write causes a pixel to be emitted down the
pipeline.  This way, you can use the GPU to control where you're
drawing.

Actually, I have a better idea:  We'll give the rasterizer a special
"single-step" state.  Rather than rasterizing automatically, it waits
for you to send it pixels, and it uses those pixels as though they
were the primary shade color.  Since it's single-step, the state can
be saved, modified, or restored at any time, unlike some GPUs that
lock up when you haven't sent them as much data as you said you would.

> Also i would change this command to allow multi plane formats,
> as today most software systems deliver images not interleaved
> but as three (or four) different planes. But you have to keep here in
> mind that there are subsampled planes that use 4:2:2 and 4:2:0
> (which are easy to deal with) and other more obscure (that are
> not so easy to deal with) formats. I would say, for practical uses
> limit subsampling to 4:2:2 and 4:2:0 and let the others convert
> by software.

The host interface is going to have an elaborate format-conversion
facility.  Everything has to end up 32-bits internally, so you can put
this into a state where 8-bit pixels (packed) are converted to AAAA or
XAAA (where X is programmable) formats.  YUV and other such things are
options too.

> BTW maybe it would be an idea to have the conversion matrix for
> YCrCb -> RBG configurable, as there are IIRC 3 (slightly) different
> standards around. But this might add some additional logic as we
> would lose the advantage of constant value multipliers.

I found one a while pack and even wrote some verilog code for it.  And
I think I posted it to the list.  Time to search the archives.  :) 
Oh, it wasn't configurable.  We'll have to see about that.

> > 4: Engine upload inline
> > [4:4][28:length]
> > length * [pixel words]
> > This header is followed by length pixels, inline in the packet.  This
> > is much like type 3 but is good for small images where it's more
> > efficient to push the data inline.
>
> Can you give me an example when this would be more efficient?
> Otherwise i think that a small memory region with the data
> (which you need anyways as the data has to come from somewhere)
> and a normal engine upload should as efficient.

Engine upload is privileged, because you don't want user processes
specifying arbitrary addresses.  This is okay, since we can filter
them through the driver.  But this introduces driver overhead.  In
addition, it also introduces bus overhead.  When we encounter the
upload command, the sequencer has to terminate the current DMA
transaction and then start again with a new address.  The inline
version is unprivileged, and it allows the PCI burst to continue
uninterrupted.

This is really useful for very small images, but it can be good for
somewhat larger ones too.

Either way, you do have to copy to somewhere in your DMA buffer, but
your memory management for it might be simpler too, since you already
know where to put it.  You don't have to allocate another area and
then keep track of it and wait for it to get used.

>
> > 5: Engine render
> > [4:5][28:register flags]
> > [16:height][16:Ystart]
> > N * [register data]
> > This command initiates a rendering operation.  28 of the most
> > commonly-used engine registers (mostly in the rasterizer) will be
> > selectable by this command, and they'll follow in order from the
> > right-most flag leftward.  The number of them depends on the number of
> > flag bits set.  A 29's register is implicitly required, which is the
> > register that contains the triangle height and starting Y coordinate.
> >
> > There are lots more engine registers, and they can be set by other
> > packet types.  These are likely to change somewhat, but here's the
> > general idea...
> >
> > 6: Engine registers 1
> > [4:6][28:flags]
> > N * [register data]
> > Some selected subset of the registers.
> >
> > 7: Engine registers 2
> > ...
>
> I cannot say anything about these, as i have no experience
> in rendering.
>
> > 14: Engine 2D stipple
> > [4:7][28:reserved]
> > 32 * [stipple bits]
> > The 2D block can overlay a 32x32 mono stipple pattern on whatever it
> > draws.  This packet uploads the stipple.  This is also useful for mono
> > text glyphs smaller than 32x32.
> >
> > 15: Engine 2D tile
> > [4:7][28:reserved]
> > 64 * [tile pixels]
> > Similarly, there is an 8x8 color tile that can be merged with pixel
> > data.  This uploads that tile.
>
> What are these for ?

They're just to upload to some registers in the GPU.  And those images
are important for certain kinds of regular patterns, and Windows kinda
sorta requires them.

>
> Also maybe it would be an idea, if we use non-uniform sized packets,
> to have a specialized packet to upload mouse pointer pixmap/masks.
> On the other hand these can be stored in the card memory
> and used by switching the pointing registers.

Changing the mouse pointer is VERY infrequent (relatively speaking). 
It's also something asynchronous to other drawing operations that you
want to happen immediately.  As such, I've decided it's going to be
all PIO.

> What is definitly missing is a NOP packet. And i would
> put this as type 0, so that 0 initialized memory would
> make valid packets.

Yeah, that's good.  I'll work on another version of this and then post it.

_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to