On Thursday 17 March 2005 19:43, Timothy Miller wrote:
> Well, let's think for a moment about what we put into the ring
> buffer. Let's say most packets are 64 bits:  A 40-bit address, 8-bit
> command, and 16-bit length.

Perfect.  The 40 bit address gives 42 bits of physical address range 
because of 4 byte granularity, I presume.  So this is 4 terabytes, 
which ought to do for now ;)

> If use PIO, that's an immediate PCI burst of two words and nothing
> else.
>
> If you use DMA, and you update the write pointer for every access,

We won't.  We will take care to write the driver so it updates the write 
pointer as few times as possible.

> then that's one PCI transaction for each packet, which is only one
> cycle shorter than the PIO.  But then you have the added latency of
> the GPU taking the bus and doing a read, which won't help you much.

But it provides inherent synchronization, which is a very big deal.  
With your PIO suggestion, you will have to provide an internal DMA 
queue and a custom-rolled means of synchronization with the host.

> If you use DMA but you only periodically update the write pointer,
> then there's much less host CPU involvement, but there's considerable
> added latency for cases when you have few small packets with long
> times between them.

But in that case we don't care... because it's just a few small packets 
with long times between them!

> We also need to consider, in all cases, what happens when you try to
> do a PIO while DMA is going on:  You wait for, like, 16+ bus cycles
> just to get in one transaction.

My suggestion is to not implement PIO at all, except for the basic card 
control commands.  I do not think PIO as an alternate means of issuing 
commands adds any useful functionality.

I also suggest that PIO commands, direct DMA commands and indirect DMA 
commands should not overlap.  At this point, I haven't seen any 
examples at all of where overlap makes sense.

> Another possibility is to have the write pointer hang out in host
> memory and have the GPU poll it periodically. That eliminates bus 
> overhead entirely from the kernel but does introduce some amount of
> latency.  The advantage is that the write pointer is never passed
> over the bus when it doesn't need to be (it only happens when the GPU
> realizes that it can't do anything else useful).

This isn't a problem.  When the drm issues a command list or a texture 
ioctl, the kernel driver will:

  1) Parse it into individual 4K DMA regions
  2) Load/lock the pages
  3) Load all the DMA commands into the ring buffer (if they fit)
  4) Update the write pointer just once

(If they don't all fit, the remainder will be loaded when the next 
buffer low interrupt arrives.)

> The simplest approach is to use PIO to push DMA commands into a
> queue. But that has the latency issue when a DMA transaction is
> already going on.  The fastest approach is the one where the host
> does absolutely no PIO at all and it's the GPU's job to poll the
> write pointer and update the read pointer at convenient times.

I doubt the GPU needs to poll.  Each of those ring buffer commands is 
going to take quite some time to execute, and they will almost always 
be submitted in batches.  When they aren't, I don't think we care.

Regards,

Daniel
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to