As you know, the first prototype devices will be PCI-only.  This is
for the "productized prototype" version that will be the first
commercial product that comes out of this.  Other prototypes will be
made, but few people will find much use for an AGP-based FPGA
prototyping board.  For this product, all the usual video hardware
will be available, plus some DDR memory, headers, etc.  It is possible
that the OGP core won't be ready at that time (the prototype board is
a prerequisite for being able to debug the core), but it'll be
available for download anyhow.

TROZ, my first graphics ASIC, was a passive device.  It did not
support DMA, and so it had to be efficient at PIO reads.  Due to the
way the PCI state machine was designed (which is complicated to go
into), the result was that it required two data buses for PIO reads. 
One was for the "current" requested address, and the other was for the
"next" one.  A few parts of TROZ didn't pay attention to the "next"
address (for simplicity), and as a result, they were only able to
return data to the host every other clock cycle.  The interface to
graphics memory was, however, highly optimized with two levels of
caching and prefetch, making it able to return data via PIO at nearly
full bus bandwidth.

For OGP, we need utter simplicity.  Furthermore, for most cases, we
will get our efficiency not from accommodating PIO reads but by doing
as much as possible via DMA.  When the DMA controller is working, it
knows well in advance every address it's going to use, eliminating the
need for any weird sort of queueing like I did for TROZ.

The PCI controller is being designed to sit in a small FPGA or large
CPLD between the PCI bus and the large FPGA that will ultimately hold
the OGP rendering core and video output logic.

The ideal situation (for reads), like what we'll use in the rendering
core to fetch memory, is to have a queue for addresses that goes into
the core, and another queue of data that comes out of the core.  (For
writes, there's a single queue of address and data, and latency is a
non-issue.)  This way, the PCI controller can generate a stream of
addresses at full tilt, and the core can return data at whatever speed
it can, and the inherent buffering of the queues will smooth out some
of the latency issues.

I have thought about the idea of making the interface between PCI and
internals run at a higher speed.  At 2x, we'd be able to get a bit
more efficiency out of 33MHz PCI.  The problem is that I want
ultimately to have a unified state machine for PCI, PCIX, and AGP,
where the fastest I expect to get out of the FPGA is 266MHz, which is
like AGP 4x.  There ain't gonna be any 2x'ing there.

There are some compromises we can make here for the sake of pin
savings.  For instance, with a 30-bit address bus (plus four write
enables), the upper bits of the address change infrequently.  We could
make the address bus 15 bits and have separate transactions for the
upper and lower bits.  This way, when streaming data, there's one
upper-half transaction, followed by many lower-halfs.  You'll get a
wasted cycle every 128K bytes to change the upper half.  This also
gives us some room for optimizing the state machine for higher speeds.
 I'll get into the specifics of the hand-shaking, as it progresses.

Anyhow, these are just initial, incomplete thoughts.  I've been
working on other things up to this point, but with focus increasing on
the prototype board, I felt it was important to set priorities, and
the host interface came to the top of that list.

_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to