Timothy Normand Miller wrote:
Today, I'm leading a round-table discussion at OSU regarding Intel's
Larrabee architecture. I thought that perhaps people on this list
might be interested in engaging in a separate discussion. Larrabee is
a multicore processor that has several in-order x86 cores enhanced
with special vector processing units, specialized cache architecture,
and other things that optimize it for graphics. Most things that OGA
will do in dedicated hardware, they do in software, with the exception
of texture filtering, which is just too slow to do in software. This
paper points out a number of things that are relevant even to our
fixed-function design, such as avoiding wasted bandwidth caused by
over-draw. But even more, it covers a lot of issues we'll have to
deal with should we ever decide to do a programmable GPU.
IIRC, I mentioned this idea a while ago. A graphics board based on
several CPUs or DSPs with a display controller to drive the display.
IIUC from the Wikipeda article:
http://en.wikipedia.org/wiki/Larrabee_(GPU)
and Intel's website,
http://www.intel.com/technology/visual/microarch.htm
and the paper. I also read a trade paper article that had the
architecture somewhat confused, so the paper is the best source.
the Intel chip does not use standard CPUs. It appears that the CPUs
have a 16 wide vector processing unit that will handle 64 bit float as
compared with SSE which had 128 bits that can be partitioned into
different widths for different size data objects. It appears to me that
if this is only for display that 64bit float is overkill. 32bit is more
than sufficient and I wonder is 16bit for half precision or integer
would be sufficient.
It appears to me that it is the large vector processors that can handle
a 4x4 matrix multiply with a single data load (but probably multiple
instructions or in microcode) that are the major advantage. So, what we
are really talking about here is not what I mentioned previously, but
rather multiple 16 wide vector processors each having a CPU to control
it. It is clearly the number of MAC operations per clock that is
important no matter how you accomplish it.
I noticed that it didn't say how much multiplication hardware each CPU has.
As we have discussed, having this much hardware available means that it
is often wasted since the 4x4 matrix multiplies are not often used when
executing GLSL, but they do exist.
I wonder if it would be practical to have multiple chips and achieve the
same thing? This would make an extendable and upgradeable board.
TI has announced a new DSP, TMS320C6748, that does both fixed floating
point operations:
http://focus.ti.com/docs/prod/folders/print/tms320c6748.html
that could handle both pixel operations and geometry operations. This
has only a 16 bit memory interface. But, it has 128K of internal SRAM
and IIUC, DSPs normally work on internal memory and use the internal DMA
controller to move data in and out. It will do 2 32bit x 32bit with
32bit out float multiplies per clock at 300MHz.
--
James Tyrer
Linux (mostly) From Scratch
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)