Timothy Normand Miller wrote:
Today, I'm leading a round-table discussion at OSU regarding Intel's
Larrabee architecture.  I thought that perhaps people on this list
might be interested in engaging in a separate discussion.  Larrabee is
a multicore processor that has several in-order x86 cores enhanced
with special vector processing units, specialized cache architecture,
and other things that optimize it for graphics.  Most things that OGA
will do in dedicated hardware, they do in software, with the exception
of texture filtering, which is just too slow to do in software.  This
paper points out a number of things that are relevant even to our
fixed-function design, such as avoiding wasted bandwidth caused by
over-draw.  But even more, it covers a lot of issues we'll have to
deal with should we ever decide to do a programmable GPU.

IIRC, I mentioned this idea a while ago. A graphics board based on several CPUs or DSPs with a display controller to drive the display.

IIUC from the Wikipeda article:

http://en.wikipedia.org/wiki/Larrabee_(GPU)

and Intel's website,

http://www.intel.com/technology/visual/microarch.htm

and the paper. I also read a trade paper article that had the architecture somewhat confused, so the paper is the best source.

the Intel chip does not use standard CPUs. It appears that the CPUs have a 16 wide vector processing unit that will handle 64 bit float as compared with SSE which had 128 bits that can be partitioned into different widths for different size data objects. It appears to me that if this is only for display that 64bit float is overkill. 32bit is more than sufficient and I wonder is 16bit for half precision or integer would be sufficient.

It appears to me that it is the large vector processors that can handle a 4x4 matrix multiply with a single data load (but probably multiple instructions or in microcode) that are the major advantage. So, what we are really talking about here is not what I mentioned previously, but rather multiple 16 wide vector processors each having a CPU to control it. It is clearly the number of MAC operations per clock that is important no matter how you accomplish it.

I noticed that it didn't say how much multiplication hardware each CPU has.

As we have discussed, having this much hardware available means that it is often wasted since the 4x4 matrix multiplies are not often used when executing GLSL, but they do exist.

I wonder if it would be practical to have multiple chips and achieve the same thing? This would make an extendable and upgradeable board.

TI has announced a new DSP, TMS320C6748, that does both fixed floating point operations:

http://focus.ti.com/docs/prod/folders/print/tms320c6748.html

that could handle both pixel operations and geometry operations. This has only a 16 bit memory interface. But, it has 128K of internal SRAM and IIUC, DSPs normally work on internal memory and use the internal DMA controller to move data in and out. It will do 2 32bit x 32bit with 32bit out float multiplies per clock at 300MHz.

--
James Tyrer

Linux (mostly) From Scratch
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to