[Open-hardware] Re: [Open-graphics] Sun releases RTL design for Niagra 2 under GPL 2.0

James Richard Tyrer Sun, 16 Dec 2007 13:51:29 -0800

Note: useless discussion moved to Open-Hardware list at the request of TM.


Nicolas Boulay wrote:

2007/12/15, James Richard Tyrer <[EMAIL PROTECTED]>:

Nicolas Boulay wrote:

2007/12/15, James Richard Tyrer <[EMAIL PROTECTED]>:

Nicolas Boulay wrote:

...

Reread my first post. Most of (complexe) shader code are _not_ vector code.

What part of 'a color pixel is a vector' don't you understand?


What part of real world shader code example don't you understand ?

Does this real world shader code operate on RGB or RGBA pixels as dataobjects? If so, it is vector.

I don't speak on generic terme, i speak about real world example of
current complexe vector shader and pixel shader.

Yes, I noticed that it did contain some scalar operations. However, itis the vector and matrix dot products that are the computationallyexpensive operations.

I speak of the actual code which we intend to use; it contains a lot of
vector and matrix operations.


The code that is describe will be done fully in hardware.

Ah! Magic? The question is how to implement it in hardware. Whenconsidering this you MUST understand that most of the arithmetic in thatcode is vector arithmetic. This logically follows from the fact that apixel is a vector (4 elements pointing to a point in color&transparencyspace). So, any operation on a pixel as a single data object is avector operation (by definition).

The use of cpu for the "standard" 3D engine have been discuss many
times before. It's not an option. It's too slow.

I guess you still don't understand. The UltraSARC, like a Pentium, hasa FPU. This is on the chip, but it is a separate entity that is feddata and instructions by the CPU. The SPARC FPUs, like the Pentium,have the ability to use the FPU to do SIMD vector computations. As youcan see from the DOX at:


        http://www.opensparc.net/

In the SPARC FPU this is a separate logical unit called the FGX. It isthis, the FGX, and only this, that I am suggesting that we might use(all or part of it) to do vector arithmetic. IIUC, this unit is 4 wordswide (32 bit float) and can pipeline one SIMD MAC instruction per clock.Also please note that dot products of matrices and vectors have nodependencies -- there are no pipeline stalls to resolve dependencies.The idea that I am proposing for discussion is to have 4 of these unitscontrolled by a microcode sequencer -- microcode may be a bit slower butit can be _changed_ by a simple IPL.

I am not in anyway what so ever proposing that we use a CPU. I amsuggesting that we consider that the SPARC FGX SIMD vector processormight be useful. As I said, I don't know if it will or not yet; I amnot going to make any broad brush statements without further R&D.

After short discussion with TM, I understand that he was consideringusing a non-standard float format to reduce the amount of hardwarerequired. This is an interesting idea since 32bit float is rather largebut ILM 16bit float is really too small for many things.

IAC, the hardware God will declare how much hardware is available. Theimportant thing here is the hardware multipliers. These are part of theFPGA and are not programmable. When considering how much chip area theytake up, I think in terms of a custom chip. If you look at a picture ofa Pentium chip, you can easily see the hardware multipliers since theyare very large.

So, the Xilinix has 96 18bit hardware multipliers available. I figuredthat we could probably use 64 of them for the pixel processor (but it ispossible that more would be available). How many actual multipliers youhave available depends on the data type.

For 32x32 float you need 4 of the 18bit integer multipliers which wouldmean that there would only be 16 multipliers available. It is based onthis that I wonder if we have enough hardware to make a totally customsystolic processor.

However for other data types you need less hardware. To multiply 32 bitfloat with a pixel element you only need 2 of the 18bit multipliers. Tomultiply a 16 bit integer or half-float by a pixel element you only needone. Other cases are similar, and I hope that you get the idea.

I do not like the idea of using a non-standard numeric format. HoweverTM's idea means that you only need one hardware multiplier for eachmultiplier resulting in a lot more multipliers available.

As a though experiment, consider how you would implement the code if youonly had 16 multipliers available, or larger numbers if you will beusing smaller data objects.

Also, please keep in mind that a linear transform of a 4 element vector(this uses a 4x4 matrix as the coefficient) is going to take 16multiplies so it will take 16 multipliers if it is to be done in oneclock with a data flow (or systolic array) processor.


Look at the code and understand that this:

    float25 A_00, R_00, G_00, B_00;
    float25 A_01, R_01, G_01, B_01;
    float25 A_10, R_10, G_10, B_10;
    float25 A_11, R_11, G_11, B_11;

defines a matrix, and this:

    A_0 = A_00*(1-DeltaS) + A_10*(DeltaS);
    R_0 = R_00*(1-DeltaS) + R_10*(DeltaS);
    G_0 = G_00*(1-DeltaS) + G_10*(DeltaS);
    B_0 = B_00*(1-DeltaS) + B_10*(DeltaS);

    A_1 = A_01*(1-DeltaS) + A_11*(DeltaS);
    R_1 = R_01*(1-DeltaS) + R_11*(DeltaS);
    G_1 = G_01*(1-DeltaS) + G_11*(DeltaS);
    B_1 = B_01*(1-DeltaS) + B_11*(DeltaS);

    *A = A_0*(1-DeltaT) + A_1*(DeltaT);
    *R = R_0*(1-DeltaT) + R_1*(DeltaT);
    *G = G_0*(1-DeltaT) + G_1*(DeltaT);
    *B = B_0*(1-DeltaT) + B_1*(DeltaT);

is matrix arithmetic decomposed into scalar operations and it takes 24multipliers to do in parallel as written. A vector processor might doit differently, or a GX vector processor might have special instructionsfor stuff like this.

And the final question is whether the FGX can reconfigure when it hasdata smaller than 32 bits like it does when it has larger data words. Apixel vector is only 32bits so one would think that it would be handledmore efficiently than converting it to 128bits.

Note that however we do this that using the FPGA hardware multipliers iswasteful. A custom chip would always need less bits in the multiplierarrays if we used standard size data words.


--
JRT
_______________________________________________
Open-hardware mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-hardware

[Open-hardware] Re: [Open-graphics] Sun releases RTL design for Niagra 2 under GPL 2.0

Reply via email to