Note: useless discussion moved to Open-Hardware list at the request of TM.

Nicolas Boulay wrote:
2007/12/15, James Richard Tyrer <[EMAIL PROTECTED]>:
Nicolas Boulay wrote:
2007/12/15, James Richard Tyrer <[EMAIL PROTECTED]>:
Nicolas Boulay wrote:
...
Reread my first post. Most of (complexe) shader code are _not_ vector code.
What part of 'a color pixel is a vector' don't you understand?

What part of real world shader code example don't you understand ?

Does this real world shader code operate on RGB or RGBA pixels as data objects? If so, it is vector.

I don't speak on generic terme, i speak about real world example of
current complexe vector shader and pixel shader.

Yes, I noticed that it did contain some scalar operations. However, it is the vector and matrix dot products that are the computationally expensive operations.

I speak of the actual code which we intend to use; it contains a lot of
vector and matrix operations.


The code that is describe will be done fully in hardware.

Ah! Magic? The question is how to implement it in hardware. When considering this you MUST understand that most of the arithmetic in that code is vector arithmetic. This logically follows from the fact that a pixel is a vector (4 elements pointing to a point in color&transparency space). So, any operation on a pixel as a single data object is a vector operation (by definition).

The use of cpu for the "standard" 3D engine have been discuss many
times before. It's not an option. It's too slow.

I guess you still don't understand. The UltraSARC, like a Pentium, has a FPU. This is on the chip, but it is a separate entity that is fed data and instructions by the CPU. The SPARC FPUs, like the Pentium, have the ability to use the FPU to do SIMD vector computations. As you can see from the DOX at:

        http://www.opensparc.net/

In the SPARC FPU this is a separate logical unit called the FGX. It is this, the FGX, and only this, that I am suggesting that we might use (all or part of it) to do vector arithmetic. IIUC, this unit is 4 words wide (32 bit float) and can pipeline one SIMD MAC instruction per clock. Also please note that dot products of matrices and vectors have no dependencies -- there are no pipeline stalls to resolve dependencies. The idea that I am proposing for discussion is to have 4 of these units controlled by a microcode sequencer -- microcode may be a bit slower but it can be _changed_ by a simple IPL.

I am not in anyway what so ever proposing that we use a CPU. I am suggesting that we consider that the SPARC FGX SIMD vector processor might be useful. As I said, I don't know if it will or not yet; I am not going to make any broad brush statements without further R&D.

After short discussion with TM, I understand that he was considering using a non-standard float format to reduce the amount of hardware required. This is an interesting idea since 32bit float is rather large but ILM 16bit float is really too small for many things.

IAC, the hardware God will declare how much hardware is available. The important thing here is the hardware multipliers. These are part of the FPGA and are not programmable. When considering how much chip area they take up, I think in terms of a custom chip. If you look at a picture of a Pentium chip, you can easily see the hardware multipliers since they are very large.

So, the Xilinix has 96 18bit hardware multipliers available. I figured that we could probably use 64 of them for the pixel processor (but it is possible that more would be available). How many actual multipliers you have available depends on the data type.

For 32x32 float you need 4 of the 18bit integer multipliers which would mean that there would only be 16 multipliers available. It is based on this that I wonder if we have enough hardware to make a totally custom systolic processor.

However for other data types you need less hardware. To multiply 32 bit float with a pixel element you only need 2 of the 18bit multipliers. To multiply a 16 bit integer or half-float by a pixel element you only need one. Other cases are similar, and I hope that you get the idea.

I do not like the idea of using a non-standard numeric format. However TM's idea means that you only need one hardware multiplier for each multiplier resulting in a lot more multipliers available.

As a though experiment, consider how you would implement the code if you only had 16 multipliers available, or larger numbers if you will be using smaller data objects.

Also, please keep in mind that a linear transform of a 4 element vector (this uses a 4x4 matrix as the coefficient) is going to take 16 multiplies so it will take 16 multipliers if it is to be done in one clock with a data flow (or systolic array) processor.

Look at the code and understand that this:

    float25 A_00, R_00, G_00, B_00;
    float25 A_01, R_01, G_01, B_01;
    float25 A_10, R_10, G_10, B_10;
    float25 A_11, R_11, G_11, B_11;

defines a matrix, and this:

    A_0 = A_00*(1-DeltaS) + A_10*(DeltaS);
    R_0 = R_00*(1-DeltaS) + R_10*(DeltaS);
    G_0 = G_00*(1-DeltaS) + G_10*(DeltaS);
    B_0 = B_00*(1-DeltaS) + B_10*(DeltaS);

    A_1 = A_01*(1-DeltaS) + A_11*(DeltaS);
    R_1 = R_01*(1-DeltaS) + R_11*(DeltaS);
    G_1 = G_01*(1-DeltaS) + G_11*(DeltaS);
    B_1 = B_01*(1-DeltaS) + B_11*(DeltaS);

    *A = A_0*(1-DeltaT) + A_1*(DeltaT);
    *R = R_0*(1-DeltaT) + R_1*(DeltaT);
    *G = G_0*(1-DeltaT) + G_1*(DeltaT);
    *B = B_0*(1-DeltaT) + B_1*(DeltaT);

is matrix arithmetic decomposed into scalar operations and it takes 24 multipliers to do in parallel as written. A vector processor might do it differently, or a GX vector processor might have special instructions for stuff like this.

And the final question is whether the FGX can reconfigure when it has data smaller than 32 bits like it does when it has larger data words. A pixel vector is only 32bits so one would think that it would be handled more efficiently than converting it to 128bits.

Note that however we do this that using the FPGA hardware multipliers is wasteful. A custom chip would always need less bits in the multiplier arrays if we used standard size data words.

--
JRT
_______________________________________________
Open-hardware mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-hardware

Reply via email to