Note: useless discussion moved to Open-Hardware list at the request of TM.
Nicolas Boulay wrote:
2007/12/15, James Richard Tyrer <[EMAIL PROTECTED]>:
Nicolas Boulay wrote:
2007/12/15, James Richard Tyrer <[EMAIL PROTECTED]>:
Nicolas Boulay wrote:
...
Reread my first post. Most of (complexe) shader code are _not_ vector code.
What part of 'a color pixel is a vector' don't you understand?
What part of real world shader code example don't you understand ?
Does this real world shader code operate on RGB or RGBA pixels as data
objects? If so, it is vector.
I don't speak on generic terme, i speak about real world example of
current complexe vector shader and pixel shader.
Yes, I noticed that it did contain some scalar operations. However, it
is the vector and matrix dot products that are the computationally
expensive operations.
I speak of the actual code which we intend to use; it contains a lot of
vector and matrix operations.
The code that is describe will be done fully in hardware.
Ah! Magic? The question is how to implement it in hardware. When
considering this you MUST understand that most of the arithmetic in that
code is vector arithmetic. This logically follows from the fact that a
pixel is a vector (4 elements pointing to a point in color&transparency
space). So, any operation on a pixel as a single data object is a
vector operation (by definition).
The use of cpu for the "standard" 3D engine have been discuss many
times before. It's not an option. It's too slow.
I guess you still don't understand. The UltraSARC, like a Pentium, has
a FPU. This is on the chip, but it is a separate entity that is fed
data and instructions by the CPU. The SPARC FPUs, like the Pentium,
have the ability to use the FPU to do SIMD vector computations. As you
can see from the DOX at:
http://www.opensparc.net/
In the SPARC FPU this is a separate logical unit called the FGX. It is
this, the FGX, and only this, that I am suggesting that we might use
(all or part of it) to do vector arithmetic. IIUC, this unit is 4 words
wide (32 bit float) and can pipeline one SIMD MAC instruction per clock.
Also please note that dot products of matrices and vectors have no
dependencies -- there are no pipeline stalls to resolve dependencies.
The idea that I am proposing for discussion is to have 4 of these units
controlled by a microcode sequencer -- microcode may be a bit slower but
it can be _changed_ by a simple IPL.
I am not in anyway what so ever proposing that we use a CPU. I am
suggesting that we consider that the SPARC FGX SIMD vector processor
might be useful. As I said, I don't know if it will or not yet; I am
not going to make any broad brush statements without further R&D.
After short discussion with TM, I understand that he was considering
using a non-standard float format to reduce the amount of hardware
required. This is an interesting idea since 32bit float is rather large
but ILM 16bit float is really too small for many things.
IAC, the hardware God will declare how much hardware is available. The
important thing here is the hardware multipliers. These are part of the
FPGA and are not programmable. When considering how much chip area they
take up, I think in terms of a custom chip. If you look at a picture of
a Pentium chip, you can easily see the hardware multipliers since they
are very large.
So, the Xilinix has 96 18bit hardware multipliers available. I figured
that we could probably use 64 of them for the pixel processor (but it is
possible that more would be available). How many actual multipliers you
have available depends on the data type.
For 32x32 float you need 4 of the 18bit integer multipliers which would
mean that there would only be 16 multipliers available. It is based on
this that I wonder if we have enough hardware to make a totally custom
systolic processor.
However for other data types you need less hardware. To multiply 32 bit
float with a pixel element you only need 2 of the 18bit multipliers. To
multiply a 16 bit integer or half-float by a pixel element you only need
one. Other cases are similar, and I hope that you get the idea.
I do not like the idea of using a non-standard numeric format. However
TM's idea means that you only need one hardware multiplier for each
multiplier resulting in a lot more multipliers available.
As a though experiment, consider how you would implement the code if you
only had 16 multipliers available, or larger numbers if you will be
using smaller data objects.
Also, please keep in mind that a linear transform of a 4 element vector
(this uses a 4x4 matrix as the coefficient) is going to take 16
multiplies so it will take 16 multipliers if it is to be done in one
clock with a data flow (or systolic array) processor.
Look at the code and understand that this:
float25 A_00, R_00, G_00, B_00;
float25 A_01, R_01, G_01, B_01;
float25 A_10, R_10, G_10, B_10;
float25 A_11, R_11, G_11, B_11;
defines a matrix, and this:
A_0 = A_00*(1-DeltaS) + A_10*(DeltaS);
R_0 = R_00*(1-DeltaS) + R_10*(DeltaS);
G_0 = G_00*(1-DeltaS) + G_10*(DeltaS);
B_0 = B_00*(1-DeltaS) + B_10*(DeltaS);
A_1 = A_01*(1-DeltaS) + A_11*(DeltaS);
R_1 = R_01*(1-DeltaS) + R_11*(DeltaS);
G_1 = G_01*(1-DeltaS) + G_11*(DeltaS);
B_1 = B_01*(1-DeltaS) + B_11*(DeltaS);
*A = A_0*(1-DeltaT) + A_1*(DeltaT);
*R = R_0*(1-DeltaT) + R_1*(DeltaT);
*G = G_0*(1-DeltaT) + G_1*(DeltaT);
*B = B_0*(1-DeltaT) + B_1*(DeltaT);
is matrix arithmetic decomposed into scalar operations and it takes 24
multipliers to do in parallel as written. A vector processor might do
it differently, or a GX vector processor might have special instructions
for stuff like this.
And the final question is whether the FGX can reconfigure when it has
data smaller than 32 bits like it does when it has larger data words. A
pixel vector is only 32bits so one would think that it would be handled
more efficiently than converting it to 128bits.
Note that however we do this that using the FPGA hardware multipliers is
wasteful. A custom chip would always need less bits in the multiplier
arrays if we used standard size data words.
--
JRT
_______________________________________________
Open-hardware mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-hardware