James Richard Tyrer wrote:
> Kenneth Ostby wrote:
>
>> Actually when it comes hardware there is surprisingly little matrix
>> matrix multiplication in the 3D world. 
>
>
> With a 3-word SIMD you need to:
>
>     clear the accumulator
>     load P in a register
>     perform 3 MAC vector instructions
>
> If you have a 4-word SIMD you can also do the same for P which is a 4
> vector and a 4x4 transform matrix, and it only takes one more MAC.
>
What you propose still have the same problem we have debated before, I
can't find it back in the list. Taking a cpu even one custom made for
the task was already found to be more work for less result. If people
wanted to make it on the card the were free to do so. OGA1 would target
for now a fixed pipeline maybe a version 2 could go the way of the
raster's but not right now.

If we take the algorithm you propose, it take 5 operations to make the
operations. If we suppose we could implement a processor who do one
operation by cycle. It would still mean that for 1 graphics operation
you would only have only work on a small group of pixel. If you suppose
you have something like 20 block of operation to do. That will take ~100
operation for 1 group of pixel. If you need to work approximately 307k
such group(640*480) it would take 30 million cycle at least so for 1
processor if you want a 60Hz refresh rate it would take 1800 million
operation in one second.

Since we are in a FPGA we could hope to get at best 100MHz clock for a
cpu. For doing 1 operation by cycle we would need a lot of support
architecture for intelligent prefetch of operation and of data. If we
suppose we do it, we can with 1 cpu make 100 millions operations.  Still
far from the 1800 operations needed. Suppose we put enough cpu core to
be able to attain the number required something like 20 processor so we
can theoretically do 2000 millions operations. Now we have a screen who
do 640*480 at 60fps. The 20 processor would cost us at least 4
multiplier each so that's approximately 80 multiplier. The number of
multiplier isn't a problem on the 3S4000 there is 96 18*18 signed
multiplier.

If we do the same with a fixed pipeline and we suppose we do the same
100 operations but unrolled and we run at 100MHZ. We have the same
requirement for the multiplier 20 stage of 4 multiplier per stage(RGBA)
so that's 80 multiplier. The difference now is that will a pixel is
doing 1 operation the other 99 stage also have a pixel in them so that
correspond to 1 pixel/cycle that's being processed. At 100MHz that
transform at 60 fps to approximatively 1.6 Millions pixel per frame.
This number of pixel translate to a screen bigger than 1280*1024. So you
can do at 60FPS 1280*1024.

I hope you see why the fixed function was preferred. The approach with
the cpu do give a lot of flexibility. But gaining that flexibility you
lose in performance and gain in complexity on the hardware side. The
fixed function pipeline do seem bigger at first glance. But it's a much
more straightforward to design than a cpu . Also you need a lot more
processor to go near the performance of a dedicated pipeline so the
hardware requirement for equal performance is at least 2 to 10 time in
disadvantage of a processor. It's a problem the hardware always face,
the balance between flexibility and performance.
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to