James Richard Tyrer wrote: > Kenneth Ostby wrote: > >> Actually when it comes hardware there is surprisingly little matrix >> matrix multiplication in the 3D world. > > > With a 3-word SIMD you need to: > > clear the accumulator > load P in a register > perform 3 MAC vector instructions > > If you have a 4-word SIMD you can also do the same for P which is a 4 > vector and a 4x4 transform matrix, and it only takes one more MAC. > What you propose still have the same problem we have debated before, I can't find it back in the list. Taking a cpu even one custom made for the task was already found to be more work for less result. If people wanted to make it on the card the were free to do so. OGA1 would target for now a fixed pipeline maybe a version 2 could go the way of the raster's but not right now.
If we take the algorithm you propose, it take 5 operations to make the operations. If we suppose we could implement a processor who do one operation by cycle. It would still mean that for 1 graphics operation you would only have only work on a small group of pixel. If you suppose you have something like 20 block of operation to do. That will take ~100 operation for 1 group of pixel. If you need to work approximately 307k such group(640*480) it would take 30 million cycle at least so for 1 processor if you want a 60Hz refresh rate it would take 1800 million operation in one second. Since we are in a FPGA we could hope to get at best 100MHz clock for a cpu. For doing 1 operation by cycle we would need a lot of support architecture for intelligent prefetch of operation and of data. If we suppose we do it, we can with 1 cpu make 100 millions operations. Still far from the 1800 operations needed. Suppose we put enough cpu core to be able to attain the number required something like 20 processor so we can theoretically do 2000 millions operations. Now we have a screen who do 640*480 at 60fps. The 20 processor would cost us at least 4 multiplier each so that's approximately 80 multiplier. The number of multiplier isn't a problem on the 3S4000 there is 96 18*18 signed multiplier. If we do the same with a fixed pipeline and we suppose we do the same 100 operations but unrolled and we run at 100MHZ. We have the same requirement for the multiplier 20 stage of 4 multiplier per stage(RGBA) so that's 80 multiplier. The difference now is that will a pixel is doing 1 operation the other 99 stage also have a pixel in them so that correspond to 1 pixel/cycle that's being processed. At 100MHz that transform at 60 fps to approximatively 1.6 Millions pixel per frame. This number of pixel translate to a screen bigger than 1280*1024. So you can do at 60FPS 1280*1024. I hope you see why the fixed function was preferred. The approach with the cpu do give a lot of flexibility. But gaining that flexibility you lose in performance and gain in complexity on the hardware side. The fixed function pipeline do seem bigger at first glance. But it's a much more straightforward to design than a cpu . Also you need a lot more processor to go near the performance of a dedicated pipeline so the hardware requirement for equal performance is at least 2 to 10 time in disadvantage of a processor. It's a problem the hardware always face, the balance between flexibility and performance. _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
