André Pouliot wrote:
If we do the same with a fixed pipeline and we suppose we do the same
100 operations but unrolled and we run at 100MHZ. We have the same requirement for the multiplier 20 stage of 4 multiplier per stage(RGBA) so that's 80 multiplier. The difference now is that will
 a pixel is doing 1 operation the other 99 stage also have a pixel in
 them so that correspond to 1 pixel/cycle that's being processed. At
 100MHz that transform at 60 fps to approximatively 1.6 Millions
pixel per frame. This number of pixel translate to a screen bigger
than 1280*1024. So you can do at 60FPS 1280*1024.

I hope you see why the fixed function was preferred.

No, I don't.  You have asserted that with more hardware you will get
more throughput.  I can not argue with that.  But the questions is how
to get the best performance with a given amount of hardware (as well as
the related question of how much hardware we will have available).  I
understand the issues involved, but I am not certain of the answer.  I
know that systolic arrays are faster, but IIUC, they also take more
hardware/throughput.  It is throwing hardware at the problem and it works.

The approach with the cpu do give a lot of flexibility.

I am not talking about using a CPU at all.  I am talking about
using 4-word vector processors optimized for graphics to be controlled
by microcode.

But gaining that flexibility you lose in performance and gain in complexity on the hardware side. The fixed function pipeline do seem
 bigger at first glance.

It doesn't just seem bigger, it is bigger.  To do more multiplies at the
same time, you will need more multipliers arrays.

But it's a much more straightforward to design than a cpu .

The RTL for the Sun graphics processor is free for the download! But, if designing it, I see no difference. You have 4 MAC units and a sequencer to tell it what to do. For more versatility, the microcode can be in RAM.

Also you need a lot more processor to go near the performance of a dedicated pipeline so the hardware requirement for equal performance is at least 2 to 10 time in disadvantage of a processor.

I am a hardware person.  I fully understand that a systolic processor
system is faster.  I also understand how much hardware is needed.  With
sufficient hardware, you can do my sample problem at the rate of one
output per clock.  HOWEVER, it will require 9 hardware multipliers and 6
adders vs only 3 of each for the vector processor. To do 4 vector * 4x4 Transform matrix (which is required for RGBA pixels), it will require 16 hardware multipliers and 12 adders.

Are we really going to have that kind of hardware available?

Next question, if we have that type of hardware available, is this the
best way to deploy it.  With the same 16 hardware multipliers and a few
more adders, we could have 4 4-word SIMD vector processors.  IIUC, if we
could then do 4 operations in parallel, we would get the same
throughput. Either way, it will saturate those 16 multipliers to do 4x4 transforms, so all of the rest of your other 99 stages is going to have to be additional hardware.

It's a problem the hardware always face, the balance between flexibility and performance.

Yes, there is always a balance between flexibility and performance.
However, to get the performance you are talking about, it looks to me
like we are going to need a massive number of hardware multiplier arrays
and the exponent adder and control to go with them.  We then need to
ask, if we have that much hardware available, is it better to have it in
a fixed purpose systolic processor array, or in long word FPUs that can
do MACs in parallel? ATI & nVidia have decided against fixed purpose processor arrays.

--
JRT

_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to