Re: [Open-graphics] Sun releases RTL design for Niagra 2 under GPL 2.0

James Richard Tyrer Sat, 15 Dec 2007 07:17:43 -0800

André Pouliot wrote:

The problem rest the same even if you use microcode you can't go near
the 1 operation per cycle for a processor in a fpga and do it fast.It's either fast but multicycle or 1 cycle but slow.


IIUC, the limiting factor would be the speed of the multiply.
Specifically, the speed of the hardware multiply array.  This is the
same limitation no matter how you arrange the hardware.

<SNIP>

The rtl for the sun sparc processor is free. But put the logic usedin an FPGA for the niagara 1 chip and it is to big to fit in the3s4000 even for only 1 core.


But how big is the vector processor?

<SNIP>

Are we really going to have that kind of hardware available?
No we don't have that kind of hardware to spare if we do it fullyparallel and try to do it in one cycle. But we could easily do it by
 using something like a pipeline.


Isn't that what I am suggesting, a pipelined MAC unit?

For your sample problem we receive more data than we generate


How do you figure that?  The matrix T is a parameter.

so if we receive 4 new data(RGBA) per clock(continuous) and wesynchronizes them right we still output 4 result each 3 clock,


And you said that this would be too slow.

we do have a quiet period of 2 clock that we are processing data. The
hardware requirement would be 4 multiplier 4 adder and some smallcontrol logic


Isn't that what I said was needed for a 4-word MAC unit?

and it would take less resource than a processor for doing the same
job,


You still seem to think that I am talking about using a whole processor.
 I am only talking about using the 4-word FPU.

the big requirement would be synchronizing data and that part is
rather easy. If you insist to do the processing one data per clock we
just put 3 block in parallel and it would take 12 multiplier and 12
adder. Less than 4 simd processor


How do you get 4 multiplies in 3 clocks with a pipelined MAC unit?

It would still be less efficient with vector processor, except if you
 suppose they don't need to fetch or store the data they process.

Fetching and storing the data is done in parallel with the processing --that is the way that a FPU works.

ATI actually use stream processor for their architecture if Iremember right and nVidia use small scalar processor. Both need torun a lot of core on an asic,

Yes, lots of multipliers means lots of core and lots of power. There isno magic way around this.

Please take the time to consider this carefully. Your argument hasdegenerated into 'hit & run' (a rhetorical device, not valid criticalthinking). As I said, there is no magic solution. If you want 16multiplies per clock, it will take a pipeline with 16 multipliers.

The proof of the proposed algorithm is its implementation. If it is tobe implemented in a systolic array, it will take a lot of multipliers.The only way I see to reduce the multiplier count is to use some form ofgeneral purpose hardware which IIUC, ATI & nVidia do use. There areprobably many ways to do this, but presuming that we are going to beable to directly implement that code in hardware that will produce 1pixel per clock is an unproven assumption.


--
JRT

_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Re: [Open-graphics] Sun releases RTL design for Niagra 2 under GPL 2.0

Reply via email to