André Pouliot wrote:
The problem rest the same even if you use microcode you can't go near
the 1 operation per cycle for a processor in a fpga and do it fast. It's either fast but multicycle or 1 cycle but slow.

IIUC, the limiting factor would be the speed of the multiply.
Specifically, the speed of the hardware multiply array.  This is the
same limitation no matter how you arrange the hardware.

<SNIP>

The rtl for the sun sparc processor is free. But put the logic used in an FPGA for the niagara 1 chip and it is to big to fit in the 3s4000 even for only 1 core.

But how big is the vector processor?

<SNIP>
Are we really going to have that kind of hardware available?
No we don't have that kind of hardware to spare if we do it fully parallel and try to do it in one cycle. But we could easily do it by
 using something like a pipeline.

Isn't that what I am suggesting, a pipelined MAC unit?

For your sample problem we receive more data than we generate

How do you figure that?  The matrix T is a parameter.

so if we receive 4 new data(RGBA) per clock(continuous) and we synchronizes them right we still output 4 result each 3 clock,

And you said that this would be too slow.

we do have a quiet period of 2 clock that we are processing data. The
hardware requirement would be 4 multiplier 4 adder and some small control logic

Isn't that what I said was needed for a 4-word MAC unit?

and it would take less resource than a processor for doing the same
job,

You still seem to think that I am talking about using a whole processor.
 I am only talking about using the 4-word FPU.

the big requirement would be synchronizing data and that part is
rather easy. If you insist to do the processing one data per clock we
just put 3 block in parallel and it would take 12 multiplier and 12
adder. Less than 4 simd processor

How do you get 4 multiplies in 3 clocks with a pipelined MAC unit?

It would still be less efficient with vector processor, except if you
 suppose they don't need to fetch or store the data they process.

Fetching and storing the data is done in parallel with the processing -- that is the way that a FPU works.

ATI actually use stream processor for their architecture if I remember right and nVidia use small scalar processor. Both need to run a lot of core on an asic,

Yes, lots of multipliers means lots of core and lots of power. There is no magic way around this.

Please take the time to consider this carefully. Your argument has degenerated into 'hit & run' (a rhetorical device, not valid critical thinking). As I said, there is no magic solution. If you want 16 multiplies per clock, it will take a pipeline with 16 multipliers.

The proof of the proposed algorithm is its implementation. If it is to be implemented in a systolic array, it will take a lot of multipliers. The only way I see to reduce the multiplier count is to use some form of general purpose hardware which IIUC, ATI & nVidia do use. There are probably many ways to do this, but presuming that we are going to be able to directly implement that code in hardware that will produce 1 pixel per clock is an unproven assumption.

--
JRT

_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to