The latency through an FPGA will be high relative to a CPU/GPU, because the 
FPGA's clock rate is lower (1/200MHz=5ns). But these operations can be 
pipelined so that you can do a DSP operation on every clock cycle. ROACH 1 and 
ROACH 2 will both run at 200MHz very easily. 

Considering ROACH-1, it has 640 DSP slices and you can do up to an 18 bit x 25 
bit multiply in a single DSP slice. So you can do 640 multiply (and/or addition 
operation) operations every 1/200MHz=5ns.

But then you can also start using the 14720 slices for multipliers or adders so 
you can get many more operations per second. And then, if you're doing low 
resolution operations, you can fill the 244 BRAMs with lookup tables and just 
lookup the product for a given input vector to do even more operations on every 
clock cycle.

If you wanted to throw the whole FPGA at DSP operations, you could easily say 
that a ROACH-1 board is capable of over 2 TeraOps/s for 4-bit operations 
(common in radio astronomy). But this is an unrealistic figure of merit because 
it ignores things like pipelining registers and data routing requirements, 
memory controllers and the like which would all be needed in a practical design.
 
Jason

On 18 Sep 2012, at 05:20, Alex Zahn wrote:

> I've been browsing the xilinx literature, but I just can't seem to get any 
> idea how long one can usually expect addition and multiplication operations 
> to take. I realize this depends on a lot of factors in the design, but does 
> anyone know if it's reasonable to multiply two 16 bit numbers in a single 
> clock with a clock rate of 200 MHz? I would test this on my ROACH out to find 
> out, but I'm away from lab for a while, and thus rendered rather helpless for 
> the time being.
> 
> Unrelated, is there any online documentation on the new snapshot block?
> 
> -Alex Zahn


Reply via email to