The latency through an FPGA will be high relative to a CPU/GPU, because the FPGA's clock rate is lower (1/200MHz=5ns). But these operations can be pipelined so that you can do a DSP operation on every clock cycle. ROACH 1 and ROACH 2 will both run at 200MHz very easily.
Considering ROACH-1, it has 640 DSP slices and you can do up to an 18 bit x 25 bit multiply in a single DSP slice. So you can do 640 multiply (and/or addition operation) operations every 1/200MHz=5ns. But then you can also start using the 14720 slices for multipliers or adders so you can get many more operations per second. And then, if you're doing low resolution operations, you can fill the 244 BRAMs with lookup tables and just lookup the product for a given input vector to do even more operations on every clock cycle. If you wanted to throw the whole FPGA at DSP operations, you could easily say that a ROACH-1 board is capable of over 2 TeraOps/s for 4-bit operations (common in radio astronomy). But this is an unrealistic figure of merit because it ignores things like pipelining registers and data routing requirements, memory controllers and the like which would all be needed in a practical design. Jason On 18 Sep 2012, at 05:20, Alex Zahn wrote: > I've been browsing the xilinx literature, but I just can't seem to get any > idea how long one can usually expect addition and multiplication operations > to take. I realize this depends on a lot of factors in the design, but does > anyone know if it's reasonable to multiply two 16 bit numbers in a single > clock with a clock rate of 200 MHz? I would test this on my ROACH out to find > out, but I'm away from lab for a while, and thus rendered rather helpless for > the time being. > > Unrelated, is there any online documentation on the new snapshot block? > > -Alex Zahn

