> Thank you--that's very useful. I didn't know the DSP slices could do 5 ns > multiplies. Yes, but this is a little misleading, because the latency is a higher... a single multiply operation actually takes a few clock cycles after you register the data on the input and the output. But you can pipeline it so that you can do a multiply on every clock cycle so it's effectively every 5ns. It is also possible (with a lot of effort and manual placement) to clock ROACH-1's DSP slices close to 400MHz, which'd double your processing throughput. 200MHz is easy though and the tools' automatic place and route scheme will achieve this.
> Ultimately what I'm what I'm getting at here is trying to estimate how many > filter taps I can reasonably support on a 5 ns clock, with new data words > arriving on every clock, questions of available chip resources aside. > If I understand this correctly, even with new data arriving on every 5 ns > clock, ROACH should (up to practical considerations) be able to operate as > many taps as can fit on the FPGA. Is this right? I think you've got the right of it. The key concept is to remember that the FPGA does a lot in parallel to increase throughput. Multiple data samples arrive in parallel (for example, 800MHz sample clock, 200MHz FPGA means you get 4 samples at a time). You can add as many taps as you want and process any signal as quickly as you want, provided you have sufficient resources to do it in parallel. Have a look at this great memo from Rurik https://www.cfa.harvard.edu/twiki/pub/SMAwideband/MemoSeries/sma_wideband_utilization_1.pdf which discusses the scaling requirements as you increase the number of taps, the bandwidth etc. In most practical designs, we use a 2 to 16 tap PFB, depending on your requirements. Beyond 16 taps for a few-thousand channel FFT, you need more than 18 bits for your co-efficients and the problem doesn't map very well to the DSP slices anymore (18x25 multipliers). Hope this helps! Jason > On Mon, Sep 17, 2012 at 11:45 PM, Jason Manley <[email protected]> wrote: > The latency through an FPGA will be high relative to a CPU/GPU, because the > FPGA's clock rate is lower (1/200MHz=5ns). But these operations can be > pipelined so that you can do a DSP operation on every clock cycle. ROACH 1 > and ROACH 2 will both run at 200MHz very easily. > > Considering ROACH-1, it has 640 DSP slices and you can do up to an 18 bit x > 25 bit multiply in a single DSP slice. So you can do 640 multiply (and/or > addition operation) operations every 1/200MHz=5ns. > > But then you can also start using the 14720 slices for multipliers or adders > so you can get many more operations per second. And then, if you're doing low > resolution operations, you can fill the 244 BRAMs with lookup tables and just > lookup the product for a given input vector to do even more operations on > every clock cycle. > > If you wanted to throw the whole FPGA at DSP operations, you could easily say > that a ROACH-1 board is capable of over 2 TeraOps/s for 4-bit operations > (common in radio astronomy). But this is an unrealistic figure of merit > because it ignores things like pipelining registers and data routing > requirements, memory controllers and the like which would all be needed in a > practical design. > > Jason > > On 18 Sep 2012, at 05:20, Alex Zahn wrote: > > > I've been browsing the xilinx literature, but I just can't seem to get any > > idea how long one can usually expect addition and multiplication operations > > to take. I realize this depends on a lot of factors in the design, but does > > anyone know if it's reasonable to multiply two 16 bit numbers in a single > > clock with a clock rate of 200 MHz? I would test this on my ROACH out to > > find out, but I'm away from lab for a while, and thus rendered rather > > helpless for the time being. > > > > Unrelated, is there any online documentation on the new snapshot block? > > > > -Alex Zahn > >

