> Thank you--that's very useful. I didn't know the DSP slices could do 5 ns 
> multiplies.
Yes, but this is a little misleading, because the latency is a higher... a 
single multiply operation actually takes a few clock cycles after you register 
the data on the input and the output. But you can pipeline it so that you can 
do a multiply on every clock cycle so it's effectively every 5ns. It is also 
possible (with a lot of effort and manual placement) to clock ROACH-1's DSP 
slices close to 400MHz, which'd double your processing throughput. 200MHz is 
easy though and the tools' automatic place and route scheme will achieve this.

> Ultimately what I'm what I'm getting at here is trying to estimate how many 
> filter taps I can reasonably support on a 5 ns clock, with new data words 
> arriving on every clock, questions of available chip resources aside.
> If I understand this correctly, even with new data arriving on every 5 ns 
> clock, ROACH should (up to practical considerations) be able to operate as 
> many taps as can fit on the FPGA. Is this right?

I think you've got the right of it. The key concept is to remember that the 
FPGA does a lot in parallel to increase throughput. 

Multiple data samples arrive in parallel (for example, 800MHz sample clock, 
200MHz FPGA means you get 4 samples at a time). You can add as many taps as you 
want and process any signal as quickly as you want, provided you have 
sufficient resources to do it in parallel. Have a look at this great memo from 
Rurik 
https://www.cfa.harvard.edu/twiki/pub/SMAwideband/MemoSeries/sma_wideband_utilization_1.pdf
 which discusses the scaling requirements as you increase the number of taps, 
the bandwidth etc.

In most practical designs, we use a 2 to 16 tap PFB, depending on your 
requirements. Beyond 16 taps for a few-thousand channel FFT, you need more than 
18 bits for your co-efficients and the problem doesn't map very well to the DSP 
slices anymore (18x25 multipliers). 

Hope this helps!

Jason

> On Mon, Sep 17, 2012 at 11:45 PM, Jason Manley <[email protected]> wrote:
> The latency through an FPGA will be high relative to a CPU/GPU, because the 
> FPGA's clock rate is lower (1/200MHz=5ns). But these operations can be 
> pipelined so that you can do a DSP operation on every clock cycle. ROACH 1 
> and ROACH 2 will both run at 200MHz very easily.
> 
> Considering ROACH-1, it has 640 DSP slices and you can do up to an 18 bit x 
> 25 bit multiply in a single DSP slice. So you can do 640 multiply (and/or 
> addition operation) operations every 1/200MHz=5ns.
> 
> But then you can also start using the 14720 slices for multipliers or adders 
> so you can get many more operations per second. And then, if you're doing low 
> resolution operations, you can fill the 244 BRAMs with lookup tables and just 
> lookup the product for a given input vector to do even more operations on 
> every clock cycle.
> 
> If you wanted to throw the whole FPGA at DSP operations, you could easily say 
> that a ROACH-1 board is capable of over 2 TeraOps/s for 4-bit operations 
> (common in radio astronomy). But this is an unrealistic figure of merit 
> because it ignores things like pipelining registers and data routing 
> requirements, memory controllers and the like which would all be needed in a 
> practical design.
> 
> Jason
> 
> On 18 Sep 2012, at 05:20, Alex Zahn wrote:
> 
> > I've been browsing the xilinx literature, but I just can't seem to get any 
> > idea how long one can usually expect addition and multiplication operations 
> > to take. I realize this depends on a lot of factors in the design, but does 
> > anyone know if it's reasonable to multiply two 16 bit numbers in a single 
> > clock with a clock rate of 200 MHz? I would test this on my ROACH out to 
> > find out, but I'm away from lab for a while, and thus rendered rather 
> > helpless for the time being.
> >
> > Unrelated, is there any online documentation on the new snapshot block?
> >
> > -Alex Zahn
> 
> 


Reply via email to