Wow!  Thanks, Andrew, that sounds like a lot of work!  Do you have any 
utilization comparisons of old vs new?

Dave

On Jul 16, 2013, at 6:22 AM, Andrew Martens wrote:

> Hi All
> 
> I have just completed a rather large overhaul of the CASPER FFT family. 
> Rigorous testing has yet to be performed but it "works" at the moment and has 
> been pushed to the ska-sa repo on github for early adopters/testers.
> 
> The main aim (and reason this email continues this thread) was to save BRAM 
> usage (the limiting factor in most designs and source of many timing 
> failures) in coefficient storage using many of the techniques (as well as 
> some others) described by Ryan.
> 
>> 
>> It would work well for the PFB, but what we *really* need is a solid "Direct 
>> Digital Synth (DDS) coefficient generator".  FFT coefficients are really 
>> just sampled points around the unit circle, so you could, in principle, use 
>> a recursive complex multiplier to generate the coefficients on the fly.  
>> You'll lose log2(sqrt(K)) bits for a recursion count of K, but that's 
>> probably OK most of the time.  Say you're doing a 2^14 point FFT, you need 
>> 2^13 coeffs.  You start with 18 bits of resolution and can do 1024 
>> iterations before you degrade down to the est. 2^13 resolution.  So you'll 
>> only need to store 8 "reset points".  Four of those will be 1, j, -1 and -j 
>> in this case.  You could thus replace 8 BRAM36'es with three DSPs.
>> 
> There is a new block (feedback_osc in the Downconverter subsection) that does 
> this. The number of "reset points" are configurable (as well as the various 
> bit widths), allowing the user to control the phase error on the resultant 
> vector. Initial phase and phase step (frequency) are compile time 
> configurable. The fact that the pipeline through the complex multiplier is 
> not 0 in length means that there must be more than a single value per "reset 
> point" but having many of these reduces error so the extra values are not 
> wasted.
> 
> This is (optionally) used in fft_direct to generate all coefficients when it 
> is used in fft_wideband_real or fft.
> 
>> In addition, for the FFT direct, the first stage has only one shared 
>> coefficient pattern, second stage has 2, third 4, etc.  You can, of course, 
>> share coefficients amongst a stage where possible.  
> 
> This is also implemented, and reduces the number of coefficient patterns (e.g 
> from 12 to 7 in an 8 input direct FFT) to be generated which will be 
> significant for large bandwidth applications.
> 
>> The real winnings occur when you realize that the other coefficient banks 
>> within later stages are actually the same coeffs as the first stage, with a 
>> constant phase rotation (again, I'm 90% sure but I'll check tomorrow 
>> morning).  So, you could generate your coefficients once, and then use a 
>> couple of complex multipliers to make the coeffs for the other stages.  BAM! 
>>  FFT Direct's coefficient memory utilization is *gone*
>> 
> Almost. Each set of coefficients within a stage are the same as the other 
> sets in the stage with a constant phase rotation. I actually implement each 
> set separately so am not quite as efficient as this.
> 
>> You could also do this for the FFT Biplex, but it would be a bit more 
>> complicated.  Whoever designed the biplex FFT used in-order inputs.  This is 
>> OK, but it means that the coefficients are in bit-reverse order.  So, you 
>> would have to move the biplex unscrambler to the beginning, change the mux 
>> logic, and replace the delay elements in the delay-commutator with some 
>> flavor of "delay, bit-reversed".  I don't know how that would look quite 
>> yet.  If you did that, your coefficients would become in-order, and you 
>> could achieve the same savings I described with the FFT-Direct.  
> 
> I did not implement this change but did find a few ways of saving BRAM;
> 1. I store real and imaginary components next to each other in the same BRAM 
> where it makes sense (18 bits or less resolution, few enough to fit in the 
> same BRAM).
> 2. I share coefficients by storing a single set in a dual port BRAM and 
> manipulate addressing on each port to produce the correct values. 
> 3. I store a quarter of a cycle of a sinusoid and generate other values via 
> address manipulation and selective inversion of the output.
> 
> The new block (cosin) in the Downconverter section allows this functionality. 
> This block can also be used in fft_direct if required rather than the 
> feedback_osc block if DSP slices are scarce allowing the user to trade off 
> multipliers and BRAMs.
> 
> The number of coefficients in the current in-order biplex fft could be 
> reduced to a single copy. Each biplex stage uses the same coefficients as the 
> previous stage (e.g bit_reverse(0b00, 0b01, 0b10, 0b11) == bit_reverse(0b000, 
> 0b001, 0b010, 0b011)) for half its coefficients and then adds on the same 
> number of never-used-in-previous-stages coefficients. In theory, each stage 
> could feed its coefficients to the next stage that would add its unique 
> coefficients and feed the resultant set on. It would be reasonably expensive 
> in logic and registers (muxes, delay chains, counters etc) but would 1/2 the 
> number of coefficient BRAMs required.
> 
>> Also, I implement coefficient and control logic sharing in my biplex and 
>> direct FFT and it works *really well* at managing the fabric and memory 
>> utilization.  Worth a shot. 
>> 
> My changes also allow this. Data is passed through the biplex FFT and 
> operated on as a bus (via various blocks developed for this purpose in the 
> Bus subsection). A single instance of control logic and coefficients are 
> used. In this way wide bandwidth data or many inputs can be processed with 
> maximum efficiency, reducing logic and coefficient usage by approximately the 
> number of parallel streams. Fanout (specifically twiddle factor fanout to 
> multipliers) can be controlled via a top level parameter to help achieve 
> timing. 
> 
> Some other FFT additions;
> 
> 1. Asynchronous operation support - the FFT can be generated to have en(able) 
> and dvalid ports. The enable signal travels with the data and appears with 
> valid data at the output. All data inputs to the block are assumed to contain 
> valid data i.e each data port does not have independent en/dvalid ports and 
> the en/dvalid ports apply to all ports.
> 
> 2. Bitgrowth. The FFT can support bit growth instead of shifting at each 
> stage. This can reduce shifting logic and the danger of overflow. An upper 
> limit on this growth (with remaining stages implementing shifting as before) 
> allows the user to have a mixture of the two.
> 
> 3. As mentioned above, each FFT block can process multiple 
> polarisations/streams/sources simultaneously to increase efficiency. This is 
> controlled via a top level parameter.
> 
> I will be updating the wiki to reflect these changes and hope to include them 
> in the CASPER library release before this year's CASPER workshop.
> 
> Regards
> Andrew


Reply via email to