Hi All

I have just completed a rather large overhaul of the CASPER FFT family. Rigorous testing has yet to be performed but it "works" at the moment and has been pushed to the ska-sa repo on github for early adopters/testers.

The main aim (and reason this email continues this thread) was to save BRAM usage (the limiting factor in most designs and source of many timing failures) in coefficient storage using many of the techniques (as well as some others) described by Ryan.


    It would work well for the PFB, but what we *really* need is a
    solid "Direct Digital Synth (DDS) coefficient generator".  FFT
    coefficients are really just sampled points around the unit
    circle, so you could, in principle, use a recursive complex
    multiplier to generate the coefficients on the fly.  You'll lose
    log2(sqrt(K)) bits for a recursion count of K, but that's probably
    OK most of the time.  Say you're doing a 2^14 point FFT, you need
    2^13 coeffs.  You start with 18 bits of resolution and can do 1024
    iterations before you degrade down to the est. 2^13 resolution.
     So you'll only need to store 8 "reset points".  Four of those
    will be 1, j, -1 and -j in this case.  You could thus replace 8
    BRAM36'es with three DSPs.

There is a new block (feedback_osc in the Downconverter subsection) that does this. The number of "reset points" are configurable (as well as the various bit widths), allowing the user to control the phase error on the resultant vector. Initial phase and phase step (frequency) are compile time configurable. The fact that the pipeline through the complex multiplier is not 0 in length means that there must be more than a single value per "reset point" but having many of these reduces error so the extra values are not wasted.

This is (optionally) used in fft_direct to generate all coefficients when it is used in fft_wideband_real or fft.

    In addition, for the FFT direct, the first stage has only one
    shared coefficient pattern, second stage has 2, third 4, etc.  You
    can, of course, share coefficients amongst a stage where possible.


This is also implemented, and reduces the number of coefficient patterns (e.g from 12 to 7 in an 8 input direct FFT) to be generated which will be significant for large bandwidth applications.

    The real winnings occur when you realize that the other
    coefficient banks within later stages are actually the same coeffs
    as the first stage, with a constant phase rotation (again, I'm 90%
    sure but I'll check tomorrow morning).  So, you could generate
    your coefficients once, and then use a couple of complex
    multipliers to make the coeffs for the other stages.  BAM!  FFT
    Direct's coefficient memory utilization is *gone*

Almost. Each set of coefficients within a stage are the same as the other sets in the stage with a constant phase rotation. I actually implement each set separately so am not quite as efficient as this.

    You could also do this for the FFT Biplex, but it would be a bit
    more complicated.  Whoever designed the biplex FFT used in-order
    inputs.  This is OK, but it means that the coefficients are in
    bit-reverse order.  So, you would have to move the biplex
    unscrambler to the beginning, change the mux logic, and replace
    the delay elements in the delay-commutator with some flavor of
    "delay, bit-reversed".  I don't know how that would look quite
    yet.  If you did that, your coefficients would become in-order,
    and you could achieve the same savings I described with the
    FFT-Direct.


I did not implement this change but did find a few ways of saving BRAM;
1. I store real and imaginary components next to each other in the same BRAM where it makes sense (18 bits or less resolution, few enough to fit in the same BRAM). 2. I share coefficients by storing a single set in a dual port BRAM and manipulate addressing on each port to produce the correct values. 3. I store a quarter of a cycle of a sinusoid and generate other values via address manipulation and selective inversion of the output.

The new block (cosin) in the Downconverter section allows this functionality. This block can also be used in fft_direct if required rather than the feedback_osc block if DSP slices are scarce allowing the user to trade off multipliers and BRAMs.

The number of coefficients in the current in-order biplex fft could be reduced to a single copy. Each biplex stage uses the same coefficients as the previous stage (e.g bit_reverse(0b00, 0b01, 0b10, 0b11) == bit_reverse(0b000, 0b001, 0b010, 0b011)) for half its coefficients and then adds on the same number of never-used-in-previous-stages coefficients. In theory, each stage could feed its coefficients to the next stage that would add its unique coefficients and feed the resultant set on. It would be reasonably expensive in logic and registers (muxes, delay chains, counters etc) but would 1/2 the number of coefficient BRAMs required.

    Also, I implement coefficient and control logic sharing in my
    biplex and direct FFT and it works *really well* at managing the
    fabric and memory utilization.  Worth a shot.

My changes also allow this. Data is passed through the biplex FFT and operated on as a bus (via various blocks developed for this purpose in the Bus subsection). A single instance of control logic and coefficients are used. In this way wide bandwidth data or many inputs can be processed with maximum efficiency, reducing logic and coefficient usage by approximately the number of parallel streams. Fanout (specifically twiddle factor fanout to multipliers) can be controlled via a top level parameter to help achieve timing.

Some other FFT additions;

1. Asynchronous operation support - the FFT can be generated to have en(able) and dvalid ports. The enable signal travels with the data and appears with valid data at the output. All data inputs to the block are assumed to contain valid data i.e each data port does not have independent en/dvalid ports and the en/dvalid ports apply to all ports.

2. Bitgrowth. The FFT can support bit growth instead of shifting at each stage. This can reduce shifting logic and the danger of overflow. An upper limit on this growth (with remaining stages implementing shifting as before) allows the user to have a mixture of the two.

3. As mentioned above, each FFT block can process multiple polarisations/streams/sources simultaneously to increase efficiency. This is controlled via a top level parameter.

I will be updating the wiki to reflect these changes and hope to include them in the CASPER library release before this year's CASPER workshop.

Regards
Andrew

Reply via email to