Re: [casper] number of coefficients needed in PFB and FFT

Andrew Martens Tue, 16 Jul 2013 06:24:56 -0700

Hi All

I have just completed a rather large overhaul of the CASPER FFT family.Rigorous testing has yet to be performed but it "works" at the momentand has been pushed to the ska-sa repo on github for early adopters/testers.

The main aim (and reason this email continues this thread) was to saveBRAM usage (the limiting factor in most designs and source of manytiming failures) in coefficient storage using many of the techniques (aswell as some others) described by Ryan.


    It would work well for the PFB, but what we *really* need is a
    solid "Direct Digital Synth (DDS) coefficient generator".  FFT
    coefficients are really just sampled points around the unit
    circle, so you could, in principle, use a recursive complex
    multiplier to generate the coefficients on the fly.  You'll lose
    log2(sqrt(K)) bits for a recursion count of K, but that's probably
    OK most of the time.  Say you're doing a 2^14 point FFT, you need
    2^13 coeffs.  You start with 18 bits of resolution and can do 1024
    iterations before you degrade down to the est. 2^13 resolution.
     So you'll only need to store 8 "reset points".  Four of those
    will be 1, j, -1 and -j in this case.  You could thus replace 8
    BRAM36'es with three DSPs.

There is a new block (feedback_osc in the Downconverter subsection) thatdoes this. The number of "reset points" are configurable (as well as thevarious bit widths), allowing the user to control the phase error on theresultant vector. Initial phase and phase step (frequency) are compiletime configurable. The fact that the pipeline through the complexmultiplier is not 0 in length means that there must be more than asingle value per "reset point" but having many of these reduces error sothe extra values are not wasted.

This is (optionally) used in fft_direct to generate all coefficientswhen it is used in fft_wideband_real or fft.

    In addition, for the FFT direct, the first stage has only one
    shared coefficient pattern, second stage has 2, third 4, etc.  You
    can, of course, share coefficients amongst a stage where possible.

This is also implemented, and reduces the number of coefficient patterns(e.g from 12 to 7 in an 8 input direct FFT) to be generated which willbe significant for large bandwidth applications.

    The real winnings occur when you realize that the other
    coefficient banks within later stages are actually the same coeffs
    as the first stage, with a constant phase rotation (again, I'm 90%
    sure but I'll check tomorrow morning).  So, you could generate
    your coefficients once, and then use a couple of complex
    multipliers to make the coeffs for the other stages.  BAM!  FFT
    Direct's coefficient memory utilization is *gone*

Almost. Each set of coefficients within a stage are the same as theother sets in the stage with a constant phase rotation. I actuallyimplement each set separately so am not quite as efficient as this.

    You could also do this for the FFT Biplex, but it would be a bit
    more complicated.  Whoever designed the biplex FFT used in-order
    inputs.  This is OK, but it means that the coefficients are in
    bit-reverse order.  So, you would have to move the biplex
    unscrambler to the beginning, change the mux logic, and replace
    the delay elements in the delay-commutator with some flavor of
    "delay, bit-reversed".  I don't know how that would look quite
    yet.  If you did that, your coefficients would become in-order,
    and you could achieve the same savings I described with the
    FFT-Direct.


I did not implement this change but did find a few ways of saving BRAM;

1. I store real and imaginary components next to each other in the sameBRAM where it makes sense (18 bits or less resolution, few enough to fitin the same BRAM).2. I share coefficients by storing a single set in a dual port BRAM andmanipulate addressing on each port to produce the correct values.3. I store a quarter of a cycle of a sinusoid and generate other valuesvia address manipulation and selective inversion of the output.

The new block (cosin) in the Downconverter section allows thisfunctionality. This block can also be used in fft_direct if requiredrather than the feedback_osc block if DSP slices are scarce allowing theuser to trade off multipliers and BRAMs.

The number of coefficients in the current in-order biplex fft could bereduced to a single copy. Each biplex stage uses the same coefficientsas the previous stage (e.g bit_reverse(0b00, 0b01, 0b10, 0b11) ==bit_reverse(0b000, 0b001, 0b010, 0b011)) for half its coefficients andthen adds on the same number of never-used-in-previous-stagescoefficients. In theory, each stage could feed its coefficients to thenext stage that would add its unique coefficients and feed the resultantset on. It would be reasonably expensive in logic and registers (muxes,delay chains, counters etc) but would 1/2 the number of coefficientBRAMs required.

    Also, I implement coefficient and control logic sharing in my
    biplex and direct FFT and it works *really well* at managing the
    fabric and memory utilization.  Worth a shot.

My changes also allow this. Data is passed through the biplex FFT andoperated on as a bus (via various blocks developed for this purpose inthe Bus subsection). A single instance of control logic and coefficientsare used. In this way wide bandwidth data or many inputs can beprocessed with maximum efficiency, reducing logic and coefficient usageby approximately the number of parallel streams. Fanout (specificallytwiddle factor fanout to multipliers) can be controlled via a top levelparameter to help achieve timing.


Some other FFT additions;

1. Asynchronous operation support - the FFT can be generated to haveen(able) and dvalid ports. The enable signal travels with the data andappears with valid data at the output. All data inputs to the block areassumed to contain valid data i.e each data port does not haveindependent en/dvalid ports and the en/dvalid ports apply to all ports.

2. Bitgrowth. The FFT can support bit growth instead of shifting at eachstage. This can reduce shifting logic and the danger of overflow. Anupper limit on this growth (with remaining stages implementing shiftingas before) allows the user to have a mixture of the two.

3. As mentioned above, each FFT block can process multiplepolarisations/streams/sources simultaneously to increase efficiency.This is controlled via a top level parameter.

I will be updating the wiki to reflect these changes and hope to includethem in the CASPER library release before this year's CASPER workshop.


Regards
Andrew

Re: [casper] number of coefficients needed in PFB and FFT

Reply via email to