Hi All
I have just completed a rather large overhaul of the CASPER FFT family.
Rigorous testing has yet to be performed but it "works" at the moment
and has been pushed to the ska-sa repo on github for early adopters/testers.
The main aim (and reason this email continues this thread) was to save
BRAM usage (the limiting factor in most designs and source of many
timing failures) in coefficient storage using many of the techniques (as
well as some others) described by Ryan.
It would work well for the PFB, but what we *really* need is a
solid "Direct Digital Synth (DDS) coefficient generator". FFT
coefficients are really just sampled points around the unit
circle, so you could, in principle, use a recursive complex
multiplier to generate the coefficients on the fly. You'll lose
log2(sqrt(K)) bits for a recursion count of K, but that's probably
OK most of the time. Say you're doing a 2^14 point FFT, you need
2^13 coeffs. You start with 18 bits of resolution and can do 1024
iterations before you degrade down to the est. 2^13 resolution.
So you'll only need to store 8 "reset points". Four of those
will be 1, j, -1 and -j in this case. You could thus replace 8
BRAM36'es with three DSPs.
There is a new block (feedback_osc in the Downconverter subsection) that
does this. The number of "reset points" are configurable (as well as the
various bit widths), allowing the user to control the phase error on the
resultant vector. Initial phase and phase step (frequency) are compile
time configurable. The fact that the pipeline through the complex
multiplier is not 0 in length means that there must be more than a
single value per "reset point" but having many of these reduces error so
the extra values are not wasted.
This is (optionally) used in fft_direct to generate all coefficients
when it is used in fft_wideband_real or fft.
In addition, for the FFT direct, the first stage has only one
shared coefficient pattern, second stage has 2, third 4, etc. You
can, of course, share coefficients amongst a stage where possible.
This is also implemented, and reduces the number of coefficient patterns
(e.g from 12 to 7 in an 8 input direct FFT) to be generated which will
be significant for large bandwidth applications.
The real winnings occur when you realize that the other
coefficient banks within later stages are actually the same coeffs
as the first stage, with a constant phase rotation (again, I'm 90%
sure but I'll check tomorrow morning). So, you could generate
your coefficients once, and then use a couple of complex
multipliers to make the coeffs for the other stages. BAM! FFT
Direct's coefficient memory utilization is *gone*
Almost. Each set of coefficients within a stage are the same as the
other sets in the stage with a constant phase rotation. I actually
implement each set separately so am not quite as efficient as this.
You could also do this for the FFT Biplex, but it would be a bit
more complicated. Whoever designed the biplex FFT used in-order
inputs. This is OK, but it means that the coefficients are in
bit-reverse order. So, you would have to move the biplex
unscrambler to the beginning, change the mux logic, and replace
the delay elements in the delay-commutator with some flavor of
"delay, bit-reversed". I don't know how that would look quite
yet. If you did that, your coefficients would become in-order,
and you could achieve the same savings I described with the
FFT-Direct.
I did not implement this change but did find a few ways of saving BRAM;
1. I store real and imaginary components next to each other in the same
BRAM where it makes sense (18 bits or less resolution, few enough to fit
in the same BRAM).
2. I share coefficients by storing a single set in a dual port BRAM and
manipulate addressing on each port to produce the correct values.
3. I store a quarter of a cycle of a sinusoid and generate other values
via address manipulation and selective inversion of the output.
The new block (cosin) in the Downconverter section allows this
functionality. This block can also be used in fft_direct if required
rather than the feedback_osc block if DSP slices are scarce allowing the
user to trade off multipliers and BRAMs.
The number of coefficients in the current in-order biplex fft could be
reduced to a single copy. Each biplex stage uses the same coefficients
as the previous stage (e.g bit_reverse(0b00, 0b01, 0b10, 0b11) ==
bit_reverse(0b000, 0b001, 0b010, 0b011)) for half its coefficients and
then adds on the same number of never-used-in-previous-stages
coefficients. In theory, each stage could feed its coefficients to the
next stage that would add its unique coefficients and feed the resultant
set on. It would be reasonably expensive in logic and registers (muxes,
delay chains, counters etc) but would 1/2 the number of coefficient
BRAMs required.
Also, I implement coefficient and control logic sharing in my
biplex and direct FFT and it works *really well* at managing the
fabric and memory utilization. Worth a shot.
My changes also allow this. Data is passed through the biplex FFT and
operated on as a bus (via various blocks developed for this purpose in
the Bus subsection). A single instance of control logic and coefficients
are used. In this way wide bandwidth data or many inputs can be
processed with maximum efficiency, reducing logic and coefficient usage
by approximately the number of parallel streams. Fanout (specifically
twiddle factor fanout to multipliers) can be controlled via a top level
parameter to help achieve timing.
Some other FFT additions;
1. Asynchronous operation support - the FFT can be generated to have
en(able) and dvalid ports. The enable signal travels with the data and
appears with valid data at the output. All data inputs to the block are
assumed to contain valid data i.e each data port does not have
independent en/dvalid ports and the en/dvalid ports apply to all ports.
2. Bitgrowth. The FFT can support bit growth instead of shifting at each
stage. This can reduce shifting logic and the danger of overflow. An
upper limit on this growth (with remaining stages implementing shifting
as before) allows the user to have a mixture of the two.
3. As mentioned above, each FFT block can process multiple
polarisations/streams/sources simultaneously to increase efficiency.
This is controlled via a top level parameter.
I will be updating the wiki to reflect these changes and hope to include
them in the CASPER library release before this year's CASPER workshop.
Regards
Andrew