Wow! Thanks, Andrew, that sounds like a lot of work! Do you have any utilization comparisons of old vs new?
Dave On Jul 16, 2013, at 6:22 AM, Andrew Martens wrote: > Hi All > > I have just completed a rather large overhaul of the CASPER FFT family. > Rigorous testing has yet to be performed but it "works" at the moment and has > been pushed to the ska-sa repo on github for early adopters/testers. > > The main aim (and reason this email continues this thread) was to save BRAM > usage (the limiting factor in most designs and source of many timing > failures) in coefficient storage using many of the techniques (as well as > some others) described by Ryan. > >> >> It would work well for the PFB, but what we *really* need is a solid "Direct >> Digital Synth (DDS) coefficient generator". FFT coefficients are really >> just sampled points around the unit circle, so you could, in principle, use >> a recursive complex multiplier to generate the coefficients on the fly. >> You'll lose log2(sqrt(K)) bits for a recursion count of K, but that's >> probably OK most of the time. Say you're doing a 2^14 point FFT, you need >> 2^13 coeffs. You start with 18 bits of resolution and can do 1024 >> iterations before you degrade down to the est. 2^13 resolution. So you'll >> only need to store 8 "reset points". Four of those will be 1, j, -1 and -j >> in this case. You could thus replace 8 BRAM36'es with three DSPs. >> > There is a new block (feedback_osc in the Downconverter subsection) that does > this. The number of "reset points" are configurable (as well as the various > bit widths), allowing the user to control the phase error on the resultant > vector. Initial phase and phase step (frequency) are compile time > configurable. The fact that the pipeline through the complex multiplier is > not 0 in length means that there must be more than a single value per "reset > point" but having many of these reduces error so the extra values are not > wasted. > > This is (optionally) used in fft_direct to generate all coefficients when it > is used in fft_wideband_real or fft. > >> In addition, for the FFT direct, the first stage has only one shared >> coefficient pattern, second stage has 2, third 4, etc. You can, of course, >> share coefficients amongst a stage where possible. > > This is also implemented, and reduces the number of coefficient patterns (e.g > from 12 to 7 in an 8 input direct FFT) to be generated which will be > significant for large bandwidth applications. > >> The real winnings occur when you realize that the other coefficient banks >> within later stages are actually the same coeffs as the first stage, with a >> constant phase rotation (again, I'm 90% sure but I'll check tomorrow >> morning). So, you could generate your coefficients once, and then use a >> couple of complex multipliers to make the coeffs for the other stages. BAM! >> FFT Direct's coefficient memory utilization is *gone* >> > Almost. Each set of coefficients within a stage are the same as the other > sets in the stage with a constant phase rotation. I actually implement each > set separately so am not quite as efficient as this. > >> You could also do this for the FFT Biplex, but it would be a bit more >> complicated. Whoever designed the biplex FFT used in-order inputs. This is >> OK, but it means that the coefficients are in bit-reverse order. So, you >> would have to move the biplex unscrambler to the beginning, change the mux >> logic, and replace the delay elements in the delay-commutator with some >> flavor of "delay, bit-reversed". I don't know how that would look quite >> yet. If you did that, your coefficients would become in-order, and you >> could achieve the same savings I described with the FFT-Direct. > > I did not implement this change but did find a few ways of saving BRAM; > 1. I store real and imaginary components next to each other in the same BRAM > where it makes sense (18 bits or less resolution, few enough to fit in the > same BRAM). > 2. I share coefficients by storing a single set in a dual port BRAM and > manipulate addressing on each port to produce the correct values. > 3. I store a quarter of a cycle of a sinusoid and generate other values via > address manipulation and selective inversion of the output. > > The new block (cosin) in the Downconverter section allows this functionality. > This block can also be used in fft_direct if required rather than the > feedback_osc block if DSP slices are scarce allowing the user to trade off > multipliers and BRAMs. > > The number of coefficients in the current in-order biplex fft could be > reduced to a single copy. Each biplex stage uses the same coefficients as the > previous stage (e.g bit_reverse(0b00, 0b01, 0b10, 0b11) == bit_reverse(0b000, > 0b001, 0b010, 0b011)) for half its coefficients and then adds on the same > number of never-used-in-previous-stages coefficients. In theory, each stage > could feed its coefficients to the next stage that would add its unique > coefficients and feed the resultant set on. It would be reasonably expensive > in logic and registers (muxes, delay chains, counters etc) but would 1/2 the > number of coefficient BRAMs required. > >> Also, I implement coefficient and control logic sharing in my biplex and >> direct FFT and it works *really well* at managing the fabric and memory >> utilization. Worth a shot. >> > My changes also allow this. Data is passed through the biplex FFT and > operated on as a bus (via various blocks developed for this purpose in the > Bus subsection). A single instance of control logic and coefficients are > used. In this way wide bandwidth data or many inputs can be processed with > maximum efficiency, reducing logic and coefficient usage by approximately the > number of parallel streams. Fanout (specifically twiddle factor fanout to > multipliers) can be controlled via a top level parameter to help achieve > timing. > > Some other FFT additions; > > 1. Asynchronous operation support - the FFT can be generated to have en(able) > and dvalid ports. The enable signal travels with the data and appears with > valid data at the output. All data inputs to the block are assumed to contain > valid data i.e each data port does not have independent en/dvalid ports and > the en/dvalid ports apply to all ports. > > 2. Bitgrowth. The FFT can support bit growth instead of shifting at each > stage. This can reduce shifting logic and the danger of overflow. An upper > limit on this growth (with remaining stages implementing shifting as before) > allows the user to have a mixture of the two. > > 3. As mentioned above, each FFT block can process multiple > polarisations/streams/sources simultaneously to increase efficiency. This is > controlled via a top level parameter. > > I will be updating the wiki to reflect these changes and hope to include them > in the CASPER library release before this year's CASPER workshop. > > Regards > Andrew

