Hi all, The memory footprint of the fft_direct block could be reduced if it is split into two blocks, one for the operator to manipulate the phase of the FFT of each input and another block to calculate the true direct-form FFT, i.e. not mapping together a larger FFT (see p. 615 of http://apps.nrbook.com/empanel/index.html#). This would require n_inputs more twiddle multiplications, but the overall operator would require only n_inputs coefficient memories (a SIGNIFICANT memory savings for wideband FFTs), as the coefficients of a true direct-form FFT are constant. The current implementation instead requires a different coefficient memory for each butterfly, and all of the butterflies compute non-trivial twiddle multiplications.
Changing architectures (1) trades BRAMs for multiplications (2) doesn't actually require more multipliier circuits because a huge number of coefficients in the direct-form FFT don't require multiplications (more than half for a 16-input FFT). And the FFT architecture becomes more suitable for messing with the phase rotation coefficients, which can help reduce interleaving artifacts in the spectrum. For those interested in this particular change, I made (and tested) the changes in my fork of xblocks_devel. If there is sufficient interest, I could eventually make the changes in the main casper library. -Suraj On Tue, Jul 16, 2013 at 10:45 AM, David MacMahon <[email protected]>wrote: > Wow! Thanks, Andrew, that sounds like a lot of work! Do you have any > utilization comparisons of old vs new? > > Dave > > On Jul 16, 2013, at 6:22 AM, Andrew Martens wrote: > > > Hi All > > > > I have just completed a rather large overhaul of the CASPER FFT family. > Rigorous testing has yet to be performed but it "works" at the moment and > has been pushed to the ska-sa repo on github for early adopters/testers. > > > > The main aim (and reason this email continues this thread) was to save > BRAM usage (the limiting factor in most designs and source of many timing > failures) in coefficient storage using many of the techniques (as well as > some others) described by Ryan. > > > >> > >> It would work well for the PFB, but what we *really* need is a solid > "Direct Digital Synth (DDS) coefficient generator". FFT coefficients are > really just sampled points around the unit circle, so you could, in > principle, use a recursive complex multiplier to generate the coefficients > on the fly. You'll lose log2(sqrt(K)) bits for a recursion count of K, but > that's probably OK most of the time. Say you're doing a 2^14 point FFT, > you need 2^13 coeffs. You start with 18 bits of resolution and can do 1024 > iterations before you degrade down to the est. 2^13 resolution. So you'll > only need to store 8 "reset points". Four of those will be 1, j, -1 and -j > in this case. You could thus replace 8 BRAM36'es with three DSPs. > >> > > There is a new block (feedback_osc in the Downconverter subsection) that > does this. The number of "reset points" are configurable (as well as the > various bit widths), allowing the user to control the phase error on the > resultant vector. Initial phase and phase step (frequency) are compile time > configurable. The fact that the pipeline through the complex multiplier is > not 0 in length means that there must be more than a single value per > "reset point" but having many of these reduces error so the extra values > are not wasted. > > > > This is (optionally) used in fft_direct to generate all coefficients > when it is used in fft_wideband_real or fft. > > > >> In addition, for the FFT direct, the first stage has only one shared > coefficient pattern, second stage has 2, third 4, etc. You can, of course, > share coefficients amongst a stage where possible. > > > > This is also implemented, and reduces the number of coefficient patterns > (e.g from 12 to 7 in an 8 input direct FFT) to be generated which will be > significant for large bandwidth applications. > > > >> The real winnings occur when you realize that the other coefficient > banks within later stages are actually the same coeffs as the first stage, > with a constant phase rotation (again, I'm 90% sure but I'll check tomorrow > morning). So, you could generate your coefficients once, and then use a > couple of complex multipliers to make the coeffs for the other stages. > BAM! FFT Direct's coefficient memory utilization is *gone* > >> > > Almost. Each set of coefficients within a stage are the same as the > other sets in the stage with a constant phase rotation. I actually > implement each set separately so am not quite as efficient as this. > > > >> You could also do this for the FFT Biplex, but it would be a bit more > complicated. Whoever designed the biplex FFT used in-order inputs. This > is OK, but it means that the coefficients are in bit-reverse order. So, > you would have to move the biplex unscrambler to the beginning, change the > mux logic, and replace the delay elements in the delay-commutator with some > flavor of "delay, bit-reversed". I don't know how that would look quite > yet. If you did that, your coefficients would become in-order, and you > could achieve the same savings I described with the FFT-Direct. > > > > I did not implement this change but did find a few ways of saving BRAM; > > 1. I store real and imaginary components next to each other in the same > BRAM where it makes sense (18 bits or less resolution, few enough to fit in > the same BRAM). > > 2. I share coefficients by storing a single set in a dual port BRAM and > manipulate addressing on each port to produce the correct values. > > 3. I store a quarter of a cycle of a sinusoid and generate other values > via address manipulation and selective inversion of the output. > > > > The new block (cosin) in the Downconverter section allows this > functionality. This block can also be used in fft_direct if required rather > than the feedback_osc block if DSP slices are scarce allowing the user to > trade off multipliers and BRAMs. > > > > The number of coefficients in the current in-order biplex fft could be > reduced to a single copy. Each biplex stage uses the same coefficients as > the previous stage (e.g bit_reverse(0b00, 0b01, 0b10, 0b11) == > bit_reverse(0b000, 0b001, 0b010, 0b011)) for half its coefficients and then > adds on the same number of never-used-in-previous-stages coefficients. In > theory, each stage could feed its coefficients to the next stage that would > add its unique coefficients and feed the resultant set on. It would be > reasonably expensive in logic and registers (muxes, delay chains, counters > etc) but would 1/2 the number of coefficient BRAMs required. > > > >> Also, I implement coefficient and control logic sharing in my biplex > and direct FFT and it works *really well* at managing the fabric and memory > utilization. Worth a shot. > >> > > My changes also allow this. Data is passed through the biplex FFT and > operated on as a bus (via various blocks developed for this purpose in the > Bus subsection). A single instance of control logic and coefficients are > used. In this way wide bandwidth data or many inputs can be processed with > maximum efficiency, reducing logic and coefficient usage by approximately > the number of parallel streams. Fanout (specifically twiddle factor fanout > to multipliers) can be controlled via a top level parameter to help achieve > timing. > > > > Some other FFT additions; > > > > 1. Asynchronous operation support - the FFT can be generated to have > en(able) and dvalid ports. The enable signal travels with the data and > appears with valid data at the output. All data inputs to the block are > assumed to contain valid data i.e each data port does not have independent > en/dvalid ports and the en/dvalid ports apply to all ports. > > > > 2. Bitgrowth. The FFT can support bit growth instead of shifting at each > stage. This can reduce shifting logic and the danger of overflow. An upper > limit on this growth (with remaining stages implementing shifting as > before) allows the user to have a mixture of the two. > > > > 3. As mentioned above, each FFT block can process multiple > polarisations/streams/sources simultaneously to increase efficiency. This > is controlled via a top level parameter. > > > > I will be updating the wiki to reflect these changes and hope to include > them in the CASPER library release before this year's CASPER workshop. > > > > Regards > > Andrew > > >

