PS3. You could also have done the 2^16 FFT's coefficients as a narrow cmult... you'd need to use a BRAM to store the 2^13 "reset points", but it would still indicate a reduction in memory use by a factor of 4 -- not trivial by any means.
On Mon, Jan 21, 2013 at 9:39 PM, Ryan Monroe <[email protected]>wrote: > It would work well for the PFB, but what we *really* need is a solid > "Direct Digital Synth (DDS) coefficient generator". FFT coefficients are > really just sampled points around the unit circle, so you could, in > principle, use a recursive complex multiplier to generate the coefficients > on the fly. You'll lose log2(sqrt(K)) bits for a recursion count of K, but > that's probably OK most of the time. Say you're doing a 2^14 point FFT, > you need 2^13 coeffs. You start with 18 bits of resolution and can do 1024 > iterations before you degrade down to the est. 2^13 resolution. So you'll > only need to store 8 "reset points". Four of those will be 1, j, -1 and -j > in this case. You could thus replace 8 BRAM36'es with three DSPs. > > If you had a much larger FFT, say 2^16... you would have to use a wider > recursive multiplier. You can achieve a wide cmult in no more than 10 > DSPs...I think. In that case, you would start with 25 bits and be able to > droop to 16 bits -- so up to 2^(2*9) = <lots> of recursion. You would only > need to have one "reset point" and your noise performance would be more > than sufficient. 1, j, -1 and -j are easy to store though, so I would > probably go with that > > In addition, for the FFT direct, the first stage has only one shared > coefficient pattern, second stage has 2, third 4, etc. You can, of course, > share coefficients amongst a stage where possible. The real winnings occur > when you realize that the other coefficient banks within later stages are > actually the same coeffs as the first stage, with a constant phase rotation > (again, I'm 90% sure but I'll check tomorrow morning). So, you could > generate your coefficients once, and then use a couple of complex > multipliers to make the coeffs for the other stages. BAM! FFT Direct's > coefficient memory utilization is *gone* > > You could also do this for the FFT Biplex, but it would be a bit more > complicated. Whoever designed the biplex FFT used in-order inputs. This > is OK, but it means that the coefficients are in bit-reverse order. So, > you would have to move the biplex unscrambler to the beginning, change the > mux logic, and replace the delay elements in the delay-commutator with some > flavor of "delay, bit-reversed". I don't know how that would look quite > yet. If you did that, your coefficients would become in-order, and you > could achieve the same savings I described with the FFT-Direct. Also, I > implement coefficient and control logic sharing in my biplex and direct FFT > and it works *really well* at managing the fabric and memory utilization. > Worth a shot. > > :-) > > --Ryan Monroe > > PS, Sorry, I'm a bit busy right now so I can't implement a coefficient > interpolator for you guys right now. I'll write back when I'm more free > > PS2. I'm a bit anal about noise performance so I usually use a couple > more bits then Dan prescribes, but as he demonstrated in the asic talks, > his comments about bit widths are 100% correct. I would recommend them as > a general design practice as well. > > > > > On Mon, Jan 21, 2013 at 3:48 PM, Dan Werthimer <[email protected]>wrote: > >> >> agreed. anybody already have, or want to develop, a coefficient >> interpolator? >> >> dan >> >> On Mon, Jan 21, 2013 at 3:44 PM, Aaron Parsons < >> [email protected]> wrote: >> >>> Agreed. >>> >>> The coefficient interpolator, however, could get substantial savings >>> beyond that, even, and could be applicable to many things besides PFBs. >>> >>> On Mon, Jan 21, 2013 at 3:36 PM, Dan Werthimer <[email protected]>wrote: >>> >>>> >>>> hi aaron, >>>> >>>> if you use xilinx brams for coefficients, they can be configured as >>>> dual port memories, >>>> so you can get the PFB reverse and forward coefficients both at the >>>> same time, >>>> from the same memory, almost for free, without any memory size penalty >>>> over single port, >>>> >>>> dan >>>> >>>> >>>> >>>> >>>> On Mon, Jan 21, 2013 at 3:18 PM, Aaron Parsons < >>>> [email protected]> wrote: >>>> >>>>> You guys probably appreciate this already, but although the >>>>> coefficients in the PFB FIR are generally symmetric around the center tap, >>>>> the upper and lower taps use these coefficients in reverse order from one >>>>> another. In order to take advantage of the symmetry, you'll have to use >>>>> dual-port ROMs that support two different addresses (one counting up and >>>>> one counting down). In the original core I wrote, I instead just shared >>>>> coefficients between the real and imaginary components. This was an easy >>>>> factor of 2 savings. After that first factor of two, we found it was kind >>>>> of diminishing returns... >>>>> >>>>> Another thought could be a small BRAM with a linear interpolator >>>>> between addresses. This would be a block with a wide range of uses, and >>>>> could easily cut the size of the PFB coefficients by an order of >>>>> magnitude. >>>>> The (hamming/hanning) window and the sinc that the PFB uses for its >>>>> coefficients are smooth functions, making all the fine subdivisions for >>>>> N>32 samples rather unnecessary. >>>>> >>>>> On Mon, Jan 21, 2013 at 2:56 PM, Dan Werthimer >>>>> <[email protected]>wrote: >>>>> >>>>>> >>>>>> >>>>>> hi danny and ryan, >>>>>> >>>>>> i suspect if you are only doing small FFT's and PFB FIR's, >>>>>> 1K points or so, then BRAM isn't likely to be the limiting resource, >>>>>> so you might as well store all the coefficients with high precision. >>>>>> >>>>>> but for long transforms, perhaps >4K points or so, >>>>>> then BRAM's might be in short supply, and then one could >>>>>> consider storing fewer coefficients (and also taking advantage >>>>>> of sin/cos and mirror symmetries, which don't degrade SNR at all). >>>>>> >>>>>> for any length FFT or PFB/FIR, even millions of points, >>>>>> if you store 1K coefficients with at least at least 10 bit precision, >>>>>> then the SNR will only be degraded slightly. >>>>>> quantization error analysis is nicely written up in memo #1, at >>>>>> https://casper.berkeley.edu/wiki/Memos >>>>>> >>>>>> best wishes, >>>>>> >>>>>> dan >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Jan 21, 2013 at 4:33 AM, Danny Price >>>>>> <[email protected]>wrote: >>>>>> >>>>>>> Hey Jason, >>>>>>> >>>>>>> Rewinding the thread a bit: >>>>>>> >>>>>>> On Fri, Jan 4, 2013 at 7:39 AM, Jason Manley <[email protected]>wrote: >>>>>>> >>>>>>>> Andrew and I have also spoken about symmetrical co-efficients in >>>>>>>> the pfb_fir and I'd very much like to see this done. We recently added >>>>>>>> the >>>>>>>> option to share co-efficient generators across multiple inputs, which >>>>>>>> has >>>>>>>> helped a lot for designs with multiple ADCs. It seems to me that bigger >>>>>>>> designs are going to be BRAM limited (FFT BRAM requirements scale >>>>>>>> linearly), so we need to optimise cores to go light on this resource. >>>>>>>> >>>>>>> >>>>>>> Agreed that BRAM is in general more precious than compute. In >>>>>>> addition to using symmetrical coefficients, it might be worth looking at >>>>>>> generating coefficients. I did some tests this morning with a simple >>>>>>> moving >>>>>>> average filter to turn 256 BRAM coefficients into 1024 (see attached >>>>>>> model >>>>>>> file), and it looks pretty promising: errors are a max of about 2.5%. >>>>>>> >>>>>>> Coupling this with symmetric coefficients could cut coefficient >>>>>>> storage to 1/8th, at the cost of a few extra adders for the >>>>>>> interpolation >>>>>>> filter. Thoughts? >>>>>>> >>>>>>> Cheers >>>>>>> Danny >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Aaron Parsons >>>>> 510-306-4322 >>>>> Hearst Field Annex B54, UCB >>>>> >>>> >>>> >>> >>> >>> -- >>> Aaron Parsons >>> 510-306-4322 >>> Hearst Field Annex B54, UCB >>> >> >> >

