hi ryan, andrew, we used to use CORDIC for generating coefficients. not sure how cordic comares to goertzel. there are a few open source VHDL cordics.
i think dave macmahon or someone developed a radix4 version of the casper streaming FFT. dan On Thu, Jan 24, 2013 at 5:44 PM, Ryan Monroe <[email protected]>wrote: > Hey Andrew, thanks for the designs! I'll have to spend some time looking > them over later, there's some good stuff there. > > Nice idea (I think the Goertzel algorithm is often used with this > technique?). I have considered this for the DDC, it allows almost > arbitrary frequency and phase resolution. The only cost is a fair amount > of multipliers. For most applications at the moment we are BRAM limited > so this is not a problem (the very wide bandwidth instruments might be > multiplier limited at some point). It would be good as an option to > trade off multipliers for BRAM. > > I haven't seen the Goertzel algorithm before, but it looks like a great > idea for this: we might be able to produce a coefficient DDS in just two > DSPs! > > For my applications, I'm *totally* DSP limited, but I agree that we should > try to cater to the greater CASPER community of course. > > Coefficient reuse (as you describe between phases) would be nice (at the > cost of some register stages I guess). > > The CASPER libraries *hemmorage* pipeline stages. A few more won't hurt, > and you'll be saving the RAM addressing logic. Not so bad. > > I think the reuse of control logic, coefficients etc would potentially > be the biggest saver assuming wide bandwidth systems. Ideally the > compiler would do this for us implicitly, but in the meantime explicit > reuse with optional register stages to reduce fanout would be awesome. > > You can change a setting on pipeline registers (and maybe other places > too) which allows it to do this. it's called "Implement using behavioral > HDL" in simulink, or "allow_register_retiming" in the xBlock interface. I > had a bad experience with it though: It'll try to optimize EVERYTHING. Got > two identical registers which you intend to place on opposite sides of the > chip? They're now the same register. In my experience, the only good way > to control the sharing (or lack thereof) was to do it manually..... YMMV. > > I've got another idea we can consider too. This one is farther away. I'm > building radix-4 versions of my FFTs (1/2 as much fabric, 85% as much DSP > and 100% as much coeff). Now, for radix 4, you get three coefficient banks > per butterfly stage, and while the sum total (# coefficients stored) is the > same, the coefficients are actually in trios of (x^1; x^2; x^3 and an > implicit x^0). You could, in principle, store just the x^1 and square/cube > it into x^2 and x^3. I haven't tried this (just thought of it), so no idea > regarding performance. In addition, while Dan and I are working with JPL > legal to get my library open-sourced, it's looking pretty clear that I > won't be able to share the really new stuff, so you'd have to do radix-4 on > your own :-( > > --Ryan > > On 01/22/2013 04:41 AM, Andrew Martens wrote: > >> Hi all >> >> >> It would work well for the PFB, but what we *really* need is a >>> solid "Direct Digital Synth (DDS) coefficient generator". >>> ... >>> >> Nice idea (I think the Goertzel algorithm is often used with this >> technique?). I have considered this for the DDC, it allows almost >> arbitrary frequency and phase resolution. The only cost is a fair amount >> of multipliers. For most applications at the moment we are BRAM limited >> so this is not a problem (the very wide bandwidth instruments might be >> multiplier limited at some point). It would be good as an option to >> trade off multipliers for BRAM. >> >> Coefficient reuse (as you describe between phases) would be nice (at the >> cost of some register stages I guess). >> >> I think the reuse of control logic, coefficients etc would potentially >> be the biggest saver assuming wide bandwidth systems. Ideally the >> compiler would do this for us implicitly, but in the meantime explicit >> reuse with optional register stages to reduce fanout would be awesome. >> >> PS, Sorry, I'm a bit busy right now so I can't implement a >>> coefficient interpolator for you guys right now. I'll write >>> back when I'm more free >>> >> Got a bit carried away and implemented one. Attached is a model that >> allows the comparison between ideal, interpolator, and Dan's reduced >> storage idea. The interpolator uses a multiplier, cruder versions might >> not at the cost of noise and/or more logic. >> >> PS2. I'm a bit anal about noise performance so I usually use >>> a couple more bits then Dan prescribes, but as he demonstrated >>> in the asic talks, his comments about bit widths are 100% >>> correct. I would recommend them as a general design practice >>> as well. >>> >> I have also seen papers that show that FFT performance is more dependent >> on data path bit width than coefficient bit width. We need a proper >> study on how many bits are required for different performance levels. >> >> but for long >>> transforms, perhaps >>> >4K points or so, >>> then BRAM's might be >>> in short supply, and >>> then one could >>> consider storing fewer >>> coefficients (and also >>> taking advantage >>> of sin/cos and mirror >>> symmetries, which >>> don't degrade SNR at >>> all). >>> >>> >> Did some work a while back. Attached is a model (sincos_upgrade.mdl) >> that implements BRAM saving in different ways when generating FFT >> twiddle factors (or DDC coefficients); >> >> 1. For very small numbers of coefficients, store them in the same word >> (can output up to 36 bits from a BRAM so can store 18 bit sin and cos >> values next to each other in the same word) so that we use 1 instead of >> (current) 2 BRAMs. (see sincos_single_bram in the design) >> >> 2. Store only a quarter of a sinusoid and generate the complex >> exponential via clever address generation and inversion of the output. >> This uses 1 BRAM instead of (current, assuming a 'large' FFT) 8 at the >> cost of logic (and multipliers) (see sincos_min_ram in the design) >> >> 3. Store half a sinusoid and generate the complex exponential via clever >> address generation. Uses 1 BRAM instead of the (current, assuming a >> 'large' FFT) 4 at the cost of some logic. (see sincos_med_ram in the >> design). >> >> The interpolator could be integrated into these to use even less BRAM. >> >> I will upgrade the library at some point this year to include these (and >> the interpolator). >> >> I did some >>> tests this >>> morning with a >>> simple moving >>> average filter >>> to turn 256 >>> BRAM >>> coefficients >>> into 1024 (see >>> attached model >>> file), and it >>> looks pretty >>> promising: >>> errors are a >>> max of about >>> 2.5%. >>> >> Could you send me this file? I would like to see how you did your >> interpolation. >> >> Regards >> Andrew >> >> >> >> >> >> >>> >

