For long FFTs, you could also use two BRAM18s (as lookup tables) and two
complex multiplies (3 dsps each for V6, 4 dsps each for v5) to get a
coefficient with 17 bits of accuracy and enough resolution for a
2^19-point FFT.
On 01/24/2013 05:49 PM, Dan Werthimer wrote:
hi ryan, andrew,
we used to use CORDIC for generating coefficients.
not sure how cordic comares to goertzel.
there are a few open source VHDL cordics.
i think dave macmahon or someone developed
a radix4 version of the casper streaming FFT.
dan
On Thu, Jan 24, 2013 at 5:44 PM, Ryan Monroe <[email protected]
<mailto:[email protected]>> wrote:
Hey Andrew, thanks for the designs! I'll have to spend some time
looking them over later, there's some good stuff there.
Nice idea (I think the Goertzel algorithm is often used with this
technique?). I have considered this for the DDC, it allows almost
arbitrary frequency and phase resolution. The only cost is a fair
amount
of multipliers. For most applications at the moment we are BRAM
limited
so this is not a problem (the very wide bandwidth instruments might be
multiplier limited at some point). It would be good as an option to
trade off multipliers for BRAM.
I haven't seen the Goertzel algorithm before, but it looks like a
great idea for this: we might be able to produce a coefficient DDS
in just two DSPs!
For my applications, I'm *totally* DSP limited, but I agree that
we should try to cater to the greater CASPER community of course.
Coefficient reuse (as you describe between phases) would be nice
(at the
cost of some register stages I guess).
The CASPER libraries *hemmorage* pipeline stages. A few more
won't hurt, and you'll be saving the RAM addressing logic. Not so
bad.
I think the reuse of control logic, coefficients etc would potentially
be the biggest saver assuming wide bandwidth systems. Ideally the
compiler would do this for us implicitly, but in the meantime explicit
reuse with optional register stages to reduce fanout would be awesome.
You can change a setting on pipeline registers (and maybe other
places too) which allows it to do this. it's called "Implement
using behavioral HDL" in simulink, or "allow_register_retiming" in
the xBlock interface. I had a bad experience with it though:
It'll try to optimize EVERYTHING. Got two identical registers
which you intend to place on opposite sides of the chip? They're
now the same register. In my experience, the only good way to
control the sharing (or lack thereof) was to do it manually..... YMMV.
I've got another idea we can consider too. This one is farther
away. I'm building radix-4 versions of my FFTs (1/2 as much
fabric, 85% as much DSP and 100% as much coeff). Now, for radix
4, you get three coefficient banks per butterfly stage, and while
the sum total (# coefficients stored) is the same, the
coefficients are actually in trios of (x^1; x^2; x^3 and an
implicit x^0). You could, in principle, store just the x^1 and
square/cube it into x^2 and x^3. I haven't tried this (just
thought of it), so no idea regarding performance. In addition,
while Dan and I are working with JPL legal to get my library
open-sourced, it's looking pretty clear that I won't be able to
share the really new stuff, so you'd have to do radix-4 on your
own :-(
--Ryan
On 01/22/2013 04:41 AM, Andrew Martens wrote:
Hi all
It would work well for the PFB, but what we
*really* need is a
solid "Direct Digital Synth (DDS) coefficient
generator".
...
Nice idea (I think the Goertzel algorithm is often used with this
technique?). I have considered this for the DDC, it allows almost
arbitrary frequency and phase resolution. The only cost is a
fair amount
of multipliers. For most applications at the moment we are
BRAM limited
so this is not a problem (the very wide bandwidth instruments
might be
multiplier limited at some point). It would be good as an
option to
trade off multipliers for BRAM.
Coefficient reuse (as you describe between phases) would be
nice (at the
cost of some register stages I guess).
I think the reuse of control logic, coefficients etc would
potentially
be the biggest saver assuming wide bandwidth systems. Ideally the
compiler would do this for us implicitly, but in the meantime
explicit
reuse with optional register stages to reduce fanout would be
awesome.
PS, Sorry, I'm a bit busy right now so I can't
implement a
coefficient interpolator for you guys right now.
I'll write
back when I'm more free
Got a bit carried away and implemented one. Attached is a
model that
allows the comparison between ideal, interpolator, and Dan's
reduced
storage idea. The interpolator uses a multiplier, cruder
versions might
not at the cost of noise and/or more logic.
PS2. I'm a bit anal about noise performance so I
usually use
a couple more bits then Dan prescribes, but as he
demonstrated
in the asic talks, his comments about bit widths
are 100%
correct. I would recommend them as a general
design practice
as well.
I have also seen papers that show that FFT performance is more
dependent
on data path bit width than coefficient bit width. We need a
proper
study on how many bits are required for different performance
levels.
but for long
transforms, perhaps
>4K
points or so,
then
BRAM's might be
in short
supply, and
then one
could
consider storing fewer
coefficients (and also
taking
advantage
of
sin/cos and mirror
symmetries, which
don't
degrade SNR at
all).
Did some work a while back. Attached is a model
(sincos_upgrade.mdl)
that implements BRAM saving in different ways when generating FFT
twiddle factors (or DDC coefficients);
1. For very small numbers of coefficients, store them in the
same word
(can output up to 36 bits from a BRAM so can store 18 bit sin
and cos
values next to each other in the same word) so that we use 1
instead of
(current) 2 BRAMs. (see sincos_single_bram in the design)
2. Store only a quarter of a sinusoid and generate the complex
exponential via clever address generation and inversion of the
output.
This uses 1 BRAM instead of (current, assuming a 'large' FFT)
8 at the
cost of logic (and multipliers) (see sincos_min_ram in the design)
3. Store half a sinusoid and generate the complex exponential
via clever
address generation. Uses 1 BRAM instead of the (current,
assuming a
'large' FFT) 4 at the cost of some logic. (see sincos_med_ram
in the
design).
The interpolator could be integrated into these to use even
less BRAM.
I will upgrade the library at some point this year to include
these (and
the interpolator).
I did some
tests this
morning with a
simple moving
average filter
to turn 256
BRAM
coefficients
into 1024 (see
attached model
file), and it
looks pretty
promising:
errors are a
max of about
2.5%.
Could you send me this file? I would like to see how you did your
interpolation.
Regards
Andrew