Re: [casper] number of coefficients needed in PFB and FFT

Ryan Monroe Thu, 24 Jan 2013 17:44:55 -0800

Hey Andrew, thanks for the designs! I'll have to spend some time lookingthem over later, there's some good stuff there.


Nice idea (I think the Goertzel algorithm is often used with this
technique?). I have considered this for the DDC, it allows almost
arbitrary frequency and phase resolution. The only cost is a fair amount
of multipliers. For most applications at the moment we are BRAM limited
so this is not a problem (the very wide bandwidth instruments might be
multiplier limited at some point). It would be good as an option to
trade off multipliers for BRAM.

I haven't seen the Goertzel algorithm before, but it looks like a greatidea for this: we might be able to produce a coefficient DDS in just twoDSPs!

For my applications, I'm *totally* DSP limited, but I agree that weshould try to cater to the greater CASPER community of course.


Coefficient reuse (as you describe between phases) would be nice (at the
cost of some register stages I guess).

The CASPER libraries *hemmorage* pipeline stages. A few more won'thurt, and you'll be saving the RAM addressing logic. Not so bad.


I think the reuse of control logic, coefficients etc would potentially
be the biggest saver assuming wide bandwidth systems. Ideally the
compiler would do this for us implicitly, but in the meantime explicit
reuse with optional register stages to reduce fanout would be awesome.

You can change a setting on pipeline registers (and maybe other placestoo) which allows it to do this. it's called "Implement usingbehavioral HDL" in simulink, or "allow_register_retiming" in the xBlockinterface. I had a bad experience with it though: It'll try to optimizeEVERYTHING. Got two identical registers which you intend to place onopposite sides of the chip? They're now the same register. In myexperience, the only good way to control the sharing (or lack thereof)was to do it manually..... YMMV.

I've got another idea we can consider too. This one is farther away.I'm building radix-4 versions of my FFTs (1/2 as much fabric, 85% asmuch DSP and 100% as much coeff). Now, for radix 4, you get threecoefficient banks per butterfly stage, and while the sum total (#coefficients stored) is the same, the coefficients are actually in triosof (x^1; x^2; x^3 and an implicit x^0). You could, in principle, storejust the x^1 and square/cube it into x^2 and x^3. I haven't tried this(just thought of it), so no idea regarding performance. In addition,while Dan and I are working with JPL legal to get my libraryopen-sourced, it's looking pretty clear that I won't be able to sharethe really new stuff, so you'd have to do radix-4 on your own :-(


--Ryan

On 01/22/2013 04:41 AM, Andrew Martens wrote:

Hi all

         It would work well for the PFB, but what we *really* need is a
         solid "Direct Digital Synth (DDS) coefficient generator".
         ...

Nice idea (I think the Goertzel algorithm is often used with this
technique?). I have considered this for the DDC, it allows almost
arbitrary frequency and phase resolution. The only cost is a fair amount
of multipliers. For most applications at the moment we are BRAM limited
so this is not a problem (the very wide bandwidth instruments might be
multiplier limited at some point). It would be good as an option to
trade off multipliers for BRAM.

Coefficient reuse (as you describe between phases) would be nice (at the
cost of some register stages I guess).

I think the reuse of control logic, coefficients etc would potentially
be the biggest saver assuming wide bandwidth systems. Ideally the
compiler would do this for us implicitly, but in the meantime explicit
reuse with optional register stages to reduce fanout would be awesome.

         PS, Sorry, I'm a bit busy right now so I can't implement a
         coefficient interpolator for you guys right now.  I'll write
         back when I'm more free

Got a bit carried away and implemented one. Attached is a model that
allows the comparison between ideal, interpolator, and Dan's reduced
storage idea. The interpolator uses a multiplier, cruder versions might
not at the cost of noise and/or more logic.

         PS2.  I'm a bit anal about noise performance so I usually use
         a couple more bits then Dan prescribes, but as he demonstrated
         in the asic talks, his comments about bit widths are 100%
         correct.   I would recommend them as a general design practice
         as well.

I have also seen papers that show that FFT performance is more dependent
on data path bit width than coefficient bit width. We need a proper
study on how many bits are required for different performance levels.

                                                 but for long
                                                 transforms, perhaps
                                                 >4K points or so,
                                                 then BRAM's might be
                                                 in short supply, and
                                                 then one could
                                                 consider storing fewer
                                                 coefficients (and also
                                                 taking advantage
                                                 of sin/cos and mirror
                                                 symmetries, which
                                                 don't degrade SNR at
                                                 all).

Did some work a while back. Attached is a model (sincos_upgrade.mdl)
that implements BRAM saving in different ways when generating FFT
twiddle factors (or DDC coefficients);

1. For very small numbers of coefficients, store them in the same word
(can output up to 36 bits from a BRAM so can store 18 bit sin and cos
values next to each other in the same word) so that we use 1 instead of
(current) 2 BRAMs. (see sincos_single_bram in the design)

2. Store only a quarter of a sinusoid and generate the complex
exponential via clever address generation and inversion of the output.
This uses 1 BRAM instead of (current, assuming a 'large' FFT) 8 at the
cost of logic (and multipliers) (see sincos_min_ram in the design)

3. Store half a sinusoid and generate the complex exponential via clever
address generation. Uses 1 BRAM instead of the (current, assuming a
'large' FFT) 4 at the cost of some logic. (see sincos_med_ram in the
design).

The interpolator could be integrated into these to use even less BRAM.

I will upgrade the library at some point this year to include these (and
the interpolator).

                                                         I did some
                                                         tests this
                                                         morning with a
                                                         simple moving
                                                         average filter
                                                         to turn 256
                                                         BRAM
                                                         coefficients
                                                         into 1024 (see
                                                         attached model
                                                         file), and it
                                                         looks pretty
                                                         promising:
                                                         errors are a
                                                         max of about
                                                         2.5%.

Could you send me this file? I would like to see how you did your
interpolation.

Regards
Andrew

Re: [casper] number of coefficients needed in PFB and FFT

Reply via email to