Re: [casper] number of coefficients needed in PFB and FFT

Dan Werthimer Thu, 24 Jan 2013 17:49:59 -0800

hi ryan, andrew,

we used to use CORDIC for generating coefficients.
not sure how cordic comares to goertzel.
there are a few open source VHDL cordics.


i think dave macmahon or someone developed
a radix4 version of the casper streaming FFT.

dan


On Thu, Jan 24, 2013 at 5:44 PM, Ryan Monroe <[email protected]>wrote:

> Hey Andrew, thanks for the designs! I'll have to spend some time looking
> them over later, there's some good stuff there.
>
> Nice idea (I think the Goertzel algorithm is often used with this
> technique?). I have considered this for the DDC, it allows almost
> arbitrary frequency and phase resolution. The only cost is a fair amount
> of multipliers. For most applications at the moment we are BRAM limited
> so this is not a problem (the very wide bandwidth instruments might be
> multiplier limited at some point). It would be good as an option to
> trade off multipliers for BRAM.
>
> I haven't seen the Goertzel algorithm before, but it looks like a great
> idea for this: we might be able to produce a coefficient DDS in just two
> DSPs!
>
> For my applications, I'm *totally* DSP limited, but I agree that we should
> try to cater to the greater CASPER community of course.
>
> Coefficient reuse (as you describe between phases) would be nice (at the
> cost of some register stages I guess).
>
> The CASPER libraries *hemmorage* pipeline stages.  A few more won't hurt,
> and you'll be saving the RAM addressing logic.  Not so bad.
>
> I think the reuse of control logic, coefficients etc would potentially
> be the biggest saver assuming wide bandwidth systems. Ideally the
> compiler would do this for us implicitly, but in the meantime explicit
> reuse with optional register stages to reduce fanout would be awesome.
>
> You can change a setting on pipeline registers (and maybe other places
> too) which allows it to do this.  it's called "Implement using behavioral
> HDL" in simulink, or "allow_register_retiming" in the xBlock interface.  I
> had a bad experience with it though: It'll try to optimize EVERYTHING.  Got
> two identical registers which you intend to place on opposite sides of the
> chip? They're now the same register.  In my experience, the only good way
> to control the sharing (or lack thereof) was to do it manually..... YMMV.
>
> I've got another idea we can consider too.  This one is farther away.  I'm
> building radix-4 versions of my FFTs (1/2 as much fabric, 85% as much DSP
> and 100% as much coeff).  Now, for radix 4, you get three coefficient banks
> per butterfly stage, and while the sum total (# coefficients stored) is the
> same, the coefficients are actually in trios of (x^1; x^2; x^3 and an
> implicit x^0).  You could, in principle, store just the x^1 and square/cube
> it into x^2 and x^3.  I haven't tried this (just thought of it), so no idea
> regarding performance.  In addition, while Dan and I are working with JPL
> legal to get my library open-sourced, it's looking pretty clear that I
> won't be able to share the really new stuff, so you'd have to do radix-4 on
> your own :-(
>
> --Ryan
>
> On 01/22/2013 04:41 AM, Andrew Martens wrote:
>
>> Hi all
>>
>>
>>           It would work well for the PFB, but what we *really* need is a
>>>          solid "Direct Digital Synth (DDS) coefficient generator".
>>>          ...
>>>
>> Nice idea (I think the Goertzel algorithm is often used with this
>> technique?). I have considered this for the DDC, it allows almost
>> arbitrary frequency and phase resolution. The only cost is a fair amount
>> of multipliers. For most applications at the moment we are BRAM limited
>> so this is not a problem (the very wide bandwidth instruments might be
>> multiplier limited at some point). It would be good as an option to
>> trade off multipliers for BRAM.
>>
>> Coefficient reuse (as you describe between phases) would be nice (at the
>> cost of some register stages I guess).
>>
>> I think the reuse of control logic, coefficients etc would potentially
>> be the biggest saver assuming wide bandwidth systems. Ideally the
>> compiler would do this for us implicitly, but in the meantime explicit
>> reuse with optional register stages to reduce fanout would be awesome.
>>
>>           PS, Sorry, I'm a bit busy right now so I can't implement a
>>>          coefficient interpolator for you guys right now.  I'll write
>>>          back when I'm more free
>>>
>> Got a bit carried away and implemented one. Attached is a model that
>> allows the comparison between ideal, interpolator, and Dan's reduced
>> storage idea. The interpolator uses a multiplier, cruder versions might
>> not at the cost of noise and/or more logic.
>>
>>           PS2.  I'm a bit anal about noise performance so I usually use
>>>          a couple more bits then Dan prescribes, but as he demonstrated
>>>          in the asic talks, his comments about bit widths are 100%
>>>          correct.   I would recommend them as a general design practice
>>>          as well.
>>>
>> I have also seen papers that show that FFT performance is more dependent
>> on data path bit width than coefficient bit width. We need a proper
>> study on how many bits are required for different performance levels.
>>
>>                                                   but for long
>>>                                                  transforms, perhaps
>>>                                                  >4K points or so,
>>>                                                  then BRAM's might be
>>>                                                  in short supply, and
>>>                                                  then one could
>>>                                                  consider storing fewer
>>>                                                  coefficients (and also
>>>                                                  taking advantage
>>>                                                  of sin/cos and mirror
>>>                                                  symmetries, which
>>>                                                  don't degrade SNR at
>>>                                                  all).
>>>
>>>
>> Did some work a while back. Attached is a model (sincos_upgrade.mdl)
>> that implements BRAM saving in different ways when generating FFT
>> twiddle factors (or DDC coefficients);
>>
>> 1. For very small numbers of coefficients, store them in the same word
>> (can output up to 36 bits from a BRAM so can store 18 bit sin and cos
>> values next to each other in the same word) so that we use 1 instead of
>> (current) 2 BRAMs. (see sincos_single_bram in the design)
>>
>> 2. Store only a quarter of a sinusoid and generate the complex
>> exponential via clever address generation and inversion of the output.
>> This uses 1 BRAM instead of (current, assuming a 'large' FFT) 8 at the
>> cost of logic (and multipliers) (see sincos_min_ram in the design)
>>
>> 3. Store half a sinusoid and generate the complex exponential via clever
>> address generation. Uses 1 BRAM instead of the (current, assuming a
>> 'large' FFT) 4 at the cost of some logic. (see sincos_med_ram in the
>> design).
>>
>> The interpolator could be integrated into these to use even less BRAM.
>>
>> I will upgrade the library at some point this year to include these (and
>> the interpolator).
>>
>>                                                           I did some
>>>                                                          tests this
>>>                                                          morning with a
>>>                                                          simple moving
>>>                                                          average filter
>>>                                                          to turn 256
>>>                                                          BRAM
>>>                                                          coefficients
>>>                                                          into 1024 (see
>>>                                                          attached model
>>>                                                          file), and it
>>>                                                          looks pretty
>>>                                                          promising:
>>>                                                          errors are a
>>>                                                          max of about
>>>                                                          2.5%.
>>>
>> Could you send me this file? I would like to see how you did your
>> interpolation.
>>
>> Regards
>> Andrew
>>
>>
>>
>>
>>
>>
>>>
>

Re: [casper] number of coefficients needed in PFB and FFT

Reply via email to