Re: [casper] number of coefficients needed in PFB and FFT

Ryan Monroe Mon, 21 Jan 2013 21:40:38 -0800

It would work well for the PFB, but what we *really* need is a solid
"Direct Digital Synth (DDS) coefficient generator".  FFT coefficients are
really just sampled points around the unit circle, so you could, in
principle, use a recursive complex multiplier to generate the coefficients
on the fly.  You'll lose log2(sqrt(K)) bits for a recursion count of K, but
that's probably OK most of the time.  Say you're doing a 2^14 point FFT,
you need 2^13 coeffs.  You start with 18 bits of resolution and can do 1024
iterations before you degrade down to the est. 2^13 resolution.  So you'll
only need to store 8 "reset points".  Four of those will be 1, j, -1 and -j
in this case.  You could thus replace 8 BRAM36'es with three DSPs.

If you had a much larger FFT, say 2^16... you would have to use a wider
recursive multiplier.  You can achieve a wide cmult in no more than 10
DSPs...I think.  In that case, you would start with 25 bits and be able to
droop to 16 bits -- so up to 2^(2*9) = <lots> of recursion.  You would only
need to have one "reset point" and your noise performance would be more
than sufficient.  1, j, -1 and -j are easy to store though, so I would
probably go with that

In addition, for the FFT direct, the first stage has only one shared
coefficient pattern, second stage has 2, third 4, etc.  You can, of course,
share coefficients amongst a stage where possible.  The real winnings occur
when you realize that the other coefficient banks within later stages are
actually the same coeffs as the first stage, with a constant phase rotation
(again, I'm 90% sure but I'll check tomorrow morning).  So, you could
generate your coefficients once, and then use a couple of complex
multipliers to make the coeffs for the other stages.  BAM!  FFT Direct's
coefficient memory utilization is *gone*

You could also do this for the FFT Biplex, but it would be a bit more
complicated.  Whoever designed the biplex FFT used in-order inputs.  This
is OK, but it means that the coefficients are in bit-reverse order.  So,
you would have to move the biplex unscrambler to the beginning, change the
mux logic, and replace the delay elements in the delay-commutator with some
flavor of "delay, bit-reversed".  I don't know how that would look quite
yet.  If you did that, your coefficients would become in-order, and you
could achieve the same savings I described with the FFT-Direct.  Also, I
implement coefficient and control logic sharing in my biplex and direct FFT
and it works *really well* at managing the fabric and memory utilization.
 Worth a shot.

:-)

--Ryan Monroe

PS, Sorry, I'm a bit busy right now so I can't implement a coefficient
interpolator for you guys right now.  I'll write back when I'm more free

PS2.  I'm a bit anal about noise performance so I usually use a couple more
bits then Dan prescribes, but as he demonstrated in the asic talks, his
comments about bit widths are 100% correct.   I would recommend them as a
general design practice as well.

On Mon, Jan 21, 2013 at 3:48 PM, Dan Werthimer <[email protected]>wrote:

>
> agreed.   anybody already have, or want to develop, a coefficient
> interpolator?
>
> dan
>
> On Mon, Jan 21, 2013 at 3:44 PM, Aaron Parsons <
> [email protected]> wrote:
>
>> Agreed.
>>
>> The coefficient interpolator, however, could get substantial savings
>> beyond that, even, and could be applicable to many things besides PFBs.
>>
>> On Mon, Jan 21, 2013 at 3:36 PM, Dan Werthimer <[email protected]>wrote:
>>
>>>
>>> hi aaron,
>>>
>>> if you use xilinx brams for coefficients, they can be configured as dual
>>> port memories,
>>> so you can get the PFB reverse and forward coefficients both at the same
>>> time,
>>> from the same memory,  almost for free, without any memory size penalty
>>> over single port,
>>>
>>> dan
>>>
>>>
>>>
>>>
>>> On Mon, Jan 21, 2013 at 3:18 PM, Aaron Parsons <
>>> [email protected]> wrote:
>>>
>>>> You guys probably appreciate this already, but although the
>>>> coefficients in the PFB FIR are generally symmetric around the center tap,
>>>> the upper and lower taps use these coefficients in reverse order from one
>>>> another.  In order to take advantage of the symmetry, you'll have to use
>>>> dual-port ROMs that support two different addresses (one counting up and
>>>> one counting down).  In the original core I wrote, I instead just shared
>>>> coefficients between the real and imaginary components.  This was an easy
>>>> factor of 2 savings.  After that first factor of two, we found it was kind
>>>> of diminishing returns...
>>>>
>>>> Another thought could be a small BRAM with a linear interpolator
>>>> between addresses.  This would be a block with a wide range of uses, and
>>>> could easily cut the size of the PFB coefficients by an order of magnitude.
>>>>  The (hamming/hanning) window and the sinc that the PFB uses for its
>>>> coefficients are smooth functions, making all the fine subdivisions for
>>>> N>32  samples rather unnecessary.
>>>>
>>>> On Mon, Jan 21, 2013 at 2:56 PM, Dan Werthimer 
>>>> <[email protected]>wrote:
>>>>
>>>>>
>>>>>
>>>>> hi danny and ryan,
>>>>>
>>>>> i suspect if you are only doing small FFT's and PFB FIR's,
>>>>> 1K points or so,  then BRAM isn't likely to be the limiting resource,
>>>>> so you might as well store all the coefficients with high precision.
>>>>>
>>>>> but for long transforms, perhaps >4K points or so,
>>>>> then BRAM's might be in short supply, and then one could
>>>>> consider storing fewer coefficients (and also taking advantage
>>>>> of sin/cos and mirror symmetries, which don't degrade SNR at all).
>>>>>
>>>>> for any length FFT or PFB/FIR, even millions of points,
>>>>> if you store 1K coefficients with at least at least 10 bit precision,
>>>>> then the SNR will only be degraded slightly.
>>>>> quantization error analysis is nicely written up in memo #1, at
>>>>> https://casper.berkeley.edu/wiki/Memos
>>>>>
>>>>> best wishes,
>>>>>
>>>>> dan
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Jan 21, 2013 at 4:33 AM, Danny Price 
>>>>> <[email protected]>wrote:
>>>>>
>>>>>> Hey Jason,
>>>>>>
>>>>>> Rewinding the thread a bit:
>>>>>>
>>>>>> On Fri, Jan 4, 2013 at 7:39 AM, Jason Manley <[email protected]>wrote:
>>>>>>
>>>>>>> Andrew and I have also spoken about symmetrical co-efficients in the
>>>>>>> pfb_fir and I'd very much like to see this done. We recently added the
>>>>>>> option to share co-efficient generators across multiple inputs, which 
>>>>>>> has
>>>>>>> helped a lot for designs with multiple ADCs. It seems to me that bigger
>>>>>>> designs are going to be BRAM limited (FFT BRAM requirements scale
>>>>>>> linearly), so we need to optimise cores to go light on this resource.
>>>>>>>
>>>>>>
>>>>>> Agreed that BRAM is in general more precious than compute. In
>>>>>> addition to using symmetrical coefficients, it might be worth looking at
>>>>>> generating coefficients. I did some tests this morning with a simple 
>>>>>> moving
>>>>>> average filter to turn 256 BRAM coefficients into 1024 (see attached 
>>>>>> model
>>>>>> file), and it looks pretty promising: errors are a max of about 2.5%.
>>>>>>
>>>>>> Coupling this with symmetric coefficients could cut coefficient
>>>>>> storage to 1/8th, at the cost of a few extra adders for the interpolation
>>>>>> filter. Thoughts?
>>>>>>
>>>>>> Cheers
>>>>>> Danny
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Aaron Parsons
>>>> 510-306-4322
>>>> Hearst Field Annex B54, UCB
>>>>
>>>
>>>
>>
>>
>> --
>> Aaron Parsons
>> 510-306-4322
>> Hearst Field Annex B54, UCB
>>
>
>

Re: [casper] number of coefficients needed in PFB and FFT

Reply via email to