PS3.  You could also have done the 2^16 FFT's coefficients as a narrow
cmult... you'd need to use a BRAM to store the 2^13 "reset points", but it
would still indicate a reduction in memory use by a factor of 4 -- not
trivial by any means.


On Mon, Jan 21, 2013 at 9:39 PM, Ryan Monroe <[email protected]>wrote:

> It would work well for the PFB, but what we *really* need is a solid
> "Direct Digital Synth (DDS) coefficient generator".  FFT coefficients are
> really just sampled points around the unit circle, so you could, in
> principle, use a recursive complex multiplier to generate the coefficients
> on the fly.  You'll lose log2(sqrt(K)) bits for a recursion count of K, but
> that's probably OK most of the time.  Say you're doing a 2^14 point FFT,
> you need 2^13 coeffs.  You start with 18 bits of resolution and can do 1024
> iterations before you degrade down to the est. 2^13 resolution.  So you'll
> only need to store 8 "reset points".  Four of those will be 1, j, -1 and -j
> in this case.  You could thus replace 8 BRAM36'es with three DSPs.
>
> If you had a much larger FFT, say 2^16... you would have to use a wider
> recursive multiplier.  You can achieve a wide cmult in no more than 10
> DSPs...I think.  In that case, you would start with 25 bits and be able to
> droop to 16 bits -- so up to 2^(2*9) = <lots> of recursion.  You would only
> need to have one "reset point" and your noise performance would be more
> than sufficient.  1, j, -1 and -j are easy to store though, so I would
> probably go with that
>
> In addition, for the FFT direct, the first stage has only one shared
> coefficient pattern, second stage has 2, third 4, etc.  You can, of course,
> share coefficients amongst a stage where possible.  The real winnings occur
> when you realize that the other coefficient banks within later stages are
> actually the same coeffs as the first stage, with a constant phase rotation
> (again, I'm 90% sure but I'll check tomorrow morning).  So, you could
> generate your coefficients once, and then use a couple of complex
> multipliers to make the coeffs for the other stages.  BAM!  FFT Direct's
> coefficient memory utilization is *gone*
>
> You could also do this for the FFT Biplex, but it would be a bit more
> complicated.  Whoever designed the biplex FFT used in-order inputs.  This
> is OK, but it means that the coefficients are in bit-reverse order.  So,
> you would have to move the biplex unscrambler to the beginning, change the
> mux logic, and replace the delay elements in the delay-commutator with some
> flavor of "delay, bit-reversed".  I don't know how that would look quite
> yet.  If you did that, your coefficients would become in-order, and you
> could achieve the same savings I described with the FFT-Direct.  Also, I
> implement coefficient and control logic sharing in my biplex and direct FFT
> and it works *really well* at managing the fabric and memory utilization.
>  Worth a shot.
>
> :-)
>
> --Ryan Monroe
>
> PS, Sorry, I'm a bit busy right now so I can't implement a coefficient
> interpolator for you guys right now.  I'll write back when I'm more free
>
> PS2.  I'm a bit anal about noise performance so I usually use a couple
> more bits then Dan prescribes, but as he demonstrated in the asic talks,
> his comments about bit widths are 100% correct.   I would recommend them as
> a general design practice as well.
>
>
>
>
> On Mon, Jan 21, 2013 at 3:48 PM, Dan Werthimer <[email protected]>wrote:
>
>>
>> agreed.   anybody already have, or want to develop, a coefficient
>> interpolator?
>>
>> dan
>>
>> On Mon, Jan 21, 2013 at 3:44 PM, Aaron Parsons <
>> [email protected]> wrote:
>>
>>> Agreed.
>>>
>>> The coefficient interpolator, however, could get substantial savings
>>> beyond that, even, and could be applicable to many things besides PFBs.
>>>
>>> On Mon, Jan 21, 2013 at 3:36 PM, Dan Werthimer <[email protected]>wrote:
>>>
>>>>
>>>> hi aaron,
>>>>
>>>> if you use xilinx brams for coefficients, they can be configured as
>>>> dual port memories,
>>>> so you can get the PFB reverse and forward coefficients both at the
>>>> same time,
>>>> from the same memory,  almost for free, without any memory size penalty
>>>> over single port,
>>>>
>>>> dan
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Jan 21, 2013 at 3:18 PM, Aaron Parsons <
>>>> [email protected]> wrote:
>>>>
>>>>> You guys probably appreciate this already, but although the
>>>>> coefficients in the PFB FIR are generally symmetric around the center tap,
>>>>> the upper and lower taps use these coefficients in reverse order from one
>>>>> another.  In order to take advantage of the symmetry, you'll have to use
>>>>> dual-port ROMs that support two different addresses (one counting up and
>>>>> one counting down).  In the original core I wrote, I instead just shared
>>>>> coefficients between the real and imaginary components.  This was an easy
>>>>> factor of 2 savings.  After that first factor of two, we found it was kind
>>>>> of diminishing returns...
>>>>>
>>>>> Another thought could be a small BRAM with a linear interpolator
>>>>> between addresses.  This would be a block with a wide range of uses, and
>>>>> could easily cut the size of the PFB coefficients by an order of 
>>>>> magnitude.
>>>>>  The (hamming/hanning) window and the sinc that the PFB uses for its
>>>>> coefficients are smooth functions, making all the fine subdivisions for
>>>>> N>32  samples rather unnecessary.
>>>>>
>>>>> On Mon, Jan 21, 2013 at 2:56 PM, Dan Werthimer 
>>>>> <[email protected]>wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> hi danny and ryan,
>>>>>>
>>>>>> i suspect if you are only doing small FFT's and PFB FIR's,
>>>>>> 1K points or so,  then BRAM isn't likely to be the limiting resource,
>>>>>> so you might as well store all the coefficients with high precision.
>>>>>>
>>>>>> but for long transforms, perhaps >4K points or so,
>>>>>> then BRAM's might be in short supply, and then one could
>>>>>> consider storing fewer coefficients (and also taking advantage
>>>>>> of sin/cos and mirror symmetries, which don't degrade SNR at all).
>>>>>>
>>>>>> for any length FFT or PFB/FIR, even millions of points,
>>>>>> if you store 1K coefficients with at least at least 10 bit precision,
>>>>>> then the SNR will only be degraded slightly.
>>>>>> quantization error analysis is nicely written up in memo #1, at
>>>>>> https://casper.berkeley.edu/wiki/Memos
>>>>>>
>>>>>> best wishes,
>>>>>>
>>>>>> dan
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Jan 21, 2013 at 4:33 AM, Danny Price 
>>>>>> <[email protected]>wrote:
>>>>>>
>>>>>>> Hey Jason,
>>>>>>>
>>>>>>> Rewinding the thread a bit:
>>>>>>>
>>>>>>> On Fri, Jan 4, 2013 at 7:39 AM, Jason Manley <[email protected]>wrote:
>>>>>>>
>>>>>>>> Andrew and I have also spoken about symmetrical co-efficients in
>>>>>>>> the pfb_fir and I'd very much like to see this done. We recently added 
>>>>>>>> the
>>>>>>>> option to share co-efficient generators across multiple inputs, which 
>>>>>>>> has
>>>>>>>> helped a lot for designs with multiple ADCs. It seems to me that bigger
>>>>>>>> designs are going to be BRAM limited (FFT BRAM requirements scale
>>>>>>>> linearly), so we need to optimise cores to go light on this resource.
>>>>>>>>
>>>>>>>
>>>>>>> Agreed that BRAM is in general more precious than compute. In
>>>>>>> addition to using symmetrical coefficients, it might be worth looking at
>>>>>>> generating coefficients. I did some tests this morning with a simple 
>>>>>>> moving
>>>>>>> average filter to turn 256 BRAM coefficients into 1024 (see attached 
>>>>>>> model
>>>>>>> file), and it looks pretty promising: errors are a max of about 2.5%.
>>>>>>>
>>>>>>> Coupling this with symmetric coefficients could cut coefficient
>>>>>>> storage to 1/8th, at the cost of a few extra adders for the 
>>>>>>> interpolation
>>>>>>> filter. Thoughts?
>>>>>>>
>>>>>>> Cheers
>>>>>>> Danny
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Aaron Parsons
>>>>> 510-306-4322
>>>>> Hearst Field Annex B54, UCB
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Aaron Parsons
>>> 510-306-4322
>>> Hearst Field Annex B54, UCB
>>>
>>
>>
>

Reply via email to