Dear all who responded,
First, I apologize for inadvertently cc'ing the entire list with a
message to my internal team. A consequence of using autocomplete in
the cc field to make sure I got the list address right. Thankfully I
think I only said nice things ;)
Second, I really appreciate all the responses which are very
enlightening. I have little time to read carefully, and less time to
respond, as I leave with my family to Cape Town this morning, and
don't expect to surface for a good few days. I do look forward to
connecting with the SA SKA/KAT group, probably in January to discuss
this and other things in person. Perhaps the discussion will continue
nonetheless.
A few quick comments, based only on a scan of the responses.
--the SMA is an 8 antenna array, with two active receivers per
antenna. In particular might be dual pol, thus 16 "ant-pols".
--we are certainly open to distributing the processing in the manner
suggested by Dan, Mel, and possibly others. Even in such a scheme,
though, an understanding of PFB fit and limits, and, related,
increasing clock rates to improve performance is warranted. We are
also open to not packetizing (on-board corner turn).
--I made mention of 500 MHz FFT cores, those were advertised by
industry DSP specialists we have had discussions with. Not designed
with CASPER methods. Multiple clock domains are required, and
perhaps we could "black box" one of these cores. I don't think anyone
has commented on multiple clock domains in CASPER yet, Billy, anyone?
(may have missed it on scanning).
--We need to understand memory util, including bram, qdr, ddr, amount
and bandwidth. Will read your comments carefully.
--Andrew, our finding is that *both* multipliers and adders scale as
Dlog2D (other terms, but this one dominates). If N is the size of the
PFB they scale only as logN (I may mis-remember if this is dominant
term). I don't understand the implication of the condition "(for
large FFT sizes, i.e not doing straight butterfly)" I would very much
like to discuss all of this with you, and others who might be
interested, in CT if possible.
--Dan your statement that D=64 or 128 would be possible is very
encouraging, but appears to contradict what Suraj said. Would very
much like to resolve this.
Thanks to all who contributed. In a huge rush, please excuse mis-
statement or typos, or questions on matters already addressed. Merry
Christmas to those who celebrate it. And look forward to picking up
this thread again.
Jonathan
On Dec 24, 2010, at 4:34 AM, Andrew Martens wrote:
Hi Jonathan
To start we are looking closely at the FPGA resource utilization of
large PFBs. Something that probably is common knowledge amongst
those experienced in FX correlator design is that the demux factor
drives the utilization much faster than the size of the PFB. In
that sense bandwidth is far more expensive than spectral
resolution. We've put some effort into accurately quantifying the
utilization, at least as far as multipliers and adders are
concerned, and are expanding this analysis to block ram and other
resources. And demux factor is typically radix 2, so it is very
much quantized.
Some thoughts on resource usage with the CASPER pfb_fir (for large
FFT sizes, i.e not doing straight butterfly);
complex multiplier usage;
- scales linearly with the demux factor (often bandwidth)
- scales linearly with number of FIR taps
- is not affected by the FFT size
adder usage (the final adder tree);
- scales by nlogn with the demux factor. Will dominate adder usage
for large demux factors
- scales by nlogn with the number of FIR taps
- is not affected by the FFT size
BRAM usage;
- scales linearly with demux factor but should not be affected
(barring constraints set by underlying hardware). (BRAMs are
currently not used efficiently - a separate set of coefficient and
data storage BRAMs is not needed for each data input. The storage
requirements should be completely dependent on FFT size and number
of FIR taps).
- scales linearly with the number of FIR taps. The current design
could be improved so that BRAMs are more efficiently used though.
- scales linearly with FFT size.
Routing constraints;
The design is simple, highly pipelined (almost no feedback) with
very low fanout. Major constraints are BRAM to DSP slice, DSP slice
to DSP slice and rounding, all of which are parameterised.
Optimisations possible;
The efficiency of BRAM use can be improved with some small logic
savings.
Resource usage in the CASPER FFT (when using the biplex FFT (eg
fft_wideband_real and fft for 'large' FFTs);
complex multiplier usage;
- dominated by (n/2)*log2n (n = demux factor) needed in fft_direct
for large FFTs.
- scales linearly with increase in FFT size.
BRAM usage;
- scales linearly with bandwidth for large FFTs if FFT size kept
constant.
- for constant (large) FFT size, unaffected by demux factor.
Biplex cores shrink in length by one stage while doubling in number
for each doubling in demux factor.
- scales roughly like n^2 with increase in FFT size.
Routing constraints;
The FFT is highly pipelined with low fanout except for in the
unscrambler (although some work has been done here and the
unscrambler is now optional). Major constraints are BRAM to DSP
slice, DSP slice to DSP slice and rounding, all of which are
parameterised.
Optimisations possible;
Various optimisations are still possible;
- Coefficients could be shared between twiddles, reducing the
number of BRAMs required by the demux factor. This would be
significant for large demux factor designs at the expense of some
fanout.
- BRAMs used for delaying data could be shared between input
streams, saving some BRAMs at the expense of extra routing
constraints.
- As Dan has suggested, grow the bits in the FFT at each stage as
needed to reduce logic (and BRAM) use and probably help timing. Care
should be taken however, as data quality is directly related to the
width of the data path through the FFT.
As noted by Jason, please also remember that other constraints such
as QDR SRAM and XAUI bandwidth needs to be considered when building
such a large system.
Dan's suggestion of FFX is worth considering. It is upgradeable,
allowing the addition of newer, more capable boards as they come
online until you end up with a simple FX correlator again.
I would love to see a correlator like that in action.
Regards
Andrew