hi jason, jonathan,
regarding jason's concerns below about corner turns
and 10Gbit links:
in the FFX model that i propose, where the
first FGPA breaks up the 9 GHz band up into 8 pieces of
1.25 GHz each, there is no corner turner needed, as
the 8 frequency bands emerge from the PFB in eight
parallel paths and each path goes separately to it's own
XAUI or 10Gbit ethernet port.
the FFX design doesn't require any block ram or QDR:
all the coefficients in the 8 channel PFB/FFT are constants,
and there are no BRAM delays in the FFT, only registers,
as the FFT is implemented with fully parallel inputs and outputs.
(the FFT is implemented like a text book diagram of an
8 input FFT with all the butterfly's done in parallel).
or instead of an 8 channel PFB, the channelization can
be implemented as 8 DDC's, again with no BRAM's.
i think the Roach II's eight 10Gbit links can just barely support
support 9 GHz of bandwidth, with 4 bit real, 4 bit imaginary data:
(1.25 GHz each * 8 bits = 10Gbits/sec on each link).
this will work with XAUI, but for 10Gbe, the extra overhead
from headers, time stamps, etc will reduce the bandwidth slightly.
jonathan,
suraj's conern about achieving high clock rates at high demux values
is for large FFT's (you asked about 32K points).
if you are just doing an 8 point PFB or FFT, or implementing 8 DDC's,
for an FFX correlator, the routing is pretty straghtforward -
you won't be using the CASPER PFB or FFT blocks,
it's all fully parallel implementation.
best wishes,
dan
On 12/23/2010 11:17 PM, Jason Manley wrote:
To the best of my knowledge, nobody's built a CASPER correlator that processes
such high bandwidths. I took a closer look at bringing 20Gsps into a ROACH2 for
MeerKAT use a few months back. My conclusion was that this would be possible
with current libraries with minimal changes. However, we weren't aiming for 32k
PFBs and we weren't aiming to process the entire 10GHz band (we'd use a DDC and
only process a couple of GHz).
I believe that you could put a PFB on the whole band if you tweak and optimise
the library block as Billy has just done with the FFT. Though the FPGA might
not run at very high speeds and you might not get the spectral resolution that
you want directly due to resource consumption of pipelining, I think it would
be possible do break this up into subbands (FFX approach) on a single board.
I will highlight the following limitations with processing such large
bandwidths on the ROACH2 platform:
1) QDR corner-turn bandwidth. You don't mention how many inputs you're
planning and so you might not need the packetised infrastructure at all
(perhaps you're considering something like Billy's point-to-point 3GHz 3-input
correlator). ROACH2 will have four 36-bit QDR interfaces. These can be ganged
together and demuxed to produce a single 288bit SDR interface so that your
limits would be:
32 parallel_streams * 4bit * 2complex = need 256-bit interface.
These 32 parallel streams are complex, post-FFT (after the imag half of
spectrum has been tossed) so that it would represent ~300MHz*32=9.6GHz of real
band. So you might be OK here.
2) QDR capacity for the corner turn is much less of a concern. With a packet
length of 128 (what everyone's using right now), you can have up to 64 antennas:
128pkt_len * 32768chan * 4bit * 2complex = 32Mbit per dual-pol antenna.
*) With the huge BRAM reserves on the V6, it might even be possible to
bypass the QDR and do the whole corner-turn in BRAM, especially if you opt for
smaller packet sizes (which'd result in smaller buffers and potentially faster
dump rates but with reduced network efficiencies due to smaller payload/header
ratio).
3) Another consideration, and possible deal-breaker, is the interconnect:
ROACH-II will have 8 10GbE links (or maybe later two 40Gbps links) which could
carry a little over 7GHz bandwidth after network overhead. Again, if you're not
aiming for a packetised system, then you can do a little better. If your ADC is
going to use some of the SERDES lines though (as many of the new high speed
samplers do), then you might have to forfeit some of this interconnect. But
basically, I think you're going to run out of bandwidth to get 10GHz out.
WRT clock rates, I think that 300MHz should be achievable on ROACH-II with a
little tweaking. ROACH-1 is able to do 250MHz with much less fiddling than the
iBOBs at these speeds. The iBOB with the old libraries used to start choking
around just 200MHz. So the clock rates are improving a little
generation-to-generation and I don't think it's unreasonable to hope for 300MHz
from V6 but I'm conservatively banking on at least 250MHz.
My conservative conclusion after going through this whole exercise for KAT was
that ROACH-2 could comfortably handle 4GHz bandwidth chunks at ~8Gsps
(8000/32=250MHz clk rate) and that we'd start hitting various limits not long
after that. So I would say that if you're considering ROACH-2 as a platform,
you'd be safe if aiming for IF chunks around 4 or 5 GHz.
Jason
On 24 Dec 2010, at 07:51, Dan Werthimer wrote:
On 2. it seems to me that if we are digitizing a 9 GHz and using 20 Gsps, one
still needs substantial demux (at least 64) no matter how small the PFB. As
Sura points out this is far in excess of practical limits. This stacks with
what we have found: BW is the difficult part, large PFB for high res less so.
hi jonathan,
i agree you need to demux 20 Gsps by 64 or 128, but i don't think this will be
a problem.
20 Gsps should fit pretty easily into an FPGA an FFX correlator:
in my example of the FFX, you'd need to implement an 8 point PFB
on the first FPGA to break the 10 GHz band into 8 sub-bands.
let's assume you do demux of 64, and clock the FPGA at 312.5 MHz:
you'd need 64*8 multipliers to implement the FIR part of an 8 tap PFB.
and 64 * 16 multipliers to implement the real to complex FFT part of the PFB.
all the multipliers have fixed coefficients - no need to use block rams to
store coefficients - no block rams are needed for delays or coefficients, as
you'd
implement the butterfly diagram directly.
so there's no coefficient routing, but there is data routing.
the data paths can all be 8 bit, and you can add pipeline registers
where needed, so you should be able to get to 312.5 MHz.
if you can't get the FPGA to route at 312.5 MHz, then you'd have
to demux by 128, and you'd need twice as many multipliers.
(instead of 1536 multipliers, it would take 3072 multipliers).
you can use block rams for many of the multipliers, as most of the
computations are multiplying 8 bit data by a fixed coefficient,
so an 8 input, 8 output look up table is all you need.
if you don't want to implement a an 8 channel PFB,
you could also implement this as eight DDC's running in parallel
from the same ADC data, each DDC with a different downmix frequency.
the mixer coefficients are fixed, and many of the coefficients are 0, 1, -1.
the DDC"s low pass filter coefficients are fixed as well - you can use look up
tables for the
low pass filters multipliers and the mixer multipliers if you are short on
DSP48's.
best wishes,
dan
BTW I realize as I write that my 6 GHz BW demux 32 case suggested in response to
Suraj still requires> 400 MHz FPGA clock, thus not so practical. Can one
gain a factor of 2 in demux doing quadrature sampling, and having I and Q inputs
to a complex input PFB each at 1/2 the rate?
Jonathan
On Dec 23, 2010, at 5:24 PM, Dan Werthimer wrote:
hi jonathan,
some ideas for your correlator:
1)
300 MHz is a good target, especially for V6.
suraj has shown how to achieve 375 MHz for V5
by using floor planning and auto-placing.
suraj or i can send you his draft paper on this if you'd like.
2)
you might want to consider FFX instead of FX:
eg: digitizing your 9 GHz band and using a PFB to break it up into eight
sub-bands
of 1.25 GHz each, and then sending the sub-bands into eight 1.25 GHz
FX correlators. this will simplify your switch requirements and each
correlator
now has only 4K channels, which is better suited for cornering turn in a roach
II.
3)
also, be sure to use billy's latest FFT, (recently checked in),
which moves all the adders and multipliers into DSP48's makes routing easier.
you should also consider bit growth FFT's and PFB's, which start
out with the 4 or 5 or 8 bits from your ADC, and add bits gradually
as you move the frequency domain. dave mcmahon and hong chen
have done work on this.
best wishes,
dan
On 12/23/2010 1:47 PM, Jonathan Weintroub wrote:
Hi CASPERites,
Here's a somewhat fluffy RFI which I hope might start a little thought and/or
discussion over the season (acknowledging that not all in the global
collaboration celebrate the traditional Western winter holidays):
At SMA we are looking into the use of CASPER methods to build a ultra wideband high
spectral resolution correlator. Typical specs are, say, 18 GHz bandwidth with roughly
300 KHz spectral resolution, by two polarizations, full Stokes. We are considering
using a standard CASPER packetized FX architecture (FX much better for high res than XF),
but in the relatively unexplored "small number of antennas, wide bandwidth"
regime. If the entire 18 GHz were eaten by one ADC, this would require a sample rate of
40 Gsps and 64 kpoint PFB. Perhaps more reasonable would be two 9 GHz BW blocks and a
32 k PFB sampled at about 20 Gsps, or three 6 GHz / 16 or 32 k PFB / 14 Gsps.
To start we are looking closely at the FPGA resource utilization of large PFBs.
Something that probably is common knowledge amongst those experienced in FX
correlator design is that the demux factor drives the utilization much faster
than the size of the PFB. In that sense bandwidth is far more expensive than
spectral resolution. We've put some effort into accurately quantifying the
utilization, at least as far as multipliers and adders are concerned, and are
expanding this analysis to block ram and other resources. And demux factor is
typically radix 2, so it is very much quantized.
For example at 20 Gsps one might consider a demux factor of 128 resulting in an
FPGA clock rate of 156 MHz, which is quite comfortable for the FPGA.
Alternatively a demux factor of 64 with corresponding FPGA clock of twice that,
or over 300 MHz. Traditionally a rather uncomfortable regime for CASPER
(we're unusual, I believe, in running iBOBs at 256 MHz for the VLBI phased
array). The trouble is our analysis shows that the difference between these
two demux setting in the size of PFB one can fit in a Virtex 6 is really quite
large, and 128 definitely won't allow us to do what we need to do.
So we are increasingly highly motivated to run the FPGAs faster still. Just a
20% increment from the 256 MHz which we currently view as a practical upper
limit allows us to cross a clock rate threshold which then enables a factor of
two decrease in demux factor, and consequent even larger increment in the
realizable PFB size.
Which is just a long winded way of asking if there are any others in the
collaboration motivated to run the FPGAs faster, and whether any tricks can be
shared? In particular, does the CASPER toolflow support multiple clock
domains? Our understanding is not yet, but that's based on incomplete
information. We know that there exists Virtex 5 (?) IP FFT cores which
supposably run at greater than 500 MHz rates, using the enhanced interconnect
between DSP slices.
While on this topic of high demux factors, the tool flow largely chokes on
demux factors of 32 or greater. Any tips here would also be appreciated.
If anyone can cast light on this general topic and related concerns it would be
very much appreciated.
Jonathan Weintroub
SAO