Re: [casper] viability of VCU128 eval board as production CASPER instrument

2022-05-11 Thread 'Jonathan Weintroub' via casper@lists.berkeley.edu
Hi Jack, Jonathon, Mitch, Francois, Dan, and the ever-supportive CASPER 
community,

Thank you all for the very thoughtful and helpful input.  It’s all filed away 
for study, particularly a comparative look at alternative COTS solutions 
suggested.

To provide some bulleted responses to specific inputs:

We need 16,384 GS/s ADC conversation rate. RFSoc is very attractive in many 
ways, but so far doesn’t provide ADCs at near that rate. Yes, maybe interleave? 
  And that’s interesting, but we’re also a little allergic to interleave 
artifacts.
We’re not actually sure whether or not we need HBM, and understand its expense 
in $ and dissipation.  We may need fast RAM storage for various functions (FIR 
coefficient storage, transposes, packet buffering, ….) there are block RAM 
resources, which might be sufficient. Subject of current study. The retail 
price comparison of VU37P to VCU128 seemed more apples-to-apples.
That said 16,384 GS/s demands very high demux factors, we think factor of 64 is 
almost certainly required.  As shown in the SWARM paper  
(section 4, 
equations 1, 2, and fig 5) demux factor D is the primary driver of PFB 
utilization, number of spectral points N has far less impact, at least as far 
as DSP slices is concerned.  (memory is driven more by N though that 
calculation isn’t quite as closed form)**
We do think of instruments, even real time ones, in terms of pure COTS high 
performance computing wherever possible, Alveo, GPUs, CPUs, whatever, 
packetized inputs and outputs.   However  go back to FPGAs as the only 
economical way (we think) to access the highest performance SERDES for ADC 
interfacing, and the precise timing of samples for VLBI.  The current 
instrument is a VLBI Digital Back End, not a correlator-beamformer, and it 
seems natural to bundle channelization on the FPGA as well, especially given 
that the DBE quantizes to 2-bits (typically) before spitting out its data 
product (thus very attractive to equalize across the wide band).  And the 
utilization considerations in the prior bullet drive us to big expensive FPGAs.


I’m on an airplane about to take off, heading to this (CASPER-driven)  EHT 
press event. 

  I’ll sign off now.

Best wishes, thanks again.

Jonathan and colleagues


**there is a useful memo for which section 4 in the SWARM paper is just a 
summary, this turns out not to be on the CASPER memo GitHub 
I’ll look 
into posting it—can I do that myself, or should I send it to an admin?



> On May 11, 2022, at 6:02 AM, Francois Kapp  wrote:
> 
> Hi Jonathan et al,
> 
> To close an open question: At SARAO we do not intend to pursue a SKARAB2, for 
> the same cost inconsistency you mentioned.  Instead, we are CASPER-ising 
> Xilinx Alveo boards, which are intended for production, albeit in a data 
> centre environment.  Our intention is to further develop hybrid FPGA/GPU 
> correlators around these boards.  At the moment, as one would expect, the GPU 
> development leads the FPGA development.
> 
> Others, notably CSIRO for the SKA Low design, are also proposing Alveo in a 
> very CASPER-like architecture: 
> https://www.spiedigitallibrary.org/journals/Journal-of-Astronomical-Telescopes-Instruments-and-Systems/volume-8/issue-01/011018/Square-Kilometre-Array-Low-Atomic-commercial-off-the-shelf-correlator/10.1117/1.JATIS.8.1.011018.full
>  
> 
>  - perhaps our CSIRO colleagues can chime in there, but packing 20 Alveo 
> U55c's in a server looks like something viable, and it certainly reduces the 
> overhead of the host server per FPGA.
> 
> +F.  
> 
> 
> 
> 
> 
> On Tue, 10 May 2022 at 00:35, Mitchell Burnett  > wrote:
> Hi Jonathan,
> 
> To chime in under the “other” category…
> 
> We have recently added six RFSoC platforms to CASPER. (Three Xilinx eval 
> boards: ZCU111, ZCU216, ZCU208. The Xilinx education platform PYNQ RFSoC 2x2. 
> Two boards from HiTech Global: HTG-ZRF16-29DR, and the 49DR version.)
> 
> For ALPACA, we have used a couple ZCU111s, with the current plan to field 12 
> ZCU216’s in the final instrument. These are and will operate in a server 
> room, so again, nothing extreme. Beyond our ALPACA project, I am aware of 
> several folks that have all had success bringing up RFSoCs using Xilinx eval 
> boards with CASPER tools (and others that are not immediately using CASPER 
> tools, but are still using eval boards). So far, I have not 

Re: [casper] viability of VCU128 eval board as production CASPER instrument

2022-05-11 Thread Francois Kapp
Hi Jonathan et al,

To close an open question: At SARAO we do not intend to pursue a SKARAB2,
for the same cost inconsistency you mentioned.  Instead, we are
CASPER-ising Xilinx Alveo boards, which are intended for production, albeit
in a data centre environment.  Our intention is to further develop hybrid
FPGA/GPU correlators around these boards.  At the moment, as one would
expect, the GPU development leads the FPGA development.

Others, notably CSIRO for the SKA Low design, are also proposing Alveo in a
very CASPER-like architecture:
https://www.spiedigitallibrary.org/journals/Journal-of-Astronomical-Telescopes-Instruments-and-Systems/volume-8/issue-01/011018/Square-Kilometre-Array-Low-Atomic-commercial-off-the-shelf-correlator/10.1117/1.JATIS.8.1.011018.full
- perhaps our CSIRO colleagues can chime in there, but packing 20 Alveo
U55c's in a server looks like something viable, and it certainly reduces
the overhead of the host server per FPGA.

+F.





On Tue, 10 May 2022 at 00:35, Mitchell Burnett 
wrote:

> Hi Jonathan,
>
> To chime in under the “other” category…
>
> We have recently added six RFSoC platforms to CASPER. (Three Xilinx eval
> boards: ZCU111, ZCU216, ZCU208. The Xilinx education platform PYNQ RFSoC
> 2x2. Two boards from HiTech Global: HTG-ZRF16-29DR, and the 49DR version.)
>
> For ALPACA, we have used a couple ZCU111s, with the current plan to field
> 12 ZCU216’s in the final instrument. These are and will operate in a server
> room, so again, nothing extreme. Beyond our ALPACA project, I am aware of
> several folks that have all had success bringing up RFSoCs using Xilinx
> eval boards with CASPER tools (and others that are not immediately using
> CASPER tools, but are still using eval boards). So far, I have not
> experienced or heard of performance issues or failures with the
> ZC111/216/208. But, I am sure they exist and perhaps this brings those out.
> Because, with other Ultrascale+ parts, I have heard of anecdotes similar to
> yours where a significant qty. of eval boards were purchased for a wideband
> system with ~20% failure rate.
>
> Bringing up some boards with folks has been bumpy, but nothing attributed
> to the board. Those cases have mostly been needing to work out the
> documentation, and some strange outliers (like switching out an SD card
> from the one provided with the board).
>
> At least until now, I have had more issues with non-Xilinx RFSoC boards.
> But, that speaks more in general to the relationship with a vendor and what
> they support. Certainly, as pointed out, eval boards will not receive any
> guarantee and in our case we have just decided to knowingly assume that
> risk.
>
> In the end though, I just parrot much of what Jack said: I would try to
> avoid eval boards, but using them is viable and in scenarios like mine, if
> the project can take on and justify the risk then, OK. I believe SOMs are
> very promising to look for first, with more vendors providing options. When
> possible, choose vendors you have had a pretty good dialog about the
> requirements with clear support expectations. Negotiating prices will be
> tough (certainly with how supply is right now).
>
> Don’t think I really added much to the conversation, just another data
> point for you.
>
> Best,
>
> Mitch
>
> On May 9, 2022, at 12:56 PM, Jonathon Kocz  wrote:
>
> Hi Jonathan,
>
> A couple of follow up questions (sorry for getting into nitty gritty you
> wanted to avoid!):
>
> 1) Are you actually using the HBM? You can get much cheaper FPGAs with
> similar DSP/BRAM resources without HBM (if you are using HBM, are you doing
> this via CASPER?!)
> 2) I've been using the VCU128 a bit - I'm working on a couple of projects
> with your ADC board now. I've not found any issues (yet!) with loading of
> code on power up, or with the 1Gb coming up - though I note that the 1Gb
> CASPER core for the VCU128 doesn't work properly (an init issue, it's on my
> list to fix that in the next couple of weeks). - Which set of libraries are
> you using, or are you working outside CASPER?
> 3) On the CASPER conference/busy week front: With the 100Gb, is that also
> a CASPER core? We currently have at least two in the CASPER libraries, and
> part of the busy week I want to try to either integrate or find a use case
> where one might use one or the other to reduce confusion for users - if you
> have a third (and it's open source) it would be good to merge that in as
> well!
>
> In terms of eval boards in general:
>
> I've fielded a few VCU128s and they're fine, but we're not running them in
> an extreme environment - just in labs / server rooms. I've previously had
> issues with other eval boards when trying to use them to maximum capacity -
> as Dan said, they're not really designed for it.
>
> In terms of other boards - which should be merged into the main branch
> after the busy week:
>
> We've put the HiTech Global HTG940 and HTG9200 boards into the CASPER
> library if either of those was useful.
>
> I