David,

So, it's been a little while now, but not much has changed yet. We've
gotten Chipscope working, and, so far, there aren't any red flags with the
FPGA firmware 10-GbE control signals.

We also confirmed that the bitstream we are using is in fact
roach2_fengine_2013_Oct_14_1756.bof.gz, so that is unfortunately not the
problem.

I also took a look at the ROACH2 PPC setup: we pulled from the .git
repository on February 12, 2014 (commit number =
e14df9016c3b7ccba62cc6d0cae05405f4929c94). There haven't been any changes
to that repository since August 2013, so unless the SKA-SA ROACH-2s are
using a pull from before then, I don't think that is our issue.

We also tried out Jason Manley's suggestion of delaying the enabling of the
10-GbE cores to ensure that the sync pulse propagated through the entire
system before buffering up data, but the problem persisted.

Just to rule it out, I double-checked (or more accurately triple-checked)
the U72 part, and, sure enough, it is the correct oscillator, model number
EEG-2121.

There is another possibility, albeit an unlikely problem: we currently have
the ROACH-2 board booting off another PC (i.e. not the same PC that the
ruby control scripts are running on). I can't imagine that this is the
problem, but I'm planning on trying to consolidate the NFS and ruby scripts
onto a single PC to rule it out.

So I suppose at this point, my questions are:

(1) What version of the roach2_nfs_uboot .git repository are SKA-SA using?
(2) Is SKA-SA using the same PCs for ROACH-2 net boots and file systems as
the ruby control scripts?
(3) Are there any additional steps that need to be taken when installing
the Quad SFP+ mezzanine cards onto the ROACH-2 board? Are there potentially
some drivers or configuration steps that are needed to make sure they
function properly? As I recall, when we got the boards, we didn't do
anything special with the cards outside of simply plugging them in.

Again, thanks for your patient advice and suggestions.


Richard Black

On Mon, Oct 27, 2014 at 2:26 PM, David MacMahon <dav...@astro.berkeley.edu>
wrote:

> Hi, Richard,
>
> On Oct 27, 2014, at 9:25 AM, Richard Black wrote:
>
> > This is a reportedly fully-functional model that shouldn't require any
> major changes in order to operate. However, this has clearly not been the
> case in at least two independent situations (us and Peter). This begs the
> question: what's so different about our use of PAPER?
>
> I just verified that the roach2_fengine_2013_Oct_14_1756.bof.gz file is
> the one being used by the PAPER correlator currently fielded in South
> Africa.  It is definitely a fully functional model.  That image (and all
> source files for it) is available from the git repo listed on the PAPER
> Correlator Manifest page of the CASPER Wiki:
>
> https://casper.berkeley.edu/wiki/PAPER_Correlator_Manifest
>
> > We, at BYU, have made painstakingly sure that our IP addressing schemes,
> switch ports, and scripts are all configured correctly (thanks to David
> MacMahon for that, btw), but we still have hit the proverbial brick wall of
> 10-GbE overflow.  When I last corresponded with David, he explained that he
> remembers having a similar issue before, but can't recall exactly what the
> problem was.
>
> Really?  I recall saying that I often forget about increasing the MTU of
> the 10 GbE switch and NICs.  I don't recall saying that I had a similar
> issue before but couldn't remember the problem.
>
> > In any case, the fact that by turning down the ADC clock prior to
> start-up prevents the 10-GbE core from overflowing is a major lead for us
> at BYU (we've been spinning our wheels on this issue for several months
> now). By no means are we proposing mid-run ADC clock modifications, but
> this appears to be a very subtle (and quite sinister, in my opinion) bug.
> >
> > Any thoughts as to what might be going on?
>
> I cannot explain the 10 GbE overflow that you and Peter are experiencing.
> I have pushed some updates to the rb-papergpu.git repository listed on the
> PAPER Correlator Manifest page.  The paper_feng_init.rb script now verifies
> that the ADC clocks are locked and provides options for issuing a software
> sync (only recommended for lab use) and for not storing the time of
> synchronization in redis (also only recommended for lab use).
>
> The 10 GbE cores can overflow if they are fed valid data (i.e. tx_valid=1)
> while they are held in reset.  Since you are using the paper_feng_init.rb
> script, this should not be happening (unless something has gone wrong
> during the running of that script) because that script specifically and
> explicitly disables the tx_valid signal before putting the cores into reset
> and it takes the cores out of reset before enabling the tx_valid signal.
> So assuming that this is not the cause of the overflows, there must be
> something else that is causing the 10 GbE cores to be unable to transmit
> data fast enough to keep up with the data stream it is being fed.  Two
> things that could cause this are 1) running the design faster than the 200
> MHz sample clock that it was built for and/or 2) some link issue that
> prevents the core from sending data.  Unfortunately, I think both of those
> ideas are also pretty far fetched given all you've done to try to get the
> system working.  I wonder whether there is some difference in the ROACH2
> firmware (u-boot version or CPLD programming) or PPC Linux setup or
> tcpborhpserver revision or ???.
>
> Have you tried using adc16_dump_chans.rb to dump snapshots of the ADC data
> to make sure that it looks OK?
>
> Dave
>
>

Reply via email to