Re: [casper] 10 GbE ports on a BEE2

Jason Manley Fri, 12 Feb 2010 11:16:09 -0800

Yes, I can confirm that reset problem on the old core. Had it in thepacketised correlator too. Don't reset it was my solution too. Isuppose the rule of thumb is don't reset anything unless you need to.There was some funny business in the RX fifo as well, if memoryserves. After an overflow, you have to pre-emptively acknowledge theoutput to get it going again.

Hopefully the bug in the open-source core will be fixed soon so thatwe can get higher speeds and proper reset ability.


Jason

On 12 Feb 2010, at 11:06, John Ford wrote:

We fixed it!  (we think...)
Hi all.  Has anyone seen any problems when using 4 10 GbE ports on a
single FPGA in the BEE2?
We're having some problems where borph access from the control FPGAseemsto lock up when we start up our designs. We've done some testingwithhacked up designs, and we can only make it lock up (so far) when wehave 410 gbe ports active. Interestingly, some of the 10 GbE ports alsoseem to
lock up at that very instant.
The fpga still is running, but the control fpga cannot talk to itOR ANYOF THE OTHER 3 FPGA's. If you kill the borph process on thecontrol fpga
and restart it, it comes back to life.
We don't know what goes on under the hood of the yellow blocksystem, butit seems that something is going wrong with whatever mechanismcontrols
the bus to the control fpga.
We finally, we think, worked out what is going on with our system.As isoften the case, the problem is not what the initial symptomssuggested.Here's a quick summary of what we attempted, and what the symptomswere,
for anyone interested...

Power distribution:
Some advice was given that maybe the load on the power supply for that
fpga was being strained by having four 10 gbe cores active anddriving the
links, so we did the following to try to eliminate that possibility:

1) Left the designs alone, and simply unplugged the CX cables.  This
should have reduced the drive current required by the chip.
2) Recompiled the design with the end-of-frame signals skewed intime so
that all 4 cores would not fire off a packet at the exact same time.
3) Compiled the design with only 1 10 gbe core.

During testing of each of these attempts, we were able to cause the
system, the software on the control FPGA as well as the core onfpga2, tolock up by doing the arm() command. In my mind, that exonerated thepower
distribution as culprit.

Software Lockups:
We found that the software lockup on the control FPGA was caused bya bugin the python server managing the machine. The server spawned theborph
process, and failed to properly handle the output from the tinyshell
pipes.  Paul changed the server to read the pipe and send it to a log
file.  Doing that fixed the software lockup on the Control FPGA.
We found that the "arp" messages that are normally sent by thebee2's 10gbe cores at a fairly prodigious rate had started failing, I guesswhenthe 10 gbe core locked up ande refused to send packets on behalf ofthe
CPU.

I changed the driver software so that it doesn't send all those arp
packets. Our thinking was that somehow the arp packets and thepacketsfrom the fabric were colliding, somehow. My modified code just didthearp routine once upon startup of the 10 gbe core, and then neveragain,
unless requested through the tinyshell.

This had no effect on the core lockups.

As a final, somewhat desperate, move, I looked at our old systems that
behave flawlessly, and found that we had made a subtle change in our
handling of the 10 gbe blocks.  Whereas in the old system we tied the
reset (rst) line low with a constant, in the new design we exercisedthereset line when we armed the system. Changing back to the old wayseems
to have fixed the problem.
Does anyone know if there is a problem with the reset line ingeneral on
that core?  What is the proper use of that line?

Thanks to all who chimed in on the list with ideas!

John

Re: [casper] 10 GbE ports on a BEE2

Reply via email to