We fixed it!  (we think...)

> Hi all.  Has anyone seen any problems when using 4 10 GbE ports on a
> single FPGA in the BEE2?
>
> We're having some problems where borph access from the control FPGA seems
> to lock up when we start up our designs.  We've done some testing with
> hacked up designs, and we can only make it lock up (so far) when we have 4
> 10 gbe ports active.  Interestingly, some of the 10 GbE ports also seem to
> lock up at that very instant.
>
> The fpga still is running, but the control fpga cannot talk to it OR ANY
> OF THE OTHER 3 FPGA's.  If you kill the borph process on the control fpga
> and restart it, it comes back to life.
>
> We don't know what goes on under the hood of the yellow block system, but
> it seems that something is going wrong with whatever mechanism controls
> the bus to the control fpga.

We finally, we think, worked out what is going on with our system.  As is
often the case, the problem is not what the initial symptoms suggested. 
Here's a quick summary of what we attempted, and what the symptoms were,
for anyone interested...

Power distribution:
Some advice was given that maybe the load on the power supply for that
fpga was being strained by having four 10 gbe cores active and driving the
links, so we did the following to try to eliminate that possibility:

1) Left the designs alone, and simply unplugged the CX cables.  This
should have reduced the drive current required by the chip.
2) Recompiled the design with the end-of-frame signals skewed in time so
that all 4 cores would not fire off a packet at the exact same time.
3) Compiled the design with only 1 10 gbe core.

During testing of each of these attempts, we were able to cause the
system, the software on the control FPGA as well as the core on fpga2, to
lock up by doing the arm() command.  In my mind, that exonerated the power
distribution as culprit.

Software Lockups:
We found that the software lockup on the control FPGA was caused by a bug
in the python server managing the machine.  The server spawned the borph
process, and failed to properly handle the output from the tinyshell
pipes.  Paul changed the server to read the pipe and send it to a log
file.  Doing that fixed the software lockup on the Control FPGA.

We found that the "arp" messages that are normally sent by the bee2's 10
gbe cores at a fairly prodigious rate had started failing, I guess when
the 10 gbe core locked up ande refused to send packets on behalf of the
CPU.

I changed the driver software so that it doesn't send all those arp
packets.  Our thinking was that somehow the arp packets and the packets
from the fabric were colliding, somehow.  My modified code just did the
arp routine once upon startup of the 10 gbe core, and then never again,
unless requested through the tinyshell.

This had no effect on the core lockups.

As a final, somewhat desperate, move, I looked at our old systems that
behave flawlessly, and found that we had made a subtle change in our
handling of the 10 gbe blocks.  Whereas in the old system we tied the
reset (rst) line low with a constant, in the new design we exercised the
reset line when we armed the system.  Changing back to the old way seems
to have fixed the problem.

Does anyone know if there is a problem with the reset line in general on
that core?  What is the proper use of that line?

Thanks to all who chimed in on the list with ideas!

John




Reply via email to