Yes, I can confirm that reset problem on the old core. Had it in the packetised correlator too. Don't reset it was my solution too. I suppose the rule of thumb is don't reset anything unless you need to. There was some funny business in the RX fifo as well, if memory serves. After an overflow, you have to pre-emptively acknowledge the output to get it going again.

Hopefully the bug in the open-source core will be fixed soon so that we can get higher speeds and proper reset ability.

Jason

On 12 Feb 2010, at 11:06, John Ford wrote:

We fixed it!  (we think...)

Hi all.  Has anyone seen any problems when using 4 10 GbE ports on a
single FPGA in the BEE2?

We're having some problems where borph access from the control FPGA seems to lock up when we start up our designs. We've done some testing with hacked up designs, and we can only make it lock up (so far) when we have 4 10 gbe ports active. Interestingly, some of the 10 GbE ports also seem to
lock up at that very instant.

The fpga still is running, but the control fpga cannot talk to it OR ANY OF THE OTHER 3 FPGA's. If you kill the borph process on the control fpga
and restart it, it comes back to life.

We don't know what goes on under the hood of the yellow block system, but it seems that something is going wrong with whatever mechanism controls
the bus to the control fpga.

We finally, we think, worked out what is going on with our system. As is often the case, the problem is not what the initial symptoms suggested. Here's a quick summary of what we attempted, and what the symptoms were,
for anyone interested...

Power distribution:
Some advice was given that maybe the load on the power supply for that
fpga was being strained by having four 10 gbe cores active and driving the
links, so we did the following to try to eliminate that possibility:

1) Left the designs alone, and simply unplugged the CX cables.  This
should have reduced the drive current required by the chip.
2) Recompiled the design with the end-of-frame signals skewed in time so
that all 4 cores would not fire off a packet at the exact same time.
3) Compiled the design with only 1 10 gbe core.

During testing of each of these attempts, we were able to cause the
system, the software on the control FPGA as well as the core on fpga2, to lock up by doing the arm() command. In my mind, that exonerated the power
distribution as culprit.

Software Lockups:
We found that the software lockup on the control FPGA was caused by a bug in the python server managing the machine. The server spawned the borph
process, and failed to properly handle the output from the tinyshell
pipes.  Paul changed the server to read the pipe and send it to a log
file.  Doing that fixed the software lockup on the Control FPGA.

We found that the "arp" messages that are normally sent by the bee2's 10 gbe cores at a fairly prodigious rate had started failing, I guess when the 10 gbe core locked up ande refused to send packets on behalf of the
CPU.

I changed the driver software so that it doesn't send all those arp
packets. Our thinking was that somehow the arp packets and the packets from the fabric were colliding, somehow. My modified code just did the arp routine once upon startup of the 10 gbe core, and then never again,
unless requested through the tinyshell.

This had no effect on the core lockups.

As a final, somewhat desperate, move, I looked at our old systems that
behave flawlessly, and found that we had made a subtle change in our
handling of the 10 gbe blocks.  Whereas in the old system we tied the
reset (rst) line low with a constant, in the new design we exercised the reset line when we armed the system. Changing back to the old way seems
to have fixed the problem.

Does anyone know if there is a problem with the reset line in general on
that core?  What is the proper use of that line?

Thanks to all who chimed in on the list with ideas!

John






Reply via email to