Yes, I can confirm that reset problem on the old core. Had it in the
packetised correlator too. Don't reset it was my solution too. I
suppose the rule of thumb is don't reset anything unless you need to.
There was some funny business in the RX fifo as well, if memory
serves. After an overflow, you have to pre-emptively acknowledge the
output to get it going again.
Hopefully the bug in the open-source core will be fixed soon so that
we can get higher speeds and proper reset ability.
Jason
On 12 Feb 2010, at 11:06, John Ford wrote:
We fixed it! (we think...)
Hi all. Has anyone seen any problems when using 4 10 GbE ports on a
single FPGA in the BEE2?
We're having some problems where borph access from the control FPGA
seems
to lock up when we start up our designs. We've done some testing
with
hacked up designs, and we can only make it lock up (so far) when we
have 4
10 gbe ports active. Interestingly, some of the 10 GbE ports also
seem to
lock up at that very instant.
The fpga still is running, but the control fpga cannot talk to it
OR ANY
OF THE OTHER 3 FPGA's. If you kill the borph process on the
control fpga
and restart it, it comes back to life.
We don't know what goes on under the hood of the yellow block
system, but
it seems that something is going wrong with whatever mechanism
controls
the bus to the control fpga.
We finally, we think, worked out what is going on with our system.
As is
often the case, the problem is not what the initial symptoms
suggested.
Here's a quick summary of what we attempted, and what the symptoms
were,
for anyone interested...
Power distribution:
Some advice was given that maybe the load on the power supply for that
fpga was being strained by having four 10 gbe cores active and
driving the
links, so we did the following to try to eliminate that possibility:
1) Left the designs alone, and simply unplugged the CX cables. This
should have reduced the drive current required by the chip.
2) Recompiled the design with the end-of-frame signals skewed in
time so
that all 4 cores would not fire off a packet at the exact same time.
3) Compiled the design with only 1 10 gbe core.
During testing of each of these attempts, we were able to cause the
system, the software on the control FPGA as well as the core on
fpga2, to
lock up by doing the arm() command. In my mind, that exonerated the
power
distribution as culprit.
Software Lockups:
We found that the software lockup on the control FPGA was caused by
a bug
in the python server managing the machine. The server spawned the
borph
process, and failed to properly handle the output from the tinyshell
pipes. Paul changed the server to read the pipe and send it to a log
file. Doing that fixed the software lockup on the Control FPGA.
We found that the "arp" messages that are normally sent by the
bee2's 10
gbe cores at a fairly prodigious rate had started failing, I guess
when
the 10 gbe core locked up ande refused to send packets on behalf of
the
CPU.
I changed the driver software so that it doesn't send all those arp
packets. Our thinking was that somehow the arp packets and the
packets
from the fabric were colliding, somehow. My modified code just did
the
arp routine once upon startup of the 10 gbe core, and then never
again,
unless requested through the tinyshell.
This had no effect on the core lockups.
As a final, somewhat desperate, move, I looked at our old systems that
behave flawlessly, and found that we had made a subtle change in our
handling of the 10 gbe blocks. Whereas in the old system we tied the
reset (rst) line low with a constant, in the new design we exercised
the
reset line when we armed the system. Changing back to the old way
seems
to have fixed the problem.
Does anyone know if there is a problem with the reset line in
general on
that core? What is the proper use of that line?
Thanks to all who chimed in on the list with ideas!
John