Hi,

in order to understand better the reason for the system's instability,
I have designed a hardware memory tester. The core generates a
pseudo-random sequence using a LFSR at the peak bandwidth of the DRAM
chips (so it puts a maximum load on the memory bus), writes it to the
DRAM, resets the PRNG, reads back the DRAM, check that the data from
the DRAM matches the pseudo random sequence and counts the erroneous
words.

The test was performed using a 32 megabyte buffer which was written and
read repetitively, with an alterning PRNG seed so that different data
is written each time.

Now the big news: I left the thing running for a while, transferring in
excess of 17 terabytes of data between the FPGA and the RAM chips,
WITHOUT A SINGLE BIT FLIPPED.

I believe the memory test core works correctly, I have verified it:
* using Verilog functional simulations
* by injecting errors in the 32MB buffer from LM32, the memory test
  core reported as many errors as I had rewritten words.
* by injecting electric noise in one of the memory data lines of the
  PCB while the test was running, the memory test core then reported
  errors.

I had modified the board to reduce SDRAM power supply noise by
soldering an additional 100nF capacitor on every capacitor in the area,
and two additional 47uF capacitors close to each SDRAM chip. I have
not measured the influence of those capacitors, nor did I run the
memory test on an unmodified board. Flickernoise still has an average
life span of 30 seconds on the modified board.

So, if it is not the SDRAM, possible sources of problems are:
* RTL bugs: unlikely. We haven't had anything like that on the ML401,
  and sometimes the application crashes right from the start (sometimes
  with a complaint from the RTEMS memory allocator about incorrect
  free's). IMO, RTL bugs have a more deterministic nature. Maybe I'll
  try to get RTEMS/Flickernoise to run on the ML401 to be sure (should
  be basically a matter of taking SoC 0.5.1 and changing the memory map;
  but mouse support will be more of a problem).
* Xst bugs: maybe. We had a similar issue on the ML401 that prevented
  the Linux kernel from booting, and we needed to either disable the L1
  caches or use Synplify. Does anyone have access to Synplify with
  Spartan-6 support so we can test? Btw, disabling L1 caches on the MM1
  does not have any effect (beyond making the system slow).
* Flaky FPGA power supply: likely. At Xilinx they told me that we
  should use 1 via per power or ground pin and not share a via across
  several pins. I checked the SP601 Gerber files and they actually do
  that. The MM1, by contrast, does quite a lot of via sharing.
* Other ideas...?

S.
_______________________________________________
http://lists.milkymist.org/listinfo.cgi/devel-milkymist.org
IRC: #milkym...@freenode
Webchat: www.milkymist.org/irc.html
Wiki: www.milkymist.org/wiki

Reply via email to