Hi, in order to understand better the reason for the system's instability, I have designed a hardware memory tester. The core generates a pseudo-random sequence using a LFSR at the peak bandwidth of the DRAM chips (so it puts a maximum load on the memory bus), writes it to the DRAM, resets the PRNG, reads back the DRAM, check that the data from the DRAM matches the pseudo random sequence and counts the erroneous words.
The test was performed using a 32 megabyte buffer which was written and read repetitively, with an alterning PRNG seed so that different data is written each time. Now the big news: I left the thing running for a while, transferring in excess of 17 terabytes of data between the FPGA and the RAM chips, WITHOUT A SINGLE BIT FLIPPED. I believe the memory test core works correctly, I have verified it: * using Verilog functional simulations * by injecting errors in the 32MB buffer from LM32, the memory test core reported as many errors as I had rewritten words. * by injecting electric noise in one of the memory data lines of the PCB while the test was running, the memory test core then reported errors. I had modified the board to reduce SDRAM power supply noise by soldering an additional 100nF capacitor on every capacitor in the area, and two additional 47uF capacitors close to each SDRAM chip. I have not measured the influence of those capacitors, nor did I run the memory test on an unmodified board. Flickernoise still has an average life span of 30 seconds on the modified board. So, if it is not the SDRAM, possible sources of problems are: * RTL bugs: unlikely. We haven't had anything like that on the ML401, and sometimes the application crashes right from the start (sometimes with a complaint from the RTEMS memory allocator about incorrect free's). IMO, RTL bugs have a more deterministic nature. Maybe I'll try to get RTEMS/Flickernoise to run on the ML401 to be sure (should be basically a matter of taking SoC 0.5.1 and changing the memory map; but mouse support will be more of a problem). * Xst bugs: maybe. We had a similar issue on the ML401 that prevented the Linux kernel from booting, and we needed to either disable the L1 caches or use Synplify. Does anyone have access to Synplify with Spartan-6 support so we can test? Btw, disabling L1 caches on the MM1 does not have any effect (beyond making the system slow). * Flaky FPGA power supply: likely. At Xilinx they told me that we should use 1 via per power or ground pin and not share a via across several pins. I checked the SP601 Gerber files and they actually do that. The MM1, by contrast, does quite a lot of via sharing. * Other ideas...? S. _______________________________________________ http://lists.milkymist.org/listinfo.cgi/devel-milkymist.org IRC: #milkym...@freenode Webchat: www.milkymist.org/irc.html Wiki: www.milkymist.org/wiki
