Summary: automated testing produced results consistent with the work-arounds for NOR corruption indeed working a expected, but do not yet provide a conclusive confirmation. Further testing to focus on validating improved rc4 circuit.
Background: During the last weeks, I've been torture-testing the M1's NOR. The background is that M1 (at least rc2 and rc3) occasionally suffers a corruption of NOR content, which prevents it from booting. The corruption can be removed by re-flashing the affected partition(s). In rc2, an improper state when ramping power was suspected, and rc3 therefore has a reset circuit that holds the NOR in reset when power is ramping up. Unfortunately, this reset circuit doesn't solve the problem entirely. Specifically, the reset circuit has the following weakness: while it protects the NOR when power is ramping up, it is less effective when power is ramping down. After Adam has done numerous manual tests, we now suspect that power ramping down is part of the condition that causes the NOR corruption. Work-around: Locking against writing: Modifying the circuit to keep the NOR in reset also when ramping down should be quite possible but it would be risky to rework the existing rc3 boards this way. We therefore looked for a work-around that would minimize the risk of NOR corruption in rc3 without introducing other problems. In manual testing, we had observed that the NOR corruption always affected a single word, and this word was usually located near the beginning of the NOR, in the partition of the standby bitstream. (This is the very first bit of "code" that's being loaded when the M1 starts.) The NOR has a feature that allows the locking of blocks (each 128 kB in size) against accidental erasing or writing. M1 didn't use locking this far. The beginning of the NOR contains the standby bitsteam and the recovery system. All this is content that is never changes in normal use. So it was suggested to lock these parts of the NOR, which also happen to be where we've seen corruption strike. Manual testing confirmed that things looked better with locking. First round of automated testing: To characterize the problem a bit better and to confirm the validity of the work-around, I did automated tests with the freshly-built Lab Switch [1]. The first tests focused on reproducing the problem, with the NOR unlocked. For this, I powered the M1 on, started the boot process over JTAG, let it run the "The Tunnel" effect for a few seconds, then cut power and repeated. Every once in a while, I checked if the FPGA would still load the standby bitstream. (The way the test loop was designed, the test would proceed even if the standby bitstream was defective, but I could see a change in the pattern of the LEDs.) The first test run produced two NOR corruptions within approximately 400-600 cycles. One at address 0x8a, the other at address 0x2840. A CRC check of the other partitions showed no anomalies. I then accelerated the test by cutting power while RTEMS was still booting. This time, I found a single corruption after 200-700 cycles, at address 0x82. After that, I locked the NOR and repeated the testing. After about 3000-4000 cycles, all the locked partitions - including the standby partition - were still healthy. However, I found a single-word corruption in the FlickerNoise partition, at address 0x8a384. Safe shutdown: As a further improvement, Sebastien suggested that the NOR corruption may be avoided if users shut down their M1 "cleanly", like one would do with a PC, instead of just cutting power. A clean shutdown can be performed through the GUI by clicking on "Shutdown" and then "Power off", or by pressing and holding the middle button of the M1 for a few seconds. To test this, I unlocked the NOR again, removed the power cycling from the test loop, and just restarted the M1 over JTAG. After about 2500 cycles, all partitions were still in good shape. When "powering down", M1 does not cut power to the FPGA completely but loads the "standby" bitsteam and executes it until woken up again. To further confirm that it's safe to power cycle while running the standby bitstream, I power cycled without trying to boot out of standby. Another 2600 cycles later, all partitions were still intact. Setback: After these results, I was about to post a jubilant account of how we vanquished the NOR corruption in just a few days. I only had one more sanity check to do: confirm that the NOR corruption occurs again if I remove all protection and don't do a clean shutdown. Unfortunately, this was not met with success: after about 8800 cycles, the NOR was still flawless. This means either that one of the reductions of the cycle time I made early in the test series also greatly lowered the probability of NOR corruption, or that NOR corruption depends on some external factor. Second round of automated testing: I then went back to check that at least the original (slow) test loop still produces NOR corruption. It did this almost immediately. I then ran a long series of power-cycling loops to get more data points on the frequency of corruptions and to also see which NOR locations are affected. This table summarizes the results: Last cycle ob- First cycle ob- Non-fatal Offsets of corrupted served without served with corruption words in standby fatal corruption fatal corr. (at end) --------------- --------------- ------------ ------------------------------- 70 ( 1h29) 223 ( 4h47) ? 0, 2, 0x5556 613 (13h14) 639 (13h48) ? 0, 0x14b0 23 ( 0h29) 117 ( 2h30) ? 0x5140 1493 (32h16) 1584 (34h14) ? 4, 0xa00; soc-rescue 0x20460 981 (21h12) 1121 (24h13) 980 (21h10) 0, 0x800, 0x1000 517 (11h09) 597 (12h53) - 0x1160 352 ( 7h36) 422 ( 9h06) - 0x2000 973 (21h03) 1002 (22h22) 501 (10h50) 6, 0x2ff2; fn-rescue 0x120000 233 ( 5h02) 260 ( 5h37) - 0x14b0 75 ( 1h36) 82 ( 1h45) - 0x1510 171 ( 3h41) 194 ( 4h11) - 2, 0x440 162 ( 3h30) 168 ( 3h52) - 0x2000 The first columns indicate in which cycle and after how many hours of testing I noticed a corruption that would prevent standby from loading. (While the power-cycling is automated, I checked the loading of standby by occasionally having a look at the LEDs.) I then also started to dump the beginning of the standby partition to see if any of the first words had been corrupted. These words are used for synchronization and a moderate amount of corruption of them will still allow the standby bitstream to load. On two occasions, I did indeed see such "light" corruption before more severe corruption hit. The last column shows where I found corrupted words. The general tendency is that lower addresses are hit more often than higher ones. The detailed log of my experiments can be found at [2]. Statistical analysis: When plotting the number of cycles between corruptions [3], one can see that the cumulative distribution is similar to that of an exponential distribution [4] with lambda = 1/476. An exponential distribution is what one would expect from a memoryless process. "Memoryless" means here that the outcome of a cycle (i.e., whether corruption occurs or not) is not affected by the cycles preceding it. Possible effect of temperature: There was one last run in this series. This one failed to produce any corruption within 4000 cycles or about 87 hours. I then stopped the test because I needed parts of the setup for other work. I did all this testing in late winter (Southern Hemisphere), with the M1 in the path of the air flow from a window I usually keep partially open. The days and particularly the nights of the last test were quite warm. I could not find a clear correlation between the likelihood of corruption and day time in the other events. Conclusion: I was able to produce numerous NOR corruptions by power-cycling the M1 while rendering. The number of power-cycling events between corruptions is exponentially distributed, with a mean time between corruptions of ~500 cycles. Furthermore, experiments showed that steps taken to improve the resiliance of M1rc3 to NOR corruption, namely write-locking the first NOR partitions and shutting down M1 in a controlled way, had the desired effect. However, the experiments may have been affected by a slow environmental change (e.g., temperature), and the results should therefore not be interpreted as a solid confirmation that locking and clean shutdowns indeed eliminate NOR corruption completely. Locking of the standby and rescue partitions was added to the setup process and all M1rc3 shipped have this protection. Future work: For the next revision of the M1 hardware, we will try to further improve the reset circuit, which will be accompanied with more automated testing. Work on reducing the boot time will contribute to reducing the cycle duration of these tests. I'll also try to run my M1 at a controlled low temperature. [1] http://downloads.qi-hardware.com/people/werner/labsw/web/ [2] http://projects.qi-hardware.com/index.php/p/wernermisc/source/tree/master/m1rc3/norruption/LOG [3] http://downloads.qi-hardware.com/people/werner/m1/tmp/nor-dist.png [4] http://en.wikipedia.org/wiki/Exponential_distribution - Werner _______________________________________________ http://lists.milkymist.org/listinfo.cgi/devel-milkymist.org IRC: #milkymist@Freenode
