Summary: automated testing produced results consistent with the
  work-arounds for NOR corruption indeed working a expected, but
  do not yet provide  a conclusive confirmation. Further testing to
  focus on validating improved rc4 circuit.


Background:

During the last weeks, I've been torture-testing the M1's NOR. The
background is that M1 (at least rc2 and rc3) occasionally suffers a
corruption of NOR content, which prevents it from booting. The
corruption can be removed by re-flashing the affected partition(s).

In rc2, an improper state when ramping power was suspected, and rc3
therefore has a reset circuit that holds the NOR in reset when power
is ramping up. Unfortunately, this reset circuit doesn't solve the
problem entirely.

Specifically, the reset circuit has the following weakness: while it
protects the NOR when power is ramping up, it is less effective when
power is ramping down.  After Adam has done numerous manual tests,
we now suspect that power ramping down is part of the condition that
causes the NOR corruption.


Work-around: Locking against writing:

Modifying the circuit to keep the NOR in reset also when ramping
down should be quite possible but it would be risky to rework the
existing rc3 boards this way. We therefore looked for a work-around
that would minimize the risk of NOR corruption in rc3 without
introducing other problems.

In manual testing, we had observed that the NOR corruption always
affected a single word, and this word was usually located near the
beginning of the NOR, in the partition of the standby bitstream.
(This is the very first bit of "code" that's being loaded when the
M1 starts.)

The NOR has a feature that allows the locking of blocks (each 128 kB
in size) against accidental erasing or writing. M1 didn't use
locking this far.

The beginning of the NOR contains the standby bitsteam and the
recovery system. All this is content that is never changes in normal
use. So it was suggested to lock these parts of the NOR, which also
happen to be where we've seen corruption strike.

Manual testing confirmed that things looked better with locking.


First round of automated testing:

To characterize the problem a bit better and to confirm the validity
of the work-around, I did automated tests with the freshly-built Lab
Switch [1].

The first tests focused on reproducing the problem, with the NOR
unlocked. For this, I powered the M1 on, started the boot process
over JTAG, let it run the "The Tunnel" effect for a few seconds,
then cut power and repeated. Every once in a while, I checked if the
FPGA would still load the standby bitstream. (The way the test loop
was designed, the test would proceed even if the standby bitstream
was defective, but I could see a change in the pattern of the LEDs.)

The first test run produced two NOR corruptions within approximately
400-600 cycles. One at address 0x8a, the other at address 0x2840. A
CRC check of the other partitions showed no anomalies.

I then accelerated the test by cutting power while RTEMS was still
booting. This time, I found a single corruption after 200-700
cycles, at address 0x82.

After that, I locked the NOR and repeated the testing. After about
3000-4000 cycles, all the locked partitions - including the standby
partition - were still healthy. However, I found a single-word
corruption in the FlickerNoise partition, at address 0x8a384.


Safe shutdown:

As a further improvement, Sebastien suggested that the NOR corruption
may be avoided if users shut down their M1 "cleanly", like one would
do with a PC, instead of just cutting power.

A clean shutdown can be performed through the GUI by clicking on
"Shutdown" and then "Power off", or by pressing and holding the
middle button of the M1 for a few seconds.

To test this, I unlocked the NOR again, removed the power cycling
from the test loop, and just restarted the M1 over JTAG. After about
2500 cycles, all partitions were still in good shape.

When "powering down", M1 does not cut power to the FPGA completely
but loads the "standby" bitsteam and executes it until woken up
again. To further confirm that it's safe to power cycle while
running the standby bitstream, I power cycled without trying to boot
out of standby. Another 2600 cycles later, all partitions were still
intact.


Setback:

After these results, I was about to post a jubilant account of how
we vanquished the NOR corruption in just a few days. I only had one
more sanity check to do: confirm that the NOR corruption occurs again
if I remove all protection and don't do a clean shutdown.

Unfortunately, this was not met with success: after about 8800
cycles, the NOR was still flawless.

This means either that one of the reductions of the cycle time I made
early in the test series also greatly lowered the probability of NOR
corruption, or that NOR corruption depends on some external factor.


Second round of automated testing:

I then went back to check that at least the original (slow) test loop
still produces NOR corruption. It did this almost immediately.

I then ran a long series of power-cycling loops to get more data
points on the frequency of corruptions and to also see which NOR
locations are affected.

This table summarizes the results:

Last cycle ob-  First cycle ob- Non-fatal       Offsets of corrupted
served without  served with     corruption      words in standby
fatal corruption fatal corr.                    (at end)
--------------- --------------- ------------    -------------------------------
  70 ( 1h29)     223 ( 4h47)    ?               0, 2, 0x5556
 613 (13h14)     639 (13h48)    ?               0, 0x14b0
  23 ( 0h29)     117 ( 2h30)    ?               0x5140
1493 (32h16)    1584 (34h14)    ?               4, 0xa00; soc-rescue 0x20460
 981 (21h12)    1121 (24h13)     980 (21h10)    0, 0x800, 0x1000
 517 (11h09)     597 (12h53)    -               0x1160
 352 ( 7h36)     422 ( 9h06)    -               0x2000
 973 (21h03)    1002 (22h22)     501 (10h50)    6, 0x2ff2; fn-rescue 0x120000
 233 ( 5h02)     260 ( 5h37)    -               0x14b0
  75 ( 1h36)      82 ( 1h45)    -               0x1510
 171 ( 3h41)     194 ( 4h11)    -               2, 0x440
 162 ( 3h30)     168 ( 3h52)    -               0x2000

The first columns indicate in which cycle and after how many hours of
testing I noticed a corruption that would prevent standby from
loading. (While the power-cycling is automated, I checked the loading
of standby by occasionally having a look at the LEDs.)

I then also started to dump the beginning of the standby partition to
see if any of the first words had been corrupted. These words are
used for synchronization and a moderate amount of corruption of them
will still allow the standby bitstream to load. On two occasions, I
did indeed see such "light" corruption before more severe corruption
hit.

The last column shows where I found corrupted words. The general
tendency is that lower addresses are hit more often than higher ones.

The detailed log of my experiments can be found at [2].


Statistical analysis:

When plotting the number of cycles between corruptions [3], one can
see that the cumulative distribution is similar to that of an
exponential distribution [4] with lambda = 1/476.

An exponential distribution is what one would expect from a
memoryless process. "Memoryless" means here that the outcome of a
cycle (i.e., whether corruption occurs or not) is not affected by the
cycles preceding it.


Possible effect of temperature:

There was one last run in this series. This one failed to produce any
corruption within 4000 cycles or about 87 hours. I then stopped the
test because I needed parts of the setup for other work.

I did all this testing in late winter (Southern Hemisphere), with the
M1 in the path of the air flow from a window I usually keep partially
open. The days and particularly the nights of the last test were
quite warm. I could not find a clear correlation between the
likelihood of corruption and day time in the other events.


Conclusion:

I was able to produce numerous NOR corruptions by power-cycling the
M1 while rendering. The number of power-cycling events between
corruptions is exponentially distributed, with a mean time between
corruptions of ~500 cycles.

Furthermore, experiments showed that steps taken to improve the
resiliance of M1rc3 to NOR corruption, namely write-locking the first
NOR partitions and shutting down M1 in a controlled way, had the
desired effect.

However, the experiments may have been affected by a slow
environmental change (e.g., temperature), and the results should
therefore not be interpreted as a solid confirmation that locking
and clean shutdowns indeed eliminate NOR corruption completely.

Locking of the standby and rescue partitions was added to the setup
process and all M1rc3 shipped have this protection.


Future work:

For the next revision of the M1 hardware, we will try to further
improve the reset circuit, which will be accompanied with more
automated testing. Work on reducing the boot time will contribute to
reducing the cycle duration of these tests. I'll also try to run my
M1 at a controlled low temperature.


[1] http://downloads.qi-hardware.com/people/werner/labsw/web/
[2] 
http://projects.qi-hardware.com/index.php/p/wernermisc/source/tree/master/m1rc3/norruption/LOG
[3] http://downloads.qi-hardware.com/people/werner/m1/tmp/nor-dist.png
[4] http://en.wikipedia.org/wiki/Exponential_distribution

- Werner
_______________________________________________
http://lists.milkymist.org/listinfo.cgi/devel-milkymist.org
IRC: #milkymist@Freenode

Reply via email to