I ran the test loop for 10516 cycles or a bit more than 55 hours
this time. The objective was to gather more data on corruption of
the "standby" partition. I locked the "read-only" partitions after
"standby" with
http://projects.qi-hardware.com/index.php/p/wernermisc/source/tree/master/m1rc3/norruption/2/lockmost

The test produced a total of 22 standby failures. They occurred in
the following cycles:

  369   786  1730  1844  2356  2565  3703  3760  4356  5500  5725
 6091  6390  6811  6953  7022  7877  8186  9744  9752  9980 10218

This is the probability distribution for the number of cycles
between corruptions:
http://downloads.qi-hardware.com/people/werner/m1/nor/d1/dist.png

The distribution is basically the same as in the previous test round.
Below is a compact representation of where corruptions happened and
what form they had:

Address                    | Corruption pattern
_ = 0, 1 = 1               | 0 = 1->0, 0 = 0 unchanged, 1 = 1 unchanged
------------------------------------------------------------------------
00000 ____________________ | 00001011 01000000 | d1/2356-corrupt.bin
                           | 00000000 00000000 | d1/4356-corrupt.bin
                           | 00000000 00000100 | d1/7877-corrupt.bin
                           | 10000110 01000000 | d1/8186-corrupt.bin
0000a ________________1_1_ | 11100110 00001100 | d1/9744-corrupt.bin
0000c ________________11__ | 00000000 00010000 | d1/2356-corrupt.bin
00010 _______________1____ | _1_1_1_1 1__01__1 | d1/1730-corrupt.bin 1/1
                           | _0_0_0_0 0__00__0 | d1/9980-corrupt.bin 1/1
00020 ______________1_____ | 0_0000__ ________ | d1/5500-corrupt.bin 1/2
                           | 0_0000__ ________ | d1/7877-corrupt.bin 1/1
00040 _____________1______ | _____0__ ________ | d1/3703-corrupt.bin 1/2
                           | _____0__ ________ | d1/5500-corrupt.bin 2/2
00050 _____________1_1____ | _____0__ ________ | d1/2356-corrupt.bin 1/2
00066 _____________11__11_ | 0___10__ 1____111 | d1/8186-corrupt.bin 1/1
00082 ____________1_____1_ | _0__11__ 1_____00 | d1/3703-corrupt.bin 2/2
                           | _0__00__ 0_____11 | d1/3760-corrupt.bin 1/1
                           | _0__00__ 0_____01 | d1/9752-corrupt.bin 1/1
00086 ____________1____11_ | _0__11__ 0____000 | d1/6390-corrupt.bin 1/1
000a0 ____________1_1_____ | ________ 0_______ | d1/6091-corrupt.bin 1/1
00310 __________11___1____ | ________ __000___ | d1/369-corrupt.bin 1/1
00480 _________1__1_______ | ________ 0_0_0___ | d1/10218-corrupt.bin 1/1
0049e _________1__1__1111_ | ________ _0______ | d1/4356-corrupt.bin 1/1
00840 ________1____1______ | ________ __0_0___ | d1/786-corrupt.bin 1/1
00850 ________1____1_1____ | ________ 1_0_0___ | d1/7022-corrupt.bin 1/1
00862 ________1____11___1_ | 00__00__ __0__00_ | d1/6811-corrupt.bin 1/1
00c10 ________11_____1____ | ________ __0_____ | d1/2356-corrupt.bin 2/2
018d0 _______11___11_1____ | ________ 0100____ | d1/1844-corrupt.bin 1/1
03880 ______111___1_______ | ________ 1___0___ | d1/2565-corrupt.bin 1/1
03ed0 ______11111_11_1____ | _____0__ ________ | d1/5725-corrupt.bin 1/2
04402 _____1___1________1_ | 01010101 00010000 | d1/5725-corrupt.bin 2/2
20080 __1_________1_______ | 01__11__ __1__10_ | d1/9744-corrupt.bin 1/1
200e0 __1_________111_____ | 11__11__ __0__00_ | d1/6953-corrupt.bin 1/1

Corruptions at addresses below 0x10 don't affect the boot process.
(Or at least not all of them do.)

I don't see any surprising pattern in the above. The general trend
towards having more zeroes than ones could have many causes and does
not point to any specific underlying mechanism.

One new insight is that multiple corruptions (at least up to two) can
occur within a single cycle. In the previous experiment, they could
also have been the result of an accumulation over nearby cycles while
I - literally - wasn't watching.

Getting about the same average interval between corruptions as in the
previous test indicates that the corruption is linked to the number
of cycles and not to the overall run time.

I also looked for anything unexpected in the correlation of adjacent
intervals between corruptions:
http://downloads.qi-hardware.com/people/werner/m1/nor/d1/corr.png

This distribution looks reasonable, as far as I can tell with so few
samples. For comparison, here is a simulation of 100 exponentially-
distributed samples with lambda = 1/478:
http://downloads.qi-hardware.com/people/werner/m1/nor/d1/corr-sim.png

The various scripts used for this experiment live here:
http://projects.qi-hardware.com/index.php/p/wernermisc/source/tree/master/m1rc3/norruption/2/

A tarball of raw results (console log and standby partition dumps)
is here:
http://downloads.qi-hardware.com/people/werner/m1/nor/d1/raw.tar.bz2

Next: a subtle gotcha.

- Werner
_______________________________________________
http://lists.milkymist.org/listinfo.cgi/devel-milkymist.org
IRC: #milkymist@Freenode

Reply via email to