The exploration of the dungeons of NORia has finally led to a meeting with the supposed arch-enemy: the power-down behaviour of the reset circuit.
Background ---------- M1rc3 has a special reset chip (U24, [1]) that resets FPGA and NOR when powering up and that also holds them in reset when the 3.3 V rail drops below 2.63 V. The expectation was that this would prevent the NOR corruption. Alas, it didn't. After poking around for a while, we started to suspect that, when powering down, the 3.3 V rail may drop more slowly than some of the other rails - particularly any of the power rails supplying the FPGA core. In this case, the FPGA could get confused, send out weird signals, which would then be properly amplified by the FPGA's I/O drivers (operating at 3.3 V), received by the NOR (also operating at 3.3 V), and finally every once in a while producing a valid command the NOR may still have enough time to process before it also loses power. Power rails can drop at different speeds because each has its own regulator and output buffering. It's not trivial to assure that rails come up or down in a specific order and it's also difficult to measure the exact order, because it can vary a lot with what the system is doing at the time of the power cut. However, we know that no power rail can drop faster than the power input. Because if a rail would drop faster, the regulator could simply draw more power from the input to bring the rail back up again. Thus the idea was born to drive the reset chip not from the regulated 3.3 V rail but from the filtered but unregulated 5 V input. Also, to make sure we cut out in time, the threshold voltage of the reset chip should be closer to 5 V. The rework ---------- I removed the old reset chip and replaced it with an APX803-44SAG-7 [2] which has a threshold voltage of 4.38 V. To isolate the input pin from the 3.3 V pad on the PCB, I placed a piece of single-sided 0.36 mm FR4 board [3] between chip and pad. The closest 5 V source I could find is C125, part of the MIDI TX circuit. This is what it looks like: http://downloads.qi-hardware.com/people/werner/m1/nor/d8/u24-to-5V.jpg M1 behaviour after rework ------------------------- Immediately after the rework, the M1 behaved a little odd. It did reset and enter standby, but when I tried to get into the BIOS to run the CRC test, it just stopped (maybe a spurious reset). I'm not sure what happened there. Later, I checked the voltages, and they're all good: 4.98 V at the DC jack and 4.94 V at U24 pin 3. Eventually, it gave in and behaved properly. I then proceeded to run the usual power-cycling loop. Testing ------- I ran the power-cycling test for 4284 cycles. It did not report a single corruption. Afterwards, I did a CRC check, which also showed that everything was in good health (*). Last but not least, I dumped the lock bits and verified that block 0 was indeed unlocked. This means that the test seems to be valid. If we assume a previous corruption probability of 1/500 per cycle, the probability of passing 4284 cycles without hitting a single corruption would be about 0.02%. (*) In case you're checking my log [4]: the rescue BIOS failed the CRC check. I think it's the MAC address that causes the CRC to fail. I never bothered to fix this, so that failure is normal and expected. Conclusion ---------- It seems that changing the reset circuit such that it always resets FPGA and NOR when power is ramping down does reduce the rate of NOR corruptions substantially and may even eliminate the problem entirely. The instabilities observed immediately after the rework need further examination. They may have been caused by residues of the rework (e.g., flux that hasn't dried completely), but another possible explanation would be short voltage drops on the 5 V rail during load changes. We may also consider using a reset chip with a lower threshold voltage. E.g., the APX803-40SAG-7 with a nominal threshold of 4.0 V should still give the 3.3 V regulator [5] enough room to do its work, while being less sensitive to small upsets of the 5 V supply. What's next ----------- I'll play with my M1 in "regular use" for a bit and watch for unexplained resets/hangs/etc. After that, a longer test run should provide more certainty that the corruption is really gone. The probability for that increases roughly exponentially with the number of cycles, and each 5-6 hours add a factor of ten. So a couple of days should be sufficient. Last but not least, this needs testing with the supply voltage at its limits, e.g., the 4.75 V to 5.25 V allowed for a USB host. [1] http://www.ait-ic.com/uploads//2009-10/21/_1256089836_7ol2c.pdf [2] http://www.diodes.com/datasheets/APX803.pdf [3] http://search.digikey.com/us/en/products/PC94/PC94-ND/354417 [4] http://downloads.qi-hardware.com/people/werner/m1/nor/d8/raw.tar.bz2 [5] http://www.national.com/profile/snip.cgi/openDS=LP38690 - Werner _______________________________________________ http://lists.milkymist.org/listinfo.cgi/devel-milkymist.org IRC: #milkymist@Freenode
