Re: GSoC Final Report on mnemofs

Sebastien Lorquet Fri, 13 Sep 2024 02:17:27 -0700

Hello

This is quite a complete report with a lot of details, this shows thatyou have put some large amount of mental energy in this project, socongratulations and thank you.

What I'm about to write is not a critic but a complement that mayinterest you.

Since I've worked with critical flash systems for more than 10 yearsnow, I have read the part of your document that deals with power losswith great interest.

Resilience to power loss is *absolutely critical* to any embeddedfilesystem.

Did you do power interruption tests on your code? Can you guarantee thatthe device format stays consistent/recoverable when the power is cut atany code location? Did you identify power critical code sections (withrelation to power cut, not cpu access) ?


Remember, if it's not tested, it doesnt work...

The most critical part of your work is the journal. Do you make surethat the checksum is written 1-last, and 2-completely? How do you makesure that the journal entries are correctly applied to their finalstorage locations?

The largest problem in that area is flash metastability. The checksumMIGHT appear correct on one read, but not correct at the next access.The reason for this is the analog nature of flash writes (and erases),which injects a number of electrons in a floating gate. 0 and 1 bits areseparated by thresholds, but these thresholds vary with temperature andtime (wear), so it might appear that a bit is correct by being just atthe threshold, but the next access will result in a flipped bit.

These issues are NOT theoretical, they happen all the time in all flashdevices, you just have to tickle the devices often enough at the rightmoment so you begin to see these.

These tests require the ability to fully cut the power to a test boardwith microsecond precision. No need for pulses, just an adjustabledelay. Test is triggered by a command that also start a countdown, andtimeout is increased microsecond by microsecond until you reach thepoint that the flash is actually written. Usually, there is a pointwhere timeouts result in partial writes. Then the board will startacting funny and will start entering the error branches that are usuallynever taken. Board capacitors are not a problem, they just increase thedelays. They always discharge the same way during all repeated tests, sothey have no influence on the process.

It is quite hard to make sure that everything is correct, but asufficient amount of dedication is required to be aware of the potentialproblems.

How do you know in your filesystem that the checksum has been writtenonly after all the previous data are written? How do you know thechecksum write is complete. There are software techniques for this. Thisalso requires the flash to support overwrites, so making this work withECC is harder (but possible).


Fine details absolutely matters here.

Thanks,

Sebastien


On 12/09/2024 17:48, Saurav Pal wrote:

Hi all,

Here's my final report <https://resyfer.github.io/blogs/mnemofs/endeval/>
on mnemofs, a NAND flash file system for NuttX, on which I worked during my
tenure as a GSoC 2024 Contributor for ASF. I would be grateful for any
suggestions and criticism.

Best regards,
Saurav Pal.

Re: GSoC Final Report on mnemofs

Reply via email to