Hi,

No idea, Tomek! flash works by storing charge (electrons) in a floating gate, like you fill a bottle with water. Empty (erased) bottle means one and above a certain level it becomes zero.

Problem is, the bottle material has a temperature coefficient so it becomes larger when hot.

My feeling is that thresholds effect WILL appear given the proper delay for the power cut, you cant avoid them. We encounter these regularly during development of our products and we have explicit tests for them.


Alan: I dont know if my help will be efficient since I have so many things pending.

For NAND you will probably map all functions to MTD and error correction features to ioctls. Many of these will vary according to the model.

For variation, I have added these Micron SLC NAND devices to my next farnell order:

https://fr.farnell.com/micron/mt29f1g01abafdwb-it-f/flash-memory-1gbit-40-to-85deg/dp/3954455

https://fr.farnell.com/micron/mt29f4g01abafdwb-it-f/flash-memory-4gbit-40-to-85deg/dp/3954460

These are UPDFN packages, which are the same size than a SOIC 8.

And wow, these devices have much larger capacity than NOR flash!


I have zero experience with NAND, but I can run code on some boards. For example I can try that on a nucleo board, or on a pi pico.

Next we need some kind of system to control a delay with microsecond precision, in order to cut power of the board (or do a RESET).

That kind of test will also help test other file systems like littlefs on NOR flash, which has the same issues.


I will not be shy that help with software would be welcome. I can execute things on hardware but I wont lead the project because I just cant. And I mean my brain cant.


Sebastien


On 13/09/2024 15:02, Tomek CEDRO wrote:
very nice discussion!

can problems with NAND create a semi-dereministic group of well defined issues with random characteristics?

if so then we could create a model nand for sim with controllable errors in order to verify various nand drivers, filesystems, etc?

:-)

--
CeDeROM, SQ7MHZ, http://www.tomek.cedro.info

On Fri, Sep 13, 2024, 14:52 Alan C. Assis <acas...@gmail.com> wrote:

    Hi Sebastien,

    Thank you for your helpful considerations.

    As I explained before he used the SIM NAND Simulator that he
    created and
    integrated on NuttX.

    Also as I explained in my previous email, we need help to test in real
    hardware.

    Since you have previous experience with NAND Flash, maybe you
    could help
    here (of course, if you are interested to help)!

    First we need to create a driver for a SPI NAND Flash (I bought
    this model:
    https://aliexpress.com/item/1005005307786079.html) and use it with
    MnemoFS.

    This model that I selected has internal error detection, etc, it
    means we
    don't need to worry about taking care of bad blocks ourselves.

    If you look inside nuttx/drivers/mtd/ many of the pieces we need are
    already there, we just need to understand how to use SPI NAND,
    FTL, MTD,
    etc.

    Xiang, since you and your team ported YAFFS to NuttX, maybe you
    guys could
    help us to get MnemoFS working on real flash on NuttX.

    BR,

    Alan



    On Fri, Sep 13, 2024 at 6:17 AM Sebastien Lorquet
    <sebast...@lorquet.fr>
    wrote:

    > Hello
    >
    >
    > This is quite a complete report with a lot of details, this
    shows that
    > you have put some large amount of mental energy in this project, so
    > congratulations and thank you.
    >
    > What I'm about to write is not a critic but a complement that may
    > interest you.
    >
    >
    > Since I've worked with critical flash systems for more than 10 years
    > now, I have read the part of your document that deals with power
    loss
    > with great interest.
    >
    > Resilience to power loss is *absolutely critical* to any embedded
    > filesystem.
    >
    >
    > Did you do power interruption tests on your code? Can you
    guarantee that
    > the device format stays consistent/recoverable when the power is
    cut at
    > any code location? Did you identify power critical code sections
    (with
    > relation to power cut, not cpu access) ?
    >
    > Remember, if it's not tested, it doesnt work...
    >
    >
    > The most critical part of your work is the journal. Do you make sure
    > that the checksum is written 1-last, and 2-completely? How do
    you make
    > sure that the journal entries are correctly applied to their final
    > storage locations?
    >
    > The largest problem in that area is flash metastability. The
    checksum
    > MIGHT appear correct on one read, but not correct at the next
    access.
    > The reason for this is the analog nature of flash writes (and
    erases),
    > which injects a number of electrons in a floating gate. 0 and 1
    bits are
    > separated by thresholds, but these thresholds vary with
    temperature and
    > time (wear), so it might appear that a bit is correct by being
    just at
    > the threshold, but the next access will result in a flipped bit.
    >
    > These issues are NOT theoretical, they happen all the time in
    all flash
    > devices, you just have to tickle the devices often enough at the
    right
    > moment so you begin to see these.
    >
    > These tests require the ability to fully cut the power to a test
    board
    > with microsecond precision. No need for pulses, just an adjustable
    > delay. Test is triggered by a command that also start a
    countdown, and
    > timeout is increased microsecond by microsecond until you reach the
    > point that the flash is actually written. Usually, there is a point
    > where timeouts result in partial writes. Then the board will start
    > acting funny and will start entering the error branches that are
    usually
    > never taken. Board capacitors are not a problem, they just
    increase the
    > delays. They always discharge the same way during all repeated
    tests, so
    > they have no influence on the process.
    >
    > It is quite hard to make sure that everything is correct, but a
    > sufficient amount of dedication is required to be aware of the
    potential
    > problems.
    >
    > How do you know in your filesystem that the checksum has been
    written
    > only after all the previous data are written? How do you know the
    > checksum write is complete. There are software techniques for
    this. This
    > also requires the flash to support overwrites, so making this
    work with
    > ECC is harder (but possible).
    >
    > Fine details absolutely matters here.
    >
    > Thanks,
    >
    > Sebastien
    >
    >
    > On 12/09/2024 17:48, Saurav Pal wrote:
    > > Hi all,
    > >
    > > Here's my final report
    <https://resyfer.github.io/blogs/mnemofs/endeval/
    > >
    > > on mnemofs, a NAND flash file system for NuttX, on which I
    worked during
    > my
    > > tenure as a GSoC 2024 Contributor for ASF. I would be grateful
    for any
    > > suggestions and criticism.
    > >
    > > Best regards,
    > > Saurav Pal.
    > >
    >

Reply via email to