>>>>> "Kent" == Kent Overstreet <[email protected]> writes:

> Roland Vet spotted a good one: currently, rebalance/copygc get stuck if
> we've got an extent that can't be read, due to checksum error/bitrot.

> This took some doing to fix properly, because

> - We don't want to just delete such data (replace it with
>   KEY_TYPE_error); we never want to delete anything except when
>   explicitly told to by the user, and while we don't yet have an API for
>   "read this file even though there's errors, just give me what you
>   have" we definitely will in the future.

So will open() just return an error?  How will this look from
userspace?  

> - Not being able to move data is a non-option: that would block copygc,
>   device removal, etc.

> - And, moving said extent requires giving it a new checksum - strictly
>   required if the move has to fragment it, teaching the write/move path
>   about extents with bad checksums is unpalateable, and anyways we'd
>   like to be able to guard against more bitrot, if we can.

Why does it need a new checksum if there are read errors?  What
happens if there are more read errors?   If I have a file on a
filesystem which is based in spinning rust and I get a single bit
flip, I'm super happy you catch it.  

But now you re-checksum the file, with the read error, and return it?
I'm stupid and just a user/IT guy.  I want notifications, but I don't
want my application to block so I can't kill it, or unmount the
filesystem.  Or continue to use it if I like.  

> So that means:

> - Extents need a poison bit: "reads return errors, even though it now
>   has a good checksum" - this was added in a separate patch queued up
>   for 6.15.

Sorry for being dense, but does a file have one or more extents?  Or
is this at a level below that?  

>   It's an incompat feature because it's a new extent field, and old
>   versions can't parse extents with unknown field types, since they
>   won't know their sizes - meaning users will have to explicitly do an
>   incompat upgrade to make use of this stuff.

> - The read path needs to do additional retries after checksum errors
>   before giving up and marking it poisoned, so that we don't
>   accidentally convert a transient error to permanent corruption.

When doing these retries, is the filesystem locked up or will the
process doing the read() be blocked from being killed?  

> - The read path gets a whole bunch of work to plumb precise modern error
>   codes around, so that e.g. the retry path, the data move path, and the
>   "mark extent poisoned" path all know exactly what's going on.

> - Read path is responsible for marking extents poisoned after sufficient
>   retry attempts (controlled by a new filesystem option)

> - Data move path is allowed to move extents after a read error, if it's
>   a checksum error (giving it a new checksum) if it's been poisoned
>   (i.e. the extent flags feature is enabled).

So if just a single bit flips, the extent gets moved onto better
storage, and the file gets re-checksummed.  But what about if more
bits go bad afterwards?  

> Code should be more or less finalized - still have more tests for corner
> cases to write, but "write corrupt data and then tell rebalance to move
> it to another device" works as expected.

> TODO:

> - NVME has a "read recovery level" attribute that controlls how hard the
>   erasure coding algorithms work - we want that plumbed.

>   Before we give up and move data that we know is bad, we need to try
>   _as hard as possible_ to get a successful read.

What does this mean exactly in terms of end user and SysAdmin point of
view?  Locking up the system (no new writes to the now problematic
storage) that I can't kill would be annoyingly painful.  


Don't get me wrong, I'm happy to see data integrity stuff get added,
I'm just trying to understand how it interacts with userspace.  

John



Reply via email to