Re: Incomplete journal entries

Theodore Ts'o Fri, 28 Oct 2016 08:00:30 -0700

On Fri, Oct 28, 2016 at 10:04:12AM +0200, Oswald Buddenhagen wrote:
> > As for truncation, this might still happen if the file is not fsynced
> > explicitly at critical transaction points (including before fclose).
> > 
> you're not getting truncation, but data corruption, as that's what
> appending a number of null bytes is. thers is _no_ standard that permits
> this without an interim system crash, fsync or not.


Actually, without an fsync ***anything*** goes.  In particular, if you
append to a file, and the system allocates a new block, it's fair game
for the file system to attach a block to the disk, but mark the block
the as uninitalized, so that reads to that block results in zeros.
That's not technically data corruption.  All of the data up to the
last fsync is safe.  What happens after the last fsync is up in the
air.  The behavior I described is what XFS will do.

With ext4, we use delayed allocation, but the way we do data=ordered
is that we flush the data blocks *before* we do the commit, so in
practice it shouldn't be happening with ext4.  However, we reserve the
right to switch how we do things in the future to be more like XFS,
since there are some performance advantages for not forcing out the
data block, but just marking the block as uninitalized and then
marking the block as initialized after the writeback completes.

If you mount with the data=writeback flag, then we don't force out
data blocks before we do a commit (which gives a performance
advantage, which is why some users might choose to use it), but it
means that it's possible for stale data (the previous contents of the
data block) to become revealed after a crash.  But (and this is
important) it's completelly legal as far as the POSIX standard is
concerned.

So if you care about this, I would strongly recommend that you include
a CRC of the contents of the transaction blocks in the commit record.

Also note that technically speaking, although fsync() guarantees that
after it returns, everything written is committed to stable store, it
does not guarantee about the *order* that data will be commited to
stable store before the fsync() completes.  So if you want to be
technically correct, what you need to do is either (a) write the
transaction blocks, fsync, then write the commit record, and then
fsync a second time, or (b) write the transaction blocks, and write
the commit block with a CRC, and then fsync --- and then on the
replay, check the CRC in the commit block, and if the CRC does not
check out, discard the last transaction since it wasn't fully
committed to stable store before the crash.

(Yes, storage is hard.  The reason why it's hard is because users
insist on extreme performance, and so POSIX guarantes are fairly
loose.  They have to be, or every day performance would be horrific.
What this does mean is that if you want transaction / atomic
guarantees, you have all of the low-level tools, but it's up to the
application programmer or the database library implementor to use
those tools corretly.)

Best regards,

                                        - Ted

------------------------------------------------------------------------------
The Command Line: Reinvented for Modern Developers
Did the resurgence of CLI tooling catch you by surprise?
Reconnect with the command line and become more productive. 
Learn the new .NET and ASP.NET CLI. Get your free copy!
http://sdm.link/telerik
_______________________________________________
isync-devel mailing list
isync-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/isync-devel

Re: Incomplete journal entries

Reply via email to