Re: [f2fs-dev] sync/umount hang on 3.18.21, 1.4TB gone after crash

Marc Lehmann Fri, 25 Sep 2015 20:09:10 -0700

On Fri, Sep 25, 2015 at 11:42:02AM -0700, Jaegeuk Kim <jaeg...@kernel.org> 
wrote:
> AFAIK, there-in *commit* means syncing metadata, not userdata. Doesn't it?


In general, no, they commit userdata, even if not necessarily at the same
time. ext* for example has three modes, and the default commits userdata
before the corresponding metadata (data=ordered).

But even when you relax this (data=writeback), a few minutes after a file is
written, both userdata and metadata are there (usually after 30s). Data that
was just being wirtten is generally mixed, but that's an easy to handle
trade-off.

(and then there is data=journal, which should get perfectly ordered
behaviour for both, at high cost, and flushoncommit).

Early (linux) versions of XFS were more like a brutal version of writeback
- files recently written before a crash were frequently filled with zero
bytes (something I haven't seen with irix, which frequently crashed :).
But they somehow made it work - I was a frequent victim of zero-filled
files, but for many years it didn't happen for me. So while I don't know
if it's a guarantee, in practise, file data is there together with the
metadata, and usually within the writeback period configured in the kernel
(+ whatever time it takes to write out the data, which cna be substantial,
but can also be limited in /proc/sys/vm, especially dirty_bytes and
dirty_background_bytes).

Note also that filesystems often special case write + rename over old
file, and carious other cases, to give a good user experience.

So for traditional filesystems, /proc/sys/vm/dirty_bytes + dirty_expire
gives a pretty good way of defining a) how much data is lost and b) within
which timeframe. Such filesystems also have their own setting for metadata
commit, but they are generally within the timeframe of a few seconds to
half a minute.

It does not have the nice "exact version of a point in time" qualities you
can get from log-based file system, but they give quite nice guarantees in
practise - if a file was half-written, it does not have it's full length
but corrupted data inside for example.

For things like database files, this could be an issue, as indeed you
don't control the order of things written, but programs *know* about this
problem and fsync accordingly (and the kernel has extra support for these
things, as in sync_page_range and so on).

So, in general, filesystems only commit metadata, but the kernel commits
userdata on its own, and as extra feature, "good" filesystems such
as xfs or ext* have extra logic to commit userdata before committing
corresponding metadata (or after).

Note also that with most journal-based filesystems, commit just forces the
issue, both metadata and userdata usually hit the disk much earlier.

In addition, the issue at hand is f2fs losing metadata, not userdata,
as all data had been written to the device hours before tghe crash. The
userdata was all there, but the filesystem forgot to how to access it.

> So, even if you saw no data loss, filesystem doesn't guarantee all the data 
> were
> completely recovered, since sync or fsync was not called for that file.

No, but I feel fairly confident that a file written over a minutes ago
on a box that is sitting idle for a minute is still there after a crash,
barring hardware faults.

Now, I am not neecessarily criticing f2fs here, after all, the problem at
hand is f2fs _hanging_, which is clearly a bug. I don't know how well f2fs
performs with this bug fixed, regarding data loss.

Also, f2fs is a different beast - syncs can take a veeery long time on f2fs
compared to xfs or ext4, and maybe that is due to the design of f2fs (I
suppose so, but you can correct me). In which case it might not be such a
good idea to commit every 30s. Maybe my performance problem was because
f2fs committed every 30s.

> I think you need to tune the system-wide parameters related to flusher 
> mentioned
> by Chao for your workloads.

I already do configure these extensively, according to my workload. On the
box I did my recent tests:

   vm.dirty_ratio = 80
   vm.dirty_background_ratio = 4
   vm.dirty_writeback_centisecs = 100
   vm.dirty_expire_centisecs = 100

These are pretty aggressive. The reason is that the box has 32GB of ram, and
with default values it is not uncommon to get 10-20gb of dirty data before a
writeback, which then more or less freezes everything and can take a long
time. So the above values don't wait long to write userdata, and make sure a
process generating lots of dirty blocks can't freeze the system.

Speciifcally, in the case of tar writing files, tar will start blocking after
only ~1.3GB of dirty data.

That means with a "conventional" filesystem, I lose at most 1.3GB of data
+ less than 30s, on a crash.

> And, we need to expect periodic checkpoints are able to recover the previously
> flushed data.

Yes, I would consider this a must, however, again, I can accept if f2fs
needs much higher "commit" intervals than other filesystems (say, 10
minutes), if that is needed to make it performant.

But some form of fixed timeframe is needed, I tzhink, whether it's seconds
or minutes.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schm...@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

------------------------------------------------------------------------------
_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] sync/umount hang on 3.18.21, 1.4TB gone after crash

Reply via email to