Re: fallocate support for bitmap-based files

Mike Waychison Fri, 29 Jun 2007 15:08:47 -0700

Andrew Morton wrote:

On Fri, 29 Jun 2007 16:55:25 -0400
Theodore Tso <[EMAIL PROTECTED]> wrote:

On Fri, Jun 29, 2007 at 01:01:20PM -0700, Andrew Morton wrote:

Guys, Mike and Sreenivasa at google are looking into implementing
fallocate() on ext2.  Of course, any such implementation could and should
also be portable to ext3 and ext4 bitmapped files.


What's the eventual goal of this work?  Would it be for mainline use,
or just something that would be used internally at Google?



Mainline, preferably.

I'm not
particularly ennthused about supporting two ways of doing fallocate();
one for ext4 and one for bitmap-based files in ext2/3/4.  Is the
benefit reallyworth it?



umm, it's worth it if you don't want to wear the overhead of journalling,
and/or if you don't want to wait on the, err, rather slow progress of ext4.

What I would suggest, which would make much easier, is to make this be
an incompatible extensions (which you as you point out is needed for
security reasons anyway) and then steal the high bit from the block
number field to indicate whether or not the block has been initialized
or not.  That way you don't end up having to seek to a potentially
distant part of the disk to check out the bitmap.  Also, you don't
have to worry about how to recover if the "block initialized bitmap"

inode gets smashed.

The downside is that it reduces the maximum size of the filesystem
supported by ext2 by a factor of two.  But, there are at least two
patch series floating about that promise to allow filesystem block
sizes > than PAGE_SIZE which would allow you to recover the maximum
size supported by the filesytem.

Furthermore, I suspect (especially after listening to a very fasting
Usenix Invited Talk by Jeffery Dean, a fellow from Google two weeks
ago) that for many of Google's workloads, using a filesystem blocksize
of 16K or 32K might not be a bad thing in any case.

It would be a lot simpler....



Hadn't thought of that.

Also, it's unclear to me why google is going this way rather than using
(perhaps suitably-tweaked) ext2 reservations code.

Because the stock ext2 block allcoator sucks big-time.

The primary reason this is a problem is that our writers into thesefiles aren't neccesarily coming from the same hosts in the cluster, sotheir arrival times aren't sequential. It ends up looking to the kernellike a random write workload, which in turn ends up causing oddfragmentation patterns that aren't very deterministic. That data isoften eventually streamed off the disk though, which is when thefragmentation hurts.

Currently, our clustered filesystem supports pre-allocation of thetarget chunks of files, but this is implemented by writting effectivelyzeroes to files, which in turn causes pagecache churn and a doublewrite-out of the blocks. Recently, we've changed the code to minimizethis pagecache churn and double write out by performing an ftruncate toextend files, but then we'll be back to square-one in terms offragmentation for the random writes.

Relying on (a tweaked) reservations code is also somewhat limitting atthis stage given that reservations are lost on close(fd). Unless wechange the lifetime of the reservations (maybe for the lifetime of thein-core inode?), crank up the reservation sizes and deal with theovercommit issues, I can't think of any better way at this time to dealwith the problem.


Mike Waychison
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: fallocate support for bitmap-based files

Reply via email to