On Fri, 2007-06-29 at 13:01 -0700, Andrew Morton wrote:
> Guys, Mike and Sreenivasa at google are looking into implementing
> fallocate() on ext2.  Of course, any such implementation could and should
> also be portable to ext3 and ext4 bitmapped files.
> 
> I believe that Sreenivasa will mainly be doing the implementation work.
> 
> 
> The basic plan is as follows:
> 
> - Create (with tune2fs and mke2fs) a hidden file using one of the
>   reserved inode numbers.  That file will be sized to have one bit for each
>   block in the partition.  Let's call this the "unwritten block file".
> 
>   The unwritten block file will be initialised with all-zeroes
> 
> - at fallocate()-time, allocate the blocks to the user's file (in some
>   yet-to-be-determined fashion) and, for each one which is uninitialised,
>   set its bit in the unwritten block file.  The set bit means "this block
>   is uninitialised and needs to be zeroed out on read".
> 
> - truncate() would need to clear out set-bits in the unwritten blocks file.

By truncating the blocks file at the correct byte offset, only needing
to zero some bits of the last byte of the file.

> - When the fs comes to read a block from disk, it will need to consult
>   the unwritten blocks file to see if that block should be zeroed by the
>   CPU.
> 
> - When the unwritten-block is written to, its bit in the unwritten blocks
>   file gets zeroed.
> 
> - An obvious efficiency concern: if a user file has no unwritten blocks
>   in it, we don't need to consult the unwritten blocks file.
> 
>   Need to work out how to do this.  An obvious solution would be to have
>   a number-of-unwritten-blocks counter in the inode.  But do we have space
>   for that?

Would it be too expensive to test the blocks-file page each time a bit
is cleared to see if it is all-zero, and then free the page, making it a
hole?  This test would stop if if finds any non-zero word, so it may not
be too bad.  (This could further be done on a block basis if the block
size is less than a page.)

>   (I expect google and others would prefer that the on-disk format be
>   compatible with legacy ext2!)
> 
> - One concern is the following scenario:
> 
>   - Mount fs with "new" kernel, fallocate() some blocks to a file.
> 
>   - Now, mount the fs under "old" kernel (which doesn't understand the
>     unwritten blocks file).
> 
>     - This kernel will be able to read uninitialised data from that
>       fallocated-to file, which is a security concern.
> 
>   - Now, the "old" kernel writes some data to a fallocated block.  But
>     this kernel doesn't know that it needs to clear that block's flag in
>     the unwritten blocks file!
> 
>   - Now mount that fs under the "new" kernel and try to read that file.
>      The flag for the block is set, so this kernel will still zero out the
>     data on a read, thus corrupting the user's data
> 
>   So how to fix this?  Perhaps with a per-inode flag indicating "this
>   inode has unwritten blocks".  But to fix this problem, we'd require that
>   the "old" kernel clear out that flag.
> 
>   Can anyone propose a solution to this?
> 
>   Ah, I can!  Use the compatibility flags in such a way as to prevent the
>   "old" kernel from mounting this filesystem at all.  To mount this fs
>   under an "old" kernel the user will need to run some tool which will
> 
>   - read the unwritten blocks file
> 
>   - for each set-bit in the unwritten blocks file, zero out the
>     corresponding block
> 
>   - zero out the unwritten blocks file
> 
>   - rewrite the superblock to indicate that this fs may now be mounted
>     by an "old" kernel.
> 
>   Sound sane?

Yeah.  I think it would have to be done under a compatibility flag.  Is
going back to an older kernel really that important?  I think it's more
important to make sure it can't be mounted by an older kernel if bad
things can happen, and they can.

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to