On Thu, Jan 24, 2013 at 08:13:34PM +0800, Miao Xie wrote: > On thu, 24 Jan 2013 11:58:13 +0000, Hugo Mills wrote: > > On Thu, Jan 24, 2013 at 07:39:17PM +0800, Miao Xie wrote: > >> On thu, 24 Jan 2013 10:53:17 +0200, Alex Lyakas wrote: > >>> Hi Chen, > >>> with all due respect, what do you mean by "I see" and "OK for users"? > >>> The semantics of the fallocate API with/without the > >>> FALLOC_FL_PUNCH_HOLE flag is not defined by me or you. As far as I > >>> understand, the file system *must* adhere to the semantics of this > >>> API, as defined elsewhere. > >>> Looking, for example, here: > >>> http://man7.org/linux/man-pages/man2/fallocate.2.html > >>> > >>> "Allocating disk space > >>> The default operation (i.e., mode is zero) of fallocate() allocates > >>> and initializes to zero the disk space within the range specified by > >>> offset and len." > >>> > >>> "Deallocating file space > >>> Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux > >>> 2.6.38) in mode deallocates space (i.e., creates a hole) in the byte > >>> range starting at offset and continuing for len bytes. Within the > >>> specified range, partial file system blocks are zeroed, and whole file > >>> system blocks are removed from the file." > >>> > >>> These are clearly two different modes of operation, and I don't think > >>> you or me can decide otherwise, at this point. > >> > >> I think send/receive commands just need ensure the content of the file is > >> right, not the disk format. That is also the reason why the developers just > >> send out the zero data when they meet a hole or pre-allocated extent. From > >> this viewpoint, they are the same. > > > > I disagree on the "content vs representation" comment above. I > > would very much hope that the reference receive implementation can > > turn a string of zeroes (or a hole) back into a sparse file. As a > > user, I'd be quite irritated if, say, one of my sparse VM images ended > > up 5 times the size when I backed it up with send/receive, simply > > because it's gone from holey to a huge bunch of zeroes. That > > particular issue can be dealt with at the receiver level, though. > > My explanation is not clear, sorry! What I want to say is send command needn't > differentiate between a hole and a pre-allocated extent, because both of them > are considered as zero. So just send out a hole even we meet a pre-allocated > extent is OK, I think.
Ah, gotcha. That sounds more sensible, and I agree with your point (for what little weight I carry -- I'm not as well-versed in the behaviour of fallocate as I might be). Hugo. > Thanks > Miao > > > > > Hugo. > > > >>> However, I may be not knowledgeable enough to confirm this. > >>> Jan/Alexander, can you perhaps comment on this? > >>> > >>> Thanks, > >>> Alex. > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> On Thu, Jan 24, 2013 at 3:54 AM, Chen Yang <[email protected]> > >>> wrote: > >>>> δΊ 2013-1-23 19:56, Alex Lyakas ει: > >>>>> Hi, > >>>>> > >>>>> On Wed, Jan 23, 2013 at 1:04 PM, Chen Yang > >>>>> <[email protected]> wrote: > >>>>>> From: Chen Yang <[email protected]> > >>>>>> Date: Wed, 23 Jan 2013 11:21:51 +0800 > >>>>>> Subject: [PATCH] Btrfs/send: sparse and pre-allocated file support for > >>>>>> btrfs-send mechanism > >>>>>> > >>>>>> When sending a file with sparse or pre-allocated part, > >>>>>> these parts will be sent as ZERO streams, and it's unnecessary. > >>>>>> > >>>>>> There are two ways to improve this, one is just skip the EMPTY parts, > >>>>>> and the other one is to add a punch command to send, when an EMPTY > >>>>>> parts > >>>>>> was detected. But considering a case of incremental sends, if we choose > >>>>>> the first one, when a hole got punched into the file after the initial > >>>>>> send, the data will be unchanged on the receiving side when received > >>>>>> incrementally. So the second choice is right. > >>>>>> > >>>>>> Signed-off-by: Cheng Yang <[email protected]> > >>>>>> --- > >>>>>> fs/btrfs/send.c | 60 > >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++- > >>>>>> fs/btrfs/send.h | 3 +- > >>>>>> 2 files changed, 61 insertions(+), 2 deletions(-) > >>>>>> > >>>>>> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c > >>>>>> index 5445454..31e9aef 100644 > >>>>>> --- a/fs/btrfs/send.c > >>>>>> +++ b/fs/btrfs/send.c > >>>>>> @@ -3585,6 +3585,52 @@ out: > >>>>>> return ret; > >>>>>> } > >>>>>> > >>>>>> +static int send_punch(struct send_ctx *sctx, u64 offset, u32 len) > >>>>>> +{ > >>>>>> + int ret = 0; > >>>>>> + struct fs_path *p; > >>>>>> + mm_segment_t old_fs; > >>>>>> + > >>>>>> + p = fs_path_alloc(sctx); > >>>>>> + if (!p) > >>>>>> + return -ENOMEM; > >>>>>> + > >>>>>> + /* > >>>>>> + * vfs normally only accepts user space buffers for security > >>>>>> reasons. > >>>>>> + * we only read from the file and also only provide the > >>>>>> read_buf buffer > >>>>>> + * to vfs. As this buffer does not come from a user space > >>>>>> call, it's > >>>>>> + * ok to temporary allow kernel space buffers. > >>>>>> + */ > >>>>>> + old_fs = get_fs(); > >>>>>> + set_fs(KERNEL_DS); > >>>>>> + > >>>>>> +verbose_printk("btrfs: send_fallocate offset=%llu, len=%d\n", offset, > >>>>>> len); > >>>>>> + > >>>>>> + ret = open_cur_inode_file(sctx); > >>>>>> + if (ret < 0) > >>>>>> + goto out; > >>>>>> + > >>>>>> + ret = begin_cmd(sctx, BTRFS_SEND_C_PUNCH); > >>>>>> + if (ret < 0) > >>>>>> + goto out; > >>>>>> + > >>>>>> + ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen, > >>>>>> p); > >>>>>> + if (ret < 0) > >>>>>> + goto out; > >>>>>> + > >>>>>> + TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p); > >>>>>> + TLV_PUT_U64(sctx, BTRFS_SEND_A_FILE_OFFSET, offset); > >>>>>> + TLV_PUT_U64(sctx, BTRFS_SEND_A_SIZE, len); > >>>>>> + > >>>>>> + ret = send_cmd(sctx); > >>>>>> + > >>>>>> +tlv_put_failure: > >>>>>> +out: > >>>>>> + fs_path_free(sctx, p); > >>>>>> + set_fs(old_fs); > >>>>>> + return ret; > >>>>>> +} > >>>>>> + > >>>>>> /* > >>>>>> * Read some bytes from the current inode/file and send a write > >>>>>> command to > >>>>>> * user space. > >>>>>> @@ -3718,6 +3764,7 @@ static int send_write_or_clone(struct send_ctx > >>>>>> *sctx, > >>>>>> u64 pos = 0; > >>>>>> u64 len; > >>>>>> u32 l; > >>>>>> + u64 bytenr; > >>>>>> u8 type; > >>>>>> > >>>>>> ei = btrfs_item_ptr(path->nodes[0], path->slots[0], > >>>>>> @@ -3731,8 +3778,19 @@ static int send_write_or_clone(struct send_ctx > >>>>>> *sctx, > >>>>>> * sure to send the whole thing > >>>>>> */ > >>>>>> len = PAGE_CACHE_ALIGN(len); > >>>>>> - } else { > >>>>>> + } else if (type == BTRFS_FILE_EXTENT_REG) { > >>>>>> len = btrfs_file_extent_num_bytes(path->nodes[0], ei); > >>>>>> + bytenr = btrfs_file_extent_disk_bytenr(path->nodes[0], > >>>>>> ei); > >>>>>> + if (bytenr == 0) { > >>>>>> + ret = send_punch(sctx, offset, len); > >>>>>> + goto out; > >>>>>> + } > >>>>>> + } else if (type == BTRFS_FILE_EXTENT_PREALLOC) { > >>>>>> + len = btrfs_file_extent_num_bytes(path->nodes[0], ei); > >>>>>> + ret = send_punch(sctx, offset, len); > >>>>>> + goto out; > >>>>>> + } else { > >>>>>> + BUG(); > >>>>>> } > >>>>> > >>>>> Are these two cases really the same? In the bytenr == 0 we want to > >>>>> deallocate the range. While in the prealloc case, we want to ensure > >>>>> disk space allocation. Or am I mistaken? > >>>>> Looking at the receive side, you use the same command for both: > >>>>> ret = fallocate(r->write_fd, > >>>>> FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, > >>>>> offset, len); > >>>>> > >>>>> Looking at btrfs_fallocate code, in that case it will always punch the > >>>>> hole: > >>>>> static long btrfs_fallocate(struct file *file, int mode, > >>>>> loff_t offset, loff_t len) > >>>>> ... > >>>>> if (mode & FALLOC_FL_PUNCH_HOLE) > >>>>> return btrfs_punch_hole(inode, offset, len); > >>>>> > >>>>> So maybe you should have two different commands, or add a flag to > >>>>> distinguish between the two cases? > >>>>> > >>>>> Thanks, > >>>>> Alex. > >>>>> > >>>> > >>>> I see sparse and pre-allocated parts both as EMPTY hole, > >>>> and although different for a file, it's OK for users. > >>>> > >>>> Thanks > >>>> > >>>>> > >>>>> > >>>>>> > >>>>>> if (offset + len > sctx->cur_inode_size) > >>>>>> diff --git a/fs/btrfs/send.h b/fs/btrfs/send.h > >>>>>> index 1bf4f32..659ac8f 100644 > >>>>>> --- a/fs/btrfs/send.h > >>>>>> +++ b/fs/btrfs/send.h > >>>>>> @@ -20,7 +20,7 @@ > >>>>>> #include "ctree.h" > >>>>>> > >>>>>> #define BTRFS_SEND_STREAM_MAGIC "btrfs-stream" > >>>>>> -#define BTRFS_SEND_STREAM_VERSION 1 > >>>>>> +#define BTRFS_SEND_STREAM_VERSION 2 > >>>>>> > >>>>>> #define BTRFS_SEND_BUF_SIZE (1024 * 64) > >>>>>> #define BTRFS_SEND_READ_SIZE (1024 * 48) > >>>>>> @@ -80,6 +80,7 @@ enum btrfs_send_cmd { > >>>>>> BTRFS_SEND_C_WRITE, > >>>>>> BTRFS_SEND_C_CLONE, > >>>>>> > >>>>>> + BTRFS_SEND_C_PUNCH, > >>>>>> BTRFS_SEND_C_TRUNCATE, > >>>>>> BTRFS_SEND_C_CHMOD, > >>>>>> BTRFS_SEND_C_CHOWN, > > > -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- UNIX: British manufacturer of modular shelving units. ---
signature.asc
Description: Digital signature
