Re: [RFC, PATCH] locks: remove posix deadlock detection
On Sun, 28 Oct 2007 13:43:21 -0400 J. Bruce Fields [EMAIL PROTECTED] wrote: From: J. Bruce Fields [EMAIL PROTECTED] We currently attempt to return -EDEALK to blocking fcntl() file locking requests that would create a cycle in the graph of tasks waiting on locks. This is inefficient: in the general case it requires us determining whether we're adding a cycle to an arbitrary directed acyclic graph. And this calculation has to be performed while holding a lock (currently the BKL) that prevents that graph from changing. It has historically been a source of bugs; most recently it was noticed that it could loop indefinitely while holding the BKL. It seems unlikely to be useful to applications: - The difficulty of implementation has kept standards from requiring it. (E.g. SUSv3 : Since implementation of full deadlock detection is not always feasible, the [EDEADLK] error was made optional.) So portable applications may not be able to depend on it. - It only detects deadlocks that involve nothing but local posix file locks; deadlocks involving network filesystems or other kinds of locks or resources are missed. It therefore seems best to remove deadlock detection. Signed-off-by: J. Bruce Fields [EMAIL PROTECTED] NAK. This is an ABI change and one that was rejected before when this was last discussed in detail. Moving it out of BKL makes a ton of sense, even adding a don't check flag makes a lot of sense. Removing the checking does not. I'd much rather see if (flags FL_NODLCHECK) posix_deadlock_detect() The failure case for removing this feature is obscure and hard to debug application hangs for the afflicted programs - not nice for users at all. Alan - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC, PATCH] locks: remove posix deadlock detection
And if posix file locks are to be useful to threaded applications, then we have to preserve the same no-false-positives requirement for them as well. It isn't useful to threaded applications. The specification requires this. Which is another reason for having an additional Linux (for now) flag to say don't bother Alan - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC, PATCH] locks: remove posix deadlock detection
On Sun, 28 Oct 2007, Matthew Wilcox wrote: A potential for deadlock occurs if a process controlling a locked region is put to sleep by attempting to lock another process' locked region. If the system detects that sleeping until a locked region is unlocked would cause a deadlock, fcntl() shall fail with an [EDEADLK] error. This is what POSIX says [1], even after being modified with respect to POSIX Threads Extension, right? So it doesn't deal with threads at all, just processess are taken into account. Probably for a reason :) Did you have a concrete suggestion, or are you just quoting the spec? I was quoting the spec and I thought that the suggestion is implicit -- the specification defines what happens only when processess are in place. If the application uses POSIX threads in combination with locks, it is going to receive undefined behavior. The problem is that it's nonsense -- processes don't sleep, threads do. I think the key is would deadlock, not might deadlock. Our current behaviour is clearly in violation of SuSv3. - either we can add a special fcntl() Linux-specific flag, that will ask the kernel not to perform deadlock detection. Any application that is combining threads and posix locks (and thus IMHO asking for undefined behavior) could use this flag and not receive EDEADLK any more - or we can add some heuristics here-and-there to track which current-files are shared and which are not, and do not return EDEADLK in the case of shared -files (could be a little bit tricky) -- Jiri Kosina - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC, PATCH] locks: remove posix deadlock detection
On Sun, 28 Oct 2007, J. Bruce Fields wrote: But, OK, if we can identify unshared current-files at the time we put a task to sleep, then a slight modification of our current algorithm might be sufficient to detect any deadlock that involves purely posix file locks and processes. And we can tell people that avoiding deadlock is their problem as soon as any task with a shared current-files is involved. (Ditto, I assume, if nfsd/lockd acquires a lock.) Don't forget that comparing file_lock-fl_owner (i.e. current-files) is not the only way how lock ownership could be computed (there could be specific file_lock-fl_lmops-fl_compare_owner() and all of them should be teached this new semantics, right?). -- Jiri Kosina - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 4/6][RFC] Attempt to plug race with truncate
On Fri, 26 Oct 2007 16:37:36 -0700 Mike Waychison [EMAIL PROTECTED] wrote: Attempt to deal with races with truncate paths. I'm not really sure on the locking here, but these seem to be taken by the truncate path. BKL is left as some filesystem may(?) still require it. Signed-off-by: Mike Waychison [EMAIL PROTECTED] fs/ioctl.c |8 1 file changed, 8 insertions(+) Index: linux-2.6.23/fs/ioctl.c === --- linux-2.6.23.orig/fs/ioctl.c 2007-10-26 15:27:29.0 -0700 +++ linux-2.6.23/fs/ioctl.c 2007-10-26 16:16:28.0 -0700 @@ -43,13 +43,21 @@ static long do_ioctl(struct file *filp, static int do_fibmap(struct address_space *mapping, sector_t block, sector_t *phys_block) { + struct inode *inode = mapping-host; + if (!capable(CAP_SYS_RAWIO)) return -EPERM; if (!mapping-a_ops-bmap) return -EINVAL; lock_kernel(); + /* Avoid races with truncate */ + mutex_lock(inode-i_mutex); + /* FIXME: Do we really need i_alloc_sem? */ + down_read(inode-i_alloc_sem); i_alloc_sem will avoid races with filesystems filling holes inside writepage (where i_mutex isn't held). I'd expect everyone to currently give some consistent result (either the old value or the new but not garbage), but I wouldn't expect taking the semaphore to hurt anything. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/6][RFC] Cleanup FIBMAP
On Sat, 27 Oct 2007 18:57:06 +0100 Anton Altaparmakov [EMAIL PROTECTED] wrote: Hi, -bmap is ugly and horrible! If you have to do this at the very least please cause -bmap64 to be able to return error values in case the file system failed to get the information or indeed such information does not exist as is the case for compressed and encrypted files for example and also for small files that are inside the on-disk inode (NTFS resident files and reiserfs packed tails are examples of this). And another of my pet peeves with -bmap is that it uses 0 to mean sparse which causes a conflict on NTFS at least as block zero is part of the $Boot system file so it is a real, valid block... NTFS uses -1 to denote sparse blocks internally. Reiserfs and Btrfs also use 0 to mean packed. It would be nice if there was a way to indicate your-data-is-here-but-isn't-alone. But that's more of a feature for the FIEMAP stuff. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/6][RFC] Cleanup FIBMAP
Zach Brown wrote: And another of my pet peeves with -bmap is that it uses 0 to mean sparse which causes a conflict on NTFS at least as block zero is part of the $Boot system file so it is a real, valid block... NTFS uses -1 to denote sparse blocks internally. Reiserfs and Btrfs also use 0 to mean packed. It would be nice if there was a way to indicate your-data-is-here-but-isn't-alone. But that's more of a feature for the FIEMAP stuff. And maybe we can step back and see what the callers of FIBMAP are doing with the results they're getting. One use is to discover the order in which to read file data that will result in efficient IO. If we had an interface specifically for this use case then perhaps a sparse block would be better reported as the position of the inode relative to other data blocks. Maybe the inode block number in ext* land. Can you clarify what you mean above with an example? I don't really follow. Thanks, Mike Waychison - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/6][RFC] Cleanup FIBMAP
Chris Mason wrote: On Sat, 27 Oct 2007 18:57:06 +0100 Anton Altaparmakov [EMAIL PROTECTED] wrote: Hi, -bmap is ugly and horrible! If you have to do this at the very least please cause -bmap64 to be able to return error values in case the file system failed to get the information or indeed such information does not exist as is the case for compressed and encrypted files for example and also for small files that are inside the on-disk inode (NTFS resident files and reiserfs packed tails are examples of this). And another of my pet peeves with -bmap is that it uses 0 to mean sparse which causes a conflict on NTFS at least as block zero is part of the $Boot system file so it is a real, valid block... NTFS uses -1 to denote sparse blocks internally. Reiserfs and Btrfs also use 0 to mean packed. It would be nice if there was a way to indicate your-data-is-here-but-isn't-alone. But that's more of a feature for the FIEMAP stuff. I hadn't heard of FIEMAP, so I went back and read the thread from April/May. It seems that this is a much better approach than introducing a FIBMAP64. What ever happened with this proposal? Mike Waychison - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/6][RFC] Cleanup FIBMAP
But, we shouldn't inflict all of this on fibmap/fiemapwe'll get lost trying to make the one true interface for all operations. For grouping operations on files, I think a read_tree syscall with hints for what userland will do (read, stat, delete, list filenames), and a better cookie than readdir should do it. Agreed, on both points. - z - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/6][RFC] Cleanup FIBMAP
Can you clarify what you mean above with an example? I don't really follow. Sure, take 'tar' as an example. It'll read files in the order that their names are returned from directory listing. This can produce bad IO patterns because the order in which the file names are returned doesn't match the order of the file's blocks on disk. (htree, I'm looking at you!) People have noticed that tar-like loads can be sped up greatly just by sorting the files by their inode number as returned by stat(), never mind the file blocks themselves. One example of this is Chris Mason's 'acp'. http://oss.oracle.com/~mason/acp/ The logical extension of that is to use FIBMAP to find the order of file blocks on disk and then doing IO on blocks in sorted order. It'd take work to write an app that does this reliably, sure. In this use the application doesn't actually care what the absolute numbers are. It cares about their ordering. File systems would be able to chose whatever scheme they wanted for the actual values of the results from a FIBMAP-alike as long as the sorting resulted in the right IO patterns. Arguing that this use is significant enough to justify an addition to the file system API is a stretch. I'm just sharing the observation. - z - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/6][RFC] Cleanup FIBMAP
On Oct 29, 2007 12:16 -0700, Mike Waychison wrote: Chris Mason wrote: Reiserfs and Btrfs also use 0 to mean packed. It would be nice if there was a way to indicate your-data-is-here-but-isn't-alone. But that's more of a feature for the FIEMAP stuff. I hadn't heard of FIEMAP, so I went back and read the thread from April/May. It seems that this is a much better approach than introducing a FIBMAP64. What ever happened with this proposal? We made a patch for ext4 that we need to update and push upstream. I've just resent the spec we used in a separate email (attached to old thread) for reference. Cheers, Andreas -- Andreas Dilger Sr. Software Engineer, Lustre Group Sun Microsystems of Canada, Inc. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
By request on #linuxfs, here is the FIEMAP spec that we used to implement the FIEMAP support for ext4. There was an ext4 patch posted on August 29 to linux-ext4 entitled [PATCH] FIEMAP ioctl. I've asked Kalpak to post an updated version of that patch along with the changes to the filefrag tool to use FIEMAP. FIEMAP_1.0.txt == File Mapping Interface 18 June 2007 Andreas Dilger, Kalpak Shah Introduction This document covers the user interface and internal implementation of an efficient fragmentation reporting tool. This will include addition of a FIEMAP ioctl to fetch extents and changes to filefrag to use this ioctl. The main objective of this tool is to efficiently and easily allow inspection of the disk layout of one or more files without requiring user access to the underlying storage device(s). 1 Requirements The tool should be efficient in its use of resources, even for large files. The FIBMAP ioctl is not suitable for use on large files, as this can result in millions or even billions of ioctls to get the mapping information for a single file. It should be possible to get the information about an arbitrary-sized extent in a single call, and the kernel component and user tool should efficiently use this information. The user interface should be simple, and the output should be easily understood - by default the filename(s), a count of extents (for each file), and the optimal number of extents for a file with the given striping parameters. The user interface will be filefrag [options] {filename ...} and will allow retrieving the fragmentation information for one or more files specified on the command-line. The output will be of the form: /path/to/file1: extents=2 optimal=1 /path/to/file2: extents=10 optimal=4 .. 2 Functional specification The FIEMAP ioctl (FIle Extent MAP) is similar to the existing FIBMAP ioctl block device ioctl used for mapping an individual logical block address in a file to a physical block address in the block device. The FIEMAP ioctl will return the logical to physical mapping for the extent that contains the specified logical byte address. struct fiemap_extent { __u64 fe_offset;/* offset in bytes for the start of the extent */ __u64 fe_length;/* length in bytes for the extent */ __u32 fe_flags; /* returned FIEMAP_EXTENT_* flags for the extent */ __u32 fe_lun; /* logical device number for extent(starting at 0)*/ }; struct fiemap { __u64 fm_start; /* logical byte offset (in/out) */ __u64 fm_length;/* logical length of map (in/out) */ __u32 fm_flags; /* FIEMAP_FLAG_* flags (in/out) */ __u32 fm_extent_count; /* extents in fm_extents (in/out) */ __u64 fm_unused; struct fiemap_extent fm_extents[0]; }; In the ioctl request, the fiemap struct is initialized with the desired mapping information. fiemap.fm_start = {desired start byte offset, 0 if whole file}; fiemap.fm_length = {length of mapping in bytes, ~0ULL if whole file} fiemap.fm_extent_count = {number of fiemap_extents in fm_extents array}; fiemap.fm_flags = {flags from FIEMPA_FLAG_* array, if needed}; ioctl(fd, FIEMAP, fiemap); {verify fiemap flags are understood } for (i = 0; i fiemap.fm_extent_count; i++) { { process extent fiemap.fm_extents[i]}; } The logic for the filefrag would be similar to above. The size of the extent array will be extrapolated from the filesize and multiple ioctls of increasing extent count may be called for very large files. filefrag can easily call the FIEMAP ioctls repeatedly using the end of the last extent as the start offset for the next ioctl: fm_start = fm_extents[fm_extent_count - 1].fe_offset + fm_extents[fm_extent_count - 1].fe_length + 1; We do this until we find an extent with FIEMAP_EXTENT_LAST flag set. We will also need to re-initialise the fiemap flags, fm_extent_count, fm_end. The FIEMAP_FLAG_* values are specified below. If FIEMAP_FLAG_NO_EXTENTS is given then the fm_extents array is not filled, and only fm_extent_count is returned with the total number of extents in the file. Any new flags that introduce and/or require an incompatible behaviour in an application or in the kernel need to be in the range specified by FIEMAP_FLAG_INCOMPAT (e.g. FIEMAP_FLAG_SYNC and FIEMAP_FLAG_NO_EXTENTS would fall into that range if they were not part of the original specification). This is currently only for future use. If it turns out that FIEMAP_FLAG_INCOMPAT is not large enough then it is possible to use the last INCOMPAT flag 0x0100 to incidate that more of the flag range contains incompatible flags. #define FIEMAP_FLAG_SYNC0x0001 /* sync file data before map */ #define FIEMAP_FLAG_HSM_READ0x0002 /* get data from HSM before map */ #define FIEMAP_FLAG_NUM_EXTENTS 0x0004 /* return only number of
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On Sun, 28 Oct 2007, Erez Zadok wrote: I took your advise regarding ~(__GFP_FS|__GFP_IO), AOP_WRITEPAGE_ACTIVATE, and such. I revised my unionfs_writepage and unionfs_sync_page, and tested it under memory pressure: I have a couple of live CDs that use tmpfs and can deterministically reproduce the conditions resulting in A_W_A. I also went back to using grab_cache_page but with the gfp_mask suggestions you made. I'm happy to report that it all works great now! That's very encouraging... Below is the entirety of the new unionfs_mmap and unionfs_sync_page code. I'd appreciate if you and others can look it over and see if you find any problems. ... but still a few problems, I'm afraid. The greatest problem is a tmpfs one, that would be for me to solve. But first... static int unionfs_writepage(struct page *page, struct writeback_control *wbc) { int err = -EIO; struct inode *inode; struct inode *lower_inode; struct page *lower_page; char *kaddr, *lower_kaddr; struct address_space *mapping; /* lower inode mapping */ gfp_t old_gfp_mask; inode = page-mapping-host; lower_inode = unionfs_lower_inode(inode); mapping = lower_inode-i_mapping; /* * find lower page (returns a locked page) * * We turn off __GFP_IO|__GFP_FS so as to prevent a deadlock under On reflection, I think I went too far in asking you to mask off __GFP_IO. Loop has to do so because it's a block device, down towards the IO layer; but unionfs is a filesystem, so masking off __GFP_FS is enough to prevent recursion into the FS layer with danger of deadlock, and leaving __GFP_IO on gives a better chance of success. * memory pressure conditions. This is similar to how the loop * driver behaves (see loop_set_fd in drivers/block/loop.c). * If we can't find the lower page, we redirty our page and return * success so that the VM will call us again in the (hopefully * near) future. */ old_gfp_mask = mapping_gfp_mask(mapping); mapping_set_gfp_mask(mapping, old_gfp_mask ~(__GFP_IO|__GFP_FS)); lower_page = grab_cache_page(mapping, page-index); mapping_set_gfp_mask(mapping, old_gfp_mask); Hmm, several points on that. When suggesting something like this, I did remark what locking needed?. You've got none: which is problematic if two stacked mounts are playing with the same underlying file concurrently (yes, userspace would have a data coherency problem in such a case, but the kernel still needs to worry about its own internal integrity) - you'd be in danger of masking (__GFP_IO|__GFP_FS) permanently off the underlying file; and furthermore, losing error flags (AS_EIO, AS_ENOSPC) which share the same unsigned long. Neither likely but both wrong. See the comment on mapping_set_gfp_mask() in include/pagemap.h: * This is non-atomic. Only to be used before the mapping is activated. Strictly speaking, I guess loop was a little bit guilty even when just loop_set_fd() did it: the underlying mapping might already be active. It appears to be just as guilty as you in its do_loop_switch() case (done at BIO completion time), but that's for a LOOP_CHANGE_FD ioctl which would only be expected to be called once, during installation; whereas you're using mapping_set_gfp_mask here with great frequency. Another point on this is: loop masks __GFP_IO|__GFP_FS off the file for the whole duration while it is looped, whereas you're flipping it just in this preliminary section of unionfs_writepage. I think you're probably okay to be doing it only here within -writepage: I think loop covered every operation because it's at the block device level, perhaps both reads and writes needed to serve reclaim at the higher FS level; and also easier to do it once for all. Are you wrong to be doing it only around the grab_cache_page, leaving the lower level -writepage further down unprotected? Certainly doing it around the grab_cache_page is likely to be way more important than around the -writepage (but rather depends on filesystem). And on reflection, I think that the lower filesystem's writepage should already be using GFP_NOFS to avoid deadlocks in any of its allocations when wbc-for_reclaim, so you should be okay just masking off around the grab_cache_page. (Actually, in the wbc-for_reclaim case, I think you don't really need to call the lower level writepage at all. Just set_page_dirty on the lower page, unlock it and return. In due course that memory pressure which has called unionfs_writepage, will come around to the lower level page and do writepage upon it. Whether that's a better strategy or not, I'm do not know.) There's an attractively simple answer to the mapping_set_gfp_mask locking problem, if we're confident that it's only needed around the grab_cache_page. Look at the declaration of grab_cache_page in linux/pagemap.h: it immediately extracts the gfp_mask from the
Re: [patch 0/6][RFC] Cleanup FIBMAP
On Mon, 29 Oct 2007 12:18:22 -0700 Mike Waychison [EMAIL PROTECTED] wrote: Zach Brown wrote: And another of my pet peeves with -bmap is that it uses 0 to mean sparse which causes a conflict on NTFS at least as block zero is part of the $Boot system file so it is a real, valid block... NTFS uses -1 to denote sparse blocks internally. Reiserfs and Btrfs also use 0 to mean packed. It would be nice if there was a way to indicate your-data-is-here-but-isn't-alone. But that's more of a feature for the FIEMAP stuff. And maybe we can step back and see what the callers of FIBMAP are doing with the results they're getting. One use is to discover the order in which to read file data that will result in efficient IO. If we had an interface specifically for this use case then perhaps a sparse block would be better reported as the position of the inode relative to other data blocks. Maybe the inode block number in ext* land. Can you clarify what you mean above with an example? I don't really follow. This is a larger topic of helping userland optimize access to groups of files. For example, during a readdir if we knew the next step was to delete all the files found, we could do one top of readahead (or even ordering the returned values). If we knew the next step would be to read all the files found, a different type of readahead would be useful. But, we shouldn't inflict all of this on fibmap/fiemapwe'll get lost trying to make the one true interface for all operations. For grouping operations on files, I think a read_tree syscall with hints for what userland will do (read, stat, delete, list filenames), and a better cookie than readdir should do it. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
Hi Andreas, Thanks for posting this. I believe that an interface such as FIEMAP would be very useful to Ocfs2 as well. (I added ocfs2-devel to the e-mail) My comments below are generally geared towards understanding the ioctl interface. On Mon, Oct 29, 2007 at 01:45:07PM -0600, Andreas Dilger wrote: 2 Functional specification The FIEMAP ioctl (FIle Extent MAP) is similar to the existing FIBMAP ioctl block device ioctl used for mapping an individual logical block address in a file to a physical block address in the block device. The FIEMAP ioctl will return the logical to physical mapping for the extent that contains the specified logical byte address. struct fiemap_extent { __u64 fe_offset;/* offset in bytes for the start of the extent */ I'm a little bit confused by fe_offset. Is it a physical offset, or a logical offset? The reason I ask is that your description above says FIEMAP ioctl will return the logical to physical mapping for the extent that contains the specified logical byte address. Which seems to imply physical, but your math to get to the next logical start in a very fragmented file, implies that fe_offset is a logical offset: fm_start = fm_extents[fm_extent_count - 1].fe_offset + fm_extents[fm_extent_count - 1].fe_length + 1; The logic for the filefrag would be similar to above. The size of the extent array will be extrapolated from the filesize and multiple ioctls of increasing extent count may be called for very large files. filefrag can easily call the FIEMAP ioctls repeatedly using the end of the last extent as the start offset for the next ioctl: fm_start = fm_extents[fm_extent_count - 1].fe_offset + fm_extents[fm_extent_count - 1].fe_length + 1; We do this until we find an extent with FIEMAP_EXTENT_LAST flag set. We will also need to re-initialise the fiemap flags, fm_extent_count, fm_end. I think you meant 'fm_length' instead of 'fm_end' there. The FIEMAP_FLAG_* values are specified below. If FIEMAP_FLAG_NO_EXTENTS is given then the fm_extents array is not filled, and only fm_extent_count is returned with the total number of extents in the file. Any new flags that introduce and/or require an incompatible behaviour in an application or in the kernel need to be in the range specified by FIEMAP_FLAG_INCOMPAT (e.g. FIEMAP_FLAG_SYNC and FIEMAP_FLAG_NO_EXTENTS would fall into that range if they were not part of the original specification). This is currently only for future use. If it turns out that FIEMAP_FLAG_INCOMPAT is not large enough then it is possible to use the last INCOMPAT flag 0x0100 to incidate that more of the flag range contains incompatible flags. #define FIEMAP_FLAG_SYNC0x0001 /* sync file data before map */ #define FIEMAP_FLAG_HSM_READ0x0002 /* get data from HSM before map */ #define FIEMAP_FLAG_NUM_EXTENTS 0x0004 /* return only number of extents */ #define FIEMAP_FLAG_INCOMPAT0xff00 /* error for unknown flags in here */ The returned data from the FIEMAP ioctl is an array of fiemap_extent elements, one per extent in the file. The first extent will contain the byte specified by fm_start and the last extent will contain the byte specified by fm_start + fm_len, unless there are more than the passed-in fm_extent_count extents in the file, or this is beyond the EOF in which case the last extent will be marked with FIEMAP_EXTENT_LAST. Each extent returned has a set of flags associated with it that provide additional information about the extent. Not all filesystems will support all flags. FIEMAP_FLAG_NUM_EXTENTS will return only the number of extents used by the file. It will be used by default for filefrag since the specific extent information is not required in many cases. #define FIEMAP_EXTENT_HOLE 0x0001 /* has no data or space allocation */ Btw, I really like that holes are explicitely marked. #define FIEMAP_EXTENT_UNWRITTEN 0x0002 /* space allocated, but no data */ #define FIEMAP_EXTENT_UNMAPPED 0x0004 /* has data but no space allocated */ #define FIEMAP_EXTENT_ERROR 0x0008 /* map error, errno in fe_offset. */ #define FIEMAP_EXTENT_NO_DIRECT 0x0010 /* cannot access data directly */ #define FIEMAP_EXTENT_LAST 0x0020 /* last extent in the file */ #define FIEMAP_EXTENT_DELALLOC 0x0040 /* has data but not yet written */ #define FIEMAP_EXTENT_SECONDARY 0x0080 /* data in secondary storage */ #define FIEMAP_EXTENT_EOF 0x0100 /* fm_start + fm_len beyond EOF */ Is EOF here considering beyond i_size or beyond allocation? #define FIEMAP_EXTENT_UNKNOWN 0x0200 /* in use but location is unknown */ FIEMAP_EXTENT_NO_DIRECT means data cannot be directly accessed (maybe encrypted, compressed, etc.) Would it be valid to use FIEMAP_EXTENT_NO_DIRECT for marking in-inode data? Btrfs, Ocfs2, and Gfs2 pack small amounts of
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On Oct 29, 2007 16:13 -0600, Andreas Dilger wrote: On Oct 29, 2007 13:57 -0700, Mark Fasheh wrote: I'm a little bit confused by fe_offset. Is it a physical offset, or a logical offset? The reason I ask is that your description above says FIEMAP ioctl will return the logical to physical mapping for the extent that contains the specified logical byte address. Which seems to imply physical, but your math to get to the next logical start in a very fragmented file, implies that fe_offset is a logical offset: fm_start = fm_extents[fm_extent_count - 1].fe_offset + fm_extents[fm_extent_count - 1].fe_length + 1; Note the distinction between fe_offset (which is a physical offset for a single extent) and fm_offset (which is a logical offset for that file). Actually, that is completely bunk. What it should say is something like: filefrag can easily call the FIEMAP ioctls repeatedly using the returned fm_start and fm_length as the start offset for the next ioctl: fiemap.fm_start = fiemap.fm_start + fiemap.fm_length + 1; Cheers, Andreas -- Andreas Dilger Sr. Software Engineer, Lustre Group Sun Microsystems of Canada, Inc. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On Mon, Oct 29, 2007 at 04:29:07PM -0600, Andreas Dilger wrote: On Oct 29, 2007 16:13 -0600, Andreas Dilger wrote: On Oct 29, 2007 13:57 -0700, Mark Fasheh wrote: I'm a little bit confused by fe_offset. Is it a physical offset, or a logical offset? The reason I ask is that your description above says FIEMAP ioctl will return the logical to physical mapping for the extent that contains the specified logical byte address. Which seems to imply physical, but your math to get to the next logical start in a very fragmented file, implies that fe_offset is a logical offset: fm_start = fm_extents[fm_extent_count - 1].fe_offset + fm_extents[fm_extent_count - 1].fe_length + 1; Note the distinction between fe_offset (which is a physical offset for a single extent) and fm_offset (which is a logical offset for that file). Actually, that is completely bunk. What it should say is something like: filefrag can easily call the FIEMAP ioctls repeatedly using the returned fm_start and fm_length as the start offset for the next ioctl: fiemap.fm_start = fiemap.fm_start + fiemap.fm_length + 1; Yeah - that's where I was going with my question. This is much more clear now, thanks. --Mark -- Mark Fasheh Senior Software Developer, Oracle [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On Mon, Oct 29, 2007 at 01:45:07PM -0600, Andreas Dilger wrote: By request on #linuxfs, here is the FIEMAP spec that we used to implement the FIEMAP support for ext4. There was an ext4 patch posted on August 29 to linux-ext4 entitled [PATCH] FIEMAP ioctl. Link: http://marc.info/?l=linux-ext4m=118838241209683w=2 That's a very ext4 specific ioctl interface. Can we get this made generic like the FIBMAP interface so we don't have to replicate all the copyin/copyout handling and interface definitions everywhere? i.e. a -extent_map aops callout to the filesystem in generic code just like -bmap? I've asked Kalpak to post an updated version of that patch along with the changes to the filefrag tool to use FIEMAP. Where can I find the test program that validates the implementation? Also, following the fallocate model, can we get the interface definition turned into a man page before anything is submitted upstream? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On Oct 29, 2007 13:57 -0700, Mark Fasheh wrote: Thanks for posting this. I believe that an interface such as FIEMAP would be very useful to Ocfs2 as well. (I added ocfs2-devel to the e-mail) I tried to make it as Lustre-agnostic as possible... On Mon, Oct 29, 2007 at 01:45:07PM -0600, Andreas Dilger wrote: The FIEMAP ioctl (FIle Extent MAP) is similar to the existing FIBMAP ioctl block device ioctl used for mapping an individual logical block address in a file to a physical block address in the block device. The FIEMAP ioctl will return the logical to physical mapping for the extent that contains the specified logical byte address. struct fiemap_extent { __u64 fe_offset;/* offset in bytes for the start of the extent */ I'm a little bit confused by fe_offset. Is it a physical offset, or a logical offset? The reason I ask is that your description above says FIEMAP ioctl will return the logical to physical mapping for the extent that contains the specified logical byte address. Which seems to imply physical, but your math to get to the next logical start in a very fragmented file, implies that fe_offset is a logical offset: fm_start = fm_extents[fm_extent_count - 1].fe_offset + fm_extents[fm_extent_count - 1].fe_length + 1; Note the distinction between fe_offset (which is a physical offset for a single extent) and fm_offset (which is a logical offset for that file). We do this until we find an extent with FIEMAP_EXTENT_LAST flag set. We will also need to re-initialise the fiemap flags, fm_extent_count, fm_end. I think you meant 'fm_length' instead of 'fm_end' there. You're right, thanks. #define FIEMAP_EXTENT_LAST 0x0020 /* last extent in the file */ #define FIEMAP_EXTENT_EOF 0x0100 /* fm_start + fm_len beyond EOF*/ Is EOF here considering beyond i_size or beyond allocation? _EOF == beyond i_size. _LAST == last extent in the file. In most cases FIEMAP_EXTENT_EOF will be set at the same time as FIEMAP_EXTENT_LAST, but in case of e.g. prealloc beyond i_size the EOF flag may be set on one or more earlier extents. FIEMAP_EXTENT_NO_DIRECT means data cannot be directly accessed (maybe encrypted, compressed, etc.) Would it be valid to use FIEMAP_EXTENT_NO_DIRECT for marking in-inode data? Btrfs, Ocfs2, and Gfs2 pack small amounts of user data directly in inode blocks. Hmm, but part of the issue would be how to request the extra data, and what offset it would be given? One could, for example, use negative offsets to represent metadata or something, or add a FIEMAP_EXTENT_META or similar, I hadn't given that much thought. The other issue is that I'd like to get the basics of the API in place before it gets too complex. We can always add functionality with more FIEMAP_FLAG_* (whether in the INCOMPAT range or not, depending on what is being done). Cheers, Andreas -- Andreas Dilger Sr. Software Engineer, Lustre Group Sun Microsystems of Canada, Inc. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On Oct 29, 2007 17:11 -0700, Mark Fasheh wrote: On Mon, Oct 29, 2007 at 04:13:02PM -0600, Andreas Dilger wrote: Btrfs, Ocfs2, and Gfs2 pack small amounts of user data directly in inode blocks. Hmm, but part of the issue would be how to request the extra data, and what offset it would be given? One could, for example, use negative offsets to represent metadata or something, or add a FIEMAP_EXTENT_META or similar, I hadn't given that much thought. Well, fe_offset and fe_length are already expressed in bytes, so we could just put the byte offset to where the inline data starts in there. fe_length is just used as the length allocated for inline-data. If fe_offset is required to be block aligned, then we could add a field to express an offset within the block where data would be found - say 'fe_data_start_offset'. In the non-inline case, we could guarantee that fe_data_start_offset is zero. That way software which doesn't want to care whether something is inline-data (for example, a backup program) or not could just blidly add it to fe_offset before looking at the data. Oh, I was confused as to what you are asking. Mapping in-inode data is just fine using the existing interface. The byte offset of the data is given, and the FIEMAP_EXTENT_NO_DIRECT flag is set to indicate that it isn't necessarily safe to do IO directly to that byte offset in the file (e.g. tail packed, compressed data, etc). I was thinking you were asking how to map metadata (e.g. indirect blocks). Cheers, Andreas -- Andreas Dilger Sr. Software Engineer, Lustre Group Sun Microsystems of Canada, Inc. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On Mon, Oct 29, 2007 at 04:13:02PM -0600, Andreas Dilger wrote: On Oct 29, 2007 13:57 -0700, Mark Fasheh wrote: Thanks for posting this. I believe that an interface such as FIEMAP would be very useful to Ocfs2 as well. (I added ocfs2-devel to the e-mail) I tried to make it as Lustre-agnostic as possible... IMHO, your description succeeded at that. I'm hoping that the final patch can have mostly generic code, like FIBMAP does today. #define FIEMAP_EXTENT_LAST 0x0020 /* last extent in the file */ #define FIEMAP_EXTENT_EOF 0x0100 /* fm_start + fm_len beyond EOF*/ Is EOF here considering beyond i_size or beyond allocation? _EOF == beyond i_size. _LAST == last extent in the file. In most cases FIEMAP_EXTENT_EOF will be set at the same time as FIEMAP_EXTENT_LAST, but in case of e.g. prealloc beyond i_size the EOF flag may be set on one or more earlier extents. Oh, ok great - I was primarily looking for a way to say there's allocation past i_size and it looks like we have it. FIEMAP_EXTENT_NO_DIRECT means data cannot be directly accessed (maybe encrypted, compressed, etc.) Would it be valid to use FIEMAP_EXTENT_NO_DIRECT for marking in-inode data? Btrfs, Ocfs2, and Gfs2 pack small amounts of user data directly in inode blocks. Hmm, but part of the issue would be how to request the extra data, and what offset it would be given? One could, for example, use negative offsets to represent metadata or something, or add a FIEMAP_EXTENT_META or similar, I hadn't given that much thought. Well, fe_offset and fe_length are already expressed in bytes, so we could just put the byte offset to where the inline data starts in there. fe_length is just used as the length allocated for inline-data. If fe_offset is required to be block aligned, then we could add a field to express an offset within the block where data would be found - say 'fe_data_start_offset'. In the non-inline case, we could guarantee that fe_data_start_offset is zero. That way software which doesn't want to care whether something is inline-data (for example, a backup program) or not could just blidly add it to fe_offset before looking at the data. Regardless, I think we also want to explicitely flag this: #define FIEMAP_EXTENT_DATA_IN_INODE 0x0400 /* extent data is stored in inode block */ I'm going to pretend that I completely understand reiserfs tail-packing and say that my approaches above looks like they could work for that case too. We'd want to add a seperate flag for tail packed data though. The other issue is that I'd like to get the basics of the API in place before it gets too complex. We can always add functionality with more FIEMAP_FLAG_* (whether in the INCOMPAT range or not, depending on what is being done). Sure, but I think whatever goes upstream should be able to handle this case - there's file systems in use _today_ which put data in inode blocks and pack file tails. Thanks, --Mark -- Mark Fasheh Senior Software Developer, Oracle [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Proposal to improve filesystem/block snapshot interaction
G'day, A number of people have already seen this; I'm posting for wider comment and to move some interesting discussion to a public list. I'll apologise in advance for the talk about SGI technologies (including proprietary ones), but all the problems mentioned apply to in-tree technologies too. This proposal seeks to solve three problems in our NAS server product due to the interaction of the filesystem (XFS) and the block-based snapshot feature (XVM snapshot). It's based on discussions held with various people over the last few weeks, including Roger Strassburg, Christoph Hellwig, David Chinner, and Donald Douwsma. a) The first problem is the server's behaviour when a filesystem which is subject to snapshot is written to, and the snapshot repository runs out of room. The failure mode can be quite severe. XFS issues a metadata write to the block device, triggering a Copy-On-Write operation in the XVM snapshot element, which because of the full repository fails with EIO. When XFS sees the failure it shuts down the filesystem. All subsequent attempts to perform IO to the filesystem block indefinitely. In particular any NFS server thread will block and never reply to the NFS client. The NFS client will retry, causing another NFS server thread to block, and repeat until every NFS server thread is blocked. At this point all NFS service for all filesystems ceases. See PV 958220 and PV 958140 for a description of this problem and some of the approaches which have been discussed for resolving it. b) The second problem is that certain common combinations of filesystem operations can cause large wastes of space in the XVM snapshot repository. Examples include writing the same file twice with dd, or writing a new file and deleting it. The cause is the inability of the XVM snapshot code to be able to free regions in the snapshot repository that are no longer in use by the filesystem; this information is simply not available within the block layer. Note that problem b) also contributes to problem a) by increasing repository usage and thus making it easier to encounter an out-of-space condition on the repository. c) The third problem is an unfortunate interaction between an XFS internal log and block snapshots. The log is a fixed region of the block device which is written as a side effect of a great many different filesystem operations. The information written there has no value and is not even read until and unless log recovery needs to be performed after the server has crashed. This means the log does not need to be preserved by the block feature snapshot (because at the point in time when the snapshot is taken, log recovery must have already happened). In fact the correct procedure when mounting a read-only snapshot is to use the norecovery option to prevent any attempt to read the log (although the NAS server software actually doesn't do this). However, because the block device layer doesn't have enough information to know any better, the first pass of writes to the log are subjected to Copy-On-Write. This has two undesirable effects. Firstly, it increases the amount of snapshot repository space used by each snapshot, thus contributing to problem a). Secondly, it puts a significant performance penalty on filesystem metadata operations for some time after each snapshot is taken; given that the NAS server can be configured to take regular frequent snapshots this may mean all of the time. An obvious solution is to use an external XFS log, but this quite inconvenient for the NAS server software to arrange. For one thing, we would need to construct a separate external log device for the main filesystem and one for each mounted snapshot. Note that these problems are not specific to XVM but will be encountered by any Linux block-COWing snapshot implementation. For example the DM snapshot implementation is documented to suffer from problem a). From the linux/Documentation/device-mapper/snapshot.txt: COW device will often be smaller than the origin and if it fills up the snapshot will become useless and be disabled, returning errors. So it is important to monitor the amount of free space and expand the COW device before it fills up. During discussions, it became clear that we could solve all three of these problems by improving the block device interface to allow a filesystem to provide the block device with dynamic block usage hints. For example, when unlinking a file the filesystem could tell the block device a hint of the form I'm about to stop using these blocks. Most block devices would silently ignore these hints, but a snapshot COW implementation (the copy-on-write XVM element or the snapshot-origin dm target) could use them to help avoid these problems. For example, the response
Re: Proposal to improve filesystem/block snapshot interaction
On Tue, Oct 30, 2007 at 12:51:47AM +0100, Arnd Bergmann wrote: On Monday 29 October 2007, Christoph Hellwig wrote: - Forwarded message from Greg Banks [EMAIL PROTECTED] - Date: Thu, 27 Sep 2007 16:31:13 +1000 From: Greg Banks [EMAIL PROTECTED] Subject: Proposal to improve filesystem/block snapshot interaction To: David Chinner [EMAIL PROTECTED], Donald Douwsma [EMAIL PROTECTED], Christoph Hellwig [EMAIL PROTECTED], Roger Strassburg [EMAIL PROTECTED] Cc: Mark Goodwin [EMAIL PROTECTED], Brett Jon Grandbois [EMAIL PROTECTED] This proposal seeks to solve three problems in our NAS server product due to the interaction of the filesystem (XFS) and the block-based snapshot feature (XVM snapshot). It's based on discussions held with various people over the last few weeks, including Roger Strassburg, Christoph Hellwig, David Chinner, and Donald Douwsma. Hi Greg, Christoph forwarded me your mail, because I mentioned to him that I'm trying to come up with a similar change, and it might make sense to combine our efforts. Excellent, thanks Christoph ;-) For example, when unlinking a file the filesystem could tell the block device a hint of the form I'm about to stop using these blocks. Most block devices would silently ignore these hints, but a snapshot COW implementation (the copy-on-write XVM element or the snapshot-origin dm target) could use them to help avoid these problems. For example, the response to the I'm about to stop using these blocks hint could be to free the space used in the snapshot repository for unnecessary copies of those blocks. The case I'm interested in is the more specific case of 'erase', which is more of a performance optimization than a space optimization. When you have a flash medium, it's useful to erase a block as soon as it's becoming unused, so that a subsequent write will be faster. Moreover, on an MTD medium, you may not even be able to write to a block unless it has been erased before. Spending the device's time to erase early, when the CPU isn't waiting for it, instead of later, when it adds to effective write latency. Makes sense. Of course snapshot cow elements may be part of more generic element trees. In general there may be more than one consumer of block usage hints in a given filesystem's element tree, and their locations in that tree are not predictable. This means the block extents mentioned in the usage hints need to be subject to the block mapping algorithms provided by the element tree. As those algorithms are currently implemented using bio mapping and splitting, the easiest and simplest way to reuse those algorithms is to add new bio flags. First we need a mechanism to indicate that a bio is a hint rather than a real IO. Perhaps the easiest way is to add a new flag to the bi_rw field: #define BIO_RW_HINT 5 /* bio is a hint not a real io; no pages */ My first thought was to do this on the request layer, not already on bio, but they can easily be combined, I guess. My first thoughts were along similar lines, but I wasn't expecting these hint bios to survive deep enough in the stack to need queuing and thus visibility in struct request; I was expecting their lifetime to be some passage and splitting through a volume manager and then conversion to synchronous metadata operations. Plus, hijacking bios means not having to modify every single DM target to duplicate it's block mapping algorithm. Basically, I was thinking of loopback-like block mapping and not considering flash. I suppose for flash where there's a real erase operation, you'd want to be queuing and that means a new request type. We'll also need a field to tell us which kind of hint the bio represents. Perhaps a new field could be added, or perhaps the top 16 bits of bi_rw (currently used to encode the bio's priority, which has no meaning for hint bios) could be reused. The latter approach may allow hints to be used without modifying the bio structure or any code that uses it other than the filesystem and the snapshot implementation. Such a property would have obvious advantages for our NAS server software, where XFS and XVM modules are provided but the other users of struct bio are stock SLES code. Next we'll need three bio hints types with the following semantics. BIO_HINT_ALLOCATE The bio's block extent will soon be written by the filesystem and any COW that may be necessary to achieve that should begin now. If the COW is going to fail, the bio should fail. Note that this provides a way for the filesystem to manage when and how failures to COW are reported. BIO_HINT_RELEASE The bio's block extent is no longer in use by the filesystem and will not be read in the future. Any storage used to back the extent may be released without any threat to filesystem
Re: Proposal to improve filesystem/block snapshot interaction
On Tuesday October 30, [EMAIL PROTECTED] wrote: Of course snapshot cow elements may be part of more generic element trees. In general there may be more than one consumer of block usage hints in a given filesystem's element tree, and their locations in that tree are not predictable. This means the block extents mentioned in the usage hints need to be subject to the block mapping algorithms provided by the element tree. As those algorithms are currently implemented using bio mapping and splitting, the easiest and simplest way to reuse those algorithms is to add new bio flags. So are you imagining that you might have a distinct snapshotable elements, and that some of these might be combined by e.g. RAID0 into a larger device, then a filesystem is created on that? I ask because my first thought was that the sort of communication you want seems like it would be just between a filesystem and the block device that it talks directly to, and as you are particularly interested in XFS and XVM, should could come up with whatever protocol you want for those two to talk to either other, prototype it, iron out all the issues, then say We've got this really cool thing to make snapshots much faster - wanna share? and thus be presenting from a position of more strength (the old 'code talks' mantra). First we need a mechanism to indicate that a bio is a hint rather than a real IO. Perhaps the easiest way is to add a new flag to the bi_rw field: #define BIO_RW_HINT 5 /* bio is a hint not a real io; no pages */ Reminds me of the new approach to issue_flush_fn which is just to have a zero-length barrier bio (is that implemented yet? I lost track). But different as a zero length barrier has zero length, and your hints have a very meaningful length. Next we'll need three bio hints types with the following semantics. BIO_HINT_ALLOCATE The bio's block extent will soon be written by the filesystem and any COW that may be necessary to achieve that should begin now. If the COW is going to fail, the bio should fail. Note that this provides a way for the filesystem to manage when and how failures to COW are reported. Would it make sense to allow the bi_sector to be changed by the device and to have that change honoured. i.e. Please allocate 128 blocks, maybe 'here' OK, 128 blocks allocated, but they are actually over 'there'. If the device is tracking what space is and isn't used, it might make life easier for it to do the allocation. Maybe even have a variant Allocate 128 blocks, I don't care where. Is this bio supposed to block until the copy has happened? Or only until the space of the copy has been allocated and possibly committed? Or must it return without doing any IO at all? BIO_HINT_RELEASE The bio's block extent is no longer in use by the filesystem and will not be read in the future. Any storage used to back the extent may be released without any threat to filesystem or data integrity. If the allocation unit of the storage device (e.g. a few MB) does not match the allocation unit of the filesystem (e.g. a few KB) then for this to be useful either the storage device must start recording tiny allocations, or the filesystem should re-release areas as they grow. i.e. when releasing a range of a device, look in the filesystem's usage records for the largest surrounding free space, and release all of that. Would this be a burden on the filesystems? Is my imagined disparity between block sizes valid? Would it be just as easy for the storage device to track small allocation/deallocations? BIO_HINT_DONTCOW (the Bart Simpson BIO). The bio's block extent is not needed in mounted snapshots and does not need to be subjected to COW. This seems like a much more domain-specific function that the other two which themselves could be more generally useful (I'm imagining using hints from them to e.g. accelerate RAID reconstruction). Surely the correct thing to do with the log is to put it on a separate device which itself isn't snapshotted. If you have a storage manager that is smart enough to handle these sorts of things, maybe the functionality you want is Give me a subordinate device which is not snapshotted, size X, then journal to that virtual device. I guess that is equally domain specific, but the difference is that if you try to read from the DONTCOW part of the snapshot, you get bad old data, where as if you try to access the subordinate device of a snapshot, you get an IO error - which is probably safer. Comments? On the whole it seems reasonably sane providing you are from the school which believes that volume managers and filesystems should be kept separate :-) NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal to improve filesystem/block snapshot interaction
On Tue, Oct 30, 2007 at 03:16:06PM +1100, Neil Brown wrote: On Tuesday October 30, [EMAIL PROTECTED] wrote: Of course snapshot cow elements may be part of more generic element trees. In general there may be more than one consumer of block usage hints in a given filesystem's element tree, and their locations in that tree are not predictable. This means the block extents mentioned in the usage hints need to be subject to the block mapping algorithms provided by the element tree. As those algorithms are currently implemented using bio mapping and splitting, the easiest and simplest way to reuse those algorithms is to add new bio flags. So are you imagining that you might have a distinct snapshotable elements, and that some of these might be combined by e.g. RAID0 into a larger device, then a filesystem is created on that? I was thinking more a concatenation than a stripe, but yes you could do such a thing, e.g. to parallelise the COW procedure. We don't do any such thing in our product; the COW element is always inserted at the top of the logical element tree. I ask because my first thought was that the sort of communication you want seems like it would be just between a filesystem and the block device that it talks directly to, and as you are particularly interested in XFS and XVM, should could come up with whatever protocol you want for those two to talk to either other, prototype it, iron out all the issues, then say We've got this really cool thing to make snapshots much faster - wanna share? and thus be presenting from a position of more strength (the old 'code talks' mantra). Indeed, code talks ;-) I was hoping someone else would do that talking for me, though. First we need a mechanism to indicate that a bio is a hint rather than a real IO. Perhaps the easiest way is to add a new flag to the bi_rw field: #define BIO_RW_HINT 5 /* bio is a hint not a real io; no pages */ Reminds me of the new approach to issue_flush_fn which is just to have a zero-length barrier bio (is that implemented yet? I lost track). But different as a zero length barrier has zero length, and your hints have a very meaningful length. Yes. Next we'll need three bio hints types with the following semantics. BIO_HINT_ALLOCATE The bio's block extent will soon be written by the filesystem and any COW that may be necessary to achieve that should begin now. If the COW is going to fail, the bio should fail. Note that this provides a way for the filesystem to manage when and how failures to COW are reported. Would it make sense to allow the bi_sector to be changed by the device and to have that change honoured. i.e. Please allocate 128 blocks, maybe 'here' OK, 128 blocks allocated, but they are actually over 'there'. That wasn't the expectation at all. Perhaps allocate is a poor name. I have just allocated, deal with it might be more appropriate. Perhaps BIO_HINT_WILLUSE or something. If the device is tracking what space is and isn't used, it might make life easier for it to do the allocation. Maybe even have a variant Allocate 128 blocks, I don't care where. That kind of thing might perhaps be useful for flash, but I think current filesystems would have conniptions. Is this bio supposed to block until the copy has happened? Or only until the space of the copy has been allocated and possibly committed? The latter. The writes following will block until the COW has completed, or might be performed sufficiently later that the COW has meanwhile completed (I think this implies an extra state in the snapshot metadata to avoid double-COWing). The point of the hint is to allow the snapshot code to test for running out of repo space and report that failure at a time when the filesystem is able to handle it gracefully. Or must it return without doing any IO at all? I would expect it would be a useful optimisation to start the IO but not wait for it's completion, but that the first implementation would just do a space check. BIO_HINT_RELEASE The bio's block extent is no longer in use by the filesystem and will not be read in the future. Any storage used to back the extent may be released without any threat to filesystem or data integrity. If the allocation unit of the storage device (e.g. a few MB) does not match the allocation unit of the filesystem (e.g. a few KB) then for this to be useful either the storage device must start recording tiny allocations, or the filesystem should re-release areas as they grow. i.e. when releasing a range of a device, look in the filesystem's usage records for the largest surrounding free space, and release all of that. Good point. I was planning on ignoring this problem :-/ Given that current snapshot implementations waste *all* the blocks in deleted files, it would be an improvement to scavenge the blocks in large extents.