Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Thu, Jun 14, 2007 at 03:14:58AM -0600, Andreas Dilger wrote: On Jun 14, 2007 09:52 +1000, David Chinner wrote: B FA_PREALLOCATE provides the same functionality as B FA_ALLOCATE except it does not ever change the file size. This allows allocation of zero blocks beyond the end of file and is useful for optimising append workloads. TP B FA_DEALLOCATE removes the underlying disk space with the given range. The disk space shall be removed regardless of it's contents so both allocated space from B FA_ALLOCATE and B FA_PREALLOCATE as well as from B write(3) will be removed. B FA_DEALLOCATE shall never remove disk blocks outside the range specified. So this is essentially the same as punch. Depends on your definition of punch. There doesn't seem to be a mechanism to only unallocate unused FA_{PRE,}ALLOCATE space at the end. ftruncate() B FA_DEALLOCATE shall never change the file size. If changing the file size is required when deallocating blocks from an offset to end of file (or beyond end of file) is required, B ftuncate64(3) should be used. This also seems to be a bit of a wart, since it isn't a natural converse of either of the above functions. How about having two modes, similar to FA_ALLOCATE and FA_PREALLOCATE? shrug whatever. Say, FA_PUNCH (which would be as you describe here - deletes all data in the specified range changing the file size if it overlaps EOF, Punch means different things to different people. To me (and probably most XFS aware ppl) punch implies no change to the file size. i.e. anyone curently using XFS_IOC_UNRESVSP will expect punching holes to leave the file size unchanged. This is the behaviour I described for FA_DEALLOCATE. and FA_DEALLOCATE, which only deallocates unused FA_{PRE,}ALLOCATE space? That's an unwritten-to-hole extent conversion. Is that really useful for anything? That's easily implemented with FIEMAP and FA_DEALLOCATE. Anyway, because we can't agree on a single pair of flags: FA_ALLOCATE== posix_fallocate() FA_DEALLOCATE == unwritten-to-hole ??? FA_RESV_SPACE == XFS_IOC_RESVSP64 FA_UNRESV_SPACE== XFS_IOC_UNRESVSP64 We might also consider making @mode be a mask instead of an enumeration: FA_FL_DEALLOC 0x01 (default allocate) FA_FL_KEEP_SIZE 0x02 (default extend/shrink size) FA_FL_DEL_DATA0x04 (default keep written data on DEALLOC) i.e: #define FA_ALLOCATE 0 #define FA_DEALLOCATE FA_FL_DEALLOC #define FA_RESV_SPACE FA_FL_KEEP_SIZE #define FA_UNRESV_SPACE FA_FL_DEALLOC | FA_FL_KEEP_SIZE | FA_FL_DEL_DATA I suppose it might be a bit late in the game to add a goal parameter and e.g. FA_FL_REQUIRE_GOAL, FA_FL_NEAR_GOAL, etc to make the API more suitable for XFS? It would suffice for the simpler operations, I think, but we'll rapidly run out of flags and we'll still need another interface for doing complex stuff. The goal could be a single __u64, or a struct with e.g. __u64 byte offset (possibly also __u32 lun like in FIEMAP). I guess the one potential limitation here is the number of function parameters on some architectures. To be useful it needs to __u64. B ENOSPC There is not enough space left on the device containing the file referred to by IR fd. Should probably say whether space is removed on failure or not. In Right. I'd say on error you need to FA_DEALLOCATE to ensure any space allocated was freed back up. That way the error handling in the allocate functions is much simpler (i.e. no need to undo there). Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS
Hi Chris- John Stoffel wrote: As a user of Netapps, having quotas (if only for reporting purposes) and some way to migrate non-used files to slower/cheaper storage would be great. Ie. being able to setup two pools, one being RAID6, the other being RAID1, where all currently accessed files are in the RAID1 setup, but if un-used get migrated to the RAID6 area. And of course some way for efficient backups and more importantly RESTORES of data which is segregated like this. I like the way dump and restore was handled in AFS (and now ZFS and NetApp). There is a simple command to flatten a file system and send it to another system, which can receive it and re-expand it. The dump/restore process uses snapshots and can easily send incremental backups which are significantly smaller than 0-level. This is somewhat better than rsync, because you don't need checksums to discover what data has changed -- you already have the new data segregated into copied-on-write blocks. NetApp happens to use the standard NDMP protocol for sending the flattened file system. NetApp uses it for synchronous replication, volume migration, and back up to nearline storage and tape. AFS used vol dump and vol restore for migration, replication, and back-up. ZFS has the zfs send and zfs receive commands that do basically the same (Eric Kustarz recently published a blog entry that described how these work). And of course, all file system objects are able to be sent this way: streams, xattrs, ACLs, and so on are all supported. Note also that NFSv4 supports the idea of migrated or replicated file objects. All that is needed to support it is a mechanism on the servers to actually move the data. begin:vcard fn:Chuck Lever n:Lever;Chuck org:Oracle Corporation;Corporate Architecture: Linux Projects Group adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA email;internet:chuck dot lever at nospam oracle dot com title:Principal Member of Staff tel;work:+1 248 614 5091 x-mozilla-html:FALSE version:2.1 end:vcard
Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS
Chris Mason wrote: The basic list of features looks like this: [amazing stuff snipped] The current status is a very early alpha state, and the kernel code weighs in at a sparsely commented 10,547 lines. I'm releasing now in hopes of finding people interested in testing, benchmarking, documenting, and contributing to the code. ok, what kind of benchmarks would help you most? bonnie? compilebench? sth. other? is it possible to test it on top of LVM2 on RAID at this stage? thanks, florian - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS
On Thu, Jun 14, 2007 at 08:29:10PM +0200, Florian D. wrote: Chris Mason wrote: The basic list of features looks like this: [amazing stuff snipped] The current status is a very early alpha state, and the kernel code weighs in at a sparsely commented 10,547 lines. I'm releasing now in hopes of finding people interested in testing, benchmarking, documenting, and contributing to the code. ok, what kind of benchmarks would help you most? bonnie? compilebench? sth. other? Thanks! Lets start with a list of the things I know will go badly: O_SYNC (not implemented) O_DIRECT (not implemented) aio (not implemented) multi-threaded (brain dead tree locking) things that fill the drive (will oops) mmap() writes (not supported, mmap reads are ok) Also, overlapping writes are not that well supported. For example, tar by default will write in 10k chunks, and btrfs_file_write currently cows on every single write. So, if your tar file has a bunch of 16k files, it'll go much faster if you tell tar to use 16k (or 8k) buffers. In general, I was hoping for a generic delayed allocation facility to magically appear in the kernel, and so I haven't spent a lot of time tuning btrfs_file_write for this yet. Any other workload is fair game, and I'm especially interested in seeing how badly the COW hurts. For example, on a big file, I'd like to see how much slower big sequential reads are after small random writes (fio is good for this). Or, writing to every file on the FS in random order and then seeing how much slower we are at reading. Benchmarks that stress the directory structure are interesting too, huge numbers of files + directories etc. Ric Wheeler's fs_mark has a lot of options and output. But, that's just my list, you can pick anything that you find interesting ;) Please try btrfsck after the run to see how well it keeps up. If you use blktrace to generate io traces, graphs can be generated: http://oss.oracle.com/~mason/seekwatcher/ Not that well documented, but drop me a line if you need help running it. btt is a good alternative to the graphs too, and easier to run. is it possible to test it on top of LVM2 on RAID at this stage? Yes, I haven't done much multi-spindle testing yet, so I'm definitely interested in these numbers. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 39/45] AppArmor: Profile loading and manipulation, pathname matching
[EMAIL PROTECTED] wrote: On Sun, 10 Jun 2007, Pavel Machek wrote: But you have that regex in _user_ space, in a place where policy is loaded into kernel. then the kernel is going to have to call out to userspace every time a file is created or renamed and the policy is going to be enforced incorrectly until userspace finished labeling/relabeling whatever is moved. building this sort of race condigion for security into the kernel is highly questionable at best. AA has regex parser in _kernel_ space, which is very wrong. see Linus' rants about why it's not automaticaly the best thing to move functionality into userspace. remember that the files covered by an AA policy can change as files are renamed. this isn't the case with SELinux so it doesn't have this sort of problem. How about using the inotify interface on / to watch for file changes and updating the SELinux policies on the fly. This could be done from a userspace daemon and should require minimal SELinux changes. The only possible problems I can see are the (hopefully) small gap between the file change and updating the policy and the performance problems of watching the whole system for changes. Just my $0.02. Jack - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 39/45] AppArmor: Profile loading and manipulation, pathname matching
On Thu, 14 Jun 2007, Jack Stone wrote: [EMAIL PROTECTED] wrote: On Sun, 10 Jun 2007, Pavel Machek wrote: But you have that regex in _user_ space, in a place where policy is loaded into kernel. then the kernel is going to have to call out to userspace every time a file is created or renamed and the policy is going to be enforced incorrectly until userspace finished labeling/relabeling whatever is moved. building this sort of race condigion for security into the kernel is highly questionable at best. AA has regex parser in _kernel_ space, which is very wrong. see Linus' rants about why it's not automaticaly the best thing to move functionality into userspace. remember that the files covered by an AA policy can change as files are renamed. this isn't the case with SELinux so it doesn't have this sort of problem. How about using the inotify interface on / to watch for file changes and updating the SELinux policies on the fly. This could be done from a userspace daemon and should require minimal SELinux changes. The only possible problems I can see are the (hopefully) small gap between the file change and updating the policy and the performance problems of watching the whole system for changes. as was mentioned by someone else, if you rename a directory this can result in millions of files that need to be relabeled (or otherwise have the policy changed for them) that can take a significant amount of time to do. David Lang - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
NILFS version 2 now available
Hi, NILFS (a New Implementation of a Log-structured Filesystem) Version 2 have been available at the project website http://www.nilfs.org/ If you are interested, please visit to our website. NILFS version 2 is equipped with the Garbage Collector that can keep numerous snapshots of NILFS filesystem. You can preserve some consistent states (checkpoints) of NILFS filesystem as snapshots ``after'' the states are established. For example, when you have removed some files accidentally, you can get the snapshot that represent the past state which holds the vanished files. Checkpoints are collected by the garbage collector unless they are marked as ``snapshot''. We welcome any comments or contributions. Thanks for your attention, AMAGAI Yoshiji Nippon Telegraph and Telephone Corporation NTT Cyber Space Laboratories Open Source Software Computing Project - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] [RFC] configfs: Pin config_items when in use by other subsystems
Many folks know that I've been pretty stubborn on the subject of configfs item removal. configfs_rmdir() cannot currently be aborted by a client driver. This is to ensure that userspace has control - if userspace wants to remove an item, it should have that ability. The client driver is left to handle the event. However, there are dependencies in the kernel. One kernel subsystem may depend on a configfs item and be unable to handle that item disappearing. So we need a mechanism to describe this dependency. After lots of beating me over my head, I've been convinced to give it a shot. The canonical example is ocfs2 and its heartbeat. Today, you can stop o2cb heartbeat (o2hb) by rmdir(2) of the heartbeat object in configfs. o2hb handles this just fine - all of o2hb's clients get node-down notifications. However, ocfs2 can't handle this. When the node stops speaking to the cluster it can't just continue, and it can't force an unmount of itself. It only has one solution - crash. This is ugly any way you look at it. With the configfs_depend_item() API, heartbeat_register_callback() can then pin the heartbeat item. Any rmdir(2) of the heartbeat item will return -EBUSY until heartbeat_unregister_callback() removes the dependancy. A similar API can be created for any other configfs client driver. The first patch is the configfs mechanism. The second patch is the heartbeat use thereof. Comments and curses welcome. Joel -- You must remember this: A kiss is just a kiss, A sigh is just a sigh. The fundamental rules apply As time goes by. Joel Becker Principal Software Developer Oracle E-mail: [EMAIL PROTECTED] Phone: (650) 506-8127 - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] ocfs2: Depend on configfs heartbeat items.
ocfs2 mounts require a heartbeat region. Use the new configfs_depend_item() facility to actually depend on them so they can't go away from under us. First, teach cluster/nodemanager.c to depend an item on the o2cb subsystem. Then teach o2hb_register_callbacks to take a UUID and depend on the appropriate region. Finally, teach all users of o2hb to pass a UUID or NULL if they don't require a pin. Signed-off-by: Joel Becker [EMAIL PROTECTED] --- fs/ocfs2/cluster/heartbeat.c | 64 +++- fs/ocfs2/cluster/heartbeat.h |6 +++- fs/ocfs2/cluster/nodemanager.c | 10 ++ fs/ocfs2/cluster/nodemanager.h |3 ++ fs/ocfs2/cluster/tcp.c |8 +++-- fs/ocfs2/dlm/dlmdomain.c |8 +++-- fs/ocfs2/heartbeat.c | 10 +++--- 7 files changed, 92 insertions(+), 17 deletions(-) diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c index 9791134..e331f4c 100644 --- a/fs/ocfs2/cluster/heartbeat.c +++ b/fs/ocfs2/cluster/heartbeat.c @@ -1665,7 +1665,56 @@ void o2hb_setup_callback(struct o2hb_cal } EXPORT_SYMBOL_GPL(o2hb_setup_callback); -int o2hb_register_callback(struct o2hb_callback_func *hc) +static struct o2hb_region *o2hb_find_region(const char *region_uuid) +{ + struct o2hb_region *p, *reg = NULL; + + assert_spin_locked(o2hb_live_lock); + + list_for_each_entry(p, o2hb_all_regions, hr_all_item) { + if (!strcmp(region_uuid, config_item_name(p-hr_item))) { + reg = p; + break; + } + } + + return reg; +} + +static int o2hb_region_get(const char *region_uuid) +{ + int ret = 0; + struct o2hb_region *reg; + + spin_lock(o2hb_live_lock); + + reg = o2hb_find_region(region_uuid); + if (!reg) + ret = -ENOENT; + spin_unlock(o2hb_live_lock); + + if (!ret) + ret = o2nm_depend_item(reg-hr_item); + + return ret; +} + +static void o2hb_region_put(const char *region_uuid) +{ + struct o2hb_region *reg; + + spin_lock(o2hb_live_lock); + + reg = o2hb_find_region(region_uuid); + + spin_unlock(o2hb_live_lock); + + if (reg) + o2nm_undepend_item(reg-hr_item); +} + +int o2hb_register_callback(const char *region_uuid, + struct o2hb_callback_func *hc) { struct o2hb_callback_func *tmp; struct list_head *iter; @@ -1681,6 +1730,12 @@ int o2hb_register_callback(struct o2hb_c goto out; } + if (region_uuid) { + ret = o2hb_region_get(region_uuid); + if (ret) + goto out; + } + down_write(o2hb_callback_sem); list_for_each(iter, hbcall-list) { @@ -1702,16 +1757,21 @@ out: } EXPORT_SYMBOL_GPL(o2hb_register_callback); -void o2hb_unregister_callback(struct o2hb_callback_func *hc) +void o2hb_unregister_callback(const char *region_uuid, + struct o2hb_callback_func *hc) { BUG_ON(hc-hc_magic != O2HB_CB_MAGIC); mlog(ML_HEARTBEAT, on behalf of %p for funcs %p\n, __builtin_return_address(0), hc); + /* XXX Can this happen _with_ a region reference? */ if (list_empty(hc-hc_item)) return; + if (region_uuid) + o2hb_region_put(region_uuid); + down_write(o2hb_callback_sem); list_del_init(hc-hc_item); diff --git a/fs/ocfs2/cluster/heartbeat.h b/fs/ocfs2/cluster/heartbeat.h index cc6d40b..35397dd 100644 --- a/fs/ocfs2/cluster/heartbeat.h +++ b/fs/ocfs2/cluster/heartbeat.h @@ -69,8 +69,10 @@ void o2hb_setup_callback(struct o2hb_cal o2hb_cb_func *func, void *data, int priority); -int o2hb_register_callback(struct o2hb_callback_func *hc); -void o2hb_unregister_callback(struct o2hb_callback_func *hc); +int o2hb_register_callback(const char *region_uuid, + struct o2hb_callback_func *hc); +void o2hb_unregister_callback(const char *region_uuid, + struct o2hb_callback_func *hc); void o2hb_fill_node_map(unsigned long *map, unsigned bytes); void o2hb_init(void); diff --git a/fs/ocfs2/cluster/nodemanager.c b/fs/ocfs2/cluster/nodemanager.c index 9f5ad0f..50a0524 100644 --- a/fs/ocfs2/cluster/nodemanager.c +++ b/fs/ocfs2/cluster/nodemanager.c @@ -900,6 +900,16 @@ static struct o2nm_cluster_group o2nm_cl }, }; +int o2nm_depend_item(struct config_item *item) +{ + return configfs_depend_item(o2nm_cluster_group.cs_subsys, item); +} + +void o2nm_undepend_item(struct config_item *item) +{ + configfs_undepend_item(o2nm_cluster_group.cs_subsys, item); +} + static void __exit exit_o2nm(void) { if (ocfs2_table_header) diff --git a/fs/ocfs2/cluster/nodemanager.h b/fs/ocfs2/cluster/nodemanager.h index 0705221..55ae1a0