Re: btrfs_tree_lock trylock
On Mon, Sep 08, 2008 at 10:02:30AM -0400, Chris Mason wrote: On Mon, 2008-09-08 at 15:54 +0200, Andi Kleen wrote: The idea is to try to spin for a bit to avoid scheduling away, which is especially important for the high levels. Most holders of the mutex let it go very quickly. Ok but that surely should be implemented in the general mutex code then or at least in a standard adaptive mutex wrapper? That depends, am I the only one crazy enough to think its a good idea? Adaptive mutexes are classic, a lot of other OS have it. Gregory et.al. also saw big improvements in the RT kernel (they posted a patchkit a few times) But a lot of people don't like them for some reason. Anyways hiding them in a fs is probably wrong. -Andi -- [EMAIL PROTECTED] -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: parity data
On Sat, 2008-09-06 at 23:43 -0600, Eric Anopolsky wrote: Hi, A couple of questions. 1. Does btrfs currently have anything like raid 5 or 6? Not yet, it might one day. 2. One guy on my LUG's mailing list is really excited about the potential for setting redundancy on a per-file basis. I.e. /home/eric/criticalfile gets mirrored across all of the drives in the filesystem but /home/eric/temporaryfile gets striped. I'm skeptical. Is it a good idea to allow people/programs to do this? In general, yes. Some files or directories are crucial, and some (swap for example) don't need to survive a crash. But, I think the flexibility should go a little further. The goal is to be able to define drive groups and tie files or directory trees to the drive groups. That way you can say these files go to the fastest drives and these files go to some other drive type, etc etc. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs_tree_lock trylock
On Mon, 2008-09-08 at 12:13 -0400, jim owens wrote: Chris Mason wrote: My guess is that the improvement happens mostly from the first couple of tries, not from repeated spinning. And since it is a mutex, you could even do: I started with lower spin counts, I really didn't want to spin at all but the current values came from trial and error. Exactly the problem Steven is saying about adaptive locking. Using benchmarks (or any test), on a small sample of systems leads you to conclude this design/tuning combination is better. I've been burned repeatedly by that... ugly things happen as you move away from your design testing center. I'm not saying your code does not work, just that we need a lot more proof with different configurations and loads to see that is at least no worse. Oh, I completely agree. This is tuned on just one CPU in a handful of workloads. In general, it makes sense to spin for about as long as it takes someone to do a btree search in the block, which we could benchmark up front at mount time. I could also get better results from an API where the holder of the lock indicates it is going to hold on to things for a while, which might happen right before doing an IO. Over the long term these are important issues, but for today I'm focused on the disk format ;) -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs_tree_lock trylock
On Mon, 08 Sep 2008 12:20:32 -0400 Chris Mason [EMAIL PROTECTED] wrote: On Mon, 2008-09-08 at 12:13 -0400, jim owens wrote: Chris Mason wrote: My guess is that the improvement happens mostly from the first couple of tries, not from repeated spinning. And since it is a mutex, you could even do: I started with lower spin counts, I really didn't want to spin at all but the current values came from trial and error. Exactly the problem Steven is saying about adaptive locking. Using benchmarks (or any test), on a small sample of systems leads you to conclude this design/tuning combination is better. I've been burned repeatedly by that... ugly things happen as you move away from your design testing center. I'm not saying your code does not work, just that we need a lot more proof with different configurations and loads to see that is at least no worse. Oh, I completely agree. This is tuned on just one CPU in a handful of workloads. In general, it makes sense to spin for about as long as it takes someone to do a btree search in the block, which we could benchmark up front at mount time. I could also get better results from an API where the holder of the lock indicates it is going to hold on to things for a while, which might happen right before doing an IO. Over the long term these are important issues, but for today I'm focused on the disk format ;) -chris Not to mention the problem that developers seem to have faster machines than average user, but slower than the enterprise and future generation CPU's. So any tuning value seems to get out of date fast. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs_tree_lock trylock
On Mon, Sep 08, 2008 at 09:49:42AM -0700, Stephen Hemminger wrote: Not to mention the problem that developers seem to have faster machines than average user, but slower than the enterprise and future generation CPU's. So any tuning value seems to get out of date fast. So where do my fellow developers get these fast systems from? :) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] update space balancing code for the new backref format
Hello, This patch updates the space balancing code to utilize the new backref format. By using the new format, we can walk up the reference chain and find all references to a given extent quickly. This patch improves the space balancing for data extents. To relocate a data block group, there are two stages. Data are moved to the new location in the first stage. In the second stage, the references are updated to reflect the new location. Regards Yan Zheng --- diff -r 325653e288b3 ctree.h --- a/ctree.h Tue Sep 09 02:15:47 2008 +0800 +++ b/ctree.h Tue Sep 09 02:16:08 2008 +0800 @@ -1484,7 +1484,6 @@ struct extent_buffer *btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root *root, u64 bytenr, u32 blocksize); -int btrfs_shrink_extent_tree(struct btrfs_root *root, u64 new_size); int btrfs_insert_extent_backref(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct btrfs_path *path, @@ -1548,6 +1547,9 @@ struct btrfs_root *root, u64 bytes_used, u64 type, u64 chunk_objectid, u64 chunk_offset, u64 size); +int btrfs_remove_block_group(struct btrfs_trans_handle *trans, +struct btrfs_root *root, u64 group_start); +int btrfs_relocate_block_group(struct btrfs_root *root, u64 group_start); /* ctree.c */ int btrfs_previous_item(struct btrfs_root *root, struct btrfs_path *path, u64 min_objectid, @@ -1791,6 +1793,8 @@ int btrfs_update_inode(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct inode *inode); +int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode); +int btrfs_orphan_del(struct btrfs_trans_handle *trans, struct inode *inode); /* ioctl.c */ long btrfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg); diff -r 325653e288b3 extent-tree.c --- a/extent-tree.c Tue Sep 09 02:15:47 2008 +0800 +++ b/extent-tree.c Tue Sep 09 02:16:08 2008 +0800 @@ -35,6 +35,18 @@ #define BLOCK_GROUP_DIRTY EXTENT_DIRTY +struct reference_path { + u64 extent_start; + u64 nodes[BTRFS_MAX_LEVEL]; + u64 root_objectid; + u64 root_generation; + u64 owner_objectid; + u64 owner_offset; + u32 num_refs; + int lowest_level; + int current_level; +}; + static int finish_current_insert(struct btrfs_trans_handle *trans, struct btrfs_root *extent_root); static int del_pending_extents(struct btrfs_trans_handle *trans, struct @@ -868,7 +880,7 @@ int ret; key.objectid = bytenr; - key.offset = 0; + key.offset = (u64)-1; key.type = BTRFS_EXTENT_ITEM_KEY; path = btrfs_alloc_path(); @@ -877,7 +889,10 @@ if (ret 0) goto out; BUG_ON(ret == 0); - + if (ret 0 || path-slots[0] == 0) + goto out; + + path-slots[0]--; leaf = path-nodes[0]; btrfs_item_key_to_cpu(leaf, found_key, path-slots[0]); @@ -914,7 +929,7 @@ struct btrfs_extent_ref); ref_generation = btrfs_ref_generation(leaf, ref_item); /* -* For (parent_gen 0 parent_gen ref_gen): +* For (parent_gen 0 parent_gen ref_generation): * * we reach here through the oldest root, therefore * all other reference from same snapshot should have @@ -924,8 +939,7 @@ (parent_gen 0 parent_gen ref_generation) || (ref_objectid = BTRFS_FIRST_FREE_OBJECTID ref_objectid != btrfs_ref_objectid(leaf, ref_item))) { - if (ref_count) - *ref_count = 2; + *ref_count = 2; break; } @@ -1539,6 +1553,219 @@ return start; } +static int noinline __next_reference_path(struct btrfs_trans_handle *trans, + struct btrfs_root *extent_root, + struct reference_path *ref_path, + int first_time) +{ + struct extent_buffer *leaf; + struct btrfs_path *path; + struct btrfs_extent_ref *ref; + struct btrfs_key key; + struct btrfs_key found_key; + u64 bytenr; + u32 nritems; + int level; + int ret = 1; + + path = btrfs_alloc_path(); + if (!path) + return -ENOMEM; + + mutex_lock(extent_root-fs_info-alloc_mutex); + + if (first_time) { + ref_path-lowest_level = -1; + ref_path-current_level = -1; + goto walk_up; + }