Re: btrfs_tree_lock trylock

2008-09-08 Thread Andi Kleen
On Mon, Sep 08, 2008 at 10:02:30AM -0400, Chris Mason wrote:
 On Mon, 2008-09-08 at 15:54 +0200, Andi Kleen wrote:
   The idea is to try to spin for a bit to avoid scheduling away, which is
   especially important for the high levels.  Most holders of the mutex
   let it go very quickly.
  
  Ok but that surely should be implemented in the general mutex code then
  or at least in a standard adaptive mutex wrapper? 
 
 That depends, am I the only one crazy enough to think its a good idea?

Adaptive mutexes are classic, a lot of other OS have it.

Gregory et.al. also saw big improvements in the RT kernel (they 
posted a patchkit a few times) 

But a lot of people don't like them for some reason. Anyways
hiding them in a fs is probably wrong.

-Andi
-- 
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: parity data

2008-09-08 Thread Chris Mason
On Sat, 2008-09-06 at 23:43 -0600, Eric Anopolsky wrote:
 Hi,
 
 A couple of questions.
 
 1. Does btrfs currently have anything like raid 5 or 6?
 

Not yet, it might one day.

 2. One guy on my LUG's mailing list is really excited about the
 potential for setting redundancy on a per-file basis.
 I.e. /home/eric/criticalfile gets mirrored across all of the drives in
 the filesystem but /home/eric/temporaryfile gets striped. I'm skeptical.
 Is it a good idea to allow people/programs to do this?

In general, yes.  Some files or directories are crucial, and some (swap
for example) don't need to survive a crash.

But, I think the flexibility should go a little further.  The goal is to
be able to define drive groups and tie files or directory trees to the
drive groups.  That way you can say these files go to the fastest drives
and these files go to some other drive type, etc etc.

-chris


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs_tree_lock trylock

2008-09-08 Thread Chris Mason
On Mon, 2008-09-08 at 12:13 -0400, jim owens wrote:
 Chris Mason wrote:
  My guess is that the improvement happens mostly from the first couple of 
  tries,
  not from repeated spinning. And since it is a mutex, you could even do:
  
  I started with lower spin counts, I really didn't want to spin at all
  but the current values came from trial and error.
 
 Exactly the problem Steven is saying about adaptive locking.
 
 Using benchmarks (or any test), on a small sample of systems
 leads you to conclude this design/tuning combination is better.
 
 I've been burned repeatedly by that... ugly things happen
 as you move away from your design testing center.
 
 I'm not saying your code does not work, just that we need
 a lot more proof with different configurations and loads
 to see that is at least no worse.

Oh, I completely agree.  This is tuned on just one CPU in a handful of
workloads.  In general, it makes sense to spin for about as long as it
takes someone to do a btree search in the block, which we could
benchmark up front at mount time.

I could also get better results from an API where the holder of the lock
indicates it is going to hold on to things for a while, which might
happen right before doing an IO.

Over the long term these are important issues, but for today I'm focused
on the disk format ;)

-chris


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs_tree_lock trylock

2008-09-08 Thread Stephen Hemminger
On Mon, 08 Sep 2008 12:20:32 -0400
Chris Mason [EMAIL PROTECTED] wrote:

 On Mon, 2008-09-08 at 12:13 -0400, jim owens wrote:
  Chris Mason wrote:
   My guess is that the improvement happens mostly from the first couple of 
   tries,
   not from repeated spinning. And since it is a mutex, you could even do:
   
   I started with lower spin counts, I really didn't want to spin at all
   but the current values came from trial and error.
  
  Exactly the problem Steven is saying about adaptive locking.
  
  Using benchmarks (or any test), on a small sample of systems
  leads you to conclude this design/tuning combination is better.
  
  I've been burned repeatedly by that... ugly things happen
  as you move away from your design testing center.
  
  I'm not saying your code does not work, just that we need
  a lot more proof with different configurations and loads
  to see that is at least no worse.
 
 Oh, I completely agree.  This is tuned on just one CPU in a handful of
 workloads.  In general, it makes sense to spin for about as long as it
 takes someone to do a btree search in the block, which we could
 benchmark up front at mount time.
 
 I could also get better results from an API where the holder of the lock
 indicates it is going to hold on to things for a while, which might
 happen right before doing an IO.
 
 Over the long term these are important issues, but for today I'm focused
 on the disk format ;)
 
 -chris
 
 

Not to mention the problem that developers seem to have faster machines than
average user, but slower than the enterprise and future generation CPU's.
So any tuning value seems to get out of date fast.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs_tree_lock trylock

2008-09-08 Thread Christoph Hellwig
On Mon, Sep 08, 2008 at 09:49:42AM -0700, Stephen Hemminger wrote:
 Not to mention the problem that developers seem to have faster machines than
 average user, but slower than the enterprise and future generation CPU's.
 So any tuning value seems to get out of date fast.

So where do my fellow developers get these fast systems from? :)

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] update space balancing code for the new backref format

2008-09-08 Thread Zheng Yan

Hello,

This patch updates the space balancing code to utilize the new backref
format. By using the new format, we can walk up the reference chain and
find all references to a given extent quickly.

This patch improves the space balancing for data extents. To relocate
a data block group, there are two stages. Data are moved to the new
location in the first stage. In the second stage, the references are
updated to reflect the new location.

Regards
Yan Zheng

---
diff -r 325653e288b3 ctree.h
--- a/ctree.h   Tue Sep 09 02:15:47 2008 +0800
+++ b/ctree.h   Tue Sep 09 02:16:08 2008 +0800
@@ -1484,7 +1484,6 @@
struct extent_buffer *btrfs_init_new_buffer(struct btrfs_trans_handle *trans,
struct btrfs_root *root,
u64 bytenr, u32 blocksize);
-int btrfs_shrink_extent_tree(struct btrfs_root *root, u64 new_size);
int btrfs_insert_extent_backref(struct btrfs_trans_handle *trans,
 struct btrfs_root *root,
 struct btrfs_path *path,
@@ -1548,6 +1547,9 @@
   struct btrfs_root *root, u64 bytes_used,
   u64 type, u64 chunk_objectid, u64 chunk_offset,
   u64 size);
+int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
+struct btrfs_root *root, u64 group_start);
+int btrfs_relocate_block_group(struct btrfs_root *root, u64 group_start);
/* ctree.c */
int btrfs_previous_item(struct btrfs_root *root,
struct btrfs_path *path, u64 min_objectid,
@@ -1791,6 +1793,8 @@
int btrfs_update_inode(struct btrfs_trans_handle *trans,
  struct btrfs_root *root,
  struct inode *inode);
+int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode);
+int btrfs_orphan_del(struct btrfs_trans_handle *trans, struct inode *inode);

/* ioctl.c */
long btrfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
diff -r 325653e288b3 extent-tree.c
--- a/extent-tree.c Tue Sep 09 02:15:47 2008 +0800
+++ b/extent-tree.c Tue Sep 09 02:16:08 2008 +0800
@@ -35,6 +35,18 @@

#define BLOCK_GROUP_DIRTY EXTENT_DIRTY

+struct reference_path {
+   u64 extent_start;
+   u64 nodes[BTRFS_MAX_LEVEL];
+   u64 root_objectid;
+   u64 root_generation;
+   u64 owner_objectid;
+   u64 owner_offset;
+   u32 num_refs;
+   int lowest_level;
+   int current_level;
+};
+
static int finish_current_insert(struct btrfs_trans_handle *trans, struct
 btrfs_root *extent_root);
static int del_pending_extents(struct btrfs_trans_handle *trans, struct
@@ -868,7 +880,7 @@
int ret;

key.objectid = bytenr;
-   key.offset = 0;
+   key.offset = (u64)-1;
key.type = BTRFS_EXTENT_ITEM_KEY;

path = btrfs_alloc_path();
@@ -877,7 +889,10 @@
if (ret  0)
goto out;
BUG_ON(ret == 0);
-
+   if (ret  0 || path-slots[0] == 0)
+   goto out;
+
+   path-slots[0]--;
leaf = path-nodes[0];
btrfs_item_key_to_cpu(leaf, found_key, path-slots[0]);

@@ -914,7 +929,7 @@
  struct btrfs_extent_ref);
ref_generation = btrfs_ref_generation(leaf, ref_item);
/*
-* For (parent_gen  0  parent_gen  ref_gen):
+* For (parent_gen  0  parent_gen  ref_generation):
 *
 * we reach here through the oldest root, therefore
 * all other reference from same snapshot should have
@@ -924,8 +939,7 @@
(parent_gen  0  parent_gen  ref_generation) ||
(ref_objectid = BTRFS_FIRST_FREE_OBJECTID 
 ref_objectid != btrfs_ref_objectid(leaf, ref_item))) {
-   if (ref_count)
-   *ref_count = 2;
+   *ref_count = 2;
break;
}

@@ -1539,6 +1553,219 @@
return start;
}

+static int noinline __next_reference_path(struct btrfs_trans_handle *trans,
+ struct btrfs_root *extent_root,
+ struct reference_path *ref_path,
+ int first_time)
+{
+   struct extent_buffer *leaf;
+   struct btrfs_path *path;
+   struct btrfs_extent_ref *ref;
+   struct btrfs_key key;
+   struct btrfs_key found_key;
+   u64 bytenr;
+   u32 nritems;
+   int level;
+   int ret = 1;
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+
+   mutex_lock(extent_root-fs_info-alloc_mutex);
+
+   if (first_time) {
+   ref_path-lowest_level = -1;
+   ref_path-current_level = -1;
+   goto walk_up;
+   }