Re: read-only subvolumes?

2011-03-24 Thread Li Zefan
 When I am creating subvolumes I get this strange behavior. If
 I create a subvolume with a name longer than 4 characters it
 is read-only, if the name is shorter than 5 characters the
 subvolume is writeable as expected. I think it is since I
 upgraded to kernel version 2.6.38 (I do not create
 subvolumes on a regular basis.). I will compile one of the
 latest 2.6.37 kernels to see whether there the problem
 exists, too. Another interesting point is that previously
 created subvolumes are not affected.

 Thanks, Andreas Philipp

 thor btrfs # btrfs subvolume create 123456789 Create
 subvolume './123456789' thor btrfs # touch 123456789/lsdkfj
 touch: cannot touch `123456789/lsdkfj': Read-only file
 system

 This is really odd, but I can't reproduce it.

 I created a btrfs filesystem on 2.6.37 kernel, and rebooted to
 latest 2.6.38+, and tried the procedures as you did, but
 nothing bad happend.
 While playing around I found the following three new points: -
 Now the length of the subvolume name does not matter. So even the
 ones with short names are read-only. - It also happens to a fresh
 newly created btrfs filesystem. - If I take a snapshot of an
 old (= writeable) subvolume this is writeable.

 I will now reboot into 2.6.37.4, check there, and then report
 back.

 Well, this was fast. Everything works as expected on 2.6.37.4. See
 the output of uname -a for the exact kernel version below. I will
 now reboot into a differently configured kernel version 2.6.38 and
 look whether the problem is gone there.

 Thanks, Andreas Philipp

 thor ~ # uname -a Linux thor 2.6.37.4 #2 SMP Wed Mar 23 10:25:54
 CET 2011 x86_64 Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz GenuineIntel
 GNU/Linux
 
 IMHO, this is related to how the debug options of the kernel are
 configured. Attached you find two config files, both for kernel
 version 2.6.38, with the one named 2.6.38-debug everything works and
 with the other one newly created subvolumes are read only.
 

I'll see if I can reproduce the problem using your config. Thanks!
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: check return value of read_tree_block()

2011-03-24 Thread Tsutomu Itoh
This patch is checking return value of read_tree_block(),
and if it is NULL, error processing.

Signed-off-by: Tsutomu Itoh t-i...@jp.fujitsu.com
---
 fs/btrfs/ctree.c   |3 +++
 fs/btrfs/extent-tree.c |6 ++
 fs/btrfs/relocation.c  |6 ++
 3 files changed, 15 insertions(+)

diff -urNp linux-2.6.38/fs/btrfs/ctree.c linux-2.6.38.new/fs/btrfs/ctree.c
--- linux-2.6.38/fs/btrfs/ctree.c   2011-03-15 10:20:32.0 +0900
+++ linux-2.6.38.new/fs/btrfs/ctree.c   2011-03-24 11:12:54.0 +0900
@@ -686,6 +686,8 @@ int btrfs_realloc_node(struct btrfs_tran
if (!cur) {
cur = read_tree_block(root, blocknr,
 blocksize, gen);
+   if (!cur)
+   return -EIO;
} else if (!uptodate) {
btrfs_read_buffer(cur, gen);
}
@@ -4217,6 +4219,7 @@ find_next_key:
}
btrfs_set_path_blocking(path);
cur = read_node_slot(root, cur, slot);
+   BUG_ON(!cur);
 
btrfs_tree_lock(cur);
 
diff -urNp linux-2.6.38/fs/btrfs/extent-tree.c 
linux-2.6.38.new/fs/btrfs/extent-tree.c
--- linux-2.6.38/fs/btrfs/extent-tree.c 2011-03-15 10:20:32.0 +0900
+++ linux-2.6.38.new/fs/btrfs/extent-tree.c 2011-03-24 11:32:55.0 
+0900
@@ -6047,6 +6047,8 @@ static noinline int do_walk_down(struct 
if (reada  level == 1)
reada_walk_down(trans, root, wc, path);
next = read_tree_block(root, bytenr, blocksize, generation);
+   if (!next)
+   return -EIO;
btrfs_tree_lock(next);
btrfs_set_lock_blocking(next);
}
@@ -7906,6 +7908,10 @@ static noinline int relocate_one_extent(
 
eb = read_tree_block(found_root, block_start,
 block_size, 0);
+   if (!eb) {
+   ret = -EIO;
+   goto out;
+   }
btrfs_tree_lock(eb);
BUG_ON(level != btrfs_header_level(eb));
 
diff -urNp linux-2.6.38/fs/btrfs/relocation.c 
linux-2.6.38.new/fs/btrfs/relocation.c
--- linux-2.6.38/fs/btrfs/relocation.c  2011-03-15 10:20:32.0 +0900
+++ linux-2.6.38.new/fs/btrfs/relocation.c  2011-03-24 11:43:53.0 
+0900
@@ -1724,6 +1724,7 @@ again:
 
eb = read_tree_block(dest, old_bytenr, blocksize,
 old_ptr_gen);
+   BUG_ON(!eb);
btrfs_tree_lock(eb);
if (cow) {
ret = btrfs_cow_block(trans, dest, eb, parent,
@@ -2513,6 +2514,10 @@ static int do_relocation(struct btrfs_tr
blocksize = btrfs_level_size(root, node-level);
generation = btrfs_node_ptr_generation(upper-eb, slot);
eb = read_tree_block(root, bytenr, blocksize, generation);
+   if (!eb) {
+   err = -EIO;
+   goto next;
+   }
btrfs_tree_lock(eb);
btrfs_set_lock_blocking(eb);
 
@@ -2670,6 +2675,7 @@ static int get_tree_block_key(struct rel
BUG_ON(block-key_ready);
eb = read_tree_block(rc-extent_root, block-bytenr,
 block-key.objectid, block-key.offset);
+   BUG_ON(!eb);
WARN_ON(btrfs_header_level(eb) != block-level);
if (block-level == 0)
btrfs_item_key_to_cpu(eb, block-key, 0);


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Tree fragmentation and prefetching

2011-03-24 Thread Arne Jansen
On 24.03.2011 02:38, Miao Xie wrote:
 On wed, 23 Mar 2011 21:28:25 +0100, Arne Jansen wrote:
 On 23.03.2011 20:26, Andrey Kuzmin wrote:
 On Wed, Mar 23, 2011 at 4:06 PM, Arne Jansensensi...@gmx.net  wrote:
 The main idea is to load the tree (or parts of it) top-down, order the
 needed blocks and distribute it over all disks.
 To keep you interested, some results first.

 a) by tree enumeration with reada=2
reading extent tree: 242s
reading csum tree: 140s
reading both trees: 324s

 b) prefetch prototype
reading extent tree: 23.5s
reading csum tree: 20.4s
reading both trees: 25.7s

 10x speed-up looks indeed impressive. Just for me to be sure, did I
 get you right in that you attribute this effect specifically to
 enumerating tree leaves in key address vs. disk addresses when these
 two are not aligned?

 Yes. Leaves and the intermediate nodes tend to be quite scattered
 around the disk with respect to their logical order.
 Reading them in logical (ascending/descending) order require lots
 of seeks.
 
 I'm also dealing with tree fragmentation problem, I try to store the leaves
 which have the same parent closely.

That's good to hear. Do you have already anything I can repeat the test
with?

-Arne

 
 Regards
 Miao
 
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] mutex: Apply adaptive spinning on mutex_trylock()

2011-03-24 Thread Ingo Molnar

* Tejun Heo t...@kernel.org wrote:

 NOT-Signed-off-by: Tejun Heo t...@kernel.org

s/NOT-// ?

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


2.6.38 defragment compression oops...

2011-03-24 Thread Daniel J Blueman
I found that I'm able to provoke undefined behaviour with 2.6.38 with
extent defragmenting + recompression, eg:

mkfs.btrfs /dev/sdb
mount /dev/sdb /mnt
cp -xa / /mnt
find /mnt -print0 | xargs -0 btrfs filesystem defragment -vc

After a short time, I was seeing what looked like a secondary effect
[1]. Reproducing with lock instrumentation reported recursive spinlock
acquisition, probably a false-positive from the locking scheme not
being annotated, so better report it now.

Daniel

--- [1]

BUG: unable to handle kernel NULL pointer dereference at   (null)
IP: [a00e23cb] write_extent_buffer+0xbb/0x1b0 [btrfs]
PGD 0
Oops:  [#1] SMP
last sysfs file: /sys/devices/pci:00/:00:1e.0/:06:04.0/local_cpus
CPU 1
Modules linked in: microcode psmouse serio_raw ioatdma i7core_edac
joydev lp edac_core dca parport raid10 raid456 async_raid6_recov
async_pq usbhid hid raid6_pq async_xor xor async_memcpy async_tx raid1
raid0 multipath linear ahci btrfs zlib_deflate libahci e1000e
libcrc32c

Pid: 1119, comm: btrfs-delalloc- Tainted: GW
2.6.38-020638-generic #201103151303 Supermicro X8STi/X8STi
RIP: 0010:[a00e23cb]  [a00e23cb]
write_extent_buffer+0xbb/0x1b0 [btrfs]
RSP: 0018:880303a0bbc0  EFLAGS: 00010a86
RAX: db74 RBX: 0d26 RCX: 8800
RDX:  RSI: 0002fa19 RDI: 88023c8353f8
RBP: 880303a0bc00 R08: 0001 R09: 
R10:  R11: 0017 R12: db738800
R13: 028c R14: 880303a0bfd8 R15: 
FS:  () GS:8800df48() knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2:  CR3: 01a03000 CR4: 06e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process btrfs-delalloc- (pid: 1119, threadinfo 880303a0a000, task
8803046cad80)
Stack:
 880280e63cc0 8802fd10ad26 0001 880303a0a000
 ea000a75ba30 0fb2 08f7 02da
 880303a0bcb0 a00c5bb0 002e0001 
Call Trace:
 [a00c5bb0] insert_inline_extent+0x330/0x350 [btrfs]
 [a00c5cf6] cow_file_range_inline+0x126/0x160 [btrfs]
 [a00c68f0] compress_file_range+0x3b0/0x580 [btrfs]
 [a00c6af5] async_cow_start+0x35/0x50 [btrfs]
 [a00eac0c] worker_loop+0xac/0x260 [btrfs]
 [a00eab60] ? worker_loop+0x0/0x260 [btrfs]
 [81086317] kthread+0x97/0xa0
 [8100ce24] kernel_thread_helper+0x4/0x10
 [81086280] ? kthread+0x0/0xa0
 [8100ce20] ? kernel_thread_helper+0x0/0x10
Code: 16 00 00 48 8d 04 0a 48 b9 b7 6d db b6 6d db b6 6d 48 c1 f8 03
48 0f af c1 48 b9 00 00 00 00 00 88 ff ff 48 c1 e0 0c 4c 8d 24 08 48
8b 02 a8 08 0f 85 9c 00 00 00 be cb 0e 00 00 48 c7 c7 b8 7c
RIP  [a00e23cb] write_extent_buffer+0xbb/0x1b0 [btrfs]
 RSP 880303a0bbc0
CR2: 
---[ end trace a7919e7f17c0a728 ]---
note: btrfs-delalloc-exited with preempt_count 1
-- 
Daniel J Blueman
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: read-only subvolumes?

2011-03-24 Thread Li Zefan
 IMHO, this is related to how the debug options of the kernel are
 configured. Attached you find two config files, both for kernel
 version 2.6.38, with the one named 2.6.38-debug everything works and
 with the other one newly created subvolumes are read only.
 

I've figured out what's wrong.

The root cause is the flags field of the root item for a new subvol
is never _initialized_!! so the on disk root_item-flags can be of
arbitrary value..

(so is root_item-byte_limit btw.)

I don't have a perfect solution at the moment, but I think a workaround
is to use a flag in root_item-inode_item-flags to indicate if
root-flags is initialized.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] Subject: mutex: Separate out mutex_spin()

2011-03-24 Thread Tejun Heo
Separate out mutex_spin() out of __mutex_lock_common().  The fat
comment is converted to docbook function description.

While at it, drop the part of comment which explains that adaptive
spinning considers whether there are pending waiters, which doesn't
match the code.

This patch is to prepare for using adaptive spinning in
mutex_trylock() and doesn't cause any behavior change.

Signed-off-by: Tejun Heo t...@kernel.org
LKML-Reference: 20110323153727.gb12...@htj.dyndns.org
Cc: Peter Zijlstra pet...@infradead.org
Cc: Ingo Molnar mi...@redhat.com
---
Here are split patches with SOB.  Ingo, it's probably best to route
this through -tip, I suppose?

Thanks.

 kernel/mutex.c |   87 -
 1 file changed, 50 insertions(+), 37 deletions(-)

Index: work/kernel/mutex.c
===
--- work.orig/kernel/mutex.c
+++ work/kernel/mutex.c
@@ -126,39 +126,32 @@ void __sched mutex_unlock(struct mutex *
 
 EXPORT_SYMBOL(mutex_unlock);
 
-/*
- * Lock a mutex (possibly interruptible), slowpath:
+/**
+ * mutex_spin - optimistic spinning on mutex
+ * @lock: mutex to spin on
+ *
+ * This function implements optimistic spin for acquisition of @lock when
+ * the lock owner is currently running on a (different) CPU.
+ *
+ * The rationale is that if the lock owner is running, it is likely to
+ * release the lock soon.
+ *
+ * Since this needs the lock owner, and this mutex implementation doesn't
+ * track the owner atomically in the lock field, we need to track it
+ * non-atomically.
+ *
+ * We can't do this for DEBUG_MUTEXES because that relies on wait_lock to
+ * serialize everything.
+ *
+ * CONTEXT:
+ * Preemption disabled.
+ *
+ * RETURNS:
+ * %true if @lock is acquired, %false otherwise.
  */
-static inline int __sched
-__mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
-   unsigned long ip)
+static inline bool mutex_spin(struct mutex *lock)
 {
-   struct task_struct *task = current;
-   struct mutex_waiter waiter;
-   unsigned long flags;
-
-   preempt_disable();
-   mutex_acquire(lock-dep_map, subclass, 0, ip);
-
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
-   /*
-* Optimistic spinning.
-*
-* We try to spin for acquisition when we find that there are no
-* pending waiters and the lock owner is currently running on a
-* (different) CPU.
-*
-* The rationale is that if the lock owner is running, it is likely to
-* release the lock soon.
-*
-* Since this needs the lock owner, and this mutex implementation
-* doesn't track the owner atomically in the lock field, we need to
-* track it non-atomically.
-*
-* We can't do this for DEBUG_MUTEXES because that relies on wait_lock
-* to serialize everything.
-*/
-
for (;;) {
struct thread_info *owner;
 
@@ -177,12 +170,8 @@ __mutex_lock_common(struct mutex *lock,
if (owner  !mutex_spin_on_owner(lock, owner))
break;
 
-   if (atomic_cmpxchg(lock-count, 1, 0) == 1) {
-   lock_acquired(lock-dep_map, ip);
-   mutex_set_owner(lock);
-   preempt_enable();
-   return 0;
-   }
+   if (atomic_cmpxchg(lock-count, 1, 0) == 1)
+   return true;
 
/*
 * When there's no owner, we might have preempted between the
@@ -190,7 +179,7 @@ __mutex_lock_common(struct mutex *lock,
 * we're an RT task that will live-lock because we won't let
 * the owner complete.
 */
-   if (!owner  (need_resched() || rt_task(task)))
+   if (!owner  (need_resched() || rt_task(current)))
break;
 
/*
@@ -202,6 +191,30 @@ __mutex_lock_common(struct mutex *lock,
arch_mutex_cpu_relax();
}
 #endif
+   return false;
+}
+
+/*
+ * Lock a mutex (possibly interruptible), slowpath:
+ */
+static inline int __sched
+__mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
+   unsigned long ip)
+{
+   struct task_struct *task = current;
+   struct mutex_waiter waiter;
+   unsigned long flags;
+
+   preempt_disable();
+   mutex_acquire(lock-dep_map, subclass, 0, ip);
+
+   if (mutex_spin(lock)) {
+   lock_acquired(lock-dep_map, ip);
+   mutex_set_owner(lock);
+   preempt_enable();
+   return 0;
+   }
+
spin_lock_mutex(lock-wait_lock, flags);
 
debug_mutex_lock_common(lock, waiter);
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()

2011-03-24 Thread Tejun Heo
Adaptive owner spinning used to be applied only to mutex_lock().  This
patch applies it also to mutex_trylock().

btrfs has developed custom locking to avoid excessive context switches
in its btree implementation.  Generally, doing away with the custom
implementation and just using the mutex shows better behavior;
however, there's an interesting distinction in the custom implemention
of trylock.  It distinguishes between simple trylock and tryspin,
where the former just tries once and then fail while the latter does
some spinning before giving up.

Currently, mutex_trylock() doesn't use adaptive spinning.  It tries
just once.  I got curious whether using adaptive spinning on
mutex_trylock() would be beneficial and it seems so, for btrfs anyway.

The following results are from dbench 50 run on an opteron two
socket eight core machine with 4GiB of memory and an OCZ vertex SSD.
During the run, disk stays mostly idle and all CPUs are fully occupied
and the difference in locking performance becomes quite visible.

SIMPLE is with the locking simplification patch[1] applied.  i.e. it
basically just uses mutex.  SPIN is with this patch applied on top -
mutex_trylock() uses adaptive spinning.

USER   SYSTEM   SIRQCXTSW  THROUGHPUT
 SIMPLE 61107  354977217  8099529  845.100 MB/sec
 SPIN   63140  364888214  6840527  879.077 MB/sec

On various runs, the adaptive spinning trylock consistently posts
higher throughput.  The amount of difference varies but it outperforms
consistently.

In general, using adaptive spinning on trylock makes sense as trylock
failure usually leads to costly unlock-relock sequence.

[1] http://article.gmane.org/gmane.comp.file-systems.btrfs/9658

Signed-off-by: Tejun Heo t...@kernel.org
LKML-Reference: 20110323153727.gb12...@htj.dyndns.org
Cc: Peter Zijlstra pet...@infradead.org
Cc: Ingo Molnar mi...@redhat.com
Cc: Chris Mason chris.ma...@oracle.com
---
 kernel/mutex.c |   10 ++
 1 file changed, 10 insertions(+)

Index: work/kernel/mutex.c
===
--- work.orig/kernel/mutex.c
+++ work/kernel/mutex.c
@@ -443,6 +443,15 @@ static inline int __mutex_trylock_slowpa
unsigned long flags;
int prev;
 
+   preempt_disable();
+
+   if (mutex_spin(lock)) {
+   mutex_set_owner(lock);
+   mutex_acquire(lock-dep_map, 0, 1, _RET_IP_);
+   preempt_enable();
+   return 1;
+   }
+
spin_lock_mutex(lock-wait_lock, flags);
 
prev = atomic_xchg(lock-count, -1);
@@ -456,6 +465,7 @@ static inline int __mutex_trylock_slowpa
atomic_set(lock-count, 0);
 
spin_unlock_mutex(lock-wait_lock, flags);
+   preempt_enable();
 
return prev == 1;
 }
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Subject: mutex: Separate out mutex_spin()

2011-03-24 Thread Tejun Heo
Ugh... Please drop the extra Subject:  from subject before applying.
Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


2.6.38 fs balance lock ordering...

2011-03-24 Thread Daniel J Blueman
While doing a filesystem balance, lockdep detecting a potential lock
ordering issue [1].

Thanks,
  Daniel

--- [1]

===
[ INFO: possible circular locking dependency detected ]
2.6.38.1-341cd+ #10
---
btrfs/1101 is trying to acquire lock:
 (sb-s_type-i_mutex_key#12){+.+.+.}, at: [812cddb9]
prealloc_file_extent_cluster+0x59/0x180

but task is already holding lock:
 (fs_info-cleaner_mutex){+.+.+.}, at: [812cfcb7]
btrfs_relocate_block_group+0x197/0x2d0

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

- #2 (fs_info-cleaner_mutex){+.+.+.}:
   [8109628a] lock_acquire+0x5a/0x70
   [816c9cde] mutex_lock_nested+0x5e/0x390
   [812828e1] btrfs_commit_super+0x21/0xe0
   [812857a2] close_ctree+0x332/0x3a0
   [8125fd08] btrfs_put_super+0x18/0x30
   [8113ae7d] generic_shutdown_super+0x6d/0xf0
   [8113af91] kill_anon_super+0x11/0x60
   [8113b6b5] deactivate_locked_super+0x45/0x60
   [8113c2b5] deactivate_super+0x45/0x60
   [81158729] mntput_no_expire+0x99/0xf0
   [8115996c] sys_umount+0x7c/0x3c0
   [81002dfb] system_call_fastpath+0x16/0x1b

- #1 (type-s_umount_key#24){++}:
   [8109628a] lock_acquire+0x5a/0x70
   [816ca372] down_read+0x42/0x60
   [8115e935] writeback_inodes_sb_nr_if_idle+0x35/0x60
   [812723ae] shrink_delalloc+0xee/0x180
   [81273253] btrfs_delalloc_reserve_metadata+0x163/0x180
   [812732ab] btrfs_delalloc_reserve_space+0x3b/0x60
   [8129563d] btrfs_file_aio_write+0x61d/0x9c0
   [81137f12] do_sync_write+0xd2/0x110
   [81138a88] vfs_write+0xc8/0x190
   [81138c3c] sys_write+0x4c/0x90
   [81002dfb] system_call_fastpath+0x16/0x1b

- #0 (sb-s_type-i_mutex_key#12){+.+.+.}:
   [810961a8] __lock_acquire+0x1ba8/0x1c30
   [8109628a] lock_acquire+0x5a/0x70
   [816c9cde] mutex_lock_nested+0x5e/0x390
   [812cddb9] prealloc_file_extent_cluster+0x59/0x180
   [812ce0a1] relocate_file_extent_cluster+0x91/0x380
   [812ce44b] relocate_data_extent+0xbb/0xd0
   [812cf843] relocate_block_group+0x323/0x600
   [812cfcc8] btrfs_relocate_block_group+0x1a8/0x2d0
   [812b09c3] btrfs_relocate_chunk+0x83/0x600
   [812b160d] btrfs_balance+0x20d/0x280
   [812b8b86] btrfs_ioctl+0x1b6/0xa80
   [8114a43d] do_vfs_ioctl+0x9d/0x590
   [8114a97a] sys_ioctl+0x4a/0x80
   [81002dfb] system_call_fastpath+0x16/0x1b

other info that might help us debug this:

2 locks held by btrfs/1101:
 #0:  (fs_info-volume_mutex){+.+.+.}, at: [812b148b]
btrfs_balance+0x8b/0x280
 #1:  (fs_info-cleaner_mutex){+.+.+.}, at: [812cfcb7]
btrfs_relocate_block_group+0x197/0x2d0

stack backtrace:
Pid: 1101, comm: btrfs Tainted: GW   2.6.38.1-341cd+ #10
Call Trace:
 [810937fb] ? print_circular_bug+0xeb/0xf0
 [810961a8] ? __lock_acquire+0x1ba8/0x1c30
 [812a5fd1] ? map_private_extent_buffer+0xe1/0x210
 [812cddb9] ? prealloc_file_extent_cluster+0x59/0x180
 [8109628a] ? lock_acquire+0x5a/0x70
 [812cddb9] ? prealloc_file_extent_cluster+0x59/0x180
 [810565f5] ? add_preempt_count+0x75/0xd0
 [816c9cde] ? mutex_lock_nested+0x5e/0x390
 [812cddb9] ? prealloc_file_extent_cluster+0x59/0x180
 [81125fa3] ? init_object+0x43/0x80
 [81051121] ? get_parent_ip+0x11/0x50
 [812cddb9] ? prealloc_file_extent_cluster+0x59/0x180
 [812ce0a1] ? relocate_file_extent_cluster+0x91/0x380
 [812ce44b] ? relocate_data_extent+0xbb/0xd0
 [812cf843] ? relocate_block_group+0x323/0x600
 [812cfcc8] ? btrfs_relocate_block_group+0x1a8/0x2d0
 [812b09c3] ? btrfs_relocate_chunk+0x83/0x600
 [812a62d2] ? read_extent_buffer+0xf2/0x230
 [8126c286] ? btrfs_search_slot+0x886/0xa90
 [8105654d] ? sub_preempt_count+0x9d/0xd0
 [812a62d2] ? read_extent_buffer+0xf2/0x230
 [812b160d] ? btrfs_balance+0x20d/0x280
 [812b8b86] ? btrfs_ioctl+0x1b6/0xa80
 [8103146c] ? do_page_fault+0x1cc/0x440
 [8114a43d] ? do_vfs_ioctl+0x9d/0x590
 [8113943f] ? fget_light+0x1df/0x3c0
 [8114a97a] ? sys_ioctl+0x4a/0x80
 [81002dfb] ? system_call_fastpath+0x16/0x1b
-- 
Daniel J Blueman
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


recurring btrfs csum failed

2011-03-24 Thread Tomasz Chmielewski
I had a system freeze for some reason with 2.6.38.

I made a hard reboot, just to discover some of the files (KVM images, were in 
use when the crash happened) on btrfs RAID-1 filesystem are corrupted:

btrfs csum failed ino 257 off 120180736 csum 4246715593 private 48329
btrfs csum failed ino 257 off 120180736 csum 4246715593 private 48329
btrfs csum failed ino 257 off 120180736 csum 4246715593 private 48329


Not being in mood if btrfs should try the other device from the mirror, I 
decided to remove the corrupted file and copy a previous version stored on a 
ext3 filesystem.

The file copied fine, but to my surprise, the new file is still corrupted:

# md5sum vm-113-disk-1.raw 
md5sum: vm-113-disk-1.raw: Input/output error


Errors reported by btrfs are slightly different now:

btrfs csum failed ino 260 extent 21968855040 csum 582168802 wanted 1727644489 
mirror 1
btrfs csum failed ino 260 extent 21948932096 csum 582168802 wanted 1727644489 
mirror 2
btrfs csum failed ino 260 extent 21968855040 csum 582168802 wanted 1727644489 
mirror 1
btrfs csum failed ino 260 extent 21968855040 csum 582168802 wanted 1727644489 
mirror 1
btrfs csum failed ino 260 extent 21948932096 csum 582168802 wanted 1727644489 
mirror 2
btrfs csum failed ino 260 extent 21968855040 csum 582168802 wanted 1727644489 
mirror 1
btrfs csum failed ino 260 extent 21948932096 csum 582168802 wanted 1727644489 
mirror 2



btrfs is mounted with these flags:

/dev/sdc on /mnt/btrfs type btrfs 
(rw,noatime,compress-force=lzo,device=/dev/sdc,device=/dev/sdd)


I don't need to recover the file, just trying to signal something doesn't work 
well here!

-- 
Tomasz Chmielewski
http://wpkg.org
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 3/6] btrfs: add scrub code and prototypes

2011-03-24 Thread Arne Jansen
On 23.03.2011 18:18, David Sterba wrote:
 Hi,
 
 I'm reviewing the atomic counters and the wait/wake infrastructure,
 just found two missed mutex_unlock()s in btrfs_scrub_dev() in error
 paths.
 
 On Fri, Mar 18, 2011 at 04:55:06PM +0100, Arne Jansen wrote:
 This is the main scrub code.

 +mutex_lock(fs_info-scrub_lock);
 +if (dev-scrub_device) {
 +mutex_unlock(fs_info-scrub_lock);
   mutex_unlock(root-fs_info-fs_devices-device_list_mutex);
 
 +scrub_workers_put(root);
 +return -EINPROGRESS;
 +}
 +sdev = scrub_setup_dev(dev);
 +if (IS_ERR(sdev)) {
 +mutex_unlock(fs_info-scrub_lock);
   mutex_unlock(root-fs_info-fs_devices-device_list_mutex);
 
 +scrub_workers_put(root);
 +return PTR_ERR(sdev);
 +}

Thanks, I'll add you as Reported-by if that's ok.

-Arne
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V4 4/4] Btrfs: add btrfs_trim_fs() to handle FITRIM

2011-03-24 Thread Li Dongyang
We take an free extent out from allocator, trim it, then put it back,
but before we trim the block group, we should make sure the block group is
cached, so plus a little change to make cache_block_group() run without a
transaction.

Signed-off-by: Li Dongyang lidongy...@novell.com
---
 fs/btrfs/ctree.h|1 +
 fs/btrfs/extent-tree.c  |   50 +++-
 fs/btrfs/free-space-cache.c |   92 +++
 fs/btrfs/free-space-cache.h |2 +
 fs/btrfs/ioctl.c|   46 +
 5 files changed, 190 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 94bb772..df206c1 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2232,6 +2232,7 @@ int btrfs_error_discard_extent(struct btrfs_root *root, 
u64 bytenr,
   u64 num_bytes, u64 *actual_bytes);
 int btrfs_force_chunk_alloc(struct btrfs_trans_handle *trans,
struct btrfs_root *root, u64 type);
+int btrfs_trim_fs(struct btrfs_root *root, struct fstrim_range *range);
 
 /* ctree.c */
 int btrfs_bin_search(struct extent_buffer *eb, struct btrfs_key *key,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 10e542a..d876759 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -440,7 +440,7 @@ static int cache_block_group(struct btrfs_block_group_cache 
*cache,
 * allocate blocks for the tree root we can't do the fast caching since
 * we likely hold important locks.
 */
-   if (!trans-transaction-in_commit 
+   if (trans  (!trans-transaction-in_commit) 
(root  root != root-fs_info-tree_root)) {
spin_lock(cache-lock);
if (cache-cached != BTRFS_CACHE_NO) {
@@ -8739,3 +8739,51 @@ int btrfs_error_discard_extent(struct btrfs_root *root, 
u64 bytenr,
 {
return btrfs_discard_extent(root, bytenr, num_bytes, actual_bytes);
 }
+
+int btrfs_trim_fs(struct btrfs_root *root, struct fstrim_range *range)
+{
+   struct btrfs_fs_info *fs_info = root-fs_info;
+   struct btrfs_block_group_cache *cache = NULL;
+   u64 group_trimmed;
+   u64 start;
+   u64 end;
+   u64 trimmed = 0;
+   int ret = 0;
+
+   cache = btrfs_lookup_block_group(fs_info, range-start);
+
+   while (cache) {
+   if (cache-key.objectid = (range-start + range-len)) {
+   btrfs_put_block_group(cache);
+   break;
+   }
+
+   start = max(range-start, cache-key.objectid);
+   end = min(range-start + range-len,
+   cache-key.objectid + cache-key.offset);
+
+   if (end - start = range-minlen) {
+   if (!block_group_cache_done(cache)) {
+   ret = cache_block_group(cache, NULL, root, 0);
+   if (!ret)
+   wait_block_group_cache_done(cache);
+   }
+   ret = btrfs_trim_block_group(cache,
+group_trimmed,
+start,
+end,
+range-minlen);
+
+   trimmed += group_trimmed;
+   if (ret) {
+   btrfs_put_block_group(cache);
+   break;
+   }
+   }
+
+   cache = next_block_group(fs_info-tree_root, cache);
+   }
+
+   range-len = trimmed;
+   return ret;
+}
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index a039065..d0dc812 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -2154,3 +2154,95 @@ void btrfs_init_free_cluster(struct btrfs_free_cluster 
*cluster)
cluster-block_group = NULL;
 }
 
+int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group,
+  u64 *trimmed, u64 start, u64 end, u64 minlen)
+{
+   struct btrfs_free_space *entry = NULL;
+   struct btrfs_fs_info *fs_info = block_group-fs_info;
+   u64 bytes = 0;
+   u64 actually_trimmed;
+   int ret = 0;
+
+   *trimmed = 0;
+
+   while (start  end) {
+   spin_lock(block_group-tree_lock);
+
+   if (block_group-free_space  minlen) {
+   spin_unlock(block_group-tree_lock);
+   break;
+   }
+
+   entry = tree_search_offset(block_group, start, 0, 1);
+   if (!entry)
+   entry = tree_search_offset(block_group,
+  offset_to_bitmap(block_group,
+   start),
+   

[PATCH V4 2/4] Btrfs: make btrfs_map_block() return entire free extent for each device of RAID0/1/10/DUP

2011-03-24 Thread Li Dongyang
btrfs_map_block() will only return a single stripe length, but we want the
full extent be mapped to each disk when we are trimming the extent,
so we add length to btrfs_bio_stripe and fill it if we are mapping for 
REQ_DISCARD.

Signed-off-by: Li Dongyang lidongy...@novell.com
---
 fs/btrfs/volumes.c |  150 
 fs/btrfs/volumes.h |1 +
 2 files changed, 129 insertions(+), 22 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index dd13eb8..e81cce6 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2962,7 +2962,10 @@ static int __btrfs_map_block(struct btrfs_mapping_tree 
*map_tree, int rw,
struct extent_map_tree *em_tree = map_tree-map_tree;
u64 offset;
u64 stripe_offset;
+   u64 stripe_end_offset;
u64 stripe_nr;
+   u64 stripe_nr_orig;
+   u64 stripe_nr_end;
int stripes_allocated = 8;
int stripes_required = 1;
int stripe_index;
@@ -2971,7 +2974,7 @@ static int __btrfs_map_block(struct btrfs_mapping_tree 
*map_tree, int rw,
int max_errors = 0;
struct btrfs_multi_bio *multi = NULL;
 
-   if (multi_ret  !(rw  REQ_WRITE))
+   if (multi_ret  !(rw  (REQ_WRITE | REQ_DISCARD)))
stripes_allocated = 1;
 again:
if (multi_ret) {
@@ -3017,7 +3020,15 @@ again:
max_errors = 1;
}
}
-   if (multi_ret  (rw  REQ_WRITE) 
+   if (rw  REQ_DISCARD) {
+   if (map-type  (BTRFS_BLOCK_GROUP_RAID0 |
+BTRFS_BLOCK_GROUP_RAID1 |
+BTRFS_BLOCK_GROUP_DUP |
+BTRFS_BLOCK_GROUP_RAID10)) {
+   stripes_required = map-num_stripes;
+   }
+   }
+   if (multi_ret  (rw  (REQ_WRITE | REQ_DISCARD)) 
stripes_allocated  stripes_required) {
stripes_allocated = map-num_stripes;
free_extent_map(em);
@@ -3037,12 +3048,15 @@ again:
/* stripe_offset is the offset of this block in its stripe*/
stripe_offset = offset - stripe_offset;
 
-   if (map-type  (BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID1 |
-BTRFS_BLOCK_GROUP_RAID10 |
-BTRFS_BLOCK_GROUP_DUP)) {
+   if (rw  REQ_DISCARD)
+   *length = min_t(u64, em-len - offset, *length);
+   else if (map-type  (BTRFS_BLOCK_GROUP_RAID0 |
+ BTRFS_BLOCK_GROUP_RAID1 |
+ BTRFS_BLOCK_GROUP_RAID10 |
+ BTRFS_BLOCK_GROUP_DUP)) {
/* we limit the length of each bio to what fits in a stripe */
*length = min_t(u64, em-len - offset,
- map-stripe_len - stripe_offset);
+   map-stripe_len - stripe_offset);
} else {
*length = em-len - offset;
}
@@ -3052,8 +3066,19 @@ again:
 
num_stripes = 1;
stripe_index = 0;
-   if (map-type  BTRFS_BLOCK_GROUP_RAID1) {
-   if (unplug_page || (rw  REQ_WRITE))
+   stripe_nr_orig = stripe_nr;
+   stripe_nr_end = (offset + *length + map-stripe_len - 1) 
+   (~(map-stripe_len - 1));
+   do_div(stripe_nr_end, map-stripe_len);
+   stripe_end_offset = stripe_nr_end * map-stripe_len -
+   (offset + *length);
+   if (map-type  BTRFS_BLOCK_GROUP_RAID0) {
+   if (rw  REQ_DISCARD)
+   num_stripes = min_t(u64, map-num_stripes,
+   stripe_nr_end - stripe_nr_orig);
+   stripe_index = do_div(stripe_nr, map-num_stripes);
+   } else if (map-type  BTRFS_BLOCK_GROUP_RAID1) {
+   if (unplug_page || (rw  (REQ_WRITE | REQ_DISCARD)))
num_stripes = map-num_stripes;
else if (mirror_num)
stripe_index = mirror_num - 1;
@@ -3064,7 +3089,7 @@ again:
}
 
} else if (map-type  BTRFS_BLOCK_GROUP_DUP) {
-   if (rw  REQ_WRITE)
+   if (rw  (REQ_WRITE | REQ_DISCARD))
num_stripes = map-num_stripes;
else if (mirror_num)
stripe_index = mirror_num - 1;
@@ -3077,6 +3102,10 @@ again:
 
if (unplug_page || (rw  REQ_WRITE))
num_stripes = map-sub_stripes;
+   else if (rw  REQ_DISCARD)
+   num_stripes = min_t(u64, map-sub_stripes *
+   (stripe_nr_end - stripe_nr_orig),
+   map-num_stripes);
else if (mirror_num)
stripe_index += mirror_num - 1;
else {
@@ -3094,24 +3123,101 @@ again:
}
BUG_ON(stripe_index = map-num_stripes);
 
-   for (i 

[PATCH V4 3/4] Btrfs: adjust btrfs_discard_extent() return errors and trimmed bytes

2011-03-24 Thread Li Dongyang
Callers of btrfs_discard_extent() should check if we are mounted with -o 
discard,
as we want to make fitrim to work even the fs is not mounted with -o discard.
Also we should use REQ_DISCARD to map the free extent to get a full mapping,
last we only return errors if
1. the error is not a EOPNOTSUPP
2. no device supports discard

Signed-off-by: Li Dongyang lidongy...@novell.com
---
 fs/btrfs/ctree.h   |2 +-
 fs/btrfs/disk-io.c |5 -
 fs/btrfs/extent-tree.c |   45 ++---
 3 files changed, 31 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2c84551..94bb772 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2229,7 +2229,7 @@ u64 btrfs_account_ro_block_groups_free_space(struct 
btrfs_space_info *sinfo);
 int btrfs_error_unpin_extent_range(struct btrfs_root *root,
   u64 start, u64 end);
 int btrfs_error_discard_extent(struct btrfs_root *root, u64 bytenr,
-  u64 num_bytes);
+  u64 num_bytes, u64 *actual_bytes);
 int btrfs_force_chunk_alloc(struct btrfs_trans_handle *trans,
struct btrfs_root *root, u64 type);
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 100b07f..98b60b0 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2947,7 +2947,10 @@ static int btrfs_destroy_pinned_extent(struct btrfs_root 
*root,
break;
 
/* opt_discard */
-   ret = btrfs_error_discard_extent(root, start, end + 1 - start);
+   if (btrfs_test_opt(root, DISCARD))
+   ret = btrfs_error_discard_extent(root, start,
+end + 1 - start,
+NULL);
 
clear_extent_dirty(unpin, start, end, GFP_NOFS);
btrfs_error_unpin_extent_range(root, start, end);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index caa4254..10e542a 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -1738,40 +1738,44 @@ static int remove_extent_backref(struct 
btrfs_trans_handle *trans,
return ret;
 }
 
-static void btrfs_issue_discard(struct block_device *bdev,
+static int btrfs_issue_discard(struct block_device *bdev,
u64 start, u64 len)
 {
-   blkdev_issue_discard(bdev, start  9, len  9, GFP_KERNEL, 0);
+   return blkdev_issue_discard(bdev, start  9, len  9, GFP_KERNEL, 0);
 }
 
 static int btrfs_discard_extent(struct btrfs_root *root, u64 bytenr,
-   u64 num_bytes)
+   u64 num_bytes, u64 *actual_bytes)
 {
int ret;
-   u64 map_length = num_bytes;
+   u64 discarded_bytes = 0;
struct btrfs_multi_bio *multi = NULL;
 
-   if (!btrfs_test_opt(root, DISCARD))
-   return 0;
-
/* Tell the block device(s) that the sectors can be discarded */
-   ret = btrfs_map_block(root-fs_info-mapping_tree, READ,
- bytenr, map_length, multi, 0);
+   ret = btrfs_map_block(root-fs_info-mapping_tree, REQ_DISCARD,
+ bytenr, num_bytes, multi, 0);
if (!ret) {
struct btrfs_bio_stripe *stripe = multi-stripes;
int i;
 
-   if (map_length  num_bytes)
-   map_length = num_bytes;
-
for (i = 0; i  multi-num_stripes; i++, stripe++) {
-   btrfs_issue_discard(stripe-dev-bdev,
-   stripe-physical,
-   map_length);
+   ret = btrfs_issue_discard(stripe-dev-bdev,
+ stripe-physical,
+ stripe-length);
+   if (!ret)
+   discarded_bytes += stripe-length;
+   else if (ret != -EOPNOTSUPP)
+   break;
}
kfree(multi);
}
 
+   if (discarded_bytes  ret == -EOPNOTSUPP)
+   ret = 0;
+
+   if (actual_bytes)
+   *actual_bytes = discarded_bytes;
+
return ret;
 }
 
@@ -4361,7 +4365,9 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle 
*trans,
if (ret)
break;
 
-   ret = btrfs_discard_extent(root, start, end + 1 - start);
+   if (btrfs_test_opt(root, DISCARD))
+   ret = btrfs_discard_extent(root, start,
+  end + 1 - start, NULL);
 
clear_extent_dirty(unpin, start, end, GFP_NOFS);
unpin_extent_range(root, start, end);
@@ -5410,7 +5416,8 @@ int btrfs_free_reserved_extent(struct btrfs_root *root, 

[PATCH V4 0/4] Btrfs: batched discard support for btrfs

2011-03-24 Thread Li Dongyang
Dear list,
This is V4 of batched discard support, now we will get full mapping of
the free space on each device for RAID0/1/10/DUP instead of just a single
stripe length, and tested with xfsstests 251, Thanks.
Changelog V4:
*make btrfs_map_block() return full mapping.
Changelog V3:
*fix style problems.
*rebase to 2.6.38-rc7.
Changelog V2:
*Check if we have devices support trim before trying to trim the fs, also 
adjust
  minlen according to the discard_granularity.
*Update reserved extent calculations in btrfs_trim_block_group().
*Call cond_resched() without checking need_resched()
*Use bitmap_clear_bits() and unlink_free_space() instead of 
btrfs_remove_free_space(),
  so we won't search the same extent for twice.
*Try harder in btrfs_discard_extent(), now we won't report errors
 if it's not a EOPNOTSUPP.
*make sure the block group is cached before trimming it,or we'll see an 
empty caching
 tree if the block group is not cached.
*Minor return value fix in btrfs_discard_block_group().
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V4 1/4] Btrfs: make update_reserved_bytes() public

2011-03-24 Thread Li Dongyang
Make the function public as we should update the reserved extents calculations
after taking out an extent for trimming.

Signed-off-by: Li Dongyang lidongy...@novell.com
---
 fs/btrfs/ctree.h|2 ++
 fs/btrfs/extent-tree.c  |   16 +++-
 2 files changed, 9 insertions(+), 9 deletions(-)
 create mode 100644 fs/btrfs/Module.symvers

diff --git a/fs/btrfs/Module.symvers b/fs/btrfs/Module.symvers
new file mode 100644
index 000..e69de29
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 7f78cc7..2c84551 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2157,6 +2157,8 @@ int btrfs_free_extent(struct btrfs_trans_handle *trans,
  u64 root_objectid, u64 owner, u64 offset);
 
 int btrfs_free_reserved_extent(struct btrfs_root *root, u64 start, u64 len);
+int btrfs_update_reserved_bytes(struct btrfs_block_group_cache *cache,
+   u64 num_bytes, int reserve, int sinfo);
 int btrfs_prepare_extent_commit(struct btrfs_trans_handle *trans,
struct btrfs_root *root);
 int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 7b3089b..caa4254 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -36,8 +36,6 @@
 static int update_block_group(struct btrfs_trans_handle *trans,
  struct btrfs_root *root,
  u64 bytenr, u64 num_bytes, int alloc);
-static int update_reserved_bytes(struct btrfs_block_group_cache *cache,
-u64 num_bytes, int reserve, int sinfo);
 static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
struct btrfs_root *root,
u64 bytenr, u64 num_bytes, u64 parent,
@@ -4223,8 +4221,8 @@ int btrfs_pin_extent(struct btrfs_root *root,
  * update size of reserved extents. this function may return -EAGAIN
  * if 'reserve' is true or 'sinfo' is false.
  */
-static int update_reserved_bytes(struct btrfs_block_group_cache *cache,
-u64 num_bytes, int reserve, int sinfo)
+int btrfs_update_reserved_bytes(struct btrfs_block_group_cache *cache,
+   u64 num_bytes, int reserve, int sinfo)
 {
int ret = 0;
if (sinfo) {
@@ -4704,10 +4702,10 @@ void btrfs_free_tree_block(struct btrfs_trans_handle 
*trans,
WARN_ON(test_bit(EXTENT_BUFFER_DIRTY, buf-bflags));
 
btrfs_add_free_space(cache, buf-start, buf-len);
-   ret = update_reserved_bytes(cache, buf-len, 0, 0);
+   ret = btrfs_update_reserved_bytes(cache, buf-len, 0, 0);
if (ret == -EAGAIN) {
/* block group became read-only */
-   update_reserved_bytes(cache, buf-len, 0, 1);
+   btrfs_update_reserved_bytes(cache, buf-len, 0, 1);
goto out;
}
 
@@ -5191,7 +5189,7 @@ checks:
 search_start - offset);
BUG_ON(offset  search_start);
 
-   ret = update_reserved_bytes(block_group, num_bytes, 1,
+   ret = btrfs_update_reserved_bytes(block_group, num_bytes, 1,
(data  BTRFS_BLOCK_GROUP_DATA));
if (ret == -EAGAIN) {
btrfs_add_free_space(block_group, offset, num_bytes);
@@ -5415,7 +5413,7 @@ int btrfs_free_reserved_extent(struct btrfs_root *root, 
u64 start, u64 len)
ret = btrfs_discard_extent(root, start, len);
 
btrfs_add_free_space(cache, start, len);
-   update_reserved_bytes(cache, len, 0, 1);
+   btrfs_update_reserved_bytes(cache, len, 0, 1);
btrfs_put_block_group(cache);
 
return ret;
@@ -5614,7 +5612,7 @@ int btrfs_alloc_logged_file_extent(struct 
btrfs_trans_handle *trans,
put_caching_control(caching_ctl);
}
 
-   ret = update_reserved_bytes(block_group, ins-offset, 1, 1);
+   ret = btrfs_update_reserved_bytes(block_group, ins-offset, 1, 1);
BUG_ON(ret);
btrfs_put_block_group(block_group);
ret = alloc_reserved_file_extent(trans, root, 0, root_objectid,
-- 
1.7.4.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: add initial tracepoint support for btrfs

2011-03-24 Thread liubo

Tracepoints can provide insight into why btrfs hits bugs and be greatly
helpful for debugging, e.g
  dd-7822  [000]  2121.641088: btrfs_inode_request: root = 
5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, 
logged_trans = 0
  dd-7822  [000]  2121.641100: btrfs_inode_new: root = 5(FS_TREE), 
gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0
 btrfs-transacti-7804  [001]  2146.935420: btrfs_cow_block: root = 
2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 
29388800 (cow_level = 0)
 btrfs-transacti-7804  [001]  2146.935473: btrfs_cow_block: root = 
1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 
29392896 (cow_level = 0)
 btrfs-transacti-7804  [001]  2146.972221: btrfs_transaction_commit: root = 
1(ROOT_TREE), gen = 8
   flush-btrfs-2-7821  [001]  2155.824210: btrfs_chunk_alloc: root = 
3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, 
sub_stripes = 0, type = DATA
   flush-btrfs-2-7821  [001]  2155.824241: btrfs_cow_block: root = 
2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 
29396992 (cow_level = 0)
   flush-btrfs-2-7821  [001]  2155.824255: btrfs_cow_block: root = 4(DEV_TREE), 
refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 
0)
   flush-btrfs-2-7821  [000]  2155.824329: btrfs_cow_block: root = 
3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 
20975616 (cow_level = 0)
 btrfs-endio-wri-7800  [001]  2155.898019: btrfs_cow_block: root = 5(FS_TREE), 
refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 
0)
 btrfs-endio-wri-7800  [001]  2155.898043: btrfs_cow_block: root = 
7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 
29409280 (cow_level = 0)

Here is what I have added:

1) ordere_extent:
btrfs_ordered_extent_add
btrfs_ordered_extent_remove
btrfs_ordered_extent_start
btrfs_ordered_extent_put

These provide critical information to understand how ordered_extents are
updated.

2) extent_map:
btrfs_get_extent

extent_map is used in both read and write cases, and it is useful for tracking
how btrfs specific IO is running.

3) writepage:
__extent_writepage
btrfs_writepage_end_io_hook

Pages are cirtical resourses and produce a lot of corner cases during writeback,
so it is valuable to know how page is written to disk.

4) inode:
btrfs_inode_new
btrfs_inode_request
btrfs_inode_evict

These can show where and when a inode is created, when a inode is evicted.

5) sync:
btrfs_sync_file
btrfs_sync_fs

These show sync arguments.

6) transaction:
btrfs_transaction_commit

In transaction based filesystem, it will be useful to know the generation and
who does commit.

7) back reference and cow:
btrfs_delayed_tree_ref
btrfs_delayed_data_ref
btrfs_delayed_ref_head
btrfs_cow_block

Btrfs natively supports back references, these tracepoints are helpful on
understanding btrfs's COW mechanism.

8) chunk:
btrfs_chunk_alloc
btrfs_chunk_free

Chunk is a link between physical offset and logical offset, and stands for space
infomation in btrfs, and these are helpful on tracing space things.

9) reserved_extent:
btrfs_reserved_extent_alloc
btrfs_reserved_extent_free

These can show how btrfs uses its space.

Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com
---
 fs/btrfs/ctree.c |3 +
 fs/btrfs/ctree.h |1 +
 fs/btrfs/delayed-ref.c   |6 +
 fs/btrfs/extent-tree.c   |4 +
 fs/btrfs/extent_io.c |2 +
 fs/btrfs/file.c  |1 +
 fs/btrfs/inode.c |   12 +
 fs/btrfs/ordered-data.c  |8 +
 fs/btrfs/super.c |5 +
 fs/btrfs/transaction.c   |2 +
 fs/btrfs/volumes.c   |   16 +-
 fs/btrfs/volumes.h   |   11 +
 include/trace/events/btrfs.h |  667 ++
 13 files changed, 727 insertions(+), 11 deletions(-)
 create mode 100644 include/trace/events/btrfs.h

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index b5baff0..351515d 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -542,6 +542,9 @@ noinline int btrfs_cow_block(struct btrfs_trans_handle 
*trans,
 
ret = __btrfs_cow_block(trans, root, buf, parent,
 parent_slot, cow_ret, search_start, 0);
+
+   trace_btrfs_cow_block(root, buf, *cow_ret);
+
return ret;
 }
 
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 28188a7..cd6906e 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -28,6 +28,7 @@
 #include linux/wait.h
 #include linux/slab.h
 #include linux/kobject.h
+#include trace/events/btrfs.h
 #include asm/kmap_types.h
 #include extent_io.h
 #include extent_map.h
diff --git a/fs/btrfs/delayed-ref.c 

[PATCH V5 1/2] btrfs: use GFP_NOFS instead of GFP_KERNEL

2011-03-24 Thread Miao Xie
In the filesystem context, we must allocate memory by GFP_NOFS,
or we may start another filesystem operation and make kswap thread hang up.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
 fs/btrfs/extent-tree.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index f1db57d..42061d2 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -471,7 +471,7 @@ static int cache_block_group(struct btrfs_block_group_cache 
*cache,
if (load_cache_only)
return 0;
 
-   caching_ctl = kzalloc(sizeof(*caching_ctl), GFP_KERNEL);
+   caching_ctl = kzalloc(sizeof(*caching_ctl), GFP_NOFS);
BUG_ON(!caching_ctl);
 
INIT_LIST_HEAD(caching_ctl-list);
@@ -1743,7 +1743,7 @@ static int remove_extent_backref(struct 
btrfs_trans_handle *trans,
 static void btrfs_issue_discard(struct block_device *bdev,
u64 start, u64 len)
 {
-   blkdev_issue_discard(bdev, start  9, len  9, GFP_KERNEL,
+   blkdev_issue_discard(bdev, start  9, len  9, GFP_NOFS,
BLKDEV_IFL_WAIT | BLKDEV_IFL_BARRIER);
 }
 
-- 
1.7.3.1
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Tree fragmentation and prefetching

2011-03-24 Thread Miao Xie
On  thu, 24 Mar 2011 08:29:57 +0100, Arne Jansen wrote:
 On 24.03.2011 02:38, Miao Xie wrote:
 On wed, 23 Mar 2011 21:28:25 +0100, Arne Jansen wrote:
 On 23.03.2011 20:26, Andrey Kuzmin wrote:
 On Wed, Mar 23, 2011 at 4:06 PM, Arne Jansensensi...@gmx.net  wrote:
 The main idea is to load the tree (or parts of it) top-down, order the
 needed blocks and distribute it over all disks.
 To keep you interested, some results first.

 a) by tree enumeration with reada=2
reading extent tree: 242s
reading csum tree: 140s
reading both trees: 324s

 b) prefetch prototype
reading extent tree: 23.5s
reading csum tree: 20.4s
reading both trees: 25.7s

 10x speed-up looks indeed impressive. Just for me to be sure, did I
 get you right in that you attribute this effect specifically to
 enumerating tree leaves in key address vs. disk addresses when these
 two are not aligned?

 Yes. Leaves and the intermediate nodes tend to be quite scattered
 around the disk with respect to their logical order.
 Reading them in logical (ascending/descending) order require lots
 of seeks.

 I'm also dealing with tree fragmentation problem, I try to store the leaves
 which have the same parent closely.
 
 That's good to hear. Do you have already anything I can repeat the test
 with?

It is still under developing.;)

Thanks
Miao

 -Arne
 

 Regards
 Miao

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 4/6] btrfs: sync scrub with commit device removal

2011-03-24 Thread Arne Jansen
On 23.03.2011 18:28, David Sterba wrote:
 Hi,
 
 you are adding a new smp_mb, can you please explain why it's needed and
 document it?
 
 thanks,
 dave
 
 On Fri, Mar 18, 2011 at 04:55:07PM +0100, Arne Jansen wrote:
 This adds several synchronizations:
  - for a transaction commit, the scrub gets paused before the
tree roots are committed until the super are safely on disk
  - during a log commit, scrubbing of supers is disabled
  - on unmount, the scrub gets cancelled
  - on device removal, the scrub for the particular device gets cancelled


 --- a/fs/btrfs/volumes.c
 +++ b/fs/btrfs/volumes.c
 @@ -1330,6 +1330,8 @@ int btrfs_rm_device(struct btrfs_root *root, char 
 *device_path)
  goto error_undo;
  
  device-in_fs_metadata = 0;
 +smp_mb();
 

The idea was to disallow any new scrubs so start beyond
this point, but it turns out this is not strong enough.
I have to move the check for in_fs_metadata in btrfs_scrub_dev
inside the scrub_lock. In this case, the smp_mb is still needed,
as in_fs_metadata is not protected by any lock. I'll add a
comment.
Thanks for forcing me to rethink this :)

-Arne

 
 +btrfs_scrub_cancel_dev(root, device);
  
  /*
   * the device list mutex makes sure that we don't change


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 4/6] btrfs: sync scrub with commit device removal

2011-03-24 Thread Arne Jansen
On 24.03.2011 13:58, Arne Jansen wrote:
 On 23.03.2011 18:28, David Sterba wrote:
 Hi,

 you are adding a new smp_mb, can you please explain why it's needed and
 document it?

 thanks,
 dave

 On Fri, Mar 18, 2011 at 04:55:07PM +0100, Arne Jansen wrote:
 This adds several synchronizations:
  - for a transaction commit, the scrub gets paused before the
tree roots are committed until the super are safely on disk
  - during a log commit, scrubbing of supers is disabled
  - on unmount, the scrub gets cancelled
  - on device removal, the scrub for the particular device gets cancelled

 
 --- a/fs/btrfs/volumes.c
 +++ b/fs/btrfs/volumes.c
 @@ -1330,6 +1330,8 @@ int btrfs_rm_device(struct btrfs_root *root, char 
 *device_path)
 goto error_undo;
  
 device-in_fs_metadata = 0;
 +   smp_mb();
 
 
 The idea was to disallow any new scrubs so start beyond
 this point, but it turns out this is not strong enough.
 I have to move the check for in_fs_metadata in btrfs_scrub_dev
 inside the scrub_lock. In this case, the smp_mb is still needed,
 as in_fs_metadata is not protected by any lock. I'll add a
 comment.

Thinking more about locking... the smp_mb is not necessary,
because the following cancel_dev aquires a spin_lock, which
implies a barrier.

 
 -Arne
 

 +   btrfs_scrub_cancel_dev(root, device);
  
 /*
  * the device list mutex makes sure that we don't change
 
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 3/6] btrfs: add scrub code and prototypes

2011-03-24 Thread David Sterba
On Thu, Mar 24, 2011 at 11:25:29AM +0100, Arne Jansen wrote:
 Thanks, I'll add you as Reported-by if that's ok.

Ok it is :)

dave
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


drives with more than 2 TByte

2011-03-24 Thread Helmut Hullen
Hallo, linux-btrfs,

what about disks with more than 2 TByte? Other filesystems (?) need GPT.

When I use

   mkfs.btrfs /dev/sdc

(p.e. with drive sdc), does that work without problems with btrfs?

Viele Gruesse!
Helmut
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCHSET] btrfs: Simplify extent_buffer locking

2011-03-24 Thread Tejun Heo
Hello,

This is split patchset of the RFC patches[1] to simplify btrfs
locking and contains the following three patches.

 0001-btrfs-Cleanup-extent_buffer-lockdep-code.patch
 0002-btrfs-Use-separate-lockdep-class-keys-for-different-.patch
 0003-btrfs-Simplify-extent_buffer-locking.patch

For more info, please read the patch description on 0003 and the
following two threads.

 http://thread.gmane.org/gmane.comp.file-systems.btrfs/9658
 http://thread.gmane.org/gmane.linux.kernel/1116910

0001 and 0002 improve lockdep key assigning such that extent_buffer
locks get different keys depending on the type (objectid) of the
btrfs_root they belong to.  I think this should provide enough lockdep
annotation resolution to avoid spurious triggering but after applying
this patchset, btrfs triggers several different locking dependency
warnings.

I've followed a couple of them and, to my untrained eyes, they seem to
indicate genuine locking order problems in btrfs which were hidden
till now because the custom locking was invisible to lockdep.

Anyways, so, it seems locking fixes or at least lockdep annotation
improvements will be needed.  Chris, how do you want to proceed?

Thanks.

 fs/btrfs/Makefile  |2 
 fs/btrfs/ctree.c   |   16 +--
 fs/btrfs/disk-io.c |  105 ++
 fs/btrfs/disk-io.h |   21 ++--
 fs/btrfs/extent-tree.c |2 
 fs/btrfs/extent_io.c   |3 
 fs/btrfs/extent_io.h   |   12 --
 fs/btrfs/locking.c |  233 -
 fs/btrfs/locking.h |   65 +++--
 fs/btrfs/volumes.c |2 
 10 files changed, 154 insertions(+), 307 deletions(-)

--
tejun

[1] http://article.gmane.org/gmane.comp.file-systems.btrfs/9658
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] btrfs: Use separate lockdep class keys for different roots

2011-03-24 Thread Tejun Heo
Due to the custom extent_buffer locking implementation, currently
lockdep doesn't have visibility into btrfs locking when the locks are
switched to blocking, hiding most of lock ordering issues from
lockdep.

With planned switch to mutex, all extent_buffer locking operations
will be visible to lockdep.  As btrfs_root's used for different
purposes can be lock-nested, sharing the same set of lockdep class
keys leads to spurious locking dependency warnings.

This patch makes btrfs_set_buffer_lockdep_class() take @root parameter
which indicates the btrfs_root the @eb belongs to and use different
sets of keys according to the type of @root.

Signed-off-by: Tejun Heo t...@kernel.org
---
 fs/btrfs/disk-io.c |   91 +--
 fs/btrfs/disk-io.h |   10 --
 fs/btrfs/extent-tree.c |2 +-
 fs/btrfs/volumes.c |2 +-
 4 files changed, 73 insertions(+), 32 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index e973e0b..710efbd 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -99,42 +99,79 @@ struct async_submit_bio {
 
 #ifdef CONFIG_LOCKDEP
 /*
- * These are used to set the lockdep class on the extent buffer locks.
- * The class is set by the readpage_end_io_hook after the buffer has
- * passed csum validation but before the pages are unlocked.
+ * Lockdep class keys for extent_buffer-lock's in this root.  For a given
+ * eb, the lockdep key is determined by the btrfs_root it belongs to and
+ * the level the eb occupies in the tree.
  *
- * The lockdep class is also set by btrfs_init_new_buffer on freshly
- * allocated blocks.
+ * Different roots are used for different purposes and may nest inside each
+ * other and they require separate keysets.  As lockdep keys should be
+ * static, assign keysets according to the purpose of the root as indicated
+ * by btrfs_root-objectid.  This ensures that all special purpose roots
+ * have separate keysets.
  *
- * The class is based on the level in the tree block, which allows lockdep
- * to know that lower nodes nest inside the locks of higher nodes.
+ * Lock-nesting across peer nodes is always done with the immediate parent
+ * node locked thus preventing deadlock.  As lockdep doesn't know this, use
+ * subclass to avoid triggering lockdep warning in such cases.
  *
- * We also add a check to make sure the highest level of the tree is
- * the same as our lockdep setup here.  If BTRFS_MAX_LEVEL changes, this
- * code needs update as well.
+ * The key is set by the readpage_end_io_hook after the buffer has passed
+ * csum validation but before the pages are unlocked.  It is also set by
+ * btrfs_init_new_buffer on freshly allocated blocks.
+ *
+ * We also add a check to make sure the highest level of the tree is the
+ * same as our lockdep setup here.  If BTRFS_MAX_LEVEL changes, this code
+ * needs update as well.
  */
 # if BTRFS_MAX_LEVEL != 8
 #  error
 # endif
-static struct lock_class_key btrfs_eb_class[BTRFS_MAX_LEVEL + 1];
-static const char *btrfs_eb_name[BTRFS_MAX_LEVEL + 1] = {
-   /* leaf */
-   btrfs-extent-00,
-   btrfs-extent-01,
-   btrfs-extent-02,
-   btrfs-extent-03,
-   btrfs-extent-04,
-   btrfs-extent-05,
-   btrfs-extent-06,
-   btrfs-extent-07,
-   /* highest possible level */
-   btrfs-extent-08,
+
+static struct btrfs_lockdep_keyset {
+   u64 id; /* root objectid */
+   const char  *name_stem; /* lock name stem */
+   charnames[BTRFS_MAX_LEVEL + 1][20];
+   struct lock_class_key   keys[BTRFS_MAX_LEVEL + 1];
+} btrfs_lockdep_keysets[] = {
+   { .id = BTRFS_ROOT_TREE_OBJECTID,   .name_stem = root },
+   { .id = BTRFS_EXTENT_TREE_OBJECTID, .name_stem = extent   },
+   { .id = BTRFS_CHUNK_TREE_OBJECTID,  .name_stem = chunk},
+   { .id = BTRFS_DEV_TREE_OBJECTID,.name_stem = dev  },
+   { .id = BTRFS_FS_TREE_OBJECTID, .name_stem = fs   },
+   { .id = BTRFS_CSUM_TREE_OBJECTID,   .name_stem = csum },
+   { .id = BTRFS_ORPHAN_OBJECTID,  .name_stem = orphan   },
+   { .id = BTRFS_TREE_LOG_OBJECTID,.name_stem = log  },
+   { .id = BTRFS_TREE_RELOC_OBJECTID,  .name_stem = treloc   },
+   { .id = BTRFS_DATA_RELOC_TREE_OBJECTID, .name_stem = dreloc   },
+   { .id = 0,  .name_stem = tree },
 };
 
-void btrfs_set_buffer_lockdep_class(struct extent_buffer *eb, int level)
+void __init btrfs_init_lockdep(void)
+{
+   int i, j;
+
+   /* initialize lockdep class names */
+   for (i = 0; i  ARRAY_SIZE(btrfs_lockdep_keysets); i++) {
+   struct btrfs_lockdep_keyset *ks = btrfs_lockdep_keysets[i];
+
+   for (j = 0; j  ARRAY_SIZE(ks-names); j++)
+   snprintf(ks-names[j], sizeof(ks-names[j]),
+btrfs-%s-%02d, ks-name_stem, j);
+ 

[PATCH 1/3] btrfs: Cleanup extent_buffer lockdep code

2011-03-24 Thread Tejun Heo
btrfs_set_buffer_lockdep_class() should be dependent upon
CONFIG_LOCKDEP instead of CONFIG_DEBUG_LOCK_ALLOC.  Collect the
related code into one place, use CONFIG_LOCKDEP instead and make some
cosmetic changes.

Signed-off-by: Tejun Heo t...@kernel.org
---
 fs/btrfs/disk-io.c |   22 ++
 fs/btrfs/disk-io.h |   11 +--
 2 files changed, 15 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 3e1ea3e..e973e0b 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -97,7 +97,9 @@ struct async_submit_bio {
struct btrfs_work work;
 };
 
-/* These are used to set the lockdep class on the extent buffer locks.
+#ifdef CONFIG_LOCKDEP
+/*
+ * These are used to set the lockdep class on the extent buffer locks.
  * The class is set by the readpage_end_io_hook after the buffer has
  * passed csum validation but before the pages are unlocked.
  *
@@ -111,7 +113,6 @@ struct async_submit_bio {
  * the same as our lockdep setup here.  If BTRFS_MAX_LEVEL changes, this
  * code needs update as well.
  */
-#ifdef CONFIG_DEBUG_LOCK_ALLOC
 # if BTRFS_MAX_LEVEL != 8
 #  error
 # endif
@@ -129,7 +130,13 @@ static const char *btrfs_eb_name[BTRFS_MAX_LEVEL + 1] = {
/* highest possible level */
btrfs-extent-08,
 };
-#endif
+
+void btrfs_set_buffer_lockdep_class(struct extent_buffer *eb, int level)
+{
+   lockdep_set_class_and_name(eb-lock, btrfs_eb_class[level],
+  btrfs_eb_name[level]);
+}
+#endif /* CONFIG_LOCKDEP */
 
 /*
  * extents on the btree inode are pretty simple, there's one extent
@@ -419,15 +426,6 @@ static int check_tree_block_fsid(struct btrfs_root *root,
return ret;
 }
 
-#ifdef CONFIG_DEBUG_LOCK_ALLOC
-void btrfs_set_buffer_lockdep_class(struct extent_buffer *eb, int level)
-{
-   lockdep_set_class_and_name(eb-lock,
-  btrfs_eb_class[level],
-  btrfs_eb_name[level]);
-}
-#endif
-
 static int btree_readpage_end_io_hook(struct page *page, u64 start, u64 end,
   struct extent_state *state)
 {
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index 07b20dc..4ab3fa8 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -102,13 +102,12 @@ int btrfs_add_log_tree(struct btrfs_trans_handle *trans,
   struct btrfs_root *root);
 int btree_lock_page_hook(struct page *page);
 
-
-#ifdef CONFIG_DEBUG_LOCK_ALLOC
+#ifdef CONFIG_LOCKDEP
 void btrfs_set_buffer_lockdep_class(struct extent_buffer *eb, int level);
 #else
 static inline void btrfs_set_buffer_lockdep_class(struct extent_buffer *eb,
 int level)
-{
-}
-#endif
-#endif
+{ }
+#endif /* CONFIG_LOCKDEP */
+
+#endif /* __DISKIO__ */
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] btrfs: Simplify extent_buffer locking

2011-03-24 Thread Tejun Heo
extent_buffer implemented custom locking which required explicit
distinction between non-sleepable and sleepable lockings.  This was to
prevent excessive context switches.

For short non-blocking acquisitions, lock was left non-blocking and
other threads which wanted to lock the same eb would spin on it
instead of scheduling out.  If the lock owner wanted to perform
blocking operations, it had to upgrade the locking to blocking mode by
calling btrfs_set_lock_blocking().

The distinction is useful and leads to performance gains compared to
naive sleeping sleeping lock implementation; however, the standard
mutex implementation already has adaptive owner spinning -
CONFIG_MUTEX_SPIN_ON_OWNER - which addresses the same problem in
transparent manner.

Compared to CONFIG_MUTEX_SPIN_ON_OWNER, the custom implementation has
several disadvantages.

* It requires explicit blocking state management by the lock owner,
  which can be tedious, error-prone and has its own overhead.

* Although the default mutex lacks access to explicit information from
  the lock owner, it has direct visibility into scheduling which is
  often better information for deciding whether optimistic spinning
  would be useful.

* Lockdep annoation comes for free.  This can be added to the custom
  implementation but hasn't been.

This patch removes the custom extent_buffer locking by replacing
eb-lock with a mutex and making the locking API simple wrappers
around mutex operations.

The following is from dbench 50 runs on 8-way opteron w/ 4GiB of
memory and SSD.  CONFIG_PREEMPT_VOLUNTARY is set.

   USER   SYSTEM   SIRQCXTSW  THROUGHPUT
BEFORE 59898  504517377  6814245 782.295
AFTER  61090  493441457  1631688 827.751

Other tests also show generally favorable results for the standard
mutex based implementation.  For more info, please read the following
threads.

 http://thread.gmane.org/gmane.comp.file-systems.btrfs/9658
 http://thread.gmane.org/gmane.linux.kernel/1116910

This patch makes all eb locking visible to lockdep and triggers
various locking ordering warnings along the allocation path.  At least
some of them seem to indicate genuine locking bugs while it is
possible that some are spuriously triggered and simply require better
lockdep annoations.  Note that this patch doesn't change locking
ordering itself.  Lockdep now just has more visibility into btrfs
locking.

Signed-off-by: Tejun Heo t...@kernel.org
Cc: Peter Zijlstra pet...@infradead.org
Cc: Ingo Molnar mi...@redhat.com
---
 fs/btrfs/Makefile|2 +-
 fs/btrfs/ctree.c |   16 ++--
 fs/btrfs/extent_io.c |3 +-
 fs/btrfs/extent_io.h |   12 +--
 fs/btrfs/locking.c   |  233 --
 fs/btrfs/locking.h   |   65 --
 6 files changed, 70 insertions(+), 261 deletions(-)
 delete mode 100644 fs/btrfs/locking.c

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 31610ea..8688f47 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -5,6 +5,6 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o 
root-tree.o dir-item.o \
   file-item.o inode-item.o inode-map.o disk-io.o \
   transaction.o inode.o file.o tree-defrag.o \
   extent_map.o sysfs.o struct-funcs.o xattr.o ordered-data.o \
-  extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \
+  extent_io.o volumes.o async-thread.o ioctl.o orphan.o \
   export.o tree-log.o acl.o free-space-cache.o zlib.o lzo.o \
   compression.o delayed-ref.o relocation.o
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index b5baff0..bc1627d 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -1074,7 +1074,7 @@ static noinline int balance_level(struct 
btrfs_trans_handle *trans,
 
left = read_node_slot(root, parent, pslot - 1);
if (left) {
-   btrfs_tree_lock(left);
+   btrfs_tree_lock_nested(left, 1);
btrfs_set_lock_blocking(left);
wret = btrfs_cow_block(trans, root, left,
   parent, pslot - 1, left);
@@ -1085,7 +1085,7 @@ static noinline int balance_level(struct 
btrfs_trans_handle *trans,
}
right = read_node_slot(root, parent, pslot + 1);
if (right) {
-   btrfs_tree_lock(right);
+   btrfs_tree_lock_nested(right, 2);
btrfs_set_lock_blocking(right);
wret = btrfs_cow_block(trans, root, right,
   parent, pslot + 1, right);
@@ -1241,7 +1241,7 @@ static noinline int push_nodes_for_insert(struct 
btrfs_trans_handle *trans,
if (left) {
u32 left_nr;
 
-   btrfs_tree_lock(left);
+   btrfs_tree_lock_nested(left, 1);
btrfs_set_lock_blocking(left);
 
left_nr = btrfs_header_nritems(left);
@@ -1291,7 +1291,7 @@ static noinline int push_nodes_for_insert(struct 
btrfs_trans_handle *trans,
if 

Re: drives with more than 2 TByte

2011-03-24 Thread Goffredo Baroncelli
On 03/24/2011 05:43 PM, Helmut Hullen wrote:
 Hallo, linux-btrfs,
 
 what about disks with more than 2 TByte? Other filesystems (?) need GPT.

The filesystems don't care about the partition system. The 2TB limits is
related to the maximum partition size. Of course a filesystem cannot be
greater the partition where it is allocated.
 
 When I use
 
mkfs.btrfs /dev/sdc
 
 (p.e. with drive sdc), does that work without problems with btrfs?

It should. BTW, why you dont' use a GPT partition table ?
 
 Viele Gruesse!
 Helmut
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 .
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: drives with more than 2 TByte

2011-03-24 Thread Helmut Hullen
Hallo, Goffredo,

Du meintest am 24.03.11:

 what about disks with more than 2 TByte? Other filesystems (?) need
 GPT.

 The filesystems don't care about the partition system.

Ok - thank you!

Viele Gruesse!
Helmut
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 1/2] btrfs: use GFP_NOFS instead of GFP_KERNEL

2011-03-24 Thread David Sterba
Hi,

On Thu, Mar 24, 2011 at 07:41:21PM +0800, Miao Xie wrote:
 In the filesystem context, we must allocate memory by GFP_NOFS,
 or we may start another filesystem operation and make kswap thread hang up.

indeed. Did you check for other GFP_KERNEL allocations? I've found 8 more
them and at least these look like candidates for GFP_NOFS too:

diff --git a/fs/btrfs/acl.c b/fs/btrfs/acl.c
index de34bfa..76b9218 100644
--- a/fs/btrfs/acl.c
+++ b/fs/btrfs/acl.c
@@ -289,7 +289,7 @@ int btrfs_acl_chmod(struct inode *inode)
if (IS_ERR(acl) || !acl)
return PTR_ERR(acl);

-   clone = posix_acl_clone(acl, GFP_KERNEL);
+   clone = posix_acl_clone(acl, GFP_NOFS);
posix_acl_release(acl);
if (!clone)
return -ENOMEM;
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index f447b78..eb5c01d 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -998,7 +998,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
nrptrs = min((iov_iter_count(i) + PAGE_CACHE_SIZE - 1) /
 PAGE_CACHE_SIZE, PAGE_CACHE_SIZE /
 (sizeof(struct page *)));
-   pages = kmalloc(nrptrs * sizeof(struct page *), GFP_KERNEL);
+   pages = kmalloc(nrptrs * sizeof(struct page *), GFP_NOFS);
if (!pages) {
ret = -ENOMEM;
goto out;
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index d1bace3..e9b9648 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1644,7 +1644,7 @@ static int btrfs_ioctl_defrag(struct file *file, void 
__user *argp)
goto out;
}

-   range = kzalloc(sizeof(*range), GFP_KERNEL);
+   range = kzalloc(sizeof(*range), GFP_NOFS);
if (!range) {
ret = -ENOMEM;
goto out;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index d39a989..5e0fff7 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -399,7 +399,7 @@ static int btrfs_parse_early_options(const char *options, 
fmode_t flags,
 * strsep changes the string, duplicate it because parse_options
 * gets called twice
 */
-   opts = kstrdup(options, GFP_KERNEL);
+   opts = kstrdup(options, GFP_NOFS);
if (!opts)
return -ENOMEM;
orig = opts;
@@ -446,7 +446,7 @@ static int btrfs_parse_early_options(const char *options, 
fmode_t flags,
 * mount path doesn't care if it's the default volume or another one.
 */
if (!*subvol_name) {
-   *subvol_name = kstrdup(., GFP_KERNEL);
+   *subvol_name = kstrdup(., GFP_NOFS);
if (!*subvol_name)
return -ENOMEM;
}


dave
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 2/2] btrfs: implement delayed inode items operation

2011-03-24 Thread David Sterba
Hi,

there's one thing I want to bring up. It's not related to delayed
functionality itself but to git tree base of the patch.

There's a merge conflict when your patch is applied directly onto
Linus' tree, and not when on Chris' one.

On Thu, Mar 24, 2011 at 07:41:31PM +0800, Miao Xie wrote:
...
 --- a/fs/btrfs/inode.c
 +++ b/fs/btrfs/inode.c
 @@ -6726,6 +6775,9 @@ void btrfs_destroy_inode(struct inode *inode)
   inode_tree_del(inode);
   btrfs_drop_extent_cache(inode, 0, (u64)-1, 0);
  free:
 + ret = btrfs_remove_delayed_node(inode);
 + BUG_ON(ret);
 +
   kmem_cache_free(btrfs_inode_cachep, BTRFS_I(inode));
  }
  

the call to kmem_cache_free has been replaced by

commit fa0d7e3de6d6fc5004ad9dea0dd6b286af8f03e9
Author: Nick Piggin npig...@kernel.dk
Date:   Fri Jan 7 17:49:49 2011 +1100
fs: icache RCU free inodes

relevant hunk:

--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6495,6 +6495,13 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
return inode;
 }

+static void btrfs_i_callback(struct rcu_head *head)
+{
+   struct inode *inode = container_of(head, struct inode, i_rcu);
+   INIT_LIST_HEAD(inode-i_dentry);
+   kmem_cache_free(btrfs_inode_cachep, BTRFS_I(inode));
+}
+
 void btrfs_destroy_inode(struct inode *inode)
 {
struct btrfs_ordered_extent *ordered;
@@ -6564,7 +6571,7 @@ void btrfs_destroy_inode(struct inode *inode)
inode_tree_del(inode);
btrfs_drop_extent_cache(inode, 0, (u64)-1, 0);
 free:
-   kmem_cache_free(btrfs_inode_cachep, BTRFS_I(inode));
+   call_rcu(inode-i_rcu, btrfs_i_callback);
 }


I don't think this disqualifies all the testing already done but maybe it's
time to rebase btrfs-unstable.git to .38 .

Chris?


dave
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 1/2] btrfs: use GFP_NOFS instead of GFP_KERNEL

2011-03-24 Thread Miao Xie
On Fri, 25 Mar 2011 00:07:59 +0100, David Sterba wrote:
 On Thu, Mar 24, 2011 at 07:41:21PM +0800, Miao Xie wrote:
 In the filesystem context, we must allocate memory by GFP_NOFS,
 or we may start another filesystem operation and make kswap thread hang up.
 
 indeed. Did you check for other GFP_KERNEL allocations? I've found 8 more
 them and at least these look like candidates for GFP_NOFS too:

I just fix the ones, which should use GFP_NOFS.

I think not all of the GFP_KERNEL allocations are wrong, if we don't hold
any btrfs's lock except relative inode's i_mutex and not in the context of
the transaction, we can use GFP_KERNEL. So the following GFP_KERNEL allocations
are right, I think.

Thanks
Miao

 
 diff --git a/fs/btrfs/acl.c b/fs/btrfs/acl.c
 index de34bfa..76b9218 100644
 --- a/fs/btrfs/acl.c
 +++ b/fs/btrfs/acl.c
 @@ -289,7 +289,7 @@ int btrfs_acl_chmod(struct inode *inode)
 if (IS_ERR(acl) || !acl)
 return PTR_ERR(acl);
 
 -   clone = posix_acl_clone(acl, GFP_KERNEL);
 +   clone = posix_acl_clone(acl, GFP_NOFS);
 posix_acl_release(acl);
 if (!clone)
 return -ENOMEM;
 diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
 index f447b78..eb5c01d 100644
 --- a/fs/btrfs/file.c
 +++ b/fs/btrfs/file.c
 @@ -998,7 +998,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
 nrptrs = min((iov_iter_count(i) + PAGE_CACHE_SIZE - 1) /
  PAGE_CACHE_SIZE, PAGE_CACHE_SIZE /
  (sizeof(struct page *)));
 -   pages = kmalloc(nrptrs * sizeof(struct page *), GFP_KERNEL);
 +   pages = kmalloc(nrptrs * sizeof(struct page *), GFP_NOFS);
 if (!pages) {
 ret = -ENOMEM;
 goto out;
 diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
 index d1bace3..e9b9648 100644
 --- a/fs/btrfs/ioctl.c
 +++ b/fs/btrfs/ioctl.c
 @@ -1644,7 +1644,7 @@ static int btrfs_ioctl_defrag(struct file *file, void 
 __user *argp)
 goto out;
 }
 
 -   range = kzalloc(sizeof(*range), GFP_KERNEL);
 +   range = kzalloc(sizeof(*range), GFP_NOFS);
 if (!range) {
 ret = -ENOMEM;
 goto out;
 diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
 index d39a989..5e0fff7 100644
 --- a/fs/btrfs/super.c
 +++ b/fs/btrfs/super.c
 @@ -399,7 +399,7 @@ static int btrfs_parse_early_options(const char *options, 
 fmode_t flags,
  * strsep changes the string, duplicate it because parse_options
  * gets called twice
  */
 -   opts = kstrdup(options, GFP_KERNEL);
 +   opts = kstrdup(options, GFP_NOFS);
 if (!opts)
 return -ENOMEM;
 orig = opts;
 @@ -446,7 +446,7 @@ static int btrfs_parse_early_options(const char *options, 
 fmode_t flags,
  * mount path doesn't care if it's the default volume or another one.
  */
 if (!*subvol_name) {
 -   *subvol_name = kstrdup(., GFP_KERNEL);
 +   *subvol_name = kstrdup(., GFP_NOFS);
 if (!*subvol_name)
 return -ENOMEM;
 }
 
 
 dave
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] mutex: Apply adaptive spinning on mutex_trylock()

2011-03-24 Thread Steven Rostedt
On Thu, Mar 24, 2011 at 09:18:16AM +0100, Ingo Molnar wrote:
 
 * Tejun Heo t...@kernel.org wrote:
 
  NOT-Signed-off-by: Tejun Heo t...@kernel.org
 
 s/NOT-// ?
 

Perhaps because it is still in RFC context?

-- Steve

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()

2011-03-24 Thread Steven Rostedt
On Thu, Mar 24, 2011 at 10:41:51AM +0100, Tejun Heo wrote:
 Adaptive owner spinning used to be applied only to mutex_lock().  This
 patch applies it also to mutex_trylock().
 
 btrfs has developed custom locking to avoid excessive context switches
 in its btree implementation.  Generally, doing away with the custom
 implementation and just using the mutex shows better behavior;
 however, there's an interesting distinction in the custom implemention
 of trylock.  It distinguishes between simple trylock and tryspin,
 where the former just tries once and then fail while the latter does
 some spinning before giving up.
 
 Currently, mutex_trylock() doesn't use adaptive spinning.  It tries
 just once.  I got curious whether using adaptive spinning on
 mutex_trylock() would be beneficial and it seems so, for btrfs anyway.
 
 The following results are from dbench 50 run on an opteron two
 socket eight core machine with 4GiB of memory and an OCZ vertex SSD.
 During the run, disk stays mostly idle and all CPUs are fully occupied
 and the difference in locking performance becomes quite visible.
 
 SIMPLE is with the locking simplification patch[1] applied.  i.e. it
 basically just uses mutex.  SPIN is with this patch applied on top -
 mutex_trylock() uses adaptive spinning.
 
 USER   SYSTEM   SIRQCXTSW  THROUGHPUT
  SIMPLE 61107  354977217  8099529  845.100 MB/sec
  SPIN   63140  364888214  6840527  879.077 MB/sec
 
 On various runs, the adaptive spinning trylock consistently posts
 higher throughput.  The amount of difference varies but it outperforms
 consistently.
 
 In general, using adaptive spinning on trylock makes sense as trylock
 failure usually leads to costly unlock-relock sequence.
 
 [1] http://article.gmane.org/gmane.comp.file-systems.btrfs/9658
 
 Signed-off-by: Tejun Heo t...@kernel.org

I'm curious about the effects that this has on those places that do:

again:
mutex_lock(A);
if (mutex_trylock(B)) {
mutex_unlock(A);
goto again;


Where the normal locking order is:
 B - A

If another location does:

mutex_lock(B);
[...]
mutex_lock(A);

But another process has A already, and is running, it may spin waiting
for A as A's owner is still running.

But now, mutex_trylock(B) becomes a spinner too, and since the B's owner
is running (spinning on A) it will spin as well waiting for A's owner to
release it. Unfortunately, A's owner is also spinning waiting for B to
release it.

If both A and B's owners are real time tasks, then boom! deadlock.

-- Steve

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()

2011-03-24 Thread Linus Torvalds
On Thu, Mar 24, 2011 at 8:39 PM, Steven Rostedt rost...@goodmis.org wrote:

 But now, mutex_trylock(B) becomes a spinner too, and since the B's owner
 is running (spinning on A) it will spin as well waiting for A's owner to
 release it. Unfortunately, A's owner is also spinning waiting for B to
 release it.

 If both A and B's owners are real time tasks, then boom! deadlock.

Hmm. I think you're right. And it looks pretty fundamental - I don't
see any reasonable approach to avoid it.

I think the RT issue is a red herring too - afaik, you can get a
deadlock with two perfectly normal processes too. Of course, for
non-RT tasks, any other process will eventually disturb the situation
and you'd get kicked out due to need_resched(), but even that might be
avoided for a long time if there are other CPU's - leading to tons of
wasted CPU time.

   Linus
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html