Re: [PATCH] Btrfs: return failure if btrfs_dev_replace_finishing() failed

2014-10-10 Thread Eryu Guan
On Thu, Sep 25, 2014 at 06:28:14PM +0800, Eryu Guan wrote:
 device replace could fail due to another running scrub process, but this
 failure doesn't get returned to userspace.
 
 The following steps could reproduce this issue
 
   mkfs -t btrfs -f /dev/sdb1 /dev/sdb2
   mount /dev/sdb1 /mnt/btrfs
   while true; do
   btrfs scrub start -B /mnt/btrfs /dev/null 21
   done 
   btrfs replace start -Bf /dev/sdb2 /dev/sdb3 /mnt/btrfs
   # if this replace succeeded, do the following and repeat until
   # you see this log in dmesg
   # BTRFS: btrfs_scrub_dev(/dev/sdb2, 2, /dev/sdb3) failed -115
   #btrfs replace start -Bf /dev/sdb3 /dev/sdb2 /mnt/btrfs
 
   # once you see the error log in dmesg, check return value of
   # replace
   echo $?
 
 Also only WARN_ON if the return code is not -EINPROGRESS.
 
 Signed-off-by: Eryu Guan guane...@gmail.com

Ping, any comments on this patch?

Thanks,
Eryu
 ---
  fs/btrfs/dev-replace.c | 8 +---
  1 file changed, 5 insertions(+), 3 deletions(-)
 
 diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
 index eea26e1..44d32ab 100644
 --- a/fs/btrfs/dev-replace.c
 +++ b/fs/btrfs/dev-replace.c
 @@ -418,9 +418,11 @@ int btrfs_dev_replace_start(struct btrfs_root *root,
 dev_replace-scrub_progress, 0, 1);
  
   ret = btrfs_dev_replace_finishing(root-fs_info, ret);
 - WARN_ON(ret);
 + /* don't warn if EINPROGRESS, someone else might be running scrub */
 + if (ret != -EINPROGRESS)
 + WARN_ON(ret);
  
 - return 0;
 + return ret;
  
  leave:
   dev_replace-srcdev = NULL;
 @@ -538,7 +540,7 @@ static int btrfs_dev_replace_finishing(struct 
 btrfs_fs_info *fs_info,
   btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device);
   mutex_unlock(dev_replace-lock_finishing_cancel_unmount);
  
 - return 0;
 + return scrub_ret;
   }
  
   printk_in_rcu(KERN_INFO
 -- 
 1.8.3.1
 
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs-progs: force overwrite should wipe stale SB

2014-10-10 Thread David Sterba
On Thu, Oct 02, 2014 at 07:22:09AM +0800, Anand Jain wrote:
 (I am unable to reproduce the issue, tried to go back with progs versions
 but still the same. So as of now this code remains untested, suggest to
 wait till we have a reproducible test case).
 
 Here is a test case which says it all..
 
 mkfs.xfs -f $DEV
 mkfs.btrfs -f $DEV
 mount $DEV $MNT
 mount: /dev/vdiskc: more filesystems detected. This should not happen,
use -t type to explicitly specify the filesystem type or
use wipefs(8) to clean up the device.
 
 mount: you must specify the filesystem type
 
 with this patch btrfs_prepare_device() also wipes old FS if any,
 btrfs_prepare_device() is called after we have verified that
 user has provided -f option.
 
 v2: to satisfy the backward compatibility issue, replace
 blkid_do_wipe() with local wipe function.

Thank you, works for me, added to 3.17 queue.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs-progs: force overwrite should wipe stale SB

2014-10-10 Thread Anand Jain



 Thanks.
Anand

On 10/10/2014 03:39 PM, David Sterba wrote:

On Thu, Oct 02, 2014 at 07:22:09AM +0800, Anand Jain wrote:

(I am unable to reproduce the issue, tried to go back with progs versions
but still the same. So as of now this code remains untested, suggest to
wait till we have a reproducible test case).

Here is a test case which says it all..

mkfs.xfs -f $DEV
mkfs.btrfs -f $DEV
mount $DEV $MNT
mount: /dev/vdiskc: more filesystems detected. This should not happen,
use -t type to explicitly specify the filesystem type or
use wipefs(8) to clean up the device.

mount: you must specify the filesystem type

with this patch btrfs_prepare_device() also wipes old FS if any,
btrfs_prepare_device() is called after we have verified that
user has provided -f option.

v2: to satisfy the backward compatibility issue, replace
blkid_do_wipe() with local wipe function.


Thank you, works for me, added to 3.17 queue.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2 v2] Btrfs: add helper btrfs_fdatawrite_range

2014-10-10 Thread Filipe Manana
To avoid duplicating this double filemap_fdatawrite_range() call for
inodes with async extents (compressed writes) so often.

Signed-off-by: Filipe Manana fdman...@suse.com
---

V2: Pass right arguments to the new helper. Missed unstaged changes.

 fs/btrfs/ctree.h|  1 +
 fs/btrfs/file.c | 39 ++-
 fs/btrfs/inode.c|  9 +
 fs/btrfs/ordered-data.c | 24 ++--
 4 files changed, 34 insertions(+), 39 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 089f6da..4e0ad8c 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3896,6 +3896,7 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct 
inode *inode,
  struct page **pages, size_t num_pages,
  loff_t pos, size_t write_bytes,
  struct extent_state **cached);
+int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
 
 /* tree-defrag.c */
 int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 82c7229..bbd474b 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1676,6 +1676,7 @@ static ssize_t __btrfs_direct_write(struct kiocb *iocb,
loff_t pos)
 {
struct file *file = iocb-ki_filp;
+   struct inode *inode = file_inode(file);
ssize_t written;
ssize_t written_buffered;
loff_t endbyte;
@@ -1697,13 +1698,10 @@ static ssize_t __btrfs_direct_write(struct kiocb *iocb,
 * able to read what was just written.
 */
endbyte = pos + written_buffered - 1;
-   err = filemap_fdatawrite_range(file-f_mapping, pos, endbyte);
-   if (!err  test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
-BTRFS_I(file_inode(file))-runtime_flags))
-   err = filemap_fdatawrite_range(file-f_mapping, pos, endbyte);
+   err = btrfs_fdatawrite_range(inode, pos, endbyte);
if (err)
goto out;
-   err = filemap_fdatawait_range(file-f_mapping, pos, endbyte);
+   err = filemap_fdatawait_range(inode-i_mapping, pos, endbyte);
if (err)
goto out;
written += written_buffered;
@@ -1864,10 +1862,7 @@ static int start_ordered_ops(struct inode *inode, loff_t 
start, loff_t end)
int ret;
 
atomic_inc(BTRFS_I(inode)-sync_writers);
-   ret = filemap_fdatawrite_range(inode-i_mapping, start, end);
-   if (!ret  test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
-BTRFS_I(inode)-runtime_flags))
-   ret = filemap_fdatawrite_range(inode-i_mapping, start, end);
+   ret = btrfs_fdatawrite_range(inode, start, end);
atomic_dec(BTRFS_I(inode)-sync_writers);
 
return ret;
@@ -2820,3 +2815,29 @@ int btrfs_auto_defrag_init(void)
 
return 0;
 }
+
+int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end)
+{
+   int ret;
+
+   /*
+* So with compression we will find and lock a dirty page and clear the
+* first one as dirty, setup an async extent, and immediately return
+* with the entire range locked but with nobody actually marked with
+* writeback.  So we can't just filemap_write_and_wait_range() and
+* expect it to work since it will just kick off a thread to do the
+* actual work.  So we need to call filemap_fdatawrite_range _again_
+* since it will wait on the page lock, which won't be unlocked until
+* after the pages have been marked as writeback and so we're good to go
+* from there.  We have to do this otherwise we'll miss the ordered
+* extents and that results in badness.  Please Josef, do not think you
+* know better and pull this out at some point in the future, it is
+* right and you are wrong.
+*/
+   ret = filemap_fdatawrite_range(inode-i_mapping, start, end);
+   if (!ret  test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
+BTRFS_I(inode)-runtime_flags))
+   ret = filemap_fdatawrite_range(inode-i_mapping, start, end);
+
+   return ret;
+}
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 752ff18..be955481 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7052,14 +7052,7 @@ static int lock_extent_direct(struct inode *inode, u64 
lockstart, u64 lockend,
btrfs_put_ordered_extent(ordered);
} else {
/* Screw you mmap */
-   ret = filemap_fdatawrite_range(inode-i_mapping,
-  lockstart,
-  lockend);
-   if (!ret  test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
-BTRFS_I(inode)-runtime_flags))
-   ret = filemap_fdatawrite_range(inode-i_mapping,
-  

Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping() to insert best fitted extent map

2014-10-10 Thread Filipe David Manana
On Fri, Oct 10, 2014 at 3:39 AM, Qu Wenruo quwen...@cn.fujitsu.com wrote:

  Original Message 
 Subject: Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping() to insert
 best fitted extent map
 From: Filipe David Manana fdman...@gmail.com
 To: Qu Wenruo quwen...@cn.fujitsu.com
 Date: 2014年10月09日 18:27

 On Thu, Oct 9, 2014 at 1:28 AM, Qu Wenruo quwen...@cn.fujitsu.com wrote:

  Original Message 
 Subject: Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping() to
 insert
 best fitted extent map
 From: Filipe David Manana fdman...@gmail.com
 To: Qu Wenruo quwen...@cn.fujitsu.com
 Date: 2014年10月08日 20:08

 On Fri, Sep 19, 2014 at 1:31 AM, Qu Wenruo quwen...@cn.fujitsu.com
 wrote:

  Original Message 
 Subject: Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping() to
 insert
 best fitted extent map
 From: Filipe David Manana fdman...@gmail.com
 To: Qu Wenruo quwen...@cn.fujitsu.com
 Date: 2014年09月18日 21:16

 On Wed, Sep 17, 2014 at 4:53 AM, Qu Wenruo quwen...@cn.fujitsu.com
 wrote:

 The following commit enhanced the merge_extent_mapping() to reduce
 fragment in extent map tree, but it can't handle case which existing
 lies before map_start:
 51f39 btrfs: Use right extent length when inserting overlap extent
 map.

 [BUG]
 When existing extent map's start is before map_start,
 the em-len will be minus, which will corrupt the extent map and fail
 to
 insert the new extent map.
 This will happen when someone get a large extent map, but when it is
 going to insert it into extent map tree, some one has already commit
 some write and split the huge extent into small parts.

 This sounds like very deterministic to me.
 Any reason to not add tests to the sanity tests that exercise
 this/these case/cases?

 Yes, thanks for the informing.
 Will add the test case for it soon.

 Hi Qu,

 Any progress on the test?

 This is a very important one IMHO, not only because of the bad
 consequences of the bug (extent map corruption, leading to all sorts
 of chaos), but also because this problem was not found by the full
 xfstests suite on several developer machines.

 thanks

 Still trying to reproduce it under xfstest framework.

 That's the problem, wasn't apparently reproducible (or detectable at
 least) by anyone with xfstests.

 I'll try to build a C program to behave the same of filebench and to see if
 it works.
 At least with filebench, it can be triggered in 60s with 100% possibility to
 reproduce.


 But even followiiing the FileBench randomrw behavior(1 thread random read
 1
 thread random write on preallocated space),
 I still failed to reproduce it.

 Still investigating how to reproduce it.
 Worst case may be add a new C program into src of xfstests?

 How about the sanity tests (fs/btrfs/tests/*.c)? Create an empty map
 tree, add some extent maps, then try to merge some new extent maps
 that used to fail before this fix. Seems simple, no?

 thanks Qu

 It needs concurrency read and write(commit) to trigger it, I am not sure it
 can be reproduced in sanity tests
 since it seems not commit things and lacks multithread facility.

Hum?
Why does concurrency or persistence matters?

Let's review the problem.
So you fixed the function inode.c:merge_extent_mapping(). That
function merges a new extent map (not in the extent map tree) with an
existing extent map (which is in the tree).
The issue was that the merge was incorrect for some cases - producing
a bad extent map (compared to the rest of the existing extent maps)
that either overlaps existing ones or introduces incorrect gaps, etc -
doesn't really matter the reason.
Now, this function is run while holding the write lock of the inode's
extent map tree.
So why does concurrency (or persistence) matters here?

Why can't we have a sanity test that simply reproduces a scenario
where immediately after attempting to merge extent maps, we get an
(in-memory) extent map that is incorrect?

Also, you don't need to go to such great lengths as writing a C
program that mimics filebench's behaviour.
The issue can be reproduced from user space without file write and
read concurrency as well, using only common tools like fallocate (or
xfs_io), dd and filefrag. See the thread at:
http://www.spinics.net/lists/linux-btrfs/msg38045.html

thanks


 I'll give it a try on the filebench-behavior C program first, and then
 sanity tests if former doesn't work at all




 Thanks,

 Qu



 Thanks,
 Qu

 Thanks,
 Qu

 Thanks

 [REPRODUCER]
 It is very easy to tiger using filebench with randomrw personality.
 It is about 100% to reproduce when using 8G preallocated file in 60s
 randonrw test.

 [FIX]
 This patch can now handle any existing extent position.
 Since it does not directly use existing-start, now it will find the
 previous and next extent around map_start.
 So the old existing-start  map_start bug will never happen again.

 [ENHANCE]
 This patch will insert the best fitted extent map into extent map
 tree,
 other than the 

Re: [PATCH 1/2] btrfs-progs: don't report internal dev replace result if ioctl failed

2014-10-10 Thread David Sterba
On Wed, Oct 08, 2014 at 05:42:28PM +0800, Eryu Guan wrote:
 If BTRFS_IOC_DEV_REPLACE ioctl failed, there's no result returned to
 fill args.result, it doesn't make sense to report this internal result
 to user.
 
 And the arg has been initialized with 0, the result is always 0, which
 is BTRFS_IOCTL_DEV_REPLACE_REPLACE_NO_ERROR, and the resulting error
 message looks confusing too:
 
 ERROR: ioctl(DEV_REPLACE_START) failed on /mnt/btrfs: No such file or 
 directory, no error
 
 So just skip the internal dev replace result if the whole ioctl failed.

The 'no error' is confusing there, but I'm afraid we're losing some
information if the secondary result is completely dropped. How about
intializing the replace result with, eg., -1 and then print an empty
string from replace_dev_result2string instead?
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: return failure if btrfs_dev_replace_finishing() failed

2014-10-10 Thread Miao Xie
On Fri, 10 Oct 2014 15:13:31 +0800, Eryu Guan wrote:
 On Thu, Sep 25, 2014 at 06:28:14PM +0800, Eryu Guan wrote:
 device replace could fail due to another running scrub process, but this
 failure doesn't get returned to userspace.

 The following steps could reproduce this issue

  mkfs -t btrfs -f /dev/sdb1 /dev/sdb2
  mount /dev/sdb1 /mnt/btrfs
  while true; do
  btrfs scrub start -B /mnt/btrfs /dev/null 21
  done 
  btrfs replace start -Bf /dev/sdb2 /dev/sdb3 /mnt/btrfs
  # if this replace succeeded, do the following and repeat until
  # you see this log in dmesg
  # BTRFS: btrfs_scrub_dev(/dev/sdb2, 2, /dev/sdb3) failed -115
  #btrfs replace start -Bf /dev/sdb3 /dev/sdb2 /mnt/btrfs

  # once you see the error log in dmesg, check return value of
  # replace
  echo $?

 Also only WARN_ON if the return code is not -EINPROGRESS.

 Signed-off-by: Eryu Guan guane...@gmail.com
 
 Ping, any comments on this patch?
 
 Thanks,
 Eryu
 ---
  fs/btrfs/dev-replace.c | 8 +---
  1 file changed, 5 insertions(+), 3 deletions(-)

 diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
 index eea26e1..44d32ab 100644
 --- a/fs/btrfs/dev-replace.c
 +++ b/fs/btrfs/dev-replace.c
 @@ -418,9 +418,11 @@ int btrfs_dev_replace_start(struct btrfs_root *root,
dev_replace-scrub_progress, 0, 1);
  
  ret = btrfs_dev_replace_finishing(root-fs_info, ret);
 -WARN_ON(ret);
 +/* don't warn if EINPROGRESS, someone else might be running scrub */
 +if (ret != -EINPROGRESS)
 +WARN_ON(ret);

picky comment

I prefer WARN_ON(ret  ret != -EINPROGRESS).

  
 -return 0;
 +return ret;

here we will return -EINPROGRESS if scrub is running, I think it better that
we assign some special number to args-result, and then return 0, just like
the case the device replace is running.

Thanks
Miao

  
  leave:
  dev_replace-srcdev = NULL;
 @@ -538,7 +540,7 @@ static int btrfs_dev_replace_finishing(struct 
 btrfs_fs_info *fs_info,
  btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device);
  mutex_unlock(dev_replace-lock_finishing_cancel_unmount);
  
 -return 0;
 +return scrub_ret;
  }
  
  printk_in_rcu(KERN_INFO
 -- 
 1.8.3.1

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs-progs: repair missing dir index

2014-10-10 Thread David Sterba
On Thu, Oct 02, 2014 at 02:50:39PM -0400, Josef Bacik wrote:
 +static int repair_inode_backrefs(struct btrfs_root *root,
 +  struct inode_record *rec,
 +  struct cache_tree *inode_cache)
 +{
 + struct btrfs_trans_handle *trans;
 + struct btrfs_path *path;
 + struct inode_backref *tmp, *backref;
 + u64 root_dirid = btrfs_root_dirid(root-root_item);
 + int ret;

Changed to ret = 1,

 + list_for_each_entry_safe(backref, tmp, rec-backrefs, list) {
[...]
 + }

gcc prints warning about possibly unitilized ret here:

 + BUG_ON(repaired  ret); /* Poor mans transaction abort */

So it would abort in case the list_for_each loop does not execute for
some reason.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2 v2] Btrfs: report error after failure inlining extent in compressed write path

2014-10-10 Thread Filipe Manana
If cow_file_range_inline() failed, when called from compress_file_range(),
we were tagging the locked page for writeback, end its writeback and unlock it,
but not marking it with an error nor setting AS_EIO in inode's mapping flags.

This made it impossible for a caller of filemap_fdatawrite_range (writepages)
or filemap_fdatawait_range() to know that an error happened. And the return
value of compress_file_range() is useless because it's returned to a workqueue
task and not to the task calling filemap_fdatawrite_range (writepages).

This change applies on top of the previous patchset starting at the patch
titled:

[1/5] Btrfs: set page and mapping error on compressed write failure

Which changed extent_clear_unlock_delalloc() to use SetPageError and
mapping_set_error().

Signed-off-by: Filipe Manana fdman...@suse.com
---

V2: Use SET_PAGE_ERROR only if ret  0, obviously. Thanks btrfs/056.

 fs/btrfs/inode.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 7635b1d..2b09425 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -527,7 +527,10 @@ cont:
if (ret = 0) {
unsigned long clear_flags = EXTENT_DELALLOC |
EXTENT_DEFRAG;
+   unsigned long page_error_op;
+
clear_flags |= (ret  0) ? EXTENT_DO_ACCOUNTING : 0;
+   page_error_op = ret  0 ? PAGE_SET_ERROR : 0;
 
/*
 * inline extent creation worked or returned error,
@@ -538,6 +541,7 @@ cont:
 clear_flags, PAGE_UNLOCK |
 PAGE_CLEAR_DIRTY |
 PAGE_SET_WRITEBACK |
+page_error_op |
 PAGE_END_WRITEBACK);
goto free_pages_out;
}
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs-progs: repair missing dir index

2014-10-10 Thread Josef Bacik
I'm redoing this patch a bit so don't take it yet. Thanks,

Josef

David Sterba dste...@suse.cz wrote:


On Thu, Oct 02, 2014 at 02:50:39PM -0400, Josef Bacik wrote:
 +static int repair_inode_backrefs(struct btrfs_root *root,
 +  struct inode_record *rec,
 +  struct cache_tree *inode_cache)
 +{
 + struct btrfs_trans_handle *trans;
 + struct btrfs_path *path;
 + struct inode_backref *tmp, *backref;
 + u64 root_dirid = btrfs_root_dirid(root-root_item);
 + int ret;

Changed to ret = 1,

 + list_for_each_entry_safe(backref, tmp, rec-backrefs, list) {
[...]
 + }

gcc prints warning about possibly unitilized ret here:

 + BUG_ON(repaired  ret); /* Poor mans transaction abort */

So it would abort in case the list_for_each loop does not execute for
some reason.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What is the vision for btrfs fs repair?

2014-10-10 Thread Bob Marley

On 10/10/2014 03:58, Chris Murphy wrote:



* mount -o recovery
Enable autorecovery attempts if a bad tree root is found at mount 
time.

I'm confused why it's not the default yet. Maybe it's continuing to evolve at a 
pace that suggests something could sneak in that makes things worse? It is 
almost an oxymoron in that I'm manually enabling an autorecovery

If true, maybe the closest indication we'd get of btrfs stablity is the default 
enabling of autorecovery.


No way!
I wouldn't want a default like that.

If you think at distributed transactions: suppose a sync was issued on 
both sides of a distributed transaction, then power was lost on one 
side, than btrfs had corruption. When I remount it, definitely the worst 
thing that can happen is that it auto-rolls-back to a previous 
known-good state.


Now if I can express wishes:

I would like an option that spits out all the usable tree roots (or 
what's the name, superblocks?) and not just the newest one which is 
corrupt. And then another option that lets me mount *readonly* starting 
from the tree root I specify. So I can check how much of the data is 
still there. After I decide that such tree root is good, I need another 
option that lets me mount with such tree root in readwrite mode, and 
obviously eliminating all tree roots newer than that.
Some time ago I read that mounting the filesystem with an earlier tree 
root was possible, but only by manually erasing the disk regions in 
which the newer superblocks are. This is crazy, it's too risky on too 
many levels, and also as I wrote I want to check what data is available 
on a certain tree root before mounting readwrite from that one.



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What is the vision for btrfs fs repair?

2014-10-10 Thread Roman Mamedov
On Fri, 10 Oct 2014 12:53:38 +0200
Bob Marley bobmar...@shiftmail.org wrote:

 On 10/10/2014 03:58, Chris Murphy wrote:
 
  * mount -o recovery
 Enable autorecovery attempts if a bad tree root is found at mount 
  time.
  I'm confused why it's not the default yet. Maybe it's continuing to evolve 
  at a pace that suggests something could sneak in that makes things worse? 
  It is almost an oxymoron in that I'm manually enabling an autorecovery
 
  If true, maybe the closest indication we'd get of btrfs stablity is the 
  default enabling of autorecovery.
 
 No way!
 I wouldn't want a default like that.
 
 If you think at distributed transactions: suppose a sync was issued on 
 both sides of a distributed transaction, then power was lost on one 
 side

What distributed transactions? Btrfs is not a clustered filesystem[1], it does
not support and likely will never support being mounted from multiple hosts at
the same time.

[1]http://en.wikipedia.org/wiki/Clustered_file_system

-- 
With respect,
Roman


signature.asc
Description: PGP signature


Re: [PATCH 1/2] btrfs-progs: don't report internal dev replace result if ioctl failed

2014-10-10 Thread Eryu Guan
On Fri, Oct 10, 2014 at 10:20:23AM +0200, David Sterba wrote:
 On Wed, Oct 08, 2014 at 05:42:28PM +0800, Eryu Guan wrote:
  If BTRFS_IOC_DEV_REPLACE ioctl failed, there's no result returned to
  fill args.result, it doesn't make sense to report this internal result
  to user.
  
  And the arg has been initialized with 0, the result is always 0, which
  is BTRFS_IOCTL_DEV_REPLACE_REPLACE_NO_ERROR, and the resulting error
  message looks confusing too:
  
  ERROR: ioctl(DEV_REPLACE_START) failed on /mnt/btrfs: No such file or 
  directory, no error
  
  So just skip the internal dev replace result if the whole ioctl failed.
 
 The 'no error' is confusing there, but I'm afraid we're losing some
 information if the secondary result is completely dropped. How about
 intializing the replace result with, eg., -1 and then print an empty
 string from replace_dev_result2string instead?

I assume the secondary result won't be filled at all if the ioctl
itself failed due to some reason(indicated by errno). But yes, it's
relatively safer to check the replace result instead of dropping it
completely.

I'll send v2 out.

Thanks for your reivew!

Eryu
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What is the vision for btrfs fs repair?

2014-10-10 Thread Bob Marley

On 10/10/2014 12:59, Roman Mamedov wrote:

On Fri, 10 Oct 2014 12:53:38 +0200
Bob Marley bobmar...@shiftmail.org wrote:


On 10/10/2014 03:58, Chris Murphy wrote:

* mount -o recovery
Enable autorecovery attempts if a bad tree root is found at mount 
time.

I'm confused why it's not the default yet. Maybe it's continuing to evolve at a 
pace that suggests something could sneak in that makes things worse? It is 
almost an oxymoron in that I'm manually enabling an autorecovery

If true, maybe the closest indication we'd get of btrfs stablity is the default 
enabling of autorecovery.

No way!
I wouldn't want a default like that.

If you think at distributed transactions: suppose a sync was issued on
both sides of a distributed transaction, then power was lost on one
side

What distributed transactions? Btrfs is not a clustered filesystem[1], it does
not support and likely will never support being mounted from multiple hosts at
the same time.

[1]http://en.wikipedia.org/wiki/Clustered_file_system



This is not the only way to do a distributed transaction.
Databases can be hosted on the filesystem, and those can do distributed 
transations.
Think of two bank accounts, one on btrfs fs1 here, and another bank 
account on database on a whatever filesystem in another country. You 
want to debit one account and credit the other one: the filesystems at 
the two sides *must not rollback their state* !! (especially not 
transparently without human intervention)


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Confusion with newly converted filesystem

2014-10-10 Thread Tim Cuthbertson
Thank you very much, Duncan and Chris. Especially to Duncan, for his
detailed reply and further suggestions. I will try to follow your
advice as best I can. I am old, but I'm still learning!

Tim
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/2] btrfs-progs: only report internal dev replace result if there's a result

2014-10-10 Thread Eryu Guan
If BTRFS_IOC_DEV_REPLACE ioctl failed, args.result usually won't be
updated by the ioctl.

And the arg has been initialized with 0, the result is always 0, which
is BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_ERROR, and the resulting error
message looks confusing:

ERROR: ioctl(DEV_REPLACE_START) failed on /mnt/btrfs: No such file or 
directory, no error

But in case there's an internal result returned in future, don't drop
the result completely, instead print dev replace result message only
if the result is updated by a failed ioctl call.

Signed-off-by: Eryu Guan guane...@gmail.com
---

I didn't update replace_dev_result2string() function, which seems not
necessary here. And I personally still prefer v1 :)

v2:
- don't drop the result completely but print result if it's got updated

 cmds-replace.c | 44 
 ioctl.h|  1 +
 2 files changed, 33 insertions(+), 12 deletions(-)

diff --git a/cmds-replace.c b/cmds-replace.c
index 9fe7ad8..bfcc161 100644
--- a/cmds-replace.c
+++ b/cmds-replace.c
@@ -184,12 +184,17 @@ static int cmd_start_replace(int argc, char **argv)
 
/* check for possible errors before backgrounding */
status_args.cmd = BTRFS_IOCTL_DEV_REPLACE_CMD_STATUS;
+   status_args.result = BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_RESULT;
ret = ioctl(fdmnt, BTRFS_IOC_DEV_REPLACE, status_args);
if (ret) {
fprintf(stderr,
-   ERROR: ioctl(DEV_REPLACE_STATUS) failed on \%s\: %s, 
%s\n,
-   path, strerror(errno),
-   replace_dev_result2string(status_args.result));
+   ERROR: ioctl(DEV_REPLACE_STATUS) failed on \%s\: %s,
+   path, strerror(errno));
+   if (status_args.result != 
BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_RESULT)
+   fprintf(stderr, , %s\n,
+   replace_dev_result2string(status_args.result));
+   else
+   fprintf(stderr, \n);
goto leave_with_error;
}
 
@@ -301,13 +306,18 @@ static int cmd_start_replace(int argc, char **argv)
}
 
start_args.cmd = BTRFS_IOCTL_DEV_REPLACE_CMD_START;
+   start_args.result = BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_RESULT;
ret = ioctl(fdmnt, BTRFS_IOC_DEV_REPLACE, start_args);
if (do_not_background) {
if (ret) {
fprintf(stderr,
-   ERROR: ioctl(DEV_REPLACE_START) failed on 
\%s\: %s, %s\n,
-   path, strerror(errno),
-   replace_dev_result2string(start_args.result));
+   ERROR: ioctl(DEV_REPLACE_START) failed on 
\%s\: %s,
+   path, strerror(errno));
+   if (start_args.result != 
BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_RESULT)
+   fprintf(stderr, , %s\n,
+   
replace_dev_result2string(start_args.result));
+   else
+   fprintf(stderr, \n);
 
if (errno == EOPNOTSUPP)
fprintf(stderr,
@@ -402,11 +412,16 @@ static int print_replace_status(int fd, const char *path, 
int once)
 
for (;;) {
args.cmd = BTRFS_IOCTL_DEV_REPLACE_CMD_STATUS;
+   args.result = BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_RESULT;
ret = ioctl(fd, BTRFS_IOC_DEV_REPLACE, args);
if (ret) {
-   fprintf(stderr, ERROR: ioctl(DEV_REPLACE_STATUS) 
failed on \%s\: %s, %s\n,
-   path, strerror(errno),
-   replace_dev_result2string(args.result));
+   fprintf(stderr, ERROR: ioctl(DEV_REPLACE_STATUS) 
failed on \%s\: %s,
+   path, strerror(errno));
+   if (args.result != 
BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_RESULT)
+   fprintf(stderr, , %s\n,
+   replace_dev_result2string(args.result));
+   else
+   fprintf(stderr, \n);
return ret;
}
 
@@ -550,13 +565,18 @@ static int cmd_cancel_replace(int argc, char **argv)
}
 
args.cmd = BTRFS_IOCTL_DEV_REPLACE_CMD_CANCEL;
+   args.result = BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_RESULT;
ret = ioctl(fd, BTRFS_IOC_DEV_REPLACE, args);
e = errno;
close_file_or_dir(fd, dirstream);
if (ret) {
-   fprintf(stderr, ERROR: ioctl(DEV_REPLACE_CANCEL) failed on 
\%s\: %s, %s\n,
-   path, strerror(e),
-   replace_dev_result2string(args.result));
+   fprintf(stderr, ERROR: ioctl(DEV_REPLACE_CANCEL) failed on 
\%s\: %s,
+   path, strerror(e));
+   

Problems running Balance and checking integrity concurrently.

2014-10-10 Thread Wang Shilong
Hello,

I have reproduced a problem with Btrfs integrity and balance using latest
btrfs kernel,it is very easy to reproduce:

With mount option “check_int”, and run xfstests/btrfs/ tests, below test could 
definitely
reproduce problem:

./check tests/btrfs/014

5.122248] Written block @25800343552 (sdc/1210187776/0) found in hash table, M, 
bytenr mismatch (!= stored 20968505344).
[  455.122283] Written block @25800359936 (sdc/1210204160/1) found in hash 
table, M, bytenr mismatch (!= stored 20968521728).
[  455.122301] Written block @25800376320 (sdc/1210220544/1) found in hash 
table, M, bytenr mismatch (!= stored 20968538112).
[  455.122347] Referenced block @25800343552 (sdc/1344405504/2) found in hash 
table, M, bytenr mismatch (!= stored 20968505344).
[  455.122427] Referenced block @25800359936 (sdc/1344421888/2) found in hash 
table, M, bytenr mismatch (!= stored 20968521728).
[  455.122853] Written block @25800376320 (sdc/1344438272/0) found in hash 
table, M, bytenr mismatch (!= stored 20968538112).
[  455.126275] BTRFS critical (device sdc): unable to find logical 3452043264 
len 4096
[  455.126491] btrfsic: btrfsic_map_block(root @3452043264) failed!
[  455.126496] BTRFS critical (device sdc): unable to find logical 3452059648 
len 4096
[  455.126769] btrfsic: btrfsic_map_block(root @3452059648) failed!
[  455.126773] BTRFS critical (device sdc): unable to find logical 3452076032 
len 4096
[  455.127007] btrfsic: btrfsic_map_block(root @3452076032) failed!
[  455.127011] BTRFS critical (device sdc): unable to find logical 3452092416 
len 4096
snip
[  455.129864] btrfsic: btrfsic_map_block(root @3452272640) failed!
[  455.129869] BTRFS critical (device sdc): unable to find logical 3452289024 
len 4096
[  455.130113] btrfsic: btrfsic_map_block(root @3452289024) failed!
[  455.130756] BTRFS critical (device sdc): unable to find logical 3451994112 
len 4096
[  455.130984] btrfsic: btrfsic_map_block(root @3451994112) failed!
[  455.130988] BTRFS critical (device sdc): unable to find logical 3452010496 
len 4096
[  455.131192] btrfsic: btrfsic_map_block(root @3452010496) failed!
[  455.131195] BTRFS critical (device sdc): unable to find logical 3452026880 
len 4096
[  455.131401] btrfsic: btrfsic_map_block(root @3452026880) failed!
[  455.131914] Written block @25800392704 (sdc/1210236928/0) found in hash 
table, M, bytenr mismatch (!= stored 20968554496).
[  455.131938] Referenced block @25800409088 (sdc/1210253312/1) found in hash 
table, M, bytenr mismatch (!= stored 20968570880).
[  455.131971] Referenced block @25800409088 (sdc/1344471040/2) found in hash 
table, M, bytenr mismatch (!= stored 20968570880).
[  455.132017] Referenced block @25800425472 (sdc/1210269696/1) found in hash 
table, M, bytenr mismatch (!= stored 20968587264).
[  455.132030] Referenced block @25800425472 (sdc/1344487424/2) found in hash 
table, M, bytenr mismatch (!= stored 20968587264).
[  455.132344] Written block @25800392704 (sdc/1344454656/0) found in hash 
table, M, bytenr mismatch (!= stored 20968554496

I supposed there are some race problems between integrity check and
Btrfs balance, could you guys please take a look at this issue!


Best Regards,
Wang Shilong

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix allocationg memory failure for btrfsic_state structure

2014-10-10 Thread Wang Shilong
size of @btrfsic_state needs more than 2M, it is very likely to
fail allocating memory using kzalloc(). see following mesage:

[91428.902148] Call Trace:
[816f6e0f] dump_stack+0x4d/0x66
[811b1c7f] warn_alloc_failed+0xff/0x170
[811b66e1] __alloc_pages_nodemask+0x951/0xc30
[811fd9da] alloc_pages_current+0x11a/0x1f0
[811b1e0b] ? alloc_kmem_pages+0x3b/0xf0
[811b1e0b] alloc_kmem_pages+0x3b/0xf0
[811d1018] kmalloc_order+0x18/0x50
[811d1074] kmalloc_order_trace+0x24/0x140
[a06c097b] btrfsic_mount+0x8b/0xae0 [btrfs]
[810af555] ? check_preempt_curr+0x85/0xa0
[810b2de3] ? try_to_wake_up+0x103/0x430
[a063d200] open_ctree+0x1bd0/0x2130 [btrfs]
[a060fdde] btrfs_mount+0x62e/0x8b0 [btrfs]
[811fd9da] ? alloc_pages_current+0x11a/0x1f0
[811b0a5e] ? __get_free_pages+0xe/0x50
[81230429] mount_fs+0x39/0x1b0
[812509fb] vfs_kern_mount+0x6b/0x150
[812537fb] do_mount+0x27b/0xc30
[811b0a5e] ? __get_free_pages+0xe/0x50
[812544f6] SyS_mount+0x96/0xf0
[81701970] system_call_fastpath+0x16/0x1b

Since we are allocating memory for hash table array, so
it will be good if we could allocate continuous pages here.

Fix this problem by firstly trying kzalloc(), if we fail,
use vzalloc() instead.

Signed-off-by: Wang Shilong wangshilong1...@gmail.com
---
 fs/btrfs/check-integrity.c | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index cb7f3fe..66e820f 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -3130,10 +3130,13 @@ int btrfsic_mount(struct btrfs_root *root,
   root-sectorsize, PAGE_CACHE_SIZE);
return -1;
}
-   state = kzalloc(sizeof(*state), GFP_NOFS);
-   if (NULL == state) {
-   printk(KERN_INFO btrfs check-integrity: kmalloc() failed!\n);
-   return -1;
+   state = kzalloc(sizeof(*state), GFP_KERNEL | __GFP_NOWARN | 
__GFP_REPEAT);
+   if (!state) {
+   state = vzalloc(sizeof(*state));
+   if (!state) {
+   printk(KERN_INFO btrfs check-integrity: vzalloc() 
failed!\n);
+   return -1;
+   }
}
 
if (!btrfsic_is_initialized) {
@@ -3277,5 +3280,8 @@ void btrfsic_unmount(struct btrfs_root *root,
 
mutex_unlock(btrfsic_mutex);
 
-   kfree(state);
+   if (is_vmalloc_addr(state))
+   vfree(state);
+   else
+   kfree(state);
 }
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What is the vision for btrfs fs repair?

2014-10-10 Thread Chris Murphy

On Oct 10, 2014, at 6:53 AM, Bob Marley bobmar...@shiftmail.org wrote:

 On 10/10/2014 03:58, Chris Murphy wrote:
 
 * mount -o recovery
 Enable autorecovery attempts if a bad tree root is found at mount 
 time.
 I'm confused why it's not the default yet. Maybe it's continuing to evolve 
 at a pace that suggests something could sneak in that makes things worse? It 
 is almost an oxymoron in that I'm manually enabling an autorecovery
 
 If true, maybe the closest indication we'd get of btrfs stablity is the 
 default enabling of autorecovery.
 
 No way!
 I wouldn't want a default like that.
 
 If you think at distributed transactions: suppose a sync was issued on both 
 sides of a distributed transaction, then power was lost on one side, than 
 btrfs had corruption. When I remount it, definitely the worst thing that can 
 happen is that it auto-rolls-back to a previous known-good state.

For a general purpose file system, losing 30 seconds (or less) of questionably 
committed data, likely corrupt, is a file system that won't mount without user 
intervention, which requires a secret decoder ring to get it to mount at all. 
And may require the use of specialized tools to retrieve that data in any case.

The fail safe behavior is to treat the known good tree root as the default tree 
root, and bypass the bad tree root if it cannot be repaired, so that the volume 
can be mounted with default mount options (i.e. the ones in fstab). Otherwise 
it's a filesystem that isn't well suited for general purpose use as rootfs let 
alone for boot.

Chris Murphy

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What is the vision for btrfs fs repair?

2014-10-10 Thread cwillu
If -o recovery is necessary, then you're either running into a btrfs
bug, or your hardware is lying about when it has actually written
things to disk.

The first case isn't unheard of, although far less common than it used
to be, and it should continue to improve with time.

In the second case, you're potentially screwed regardless of the
filesystem, without doing hacks like wait a good long time before
returning from fsync in the hopes that the disk might actually have
gotten around to performing the write it said had already finished.

On Fri, Oct 10, 2014 at 5:12 AM, Bob Marley bobmar...@shiftmail.org wrote:
 On 10/10/2014 12:59, Roman Mamedov wrote:

 On Fri, 10 Oct 2014 12:53:38 +0200
 Bob Marley bobmar...@shiftmail.org wrote:

 On 10/10/2014 03:58, Chris Murphy wrote:

 * mount -o recovery
 Enable autorecovery attempts if a bad tree root is found at
 mount time.

 I'm confused why it's not the default yet. Maybe it's continuing to
 evolve at a pace that suggests something could sneak in that makes things
 worse? It is almost an oxymoron in that I'm manually enabling an
 autorecovery

 If true, maybe the closest indication we'd get of btrfs stablity is the
 default enabling of autorecovery.

 No way!
 I wouldn't want a default like that.

 If you think at distributed transactions: suppose a sync was issued on
 both sides of a distributed transaction, then power was lost on one
 side

 What distributed transactions? Btrfs is not a clustered filesystem[1], it
 does
 not support and likely will never support being mounted from multiple
 hosts at
 the same time.

 [1]http://en.wikipedia.org/wiki/Clustered_file_system


 This is not the only way to do a distributed transaction.
 Databases can be hosted on the filesystem, and those can do distributed
 transations.
 Think of two bank accounts, one on btrfs fs1 here, and another bank account
 on database on a whatever filesystem in another country. You want to debit
 one account and credit the other one: the filesystems at the two sides *must
 not rollback their state* !! (especially not transparently without human
 intervention)


 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: debug: print more info about inode

2014-10-10 Thread David Sterba
Add uid, gid, rdev and flags to btrfs_print_leaf.

Signed-off-by: David Sterba dste...@suse.cz
---
 print-tree.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/print-tree.c b/print-tree.c
index 7df5798e539c..70a7acc632f2 100644
--- a/print-tree.c
+++ b/print-tree.c
@@ -823,13 +823,17 @@ void btrfs_print_leaf(struct btrfs_root *root, struct 
extent_buffer *l)
switch (type) {
case BTRFS_INODE_ITEM_KEY:
ii = btrfs_item_ptr(l, i, struct btrfs_inode_item);
-   printf(\t\tinode generation %llu transid %llu size 
%llu block group %llu mode %o links %u\n,
+   printf(\t\tinode generation %llu transid %llu size 
%llu block group %llu mode %o links %u uid %u gid %u rdev %llu flags 0x%llx\n,
   (unsigned long long)btrfs_inode_generation(l, 
ii),
   (unsigned long long)btrfs_inode_transid(l, ii),
   (unsigned long long)btrfs_inode_size(l, ii),
   (unsigned long 
long)btrfs_inode_block_group(l,ii),
   btrfs_inode_mode(l, ii),
-  btrfs_inode_nlink(l, ii));
+  btrfs_inode_nlink(l, ii),
+  btrfs_inode_uid(l, ii),
+  btrfs_inode_gid(l, ii),
+  (unsigned long long)btrfs_inode_rdev(l,ii),
+  (unsigned long long)btrfs_inode_flags(l,ii));
break;
case BTRFS_INODE_REF_KEY:
iref = btrfs_item_ptr(l, i, struct btrfs_inode_ref);
-- 
2.1.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: document the limit balance filter

2014-10-10 Thread David Sterba
Signed-off-by: David Sterba dste...@suse.cz
---
 Documentation/btrfs-balance.txt | 5 +
 1 file changed, 5 insertions(+)

diff --git a/Documentation/btrfs-balance.txt b/Documentation/btrfs-balance.txt
index 89fd44901a9d..e1ad6511571c 100644
--- a/Documentation/btrfs-balance.txt
+++ b/Documentation/btrfs-balance.txt
@@ -102,6 +102,11 @@ specified as start..end.
 Convert each selected block group to the given profile name identified by
 parameters.
 
+*limit*
+Process only given number of chunks, after all filters apply. This can be used
+to specifically target a chunk in connection with other filters (drange,
+vrange) or just simply limit the amount of work done by a single balance run.
+
 *soft*
 Takes no parameters. Only has meaning when converting between profiles.
 When doing convert from one profile to another and soft mode is on,
-- 
2.1.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What is the vision for btrfs fs repair?

2014-10-10 Thread Bob Marley

On 10/10/2014 16:37, Chris Murphy wrote:

The fail safe behavior is to treat the known good tree root as the default tree 
root, and bypass the bad tree root if it cannot be repaired, so that the volume 
can be mounted with default mount options (i.e. the ones in fstab). Otherwise 
it's a filesystem that isn't well suited for general purpose use as rootfs let 
alone for boot.



A filesystem which is suited for general purpose use is a filesystem 
which honors fsync, and doesn't *ever* auto-roll-back without user 
intervention.


Anything different is not suited for database transactions at all. Any 
paid service which has the users database on btrfs is going to be at 
risk of losing payments, and probably without the company even knowing. 
If btrfs goes this way I hope a big warning is written on the wiki and 
on the manpages telling that this filesystem is totally unsuitable for 
hosting databases performing transactions.


At most I can suggest that a flag in the metadata be added to 
allow/disallow auto-roll-back-on-error on such filesystem, so people can 
decide the tolerant vs. transaction-safe mode at filesystem creation.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What is the vision for btrfs fs repair?

2014-10-10 Thread Bardur Arantsson
On 2014-10-10 19:43, Bob Marley wrote:
 On 10/10/2014 16:37, Chris Murphy wrote:
 The fail safe behavior is to treat the known good tree root as the
 default tree root, and bypass the bad tree root if it cannot be
 repaired, so that the volume can be mounted with default mount options
 (i.e. the ones in fstab). Otherwise it's a filesystem that isn't well
 suited for general purpose use as rootfs let alone for boot.

 
 A filesystem which is suited for general purpose use is a filesystem
 which honors fsync, and doesn't *ever* auto-roll-back without user
 intervention.
 

A file system cannot do anything about the *DISKS* not honouring a sync
command. That's what the PP was talking about.



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What is the vision for btrfs fs repair?

2014-10-10 Thread Austin S Hemmelgarn

On 2014-10-10 13:43, Bob Marley wrote:

On 10/10/2014 16:37, Chris Murphy wrote:

The fail safe behavior is to treat the known good tree root as the
default tree root, and bypass the bad tree root if it cannot be
repaired, so that the volume can be mounted with default mount options
(i.e. the ones in fstab). Otherwise it's a filesystem that isn't well
suited for general purpose use as rootfs let alone for boot.



A filesystem which is suited for general purpose use is a filesystem
which honors fsync, and doesn't *ever* auto-roll-back without user
intervention.

Anything different is not suited for database transactions at all. Any
paid service which has the users database on btrfs is going to be at
risk of losing payments, and probably without the company even knowing.
If btrfs goes this way I hope a big warning is written on the wiki and
on the manpages telling that this filesystem is totally unsuitable for
hosting databases performing transactions.
If they need reliability, they should have some form of redundancy 
in-place and/or run the database directly on the block device; because 
even ext4, XFS, and pretty much every other filesystem can lose data 
sometimes, the difference being that those tend to give worse results 
when hardware is misbehaving than BTRFS does, because BTRFS usually has 
a old copy of whatever data structure gets corrupted to fall back on.


Also, you really shouldn't be running databases on a BTRFS filesystem at 
the moment anyway, because of the significant performance implications.


At most I can suggest that a flag in the metadata be added to
allow/disallow auto-roll-back-on-error on such filesystem, so people can
decide the tolerant vs. transaction-safe mode at filesystem creation.



The problem with this is that if the auto-recovery code did run (and 
IMHO the kernel should spit out a warning to the system log whenever it 
does), then chances are that you wouldn't have had a consistent view if 
you had prevented it from running either; and, if the database is 
properly distributed/replicated, then it should recover by itself.





smime.p7s
Description: S/MIME Cryptographic Signature


[PATCH 01/12] Btrfs-progs: repair missing dir index

2014-10-10 Thread Josef Bacik
If we have an inode backref entry then we know enough to add back a missing dir
index.  When messing with the inode backrefs we need to do all of that first
before we process the inode recs themselves as we may clear errors on the inode
recs as we fix the directory indexes.  This adds the framework for fixing
backref errors and fixes missing dir index issues.  Thanks,

Signed-off-by: Josef Bacik jba...@fb.com
---
 cmds-check.c  | 121 +-
 tests/fsck-tests/004-no-dir-index.img | Bin 0 - 4096 bytes
 2 files changed, 119 insertions(+), 2 deletions(-)
 create mode 100644 tests/fsck-tests/004-no-dir-index.img

diff --git a/cmds-check.c b/cmds-check.c
index 76df5ae..a7e0840 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -1516,15 +1516,111 @@ static int repair_inode_orphan_item(struct 
btrfs_trans_handle *trans,
return ret;
 }
 
+static int add_missing_dir_index(struct btrfs_root *root,
+struct cache_tree *inode_cache,
+struct inode_record *rec,
+struct inode_backref *backref)
+{
+   struct btrfs_path *path;
+   struct btrfs_trans_handle *trans;
+   struct btrfs_dir_item *dir_item;
+   struct extent_buffer *leaf;
+   struct btrfs_key key;
+   struct btrfs_disk_key disk_key;
+   struct inode_record *dir_rec;
+   unsigned long name_ptr;
+   u32 data_size = sizeof(*dir_item) + backref-namelen;
+   int ret;
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+
+   trans = btrfs_start_transaction(root, 1);
+   if (IS_ERR(trans)) {
+   btrfs_free_path(path);
+   return PTR_ERR(trans);
+   }
+
+   fprintf(stderr, repairing missing dir index item for inode %llu\n,
+   (unsigned long long)rec-ino);
+   key.objectid = backref-dir;
+   key.type = BTRFS_DIR_INDEX_KEY;
+   key.offset = backref-index;
+
+   ret = btrfs_insert_empty_item(trans, root, path, key, data_size);
+   BUG_ON(ret);
+
+   leaf = path-nodes[0];
+   dir_item = btrfs_item_ptr(leaf, path-slots[0], struct btrfs_dir_item);
+
+   disk_key.objectid = cpu_to_le64(rec-ino);
+   disk_key.type = BTRFS_INODE_ITEM_KEY;
+   disk_key.offset = 0;
+
+   btrfs_set_dir_item_key(leaf, dir_item, disk_key);
+   btrfs_set_dir_type(leaf, dir_item, imode_to_type(rec-imode));
+   btrfs_set_dir_data_len(leaf, dir_item, 0);
+   btrfs_set_dir_name_len(leaf, dir_item, backref-namelen);
+   name_ptr = (unsigned long)(dir_item + 1);
+   write_extent_buffer(leaf, backref-name, name_ptr, backref-namelen);
+   btrfs_mark_buffer_dirty(leaf);
+   btrfs_free_path(path);
+   btrfs_commit_transaction(trans, root);
+
+   backref-found_dir_index = 1;
+   dir_rec = get_inode_rec(inode_cache, backref-dir, 0);
+   if (!dir_rec)
+   return 0;
+   dir_rec-found_size += backref-namelen;
+   if (dir_rec-found_size == dir_rec-isize 
+   (dir_rec-errors  I_ERR_DIR_ISIZE_WRONG))
+   dir_rec-errors = ~I_ERR_DIR_ISIZE_WRONG;
+   if (dir_rec-found_size != dir_rec-isize)
+   dir_rec-errors |= I_ERR_DIR_ISIZE_WRONG;
+
+   return 0;
+}
+
+static int repair_inode_backrefs(struct btrfs_root *root,
+struct inode_record *rec,
+struct cache_tree *inode_cache)
+{
+   struct inode_backref *tmp, *backref;
+   u64 root_dirid = btrfs_root_dirid(root-root_item);
+   int ret = 0;
+
+   list_for_each_entry_safe(backref, tmp, rec-backrefs, list) {
+   /* Index 0 for root dir's are special, don't mess with it */
+   if (rec-ino == root_dirid  backref-index == 0)
+   continue;
+
+   if (!backref-found_dir_index  backref-found_inode_ref) {
+   ret = add_missing_dir_index(root, inode_cache, rec,
+   backref);
+   if (ret)
+   break;
+   }
+
+   if (backref-found_dir_item  backref-found_dir_index) {
+   if (!backref-errors  backref-found_inode_ref) {
+   list_del(backref-list);
+   free(backref);
+   }
+   }
+   }
+
+   return ret;
+}
+
 static int try_repair_inode(struct btrfs_root *root, struct inode_record *rec)
 {
struct btrfs_trans_handle *trans;
struct btrfs_path *path;
int ret = 0;
 
-   /* So far we just fix dir isize wrong */
if (!(rec-errors  (I_ERR_DIR_ISIZE_WRONG | I_ERR_NO_ORPHAN_ITEM)))
-   return 1;
+   return rec-errors;
 
path = btrfs_alloc_path();
if (!path)
@@ -1562,6 +1658,27 @@ static int check_inode_recs(struct btrfs_root 

Btrfs-progs: fix horrible things

2014-10-10 Thread Josef Bacik
This is the culmination of two weeks worth of work with Jaap Pieroen and his
broekn file system.  This pulls back a few things from the kernel in order to
support the dir item stuff, and the rbtree stuff to fix a weird bug we were
seeing.  There will be more test images coming, but this is what I have so far.
David if you want this in a branch to pull let me know, this has the updated
repair missing dir index patch so you can drop the old one.  With these
patches we can now deal with most if not all dir index corruption issues and be
able to fix corrupt blocks that are pointed at by multiple snapshots.  Thanks,

Josef

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/12] Btrfs-progs: delete bogus dir indexes

2014-10-10 Thread Josef Bacik
We may run across dir indexes that are corrupt in such a way that it makes them
useless, such as having a bad location key or a bad name.  In this case we can
just delete dir indexes that don't show up properly and then re-create what we
need.  When we delete dir indexes however we need to restart scanning the fs
tree as we could have greated bogus inode recs if the location key was bad, so
set it up so that if we had to delete an dir index we go ahead and free up our
inode recs and return -EAGAIN to check_fs_roots so it knows to restart the loop.
Thanks,

Signed-off-by: Josef Bacik jba...@fb.com
---
 cmds-check.c | 137 +++
 ctree.h  |   9 
 dir-item.c   | 101 +++
 3 files changed, 229 insertions(+), 18 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index 9522077..8fdc542 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -1585,35 +1585,99 @@ static int add_missing_dir_index(struct btrfs_root 
*root,
return 0;
 }
 
+static int delete_dir_index(struct btrfs_root *root,
+   struct cache_tree *inode_cache,
+   struct inode_record *rec,
+   struct inode_backref *backref)
+{
+   struct btrfs_trans_handle *trans;
+   struct btrfs_dir_item *di;
+   struct btrfs_path *path;
+   int ret = 0;
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+
+   trans = btrfs_start_transaction(root, 1);
+   if (IS_ERR(trans)) {
+   btrfs_free_path(path);
+   return PTR_ERR(trans);
+   }
+
+
+   fprintf(stderr, Deleting bad dir index [%llu,%u,%llu] root %llu\n,
+   (unsigned long long)backref-dir,
+   BTRFS_DIR_INDEX_KEY, (unsigned long long)backref-index,
+   (unsigned long long)root-objectid);
+
+   di = btrfs_lookup_dir_index(trans, root, path, backref-dir,
+   backref-name, backref-namelen,
+   backref-index, -1);
+   if (IS_ERR(di)) {
+   ret = PTR_ERR(di);
+   btrfs_free_path(path);
+   btrfs_commit_transaction(trans, root);
+   if (ret == -ENOENT)
+   return 0;
+   return ret;
+   }
+
+   if (!di)
+   ret = btrfs_del_item(trans, root, path);
+   else
+   ret = btrfs_delete_one_dir_name(trans, root, path, di);
+   BUG_ON(ret);
+   btrfs_free_path(path);
+   btrfs_commit_transaction(trans, root);
+   return ret;
+}
+
 static int repair_inode_backrefs(struct btrfs_root *root,
 struct inode_record *rec,
-struct cache_tree *inode_cache)
+struct cache_tree *inode_cache,
+int delete)
 {
struct inode_backref *tmp, *backref;
u64 root_dirid = btrfs_root_dirid(root-root_item);
int ret = 0;
+   int repaired = 0;
 
list_for_each_entry_safe(backref, tmp, rec-backrefs, list) {
/* Index 0 for root dir's are special, don't mess with it */
if (rec-ino == root_dirid  backref-index == 0)
continue;
 
-   if (!backref-found_dir_index  backref-found_inode_ref) {
-   ret = add_missing_dir_index(root, inode_cache, rec,
-   backref);
+   if (delete  backref-found_dir_index 
+   !backref-found_inode_ref) {
+   ret = delete_dir_index(root, inode_cache, rec, backref);
if (ret)
break;
+   repaired++;
+   list_del(backref-list);
+   free(backref);
}
 
-   if (backref-found_dir_item  backref-found_dir_index) {
-   if (!backref-errors  backref-found_inode_ref) {
-   list_del(backref-list);
-   free(backref);
+   if (!delete  !backref-found_dir_index 
+   backref-found_dir_item  backref-found_inode_ref) {
+   ret = add_missing_dir_index(root, inode_cache, rec,
+   backref);
+   if (ret)
+   break;
+   repaired++;
+   if (backref-found_dir_item 
+   backref-found_dir_index 
+   backref-found_dir_index) {
+   if (!backref-errors 
+   backref-found_inode_ref) {
+   list_del(backref-list);
+   free(backref);
+  

[PATCH 05/12] Btrfs-progs: lookup all roots that point to a corrupt block

2014-10-10 Thread Josef Bacik
If we have a corrupt block that multiple snapshots point to we will only fix the
guy who originally pointed to the block, and then simply loop forever because we
keep finding the same bad block.  So instead lookup all roots that point to this
block, and then search down to the block for each root and fix the block in all
snapshots.  Thanks,

Signed-off-by: Josef Bacik jba...@fb.com
---
 cmds-check.c | 143 +--
 1 file changed, 70 insertions(+), 73 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index ff9c0d5..1deef77 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -40,6 +40,8 @@
 #include btrfsck.h
 #include qgroup-verify.h
 #include rbtree-utils.h
+#include backref.h
+#include ulist.h
 
 static u64 bytes_used = 0;
 static u64 total_csum_bytes = 0;
@@ -2554,42 +2556,14 @@ static int swap_values(struct btrfs_root *root, struct 
btrfs_path *path,
 
 static int fix_key_order(struct btrfs_trans_handle *trans,
 struct btrfs_root *root,
-struct extent_buffer *buf)
+struct btrfs_path *path)
 {
-   struct btrfs_path *path;
+   struct extent_buffer *buf;
struct btrfs_key k1, k2;
int i;
-   int level;
+   int level = path-lowest_level;
int ret;
 
-   k1.objectid = btrfs_header_owner(buf);
-   k1.type = BTRFS_ROOT_ITEM_KEY;
-   k1.offset = (u64)-1;
-
-   root = btrfs_read_fs_root(root-fs_info, k1);
-   if (IS_ERR(root))
-   return -EIO;
-
-   record_root_in_trans(trans, root);
-
-   path = btrfs_alloc_path();
-   if (!path)
-   return -EIO;
-
-   level = btrfs_header_level(buf);
-   path-lowest_level = level;
-   path-skip_check_block = 1;
-   if (level)
-   btrfs_node_key_to_cpu(buf, k1, 0);
-   else
-   btrfs_item_key_to_cpu(buf, k1, 0);
-
-   ret = btrfs_search_slot(trans, root, k1, path, 0, 1);
-   if (ret) {
-   btrfs_free_path(path);
-   return -EIO;
-   }
-
buf = path-nodes[level];
for (i = 0; i  btrfs_header_nritems(buf) - 1; i++) {
if (level) {
@@ -2607,8 +2581,6 @@ static int fix_key_order(struct btrfs_trans_handle *trans,
btrfs_mark_buffer_dirty(buf);
i = 0;
}
-
-   btrfs_free_path(path);
return ret;
 }
 
@@ -2650,43 +2622,15 @@ static int delete_bogus_item(struct btrfs_trans_handle 
*trans,
 
 static int fix_item_offset(struct btrfs_trans_handle *trans,
   struct btrfs_root *root,
-  struct extent_buffer *buf)
+  struct btrfs_path *path)
 {
-   struct btrfs_path *path;
-   struct btrfs_key k1;
+   struct extent_buffer *buf;
int i;
-   int level;
-   int ret;
-
-   k1.objectid = btrfs_header_owner(buf);
-   k1.type = BTRFS_ROOT_ITEM_KEY;
-   k1.offset = (u64)-1;
-
-   root = btrfs_read_fs_root(root-fs_info, k1);
-   if (IS_ERR(root))
-   return -EIO;
-
-   record_root_in_trans(trans, root);
-
-   path = btrfs_alloc_path();
-   if (!path)
-   return -EIO;
-
-   level = btrfs_header_level(buf);
-   path-lowest_level = level;
-   path-skip_check_block = 1;
-   if (level)
-   btrfs_node_key_to_cpu(buf, k1, 0);
-   else
-   btrfs_item_key_to_cpu(buf, k1, 0);
-
-   ret = btrfs_search_slot(trans, root, k1, path, 0, 1);
-   if (ret) {
-   btrfs_free_path(path);
-   return -EIO;
-   }
+   int ret = 0;
 
-   buf = path-nodes[level];
+   /* We should only get this for leaves */
+   BUG_ON(path-lowest_level);
+   buf = path-nodes[0];
 again:
for (i = 0; i  btrfs_header_nritems(buf); i++) {
unsigned int shift = 0, offset;
@@ -2742,7 +2686,6 @@ again:
 * progs this can be changed to something nicer.
 */
BUG_ON(ret);
-   btrfs_free_path(path);
return ret;
 }
 
@@ -2755,11 +2698,65 @@ static int try_to_fix_bad_block(struct 
btrfs_trans_handle *trans,
struct extent_buffer *buf,
enum btrfs_tree_block_status status)
 {
-   if (status == BTRFS_TREE_BLOCK_BAD_KEY_ORDER)
-   return fix_key_order(trans, root, buf);
-   if (status == BTRFS_TREE_BLOCK_INVALID_OFFSETS)
-   return fix_item_offset(trans, root, buf);
-   return -EIO;
+   struct ulist *roots;
+   struct ulist_node *node;
+   struct btrfs_root *search_root;
+   struct btrfs_path *path;
+   struct ulist_iterator iter;
+   struct btrfs_key root_key, key;
+   int ret;
+
+   if (status != BTRFS_TREE_BLOCK_BAD_KEY_ORDER 
+   status != BTRFS_TREE_BLOCK_INVALID_OFFSETS)
+   return -EIO;
+
+   path = btrfs_alloc_path();
+   

[PATCH 06/12] Btrfs-progs: reset chunk state if we restart check

2014-10-10 Thread Josef Bacik
If we hid a corrupt block that we fix and we restart the fsck loop you will get
lots of noise about duplicate block groups and such.  This is because we don't
clear the block group and chunk cache when we do this restart.  This patch fixes
that, which is a little tricky since the structs are linked together with
various linked lists, but this passed with a user who was hitting this problem.
Thanks,

Signed-off-by: Josef Bacik jba...@fb.com
---
 cmds-check.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/cmds-check.c b/cmds-check.c
index 1deef77..e81a26c 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -3270,6 +3270,8 @@ static void free_chunk_record(struct cache_extent *cache)
struct chunk_record *rec;
 
rec = container_of(cache, struct chunk_record, cache);
+   list_del_init(rec-list);
+   list_del_init(rec-dextents);
free(rec);
 }
 
@@ -3306,6 +3308,7 @@ static void free_block_group_record(struct cache_extent 
*cache)
struct block_group_record *rec;
 
rec = container_of(cache, struct block_group_record, cache);
+   list_del_init(rec-list);
free(rec);
 }
 
@@ -3339,6 +3342,10 @@ static void free_device_extent_record(struct 
cache_extent *cache)
struct device_extent_record *rec;
 
rec = container_of(cache, struct device_extent_record, cache);
+   if (!list_empty(rec-chunk_list))
+   list_del_init(rec-chunk_list);
+   if (!list_empty(rec-device_list))
+   list_del_init(rec-device_list);
free(rec);
 }
 
@@ -6051,7 +6058,7 @@ static int check_device_used(struct device_record 
*dev_rec,
if (dev_extent_rec-objectid != dev_rec-devid)
break;
 
-   list_del(dev_extent_rec-device_list);
+   list_del_init(dev_extent_rec-device_list);
total_byte += dev_extent_rec-length;
cache = next_cache_extent(cache);
}
@@ -6280,6 +6287,10 @@ again:
free_extent_cache_tree(pending);
free_extent_cache_tree(reada);
free_extent_cache_tree(nodes);
+   free_chunk_cache_tree(chunk_cache);
+   free_block_group_tree(block_group_cache);
+   free_device_cache_tree(dev_cache);
+   free_device_extent_tree(dev_extent_cache);
free_extent_record_cache(root-fs_info, extent_cache);
goto again;
}
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/12] Btrfs-progs: add a dummy backref if our location is wrong

2014-10-10 Thread Josef Bacik
If our location is bogus in our dir item we were just skipping the thing.
However in this case we want to just delete the dir index, so create a dummy
inode rec using BTRFS_MULTIPLE_OBJECTIDS and just add every backref we find to
the list so we know to straight up delete all of these items.  Thanks,

Signed-off-by: Josef Bacik jba...@fb.com
---
 cmds-check.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/cmds-check.c b/cmds-check.c
index 8fdc542..ca890cc 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -551,6 +551,8 @@ static struct inode_backref *get_inode_backref(struct 
inode_record *rec,
struct inode_backref *backref;
 
list_for_each_entry(backref, rec-backrefs, list) {
+   if (rec-ino == BTRFS_MULTIPLE_OBJECTIDS)
+   break;
if (backref-dir != dir || backref-namelen != namelen)
continue;
if (memcmp(name, backref-name, namelen))
@@ -990,7 +992,11 @@ static int process_dir_item(struct btrfs_root *root,
  namebuf, len, filetype,
  key-type, error);
} else {
-   fprintf(stderr, warning line %d\n, __LINE__);
+   fprintf(stderr, invalid location in dir item %u\n,
+   location.type);
+   add_inode_backref(inode_cache, BTRFS_MULTIPLE_OBJECTIDS,
+ key-objectid, key-offset, namebuf,
+ len, filetype, key-type, error);
}
 
len = sizeof(*di) + name_len + data_len;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 07/12] Btrfs-progs: re-search tree root if it changes

2014-10-10 Thread Josef Bacik
If we change something while scanning fs-roots we need to redo our search so
that we get valid root items and have valid root cache.  Thanks,

Signed-off-by: Josef Bacik jba...@fb.com
---
 cmds-check.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index e81a26c..9522077 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -2171,7 +2171,7 @@ static int check_fs_roots(struct btrfs_root *root,
struct btrfs_path path;
struct btrfs_key key;
struct walk_control wc;
-   struct extent_buffer *leaf;
+   struct extent_buffer *leaf, *tree_node;
struct btrfs_root *tmp_root;
struct btrfs_root *tree_root = root-fs_info-tree_root;
int ret;
@@ -2186,13 +2186,19 @@ static int check_fs_roots(struct btrfs_root *root,
memset(wc, 0, sizeof(wc));
cache_tree_init(wc.shared);
btrfs_init_path(path);
-
+again:
key.offset = 0;
key.objectid = 0;
key.type = BTRFS_ROOT_ITEM_KEY;
ret = btrfs_search_slot(NULL, tree_root, key, path, 0, 0);
BUG_ON(ret  0);
+   tree_node = tree_root-node;
while (1) {
+   if (tree_node != tree_root-node) {
+   free_root_recs_tree(root_cache);
+   btrfs_release_path(path);
+   goto again;
+   }
leaf = path.nodes[0];
if (path.slots[0] = btrfs_header_nritems(leaf)) {
ret = btrfs_next_leaf(tree_root, path);
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 11/12] Btrfs-progs: add ability to corrupt dir items

2014-10-10 Thread Josef Bacik
In order to test the dir index corruption fixing patches in fsck we need to add
functionality to btrfs-corrupt-block to corrupt dir item fields.  Thanks,

Signed-off-by: Josef Bacik jba...@fb.com
---
 btrfs-corrupt-block.c | 103 +-
 1 file changed, 102 insertions(+), 1 deletion(-)

diff --git a/btrfs-corrupt-block.c b/btrfs-corrupt-block.c
index 3e36f90..66d9f02 100644
--- a/btrfs-corrupt-block.c
+++ b/btrfs-corrupt-block.c
@@ -111,6 +111,7 @@ static void print_usage(void)
fprintf(stderr, \t-d Delete this item (must specify -K)\n);
fprintf(stderr, \t-I An item to corrupt (must also specify the field 
to corrupt and a root+key for the item)\n);
+   fprintf(stderr, \t-D Corrupt a dir item, must specify key and 
field\n);
exit(1);
 }
 
@@ -304,6 +305,12 @@ enum btrfs_file_extent_field {
BTRFS_FILE_EXTENT_BAD,
 };
 
+enum btrfs_dir_item_field {
+   BTRFS_DIR_ITEM_NAME,
+   BTRFS_DIR_ITEM_LOCATION_OBJECTID,
+   BTRFS_DIR_ITEM_BAD,
+};
+
 enum btrfs_metadata_block_field {
BTRFS_METADATA_BLOCK_GENERATION,
BTRFS_METADATA_BLOCK_SHIFT_ITEMS,
@@ -364,6 +371,15 @@ static enum btrfs_item_field convert_item_field(char 
*field)
return BTRFS_ITEM_BAD;
 }
 
+static enum btrfs_dir_item_field convert_dir_item_field(char *field)
+{
+   if (!strncmp(field, name, FIELD_BUF_LEN))
+   return BTRFS_DIR_ITEM_NAME;
+   if (!strncmp(field, location_objectid, FIELD_BUF_LEN))
+   return BTRFS_DIR_ITEM_LOCATION_OBJECTID;
+   return BTRFS_DIR_ITEM_BAD;
+}
+
 static u64 generate_u64(u64 orig)
 {
u64 ret;
@@ -448,6 +464,80 @@ out:
return ret;
 }
 
+static int corrupt_dir_item(struct btrfs_root *root, struct btrfs_key *key,
+   char *field)
+{
+   struct btrfs_trans_handle *trans;
+   struct btrfs_dir_item *di;
+   struct btrfs_path *path;
+   char *name;
+   struct btrfs_key location;
+   struct btrfs_disk_key disk_key;
+   unsigned long name_ptr;
+   enum btrfs_dir_item_field corrupt_field =
+   convert_dir_item_field(field);
+   u64 bogus;
+   u16 name_len;
+   int ret;
+
+   if (corrupt_field == BTRFS_DIR_ITEM_BAD) {
+   fprintf(stderr, Invalid field %s\n, field);
+   return -EINVAL;
+   }
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+
+   trans = btrfs_start_transaction(root, 1);
+   if (IS_ERR(trans)) {
+   btrfs_free_path(path);
+   return PTR_ERR(trans);
+   }
+
+   ret = btrfs_search_slot(trans, root, key, path, 0, 1);
+   if (ret) {
+   if (ret  0)
+   ret = -ENOENT;
+   fprintf(stderr, Error searching for dir item %d\n, ret);
+   goto out;
+   }
+
+   di = btrfs_item_ptr(path-nodes[0], path-slots[0],
+   struct btrfs_dir_item);
+
+   switch (corrupt_field) {
+   case BTRFS_DIR_ITEM_NAME:
+   name_len = btrfs_dir_name_len(path-nodes[0], di);
+   name = malloc(name_len);
+   if (!name) {
+   ret = -ENOMEM;
+   goto out;
+   }
+   name_ptr = (unsigned long)(di + 1);
+   read_extent_buffer(path-nodes[0], name, name_ptr, name_len);
+   name[0]++;
+   write_extent_buffer(path-nodes[0], name, name_ptr, name_len);
+   btrfs_mark_buffer_dirty(path-nodes[0]);
+   free(name);
+   goto out;
+   case BTRFS_DIR_ITEM_LOCATION_OBJECTID:
+   btrfs_dir_item_key_to_cpu(path-nodes[0], di, location);
+   bogus = generate_u64(location.objectid);
+   location.objectid = bogus;
+   btrfs_cpu_key_to_disk(disk_key, location);
+   btrfs_set_dir_item_key(path-nodes[0], di, disk_key);
+   btrfs_mark_buffer_dirty(path-nodes[0]);
+   goto out;
+   default:
+   ret = -EINVAL;
+   goto out;
+   }
+out:
+   btrfs_commit_transaction(trans, root);
+   btrfs_free_path(path);
+   return ret;
+}
 
 static int corrupt_inode(struct btrfs_trans_handle *trans,
 struct btrfs_root *root, u64 inode, char *field)
@@ -772,6 +862,7 @@ static struct option long_options[] = {
{ key, 1, NULL, 'K'},
{ delete, 0, NULL, 'd'},
{ item, 0, NULL, 'I'},
+   { dir-item, 0, NULL, 'D'},
{ 0, 0, 0, 0}
 };
 
@@ -937,6 +1028,7 @@ int main(int ac, char **av)
int chunk_tree = 0;
int delete = 0;
int corrupt_item = 0;
+   int corrupt_di = 0;
u64 metadata_block = 0;
u64 inode = 0;
u64 file_extent = (u64)-1;
@@ -948,7 +1040,7 @@ int main(int ac, char **av)
 
while(1) {
int c;
-   c 

[PATCH 10/12] Btrfs-progs: deal with mismatch index between dir index and inode ref

2014-10-10 Thread Josef Bacik
Sometimes we have a dir index and an inode ref that don't agree on the index.
In this case just assume that the inode ref is the ultimate authority on the
subject and delete the dir index.  This means we have to not reset index if we
find a mismatched inode ref to make sure we delete the right dir index.  Thanks,

Signed-off-by: Josef Bacik jba...@fb.com
---
 cmds-check.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index ca890cc..80fa244 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -608,9 +608,10 @@ static int add_inode_backref(struct cache_tree 
*inode_cache,
backref-errors |= REF_ERR_DUP_INODE_REF;
if (backref-found_dir_index  backref-index != index)
backref-errors |= REF_ERR_INDEX_UNMATCH;
+   else
+   backref-index = index;
 
backref-ref_type = itemtype;
-   backref-index = index;
backref-found_inode_ref = 1;
} else {
BUG_ON(1);
@@ -1654,8 +1655,10 @@ static int repair_inode_backrefs(struct btrfs_root *root,
if (rec-ino == root_dirid  backref-index == 0)
continue;
 
-   if (delete  backref-found_dir_index 
-   !backref-found_inode_ref) {
+   if (delete 
+   ((backref-found_dir_index  !backref-found_inode_ref) ||
+(backref-found_dir_index  backref-found_inode_ref 
+ backref-errors  REF_ERR_INDEX_UNMATCH))) {
ret = delete_dir_index(root, inode_cache, rec, backref);
if (ret)
break;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/12] Btrfs-progs: pull back backref.c and fix it up

2014-10-10 Thread Josef Bacik
This patch pulls back backref.c, adds a couple of helpers everywhere that it
needs, and cleans up backref.c to fit in btrfs-progs.  Thanks,

Signed-off-by: Josef Bacik jba...@fb.com
---
 Makefile |2 +-
 backref.c| 1651 ++
 backref.h|   73 +++
 ctree.c  |   84 +++
 ctree.h  |   14 +
 extent_io.c  |   43 +-
 extent_io.h  |2 +
 kerncompat.h |7 +
 ulist.h  |   15 +
 9 files changed, 1878 insertions(+), 13 deletions(-)
 create mode 100644 backref.c
 create mode 100644 backref.h

diff --git a/Makefile b/Makefile
index 7cc7783..f954649 100644
--- a/Makefile
+++ b/Makefile
@@ -10,7 +10,7 @@ objects = ctree.o disk-io.o radix-tree.o extent-tree.o 
print-tree.o \
  root-tree.o dir-item.o file-item.o inode-item.o inode-map.o \
  extent-cache.o extent_io.o volumes.o utils.o repair.o \
  qgroup.o raid6.o free-space-cache.o list_sort.o props.o \
- ulist.o qgroup-verify.o
+ ulist.o qgroup-verify.o backref.o
 cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \
   cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \
   cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \
diff --git a/backref.c b/backref.c
new file mode 100644
index 000..593f936
--- /dev/null
+++ b/backref.c
@@ -0,0 +1,1651 @@
+/*
+ * Copyright (C) 2011 STRATO.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include kerncompat.h
+#include ctree.h
+#include disk-io.h
+#include backref.h
+#include ulist.h
+#include transaction.h
+
+#define pr_debug(...) do { } while (0)
+
+struct extent_inode_elem {
+   u64 inum;
+   u64 offset;
+   struct extent_inode_elem *next;
+};
+
+static int check_extent_in_eb(struct btrfs_key *key, struct extent_buffer *eb,
+   struct btrfs_file_extent_item *fi,
+   u64 extent_item_pos,
+   struct extent_inode_elem **eie)
+{
+   u64 offset = 0;
+   struct extent_inode_elem *e;
+
+   if (!btrfs_file_extent_compression(eb, fi) 
+   !btrfs_file_extent_encryption(eb, fi) 
+   !btrfs_file_extent_other_encoding(eb, fi)) {
+   u64 data_offset;
+   u64 data_len;
+
+   data_offset = btrfs_file_extent_offset(eb, fi);
+   data_len = btrfs_file_extent_num_bytes(eb, fi);
+
+   if (extent_item_pos  data_offset ||
+   extent_item_pos = data_offset + data_len)
+   return 1;
+   offset = extent_item_pos - data_offset;
+   }
+
+   e = kmalloc(sizeof(*e), GFP_NOFS);
+   if (!e)
+   return -ENOMEM;
+
+   e-next = *eie;
+   e-inum = key-objectid;
+   e-offset = key-offset + offset;
+   *eie = e;
+
+   return 0;
+}
+
+static void free_inode_elem_list(struct extent_inode_elem *eie)
+{
+   struct extent_inode_elem *eie_next;
+
+   for (; eie; eie = eie_next) {
+   eie_next = eie-next;
+   kfree(eie);
+   }
+}
+
+static int find_extent_in_eb(struct extent_buffer *eb, u64 wanted_disk_byte,
+   u64 extent_item_pos,
+   struct extent_inode_elem **eie)
+{
+   u64 disk_byte;
+   struct btrfs_key key;
+   struct btrfs_file_extent_item *fi;
+   int slot;
+   int nritems;
+   int extent_type;
+   int ret;
+
+   /*
+* from the shared data ref, we only have the leaf but we need
+* the key. thus, we must look into all items and see that we
+* find one (some) with a reference to our extent item.
+*/
+   nritems = btrfs_header_nritems(eb);
+   for (slot = 0; slot  nritems; ++slot) {
+   btrfs_item_key_to_cpu(eb, key, slot);
+   if (key.type != BTRFS_EXTENT_DATA_KEY)
+   continue;
+   fi = btrfs_item_ptr(eb, slot, struct btrfs_file_extent_item);
+   extent_type = btrfs_file_extent_type(eb, fi);
+   if (extent_type == BTRFS_FILE_EXTENT_INLINE)
+   continue;
+   /* don't skip BTRFS_FILE_EXTENT_PREALLOC, we can handle that */
+   disk_byte = btrfs_file_extent_disk_bytenr(eb, fi);
+   

[PATCH 04/12] Btrfs-progs: update rbtree libs

2014-10-10 Thread Josef Bacik
While debugging a broken fs we were seeing hangs in the rb_erase loops.  The
rbtree was simple and wasn't corrupted so it appeared to be a bug in our rbtree
library.  Updating to the kernels latest rbtree code made the infinite loop go
away, so pull it back.  Thanks,

Signed-off-by: Josef Bacik jba...@fb.com
---
 kerncompat.h   |   5 +
 rbtree.c   | 667 +
 rbtree.h   | 142 
 rbtree_augmented.h | 231 +++
 4 files changed, 695 insertions(+), 350 deletions(-)
 create mode 100644 rbtree_augmented.h

diff --git a/kerncompat.h b/kerncompat.h
index ea34936..a1336de 100644
--- a/kerncompat.h
+++ b/kerncompat.h
@@ -330,6 +330,11 @@ struct __una_u64 { __le64 x; } __attribute__((__packed__));
 #define put_unaligned_le64(val,p) (((struct __una_u64 *)(p))-x = 
cpu_to_le64(val))
 #endif
 
+#ifndef true
+#define true 1
+#define false 0
+#endif
+
 #ifndef noinline
 #define noinline
 #endif
diff --git a/rbtree.c b/rbtree.c
index 6ad800f..92590a5 100644
--- a/rbtree.c
+++ b/rbtree.c
@@ -2,7 +2,8 @@
   Red Black Trees
   (C) 1999  Andrea Arcangeli and...@suse.de
   (C) 2002  David Woodhouse dw...@infradead.org
-  
+  (C) 2012  Michel Lespinasse wal...@google.com
+
   This program is free software; you can redistribute it and/or modify
   it under the terms of the GNU General Public License as published by
   the Free Software Foundation; either version 2 of the License, or
@@ -20,276 +21,396 @@
   linux/lib/rbtree.c
 */
 
-#include rbtree.h
-
-static void __rb_rotate_left(struct rb_node *node, struct rb_root *root)
-{
-   struct rb_node *right = node-rb_right;
-   struct rb_node *parent = rb_parent(node);
+#include rbtree_augmented.h
 
-   if ((node-rb_right = right-rb_left))
-   rb_set_parent(right-rb_left, node);
-   right-rb_left = node;
-
-   rb_set_parent(right, parent);
+/*
+ * red-black trees properties:  http://en.wikipedia.org/wiki/Rbtree
+ *
+ *  1) A node is either red or black
+ *  2) The root is black
+ *  3) All leaves (NULL) are black
+ *  4) Both children of every red node are black
+ *  5) Every simple path from root to leaves contains the same number
+ * of black nodes.
+ *
+ *  4 and 5 give the O(log n) guarantee, since 4 implies you cannot have two
+ *  consecutive red nodes in a path and every red node is therefore followed by
+ *  a black. So if B is the number of black nodes on every simple path (as per
+ *  5), then the longest possible path due to 4 is 2B.
+ *
+ *  We shall indicate color with case, where black nodes are uppercase and red
+ *  nodes will be lowercase. Unknown color nodes shall be drawn as red within
+ *  parentheses and have some accompanying text comment.
+ */
 
-   if (parent)
-   {
-   if (node == parent-rb_left)
-   parent-rb_left = right;
-   else
-   parent-rb_right = right;
-   }
-   else
-   root-rb_node = right;
-   rb_set_parent(node, right);
+static inline void rb_set_black(struct rb_node *rb)
+{
+   rb-__rb_parent_color |= RB_BLACK;
 }
 
-static void __rb_rotate_right(struct rb_node *node, struct rb_root *root)
+static inline struct rb_node *rb_red_parent(struct rb_node *red)
 {
-   struct rb_node *left = node-rb_left;
-   struct rb_node *parent = rb_parent(node);
-
-   if ((node-rb_left = left-rb_right))
-   rb_set_parent(left-rb_right, node);
-   left-rb_right = node;
-
-   rb_set_parent(left, parent);
+   return (struct rb_node *)red-__rb_parent_color;
+}
 
-   if (parent)
-   {
-   if (node == parent-rb_right)
-   parent-rb_right = left;
-   else
-   parent-rb_left = left;
-   }
-   else
-   root-rb_node = left;
-   rb_set_parent(node, left);
+/*
+ * Helper function for rotations:
+ * - old's parent and color get assigned to new
+ * - old gets assigned new as a parent and 'color' as a color.
+ */
+static inline void
+__rb_rotate_set_parents(struct rb_node *old, struct rb_node *new,
+   struct rb_root *root, int color)
+{
+   struct rb_node *parent = rb_parent(old);
+   new-__rb_parent_color = old-__rb_parent_color;
+   rb_set_parent_color(old, new, color);
+   __rb_change_child(old, new, parent, root);
 }
 
-void rb_insert_color(struct rb_node *node, struct rb_root *root)
+static __always_inline void
+__rb_insert(struct rb_node *node, struct rb_root *root,
+   void (*augment_rotate)(struct rb_node *old, struct rb_node *new))
 {
-   struct rb_node *parent, *gparent;
-
-   while ((parent = rb_parent(node))  rb_is_red(parent))
-   {
-   gparent = rb_parent(parent);
-
-   if (parent == gparent-rb_left)
-   {
-   {
-   register struct rb_node *uncle = 
gparent-rb_right;
-   

[PATCH 03/12] Btrfs-progs: break out rbtree util functions

2014-10-10 Thread Josef Bacik
These were added to deal with duplicated functionality within btrfs-progs, but
we specifically copied rbtree.c from the kernel, so move these functions out
into their own file.  This will make it easier to keep rbtree.c in sync.  
Thanks,

Signed-off-by: Josef Bacik jba...@fb.com
---
 Makefile|  2 +-
 btrfs-list.c|  1 +
 cmds-check.c|  1 +
 disk-io.c   |  1 +
 extent-cache.c  |  1 +
 qgroup-verify.c |  1 +
 rbtree-utils.c  | 82 +
 rbtree-utils.h  | 45 +++
 rbtree.c| 63 
 rbtree.h| 22 
 10 files changed, 133 insertions(+), 86 deletions(-)
 create mode 100644 rbtree-utils.c
 create mode 100644 rbtree-utils.h

diff --git a/Makefile b/Makefile
index f954649..a0afe3b 100644
--- a/Makefile
+++ b/Makefile
@@ -10,7 +10,7 @@ objects = ctree.o disk-io.o radix-tree.o extent-tree.o 
print-tree.o \
  root-tree.o dir-item.o file-item.o inode-item.o inode-map.o \
  extent-cache.o extent_io.o volumes.o utils.o repair.o \
  qgroup.o raid6.o free-space-cache.o list_sort.o props.o \
- ulist.o qgroup-verify.o backref.o
+ ulist.o qgroup-verify.o backref.o rbtree-utils.o
 cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \
   cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \
   cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \
diff --git a/btrfs-list.c b/btrfs-list.c
index 01ccca9..b6b8493 100644
--- a/btrfs-list.c
+++ b/btrfs-list.c
@@ -33,6 +33,7 @@
 #include utils.h
 #include uuid/uuid.h
 #include btrfs-list.h
+#include rbtree-utils.h
 
 #define BTRFS_LIST_NFILTERS_INCREASE   (2 * BTRFS_LIST_FILTER_MAX)
 #define BTRFS_LIST_NCOMPS_INCREASE (2 * BTRFS_LIST_COMP_MAX)
diff --git a/cmds-check.c b/cmds-check.c
index a7e0840..ff9c0d5 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -39,6 +39,7 @@
 #include free-space-cache.h
 #include btrfsck.h
 #include qgroup-verify.h
+#include rbtree-utils.h
 
 static u64 bytes_used = 0;
 static u64 total_csum_bytes = 0;
diff --git a/disk-io.c b/disk-io.c
index 34c0a97..08be53a 100644
--- a/disk-io.c
+++ b/disk-io.c
@@ -34,6 +34,7 @@
 #include crc32c.h
 #include utils.h
 #include print-tree.h
+#include rbtree-utils.h
 
 static int check_tree_block(struct btrfs_root *root, struct extent_buffer *buf)
 {
diff --git a/extent-cache.c b/extent-cache.c
index 84de87b..7656ab2 100644
--- a/extent-cache.c
+++ b/extent-cache.c
@@ -19,6 +19,7 @@
 #include stdlib.h
 #include kerncompat.h
 #include extent-cache.h
+#include rbtree-utils.h
 
 struct cache_extent_search_range {
u64 objectid;
diff --git a/qgroup-verify.c b/qgroup-verify.c
index 430f099..c0c61d0 100644
--- a/qgroup-verify.c
+++ b/qgroup-verify.c
@@ -28,6 +28,7 @@
 #include print-tree.h
 #include utils.h
 #include ulist.h
+#include rbtree-utils.h
 
 #include qgroup-verify.h
 
diff --git a/rbtree-utils.c b/rbtree-utils.c
new file mode 100644
index 000..7371bbb
--- /dev/null
+++ b/rbtree-utils.c
@@ -0,0 +1,82 @@
+/*
+ * Copyright (C) 2014 Facebook.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include rbtree-utils.h
+
+int rb_insert(struct rb_root *root, struct rb_node *node,
+ rb_compare_nodes comp)
+{
+   struct rb_node **p = root-rb_node;
+   struct rb_node *parent = NULL;
+   int ret;
+
+   while(*p) {
+   parent = *p;
+
+   ret = comp(parent, node);
+   if (ret  0)
+   p = (*p)-rb_left;
+   else if (ret  0)
+   p = (*p)-rb_right;
+   else
+   return -EEXIST;
+   }
+
+   rb_link_node(node, parent, p);
+   rb_insert_color(node, root);
+   return 0;
+}
+
+struct rb_node *rb_search(struct rb_root *root, void *key, rb_compare_keys 
comp,
+ struct rb_node **next_ret)
+{
+   struct rb_node *n = root-rb_node;
+   struct rb_node *parent = NULL;
+   int ret = 0;
+
+   while(n) {
+   parent = n;
+
+   ret = comp(n, key);
+   if (ret  0)
+   n = n-rb_left;
+   else if (ret  0)
+   n = n-rb_right;
+   else
+ 

Re: What is the vision for btrfs fs repair?

2014-10-10 Thread Eric Sandeen
On 10/10/14 2:35 PM, Austin S Hemmelgarn wrote:
 On 2014-10-10 13:43, Bob Marley wrote:
 On 10/10/2014 16:37, Chris Murphy wrote:
 The fail safe behavior is to treat the known good tree root as
 the default tree root, and bypass the bad tree root if it cannot
 be repaired, so that the volume can be mounted with default mount
 options (i.e. the ones in fstab). Otherwise it's a filesystem
 that isn't well suited for general purpose use as rootfs let
 alone for boot.
 
 
 A filesystem which is suited for general purpose use is a
 filesystem which honors fsync, and doesn't *ever* auto-roll-back
 without user intervention.
 
 Anything different is not suited for database transactions at all.
 Any paid service which has the users database on btrfs is going to
 be at risk of losing payments, and probably without the company
 even knowing. If btrfs goes this way I hope a big warning is
 written on the wiki and on the manpages telling that this
 filesystem is totally unsuitable for hosting databases performing
 transactions.
 If they need reliability, they should have some form of redundancy
 in-place and/or run the database directly on the block device;
 because even ext4, XFS, and pretty much every other filesystem can
 lose data sometimes,

Not if i.e. fsync returns.  If the data is gone later, it's a hardware
problem, or occasionally a bug - bugs that are usually found  fixed
pretty quickly.

 the difference being that those tend to give
 worse results when hardware is misbehaving than BTRFS does, because
 BTRFS usually has a old copy of whatever data structure gets
 corrupted to fall back on.

I'm curious, is that based on conjecture or real-world testing?

-Eric

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] Btrfs

2014-10-10 Thread Chris Mason

Hi Linus,

Please pull my for-linus branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

This is based on 3.17-rc5 because that's when I forked off for -next,
but I've been testing it against linux-next or 3.17 for a while now.

The largest set of changes here come from Miao Xie.  He's cleaning up
and improving read recovery/repair for raid, and has a number of related
fixes.

I've merged another set of fsync fixes from Filipe, and he's also
improved the way we handle metadata write errors to make sure we force
the FS readonly if things go wrong.

Otherwise we have a collection of fixes and cleanups.  Dave Sterba gets
a cookie for removing the most lines (thanks Dave).

Miao Xie (36) commits (+1193/-537):
Btrfs: cleanup similar code of the buffered data data check and dio read 
data check (+47/-55)
Btrfs: cleanup double assignment of device-bytes_used when device replace 
finishes (+0/-1)
Btrfs: cleanup the read failure record after write or when the inode is 
freeing (+41/-0)
Btrfs: stop mounting the fs if the non-ENOENT errors happen when opening 
seed fs (+1/-1)
Btrfs: update the comment of total_bytes and disk_total_bytes of 
btrfs_devie (+2/-2)
Btrfs: make the device lock and its protected data in the same cacheline 
(+12/-13)
Btrfs: fix unprotected device list access when getting the fs information 
(+7/-1)
Btrfs: fix use-after-free problem of the device during device replace 
(+8/-1)
Btrfs: Fix wrong free_chunk_space assignment during removing a device 
(+0/-5)
Btrfs: Cleanup unused variant and argument of IO failure handlers (+10/-16)
Btrfs: cleanup unused latest_devid and latest_trans in fs_devices (+11/-34)
Btrfs: Fix the problem that the dirty flag of dev stats is cleared (+20/-6)
Btrfs: load checksum data once when submitting a direct read io (+35/-34)
Btrfs: fix unprotected device list access when cloning fs devices (+3/-0)
Btrfs: fix wrong generation check of super block on a seed device (+5/-1)
Btrfs: fix missing error handler if submiting re-read bio fails (+5/-0)
Btrfs: Set real mirror number for read operation on RAID0/5/6 (+5/-0)
Btrfs: fix unprotected device's variants on 32bits machine (+124/-29)
Btrfs: modify clean_io_failure and make it suit direct io (+19/-21)
Btrfs: make the logic of source device removing more clear (+8/-14)
Btrfs: update free_chunk_space during allocting a new chunk (+5/-5)
Btrfs: implement repair function when direct read fails (+281/-29)
Btrfs: modify repair_io_failure and make it suit direct io (+7/-4)
Btrfs: modify rw_devices counter under chunk_mutex context (+2/-2)
Btrfs: move the missing device to its own fs device list (+52/-26)
Btrfs: split bio_readpage_error into several functions (+123/-64)
Btrfs: fix unprotected assignment of the target device (+28/-28)
Btrfs: fix wrong device bytes_used in the super block (+37/-1)
Btrfs: fix wrong disk size when writing super blocks (+83/-5)
Btrfs: cleanup unused num_can_discard in fs_devices (+2/-15)
Btrfs: fix unprotected system chunk array insertion (+6/-1)
Btrfs: fix writing data into the seed filesystem (+36/-16)
Btrfs: fix unprotected device-bytes_used update (+3/-1)
Btrfs: do file data check by sub-bio's self (+87/-29)
Btrfs: fix wrong fsid check of scrub (+13/-5)
Btrfs: Fix misuse of chunk mutex (+65/-72)

David Sterba (32) commits (+507/-584):
btrfs: remove unused parameter blocksize from btrfs_find_tree_block (+9/-12)
btrfs: remove unlikely from data-dependent branches and slow paths (+10/-10)
btrfs: remove obsolete comment in btrfs_clean_one_deleted_snapshot (+1/-4)
btrfs: drop constant param from btrfs_release_extent_buffer_page (+6/-9)
btrfs: remove blocksize from btrfs_alloc_free_block and rename (+21/-27)
btrfs: clenaup: don't call btrfs_release_path before free_path (+0/-1)
btrfs: remove stale define after removing ordered operations (+0/-7)
btrfs: hide typecast to definition of BTRFS_SEND_TRANS_STUB (+4/-5)
btrfs: remove unused parameter from readahead_tree_block (+8/-16)
btrfs: remove parameter blocksize from read_tree_block (+18/-37)
btrfs: use DIV_ROUND_UP instead of open-coded variants (+22/-32)
btrfs: remove unused members from struct scrub_warning (+2/-15)
btrfs: remove unused variable from btrfs_parse_options (+1/-3)
btrfs: inline code of reada_tree_block and remove it (+2/-10)
btrfs: wake up transaction thread from SYNC_FS ioctl (+6/-0)
btrfs: new define for the inline extent data start (+9/-10)
btrfs: defrag, use unsigned type for extent thresh (+4/-4)
btrfs: move checks for DUMMY_ROOT into a helper (+23/-21)
btrfs: use nodesize everywhere, kill leafsize (+89/-141)
btrfs: cleanup ino cache members of btrfs_root (+52/-52)
btrfs: use enum for wq endio metadata type (+14/-18)
btrfs: return void from readahead_tree_block (+4/-8)