date:20150925

Re: Latest kernel to use?

2015-09-25 Thread Sjoerd

On Friday 25 September 2015 13:51:34 Hugo Mills wrote:
> On Fri, Sep 25, 2015 at 03:36:18PM +0200, Sjoerd wrote:
> > Thanks all for the feedback. Still doubting though to go for 4.2.1 or not.
> > Main reason is that I am currently running 4.1.7 on my laptop which seems
> > to work fine and had some issues with the 4.2.0 kernel. No issues I thing
> > that were btrfs related, but more related to my nvidia card. Anyway
> > switching back to 4.1.7 resolved those, so I am a bit holding back to try
> > the 4.2.1 version ;)
> > Anyway I'll see and can always revert back if I don't like it ;)
> 
>If 4.1.7 is working OK for you, stick with it. It's getting much
> less important now, as btrfs matures, to keep up with the _very_
> latest. Purely on gut feeling about issues we see on IRC and here,
> 3.19 or later would be reasonable at the moment.
> 

OK i'll stick with the longterm 4.1.x branch then..

>Compared to, say, 3 or 4 years ago when running late -rc kernels
> was often preferable to running the latest stable, and things have
> improved quite a bit. :)

Good to know and thanks for the feedback :)

Cheers,
Sjoerd

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Latest kernel to use?

2015-09-25 Thread Bostjan Skufca

On 25 September 2015 at 15:51, Hugo Mills  wrote:
> On Fri, Sep 25, 2015 at 03:36:18PM +0200, Sjoerd wrote:
>> Thanks all for the feedback. Still doubting though to go for 4.2.1 or not.
>> Main reason is that I am currently running 4.1.7 on my laptop which seems to
>> work fine and had some issues with the 4.2.0 kernel. No issues I thing that
>> were btrfs related, but more related to my nvidia card. Anyway switching back
>> to 4.1.7 resolved those, so I am a bit holding back to try the 4.2.1 version
>> ;)
>> Anyway I'll see and can always revert back if I don't like it ;)
>
>If 4.1.7 is working OK for you, stick with it. It's getting much
> less important now, as btrfs matures, to keep up with the _very_
> latest. Purely on gut feeling about issues we see on IRC and here,
> 3.19 or later would be reasonable at the moment.

Similar here: I am sticking with 3.19.2 which has proven to work fine
for me (backup systems with btrfs on lvm, lots of snapshots/subvolumes
and occasional rebalance, no fancy/fresh stuff like btrfs-raid, online
compression or subvolume quota, though this last one is tempting in my
use case).

b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH V2] Btrfs: add a flags field to btrfs_transaction

2015-09-25 Thread Josef Bacik

I want to set some per transaction flags, so instead of adding yet another int
lets just convert the current two int indicators to flags and add a flags field
for future use.  Thanks,

Signed-off-by: Josef Bacik 
---
V1->V2: set the wrong bit in my conversion

 fs/btrfs/extent-tree.c |  5 +++--
 fs/btrfs/transaction.c | 18 --
 fs/btrfs/transaction.h |  9 -
 fs/btrfs/volumes.c |  2 +-
 4 files changed, 16 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index eca1840..78a1504 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4024,7 +4024,8 @@ commit_trans:
if (IS_ERR(trans))
return PTR_ERR(trans);
if (have_pinned_space >= 0 ||
-   trans->transaction->have_free_bgs ||
+   test_bit(BTRFS_TRANS_HAVE_FREE_BGS,
+>transaction->flags) ||
need_commit > 0) {
ret = btrfs_commit_transaction(trans, root);
if (ret)
@@ -9049,7 +9050,7 @@ again:
 * back off and let this transaction commit
 */
mutex_lock(>fs_info->ro_block_group_mutex);
-   if (trans->transaction->dirty_bg_run) {
+   if (test_bit(BTRFS_TRANS_DIRTY_BG_RUN, >transaction->flags)) {
u64 transid = trans->transid;
 
mutex_unlock(>fs_info->ro_block_group_mutex);
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index df1e61e..debfc99 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -230,9 +230,8 @@ loop:
 * commit the transaction.
 */
atomic_set(_trans->use_count, 2);
-   cur_trans->have_free_bgs = 0;
+   cur_trans->flags = 0;
cur_trans->start_time = get_seconds();
-   cur_trans->dirty_bg_run = 0;
 
cur_trans->delayed_refs.href_root = RB_ROOT;
cur_trans->delayed_refs.dirty_extent_root = RB_ROOT;
@@ -1846,7 +1845,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans,
return ret;
}
 
-   if (!cur_trans->dirty_bg_run) {
+   if (!test_bit(BTRFS_TRANS_DIRTY_BG_RUN, _trans->flags)) {
int run_it = 0;
 
/* this mutex is also taken before trying to set
@@ -1855,18 +1854,17 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans,
 * after a extents from that block group have been
 * allocated for cache files.  btrfs_set_block_group_ro
 * will wait for the transaction to commit if it
-* finds dirty_bg_run = 1
+* finds BTRFS_TRANS_DIRTY_BG_RUN set.
 *
-* The dirty_bg_run flag is also used to make sure only
-* one process starts all the block group IO.  It wouldn't
+* The BTRFS_TRANS_DIRTY_BG_RUN flag is also used to make sure
+* only one process starts all the block group IO.  It wouldn't
 * hurt to have more than one go through, but there's no
 * real advantage to it either.
 */
mutex_lock(>fs_info->ro_block_group_mutex);
-   if (!cur_trans->dirty_bg_run) {
+   if (!test_and_set_bit(BTRFS_TRANS_DIRTY_BG_RUN,
+ _trans->flags))
run_it = 1;
-   cur_trans->dirty_bg_run = 1;
-   }
mutex_unlock(>fs_info->ro_block_group_mutex);
 
if (run_it)
@@ -2134,7 +2132,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans,
 
btrfs_finish_extent_commit(trans, root);
 
-   if (cur_trans->have_free_bgs)
+   if (test_bit(BTRFS_TRANS_HAVE_FREE_BGS, _trans->flags))
btrfs_clear_space_info_full(root->fs_info);
 
root->fs_info->last_trans_committed = cur_trans->transid;
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index f158ab4..02c6ca1 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -32,6 +32,9 @@ enum btrfs_trans_state {
TRANS_STATE_MAX = 6,
 };
 
+#define BTRFS_TRANS_HAVE_FREE_BGS  0
+#define BTRFS_TRANS_DIRTY_BG_RUN   1
+
 struct btrfs_transaction {
u64 transid;
/*
@@ -47,10 +50,7 @@ struct btrfs_transaction {
atomic_t num_writers;
atomic_t use_count;
 
-   /*
-* true if there is free bgs operations in this transaction
-*/
-   int have_free_bgs;
+   unsigned long flags;
 
/* Be protected by fs_info->trans_lock when we want to change it. */
enum btrfs_trans_state state;
@@ -78,7 +78,6 @@ struct btrfs_transaction {
spinlock_t dropped_roots_lock;
struct btrfs_delayed_ref_root delayed_refs;
int aborted;
-   int

Re: [PATCH 0/4] Fix for btrfs-convert chunk type and fsck support

2015-09-25 Thread David Sterba

On Thu, Sep 24, 2015 at 09:18:51AM +0800, Qu Wenruo wrote:
> After some ext* lecture given by my teammate, Wang Xiaoguang, I'm more 
> convinced that, at least for convert from ext*, separate chunk type will 
> not be a good idea.

Thanks for the additional research.

> For above almost empty ext4 case, it will cause less problem, as
> we can batch several groups and put a large data chunk to cover them,
> then allocate a 100+M metadata chunk.

> But if the filesystem is used and most group only has scattered 
> available space, we may ended up unable to alloc any metadata chunk.

I think this is the price we pay for the in-place conversion, we can
safely use only the free space and respect the constraints it gives.
As long as the ext2_subvol exists, ie. the original data and metadata
are pinned, we don't have much choice. Rebalancing is possible but given
the remaining space may fail to allocate block groups of desired size.

> This will make the usage of limited block group quite limited.
> Although I'll continue add such support for btrfs-convert, I'm quite 
> concerned about the usage...

So can we do it like this:

1) enable support for mixed bg in convert
2) implement mixed -> split conversion in balance
3) force convert to do mixed bgs by default

The conversion from/to mixed would be good on it's own but may not be
trivial to implement. If the main concern about result of conversion is
bad bg layout, I'd say that we rely on balance to reshuffle the bgs to
make the filesystem more up to the desired layout.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] Fix for btrfs-convert chunk type and fsck support

2015-09-25 Thread David Sterba

On Thu, Sep 10, 2015 at 10:34:13AM +0800, Qu Wenruo wrote:
>   btrfs-progs: fsck: Add check for extent and parent chunk type
>   btrfs-progs: utils: Check nodesize against features

Applied the two, thanks.

>   btrfs-progs: convert: force convert to used mixed block group
>   btrfs-progs: util: add parameter for btrfs_list_all_fs_features
>   btrfs-progs: convert-test: Disable different nodesize test
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: change how we wait for pending ordered extents

2015-09-25 Thread Holger Hoffstätte

On 09/24/15 22:56, Josef Bacik wrote:
> We have a mechanism to make sure we don't lose updates for ordered extents 
> that
> were logged in the transaction that is currently running.  We add the ordered
> extent to a transaction list and then the transaction waits on all the ordered
> extents in that list.  However are substantially large file systems this list
> can be extremely large, and can give us soft lockups, since the ordered 
> extents
> don't remove themselves from the list when they do complete.
> 
> To fix this we simply add a counter to the transaction that is incremented any
> time we have a logged extent that needs to be completed in the current
> transaction.  Then when the ordered extent finally completes it decrements the
> per transaction counter and wakes up the transaction if we are the last ones.
> This will eliminate the softlockup.  Thanks,

Tried this and unexpectedly didn't get any lockups or 'splosions during normal
operation, but balance now seems very slow and sits idle most of the time.

Running balance without filters on a previously balanced but empty fs:
$time btrfs filesystem balance start /mnt/usb
Done, had to relocate 3 out of 3 chunks
btrfs filesystem balance start /mnt/usb  0.01s user 0.00s system 0% cpu 1:00.91 
total

On a completely new fs:
$mkfs.btrfs -L Test -msingle -dsingle -f /dev/sdf1
$mount /dev/sdf1 /mnt/usb
$time btrfs filesystem balance start /mnt/usb
Done, had to relocate 4 out of 4 chunks
btrfs filesystem balance start /mnt/usb  0.01s user 0.00s system 0% cpu 1:30.88 
total

Time seems suspiciously in tune with (#chunks) * (tx commits every 30 seconds),
maybe needs a few more wakeups?

hope this helps :)

Holger

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: change how we wait for pending ordered extents

2015-09-25 Thread Holger Hoffstätte

On 09/25/15 13:05, Holger Hoffstätte wrote:
> Tried this and unexpectedly didn't get any lockups or 'splosions during normal
> operation, but balance now seems very slow and sits idle most of the time.

Meh..this doesn't seem to have anything to do with this particular patch after
all. Whoopdedoo. Sorry for the noise.

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 2/2] btrfs-progs: Modify confuse error message in scrub

2015-09-25 Thread David Sterba

On Thu, Aug 06, 2015 at 11:05:55AM +0800, Zhao Lei wrote:
> Scrub output following error message in my test:
>   ERROR: scrubbing /var/ltf/tester/scratch_mnt failed for device id 5 
> (Success)
> 
> It is caused by a broken kernel and fs, but the we need to avoid
> outputting both "error and success" in oneline message as above.
> 
> This patch modified above message to:
>   ERROR: scrubbing /var/ltf/tester/scratch_mnt failed for device id 5, ret=1, 
> errno=0(Success)
> 
> Signed-off-by: Zhao Lei 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/4] btrfs-progs: Read the whole superblock instead of struct btrfs_super_block.

2015-09-25 Thread David Sterba

On Wed, May 13, 2015 at 05:15:34PM +0800, Qu Wenruo wrote:
> Before the patch, btrfs-progs will only read
> sizeof(struct btrfs_super_block) and restore it into super_copy.
> 
> This makes checksum check for superblock impossible.
> Change it to read the whole superblock.
> 
> Signed-off-by: Qu Wenruo 

I've applied this despite there are the issues on ppc64. This patch is
required for the extended superblock checks in the next patch.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 9/9] btrfs: btrfs_copy_file_range() only supports reflinks

2015-09-25 Thread Anna Schumaker

Reject copies that don't have the COPY_FR_REFLINK flag set.

Signed-off-by: Anna Schumaker 
Reviewed-by: David Sterba 
---
 fs/btrfs/ioctl.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 4311554..2e14b91 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -44,6 +44,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ctree.h"
 #include "disk-io.h"
 #include "transaction.h"
@@ -3848,6 +3849,9 @@ ssize_t btrfs_copy_file_range(struct file *file_in, 
loff_t pos_in,
 {
ssize_t ret;
 
+   if (!(flags & COPY_FR_REFLINK))
+   return -EOPNOTSUPP;
+
ret = btrfs_clone_files(file_out, file_in, pos_in, len, pos_out);
if (ret == 0)
ret = len;
-- 
2.5.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 3/9] btrfs: add .copy_file_range file operation

2015-09-25 Thread Anna Schumaker

From: Zach Brown 

This rearranges the existing COPY_RANGE ioctl implementation so that the
.copy_file_range file operation can call the core loop that copies file
data extent items.

The extent copying loop is lifted up into its own function.  It retains
the core btrfs error checks that should be shared.

Signed-off-by: Zach Brown 
Signed-off-by: Anna Schumaker 
Reviewed-by: Josef Bacik 
Reviewed-by: David Sterba 
---
 fs/btrfs/ctree.h |  3 ++
 fs/btrfs/file.c  |  1 +
 fs/btrfs/ioctl.c | 91 
 3 files changed, 56 insertions(+), 39 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 938efe3..5d06a4f 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3996,6 +3996,9 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct 
inode *inode,
  loff_t pos, size_t write_bytes,
  struct extent_state **cached);
 int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
+ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t len, int flags);
 
 /* tree-defrag.c */
 int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index b823fac..b05449c 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2816,6 +2816,7 @@ const struct file_operations btrfs_file_operations = {
 #ifdef CONFIG_COMPAT
.compat_ioctl   = btrfs_ioctl,
 #endif
+   .copy_file_range = btrfs_copy_file_range,
 };
 
 void btrfs_auto_defrag_exit(void)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 0adf542..4311554 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3727,17 +3727,16 @@ out:
return ret;
 }
 
-static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
-  u64 off, u64 olen, u64 destoff)
+static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
+   u64 off, u64 olen, u64 destoff)
 {
struct inode *inode = file_inode(file);
+   struct inode *src = file_inode(file_src);
struct btrfs_root *root = BTRFS_I(inode)->root;
-   struct fd src_file;
-   struct inode *src;
int ret;
u64 len = olen;
u64 bs = root->fs_info->sb->s_blocksize;
-   int same_inode = 0;
+   int same_inode = src == inode;
 
/*
 * TODO:
@@ -3750,49 +3749,20 @@ static noinline long btrfs_ioctl_clone(struct file 
*file, unsigned long srcfd,
 *   be either compressed or non-compressed.
 */
 
-   /* the destination must be opened for writing */
-   if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND))
-   return -EINVAL;
-
if (btrfs_root_readonly(root))
return -EROFS;
 
-   ret = mnt_want_write_file(file);
-   if (ret)
-   return ret;
-
-   src_file = fdget(srcfd);
-   if (!src_file.file) {
-   ret = -EBADF;
-   goto out_drop_write;
-   }
-
-   ret = -EXDEV;
-   if (src_file.file->f_path.mnt != file->f_path.mnt)
-   goto out_fput;
-
-   src = file_inode(src_file.file);
-
-   ret = -EINVAL;
-   if (src == inode)
-   same_inode = 1;
-
-   /* the src must be open for reading */
-   if (!(src_file.file->f_mode & FMODE_READ))
-   goto out_fput;
+   if (file_src->f_path.mnt != file->f_path.mnt ||
+   src->i_sb != inode->i_sb)
+   return -EXDEV;
 
/* don't make the dst file partly checksummed */
if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !=
(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM))
-   goto out_fput;
+   return -EINVAL;
 
-   ret = -EISDIR;
if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode))
-   goto out_fput;
-
-   ret = -EXDEV;
-   if (src->i_sb != inode->i_sb)
-   goto out_fput;
+   return -EISDIR;
 
if (!same_inode) {
btrfs_double_inode_lock(src, inode);
@@ -3869,6 +3839,49 @@ out_unlock:
btrfs_double_inode_unlock(src, inode);
else
mutex_unlock(>i_mutex);
+   return ret;
+}
+
+ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t len, int flags)
+{
+   ssize_t ret;
+
+   ret = btrfs_clone_files(file_out, file_in, pos_in, len, pos_out);
+   if (ret == 0)
+   ret = len;
+   return ret;
+}
+
+static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
+  u64 off, u64 olen, u64 destoff)
+{
+

[PATCH v3 6/9] vfs: Copy should use file_out rather than file_in

2015-09-25 Thread Anna Schumaker

The way to think about this is that the destination filesystem reads the
data from the source file and processes it accordingly.  This is
especially important to avoid an infinate loop when doing a "server to
server" copy on NFS.

Signed-off-by: Anna Schumaker 
---
 fs/read_write.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 8e7cb33..6f74f1f 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1355,7 +1355,7 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t 
pos_in,
if (!(file_in->f_mode & FMODE_READ) ||
!(file_out->f_mode & FMODE_WRITE) ||
(file_out->f_flags & O_APPEND) ||
-   !file_in->f_op || !file_in->f_op->copy_file_range)
+   !file_out->f_op || !file_out->f_op->copy_file_range)
return -EBADF;
 
inode_in = file_inode(file_in);
@@ -1378,8 +1378,8 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t 
pos_in,
if (ret)
return ret;
 
-   ret = file_in->f_op->copy_file_range(file_in, pos_in, file_out, pos_out,
-len, flags);
+   ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out, 
pos_out,
+ len, flags);
if (ret > 0) {
fsnotify_access(file_in);
add_rchar(current, ret);
-- 
2.5.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 1/9] vfs: add copy_file_range syscall and vfs helper

2015-09-25 Thread Anna Schumaker

From: Zach Brown 

Add a copy_file_range() system call for offloading copies between
regular files.

This gives an interface to underlying layers of the storage stack which
can copy without reading and writing all the data.  There are a few
candidates that should support copy offloading in the nearer term:

- btrfs shares extent references with its clone ioctl
- NFS has patches to add a COPY command which copies on the server
- SCSI has a family of XCOPY commands which copy in the device

This system call avoids the complexity of also accelerating the creation
of the destination file by operating on an existing destination file
descriptor, not a path.

Currently the high level vfs entry point limits copy offloading to files
on the same mount and super (and not in the same file).  This can be
relaxed if we get implementations which can copy between file systems
safely.

Signed-off-by: Zach Brown 
[Anna Schumaker: Change -EINVAL to -EBADF during file verification]
[Anna Schumaker: Change flags parameter from int to unsigned int]
Signed-off-by: Anna Schumaker 
---
v3:
- Change flags parameter to take an unsigned int instead of an int
---
 fs/read_write.c   | 129 ++
 include/linux/fs.h|   3 +
 include/uapi/asm-generic/unistd.h |   2 +
 kernel/sys_ni.c   |   1 +
 4 files changed, 135 insertions(+)

diff --git a/fs/read_write.c b/fs/read_write.c
index 819ef3f..dd10750 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 
 #include 
@@ -1327,3 +1328,131 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, 
in_fd,
return do_sendfile(out_fd, in_fd, NULL, count, 0);
 }
 #endif
+
+/*
+ * copy_file_range() differs from regular file read and write in that it
+ * specifically allows return partial success.  When it does so is up to
+ * the copy_file_range method.
+ */
+ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
+   struct file *file_out, loff_t pos_out,
+   size_t len, unsigned int flags)
+{
+   struct inode *inode_in;
+   struct inode *inode_out;
+   ssize_t ret;
+
+   if (flags)
+   return -EINVAL;
+
+   if (len == 0)
+   return 0;
+
+   /* copy_file_range allows full ssize_t len, ignoring MAX_RW_COUNT  */
+   ret = rw_verify_area(READ, file_in, _in, len);
+   if (ret >= 0)
+   ret = rw_verify_area(WRITE, file_out, _out, len);
+   if (ret < 0)
+   return ret;
+
+   if (!(file_in->f_mode & FMODE_READ) ||
+   !(file_out->f_mode & FMODE_WRITE) ||
+   (file_out->f_flags & O_APPEND) ||
+   !file_in->f_op || !file_in->f_op->copy_file_range)
+   return -EBADF;
+
+   inode_in = file_inode(file_in);
+   inode_out = file_inode(file_out);
+
+   /* make sure offsets don't wrap and the input is inside i_size */
+   if (pos_in + len < pos_in || pos_out + len < pos_out ||
+   pos_in + len > i_size_read(inode_in))
+   return -EINVAL;
+
+   /* this could be relaxed once a method supports cross-fs copies */
+   if (inode_in->i_sb != inode_out->i_sb ||
+   file_in->f_path.mnt != file_out->f_path.mnt)
+   return -EXDEV;
+
+   /* forbid ranges in the same file */
+   if (inode_in == inode_out)
+   return -EINVAL;
+
+   ret = mnt_want_write_file(file_out);
+   if (ret)
+   return ret;
+
+   ret = file_in->f_op->copy_file_range(file_in, pos_in, file_out, pos_out,
+len, flags);
+   if (ret > 0) {
+   fsnotify_access(file_in);
+   add_rchar(current, ret);
+   fsnotify_modify(file_out);
+   add_wchar(current, ret);
+   }
+   inc_syscr(current);
+   inc_syscw(current);
+
+   mnt_drop_write_file(file_out);
+
+   return ret;
+}
+EXPORT_SYMBOL(vfs_copy_file_range);
+
+SYSCALL_DEFINE6(copy_file_range, int, fd_in, loff_t __user *, off_in,
+   int, fd_out, loff_t __user *, off_out,
+   size_t, len, unsigned int, flags)
+{
+   loff_t pos_in;
+   loff_t pos_out;
+   struct fd f_in;
+   struct fd f_out;
+   ssize_t ret;
+
+   f_in = fdget(fd_in);
+   f_out = fdget(fd_out);
+   if (!f_in.file || !f_out.file) {
+   ret = -EBADF;
+   goto out;
+   }
+
+   ret = -EFAULT;
+   if (off_in) {
+   if (copy_from_user(_in, off_in, sizeof(loff_t)))
+   goto out;
+   } else {
+   pos_in = f_in.file->f_pos;
+   }
+
+   if (off_out) {
+   if (copy_from_user(_out, off_out, sizeof(loff_t)))
+   goto out;
+   } else {
+   pos_out =

Re: strange i/o errors with btrfs on raid/lvm

2015-09-25 Thread Chris Murphy

On Fri, Sep 25, 2015 at 2:26 PM, Jogi Hofmüller  wrote:

> That was right while the RAID was in degraded state and rebuilding.

On the guest:

Aug 28 05:17:01 vm kernel: [140683.741688] BTRFS info (device vdc):
disk space caching is enabled
Aug 28 05:17:13 vm kernel: [140695.575896] BTRFS warning (device vdc):
block group 13988003840 has wrong amount of free space
Aug 28 05:17:13 vm kernel: [140695.575901] BTRFS warning (device vdc):
failed to load free space cache for block group 13988003840, rebuild
it now
Aug 28 05:17:13 vm kernel: [140695.626035] BTRFS warning (device vdc):
block group 17209229312 has wrong amount of free space
Aug 28 05:17:13 vm kernel: [140695.626039] BTRFS warning (device vdc):
failed to load free space cache for block group 17209229312, rebuild
it now
Aug 28 05:17:13 vm kernel: [140695.683517] BTRFS warning (device vdc):
block group 20430454784 has wrong amount of free space
Aug 28 05:17:13 vm kernel: [140695.683521] BTRFS warning (device vdc):
failed to load free space cache for block group 20430454784, rebuild
it now
Aug 28 05:17:13 vm kernel: [140695.822818] BTRFS warning (device vdc):
block group 68211965952 has wrong amount of free space
Aug 28 05:17:13 vm kernel: [140695.822822] BTRFS warning (device vdc):
failed to load free space cache for block group 68211965952, rebuild
it now



On the host, there are no messages that correspond to this time index,
but a bit over an hour and a half later are when there are sas error
messages, and the first reported write error.

I see the rebuild event starting:

Aug 28 07:04:23 host mdadm[2751]: RebuildStarted event detected on md
device /dev/md/0

But there are subsequent sas errors still, including hard resetting of
the link, and additional read errors. This continues more than once...

And then

Aug 28 17:06:49 host mdadm[2751]: RebuildFinished event detected on md
device /dev/md/0, component device  mismatches found: 2048 (on raid
level 10)
Aug 28 17:06:49 host mdadm[2751]: SpareActive event detected on md
device /dev/md/0, component device /dev/sdd1

and also a number of SMART warnings about seek error on another device

Aug 28 17:35:55 host smartd[3146]: Device: /dev/sda [SAT], SMART Usage
Attribute: 7 Seek_Error_Rate changed from 180 to 179


So it sounds like more than one problem, either two drives, or maybe
even a controller problem, I can't really tell as there are lots of
messages.

But 2048 mismatches found after a rebuild is a problem. So there's
already some discrepancy in the mdadm raid10. And mdadm raid1 (or 10)
cannot resolve mismatches because which block is correct is ambiguous.
So that means something is definitely going to get corrupt. Btrfs, if
the metadata profile is DUP can recover from that. But data can't.
Normally this results in an explicit Btrfs message about a checksum
mismatch and no ability to fix it, but will still report the path to
affected file.  But I'm not finding that.

Anyway, if the hardware errors are resolved, try doing a scrub on the
file system, maybe when there's a period of reduced usage, to make
sure data and metadata are OK.

And then you could also manually reset the free space cache by
umounting and then remounting with -o clear_cache. This is a one time
thing, you do not need to remount.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 7/9] vfs: Remove copy_file_range mountpoint checks

2015-09-25 Thread Anna Schumaker

I still want to do an in-kernel copy even if the files are on different
mountpoints, and NFS has a "server to server" copy that expects two
files on different mountpoints.  Let's have individual filesystems
implement this check instead.

Signed-off-by: Anna Schumaker 
Reviewed-by: David Sterba 
---
 fs/read_write.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 6f74f1f..ee9fa37 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1366,11 +1366,6 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t 
pos_in,
pos_in + len > i_size_read(inode_in))
return -EINVAL;
 
-   /* this could be relaxed once a method supports cross-fs copies */
-   if (inode_in->i_sb != inode_out->i_sb ||
-   file_in->f_path.mnt != file_out->f_path.mnt)
-   return -EXDEV;
-
if (len == 0)
return 0;
 
-- 
2.5.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 10/9] copy_file_range.2: New page documenting copy_file_range()

2015-09-25 Thread Anna Schumaker

copy_file_range() is a new system call for copying ranges of data
completely in the kernel.  This gives filesystems an opportunity to
implement some kind of "copy acceleration", such as reflinks or
server-side-copy (in the case of NFS).

Signed-off-by: Anna Schumaker 
---
v3:
- Added license information
- Updated splice(2)
- Various other edits after mailing list discussion
---
 man2/copy_file_range.2 | 211 +
 man2/splice.2  |   1 +
 2 files changed, 212 insertions(+)
 create mode 100644 man2/copy_file_range.2

diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
new file mode 100644
index 000..6d66d4a
--- /dev/null
+++ b/man2/copy_file_range.2
@@ -0,0 +1,211 @@
+.\"This manpage is Copyright (C) 2015 Anna Schumaker 

+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of
+.\" this manual under the conditions for verbatim copying, provided that
+.\" the entire resulting derived work is distributed under the terms of
+.\" a permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume.
+.\" no responsibility for errors or omissions, or for damages resulting.
+.\" from the use of the information contained herein.  The author(s) may.
+.\" not have taken the same level of care in the production of this.
+.\" manual, which is licensed free of charge, as they might when working.
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.TH COPY 2 2015-08-31 "Linux" "Linux Programmer's Manual"
+.SH NAME
+copy_file_range \- Copy a range of data from one file to another
+.SH SYNOPSIS
+.nf
+.B #include 
+.B #include 
+.B #include 
+
+.BI "ssize_t copy_file_range(int " fd_in ", loff_t *" off_in ", int " fd_out ",
+.BI "loff_t *" off_out ", size_t " len \
+", unsigned int " flags );
+.fi
+.SH DESCRIPTION
+The
+.BR copy_file_range ()
+system call performs an in-kernel copy between two file descriptors
+without the additional cost of transferring data from the kernel to userspace
+and then back into the kernel.
+It copies up to
+.I len
+bytes of data from file descriptor
+.I fd_in
+to file descriptor
+.IR fd_out ,
+overwriting any data that exists within the requested range of the target file.
+
+The following semantics apply for
+.IR off_in ,
+and similar statements apply to
+.IR off_out :
+.IP * 3
+If
+.I off_in
+is NULL, then bytes are read from
+.I fd_in
+starting from the current file offset, and the offset is
+adjusted by the number of bytes copied.
+.IP *
+If
+.I off_in
+is not NULL, then
+.I off_in
+must point to a buffer that specifies the starting
+offset where bytes from
+.I fd_in
+will be read.  The current file offset of
+.I fd_in
+is not changed, but
+.I off_in
+is adjusted appropriately.
+.PP
+
+The
+.I flags
+argument can have one of the following flags set:
+.TP 1.9i
+.B COPY_FR_COPY
+Copy all the file data in the requested range.
+Some filesystems might be able to accelerate this copy
+to avoid unnecessary data transfers.
+.TP
+.B COPY_FR_REFLINK
+Create a lightweight "reflink", where data is not copied until
+one of the files is modified.
+.PP
+The default behavior
+.RI ( flags
+== 0) is to try creating a reflink,
+and if reflinking fails
+.BR copy_file_range ()
+will fall back to performing a full data copy.
+.SH RETURN VALUE
+Upon successful completion,
+.BR copy_file_range ()
+will return the number of bytes copied between files.
+This could be less than the length originally requested.
+
+On error,
+.BR copy_file_range ()
+returns \-1 and
+.I errno
+is set to indicate the error.
+.SH ERRORS
+.TP
+.B EBADF
+One or more file descriptors are not valid; or
+.I fd_in
+is not open for reading; or
+.I fd_out
+is not open for writing.
+.TP
+.B EINVAL
+Requested range extends beyond the end of the source file; or the
+.I flags
+argument is set to an invalid value.
+.TP
+.B EIO
+A low level I/O error occurred while copying.
+.TP
+.B ENOMEM
+Out of memory.
+.TP
+.B ENOSPC
+There is not enough space on the target filesystem to complete the copy.
+.TP
+.B EOPNOTSUPP
+.B COPY_REFLINK
+was specified in
+.IR flags ,
+but the target filesystem does not support reflinks.
+.TP
+.B EXDEV
+Target filesystem doesn't support cross-filesystem copies.
+.SH VERSIONS
+The
+.BR copy_file_range ()
+system call first appeared in Linux 4.4.
+.SH CONFORMING TO
+The
+.BR copy_file_range ()
+system call is a nonstandard Linux extension.
+.SH EXAMPLE
+.nf
+#define _GNU_SOURCE
+#include

[PATCH v3 2/9] x86: add sys_copy_file_range to syscall tables

2015-09-25 Thread Anna Schumaker

From: Zach Brown 

Add sys_copy_file_range to the x86 syscall tables.

Signed-off-by: Zach Brown 
[Anna Schumaker: Update syscall number in syscall_32.tbl]
Signed-off-by: Anna Schumaker 
---
 arch/x86/entry/syscalls/syscall_32.tbl | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
b/arch/x86/entry/syscalls/syscall_32.tbl
index 7663c45..0531270 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -382,3 +382,4 @@
 373i386shutdownsys_shutdown
 374i386userfaultfd sys_userfaultfd
 375i386membarrier  sys_membarrier
+376i386copy_file_range sys_copy_file_range
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
b/arch/x86/entry/syscalls/syscall_64.tbl
index 278842f..03a9396 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -331,6 +331,7 @@
 32264  execveatstub_execveat
 323common  userfaultfd sys_userfaultfd
 324common  membarrier  sys_membarrier
+325common  copy_file_range sys_copy_file_range
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
2.5.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 8/9] vfs: copy_file_range() can do a pagecache copy with splice

2015-09-25 Thread Anna Schumaker

The NFS server will need some kind offallback for filesystems that don't
have any kind of copy acceleration, and it should be generally useful to
have an in-kernel copy to avoid lots of switches between kernel and user
space.

I make this configurable by adding two new flags.  Users who only want a
reflink can pass COPY_FR_REFLINK, and users who want a full data copy can
pass COPY_FR_COPY.  The default (flags=0) means to first attempt a
reflink, but use the pagecache if that fails.

I moved the rw_verify_area() calls into the fallback code since some
filesystems can handle reflinking a large range.

Signed-off-by: Anna Schumaker 
---
v3:
- Check that both filesystems have the same filesystem type
- Add COPY_FR_DEDUPE flag for Darrick
- Check that at most one flag is set at a time
---
 fs/read_write.c   | 61 +++
 include/linux/copy.h  |  6 +
 include/uapi/linux/Kbuild |  1 +
 include/uapi/linux/copy.h |  8 +++
 4 files changed, 56 insertions(+), 20 deletions(-)
 create mode 100644 include/linux/copy.h
 create mode 100644 include/uapi/linux/copy.h

diff --git a/fs/read_write.c b/fs/read_write.c
index ee9fa37..a0fd9dc 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -7,6 +7,7 @@
 #include  
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1329,6 +1330,29 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, 
in_fd,
 }
 #endif
 
+static ssize_t vfs_copy_file_pagecache(struct file *file_in, loff_t pos_in,
+  struct file *file_out, loff_t pos_out,
+  size_t len)
+{
+   ssize_t ret;
+
+   ret = rw_verify_area(READ, file_in, _in, len);
+   if (ret >= 0) {
+   len = ret;
+   ret = rw_verify_area(WRITE, file_out, _out, len);
+   if (ret >= 0)
+   len = ret;
+   }
+   if (ret < 0)
+   return ret;
+
+   file_start_write(file_out);
+   ret = do_splice_direct(file_in, _in, file_out, _out, len, 0);
+   file_end_write(file_out);
+
+   return ret;
+}
+
 /*
  * copy_file_range() differs from regular file read and write in that it
  * specifically allows return partial success.  When it does so is up to
@@ -1338,34 +1362,26 @@ ssize_t vfs_copy_file_range(struct file *file_in, 
loff_t pos_in,
struct file *file_out, loff_t pos_out,
size_t len, unsigned int flags)
 {
-   struct inode *inode_in;
-   struct inode *inode_out;
ssize_t ret;
 
-   if (flags)
+   /* Flags should only be used exclusively. */
+   if ((flags & COPY_FR_COPY) && (flags & ~COPY_FR_COPY))
return -EINVAL;
+   if ((flags & COPY_FR_REFLINK) && (flags & ~COPY_FR_REFLINK))
+   return -EINVAL;
+   if (flags & COPY_FR_DEDUPE)
+   return -EOPNOTSUPP;
 
-   /* copy_file_range allows full ssize_t len, ignoring MAX_RW_COUNT  */
-   ret = rw_verify_area(READ, file_in, _in, len);
-   if (ret >= 0)
-   ret = rw_verify_area(WRITE, file_out, _out, len);
-   if (ret < 0)
-   return ret;
+   /* Default behavior is to try both. */
+   if (flags == 0)
+   flags = COPY_FR_COPY | COPY_FR_REFLINK;
 
if (!(file_in->f_mode & FMODE_READ) ||
!(file_out->f_mode & FMODE_WRITE) ||
(file_out->f_flags & O_APPEND) ||
-   !file_out->f_op || !file_out->f_op->copy_file_range)
+   !file_out->f_op)
return -EBADF;
 
-   inode_in = file_inode(file_in);
-   inode_out = file_inode(file_out);
-
-   /* make sure offsets don't wrap and the input is inside i_size */
-   if (pos_in + len < pos_in || pos_out + len < pos_out ||
-   pos_in + len > i_size_read(inode_in))
-   return -EINVAL;
-
if (len == 0)
return 0;
 
@@ -1373,8 +1389,13 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t 
pos_in,
if (ret)
return ret;
 
-   ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out, 
pos_out,
- len, flags);
+   ret = -EOPNOTSUPP;
+   if (file_out->f_op->copy_file_range && (file_in->f_op == 
file_out->f_op))
+   ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out,
+ pos_out, len, flags);
+   if ((ret < 0) && (flags & COPY_FR_COPY))
+   ret = vfs_copy_file_pagecache(file_in, pos_in, file_out,
+ pos_out, len);
if (ret > 0) {
fsnotify_access(file_in);
add_rchar(current, ret);
diff --git a/include/linux/copy.h b/include/linux/copy.h
new file mode 100644
index 000..fd54543
--- /dev/null
+++ b/include/linux/copy.h
@@ -0,0 +1,6 @@
+#ifndef

[PATCH v3 0/9] VFS: In-kernel copy system call

2015-09-25 Thread Anna Schumaker

Copy system calls came up during Plumbers a while ago, mostly because several
filesystems (including NFS and XFS) are currently working on copy acceleration
implementations.  We haven't heard from Zach Brown in a while, so I volunteered
to push his patches upstream so individual filesystems don't need to keep
writing their own ioctls. 

The question has come up about how vfs_copy_file_range() responds to signals,
and I don't have a good answer.  The pagecache copy option uses splice,
which (as far as I can tell) doesn't get interrupted.  Please let me know if
I'm missing something or completely misunderstanding the question!

Changes in v3:
- Update against the most recent Linus kernel
- Flags parameter should be an unsigned int
- Add COPY_FR_DEDUPE flag for Darrick
- Make each flag exclusive
- Update man page
- Added "Reiewed-by" tags

I tested the COPY_FR_COPY flag by using /dev/urandom to generate files of
varying sizes and copying them.  I compared the output from `time` against
that of `cp` to see if there is any noticable difference. Values in the
tables below are averages across multiple trials.

 /usr/bin/cp |   512  |  1024  |  1536  |  2048  |  2560  |  3072  |  5120
-|||||||
user |  0.00s |  0.00s |  0.01s |  0.01s |  0.01s |  0.01s |  0.02s
  system |  0.92s |  0.59s |  0.88s |  1.18s |  1.48s |  1.78s |  2.98s
 cpu |43% |18% |17% |18% |18% |18% |17%
   total |  2.116 |  3.200 |  4.950 |  6.541 |  8.105 |  9.811 | 17.211


VFS Copy |   512  |  1024  |  1536  |  2048  |  2560  |  3072  |  5120
-|||||||
user |  0.00s |  0.00s |  0.00s |  0.00s |  0.00s |  0.00s |  0.00s
  system |  0.80s |  0.56s |  0.84s |  1.10s |  1.39s |  1.67s |  2.81s
 cpu |41% |18% |19% |17% |17% |17% |17%
   total |  1.922 |  2.990 |  4.448 |  6.292 |  7.855 |  9.480 | 15.944

Questions?  Comments?  Thoughts?

Anna


Anna Schumaker (6):
  vfs: Copy should check len after file open mode
  vfs: Copy shouldn't forbid ranges inside the same file
  vfs: Copy should use file_out rather than file_in
  vfs: Remove copy_file_range mountpoint checks
  vfs: copy_file_range() can do a pagecache copy with splice
  btrfs: btrfs_copy_file_range() only supports reflinks

Zach Brown (3):
  vfs: add copy_file_range syscall and vfs helper
  x86: add sys_copy_file_range to syscall tables
  btrfs: add .copy_file_range file operation

 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 fs/btrfs/ctree.h   |   3 +
 fs/btrfs/file.c|   1 +
 fs/btrfs/ioctl.c   |  95 +-
 fs/read_write.c| 141 +
 include/linux/copy.h   |   6 ++
 include/linux/fs.h |   3 +
 include/uapi/asm-generic/unistd.h  |   2 +
 include/uapi/linux/Kbuild  |   1 +
 include/uapi/linux/copy.h  |   8 ++
 kernel/sys_ni.c|   1 +
 12 files changed, 224 insertions(+), 39 deletions(-)
 create mode 100644 include/linux/copy.h
 create mode 100644 include/uapi/linux/copy.h

-- 
2.5.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] btrfs-progs: Introduce better superblock check.

2015-09-25 Thread David Sterba

On Wed, May 13, 2015 at 05:15:35PM +0800, Qu Wenruo wrote:
> Now btrfs-progs will have much more restrict superblock check based on
> kernel superblock check.
> 
> This should at least provide some hostile crafted image to crash command
> like btrfsck.
> 
> Signed-off-by: Qu Wenruo 

Applied with some changes.

> +/* Just to save some space */
> +#define pr_err(fmt, args...) (fprintf(stderr, fmt, ##args))

fprintf(stderr, ...)

> + /*
> +  * Hint to catch really bogus numbers, bitflips or so
> +  */
> + if (btrfs_super_num_devices(sb) > (1UL << 31)) {
> + pr_err("ERROR: suspicious number of devices: %llu\n",
> + btrfs_super_num_devices(sb));
> + return -EIO;

This is supposed to be only a warning.

> + }
> +
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Latest kernel to use?

2015-09-25 Thread Rich Freeman

On Fri, Sep 25, 2015 at 9:25 AM, Bostjan Skufca  wrote:
>
> Similar here: I am sticking with 3.19.2 which has proven to work fine for me

I'd recommend still tracking SOME stable series.  I'm sure there were
fixes in 3.19 for btrfs (to say nothing of other subsystems) that
you're missing with that version.  3.19 is also unsupported at this
time.  You might want to consider moving to either 3.18.21 or 4.1.8
and tracking those series instead.  I doubt you'd give up much moving
back to 3.18 and there have been a bunch of btrfs fixes in that series
(though it seems to me that 3.18 has been slower to receive btrfs
patches than some of the other series).

I'm on the fence right now about making the move to 4.1.  Maybe in a
few releases I'll be there, depending on what the noise on the lists
sounds like.

There was a time when you were better off on bleeding-edge linux for
btrfs.  If you REALLY want to run btrfs raid5 or something like that
then I'd say that is still your best strategy.  However, if you stick
with features that have been around for a year the longterm kernels
seem a lot less likely to hit you with a regression, as long as you
don't switch to a new one the day it is declared as such.

--
Rich
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: strange i/o errors with btrfs on raid/lvm

2015-09-25 Thread Chris Murphy

>From offlist host logs (the Btrfs errors happen in a VM guest), I
think this is a hardware problem.

Aug 28 07:04:22 host kernel: [41367948.153031] sas:
sas_scsi_find_task: task 0x880e85c09c00 is done
Aug 28 07:04:22 host kernel: [41367948.153033] sas:
sas_eh_handle_sas_errors: task 0x880e85c09c00 is done
Aug 28 07:04:22 host kernel: [41367948.153036] sas: ata7:
end_device-0:0: cmd error handler
Aug 28 07:04:22 host kernel: [41367948.153085] sas: ata7:
end_device-0:0: dev error handler
Aug 28 07:04:22 host kernel: [41367948.153094] ata7.00: exception
Emask 0x0 SAct 0x7 SErr 0x0 action 0x6 frozen
Aug 28 07:04:22 host kernel: [41367948.153119] sas: ata8:
end_device-0:1: dev error handler
Aug 28 07:04:22 host kernel: [41367948.153130] sas: ata9:
end_device-0:2: dev error handler
Aug 28 07:04:22 host kernel: [41367948.153134] sas: ata10:
end_device-0:3: dev error handler
Aug 28 07:04:22 host kernel: [41367948.153181] ata7.00: failed
command: WRITE FPDMA QUEUED
Aug 28 07:04:22 host kernel: [41367948.153234] ata7.00: cmd
61/01:00:08:08:00/00:00:00:00:00/40 tag 0 ncq 512 out
Aug 28 07:04:22 host kernel: [41367948.153234]  res
40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 28 07:04:22 host kernel: [41367948.153393] ata7.00: status: { DRDY }
Aug 28 07:04:22 host kernel: [41367948.153437] ata7.00: failed
command: READ FPDMA QUEUED
Aug 28 07:04:22 host kernel: [41367948.153490] ata7.00: cmd
60/08:00:80:ac:15/00:00:00:00:00/40 tag 1 ncq 4096 in
Aug 28 07:04:22 host kernel: [41367948.153490]  res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 28 07:04:22 host kernel: [41367948.153648] ata7.00: status: { DRDY }
Aug 28 07:04:22 host kernel: [41367948.153693] ata7.00: failed
command: READ FPDMA QUEUED
Aug 28 07:04:22 host kernel: [41367948.153744] ata7.00: cmd
60/30:00:90:ac:15/00:00:00:00:00/40 tag 2 ncq 24576 in
Aug 28 07:04:22 host kernel: [41367948.153744]  res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 28 07:04:22 host kernel: [41367948.153903] ata7.00: status: { DRDY }
Aug 28 07:04:22 host kernel: [41367948.153950] ata7: hard resetting link
Aug 28 07:04:22 host kernel: [41367948.320311] ata7.00: configured for
UDMA/133 (device error ignored)
Aug 28 07:04:22 host kernel: [41367948.320371] ata7.00: device
reported invalid CHS sector 0
Aug 28 07:04:22 host kernel: [41367948.320421] ata7.00: device
reported invalid CHS sector 0
Aug 28 07:04:22 host kernel: [41367948.320478] ata7: EH complete
Aug 28 07:04:22 host kernel: [41367948.320558] sas: --- Exit
sas_scsi_recover_host: busy: 0 failed: 0 tries: 1


So at the moment I don't think this is a Btrfs problem.


Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs-progs: Increase running state's priority in stat output

2015-09-25 Thread David Sterba

On Tue, Jul 28, 2015 at 03:53:58PM +0800, Zhaolei wrote:
> From: Zhao Lei 
> 
> Anthony Plack  reported a output bug in maillist:
>   title: btrfs-progs SCRUB reporting aborted but still running - minor
> 
> btrfs scrub status report it was aborted but still runs to completion.
>   # btrfs scrub status /mnt/data
>   scrub status for f591ac13-1a69-476d-bd30-346f87a491da
>   scrub started at Mon Apr 27 06:48:44 2015 and was aborted after 1089 
> seconds
>   total bytes scrubbed: 1.02TiB with 0 errors
>   #
>   # btrfs scrub status /mnt/data
>   scrub status for f591ac13-1a69-476d-bd30-346f87a491da
>   scrub started at Mon Apr 27 06:48:44 2015 and was aborted after 1664 
> seconds
>   total bytes scrubbed: 1.53TiB with 0 errors
>   #
>   ...
> 
> Reason:
>   When scrub multi-device simultaneously, if some device canceled,
>   and some device is still running, cancel state have higher priority to
>   be outputed in global report.
>   So we can see "scrub aborted" in status line, with running-time keeps
>   increased.
> 
> Fix:
>   We can increase running state's priority in output, if there is
>   some device in scrub state, we output running state instead of
>   cancelled state.
> 
> Reported-by: Anthony Plack 
> Signed-off-by: Zhao Lei 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Latest kernel to use?

2015-09-25 Thread Bostjan Skufca

Thanks for heart-warming recommendation, this is also what I generally do.

In this case (and I remember vaguely) the reasoning for going with
3.19.x at the time was that I was hitting some btrfs issues around
3.16 and at the same time eyeing btrfs changesets going into mainline.
This, combined with general recommendation for using latest kernels
for btrfs and that systems were doing backups mainly (so, no
btrfs-related bugs if at all possible, the rest is unimportant),
resulted in what is now stable configuration.

Downgrading - well, no :)

For such systems (backend/backup servers), I tend to upgrade kernels when:
a) some exploit is discovered, or
b) feature present in newer kernel is needed.

I understand the difference between mainline and stable, but I haven't
had problems with mainline 'since forever'.

b.

On 25 September 2015 at 19:00, Rich Freeman  wrote:
> On Fri, Sep 25, 2015 at 9:25 AM, Bostjan Skufca  wrote:
>>
>> Similar here: I am sticking with 3.19.2 which has proven to work fine for me
>
> I'd recommend still tracking SOME stable series.  I'm sure there were
> fixes in 3.19 for btrfs (to say nothing of other subsystems) that
> you're missing with that version.  3.19 is also unsupported at this
> time.  You might want to consider moving to either 3.18.21 or 4.1.8
> and tracking those series instead.  I doubt you'd give up much moving
> back to 3.18 and there have been a bunch of btrfs fixes in that series
> (though it seems to me that 3.18 has been slower to receive btrfs
> patches than some of the other series).
>
> I'm on the fence right now about making the move to 4.1.  Maybe in a
> few releases I'll be there, depending on what the noise on the lists
> sounds like.
>
> There was a time when you were better off on bleeding-edge linux for
> btrfs.  If you REALLY want to run btrfs raid5 or something like that
> then I'd say that is still your best strategy.  However, if you stick
> with features that have been around for a year the longterm kernels
> seem a lot less likely to hit you with a regression, as long as you
> don't switch to a new one the day it is declared as such.
>
> --
> Rich
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs-progs: Show detail error message when write sb failed in write_dev_supers()

2015-09-25 Thread David Sterba

On Mon, Jul 27, 2015 at 07:32:37PM +0800, Zhaolei wrote:
> From: Zhao Lei 
> 
> fsck-tests.sh failed and show following message in my node:
>   # ./fsck-tests.sh
>  [TEST]   001-bad-file-extent-bytenr
>   disk-io.c:1444: write_dev_supers: Assertion `ret != BTRFS_SUPER_INFO_SIZE` 
> failed.
>   /root/btrfsprogs/btrfs-image(write_all_supers+0x2d2)[0x41031c]
>   /root/btrfsprogs/btrfs-image(write_ctree_super+0xc5)[0x41042e]
>   /root/btrfsprogs/btrfs-image(btrfs_commit_transaction+0x208)[0x410976]
>   /root/btrfsprogs/btrfs-image[0x438780]
>   /root/btrfsprogs/btrfs-image(main+0x3d5)[0x438c5c]
>   /lib64/libc.so.6(__libc_start_main+0xfd)[0x335e01ecdd]
>   /root/btrfsprogs/btrfs-image[0x4074e9]
>   failed to restore image 
> /root/btrfsprogs/tests/fsck-tests/001-bad-file-extent-bytenr/default_case.img
>   #
> 
>   # cat fsck-tests-results.txt
>   === Entering /root/btrfsprogs/tests/fsck-tests/001-bad-file-extent-bytenr
>   restoring image default_case.img
>   failed to restore image 
> /root/btrfsprogs/tests/fsck-tests/001-bad-file-extent-bytenr/default_case.img
>   #
> 
> Reason:
>   I run above test in a NFS mountpoint, it don't have enouth space to write
>   all superblock to image file, and don't support sparse file.
>   So write_dev_supers() failed in writing sb and output above message.
> 
> It takes me quite of time to know what happened, we can save these time
> by output exact information in write-sb-fail case.
> 
> After patch:
>   # ./fsck-tests.sh
> [TEST]   001-bad-file-extent-bytenr
>   WARNING: Write sb failed: File too large
>   disk-io.c:1492: write_all_supers: Assertion `ret` failed.
>   ...
>   #
> 
> Signed-off-by: Zhao Lei 

Applied, thans.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs-progs: convert: Print different error message if convert partly failed.

2015-09-25 Thread David Sterba

On Tue, Jun 09, 2015 at 03:57:40PM +0800, Qu Wenruo wrote:
> When testing under libguestfs, btrfs-convert will never succeed to fix
> chunk map, and always fails.
> 
> But in that case, it's already a mountable btrfs.
> So better to info user with different error message for that case.
> 
> The root cause of it is still under investigation.
> 
> Signed-off-by: Qu Wenruo 

I've adjusted wording of the error message and applied, thanks. What are
the consequences of the unfinished conversion process when such
filesystem is mounted?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 1/2] btrfs-progs: use switch instead of a series of ifs for output errormsg

2015-09-25 Thread David Sterba

On Thu, Aug 06, 2015 at 11:05:54AM +0800, Zhao Lei wrote:
> switch statement is more suitable for outputing currsponding message
> for errno.
> 
> Suggested-by: David Sterba 
> Signed-off-by: Zhao Lei 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[GIT PULL] Btrfs

2015-09-25 Thread Chris Mason

Hi Linus,

My for-linus-4.3 branch has a few fixes:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.3

This is an assorted set I've been queuing up:

Jeff Mahoney tracked down a tricky one where we ended up starting IO on
the wrong mapping for special files in btrfs_evict_inode.  A few people
reported this one on the list.

Filipe found (and provided a test for) a difficult bug in reading
compressed extents, and Josef fixed up some quota record keeping with
snapshot deletion.  Chandan killed off an accounting bug during DIO that
lead to WARN_ONs as we freed inodes.

Filipe Manana (3) commits (+58/-16):
Btrfs: remove unnecessary locking of cleaner_mutex to avoid deadlock (+0/-4)
Btrfs: don't initialize a space info as full to prevent ENOSPC (+1/-4)
Btrfs: fix read corruption of compressed and shared extents (+57/-8)

Josef Bacik (1) commits (+37/-2):
Btrfs: keep dropped roots in cache until transaction commit

Jeff Mahoney (1) commits (+2/-1):
btrfs: skip waiting on ordered range for special files

chandan (1) commits (+21/-23):
Btrfs: Direct I/O: Fix space accounting

Total: (6) commits (+118/-42)

 fs/btrfs/btrfs_inode.h |  2 --
 fs/btrfs/disk-io.c |  2 --
 fs/btrfs/extent-tree.c |  7 ++
 fs/btrfs/extent_io.c   | 65 +++---
 fs/btrfs/inode.c   | 45 +-
 fs/btrfs/super.c   |  2 --
 fs/btrfs/transaction.c | 32 +
 fs/btrfs/transaction.h |  5 +++-
 8 files changed, 118 insertions(+), 42 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Too many missing devices, writeable mount is not allowed

2015-09-25 Thread Marcel Bischoff


Hello all,

I have kind of a serious problem with one of my disks.

The controller of one of my external drives died (WD Studio). The disk 
is alright though. I cracked open the case, got the drive out and 
connected it via a SATA-USB interface.


Now, mounting the filesystem is not possible. Here's the message:

$ btrfs fi show
warning devid 3 not found already
Label: none  uuid: bd6090df-5179-490e-a5f8-8fbad433657f
   Total devices 3 FS bytes used 3.02TiB
   devid1 size 596.17GiB used 532.03GiB path /dev/sdd
   devid2 size 931.51GiB used 867.03GiB path /dev/sde
   *** Some devices missing

Yes, I did bundle up three drives with very different sizes with the 
--single option on creating the file system.


I have already asked for help on StackExchange but replies have been 
few. Now I thought people on this list, close to btrfs development may 
be able and willing to help. This would be so much appreciated.


Here's the issue with lots of information and a record of what I/we have 
tried up until now: 
http://unix.stackexchange.com/questions/231174/btrfs-too-many-missing-devices-writeable-mount-is-not-allowed


Thanks for your time and consideration!

Best,
Marcel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/2] Btrfs: Fix a insane extent_buffer copy behavior for qgroup

2015-09-25 Thread Stéphane Lesimple


Le 2015-09-25 04:37, Qu Wenruo a écrit :

Stephane Lesimple reported an qgroup rescan bug:

[92098.841309] general protection fault:  [#1] SMP
[92098.841338] Modules linked in: ...
[92098.841814] CPU: 1 PID: 24655 Comm: kworker/u4:12 Not tainted
4.3.0-rc1 #1
[92098.841868] Workqueue: btrfs-qgroup-rescan 
btrfs_qgroup_rescan_helper

[btrfs]
[92098.842261] Call Trace:
[92098.842277]  [] ? read_extent_buffer+0xb8/0x110
[btrfs]
[92098.842304]  [] ? btrfs_find_all_roots+0x60/0x70
[btrfs]
[92098.842329]  []
btrfs_qgroup_rescan_worker+0x28d/0x5a0 [btrfs]
...

The triggering function btrfs_disk_key_to_cpu(), which should never
fail.
But it turned out that the extent_buffere being called on is memcpied
from an existing one.

Such behavior to copy a structure with page pointers and locks in it is
never a sane thing.

Fix it by do it in memory other than extent_buffer.

Qu Wenruo (2):
  btrfs: Add support to do stack item key operation
  btrfs: qgroup: Don't copy extent buffer to do qgroup rescan

 fs/btrfs/ctree.h  | 20 
 fs/btrfs/qgroup.c | 22 --
 2 files changed, 32 insertions(+), 10 deletions(-)


I applied this patch and ran 100+ rescans on my filesystem for several 
hours. No crash happened, where only a few rescans were enough to 
trigger the bug previously.


I'm confident this patch fixes the GPF, thanks Qu !

--
Stéphane.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 8/9] vfs: copy_file_range() can do a pagecache copy with splice

2015-09-25 Thread Pádraig Brady

On 25/09/15 21:48, Anna Schumaker wrote:
> The NFS server will need some kind offallback for filesystems that don't
> have any kind of copy acceleration, and it should be generally useful to
> have an in-kernel copy to avoid lots of switches between kernel and user
> space.
> 
> I make this configurable by adding two new flags.  Users who only want a
> reflink can pass COPY_FR_REFLINK, and users who want a full data copy can
> pass COPY_FR_COPY.  The default (flags=0) means to first attempt a
> reflink, but use the pagecache if that fails.
> 
> I moved the rw_verify_area() calls into the fallback code since some
> filesystems can handle reflinking a large range.
> 
> Signed-off-by: Anna Schumaker 

Reviewed-by: Pádraig Brady 

LGTM. For my reference, for cp(1), mv(1), install(1), this will avoid
user space copies in the normal case, client side copies in the network
file system case, and provide a more generalized interface to reflink().

coreutils pseudo code is:

  unsigned int cfr_flags = COPY_FR_COPY;
  if (mode == mv)
cfr_flags = 0; /* reflink falling back to normal */
  else if (mode == cp) {
if --reflink || --reflink==always
  cfr_flags = COPY_FR_REFLINK;
else if --reflink==auto
  cfr_flags = 0; /* reflink falling back to normal */
  }
  if vfs_copy_file_range(..., cfr_flags) == ENOTSUP
normal_user_space_copy();

thanks,
Pádraig.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Too many missing devices, writeable mount is not allowed

2015-09-25 Thread Hugo Mills

On Fri, Sep 25, 2015 at 11:45:44PM +0200, Marcel Bischoff wrote:
> Hello all,
> 
> I have kind of a serious problem with one of my disks.
> 
> The controller of one of my external drives died (WD Studio). The
> disk is alright though. I cracked open the case, got the drive out
> and connected it via a SATA-USB interface.
> 
> Now, mounting the filesystem is not possible. Here's the message:
> 
> $ btrfs fi show
> warning devid 3 not found already
> Label: none  uuid: bd6090df-5179-490e-a5f8-8fbad433657f
>Total devices 3 FS bytes used 3.02TiB
>devid1 size 596.17GiB used 532.03GiB path /dev/sdd
>devid2 size 931.51GiB used 867.03GiB path /dev/sde
>*** Some devices missing
> 
> Yes, I did bundle up three drives with very different sizes with the
> --single option on creating the file system.

   OK, that's entirely possible. Not a problem in itself.

   Now, asuming that the missing device is actually unrecoverable:

   Since you've said it's single, you've lost some large fraction of
the file data on your filesystem, so this isn't going to end well in
any case. I hope you have good backups.

   Was the metadata on the filesystem also single? If so, then I have
no hesitation in declaring this filesystem completely dead. If it was
RAID-1 (or RAID-5 or RAID-6), then the metadata should still be OK,
and you should be able to mount the FS with -o degraded. That will
give you a working (read-only) filesystem where some of the data will
return EIO where the data is missing. ddrescue should help you to
recover partial files for those cases where partial recovery is
acceptable.

   But it might be recoverable, because...

> I have already asked for help on StackExchange but replies have been
> few. Now I thought people on this list, close to btrfs development
> may be able and willing to help. This would be so much appreciated.
> 
> Here's the issue with lots of information and a record of what I/we
> have tried up until now: 
> http://unix.stackexchange.com/questions/231174/btrfs-too-many-missing-devices-writeable-mount-is-not-allowed

   I think Vincent Yu there has the right idea -- there's no
superblock showing up on the device in the place that's expected.
However, your update 3 shows that there is a superblock offset by 1
MiB (1114176-65600 = 1048576 = 1024*1024). So the recovery approach
here would be to construct a block device using an offset of 1 MiB
into /dev/sdc. dmsetup shoudld be able to do this, I think.

   It's been a long time since I used dmsetup in anger, but something
like this may work:

# dmsetup load /dev/sdc --table "256  linear /dev/mapper/sdc_offset 0"

where  is the number of sectors of /dev/sdc, less the 256 at the
start. I recommend reading the man page in detail and double-checking
that what I've got there is actually what's needed.

   That will (I think) give you a device /dev/mapper/sdc_offset, which
should then show up in btfs fi show, and allow you to keep using the
FS.

   Hugo.

-- 
Hugo Mills | If you see something, say nothing and drink to
hugo@... carfax.org.uk | forget
http://carfax.org.uk/  |
PGP: E2AB1DE4  | Welcome to Night Vale

signature.asc
Description: Digital signature

Re: [PATCH v2] btrfs: qgroup: exit the rescan worker during umount

2015-09-25 Thread Justin Maggard

On Tue, Sep 22, 2015 at 7:45 AM, David Sterba  wrote:
> On Wed, Sep 02, 2015 at 06:05:17PM -0700, Justin Maggard wrote:
>> v2: Fix stupid error while making formatting changes...
>
> I haven't noticed any difference between the patches, what exactly did
> you change?
>

I broke compiling while cleaning up some checkpatch.pl feedback.
Here's what changed between v1 and v2:

-   if (!btrfs_fs_closing(fs_info)) {
+   if (!btrfs_fs_closing(fs_info))


>> I was hitting a consistent NULL pointer dereference during shutdown that
>> showed the trace running through end_workqueue_bio().  I traced it back to
>> the endio_meta_workers workqueue being poked after it had already been
>> destroyed.
>>
>> Eventually I found that the root cause was a qgroup rescan that was still
>> in progress while we were stopping all the btrfs workers.
>>
>> Currently we explicitly pause balance and scrub operations in
>> close_ctree(), but we do nothing to stop the qgroup rescan.  We should
>> probably be doing the same for qgroup rescan, but that's a much larger
>> change.  This small change is good enough to allow me to unmount without
>> crashing.
>>
>> Signed-off-by: Justin Maggard 
>
> Can you please submit the test you've used to trigger the crash to
> fstests?
>

Sure, I've got a reproducer coded up for xfstests now.  Should I just
send that to this list, or is there a better place to send it?

-Justin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raw devices or partitions?

2015-09-25 Thread Duncan

Sjoerd posted on Fri, 25 Sep 2015 15:40:39 +0200 as excerpted:

> Is it better to use raw devices for a RAID setup or make one partition
> on the drive and then create your RAID from there?
> Right now if have one setup that uses raw, but get messages "unknown
> partition table" all the time in my logs.
> I am planning to create a RAID 5 setup (seems to be stable these days?),
> but wondering  to deal with  raw drives or partitions (4 at the moment).
> In the wiki they're referring to raw devices in the examples, but that
> could be outdated?

Raw device vs. partition (vs mdraid vs dmraid vs lvm) doesn't matter to 
btrfs.  They're all block devices.

That unknown partition table log entry is from elsewhere in the kernel, 
where it would normally read partition tables if there were any to read.  
It's simply telling you it couldn't find any, to help with diagnostics in 
case there's supposed to be one.  But if you deliberately used a raw 
device, there isn't supposed to be a partition table, so no big deal.

So just ignore the warning as the diagnostic aid for a case that doesn't 
apply to you, if you like.  Or if you find it irritating enough that 
isn't possible, then do the big partition thing and disappear the 
warning.  No big deal either way. =:^)


Tho if the device is either going to be bootable, or you might want to 
repurpose it to bootable in the future, you may well want to partition 
it, preferably gpt, and create a couple small partitions to make that 
easier, before creating the big one.

Here, I create both a tiny BIOS-reserved partition, for legacy BIOS boot 
(which I use now, grub2 installs to it for BIOS systems), and a slightly 
larger but still tiny EFI-reserved partition, for forward compatibility.  
The two together still take only an eighth of a gig (128 MiB), which I 
figure is a fair trade for the additional flexibility it gives me.

Here's my layout, FWIW.  (This is on SSD where partition order doesn't 
matter.  Since the first sections of a spinning rust device are the 
fastest, and these will only ever be used for boot if used at all, you 
might wish to put these last, on that.)

Generally 2 MiB minimum alignment, for better efficiency on modern 
devices, and the first few KiB of a device are taken by the boot sector 
and partition table.  But the BIOS partition is so small and accessed 
using primitive and inefficient BIOS routines anyway...

#   Start (MiB) End (MiB)   Size (MiB)  Name/Description
x   0   1   1   GPT/free
1   1   4   3   BIOS/reserved
2   4   128 124 EFI/reserved

As you will note, that ends at an even 128 MiB alignment, 1/8 GiB.  FWIW, 
I put another couple small partitions (boot and log) under a gig, ending 
with GiB alignment, with everything else GiB aligned.

And 1/8 GiB is small enough, I'll often start off with that even on USB 
flash drives.  To me it's worth it, just to have the flexibility of 
making it bootable, should I decide to, without having to move partitions 
around to fit in the boot stuff.  Then I partition up the rest, or leave 
it one big partition, as the use-case calls for.

As an additional benefit, because I use this layout consistently, should 
I need to (tho with gpt having checksumming and a partition table at each 
end of the device, the chance of entirely losing it without the whole 
device being junk is dramatically lowered), I can recreate it with good 
confidence, knowing the normal partitions always start at 128 MiB.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 8/9] vfs: copy_file_range() can do a pagecache copy with splice

2015-09-25 Thread Andy Lutomirski

On Fri, Sep 25, 2015 at 1:48 PM, Anna Schumaker
 wrote:
> The NFS server will need some kind offallback for filesystems that don't
> have any kind of copy acceleration, and it should be generally useful to
> have an in-kernel copy to avoid lots of switches between kernel and user
> space.
>
> I make this configurable by adding two new flags.  Users who only want a
> reflink can pass COPY_FR_REFLINK, and users who want a full data copy can
> pass COPY_FR_COPY.  The default (flags=0) means to first attempt a
> reflink, but use the pagecache if that fails.
>

Can you clarify how the subject line fits in?  I'm a bit lost.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Too many missing devices, writeable mount is not allowed

2015-09-25 Thread Duncan

Marcel Bischoff posted on Fri, 25 Sep 2015 23:45:44 +0200 as excerpted:

> Hello all,
> 
> I have kind of a serious problem with one of my disks.
> 
> The controller of one of my external drives died (WD Studio). The disk
> is alright though. I cracked open the case, got the drive out and
> connected it via a SATA-USB interface.
> 
> Now, mounting the filesystem is not possible. Here's the message:
> 
> $ btrfs fi show
> warning devid 3 not found already
> Label: none  uuid: bd6090df-5179-490e-a5f8-8fbad433657f
> Total devices 3 FS bytes used 3.02TiB
> devid1 size 596.17GiB used 532.03GiB path /dev/sdd
> devid2 size 931.51GiB used 867.03GiB path /dev/sde
> *** Some devices missing
> 
> Yes, I did bundle up three drives with very different sizes with the
> --single option on creating the file system.

[FWIW, the additional comments on the stackexchange link didn't load for 
me, presumably due to my default security settings.  I could of course 
fiddle with them to try to get it to work, but meh...  So I only saw the 
first three comments or so.  As a result, some of this might be repeat 
territory for you.]

?? --single doesn't appear to be a valid option for mkfs.btrfs.  Did you 
mean --metadata single and/or --data single?  Which?  Both?

If you were running single metadata, like raid0, you're effectively 
declaring the filesystem dead and not worth the effort to fix if a device 
dies and disappears.  In which case you got what you requested, a multi-
device filesystem that dies when one of the devices dies. =:^)  Tho it 
may still be possible to revive the filesystem if you can get the bad 
device recovered enough to get it to be pulled into the filesystem, again.

That's why metadata defaults to raid1 (tho btrfs raid1 is only pair-
mirror, even if there's more than two devices) on a multi-device 
filesystem.  So if you didn't specify --metadata single, then it should 
be raid1 (unless the filesystem started as a single device and was never 
balance-converted when the other devices were added).

--data single is the default on both single and multi-device filesystems, 
however, which, given raid1 metadata, should at least let you recover 
files that were 100% on the remaining devices.  I'm assuming this, as it 
would normally allow read-only mounting due to the second copy of the 
metadata, but isn't going to allow writable mounting because with single 
data, that would damage any possible chance of getting the data on the 
missing device back.  Chances of getting writable if the missing device 
is as damaged as it could be are slim, but it's possible, if you can 
bandaid it up.  However, even then I'd consider it suspect and would 
strongly recommend taking the chance you've been given to freshen your 
backups, then at least btrfs device delete (or btrfs replace with another 
device), if not blow away the filesystem and start over with a fresh 
mkfs.  Meanwhile, do a full write/read test (badblocks or the like) of 
the bad device, before trying to use it again.

The other (remote) possibility is mixed-bg mode, combining data and 
metadata in the same block-groups.  But that's default only with 1 GiB 
and under filesystems (and with filesystems converted from ext* with some 
versions of btrfs-convert), so it's extremely unlikely unless you 
specified that at mkfs.btrfs time, in which case mentioning that would 
have been useful.

A btrfs filesystem df (or usage) should confirm both data and metadata 
status.  The filesystem must be mounted to run it, but read-only degraded 
mount should do.

[More specific suggestions below.]

> I have already asked for help on StackExchange but replies have been
> few. Now I thought people on this list, close to btrfs development may
> be able and willing to help. This would be so much appreciated.
> 
> Here's the issue with lots of information and a record of what I/we have
> tried up until now:
> http://unix.stackexchange.com/questions/231174/btrfs-too-many-missing-
devices-writeable-mount-is-not-allowed

OK, first the safe stuff, then some more risky possibilities...

1) Sysadmin's rule of backups:  If you value data, by definition, you 
have it backed up.  If it's not backed up, by definition, you definitely 
value it less than the time and resources saved by not doing the backups, 
not withstanding any claims to the contrary.  (And by the same token, a 
would-be backup that hasn't been tested restorable isn't yet a backup, as 
the job isn't complete until you know it can be restored.)

1a) Btrfs addendum: Because btrfs is still a maturing filesystem not yet 
fully stabilized, the above backup rule applies even more strongly than 
it does to a more mature filesystem.

So in worst-case, just blow away the existing filesystem and start over, 
either restoring from those backups, or happy in the knowledge that since 
you didn't have them, you self-evidently didn't value the data on the 
filesystem, and can go on without it.[1]

Re: Latest kernel to use?

2015-09-25 Thread Duncan

Bostjan Skufca posted on Fri, 25 Sep 2015 16:34:16 +0200 as excerpted:

> Similar here: I am sticking with 3.19.2 which has proven to work fine
> for me (backup systems with btrfs on lvm, lots of snapshots/subvolumes
> and occasional rebalance, no fancy/fresh stuff like btrfs-raid, online
> compression or subvolume quota, though this last one is tempting in my
> use case).

On that last one, subvolume quota, you've been following list discussion, 
right?  Just in case you haven't...

Btrfs quotas have been an incredibly tough feature to stabilize.  They're 
on the third rewrite and still have some major bugs to fix, so I'd say at 
least a couple more kernel cycles, then check again before using the 
feature.

So, unless you're actively testing and reporting bugs in that specific 
feature (in which case, thanks and please continue! =:^), I continue to 
strongly recommend leaving quotas off for now, as I have been for a year 
or more, now.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/5] btrfs: Do per-chunk degraded check for remount

2015-09-25 Thread Qu Wenruo




Anand Jain wrote on 2015/09/25 14:54 +0800:



On 09/21/2015 10:10 AM, Qu Wenruo wrote:

Just the same for mount time check, use new btrfs_check_degraded() to do
per chunk check.

Signed-off-by: Qu Wenruo 
---
  fs/btrfs/super.c | 11 +++
  1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index c389c13..720c044 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1681,11 +1681,14 @@ static int btrfs_remount(struct super_block
*sb, int *flags, char *data)
  goto restore;
  }

-if (fs_info->fs_devices->missing_devices >
- fs_info->num_tolerated_disk_barrier_failures &&
-!(*flags & MS_RDONLY)) {
+ret = btrfs_check_degradable(fs_info, *flags);
+if (ret < 0) {
+btrfs_error(fs_info, ret,
+"degraded writable remount failed");


btrfs_erorr() which is an error handling routine, isn't appropriate
here, mainly because as we are in the remount context, I am not sure if
you meant to change the fs state to readonly (on error) in the remount
context ? or Instead btrfs_err() which is an error reporting/logging
would be appropriate.

btrfs_erorr() and btrfs_err() are way different in action but very easy
have an oversight and use the wrong one. the below patch changed it..

Btrfs: consolidate btrfs_error() to btrfs_std_error()

Thanks, Anand


Thanks for pointing this out.

I was quite unsure about using btrfs_info/warn/error.

In this case, I just wan't to output a dmesg info to let user know 
exactly what caused the mount failed.
Original code output nothing but "failed to open chunk tree", which is 
quite confusing for end user.


I was planning to use btrfs_info, but at least this is really an error 
message, but only to info user the real cause.


Maybe btrfs_warn will be a better choice?

Thanks,
Qu





+goto restore;
+} else if (ret > 0 && !btrfs_test_opt(root, DEGRADED)) {
  btrfs_warn(fs_info,
-"too many missing devices, writeable remount is not
allowed");
+"some device missing, but still degraded mountable,
please remount with -o degraded option");
  ret = -EACCES;
  goto restore;
  }


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/1] btrfs: Do per-chunk degraded check for remount

2015-09-25 Thread Anand Jain

From: Qu Wenruo 

Just the same for mount time check, use new btrfs_check_degraded() to do
per chunk check.

Signed-off-by: Qu Wenruo 

Btrfs: use btrfs_error instead of btrfs_err during remount

apply on top of the patch

[PATCH 1/1] Btrfs: consolidate btrfs_error() to btrfs_std_error()

Signed-off-by: Anand Jain 
---
 fs/btrfs/super.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 181db38..16f1412 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1664,11 +1664,14 @@ static int btrfs_remount(struct super_block *sb, int 
*flags, char *data)
goto restore;
}
 
-   if (fs_info->fs_devices->missing_devices >
-fs_info->num_tolerated_disk_barrier_failures &&
-   !(*flags & MS_RDONLY)) {
+   ret = btrfs_check_degradable(fs_info, *flags);
+   if (ret < 0) {
+   btrfs_err(fs_info,
+   "degraded writable remount failed %d", ret);
+   goto restore;
+   } else if (ret > 0 && !btrfs_test_opt(root, DEGRADED)) {
btrfs_warn(fs_info,
-   "too many missing devices, writeable remount is 
not allowed");
+   "some device missing, but still degraded 
mountable, please remount with -o degraded option");
ret = -EACCES;
goto restore;
}
-- 
2.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/5] btrfs: Do per-chunk degraded check for remount

2015-09-25 Thread Qu Wenruo


Thanks Anand,

I'm OK with both new patches.

Thanks for the modification.
Qu

Anand Jain wrote on 2015/09/25 16:30 +0800:

Qu,

Strictly speaking IMO it should be reported to the user on the cli
terminal, and no logging in required. since its not that easy to get
that at this point, I am ok with logging it as error. Since we are
failing the task(mount), error is better.

I have made that change this on top of the patch

   [PATCH 1/1] Btrfs: consolidate btrfs_error() to btrfs_std_error()

and sent them both.

Thanks, Anand



Thanks for pointing this out.

I was quite unsure about using btrfs_info/warn/error.

In this case, I just wan't to output a dmesg info to let user know
exactly what caused the mount failed.
Original code output nothing but "failed to open chunk tree", which is
quite confusing for end user.

I was planning to use btrfs_info, but at least this is really an error
message, but only to info user the real cause.

Maybe btrfs_warn will be a better choice?

Thanks,
Qu




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/1] btrfs: Do per-chunk check for mount time check

2015-09-25 Thread Anand Jain

From: Qu Wenruo 

Now use the btrfs_check_degraded() to do mount time degraded check.

With this patch, now we can mount with the following case:
 # mkfs.btrfs -f -m raid1 -d single /dev/sdb /dev/sdc
 # wipefs -a /dev/sdc
 # mount /dev/sdb /mnt/btrfs -o degraded
 As the single data chunk is only in sdb, so it's OK to mount as
 degraded, as missing one device is OK for RAID1.

But still fail with the following case as expected:
 # mkfs.btrfs -f -m raid1 -d single /dev/sdb /dev/sdc
 # wipefs -a /dev/sdb
 # mount /dev/sdc /mnt/btrfs -o degraded
 As the data chunk is only in sdb, so it's not OK to mount it as
 degraded.

Reported-by: Zhao Lei 
Reported-by: Anand Jain 
Signed-off-by: Qu Wenruo 

Btrfs: use btrfs_error instead of btrfs_err during mount

fix up on top of the patch
[PATCH 1/1] Btrfs: consolidate btrfs_error() to btrfs_std_error()

Signed-off-by: Anand Jain 
---
 fs/btrfs/disk-io.c | 18 ++
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index ccb1f28..ae7b180 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2863,6 +2863,16 @@ int open_ctree(struct super_block *sb,
goto fail_tree_roots;
}
 
+   ret = btrfs_check_degradable(fs_info, fs_info->sb->s_flags);
+   if (ret < 0) {
+   btrfs_err(fs_info, "degraded writable mount failed %d", ret);
+   goto fail_tree_roots;
+   } else if (ret > 0 && !btrfs_test_opt(chunk_root, DEGRADED)) {
+   btrfs_warn(fs_info,
+   "Some device missing, but still degraded mountable, 
please mount with -o degraded option");
+   ret = -EACCES;
+   goto fail_tree_roots;
+   }
/*
 * keep the device that is marked to be the target device for the
 * dev_replace procedure
@@ -2940,14 +2950,6 @@ retry_root_backup:
}
fs_info->num_tolerated_disk_barrier_failures =
btrfs_calc_num_tolerated_disk_barrier_failures(fs_info);
-   if (fs_info->fs_devices->missing_devices >
-fs_info->num_tolerated_disk_barrier_failures &&
-   !(sb->s_flags & MS_RDONLY)) {
-   pr_warn("BTRFS: missing devices(%llu) exceeds the limit(%d), 
writeable mount is not allowed\n",
-   fs_info->fs_devices->missing_devices,
-   fs_info->num_tolerated_disk_barrier_failures);
-   goto fail_sysfs;
-   }
 
fs_info->cleaner_kthread = kthread_run(cleaner_kthread, tree_root,
   "btrfs-cleaner");
-- 
2.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/5] btrfs: Do per-chunk degraded check for remount

2015-09-25 Thread Anand Jain


Qu,

Strictly speaking IMO it should be reported to the user on the cli 
terminal, and no logging in required. since its not that easy to get 
that at this point, I am ok with logging it as error. Since we are 
failing the task(mount), error is better.


I have made that change this on top of the patch

  [PATCH 1/1] Btrfs: consolidate btrfs_error() to btrfs_std_error()

and sent them both.

Thanks, Anand



Thanks for pointing this out.

I was quite unsure about using btrfs_info/warn/error.

In this case, I just wan't to output a dmesg info to let user know
exactly what caused the mount failed.
Original code output nothing but "failed to open chunk tree", which is
quite confusing for end user.

I was planning to use btrfs_info, but at least this is really an error
message, but only to info user the real cause.

Maybe btrfs_warn will be a better choice?

Thanks,
Qu



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS as image store for KVM?

2015-09-25 Thread Rich Freeman

On Sat, Sep 19, 2015 at 9:26 PM, Jim Salter  wrote:
>
> ZFS, by contrast, works like absolute gangbusters for KVM image storage.

I'd be interested in what allows ZFS to handle KVM image storage well,
and whether this could be implemented in btrfs.  I'd think that the
fragmentation issues would potentially apply to any COW filesystem,
and if ZFS has a solution for this then it would probably benefit
btrfs to implement the same solution, and not just for VM images.

--
Rich
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: add a flags field to btrfs_transaction

2015-09-25 Thread Holger Hoffstätte

Followup from my observation wrt. "Btrfs: change how we wait for
pending ordered extents" and balance sitting idle:

On Thu, Sep 24, 2015 at 4:47 PM, Josef Bacik  wrote:
> I want to set some per transaction flags, so instead of adding yet another int
> lets just convert the current two int indicators to flags and add a flags 
> field
> for future use.  Thanks,

..snip..

> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index ff64689..3f5a781 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -1399,7 +1399,7 @@ again:
> btrfs_error(root->fs_info, ret,
> "Failed to remove dev extent item");
> } else {
> -   trans->transaction->have_free_bgs = 1;
> +   set_bit(BTRFS_TRANS_DIRTY_BG_RUN, >transaction->flags);

Judging from the rest of the code transformation TRANS_DIRTY_BG_RUN
seems like the wrong bit to set here. A quick test with
set_bit(BTRFS_TRANS_HAVE_FREE_BGS) confirms that it fixes the balance
delays.

cheers
Holger
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS as image store for KVM?

2015-09-25 Thread Jim Salter


I suspect that the answer most likely boils down to "the ARC".

ZFS uses an Adaptive Replacement Cache instead of a standard FIFO, which 
keeps blocks in cache longer if they have been accessed in cache.  This 
means much higher cache hit rates, which also means minimizing the 
effects of fragmentation.


That's an off-the-top-of-my-head guess, though.  All I can tell you for 
certain is that I've done both - KVM stores on btrfs and on ZFS (and on 
LVM and on mdraid and...) - and it works extremely, extremely well on 
ZFS for long periods of time, where with btrfs it works very well at 
first but then degrades rapidly.


FWIW I've been using KVM + ZFS in wide production (>50 hosts) for 5+ 
years now.


On 09/25/2015 08:48 AM, Rich Freeman wrote:

On Sat, Sep 19, 2015 at 9:26 PM, Jim Salter  wrote:

ZFS, by contrast, works like absolute gangbusters for KVM image storage.

I'd be interested in what allows ZFS to handle KVM image storage well,
and whether this could be implemented in btrfs.  I'd think that the
fragmentation issues would potentially apply to any COW filesystem,
and if ZFS has a solution for this then it would probably benefit
btrfs to implement the same solution, and not just for VM images.

--
Rich


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS as image store for KVM?

2015-09-25 Thread Austin S Hemmelgarn


On 2015-09-25 08:48, Rich Freeman wrote:

On Sat, Sep 19, 2015 at 9:26 PM, Jim Salter  wrote:


ZFS, by contrast, works like absolute gangbusters for KVM image storage.


I'd be interested in what allows ZFS to handle KVM image storage well,
and whether this could be implemented in btrfs.  I'd think that the
fragmentation issues would potentially apply to any COW filesystem,
and if ZFS has a solution for this then it would probably benefit
btrfs to implement the same solution, and not just for VM images.
That may be tough to do however, the internal design of ZFS is _very_ 
different from that of BTRFS (and for that matter, every other 
filesystem on Linux).  Part of it may just be better data locality (if 
all of the fragments of a file are close to each other, then the 
fragmentation of the file is not as much of a performance hit), and part 
of it is probably how they do caching (and I personally _do not_ want 
BTRFS to try to do caching the way ZFS does, we have a unified pagecache 
in the VFS for a reason, we should be improving that, not trying to come 
up with multiple independent solutions).



Even aside from that however, just saying that ZFS works great for some 
particular use case isn't giving enough info, it has so many optional 
features and configuration knobs, you really need to give specifics on 
how you have ZFS set up in that case.




smime.p7s
Description: S/MIME Cryptographic Signature

Incremental btrfs send/receive fails if file is unlinked and cloned afterwards

2015-09-25 Thread Martin Raiber

Hi,

the commit "Btrfs: incremental send, check if orphanized dir inode needs
delayed rename" causes incremental send/receive to fail if a file is
unlinked and then reflinked to the same location from the parent
snapshot. An xfstest reproducing the issue is attached.

Regards,
Martin
From ebc5e8721264823a0df92b31e5fb1381f7f5e6f8 Mon Sep 17 00:00:00 2001
From: Martin Raiber 
Date: Fri, 25 Sep 2015 13:24:13 +0200
Subject: [PATCH 1/1] btrfs: test for incremental send after file unlink and
 then cloning

Creating a snapshot, then removing a file and cloning it back to its
original location, causes btrfs send/receive to fail, because
it doesn't use the correct path for the file unlink.
---
 tests/btrfs/104 | 94 +
 tests/btrfs/104.out |  7 
 tests/btrfs/group   |  1 +
 3 files changed, 102 insertions(+)
 create mode 100755 tests/btrfs/104
 create mode 100644 tests/btrfs/104.out

diff --git a/tests/btrfs/104 b/tests/btrfs/104
new file mode 100755
index 000..976228f
--- /dev/null
+++ b/tests/btrfs/104
@@ -0,0 +1,94 @@
+#! /bin/bash
+# FS QA Test No. btrfs/104
+#
+# Test that an incremental send works after a files gets unlinked
+# and then cloned back from the previous snapshot.
+#
+#---
+# Copyright (C) 2015 SUSE Linux Products GmbH. All Rights Reserved.
+# Copyright (C) 2015 Martin Raiber 
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   rm -fr $send_files_dir
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# real QA test starts here
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+_require_cloner
+_need_to_be_root
+_require_cp_reflink
+
+send_files_dir=$TEST_DIR/btrfs-test-$seq
+
+rm -f $seqres.full
+rm -fr $send_files_dir
+mkdir $send_files_dir
+
+_scratch_mkfs >>$seqres.full 2>&1
+_scratch_mount
+
+# Create our test file with a single extent of 64K starting at file offset 
128K.
+mkdir -p $SCRATCH_MNT/foo
+$XFS_IO_PROG -f -c "pwrite -S 0xaa 128K 64K" $SCRATCH_MNT/foo/bar | 
_filter_xfs_io
+
+_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap1
+_run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/mysnap2
+
+#Remove the file and then reflink it back from the original snapshot
+rm $SCRATCH_MNT/mysnap2/foo/bar
+cp --reflink=always $SCRATCH_MNT/foo/bar $SCRATCH_MNT/mysnap2/foo/bar
+
+_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT/mysnap2 
$SCRATCH_MNT/mysnap2_ro
+
+_run_btrfs_util_prog send $SCRATCH_MNT/mysnap1 -f $send_files_dir/1.snap
+_run_btrfs_util_prog send -p $SCRATCH_MNT/mysnap1 $SCRATCH_MNT/mysnap2_ro \
+   -f $send_files_dir/2.snap
+
+echo "File digest in the original filesystem:"
+md5sum $SCRATCH_MNT/mysnap2_ro/foo/bar | _filter_scratch
+
+# Now recreate the filesystem by receiving both send streams and verify we get
+# the same file contents that the original filesystem had.
+_scratch_unmount
+_scratch_mkfs >>$seqres.full 2>&1
+_scratch_mount
+
+_run_btrfs_util_prog receive -vv $SCRATCH_MNT -f $send_files_dir/1.snap
+_run_btrfs_util_prog receive -vv $SCRATCH_MNT -f $send_files_dir/2.snap
+
+echo "File digest in the new filesystem:"
+md5sum $SCRATCH_MNT/mysnap2_ro/foo/bar | _filter_scratch
+
+status=0
+exit
diff --git a/tests/btrfs/104.out b/tests/btrfs/104.out
new file mode 100644
index 000..6d18932
--- /dev/null
+++ b/tests/btrfs/104.out
@@ -0,0 +1,7 @@
+QA output created by 104
+wrote 65536/65536 bytes at offset 131072
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+File digest in the original filesystem:
+eb4de91c30abc45b27bc4c00a653d84e  SCRATCH_MNT/mysnap2_ro/foo/bar
+File digest in the new filesystem:
+eb4de91c30abc45b27bc4c00a653d84e  SCRATCH_MNT/mysnap2_ro/foo/bar
diff --git a/tests/btrfs/group b/tests/btrfs/group
index e92a65a..91580b9 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -106,3 +106,4 @@
 101 auto quick replace
 102 auto quick metadata enospc
 103 auto quick clone compress
+104 auto quick send clone
--

[PATCH 1/1] Btrfs: consolidate btrfs_error() to btrfs_std_error()

2015-09-25 Thread Anand Jain

btrfs_error() and btrfs_std_error() does the same thing
and calls _btrfs_std_error(), so consolidate them together.
And the main motivation is that btrfs_error() is closely
named with btrfs_err(), one handles error action the other
is to log the error, so don't closely name them.

Signed-off-by: Anand Jain 
Suggested-by: David Sterba 
---
 fs/btrfs/ctree.c   |  6 +++---
 fs/btrfs/ctree.h   |  9 +
 fs/btrfs/disk-io.c |  8 
 fs/btrfs/extent-tree.c |  2 +-
 fs/btrfs/inode-item.c  |  2 +-
 fs/btrfs/ioctl.c   |  2 +-
 fs/btrfs/relocation.c  |  2 +-
 fs/btrfs/root-tree.c   |  4 ++--
 fs/btrfs/transaction.c |  2 +-
 fs/btrfs/tree-log.c|  8 
 fs/btrfs/volumes.c | 14 +++---
 11 files changed, 26 insertions(+), 33 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 5f745ea..1063315 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -1011,7 +1011,7 @@ static noinline int update_ref_for_cow(struct 
btrfs_trans_handle *trans,
return ret;
if (refs == 0) {
ret = -EROFS;
-   btrfs_std_error(root->fs_info, ret);
+   btrfs_std_error(root->fs_info, ret, NULL);
return ret;
}
} else {
@@ -1927,7 +1927,7 @@ static noinline int balance_level(struct 
btrfs_trans_handle *trans,
child = read_node_slot(root, mid, 0);
if (!child) {
ret = -EROFS;
-   btrfs_std_error(root->fs_info, ret);
+   btrfs_std_error(root->fs_info, ret, NULL);
goto enospc;
}
 
@@ -2030,7 +2030,7 @@ static noinline int balance_level(struct 
btrfs_trans_handle *trans,
 */
if (!left) {
ret = -EROFS;
-   btrfs_std_error(root->fs_info, ret);
+   btrfs_std_error(root->fs_info, ret, NULL);
goto enospc;
}
wret = balance_node_right(trans, root, mid, left);
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 4484063..a86051e 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -4127,14 +4127,7 @@ do { 
\
  __LINE__, (errno));   \
 } while (0)
 
-#define btrfs_std_error(fs_info, errno)\
-do {   \
-   if ((errno))\
-   __btrfs_std_error((fs_info), __func__,  \
-  __LINE__, (errno), NULL);\
-} while (0)
-
-#define btrfs_error(fs_info, errno, fmt, args...)  \
+#define btrfs_std_error(fs_info, errno, fmt, args...)  \
 do {   \
__btrfs_std_error((fs_info), __func__, __LINE__,\
  (errno), fmt, ##args);\
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index e0dbe41..18796c9 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2377,7 +2377,7 @@ static int btrfs_replay_log(struct btrfs_fs_info *fs_info,
/* returns with log_tree_root freed on success */
ret = btrfs_recover_log_trees(log_tree_root);
if (ret) {
-   btrfs_error(tree_root->fs_info, ret,
+   btrfs_std_error(tree_root->fs_info, ret,
"Failed to recover log tree");
free_extent_buffer(log_tree_root->node);
kfree(log_tree_root);
@@ -3570,7 +3570,7 @@ static int write_all_supers(struct btrfs_root *root, int 
max_mirrors)
if (ret) {
mutex_unlock(
>fs_info->fs_devices->device_list_mutex);
-   btrfs_error(root->fs_info, ret,
+   btrfs_std_error(root->fs_info, ret,
"errors while submitting device barriers.");
return ret;
}
@@ -3610,7 +3610,7 @@ static int write_all_supers(struct btrfs_root *root, int 
max_mirrors)
mutex_unlock(>fs_info->fs_devices->device_list_mutex);
 
/* FUA is masked off if unsupported and can't be the reason */
-   btrfs_error(root->fs_info, -EIO,
+   btrfs_std_error(root->fs_info, -EIO,
"%d errors while writing supers", total_errors);
return -EIO;
}
@@ -3628,7 +3628,7 @@ static int write_all_supers(struct btrfs_root *root, int 
max_mirrors)
}
mutex_unlock(>fs_info->fs_devices->device_list_mutex);
if (total_errors > max_errors) {
-   btrfs_error(root->fs_info, -EIO,
+

Re: [PATCH 3/5] btrfs: Do per-chunk degraded check for remount

2015-09-25 Thread Anand Jain




On 09/21/2015 10:10 AM, Qu Wenruo wrote:

Just the same for mount time check, use new btrfs_check_degraded() to do
per chunk check.

Signed-off-by: Qu Wenruo 
---
  fs/btrfs/super.c | 11 +++
  1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index c389c13..720c044 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1681,11 +1681,14 @@ static int btrfs_remount(struct super_block *sb, int 
*flags, char *data)
goto restore;
}

-   if (fs_info->fs_devices->missing_devices >
-fs_info->num_tolerated_disk_barrier_failures &&
-   !(*flags & MS_RDONLY)) {
+   ret = btrfs_check_degradable(fs_info, *flags);
+   if (ret < 0) {
+   btrfs_error(fs_info, ret,
+   "degraded writable remount failed");


btrfs_erorr() which is an error handling routine, isn't appropriate 
here, mainly because as we are in the remount context, I am not sure if 
you meant to change the fs state to readonly (on error) in the remount 
context ? or Instead btrfs_err() which is an error reporting/logging 
would be appropriate.


btrfs_erorr() and btrfs_err() are way different in action but very easy 
have an oversight and use the wrong one. the below patch changed it..


   Btrfs: consolidate btrfs_error() to btrfs_std_error()

Thanks, Anand



+   goto restore;
+   } else if (ret > 0 && !btrfs_test_opt(root, DEGRADED)) {
btrfs_warn(fs_info,
-   "too many missing devices, writeable remount is not 
allowed");
+   "some device missing, but still degraded mountable, 
please remount with -o degraded option");
ret = -EACCES;
goto restore;
}


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/5] btrfs: Do per-chunk check for mount time check

2015-09-25 Thread Anand Jain




On 09/21/2015 10:10 AM, Qu Wenruo wrote:

Now use the btrfs_check_degraded() to do mount time degraded check.

With this patch, now we can mount with the following case:
  # mkfs.btrfs -f -m raid1 -d single /dev/sdb /dev/sdc
  # wipefs -a /dev/sdc
  # mount /dev/sdb /mnt/btrfs -o degraded
  As the single data chunk is only in sdb, so it's OK to mount as
  degraded, as missing one device is OK for RAID1.

But still fail with the following case as expected:
  # mkfs.btrfs -f -m raid1 -d single /dev/sdb /dev/sdc
  # wipefs -a /dev/sdb
  # mount /dev/sdc /mnt/btrfs -o degraded
  As the data chunk is only in sdb, so it's not OK to mount it as
  degraded.

Reported-by: Zhao Lei 
Reported-by: Anand Jain 
Signed-off-by: Qu Wenruo 
---
  fs/btrfs/disk-io.c | 18 ++
  1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 0b658d0..d64299f 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2858,6 +2858,16 @@ int open_ctree(struct super_block *sb,
goto fail_tree_roots;
}

+   ret = btrfs_check_degradable(fs_info, fs_info->sb->s_flags);
+   if (ret < 0) {
+   btrfs_error(fs_info, ret, "degraded writable mount failed");



+   goto fail_tree_roots;


same here too. if at all we are failing the mount. there is no point in 
doing the error handling (btrfs_error()) instead just error reporting is 
better (btrfs_err()).


Thanks, Anand




+   } else if (ret > 0 && !btrfs_test_opt(chunk_root, DEGRADED)) {
+   btrfs_warn(fs_info,
+   "Some device missing, but still degraded mountable, please 
mount with -o degraded option");
+   ret = -EACCES;
+   goto fail_tree_roots;
+   }
/*
 * keep the device that is marked to be the target device for the
 * dev_replace procedure
@@ -2947,14 +2957,6 @@ retry_root_backup:
}
fs_info->num_tolerated_disk_barrier_failures =
btrfs_calc_num_tolerated_disk_barrier_failures(fs_info);
-   if (fs_info->fs_devices->missing_devices >
-fs_info->num_tolerated_disk_barrier_failures &&
-   !(sb->s_flags & MS_RDONLY)) {
-   pr_warn("BTRFS: missing devices(%llu) exceeds the limit(%d), 
writeable mount is not allowed\n",
-   fs_info->fs_devices->missing_devices,
-   fs_info->num_tolerated_disk_barrier_failures);
-   goto fail_sysfs;
-   }

fs_info->cleaner_kthread = kthread_run(cleaner_kthread, tree_root,
   "btrfs-cleaner");


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/1] Btrfs: consolidate btrfs_error() to btrfs_std_error()

2015-09-25 Thread David Sterba

On Fri, Sep 25, 2015 at 02:43:01PM +0800, Anand Jain wrote:
> btrfs_error() and btrfs_std_error() does the same thing
> and calls _btrfs_std_error(), so consolidate them together.
> And the main motivation is that btrfs_error() is closely
> named with btrfs_err(), one handles error action the other
> is to log the error, so don't closely name them.
> 
> Signed-off-by: Anand Jain 
> Suggested-by: David Sterba 

Reviewed-by: David Sterba 

I guess we can live with the extra NULL argument, in some cases it does
not make sense to put a string there.

> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -4852,7 +4852,7 @@ static long btrfs_ioctl_qgroup_assign(struct file 
> *file, void __user *arg)
>   /* update qgroup status and info */
>   err = btrfs_run_qgroups(trans, root->fs_info);
>   if (err < 0)
> - btrfs_error(root->fs_info, ret,
> + btrfs_std_error(root->fs_info, ret,

This looks like a bug, ret instead of err. The value of 'ret' is set by
add/del qgroup relation which might fail if the relations are there, but
we do not care. We're likely interested in the return code of
btrfs_run_qgroups, ie. err. Can you please send a new patch on top of this?

>   "failed to update qgroup status and info\n");
>   err = btrfs_end_transaction(trans, root);
>   if (err && !ret)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 2/2] btrfs: reada: Fix returned errno code

2015-09-25 Thread David Sterba

On Thu, Sep 24, 2015 at 08:13:33PM +0100, Luis de Bethencourt wrote:
> reada is using -1 instead of the -ENOMEM defined macro to specify that
> a buffer allocation failed. Since the error number is propagated, the
> caller will get a -EPERM which is the wrong error condition.
> 
> Also, updating the caller to return the exact value from
> reada_add_block.
> 
> Smatch tool warning:
> reada_add_block() warn: returning -1 instead of -ENOMEM is sloppy
> 
> Signed-off-by: Luis de Bethencourt 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 1/2] btrfs-progs: Introduce warning and error for common use

2015-09-25 Thread David Sterba

On Wed, Sep 16, 2015 at 05:40:46PM +0800, Zhao Lei wrote:
> +static inline void __veprintf(const char *prefix, const char *format,
> +   va_list ap)
> +{
> + if (prefix)
> + fprintf(stderr, "%s", prefix);
> + vfprintf(stderr, format, ap);

I'm not sure we need this helper. All it does it prints a fixed string,
we can simply add fputs("prefix", stderr) into warning/error functions.

> +static inline int warning_on(int condition, const char *fmt, ...)
> +{
> + if (!condition)
> + return 0;
> +
> + va_list args;

Please do not put declaration(s) after statements.

> +static inline int error_on(int condition, const char *fmt, ...)
> +{
> + if (!condition)
> + return 0;
> +
> + va_list args;

dtto

> +
> + va_start(args, fmt);
> + __veprintf("ERROR: ", fmt, args);
> + va_end(args);
> +
> + return 1;
> +}
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Latest kernel to use?

2015-09-25 Thread Sjoerd

Thanks all for the feedback. Still doubting though to go for 4.2.1 or not. 
Main reason is that I am currently running 4.1.7 on my laptop which seems to 
work fine and had some issues with the 4.2.0 kernel. No issues I thing that 
were btrfs related, but more related to my nvidia card. Anyway switching back 
to 4.1.7 resolved those, so I am a bit holding back to try the 4.2.1 version 
;)
Anyway I'll see and can always revert back if I don't like it ;)

Cheers,
Sjoerd

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: add a flags field to btrfs_transaction

2015-09-25 Thread Josef Bacik


On 09/25/2015 08:30 AM, Holger Hoffstätte wrote:

Followup from my observation wrt. "Btrfs: change how we wait for
pending ordered extents" and balance sitting idle:

On Thu, Sep 24, 2015 at 4:47 PM, Josef Bacik  wrote:

I want to set some per transaction flags, so instead of adding yet another int
lets just convert the current two int indicators to flags and add a flags field
for future use.  Thanks,


..snip..


diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index ff64689..3f5a781 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1399,7 +1399,7 @@ again:
 btrfs_error(root->fs_info, ret,
 "Failed to remove dev extent item");
 } else {
-   trans->transaction->have_free_bgs = 1;
+   set_bit(BTRFS_TRANS_DIRTY_BG_RUN, >transaction->flags);


Judging from the rest of the code transformation TRANS_DIRTY_BG_RUN
seems like the wrong bit to set here. A quick test with
set_bit(BTRFS_TRANS_HAVE_FREE_BGS) confirms that it fixes the balance
delays.


Haha oops, thanks for catching that, I'll fix it up locally and send out 
an updated one in a bit.


Josef

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Latest kernel to use?

2015-09-25 Thread Roman Mamedov

On Fri, 25 Sep 2015 09:12:15 -0400
Rich Freeman  wrote:

> I'll just say that my btrfs stability has gone WAY up when I stopped
> following this advice and instead followed a recent longterm.  Right
> now I'm following 3.18.  There were some really bad corruption issues
> in 3.17/18/19 that burned me, and today while considering moving up to
> 4.1 I'm still seeing a lot of threads about issues during balance/etc.
> I still run into the odd issue with 3.18, but not nearly to the degree
> that I used to.
> 
> Now, I would stick with a recent longterm.  The older longterms go
> back to a time when btrfs was far more experimental.  Even 3.16
> probably has a lot of issues that are fixed in 3.18.

Absolutely that! I was pondering whether or not to chime in with my praise of
"longterm" as far as Btrfs stability goes, but apparently it's not just me who
uses it. In my experience 3.18 just works* and is very stable, and before that
it was 3.14, which by luck(?) happened to go longterm IIRC just before Btrfs
transitioned to "kernel worker threads" in 3.15 (and that caused ALL sorts of
trouble initially).

[*] at least in a relatively simple scenario -- with snapshots, but without
using any of the multi-device features or stuff such as qgroups or
send/receive.

-- 
With respect,
Roman

signature.asc
Description: PGP signature

Re: BTRFS as image store for KVM?

2015-09-25 Thread Jim Salter

Pretty much bog-standard, as ZFS goes.  Nothing different than what's 
recommended for any generic ZFS use.


* set blocksize to match hardware blocksize - 4K drives get 4K 
blocksize, 8K drives get 8K blocksize (Samsung SSDs)
* LZO compression is a win.  But it's not like anything sucks without 
it.  No real impact on performance for most use, + or -. Just saves space.
* > 4GB allocated to the ARC.  General rule of thumb: half the RAM 
belongs to the host (which is mostly ARC), half belongs to the guests.


I strongly prefer pool-of-mirrors topology, but nothing crazy happens if 
you use striped-with-parity instead.  I use to use RAIDZ1 (the rough 
equivalent of RAID5) quite frequently, and there wasn't anything 
amazingly sucky about it; it performed at least as well as you'd expect 
ext4 on mdraid5 to perform.


ZFS might or might not do a better job of managing fragmentation; I 
really don't know.  I strongly suspect the design difference between the 
kernel's simple FIFO page cache and ZFS' weighted cache makes a really, 
really big difference.




On 09/25/2015 09:04 AM, Austin S Hemmelgarn wrote:
> you really need to give specifics on how you have ZFS set up in that 
case.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Latest kernel to use?

2015-09-25 Thread Hugo Mills

On Fri, Sep 25, 2015 at 03:36:18PM +0200, Sjoerd wrote:
> Thanks all for the feedback. Still doubting though to go for 4.2.1 or not. 
> Main reason is that I am currently running 4.1.7 on my laptop which seems to 
> work fine and had some issues with the 4.2.0 kernel. No issues I thing that 
> were btrfs related, but more related to my nvidia card. Anyway switching back 
> to 4.1.7 resolved those, so I am a bit holding back to try the 4.2.1 version 
> ;)
> Anyway I'll see and can always revert back if I don't like it ;)

   If 4.1.7 is working OK for you, stick with it. It's getting much
less important now, as btrfs matures, to keep up with the _very_
latest. Purely on gut feeling about issues we see on IRC and here,
3.19 or later would be reasonable at the moment.

   Compared to, say, 3 or 4 years ago when running late -rc kernels
was often preferable to running the latest stable, and things have
improved quite a bit. :)

   Hugo.

-- 
Hugo Mills | For months now, we have been making triumphant
hugo@... carfax.org.uk | retreats before a demoralised enemy who is advancing
http://carfax.org.uk/  | in utter disorder.
PGP: E2AB1DE4  |  Eric Frank Russell, Wasp

signature.asc
Description: Digital signature

Re: Latest kernel to use?

2015-09-25 Thread Rich Freeman

On Fri, Sep 25, 2015 at 7:20 AM, Austin S Hemmelgarn
 wrote:
> On 2015-09-24 17:07, Sjoerd wrote:
>>
>> Maybe a silly question for most of you, but the wiki states to always try
>> to
>> use the latest kernel with btrfs. Which one would be best:
>> - 4.2.1 (currently latest stable and matches the btrfs-progs versioning)
>> or
>> - the 4.3.x (mainline)?
>>
>> Stable sounds more stable to me(hence the name ;) ), but the mainline
>> kernel
>> seems to be in more active development?
>>
> Like Hugo said, 4.2.1 is what you want right now.  In general, go with the
> highest version number that isn't a -rc version (4.3 isn't actually released
> yet, IIRC they're up to 4.3-rc2 right now, and almost at -rc3) (we should
> probably be specific like this on the wiki).
>

I'll just say that my btrfs stability has gone WAY up when I stopped
following this advice and instead followed a recent longterm.  Right
now I'm following 3.18.  There were some really bad corruption issues
in 3.17/18/19 that burned me, and today while considering moving up to
4.1 I'm still seeing a lot of threads about issues during balance/etc.
I still run into the odd issue with 3.18, but not nearly to the degree
that I used to.

Now, I would stick with a recent longterm.  The older longterms go
back to a time when btrfs was far more experimental.  Even 3.16
probably has a lot of issues that are fixed in 3.18.

That said, if you do run into an issue on a longterm kernel nobody
around here is likely to be able to help you much unless you can
reproduce it on the most recent stable kernel.

Just tossing that out as an alternative opinion.  Right now I'm
sticking with 3.18, but I'm interested in making the 4.1 switch once
issues with that seem to have died down.

--
Rich
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS as image store for KVM?

2015-09-25 Thread Austin S Hemmelgarn


On 2015-09-25 09:12, Jim Salter wrote:

Pretty much bog-standard, as ZFS goes.  Nothing different than what's
recommended for any generic ZFS use.

* set blocksize to match hardware blocksize - 4K drives get 4K
blocksize, 8K drives get 8K blocksize (Samsung SSDs)
* LZO compression is a win.  But it's not like anything sucks without
it.  No real impact on performance for most use, + or -. Just saves space.
* > 4GB allocated to the ARC.  General rule of thumb: half the RAM
belongs to the host (which is mostly ARC), half belongs to the guests.

I strongly prefer pool-of-mirrors topology, but nothing crazy happens if
you use striped-with-parity instead.  I use to use RAIDZ1 (the rough
equivalent of RAID5) quite frequently, and there wasn't anything
amazingly sucky about it; it performed at least as well as you'd expect
ext4 on mdraid5 to perform.

ZFS might or might not do a better job of managing fragmentation; I
really don't know.  I /strongly/ suspect the design difference between
the kernel's simple FIFO page cache and ZFS' weighted cache makes a
really, really big difference.
I've been coming to that same conclusion myself over the years.  I would 
really love to see a drop in replacement for Linux's pagecache with 
better performance (I don't remember for sure, but I seem to remember 
that the native pagecache isn't straight FIFO), but the likelihood of 
that actually getting into mainline is slim to none (can you imagine 
though how fast XFS or ext* would be with a good caching algorithm?).




On 09/25/2015 09:04 AM, Austin S Hemmelgarn wrote:

you really need to give specifics on how you have ZFS set up in that
case.








smime.p7s
Description: S/MIME Cryptographic Signature

raw devices or partitions?

2015-09-25 Thread Sjoerd

Is it better to use raw devices for a RAID setup or make one partition on the 
drive and then create your RAID from there?
Right now if have one setup that uses raw, but get messages "unknown partition 
table" all the time in my logs. 
I am planning to create a RAID 5 setup (seems to be stable these days?), but 
wondering  to deal with  raw drives or partitions (4 at the moment).
In the wiki they're referring to raw devices in the examples, but that could 
be outdated?


Cheers,
Sjoerd

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs check dont repair

2015-09-25 Thread Vackář František

Now I have memory ok.

I found this patch, but i am not sure, if its the correct one.
https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg33496.html

Cay you help me please, witch patch i should use?

Frantisek

2015-09-23 17:22 GMT+02:00 Vackář František :
> Yes, I have bad ram. I ran memtest and memory is really bad.
>
> So a must buy new memory first.
>
> Thank you.
>
> Frantisek
>
> 2015-09-23 16:43 GMT+02:00 Hugo Mills :
>> On Wed, Sep 23, 2015 at 04:39:27PM +0200, Vackář František wrote:
>>> Hello,
>>>
>>> i have problem with my btrfs, can you help me, please? Its on my
>>> notebook. Sometimes it exhausted battery during sleep and died. But
>>> everytime FS was ok. But ones mount fail.
>>>
>>> Do you have any idea how repair it?
>> [snip]
>>> [root@rak ~]# dmesg | tail
>> [snip]
>>> [ 4108.72] BTRFS critical (device sdb2): corrupt leaf, bad key
>>> order: block=3242455040,root=1, slot=0
>>
>>You have bad RAM. You should run memtest and fix the hardware first.
>>
>>After that, there's some patches that should allow btrfs repair to
>> fix bad key orders in most situations -- I think David was picking
>> them up again, but I don't know what state they're in right now.
>>
>>Hugo.
>>
>>> [ 4108.444649] BTRFS critical (device sdb2): corrupt leaf, bad key
>>> order: block=3242455040,root=1, slot=0
>>> [ 4108.444681] BTRFS error (device sdb2): Error removing orphan entry,
>>> stopping orphan cleanup
>>> [ 4108.444684] BTRFS error (device sdb2): could not do orphan cleanup -22
>>> [ 4111.047323] BTRFS: open_ctree failed
>>>
>>> [root@rak ~]# uname -a
>>> Linux rak 4.1.2-2-ARCH #1 SMP PREEMPT Wed Jul 15 08:30:32 UTC 2015
>>> x86_64 GNU/Linux
>>>
>>> [root@rak ~]# btrfs fi show /dev/sdb2
>>> Label: 'data'  uuid: 754a3186-c0ae-4680-ab28-864c8bdad8b5
>>> Total devices 1 FS bytes used 1.23TiB
>>> devid1 size 1.72TiB used 1.24TiB path /dev/sdb2
>>>
>>> btrfs-progs v4.2
>>>
>>> [root@rak ~]# btrfs --version
>>> btrfs-progs v4.2
>>>
>>> [root@rak ~]#   btrfs fi show
>>> Label: 'data'  uuid: 754a3186-c0ae-4680-ab28-864c8bdad8b5
>>> Total devices 1 FS bytes used 1.23TiB
>>> devid1 size 1.72TiB used 1.24TiB path /dev/sdb2
>>>
>>> btrfs-progs v4.2
>>
>> --
>> Hugo Mills | Sometimes, when I'm alone, I Google myself.
>> hugo@... carfax.org.uk |
>> http://carfax.org.uk/  |
>> PGP: E2AB1DE4  |
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs check dont repair

2015-09-25 Thread Hugo Mills

On Fri, Sep 25, 2015 at 11:36:55AM +0200, Vackář František wrote:
> Now I have memory ok.
> 
> I found this patch, but i am not sure, if its the correct one.
> https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg33496.html
> 
> Cay you help me please, witch patch i should use?

   Yes, that's the patch series (you'll need all three patches). David
Sterba said it didn't apply cleanly any more; I don't know how hard
it'll be to fix up the problems, though.

   Hugo.

> Frantisek
> 
> 2015-09-23 17:22 GMT+02:00 Vackář František :
> > Yes, I have bad ram. I ran memtest and memory is really bad.
> >
> > So a must buy new memory first.
> >
> > Thank you.
> >
> > Frantisek
> >
> > 2015-09-23 16:43 GMT+02:00 Hugo Mills :
> >> On Wed, Sep 23, 2015 at 04:39:27PM +0200, Vackář František wrote:
> >>> Hello,
> >>>
> >>> i have problem with my btrfs, can you help me, please? Its on my
> >>> notebook. Sometimes it exhausted battery during sleep and died. But
> >>> everytime FS was ok. But ones mount fail.
> >>>
> >>> Do you have any idea how repair it?
> >> [snip]
> >>> [root@rak ~]# dmesg | tail
> >> [snip]
> >>> [ 4108.72] BTRFS critical (device sdb2): corrupt leaf, bad key
> >>> order: block=3242455040,root=1, slot=0
> >>
> >>You have bad RAM. You should run memtest and fix the hardware first.
> >>
> >>After that, there's some patches that should allow btrfs repair to
> >> fix bad key orders in most situations -- I think David was picking
> >> them up again, but I don't know what state they're in right now.
> >>
> >>Hugo.
> >>
> >>> [ 4108.444649] BTRFS critical (device sdb2): corrupt leaf, bad key
> >>> order: block=3242455040,root=1, slot=0
> >>> [ 4108.444681] BTRFS error (device sdb2): Error removing orphan entry,
> >>> stopping orphan cleanup
> >>> [ 4108.444684] BTRFS error (device sdb2): could not do orphan cleanup -22
> >>> [ 4111.047323] BTRFS: open_ctree failed
> >>>
> >>> [root@rak ~]# uname -a
> >>> Linux rak 4.1.2-2-ARCH #1 SMP PREEMPT Wed Jul 15 08:30:32 UTC 2015
> >>> x86_64 GNU/Linux
> >>>
> >>> [root@rak ~]# btrfs fi show /dev/sdb2
> >>> Label: 'data'  uuid: 754a3186-c0ae-4680-ab28-864c8bdad8b5
> >>> Total devices 1 FS bytes used 1.23TiB
> >>> devid1 size 1.72TiB used 1.24TiB path /dev/sdb2
> >>>
> >>> btrfs-progs v4.2
> >>>
> >>> [root@rak ~]# btrfs --version
> >>> btrfs-progs v4.2
> >>>
> >>> [root@rak ~]#   btrfs fi show
> >>> Label: 'data'  uuid: 754a3186-c0ae-4680-ab28-864c8bdad8b5
> >>> Total devices 1 FS bytes used 1.23TiB
> >>> devid1 size 1.72TiB used 1.24TiB path /dev/sdb2
> >>>
> >>> btrfs-progs v4.2
> >>

-- 
Hugo Mills | Do not meddle in the affairs of wizards, for they
hugo@... carfax.org.uk | are subtle, and quick to anger.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |J.R.R. Tolkein


signature.asc
Description: Digital signature

Re: [PATCH 2/2] btrfs-progs: optimize not to scan repeated fsid mount points

2015-09-25 Thread David Sterba

On Tue, Sep 15, 2015 at 04:46:23PM +0800, Anand Jain wrote:
> fsid can be mounted multiple times, with different subvolid.
> And we don't have to scan a mount point if we already have
> that in the scanned list.
> 
> And thus nicely avoids the following warning with multiple
> subvol mounts on older kernel like 2.6.32 where
> BTRFS_IOC_GET_FSLABEL ioctl does not exist.
> 
> ./btrfs fi show -m
> Label: none  uuid: 31845933-611e-422d-ae6f-386e57ad81aa
>   Total devices 2 FS bytes used 172.00KiB
>   devid1 size 3.00GiB used 642.38MiB path /dev/sdd
>   devid2 size 3.00GiB used 622.38MiB path /dev/sde
> 
> warning, device 2 is missing
> warning devid 2 not found already
> warning, device 2 is missing
> warning devid 2 not found already
> 
> Signed-off-by: Anand Jain 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS as image store for KVM?

2015-09-25 Thread Timofey Titovets

2015-09-25 16:52 GMT+03:00 Jim Salter :
> Pretty much bog-standard, as ZFS goes.  Nothing different than what's
> recommended for any generic ZFS use.
>
> * set blocksize to match hardware blocksize - 4K drives get 4K blocksize, 8K
> drives get 8K blocksize (Samsung SSDs)
> * LZO compression is a win.  But it's not like anything sucks without it.
> No real impact on performance for most use, + or -. Just saves space.
> * > 4GB allocated to the ARC.  General rule of thumb: half the RAM belongs
> to the host (which is mostly ARC), half belongs to the guests.
>
> I strongly prefer pool-of-mirrors topology, but nothing crazy happens if you
> use striped-with-parity instead.  I use to use RAIDZ1 (the rough equivalent
> of RAID5) quite frequently, and there wasn't anything amazingly sucky about
> it; it performed at least as well as you'd expect ext4 on mdraid5 to
> perform.
>
> ZFS might or might not do a better job of managing fragmentation; I really
> don't know.  I strongly suspect the design difference between the kernel's
> simple FIFO page cache and ZFS' weighted cache makes a really, really big
> difference.
>
>
>
> On 09/25/2015 09:04 AM, Austin S Hemmelgarn wrote:
>> you really need to give specifics on how you have ZFS set up in that case.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

FYI:
Linux pagecache use LRU cache algo, and in general case it's working good enough

-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS as image store for KVM?

2015-09-25 Thread Austin S Hemmelgarn


On 2015-09-25 10:02, Timofey Titovets wrote:

2015-09-25 16:52 GMT+03:00 Jim Salter :

Pretty much bog-standard, as ZFS goes.  Nothing different than what's
recommended for any generic ZFS use.

* set blocksize to match hardware blocksize - 4K drives get 4K blocksize, 8K
drives get 8K blocksize (Samsung SSDs)
* LZO compression is a win.  But it's not like anything sucks without it.
No real impact on performance for most use, + or -. Just saves space.
* > 4GB allocated to the ARC.  General rule of thumb: half the RAM belongs
to the host (which is mostly ARC), half belongs to the guests.

I strongly prefer pool-of-mirrors topology, but nothing crazy happens if you
use striped-with-parity instead.  I use to use RAIDZ1 (the rough equivalent
of RAID5) quite frequently, and there wasn't anything amazingly sucky about
it; it performed at least as well as you'd expect ext4 on mdraid5 to
perform.

ZFS might or might not do a better job of managing fragmentation; I really
don't know.  I strongly suspect the design difference between the kernel's
simple FIFO page cache and ZFS' weighted cache makes a really, really big
difference.



On 09/25/2015 09:04 AM, Austin S Hemmelgarn wrote:

you really need to give specifics on how you have ZFS set up in that case.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


FYI:
Linux pagecache use LRU cache algo, and in general case it's working good enough

I'd argue that 'general usage' should be better defined in this 
statement.  Obviously, ZFS's ARC implementation provides better 
performance in a significant number of common use cases for Linux, 
otherwise people wouldn't be using it to the degree they are.  LRU often 
gives abysmal performance for VM images in my experience, and 
virtualization is becoming a very common use case for Linux.  On top of 
that, there are lots of applications that bypass the cache almost 
completely, and while that is a valid option in some cases, it shouldn't 
be needed most of the time.


If it's just plain LRU, I may take the time at some point to try and 
write some patches to test if SLRU works any better (as SLRU is 
essentially ARC without the auto-tuning), although I have nowhere near 
the resources to test something like that to the degree that would be 
required to get it even considered for inclusion in mainline.




smime.p7s
Description: S/MIME Cryptographic Signature

66 matches

Mail list logo