Re: [PATCH v4 3/5] btrfs: tree-checker: Enhance output for btrfs_check_leaf

2017-10-11 Thread Nikolay Borisov


On 12.10.2017 03:28, Qu Wenruo wrote:
> 
> 
> On 2017年10月11日 23:41, Nikolay Borisov wrote:
>>
>>
>> On  9.10.2017 04:51, Qu Wenruo wrote:
>>> Enhance the output to print:
>>> 1) Reason
>>> 2) Bad value
>>>     If reason can't explain enough
>>> 3) Good value (range)
>>>
>>> Signed-off-by: Qu Wenruo 
>>> ---
>>>   fs/btrfs/tree-checker.c | 27 +--
>>>   1 file changed, 21 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
>>> index b4ced8d3ce2a..7bba195ecc8b 100644
>>> --- a/fs/btrfs/tree-checker.c
>>> +++ b/fs/btrfs/tree-checker.c
>>> @@ -233,8 +233,9 @@ int btrfs_check_leaf(struct btrfs_root *root,
>>> struct extent_buffer *leaf)
>>>   eb = btrfs_root_node(check_root);
>>>   /* if leaf is the root, then it's fine */
>>>   if (leaf != eb) {
>>> -    CORRUPT("non-root leaf's nritems is 0",
>>> -    leaf, check_root, 0);
>>> +    generic_err(check_root, leaf, 0,
>>> +    "invalid nritems, have %u shouldn't be 0 for
>>> non-root leaf",
>>> +    nritems);
>>
>> I'm a bit confused by what this error messages wants to convey. Even
>> reading the previous version with CORRUPT() it still didn't make sense.
>> So what we want to say here is we shouldn't have empty leaf nodes. So
>> Something along the line of "Unexpected empty leaf".
> 
> Yes, the error message is too fixed to follow the output format.
>>
>> Why would the (leaf != eb) check not trigger, given that we call
>> btrfs_check_leaf when we now that the item is a leaf (level is 0 )?
> 
> What's the problem here? I didn't really get your point.
> 
> Did you mean leaf can't be tree root? Or empty tree root is not possible?
> 
> 
> It's completely possible for a leaf to be a tree root.
> 
> All tree roots of a newly created (without --rootdir) is leaf.
> Because the content of each tree is so few that one leaf can contain
> them all.
> 
> 
> And it's also very possible to have empty tree, whose root (leaf) is
> also empty.
> Still for a newly created btrfs, its csum tree is empty.
> Its uuid tree is also empty.
> 
> 
> But the only valid case for empty leaf is when it's a tree root.
> So the code just checks it, and I didn't find anything wrong here.

I was just confused when the invariant wouldn't hold. Now that you've
explained it I agree with the change.


> 
> Thanks,
> Qu
> 
>>
>>
>>>   free_extent_buffer(eb);
>>>   return -EUCLEAN;
>>>   }
>>> @@ -265,7 +266,11 @@ int btrfs_check_leaf(struct btrfs_root *root,
>>> struct extent_buffer *leaf)
>>>     /* Make sure the keys are in the right order */
>>>   if (btrfs_comp_cpu_keys(&prev_key, &key) >= 0) {
>>> -    CORRUPT("bad key order", leaf, root, slot);
>>> +    generic_err(root, leaf, slot,
>>> +    "bad key order, prev key (%llu %u %llu) current key
>>> (%llu %u %llu)",
>>> +    prev_key.objectid, prev_key.type,
>>> +    prev_key.offset, key.objectid, key.type,
>>> +    key.offset);
>>>   return -EUCLEAN;
>>>   }
>>>   @@ -280,7 +285,10 @@ int btrfs_check_leaf(struct btrfs_root *root,
>>> struct extent_buffer *leaf)
>>>   item_end_expected = btrfs_item_offset_nr(leaf,
>>>    slot - 1);
>>>   if (btrfs_item_end_nr(leaf, slot) != item_end_expected) {
>>> -    CORRUPT("slot offset bad", leaf, root, slot);
>>> +    generic_err(root, leaf, slot,
>>> +    "discontinious item end, have %u expect %u",
>>
>> s/discontinious/unexpected ?
>>
>>> +    btrfs_item_end_nr(leaf, slot),
>>> +    item_end_expected);
>>>   return -EUCLEAN;
>>>   }
>>>   @@ -291,14 +299,21 @@ int btrfs_check_leaf(struct btrfs_root *root,
>>> struct extent_buffer *leaf)
>>>    */
>>>   if (btrfs_item_end_nr(leaf, slot) >
>>>   BTRFS_LEAF_DATA_SIZE(fs_info)) {
>>> -    CORRUPT("slot end outside of leaf", leaf, root, slot);
>>> +    generic_err(root, leaf, slot,
>>> +    "slot end outside of leaf, have %u expect range [0,
>>> %u]",> +    btrfs_item_end_nr(leaf, slot),
>>> +    BTRFS_LEAF_DATA_SIZE(fs_info));
>>>   return -EUCLEAN;
>>>   }
>>>     /* Also check if the item pointer overlaps with btrfs
>>> item. */
>>>   if (btrfs_item_nr_offset(slot) + sizeof(struct btrfs_item) >
>>>   btrfs_item_ptr_offset(leaf, slot)) {
>>> -    CORRUPT("slot overlap with its data", leaf, root, slot);
>>> +    generic_err(root, leaf, slot,
>>> +    "slot overlap with its data, item end %lu data start
>>> %lu",
>>> +    btrfs_item_nr_offset(slot) +
>>> +    sizeof(struct btrfs_item),
>>> +    btrfs_item_ptr_offset(leaf, slot));
>

Re: btrfs seed question

2017-10-11 Thread Anand Jain



On 10/12/2017 08:47 AM, Joseph Dunn wrote:

After seeing how btrfs seeds work I wondered if it was possible to push
specific files from the seed to the rw device.  I know that removing
the seed device will flush all the contents over to the rw device, but
what about flushing individual files on demand?

I found that opening a file, reading the contents, seeking back to 0,
and writing out the contents does what I want, but I was hoping for a
bit less of a hack.

Is there maybe an ioctl or something else that might trigger a similar
action?


  You mean to say - seed-device delete to trigger copy of only the 
specified or the modified files only, instead of whole of seed-device ? 
What's the use case around this ?



Thanks, Anand



Thanks,
-Joseph
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB upgrade fun

2017-10-11 Thread Kai Hendry
A guy on #btrsfs suggests:

15:09  hendry: super_total_bytes 8001581707264 mismatch with
fs_devices total_rw_bytes 8001581710848 that one is because unaligned
partitions, 4.12 - 4.13 kernels are affected (at least some versions)


However I rebooted into 4.9.54-1-lts and I have the same issue.
super_total_bytes 8001581707264 mismatch with fs_devices total_rw_bytes
8001581710848

https://s.natalian.org/2017-10-12/1507775104_2548x1398.png

Any ideas what I can do now? I am getting rather nervous!

Btw I checked sudo btrfs check -p /dev/sd{b,c,d}1 and they are all fine.

Just can't mount my data! :(

Hope you can help me,
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: set include path relatively

2017-10-11 Thread Naohiro Aota
Currently, gcc is passed the include directory with full path. As a result,
dependency files (*.o.d) also record the full path at the build time. Such
full path dependency is annoying for sharing the source between multiple
machines, containers, or anything the path differ.

And this is the same way what other program using autotools e.g. e2fsprogs
is doing:

$ grep top_builddir Makefile
top_builddir = .
CPPFLAGS = -I. -I$(top_builddir)/lib -I$(top_srcdir)/lib
BUILD_CFLAGS = -g -O2  -I. -I$(top_builddir)/lib -I$(top_srcdir)/lib 
-DHAVE_CONFIG_H


Signed-off-by: Naohiro Aota 
---
 Makefile | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Makefile b/Makefile
index b1f3388..69b94fb 100644
--- a/Makefile
+++ b/Makefile
@@ -70,8 +70,8 @@ CFLAGS = $(SUBST_CFLAGS) \
 -D_XOPEN_SOURCE=700  \
 -fno-strict-aliasing \
 -fPIC \
--I$(TOPDIR) \
--I$(TOPDIR)/kernel-lib \
+-I. \
+-I./kernel-lib \
 $(EXTRAWARN_CFLAGS) \
 $(DEBUG_CFLAGS_INTERNAL) \
 $(EXTRA_CFLAGS)
-- 
2.14.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs seed question

2017-10-11 Thread Joseph Dunn
After seeing how btrfs seeds work I wondered if it was possible to push
specific files from the seed to the rw device.  I know that removing
the seed device will flush all the contents over to the rw device, but
what about flushing individual files on demand?

I found that opening a file, reading the contents, seeking back to 0,
and writing out the contents does what I want, but I was hoping for a
bit less of a hack.

Is there maybe an ioctl or something else that might trigger a similar
action?

Thanks,
-Joseph
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 3/5] btrfs: tree-checker: Enhance output for btrfs_check_leaf

2017-10-11 Thread Qu Wenruo



On 2017年10月11日 23:41, Nikolay Borisov wrote:



On  9.10.2017 04:51, Qu Wenruo wrote:

Enhance the output to print:
1) Reason
2) Bad value
If reason can't explain enough
3) Good value (range)

Signed-off-by: Qu Wenruo 
---
  fs/btrfs/tree-checker.c | 27 +--
  1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
index b4ced8d3ce2a..7bba195ecc8b 100644
--- a/fs/btrfs/tree-checker.c
+++ b/fs/btrfs/tree-checker.c
@@ -233,8 +233,9 @@ int btrfs_check_leaf(struct btrfs_root *root, struct 
extent_buffer *leaf)
eb = btrfs_root_node(check_root);
/* if leaf is the root, then it's fine */
if (leaf != eb) {
-   CORRUPT("non-root leaf's nritems is 0",
-   leaf, check_root, 0);
+   generic_err(check_root, leaf, 0,
+   "invalid nritems, have %u shouldn't be 0 for 
non-root leaf",
+   nritems);


I'm a bit confused by what this error messages wants to convey. Even
reading the previous version with CORRUPT() it still didn't make sense.
So what we want to say here is we shouldn't have empty leaf nodes. So
Something along the line of "Unexpected empty leaf".


Yes, the error message is too fixed to follow the output format.


Why would the (leaf != eb) check not trigger, given that we call
btrfs_check_leaf when we now that the item is a leaf (level is 0 )?


What's the problem here? I didn't really get your point.

Did you mean leaf can't be tree root? Or empty tree root is not possible?


It's completely possible for a leaf to be a tree root.

All tree roots of a newly created (without --rootdir) is leaf.
Because the content of each tree is so few that one leaf can contain 
them all.



And it's also very possible to have empty tree, whose root (leaf) is 
also empty.

Still for a newly created btrfs, its csum tree is empty.
Its uuid tree is also empty.


But the only valid case for empty leaf is when it's a tree root.
So the code just checks it, and I didn't find anything wrong here.

Thanks,
Qu





free_extent_buffer(eb);
return -EUCLEAN;
}
@@ -265,7 +266,11 @@ int btrfs_check_leaf(struct btrfs_root *root, struct 
extent_buffer *leaf)
  
  		/* Make sure the keys are in the right order */

if (btrfs_comp_cpu_keys(&prev_key, &key) >= 0) {
-   CORRUPT("bad key order", leaf, root, slot);
+   generic_err(root, leaf, slot,
+   "bad key order, prev key (%llu %u %llu) current key 
(%llu %u %llu)",
+   prev_key.objectid, prev_key.type,
+   prev_key.offset, key.objectid, key.type,
+   key.offset);
return -EUCLEAN;
}
  
@@ -280,7 +285,10 @@ int btrfs_check_leaf(struct btrfs_root *root, struct extent_buffer *leaf)

item_end_expected = btrfs_item_offset_nr(leaf,
 slot - 1);
if (btrfs_item_end_nr(leaf, slot) != item_end_expected) {
-   CORRUPT("slot offset bad", leaf, root, slot);
+   generic_err(root, leaf, slot,
+   "discontinious item end, have %u expect %u",


s/discontinious/unexpected ?


+   btrfs_item_end_nr(leaf, slot),
+   item_end_expected);
return -EUCLEAN;
}
  
@@ -291,14 +299,21 @@ int btrfs_check_leaf(struct btrfs_root *root, struct extent_buffer *leaf)

 */
if (btrfs_item_end_nr(leaf, slot) >
BTRFS_LEAF_DATA_SIZE(fs_info)) {
-   CORRUPT("slot end outside of leaf", leaf, root, slot);
+   generic_err(root, leaf, slot,
+   "slot end outside of leaf, have %u expect range [0, 
%u]",> +   btrfs_item_end_nr(leaf, slot),
+   BTRFS_LEAF_DATA_SIZE(fs_info));
return -EUCLEAN;
}
  
  		/* Also check if the item pointer overlaps with btrfs item. */

if (btrfs_item_nr_offset(slot) + sizeof(struct btrfs_item) >
btrfs_item_ptr_offset(leaf, slot)) {
-   CORRUPT("slot overlap with its data", leaf, root, slot);
+   generic_err(root, leaf, slot,
+   "slot overlap with its data, item end %lu data start 
%lu",
+   btrfs_item_nr_offset(slot) +
+   sizeof(struct btrfs_item),
+ 

[RFC] Btrfs: compression heuristic performance

2017-10-11 Thread Timofey Titovets
Hi David, (and list folks in CC of course).

TL;DR
Sorting is a slowest part of heuristic for now (in worst case ~55% of run time).
Kernel use heap sort, i.e. heuristic also use that.
Radix sort speedup heuristic a lot (~+30-40%).
But radix have some tradeoffs (i.e. need more memory, but that can be
easy fixed in theory),
does that a normal idea to port radix sort for kernel?

Long read version below.
---
I already mentioned that i reverse port heuristic code from kernel to
userspace [1],
to make some experiments and profiling easy for me.

(So i spend about a week with some experiments and testings.)

(I firstly profiling with Valgrind)
So i found heuristic bottleneck for now, is a kernel heap sort()
(sort uses about 55% of runtime, that look crazy for me)

And as sort used in byte_core_set_size(), all not text data, sort will
be called.
i.e. from my point of view in 90% cases.

(Tool have 2 mods in main, stats and non stats %) )
For show the scale of a problem:
  Stats mode:
- For average and worst case data (random) ~2600 MiB/s
- Zeroed data - ~3400 MiB/s
  Non stats (kernel like)
- For average and worst case data (random) ~3000 MiB/s
- Zeroed data - ~8000 MiB/s


I spend several days try to beat that by different way =\
Only way that i found (i.e. that show good profit)
Is replace kernel heap sort with radix_sort() [2]   =\

With radix sort:
  Stats mode:
- For average and worst case data (random) ~3800 MiB/s ~+31%
- Zeroed data - ~3400 MiB/s (not sure why performance are same)
  Non stats (kernel like)
- For average and worst case data (random) ~4800 MiB/s ~+37%
- Zeroed data - ~8000 MiB/s

(Inline Radix get additional show +200MiB/s)

That seems cool, but please, wait a second nothing come free.

Radix sort need a 2 buffers, one for counters, and one for output array.

So in our case, (i use 4 bit as radix base),
16 * 4 bytes for counters (stored at stack of sort function)
+ 4*256 bytes for bucket_buffer. (stored at heuristic workspace).

So, in theory, that make a heuristic workspace weight more, i.e. +1KiB per CPU.

As heuristic have some assumptions, i.e. we really care only about sample size.
And sample size in our case are 8KiB max (i.e. 8KiB per CPU).
That possible to replace u32 counters in buckets with u16 counters.

And get same:
8KiB sample per CPU
1KiB buckets per CPU

But main question:
does that make a sense?

Thanks!

Links:
1. https://github.com/Nefelim4ag/companalyze
2. 
https://github.com/Nefelim4ag/companalyze/commit/1cd371eac5056927e5f567c6bd32e179ba859629
-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] Btrfs: free btrfs_device in place

2017-10-11 Thread Liu Bo
It's pointless to defer it to a kthread helper as we're not under any
special context.

A bit more about device's lifetime, filesystems use an exclusive way
to access devices and hold a reference count on the devices in use.
Now that we've run blkdev_put() in btrfs_close_bdev(), device->bdev's
lifetime ends at btrfs_close_bdev(), not free_device(), and %bdev is
the only thing that others who need to access it really care about.

Since free_device() only frees the resources of 'struct btrfs_device',
this change won't result in the problem, ie. others like mkfs and md
are unable to access the device immediately after we do umount.

Signed-off-by: Liu Bo 
Reviewed-by: Anand Jain 
---

v2: Clarify the lifetime of device and device->bdev respectively and
clear the concern about raising the 'device is in use' problem.

 fs/btrfs/volumes.c | 14 ++
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d983cea..4a72c45 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -836,26 +836,16 @@ void btrfs_close_extra_devices(struct btrfs_fs_devices 
*fs_devices, int step)
mutex_unlock(&uuid_mutex);
 }
 
-static void __free_device(struct work_struct *work)
+static void free_device(struct rcu_head *head)
 {
struct btrfs_device *device;
 
-   device = container_of(work, struct btrfs_device, rcu_work);
+   device = container_of(head, struct btrfs_device, rcu);
rcu_string_free(device->name);
bio_put(device->flush_bio);
kfree(device);
 }
 
-static void free_device(struct rcu_head *head)
-{
-   struct btrfs_device *device;
-
-   device = container_of(head, struct btrfs_device, rcu);
-
-   INIT_WORK(&device->rcu_work, __free_device);
-   schedule_work(&device->rcu_work);
-}
-
 static void btrfs_close_bdev(struct btrfs_device *device)
 {
if (device->bdev && device->writeable) {
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] Btrfs-progs: do not add stale device into fs_devices

2017-10-11 Thread Liu Bo
If one of btrfs's devices was pulled out and we've replaced it with a
new one, then they have the same uuid.

If that device gets reconnected, 'btrfs filesystem show' will show the
stale one instead of the new one, but on kernel side btrfs has a fix
to not include the stale one, this could confuse users as people may
monitor btrfs by running that cli.

This does the similar thing to what kernel side has done.

Signed-off-by: Liu Bo 
Reviewed-by: Anand Jain 
---
v2: remove a wrong parameter.

 volumes.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/volumes.c b/volumes.c
index 2f3943d..036c58d 100644
--- a/volumes.c
+++ b/volumes.c
@@ -138,7 +138,20 @@ static int device_list_add(const char *path,
list_add(&device->dev_list, &fs_devices->devices);
device->fs_devices = fs_devices;
} else if (!device->name || strcmp(device->name, path)) {
-   char *name = strdup(path);
+   char *name;
+
+   /*
+* The existing device has newer generation, so this
+* one could be a stale one, don't add it.
+*/
+   if (found_transid < device->generation) {
+   warning("adding device %s gen %llu but found an 
existing device %s gen %llu\n",
+   path, found_transid, device->name,
+   device->generation);
+   return -EEXIST;
+   }
+
+   name = strdup(path);
 if (!name)
 return -ENOMEM;
 kfree(device->name);
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: free btrfs_device in place

2017-10-11 Thread Liu Bo
On Wed, Oct 11, 2017 at 07:54:04PM +0200, David Sterba wrote:
> On Tue, Oct 10, 2017 at 03:51:03PM -0600, Liu Bo wrote:
> > It's pointless to defer it to a kthread helper as we're not under any
> > special context.
> 
> I agree the doubly deferred freeing is pointless. It's a weird mix of
> RCU and workques and understanding all the interactions turned out to be
> hard, last time I tried.
> 
> The RCU stuff needs the rcu_barriers, and the callback can be served
> from any process context. While the workqueus have their dedicated
> kthreads.
> 
> Calling free_device() is quick, it just adds the work to the queue and
> returns. This makes __btrfs_close_devices/btrfs_rm_device/... and all
> other callers fast, at the cost that there must be some explicit barrier
> or waiting done when we want to make sure all the device resources have
> been freed.
> 
> I can't say the quick return is wrong, but it makes the device lifetime
> hard to understand. The device freeing callback (__free_device) is
> lightweight, but also calls "rcu_free" for the device name.
>

I did check the history, at the time that this mix was introduced, it
was not under from a non-sleep context, so I'm not sure why it needs
to be deferred.

rcu_barrier() was needed because we used to do blkdev_put() in this
free_device(), which means the bdev's refcnt is held until the rcu
callback runs.  Later we found this is unnecessary because rcu
protected resources and blkdev_put() can be done separated.

> I have WIP patches to clean up the rcu and locking around devices and
> actually document the rules, but with unreviewed pile in the mailinglist
> I can't tell when this is going to land. If you want to simplify at
> least the device freeing, please go on, and explain in the changelog
> that it's not going to break anything. The hand-wavy sentence is not
> what I'd expect :)

I'll update the changelog to clarify the concern.

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] Btrfs: remove nr_async_submits and async_submit_draining

2017-10-11 Thread David Sterba
On Wed, Oct 11, 2017 at 10:49:14AM -0600, Liu Bo wrote:
> On Wed, Oct 11, 2017 at 07:20:32PM +0200, David Sterba wrote:
> > On Wed, Sep 27, 2017 at 02:31:17PM +0200, David Sterba wrote:
> > > On Thu, Sep 07, 2017 at 11:22:22AM -0600, Liu Bo wrote:
> > > > Now that we have the combo of flushing twice, which can make sure IO
> > > > have started since the second flush will wait for page lock which
> > > > won't be unlocked unless setting page writeback and queuing ordered
> > > > extents, we don't need %async_submit_draining, %async_delalloc_pages
> > > > and %nr_async_submits to tell whether IO have actually started.
> > > > 
> > > > Moreover, all the flushers in use are followed by functions that wait
> > > > for ordered extents to complete, so %nr_async_submits, which tracks
> > > > whether bio's async submit has made progress, doesn't really make
> > > > sense.
> > > > 
> > > > However, %async_delalloc_pages is still required by shrink_delalloc()
> > > > as that function doesn't flush twice in the normal case (just issues a
> > > > writeback with WB_REASON_FS_FREE_SPACE).
> > > > 
> > > > Signed-off-by: Liu Bo 
> > > 
> > > Patches like this are almost impossible to review just from the code.
> > > This has runtime implications that we'd need to observe in real on
> > > various workloads.
> > > 
> > > So, the patches "look good", but I did not try to forsee all the
> > > scenarios where threads interact through the counters. I think it should
> > > be safe to add them to for-next and give enough time to test them.
> > > 
> > > The performance has varied wildly in the last cycle(s) so it's hard to
> > > get a baseline, other than another major release. I do expect changes in
> > > performance characteristics but still hope it'll be the same on average.
> > 
> > Testing appears normal, we get more weirdnes just by enabling the
> > writeback throttling, but without that I haven't observed anything
> > unusual. Patches go to the misc-next part, meaning I don't expect
> > changes other than fixups, as separate patches. Thanks.
> 
> Thank you Dave, I'm interesting in that weirdness part, will look into
> it.

The symptoms are long stalls, without any IO activity (tens of seconds)
when the superblock is being written, this can be identified in the
kernel stack of the process. This was partially fixed by the REQ_ flags
added to the write requests, but I vaguely remember that even after that
the stalls happen. I don't have more details, sorry.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: remove rcu_barrier in btrfs_close_devices

2017-10-11 Thread Liu Bo
On Wed, Oct 11, 2017 at 03:41:23PM +0800, Anand Jain wrote:
> 
> 
> On 10/11/2017 02:11 PM, Anand Jain wrote:
> > 
> > 
> > On 10/11/2017 05:51 AM, Liu Bo wrote:
> > > It was introduced because btrfs used to do blkdev_put in a deferred
> > > work, now that btrfs has put blkdev in place, this rcu_barrier can be
> > > removed.
> 
>  On the 2nd thought, modprobe -r btrfs would still need rcu_barrier(), some
> where else outside of umount context ?

Thanks a lot for the comments.

modprobe -r btrfs will do btrfs_cleanup_fs_uuids(), where it cleanup
every %fs_devices on the list, but when we do btrfs_close_devices(), we
have replaced the devices on the list with dummy ones which only have
the same name and uuid, so modprobe -r btrfs will free those instead
of what we were using, this change won't cause a problem for it.

Thanks,

-liubo
> 
> Thanks, Anand
> 
> 
> >   Reviewed-by: Anand Jain 
> > 
> > Thanks, Anand
> > 
> > 
> > > Signed-off-by: Liu Bo 
> > > ---
> > >   fs/btrfs/volumes.c | 6 --
> > >   1 file changed, 6 deletions(-)
> > > 
> > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > > index 0e8f16c..d983cea 100644
> > > --- a/fs/btrfs/volumes.c
> > > +++ b/fs/btrfs/volumes.c
> > > @@ -958,12 +958,6 @@ int btrfs_close_devices(struct btrfs_fs_devices
> > > *fs_devices)
> > >   __btrfs_close_devices(fs_devices);
> > >   free_fs_devices(fs_devices);
> > >   }
> > > -/*
> > > - * Wait for rcu kworkers under __btrfs_close_devices
> > > - * to finish all blkdev_puts so device is really
> > > - * free when umount is done.
> > > - */
> > > -rcu_barrier();
> > >   return ret;
> > >   }
> > > 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/5] Enhance tree block validation checker

2017-10-11 Thread David Sterba
On Mon, Oct 09, 2017 at 01:51:01AM +, Qu Wenruo wrote:
> The patchset can be fetched from github:
> https://github.com/adam900710/linux/tree/checker_enhance
> 
> It's based on David's misc-next branch, with following commit as base:
> a5e50b4b444c ("btrfs: Add checker for EXTENT_CSUM")
> 
> According to David's suggestion, enhance the output format of tree block
> validation checker.
> 
> And move them into separate files: tree-checker.[ch].
> 
> Also added a output format rule to try to make all output message
> follow the same format.
> 
> Some example output using btrfs-progs fsck-test images looks like:
> 
> For unagliend file extent member:
> ---
> BTRFS critical (device loop0): corrupt leaf: root=1 block=29360128 slot=7 
> ino=257 file_offset=0, invalid disk_bytenr for file extent, have 755944791, 
> should be aligned to 4096
> ---
> 
> For bad leaf holes:
> ---
> BTRFS critical (device loop0): corrupt leaf: root=1 block=29360128 slot=28, 
> discontinious item end, have 9387 expect 15018
> ---
> 
> Changelog:
> v2:
>   Unify the error string format, so it should be easier to grep them
>   from dmesg. Thanks Nikolay for pointing this out.
>   Remove unused CORRUPT() macro.
> v3:
>   Replace EIO with EUCLEAN in 2nd patch. Thanks Nikolay for pointing
>   this out.
>   Correct "btrfs-progs:" to "btrfs:" for 1st patch.
> v4:
>   Code style change suggested by David.
>   Use more easy-to-understand error message for NULL node pointer,
>   suggested by Nikolay.
>   Helper macro enhancement, including naming change and argument
>   protection suggested by David.
>   Separate tree-checker.h suggested by David.

Patches merged to the devel queue, with updates. The strings that
overflow 80 chars should be un-indented, plase do that in future
patches. Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 3/5] btrfs: tree-checker: Enhance output for btrfs_check_leaf

2017-10-11 Thread David Sterba
On Wed, Oct 11, 2017 at 06:41:32PM +0300, Nikolay Borisov wrote:
> 
> 
> On  9.10.2017 04:51, Qu Wenruo wrote:
> > Enhance the output to print:
> > 1) Reason
> > 2) Bad value
> >If reason can't explain enough
> > 3) Good value (range)
> > 
> > Signed-off-by: Qu Wenruo 
> > ---
> >  fs/btrfs/tree-checker.c | 27 +--
> >  1 file changed, 21 insertions(+), 6 deletions(-)
> > 
> > diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
> > index b4ced8d3ce2a..7bba195ecc8b 100644
> > --- a/fs/btrfs/tree-checker.c
> > +++ b/fs/btrfs/tree-checker.c
> > @@ -233,8 +233,9 @@ int btrfs_check_leaf(struct btrfs_root *root, struct 
> > extent_buffer *leaf)
> > eb = btrfs_root_node(check_root);
> > /* if leaf is the root, then it's fine */
> > if (leaf != eb) {
> > -   CORRUPT("non-root leaf's nritems is 0",
> > -   leaf, check_root, 0);
> > +   generic_err(check_root, leaf, 0,
> > +   "invalid nritems, have %u shouldn't be 
> > 0 for non-root leaf",
> > +   nritems);
> 
> I'm a bit confused by what this error messages wants to convey. Even
> reading the previous version with CORRUPT() it still didn't make sense.
> So what we want to say here is we shouldn't have empty leaf nodes. So
> Something along the line of "Unexpected empty leaf".
> 
> Why would the (leaf != eb) check not trigger, given that we call
> btrfs_check_leaf when we now that the item is a leaf (level is 0 )?

I've merged the patches, with more adjusmtents to the wording, so any
updates please send as separate patches.

> 
> 
> > free_extent_buffer(eb);
> > return -EUCLEAN;
> > }
> > @@ -265,7 +266,11 @@ int btrfs_check_leaf(struct btrfs_root *root, struct 
> > extent_buffer *leaf)
> >  
> > /* Make sure the keys are in the right order */
> > if (btrfs_comp_cpu_keys(&prev_key, &key) >= 0) {
> > -   CORRUPT("bad key order", leaf, root, slot);
> > +   generic_err(root, leaf, slot,
> > +   "bad key order, prev key (%llu %u %llu) current 
> > key (%llu %u %llu)",
> > +   prev_key.objectid, prev_key.type,
> > +   prev_key.offset, key.objectid, key.type,
> > +   key.offset);
> > return -EUCLEAN;
> > }
> >  
> > @@ -280,7 +285,10 @@ int btrfs_check_leaf(struct btrfs_root *root, struct 
> > extent_buffer *leaf)
> > item_end_expected = btrfs_item_offset_nr(leaf,
> >  slot - 1);
> > if (btrfs_item_end_nr(leaf, slot) != item_end_expected) {
> > -   CORRUPT("slot offset bad", leaf, root, slot);
> > +   generic_err(root, leaf, slot,
> > +   "discontinious item end, have %u expect %u",
> 
> s/discontinious/unexpected ?

I've changed that to 'unexpected item end, ...'
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs-progs: do not add stale device into fs_devices

2017-10-11 Thread Liu Bo
On Wed, Oct 11, 2017 at 12:33:15PM +0300, Nikolay Borisov wrote:
> 
> 
> On 11.10.2017 03:28, Liu Bo wrote:
> > If one of btrfs's devices was pulled out and we've replaced it with a
> > new one, then they have the same uuid.
> > 
> > If that device gets reconnected, 'btrfs filesystem show' will show the
> > stale one instead of the new one, but on kernel side btrfs has a fix
> > to not include the stale one, this could confuse users as people may
> > monitor btrfs by running that cli.
> > 
> > This does the similar thing to what kernel side has done.
> > 
> > Signed-off-by: Liu Bo 
> > ---
> >  volumes.c | 15 ++-
> >  1 file changed, 14 insertions(+), 1 deletion(-)
> > 
> > diff --git a/volumes.c b/volumes.c
> > index 2f3943d..c7b7a41 100644
> > --- a/volumes.c
> > +++ b/volumes.c
> > @@ -138,7 +138,20 @@ static int device_list_add(const char *path,
> > list_add(&device->dev_list, &fs_devices->devices);
> > device->fs_devices = fs_devices;
> > } else if (!device->name || strcmp(device->name, path)) {
> > -   char *name = strdup(path);
> > +   char *name;
> > +
> > +   /*
> > +* The existing device has newer generation, so this
> > +* one could be a stale one, don't add it.
> > +*/
> > +   if (found_transid < device->generation) {
> > +   warning("adding device %s gen %llu but found a existing 
> > device %s gen %llu\n",
> > +   path, found_transid, device->name,
> > +   device->generation, found_transid);
> 
> You pass in 5 parameters but have only 4 formatting strings. I don't see
> the same happening on other warning() invocations? Perhaps the last
> found_transid is not necessary?
> 

Oh, thanks for the comments.

I messed it up again, what I was testing is warning("blabla",
device->name, device->generation, path, found_transid), but later I
updated the message but got the parameters wrong.

Will do a v2.

Thanks,

-liubo
> > +   return -EEXIST;
> > +   }
> > +
> > +   name = strdup(path);
> >  if (!name)
> >  return -ENOMEM;
> >  kfree(device->name);
> > 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: free btrfs_device in place

2017-10-11 Thread David Sterba
On Tue, Oct 10, 2017 at 03:51:03PM -0600, Liu Bo wrote:
> It's pointless to defer it to a kthread helper as we're not under any
> special context.

I agree the doubly deferred freeing is pointless. It's a weird mix of
RCU and workques and understanding all the interactions turned out to be
hard, last time I tried.

The RCU stuff needs the rcu_barriers, and the callback can be served
from any process context. While the workqueus have their dedicated
kthreads.

Calling free_device() is quick, it just adds the work to the queue and
returns. This makes __btrfs_close_devices/btrfs_rm_device/... and all
other callers fast, at the cost that there must be some explicit barrier
or waiting done when we want to make sure all the device resources have
been freed.

I can't say the quick return is wrong, but it makes the device lifetime
hard to understand. The device freeing callback (__free_device) is
lightweight, but also calls "rcu_free" for the device name.

I have WIP patches to clean up the rcu and locking around devices and
actually document the rules, but with unreviewed pile in the mailinglist
I can't tell when this is going to land. If you want to simplify at
least the device freeing, please go on, and explain in the changelog
that it's not going to break anything. The hand-wavy sentence is not
what I'd expect :)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] Btrfs: remove nr_async_submits and async_submit_draining

2017-10-11 Thread Liu Bo
On Wed, Oct 11, 2017 at 07:20:32PM +0200, David Sterba wrote:
> On Wed, Sep 27, 2017 at 02:31:17PM +0200, David Sterba wrote:
> > On Thu, Sep 07, 2017 at 11:22:22AM -0600, Liu Bo wrote:
> > > Now that we have the combo of flushing twice, which can make sure IO
> > > have started since the second flush will wait for page lock which
> > > won't be unlocked unless setting page writeback and queuing ordered
> > > extents, we don't need %async_submit_draining, %async_delalloc_pages
> > > and %nr_async_submits to tell whether IO have actually started.
> > > 
> > > Moreover, all the flushers in use are followed by functions that wait
> > > for ordered extents to complete, so %nr_async_submits, which tracks
> > > whether bio's async submit has made progress, doesn't really make
> > > sense.
> > > 
> > > However, %async_delalloc_pages is still required by shrink_delalloc()
> > > as that function doesn't flush twice in the normal case (just issues a
> > > writeback with WB_REASON_FS_FREE_SPACE).
> > > 
> > > Signed-off-by: Liu Bo 
> > 
> > Patches like this are almost impossible to review just from the code.
> > This has runtime implications that we'd need to observe in real on
> > various workloads.
> > 
> > So, the patches "look good", but I did not try to forsee all the
> > scenarios where threads interact through the counters. I think it should
> > be safe to add them to for-next and give enough time to test them.
> > 
> > The performance has varied wildly in the last cycle(s) so it's hard to
> > get a baseline, other than another major release. I do expect changes in
> > performance characteristics but still hope it'll be the same on average.
> 
> Testing appears normal, we get more weirdnes just by enabling the
> writeback throttling, but without that I haven't observed anything
> unusual. Patches go to the misc-next part, meaning I don't expect
> changes other than fixups, as separate patches. Thanks.

Thank you Dave, I'm interesting in that weirdness part, will look into
it.

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] Btrfs: remove nr_async_submits and async_submit_draining

2017-10-11 Thread David Sterba
On Wed, Sep 27, 2017 at 02:31:17PM +0200, David Sterba wrote:
> On Thu, Sep 07, 2017 at 11:22:22AM -0600, Liu Bo wrote:
> > Now that we have the combo of flushing twice, which can make sure IO
> > have started since the second flush will wait for page lock which
> > won't be unlocked unless setting page writeback and queuing ordered
> > extents, we don't need %async_submit_draining, %async_delalloc_pages
> > and %nr_async_submits to tell whether IO have actually started.
> > 
> > Moreover, all the flushers in use are followed by functions that wait
> > for ordered extents to complete, so %nr_async_submits, which tracks
> > whether bio's async submit has made progress, doesn't really make
> > sense.
> > 
> > However, %async_delalloc_pages is still required by shrink_delalloc()
> > as that function doesn't flush twice in the normal case (just issues a
> > writeback with WB_REASON_FS_FREE_SPACE).
> > 
> > Signed-off-by: Liu Bo 
> 
> Patches like this are almost impossible to review just from the code.
> This has runtime implications that we'd need to observe in real on
> various workloads.
> 
> So, the patches "look good", but I did not try to forsee all the
> scenarios where threads interact through the counters. I think it should
> be safe to add them to for-next and give enough time to test them.
> 
> The performance has varied wildly in the last cycle(s) so it's hard to
> get a baseline, other than another major release. I do expect changes in
> performance characteristics but still hope it'll be the same on average.

Testing appears normal, we get more weirdnes just by enabling the
writeback throttling, but without that I haven't observed anything
unusual. Patches go to the misc-next part, meaning I don't expect
changes other than fixups, as separate patches. Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/5] btrfs: ensure that metadata and flush are issued from the root cgroup

2017-10-11 Thread David Sterba
On Tue, Oct 10, 2017 at 09:43:26AM -0700, Tejun Heo wrote:
> >From 3bbed8c7747739cda48f592f165e8839da076a3a Mon Sep 17 00:00:00 2001
> 
> Issuing metdata or otherwise shared IOs from !root cgroup can lead to
> priority inversion.  This patch ensures that those IOs are always
> issued from the root cgroup.
> 
> This patch updates btrfs_update_iflags() to not set S_CGROUPWB on
> btree_inodes.

The 'btree_inode' is only one, with inode number 1, and represents all
the metadata, so I don't understand what it means in plural.

> This isn't strictly necessary as those inodes don't
> call the function during init; however, this serves as documentation
> and prevents possible future mistakes.  If this isn't desirable,
> please feel free to drop the section.
> 
> v2: Fixed missing @bh in submit_bh_blkcg_css() call.
> 
> Signed-off-by: Tejun Heo 
> Cc: Chris Mason 
> Cc: Josef Bacik 
> ---
>  fs/btrfs/check-integrity.c | 2 +-
>  fs/btrfs/disk-io.c | 4 
>  fs/btrfs/ioctl.c   | 4 +++-
>  3 files changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
> index 7d5a9b5..d66774e 100644
> --- a/fs/btrfs/check-integrity.c
> +++ b/fs/btrfs/check-integrity.c
> @@ -2741,7 +2741,7 @@ int btrfsic_submit_bh(int op, int op_flags, struct 
> buffer_head *bh)
>   struct btrfsic_dev_state *dev_state;
>  
>   if (!btrfsic_is_initialized)
> - return submit_bh(op, op_flags, bh);
> + return submit_bh_blkcg_css(op, op_flags, bh, blkcg_root_css);
>  
>   mutex_lock(&btrfsic_mutex);
>   /* since btrfsic_submit_bh() might also be called before
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index dfdab84..fe8bbe1 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1025,6 +1025,8 @@ static blk_status_t btree_submit_bio_hook(void 
> *private_data, struct bio *bio,
>   int async = check_async_write(bio_flags);
>   blk_status_t ret;
>  
> + bio_associate_blkcg(bio, blkcg_root_css);
> +
>   if (bio_op(bio) != REQ_OP_WRITE) {
>   /*
>* called for a read, do the setup so that checksum validation
> @@ -3512,6 +3514,8 @@ static void write_dev_flush(struct btrfs_device *device)
>   return;
>  
>   bio_reset(bio);
> + bio_associate_blkcg(bio, blkcg_root_css);
> +
>   bio->bi_end_io = btrfs_end_empty_barrier;
>   bio_set_dev(bio, device->bdev);
>   bio->bi_opf = REQ_OP_WRITE | REQ_SYNC | REQ_PREFLUSH;
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 117cc63..8a7db6c 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -150,7 +150,9 @@ void btrfs_update_iflags(struct inode *inode)
>   new_fl |= S_NOATIME;
>   if (ip->flags & BTRFS_INODE_DIRSYNC)
>   new_fl |= S_DIRSYNC;
> - new_fl |= S_CGROUPWB;
> + /* btree_inodes are always in the root cgroup */
> + if (btrfs_ino(ip) != BTRFS_BTREE_INODE_OBJECTID)
> + new_fl |= S_CGROUPWB;

The comment is useful, but the condition will be always true, so I don't
see the point.

/*
 * The btree_inode will be always in the root cgroup. The cgroup
 * writeback can be enabled on regular inodes selectively.
 */
new_fl |= S_CGROUPWB;

is IMHO enough, based on my reading of patch 2/5 changelog.

>  
>   set_mask_bits(&inode->i_flags,
> S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC |
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 3/5] btrfs: tree-checker: Enhance output for btrfs_check_leaf

2017-10-11 Thread Nikolay Borisov


On  9.10.2017 04:51, Qu Wenruo wrote:
> Enhance the output to print:
> 1) Reason
> 2) Bad value
>If reason can't explain enough
> 3) Good value (range)
> 
> Signed-off-by: Qu Wenruo 
> ---
>  fs/btrfs/tree-checker.c | 27 +--
>  1 file changed, 21 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
> index b4ced8d3ce2a..7bba195ecc8b 100644
> --- a/fs/btrfs/tree-checker.c
> +++ b/fs/btrfs/tree-checker.c
> @@ -233,8 +233,9 @@ int btrfs_check_leaf(struct btrfs_root *root, struct 
> extent_buffer *leaf)
>   eb = btrfs_root_node(check_root);
>   /* if leaf is the root, then it's fine */
>   if (leaf != eb) {
> - CORRUPT("non-root leaf's nritems is 0",
> - leaf, check_root, 0);
> + generic_err(check_root, leaf, 0,
> + "invalid nritems, have %u shouldn't be 
> 0 for non-root leaf",
> + nritems);

I'm a bit confused by what this error messages wants to convey. Even
reading the previous version with CORRUPT() it still didn't make sense.
So what we want to say here is we shouldn't have empty leaf nodes. So
Something along the line of "Unexpected empty leaf".

Why would the (leaf != eb) check not trigger, given that we call
btrfs_check_leaf when we now that the item is a leaf (level is 0 )?


>   free_extent_buffer(eb);
>   return -EUCLEAN;
>   }
> @@ -265,7 +266,11 @@ int btrfs_check_leaf(struct btrfs_root *root, struct 
> extent_buffer *leaf)
>  
>   /* Make sure the keys are in the right order */
>   if (btrfs_comp_cpu_keys(&prev_key, &key) >= 0) {
> - CORRUPT("bad key order", leaf, root, slot);
> + generic_err(root, leaf, slot,
> + "bad key order, prev key (%llu %u %llu) current 
> key (%llu %u %llu)",
> + prev_key.objectid, prev_key.type,
> + prev_key.offset, key.objectid, key.type,
> + key.offset);
>   return -EUCLEAN;
>   }
>  
> @@ -280,7 +285,10 @@ int btrfs_check_leaf(struct btrfs_root *root, struct 
> extent_buffer *leaf)
>   item_end_expected = btrfs_item_offset_nr(leaf,
>slot - 1);
>   if (btrfs_item_end_nr(leaf, slot) != item_end_expected) {
> - CORRUPT("slot offset bad", leaf, root, slot);
> + generic_err(root, leaf, slot,
> + "discontinious item end, have %u expect %u",

s/discontinious/unexpected ?

> + btrfs_item_end_nr(leaf, slot),
> + item_end_expected);
>   return -EUCLEAN;
>   }
>  
> @@ -291,14 +299,21 @@ int btrfs_check_leaf(struct btrfs_root *root, struct 
> extent_buffer *leaf)
>*/
>   if (btrfs_item_end_nr(leaf, slot) >
>   BTRFS_LEAF_DATA_SIZE(fs_info)) {
> - CORRUPT("slot end outside of leaf", leaf, root, slot);
> + generic_err(root, leaf, slot,
> + "slot end outside of leaf, have %u expect range 
> [0, %u]",> +btrfs_item_end_nr(leaf, slot),
> + BTRFS_LEAF_DATA_SIZE(fs_info));
>   return -EUCLEAN;
>   }
>  
>   /* Also check if the item pointer overlaps with btrfs item. */
>   if (btrfs_item_nr_offset(slot) + sizeof(struct btrfs_item) >
>   btrfs_item_ptr_offset(leaf, slot)) {
> - CORRUPT("slot overlap with its data", leaf, root, slot);
> + generic_err(root, leaf, slot,
> + "slot overlap with its data, item end %lu data 
> start %lu",
> + btrfs_item_nr_offset(slot) +
> + sizeof(struct btrfs_item),
> + btrfs_item_ptr_offset(leaf, slot));
>   return -EUCLEAN;
>   }
>  
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fatal failure, btrfs raid with duplicated metadata

2017-10-11 Thread Ian Kumlien
On Wed, Oct 11, 2017 at 2:42 PM, Jeff Mahoney  wrote:
> On 10/11/17 2:20 PM, Ian Kumlien wrote:
>>
>>
>> On Wed, Oct 11, 2017 at 2:10 PM Jeff Mahoney > > wrote:
>>
>> On 10/11/17 12:41 PM, Ian Kumlien wrote:
>>
>> [--8<--]
>>
>> > Eventually the filesystem becomes read-only and everything is odd...
>>
>> Are you still able to mount it?  I'd be surprised if you could if check
>> can't open the file system.
>>
>>
>> Nope, it's like there never was a filesystem in the first place...
>>
>> But since metadata should be duplicated all over, i'd assume that it
>> would be able to mount it and survive =)
>
> If you'd been using RAID1 or something instead, you'd be able to mount
> the file system and replace the disk.

Yep, It wasn't a problem, i had a backup and so on (and the disks had
been odd before)

>> > Trying to run btrfs check on the disks results in:
>> > btrfs check -b /dev/disk/by-uuid/8d431da9-dad4-481c-a5ad-5e6844f31da0
>> > bytenr mismatch, want=912228352, have=0
>> > Couldn't read tree root
>> > ERROR: cannot open file system
>> >
>> > (For backup and normal)
>> >
>> > So even if the data is duplicated on all disks, something in the above
>> > errors seemed to cause it to abort
>> > (These disks are seagate sshd disks, never ever buying them again)
>>
>> If you have metadata: dup, that doesn't mean the metadata is duplicated
>> on every disk.  It means that there are two copies of the metadata on a
>> single disk.  If that disk is going bad and returning failures for both
>> copies of the metadata, you may be out of luck.  It's really intended
>> for single spinning disks to get a little bit more resiliency in the
>> face of bad sectors.
>>
>>
>> Oh? it looks like it would be 2 per 1 device, but ok - Then i could have
>> had a issue where the drive that keeps the metadata is gone... I
>> suspected that I did do DUP on multiple devices
>>
>> from the man page:
>>Note 1: DUP may exist on more than 1 device if it starts on a
>> single device and another one is added. Since version 4.5.1,
>>mkfs.btrfs will let you create DUP on multiple devices.
>
> I can see how you'd reach that conclusion.  The wording is somewhat
> confusing.  We allocate space in chunks that are usually about 1GB in
> size.  When DUP is used, we allocate two chunks on the same device and
> that is presented as a single usable chunk.  The constituent chunks will
> be allocated on the same device, but which device is used can change
> with each allocation.
>
> Say you have 5 disks and 8 metadata chunks.  They can be allocated like so:
>
> sda: A A D D
> sdb: B B
> sdc: C C
> sde: F F G G
> sdf: H H I I
>
> There is no redundancy in the case of a disk failure, only for sector
> failures.  To spread metadata across disks for redundancy you'll need to
> use a raid mode instead.  If one of those disks is failing and it
> contains a critical part of the metadata, the file system won't be
> mountable.

Yeah, thanks =)

>> The check error above means that it wasn't able to map a logical address
>> to a physical address.  Typically that means that the mapping was lost.
>>
>>
>> I was more reporting that it happened and if there was any useful data
>> that we could extract from this if it's a failure that shouldn't happen :)
>>
>> I haven't wiped anything yet - preparing to replace the disks though
>
> Thanks for reporting it, but in this context, it's somewhat of an
> expected failure mode.

Yep, i see that now, thanks =)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs-progs: do not add stale device into fs_devices

2017-10-11 Thread Anand Jain



On 10/11/2017 05:33 PM, Nikolay Borisov wrote:



On 11.10.2017 03:28, Liu Bo wrote:

If one of btrfs's devices was pulled out and we've replaced it with a
new one, then they have the same uuid.

If that device gets reconnected, 'btrfs filesystem show' will show the
stale one instead of the new one, but on kernel side btrfs has a fix
to not include the stale one, this could confuse users as people may
monitor btrfs by running that cli.

This does the similar thing to what kernel side has done.

Signed-off-by: Liu Bo 
---
  volumes.c | 15 ++-
  1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/volumes.c b/volumes.c
index 2f3943d..c7b7a41 100644
--- a/volumes.c
+++ b/volumes.c
@@ -138,7 +138,20 @@ static int device_list_add(const char *path,
list_add(&device->dev_list, &fs_devices->devices);
device->fs_devices = fs_devices;
} else if (!device->name || strcmp(device->name, path)) {
-   char *name = strdup(path);
+   char *name;
+
+   /*
+* The existing device has newer generation, so this
+* one could be a stale one, don't add it.
+*/
+   if (found_transid < device->generation) {
+   warning("adding device %s gen %llu but found a existing 
device %s gen %llu\n",
+   path, found_transid, device->name,
+   device->generation, found_transid);


You pass in 5 parameters but have only 4 formatting strings. I don't see
the same happening on other warning() invocations? Perhaps the last
found_transid is not necessary?


 I missed that. Sorry.

Thanks, Anand



+   return -EEXIST;
+   }
+
+   name = strdup(path);
  if (!name)
  return -ENOMEM;
  kfree(device->name);


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fatal failure, btrfs raid with duplicated metadata

2017-10-11 Thread Jeff Mahoney
On 10/11/17 2:20 PM, Ian Kumlien wrote:
> 
> 
> On Wed, Oct 11, 2017 at 2:10 PM Jeff Mahoney  > wrote:
> 
> On 10/11/17 12:41 PM, Ian Kumlien wrote:
> 
> [--8<--] 
> 
> > Eventually the filesystem becomes read-only and everything is odd...
> 
> Are you still able to mount it?  I'd be surprised if you could if check
> can't open the file system.
> 
> 
> Nope, it's like there never was a filesystem in the first place... 
> 
> But since metadata should be duplicated all over, i'd assume that it
> would be able to mount it and survive =)

If you'd been using RAID1 or something instead, you'd be able to mount
the file system and replace the disk.
> > Trying to run btrfs check on the disks results in:
> > btrfs check -b /dev/disk/by-uuid/8d431da9-dad4-481c-a5ad-5e6844f31da0
> > bytenr mismatch, want=912228352, have=0
> > Couldn't read tree root
> > ERROR: cannot open file system
> >
> > (For backup and normal)
> >
> > So even if the data is duplicated on all disks, something in the above
> > errors seemed to cause it to abort
> > (These disks are seagate sshd disks, never ever buying them again)
> 
> If you have metadata: dup, that doesn't mean the metadata is duplicated
> on every disk.  It means that there are two copies of the metadata on a
> single disk.  If that disk is going bad and returning failures for both
> copies of the metadata, you may be out of luck.  It's really intended
> for single spinning disks to get a little bit more resiliency in the
> face of bad sectors.
> 
> 
> Oh? it looks like it would be 2 per 1 device, but ok - Then i could have
> had a issue where the drive that keeps the metadata is gone... I
> suspected that I did do DUP on multiple devices
> 
> from the man page:
>        Note 1: DUP may exist on more than 1 device if it starts on a
> single device and another one is added. Since version 4.5.1,
>        mkfs.btrfs will let you create DUP on multiple devices.

I can see how you'd reach that conclusion.  The wording is somewhat
confusing.  We allocate space in chunks that are usually about 1GB in
size.  When DUP is used, we allocate two chunks on the same device and
that is presented as a single usable chunk.  The constituent chunks will
be allocated on the same device, but which device is used can change
with each allocation.

Say you have 5 disks and 8 metadata chunks.  They can be allocated like so:

sda: A A D D
sdb: B B
sdc: C C
sde: F F G G
sdf: H H I I

There is no redundancy in the case of a disk failure, only for sector
failures.  To spread metadata across disks for redundancy you'll need to
use a raid mode instead.  If one of those disks is failing and it
contains a critical part of the metadata, the file system won't be
mountable.

> The check error above means that it wasn't able to map a logical address
> to a physical address.  Typically that means that the mapping was lost.
> 
> 
> I was more reporting that it happened and if there was any useful data
> that we could extract from this if it's a failure that shouldn't happen :)
> 
> I haven't wiped anything yet - preparing to replace the disks though

Thanks for reporting it, but in this context, it's somewhat of an
expected failure mode.

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: Fatal failure, btrfs raid with duplicated metadata

2017-10-11 Thread Ian Kumlien
Resent since google inbox is still not doing clear-text emails...

On Wed, Oct 11, 2017 at 2:09 PM, Jeff Mahoney  wrote:
> On 10/11/17 12:41 PM, Ian Kumlien wrote:

[--8<---]

>> Eventually the filesystem becomes read-only and everything is odd...
>
> Are you still able to mount it?  I'd be surprised if you could if check
> can't open the file system.

Nope, it's like there never was a filesystem in the first place...

But since metadata should be duplicated all over, i'd assume that it
would be able to mount it and survive =)

>> Trying to run btrfs check on the disks results in:
>> btrfs check -b /dev/disk/by-uuid/8d431da9-dad4-481c-a5ad-5e6844f31da0
>> bytenr mismatch, want=912228352, have=0
>> Couldn't read tree root
>> ERROR: cannot open file system
>>
>> (For backup and normal)
>>
>> So even if the data is duplicated on all disks, something in the above
>> errors seemed to cause it to abort
>> (These disks are seagate sshd disks, never ever buying them again)
>
> If you have metadata: dup, that doesn't mean the metadata is duplicated
> on every disk.  It means that there are two copies of the metadata on a
> single disk.  If that disk is going bad and returning failures for both
> copies of the metadata, you may be out of luck.  It's really intended
> for single spinning disks to get a little bit more resiliency in the
> face of bad sectors.

Oh? it looks like it would be 2 per 1 device, but ok - Then i could
have had a issue where
the drive that keeps the metadata is gone... I suspected that I did do
DUP on multiple devices

from the man page:
   Note 1: DUP may exist on more than 1 device if it starts on a
single device and another one is added. Since version 4.5.1,
   mkfs.btrfs will let you create DUP on multiple devices.

> The check error above means that it wasn't able to map a logical address
> to a physical address.  Typically that means that the mapping was lost.

I was more reporting that it happened and if there was any useful data
that we could
extract from this if it's a failure that shouldn't happen :)

I haven't wiped anything yet - preparing to replace the disks though
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fatal failure, btrfs raid with duplicated metadata

2017-10-11 Thread Jeff Mahoney
On 10/11/17 12:41 PM, Ian Kumlien wrote:
> Hi,
> 
> I was running a btrfs raid with 6 disks, metadata: dup and data: raid 6
> 
> Two of the disks started behaving oddly:
> [436823.570296] sd 3:1:0:4: [sdf] Unaligned partial completion
> (resid=244, sector_sz=512)
> [436823.578604] sd 3:1:0:4: [sdf] Unaligned partial completion
> (resid=52, sector_sz=512) [436823.617593]
> sd 3:1:0:4: [sdf] Unaligned partial completion (resid=56,
> sector_sz=512)
> [436823.617771] sd 3:1:0:4: [sdf] Unaligned partial completion
> (resid=222, sector_sz=512)
> [436823.618386] sd 3:1:0:4: [sdf] Unaligned partial completion
> (resid=246, sector_sz=512)
> [436823.618463] sd 3:1:0:4: [sdf] Unaligned partial completion
> (resid=56, sector_sz=512)
> [436977.701944] scsi_io_completion: 68 callbacks suppressed
> [436977.701973] sd 3:1:0:4: [sdf] tag#0 FAILED Result: hostbyte=DID_OK
> driverbyte=DRIVER_SENSE
> [436977.701982] sd 3:1:0:4: [sdf] tag#0 Sense Key : Hardware Error
> [current]
> [436977.701991] sd 3:1:0:4: [sdf] tag#0 Add. Sense: Logical unit
> failure
> [436977.702000] sd 3:1:0:4: [sdf] tag#0 CDB: Read(10) 28 00 02 fb fb
> 80 00 00 28 00
> [436977.702005] print_req_error: 68 callbacks suppressed
> [436977.702010] print_req_error: critical target error, dev sdf,
> sector 50068352
> [498132.144319] print_req_error: 450 callbacks suppressed
> [498132.144324] print_req_error: critical target error, dev sdf,
> sector 41777640
> [498132.144590] btrfs_dev_stat_print_on_error: 540 callbacks
> suppressed
> [498132.144600] BTRFS error (device sdb1): bdev /dev/sdf1 errs: wr
> 632, rd 1526, flush 0, corrupt 0, gen 0
> 
> Eventually the filesystem becomes read-only and everything is odd...

Are you still able to mount it?  I'd be surprised if you could if check
can't open the file system.

> Trying to run btrfs check on the disks results in:
> btrfs check -b /dev/disk/by-uuid/8d431da9-dad4-481c-a5ad-5e6844f31da0
> bytenr mismatch, want=912228352, have=0
> Couldn't read tree root
> ERROR: cannot open file system
> 
> (For backup and normal)
> 
> So even if the data is duplicated on all disks, something in the above
> errors seemed to cause it to abort
> (These disks are seagate sshd disks, never ever buying them again)

If you have metadata: dup, that doesn't mean the metadata is duplicated
on every disk.  It means that there are two copies of the metadata on a
single disk.  If that disk is going bad and returning failures for both
copies of the metadata, you may be out of luck.  It's really intended
for single spinning disks to get a little bit more resiliency in the
face of bad sectors.

The check error above means that it wasn't able to map a logical address
to a physical address.  Typically that means that the mapping was lost.

-Jeff


-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH] btrfs: fix max chunk size on dup

2017-10-11 Thread Hans van Kranenburg
On 10/11/2017 11:58 AM, Naohiro Aota wrote:
> Hi,
> 
> You may notice my @wdc.com address is bounced. As my internship job at
> wdc is finished, the address is already expired.
> Please contact me on this @elisp.net address.
> 
> 2017-10-11 2:42 GMT+09:00 Hans van Kranenburg 
> :
>> Sorry for the mail spam, it's an interesting code puzzle... :)
>>
>> On 10/10/2017 07:22 PM, Hans van Kranenburg wrote:
>>> On 10/10/2017 07:07 PM, Hans van Kranenburg wrote:
 On 10/10/2017 01:31 PM, David Sterba wrote:
> Hi,
>
> On Fri, Sep 29, 2017 at 04:20:51PM +0900, Naohiro Aota wrote:
>> Balancing a fresh METADATA=dup btrfs file system (with size < 50G)
>> generates a 128MB sized block group. While we set max_stripe_size =
>> max_chunk_size = 256MB, we get this half sized block group:
>>
>> $ btrfs ins dump-t -t CHUNK_TREE btrfs.img|grep length
>> length 8388608 owner 2 stripe_len 65536 type DATA
>> length 33554432 owner 2 stripe_len 65536 type SYSTEM|DUP
>> length 134217728 owner 2 stripe_len 65536 type 
>> METADATA|DUP
>>
>> Before commit 86db25785a6e ("Btrfs: fix max chunk size on raid5/6"), we
>> used "stripe_size * ndevs > max_chunk_size * ncopies" to check the max
>> chunk size. Since stripe_size = 256MB * dev_stripes (= 2) = 512MB, ndevs
>> = 1, max_chunk_size = 256MB, and ncopies = 2, we allowed 256MB
>> METADATA|DUP block group.
>>
>> But now, we use "stripe_size * data_stripes > max_chunk_size". Since
>> data_stripes = 1 on DUP, it disallows the block group to have > 128MB.
>> What missing here is "dev_stripes". Proper logical space used by the 
>> block
>> group is "stripe_size * data_stripes / dev_stripes". Tweak the equations 
>> to
>> use the right value.
>
> I started looking into it and still don't fully understand it. Change
> deep in the allocator can easily break some blockgroup combinations, so
> I'm rather conservative here.

 I think that the added usage of data_stripes in 86db25785a6e is the
 problematic change. data_stripes is something that was introduced as
 part of RAID56 in 53b381b3a and clearly only has a meaning that's
 properly thought of for RAID56. The RAID56 commit already adds "this
 will have to be fixed for RAID1 and RAID10 over more drives", only the
 author doesn't catch the DUP case, which already breaks at that point.

> 
> Thank you for explaining in detail :) Yes, the allocator was already
> broken by using the data_stripes.

Well, technically, data_stripes is right (I first was confused about it
yesterday). But it's something specific for parity raid.

> (and it will also "break" existing file systems, in terms of making a
> DUP|META block group smaller)

Yes, and on a fs larger than 50GiB I indeed see 512MB DUP metadata
chunks, instead of 1GiB like the code says.

> I believe this patch fix the things.
> 
 At the beginning it says:

 int data_stripes; /* number of stripes that count for block group size */

 For the example:

 This is DUP:

 .sub_stripes= 1,
 .dev_stripes= 2,
 .devs_max   = 1,
 .devs_min   = 1,
 .tolerated_failures = 0,
 .devs_increment = 1,
 .ncopies= 2,

 In the code:

 max_stripe_size = SZ_256M
 max_chunk_size = max_stripe_size-> SZ_256M

 Then we have find_free_dev_extent:
 max_stripe_size * dev_stripes   -> SZ_256M * 2 -> 512M

 So we like to find 512M on a disk, to stuff 2 stripes of 256M inside for
 the DUP. (remember: The two parts of DUP *never* end up on a different
 disk, even if you have multiple)

>>
>> Another better fix would be to make a change here...
>>
 If we find one:
 stripe_size = devices_info[ndevs-1].max_avail   -> 512M, yay
>>
>> ...because this is not yay. The 512M is max_avail, which needs to holds
>> *2* stripes, not 1. So stripe_size is suddenly changed to twice the
>> stripe size for DUP.
>>
>> So, an additional division again by dev_stripes would translate the
>> max_avail on device ndevs-1 to the stripe size to use.
> 
> This catch the point. Notice stripe_size is divided by dev_stripes
> after the lines in the patch.

Yes,

stripe_size = div_u64(stripe_size, dev_stripes);

Anyone who reads this should think.. "So, ok, what IS a stripe then". O_o

> It means that at this region (from "stripe_size =
> devices_info[ndevs-1].max_avail;" to "stripe_size =
> div_u64(stripe_size, dev_stripes);"),
> "stripe_size" stands for "stripe size written in one device". IOW,
> stripe_size (here, size of region in one device) = dev_stripes *
> stripe_size (final, one chunk).
> These two "stripe_size" is mostly the same value, because we set
> dev_stripes > 1 only with DUP.
> 
> We are using this stripe_size in one device manner to calculate logical size.
> This is correct, if we actually stripe wri

Fatal failure, btrfs raid with duplicated metadata

2017-10-11 Thread Ian Kumlien
Hi,

I was running a btrfs raid with 6 disks, metadata: dup and data: raid 6

Two of the disks started behaving oddly:
[436823.570296] sd 3:1:0:4: [sdf] Unaligned partial completion
(resid=244, sector_sz=512)
[436823.578604] sd 3:1:0:4: [sdf] Unaligned partial completion
(resid=52, sector_sz=512) [436823.617593]
sd 3:1:0:4: [sdf] Unaligned partial completion (resid=56,
sector_sz=512)
[436823.617771] sd 3:1:0:4: [sdf] Unaligned partial completion
(resid=222, sector_sz=512)
[436823.618386] sd 3:1:0:4: [sdf] Unaligned partial completion
(resid=246, sector_sz=512)
[436823.618463] sd 3:1:0:4: [sdf] Unaligned partial completion
(resid=56, sector_sz=512)
[436977.701944] scsi_io_completion: 68 callbacks suppressed
[436977.701973] sd 3:1:0:4: [sdf] tag#0 FAILED Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE
[436977.701982] sd 3:1:0:4: [sdf] tag#0 Sense Key : Hardware Error
[current]
[436977.701991] sd 3:1:0:4: [sdf] tag#0 Add. Sense: Logical unit
failure
[436977.702000] sd 3:1:0:4: [sdf] tag#0 CDB: Read(10) 28 00 02 fb fb
80 00 00 28 00
[436977.702005] print_req_error: 68 callbacks suppressed
[436977.702010] print_req_error: critical target error, dev sdf,
sector 50068352
[498132.144319] print_req_error: 450 callbacks suppressed
[498132.144324] print_req_error: critical target error, dev sdf,
sector 41777640
[498132.144590] btrfs_dev_stat_print_on_error: 540 callbacks
suppressed
[498132.144600] BTRFS error (device sdb1): bdev /dev/sdf1 errs: wr
632, rd 1526, flush 0, corrupt 0, gen 0

Eventually the filesystem becomes read-only and everything is odd...

Trying to run btrfs check on the disks results in:
btrfs check -b /dev/disk/by-uuid/8d431da9-dad4-481c-a5ad-5e6844f31da0
bytenr mismatch, want=912228352, have=0
Couldn't read tree root
ERROR: cannot open file system

(For backup and normal)

So even if the data is duplicated on all disks, something in the above
errors seemed to cause it to abort
(These disks are seagate sshd disks, never ever buying them again)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: fix max chunk size on dup

2017-10-11 Thread Naohiro Aota
Hi,

You may notice my @wdc.com address is bounced. As my internship job at
wdc is finished, the address is already expired.
Please contact me on this @elisp.net address.

2017-10-11 2:42 GMT+09:00 Hans van Kranenburg :
> Sorry for the mail spam, it's an interesting code puzzle... :)
>
> On 10/10/2017 07:22 PM, Hans van Kranenburg wrote:
>> On 10/10/2017 07:07 PM, Hans van Kranenburg wrote:
>>> On 10/10/2017 01:31 PM, David Sterba wrote:
 Hi,

 On Fri, Sep 29, 2017 at 04:20:51PM +0900, Naohiro Aota wrote:
> Balancing a fresh METADATA=dup btrfs file system (with size < 50G)
> generates a 128MB sized block group. While we set max_stripe_size =
> max_chunk_size = 256MB, we get this half sized block group:
>
> $ btrfs ins dump-t -t CHUNK_TREE btrfs.img|grep length
> length 8388608 owner 2 stripe_len 65536 type DATA
> length 33554432 owner 2 stripe_len 65536 type SYSTEM|DUP
> length 134217728 owner 2 stripe_len 65536 type 
> METADATA|DUP
>
> Before commit 86db25785a6e ("Btrfs: fix max chunk size on raid5/6"), we
> used "stripe_size * ndevs > max_chunk_size * ncopies" to check the max
> chunk size. Since stripe_size = 256MB * dev_stripes (= 2) = 512MB, ndevs
> = 1, max_chunk_size = 256MB, and ncopies = 2, we allowed 256MB
> METADATA|DUP block group.
>
> But now, we use "stripe_size * data_stripes > max_chunk_size". Since
> data_stripes = 1 on DUP, it disallows the block group to have > 128MB.
> What missing here is "dev_stripes". Proper logical space used by the block
> group is "stripe_size * data_stripes / dev_stripes". Tweak the equations 
> to
> use the right value.

 I started looking into it and still don't fully understand it. Change
 deep in the allocator can easily break some blockgroup combinations, so
 I'm rather conservative here.
>>>
>>> I think that the added usage of data_stripes in 86db25785a6e is the
>>> problematic change. data_stripes is something that was introduced as
>>> part of RAID56 in 53b381b3a and clearly only has a meaning that's
>>> properly thought of for RAID56. The RAID56 commit already adds "this
>>> will have to be fixed for RAID1 and RAID10 over more drives", only the
>>> author doesn't catch the DUP case, which already breaks at that point.
>>>

Thank you for explaining in detail :) Yes, the allocator was already
broken by using the data_stripes.
(and it will also "break" existing file systems, in terms of making a
DUP|META block group smaller)
I believe this patch fix the things.

>>> At the beginning it says:
>>>
>>> int data_stripes; /* number of stripes that count for block group size */
>>>
>>> For the example:
>>>
>>> This is DUP:
>>>
>>> .sub_stripes= 1,
>>> .dev_stripes= 2,
>>> .devs_max   = 1,
>>> .devs_min   = 1,
>>> .tolerated_failures = 0,
>>> .devs_increment = 1,
>>> .ncopies= 2,
>>>
>>> In the code:
>>>
>>> max_stripe_size = SZ_256M
>>> max_chunk_size = max_stripe_size-> SZ_256M
>>>
>>> Then we have find_free_dev_extent:
>>> max_stripe_size * dev_stripes   -> SZ_256M * 2 -> 512M
>>>
>>> So we like to find 512M on a disk, to stuff 2 stripes of 256M inside for
>>> the DUP. (remember: The two parts of DUP *never* end up on a different
>>> disk, even if you have multiple)
>>>
>
> Another better fix would be to make a change here...
>
>>> If we find one:
>>> stripe_size = devices_info[ndevs-1].max_avail   -> 512M, yay
>
> ...because this is not yay. The 512M is max_avail, which needs to holds
> *2* stripes, not 1. So stripe_size is suddenly changed to twice the
> stripe size for DUP.
>
> So, an additional division again by dev_stripes would translate the
> max_avail on device ndevs-1 to the stripe size to use.

This catch the point. Notice stripe_size is divided by dev_stripes
after the lines in the patch.
It means that at this region (from "stripe_size =
devices_info[ndevs-1].max_avail;" to "stripe_size =
div_u64(stripe_size, dev_stripes);"),
"stripe_size" stands for "stripe size written in one device". IOW,
stripe_size (here, size of region in one device) = dev_stripes *
stripe_size (final, one chunk).
These two "stripe_size" is mostly the same value, because we set
dev_stripes > 1 only with DUP.

We are using this stripe_size in one device manner to calculate logical size.
This is correct, if we actually stripe write to stripes in one device.
But, it actually is
copied stripes at least with current implementation. So the
if-condition "if (stripe_size * data_stripes > max_chunk_size)"
is doing wrong calculation.

So the solution would be:

1. as in my patch, consider dev_stripes in the if-condition
or,
2. as Hans suggests, move the division by dev_stripes above the if-condition,
   and properly re-calculate stripe_size if logical size > max_chunk_size

Now, I think taking the 2nd solution would make the code more clear,
because it eliminate two different meanings of "stripe

Re: [PATCH] Btrfs-progs: do not add stale device into fs_devices

2017-10-11 Thread Nikolay Borisov


On 11.10.2017 03:28, Liu Bo wrote:
> If one of btrfs's devices was pulled out and we've replaced it with a
> new one, then they have the same uuid.
> 
> If that device gets reconnected, 'btrfs filesystem show' will show the
> stale one instead of the new one, but on kernel side btrfs has a fix
> to not include the stale one, this could confuse users as people may
> monitor btrfs by running that cli.
> 
> This does the similar thing to what kernel side has done.
> 
> Signed-off-by: Liu Bo 
> ---
>  volumes.c | 15 ++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/volumes.c b/volumes.c
> index 2f3943d..c7b7a41 100644
> --- a/volumes.c
> +++ b/volumes.c
> @@ -138,7 +138,20 @@ static int device_list_add(const char *path,
>   list_add(&device->dev_list, &fs_devices->devices);
>   device->fs_devices = fs_devices;
>   } else if (!device->name || strcmp(device->name, path)) {
> - char *name = strdup(path);
> + char *name;
> +
> + /*
> +  * The existing device has newer generation, so this
> +  * one could be a stale one, don't add it.
> +  */
> + if (found_transid < device->generation) {
> + warning("adding device %s gen %llu but found a existing 
> device %s gen %llu\n",
> + path, found_transid, device->name,
> + device->generation, found_transid);

You pass in 5 parameters but have only 4 formatting strings. I don't see
the same happening on other warning() invocations? Perhaps the last
found_transid is not necessary?

> + return -EEXIST;
> + }
> +
> + name = strdup(path);
>  if (!name)
>  return -ENOMEM;
>  kfree(device->name);
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Remove unused dedupe argument btrfs_set_extent_delalloc()

2017-10-11 Thread Nikolay Borisov


On 11.10.2017 05:29, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> 

I had already sent similar patch and David said he is not removing the
in-band dedup prep for now.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] btrfs: Use bd_dev to generate index when dev_state_hashtable add items.

2017-10-11 Thread Nikolay Borisov


On 11.10.2017 12:16, Nikolay Borisov wrote:
> 
> 
> On 11.10.2017 11:44, Gu Jinxiang wrote:
>> From: Gu JinXiang 
>>
>> Fix missing modify of
>> commit f8f84b2dfda5 ("btrfs: index check-integrity state hash by a dev_t").
>> Function btrfsic_dev_state_hashtable_lookup use dev_t to generate hashval
>> when lookup a btrfsic_dev_state in hash table. So when add a
>> btrfsic_dev_state into hash table, it should also use dev_t.
> 
> Your description contradicts the actual code as of a957fd420ca8.
> btrfsic_block_hashtable_lookup creates the hash like so:
> 
> const unsigned int hashval =
> (((unsigned int)(dev_bytenr >> 16)) ^
> 
>  ((unsigned int)((uintptr_t)bdev))) &
> 
>  (BTRFSIC_BLOCK_HASHTABLE_SIZE - 1)
> 
> and bdev is being passed as an argument. And hashtable_add also uses the
> pointer of the bdev, which seems correct. So what's the deal here?

Disregard, there seem to be multiple _lookup functions and I was looking
at the wrong one. btrfsic_dev_state_hashtable_lookup definitely uses the
passed in dev_t as the key. So this fix looks correct.


On a slightly different note the check-integrity.c file could really
make use of the hashtable.h API but that's a bit more work. In any case
I think what could be done to reduce the likelihood of such errors is to
introduce functions/macros which calculate the correct key when it's
necessary rather than open-coding it.

Any way:

Reviewed-by: Nikolay Borisov 

> 
>>
>> Reproduce of this bug:
>> Use MOUNT_OPTIONS="-o check_int" when run xfstest, device can not be
>> mount successfully. So xfstest can not run. 
>>
>> changelog:
>> v1->v2: Add how to reproduce this bug.
>> v2->v3: Add description of this bug.
>>
>> Signed-off-by: Gu JinXiang 
>> ---
>>  fs/btrfs/check-integrity.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
>> index 9d3854839038..86d79bc4cfb3 100644
>> --- a/fs/btrfs/check-integrity.c
>> +++ b/fs/btrfs/check-integrity.c
>> @@ -613,7 +613,7 @@ static void btrfsic_dev_state_hashtable_add(
>>  struct btrfsic_dev_state_hashtable *h)
>>  {
>>  const unsigned int hashval =
>> -(((unsigned int)((uintptr_t)ds->bdev)) &
>> +(((unsigned int)((uintptr_t)ds->bdev->bd_dev)) &
>>   (BTRFSIC_DEV2STATE_HASHTABLE_SIZE - 1));
>>  
>>  list_add(&ds->collision_resolving_node, h->table + hashval);
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] btrfs: Use bd_dev to generate index when dev_state_hashtable add items.

2017-10-11 Thread Nikolay Borisov


On 11.10.2017 11:44, Gu Jinxiang wrote:
> From: Gu JinXiang 
> 
> Fix missing modify of
> commit f8f84b2dfda5 ("btrfs: index check-integrity state hash by a dev_t").
> Function btrfsic_dev_state_hashtable_lookup use dev_t to generate hashval
> when lookup a btrfsic_dev_state in hash table. So when add a
> btrfsic_dev_state into hash table, it should also use dev_t.

Your description contradicts the actual code as of a957fd420ca8.
btrfsic_block_hashtable_lookup creates the hash like so:

const unsigned int hashval =
(((unsigned int)(dev_bytenr >> 16)) ^

 ((unsigned int)((uintptr_t)bdev))) &

 (BTRFSIC_BLOCK_HASHTABLE_SIZE - 1)

and bdev is being passed as an argument. And hashtable_add also uses the
pointer of the bdev, which seems correct. So what's the deal here?

> 
> Reproduce of this bug:
> Use MOUNT_OPTIONS="-o check_int" when run xfstest, device can not be
> mount successfully. So xfstest can not run. 
> 
> changelog:
> v1->v2: Add how to reproduce this bug.
> v2->v3: Add description of this bug.
> 
> Signed-off-by: Gu JinXiang 
> ---
>  fs/btrfs/check-integrity.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
> index 9d3854839038..86d79bc4cfb3 100644
> --- a/fs/btrfs/check-integrity.c
> +++ b/fs/btrfs/check-integrity.c
> @@ -613,7 +613,7 @@ static void btrfsic_dev_state_hashtable_add(
>   struct btrfsic_dev_state_hashtable *h)
>  {
>   const unsigned int hashval =
> - (((unsigned int)((uintptr_t)ds->bdev)) &
> + (((unsigned int)((uintptr_t)ds->bdev->bd_dev)) &
>(BTRFSIC_DEV2STATE_HASHTABLE_SIZE - 1));
>  
>   list_add(&ds->collision_resolving_node, h->table + hashval);
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3] btrfs: Use bd_dev to generate index when dev_state_hashtable add items.

2017-10-11 Thread Gu Jinxiang
From: Gu JinXiang 

Fix missing modify of
commit f8f84b2dfda5 ("btrfs: index check-integrity state hash by a dev_t").
Function btrfsic_dev_state_hashtable_lookup use dev_t to generate hashval
when lookup a btrfsic_dev_state in hash table. So when add a
btrfsic_dev_state into hash table, it should also use dev_t.

Reproduce of this bug:
Use MOUNT_OPTIONS="-o check_int" when run xfstest, device can not be
mount successfully. So xfstest can not run. 

changelog:
v1->v2: Add how to reproduce this bug.
v2->v3: Add description of this bug.

Signed-off-by: Gu JinXiang 
---
 fs/btrfs/check-integrity.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index 9d3854839038..86d79bc4cfb3 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -613,7 +613,7 @@ static void btrfsic_dev_state_hashtable_add(
struct btrfsic_dev_state_hashtable *h)
 {
const unsigned int hashval =
-   (((unsigned int)((uintptr_t)ds->bdev)) &
+   (((unsigned int)((uintptr_t)ds->bdev->bd_dev)) &
 (BTRFSIC_DEV2STATE_HASHTABLE_SIZE - 1));
 
list_add(&ds->collision_resolving_node, h->table + hashval);
-- 
2.13.5



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] cgroup, buffer_head: implement submit_bh_blkcg_css()

2017-10-11 Thread Jan Kara
On Tue 10-10-17 08:54:40, Tejun Heo wrote:
> Implement submit_bh_blkcg_css() which will be used to override cgroup
> membership on specific buffer_heads.
> 
> v2: Reimplemented using create_bh_bio() as suggested by Jan.
> 
> Signed-off-by: Tejun Heo 
> Cc: Jan Kara 
> Cc: Jens Axboe 

Looks good. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  fs/buffer.c | 12 
>  include/linux/buffer_head.h |  3 +++
>  2 files changed, 15 insertions(+)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index b4b2169..ed0e473 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -3147,6 +3147,18 @@ static int submit_bh_wbc(int op, int op_flags, struct 
> buffer_head *bh,
>   return 0;
>  }
>  
> +int submit_bh_blkcg_css(int op, int op_flags, struct buffer_head *bh,
> + struct cgroup_subsys_state *blkcg_css)
> +{
> + struct bio *bio;
> +
> + bio = create_bh_bio(op, op_flags, bh, 0);
> + bio_associate_blkcg(bio, blkcg_css);
> + submit_bio(bio);
> + return 0;
> +}
> +EXPORT_SYMBOL(submit_bh_blkcg_css);
> +
>  int submit_bh(int op, int op_flags, struct buffer_head *bh)
>  {
>   struct bio *bio;
> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> index c8dae55..dca5b3b 100644
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -13,6 +13,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #ifdef CONFIG_BLOCK
>  
> @@ -197,6 +198,8 @@ void ll_rw_block(int, int, int, struct buffer_head * 
> bh[]);
>  int sync_dirty_buffer(struct buffer_head *bh);
>  int __sync_dirty_buffer(struct buffer_head *bh, int op_flags);
>  void write_dirty_buffer(struct buffer_head *bh, int op_flags);
> +int submit_bh_blkcg_css(int op, int op_flags, struct buffer_head *bh,
> + struct cgroup_subsys_state *blkcg_css);
>  int submit_bh(int, int, struct buffer_head *);
>  void write_boundary_block(struct block_device *bdev,
>   sector_t bblock, unsigned blocksize);
> -- 
> 2.9.5
> 
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] buffer_head: separate out create_bh_bio() from submit_bh_wbc()

2017-10-11 Thread Jan Kara
On Tue 10-10-17 08:54:39, Tejun Heo wrote:
> submit_bh_wbc() creates a bio matching the specific @bh and submits it
> at the end.  This patch separates out the bio creation part to its own
> function, create_bh_bio(), and reimplement submit_bh[_wbc]() using the
> function.
> 
> As bio can now be manipulated before submitted, we can move out @wbc
> handling into submit_bh_wbc() and similarly this will make adding more
> submit_bh variants straight-forward.
> 
> This patch is pure refactoring and doesn't cause any functional
> changes.
> 
> Signed-off-by: Tejun Heo 
> Suggested-by: Jan Kara 

Looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza


> ---
>  fs/buffer.c | 30 ++
>  1 file changed, 22 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 170df85..b4b2169 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -3086,8 +3086,8 @@ void guard_bio_eod(int op, struct bio *bio)
>   }
>  }
>  
> -static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
> -  enum rw_hint write_hint, struct writeback_control *wbc)
> +struct bio *create_bh_bio(int op, int op_flags, struct buffer_head *bh,
> +  enum rw_hint write_hint)
>  {
>   struct bio *bio;
>  
> @@ -3109,11 +3109,6 @@ static int submit_bh_wbc(int op, int op_flags, struct 
> buffer_head *bh,
>*/
>   bio = bio_alloc(GFP_NOIO, 1);
>  
> - if (wbc) {
> - wbc_init_bio(wbc, bio);
> - wbc_account_io(wbc, bh->b_page, bh->b_size);
> - }
> -
>   bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
>   bio_set_dev(bio, bh->b_bdev);
>   bio->bi_write_hint = write_hint;
> @@ -3133,13 +3128,32 @@ static int submit_bh_wbc(int op, int op_flags, struct 
> buffer_head *bh,
>   op_flags |= REQ_PRIO;
>   bio_set_op_attrs(bio, op, op_flags);
>  
> + return bio;
> +}
> +
> +static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
> +  enum rw_hint write_hint, struct writeback_control *wbc)
> +{
> + struct bio *bio;
> +
> + bio = create_bh_bio(op, op_flags, bh, write_hint);
> +
> + if (wbc) {
> + wbc_init_bio(wbc, bio);
> + wbc_account_io(wbc, bh->b_page, bh->b_size);
> + }
> +
>   submit_bio(bio);
>   return 0;
>  }
>  
>  int submit_bh(int op, int op_flags, struct buffer_head *bh)
>  {
> - return submit_bh_wbc(op, op_flags, bh, 0, NULL);
> + struct bio *bio;
> +
> + bio = create_bh_bio(op, op_flags, bh, 0);
> + submit_bio(bio);
> + return 0;
>  }
>  EXPORT_SYMBOL(submit_bh);
>  
> -- 
> 2.9.5
> 
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] blkcg: export blkcg_root_css

2017-10-11 Thread Jan Kara
On Tue 10-10-17 08:54:37, Tejun Heo wrote:
> Export blkcg_root_css so that filesystem modules can use it.
> 
> Signed-off-by: Tejun Heo 

Looks good. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  block/blk-cgroup.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index d3f56ba..597a457 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -45,6 +45,7 @@ struct blkcg blkcg_root;
>  EXPORT_SYMBOL_GPL(blkcg_root);
>  
>  struct cgroup_subsys_state * const blkcg_root_css = &blkcg_root.css;
> +EXPORT_SYMBOL_GPL(blkcg_root_css);
>  
>  static struct blkcg_policy *blkcg_policy[BLKCG_MAX_POLS];
>  
> -- 
> 2.9.5
> 
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: remove rcu_barrier in btrfs_close_devices

2017-10-11 Thread Anand Jain



On 10/11/2017 02:11 PM, Anand Jain wrote:



On 10/11/2017 05:51 AM, Liu Bo wrote:

It was introduced because btrfs used to do blkdev_put in a deferred
work, now that btrfs has put blkdev in place, this rcu_barrier can be
removed.


 On the 2nd thought, modprobe -r btrfs would still need rcu_barrier(), 
some where else outside of umount context ?


Thanks, Anand



  Reviewed-by: Anand Jain 

Thanks, Anand



Signed-off-by: Liu Bo 
---
  fs/btrfs/volumes.c | 6 --
  1 file changed, 6 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 0e8f16c..d983cea 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -958,12 +958,6 @@ int btrfs_close_devices(struct btrfs_fs_devices 
*fs_devices)

  __btrfs_close_devices(fs_devices);
  free_fs_devices(fs_devices);
  }
-/*
- * Wait for rcu kworkers under __btrfs_close_devices
- * to finish all blkdev_puts so device is really
- * free when umount is done.
- */
-rcu_barrier();
  return ret;
  }


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: Use bd_dev to generate index when dev_state_hashtable add items.

2017-10-11 Thread Nikolay Borisov


On 11.10.2017 04:55, Gu Jinxiang wrote:
> From: Gu JinXiang 
> 
> Fix bug of
>  ().
> 
> Description of this bug:
> Use MOUNT_OPTIONS="-o check_int" when run xfstest, device can not be
> mount successfully. So xfstest can not run.

You haven't described what causes the bug but rather how to reproduce it
? Describe what is the difference between using ->bdev and bdev->bd_dev.
And how using one or the other affects the integrity checks and why
those checks fail.

> 
> Signed-off-by: Gu JinXiang 
> ---
>  fs/btrfs/check-integrity.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
> index 9d3854839038..86d79bc4cfb3 100644
> --- a/fs/btrfs/check-integrity.c
> +++ b/fs/btrfs/check-integrity.c
> @@ -613,7 +613,7 @@ static void btrfsic_dev_state_hashtable_add(
>   struct btrfsic_dev_state_hashtable *h)
>  {
>   const unsigned int hashval =
> - (((unsigned int)((uintptr_t)ds->bdev)) &
> + (((unsigned int)((uintptr_t)ds->bdev->bd_dev)) &
>(BTRFSIC_DEV2STATE_HASHTABLE_SIZE - 1));
>  
>   list_add(&ds->collision_resolving_node, h->table + hashval);
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: avoid losing data raid profile when deleting a device

2017-10-11 Thread Nikolay Borisov


On 10.10.2017 20:53, Liu Bo wrote:
> We've avoided data losing raid profile when doing balance, but it
> turns out that deleting a device could also result in the same
> problem
> 
> This fixes the problem by creating an empty data chunk before
> relocating the data chunk.

Why is this needed - copy the metadata of the to-be-relocated chunk into
the newly created empty chunk? I don't entirely understand that code but
doesn't this seem a bit like a hack in order to stash some information?
Perhaps you could elaborate the logic a bit more in the changelog?

> 
> Metadata/System chunk are supposed to have non-zero bytes all the time
> so their raid profile is persistent.

I think this changelog is a bit scarce on detail as to the culprit of
the problem. Could you perhaps put a sentence or two what the underlying
logic which deletes the raid profile if a chunk is empty ?

> 
> Reported-by: James Alandt 
> Signed-off-by: Liu Bo 
> ---
> 
> v2: - return the correct error.
> - move helper ahead of __btrfs_balance().
> 
>  fs/btrfs/volumes.c | 84 
> ++
>  1 file changed, 65 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 4a72c45..a74396d 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -3018,6 +3018,48 @@ static int btrfs_relocate_sys_chunks(struct 
> btrfs_fs_info *fs_info)
>   return ret;
>  }
>  
> +/*
> + * return 1 : allocate a data chunk successfully,
> + * return <0: errors during allocating a data chunk,
> + * return 0 : no need to allocate a data chunk.
> + */
> +static int btrfs_may_alloc_data_chunk(struct btrfs_fs_info *fs_info,
> +   u64 chunk_offset)
> +{
> + struct btrfs_block_group_cache *cache;
> + u64 bytes_used;
> + u64 chunk_type;
> +
> + cache = btrfs_lookup_block_group(fs_info, chunk_offset);
> + ASSERT(cache);
> + chunk_type = cache->flags;
> + btrfs_put_block_group(cache);
> +
> + if (chunk_type & BTRFS_BLOCK_GROUP_DATA) {
> + spin_lock(&fs_info->data_sinfo->lock);
> + bytes_used = fs_info->data_sinfo->bytes_used;
> + spin_unlock(&fs_info->data_sinfo->lock);
> +
> + if (!bytes_used) {
> + struct btrfs_trans_handle *trans;
> + int ret;
> +
> + trans = btrfs_join_transaction(fs_info->tree_root);
> + if (IS_ERR(trans))
> + return PTR_ERR(trans);
> +
> + ret = btrfs_force_chunk_alloc(trans, fs_info,
> +   BTRFS_BLOCK_GROUP_DATA);
> + btrfs_end_transaction(trans);
> + if (ret < 0)
> + return ret;
> +
> + return 1;
> + }
> + }
> + return 0;
> +}
> +
>  static int insert_balance_item(struct btrfs_fs_info *fs_info,
>  struct btrfs_balance_control *bctl)
>  {
> @@ -3476,7 +3518,6 @@ static int __btrfs_balance(struct btrfs_fs_info 
> *fs_info)
>   u32 count_meta = 0;
>   u32 count_sys = 0;
>   int chunk_reserved = 0;
> - u64 bytes_used = 0;
>  
>   /* step one make some room on all the devices */
>   devices = &fs_info->fs_devices->devices;
> @@ -3635,28 +3676,21 @@ static int __btrfs_balance(struct btrfs_fs_info 
> *fs_info)
>   goto loop;
>   }
>  
> - ASSERT(fs_info->data_sinfo);
> - spin_lock(&fs_info->data_sinfo->lock);
> - bytes_used = fs_info->data_sinfo->bytes_used;
> - spin_unlock(&fs_info->data_sinfo->lock);
> -
> - if ((chunk_type & BTRFS_BLOCK_GROUP_DATA) &&
> - !chunk_reserved && !bytes_used) {
> - trans = btrfs_start_transaction(chunk_root, 0);
> - if (IS_ERR(trans)) {
> - mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> - ret = PTR_ERR(trans);
> - goto error;
> - }
> -
> - ret = btrfs_force_chunk_alloc(trans, fs_info,
> -   BTRFS_BLOCK_GROUP_DATA);
> - btrfs_end_transaction(trans);
> + if (!chunk_reserved) {
> + /*
> +  * We may be relocating the only data chunk we have,
> +  * which could potentially end up with losing data's
> +  * raid profile, so lets allocate an empty one in
> +  * advance.
> +  */
> + ret = btrfs_may_alloc_data_chunk(fs_info,
> +  found_key.offset);
>   if (ret < 0) {
>   mutex_unlock(&fs_info->delete_unused_bgs_mutex