Re: [RFC] BTRFS_DEV_REPLACE_ITEM_STATE_* doesn't match with on disk

2018-11-20 Thread Anand Jain



David, any comments on this please.

Thanks, Anand


On 11/13/2018 06:32 PM, Anand Jain wrote:


David, Gentle ping.

Thanks, Anand

On 11/12/2018 03:50 PM, Nikolay Borisov wrote:



On 12.11.18 г. 6:58 ч., Anand Jain wrote:


The dev_replace_state defines are miss matched between the
BTRFS_IOCTL_DEV_REPLACE_STATE_* and BTRFS_DEV_REPLACE_ITEM_STATE_* [1].

[1]
-
btrfs.h:#define BTRFS_IOCTL_DEV_REPLACE_STATE_FINISHED    2
btrfs.h:#define BTRFS_IOCTL_DEV_REPLACE_STATE_CANCELED    3
btrfs.h:#define BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED    4

btrfs_tree.h:#define BTRFS_DEV_REPLACE_ITEM_STATE_SUSPENDED    2
btrfs_tree.h:#define BTRFS_DEV_REPLACE_ITEM_STATE_FINISHED    3
btrfs_tree.h:#define BTRFS_DEV_REPLACE_ITEM_STATE_CANCELED    4
-

The BTRFS_DEV_REPLACE_ITEM_STATE_* series is unused in both btrfs.ko and
btrfs-progs, the on-disk also follows BTRFS_IOCTL_DEV_REPLACE_STATE_*
(we set dev_replace->replace_state using the
BTRFS_IOCTL_DEV_REPLACE_STATE_* defines and write to the on-disk).

  359 btrfs_set_dev_replace_replace_state(eb, ptr,
  360 dev_replace->replace_state);

IMO it should be ok to delete the BTRFS_DEV_REPLACE_ITEM_STATE_*
altogether? But how about the userland progs other than btrfs-progs?
If not at least fix the miss match as in [2], any comments?


Unfortunately you are right. This seems to stem from sloppy job back in
the days of initial dev-replace support. BTRFS_DEV_REPLACE_ITEM_STATE_*
were added in e922e087a35c ("Btrfs: enhance btrfs structures for device
replace support"), yet they were never used. And the
IOCTL_DEV_REPLACE_STATE* were added in e93c89c1 ("Btrfs: add new
sources for device replace code").

It looks like the ITEM_STATE* definitions were stillborn so to speak and
personally I'm in favor of removing them. They shouldn't have been
merged in the first place and indeed the patch doesn't even have a
Reviewed-by tag. So it originated from the, I'd say, spartan days of
btrfs development...

David,  any code which is using BTRFS_DEV_REPLACE_ITEM_STATE_SUSPENDED
is inherently broken, so how about we remove those definitions, then
when it's compilation is broken in the future the author will actually
have a chance to fix it, though it's highly unlikely anyone is relying
on those definitions.




[2]
--
diff --git a/include/uapi/linux/btrfs_tree.h
b/include/uapi/linux/btrfs_tree.h
index aff1356c2bb8..9ffa7534cadf 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -805,9 +805,9 @@ struct btrfs_dev_stats_item {
  #define 
BTRFS_DEV_REPLACE_ITEM_CONT_READING_FROM_SRCDEV_MODE_AVOID 1

  #define BTRFS_DEV_REPLACE_ITEM_STATE_NEVER_STARTED 0
  #define BTRFS_DEV_REPLACE_ITEM_STATE_STARTED   1
-#define BTRFS_DEV_REPLACE_ITEM_STATE_SUSPENDED 2
-#define BTRFS_DEV_REPLACE_ITEM_STATE_FINISHED  3
-#define BTRFS_DEV_REPLACE_ITEM_STATE_CANCELED  4
+#define BTRFS_DEV_REPLACE_ITEM_STATE_FINISHED  2
+#define BTRFS_DEV_REPLACE_ITEM_STATE_CANCELED  3
+#define BTRFS_DEV_REPLACE_ITEM_STATE_SUSPENDED 4

  struct btrfs_dev_replace_item {
 /*
--


Thanks, Anand





Re: [PATCH v2] btrfs: Fix error handling in btrfs_cleanup_ordered_extents

2018-11-20 Thread Nikolay Borisov



On 20.11.18 г. 21:00 ч., Josef Bacik wrote:
> On Fri, Oct 26, 2018 at 02:41:55PM +0300, Nikolay Borisov wrote:
>> Running btrfs/124 in a loop hung up on me sporadically with the
>> following call trace:
>>  btrfs   D0  5760   5324 0x
>>  Call Trace:
>>   ? __schedule+0x243/0x800
>>   schedule+0x33/0x90
>>   btrfs_start_ordered_extent+0x10c/0x1b0 [btrfs]
>>   ? wait_woken+0xa0/0xa0
>>   btrfs_wait_ordered_range+0xbb/0x100 [btrfs]
>>   btrfs_relocate_block_group+0x1ff/0x230 [btrfs]
>>   btrfs_relocate_chunk+0x49/0x100 [btrfs]
>>   btrfs_balance+0xbeb/0x1740 [btrfs]
>>   btrfs_ioctl_balance+0x2ee/0x380 [btrfs]
>>   btrfs_ioctl+0x1691/0x3110 [btrfs]
>>   ? lockdep_hardirqs_on+0xed/0x180
>>   ? __handle_mm_fault+0x8e7/0xfb0
>>   ? _raw_spin_unlock+0x24/0x30
>>   ? __handle_mm_fault+0x8e7/0xfb0
>>   ? do_vfs_ioctl+0xa5/0x6e0
>>   ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
>>   do_vfs_ioctl+0xa5/0x6e0
>>   ? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
>>   ksys_ioctl+0x3a/0x70
>>   __x64_sys_ioctl+0x16/0x20
>>   do_syscall_64+0x60/0x1b0
>>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>
>> Turns out during page writeback it's possible that the code in
>> writepage_delalloc can instantiate a delalloc range which doesn't
>> belong to the page currently being written back. This happens since
>> find_lock_delalloc_range returns up to BTRFS_MAX_EXTENT_SIZE delalloc
>> range when asked and doens't really consider the range of the passed
>> page. When such a foregin range is found the code proceeds to
>> run_delalloc_range and calls the appropriate function to fill the
>> delalloc and create ordered extents. If, however, a failure occurs
>> while this operation is in effect then the clean up code in
>> btrfs_cleanup_ordered_extents will be called. This function has the
>> wrong assumption that caller of run_delalloc_range always properly
>> cleans the first page of the range hence when it calls
>> __endio_write_update_ordered it explicitly ommits the first page of
>> the delalloc range. This is wrong because this function could be
>> cleaning a delalloc range that doesn't belong to the current page. This
>> in turn means that the page cleanup code in __extent_writepage will
>> not really free the initial page for the range, leaving a hanging
>> ordered extent with bytes_left set to 4k. This bug has been present
>> ever since the original introduction of the cleanup code in 524272607e88.
>>
>> Fix this by correctly checking whether the current page belongs to the
>> range being instantiated and if so correctly adjust the range parameters
>> passed for cleaning up. If it doesn't, then just clean the whole OE
>> range directly.
>>
>> Signed-off-by: Nikolay Borisov 
>> Fixes: 524272607e88 ("btrfs: Handle delalloc error correctly to avoid 
>> ordered extent hang")
> 
> Can we just remove the endio cleanup in __extent_writepage() and make this do
> the proper cleanup?  I'm not sure if that is feasible or not, but seems like 
> it
> would make the cleanup stuff less confusing and more straightforward.  If not
> you can add

Quickly skimming the code I think the cleanup in __extent_writepage
could be moved into __extent_writepage_io where we have 2 branches that
set PageError. So I guess it could be done, but I will have to
experiment with it.

> 
> Reviewed-by: Josef Bacik 
> 
> Thanks,
> 
> Josef
> 


Re: [PATCH v2] btrfs: Fix error handling in btrfs_cleanup_ordered_extents

2018-11-20 Thread Josef Bacik
On Fri, Oct 26, 2018 at 02:41:55PM +0300, Nikolay Borisov wrote:
> Running btrfs/124 in a loop hung up on me sporadically with the
> following call trace:
>   btrfs   D0  5760   5324 0x
>   Call Trace:
>? __schedule+0x243/0x800
>schedule+0x33/0x90
>btrfs_start_ordered_extent+0x10c/0x1b0 [btrfs]
>? wait_woken+0xa0/0xa0
>btrfs_wait_ordered_range+0xbb/0x100 [btrfs]
>btrfs_relocate_block_group+0x1ff/0x230 [btrfs]
>btrfs_relocate_chunk+0x49/0x100 [btrfs]
>btrfs_balance+0xbeb/0x1740 [btrfs]
>btrfs_ioctl_balance+0x2ee/0x380 [btrfs]
>btrfs_ioctl+0x1691/0x3110 [btrfs]
>? lockdep_hardirqs_on+0xed/0x180
>? __handle_mm_fault+0x8e7/0xfb0
>? _raw_spin_unlock+0x24/0x30
>? __handle_mm_fault+0x8e7/0xfb0
>? do_vfs_ioctl+0xa5/0x6e0
>? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
>do_vfs_ioctl+0xa5/0x6e0
>? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
>ksys_ioctl+0x3a/0x70
>__x64_sys_ioctl+0x16/0x20
>do_syscall_64+0x60/0x1b0
>entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> Turns out during page writeback it's possible that the code in
> writepage_delalloc can instantiate a delalloc range which doesn't
> belong to the page currently being written back. This happens since
> find_lock_delalloc_range returns up to BTRFS_MAX_EXTENT_SIZE delalloc
> range when asked and doens't really consider the range of the passed
> page. When such a foregin range is found the code proceeds to
> run_delalloc_range and calls the appropriate function to fill the
> delalloc and create ordered extents. If, however, a failure occurs
> while this operation is in effect then the clean up code in
> btrfs_cleanup_ordered_extents will be called. This function has the
> wrong assumption that caller of run_delalloc_range always properly
> cleans the first page of the range hence when it calls
> __endio_write_update_ordered it explicitly ommits the first page of
> the delalloc range. This is wrong because this function could be
> cleaning a delalloc range that doesn't belong to the current page. This
> in turn means that the page cleanup code in __extent_writepage will
> not really free the initial page for the range, leaving a hanging
> ordered extent with bytes_left set to 4k. This bug has been present
> ever since the original introduction of the cleanup code in 524272607e88.
> 
> Fix this by correctly checking whether the current page belongs to the
> range being instantiated and if so correctly adjust the range parameters
> passed for cleaning up. If it doesn't, then just clean the whole OE
> range directly.
> 
> Signed-off-by: Nikolay Borisov 
> Fixes: 524272607e88 ("btrfs: Handle delalloc error correctly to avoid ordered 
> extent hang")

Can we just remove the endio cleanup in __extent_writepage() and make this do
the proper cleanup?  I'm not sure if that is feasible or not, but seems like it
would make the cleanup stuff less confusing and more straightforward.  If not
you can add

Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [PATCH v6 0/3] btrfs: balance: improve kernel logs

2018-11-20 Thread David Sterba
On Tue, Nov 20, 2018 at 04:12:54PM +0800, Anand Jain wrote:
> v5->v6:
>  Mostly the defines non functional changes.
>  Use goto instead of return in middle of the define.
>  Pls ref individual patches 1/3 and 2/3 for more info.
> 
> v4->v5:
>  Mainly address David review comment [1].
>  [1]
>   https://patchwork.kernel.org/patch/10425987/
>  pls ref to individual patch 2/3 for details.
> 
> v3->v4:
>  Pls ref to individual patches.
> 
> Based on misc-next.
> 
> v2->v3:
>   Inspried by describe_relocation(), improves it, makes it a helper
>   function and use it to log the balance operations.
> 
> Kernel logs are very important for the forensic investigations of the
> issues, these patchs make balance logs easy to review.
> 
> Anand Jain (3):
>   btrfs: add helper function describe_block_group()
>   btrfs: balance: add args info during start and resume
>   btrfs: balance: add kernel log for end or paused

Now in topic branch ext/anand/balance-description in my devel repos, I
made some changes as I don't want to take the rounds through mail. There
are some cosmetic changes like reordering the balance args, added
comments and aligned the \ even farther to the right. Please check the
committed patches for some inspiration.

I'll add that to misc-next after passes some tests, until then it'll be
in for-next. Thanks.


Re: [PATCH RESEND 0/9 v2] fix warn_on for replace cancel

2018-11-20 Thread David Sterba
On Tue, Nov 20, 2018 at 07:56:14PM +0800, Anand Jain wrote:
> These two patches were sent as part of
>[PATCH 0/9 v2] fix replace-start and replace-cancel racing
> before but as these aren't integrated so I am sending these again.
> 
> The patch [1] which is in misc-next, calls btrfs_dev_replace_finishing()
> after replace is canceled, so the ret argument passed to the
> btrfs_dev_replace_finishing() is -ECANCEL, which is not a real error.
> So these patches quieten the warn and error log if its -ECANCEL.
> 
> These should be integrated otherwise we see the WARN_ON and btrfs_error()
> after replace cancel.
> 
> [1] 
> 08bdfc3a0611 btrfs: fix use-after-free due to race between replace start and 
> cancel
> 
> Anand Jain (2):
>   btrfs: quieten warn if the replace is canceled at finish
>   btrfs: user requsted replace cancel is not an error

Thanks, patches added to misc-next, right after the other dev-replace
patches.


Re: [PATCH RESEND v2 1/2] btrfs: quieten warn if replace is canceled at finish

2018-11-20 Thread David Sterba
On Tue, Nov 20, 2018 at 07:56:15PM +0800, Anand Jain wrote:
> When we successfully cancel the replace its scrub returns -ECANCELED,
> which then passed to btrfs_dev_replace_finishing(), it cleans up based
> on the scrub returned status and propagates the same -ECANCELED back
> the parent function. As of now only user can cancel the replace-scrub,
> so its ok to quieten the warn here.
> 
> Signed-off-by: Anand Jain 

Ok for getting rid if the ECANCELED warning, though it would be better
to replace the WARN_ON too.

Reviewed-by: David Sterba 


[PATCH 4/4] btrfs: dev-replace: open code trivial locking helpers

2018-11-20 Thread David Sterba
The dev-replace locking functions are now trivial wrappers around rw
semaphore that can be used directly everywhere. No functional change.

Signed-off-by: David Sterba 
---
 fs/btrfs/dev-replace.c | 81 --
 fs/btrfs/dev-replace.h |  4 ---
 fs/btrfs/reada.c   | 12 +++
 fs/btrfs/scrub.c   | 15 
 fs/btrfs/volumes.c | 14 
 5 files changed, 52 insertions(+), 74 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index a37af7aaca86..9b8ab26deb41 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -284,13 +284,13 @@ int btrfs_run_dev_replace(struct btrfs_trans_handle 
*trans,
struct btrfs_dev_replace_item *ptr;
struct btrfs_dev_replace *dev_replace = _info->dev_replace;
 
-   btrfs_dev_replace_read_lock(dev_replace);
+   down_read(_replace->rwsem);
if (!dev_replace->is_valid ||
!dev_replace->item_needs_writeback) {
-   btrfs_dev_replace_read_unlock(dev_replace);
+   up_read(_replace->rwsem);
return 0;
}
-   btrfs_dev_replace_read_unlock(dev_replace);
+   up_read(_replace->rwsem);
 
key.objectid = 0;
key.type = BTRFS_DEV_REPLACE_KEY;
@@ -348,7 +348,7 @@ int btrfs_run_dev_replace(struct btrfs_trans_handle *trans,
ptr = btrfs_item_ptr(eb, path->slots[0],
 struct btrfs_dev_replace_item);
 
-   btrfs_dev_replace_write_lock(dev_replace);
+   down_write(_replace->rwsem);
if (dev_replace->srcdev)
btrfs_set_dev_replace_src_devid(eb, ptr,
dev_replace->srcdev->devid);
@@ -371,7 +371,7 @@ int btrfs_run_dev_replace(struct btrfs_trans_handle *trans,
btrfs_set_dev_replace_cursor_right(eb, ptr,
dev_replace->cursor_right);
dev_replace->item_needs_writeback = 0;
-   btrfs_dev_replace_write_unlock(dev_replace);
+   up_write(_replace->rwsem);
 
btrfs_mark_buffer_dirty(eb);
 
@@ -432,7 +432,7 @@ static int btrfs_dev_replace_start(struct btrfs_fs_info 
*fs_info,
}
 
need_unlock = true;
-   btrfs_dev_replace_write_lock(dev_replace);
+   down_write(_replace->rwsem);
switch (dev_replace->replace_state) {
case BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED:
case BTRFS_IOCTL_DEV_REPLACE_STATE_FINISHED:
@@ -470,7 +470,7 @@ static int btrfs_dev_replace_start(struct btrfs_fs_info 
*fs_info,
dev_replace->item_needs_writeback = 1;
atomic64_set(_replace->num_write_errors, 0);
atomic64_set(_replace->num_uncorrectable_read_errors, 0);
-   btrfs_dev_replace_write_unlock(dev_replace);
+   up_write(_replace->rwsem);
need_unlock = false;
 
ret = btrfs_sysfs_add_device_link(tgt_device->fs_devices, tgt_device);
@@ -484,7 +484,7 @@ static int btrfs_dev_replace_start(struct btrfs_fs_info 
*fs_info,
if (IS_ERR(trans)) {
ret = PTR_ERR(trans);
need_unlock = true;
-   btrfs_dev_replace_write_lock(dev_replace);
+   down_write(_replace->rwsem);
dev_replace->replace_state =
BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED;
dev_replace->srcdev = NULL;
@@ -511,7 +511,7 @@ static int btrfs_dev_replace_start(struct btrfs_fs_info 
*fs_info,
 
 leave:
if (need_unlock)
-   btrfs_dev_replace_write_unlock(dev_replace);
+   up_write(_replace->rwsem);
btrfs_destroy_dev_replace_tgtdev(tgt_device);
return ret;
 }
@@ -579,18 +579,18 @@ static int btrfs_dev_replace_finishing(struct 
btrfs_fs_info *fs_info,
/* don't allow cancel or unmount to disturb the finishing procedure */
mutex_lock(_replace->lock_finishing_cancel_unmount);
 
-   btrfs_dev_replace_read_lock(dev_replace);
+   down_read(_replace->rwsem);
/* was the operation canceled, or is it finished? */
if (dev_replace->replace_state !=
BTRFS_IOCTL_DEV_REPLACE_STATE_STARTED) {
-   btrfs_dev_replace_read_unlock(dev_replace);
+   up_read(_replace->rwsem);
mutex_unlock(_replace->lock_finishing_cancel_unmount);
return 0;
}
 
tgt_device = dev_replace->tgtdev;
src_device = dev_replace->srcdev;
-   btrfs_dev_replace_read_unlock(dev_replace);
+   up_read(_replace->rwsem);
 
/*
 * flush all outstanding I/O and inode extent mappings before the
@@ -614,7 +614,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info 
*fs_info,
/* keep away write_all_supers() during the finishing procedure */
mutex_lock(_info->fs_devices->device_list_mutex);
mutex_lock(_info->chunk_mutex);
-   btrfs_dev_replace_write_lock(dev_replace);
+   down_write(_replace->rwsem);
dev_replace->replace_state =
scrub_ret ? 

[PATCH 1/4] btrfs: reada: reorder dev-replace locks before radix tree preload

2018-11-20 Thread David Sterba
The read lock is going to use rw semaphore that might sleep, this is not
possible in the radix tree preload section. The lock nesting is now:

* device replace
  * radix tree preload
* readahead spinlock

Signed-off-by: David Sterba 
---
 fs/btrfs/reada.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c
index dec14b739b10..6c4fdd98582d 100644
--- a/fs/btrfs/reada.c
+++ b/fs/btrfs/reada.c
@@ -376,26 +376,28 @@ static struct reada_extent *reada_find_extent(struct 
btrfs_fs_info *fs_info,
goto error;
}
 
+   /* insert extent in reada_tree + all per-device trees, all or nothing */
+   btrfs_dev_replace_read_lock(_info->dev_replace);
ret = radix_tree_preload(GFP_KERNEL);
-   if (ret)
+   if (ret) {
+   btrfs_dev_replace_read_unlock(_info->dev_replace);
goto error;
+   }
 
-   /* insert extent in reada_tree + all per-device trees, all or nothing */
-   btrfs_dev_replace_read_lock(_info->dev_replace);
spin_lock(_info->reada_lock);
ret = radix_tree_insert(_info->reada_tree, index, re);
if (ret == -EEXIST) {
re_exist = radix_tree_lookup(_info->reada_tree, index);
re_exist->refcnt++;
spin_unlock(_info->reada_lock);
-   btrfs_dev_replace_read_unlock(_info->dev_replace);
radix_tree_preload_end();
+   btrfs_dev_replace_read_unlock(_info->dev_replace);
goto error;
}
if (ret) {
spin_unlock(_info->reada_lock);
-   btrfs_dev_replace_read_unlock(_info->dev_replace);
radix_tree_preload_end();
+   btrfs_dev_replace_read_unlock(_info->dev_replace);
goto error;
}
radix_tree_preload_end();
-- 
2.19.1



[PATCH 2/4] btrfs: dev-replace: swich locking to rw semaphore

2018-11-20 Thread David Sterba
The this is first part of removing the custom locking and waiting scheme
used for device replace. It was probably copied from extent buffer
locking, but there's nothing that would require more than is provided by
the common locking primitives.

The rw spinlock protects waiting tasks counter in case of incompatible
locks and the waitqueue. Same as rw semaphore.

This patch only switches the locking primitive, for better
bisectability.  There should be no functional change other than the
overhead of the locking and potential sleeping instead of spinning when
the lock is contended.

Signed-off-by: David Sterba 
---
 fs/btrfs/ctree.h   |  2 +-
 fs/btrfs/dev-replace.c | 12 ++--
 fs/btrfs/disk-io.c |  2 +-
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index b58c5e73e458..01efe3849968 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -360,7 +360,7 @@ struct btrfs_dev_replace {
struct btrfs_device *tgtdev;
 
struct mutex lock_finishing_cancel_unmount;
-   rwlock_t lock;
+   struct rw_semaphore rwsem;
atomic_t blocking_readers;
wait_queue_head_t read_lock_wq;
 
diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 85d93bd3b27a..387f85fcc26e 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -1001,12 +1001,12 @@ int btrfs_dev_replace_is_ongoing(struct 
btrfs_dev_replace *dev_replace)
 
 void btrfs_dev_replace_read_lock(struct btrfs_dev_replace *dev_replace)
 {
-   read_lock(_replace->lock);
+   down_read(_replace->rwsem);
 }
 
 void btrfs_dev_replace_read_unlock(struct btrfs_dev_replace *dev_replace)
 {
-   read_unlock(_replace->lock);
+   up_read(_replace->rwsem);
 }
 
 void btrfs_dev_replace_write_lock(struct btrfs_dev_replace *dev_replace)
@@ -1014,16 +1014,16 @@ void btrfs_dev_replace_write_lock(struct 
btrfs_dev_replace *dev_replace)
 again:
wait_event(dev_replace->read_lock_wq,
   atomic_read(_replace->blocking_readers) == 0);
-   write_lock(_replace->lock);
+   down_write(_replace->rwsem);
if (atomic_read(_replace->blocking_readers)) {
-   write_unlock(_replace->lock);
+   up_write(_replace->rwsem);
goto again;
}
 }
 
 void btrfs_dev_replace_write_unlock(struct btrfs_dev_replace *dev_replace)
 {
-   write_unlock(_replace->lock);
+   up_write(_replace->rwsem);
 }
 
 /* inc blocking cnt and release read lock */
@@ -1032,7 +1032,7 @@ void btrfs_dev_replace_set_lock_blocking(
 {
/* only set blocking for read lock */
atomic_inc(_replace->blocking_readers);
-   read_unlock(_replace->lock);
+   up_read(_replace->rwsem);
 }
 
 void btrfs_bio_counter_inc_noblocked(struct btrfs_fs_info *fs_info)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index c1d127decc8d..38717aa9c47d 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2119,7 +2119,7 @@ static void btrfs_init_btree_inode(struct btrfs_fs_info 
*fs_info)
 static void btrfs_init_dev_replace_locks(struct btrfs_fs_info *fs_info)
 {
mutex_init(_info->dev_replace.lock_finishing_cancel_unmount);
-   rwlock_init(_info->dev_replace.lock);
+   init_rwsem(_info->dev_replace.rwsem);
atomic_set(_info->dev_replace.blocking_readers, 0);
init_waitqueue_head(_info->dev_replace.replace_wait);
init_waitqueue_head(_info->dev_replace.read_lock_wq);
-- 
2.19.1



[PATCH 0/4] Replace custom device-replace locking with rwsem

2018-11-20 Thread David Sterba
The first cleanup part went to 4.19, the actual switch from the custom
locking to rswem was postponed as I found performance degradation. This
turned out to be related to VM cache settings, so I'm resending the
series again.

The custom locking is based on rwlock protected reader/writer counters,
waitqueues, which essentially is what the readwrite semaphore does.

Previous patchset:
https://lore.kernel.org/linux-btrfs/cover.1536331604.git.dste...@suse.com/

Patches correspond to 8/11-11/11 and there's no change besides
refreshing on top of current misc-next.

David Sterba (4):
  btrfs: reada: reorder dev-replace locks before radix tree preload
  btrfs: dev-replace: swich locking to rw semaphore
  btrfs: dev-replace: remove custom read/write blocking scheme
  btrfs: dev-replace: open code trivial locking helpers

 fs/btrfs/ctree.h   |  4 +-
 fs/btrfs/dev-replace.c | 97 ++
 fs/btrfs/dev-replace.h |  5 ---
 fs/btrfs/disk-io.c |  4 +-
 fs/btrfs/reada.c   | 16 ---
 fs/btrfs/scrub.c   | 15 ---
 fs/btrfs/volumes.c | 27 ++--
 7 files changed, 63 insertions(+), 105 deletions(-)

-- 
2.19.1



[PATCH 3/4] btrfs: dev-replace: remove custom read/write blocking scheme

2018-11-20 Thread David Sterba
After the rw semaphore has been added, the custom blocking using
::blocking_readers and ::read_lock_wq is redundant.

The blocking logic in __btrfs_map_block is replaced by extending the
time the semaphore is held, that has the same blocking effect on writes
as the previous custom scheme that waited until ::blocking_readers was
zero.

Signed-off-by: David Sterba 
---
 fs/btrfs/ctree.h   |  2 --
 fs/btrfs/dev-replace.c | 16 
 fs/btrfs/dev-replace.h |  1 -
 fs/btrfs/disk-io.c |  2 --
 fs/btrfs/volumes.c | 13 ++---
 5 files changed, 6 insertions(+), 28 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 01efe3849968..b711d4ee4c83 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -361,8 +361,6 @@ struct btrfs_dev_replace {
 
struct mutex lock_finishing_cancel_unmount;
struct rw_semaphore rwsem;
-   atomic_t blocking_readers;
-   wait_queue_head_t read_lock_wq;
 
struct btrfs_scrub_progress scrub_progress;
 
diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 387f85fcc26e..a37af7aaca86 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -1011,14 +1011,7 @@ void btrfs_dev_replace_read_unlock(struct 
btrfs_dev_replace *dev_replace)
 
 void btrfs_dev_replace_write_lock(struct btrfs_dev_replace *dev_replace)
 {
-again:
-   wait_event(dev_replace->read_lock_wq,
-  atomic_read(_replace->blocking_readers) == 0);
down_write(_replace->rwsem);
-   if (atomic_read(_replace->blocking_readers)) {
-   up_write(_replace->rwsem);
-   goto again;
-   }
 }
 
 void btrfs_dev_replace_write_unlock(struct btrfs_dev_replace *dev_replace)
@@ -1026,15 +1019,6 @@ void btrfs_dev_replace_write_unlock(struct 
btrfs_dev_replace *dev_replace)
up_write(_replace->rwsem);
 }
 
-/* inc blocking cnt and release read lock */
-void btrfs_dev_replace_set_lock_blocking(
-   struct btrfs_dev_replace *dev_replace)
-{
-   /* only set blocking for read lock */
-   atomic_inc(_replace->blocking_readers);
-   up_read(_replace->rwsem);
-}
-
 void btrfs_bio_counter_inc_noblocked(struct btrfs_fs_info *fs_info)
 {
percpu_counter_inc(_info->dev_replace.bio_counter);
diff --git a/fs/btrfs/dev-replace.h b/fs/btrfs/dev-replace.h
index 27e3bb0cca11..dd1dcb22c1e3 100644
--- a/fs/btrfs/dev-replace.h
+++ b/fs/btrfs/dev-replace.h
@@ -23,6 +23,5 @@ void btrfs_dev_replace_read_lock(struct btrfs_dev_replace 
*dev_replace);
 void btrfs_dev_replace_read_unlock(struct btrfs_dev_replace *dev_replace);
 void btrfs_dev_replace_write_lock(struct btrfs_dev_replace *dev_replace);
 void btrfs_dev_replace_write_unlock(struct btrfs_dev_replace *dev_replace);
-void btrfs_dev_replace_set_lock_blocking(struct btrfs_dev_replace 
*dev_replace);
 
 #endif
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 38717aa9c47d..39fd33ef8f80 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2120,9 +2120,7 @@ static void btrfs_init_dev_replace_locks(struct 
btrfs_fs_info *fs_info)
 {
mutex_init(_info->dev_replace.lock_finishing_cancel_unmount);
init_rwsem(_info->dev_replace.rwsem);
-   atomic_set(_info->dev_replace.blocking_readers, 0);
init_waitqueue_head(_info->dev_replace.replace_wait);
-   init_waitqueue_head(_info->dev_replace.read_lock_wq);
 }
 
 static void btrfs_init_qgroup(struct btrfs_fs_info *fs_info)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 1f476669479b..dc09c6abaf6d 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5759,10 +5759,12 @@ static int __btrfs_map_block(struct btrfs_fs_info 
*fs_info,
 
btrfs_dev_replace_read_lock(dev_replace);
dev_replace_is_ongoing = btrfs_dev_replace_is_ongoing(dev_replace);
+   /*
+* Hold the semaphore for read during the whole operation, write is
+* requested at commit time but must wait.
+*/
if (!dev_replace_is_ongoing)
btrfs_dev_replace_read_unlock(dev_replace);
-   else
-   btrfs_dev_replace_set_lock_blocking(dev_replace);
 
if (dev_replace_is_ongoing && mirror_num == map->num_stripes + 1 &&
!need_full_stripe(op) && dev_replace->tgtdev != NULL) {
@@ -5957,11 +5959,8 @@ static int __btrfs_map_block(struct btrfs_fs_info 
*fs_info,
}
 out:
if (dev_replace_is_ongoing) {
-   ASSERT(atomic_read(_replace->blocking_readers) > 0);
-   btrfs_dev_replace_read_lock(dev_replace);
-   /* Barrier implied by atomic_dec_and_test */
-   if (atomic_dec_and_test(_replace->blocking_readers))
-   cond_wake_up_nomb(_replace->read_lock_wq);
+   lockdep_assert_held(_replace->rwsem);
+   /* Unlock and let waiting writers proceed */
btrfs_dev_replace_read_unlock(dev_replace);
}
free_extent_map(em);
-- 
2.19.1



[PATCH RESEND v2 1/2] btrfs: quieten warn if replace is canceled at finish

2018-11-20 Thread Anand Jain
When we successfully cancel the replace its scrub returns -ECANCELED,
which then passed to btrfs_dev_replace_finishing(), it cleans up based
on the scrub returned status and propagates the same -ECANCELED back
the parent function. As of now only user can cancel the replace-scrub,
so its ok to quieten the warn here.

Signed-off-by: Anand Jain 
---
[fix: quieten spelling]
v1->v2: Use the condition within the WARN_ON()

 fs/btrfs/dev-replace.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 1dc8e86546db..9031a362921a 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -497,7 +497,7 @@ static int btrfs_dev_replace_start(struct btrfs_fs_info 
*fs_info,
ret = btrfs_dev_replace_finishing(fs_info, ret);
if (ret == -EINPROGRESS) {
ret = BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS;
-   } else {
+   } else if (ret != -ECANCELED) {
WARN_ON(ret);
}
 
@@ -954,7 +954,7 @@ static int btrfs_dev_replace_kthread(void *data)
  btrfs_device_get_total_bytes(dev_replace->srcdev),
  _replace->scrub_progress, 0, 1);
ret = btrfs_dev_replace_finishing(fs_info, ret);
-   WARN_ON(ret);
+   WARN_ON(ret && ret != -ECANCELED);
 
clear_bit(BTRFS_FS_EXCL_OP, _info->flags);
return 0;
-- 
1.8.3.1



[PATCH RESEND v2 2/2] btrfs: user requsted replace cancel is not an error

2018-11-20 Thread Anand Jain
As of now only user requested replace cancel can cancel the replace-scrub
so no need to log error for it.

Signed-off-by: Anand Jain 
---
v1->v2: none.

 fs/btrfs/dev-replace.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 9031a362921a..40a0942b4659 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -622,7 +622,8 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info 
*fs_info,
src_device,
tgt_device);
} else {
-   btrfs_err_in_rcu(fs_info,
+   if (scrub_ret != -ECANCELED)
+   btrfs_err_in_rcu(fs_info,
 "btrfs_scrub_dev(%s, %llu, %s) failed %d",
 btrfs_dev_name(src_device),
 src_device->devid,
-- 
1.8.3.1



[PATCH RESEND 0/9 v2] fix warn_on for replace cancel

2018-11-20 Thread Anand Jain
These two patches were sent as part of
   [PATCH 0/9 v2] fix replace-start and replace-cancel racing
before but as these aren't integrated so I am sending these again.

The patch [1] which is in misc-next, calls btrfs_dev_replace_finishing()
after replace is canceled, so the ret argument passed to the
btrfs_dev_replace_finishing() is -ECANCEL, which is not a real error.
So these patches quieten the warn and error log if its -ECANCEL.

These should be integrated otherwise we see the WARN_ON and btrfs_error()
after replace cancel.

[1] 
08bdfc3a0611 btrfs: fix use-after-free due to race between replace start and 
cancel

Anand Jain (2):
  btrfs: quieten warn if the replace is canceled at finish
  btrfs: user requsted replace cancel is not an error


Re: [PATCH] btrfs: qgroup: Skip delayed data ref for reloc trees

2018-11-20 Thread Qu Wenruo



On 2018/11/20 下午5:11, Nikolay Borisov wrote:
> 
> 
> On 20.11.18 г. 11:07 ч., Qu Wenruo wrote:
>>
>>
>> On 2018/11/20 下午4:51, Nikolay Borisov wrote:
> 
>>> I'm beginning to wonder, should we document
>>> btrfs_add_delayed_data_ref/btrfs_add_tree_ref arguments separate for
>>> each function, or should only the differences be documented - in this
>>> case the newly added root parameter. The rest of the arguments are being
>>> documented at init_delayed_ref_common.
>>
>> You won't be happy with my later plan, it will add new parameter for
>> btrfs_add_delayed_tree_ref(), and it may not be @root, but some bool.
> 
> You are right, but I'm starting to think that the interface of adding
> those references is wrong because we shouldn't really need adding more
> and more arguments. All of this feels like piling hack on top of hack
> for some legacy infrastructure which no one bothers fixing from a
> high-level.

Can't agree any more!

But the problem is, such big change on the low level delayed-ref
infrastructure could easily cause new bugs, thus there isn't much
driving force to change it.

I'm considering to change the longer and longer parameter list into a
structure as the first step to do cleanup.
(By the nature of structure and union, some parameters can easily be
merged into an union, makes the parameter structure easier to read)

Feel free if you have any better suggestion.
(I also hate the current btrfs_inc_extent_ref() and btrfs_inc_ref()
interfaces, but I don't have enough confidence to change them right now)

Thanks,
Qu

> 
> 
>>
>> So I think we may need to document at least the difference.
>>
>> Thanks,
>> Qu
>>
>>>
> 
> 


Re: [PATCH] btrfs: qgroup: Skip delayed data ref for reloc trees

2018-11-20 Thread Nikolay Borisov



On 20.11.18 г. 11:07 ч., Qu Wenruo wrote:
> 
> 
> On 2018/11/20 下午4:51, Nikolay Borisov wrote:

>> I'm beginning to wonder, should we document
>> btrfs_add_delayed_data_ref/btrfs_add_tree_ref arguments separate for
>> each function, or should only the differences be documented - in this
>> case the newly added root parameter. The rest of the arguments are being
>> documented at init_delayed_ref_common.
> 
> You won't be happy with my later plan, it will add new parameter for
> btrfs_add_delayed_tree_ref(), and it may not be @root, but some bool.

You are right, but I'm starting to think that the interface of adding
those references is wrong because we shouldn't really need adding more
and more arguments. All of this feels like piling hack on top of hack
for some legacy infrastructure which no one bothers fixing from a
high-level.


> 
> So I think we may need to document at least the difference.
> 
> Thanks,
> Qu
> 
>>



Re: [PATCH] btrfs: qgroup: Skip delayed data ref for reloc trees

2018-11-20 Thread Qu Wenruo



On 2018/11/20 下午4:51, Nikolay Borisov wrote:
> 
> 
> On 20.11.18 г. 10:46 ч., Qu Wenruo wrote:
>> Currently we only check @ref_root in btrfs_add_delayed_data_ref() to
>> determine whether a data delayed ref is for reloc tree.
>>
>> Such check is insufficient as for relocation we could pass @ref_root
>> as the source file tree, causing qgroup to trace unchanged data extents
>> even we're only relocating metadata chunks.
>>
>> We could insert qgroup extent record for the following call trace even
>> we're only relocating metadata block group:
>>
>> do_relocation()
>> |- btrfs_cow_block(root=reloc_root)
>>|- update_ref_for_cow(root=reloc_root)
>>   |- __btrfs_mod_ref(root=reloc_root)
>>  |- ref_root = btrfs_header_owner()
>>  |- btrfs_add_delayed_data_ref(ref_root=source_root)
>>
>> And another case when dropping reloc tree:
>>
>> clean_dirty_root()
>> |- btrfs_drop_snapshot(root=relocc_root)
>>|- walk_up_tree(root=reloc_root)
>>   |- walk_up_proc(root=reloc_root)
>>  |- btrfs_dec_ref(root=reloc_root)
>> |- __btrfs_mod_ref(root=reloc_root)
>>|- ref_root = btrfs_header_owner()
>>|- btrfs_add_delayed_data_ref(ref_root=source_root)
>>
>> This patch will introduce @root parameter for
>> btrfs_add_delayed_data_ref(), so that we could determine if this data
>> extent belongs to reloc tree or not.
>>
>> This could skip data extents which aren't really modified during
>> relocation.
>>
>> For the same real world 4G data 16 snapshots 4K nodesize metadata
>> balance test:
>>  | v4.20-rc1 + delaye*  | w/ patch   | diff
>> ---
>> relocated extents| 22773| 22656  | -0.1%
>> qgroup dirty extents | 122879   | 74316  | -39.5%
>> time (real)  | 23.073s  | 14.971 | -35.1%
>>
>> *: v4.20-rc1 + delayed subtree scan patchset
>>
>> Signed-off-by: Qu Wenruo 
>> ---
>>  fs/btrfs/delayed-ref.c | 3 ++-
>>  fs/btrfs/delayed-ref.h | 1 +
>>  fs/btrfs/extent-tree.c | 6 +++---
>>  3 files changed, 6 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
>> index 9301b3ad9217..269bd6ecb8f3 100644
>> --- a/fs/btrfs/delayed-ref.c
>> +++ b/fs/btrfs/delayed-ref.c
>> @@ -798,6 +798,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle 
>> *trans,
>>   * add a delayed data ref. it's similar to btrfs_add_delayed_tree_ref.
>>   */
>>  int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
>> +   struct btrfs_root *root,
> 
> I'm beginning to wonder, should we document
> btrfs_add_delayed_data_ref/btrfs_add_tree_ref arguments separate for
> each function, or should only the differences be documented - in this
> case the newly added root parameter. The rest of the arguments are being
> documented at init_delayed_ref_common.

You won't be happy with my later plan, it will add new parameter for
btrfs_add_delayed_tree_ref(), and it may not be @root, but some bool.

So I think we may need to document at least the difference.

Thanks,
Qu

> 
>> u64 bytenr, u64 num_bytes,
>> u64 parent, u64 ref_root,
>> u64 owner, u64 offset, u64 reserved, int action,
>> @@ -835,7 +836,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle 
>> *trans,
>>  }
>>  
>>  if (test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags) &&
>> -is_fstree(ref_root)) {
>> +is_fstree(ref_root) && is_fstree(root->root_key.objectid)) {
>>  record = kmalloc(sizeof(*record), GFP_NOFS);
>>  if (!record) {
>>  kmem_cache_free(btrfs_delayed_data_ref_cachep, ref);
>> diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
>> index 8e20c5cb5404..6c60737e55d6 100644
>> --- a/fs/btrfs/delayed-ref.h
>> +++ b/fs/btrfs/delayed-ref.h
>> @@ -240,6 +240,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle 
>> *trans,
>> struct btrfs_delayed_extent_op *extent_op,
>> int *old_ref_mod, int *new_ref_mod);
>>  int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
>> +   struct btrfs_root *root,
>> u64 bytenr, u64 num_bytes,
>> u64 parent, u64 ref_root,
>> u64 owner, u64 offset, u64 reserved, int action,
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index a1febf155747..0554d2cc2ea1 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -2046,7 +2046,7 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle 
>> *trans,
>>   BTRFS_ADD_DELAYED_REF, NULL,
>>   _ref_mod, _ref_mod);
>>  } else {
>> -ret = 

Re: [PATCH] btrfs: qgroup: Skip delayed data ref for reloc trees

2018-11-20 Thread Nikolay Borisov



On 20.11.18 г. 10:46 ч., Qu Wenruo wrote:
> Currently we only check @ref_root in btrfs_add_delayed_data_ref() to
> determine whether a data delayed ref is for reloc tree.
> 
> Such check is insufficient as for relocation we could pass @ref_root
> as the source file tree, causing qgroup to trace unchanged data extents
> even we're only relocating metadata chunks.
> 
> We could insert qgroup extent record for the following call trace even
> we're only relocating metadata block group:
> 
> do_relocation()
> |- btrfs_cow_block(root=reloc_root)
>|- update_ref_for_cow(root=reloc_root)
>   |- __btrfs_mod_ref(root=reloc_root)
>  |- ref_root = btrfs_header_owner()
>  |- btrfs_add_delayed_data_ref(ref_root=source_root)
> 
> And another case when dropping reloc tree:
> 
> clean_dirty_root()
> |- btrfs_drop_snapshot(root=relocc_root)
>|- walk_up_tree(root=reloc_root)
>   |- walk_up_proc(root=reloc_root)
>  |- btrfs_dec_ref(root=reloc_root)
> |- __btrfs_mod_ref(root=reloc_root)
>|- ref_root = btrfs_header_owner()
>|- btrfs_add_delayed_data_ref(ref_root=source_root)
> 
> This patch will introduce @root parameter for
> btrfs_add_delayed_data_ref(), so that we could determine if this data
> extent belongs to reloc tree or not.
> 
> This could skip data extents which aren't really modified during
> relocation.
> 
> For the same real world 4G data 16 snapshots 4K nodesize metadata
> balance test:
>  | v4.20-rc1 + delaye*  | w/ patch   | diff
> ---
> relocated extents| 22773| 22656  | -0.1%
> qgroup dirty extents | 122879   | 74316  | -39.5%
> time (real)  | 23.073s  | 14.971 | -35.1%
> 
> *: v4.20-rc1 + delayed subtree scan patchset
> 
> Signed-off-by: Qu Wenruo 
> ---
>  fs/btrfs/delayed-ref.c | 3 ++-
>  fs/btrfs/delayed-ref.h | 1 +
>  fs/btrfs/extent-tree.c | 6 +++---
>  3 files changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index 9301b3ad9217..269bd6ecb8f3 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -798,6 +798,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle 
> *trans,
>   * add a delayed data ref. it's similar to btrfs_add_delayed_tree_ref.
>   */
>  int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
> +struct btrfs_root *root,

I'm beginning to wonder, should we document
btrfs_add_delayed_data_ref/btrfs_add_tree_ref arguments separate for
each function, or should only the differences be documented - in this
case the newly added root parameter. The rest of the arguments are being
documented at init_delayed_ref_common.

>  u64 bytenr, u64 num_bytes,
>  u64 parent, u64 ref_root,
>  u64 owner, u64 offset, u64 reserved, int action,
> @@ -835,7 +836,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle 
> *trans,
>   }
>  
>   if (test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags) &&
> - is_fstree(ref_root)) {
> + is_fstree(ref_root) && is_fstree(root->root_key.objectid)) {
>   record = kmalloc(sizeof(*record), GFP_NOFS);
>   if (!record) {
>   kmem_cache_free(btrfs_delayed_data_ref_cachep, ref);
> diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
> index 8e20c5cb5404..6c60737e55d6 100644
> --- a/fs/btrfs/delayed-ref.h
> +++ b/fs/btrfs/delayed-ref.h
> @@ -240,6 +240,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle 
> *trans,
>  struct btrfs_delayed_extent_op *extent_op,
>  int *old_ref_mod, int *new_ref_mod);
>  int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
> +struct btrfs_root *root,
>  u64 bytenr, u64 num_bytes,
>  u64 parent, u64 ref_root,
>  u64 owner, u64 offset, u64 reserved, int action,
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index a1febf155747..0554d2cc2ea1 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2046,7 +2046,7 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle 
> *trans,
>BTRFS_ADD_DELAYED_REF, NULL,
>_ref_mod, _ref_mod);
>   } else {
> - ret = btrfs_add_delayed_data_ref(trans, bytenr,
> + ret = btrfs_add_delayed_data_ref(trans, root, bytenr,
>num_bytes, parent,
>root_objectid, owner, offset,
>0, BTRFS_ADD_DELAYED_REF,
> @@ -7104,7 

[PATCH] btrfs: qgroup: Skip delayed data ref for reloc trees

2018-11-20 Thread Qu Wenruo
Currently we only check @ref_root in btrfs_add_delayed_data_ref() to
determine whether a data delayed ref is for reloc tree.

Such check is insufficient as for relocation we could pass @ref_root
as the source file tree, causing qgroup to trace unchanged data extents
even we're only relocating metadata chunks.

We could insert qgroup extent record for the following call trace even
we're only relocating metadata block group:

do_relocation()
|- btrfs_cow_block(root=reloc_root)
   |- update_ref_for_cow(root=reloc_root)
  |- __btrfs_mod_ref(root=reloc_root)
 |- ref_root = btrfs_header_owner()
 |- btrfs_add_delayed_data_ref(ref_root=source_root)

And another case when dropping reloc tree:

clean_dirty_root()
|- btrfs_drop_snapshot(root=relocc_root)
   |- walk_up_tree(root=reloc_root)
  |- walk_up_proc(root=reloc_root)
 |- btrfs_dec_ref(root=reloc_root)
|- __btrfs_mod_ref(root=reloc_root)
   |- ref_root = btrfs_header_owner()
   |- btrfs_add_delayed_data_ref(ref_root=source_root)

This patch will introduce @root parameter for
btrfs_add_delayed_data_ref(), so that we could determine if this data
extent belongs to reloc tree or not.

This could skip data extents which aren't really modified during
relocation.

For the same real world 4G data 16 snapshots 4K nodesize metadata
balance test:
 | v4.20-rc1 + delaye*  | w/ patch   | diff
---
relocated extents| 22773| 22656  | -0.1%
qgroup dirty extents | 122879   | 74316  | -39.5%
time (real)  | 23.073s  | 14.971 | -35.1%

*: v4.20-rc1 + delayed subtree scan patchset

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/delayed-ref.c | 3 ++-
 fs/btrfs/delayed-ref.h | 1 +
 fs/btrfs/extent-tree.c | 6 +++---
 3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 9301b3ad9217..269bd6ecb8f3 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -798,6 +798,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle 
*trans,
  * add a delayed data ref. it's similar to btrfs_add_delayed_tree_ref.
  */
 int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
+  struct btrfs_root *root,
   u64 bytenr, u64 num_bytes,
   u64 parent, u64 ref_root,
   u64 owner, u64 offset, u64 reserved, int action,
@@ -835,7 +836,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle 
*trans,
}
 
if (test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags) &&
-   is_fstree(ref_root)) {
+   is_fstree(ref_root) && is_fstree(root->root_key.objectid)) {
record = kmalloc(sizeof(*record), GFP_NOFS);
if (!record) {
kmem_cache_free(btrfs_delayed_data_ref_cachep, ref);
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index 8e20c5cb5404..6c60737e55d6 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -240,6 +240,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle 
*trans,
   struct btrfs_delayed_extent_op *extent_op,
   int *old_ref_mod, int *new_ref_mod);
 int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
+  struct btrfs_root *root,
   u64 bytenr, u64 num_bytes,
   u64 parent, u64 ref_root,
   u64 owner, u64 offset, u64 reserved, int action,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a1febf155747..0554d2cc2ea1 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2046,7 +2046,7 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
 BTRFS_ADD_DELAYED_REF, NULL,
 _ref_mod, _ref_mod);
} else {
-   ret = btrfs_add_delayed_data_ref(trans, bytenr,
+   ret = btrfs_add_delayed_data_ref(trans, root, bytenr,
 num_bytes, parent,
 root_objectid, owner, offset,
 0, BTRFS_ADD_DELAYED_REF,
@@ -7104,7 +7104,7 @@ int btrfs_free_extent(struct btrfs_trans_handle *trans,
 BTRFS_DROP_DELAYED_REF, NULL,
 _ref_mod, _ref_mod);
} else {
-   ret = btrfs_add_delayed_data_ref(trans, bytenr,
+   ret = btrfs_add_delayed_data_ref(trans, root, bytenr,
 num_bytes, parent,
 root_objectid, 

[PATCH] btrfs: qgroup: Skip delayed data ref for reloc trees

2018-11-20 Thread Qu Wenruo
Currently we only check @ref_root in btrfs_add_delayed_data_ref() to
determine whether a data delayed ref is for reloc tree.

Such check is insufficient as for relocation we could pass @ref_root
as the source file tree, causing qgroup to trace unchanged data extents
even we're only relocating metadata chunks.

We could insert qgroup extent record for the following call trace even
we're only relocating metadata block group:

do_relocation()
|- btrfs_cow_block(root=reloc_root)
   |- update_ref_for_cow(root=reloc_root)
  |- __btrfs_mod_ref(root=reloc_root)
 |- ref_root = btrfs_header_owner()
 |- btrfs_add_delayed_data_ref(ref_root=source_root)

And another case when dropping reloc tree:

clean_dirty_root()
|- btrfs_drop_snapshot(root=relocc_root)
   |- walk_up_tree(root=reloc_root)
  |- walk_up_proc(root=reloc_root)
 |- btrfs_dec_ref(root=reloc_root)
|- __btrfs_mod_ref(root=reloc_root)
   |- ref_root = btrfs_header_owner()
   |- btrfs_add_delayed_data_ref(ref_root=source_root)

This patch will introduce @root parameter for
btrfs_add_delayed_data_ref(), so that we could determine if this data
extent belongs to reloc tree or not.

This could skip data extents which aren't really modified during
relocation.

For the same real world 4G data 16 snapshots 4K nodesize metadata
balance test:
 | v4.20-rc1 + delaye*  | w/ patch   | diff
---
relocated extents| 22773| 22656  | -0.1%
qgroup dirty extents | 122879   | 74316  | -39.5%
time (real)  | 23.073s  | 14.971 | -35.1%

*: v4.20-rc1 + delayed subtree scan patchset

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/delayed-ref.c | 3 ++-
 fs/btrfs/delayed-ref.h | 1 +
 fs/btrfs/extent-tree.c | 6 +++---
 3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 9301b3ad9217..269bd6ecb8f3 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -798,6 +798,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle 
*trans,
  * add a delayed data ref. it's similar to btrfs_add_delayed_tree_ref.
  */
 int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
+  struct btrfs_root *root,
   u64 bytenr, u64 num_bytes,
   u64 parent, u64 ref_root,
   u64 owner, u64 offset, u64 reserved, int action,
@@ -835,7 +836,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle 
*trans,
}
 
if (test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags) &&
-   is_fstree(ref_root)) {
+   is_fstree(ref_root) && is_fstree(root->root_key.objectid)) {
record = kmalloc(sizeof(*record), GFP_NOFS);
if (!record) {
kmem_cache_free(btrfs_delayed_data_ref_cachep, ref);
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index 8e20c5cb5404..6c60737e55d6 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -240,6 +240,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle 
*trans,
   struct btrfs_delayed_extent_op *extent_op,
   int *old_ref_mod, int *new_ref_mod);
 int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
+  struct btrfs_root *root,
   u64 bytenr, u64 num_bytes,
   u64 parent, u64 ref_root,
   u64 owner, u64 offset, u64 reserved, int action,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a1febf155747..0554d2cc2ea1 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2046,7 +2046,7 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
 BTRFS_ADD_DELAYED_REF, NULL,
 _ref_mod, _ref_mod);
} else {
-   ret = btrfs_add_delayed_data_ref(trans, bytenr,
+   ret = btrfs_add_delayed_data_ref(trans, root, bytenr,
 num_bytes, parent,
 root_objectid, owner, offset,
 0, BTRFS_ADD_DELAYED_REF,
@@ -7104,7 +7104,7 @@ int btrfs_free_extent(struct btrfs_trans_handle *trans,
 BTRFS_DROP_DELAYED_REF, NULL,
 _ref_mod, _ref_mod);
} else {
-   ret = btrfs_add_delayed_data_ref(trans, bytenr,
+   ret = btrfs_add_delayed_data_ref(trans, root, bytenr,
 num_bytes, parent,
 root_objectid, 

[PATCH v6 1/3] btrfs: add helper function describe_block_group()

2018-11-20 Thread Anand Jain
Improve on describe_relocation() add a common helper function to describe
the block groups.

Signed-off-by: Anand Jain 
Reviewed-by: David Sterba 
---
v5->v6: Use () in the body for the args sent in defines
Use right indent to align '\'
Use goto to out_overflow instead of return
v4.1->v5: Initialize buf[128] to null.
v4->v4.1: Use strcpy(buf, "|NONE"); as in the original
v3->v4: Just pass full flag name in the define DESCRIBE_FLAG(flag,..),
 so that it can be used at couple of more places.
Rename describe_block_groups() to btrfs_describe_block_groups().
Drop useless return u32.
v3: Born.

 fs/btrfs/relocation.c | 30 +++---
 fs/btrfs/volumes.c| 46 ++
 fs/btrfs/volumes.h|  1 +
 3 files changed, 50 insertions(+), 27 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 924116f654a1..0373b3cc1d36 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -4184,37 +4184,13 @@ static struct reloc_control *alloc_reloc_control(void)
 static void describe_relocation(struct btrfs_fs_info *fs_info,
struct btrfs_block_group_cache *block_group)
 {
-   char buf[128];  /* prefixed by a '|' that'll be dropped */
-   u64 flags = block_group->flags;
+   char buf[128] = {'\0'};
 
-   /* Shouldn't happen */
-   if (!flags) {
-   strcpy(buf, "|NONE");
-   } else {
-   char *bp = buf;
-
-#define DESCRIBE_FLAG(f, d) \
-   if (flags & BTRFS_BLOCK_GROUP_##f) { \
-   bp += snprintf(bp, buf - bp + sizeof(buf), "|%s", d); \
-   flags &= ~BTRFS_BLOCK_GROUP_##f; \
-   }
-   DESCRIBE_FLAG(DATA, "data");
-   DESCRIBE_FLAG(SYSTEM,   "system");
-   DESCRIBE_FLAG(METADATA, "metadata");
-   DESCRIBE_FLAG(RAID0,"raid0");
-   DESCRIBE_FLAG(RAID1,"raid1");
-   DESCRIBE_FLAG(DUP,  "dup");
-   DESCRIBE_FLAG(RAID10,   "raid10");
-   DESCRIBE_FLAG(RAID5,"raid5");
-   DESCRIBE_FLAG(RAID6,"raid6");
-   if (flags)
-   snprintf(bp, buf - bp + sizeof(buf), "|0x%llx", flags);
-#undef DESCRIBE_FLAG
-   }
+   btrfs_describe_block_groups(block_group->flags, buf, sizeof(buf));
 
btrfs_info(fs_info,
   "relocating block group %llu flags %s",
-  block_group->key.objectid, buf + 1);
+  block_group->key.objectid, buf);
 }
 
 /*
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f435d397019e..3ab22e7e404e 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -123,6 +123,52 @@ const char *get_raid_name(enum btrfs_raid_types type)
return btrfs_raid_array[type].raid_name;
 }
 
+void btrfs_describe_block_groups(u64 bg_flags, char *buf, u32 size_buf)
+{
+   int i;
+   int ret;
+   char *bp = buf;
+   u64 flags = bg_flags;
+   u32 size_bp = size_buf;
+
+   if (!flags) {
+   strcpy(bp, "NONE");
+   return;
+   }
+
+#define DESCRIBE_FLAG(f, d)\
+   do {\
+   if (flags & (f)) {  \
+   ret = snprintf(bp, size_bp, "%s|", (d));\
+   if (ret < 0 || ret >= size_bp)  \
+   goto out_overflow;  \
+   size_bp -= ret; \
+   bp += ret;  \
+   flags &= ~(f);  \
+   }   \
+   } while (0)
+
+   DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_DATA, "data");
+   DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_SYSTEM, "system");
+   DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_METADATA, "metadata");
+   DESCRIBE_FLAG(BTRFS_AVAIL_ALLOC_BIT_SINGLE, "single");
+   for (i = 0; i < BTRFS_NR_RAID_TYPES; i++)
+   DESCRIBE_FLAG(btrfs_raid_array[i].bg_flag,
+ btrfs_raid_array[i].raid_name);
+#undef DESCRIBE_FLAG
+
+   if (flags) {
+   ret = snprintf(bp, size_bp, "0x%llx|", flags);
+   size_bp -= ret;
+   }
+
+   if (size_bp < size_buf)
+   buf[size_buf - size_bp - 1] = '\0'; /* remove last | */
+
+out_overflow:
+   return;
+}
+
 static int init_first_rw_device(struct btrfs_trans_handle *trans,
struct btrfs_fs_info *fs_info);
 static int btrfs_relocate_sys_chunks(struct btrfs_fs_info *fs_info);
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index aefce895e994..3e914effcdf6 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -430,6 +430,7 @@ struct btrfs_device 

[PATCH v6 0/3] btrfs: balance: improve kernel logs

2018-11-20 Thread Anand Jain
v5->v6:
 Mostly the defines non functional changes.
 Use goto instead of return in middle of the define.
 Pls ref individual patches 1/3 and 2/3 for more info.

v4->v5:
 Mainly address David review comment [1].
 [1]
  https://patchwork.kernel.org/patch/10425987/
 pls ref to individual patch 2/3 for details.

v3->v4:
 Pls ref to individual patches.

Based on misc-next.

v2->v3:
  Inspried by describe_relocation(), improves it, makes it a helper
  function and use it to log the balance operations.

Kernel logs are very important for the forensic investigations of the
issues, these patchs make balance logs easy to review.

Anand Jain (3):
  btrfs: add helper function describe_block_group()
  btrfs: balance: add args info during start and resume
  btrfs: balance: add kernel log for end or paused

 fs/btrfs/relocation.c |  30 +--
 fs/btrfs/volumes.c| 216 +-
 fs/btrfs/volumes.h|   1 +
 3 files changed, 217 insertions(+), 30 deletions(-)

-- 
1.8.3.1


[PATCH v6 3/3] btrfs: balance: add kernel log for end or paused

2018-11-20 Thread Anand Jain
Add a kernel log when the balance ends, either for cancel or completed
or if it is paused.

Signed-off-by: Anand Jain 
---
v5->v6: Quite soul. nothing.
v4->v5: nothing.
v3->v4: nothing.
v2->v3: nothing.
v1->v2: Moved from 2/3 to 3/3
 fs/btrfs/volumes.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index eb18739f19ef..0684f14ff70d 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4060,6 +4060,13 @@ int btrfs_balance(struct btrfs_fs_info *fs_info,
ret = __btrfs_balance(fs_info);
 
mutex_lock(_info->balance_mutex);
+   if (ret == -ECANCELED && atomic_read(_info->balance_pause_req))
+   btrfs_info(fs_info, "balance: paused");
+   else if (ret == -ECANCELED && atomic_read(_info->balance_cancel_req))
+   btrfs_info(fs_info, "balance: canceled");
+   else
+   btrfs_info(fs_info, "balance: ended with status: %d", ret);
+
clear_bit(BTRFS_FS_BALANCE_RUNNING, _info->flags);
 
if (bargs) {
-- 
1.8.3.1



[PATCH v6 2/3] btrfs: balance: add args info during start and resume

2018-11-20 Thread Anand Jain
Balance arg info is an important information to be reviewed for the
system audit. So this patch adds them to the kernel log.

Example:
->btrfs bal start -f -mprofiles=raid1,convert=single,soft 
-dlimit=10..20,usage=50 /btrfs

kernel: BTRFS info (device sdb): balance: start -f -dusage=50,limit=10..20 
-msoft,profiles=raid1,convert=single -ssoft,profiles=raid1,convert=single

Signed-off-by: Anand Jain 
---
v5->v6: Use () in the body for the args sent in defines
Use right indent to align '\'
Use goto to out_overflow instead of return (also fixes a mem leak in v5)
Rename CHECK_UPDATE_BP_XXX to CHECK_APPEND_XXX
v4.1->v5: Per David review comment the code..
   bp += snprintf(bp, buf - bp + size_buf, "soft,");
  is not safe if 'buf - bp + size_buf' becomes accidentally
  negative, so fix this by using following snprintf.

 #define CHECK_UPDATE_BP_1(a) \
   do { \
   ret = snprintf(bp, size_bp, a); \
   if (ret < 0 || ret >= size_bp) \
   goto out; \
   size_bp -= ret; \
   bp += ret; \
   } while (0)

And initialize the tmp_buf to null.

v4->v4.1: Rename last missed sz to size_buf.

v3->v4: Change log updated with new example.
Log format is changed to almost match with the cli.
snprintf drop %s for inline string.
Print range as =%u..%u instead of min=%umax=%u.
soft is under the args now.
force is printed as -f.

v2->v3: Change log updated.
Make use of describe_block_group() added in 1/4
Drop using strcat instead use snprintf.
Logs string format updated as shown in the example above.

v1->v2: Change log update.
Move adding the logs for balance complete and end to a new patch

 fs/btrfs/volumes.c | 163 -
 1 file changed, 160 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 3ab22e7e404e..eb18739f19ef 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3757,6 +3757,164 @@ static inline int validate_convert_profile(struct 
btrfs_balance_args *bctl_arg,
 (bctl_arg->target & ~allowed)));
 }
 
+static void describe_balance_args(struct btrfs_balance_args *bargs, char *buf,
+u32 size_buf)
+{
+   int ret;
+   u32 size_bp = size_buf;
+   char *bp = buf;
+   u64 flags = bargs->flags;
+   char tmp_buf[128] = {'\0'};
+
+   if (!flags)
+   return;
+
+#define CHECK_APPEND_NOARG(a)  \
+   do {\
+   ret = snprintf(bp, size_bp, (a));   \
+   if (ret < 0 || ret >= size_bp)  \
+   goto out_overflow;  \
+   size_bp -= ret; \
+   bp += ret;  \
+   } while (0)
+
+#define CHECK_APPEND_1ARG(a, v1)   \
+   do {\
+   ret = snprintf(bp, size_bp, (a), (v1)); \
+   if (ret < 0 || ret >= size_bp)  \
+   goto out_overflow;  \
+   size_bp -= ret; \
+   bp += ret;  \
+   } while (0)
+
+#define CHECK_APPEND_2ARG(a, v1, v2)   \
+   do {\
+   ret = snprintf(bp, size_bp, (a), (v1), (v2));   \
+   if (ret < 0 || ret >= size_bp)  \
+   goto out_overflow;  \
+   size_bp -= ret; \
+   bp += ret;  \
+   } while (0)
+
+   if (flags & BTRFS_BALANCE_ARGS_SOFT)
+   CHECK_APPEND_NOARG("soft,");
+
+   if (flags & BTRFS_BALANCE_ARGS_PROFILES) {
+   btrfs_describe_block_groups(bargs->profiles, tmp_buf,
+   sizeof(tmp_buf));
+   CHECK_APPEND_1ARG("profiles=%s,", tmp_buf);
+   }
+
+   if (flags & BTRFS_BALANCE_ARGS_USAGE)
+   CHECK_APPEND_1ARG("usage=%llu,", bargs->usage);
+
+   if (flags & BTRFS_BALANCE_ARGS_USAGE_RANGE)
+   CHECK_APPEND_2ARG("usage=%u..%u,",
+ bargs->usage_min, bargs->usage_max);
+
+   if (flags & BTRFS_BALANCE_ARGS_DEVID)
+   CHECK_APPEND_1ARG("devid=%llu,", bargs->devid);
+
+   if (flags & BTRFS_BALANCE_ARGS_DRANGE)
+   CHECK_APPEND_2ARG("drange=%llu..%llu,",
+ bargs->pstart, bargs->pend);
+
+   if (flags & BTRFS_BALANCE_ARGS_VRANGE)
+   CHECK_APPEND_2ARG("vrange%llu..%llu,",
+ bargs->vstart, bargs->vend);
+
+   if (flags & 

Re: [PATCH v5 2/3] btrfs: balance: add args info during start and resume

2018-11-20 Thread Anand Jain




On 11/20/2018 01:07 AM, David Sterba wrote:

On Wed, Nov 14, 2018 at 09:17:11PM +0800, Anand Jain wrote:

Balance arg info is an important information to be reviewed for the
system audit. So this patch adds them to the kernel log.

Example:
->btrfs bal start -f -mprofiles=raid1,convert=single,soft 
-dlimit=10..20,usage=50 /btrfs

kernel: BTRFS info (device sdb): balance: start -f -dusage=50,limit=10..20 
-msoft,profiles=raid1,convert=single -ssoft,profiles=raid1,convert=single

Signed-off-by: Anand Jain 
---
v4.1->v5: Per David review comment the code..
bp += snprintf(bp, buf - bp + size_buf, "soft,");
  is not safe if 'buf - bp + size_buf' becomes accidentally
  negative, so fix this by using following snprintf.

  #define CHECK_UPDATE_BP_1(a) \
do { \
ret = snprintf(bp, size_bp, a); \
if (ret < 0 || ret >= size_bp) \
goto out; \
size_bp -= ret; \
bp += ret; \
} while (0)

And initialize the tmp_buf to null.

v4->v4.1: Rename last missed sz to size_buf.

v3->v4: Change log updated with new example.
Log format is changed to almost match with the cli.
snprintf drop %s for inline string.
Print range as =%u..%u instead of min=%umax=%u.
soft is under the args now.
force is printed as -f.

v2->v3: Change log updated.
 Make use of describe_block_group() added in 1/4
 Drop using strcat instead use snprintf.
 Logs string format updated as shown in the example above.

v1->v2: Change log update.
 Move adding the logs for balance complete and end to a new patch

  fs/btrfs/volumes.c | 172 -
  1 file changed, 169 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 12b3d0625c6a..8397f72bdac7 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3754,6 +3754,173 @@ static inline int validate_convert_profile(struct 
btrfs_balance_args *bctl_arg,
 (bctl_arg->target & ~allowed)));
  }
  
+static void describe_balance_args(struct btrfs_balance_args *bargs, char *buf,

+u32 size_buf)
+{
+   int ret;
+   u32 size_bp = size_buf;
+   char *bp = buf;
+   u64 flags = bargs->flags;
+   char tmp_buf[128] = {'\0'};
+
+   if (!flags)
+   return;
+
+#define CHECK_UPDATE_BP_1(a) \


CHECK_UPDATE_BP_NOARG and the others CHECK_UPDATE_BP_1ARG, though I'm
thinking about picking another name for the macro like CHECK_APPEND_2ARG
etc.


 I liked CHECK_APPEND_XXX I have renamed
  CHECK_UPDATE_BP_1 to CHECK_APPEND_NOARG
  CHECK_UPDATE_BP_2 to CHECK_APPEND_1ARG
  CHECK_UPDATE_BP_3 to CHECK_APPEND_2ARG

Hope its better now.


+   do { \
+   ret = snprintf(bp, size_bp, a); \
+   if (ret < 0 || ret >= size_bp) \
+   goto out; \


out_overflow and align \ to the right, please.


 Yep done.


+   size_bp -= ret; \
+   bp += ret; \
+   } while (0)
+
+#define CHECK_UPDATE_BP_2(a, v1) \
+   do { \
+   ret = snprintf(bp, size_bp, a, v1); \
+   if (ret < 0 || ret >= size_bp) \
+   goto out; \
+   size_bp -= ret; \
+   bp += ret; \
+   } while (0)
+
+#define CHECK_UPDATE_BP_3(a, v1, v2) \
+   do { \
+   ret = snprintf(bp, size_bp, a, v1, v2); \
+   if (ret < 0 || ret >= size_bp) \
+   goto out; \
+   size_bp -= ret; \
+   bp += ret; \
+   } while (0)
+
+   if (flags & BTRFS_BALANCE_ARGS_SOFT) {
+   CHECK_UPDATE_BP_1("soft,");
+   }


no { } around single statement


 fixed.


+
+   if (flags & BTRFS_BALANCE_ARGS_PROFILES) {
+   btrfs_describe_block_groups(bargs->profiles, tmp_buf,
+   sizeof(tmp_buf));
+   CHECK_UPDATE_BP_2("profiles=%s,", tmp_buf);
+   }
+
+   if (flags & BTRFS_BALANCE_ARGS_USAGE) {
+   CHECK_UPDATE_BP_2("usage=%llu,", bargs->usage);
+   }
+
+   if (flags & BTRFS_BALANCE_ARGS_USAGE_RANGE) {
+   CHECK_UPDATE_BP_3("usage=%u..%u,",
+ bargs->usage_min, bargs->usage_max);
+   }
+
+   if (flags & BTRFS_BALANCE_ARGS_DEVID) {
+   CHECK_UPDATE_BP_2("devid=%llu,", bargs->devid);
+   }
+
+   if (flags & BTRFS_BALANCE_ARGS_DRANGE) {
+   CHECK_UPDATE_BP_3("drange=%llu..%llu,",
+ bargs->pstart, bargs->pend);
+   }
+
+   if (flags & BTRFS_BALANCE_ARGS_VRANGE) {
+   CHECK_UPDATE_BP_3("vrange%llu..%llu,",
+ bargs->vstart, bargs->vend);
+   }
+
+   if (flags & BTRFS_BALANCE_ARGS_LIMIT) {
+   CHECK_UPDATE_BP_2("limit=%llu,", bargs->limit);
+ 

Re: [PATCH v5 1/3] btrfs: add helper function describe_block_group()

2018-11-20 Thread Anand Jain



Thanks for the review.. more below.

On 11/20/2018 01:02 AM, David Sterba wrote:

On Wed, Nov 14, 2018 at 09:17:10PM +0800, Anand Jain wrote:

Improve on describe_relocation() add a common helper function to describe
the block groups.

Signed-off-by: Anand Jain 
Reviewed-by: David Sterba 
---
v4.1->v5: Initialize buf[128] to null.
v4->v4.1: Use strcpy(buf, "|NONE"); as in the original
v3->v4: Just pass full flag name in the define DESCRIBE_FLAG(flag,..),
 so that it can be used at couple of more places.
Rename describe_block_groups() to btrfs_describe_block_groups().
Drop useless return u32.
v3: Born.

  fs/btrfs/relocation.c | 30 +++---
  fs/btrfs/volumes.c| 43 +++
  fs/btrfs/volumes.h|  1 +
  3 files changed, 47 insertions(+), 27 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 924116f654a1..0373b3cc1d36 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -4184,37 +4184,13 @@ static struct reloc_control *alloc_reloc_control(void)
  static void describe_relocation(struct btrfs_fs_info *fs_info,
struct btrfs_block_group_cache *block_group)
  {
-   char buf[128];  /* prefixed by a '|' that'll be dropped */
-   u64 flags = block_group->flags;
+   char buf[128] = {'\0'};
  
-	/* Shouldn't happen */

-   if (!flags) {
-   strcpy(buf, "|NONE");
-   } else {
-   char *bp = buf;
-
-#define DESCRIBE_FLAG(f, d) \
-   if (flags & BTRFS_BLOCK_GROUP_##f) { \
-   bp += snprintf(bp, buf - bp + sizeof(buf), "|%s", d); \
-   flags &= ~BTRFS_BLOCK_GROUP_##f; \
-   }
-   DESCRIBE_FLAG(DATA, "data");
-   DESCRIBE_FLAG(SYSTEM,   "system");
-   DESCRIBE_FLAG(METADATA, "metadata");
-   DESCRIBE_FLAG(RAID0,"raid0");
-   DESCRIBE_FLAG(RAID1,"raid1");
-   DESCRIBE_FLAG(DUP,  "dup");
-   DESCRIBE_FLAG(RAID10,   "raid10");
-   DESCRIBE_FLAG(RAID5,"raid5");
-   DESCRIBE_FLAG(RAID6,"raid6");
-   if (flags)
-   snprintf(bp, buf - bp + sizeof(buf), "|0x%llx", flags);
-#undef DESCRIBE_FLAG
-   }
+   btrfs_describe_block_groups(block_group->flags, buf, sizeof(buf));
  
  	btrfs_info(fs_info,

   "relocating block group %llu flags %s",
-  block_group->key.objectid, buf + 1);
+  block_group->key.objectid, buf);
  }
  
  /*

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f435d397019e..12b3d0625c6a 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -123,6 +123,49 @@ const char *get_raid_name(enum btrfs_raid_types type)
return btrfs_raid_array[type].raid_name;
  }
  
+void btrfs_describe_block_groups(u64 bg_flags, char *buf, u32 size_buf)

+{
+   int i;
+   int ret;
+   char *bp = buf;
+   u64 flags = bg_flags;
+   u32 size_bp = size_buf;
+
+   if (!flags) {
+   strcpy(bp, "NONE");
+   return;
+   }
+
+#define DESCRIBE_FLAG(f, d) \


Macro arguments should be in ( ) in the body


 Fixed in v6.


+   do { \
+   if (flags & f) { \
+   ret = snprintf(bp, size_bp, "%s|", d); \
+   if (ret < 0 || ret >= size_bp) \
+   return; \


Ok, this seems to be safe. snprintf will not write more than size_bp and
this will be caught if it overflows.

The return is obscured by the macro, I'd rather make it more visible
that there is a potential change in the code flow. The proposed change:

goto out_overflow;
...

and define out_overflow at the end of function. Same for the other
helper macros.


 makes sense. Thanks for explaining. Fixed in v6.


Also please align the backslashes of the macro a few tabs to the right.


 Yep done in v6.

Thanks, Anand



+   size_bp -= ret; \
+   bp += ret; \
+   flags &= ~f; \
+   } \
+   } while (0)