date:20180403

Re: Status of RAID5/6

2018-04-03 Thread Goffredo Baroncelli

On 04/04/2018 12:57 AM, Zygo Blaxell wrote:
>> I have to point out that in any case the extent is physically
>> interrupted at the disk-stripe size. Assuming disk-stripe=64KB, if
>> you want to write 128KB, the first half is written in the first disk,
>> the other in the 2nd disk.  If you want to write 96kb, the first 64
>> are written in the first disk, the last part in the 2nd, only on a
>> different BG.
> The "only on a different BG" part implies something expensive, either
> a seek or a new erase page depending on the hardware.  Without that,
> nearby logical blocks are nearby physical blocks as well.

In any case it happens on a different disk

> 
>> So yes there is a fragmentation from a logical point of view; from a
>> physical point of view the data is spread on the disks in any case.

> What matters is the extent-tree point of view.  There is (currently)
> no fragmentation there, even for RAID5/6.  The extent tree is unaware
> of RAID5/6 (to its peril).

Before you pointed out that the non-contiguous block written has an impact on 
performance. I am replaying  that the switching from a different BG happens at 
the stripe-disk boundary, so in any case the block is physically interrupted 
and switched to another disk

However yes: from an extent-tree point of view there will be an increase of 
number extents, because the end of the writing is allocated to another BG (if 
the size is not stripe-boundary)

> If an application does a loop writing 68K then fsync(), the multiple-BG
> solution adds two seeks to read every 68K.  That's expensive if sequential
> read bandwidth is more scarce than free space.

Why you talk about an additional seeks? In any case (even without the 
additional BG) the read happens from another disks

>> * c),d),e) are applied only for the tail of the extent, in case the
> size is less than the stripe size.
> 
> It's only necessary to split an extent if there are no other writes
> in the same transaction that could be combined with the extent tail
> into a single RAID stripe.  As long as everything in the RAID stripe
> belongs to a single transaction, there is no write hole

May be that a more "simpler" optimization would be close the transaction when 
the data reach the stripe boundary... But I suspect that it is not so simple to 
implement.

> Not for d.  Balance doesn't know how to get rid of unreachable blocks
> in extents (it just moves the entire extent around) so after a balance
> the writes would still be rounded up to the stripe size.  Balance would
> never be able to free the rounded-up space.  That space would just be
> gone until the file was overwritten, deleted, or defragged.

If balance is capable to move the extent, why not place one near the other 
during a balance ? The goal is not to limit the the writing of the end of a 
extent, but avoid writing the end of an extent without further data (e.g. the 
gap to the stripe has to be filled in the same transaction)

BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Status of RAID5/6

2018-04-03 Thread Chris Murphy

On Tue, Apr 3, 2018 at 11:03 AM, Goffredo Baroncelli  wrote:
> On 04/03/2018 02:31 AM, Zygo Blaxell wrote:
>> On Mon, Apr 02, 2018 at 06:23:34PM -0400, Zygo Blaxell wrote:
>>> On Mon, Apr 02, 2018 at 11:49:42AM -0400, Austin S. Hemmelgarn wrote:
 On 2018-04-02 11:18, Goffredo Baroncelli wrote:
> I thought that a possible solution is to create BG with different
 number of data disks. E.g. supposing to have a raid 6 system with 6
 disks, where 2 are parity disk; we should allocate 3 BG
> BG #1: 1 data disk, 2 parity disks
> BG #2: 2 data disks, 2 parity disks,
> BG #3: 4 data disks, 2 parity disks
>
> For simplicity, the disk-stripe length is assumed = 4K.
>
> So If you have a write with a length of 4 KB, this should be placed
 in BG#1; if you have a write with a length of 4*3KB, the first 8KB,
 should be placed in in BG#2, then in BG#1.
> This would avoid space wasting, even if the fragmentation will
 increase (but shall the fragmentation matters with the modern solid
 state disks ?).
>>> I don't really see why this would increase fragmentation or waste space.
>
>> Oh, wait, yes I do.  If there's a write of 6 blocks, we would have
>> to split an extent between BG #3 (the first 4 blocks) and BG #2 (the
>> remaining 2 blocks).  It also flips the usual order of "determine size
>> of extent, then allocate space for it" which might require major surgery
>> on the btrfs allocator to implement.
>
> I have to point out that in any case the extent is physically interrupted at 
> the disk-stripe size. Assuming disk-stripe=64KB, if you want to write 128KB, 
> the first half is written in the first disk, the other in the 2nd disk.  If 
> you want to write 96kb, the first 64 are written in the first disk, the last 
> part in the 2nd, only on a different BG.
> So yes there is a fragmentation from a logical point of view; from a physical 
> point of view the data is spread on the disks in any case.
>
> In any case, you are right, we should gather some data, because the 
> performance impact are no so clear.

They're pretty clear, and there's a lot written about small file size
and parity raid performance being shit, no matter the implementation
(md, ZFS, Btrfs, hardware maybe less so just because of all the
caching and extra processing hardware that's dedicated to the task).

The linux-raid@ list is full of optimizations for this that are use
case specific. One of those that often comes up is how badly suited
raid56 are for e.g. mail servers, tons of small file reads and writes,
and all the disk contention that comes up, and it's even worse when
you lose a disk, and even if you're running raid 6 and lose two disk
it's really god awful. It can be unexpectedly a disqualifying setup
without prior testing in that condition: can your workload really be
usable for two or three days in a double degraded state on that raid6?
*shrug*

Parity raid is well suited for full stripe reads and writes, lots of
sequential writes. Ergo a small file is anything less than a full
stripe write. Of course, delayed allocation can end up making for more
full stripe writes. But now you have more RMW which is the real
performance killer, again no matter the raid.

>
> I am not worried abut having different BG; we have problem with these because 
> we never developed tool to handle this issue properly (i.e. a daemon which 
> starts a balance when needed). But I hope that this will be solved in future.
>
> In any case, the all solutions proposed have their trade off:
>
> - a) as is: write hole bug
> - b) variable stripe size (like ZFS): big impact on how btrfs handle the 
> extent. limited waste of space
> - c) logging data before writing: we wrote the data two times in a short time 
> window. Moreover the log area is written several order of magnitude more than 
> the other area; there was some patch around
> - d) rounding the writing to the stripe size: waste of space; simple to 
> implement;
> - e) different BG with different stripe size: limited waste of space; logical 
> fragmentation.

I'd say for sure you're worse off with metadata raid5 vs metadata
raid1. And if there are many devices you might be better off with
metadata raid1 even on a raid6, it's not an absolute certainty you
lose the file system with a 2nd drive failure - it depends on the
device and what chunk copies happen to be on it. But at the least if
you have a script or some warning you can relatively easily rebalance
... HMMM

Actually that should be a test. Single drive degraded raid6 with
metadata raid1, can you do a metadata only balance to force the
missing copy of metadata to be replicated again? In theory this should
be quite fast.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH typo-fixed] fstests: btrfs: 159 superblock corruption test case

2018-04-03 Thread Eryu Guan

On Thu, Mar 29, 2018 at 06:28:48PM +0800, Anand Jain wrote:
> Verify if the superblock corruption is handled correctly.
> 
> Signed-off-by: Anand Jain 
> ---
>  tests/btrfs/159 | 142 
> 
>  tests/btrfs/159.out |  35 +
>  tests/btrfs/group   |   1 +
>  3 files changed, 178 insertions(+)
>  create mode 100755 tests/btrfs/159
>  create mode 100644 tests/btrfs/159.out
> 
> diff --git a/tests/btrfs/159 b/tests/btrfs/159
> new file mode 100755
> index ..f42ba4b777e7
> --- /dev/null
> +++ b/tests/btrfs/159
> @@ -0,0 +1,142 @@
> +#! /bin/bash
> +# FS QA Test 159
> +#
> +# Test if the superblock corruption is handled correctly.
> +#   - Check and validate superblock csum in mount and scan context
> +#   - Check and validate superblock for all disks in the fs
> +#   - Make sure if the copies are really a copy of the primary superblock
> +#
> +#-
> +# Copyright (c) 2018 Oracle.  All Rights Reserved.
> +# Author: Anand Jain 
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#-
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +status=1 # failure is the default!

Please use './new btrfs' to generate new test template, which defines
variables like 'tmp' and 'here', these variables may not be used in the
test explicitly, but they could be used by some helper functions,
especially the "$tmp" var. (Or please don't remove them from template :)

> +
> +_cleanup()
> +{
> + _scratch_dev_pool_put
> +}
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +. ./common/module
> +
> +# remove previous $seqres.full before test
> +rm -f $seqres.full
> +
> +# real QA test starts here
> +
> +_supported_fs btrfs
> +_supported_os Linux
> +_require_scratch_dev_pool 2
> +_require_loadable_fs_module "btrfs"
> +_require_command "$WIPEFS_PROG" wipefs
> +
> +_scratch_dev_pool_get 2
> +
> +MNT_OPT=$(echo $MOUNT_OPTIONS | cut -d" " -f2-)

I'm not sure about the purpose of this MNT_OPT, and it could be empty
and causes problems in test. Please see below.

> +DEV_GOOD=$(echo $SCRATCH_DEV_POOL | awk '{print $1}')
> +DEV_BAD=$(echo $SCRATCH_DEV_POOL | awk '{print $2}')
> +
> +wipe()
> +{
> + $WIPEFS_PROG -a $DEV_GOOD > /dev/null 2>&1
> + $WIPEFS_PROG -a $DEV_BAD > /dev/null 2>&1
> +}
> +
> +# If copy1 fsid does not match with primary fail the scan thus the mount as 
> well
> +check_copy1_fsid()
> +{
> + local bytenr=67108864
> + echo -e "\\ncheck_copy1_fsid\\n{" | tee -a $seqres.full

Hmm, I don't think the "{ }" block is necessary in .out file, IMHO it
makes the diff context harder to read when test fails. Just echo the
test name would be fine.

> +
> + wipe
> + $MKFS_BTRFS_PROG -fq $DEV_GOOD
> + $MKFS_BTRFS_PROG -fq $DEV_BAD
> + _reload_fs_module "btrfs"
> +
> + dd status=none of=$DEV_BAD if=$DEV_GOOD ibs=1 obs=1 skip=$bytenr 
> seek=$bytenr count=4K|\
> + tee -a $seqres.full
> +
> + _mount -o $MNT_OPT $DEV_BAD $SCRATCH_MNT 2>&1 | _filter_scratch_pool

When $MOUNT_OPTIONS is empty, this mount fails as

+mount: can't find /mnt/scratch in /etc/fstab

And I think we need to umount $SCRATCH_MNT here in case a buggy btrfs
module allowed the mount, as the next test assumes the device is not
mounted anywhere. Otherwise I see failures like:

+ERROR: /dev/loop1 is mounted
+mount failed

> +
> + echo -e "good\\n}" | tee -a $seqres.full
> +}
> +
> +# As in Linux kernel 4.16 if the primary is corrupted mount will fail.
> +# Which might change in the long run.
> +check_primary()
> +{
> + local bytenr=65536
> + echo -e "\\ncheck_primary\\n{" | tee -a $seqres.full
> +
> + wipe
> + _scratch_pool_mkfs "-mraid1 -draid1"
> + _reload_fs_module "btrfs"
> +
> + #To corrupt a disk block, read in hex, write in dec
> + od -j$bytenr -N1 -An -x $DEV_BAD |\
> + dd status=none of=$DEV_BAD ibs=1 obs=1 seek=$bytenr count=1|\
> + tee -a $seqres.full
> + _mount -o $MNT_OPT,device=$DEV_GOOD $DEV_BAD $SCRATCH_MNT 2>&1 | 
> _filter_scratch_pool

Same here, need to umount

Re: System freezes a lot - btrfs trace in dmesg - symptom or cause?

2018-04-03 Thread Qu Wenruo



On 2018年04月04日 05:35, Richard Schwab wrote:
> Hi there,
> 
> I'm currently having some issues with my system freezing a lot, from a
> few seconds up to a few minutes. I haven't figured out yet what might be
> causing this, however, coincidentally some btrfs notices appeared in my
> dmesg after I had a freeze for about 5 minutes. I'm wondering whether
> this is just another symptom of my system having some weird issues or
> whether this might actually be the cause of the freezes I've been
> experiencing.
> 
> $ uname -a
> Linux mypc 4.15.14-1-ARCH #1 SMP PREEMPT Wed Mar 28 17:34:29 UTC 2018
> x86_64 GNU/Linux
> 
> $ btrfs --version
> btrfs-progs v4.15.1
> 
> $ btrfs fi show
> Label: none  uuid: uuid1
>     Total devices 1 FS bytes used 226.60GiB
>     devid    1 size 237.37GiB used 237.31GiB path /dev/mapper/cryptroot256g
> 
> Label: none  uuid: uuid2
>     Total devices 1 FS bytes used 116.61MiB
>     devid    1 size 1.00GiB used 460.00MiB path /dev/sdb2
> 
> $ btrfs fi df /
> Data, single: total=231.28GiB, used=223.66GiB
> System, single: total=32.00MiB, used=48.00KiB
> Metadata, single: total=6.00GiB, used=2.93GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> Partial dmesg is attached as file. Please tell me if you need more than
> this, however, before these lines it only contained the startup process
> (up until 87s).

[10691.693202] btrfs-transacti D0   447  2 0x8000
[10691.693207] Call Trace:
[10691.693221]  ? __schedule+0x24b/0x8c0
[10691.693229]  ? enqueue_entity+0x59b/0xc70
[10691.693234]  schedule+0x32/0x90
[10691.693239]  io_schedule+0x12/0x40
[10691.693247]  __lock_page+0xf8/0x130
[10691.693252]  ? add_to_page_cache_lru+0xe0/0xe0
[10691.693309]  read_extent_buffer_pages+0xf0/0x2f0 [btrfs]
[10691.693337]  ? btrfs_alloc_root+0x30/0x30 [btrfs]
[10691.693365]  btree_read_extent_buffer_pages+0x94/0xf0 [btrfs]
[10691.693389]  read_block_for_search.isra.14+0x15a/0x2f0 [btrfs]

This means the tree block pages btrfs trying to read is being locked by
other guy.

And considering later operations also waiting on the same page, it looks
like a deadlock.

Not sure if it's something serious or just bad luck.
But looks like there is something wrong with btrfs tree locks.

Is it reproducible and what's the workload?

And if it's just bad luck, reboot the system and run "btrfs check" on
the fs is highly recommended.

Thanks,
Qu

> 
> Sincerely
> Richard Schwab
> 



signature.asc
Description: OpenPGP digital signature

Re: System freezes a lot - btrfs trace in dmesg - symptom or cause?

2018-04-03 Thread Chris Murphy

FWIW, gmail considers the original post a phishing attempt, and put it
into spam. Also the sending mail server marked the header with a
request that subsequent mail servers fail to forward the email, which
effectively makes it incompatible with mail servers which modify the
email body, making dkim message verification fail. So I suggest either
clearing it up with the mail provider or using a different email for
lists.

This message has a from address in central-intelligence.agency but has
failed central-intelligence.agency's required tests for
authentication.  Learn more

Authentication-Results: mx.google.com;
   dkim=neutral (body hash did not verify)
header.i=@central-intelligence.agency header.s=default
header.b=eqsLMYnQ;
   spf=pass (google.com: best guess record for domain of
linux-btrfs-ow...@vger.kernel.org designates 209.132.180.67 as
permitted sender) smtp.mailfrom=linux-btrfs-ow...@vger.kernel.org;
   dmarc=fail (p=REJECT sp=REJECT dis=QUARANTINE)
header.from=central-intelligence.agency

On Tue, Apr 3, 2018 at 3:35 PM, Richard Schwab
 wrote:
> Hi there,
>
> I'm currently having some issues with my system freezing a lot, from a
> few seconds up to a few minutes. I haven't figured out yet what might be
> causing this, however, coincidentally some btrfs notices appeared in my
> dmesg after I had a freeze for about 5 minutes. I'm wondering whether
> this is just another symptom of my system having some weird issues or
> whether this might actually be the cause of the freezes I've been
> experiencing.

>From the attached text file

10691.693188] INFO: task btrfs-transacti:447 blocked for more than 120 seconds.
[10691.693196]   Tainted: P   O 4.15.14-1-ARCH #1
[10691.693199] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[10691.693202] btrfs-transacti D0   447  2 0x8000
[10691.693207] Call Trace:
[10691.693221]  ? __schedule+0x24b/0x8c0
[10691.693229]  ? enqueue_entity+0x59b/0xc70
[10691.693234]  schedule+0x32/0x90
[10691.693239]  io_schedule+0x12/0x40
[10691.693247]  __lock_page+0xf8/0x130
[10691.693252]  ? add_to_page_cache_lru+0xe0/0xe0
[10691.693309]  read_extent_buffer_pages+0xf0/0x2f0 [btrfs]
[10691.693337]  ? btrfs_alloc_root+0x30/0x30 [btrfs]
[10691.693365]  btree_read_extent_buffer_pages+0x94/0xf0 [btrfs]
[10691.693389]  read_block_for_search.isra.14+0x15a/0x2f0 [btrfs]
[10691.693413]  btrfs_search_slot+0x3cd/0xa00 [btrfs]
[10691.693437]  ? block_group_cache_tree_search+0xba/0xd0 [btrfs]
[10691.693460]  btrfs_insert_empty_items+0x66/0xb0 [btrfs]
[10691.693484]  alloc_reserved_file_extent+0x96/0x320 [btrfs]
[10691.693511]  __btrfs_run_delayed_refs+0x908/0x1180 [btrfs]
[10691.693539]  btrfs_run_delayed_refs+0xe5/0x1b0 [btrfs]
[10691.693566]  btrfs_start_dirty_block_groups+0x2cb/0x470 [btrfs]
[10691.693596]  btrfs_commit_transaction+0x117/0x930 [btrfs]
[10691.693626]  ? start_transaction+0x9e/0x420 [btrfs]
[10691.693654]  transaction_kthread+0x17b/0x1a0 [btrfs]
[10691.693684]  ? btrfs_cleanup_transaction+0x570/0x570 [btrfs]
[10691.693687]  kthread+0x113/0x130
[10691.693691]  ? kthread_create_on_node+0x70/0x70
[10691.693695]  ret_from_fork+0x35/0x40
[10691.693703] INFO: task journal-offline:7151 blocked for more than
120 seconds.

Off hand it just looks like noise, it's only an info message. But of
course the hung task is super annoying and maybe makes it worse than
if it were to just crash. Anyway, best to reproduce and issue sysrq +
w and hopefully a dev will get around to giving an opinion on what's
going on. In the meantime you can try reverting to 4.14.32 or try
4.16.0. I'm not seeing anything off hand in 4.15.15 that relates, so I
suspect it still happens with that version. Another thing to try is
booting without the proprietary driver you're using that taints the
kernel. It may very well be unrelated to this problem, but ...

https://www.kernel.org/doc/Documentation/admin-guide/sysrq.rst

Translated it means do this:

sudo -i
echo 1 > /proc/sys/kernel/sysrq
echo w > /proc/sysrq-trigger
exit
sudo journalctl -o short-monotonic -k > kernelmsg+sysrqw.log

And then post that log file somewhere like dropbox or whatever, it may
be too big to attach to the list but you can try it.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Status of RAID5/6

2018-04-03 Thread Zygo Blaxell

On Tue, Apr 03, 2018 at 07:03:06PM +0200, Goffredo Baroncelli wrote:
> On 04/03/2018 02:31 AM, Zygo Blaxell wrote:
> > On Mon, Apr 02, 2018 at 06:23:34PM -0400, Zygo Blaxell wrote:
> >> On Mon, Apr 02, 2018 at 11:49:42AM -0400, Austin S. Hemmelgarn wrote:
> >>> On 2018-04-02 11:18, Goffredo Baroncelli wrote:
>  I thought that a possible solution is to create BG with different
> >>> number of data disks. E.g. supposing to have a raid 6 system with 6
> >>> disks, where 2 are parity disk; we should allocate 3 BG
>  BG #1: 1 data disk, 2 parity disks
>  BG #2: 2 data disks, 2 parity disks,
>  BG #3: 4 data disks, 2 parity disks
> 
>  For simplicity, the disk-stripe length is assumed = 4K.
> 
>  So If you have a write with a length of 4 KB, this should be placed
> >>> in BG#1; if you have a write with a length of 4*3KB, the first 8KB,
> >>> should be placed in in BG#2, then in BG#1.
>  This would avoid space wasting, even if the fragmentation will
> >>> increase (but shall the fragmentation matters with the modern solid
> >>> state disks ?).
> >> I don't really see why this would increase fragmentation or waste space.
> 
> > Oh, wait, yes I do.  If there's a write of 6 blocks, we would have
> > to split an extent between BG #3 (the first 4 blocks) and BG #2 (the
> > remaining 2 blocks).  It also flips the usual order of "determine size
> > of extent, then allocate space for it" which might require major surgery
> > on the btrfs allocator to implement.
> 
> I have to point out that in any case the extent is physically
> interrupted at the disk-stripe size. Assuming disk-stripe=64KB, if
> you want to write 128KB, the first half is written in the first disk,
> the other in the 2nd disk.  If you want to write 96kb, the first 64
> are written in the first disk, the last part in the 2nd, only on a
> different BG.

The "only on a different BG" part implies something expensive, either
a seek or a new erase page depending on the hardware.  Without that,
nearby logical blocks are nearby physical blocks as well.

> So yes there is a fragmentation from a logical point of view; from a
> physical point of view the data is spread on the disks in any case.

What matters is the extent-tree point of view.  There is (currently)
no fragmentation there, even for RAID5/6.  The extent tree is unaware
of RAID5/6 (to its peril).

ZFS makes its thing-like-the-extent-tree aware of RAID5/6, and it can
put a stripe of any size anywhere.  If we're going to do that in btrfs,
you might as well just do what ZFS does.

OTOH, variable-size block groups give us read-compatibility with old
kernel versions (and write-compatibility for that matter--a kernel that
didn't know about the BG separation would just work but have write hole).

If an application does a loop writing 68K then fsync(), the multiple-BG
solution adds two seeks to read every 68K.  That's expensive if sequential
read bandwidth is more scarce than free space.

> In any case, you are right, we should gather some data, because the
> performance impact are no so clear.
> 
> I am not worried abut having different BG; we have problem with these
> because we never developed tool to handle this issue properly (i.e. a
> daemon which starts a balance when needed). But I hope that this will
> be solved in future.

Balance daemons are easy to the point of being trivial to write in Python.

The balancing itself is quite expensive and invasive:  can't usefully
ionice it, can only abort it on block group boundaries, can't delete
snapshots while it's running.

If balance could be given a vrange that was the size of one extent...then
we could talk about daemons.

> In any case, the all solutions proposed have their trade off:
> 
> - a) as is: write hole bug
> - b) variable stripe size (like ZFS): big impact on how btrfs handle
> the extent. limited waste of space
> - c) logging data before writing: we wrote the data two times in a
> short time window. Moreover the log area is written several order of
> magnitude more than the other area; there was some patch around
> - d) rounding the writing to the stripe size: waste of space; simple
> to implement;
> - e) different BG with different stripe size: limited waste of space;
> logical fragmentation.

Also:

  - f) avoiding writes to partially filled stripes: free space
  fragmentation; simple to implement (ssd_spread does it accidentally)

The difference between d) and f) is that d) allocates the space to the
extent while f) leaves the space unallocated, but skips any free space
fragments smaller than the stripe size when allocating.  f) gets the
space back with a balance (i.e. it is exactly as space-efficient as (a)
after balance).

> * c),d),e) are applied only for the tail of the extent, in case the
size is less than the stripe size.

It's only necessary to split an extent if there are no other writes
in the same transaction that could be combined with the extent tail
into a single RAID stripe.  As long as

System freezes a lot - btrfs trace in dmesg - symptom or cause?

2018-04-03 Thread Richard Schwab

Hi there,

I'm currently having some issues with my system freezing a lot, from a
few seconds up to a few minutes. I haven't figured out yet what might be
causing this, however, coincidentally some btrfs notices appeared in my
dmesg after I had a freeze for about 5 minutes. I'm wondering whether
this is just another symptom of my system having some weird issues or
whether this might actually be the cause of the freezes I've been
experiencing.

$ uname -a
Linux mypc 4.15.14-1-ARCH #1 SMP PREEMPT Wed Mar 28 17:34:29 UTC 2018
x86_64 GNU/Linux

$ btrfs --version
btrfs-progs v4.15.1

$ btrfs fi show
Label: none  uuid: uuid1
    Total devices 1 FS bytes used 226.60GiB
    devid    1 size 237.37GiB used 237.31GiB path /dev/mapper/cryptroot256g

Label: none  uuid: uuid2
    Total devices 1 FS bytes used 116.61MiB
    devid    1 size 1.00GiB used 460.00MiB path /dev/sdb2

$ btrfs fi df /
Data, single: total=231.28GiB, used=223.66GiB
System, single: total=32.00MiB, used=48.00KiB
Metadata, single: total=6.00GiB, used=2.93GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

Partial dmesg is attached as file. Please tell me if you need more than
this, however, before these lines it only contained the startup process
(up until 87s).

Sincerely
Richard Schwab
[ 6995.028268] perf: interrupt took too long (2513 > 2500), lowering 
kernel.perf_event_max_sample_rate to 79000
[10513.344255] Chrome_~dThread[6211]: segfault at 0 ip 7f0cf13d6aed sp 
7f0ceb37eb00 error 6 in libxul.so[7f0cf0fe2000+4c14000]
[10513.344553] Chrome_~dThread[3242]: segfault at 0 ip 7f8bcb0d6aed sp 
7f8bc507eb00 error 6
[10513.344557] Chrome_~dThread[3110]: segfault at 0 ip 7f13a50d6aed sp 
7f139f07eb00 error 6 in libxul.so[7f13a4ce2000+4c14000]
[10513.344562]  in libxul.so[7f8bcace2000+4c14000]
[10513.344704] Chrome_~dThread[3202]: segfault at 0 ip 7f089e7d6aed sp 
7f089877eb00 error 6 in libxul.so[7f089e3e2000+4c14000]
[10691.693188] INFO: task btrfs-transacti:447 blocked for more than 120 seconds.
[10691.693196]   Tainted: P   O 4.15.14-1-ARCH #1
[10691.693199] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[10691.693202] btrfs-transacti D0   447  2 0x8000
[10691.693207] Call Trace:
[10691.693221]  ? __schedule+0x24b/0x8c0
[10691.693229]  ? enqueue_entity+0x59b/0xc70
[10691.693234]  schedule+0x32/0x90
[10691.693239]  io_schedule+0x12/0x40
[10691.693247]  __lock_page+0xf8/0x130
[10691.693252]  ? add_to_page_cache_lru+0xe0/0xe0
[10691.693309]  read_extent_buffer_pages+0xf0/0x2f0 [btrfs]
[10691.693337]  ? btrfs_alloc_root+0x30/0x30 [btrfs]
[10691.693365]  btree_read_extent_buffer_pages+0x94/0xf0 [btrfs]
[10691.693389]  read_block_for_search.isra.14+0x15a/0x2f0 [btrfs]
[10691.693413]  btrfs_search_slot+0x3cd/0xa00 [btrfs]
[10691.693437]  ? block_group_cache_tree_search+0xba/0xd0 [btrfs]
[10691.693460]  btrfs_insert_empty_items+0x66/0xb0 [btrfs]
[10691.693484]  alloc_reserved_file_extent+0x96/0x320 [btrfs]
[10691.693511]  __btrfs_run_delayed_refs+0x908/0x1180 [btrfs]
[10691.693539]  btrfs_run_delayed_refs+0xe5/0x1b0 [btrfs]
[10691.693566]  btrfs_start_dirty_block_groups+0x2cb/0x470 [btrfs]
[10691.693596]  btrfs_commit_transaction+0x117/0x930 [btrfs]
[10691.693626]  ? start_transaction+0x9e/0x420 [btrfs]
[10691.693654]  transaction_kthread+0x17b/0x1a0 [btrfs]
[10691.693684]  ? btrfs_cleanup_transaction+0x570/0x570 [btrfs]
[10691.693687]  kthread+0x113/0x130
[10691.693691]  ? kthread_create_on_node+0x70/0x70
[10691.693695]  ret_from_fork+0x35/0x40
[10691.693703] INFO: task journal-offline:7151 blocked for more than 120 
seconds.
[10691.693707]   Tainted: P   O 4.15.14-1-ARCH #1
[10691.693709] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[10691.693712] journal-offline D0  7151  1 0x0104
[10691.693715] Call Trace:
[10691.693721]  ? __schedule+0x24b/0x8c0
[10691.693726]  ? preempt_count_add+0x68/0xa0
[10691.693730]  schedule+0x32/0x90
[10691.693759]  wait_current_trans+0xc3/0xf0 [btrfs]
[10691.693763]  ? wait_woken+0x80/0x80
[10691.693792]  start_transaction+0x332/0x420 [btrfs]
[10691.693824]  btrfs_sync_file+0x2b6/0x440 [btrfs]
[10691.693832]  do_fsync+0x38/0x60
[10691.693836]  SyS_fsync+0xc/0x10
[10691.693840]  do_syscall_64+0x74/0x190
[10691.693845]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[10691.693848] RIP: 0033:0x7f7147b5a0ec
[10691.693850] RSP: 002b:7f714191cd30 EFLAGS: 0293 ORIG_RAX: 
004a
[10691.693854] RAX: ffda RBX: 55d002846410 RCX: 7f7147b5a0ec
[10691.693856] RDX:  RSI: 7f71474b6811 RDI: 0015
[10691.693857] RBP: 7f71474b7c60 R08: 7f714191d700 R09: 7f714191d700
[10691.693859] R10: 0011 R11: 0293 R12: 0002
[10691.693861] R13: 7fff8456870f R14: 7fff84568710 R15: 0016
[10691.693866] INFO: task journal-offline:7152 blocked for more than 120 
seconds.
[10691.693869]

Re: Please add 21035965f60b ("bitmap: fix memset optimization on big-endian systems") to the stable tree

2018-04-03 Thread Linus Torvalds

On Tue, Apr 3, 2018 at 11:14 AM, Omar Sandoval  wrote:
> Just wanted to make sure this doesn't get missed because I misspelled
> the stable email address in the patch. It applies to v4.13+.

The "sta...@kernel.org" address for a stable cc is actually better
than stable@vger" imho, because afaik it still matches Greg's
automation (that also picks up on "Fixes:" tags), and it does *not*
cause extra email flurries when people use git-send-email scripts.

iirc, we had some vendor leak a security bug early because they
actually just cc'd everybody - including the stable list - on the
not-yet-publicly-committed change.

Greg - if your automation has changed, and you actually really want
the "vger", let me know. Because I tend to just use
"sta...@kernel.org"

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 14/16] btrfs: drop lock parameter from update_ioctl_balance_args and rename

2018-04-03 Thread David Sterba

The parameter controls locking of the stats part but we can lock it
unconditionally, as this only happens once when balance starts. This is
not performance critical.

Add the prefix for an exported function.

Signed-off-by: David Sterba 
---
 fs/btrfs/ctree.h   |  2 +-
 fs/btrfs/ioctl.c   | 14 +-
 fs/btrfs/volumes.c |  2 +-
 3 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index fda94a264eb7..7ac2e69bfb03 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3258,7 +3258,7 @@ int btrfs_defrag_file(struct inode *inode, struct file 
*file,
  u64 newer_than, unsigned long max_pages);
 void btrfs_get_block_group_info(struct list_head *groups_list,
struct btrfs_ioctl_space_info *space);
-void update_ioctl_balance_args(struct btrfs_fs_info *fs_info, int lock,
+void btrfs_update_ioctl_balance_args(struct btrfs_fs_info *fs_info,
   struct btrfs_ioctl_balance_args *bargs);
 ssize_t btrfs_dedupe_file_range(struct file *src_file, u64 loff, u64 olen,
   struct file *dst_file, u64 dst_loff);
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index e6f21f30416a..45759f1ea438 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4499,7 +4499,7 @@ static long btrfs_ioctl_logical_to_ino(struct 
btrfs_fs_info *fs_info,
return ret;
 }
 
-void update_ioctl_balance_args(struct btrfs_fs_info *fs_info, int lock,
+void btrfs_update_ioctl_balance_args(struct btrfs_fs_info *fs_info,
   struct btrfs_ioctl_balance_args *bargs)
 {
struct btrfs_balance_control *bctl = fs_info->balance_ctl;
@@ -4517,13 +4517,9 @@ void update_ioctl_balance_args(struct btrfs_fs_info 
*fs_info, int lock,
memcpy(>meta, >meta, sizeof(bargs->meta));
memcpy(>sys, >sys, sizeof(bargs->sys));
 
-   if (lock) {
-   spin_lock(_info->balance_lock);
-   memcpy(>stat, >stat, sizeof(bargs->stat));
-   spin_unlock(_info->balance_lock);
-   } else {
-   memcpy(>stat, >stat, sizeof(bargs->stat));
-   }
+   spin_lock(_info->balance_lock);
+   memcpy(>stat, >stat, sizeof(bargs->stat));
+   spin_unlock(_info->balance_lock);
 }
 
 static long btrfs_ioctl_balance(struct file *file, void __user *arg)
@@ -4709,7 +4705,7 @@ static long btrfs_ioctl_balance_progress(struct 
btrfs_fs_info *fs_info,
goto out;
}
 
-   update_ioctl_balance_args(fs_info, 1, bargs);
+   btrfs_update_ioctl_balance_args(fs_info, bargs);
 
if (copy_to_user(arg, bargs, sizeof(*bargs)))
ret = -EFAULT;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 2956e7b4cb9f..4b7d54abfa34 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3897,7 +3897,7 @@ int btrfs_balance(struct btrfs_balance_control *bctl,
 
if (bargs) {
memset(bargs, 0, sizeof(*bargs));
-   update_ioctl_balance_args(fs_info, 0, bargs);
+   btrfs_update_ioctl_balance_args(fs_info, bargs);
}
 
if ((ret && ret != -ECANCELED && ret != -ENOSPC) ||
-- 
2.16.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 13/16] btrfs: remove redundant read-only check from btrfs_cancel_balance

2018-04-03 Thread David Sterba

Balance cannot be started on a read-only filesystem and will have to
finish/exit before eg. going to read-only via remount. Cancelling does
not need to check for that.

In case the filesystem is forcibly set to read-only after an error,
balance will finish anyway and if the cancel call is too fast it will
just wait for that to happen. Again does not have to check.

Signed-off-by: David Sterba 
---
 fs/btrfs/volumes.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a0420af1fad9..2956e7b4cb9f 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4053,9 +4053,6 @@ int btrfs_pause_balance(struct btrfs_fs_info *fs_info)
 
 int btrfs_cancel_balance(struct btrfs_fs_info *fs_info)
 {
-   if (sb_rdonly(fs_info->sb))
-   return -EROFS;
-
mutex_lock(_info->balance_mutex);
if (!fs_info->balance_ctl) {
mutex_unlock(_info->balance_mutex);
-- 
2.16.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 12/16] btrfs: track running balance in a simpler way

2018-04-03 Thread David Sterba

Currently fs_info::balance_running is 0 or 1 and does not use the
semantics of atomics. The pause and cancel check for 0, that can happen
only after __btrfs_balance exits for whatever reason.

Parallel calls to balance ioctl may enter btrfs_ioctl_balance multiple
times but will block on the balance_mutex that protects the
fs_info::flags bit.

Signed-off-by: David Sterba 
---
 fs/btrfs/ctree.h   |  6 +-
 fs/btrfs/disk-io.c |  1 -
 fs/btrfs/ioctl.c   |  6 +++---
 fs/btrfs/volumes.c | 18 ++
 4 files changed, 18 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 011cb9a8a967..fda94a264eb7 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -726,6 +726,11 @@ struct btrfs_delayed_root;
  * (device replace, resize, device add/delete, balance)
  */
 #define BTRFS_FS_EXCL_OP   16
+/*
+ * Indicate that balance has been set up from the ioctl and is in the main
+ * phase. The fs_info::balance_ctl is initialized.
+ */
+#define BTRFS_FS_BALANCE_RUNNING   17
 
 struct btrfs_fs_info {
u8 fsid[BTRFS_FSID_SIZE];
@@ -991,7 +996,6 @@ struct btrfs_fs_info {
/* restriper state */
spinlock_t balance_lock;
struct mutex balance_mutex;
-   atomic_t balance_running;
atomic_t balance_pause_req;
atomic_t balance_cancel_req;
struct btrfs_balance_control *balance_ctl;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index c0482ecea11f..b62559dfb053 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2168,7 +2168,6 @@ static void btrfs_init_balance(struct btrfs_fs_info 
*fs_info)
 {
spin_lock_init(_info->balance_lock);
mutex_init(_info->balance_mutex);
-   atomic_set(_info->balance_running, 0);
atomic_set(_info->balance_pause_req, 0);
atomic_set(_info->balance_cancel_req, 0);
fs_info->balance_ctl = NULL;
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 410037281f2a..e6f21f30416a 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4506,7 +4506,7 @@ void update_ioctl_balance_args(struct btrfs_fs_info 
*fs_info, int lock,
 
bargs->flags = bctl->flags;
 
-   if (atomic_read(_info->balance_running))
+   if (test_bit(BTRFS_FS_BALANCE_RUNNING, _info->flags))
bargs->state |= BTRFS_BALANCE_STATE_RUNNING;
if (atomic_read(_info->balance_pause_req))
bargs->state |= BTRFS_BALANCE_STATE_PAUSE_REQ;
@@ -4558,7 +4558,7 @@ static long btrfs_ioctl_balance(struct file *file, void 
__user *arg)
mutex_lock(_info->balance_mutex);
if (fs_info->balance_ctl) {
/* this is either (2) or (3) */
-   if (!atomic_read(_info->balance_running)) {
+   if (!test_bit(BTRFS_FS_BALANCE_RUNNING, _info->flags)) {
mutex_unlock(_info->balance_mutex);
/*
 * Lock released to allow other waiters to continue,
@@ -4567,7 +4567,7 @@ static long btrfs_ioctl_balance(struct file *file, void 
__user *arg)
mutex_lock(_info->balance_mutex);
 
if (fs_info->balance_ctl &&
-   !atomic_read(_info->balance_running)) {
+   !test_bit(BTRFS_FS_BALANCE_RUNNING, 
_info->flags)) {
/* this is (3) */
need_unlock = false;
goto locked;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d1e3486343ae..a0420af1fad9 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3886,13 +3886,14 @@ int btrfs_balance(struct btrfs_balance_control *bctl,
spin_unlock(_info->balance_lock);
}
 
-   atomic_inc(_info->balance_running);
+   ASSERT(!test_bit(BTRFS_FS_BALANCE_RUNNING, _info->flags));
+   set_bit(BTRFS_FS_BALANCE_RUNNING, _info->flags);
mutex_unlock(_info->balance_mutex);
 
ret = __btrfs_balance(fs_info);
 
mutex_lock(_info->balance_mutex);
-   atomic_dec(_info->balance_running);
+   clear_bit(BTRFS_FS_BALANCE_RUNNING, _info->flags);
 
if (bargs) {
memset(bargs, 0, sizeof(*bargs));
@@ -4031,16 +4032,16 @@ int btrfs_pause_balance(struct btrfs_fs_info *fs_info)
return -ENOTCONN;
}
 
-   if (atomic_read(_info->balance_running)) {
+   if (test_bit(BTRFS_FS_BALANCE_RUNNING, _info->flags)) {
atomic_inc(_info->balance_pause_req);
mutex_unlock(_info->balance_mutex);
 
wait_event(fs_info->balance_wait_q,
-  atomic_read(_info->balance_running) == 0);
+  !test_bit(BTRFS_FS_BALANCE_RUNNING, 
_info->flags));
 
mutex_lock(_info->balance_mutex);
/* we are good with balance_ctl ripped off from under us */
-   BUG_ON(atomic_read(_info->balance_running));

[PATCH 16/16] btrfs: open code set_balance_control

2018-04-03 Thread David Sterba

The helper is quite simple and I'd like to see the locking in the
caller.

Signed-off-by: David Sterba 
---
 fs/btrfs/volumes.c | 20 
 1 file changed, 4 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index e93c7ba44d06..db9f72c6af85 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3178,21 +3178,6 @@ static void update_balance_args(struct 
btrfs_balance_control *bctl)
}
 }
 
-/*
- * Should be called with balance mutex held to protect against checking the
- * balance status or progress. Same goes for reset_balance_state.
- */
-static void set_balance_control(struct btrfs_balance_control *bctl)
-{
-   struct btrfs_fs_info *fs_info = bctl->fs_info;
-
-   BUG_ON(fs_info->balance_ctl);
-
-   spin_lock(_info->balance_lock);
-   fs_info->balance_ctl = bctl;
-   spin_unlock(_info->balance_lock);
-}
-
 /*
  * Clear the balance status in fs_info and delete the balance item from disk.
  */
@@ -3878,7 +3863,10 @@ int btrfs_balance(struct btrfs_balance_control *bctl,
 
if (!(bctl->flags & BTRFS_BALANCE_RESUME)) {
BUG_ON(ret == -EEXIST);
-   set_balance_control(bctl);
+   BUG_ON(fs_info->balance_ctl);
+   spin_lock(_info->balance_lock);
+   fs_info->balance_ctl = bctl;
+   spin_unlock(_info->balance_lock);
} else {
BUG_ON(ret != -EEXIST);
spin_lock(_info->balance_lock);
-- 
2.16.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 11/16] btrfs: kill btrfs_fs_info::volume_mutex

2018-04-03 Thread David Sterba

Mutual exclusion of device add/rm and balance was done by the volume
mutex up to version 3.7. The commit 5ac00addc7ac091109 ("Btrfs: disallow
mutually exclusive admin operations from user mode") added a bit that
essentially tracked the same information.

The status bit has an advantage over a mutex that it can be set without
restrictions of function context, so it started to be used in the
mount-time resuming of balance or device replace.

But we don't really need to track the same information in two ways.

1) After the previous cleanups, the main ioctl handlers for
   add/del/resize copy the EXCL_OP bit next to the volume mutex, here
   it's clearly safe.

2) Resuming balance during mount or after rw remount will set only the
   EXCL_OP bit and the volume_mutex is held in the kernel thread that
   calls btrfs_balance.

3) Resuming device replace during mount or after rw remount is done
   after balance and is excluded by the EXCL_OP bit. It does not take
   the volume_mutex at all and completely relies on the EXCL_OP bit.

4) The resuming of balance and dev-replace cannot hapen at the same time
   as the ioctls cannot be started in parallel. Nevertheless, a crafted
   image could trigger that and a warning is printed.

5) Balance is normally excluded by EXCL_OP and also uses own mutex to
   protect against concurrent access to its status data. There's some
   trickery to maintain the right lock nesting in case we need to
   reexamine the status in btrfs_ioctl_balance. The volume_mutex is
   removed and the unlock/lock sequence is left in place as we might
   expect other waiters to proceed.

6) Similar to 5, the unlock/lock sequence is kept in
   btrfs_cancel_balance to allow waiters to continue.

Signed-off-by: David Sterba 
---
 fs/btrfs/ctree.h   |  1 -
 fs/btrfs/disk-io.c |  1 -
 fs/btrfs/extent-tree.c |  2 +-
 fs/btrfs/ioctl.c   | 17 -
 fs/btrfs/volumes.c | 36 ++--
 5 files changed, 15 insertions(+), 42 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 0eb55825862a..011cb9a8a967 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -826,7 +826,6 @@ struct btrfs_fs_info {
struct mutex transaction_kthread_mutex;
struct mutex cleaner_mutex;
struct mutex chunk_mutex;
-   struct mutex volume_mutex;
 
/*
 * this is taken to make sure we don't set block groups ro after
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 07b5e6f7df67..c0482ecea11f 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2605,7 +2605,6 @@ int open_ctree(struct super_block *sb,
mutex_init(_info->chunk_mutex);
mutex_init(_info->transaction_kthread_mutex);
mutex_init(_info->cleaner_mutex);
-   mutex_init(_info->volume_mutex);
mutex_init(_info->ro_block_group_mutex);
init_rwsem(_info->commit_root_sem);
init_rwsem(_info->cleanup_work_sem);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index f30548d7e0d2..90d28a3727c6 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4122,7 +4122,7 @@ static void set_avail_alloc_bits(struct btrfs_fs_info 
*fs_info, u64 flags)
  * returns target flags in extended format or 0 if restripe for this
  * chunk_type is not in progress
  *
- * should be called with either volume_mutex or balance_lock held
+ * should be called with balance_lock held
  */
 static u64 get_restripe_target(struct btrfs_fs_info *fs_info, u64 flags)
 {
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 2f172122b63d..410037281f2a 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1470,7 +1470,6 @@ static noinline int btrfs_ioctl_resize(struct file *file,
return BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS;
}
 
-   mutex_lock(_info->volume_mutex);
vol_args = memdup_user(arg, sizeof(*vol_args));
if (IS_ERR(vol_args)) {
ret = PTR_ERR(vol_args);
@@ -1578,7 +1577,6 @@ static noinline int btrfs_ioctl_resize(struct file *file,
 out_free:
kfree(vol_args);
 out:
-   mutex_unlock(_info->volume_mutex);
clear_bit(BTRFS_FS_EXCL_OP, _info->flags);
mnt_drop_write_file(file);
return ret;
@@ -2626,7 +2624,6 @@ static long btrfs_ioctl_add_dev(struct btrfs_fs_info 
*fs_info, void __user *arg)
if (test_and_set_bit(BTRFS_FS_EXCL_OP, _info->flags))
return BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS;
 
-   mutex_lock(_info->volume_mutex);
vol_args = memdup_user(arg, sizeof(*vol_args));
if (IS_ERR(vol_args)) {
ret = PTR_ERR(vol_args);
@@ -2641,7 +2638,6 @@ static long btrfs_ioctl_add_dev(struct btrfs_fs_info 
*fs_info, void __user *arg)
 
kfree(vol_args);
 out:
-   mutex_unlock(_info->volume_mutex);
clear_bit(BTRFS_FS_EXCL_OP, _info->flags);
return ret;
 }
@@ -2674,7 +2670,6 @@ static long btrfs_ioctl_rm_dev_v2(struct file *file, void

[PATCH 15/16] btrfs: use mutex in btrfs_resume_balance_async

2018-04-03 Thread David Sterba

While the spinlock does not cause problems, using the mutex is more
correct and consistent with others. The global status of balance is eg.
checked from btrfs_pause_balance or btrfs_cancel_balance with mutex.

Resuming balance happens during mount or ro->rw remount. In the former
case, no other user of the balance_ctl exists, in the latter, balance
cannot run until the ro/rw transition is finished.

Signed-off-by: David Sterba 
---
 fs/btrfs/volumes.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 4b7d54abfa34..e93c7ba44d06 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3938,12 +3938,12 @@ int btrfs_resume_balance_async(struct btrfs_fs_info 
*fs_info)
 {
struct task_struct *tsk;
 
-   spin_lock(_info->balance_lock);
+   mutex_lock(_info->balance_mutex);
if (!fs_info->balance_ctl) {
-   spin_unlock(_info->balance_lock);
+   mutex_unlock(_info->balance_mutex);
return 0;
}
-   spin_unlock(_info->balance_lock);
+   mutex_unlock(_info->balance_mutex);
 
if (btrfs_test_opt(fs_info, SKIP_BALANCE)) {
btrfs_info(fs_info, "force skipping balance");
-- 
2.16.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 09/16] btrfs: cleanup helpers that reset balance state

2018-04-03 Thread David Sterba

The function __cancel_balance name is confusing with the cancel
operation of balance and it really resets the state of balance back to
zero. The unset_balance_control helper is called only from one place and
simple enough to be inlined.

Signed-off-by: David Sterba 
---
 fs/btrfs/ioctl.c   |  8 
 fs/btrfs/volumes.c | 27 ---
 2 files changed, 16 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 582bde5b7eda..2f172122b63d 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4653,10 +4653,10 @@ static long btrfs_ioctl_balance(struct file *file, void 
__user *arg)
 
 do_balance:
/*
-* Ownership of bctl and filesystem flag BTRFS_FS_EXCL_OP
-* goes to to btrfs_balance.  bctl is freed in __cancel_balance,
-* or, if restriper was paused all the way until unmount, in
-* free_fs_info.  The flag should be cleared after __cancel_balance.
+* Ownership of bctl and filesystem flag BTRFS_FS_EXCL_OP goes to to
+* btrfs_balance.  bctl is freed in reset_balance_state, or, if
+* restriper was paused all the way until unmount, in free_fs_info.
+* The flag should be cleared after reset_balance_state.
 */
need_unlock = false;
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 843982a2cbdb..b073ab4c3c70 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3192,7 +3192,7 @@ static void update_balance_args(struct 
btrfs_balance_control *bctl)
 /*
  * Should be called with both balance and volume mutexes held to
  * serialize other volume operations (add_dev/rm_dev/resize) with
- * restriper.  Same goes for unset_balance_control.
+ * restriper.  Same goes for reset_balance_state.
  */
 static void set_balance_control(struct btrfs_balance_control *bctl)
 {
@@ -3205,9 +3205,13 @@ static void set_balance_control(struct 
btrfs_balance_control *bctl)
spin_unlock(_info->balance_lock);
 }
 
-static void unset_balance_control(struct btrfs_fs_info *fs_info)
+/*
+ * Clear the balance status in fs_info and delete the balance item from disk.
+ */
+static void reset_balance_state(struct btrfs_fs_info *fs_info)
 {
struct btrfs_balance_control *bctl = fs_info->balance_ctl;
+   int ret;
 
BUG_ON(!fs_info->balance_ctl);
 
@@ -3216,6 +3220,9 @@ static void unset_balance_control(struct btrfs_fs_info 
*fs_info)
spin_unlock(_info->balance_lock);
 
kfree(bctl);
+   ret = del_balance_item(fs_info);
+   if (ret)
+   btrfs_handle_fs_error(fs_info, ret, NULL);
 }
 
 /*
@@ -3752,16 +3759,6 @@ static inline int balance_need_close(struct 
btrfs_fs_info *fs_info)
 atomic_read(_info->balance_cancel_req) == 0);
 }
 
-static void __cancel_balance(struct btrfs_fs_info *fs_info)
-{
-   int ret;
-
-   unset_balance_control(fs_info);
-   ret = del_balance_item(fs_info);
-   if (ret)
-   btrfs_handle_fs_error(fs_info, ret, NULL);
-}
-
 /* Non-zero return value signifies invalidity */
 static inline int validate_convert_profile(struct btrfs_balance_args *bctl_arg,
u64 allowed)
@@ -3916,7 +3913,7 @@ int btrfs_balance(struct btrfs_balance_control *bctl,
 
if ((ret && ret != -ECANCELED && ret != -ENOSPC) ||
balance_need_close(fs_info)) {
-   __cancel_balance(fs_info);
+   reset_balance_state(fs_info);
clear_bit(BTRFS_FS_EXCL_OP, _info->flags);
}
 
@@ -4095,13 +4092,13 @@ int btrfs_cancel_balance(struct btrfs_fs_info *fs_info)
   atomic_read(_info->balance_running) == 0);
mutex_lock(_info->balance_mutex);
} else {
-   /* __cancel_balance needs volume_mutex */
+   /* reset_balance_state needs volume_mutex */
mutex_unlock(_info->balance_mutex);
mutex_lock(_info->volume_mutex);
mutex_lock(_info->balance_mutex);
 
if (fs_info->balance_ctl) {
-   __cancel_balance(fs_info);
+   reset_balance_state(fs_info);
clear_bit(BTRFS_FS_EXCL_OP, _info->flags);
}
 
-- 
2.16.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 10/16] btrfs: remove wrong use of volume_mutex from btrfs_dev_replace_start

2018-04-03 Thread David Sterba

The volume mutex does not protect against anything in this case, the
comment about scrub is right but not related to locking and looks
confusing. The comment in btrfs_find_device_missing_or_by_path is wrong
and confusing too.

The device_list_mutex is not held here to protect device lookup, but in
this case device replace cannot run in parallel with device removal (due
to exclusive op protection), so we don't need further locking here.

Signed-off-by: David Sterba 
---
 fs/btrfs/dev-replace.c | 7 +--
 fs/btrfs/volumes.c | 4 
 2 files changed, 1 insertion(+), 10 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 346bd460f8e7..ba011af9b0d2 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -426,18 +426,13 @@ int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
struct btrfs_device *tgt_device = NULL;
struct btrfs_device *src_device = NULL;
 
-   /* the disk copy procedure reuses the scrub code */
-   mutex_lock(_info->volume_mutex);
ret = btrfs_find_device_by_devspec(fs_info, srcdevid,
srcdev_name, _device);
-   if (ret) {
-   mutex_unlock(_info->volume_mutex);
+   if (ret)
return ret;
-   }
 
ret = btrfs_init_dev_replace_tgtdev(fs_info, tgtdev_name,
src_device, _device);
-   mutex_unlock(_info->volume_mutex);
if (ret)
return ret;
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index b073ab4c3c70..0ae29cd69363 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2198,10 +2198,6 @@ int btrfs_find_device_missing_or_by_path(struct 
btrfs_fs_info *fs_info,
struct btrfs_device *tmp;
 
devices = _info->fs_devices->devices;
-   /*
-* It is safe to read the devices since the volume_mutex
-* is held by the caller.
-*/
list_for_each_entry(tmp, devices, dev_list) {
if (test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
>dev_state) && !tmp->bdev) {
-- 
2.16.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 07/16] btrfs: add proper safety check before resuming dev-replace

2018-04-03 Thread David Sterba

The device replace is paused by unmount or read only remount, and
resumed on next mount or write remount.

The exclusive status should be checked properly as it's a global
invariant and we must not allow 2 operations run. In this case, the
balance can be also paused and resumed under same conditions. It's
always checked first so dev-replace could see the EXCL_OP already taken,
BUT, the ioctl would never let start both at the same time.

Replace the WARN_ON with message and return 0, indicating no error as
this is purely theoretical and the user will be informed. Resolving that
manually should be possible by waiting for the other operation to finish
or cancel the paused state.

Signed-off-by: David Sterba 
---
 fs/btrfs/dev-replace.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 7a87ffad041e..346bd460f8e7 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -908,7 +908,17 @@ int btrfs_resume_dev_replace_async(struct btrfs_fs_info 
*fs_info)
}
btrfs_dev_replace_write_unlock(dev_replace);
 
-   WARN_ON(test_and_set_bit(BTRFS_FS_EXCL_OP, _info->flags));
+   /*
+* This could collide with a paused balance, but the exclusive op logic
+* should never allow both to start and pause. We don't want to allow
+* dev-replace to start anyway.
+*/
+   if (test_and_set_bit(BTRFS_FS_EXCL_OP, _info->flags)) {
+   btrfs_info(fs_info,
+   "cannot resume dev-replace, other exclusive operation running");
+   return 0;
+   }
+
task = kthread_run(btrfs_dev_replace_kthread, fs_info, "btrfs-devrepl");
return PTR_ERR_OR_ZERO(task);
 }
-- 
2.16.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 08/16] btrfs: add sanity check when resuming balance after mount

2018-04-03 Thread David Sterba

Replace a WARN_ON with a proper check and message in case something goes
really wrong and resumed balance cannot set up its exclusive status.
The check is a user friendly assertion, I don't expect to ever happen
under normal circumstances.

Also document that the paused balance starts here and owns the exclusive
op status.

Signed-off-by: David Sterba 
---
 fs/btrfs/volumes.c | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index eb78c1d0ce2b..843982a2cbdb 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3982,6 +3982,20 @@ int btrfs_recover_balance(struct btrfs_fs_info *fs_info)
struct btrfs_key key;
int ret;
 
+   /*
+* This should never happen, as the paused balance state is recovered
+* during mount without any chance of other exclusive ops to collide.
+* Let it fail early and do not continue mount.
+*
+* Otherwise, this gives the exclusive op status to balance and keeps
+* in paused state until user intervention (cancel or umount).
+*/
+   if (test_and_set_bit(BTRFS_FS_EXCL_OP, _info->flags)) {
+   btrfs_err(fs_info,
+   "cannot set exclusive op status to resume balance");
+   return -EINVAL;
+   }
+
path = btrfs_alloc_path();
if (!path)
return -ENOMEM;
@@ -4018,8 +4032,6 @@ int btrfs_recover_balance(struct btrfs_fs_info *fs_info)
btrfs_balance_sys(leaf, item, _bargs);
btrfs_disk_balance_args_to_cpu(>sys, _bargs);
 
-   WARN_ON(test_and_set_bit(BTRFS_FS_EXCL_OP, _info->flags));
-
mutex_lock(_info->volume_mutex);
mutex_lock(_info->balance_mutex);
 
-- 
2.16.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 02/16] btrfs: make success path out of btrfs_init_dev_replace_tgtdev more clear

2018-04-03 Thread David Sterba

This is a preparatory cleanup that will make clear that the only
successful way out of btrfs_init_dev_replace_tgtdev will also set the
device_out to a valid pointer. With this guarantee, the callers can be
simplified.

Signed-off-by: David Sterba 
---
 fs/btrfs/dev-replace.c | 1 -
 fs/btrfs/volumes.c | 8 +++-
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index c0397cd2a3ba..e3b8a79c1665 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -370,7 +370,6 @@ int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
dev_replace->cont_reading_from_srcdev_mode = read_src;
WARN_ON(!src_device);
dev_replace->srcdev = src_device;
-   WARN_ON(!tgt_device);
dev_replace->tgtdev = tgt_device;
 
btrfs_info_in_rcu(fs_info,
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 93f8f17cacca..8086c5687b72 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2592,6 +2592,12 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, 
const char *device_path
return ret;
 }
 
+/*
+ * Initialize a new device for device replace target from a given source dev
+ * and path.
+ *
+ * Return 0 and new device in @device_out, otherwise return < 0
+ */
 int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
  const char *device_path,
  struct btrfs_device *srcdev,
@@ -2678,7 +2684,7 @@ int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info 
*fs_info,
mutex_unlock(_info->fs_devices->device_list_mutex);
 
*device_out = device;
-   return ret;
+   return 0;
 
 error:
blkdev_put(bdev, FMODE_EXCL);
-- 
2.16.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 01/16] btrfs: squeeze btrfs_dev_replace_continue_on_mount to its caller

2018-04-03 Thread David Sterba

The function is called once and is fairly small, we can merge it with
the caller.

Signed-off-by: David Sterba 
---
 fs/btrfs/dev-replace.c | 16 +++-
 1 file changed, 3 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 0d203633bb96..c0397cd2a3ba 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -45,8 +45,6 @@ static void btrfs_dev_replace_update_device_in_mapping_tree(
struct btrfs_device *srcdev,
struct btrfs_device *tgtdev);
 static int btrfs_dev_replace_kthread(void *data);
-static int btrfs_dev_replace_continue_on_mount(struct btrfs_fs_info *fs_info);
-
 
 int btrfs_init_dev_replace(struct btrfs_fs_info *fs_info)
 {
@@ -822,6 +820,7 @@ static int btrfs_dev_replace_kthread(void *data)
struct btrfs_fs_info *fs_info = data;
struct btrfs_dev_replace *dev_replace = _info->dev_replace;
u64 progress;
+   int ret;
 
progress = btrfs_dev_replace_progress(fs_info);
progress = div_u64(progress, 10);
@@ -832,23 +831,14 @@ static int btrfs_dev_replace_kthread(void *data)
btrfs_dev_name(dev_replace->tgtdev),
(unsigned int)progress);
 
-   btrfs_dev_replace_continue_on_mount(fs_info);
-   clear_bit(BTRFS_FS_EXCL_OP, _info->flags);
-
-   return 0;
-}
-
-static int btrfs_dev_replace_continue_on_mount(struct btrfs_fs_info *fs_info)
-{
-   struct btrfs_dev_replace *dev_replace = _info->dev_replace;
-   int ret;
-
ret = btrfs_scrub_dev(fs_info, dev_replace->srcdev->devid,
  dev_replace->committed_cursor_left,
  btrfs_device_get_total_bytes(dev_replace->srcdev),
  _replace->scrub_progress, 0, 1);
ret = btrfs_dev_replace_finishing(fs_info, ret);
WARN_ON(ret);
+
+   clear_bit(BTRFS_FS_EXCL_OP, _info->flags);
return 0;
 }
 
-- 
2.16.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 05/16] btrfs: move volume_mutex to callers of btrfs_rm_device

2018-04-03 Thread David Sterba

Move locking and unlocking next to the BTRFS_FS_EXCL_OP bit manipulation
so it's obvious that the two happen at the same time.

Signed-off-by: David Sterba 
---
 fs/btrfs/ioctl.c   | 4 
 fs/btrfs/volumes.c | 2 --
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index ac85e07f567b..b93dea445802 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2674,6 +2674,7 @@ static long btrfs_ioctl_rm_dev_v2(struct file *file, void 
__user *arg)
ret = BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS;
goto out;
}
+   mutex_lock(_info->volume_mutex);
 
if (vol_args->flags & BTRFS_DEVICE_SPEC_BY_ID) {
ret = btrfs_rm_device(fs_info, NULL, vol_args->devid);
@@ -2681,6 +2682,7 @@ static long btrfs_ioctl_rm_dev_v2(struct file *file, void 
__user *arg)
vol_args->name[BTRFS_SUBVOL_NAME_MAX] = '\0';
ret = btrfs_rm_device(fs_info, vol_args->name, 0);
}
+   mutex_unlock(_info->volume_mutex);
clear_bit(BTRFS_FS_EXCL_OP, _info->flags);
 
if (!ret) {
@@ -2716,6 +2718,7 @@ static long btrfs_ioctl_rm_dev(struct file *file, void 
__user *arg)
ret = BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS;
goto out_drop_write;
}
+   mutex_lock(_info->volume_mutex);
 
vol_args = memdup_user(arg, sizeof(*vol_args));
if (IS_ERR(vol_args)) {
@@ -2730,6 +2733,7 @@ static long btrfs_ioctl_rm_dev(struct file *file, void 
__user *arg)
btrfs_info(fs_info, "disk deleted %s", vol_args->name);
kfree(vol_args);
 out:
+   mutex_unlock(_info->volume_mutex);
clear_bit(BTRFS_FS_EXCL_OP, _info->flags);
 out_drop_write:
mnt_drop_write_file(file);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index df2d6d06e014..e0bd181dc9e0 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1932,7 +1932,6 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info, const 
char *device_path,
u64 num_devices;
int ret = 0;
 
-   mutex_lock(_info->volume_mutex);
mutex_lock(_mutex);
 
num_devices = fs_info->fs_devices->num_devices;
@@ -2048,7 +2047,6 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info, const 
char *device_path,
 
 out:
mutex_unlock(_mutex);
-   mutex_unlock(_info->volume_mutex);
return ret;
 
 error_undo:
-- 
2.16.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 03/16] btrfs: export and rename free_device

2018-04-03 Thread David Sterba

The function will be used outside of volumes.c, the allocation
btrfs_alloc_device is also exported.

Signed-off-by: David Sterba 
---
 fs/btrfs/volumes.c | 24 
 fs/btrfs/volumes.h |  1 +
 2 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 8086c5687b72..ca5521cc1a5b 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -246,7 +246,7 @@ static struct btrfs_fs_devices *alloc_fs_devices(const u8 
*fsid)
return fs_devs;
 }
 
-static void free_device(struct btrfs_device *device)
+void btrfs_free_device(struct btrfs_device *device)
 {
rcu_string_free(device->name);
bio_put(device->flush_bio);
@@ -261,7 +261,7 @@ static void free_fs_devices(struct btrfs_fs_devices 
*fs_devices)
device = list_entry(fs_devices->devices.next,
struct btrfs_device, dev_list);
list_del(>dev_list);
-   free_device(device);
+   btrfs_free_device(device);
}
kfree(fs_devices);
 }
@@ -294,7 +294,7 @@ void __exit btrfs_cleanup_fs_uuids(void)
 /*
  * Returns a pointer to a new btrfs_device on success; ERR_PTR() on error.
  * Returned struct is not linked onto any lists and must be destroyed using
- * free_device.
+ * btrfs_free_device.
  */
 static struct btrfs_device *__alloc_device(void)
 {
@@ -650,7 +650,7 @@ static void btrfs_free_stale_devices(const char *path,
} else {
fs_devs->num_devices--;
list_del(>dev_list);
-   free_device(dev);
+   btrfs_free_device(dev);
}
}
}
@@ -765,7 +765,7 @@ static noinline struct btrfs_device *device_list_add(const 
char *path,
 
name = rcu_string_strdup(path, GFP_NOFS);
if (!name) {
-   free_device(device);
+   btrfs_free_device(device);
return ERR_PTR(-ENOMEM);
}
rcu_assign_pointer(device->name, name);
@@ -878,7 +878,7 @@ static struct btrfs_fs_devices *clone_fs_devices(struct 
btrfs_fs_devices *orig)
name = rcu_string_strdup(orig_dev->name->str,
GFP_KERNEL);
if (!name) {
-   free_device(device);
+   btrfs_free_device(device);
goto error;
}
rcu_assign_pointer(device->name, name);
@@ -950,7 +950,7 @@ void btrfs_free_extra_devids(struct btrfs_fs_devices 
*fs_devices, int step)
}
list_del_init(>dev_list);
fs_devices->num_devices--;
-   free_device(device);
+   btrfs_free_device(device);
}
 
if (fs_devices->seed) {
@@ -968,7 +968,7 @@ static void free_device_rcu(struct rcu_head *head)
struct btrfs_device *device;
 
device = container_of(head, struct btrfs_device, rcu);
-   free_device(device);
+   btrfs_free_device(device);
 }
 
 static void btrfs_close_bdev(struct btrfs_device *device)
@@ -2582,7 +2582,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, 
const char *device_path
if (trans)
btrfs_end_transaction(trans);
 error_free_device:
-   free_device(device);
+   btrfs_free_device(device);
 error:
blkdev_put(bdev, FMODE_EXCL);
if (seeding_dev && !unlocked) {
@@ -2653,7 +2653,7 @@ int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info 
*fs_info,
 
name = rcu_string_strdup(device_path, GFP_KERNEL);
if (!name) {
-   free_device(device);
+   btrfs_free_device(device);
ret = -ENOMEM;
goto error;
}
@@ -6419,7 +6419,7 @@ static struct btrfs_device *add_missing_dev(struct 
btrfs_fs_devices *fs_devices,
  *
  * Return: a pointer to a new  btrfs_device on success; ERR_PTR()
  * on error.  Returned struct is not linked onto any lists and must be
- * destroyed with free_device.
+ * destroyed with btrfs_free_device.
  */
 struct btrfs_device *btrfs_alloc_device(struct btrfs_fs_info *fs_info,
const u64 *devid,
@@ -6442,7 +6442,7 @@ struct btrfs_device *btrfs_alloc_device(struct 
btrfs_fs_info *fs_info,
 
ret = find_next_devid(fs_info, );
if (ret) {
-   free_device(dev);
+   btrfs_free_device(dev);
return ERR_PTR(ret);
}
}
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index d1fcaea9fef5..45e3ece21290 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -434,6 +434,7 @@ int btrfs_find_device_by_devspec(struct btrfs_fs_info 
*fs_info,

[PATCH 04/16] btrfs: move btrfs_init_dev_replace_tgtdev to dev-replace.c and make static

2018-04-03 Thread David Sterba

The function logically belongs there and there's only a single caller,
no need to export it. No code changes.

Signed-off-by: David Sterba 
---
 fs/btrfs/dev-replace.c | 99 ++
 fs/btrfs/volumes.c | 99 --
 fs/btrfs/volumes.h |  4 --
 3 files changed, 99 insertions(+), 103 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index e3b8a79c1665..7a87ffad041e 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -188,6 +188,105 @@ int btrfs_init_dev_replace(struct btrfs_fs_info *fs_info)
return ret;
 }
 
+/*
+ * Initialize a new device for device replace target from a given source dev
+ * and path.
+ *
+ * Return 0 and new device in @device_out, otherwise return < 0
+ */
+static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
+ const char *device_path,
+ struct btrfs_device *srcdev,
+ struct btrfs_device **device_out)
+{
+   struct btrfs_device *device;
+   struct block_device *bdev;
+   struct list_head *devices;
+   struct rcu_string *name;
+   u64 devid = BTRFS_DEV_REPLACE_DEVID;
+   int ret = 0;
+
+   *device_out = NULL;
+   if (fs_info->fs_devices->seeding) {
+   btrfs_err(fs_info, "the filesystem is a seed filesystem!");
+   return -EINVAL;
+   }
+
+   bdev = blkdev_get_by_path(device_path, FMODE_WRITE | FMODE_EXCL,
+ fs_info->bdev_holder);
+   if (IS_ERR(bdev)) {
+   btrfs_err(fs_info, "target device %s is invalid!", device_path);
+   return PTR_ERR(bdev);
+   }
+
+   filemap_write_and_wait(bdev->bd_inode->i_mapping);
+
+   devices = _info->fs_devices->devices;
+   list_for_each_entry(device, devices, dev_list) {
+   if (device->bdev == bdev) {
+   btrfs_err(fs_info,
+ "target device is in the filesystem!");
+   ret = -EEXIST;
+   goto error;
+   }
+   }
+
+
+   if (i_size_read(bdev->bd_inode) <
+   btrfs_device_get_total_bytes(srcdev)) {
+   btrfs_err(fs_info,
+ "target device is smaller than source device!");
+   ret = -EINVAL;
+   goto error;
+   }
+
+
+   device = btrfs_alloc_device(NULL, , NULL);
+   if (IS_ERR(device)) {
+   ret = PTR_ERR(device);
+   goto error;
+   }
+
+   name = rcu_string_strdup(device_path, GFP_KERNEL);
+   if (!name) {
+   btrfs_free_device(device);
+   ret = -ENOMEM;
+   goto error;
+   }
+   rcu_assign_pointer(device->name, name);
+
+   mutex_lock(_info->fs_devices->device_list_mutex);
+   set_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state);
+   device->generation = 0;
+   device->io_width = fs_info->sectorsize;
+   device->io_align = fs_info->sectorsize;
+   device->sector_size = fs_info->sectorsize;
+   device->total_bytes = btrfs_device_get_total_bytes(srcdev);
+   device->disk_total_bytes = btrfs_device_get_disk_total_bytes(srcdev);
+   device->bytes_used = btrfs_device_get_bytes_used(srcdev);
+   device->commit_total_bytes = srcdev->commit_total_bytes;
+   device->commit_bytes_used = device->bytes_used;
+   device->fs_info = fs_info;
+   device->bdev = bdev;
+   set_bit(BTRFS_DEV_STATE_IN_FS_METADATA, >dev_state);
+   set_bit(BTRFS_DEV_STATE_REPLACE_TGT, >dev_state);
+   device->mode = FMODE_EXCL;
+   device->dev_stats_valid = 1;
+   set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE);
+   device->fs_devices = fs_info->fs_devices;
+   list_add(>dev_list, _info->fs_devices->devices);
+   fs_info->fs_devices->num_devices++;
+   fs_info->fs_devices->open_devices++;
+   mutex_unlock(_info->fs_devices->device_list_mutex);
+
+   *device_out = device;
+   return 0;
+
+error:
+   blkdev_put(bdev, FMODE_EXCL);
+   return ret;
+}
+
 /*
  * called from commit_transaction. Writes changed device replace state to
  * disk.
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index ca5521cc1a5b..df2d6d06e014 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2592,105 +2592,6 @@ int btrfs_init_new_device(struct btrfs_fs_info 
*fs_info, const char *device_path
return ret;
 }
 
-/*
- * Initialize a new device for device replace target from a given source dev
- * and path.
- *
- * Return 0 and new device in @device_out, otherwise return < 0
- */
-int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
- const char *device_path,
- struct btrfs_device *srcdev,
- struct

[PATCH 06/16] btrfs: move clearing of EXCL_OP out of __cancel_balance

2018-04-03 Thread David Sterba

Make the clearning visible in the callers so we can pair it with the
test_and_set part.

Signed-off-by: David Sterba 
---
 fs/btrfs/ioctl.c   |  2 +-
 fs/btrfs/volumes.c | 15 ---
 2 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index b93dea445802..582bde5b7eda 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4656,7 +4656,7 @@ static long btrfs_ioctl_balance(struct file *file, void 
__user *arg)
 * Ownership of bctl and filesystem flag BTRFS_FS_EXCL_OP
 * goes to to btrfs_balance.  bctl is freed in __cancel_balance,
 * or, if restriper was paused all the way until unmount, in
-* free_fs_info.  The flag is cleared in __cancel_balance.
+* free_fs_info.  The flag should be cleared after __cancel_balance.
 */
need_unlock = false;
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index e0bd181dc9e0..eb78c1d0ce2b 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3760,8 +3760,6 @@ static void __cancel_balance(struct btrfs_fs_info 
*fs_info)
ret = del_balance_item(fs_info);
if (ret)
btrfs_handle_fs_error(fs_info, ret, NULL);
-
-   clear_bit(BTRFS_FS_EXCL_OP, _info->flags);
 }
 
 /* Non-zero return value signifies invalidity */
@@ -3919,6 +3917,7 @@ int btrfs_balance(struct btrfs_balance_control *bctl,
if ((ret && ret != -ECANCELED && ret != -ENOSPC) ||
balance_need_close(fs_info)) {
__cancel_balance(fs_info);
+   clear_bit(BTRFS_FS_EXCL_OP, _info->flags);
}
 
wake_up(_info->balance_wait_q);
@@ -3926,11 +3925,11 @@ int btrfs_balance(struct btrfs_balance_control *bctl,
return ret;
 out:
if (bctl->flags & BTRFS_BALANCE_RESUME)
-   __cancel_balance(fs_info);
-   else {
+   reset_balance_state(fs_info);
+   else
kfree(bctl);
-   clear_bit(BTRFS_FS_EXCL_OP, _info->flags);
-   }
+   clear_bit(BTRFS_FS_EXCL_OP, _info->flags);
+
return ret;
 }
 
@@ -4089,8 +4088,10 @@ int btrfs_cancel_balance(struct btrfs_fs_info *fs_info)
mutex_lock(_info->volume_mutex);
mutex_lock(_info->balance_mutex);
 
-   if (fs_info->balance_ctl)
+   if (fs_info->balance_ctl) {
__cancel_balance(fs_info);
+   clear_bit(BTRFS_FS_EXCL_OP, _info->flags);
+   }
 
mutex_unlock(_info->volume_mutex);
}
-- 
2.16.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 00/16 v1] Kill fs_info::volume_mutex

2018-04-03 Thread David Sterba

This series gets rid of the volume mutex. The fstests do not pass
cleanly, 2 or more tests fail so this needs to be fixed, but otherwise
majority of the work ready for review.

The merge target is 4.18 and I'll probably not get back to this pathset
during merge window (nor add it to next), so there should be enough time
for review.

David Sterba (16):
  btrfs: squeeze btrfs_dev_replace_continue_on_mount to its caller
  btrfs: make success path out of btrfs_init_dev_replace_tgtdev more
clear
  btrfs: export and rename free_device
  btrfs: move btrfs_init_dev_replace_tgtdev to dev-replace.c and make
static
  btrfs: move volume_mutex to callers of btrfs_rm_device
  btrfs: move clearing of EXCL_OP out of __cancel_balance
  btrfs: add proper safety check before resuming dev-replace
  btrfs: add sanity check when resuming balance after mount
  btrfs: cleanup helpers that reset balance state
  btrfs: remove wrong use of volume_mutex from btrfs_dev_replace_start
  btrfs: kill btrfs_fs_info::volume_mutex
  btrfs: track running balance in a simpler way
  btrfs: remove redundant read-only check from btrfs_cancel_balance
  btrfs: drop lock parameter from update_ioctl_balance_args and rename
  btrfs: use mutex in btrfs_resume_balance_async
  btrfs: open code set_balance_control

 fs/btrfs/ctree.h   |   9 +-
 fs/btrfs/dev-replace.c | 135 +-
 fs/btrfs/disk-io.c |   2 -
 fs/btrfs/extent-tree.c |   2 +-
 fs/btrfs/ioctl.c   |  41 
 fs/btrfs/volumes.c | 252 +
 fs/btrfs/volumes.h |   5 +-
 7 files changed, 205 insertions(+), 241 deletions(-)

-- 
2.16.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/3] SPDX to btrfs

2018-04-03 Thread David Sterba

Switch fs/btrfs/*.[ch] and fs/btrfs/tests/*.[ch] to SPDX. I've briefly
verified that there are no exceptions to GPL-2.0 (ie. no 'or later').
The changes match what I've seen in other patches and I don't think
ther's more needed to be done. If there are no objections I'm going to
add the patches to a 4.17 pull.

David Sterba (3):
  btrfs: replace GPL boilerplate by SPDX -- headers
  btrfs: replace GPL boilerplate by SPDX -- sources
  btrfs: add SPDX header to Kconfig

 fs/btrfs/Kconfig   |  2 ++
 fs/btrfs/acl.c | 15 +--
 fs/btrfs/async-thread.c| 15 +--
 fs/btrfs/async-thread.h| 21 +
 fs/btrfs/backref.c | 15 +--
 fs/btrfs/backref.h | 19 +++
 fs/btrfs/btrfs_inode.h | 19 +++
 fs/btrfs/check-integrity.c | 15 +--
 fs/btrfs/check-integrity.h | 19 +++
 fs/btrfs/compression.c | 15 +--
 fs/btrfs/compression.h | 19 +++
 fs/btrfs/ctree.c   | 15 +--
 fs/btrfs/ctree.h   | 20 
 fs/btrfs/dedupe.h  | 20 
 fs/btrfs/delayed-inode.c   | 15 +--
 fs/btrfs/delayed-inode.h   | 19 +++
 fs/btrfs/delayed-ref.c | 15 +--
 fs/btrfs/delayed-ref.h | 21 +
 fs/btrfs/dev-replace.c | 16 ++--
 fs/btrfs/dev-replace.h | 20 
 fs/btrfs/dir-item.c| 15 +--
 fs/btrfs/disk-io.c | 15 +--
 fs/btrfs/disk-io.h | 20 
 fs/btrfs/export.c  |  1 +
 fs/btrfs/export.h  |  1 +
 fs/btrfs/extent-tree.c | 16 ++--
 fs/btrfs/extent_io.c   |  1 +
 fs/btrfs/extent_io.h   |  6 --
 fs/btrfs/extent_map.c  |  1 +
 fs/btrfs/extent_map.h  |  6 --
 fs/btrfs/file-item.c   | 15 +--
 fs/btrfs/file.c| 15 +--
 fs/btrfs/free-space-cache.c| 15 +--
 fs/btrfs/free-space-cache.h| 19 +++
 fs/btrfs/free-space-tree.c | 15 +--
 fs/btrfs/free-space-tree.h | 19 +++
 fs/btrfs/inode-item.c  | 15 +--
 fs/btrfs/inode-map.c   | 15 +--
 fs/btrfs/inode-map.h   |  5 +++--
 fs/btrfs/inode.c   | 15 +--
 fs/btrfs/ioctl.c   | 15 +--
 fs/btrfs/locking.c | 16 ++--
 fs/btrfs/locking.h | 19 +++
 fs/btrfs/lzo.c | 15 +--
 fs/btrfs/math.h| 20 +++-
 fs/btrfs/ordered-data.c| 15 +--
 fs/btrfs/ordered-data.h| 20 
 fs/btrfs/orphan.c  | 15 +--
 fs/btrfs/print-tree.c  | 15 +--
 fs/btrfs/print-tree.h  | 21 +
 fs/btrfs/props.c   | 15 +--
 fs/btrfs/props.h   | 19 +++
 fs/btrfs/qgroup.c  | 15 +--
 fs/btrfs/qgroup.h  | 22 +-
 fs/btrfs/raid56.c  | 16 ++--
 fs/btrfs/raid56.h  | 21 +
 fs/btrfs/rcu-string.h  | 20 ++--
 fs/btrfs/reada.c   | 15 +--
 fs/btrfs/ref-verify.c  | 15 +--
 fs/btrfs/ref-verify.h  | 23 ++-
 fs/btrfs/relocation.c  | 15 +--
 fs/btrfs/root-tree.c   | 15 +--
 fs/btrfs/scrub.c   | 15 +--
 fs/btrfs/send.c| 15 +--
 fs/btrfs/send.h| 20 ++--
 fs/btrfs/struct-funcs.c| 15 +--
 fs/btrfs/super.c   | 15 +--
 fs/btrfs/sysfs.c   | 15 +--
 fs/btrfs/sysfs.h   |  7 ---
 fs/btrfs/tests/btrfs-tests.c   | 15 +--
 fs/btrfs/tests/btrfs-tests.h   | 19 +++
 fs/btrfs/tests/extent-buffer-tests.c   | 15 +--
 fs/btrfs/tests/extent-io-tests.c   | 15 +--
 fs/btrfs/tests/extent-map-tests.c  | 15

[PATCH 1/3] btrfs: replace GPL boilerplate by SPDX -- headers

2018-04-03 Thread David Sterba

Remove GPL boilerplate text (long, short, one-line) and keep the rest,
ie. personal, company or original source copyright statements. Add the
SPDX header.

Unify the include protection macros to match the file names.

Signed-off-by: David Sterba 
---
 fs/btrfs/async-thread.h  | 21 +
 fs/btrfs/backref.h   | 19 +++
 fs/btrfs/btrfs_inode.h   | 19 +++
 fs/btrfs/check-integrity.h   | 19 +++
 fs/btrfs/compression.h   | 19 +++
 fs/btrfs/ctree.h | 20 
 fs/btrfs/dedupe.h| 20 
 fs/btrfs/delayed-inode.h | 19 +++
 fs/btrfs/delayed-ref.h   | 21 +
 fs/btrfs/dev-replace.h   | 20 
 fs/btrfs/disk-io.h   | 20 
 fs/btrfs/export.h|  1 +
 fs/btrfs/extent_io.h |  6 --
 fs/btrfs/extent_map.h|  6 --
 fs/btrfs/free-space-cache.h  | 19 +++
 fs/btrfs/free-space-tree.h   | 19 +++
 fs/btrfs/inode-map.h |  5 +++--
 fs/btrfs/locking.h   | 19 +++
 fs/btrfs/math.h  | 20 +++-
 fs/btrfs/ordered-data.h  | 20 
 fs/btrfs/print-tree.h| 21 +
 fs/btrfs/props.h | 19 +++
 fs/btrfs/qgroup.h| 22 +-
 fs/btrfs/raid56.h| 21 +
 fs/btrfs/rcu-string.h| 20 ++--
 fs/btrfs/ref-verify.h| 23 ++-
 fs/btrfs/send.h  | 20 ++--
 fs/btrfs/sysfs.h |  7 ---
 fs/btrfs/tests/btrfs-tests.h | 19 +++
 fs/btrfs/transaction.h   | 20 
 fs/btrfs/tree-checker.h  | 17 +++--
 fs/btrfs/tree-log.h  | 20 
 fs/btrfs/ulist.h |  7 +++
 fs/btrfs/volumes.h   | 19 +++
 fs/btrfs/xattr.h | 21 -
 35 files changed, 133 insertions(+), 475 deletions(-)

diff --git a/fs/btrfs/async-thread.h b/fs/btrfs/async-thread.h
index fc957e00cef1..7861c9feba5f 100644
--- a/fs/btrfs/async-thread.h
+++ b/fs/btrfs/async-thread.h
@@ -1,24 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0 */
 /*
  * Copyright (C) 2007 Oracle.  All rights reserved.
  * Copyright (C) 2014 Fujitsu.  All rights reserved.
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of the GNU General Public
- * License v2 as published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
- * General Public License for more details.
- *
- * You should have received a copy of the GNU General Public
- * License along with this program; if not, write to the
- * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
- * Boston, MA 021110-1307, USA.
  */
 
-#ifndef __BTRFS_ASYNC_THREAD_
-#define __BTRFS_ASYNC_THREAD_
+#ifndef BTRFS_ASYNC_THREAD_H
+#define BTRFS_ASYNC_THREAD_H
+
 #include 
 
 struct btrfs_fs_info;
@@ -85,4 +73,5 @@ void btrfs_set_work_high_priority(struct btrfs_work *work);
 struct btrfs_fs_info *btrfs_work_owner(const struct btrfs_work *work);
 struct btrfs_fs_info *btrfs_workqueue_owner(const struct __btrfs_workqueue 
*wq);
 bool btrfs_workqueue_normal_congested(const struct btrfs_workqueue *wq);
+
 #endif
diff --git a/fs/btrfs/backref.h b/fs/btrfs/backref.h
index 0a30028d5196..54d58988483a 100644
--- a/fs/btrfs/backref.h
+++ b/fs/btrfs/backref.h
@@ -1,23 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0 */
 /*
  * Copyright (C) 2011 STRATO.  All rights reserved.
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of the GNU General Public
- * License v2 as published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
- * General Public License for more details.
- *
- * You should have received a copy of the GNU General Public
- * License along with this program; if not, write to the
- * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
- * Boston, MA 021110-1307, USA.
  */
 
-#ifndef __BTRFS_BACKREF__
-#define __BTRFS_BACKREF__
+#ifndef BTRFS_BACKREF_H
+#define BTRFS_BACKREF_H
 
 #include 
 #include "ulist.h"
diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index ca15be569d69..234bae55b85d 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -1,23 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0 */
 /*
  * Copyright (C) 2007 Oracle.  All

[PATCH 3/3] btrfs: add SPDX header to Kconfig

2018-04-03 Thread David Sterba

Signed-off-by: David Sterba 
---
 fs/btrfs/Kconfig | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/Kconfig b/fs/btrfs/Kconfig
index 167e5dc7eadd..23537bc8c827 100644
--- a/fs/btrfs/Kconfig
+++ b/fs/btrfs/Kconfig
@@ -1,3 +1,5 @@
+# SPDX-License-Identifier: GPL-2.0
+
 config BTRFS_FS
tristate "Btrfs filesystem support"
select LIBCRC32C
-- 
2.16.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/3] btrfs: replace GPL boilerplate by SPDX -- sources

2018-04-03 Thread David Sterba

Remove GPL boilerplate text (long, short, one-line) and keep the rest,
ie. personal, company or original source copyright statements. Add the
SPDX header.

Signed-off-by: David Sterba 
---
 fs/btrfs/acl.c | 15 +--
 fs/btrfs/async-thread.c| 15 +--
 fs/btrfs/backref.c | 15 +--
 fs/btrfs/check-integrity.c | 15 +--
 fs/btrfs/compression.c | 15 +--
 fs/btrfs/ctree.c   | 15 +--
 fs/btrfs/delayed-inode.c   | 15 +--
 fs/btrfs/delayed-ref.c | 15 +--
 fs/btrfs/dev-replace.c | 16 ++--
 fs/btrfs/dir-item.c| 15 +--
 fs/btrfs/disk-io.c | 15 +--
 fs/btrfs/export.c  |  1 +
 fs/btrfs/extent-tree.c | 16 ++--
 fs/btrfs/extent_io.c   |  1 +
 fs/btrfs/extent_map.c  |  1 +
 fs/btrfs/file-item.c   | 15 +--
 fs/btrfs/file.c| 15 +--
 fs/btrfs/free-space-cache.c| 15 +--
 fs/btrfs/free-space-tree.c | 15 +--
 fs/btrfs/inode-item.c  | 15 +--
 fs/btrfs/inode-map.c   | 15 +--
 fs/btrfs/inode.c   | 15 +--
 fs/btrfs/ioctl.c   | 15 +--
 fs/btrfs/locking.c | 16 ++--
 fs/btrfs/lzo.c | 15 +--
 fs/btrfs/ordered-data.c| 15 +--
 fs/btrfs/orphan.c  | 15 +--
 fs/btrfs/print-tree.c  | 15 +--
 fs/btrfs/props.c   | 15 +--
 fs/btrfs/qgroup.c  | 15 +--
 fs/btrfs/raid56.c  | 16 ++--
 fs/btrfs/reada.c   | 15 +--
 fs/btrfs/ref-verify.c  | 15 +--
 fs/btrfs/relocation.c  | 15 +--
 fs/btrfs/root-tree.c   | 15 +--
 fs/btrfs/scrub.c   | 15 +--
 fs/btrfs/send.c| 15 +--
 fs/btrfs/struct-funcs.c| 15 +--
 fs/btrfs/super.c   | 15 +--
 fs/btrfs/sysfs.c   | 15 +--
 fs/btrfs/tests/btrfs-tests.c   | 15 +--
 fs/btrfs/tests/extent-buffer-tests.c   | 15 +--
 fs/btrfs/tests/extent-io-tests.c   | 15 +--
 fs/btrfs/tests/extent-map-tests.c  | 15 +--
 fs/btrfs/tests/free-space-tests.c  | 15 +--
 fs/btrfs/tests/free-space-tree-tests.c | 15 +--
 fs/btrfs/tests/inode-tests.c   | 15 +--
 fs/btrfs/tests/qgroup-tests.c  | 15 +--
 fs/btrfs/transaction.c | 15 +--
 fs/btrfs/tree-checker.c| 13 +
 fs/btrfs/tree-defrag.c | 15 +--
 fs/btrfs/tree-log.c| 15 +--
 fs/btrfs/ulist.c   |  2 +-
 fs/btrfs/uuid-tree.c   | 16 ++--
 fs/btrfs/volumes.c | 16 ++--
 fs/btrfs/xattr.c   | 16 +---
 fs/btrfs/zlib.c| 15 +--
 fs/btrfs/zstd.c| 10 ++
 58 files changed, 65 insertions(+), 750 deletions(-)

diff --git a/fs/btrfs/acl.c b/fs/btrfs/acl.c
index 0066d95b133f..15e1dfef56a5 100644
--- a/fs/btrfs/acl.c
+++ b/fs/btrfs/acl.c
@@ -1,19 +1,6 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * Copyright (C) 2007 Red Hat.  All rights reserved.
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of the GNU General Public
- * License v2 as published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
- * General Public License for more details.
- *
- * You should have received a copy of the GNU General Public
- * License along with this program; if not, write to the
- * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
- * Boston, MA 021110-1307, USA.
  */
 
 #include 
diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
index d5540749f0e5..d522494698fa 100644
--- a/fs/btrfs/async-thread.c
+++ b/fs/btrfs/async-thread.c
@@ -1,20 +1,7 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * Copyright (C) 2007 Oracle.  All rights reserved.
  * Copyright (C) 2014 Fujitsu.  All rights reserved.
- *
-

Please add 21035965f60b ("bitmap: fix memset optimization on big-endian systems") to the stable tree

2018-04-03 Thread Omar Sandoval

Just wanted to make sure this doesn't get missed because I misspelled
the stable email address in the patch. It applies to v4.13+.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] bitmap: fix memset optimization on big-endian systems

2018-04-03 Thread Omar Sandoval

On Mon, Apr 02, 2018 at 04:49:39PM -0700, Linus Torvalds wrote:
> On Mon, Apr 2, 2018 at 4:37 PM, Linus Torvalds
>  wrote:
> >
> > We should probably have at least switched it to "unsigned long int"
> 
> I meant just "unsigned int", of course.
> 
> Right now we occasionally have a silly 64-bit field just for a couple of 
> flags.

Not to mention the mix of bit fields, macros, and enums, some of which
are bit masks, some of which are the bit number :)

> Of course, we do have cases where 64-bit architectures happily end up
> using more than 32 bits too, so the "unsigned long" is occasionally
> useful. But it's rare enough that it probably wasn't the right thing
> to do.
> 
> Water under the bridge. Part of it is due to another historical
> accident: the alpha port was one of the early ports, and it didn't do
> atomic byte accesses at all.
> 
> So we had multiple issues that all conspired to "unsigned long arrays
> are the natural for atomic bit operations" even though they have this
> really annoying byte order issue.

Thanks for the historical context, this is exactly what I was wondering
when I spotted this bug and fixed a similar one in Btrfs a couple of
years back.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Status of RAID5/6

2018-04-03 Thread Goffredo Baroncelli

On 04/03/2018 02:31 AM, Zygo Blaxell wrote:
> On Mon, Apr 02, 2018 at 06:23:34PM -0400, Zygo Blaxell wrote:
>> On Mon, Apr 02, 2018 at 11:49:42AM -0400, Austin S. Hemmelgarn wrote:
>>> On 2018-04-02 11:18, Goffredo Baroncelli wrote:
 I thought that a possible solution is to create BG with different
>>> number of data disks. E.g. supposing to have a raid 6 system with 6
>>> disks, where 2 are parity disk; we should allocate 3 BG
 BG #1: 1 data disk, 2 parity disks
 BG #2: 2 data disks, 2 parity disks,
 BG #3: 4 data disks, 2 parity disks

 For simplicity, the disk-stripe length is assumed = 4K.

 So If you have a write with a length of 4 KB, this should be placed
>>> in BG#1; if you have a write with a length of 4*3KB, the first 8KB,
>>> should be placed in in BG#2, then in BG#1.
 This would avoid space wasting, even if the fragmentation will
>>> increase (but shall the fragmentation matters with the modern solid
>>> state disks ?).
>> I don't really see why this would increase fragmentation or waste space.

> Oh, wait, yes I do.  If there's a write of 6 blocks, we would have
> to split an extent between BG #3 (the first 4 blocks) and BG #2 (the
> remaining 2 blocks).  It also flips the usual order of "determine size
> of extent, then allocate space for it" which might require major surgery
> on the btrfs allocator to implement.

I have to point out that in any case the extent is physically interrupted at 
the disk-stripe size. Assuming disk-stripe=64KB, if you want to write 128KB, 
the first half is written in the first disk, the other in the 2nd disk.  If you 
want to write 96kb, the first 64 are written in the first disk, the last part 
in the 2nd, only on a different BG.
So yes there is a fragmentation from a logical point of view; from a physical 
point of view the data is spread on the disks in any case.

In any case, you are right, we should gather some data, because the performance 
impact are no so clear.

I am not worried abut having different BG; we have problem with these because 
we never developed tool to handle this issue properly (i.e. a daemon which 
starts a balance when needed). But I hope that this will be solved in future.

In any case, the all solutions proposed have their trade off:

- a) as is: write hole bug
- b) variable stripe size (like ZFS): big impact on how btrfs handle the 
extent. limited waste of space
- c) logging data before writing: we wrote the data two times in a short time 
window. Moreover the log area is written several order of magnitude more than 
the other area; there was some patch around
- d) rounding the writing to the stripe size: waste of space; simple to 
implement;
- e) different BG with different stripe size: limited waste of space; logical 
fragmentation.

* c),d),e) are applied only for the tail of the extent, in case the size is 
less than the stripe size.
* for b),d), e), the wasting of space may be reduced with a balance 

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[GIT PULL] Btrfs updates for 4.17

2018-04-03 Thread David Sterba

Hi,

please pull the following btrfs changes. There are a several user
visible changes, the rest is mostly invisible and continues to clean up
the whole code base.

There are no merge conflicts with current master. Please pull, thanks.

User visible changes:

  - new mount option nossd_spread (pair for ssd_spread)

  - mount option subvolid will detect junk after the number and fail the mount

  - add message after cancelled device replace

  - direct module dependency on libcrc32, removed own crc wrappers

  - removed user space transaction ioctls

  - use lighter locking when reading /proc/self/mounts, RCU instead of mutex
to avoid unnecessary contention

Enhancements:

  - skip writeback of last page when truncating file to same size

  - send: do not issue unnecessary truncate operations

  - mount option token specifiers: use %u for unsigned values, more validation

  - selftests: more tree block validations

qgroups:

  - preparatory work for splitting reservation types for data and metadata,
this should allow for more accurate tracking and fix some issues with
underflows or do further enhancements

  - split metadata reservations for started and joined transaction so they do
not get mixed up and are accounted correctly at commit time

  - with the above, it's possible to revert patch that potentially deadlocks
when trying to make more space by explicitly committing when the quota
limit is hit

  - fix root item corruption when multiple same source snapshots are created
with quota enabled

RAID56:

  - make sure target is identical to source when raid56 rebuild fails after
dev-replace

  - faster rebuild during scrub, batch by stripes and not block-by-block

  - make more use of cached data when rebuilding from a missing device

Fixes:

  - null pointer deref when device replace target is missing

  - fix fsync after hole punching when using no-holes feature

  - fix lockdep splat when allocating percpu data with wrong GFP flags

Cleanups, refactoring, core changes:

  - drop redunant parameters from various functions

  - kill and opencode trivial helpers

  - __cold/__exit function annotations

  - dead code removal

  - continued audit and documentation of memory barriers

  - error handling: handle removal from uuid tree

  - error handling: remove handling of impossible condtitons

  - more debugging or error messages

  - updated tracepoints

  - 1 VLA use removal (1 still left)


The following changes since commit 3eb2ce825ea1ad89d20f7a3b5780df850e4be274:

  Linux 4.16-rc7 (2018-03-25 12:44:30 -1000)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-4.17-tag

for you to fetch changes up to 57599c7e7722daf5f8c2dba4b0e4628f5c500771:

  btrfs: lift errors from add_extent_changeset to the callers (2018-03-31 
02:03:25 +0200)


Anand Jain (25):
  btrfs: open code btrfs_dev_replace_cancel()
  btrfs: rename __btrfs_dev_replace_cancel()
  btrfs: btrfs_dev_replace_cancel() can return int
  btrfs: open code btrfs_init_dev_replace_tgtdev_for_resume()
  btrfs: extent_buffer_uptodate() make it static and inline
  btrfs: manage thread_pool mount option as %u
  btrfs: manage metadata_ratio mount option as %u
  btrfs: manage check_int_print_mask mount option as %u
  btrfs: manage commit mount option as %u
  btrfs: add a comment to mark the deprecated mount option
  btrfs: fix null pointer deref when target device is missing
  btrfs: log, when replace, is canceled by the user
  btrfs: remove unused function btrfs_async_submit_limit()
  btrfs: cow_file_range() num_bytes and disk_num_bytes are same
  btrfs: use ASSERT to report logical error in cow_file_range()
  btrfs: not a disk error if the bio_add_page fails
  btrfs: keep device list sorted
  btrfs: insert newly opened device to the end of the list
  btrfs: verify subvolid mount parameter
  btrfs: remove assert in btrfs_init_dev_replace_tgtdev()
  btrfs: unify types for metadata_ratio and data_chunk_allocations
  btrfs: rename btrfs_close_extra_device to btrfs_free_extra_devids
  btrfs: add define for oldest generation
  btrfs: drop num argument from find_live_mirror()
  btrfs: drop optimal argument from find_live_mirror()

Colin Ian King (2):
  btrfs: remove redundant check on ret and goto
  Btrfs: extent map selftest: add missing void parameter to 
btrfs_test_extent_map

David Sterba (37):
  btrfs: add (the only possible) __exit annotation
  btrfs: add more __cold annotations
  btrfs: drop underscores from exported xattr functions
  btrfs: drop extern from function declarations
  btrfs: adjust return type of btrfs_getxattr
  btrfs: move btrfs_listxattr prototype to xattr.h
  btrfs: open code trivial helper

Re: [PATCH] btrfs-progs: let mkfs return nozero value on thin provision device

2018-04-03 Thread Qu Wenruo



On 2018年04月03日 16:39, Su Yue wrote:
> when mkfs.btrfs on a thin provision device which has very small
> backing size and big virtual size, all code works well in
> mkfs.btrfs until close_ctree() is called.
> close_ctree() fails to sync device due to small backing size
> while closing devices.
> However, mkfs returns 0 in such situation which causes failure of
> xfstests generic/405.
> 
> So, let mkfs returns nonzero value if previous steps succeeded but
> close_ctree() failed.
> Then xfstests generic/405 passes now.
> 
> Signed-off-by: Su Yue 

Reviewed-by: Qu Wenruo 

Thanks,
Qu

> ---
>  mkfs/main.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/mkfs/main.c b/mkfs/main.c
> index 5a717f701cd5..9f7f2396df8f 100644
> --- a/mkfs/main.c
> +++ b/mkfs/main.c
> @@ -1285,6 +1285,12 @@ out:
>   }
>   }
>  
> + if (!ret && close_ret) {
> + ret = close_ret;
> + error("failed to close ctree, the filesystem may be 
> inconsistent: %d",
> +   ret);
> + }
> +
>   btrfs_close_all_devices();
>   free(label);
>  
> 



signature.asc
Description: OpenPGP digital signature

[PATCH] btrfs-progs: let mkfs return nozero value on thin provision device

2018-04-03 Thread Su Yue

when mkfs.btrfs on a thin provision device which has very small
backing size and big virtual size, all code works well in
mkfs.btrfs until close_ctree() is called.
close_ctree() fails to sync device due to small backing size
while closing devices.
However, mkfs returns 0 in such situation which causes failure of
xfstests generic/405.

So, let mkfs returns nonzero value if previous steps succeeded but
close_ctree() failed.
Then xfstests generic/405 passes now.

Signed-off-by: Su Yue 
---
 mkfs/main.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/mkfs/main.c b/mkfs/main.c
index 5a717f701cd5..9f7f2396df8f 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -1285,6 +1285,12 @@ out:
}
}
 
+   if (!ret && close_ret) {
+   ret = close_ret;
+   error("failed to close ctree, the filesystem may be 
inconsistent: %d",
+ ret);
+   }
+
btrfs_close_all_devices();
free(label);
 
-- 
2.16.3



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 10/10] btrfs: qgroup: Use independent and accurate per inode qgroup rsv

2018-04-03 Thread Qu Wenruo

Hi David,

I didn't see this patch merged in your misc-next branch but only the
remaining patches.

However without this patch, btrfs qgroup reserved space will get
obviously increased as prealloc metadata reserved space is never freed
until inode reserved space is freed.

This would cause a lot of qgroup related test cases to fail.

Would you please merge this patch with all qgroup patchset?

Thanks,
Qu

On 2017年12月22日 14:18, Qu Wenruo wrote:
> Unlike reservation calculation used in inode rsv for metadata, qgroup
> doesn't really need to care things like csum size or extent usage for
> whole tree COW.
> 
> Qgroup care more about net change of extent usage.
> That's to say, if we're going to insert one file extent, it will mostly
> find its place in CoWed tree block, leaving no change in extent usage.
> Or cause leaf split, result one new net extent, increasing qgroup number
> by nodesize.
> (Or even more rare case, increase the tree level, increasing qgroup
> number by 2 * nodesize)
> 
> So here instead of using the way overkilled calculation for extent
> allocator, which cares more about accurate and no error, qgroup doesn't
> need that over-calculated reservation.
> 
> This patch will maintain 2 new members in btrfs_block_rsv structure for
> qgroup, using much smaller calculation for qgroup rsv, reducing false
> EDQUOT.
> 
> Signed-off-by: Qu Wenruo 
> ---
>  fs/btrfs/ctree.h   | 18 +
>  fs/btrfs/extent-tree.c | 55 
> --
>  2 files changed, 62 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 0c58f92c2d44..97783ba91e00 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -467,6 +467,24 @@ struct btrfs_block_rsv {
>   unsigned short full;
>   unsigned short type;
>   unsigned short failfast;
> +
> + /*
> +  * Qgroup equivalent for @size @reserved
> +  *
> +  * Unlike normal normal @size/@reserved for inode rsv,
> +  * qgroup doesn't care about things like csum size nor how many tree
> +  * blocks it will need to reserve.
> +  *
> +  * Qgroup cares more about *NET* change of extent usage.
> +  * So for ONE newly inserted file extent, in worst case it will cause
> +  * leaf split and level increase, nodesize for each file extent
> +  * is already way overkilled.
> +  *
> +  * In short, qgroup_size/reserved is the up limit of possible needed
> +  * qgroup metadata reservation.
> +  */
> + u64 qgroup_rsv_size;
> + u64 qgroup_rsv_reserved;
>  };
>  
>  /*
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 986660f301de..9d579c7bcf7f 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -5565,14 +5565,18 @@ static void space_info_add_new_bytes(struct 
> btrfs_fs_info *fs_info,
>  
>  static u64 block_rsv_release_bytes(struct btrfs_fs_info *fs_info,
>   struct btrfs_block_rsv *block_rsv,
> - struct btrfs_block_rsv *dest, u64 num_bytes)
> + struct btrfs_block_rsv *dest, u64 num_bytes,
> + u64 *qgroup_to_release_ret)
>  {
>   struct btrfs_space_info *space_info = block_rsv->space_info;
> + u64 qgroup_to_release = 0;
>   u64 ret;
>  
>   spin_lock(_rsv->lock);
> - if (num_bytes == (u64)-1)
> + if (num_bytes == (u64)-1) {
>   num_bytes = block_rsv->size;
> + qgroup_to_release = block_rsv->qgroup_rsv_size;
> + }
>   block_rsv->size -= num_bytes;
>   if (block_rsv->reserved >= block_rsv->size) {
>   num_bytes = block_rsv->reserved - block_rsv->size;
> @@ -5581,6 +5585,12 @@ static u64 block_rsv_release_bytes(struct 
> btrfs_fs_info *fs_info,
>   } else {
>   num_bytes = 0;
>   }
> + if (block_rsv->qgroup_rsv_reserved >= block_rsv->qgroup_rsv_size) {
> + qgroup_to_release = block_rsv->qgroup_rsv_reserved -
> + block_rsv->qgroup_rsv_size;
> + block_rsv->qgroup_rsv_reserved = block_rsv->qgroup_rsv_size;
> + } else
> + qgroup_to_release = 0;
>   spin_unlock(_rsv->lock);
>  
>   ret = num_bytes;
> @@ -5603,6 +5613,8 @@ static u64 block_rsv_release_bytes(struct btrfs_fs_info 
> *fs_info,
>   space_info_add_old_bytes(fs_info, space_info,
>num_bytes);
>   }
> + if (qgroup_to_release_ret)
> + *qgroup_to_release_ret = qgroup_to_release;
>   return ret;
>  }
>  
> @@ -5744,17 +5756,21 @@ int btrfs_inode_rsv_refill(struct btrfs_inode *inode,
>   struct btrfs_root *root = inode->root;
>   struct btrfs_block_rsv *block_rsv = >block_rsv;
>   u64 num_bytes = 0;
> + u64 qgroup_num_bytes = 0;
>   int ret = -ENOSPC;
>  
>   spin_lock(_rsv->lock);
>   if

Re: [PATCH 1/2] btrfs-progs: check: Skip data csum verification for metadata dump

2018-04-03 Thread Nikolay Borisov



On  3.04.2018 08:55, Qu Wenruo wrote:
> 
> 
> On 2018年04月03日 13:51, Nikolay Borisov wrote:
>>
>>
>> On  3.04.2018 08:39, Qu Wenruo wrote:
>>> For metadata dump (fs restored by btrfs-image), since no data is
>>> restored check sum verification will definitely report error.
>>>
>>> Add such check in check_csums() and prompt for user.
>>>
>>> Issue: #103
>>> Signed-off-by: Qu Wenruo 
>>
>> Reviewed-by: Nikolay Borisov 
>>
>>> ---
>>>  check/main.c | 13 -
>>>  1 file changed, 12 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/check/main.c b/check/main.c
>>> index 891a6d797756..15c94543946a 100644
>>> --- a/check/main.c
>>> +++ b/check/main.c
>>> @@ -5601,6 +5601,7 @@ static int check_csums(struct btrfs_root *root)
>>> int ret;
>>> u64 data_len;
>>> unsigned long leaf_offset;
>>> +   bool verify_csum = !!check_data_csum;
>>
>> nit: Why don't you convert check_data_csum to bool in this patch.
> 
> Next thing to do.
> 
>> My
>> recent testing in !! usage showed that it's generating quite a number of
>> additional instructions.
> 
> That's pretty interesting.
> 
> Would you please shared some more info about this?

So it seems I might have been mistaken. The latest experiment I did was
with the attached diff. So I did a bloat-o-meter comparison between
stock random.c against the diff applied and then modified the diff to
use uuid rather than !!uuid. And in my initial experiment I did see a
reduction in the size of the function. However, now that I tried
reproducing I couldn't. So just ignore my comment.

Bloat-o-meter shows in both cases:

nborisov@fisk:~/projects/kernel/source$ ./scripts/bloat-o-meter
random.original drivers/char/random.o
add/remove: 0/0 grow/shrink: 1/0 up/down: 49/0 (49)
Function old new   delta
proc_do_uuid 227 276 +49
Total: Before=30542, After=30591, chg +0.16%



  diff --git a/drivers/char/random.c b/drivers/char/random.c

index ec42c8bb9b0d..e05daf7f38f4 100644

--- a/drivers/char/random.c

+++ b/drivers/char/random.c

@@ -1960,6 +1960,10 @@ static int proc_do_uuid(struct ctl_table *table,
int write,
unsigned char buf[64], tmp_uuid[16], *uuid;



uuid = table->data;

+#ifdef CONFIG_UTS_NS

+   if (!!uuid && (uuid == (unsigned char *)sysctl_bootid))

+   uuid = current->nsproxy->uts_ns->sysctl_bootid;

+#endif

if (!uuid) {

uuid = tmp_uuid;

generate_random_uuid(uuid);

diff --git a/include/linux/utsname.h b/include/linux/utsname.h

index c8060c2ecd04..f704aca3e95a 100644

--- a/include/linux/utsname.h

+++ b/include/linux/utsname.h

@@ -27,6 +27,7 @@ struct uts_namespace {

struct user_namespace *user_ns;

struct ucounts *ucounts;

struct ns_common ns;

+   char sysctl_bootid[16];

 } __randomize_layout;

 extern struct uts_namespace init_uts_ns;



diff --git a/kernel/utsname.c b/kernel/utsname.c

index 913fe4336d2b..f1749cdcd341 100644

--- a/kernel/utsname.c

+++ b/kernel/utsname.c

@@ -34,8 +34,10 @@ static struct uts_namespace *create_uts_ns(void)

struct uts_namespace *uts_ns;



uts_ns = kmalloc(sizeof(struct uts_namespace), GFP_KERNEL);

-   if (uts_ns)

+   if (uts_ns) {

kref_init(_ns->kref);

+   memset(uts_ns->sysctl_bootid, 0, 16);

+   }

return uts_ns;

 }
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

37 matches

Mail list logo