Re: Status of RAID5/6
On 04/04/2018 12:57 AM, Zygo Blaxell wrote: >> I have to point out that in any case the extent is physically >> interrupted at the disk-stripe size. Assuming disk-stripe=64KB, if >> you want to write 128KB, the first half is written in the first disk, >> the other in the 2nd disk. If you want to write 96kb, the first 64 >> are written in the first disk, the last part in the 2nd, only on a >> different BG. > The "only on a different BG" part implies something expensive, either > a seek or a new erase page depending on the hardware. Without that, > nearby logical blocks are nearby physical blocks as well. In any case it happens on a different disk > >> So yes there is a fragmentation from a logical point of view; from a >> physical point of view the data is spread on the disks in any case. > What matters is the extent-tree point of view. There is (currently) > no fragmentation there, even for RAID5/6. The extent tree is unaware > of RAID5/6 (to its peril). Before you pointed out that the non-contiguous block written has an impact on performance. I am replaying that the switching from a different BG happens at the stripe-disk boundary, so in any case the block is physically interrupted and switched to another disk However yes: from an extent-tree point of view there will be an increase of number extents, because the end of the writing is allocated to another BG (if the size is not stripe-boundary) > If an application does a loop writing 68K then fsync(), the multiple-BG > solution adds two seeks to read every 68K. That's expensive if sequential > read bandwidth is more scarce than free space. Why you talk about an additional seeks? In any case (even without the additional BG) the read happens from another disks >> * c),d),e) are applied only for the tail of the extent, in case the > size is less than the stripe size. > > It's only necessary to split an extent if there are no other writes > in the same transaction that could be combined with the extent tail > into a single RAID stripe. As long as everything in the RAID stripe > belongs to a single transaction, there is no write hole May be that a more "simpler" optimization would be close the transaction when the data reach the stripe boundary... But I suspect that it is not so simple to implement. > Not for d. Balance doesn't know how to get rid of unreachable blocks > in extents (it just moves the entire extent around) so after a balance > the writes would still be rounded up to the stripe size. Balance would > never be able to free the rounded-up space. That space would just be > gone until the file was overwritten, deleted, or defragged. If balance is capable to move the extent, why not place one near the other during a balance ? The goal is not to limit the the writing of the end of a extent, but avoid writing the end of an extent without further data (e.g. the gap to the stripe has to be filled in the same transaction) BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Status of RAID5/6
On Tue, Apr 3, 2018 at 11:03 AM, Goffredo Baroncelliwrote: > On 04/03/2018 02:31 AM, Zygo Blaxell wrote: >> On Mon, Apr 02, 2018 at 06:23:34PM -0400, Zygo Blaxell wrote: >>> On Mon, Apr 02, 2018 at 11:49:42AM -0400, Austin S. Hemmelgarn wrote: On 2018-04-02 11:18, Goffredo Baroncelli wrote: > I thought that a possible solution is to create BG with different number of data disks. E.g. supposing to have a raid 6 system with 6 disks, where 2 are parity disk; we should allocate 3 BG > BG #1: 1 data disk, 2 parity disks > BG #2: 2 data disks, 2 parity disks, > BG #3: 4 data disks, 2 parity disks > > For simplicity, the disk-stripe length is assumed = 4K. > > So If you have a write with a length of 4 KB, this should be placed in BG#1; if you have a write with a length of 4*3KB, the first 8KB, should be placed in in BG#2, then in BG#1. > This would avoid space wasting, even if the fragmentation will increase (but shall the fragmentation matters with the modern solid state disks ?). >>> I don't really see why this would increase fragmentation or waste space. > >> Oh, wait, yes I do. If there's a write of 6 blocks, we would have >> to split an extent between BG #3 (the first 4 blocks) and BG #2 (the >> remaining 2 blocks). It also flips the usual order of "determine size >> of extent, then allocate space for it" which might require major surgery >> on the btrfs allocator to implement. > > I have to point out that in any case the extent is physically interrupted at > the disk-stripe size. Assuming disk-stripe=64KB, if you want to write 128KB, > the first half is written in the first disk, the other in the 2nd disk. If > you want to write 96kb, the first 64 are written in the first disk, the last > part in the 2nd, only on a different BG. > So yes there is a fragmentation from a logical point of view; from a physical > point of view the data is spread on the disks in any case. > > In any case, you are right, we should gather some data, because the > performance impact are no so clear. They're pretty clear, and there's a lot written about small file size and parity raid performance being shit, no matter the implementation (md, ZFS, Btrfs, hardware maybe less so just because of all the caching and extra processing hardware that's dedicated to the task). The linux-raid@ list is full of optimizations for this that are use case specific. One of those that often comes up is how badly suited raid56 are for e.g. mail servers, tons of small file reads and writes, and all the disk contention that comes up, and it's even worse when you lose a disk, and even if you're running raid 6 and lose two disk it's really god awful. It can be unexpectedly a disqualifying setup without prior testing in that condition: can your workload really be usable for two or three days in a double degraded state on that raid6? *shrug* Parity raid is well suited for full stripe reads and writes, lots of sequential writes. Ergo a small file is anything less than a full stripe write. Of course, delayed allocation can end up making for more full stripe writes. But now you have more RMW which is the real performance killer, again no matter the raid. > > I am not worried abut having different BG; we have problem with these because > we never developed tool to handle this issue properly (i.e. a daemon which > starts a balance when needed). But I hope that this will be solved in future. > > In any case, the all solutions proposed have their trade off: > > - a) as is: write hole bug > - b) variable stripe size (like ZFS): big impact on how btrfs handle the > extent. limited waste of space > - c) logging data before writing: we wrote the data two times in a short time > window. Moreover the log area is written several order of magnitude more than > the other area; there was some patch around > - d) rounding the writing to the stripe size: waste of space; simple to > implement; > - e) different BG with different stripe size: limited waste of space; logical > fragmentation. I'd say for sure you're worse off with metadata raid5 vs metadata raid1. And if there are many devices you might be better off with metadata raid1 even on a raid6, it's not an absolute certainty you lose the file system with a 2nd drive failure - it depends on the device and what chunk copies happen to be on it. But at the least if you have a script or some warning you can relatively easily rebalance ... HMMM Actually that should be a test. Single drive degraded raid6 with metadata raid1, can you do a metadata only balance to force the missing copy of metadata to be replicated again? In theory this should be quite fast. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH typo-fixed] fstests: btrfs: 159 superblock corruption test case
On Thu, Mar 29, 2018 at 06:28:48PM +0800, Anand Jain wrote: > Verify if the superblock corruption is handled correctly. > > Signed-off-by: Anand Jain> --- > tests/btrfs/159 | 142 > > tests/btrfs/159.out | 35 + > tests/btrfs/group | 1 + > 3 files changed, 178 insertions(+) > create mode 100755 tests/btrfs/159 > create mode 100644 tests/btrfs/159.out > > diff --git a/tests/btrfs/159 b/tests/btrfs/159 > new file mode 100755 > index ..f42ba4b777e7 > --- /dev/null > +++ b/tests/btrfs/159 > @@ -0,0 +1,142 @@ > +#! /bin/bash > +# FS QA Test 159 > +# > +# Test if the superblock corruption is handled correctly. > +# - Check and validate superblock csum in mount and scan context > +# - Check and validate superblock for all disks in the fs > +# - Make sure if the copies are really a copy of the primary superblock > +# > +#- > +# Copyright (c) 2018 Oracle. All Rights Reserved. > +# Author: Anand Jain > +# > +# This program is free software; you can redistribute it and/or > +# modify it under the terms of the GNU General Public License as > +# published by the Free Software Foundation. > +# > +# This program is distributed in the hope that it would be useful, > +# but WITHOUT ANY WARRANTY; without even the implied warranty of > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > +# GNU General Public License for more details. > +# > +# You should have received a copy of the GNU General Public License > +# along with this program; if not, write the Free Software Foundation, > +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA > +#- > +# > + > +seq=`basename $0` > +seqres=$RESULT_DIR/$seq > +echo "QA output created by $seq" > +status=1 # failure is the default! Please use './new btrfs' to generate new test template, which defines variables like 'tmp' and 'here', these variables may not be used in the test explicitly, but they could be used by some helper functions, especially the "$tmp" var. (Or please don't remove them from template :) > + > +_cleanup() > +{ > + _scratch_dev_pool_put > +} > +trap "_cleanup; exit \$status" 0 1 2 3 15 > + > +# get standard environment, filters and checks > +. ./common/rc > +. ./common/filter > +. ./common/module > + > +# remove previous $seqres.full before test > +rm -f $seqres.full > + > +# real QA test starts here > + > +_supported_fs btrfs > +_supported_os Linux > +_require_scratch_dev_pool 2 > +_require_loadable_fs_module "btrfs" > +_require_command "$WIPEFS_PROG" wipefs > + > +_scratch_dev_pool_get 2 > + > +MNT_OPT=$(echo $MOUNT_OPTIONS | cut -d" " -f2-) I'm not sure about the purpose of this MNT_OPT, and it could be empty and causes problems in test. Please see below. > +DEV_GOOD=$(echo $SCRATCH_DEV_POOL | awk '{print $1}') > +DEV_BAD=$(echo $SCRATCH_DEV_POOL | awk '{print $2}') > + > +wipe() > +{ > + $WIPEFS_PROG -a $DEV_GOOD > /dev/null 2>&1 > + $WIPEFS_PROG -a $DEV_BAD > /dev/null 2>&1 > +} > + > +# If copy1 fsid does not match with primary fail the scan thus the mount as > well > +check_copy1_fsid() > +{ > + local bytenr=67108864 > + echo -e "\\ncheck_copy1_fsid\\n{" | tee -a $seqres.full Hmm, I don't think the "{ }" block is necessary in .out file, IMHO it makes the diff context harder to read when test fails. Just echo the test name would be fine. > + > + wipe > + $MKFS_BTRFS_PROG -fq $DEV_GOOD > + $MKFS_BTRFS_PROG -fq $DEV_BAD > + _reload_fs_module "btrfs" > + > + dd status=none of=$DEV_BAD if=$DEV_GOOD ibs=1 obs=1 skip=$bytenr > seek=$bytenr count=4K|\ > + tee -a $seqres.full > + > + _mount -o $MNT_OPT $DEV_BAD $SCRATCH_MNT 2>&1 | _filter_scratch_pool When $MOUNT_OPTIONS is empty, this mount fails as +mount: can't find /mnt/scratch in /etc/fstab And I think we need to umount $SCRATCH_MNT here in case a buggy btrfs module allowed the mount, as the next test assumes the device is not mounted anywhere. Otherwise I see failures like: +ERROR: /dev/loop1 is mounted +mount failed > + > + echo -e "good\\n}" | tee -a $seqres.full > +} > + > +# As in Linux kernel 4.16 if the primary is corrupted mount will fail. > +# Which might change in the long run. > +check_primary() > +{ > + local bytenr=65536 > + echo -e "\\ncheck_primary\\n{" | tee -a $seqres.full > + > + wipe > + _scratch_pool_mkfs "-mraid1 -draid1" > + _reload_fs_module "btrfs" > + > + #To corrupt a disk block, read in hex, write in dec > + od -j$bytenr -N1 -An -x $DEV_BAD |\ > + dd status=none of=$DEV_BAD ibs=1 obs=1 seek=$bytenr count=1|\ > + tee -a $seqres.full > + _mount -o $MNT_OPT,device=$DEV_GOOD $DEV_BAD $SCRATCH_MNT 2>&1 | > _filter_scratch_pool Same here, need to umount
Re: System freezes a lot - btrfs trace in dmesg - symptom or cause?
On 2018年04月04日 05:35, Richard Schwab wrote: > Hi there, > > I'm currently having some issues with my system freezing a lot, from a > few seconds up to a few minutes. I haven't figured out yet what might be > causing this, however, coincidentally some btrfs notices appeared in my > dmesg after I had a freeze for about 5 minutes. I'm wondering whether > this is just another symptom of my system having some weird issues or > whether this might actually be the cause of the freezes I've been > experiencing. > > $ uname -a > Linux mypc 4.15.14-1-ARCH #1 SMP PREEMPT Wed Mar 28 17:34:29 UTC 2018 > x86_64 GNU/Linux > > $ btrfs --version > btrfs-progs v4.15.1 > > $ btrfs fi show > Label: none uuid: uuid1 > Total devices 1 FS bytes used 226.60GiB > devid 1 size 237.37GiB used 237.31GiB path /dev/mapper/cryptroot256g > > Label: none uuid: uuid2 > Total devices 1 FS bytes used 116.61MiB > devid 1 size 1.00GiB used 460.00MiB path /dev/sdb2 > > $ btrfs fi df / > Data, single: total=231.28GiB, used=223.66GiB > System, single: total=32.00MiB, used=48.00KiB > Metadata, single: total=6.00GiB, used=2.93GiB > GlobalReserve, single: total=512.00MiB, used=0.00B > > Partial dmesg is attached as file. Please tell me if you need more than > this, however, before these lines it only contained the startup process > (up until 87s). [10691.693202] btrfs-transacti D0 447 2 0x8000 [10691.693207] Call Trace: [10691.693221] ? __schedule+0x24b/0x8c0 [10691.693229] ? enqueue_entity+0x59b/0xc70 [10691.693234] schedule+0x32/0x90 [10691.693239] io_schedule+0x12/0x40 [10691.693247] __lock_page+0xf8/0x130 [10691.693252] ? add_to_page_cache_lru+0xe0/0xe0 [10691.693309] read_extent_buffer_pages+0xf0/0x2f0 [btrfs] [10691.693337] ? btrfs_alloc_root+0x30/0x30 [btrfs] [10691.693365] btree_read_extent_buffer_pages+0x94/0xf0 [btrfs] [10691.693389] read_block_for_search.isra.14+0x15a/0x2f0 [btrfs] This means the tree block pages btrfs trying to read is being locked by other guy. And considering later operations also waiting on the same page, it looks like a deadlock. Not sure if it's something serious or just bad luck. But looks like there is something wrong with btrfs tree locks. Is it reproducible and what's the workload? And if it's just bad luck, reboot the system and run "btrfs check" on the fs is highly recommended. Thanks, Qu > > Sincerely > Richard Schwab > signature.asc Description: OpenPGP digital signature
Re: System freezes a lot - btrfs trace in dmesg - symptom or cause?
FWIW, gmail considers the original post a phishing attempt, and put it into spam. Also the sending mail server marked the header with a request that subsequent mail servers fail to forward the email, which effectively makes it incompatible with mail servers which modify the email body, making dkim message verification fail. So I suggest either clearing it up with the mail provider or using a different email for lists. This message has a from address in central-intelligence.agency but has failed central-intelligence.agency's required tests for authentication. Learn more Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@central-intelligence.agency header.s=default header.b=eqsLMYnQ; spf=pass (google.com: best guess record for domain of linux-btrfs-ow...@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-btrfs-ow...@vger.kernel.org; dmarc=fail (p=REJECT sp=REJECT dis=QUARANTINE) header.from=central-intelligence.agency On Tue, Apr 3, 2018 at 3:35 PM, Richard Schwabwrote: > Hi there, > > I'm currently having some issues with my system freezing a lot, from a > few seconds up to a few minutes. I haven't figured out yet what might be > causing this, however, coincidentally some btrfs notices appeared in my > dmesg after I had a freeze for about 5 minutes. I'm wondering whether > this is just another symptom of my system having some weird issues or > whether this might actually be the cause of the freezes I've been > experiencing. >From the attached text file 10691.693188] INFO: task btrfs-transacti:447 blocked for more than 120 seconds. [10691.693196] Tainted: P O 4.15.14-1-ARCH #1 [10691.693199] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10691.693202] btrfs-transacti D0 447 2 0x8000 [10691.693207] Call Trace: [10691.693221] ? __schedule+0x24b/0x8c0 [10691.693229] ? enqueue_entity+0x59b/0xc70 [10691.693234] schedule+0x32/0x90 [10691.693239] io_schedule+0x12/0x40 [10691.693247] __lock_page+0xf8/0x130 [10691.693252] ? add_to_page_cache_lru+0xe0/0xe0 [10691.693309] read_extent_buffer_pages+0xf0/0x2f0 [btrfs] [10691.693337] ? btrfs_alloc_root+0x30/0x30 [btrfs] [10691.693365] btree_read_extent_buffer_pages+0x94/0xf0 [btrfs] [10691.693389] read_block_for_search.isra.14+0x15a/0x2f0 [btrfs] [10691.693413] btrfs_search_slot+0x3cd/0xa00 [btrfs] [10691.693437] ? block_group_cache_tree_search+0xba/0xd0 [btrfs] [10691.693460] btrfs_insert_empty_items+0x66/0xb0 [btrfs] [10691.693484] alloc_reserved_file_extent+0x96/0x320 [btrfs] [10691.693511] __btrfs_run_delayed_refs+0x908/0x1180 [btrfs] [10691.693539] btrfs_run_delayed_refs+0xe5/0x1b0 [btrfs] [10691.693566] btrfs_start_dirty_block_groups+0x2cb/0x470 [btrfs] [10691.693596] btrfs_commit_transaction+0x117/0x930 [btrfs] [10691.693626] ? start_transaction+0x9e/0x420 [btrfs] [10691.693654] transaction_kthread+0x17b/0x1a0 [btrfs] [10691.693684] ? btrfs_cleanup_transaction+0x570/0x570 [btrfs] [10691.693687] kthread+0x113/0x130 [10691.693691] ? kthread_create_on_node+0x70/0x70 [10691.693695] ret_from_fork+0x35/0x40 [10691.693703] INFO: task journal-offline:7151 blocked for more than 120 seconds. Off hand it just looks like noise, it's only an info message. But of course the hung task is super annoying and maybe makes it worse than if it were to just crash. Anyway, best to reproduce and issue sysrq + w and hopefully a dev will get around to giving an opinion on what's going on. In the meantime you can try reverting to 4.14.32 or try 4.16.0. I'm not seeing anything off hand in 4.15.15 that relates, so I suspect it still happens with that version. Another thing to try is booting without the proprietary driver you're using that taints the kernel. It may very well be unrelated to this problem, but ... https://www.kernel.org/doc/Documentation/admin-guide/sysrq.rst Translated it means do this: sudo -i echo 1 > /proc/sys/kernel/sysrq echo w > /proc/sysrq-trigger exit sudo journalctl -o short-monotonic -k > kernelmsg+sysrqw.log And then post that log file somewhere like dropbox or whatever, it may be too big to attach to the list but you can try it. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Status of RAID5/6
On Tue, Apr 03, 2018 at 07:03:06PM +0200, Goffredo Baroncelli wrote: > On 04/03/2018 02:31 AM, Zygo Blaxell wrote: > > On Mon, Apr 02, 2018 at 06:23:34PM -0400, Zygo Blaxell wrote: > >> On Mon, Apr 02, 2018 at 11:49:42AM -0400, Austin S. Hemmelgarn wrote: > >>> On 2018-04-02 11:18, Goffredo Baroncelli wrote: > I thought that a possible solution is to create BG with different > >>> number of data disks. E.g. supposing to have a raid 6 system with 6 > >>> disks, where 2 are parity disk; we should allocate 3 BG > BG #1: 1 data disk, 2 parity disks > BG #2: 2 data disks, 2 parity disks, > BG #3: 4 data disks, 2 parity disks > > For simplicity, the disk-stripe length is assumed = 4K. > > So If you have a write with a length of 4 KB, this should be placed > >>> in BG#1; if you have a write with a length of 4*3KB, the first 8KB, > >>> should be placed in in BG#2, then in BG#1. > This would avoid space wasting, even if the fragmentation will > >>> increase (but shall the fragmentation matters with the modern solid > >>> state disks ?). > >> I don't really see why this would increase fragmentation or waste space. > > > Oh, wait, yes I do. If there's a write of 6 blocks, we would have > > to split an extent between BG #3 (the first 4 blocks) and BG #2 (the > > remaining 2 blocks). It also flips the usual order of "determine size > > of extent, then allocate space for it" which might require major surgery > > on the btrfs allocator to implement. > > I have to point out that in any case the extent is physically > interrupted at the disk-stripe size. Assuming disk-stripe=64KB, if > you want to write 128KB, the first half is written in the first disk, > the other in the 2nd disk. If you want to write 96kb, the first 64 > are written in the first disk, the last part in the 2nd, only on a > different BG. The "only on a different BG" part implies something expensive, either a seek or a new erase page depending on the hardware. Without that, nearby logical blocks are nearby physical blocks as well. > So yes there is a fragmentation from a logical point of view; from a > physical point of view the data is spread on the disks in any case. What matters is the extent-tree point of view. There is (currently) no fragmentation there, even for RAID5/6. The extent tree is unaware of RAID5/6 (to its peril). ZFS makes its thing-like-the-extent-tree aware of RAID5/6, and it can put a stripe of any size anywhere. If we're going to do that in btrfs, you might as well just do what ZFS does. OTOH, variable-size block groups give us read-compatibility with old kernel versions (and write-compatibility for that matter--a kernel that didn't know about the BG separation would just work but have write hole). If an application does a loop writing 68K then fsync(), the multiple-BG solution adds two seeks to read every 68K. That's expensive if sequential read bandwidth is more scarce than free space. > In any case, you are right, we should gather some data, because the > performance impact are no so clear. > > I am not worried abut having different BG; we have problem with these > because we never developed tool to handle this issue properly (i.e. a > daemon which starts a balance when needed). But I hope that this will > be solved in future. Balance daemons are easy to the point of being trivial to write in Python. The balancing itself is quite expensive and invasive: can't usefully ionice it, can only abort it on block group boundaries, can't delete snapshots while it's running. If balance could be given a vrange that was the size of one extent...then we could talk about daemons. > In any case, the all solutions proposed have their trade off: > > - a) as is: write hole bug > - b) variable stripe size (like ZFS): big impact on how btrfs handle > the extent. limited waste of space > - c) logging data before writing: we wrote the data two times in a > short time window. Moreover the log area is written several order of > magnitude more than the other area; there was some patch around > - d) rounding the writing to the stripe size: waste of space; simple > to implement; > - e) different BG with different stripe size: limited waste of space; > logical fragmentation. Also: - f) avoiding writes to partially filled stripes: free space fragmentation; simple to implement (ssd_spread does it accidentally) The difference between d) and f) is that d) allocates the space to the extent while f) leaves the space unallocated, but skips any free space fragments smaller than the stripe size when allocating. f) gets the space back with a balance (i.e. it is exactly as space-efficient as (a) after balance). > * c),d),e) are applied only for the tail of the extent, in case the size is less than the stripe size. It's only necessary to split an extent if there are no other writes in the same transaction that could be combined with the extent tail into a single RAID stripe. As long as
System freezes a lot - btrfs trace in dmesg - symptom or cause?
Hi there, I'm currently having some issues with my system freezing a lot, from a few seconds up to a few minutes. I haven't figured out yet what might be causing this, however, coincidentally some btrfs notices appeared in my dmesg after I had a freeze for about 5 minutes. I'm wondering whether this is just another symptom of my system having some weird issues or whether this might actually be the cause of the freezes I've been experiencing. $ uname -a Linux mypc 4.15.14-1-ARCH #1 SMP PREEMPT Wed Mar 28 17:34:29 UTC 2018 x86_64 GNU/Linux $ btrfs --version btrfs-progs v4.15.1 $ btrfs fi show Label: none uuid: uuid1 Total devices 1 FS bytes used 226.60GiB devid 1 size 237.37GiB used 237.31GiB path /dev/mapper/cryptroot256g Label: none uuid: uuid2 Total devices 1 FS bytes used 116.61MiB devid 1 size 1.00GiB used 460.00MiB path /dev/sdb2 $ btrfs fi df / Data, single: total=231.28GiB, used=223.66GiB System, single: total=32.00MiB, used=48.00KiB Metadata, single: total=6.00GiB, used=2.93GiB GlobalReserve, single: total=512.00MiB, used=0.00B Partial dmesg is attached as file. Please tell me if you need more than this, however, before these lines it only contained the startup process (up until 87s). Sincerely Richard Schwab [ 6995.028268] perf: interrupt took too long (2513 > 2500), lowering kernel.perf_event_max_sample_rate to 79000 [10513.344255] Chrome_~dThread[6211]: segfault at 0 ip 7f0cf13d6aed sp 7f0ceb37eb00 error 6 in libxul.so[7f0cf0fe2000+4c14000] [10513.344553] Chrome_~dThread[3242]: segfault at 0 ip 7f8bcb0d6aed sp 7f8bc507eb00 error 6 [10513.344557] Chrome_~dThread[3110]: segfault at 0 ip 7f13a50d6aed sp 7f139f07eb00 error 6 in libxul.so[7f13a4ce2000+4c14000] [10513.344562] in libxul.so[7f8bcace2000+4c14000] [10513.344704] Chrome_~dThread[3202]: segfault at 0 ip 7f089e7d6aed sp 7f089877eb00 error 6 in libxul.so[7f089e3e2000+4c14000] [10691.693188] INFO: task btrfs-transacti:447 blocked for more than 120 seconds. [10691.693196] Tainted: P O 4.15.14-1-ARCH #1 [10691.693199] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10691.693202] btrfs-transacti D0 447 2 0x8000 [10691.693207] Call Trace: [10691.693221] ? __schedule+0x24b/0x8c0 [10691.693229] ? enqueue_entity+0x59b/0xc70 [10691.693234] schedule+0x32/0x90 [10691.693239] io_schedule+0x12/0x40 [10691.693247] __lock_page+0xf8/0x130 [10691.693252] ? add_to_page_cache_lru+0xe0/0xe0 [10691.693309] read_extent_buffer_pages+0xf0/0x2f0 [btrfs] [10691.693337] ? btrfs_alloc_root+0x30/0x30 [btrfs] [10691.693365] btree_read_extent_buffer_pages+0x94/0xf0 [btrfs] [10691.693389] read_block_for_search.isra.14+0x15a/0x2f0 [btrfs] [10691.693413] btrfs_search_slot+0x3cd/0xa00 [btrfs] [10691.693437] ? block_group_cache_tree_search+0xba/0xd0 [btrfs] [10691.693460] btrfs_insert_empty_items+0x66/0xb0 [btrfs] [10691.693484] alloc_reserved_file_extent+0x96/0x320 [btrfs] [10691.693511] __btrfs_run_delayed_refs+0x908/0x1180 [btrfs] [10691.693539] btrfs_run_delayed_refs+0xe5/0x1b0 [btrfs] [10691.693566] btrfs_start_dirty_block_groups+0x2cb/0x470 [btrfs] [10691.693596] btrfs_commit_transaction+0x117/0x930 [btrfs] [10691.693626] ? start_transaction+0x9e/0x420 [btrfs] [10691.693654] transaction_kthread+0x17b/0x1a0 [btrfs] [10691.693684] ? btrfs_cleanup_transaction+0x570/0x570 [btrfs] [10691.693687] kthread+0x113/0x130 [10691.693691] ? kthread_create_on_node+0x70/0x70 [10691.693695] ret_from_fork+0x35/0x40 [10691.693703] INFO: task journal-offline:7151 blocked for more than 120 seconds. [10691.693707] Tainted: P O 4.15.14-1-ARCH #1 [10691.693709] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10691.693712] journal-offline D0 7151 1 0x0104 [10691.693715] Call Trace: [10691.693721] ? __schedule+0x24b/0x8c0 [10691.693726] ? preempt_count_add+0x68/0xa0 [10691.693730] schedule+0x32/0x90 [10691.693759] wait_current_trans+0xc3/0xf0 [btrfs] [10691.693763] ? wait_woken+0x80/0x80 [10691.693792] start_transaction+0x332/0x420 [btrfs] [10691.693824] btrfs_sync_file+0x2b6/0x440 [btrfs] [10691.693832] do_fsync+0x38/0x60 [10691.693836] SyS_fsync+0xc/0x10 [10691.693840] do_syscall_64+0x74/0x190 [10691.693845] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [10691.693848] RIP: 0033:0x7f7147b5a0ec [10691.693850] RSP: 002b:7f714191cd30 EFLAGS: 0293 ORIG_RAX: 004a [10691.693854] RAX: ffda RBX: 55d002846410 RCX: 7f7147b5a0ec [10691.693856] RDX: RSI: 7f71474b6811 RDI: 0015 [10691.693857] RBP: 7f71474b7c60 R08: 7f714191d700 R09: 7f714191d700 [10691.693859] R10: 0011 R11: 0293 R12: 0002 [10691.693861] R13: 7fff8456870f R14: 7fff84568710 R15: 0016 [10691.693866] INFO: task journal-offline:7152 blocked for more than 120 seconds. [10691.693869]
Re: Please add 21035965f60b ("bitmap: fix memset optimization on big-endian systems") to the stable tree
On Tue, Apr 3, 2018 at 11:14 AM, Omar Sandovalwrote: > Just wanted to make sure this doesn't get missed because I misspelled > the stable email address in the patch. It applies to v4.13+. The "sta...@kernel.org" address for a stable cc is actually better than stable@vger" imho, because afaik it still matches Greg's automation (that also picks up on "Fixes:" tags), and it does *not* cause extra email flurries when people use git-send-email scripts. iirc, we had some vendor leak a security bug early because they actually just cc'd everybody - including the stable list - on the not-yet-publicly-committed change. Greg - if your automation has changed, and you actually really want the "vger", let me know. Because I tend to just use "sta...@kernel.org" Linus -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 14/16] btrfs: drop lock parameter from update_ioctl_balance_args and rename
The parameter controls locking of the stats part but we can lock it unconditionally, as this only happens once when balance starts. This is not performance critical. Add the prefix for an exported function. Signed-off-by: David Sterba--- fs/btrfs/ctree.h | 2 +- fs/btrfs/ioctl.c | 14 +- fs/btrfs/volumes.c | 2 +- 3 files changed, 7 insertions(+), 11 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index fda94a264eb7..7ac2e69bfb03 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3258,7 +3258,7 @@ int btrfs_defrag_file(struct inode *inode, struct file *file, u64 newer_than, unsigned long max_pages); void btrfs_get_block_group_info(struct list_head *groups_list, struct btrfs_ioctl_space_info *space); -void update_ioctl_balance_args(struct btrfs_fs_info *fs_info, int lock, +void btrfs_update_ioctl_balance_args(struct btrfs_fs_info *fs_info, struct btrfs_ioctl_balance_args *bargs); ssize_t btrfs_dedupe_file_range(struct file *src_file, u64 loff, u64 olen, struct file *dst_file, u64 dst_loff); diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index e6f21f30416a..45759f1ea438 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -4499,7 +4499,7 @@ static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info, return ret; } -void update_ioctl_balance_args(struct btrfs_fs_info *fs_info, int lock, +void btrfs_update_ioctl_balance_args(struct btrfs_fs_info *fs_info, struct btrfs_ioctl_balance_args *bargs) { struct btrfs_balance_control *bctl = fs_info->balance_ctl; @@ -4517,13 +4517,9 @@ void update_ioctl_balance_args(struct btrfs_fs_info *fs_info, int lock, memcpy(>meta, >meta, sizeof(bargs->meta)); memcpy(>sys, >sys, sizeof(bargs->sys)); - if (lock) { - spin_lock(_info->balance_lock); - memcpy(>stat, >stat, sizeof(bargs->stat)); - spin_unlock(_info->balance_lock); - } else { - memcpy(>stat, >stat, sizeof(bargs->stat)); - } + spin_lock(_info->balance_lock); + memcpy(>stat, >stat, sizeof(bargs->stat)); + spin_unlock(_info->balance_lock); } static long btrfs_ioctl_balance(struct file *file, void __user *arg) @@ -4709,7 +4705,7 @@ static long btrfs_ioctl_balance_progress(struct btrfs_fs_info *fs_info, goto out; } - update_ioctl_balance_args(fs_info, 1, bargs); + btrfs_update_ioctl_balance_args(fs_info, bargs); if (copy_to_user(arg, bargs, sizeof(*bargs))) ret = -EFAULT; diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 2956e7b4cb9f..4b7d54abfa34 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -3897,7 +3897,7 @@ int btrfs_balance(struct btrfs_balance_control *bctl, if (bargs) { memset(bargs, 0, sizeof(*bargs)); - update_ioctl_balance_args(fs_info, 0, bargs); + btrfs_update_ioctl_balance_args(fs_info, bargs); } if ((ret && ret != -ECANCELED && ret != -ENOSPC) || -- 2.16.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 13/16] btrfs: remove redundant read-only check from btrfs_cancel_balance
Balance cannot be started on a read-only filesystem and will have to finish/exit before eg. going to read-only via remount. Cancelling does not need to check for that. In case the filesystem is forcibly set to read-only after an error, balance will finish anyway and if the cancel call is too fast it will just wait for that to happen. Again does not have to check. Signed-off-by: David Sterba--- fs/btrfs/volumes.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index a0420af1fad9..2956e7b4cb9f 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -4053,9 +4053,6 @@ int btrfs_pause_balance(struct btrfs_fs_info *fs_info) int btrfs_cancel_balance(struct btrfs_fs_info *fs_info) { - if (sb_rdonly(fs_info->sb)) - return -EROFS; - mutex_lock(_info->balance_mutex); if (!fs_info->balance_ctl) { mutex_unlock(_info->balance_mutex); -- 2.16.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 12/16] btrfs: track running balance in a simpler way
Currently fs_info::balance_running is 0 or 1 and does not use the semantics of atomics. The pause and cancel check for 0, that can happen only after __btrfs_balance exits for whatever reason. Parallel calls to balance ioctl may enter btrfs_ioctl_balance multiple times but will block on the balance_mutex that protects the fs_info::flags bit. Signed-off-by: David Sterba--- fs/btrfs/ctree.h | 6 +- fs/btrfs/disk-io.c | 1 - fs/btrfs/ioctl.c | 6 +++--- fs/btrfs/volumes.c | 18 ++ 4 files changed, 18 insertions(+), 13 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 011cb9a8a967..fda94a264eb7 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -726,6 +726,11 @@ struct btrfs_delayed_root; * (device replace, resize, device add/delete, balance) */ #define BTRFS_FS_EXCL_OP 16 +/* + * Indicate that balance has been set up from the ioctl and is in the main + * phase. The fs_info::balance_ctl is initialized. + */ +#define BTRFS_FS_BALANCE_RUNNING 17 struct btrfs_fs_info { u8 fsid[BTRFS_FSID_SIZE]; @@ -991,7 +996,6 @@ struct btrfs_fs_info { /* restriper state */ spinlock_t balance_lock; struct mutex balance_mutex; - atomic_t balance_running; atomic_t balance_pause_req; atomic_t balance_cancel_req; struct btrfs_balance_control *balance_ctl; diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index c0482ecea11f..b62559dfb053 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2168,7 +2168,6 @@ static void btrfs_init_balance(struct btrfs_fs_info *fs_info) { spin_lock_init(_info->balance_lock); mutex_init(_info->balance_mutex); - atomic_set(_info->balance_running, 0); atomic_set(_info->balance_pause_req, 0); atomic_set(_info->balance_cancel_req, 0); fs_info->balance_ctl = NULL; diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 410037281f2a..e6f21f30416a 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -4506,7 +4506,7 @@ void update_ioctl_balance_args(struct btrfs_fs_info *fs_info, int lock, bargs->flags = bctl->flags; - if (atomic_read(_info->balance_running)) + if (test_bit(BTRFS_FS_BALANCE_RUNNING, _info->flags)) bargs->state |= BTRFS_BALANCE_STATE_RUNNING; if (atomic_read(_info->balance_pause_req)) bargs->state |= BTRFS_BALANCE_STATE_PAUSE_REQ; @@ -4558,7 +4558,7 @@ static long btrfs_ioctl_balance(struct file *file, void __user *arg) mutex_lock(_info->balance_mutex); if (fs_info->balance_ctl) { /* this is either (2) or (3) */ - if (!atomic_read(_info->balance_running)) { + if (!test_bit(BTRFS_FS_BALANCE_RUNNING, _info->flags)) { mutex_unlock(_info->balance_mutex); /* * Lock released to allow other waiters to continue, @@ -4567,7 +4567,7 @@ static long btrfs_ioctl_balance(struct file *file, void __user *arg) mutex_lock(_info->balance_mutex); if (fs_info->balance_ctl && - !atomic_read(_info->balance_running)) { + !test_bit(BTRFS_FS_BALANCE_RUNNING, _info->flags)) { /* this is (3) */ need_unlock = false; goto locked; diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index d1e3486343ae..a0420af1fad9 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -3886,13 +3886,14 @@ int btrfs_balance(struct btrfs_balance_control *bctl, spin_unlock(_info->balance_lock); } - atomic_inc(_info->balance_running); + ASSERT(!test_bit(BTRFS_FS_BALANCE_RUNNING, _info->flags)); + set_bit(BTRFS_FS_BALANCE_RUNNING, _info->flags); mutex_unlock(_info->balance_mutex); ret = __btrfs_balance(fs_info); mutex_lock(_info->balance_mutex); - atomic_dec(_info->balance_running); + clear_bit(BTRFS_FS_BALANCE_RUNNING, _info->flags); if (bargs) { memset(bargs, 0, sizeof(*bargs)); @@ -4031,16 +4032,16 @@ int btrfs_pause_balance(struct btrfs_fs_info *fs_info) return -ENOTCONN; } - if (atomic_read(_info->balance_running)) { + if (test_bit(BTRFS_FS_BALANCE_RUNNING, _info->flags)) { atomic_inc(_info->balance_pause_req); mutex_unlock(_info->balance_mutex); wait_event(fs_info->balance_wait_q, - atomic_read(_info->balance_running) == 0); + !test_bit(BTRFS_FS_BALANCE_RUNNING, _info->flags)); mutex_lock(_info->balance_mutex); /* we are good with balance_ctl ripped off from under us */ - BUG_ON(atomic_read(_info->balance_running));
[PATCH 16/16] btrfs: open code set_balance_control
The helper is quite simple and I'd like to see the locking in the caller. Signed-off-by: David Sterba--- fs/btrfs/volumes.c | 20 1 file changed, 4 insertions(+), 16 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index e93c7ba44d06..db9f72c6af85 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -3178,21 +3178,6 @@ static void update_balance_args(struct btrfs_balance_control *bctl) } } -/* - * Should be called with balance mutex held to protect against checking the - * balance status or progress. Same goes for reset_balance_state. - */ -static void set_balance_control(struct btrfs_balance_control *bctl) -{ - struct btrfs_fs_info *fs_info = bctl->fs_info; - - BUG_ON(fs_info->balance_ctl); - - spin_lock(_info->balance_lock); - fs_info->balance_ctl = bctl; - spin_unlock(_info->balance_lock); -} - /* * Clear the balance status in fs_info and delete the balance item from disk. */ @@ -3878,7 +3863,10 @@ int btrfs_balance(struct btrfs_balance_control *bctl, if (!(bctl->flags & BTRFS_BALANCE_RESUME)) { BUG_ON(ret == -EEXIST); - set_balance_control(bctl); + BUG_ON(fs_info->balance_ctl); + spin_lock(_info->balance_lock); + fs_info->balance_ctl = bctl; + spin_unlock(_info->balance_lock); } else { BUG_ON(ret != -EEXIST); spin_lock(_info->balance_lock); -- 2.16.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 11/16] btrfs: kill btrfs_fs_info::volume_mutex
Mutual exclusion of device add/rm and balance was done by the volume mutex up to version 3.7. The commit 5ac00addc7ac091109 ("Btrfs: disallow mutually exclusive admin operations from user mode") added a bit that essentially tracked the same information. The status bit has an advantage over a mutex that it can be set without restrictions of function context, so it started to be used in the mount-time resuming of balance or device replace. But we don't really need to track the same information in two ways. 1) After the previous cleanups, the main ioctl handlers for add/del/resize copy the EXCL_OP bit next to the volume mutex, here it's clearly safe. 2) Resuming balance during mount or after rw remount will set only the EXCL_OP bit and the volume_mutex is held in the kernel thread that calls btrfs_balance. 3) Resuming device replace during mount or after rw remount is done after balance and is excluded by the EXCL_OP bit. It does not take the volume_mutex at all and completely relies on the EXCL_OP bit. 4) The resuming of balance and dev-replace cannot hapen at the same time as the ioctls cannot be started in parallel. Nevertheless, a crafted image could trigger that and a warning is printed. 5) Balance is normally excluded by EXCL_OP and also uses own mutex to protect against concurrent access to its status data. There's some trickery to maintain the right lock nesting in case we need to reexamine the status in btrfs_ioctl_balance. The volume_mutex is removed and the unlock/lock sequence is left in place as we might expect other waiters to proceed. 6) Similar to 5, the unlock/lock sequence is kept in btrfs_cancel_balance to allow waiters to continue. Signed-off-by: David Sterba--- fs/btrfs/ctree.h | 1 - fs/btrfs/disk-io.c | 1 - fs/btrfs/extent-tree.c | 2 +- fs/btrfs/ioctl.c | 17 - fs/btrfs/volumes.c | 36 ++-- 5 files changed, 15 insertions(+), 42 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 0eb55825862a..011cb9a8a967 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -826,7 +826,6 @@ struct btrfs_fs_info { struct mutex transaction_kthread_mutex; struct mutex cleaner_mutex; struct mutex chunk_mutex; - struct mutex volume_mutex; /* * this is taken to make sure we don't set block groups ro after diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 07b5e6f7df67..c0482ecea11f 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2605,7 +2605,6 @@ int open_ctree(struct super_block *sb, mutex_init(_info->chunk_mutex); mutex_init(_info->transaction_kthread_mutex); mutex_init(_info->cleaner_mutex); - mutex_init(_info->volume_mutex); mutex_init(_info->ro_block_group_mutex); init_rwsem(_info->commit_root_sem); init_rwsem(_info->cleanup_work_sem); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index f30548d7e0d2..90d28a3727c6 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4122,7 +4122,7 @@ static void set_avail_alloc_bits(struct btrfs_fs_info *fs_info, u64 flags) * returns target flags in extended format or 0 if restripe for this * chunk_type is not in progress * - * should be called with either volume_mutex or balance_lock held + * should be called with balance_lock held */ static u64 get_restripe_target(struct btrfs_fs_info *fs_info, u64 flags) { diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 2f172122b63d..410037281f2a 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1470,7 +1470,6 @@ static noinline int btrfs_ioctl_resize(struct file *file, return BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS; } - mutex_lock(_info->volume_mutex); vol_args = memdup_user(arg, sizeof(*vol_args)); if (IS_ERR(vol_args)) { ret = PTR_ERR(vol_args); @@ -1578,7 +1577,6 @@ static noinline int btrfs_ioctl_resize(struct file *file, out_free: kfree(vol_args); out: - mutex_unlock(_info->volume_mutex); clear_bit(BTRFS_FS_EXCL_OP, _info->flags); mnt_drop_write_file(file); return ret; @@ -2626,7 +2624,6 @@ static long btrfs_ioctl_add_dev(struct btrfs_fs_info *fs_info, void __user *arg) if (test_and_set_bit(BTRFS_FS_EXCL_OP, _info->flags)) return BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS; - mutex_lock(_info->volume_mutex); vol_args = memdup_user(arg, sizeof(*vol_args)); if (IS_ERR(vol_args)) { ret = PTR_ERR(vol_args); @@ -2641,7 +2638,6 @@ static long btrfs_ioctl_add_dev(struct btrfs_fs_info *fs_info, void __user *arg) kfree(vol_args); out: - mutex_unlock(_info->volume_mutex); clear_bit(BTRFS_FS_EXCL_OP, _info->flags); return ret; } @@ -2674,7 +2670,6 @@ static long btrfs_ioctl_rm_dev_v2(struct file *file, void
[PATCH 15/16] btrfs: use mutex in btrfs_resume_balance_async
While the spinlock does not cause problems, using the mutex is more correct and consistent with others. The global status of balance is eg. checked from btrfs_pause_balance or btrfs_cancel_balance with mutex. Resuming balance happens during mount or ro->rw remount. In the former case, no other user of the balance_ctl exists, in the latter, balance cannot run until the ro/rw transition is finished. Signed-off-by: David Sterba--- fs/btrfs/volumes.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 4b7d54abfa34..e93c7ba44d06 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -3938,12 +3938,12 @@ int btrfs_resume_balance_async(struct btrfs_fs_info *fs_info) { struct task_struct *tsk; - spin_lock(_info->balance_lock); + mutex_lock(_info->balance_mutex); if (!fs_info->balance_ctl) { - spin_unlock(_info->balance_lock); + mutex_unlock(_info->balance_mutex); return 0; } - spin_unlock(_info->balance_lock); + mutex_unlock(_info->balance_mutex); if (btrfs_test_opt(fs_info, SKIP_BALANCE)) { btrfs_info(fs_info, "force skipping balance"); -- 2.16.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 09/16] btrfs: cleanup helpers that reset balance state
The function __cancel_balance name is confusing with the cancel operation of balance and it really resets the state of balance back to zero. The unset_balance_control helper is called only from one place and simple enough to be inlined. Signed-off-by: David Sterba--- fs/btrfs/ioctl.c | 8 fs/btrfs/volumes.c | 27 --- 2 files changed, 16 insertions(+), 19 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 582bde5b7eda..2f172122b63d 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -4653,10 +4653,10 @@ static long btrfs_ioctl_balance(struct file *file, void __user *arg) do_balance: /* -* Ownership of bctl and filesystem flag BTRFS_FS_EXCL_OP -* goes to to btrfs_balance. bctl is freed in __cancel_balance, -* or, if restriper was paused all the way until unmount, in -* free_fs_info. The flag should be cleared after __cancel_balance. +* Ownership of bctl and filesystem flag BTRFS_FS_EXCL_OP goes to to +* btrfs_balance. bctl is freed in reset_balance_state, or, if +* restriper was paused all the way until unmount, in free_fs_info. +* The flag should be cleared after reset_balance_state. */ need_unlock = false; diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 843982a2cbdb..b073ab4c3c70 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -3192,7 +3192,7 @@ static void update_balance_args(struct btrfs_balance_control *bctl) /* * Should be called with both balance and volume mutexes held to * serialize other volume operations (add_dev/rm_dev/resize) with - * restriper. Same goes for unset_balance_control. + * restriper. Same goes for reset_balance_state. */ static void set_balance_control(struct btrfs_balance_control *bctl) { @@ -3205,9 +3205,13 @@ static void set_balance_control(struct btrfs_balance_control *bctl) spin_unlock(_info->balance_lock); } -static void unset_balance_control(struct btrfs_fs_info *fs_info) +/* + * Clear the balance status in fs_info and delete the balance item from disk. + */ +static void reset_balance_state(struct btrfs_fs_info *fs_info) { struct btrfs_balance_control *bctl = fs_info->balance_ctl; + int ret; BUG_ON(!fs_info->balance_ctl); @@ -3216,6 +3220,9 @@ static void unset_balance_control(struct btrfs_fs_info *fs_info) spin_unlock(_info->balance_lock); kfree(bctl); + ret = del_balance_item(fs_info); + if (ret) + btrfs_handle_fs_error(fs_info, ret, NULL); } /* @@ -3752,16 +3759,6 @@ static inline int balance_need_close(struct btrfs_fs_info *fs_info) atomic_read(_info->balance_cancel_req) == 0); } -static void __cancel_balance(struct btrfs_fs_info *fs_info) -{ - int ret; - - unset_balance_control(fs_info); - ret = del_balance_item(fs_info); - if (ret) - btrfs_handle_fs_error(fs_info, ret, NULL); -} - /* Non-zero return value signifies invalidity */ static inline int validate_convert_profile(struct btrfs_balance_args *bctl_arg, u64 allowed) @@ -3916,7 +3913,7 @@ int btrfs_balance(struct btrfs_balance_control *bctl, if ((ret && ret != -ECANCELED && ret != -ENOSPC) || balance_need_close(fs_info)) { - __cancel_balance(fs_info); + reset_balance_state(fs_info); clear_bit(BTRFS_FS_EXCL_OP, _info->flags); } @@ -4095,13 +4092,13 @@ int btrfs_cancel_balance(struct btrfs_fs_info *fs_info) atomic_read(_info->balance_running) == 0); mutex_lock(_info->balance_mutex); } else { - /* __cancel_balance needs volume_mutex */ + /* reset_balance_state needs volume_mutex */ mutex_unlock(_info->balance_mutex); mutex_lock(_info->volume_mutex); mutex_lock(_info->balance_mutex); if (fs_info->balance_ctl) { - __cancel_balance(fs_info); + reset_balance_state(fs_info); clear_bit(BTRFS_FS_EXCL_OP, _info->flags); } -- 2.16.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 10/16] btrfs: remove wrong use of volume_mutex from btrfs_dev_replace_start
The volume mutex does not protect against anything in this case, the comment about scrub is right but not related to locking and looks confusing. The comment in btrfs_find_device_missing_or_by_path is wrong and confusing too. The device_list_mutex is not held here to protect device lookup, but in this case device replace cannot run in parallel with device removal (due to exclusive op protection), so we don't need further locking here. Signed-off-by: David Sterba--- fs/btrfs/dev-replace.c | 7 +-- fs/btrfs/volumes.c | 4 2 files changed, 1 insertion(+), 10 deletions(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index 346bd460f8e7..ba011af9b0d2 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -426,18 +426,13 @@ int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info, struct btrfs_device *tgt_device = NULL; struct btrfs_device *src_device = NULL; - /* the disk copy procedure reuses the scrub code */ - mutex_lock(_info->volume_mutex); ret = btrfs_find_device_by_devspec(fs_info, srcdevid, srcdev_name, _device); - if (ret) { - mutex_unlock(_info->volume_mutex); + if (ret) return ret; - } ret = btrfs_init_dev_replace_tgtdev(fs_info, tgtdev_name, src_device, _device); - mutex_unlock(_info->volume_mutex); if (ret) return ret; diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index b073ab4c3c70..0ae29cd69363 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -2198,10 +2198,6 @@ int btrfs_find_device_missing_or_by_path(struct btrfs_fs_info *fs_info, struct btrfs_device *tmp; devices = _info->fs_devices->devices; - /* -* It is safe to read the devices since the volume_mutex -* is held by the caller. -*/ list_for_each_entry(tmp, devices, dev_list) { if (test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, >dev_state) && !tmp->bdev) { -- 2.16.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 07/16] btrfs: add proper safety check before resuming dev-replace
The device replace is paused by unmount or read only remount, and resumed on next mount or write remount. The exclusive status should be checked properly as it's a global invariant and we must not allow 2 operations run. In this case, the balance can be also paused and resumed under same conditions. It's always checked first so dev-replace could see the EXCL_OP already taken, BUT, the ioctl would never let start both at the same time. Replace the WARN_ON with message and return 0, indicating no error as this is purely theoretical and the user will be informed. Resolving that manually should be possible by waiting for the other operation to finish or cancel the paused state. Signed-off-by: David Sterba--- fs/btrfs/dev-replace.c | 12 +++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index 7a87ffad041e..346bd460f8e7 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -908,7 +908,17 @@ int btrfs_resume_dev_replace_async(struct btrfs_fs_info *fs_info) } btrfs_dev_replace_write_unlock(dev_replace); - WARN_ON(test_and_set_bit(BTRFS_FS_EXCL_OP, _info->flags)); + /* +* This could collide with a paused balance, but the exclusive op logic +* should never allow both to start and pause. We don't want to allow +* dev-replace to start anyway. +*/ + if (test_and_set_bit(BTRFS_FS_EXCL_OP, _info->flags)) { + btrfs_info(fs_info, + "cannot resume dev-replace, other exclusive operation running"); + return 0; + } + task = kthread_run(btrfs_dev_replace_kthread, fs_info, "btrfs-devrepl"); return PTR_ERR_OR_ZERO(task); } -- 2.16.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 08/16] btrfs: add sanity check when resuming balance after mount
Replace a WARN_ON with a proper check and message in case something goes really wrong and resumed balance cannot set up its exclusive status. The check is a user friendly assertion, I don't expect to ever happen under normal circumstances. Also document that the paused balance starts here and owns the exclusive op status. Signed-off-by: David Sterba--- fs/btrfs/volumes.c | 16 ++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index eb78c1d0ce2b..843982a2cbdb 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -3982,6 +3982,20 @@ int btrfs_recover_balance(struct btrfs_fs_info *fs_info) struct btrfs_key key; int ret; + /* +* This should never happen, as the paused balance state is recovered +* during mount without any chance of other exclusive ops to collide. +* Let it fail early and do not continue mount. +* +* Otherwise, this gives the exclusive op status to balance and keeps +* in paused state until user intervention (cancel or umount). +*/ + if (test_and_set_bit(BTRFS_FS_EXCL_OP, _info->flags)) { + btrfs_err(fs_info, + "cannot set exclusive op status to resume balance"); + return -EINVAL; + } + path = btrfs_alloc_path(); if (!path) return -ENOMEM; @@ -4018,8 +4032,6 @@ int btrfs_recover_balance(struct btrfs_fs_info *fs_info) btrfs_balance_sys(leaf, item, _bargs); btrfs_disk_balance_args_to_cpu(>sys, _bargs); - WARN_ON(test_and_set_bit(BTRFS_FS_EXCL_OP, _info->flags)); - mutex_lock(_info->volume_mutex); mutex_lock(_info->balance_mutex); -- 2.16.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 02/16] btrfs: make success path out of btrfs_init_dev_replace_tgtdev more clear
This is a preparatory cleanup that will make clear that the only successful way out of btrfs_init_dev_replace_tgtdev will also set the device_out to a valid pointer. With this guarantee, the callers can be simplified. Signed-off-by: David Sterba--- fs/btrfs/dev-replace.c | 1 - fs/btrfs/volumes.c | 8 +++- 2 files changed, 7 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index c0397cd2a3ba..e3b8a79c1665 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -370,7 +370,6 @@ int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info, dev_replace->cont_reading_from_srcdev_mode = read_src; WARN_ON(!src_device); dev_replace->srcdev = src_device; - WARN_ON(!tgt_device); dev_replace->tgtdev = tgt_device; btrfs_info_in_rcu(fs_info, diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 93f8f17cacca..8086c5687b72 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -2592,6 +2592,12 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path return ret; } +/* + * Initialize a new device for device replace target from a given source dev + * and path. + * + * Return 0 and new device in @device_out, otherwise return < 0 + */ int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info, const char *device_path, struct btrfs_device *srcdev, @@ -2678,7 +2684,7 @@ int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info, mutex_unlock(_info->fs_devices->device_list_mutex); *device_out = device; - return ret; + return 0; error: blkdev_put(bdev, FMODE_EXCL); -- 2.16.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 01/16] btrfs: squeeze btrfs_dev_replace_continue_on_mount to its caller
The function is called once and is fairly small, we can merge it with the caller. Signed-off-by: David Sterba--- fs/btrfs/dev-replace.c | 16 +++- 1 file changed, 3 insertions(+), 13 deletions(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index 0d203633bb96..c0397cd2a3ba 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -45,8 +45,6 @@ static void btrfs_dev_replace_update_device_in_mapping_tree( struct btrfs_device *srcdev, struct btrfs_device *tgtdev); static int btrfs_dev_replace_kthread(void *data); -static int btrfs_dev_replace_continue_on_mount(struct btrfs_fs_info *fs_info); - int btrfs_init_dev_replace(struct btrfs_fs_info *fs_info) { @@ -822,6 +820,7 @@ static int btrfs_dev_replace_kthread(void *data) struct btrfs_fs_info *fs_info = data; struct btrfs_dev_replace *dev_replace = _info->dev_replace; u64 progress; + int ret; progress = btrfs_dev_replace_progress(fs_info); progress = div_u64(progress, 10); @@ -832,23 +831,14 @@ static int btrfs_dev_replace_kthread(void *data) btrfs_dev_name(dev_replace->tgtdev), (unsigned int)progress); - btrfs_dev_replace_continue_on_mount(fs_info); - clear_bit(BTRFS_FS_EXCL_OP, _info->flags); - - return 0; -} - -static int btrfs_dev_replace_continue_on_mount(struct btrfs_fs_info *fs_info) -{ - struct btrfs_dev_replace *dev_replace = _info->dev_replace; - int ret; - ret = btrfs_scrub_dev(fs_info, dev_replace->srcdev->devid, dev_replace->committed_cursor_left, btrfs_device_get_total_bytes(dev_replace->srcdev), _replace->scrub_progress, 0, 1); ret = btrfs_dev_replace_finishing(fs_info, ret); WARN_ON(ret); + + clear_bit(BTRFS_FS_EXCL_OP, _info->flags); return 0; } -- 2.16.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 05/16] btrfs: move volume_mutex to callers of btrfs_rm_device
Move locking and unlocking next to the BTRFS_FS_EXCL_OP bit manipulation so it's obvious that the two happen at the same time. Signed-off-by: David Sterba--- fs/btrfs/ioctl.c | 4 fs/btrfs/volumes.c | 2 -- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index ac85e07f567b..b93dea445802 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2674,6 +2674,7 @@ static long btrfs_ioctl_rm_dev_v2(struct file *file, void __user *arg) ret = BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS; goto out; } + mutex_lock(_info->volume_mutex); if (vol_args->flags & BTRFS_DEVICE_SPEC_BY_ID) { ret = btrfs_rm_device(fs_info, NULL, vol_args->devid); @@ -2681,6 +2682,7 @@ static long btrfs_ioctl_rm_dev_v2(struct file *file, void __user *arg) vol_args->name[BTRFS_SUBVOL_NAME_MAX] = '\0'; ret = btrfs_rm_device(fs_info, vol_args->name, 0); } + mutex_unlock(_info->volume_mutex); clear_bit(BTRFS_FS_EXCL_OP, _info->flags); if (!ret) { @@ -2716,6 +2718,7 @@ static long btrfs_ioctl_rm_dev(struct file *file, void __user *arg) ret = BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS; goto out_drop_write; } + mutex_lock(_info->volume_mutex); vol_args = memdup_user(arg, sizeof(*vol_args)); if (IS_ERR(vol_args)) { @@ -2730,6 +2733,7 @@ static long btrfs_ioctl_rm_dev(struct file *file, void __user *arg) btrfs_info(fs_info, "disk deleted %s", vol_args->name); kfree(vol_args); out: + mutex_unlock(_info->volume_mutex); clear_bit(BTRFS_FS_EXCL_OP, _info->flags); out_drop_write: mnt_drop_write_file(file); diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index df2d6d06e014..e0bd181dc9e0 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1932,7 +1932,6 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info, const char *device_path, u64 num_devices; int ret = 0; - mutex_lock(_info->volume_mutex); mutex_lock(_mutex); num_devices = fs_info->fs_devices->num_devices; @@ -2048,7 +2047,6 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info, const char *device_path, out: mutex_unlock(_mutex); - mutex_unlock(_info->volume_mutex); return ret; error_undo: -- 2.16.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 03/16] btrfs: export and rename free_device
The function will be used outside of volumes.c, the allocation btrfs_alloc_device is also exported. Signed-off-by: David Sterba--- fs/btrfs/volumes.c | 24 fs/btrfs/volumes.h | 1 + 2 files changed, 13 insertions(+), 12 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 8086c5687b72..ca5521cc1a5b 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -246,7 +246,7 @@ static struct btrfs_fs_devices *alloc_fs_devices(const u8 *fsid) return fs_devs; } -static void free_device(struct btrfs_device *device) +void btrfs_free_device(struct btrfs_device *device) { rcu_string_free(device->name); bio_put(device->flush_bio); @@ -261,7 +261,7 @@ static void free_fs_devices(struct btrfs_fs_devices *fs_devices) device = list_entry(fs_devices->devices.next, struct btrfs_device, dev_list); list_del(>dev_list); - free_device(device); + btrfs_free_device(device); } kfree(fs_devices); } @@ -294,7 +294,7 @@ void __exit btrfs_cleanup_fs_uuids(void) /* * Returns a pointer to a new btrfs_device on success; ERR_PTR() on error. * Returned struct is not linked onto any lists and must be destroyed using - * free_device. + * btrfs_free_device. */ static struct btrfs_device *__alloc_device(void) { @@ -650,7 +650,7 @@ static void btrfs_free_stale_devices(const char *path, } else { fs_devs->num_devices--; list_del(>dev_list); - free_device(dev); + btrfs_free_device(dev); } } } @@ -765,7 +765,7 @@ static noinline struct btrfs_device *device_list_add(const char *path, name = rcu_string_strdup(path, GFP_NOFS); if (!name) { - free_device(device); + btrfs_free_device(device); return ERR_PTR(-ENOMEM); } rcu_assign_pointer(device->name, name); @@ -878,7 +878,7 @@ static struct btrfs_fs_devices *clone_fs_devices(struct btrfs_fs_devices *orig) name = rcu_string_strdup(orig_dev->name->str, GFP_KERNEL); if (!name) { - free_device(device); + btrfs_free_device(device); goto error; } rcu_assign_pointer(device->name, name); @@ -950,7 +950,7 @@ void btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices, int step) } list_del_init(>dev_list); fs_devices->num_devices--; - free_device(device); + btrfs_free_device(device); } if (fs_devices->seed) { @@ -968,7 +968,7 @@ static void free_device_rcu(struct rcu_head *head) struct btrfs_device *device; device = container_of(head, struct btrfs_device, rcu); - free_device(device); + btrfs_free_device(device); } static void btrfs_close_bdev(struct btrfs_device *device) @@ -2582,7 +2582,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path if (trans) btrfs_end_transaction(trans); error_free_device: - free_device(device); + btrfs_free_device(device); error: blkdev_put(bdev, FMODE_EXCL); if (seeding_dev && !unlocked) { @@ -2653,7 +2653,7 @@ int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info, name = rcu_string_strdup(device_path, GFP_KERNEL); if (!name) { - free_device(device); + btrfs_free_device(device); ret = -ENOMEM; goto error; } @@ -6419,7 +6419,7 @@ static struct btrfs_device *add_missing_dev(struct btrfs_fs_devices *fs_devices, * * Return: a pointer to a new btrfs_device on success; ERR_PTR() * on error. Returned struct is not linked onto any lists and must be - * destroyed with free_device. + * destroyed with btrfs_free_device. */ struct btrfs_device *btrfs_alloc_device(struct btrfs_fs_info *fs_info, const u64 *devid, @@ -6442,7 +6442,7 @@ struct btrfs_device *btrfs_alloc_device(struct btrfs_fs_info *fs_info, ret = find_next_devid(fs_info, ); if (ret) { - free_device(dev); + btrfs_free_device(dev); return ERR_PTR(ret); } } diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index d1fcaea9fef5..45e3ece21290 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -434,6 +434,7 @@ int btrfs_find_device_by_devspec(struct btrfs_fs_info *fs_info,
[PATCH 04/16] btrfs: move btrfs_init_dev_replace_tgtdev to dev-replace.c and make static
The function logically belongs there and there's only a single caller, no need to export it. No code changes. Signed-off-by: David Sterba--- fs/btrfs/dev-replace.c | 99 ++ fs/btrfs/volumes.c | 99 -- fs/btrfs/volumes.h | 4 -- 3 files changed, 99 insertions(+), 103 deletions(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index e3b8a79c1665..7a87ffad041e 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -188,6 +188,105 @@ int btrfs_init_dev_replace(struct btrfs_fs_info *fs_info) return ret; } +/* + * Initialize a new device for device replace target from a given source dev + * and path. + * + * Return 0 and new device in @device_out, otherwise return < 0 + */ +static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info, + const char *device_path, + struct btrfs_device *srcdev, + struct btrfs_device **device_out) +{ + struct btrfs_device *device; + struct block_device *bdev; + struct list_head *devices; + struct rcu_string *name; + u64 devid = BTRFS_DEV_REPLACE_DEVID; + int ret = 0; + + *device_out = NULL; + if (fs_info->fs_devices->seeding) { + btrfs_err(fs_info, "the filesystem is a seed filesystem!"); + return -EINVAL; + } + + bdev = blkdev_get_by_path(device_path, FMODE_WRITE | FMODE_EXCL, + fs_info->bdev_holder); + if (IS_ERR(bdev)) { + btrfs_err(fs_info, "target device %s is invalid!", device_path); + return PTR_ERR(bdev); + } + + filemap_write_and_wait(bdev->bd_inode->i_mapping); + + devices = _info->fs_devices->devices; + list_for_each_entry(device, devices, dev_list) { + if (device->bdev == bdev) { + btrfs_err(fs_info, + "target device is in the filesystem!"); + ret = -EEXIST; + goto error; + } + } + + + if (i_size_read(bdev->bd_inode) < + btrfs_device_get_total_bytes(srcdev)) { + btrfs_err(fs_info, + "target device is smaller than source device!"); + ret = -EINVAL; + goto error; + } + + + device = btrfs_alloc_device(NULL, , NULL); + if (IS_ERR(device)) { + ret = PTR_ERR(device); + goto error; + } + + name = rcu_string_strdup(device_path, GFP_KERNEL); + if (!name) { + btrfs_free_device(device); + ret = -ENOMEM; + goto error; + } + rcu_assign_pointer(device->name, name); + + mutex_lock(_info->fs_devices->device_list_mutex); + set_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state); + device->generation = 0; + device->io_width = fs_info->sectorsize; + device->io_align = fs_info->sectorsize; + device->sector_size = fs_info->sectorsize; + device->total_bytes = btrfs_device_get_total_bytes(srcdev); + device->disk_total_bytes = btrfs_device_get_disk_total_bytes(srcdev); + device->bytes_used = btrfs_device_get_bytes_used(srcdev); + device->commit_total_bytes = srcdev->commit_total_bytes; + device->commit_bytes_used = device->bytes_used; + device->fs_info = fs_info; + device->bdev = bdev; + set_bit(BTRFS_DEV_STATE_IN_FS_METADATA, >dev_state); + set_bit(BTRFS_DEV_STATE_REPLACE_TGT, >dev_state); + device->mode = FMODE_EXCL; + device->dev_stats_valid = 1; + set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE); + device->fs_devices = fs_info->fs_devices; + list_add(>dev_list, _info->fs_devices->devices); + fs_info->fs_devices->num_devices++; + fs_info->fs_devices->open_devices++; + mutex_unlock(_info->fs_devices->device_list_mutex); + + *device_out = device; + return 0; + +error: + blkdev_put(bdev, FMODE_EXCL); + return ret; +} + /* * called from commit_transaction. Writes changed device replace state to * disk. diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index ca5521cc1a5b..df2d6d06e014 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -2592,105 +2592,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path return ret; } -/* - * Initialize a new device for device replace target from a given source dev - * and path. - * - * Return 0 and new device in @device_out, otherwise return < 0 - */ -int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info, - const char *device_path, - struct btrfs_device *srcdev, - struct
[PATCH 06/16] btrfs: move clearing of EXCL_OP out of __cancel_balance
Make the clearning visible in the callers so we can pair it with the test_and_set part. Signed-off-by: David Sterba--- fs/btrfs/ioctl.c | 2 +- fs/btrfs/volumes.c | 15 --- 2 files changed, 9 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index b93dea445802..582bde5b7eda 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -4656,7 +4656,7 @@ static long btrfs_ioctl_balance(struct file *file, void __user *arg) * Ownership of bctl and filesystem flag BTRFS_FS_EXCL_OP * goes to to btrfs_balance. bctl is freed in __cancel_balance, * or, if restriper was paused all the way until unmount, in -* free_fs_info. The flag is cleared in __cancel_balance. +* free_fs_info. The flag should be cleared after __cancel_balance. */ need_unlock = false; diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index e0bd181dc9e0..eb78c1d0ce2b 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -3760,8 +3760,6 @@ static void __cancel_balance(struct btrfs_fs_info *fs_info) ret = del_balance_item(fs_info); if (ret) btrfs_handle_fs_error(fs_info, ret, NULL); - - clear_bit(BTRFS_FS_EXCL_OP, _info->flags); } /* Non-zero return value signifies invalidity */ @@ -3919,6 +3917,7 @@ int btrfs_balance(struct btrfs_balance_control *bctl, if ((ret && ret != -ECANCELED && ret != -ENOSPC) || balance_need_close(fs_info)) { __cancel_balance(fs_info); + clear_bit(BTRFS_FS_EXCL_OP, _info->flags); } wake_up(_info->balance_wait_q); @@ -3926,11 +3925,11 @@ int btrfs_balance(struct btrfs_balance_control *bctl, return ret; out: if (bctl->flags & BTRFS_BALANCE_RESUME) - __cancel_balance(fs_info); - else { + reset_balance_state(fs_info); + else kfree(bctl); - clear_bit(BTRFS_FS_EXCL_OP, _info->flags); - } + clear_bit(BTRFS_FS_EXCL_OP, _info->flags); + return ret; } @@ -4089,8 +4088,10 @@ int btrfs_cancel_balance(struct btrfs_fs_info *fs_info) mutex_lock(_info->volume_mutex); mutex_lock(_info->balance_mutex); - if (fs_info->balance_ctl) + if (fs_info->balance_ctl) { __cancel_balance(fs_info); + clear_bit(BTRFS_FS_EXCL_OP, _info->flags); + } mutex_unlock(_info->volume_mutex); } -- 2.16.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 00/16 v1] Kill fs_info::volume_mutex
This series gets rid of the volume mutex. The fstests do not pass cleanly, 2 or more tests fail so this needs to be fixed, but otherwise majority of the work ready for review. The merge target is 4.18 and I'll probably not get back to this pathset during merge window (nor add it to next), so there should be enough time for review. David Sterba (16): btrfs: squeeze btrfs_dev_replace_continue_on_mount to its caller btrfs: make success path out of btrfs_init_dev_replace_tgtdev more clear btrfs: export and rename free_device btrfs: move btrfs_init_dev_replace_tgtdev to dev-replace.c and make static btrfs: move volume_mutex to callers of btrfs_rm_device btrfs: move clearing of EXCL_OP out of __cancel_balance btrfs: add proper safety check before resuming dev-replace btrfs: add sanity check when resuming balance after mount btrfs: cleanup helpers that reset balance state btrfs: remove wrong use of volume_mutex from btrfs_dev_replace_start btrfs: kill btrfs_fs_info::volume_mutex btrfs: track running balance in a simpler way btrfs: remove redundant read-only check from btrfs_cancel_balance btrfs: drop lock parameter from update_ioctl_balance_args and rename btrfs: use mutex in btrfs_resume_balance_async btrfs: open code set_balance_control fs/btrfs/ctree.h | 9 +- fs/btrfs/dev-replace.c | 135 +- fs/btrfs/disk-io.c | 2 - fs/btrfs/extent-tree.c | 2 +- fs/btrfs/ioctl.c | 41 fs/btrfs/volumes.c | 252 + fs/btrfs/volumes.h | 5 +- 7 files changed, 205 insertions(+), 241 deletions(-) -- 2.16.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] SPDX to btrfs
Switch fs/btrfs/*.[ch] and fs/btrfs/tests/*.[ch] to SPDX. I've briefly verified that there are no exceptions to GPL-2.0 (ie. no 'or later'). The changes match what I've seen in other patches and I don't think ther's more needed to be done. If there are no objections I'm going to add the patches to a 4.17 pull. David Sterba (3): btrfs: replace GPL boilerplate by SPDX -- headers btrfs: replace GPL boilerplate by SPDX -- sources btrfs: add SPDX header to Kconfig fs/btrfs/Kconfig | 2 ++ fs/btrfs/acl.c | 15 +-- fs/btrfs/async-thread.c| 15 +-- fs/btrfs/async-thread.h| 21 + fs/btrfs/backref.c | 15 +-- fs/btrfs/backref.h | 19 +++ fs/btrfs/btrfs_inode.h | 19 +++ fs/btrfs/check-integrity.c | 15 +-- fs/btrfs/check-integrity.h | 19 +++ fs/btrfs/compression.c | 15 +-- fs/btrfs/compression.h | 19 +++ fs/btrfs/ctree.c | 15 +-- fs/btrfs/ctree.h | 20 fs/btrfs/dedupe.h | 20 fs/btrfs/delayed-inode.c | 15 +-- fs/btrfs/delayed-inode.h | 19 +++ fs/btrfs/delayed-ref.c | 15 +-- fs/btrfs/delayed-ref.h | 21 + fs/btrfs/dev-replace.c | 16 ++-- fs/btrfs/dev-replace.h | 20 fs/btrfs/dir-item.c| 15 +-- fs/btrfs/disk-io.c | 15 +-- fs/btrfs/disk-io.h | 20 fs/btrfs/export.c | 1 + fs/btrfs/export.h | 1 + fs/btrfs/extent-tree.c | 16 ++-- fs/btrfs/extent_io.c | 1 + fs/btrfs/extent_io.h | 6 -- fs/btrfs/extent_map.c | 1 + fs/btrfs/extent_map.h | 6 -- fs/btrfs/file-item.c | 15 +-- fs/btrfs/file.c| 15 +-- fs/btrfs/free-space-cache.c| 15 +-- fs/btrfs/free-space-cache.h| 19 +++ fs/btrfs/free-space-tree.c | 15 +-- fs/btrfs/free-space-tree.h | 19 +++ fs/btrfs/inode-item.c | 15 +-- fs/btrfs/inode-map.c | 15 +-- fs/btrfs/inode-map.h | 5 +++-- fs/btrfs/inode.c | 15 +-- fs/btrfs/ioctl.c | 15 +-- fs/btrfs/locking.c | 16 ++-- fs/btrfs/locking.h | 19 +++ fs/btrfs/lzo.c | 15 +-- fs/btrfs/math.h| 20 +++- fs/btrfs/ordered-data.c| 15 +-- fs/btrfs/ordered-data.h| 20 fs/btrfs/orphan.c | 15 +-- fs/btrfs/print-tree.c | 15 +-- fs/btrfs/print-tree.h | 21 + fs/btrfs/props.c | 15 +-- fs/btrfs/props.h | 19 +++ fs/btrfs/qgroup.c | 15 +-- fs/btrfs/qgroup.h | 22 +- fs/btrfs/raid56.c | 16 ++-- fs/btrfs/raid56.h | 21 + fs/btrfs/rcu-string.h | 20 ++-- fs/btrfs/reada.c | 15 +-- fs/btrfs/ref-verify.c | 15 +-- fs/btrfs/ref-verify.h | 23 ++- fs/btrfs/relocation.c | 15 +-- fs/btrfs/root-tree.c | 15 +-- fs/btrfs/scrub.c | 15 +-- fs/btrfs/send.c| 15 +-- fs/btrfs/send.h| 20 ++-- fs/btrfs/struct-funcs.c| 15 +-- fs/btrfs/super.c | 15 +-- fs/btrfs/sysfs.c | 15 +-- fs/btrfs/sysfs.h | 7 --- fs/btrfs/tests/btrfs-tests.c | 15 +-- fs/btrfs/tests/btrfs-tests.h | 19 +++ fs/btrfs/tests/extent-buffer-tests.c | 15 +-- fs/btrfs/tests/extent-io-tests.c | 15 +-- fs/btrfs/tests/extent-map-tests.c | 15
[PATCH 1/3] btrfs: replace GPL boilerplate by SPDX -- headers
Remove GPL boilerplate text (long, short, one-line) and keep the rest, ie. personal, company or original source copyright statements. Add the SPDX header. Unify the include protection macros to match the file names. Signed-off-by: David Sterba--- fs/btrfs/async-thread.h | 21 + fs/btrfs/backref.h | 19 +++ fs/btrfs/btrfs_inode.h | 19 +++ fs/btrfs/check-integrity.h | 19 +++ fs/btrfs/compression.h | 19 +++ fs/btrfs/ctree.h | 20 fs/btrfs/dedupe.h| 20 fs/btrfs/delayed-inode.h | 19 +++ fs/btrfs/delayed-ref.h | 21 + fs/btrfs/dev-replace.h | 20 fs/btrfs/disk-io.h | 20 fs/btrfs/export.h| 1 + fs/btrfs/extent_io.h | 6 -- fs/btrfs/extent_map.h| 6 -- fs/btrfs/free-space-cache.h | 19 +++ fs/btrfs/free-space-tree.h | 19 +++ fs/btrfs/inode-map.h | 5 +++-- fs/btrfs/locking.h | 19 +++ fs/btrfs/math.h | 20 +++- fs/btrfs/ordered-data.h | 20 fs/btrfs/print-tree.h| 21 + fs/btrfs/props.h | 19 +++ fs/btrfs/qgroup.h| 22 +- fs/btrfs/raid56.h| 21 + fs/btrfs/rcu-string.h| 20 ++-- fs/btrfs/ref-verify.h| 23 ++- fs/btrfs/send.h | 20 ++-- fs/btrfs/sysfs.h | 7 --- fs/btrfs/tests/btrfs-tests.h | 19 +++ fs/btrfs/transaction.h | 20 fs/btrfs/tree-checker.h | 17 +++-- fs/btrfs/tree-log.h | 20 fs/btrfs/ulist.h | 7 +++ fs/btrfs/volumes.h | 19 +++ fs/btrfs/xattr.h | 21 - 35 files changed, 133 insertions(+), 475 deletions(-) diff --git a/fs/btrfs/async-thread.h b/fs/btrfs/async-thread.h index fc957e00cef1..7861c9feba5f 100644 --- a/fs/btrfs/async-thread.h +++ b/fs/btrfs/async-thread.h @@ -1,24 +1,12 @@ +/* SPDX-License-Identifier: GPL-2.0 */ /* * Copyright (C) 2007 Oracle. All rights reserved. * Copyright (C) 2014 Fujitsu. All rights reserved. - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public - * License v2 as published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. - * - * You should have received a copy of the GNU General Public - * License along with this program; if not, write to the - * Free Software Foundation, Inc., 59 Temple Place - Suite 330, - * Boston, MA 021110-1307, USA. */ -#ifndef __BTRFS_ASYNC_THREAD_ -#define __BTRFS_ASYNC_THREAD_ +#ifndef BTRFS_ASYNC_THREAD_H +#define BTRFS_ASYNC_THREAD_H + #include struct btrfs_fs_info; @@ -85,4 +73,5 @@ void btrfs_set_work_high_priority(struct btrfs_work *work); struct btrfs_fs_info *btrfs_work_owner(const struct btrfs_work *work); struct btrfs_fs_info *btrfs_workqueue_owner(const struct __btrfs_workqueue *wq); bool btrfs_workqueue_normal_congested(const struct btrfs_workqueue *wq); + #endif diff --git a/fs/btrfs/backref.h b/fs/btrfs/backref.h index 0a30028d5196..54d58988483a 100644 --- a/fs/btrfs/backref.h +++ b/fs/btrfs/backref.h @@ -1,23 +1,10 @@ +/* SPDX-License-Identifier: GPL-2.0 */ /* * Copyright (C) 2011 STRATO. All rights reserved. - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public - * License v2 as published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. - * - * You should have received a copy of the GNU General Public - * License along with this program; if not, write to the - * Free Software Foundation, Inc., 59 Temple Place - Suite 330, - * Boston, MA 021110-1307, USA. */ -#ifndef __BTRFS_BACKREF__ -#define __BTRFS_BACKREF__ +#ifndef BTRFS_BACKREF_H +#define BTRFS_BACKREF_H #include #include "ulist.h" diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index ca15be569d69..234bae55b85d 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -1,23 +1,10 @@ +/* SPDX-License-Identifier: GPL-2.0 */ /* * Copyright (C) 2007 Oracle. All
[PATCH 3/3] btrfs: add SPDX header to Kconfig
Signed-off-by: David Sterba--- fs/btrfs/Kconfig | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/btrfs/Kconfig b/fs/btrfs/Kconfig index 167e5dc7eadd..23537bc8c827 100644 --- a/fs/btrfs/Kconfig +++ b/fs/btrfs/Kconfig @@ -1,3 +1,5 @@ +# SPDX-License-Identifier: GPL-2.0 + config BTRFS_FS tristate "Btrfs filesystem support" select LIBCRC32C -- 2.16.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] btrfs: replace GPL boilerplate by SPDX -- sources
Remove GPL boilerplate text (long, short, one-line) and keep the rest, ie. personal, company or original source copyright statements. Add the SPDX header. Signed-off-by: David Sterba--- fs/btrfs/acl.c | 15 +-- fs/btrfs/async-thread.c| 15 +-- fs/btrfs/backref.c | 15 +-- fs/btrfs/check-integrity.c | 15 +-- fs/btrfs/compression.c | 15 +-- fs/btrfs/ctree.c | 15 +-- fs/btrfs/delayed-inode.c | 15 +-- fs/btrfs/delayed-ref.c | 15 +-- fs/btrfs/dev-replace.c | 16 ++-- fs/btrfs/dir-item.c| 15 +-- fs/btrfs/disk-io.c | 15 +-- fs/btrfs/export.c | 1 + fs/btrfs/extent-tree.c | 16 ++-- fs/btrfs/extent_io.c | 1 + fs/btrfs/extent_map.c | 1 + fs/btrfs/file-item.c | 15 +-- fs/btrfs/file.c| 15 +-- fs/btrfs/free-space-cache.c| 15 +-- fs/btrfs/free-space-tree.c | 15 +-- fs/btrfs/inode-item.c | 15 +-- fs/btrfs/inode-map.c | 15 +-- fs/btrfs/inode.c | 15 +-- fs/btrfs/ioctl.c | 15 +-- fs/btrfs/locking.c | 16 ++-- fs/btrfs/lzo.c | 15 +-- fs/btrfs/ordered-data.c| 15 +-- fs/btrfs/orphan.c | 15 +-- fs/btrfs/print-tree.c | 15 +-- fs/btrfs/props.c | 15 +-- fs/btrfs/qgroup.c | 15 +-- fs/btrfs/raid56.c | 16 ++-- fs/btrfs/reada.c | 15 +-- fs/btrfs/ref-verify.c | 15 +-- fs/btrfs/relocation.c | 15 +-- fs/btrfs/root-tree.c | 15 +-- fs/btrfs/scrub.c | 15 +-- fs/btrfs/send.c| 15 +-- fs/btrfs/struct-funcs.c| 15 +-- fs/btrfs/super.c | 15 +-- fs/btrfs/sysfs.c | 15 +-- fs/btrfs/tests/btrfs-tests.c | 15 +-- fs/btrfs/tests/extent-buffer-tests.c | 15 +-- fs/btrfs/tests/extent-io-tests.c | 15 +-- fs/btrfs/tests/extent-map-tests.c | 15 +-- fs/btrfs/tests/free-space-tests.c | 15 +-- fs/btrfs/tests/free-space-tree-tests.c | 15 +-- fs/btrfs/tests/inode-tests.c | 15 +-- fs/btrfs/tests/qgroup-tests.c | 15 +-- fs/btrfs/transaction.c | 15 +-- fs/btrfs/tree-checker.c| 13 + fs/btrfs/tree-defrag.c | 15 +-- fs/btrfs/tree-log.c| 15 +-- fs/btrfs/ulist.c | 2 +- fs/btrfs/uuid-tree.c | 16 ++-- fs/btrfs/volumes.c | 16 ++-- fs/btrfs/xattr.c | 16 +--- fs/btrfs/zlib.c| 15 +-- fs/btrfs/zstd.c| 10 ++ 58 files changed, 65 insertions(+), 750 deletions(-) diff --git a/fs/btrfs/acl.c b/fs/btrfs/acl.c index 0066d95b133f..15e1dfef56a5 100644 --- a/fs/btrfs/acl.c +++ b/fs/btrfs/acl.c @@ -1,19 +1,6 @@ +// SPDX-License-Identifier: GPL-2.0 /* * Copyright (C) 2007 Red Hat. All rights reserved. - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public - * License v2 as published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. - * - * You should have received a copy of the GNU General Public - * License along with this program; if not, write to the - * Free Software Foundation, Inc., 59 Temple Place - Suite 330, - * Boston, MA 021110-1307, USA. */ #include diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c index d5540749f0e5..d522494698fa 100644 --- a/fs/btrfs/async-thread.c +++ b/fs/btrfs/async-thread.c @@ -1,20 +1,7 @@ +// SPDX-License-Identifier: GPL-2.0 /* * Copyright (C) 2007 Oracle. All rights reserved. * Copyright (C) 2014 Fujitsu. All rights reserved. - * -
Please add 21035965f60b ("bitmap: fix memset optimization on big-endian systems") to the stable tree
Just wanted to make sure this doesn't get missed because I misspelled the stable email address in the patch. It applies to v4.13+. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] bitmap: fix memset optimization on big-endian systems
On Mon, Apr 02, 2018 at 04:49:39PM -0700, Linus Torvalds wrote: > On Mon, Apr 2, 2018 at 4:37 PM, Linus Torvalds >wrote: > > > > We should probably have at least switched it to "unsigned long int" > > I meant just "unsigned int", of course. > > Right now we occasionally have a silly 64-bit field just for a couple of > flags. Not to mention the mix of bit fields, macros, and enums, some of which are bit masks, some of which are the bit number :) > Of course, we do have cases where 64-bit architectures happily end up > using more than 32 bits too, so the "unsigned long" is occasionally > useful. But it's rare enough that it probably wasn't the right thing > to do. > > Water under the bridge. Part of it is due to another historical > accident: the alpha port was one of the early ports, and it didn't do > atomic byte accesses at all. > > So we had multiple issues that all conspired to "unsigned long arrays > are the natural for atomic bit operations" even though they have this > really annoying byte order issue. Thanks for the historical context, this is exactly what I was wondering when I spotted this bug and fixed a similar one in Btrfs a couple of years back. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Status of RAID5/6
On 04/03/2018 02:31 AM, Zygo Blaxell wrote: > On Mon, Apr 02, 2018 at 06:23:34PM -0400, Zygo Blaxell wrote: >> On Mon, Apr 02, 2018 at 11:49:42AM -0400, Austin S. Hemmelgarn wrote: >>> On 2018-04-02 11:18, Goffredo Baroncelli wrote: I thought that a possible solution is to create BG with different >>> number of data disks. E.g. supposing to have a raid 6 system with 6 >>> disks, where 2 are parity disk; we should allocate 3 BG BG #1: 1 data disk, 2 parity disks BG #2: 2 data disks, 2 parity disks, BG #3: 4 data disks, 2 parity disks For simplicity, the disk-stripe length is assumed = 4K. So If you have a write with a length of 4 KB, this should be placed >>> in BG#1; if you have a write with a length of 4*3KB, the first 8KB, >>> should be placed in in BG#2, then in BG#1. This would avoid space wasting, even if the fragmentation will >>> increase (but shall the fragmentation matters with the modern solid >>> state disks ?). >> I don't really see why this would increase fragmentation or waste space. > Oh, wait, yes I do. If there's a write of 6 blocks, we would have > to split an extent between BG #3 (the first 4 blocks) and BG #2 (the > remaining 2 blocks). It also flips the usual order of "determine size > of extent, then allocate space for it" which might require major surgery > on the btrfs allocator to implement. I have to point out that in any case the extent is physically interrupted at the disk-stripe size. Assuming disk-stripe=64KB, if you want to write 128KB, the first half is written in the first disk, the other in the 2nd disk. If you want to write 96kb, the first 64 are written in the first disk, the last part in the 2nd, only on a different BG. So yes there is a fragmentation from a logical point of view; from a physical point of view the data is spread on the disks in any case. In any case, you are right, we should gather some data, because the performance impact are no so clear. I am not worried abut having different BG; we have problem with these because we never developed tool to handle this issue properly (i.e. a daemon which starts a balance when needed). But I hope that this will be solved in future. In any case, the all solutions proposed have their trade off: - a) as is: write hole bug - b) variable stripe size (like ZFS): big impact on how btrfs handle the extent. limited waste of space - c) logging data before writing: we wrote the data two times in a short time window. Moreover the log area is written several order of magnitude more than the other area; there was some patch around - d) rounding the writing to the stripe size: waste of space; simple to implement; - e) different BG with different stripe size: limited waste of space; logical fragmentation. * c),d),e) are applied only for the tail of the extent, in case the size is less than the stripe size. * for b),d), e), the wasting of space may be reduced with a balance -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] Btrfs updates for 4.17
Hi, please pull the following btrfs changes. There are a several user visible changes, the rest is mostly invisible and continues to clean up the whole code base. There are no merge conflicts with current master. Please pull, thanks. User visible changes: - new mount option nossd_spread (pair for ssd_spread) - mount option subvolid will detect junk after the number and fail the mount - add message after cancelled device replace - direct module dependency on libcrc32, removed own crc wrappers - removed user space transaction ioctls - use lighter locking when reading /proc/self/mounts, RCU instead of mutex to avoid unnecessary contention Enhancements: - skip writeback of last page when truncating file to same size - send: do not issue unnecessary truncate operations - mount option token specifiers: use %u for unsigned values, more validation - selftests: more tree block validations qgroups: - preparatory work for splitting reservation types for data and metadata, this should allow for more accurate tracking and fix some issues with underflows or do further enhancements - split metadata reservations for started and joined transaction so they do not get mixed up and are accounted correctly at commit time - with the above, it's possible to revert patch that potentially deadlocks when trying to make more space by explicitly committing when the quota limit is hit - fix root item corruption when multiple same source snapshots are created with quota enabled RAID56: - make sure target is identical to source when raid56 rebuild fails after dev-replace - faster rebuild during scrub, batch by stripes and not block-by-block - make more use of cached data when rebuilding from a missing device Fixes: - null pointer deref when device replace target is missing - fix fsync after hole punching when using no-holes feature - fix lockdep splat when allocating percpu data with wrong GFP flags Cleanups, refactoring, core changes: - drop redunant parameters from various functions - kill and opencode trivial helpers - __cold/__exit function annotations - dead code removal - continued audit and documentation of memory barriers - error handling: handle removal from uuid tree - error handling: remove handling of impossible condtitons - more debugging or error messages - updated tracepoints - 1 VLA use removal (1 still left) The following changes since commit 3eb2ce825ea1ad89d20f7a3b5780df850e4be274: Linux 4.16-rc7 (2018-03-25 12:44:30 -1000) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-4.17-tag for you to fetch changes up to 57599c7e7722daf5f8c2dba4b0e4628f5c500771: btrfs: lift errors from add_extent_changeset to the callers (2018-03-31 02:03:25 +0200) Anand Jain (25): btrfs: open code btrfs_dev_replace_cancel() btrfs: rename __btrfs_dev_replace_cancel() btrfs: btrfs_dev_replace_cancel() can return int btrfs: open code btrfs_init_dev_replace_tgtdev_for_resume() btrfs: extent_buffer_uptodate() make it static and inline btrfs: manage thread_pool mount option as %u btrfs: manage metadata_ratio mount option as %u btrfs: manage check_int_print_mask mount option as %u btrfs: manage commit mount option as %u btrfs: add a comment to mark the deprecated mount option btrfs: fix null pointer deref when target device is missing btrfs: log, when replace, is canceled by the user btrfs: remove unused function btrfs_async_submit_limit() btrfs: cow_file_range() num_bytes and disk_num_bytes are same btrfs: use ASSERT to report logical error in cow_file_range() btrfs: not a disk error if the bio_add_page fails btrfs: keep device list sorted btrfs: insert newly opened device to the end of the list btrfs: verify subvolid mount parameter btrfs: remove assert in btrfs_init_dev_replace_tgtdev() btrfs: unify types for metadata_ratio and data_chunk_allocations btrfs: rename btrfs_close_extra_device to btrfs_free_extra_devids btrfs: add define for oldest generation btrfs: drop num argument from find_live_mirror() btrfs: drop optimal argument from find_live_mirror() Colin Ian King (2): btrfs: remove redundant check on ret and goto Btrfs: extent map selftest: add missing void parameter to btrfs_test_extent_map David Sterba (37): btrfs: add (the only possible) __exit annotation btrfs: add more __cold annotations btrfs: drop underscores from exported xattr functions btrfs: drop extern from function declarations btrfs: adjust return type of btrfs_getxattr btrfs: move btrfs_listxattr prototype to xattr.h btrfs: open code trivial helper
Re: [PATCH] btrfs-progs: let mkfs return nozero value on thin provision device
On 2018年04月03日 16:39, Su Yue wrote: > when mkfs.btrfs on a thin provision device which has very small > backing size and big virtual size, all code works well in > mkfs.btrfs until close_ctree() is called. > close_ctree() fails to sync device due to small backing size > while closing devices. > However, mkfs returns 0 in such situation which causes failure of > xfstests generic/405. > > So, let mkfs returns nonzero value if previous steps succeeded but > close_ctree() failed. > Then xfstests generic/405 passes now. > > Signed-off-by: Su YueReviewed-by: Qu Wenruo Thanks, Qu > --- > mkfs/main.c | 6 ++ > 1 file changed, 6 insertions(+) > > diff --git a/mkfs/main.c b/mkfs/main.c > index 5a717f701cd5..9f7f2396df8f 100644 > --- a/mkfs/main.c > +++ b/mkfs/main.c > @@ -1285,6 +1285,12 @@ out: > } > } > > + if (!ret && close_ret) { > + ret = close_ret; > + error("failed to close ctree, the filesystem may be > inconsistent: %d", > + ret); > + } > + > btrfs_close_all_devices(); > free(label); > > signature.asc Description: OpenPGP digital signature
[PATCH] btrfs-progs: let mkfs return nozero value on thin provision device
when mkfs.btrfs on a thin provision device which has very small backing size and big virtual size, all code works well in mkfs.btrfs until close_ctree() is called. close_ctree() fails to sync device due to small backing size while closing devices. However, mkfs returns 0 in such situation which causes failure of xfstests generic/405. So, let mkfs returns nonzero value if previous steps succeeded but close_ctree() failed. Then xfstests generic/405 passes now. Signed-off-by: Su Yue--- mkfs/main.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/mkfs/main.c b/mkfs/main.c index 5a717f701cd5..9f7f2396df8f 100644 --- a/mkfs/main.c +++ b/mkfs/main.c @@ -1285,6 +1285,12 @@ out: } } + if (!ret && close_ret) { + ret = close_ret; + error("failed to close ctree, the filesystem may be inconsistent: %d", + ret); + } + btrfs_close_all_devices(); free(label); -- 2.16.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 10/10] btrfs: qgroup: Use independent and accurate per inode qgroup rsv
Hi David, I didn't see this patch merged in your misc-next branch but only the remaining patches. However without this patch, btrfs qgroup reserved space will get obviously increased as prealloc metadata reserved space is never freed until inode reserved space is freed. This would cause a lot of qgroup related test cases to fail. Would you please merge this patch with all qgroup patchset? Thanks, Qu On 2017年12月22日 14:18, Qu Wenruo wrote: > Unlike reservation calculation used in inode rsv for metadata, qgroup > doesn't really need to care things like csum size or extent usage for > whole tree COW. > > Qgroup care more about net change of extent usage. > That's to say, if we're going to insert one file extent, it will mostly > find its place in CoWed tree block, leaving no change in extent usage. > Or cause leaf split, result one new net extent, increasing qgroup number > by nodesize. > (Or even more rare case, increase the tree level, increasing qgroup > number by 2 * nodesize) > > So here instead of using the way overkilled calculation for extent > allocator, which cares more about accurate and no error, qgroup doesn't > need that over-calculated reservation. > > This patch will maintain 2 new members in btrfs_block_rsv structure for > qgroup, using much smaller calculation for qgroup rsv, reducing false > EDQUOT. > > Signed-off-by: Qu Wenruo> --- > fs/btrfs/ctree.h | 18 + > fs/btrfs/extent-tree.c | 55 > -- > 2 files changed, 62 insertions(+), 11 deletions(-) > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index 0c58f92c2d44..97783ba91e00 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -467,6 +467,24 @@ struct btrfs_block_rsv { > unsigned short full; > unsigned short type; > unsigned short failfast; > + > + /* > + * Qgroup equivalent for @size @reserved > + * > + * Unlike normal normal @size/@reserved for inode rsv, > + * qgroup doesn't care about things like csum size nor how many tree > + * blocks it will need to reserve. > + * > + * Qgroup cares more about *NET* change of extent usage. > + * So for ONE newly inserted file extent, in worst case it will cause > + * leaf split and level increase, nodesize for each file extent > + * is already way overkilled. > + * > + * In short, qgroup_size/reserved is the up limit of possible needed > + * qgroup metadata reservation. > + */ > + u64 qgroup_rsv_size; > + u64 qgroup_rsv_reserved; > }; > > /* > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 986660f301de..9d579c7bcf7f 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -5565,14 +5565,18 @@ static void space_info_add_new_bytes(struct > btrfs_fs_info *fs_info, > > static u64 block_rsv_release_bytes(struct btrfs_fs_info *fs_info, > struct btrfs_block_rsv *block_rsv, > - struct btrfs_block_rsv *dest, u64 num_bytes) > + struct btrfs_block_rsv *dest, u64 num_bytes, > + u64 *qgroup_to_release_ret) > { > struct btrfs_space_info *space_info = block_rsv->space_info; > + u64 qgroup_to_release = 0; > u64 ret; > > spin_lock(_rsv->lock); > - if (num_bytes == (u64)-1) > + if (num_bytes == (u64)-1) { > num_bytes = block_rsv->size; > + qgroup_to_release = block_rsv->qgroup_rsv_size; > + } > block_rsv->size -= num_bytes; > if (block_rsv->reserved >= block_rsv->size) { > num_bytes = block_rsv->reserved - block_rsv->size; > @@ -5581,6 +5585,12 @@ static u64 block_rsv_release_bytes(struct > btrfs_fs_info *fs_info, > } else { > num_bytes = 0; > } > + if (block_rsv->qgroup_rsv_reserved >= block_rsv->qgroup_rsv_size) { > + qgroup_to_release = block_rsv->qgroup_rsv_reserved - > + block_rsv->qgroup_rsv_size; > + block_rsv->qgroup_rsv_reserved = block_rsv->qgroup_rsv_size; > + } else > + qgroup_to_release = 0; > spin_unlock(_rsv->lock); > > ret = num_bytes; > @@ -5603,6 +5613,8 @@ static u64 block_rsv_release_bytes(struct btrfs_fs_info > *fs_info, > space_info_add_old_bytes(fs_info, space_info, >num_bytes); > } > + if (qgroup_to_release_ret) > + *qgroup_to_release_ret = qgroup_to_release; > return ret; > } > > @@ -5744,17 +5756,21 @@ int btrfs_inode_rsv_refill(struct btrfs_inode *inode, > struct btrfs_root *root = inode->root; > struct btrfs_block_rsv *block_rsv = >block_rsv; > u64 num_bytes = 0; > + u64 qgroup_num_bytes = 0; > int ret = -ENOSPC; > > spin_lock(_rsv->lock); > if
Re: [PATCH 1/2] btrfs-progs: check: Skip data csum verification for metadata dump
On 3.04.2018 08:55, Qu Wenruo wrote: > > > On 2018年04月03日 13:51, Nikolay Borisov wrote: >> >> >> On 3.04.2018 08:39, Qu Wenruo wrote: >>> For metadata dump (fs restored by btrfs-image), since no data is >>> restored check sum verification will definitely report error. >>> >>> Add such check in check_csums() and prompt for user. >>> >>> Issue: #103 >>> Signed-off-by: Qu Wenruo>> >> Reviewed-by: Nikolay Borisov >> >>> --- >>> check/main.c | 13 - >>> 1 file changed, 12 insertions(+), 1 deletion(-) >>> >>> diff --git a/check/main.c b/check/main.c >>> index 891a6d797756..15c94543946a 100644 >>> --- a/check/main.c >>> +++ b/check/main.c >>> @@ -5601,6 +5601,7 @@ static int check_csums(struct btrfs_root *root) >>> int ret; >>> u64 data_len; >>> unsigned long leaf_offset; >>> + bool verify_csum = !!check_data_csum; >> >> nit: Why don't you convert check_data_csum to bool in this patch. > > Next thing to do. > >> My >> recent testing in !! usage showed that it's generating quite a number of >> additional instructions. > > That's pretty interesting. > > Would you please shared some more info about this? So it seems I might have been mistaken. The latest experiment I did was with the attached diff. So I did a bloat-o-meter comparison between stock random.c against the diff applied and then modified the diff to use uuid rather than !!uuid. And in my initial experiment I did see a reduction in the size of the function. However, now that I tried reproducing I couldn't. So just ignore my comment. Bloat-o-meter shows in both cases: nborisov@fisk:~/projects/kernel/source$ ./scripts/bloat-o-meter random.original drivers/char/random.o add/remove: 0/0 grow/shrink: 1/0 up/down: 49/0 (49) Function old new delta proc_do_uuid 227 276 +49 Total: Before=30542, After=30591, chg +0.16% diff --git a/drivers/char/random.c b/drivers/char/random.c index ec42c8bb9b0d..e05daf7f38f4 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -1960,6 +1960,10 @@ static int proc_do_uuid(struct ctl_table *table, int write, unsigned char buf[64], tmp_uuid[16], *uuid; uuid = table->data; +#ifdef CONFIG_UTS_NS + if (!!uuid && (uuid == (unsigned char *)sysctl_bootid)) + uuid = current->nsproxy->uts_ns->sysctl_bootid; +#endif if (!uuid) { uuid = tmp_uuid; generate_random_uuid(uuid); diff --git a/include/linux/utsname.h b/include/linux/utsname.h index c8060c2ecd04..f704aca3e95a 100644 --- a/include/linux/utsname.h +++ b/include/linux/utsname.h @@ -27,6 +27,7 @@ struct uts_namespace { struct user_namespace *user_ns; struct ucounts *ucounts; struct ns_common ns; + char sysctl_bootid[16]; } __randomize_layout; extern struct uts_namespace init_uts_ns; diff --git a/kernel/utsname.c b/kernel/utsname.c index 913fe4336d2b..f1749cdcd341 100644 --- a/kernel/utsname.c +++ b/kernel/utsname.c @@ -34,8 +34,10 @@ static struct uts_namespace *create_uts_ns(void) struct uts_namespace *uts_ns; uts_ns = kmalloc(sizeof(struct uts_namespace), GFP_KERNEL); - if (uts_ns) + if (uts_ns) { kref_init(_ns->kref); + memset(uts_ns->sysctl_bootid, 0, 16); + } return uts_ns; } -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html