Re: [PATCH] btrfs: Use immediate assignment when referencing cc-option
On 17/03/2021 06:46, Victor Erminpour wrote: Calling cc-option will use KBUILD_CFLAGS, which when lazy setting subdir-ccflags-y produces the following build error: scripts/Makefile.lib:10: *** Recursive variable `KBUILD_CFLAGS' \ references itself (eventually). Stop. Use := assignment to subdir-ccflags-y when referencing cc-option. This causes make to also evaluate += immediately, cc-option calls are done right away and we don't end up with KBUILD_CFLAGS referencing itself. Thanks for the patch. Reviewed-by: Anand Jain Signed-off-by: Victor Erminpour --- fs/btrfs/Makefile | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile index b634c42115ea..3dba1336fa95 100644 --- a/fs/btrfs/Makefile +++ b/fs/btrfs/Makefile @@ -7,10 +7,10 @@ subdir-ccflags-y += -Wmissing-format-attribute subdir-ccflags-y += -Wmissing-prototypes subdir-ccflags-y += -Wold-style-definition subdir-ccflags-y += -Wmissing-include-dirs -subdir-ccflags-y += $(call cc-option, -Wunused-but-set-variable) -subdir-ccflags-y += $(call cc-option, -Wunused-const-variable) -subdir-ccflags-y += $(call cc-option, -Wpacked-not-aligned) -subdir-ccflags-y += $(call cc-option, -Wstringop-truncation) +subdir-ccflags-y := $(call cc-option, -Wunused-but-set-variable) +subdir-ccflags-y := $(call cc-option, -Wunused-const-variable) +subdir-ccflags-y := $(call cc-option, -Wpacked-not-aligned) +subdir-ccflags-y := $(call cc-option, -Wstringop-truncation) # The following turn off the warnings enabled by -Wextra subdir-ccflags-y += -Wno-missing-field-initializers subdir-ccflags-y += -Wno-sign-compare
Re: [PATCH] btrfs: turn btrfs_destroy_delayed_refs() into void function
On 9/3/21 5:32 pm, Yang Li wrote: This function always return '0' and no callers use the return value. So make it a void function. This eliminates the following coccicheck warning: ./fs/btrfs/disk-io.c:4522:5-8: Unneeded variable: "ret". Return "0" on line 4530 Reported-by: Abaci Robot Signed-off-by: Yang Li Reviewed-by: Anand Jain --- fs/btrfs/disk-io.c | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 41b718c..b75d2d9 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -52,7 +52,7 @@ static void end_workqueue_fn(struct btrfs_work *work); static void btrfs_destroy_ordered_extents(struct btrfs_root *root); -static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans, +static void btrfs_destroy_delayed_refs(struct btrfs_transaction *trans, struct btrfs_fs_info *fs_info); A nit... The declare here can be removed without moving the code. Thanks, Anand static void btrfs_destroy_delalloc_inodes(struct btrfs_root *root); static int btrfs_destroy_marked_extents(struct btrfs_fs_info *fs_info, @@ -4513,13 +4513,12 @@ static void btrfs_destroy_all_ordered_extents(struct btrfs_fs_info *fs_info) btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1); } -static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans, +static void btrfs_destroy_delayed_refs(struct btrfs_transaction *trans, struct btrfs_fs_info *fs_info) { struct rb_node *node; struct btrfs_delayed_ref_root *delayed_refs; struct btrfs_delayed_ref_node *ref; - int ret = 0; delayed_refs = >delayed_refs; @@ -4592,8 +4591,6 @@ static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans, btrfs_qgroup_destroy_extent_records(trans); spin_unlock(_refs->lock); - - return ret; } static void btrfs_destroy_delalloc_inodes(struct btrfs_root *root)
Re: [PATCH RFC 6/6] btrfs: Add roundrobin raid1 read policy
On 10/02/2021 04:30, Michal Rostecki wrote: From: Michal Rostecki Add a new raid1 read policy `roundrobin`. For each read request, it selects the mirror which has lower load than queue depth and it starts iterating from the last used mirror (by the current CPU). Load is defined as the number of inflight requests + a potential penalty value. The policy can be enabled through sysfs: # echo roundrobin > /sys/fs/btrfs/[fsid]/read_policies/policy This policy was tested with fio and compared with the default `pid` policy. The singlethreaded test has the following parameters: [global] name=btrfs-raid1-seqread filename=btrfs-raid1-seqread rw=read bs=64k direct=0 numjobs=1 time_based=0 [file1] size=10G ioengine=libaio and shows the following results: - raid1c3 with 3 HDDs: 3 x Segate Barracuda ST2000DM008 (2TB) * pid policy READ: bw=217MiB/s (228MB/s), 217MiB/s-217MiB/s (228MB/s-228MB/s), io=10.0GiB (10.7GB), run=47082-47082msec * roundrobin policy READ: bw=409MiB/s (429MB/s), 409MiB/s-409MiB/s (429MB/s-429MB/s), io=10.0GiB (10.7GB), run=25028-25028mse - raid1c3 with 2 HDDs and 1 SSD: 2 x Segate Barracuda ST2000DM008 (2TB) 1 x Crucial CT256M550SSD1 (256GB) * pid policy (the worst case when only HDDs were chosen) READ: bw=220MiB/s (231MB/s), 220MiB/s-220MiB/s (231MB/s-231MB/s), io=10.0GiB (10.7GB), run=46577-46577mse * pid policy (the best case when SSD was used as well) READ: bw=513MiB/s (538MB/s), 513MiB/s-513MiB/s (538MB/s-538MB/s), io=10.0GiB (10.7GB), run=19954-19954msec * roundrobin (there are no noticeable differences when testing multiple times) READ: bw=541MiB/s (567MB/s), 541MiB/s-541MiB/s (567MB/s-567MB/s), io=10.0GiB (10.7GB), run=18933-18933msec The multithreaded test has the following parameters: [global] name=btrfs-raid1-seqread filename=btrfs-raid1-seqread rw=read bs=64k direct=0 numjobs=8 time_based=0 [file1] size=10G ioengine=libaio and shows the following results: - raid1c3 with 3 HDDs: 3 x Segate Barracuda ST2000DM008 (2TB) 3 x Segate Barracuda ST2000DM008 (2TB) * pid policy READ: bw=1569MiB/s (1645MB/s), 196MiB/s-196MiB/s (206MB/s-206MB/s), io=80.0GiB (85.9GB), run=52210-52211msec * roundrobin READ: bw=1733MiB/s (1817MB/s), 217MiB/s-217MiB/s (227MB/s-227MB/s), io=80.0GiB (85.9GB), run=47269-47271msec - raid1c3 with 2 HDDs and 1 SSD: 2 x Segate Barracuda ST2000DM008 (2TB) 1 x Crucial CT256M550SSD1 (256GB) * pid policy READ: bw=1843MiB/s (1932MB/s), 230MiB/s-230MiB/s (242MB/s-242MB/s), io=80.0GiB (85.9GB), run=9-44450msec * roundrobin READ: bw=2485MiB/s (2605MB/s), 311MiB/s-311MiB/s (326MB/s-326MB/s), io=80.0GiB (85.9GB), run=32969-32970msec The penalty value is an additional value added to the number of inflight requests when a scheduled request is non-local (which means it would start from the different physical location than the physical location of the last request processed by the given device). By default, it's applied only in filesystems which have mixed types of devices (non-rotational and rotational), but it can be configured to be applied without that condition. The configuration is done through sysfs: > - /sys/fs/btrfs/[fsid]/read_policies/roundrobin_nonlocal_inc_mixed_only where 1 (the default) value means applying penalty only in mixed arrays, 0 means applying it unconditionally. > The exact penalty value is defined separately for non-rotational and rotational devices. By default, it's 0 for non-rotational devices and 1 for rotational devices. Both values are configurable through sysfs: - /sys/fs/btrfs/[fsid]/read_policies/roundrobin_nonrot_nonlocal_inc - /sys/fs/btrfs/[fsid]/read_policies/roundrobin_rot_nonlocal_inc To sum it up - the default case is applying the penalty under the following conditions: - the raid1 array consists of mixed types of devices - the scheduled request is going to be non-local for the given disk - the device is rotational > That default case is based on a slight preference towards non-rotational disks in mixed arrays and has proven to give the best performance in tested arrays. >> For the array with 3 HDDs, not adding any penalty resulted in 409MiB/s (429MB/s) performance. Adding the penalty value 1 resulted in a performance drop to 404MiB/s (424MB/s). Increasing the value towards 10 was making the performance even worse. For the array with 2 HDDs and 1 SSD, adding penalty value 1 to rotational disks resulted in the best performance - 541MiB/s (567MB/s). Not adding any value and increasing the value was making the performance worse. > Adding penalty value to non-rotational disks was always decreasing the performance, which motivated setting it as 0 by default. For the purpose of testing, it's still configurable. > To measure the performance of each policy and find optimal penalty values, I created scripts which
Re: [PATCH RFC 0/6] Add roundrobin raid1 read policy
On 10/02/2021 04:30, Michal Rostecki wrote: From: Michal Rostecki This patch series adds a new raid1 read policy - roundrobin. For each request, it selects the mirror which has lower load than queue depth. Load is defined as the number of inflight requests + a penalty value (if the scheduled request is not local to the last processed request for a rotational disk). The series consists of preparational changes which add necessary information to the btrfs_device struct and the change with the policy. This policy was tested with fio and compared with the default `pid` policy. The singlethreaded test has the following parameters: [global] name=btrfs-raid1-seqread filename=btrfs-raid1-seqread rw=read bs=64k direct=0 numjobs=1 time_based=0 [file1] size=10G ioengine=libaio and shows the following results: - raid1c3 with 3 HDDs: 3 x Segate Barracuda ST2000DM008 (2TB) * pid policy READ: bw=217MiB/s (228MB/s), 217MiB/s-217MiB/s (228MB/s-228MB/s), io=10.0GiB (10.7GB), run=47082-47082msec * roundrobin policy READ: bw=409MiB/s (429MB/s), 409MiB/s-409MiB/s (429MB/s-429MB/s), io=10.0GiB (10.7GB), run=25028-25028mse - raid1c3 with 2 HDDs and 1 SSD: 2 x Segate Barracuda ST2000DM008 (2TB) 1 x Crucial CT256M550SSD1 (256GB) * pid policy (the worst case when only HDDs were chosen) READ: bw=220MiB/s (231MB/s), 220MiB/s-220MiB/s (231MB/s-231MB/s), io=10.0GiB (10.7GB), run=46577-46577mse * pid policy (the best case when SSD was used as well) READ: bw=513MiB/s (538MB/s), 513MiB/s-513MiB/s (538MB/s-538MB/s), io=10.0GiB (10.7GB), run=19954-19954msec * roundrobin (there are no noticeable differences when testing multiple times) READ: bw=541MiB/s (567MB/s), 541MiB/s-541MiB/s (567MB/s-567MB/s), io=10.0GiB (10.7GB), run=18933-18933msec The multithreaded test has the following parameters: [global] name=btrfs-raid1-seqread filename=btrfs-raid1-seqread rw=read bs=64k direct=0 numjobs=8 time_based=0 [file1] size=10G ioengine=libaio and shows the following results: - raid1c3 with 3 HDDs: 3 x Segate Barracuda ST2000DM008 (2TB) 3 x Segate Barracuda ST2000DM008 (2TB) * pid policy READ: bw=1569MiB/s (1645MB/s), 196MiB/s-196MiB/s (206MB/s-206MB/s), io=80.0GiB (85.9GB), run=52210-52211msec * roundrobin READ: bw=1733MiB/s (1817MB/s), 217MiB/s-217MiB/s (227MB/s-227MB/s), io=80.0GiB (85.9GB), run=47269-47271msec - raid1c3 with 2 HDDs and 1 SSD: 2 x Segate Barracuda ST2000DM008 (2TB) 1 x Crucial CT256M550SSD1 (256GB) * pid policy READ: bw=1843MiB/s (1932MB/s), 230MiB/s-230MiB/s (242MB/s-242MB/s), io=80.0GiB (85.9GB), run=9-44450msec * roundrobin READ: bw=2485MiB/s (2605MB/s), 311MiB/s-311MiB/s (326MB/s-326MB/s), io=80.0GiB (85.9GB), run=32969-32970msec Both of the above test cases are sequential. How about some random IO workload? Also, the seek time for non-rotational devices does not exist. So it is a good idea to test with ssd + nvme and all nvme or all ssd. To measure the performance of each policy and find optimal penalty values, I created scripts which are available here: https://gitlab.com/vadorovsky/btrfs-perf https://github.com/mrostecki/btrfs-perf Michal Rostecki (6): btrfs: Add inflight BIO request counter btrfs: Store the last device I/O offset These patches look good. But as only round-robin policy requires to monitor the inflight and last-offset. Could you bring them under if policy=roundrobin? Otherwise, it is just a waste of CPU cycles if the policy != roundrobin. >btrfs: Add stripe_physical function >btrfs: Check if the filesystem is has mixed type of devices Thanks, Anand btrfs: sysfs: Add directory for read policies btrfs: Add roundrobin raid1 read policy fs/btrfs/ctree.h | 3 + fs/btrfs/disk-io.c | 3 + fs/btrfs/sysfs.c | 144 ++ fs/btrfs/volumes.c | 218 +++-- fs/btrfs/volumes.h | 22 + 5 files changed, 366 insertions(+), 24 deletions(-)
Re: [PATCH] btrfs: remove redundant NULL check
On 21/1/21 4:19 pm, Yang Li wrote: Fix below warnings reported by coccicheck: ./fs/btrfs/raid56.c:237:2-8: WARNING: NULL check before some freeing functions is not needed. Reported-by: Abaci Robot Signed-off-by: Yang Li --- fs/btrfs/raid56.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 93fbf87..5394641 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -233,8 +233,7 @@ int btrfs_alloc_stripe_hash_table(struct btrfs_fs_info *info) } x = cmpxchg(>stripe_hash_table, NULL, table); - if (x) - kvfree(x); + kvfree(x); Reviewed-by: Anand Jain Thanks, Anand return 0; }
Re: [PATCH v2] btrfs: return EAGAIN when receiving a pending signal in the defrag loops
On 18/11/20 12:02 pm, xiakaixu1...@gmail.com wrote: From: Kaixu Xia The variable ret is overwritten by the following variable defrag_count. Actually the code should return EAGAIN when receiving a pending signal in the defrag loops. Reported-by: Tosk Robot Signed-off-by: Kaixu Xia --- v2 -return EAGAIN instead of remove the EAGAIN error. Sorry I might have missed in v1. Why was EAGAIN needed here? Return of defrag_count rather makes sense to me as of now. Thanks, Anand fs/btrfs/ioctl.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 69a384145dc6..6f13db6d30bd 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1519,7 +1519,7 @@ int btrfs_defrag_file(struct inode *inode, struct file *file, if (btrfs_defrag_cancelled(fs_info)) { btrfs_debug(fs_info, "defrag_file cancelled"); ret = -EAGAIN; - break; + goto out_ra; } if (!should_defrag_range(inode, (u64)i << PAGE_SHIFT,
Re: WARNING in close_fs_devices (2)
On 30/10/20 6:10 pm, Dmitry Vyukov wrote: On Tue, Sep 22, 2020 at 2:37 PM Anand Jain wrote: On 18/9/20 7:22 pm, syzbot wrote: Hello, syzbot found the following issue on: HEAD commit:e4c26faa Merge tag 'usb-5.9-rc5' of git://git.kernel.org/p.. git tree: upstream console output: https://syzkaller.appspot.com/x/log.txt?x=15bf162190 kernel config: https://syzkaller.appspot.com/x/.config?x=c61610091f4ca8c4 dashboard link: https://syzkaller.appspot.com/bug?extid=4cfe71a4da060be47502 compiler: gcc (GCC) 10.1.0-syz 20200507 Unfortunately, I don't have any reproducer for this issue yet. IMPORTANT: if you fix the issue, please add the following tag to the commit: Reported-by: syzbot+4cfe71a4da060be47...@syzkaller.appspotmail.com [ cut here ] WARNING: CPU: 1 PID: 3612 at fs/btrfs/volumes.c:1166 close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166 Kernel panic - not syncing: panic_on_warn set ... CPU: 1 PID: 3612 Comm: syz-executor.2 Not tainted 5.9.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x198/0x1fd lib/dump_stack.c:118 panic+0x347/0x7c0 kernel/panic.c:231 __warn.cold+0x20/0x46 kernel/panic.c:600 report_bug+0x1bd/0x210 lib/bug.c:198 handle_bug+0x38/0x90 arch/x86/kernel/traps.c:234 exc_invalid_op+0x14/0x40 arch/x86/kernel/traps.c:254 asm_exc_invalid_op+0x12/0x20 arch/x86/include/asm/idtentry.h:536 RIP: 0010:close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166 Code: 0f b6 04 02 84 c0 74 02 7e 33 48 8b 44 24 18 c6 80 30 01 00 00 00 48 83 c4 30 5b 5d 41 5c 41 5d 41 5e 41 5f c3 e8 99 ce 6a fe <0f> 0b e9 71 ff ff ff e8 8d ce 6a fe 0f 0b e9 20 ff ff ff e8 d1 d5 RSP: 0018:c900091777e0 EFLAGS: 00010246 RAX: 0004 RBX: RCX: c9000c8b7000 RDX: 0004 RSI: 83097f47 RDI: 0007 RBP: dc00 R08: 0001 R09: 8880988a187f R10: R11: 0001 R12: 88809593a130 R13: 88809593a1ec R14: 8880988a1908 R15: 88809593a050 close_fs_devices fs/btrfs/volumes.c:1193 [inline] btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179 open_ctree+0x4984/0x4a2d fs/btrfs/disk-io.c:3434 btrfs_fill_super fs/btrfs/super.c:1316 [inline] btrfs_mount_root.cold+0x14/0x165 fs/btrfs/super.c:1672 legacy_get_tree+0x105/0x220 fs/fs_context.c:592 vfs_get_tree+0x89/0x2f0 fs/super.c:1547 fc_mount fs/namespace.c:978 [inline] vfs_kern_mount.part.0+0xd3/0x170 fs/namespace.c:1008 vfs_kern_mount+0x3c/0x60 fs/namespace.c:995 btrfs_mount+0x234/0xaa0 fs/btrfs/super.c:1732 legacy_get_tree+0x105/0x220 fs/fs_context.c:592 vfs_get_tree+0x89/0x2f0 fs/super.c:1547 do_new_mount fs/namespace.c:2875 [inline] path_mount+0x1387/0x2070 fs/namespace.c:3192 do_mount fs/namespace.c:3205 [inline] __do_sys_mount fs/namespace.c:3413 [inline] __se_sys_mount fs/namespace.c:3390 [inline] __x64_sys_mount+0x27f/0x300 fs/namespace.c:3390 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x46004a Code: b8 a6 00 00 00 0f 05 48 3d 01 f0 ff ff 0f 83 fd 89 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 0f 83 da 89 fb ff c3 66 0f 1f 84 00 00 00 00 00 RSP: 002b:7f414d78da88 EFLAGS: 0202 ORIG_RAX: 00a5 RAX: ffda RBX: 7f414d78db20 RCX: 0046004a RDX: 2000 RSI: 2100 RDI: 7f414d78dae0 RBP: 7f414d78dae0 R08: 7f414d78db20 R09: 2000 R10: R11: 0202 R12: 2000 R13: 2100 R14: 2200 R15: 2001a800 Kernel Offset: disabled Rebooting in 86400 seconds.. #syz fix: btrfs: fix rw_devices count in __btrfs_free_extra_devids Is it the correct patch title? It still does not exist anywhere including linux-next... V2 is here [1]. The patch title is changed. Also the cover letter talks about the path dependency. It is not yet integrated. [1] https://patchwork.kernel.org/project/linux-btrfs/patch/d0b5790792b8b826504dd239ad9efc514f3d9109.1604009248.git.anand.j...@oracle.com/ Thanks, Anand
[PATCH stable-5.4] backport enospc issues during balance
Patch 1 is a preparatory patch to reduce conflicts. Patch 2 fixes balance failure due to ENOSPC in btrfs/156 on arm64 systems with pagesize=64k. Minor conflicts in fs/btrfs/block-group.c are resolved. Thanks. Josef Bacik (2): btrfs: don't pass system_chunk into can_overcommit btrfs: take overcommit into account in inc_block_group_ro fs/btrfs/block-group.c | 38 ++-- fs/btrfs/space-info.c | 50 +- fs/btrfs/space-info.h | 3 +++ 3 files changed, 49 insertions(+), 42 deletions(-) -- 2.18.4
[PATCH stable-5.4 1/2] btrfs: don't pass system_chunk into can_overcommit
From: Josef Bacik commit 9f246926b4d5db4c5e8c78e4897757de26c95be6 upstream We have the space_info, we can just check its flags to see if it's the system chunk space info. Reviewed-by: Nikolay Borisov Reviewed-by: Qu Wenruo Reviewed-by: Johannes Thumshirn Signed-off-by: Josef Bacik Reviewed-by: David Sterba Signed-off-by: David Sterba Signed-off-by: Anand Jain --- fs/btrfs/space-info.c | 42 +++--- 1 file changed, 15 insertions(+), 27 deletions(-) diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c index 6f484f0d347e..e19e538d05f9 100644 --- a/fs/btrfs/space-info.c +++ b/fs/btrfs/space-info.c @@ -162,8 +162,7 @@ static inline u64 calc_global_rsv_need_space(struct btrfs_block_rsv *global) static int can_overcommit(struct btrfs_fs_info *fs_info, struct btrfs_space_info *space_info, u64 bytes, - enum btrfs_reserve_flush_enum flush, - bool system_chunk) + enum btrfs_reserve_flush_enum flush) { u64 profile; u64 avail; @@ -174,7 +173,7 @@ static int can_overcommit(struct btrfs_fs_info *fs_info, if (space_info->flags & BTRFS_BLOCK_GROUP_DATA) return 0; - if (system_chunk) + if (space_info->flags & BTRFS_BLOCK_GROUP_SYSTEM) profile = btrfs_system_alloc_profile(fs_info); else profile = btrfs_metadata_alloc_profile(fs_info); @@ -228,8 +227,7 @@ void btrfs_try_granting_tickets(struct btrfs_fs_info *fs_info, /* Check and see if our ticket can be satisified now. */ if ((used + ticket->bytes <= space_info->total_bytes) || - can_overcommit(fs_info, space_info, ticket->bytes, flush, - false)) { + can_overcommit(fs_info, space_info, ticket->bytes, flush)) { btrfs_space_info_update_bytes_may_use(fs_info, space_info, ticket->bytes); @@ -634,8 +632,7 @@ static void flush_space(struct btrfs_fs_info *fs_info, static inline u64 btrfs_calc_reclaim_metadata_size(struct btrfs_fs_info *fs_info, -struct btrfs_space_info *space_info, -bool system_chunk) +struct btrfs_space_info *space_info) { struct reserve_ticket *ticket; u64 used; @@ -651,13 +648,12 @@ btrfs_calc_reclaim_metadata_size(struct btrfs_fs_info *fs_info, to_reclaim = min_t(u64, num_online_cpus() * SZ_1M, SZ_16M); if (can_overcommit(fs_info, space_info, to_reclaim, - BTRFS_RESERVE_FLUSH_ALL, system_chunk)) + BTRFS_RESERVE_FLUSH_ALL)) return 0; used = btrfs_space_info_used(space_info, true); - if (can_overcommit(fs_info, space_info, SZ_1M, - BTRFS_RESERVE_FLUSH_ALL, system_chunk)) + if (can_overcommit(fs_info, space_info, SZ_1M, BTRFS_RESERVE_FLUSH_ALL)) expected = div_factor_fine(space_info->total_bytes, 95); else expected = div_factor_fine(space_info->total_bytes, 90); @@ -673,7 +669,7 @@ btrfs_calc_reclaim_metadata_size(struct btrfs_fs_info *fs_info, static inline int need_do_async_reclaim(struct btrfs_fs_info *fs_info, struct btrfs_space_info *space_info, - u64 used, bool system_chunk) + u64 used) { u64 thresh = div_factor_fine(space_info->total_bytes, 98); @@ -681,8 +677,7 @@ static inline int need_do_async_reclaim(struct btrfs_fs_info *fs_info, if ((space_info->bytes_used + space_info->bytes_reserved) >= thresh) return 0; - if (!btrfs_calc_reclaim_metadata_size(fs_info, space_info, - system_chunk)) + if (!btrfs_calc_reclaim_metadata_size(fs_info, space_info)) return 0; return (used >= thresh && !btrfs_fs_closing(fs_info) && @@ -805,8 +800,7 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work) space_info = btrfs_find_space_info(fs_info, BTRFS_BLOCK_GROUP_METADATA); spin_lock(_info->lock); - to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info, space_info, - false); + to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info, space_info); if (!to_reclaim) { space_info->flush = 0; spin_unlock(_info->lock); @@ -825,8 +819,7 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work) return;
[PATCH stable-5.4 2/2] btrfs: take overcommit into account in inc_block_group_ro
From: Josef Bacik commit a30a3d2067536cbcce26c055e70cc3a6ae4fd45c upstream inc_block_group_ro does a calculation to see if we have enough room left over if we mark this block group as read only in order to see if it's ok to mark the block group as read only. The problem is this calculation _only_ works for data, where our used is always less than our total. For metadata we will overcommit, so this will almost always fail for metadata. Fix this by exporting btrfs_can_overcommit, and then see if we have enough space to remove the remaining free space in the block group we are trying to mark read only. If we do then we can mark this block group as read only. Reviewed-by: Qu Wenruo Signed-off-by: Josef Bacik Reviewed-by: David Sterba Signed-off-by: David Sterba Signed-off-by: Anand Jain --- fs/btrfs/block-group.c | 38 ++ fs/btrfs/space-info.c | 18 ++ fs/btrfs/space-info.h | 3 +++ 3 files changed, 39 insertions(+), 20 deletions(-) diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index b167649f5f5d..ace49a999ece 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -1186,7 +1186,6 @@ static int inc_block_group_ro(struct btrfs_block_group_cache *cache, int force) { struct btrfs_space_info *sinfo = cache->space_info; u64 num_bytes; - u64 sinfo_used; u64 min_allocable_bytes; int ret = -ENOSPC; @@ -1213,20 +1212,38 @@ static int inc_block_group_ro(struct btrfs_block_group_cache *cache, int force) num_bytes = cache->key.offset - cache->reserved - cache->pinned - cache->bytes_super - btrfs_block_group_used(>item); - sinfo_used = btrfs_space_info_used(sinfo, true); /* -* sinfo_used + num_bytes should always <= sinfo->total_bytes. -* -* Here we make sure if we mark this bg RO, we still have enough -* free space as buffer (if min_allocable_bytes is not 0). +* Data never overcommits, even in mixed mode, so do just the straight +* check of left over space in how much we have allocated. */ - if (sinfo_used + num_bytes + min_allocable_bytes <= - sinfo->total_bytes) { + if (force) { + ret = 0; + } else if (sinfo->flags & BTRFS_BLOCK_GROUP_DATA) { + u64 sinfo_used = btrfs_space_info_used(sinfo, true); + + /* +* Here we make sure if we mark this bg RO, we still have enough +* free space as buffer. +*/ + if (sinfo_used + num_bytes <= sinfo->total_bytes) + ret = 0; + } else { + /* +* We overcommit metadata, so we need to do the +* btrfs_can_overcommit check here, and we need to pass in +* BTRFS_RESERVE_NO_FLUSH to give ourselves the most amount of +* leeway to allow us to mark this block group as read only. +*/ + if (btrfs_can_overcommit(cache->fs_info, sinfo, num_bytes, +BTRFS_RESERVE_NO_FLUSH)) + ret = 0; + } + + if (!ret) { sinfo->bytes_readonly += num_bytes; cache->ro++; list_add_tail(>ro_list, >ro_bgs); - ret = 0; } out: spin_unlock(>lock); @@ -1235,9 +1252,6 @@ static int inc_block_group_ro(struct btrfs_block_group_cache *cache, int force) btrfs_info(cache->fs_info, "unable to make block group %llu ro", cache->key.objectid); - btrfs_info(cache->fs_info, - "sinfo_used=%llu bg_num_bytes=%llu min_allocable=%llu", - sinfo_used, num_bytes, min_allocable_bytes); btrfs_dump_space_info(cache->fs_info, cache->space_info, 0, 0); } return ret; diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c index e19e538d05f9..90500b6c41fc 100644 --- a/fs/btrfs/space-info.c +++ b/fs/btrfs/space-info.c @@ -160,9 +160,9 @@ static inline u64 calc_global_rsv_need_space(struct btrfs_block_rsv *global) return (global->size << 1); } -static int can_overcommit(struct btrfs_fs_info *fs_info, - struct btrfs_space_info *space_info, u64 bytes, - enum btrfs_reserve_flush_enum flush) +int btrfs_can_overcommit(struct btrfs_fs_info *fs_info, +struct btrfs_space_info *space_info, u64 bytes, +enum btrfs_reserve_flush_enum flush) { u64 profile; u64 avail; @@ -227,7 +227,8 @@ void btrfs_try_granting_tickets(struct btrfs_fs_info *fs_info, /* Check and see if our ticket can be satisified now. */
Re: [PATCH v2 add prerequisite-patch-id] btrfs: fix devid 0 without a replace item by failing the mount
Accidentally this patch went to the stable. Pls, ignore. Sorry for the noise. Thanks. On 12/10/20 1:26 pm, Anand Jain wrote: If there is a BTRFS_DEV_REPLACE_DEVID without a replace item, then it means some device is trying to attack or may be corrupted. Fail the mount so that the user can remove the attacking or fix the corrupted device. As of now if BTRFS_DEV_REPLACE_DEVID is present without the replace item, in __btrfs_free_extra_devids() we determine that there is an extra device, and free those extra devices but continue to mount the device. However, we were wrong in keeping tack of the rw_devices so the syzbot testcase failed as below [1]. [1] WARNING: CPU: 1 PID: 3612 at fs/btrfs/volumes.c:1166 close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166 Kernel panic - not syncing: panic_on_warn set ... CPU: 1 PID: 3612 Comm: syz-executor.2 Not tainted 5.9.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x198/0x1fd lib/dump_stack.c:118 panic+0x347/0x7c0 kernel/panic.c:231 __warn.cold+0x20/0x46 kernel/panic.c:600 report_bug+0x1bd/0x210 lib/bug.c:198 handle_bug+0x38/0x90 arch/x86/kernel/traps.c:234 exc_invalid_op+0x14/0x40 arch/x86/kernel/traps.c:254 asm_exc_invalid_op+0x12/0x20 arch/x86/include/asm/idtentry.h:536 RIP: 0010:close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166 Code: 0f b6 04 02 84 c0 74 02 7e 33 48 8b 44 24 18 c6 80 30 01 00 00 00 48 83 c4 30 5b 5d 41 5c 41 5d 41 5e 41 5f c3 e8 99 ce 6a fe <0f> 0b e9 71 ff ff ff e8 8d ce 6a fe 0f 0b e9 20 ff ff ff e8 d1 d5 RSP: 0018:c900091777e0 EFLAGS: 00010246 RAX: 0004 RBX: RCX: c9000c8b7000 RDX: 0004 RSI: 83097f47 RDI: 0007 RBP: dc00 R08: 0001 R09: 8880988a187f R10: R11: 0001 R12: 88809593a130 R13: 88809593a1ec R14: 8880988a1908 R15: 88809593a050 close_fs_devices fs/btrfs/volumes.c:1193 [inline] btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179 open_ctree+0x4984/0x4a2d fs/btrfs/disk-io.c:3434 btrfs_fill_super fs/btrfs/super.c:1316 [inline] btrfs_mount_root.cold+0x14/0x165 fs/btrfs/super.c:1672 The fix here is, when we determine that there isn't a replace item then fail the mount if there is a replace target device (devid 0). Reported-by: syzbot+4cfe71a4da060be47...@syzkaller.appspotmail.com Signed-off-by: Anand Jain --- Depends on the patches btrfs: drop never met condition of disk_total_bytes == 0 btrfs: fix btrfs_find_device unused arg seed If these patches aren't integrated yet, then please add the last arg in the function btrfs_find_device(). Any value is fine as it doesn't care. fstest case will follow. v2: changed title old: btrfs: fix rw_devices count in __btrfs_free_extra_devids In btrfs_init_dev_replace() try to match the presence of replace_item to the BTRFS_DEV_REPLACE_DEVID device. If fails then fail the mount. So drop the similar check in __btrfs_free_extra_devids(). fs/btrfs/dev-replace.c | 26 -- fs/btrfs/volumes.c | 26 +++--- 2 files changed, 31 insertions(+), 21 deletions(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index 9340d03661cd..e2b7ae386224 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -91,6 +91,17 @@ int btrfs_init_dev_replace(struct btrfs_fs_info *fs_info) ret = btrfs_search_slot(NULL, dev_root, , path, 0, 0); if (ret) { no_valid_dev_replace_entry_found: + /* +* We don't have a replace item or it's corrupted. +* If there is a replace target, fail the mount. +*/ + if (btrfs_find_device(fs_info->fs_devices, + BTRFS_DEV_REPLACE_DEVID, NULL, NULL)) { + btrfs_err(fs_info, + "found replace target device without a replace item"); + ret = -EIO; + goto out; + } ret = 0; dev_replace->replace_state = BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED; @@ -143,8 +154,19 @@ int btrfs_init_dev_replace(struct btrfs_fs_info *fs_info) case BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED: case BTRFS_IOCTL_DEV_REPLACE_STATE_FINISHED: case BTRFS_IOCTL_DEV_REPLACE_STATE_CANCELED: - dev_replace->srcdev = NULL; - dev_replace->tgtdev = NULL; + /* +* We don't have an active replace item but if there is a +* replace target, fail the mount. +*/ + if (btrfs_find_device(fs_info->fs_devices, + BTRFS_DEV_REPLACE_DEVID, NULL, NULL)) { +
[PATCH v2 add prerequisite-patch-id] btrfs: fix devid 0 without a replace item by failing the mount
If there is a BTRFS_DEV_REPLACE_DEVID without a replace item, then it means some device is trying to attack or may be corrupted. Fail the mount so that the user can remove the attacking or fix the corrupted device. As of now if BTRFS_DEV_REPLACE_DEVID is present without the replace item, in __btrfs_free_extra_devids() we determine that there is an extra device, and free those extra devices but continue to mount the device. However, we were wrong in keeping tack of the rw_devices so the syzbot testcase failed as below [1]. [1] WARNING: CPU: 1 PID: 3612 at fs/btrfs/volumes.c:1166 close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166 Kernel panic - not syncing: panic_on_warn set ... CPU: 1 PID: 3612 Comm: syz-executor.2 Not tainted 5.9.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x198/0x1fd lib/dump_stack.c:118 panic+0x347/0x7c0 kernel/panic.c:231 __warn.cold+0x20/0x46 kernel/panic.c:600 report_bug+0x1bd/0x210 lib/bug.c:198 handle_bug+0x38/0x90 arch/x86/kernel/traps.c:234 exc_invalid_op+0x14/0x40 arch/x86/kernel/traps.c:254 asm_exc_invalid_op+0x12/0x20 arch/x86/include/asm/idtentry.h:536 RIP: 0010:close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166 Code: 0f b6 04 02 84 c0 74 02 7e 33 48 8b 44 24 18 c6 80 30 01 00 00 00 48 83 c4 30 5b 5d 41 5c 41 5d 41 5e 41 5f c3 e8 99 ce 6a fe <0f> 0b e9 71 ff ff ff e8 8d ce 6a fe 0f 0b e9 20 ff ff ff e8 d1 d5 RSP: 0018:c900091777e0 EFLAGS: 00010246 RAX: 0004 RBX: RCX: c9000c8b7000 RDX: 0004 RSI: 83097f47 RDI: 0007 RBP: dc00 R08: 0001 R09: 8880988a187f R10: R11: 0001 R12: 88809593a130 R13: 88809593a1ec R14: 8880988a1908 R15: 88809593a050 close_fs_devices fs/btrfs/volumes.c:1193 [inline] btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179 open_ctree+0x4984/0x4a2d fs/btrfs/disk-io.c:3434 btrfs_fill_super fs/btrfs/super.c:1316 [inline] btrfs_mount_root.cold+0x14/0x165 fs/btrfs/super.c:1672 The fix here is, when we determine that there isn't a replace item then fail the mount if there is a replace target device (devid 0). Reported-by: syzbot+4cfe71a4da060be47...@syzkaller.appspotmail.com Signed-off-by: Anand Jain --- Depends on the patches btrfs: drop never met condition of disk_total_bytes == 0 btrfs: fix btrfs_find_device unused arg seed If these patches aren't integrated yet, then please add the last arg in the function btrfs_find_device(). Any value is fine as it doesn't care. fstest case will follow. v2: changed title old: btrfs: fix rw_devices count in __btrfs_free_extra_devids In btrfs_init_dev_replace() try to match the presence of replace_item to the BTRFS_DEV_REPLACE_DEVID device. If fails then fail the mount. So drop the similar check in __btrfs_free_extra_devids(). fs/btrfs/dev-replace.c | 26 -- fs/btrfs/volumes.c | 26 +++--- 2 files changed, 31 insertions(+), 21 deletions(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index 9340d03661cd..e2b7ae386224 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -91,6 +91,17 @@ int btrfs_init_dev_replace(struct btrfs_fs_info *fs_info) ret = btrfs_search_slot(NULL, dev_root, , path, 0, 0); if (ret) { no_valid_dev_replace_entry_found: + /* +* We don't have a replace item or it's corrupted. +* If there is a replace target, fail the mount. +*/ + if (btrfs_find_device(fs_info->fs_devices, + BTRFS_DEV_REPLACE_DEVID, NULL, NULL)) { + btrfs_err(fs_info, + "found replace target device without a replace item"); + ret = -EIO; + goto out; + } ret = 0; dev_replace->replace_state = BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED; @@ -143,8 +154,19 @@ int btrfs_init_dev_replace(struct btrfs_fs_info *fs_info) case BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED: case BTRFS_IOCTL_DEV_REPLACE_STATE_FINISHED: case BTRFS_IOCTL_DEV_REPLACE_STATE_CANCELED: - dev_replace->srcdev = NULL; - dev_replace->tgtdev = NULL; + /* +* We don't have an active replace item but if there is a +* replace target, fail the mount. +*/ + if (btrfs_find_device(fs_info->fs_devices, + BTRFS_DEV_REPLACE_DEVID, NULL, NULL)) { + btrfs_err(fs_info, + "replace devid present without an active replace item"); +
[PATCH stable-5.4 4/6] btrfs: Ensure we trim ranges across block group boundary
From: Qu Wenruo commit 6b7faadd985c990324b5b5bd18cc4ba5c395eb65 upstream. [BUG] When deleting large files (which cross block group boundary) with discard mount option, we find some btrfs_discard_extent() calls only trimmed part of its space, not the whole range: btrfs_discard_extent: type=0x1 start=19626196992 len=2144530432 trimmed=1073741824 ratio=50% type: bbio->map_type, in above case, it's SINGLE DATA. start: Logical address of this trim len:Logical length of this trim trimmed:Physically trimmed bytes ratio: trimmed / len Thus leaving some unused space not discarded. [CAUSE] When discard mount option is specified, after a transaction is fully committed (super block written to disk), we begin to cleanup pinned extents in the following call chain: btrfs_commit_transaction() |- btrfs_finish_extent_commit() |- find_first_extent_bit(unpin, 0, , , EXTENT_DIRTY); |- btrfs_discard_extent() However, pinned extents are recorded in an extent_io_tree, which can merge adjacent extent states. When a large file gets deleted and it has adjacent file extents across block group boundary, we will get a large merged range like this: |<---BG1--->|<--- BG2 --->| |//|<-- Range to discard --->|/| To discard that range, we have the following calls: btrfs_discard_extent() |- btrfs_map_block() | Returned bbio will end at BG1's end. As btrfs_map_block() | never returns result across block group boundary. |- btrfs_issuse_discard() Issue discard for each stripe. So we will only discard the range in BG1, not the remaining part in BG2. Furthermore, this bug is not that reliably observed, for above case, if there is no other extent in BG2, BG2 will be empty and btrfs will trim all space of BG2, covering up the bug. [FIX] - Allow __btrfs_map_block_for_discard() to modify @length parameter btrfs_map_block() uses its @length paramter to notify the caller how many bytes are mapped in current call. With __btrfs_map_block_for_discard() also modifing the @length, btrfs_discard_extent() now understands when to do extra trim. - Call btrfs_map_block() in a loop until we hit the range end Since we now know how many bytes are mapped each time, we can iterate through each block group boundary and issue correct trim for each range. Reviewed-by: Filipe Manana Reviewed-by: Nikolay Borisov Tested-by: Nikolay Borisov Reviewed-by: Josef Bacik Signed-off-by: Qu Wenruo Signed-off-by: David Sterba Signed-off-by: Anand Jain --- fs/btrfs/extent-tree.c | 41 +++-- fs/btrfs/volumes.c | 6 -- 2 files changed, 35 insertions(+), 12 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index ef05cbacef73..a546212594e8 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -1306,8 +1306,10 @@ static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len, int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr, u64 num_bytes, u64 *actual_bytes) { - int ret; + int ret = 0; u64 discarded_bytes = 0; + u64 end = bytenr + num_bytes; + u64 cur = bytenr; struct btrfs_bio *bbio = NULL; @@ -1316,15 +1318,23 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr, * associated to its stripes that don't go away while we are discarding. */ btrfs_bio_counter_inc_blocked(fs_info); - /* Tell the block device(s) that the sectors can be discarded */ - ret = btrfs_map_block(fs_info, BTRFS_MAP_DISCARD, bytenr, _bytes, - , 0); - /* Error condition is -ENOMEM */ - if (!ret) { - struct btrfs_bio_stripe *stripe = bbio->stripes; + while (cur < end) { + struct btrfs_bio_stripe *stripe; int i; + num_bytes = end - cur; + /* Tell the block device(s) that the sectors can be discarded */ + ret = btrfs_map_block(fs_info, BTRFS_MAP_DISCARD, cur, + _bytes, , 0); + /* +* Error can be -ENOMEM, -ENOENT (no such chunk mapping) or +* -EOPNOTSUPP. For any such error, @num_bytes is not updated, +* thus we can't continue anyway. +*/ + if (ret < 0) + goto out; + stripe = bbio->stripes; for (i = 0; i < bbio->num_stripes; i++, stripe++) { u64 bytes; struct request_queue *req_q; @@ -1341,10 +1351,19 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr, stripe->physical,
[PATCH stable-5.4 0/6] backport fixes for xfstests btrfs/199,200,203,204
This series backports fixes for the xfstests test cases btrfs/199 btrfs/200 btrfs/203 and btrfs/204. patch 1 is fix for btrfs/200 patch 2 fixes regression in patch 1 and is fix for btrfs/203 patch 3 helps to fix conflicts in patch 4 patch 4 is fix for btrfs/199 patch 5 helps to avoid conflicts in patch6 patch 6 is fix for btrfs/204 patch1 Btrfs: send, allow clone operations within the same file patch2 Btrfs: send, fix emission of invalid clone operations within the same file patch3 btrfs: volumes: Use more straightforward way to calculate map length patch4 btrfs: Ensure we trim ranges across block group boundary patch5 btrfs: fix RWF_NOWAIT write not failling when we need to cow patch6 btrfs: allow btrfs_truncate_block() to fallback to nocow for data space reservation Filipe Manana (3): Btrfs: send, allow clone operations within the same file Btrfs: send, fix emission of invalid clone operations within the same file btrfs: fix RWF_NOWAIT write not failling when we need to cow Qu Wenruo (3): btrfs: volumes: Use more straightforward way to calculate map length btrfs: Ensure we trim ranges across block group boundary btrfs: allow btrfs_truncate_block() to fallback to nocow for data space reservation fs/btrfs/ctree.h | 2 ++ fs/btrfs/extent-tree.c | 41 +-- fs/btrfs/file.c| 23 ++ fs/btrfs/inode.c | 44 +++--- fs/btrfs/send.c| 19 +- fs/btrfs/volumes.c | 8 +--- 6 files changed, 108 insertions(+), 29 deletions(-) -- 2.25.1
[PATCH stable-5.4 5/6] btrfs: fix RWF_NOWAIT write not failling when we need to cow
From: Filipe Manana commit 260a63395f90f67d6ab89e4266af9e3dc34a77e9 upstream. If we attempt to do a RWF_NOWAIT write against a file range for which we can only do NOCOW for a part of it, due to the existence of holes or shared extents for example, we proceed with the write as if it were possible to NOCOW the whole range. Example: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ touch /mnt/sdj/bar $ chattr +C /mnt/sdj/bar $ xfs_io -d -c "pwrite -S 0xab -b 256K 0 256K" /mnt/bar wrote 262144/262144 bytes at offset 0 256 KiB, 1 ops; 0.0003 sec (694.444 MiB/sec and 2777.7778 ops/sec) $ xfs_io -c "fpunch 64K 64K" /mnt/bar $ sync $ xfs_io -d -c "pwrite -N -V 1 -b 128K -S 0xfe 0 128K" /mnt/bar wrote 131072/131072 bytes at offset 0 128 KiB, 1 ops; 0.0007 sec (160.051 MiB/sec and 1280.4097 ops/sec) This last write should fail with -EAGAIN since the file range from 64K to 128K is a hole. On xfs it fails, as expected, but on ext4 it currently succeeds because apparently it is expensive to check if there are extents allocated for the whole range, but I'll check with the ext4 people. Fix the issue by checking if check_can_nocow() returns a number of NOCOW'able bytes smaller then the requested number of bytes, and if it does return -EAGAIN. Fixes: edf064e7c6fec3 ("btrfs: nowait aio support") CC: sta...@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana Reviewed-by: David Sterba Signed-off-by: David Sterba Signed-off-by: Anand Jain --- fs/btrfs/file.c | 16 +++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 4e4ddd5629e5..4d0e1f9452a5 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1919,13 +1919,27 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb, pos = iocb->ki_pos; count = iov_iter_count(from); if (iocb->ki_flags & IOCB_NOWAIT) { + size_t nocow_bytes = count; + /* * We will allocate space in case nodatacow is not set, * so bail */ if (!(BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW | BTRFS_INODE_PREALLOC)) || - check_can_nocow(BTRFS_I(inode), pos, ) <= 0) { + check_can_nocow(BTRFS_I(inode), pos, _bytes) <= 0) { + inode_unlock(inode); + return -EAGAIN; + } + + /* check_can_nocow() locks the snapshot lock on success */ + btrfs_end_write_no_snapshotting(root); + /* +* There are holes in the range or parts of the range that must +* be COWed (shared extents, RO block groups, etc), so just bail +* out. +*/ + if (nocow_bytes < count) { inode_unlock(inode); return -EAGAIN; } -- 2.25.1
[PATCH stable-5.4 6/6] btrfs: allow btrfs_truncate_block() to fallback to nocow for data space reservation
From: Qu Wenruo commit 6d4572a9d71d5fc2affee0258d8582d39859188c upstream. [BUG] When the data space is exhausted, even if the inode has NOCOW attribute, we will still refuse to truncate unaligned range due to ENOSPC. The following script can reproduce it pretty easily: #!/bin/bash dev=/dev/test/test mnt=/mnt/btrfs umount $dev &> /dev/null umount $mnt &> /dev/null mkfs.btrfs -f $dev -b 1G mount -o nospace_cache $dev $mnt touch $mnt/foobar chattr +C $mnt/foobar xfs_io -f -c "pwrite -b 4k 0 4k" $mnt/foobar > /dev/null xfs_io -f -c "pwrite -b 4k 0 1G" $mnt/padding &> /dev/null sync xfs_io -c "fpunch 0 2k" $mnt/foobar umount $mnt Currently this will fail at the fpunch part. [CAUSE] Because btrfs_truncate_block() always reserves space without checking the NOCOW attribute. Since the writeback path follows NOCOW bit, we only need to bother the space reservation code in btrfs_truncate_block(). [FIX] Make btrfs_truncate_block() follow btrfs_buffered_write() to try to reserve data space first, and fall back to NOCOW check only when we don't have enough space. Such always-try-reserve is an optimization introduced in btrfs_buffered_write(), to avoid expensive btrfs_check_can_nocow() call. This patch will export check_can_nocow() as btrfs_check_can_nocow(), and use it in btrfs_truncate_block() to fix the problem. Reported-by: Martin Doucha Reviewed-by: Filipe Manana Reviewed-by: Anand Jain Signed-off-by: Qu Wenruo Reviewed-by: David Sterba Signed-off-by: David Sterba Signed-off-by: Anand Jain Conflicts: fs/btrfs/ctree.h fs/btrfs/file.c fs/btrfs/inode.c --- fs/btrfs/ctree.h | 2 ++ fs/btrfs/file.c | 9 + fs/btrfs/inode.c | 44 +--- 3 files changed, 44 insertions(+), 11 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 9a690c10afaa..23b4f38e2392 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2956,6 +2956,8 @@ int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end); loff_t btrfs_remap_file_range(struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, loff_t len, unsigned int remap_flags); +int btrfs_check_can_nocow(struct btrfs_inode *inode, loff_t pos, + size_t *write_bytes); /* tree-defrag.c */ int btrfs_defrag_leaves(struct btrfs_trans_handle *trans, diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 4d0e1f9452a5..4126513e2429 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1546,8 +1546,8 @@ lock_and_cleanup_extent_if_need(struct btrfs_inode *inode, struct page **pages, return ret; } -static noinline int check_can_nocow(struct btrfs_inode *inode, loff_t pos, - size_t *write_bytes) +int btrfs_check_can_nocow(struct btrfs_inode *inode, loff_t pos, + size_t *write_bytes) { struct btrfs_fs_info *fs_info = inode->root->fs_info; struct btrfs_root *root = inode->root; @@ -1647,7 +1647,7 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb, if (ret < 0) { if ((BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW | BTRFS_INODE_PREALLOC)) && - check_can_nocow(BTRFS_I(inode), pos, + btrfs_check_can_nocow(BTRFS_I(inode), pos, _bytes) > 0) { /* * For nodata cow case, no need to reserve @@ -1927,7 +1927,8 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb, */ if (!(BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW | BTRFS_INODE_PREALLOC)) || - check_can_nocow(BTRFS_I(inode), pos, _bytes) <= 0) { + btrfs_check_can_nocow(BTRFS_I(inode), pos, + _bytes) <= 0) { inode_unlock(inode); return -EAGAIN; } diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 9ac40991a640..2356b28afebe 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -5133,11 +5133,13 @@ int btrfs_truncate_block(struct inode *inode, loff_t from, loff_t len, struct extent_state *cached_state = NULL; struct extent_changeset *data_reserved = NULL; char *kaddr; + bool only_release_metadata = false; u32 blocksize = fs_info->sectorsize; pgoff_t index = from >> PAGE_SHIFT; unsigned offset = from & (blocksize - 1); struct page *page; gfp_t mask = btrfs_alloc_write_mask(mapping); + size_t write_bytes = blocksize; int ret = 0;
[PATCH stable-5.4 2/6] Btrfs: send, fix emission of invalid clone operations within the same file
From: Filipe Manana commit 9722b10148504c4153a74a9c89725af271e490fc upstream. When doing an incremental send and a file has extents shared with itself at different file offsets, it's possible for send to emit clone operations that will fail at the destination because the source range goes beyond the file's current size. This happens when the file size has increased in the send snapshot, there is a hole between the shared extents and both shared extents are at file offsets which are greater the file's size in the parent snapshot. Example: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt/sdb $ xfs_io -f -c "pwrite -S 0xf1 0 64K" /mnt/sdb/foobar $ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/base $ btrfs send -f /tmp/1.snap /mnt/sdb/base # Create a 320K extent at file offset 512K. $ xfs_io -c "pwrite -S 0xab 512K 64K" /mnt/sdb/foobar $ xfs_io -c "pwrite -S 0xcd 576K 64K" /mnt/sdb/foobar $ xfs_io -c "pwrite -S 0xef 640K 64K" /mnt/sdb/foobar $ xfs_io -c "pwrite -S 0x64 704K 64K" /mnt/sdb/foobar $ xfs_io -c "pwrite -S 0x73 768K 64K" /mnt/sdb/foobar # Clone part of that 320K extent into a lower file offset (192K). # This file offset is greater than the file's size in the parent # snapshot (64K). Also the clone range is a bit behind the offset of # the 320K extent so that we leave a hole between the shared extents. $ xfs_io -c "reflink /mnt/sdb/foobar 448K 192K 192K" /mnt/sdb/foobar $ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/incr $ btrfs send -p /mnt/sdb/base -f /tmp/2.snap /mnt/sdb/incr $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt/sdc $ btrfs receive -f /tmp/1.snap /mnt/sdc $ btrfs receive -f /tmp/2.snap /mnt/sdc ERROR: failed to clone extents to foobar: Invalid argument The problem is that after processing the extent at file offset 256K, which refers to the first 128K of the 320K extent created by the buffered write operations, we have 'cur_inode_next_write_offset' set to 384K, which corresponds to the end offset of the partially shared extent (256K + 128K) and to the current file size in the receiver. Then when we process the extent at offset 512K, we do extent backreference iteration to figure out if we can clone the extent from some other inode or from the same inode, and we consider the extent at offset 256K of the same inode as a valid source for a clone operation, which is not correct because at that point the current file size in the receiver is 384K, which corresponds to the end of last processed extent (at file offset 256K), so using a clone source range from 256K to 256K + 320K is invalid because that goes past the current size of the file (384K) - this makes the receiver get an -EINVAL error when attempting the clone operation. So fix this by excluding clone sources that have a range that goes beyond the current file size in the receiver when iterating extent backreferences. A test case for fstests follows soon. Fixes: 11f2069c113e02 ("Btrfs: send, allow clone operations within the same file") CC: sta...@vger.kernel.org # 5.5+ Reviewed-by: Josef Bacik Signed-off-by: Filipe Manana Signed-off-by: David Sterba Signed-off-by: Anand Jain --- fs/btrfs/send.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c index 48d028b1cfaf..b0e5dfb9be7a 100644 --- a/fs/btrfs/send.c +++ b/fs/btrfs/send.c @@ -1270,7 +1270,8 @@ static int __iterate_backrefs(u64 ino, u64 offset, u64 root, void *ctx_) * destination of the stream. */ if (ino == bctx->cur_objectid && - offset >= bctx->sctx->cur_inode_next_write_offset) + offset + bctx->extent_len > + bctx->sctx->cur_inode_next_write_offset) return 0; } -- 2.25.1
[PATCH stable-5.4 3/6] btrfs: volumes: Use more straightforward way to calculate map length
From: Qu Wenruo commit 2d974619a77f106f3d1341686dea95c0eaad601f upstream. The old code goes: offset = logical - em->start; length = min_t(u64, em->len - offset, length); Where @length calculation is dependent on offset, it can take reader several more seconds to find it's just the same code as: offset = logical - em->start; length = min_t(u64, em->start + em->len - logical, length); Use above code to make the length calculate independent from other variable, thus slightly increase the readability. Reviewed-by: Johannes Thumshirn Reviewed-by: Josef Bacik Signed-off-by: Qu Wenruo Reviewed-by: David Sterba Signed-off-by: David Sterba Signed-off-by: Anand Jain --- fs/btrfs/volumes.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 196ddbcd2936..1dba45d43485 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -5712,7 +5712,7 @@ static int __btrfs_map_block_for_discard(struct btrfs_fs_info *fs_info, } offset = logical - em->start; - length = min_t(u64, em->len - offset, length); + length = min_t(u64, em->start + em->len - logical, length); stripe_len = map->stripe_len; /* -- 2.25.1
[PATCH stable-5.4 1/6] Btrfs: send, allow clone operations within the same file
From: Filipe Manana commit 11f2069c113e02971b8db6fda62f9b9cd31a030f upstream. For send we currently skip clone operations when the source and destination files are the same. This is so because clone didn't support this case in its early days, but support for it was added back in May 2013 by commit a96fbc72884fcb ("Btrfs: allow file data clone within a file"). This change adds support for it. Example: $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdd /mnt/sdd $ xfs_io -f -c "pwrite -S 0xab -b 64K 0 64K" /mnt/sdd/foobar $ xfs_io -c "reflink /mnt/sdd/foobar 0 64K 64K" /mnt/sdd/foobar $ btrfs subvolume snapshot -r /mnt/sdd /mnt/sdd/snap $ mkfs.btrfs -f /dev/sde $ mount /dev/sde /mnt/sde $ btrfs send /mnt/sdd/snap | btrfs receive /mnt/sde Without this change file foobar at the destination has a single 128Kb extent: $ filefrag -v /mnt/sde/snap/foobar Filesystem type is: 9123683e File size of /mnt/sde/snap/foobar is 131072 (32 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 31: 0..31: 32: last,unknown_loc,delalloc,eof /mnt/sde/snap/foobar: 1 extent found With this we get a single 64Kb extent that is shared at file offsets 0 and 64K, just like in the source filesystem: $ filefrag -v /mnt/sde/snap/foobar Filesystem type is: 9123683e File size of /mnt/sde/snap/foobar is 131072 (32 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 15: 3328.. 3343: 16: shared 1: 16.. 31: 3328.. 3343: 16: 3344: last,shared,eof /mnt/sde/snap/foobar: 2 extents found Reviewed-by: Josef Bacik Signed-off-by: Filipe Manana Reviewed-by: David Sterba Signed-off-by: David Sterba Signed-off-by: Anand Jain --- fs/btrfs/send.c | 18 +- 1 file changed, 13 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c index 6ad216e8178e..48d028b1cfaf 100644 --- a/fs/btrfs/send.c +++ b/fs/btrfs/send.c @@ -1257,12 +1257,20 @@ static int __iterate_backrefs(u64 ino, u64 offset, u64 root, void *ctx_) */ if (found->root == bctx->sctx->send_root) { /* -* TODO for the moment we don't accept clones from the inode -* that is currently send. We may change this when -* BTRFS_IOC_CLONE_RANGE supports cloning from and to the same -* file. +* If the source inode was not yet processed we can't issue a +* clone operation, as the source extent does not exist yet at +* the destination of the stream. */ - if (ino >= bctx->cur_objectid) + if (ino > bctx->cur_objectid) + return 0; + /* +* We clone from the inode currently being sent as long as the +* source extent is already processed, otherwise we could try +* to clone from an extent that does not exist yet at the +* destination of the stream. +*/ + if (ino == bctx->cur_objectid && + offset >= bctx->sctx->cur_inode_next_write_offset) return 0; } -- 2.25.1
[PATCH stable-4.14.y] Btrfs: fix unexpected failure of nocow buffered writes after snapshotting when low on space
From: Robbie Ko [ Upstream commit 8ecebf4d767e2307a946c8905278d6358eda35c3 ] Commit e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting") forced nocow writes to fallback to COW, during writeback, when a snapshot is created. This resulted in writes made before creating the snapshot to unexpectedly fail with ENOSPC during writeback when success (0) was returned to user space through the write system call. The steps leading to this problem are: 1. When it's not possible to allocate data space for a write, the buffered write path checks if a NOCOW write is possible. If it is, it will not reserve space and success (0) is returned to user space. 2. Then when a snapshot is created, the root's will_be_snapshotted atomic is incremented and writeback is triggered for all inode's that belong to the root being snapshotted. Incrementing that atomic forces all previous writes to fallback to COW during writeback (running delalloc). 3. This results in the writeback for the inodes to fail and therefore setting the ENOSPC error in their mappings, so that a subsequent fsync on them will report the error to user space. So it's not a completely silent data loss (since fsync will report ENOSPC) but it's a very unexpected and undesirable behaviour, because if a clean shutdown/unmount of the filesystem happens without previous calls to fsync, it is expected to have the data present in the files after mounting the filesystem again. So fix this by adding a new atomic named snapshot_force_cow to the root structure which prevents this behaviour and works the following way: 1. It is incremented when we start to create a snapshot after triggering writeback and before waiting for writeback to finish. 2. This new atomic is now what is used by writeback (running delalloc) to decide whether we need to fallback to COW or not. Because we incremented this new atomic after triggering writeback in the snapshot creation ioctl, we ensure that all buffered writes that happened before snapshot creation will succeed and not fallback to COW (which would make them fail with ENOSPC). 3. The existing atomic, will_be_snapshotted, is kept because it is used to force new buffered writes, that start after we started snapshotting, to reserve data space even when NOCOW is possible. This makes these writes fail early with ENOSPC when there's no available space to allocate, preventing the unexpected behaviour of writeback later failing with ENOSPC due to a fallback to COW mode. Fixes: e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting") Signed-off-by: Robbie Ko Reviewed-by: Filipe Manana Signed-off-by: David Sterba Signed-off-by: Anand Jain Conflicts: fs/btrfs/disk-io.c --- fs/btrfs/ctree.h | 1 + fs/btrfs/disk-io.c | 1 + fs/btrfs/inode.c | 25 - fs/btrfs/ioctl.c | 16 4 files changed, 22 insertions(+), 21 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index de951987fd23..19a668e9164b 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1257,6 +1257,7 @@ struct btrfs_root { int send_in_progress; struct btrfs_subvolume_writers *subv_writers; atomic_t will_be_snapshotted; + atomic_t snapshot_force_cow; /* For qgroup metadata space reserve */ atomic64_t qgroup_meta_rsv; diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 495430e4f84b..ace58d6a270b 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1200,6 +1200,7 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info, refcount_set(>refs, 1); atomic_set(>will_be_snapshotted, 0); atomic64_set(>qgroup_meta_rsv, 0); + atomic_set(>snapshot_force_cow, 0); root->log_transid = 0; root->log_transid_committed = -1; root->last_log_commit = 0; diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index c9e7b92d0f21..e985e820724e 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -1335,7 +1335,7 @@ static noinline int run_delalloc_nocow(struct inode *inode, u64 disk_num_bytes; u64 ram_bytes; int extent_type; - int ret, err; + int ret; int type; int nocow; int check_prev = 1; @@ -1460,11 +1460,8 @@ static noinline int run_delalloc_nocow(struct inode *inode, * if there are pending snapshots for this root, * we fall into common COW way. */ - if (!nolock) { - err = btrfs_start_write_no_snapshotting(root); - if (!err) - goto out_check; - } + if (!nolock && atomic_read(>snapshot_force_cow)) + goto out_check; /*
Re: WARNING in close_fs_devices (2)
On 18/9/20 7:22 pm, syzbot wrote: Hello, syzbot found the following issue on: HEAD commit:e4c26faa Merge tag 'usb-5.9-rc5' of git://git.kernel.org/p.. git tree: upstream console output: https://syzkaller.appspot.com/x/log.txt?x=15bf162190 kernel config: https://syzkaller.appspot.com/x/.config?x=c61610091f4ca8c4 dashboard link: https://syzkaller.appspot.com/bug?extid=4cfe71a4da060be47502 compiler: gcc (GCC) 10.1.0-syz 20200507 Unfortunately, I don't have any reproducer for this issue yet. IMPORTANT: if you fix the issue, please add the following tag to the commit: Reported-by: syzbot+4cfe71a4da060be47...@syzkaller.appspotmail.com [ cut here ] WARNING: CPU: 1 PID: 3612 at fs/btrfs/volumes.c:1166 close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166 Kernel panic - not syncing: panic_on_warn set ... CPU: 1 PID: 3612 Comm: syz-executor.2 Not tainted 5.9.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x198/0x1fd lib/dump_stack.c:118 panic+0x347/0x7c0 kernel/panic.c:231 __warn.cold+0x20/0x46 kernel/panic.c:600 report_bug+0x1bd/0x210 lib/bug.c:198 handle_bug+0x38/0x90 arch/x86/kernel/traps.c:234 exc_invalid_op+0x14/0x40 arch/x86/kernel/traps.c:254 asm_exc_invalid_op+0x12/0x20 arch/x86/include/asm/idtentry.h:536 RIP: 0010:close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166 Code: 0f b6 04 02 84 c0 74 02 7e 33 48 8b 44 24 18 c6 80 30 01 00 00 00 48 83 c4 30 5b 5d 41 5c 41 5d 41 5e 41 5f c3 e8 99 ce 6a fe <0f> 0b e9 71 ff ff ff e8 8d ce 6a fe 0f 0b e9 20 ff ff ff e8 d1 d5 RSP: 0018:c900091777e0 EFLAGS: 00010246 RAX: 0004 RBX: RCX: c9000c8b7000 RDX: 0004 RSI: 83097f47 RDI: 0007 RBP: dc00 R08: 0001 R09: 8880988a187f R10: R11: 0001 R12: 88809593a130 R13: 88809593a1ec R14: 8880988a1908 R15: 88809593a050 close_fs_devices fs/btrfs/volumes.c:1193 [inline] btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179 open_ctree+0x4984/0x4a2d fs/btrfs/disk-io.c:3434 btrfs_fill_super fs/btrfs/super.c:1316 [inline] btrfs_mount_root.cold+0x14/0x165 fs/btrfs/super.c:1672 legacy_get_tree+0x105/0x220 fs/fs_context.c:592 vfs_get_tree+0x89/0x2f0 fs/super.c:1547 fc_mount fs/namespace.c:978 [inline] vfs_kern_mount.part.0+0xd3/0x170 fs/namespace.c:1008 vfs_kern_mount+0x3c/0x60 fs/namespace.c:995 btrfs_mount+0x234/0xaa0 fs/btrfs/super.c:1732 legacy_get_tree+0x105/0x220 fs/fs_context.c:592 vfs_get_tree+0x89/0x2f0 fs/super.c:1547 do_new_mount fs/namespace.c:2875 [inline] path_mount+0x1387/0x2070 fs/namespace.c:3192 do_mount fs/namespace.c:3205 [inline] __do_sys_mount fs/namespace.c:3413 [inline] __se_sys_mount fs/namespace.c:3390 [inline] __x64_sys_mount+0x27f/0x300 fs/namespace.c:3390 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x46004a Code: b8 a6 00 00 00 0f 05 48 3d 01 f0 ff ff 0f 83 fd 89 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 0f 83 da 89 fb ff c3 66 0f 1f 84 00 00 00 00 00 RSP: 002b:7f414d78da88 EFLAGS: 0202 ORIG_RAX: 00a5 RAX: ffda RBX: 7f414d78db20 RCX: 0046004a RDX: 2000 RSI: 2100 RDI: 7f414d78dae0 RBP: 7f414d78dae0 R08: 7f414d78db20 R09: 2000 R10: R11: 0202 R12: 2000 R13: 2100 R14: 2200 R15: 2001a800 Kernel Offset: disabled Rebooting in 86400 seconds.. #syz fix: btrfs: fix rw_devices count in __btrfs_free_extra_devids
Re: WARNING in close_fs_devices (2)
On 21/9/20 6:42 am, syzbot wrote: syzbot has found a reproducer for the following issue on: HEAD commit:b652d2a5 Add linux-next specific files for 20200918 git tree: linux-next console output: https://syzkaller.appspot.com/x/log.txt?x=17e84b0790 kernel config: https://syzkaller.appspot.com/x/.config?x=3cf0782933432b43 dashboard link: https://syzkaller.appspot.com/bug?extid=4cfe71a4da060be47502 compiler: gcc (GCC) 10.1.0-syz 20200507 syz repro: https://syzkaller.appspot.com/x/repro.syz?x=112425d990 C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1486929b90 IMPORTANT: if you fix the issue, please add the following tag to the commit: Reported-by: syzbot+4cfe71a4da060be47...@syzkaller.appspotmail.com [ cut here ] WARNING: CPU: 0 PID: 6972 at fs/btrfs/volumes.c:1172 close_fs_devices+0x715/0x930 fs/btrfs/volumes.c:1172 Modules linked in: CPU: 1 PID: 6972 Comm: syz-executor044 Not tainted 5.9.0-rc5-next-20200918-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:close_fs_devices+0x715/0x930 fs/btrfs/volumes.c:1172 Code: e8 00 b8 4c fe 85 db 0f 85 65 f9 ff ff e8 93 bb 4c fe 0f 0b e9 59 f9 ff ff e8 87 bb 4c fe 0f 0b e9 c0 fe ff ff e8 7b bb 4c fe <0f> 0b e9 f9 fe ff ff 48 c7 c7 fc a1 8f 8b e8 e8 0b 8e fe e9 19 f9 RSP: 0018:c900061b7758 EFLAGS: 00010293 RAX: RBX: RCX: 83285c2c RDX: 8880a6bbe4c0 RSI: 83285d35 RDI: 0007 RBP: dc00 R08: R09: 8880a2be1133 R10: R11: R12: 8880a2be1130 R13: 8880a2be11ec R14: 888093ab0508 R15: 8880a2be1050 FS: 0208a880() GS:8880ae50() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 004babec CR3: a7bc7000 CR4: 001506e0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: btrfs_close_devices+0x8e/0x4b0 fs/btrfs/volumes.c:1184 open_ctree+0x492a/0x49cf fs/btrfs/disk-io.c:3381 btrfs_fill_super fs/btrfs/super.c:1316 [inline] btrfs_mount_root.cold+0x14/0x165 fs/btrfs/super.c:1672 legacy_get_tree+0x105/0x220 fs/fs_context.c:592 vfs_get_tree+0x89/0x2f0 fs/super.c:1547 fc_mount fs/namespace.c:983 [inline] vfs_kern_mount.part.0+0xd3/0x170 fs/namespace.c:1013 vfs_kern_mount+0x3c/0x60 fs/namespace.c:1000 btrfs_mount+0x234/0xaa0 fs/btrfs/super.c:1732 legacy_get_tree+0x105/0x220 fs/fs_context.c:592 vfs_get_tree+0x89/0x2f0 fs/super.c:1547 do_new_mount fs/namespace.c:2896 [inline] path_mount+0x12ae/0x1e70 fs/namespace.c:3216 do_mount fs/namespace.c:3229 [inline] __do_sys_mount fs/namespace.c:3437 [inline] __se_sys_mount fs/namespace.c:3414 [inline] __x64_sys_mount+0x27f/0x300 fs/namespace.c:3414 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x44851a Code: b8 08 00 00 00 0f 05 48 3d 01 f0 ff ff 0f 83 cd a2 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 0f 83 aa a2 fb ff c3 66 0f 1f 84 00 00 00 00 00 RSP: 002b:7ffcb26bce08 EFLAGS: 0293 ORIG_RAX: 00a5 RAX: ffda RBX: 0004 RCX: 0044851a RDX: 2000 RSI: 2100 RDI: 7ffcb26bce50 RBP: 7ffcb26bce90 R08: 7ffcb26bce90 R09: 2000 R10: R11: 0293 R12: 0003 R13: 7ffcb26bce50 R14: 0003 R15: 0001 I am able to reproduce. And its quite strange at the moment. A devid 0 is being scanned. Looks like crafted image. [19592.946397] BTRFS: device fsid f90cac8b-044b-4fa8-8bee-4b8d3da88dc2 devid 0 transid 0 /dev/loop0 scanned by t (70902)
Re: [PATCH -next] btrfs: Make btrfs_sysfs_add_fs_devices static
On 16/9/20 10:26 pm, YueHaibing wrote: Fix sparse warning: fs/btrfs/sysfs.c:1386:5: warning: symbol 'btrfs_sysfs_add_fs_devices' was not declared. Should it be static? misc-next branch has it declared static. It was fixed later. Thanks, Anand Signed-off-by: YueHaibing --- fs/btrfs/sysfs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c index e7b0e10685d9..279d9262b676 100644 --- a/fs/btrfs/sysfs.c +++ b/fs/btrfs/sysfs.c @@ -1383,7 +1383,7 @@ int btrfs_sysfs_add_device(struct btrfs_device *device) return ret; } -int btrfs_sysfs_add_fs_devices(struct btrfs_fs_devices *fs_devices) +static int btrfs_sysfs_add_fs_devices(struct btrfs_fs_devices *fs_devices) { int ret; struct btrfs_device *device;
Re: [PATCH -next] btrfs: Remove unused function calc_global_rsv_need_space()
On 9/9/20 9:51 pm, YueHaibing wrote: It is not used since commit 0096420adb03 ("btrfs: do not account global reserve in can_overcommit") Signed-off-by: YueHaibing Reviewed-by: Anand Jain Thanks, Anand
Re: [PATCH -next] btrfs: remove set but not used variable 'offset'
On 2/7/19 10:15 PM, YueHaibing wrote: Fixes gcc '-Wunused-but-set-variable' warning: fs/btrfs/volumes.c: In function __btrfs_map_block: fs/btrfs/volumes.c:6023:6: warning: variable offset set but not used [-Wunused-but-set-variable] It is not used any more since commit 343abd1c0ca9 ("btrfs: Use btrfs_get_io_geometry appropriately") Reported-by: Hulk Robot Signed-off-by: YueHaibing Reviewed-by: Anand Jain --- fs/btrfs/volumes.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index d1fd910..5d5a9ff 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -6020,7 +6020,6 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, { struct extent_map *em; struct map_lookup *map; - u64 offset; u64 stripe_offset; u64 stripe_nr; u64 stripe_len; @@ -6055,7 +6054,6 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, map = em->map_lookup; *length = geom.len; - offset = geom.offset; stripe_len = geom.stripe_len; stripe_nr = geom.stripe_nr; stripe_offset = geom.stripe_offset;
Re: [PATCH 3.16 315/410] btrfs: use proper endianness accessors for super_copy
Hi Ben, This patch was reverted from the mainline and 4.14 stable as it only added half of the endianness conversion and caused regression. This patch should not be included. Thanks, Anand On 06/07/2018 10:05 PM, Ben Hutchings wrote: 3.16.57-rc1 review patch. If anyone has any objections, please let me know. -- From: Anand Jain commit 3c181c12c431fe33b669410d663beb9cceefcd1b upstream. The fs_info::super_copy is a byte copy of the on-disk structure and all members must use the accessor macros/functions to obtain the right value. This was missing in update_super_roots and in sysfs readers. Moving between opposite endianness hosts will report bogus numbers in sysfs, and mount may fail as the root will not be restored correctly. If the filesystem is always used on a same endian host, this will not be a problem. Fix this by using the btrfs_set_super...() functions to set fs_info::super_copy values, and for the sysfs, use the cached fs_info::nodesize/sectorsize values. Fixes: df93589a17378 ("btrfs: export more from FS_INFO to sysfs") Signed-off-by: Anand Jain Reviewed-by: Liu Bo Reviewed-by: David Sterba [ update changelog ] Signed-off-by: David Sterba [bwh: Backported to 3.16: - btrfs_fs_info doesn't have cached nodesize or sectorsize fields, so use the accessor functions - Adjust context] Signed-off-by: Ben Hutchings --- --- a/fs/btrfs/sysfs.c +++ b/fs/btrfs/sysfs.c @@ -406,7 +406,7 @@ static ssize_t btrfs_nodesize_show(struc { struct btrfs_fs_info *fs_info = to_fs_info(kobj); - return snprintf(buf, PAGE_SIZE, "%u\n", fs_info->super_copy->nodesize); + return snprintf(buf, PAGE_SIZE, "%u\n", btrfs_super_nodesize(fs_info->super_copy)); } BTRFS_ATTR_RW(nodesize, 0444, btrfs_nodesize_show, btrfs_no_store); @@ -416,7 +416,7 @@ static ssize_t btrfs_sectorsize_show(str { struct btrfs_fs_info *fs_info = to_fs_info(kobj); - return snprintf(buf, PAGE_SIZE, "%u\n", fs_info->super_copy->sectorsize); + return snprintf(buf, PAGE_SIZE, "%u\n", btrfs_super_sectorsize(fs_info->super_copy)); } BTRFS_ATTR_RW(sectorsize, 0444, btrfs_sectorsize_show, btrfs_no_store); @@ -426,7 +426,7 @@ static ssize_t btrfs_clone_alignment_sho { struct btrfs_fs_info *fs_info = to_fs_info(kobj); - return snprintf(buf, PAGE_SIZE, "%u\n", fs_info->super_copy->sectorsize); + return snprintf(buf, PAGE_SIZE, "%u\n", btrfs_super_sectorsize(fs_info->super_copy)); } BTRFS_ATTR_RW(clone_alignment, 0444, btrfs_clone_alignment_show, btrfs_no_store); --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -1428,19 +1428,23 @@ static void update_super_roots(struct bt super = root->fs_info->super_copy; + /* update latest btrfs_super_block::chunk_root refs */ root_item = >fs_info->chunk_root->root_item; - super->chunk_root = root_item->bytenr; - super->chunk_root_generation = root_item->generation; - super->chunk_root_level = root_item->level; + btrfs_set_super_chunk_root(super, root_item->bytenr); + btrfs_set_super_chunk_root_generation(super, root_item->generation); + btrfs_set_super_chunk_root_level(super, root_item->level); + /* update latest btrfs_super_block::root refs */ root_item = >fs_info->tree_root->root_item; - super->root = root_item->bytenr; - super->generation = root_item->generation; - super->root_level = root_item->level; + btrfs_set_super_root(super, root_item->bytenr); + btrfs_set_super_generation(super, root_item->generation); + btrfs_set_super_root_level(super, root_item->level); + if (btrfs_test_opt(root, SPACE_CACHE)) - super->cache_generation = root_item->generation; + btrfs_set_super_cache_generation(super, root_item->generation); if (root->fs_info->update_uuid_tree_gen) - super->uuid_tree_generation = root_item->generation; + btrfs_set_super_uuid_tree_generation(super, +root_item->generation); } int btrfs_transaction_in_commit(struct btrfs_fs_info *info)
Re: [PATCH 3.16 315/410] btrfs: use proper endianness accessors for super_copy
Hi Ben, This patch was reverted from the mainline and 4.14 stable as it only added half of the endianness conversion and caused regression. This patch should not be included. Thanks, Anand On 06/07/2018 10:05 PM, Ben Hutchings wrote: 3.16.57-rc1 review patch. If anyone has any objections, please let me know. -- From: Anand Jain commit 3c181c12c431fe33b669410d663beb9cceefcd1b upstream. The fs_info::super_copy is a byte copy of the on-disk structure and all members must use the accessor macros/functions to obtain the right value. This was missing in update_super_roots and in sysfs readers. Moving between opposite endianness hosts will report bogus numbers in sysfs, and mount may fail as the root will not be restored correctly. If the filesystem is always used on a same endian host, this will not be a problem. Fix this by using the btrfs_set_super...() functions to set fs_info::super_copy values, and for the sysfs, use the cached fs_info::nodesize/sectorsize values. Fixes: df93589a17378 ("btrfs: export more from FS_INFO to sysfs") Signed-off-by: Anand Jain Reviewed-by: Liu Bo Reviewed-by: David Sterba [ update changelog ] Signed-off-by: David Sterba [bwh: Backported to 3.16: - btrfs_fs_info doesn't have cached nodesize or sectorsize fields, so use the accessor functions - Adjust context] Signed-off-by: Ben Hutchings --- --- a/fs/btrfs/sysfs.c +++ b/fs/btrfs/sysfs.c @@ -406,7 +406,7 @@ static ssize_t btrfs_nodesize_show(struc { struct btrfs_fs_info *fs_info = to_fs_info(kobj); - return snprintf(buf, PAGE_SIZE, "%u\n", fs_info->super_copy->nodesize); + return snprintf(buf, PAGE_SIZE, "%u\n", btrfs_super_nodesize(fs_info->super_copy)); } BTRFS_ATTR_RW(nodesize, 0444, btrfs_nodesize_show, btrfs_no_store); @@ -416,7 +416,7 @@ static ssize_t btrfs_sectorsize_show(str { struct btrfs_fs_info *fs_info = to_fs_info(kobj); - return snprintf(buf, PAGE_SIZE, "%u\n", fs_info->super_copy->sectorsize); + return snprintf(buf, PAGE_SIZE, "%u\n", btrfs_super_sectorsize(fs_info->super_copy)); } BTRFS_ATTR_RW(sectorsize, 0444, btrfs_sectorsize_show, btrfs_no_store); @@ -426,7 +426,7 @@ static ssize_t btrfs_clone_alignment_sho { struct btrfs_fs_info *fs_info = to_fs_info(kobj); - return snprintf(buf, PAGE_SIZE, "%u\n", fs_info->super_copy->sectorsize); + return snprintf(buf, PAGE_SIZE, "%u\n", btrfs_super_sectorsize(fs_info->super_copy)); } BTRFS_ATTR_RW(clone_alignment, 0444, btrfs_clone_alignment_show, btrfs_no_store); --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -1428,19 +1428,23 @@ static void update_super_roots(struct bt super = root->fs_info->super_copy; + /* update latest btrfs_super_block::chunk_root refs */ root_item = >fs_info->chunk_root->root_item; - super->chunk_root = root_item->bytenr; - super->chunk_root_generation = root_item->generation; - super->chunk_root_level = root_item->level; + btrfs_set_super_chunk_root(super, root_item->bytenr); + btrfs_set_super_chunk_root_generation(super, root_item->generation); + btrfs_set_super_chunk_root_level(super, root_item->level); + /* update latest btrfs_super_block::root refs */ root_item = >fs_info->tree_root->root_item; - super->root = root_item->bytenr; - super->generation = root_item->generation; - super->root_level = root_item->level; + btrfs_set_super_root(super, root_item->bytenr); + btrfs_set_super_generation(super, root_item->generation); + btrfs_set_super_root_level(super, root_item->level); + if (btrfs_test_opt(root, SPACE_CACHE)) - super->cache_generation = root_item->generation; + btrfs_set_super_cache_generation(super, root_item->generation); if (root->fs_info->update_uuid_tree_gen) - super->uuid_tree_generation = root_item->generation; + btrfs_set_super_uuid_tree_generation(super, +root_item->generation); } int btrfs_transaction_in_commit(struct btrfs_fs_info *info)
Re: [btrfs_put_block_group] WARNING: CPU: 1 PID: 14674 at fs/btrfs/disk-io.c:3675 free_fs_root+0xc2/0xd0 [btrfs]
On 04/19/2018 03:25 PM, Nikolay Borisov wrote: On 19.04.2018 08:32, Fengguang Wu wrote: Hello, FYI this happens in mainline kernel and at least dates back to v4.16 . It's rather rare error and happens when running xfstests. Yeah, so this is something which only recently was characterised as leaking delalloc inodes. I can easily reproduce this when running generic/019 test. A fix is in the works. Hi Nikolay Is the fix available? generic/475 also notice the same thing. -Anand [ 438.327552] BTRFS: error (device dm-0) in __btrfs_free_extent:6962: errno=-5 IO failure [ 438.336415] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:3070: errno=-5 IO failure [ 438.345590] BTRFS error (device dm-0): pending csums is 1028096 [ 438.369254] BTRFS error (device dm-0): cleaner transaction attach returned -30 [ 438.377674] BTRFS info (device dm-0): at unmount delalloc count 98304 [ 438.385166] WARNING: CPU: 1 PID: 14674 at fs/btrfs/disk-io.c:3675 free_fs_root+0xc2/0xd0 [btrfs] [ 438.396562] Modules linked in: dm_snapshot dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio dm_flakey dm_mod netconsole btrfs xor zstd_decompress zstd_compress xxhash raid6_pq sd_mod sg snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ata_generic pata_acpi intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_intel kvm_intel snd_hda_codec kvm irqbypass crct10dif_pclmul eeepc_wmi crc32_pclmul crc32c_intel ghash_clmulni_intel pata_via asus_wmi sparse_keymap snd_hda_core ata_piix snd_hwdep ppdev rfkill wmi_bmof i915 pcbc snd_pcm snd_timer drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops snd aesni_intel parport_pc crypto_simd pcspkr libata soundcore cryptd glue_helper drm wmi parport video shpchp ip_tables [ 438.467607] CPU: 1 PID: 14674 Comm: umount Not tainted 4.17.0-rc1 #1 [ 438.474798] Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 1002 04/01/2011 [ 438.484804] RIP: 0010:free_fs_root+0xc2/0xd0 [btrfs] [ 438.490590] RSP: 0018:c90008b0fda8 EFLAGS: 00010282 [ 438.496641] RAX: 88017c5954b0 RBX: 880137f6d800 RCX: 00018013 [ 438.504652] RDX: 0001 RSI: ea0006e93600 RDI: [ 438.512679] RBP: 88017b36 R08: 8801ba4dd000 R09: 00018013 [ 438.520644] R10: c90008b0fc70 R11: R12: c90008b0fdd0 [ 438.528657] R13: 88017b360080 R14: c90008b0fdc8 R15: [ 438.536662] FS: 7f06c1a80fc0() GS:8801bfa8() knlGS: [ 438.545582] CS: 0010 DS: ES: CR0: 80050033 [ 438.552157] CR2: 7f06c12b0260 CR3: 00017d3de004 CR4: 000606e0 [ 438.642653] RAX: RBX: 0234a2d0 RCX: 7f06c1359cf7 [ 438.793559] CPU: 1 PID: 14674 Comm: umount Tainted: GW 4.17.0-rc1 #1 [ 438.802152] Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 1002 04/01/2011 [ 438.812108] RIP: 0010:btrfs_put_block_group+0x41/0x60 [btrfs] [ 438.819364] RSP: 0018:c90008b0fde0 EFLAGS: 00010206 [ 438.825378] RAX: RBX: 8801abf63000 RCX: e38e38e38e38e38f [ 438.833307] RDX: 0001 RSI: 09f6 RDI: 8801abf63000 [ 438.841230] RBP: 88017b36 R08: 88017d3b7750 R09: 000180380010 [ 438.849133] R10: c90008b0fca0 R11: R12: 8801abf63000 [ 438.857047] R13: 88017b3600a0 R14: 8801abf630e0 R15: dead0100 [ 438.864943] FS: 7f06c1a80fc0() GS:8801bfa8() knlGS: [ 438.873793] CS: 0010 DS: ES: CR0: 80050033 [ 438.880320] CR2: 7f06c12b0260 CR3: 00017d3de004 CR4: 000606e0 [ 438.888226] Call Trace: [ 438.891454] btrfs_free_block_groups+0x138/0x3d0 [btrfs] [ 438.897569] close_ctree+0x13b/0x2f0 [btrfs] [ 438.902618] generic_shutdown_super+0x6c/0x120: __read_once_size at include/linux/compiler.h:188 (inlined by) list_empty at include/linux/list.h:203 (inlined by) generic_shutdown_super at fs/super.c:442 [ 438.907801] kill_anon_super+0xe/0x20: kill_anon_super at fs/super.c:1038 [ 438.912223] btrfs_kill_super+0x13/0x100 [btrfs] [ 438.917598] deactivate_locked_super+0x3f/0x70: deactivate_locked_super at fs/super.c:320 [ 438.922757] cleanup_mnt+0x3b/0x70: cleanup_mnt at fs/namespace.c:1174 [ 438.926879] task_work_run+0xa3/0xe0: task_work_run at kernel/task_work.c:115 (discriminator 1) [ 438.931205] exit_to_usermode_loop+0x9e/0xa0: tracehook_notify_resume at include/linux/tracehook.h:191
Re: [btrfs_put_block_group] WARNING: CPU: 1 PID: 14674 at fs/btrfs/disk-io.c:3675 free_fs_root+0xc2/0xd0 [btrfs]
On 04/19/2018 03:25 PM, Nikolay Borisov wrote: On 19.04.2018 08:32, Fengguang Wu wrote: Hello, FYI this happens in mainline kernel and at least dates back to v4.16 . It's rather rare error and happens when running xfstests. Yeah, so this is something which only recently was characterised as leaking delalloc inodes. I can easily reproduce this when running generic/019 test. A fix is in the works. Hi Nikolay Is the fix available? generic/475 also notice the same thing. -Anand [ 438.327552] BTRFS: error (device dm-0) in __btrfs_free_extent:6962: errno=-5 IO failure [ 438.336415] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:3070: errno=-5 IO failure [ 438.345590] BTRFS error (device dm-0): pending csums is 1028096 [ 438.369254] BTRFS error (device dm-0): cleaner transaction attach returned -30 [ 438.377674] BTRFS info (device dm-0): at unmount delalloc count 98304 [ 438.385166] WARNING: CPU: 1 PID: 14674 at fs/btrfs/disk-io.c:3675 free_fs_root+0xc2/0xd0 [btrfs] [ 438.396562] Modules linked in: dm_snapshot dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio dm_flakey dm_mod netconsole btrfs xor zstd_decompress zstd_compress xxhash raid6_pq sd_mod sg snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ata_generic pata_acpi intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_intel kvm_intel snd_hda_codec kvm irqbypass crct10dif_pclmul eeepc_wmi crc32_pclmul crc32c_intel ghash_clmulni_intel pata_via asus_wmi sparse_keymap snd_hda_core ata_piix snd_hwdep ppdev rfkill wmi_bmof i915 pcbc snd_pcm snd_timer drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops snd aesni_intel parport_pc crypto_simd pcspkr libata soundcore cryptd glue_helper drm wmi parport video shpchp ip_tables [ 438.467607] CPU: 1 PID: 14674 Comm: umount Not tainted 4.17.0-rc1 #1 [ 438.474798] Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 1002 04/01/2011 [ 438.484804] RIP: 0010:free_fs_root+0xc2/0xd0 [btrfs] [ 438.490590] RSP: 0018:c90008b0fda8 EFLAGS: 00010282 [ 438.496641] RAX: 88017c5954b0 RBX: 880137f6d800 RCX: 00018013 [ 438.504652] RDX: 0001 RSI: ea0006e93600 RDI: [ 438.512679] RBP: 88017b36 R08: 8801ba4dd000 R09: 00018013 [ 438.520644] R10: c90008b0fc70 R11: R12: c90008b0fdd0 [ 438.528657] R13: 88017b360080 R14: c90008b0fdc8 R15: [ 438.536662] FS: 7f06c1a80fc0() GS:8801bfa8() knlGS: [ 438.545582] CS: 0010 DS: ES: CR0: 80050033 [ 438.552157] CR2: 7f06c12b0260 CR3: 00017d3de004 CR4: 000606e0 [ 438.642653] RAX: RBX: 0234a2d0 RCX: 7f06c1359cf7 [ 438.793559] CPU: 1 PID: 14674 Comm: umount Tainted: GW 4.17.0-rc1 #1 [ 438.802152] Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 1002 04/01/2011 [ 438.812108] RIP: 0010:btrfs_put_block_group+0x41/0x60 [btrfs] [ 438.819364] RSP: 0018:c90008b0fde0 EFLAGS: 00010206 [ 438.825378] RAX: RBX: 8801abf63000 RCX: e38e38e38e38e38f [ 438.833307] RDX: 0001 RSI: 09f6 RDI: 8801abf63000 [ 438.841230] RBP: 88017b36 R08: 88017d3b7750 R09: 000180380010 [ 438.849133] R10: c90008b0fca0 R11: R12: 8801abf63000 [ 438.857047] R13: 88017b3600a0 R14: 8801abf630e0 R15: dead0100 [ 438.864943] FS: 7f06c1a80fc0() GS:8801bfa8() knlGS: [ 438.873793] CS: 0010 DS: ES: CR0: 80050033 [ 438.880320] CR2: 7f06c12b0260 CR3: 00017d3de004 CR4: 000606e0 [ 438.888226] Call Trace: [ 438.891454] btrfs_free_block_groups+0x138/0x3d0 [btrfs] [ 438.897569] close_ctree+0x13b/0x2f0 [btrfs] [ 438.902618] generic_shutdown_super+0x6c/0x120: __read_once_size at include/linux/compiler.h:188 (inlined by) list_empty at include/linux/list.h:203 (inlined by) generic_shutdown_super at fs/super.c:442 [ 438.907801] kill_anon_super+0xe/0x20: kill_anon_super at fs/super.c:1038 [ 438.912223] btrfs_kill_super+0x13/0x100 [btrfs] [ 438.917598] deactivate_locked_super+0x3f/0x70: deactivate_locked_super at fs/super.c:320 [ 438.922757] cleanup_mnt+0x3b/0x70: cleanup_mnt at fs/namespace.c:1174 [ 438.926879] task_work_run+0xa3/0xe0: task_work_run at kernel/task_work.c:115 (discriminator 1) [ 438.931205] exit_to_usermode_loop+0x9e/0xa0: tracehook_notify_resume at include/linux/tracehook.h:191
Re: [PATCH 4.14 024/110] btrfs: use proper endianness accessors for super_copy
On 03/20/2018 03:32 AM, David Sterba wrote: On Sat, Mar 17, 2018 at 06:27:15PM +0100, Christoph Biedl wrote: Greg Kroah-Hartman wrote... On Thu, Mar 15, 2018 at 07:55:42PM +0100, Christoph Biedl wrote: commit 3c181c12c431fe33b669410d663beb9cceefcd1b upstream. On big-endian systems, this change intruduces severe corruption, resulting in complete loss of the data on the used block device. That sucks. Can you test Linus's tree to verify the problem is there? I'll gladly revert this if Linus's tree also gets the revert, I don't want you to hit this when you upgrade to a newer kernel. Confirmed: The problem is, err ... was in Linus' tree as well. The rather recent commit 8f5fd927c3a7 reverted the change, after that everything is as expected again. Thanks for checking. Looking at the original commit, I don't have a clue why things go wrong so horribly It's a half endianness conversion. The plain in-memory structures are in LE and has to be accessed via the helpers, super block copy and the root item. The commit adds only one half of the conversion, that naturally does not exhibit on LE, because the macros are no-op. Originally, the items were stored from the on-disk type to on-disk type, regardless of the CPU: super->chunk_root = root_item->bytenr; The patch should have added conversion of both values, like btrfs_set_super_chunk_root(super, btrfs_root_bytenr(root_item)); and not btrfs_set_super_chunk_root(super, root_item->bytenr); It's not very obvious from the context though, typically there's one structure that needs the accessors and is set from a value that's in CPU byteorder. I think that the exception to this pattern confused all involved developers. The root_item members are annotated as __le64 that should be caught by sparse/smatch checker in the buggy case, but we don't run the checkers every the time. Ah! the RC is at the other side of the equation. Makes sense to me. Thanks. - otherwise don't be afraid of my data. I took this as a chance to verify my data recovery procedure, with success. Should that not be the case, the damaged items in superblock can be byteswapped back. That's 6 x u64 at most, I have a tool for that now. Thanks again for the report and sorry for the trouble. It's entirely my mistake. My apologies for the inconvenience. Thanks, Anand -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4.14 024/110] btrfs: use proper endianness accessors for super_copy
On 03/20/2018 03:32 AM, David Sterba wrote: On Sat, Mar 17, 2018 at 06:27:15PM +0100, Christoph Biedl wrote: Greg Kroah-Hartman wrote... On Thu, Mar 15, 2018 at 07:55:42PM +0100, Christoph Biedl wrote: commit 3c181c12c431fe33b669410d663beb9cceefcd1b upstream. On big-endian systems, this change intruduces severe corruption, resulting in complete loss of the data on the used block device. That sucks. Can you test Linus's tree to verify the problem is there? I'll gladly revert this if Linus's tree also gets the revert, I don't want you to hit this when you upgrade to a newer kernel. Confirmed: The problem is, err ... was in Linus' tree as well. The rather recent commit 8f5fd927c3a7 reverted the change, after that everything is as expected again. Thanks for checking. Looking at the original commit, I don't have a clue why things go wrong so horribly It's a half endianness conversion. The plain in-memory structures are in LE and has to be accessed via the helpers, super block copy and the root item. The commit adds only one half of the conversion, that naturally does not exhibit on LE, because the macros are no-op. Originally, the items were stored from the on-disk type to on-disk type, regardless of the CPU: super->chunk_root = root_item->bytenr; The patch should have added conversion of both values, like btrfs_set_super_chunk_root(super, btrfs_root_bytenr(root_item)); and not btrfs_set_super_chunk_root(super, root_item->bytenr); It's not very obvious from the context though, typically there's one structure that needs the accessors and is set from a value that's in CPU byteorder. I think that the exception to this pattern confused all involved developers. The root_item members are annotated as __le64 that should be caught by sparse/smatch checker in the buggy case, but we don't run the checkers every the time. Ah! the RC is at the other side of the equation. Makes sense to me. Thanks. - otherwise don't be afraid of my data. I took this as a chance to verify my data recovery procedure, with success. Should that not be the case, the damaged items in superblock can be byteswapped back. That's 6 x u64 at most, I have a tool for that now. Thanks again for the report and sorry for the trouble. It's entirely my mistake. My apologies for the inconvenience. Thanks, Anand -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4.14 024/110] btrfs: use proper endianness accessors for super_copy
On 03/16/2018 10:02 PM, Greg Kroah-Hartman wrote: On Fri, Mar 16, 2018 at 02:22:02PM +0100, David Sterba wrote: On Fri, Mar 16, 2018 at 01:30:49PM +0100, Greg Kroah-Hartman wrote: On Thu, Mar 15, 2018 at 07:55:42PM +0100, Christoph Biedl wrote: Greg Kroah-Hartman wrote... 4.14-stable review patch. If anyone has any objections, please let me know. commit 3c181c12c431fe33b669410d663beb9cceefcd1b upstream. (...) If the filesystem is always used on a same endian host, this will not be a problem. >From my observations I cannot quite subscribe to that. On big-endian systems, this change intruduces severe corruption, resulting in complete loss of the data on the used block device. Steps to reproduce (tested on ppc/powerpc and parisc/hppa): # mkfs.btrfs $DEV # mount $DEV /mnt/tmp/ # umount /mnt/tmp/ This simple umount corrupts the file system: # mount $DEV /mnt/tmp/ mount: /mnt/tmp: wrong fs type, bad option, bad superblock on $DEV, missing codepage or helper program, or other error. # dmesg: BTRFS critical (device ): unable to find logical 4294967296 length 4096 BTRFS critical (device ): unable to find logical 4294967296 length 4096 BTRFS critical (device ): unable to find logical 18102363734671360 length 16384 BTRFS error (device ): failed to read chunk root BTRFS error (device ): open_ctree failed Also fsck is of no help: # btrfsck $DEV Couldn't map the block 18102363734671360 No mapping for 18102363734671360-18102363734687744 Couldn't map the block 18102363734671360 bytenr mismatch, want=18102363734671360, have=0 ERROR: cannot read chunk root ERROR: cannot open file system Trying mount or fsck on a little-endian system does not help either. So I consider the data on that device lost - luckily I use btrfs only for files where a backup exists all the time. Reverting that change restored the previous error-free behaviour. I didn't check HEAD, i.e. v4.16-rc5, since the upstream commt was the last that affected these files. Still I could give this a try if anybody wishes so. That sucks. Can you test Linus's tree to verify the problem is there? I'll gladly revert this if Linus's tree also gets the revert, I don't want you to hit this when you upgrade to a newer kernel. I'll push a fix for the upcoming rc but I think it would be better to remove the broken patch from stable kernels ASAP, so I'd recommend to revert it now. Now reverted, thanks. Thanks ! Sorry for the mess. -Anand greg k-h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4.14 024/110] btrfs: use proper endianness accessors for super_copy
On 03/16/2018 10:02 PM, Greg Kroah-Hartman wrote: On Fri, Mar 16, 2018 at 02:22:02PM +0100, David Sterba wrote: On Fri, Mar 16, 2018 at 01:30:49PM +0100, Greg Kroah-Hartman wrote: On Thu, Mar 15, 2018 at 07:55:42PM +0100, Christoph Biedl wrote: Greg Kroah-Hartman wrote... 4.14-stable review patch. If anyone has any objections, please let me know. commit 3c181c12c431fe33b669410d663beb9cceefcd1b upstream. (...) If the filesystem is always used on a same endian host, this will not be a problem. >From my observations I cannot quite subscribe to that. On big-endian systems, this change intruduces severe corruption, resulting in complete loss of the data on the used block device. Steps to reproduce (tested on ppc/powerpc and parisc/hppa): # mkfs.btrfs $DEV # mount $DEV /mnt/tmp/ # umount /mnt/tmp/ This simple umount corrupts the file system: # mount $DEV /mnt/tmp/ mount: /mnt/tmp: wrong fs type, bad option, bad superblock on $DEV, missing codepage or helper program, or other error. # dmesg: BTRFS critical (device ): unable to find logical 4294967296 length 4096 BTRFS critical (device ): unable to find logical 4294967296 length 4096 BTRFS critical (device ): unable to find logical 18102363734671360 length 16384 BTRFS error (device ): failed to read chunk root BTRFS error (device ): open_ctree failed Also fsck is of no help: # btrfsck $DEV Couldn't map the block 18102363734671360 No mapping for 18102363734671360-18102363734687744 Couldn't map the block 18102363734671360 bytenr mismatch, want=18102363734671360, have=0 ERROR: cannot read chunk root ERROR: cannot open file system Trying mount or fsck on a little-endian system does not help either. So I consider the data on that device lost - luckily I use btrfs only for files where a backup exists all the time. Reverting that change restored the previous error-free behaviour. I didn't check HEAD, i.e. v4.16-rc5, since the upstream commt was the last that affected these files. Still I could give this a try if anybody wishes so. That sucks. Can you test Linus's tree to verify the problem is there? I'll gladly revert this if Linus's tree also gets the revert, I don't want you to hit this when you upgrade to a newer kernel. I'll push a fix for the upcoming rc but I think it would be better to remove the broken patch from stable kernels ASAP, so I'd recommend to revert it now. Now reverted, thanks. Thanks ! Sorry for the mess. -Anand greg k-h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4.14 024/110] btrfs: use proper endianness accessors for super_copy
On 03/16/2018 02:55 AM, Christoph Biedl wrote: Greg Kroah-Hartman wrote... 4.14-stable review patch. If anyone has any objections, please let me know. commit 3c181c12c431fe33b669410d663beb9cceefcd1b upstream. (...) If the filesystem is always used on a same endian host, this will not be a problem. From my observations I cannot quite subscribe to that. On big-endian systems, this change intruduces severe corruption, resulting in complete loss of the data on the used block device. Thanks for the report. That's really bad, my mistake. I am digging to know how it happened. Our on-disk root bytenr are little-endian compatible. So using the cpu_to_le for write on a big-endian arch is a correct thing to do. If it fails, certainly there is something which I have overlooked. I am digging to know. Thanks for the report again. Fsck won't be able to figure out the correct on-disk btyenr either. If there isn't any backup we could try to find out the correct pointers manually. However, restore from the backup approach is much better. -Anand Steps to reproduce (tested on ppc/powerpc and parisc/hppa): # mkfs.btrfs $DEV # mount $DEV /mnt/tmp/ # umount /mnt/tmp/ This simple umount corrupts the file system: # mount $DEV /mnt/tmp/ mount: /mnt/tmp: wrong fs type, bad option, bad superblock on $DEV, missing codepage or helper program, or other error. # dmesg: BTRFS critical (device ): unable to find logical 4294967296 length 4096 BTRFS critical (device ): unable to find logical 4294967296 length 4096 BTRFS critical (device ): unable to find logical 18102363734671360 length 16384 BTRFS error (device ): failed to read chunk root BTRFS error (device ): open_ctree failed Also fsck is of no help: # btrfsck $DEV Couldn't map the block 18102363734671360 No mapping for 18102363734671360-18102363734687744 Couldn't map the block 18102363734671360 bytenr mismatch, want=18102363734671360, have=0 ERROR: cannot read chunk root ERROR: cannot open file system Trying mount or fsck on a little-endian system does not help either. So I consider the data on that device lost - luckily I use btrfs only for files where a backup exists all the time. Reverting that change restored the previous error-free behaviour. I didn't check HEAD, i.e. v4.16-rc5, since the upstream commt was the last that affected these files. Still I could give this a try if anybody wishes so. Cheers, Christoph
Re: [PATCH 4.14 024/110] btrfs: use proper endianness accessors for super_copy
On 03/16/2018 02:55 AM, Christoph Biedl wrote: Greg Kroah-Hartman wrote... 4.14-stable review patch. If anyone has any objections, please let me know. commit 3c181c12c431fe33b669410d663beb9cceefcd1b upstream. (...) If the filesystem is always used on a same endian host, this will not be a problem. From my observations I cannot quite subscribe to that. On big-endian systems, this change intruduces severe corruption, resulting in complete loss of the data on the used block device. Thanks for the report. That's really bad, my mistake. I am digging to know how it happened. Our on-disk root bytenr are little-endian compatible. So using the cpu_to_le for write on a big-endian arch is a correct thing to do. If it fails, certainly there is something which I have overlooked. I am digging to know. Thanks for the report again. Fsck won't be able to figure out the correct on-disk btyenr either. If there isn't any backup we could try to find out the correct pointers manually. However, restore from the backup approach is much better. -Anand Steps to reproduce (tested on ppc/powerpc and parisc/hppa): # mkfs.btrfs $DEV # mount $DEV /mnt/tmp/ # umount /mnt/tmp/ This simple umount corrupts the file system: # mount $DEV /mnt/tmp/ mount: /mnt/tmp: wrong fs type, bad option, bad superblock on $DEV, missing codepage or helper program, or other error. # dmesg: BTRFS critical (device ): unable to find logical 4294967296 length 4096 BTRFS critical (device ): unable to find logical 4294967296 length 4096 BTRFS critical (device ): unable to find logical 18102363734671360 length 16384 BTRFS error (device ): failed to read chunk root BTRFS error (device ): open_ctree failed Also fsck is of no help: # btrfsck $DEV Couldn't map the block 18102363734671360 No mapping for 18102363734671360-18102363734687744 Couldn't map the block 18102363734671360 bytenr mismatch, want=18102363734671360, have=0 ERROR: cannot read chunk root ERROR: cannot open file system Trying mount or fsck on a little-endian system does not help either. So I consider the data on that device lost - luckily I use btrfs only for files where a backup exists all the time. Reverting that change restored the previous error-free behaviour. I didn't check HEAD, i.e. v4.16-rc5, since the upstream commt was the last that affected these files. Still I could give this a try if anybody wishes so. Cheers, Christoph
Re: [PATCH] f2fs: support multiple devices
(this is deviating from the subject, sorry about that) Pretty much, if you're using just raid1 mode, without compression, on reasonable storage devices, things are rock-solid relative to the rest of BTRFS. IMO, BTRFS volume manger feature is incomplete and there is RAID1 critical bug which affects availability, so its not suitable for enterprise solutions yet. However it should be fine in a setup where dedicated sysadmin and maintenance downtime is a choice.
Re: [PATCH] f2fs: support multiple devices
(this is deviating from the subject, sorry about that) Pretty much, if you're using just raid1 mode, without compression, on reasonable storage devices, things are rock-solid relative to the rest of BTRFS. IMO, BTRFS volume manger feature is incomplete and there is RAID1 critical bug which affects availability, so its not suitable for enterprise solutions yet. However it should be fine in a setup where dedicated sysadmin and maintenance downtime is a choice.
Re: [lkp] [btrfs] 164a91c9be: xfstests.generic.088.fail
Thanks for the report. > _check_btrfs_filesystem: filesystem on /dev/sda5 is inconsistent (see /lkp/benchmarks/xfstests/results//check.full) > generic/088 0s Ran tests again unfortunately I couldn't reproduce.. - generic/088 0s ... 0s generic/089 21s ... 39s generic/090 4s ... 4s generic/091 58s ... 58s generic/092 1s ... 0s generic/096 [not run] xfs_io fzero failed (old kernel/wrong fs?) generic/098 2s ... 2s generic/100 18s ... 18s generic/101 3s generic/102 8s generic/103 2s -- But a possible scenario .. in systems where rcu_callback itself could be deferred it needs rcu_barrier(). Fixed this in V2, by restoring rcu_barrier() and then grace period wait for devices to close. +mount: /dev/sda6 is already mounted or /fs/scratch busy :: +umount: /dev/sda6: not mounted This might be different issue, when this happens generally wipefs all scratch pool devices helps. Thanks, Anand
Re: [lkp] [btrfs] 164a91c9be: xfstests.generic.088.fail
Thanks for the report. > _check_btrfs_filesystem: filesystem on /dev/sda5 is inconsistent (see /lkp/benchmarks/xfstests/results//check.full) > generic/088 0s Ran tests again unfortunately I couldn't reproduce.. - generic/088 0s ... 0s generic/089 21s ... 39s generic/090 4s ... 4s generic/091 58s ... 58s generic/092 1s ... 0s generic/096 [not run] xfs_io fzero failed (old kernel/wrong fs?) generic/098 2s ... 2s generic/100 18s ... 18s generic/101 3s generic/102 8s generic/103 2s -- But a possible scenario .. in systems where rcu_callback itself could be deferred it needs rcu_barrier(). Fixed this in V2, by restoring rcu_barrier() and then grace period wait for devices to close. +mount: /dev/sda6 is already mounted or /fs/scratch busy :: +umount: /dev/sda6: not mounted This might be different issue, when this happens generally wipefs all scratch pool devices helps. Thanks, Anand
Re: [PATCH 1/1] Btrfs: Code Cleanup
Hi Flex, diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 366b335..5c16f04 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -2325,7 +2325,10 @@ int btrfs_init_new_device(struct btrfs_root *root, char *device_path) if (seeding_dev) { sb->s_flags &= ~MS_RDONLY; This is not undone in the failure code. Theoretically it should report error during unmount, did you notice ? (in general, $subject can be more specific). Thanks, Anand ret = btrfs_prepare_sprout(root); - BUG_ON(ret); /* -ENOMEM */ + if (ret) { + btrfs_abort_transaction(trans, root, ret); + goto error_trans; + } } device->fs_devices = root->fs_info->fs_devices;
Re: [PATCH 1/1] Btrfs: Code Cleanup
Hi Flex, diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 366b335..5c16f04 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -2325,7 +2325,10 @@ int btrfs_init_new_device(struct btrfs_root *root, char *device_path) if (seeding_dev) { sb->s_flags &= ~MS_RDONLY; This is not undone in the failure code. Theoretically it should report error during unmount, did you notice ? (in general, $subject can be more specific). Thanks, Anand ret = btrfs_prepare_sprout(root); - BUG_ON(ret); /* -ENOMEM */ + if (ret) { + btrfs_abort_transaction(trans, root, ret); + goto error_trans; + } } device->fs_devices = root->fs_info->fs_devices;