[PATCH v2 1/2] btrfs-progs: fi-usage: Fix wrong RAID10 used and unallocated space
[BUG] For a very basic RAID10 with 4 disks, "fi usage" and "fi show" are outputting conflicting result: -- # btrfs fi show /mnt/btrfs/ Label: none uuid: 6d0229db-28d1-4696-ac20-e828cc45dc40 Total devices 4 FS bytes used 1.12MiB devid1 size 5.00GiB used 2.01GiB path /dev/mapper/data-disk1 devid2 size 5.00GiB used 2.01GiB path /dev/mapper/data-disk2 devid3 size 5.00GiB used 2.01GiB path /dev/mapper/data-disk3 devid4 size 5.00GiB used 2.01GiB path /dev/mapper/data-disk4 Here the unallocated space for disk4 should be a little less than 3G. (5.00 Gib - 2.01GiB) # btrfs fi usage /mnt/btrfs/ Overall: Device size: 20.00GiB Device allocated: 8.03GiB Device unallocated: 11.97GiB Device missing: 0.00B Used: 2.25MiB Free (estimated): 7.98GiB (min: 7.98GiB) Data ratio: 2.00 Metadata ratio: 2.00 Global reserve: 16.00MiB (used: 0.00B) Data,RAID10: Size:2.00GiB, Used:1.00MiB ... /dev/mapper/data-disk4512.00MiB Metadata,RAID10: Size:2.00GiB, Used:112.00KiB ... /dev/mapper/data-disk4512.00MiB System,RAID10: Size:16.00MiB, Used:16.00KiB ... /dev/mapper/data-disk4 4.00MiB Unallocated: ... /dev/mapper/data-disk4 4.00GiB While fi usage shows we still have 4.00GiB unallocated space. -- [CAUSE] calc_chunk_size() is used to convert chunk size to device extent size, which is used to get the per-device data/meta/sys used space. However, for RAID10 we just divide the chunk size with num_stripes, without taking sub stripes into account. Resulting data/meta/sys usage in RAID10 halved. [FIX] Take the missing sub stripes into consideration. Reported-by: Adam BaheSigned-off-by: Qu Wenruo --- v2: Nothing --- cmds-fi-usage.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/cmds-fi-usage.c b/cmds-fi-usage.c index 0b0e47fee194..0af4bdbc9370 100644 --- a/cmds-fi-usage.c +++ b/cmds-fi-usage.c @@ -664,7 +664,7 @@ static u64 calc_chunk_size(struct chunk_info *ci) else if (ci->type & BTRFS_BLOCK_GROUP_RAID6) return ci->size / (ci->num_stripes -2); else if (ci->type & BTRFS_BLOCK_GROUP_RAID10) - return ci->size / ci->num_stripes; + return ci->size / (ci->num_stripes / 2); return ci->size; } -- 2.15.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/2] btrfs-progs: misc-test: Add test case to check if fi usage and show report consistent unallocated space
Signed-off-by: Qu Wenruo--- v2: Don't create unnecessary temporary dir. Use variable directly to save output Use run_check_mount_test_dev() macro with TEST_DEV set manually. --- tests/misc-tests/028-fi-usage-cross-check/test.sh | 44 +++ 1 file changed, 44 insertions(+) create mode 100755 tests/misc-tests/028-fi-usage-cross-check/test.sh diff --git a/tests/misc-tests/028-fi-usage-cross-check/test.sh b/tests/misc-tests/028-fi-usage-cross-check/test.sh new file mode 100755 index ..0343888b3413 --- /dev/null +++ b/tests/misc-tests/028-fi-usage-cross-check/test.sh @@ -0,0 +1,44 @@ +#!/bin/bash +# check fi-usage is output correct unallocated space for RAID10 + +source "$TOP/tests/common" + +check_prereq mkfs.btrfs +check_prereq btrfs + +setup_root_helper + +setup_loopdevs 4 +prepare_loopdevs +dev1=${loopdevs[1]} + +run_check $SUDO_HELPER "$TOP/mkfs.btrfs" -d raid10 -m raid10 -f ${loopdevs[@]} +TEST_DEV=$dev1 +run_check_mount_test_dev + +output=$(run_check_stdout $SUDO_HELPER "$TOP/btrfs" filesystem show --raw "$TEST_MNT" |\ +grep "$dev1") +if [ -z "$output" ]; then + _fail "failed to get correct fi show output" +fi + +dev1_total=$(echo $output | awk '{print $4}') +dev1_used=$(echo $output | awk '{print $6}') +dev1_fi_show_unallocated=$(( $dev1_total - $dev1_used)) + +output=$(run_check_stdout $SUDO_HELPER "$TOP/btrfs" filesystem usage --raw "$TEST_MNT" |\ +grep "$dev1" | tail -n1) +if [ -z "$output" ]; then + _fail "failed to get correct fi usage output" +fi + +dev1_fi_usage_unallocated=$(echo $output | awk '{print $2}') +echo "fi usage unallocated for devid1 is $dev1_fi_usage_unallocated" >> "$RESULTS" +echo "fi show unallocated for devid1 is $dev1_fi_show_unallocated" >> "$RESULTS" + +if [ $dev1_fi_show_unallocated -ne $dev1_fi_usage_unallocated ]; then + _fail "fi usage unallocated mismatch with fi show" +fi + +run_check $SUDO_HELPER umount "$TEST_MNT" +cleanup_loopdevs -- 2.15.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs-progs: Fix wrong chunk allocation type in btrfs_reserve_extent
btrfs_reserve_extent() will try to allocate new chunk if there is not enough space (mostly meta space). However when it tries to allocate new meta chunk, it will always try to allocate SINGLE meta chunk, and if the fs is using other profile, it will cause dead allocation which can't be really used. And when trying to allocate new meta chunk for fs trees, it uses the @num_bytes which is normally times larger than metadata space. Fix it by using proper meta profile and calculate based on file extent other than @num_bytes if we are allocating meta space for file extents. And all such chunk allocation condition is not explained at all. Add needed (although maybe a little overkilled) comment for it. Signed-off-by: Qu Wenruo--- Needed prerequisite is pushed to github: https://github.com/adam900710/btrfs-progs/tree/chunk_alloc_fixes And unfortunately, the condition to trigger chunk allocation is pretty hard to trigger. For normal mkfs case, we will always have enough space, while for normal usage, kernel will ensure there is enough space (global reservation) for extent tree update, which is normally larger than the 2M headroom used in btrfs-progs. So no easy test case here. --- extent-tree.c | 55 +++ 1 file changed, 47 insertions(+), 8 deletions(-) diff --git a/extent-tree.c b/extent-tree.c index 90e792a3fe62..aab09d926043 100644 --- a/extent-tree.c +++ b/extent-tree.c @@ -1907,7 +1907,7 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, return 0; ret = btrfs_alloc_chunk(trans, fs_info, , _bytes, - space_info->flags, false); + flags, false); if (ret == -ENOSPC) { space_info->full = 1; return ret; @@ -2652,8 +2652,11 @@ int btrfs_reserve_extent(struct btrfs_trans_handle *trans, u64 search_start = 0; u64 alloc_profile; u64 profile; + u64 meta_profile; struct btrfs_fs_info *info = root->fs_info; + meta_profile = info->avail_metadata_alloc_bits & + info->metadata_alloc_profile; if (is_data) { alloc_profile = info->avail_data_alloc_bits & info->data_alloc_profile; @@ -2663,21 +2666,57 @@ int btrfs_reserve_extent(struct btrfs_trans_handle *trans, info->system_alloc_profile; profile = BTRFS_BLOCK_GROUP_SYSTEM | alloc_profile; } else { - alloc_profile = info->avail_metadata_alloc_bits & - info->metadata_alloc_profile; + alloc_profile = meta_profile; profile = BTRFS_BLOCK_GROUP_METADATA | alloc_profile; } + /* +* If we're modifying global trees likes extent/root/device tree +* (whose ref_cows is 0), we shouldn't trigger chunk allocation AT ALL! +* +* The problem is, if root is extent tree, we trigger a chunk +* allocation, then it will modify extent tree, causing a path deadlock. +* (Although not implemented in btrfs-progs, it will still cause tree +* modification screwed up) +*/ if (root->ref_cows) { + u64 meta_reserve; + if (!(profile & BTRFS_BLOCK_GROUP_METADATA)) { - ret = do_chunk_alloc(trans, info, -num_bytes, -BTRFS_BLOCK_GROUP_METADATA); - BUG_ON(ret); + /* +* If we're modifying fs trees and allocating data +* extent, we need to ensure we have enough metadata +* space to insert new file extent item. +* +* Here we need to ensure we have enough space for the +* following operations: +* 1) Insert one file extent +*A full tree CoW + Leaf split should be enough. +* +* 2) Possible new chunk allocation +*4 trees are involved, chunk, device, extent and +*root tree. +*However chunk tree is counted as system chunk, and +*never +*really needs a new system chunk even for super +*large fs. +*Only 3 trees are need. +* +* So it's total 4 trees, subvolume, extent, device and +* root tree. +* Each tree needs a full tree CoW + Leaf split. +*/ + meta_reserve = 4 * ( BTRFS_MAX_LEVEL + 1) * +
[PATCH 1/4] btrfs: move pr_info into device_list_add
Commit 60999ca4b403 ("btrfs: make device scan less noisy") adds return value 1 to device_list_add(), so that parent function can call pr_info only when new device is added. Move the pr_info() part into device_list_add() so that this function can be kept simple. Signed-off-by: Anand JainReviewed-by: Josef Bacik --- fs/btrfs/volumes.c | 29 +++-- 1 file changed, 11 insertions(+), 18 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 5adf2d3f949a..62141284b577 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -726,8 +726,7 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices, * Add new device to list of registered devices * * Returns: - * 1 - first time device is seen - * 0 - device already known + * 0 - device already known or newly added * < 0 - error */ static noinline int device_list_add(const char *path, @@ -737,7 +736,6 @@ static noinline int device_list_add(const char *path, struct btrfs_device *device; struct btrfs_fs_devices *fs_devices; struct rcu_string *name; - int ret = 0; u64 found_transid = btrfs_super_generation(disk_super); fs_devices = find_fsid(disk_super->fsid); @@ -777,9 +775,16 @@ static noinline int device_list_add(const char *path, fs_devices->num_devices++; mutex_unlock(_devices->device_list_mutex); - ret = 1; device->fs_devices = fs_devices; btrfs_free_stale_devices(path, device); + + if (disk_super->label[0]) + pr_info("BTRFS: device label %s devid %llu transid %llu %s\n", + disk_super->label, devid, found_transid, path); + else + pr_info("BTRFS: device fsid %pU devid %llu transid %llu %s\n", + disk_super->fsid, devid, found_transid, path); + } else if (!device->name || strcmp(device->name->str, path)) { /* * When FS is already mounted. @@ -840,7 +845,7 @@ static noinline int device_list_add(const char *path, *fs_devices_ret = fs_devices; - return ret; + return 0; } static struct btrfs_fs_devices *clone_fs_devices(struct btrfs_fs_devices *orig) @@ -1179,7 +1184,6 @@ int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder, struct page *page; int ret; u64 devid; - u64 transid; u64 total_devices; u64 bytenr; @@ -1202,25 +1206,14 @@ int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder, } devid = btrfs_stack_device_id(_super->dev_item); - transid = btrfs_super_generation(disk_super); total_devices = btrfs_super_num_devices(disk_super); mutex_lock(_mutex); ret = device_list_add(path, disk_super, devid, fs_devices_ret); - if (ret >= 0 && fs_devices_ret) + if (!ret && fs_devices_ret) (*fs_devices_ret)->total_devices = total_devices; mutex_unlock(_mutex); - if (ret > 0) { - if (disk_super->label[0]) - pr_info("BTRFS: device label %s ", disk_super->label); - else - pr_info("BTRFS: device fsid %pU ", disk_super->fsid); - - pr_cont("devid %llu transid %llu %s\n", devid, transid, path); - ret = 0; - } - btrfs_release_disk_super(page); error_bdev_put: -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] btrfs: drop devid as device_list_add() arg
As struct btrfs_disk_super is being passed, so it can get devid the same way its parent does. Signed-off-by: Anand JainReviewed-by: Josef Bacik --- fs/btrfs/volumes.c | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index b5305bd48338..ffeb02cbcf3d 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -730,12 +730,13 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices, * error pointer when failed */ static noinline struct btrfs_device *device_list_add(const char *path, - struct btrfs_super_block *disk_super, u64 devid) + struct btrfs_super_block *disk_super) { struct btrfs_device *device; struct btrfs_fs_devices *fs_devices; struct rcu_string *name; u64 found_transid = btrfs_super_generation(disk_super); + u64 devid = btrfs_stack_device_id(_super->dev_item); fs_devices = find_fsid(disk_super->fsid); if (!fs_devices) { @@ -1183,7 +1184,6 @@ int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder, struct block_device *bdev; struct page *page; int ret = 0; - u64 devid; u64 bytenr; /* @@ -1204,10 +1204,8 @@ int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder, goto error_bdev_put; } - devid = btrfs_stack_device_id(_super->dev_item); - mutex_lock(_mutex); - device = device_list_add(path, disk_super, devid); + device = device_list_add(path, disk_super); mutex_unlock(_mutex); if (IS_ERR(device)) ret = PTR_ERR(device); -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] btrfs: get device pointer from device_list_add()
Instead of pointer to btrfs_fs_devices as an arg in device_list_add() better to get pointer to btrfs_device as return value, then we have both, pointer to btrfs_device and btrfs_fs_devices. btrfs_device is needed to handle reappearing missing device. Signed-off-by: Anand Jain--- fs/btrfs/volumes.c | 34 ++ 1 file changed, 18 insertions(+), 16 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 4e5b82976455..b5305bd48338 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -726,12 +726,11 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices, * Add new device to list of registered devices * * Returns: - * 0 - device already known or newly added - * < 0 - error + * device pointer which was just added or updated when successful + * error pointer when failed */ -static noinline int device_list_add(const char *path, - struct btrfs_super_block *disk_super, - u64 devid, struct btrfs_fs_devices **fs_devices_ret) +static noinline struct btrfs_device *device_list_add(const char *path, + struct btrfs_super_block *disk_super, u64 devid) { struct btrfs_device *device; struct btrfs_fs_devices *fs_devices; @@ -742,7 +741,7 @@ static noinline int device_list_add(const char *path, if (!fs_devices) { fs_devices = alloc_fs_devices(disk_super->fsid); if (IS_ERR(fs_devices)) - return PTR_ERR(fs_devices); + return ERR_PTR(PTR_ERR(fs_devices)); list_add(_devices->list, _uuids); @@ -754,19 +753,19 @@ static noinline int device_list_add(const char *path, if (!device) { if (fs_devices->opened) - return -EBUSY; + return ERR_PTR(-EBUSY); device = btrfs_alloc_device(NULL, , disk_super->dev_item.uuid); if (IS_ERR(device)) { /* we can safely leave the fs_devices entry around */ - return PTR_ERR(device); + return ERR_PTR(PTR_ERR(device)); } name = rcu_string_strdup(path, GFP_NOFS); if (!name) { free_device(device); - return -ENOMEM; + return ERR_PTR(-ENOMEM); } rcu_assign_pointer(device->name, name); @@ -820,12 +819,12 @@ static noinline int device_list_add(const char *path, * with larger generation number or the last-in if * generation are equal. */ - return -EEXIST; + return ERR_PTR(-EEXIST); } name = rcu_string_strdup(path, GFP_NOFS); if (!name) - return -ENOMEM; + return ERR_PTR(-ENOMEM); rcu_string_free(device->name); rcu_assign_pointer(device->name, name); if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state)) { @@ -845,9 +844,7 @@ static noinline int device_list_add(const char *path, fs_devices->total_devices = btrfs_super_num_devices(disk_super); - *fs_devices_ret = fs_devices; - - return 0; + return device; } static struct btrfs_fs_devices *clone_fs_devices(struct btrfs_fs_devices *orig) @@ -1182,9 +1179,10 @@ int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder, struct btrfs_fs_devices **fs_devices_ret) { struct btrfs_super_block *disk_super; + struct btrfs_device *device; struct block_device *bdev; struct page *page; - int ret; + int ret = 0; u64 devid; u64 bytenr; @@ -1209,8 +1207,12 @@ int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder, devid = btrfs_stack_device_id(_super->dev_item); mutex_lock(_mutex); - ret = device_list_add(path, disk_super, devid, fs_devices_ret); + device = device_list_add(path, disk_super, devid); mutex_unlock(_mutex); + if (IS_ERR(device)) + ret = PTR_ERR(device); + + *fs_devices_ret = device->fs_devices; btrfs_release_disk_super(page); -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 0/6] preparatory work to add device forget
v3->v4: Mainly fix as per comments from Josef. @3/6: rename btrfs_free_stale_device() to btrfs_free_stale_devices() @4/6: reorg logic, init not_found = 0; drop else part @5/6: added new in v4. Renames arg cur_dev to skip_dev @6/6: v3:5/6 is merged to v4:6/6 checkpath error fixes. v2->v3: @ 6/6: add btrfs_free_stale_device() fn description, suggested by Nikolay Fix line with longer than 80 char v1->v2: @ 6/6: btrfs_device::name is null when we have missing device and unmounted. So we still need to check for dev->name. We can reuse the function btrfs_free_stale_device() to add feature to forget a scanned device or all stale devices. So this patch set proposes following changes to it. Anand Jain (6): btrfs: cleanup btrfs_free_stale_device() usage btrfs: no need to check for btrfs_fs_devices::seeding btrfs: make btrfs_free_stale_device() to iterate all stales btrfs: make btrfs_free_stale_device() argument optional btrfs: rename btrfs_free_stale_devices() arg to skip_dev btrfs: make btrfs_free_stale_device() to match the path fs/btrfs/volumes.c | 59 +++--- 1 file changed, 25 insertions(+), 34 deletions(-) -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/6] btrfs: make btrfs_free_stale_device() to match the path
From: Anand JainThe btrfs_free_stale_device() is updated to match for the given device path and delete it. (It searches for only unmounted list of devices.) Also drop the comment about different path being used for the same device, since now we will have cli to clean any device that's not a concern any more. Signed-off-by: Anand Jain --- fs/btrfs/volumes.c | 29 + 1 file changed, 17 insertions(+), 12 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index a3edd4d92c57..5adf2d3f949a 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -605,8 +605,17 @@ static void pending_bios_fn(struct btrfs_work *work) run_scheduled_bios(device); } - -static void btrfs_free_stale_devices(struct btrfs_device *skip_dev) +/* + * btrfs_free_stale_device() + * Search and remove all stale (devices which are not mounted) devices. + * When both inputs are NULL, it will search and release all stale devices. + * path: Optional. When provided will it release all unmounted devices + * matching this path only. + * skip_dev: Optional. Will skip this device when searching for the stale + * devices. + */ +static void btrfs_free_stale_devices(const char *path, +struct btrfs_device *skip_dev) { struct btrfs_fs_devices *fs_devs, *tmp_fs_devs; struct btrfs_device *dev, *tmp_dev; @@ -620,19 +629,15 @@ static void btrfs_free_stale_devices(struct btrfs_device *skip_dev) _devs->devices, dev_list) { int not_found = 0; - if (skip_dev && (skip_dev == dev || !dev->name)) + if (skip_dev && skip_dev == dev) + continue; + if (path && !dev->name) continue; - /* -* Todo: This won't be enough. What if the same device -* comes back (with new uuid and) with its mapper path? -* But for now, this does help as mostly an admin will -* either use mapper or non mapper path throughout. -*/ rcu_read_lock(); - if (skip_dev) + if (path) not_found = strcmp(rcu_str_deref(dev->name), - rcu_str_deref(skip_dev->name)); + path); rcu_read_unlock(); if (not_found) continue; @@ -774,7 +779,7 @@ static noinline int device_list_add(const char *path, ret = 1; device->fs_devices = fs_devices; - btrfs_free_stale_devices(device); + btrfs_free_stale_devices(path, device); } else if (!device->name || strcmp(device->name->str, path)) { /* * When FS is already mounted. -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/4] btrfs: set the total_devices in device_list_add()
There is no other parent for device_list_add() except for btrfs_scan_one_device(), which would set btrfs_fs_devices::total_devices if device_list_add is successful and this can be done with in device_list_add() itself. Signed-off-by: Anand JainReviewed-by: Josef Bacik --- fs/btrfs/volumes.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 62141284b577..4e5b82976455 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -843,6 +843,8 @@ static noinline int device_list_add(const char *path, if (!fs_devices->opened) device->generation = found_transid; + fs_devices->total_devices = btrfs_super_num_devices(disk_super); + *fs_devices_ret = fs_devices; return 0; @@ -1184,7 +1186,6 @@ int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder, struct page *page; int ret; u64 devid; - u64 total_devices; u64 bytenr; /* @@ -1206,12 +1207,9 @@ int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder, } devid = btrfs_stack_device_id(_super->dev_item); - total_devices = btrfs_super_num_devices(disk_super); mutex_lock(_mutex); ret = device_list_add(path, disk_super, devid, fs_devices_ret); - if (!ret && fs_devices_ret) - (*fs_devices_ret)->total_devices = total_devices; mutex_unlock(_mutex); btrfs_release_disk_super(page); -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/6] btrfs: rename btrfs_free_stale_devices() arg to skip_dev
No functional changes. Rename btrfs_free_stale_devices() arg to skip_dev, so that it reflects what that arg for. Signed-off-by: Anand Jain--- fs/btrfs/volumes.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index bba98d043402..a3edd4d92c57 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -606,7 +606,7 @@ static void pending_bios_fn(struct btrfs_work *work) } -static void btrfs_free_stale_devices(struct btrfs_device *cur_dev) +static void btrfs_free_stale_devices(struct btrfs_device *skip_dev) { struct btrfs_fs_devices *fs_devs, *tmp_fs_devs; struct btrfs_device *dev, *tmp_dev; @@ -620,7 +620,7 @@ static void btrfs_free_stale_devices(struct btrfs_device *cur_dev) _devs->devices, dev_list) { int not_found = 0; - if (cur_dev && (cur_dev == dev || !dev->name)) + if (skip_dev && (skip_dev == dev || !dev->name)) continue; /* @@ -630,9 +630,9 @@ static void btrfs_free_stale_devices(struct btrfs_device *cur_dev) * either use mapper or non mapper path throughout. */ rcu_read_lock(); - if (cur_dev) + if (skip_dev) not_found = strcmp(rcu_str_deref(dev->name), - rcu_str_deref(cur_dev->name)); + rcu_str_deref(skip_dev->name)); rcu_read_unlock(); if (not_found) continue; -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 0/4] device_list_add() peparation to add reappearing missing device
(Apply on top of my patchset [PATCH v4 0/6] preparatory work to add device forget for conflict free apply. They don't actually depend on each other though). v2->v3: Fix device_list_add() fn description which was still referring to the previous return values. v1->v2: Drop patch 5/5 for uuid_mutex optimize. That was wrong. Thanks Josef. In patch 3/5 make btrfs_device * as return. Cleanup of device_list_add(), mainly in preparation to handle reappearing missing device which its next reroll will be sent separately. Anand Jain (4): btrfs: move pr_info into device_list_add btrfs: set the total_devices in device_list_add() btrfs: get device pointer from device_list_add() btrfs: drop devid as device_list_add() arg fs/btrfs/volumes.c | 63 +++--- 1 file changed, 27 insertions(+), 36 deletions(-) -- 2.7.0 >From 84319f498ffcdca9bbb63fec680872d15392bf98 Mon Sep 17 00:00:00 2001 From: Anand JainDate: Wed, 10 Jan 2018 09:40:38 +0800 Anand Jain (4): btrfs: move pr_info into device_list_add btrfs: set the total_devices in device_list_add() btrfs: get device pointer from device_list_add() btrfs: drop devid as device_list_add() arg fs/btrfs/volumes.c | 61 +++--- 1 file changed, 26 insertions(+), 35 deletions(-) -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/6] btrfs: make btrfs_free_stale_device() argument optional
From: Anand JainThis updates btrfs_free_stale_device() helper function to delete all unmouted devices, when arg is NULL. Signed-off-by: Anand Jain --- fs/btrfs/volumes.c | 14 +- 1 file changed, 5 insertions(+), 9 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index cd234a5dc763..bba98d043402 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -611,9 +611,6 @@ static void btrfs_free_stale_devices(struct btrfs_device *cur_dev) struct btrfs_fs_devices *fs_devs, *tmp_fs_devs; struct btrfs_device *dev, *tmp_dev; - if (!cur_dev->name) - return; - list_for_each_entry_safe(fs_devs, tmp_fs_devs, _uuids, list) { if (fs_devs->opened) @@ -621,11 +618,9 @@ static void btrfs_free_stale_devices(struct btrfs_device *cur_dev) list_for_each_entry_safe(dev, tmp_dev, _devs->devices, dev_list) { - int not_found; + int not_found = 0; - if (dev == cur_dev) - continue; - if (!dev->name) + if (cur_dev && (cur_dev == dev || !dev->name)) continue; /* @@ -635,8 +630,9 @@ static void btrfs_free_stale_devices(struct btrfs_device *cur_dev) * either use mapper or non mapper path throughout. */ rcu_read_lock(); - not_found = strcmp(rcu_str_deref(dev->name), - rcu_str_deref(cur_dev->name)); + if (cur_dev) + not_found = strcmp(rcu_str_deref(dev->name), + rcu_str_deref(cur_dev->name)); rcu_read_unlock(); if (not_found) continue; -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/6] btrfs: no need to check for btrfs_fs_devices::seeding
There is no need to check for btrfs_fs_devices::seeding when we have checked for btrfs_fs_devices::opened, because we can't sprout without its seed FS being opened. Signed-off-by: Anand JainReviewed-by: Josef Bacik --- fs/btrfs/volumes.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 25b91776d036..3f481da9cae7 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -619,8 +619,6 @@ static void btrfs_free_stale_device(struct btrfs_device *cur_dev) if (fs_devs->opened) continue; - if (fs_devs->seeding) - continue; list_for_each_entry(dev, _devs->devices, dev_list) { -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/6] btrfs: make btrfs_free_stale_device() to iterate all stales
From: Anand JainLet the list iterator iterate further and find other stale devices and delete it. This is in preparation to add support for user land request-able stale devices cleanup. Also rename btrfs_free_stale_device() to btrfs_free_stale_devices(). Signed-off-by: Anand Jain --- fs/btrfs/volumes.c | 24 +++- 1 file changed, 11 insertions(+), 13 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 3f481da9cae7..cd234a5dc763 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -606,21 +606,22 @@ static void pending_bios_fn(struct btrfs_work *work) } -static void btrfs_free_stale_device(struct btrfs_device *cur_dev) +static void btrfs_free_stale_devices(struct btrfs_device *cur_dev) { - struct btrfs_fs_devices *fs_devs; - struct btrfs_device *dev; + struct btrfs_fs_devices *fs_devs, *tmp_fs_devs; + struct btrfs_device *dev, *tmp_dev; if (!cur_dev->name) return; - list_for_each_entry(fs_devs, _uuids, list) { - int del = 1; + list_for_each_entry_safe(fs_devs, tmp_fs_devs, _uuids, list) { if (fs_devs->opened) continue; - list_for_each_entry(dev, _devs->devices, dev_list) { + list_for_each_entry_safe(dev, tmp_dev, +_devs->devices, dev_list) { + int not_found; if (dev == cur_dev) continue; @@ -634,14 +635,12 @@ static void btrfs_free_stale_device(struct btrfs_device *cur_dev) * either use mapper or non mapper path throughout. */ rcu_read_lock(); - del = strcmp(rcu_str_deref(dev->name), + not_found = strcmp(rcu_str_deref(dev->name), rcu_str_deref(cur_dev->name)); rcu_read_unlock(); - if (!del) - break; - } + if (not_found) + continue; - if (!del) { /* delete the stale device */ if (fs_devs->num_devices == 1) { btrfs_sysfs_remove_fsid(fs_devs); @@ -652,7 +651,6 @@ static void btrfs_free_stale_device(struct btrfs_device *cur_dev) list_del(>dev_list); free_device(dev); } - break; } } } @@ -780,7 +778,7 @@ static noinline int device_list_add(const char *path, ret = 1; device->fs_devices = fs_devices; - btrfs_free_stale_device(device); + btrfs_free_stale_devices(device); } else if (!device->name || strcmp(device->name->str, path)) { /* * When FS is already mounted. -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/6] btrfs: cleanup btrfs_free_stale_device() usage
We call btrfs_free_stale_device() only when we alloc a new struct btrfs_device (ret=1), so move it closer to where we alloc the new device. Also drop the comments. Signed-off-by: Anand JainReviewed-by: Josef Bacik --- fs/btrfs/volumes.c | 8 +--- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index d393808071d5..25b91776d036 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -782,6 +782,7 @@ static noinline int device_list_add(const char *path, ret = 1; device->fs_devices = fs_devices; + btrfs_free_stale_device(device); } else if (!device->name || strcmp(device->name->str, path)) { /* * When FS is already mounted. @@ -840,13 +841,6 @@ static noinline int device_list_add(const char *path, if (!fs_devices->opened) device->generation = found_transid; - /* -* if there is new btrfs on an already registered device, -* then remove the stale device entry. -*/ - if (ret > 0) - btrfs_free_stale_device(device); - *fs_devices_ret = fs_devices; return ret; -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] generic/015: Change the test filesystem size to 101mb
On Mon, Jan 08, 2018 at 10:43:30AM +0200, Nikolay Borisov wrote: > This test has been failing for btrfs for quite some time, > at least since 4.7. There are 2 implementation details of btrfs that > it exposes: > > 1. Currently btrfs filesystem under 100mb are created in Mixed block > group mode. Freespace accounting for it is not 100% accurate - I've mkfs.btrfs does this too? Because I noticed _scratch_mkfs_sized adds '--mixed' mkfs option explicitly. > observed around 100-200kb discrepancy between a newly created filesystem, > then writing a file and deleting it and checking the free space. This > falls within %3 and not %1 as hardcoded in the test. > > 2. BTRFS won't flush it's delayed allocation on file deletion if less > than 32mb are deleted. On such files we need to perform sync (missing > in the test) or wait until time elapses for transaction commit. > > Since mixed mode is somewhat deprecated and btrfs is not really intended > to be used on really small devices let's just adjust the test to > create a 101mb fs, which doesn't use mixed mode and really test > freespace accounting. > > Signed-off-by: Nikolay Borisov> --- > tests/generic/015 | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/tests/generic/015 b/tests/generic/015 > index 78f2b13..416c4ae 100755 > --- a/tests/generic/015 > +++ b/tests/generic/015 > @@ -53,7 +53,7 @@ _supported_os Linux > _require_scratch > _require_no_large_scratch_dev > > -_scratch_mkfs_sized `expr 50 \* 1024 \* 1024` >/dev/null 2>&1 \ > +_scratch_mkfs_sized `expr 101 \* 1024 \* 1024` >/dev/null 2>&1 \ Better to have some comments in the code too, to explain why we choose 101m filesystem size. Thanks, Eryu > || _fail "mkfs failed" > _scratch_mount || _fail "mount failed" > out=$SCRATCH_MNT/fillup.$$ > -- > 2.7.4 > > -- > To unsubscribe from this list: send the line "unsubscribe fstests" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Recommendations for balancing as part of regular maintenance?
Austin S. Hemmelgarn posted on Tue, 09 Jan 2018 07:46:48 -0500 as excerpted: >> On 08/01/18 23:29, Martin Raiber wrote: >>> There have been reports of (rare) corruption caused by balance (won't >>> be detected by a scrub) here on the mailing list. So I would stay a >>> away from btrfs balance unless it is absolutely needed (ENOSPC), and >>> while it is run I would try not to do anything else wrt. to writes >>> simultaneously. >> >> This is my opinion too as a normal user, based upon reading this list >> and own attempts to recover from ENOSPC. I'd rather re-create >> filesystem from scratch, or at least make full verified backup before >> attempting to fix problems with balance. > While I'm generally of the same opinion (and I have a feeling most other > people who have been server admins are too), it's not a very user > friendly position to recommend that. Keep in mind that many (probably > most) users don't keep proper backups, and just targeting 'sensible' > people as your primary audience is a bad idea. It also needs to work at > at least a basic level anyway though simply because you can't always > just nuke the volume and rebuild it from scratch. > > Personally though, I don't think I've ever seen issues with balance > corrupting data, and I don't recall seeing complaints about it either > (though I would love to see some links that prove me wrong). AFAIK, such corruption reports re balance aren't really balance, per se, at all. Instead, what I've seen in nearly all cases is a number of filesystem maintenance commands involving heavy I/O colliding, that is, being run at the same time, possibly because some of them are scheduled, and the admin didn't take into account scheduled commands when issuing others manually. I don't believe anyone would recommend running balance, scrub, snapshot- deletion, and backups (rsync or btrfs send/receive being the common ones), all at the same time, or even two or more at the same time, if for no other reason than because they're all IO intensive and running just /one/ of them at a time is hard /enough/ on the system and the performance of anything else running at the same time, even when all components are fully stable and mature (and as we all know, btrfs is stabilizing, but not yet fully stable and mature), yet that's what these sorts of reports invariably involve. Of course, with a certainty btrfs /should/ be able to handle more than one of these at once without corruption, because anything else is a bug, but... btrfs /is/ still stabilizing and maturing, and it's precisely this sort of rare corner-case race-condition bugs where more than one extremely heavy IO filesystem maintenance command is being run at the same time that tend to be the last to be found and fixed, because they /are/ rare corner-cases, often depending on race conditions, that tend to be rare enough reported, and then extremely difficult to duplicate, so that's exactly the type of bugs that tend to remain around at this point. So rather than discouraging a sane-filtered regular balance (which I'll discuss in a different reply), I'd suggest that the more sane recommendation is to be aware of other major-IO filesystem maintenance commands (not just btrfs commands but rsync-based backups, etc, too, rsync being demanding enough on its own to have triggered a number of btrfs bug reports and fixes over the years), including scheduled commands, and to only run one at a time. IOW, don't do a balance if your scheduled backup or snapshot-deletion is about to kick in. One at a time is stressful enough on the filesystem and hardware, don't compound the problem trying to do two or more at once! So assuming a weekly schedule, do one a day of balance, scrub, snapshot- deletion, backups (after ensuring that none of them take over a day, balance in particular could at TiB-scale+ if not sanely filtered, particularly if quotas are enabled due to the scaling issues of that feature). And if any of those are scheduled daily or more frequently, space the scheduling appropriately and ensure they're done before starting the next task. And keep in mind the scheduled tasks when running things manually, so as not to collide there either. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs scrub not repair corruption
Wolf posted on Mon, 08 Jan 2018 23:27:27 +0100 as excerpted: > I'm running btrfs scrub on my raid each week (is that too often?) and > I'm having a problem that it reports corruption, says it's repaired but > next week reports it again. I won't attempt to answer the larger question, but on the narrow "too often?" question, no, running scrub once a week shouldn't be a problem. Scrub is read-only unless it finds errors, so even running it repeatedly end-to-end shouldn't be a problem, other than the obvious performance issue and the potential increased head-seek wear on non-ssd devices. The obvious issue would be slowing down whatever else you're doing at the same time, and at whatever presumably scheduled weekly time you run it that's evidently not a problem for your use-case. Also, a bit OT as I don't believe it's related to this, but FWIW... There *has* been a recent kernel issue with gentoo-hardened compiling kernel code incorrectly due to a gcc option enabled by default on hardened. I don't remember the details, but I ran across in in one of the kernel development articles I read. I /think/ it applied only to 4.15-rc, however, or possibly 4.14. The fix is to disable that specific gcc option when building the kernel, as it was designed for userspace and doesn't make much sense for the kernel anyway. A patch doing just that should already be part of the latest 4.15-rcs and if the bug applied to 4.14 it'll be backported there as well, but I'm not sure of current 4.14- stable status. (I run gentoo, so my interest perked when I came across the discussion, but not hardened, so I didn't need to retain the details.) If you're not already aware of that, you might wish to research it a bit more, and disable whatever option manually in your kernel-build CFLAGS, tho as mentioned once the patch is applied the kernel make files automatically apply the appropriate option. (The official kernel CFLAGS related vars are KCFLAGS (C), KCPPFLAGS (pre-processor), and KAFLAGS (assembler).) Unfortunately IDR what the specific flag was, -fno-something, IIRC. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] Btrfs: raid56: fix race between merge_bio and rbio_orig_end_io
Before rbio_orig_end_io() goes to free rbio, rbio may get merged with more bios from other rbios and rbio->bio_list becomes non-empty, in that case, these newly merged bios don't end properly. Once unlock_stripe() is done, rbio->bio_list will not be updated any more and we can call bio_endio() on all queued bios. It should only happen in error-out cases, the normal path of recover and full stripe write have already set RBIO_RMW_LOCKED_BIT to disable merge before doing IO, so rbio_orig_end_io() called by them doesn't have the above issue. Reported-by: Jérôme CarreteroSigned-off-by: Liu Bo --- v2: - Remove the usage spin_lock as there is a chance of deadlock in interrupt context, it's reported by lockdep, although it'd never happen because we've taken care of it by saving irq flags at all places. - Update commit log and comments of code to explain the new idea. - This has been tested against btrfs/011 for 50 times. fs/btrfs/raid56.c | 37 + 1 file changed, 25 insertions(+), 12 deletions(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 7747323..b2b426d 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -864,10 +864,17 @@ static void __free_raid_bio(struct btrfs_raid_bio *rbio) kfree(rbio); } -static void free_raid_bio(struct btrfs_raid_bio *rbio) +static void rbio_endio_bio_list(struct bio *cur, blk_status_t err) { - unlock_stripe(rbio); - __free_raid_bio(rbio); + struct bio *next; + + while (cur) { + next = cur->bi_next; + cur->bi_next = NULL; + cur->bi_status = err; + bio_endio(cur); + cur = next; + } } /* @@ -877,20 +884,26 @@ static void free_raid_bio(struct btrfs_raid_bio *rbio) static void rbio_orig_end_io(struct btrfs_raid_bio *rbio, blk_status_t err) { struct bio *cur = bio_list_get(>bio_list); - struct bio *next; + struct bio *extra; if (rbio->generic_bio_cnt) btrfs_bio_counter_sub(rbio->fs_info, rbio->generic_bio_cnt); - free_raid_bio(rbio); + /* +* At this moment, rbio->bio_list is empty, however since rbio does not +* always have RBIO_RMW_LOCKED_BIT set and rbio is still linked on the +* hash list, rbio may be merged with others so that rbio->bio_list +* becomes non-empty. +* Once unlock_stripe() is done, rbio->bio_list will not be updated any +* more and we can call bio_endio() on all queued bios. +*/ + unlock_stripe(rbio); + extra = bio_list_get(>bio_list); + __free_raid_bio(rbio); - while (cur) { - next = cur->bi_next; - cur->bi_next = NULL; - cur->bi_status = err; - bio_endio(cur); - cur = next; - } + rbio_endio_bio_list(cur, err); + if (extra) + rbio_endio_bio_list(extra, err); } /* -- 2.9.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] btrfs: Fix slight df misreporting when listing mixed block groups fs
On 2018年01月10日 09:30, Qu Wenruo wrote: > > > On 2018年01月10日 00:04, Nikolay Borisov wrote: >> Currently when df (resp. statfs) is executed on a btrfs instance the >> free space is calculated as the freespace of the block groups + >> any unallocated space on the devices, constituting this filesystem. >> There is a catch, however, in that the unallocated space from the >> devices has 1 mb subtracted from each device to closely follow btrfs' >> allocator behavior. This is all good except that when we have a file >> system and the system chunk's allocation profile is single (as is the >> default in mixed mode) it occupies the range 0..4mb and thus implicitly >> prevents allocation starting in the range 0..1mb. In those cases >> additionally subtracting 1mb during statfs makes us misreport the >> actual free space and this causes generic/015 to fail. >> >> Signed-off-by: Nikolay Borisov>> --- >> Hello, >> >> >> >> So with this patch and adding sync to generic/015 (which is a different >> issue) >> I can make generic/015 pass. In any case I think this behavior is correct >> >> though frankly I feel it's not really worth it, in the worst case we are >> >> underreporting by at most 1mb per device. >> >> fs/btrfs/super.c | 25 + >> 1 file changed, 13 insertions(+), 12 deletions(-) >> >> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c >> index f33a55272deb..a4dfc9701220 100644 >> --- a/fs/btrfs/super.c >> +++ b/fs/btrfs/super.c >> @@ -1925,12 +1925,13 @@ static int btrfs_calc_avail_data_space(struct >> btrfs_fs_info *fs_info, >> struct btrfs_device_info *devices_info; >> struct btrfs_fs_devices *fs_devices = fs_info->fs_devices; >> struct btrfs_device *device; >> -u64 skip_space; >> u64 type; >> u64 avail_space; >> u64 min_stripe_size; >> int min_stripes = 1, num_stripes = 1; >> int i = 0, nr_devices; >> +bool sys_chunk_single = !(btrfs_system_alloc_profile(fs_info) & >> + ~BTRFS_BLOCK_GROUP_SYSTEM); >> >> /* >> * We aren't under the device list lock, so this is racy-ish, but good >> @@ -1988,18 +1989,18 @@ static int btrfs_calc_avail_data_space(struct >> btrfs_fs_info *fs_info, >> /* >> * In order to avoid overwriting the superblock on the drive, >> * btrfs starts at an offset of at least 1MB when doing chunk >> - * allocation. >> + * allocation. There is one notable exception, and that is >> + * when the System chunk has allocation profile of single. In >> + * this case it occupies 0..4m and we don't really need to >> + * subtract 1mb as it's implied > > This is in fact a bug in mkfs.btrfs. > > The problem is, current chunk allocator in mkfs.btrfs doesn't really > avoid the first 1M. Just a side note, it's the custom allocator (Yep, custom allocator again) ignoring the 1M limit. While the generic btrfs_alloc_chunk() is completely fine. The fix to mkfs.btrfs can arrive very soon. This also reminds me about the long forgot project to get rid of such SINGLE temporary chunks (not allocating them at all) and allocate them in native profile at the very beginning of mkfs.btrfs. Thanks, Qu > > So temporary (profile SINGLE) chunks can still using that reserved 1M, > while normally such chunks get removed so reserved 1M will not be used, > but if we specified -m single, such chunks will just be used as is. > > And such behavior makes unallocated space calculation pretty nasty. > >> */ >> -skip_space = SZ_1M; >> - >> -/* >> - * we can use the free space in [0, skip_space - 1], subtract >> - * it from the total. >> - */ >> -if (avail_space && avail_space >= skip_space) >> -avail_space -= skip_space; >> -else >> -avail_space = 0; >> +if (device->devid != 1 || >> +(!sys_chunk_single && device->devid == 1)) { >> +if (avail_space && avail_space >= SZ_1M) >> +avail_space -= SZ_1M; >> +else >> +avail_space = 0; >> +} > > Due to above reason, using system chunk profile and devid to check if we > need to reduce 1M is not reliable. > > That system chunk can be relocated by balance, and in that case new > chunk may start beyond 1M. > > So the most reliable method would be manually checking the the first > device extent of this device, and to determine if we should minus that 1M. > > If the first device extent starts beyond 1M, reducing 1M would be good. > Or don't modify it. > > Thanks, > Qu > >> >> if (avail_space <
[PATCH 4/4] btrfs: drop devid as device_list_add() arg
As struct btrfs_disk_super is being passed, so it can get devid the same way its parent does. Signed-off-by: Anand JainReviewed-by: Josef Bacik --- fs/btrfs/volumes.c | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index d74d01971d0c..bd2e8616755e 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -731,12 +731,13 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices, * < 0 - error */ static noinline struct btrfs_device *device_list_add(const char *path, - struct btrfs_super_block *disk_super, u64 devid) + struct btrfs_super_block *disk_super) { struct btrfs_device *device; struct btrfs_fs_devices *fs_devices; struct rcu_string *name; u64 found_transid = btrfs_super_generation(disk_super); + u64 devid = btrfs_stack_device_id(_super->dev_item); fs_devices = find_fsid(disk_super->fsid); if (!fs_devices) { @@ -1184,7 +1185,6 @@ int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder, struct block_device *bdev; struct page *page; int ret = 0; - u64 devid; u64 bytenr; /* @@ -1205,10 +1205,8 @@ int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder, goto error_bdev_put; } - devid = btrfs_stack_device_id(_super->dev_item); - mutex_lock(_mutex); - device = device_list_add(path, disk_super, devid); + device = device_list_add(path, disk_super); mutex_unlock(_mutex); if (IS_ERR(device)) ret = PTR_ERR(device); -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V2 0/4] device_list_add() peparation to add reappearing missing device
v1->v2: Drop patch 5/5 for uuid_mutex optimize. That was wrong. Thanks Josef. In patch 3/5 make btrfs_device * as return. Cleanup of device_list_add(), mainly in preparation to handle reappearing missing device which its next reroll will be sent separately. Anand Jain (4): btrfs: move pr_info into device_list_add btrfs: set the total_devices in device_list_add() btrfs: get device pointer from device_list_add() btrfs: drop devid as device_list_add() arg fs/btrfs/volumes.c | 61 +++--- 1 file changed, 26 insertions(+), 35 deletions(-) -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/4] btrfs: set the total_devices in device_list_add()
There is no other parent for device_list_add() except for btrfs_scan_one_device(), which would set btrfs_fs_devices::total_devices if device_list_add is successful and this can be done with in device_list_add() itself. Signed-off-by: Anand JainReviewed-by: Josef Bacik --- fs/btrfs/volumes.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 091eb452f3c8..f9aaf65e27f5 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -844,6 +844,8 @@ static noinline int device_list_add(const char *path, if (!fs_devices->opened) device->generation = found_transid; + fs_devices->total_devices = btrfs_super_num_devices(disk_super); + *fs_devices_ret = fs_devices; return 0; @@ -1185,7 +1187,6 @@ int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder, struct page *page; int ret; u64 devid; - u64 total_devices; u64 bytenr; /* @@ -1207,12 +1208,9 @@ int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder, } devid = btrfs_stack_device_id(_super->dev_item); - total_devices = btrfs_super_num_devices(disk_super); mutex_lock(_mutex); ret = device_list_add(path, disk_super, devid, fs_devices_ret); - if (!ret && fs_devices_ret) - (*fs_devices_ret)->total_devices = total_devices; mutex_unlock(_mutex); btrfs_release_disk_super(page); -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] btrfs: move pr_info into device_list_add
Commit 60999ca4b403 ("btrfs: make device scan less noisy") adds return value 1 to device_list_add(), so that parent function can call pr_info only when new device is added. Move the pr_info() part into device_list_add() so that this function can be kept simple. Signed-off-by: Anand JainReviewed-by: Josef Bacik --- fs/btrfs/volumes.c | 29 +++-- 1 file changed, 11 insertions(+), 18 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 0d7f912f68ff..091eb452f3c8 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -727,8 +727,7 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices, * Add new device to list of registered devices * * Returns: - * 1 - first time device is seen - * 0 - device already known + * 0 - device already known or newly added * < 0 - error */ static noinline int device_list_add(const char *path, @@ -738,7 +737,6 @@ static noinline int device_list_add(const char *path, struct btrfs_device *device; struct btrfs_fs_devices *fs_devices; struct rcu_string *name; - int ret = 0; u64 found_transid = btrfs_super_generation(disk_super); fs_devices = find_fsid(disk_super->fsid); @@ -778,9 +776,16 @@ static noinline int device_list_add(const char *path, fs_devices->num_devices++; mutex_unlock(_devices->device_list_mutex); - ret = 1; device->fs_devices = fs_devices; btrfs_free_stale_device(path, device); + + if (disk_super->label[0]) + pr_info("BTRFS: device label %s devid %llu transid %llu %s\n", + disk_super->label, devid, found_transid, path); + else + pr_info("BTRFS: device fsid %pU devid %llu transid %llu %s\n", + disk_super->fsid, devid, found_transid, path); + } else if (!device->name || strcmp(device->name->str, path)) { /* * When FS is already mounted. @@ -841,7 +846,7 @@ static noinline int device_list_add(const char *path, *fs_devices_ret = fs_devices; - return ret; + return 0; } static struct btrfs_fs_devices *clone_fs_devices(struct btrfs_fs_devices *orig) @@ -1180,7 +1185,6 @@ int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder, struct page *page; int ret; u64 devid; - u64 transid; u64 total_devices; u64 bytenr; @@ -1203,25 +1207,14 @@ int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder, } devid = btrfs_stack_device_id(_super->dev_item); - transid = btrfs_super_generation(disk_super); total_devices = btrfs_super_num_devices(disk_super); mutex_lock(_mutex); ret = device_list_add(path, disk_super, devid, fs_devices_ret); - if (ret >= 0 && fs_devices_ret) + if (!ret && fs_devices_ret) (*fs_devices_ret)->total_devices = total_devices; mutex_unlock(_mutex); - if (ret > 0) { - if (disk_super->label[0]) - pr_info("BTRFS: device label %s ", disk_super->label); - else - pr_info("BTRFS: device fsid %pU ", disk_super->fsid); - - pr_cont("devid %llu transid %llu %s\n", devid, transid, path); - ret = 0; - } - btrfs_release_disk_super(page); error_bdev_put: -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] btrfs: get device pointer from device_list_add()
Instead of pointer to btrfs_fs_devices as an arg in device_list_add() better to get pointer to btrfs_device as return value, then we have both, pointer to btrfs_device and btrfs_fs_devices. btrfs_device is needed to handle reappearing missing device. Signed-off-by: Anand Jain--- fs/btrfs/volumes.c | 30 -- 1 file changed, 16 insertions(+), 14 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index f9aaf65e27f5..d74d01971d0c 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -730,9 +730,8 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices, * 0 - device already known or newly added * < 0 - error */ -static noinline int device_list_add(const char *path, - struct btrfs_super_block *disk_super, - u64 devid, struct btrfs_fs_devices **fs_devices_ret) +static noinline struct btrfs_device *device_list_add(const char *path, + struct btrfs_super_block *disk_super, u64 devid) { struct btrfs_device *device; struct btrfs_fs_devices *fs_devices; @@ -743,7 +742,7 @@ static noinline int device_list_add(const char *path, if (!fs_devices) { fs_devices = alloc_fs_devices(disk_super->fsid); if (IS_ERR(fs_devices)) - return PTR_ERR(fs_devices); + return ERR_PTR(PTR_ERR(fs_devices)); list_add(_devices->list, _uuids); @@ -755,19 +754,19 @@ static noinline int device_list_add(const char *path, if (!device) { if (fs_devices->opened) - return -EBUSY; + return ERR_PTR(-EBUSY); device = btrfs_alloc_device(NULL, , disk_super->dev_item.uuid); if (IS_ERR(device)) { /* we can safely leave the fs_devices entry around */ - return PTR_ERR(device); + return ERR_PTR(PTR_ERR(device)); } name = rcu_string_strdup(path, GFP_NOFS); if (!name) { free_device(device); - return -ENOMEM; + return ERR_PTR(-ENOMEM); } rcu_assign_pointer(device->name, name); @@ -821,12 +820,12 @@ static noinline int device_list_add(const char *path, * with larger generation number or the last-in if * generation are equal. */ - return -EEXIST; + return ERR_PTR(-EEXIST); } name = rcu_string_strdup(path, GFP_NOFS); if (!name) - return -ENOMEM; + return ERR_PTR(-ENOMEM); rcu_string_free(device->name); rcu_assign_pointer(device->name, name); if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state)) { @@ -846,9 +845,7 @@ static noinline int device_list_add(const char *path, fs_devices->total_devices = btrfs_super_num_devices(disk_super); - *fs_devices_ret = fs_devices; - - return 0; + return device; } static struct btrfs_fs_devices *clone_fs_devices(struct btrfs_fs_devices *orig) @@ -1183,9 +1180,10 @@ int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder, struct btrfs_fs_devices **fs_devices_ret) { struct btrfs_super_block *disk_super; + struct btrfs_device *device; struct block_device *bdev; struct page *page; - int ret; + int ret = 0; u64 devid; u64 bytenr; @@ -1210,8 +1208,12 @@ int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder, devid = btrfs_stack_device_id(_super->dev_item); mutex_lock(_mutex); - ret = device_list_add(path, disk_super, devid, fs_devices_ret); + device = device_list_add(path, disk_super, devid); mutex_unlock(_mutex); + if (IS_ERR(device)) + ret = PTR_ERR(device); + + *fs_devices_ret = device->fs_devices; btrfs_release_disk_super(page); -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] btrfs: move uuid_mutex closer to exclusivity
On 01/10/2018 12:15 AM, Josef Bacik wrote: On Tue, Jan 09, 2018 at 10:46:25PM +0800, Anand Jain wrote: move uuid_mutex with in device_list_add(). Signed-off-by: Anand JainThis isn't going to work, there's a bunch of places we return errors so this just deadlocks the box. Leave it like it is, it's not hurting anybody leaving it like this. Thanks, Err. My mistake. Can't believe how did I miss it. Pls ignore this patch. Thanks for the review. - Anand Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] btrfs: Fix slight df misreporting when listing mixed block groups fs
On 2018年01月10日 00:04, Nikolay Borisov wrote: > Currently when df (resp. statfs) is executed on a btrfs instance the > free space is calculated as the freespace of the block groups + > any unallocated space on the devices, constituting this filesystem. > There is a catch, however, in that the unallocated space from the > devices has 1 mb subtracted from each device to closely follow btrfs' > allocator behavior. This is all good except that when we have a file > system and the system chunk's allocation profile is single (as is the > default in mixed mode) it occupies the range 0..4mb and thus implicitly > prevents allocation starting in the range 0..1mb. In those cases > additionally subtracting 1mb during statfs makes us misreport the > actual free space and this causes generic/015 to fail. > > Signed-off-by: Nikolay Borisov> --- > Hello, > > > > So with this patch and adding sync to generic/015 (which is a different > issue) > I can make generic/015 pass. In any case I think this behavior is correct > > though frankly I feel it's not really worth it, in the worst case we are > > underreporting by at most 1mb per device. > > fs/btrfs/super.c | 25 + > 1 file changed, 13 insertions(+), 12 deletions(-) > > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c > index f33a55272deb..a4dfc9701220 100644 > --- a/fs/btrfs/super.c > +++ b/fs/btrfs/super.c > @@ -1925,12 +1925,13 @@ static int btrfs_calc_avail_data_space(struct > btrfs_fs_info *fs_info, > struct btrfs_device_info *devices_info; > struct btrfs_fs_devices *fs_devices = fs_info->fs_devices; > struct btrfs_device *device; > - u64 skip_space; > u64 type; > u64 avail_space; > u64 min_stripe_size; > int min_stripes = 1, num_stripes = 1; > int i = 0, nr_devices; > + bool sys_chunk_single = !(btrfs_system_alloc_profile(fs_info) & > + ~BTRFS_BLOCK_GROUP_SYSTEM); > > /* >* We aren't under the device list lock, so this is racy-ish, but good > @@ -1988,18 +1989,18 @@ static int btrfs_calc_avail_data_space(struct > btrfs_fs_info *fs_info, > /* >* In order to avoid overwriting the superblock on the drive, >* btrfs starts at an offset of at least 1MB when doing chunk > - * allocation. > + * allocation. There is one notable exception, and that is > + * when the System chunk has allocation profile of single. In > + * this case it occupies 0..4m and we don't really need to > + * subtract 1mb as it's implied This is in fact a bug in mkfs.btrfs. The problem is, current chunk allocator in mkfs.btrfs doesn't really avoid the first 1M. So temporary (profile SINGLE) chunks can still using that reserved 1M, while normally such chunks get removed so reserved 1M will not be used, but if we specified -m single, such chunks will just be used as is. And such behavior makes unallocated space calculation pretty nasty. >*/ > - skip_space = SZ_1M; > - > - /* > - * we can use the free space in [0, skip_space - 1], subtract > - * it from the total. > - */ > - if (avail_space && avail_space >= skip_space) > - avail_space -= skip_space; > - else > - avail_space = 0; > + if (device->devid != 1 || > + (!sys_chunk_single && device->devid == 1)) { > + if (avail_space && avail_space >= SZ_1M) > + avail_space -= SZ_1M; > + else > + avail_space = 0; > + } Due to above reason, using system chunk profile and devid to check if we need to reduce 1M is not reliable. That system chunk can be relocated by balance, and in that case new chunk may start beyond 1M. So the most reliable method would be manually checking the the first device extent of this device, and to determine if we should minus that 1M. If the first device extent starts beyond 1M, reducing 1M would be good. Or don't modify it. Thanks, Qu > > if (avail_space < min_stripe_size) > continue; > signature.asc Description: OpenPGP digital signature
Re: [PATCH 3/5] btrfs: get device pointer from device_list_add()
On 01/10/2018 12:13 AM, Josef Bacik wrote: On Tue, Jan 09, 2018 at 10:46:23PM +0800, Anand Jain wrote: Instead of pointer to btrfs_fs_devices from device_list_add() its better to get pointer to btrfs_device, then we have both, pointer to btrfs_device and btrfs_fs_devices. This is needed in preparation to add handling reappearing missing device feature. Signed-off-by: Anand JainCan we just change device_list_add to return struct btrfs_device * instead, and just do PTR_ERR() if there's a problem? Thanks, That's much better. Will do. Thanks, Anand Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 19/19] fs: handle inode->i_version more efficiently
On Tue, Jan 09, 2018 at 09:10:59AM -0500, Jeff Layton wrote: > From: Jeff Layton> > Since i_version is mostly treated as an opaque value, we can exploit that > fact to avoid incrementing it when no one is watching. With that change, > we can avoid incrementing the counter on writes, unless someone has > queried for it since it was last incremented. If the a/c/mtime don't > change, and the i_version hasn't changed, then there's no need to dirty > the inode metadata on a write. > > Convert the i_version counter to an atomic64_t, and use the lowest order > bit to hold a flag that will tell whether anyone has queried the value > since it was last incremented. > > When we go to maybe increment it, we fetch the value and check the flag > bit. If it's clear then we don't need to do anything if the update > isn't being forced. > > If we do need to update, then we increment the counter by 2, and clear > the flag bit, and then use a CAS op to swap it into place. If that > works, we return true. If it doesn't then do it again with the value > that we fetch from the CAS operation. > > On the query side, if the flag is already set, then we just shift the > value down by 1 bit and return it. Otherwise, we set the flag in our > on-stack value and again use cmpxchg to swap it into place if it hasn't > changed. If it has, then we use the value from the cmpxchg as the new > "old" value and try again. > > This method allows us to avoid incrementing the counter on writes (and > dirtying the metadata) under typical workloads. We only need to increment > if it has been queried since it was last changed. > > Signed-off-by: Jeff Layton > Reviewed-by: Jan Kara Documentation helps a lot in understanding all this. Thanks for adding it into the patch! Acked-by: Dave Chinner -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 17/19] xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing
On Tue, Jan 09, 2018 at 09:10:57AM -0500, Jeff Layton wrote: > From: Jeff Layton> > If XFS_ILOG_CORE is already set then go ahead and increment it. > > Signed-off-by: Jeff Layton > Acked-by: Darrick J. Wong Acked-by: Dave Chinner -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 14/19] xfs: convert to new i_version API
On Tue, Jan 09, 2018 at 09:10:54AM -0500, Jeff Layton wrote: > From: Jeff Layton> > Signed-off-by: Jeff Layton > Acked-by: Darrick J. Wong Looks ok, but I haven't tested it at all. Acked-by: Dave Chinner -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: use correct string length
On Mon, Jan 08, 2018 at 07:51:22PM +0800, Xiongfeng Wang wrote: > From: Xiongfeng Wang> > gcc-8 reports > > fs/btrfs/ioctl.c: In function 'btrfs_ioctl': > ./include/linux/string.h:245:9: warning: '__builtin_strncpy' specified > bound 1024 equals destination size [-Wstringop-truncation] > > We need one less byte or call strlcpy() to make it a nul-terminated > string. The null termination is on the following line, so this patch fixes namely the gcc warning: > - strncpy(di_args->path, name->str, sizeof(di_args->path)); > + strncpy(di_args->path, name->str, sizeof(di_args->path) - 1); > rcu_read_unlock(); > di_args->path[sizeof(di_args->path) - 1] = 0; Reviewed-by: David Sterba -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 10/10] Btrfs: noinline merge_extent_mapping
On Fri, Jan 05, 2018 at 12:51:17PM -0700, Liu Bo wrote: > In order to debug subtle bugs around merge_extent_mapping(), perf probe > can be used to check the arguments, but sometimes merge_extent_mapping() > got inlined by compiler and couldn't be probed. > > This is adding noinline attribute to merge_extent_mapping(). > > Signed-off-by: Liu BoReviewed-by: Josef Bacik Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 09/10] Btrfs: add tracepoint for em's EEXIST case
On Fri, Jan 05, 2018 at 12:51:16PM -0700, Liu Bo wrote: > This is adding a tracepoint 'btrfs_handle_em_exist' to help debug the > subtle bugs around merge_extent_mapping. > > Signed-off-by: Liu BoReviewed-by: Josef Bacik Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 08/10] Btrfs: add WARN_ONCE to detect unexpected error from merge_extent_mapping
On Fri, Jan 05, 2018 at 12:51:15PM -0700, Liu Bo wrote: > This is a subtle case, so in order to understand the problem, it'd be good > to know the content of existing and em when any error occurs. > > Signed-off-by: Liu BoReviewed-by: Josef Bacik Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 07/10] Btrfs: extent map selftest: dio write vs dio read
On Fri, Jan 05, 2018 at 12:51:14PM -0700, Liu Bo wrote: > This test case simulates the racy situation of dio write vs dio read, > and see if btrfs_get_extent() would return -EEXIST. > > Signed-off-by: Liu BoReviewed-by: Josef Bacik Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 06/10] Btrfs: extent map selftest: buffered write vs dio read
On Fri, Jan 05, 2018 at 12:51:13PM -0700, Liu Bo wrote: > This test case simulates the racy situation of buffered write vs dio > read, and see if btrfs_get_extent() would return -EEXIST. > > Signed-off-by: Liu BoReviewed-by: Josef Bacik Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 05/10] Btrfs: add extent map selftests
On Fri, Jan 05, 2018 at 12:51:12PM -0700, Liu Bo wrote: > We've observed that btrfs_get_extent() and merge_extent_mapping() could > return -EEXIST in several cases, and they are caused by some racy > condition, e.g dio read vs dio write, which makes the problem very tricky > to reproduce. > > This adds extent map selftests in order to simulate those racy situations. > > Signed-off-by: Liu BoReviewed-by: Josef Bacik Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 04/10] Btrfs: move extent map specific code to extent_map.c
On Fri, Jan 05, 2018 at 12:51:11PM -0700, Liu Bo wrote: > These helpers are extent map specific, move them to extent_map.c. > > Signed-off-by: Liu BoReviewed-by: Josef Bacik Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 02/10] Btrfs: fix unexpected EEXIST from btrfs_get_extent
On Fri, Jan 05, 2018 at 12:51:09PM -0700, Liu Bo wrote: > This fixes a corner case that is caused by a race of dio write vs dio > read/write. > > Here is how the race could happen. > > Suppose that no extent map has been loaded into memory yet. > There is a file extent [0, 32K), two jobs are running concurrently > against it, t1 is doing dio write to [8K, 32K) and t2 is doing dio > read from [0, 4K) or [4K, 8K). > > t1 goes ahead of t2 and splits em [0, 32K) to em [0K, 8K) and [8K 32K). > > -- > t1t2 > btrfs_get_blocks_direct() btrfs_get_blocks_direct() >-> btrfs_get_extent() -> btrfs_get_extent() >-> lookup_extent_mapping() >-> add_extent_mapping()-> lookup_extent_mapping() > # load [0, 32K) >-> btrfs_new_extent_direct() >-> btrfs_drop_extent_cache() > # split [0, 32K) and > # drop [8K, 32K) >-> add_extent_mapping() > # add [8K, 32K) > -> add_extent_mapping() > # handle -EEXIST when adding > # [0, 32K) > -- > About how t2(dio read/write) runs into -EEXIST: > > a) add_extent_mapping() gets -EEXIST for adding em [0, 32k), > > b) search_extent_mapping() then returns [0, 8k) as the existing em, >even though start == existing->start, em is [0, 32k) so that >extent_map_end(em) > extent_map_end(existing), i.e. 32k > 8k, > > c) then it goes thru merge_extent_mapping() which tries to add a [8k, 8k) >(with a length 0) and returns -EEXIST as [8k, 32k) is already in tree, > > d) so btrfs_get_extent() ends up returning -EEXIST to dio read/write, >which is confusing applications. > > Here I conclude all the possible situations, > 1) start < existing->start > > +---+em+---+ > +--prev---+ | +-+ | > | | | | | | > +-+ + +---+existing++ ++ > + > | > + > start > > 2) start == existing->start > > +em+ > | +-+ | > | | | | > + +existing-+ + > | > | > + > start > > 3) start > existing->start && start < (existing->start + existing->len) > > +em+ > | +-+ | > | | | | > + +existing-+ + >| >| >+ > start > > 4) start >= (existing->start + existing->len) > > +---+em+---+ > | +-+ | +--next---+ > | | | | | | > + +---+existing++ + +-+ > + > | > + >start > > As we can see, it turns out that if start is within existing em (front > inclusive), then the existing em should be returned as is, otherwise, > we try our best to merge candidate em with sibling ems to form a > larger em (in order to reduce the total number of em). > > Reported-by: David Vallender> Signed-off-by: Liu Bo Reviewed-by: Josef Bacik Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 03/10] Btrfs: add helper for em merge logic
On Fri, Jan 05, 2018 at 12:51:10PM -0700, Liu Bo wrote: > This is a prepare work for the following extent map selftest, which > runs tests against em merge logic. > > Signed-off-by: Liu BoReviewed-by: Josef Bacik Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 01/10] Btrfs: fix incorrect block_len in merge_extent_mapping
On Fri, Jan 05, 2018 at 12:51:08PM -0700, Liu Bo wrote: > %block_len could be checked on deciding if two em are mergeable. > > merge_extent_mapping() has only added the front pad if the front part > of em gets truncated, but it's possible that the end part gets > truncated. > > For both compressed extent and inline extent, em->block_len is not > adjusted accordingly, and for regular extent, em->block_len always > equals to em->len, hence this sets em->block_len with em->len. > > Signed-off-by: Liu BoReviewed-by: Josef Bacik Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] btrfs: add support for SUPER_FLAG_CHANGING_FSID in btrfs.ko
On Tue, Jan 09, 2018 at 09:05:42AM +0800, Anand Jain wrote: > Userland sets SUPER_FLAG_CHANGING_FSID and resets it only when changing > fsid is complete. Its not a good idea to mount the device anything in > between. > This patch doesn't add SUPER_FLAG_CHANGING_FSID into BTRFS_SUPER_FLAG_SUPP > list, so mount would fail (along with the fix in the next patch) when > SUPER_FLAG_CHANGING_FSID is set. > > Signed-off-by: Anand Jain> cc: w...@suse.com > Reviewed-by: Qu Wenruo Reviewed-by: David Sterba -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] btrfs: fail mount when sb flag is not in BTRFS_SUPER_FLAG_SUPP
On Tue, Jan 09, 2018 at 09:05:43AM +0800, Anand Jain wrote: > It appear from the original commit [1] that there isn't any design > specific reason not to fail the mount instead of just warning. This > patch will change it to fail. > > [1] > commit 319e4d0661e5323c9f9945f0f8fb5905e5fe74c3 > btrfs: Enhance super validation check > > Signed-off-by: Anand Jain> cc: w...@suse.com > Reviewed-by: Qu Wenruo Reviewed-by: David Sterba -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] btrfs: define SUPER_FLAG_METADUMP_V2
On Tue, Jan 09, 2018 at 09:05:41AM +0800, Anand Jain wrote: > btrfs-progs uses super flag bit BTRFS_SUPER_FLAG_METADUMP_V2 (1ULL << 34). > So just define that in kernel so that we know its been used. > > Signed-off-by: Anand JainReviewed-by: David Sterba > --- > fs/btrfs/disk-io.c | 3 ++- > include/uapi/linux/btrfs_tree.h | 1 + > 2 files changed, 3 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index a69e5944dc08..7cd4935f690f 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -61,7 +61,8 @@ >BTRFS_HEADER_FLAG_RELOC |\ >BTRFS_SUPER_FLAG_ERROR |\ >BTRFS_SUPER_FLAG_SEEDING |\ > - BTRFS_SUPER_FLAG_METADUMP) > + BTRFS_SUPER_FLAG_METADUMP |\ > + BTRFS_SUPER_FLAG_METADUMP_V2) > > static const struct extent_io_ops btree_extent_io_ops; > static void end_workqueue_fn(struct btrfs_work *work); > diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h > index 6d6e5da51527..38ab0e06259a 100644 > --- a/include/uapi/linux/btrfs_tree.h > +++ b/include/uapi/linux/btrfs_tree.h > @@ -456,6 +456,7 @@ struct btrfs_free_space_header { > > #define BTRFS_SUPER_FLAG_SEEDING (1ULL << 32) > #define BTRFS_SUPER_FLAG_METADUMP(1ULL << 33) > +#define BTRFS_SUPER_FLAG_METADUMP_V2 (1ULL << 34) > > > /* > -- > 2.15.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHSET v5] blk-mq: reimplement timeout handling
On 1/9/18 9:29 AM, Tejun Heo wrote: > Hello, > > Changes from [v4] > > - Comments added. Patch description updated. > > Changes from [v3] > > - Rebased on top of for-4.16/block. > > - Integrated Jens's hctx_[un]lock() factoring patch and refreshed the > patches accordingly. > > - Added comment explaining the use of hctx_lock() instead of > rcu_read_lock() in completion path. > > Changes from [v2] > > - Possible extended looping around seqcount and u64_stat_sync fixed. > > - Misplaced MQ_RQ_IDLE state setting fixed. > > - RQF_MQ_TIMEOUT_EXPIRED added to prevent firing the same timeout > multiple times. > > - s/queue_rq_src/srcu/ patch added. > > - Other misc changes. > > Changes from [v1] > > - BLK_EH_RESET_TIMER handling fixed. > > - s/request->gstate_seqc/request->gstate_seq/ > > - READ_ONCE() added to blk_mq_rq_udpate_state(). > > - Removed left over blk_clear_rq_complete() invocation from > blk_mq_rq_timed_out(). > > Currently, blk-mq timeout path synchronizes against the usual > issue/completion path using a complex scheme involving atomic > bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence > rules. Unfortunatley, it contains quite a few holes. > > It's pretty easy to make blk_mq_check_expired() terminate a later > instance of a request. If we induce 5 sec delay before > time_after_eq() test in blk_mq_check_expired(), shorten the timeout to > 2s, and issue back-to-back large IOs, blk-mq starts timing out > requests spuriously pretty quickly. Nothing actually timed out. It > just made the call on a recycle instance of a request and then > terminated a later instance long after the original instance finished. > The scenario isn't theoretical either. > > This patchset replaces the broken synchronization mechanism with a RCU > and generation number based one. Please read the patch description of > the second path for more details. Applied for 4.16, and added a patch to silence that false positive srcu_idx for hctx_unlock() that got reported. This grows the request a bit, which sucks, but probably unavoidable. There's some room for improvement with a hole or two, however. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHSET v5] blk-mq: reimplement timeout handling
Hello, Changes from [v4] - Comments added. Patch description updated. Changes from [v3] - Rebased on top of for-4.16/block. - Integrated Jens's hctx_[un]lock() factoring patch and refreshed the patches accordingly. - Added comment explaining the use of hctx_lock() instead of rcu_read_lock() in completion path. Changes from [v2] - Possible extended looping around seqcount and u64_stat_sync fixed. - Misplaced MQ_RQ_IDLE state setting fixed. - RQF_MQ_TIMEOUT_EXPIRED added to prevent firing the same timeout multiple times. - s/queue_rq_src/srcu/ patch added. - Other misc changes. Changes from [v1] - BLK_EH_RESET_TIMER handling fixed. - s/request->gstate_seqc/request->gstate_seq/ - READ_ONCE() added to blk_mq_rq_udpate_state(). - Removed left over blk_clear_rq_complete() invocation from blk_mq_rq_timed_out(). Currently, blk-mq timeout path synchronizes against the usual issue/completion path using a complex scheme involving atomic bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence rules. Unfortunatley, it contains quite a few holes. It's pretty easy to make blk_mq_check_expired() terminate a later instance of a request. If we induce 5 sec delay before time_after_eq() test in blk_mq_check_expired(), shorten the timeout to 2s, and issue back-to-back large IOs, blk-mq starts timing out requests spuriously pretty quickly. Nothing actually timed out. It just made the call on a recycle instance of a request and then terminated a later instance long after the original instance finished. The scenario isn't theoretical either. This patchset replaces the broken synchronization mechanism with a RCU and generation number based one. Please read the patch description of the second path for more details. This patchset contains the following eight patches. 0001-blk-mq-move-hctx-lock-unlock-into-a-helper.patch 0002-blk-mq-protect-completion-path-with-RCU.patch 0003-blk-mq-replace-timeout-synchronization-with-a-RCU-an.patch 0004-blk-mq-use-blk_mq_rq_state-instead-of-testing-REQ_AT.patch 0005-blk-mq-make-blk_abort_request-trigger-timeout-path.patch 0006-blk-mq-remove-REQ_ATOM_COMPLETE-usages-from-blk-mq.patch 0007-blk-mq-remove-REQ_ATOM_STARTED.patch 0008-blk-mq-rename-blk_mq_hw_ctx-queue_rq_srcu-to-srcu.patch and is available in the following git branch. git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git blk-mq-timeout-v5 diffstat follows. Thanks. block/blk-core.c |2 block/blk-mq-debugfs.c |4 block/blk-mq.c | 346 + block/blk-mq.h | 49 ++ block/blk-timeout.c| 16 +- block/blk.h|7 include/linux/blk-mq.h |3 include/linux/blkdev.h | 25 +++ 8 files changed, 294 insertions(+), 158 deletions(-) -- tejun [v4] http://lkml.kernel.org/r/20180108191542.379478-1...@kernel.org [v3] http://lkml.kernel.org/r/20171216120726.517153-1...@kernel.org [v2] http://lkml.kernel.org/r/20171010155441.753966-1...@kernel.org [v1] http://lkml.kernel.org/r/20171209192525.982030-1...@kernel.org -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/8] blk-mq: move hctx lock/unlock into a helper
From: Jens AxboeMove the RCU vs SRCU logic into lock/unlock helpers, which makes the actual functional bits within the locked region much easier to read. tj: Reordered in front of timeout revamp patches and added the missing blk_mq_run_hw_queue() conversion. Signed-off-by: Jens Axboe Signed-off-by: Tejun Heo --- block/blk-mq.c | 66 -- 1 file changed, 32 insertions(+), 34 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 111e1aa..ddc9261 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -557,6 +557,22 @@ static void __blk_mq_complete_request(struct request *rq) put_cpu(); } +static void hctx_unlock(struct blk_mq_hw_ctx *hctx, int srcu_idx) +{ + if (!(hctx->flags & BLK_MQ_F_BLOCKING)) + rcu_read_unlock(); + else + srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx); +} + +static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx) +{ + if (!(hctx->flags & BLK_MQ_F_BLOCKING)) + rcu_read_lock(); + else + *srcu_idx = srcu_read_lock(hctx->queue_rq_srcu); +} + /** * blk_mq_complete_request - end I/O on a request * @rq:the request being processed @@ -1214,17 +1230,11 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx) */ WARN_ON_ONCE(in_interrupt()); - if (!(hctx->flags & BLK_MQ_F_BLOCKING)) { - rcu_read_lock(); - blk_mq_sched_dispatch_requests(hctx); - rcu_read_unlock(); - } else { - might_sleep(); + might_sleep_if(hctx->flags & BLK_MQ_F_BLOCKING); - srcu_idx = srcu_read_lock(hctx->queue_rq_srcu); - blk_mq_sched_dispatch_requests(hctx); - srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx); - } + hctx_lock(hctx, _idx); + blk_mq_sched_dispatch_requests(hctx); + hctx_unlock(hctx, srcu_idx); } /* @@ -1296,17 +1306,10 @@ bool blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async) * And queue will be rerun in blk_mq_unquiesce_queue() if it is * quiesced. */ - if (!(hctx->flags & BLK_MQ_F_BLOCKING)) { - rcu_read_lock(); - need_run = !blk_queue_quiesced(hctx->queue) && - blk_mq_hctx_has_pending(hctx); - rcu_read_unlock(); - } else { - srcu_idx = srcu_read_lock(hctx->queue_rq_srcu); - need_run = !blk_queue_quiesced(hctx->queue) && - blk_mq_hctx_has_pending(hctx); - srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx); - } + hctx_lock(hctx, _idx); + need_run = !blk_queue_quiesced(hctx->queue) && + blk_mq_hctx_has_pending(hctx); + hctx_unlock(hctx, srcu_idx); if (need_run) { __blk_mq_delay_run_hw_queue(hctx, async, 0); @@ -1618,7 +1621,7 @@ static blk_qc_t request_to_qc_t(struct blk_mq_hw_ctx *hctx, struct request *rq) static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, struct request *rq, - blk_qc_t *cookie, bool may_sleep) + blk_qc_t *cookie) { struct request_queue *q = rq->q; struct blk_mq_queue_data bd = { @@ -1668,25 +1671,20 @@ static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, } insert: - blk_mq_sched_insert_request(rq, false, run_queue, false, may_sleep); + blk_mq_sched_insert_request(rq, false, run_queue, false, + hctx->flags & BLK_MQ_F_BLOCKING); } static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, struct request *rq, blk_qc_t *cookie) { - if (!(hctx->flags & BLK_MQ_F_BLOCKING)) { - rcu_read_lock(); - __blk_mq_try_issue_directly(hctx, rq, cookie, false); - rcu_read_unlock(); - } else { - unsigned int srcu_idx; + int srcu_idx; - might_sleep(); + might_sleep_if(hctx->flags & BLK_MQ_F_BLOCKING); - srcu_idx = srcu_read_lock(hctx->queue_rq_srcu); - __blk_mq_try_issue_directly(hctx, rq, cookie, true); - srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx); - } + hctx_lock(hctx, _idx); + __blk_mq_try_issue_directly(hctx, rq, cookie); + hctx_unlock(hctx, srcu_idx); } static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) -- 2.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/8] blk-mq: protect completion path with RCU
Currently, blk-mq protects only the issue path with RCU. This patch puts the completion path under the same RCU protection. This will be used to synchronize issue/completion against timeout by later patches, which will also add the comments. Signed-off-by: Tejun Heo--- block/blk-mq.c | 5 + 1 file changed, 5 insertions(+) diff --git a/block/blk-mq.c b/block/blk-mq.c index ddc9261..6741c3e 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -584,11 +584,16 @@ static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx) void blk_mq_complete_request(struct request *rq) { struct request_queue *q = rq->q; + struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu); + int srcu_idx; if (unlikely(blk_should_fake_timeout(q))) return; + + hctx_lock(hctx, _idx); if (!blk_mark_rq_complete(rq)) __blk_mq_complete_request(rq); + hctx_unlock(hctx, srcu_idx); } EXPORT_SYMBOL(blk_mq_complete_request); -- 2.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/8] blk-mq: replace timeout synchronization with a RCU and generation based scheme
Currently, blk-mq timeout path synchronizes against the usual issue/completion path using a complex scheme involving atomic bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence rules. Unfortunately, it contains quite a few holes. There's a complex dancing around REQ_ATOM_STARTED and REQ_ATOM_COMPLETE between issue/completion and timeout paths; however, they don't have a synchronization point across request recycle instances and it isn't clear what the barriers add. blk_mq_check_expired() can easily read STARTED from N-2'th iteration, deadline from N-1'th, blk_mark_rq_complete() against Nth instance. In fact, it's pretty easy to make blk_mq_check_expired() terminate a later instance of a request. If we induce 5 sec delay before time_after_eq() test in blk_mq_check_expired(), shorten the timeout to 2s, and issue back-to-back large IOs, blk-mq starts timing out requests spuriously pretty quickly. Nothing actually timed out. It just made the call on a recycle instance of a request and then terminated a later instance long after the original instance finished. The scenario isn't theoretical either. This patch replaces the broken synchronization mechanism with a RCU and generation number based one. 1. Each request has a u64 generation + state value, which can be updated only by the request owner. Whenever a request becomes in-flight, the generation number gets bumped up too. This provides the basis for the timeout path to distinguish different recycle instances of the request. Also, marking a request in-flight and setting its deadline are protected with a seqcount so that the timeout path can fetch both values coherently. 2. The timeout path fetches the generation, state and deadline. If the verdict is timeout, it records the generation into a dedicated request abortion field and does RCU wait. 3. The completion path is also protected by RCU (from the previous patch) and checks whether the current generation number and state match the abortion field. If so, it skips completion. 4. The timeout path, after RCU wait, scans requests again and terminates the ones whose generation and state still match the ones requested for abortion. By now, the timeout path knows that either the generation number and state changed if it lost the race or the completion will yield to it and can safely timeout the request. While it's more lines of code, it's conceptually simpler, doesn't depend on direct use of subtle memory ordering or coherence, and hopefully doesn't terminate the wrong instance. While this change makes REQ_ATOM_COMPLETE synchronization unnecessary between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't removed yet as it's still used in other places. Future patches will move all state tracking to the new mechanism and remove all bitops in the hot paths. Note that this patch adds a comment explaining a race condition in BLK_EH_RESET_TIMER path. The race has always been there and this patch doesn't change it. It's just documenting the existing race. v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao. - s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter. - READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter. v3: - Fixed possible extended seqcount / u64_stats_sync read looping spotted by Peter. - MQ_RQ_IDLE was incorrectly being set in complete_request instead of free_request. Fixed. v4: - Rebased on top of hctx_lock() refactoring patch. - Added comment explaining the use of hctx_lock() in completion path. v5: - Added comments requested by Bart. - Note the addition of BLK_EH_RESET_TIMER race condition in the commit message. Signed-off-by: Tejun HeoCc: "jianchao.wang" Cc: Peter Zijlstra Cc: Christoph Hellwig Cc: Bart Van Assche --- block/blk-core.c | 2 + block/blk-mq.c | 229 + block/blk-mq.h | 46 ++ block/blk-timeout.c| 2 +- block/blk.h| 6 -- include/linux/blk-mq.h | 1 + include/linux/blkdev.h | 23 + 7 files changed, 230 insertions(+), 79 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 2e0d041..f843ae4f 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -126,6 +126,8 @@ void blk_rq_init(struct request_queue *q, struct request *rq) rq->start_time = jiffies; set_start_time_ns(rq); rq->part = NULL; + seqcount_init(>gstate_seq); + u64_stats_init(>aborted_gstate_sync); } EXPORT_SYMBOL(blk_rq_init); diff --git a/block/blk-mq.c b/block/blk-mq.c index 6741c3e..052fee5 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -483,6 +483,7 @@ void blk_mq_free_request(struct request *rq) if (blk_rq_rl(rq)) blk_put_rl(blk_rq_rl(rq)); +
[PATCH 5/8] blk-mq: make blk_abort_request() trigger timeout path
With issue/complete and timeout paths now using the generation number and state based synchronization, blk_abort_request() is the only one which depends on REQ_ATOM_COMPLETE for arbitrating completion. There's no reason for blk_abort_request() to be a completely separate path. This patch makes blk_abort_request() piggyback on the timeout path instead of trying to terminate the request directly. This removes the last dependency on REQ_ATOM_COMPLETE in blk-mq. Note that this makes blk_abort_request() asynchronous - it initiates abortion but the actual termination will happen after a short while, even when the caller owns the request. AFAICS, SCSI and ATA should be fine with that and I think mtip32xx and dasd should be safe but not completely sure. It'd be great if people who know the drivers take a look. v2: - Add comment explaining the lack of synchronization around ->deadline update as requested by Bart. Signed-off-by: Tejun HeoCc: Asai Thambi SP Cc: Stefan Haberland Cc: Jan Hoeppner Cc: Bart Van Assche --- block/blk-mq.c | 2 +- block/blk-mq.h | 2 -- block/blk-timeout.c | 13 + 3 files changed, 10 insertions(+), 7 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 51e9704..b419746 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -820,7 +820,7 @@ struct blk_mq_timeout_data { unsigned int nr_expired; }; -void blk_mq_rq_timed_out(struct request *req, bool reserved) +static void blk_mq_rq_timed_out(struct request *req, bool reserved) { const struct blk_mq_ops *ops = req->q->mq_ops; enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER; diff --git a/block/blk-mq.h b/block/blk-mq.h index cf01f6f..6b2d616 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -94,8 +94,6 @@ extern int blk_mq_sysfs_register(struct request_queue *q); extern void blk_mq_sysfs_unregister(struct request_queue *q); extern void blk_mq_hctx_kobj_init(struct blk_mq_hw_ctx *hctx); -extern void blk_mq_rq_timed_out(struct request *req, bool reserved); - void blk_mq_release(struct request_queue *q); /** diff --git a/block/blk-timeout.c b/block/blk-timeout.c index 6427be7..4f04cd1 100644 --- a/block/blk-timeout.c +++ b/block/blk-timeout.c @@ -156,12 +156,17 @@ void blk_timeout_work(struct work_struct *work) */ void blk_abort_request(struct request *req) { - if (blk_mark_rq_complete(req)) - return; - if (req->q->mq_ops) { - blk_mq_rq_timed_out(req, false); + /* +* All we need to ensure is that timeout scan takes place +* immediately and that scan sees the new timeout value. +* No need for fancy synchronizations. +*/ + req->deadline = jiffies; + mod_timer(>q->timeout, 0); } else { + if (blk_mark_rq_complete(req)) + return; blk_delete_timer(req); blk_rq_timed_out(req); } -- 2.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/8] blk-mq: use blk_mq_rq_state() instead of testing REQ_ATOM_COMPLETE
blk_mq_check_inflight() and blk_mq_poll_hybrid_sleep() test REQ_ATOM_COMPLETE to determine the request state. Both uses are speculative and we can test REQ_ATOM_STARTED and blk_mq_rq_state() for equivalent results. Replace the tests. This will allow removing REQ_ATOM_COMPLETE usages from blk-mq. Signed-off-by: Tejun Heo--- block/blk-mq.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 052fee5..51e9704 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -95,8 +95,7 @@ static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx, { struct mq_inflight *mi = priv; - if (test_bit(REQ_ATOM_STARTED, >atomic_flags) && - !test_bit(REQ_ATOM_COMPLETE, >atomic_flags)) { + if (blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT) { /* * index[0] counts the specific partition that was asked * for. index[1] counts the ones that are active on the @@ -3023,7 +3022,8 @@ static bool blk_mq_poll_hybrid_sleep(struct request_queue *q, hrtimer_init_sleeper(, current); do { - if (test_bit(REQ_ATOM_COMPLETE, >atomic_flags)) + if (test_bit(REQ_ATOM_STARTED, >atomic_flags) && + blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT) break; set_current_state(TASK_UNINTERRUPTIBLE); hrtimer_start_expires(, mode); -- 2.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/8] blk-mq: remove REQ_ATOM_COMPLETE usages from blk-mq
After the recent updates to use generation number and state based synchronization, blk-mq no longer depends on REQ_ATOM_COMPLETE except to avoid firing the same timeout multiple times. Remove all REQ_ATOM_COMPLETE usages and use a new rq_flags flag RQF_MQ_TIMEOUT_EXPIRED to avoid firing the same timeout multiple times. This removes atomic bitops from hot paths too. v2: Removed blk_clear_rq_complete() from blk_mq_rq_timed_out(). v3: Added RQF_MQ_TIMEOUT_EXPIRED flag. Signed-off-by: Tejun HeoCc: "jianchao.wang" --- block/blk-mq.c | 15 +++ block/blk-timeout.c| 1 + include/linux/blkdev.h | 2 ++ 3 files changed, 10 insertions(+), 8 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index b419746..3f297f0 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -634,8 +634,7 @@ void blk_mq_complete_request(struct request *rq) * hctx_lock() covers both issue and completion paths. */ hctx_lock(hctx, _idx); - if (blk_mq_rq_aborted_gstate(rq) != rq->gstate && - !blk_mark_rq_complete(rq)) + if (blk_mq_rq_aborted_gstate(rq) != rq->gstate) __blk_mq_complete_request(rq); hctx_unlock(hctx, srcu_idx); } @@ -685,8 +684,6 @@ void blk_mq_start_request(struct request *rq) preempt_enable(); set_bit(REQ_ATOM_STARTED, >atomic_flags); - if (test_bit(REQ_ATOM_COMPLETE, >atomic_flags)) - clear_bit(REQ_ATOM_COMPLETE, >atomic_flags); if (q->dma_drain_size && blk_rq_bytes(rq)) { /* @@ -837,6 +834,8 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved) if (!test_bit(REQ_ATOM_STARTED, >atomic_flags)) return; + req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED; + if (ops->timeout) ret = ops->timeout(req, reserved); @@ -852,7 +851,6 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved) */ blk_mq_rq_update_aborted_gstate(req, 0); blk_add_timer(req); - blk_clear_rq_complete(req); break; case BLK_EH_NOT_HANDLED: break; @@ -871,7 +869,8 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx, might_sleep(); - if (!test_bit(REQ_ATOM_STARTED, >atomic_flags)) + if ((rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) || + !test_bit(REQ_ATOM_STARTED, >atomic_flags)) return; /* read coherent snapshots of @rq->state_gen and @rq->deadline */ @@ -906,8 +905,8 @@ static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx, * now guaranteed to see @rq->aborted_gstate and yield. If * @rq->aborted_gstate still matches @rq->gstate, @rq is ours. */ - if (READ_ONCE(rq->gstate) == rq->aborted_gstate && - !blk_mark_rq_complete(rq)) + if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) && + READ_ONCE(rq->gstate) == rq->aborted_gstate) blk_mq_rq_timed_out(rq, reserved); } diff --git a/block/blk-timeout.c b/block/blk-timeout.c index 4f04cd1..ebe9996 100644 --- a/block/blk-timeout.c +++ b/block/blk-timeout.c @@ -214,6 +214,7 @@ void blk_add_timer(struct request *req) req->timeout = q->rq_timeout; req->deadline = jiffies + req->timeout; + req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED; /* * Only the non-mq case needs to add the request to a protected list. diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index ae563d0..007a7cf 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -125,6 +125,8 @@ typedef __u32 __bitwise req_flags_t; #define RQF_SPECIAL_PAYLOAD((__force req_flags_t)(1 << 18)) /* The per-zone write lock is held for this request */ #define RQF_ZONE_WRITE_LOCKED ((__force req_flags_t)(1 << 19)) +/* timeout is expired */ +#define RQF_MQ_TIMEOUT_EXPIRED ((__force req_flags_t)(1 << 20)) /* flags that prevent us from merging requests: */ #define RQF_NOMERGE_FLAGS \ -- 2.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7/8] blk-mq: remove REQ_ATOM_STARTED
After the recent updates to use generation number and state based synchronization, we can easily replace REQ_ATOM_STARTED usages by adding an extra state to distinguish completed but not yet freed state. Add MQ_RQ_COMPLETE and replace REQ_ATOM_STARTED usages with blk_mq_rq_state() tests. REQ_ATOM_STARTED no longer has any users left and is removed. Signed-off-by: Tejun Heo--- block/blk-mq-debugfs.c | 4 +--- block/blk-mq.c | 37 - block/blk-mq.h | 1 + block/blk.h| 1 - 4 files changed, 10 insertions(+), 33 deletions(-) diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c index b56a4f3..8adc837 100644 --- a/block/blk-mq-debugfs.c +++ b/block/blk-mq-debugfs.c @@ -271,7 +271,6 @@ static const char *const cmd_flag_name[] = { #define RQF_NAME(name) [ilog2((__force u32)RQF_##name)] = #name static const char *const rqf_name[] = { RQF_NAME(SORTED), - RQF_NAME(STARTED), RQF_NAME(QUEUED), RQF_NAME(SOFTBARRIER), RQF_NAME(FLUSH_SEQ), @@ -295,7 +294,6 @@ static const char *const rqf_name[] = { #define RQAF_NAME(name) [REQ_ATOM_##name] = #name static const char *const rqaf_name[] = { RQAF_NAME(COMPLETE), - RQAF_NAME(STARTED), RQAF_NAME(POLL_SLEPT), }; #undef RQAF_NAME @@ -409,7 +407,7 @@ static void hctx_show_busy_rq(struct request *rq, void *data, bool reserved) const struct show_busy_params *params = data; if (blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) == params->hctx && - test_bit(REQ_ATOM_STARTED, >atomic_flags)) + blk_mq_rq_state(rq) != MQ_RQ_IDLE) __blk_mq_debugfs_rq_show(params->m, list_entry_rq(>queuelist)); } diff --git a/block/blk-mq.c b/block/blk-mq.c index 3f297f0..e6e74c0 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -483,7 +483,6 @@ void blk_mq_free_request(struct request *rq) blk_put_rl(blk_rq_rl(rq)); blk_mq_rq_update_state(rq, MQ_RQ_IDLE); - clear_bit(REQ_ATOM_STARTED, >atomic_flags); clear_bit(REQ_ATOM_POLL_SLEPT, >atomic_flags); if (rq->tag != -1) blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag); @@ -531,6 +530,7 @@ static void __blk_mq_complete_request(struct request *rq) int cpu; WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT); + blk_mq_rq_update_state(rq, MQ_RQ_COMPLETE); if (rq->internal_tag != -1) blk_mq_sched_completed_request(rq); @@ -642,7 +642,7 @@ EXPORT_SYMBOL(blk_mq_complete_request); int blk_mq_request_started(struct request *rq) { - return test_bit(REQ_ATOM_STARTED, >atomic_flags); + return blk_mq_rq_state(rq) != MQ_RQ_IDLE; } EXPORT_SYMBOL_GPL(blk_mq_request_started); @@ -661,7 +661,6 @@ void blk_mq_start_request(struct request *rq) } WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE); - WARN_ON_ONCE(test_bit(REQ_ATOM_STARTED, >atomic_flags)); /* * Mark @rq in-flight which also advances the generation number, @@ -683,8 +682,6 @@ void blk_mq_start_request(struct request *rq) write_seqcount_end(>gstate_seq); preempt_enable(); - set_bit(REQ_ATOM_STARTED, >atomic_flags); - if (q->dma_drain_size && blk_rq_bytes(rq)) { /* * Make sure space for the drain appears. We know we can do @@ -697,13 +694,9 @@ void blk_mq_start_request(struct request *rq) EXPORT_SYMBOL(blk_mq_start_request); /* - * When we reach here because queue is busy, REQ_ATOM_COMPLETE - * flag isn't set yet, so there may be race with timeout handler, - * but given rq->deadline is just set in .queue_rq() under - * this situation, the race won't be possible in reality because - * rq->timeout should be set as big enough to cover the window - * between blk_mq_start_request() called from .queue_rq() and - * clearing REQ_ATOM_STARTED here. + * When we reach here because queue is busy, it's safe to change the state + * to IDLE without checking @rq->aborted_gstate because we should still be + * holding the RCU read lock and thus protected against timeout. */ static void __blk_mq_requeue_request(struct request *rq) { @@ -715,7 +708,7 @@ static void __blk_mq_requeue_request(struct request *rq) wbt_requeue(q->rq_wb, >issue_stat); blk_mq_sched_requeue_request(rq); - if (test_and_clear_bit(REQ_ATOM_STARTED, >atomic_flags)) { + if (blk_mq_rq_state(rq) != MQ_RQ_IDLE) { blk_mq_rq_update_state(rq, MQ_RQ_IDLE); if (q->dma_drain_size && blk_rq_bytes(rq)) rq->nr_phys_segments--; @@ -822,18 +815,6 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved) const struct blk_mq_ops *ops = req->q->mq_ops; enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER; - /* -* We know that complete is set at this point. If STARTED
Re: [PATCH 4/6] btrfs: make btrfs_free_stale_device() argument optional
On Tue, Jan 09, 2018 at 10:13:12PM +0800, Anand Jain wrote: > From: Anand Jain> > This updates btrfs_free_stale_device() helper function to delete all > unmouted devices, when arg is NULL. > > Signed-off-by: Anand Jain > --- > fs/btrfs/volumes.c | 14 ++ > 1 file changed, 6 insertions(+), 8 deletions(-) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index 7b253da6c0a4..7646f8860096 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -611,9 +611,6 @@ static void btrfs_free_stale_device(struct btrfs_device > *cur_dev) > struct btrfs_fs_devices *fs_devs, *tmp_fs_devs; > struct btrfs_device *dev, *tmp_dev; > > - if (!cur_dev->name) > - return; > - > list_for_each_entry_safe(fs_devs, tmp_fs_devs, _uuids, list) { > > if (fs_devs->opened) > @@ -623,9 +620,7 @@ static void btrfs_free_stale_device(struct btrfs_device > *cur_dev) >_devs->devices, dev_list) { > int not_found; Change this to int not_found = 0; > > - if (dev == cur_dev) > - continue; > - if (!dev->name) > + if (cur_dev && (cur_dev == dev || !dev->name)) > continue; > > /* > @@ -635,8 +630,11 @@ static void btrfs_free_stale_device(struct btrfs_device > *cur_dev) >* either use mapper or non mapper path throughout. >*/ > rcu_read_lock(); > - not_found = strcmp(rcu_str_deref(dev->name), > - rcu_str_deref(cur_dev->name)); > + if (cur_dev) > + not_found = strcmp(rcu_str_deref(dev->name), > + > rcu_str_deref(cur_dev->name)); > + else > + not_found = 0; And drop the else part of this. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/8] blk-mq: protect completion path with RCU
On 1/9/18 9:19 AM, t...@kernel.org wrote: > Hello, Bart. > > On Tue, Jan 09, 2018 at 04:12:40PM +, Bart Van Assche wrote: >> I'm concerned about the additional CPU cycles needed for the new >> blk_mq_map_queue() >> call, although I know this call is cheap. Would the timeout code really get >> that > > So, if that is really a concern, let's cache that mapping instead of > changing synchronization rules for that. > >> much more complicated if the hctx_lock() and hctx_unlock() calls would be >> changed >> into rcu_read_lock() and rcu_read_unlock() calls? Would it be sufficient if >> "if (has_rcu) synchronize_rcu();" would be changed into "synchronize_rcu();" >> in >> blk_mq_timeout_work()? > > Code-wise, it won't be too much extra code but I think diverging the > sync methods between issue and completion paths is more fragile and > likely to invite confusions and mistakes in the future. We have the > normal path (issue) synchronizing against the exception > path (timeout). I think it's best to keep the sync constructs aligned > with that conceptual picture. Guys, the map call is FINE. We do it at least once per request anyway, usually multiple times. If it's too expensive, then THAT is what needs fixing, not adding an arbitrary caching of that mapping. Look at the actual code: return q->queue_hw_ctx[q->mq_map[cpu]]; that's it. It's just a double index. So let's put this thing to rest right now - the map call is perfectly fine, especially since the alternatives are either bloating struct request or making the code less maintainable. Foot -> down. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/6] btrfs: make btrfs_free_stale_device() to iterate all stales
On Tue, Jan 09, 2018 at 10:13:11PM +0800, Anand Jain wrote: > From: Anand Jain> > Let the list iterator iterate further and find other stale > devices and delete it. This is in preparation to add support > for user land request-able stale devices cleanup. > > Signed-off-by: Anand Jain If we're going to do this then please change the function name to btrfs_free_stale_devices() so it's clear what its intentions are. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/8] blk-mq: protect completion path with RCU
Hello, Bart. On Tue, Jan 09, 2018 at 04:12:40PM +, Bart Van Assche wrote: > I'm concerned about the additional CPU cycles needed for the new > blk_mq_map_queue() > call, although I know this call is cheap. Would the timeout code really get > that So, if that is really a concern, let's cache that mapping instead of changing synchronization rules for that. > much more complicated if the hctx_lock() and hctx_unlock() calls would be > changed > into rcu_read_lock() and rcu_read_unlock() calls? Would it be sufficient if > "if (has_rcu) synchronize_rcu();" would be changed into "synchronize_rcu();" > in > blk_mq_timeout_work()? Code-wise, it won't be too much extra code but I think diverging the sync methods between issue and completion paths is more fragile and likely to invite confusions and mistakes in the future. We have the normal path (issue) synchronizing against the exception path (timeout). I think it's best to keep the sync constructs aligned with that conceptual picture. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/6] btrfs: no need to check for btrfs_fs_devices::seeding
On Tue, Jan 09, 2018 at 10:13:10PM +0800, Anand Jain wrote: > There is no need to check for btrfs_fs_devices::seeding when we > have checked for btrfs_fs_devices::opened, because we can't sprout > without its seed FS being opened. > > Signed-off-by: Anand JainReviewed-by: Josef Bacik Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/6] btrfs: cleanup btrfs_free_stale_device() usage
On Tue, Jan 09, 2018 at 10:13:09PM +0800, Anand Jain wrote: > We call btrfs_free_stale_device() only when we alloc a new > struct btrfs_device (ret=1), so move it closer to where we > alloc the new device. Also drop the comments. > > Signed-off-by: Anand JainReviewed-by: Josef Bacik Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] btrfs: move uuid_mutex closer to exclusivity
On Tue, Jan 09, 2018 at 10:46:25PM +0800, Anand Jain wrote: > move uuid_mutex with in device_list_add(). > > Signed-off-by: Anand JainThis isn't going to work, there's a bunch of places we return errors so this just deadlocks the box. Leave it like it is, it's not hurting anybody leaving it like this. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/5] btrfs: get device pointer from device_list_add()
On Tue, Jan 09, 2018 at 10:46:23PM +0800, Anand Jain wrote: > Instead of pointer to btrfs_fs_devices from device_list_add() its > better to get pointer to btrfs_device, then we have both, pointer > to btrfs_device and btrfs_fs_devices. This is needed in preparation > to add handling reappearing missing device feature. > > Signed-off-by: Anand JainCan we just change device_list_add to return struct btrfs_device * instead, and just do PTR_ERR() if there's a problem? Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/8] blk-mq: protect completion path with RCU
On Mon, 2018-01-08 at 11:15 -0800, Tejun Heo wrote: > Currently, blk-mq protects only the issue path with RCU. This patch > puts the completion path under the same RCU protection. This will be > used to synchronize issue/completion against timeout by later patches, > which will also add the comments. > > Signed-off-by: Tejun Heo> --- > block/blk-mq.c | 5 + > 1 file changed, 5 insertions(+) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index ddc9261..6741c3e 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -584,11 +584,16 @@ static void hctx_lock(struct blk_mq_hw_ctx *hctx, int > *srcu_idx) > void blk_mq_complete_request(struct request *rq) > { > struct request_queue *q = rq->q; > + struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu); > + int srcu_idx; > > if (unlikely(blk_should_fake_timeout(q))) > return; > + > + hctx_lock(hctx, _idx); > if (!blk_mark_rq_complete(rq)) > __blk_mq_complete_request(rq); > + hctx_unlock(hctx, srcu_idx); > } > EXPORT_SYMBOL(blk_mq_complete_request); Hello Tejun, I'm concerned about the additional CPU cycles needed for the new blk_mq_map_queue() call, although I know this call is cheap. Would the timeout code really get that much more complicated if the hctx_lock() and hctx_unlock() calls would be changed into rcu_read_lock() and rcu_read_unlock() calls? Would it be sufficient if "if (has_rcu) synchronize_rcu();" would be changed into "synchronize_rcu();" in blk_mq_timeout_work()? Thanks, Bart.N�r��yb�X��ǧv�^�){.n�+{�n�߲)w*jg����ݢj/���z�ޖ��2�ޙ&�)ߡ�a�����G���h��j:+v���w��٥
Re: [PATCH 1/5] btrfs: move pr_info into device_list_add
On Tue, Jan 09, 2018 at 10:46:21PM +0800, Anand Jain wrote: > Commit 60999ca4b403 ("btrfs: make device scan less noisy") > adds return value 1 to device_list_add(), so that parent function can > call pr_info only when new device is added. Move the pr_info() part > into device_list_add() so that this function can be kept simple. > > Signed-off-by: Anand JainReviewed-by: Josef Bacik Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] btrfs: Fix slight df misreporting when listing mixed block groups fs
On Tue, Jan 09, 2018 at 06:04:05PM +0200, Nikolay Borisov wrote: > Currently when df (resp. statfs) is executed on a btrfs instance the > free space is calculated as the freespace of the block groups + > any unallocated space on the devices, constituting this filesystem. > There is a catch, however, in that the unallocated space from the > devices has 1 mb subtracted from each device to closely follow btrfs' > allocator behavior. This is all good except that when we have a file > system and the system chunk's allocation profile is single (as is the > default in mixed mode) it occupies the range 0..4mb and thus implicitly > prevents allocation starting in the range 0..1mb. In those cases > additionally subtracting 1mb during statfs makes us misreport the > actual free space and this causes generic/015 to fail. > > Signed-off-by: Nikolay BorisovThis is fine by me, Reviewed-by: Josef Bacik Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/8] blk-mq: make blk_abort_request() trigger timeout path
On Mon, Jan 08, 2018 at 10:10:01PM +, Bart Van Assche wrote: > Other req->deadline writes are protected by preempt_disable(), > write_seqcount_begin(>gstate_seq), write_seqcount_end(>gstate_seq) > and preempt_enable(). I think it's fine that the above req->deadline store > does not have that protection but I also think that that deserves a comment. Will add. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH] btrfs: Fix slight df misreporting when listing mixed block groups fs
Currently when df (resp. statfs) is executed on a btrfs instance the free space is calculated as the freespace of the block groups + any unallocated space on the devices, constituting this filesystem. There is a catch, however, in that the unallocated space from the devices has 1 mb subtracted from each device to closely follow btrfs' allocator behavior. This is all good except that when we have a file system and the system chunk's allocation profile is single (as is the default in mixed mode) it occupies the range 0..4mb and thus implicitly prevents allocation starting in the range 0..1mb. In those cases additionally subtracting 1mb during statfs makes us misreport the actual free space and this causes generic/015 to fail. Signed-off-by: Nikolay Borisov--- Hello, So with this patch and adding sync to generic/015 (which is a different issue) I can make generic/015 pass. In any case I think this behavior is correct though frankly I feel it's not really worth it, in the worst case we are underreporting by at most 1mb per device. fs/btrfs/super.c | 25 + 1 file changed, 13 insertions(+), 12 deletions(-) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index f33a55272deb..a4dfc9701220 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -1925,12 +1925,13 @@ static int btrfs_calc_avail_data_space(struct btrfs_fs_info *fs_info, struct btrfs_device_info *devices_info; struct btrfs_fs_devices *fs_devices = fs_info->fs_devices; struct btrfs_device *device; - u64 skip_space; u64 type; u64 avail_space; u64 min_stripe_size; int min_stripes = 1, num_stripes = 1; int i = 0, nr_devices; + bool sys_chunk_single = !(btrfs_system_alloc_profile(fs_info) & + ~BTRFS_BLOCK_GROUP_SYSTEM); /* * We aren't under the device list lock, so this is racy-ish, but good @@ -1988,18 +1989,18 @@ static int btrfs_calc_avail_data_space(struct btrfs_fs_info *fs_info, /* * In order to avoid overwriting the superblock on the drive, * btrfs starts at an offset of at least 1MB when doing chunk -* allocation. +* allocation. There is one notable exception, and that is +* when the System chunk has allocation profile of single. In +* this case it occupies 0..4m and we don't really need to +* subtract 1mb as it's implied */ - skip_space = SZ_1M; - - /* -* we can use the free space in [0, skip_space - 1], subtract -* it from the total. -*/ - if (avail_space && avail_space >= skip_space) - avail_space -= skip_space; - else - avail_space = 0; + if (device->devid != 1 || + (!sys_chunk_single && device->devid == 1)) { + if (avail_space && avail_space >= SZ_1M) + avail_space -= SZ_1M; + else + avail_space = 0; + } if (avail_space < min_stripe_size) continue; -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/8] blk-mq: replace timeout synchronization with a RCU and generation based scheme
On Mon, Jan 08, 2018 at 11:29:11PM +, Bart Van Assche wrote: > Does "gstate" perhaps stand for "generation number and state"? If so, please > mention this in one of the above comments. Yeah, will do. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/8] blk-mq: replace timeout synchronization with a RCU and generation based scheme
On Mon, Jan 08, 2018 at 09:06:55PM +, Bart Van Assche wrote: > On Mon, 2018-01-08 at 11:15 -0800, Tejun Heo wrote: > > +static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate) > > +{ > > + unsigned long flags; > > + > > + local_irq_save(flags); > > + u64_stats_update_begin(>aborted_gstate_sync); > > + rq->aborted_gstate = gstate; > > + u64_stats_update_end(>aborted_gstate_sync); > > + local_irq_restore(flags); > > +} > > Please add a comment that explains the purpose of local_irq_save() and > local_irq_restore(). Please also explain why you chose to disable interrupts Will do. > instead of disabling preemption. I think that disabling preemption should be > sufficient since this is the only code that updates rq->aborted_gstate and > since this function is always called from thread context. blk_mq_complete_request() can read it from the irq context. If that happens between update_begin and end, it'll end up looping infinitely. > > @@ -801,6 +840,12 @@ void blk_mq_rq_timed_out(struct request *req, bool > > reserved) > > __blk_mq_complete_request(req); > > break; > > case BLK_EH_RESET_TIMER: > > + /* > > +* As nothing prevents from completion happening while > > +* ->aborted_gstate is set, this may lead to ignored > > +* completions and further spurious timeouts. > > +*/ > > + blk_mq_rq_update_aborted_gstate(req, 0); > > blk_add_timer(req); > > blk_clear_rq_complete(req); > > break; > > Is the race that the comment refers to addressed by one of the later patches? No, but it's not a new race. It has always been there and I suppose doesn't lead to practical problems - the race window is pretty small and the effect isn't critical. I'm just documenting the existing race condition. Will note that in the description. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/8] blk-mq: protect completion path with RCU
On 1/9/18 12:08 AM, Hannes Reinecke wrote: > On 01/08/2018 08:15 PM, Tejun Heo wrote: >> Currently, blk-mq protects only the issue path with RCU. This patch >> puts the completion path under the same RCU protection. This will be >> used to synchronize issue/completion against timeout by later patches, >> which will also add the comments. >> >> Signed-off-by: Tejun Heo>> --- >> block/blk-mq.c | 5 + >> 1 file changed, 5 insertions(+) >> >> diff --git a/block/blk-mq.c b/block/blk-mq.c >> index ddc9261..6741c3e 100644 >> --- a/block/blk-mq.c >> +++ b/block/blk-mq.c >> @@ -584,11 +584,16 @@ static void hctx_lock(struct blk_mq_hw_ctx *hctx, int >> *srcu_idx) >> void blk_mq_complete_request(struct request *rq) >> { >> struct request_queue *q = rq->q; >> +struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu); >> +int srcu_idx; >> >> if (unlikely(blk_should_fake_timeout(q))) >> return; >> + >> +hctx_lock(hctx, _idx); >> if (!blk_mark_rq_complete(rq)) >> __blk_mq_complete_request(rq); >> +hctx_unlock(hctx, srcu_idx); >> } >> EXPORT_SYMBOL(blk_mq_complete_request); >> >> > Hmm. Why do we need to call blk_mq_map_queue() here? > Is there a chance that we end up with a _different_ hctx on completion > than that one used for submission? > If not, why can't we just keep a pointer to the hctx in struct request? Mapping is the right thing to do. We cache the software queue, which allows us to map back to the same hardware queue. There would be no point in storing both, the mapping is very cheap. No point in bloating the request with redundant information. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 02/19] fs: don't take the i_lock in inode_inc_iversion
On Tue 09-01-18 09:10:42, Jeff Layton wrote: > From: Jeff Layton> > The rationale for taking the i_lock when incrementing this value is > lost in antiquity. The readers of the field don't take it (at least > not universally), so my assumption is that it was only done here to > serialize incrementors. > > If that is indeed the case, then we can drop the i_lock from this > codepath and treat it as a atomic64_t for the purposes of > incrementing it. This allows us to use inode_inc_iversion without > any danger of lock inversion. > > Note that the read side is not fetched atomically with this change. > The assumption here is that that is not a critical issue since the > i_version is not fully synchronized with anything else anyway. > > Signed-off-by: Jeff Layton This changes the memory barrier behavior but IMO it is good enough for an intermediate version. You can add: Reviewed-by: Jan Kara Honza > --- > include/linux/iversion.h | 7 --- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/include/linux/iversion.h b/include/linux/iversion.h > index d09cc3a08740..5ad9eaa3a9b0 100644 > --- a/include/linux/iversion.h > +++ b/include/linux/iversion.h > @@ -104,12 +104,13 @@ inode_set_iversion_queried(struct inode *inode, u64 new) > static inline bool > inode_maybe_inc_iversion(struct inode *inode, bool force) > { > - spin_lock(>i_lock); > - inode->i_version++; > - spin_unlock(>i_lock); > + atomic64_t *ivp = (atomic64_t *)>i_version; > + > + atomic64_inc(ivp); > return true; > } > > + > /** > * inode_inc_iversion - forcibly increment i_version > * @inode: inode that needs to be updated > -- > 2.14.3 > -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/5] btrfs: move uuid_mutex closer to exclusivity
move uuid_mutex with in device_list_add(). Signed-off-by: Anand Jain--- fs/btrfs/volumes.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index f942e8193862..283417bf3b00 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -740,6 +740,7 @@ static noinline int device_list_add(const char *path, u64 found_transid = btrfs_super_generation(disk_super); u64 devid = btrfs_stack_device_id(_super->dev_item); + mutex_lock(_mutex); fs_devices = find_fsid(disk_super->fsid); if (!fs_devices) { fs_devices = alloc_fs_devices(disk_super->fsid); @@ -847,6 +848,8 @@ static noinline int device_list_add(const char *path, fs_devices->total_devices = btrfs_super_num_devices(disk_super); + mutex_unlock(_mutex); + *device_ret = device; return 0; @@ -1208,9 +1211,7 @@ int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder, goto error_bdev_put; } - mutex_lock(_mutex); ret = device_list_add(path, disk_super, ); - mutex_unlock(_mutex); *fs_devices_ret = device->fs_devices; -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] generic/015: Change the test filesystem size to 101mb
On 8.01.2018 10:43, Nikolay Borisov wrote: > This test has been failing for btrfs for quite some time, > at least since 4.7. There are 2 implementation details of btrfs that > it exposes: > > 1. Currently btrfs filesystem under 100mb are created in Mixed block > group mode. Freespace accounting for it is not 100% accurate - I've > observed around 100-200kb discrepancy between a newly created filesystem, > then writing a file and deleting it and checking the free space. This > falls within %3 and not %1 as hardcoded in the test. > > 2. BTRFS won't flush it's delayed allocation on file deletion if less > than 32mb are deleted. On such files we need to perform sync (missing > in the test) or wait until time elapses for transaction commit. > > Since mixed mode is somewhat deprecated and btrfs is not really intended > to be used on really small devices let's just adjust the test to > create a 101mb fs, which doesn't use mixed mode and really test > freespace accounting. > > Signed-off-by: Nikolay BorisovAfter further experimentation this test is indeed hitting a very pathological edge case. So when we create a btrfs file system in mixed mode with default raid levels the system chunk occupies 0-4mb. As such we can never allocate a chunk which spans this range so the -1mb in btrfs_calc_avail_data_space is not required in this case. Anyway, I believe this to be a really minor nuisance not worth complicating the code to fix, so I'd suggest this test change be merged. > --- > tests/generic/015 | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/tests/generic/015 b/tests/generic/015 > index 78f2b13..416c4ae 100755 > --- a/tests/generic/015 > +++ b/tests/generic/015 > @@ -53,7 +53,7 @@ _supported_os Linux > _require_scratch > _require_no_large_scratch_dev > > -_scratch_mkfs_sized `expr 50 \* 1024 \* 1024` >/dev/null 2>&1 \ > +_scratch_mkfs_sized `expr 101 \* 1024 \* 1024` >/dev/null 2>&1 \ > || _fail "mkfs failed" > _scratch_mount || _fail "mount failed" > out=$SCRATCH_MNT/fillup.$$ > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/5] device_list_add() peparation to add reappearing missing device
Cleanup of device_list_add(), mainly in preparation to handle reappearing missing device which its next reroll will be sent separately. Anand Jain (5): btrfs: move pr_info into device_list_add btrfs: set the total_devices in device_list_add() btrfs: get device pointer from device_list_add() btrfs: drop devid as device_list_add() arg btrfs: move uuid_mutex closer to exclusivity fs/btrfs/volumes.c | 49 + 1 file changed, 21 insertions(+), 28 deletions(-) -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/5] btrfs: set the total_devices in device_list_add()
There is no other parent for device_list_add() except for btrfs_scan_one_device(), which would set btrfs_fs_devices::total_devices if device_list_add is successful and this can be done with in device_list_add() itself. Signed-off-by: Anand Jain--- fs/btrfs/volumes.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 091eb452f3c8..f9aaf65e27f5 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -844,6 +844,8 @@ static noinline int device_list_add(const char *path, if (!fs_devices->opened) device->generation = found_transid; + fs_devices->total_devices = btrfs_super_num_devices(disk_super); + *fs_devices_ret = fs_devices; return 0; @@ -1185,7 +1187,6 @@ int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder, struct page *page; int ret; u64 devid; - u64 total_devices; u64 bytenr; /* @@ -1207,12 +1208,9 @@ int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder, } devid = btrfs_stack_device_id(_super->dev_item); - total_devices = btrfs_super_num_devices(disk_super); mutex_lock(_mutex); ret = device_list_add(path, disk_super, devid, fs_devices_ret); - if (!ret && fs_devices_ret) - (*fs_devices_ret)->total_devices = total_devices; mutex_unlock(_mutex); btrfs_release_disk_super(page); -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/5] btrfs: get device pointer from device_list_add()
Instead of pointer to btrfs_fs_devices from device_list_add() its better to get pointer to btrfs_device, then we have both, pointer to btrfs_device and btrfs_fs_devices. This is needed in preparation to add handling reappearing missing device feature. Signed-off-by: Anand Jain--- fs/btrfs/volumes.c | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index f9aaf65e27f5..2317ca1b3d83 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -732,7 +732,7 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices, */ static noinline int device_list_add(const char *path, struct btrfs_super_block *disk_super, - u64 devid, struct btrfs_fs_devices **fs_devices_ret) + u64 devid, struct btrfs_device **device_ret) { struct btrfs_device *device; struct btrfs_fs_devices *fs_devices; @@ -846,7 +846,7 @@ static noinline int device_list_add(const char *path, fs_devices->total_devices = btrfs_super_num_devices(disk_super); - *fs_devices_ret = fs_devices; + *device_ret = device; return 0; } @@ -1183,6 +1183,7 @@ int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder, struct btrfs_fs_devices **fs_devices_ret) { struct btrfs_super_block *disk_super; + struct btrfs_device *device; struct block_device *bdev; struct page *page; int ret; @@ -1210,9 +1211,11 @@ int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder, devid = btrfs_stack_device_id(_super->dev_item); mutex_lock(_mutex); - ret = device_list_add(path, disk_super, devid, fs_devices_ret); + ret = device_list_add(path, disk_super, devid, ); mutex_unlock(_mutex); + *fs_devices_ret = device->fs_devices; + btrfs_release_disk_super(page); error_bdev_put: -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/5] btrfs: move pr_info into device_list_add
Commit 60999ca4b403 ("btrfs: make device scan less noisy") adds return value 1 to device_list_add(), so that parent function can call pr_info only when new device is added. Move the pr_info() part into device_list_add() so that this function can be kept simple. Signed-off-by: Anand Jain--- fs/btrfs/volumes.c | 29 +++-- 1 file changed, 11 insertions(+), 18 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 0d7f912f68ff..091eb452f3c8 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -727,8 +727,7 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices, * Add new device to list of registered devices * * Returns: - * 1 - first time device is seen - * 0 - device already known + * 0 - device already known or newly added * < 0 - error */ static noinline int device_list_add(const char *path, @@ -738,7 +737,6 @@ static noinline int device_list_add(const char *path, struct btrfs_device *device; struct btrfs_fs_devices *fs_devices; struct rcu_string *name; - int ret = 0; u64 found_transid = btrfs_super_generation(disk_super); fs_devices = find_fsid(disk_super->fsid); @@ -778,9 +776,16 @@ static noinline int device_list_add(const char *path, fs_devices->num_devices++; mutex_unlock(_devices->device_list_mutex); - ret = 1; device->fs_devices = fs_devices; btrfs_free_stale_device(path, device); + + if (disk_super->label[0]) + pr_info("BTRFS: device label %s devid %llu transid %llu %s\n", + disk_super->label, devid, found_transid, path); + else + pr_info("BTRFS: device fsid %pU devid %llu transid %llu %s\n", + disk_super->fsid, devid, found_transid, path); + } else if (!device->name || strcmp(device->name->str, path)) { /* * When FS is already mounted. @@ -841,7 +846,7 @@ static noinline int device_list_add(const char *path, *fs_devices_ret = fs_devices; - return ret; + return 0; } static struct btrfs_fs_devices *clone_fs_devices(struct btrfs_fs_devices *orig) @@ -1180,7 +1185,6 @@ int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder, struct page *page; int ret; u64 devid; - u64 transid; u64 total_devices; u64 bytenr; @@ -1203,25 +1207,14 @@ int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder, } devid = btrfs_stack_device_id(_super->dev_item); - transid = btrfs_super_generation(disk_super); total_devices = btrfs_super_num_devices(disk_super); mutex_lock(_mutex); ret = device_list_add(path, disk_super, devid, fs_devices_ret); - if (ret >= 0 && fs_devices_ret) + if (!ret && fs_devices_ret) (*fs_devices_ret)->total_devices = total_devices; mutex_unlock(_mutex); - if (ret > 0) { - if (disk_super->label[0]) - pr_info("BTRFS: device label %s ", disk_super->label); - else - pr_info("BTRFS: device fsid %pU ", disk_super->fsid); - - pr_cont("devid %llu transid %llu %s\n", devid, transid, path); - ret = 0; - } - btrfs_release_disk_super(page); error_bdev_put: -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/5] btrfs: drop devid as device_list_add() arg
As struct btrfs_disk_super is being passed, so it can get devid the same way its parent does. Signed-off-by: Anand Jain--- fs/btrfs/volumes.c | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 2317ca1b3d83..f942e8193862 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -732,12 +732,13 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices, */ static noinline int device_list_add(const char *path, struct btrfs_super_block *disk_super, - u64 devid, struct btrfs_device **device_ret) + struct btrfs_device **device_ret) { struct btrfs_device *device; struct btrfs_fs_devices *fs_devices; struct rcu_string *name; u64 found_transid = btrfs_super_generation(disk_super); + u64 devid = btrfs_stack_device_id(_super->dev_item); fs_devices = find_fsid(disk_super->fsid); if (!fs_devices) { @@ -1187,7 +1188,6 @@ int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder, struct block_device *bdev; struct page *page; int ret; - u64 devid; u64 bytenr; /* @@ -1208,10 +1208,8 @@ int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder, goto error_bdev_put; } - devid = btrfs_stack_device_id(_super->dev_item); - mutex_lock(_mutex); - ret = device_list_add(path, disk_super, devid, ); + ret = device_list_add(path, disk_super, ); mutex_unlock(_mutex); *fs_devices_ret = device->fs_devices; -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/6] btrfs: make btrfs_free_stale_device() to iterate all stales
From: Anand JainLet the list iterator iterate further and find other stale devices and delete it. This is in preparation to add support for user land request-able stale devices cleanup. Signed-off-by: Anand Jain --- fs/btrfs/volumes.c | 20 +--- 1 file changed, 9 insertions(+), 11 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index ab6ccad08ef7..7b253da6c0a4 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -608,19 +608,20 @@ static void pending_bios_fn(struct btrfs_work *work) static void btrfs_free_stale_device(struct btrfs_device *cur_dev) { - struct btrfs_fs_devices *fs_devs; - struct btrfs_device *dev; + struct btrfs_fs_devices *fs_devs, *tmp_fs_devs; + struct btrfs_device *dev, *tmp_dev; if (!cur_dev->name) return; - list_for_each_entry(fs_devs, _uuids, list) { - int del = 1; + list_for_each_entry_safe(fs_devs, tmp_fs_devs, _uuids, list) { if (fs_devs->opened) continue; - list_for_each_entry(dev, _devs->devices, dev_list) { + list_for_each_entry_safe(dev, tmp_dev, +_devs->devices, dev_list) { + int not_found; if (dev == cur_dev) continue; @@ -634,14 +635,12 @@ static void btrfs_free_stale_device(struct btrfs_device *cur_dev) * either use mapper or non mapper path throughout. */ rcu_read_lock(); - del = strcmp(rcu_str_deref(dev->name), + not_found = strcmp(rcu_str_deref(dev->name), rcu_str_deref(cur_dev->name)); rcu_read_unlock(); - if (!del) - break; - } + if (not_found) + continue; - if (!del) { /* delete the stale device */ if (fs_devs->num_devices == 1) { btrfs_sysfs_remove_fsid(fs_devs); @@ -652,7 +651,6 @@ static void btrfs_free_stale_device(struct btrfs_device *cur_dev) list_del(>dev_list); free_device(dev); } - break; } } } -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 00/19] fs: rework and optimize i_version handling in filesystems
From: Jeff Laytonv5: - don't corrupt refcounts stashed in i_version of ext4 xattr inodes - add raw variants of inc and cmp functions, and have nfs use them v4: - fix SB_LAZYTIME handling in generic_update_time - add memory barriers to patch to convert i_version field to atomic64_t v3: - move i_version handling functions to new header file - document that the kernel-managed i_version implementation will appear to increase over time - fix inode_cmp_iversion to handle wraparound correctly v2: - xfs should use inode_peek_iversion instead of inode_peek_iversion_raw - rework file_update_time patch - don't dirty inode when only S_ATIME is set and SB_LAZYTIME is enabled - better comments and documentation I think this is now approaching merge readiness. Special thanks to Jan Kara and Dave Chinner who helped me tighten up the memory barriers in the final patch, and Krzysztof Kozlowski for help in tracking down a set of bugs in the NFS client patch. tl;dr: I think we can greatly reduce the cost of the inode->i_version counter, by exploiting the fact that we don't need to increment it if no one is looking at it. We can also clean up the code to prepare to eventually expose this value via statx(). Note that this set relies on a few patches that are in other trees. The full stack that I've been testing with is here: https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git/log/?h=iversion The inode->i_version field is supposed to be a value that changes whenever there is any data or metadata change to the inode. Some filesystems use it internally to detect directory changes during readdir. knfsd will use it if the filesystem has MS_I_VERSION set. IMA will also use it to optimize away some remeasurement if it's available. NFS and AFS just use it to store an opaque change attribute from the server. Only btrfs, ext4, and xfs increment it for data changes. Because of this, these filesystems must log the inode to disk whenever the i_version counter changes. That has a non-zero performance impact, especially on write-heavy workloads, because we end up dirtying the inode metadata on every write, not just when the times change. It turns out though that none of these users of i_version require that it change on every change to the file. The only real requirement is that it be different if something changed since the last time we queried for it. If we keep track of when something queries the value, we can avoid bumping the counter and an on-disk update when nothing else has changed if no one has queried it since it was last incremented. This patchset changes the code to only bump the i_version counter when it's strictly necessary, or when we're updating the inode metadata anyway (e.g. when times change). It takes the approach of converting the existing accessors of i_version to use a new API, while leaving the underlying implementation mostly the same. The last patch then converts the existing implementation to keep track of whether the value has been queried since it was last incremented. It then uses that to avoid incrementing the counter when it can. With this, we reduce inode metadata updates across all 3 filesystems down to roughly the frequency of the timestamp granularity, particularly when it's not being queried (the vastly common case). I can see measurable performance gains on xfs and ext4 with iversion enabled, when streaming small (4k) I/Os. btrfs shows some slight gain in testing, but not quite the magnitude that xfs and ext4 show. I'm not sure why yet and would appreciate some input from btrfs folks. My goal is to get this into linux-next fairly soon. If it shows no problems then we can look at merging it for 4.16, or 4.17 if all of the prequisite patches are not yet merged. Jeff Layton (19): fs: new API for handling inode->i_version fs: don't take the i_lock in inode_inc_iversion fat: convert to new i_version API affs: convert to new i_version API afs: convert to new i_version API btrfs: convert to new i_version API exofs: switch to new i_version API ext2: convert to new i_version API ext4: convert to new i_version API nfs: convert to new i_version API nfsd: convert to new i_version API ocfs2: convert to new i_version API ufs: use new i_version API xfs: convert to new i_version API IMA: switch IMA over to new i_version API fs: only set S_VERSION when updating times if necessary xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing btrfs: only dirty the inode in btrfs_update_time if something was changed fs: handle inode->i_version more efficiently fs/affs/amigaffs.c| 5 +- fs/affs/dir.c | 5 +- fs/affs/super.c | 3 +- fs/afs/fsclient.c | 3 +- fs/afs/inode.c| 5 +- fs/btrfs/delayed-inode.c | 7 +- fs/btrfs/file.c | 1 + fs/btrfs/inode.c | 12 +-
[PATCH v5 01/19] fs: new API for handling inode->i_version
From: Jeff LaytonAdd a documentation blob that explains what the i_version field is, how it is expected to work, and how it is currently implemented by various filesystems. We already have inode_inc_iversion. Add several other functions for manipulating and accessing the i_version counter. For now, the implementation is trivial and basically works the way that all of the open-coded i_version accesses work today. Future patches will convert existing users of i_version to use the new API, and then convert the backend implementation to do things more efficiently. Signed-off-by: Jeff Layton Reviewed-by: Jan Kara --- fs/btrfs/file.c | 1 + fs/btrfs/inode.c | 1 + fs/btrfs/ioctl.c | 1 + fs/btrfs/xattr.c | 1 + fs/ext4/inode.c | 1 + fs/ext4/namei.c | 1 + fs/inode.c | 1 + include/linux/fs.h | 15 --- include/linux/iversion.h | 236 +++ 9 files changed, 243 insertions(+), 15 deletions(-) create mode 100644 include/linux/iversion.h diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index eb1bac7c8553..c95d7b2efefb 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -31,6 +31,7 @@ #include #include #include +#include #include "ctree.h" #include "disk-io.h" #include "transaction.h" diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index e1a7f3cb5be9..27f008b33fc1 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -43,6 +43,7 @@ #include #include #include +#include #include "ctree.h" #include "disk-io.h" #include "transaction.h" diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 2ef8acaac688..aa452c9e2eff 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -43,6 +43,7 @@ #include #include #include +#include #include "ctree.h" #include "disk-io.h" #include "transaction.h" diff --git a/fs/btrfs/xattr.c b/fs/btrfs/xattr.c index 2c7e53f9ff1b..5258c1714830 100644 --- a/fs/btrfs/xattr.c +++ b/fs/btrfs/xattr.c @@ -23,6 +23,7 @@ #include #include #include +#include #include "ctree.h" #include "btrfs_inode.h" #include "transaction.h" diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 7df2c5644e59..fa5d8bc52d2d 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -39,6 +39,7 @@ #include #include #include +#include #include "ext4_jbd2.h" #include "xattr.h" diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index 798b3ac680db..bcf0dff517be 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -34,6 +34,7 @@ #include #include #include +#include #include "ext4.h" #include "ext4_jbd2.h" diff --git a/fs/inode.c b/fs/inode.c index 03102d6ef044..19e72f500f71 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -18,6 +18,7 @@ #include /* for inode_has_buffers */ #include #include +#include #include #include "internal.h" diff --git a/include/linux/fs.h b/include/linux/fs.h index 511fbaabf624..76382c24e9d0 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2036,21 +2036,6 @@ static inline void inode_dec_link_count(struct inode *inode) mark_inode_dirty(inode); } -/** - * inode_inc_iversion - increments i_version - * @inode: inode that need to be updated - * - * Every time the inode is modified, the i_version field will be incremented. - * The filesystem has to be mounted with i_version flag - */ - -static inline void inode_inc_iversion(struct inode *inode) -{ - spin_lock(>i_lock); - inode->i_version++; - spin_unlock(>i_lock); -} - enum file_time_flags { S_ATIME = 1, S_MTIME = 2, diff --git a/include/linux/iversion.h b/include/linux/iversion.h new file mode 100644 index ..d09cc3a08740 --- /dev/null +++ b/include/linux/iversion.h @@ -0,0 +1,236 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_IVERSION_H +#define _LINUX_IVERSION_H + +#include + +/* + * The change attribute (i_version) is mandated by NFSv4 and is mostly for + * knfsd, but is also used for other purposes (e.g. IMA). The i_version must + * appear different to observers if there was a change to the inode's data or + * metadata since it was last queried. + * + * Observers see the i_version as a 64-bit number that never changes. If it + * remains the same since it was last checked, then nothing has changed in the + * inode. If it's different then something has changed. Observers cannot infer + * anything about the nature or magnitude of the changes from the value, only + * that the inode has changed in some fashion. + * + * Not all filesystems properly implement the i_version counter. Subsystems that + * want to use i_version field on an inode should first check whether the + * filesystem sets the SB_I_VERSION flag (usually via the IS_I_VERSION macro). + * + * Those that set SB_I_VERSION will automatically have their i_version counter + * incremented on writes to normal files. If the SB_I_VERSION is not set, then + * the VFS will not
[PATCH v5 02/19] fs: don't take the i_lock in inode_inc_iversion
From: Jeff LaytonThe rationale for taking the i_lock when incrementing this value is lost in antiquity. The readers of the field don't take it (at least not universally), so my assumption is that it was only done here to serialize incrementors. If that is indeed the case, then we can drop the i_lock from this codepath and treat it as a atomic64_t for the purposes of incrementing it. This allows us to use inode_inc_iversion without any danger of lock inversion. Note that the read side is not fetched atomically with this change. The assumption here is that that is not a critical issue since the i_version is not fully synchronized with anything else anyway. Signed-off-by: Jeff Layton --- include/linux/iversion.h | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/include/linux/iversion.h b/include/linux/iversion.h index d09cc3a08740..5ad9eaa3a9b0 100644 --- a/include/linux/iversion.h +++ b/include/linux/iversion.h @@ -104,12 +104,13 @@ inode_set_iversion_queried(struct inode *inode, u64 new) static inline bool inode_maybe_inc_iversion(struct inode *inode, bool force) { - spin_lock(>i_lock); - inode->i_version++; - spin_unlock(>i_lock); + atomic64_t *ivp = (atomic64_t *)>i_version; + + atomic64_inc(ivp); return true; } + /** * inode_inc_iversion - forcibly increment i_version * @inode: inode that needs to be updated -- 2.14.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 03/19] fat: convert to new i_version API
From: Jeff LaytonSigned-off-by: Jeff Layton --- fs/fat/dir.c | 3 ++- fs/fat/inode.c | 9 + fs/fat/namei_msdos.c | 7 --- fs/fat/namei_vfat.c | 22 +++--- 4 files changed, 22 insertions(+), 19 deletions(-) diff --git a/fs/fat/dir.c b/fs/fat/dir.c index b833ffeee1e1..8e100c3bf72c 100644 --- a/fs/fat/dir.c +++ b/fs/fat/dir.c @@ -16,6 +16,7 @@ #include #include #include +#include #include "fat.h" /* @@ -1055,7 +1056,7 @@ int fat_remove_entries(struct inode *dir, struct fat_slot_info *sinfo) brelse(bh); if (err) return err; - dir->i_version++; + inode_inc_iversion(dir); if (nr_slots) { /* diff --git a/fs/fat/inode.c b/fs/fat/inode.c index 20a0a89eaca5..ffbbf0520d9e 100644 --- a/fs/fat/inode.c +++ b/fs/fat/inode.c @@ -20,6 +20,7 @@ #include #include #include +#include #include "fat.h" #ifndef CONFIG_FAT_DEFAULT_IOCHARSET @@ -507,7 +508,7 @@ int fat_fill_inode(struct inode *inode, struct msdos_dir_entry *de) MSDOS_I(inode)->i_pos = 0; inode->i_uid = sbi->options.fs_uid; inode->i_gid = sbi->options.fs_gid; - inode->i_version++; + inode_inc_iversion(inode); inode->i_generation = get_seconds(); if ((de->attr & ATTR_DIR) && !IS_FREE(de->name)) { @@ -590,7 +591,7 @@ struct inode *fat_build_inode(struct super_block *sb, goto out; } inode->i_ino = iunique(sb, MSDOS_ROOT_INO); - inode->i_version = 1; + inode_set_iversion(inode, 1); err = fat_fill_inode(inode, de); if (err) { iput(inode); @@ -1377,7 +1378,7 @@ static int fat_read_root(struct inode *inode) MSDOS_I(inode)->i_pos = MSDOS_ROOT_INO; inode->i_uid = sbi->options.fs_uid; inode->i_gid = sbi->options.fs_gid; - inode->i_version++; + inode_inc_iversion(inode); inode->i_generation = 0; inode->i_mode = fat_make_mode(sbi, ATTR_DIR, S_IRWXUGO); inode->i_op = sbi->dir_ops; @@ -1828,7 +1829,7 @@ int fat_fill_super(struct super_block *sb, void *data, int silent, int isvfat, if (!root_inode) goto out_fail; root_inode->i_ino = MSDOS_ROOT_INO; - root_inode->i_version = 1; + inode_set_iversion(root_inode, 1); error = fat_read_root(root_inode); if (error < 0) { iput(root_inode); diff --git a/fs/fat/namei_msdos.c b/fs/fat/namei_msdos.c index d24d2758a363..582ca731a6c9 100644 --- a/fs/fat/namei_msdos.c +++ b/fs/fat/namei_msdos.c @@ -7,6 +7,7 @@ */ #include +#include #include "fat.h" /* Characters that are undesirable in an MS-DOS file name */ @@ -480,7 +481,7 @@ static int do_msdos_rename(struct inode *old_dir, unsigned char *old_name, } else mark_inode_dirty(old_inode); - old_dir->i_version++; + inode_inc_iversion(old_dir); old_dir->i_ctime = old_dir->i_mtime = current_time(old_dir); if (IS_DIRSYNC(old_dir)) (void)fat_sync_inode(old_dir); @@ -508,7 +509,7 @@ static int do_msdos_rename(struct inode *old_dir, unsigned char *old_name, goto out; new_i_pos = sinfo.i_pos; } - new_dir->i_version++; + inode_inc_iversion(new_dir); fat_detach(old_inode); fat_attach(old_inode, new_i_pos); @@ -540,7 +541,7 @@ static int do_msdos_rename(struct inode *old_dir, unsigned char *old_name, old_sinfo.bh = NULL; if (err) goto error_dotdot; - old_dir->i_version++; + inode_inc_iversion(old_dir); old_dir->i_ctime = old_dir->i_mtime = ts; if (IS_DIRSYNC(old_dir)) (void)fat_sync_inode(old_dir); diff --git a/fs/fat/namei_vfat.c b/fs/fat/namei_vfat.c index 02c03a3a..cefea792cde8 100644 --- a/fs/fat/namei_vfat.c +++ b/fs/fat/namei_vfat.c @@ -20,7 +20,7 @@ #include #include #include - +#include #include "fat.h" static inline unsigned long vfat_d_version(struct dentry *dentry) @@ -46,7 +46,7 @@ static int vfat_revalidate_shortname(struct dentry *dentry) { int ret = 1; spin_lock(>d_lock); - if (vfat_d_version(dentry) != d_inode(dentry->d_parent)->i_version) + if (inode_cmp_iversion(d_inode(dentry->d_parent), vfat_d_version(dentry))) ret = 0; spin_unlock(>d_lock); return ret; @@ -759,7 +759,7 @@ static struct dentry *vfat_lookup(struct inode *dir, struct dentry *dentry, out: mutex_unlock(_SB(sb)->s_lock); if (!inode) - vfat_d_version_set(dentry, dir->i_version); + vfat_d_version_set(dentry, inode_query_iversion(dir)); return d_splice_alias(inode, dentry); error:
[PATCH v5 04/19] affs: convert to new i_version API
From: Jeff LaytonSigned-off-by: Jeff Layton --- fs/affs/amigaffs.c | 5 +++-- fs/affs/dir.c | 5 +++-- fs/affs/super.c| 3 ++- 3 files changed, 8 insertions(+), 5 deletions(-) diff --git a/fs/affs/amigaffs.c b/fs/affs/amigaffs.c index 0f0e6925e97d..14a6c1b90c9f 100644 --- a/fs/affs/amigaffs.c +++ b/fs/affs/amigaffs.c @@ -10,6 +10,7 @@ */ #include +#include #include "affs.h" /* @@ -60,7 +61,7 @@ affs_insert_hash(struct inode *dir, struct buffer_head *bh) affs_brelse(dir_bh); dir->i_mtime = dir->i_ctime = current_time(dir); - dir->i_version++; + inode_inc_iversion(dir); mark_inode_dirty(dir); return 0; @@ -114,7 +115,7 @@ affs_remove_hash(struct inode *dir, struct buffer_head *rem_bh) affs_brelse(bh); dir->i_mtime = dir->i_ctime = current_time(dir); - dir->i_version++; + inode_inc_iversion(dir); mark_inode_dirty(dir); return retval; diff --git a/fs/affs/dir.c b/fs/affs/dir.c index a105e77df2c1..d180b46453cf 100644 --- a/fs/affs/dir.c +++ b/fs/affs/dir.c @@ -14,6 +14,7 @@ * */ +#include #include "affs.h" static int affs_readdir(struct file *, struct dir_context *); @@ -80,7 +81,7 @@ affs_readdir(struct file *file, struct dir_context *ctx) * we can jump directly to where we left off. */ ino = (u32)(long)file->private_data; - if (ino && file->f_version == inode->i_version) { + if (ino && inode_cmp_iversion(inode, file->f_version) == 0) { pr_debug("readdir() left off=%d\n", ino); goto inside; } @@ -130,7 +131,7 @@ affs_readdir(struct file *file, struct dir_context *ctx) } while (ino); } done: - file->f_version = inode->i_version; + file->f_version = inode_query_iversion(inode); file->private_data = (void *)(long)ino; affs_brelse(fh_bh); diff --git a/fs/affs/super.c b/fs/affs/super.c index 1117e36134cc..e602619aed9d 100644 --- a/fs/affs/super.c +++ b/fs/affs/super.c @@ -21,6 +21,7 @@ #include #include #include +#include #include "affs.h" static int affs_statfs(struct dentry *dentry, struct kstatfs *buf); @@ -102,7 +103,7 @@ static struct inode *affs_alloc_inode(struct super_block *sb) if (!i) return NULL; - i->vfs_inode.i_version = 1; + inode_set_iversion(>vfs_inode, 1); i->i_lc = NULL; i->i_ext_bh = NULL; i->i_pa_cnt = 0; -- 2.14.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 05/19] afs: convert to new i_version API
From: Jeff LaytonFor AFS, it's generally treated as an opaque value, so we use the *_raw variants of the API here. Note that AFS has quite a different definition for this counter. AFS only increments it on changes to the data to the data in regular files and contents of the directories. Inode metadata changes do not result in a version increment. We'll need to reconcile that somehow if we ever want to present this to userspace via statx. Signed-off-by: Jeff Layton --- fs/afs/fsclient.c | 3 ++- fs/afs/inode.c| 5 +++-- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/fs/afs/fsclient.c b/fs/afs/fsclient.c index b90ef39ae914..88ec38c2d83c 100644 --- a/fs/afs/fsclient.c +++ b/fs/afs/fsclient.c @@ -13,6 +13,7 @@ #include #include #include +#include #include "internal.h" #include "afs_fs.h" @@ -124,7 +125,7 @@ static void xdr_decode_AFSFetchStatus(const __be32 **_bp, vnode->vfs_inode.i_ctime.tv_sec = status->mtime_client; vnode->vfs_inode.i_mtime= vnode->vfs_inode.i_ctime; vnode->vfs_inode.i_atime= vnode->vfs_inode.i_ctime; - vnode->vfs_inode.i_version = data_version; + inode_set_iversion_raw(>vfs_inode, data_version); } expected_version = status->data_version; diff --git a/fs/afs/inode.c b/fs/afs/inode.c index 3415eb7484f6..dcd2e08d6cdb 100644 --- a/fs/afs/inode.c +++ b/fs/afs/inode.c @@ -21,6 +21,7 @@ #include #include #include +#include #include "internal.h" static const struct inode_operations afs_symlink_inode_operations = { @@ -89,7 +90,7 @@ static int afs_inode_map_status(struct afs_vnode *vnode, struct key *key) inode->i_atime = inode->i_mtime = inode->i_ctime; inode->i_blocks = 0; inode->i_generation = vnode->fid.unique; - inode->i_version= vnode->status.data_version; + inode_set_iversion_raw(inode, vnode->status.data_version); inode->i_mapping->a_ops = _fs_aops; read_sequnlock_excl(>cb_lock); @@ -218,7 +219,7 @@ struct inode *afs_iget_autocell(struct inode *dir, const char *dev_name, inode->i_ctime.tv_nsec = 0; inode->i_atime = inode->i_mtime = inode->i_ctime; inode->i_blocks = 0; - inode->i_version= 0; + inode_set_iversion_raw(inode, 0); inode->i_generation = 0; set_bit(AFS_VNODE_PSEUDODIR, >flags); -- 2.14.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 06/19] btrfs: convert to new i_version API
From: Jeff LaytonSigned-off-by: Jeff Layton Acked-by: David Sterba --- fs/btrfs/delayed-inode.c | 7 +-- fs/btrfs/inode.c | 6 -- fs/btrfs/tree-log.c | 4 +++- 3 files changed, 12 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c index 5d73f79ded8b..6a246ae2bcb2 100644 --- a/fs/btrfs/delayed-inode.c +++ b/fs/btrfs/delayed-inode.c @@ -18,6 +18,7 @@ */ #include +#include #include "delayed-inode.h" #include "disk-io.h" #include "transaction.h" @@ -1700,7 +1701,8 @@ static void fill_stack_inode_item(struct btrfs_trans_handle *trans, btrfs_set_stack_inode_nbytes(inode_item, inode_get_bytes(inode)); btrfs_set_stack_inode_generation(inode_item, BTRFS_I(inode)->generation); - btrfs_set_stack_inode_sequence(inode_item, inode->i_version); + btrfs_set_stack_inode_sequence(inode_item, + inode_peek_iversion(inode)); btrfs_set_stack_inode_transid(inode_item, trans->transid); btrfs_set_stack_inode_rdev(inode_item, inode->i_rdev); btrfs_set_stack_inode_flags(inode_item, BTRFS_I(inode)->flags); @@ -1754,7 +1756,8 @@ int btrfs_fill_inode(struct inode *inode, u32 *rdev) BTRFS_I(inode)->generation = btrfs_stack_inode_generation(inode_item); BTRFS_I(inode)->last_trans = btrfs_stack_inode_transid(inode_item); - inode->i_version = btrfs_stack_inode_sequence(inode_item); + inode_set_iversion_queried(inode, + btrfs_stack_inode_sequence(inode_item)); inode->i_rdev = 0; *rdev = btrfs_stack_inode_rdev(inode_item); BTRFS_I(inode)->flags = btrfs_stack_inode_flags(inode_item); diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 27f008b33fc1..ac8692849a81 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -3778,7 +3778,8 @@ static int btrfs_read_locked_inode(struct inode *inode) BTRFS_I(inode)->generation = btrfs_inode_generation(leaf, inode_item); BTRFS_I(inode)->last_trans = btrfs_inode_transid(leaf, inode_item); - inode->i_version = btrfs_inode_sequence(leaf, inode_item); + inode_set_iversion_queried(inode, + btrfs_inode_sequence(leaf, inode_item)); inode->i_generation = BTRFS_I(inode)->generation; inode->i_rdev = 0; rdev = btrfs_inode_rdev(leaf, inode_item); @@ -3946,7 +3947,8 @@ static void fill_inode_item(struct btrfs_trans_handle *trans, ); btrfs_set_token_inode_generation(leaf, item, BTRFS_I(inode)->generation, ); - btrfs_set_token_inode_sequence(leaf, item, inode->i_version, ); + btrfs_set_token_inode_sequence(leaf, item, inode_peek_iversion(inode), + ); btrfs_set_token_inode_transid(leaf, item, trans->transid, ); btrfs_set_token_inode_rdev(leaf, item, inode->i_rdev, ); btrfs_set_token_inode_flags(leaf, item, BTRFS_I(inode)->flags, ); diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 7bf9b31561db..1b7d92075c1f 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -20,6 +20,7 @@ #include #include #include +#include #include "tree-log.h" #include "disk-io.h" #include "locking.h" @@ -3609,7 +3610,8 @@ static void fill_inode_item(struct btrfs_trans_handle *trans, btrfs_set_token_inode_nbytes(leaf, item, inode_get_bytes(inode), ); - btrfs_set_token_inode_sequence(leaf, item, inode->i_version, ); + btrfs_set_token_inode_sequence(leaf, item, + inode_peek_iversion(inode), ); btrfs_set_token_inode_transid(leaf, item, trans->transid, ); btrfs_set_token_inode_rdev(leaf, item, inode->i_rdev, ); btrfs_set_token_inode_flags(leaf, item, BTRFS_I(inode)->flags, ); -- 2.14.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 08/19] ext2: convert to new i_version API
From: Jeff LaytonSigned-off-by: Jeff Layton Reviewed-by: Jan Kara --- fs/ext2/dir.c | 9 + fs/ext2/super.c | 5 +++-- 2 files changed, 8 insertions(+), 6 deletions(-) diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c index 987647986f47..4111085a129f 100644 --- a/fs/ext2/dir.c +++ b/fs/ext2/dir.c @@ -26,6 +26,7 @@ #include #include #include +#include typedef struct ext2_dir_entry_2 ext2_dirent; @@ -92,7 +93,7 @@ static int ext2_commit_chunk(struct page *page, loff_t pos, unsigned len) struct inode *dir = mapping->host; int err = 0; - dir->i_version++; + inode_inc_iversion(dir); block_write_end(NULL, mapping, pos, len, len, page, NULL); if (pos+len > dir->i_size) { @@ -293,7 +294,7 @@ ext2_readdir(struct file *file, struct dir_context *ctx) unsigned long npages = dir_pages(inode); unsigned chunk_mask = ~(ext2_chunk_size(inode)-1); unsigned char *types = NULL; - int need_revalidate = file->f_version != inode->i_version; + bool need_revalidate = inode_cmp_iversion(inode, file->f_version); if (pos > inode->i_size - EXT2_DIR_REC_LEN(1)) return 0; @@ -319,8 +320,8 @@ ext2_readdir(struct file *file, struct dir_context *ctx) offset = ext2_validate_entry(kaddr, offset, chunk_mask); ctx->pos = (n< f_version = inode->i_version; - need_revalidate = 0; + file->f_version = inode_query_iversion(inode); + need_revalidate = false; } de = (ext2_dirent *)(kaddr+offset); limit = kaddr + ext2_last_byte(inode, n) - EXT2_DIR_REC_LEN(1); diff --git a/fs/ext2/super.c b/fs/ext2/super.c index 7646818ab266..554c98b8a93a 100644 --- a/fs/ext2/super.c +++ b/fs/ext2/super.c @@ -33,6 +33,7 @@ #include #include #include +#include #include "ext2.h" #include "xattr.h" #include "acl.h" @@ -184,7 +185,7 @@ static struct inode *ext2_alloc_inode(struct super_block *sb) if (!ei) return NULL; ei->i_block_alloc_info = NULL; - ei->vfs_inode.i_version = 1; + inode_set_iversion(>vfs_inode, 1); #ifdef CONFIG_QUOTA memset(>i_dquot, 0, sizeof(ei->i_dquot)); #endif @@ -1569,7 +1570,7 @@ static ssize_t ext2_quota_write(struct super_block *sb, int type, return err; if (inode->i_size < off+len-towrite) i_size_write(inode, off+len-towrite); - inode->i_version++; + inode_inc_iversion(inode); inode->i_mtime = inode->i_ctime = current_time(inode); mark_inode_dirty(inode); return len - towrite; -- 2.14.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 07/19] exofs: switch to new i_version API
From: Jeff LaytonSigned-off-by: Jeff Layton --- fs/exofs/dir.c | 9 + fs/exofs/super.c | 3 ++- 2 files changed, 7 insertions(+), 5 deletions(-) diff --git a/fs/exofs/dir.c b/fs/exofs/dir.c index 98233a97b7b8..c5a53fcc43ea 100644 --- a/fs/exofs/dir.c +++ b/fs/exofs/dir.c @@ -31,6 +31,7 @@ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA */ +#include #include "exofs.h" static inline unsigned exofs_chunk_size(struct inode *inode) @@ -60,7 +61,7 @@ static int exofs_commit_chunk(struct page *page, loff_t pos, unsigned len) struct inode *dir = mapping->host; int err = 0; - dir->i_version++; + inode_inc_iversion(dir); if (!PageUptodate(page)) SetPageUptodate(page); @@ -241,7 +242,7 @@ exofs_readdir(struct file *file, struct dir_context *ctx) unsigned long n = pos >> PAGE_SHIFT; unsigned long npages = dir_pages(inode); unsigned chunk_mask = ~(exofs_chunk_size(inode)-1); - int need_revalidate = (file->f_version != inode->i_version); + bool need_revalidate = inode_cmp_iversion(inode, file->f_version); if (pos > inode->i_size - EXOFS_DIR_REC_LEN(1)) return 0; @@ -264,8 +265,8 @@ exofs_readdir(struct file *file, struct dir_context *ctx) chunk_mask); ctx->pos = (n< f_version = inode->i_version; - need_revalidate = 0; + file->f_version = inode_query_iversion(inode); + need_revalidate = false; } de = (struct exofs_dir_entry *)(kaddr + offset); limit = kaddr + exofs_last_byte(inode, n) - diff --git a/fs/exofs/super.c b/fs/exofs/super.c index 819624cfc8da..7e244093c0e5 100644 --- a/fs/exofs/super.c +++ b/fs/exofs/super.c @@ -38,6 +38,7 @@ #include #include #include +#include #include "exofs.h" @@ -159,7 +160,7 @@ static struct inode *exofs_alloc_inode(struct super_block *sb) if (!oi) return NULL; - oi->vfs_inode.i_version = 1; + inode_set_iversion(>vfs_inode, 1); return >vfs_inode; } -- 2.14.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/6] btrfs: make btrfs_free_stale_device() to match the path
From: Anand JainThe btrfs_free_stale_device() is updated to match for the given device path and delete it. (It searchs for only unmounted list of devices.) Signed-off-by: Anand Jain --- fs/btrfs/volumes.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 7646f8860096..f87d30aa0e18 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -606,7 +606,7 @@ static void pending_bios_fn(struct btrfs_work *work) } -static void btrfs_free_stale_device(struct btrfs_device *cur_dev) +static void btrfs_free_stale_device(struct btrfs_device *cur_dev, char *path) { struct btrfs_fs_devices *fs_devs, *tmp_fs_devs; struct btrfs_device *dev, *tmp_dev; @@ -633,6 +633,8 @@ static void btrfs_free_stale_device(struct btrfs_device *cur_dev) if (cur_dev) not_found = strcmp(rcu_str_deref(dev->name), rcu_str_deref(cur_dev->name)); + else if (path) + not_found = strcmp(rcu_str_deref(dev->name), path); else not_found = 0; rcu_read_unlock(); @@ -776,7 +778,7 @@ static noinline int device_list_add(const char *path, ret = 1; device->fs_devices = fs_devices; - btrfs_free_stale_device(device); + btrfs_free_stale_device(device, NULL); } else if (!device->name || strcmp(device->name->str, path)) { /* * When FS is already mounted. -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/6] btrfs: make btrfs_free_stale_device() argument optional
From: Anand JainThis updates btrfs_free_stale_device() helper function to delete all unmouted devices, when arg is NULL. Signed-off-by: Anand Jain --- fs/btrfs/volumes.c | 14 ++ 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 7b253da6c0a4..7646f8860096 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -611,9 +611,6 @@ static void btrfs_free_stale_device(struct btrfs_device *cur_dev) struct btrfs_fs_devices *fs_devs, *tmp_fs_devs; struct btrfs_device *dev, *tmp_dev; - if (!cur_dev->name) - return; - list_for_each_entry_safe(fs_devs, tmp_fs_devs, _uuids, list) { if (fs_devs->opened) @@ -623,9 +620,7 @@ static void btrfs_free_stale_device(struct btrfs_device *cur_dev) _devs->devices, dev_list) { int not_found; - if (dev == cur_dev) - continue; - if (!dev->name) + if (cur_dev && (cur_dev == dev || !dev->name)) continue; /* @@ -635,8 +630,11 @@ static void btrfs_free_stale_device(struct btrfs_device *cur_dev) * either use mapper or non mapper path throughout. */ rcu_read_lock(); - not_found = strcmp(rcu_str_deref(dev->name), - rcu_str_deref(cur_dev->name)); + if (cur_dev) + not_found = strcmp(rcu_str_deref(dev->name), + rcu_str_deref(cur_dev->name)); + else + not_found = 0; rcu_read_unlock(); if (not_found) continue; -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/6] btrfs: cleanup btrfs_free_stale_device() usage
We call btrfs_free_stale_device() only when we alloc a new struct btrfs_device (ret=1), so move it closer to where we alloc the new device. Also drop the comments. Signed-off-by: Anand Jain--- fs/btrfs/volumes.c | 8 +--- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 9a04245003ab..7e04cd509ab9 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -782,6 +782,7 @@ static noinline int device_list_add(const char *path, ret = 1; device->fs_devices = fs_devices; + btrfs_free_stale_device(device); } else if (!device->name || strcmp(device->name->str, path)) { /* * When FS is already mounted. @@ -840,13 +841,6 @@ static noinline int device_list_add(const char *path, if (!fs_devices->opened) device->generation = found_transid; - /* -* if there is new btrfs on an already registered device, -* then remove the stale device entry. -*/ - if (ret > 0) - btrfs_free_stale_device(device); - *fs_devices_ret = fs_devices; return ret; -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 09/19] ext4: convert to new i_version API
From: Jeff LaytonSigned-off-by: Jeff Layton Acked-by: Theodore Ts'o --- fs/ext4/dir.c| 9 + fs/ext4/inline.c | 7 --- fs/ext4/inode.c | 12 fs/ext4/ioctl.c | 3 ++- fs/ext4/namei.c | 4 ++-- fs/ext4/super.c | 3 ++- fs/ext4/xattr.c | 5 +++-- 7 files changed, 26 insertions(+), 17 deletions(-) diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c index d5babc9f222b..afda0a0499ce 100644 --- a/fs/ext4/dir.c +++ b/fs/ext4/dir.c @@ -25,6 +25,7 @@ #include #include #include +#include #include "ext4.h" #include "xattr.h" @@ -208,7 +209,7 @@ static int ext4_readdir(struct file *file, struct dir_context *ctx) * readdir(2), then we might be pointing to an invalid * dirent right now. Scan from the start of the block * to make sure. */ - if (file->f_version != inode->i_version) { + if (inode_cmp_iversion(inode, file->f_version)) { for (i = 0; i < sb->s_blocksize && i < offset; ) { de = (struct ext4_dir_entry_2 *) (bh->b_data + i); @@ -227,7 +228,7 @@ static int ext4_readdir(struct file *file, struct dir_context *ctx) offset = i; ctx->pos = (ctx->pos & ~(sb->s_blocksize - 1)) | offset; - file->f_version = inode->i_version; + file->f_version = inode_query_iversion(inode); } while (ctx->pos < inode->i_size @@ -568,10 +569,10 @@ static int ext4_dx_readdir(struct file *file, struct dir_context *ctx) * cached entries. */ if ((!info->curr_node) || - (file->f_version != inode->i_version)) { + inode_cmp_iversion(inode, file->f_version)) { info->curr_node = NULL; free_rb_tree_fname(>root); - file->f_version = inode->i_version; + file->f_version = inode_query_iversion(inode); ret = ext4_htree_fill_tree(file, info->curr_hash, info->curr_minor_hash, >next_hash); diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c index 1367553c43bb..a8b987b71173 100644 --- a/fs/ext4/inline.c +++ b/fs/ext4/inline.c @@ -14,6 +14,7 @@ #include #include +#include #include "ext4_jbd2.h" #include "ext4.h" @@ -1042,7 +1043,7 @@ static int ext4_add_dirent_to_inline(handle_t *handle, */ dir->i_mtime = dir->i_ctime = current_time(dir); ext4_update_dx_flag(dir); - dir->i_version++; + inode_inc_iversion(dir); return 1; } @@ -1494,7 +1495,7 @@ int ext4_read_inline_dir(struct file *file, * dirent right now. Scan from the start of the inline * dir to make sure. */ - if (file->f_version != inode->i_version) { + if (inode_cmp_iversion(inode, file->f_version)) { for (i = 0; i < extra_size && i < offset;) { /* * "." is with offset 0 and @@ -1526,7 +1527,7 @@ int ext4_read_inline_dir(struct file *file, } offset = i; ctx->pos = offset; - file->f_version = inode->i_version; + file->f_version = inode_query_iversion(inode); } while (ctx->pos < extra_size) { diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index fa5d8bc52d2d..1b0d54b372f2 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4874,12 +4874,14 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino) EXT4_EINODE_GET_XTIME(i_crtime, ei, raw_inode); if (likely(!test_opt2(inode->i_sb, HURD_COMPAT))) { - inode->i_version = le32_to_cpu(raw_inode->i_disk_version); + u64 ivers = le32_to_cpu(raw_inode->i_disk_version); + if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE) { if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi)) - inode->i_version |= + ivers |= (__u64)(le32_to_cpu(raw_inode->i_version_hi)) << 32; } + inode_set_iversion_queried(inode, ivers); } ret = 0; @@ -5165,11 +5167,13 @@ static int ext4_do_update_inode(handle_t *handle, } if (likely(!test_opt2(inode->i_sb, HURD_COMPAT))) { - raw_inode->i_disk_version = cpu_to_le32(inode->i_version); + u64 ivers = inode_peek_iversion(inode); + + raw_inode->i_disk_version = cpu_to_le32(ivers); if (ei->i_extra_isize) { if (EXT4_FITS_IN_INODE(raw_inode, ei,
[PATCH v5 10/19] nfs: convert to new i_version API
From: Jeff LaytonFor NFS, we just use the "raw" API since the i_version is mostly managed by the server. The exception there is when the client holds a write delegation, but we only need to bump it once there anyway to handle CB_GETATTR. Tested-by: Krzysztof Kozlowski Signed-off-by: Jeff Layton --- fs/nfs/delegation.c| 3 ++- fs/nfs/fscache-index.c | 5 +++-- fs/nfs/inode.c | 18 +- fs/nfs/nfs4proc.c | 10 ++ fs/nfs/nfstrace.h | 5 +++-- fs/nfs/write.c | 8 +++- 6 files changed, 26 insertions(+), 23 deletions(-) diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c index ade44ca0c66c..d8b47624fee2 100644 --- a/fs/nfs/delegation.c +++ b/fs/nfs/delegation.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include @@ -347,7 +348,7 @@ int nfs_inode_set_delegation(struct inode *inode, struct rpc_cred *cred, struct nfs4_stateid_copy(>stateid, >delegation); delegation->type = res->delegation_type; delegation->pagemod_limit = res->pagemod_limit; - delegation->change_attr = inode->i_version; + delegation->change_attr = inode_peek_iversion_raw(inode); delegation->cred = get_rpccred(cred); delegation->inode = inode; delegation->flags = 1< vfs_inode.i_ctime; if (NFS_SERVER(>vfs_inode)->nfs_client->rpc_ops->version == 4) - auxdata.change_attr = nfsi->vfs_inode.i_version; + auxdata.change_attr = inode_peek_iversion_raw(>vfs_inode); if (bufmax > sizeof(auxdata)) bufmax = sizeof(auxdata); @@ -243,7 +244,7 @@ enum fscache_checkaux nfs_fscache_inode_check_aux(void *cookie_netfs_data, auxdata.ctime = nfsi->vfs_inode.i_ctime; if (NFS_SERVER(>vfs_inode)->nfs_client->rpc_ops->version == 4) - auxdata.change_attr = nfsi->vfs_inode.i_version; + auxdata.change_attr = inode_peek_iversion_raw(>vfs_inode); if (memcmp(data, , datalen) != 0) return FSCACHE_CHECKAUX_OBSOLETE; diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c index b992d2382ffa..93552c482992 100644 --- a/fs/nfs/inode.c +++ b/fs/nfs/inode.c @@ -38,8 +38,8 @@ #include #include #include - #include +#include #include "nfs4_fs.h" #include "callback.h" @@ -483,7 +483,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr, st memset(>i_atime, 0, sizeof(inode->i_atime)); memset(>i_mtime, 0, sizeof(inode->i_mtime)); memset(>i_ctime, 0, sizeof(inode->i_ctime)); - inode->i_version = 0; + inode_set_iversion_raw(inode, 0); inode->i_size = 0; clear_nlink(inode); inode->i_uid = make_kuid(_user_ns, -2); @@ -508,7 +508,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr, st else if (nfs_server_capable(inode, NFS_CAP_CTIME)) nfs_set_cache_invalid(inode, NFS_INO_INVALID_ATTR); if (fattr->valid & NFS_ATTR_FATTR_CHANGE) - inode->i_version = fattr->change_attr; + inode_set_iversion_raw(inode, fattr->change_attr); else nfs_set_cache_invalid(inode, NFS_INO_INVALID_ATTR | NFS_INO_REVAL_PAGECACHE); @@ -1289,8 +1289,8 @@ static unsigned long nfs_wcc_update_inode(struct inode *inode, struct nfs_fattr if ((fattr->valid & NFS_ATTR_FATTR_PRECHANGE) && (fattr->valid & NFS_ATTR_FATTR_CHANGE) - && inode->i_version == fattr->pre_change_attr) { - inode->i_version = fattr->change_attr; + && !inode_cmp_iversion_raw(inode, fattr->pre_change_attr)) { + inode_set_iversion_raw(inode, fattr->change_attr); if (S_ISDIR(inode->i_mode)) nfs_set_cache_invalid(inode, NFS_INO_INVALID_DATA); ret |= NFS_INO_INVALID_ATTR; @@ -1348,7 +1348,7 @@ static int nfs_check_inode_attributes(struct inode *inode, struct nfs_fattr *fat if (!nfs_file_has_buffered_writers(nfsi)) { /* Verify a few of the more important attributes */ - if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) != 0 && inode->i_version != fattr->change_attr) + if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) != 0 && inode_cmp_iversion_raw(inode,
[PATCH v5 13/19] ufs: use new i_version API
From: Jeff LaytonSigned-off-by: Jeff Layton --- fs/ufs/dir.c | 9 + fs/ufs/inode.c | 3 ++- fs/ufs/super.c | 3 ++- 3 files changed, 9 insertions(+), 6 deletions(-) diff --git a/fs/ufs/dir.c b/fs/ufs/dir.c index 2edc1755b7c5..50dfce000864 100644 --- a/fs/ufs/dir.c +++ b/fs/ufs/dir.c @@ -20,6 +20,7 @@ #include #include #include +#include #include "ufs_fs.h" #include "ufs.h" @@ -47,7 +48,7 @@ static int ufs_commit_chunk(struct page *page, loff_t pos, unsigned len) struct inode *dir = mapping->host; int err = 0; - dir->i_version++; + inode_inc_iversion(dir); block_write_end(NULL, mapping, pos, len, len, page, NULL); if (pos+len > dir->i_size) { i_size_write(dir, pos+len); @@ -428,7 +429,7 @@ ufs_readdir(struct file *file, struct dir_context *ctx) unsigned long n = pos >> PAGE_SHIFT; unsigned long npages = dir_pages(inode); unsigned chunk_mask = ~(UFS_SB(sb)->s_uspi->s_dirblksize - 1); - int need_revalidate = file->f_version != inode->i_version; + bool need_revalidate = inode_cmp_iversion(inode, file->f_version); unsigned flags = UFS_SB(sb)->s_flags; UFSD("BEGIN\n"); @@ -455,8 +456,8 @@ ufs_readdir(struct file *file, struct dir_context *ctx) offset = ufs_validate_entry(sb, kaddr, offset, chunk_mask); ctx->pos = (n< f_version = inode->i_version; - need_revalidate = 0; + file->f_version = inode_query_iversion(inode); + need_revalidate = false; } de = (struct ufs_dir_entry *)(kaddr+offset); limit = kaddr + ufs_last_byte(inode, n) - UFS_DIR_REC_LEN(1); diff --git a/fs/ufs/inode.c b/fs/ufs/inode.c index afb601c0dda0..c843ec858cf7 100644 --- a/fs/ufs/inode.c +++ b/fs/ufs/inode.c @@ -36,6 +36,7 @@ #include #include #include +#include #include "ufs_fs.h" #include "ufs.h" @@ -693,7 +694,7 @@ struct inode *ufs_iget(struct super_block *sb, unsigned long ino) if (err) goto bad_inode; - inode->i_version++; + inode_inc_iversion(inode); ufsi->i_lastfrag = (inode->i_size + uspi->s_fsize - 1) >> uspi->s_fshift; ufsi->i_dir_start_lookup = 0; diff --git a/fs/ufs/super.c b/fs/ufs/super.c index 4d497e9c6883..b6ba80e05bff 100644 --- a/fs/ufs/super.c +++ b/fs/ufs/super.c @@ -88,6 +88,7 @@ #include #include #include +#include #include "ufs_fs.h" #include "ufs.h" @@ -1440,7 +1441,7 @@ static struct inode *ufs_alloc_inode(struct super_block *sb) if (!ei) return NULL; - ei->vfs_inode.i_version = 1; + inode_set_iversion(>vfs_inode, 1); seqlock_init(>meta_lock); mutex_init(>truncate_mutex); return >vfs_inode; -- 2.14.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Recommendations for balancing as part of regular maintenance?
On 9 Jan 2018, at 7:23, Austin S. Hemmelgarn wrote: On 2018-01-08 16:43, Tom Worster wrote: Given the documentation and the usage stats, I did not know what options to use with balance. I spent some time reading and researching and trying to understand the filters and how they should relate to my situation. Eventually I abandoned that effort and ran balance without options. Hopefully the explanation I gave on the filters in the Github issue helped some. In this case though, it sounds like running a filtered balance probably wouldn't have saved you much over a full one. Yes, it helped. Hugo's email helped too. I now have a better understanding of balance filters. At the same time, Hugo's email and others in this thread added to my belief that I'm now managing systems with a filesystem I'm unqualified to use. Tom -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 11/19] nfsd: convert to new i_version API
From: Jeff LaytonMostly just making sure we use the "get" wrappers so we know when it is being fetched for later use. Signed-off-by: Jeff Layton --- fs/nfsd/nfsfh.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h index 43f31cf49bae..b8444189223b 100644 --- a/fs/nfsd/nfsfh.h +++ b/fs/nfsd/nfsfh.h @@ -11,6 +11,7 @@ #include #include #include +#include static inline __u32 ino_t_to_u32(ino_t ino) { @@ -259,7 +260,7 @@ static inline u64 nfsd4_change_attribute(struct inode *inode) chattr = inode->i_ctime.tv_sec; chattr <<= 30; chattr += inode->i_ctime.tv_nsec; - chattr += inode->i_version; + chattr += inode_query_iversion(inode); return chattr; } -- 2.14.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 12/19] ocfs2: convert to new i_version API
From: Jeff LaytonSigned-off-by: Jeff Layton Reviewed-by: Jan Kara --- fs/ocfs2/dir.c | 15 --- fs/ocfs2/inode.c| 3 ++- fs/ocfs2/namei.c| 3 ++- fs/ocfs2/quota_global.c | 3 ++- 4 files changed, 14 insertions(+), 10 deletions(-) diff --git a/fs/ocfs2/dir.c b/fs/ocfs2/dir.c index febe6312ceff..32f9c72dff17 100644 --- a/fs/ocfs2/dir.c +++ b/fs/ocfs2/dir.c @@ -42,6 +42,7 @@ #include #include #include +#include #include @@ -1174,7 +1175,7 @@ static int __ocfs2_delete_entry(handle_t *handle, struct inode *dir, le16_add_cpu(>rec_len, le16_to_cpu(de->rec_len)); de->inode = 0; - dir->i_version++; + inode_inc_iversion(dir); ocfs2_journal_dirty(handle, bh); goto bail; } @@ -1729,7 +1730,7 @@ int __ocfs2_add_entry(handle_t *handle, if (ocfs2_dir_indexed(dir)) ocfs2_recalc_free_list(dir, handle, lookup); - dir->i_version++; + inode_inc_iversion(dir); ocfs2_journal_dirty(handle, insert_bh); retval = 0; goto bail; @@ -1775,7 +1776,7 @@ static int ocfs2_dir_foreach_blk_id(struct inode *inode, * readdir(2), then we might be pointing to an invalid * dirent right now. Scan from the start of the block * to make sure. */ - if (*f_version != inode->i_version) { + if (inode_cmp_iversion(inode, *f_version)) { for (i = 0; i < i_size_read(inode) && i < offset; ) { de = (struct ocfs2_dir_entry *) (data->id_data + i); @@ -1791,7 +1792,7 @@ static int ocfs2_dir_foreach_blk_id(struct inode *inode, i += le16_to_cpu(de->rec_len); } ctx->pos = offset = i; - *f_version = inode->i_version; + *f_version = inode_query_iversion(inode); } de = (struct ocfs2_dir_entry *) (data->id_data + ctx->pos); @@ -1869,7 +1870,7 @@ static int ocfs2_dir_foreach_blk_el(struct inode *inode, * readdir(2), then we might be pointing to an invalid * dirent right now. Scan from the start of the block * to make sure. */ - if (*f_version != inode->i_version) { + if (inode_cmp_iversion(inode, *f_version)) { for (i = 0; i < sb->s_blocksize && i < offset; ) { de = (struct ocfs2_dir_entry *) (bh->b_data + i); /* It's too expensive to do a full @@ -1886,7 +1887,7 @@ static int ocfs2_dir_foreach_blk_el(struct inode *inode, offset = i; ctx->pos = (ctx->pos & ~(sb->s_blocksize - 1)) | offset; - *f_version = inode->i_version; + *f_version = inode_query_iversion(inode); } while (ctx->pos < i_size_read(inode) @@ -1940,7 +1941,7 @@ static int ocfs2_dir_foreach_blk(struct inode *inode, u64 *f_version, */ int ocfs2_dir_foreach(struct inode *inode, struct dir_context *ctx) { - u64 version = inode->i_version; + u64 version = inode_query_iversion(inode); ocfs2_dir_foreach_blk(inode, , ctx, true); return 0; } diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c index 1a1e0078ab38..d51b80edd972 100644 --- a/fs/ocfs2/inode.c +++ b/fs/ocfs2/inode.c @@ -28,6 +28,7 @@ #include #include #include +#include #include @@ -302,7 +303,7 @@ void ocfs2_populate_inode(struct inode *inode, struct ocfs2_dinode *fe, OCFS2_I(inode)->ip_attr = le32_to_cpu(fe->i_attr); OCFS2_I(inode)->ip_dyn_features = le16_to_cpu(fe->i_dyn_features); - inode->i_version = 1; + inode_set_iversion(inode, 1); inode->i_generation = le32_to_cpu(fe->i_generation); inode->i_rdev = huge_decode_dev(le64_to_cpu(fe->id1.dev1.i_rdev)); inode->i_mode = le16_to_cpu(fe->i_mode); diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c index 3b0a10d9b36f..c801eddc4bf3 100644 --- a/fs/ocfs2/namei.c +++ b/fs/ocfs2/namei.c @@ -41,6 +41,7 @@ #include #include #include +#include #include @@ -1520,7 +1521,7 @@ static int ocfs2_rename(struct inode *old_dir, mlog_errno(status); goto bail; } - new_dir->i_version++; + inode_inc_iversion(new_dir); if (S_ISDIR(new_inode->i_mode))
[PATCH v5 15/19] IMA: switch IMA over to new i_version API
From: Jeff LaytonSigned-off-by: Jeff Layton --- security/integrity/ima/ima_api.c | 3 ++- security/integrity/ima/ima_main.c | 3 ++- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/security/integrity/ima/ima_api.c b/security/integrity/ima/ima_api.c index c7e8db0ea4c0..c6ae42266270 100644 --- a/security/integrity/ima/ima_api.c +++ b/security/integrity/ima/ima_api.c @@ -18,6 +18,7 @@ #include #include #include +#include #include "ima.h" @@ -215,7 +216,7 @@ int ima_collect_measurement(struct integrity_iint_cache *iint, * which do not support i_version, support is limited to an initial * measurement/appraisal/audit. */ - i_version = file_inode(file)->i_version; + i_version = inode_query_iversion(inode); hash.hdr.algo = algo; /* Initialize hash digest to 0's in case of failure */ diff --git a/security/integrity/ima/ima_main.c b/security/integrity/ima/ima_main.c index 50b8254d..06a70c5a2329 100644 --- a/security/integrity/ima/ima_main.c +++ b/security/integrity/ima/ima_main.c @@ -24,6 +24,7 @@ #include #include #include +#include #include "ima.h" @@ -128,7 +129,7 @@ static void ima_check_last_writer(struct integrity_iint_cache *iint, inode_lock(inode); if (atomic_read(>i_writecount) == 1) { if (!IS_I_VERSION(inode) || - (iint->version != inode->i_version) || + inode_cmp_iversion(inode, iint->version) || (iint->flags & IMA_NEW_FILE)) { iint->flags &= ~(IMA_DONE_MASK | IMA_NEW_FILE); iint->measured_pcrs = 0; -- 2.14.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 14/19] xfs: convert to new i_version API
From: Jeff LaytonSigned-off-by: Jeff Layton Acked-by: Darrick J. Wong --- fs/xfs/libxfs/xfs_inode_buf.c | 7 +-- fs/xfs/xfs_icache.c | 5 +++-- fs/xfs/xfs_inode.c| 3 ++- fs/xfs/xfs_inode_item.c | 3 ++- fs/xfs/xfs_trans_inode.c | 4 +++- 5 files changed, 15 insertions(+), 7 deletions(-) diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c index 6b7989038d75..b9c0bf80669c 100644 --- a/fs/xfs/libxfs/xfs_inode_buf.c +++ b/fs/xfs/libxfs/xfs_inode_buf.c @@ -32,6 +32,8 @@ #include "xfs_ialloc.h" #include "xfs_dir2.h" +#include + /* * Check that none of the inode's in the buffer have a next * unlinked field of 0. @@ -264,7 +266,8 @@ xfs_inode_from_disk( to->di_flags= be16_to_cpu(from->di_flags); if (to->di_version == 3) { - inode->i_version = be64_to_cpu(from->di_changecount); + inode_set_iversion_queried(inode, + be64_to_cpu(from->di_changecount)); to->di_crtime.t_sec = be32_to_cpu(from->di_crtime.t_sec); to->di_crtime.t_nsec = be32_to_cpu(from->di_crtime.t_nsec); to->di_flags2 = be64_to_cpu(from->di_flags2); @@ -314,7 +317,7 @@ xfs_inode_to_disk( to->di_flags = cpu_to_be16(from->di_flags); if (from->di_version == 3) { - to->di_changecount = cpu_to_be64(inode->i_version); + to->di_changecount = cpu_to_be64(inode_peek_iversion(inode)); to->di_crtime.t_sec = cpu_to_be32(from->di_crtime.t_sec); to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec); to->di_flags2 = cpu_to_be64(from->di_flags2); diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 43005fbe8b1e..4c315adb05e6 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -37,6 +37,7 @@ #include #include +#include /* * Allocate and initialise an xfs_inode. @@ -293,14 +294,14 @@ xfs_reinit_inode( int error; uint32_tnlink = inode->i_nlink; uint32_tgeneration = inode->i_generation; - uint64_tversion = inode->i_version; + uint64_tversion = inode_peek_iversion(inode); umode_t mode = inode->i_mode; error = inode_init_always(mp->m_super, inode); set_nlink(inode, nlink); inode->i_generation = generation; - inode->i_version = version; + inode_set_iversion_queried(inode, version); inode->i_mode = mode; return error; } diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index 801274126648..dfc5e60d8af3 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -16,6 +16,7 @@ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA */ #include +#include #include "xfs.h" #include "xfs_fs.h" @@ -833,7 +834,7 @@ xfs_ialloc( ip->i_d.di_flags = 0; if (ip->i_d.di_version == 3) { - inode->i_version = 1; + inode_set_iversion(inode, 1); ip->i_d.di_flags2 = 0; ip->i_d.di_cowextsize = 0; ip->i_d.di_crtime.t_sec = (int32_t)tv.tv_sec; diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c index 6ee5c3bf19ad..7571abf5dfb3 100644 --- a/fs/xfs/xfs_inode_item.c +++ b/fs/xfs/xfs_inode_item.c @@ -30,6 +30,7 @@ #include "xfs_buf_item.h" #include "xfs_log.h" +#include kmem_zone_t*xfs_ili_zone; /* inode log item zone */ @@ -354,7 +355,7 @@ xfs_inode_to_log_dinode( to->di_next_unlinked = NULLAGINO; if (from->di_version == 3) { - to->di_changecount = inode->i_version; + to->di_changecount = inode_peek_iversion(inode); to->di_crtime.t_sec = from->di_crtime.t_sec; to->di_crtime.t_nsec = from->di_crtime.t_nsec; to->di_flags2 = from->di_flags2; diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c index daa7615497f9..225544327c4f 100644 --- a/fs/xfs/xfs_trans_inode.c +++ b/fs/xfs/xfs_trans_inode.c @@ -28,6 +28,8 @@ #include "xfs_inode_item.h" #include "xfs_trace.h" +#include + /* * Add a locked inode to the transaction. * @@ -117,7 +119,7 @@ xfs_trans_log_inode( */ if (!(ip->i_itemp->ili_item.li_desc->lid_flags & XFS_LID_DIRTY) && IS_I_VERSION(VFS_I(ip))) { - VFS_I(ip)->i_version++; + inode_inc_iversion(VFS_I(ip)); flags |= XFS_ILOG_CORE; } -- 2.14.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 17/19] xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing
From: Jeff LaytonIf XFS_ILOG_CORE is already set then go ahead and increment it. Signed-off-by: Jeff Layton Acked-by: Darrick J. Wong --- fs/xfs/xfs_trans_inode.c | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c index 225544327c4f..4a89da4b6fe7 100644 --- a/fs/xfs/xfs_trans_inode.c +++ b/fs/xfs/xfs_trans_inode.c @@ -112,15 +112,17 @@ xfs_trans_log_inode( /* * First time we log the inode in a transaction, bump the inode change -* counter if it is configured for this to occur. We don't use -* inode_inc_version() because there is no need for extra locking around -* i_version as we already hold the inode locked exclusively for -* metadata modification. +* counter if it is configured for this to occur. While we have the +* inode locked exclusively for metadata modification, we can usually +* avoid setting XFS_ILOG_CORE if no one has queried the value since +* the last time it was incremented. If we have XFS_ILOG_CORE already +* set however, then go ahead and bump the i_version counter +* unconditionally. */ if (!(ip->i_itemp->ili_item.li_desc->lid_flags & XFS_LID_DIRTY) && IS_I_VERSION(VFS_I(ip))) { - inode_inc_iversion(VFS_I(ip)); - flags |= XFS_ILOG_CORE; + if (inode_maybe_inc_iversion(VFS_I(ip), flags & XFS_ILOG_CORE)) + flags |= XFS_ILOG_CORE; } tp->t_flags |= XFS_TRANS_DIRTY; -- 2.14.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html