[PATCH v2 1/2] btrfs-progs: fi-usage: Fix wrong RAID10 used and unallocated space

2018-01-09 Thread Qu Wenruo
[BUG]
For a very basic RAID10 with 4 disks, "fi usage" and "fi show" are
outputting conflicting result:
--
 # btrfs fi show /mnt/btrfs/
Label: none  uuid: 6d0229db-28d1-4696-ac20-e828cc45dc40
Total devices 4 FS bytes used 1.12MiB
devid1 size 5.00GiB used 2.01GiB path /dev/mapper/data-disk1
devid2 size 5.00GiB used 2.01GiB path /dev/mapper/data-disk2
devid3 size 5.00GiB used 2.01GiB path /dev/mapper/data-disk3
devid4 size 5.00GiB used 2.01GiB path /dev/mapper/data-disk4

Here the unallocated space for disk4 should be a little less than 3G.
(5.00 Gib - 2.01GiB)

 # btrfs fi usage  /mnt/btrfs/
Overall:
Device size:  20.00GiB
Device allocated:  8.03GiB
Device unallocated:   11.97GiB
Device missing:  0.00B
Used:  2.25MiB
Free (estimated):  7.98GiB  (min: 7.98GiB)
Data ratio:   2.00
Metadata ratio:   2.00
Global reserve:   16.00MiB  (used: 0.00B)

Data,RAID10: Size:2.00GiB, Used:1.00MiB
   ...
   /dev/mapper/data-disk4512.00MiB

Metadata,RAID10: Size:2.00GiB, Used:112.00KiB
   ...
   /dev/mapper/data-disk4512.00MiB

System,RAID10: Size:16.00MiB, Used:16.00KiB
   ...
   /dev/mapper/data-disk4  4.00MiB

Unallocated:
   ...
   /dev/mapper/data-disk4  4.00GiB

While fi usage shows we still have 4.00GiB unallocated space.
--

[CAUSE]
calc_chunk_size() is used to convert chunk size to device extent size,
which is used to get the per-device data/meta/sys used space.

However, for RAID10 we just divide the chunk size with num_stripes,
without taking sub stripes into account.

Resulting data/meta/sys usage in RAID10 halved.

[FIX]
Take the missing sub stripes into consideration.

Reported-by: Adam Bahe 
Signed-off-by: Qu Wenruo 
---
v2:
  Nothing
---
 cmds-fi-usage.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/cmds-fi-usage.c b/cmds-fi-usage.c
index 0b0e47fee194..0af4bdbc9370 100644
--- a/cmds-fi-usage.c
+++ b/cmds-fi-usage.c
@@ -664,7 +664,7 @@ static u64 calc_chunk_size(struct chunk_info *ci)
else if (ci->type & BTRFS_BLOCK_GROUP_RAID6)
return ci->size / (ci->num_stripes -2);
else if (ci->type & BTRFS_BLOCK_GROUP_RAID10)
-   return ci->size / ci->num_stripes;
+   return ci->size / (ci->num_stripes / 2);
return ci->size;
 }
 
-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/2] btrfs-progs: misc-test: Add test case to check if fi usage and show report consistent unallocated space

2018-01-09 Thread Qu Wenruo
Signed-off-by: Qu Wenruo 
---
v2:
  Don't create unnecessary temporary dir. Use variable directly to save
  output
  Use run_check_mount_test_dev() macro with TEST_DEV set manually.
---
 tests/misc-tests/028-fi-usage-cross-check/test.sh | 44 +++
 1 file changed, 44 insertions(+)
 create mode 100755 tests/misc-tests/028-fi-usage-cross-check/test.sh

diff --git a/tests/misc-tests/028-fi-usage-cross-check/test.sh 
b/tests/misc-tests/028-fi-usage-cross-check/test.sh
new file mode 100755
index ..0343888b3413
--- /dev/null
+++ b/tests/misc-tests/028-fi-usage-cross-check/test.sh
@@ -0,0 +1,44 @@
+#!/bin/bash
+# check fi-usage is output correct unallocated space for RAID10
+
+source "$TOP/tests/common"
+
+check_prereq mkfs.btrfs
+check_prereq btrfs
+
+setup_root_helper
+
+setup_loopdevs 4
+prepare_loopdevs
+dev1=${loopdevs[1]}
+
+run_check $SUDO_HELPER "$TOP/mkfs.btrfs" -d raid10 -m raid10 -f ${loopdevs[@]}
+TEST_DEV=$dev1
+run_check_mount_test_dev
+
+output=$(run_check_stdout $SUDO_HELPER "$TOP/btrfs" filesystem show --raw 
"$TEST_MNT" |\
+grep "$dev1")
+if [ -z "$output" ]; then
+   _fail "failed to get correct fi show output"
+fi
+
+dev1_total=$(echo $output | awk '{print $4}')
+dev1_used=$(echo $output | awk '{print $6}')
+dev1_fi_show_unallocated=$(( $dev1_total - $dev1_used))
+
+output=$(run_check_stdout $SUDO_HELPER "$TOP/btrfs" filesystem usage --raw 
"$TEST_MNT" |\
+grep "$dev1" | tail -n1)
+if [ -z "$output" ]; then
+   _fail "failed to get correct fi usage output"
+fi
+
+dev1_fi_usage_unallocated=$(echo $output | awk '{print $2}')
+echo "fi usage unallocated for devid1 is $dev1_fi_usage_unallocated" >> 
"$RESULTS"
+echo "fi show unallocated for devid1 is $dev1_fi_show_unallocated" >> 
"$RESULTS"
+
+if [ $dev1_fi_show_unallocated -ne $dev1_fi_usage_unallocated ]; then
+   _fail "fi usage unallocated mismatch with fi show"
+fi
+
+run_check $SUDO_HELPER umount "$TEST_MNT"
+cleanup_loopdevs
-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: Fix wrong chunk allocation type in btrfs_reserve_extent

2018-01-09 Thread Qu Wenruo
btrfs_reserve_extent() will try to allocate new chunk if there is not
enough space (mostly meta space).

However when it tries to allocate new meta chunk, it will always try to
allocate SINGLE meta chunk, and if the fs is using other profile, it
will cause dead allocation which can't be really used.

And when trying to allocate new meta chunk for fs trees, it uses the
@num_bytes which is normally times larger than metadata space.

Fix it by using proper meta profile and calculate based on file extent
other than @num_bytes if we are allocating meta space for file extents.

And all such chunk allocation condition is not explained at all.
Add needed (although maybe a little overkilled) comment for it.

Signed-off-by: Qu Wenruo 
---
Needed prerequisite is pushed to github:
https://github.com/adam900710/btrfs-progs/tree/chunk_alloc_fixes

And unfortunately, the condition to trigger chunk allocation is pretty
hard to trigger.
For normal mkfs case, we will always have enough space, while for normal
usage, kernel will ensure there is enough space (global reservation) for
extent tree update, which is normally larger than the 2M headroom used in 
btrfs-progs.

So no easy test case here.
---
 extent-tree.c | 55 +++
 1 file changed, 47 insertions(+), 8 deletions(-)

diff --git a/extent-tree.c b/extent-tree.c
index 90e792a3fe62..aab09d926043 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -1907,7 +1907,7 @@ static int do_chunk_alloc(struct btrfs_trans_handle 
*trans,
return 0;
 
ret = btrfs_alloc_chunk(trans, fs_info, , _bytes,
-   space_info->flags, false);
+   flags, false);
if (ret == -ENOSPC) {
space_info->full = 1;
return ret;
@@ -2652,8 +2652,11 @@ int btrfs_reserve_extent(struct btrfs_trans_handle 
*trans,
u64 search_start = 0;
u64 alloc_profile;
u64 profile;
+   u64 meta_profile;
struct btrfs_fs_info *info = root->fs_info;
 
+   meta_profile = info->avail_metadata_alloc_bits &
+  info->metadata_alloc_profile;
if (is_data) {
alloc_profile = info->avail_data_alloc_bits &
info->data_alloc_profile;
@@ -2663,21 +2666,57 @@ int btrfs_reserve_extent(struct btrfs_trans_handle 
*trans,
info->system_alloc_profile;
profile = BTRFS_BLOCK_GROUP_SYSTEM | alloc_profile;
} else {
-   alloc_profile = info->avail_metadata_alloc_bits &
-   info->metadata_alloc_profile;
+   alloc_profile = meta_profile;
profile = BTRFS_BLOCK_GROUP_METADATA | alloc_profile;
}
 
+   /*
+* If we're modifying global trees likes extent/root/device tree
+* (whose ref_cows is 0), we shouldn't trigger chunk allocation AT ALL!
+*
+* The problem is, if root is extent tree, we trigger a chunk
+* allocation, then it will modify extent tree, causing a path deadlock.
+* (Although not implemented in btrfs-progs, it will still cause tree
+* modification screwed up)
+*/
if (root->ref_cows) {
+   u64 meta_reserve;
+
if (!(profile & BTRFS_BLOCK_GROUP_METADATA)) {
-   ret = do_chunk_alloc(trans, info,
-num_bytes,
-BTRFS_BLOCK_GROUP_METADATA);
-   BUG_ON(ret);
+   /*
+* If we're modifying fs trees and allocating data
+* extent, we need to ensure we have enough metadata
+* space to insert new file extent item.
+*
+* Here we need to ensure we have enough space for the
+* following operations:
+* 1) Insert one file extent
+*A full tree CoW + Leaf split should be enough.
+*
+* 2) Possible new chunk allocation
+*4 trees are involved, chunk, device, extent and
+*root tree.
+*However chunk tree is counted as system chunk, and
+*never
+*really needs a new system chunk even for super
+*large fs.
+*Only 3 trees are need.
+*
+* So it's total 4 trees, subvolume, extent, device and
+* root tree.
+* Each tree needs a full tree CoW + Leaf split.
+*/
+   meta_reserve = 4 * ( BTRFS_MAX_LEVEL + 1) *
+  

[PATCH 1/4] btrfs: move pr_info into device_list_add

2018-01-09 Thread Anand Jain
Commit 60999ca4b403 ("btrfs: make device scan less noisy")
adds return value 1 to device_list_add(), so that parent function can
call pr_info only when new device is added. Move the pr_info() part
into device_list_add() so that this function can be kept simple.

Signed-off-by: Anand Jain 
Reviewed-by: Josef Bacik 
---
 fs/btrfs/volumes.c | 29 +++--
 1 file changed, 11 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 5adf2d3f949a..62141284b577 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -726,8 +726,7 @@ static int btrfs_open_one_device(struct btrfs_fs_devices 
*fs_devices,
  * Add new device to list of registered devices
  *
  * Returns:
- * 1   - first time device is seen
- * 0   - device already known
+ * 0   - device already known or newly added
  * < 0 - error
  */
 static noinline int device_list_add(const char *path,
@@ -737,7 +736,6 @@ static noinline int device_list_add(const char *path,
struct btrfs_device *device;
struct btrfs_fs_devices *fs_devices;
struct rcu_string *name;
-   int ret = 0;
u64 found_transid = btrfs_super_generation(disk_super);
 
fs_devices = find_fsid(disk_super->fsid);
@@ -777,9 +775,16 @@ static noinline int device_list_add(const char *path,
fs_devices->num_devices++;
mutex_unlock(_devices->device_list_mutex);
 
-   ret = 1;
device->fs_devices = fs_devices;
btrfs_free_stale_devices(path, device);
+
+   if (disk_super->label[0])
+   pr_info("BTRFS: device label %s devid %llu transid %llu 
%s\n",
+   disk_super->label, devid, found_transid, path);
+   else
+   pr_info("BTRFS: device fsid %pU devid %llu transid %llu 
%s\n",
+   disk_super->fsid, devid, found_transid, path);
+
} else if (!device->name || strcmp(device->name->str, path)) {
/*
 * When FS is already mounted.
@@ -840,7 +845,7 @@ static noinline int device_list_add(const char *path,
 
*fs_devices_ret = fs_devices;
 
-   return ret;
+   return 0;
 }
 
 static struct btrfs_fs_devices *clone_fs_devices(struct btrfs_fs_devices *orig)
@@ -1179,7 +1184,6 @@ int btrfs_scan_one_device(const char *path, fmode_t 
flags, void *holder,
struct page *page;
int ret;
u64 devid;
-   u64 transid;
u64 total_devices;
u64 bytenr;
 
@@ -1202,25 +1206,14 @@ int btrfs_scan_one_device(const char *path, fmode_t 
flags, void *holder,
}
 
devid = btrfs_stack_device_id(_super->dev_item);
-   transid = btrfs_super_generation(disk_super);
total_devices = btrfs_super_num_devices(disk_super);
 
mutex_lock(_mutex);
ret = device_list_add(path, disk_super, devid, fs_devices_ret);
-   if (ret >= 0 && fs_devices_ret)
+   if (!ret && fs_devices_ret)
(*fs_devices_ret)->total_devices = total_devices;
mutex_unlock(_mutex);
 
-   if (ret > 0) {
-   if (disk_super->label[0])
-   pr_info("BTRFS: device label %s ", disk_super->label);
-   else
-   pr_info("BTRFS: device fsid %pU ", disk_super->fsid);
-
-   pr_cont("devid %llu transid %llu %s\n", devid, transid, path);
-   ret = 0;
-   }
-
btrfs_release_disk_super(page);
 
 error_bdev_put:
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] btrfs: drop devid as device_list_add() arg

2018-01-09 Thread Anand Jain
As struct btrfs_disk_super is being passed, so it can get devid
the same way its parent does.

Signed-off-by: Anand Jain 
Reviewed-by: Josef Bacik 
---
 fs/btrfs/volumes.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index b5305bd48338..ffeb02cbcf3d 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -730,12 +730,13 @@ static int btrfs_open_one_device(struct btrfs_fs_devices 
*fs_devices,
  * error pointer when failed
  */
 static noinline struct btrfs_device *device_list_add(const char *path,
-  struct btrfs_super_block *disk_super, u64 devid)
+  struct btrfs_super_block *disk_super)
 {
struct btrfs_device *device;
struct btrfs_fs_devices *fs_devices;
struct rcu_string *name;
u64 found_transid = btrfs_super_generation(disk_super);
+   u64 devid = btrfs_stack_device_id(_super->dev_item);
 
fs_devices = find_fsid(disk_super->fsid);
if (!fs_devices) {
@@ -1183,7 +1184,6 @@ int btrfs_scan_one_device(const char *path, fmode_t 
flags, void *holder,
struct block_device *bdev;
struct page *page;
int ret = 0;
-   u64 devid;
u64 bytenr;
 
/*
@@ -1204,10 +1204,8 @@ int btrfs_scan_one_device(const char *path, fmode_t 
flags, void *holder,
goto error_bdev_put;
}
 
-   devid = btrfs_stack_device_id(_super->dev_item);
-
mutex_lock(_mutex);
-   device = device_list_add(path, disk_super, devid);
+   device = device_list_add(path, disk_super);
mutex_unlock(_mutex);
if (IS_ERR(device))
ret = PTR_ERR(device);
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] btrfs: get device pointer from device_list_add()

2018-01-09 Thread Anand Jain
Instead of pointer to btrfs_fs_devices as an arg in device_list_add()
better to get pointer to btrfs_device as return value, then we have
both, pointer to btrfs_device and btrfs_fs_devices. btrfs_device is
needed to handle reappearing missing device.

Signed-off-by: Anand Jain 
---
 fs/btrfs/volumes.c | 34 ++
 1 file changed, 18 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 4e5b82976455..b5305bd48338 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -726,12 +726,11 @@ static int btrfs_open_one_device(struct btrfs_fs_devices 
*fs_devices,
  * Add new device to list of registered devices
  *
  * Returns:
- * 0   - device already known or newly added
- * < 0 - error
+ * device pointer which was just added or updated when successful
+ * error pointer when failed
  */
-static noinline int device_list_add(const char *path,
-  struct btrfs_super_block *disk_super,
-  u64 devid, struct btrfs_fs_devices **fs_devices_ret)
+static noinline struct btrfs_device *device_list_add(const char *path,
+  struct btrfs_super_block *disk_super, u64 devid)
 {
struct btrfs_device *device;
struct btrfs_fs_devices *fs_devices;
@@ -742,7 +741,7 @@ static noinline int device_list_add(const char *path,
if (!fs_devices) {
fs_devices = alloc_fs_devices(disk_super->fsid);
if (IS_ERR(fs_devices))
-   return PTR_ERR(fs_devices);
+   return ERR_PTR(PTR_ERR(fs_devices));
 
list_add(_devices->list, _uuids);
 
@@ -754,19 +753,19 @@ static noinline int device_list_add(const char *path,
 
if (!device) {
if (fs_devices->opened)
-   return -EBUSY;
+   return ERR_PTR(-EBUSY);
 
device = btrfs_alloc_device(NULL, ,
disk_super->dev_item.uuid);
if (IS_ERR(device)) {
/* we can safely leave the fs_devices entry around */
-   return PTR_ERR(device);
+   return ERR_PTR(PTR_ERR(device));
}
 
name = rcu_string_strdup(path, GFP_NOFS);
if (!name) {
free_device(device);
-   return -ENOMEM;
+   return ERR_PTR(-ENOMEM);
}
rcu_assign_pointer(device->name, name);
 
@@ -820,12 +819,12 @@ static noinline int device_list_add(const char *path,
 * with larger generation number or the last-in if
 * generation are equal.
 */
-   return -EEXIST;
+   return ERR_PTR(-EEXIST);
}
 
name = rcu_string_strdup(path, GFP_NOFS);
if (!name)
-   return -ENOMEM;
+   return ERR_PTR(-ENOMEM);
rcu_string_free(device->name);
rcu_assign_pointer(device->name, name);
if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state)) {
@@ -845,9 +844,7 @@ static noinline int device_list_add(const char *path,
 
fs_devices->total_devices = btrfs_super_num_devices(disk_super);
 
-   *fs_devices_ret = fs_devices;
-
-   return 0;
+   return device;
 }
 
 static struct btrfs_fs_devices *clone_fs_devices(struct btrfs_fs_devices *orig)
@@ -1182,9 +1179,10 @@ int btrfs_scan_one_device(const char *path, fmode_t 
flags, void *holder,
  struct btrfs_fs_devices **fs_devices_ret)
 {
struct btrfs_super_block *disk_super;
+   struct btrfs_device *device;
struct block_device *bdev;
struct page *page;
-   int ret;
+   int ret = 0;
u64 devid;
u64 bytenr;
 
@@ -1209,8 +1207,12 @@ int btrfs_scan_one_device(const char *path, fmode_t 
flags, void *holder,
devid = btrfs_stack_device_id(_super->dev_item);
 
mutex_lock(_mutex);
-   ret = device_list_add(path, disk_super, devid, fs_devices_ret);
+   device = device_list_add(path, disk_super, devid);
mutex_unlock(_mutex);
+   if (IS_ERR(device))
+   ret = PTR_ERR(device);
+
+   *fs_devices_ret = device->fs_devices;
 
btrfs_release_disk_super(page);
 
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 0/6] preparatory work to add device forget

2018-01-09 Thread Anand Jain
v3->v4:
 Mainly fix as per comments from Josef.
 @3/6: rename btrfs_free_stale_device() to btrfs_free_stale_devices()
 @4/6: reorg logic, init not_found = 0; drop else part
 @5/6: added new in v4. Renames arg cur_dev to skip_dev
 @6/6: v3:5/6 is merged to v4:6/6
 checkpath error fixes.

v2->v3:
 @ 6/6:
 add btrfs_free_stale_device() fn description, suggested by Nikolay
 Fix line with longer than 80 char
 
v1->v2:
 @ 6/6:
 btrfs_device::name is null when we have missing device and
 unmounted. So we still need to check for dev->name.

We can reuse the function btrfs_free_stale_device() to add feature
to forget a scanned device or all stale devices. So this patch set
proposes following changes to it.

Anand Jain (6):
  btrfs: cleanup btrfs_free_stale_device() usage
  btrfs: no need to check for btrfs_fs_devices::seeding
  btrfs: make btrfs_free_stale_device() to iterate all stales
  btrfs: make btrfs_free_stale_device() argument optional
  btrfs: rename btrfs_free_stale_devices() arg to skip_dev
  btrfs: make btrfs_free_stale_device() to match the path

 fs/btrfs/volumes.c | 59 +++---
 1 file changed, 25 insertions(+), 34 deletions(-)

-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/6] btrfs: make btrfs_free_stale_device() to match the path

2018-01-09 Thread Anand Jain
From: Anand Jain 

The btrfs_free_stale_device() is updated to match for the given
device path and delete it. (It searches for only unmounted list of
devices.) Also drop the comment about different path being used
for the same device, since now we will have cli to clean any
device that's not a concern any more.

Signed-off-by: Anand Jain 
---
 fs/btrfs/volumes.c | 29 +
 1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a3edd4d92c57..5adf2d3f949a 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -605,8 +605,17 @@ static void pending_bios_fn(struct btrfs_work *work)
run_scheduled_bios(device);
 }
 
-
-static void btrfs_free_stale_devices(struct btrfs_device *skip_dev)
+/*
+ * btrfs_free_stale_device()
+ *  Search and remove all stale (devices which are not mounted) devices.
+ *  When both inputs are NULL, it will search and release all stale devices.
+ *  path:  Optional. When provided will it release all unmounted devices
+ * matching this path only.
+ *  skip_dev:  Optional. Will skip this device when searching for the stale
+ * devices.
+ */
+static void btrfs_free_stale_devices(const char *path,
+struct btrfs_device *skip_dev)
 {
struct btrfs_fs_devices *fs_devs, *tmp_fs_devs;
struct btrfs_device *dev, *tmp_dev;
@@ -620,19 +629,15 @@ static void btrfs_free_stale_devices(struct btrfs_device 
*skip_dev)
 _devs->devices, dev_list) {
int not_found = 0;
 
-   if (skip_dev && (skip_dev == dev || !dev->name))
+   if (skip_dev && skip_dev == dev)
+   continue;
+   if (path && !dev->name)
continue;
 
-   /*
-* Todo: This won't be enough. What if the same device
-* comes back (with new uuid and) with its mapper path?
-* But for now, this does help as mostly an admin will
-* either use mapper or non mapper path throughout.
-*/
rcu_read_lock();
-   if (skip_dev)
+   if (path)
not_found = strcmp(rcu_str_deref(dev->name),
-  
rcu_str_deref(skip_dev->name));
+  path);
rcu_read_unlock();
if (not_found)
continue;
@@ -774,7 +779,7 @@ static noinline int device_list_add(const char *path,
 
ret = 1;
device->fs_devices = fs_devices;
-   btrfs_free_stale_devices(device);
+   btrfs_free_stale_devices(path, device);
} else if (!device->name || strcmp(device->name->str, path)) {
/*
 * When FS is already mounted.
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/4] btrfs: set the total_devices in device_list_add()

2018-01-09 Thread Anand Jain
There is no other parent for device_list_add() except for
btrfs_scan_one_device(), which would set btrfs_fs_devices::total_devices
if device_list_add is successful and this can be done with in
device_list_add() itself.

Signed-off-by: Anand Jain 
Reviewed-by: Josef Bacik 
---
 fs/btrfs/volumes.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 62141284b577..4e5b82976455 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -843,6 +843,8 @@ static noinline int device_list_add(const char *path,
if (!fs_devices->opened)
device->generation = found_transid;
 
+   fs_devices->total_devices = btrfs_super_num_devices(disk_super);
+
*fs_devices_ret = fs_devices;
 
return 0;
@@ -1184,7 +1186,6 @@ int btrfs_scan_one_device(const char *path, fmode_t 
flags, void *holder,
struct page *page;
int ret;
u64 devid;
-   u64 total_devices;
u64 bytenr;
 
/*
@@ -1206,12 +1207,9 @@ int btrfs_scan_one_device(const char *path, fmode_t 
flags, void *holder,
}
 
devid = btrfs_stack_device_id(_super->dev_item);
-   total_devices = btrfs_super_num_devices(disk_super);
 
mutex_lock(_mutex);
ret = device_list_add(path, disk_super, devid, fs_devices_ret);
-   if (!ret && fs_devices_ret)
-   (*fs_devices_ret)->total_devices = total_devices;
mutex_unlock(_mutex);
 
btrfs_release_disk_super(page);
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/6] btrfs: rename btrfs_free_stale_devices() arg to skip_dev

2018-01-09 Thread Anand Jain
No functional changes.
Rename btrfs_free_stale_devices() arg to skip_dev, so that it
reflects what that arg for.

Signed-off-by: Anand Jain 
---
 fs/btrfs/volumes.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index bba98d043402..a3edd4d92c57 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -606,7 +606,7 @@ static void pending_bios_fn(struct btrfs_work *work)
 }
 
 
-static void btrfs_free_stale_devices(struct btrfs_device *cur_dev)
+static void btrfs_free_stale_devices(struct btrfs_device *skip_dev)
 {
struct btrfs_fs_devices *fs_devs, *tmp_fs_devs;
struct btrfs_device *dev, *tmp_dev;
@@ -620,7 +620,7 @@ static void btrfs_free_stale_devices(struct btrfs_device 
*cur_dev)
 _devs->devices, dev_list) {
int not_found = 0;
 
-   if (cur_dev && (cur_dev == dev || !dev->name))
+   if (skip_dev && (skip_dev == dev || !dev->name))
continue;
 
/*
@@ -630,9 +630,9 @@ static void btrfs_free_stale_devices(struct btrfs_device 
*cur_dev)
 * either use mapper or non mapper path throughout.
 */
rcu_read_lock();
-   if (cur_dev)
+   if (skip_dev)
not_found = strcmp(rcu_str_deref(dev->name),
-  
rcu_str_deref(cur_dev->name));
+  
rcu_str_deref(skip_dev->name));
rcu_read_unlock();
if (not_found)
continue;
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 0/4] device_list_add() peparation to add reappearing missing device

2018-01-09 Thread Anand Jain
(Apply on top of my patchset
   [PATCH v4 0/6] preparatory work to add device forget
 for conflict free apply. They don't actually depend on
 each other though).

v2->v3:
 Fix device_list_add() fn description which was still referring to the
 previous return values.

v1->v2:
 Drop patch 5/5 for uuid_mutex optimize. That was wrong. Thanks Josef.
 In patch 3/5 make btrfs_device * as return.

Cleanup of device_list_add(), mainly in preparation to handle
reappearing missing device which its next reroll will be sent
separately.

Anand Jain (4):
  btrfs: move pr_info into device_list_add
  btrfs: set the total_devices in device_list_add()
  btrfs: get device pointer from device_list_add()
  btrfs: drop devid as device_list_add() arg

 fs/btrfs/volumes.c | 63 +++---
 1 file changed, 27 insertions(+), 36 deletions(-)

-- 
2.7.0

>From 84319f498ffcdca9bbb63fec680872d15392bf98 Mon Sep 17 00:00:00 2001
From: Anand Jain 
Date: Wed, 10 Jan 2018 09:40:38 +0800

Anand Jain (4):
  btrfs: move pr_info into device_list_add
  btrfs: set the total_devices in device_list_add()
  btrfs: get device pointer from device_list_add()
  btrfs: drop devid as device_list_add() arg

 fs/btrfs/volumes.c | 61 +++---
 1 file changed, 26 insertions(+), 35 deletions(-)

-- 
2.7.0
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/6] btrfs: make btrfs_free_stale_device() argument optional

2018-01-09 Thread Anand Jain
From: Anand Jain 

This updates btrfs_free_stale_device() helper function to delete all
unmouted devices, when arg is NULL.

Signed-off-by: Anand Jain 
---
 fs/btrfs/volumes.c | 14 +-
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index cd234a5dc763..bba98d043402 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -611,9 +611,6 @@ static void btrfs_free_stale_devices(struct btrfs_device 
*cur_dev)
struct btrfs_fs_devices *fs_devs, *tmp_fs_devs;
struct btrfs_device *dev, *tmp_dev;
 
-   if (!cur_dev->name)
-   return;
-
list_for_each_entry_safe(fs_devs, tmp_fs_devs, _uuids, list) {
 
if (fs_devs->opened)
@@ -621,11 +618,9 @@ static void btrfs_free_stale_devices(struct btrfs_device 
*cur_dev)
 
list_for_each_entry_safe(dev, tmp_dev,
 _devs->devices, dev_list) {
-   int not_found;
+   int not_found = 0;
 
-   if (dev == cur_dev)
-   continue;
-   if (!dev->name)
+   if (cur_dev && (cur_dev == dev || !dev->name))
continue;
 
/*
@@ -635,8 +630,9 @@ static void btrfs_free_stale_devices(struct btrfs_device 
*cur_dev)
 * either use mapper or non mapper path throughout.
 */
rcu_read_lock();
-   not_found = strcmp(rcu_str_deref(dev->name),
-   rcu_str_deref(cur_dev->name));
+   if (cur_dev)
+   not_found = strcmp(rcu_str_deref(dev->name),
+  
rcu_str_deref(cur_dev->name));
rcu_read_unlock();
if (not_found)
continue;
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/6] btrfs: no need to check for btrfs_fs_devices::seeding

2018-01-09 Thread Anand Jain
There is no need to check for btrfs_fs_devices::seeding when we
have checked for btrfs_fs_devices::opened, because we can't sprout
without its seed FS being opened.

Signed-off-by: Anand Jain 
Reviewed-by: Josef Bacik 
---
 fs/btrfs/volumes.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 25b91776d036..3f481da9cae7 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -619,8 +619,6 @@ static void btrfs_free_stale_device(struct btrfs_device 
*cur_dev)
 
if (fs_devs->opened)
continue;
-   if (fs_devs->seeding)
-   continue;
 
list_for_each_entry(dev, _devs->devices, dev_list) {
 
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/6] btrfs: make btrfs_free_stale_device() to iterate all stales

2018-01-09 Thread Anand Jain
From: Anand Jain 

Let the list iterator iterate further and find other stale
devices and delete it. This is in preparation to add support
for user land request-able stale devices cleanup. Also rename
btrfs_free_stale_device() to btrfs_free_stale_devices().

Signed-off-by: Anand Jain 
---
 fs/btrfs/volumes.c | 24 +++-
 1 file changed, 11 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 3f481da9cae7..cd234a5dc763 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -606,21 +606,22 @@ static void pending_bios_fn(struct btrfs_work *work)
 }
 
 
-static void btrfs_free_stale_device(struct btrfs_device *cur_dev)
+static void btrfs_free_stale_devices(struct btrfs_device *cur_dev)
 {
-   struct btrfs_fs_devices *fs_devs;
-   struct btrfs_device *dev;
+   struct btrfs_fs_devices *fs_devs, *tmp_fs_devs;
+   struct btrfs_device *dev, *tmp_dev;
 
if (!cur_dev->name)
return;
 
-   list_for_each_entry(fs_devs, _uuids, list) {
-   int del = 1;
+   list_for_each_entry_safe(fs_devs, tmp_fs_devs, _uuids, list) {
 
if (fs_devs->opened)
continue;
 
-   list_for_each_entry(dev, _devs->devices, dev_list) {
+   list_for_each_entry_safe(dev, tmp_dev,
+_devs->devices, dev_list) {
+   int not_found;
 
if (dev == cur_dev)
continue;
@@ -634,14 +635,12 @@ static void btrfs_free_stale_device(struct btrfs_device 
*cur_dev)
 * either use mapper or non mapper path throughout.
 */
rcu_read_lock();
-   del = strcmp(rcu_str_deref(dev->name),
+   not_found = strcmp(rcu_str_deref(dev->name),
rcu_str_deref(cur_dev->name));
rcu_read_unlock();
-   if (!del)
-   break;
-   }
+   if (not_found)
+   continue;
 
-   if (!del) {
/* delete the stale device */
if (fs_devs->num_devices == 1) {
btrfs_sysfs_remove_fsid(fs_devs);
@@ -652,7 +651,6 @@ static void btrfs_free_stale_device(struct btrfs_device 
*cur_dev)
list_del(>dev_list);
free_device(dev);
}
-   break;
}
}
 }
@@ -780,7 +778,7 @@ static noinline int device_list_add(const char *path,
 
ret = 1;
device->fs_devices = fs_devices;
-   btrfs_free_stale_device(device);
+   btrfs_free_stale_devices(device);
} else if (!device->name || strcmp(device->name->str, path)) {
/*
 * When FS is already mounted.
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/6] btrfs: cleanup btrfs_free_stale_device() usage

2018-01-09 Thread Anand Jain
We call btrfs_free_stale_device() only when we alloc a new
struct btrfs_device (ret=1), so move it closer to where we
alloc the new device. Also drop the comments.

Signed-off-by: Anand Jain 
Reviewed-by: Josef Bacik 
---
 fs/btrfs/volumes.c | 8 +---
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d393808071d5..25b91776d036 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -782,6 +782,7 @@ static noinline int device_list_add(const char *path,
 
ret = 1;
device->fs_devices = fs_devices;
+   btrfs_free_stale_device(device);
} else if (!device->name || strcmp(device->name->str, path)) {
/*
 * When FS is already mounted.
@@ -840,13 +841,6 @@ static noinline int device_list_add(const char *path,
if (!fs_devices->opened)
device->generation = found_transid;
 
-   /*
-* if there is new btrfs on an already registered device,
-* then remove the stale device entry.
-*/
-   if (ret > 0)
-   btrfs_free_stale_device(device);
-
*fs_devices_ret = fs_devices;
 
return ret;
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] generic/015: Change the test filesystem size to 101mb

2018-01-09 Thread Eryu Guan
On Mon, Jan 08, 2018 at 10:43:30AM +0200, Nikolay Borisov wrote:
> This test has been failing for btrfs for quite some time,
> at least since 4.7. There are 2 implementation details of btrfs that
> it exposes:
> 
> 1. Currently btrfs filesystem under 100mb are created in Mixed block
> group mode. Freespace accounting for it is not 100% accurate - I've

mkfs.btrfs does this too? Because I noticed _scratch_mkfs_sized adds
'--mixed' mkfs option explicitly.

> observed around 100-200kb discrepancy between a newly created filesystem,
> then writing a file and deleting it and checking the free space. This
> falls within %3 and not %1 as hardcoded in the test. 
> 
> 2. BTRFS won't flush it's delayed allocation on file deletion if less
> than 32mb are deleted. On such files we need to perform sync (missing
> in the test) or wait until time elapses for transaction commit.
> 
> Since mixed mode is somewhat deprecated and btrfs is not really intended
> to be used on really small devices let's just adjust the test to
> create a 101mb fs, which doesn't use mixed mode and really test
> freespace accounting.
> 
> Signed-off-by: Nikolay Borisov 
> ---
>  tests/generic/015 | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tests/generic/015 b/tests/generic/015
> index 78f2b13..416c4ae 100755
> --- a/tests/generic/015
> +++ b/tests/generic/015
> @@ -53,7 +53,7 @@ _supported_os Linux
>  _require_scratch
>  _require_no_large_scratch_dev
>  
> -_scratch_mkfs_sized `expr 50 \* 1024 \* 1024` >/dev/null 2>&1 \
> +_scratch_mkfs_sized `expr 101 \* 1024 \* 1024` >/dev/null 2>&1 \

Better to have some comments in the code too, to explain why we choose
101m filesystem size.

Thanks,
Eryu

>  || _fail "mkfs failed"
>  _scratch_mount || _fail "mount failed"
>  out=$SCRATCH_MNT/fillup.$$
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe fstests" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recommendations for balancing as part of regular maintenance?

2018-01-09 Thread Duncan
Austin S. Hemmelgarn posted on Tue, 09 Jan 2018 07:46:48 -0500 as
excerpted:

>> On 08/01/18 23:29, Martin Raiber wrote:
>>> There have been reports of (rare) corruption caused by balance (won't
>>> be detected by a scrub) here on the mailing list. So I would stay a
>>> away from btrfs balance unless it is absolutely needed (ENOSPC), and
>>> while it is run I would try not to do anything else wrt. to writes
>>> simultaneously.
>> 
>> This is my opinion too as a normal user, based upon reading this list
>> and own attempts to recover from ENOSPC. I'd rather re-create
>> filesystem from scratch, or at least make full verified backup before
>> attempting to fix problems with balance.

> While I'm generally of the same opinion (and I have a feeling most other
> people who have been server admins are too), it's not a very user
> friendly position to recommend that.  Keep in mind that many (probably
> most) users don't keep proper backups, and just targeting 'sensible'
> people as your primary audience is a bad idea.  It also needs to work at
> at least a basic level anyway though simply because you can't always
> just nuke the volume and rebuild it from scratch.
> 
> Personally though, I don't think I've ever seen issues with balance
> corrupting data, and I don't recall seeing complaints about it either
> (though I would love to see some links that prove me wrong).

AFAIK, such corruption reports re balance aren't really balance, per se, 
at all.

Instead, what I've seen in nearly all cases is a number of filesystem 
maintenance commands involving heavy I/O colliding, that is, being run at 
the same time, possibly because some of them are scheduled, and the admin 
didn't take into account scheduled commands when issuing others manually.

I don't believe anyone would recommend running balance, scrub, snapshot-
deletion, and backups (rsync or btrfs send/receive being the common 
ones), all at the same time, or even two or more at the same time, if for 
no other reason than because they're all IO intensive and running just 
/one/ of them at a time is hard /enough/ on the system and the 
performance of anything else running at the same time, even when all 
components are fully stable and mature (and as we all know, btrfs is 
stabilizing, but not yet fully stable and mature), yet that's what these 
sorts of reports invariably involve.

Of course, with a certainty btrfs /should/ be able to handle more than 
one of these at once without corruption, because anything else is a bug, 
but... btrfs /is/ still stabilizing and maturing, and it's precisely this 
sort of rare corner-case race-condition bugs where more than one 
extremely heavy IO filesystem maintenance command is being run at the 
same time that tend to be the last to be found and fixed, because they 
/are/ rare corner-cases, often depending on race conditions, that tend to 
be rare enough reported, and then extremely difficult to duplicate, so 
that's exactly the type of bugs that tend to remain around at this point.


So rather than discouraging a sane-filtered regular balance (which I'll 
discuss in a different reply), I'd suggest that the more sane 
recommendation is to be aware of other major-IO filesystem maintenance 
commands (not just btrfs commands but rsync-based backups, etc, too, 
rsync being demanding enough on its own to have triggered a number of 
btrfs bug reports and fixes over the years), including scheduled 
commands, and to only run one at a time.

IOW, don't do a balance if your scheduled backup or snapshot-deletion is 
about to kick in.  One at a time is stressful enough on the filesystem 
and hardware, don't compound the problem trying to do two or more at once!

So assuming a weekly schedule, do one a day of balance, scrub, snapshot-
deletion, backups (after ensuring that none of them take over a day, 
balance in particular could at TiB-scale+ if not sanely filtered, 
particularly if quotas are enabled due to the scaling issues of that 
feature).  And if any of those are scheduled daily or more frequently, 
space the scheduling appropriately and ensure they're done before 
starting the next task.

And keep in mind the scheduled tasks when running things manually, so as 
not to collide there either.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs scrub not repair corruption

2018-01-09 Thread Duncan
Wolf posted on Mon, 08 Jan 2018 23:27:27 +0100 as excerpted:

> I'm running btrfs scrub on my raid each week (is that too often?) and
> I'm having a problem that it reports corruption, says it's repaired but
> next week reports it again.

I won't attempt to answer the larger question, but on the narrow "too 
often?" question, no, running scrub once a week shouldn't be a problem.

Scrub is read-only unless it finds errors, so even running it repeatedly 
end-to-end shouldn't be a problem, other than the obvious performance 
issue and the potential increased head-seek wear on non-ssd devices.  The 
obvious issue would be slowing down whatever else you're doing at the 
same time, and at whatever presumably scheduled weekly time you run it 
that's evidently not a problem for your use-case.


Also, a bit OT as I don't believe it's related to this, but FWIW...

There *has* been a recent kernel issue with gentoo-hardened compiling 
kernel code incorrectly due to a gcc option enabled by default on 
hardened.  I don't remember the details, but I ran across in in one of 
the kernel development articles I read.  I /think/ it applied only to 
4.15-rc, however, or possibly 4.14.  The fix is to disable that specific 
gcc option when building the kernel, as it was designed for userspace and 
doesn't make much sense for the kernel anyway.  A patch doing just that 
should already be part of the latest 4.15-rcs and if the bug applied to 
4.14 it'll be backported there as well, but I'm not sure of current 4.14-
stable status.

(I run gentoo, so my interest perked when I came across the discussion, 
but not hardened, so I didn't need to retain the details.)

If you're not already aware of that, you might wish to research it a bit 
more, and disable whatever option manually in your kernel-build CFLAGS, 
tho as mentioned once the patch is applied the kernel make files 
automatically apply the appropriate option. (The official kernel CFLAGS 
related vars are KCFLAGS (C), KCPPFLAGS (pre-processor), and KAFLAGS 
(assembler).)  Unfortunately IDR what the specific flag was, 
-fno-something, IIRC.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] Btrfs: raid56: fix race between merge_bio and rbio_orig_end_io

2018-01-09 Thread Liu Bo
Before rbio_orig_end_io() goes to free rbio, rbio may get merged with
more bios from other rbios and rbio->bio_list becomes non-empty,
in that case, these newly merged bios don't end properly.

Once unlock_stripe() is done, rbio->bio_list will not be updated any
more and we can call bio_endio() on all queued bios.

It should only happen in error-out cases, the normal path of recover
and full stripe write have already set RBIO_RMW_LOCKED_BIT to disable
merge before doing IO, so rbio_orig_end_io() called by them doesn't
have the above issue.

Reported-by: Jérôme Carretero 
Signed-off-by: Liu Bo 
---
v2: - Remove the usage spin_lock as there is a chance of deadlock in
  interrupt context, it's reported by lockdep, although it'd never
  happen because we've taken care of it by saving irq flags at all
  places.
- Update commit log and comments of code to explain the new idea.
- This has been tested against btrfs/011 for 50 times.

 fs/btrfs/raid56.c | 37 +
 1 file changed, 25 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 7747323..b2b426d 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -864,10 +864,17 @@ static void __free_raid_bio(struct btrfs_raid_bio *rbio)
kfree(rbio);
 }
 
-static void free_raid_bio(struct btrfs_raid_bio *rbio)
+static void rbio_endio_bio_list(struct bio *cur, blk_status_t err)
 {
-   unlock_stripe(rbio);
-   __free_raid_bio(rbio);
+   struct bio *next;
+
+   while (cur) {
+   next = cur->bi_next;
+   cur->bi_next = NULL;
+   cur->bi_status = err;
+   bio_endio(cur);
+   cur = next;
+   }
 }
 
 /*
@@ -877,20 +884,26 @@ static void free_raid_bio(struct btrfs_raid_bio *rbio)
 static void rbio_orig_end_io(struct btrfs_raid_bio *rbio, blk_status_t err)
 {
struct bio *cur = bio_list_get(>bio_list);
-   struct bio *next;
+   struct bio *extra;
 
if (rbio->generic_bio_cnt)
btrfs_bio_counter_sub(rbio->fs_info, rbio->generic_bio_cnt);
 
-   free_raid_bio(rbio);
+   /*
+* At this moment, rbio->bio_list is empty, however since rbio does not
+* always have RBIO_RMW_LOCKED_BIT set and rbio is still linked on the
+* hash list, rbio may be merged with others so that rbio->bio_list
+* becomes non-empty.
+* Once unlock_stripe() is done, rbio->bio_list will not be updated any
+* more and we can call bio_endio() on all queued bios.
+*/
+   unlock_stripe(rbio);
+   extra = bio_list_get(>bio_list);
+   __free_raid_bio(rbio);
 
-   while (cur) {
-   next = cur->bi_next;
-   cur->bi_next = NULL;
-   cur->bi_status = err;
-   bio_endio(cur);
-   cur = next;
-   }
+   rbio_endio_bio_list(cur, err);
+   if (extra)
+   rbio_endio_bio_list(extra, err);
 }
 
 /*
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] btrfs: Fix slight df misreporting when listing mixed block groups fs

2018-01-09 Thread Qu Wenruo


On 2018年01月10日 09:30, Qu Wenruo wrote:
> 
> 
> On 2018年01月10日 00:04, Nikolay Borisov wrote:
>> Currently when df (resp. statfs) is executed on a btrfs instance the
>> free space is calculated as the freespace of the block groups +
>> any unallocated space on the devices, constituting this filesystem.
>> There is a catch, however, in that the unallocated space from the
>> devices has 1 mb subtracted from each device to closely follow btrfs'
>> allocator behavior. This is all good except that when we have a file
>> system and the system chunk's allocation profile is single (as is the
>> default in mixed mode) it occupies the range 0..4mb and thus implicitly
>> prevents allocation starting in the range 0..1mb. In those cases
>> additionally subtracting 1mb during statfs makes us misreport the
>> actual free space and this causes generic/015 to fail.
>>
>> Signed-off-by: Nikolay Borisov 
>> ---
>> Hello,   
>>
>>  
>>
>> So with this patch and adding sync to generic/015 (which is a different 
>> issue)  
>> I can make generic/015 pass. In any case I think this behavior is correct
>>
>> though frankly I feel it's not really worth it, in the worst case we are 
>>
>> underreporting by at most 1mb per device.
>>
>>  fs/btrfs/super.c | 25 +
>>  1 file changed, 13 insertions(+), 12 deletions(-)
>>
>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
>> index f33a55272deb..a4dfc9701220 100644
>> --- a/fs/btrfs/super.c
>> +++ b/fs/btrfs/super.c
>> @@ -1925,12 +1925,13 @@ static int btrfs_calc_avail_data_space(struct 
>> btrfs_fs_info *fs_info,
>>  struct btrfs_device_info *devices_info;
>>  struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
>>  struct btrfs_device *device;
>> -u64 skip_space;
>>  u64 type;
>>  u64 avail_space;
>>  u64 min_stripe_size;
>>  int min_stripes = 1, num_stripes = 1;
>>  int i = 0, nr_devices;
>> +bool sys_chunk_single = !(btrfs_system_alloc_profile(fs_info) &
>> +  ~BTRFS_BLOCK_GROUP_SYSTEM);
>>  
>>  /*
>>   * We aren't under the device list lock, so this is racy-ish, but good
>> @@ -1988,18 +1989,18 @@ static int btrfs_calc_avail_data_space(struct 
>> btrfs_fs_info *fs_info,
>>  /*
>>   * In order to avoid overwriting the superblock on the drive,
>>   * btrfs starts at an offset of at least 1MB when doing chunk
>> - * allocation.
>> + * allocation. There is one notable exception, and that is
>> + * when the System chunk has allocation profile of single. In
>> + * this case it occupies 0..4m  and we don't really need to
>> + * subtract 1mb as it's implied
> 
> This is in fact a bug in mkfs.btrfs.
> 
> The problem is, current chunk allocator in mkfs.btrfs doesn't really
> avoid the first 1M.

Just a side note, it's the custom allocator (Yep, custom allocator
again) ignoring the 1M limit.
While the generic btrfs_alloc_chunk() is completely fine.

The fix to mkfs.btrfs can arrive very soon.

This also reminds me about the long forgot project to get rid of such
SINGLE temporary chunks (not allocating them at all) and allocate them
in native profile at the very beginning of mkfs.btrfs.

Thanks,
Qu

> 
> So temporary (profile SINGLE) chunks can still using that reserved 1M,
> while normally such chunks get removed so reserved 1M will not be used,
> but if we specified -m single, such chunks will just be used as is.
> 
> And such behavior makes unallocated space calculation pretty nasty.
> 
>>   */
>> -skip_space = SZ_1M;
>> -
>> -/*
>> - * we can use the free space in [0, skip_space - 1], subtract
>> - * it from the total.
>> - */
>> -if (avail_space && avail_space >= skip_space)
>> -avail_space -= skip_space;
>> -else
>> -avail_space = 0;
>> +if (device->devid != 1 ||
>> +(!sys_chunk_single && device->devid == 1)) {
>> +if (avail_space && avail_space >= SZ_1M)
>> +avail_space -= SZ_1M;
>> +else
>> +avail_space = 0;
>> +}
> 
> Due to above reason, using system chunk profile and devid to check if we
> need to reduce 1M is not reliable.
> 
> That system chunk can be relocated by balance, and in that case new
> chunk may start beyond 1M.
> 
> So the most reliable method would be manually checking the the first
> device extent of this device, and to determine if we should minus that 1M.
> 
> If the first device extent starts beyond 1M, reducing 1M would be good.
> Or don't modify it.
> 
> Thanks,
> Qu
> 
>>  
>>  if (avail_space < 

[PATCH 4/4] btrfs: drop devid as device_list_add() arg

2018-01-09 Thread Anand Jain
As struct btrfs_disk_super is being passed, so it can get devid
the same way its parent does.

Signed-off-by: Anand Jain 
Reviewed-by: Josef Bacik 
---
 fs/btrfs/volumes.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d74d01971d0c..bd2e8616755e 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -731,12 +731,13 @@ static int btrfs_open_one_device(struct btrfs_fs_devices 
*fs_devices,
  * < 0 - error
  */
 static noinline struct btrfs_device *device_list_add(const char *path,
-  struct btrfs_super_block *disk_super, u64 devid)
+  struct btrfs_super_block *disk_super)
 {
struct btrfs_device *device;
struct btrfs_fs_devices *fs_devices;
struct rcu_string *name;
u64 found_transid = btrfs_super_generation(disk_super);
+   u64 devid = btrfs_stack_device_id(_super->dev_item);
 
fs_devices = find_fsid(disk_super->fsid);
if (!fs_devices) {
@@ -1184,7 +1185,6 @@ int btrfs_scan_one_device(const char *path, fmode_t 
flags, void *holder,
struct block_device *bdev;
struct page *page;
int ret = 0;
-   u64 devid;
u64 bytenr;
 
/*
@@ -1205,10 +1205,8 @@ int btrfs_scan_one_device(const char *path, fmode_t 
flags, void *holder,
goto error_bdev_put;
}
 
-   devid = btrfs_stack_device_id(_super->dev_item);
-
mutex_lock(_mutex);
-   device = device_list_add(path, disk_super, devid);
+   device = device_list_add(path, disk_super);
mutex_unlock(_mutex);
if (IS_ERR(device))
ret = PTR_ERR(device);
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V2 0/4] device_list_add() peparation to add reappearing missing device

2018-01-09 Thread Anand Jain
v1->v2:
 Drop patch 5/5 for uuid_mutex optimize. That was wrong. Thanks Josef.
 In patch 3/5 make btrfs_device * as return.

Cleanup of device_list_add(), mainly in preparation to handle
reappearing missing device which its next reroll will be sent
separately.

Anand Jain (4):
  btrfs: move pr_info into device_list_add
  btrfs: set the total_devices in device_list_add()
  btrfs: get device pointer from device_list_add()
  btrfs: drop devid as device_list_add() arg

 fs/btrfs/volumes.c | 61 +++---
 1 file changed, 26 insertions(+), 35 deletions(-)

-- 
2.7.0
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/4] btrfs: set the total_devices in device_list_add()

2018-01-09 Thread Anand Jain
There is no other parent for device_list_add() except for
btrfs_scan_one_device(), which would set btrfs_fs_devices::total_devices
if device_list_add is successful and this can be done with in
device_list_add() itself.

Signed-off-by: Anand Jain 
Reviewed-by: Josef Bacik 
---
 fs/btrfs/volumes.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 091eb452f3c8..f9aaf65e27f5 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -844,6 +844,8 @@ static noinline int device_list_add(const char *path,
if (!fs_devices->opened)
device->generation = found_transid;
 
+   fs_devices->total_devices = btrfs_super_num_devices(disk_super);
+
*fs_devices_ret = fs_devices;
 
return 0;
@@ -1185,7 +1187,6 @@ int btrfs_scan_one_device(const char *path, fmode_t 
flags, void *holder,
struct page *page;
int ret;
u64 devid;
-   u64 total_devices;
u64 bytenr;
 
/*
@@ -1207,12 +1208,9 @@ int btrfs_scan_one_device(const char *path, fmode_t 
flags, void *holder,
}
 
devid = btrfs_stack_device_id(_super->dev_item);
-   total_devices = btrfs_super_num_devices(disk_super);
 
mutex_lock(_mutex);
ret = device_list_add(path, disk_super, devid, fs_devices_ret);
-   if (!ret && fs_devices_ret)
-   (*fs_devices_ret)->total_devices = total_devices;
mutex_unlock(_mutex);
 
btrfs_release_disk_super(page);
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] btrfs: move pr_info into device_list_add

2018-01-09 Thread Anand Jain
Commit 60999ca4b403 ("btrfs: make device scan less noisy")
adds return value 1 to device_list_add(), so that parent function can
call pr_info only when new device is added. Move the pr_info() part
into device_list_add() so that this function can be kept simple.

Signed-off-by: Anand Jain 
Reviewed-by: Josef Bacik 
---
 fs/btrfs/volumes.c | 29 +++--
 1 file changed, 11 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 0d7f912f68ff..091eb452f3c8 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -727,8 +727,7 @@ static int btrfs_open_one_device(struct btrfs_fs_devices 
*fs_devices,
  * Add new device to list of registered devices
  *
  * Returns:
- * 1   - first time device is seen
- * 0   - device already known
+ * 0   - device already known or newly added
  * < 0 - error
  */
 static noinline int device_list_add(const char *path,
@@ -738,7 +737,6 @@ static noinline int device_list_add(const char *path,
struct btrfs_device *device;
struct btrfs_fs_devices *fs_devices;
struct rcu_string *name;
-   int ret = 0;
u64 found_transid = btrfs_super_generation(disk_super);
 
fs_devices = find_fsid(disk_super->fsid);
@@ -778,9 +776,16 @@ static noinline int device_list_add(const char *path,
fs_devices->num_devices++;
mutex_unlock(_devices->device_list_mutex);
 
-   ret = 1;
device->fs_devices = fs_devices;
btrfs_free_stale_device(path, device);
+
+   if (disk_super->label[0])
+   pr_info("BTRFS: device label %s devid %llu transid %llu 
%s\n",
+   disk_super->label, devid, found_transid, path);
+   else
+   pr_info("BTRFS: device fsid %pU devid %llu transid %llu 
%s\n",
+   disk_super->fsid, devid, found_transid, path);
+
} else if (!device->name || strcmp(device->name->str, path)) {
/*
 * When FS is already mounted.
@@ -841,7 +846,7 @@ static noinline int device_list_add(const char *path,
 
*fs_devices_ret = fs_devices;
 
-   return ret;
+   return 0;
 }
 
 static struct btrfs_fs_devices *clone_fs_devices(struct btrfs_fs_devices *orig)
@@ -1180,7 +1185,6 @@ int btrfs_scan_one_device(const char *path, fmode_t 
flags, void *holder,
struct page *page;
int ret;
u64 devid;
-   u64 transid;
u64 total_devices;
u64 bytenr;
 
@@ -1203,25 +1207,14 @@ int btrfs_scan_one_device(const char *path, fmode_t 
flags, void *holder,
}
 
devid = btrfs_stack_device_id(_super->dev_item);
-   transid = btrfs_super_generation(disk_super);
total_devices = btrfs_super_num_devices(disk_super);
 
mutex_lock(_mutex);
ret = device_list_add(path, disk_super, devid, fs_devices_ret);
-   if (ret >= 0 && fs_devices_ret)
+   if (!ret && fs_devices_ret)
(*fs_devices_ret)->total_devices = total_devices;
mutex_unlock(_mutex);
 
-   if (ret > 0) {
-   if (disk_super->label[0])
-   pr_info("BTRFS: device label %s ", disk_super->label);
-   else
-   pr_info("BTRFS: device fsid %pU ", disk_super->fsid);
-
-   pr_cont("devid %llu transid %llu %s\n", devid, transid, path);
-   ret = 0;
-   }
-
btrfs_release_disk_super(page);
 
 error_bdev_put:
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] btrfs: get device pointer from device_list_add()

2018-01-09 Thread Anand Jain
Instead of pointer to btrfs_fs_devices as an arg in device_list_add()
better to get pointer to btrfs_device as return value, then we have
both, pointer to btrfs_device and btrfs_fs_devices. btrfs_device is
needed to handle reappearing missing device.

Signed-off-by: Anand Jain 
---
 fs/btrfs/volumes.c | 30 --
 1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f9aaf65e27f5..d74d01971d0c 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -730,9 +730,8 @@ static int btrfs_open_one_device(struct btrfs_fs_devices 
*fs_devices,
  * 0   - device already known or newly added
  * < 0 - error
  */
-static noinline int device_list_add(const char *path,
-  struct btrfs_super_block *disk_super,
-  u64 devid, struct btrfs_fs_devices **fs_devices_ret)
+static noinline struct btrfs_device *device_list_add(const char *path,
+  struct btrfs_super_block *disk_super, u64 devid)
 {
struct btrfs_device *device;
struct btrfs_fs_devices *fs_devices;
@@ -743,7 +742,7 @@ static noinline int device_list_add(const char *path,
if (!fs_devices) {
fs_devices = alloc_fs_devices(disk_super->fsid);
if (IS_ERR(fs_devices))
-   return PTR_ERR(fs_devices);
+   return ERR_PTR(PTR_ERR(fs_devices));
 
list_add(_devices->list, _uuids);
 
@@ -755,19 +754,19 @@ static noinline int device_list_add(const char *path,
 
if (!device) {
if (fs_devices->opened)
-   return -EBUSY;
+   return ERR_PTR(-EBUSY);
 
device = btrfs_alloc_device(NULL, ,
disk_super->dev_item.uuid);
if (IS_ERR(device)) {
/* we can safely leave the fs_devices entry around */
-   return PTR_ERR(device);
+   return ERR_PTR(PTR_ERR(device));
}
 
name = rcu_string_strdup(path, GFP_NOFS);
if (!name) {
free_device(device);
-   return -ENOMEM;
+   return ERR_PTR(-ENOMEM);
}
rcu_assign_pointer(device->name, name);
 
@@ -821,12 +820,12 @@ static noinline int device_list_add(const char *path,
 * with larger generation number or the last-in if
 * generation are equal.
 */
-   return -EEXIST;
+   return ERR_PTR(-EEXIST);
}
 
name = rcu_string_strdup(path, GFP_NOFS);
if (!name)
-   return -ENOMEM;
+   return ERR_PTR(-ENOMEM);
rcu_string_free(device->name);
rcu_assign_pointer(device->name, name);
if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state)) {
@@ -846,9 +845,7 @@ static noinline int device_list_add(const char *path,
 
fs_devices->total_devices = btrfs_super_num_devices(disk_super);
 
-   *fs_devices_ret = fs_devices;
-
-   return 0;
+   return device;
 }
 
 static struct btrfs_fs_devices *clone_fs_devices(struct btrfs_fs_devices *orig)
@@ -1183,9 +1180,10 @@ int btrfs_scan_one_device(const char *path, fmode_t 
flags, void *holder,
  struct btrfs_fs_devices **fs_devices_ret)
 {
struct btrfs_super_block *disk_super;
+   struct btrfs_device *device;
struct block_device *bdev;
struct page *page;
-   int ret;
+   int ret = 0;
u64 devid;
u64 bytenr;
 
@@ -1210,8 +1208,12 @@ int btrfs_scan_one_device(const char *path, fmode_t 
flags, void *holder,
devid = btrfs_stack_device_id(_super->dev_item);
 
mutex_lock(_mutex);
-   ret = device_list_add(path, disk_super, devid, fs_devices_ret);
+   device = device_list_add(path, disk_super, devid);
mutex_unlock(_mutex);
+   if (IS_ERR(device))
+   ret = PTR_ERR(device);
+
+   *fs_devices_ret = device->fs_devices;
 
btrfs_release_disk_super(page);
 
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] btrfs: move uuid_mutex closer to exclusivity

2018-01-09 Thread Anand Jain



On 01/10/2018 12:15 AM, Josef Bacik wrote:

On Tue, Jan 09, 2018 at 10:46:25PM +0800, Anand Jain wrote:

move uuid_mutex with in device_list_add().

Signed-off-by: Anand Jain 


This isn't going to work, there's a bunch of places we return errors so this
just deadlocks the box.  Leave it like it is, it's not hurting anybody leaving
it like this.  Thanks,


 Err. My mistake. Can't believe how did I miss it.
 Pls ignore this patch.
 Thanks for the review.

- Anand


Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] btrfs: Fix slight df misreporting when listing mixed block groups fs

2018-01-09 Thread Qu Wenruo


On 2018年01月10日 00:04, Nikolay Borisov wrote:
> Currently when df (resp. statfs) is executed on a btrfs instance the
> free space is calculated as the freespace of the block groups +
> any unallocated space on the devices, constituting this filesystem.
> There is a catch, however, in that the unallocated space from the
> devices has 1 mb subtracted from each device to closely follow btrfs'
> allocator behavior. This is all good except that when we have a file
> system and the system chunk's allocation profile is single (as is the
> default in mixed mode) it occupies the range 0..4mb and thus implicitly
> prevents allocation starting in the range 0..1mb. In those cases
> additionally subtracting 1mb during statfs makes us misreport the
> actual free space and this causes generic/015 to fail.
> 
> Signed-off-by: Nikolay Borisov 
> ---
> Hello,
>   
>   
>   
> So with this patch and adding sync to generic/015 (which is a different 
> issue)  
> I can make generic/015 pass. In any case I think this behavior is correct 
>   
> though frankly I feel it's not really worth it, in the worst case we are  
>   
> underreporting by at most 1mb per device.
> 
>  fs/btrfs/super.c | 25 +
>  1 file changed, 13 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index f33a55272deb..a4dfc9701220 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -1925,12 +1925,13 @@ static int btrfs_calc_avail_data_space(struct 
> btrfs_fs_info *fs_info,
>   struct btrfs_device_info *devices_info;
>   struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
>   struct btrfs_device *device;
> - u64 skip_space;
>   u64 type;
>   u64 avail_space;
>   u64 min_stripe_size;
>   int min_stripes = 1, num_stripes = 1;
>   int i = 0, nr_devices;
> + bool sys_chunk_single = !(btrfs_system_alloc_profile(fs_info) &
> +   ~BTRFS_BLOCK_GROUP_SYSTEM);
>  
>   /*
>* We aren't under the device list lock, so this is racy-ish, but good
> @@ -1988,18 +1989,18 @@ static int btrfs_calc_avail_data_space(struct 
> btrfs_fs_info *fs_info,
>   /*
>* In order to avoid overwriting the superblock on the drive,
>* btrfs starts at an offset of at least 1MB when doing chunk
> -  * allocation.
> +  * allocation. There is one notable exception, and that is
> +  * when the System chunk has allocation profile of single. In
> +  * this case it occupies 0..4m  and we don't really need to
> +  * subtract 1mb as it's implied

This is in fact a bug in mkfs.btrfs.

The problem is, current chunk allocator in mkfs.btrfs doesn't really
avoid the first 1M.

So temporary (profile SINGLE) chunks can still using that reserved 1M,
while normally such chunks get removed so reserved 1M will not be used,
but if we specified -m single, such chunks will just be used as is.

And such behavior makes unallocated space calculation pretty nasty.

>*/
> - skip_space = SZ_1M;
> -
> - /*
> -  * we can use the free space in [0, skip_space - 1], subtract
> -  * it from the total.
> -  */
> - if (avail_space && avail_space >= skip_space)
> - avail_space -= skip_space;
> - else
> - avail_space = 0;
> + if (device->devid != 1 ||
> + (!sys_chunk_single && device->devid == 1)) {
> + if (avail_space && avail_space >= SZ_1M)
> + avail_space -= SZ_1M;
> + else
> + avail_space = 0;
> + }

Due to above reason, using system chunk profile and devid to check if we
need to reduce 1M is not reliable.

That system chunk can be relocated by balance, and in that case new
chunk may start beyond 1M.

So the most reliable method would be manually checking the the first
device extent of this device, and to determine if we should minus that 1M.

If the first device extent starts beyond 1M, reducing 1M would be good.
Or don't modify it.

Thanks,
Qu

>  
>   if (avail_space < min_stripe_size)
>   continue;
> 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 3/5] btrfs: get device pointer from device_list_add()

2018-01-09 Thread Anand Jain



On 01/10/2018 12:13 AM, Josef Bacik wrote:

On Tue, Jan 09, 2018 at 10:46:23PM +0800, Anand Jain wrote:

Instead of pointer to btrfs_fs_devices from device_list_add() its
better to get pointer to btrfs_device, then we have both, pointer
to btrfs_device and btrfs_fs_devices. This is needed in preparation
to add handling reappearing missing device feature.

Signed-off-by: Anand Jain 


Can we just change device_list_add to return struct btrfs_device * instead, and
just do PTR_ERR() if there's a problem?  Thanks,


That's much better. Will do.

Thanks, Anand


Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 19/19] fs: handle inode->i_version more efficiently

2018-01-09 Thread Dave Chinner
On Tue, Jan 09, 2018 at 09:10:59AM -0500, Jeff Layton wrote:
> From: Jeff Layton 
> 
> Since i_version is mostly treated as an opaque value, we can exploit that
> fact to avoid incrementing it when no one is watching. With that change,
> we can avoid incrementing the counter on writes, unless someone has
> queried for it since it was last incremented. If the a/c/mtime don't
> change, and the i_version hasn't changed, then there's no need to dirty
> the inode metadata on a write.
> 
> Convert the i_version counter to an atomic64_t, and use the lowest order
> bit to hold a flag that will tell whether anyone has queried the value
> since it was last incremented.
> 
> When we go to maybe increment it, we fetch the value and check the flag
> bit.  If it's clear then we don't need to do anything if the update
> isn't being forced.
> 
> If we do need to update, then we increment the counter by 2, and clear
> the flag bit, and then use a CAS op to swap it into place. If that
> works, we return true. If it doesn't then do it again with the value
> that we fetch from the CAS operation.
> 
> On the query side, if the flag is already set, then we just shift the
> value down by 1 bit and return it. Otherwise, we set the flag in our
> on-stack value and again use cmpxchg to swap it into place if it hasn't
> changed. If it has, then we use the value from the cmpxchg as the new
> "old" value and try again.
> 
> This method allows us to avoid incrementing the counter on writes (and
> dirtying the metadata) under typical workloads. We only need to increment
> if it has been queried since it was last changed.
> 
> Signed-off-by: Jeff Layton 
> Reviewed-by: Jan Kara 

Documentation helps a lot in understanding all this. Thanks for
adding it into the patch!

Acked-by: Dave Chinner 
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 17/19] xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing

2018-01-09 Thread Dave Chinner
On Tue, Jan 09, 2018 at 09:10:57AM -0500, Jeff Layton wrote:
> From: Jeff Layton 
> 
> If XFS_ILOG_CORE is already set then go ahead and increment it.
> 
> Signed-off-by: Jeff Layton 
> Acked-by: Darrick J. Wong 

Acked-by: Dave Chinner 

-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 14/19] xfs: convert to new i_version API

2018-01-09 Thread Dave Chinner
On Tue, Jan 09, 2018 at 09:10:54AM -0500, Jeff Layton wrote:
> From: Jeff Layton 
> 
> Signed-off-by: Jeff Layton 
> Acked-by: Darrick J. Wong 

Looks ok, but I haven't tested it at all.

Acked-by: Dave Chinner 

-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: use correct string length

2018-01-09 Thread David Sterba
On Mon, Jan 08, 2018 at 07:51:22PM +0800, Xiongfeng Wang wrote:
> From: Xiongfeng Wang 
> 
> gcc-8 reports
> 
> fs/btrfs/ioctl.c: In function 'btrfs_ioctl':
> ./include/linux/string.h:245:9: warning: '__builtin_strncpy' specified
> bound 1024 equals destination size [-Wstringop-truncation]
> 
> We need one less byte or call strlcpy() to make it a nul-terminated
> string.

The null termination is on the following line, so this patch fixes
namely the gcc warning:

> - strncpy(di_args->path, name->str, sizeof(di_args->path));
> + strncpy(di_args->path, name->str, sizeof(di_args->path) - 1);
>   rcu_read_unlock();
>   di_args->path[sizeof(di_args->path) - 1] = 0;

Reviewed-by: David Sterba 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 10/10] Btrfs: noinline merge_extent_mapping

2018-01-09 Thread Josef Bacik
On Fri, Jan 05, 2018 at 12:51:17PM -0700, Liu Bo wrote:
> In order to debug subtle bugs around merge_extent_mapping(), perf probe
> can be used to check the arguments, but sometimes merge_extent_mapping()
> got inlined by compiler and couldn't be probed.
> 
> This is adding noinline attribute to merge_extent_mapping().
> 
> Signed-off-by: Liu Bo 

Reviewed-by: Josef Bacik 

Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 09/10] Btrfs: add tracepoint for em's EEXIST case

2018-01-09 Thread Josef Bacik
On Fri, Jan 05, 2018 at 12:51:16PM -0700, Liu Bo wrote:
> This is adding a tracepoint 'btrfs_handle_em_exist' to help debug the
> subtle bugs around merge_extent_mapping.
> 
> Signed-off-by: Liu Bo 

Reviewed-by: Josef Bacik 

Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 08/10] Btrfs: add WARN_ONCE to detect unexpected error from merge_extent_mapping

2018-01-09 Thread Josef Bacik
On Fri, Jan 05, 2018 at 12:51:15PM -0700, Liu Bo wrote:
> This is a subtle case, so in order to understand the problem, it'd be good
> to know the content of existing and em when any error occurs.
> 
> Signed-off-by: Liu Bo 

Reviewed-by: Josef Bacik 

Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 07/10] Btrfs: extent map selftest: dio write vs dio read

2018-01-09 Thread Josef Bacik
On Fri, Jan 05, 2018 at 12:51:14PM -0700, Liu Bo wrote:
> This test case simulates the racy situation of dio write vs dio read,
> and see if btrfs_get_extent() would return -EEXIST.
> 
> Signed-off-by: Liu Bo 

Reviewed-by: Josef Bacik 

Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 06/10] Btrfs: extent map selftest: buffered write vs dio read

2018-01-09 Thread Josef Bacik
On Fri, Jan 05, 2018 at 12:51:13PM -0700, Liu Bo wrote:
> This test case simulates the racy situation of buffered write vs dio
> read, and see if btrfs_get_extent() would return -EEXIST.
> 
> Signed-off-by: Liu Bo 

Reviewed-by: Josef Bacik 

Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 05/10] Btrfs: add extent map selftests

2018-01-09 Thread Josef Bacik
On Fri, Jan 05, 2018 at 12:51:12PM -0700, Liu Bo wrote:
> We've observed that btrfs_get_extent() and merge_extent_mapping() could
> return -EEXIST in several cases, and they are caused by some racy
> condition, e.g dio read vs dio write, which makes the problem very tricky
> to reproduce.
> 
> This adds extent map selftests in order to simulate those racy situations.
> 
> Signed-off-by: Liu Bo 

Reviewed-by: Josef Bacik 

Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 04/10] Btrfs: move extent map specific code to extent_map.c

2018-01-09 Thread Josef Bacik
On Fri, Jan 05, 2018 at 12:51:11PM -0700, Liu Bo wrote:
> These helpers are extent map specific, move them to extent_map.c.
> 
> Signed-off-by: Liu Bo 

Reviewed-by: Josef Bacik 

Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 02/10] Btrfs: fix unexpected EEXIST from btrfs_get_extent

2018-01-09 Thread Josef Bacik
On Fri, Jan 05, 2018 at 12:51:09PM -0700, Liu Bo wrote:
> This fixes a corner case that is caused by a race of dio write vs dio
> read/write.
> 
> Here is how the race could happen.
> 
> Suppose that no extent map has been loaded into memory yet.
> There is a file extent [0, 32K), two jobs are running concurrently
> against it, t1 is doing dio write to [8K, 32K) and t2 is doing dio
> read from [0, 4K) or [4K, 8K).
> 
> t1 goes ahead of t2 and splits em [0, 32K) to em [0K, 8K) and [8K 32K).
> 
> --
>  t1t2
>   btrfs_get_blocks_direct() btrfs_get_blocks_direct()
>-> btrfs_get_extent()  -> btrfs_get_extent()
>-> lookup_extent_mapping()
>-> add_extent_mapping()-> lookup_extent_mapping()
>   # load [0, 32K)
>-> btrfs_new_extent_direct()
>-> btrfs_drop_extent_cache()
>   # split [0, 32K) and
> # drop [8K, 32K)
>-> add_extent_mapping()
>   # add [8K, 32K)
>   -> add_extent_mapping()
>  # handle -EEXIST when adding
>  # [0, 32K)
> --
> About how t2(dio read/write) runs into -EEXIST:
> 
> a) add_extent_mapping() gets -EEXIST for adding em [0, 32k),
> 
> b) search_extent_mapping() then returns [0, 8k) as the existing em,
>even though start == existing->start, em is [0, 32k) so that
>extent_map_end(em) > extent_map_end(existing), i.e. 32k > 8k,
> 
> c) then it goes thru merge_extent_mapping() which tries to add a [8k, 8k)
>(with a length 0) and returns -EEXIST as [8k, 32k) is already in tree,
> 
> d) so btrfs_get_extent() ends up returning -EEXIST to dio read/write,
>which is confusing applications.
> 
> Here I conclude all the possible situations,
> 1) start < existing->start
> 
> +---+em+---+
> +--prev---+ | +-+  |
> | | | | |  |
> +-+ + +---+existing++  ++
> +
> |
> +
>  start
> 
> 2) start == existing->start
> 
>   +em+
>   | +-+  |
>   | | |  |
>   + +existing-+  +
> |
> |
> +
>  start
> 
> 3) start > existing->start && start < (existing->start + existing->len)
> 
>   +em+
>   | +-+  |
>   | | |  |
>   + +existing-+  +
>|
>|
>+
>  start
> 
> 4) start >= (existing->start + existing->len)
> 
> +---+em+---+
> | +-+  | +--next---+
> | | |  | | |
> + +---+existing++  + +-+
>   +
>   |
>   +
>start
> 
> As we can see, it turns out that if start is within existing em (front
> inclusive), then the existing em should be returned as is, otherwise,
> we try our best to merge candidate em with sibling ems to form a
> larger em (in order to reduce the total number of em).
> 
> Reported-by: David Vallender 
> Signed-off-by: Liu Bo 

Reviewed-by: Josef Bacik 

Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 03/10] Btrfs: add helper for em merge logic

2018-01-09 Thread Josef Bacik
On Fri, Jan 05, 2018 at 12:51:10PM -0700, Liu Bo wrote:
> This is a prepare work for the following extent map selftest, which
> runs tests against em merge logic.
> 
> Signed-off-by: Liu Bo 

Reviewed-by: Josef Bacik 

Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 01/10] Btrfs: fix incorrect block_len in merge_extent_mapping

2018-01-09 Thread Josef Bacik
On Fri, Jan 05, 2018 at 12:51:08PM -0700, Liu Bo wrote:
> %block_len could be checked on deciding if two em are mergeable.
> 
> merge_extent_mapping() has only added the front pad if the front part
> of em gets truncated, but it's possible that the end part gets
> truncated.
> 
> For both compressed extent and inline extent, em->block_len is not
> adjusted accordingly, and for regular extent, em->block_len always
> equals to em->len, hence this sets em->block_len with em->len.
> 
> Signed-off-by: Liu Bo 

Reviewed-by: Josef Bacik 

Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] btrfs: add support for SUPER_FLAG_CHANGING_FSID in btrfs.ko

2018-01-09 Thread David Sterba
On Tue, Jan 09, 2018 at 09:05:42AM +0800, Anand Jain wrote:
> Userland sets SUPER_FLAG_CHANGING_FSID and resets it only when changing
> fsid is complete. Its not a good idea to mount the device anything in
> between.
> This patch doesn't add SUPER_FLAG_CHANGING_FSID into BTRFS_SUPER_FLAG_SUPP
> list, so mount would fail (along with the fix in the next patch) when
> SUPER_FLAG_CHANGING_FSID is set.
> 
> Signed-off-by: Anand Jain 
>  cc: w...@suse.com
> Reviewed-by: Qu Wenruo 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] btrfs: fail mount when sb flag is not in BTRFS_SUPER_FLAG_SUPP

2018-01-09 Thread David Sterba
On Tue, Jan 09, 2018 at 09:05:43AM +0800, Anand Jain wrote:
> It appear from the original commit [1] that there isn't any design
> specific reason not to fail the mount instead of just warning. This
> patch will change it to fail.
> 
> [1]
>  commit 319e4d0661e5323c9f9945f0f8fb5905e5fe74c3
> btrfs: Enhance super validation check
> 
> Signed-off-by: Anand Jain 
>  cc: w...@suse.com
> Reviewed-by: Qu Wenruo 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] btrfs: define SUPER_FLAG_METADUMP_V2

2018-01-09 Thread David Sterba
On Tue, Jan 09, 2018 at 09:05:41AM +0800, Anand Jain wrote:
> btrfs-progs uses super flag bit BTRFS_SUPER_FLAG_METADUMP_V2 (1ULL << 34).
> So just define that in kernel so that we know its been used.
> 
> Signed-off-by: Anand Jain 
Reviewed-by: David Sterba 
> ---
>  fs/btrfs/disk-io.c  | 3 ++-
>  include/uapi/linux/btrfs_tree.h | 1 +
>  2 files changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index a69e5944dc08..7cd4935f690f 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -61,7 +61,8 @@
>BTRFS_HEADER_FLAG_RELOC |\
>BTRFS_SUPER_FLAG_ERROR |\
>BTRFS_SUPER_FLAG_SEEDING |\
> -  BTRFS_SUPER_FLAG_METADUMP)
> +  BTRFS_SUPER_FLAG_METADUMP |\
> +  BTRFS_SUPER_FLAG_METADUMP_V2)
>  
>  static const struct extent_io_ops btree_extent_io_ops;
>  static void end_workqueue_fn(struct btrfs_work *work);
> diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
> index 6d6e5da51527..38ab0e06259a 100644
> --- a/include/uapi/linux/btrfs_tree.h
> +++ b/include/uapi/linux/btrfs_tree.h
> @@ -456,6 +456,7 @@ struct btrfs_free_space_header {
>  
>  #define BTRFS_SUPER_FLAG_SEEDING (1ULL << 32)
>  #define BTRFS_SUPER_FLAG_METADUMP(1ULL << 33)
> +#define BTRFS_SUPER_FLAG_METADUMP_V2 (1ULL << 34)
>  
>  
>  /*
> -- 
> 2.15.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHSET v5] blk-mq: reimplement timeout handling

2018-01-09 Thread Jens Axboe
On 1/9/18 9:29 AM, Tejun Heo wrote:
> Hello,
> 
> Changes from [v4]
> 
> - Comments added.  Patch description updated.
> 
> Changes from [v3]
> 
> - Rebased on top of for-4.16/block.
> 
> - Integrated Jens's hctx_[un]lock() factoring patch and refreshed the
>   patches accordingly.
> 
> - Added comment explaining the use of hctx_lock() instead of
>   rcu_read_lock() in completion path.
> 
> Changes from [v2]
> 
> - Possible extended looping around seqcount and u64_stat_sync fixed.
> 
> - Misplaced MQ_RQ_IDLE state setting fixed.
> 
> - RQF_MQ_TIMEOUT_EXPIRED added to prevent firing the same timeout
>   multiple times.
> 
> - s/queue_rq_src/srcu/ patch added.
> 
> - Other misc changes.
> 
> Changes from [v1]
> 
> - BLK_EH_RESET_TIMER handling fixed.
> 
> - s/request->gstate_seqc/request->gstate_seq/
> 
> - READ_ONCE() added to blk_mq_rq_udpate_state().
> 
> - Removed left over blk_clear_rq_complete() invocation from
>   blk_mq_rq_timed_out().
> 
> Currently, blk-mq timeout path synchronizes against the usual
> issue/completion path using a complex scheme involving atomic
> bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
> rules.  Unfortunatley, it contains quite a few holes.
> 
> It's pretty easy to make blk_mq_check_expired() terminate a later
> instance of a request.  If we induce 5 sec delay before
> time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
> 2s, and issue back-to-back large IOs, blk-mq starts timing out
> requests spuriously pretty quickly.  Nothing actually timed out.  It
> just made the call on a recycle instance of a request and then
> terminated a later instance long after the original instance finished.
> The scenario isn't theoretical either.
> 
> This patchset replaces the broken synchronization mechanism with a RCU
> and generation number based one.  Please read the patch description of
> the second path for more details.

Applied for 4.16, and added a patch to silence that false positive
srcu_idx for hctx_unlock() that got reported.

This grows the request a bit, which sucks, but probably unavoidable.
There's some room for improvement with a hole or two, however.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHSET v5] blk-mq: reimplement timeout handling

2018-01-09 Thread Tejun Heo
Hello,

Changes from [v4]

- Comments added.  Patch description updated.

Changes from [v3]

- Rebased on top of for-4.16/block.

- Integrated Jens's hctx_[un]lock() factoring patch and refreshed the
  patches accordingly.

- Added comment explaining the use of hctx_lock() instead of
  rcu_read_lock() in completion path.

Changes from [v2]

- Possible extended looping around seqcount and u64_stat_sync fixed.

- Misplaced MQ_RQ_IDLE state setting fixed.

- RQF_MQ_TIMEOUT_EXPIRED added to prevent firing the same timeout
  multiple times.

- s/queue_rq_src/srcu/ patch added.

- Other misc changes.

Changes from [v1]

- BLK_EH_RESET_TIMER handling fixed.

- s/request->gstate_seqc/request->gstate_seq/

- READ_ONCE() added to blk_mq_rq_udpate_state().

- Removed left over blk_clear_rq_complete() invocation from
  blk_mq_rq_timed_out().

Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules.  Unfortunatley, it contains quite a few holes.

It's pretty easy to make blk_mq_check_expired() terminate a later
instance of a request.  If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly.  Nothing actually timed out.  It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.

This patchset replaces the broken synchronization mechanism with a RCU
and generation number based one.  Please read the patch description of
the second path for more details.

This patchset contains the following eight patches.

0001-blk-mq-move-hctx-lock-unlock-into-a-helper.patch
0002-blk-mq-protect-completion-path-with-RCU.patch
0003-blk-mq-replace-timeout-synchronization-with-a-RCU-an.patch
0004-blk-mq-use-blk_mq_rq_state-instead-of-testing-REQ_AT.patch
0005-blk-mq-make-blk_abort_request-trigger-timeout-path.patch
0006-blk-mq-remove-REQ_ATOM_COMPLETE-usages-from-blk-mq.patch
0007-blk-mq-remove-REQ_ATOM_STARTED.patch
0008-blk-mq-rename-blk_mq_hw_ctx-queue_rq_srcu-to-srcu.patch

and is available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git blk-mq-timeout-v5

diffstat follows.  Thanks.

 block/blk-core.c   |2 
 block/blk-mq-debugfs.c |4 
 block/blk-mq.c |  346 +
 block/blk-mq.h |   49 ++
 block/blk-timeout.c|   16 +-
 block/blk.h|7 
 include/linux/blk-mq.h |3 
 include/linux/blkdev.h |   25 +++
 8 files changed, 294 insertions(+), 158 deletions(-)

--
tejun

[v4] http://lkml.kernel.org/r/20180108191542.379478-1...@kernel.org
[v3] http://lkml.kernel.org/r/20171216120726.517153-1...@kernel.org
[v2] http://lkml.kernel.org/r/20171010155441.753966-1...@kernel.org
[v1] http://lkml.kernel.org/r/20171209192525.982030-1...@kernel.org
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/8] blk-mq: move hctx lock/unlock into a helper

2018-01-09 Thread Tejun Heo
From: Jens Axboe 

Move the RCU vs SRCU logic into lock/unlock helpers, which makes
the actual functional bits within the locked region much easier
to read.

tj: Reordered in front of timeout revamp patches and added the missing
blk_mq_run_hw_queue() conversion.

Signed-off-by: Jens Axboe 
Signed-off-by: Tejun Heo 
---
 block/blk-mq.c | 66 --
 1 file changed, 32 insertions(+), 34 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 111e1aa..ddc9261 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -557,6 +557,22 @@ static void __blk_mq_complete_request(struct request *rq)
put_cpu();
 }
 
+static void hctx_unlock(struct blk_mq_hw_ctx *hctx, int srcu_idx)
+{
+   if (!(hctx->flags & BLK_MQ_F_BLOCKING))
+   rcu_read_unlock();
+   else
+   srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
+}
+
+static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx)
+{
+   if (!(hctx->flags & BLK_MQ_F_BLOCKING))
+   rcu_read_lock();
+   else
+   *srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
+}
+
 /**
  * blk_mq_complete_request - end I/O on a request
  * @rq:the request being processed
@@ -1214,17 +1230,11 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx 
*hctx)
 */
WARN_ON_ONCE(in_interrupt());
 
-   if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
-   rcu_read_lock();
-   blk_mq_sched_dispatch_requests(hctx);
-   rcu_read_unlock();
-   } else {
-   might_sleep();
+   might_sleep_if(hctx->flags & BLK_MQ_F_BLOCKING);
 
-   srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
-   blk_mq_sched_dispatch_requests(hctx);
-   srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
-   }
+   hctx_lock(hctx, _idx);
+   blk_mq_sched_dispatch_requests(hctx);
+   hctx_unlock(hctx, srcu_idx);
 }
 
 /*
@@ -1296,17 +1306,10 @@ bool blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, 
bool async)
 * And queue will be rerun in blk_mq_unquiesce_queue() if it is
 * quiesced.
 */
-   if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
-   rcu_read_lock();
-   need_run = !blk_queue_quiesced(hctx->queue) &&
-   blk_mq_hctx_has_pending(hctx);
-   rcu_read_unlock();
-   } else {
-   srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
-   need_run = !blk_queue_quiesced(hctx->queue) &&
-   blk_mq_hctx_has_pending(hctx);
-   srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
-   }
+   hctx_lock(hctx, _idx);
+   need_run = !blk_queue_quiesced(hctx->queue) &&
+   blk_mq_hctx_has_pending(hctx);
+   hctx_unlock(hctx, srcu_idx);
 
if (need_run) {
__blk_mq_delay_run_hw_queue(hctx, async, 0);
@@ -1618,7 +1621,7 @@ static blk_qc_t request_to_qc_t(struct blk_mq_hw_ctx 
*hctx, struct request *rq)
 
 static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
struct request *rq,
-   blk_qc_t *cookie, bool may_sleep)
+   blk_qc_t *cookie)
 {
struct request_queue *q = rq->q;
struct blk_mq_queue_data bd = {
@@ -1668,25 +1671,20 @@ static void __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
}
 
 insert:
-   blk_mq_sched_insert_request(rq, false, run_queue, false, may_sleep);
+   blk_mq_sched_insert_request(rq, false, run_queue, false,
+   hctx->flags & BLK_MQ_F_BLOCKING);
 }
 
 static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
struct request *rq, blk_qc_t *cookie)
 {
-   if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
-   rcu_read_lock();
-   __blk_mq_try_issue_directly(hctx, rq, cookie, false);
-   rcu_read_unlock();
-   } else {
-   unsigned int srcu_idx;
+   int srcu_idx;
 
-   might_sleep();
+   might_sleep_if(hctx->flags & BLK_MQ_F_BLOCKING);
 
-   srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
-   __blk_mq_try_issue_directly(hctx, rq, cookie, true);
-   srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
-   }
+   hctx_lock(hctx, _idx);
+   __blk_mq_try_issue_directly(hctx, rq, cookie);
+   hctx_unlock(hctx, srcu_idx);
 }
 
 static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
-- 
2.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/8] blk-mq: protect completion path with RCU

2018-01-09 Thread Tejun Heo
Currently, blk-mq protects only the issue path with RCU.  This patch
puts the completion path under the same RCU protection.  This will be
used to synchronize issue/completion against timeout by later patches,
which will also add the comments.

Signed-off-by: Tejun Heo 
---
 block/blk-mq.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index ddc9261..6741c3e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -584,11 +584,16 @@ static void hctx_lock(struct blk_mq_hw_ctx *hctx, int 
*srcu_idx)
 void blk_mq_complete_request(struct request *rq)
 {
struct request_queue *q = rq->q;
+   struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
+   int srcu_idx;
 
if (unlikely(blk_should_fake_timeout(q)))
return;
+
+   hctx_lock(hctx, _idx);
if (!blk_mark_rq_complete(rq))
__blk_mq_complete_request(rq);
+   hctx_unlock(hctx, srcu_idx);
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
 
-- 
2.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/8] blk-mq: replace timeout synchronization with a RCU and generation based scheme

2018-01-09 Thread Tejun Heo
Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules.  Unfortunately, it contains quite a few holes.

There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.

In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request.  If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly.  Nothing actually timed out.  It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.

This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.

1. Each request has a u64 generation + state value, which can be
   updated only by the request owner.  Whenever a request becomes
   in-flight, the generation number gets bumped up too.  This provides
   the basis for the timeout path to distinguish different recycle
   instances of the request.

   Also, marking a request in-flight and setting its deadline are
   protected with a seqcount so that the timeout path can fetch both
   values coherently.

2. The timeout path fetches the generation, state and deadline.  If
   the verdict is timeout, it records the generation into a dedicated
   request abortion field and does RCU wait.

3. The completion path is also protected by RCU (from the previous
   patch) and checks whether the current generation number and state
   match the abortion field.  If so, it skips completion.

4. The timeout path, after RCU wait, scans requests again and
   terminates the ones whose generation and state still match the ones
   requested for abortion.

   By now, the timeout path knows that either the generation number
   and state changed if it lost the race or the completion will yield
   to it and can safely timeout the request.

While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.

While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places.  Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.

Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path.  The race has always been there and this
patch doesn't change it.  It's just documenting the existing race.

v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
- s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
- READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.

v3: - Fixed possible extended seqcount / u64_stats_sync read looping
  spotted by Peter.
- MQ_RQ_IDLE was incorrectly being set in complete_request instead
  of free_request.  Fixed.

v4: - Rebased on top of hctx_lock() refactoring patch.
- Added comment explaining the use of hctx_lock() in completion path.

v5: - Added comments requested by Bart.
- Note the addition of BLK_EH_RESET_TIMER race condition in the
  commit message.

Signed-off-by: Tejun Heo 
Cc: "jianchao.wang" 
Cc: Peter Zijlstra 
Cc: Christoph Hellwig 
Cc: Bart Van Assche 
---
 block/blk-core.c   |   2 +
 block/blk-mq.c | 229 +
 block/blk-mq.h |  46 ++
 block/blk-timeout.c|   2 +-
 block/blk.h|   6 --
 include/linux/blk-mq.h |   1 +
 include/linux/blkdev.h |  23 +
 7 files changed, 230 insertions(+), 79 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2e0d041..f843ae4f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -126,6 +126,8 @@ void blk_rq_init(struct request_queue *q, struct request 
*rq)
rq->start_time = jiffies;
set_start_time_ns(rq);
rq->part = NULL;
+   seqcount_init(>gstate_seq);
+   u64_stats_init(>aborted_gstate_sync);
 }
 EXPORT_SYMBOL(blk_rq_init);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6741c3e..052fee5 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -483,6 +483,7 @@ void blk_mq_free_request(struct request *rq)
if (blk_rq_rl(rq))
blk_put_rl(blk_rq_rl(rq));
 
+   

[PATCH 5/8] blk-mq: make blk_abort_request() trigger timeout path

2018-01-09 Thread Tejun Heo
With issue/complete and timeout paths now using the generation number
and state based synchronization, blk_abort_request() is the only one
which depends on REQ_ATOM_COMPLETE for arbitrating completion.

There's no reason for blk_abort_request() to be a completely separate
path.  This patch makes blk_abort_request() piggyback on the timeout
path instead of trying to terminate the request directly.

This removes the last dependency on REQ_ATOM_COMPLETE in blk-mq.

Note that this makes blk_abort_request() asynchronous - it initiates
abortion but the actual termination will happen after a short while,
even when the caller owns the request.  AFAICS, SCSI and ATA should be
fine with that and I think mtip32xx and dasd should be safe but not
completely sure.  It'd be great if people who know the drivers take a
look.

v2: - Add comment explaining the lack of synchronization around
  ->deadline update as requested by Bart.

Signed-off-by: Tejun Heo 
Cc: Asai Thambi SP 
Cc: Stefan Haberland 
Cc: Jan Hoeppner 
Cc: Bart Van Assche 
---
 block/blk-mq.c  |  2 +-
 block/blk-mq.h  |  2 --
 block/blk-timeout.c | 13 +
 3 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 51e9704..b419746 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -820,7 +820,7 @@ struct blk_mq_timeout_data {
unsigned int nr_expired;
 };
 
-void blk_mq_rq_timed_out(struct request *req, bool reserved)
+static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 {
const struct blk_mq_ops *ops = req->q->mq_ops;
enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index cf01f6f..6b2d616 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -94,8 +94,6 @@ extern int blk_mq_sysfs_register(struct request_queue *q);
 extern void blk_mq_sysfs_unregister(struct request_queue *q);
 extern void blk_mq_hctx_kobj_init(struct blk_mq_hw_ctx *hctx);
 
-extern void blk_mq_rq_timed_out(struct request *req, bool reserved);
-
 void blk_mq_release(struct request_queue *q);
 
 /**
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 6427be7..4f04cd1 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -156,12 +156,17 @@ void blk_timeout_work(struct work_struct *work)
  */
 void blk_abort_request(struct request *req)
 {
-   if (blk_mark_rq_complete(req))
-   return;
-
if (req->q->mq_ops) {
-   blk_mq_rq_timed_out(req, false);
+   /*
+* All we need to ensure is that timeout scan takes place
+* immediately and that scan sees the new timeout value.
+* No need for fancy synchronizations.
+*/
+   req->deadline = jiffies;
+   mod_timer(>q->timeout, 0);
} else {
+   if (blk_mark_rq_complete(req))
+   return;
blk_delete_timer(req);
blk_rq_timed_out(req);
}
-- 
2.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/8] blk-mq: use blk_mq_rq_state() instead of testing REQ_ATOM_COMPLETE

2018-01-09 Thread Tejun Heo
blk_mq_check_inflight() and blk_mq_poll_hybrid_sleep() test
REQ_ATOM_COMPLETE to determine the request state.  Both uses are
speculative and we can test REQ_ATOM_STARTED and blk_mq_rq_state() for
equivalent results.  Replace the tests.  This will allow removing
REQ_ATOM_COMPLETE usages from blk-mq.

Signed-off-by: Tejun Heo 
---
 block/blk-mq.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 052fee5..51e9704 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -95,8 +95,7 @@ static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
 {
struct mq_inflight *mi = priv;
 
-   if (test_bit(REQ_ATOM_STARTED, >atomic_flags) &&
-   !test_bit(REQ_ATOM_COMPLETE, >atomic_flags)) {
+   if (blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT) {
/*
 * index[0] counts the specific partition that was asked
 * for. index[1] counts the ones that are active on the
@@ -3023,7 +3022,8 @@ static bool blk_mq_poll_hybrid_sleep(struct request_queue 
*q,
 
hrtimer_init_sleeper(, current);
do {
-   if (test_bit(REQ_ATOM_COMPLETE, >atomic_flags))
+   if (test_bit(REQ_ATOM_STARTED, >atomic_flags) &&
+   blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT)
break;
set_current_state(TASK_UNINTERRUPTIBLE);
hrtimer_start_expires(, mode);
-- 
2.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/8] blk-mq: remove REQ_ATOM_COMPLETE usages from blk-mq

2018-01-09 Thread Tejun Heo
After the recent updates to use generation number and state based
synchronization, blk-mq no longer depends on REQ_ATOM_COMPLETE except
to avoid firing the same timeout multiple times.

Remove all REQ_ATOM_COMPLETE usages and use a new rq_flags flag
RQF_MQ_TIMEOUT_EXPIRED to avoid firing the same timeout multiple
times.  This removes atomic bitops from hot paths too.

v2: Removed blk_clear_rq_complete() from blk_mq_rq_timed_out().

v3: Added RQF_MQ_TIMEOUT_EXPIRED flag.

Signed-off-by: Tejun Heo 
Cc: "jianchao.wang" 
---
 block/blk-mq.c | 15 +++
 block/blk-timeout.c|  1 +
 include/linux/blkdev.h |  2 ++
 3 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index b419746..3f297f0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -634,8 +634,7 @@ void blk_mq_complete_request(struct request *rq)
 * hctx_lock() covers both issue and completion paths.
 */
hctx_lock(hctx, _idx);
-   if (blk_mq_rq_aborted_gstate(rq) != rq->gstate &&
-   !blk_mark_rq_complete(rq))
+   if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
__blk_mq_complete_request(rq);
hctx_unlock(hctx, srcu_idx);
 }
@@ -685,8 +684,6 @@ void blk_mq_start_request(struct request *rq)
preempt_enable();
 
set_bit(REQ_ATOM_STARTED, >atomic_flags);
-   if (test_bit(REQ_ATOM_COMPLETE, >atomic_flags))
-   clear_bit(REQ_ATOM_COMPLETE, >atomic_flags);
 
if (q->dma_drain_size && blk_rq_bytes(rq)) {
/*
@@ -837,6 +834,8 @@ static void blk_mq_rq_timed_out(struct request *req, bool 
reserved)
if (!test_bit(REQ_ATOM_STARTED, >atomic_flags))
return;
 
+   req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED;
+
if (ops->timeout)
ret = ops->timeout(req, reserved);
 
@@ -852,7 +851,6 @@ static void blk_mq_rq_timed_out(struct request *req, bool 
reserved)
 */
blk_mq_rq_update_aborted_gstate(req, 0);
blk_add_timer(req);
-   blk_clear_rq_complete(req);
break;
case BLK_EH_NOT_HANDLED:
break;
@@ -871,7 +869,8 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
 
might_sleep();
 
-   if (!test_bit(REQ_ATOM_STARTED, >atomic_flags))
+   if ((rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) ||
+   !test_bit(REQ_ATOM_STARTED, >atomic_flags))
return;
 
/* read coherent snapshots of @rq->state_gen and @rq->deadline */
@@ -906,8 +905,8 @@ static void blk_mq_terminate_expired(struct blk_mq_hw_ctx 
*hctx,
 * now guaranteed to see @rq->aborted_gstate and yield.  If
 * @rq->aborted_gstate still matches @rq->gstate, @rq is ours.
 */
-   if (READ_ONCE(rq->gstate) == rq->aborted_gstate &&
-   !blk_mark_rq_complete(rq))
+   if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
+   READ_ONCE(rq->gstate) == rq->aborted_gstate)
blk_mq_rq_timed_out(rq, reserved);
 }
 
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 4f04cd1..ebe9996 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -214,6 +214,7 @@ void blk_add_timer(struct request *req)
req->timeout = q->rq_timeout;
 
req->deadline = jiffies + req->timeout;
+   req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;
 
/*
 * Only the non-mq case needs to add the request to a protected list.
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ae563d0..007a7cf 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -125,6 +125,8 @@ typedef __u32 __bitwise req_flags_t;
 #define RQF_SPECIAL_PAYLOAD((__force req_flags_t)(1 << 18))
 /* The per-zone write lock is held for this request */
 #define RQF_ZONE_WRITE_LOCKED  ((__force req_flags_t)(1 << 19))
+/* timeout is expired */
+#define RQF_MQ_TIMEOUT_EXPIRED ((__force req_flags_t)(1 << 20))
 
 /* flags that prevent us from merging requests: */
 #define RQF_NOMERGE_FLAGS \
-- 
2.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 7/8] blk-mq: remove REQ_ATOM_STARTED

2018-01-09 Thread Tejun Heo
After the recent updates to use generation number and state based
synchronization, we can easily replace REQ_ATOM_STARTED usages by
adding an extra state to distinguish completed but not yet freed
state.

Add MQ_RQ_COMPLETE and replace REQ_ATOM_STARTED usages with
blk_mq_rq_state() tests.  REQ_ATOM_STARTED no longer has any users
left and is removed.

Signed-off-by: Tejun Heo 
---
 block/blk-mq-debugfs.c |  4 +---
 block/blk-mq.c | 37 -
 block/blk-mq.h |  1 +
 block/blk.h|  1 -
 4 files changed, 10 insertions(+), 33 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index b56a4f3..8adc837 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -271,7 +271,6 @@ static const char *const cmd_flag_name[] = {
 #define RQF_NAME(name) [ilog2((__force u32)RQF_##name)] = #name
 static const char *const rqf_name[] = {
RQF_NAME(SORTED),
-   RQF_NAME(STARTED),
RQF_NAME(QUEUED),
RQF_NAME(SOFTBARRIER),
RQF_NAME(FLUSH_SEQ),
@@ -295,7 +294,6 @@ static const char *const rqf_name[] = {
 #define RQAF_NAME(name) [REQ_ATOM_##name] = #name
 static const char *const rqaf_name[] = {
RQAF_NAME(COMPLETE),
-   RQAF_NAME(STARTED),
RQAF_NAME(POLL_SLEPT),
 };
 #undef RQAF_NAME
@@ -409,7 +407,7 @@ static void hctx_show_busy_rq(struct request *rq, void 
*data, bool reserved)
const struct show_busy_params *params = data;
 
if (blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) == params->hctx &&
-   test_bit(REQ_ATOM_STARTED, >atomic_flags))
+   blk_mq_rq_state(rq) != MQ_RQ_IDLE)
__blk_mq_debugfs_rq_show(params->m,
 list_entry_rq(>queuelist));
 }
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3f297f0..e6e74c0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -483,7 +483,6 @@ void blk_mq_free_request(struct request *rq)
blk_put_rl(blk_rq_rl(rq));
 
blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
-   clear_bit(REQ_ATOM_STARTED, >atomic_flags);
clear_bit(REQ_ATOM_POLL_SLEPT, >atomic_flags);
if (rq->tag != -1)
blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
@@ -531,6 +530,7 @@ static void __blk_mq_complete_request(struct request *rq)
int cpu;
 
WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT);
+   blk_mq_rq_update_state(rq, MQ_RQ_COMPLETE);
 
if (rq->internal_tag != -1)
blk_mq_sched_completed_request(rq);
@@ -642,7 +642,7 @@ EXPORT_SYMBOL(blk_mq_complete_request);
 
 int blk_mq_request_started(struct request *rq)
 {
-   return test_bit(REQ_ATOM_STARTED, >atomic_flags);
+   return blk_mq_rq_state(rq) != MQ_RQ_IDLE;
 }
 EXPORT_SYMBOL_GPL(blk_mq_request_started);
 
@@ -661,7 +661,6 @@ void blk_mq_start_request(struct request *rq)
}
 
WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
-   WARN_ON_ONCE(test_bit(REQ_ATOM_STARTED, >atomic_flags));
 
/*
 * Mark @rq in-flight which also advances the generation number,
@@ -683,8 +682,6 @@ void blk_mq_start_request(struct request *rq)
write_seqcount_end(>gstate_seq);
preempt_enable();
 
-   set_bit(REQ_ATOM_STARTED, >atomic_flags);
-
if (q->dma_drain_size && blk_rq_bytes(rq)) {
/*
 * Make sure space for the drain appears.  We know we can do
@@ -697,13 +694,9 @@ void blk_mq_start_request(struct request *rq)
 EXPORT_SYMBOL(blk_mq_start_request);
 
 /*
- * When we reach here because queue is busy, REQ_ATOM_COMPLETE
- * flag isn't set yet, so there may be race with timeout handler,
- * but given rq->deadline is just set in .queue_rq() under
- * this situation, the race won't be possible in reality because
- * rq->timeout should be set as big enough to cover the window
- * between blk_mq_start_request() called from .queue_rq() and
- * clearing REQ_ATOM_STARTED here.
+ * When we reach here because queue is busy, it's safe to change the state
+ * to IDLE without checking @rq->aborted_gstate because we should still be
+ * holding the RCU read lock and thus protected against timeout.
  */
 static void __blk_mq_requeue_request(struct request *rq)
 {
@@ -715,7 +708,7 @@ static void __blk_mq_requeue_request(struct request *rq)
wbt_requeue(q->rq_wb, >issue_stat);
blk_mq_sched_requeue_request(rq);
 
-   if (test_and_clear_bit(REQ_ATOM_STARTED, >atomic_flags)) {
+   if (blk_mq_rq_state(rq) != MQ_RQ_IDLE) {
blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
if (q->dma_drain_size && blk_rq_bytes(rq))
rq->nr_phys_segments--;
@@ -822,18 +815,6 @@ static void blk_mq_rq_timed_out(struct request *req, bool 
reserved)
const struct blk_mq_ops *ops = req->q->mq_ops;
enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
 
-   /*
-* We know that complete is set at this point. If STARTED 

Re: [PATCH 4/6] btrfs: make btrfs_free_stale_device() argument optional

2018-01-09 Thread Josef Bacik
On Tue, Jan 09, 2018 at 10:13:12PM +0800, Anand Jain wrote:
> From: Anand Jain 
> 
> This updates btrfs_free_stale_device() helper function to delete all
> unmouted devices, when arg is NULL.
> 
> Signed-off-by: Anand Jain 
> ---
>  fs/btrfs/volumes.c | 14 ++
>  1 file changed, 6 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 7b253da6c0a4..7646f8860096 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -611,9 +611,6 @@ static void btrfs_free_stale_device(struct btrfs_device 
> *cur_dev)
>   struct btrfs_fs_devices *fs_devs, *tmp_fs_devs;
>   struct btrfs_device *dev, *tmp_dev;
>  
> - if (!cur_dev->name)
> - return;
> -
>   list_for_each_entry_safe(fs_devs, tmp_fs_devs, _uuids, list) {
>  
>   if (fs_devs->opened)
> @@ -623,9 +620,7 @@ static void btrfs_free_stale_device(struct btrfs_device 
> *cur_dev)
>_devs->devices, dev_list) {
>   int not_found;

Change this to

int not_found = 0;

>  
> - if (dev == cur_dev)
> - continue;
> - if (!dev->name)
> + if (cur_dev && (cur_dev == dev || !dev->name))
>   continue;
>  
>   /*
> @@ -635,8 +630,11 @@ static void btrfs_free_stale_device(struct btrfs_device 
> *cur_dev)
>* either use mapper or non mapper path throughout.
>*/
>   rcu_read_lock();
> - not_found = strcmp(rcu_str_deref(dev->name),
> - rcu_str_deref(cur_dev->name));
> + if (cur_dev)
> + not_found = strcmp(rcu_str_deref(dev->name),
> +
> rcu_str_deref(cur_dev->name));
> + else
> + not_found = 0;

And drop the else part of this.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/8] blk-mq: protect completion path with RCU

2018-01-09 Thread Jens Axboe
On 1/9/18 9:19 AM, t...@kernel.org wrote:
> Hello, Bart.
> 
> On Tue, Jan 09, 2018 at 04:12:40PM +, Bart Van Assche wrote:
>> I'm concerned about the additional CPU cycles needed for the new 
>> blk_mq_map_queue()
>> call, although I know this call is cheap. Would the timeout code really get 
>> that
> 
> So, if that is really a concern, let's cache that mapping instead of
> changing synchronization rules for that.
> 
>> much more complicated if the hctx_lock() and hctx_unlock() calls would be 
>> changed
>> into rcu_read_lock() and rcu_read_unlock() calls? Would it be sufficient if
>> "if (has_rcu) synchronize_rcu();" would be changed into "synchronize_rcu();" 
>> in
>> blk_mq_timeout_work()?
> 
> Code-wise, it won't be too much extra code but I think diverging the
> sync methods between issue and completion paths is more fragile and
> likely to invite confusions and mistakes in the future.  We have the
> normal path (issue) synchronizing against the exception
> path (timeout).  I think it's best to keep the sync constructs aligned
> with that conceptual picture.

Guys, the map call is FINE. We do it at least once per request anyway,
usually multiple times. If it's too expensive, then THAT is what needs
fixing, not adding an arbitrary caching of that mapping. Look at the
actual code:

return q->queue_hw_ctx[q->mq_map[cpu]];

that's it. It's just a double index.

So let's put this thing to rest right now - the map call is perfectly
fine, especially since the alternatives are either bloating struct
request or making the code less maintainable.

Foot -> down.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/6] btrfs: make btrfs_free_stale_device() to iterate all stales

2018-01-09 Thread Josef Bacik
On Tue, Jan 09, 2018 at 10:13:11PM +0800, Anand Jain wrote:
> From: Anand Jain 
> 
> Let the list iterator iterate further and find other stale
> devices and delete it. This is in preparation to add support
> for user land request-able stale devices cleanup.
> 
> Signed-off-by: Anand Jain 

If we're going to do this then please change the function name to
btrfs_free_stale_devices() so it's clear what its intentions are.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/8] blk-mq: protect completion path with RCU

2018-01-09 Thread t...@kernel.org
Hello, Bart.

On Tue, Jan 09, 2018 at 04:12:40PM +, Bart Van Assche wrote:
> I'm concerned about the additional CPU cycles needed for the new 
> blk_mq_map_queue()
> call, although I know this call is cheap. Would the timeout code really get 
> that

So, if that is really a concern, let's cache that mapping instead of
changing synchronization rules for that.

> much more complicated if the hctx_lock() and hctx_unlock() calls would be 
> changed
> into rcu_read_lock() and rcu_read_unlock() calls? Would it be sufficient if
> "if (has_rcu) synchronize_rcu();" would be changed into "synchronize_rcu();" 
> in
> blk_mq_timeout_work()?

Code-wise, it won't be too much extra code but I think diverging the
sync methods between issue and completion paths is more fragile and
likely to invite confusions and mistakes in the future.  We have the
normal path (issue) synchronizing against the exception
path (timeout).  I think it's best to keep the sync constructs aligned
with that conceptual picture.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/6] btrfs: no need to check for btrfs_fs_devices::seeding

2018-01-09 Thread Josef Bacik
On Tue, Jan 09, 2018 at 10:13:10PM +0800, Anand Jain wrote:
> There is no need to check for btrfs_fs_devices::seeding when we
> have checked for btrfs_fs_devices::opened, because we can't sprout
> without its seed FS being opened.
> 
> Signed-off-by: Anand Jain 

Reviewed-by: Josef Bacik 

Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/6] btrfs: cleanup btrfs_free_stale_device() usage

2018-01-09 Thread Josef Bacik
On Tue, Jan 09, 2018 at 10:13:09PM +0800, Anand Jain wrote:
> We call btrfs_free_stale_device() only when we alloc a new
> struct btrfs_device (ret=1), so move it closer to where we
> alloc the new device. Also drop the comments.
> 
> Signed-off-by: Anand Jain 

Reviewed-by: Josef Bacik 

Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] btrfs: move uuid_mutex closer to exclusivity

2018-01-09 Thread Josef Bacik
On Tue, Jan 09, 2018 at 10:46:25PM +0800, Anand Jain wrote:
> move uuid_mutex with in device_list_add().
> 
> Signed-off-by: Anand Jain 

This isn't going to work, there's a bunch of places we return errors so this
just deadlocks the box.  Leave it like it is, it's not hurting anybody leaving
it like this.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] btrfs: get device pointer from device_list_add()

2018-01-09 Thread Josef Bacik
On Tue, Jan 09, 2018 at 10:46:23PM +0800, Anand Jain wrote:
> Instead of pointer to btrfs_fs_devices from device_list_add() its
> better to get pointer to btrfs_device, then we have both, pointer
> to btrfs_device and btrfs_fs_devices. This is needed in preparation
> to add handling reappearing missing device feature.
> 
> Signed-off-by: Anand Jain 

Can we just change device_list_add to return struct btrfs_device * instead, and
just do PTR_ERR() if there's a problem?  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/8] blk-mq: protect completion path with RCU

2018-01-09 Thread Bart Van Assche
On Mon, 2018-01-08 at 11:15 -0800, Tejun Heo wrote:
> Currently, blk-mq protects only the issue path with RCU.  This patch
> puts the completion path under the same RCU protection.  This will be
> used to synchronize issue/completion against timeout by later patches,
> which will also add the comments.
> 
> Signed-off-by: Tejun Heo 
> ---
>  block/blk-mq.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index ddc9261..6741c3e 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -584,11 +584,16 @@ static void hctx_lock(struct blk_mq_hw_ctx *hctx, int 
> *srcu_idx)
>  void blk_mq_complete_request(struct request *rq)
>  {
>   struct request_queue *q = rq->q;
> + struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
> + int srcu_idx;
>  
>   if (unlikely(blk_should_fake_timeout(q)))
>   return;
> +
> + hctx_lock(hctx, _idx);
>   if (!blk_mark_rq_complete(rq))
>   __blk_mq_complete_request(rq);
> + hctx_unlock(hctx, srcu_idx);
>  }
>  EXPORT_SYMBOL(blk_mq_complete_request);

Hello Tejun,

I'm concerned about the additional CPU cycles needed for the new 
blk_mq_map_queue()
call, although I know this call is cheap. Would the timeout code really get that
much more complicated if the hctx_lock() and hctx_unlock() calls would be 
changed
into rcu_read_lock() and rcu_read_unlock() calls? Would it be sufficient if
"if (has_rcu) synchronize_rcu();" would be changed into "synchronize_rcu();" in
blk_mq_timeout_work()?

Thanks,

Bart.N�r��yb�X��ǧv�^�)޺{.n�+{�n�߲)w*jg����ݢj/���z�ޖ��2�ޙ&�)ߡ�a�����G���h��j:+v���w��٥

Re: [PATCH 1/5] btrfs: move pr_info into device_list_add

2018-01-09 Thread Josef Bacik
On Tue, Jan 09, 2018 at 10:46:21PM +0800, Anand Jain wrote:
> Commit 60999ca4b403 ("btrfs: make device scan less noisy")
> adds return value 1 to device_list_add(), so that parent function can
> call pr_info only when new device is added. Move the pr_info() part
> into device_list_add() so that this function can be kept simple.
> 
> Signed-off-by: Anand Jain 

Reviewed-by: Josef Bacik 

Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] btrfs: Fix slight df misreporting when listing mixed block groups fs

2018-01-09 Thread Josef Bacik
On Tue, Jan 09, 2018 at 06:04:05PM +0200, Nikolay Borisov wrote:
> Currently when df (resp. statfs) is executed on a btrfs instance the
> free space is calculated as the freespace of the block groups +
> any unallocated space on the devices, constituting this filesystem.
> There is a catch, however, in that the unallocated space from the
> devices has 1 mb subtracted from each device to closely follow btrfs'
> allocator behavior. This is all good except that when we have a file
> system and the system chunk's allocation profile is single (as is the
> default in mixed mode) it occupies the range 0..4mb and thus implicitly
> prevents allocation starting in the range 0..1mb. In those cases
> additionally subtracting 1mb during statfs makes us misreport the
> actual free space and this causes generic/015 to fail.
> 
> Signed-off-by: Nikolay Borisov 

This is fine by me,

Reviewed-by: Josef Bacik 

Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/8] blk-mq: make blk_abort_request() trigger timeout path

2018-01-09 Thread t...@kernel.org
On Mon, Jan 08, 2018 at 10:10:01PM +, Bart Van Assche wrote:
> Other req->deadline writes are protected by preempt_disable(),
> write_seqcount_begin(>gstate_seq), write_seqcount_end(>gstate_seq)
> and preempt_enable(). I think it's fine that the above req->deadline store
> does not have that protection but I also think that that deserves a comment.

Will add.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH] btrfs: Fix slight df misreporting when listing mixed block groups fs

2018-01-09 Thread Nikolay Borisov
Currently when df (resp. statfs) is executed on a btrfs instance the
free space is calculated as the freespace of the block groups +
any unallocated space on the devices, constituting this filesystem.
There is a catch, however, in that the unallocated space from the
devices has 1 mb subtracted from each device to closely follow btrfs'
allocator behavior. This is all good except that when we have a file
system and the system chunk's allocation profile is single (as is the
default in mixed mode) it occupies the range 0..4mb and thus implicitly
prevents allocation starting in the range 0..1mb. In those cases
additionally subtracting 1mb during statfs makes us misreport the
actual free space and this causes generic/015 to fail.

Signed-off-by: Nikolay Borisov 
---
Hello,  

So with this patch and adding sync to generic/015 (which is a different issue)  
I can make generic/015 pass. In any case I think this behavior is correct   
though frankly I feel it's not really worth it, in the worst case we are
underreporting by at most 1mb per device.

 fs/btrfs/super.c | 25 +
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index f33a55272deb..a4dfc9701220 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1925,12 +1925,13 @@ static int btrfs_calc_avail_data_space(struct 
btrfs_fs_info *fs_info,
struct btrfs_device_info *devices_info;
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
struct btrfs_device *device;
-   u64 skip_space;
u64 type;
u64 avail_space;
u64 min_stripe_size;
int min_stripes = 1, num_stripes = 1;
int i = 0, nr_devices;
+   bool sys_chunk_single = !(btrfs_system_alloc_profile(fs_info) &
+ ~BTRFS_BLOCK_GROUP_SYSTEM);
 
/*
 * We aren't under the device list lock, so this is racy-ish, but good
@@ -1988,18 +1989,18 @@ static int btrfs_calc_avail_data_space(struct 
btrfs_fs_info *fs_info,
/*
 * In order to avoid overwriting the superblock on the drive,
 * btrfs starts at an offset of at least 1MB when doing chunk
-* allocation.
+* allocation. There is one notable exception, and that is
+* when the System chunk has allocation profile of single. In
+* this case it occupies 0..4m  and we don't really need to
+* subtract 1mb as it's implied
 */
-   skip_space = SZ_1M;
-
-   /*
-* we can use the free space in [0, skip_space - 1], subtract
-* it from the total.
-*/
-   if (avail_space && avail_space >= skip_space)
-   avail_space -= skip_space;
-   else
-   avail_space = 0;
+   if (device->devid != 1 ||
+   (!sys_chunk_single && device->devid == 1)) {
+   if (avail_space && avail_space >= SZ_1M)
+   avail_space -= SZ_1M;
+   else
+   avail_space = 0;
+   }
 
if (avail_space < min_stripe_size)
continue;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/8] blk-mq: replace timeout synchronization with a RCU and generation based scheme

2018-01-09 Thread t...@kernel.org
On Mon, Jan 08, 2018 at 11:29:11PM +, Bart Van Assche wrote:
> Does "gstate" perhaps stand for "generation number and state"? If so, please
> mention this in one of the above comments.

Yeah, will do.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/8] blk-mq: replace timeout synchronization with a RCU and generation based scheme

2018-01-09 Thread t...@kernel.org
On Mon, Jan 08, 2018 at 09:06:55PM +, Bart Van Assche wrote:
> On Mon, 2018-01-08 at 11:15 -0800, Tejun Heo wrote:
> > +static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
> > +{
> > +   unsigned long flags;
> > +
> > +   local_irq_save(flags);
> > +   u64_stats_update_begin(>aborted_gstate_sync);
> > +   rq->aborted_gstate = gstate;
> > +   u64_stats_update_end(>aborted_gstate_sync);
> > +   local_irq_restore(flags);
> > +}
> 
> Please add a comment that explains the purpose of local_irq_save() and
> local_irq_restore(). Please also explain why you chose to disable interrupts

Will do.

> instead of disabling preemption. I think that disabling preemption should be
> sufficient since this is the only code that updates rq->aborted_gstate and
> since this function is always called from thread context.

blk_mq_complete_request() can read it from the irq context.  If that
happens between update_begin and end, it'll end up looping infinitely.

> > @@ -801,6 +840,12 @@ void blk_mq_rq_timed_out(struct request *req, bool 
> > reserved)
> > __blk_mq_complete_request(req);
> > break;
> > case BLK_EH_RESET_TIMER:
> > +   /*
> > +* As nothing prevents from completion happening while
> > +* ->aborted_gstate is set, this may lead to ignored
> > +* completions and further spurious timeouts.
> > +*/
> > +   blk_mq_rq_update_aborted_gstate(req, 0);
> > blk_add_timer(req);
> > blk_clear_rq_complete(req);
> > break;
> 
> Is the race that the comment refers to addressed by one of the later patches?

No, but it's not a new race.  It has always been there and I suppose
doesn't lead to practical problems - the race window is pretty small
and the effect isn't critical.  I'm just documenting the existing race
condition.  Will note that in the description.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/8] blk-mq: protect completion path with RCU

2018-01-09 Thread Jens Axboe
On 1/9/18 12:08 AM, Hannes Reinecke wrote:
> On 01/08/2018 08:15 PM, Tejun Heo wrote:
>> Currently, blk-mq protects only the issue path with RCU.  This patch
>> puts the completion path under the same RCU protection.  This will be
>> used to synchronize issue/completion against timeout by later patches,
>> which will also add the comments.
>>
>> Signed-off-by: Tejun Heo 
>> ---
>>  block/blk-mq.c | 5 +
>>  1 file changed, 5 insertions(+)
>>
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index ddc9261..6741c3e 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -584,11 +584,16 @@ static void hctx_lock(struct blk_mq_hw_ctx *hctx, int 
>> *srcu_idx)
>>  void blk_mq_complete_request(struct request *rq)
>>  {
>>  struct request_queue *q = rq->q;
>> +struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
>> +int srcu_idx;
>>  
>>  if (unlikely(blk_should_fake_timeout(q)))
>>  return;
>> +
>> +hctx_lock(hctx, _idx);
>>  if (!blk_mark_rq_complete(rq))
>>  __blk_mq_complete_request(rq);
>> +hctx_unlock(hctx, srcu_idx);
>>  }
>>  EXPORT_SYMBOL(blk_mq_complete_request);
>>  
>>
> Hmm. Why do we need to call blk_mq_map_queue() here?
> Is there a chance that we end up with a _different_ hctx on completion
> than that one used for submission?
> If not, why can't we just keep a pointer to the hctx in struct request?

Mapping is the right thing to do. We cache the software queue, which
allows us to map back to the same hardware queue. There would be no
point in storing both, the mapping is very cheap. No point in bloating
the request with redundant information.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 02/19] fs: don't take the i_lock in inode_inc_iversion

2018-01-09 Thread Jan Kara
On Tue 09-01-18 09:10:42, Jeff Layton wrote:
> From: Jeff Layton 
> 
> The rationale for taking the i_lock when incrementing this value is
> lost in antiquity. The readers of the field don't take it (at least
> not universally), so my assumption is that it was only done here to
> serialize incrementors.
> 
> If that is indeed the case, then we can drop the i_lock from this
> codepath and treat it as a atomic64_t for the purposes of
> incrementing it. This allows us to use inode_inc_iversion without
> any danger of lock inversion.
> 
> Note that the read side is not fetched atomically with this change.
> The assumption here is that that is not a critical issue since the
> i_version is not fully synchronized with anything else anyway.
> 
> Signed-off-by: Jeff Layton 

This changes the memory barrier behavior but IMO it is good enough for an
intermediate version. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  include/linux/iversion.h | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> index d09cc3a08740..5ad9eaa3a9b0 100644
> --- a/include/linux/iversion.h
> +++ b/include/linux/iversion.h
> @@ -104,12 +104,13 @@ inode_set_iversion_queried(struct inode *inode, u64 new)
>  static inline bool
>  inode_maybe_inc_iversion(struct inode *inode, bool force)
>  {
> - spin_lock(>i_lock);
> - inode->i_version++;
> - spin_unlock(>i_lock);
> + atomic64_t *ivp = (atomic64_t *)>i_version;
> +
> + atomic64_inc(ivp);
>   return true;
>  }
>  
> +
>  /**
>   * inode_inc_iversion - forcibly increment i_version
>   * @inode: inode that needs to be updated
> -- 
> 2.14.3
> 
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/5] btrfs: move uuid_mutex closer to exclusivity

2018-01-09 Thread Anand Jain
move uuid_mutex with in device_list_add().

Signed-off-by: Anand Jain 
---
 fs/btrfs/volumes.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f942e8193862..283417bf3b00 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -740,6 +740,7 @@ static noinline int device_list_add(const char *path,
u64 found_transid = btrfs_super_generation(disk_super);
u64 devid = btrfs_stack_device_id(_super->dev_item);
 
+   mutex_lock(_mutex);
fs_devices = find_fsid(disk_super->fsid);
if (!fs_devices) {
fs_devices = alloc_fs_devices(disk_super->fsid);
@@ -847,6 +848,8 @@ static noinline int device_list_add(const char *path,
 
fs_devices->total_devices = btrfs_super_num_devices(disk_super);
 
+   mutex_unlock(_mutex);
+
*device_ret = device;
 
return 0;
@@ -1208,9 +1211,7 @@ int btrfs_scan_one_device(const char *path, fmode_t 
flags, void *holder,
goto error_bdev_put;
}
 
-   mutex_lock(_mutex);
ret = device_list_add(path, disk_super, );
-   mutex_unlock(_mutex);
 
*fs_devices_ret = device->fs_devices;
 
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] generic/015: Change the test filesystem size to 101mb

2018-01-09 Thread Nikolay Borisov


On  8.01.2018 10:43, Nikolay Borisov wrote:
> This test has been failing for btrfs for quite some time,
> at least since 4.7. There are 2 implementation details of btrfs that
> it exposes:
> 
> 1. Currently btrfs filesystem under 100mb are created in Mixed block
> group mode. Freespace accounting for it is not 100% accurate - I've
> observed around 100-200kb discrepancy between a newly created filesystem,
> then writing a file and deleting it and checking the free space. This
> falls within %3 and not %1 as hardcoded in the test. 
> 
> 2. BTRFS won't flush it's delayed allocation on file deletion if less
> than 32mb are deleted. On such files we need to perform sync (missing
> in the test) or wait until time elapses for transaction commit.
> 
> Since mixed mode is somewhat deprecated and btrfs is not really intended
> to be used on really small devices let's just adjust the test to
> create a 101mb fs, which doesn't use mixed mode and really test
> freespace accounting.
> 
> Signed-off-by: Nikolay Borisov 

After further experimentation this test is indeed hitting a very
pathological edge case. So when we create a btrfs file system in mixed
mode with default raid levels the system chunk occupies 0-4mb. As such
we can never allocate a chunk which spans this range so the -1mb in
btrfs_calc_avail_data_space is not required in this case. Anyway, I
believe this to be a really minor nuisance not worth complicating the
code to fix, so I'd suggest this test change be merged.

> ---
>  tests/generic/015 | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tests/generic/015 b/tests/generic/015
> index 78f2b13..416c4ae 100755
> --- a/tests/generic/015
> +++ b/tests/generic/015
> @@ -53,7 +53,7 @@ _supported_os Linux
>  _require_scratch
>  _require_no_large_scratch_dev
>  
> -_scratch_mkfs_sized `expr 50 \* 1024 \* 1024` >/dev/null 2>&1 \
> +_scratch_mkfs_sized `expr 101 \* 1024 \* 1024` >/dev/null 2>&1 \
>  || _fail "mkfs failed"
>  _scratch_mount || _fail "mount failed"
>  out=$SCRATCH_MNT/fillup.$$
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/5] device_list_add() peparation to add reappearing missing device

2018-01-09 Thread Anand Jain
Cleanup of device_list_add(), mainly in preparation to handle
reappearing missing device which its next reroll will be sent
separately.

Anand Jain (5):
  btrfs: move pr_info into device_list_add
  btrfs: set the total_devices in device_list_add()
  btrfs: get device pointer from device_list_add()
  btrfs: drop devid as device_list_add() arg
  btrfs: move uuid_mutex closer to exclusivity

 fs/btrfs/volumes.c | 49 +
 1 file changed, 21 insertions(+), 28 deletions(-)

-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/5] btrfs: set the total_devices in device_list_add()

2018-01-09 Thread Anand Jain
There is no other parent for device_list_add() except for
btrfs_scan_one_device(), which would set btrfs_fs_devices::total_devices
if device_list_add is successful and this can be done with in
device_list_add() itself.

Signed-off-by: Anand Jain 
---
 fs/btrfs/volumes.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 091eb452f3c8..f9aaf65e27f5 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -844,6 +844,8 @@ static noinline int device_list_add(const char *path,
if (!fs_devices->opened)
device->generation = found_transid;
 
+   fs_devices->total_devices = btrfs_super_num_devices(disk_super);
+
*fs_devices_ret = fs_devices;
 
return 0;
@@ -1185,7 +1187,6 @@ int btrfs_scan_one_device(const char *path, fmode_t 
flags, void *holder,
struct page *page;
int ret;
u64 devid;
-   u64 total_devices;
u64 bytenr;
 
/*
@@ -1207,12 +1208,9 @@ int btrfs_scan_one_device(const char *path, fmode_t 
flags, void *holder,
}
 
devid = btrfs_stack_device_id(_super->dev_item);
-   total_devices = btrfs_super_num_devices(disk_super);
 
mutex_lock(_mutex);
ret = device_list_add(path, disk_super, devid, fs_devices_ret);
-   if (!ret && fs_devices_ret)
-   (*fs_devices_ret)->total_devices = total_devices;
mutex_unlock(_mutex);
 
btrfs_release_disk_super(page);
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/5] btrfs: get device pointer from device_list_add()

2018-01-09 Thread Anand Jain
Instead of pointer to btrfs_fs_devices from device_list_add() its
better to get pointer to btrfs_device, then we have both, pointer
to btrfs_device and btrfs_fs_devices. This is needed in preparation
to add handling reappearing missing device feature.

Signed-off-by: Anand Jain 
---
 fs/btrfs/volumes.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f9aaf65e27f5..2317ca1b3d83 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -732,7 +732,7 @@ static int btrfs_open_one_device(struct btrfs_fs_devices 
*fs_devices,
  */
 static noinline int device_list_add(const char *path,
   struct btrfs_super_block *disk_super,
-  u64 devid, struct btrfs_fs_devices **fs_devices_ret)
+  u64 devid, struct btrfs_device **device_ret)
 {
struct btrfs_device *device;
struct btrfs_fs_devices *fs_devices;
@@ -846,7 +846,7 @@ static noinline int device_list_add(const char *path,
 
fs_devices->total_devices = btrfs_super_num_devices(disk_super);
 
-   *fs_devices_ret = fs_devices;
+   *device_ret = device;
 
return 0;
 }
@@ -1183,6 +1183,7 @@ int btrfs_scan_one_device(const char *path, fmode_t 
flags, void *holder,
  struct btrfs_fs_devices **fs_devices_ret)
 {
struct btrfs_super_block *disk_super;
+   struct btrfs_device *device;
struct block_device *bdev;
struct page *page;
int ret;
@@ -1210,9 +1211,11 @@ int btrfs_scan_one_device(const char *path, fmode_t 
flags, void *holder,
devid = btrfs_stack_device_id(_super->dev_item);
 
mutex_lock(_mutex);
-   ret = device_list_add(path, disk_super, devid, fs_devices_ret);
+   ret = device_list_add(path, disk_super, devid, );
mutex_unlock(_mutex);
 
+   *fs_devices_ret = device->fs_devices;
+
btrfs_release_disk_super(page);
 
 error_bdev_put:
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/5] btrfs: move pr_info into device_list_add

2018-01-09 Thread Anand Jain
Commit 60999ca4b403 ("btrfs: make device scan less noisy")
adds return value 1 to device_list_add(), so that parent function can
call pr_info only when new device is added. Move the pr_info() part
into device_list_add() so that this function can be kept simple.

Signed-off-by: Anand Jain 
---
 fs/btrfs/volumes.c | 29 +++--
 1 file changed, 11 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 0d7f912f68ff..091eb452f3c8 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -727,8 +727,7 @@ static int btrfs_open_one_device(struct btrfs_fs_devices 
*fs_devices,
  * Add new device to list of registered devices
  *
  * Returns:
- * 1   - first time device is seen
- * 0   - device already known
+ * 0   - device already known or newly added
  * < 0 - error
  */
 static noinline int device_list_add(const char *path,
@@ -738,7 +737,6 @@ static noinline int device_list_add(const char *path,
struct btrfs_device *device;
struct btrfs_fs_devices *fs_devices;
struct rcu_string *name;
-   int ret = 0;
u64 found_transid = btrfs_super_generation(disk_super);
 
fs_devices = find_fsid(disk_super->fsid);
@@ -778,9 +776,16 @@ static noinline int device_list_add(const char *path,
fs_devices->num_devices++;
mutex_unlock(_devices->device_list_mutex);
 
-   ret = 1;
device->fs_devices = fs_devices;
btrfs_free_stale_device(path, device);
+
+   if (disk_super->label[0])
+   pr_info("BTRFS: device label %s devid %llu transid %llu 
%s\n",
+   disk_super->label, devid, found_transid, path);
+   else
+   pr_info("BTRFS: device fsid %pU devid %llu transid %llu 
%s\n",
+   disk_super->fsid, devid, found_transid, path);
+
} else if (!device->name || strcmp(device->name->str, path)) {
/*
 * When FS is already mounted.
@@ -841,7 +846,7 @@ static noinline int device_list_add(const char *path,
 
*fs_devices_ret = fs_devices;
 
-   return ret;
+   return 0;
 }
 
 static struct btrfs_fs_devices *clone_fs_devices(struct btrfs_fs_devices *orig)
@@ -1180,7 +1185,6 @@ int btrfs_scan_one_device(const char *path, fmode_t 
flags, void *holder,
struct page *page;
int ret;
u64 devid;
-   u64 transid;
u64 total_devices;
u64 bytenr;
 
@@ -1203,25 +1207,14 @@ int btrfs_scan_one_device(const char *path, fmode_t 
flags, void *holder,
}
 
devid = btrfs_stack_device_id(_super->dev_item);
-   transid = btrfs_super_generation(disk_super);
total_devices = btrfs_super_num_devices(disk_super);
 
mutex_lock(_mutex);
ret = device_list_add(path, disk_super, devid, fs_devices_ret);
-   if (ret >= 0 && fs_devices_ret)
+   if (!ret && fs_devices_ret)
(*fs_devices_ret)->total_devices = total_devices;
mutex_unlock(_mutex);
 
-   if (ret > 0) {
-   if (disk_super->label[0])
-   pr_info("BTRFS: device label %s ", disk_super->label);
-   else
-   pr_info("BTRFS: device fsid %pU ", disk_super->fsid);
-
-   pr_cont("devid %llu transid %llu %s\n", devid, transid, path);
-   ret = 0;
-   }
-
btrfs_release_disk_super(page);
 
 error_bdev_put:
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/5] btrfs: drop devid as device_list_add() arg

2018-01-09 Thread Anand Jain
As struct btrfs_disk_super is being passed, so it can get devid
the same way its parent does.

Signed-off-by: Anand Jain 
---
 fs/btrfs/volumes.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 2317ca1b3d83..f942e8193862 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -732,12 +732,13 @@ static int btrfs_open_one_device(struct btrfs_fs_devices 
*fs_devices,
  */
 static noinline int device_list_add(const char *path,
   struct btrfs_super_block *disk_super,
-  u64 devid, struct btrfs_device **device_ret)
+  struct btrfs_device **device_ret)
 {
struct btrfs_device *device;
struct btrfs_fs_devices *fs_devices;
struct rcu_string *name;
u64 found_transid = btrfs_super_generation(disk_super);
+   u64 devid = btrfs_stack_device_id(_super->dev_item);
 
fs_devices = find_fsid(disk_super->fsid);
if (!fs_devices) {
@@ -1187,7 +1188,6 @@ int btrfs_scan_one_device(const char *path, fmode_t 
flags, void *holder,
struct block_device *bdev;
struct page *page;
int ret;
-   u64 devid;
u64 bytenr;
 
/*
@@ -1208,10 +1208,8 @@ int btrfs_scan_one_device(const char *path, fmode_t 
flags, void *holder,
goto error_bdev_put;
}
 
-   devid = btrfs_stack_device_id(_super->dev_item);
-
mutex_lock(_mutex);
-   ret = device_list_add(path, disk_super, devid, );
+   ret = device_list_add(path, disk_super, );
mutex_unlock(_mutex);
 
*fs_devices_ret = device->fs_devices;
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/6] btrfs: make btrfs_free_stale_device() to iterate all stales

2018-01-09 Thread Anand Jain
From: Anand Jain 

Let the list iterator iterate further and find other stale
devices and delete it. This is in preparation to add support
for user land request-able stale devices cleanup.

Signed-off-by: Anand Jain 
---
 fs/btrfs/volumes.c | 20 +---
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index ab6ccad08ef7..7b253da6c0a4 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -608,19 +608,20 @@ static void pending_bios_fn(struct btrfs_work *work)
 
 static void btrfs_free_stale_device(struct btrfs_device *cur_dev)
 {
-   struct btrfs_fs_devices *fs_devs;
-   struct btrfs_device *dev;
+   struct btrfs_fs_devices *fs_devs, *tmp_fs_devs;
+   struct btrfs_device *dev, *tmp_dev;
 
if (!cur_dev->name)
return;
 
-   list_for_each_entry(fs_devs, _uuids, list) {
-   int del = 1;
+   list_for_each_entry_safe(fs_devs, tmp_fs_devs, _uuids, list) {
 
if (fs_devs->opened)
continue;
 
-   list_for_each_entry(dev, _devs->devices, dev_list) {
+   list_for_each_entry_safe(dev, tmp_dev,
+_devs->devices, dev_list) {
+   int not_found;
 
if (dev == cur_dev)
continue;
@@ -634,14 +635,12 @@ static void btrfs_free_stale_device(struct btrfs_device 
*cur_dev)
 * either use mapper or non mapper path throughout.
 */
rcu_read_lock();
-   del = strcmp(rcu_str_deref(dev->name),
+   not_found = strcmp(rcu_str_deref(dev->name),
rcu_str_deref(cur_dev->name));
rcu_read_unlock();
-   if (!del)
-   break;
-   }
+   if (not_found)
+   continue;
 
-   if (!del) {
/* delete the stale device */
if (fs_devs->num_devices == 1) {
btrfs_sysfs_remove_fsid(fs_devs);
@@ -652,7 +651,6 @@ static void btrfs_free_stale_device(struct btrfs_device 
*cur_dev)
list_del(>dev_list);
free_device(dev);
}
-   break;
}
}
 }
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 00/19] fs: rework and optimize i_version handling in filesystems

2018-01-09 Thread Jeff Layton
From: Jeff Layton 

v5:
- don't corrupt refcounts stashed in i_version of ext4 xattr inodes
- add raw variants of inc and cmp functions, and have nfs use them

v4:
- fix SB_LAZYTIME handling in generic_update_time
- add memory barriers to patch to convert i_version field to atomic64_t

v3:
- move i_version handling functions to new header file
- document that the kernel-managed i_version implementation will appear to
  increase over time
- fix inode_cmp_iversion to handle wraparound correctly

v2:
- xfs should use inode_peek_iversion instead of inode_peek_iversion_raw
- rework file_update_time patch
- don't dirty inode when only S_ATIME is set and SB_LAZYTIME is enabled
- better comments and documentation

I think this is now approaching merge readiness.

Special thanks to Jan Kara and Dave Chinner who helped me tighten up the
memory barriers in the final patch, and Krzysztof Kozlowski for help in
tracking down a set of bugs in the NFS client patch.

tl;dr: I think we can greatly reduce the cost of the inode->i_version
counter, by exploiting the fact that we don't need to increment it if no
one is looking at it. We can also clean up the code to prepare to
eventually expose this value via statx().

Note that this set relies on a few patches that are in other trees. The
full stack that I've been testing with is here:


https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git/log/?h=iversion

The inode->i_version field is supposed to be a value that changes
whenever there is any data or metadata change to the inode. Some
filesystems use it internally to detect directory changes during
readdir. knfsd will use it if the filesystem has MS_I_VERSION set. IMA
will also use it to optimize away some remeasurement if it's available.
NFS and AFS just use it to store an opaque change attribute from the
server.

Only btrfs, ext4, and xfs increment it for data changes. Because of
this, these filesystems must log the inode to disk whenever the
i_version counter changes. That has a non-zero performance impact,
especially on write-heavy workloads, because we end up dirtying the
inode metadata on every write, not just when the times change.

It turns out though that none of these users of i_version require that
it change on every change to the file. The only real requirement is that
it be different if something changed since the last time we queried for
it.

If we keep track of when something queries the value, we can avoid
bumping the counter and an on-disk update when nothing else has changed
if no one has queried it since it was last incremented.

This patchset changes the code to only bump the i_version counter when
it's strictly necessary, or when we're updating the inode metadata
anyway (e.g. when times change).

It takes the approach of converting the existing accessors of i_version
to use a new API, while leaving the underlying implementation mostly the
same.  The last patch then converts the existing implementation to keep
track of whether the value has been queried since it was last
incremented. It then uses that to avoid incrementing the counter when
it can.

With this, we reduce inode metadata updates across all 3 filesystems
down to roughly the frequency of the timestamp granularity, particularly
when it's not being queried (the vastly common case).

I can see measurable performance gains on xfs and ext4 with iversion
enabled, when streaming small (4k) I/Os.

btrfs shows some slight gain in testing, but not quite the magnitude
that xfs and ext4 show. I'm not sure why yet and would appreciate some
input from btrfs folks.

My goal is to get this into linux-next fairly soon. If it shows no
problems then we can look at merging it for 4.16, or 4.17 if all of the
prequisite patches are not yet merged.

Jeff Layton (19):
  fs: new API for handling inode->i_version
  fs: don't take the i_lock in inode_inc_iversion
  fat: convert to new i_version API
  affs: convert to new i_version API
  afs: convert to new i_version API
  btrfs: convert to new i_version API
  exofs: switch to new i_version API
  ext2: convert to new i_version API
  ext4: convert to new i_version API
  nfs: convert to new i_version API
  nfsd: convert to new i_version API
  ocfs2: convert to new i_version API
  ufs: use new i_version API
  xfs: convert to new i_version API
  IMA: switch IMA over to new i_version API
  fs: only set S_VERSION when updating times if necessary
  xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need
incrementing
  btrfs: only dirty the inode in btrfs_update_time if something was
changed
  fs: handle inode->i_version more efficiently

 fs/affs/amigaffs.c|   5 +-
 fs/affs/dir.c |   5 +-
 fs/affs/super.c   |   3 +-
 fs/afs/fsclient.c |   3 +-
 fs/afs/inode.c|   5 +-
 fs/btrfs/delayed-inode.c  |   7 +-
 fs/btrfs/file.c   |   1 +
 fs/btrfs/inode.c  |  12 +-
 

[PATCH v5 01/19] fs: new API for handling inode->i_version

2018-01-09 Thread Jeff Layton
From: Jeff Layton 

Add a documentation blob that explains what the i_version field is, how
it is expected to work, and how it is currently implemented by various
filesystems.

We already have inode_inc_iversion. Add several other functions for
manipulating and accessing the i_version counter. For now, the
implementation is trivial and basically works the way that all of the
open-coded i_version accesses work today.

Future patches will convert existing users of i_version to use the new
API, and then convert the backend implementation to do things more
efficiently.

Signed-off-by: Jeff Layton 
Reviewed-by: Jan Kara 
---
 fs/btrfs/file.c  |   1 +
 fs/btrfs/inode.c |   1 +
 fs/btrfs/ioctl.c |   1 +
 fs/btrfs/xattr.c |   1 +
 fs/ext4/inode.c  |   1 +
 fs/ext4/namei.c  |   1 +
 fs/inode.c   |   1 +
 include/linux/fs.h   |  15 ---
 include/linux/iversion.h | 236 +++
 9 files changed, 243 insertions(+), 15 deletions(-)
 create mode 100644 include/linux/iversion.h

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index eb1bac7c8553..c95d7b2efefb 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ctree.h"
 #include "disk-io.h"
 #include "transaction.h"
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e1a7f3cb5be9..27f008b33fc1 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ctree.h"
 #include "disk-io.h"
 #include "transaction.h"
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 2ef8acaac688..aa452c9e2eff 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ctree.h"
 #include "disk-io.h"
 #include "transaction.h"
diff --git a/fs/btrfs/xattr.c b/fs/btrfs/xattr.c
index 2c7e53f9ff1b..5258c1714830 100644
--- a/fs/btrfs/xattr.c
+++ b/fs/btrfs/xattr.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ctree.h"
 #include "btrfs_inode.h"
 #include "transaction.h"
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 7df2c5644e59..fa5d8bc52d2d 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -39,6 +39,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "ext4_jbd2.h"
 #include "xattr.h"
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 798b3ac680db..bcf0dff517be 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ext4.h"
 #include "ext4_jbd2.h"
 
diff --git a/fs/inode.c b/fs/inode.c
index 03102d6ef044..19e72f500f71 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -18,6 +18,7 @@
 #include  /* for inode_has_buffers */
 #include 
 #include 
+#include 
 #include 
 #include "internal.h"
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 511fbaabf624..76382c24e9d0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2036,21 +2036,6 @@ static inline void inode_dec_link_count(struct inode 
*inode)
mark_inode_dirty(inode);
 }
 
-/**
- * inode_inc_iversion - increments i_version
- * @inode: inode that need to be updated
- *
- * Every time the inode is modified, the i_version field will be incremented.
- * The filesystem has to be mounted with i_version flag
- */
-
-static inline void inode_inc_iversion(struct inode *inode)
-{
-   spin_lock(>i_lock);
-   inode->i_version++;
-   spin_unlock(>i_lock);
-}
-
 enum file_time_flags {
S_ATIME = 1,
S_MTIME = 2,
diff --git a/include/linux/iversion.h b/include/linux/iversion.h
new file mode 100644
index ..d09cc3a08740
--- /dev/null
+++ b/include/linux/iversion.h
@@ -0,0 +1,236 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_IVERSION_H
+#define _LINUX_IVERSION_H
+
+#include 
+
+/*
+ * The change attribute (i_version) is mandated by NFSv4 and is mostly for
+ * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
+ * appear different to observers if there was a change to the inode's data or
+ * metadata since it was last queried.
+ *
+ * Observers see the i_version as a 64-bit number that never changes. If it
+ * remains the same since it was last checked, then nothing has changed in the
+ * inode. If it's different then something has changed. Observers cannot infer
+ * anything about the nature or magnitude of the changes from the value, only
+ * that the inode has changed in some fashion.
+ *
+ * Not all filesystems properly implement the i_version counter. Subsystems 
that
+ * want to use i_version field on an inode should first check whether the
+ * filesystem sets the SB_I_VERSION flag (usually via the IS_I_VERSION macro).
+ *
+ * Those that set SB_I_VERSION will automatically have their i_version counter
+ * incremented on writes to normal files. If the SB_I_VERSION is not set, then
+ * the VFS will not 

[PATCH v5 02/19] fs: don't take the i_lock in inode_inc_iversion

2018-01-09 Thread Jeff Layton
From: Jeff Layton 

The rationale for taking the i_lock when incrementing this value is
lost in antiquity. The readers of the field don't take it (at least
not universally), so my assumption is that it was only done here to
serialize incrementors.

If that is indeed the case, then we can drop the i_lock from this
codepath and treat it as a atomic64_t for the purposes of
incrementing it. This allows us to use inode_inc_iversion without
any danger of lock inversion.

Note that the read side is not fetched atomically with this change.
The assumption here is that that is not a critical issue since the
i_version is not fully synchronized with anything else anyway.

Signed-off-by: Jeff Layton 
---
 include/linux/iversion.h | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/linux/iversion.h b/include/linux/iversion.h
index d09cc3a08740..5ad9eaa3a9b0 100644
--- a/include/linux/iversion.h
+++ b/include/linux/iversion.h
@@ -104,12 +104,13 @@ inode_set_iversion_queried(struct inode *inode, u64 new)
 static inline bool
 inode_maybe_inc_iversion(struct inode *inode, bool force)
 {
-   spin_lock(>i_lock);
-   inode->i_version++;
-   spin_unlock(>i_lock);
+   atomic64_t *ivp = (atomic64_t *)>i_version;
+
+   atomic64_inc(ivp);
return true;
 }
 
+
 /**
  * inode_inc_iversion - forcibly increment i_version
  * @inode: inode that needs to be updated
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 03/19] fat: convert to new i_version API

2018-01-09 Thread Jeff Layton
From: Jeff Layton 

Signed-off-by: Jeff Layton 
---
 fs/fat/dir.c |  3 ++-
 fs/fat/inode.c   |  9 +
 fs/fat/namei_msdos.c |  7 ---
 fs/fat/namei_vfat.c  | 22 +++---
 4 files changed, 22 insertions(+), 19 deletions(-)

diff --git a/fs/fat/dir.c b/fs/fat/dir.c
index b833ffeee1e1..8e100c3bf72c 100644
--- a/fs/fat/dir.c
+++ b/fs/fat/dir.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "fat.h"
 
 /*
@@ -1055,7 +1056,7 @@ int fat_remove_entries(struct inode *dir, struct 
fat_slot_info *sinfo)
brelse(bh);
if (err)
return err;
-   dir->i_version++;
+   inode_inc_iversion(dir);
 
if (nr_slots) {
/*
diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index 20a0a89eaca5..ffbbf0520d9e 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "fat.h"
 
 #ifndef CONFIG_FAT_DEFAULT_IOCHARSET
@@ -507,7 +508,7 @@ int fat_fill_inode(struct inode *inode, struct 
msdos_dir_entry *de)
MSDOS_I(inode)->i_pos = 0;
inode->i_uid = sbi->options.fs_uid;
inode->i_gid = sbi->options.fs_gid;
-   inode->i_version++;
+   inode_inc_iversion(inode);
inode->i_generation = get_seconds();
 
if ((de->attr & ATTR_DIR) && !IS_FREE(de->name)) {
@@ -590,7 +591,7 @@ struct inode *fat_build_inode(struct super_block *sb,
goto out;
}
inode->i_ino = iunique(sb, MSDOS_ROOT_INO);
-   inode->i_version = 1;
+   inode_set_iversion(inode, 1);
err = fat_fill_inode(inode, de);
if (err) {
iput(inode);
@@ -1377,7 +1378,7 @@ static int fat_read_root(struct inode *inode)
MSDOS_I(inode)->i_pos = MSDOS_ROOT_INO;
inode->i_uid = sbi->options.fs_uid;
inode->i_gid = sbi->options.fs_gid;
-   inode->i_version++;
+   inode_inc_iversion(inode);
inode->i_generation = 0;
inode->i_mode = fat_make_mode(sbi, ATTR_DIR, S_IRWXUGO);
inode->i_op = sbi->dir_ops;
@@ -1828,7 +1829,7 @@ int fat_fill_super(struct super_block *sb, void *data, 
int silent, int isvfat,
if (!root_inode)
goto out_fail;
root_inode->i_ino = MSDOS_ROOT_INO;
-   root_inode->i_version = 1;
+   inode_set_iversion(root_inode, 1);
error = fat_read_root(root_inode);
if (error < 0) {
iput(root_inode);
diff --git a/fs/fat/namei_msdos.c b/fs/fat/namei_msdos.c
index d24d2758a363..582ca731a6c9 100644
--- a/fs/fat/namei_msdos.c
+++ b/fs/fat/namei_msdos.c
@@ -7,6 +7,7 @@
  */
 
 #include 
+#include 
 #include "fat.h"
 
 /* Characters that are undesirable in an MS-DOS file name */
@@ -480,7 +481,7 @@ static int do_msdos_rename(struct inode *old_dir, unsigned 
char *old_name,
} else
mark_inode_dirty(old_inode);
 
-   old_dir->i_version++;
+   inode_inc_iversion(old_dir);
old_dir->i_ctime = old_dir->i_mtime = 
current_time(old_dir);
if (IS_DIRSYNC(old_dir))
(void)fat_sync_inode(old_dir);
@@ -508,7 +509,7 @@ static int do_msdos_rename(struct inode *old_dir, unsigned 
char *old_name,
goto out;
new_i_pos = sinfo.i_pos;
}
-   new_dir->i_version++;
+   inode_inc_iversion(new_dir);
 
fat_detach(old_inode);
fat_attach(old_inode, new_i_pos);
@@ -540,7 +541,7 @@ static int do_msdos_rename(struct inode *old_dir, unsigned 
char *old_name,
old_sinfo.bh = NULL;
if (err)
goto error_dotdot;
-   old_dir->i_version++;
+   inode_inc_iversion(old_dir);
old_dir->i_ctime = old_dir->i_mtime = ts;
if (IS_DIRSYNC(old_dir))
(void)fat_sync_inode(old_dir);
diff --git a/fs/fat/namei_vfat.c b/fs/fat/namei_vfat.c
index 02c03a3a..cefea792cde8 100644
--- a/fs/fat/namei_vfat.c
+++ b/fs/fat/namei_vfat.c
@@ -20,7 +20,7 @@
 #include 
 #include 
 #include 
-
+#include 
 #include "fat.h"
 
 static inline unsigned long vfat_d_version(struct dentry *dentry)
@@ -46,7 +46,7 @@ static int vfat_revalidate_shortname(struct dentry *dentry)
 {
int ret = 1;
spin_lock(>d_lock);
-   if (vfat_d_version(dentry) != d_inode(dentry->d_parent)->i_version)
+   if (inode_cmp_iversion(d_inode(dentry->d_parent), 
vfat_d_version(dentry)))
ret = 0;
spin_unlock(>d_lock);
return ret;
@@ -759,7 +759,7 @@ static struct dentry *vfat_lookup(struct inode *dir, struct 
dentry *dentry,
 out:
mutex_unlock(_SB(sb)->s_lock);
if (!inode)
-   vfat_d_version_set(dentry, dir->i_version);
+   vfat_d_version_set(dentry, inode_query_iversion(dir));
return d_splice_alias(inode, dentry);
 error:

[PATCH v5 04/19] affs: convert to new i_version API

2018-01-09 Thread Jeff Layton
From: Jeff Layton 

Signed-off-by: Jeff Layton 
---
 fs/affs/amigaffs.c | 5 +++--
 fs/affs/dir.c  | 5 +++--
 fs/affs/super.c| 3 ++-
 3 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/fs/affs/amigaffs.c b/fs/affs/amigaffs.c
index 0f0e6925e97d..14a6c1b90c9f 100644
--- a/fs/affs/amigaffs.c
+++ b/fs/affs/amigaffs.c
@@ -10,6 +10,7 @@
  */
 
 #include 
+#include 
 #include "affs.h"
 
 /*
@@ -60,7 +61,7 @@ affs_insert_hash(struct inode *dir, struct buffer_head *bh)
affs_brelse(dir_bh);
 
dir->i_mtime = dir->i_ctime = current_time(dir);
-   dir->i_version++;
+   inode_inc_iversion(dir);
mark_inode_dirty(dir);
 
return 0;
@@ -114,7 +115,7 @@ affs_remove_hash(struct inode *dir, struct buffer_head 
*rem_bh)
affs_brelse(bh);
 
dir->i_mtime = dir->i_ctime = current_time(dir);
-   dir->i_version++;
+   inode_inc_iversion(dir);
mark_inode_dirty(dir);
 
return retval;
diff --git a/fs/affs/dir.c b/fs/affs/dir.c
index a105e77df2c1..d180b46453cf 100644
--- a/fs/affs/dir.c
+++ b/fs/affs/dir.c
@@ -14,6 +14,7 @@
  *
  */
 
+#include 
 #include "affs.h"
 
 static int affs_readdir(struct file *, struct dir_context *);
@@ -80,7 +81,7 @@ affs_readdir(struct file *file, struct dir_context *ctx)
 * we can jump directly to where we left off.
 */
ino = (u32)(long)file->private_data;
-   if (ino && file->f_version == inode->i_version) {
+   if (ino && inode_cmp_iversion(inode, file->f_version) == 0) {
pr_debug("readdir() left off=%d\n", ino);
goto inside;
}
@@ -130,7 +131,7 @@ affs_readdir(struct file *file, struct dir_context *ctx)
} while (ino);
}
 done:
-   file->f_version = inode->i_version;
+   file->f_version = inode_query_iversion(inode);
file->private_data = (void *)(long)ino;
affs_brelse(fh_bh);
 
diff --git a/fs/affs/super.c b/fs/affs/super.c
index 1117e36134cc..e602619aed9d 100644
--- a/fs/affs/super.c
+++ b/fs/affs/super.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "affs.h"
 
 static int affs_statfs(struct dentry *dentry, struct kstatfs *buf);
@@ -102,7 +103,7 @@ static struct inode *affs_alloc_inode(struct super_block 
*sb)
if (!i)
return NULL;
 
-   i->vfs_inode.i_version = 1;
+   inode_set_iversion(>vfs_inode, 1);
i->i_lc = NULL;
i->i_ext_bh = NULL;
i->i_pa_cnt = 0;
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 05/19] afs: convert to new i_version API

2018-01-09 Thread Jeff Layton
From: Jeff Layton 

For AFS, it's generally treated as an opaque value, so we use the
*_raw variants of the API here.

Note that AFS has quite a different definition for this counter. AFS
only increments it on changes to the data to the data in regular files
and contents of the directories. Inode metadata changes do not result
in a version increment.

We'll need to reconcile that somehow if we ever want to present this to
userspace via statx.

Signed-off-by: Jeff Layton 
---
 fs/afs/fsclient.c | 3 ++-
 fs/afs/inode.c| 5 +++--
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/afs/fsclient.c b/fs/afs/fsclient.c
index b90ef39ae914..88ec38c2d83c 100644
--- a/fs/afs/fsclient.c
+++ b/fs/afs/fsclient.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 #include "afs_fs.h"
 
@@ -124,7 +125,7 @@ static void xdr_decode_AFSFetchStatus(const __be32 **_bp,
vnode->vfs_inode.i_ctime.tv_sec = status->mtime_client;
vnode->vfs_inode.i_mtime= vnode->vfs_inode.i_ctime;
vnode->vfs_inode.i_atime= vnode->vfs_inode.i_ctime;
-   vnode->vfs_inode.i_version  = data_version;
+   inode_set_iversion_raw(>vfs_inode, data_version);
}
 
expected_version = status->data_version;
diff --git a/fs/afs/inode.c b/fs/afs/inode.c
index 3415eb7484f6..dcd2e08d6cdb 100644
--- a/fs/afs/inode.c
+++ b/fs/afs/inode.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 
 static const struct inode_operations afs_symlink_inode_operations = {
@@ -89,7 +90,7 @@ static int afs_inode_map_status(struct afs_vnode *vnode, 
struct key *key)
inode->i_atime  = inode->i_mtime = inode->i_ctime;
inode->i_blocks = 0;
inode->i_generation = vnode->fid.unique;
-   inode->i_version= vnode->status.data_version;
+   inode_set_iversion_raw(inode, vnode->status.data_version);
inode->i_mapping->a_ops = _fs_aops;
 
read_sequnlock_excl(>cb_lock);
@@ -218,7 +219,7 @@ struct inode *afs_iget_autocell(struct inode *dir, const 
char *dev_name,
inode->i_ctime.tv_nsec  = 0;
inode->i_atime  = inode->i_mtime = inode->i_ctime;
inode->i_blocks = 0;
-   inode->i_version= 0;
+   inode_set_iversion_raw(inode, 0);
inode->i_generation = 0;
 
set_bit(AFS_VNODE_PSEUDODIR, >flags);
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 06/19] btrfs: convert to new i_version API

2018-01-09 Thread Jeff Layton
From: Jeff Layton 

Signed-off-by: Jeff Layton 
Acked-by: David Sterba 
---
 fs/btrfs/delayed-inode.c | 7 +--
 fs/btrfs/inode.c | 6 --
 fs/btrfs/tree-log.c  | 4 +++-
 3 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 5d73f79ded8b..6a246ae2bcb2 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -18,6 +18,7 @@
  */
 
 #include 
+#include 
 #include "delayed-inode.h"
 #include "disk-io.h"
 #include "transaction.h"
@@ -1700,7 +1701,8 @@ static void fill_stack_inode_item(struct 
btrfs_trans_handle *trans,
btrfs_set_stack_inode_nbytes(inode_item, inode_get_bytes(inode));
btrfs_set_stack_inode_generation(inode_item,
 BTRFS_I(inode)->generation);
-   btrfs_set_stack_inode_sequence(inode_item, inode->i_version);
+   btrfs_set_stack_inode_sequence(inode_item,
+  inode_peek_iversion(inode));
btrfs_set_stack_inode_transid(inode_item, trans->transid);
btrfs_set_stack_inode_rdev(inode_item, inode->i_rdev);
btrfs_set_stack_inode_flags(inode_item, BTRFS_I(inode)->flags);
@@ -1754,7 +1756,8 @@ int btrfs_fill_inode(struct inode *inode, u32 *rdev)
BTRFS_I(inode)->generation = btrfs_stack_inode_generation(inode_item);
 BTRFS_I(inode)->last_trans = btrfs_stack_inode_transid(inode_item);
 
-   inode->i_version = btrfs_stack_inode_sequence(inode_item);
+   inode_set_iversion_queried(inode,
+  btrfs_stack_inode_sequence(inode_item));
inode->i_rdev = 0;
*rdev = btrfs_stack_inode_rdev(inode_item);
BTRFS_I(inode)->flags = btrfs_stack_inode_flags(inode_item);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 27f008b33fc1..ac8692849a81 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3778,7 +3778,8 @@ static int btrfs_read_locked_inode(struct inode *inode)
BTRFS_I(inode)->generation = btrfs_inode_generation(leaf, inode_item);
BTRFS_I(inode)->last_trans = btrfs_inode_transid(leaf, inode_item);
 
-   inode->i_version = btrfs_inode_sequence(leaf, inode_item);
+   inode_set_iversion_queried(inode,
+  btrfs_inode_sequence(leaf, inode_item));
inode->i_generation = BTRFS_I(inode)->generation;
inode->i_rdev = 0;
rdev = btrfs_inode_rdev(leaf, inode_item);
@@ -3946,7 +3947,8 @@ static void fill_inode_item(struct btrfs_trans_handle 
*trans,
 );
btrfs_set_token_inode_generation(leaf, item, BTRFS_I(inode)->generation,
 );
-   btrfs_set_token_inode_sequence(leaf, item, inode->i_version, );
+   btrfs_set_token_inode_sequence(leaf, item, inode_peek_iversion(inode),
+  );
btrfs_set_token_inode_transid(leaf, item, trans->transid, );
btrfs_set_token_inode_rdev(leaf, item, inode->i_rdev, );
btrfs_set_token_inode_flags(leaf, item, BTRFS_I(inode)->flags, );
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 7bf9b31561db..1b7d92075c1f 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "tree-log.h"
 #include "disk-io.h"
 #include "locking.h"
@@ -3609,7 +3610,8 @@ static void fill_inode_item(struct btrfs_trans_handle 
*trans,
btrfs_set_token_inode_nbytes(leaf, item, inode_get_bytes(inode),
 );
 
-   btrfs_set_token_inode_sequence(leaf, item, inode->i_version, );
+   btrfs_set_token_inode_sequence(leaf, item,
+  inode_peek_iversion(inode), );
btrfs_set_token_inode_transid(leaf, item, trans->transid, );
btrfs_set_token_inode_rdev(leaf, item, inode->i_rdev, );
btrfs_set_token_inode_flags(leaf, item, BTRFS_I(inode)->flags, );
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 08/19] ext2: convert to new i_version API

2018-01-09 Thread Jeff Layton
From: Jeff Layton 

Signed-off-by: Jeff Layton 
Reviewed-by: Jan Kara 
---
 fs/ext2/dir.c   | 9 +
 fs/ext2/super.c | 5 +++--
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index 987647986f47..4111085a129f 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 
 typedef struct ext2_dir_entry_2 ext2_dirent;
 
@@ -92,7 +93,7 @@ static int ext2_commit_chunk(struct page *page, loff_t pos, 
unsigned len)
struct inode *dir = mapping->host;
int err = 0;
 
-   dir->i_version++;
+   inode_inc_iversion(dir);
block_write_end(NULL, mapping, pos, len, len, page, NULL);
 
if (pos+len > dir->i_size) {
@@ -293,7 +294,7 @@ ext2_readdir(struct file *file, struct dir_context *ctx)
unsigned long npages = dir_pages(inode);
unsigned chunk_mask = ~(ext2_chunk_size(inode)-1);
unsigned char *types = NULL;
-   int need_revalidate = file->f_version != inode->i_version;
+   bool need_revalidate = inode_cmp_iversion(inode, file->f_version);
 
if (pos > inode->i_size - EXT2_DIR_REC_LEN(1))
return 0;
@@ -319,8 +320,8 @@ ext2_readdir(struct file *file, struct dir_context *ctx)
offset = ext2_validate_entry(kaddr, offset, 
chunk_mask);
ctx->pos = (n<f_version = inode->i_version;
-   need_revalidate = 0;
+   file->f_version = inode_query_iversion(inode);
+   need_revalidate = false;
}
de = (ext2_dirent *)(kaddr+offset);
limit = kaddr + ext2_last_byte(inode, n) - EXT2_DIR_REC_LEN(1);
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 7646818ab266..554c98b8a93a 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ext2.h"
 #include "xattr.h"
 #include "acl.h"
@@ -184,7 +185,7 @@ static struct inode *ext2_alloc_inode(struct super_block 
*sb)
if (!ei)
return NULL;
ei->i_block_alloc_info = NULL;
-   ei->vfs_inode.i_version = 1;
+   inode_set_iversion(>vfs_inode, 1);
 #ifdef CONFIG_QUOTA
memset(>i_dquot, 0, sizeof(ei->i_dquot));
 #endif
@@ -1569,7 +1570,7 @@ static ssize_t ext2_quota_write(struct super_block *sb, 
int type,
return err;
if (inode->i_size < off+len-towrite)
i_size_write(inode, off+len-towrite);
-   inode->i_version++;
+   inode_inc_iversion(inode);
inode->i_mtime = inode->i_ctime = current_time(inode);
mark_inode_dirty(inode);
return len - towrite;
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 07/19] exofs: switch to new i_version API

2018-01-09 Thread Jeff Layton
From: Jeff Layton 

Signed-off-by: Jeff Layton 
---
 fs/exofs/dir.c   | 9 +
 fs/exofs/super.c | 3 ++-
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/exofs/dir.c b/fs/exofs/dir.c
index 98233a97b7b8..c5a53fcc43ea 100644
--- a/fs/exofs/dir.c
+++ b/fs/exofs/dir.c
@@ -31,6 +31,7 @@
  * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
  */
 
+#include 
 #include "exofs.h"
 
 static inline unsigned exofs_chunk_size(struct inode *inode)
@@ -60,7 +61,7 @@ static int exofs_commit_chunk(struct page *page, loff_t pos, 
unsigned len)
struct inode *dir = mapping->host;
int err = 0;
 
-   dir->i_version++;
+   inode_inc_iversion(dir);
 
if (!PageUptodate(page))
SetPageUptodate(page);
@@ -241,7 +242,7 @@ exofs_readdir(struct file *file, struct dir_context *ctx)
unsigned long n = pos >> PAGE_SHIFT;
unsigned long npages = dir_pages(inode);
unsigned chunk_mask = ~(exofs_chunk_size(inode)-1);
-   int need_revalidate = (file->f_version != inode->i_version);
+   bool need_revalidate = inode_cmp_iversion(inode, file->f_version);
 
if (pos > inode->i_size - EXOFS_DIR_REC_LEN(1))
return 0;
@@ -264,8 +265,8 @@ exofs_readdir(struct file *file, struct dir_context *ctx)
chunk_mask);
ctx->pos = (n<f_version = inode->i_version;
-   need_revalidate = 0;
+   file->f_version = inode_query_iversion(inode);
+   need_revalidate = false;
}
de = (struct exofs_dir_entry *)(kaddr + offset);
limit = kaddr + exofs_last_byte(inode, n) -
diff --git a/fs/exofs/super.c b/fs/exofs/super.c
index 819624cfc8da..7e244093c0e5 100644
--- a/fs/exofs/super.c
+++ b/fs/exofs/super.c
@@ -38,6 +38,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "exofs.h"
 
@@ -159,7 +160,7 @@ static struct inode *exofs_alloc_inode(struct super_block 
*sb)
if (!oi)
return NULL;
 
-   oi->vfs_inode.i_version = 1;
+   inode_set_iversion(>vfs_inode, 1);
return >vfs_inode;
 }
 
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/6] btrfs: make btrfs_free_stale_device() to match the path

2018-01-09 Thread Anand Jain
From: Anand Jain 

The btrfs_free_stale_device() is updated to match for the given
device path and delete it. (It searchs for only unmounted list of
devices.)

Signed-off-by: Anand Jain 
---
 fs/btrfs/volumes.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 7646f8860096..f87d30aa0e18 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -606,7 +606,7 @@ static void pending_bios_fn(struct btrfs_work *work)
 }
 
 
-static void btrfs_free_stale_device(struct btrfs_device *cur_dev)
+static void btrfs_free_stale_device(struct btrfs_device *cur_dev, char *path)
 {
struct btrfs_fs_devices *fs_devs, *tmp_fs_devs;
struct btrfs_device *dev, *tmp_dev;
@@ -633,6 +633,8 @@ static void btrfs_free_stale_device(struct btrfs_device 
*cur_dev)
if (cur_dev)
not_found = strcmp(rcu_str_deref(dev->name),
   
rcu_str_deref(cur_dev->name));
+   else if (path)
+   not_found = strcmp(rcu_str_deref(dev->name), 
path);
else
not_found = 0;
rcu_read_unlock();
@@ -776,7 +778,7 @@ static noinline int device_list_add(const char *path,
 
ret = 1;
device->fs_devices = fs_devices;
-   btrfs_free_stale_device(device);
+   btrfs_free_stale_device(device, NULL);
} else if (!device->name || strcmp(device->name->str, path)) {
/*
 * When FS is already mounted.
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/6] btrfs: make btrfs_free_stale_device() argument optional

2018-01-09 Thread Anand Jain
From: Anand Jain 

This updates btrfs_free_stale_device() helper function to delete all
unmouted devices, when arg is NULL.

Signed-off-by: Anand Jain 
---
 fs/btrfs/volumes.c | 14 ++
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 7b253da6c0a4..7646f8860096 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -611,9 +611,6 @@ static void btrfs_free_stale_device(struct btrfs_device 
*cur_dev)
struct btrfs_fs_devices *fs_devs, *tmp_fs_devs;
struct btrfs_device *dev, *tmp_dev;
 
-   if (!cur_dev->name)
-   return;
-
list_for_each_entry_safe(fs_devs, tmp_fs_devs, _uuids, list) {
 
if (fs_devs->opened)
@@ -623,9 +620,7 @@ static void btrfs_free_stale_device(struct btrfs_device 
*cur_dev)
 _devs->devices, dev_list) {
int not_found;
 
-   if (dev == cur_dev)
-   continue;
-   if (!dev->name)
+   if (cur_dev && (cur_dev == dev || !dev->name))
continue;
 
/*
@@ -635,8 +630,11 @@ static void btrfs_free_stale_device(struct btrfs_device 
*cur_dev)
 * either use mapper or non mapper path throughout.
 */
rcu_read_lock();
-   not_found = strcmp(rcu_str_deref(dev->name),
-   rcu_str_deref(cur_dev->name));
+   if (cur_dev)
+   not_found = strcmp(rcu_str_deref(dev->name),
+  
rcu_str_deref(cur_dev->name));
+   else
+   not_found = 0;
rcu_read_unlock();
if (not_found)
continue;
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/6] btrfs: cleanup btrfs_free_stale_device() usage

2018-01-09 Thread Anand Jain
We call btrfs_free_stale_device() only when we alloc a new
struct btrfs_device (ret=1), so move it closer to where we
alloc the new device. Also drop the comments.

Signed-off-by: Anand Jain 
---
 fs/btrfs/volumes.c | 8 +---
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 9a04245003ab..7e04cd509ab9 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -782,6 +782,7 @@ static noinline int device_list_add(const char *path,
 
ret = 1;
device->fs_devices = fs_devices;
+   btrfs_free_stale_device(device);
} else if (!device->name || strcmp(device->name->str, path)) {
/*
 * When FS is already mounted.
@@ -840,13 +841,6 @@ static noinline int device_list_add(const char *path,
if (!fs_devices->opened)
device->generation = found_transid;
 
-   /*
-* if there is new btrfs on an already registered device,
-* then remove the stale device entry.
-*/
-   if (ret > 0)
-   btrfs_free_stale_device(device);
-
*fs_devices_ret = fs_devices;
 
return ret;
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 09/19] ext4: convert to new i_version API

2018-01-09 Thread Jeff Layton
From: Jeff Layton 

Signed-off-by: Jeff Layton 
Acked-by: Theodore Ts'o 
---
 fs/ext4/dir.c|  9 +
 fs/ext4/inline.c |  7 ---
 fs/ext4/inode.c  | 12 
 fs/ext4/ioctl.c  |  3 ++-
 fs/ext4/namei.c  |  4 ++--
 fs/ext4/super.c  |  3 ++-
 fs/ext4/xattr.c  |  5 +++--
 7 files changed, 26 insertions(+), 17 deletions(-)

diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index d5babc9f222b..afda0a0499ce 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ext4.h"
 #include "xattr.h"
 
@@ -208,7 +209,7 @@ static int ext4_readdir(struct file *file, struct 
dir_context *ctx)
 * readdir(2), then we might be pointing to an invalid
 * dirent right now.  Scan from the start of the block
 * to make sure. */
-   if (file->f_version != inode->i_version) {
+   if (inode_cmp_iversion(inode, file->f_version)) {
for (i = 0; i < sb->s_blocksize && i < offset; ) {
de = (struct ext4_dir_entry_2 *)
(bh->b_data + i);
@@ -227,7 +228,7 @@ static int ext4_readdir(struct file *file, struct 
dir_context *ctx)
offset = i;
ctx->pos = (ctx->pos & ~(sb->s_blocksize - 1))
| offset;
-   file->f_version = inode->i_version;
+   file->f_version = inode_query_iversion(inode);
}
 
while (ctx->pos < inode->i_size
@@ -568,10 +569,10 @@ static int ext4_dx_readdir(struct file *file, struct 
dir_context *ctx)
 * cached entries.
 */
if ((!info->curr_node) ||
-   (file->f_version != inode->i_version)) {
+   inode_cmp_iversion(inode, file->f_version)) {
info->curr_node = NULL;
free_rb_tree_fname(>root);
-   file->f_version = inode->i_version;
+   file->f_version = inode_query_iversion(inode);
ret = ext4_htree_fill_tree(file, info->curr_hash,
   info->curr_minor_hash,
   >next_hash);
diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 1367553c43bb..a8b987b71173 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -14,6 +14,7 @@
 
 #include 
 #include 
+#include 
 
 #include "ext4_jbd2.h"
 #include "ext4.h"
@@ -1042,7 +1043,7 @@ static int ext4_add_dirent_to_inline(handle_t *handle,
 */
dir->i_mtime = dir->i_ctime = current_time(dir);
ext4_update_dx_flag(dir);
-   dir->i_version++;
+   inode_inc_iversion(dir);
return 1;
 }
 
@@ -1494,7 +1495,7 @@ int ext4_read_inline_dir(struct file *file,
 * dirent right now.  Scan from the start of the inline
 * dir to make sure.
 */
-   if (file->f_version != inode->i_version) {
+   if (inode_cmp_iversion(inode, file->f_version)) {
for (i = 0; i < extra_size && i < offset;) {
/*
 * "." is with offset 0 and
@@ -1526,7 +1527,7 @@ int ext4_read_inline_dir(struct file *file,
}
offset = i;
ctx->pos = offset;
-   file->f_version = inode->i_version;
+   file->f_version = inode_query_iversion(inode);
}
 
while (ctx->pos < extra_size) {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index fa5d8bc52d2d..1b0d54b372f2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4874,12 +4874,14 @@ struct inode *ext4_iget(struct super_block *sb, 
unsigned long ino)
EXT4_EINODE_GET_XTIME(i_crtime, ei, raw_inode);
 
if (likely(!test_opt2(inode->i_sb, HURD_COMPAT))) {
-   inode->i_version = le32_to_cpu(raw_inode->i_disk_version);
+   u64 ivers = le32_to_cpu(raw_inode->i_disk_version);
+
if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE) {
if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi))
-   inode->i_version |=
+   ivers |=
(__u64)(le32_to_cpu(raw_inode->i_version_hi)) << 32;
}
+   inode_set_iversion_queried(inode, ivers);
}
 
ret = 0;
@@ -5165,11 +5167,13 @@ static int ext4_do_update_inode(handle_t *handle,
}
 
if (likely(!test_opt2(inode->i_sb, HURD_COMPAT))) {
-   raw_inode->i_disk_version = cpu_to_le32(inode->i_version);
+   u64 ivers = inode_peek_iversion(inode);
+
+   raw_inode->i_disk_version = cpu_to_le32(ivers);
if (ei->i_extra_isize) {
if (EXT4_FITS_IN_INODE(raw_inode, ei, 

[PATCH v5 10/19] nfs: convert to new i_version API

2018-01-09 Thread Jeff Layton
From: Jeff Layton 

For NFS, we just use the "raw" API since the i_version is mostly
managed by the server. The exception there is when the client
holds a write delegation, but we only need to bump it once
there anyway to handle CB_GETATTR.

Tested-by: Krzysztof Kozlowski 
Signed-off-by: Jeff Layton 
---
 fs/nfs/delegation.c|  3 ++-
 fs/nfs/fscache-index.c |  5 +++--
 fs/nfs/inode.c | 18 +-
 fs/nfs/nfs4proc.c  | 10 ++
 fs/nfs/nfstrace.h  |  5 +++--
 fs/nfs/write.c |  8 +++-
 6 files changed, 26 insertions(+), 23 deletions(-)

diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
index ade44ca0c66c..d8b47624fee2 100644
--- a/fs/nfs/delegation.c
+++ b/fs/nfs/delegation.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -347,7 +348,7 @@ int nfs_inode_set_delegation(struct inode *inode, struct 
rpc_cred *cred, struct
nfs4_stateid_copy(>stateid, >delegation);
delegation->type = res->delegation_type;
delegation->pagemod_limit = res->pagemod_limit;
-   delegation->change_attr = inode->i_version;
+   delegation->change_attr = inode_peek_iversion_raw(inode);
delegation->cred = get_rpccred(cred);
delegation->inode = inode;
delegation->flags = 1<vfs_inode.i_ctime;
 
if (NFS_SERVER(>vfs_inode)->nfs_client->rpc_ops->version == 4)
-   auxdata.change_attr = nfsi->vfs_inode.i_version;
+   auxdata.change_attr = inode_peek_iversion_raw(>vfs_inode);
 
if (bufmax > sizeof(auxdata))
bufmax = sizeof(auxdata);
@@ -243,7 +244,7 @@ enum fscache_checkaux nfs_fscache_inode_check_aux(void 
*cookie_netfs_data,
auxdata.ctime = nfsi->vfs_inode.i_ctime;
 
if (NFS_SERVER(>vfs_inode)->nfs_client->rpc_ops->version == 4)
-   auxdata.change_attr = nfsi->vfs_inode.i_version;
+   auxdata.change_attr = inode_peek_iversion_raw(>vfs_inode);
 
if (memcmp(data, , datalen) != 0)
return FSCACHE_CHECKAUX_OBSOLETE;
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index b992d2382ffa..93552c482992 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -38,8 +38,8 @@
 #include 
 #include 
 #include 
-
 #include 
+#include 
 
 #include "nfs4_fs.h"
 #include "callback.h"
@@ -483,7 +483,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct 
nfs_fattr *fattr, st
memset(>i_atime, 0, sizeof(inode->i_atime));
memset(>i_mtime, 0, sizeof(inode->i_mtime));
memset(>i_ctime, 0, sizeof(inode->i_ctime));
-   inode->i_version = 0;
+   inode_set_iversion_raw(inode, 0);
inode->i_size = 0;
clear_nlink(inode);
inode->i_uid = make_kuid(_user_ns, -2);
@@ -508,7 +508,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct 
nfs_fattr *fattr, st
else if (nfs_server_capable(inode, NFS_CAP_CTIME))
nfs_set_cache_invalid(inode, NFS_INO_INVALID_ATTR);
if (fattr->valid & NFS_ATTR_FATTR_CHANGE)
-   inode->i_version = fattr->change_attr;
+   inode_set_iversion_raw(inode, fattr->change_attr);
else
nfs_set_cache_invalid(inode, NFS_INO_INVALID_ATTR
| NFS_INO_REVAL_PAGECACHE);
@@ -1289,8 +1289,8 @@ static unsigned long nfs_wcc_update_inode(struct inode 
*inode, struct nfs_fattr
 
if ((fattr->valid & NFS_ATTR_FATTR_PRECHANGE)
&& (fattr->valid & NFS_ATTR_FATTR_CHANGE)
-   && inode->i_version == fattr->pre_change_attr) {
-   inode->i_version = fattr->change_attr;
+   && !inode_cmp_iversion_raw(inode, 
fattr->pre_change_attr)) {
+   inode_set_iversion_raw(inode, fattr->change_attr);
if (S_ISDIR(inode->i_mode))
nfs_set_cache_invalid(inode, NFS_INO_INVALID_DATA);
ret |= NFS_INO_INVALID_ATTR;
@@ -1348,7 +1348,7 @@ static int nfs_check_inode_attributes(struct inode 
*inode, struct nfs_fattr *fat
 
if (!nfs_file_has_buffered_writers(nfsi)) {
/* Verify a few of the more important attributes */
-   if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) != 0 && 
inode->i_version != fattr->change_attr)
+   if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) != 0 && 
inode_cmp_iversion_raw(inode, 

[PATCH v5 13/19] ufs: use new i_version API

2018-01-09 Thread Jeff Layton
From: Jeff Layton 

Signed-off-by: Jeff Layton 
---
 fs/ufs/dir.c   | 9 +
 fs/ufs/inode.c | 3 ++-
 fs/ufs/super.c | 3 ++-
 3 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/fs/ufs/dir.c b/fs/ufs/dir.c
index 2edc1755b7c5..50dfce000864 100644
--- a/fs/ufs/dir.c
+++ b/fs/ufs/dir.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "ufs_fs.h"
 #include "ufs.h"
@@ -47,7 +48,7 @@ static int ufs_commit_chunk(struct page *page, loff_t pos, 
unsigned len)
struct inode *dir = mapping->host;
int err = 0;
 
-   dir->i_version++;
+   inode_inc_iversion(dir);
block_write_end(NULL, mapping, pos, len, len, page, NULL);
if (pos+len > dir->i_size) {
i_size_write(dir, pos+len);
@@ -428,7 +429,7 @@ ufs_readdir(struct file *file, struct dir_context *ctx)
unsigned long n = pos >> PAGE_SHIFT;
unsigned long npages = dir_pages(inode);
unsigned chunk_mask = ~(UFS_SB(sb)->s_uspi->s_dirblksize - 1);
-   int need_revalidate = file->f_version != inode->i_version;
+   bool need_revalidate = inode_cmp_iversion(inode, file->f_version);
unsigned flags = UFS_SB(sb)->s_flags;
 
UFSD("BEGIN\n");
@@ -455,8 +456,8 @@ ufs_readdir(struct file *file, struct dir_context *ctx)
offset = ufs_validate_entry(sb, kaddr, offset, 
chunk_mask);
ctx->pos = (n<f_version = inode->i_version;
-   need_revalidate = 0;
+   file->f_version = inode_query_iversion(inode);
+   need_revalidate = false;
}
de = (struct ufs_dir_entry *)(kaddr+offset);
limit = kaddr + ufs_last_byte(inode, n) - UFS_DIR_REC_LEN(1);
diff --git a/fs/ufs/inode.c b/fs/ufs/inode.c
index afb601c0dda0..c843ec858cf7 100644
--- a/fs/ufs/inode.c
+++ b/fs/ufs/inode.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "ufs_fs.h"
 #include "ufs.h"
@@ -693,7 +694,7 @@ struct inode *ufs_iget(struct super_block *sb, unsigned 
long ino)
if (err)
goto bad_inode;
 
-   inode->i_version++;
+   inode_inc_iversion(inode);
ufsi->i_lastfrag =
(inode->i_size + uspi->s_fsize - 1) >> uspi->s_fshift;
ufsi->i_dir_start_lookup = 0;
diff --git a/fs/ufs/super.c b/fs/ufs/super.c
index 4d497e9c6883..b6ba80e05bff 100644
--- a/fs/ufs/super.c
+++ b/fs/ufs/super.c
@@ -88,6 +88,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "ufs_fs.h"
 #include "ufs.h"
@@ -1440,7 +1441,7 @@ static struct inode *ufs_alloc_inode(struct super_block 
*sb)
if (!ei)
return NULL;
 
-   ei->vfs_inode.i_version = 1;
+   inode_set_iversion(>vfs_inode, 1);
seqlock_init(>meta_lock);
mutex_init(>truncate_mutex);
return >vfs_inode;
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recommendations for balancing as part of regular maintenance?

2018-01-09 Thread Tom Worster

On 9 Jan 2018, at 7:23, Austin S. Hemmelgarn wrote:


On 2018-01-08 16:43, Tom Worster wrote:


Given the documentation and the usage stats, I did not know what 
options to use with balance. I spent some time reading and 
researching and trying to understand the filters and how they should 
relate to my situation. Eventually I abandoned that effort and ran 
balance without options.
Hopefully the explanation I gave on the filters in the Github issue 
helped some.  In this case though, it sounds like running a filtered 
balance probably wouldn't have saved you much over a full one.


Yes, it helped. Hugo's email helped too. I now have a better 
understanding of balance filters.


At the same time, Hugo's email and others in this thread added to my 
belief that I'm now managing systems with a filesystem I'm unqualified 
to use.


Tom
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 11/19] nfsd: convert to new i_version API

2018-01-09 Thread Jeff Layton
From: Jeff Layton 

Mostly just making sure we use the "get" wrappers so we know when
it is being fetched for later use.

Signed-off-by: Jeff Layton 
---
 fs/nfsd/nfsfh.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
index 43f31cf49bae..b8444189223b 100644
--- a/fs/nfsd/nfsfh.h
+++ b/fs/nfsd/nfsfh.h
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static inline __u32 ino_t_to_u32(ino_t ino)
 {
@@ -259,7 +260,7 @@ static inline u64 nfsd4_change_attribute(struct inode 
*inode)
chattr =  inode->i_ctime.tv_sec;
chattr <<= 30;
chattr += inode->i_ctime.tv_nsec;
-   chattr += inode->i_version;
+   chattr += inode_query_iversion(inode);
return chattr;
 }
 
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 12/19] ocfs2: convert to new i_version API

2018-01-09 Thread Jeff Layton
From: Jeff Layton 

Signed-off-by: Jeff Layton 
Reviewed-by: Jan Kara 
---
 fs/ocfs2/dir.c  | 15 ---
 fs/ocfs2/inode.c|  3 ++-
 fs/ocfs2/namei.c|  3 ++-
 fs/ocfs2/quota_global.c |  3 ++-
 4 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/fs/ocfs2/dir.c b/fs/ocfs2/dir.c
index febe6312ceff..32f9c72dff17 100644
--- a/fs/ocfs2/dir.c
+++ b/fs/ocfs2/dir.c
@@ -42,6 +42,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -1174,7 +1175,7 @@ static int __ocfs2_delete_entry(handle_t *handle, struct 
inode *dir,
le16_add_cpu(>rec_len,
le16_to_cpu(de->rec_len));
de->inode = 0;
-   dir->i_version++;
+   inode_inc_iversion(dir);
ocfs2_journal_dirty(handle, bh);
goto bail;
}
@@ -1729,7 +1730,7 @@ int __ocfs2_add_entry(handle_t *handle,
if (ocfs2_dir_indexed(dir))
ocfs2_recalc_free_list(dir, handle, lookup);
 
-   dir->i_version++;
+   inode_inc_iversion(dir);
ocfs2_journal_dirty(handle, insert_bh);
retval = 0;
goto bail;
@@ -1775,7 +1776,7 @@ static int ocfs2_dir_foreach_blk_id(struct inode *inode,
 * readdir(2), then we might be pointing to an invalid
 * dirent right now.  Scan from the start of the block
 * to make sure. */
-   if (*f_version != inode->i_version) {
+   if (inode_cmp_iversion(inode, *f_version)) {
for (i = 0; i < i_size_read(inode) && i < offset; ) {
de = (struct ocfs2_dir_entry *)
(data->id_data + i);
@@ -1791,7 +1792,7 @@ static int ocfs2_dir_foreach_blk_id(struct inode *inode,
i += le16_to_cpu(de->rec_len);
}
ctx->pos = offset = i;
-   *f_version = inode->i_version;
+   *f_version = inode_query_iversion(inode);
}
 
de = (struct ocfs2_dir_entry *) (data->id_data + ctx->pos);
@@ -1869,7 +1870,7 @@ static int ocfs2_dir_foreach_blk_el(struct inode *inode,
 * readdir(2), then we might be pointing to an invalid
 * dirent right now.  Scan from the start of the block
 * to make sure. */
-   if (*f_version != inode->i_version) {
+   if (inode_cmp_iversion(inode, *f_version)) {
for (i = 0; i < sb->s_blocksize && i < offset; ) {
de = (struct ocfs2_dir_entry *) (bh->b_data + 
i);
/* It's too expensive to do a full
@@ -1886,7 +1887,7 @@ static int ocfs2_dir_foreach_blk_el(struct inode *inode,
offset = i;
ctx->pos = (ctx->pos & ~(sb->s_blocksize - 1))
| offset;
-   *f_version = inode->i_version;
+   *f_version = inode_query_iversion(inode);
}
 
while (ctx->pos < i_size_read(inode)
@@ -1940,7 +1941,7 @@ static int ocfs2_dir_foreach_blk(struct inode *inode, u64 
*f_version,
  */
 int ocfs2_dir_foreach(struct inode *inode, struct dir_context *ctx)
 {
-   u64 version = inode->i_version;
+   u64 version = inode_query_iversion(inode);
ocfs2_dir_foreach_blk(inode, , ctx, true);
return 0;
 }
diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c
index 1a1e0078ab38..d51b80edd972 100644
--- a/fs/ocfs2/inode.c
+++ b/fs/ocfs2/inode.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -302,7 +303,7 @@ void ocfs2_populate_inode(struct inode *inode, struct 
ocfs2_dinode *fe,
OCFS2_I(inode)->ip_attr = le32_to_cpu(fe->i_attr);
OCFS2_I(inode)->ip_dyn_features = le16_to_cpu(fe->i_dyn_features);
 
-   inode->i_version = 1;
+   inode_set_iversion(inode, 1);
inode->i_generation = le32_to_cpu(fe->i_generation);
inode->i_rdev = huge_decode_dev(le64_to_cpu(fe->id1.dev1.i_rdev));
inode->i_mode = le16_to_cpu(fe->i_mode);
diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
index 3b0a10d9b36f..c801eddc4bf3 100644
--- a/fs/ocfs2/namei.c
+++ b/fs/ocfs2/namei.c
@@ -41,6 +41,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -1520,7 +1521,7 @@ static int ocfs2_rename(struct inode *old_dir,
mlog_errno(status);
goto bail;
}
-   new_dir->i_version++;
+   inode_inc_iversion(new_dir);
 
if (S_ISDIR(new_inode->i_mode))

[PATCH v5 15/19] IMA: switch IMA over to new i_version API

2018-01-09 Thread Jeff Layton
From: Jeff Layton 

Signed-off-by: Jeff Layton 
---
 security/integrity/ima/ima_api.c  | 3 ++-
 security/integrity/ima/ima_main.c | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/security/integrity/ima/ima_api.c b/security/integrity/ima/ima_api.c
index c7e8db0ea4c0..c6ae42266270 100644
--- a/security/integrity/ima/ima_api.c
+++ b/security/integrity/ima/ima_api.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "ima.h"
 
@@ -215,7 +216,7 @@ int ima_collect_measurement(struct integrity_iint_cache 
*iint,
 * which do not support i_version, support is limited to an initial
 * measurement/appraisal/audit.
 */
-   i_version = file_inode(file)->i_version;
+   i_version = inode_query_iversion(inode);
hash.hdr.algo = algo;
 
/* Initialize hash digest to 0's in case of failure */
diff --git a/security/integrity/ima/ima_main.c 
b/security/integrity/ima/ima_main.c
index 50b8254d..06a70c5a2329 100644
--- a/security/integrity/ima/ima_main.c
+++ b/security/integrity/ima/ima_main.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "ima.h"
 
@@ -128,7 +129,7 @@ static void ima_check_last_writer(struct 
integrity_iint_cache *iint,
inode_lock(inode);
if (atomic_read(>i_writecount) == 1) {
if (!IS_I_VERSION(inode) ||
-   (iint->version != inode->i_version) ||
+   inode_cmp_iversion(inode, iint->version) ||
(iint->flags & IMA_NEW_FILE)) {
iint->flags &= ~(IMA_DONE_MASK | IMA_NEW_FILE);
iint->measured_pcrs = 0;
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 14/19] xfs: convert to new i_version API

2018-01-09 Thread Jeff Layton
From: Jeff Layton 

Signed-off-by: Jeff Layton 
Acked-by: Darrick J. Wong 
---
 fs/xfs/libxfs/xfs_inode_buf.c | 7 +--
 fs/xfs/xfs_icache.c   | 5 +++--
 fs/xfs/xfs_inode.c| 3 ++-
 fs/xfs/xfs_inode_item.c   | 3 ++-
 fs/xfs/xfs_trans_inode.c  | 4 +++-
 5 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 6b7989038d75..b9c0bf80669c 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -32,6 +32,8 @@
 #include "xfs_ialloc.h"
 #include "xfs_dir2.h"
 
+#include 
+
 /*
  * Check that none of the inode's in the buffer have a next
  * unlinked field of 0.
@@ -264,7 +266,8 @@ xfs_inode_from_disk(
to->di_flags= be16_to_cpu(from->di_flags);
 
if (to->di_version == 3) {
-   inode->i_version = be64_to_cpu(from->di_changecount);
+   inode_set_iversion_queried(inode,
+  be64_to_cpu(from->di_changecount));
to->di_crtime.t_sec = be32_to_cpu(from->di_crtime.t_sec);
to->di_crtime.t_nsec = be32_to_cpu(from->di_crtime.t_nsec);
to->di_flags2 = be64_to_cpu(from->di_flags2);
@@ -314,7 +317,7 @@ xfs_inode_to_disk(
to->di_flags = cpu_to_be16(from->di_flags);
 
if (from->di_version == 3) {
-   to->di_changecount = cpu_to_be64(inode->i_version);
+   to->di_changecount = cpu_to_be64(inode_peek_iversion(inode));
to->di_crtime.t_sec = cpu_to_be32(from->di_crtime.t_sec);
to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
to->di_flags2 = cpu_to_be64(from->di_flags2);
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 43005fbe8b1e..4c315adb05e6 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -37,6 +37,7 @@
 
 #include 
 #include 
+#include 
 
 /*
  * Allocate and initialise an xfs_inode.
@@ -293,14 +294,14 @@ xfs_reinit_inode(
int error;
uint32_tnlink = inode->i_nlink;
uint32_tgeneration = inode->i_generation;
-   uint64_tversion = inode->i_version;
+   uint64_tversion = inode_peek_iversion(inode);
umode_t mode = inode->i_mode;
 
error = inode_init_always(mp->m_super, inode);
 
set_nlink(inode, nlink);
inode->i_generation = generation;
-   inode->i_version = version;
+   inode_set_iversion_queried(inode, version);
inode->i_mode = mode;
return error;
 }
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 801274126648..dfc5e60d8af3 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -16,6 +16,7 @@
  * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
  */
 #include 
+#include 
 
 #include "xfs.h"
 #include "xfs_fs.h"
@@ -833,7 +834,7 @@ xfs_ialloc(
ip->i_d.di_flags = 0;
 
if (ip->i_d.di_version == 3) {
-   inode->i_version = 1;
+   inode_set_iversion(inode, 1);
ip->i_d.di_flags2 = 0;
ip->i_d.di_cowextsize = 0;
ip->i_d.di_crtime.t_sec = (int32_t)tv.tv_sec;
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 6ee5c3bf19ad..7571abf5dfb3 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -30,6 +30,7 @@
 #include "xfs_buf_item.h"
 #include "xfs_log.h"
 
+#include 
 
 kmem_zone_t*xfs_ili_zone;  /* inode log item zone */
 
@@ -354,7 +355,7 @@ xfs_inode_to_log_dinode(
to->di_next_unlinked = NULLAGINO;
 
if (from->di_version == 3) {
-   to->di_changecount = inode->i_version;
+   to->di_changecount = inode_peek_iversion(inode);
to->di_crtime.t_sec = from->di_crtime.t_sec;
to->di_crtime.t_nsec = from->di_crtime.t_nsec;
to->di_flags2 = from->di_flags2;
diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c
index daa7615497f9..225544327c4f 100644
--- a/fs/xfs/xfs_trans_inode.c
+++ b/fs/xfs/xfs_trans_inode.c
@@ -28,6 +28,8 @@
 #include "xfs_inode_item.h"
 #include "xfs_trace.h"
 
+#include 
+
 /*
  * Add a locked inode to the transaction.
  *
@@ -117,7 +119,7 @@ xfs_trans_log_inode(
 */
if (!(ip->i_itemp->ili_item.li_desc->lid_flags & XFS_LID_DIRTY) &&
IS_I_VERSION(VFS_I(ip))) {
-   VFS_I(ip)->i_version++;
+   inode_inc_iversion(VFS_I(ip));
flags |= XFS_ILOG_CORE;
}
 
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 17/19] xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing

2018-01-09 Thread Jeff Layton
From: Jeff Layton 

If XFS_ILOG_CORE is already set then go ahead and increment it.

Signed-off-by: Jeff Layton 
Acked-by: Darrick J. Wong 
---
 fs/xfs/xfs_trans_inode.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c
index 225544327c4f..4a89da4b6fe7 100644
--- a/fs/xfs/xfs_trans_inode.c
+++ b/fs/xfs/xfs_trans_inode.c
@@ -112,15 +112,17 @@ xfs_trans_log_inode(
 
/*
 * First time we log the inode in a transaction, bump the inode change
-* counter if it is configured for this to occur. We don't use
-* inode_inc_version() because there is no need for extra locking around
-* i_version as we already hold the inode locked exclusively for
-* metadata modification.
+* counter if it is configured for this to occur. While we have the
+* inode locked exclusively for metadata modification, we can usually
+* avoid setting XFS_ILOG_CORE if no one has queried the value since
+* the last time it was incremented. If we have XFS_ILOG_CORE already
+* set however, then go ahead and bump the i_version counter
+* unconditionally.
 */
if (!(ip->i_itemp->ili_item.li_desc->lid_flags & XFS_LID_DIRTY) &&
IS_I_VERSION(VFS_I(ip))) {
-   inode_inc_iversion(VFS_I(ip));
-   flags |= XFS_ILOG_CORE;
+   if (inode_maybe_inc_iversion(VFS_I(ip), flags & XFS_ILOG_CORE))
+   flags |= XFS_ILOG_CORE;
}
 
tp->t_flags |= XFS_TRANS_DIRTY;
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >