From: Filipe Manana <[email protected]>
During the final phase of a device replace operation, I ran into a
transaction abort that resulted in the following trace:
[23919.655368] WARNING: CPU: 10 PID: 30175 at fs/btrfs/extent-tree.c:9843
btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]()
[23919.664742] BTRFS: Transaction aborted (error -2)
[23919.665749] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd
auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse
parport_pc i2c_piix4 parport psmouse acpi_cpufreq processor i2c_core evdev
microcode pcspkr button serio_raw ext4 crc16 jbd2 mbcache sd_mod sg sr_mod
cdrom virtio_scsi ata_generic ata_piix virtio_pci floppy virtio_ring libata
e1000 virtio scsi_mod [last unloaded: btrfs]
[23919.679442] CPU: 10 PID: 30175 Comm: fsstress Not tainted
4.3.0-rc5-btrfs-next-17+ #1
[23919.682392] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[23919.689151] 0000000000000000 ffff8804020cbb50 ffffffff812566f4
ffff8804020cbb98
[23919.692604] ffff8804020cbb88 ffffffff8104d0a6 ffffffffa03eea69
ffff88041b678a48
[23919.694230] ffff88042ac38000 ffff88041b678930 00000000fffffffe
ffff8804020cbbf0
[23919.696716] Call Trace:
[23919.698669] [<ffffffff812566f4>] dump_stack+0x4e/0x79
[23919.700597] [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8
[23919.701958] [<ffffffffa03eea69>] ?
btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]
[23919.703612] [<ffffffff8104d107>] warn_slowpath_fmt+0x48/0x50
[23919.705047] [<ffffffffa03eea69>]
btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]
[23919.706967] [<ffffffffa0402097>] __btrfs_end_transaction+0x84/0x2dd [btrfs]
[23919.708611] [<ffffffffa0402300>] btrfs_end_transaction+0x10/0x12 [btrfs]
[23919.710099] [<ffffffffa03ef0b8>]
btrfs_alloc_data_chunk_ondemand+0x121/0x28b [btrfs]
[23919.711970] [<ffffffffa0413025>] btrfs_fallocate+0x7d3/0xc6d [btrfs]
[23919.713602] [<ffffffff8108b78f>] ? lock_acquire+0x10d/0x194
[23919.714756] [<ffffffff81086dbc>] ? percpu_down_read+0x51/0x78
[23919.716155] [<ffffffff8116ef1d>] ? __sb_start_write+0x5f/0xb0
[23919.718918] [<ffffffff8116ef1d>] ? __sb_start_write+0x5f/0xb0
[23919.724170] [<ffffffff8116b579>] vfs_fallocate+0x170/0x1ff
[23919.725482] [<ffffffff8117c1d7>] ioctl_preallocate+0x89/0x9b
[23919.726790] [<ffffffff8117c5ef>] do_vfs_ioctl+0x406/0x4e6
[23919.728428] [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e
[23919.729642] [<ffffffff8118574d>] ? __fget_light+0x4d/0x71
[23919.730782] [<ffffffff8117c726>] SyS_ioctl+0x57/0x79
[23919.731847] [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
[23919.733330] ---[ end trace 166ef301a335832a ]---
This is due to a race between device replace and chunk allocation, which
the following diagram illustrates:
CPU 1 CPU 2
btrfs_dev_replace_finishing()
at this point
dev_replace->tgtdev->devid ==
BTRFS_DEV_REPLACE_DEVID (0ULL)
...
btrfs_start_transaction()
btrfs_commit_transaction()
btrfs_fallocate()
btrfs_alloc_data_chunk_ondemand()
btrfs_join_transaction()
--> starts a new
transaction
do_chunk_alloc()
lock fs_info->chunk_mutex
btrfs_alloc_chunk()
--> creates extent map
for
the new chunk with
em->bdev->map->stripes[i]->dev->devid
== X (X > 0)
--> extent map is
added to
fs_info->mapping_tree
--> initial phase of
bg A
allocation
completes
unlock fs_info->chunk_mutex
lock fs_info->chunk_mutex
btrfs_dev_replace_update_device_in_mapping_tree()
--> iterates fs_info->mapping_tree and
replaces the device in every extent
map's map->stripes[] with
dev_replace->tgtdev, which still has
an id of 0ULL (BTRFS_DEV_REPLACE_DEVID)
btrfs_end_transaction()
btrfs_create_pending_block_groups()
--> starts final phase of
bg A creation
(update device,
extent, and chunk
trees, etc)
btrfs_finish_chunk_alloc()
btrfs_update_device()
--> attempts to
update a device
item with ID ==
0ULL
(BTRFS_DEV_REPLACE_DEVID)
which is the
current ID of
bg A's
em->bdev->map->stripes[i]->dev->devid
--> doesn't find
such item
returns -ENOENT
--> the device id
should have been X
and not 0ULL
got -ENOENT from
btrfs_finish_chunk_alloc()
and aborts current
transaction
finishes setting up the target device,
namely it sets tgtdev->devid to the value
of srcdev->devid, which is X (and X > 0)
So fix this by replacing the device object in the extent maps only after
finishing the setup for the target device (setting its new ID, etc) and
before setting the old device's ID to BTRFS_DEV_REPLACE_DEVID.
This happened while running fstest btrfs/071.
Signed-off-by: Filipe Manana <[email protected]>
---
V2: Made the attribution of a new id from the old device to happen
after replacing the device for the extent maps with the new device,
as obviously it was missing before.
fs/btrfs/dev-replace.c | 20 +++++++++++++-------
1 file changed, 13 insertions(+), 7 deletions(-)
diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 1e668fb..923b375 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -516,12 +516,7 @@ static int btrfs_dev_replace_finishing(struct
btrfs_fs_info *fs_info,
dev_replace->time_stopped = get_seconds();
dev_replace->item_needs_writeback = 1;
- /* replace old device with new one in mapping tree */
- if (!scrub_ret) {
- btrfs_dev_replace_update_device_in_mapping_tree(fs_info,
- src_device,
- tgt_device);
- } else {
+ if (scrub_ret) {
btrfs_err_in_rcu(root->fs_info,
"btrfs_scrub_dev(%s, %llu, %s) failed %d",
src_device->missing ? "<missing disk>" :
@@ -547,7 +542,6 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info
*fs_info,
rcu_str_deref(tgt_device->name));
tgt_device->is_tgtdev_for_dev_replace = 0;
tgt_device->devid = src_device->devid;
- src_device->devid = BTRFS_DEV_REPLACE_DEVID;
memcpy(uuid_tmp, tgt_device->uuid, sizeof(uuid_tmp));
memcpy(tgt_device->uuid, src_device->uuid, sizeof(tgt_device->uuid));
memcpy(src_device->uuid, uuid_tmp, sizeof(src_device->uuid));
@@ -565,6 +559,18 @@ static int btrfs_dev_replace_finishing(struct
btrfs_fs_info *fs_info,
list_add(&tgt_device->dev_alloc_list, &fs_info->fs_devices->alloc_list);
fs_info->fs_devices->rw_devices++;
+ /*
+ * Now that our new device replaced the old device and it's completely
+ * setup, replace the old device with the new one in the mapping tree.
+ * But do it before setting the old device's id to
+ * BTRFS_DEV_REPLACE_DEVID, to avoid racing with tasks finishing chunk
+ * allocation.
+ */
+ btrfs_dev_replace_update_device_in_mapping_tree(fs_info,
+ src_device,
+ tgt_device);
+ src_device->devid = BTRFS_DEV_REPLACE_DEVID;
+
btrfs_dev_replace_unlock(dev_replace);
btrfs_rm_dev_replace_blocked(fs_info);
--
2.1.3
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html