This is a followup to my previous post "About free space fragmentation, metadata write amplification and (no)ssd", exploring how good or bad btrfs can handle filesystem that are larger than your average desktop computer.
One of the things I'm looking at to do is to convert the metadata of a large filesystem from DUP to single, because: 1. in this particular situation the disk storage is considered reliable enough to handle bitrot and failed disks itself. 2. it would simply reduce metadata writes (the favourite thing this filesystem wants to do all the time) with 50% already So, I used the clone functionality of the underlying iSCSI target to get a writable throw-away version of the filesystem to experiment with (great!). == The starting point == Data, single: total=39.46TiB, used=35.63TiB System, DUP: total=40.00MiB, used=6.23MiB Metadata, DUP: total=454.50GiB, used=441.46GiB GlobalReserve, single: total=512.00MiB, used=0.00B ~90000 subvolumes, which are related to each other in groups of about 32 of them, no shared data extents between those groups. That's around 900x a 512MiB metadata DUP block group, all >90% used. I wrote a simple script to count the metadata types, here's the result, sorted by tree and level (0 = leaf, > 0 are nodes) -# ./show_metadata_tree_sizes.py /srv/backup/ ROOT_TREE 68.11MiB 0( 4346) 1( 12) 2( 1) EXTENT_TREE 14.63GiB 0(955530) 1( 3572) 2( 16) 3( 1) CHUNK_TREE 6.17MiB 0( 394) 1( 1) DEV_TREE 3.50MiB 0( 223) 1( 1) FS_TREE 382.63GiB 0(24806886) 1(250284) 2( 18258) 3( 331) CSUM_TREE 41.98GiB 0(2741020) 1( 9930) 2( 45) 3( 1) QUOTA_TREE 0.00B UUID_TREE 3.28MiB 0( 209) 1( 1) FREE_SPACE_TREE 79.31MiB 0( 5063) 1( 12) 2( 1) DATA_RELOC_TREE 16.00KiB 0( 1) FS_TREE counts tree 5 and all other subvolumes together. Kernel: Linux 4.9.18 (Debian) Progs: 4.9.1 (Debian) == Test 1: Just trying it == So, let's just do a btrfs balance start -f -mconvert=single /srv/backup The result: * One by one the metadata block groups are emptied, from highest vaddr to lowest vaddr. * For each 512MiB that is removed, a new 1024MiB block group is forcibly added. * For each 512MiB, it takes on average 3 hours to empty it, during which the filesystem is writing metadata at 100MiB/s to disk. That means, to move 512MiB to another place, it needs to write a bit more than *1TiB* to disk (3*3600*100MiB). And, it seems to be touching almost all of the 900 metadata block groups on every committed transaction. * Instead of moving the metadata to the single type block group, it seems to prefer to keep messing around in the DUP block groups all the time as long as there's any free space to be found in them. I let it run for a day, and then stopped it. So, when naively extrapolating this, when running at full speed, doing nothing else, this would take 112.5 days, while writing a Petabyte of metadata to disk. Hmm... == Test 2: Does reducing metadata size help? == Another thing I tried is to see what the effect of removing a lot of subvolumes is. I simply ran the backup-expiries for everything that would expire in the next two weeks (which is at least all daily backup snapshots, which are kept for 14 days by default). After that: Data, single: total=38.62TiB, used=30.15TiB System, DUP: total=40.00MiB, used=6.16MiB Metadata, DUP: total=454.00GiB, used=391.78GiB GlobalReserve, single: total=512.00MiB, used=265.62MiB About 54000 subvolumes left now. Hmmzzz... FS trees reduced from ~380 to ~340 GiB... not spectacular. ROOT_TREE 48.05MiB 0( 3064) 1( 10) 2( 1) EXTENT_TREE 14.41GiB 0(940473) 1( 3559) 2( 16) 3( 1) CHUNK_TREE 6.16MiB 0( 393) 1( 1) DEV_TREE 3.50MiB 0( 223) 1( 1) FS_TREE 339.80GiB 0(22072422) 1(183505) 2( 12821) 3( 272) CSUM_TREE 37.33GiB 0(2437006) 1( 9519) 2( 44) 3( 1) QUOTA_TREE 0.00B UUID_TREE 3.25MiB 0( 207) 1( 1) FREE_SPACE_TREE 119.44MiB 0( 7619) 1( 24) 2( 1) DATA_RELOC_TREE 16.00KiB 0( 1) Now trying it again: btrfs balance start -f -mconvert=single /srv/backup The result: * For each 512MiB, it takes on average 1 hour, writing 100MiB/s to disk * Almost all metadata is rewritten into existing DUP chunks (!!), since there's more room in there now because of reducing total metadata amount. * For the little bit of data written to the new single chunks (which have twice the amount of space in total in them, because every 512MiB is traded for a new 1024MiB...) it shows a somewhat interesting pattern: vaddr length flags used used_pct [.. many more above here ..] v 87196153413632 l 512MiB f METADATA|DUP used 379338752 pct 71 v 87784027062272 l 512MiB f METADATA|DUP used 351125504 pct 65 v 87784563933184 l 512MiB f METADATA|DUP used 365297664 pct 68 v 87901064921088 l 512MiB f METADATA|DUP used 403718144 pct 75 v 87901601792000 l 512MiB f METADATA|DUP used 373047296 pct 69 v 87969784397824 l 512MiB f METADATA|DUP used 376979456 pct 70 v 87971395010560 l 512MiB f METADATA|DUP used 398917632 pct 74 v 87971931881472 l 512MiB f METADATA|DUP used 391757824 pct 73 v 88126013833216 l 512MiB f METADATA|DUP used 426967040 pct 80 v 88172721602560 l 512MiB f METADATA|DUP used 418840576 pct 78 v 88186143375360 l 512MiB f METADATA|DUP used 422821888 pct 79 v 88187753988096 l 512MiB f METADATA|DUP used 395575296 pct 74 v 88190438342656 l 512MiB f METADATA|DUP used 388841472 pct 72 v 88545310015488 l 512MiB f METADATA|DUP used 347045888 pct 65 v 88545846886400 l 512MiB f METADATA|DUP used 318111744 pct 59 v 88546383757312 l 512MiB f METADATA|DUP used 101662720 pct 19 v 89532615622656 l 1GiB f METADATA used 150994944 pct 14 v 89533689364480 l 1GiB f METADATA used 150716416 pct 14 v 89534763106304 l 1GiB f METADATA used 144375808 pct 13 v 89535836848128 l 1GiB f METADATA used 140738560 pct 13 v 89536910589952 l 1GiB f METADATA used 144637952 pct 13 v 89537984331776 l 1GiB f METADATA used 153124864 pct 14 v 89539058073600 l 1GiB f METADATA used 127434752 pct 12 v 89540131815424 l 1GiB f METADATA used 113655808 pct 11 v 89541205557248 l 1GiB f METADATA used 99450880 pct 9 v 89542279299072 l 1GiB f METADATA used 90652672 pct 8 v 89543353040896 l 1GiB f METADATA used 78725120 pct 7 v 89544426782720 l 1GiB f METADATA used 74186752 pct 7 v 89545500524544 l 1GiB f METADATA used 65175552 pct 6 v 89546574266368 l 1GiB f METADATA used 47136768 pct 4 v 89547648008192 l 1GiB f METADATA used 30965760 pct 3 v 89548721750016 l 1GiB f METADATA used 15187968 pct 1 So, it's 3 times faster per blockgroup, but it will only do more work over and over again, leading to a 900 + 899 + 898 + 897 + ... etc pattern of the amount of work it seems. Still doesn't sound encouraging. == Intermezzo: endless btrfs_merge_delayed_refs == Test 3: What happens when starting at the lower vaddr? This test is mainly to just try things and find out what happens. I tried to feed the metadata blockgroup with the lowest vaddr to balance: btrfs balance start -f -mconvert=single,soft,vrange=29360128..29360129 /srv/backup When doing so, the filesystem immediately ends up using 100% kernel cpu, does not read or write from disk anymore. After letting it running for two hours, there's no change. These two processes are just doing 100% cpu, showing the following stack traces (which do not change over time) in /proc/<pid>/stack: kworker/u20:3 [<ffffffffc00caa74>] btrfs_insert_empty_items+0x94/0xc0 [btrfs] [<ffffffff815fc689>] error_exit+0x9/0x20 [<ffffffffc01426fe>] btrfs_merge_delayed_refs+0xee/0x570 [btrfs] [<ffffffffc00d5ded>] __btrfs_run_delayed_refs+0xad/0x13a0 [btrfs] [<ffffffff810abdb1>] update_curr+0xe1/0x160 [<ffffffff811e02dc>] kmem_cache_alloc+0xbc/0x520 [<ffffffff810aabc4>] account_entity_dequeue+0xa4/0xc0 [<ffffffffc00da07d>] btrfs_run_delayed_refs+0x9d/0x2b0 [btrfs] [<ffffffffc00da319>] delayed_ref_async_start+0x89/0xa0 [btrfs] [<ffffffffc0124fff>] btrfs_scrubparity_helper+0xcf/0x2d0 [btrfs] [<ffffffff81090384>] process_one_work+0x184/0x410 [<ffffffff8109065d>] worker_thread+0x4d/0x480 [<ffffffff81090610>] process_one_work+0x410/0x410 [<ffffffff81090610>] process_one_work+0x410/0x410 [<ffffffff8107bb0a>] do_group_exit+0x3a/0xa0 [<ffffffff810965ce>] kthread+0xce/0xf0 [<ffffffff81024701>] __switch_to+0x2c1/0x6c0 [<ffffffff81096500>] kthread_park+0x60/0x60 [<ffffffff815fb2f5>] ret_from_fork+0x25/0x30 [<ffffffffffffffff>] 0xffffffffffffffff btrfs-transacti [<ffffffff815fc689>] error_exit+0x9/0x20 [<ffffffffc01426fe>] btrfs_merge_delayed_refs+0xee/0x570 [btrfs] [<ffffffffc01426a5>] btrfs_merge_delayed_refs+0x95/0x570 [btrfs] [<ffffffffc00d5ded>] __btrfs_run_delayed_refs+0xad/0x13a0 [btrfs] [<ffffffff8109d94d>] finish_task_switch+0x7d/0x1f0 [<ffffffffc00da07d>] btrfs_run_delayed_refs+0x9d/0x2b0 [btrfs] [<ffffffff8107bb0a>] do_group_exit+0x3a/0xa0 [<ffffffffc00f0b10>] btrfs_commit_transaction+0x40/0xa10 [btrfs] [<ffffffffc00f1576>] start_transaction+0x96/0x480 [btrfs] [<ffffffffc00eb9ac>] transaction_kthread+0x1dc/0x200 [btrfs] [<ffffffffc00eb7d0>] btrfs_cleanup_transaction+0x580/0x580 [btrfs] [<ffffffff810965ce>] kthread+0xce/0xf0 [<ffffffff81024701>] __switch_to+0x2c1/0x6c0 [<ffffffff81096500>] kthread_park+0x60/0x60 [<ffffffff815fb2f5>] ret_from_fork+0x25/0x30 [<ffffffffffffffff>] 0xffffffffffffffff == Considering the options == Well, this all doesn't look good, that's for sure. Especially the tendency to empty DUP block groups into other DUP block groups, which need to be removed later again, instead of single ones when converting it a bit sad. == Thinking out of the box == Technically, converting from DUP to single could also mean: * Flipping one bit in the block group type flags to 0 for each block group item * Flipping one bit in the chunk type flags and removing 1 stripe struct for each metadata chunk item * Removing the * Anything else? How feasible would it be to write btrfs-progs style conversion to do this? -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html