Sorry for the mail spam, it's an interesting code puzzle... :) On 10/10/2017 07:22 PM, Hans van Kranenburg wrote: > On 10/10/2017 07:07 PM, Hans van Kranenburg wrote: >> On 10/10/2017 01:31 PM, David Sterba wrote: >>> Hi, >>> >>> On Fri, Sep 29, 2017 at 04:20:51PM +0900, Naohiro Aota wrote: >>>> Balancing a fresh METADATA=dup btrfs file system (with size < 50G) >>>> generates a 128MB sized block group. While we set max_stripe_size = >>>> max_chunk_size = 256MB, we get this half sized block group: >>>> >>>> $ btrfs ins dump-t -t CHUNK_TREE btrfs.img|grep length >>>> length 8388608 owner 2 stripe_len 65536 type DATA >>>> length 33554432 owner 2 stripe_len 65536 type SYSTEM|DUP >>>> length 134217728 owner 2 stripe_len 65536 type METADATA|DUP >>>> >>>> Before commit 86db25785a6e ("Btrfs: fix max chunk size on raid5/6"), we >>>> used "stripe_size * ndevs > max_chunk_size * ncopies" to check the max >>>> chunk size. Since stripe_size = 256MB * dev_stripes (= 2) = 512MB, ndevs >>>> = 1, max_chunk_size = 256MB, and ncopies = 2, we allowed 256MB >>>> METADATA|DUP block group. >>>> >>>> But now, we use "stripe_size * data_stripes > max_chunk_size". Since >>>> data_stripes = 1 on DUP, it disallows the block group to have > 128MB. >>>> What missing here is "dev_stripes". Proper logical space used by the block >>>> group is "stripe_size * data_stripes / dev_stripes". Tweak the equations to >>>> use the right value. >>> >>> I started looking into it and still don't fully understand it. Change >>> deep in the allocator can easily break some blockgroup combinations, so >>> I'm rather conservative here. >> >> I think that the added usage of data_stripes in 86db25785a6e is the >> problematic change. data_stripes is something that was introduced as >> part of RAID56 in 53b381b3a and clearly only has a meaning that's >> properly thought of for RAID56. The RAID56 commit already adds "this >> will have to be fixed for RAID1 and RAID10 over more drives", only the >> author doesn't catch the DUP case, which already breaks at that point. >> >> At the beginning it says: >> >> int data_stripes; /* number of stripes that count for block group size */ >> >> For the example: >> >> This is DUP: >> >> .sub_stripes = 1, >> .dev_stripes = 2, >> .devs_max = 1, >> .devs_min = 1, >> .tolerated_failures = 0, >> .devs_increment = 1, >> .ncopies = 2, >> >> In the code: >> >> max_stripe_size = SZ_256M >> max_chunk_size = max_stripe_size -> SZ_256M >> >> Then we have find_free_dev_extent: >> max_stripe_size * dev_stripes -> SZ_256M * 2 -> 512M >> >> So we like to find 512M on a disk, to stuff 2 stripes of 256M inside for >> the DUP. (remember: The two parts of DUP *never* end up on a different >> disk, even if you have multiple) >>
Another better fix would be to make a change here... >> If we find one: >> stripe_size = devices_info[ndevs-1].max_avail -> 512M, yay ...because this is not yay. The 512M is max_avail, which needs to holds *2* stripes, not 1. So stripe_size is suddenly changed to twice the stripe size for DUP. So, an additional division again by dev_stripes would translate the max_avail on device ndevs-1 to the stripe size to use. >> >> num_stripes = ndevs * dev_stripes -> 1 * 2 -> 2 >> >> data_stripes = num_stripes / ncopies = 2 / 2 = 1 > > Oh, wow, this is not true of course, because the "number of stripes that > count for block group size" should still be 1 for DUP... > >> BOOM! There's the problem. The data_stripes only thinks about data that >> is horizontally spread over disks, and not vertically spread... > > Hm... no Hans, you're wrong. > >> What I would propose is changing... >> the data_stripes = <blah> and afterwards trying to correct it with >> some ifs >> ...to... >> a switch/case thing where the explicit logic is put to get the right >> value for each specific raid type. >> >> In case of DUP this simply means data_stripes = 2, because there is no >> need for fancy calculations about spreading DUP data over X devices. >> It's always 2 copies on 1 device. > > Eh, nope. 1. > > So, then I end up at the "stripe_size * data_stripes > max_chunk_size" > again. > > So, yes, Naohiro is right, and DUP is the only case in which this logic > breaks. DUP is the only one in which it this change makes a difference, > because it's the only one which has dev_stripes set to something else > than 1. If stripe_size doesn't incorrectly get set to dev_stripes * stripe_size max_avail above, the if works. The comments below still stand, because doing all those calculations with stripes and number of devices for something as predictable as DUP is only confusing. :D > > \o/ > >> ... >> >> My general feeling when looking at the code, is that this single part of >> code is responsible for too many different cases, or, more possible >> cases than a developer can reason about at once "in his head" when >> working on it. >> >> 7 raid options * 3 different types (data, metadata, system) = 21 >> already... Some parts of the algorithm make only sense for a subset of >> the combinations, but they're still part of the computation, which >> sometimes "by accident" results in the correct outcome. :) >> >> If it can't be done in a way that's easier to understand when reading >> the code, it should have unit tests with a list of known input/output to >> detect unwanted changes. >> > > -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html