Sorry for the mail spam, it's an interesting code puzzle... :)

On 10/10/2017 07:22 PM, Hans van Kranenburg wrote:
> On 10/10/2017 07:07 PM, Hans van Kranenburg wrote:
>> On 10/10/2017 01:31 PM, David Sterba wrote:
>>> Hi,
>>>
>>> On Fri, Sep 29, 2017 at 04:20:51PM +0900, Naohiro Aota wrote:
>>>> Balancing a fresh METADATA=dup btrfs file system (with size < 50G)
>>>> generates a 128MB sized block group. While we set max_stripe_size =
>>>> max_chunk_size = 256MB, we get this half sized block group:
>>>>
>>>> $ btrfs ins dump-t -t CHUNK_TREE btrfs.img|grep length
>>>>                 length 8388608 owner 2 stripe_len 65536 type DATA
>>>>                 length 33554432 owner 2 stripe_len 65536 type SYSTEM|DUP
>>>>                 length 134217728 owner 2 stripe_len 65536 type METADATA|DUP
>>>>
>>>> Before commit 86db25785a6e ("Btrfs: fix max chunk size on raid5/6"), we
>>>> used "stripe_size * ndevs > max_chunk_size * ncopies" to check the max
>>>> chunk size. Since stripe_size = 256MB * dev_stripes (= 2) = 512MB, ndevs
>>>> = 1, max_chunk_size = 256MB, and ncopies = 2, we allowed 256MB
>>>> METADATA|DUP block group.
>>>>
>>>> But now, we use "stripe_size * data_stripes > max_chunk_size". Since
>>>> data_stripes = 1 on DUP, it disallows the block group to have > 128MB.
>>>> What missing here is "dev_stripes". Proper logical space used by the block
>>>> group is "stripe_size * data_stripes / dev_stripes". Tweak the equations to
>>>> use the right value.
>>>
>>> I started looking into it and still don't fully understand it. Change
>>> deep in the allocator can easily break some blockgroup combinations, so
>>> I'm rather conservative here.
>>
>> I think that the added usage of data_stripes in 86db25785a6e is the
>> problematic change. data_stripes is something that was introduced as
>> part of RAID56 in 53b381b3a and clearly only has a meaning that's
>> properly thought of for RAID56. The RAID56 commit already adds "this
>> will have to be fixed for RAID1 and RAID10 over more drives", only the
>> author doesn't catch the DUP case, which already breaks at that point.
>>
>> At the beginning it says:
>>
>> int data_stripes; /* number of stripes that count for block group size */
>>
>> For the example:
>>
>> This is DUP:
>>
>> .sub_stripes    = 1,
>> .dev_stripes    = 2,
>> .devs_max       = 1,
>> .devs_min       = 1,
>> .tolerated_failures = 0,
>> .devs_increment = 1,
>> .ncopies        = 2,
>>
>> In the code:
>>
>> max_stripe_size = SZ_256M
>> max_chunk_size = max_stripe_size    -> SZ_256M
>>
>> Then we have find_free_dev_extent:
>>     max_stripe_size * dev_stripes   -> SZ_256M * 2 -> 512M
>>
>> So we like to find 512M on a disk, to stuff 2 stripes of 256M inside for
>> the DUP. (remember: The two parts of DUP *never* end up on a different
>> disk, even if you have multiple)
>>

Another better fix would be to make a change here...

>> If we find one:
>> stripe_size = devices_info[ndevs-1].max_avail   -> 512M, yay

...because this is not yay. The 512M is max_avail, which needs to holds
*2* stripes, not 1. So stripe_size is suddenly changed to twice the
stripe size for DUP.

So, an additional division again by dev_stripes would translate the
max_avail on device ndevs-1 to the stripe size to use.

>>
>> num_stripes = ndevs * dev_stripes -> 1 * 2  -> 2
>>
>> data_stripes = num_stripes / ncopies = 2 / 2 = 1
> 
> Oh, wow, this is not true of course, because the "number of stripes that
> count for block group size" should still be 1 for DUP...
> 
>> BOOM! There's the problem. The data_stripes only thinks about data that
>> is horizontally spread over disks, and not vertically spread...
> 
> Hm... no Hans, you're wrong.
> 
>> What I would propose is changing...
>>    the data_stripes = <blah> and afterwards trying to correct it with
>> some ifs
>> ...to...
>>    a switch/case thing where the explicit logic is put to get the right
>> value for each specific raid type.
>>
>> In case of DUP this simply means data_stripes = 2, because there is no
>> need for fancy calculations about spreading DUP data over X devices.
>> It's always 2 copies on 1 device.
> 
> Eh, nope. 1.
> 
> So, then I end up at the "stripe_size * data_stripes > max_chunk_size"
> again.
> 
> So, yes, Naohiro is right, and DUP is the only case in which this logic
> breaks. DUP is the only one in which it this change makes a difference,
> because it's the only one which has dev_stripes set to something else
> than 1.

If stripe_size doesn't incorrectly get set to dev_stripes * stripe_size
max_avail above, the if works.

The comments below still stand, because doing all those calculations
with stripes and number of devices for something as predictable as DUP
is only confusing. :D

> 
> \o/
> 
>> ...
>>
>> My general feeling when looking at the code, is that this single part of
>> code is responsible for too many different cases, or, more possible
>> cases than a developer can reason about at once "in his head" when
>> working on it.
>>
>> 7 raid options * 3 different types (data, metadata, system) = 21
>> already... Some parts of the algorithm make only sense for a subset of
>> the combinations, but they're still part of the computation, which
>> sometimes "by accident" results in the correct outcome. :)
>>
>> If it can't be done in a way that's easier to understand when reading
>> the code, it should have unit tests with a list of known input/output to
>> detect unwanted changes.
>>
> 
> 
-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to