Hi Naohiro,

Are you still planning to work on this?

Hans

On 10/11/2017 11:58 AM, Naohiro Aota wrote:
> Hi,
> 
> You may notice my @wdc.com address is bounced. As my internship job at
> wdc is finished, the address is already expired.
> Please contact me on this @elisp.net address.
> 
> 2017-10-11 2:42 GMT+09:00 Hans van Kranenburg 
> <hans.van.kranenb...@mendix.com>:
>> Sorry for the mail spam, it's an interesting code puzzle... :)
>>
>> On 10/10/2017 07:22 PM, Hans van Kranenburg wrote:
>>> On 10/10/2017 07:07 PM, Hans van Kranenburg wrote:
>>>> On 10/10/2017 01:31 PM, David Sterba wrote:
>>>>> Hi,
>>>>>
>>>>> On Fri, Sep 29, 2017 at 04:20:51PM +0900, Naohiro Aota wrote:
>>>>>> Balancing a fresh METADATA=dup btrfs file system (with size < 50G)
>>>>>> generates a 128MB sized block group. While we set max_stripe_size =
>>>>>> max_chunk_size = 256MB, we get this half sized block group:
>>>>>>
>>>>>> $ btrfs ins dump-t -t CHUNK_TREE btrfs.img|grep length
>>>>>>                 length 8388608 owner 2 stripe_len 65536 type DATA
>>>>>>                 length 33554432 owner 2 stripe_len 65536 type SYSTEM|DUP
>>>>>>                 length 134217728 owner 2 stripe_len 65536 type 
>>>>>> METADATA|DUP
>>>>>>
>>>>>> Before commit 86db25785a6e ("Btrfs: fix max chunk size on raid5/6"), we
>>>>>> used "stripe_size * ndevs > max_chunk_size * ncopies" to check the max
>>>>>> chunk size. Since stripe_size = 256MB * dev_stripes (= 2) = 512MB, ndevs
>>>>>> = 1, max_chunk_size = 256MB, and ncopies = 2, we allowed 256MB
>>>>>> METADATA|DUP block group.
>>>>>>
>>>>>> But now, we use "stripe_size * data_stripes > max_chunk_size". Since
>>>>>> data_stripes = 1 on DUP, it disallows the block group to have > 128MB.
>>>>>> What missing here is "dev_stripes". Proper logical space used by the 
>>>>>> block
>>>>>> group is "stripe_size * data_stripes / dev_stripes". Tweak the equations 
>>>>>> to
>>>>>> use the right value.
>>>>>
>>>>> I started looking into it and still don't fully understand it. Change
>>>>> deep in the allocator can easily break some blockgroup combinations, so
>>>>> I'm rather conservative here.
>>>>
>>>> I think that the added usage of data_stripes in 86db25785a6e is the
>>>> problematic change. data_stripes is something that was introduced as
>>>> part of RAID56 in 53b381b3a and clearly only has a meaning that's
>>>> properly thought of for RAID56. The RAID56 commit already adds "this
>>>> will have to be fixed for RAID1 and RAID10 over more drives", only the
>>>> author doesn't catch the DUP case, which already breaks at that point.
>>>>
> 
> Thank you for explaining in detail :) Yes, the allocator was already
> broken by using the data_stripes.
> (and it will also "break" existing file systems, in terms of making a
> DUP|META block group smaller)
> I believe this patch fix the things.
> 
>>>> At the beginning it says:
>>>>
>>>> int data_stripes; /* number of stripes that count for block group size */
>>>>
>>>> For the example:
>>>>
>>>> This is DUP:
>>>>
>>>> .sub_stripes    = 1,
>>>> .dev_stripes    = 2,
>>>> .devs_max       = 1,
>>>> .devs_min       = 1,
>>>> .tolerated_failures = 0,
>>>> .devs_increment = 1,
>>>> .ncopies        = 2,
>>>>
>>>> In the code:
>>>>
>>>> max_stripe_size = SZ_256M
>>>> max_chunk_size = max_stripe_size    -> SZ_256M
>>>>
>>>> Then we have find_free_dev_extent:
>>>>     max_stripe_size * dev_stripes   -> SZ_256M * 2 -> 512M
>>>>
>>>> So we like to find 512M on a disk, to stuff 2 stripes of 256M inside for
>>>> the DUP. (remember: The two parts of DUP *never* end up on a different
>>>> disk, even if you have multiple)
>>>>
>>
>> Another better fix would be to make a change here...
>>
>>>> If we find one:
>>>> stripe_size = devices_info[ndevs-1].max_avail   -> 512M, yay
>>
>> ...because this is not yay. The 512M is max_avail, which needs to holds
>> *2* stripes, not 1. So stripe_size is suddenly changed to twice the
>> stripe size for DUP.
>>
>> So, an additional division again by dev_stripes would translate the
>> max_avail on device ndevs-1 to the stripe size to use.
> 
> This catch the point. Notice stripe_size is divided by dev_stripes
> after the lines in the patch.
> It means that at this region (from "stripe_size =
> devices_info[ndevs-1].max_avail;" to "stripe_size =
> div_u64(stripe_size, dev_stripes);"),
> "stripe_size" stands for "stripe size written in one device". IOW,
> stripe_size (here, size of region in one device) = dev_stripes *
> stripe_size (final, one chunk).
> These two "stripe_size" is mostly the same value, because we set
> dev_stripes > 1 only with DUP.
> 
> We are using this stripe_size in one device manner to calculate logical size.
> This is correct, if we actually stripe write to stripes in one device.
> But, it actually is
> copied stripes at least with current implementation. So the
> if-condition "if (stripe_size * data_stripes > max_chunk_size)"
> is doing wrong calculation.
> 
> So the solution would be:
> 
> 1. as in my patch, consider dev_stripes in the if-condition
> or,
> 2. as Hans suggests, move the division by dev_stripes above the if-condition,
>    and properly re-calculate stripe_size if logical size > max_chunk_size
> 
> Now, I think taking the 2nd solution would make the code more clear,
> because it eliminate two different meanings of "stripe_size"
> 
> Again, keep in mind that "dev_stripes = 1" on all BG types but DUP.
> So, multiply/divide with "dev_stripes" won't affect any BG but DUP BG
> (at least on current RAID feature set).
> 
>>
>>>>
>>>> num_stripes = ndevs * dev_stripes -> 1 * 2  -> 2
>>>>
>>>> data_stripes = num_stripes / ncopies = 2 / 2 = 1
>>>
>>> Oh, wow, this is not true of course, because the "number of stripes that
>>> count for block group size" should still be 1 for DUP...
>>>
>>>> BOOM! There's the problem. The data_stripes only thinks about data that
>>>> is horizontally spread over disks, and not vertically spread...
>>>
>>> Hm... no Hans, you're wrong.
>>>
>>>> What I would propose is changing...
>>>>    the data_stripes = <blah> and afterwards trying to correct it with
>>>> some ifs
>>>> ...to...
>>>>    a switch/case thing where the explicit logic is put to get the right
>>>> value for each specific raid type.
>>>>
>>>> In case of DUP this simply means data_stripes = 2, because there is no
>>>> need for fancy calculations about spreading DUP data over X devices.
>>>> It's always 2 copies on 1 device.
>>>
>>> Eh, nope. 1.
>>>
>>> So, then I end up at the "stripe_size * data_stripes > max_chunk_size"
>>> again.
>>>
>>> So, yes, Naohiro is right, and DUP is the only case in which this logic
>>> breaks. DUP is the only one in which it this change makes a difference,
>>> because it's the only one which has dev_stripes set to something else
>>> than 1.
>>
>> If stripe_size doesn't incorrectly get set to dev_stripes * stripe_size
>> max_avail above, the if works.
>>
>> The comments below still stand, because doing all those calculations
>> with stripes and number of devices for something as predictable as DUP
>> is only confusing. :D
>>
>>>
>>> \o/
>>>
>>>> ...
>>>>
>>>> My general feeling when looking at the code, is that this single part of
>>>> code is responsible for too many different cases, or, more possible
>>>> cases than a developer can reason about at once "in his head" when
>>>> working on it.
>>>>
>>>> 7 raid options * 3 different types (data, metadata, system) = 21
>>>> already... Some parts of the algorithm make only sense for a subset of
>>>> the combinations, but they're still part of the computation, which
>>>> sometimes "by accident" results in the correct outcome. :)
>>>>
>>>> If it can't be done in a way that's easier to understand when reading
>>>> the code, it should have unit tests with a list of known input/output to
>>>> detect unwanted changes.
> 
> I agree. The code is too complicated for current RAID feature set,
> though it may support future
> expansion naturally.
> 
>>>>
>>>
>>>
>> --
>> Hans van Kranenburg
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to