Re: Status of RAID5/6

Goffredo Baroncelli Tue, 03 Apr 2018 10:03:30 -0700

On 04/03/2018 02:31 AM, Zygo Blaxell wrote:
> On Mon, Apr 02, 2018 at 06:23:34PM -0400, Zygo Blaxell wrote:
>> On Mon, Apr 02, 2018 at 11:49:42AM -0400, Austin S. Hemmelgarn wrote:
>>> On 2018-04-02 11:18, Goffredo Baroncelli wrote:
>>>> I thought that a possible solution is to create BG with different
>>> number of data disks. E.g. supposing to have a raid 6 system with 6
>>> disks, where 2 are parity disk; we should allocate 3 BG
>>>> BG #1: 1 data disk, 2 parity disks
>>>> BG #2: 2 data disks, 2 parity disks,
>>>> BG #3: 4 data disks, 2 parity disks
>>>>
>>>> For simplicity, the disk-stripe length is assumed = 4K.
>>>>
>>>> So If you have a write with a length of 4 KB, this should be placed
>>> in BG#1; if you have a write with a length of 4*3KB, the first 8KB,
>>> should be placed in in BG#2, then in BG#1.
>>>> This would avoid space wasting, even if the fragmentation will
>>> increase (but shall the fragmentation matters with the modern solid
>>> state disks ?).
>> I don't really see why this would increase fragmentation or waste space.


> Oh, wait, yes I do.  If there's a write of 6 blocks, we would have
> to split an extent between BG #3 (the first 4 blocks) and BG #2 (the
> remaining 2 blocks).  It also flips the usual order of "determine size
> of extent, then allocate space for it" which might require major surgery
> on the btrfs allocator to implement.

I have to point out that in any case the extent is physically interrupted at 
the disk-stripe size. Assuming disk-stripe=64KB, if you want to write 128KB, 
the first half is written in the first disk, the other in the 2nd disk.  If you 
want to write 96kb, the first 64 are written in the first disk, the last part 
in the 2nd, only on a different BG.
So yes there is a fragmentation from a logical point of view; from a physical 
point of view the data is spread on the disks in any case.

In any case, you are right, we should gather some data, because the performance 
impact are no so clear.

I am not worried abut having different BG; we have problem with these because 
we never developed tool to handle this issue properly (i.e. a daemon which 
starts a balance when needed). But I hope that this will be solved in future.

In any case, the all solutions proposed have their trade off:

- a) as is: write hole bug
- b) variable stripe size (like ZFS): big impact on how btrfs handle the 
extent. limited waste of space
- c) logging data before writing: we wrote the data two times in a short time 
window. Moreover the log area is written several order of magnitude more than 
the other area; there was some patch around
- d) rounding the writing to the stripe size: waste of space; simple to 
implement;
- e) different BG with different stripe size: limited waste of space; logical 
fragmentation.


* c),d),e) are applied only for the tail of the extent, in case the size is 
less than the stripe size.
* for b),d), e), the wasting of space may be reduced with a balance 

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Status of RAID5/6

Reply via email to