Re: Status of RAID5/6

Zygo Blaxell Tue, 03 Apr 2018 15:58:30 -0700

On Tue, Apr 03, 2018 at 07:03:06PM +0200, Goffredo Baroncelli wrote:
> On 04/03/2018 02:31 AM, Zygo Blaxell wrote:
> > On Mon, Apr 02, 2018 at 06:23:34PM -0400, Zygo Blaxell wrote:
> >> On Mon, Apr 02, 2018 at 11:49:42AM -0400, Austin S. Hemmelgarn wrote:
> >>> On 2018-04-02 11:18, Goffredo Baroncelli wrote:
> >>>> I thought that a possible solution is to create BG with different
> >>> number of data disks. E.g. supposing to have a raid 6 system with 6
> >>> disks, where 2 are parity disk; we should allocate 3 BG
> >>>> BG #1: 1 data disk, 2 parity disks
> >>>> BG #2: 2 data disks, 2 parity disks,
> >>>> BG #3: 4 data disks, 2 parity disks
> >>>>
> >>>> For simplicity, the disk-stripe length is assumed = 4K.
> >>>>
> >>>> So If you have a write with a length of 4 KB, this should be placed
> >>> in BG#1; if you have a write with a length of 4*3KB, the first 8KB,
> >>> should be placed in in BG#2, then in BG#1.
> >>>> This would avoid space wasting, even if the fragmentation will
> >>> increase (but shall the fragmentation matters with the modern solid
> >>> state disks ?).
> >> I don't really see why this would increase fragmentation or waste space.
> 
> > Oh, wait, yes I do.  If there's a write of 6 blocks, we would have
> > to split an extent between BG #3 (the first 4 blocks) and BG #2 (the
> > remaining 2 blocks).  It also flips the usual order of "determine size
> > of extent, then allocate space for it" which might require major surgery
> > on the btrfs allocator to implement.
> 
> I have to point out that in any case the extent is physically
> interrupted at the disk-stripe size. Assuming disk-stripe=64KB, if
> you want to write 128KB, the first half is written in the first disk,
> the other in the 2nd disk.  If you want to write 96kb, the first 64
> are written in the first disk, the last part in the 2nd, only on a
> different BG.


The "only on a different BG" part implies something expensive, either
a seek or a new erase page depending on the hardware.  Without that,
nearby logical blocks are nearby physical blocks as well.

> So yes there is a fragmentation from a logical point of view; from a
> physical point of view the data is spread on the disks in any case.

What matters is the extent-tree point of view.  There is (currently)
no fragmentation there, even for RAID5/6.  The extent tree is unaware
of RAID5/6 (to its peril).

ZFS makes its thing-like-the-extent-tree aware of RAID5/6, and it can
put a stripe of any size anywhere.  If we're going to do that in btrfs,
you might as well just do what ZFS does.

OTOH, variable-size block groups give us read-compatibility with old
kernel versions (and write-compatibility for that matter--a kernel that
didn't know about the BG separation would just work but have write hole).

If an application does a loop writing 68K then fsync(), the multiple-BG
solution adds two seeks to read every 68K.  That's expensive if sequential
read bandwidth is more scarce than free space.

> In any case, you are right, we should gather some data, because the
> performance impact are no so clear.
> 
> I am not worried abut having different BG; we have problem with these
> because we never developed tool to handle this issue properly (i.e. a
> daemon which starts a balance when needed). But I hope that this will
> be solved in future.

Balance daemons are easy to the point of being trivial to write in Python.

The balancing itself is quite expensive and invasive:  can't usefully
ionice it, can only abort it on block group boundaries, can't delete
snapshots while it's running.

If balance could be given a vrange that was the size of one extent...then
we could talk about daemons.

> In any case, the all solutions proposed have their trade off:
> 
> - a) as is: write hole bug
> - b) variable stripe size (like ZFS): big impact on how btrfs handle
> the extent. limited waste of space
> - c) logging data before writing: we wrote the data two times in a
> short time window. Moreover the log area is written several order of
> magnitude more than the other area; there was some patch around
> - d) rounding the writing to the stripe size: waste of space; simple
> to implement;
> - e) different BG with different stripe size: limited waste of space;
> logical fragmentation.

Also:

  - f) avoiding writes to partially filled stripes: free space
  fragmentation; simple to implement (ssd_spread does it accidentally)

The difference between d) and f) is that d) allocates the space to the
extent while f) leaves the space unallocated, but skips any free space
fragments smaller than the stripe size when allocating.  f) gets the
space back with a balance (i.e. it is exactly as space-efficient as (a)
after balance).

> * c),d),e) are applied only for the tail of the extent, in case the
size is less than the stripe size.

It's only necessary to split an extent if there are no other writes
in the same transaction that could be combined with the extent tail
into a single RAID stripe.  As long as everything in the RAID stripe
belongs to a single transaction, there is no write hole.

> * for b),d), e), the wasting of space may be reduced with a balance 

Not for d.  Balance doesn't know how to get rid of unreachable blocks
in extents (it just moves the entire extent around) so after a balance
the writes would still be rounded up to the stripe size.  Balance would
never be able to free the rounded-up space.  That space would just be
gone until the file was overwritten, deleted, or defragged.

Possibly not for b either, for the same reason.  Defrag is the existing
btrfs tool to fix extents with unused space attached to them.  Or some
new thing designed explicitly to handle these cases.

And also not for e, but it's a little different there.  In e the wasted
space is the extra metadata extent refs due to discontiguous extent
allocation.  You don't get that back with balance, you need defrag
here too.  e) also effectively can't claim that unused space in BG's is
"free" since there are non-trivial restrictions on whether it can be
allocated for any given write.  So even if you have free space, 'df'
has to tell you that you don't.

> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

signature.asc
Description: PGP signature

Re: Status of RAID5/6

Reply via email to