Re: List of known BTRFS Raid 5/6 Bugs?

Zygo Blaxell Tue, 14 Aug 2018 20:34:01 -0700

On Tue, Aug 14, 2018 at 09:32:51AM +0200, Menion wrote:
> Hi
> Well, I think it is worth to give more details on the array.
> the array is built with 5x8TB HDD in an esternal USB3.0 to SATAIII enclosure
> The enclosure is a cheap JMicron based chinese stuff (from Orico).
> There is one USB3.0 link for all the 5 HDD with a SATAIII 3.0Gb
> multiplexer behind it. So you cannot expect peak performance, which is
> not the goal of this array (domestic data storage).
> Also the USB to SATA firmware is buggy, so UAS operations are not
> stable, it run in BOT mode.
> Having said so, the scrub has been started (and resumed) on the array
> mount point:
> 
> sudo btrfs scrub start(resume) /media/storage/das1


So is 2.59TB the amount scrubbed _since resume_?  If you run a complete
scrub end to end without cancelling or rebooting in between, what is
the size on all disks (btrfs scrub status -d)?

> even if reading the documentation I understand that it is the same
> invoking it on mountpoint or one of the HDD in the array.
> In the end, especially for a RAID5 array, does it really make sense to
> scrub only one disk in the array???

You would set up a shell for-loop and scrub each disk of the array
in turn.  Each scrub would correct errors on a single device.

There was a bug in btrfs scrub where scrubbing the filesystem would
create one thread for each disk, and the threads would issue commands
to all disks and compete with each other for IO, resulting in terrible
performance on most non-SSD hardware.  By scrubbing disks one at a time,
there are no competing threads, so the scrub runs many times faster.
With this bug the total time to scrub all disks individually is usually
less than the time to scrub the entire filesystem at once, especially
on HDD (and even if it's not faster, one-at-a-time disk scrubs are
much kinder to any other process trying to use the filesystem at the
same time).

It appears this bug is not fixed, based on some timing results I am
getting from a test array.  iostat shows 10x more reads than writes on
all disks even when all blocks on one disk are corrupted and the scrub
is given only a single disk to process (that should result in roughly
equal reads on all disks slightly above the number of writes on the
corrupted disk).

This is where my earlier caveat about performance comes from.  Many parts
of btrfs raid5 are somewhere between slower and *much* slower than
comparable software raid5 implementations.  Some of that is by design:
btrfs must be at least 1% slower than mdadm because btrfs needs to read
metadata to verify data block csums in scrub, and the difference would
be much larger in practice due to HDD seek times, but 500%-900% overhead
still seems high especially when compared to btrfs raid1 that has the
same metadata csum reading issue without the huge performance gap.

It seems like btrfs raid5 could still use a thorough profiling to figure
out where it's spending all its IO.

> Regarding the data usage, here you have the current figures:
> 
> menion@Menionubuntu:~$ sudo btrfs fi show
> [sudo] password for menion:
> Label: none  uuid: 6db4baf7-fda8-41ac-a6ad-1ca7b083430f
> Total devices 1 FS bytes used 11.44GiB
> devid    1 size 27.07GiB used 18.07GiB path /dev/mmcblk0p3
> 
> Label: none  uuid: 931d40c6-7cd7-46f3-a4bf-61f3a53844bc
> Total devices 5 FS bytes used 6.57TiB
> devid    1 size 7.28TiB used 1.64TiB path /dev/sda
> devid    2 size 7.28TiB used 1.64TiB path /dev/sdb
> devid    3 size 7.28TiB used 1.64TiB path /dev/sdc
> devid    4 size 7.28TiB used 1.64TiB path /dev/sdd
> devid    5 size 7.28TiB used 1.64TiB path /dev/sde
> 
> menion@Menionubuntu:~$ sudo btrfs fi df /media/storage/das1
> Data, RAID5: total=6.57TiB, used=6.56TiB
> System, RAID5: total=12.75MiB, used=416.00KiB
> Metadata, RAID5: total=9.00GiB, used=8.16GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> menion@Menionubuntu:~$ sudo btrfs fi usage /media/storage/das1
> WARNING: RAID56 detected, not implemented
> WARNING: RAID56 detected, not implemented
> WARNING: RAID56 detected, not implemented
> Overall:
>     Device size:   36.39TiB
>     Device allocated:      0.00B
>     Device unallocated:   36.39TiB
>     Device missing:      0.00B
>     Used:      0.00B
>     Free (estimated):      0.00B (min: 8.00EiB)
>     Data ratio:       0.00
>     Metadata ratio:       0.00
>     Global reserve: 512.00MiB (used: 32.00KiB)
> 
> Data,RAID5: Size:6.57TiB, Used:6.56TiB
>    /dev/sda    1.64TiB
>    /dev/sdb    1.64TiB
>    /dev/sdc    1.64TiB
>    /dev/sdd    1.64TiB
>    /dev/sde    1.64TiB
> 
> Metadata,RAID5: Size:9.00GiB, Used:8.16GiB
>    /dev/sda    2.25GiB
>    /dev/sdb    2.25GiB
>    /dev/sdc    2.25GiB
>    /dev/sdd    2.25GiB
>    /dev/sde    2.25GiB
> 
> System,RAID5: Size:12.75MiB, Used:416.00KiB
>    /dev/sda    3.19MiB
>    /dev/sdb    3.19MiB
>    /dev/sdc    3.19MiB
>    /dev/sdd    3.19MiB
>    /dev/sde    3.19MiB
> 
> Unallocated:
>    /dev/sda    5.63TiB
>    /dev/sdb    5.63TiB
>    /dev/sdc    5.63TiB
>    /dev/sdd    5.63TiB
>    /dev/sde    5.63TiB
> menion@Menionubuntu:~$
> menion@Menionubuntu:~$ sf -h
> The program 'sf' is currently not installed. You can install it by typing:
> sudo apt install ruby-sprite-factory
> menion@Menionubuntu:~$ df -h
> Filesystem      Size  Used Avail Use% Mounted on
> udev            934M     0  934M   0% /dev
> tmpfs           193M   22M  171M  12% /run
> /dev/mmcblk0p3   28G   12G   15G  44% /
> tmpfs           962M     0  962M   0% /dev/shm
> tmpfs           5,0M     0  5,0M   0% /run/lock
> tmpfs           962M     0  962M   0% /sys/fs/cgroup
> /dev/mmcblk0p1  188M  3,4M  184M   2% /boot/efi
> /dev/mmcblk0p3   28G   12G   15G  44% /home
> /dev/sda         37T  6,6T   29T  19% /media/storage/das1
> tmpfs           193M     0  193M   0% /run/user/1000
> menion@Menionubuntu:~$ btrfs --version
> btrfs-progs v4.17
> 
> So I don't fully understand where the scrub data size comes from
> Il giorno lun 13 ago 2018 alle ore 23:56 <erentheti...@mail.de> ha scritto:
> >
> > Running time of 55:06:35 indicates that the counter is right, it is not 
> > enough time to scrub the entire array using hdd.
> >
> > 2TiB might be right if you only scrubbed one disc, "sudo btrfs scrub start 
> > /dev/sdx1" only scrubs the selected partition,
> > whereas "sudo btrfs scrub start /media/storage/das1" scrubs the actual 
> > array.
> >
> > Use "sudo btrfs scrub status -d " to view per disc scrubbing statistics and 
> > post the output.
> > For live statistics, use "sudo watch -n 1".
> >
> > By the way:
> > 0 errors despite multiple unclean shutdowns? I assumed that the write hole 
> > would corrupt parity the first time around, was i wrong?
> >
> > Am 13-Aug-2018 09:20:36 +0200 schrieb men...@gmail.com:
> > > Hi
> > > I have a BTRFS RAID5 array built on 5x8TB HDD filled with, well :),
> > > there are contradicting opinions by the, well, "several" ways to check
> > > the used space on a BTRFS RAID5 array, but I should be aroud 8TB of
> > > data.
> > > This array is running on kernel 4.17.3 and it definitely experienced
> > > power loss while data was being written.
> > > I can say that it wen through at least a dozen of unclear shutdown
> > > So following this thread I started my first scrub on the array. and
> > > this is the outcome (after having resumed it 4 times, two after a
> > > power loss...):
> > >
> > > menion@Menionubuntu:~$ sudo btrfs scrub status /media/storage/das1/
> > > scrub status for 931d40c6-7cd7-46f3-a4bf-61f3a53844bc
> > > scrub resumed at Sun Aug 12 18:43:31 2018 and finished after 55:06:35
> > > total bytes scrubbed: 2.59TiB with 0 errors
> > >
> > > So, there are 0 errors, but I don't understand why it says 2.59TiB of
> > > scrubbed data. Is it possible that also this values is crap, as the
> > > non zero counters for RAID5 array?
> > > Il giorno sab 11 ago 2018 alle ore 17:29 Zygo Blaxell
> > > <ce3g8...@umail.furryterror.org> ha scritto:
> > > >
> > > > On Sat, Aug 11, 2018 at 08:27:04AM +0200, erentheti...@mail.de wrote:
> > > > > I guess that covers most topics, two last questions:
> > > > >
> > > > > Will the write hole behave differently on Raid 6 compared to Raid 5 ?
> > > >
> > > > Not really. It changes the probability distribution (you get an extra
> > > > chance to recover using a parity block in some cases), but there are
> > > > still cases where data gets lost that didn't need to be.
> > > >
> > > > > Is there any benefit of running Raid 5 Metadata compared to Raid 1 ?
> > > >
> > > > There may be benefits of raid5 metadata, but they are small compared to
> > > > the risks.
> > > >
> > > > In some configurations it may not be possible to allocate the last
> > > > gigabyte of space. raid1 will allocate 1GB chunks from 2 disks at a
> > > > time while raid5 will allocate 1GB chunks from N disks at a time, and if
> > > > N is an odd number there could be one chunk left over in the array that
> > > > is unusable. Most users will find this irrelevant because a large disk
> > > > array that is filled to the last GB will become quite slow due to long
> > > > free space search and seek times--you really want to keep usage below 
> > > > 95%,
> > > > maybe 98% at most, and that means the last GB will never be needed.
> > > >
> > > > Reading raid5 metadata could theoretically be faster than raid1, but 
> > > > that
> > > > depends on a lot of variables, so you can't assume it as a rule of 
> > > > thumb.
> > > >
> > > > Raid6 metadata is more interesting because it's the only currently
> > > > supported way to get 2-disk failure tolerance in btrfs. Unfortunately
> > > > that benefit is rather limited due to the write hole bug.
> > > >
> > > > There are patches floating around that implement multi-disk raid1 (i.e. 
> > > > 3
> > > > or 4 mirror copies instead of just 2). This would be much better for
> > > > metadata than raid6--more flexible, more robust, and my guess is that
> > > > it will be faster as well (no need for RMW updates or journal seeks).
> > > >
> > > > > -------------------------------------------------------------------------------------------------
> > > > > FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT
> > > > >
> >
> >
> > -------------------------------------------------------------------------------------------------
> > FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT

signature.asc
Description: PGP signature

Re: List of known BTRFS Raid 5/6 Bugs?

Reply via email to