Re: Data and metadata extent allocators [1/2]: Recap: The data story

Martin Steigerwald Fri, 27 Oct 2017 13:11:22 -0700

Hello Hans,

Hans van Kranenburg - 27.10.17, 20:17:
> This is a followup to my previous threads named "About free space
> fragmentation, metadata write amplification and (no)ssd" [0] and
> "Experiences with metadata balance/convert" [1], exploring how good or
> bad btrfs can handle filesystems that are larger than your average
> desktop computer and/or which see a pattern of writing and deleting huge
> amounts of files of wildly varying sizes all the time.
[…]
> Q: How do I fight this and prevent getting into a situation where all
> raw space is allocated, risking a filesystem crash?
> A: Use btrfs balance to fight the symptoms. It reads data and writes it
> out again without the free space fragments.


What do you mean by a filesystem crash? Since kernel 4.5 or 4.6 I don´t see any 
BTRFS related filesystem hangs anymore on the /home BTRFS Dual SSD RAID 1 on my 
Laptop, which one or two copies of Akonadi, Baloo and other desktop related 
stuff write *heavily to* and which has all free space allocated into cunks 
since a pretty long time:

merkaba:~> btrfs fi usage -T /home
Overall:
    Device size:                 340.00GiB
    Device allocated:            340.00GiB
    Device unallocated:            2.00MiB
    Device missing:                  0.00B
    Used:                        290.32GiB
    Free (estimated):             23.09GiB      (min: 23.09GiB)
    Data ratio:                       2.00
    Metadata ratio:                   2.00
    Global reserve:              512.00MiB      (used: 0.00B)

                          Data      Metadata System              
Id Path                   RAID1     RAID1    RAID1    Unallocated
-- ---------------------- --------- -------- -------- -----------
 1 /dev/mapper/msata-home 163.94GiB  6.03GiB 32.00MiB     1.00MiB
 2 /dev/mapper/sata-home  163.94GiB  6.03GiB 32.00MiB     1.00MiB
-- ---------------------- --------- -------- -------- -----------
   Total                  163.94GiB  6.03GiB 32.00MiB     2.00MiB
   Used                   140.85GiB  4.31GiB 48.00KiB

I didn´t do a balance on this filesystem since a long time (kernel 4.6).

Granted my filesystem is smaller than the typical backup BTRFS. I do have two 3 
TB and one 1,5 TB SATA disks I backup to and another 2 TB BTRFS on a backup 
server that I use for borgbackup (and that doesn´t yet do any snapshots and 
may be better of running as XFS as it doesn´t really need snapshots as 
borgbackup takes care of that. A BTRFS snapshot would only come handy to be 
able to go back to a previous borgbackup repo in case it for whatever reason 
gets corrupted or damaged / deleted by an attacker who only access to non 
privileged user). – However all of these filesystems have plenty of free space 
currently and are not accessed daily.

> Q: Why would it crash the file system when all raw space is allocated?
> Won't it start trying harder to reuse the free space inside?
> A: Yes, it will, for data. The big problem here is that allocation of a
> new metadata chunk when needed is not possible any more.

And there it hangs or really crashes?

[…]

> Q: Why do the pictures of my data block groups look like someone fired a
> shotgun at it. [3], [4]?
> A: Because the data extent allocator that is active when using the 'ssd'
> mount option both tends to ignore smaller free space fragments all the
> time, and also behaves in a way that causes more of them to appear. [5]
> 
> Q: Wait, why is there "ssd" in my mount options? Why does btrfs think my
> iSCSI attached lun is an SSD?
> A: Because it makes wrong assumptions based on the rotational attribute,
> which we can also see in sysfs.
> 
> Q: Why does this ssd mode ignore free space?
> A: Because it makes assumptions about the mapping of the addresses of
> the block device we see in linux and the storage in actual flash chips
> inside the ssd. Based on that information it decides where to write or
> where not to write any more.
> 
> Q: Does this make sense in 2017?
> A: No. The interesting relevant optimization when writing to an ssd
> would be to write all data together that will be deleted or overwritten
> together at the same time in the future. Since btrfs does not come with
> a time machine included, it can't do this. So, remove this behaviour
> instead. [6]
> 
> Q: What will happen when I use kernel 4.14 with the previously mentioned
> change, or if I change to the nossd mount option explicitely already?
> A: Relatively small free space fragments in existing chunks will
> actually be reused for new writes that fit, working from the beginning
> of the virtual address space upwards. It's like tetris, trying to
> completely fill up the lowest lines first. See the big difference in
> behavior when changing extent allocator happening at 16 seconds into
> this timelapse movie: [7] (virtual address space)

I see a difference in behavior but I do not yet fully understand what I am 
looking at.
 
> Q: But what if all my chunks have badly fragmented free space right now?
> A: If your situation allows for it, the simplest way is running a full
> balance of the data, as some sort of big reset button. If you only want
> to clean up chunks with excessive free space fragmentation, then you can
> use the helper I used to identify them, which is
> show_free_space_fragmentation.py in [8]. Just feed the chunks to balance
> starting with the one with the highest score. The script requires the
> free space tree to be used, which is a good idea anyway.

Okay, when I understand this correctly I don´t need to use "nossd" with kernel 
4.14, but it would be good to do a full "btrfs filesystem balance" run on all 
the SSD BTRFS filesystems or all other ones with rotational=0.

What would be the benefit of that? Would the filesystem run faster again? My 
subjective impression is that performance got worse over time. *However* all 
my previous full balance attempts made the performance even more worse. So… is 
a full balance safe to the filesystem performance meanwhile?

I still have the issue that fstrim on /home only works with patch from Lutz 
Euler from 2014, which is still not in mainline BTRFS. Maybe it would be a 
good idea to recreate /home in order to get rid of that special "anomaly" of 
the BTRFS that fstrim don´t work without this patch.

Maybe a least a part of this should go into BTRFS kernel wiki as it would be 
more easy to find there for users.

I wonder about a "upgrade notes for users" / "BTRFS maintenance" page that 
gives recommendations in case some step is recommended after a major kernel 
update and general recommendations for maintenance. Ideally most of this would 
be integrated into BTRFS or a userspace daemon for it and be handled 
transparently and automatically. Yet a full balance is an expensive operation 
time-wise and probably should not be started without user consent.

I do wonder about the ton of tools here and there and I would love some btrfsd 
or… maybe even more generic fsd filesystem maintenance daemon which would do 
regular scrubs and whatever else makes sense. It could use some configuration 
in the root directory of a filesystem and work for BTRFS and other filesystem 
that do have beneficial online / background upgraded like XFS which also has 
online scrubbing by now (at least for metadata).

> [0] https://www.spinics.net/lists/linux-btrfs/msg64446.html
> [1] https://www.spinics.net/lists/linux-btrfs/msg64771.html
> [2] https://github.com/knorrie/btrfs-heatmap/
> [3]
> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-10-27-shotgunblast.png
> [4]
> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2016-12-18-heatmap-scripting/
> fsid_ed10a358-c846-4e76-a071-3821d423a99d_startat_320029589504_at_1482095269
> .png [5] https://www.spinics.net/lists/linux-btrfs/msg64418.html
> [6]
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i
> d=583b723151794e2ff1691f1510b4e43710293875 [7]
> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-ssd-to-nossd.mp4 [8]
> https://github.com/knorrie/python-btrfs/tree/develop/examples

Thanks,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Data and metadata extent allocators [1/2]: Recap: The data story

Reply via email to