Re: Feature requests: online backup - defrag - change RAID level

General Zed Mon, 16 Sep 2019 18:35:06 -0700


Quoting General Zed <general-...@zedlx.com>:

Quoting Zygo Blaxell <ce3g8...@umail.furryterror.org>:
On Sun, Sep 15, 2019 at 01:54:07PM -0400, General Zed wrote:
Quoting Zygo Blaxell <ce3g8...@umail.furryterror.org>:
On Fri, Sep 13, 2019 at 09:50:38PM -0400, General Zed wrote:
>
> Quoting Zygo Blaxell <ce3g8...@umail.furryterror.org>:
>
> > On Fri, Sep 13, 2019 at 01:05:52AM -0400, General Zed wrote:
> > >
> > > Quoting Zygo Blaxell <ce3g8...@umail.furryterror.org>:
> > >
> > > > On Thu, Sep 12, 2019 at 08:26:04PM -0400, General Zed wrote:
> > > > >
> > > > > Quoting Zygo Blaxell <ce3g8...@umail.furryterror.org>:
> > > > >
> > > > > > Don't forget you have to write new checksum and free space
> tree pages.
> > > > > > In the worst case, you'll need about 1GB of new metadata pages
> > > for each
> > > > > > 128MB you defrag (though you get to delete 99.5% of them
> immediately
> > > > > > after).
> > > > >
> > > > > Yes, here we are debating some worst-case scenaraio which is
> actually
> > > > > imposible in practice due to various reasons.
> > > >
> > > > No, it's quite possible.  A log file written slowly on an active
> > > > filesystem above a few TB will do that accidentally.  Every
> now and then
> > > > I hit that case.  It can take several hours to do a logrotate
> on spinning
> > > > arrays because of all the metadata fetches and updatesassociated with> > > > worst-case file delete. Long enough to watch the deletehappen, and
> > > > even follow along in the source code.
> > > >
> > > > I guess if I did a proactive defrag every few hours, it might
> take less
> > > > time to do the logrotate, but that would mean spreadingout all the> > > > seeky IO load during the day instead of getting it alldone at night.
> > > > Logrotate does the same job as defrag in this case (replacing
> a file in
> > > > thousands of fragments spread across the disk with a few large
> fragments
> > > > close together), except logrotate gets better compression.
> > > >
> > > > To be more accurate, the example I gave above is theworst case you> > > > can expect from normal user workloads. If I throw insome reflinks> > > > and snapshots, I can make it arbitrarily worse, until theentire disk
> > > > is consumed by the metadata update of a single extent defrag.
> > > >
> > >
> > > I can't believe I am considering this case.
> > >
> > > So, we have a 1TB log file "ultralog" split into 256 million 4
> KB extents
> > > randomly over the entire disk. We have 512 GB free RAM and2% free disk
> > > space. The file needs to be defragmented.
> > >
> > > In order to do that, defrag needs to be able to copy-move
> multiple extents
> > > in one batch, and update the metadata.
> > >
> > > The metadata has a total of at least 256 million entries, each
> of some size,
> > > but each one should hold at least a pointer to the extent (8
> bytes) and a
> > > checksum (8 bytes): In reality, it could be that there is alot of other
> > > data there per entry.
> >
> > It's about 48KB per 4K extent, plus a few hundred bytes on average
> for each
> > reference.
>
> Sorry, could you be more clear there? An file fragment/extent that holds
> file data can be any
> size up to 128 MB. What metadata is there per every filefragment/extent?
>
> Because "48 KB per 4 K extent" ... cannot decode what you mean.

An extent has 3 associated records in btrfs, not including its references.
The first two exist while the extent exists, the third appears after it
is removed.

        - extent tree:  location, size of extent, pointers to backref trees.
        Length is around 60 bytes plus the size of the backref pointer list.
Wait.. and where are the reflinks? Backrefs are there for going upthe tree,
but where are reflinks for going down the tree?
Reflinks are the forward references--there is no other kind of forward
reference in btrfs (contrast with other filesystems which use one data
structure for single references and another for multiple references).

There are two distinct objects with similar names:  extent data items,
and extent ref items.

A file consists of an inode item followed by extent ref items (aka
reflinks) in a subvol tree keyed by (inode, offset) pairs.  Subvol tree
pages can be shared with other subvol trees to make snapshots.
Ok, so a reflink contains a virtual address. Did I get that right?
All extent ref items are reflinks which contain a 4 KB alignedaddress because the extents have that same alignment. Did I get thatright?
Virtual addresses are 8-bytes in size?
I hope that virtual addresses are not wasteful of address space(that is, many top bits in an 8 bit virtual address are all zero).
Extent data items are stored in a single tree (with other trees using
the same keys) that just lists which parts of the filesystem are occupied,
how long they are, and what data/metadata they contain.  Each extent
item contains a list of references to one of four kinds of object that
refers to the extent item (aka backrefs).  The free space tree is the
inverse of the extent data tree.
Ok, so there is an "extent tree" keyed by virtual addresses. Itemsthere contain extent data.
But, how are nodes in this extent tree addressed (how do you travelfrom the parent to the child)? I guess by full virtual address, i.e.by a reflink, but this reflink can point within-extent, meaning itsaddress is not 4 KB aligned.
Or, an alternative explanation:
each whole metadata extent is a single node. Each node is oftenhalf-full to allow for various tree operations to be performed. Dueto there being many items per each node, there is additional CPUprocessing effort required when updating a node.
Each extent ref item is a reference to an extent data item, but it
also contains all the information required to access the data.  For
normal read operations the extent data tree can be ignored (though
you still need to do a lookup in the csum tree to verify csums.
So, for normal reads, the information in subvol tree is sufficient.
So, you are saying that backrefs are already in the extent tree (or
reachable from it). I didn't know that, that information makes my defrag
much simpler to implement and describe. Someone in this thread has
previously mislead me to believe that backref information is not easily
available.
The backref isn't a precise location--it just tells you which metadata
blocks are holding at least one reference to the extent.  Some CPU
and linear searching has to be done to resolve that fully to an (inode,
offset) pair in the subvol tree(s).  It's a tradeoff to make normal POSIX
go faster, because you don't need to update the extent tree again when
you do some operations on the forward ref side, even though they add or
remove references.  e.g. creating a snapshot does not change the backrefs
list on individual extents--it creates two roots sharing a subset of the
subvol trees' branches.
This reads like a mayor fu**** to me.
I don't get it. If a backref doesn't point to an exact item, thanCPU has to scan the entire 16 KB metadata extent to find thematching reflink. However, this would imply that all the items in ametadata extent are always valid (not stale from older versions ofmetadata). This then implies that, when an item of a metadata extentis updated, all the parents of all the items in the same extent haveto be updated. Now, that would be such a waste, wouldn't it?Especially if the metadata extent is allowed to contain stale items.
An alternative explanation: all the b-trees have 16 KB nodes, whereeach node matches a metadata extent. Therefore, the entire node hasa single parent in a particular tree.
This means all virtual addresses are always 4 K aligned,furthermore, all virtual addresses that point to metadata extentsare 16 K aligned.
16 KB is a pretty big for a tree node. I wonder why was this sizeselected vs. 4 KB nodes? But, it doesn't matter.
        - csum tree:  location, 1 or more 4-byte csums packed in an array.
        Length of item is number of extent data blocks * 4 bytes plus a
        168-bit header (ish...csums from adjacent extents may be packed
        using a shared header)

        - free space tree:  location, size of free space.  This appears
        when the extent is deleted.  It may be merged with adjacent
        records.  Length is maybe 20 bytes?

Each page contains a few hundred items, so if there are a few hundred
unrelated extents between extents in the log file, each log file extent
gets its own metadata page in each tree.
As far as I can understand it, the extents in the extent tree are indexed
(keyed) by inode&offset. Therefore, no matter how many unrelated extents
there are between (physical locations of data) extents in the log file, the
log file extent tree entries will (generally speaking) belocalized, because
multiple extent entries (extent items) are bunched tohgether in one 16 KB
metadata extent node.
No, extents in the extent tree are indexed by virtual address (roughly the
same as physical address over small scales, let's leave the device tree
out of it for now).  The subvol trees are organized the way you are
thinking of.
So, I guess that the virtual-to-physical address translation tablesare always loaded in memory and that this translation is very fast?And the translation in the opposite direction, too.
Anyway, thanks for explaining this all to me, makes it all much more clear.

Taking into account the explanation that b-tree nodes are sized 16 KB,this is what my imagined defrag would do:

This defrag makes batched updates to metadata. A most common cases, abathched update (with a single commit) will have the followingproperties:

- it moves many file data extents from one place to another. However,the update is mostly localized in space (for common fragmentationcases), meaning that only a low number of fragments per update will besignificantly scattered. Therefore, updates to extent-tree arelocalized.


- the free space tree - mostly localized updates, for same reason as above

- the checksum tree - the same as above

- the subvol tree - a (potentially big) number of files is affectedper update. In most cases of fragmentation, the dispersion offragments will not be high. However, due to number of files, updatingthis tree looks like the most performance-troubling part of all. Theproblem is that while there is only a small dispersion per each file,when this gets multiplied by the number of files, it can get bad. Somedispersion in this tree is to be expected. The update size caneffectively be limited by the size of changes to the subvol tree (incomplex cases).

- the expected dispersion in the subvol tree can get better aftermultiple defrags, if the defrag is made to approximately order thefiles on disk according to their position in filesystem directorytree. So, the files of the same directory should be grouped togetherin disk sectors in order to speed up the future defrags.

So, as each b-tree node is 16 K, but I guess most nodes are onlyhalf-full, the average node size is 8 KB. If the defrag reserves 128MB for a metadata update computation, it could update 16000 (16thousand) metadata nodes per one commit. It would be interesting totry to predict what amount of file data, on average, is referred bysuch a 16000 nodes commit?

Ok, let's do some pure guessing. For a home user, with 256 GB root orhome partition, I would guess the average will be well over 500 MBfile data per commit. It depends on how long has it been since hislast defrag. If it is a weekly or a daily defrag, then I guess wellover 1 GB file data per commit.

And for a server, there are so many different kinds of servers, soI'll just skip the guessing.

Re: Feature requests: online backup - defrag - change RAID level

Reply via email to