From: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>
---
.../xfs-data-structures/allocation_groups.rst | 2
.../filesystems/xfs-data-structures/rmapbt.rst | 336 ++++++++++++++++++++
2 files changed, 338 insertions(+)
create mode 100644 Documentation/filesystems/xfs-data-structures/rmapbt.rst
diff --git
a/Documentation/filesystems/xfs-data-structures/allocation_groups.rst
b/Documentation/filesystems/xfs-data-structures/allocation_groups.rst
index 30d169ab5cc5..6c0ffd3a170b 100644
--- a/Documentation/filesystems/xfs-data-structures/allocation_groups.rst
+++ b/Documentation/filesystems/xfs-data-structures/allocation_groups.rst
@@ -1379,3 +1379,5 @@ response times that come from metadata operations.
None of the XFS per-AG B+trees are involved with real time files. It is not
possible for real time files to share data blocks.
+
+.. include:: rmapbt.rst
diff --git a/Documentation/filesystems/xfs-data-structures/rmapbt.rst
b/Documentation/filesystems/xfs-data-structures/rmapbt.rst
new file mode 100644
index 000000000000..eefcee5d4e95
--- /dev/null
+++ b/Documentation/filesystems/xfs-data-structures/rmapbt.rst
@@ -0,0 +1,336 @@
+.. SPDX-License-Identifier: CC-BY-SA-4.0
+
+Reverse-Mapping B+tree
+~~~~~~~~~~~~~~~~~~~~~~
+
+If the feature is enabled, each allocation group has its own reverse
+block-mapping B+tree, which grows in the free space like the free space
+B+trees. As mentioned in the chapter about
+`reconstruction <#metadata-reconstruction>`__, this data structure is another
piece of
+the puzzle necessary to reconstruct the data or attribute fork of a file from
+reverse-mapping records; we can also use it to double-check allocations to
+ensure that we are not accidentally cross-linking blocks, which can cause
+severe damage to the filesystem.
+
+This B+tree is only present if the XFS\_SB\_FEAT\_RO\_COMPAT\_RMAPBT feature
+is enabled. The feature requires a version 5 filesystem.
+
+Each record in the reverse-mapping B+tree has the following structure:
+
+.. code:: c
+
+ struct xfs_rmap_rec {
+ __be32 rm_startblock;
+ __be32 rm_blockcount;
+ __be64 rm_owner;
+ __be64 rm_fork:1;
+ __be64 rm_bmbt:1;
+ __be64 rm_unwritten:1;
+ __be64 rm_unused:7;
+ __be64 rm_offset:54;
+ };
+
+**rm\_startblock**
+ AG block number of this record.
+
+**rm\_blockcount**
+ The length of this extent.
+
+**rm\_owner**
+ A 64-bit number describing the owner of this extent. This is typically the
+ absolute inode number, but can also correspond to one of the following:
+
+.. list-table::
+ :widths: 28 52
+ :header-rows: 1
+
+ * - Flag
+ - Description
+ * - XFS\_RMAP\_OWN\_NULL
+ - No owner. This should never appear on disk.
+
+ * - XFS\_RMAP\_OWN\_UNKNOWN
+ - Unknown owner; for EFI recovery. This should never appear on disk.
+
+ * - XFS\_RMAP\_OWN\_FS
+ - Allocation group headers.
+
+ * - XFS\_RMAP\_OWN\_LOG
+ - XFS log blocks.
+
+ * - XFS\_RMAP\_OWN\_AG
+ - Per-allocation group B+tree blocks. This means free space B+tree blocks,
+ blocks on the freelist, and reverse-mapping B+tree blocks.
+
+ * - XFS\_RMAP\_OWN\_INOBT
+ - Per-allocation group inode B+tree blocks. This includes free inode
+ B+tree blocks.
+
+ * - XFS\_RMAP\_OWN\_INODES
+ - Inode chunks.
+
+ * - XFS\_RMAP\_OWN\_REFC
+ - Per-allocation group refcount B+tree blocks. This will be used for
+ reflink support.
+
+ * - XFS\_RMAP\_OWN\_COW
+ - Blocks that have been reserved for a copy-on-write operation that has
+ not completed.
+
+Table: Special owner values
+
+**rm\_fork**
+ If rm\_owner describes an inode, this can be 1 if this record is for an
+ attribute fork.
+
+**rm\_bmbt**
+ If rm\_owner describes an inode, this can be 1 to signify that this record
+ is for a block map B+tree block. In this case, rm\_offset has no meaning.
+
+**rm\_unwritten**
+ A flag indicating that the extent is unwritten. This corresponds to the
+ flag in the `extent record <#data-extents>`__ format which means
+ XFS\_EXT\_UNWRITTEN.
+
+**rm\_offset**
+ The 54-bit logical file block offset, if rm\_owner describes an inode.
+ Meaningless otherwise.
+
+ **Note**
+
+ The single-bit flag values rm\_unwritten, rm\_fork, and rm\_bmbt are
+ packed into the larger fields in the C structure definition.
+
+The key has the following structure:
+
+.. code:: c
+
+ struct xfs_rmap_key {
+ __be32 rm_startblock;
+ __be64 rm_owner;
+ __be64 rm_fork:1;
+ __be64 rm_bmbt:1;
+ __be64 rm_reserved:1;
+ __be64 rm_unused:7;
+ __be64 rm_offset:54;
+ };
+
+For the reverse-mapping B+tree on a filesystem that supports sharing of file
+data blocks, the key definition is larger than the usual AG block number. On a
+classic XFS filesystem, each block has only one owner, which means that
+rm\_startblock is sufficient to uniquely identify each record. However, shared
+block support (reflink) on XFS breaks that assumption; now filesystem blocks
+can be linked to any logical block offset of any file inode. Therefore, the
+key must include the owner and offset information to preserve the 1 to 1
+relation between key and record.
+
+- As the reference counting is AG relative, all the block numbers are only
+ 32-bits.
+
+- The bb\_magic value is "RMB3" (0x524d4233).
+
+- The xfs\_btree\_sblock\_t header is used for intermediate B+tree node as
+ well as the leaves.
+
+- Each pointer is associated with two keys. The first of these is the "low
+ key", which is the key of the smallest record accessible through the
+ pointer. This low key has the same meaning as the key in all other btrees.
+ The second key is the high key, which is the maximum of the largest key
+ that can be used to access a given record underneath the pointer. Recall
+ that each record in the reverse mapping b+tree describes an interval of
+ physical blocks mapped to an interval of logical file block offsets;
+ therefore, it makes sense that a range of keys can be used to find to a
+ record.
+
+xfs\_db rmapbt Example
+^^^^^^^^^^^^^^^^^^^^^^
+
+This example shows a reverse-mapping B+tree from a freshly populated root
+filesystem:
+
+::
+
+ xfs_db> agf 0
+ xfs_db> addr rmaproot
+ xfs_db> p
+ magic = 0x524d4233
+ level = 1
+ numrecs = 43
+ leftsib = null
+ rightsib = null
+ bno = 56
+ lsn = 0x3000004c8
+ uuid = 1977221d-8345-464e-b1f4-aa2ea36895f4
+ owner = 0
+ crc = 0x7cf8be6f (correct)
+ keys[1-43] = [startblock,owner,offset]
+ keys[1-43] =
[startblock,owner,offset,attrfork,bmbtblock,startblock_hi,owner_hi,
+ offset_hi,attrfork_hi,bmbtblock_hi]
+ 1:[0,-3,0,0,0,351,4418,66,0,0]
+ 2:[417,285,0,0,0,827,4419,2,0,0]
+ 3:[829,499,0,0,0,2352,573,55,0,0]
+ 4:[1292,710,0,0,0,32168,262923,47,0,0]
+ 5:[32215,-5,0,0,0,34655,2365,3411,0,0]
+ 6:[34083,1161,0,0,0,34895,265220,1,0,1]
+ 7:[34896,256191,0,0,0,36522,-9,0,0,0]
+ ...
+ 41:[50998,326734,0,0,0,51430,-5,0,0,0]
+ 42:[51431,327010,0,0,0,51600,325722,11,0,0]
+ 43:[51611,327112,0,0,0,94063,23522,28375272,0,0]
+ ptrs[1-43] = 1:5 2:6 3:8 4:9 5:10 6:11 7:418 ... 41:46377 42:48784 43:49522
+
+We arbitrarily pick pointer 17 to traverse downwards:
+
+::
+
+ xfs_db> addr ptrs[17]
+ xfs_db> p
+ magic = 0x524d4233
+ level = 0
+ numrecs = 168
+ leftsib = 36284
+ rightsib = 37617
+ bno = 294760
+ lsn = 0x200002761
+ uuid = 1977221d-8345-464e-b1f4-aa2ea36895f4
+ owner = 0
+ crc = 0x2dad3fbe (correct)
+ recs[1-168] =
[startblock,blockcount,owner,offset,extentflag,attrfork,bmbtblock]
+ 1:[40326,1,259615,0,0,0,0] 2:[40327,1,-5,0,0,0,0]
+ 3:[40328,2,259618,0,0,0,0] 4:[40330,1,259619,0,0,0,0]
+ ...
+ 127:[40540,1,324266,0,0,0,0] 128:[40541,1,324266,8388608,0,0,0]
+ 129:[40542,2,324266,1,0,0,0] 130:[40544,32,-7,0,0,0,0]
+
+Several interesting things pop out here. The first record shows that inode
+259,615 has mapped AG block 40,326 at offset 0. We confirm this by looking at
+the block map for that inode:
+
+::
+
+ xfs_db> inode 259615
+ xfs_db> bmap
+ data offset 0 startblock 40326 (0/40326) count 1 flag 0
+
+Next, notice records 127 and 128, which describe neighboring AG blocks that
+are mapped to non-contiguous logical blocks in inode 324,266. Given the
+logical offset of 8,388,608 we surmise that this is a leaf directory, but let
+us confirm:
+
+::
+
+ xfs_db> inode 324266
+ xfs_db> p core.mode
+ core.mode = 040755
+ xfs_db> bmap
+ data offset 0 startblock 40540 (0/40540) count 1 flag 0
+ data offset 1 startblock 40542 (0/40542) count 2 flag 0
+ data offset 3 startblock 40576 (0/40576) count 1 flag 0
+ data offset 8388608 startblock 40541 (0/40541) count 1 flag 0
+ xfs_db> p core.mode
+ core.mode = 0100644
+ xfs_db> dblock 0
+ xfs_db> p dhdr.hdr.magic
+ dhdr.hdr.magic = 0x58444433
+ xfs_db> dblock 8388608
+ xfs_db> p lhdr.info.hdr.magic
+ lhdr.info.hdr.magic = 0x3df1
+
+Indeed, this inode 324,266 appears to be a leaf directory, as it has regular
+directory data blocks at low offsets, and a single leaf block.
+
+Notice further the two reverse-mapping records with negative owners. An owner
+of -7 corresponds to XFS\_RMAP\_OWN\_INODES, which is an inode chunk, and an
+owner code of -5 corresponds to XFS\_RMAP\_OWN\_AG, which covers free space
+B+trees and free space. Let’s see if block 40,544 is part of an inode chunk:
+
+::
+
+ xfs_db> blockget
+ xfs_db> fsblock 40544
+ xfs_db> blockuse
+ block 40544 (0/40544) type inode
+ xfs_db> stack
+ 1:
+ byte offset 166068224, length 4096
+ buffer block 324352 (fsbno 40544), 8 bbs
+ inode 324266, dir inode 324266, type data
+ xfs_db> type inode
+ xfs_db> p
+ core.magic = 0x494e
+
+Our suspicions are confirmed. Let’s also see if 40,327 is part of a free space
+tree:
+
+::
+
+ xfs_db> fsblock 40327
+ xfs_db> blockuse
+ block 40327 (0/40327) type btrmap
+ xfs_db> type rmapbt
+ xfs_db> p
+ magic = 0x524d4233
+
+As you can see, the reverse block-mapping B+tree is an important secondary
+metadata structure, which can be used to reconstruct damaged primary metadata.
+Now let’s look at an extend rmap btree:
+
+::
+
+ xfs_db> agf 0
+ xfs_db> addr rmaproot
+ xfs_db> p
+ magic = 0x34524d42
+ level = 1
+ numrecs = 5
+ leftsib = null
+ rightsib = null
+ bno = 6368
+ lsn = 0x100000d1b
+ uuid = 400f0928-6b88-4c37-af1e-cef1f8911f3f
+ owner = 0
+ crc = 0x8d4ace05 (correct)
+ keys[1-5] =
[startblock,owner,offset,attrfork,bmbtblock,startblock_hi,owner_hi,offset_hi,attrfork_hi,bmbtblock_hi]
+ 1:[0,-3,0,0,0,705,132,681,0,0]
+ 2:[24,5761,0,0,0,548,5761,524,0,0]
+ 3:[24,5929,0,0,0,380,5929,356,0,0]
+ 4:[24,6097,0,0,0,212,6097,188,0,0]
+ 5:[24,6277,0,0,0,807,-7,0,0,0]
+ ptrs[1-5] = 1:5 2:771 3:9 4:10 5:11
+
+The second pointer stores both the low key [24,5761,0,0,0] and the high key
+[548,5761,524,0,0], which means that we can expect block 771 to contain
+records starting at physical block 24, inode 5761, offset zero; and that one
+of the records can be used to find a reverse mapping for physical block 548,
+inode 5761, and offset 524:
+
+::
+
+ xfs_db> addr ptrs[2]
+ xfs_db> p
+ magic = 0x34524d42
+ level = 0
+ numrecs = 168
+ leftsib = 5
+ rightsib = 9
+ bno = 6168
+ lsn = 0x100000d1b
+ uuid = 400f0928-6b88-4c37-af1e-cef1f8911f3f
+ owner = 0
+ crc = 0xd58eff0e (correct)
+ recs[1-168] =
[startblock,blockcount,owner,offset,extentflag,attrfork,bmbtblock]
+ 1:[24,525,5761,0,0,0,0]
+ 2:[24,524,5762,0,0,0,0]
+ 3:[24,523,5763,0,0,0,0]
+ ...
+ 166:[24,360,5926,0,0,0,0]
+ 167:[24,359,5927,0,0,0,0]
+ 168:[24,358,5928,0,0,0,0]
+
+Observe that the first record in the block starts at physical block 24, inode
+5761, offset zero, just as we expected. Note that this first record is also
+indexed by the highest key as provided in the node block; physical block 548,
+inode 5761, offset 524 is the very last block mapped by this record.
+Furthermore, note that record 168, despite being the last record in this
+block, has a lower maximum key (physical block 382, inode 5928, offset 23)
+than the first record.