On 15/03/2021 13:53, Naohiro Aota wrote:
The following patch will change the superblock logging zones' location from
fixed zone number to fixed LBAs.
Here is a background of how the superblock is working on zoned btrfs.
This document will be promoted to btrfs-dev-docs in the future.
# Superblock logging for zoned btrfs
The superblock and its copies are the only data structures in btrfs with a
fixed location on a device. Since we cannot overwrite these blocks if they
are placed in sequential write required zones, we cannot use the regular
method of updating superblocks with zoned btrfs.
Looks like a ZBC which does the write pointer reset and write could
have helped here.
We also cannot limit the
position of superblocks to conventional zones as that would prevent using
zoned block devices that do not have this zone type (e.g. NVMe ZNS SSDs).
To solve this problem, we use superblock log writing. This method uses two
sequential write required zones as a circular buffer to write updated
superblocks. Once the first zone is filled up, start writing into the
second zone. When both zones are filled up and before start writing to the
first zone again, the first zone is reset and writing continues in the
first zone. Once the first zone is full, reset the second zone, and write
the latest superblock in the second zone. With this logging, we can always
determine the position of the latest superblock by inspecting the zones'
write pointer information provided by the device. One corner case is when
both zones are full. For this situation, we read out the last superblock of
each zone and compare them to determine which copy is the latest one.
## Placement of superblock logging zones
We use the following three pairs of zones containing fixed offset
locations, regardless of the device zone size.
- Primary superblock: zone starting at offset 0 and the following zone
- First copy: zone containing offset 64GB and the following zone
- Second copy: zone containing offset 256GB and the following zone
These zones are reserved for superblock logging and never used for data or
metadata blocks. Zones containing the offsets used to store superblocks in
a regular btrfs volume (no zoned case) are also reserved to avoid
The first copy position is much larger than for a regular btrfs volume
(64M). This increase is to avoid overlapping with the log zones for the
primary superblock. This higher location is arbitrary but allows supporting
devices with very large zone size, up to 32GB. But we only allow zone sizes
up to 8GB for now.
## Writing superblock in conventional zones
Conventional zones do not have a write pointer. This zone type thus cannot
be used with superblock logging since determining the position of the
latest copy of the superblock in a zone pair would be impossible.
To address this problem, if either of the zones containing the fixed offset
locations for zone logging is a conventional zone, superblock updates are
done in-place using the first block of the conventional zone.
## Reading zoned btrfs dump image without zone information
Reading a zoned btrfs image without zone information is challenging but
We can always find a superblock copy at or after the fixed offset locations
determining the logging zones position. With such copy, the superblock
incompatible flags indicates if the volume is zoned or not. With a chunk
item in the sys_chunk_array, we can determine the zone size from the size
of a device extent, itself determined from the chunk length, num_stripes,
and sub_stripes. With this information, all blocks within the 2 logging
zones containing the fixed locations can be inspected to find the newest
The first zone of a log pair may be empty and have no superblock copy. This
can happen if a system crashes after resetting the first zone of a pair and
before writing out a new superblock. In this case, a superblock copy can be
found in the second zone of a log pair. The start of this second zone can
be found by inspecting the blocks located at the fixed offset of the log
pair plus the possible zone size (4M , 8M, 16M, 32M, 64M, 128M, 256M,
512M, 1G, 2G, 4G, 8G ). Once we find a superblock, we can follow the
same instruction above to find the latest superblock copy within the zone
 4M = BTRFS_MKFS_SYSTEM_GROUP_SIZE. We cannot mkfs on a device with a
zone size less than 4MB because we cannot create the initial temporary
system chunk with the size.
 The maximum size we support for now.
 The zone size is limited to these 11 cases, as it must be a power of 2.
Once we find the latest superblock, it is no different than reading a
regular btrfs image. You can further confirm the determined zone size by
comparing it with the size of a device extent because it is the same as the
Actually, since the writing offset within the logging buffer is different
from the primary to copies , the timing when resetting the former zone
will become different. So, we can also try reading the head of the buffer
of a copy in case of missing superblock at offset 0.
 Because mkfs update the primary in the initial process, advancing only
the write pointer of the primary log buffer
## Superblock writing on an emulated zoned device
By mounting a regular device in zoned mode, btrfs emulates conventional
zones by slicing the device with a fixed size. In this case, however, we do
not follow the above rule of writing superblocks at the head of the logging
zones if they are conventional. Doing so would introduce a chicken-and-egg
problem. To know the given btrfs is zoned btrfs, we need to read a
superblock to see the incompatible flags. But, to read a superblock
properly from a zoned position, we need to know the file-system is zoned a
priori (e.g. resided in a zoned device), leading to a recursive dependency.
We can use the regular super block update method on an emulated zoned
device to break the recursion. Since the zones containing the regular
locations are always reserved, it is safe to do so. Then, we can naturally
read a regular superblock on a regular device and determine the file-system
is zoned or not.
Naohiro Aota (1):
btrfs: zoned: move superblock logging zone location
fs/btrfs/zoned.c | 40 ++++++++++++++++++++++++++++++----------
1 file changed, 30 insertions(+), 10 deletions(-)