On 2021/03/29 12:34, Anand Jain wrote:
> On 15/03/2021 13:53, Naohiro Aota wrote:
>> The following patch will change the superblock logging zones' location from
>> fixed zone number to fixed LBAs.
>> Here is a background of how the superblock is working on zoned btrfs.
>> This document will be promoted to btrfs-dev-docs in the future.
>> # Superblock logging for zoned btrfs
>> The superblock and its copies are the only data structures in btrfs with a
>> fixed location on a device. Since we cannot overwrite these blocks if they
>> are placed in sequential write required zones, we cannot use the regular
>> method of updating superblocks with zoned btrfs.
>   Looks like a ZBC which does the write pointer reset and write could 
> have helped here.

Yes and no. A two-part command like this could fail either on the reset part or
the write part (which would leave the zone empty). So in the end, the possible
error patterns are very similar to using 2 commands, and for the SB logging, we
would still need 2 zones to make sure we do not end up with an SB log that is 

>> We also cannot limit the
>> position of superblocks to conventional zones as that would prevent using
>> zoned block devices that do not have this zone type (e.g. NVMe ZNS SSDs).
>> To solve this problem, we use superblock log writing. This method uses two
>> sequential write required zones as a circular buffer to write updated
>> superblocks. Once the first zone is filled up, start writing into the
>> second zone. When both zones are filled up and before start writing to the
>> first zone again, the first zone is reset and writing continues in the
>> first zone. Once the first zone is full, reset the second zone, and write
>> the latest superblock in the second zone. With this logging, we can always
>> determine the position of the latest superblock by inspecting the zones'
>> write pointer information provided by the device. One corner case is when
>> both zones are full. For this situation, we read out the last superblock of
>> each zone and compare them to determine which copy is the latest one.
>> ## Placement of superblock logging zones
>> We use the following three pairs of zones containing fixed offset
>> locations, regardless of the device zone size.
>>    - Primary superblock: zone starting at offset 0 and the following zone
>>    - First copy: zone containing offset 64GB and the following zone
>>    - Second copy: zone containing offset 256GB and the following zone
>> These zones are reserved for superblock logging and never used for data or
>> metadata blocks. Zones containing the offsets used to store superblocks in
>> a regular btrfs volume (no zoned case) are also reserved to avoid
>> confusion.
>> The first copy position is much larger than for a regular btrfs volume
>> (64M).  This increase is to avoid overlapping with the log zones for the
>> primary superblock. This higher location is arbitrary but allows supporting
>> devices with very large zone size, up to 32GB. But we only allow zone sizes
>> up to 8GB for now.
>> ## Writing superblock in conventional zones
>> Conventional zones do not have a write pointer. This zone type thus cannot
>> be used with superblock logging since determining the position of the
>> latest copy of the superblock in a zone pair would be impossible.
>> To address this problem, if either of the zones containing the fixed offset
>> locations for zone logging is a conventional zone, superblock updates are
>> done in-place using the first block of the conventional zone.
>> ## Reading zoned btrfs dump image without zone information
>> Reading a zoned btrfs image without zone information is challenging but
>> possible.
>> We can always find a superblock copy at or after the fixed offset locations
>> determining the logging zones position. With such copy, the superblock
>> incompatible flags indicates if the volume is zoned or not. With a chunk
>> item in the sys_chunk_array, we can determine the zone size from the size
>> of a device extent, itself determined from the chunk length, num_stripes,
>> and sub_stripes.  With this information, all blocks within the 2 logging
>> zones containing the fixed locations can be inspected to find the newest
>> superblock copy.
>> The first zone of a log pair may be empty and have no superblock copy. This
>> can happen if a system crashes after resetting the first zone of a pair and
>> before writing out a new superblock. In this case, a superblock copy can be
>> found in the second zone of a log pair. The start of this second zone can
>> be found by inspecting the blocks located at the fixed offset of the log
>> pair plus the possible zone size (4M [1], 8M, 16M, 32M, 64M, 128M, 256M,
>> 512M, 1G, 2G, 4G, 8G [2])[3]. Once we find a superblock, we can follow the
>> same instruction above to find the latest superblock copy within the zone
>> log pair.
>> [1] 4M = BTRFS_MKFS_SYSTEM_GROUP_SIZE. We cannot mkfs on a device with a
>> zone size less than 4MB because we cannot create the initial temporary
>> system chunk with the size.
>> [2] The maximum size we support for now.
>> [3] The zone size is limited to these 11 cases, as it must be a power of 2.
>> Once we find the latest superblock, it is no different than reading a
>> regular btrfs image. You can further confirm the determined zone size by
>> comparing it with the size of a device extent because it is the same as the
>> zone size.
>> Actually, since the writing offset within the logging buffer is different
>> from the primary to copies [4], the timing when resetting the former zone
>> will become different. So, we can also try reading the head of the buffer
>> of a copy in case of missing superblock at offset 0.
>> [4] Because mkfs update the primary in the initial process, advancing only
>> the write pointer of the primary log buffer
>> ## Superblock writing on an emulated zoned device
>> By mounting a regular device in zoned mode, btrfs emulates conventional
>> zones by slicing the device with a fixed size. In this case, however, we do
>> not follow the above rule of writing superblocks at the head of the logging
>> zones if they are conventional. Doing so would introduce a chicken-and-egg
>> problem. To know the given btrfs is zoned btrfs, we need to read a
>> superblock to see the incompatible flags. But, to read a superblock
>> properly from a zoned position, we need to know the file-system is zoned a
>> priori (e.g. resided in a zoned device), leading to a recursive dependency.
>> We can use the regular super block update method on an emulated zoned
>> device to break the recursion. Since the zones containing the regular
>> locations are always reserved, it is safe to do so. Then, we can naturally
>> read a regular superblock on a regular device and determine the file-system
>> is zoned or not.
>> Naohiro Aota (1):
>>    btrfs: zoned: move superblock logging zone location
>>   fs/btrfs/zoned.c | 40 ++++++++++++++++++++++++++++++----------
>>   1 file changed, 30 insertions(+), 10 deletions(-)

Damien Le Moal
Western Digital Research

Reply via email to