On 2018年03月16日 08:49, Nicholas D Steeves wrote: > Hi Qu, > > So sorry for the incredibly delayed reply [it got lost in my drafts > folder], I sincerely appreciate the time you took to respond. There > is a lot in your responses that I suspect would benefit readers of the > btrfs wiki, so I've drawn attention to them by replying inline. I've > omitted the sections David resolved with his merge. > > P.S. Even graduate-level native speakers struggle with the > multitude of special-cases in English! > > On Sun, Oct 22, 2017 at 06:54:16PM +0800, Qu Wenruo wrote: >> Hi Nicholas, >> >> Thanks for the documentation update. >> Since I'm not a native English speaker, I may not help much to organize >> the sentence, but I can help to explain the question noted in the >> modification. >> >> On 2017年10月22日 08:00, Nicholas D Steeves wrote: >>> In one big patch, as requested > [...] >>> --- a/Documentation/btrfs-balance.asciidoc >>> +++ b/Documentation/btrfs-balance.asciidoc >>> @@ -21,7 +21,7 @@ filesystem. >>> The balance operation is cancellable by the user. The on-disk state of the >>> filesystem is always consistent so an unexpected interruption (eg. system >>> crash, >>> reboot) does not corrupt the filesystem. The progress of the balance >>> operation >>> -is temporarily stored and will be resumed upon mount, unless the mount >>> option >>> +****is temporarily stored**** (EDIT: where is it stored?) and will be >>> resumed upon mount, unless the mount option >> >> To be specific, they are stored in data reloc tree and tree reloc tree. >> >> Data reloc tree stores the data/metadata written to new location. >> >> And tree reloc tree is kind of special snapshot for each tree whose tree >> block is get relocated during the relocation. > > Is there already a document on the btrfs allocation? This seems like > it might be a nice addition for the wiki. I'm guessing it would fit > under > https://btrfs.wiki.kernel.org/index.php/Main_Page#Developer_documentation
Yep, it's a good idea to document such things into btrfs wiki. I would add such doc in my spare time. > >>> @@ -200,11 +200,11 @@ section 'PROFILES'. >>> ENOSPC >>> ------ >>> >>> -The way balance operates, it usually needs to temporarily create a new >>> block >>> +****The way balance operates, it usually needs to temporarily create a new >>> block >>> group and move the old data there. For that it needs work space, otherwise >>> it fails for ENOSPC reasons. >>> This is not the same ENOSPC as if the free space is exhausted. This refers >>> to >>> -the space on the level of block groups. >>> +the space on the level of block groups.**** (EDIT: What is the >>> relationship between the new block group and the work space? Is the "old >>> data" removed from the new block group? Please say something about block >>> groups to clarify) >> >> Here I think we're talking about allocating new block group, so it's >> using unallocated space. >> >> While for normal space usage, we're allocating from *allocated* block >> group space. >> >> So there are two levels of space allocation: >> >> 1) Extent level >> Always allocated from existing block group (or chunk). >> Data extent, tree block allocation are all happening at this level. >> >> 2) Block group (or chunk, which are the same) level >> Always allocated from free device space. >> >> I think the original sentence just wants to address this. > > Also seems like a good fit for a btrfs allocation document. > >>> >>> The free work space can be calculated from the output of the *btrfs >>> filesystem show* >>> command: >>> @@ -227,7 +227,7 @@ space. After that it might be possible to run other >>> filters. >>> >>> Conversion to profiles based on striping (RAID0, RAID5/6) require the work >>> space on each device. An interrupted balance may leave partially filled >>> block >>> -groups that might consume the work space. >>> +groups that ****might**** (EDIT: is this 2nd level of uncertainty >>> necessary?) consume the work space. >>> > [...] >>> @@ -3,7 +3,7 @@ btrfs-filesystem(8) > [...] >>> SYNOPSIS >>> -------- >>> @@ -53,8 +53,8 @@ not total size of filesystem. >>> when the filesystem is full. Its 'total' size is dynamic based on the >>> filesystem size, usually not larger than 512MiB, 'used' may fluctuate. >>> + >>> -The global block reserve is accounted within Metadata. In case the >>> filesystem >>> -metadata are exhausted, 'GlobalReserve/total + Metadata/used = >>> Metadata/total'. >>> +The global block reserve is accounted within Metadata. ****In case the >>> filesystem >>> +metadata are exhausted, 'GlobalReserve/total + Metadata/used = >>> Metadata/total'.**** (EDIT: s/are/is/? And please write more for clarity. >>> Is "global block reserve" part of GlobalReserve that is accounted within >>> Metadata? Isn't all of GlobalReserve's metadata accounted within Metadata? >>> eg: "global block reserve" is the data portion of GlobalReserve, but all >>> metadata is accounted for in Metadata.) >> >> GlobalReserve is accounted as Metadata, but most of time it's just as a >> buffer until we really run out of metadata space. >> >> It's like metadata headroom reserved for really important time. >> >> So under most situation, the GlobalReserve usage should be 0. >> And it's not accounted as Meta/used. (so, if there is Meta/free, then it >> belongs to Meta/free) >> >> But when GlobalReserve/used is not 0, the used part is accounted to >> Meta/Used, and the unused part (GlobalReserve/free if exists) belongs to >> Meta/free. >> >> Not sure how to explain it better. > > Thank you, you've explained it wonderfully. (This also seems like a > good fit for a btrfs allocation document) > >>> + >>> `Options` >>> + >>> @@ -93,10 +93,10 @@ You can also turn on compression in defragment >>> operations. >>> + >>> WARNING: Defragmenting with Linux kernel versions < 3.9 or ≥ 3.14-rc2 as >>> well as >>> with Linux stable kernel versions ≥ 3.10.31, ≥ 3.12.12 or ≥ 3.13.4 will >>> break up >>> -the ref-links of COW data (for example files copied with `cp --reflink`, >>> +the reflinks of COW data (for example files copied with `cp --reflink`, >>> snapshots or de-duplicated data). >>> This may cause considerable increase of space usage depending on the >>> broken up >>> -ref-links. >>> +reflinks. >>> + >> [snip] >>> +broken up reflinks. >>> >>> *barrier*:: >>> *nobarrier*:: >>> (default: on) >>> + >>> Ensure that all IO write operations make it through the device cache and >>> are stored >>> -permanently when the filesystem is at it's consistency checkpoint. This >>> +permanently when the filesystem is at ****(EDIT: "its" or "one of its" >>> consistency checkpoint[s])****. This >> >> I think it is "one of its", as there are in fact 2 checkpoints for btrfs: >> 1) Normal transaction commitment >> 2) Log tree commitment >> Which only commits the log trees and log tree root. >> >> But I'm not really sure if log tree commitment is also under the control >> of barrier. > > Is there a document on the topic of "Things btrfs does to keep your > data safe, and things it does to maintain a consistent state"? This > can to there, with a subsection for "differences during a balance > operation" if necessary. David merged "its consistency checkpoint", > which I think is fine for general-user-facing documentation, but > because you mentioned log tree commitment I'm also wondering if 2) is > not under the control of a barrier. Without this barrier, aren't the > log trees more likely to be corrupted and/or out-of-date in the event > of sudden loss of power or crash? For log tree I investigated it a little further so I'm pretty confident to say that, all super block update is using FUA unless nobarrier mount option is specified. So it's as safe as normal transaction commit. > > [...] >>> >>> *sync* <path> [subvolid...]:: >>> -Wait until given subvolume(s) are completely removed from the filesystem >>> -after deletion. If no subvolume id is given, wait until all current >>> deletion >>> -requests are completed, but do not wait for subvolumes deleted meanwhile. >>> -The status of subvolume ids is checked periodically. >>> +Wait until given subvolume[s] are completely removed from the filesystem >>> after >>> +deletion. If no subvolume id is given, wait until all current deletion >>> requests >>> +are completed, but do not wait for subvolumes deleted in the meantime. >>> ****The >>> +status of subvolume ids is checked periodically.**** (EDIT: How is the >>> relevant to sync? Should it read "the status of all subvolume ids are >>> periodically synced as a normal background operation"?) >> >> The background is, subvolume deletion is expensive for btrfs, so >> subvolume deletion is split into 2 stages: >> 1) Unlike the subvolume >> So no one can access the deleted subvolume >> >> 2) Delete the subvolume tree blocks and its data in background >> And for tree blocks, we skip the normal tree balance, to speed up the >> deletion. >> >> I think the original sentence means we won't wait for the 2nd stage. > > When I started using btrfs with linux-3.16 I regularly ran into issues > when I omitted a btrfs sub sync step when deleting, creating, and then > deleting snapshots, so I started syncing subvolumes religiously after > each operation. If the btrfs sub sync step is still a recommended > practice, I wonder if this is the place to say so. Maybe it's no > longer necessary? This may need extra test so I'm not 100% confident yet. > > [...] >>> *-d|--data <profile>*:: >>> @@ -79,7 +79,7 @@ default value is 16KiB (16384) or the page size, >>> whichever is bigger. Must be a >>> multiple of the sectorsize and a power of 2, but not larger than 64KiB >>> (65536). >>> Leafsize always equals nodesize and the options are aliases. >>> + >>> -Smaller node size increases fragmentation but lead to higher b-trees which >>> in >>> +Smaller node size increases fragmentation ****but lead to higher >>> b-trees**** (EDIT: "but leads to taller/deeper/more/increased-usage-of >>> b-trees"?) which in >> >> What's the difference between "higher" and "taller"? >> Seems quite similar to me though. > > I could be wrong, but I think one of > "taller/deeper/more/increased-usage-of b-trees" is closer to what you > want to say, Yep. > because "smaller node size...leads to higher b-trees" > sounds like a smaller node size leads to the emergence of something > like a higher-order of b-trees that operate or function differently > than b-trees usually do in btrfs. Makes sense. Thanks, Qu > > [I've deleted my pedantic explanation, because I think googling for > "taller vs higher" will provide the resources you need] > >>> @@ -166,7 +166,7 @@ root partition created with RAID1/10/5/6 profiles. The >>> mount action can happen >>> before all block devices are discovered. The waiting is usually done on the >>> initramfs/initrd systems. >>> >>> -As of kernel 4.9, RAID5/6 is still considered experimental and shouldn't be >>> +As of kernel ****4.9**** (EDIT: 4.14 status?), RAID5/6 is still considered >>> experimental and shouldn't be >> >> Well, this changed a lot in v4.14. So definitely need to be modified. >> >> At least Oracle is considring RAID5/6 stable. Maybe we'd better to wait >> for several other releases to see if this is true. > > Wow! If so, congratulations! > > Sincerely, > Nicholas >
signature.asc
Description: OpenPGP digital signature
