Author: stefan2
Date: Sun Aug 4 12:29:14 2013
New Revision: 1510158
URL: http://svn.apache.org/r1510158
Log:
The structure of FSX is nowhere near its completion.
So, remove everything *not* currently applying to FSX from the
'structure' file carried over from FSFS, put a disclaimer at
the top and mark the mostly empty sections as TBD.
* subversion/libsvn_fs_x/structure
(): add disclaimer
(Design,
Filesystem formats,
Revision file format): put a "TDB" in here and remove most text
(Packing revision*): drop entirely
(Layout of the FS directory,
Node-revision IDs,
Transaction layout): update to reflect the current state of FSX
(Locks layout): s/FSFS/FSX/
Modified:
subversion/trunk/subversion/libsvn_fs_x/structure
Modified: subversion/trunk/subversion/libsvn_fs_x/structure
URL:
http://svn.apache.org/viewvc/subversion/trunk/subversion/libsvn_fs_x/structure?rev=1510158&r1=1510157&r2=1510158&view=diff
==============================================================================
--- subversion/trunk/subversion/libsvn_fs_x/structure (original)
+++ subversion/trunk/subversion/libsvn_fs_x/structure Sun Aug 4 12:29:14 2013
@@ -1,21 +1,24 @@
-This file describes the design, layouts, and file formats of a
-libsvn_fs_fs repository.
+This file will describe the design, layouts, and file formats of a
+libsvn_fs_x repository.
+
+Since FSX is still in a very early phase of its development, all sections
+either subject to major change or simply "TBD".
Design
------
+TBD.
+
+Similar to FSFS format 7 but using a radically different on-disk format.
+
In FSFS, each committed revision is represented as an immutable file
containing the new node-revisions, contents, and changed-path
information for the revision, plus a second, changeable file
containing the revision properties.
-In contrast to the BDB back end, the contents of recent revision of
-files are stored as deltas against earlier revisions, instead of the
-other way around. This is less efficient for common-case checkouts,
-but brings greater simplicity and robustness, as well as the
-flexibility to make commits work without write access to existing
-revisions. Skip-deltas and delta combination mitigate the checkout
-cost.
+To reduce the size of the on-disk representation, revision data gets
+packed, i.e. multiple revision files get combined into a single pack
+file of smaller total size. The same strategy is applied to revprops.
In-progress transactions are represented with a prototype rev file
containing only the new text representations of files (appended to as
@@ -53,15 +56,13 @@ repository) is:
locks/ Subdirectory containing locks
<partial-digest>/ Subdirectory named for first 3 letters of an MD5 digest
<digest> File containing locks/children for path with <digest>
- node-origins/ Lazy cache of origin noderevs for nodes
- <partial-nodeid> File containing noderev ID of origins of nodes
current File specifying current revision and next node/copy id
fs-type File identifying this filesystem as an FSFS filesystem
write-lock Empty file, locked to serialise writers
txn-current-lock Empty file, locked to serialise 'txn-current'
uuid File containing the UUID of the repository
format File containing the format number of this filesystem
- fsfs.conf Configuration file
+ fsx.conf Configuration file
min-unpacked-rev File containing the oldest revision not in a pack file
min-unpacked-revprop File containing the oldest revision of unpacked revprop
rep-cache.db SQLite database mapping rep checksums to locations
@@ -69,16 +70,9 @@ repository) is:
Files in the revprops directory are in the hash dump format used by
svn_hash_write.
-The format of the "current" file is:
-
- * Format 3 and above: a single line of the form
- "<youngest-revision>\n" giving the youngest revision for the
- repository.
-
- * Format 2 and below: a single line of the form "<youngest-revision>
- <next-node-id> <next-copy-id>\n" giving the youngest revision, the
- next unique node-ID, and the next unique copy-ID for the
- repository.
+The format of the "current" file is a single line of the form
+"<youngest-revision>\n" giving the youngest revision for the
+repository.
The "write-lock" file is an empty file which is locked before the
final stage of a commit and unlocked after the new "current" file has
@@ -97,7 +91,7 @@ based on the same revision is begun. Th
performs on this file is "get and increment"; the "txn-current-lock"
file is locked during this operation.
-"fsfs.conf" is a configuration file in the standard Subversion/Python
+"fsx.conf" is a configuration file in the standard Subversion/Python
config format. It is automatically generated when you create a new
repository; read the generated file for details on what it controls.
@@ -113,207 +107,13 @@ revisions written thereafter.
Filesystem formats
------------------
+TBD.
+
The "format" file defines what features are permitted within the
filesystem, and indicates changes that are not backward-compatible.
It serves the same purpose as the repository file of the same name.
-The filesystem format file was introduced in Subversion 1.2, and so
-will not be present if the repository was created with an older
-version of Subversion. An absent format file should be interpreted as
-indicating a format 1 filesystem.
-
-The format file is a single line of the form "<format number>\n",
-followed by any number of lines specifying 'format options' -
-additional information about the filesystem's format. Each format
-option line is of the form "<option>\n" or "<option> <parameters>\n".
-
-Clients should raise an error if they encounter an option not
-permitted by the format number in use.
-
-The formats are:
-
- Format 1, understood by Subversion 1.1+
- Format 2, understood by Subversion 1.4+
- Format 3, understood by Subversion 1.5+
- Format 4, understood by Subversion 1.6+
- Format 5, understood by Subversion 1.7-dev, never released
- Format 6, understood by Subversion 1.8
-
-The differences between the formats are:
-
-Delta representation in revision files
- Format 1: svndiff0 only
- Formats 2+: svndiff0 or svndiff1
-
-Format options
- Formats 1-2: none permitted
- Format 3+: "layout" option
-
-Transaction name reuse
- Formats 1-2: transaction names may be reused
- Format 3+: transaction names generated using txn-current file
-
-Location of proto-rev file and its lock
- Formats 1-2: transactions/<txnid>/rev and
- transactions/<txnid>/rev-lock.
- Format 3+: txn-protorevs/<txnid>.rev and
- txn-protorevs/<txnid>.rev-lock.
-
-Node-ID and copy-ID generation
- Formats 1-2: Node-IDs and copy-IDs are guaranteed to form a
- monotonically increasing base36 sequence using the "current"
- file.
- Format 3+: Node-IDs and copy-IDs use the new revision number to
- ensure uniqueness and the "current" file just contains the
- youngest revision.
-
-Mergeinfo metadata:
- Format 1-2: minfo-here and minfo-count node-revision fields are not
- stored. svn_fs_get_mergeinfo returns an error.
- Format 3+: minfo-here and minfo-count node-revision fields are
- maintained. svn_fs_get_mergeinfo works.
-
-Revision changed paths list:
- Format 1-3: Does not contain the node's kind.
- Format 4+: Contains the node's kind.
-
-Shard packing:
- Format 4: Applied to revision data only.
- Format 5: Revprops would be packed independently of revision data.
- Format 6+: Applied equally to revision data and revprop data
- (i.e. same min packed revision)
-
-# Incomplete list. See SVN_FS_FS__MIN_*_FORMAT
-
-
-Filesystem format options
--------------------------
-
-Currently, the only recognised format option is "layout", which
-specifies the paths that will be used to store the revision files and
-revision property files.
-
-The "layout" option is followed by the name of the filesystem layout
-and any required parameters. The default layout, if no "layout"
-keyword is specified, is the 'linear' layout.
-
-The known layouts, and the parameters they require, are as follows:
-
-"linear"
- Revision files and rev-prop files are named after the revision they
- represent, and are placed directly in the revs/ and revprops/
- directories. r1234 will be represented by the revision file
- revs/1234 and the rev-prop file revprops/1234.
-
-"sharded <max-files-per-directory>"
- Revision files and rev-prop files are named after the revision they
- represent, and are placed in a subdirectory of the revs/ and
- revprops/ directories named according to the 'shard' they belong to.
-
- Shards are numbered from zero and contain between one and the
- maximum number of files per directory specified in the layout's
- parameters.
-
- For the "sharded 1000" layout, r1234 will be represented by the
- revision file revs/1/1234 and rev-prop file revprops/1/1234. The
- revs/0/ directory will contain revisions 0-999, revs/1/ will contain
- 1000-1999, and so on.
-
-Packing revisions
------------------
-
-A filesystem can optionally be "packed" to conserve space on disk. The
-packing process concatenates all the revision files in each full shard to
-create pack files. A manifest file is also created for each shard which
-records the indexes of the corresponding revision files in the pack file.
-In addition, the original shard is removed, and reads are redirected to the
-pack file.
-
-The manifest file consists of a list of offsets, one for each revision in the
-pack file. The offsets are stored as ASCII decimal, and separated by a newline
-character.
-
-Packing revision properties (format 5: SQLite)
----------------------------
-
-This was supported by 1.7-dev builds but never included in a blessed release.
-
-See r1143829 of this file:
-http://svn.apache.org/viewvc/subversion/trunk/subversion/libsvn_fs_fs/structure?view=markup&pathrev=1143829
-
-
-Packing revision properties (format 6+)
----------------------------
-
-Similarly to the revision data, packing will concatenate multiple
-revprops into a single file. Since they are mutable data, we put an
-upper limit to the size of these files: We will concatenate the data
-up to the limit and then use a new file for the following revisions.
-
-The limit can be set and changed at will in the configuration file.
-It is 64kB by default. Because a pack file must contain at least one
-complete property list, files containing just one revision may exceed
-that limit.
-
-Furthermore, pack files can be compressed which saves about 75% of
-disk space. A configuration file flag enables the compression; it is
-off by default and may be switched on and off at will. The pack size
-limit is always applied to the uncompressed data. For this reason,
-the default is 256kB while compression has been enabled.
-
-Files are named after their start revision as "<rev>.<counter>" where
-counter will be increased whenever we rewrite a pack file due to a
-revprop change. The manifest file contains the list of pack file
-names, one line for each revision.
-
-Many tools track repository global data in revision properties at
-revision 0. To minimize I/O overhead for those applications, we
-will never pack that revision, i.e. its data is always being kept
-in revprops/0/0.
-
-Pack file format
-
- Top level: <packed container>
-
- We always apply data compression to the pack file - using the
- SVN_DELTA_COMPRESSION_LEVEL_NONE level if compression is disabled.
- (Note that compression at SVN_DELTA_COMPRESSION_LEVEL_NONE is not
- a no-op stream transformation although most of the data will remain
- human readable.)
-
- container := header '\n' (revprops)+
- header := start_rev '\n' rev_count '\n' (size '\n')+
-
- All numbers in the header are given as ASCII decimals. rev_count
- is the number of revisions packed into this container. There must
- be exactly as many "size" and serialized "revprops". The "size"
- values in the list are the length in bytes of the serialized
- revprops of the respective revision.
-
-Writing to packed revprops
-
- The old pack file is being read and the new revprops serialized.
- If they fit into the same pack file, a temp file with the new
- content gets written and moved into place just like an non-packed
- revprop file would. No name change or manifest update required.
-
- If they don't fit into the same pack file, i.e. exceed the pack
- size limit, the pack will be split into 2 or 3 new packs just
- before and / or after the modified revision.
-
- In the current implementation, they will never be merged again.
- To minimize fragmentation, the initial packing process will only
- use about 90% of the limit, i.e. leave some room for growth.
-
- When a pack file gets split, its counter is being increased
- creating a new file and leaving the old content in place and
- available for concurrent readers. Only after the new manifest
- file got moved into place, will the old pack files be deleted.
-
- Write access to revprops is being serialized by the global
- filesystem write lock. We only need to build a few retries into
- the reader code to gracefully handle manifest changes and pack
- file deletions.
+So far, there is only format 1.
Node-revision IDs
@@ -349,32 +149,17 @@ Within a revision:
to have repository-wide unique node-ID and copy-ID fields, and to have
"r<rev>/<offset>" txn-id fields.
- In Format 3 and above, this uniqueness is done by changing a temporary
+ This uniqueness is done by changing a temporary
id of "_<base36>" to "<base36>-<rev>". Note that this means that the
originating revision of a line of history or a copy can be determined
by looking at the node ID.
- In Format 2 and below, the "current" file contains global base36
- node-ID and copy-ID counters; during the commit, the counter value is
- added to the transaction-specific base36 ID, and the value in
- "current" is adjusted.
-
- (It is legal for Format 3 repositories to contain Format 2-style IDs;
- this just prevents I/O-less node-origin-rev lookup for those nodes.)
-
The temporary assignment of node-ID and copy-ID fields has
implications for svn_fs_compare_ids and svn_fs_check_related. The ID
_1.0.t1 is not related to the ID _1.0.t2 even though they have the
same node-ID, because temporary node-IDs are restricted in scope to
the transactions they belong to.
-There is a lazily created cache mapping from node-IDs to the full
-node-revision ID where they are created. This is in the node-origins
-directory; the file name is the node-ID without its last character (or
-"0" for single-character node IDs) and the contents is a serialized
-hash mapping from node-ID to node-revision ID. This cache is only
-used for node-IDs of the pre-Format 3 style.
-
Copy-IDs and copy roots
-----------------------
@@ -424,96 +209,16 @@ rev 0.
Revision file format
--------------------
+TBD
+
A revision file contains a concatenation of various kinds of data:
* Text and property representations
* Node-revisions
* The changed-path data
- * Two offsets at the very end
-A representation begins with a line containing either "PLAIN\n" or
-"DELTA\n" or "DELTA <rev> <offset> <length>\n", where <rev>, <offset>,
-and <length> give the location of the delta base of the representation
-and the amount of data it contains (not counting the header or
-trailer). If no base location is given for a delta, the base is the
-empty stream. After the initial line comes raw svndiff data, followed
-by a cosmetic trailer "ENDREP\n".
-
-If the representation is for the text contents of a directory node,
-the expanded contents are in hash dump format mapping entry names to
-"<type> <id>" pairs, where <type> is "file" or "dir" and <id> gives
-the ID of the child node-rev.
-
-If a representation is for a property list, the expanded contents are
-in the form of a dumped hash map mapping property names to property
-values.
-
-The marshalling syntax for node-revs is a series of fields terminated
-by a blank line. Fields have the syntax "<name>: <value>\n", where
-<name> is a symbolic field name (each symbolic name is used only once
-in a given node-rev) and <value> is the value data. Unrecognized
-fields are ignored, for extensibility. The following fields are
-defined:
-
- id The ID of the node-rev
- type "file" or "dir"
- pred The ID of the predecessor node-rev
- count Count of node-revs since the base of the node
- text "<rev> <offset> <length> <size> <digest>" for text rep
- props "<rev> <offset> <length> <size> <digest>" for props rep
- <rev> and <offset> give location of rep
- <length> gives length of rep, sans header and trailer
- <size> gives size of expanded rep; may be 0 if equal
- to the length
- <digest> gives hex MD5 digest of expanded rep
- ### in formats >=4, also present:
- <sha1-digest> gives hex SHA1 digest of expanded rep
- <uniquifier> see representation_t->uniquifier in fs.h
- cpath FS pathname node was created at
- copyfrom "<rev> <path>" of copyfrom data
- copyroot "<rev> <created-path>" of the root of this copy
- minfo-cnt The number of nodes under (and including) this node
- which have svn:mergeinfo.
- minfo-here Exists if this node itself has svn:mergeinfo.
-
-The predecessor of a node-rev crosses both soft and true copies;
-together with the count field, it allows efficient determination of
-the base for skip-deltas. The first node-rev of a node contains no
-"pred" field. A node-revision with no properties may omit the "props"
-field. A node-revision with no contents (a zero-length file or an
-empty directory) may omit the "text" field. In a node-revision
-resulting from a true copy operation, the "copyfrom" field gives the
-copyfrom data. The "copyroot" field identifies the root node-revision
-of the copy; it may be omitted if the node-rev is its own copy root
-(as is the case for node-revs with copy history, and for the root node
-of revision 0). Copy roots are identified by revision and
-created-path, not by node-rev ID, because a copy root may be a
-node-rev which exists later on within the same revision file, meaning
-its offset is not yet known.
-
-The changed-path data is represented as a series of changed-path
-items, each consisting of two lines. The first line has the format
-"<id> <action> <text-mod> <prop-mod> <path>\n", where <id> is the
-node-rev ID of the new node-rev, <action> is "add", "delete",
-"replace", or "modify", <text-mod> and <prop-mod> are "true" or
-"false" indicating whether the text and/or properties changed, and
-<path> is the changed pathname. For deletes, <id> is the node-rev ID
-of the deleted node-rev, and <text-mod> and <prop-mod> are always
-"false". The second line has the format "<rev> <path>\n" containing
-the node-rev's copyfrom information if it has any; if it does not, the
-second line is blank.
-
-Starting with FS format 4, <action> may contain the kind ("file" or
-"dir") of the node, after a hyphen; for example, an added directory
-may be represented as "add-dir".
-
-At the very end of a rev file is a pair of lines containing
-"\n<root-offset> <cp-offset>\n", where <root-offset> is the offset of
-the root directory node revision and <cp-offset> is the offset of the
-changed-path data.
-
-All numbers in the rev file format are unsigned and are represented as
-ASCII decimal.
+That data is aggregated in compressed containers with a binary on-disk
+representation.
Transaction layout
------------------
@@ -528,12 +233,8 @@ A transaction directory has the followin
node.<nid>.<cid>.children Directory contents for node-rev
<sha1> Text representation of that sha1
-In FS formats 1 and 2, it also contains:
-
- rev Prototype rev file with new text reps
- rev-lock Lockfile for writing to the above
-
-In newer formats, these files are in the txn-protorevs/ directory.
+ txn-protorevs/rev Prototype rev file with new text reps
+ txn-protorevs/rev-lock Lockfile for writing to the above
The prototype rev file is used to store the text representations as
they are received from the client. To ensure that only one client is
@@ -544,8 +245,8 @@ The two kinds of props files are all in
file will always be present. The "node.<nid>.<cid>.props" file will
only be present if the node-rev properties have been changed.
-The <sha1> files have been introduced in FS format 6. Their content
-is that of text rep references: "<rev> <offset> <length> <size> <digest>"
+The <sha1> files' content is that of text rep references:
+"<rev> <offset> <length> <size> <digest>"
They will be written for text reps in the current transaction and be
used to eliminate duplicate reps within that transaction.
@@ -566,7 +267,7 @@ may both be "reset" (in which case <text
always "false") to indicate that all changes to a path should be
considered undone. Reset entries are only used during the final merge
phase of a transaction. Actions in the "changes" file always contain
-a node kind, even if the FS format is older than format 4.
+a node kind.
The node-rev files have the same format as node-revs in a revision
file, except that the "text" and "props" fields are augmented as
@@ -591,10 +292,11 @@ follows:
* The "copyroot" field may have the value "-1 <created-path>" if the
copy root of the node-rev is part of the transaction in process.
+
Locks layout
------------
-Locks in FSFS are stored in serialized hash format in files whose
+Locks in FSX are stored in serialized hash format in files whose
names are MD5 digests of the FS path which the lock is associated
with. For the purposes of keeping directory inode usage down, these
digest files live in subdirectories of the main lock directory whose