Author: esr
Date: Wed Jan 18 07:44:48 2012
New Revision: 1232771

URL: http://svn.apache.org/viewvc?rev=1232771&view=rev
Log:
New, much more comprehensive dumpfile format notes.

Modified:
    subversion/trunk/notes/dump-load-format.txt

Modified: subversion/trunk/notes/dump-load-format.txt
URL: 
http://svn.apache.org/viewvc/subversion/trunk/notes/dump-load-format.txt?rev=1232771&r1=1232770&r2=1232771&view=diff
==============================================================================
--- subversion/trunk/notes/dump-load-format.txt (original)
+++ subversion/trunk/notes/dump-load-format.txt Wed Jan 18 07:44:48 2012
@@ -1,139 +1,375 @@
-This file describes the format produced by 'svnadmin dump' and
-consumed by 'svnadmin load'.  
+= How to interpret Subversion dumpfiles =
 
-The format has undergone revisions over time.  They are presented in
-reverse chronological order here.  You may wish to start with the
-VERSION 1 description in order to get a baseline understanding first.
+Version 1.0, 2012-01-18
 
-===== SVN DUMPFILE VERSION 3 FORMAT =====
+== Introduction ==
 
-(generated by SVN versions 1.1.0-present, if requested by the user)
+The Subversion dumpfile format is a serialized description of the
+actions required to (re)build a version history. from scratch.
 
-This format is equivalent to the VERSION 2 format except for the
-following:
+The goal of this document is that it be sufficient for people writing
+dumpfile interpreters to emulate the actions the dumpfile describes on
+a versioned filesystem-like store, such as another version-control
+system.  It derives from and incorporates some incomplete notes from 
+before r39883.
 
-1.) The format starts with the new version number of the dump format
-    ("SVN-fs-dump-format-version: 3\n").
+1. In interpreting a Node record which has both a copyfrom source and
+a property section, it is possible that the copy source node itself
+has a property section.  How are they to be combined?
 
-2.) There are several new optional headers for node changes:
+Also note that the section on the semantics of kinds of operations 
+documents a minor bug at r39883 in the behavior of "add", which 
+should be fixed.
 
-[Text-delta: true|false]
-[Prop-delta: true|false]
-[Text-delta-base-md5: blob]
-[Text-delta-base-sha1: blob]
-[Text-copy-source-sha1: blob]
-[Text-content-sha1: blob]
+== Syntax ==
 
-    The default value for the boolean headers is "false".  If the value is
-    set to "true", then the text and property contents will be treated
-    as deltas against the previous contents of the node (as determined
-    by copy history for adds with history, or by the value in the
-    previous revision for changes--just as with commits).
+=== Encoding and delimiters ===
 
-Property deltas have the same format as regular property lists except
-that (1) properties with the same value as in the previous contents of
-the node are not printed, and (2) deleted properties will be written
-out as
+Subversion dumpfiles are plain byte streams. The structural parts are
+ASCII.  Text sections and property key/value pairs may be interpreted
+as binary data in any encoding by client tools.
 
-D <name length>
-<name>
+A dumpfile consists of four kinds of records.  A record is a group of
+RFC822-style header lines (each consisting of a key, followed by a
+colon, followed by text data to end of line), followed by an empty
+spacer line, followed optionally by a body section.  If the body
+section is present, another empty spacer line separates it from the
+following record.
 
-just as a regular property is printed, but with the "K " changed to a
-"D " and with no value part.
+For forward compatibility, unrecognized headers are ignored.
 
-Text deltas are written out as a series of svndiff0 windows.  If
-Text-delta-base-md5 is provided, it is the checksum of the base to
-which the text delta is applied; note that older versions (pre-1.5) of
-'svnadmin load' may ignore the checksum.
+=== Record types ===
 
-Text-delta-base-sha1, Text-copy-source-sha1, and Text-content-sha1 are not
-currently used by the loader.  They are written by 1.6-and-later versions of
-Subversion so that future loaders can optionally choose which checksum to
-use for checking for corruption.
+Dumpfiles include four record types.  Two, the version stamp and UUID
+record, consist of single header lines. The bulk of a dumpfile
+consists of Revision and Node records.
+
+A version stamp record is always the first line of the file and
+looks like this:
+
+-------------------------------------------------------------------
+SVN-fs-dump-format-version: <N>\n
+-------------------------------------------------------------------
+
+where <N> is replaced by the dump format version. Except where 
+specified, the descriptions in this document aapply to all
+versions of the format.
+
+Versions 2 and later may have a UUID record following the version
+stamp. It is of the form 
 
-===== SVN DUMPFILE VERSION 2 FORMAT =====
+-------------------------------------------------------------------
+UUID: <hex-string>
+-------------------------------------------------------------------
 
-(generated by SVN versions 0.18.0-present, by default)
+where the <hex-string> is the UUID of the originating repository.
+An example UUID is "7bf7a5ef-cabf-0310-b7d4-93df341afa7e".
 
-This format is equivalent to the VERSION 1 format in every respect,
-except for the following:
+A Revision record has three headers and is always followed by a
+property section.  Expect the following form and sequence:
+
+-------------------------------------------------------------------
+Revision-number: <N>
+Prop-content-length: <P>
+Content-length: <L>
+!
+-------------------------------------------------------------------
+
+with the Revision-number header always first and the '!' indicating
+a mandatory empty spacer line.  <P> gives the length in bytes of the
+following property section. <L> gives the body length of the entire
+Revision record.  These two numbers will be *identical* for a Revision
+record; the Content-length header is added for the benefit of software
+that can parse RFC-822 messages.
+
+A revision record is followed by one or more Node records (see below).
+
+=== Property sections ==
+
+A Revision record *must* have a property section, and a Node record *may*
+have a property section. Every record with a property section has 
+a Prop-content-length header.
+
+A property section consists of pairs of key and value records and
+is ended by a fixed trailer.  Here is an example attached to a
+Revision record:
+
+-------------------------------------------------------------------
+Revision-number: 1422
+Prop-content-length: 80
+Content-length: 80
+
+K 6
+author
+V 7
+sussman
+K 3
+log
+V 33
+Added two files, changed a third.
+PROPS-END
+-------------------------------------------------------------------
+
+The fixed trailer is "PROPS-END\n" and its length is included in the
+Prop-content-length. Before it, each K and V record consists of a
+header line giving the length of the key or value content in bytes.  
+The content follows.  The content is itself always followed by \n.
+
+In version 3 of the format, a third type 'D' of property record is
+introduced to describe property deletion. This feature will be
+described later, in the specification of delta dumps.
+
+=== Node records ===
+
+Each Revision record is followed by one or more Node records.
+Node records have the following sequence of header lines:
+
+-------------------------------------------------------------------
+Node-path: <path/to/node/in/filesystem>
+[Node-kind: {file | dir}]
+Node-action: {change | add | delete | replace}
+[Node-copyfrom-rev: <rev>]
+[Node-copyfrom-path: <path> ]
+[Text-copy-source-md5: <blob>]
+[Text-content-md5: <blob>]
+[Text-content-length: <T>]
+[Prop-content-length: <P>]
+[Content-length: Y]
+!
+-------------------------------------------------------------------
+
+Bracketing in [] indicates optional lines; { | } is an alternation group.
+
+Dump decoders should be prepared for the optional lines after
+Node-action to be in any order, except that Content-length is 
+always last if it present.
+
+A Node record describes an action on a path relative to the repository
+root, and always begins with the Node-path specification.
+
+The Node-kind line indicates whether the path is a file or directory.
+The header value will be one of the strings "file" or "dir". 
+This header may be (and usually is) absent if the node action is a delete.  
+
+The Node-action line is always present and specifies the type of
+operation for this node.  The header value is one of the strings
+"change", "add", "delete", or "replace".  These operations will be
+described in detail later in this document.
+
+Either both the Node-copyfrom-rev and Node-copyfrom-path lines will be
+present, or neither will be.  They pair to describe a copy source for
+the node. Copy-source semantics will be described in detail later in
+this document.
+
+The Text-content-md5 and Text-copy-source-md5 lines are hash integrity
+checks and will be present only if Text-content-length and the copfyrom
+pair (respectively) are also present. A decoder may use them to verify
+that the source content they refer to has not been corrupted.
+
+Text-content-length will be present only when there is a text section.
+Zero is a legal value for this length, indicating an empty file.
+
+Prop-content-length will be present only when there is a properties section.
+
+Content-length will be present if there is either a text or a
+properties section.  This is not always the case.  In particular, 
+a delete operation cannot have either.  Some other operations that use
+copyfrom sources may also not have either.
+
+Again, the '!' stands in for a mandatory empty line following the
+RFC822-style headers. A body may follow
+
+== Semantics ==
+
+=== The kinds of things ===
+
+There are four kinds of things described by a dumpfile: paths,
+properties, content, and flows.  The distinctions among content,
+paths, and flows matter for understanding some operations.
+
+A path is a filesystem location (a file or directory).  There are two
+kinds of paths in a dumpfile; node paths and copy sources.
+
+Properties are key-value pairs associated with revisions or paths.
+Subversion interprets and reserves some properties, those beginning
+with "svn:". Others are not interpreted by Subversion; they may 
+may be set and read for the convenience of other applications, such
+as repository browsers or translators.
+
+A flow is a sequence of actions on a file or directory path that is
+considered to be a single history for change-tracking purposes.
+Creating a flow tells Subversion that you want to track the history of
+the path or paths it contains. Destroying a flow breaks the chain of
+history; changes will not be tracked across the break, even if another
+flow is created at the same path.  A copy operation creates a new
+flow connected to the flow from which it was copied.
+
+Content is what file paths point at (one timewise slice of a flow). It
+is the payload of program source code, documents, images, and so forth
+that a version control system actually manages.
+
+A Node record describes a change in properties, the addition or deletion
+of a flow, or a change in content.  It nust do at least one of these things,
+otherwise it would be a no-op and omitted.
+
+When no copyfrom is present, and the action isn't an add or copy, then
+the kind of the thing identified by (PATH, REVISION) must agree with
+the kind of the thing identified by (PATH, -1+REVISION).
+
+Terminological node: in Subversion-speak, the term "node" is
+historically ambiguous.  Sometimes it refers to what this document
+calls a "flow", and sometimes it refers to the internal per-revision
+structure that a Node record represents (that is, just one action in a
+flow).  For clarity, most of this document avoids the term "node" in
+favor of the more specific "flow" and "Node record", but knowing 
+about this issue will help if you read the Ancient History section.
+
+=== The kinds of operations ===
+
+.File operations
+|======================================================================
+|                           |   add    | delete | replace  |  change  |
+|Can have text section?     | optional |   no   | optional | optional |
+|Can have property section? | optional |   no   | optional | optional |
+|Can have copy source?      | optional |   no   | optional |    no    |
+|Fails on existent path     |   yes*   |   no   |    no    |    no    |   
+|Fails on non-existent path |    no    |  yes   |   yes    |   yes    |   
+|======================================================================
+
+* As of December 2011 there is a minor bug: Adding a file with history
+twice _in two different revisions_ succeeds silently.
+
+.Directory operations
+|======================================================================
+|                           |   add    | delete | replace  |  change  |
+|Can have text section?     |    no    |   no   |    no    |    no    |
+|Can have property section? | optional |   no   | optional | required |
+|Can have copy source?      | optional |   no   | optional |    no    |
+|Fails on existent path     |   yes    |   no   |    no    |    no    |   
+|Fails on non-existent path |    no    |  yes   |   yes    |   yes    |   
+|======================================================================
+
+A Node record represents an operation that does one of four things: add,
+delete, change, or replace.
+
+Node records can carry content in one (or both!) of two ways: from a text
+section or from a copy source (that is, a copy-path and copy-revision
+pair).
+
+Giving a copy source appends the node to the flow of which that source
+is part; when you 'add' or 'replace' with a copy source, the content
+at the path becomes a copy of the source (but see below for a
+qualification about directories).
+
+Giving a text section also changes the content of the flow. In the
+(unusual) case that a node has both a copy source and a text section,
+the correct semantics is to attach the path to the source flow and
+then change the content.
+
+An add operation creates a new flow for a file or directory. See the
+table above for possible operand combinations.
+
+A delete operation deletes a flow and its content. If the path is a
+file, the file is deleted.  If the path is a directory, the directory
+and all its children are deleted. A subsequent add at the same path
+will create a new and different flow with its own history.
+
+A change operation changes properties on a file or directory path. See the
+table above for possible operand combinations.
+
+A replace operation behaves exactly like a delete followed by an add
+(destroying an old flow, producing a new one) when it has no copy
+source. When a replace has a copy source, it produces a new flow
+with history extending back through the copy source. A Node record
+representing a replace operation may have a property section.
+
+The main reason "replace" exists is because it helps sequential
+processors of the dump stream avoid possibly notifying about multiple
+actions on the same path.
+
+It is even possible to have a replace with a copyfrom source *and*
+text, such as would result from this on the client side:
+ 
+-------------------------------------------------------------------
+$ svn rm dir/file.txt
+$ svn cp otherdir/otherfile.txt dir/file.txt
+$ echo "Replacement text" > dir/file.txt
+$ svn ci -m "Replace dir/file.txt with a copy of otherdir/otherfile.txt and 
replace its text, too."
+-------------------------------------------------------------------
+
+$Subversion filesystems do not allow the root directory ("/") to be
+deleted or replaced.
+
+=== Some details about copyfroms ===
+
+The source and target of a copyfrom are always of like kind; that is,
+Subversion dump will never generate a node with a source type of file
+and a target type of directory or vice-versa.
+
+Interpreting copyfrom_path for file copies is straightforward; the
+target pathname gets the contents of the source pathname.
+
+Directory copies (the primitive beneath branching and tagging) are
+tricky.  For each source path under the source directory, a new path
+is generated by removing the head segment of the pathname that is
+the source directory.  That new path under the target directory gets
+the content of the source path.
+
+After this operation:
+
+-------------------------------------------------------------------
+Node-path: x/y/z
+Node-kind: dir
+Node-action: add
+Node-copyfrom-rev: 10
+Node-copyfrom-path: a/b/c
+-------------------------------------------------------------------
 
-1.) The format starts with the new version number of the dump format
-    ("SVN-fs-dump-format-version: 2\n").
-
-2.) In addition to "Revision Records", another sort of record is supported:
-    the "UUID" record, which should be of the form:
-
-UUID: 7bf7a5ef-cabf-0310-b7d4-93df341afa7e
-
-    This should be used to indicate the UUID of the originating repository.
-
-===== SVN DUMPFILE VERSION 1 FORMAT =====
-
-(generated by SVN versions prior to 0.18.0)
-
-The binary format starts with the version number of the dump format
-("SVN-fs-dump-format-version: 1\n"), followed by a series of revision
-records.  Each revision record starts with information about the
-revision, followed by a variable number of node changes for that
-revision.  Fields in [braces] are optional, and unknown headers are
-always ignored, for backwards compatibility.
-
-Revision-number: N
-Prop-content-length: P
-Content-length: L
-
-   ...P bytes of property data.  Properties are stored in the same
-   human-readable hashdump format used by working copy property files,
-   except that they end with "PROPS-END\n" for better readability.
-
-Node-path: absolute/path/to/node/in/filesystem
-Node-kind: file | dir  (1)
-Node-action: change | add | delete | replace
-[Node-copyfrom-rev: X]
-[Node-copyfrom-path: path ]
-[Text-copy-source-md5: blob] (2)
-[Text-content-md5: blob]
-[Text-content-length: T]
-[Prop-content-length: P]
-Content-length: Y (3)
-
-   ... Y bytes of content data, divided into P bytes of "property"
-   data and T bytes of "text" data.  The properties come first; their
-   total length (including formatting) is Prop-content-length, and is
-   included in Node-content-length.  The "PROPS-END\n" line always
-   terminates the property section if there are props.  The remainder
-   of the Y bytes (expected to be equivalent to Text-content-length)
-   represent the contents of the node.
-
-
-Notes:
-
-   (1) if the node represents a deletion, this field is optional.
-   
-   (2) this is a checksum of the source of the copy.  a loader process
-       can use this checksum to determine that the copyfrom path/rev
-       already present in a filesystem is really the *correct* one to
-       use.
-   
-   (3) the Content-length header is technically unnecessary, since the
-       information it holds (and more) can be found in the
-       Prop-content-length and Text-content-length fields.  Though
-       Subversion itself does not make use of the header when reading
-       a dumpfile, we include it for compatibility with generic RFC822
-       parsers.
-   
-   (4) There are actually 2 types of version 1 dump streams. The
-       regular ones are generated since r2634 (svn 0.14.0). Older ones
-       also claim to be version 1, but miss the Props-content-length
-       and Text-content-length fields in the block header. In those
-       days there *always* was a properties block.
-   
-EXAMPLE:
+the file a/b/c/d will have been be copied to x/y/z/d.
 
-Here's an example of revision 1422, whereby I added a new directory
+A single revision may include multiple copyfrom Node records, even multiple
+copyfroms to the same directory, even mixed directory and file copies
+to the same directory. 
+
+=== Properties and persistence ===
+
+The properties section of a Revision record consists of some subset
+of the three reserved per-commit properties: svn:author, svn:date,
+and svn.log. These properties do not persist to later revisions.
+
+The key thing to know about Node properties is that they are 
+persistent, once set, until modified by a future property 
+section on the same path.
+
+Normally, a dumpfile re-lists the entire property set for a directory
+or file in every Node record that changes any part of it. (But see
+the material on delta dumps for an exception.)
+
+This implies that to delete a given property from a path, a dumpfile
+generator will issue a Node record with all other properties listed in it;
+to delete all properties from a path, the dumpfile generator will
+simply issue a node with an empty properties section. Note that this
+is different from an *absent* properties section, which will change
+no properties and will be associated with a change to content!
+
+=== Implementation pragmatics ===
+
+Because directory operations with copyfroms don't specify all the file
+paths they modify, an interpreter for this format must build a map of
+the paths in the file store it is manipulating, and update that map as
+it processes each Node record.
+
+On a repository with thousands of commits, the per-revision list of
+maps can become quite large. For space economy, the file map for each 
+revision can be discarded after it is processed *unless it is a source
+revision for a copyfrom*. 
+
+== An example ==
+
+Here's an example of revision 1422, which added a new directory
 "baz", added a new file "bop" inside it, and modified the file "foo.c":
 
+-------------------------------------------------------------------
 Revision-number: 1422
 Prop-content-length: 80
 Content-length: 80
@@ -188,8 +424,79 @@ Content-length: 102
 
 Here is the fulltext of my change to an existing /bar/foo.c.
 Notice that this file has no properties.
+-------------------------------------------------------------------
+
+== Format variants ==
+
+=== Version 3 format ===
+
+Version 3 format is a delta dump; text changes are represented 
+as diffs against the original file, and properties as incremental
+changes to a persistent set (that is, a property section does not
+necessarily implcitly clear the property set on a path before the
+new property settings are evaluated).
+
+This change is a space optimization. It requires additional 
+computing time to integrate the diff history.
+
+Version 3 is generated by SVN versions 1.1.0-present, if requested by the user.
+
+This format is equivalent to the VERSION 2 format except for the
+following:
+
+1. The format starts with the new version number of the dump format
+   ("SVN-fs-dump-format-version: 3\n").
+
+2. There are several new optional headers for Node records:
+
+-------------------------------------------------------------------
+[Text-delta: true|false]
+[Prop-delta: true|false]
+[Text-delta-base-md5: blob]
+[Text-delta-base-sha1: blob]
+[Text-copy-source-sha1: blob]
+[Text-content-sha1: blob]
+-------------------------------------------------------------------
 
--*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*-
+The default value for the boolean headers is "false".  If the value is
+set to "true", then the text and property contents will be treated
+as deltas against the previous contents of the flow (as determined
+by copy history for adds with history, or by the value in the
+previous revision for changes--just as with commits).
+
+Property deltas have the same format as regular property lists except
+that (1) properties with the same value as in the previous contents of
+the flow are not printed, and (2) deleted properties will be written
+out as
+
+D <name length>
+<name>
+
+just as a regular property is printed, but with the "K " changed to a
+"D " and with no value part.
+
+Text deltas are written out as a series of svndiff0 windows.  If
+Text-delta-base-md5 is provided, it is the checksum of the base to
+which the text delta is applied; note that older versions (pre-1.5) of
+'svnadmin load' may ignore the checksum.
+
+Text-delta-base-sha1, Text-copy-source-sha1, and Text-content-sha1 are not
+currently used by the loader.  They are written by 1.6-and-later versions of
+Subversion so that future loaders can optionally choose which checksum to
+use for checking for corruption.
+
+=== Archaic version 1 format ===
+
+There are actually two types of version 1 dump streams. The regular ones
+are generated since r2634 (svn 0.14.0). Older ones also claim to be
+version 1, but miss the Props-content-length and Text-content-length
+fields in the block header. In those days there *always* was a
+properties block.
+
+This note is included for historical completeness only, at is it highly
+unlikely that any Subversion instances that old remain in production.
+
+== Ancient history ==
 
 Old discussion: 
 
@@ -197,8 +504,7 @@ Old discussion: 
 
 A proposal for an svn filesystem dump/restore format.
 
-Two problems we want to solve
-=============================
+=== Two problems we want to solve ===
 
  1.  When we change our node-id schema, we need to migrate all of our
      data (by dumping and restoring).
@@ -207,8 +513,7 @@ Two problems we want to solve
      someday.
 
 
-Design Goals
-============
+=== Design Goals ===
 
  A.  Written as two new public functions in svn_fs.h.  To be invoked
      by new 'svnadmin' subcommands.
@@ -221,9 +526,7 @@ Design Goals
      backend.  In other words, we're talking about the basic ideas in
      our original "design spec" from May 2000.
 
-
-Format Semantics
-================
+=== Format Semantics ===
 
 Here are the timeless semantics of our fs design -- the things that
 would be stored in our dump format.
@@ -248,10 +551,9 @@ would be stored in our dump format.
     The history values can be non-existent (meaning the node is
     completely new), or can have a value of {revision, path}.
 
+=== Refinement of proposal #2: ===
 
-------------------------------------------------------------------------
-Refinement of proposal #2:  (after discussion with gstein)
-=========================
+(after discussion with gstein)
 
 Each node starts with RFC822-style headers at the top.  The final
 header is a 'Content-length:', followed by the content, so record
@@ -261,3 +563,5 @@ The content section has two implicit par
 fulltext.  The division between these two sections is implied by the
 "PROPS-END\n" tag at the end of the prophash.  In the case of a
 directory node or a revision, only the prophash is present.
+
+//End of document.


Reply via email to