This is a series of small fixes and features for fast-export and
fast-import, mostly on the fast-export side.
Changes since v1 (full range-diff below):
- used {tilde} in asciidoc documentation to avoid subscripting and
escaping problems
- renamed ABORT/ERROR enum values to help avoid further misusage
- multiple small testcase cleanups (use $ZERO_OID, remove grep -A, etc.)
- add FIXME comment to code about string_list usage
- record Peff's idea for a future optimization in patch 8 commit message
(is there a better place to put that??)
- New patch (9/11): remove the unmaintained copy of fast-import stream
format documentation at the beginning of fast-import.c
- Rewrite commit message for 10/11 to match the wording Peff liked
better, s/originally/original-oid/, and add documentation to
git-fast-import.txt
- Rewrite commit message for 11/11; the last one didn't make sense to
Peff. I hope this one does.
Elijah Newren (11):
git-fast-import.txt: fix documentation for --quiet option
git-fast-export.txt: clarify misleading documentation about rev-list
args
fast-export: use value from correct enum
fast-export: avoid dying when filtering by paths and old tags exist
fast-export: move commit rewriting logic into a function for reuse
fast-export: when using paths, avoid corrupt stream with non-existent
mark
fast-export: ensure we export requested refs
fast-export: add --reference-excluded-parents option
fast-import: remove unmaintained duplicate documentation
fast-export: add a --show-original-ids option to show original names
fast-export: add --always-show-modify-after-rename
Documentation/git-fast-export.txt | 34 +++++-
Documentation/git-fast-import.txt | 23 +++-
builtin/fast-export.c | 172 ++++++++++++++++++++++--------
fast-import.c | 166 +++-------------------------
t/t9350-fast-export.sh | 116 +++++++++++++++++++-
5 files changed, 308 insertions(+), 203 deletions(-)
1: 0744f65b0d = 1: 8870fb1340 git-fast-import.txt: fix documentation for
--quiet option
2: aba1e22fdd ! 2: 16d1c3e22d git-fast-export.txt: clarify misleading
documentation about rev-list args
@@ -13,7 +13,7 @@
current master reference to be exported along with all objects
- added since its 10th ancestor commit.
+ added since its 10th ancestor commit and all files common to
-+ master\~9 and master~10.
++ master{tilde}9 and master{tilde}10.
EXAMPLES
--------
3: 6983e845b2 < -: ---------- fast-export: use value from correct enum
-: ---------- > 3: e19f6b36f9 fast-export: use value from correct enum
4: 761ba324d5 ! 4: 2b305561d5 fast-export: avoid dying when filtering by
paths and old tags exist
@@ -49,18 +49,14 @@
+ (
+ cd rewrite_tag_predating_pathspecs &&
+
-+ touch ignored &&
-+ git add ignored &&
+ test_commit initial &&
+
+ git tag -a -m "Some old tag" v0.0.0.0.0.0.1 &&
+
-+ echo foo >bar &&
-+ git add bar &&
-+ test_commit add-bar &&
++ test_commit bar &&
+
-+ git fast-export --tag-of-filtered-object=rewrite --all -- bar
>output &&
-+ grep -A 1 refs/tags/v0.0.0.0.0.0.1 output | grep -E ^from.0{40}
++ git fast-export --tag-of-filtered-object=rewrite --all -- bar.t
>output &&
++ grep from.$ZERO_OID output
+ )
+'
+
5: 64e9f0d360 = 5: 607b1dc2b2 fast-export: move commit rewriting logic into
a function for reuse
6: fd14d9749a ! 6: ec1862e858 fast-export: when using paths, avoid corrupt
stream with non-existent mark
@@ -54,22 +54,18 @@
+ (
+ cd avoid_non_existent_mark &&
+
-+ touch important-path &&
-+ git add important-path &&
-+ test_commit initial &&
++ test_commit important-path &&
+
-+ touch ignored &&
-+ git add ignored &&
-+ test_commit whatever &&
++ test_commit ignored &&
+
+ git branch A &&
+ git branch B &&
+
-+ echo foo >>important-path &&
-+ git add important-path &&
++ echo foo >>important-path.t &&
++ git add important-path.t &&
+ test_commit more changes &&
+
-+ git fast-export --all -- important-path | git fast-import
--force
++ git fast-export --all -- important-path.t | git fast-import
--force
+ )
+'
+
7: 4e67a2bc7f ! 7: 9da26e3ccb fast-export: ensure we export requested refs
@@ -21,9 +21,6 @@
were just silently omitted from being exported despite having been
explicitly requested for export.
- NOTE: The usage of string_list should really be replaced with the
- strmap proposal, once it materializes.
-
Signed-off-by: Elijah Newren <[email protected]>
diff --git a/builtin/fast-export.c b/builtin/fast-export.c
@@ -41,6 +38,12 @@
export_blob(&diff_queued_diff.queue[i]->two->oid);
refname = *revision_sources_at(&revision_sources, commit);
++ /*
++ * FIXME: string_list_remove() below for each ref is overall
++ * O(N^2). Compared to a history walk and diffing trees, this is
++ * just lost in the noise in practice. However, theoretically a
++ * repo may have enough refs for this to become slow.
++ */
+ string_list_remove(&extra_refs, refname, 0);
if (anonymize) {
refname = anonymize_refname(refname);
8: be02337f29 ! 8: 7e5fe2f02e fast-export: add --reference-excluded-parents
option
@@ -30,6 +30,15 @@
repository which already contains the necessary commits (much like the
restriction imposed when using --no-data).
+ Note from Peff:
+ I think we might be able to do a little more optimization here. If
+ we're exporting HEAD^..HEAD and there's an object in HEAD^ which is
+ unchanged in HEAD, I think we'd still print it (because it would not
+ be marked SHOWN), but we could omit it (by walking the tree of the
+ boundary commits and marking them shown). I don't think it's a
+ blocker for what you're doing here, but just a possible future
+ optimization.
+
Signed-off-by: Elijah Newren <[email protected]>
diff --git a/Documentation/git-fast-export.txt
b/Documentation/git-fast-export.txt
@@ -41,14 +50,15 @@
+--reference-excluded-parents::
+ By default, running a command such as `git fast-export
-+ master~5..master` will not include the commit master\~5 and
-+ will make master\~4 no longer have master\~5 as a parent (though
-+ both the old master\~4 and new master~4 will have all the same
-+ files). Use --reference-excluded-parents to instead have the
-+ the stream refer to commits in the excluded range of history
-+ by their sha1sum. Note that the resulting stream can only be
-+ used by a repository which already contains the necessary
-+ parent commits.
++ master~5..master` will not include the commit master{tilde}5
++ and will make master{tilde}4 no longer have master{tilde}5 as
++ a parent (though both the old master{tilde}4 and new
++ master{tilde}4 will have all the same files). Use
++ --reference-excluded-parents to instead have the the stream
++ refer to commits in the excluded range of history by their
++ sha1sum. Note that the resulting stream can only be used by a
++ repository which already contains the necessary parent
++ commits.
+
--refspec::
Apply the specified refspec to each ref exported. Multiple of them can
@@ -58,10 +68,10 @@
to export. For example, `master~10..master` causes the
current master reference to be exported along with all objects
- added since its 10th ancestor commit and all files common to
-- master\~9 and master~10.
+- master{tilde}9 and master{tilde}10.
+ added since its 10th ancestor commit and (unless the
+ --reference-excluded-parents option is specified) all files
-+ common to master\~9 and master~10.
++ common to master{tilde}9 and master{tilde}10.
EXAMPLES
--------
-: ---------- > 9: 14306a8436 fast-import: remove unmaintained duplicate
documentation
9: 7ab314849d ! 10: 72487a61e4 fast-export: add a --show-original-ids option
to show original names
@@ -2,16 +2,24 @@
fast-export: add a --show-original-ids option to show original names
- Knowing the original names (hashes) of commits, blobs, and tags can
- sometimes enable post-filtering that would otherwise be difficult or
- impossible. In particular, the desire to rewrite commit messages which
- refer to other prior commits (on top of whatever other filtering is
- being done) is very difficult without knowing the original names of
each
- commit.
+ Knowing the original names (hashes) of commits can sometimes enable
+ post-filtering that would otherwise be difficult or impossible. In
+ particular, the desire to rewrite commit messages which refer to other
+ prior commits (on top of whatever other filtering is being done) is
+ very difficult without knowing the original names of each commit.
+
+ In addition, knowing the original names (hashes) of blobs can allow
+ filtering by blob-id without requiring re-hashing the content of the
+ blob, and is thus useful as a small optimization.
+
+ Once we add original ids for both commits and blobs, we may as well
+ add them for tags too for completeness. Perhaps someone will have a
+ use for them.
This commit teaches a new --show-original-ids option to fast-export
- which will make it add a 'originally <hash>' line to blob, commits, and
- tags. It also teaches fast-import to parse (and ignore) such lines.
+ which will make it add a 'original-oid <hash>' line to blob, commits,
+ and tags. It also teaches fast-import to parse (and ignore) such
+ lines.
Signed-off-by: Elijah Newren <[email protected]>
@@ -19,12 +27,12 @@
--- a/Documentation/git-fast-export.txt
+++ b/Documentation/git-fast-export.txt
@@
- used by a repository which already contains the necessary
- parent commits.
+ repository which already contains the necessary parent
+ commits.
+--show-original-ids::
+ Add an extra directive to the output for commits and blobs,
-+ `originally <SHA1SUM>`. While such directives will likely be
++ `original-oid <SHA1SUM>`. While such directives will likely be
+ ignored by importers such as git-fast-import, it may be useful
+ for intermediary filters (e.g. for rewriting commit messages
+ which refer to older commits, or for stripping blobs by id).
@@ -33,6 +41,54 @@
Apply the specified refspec to each ref exported. Multiple of them can
be specified.
+ diff --git a/Documentation/git-fast-import.txt
b/Documentation/git-fast-import.txt
+ --- a/Documentation/git-fast-import.txt
+ +++ b/Documentation/git-fast-import.txt
+@@
+ ....
+ 'commit' SP <ref> LF
+ mark?
++ original-oid?
+ ('author' (SP <name>)? SP LT <email> GT SP <when> LF)?
+ 'committer' (SP <name>)? SP LT <email> GT SP <when> LF
+ data
+@@
+ to another object simply by reusing the same `<idnum>` in another
+ `mark` command.
+
++`original-oid`
++~~~~~~~~~~~~~~
++Provides the name of the object in the original source control system.
++fast-import will simply ignore this directive, but filter processes
++which operate on and modify the stream before feeding to fast-import
++may have uses for this information
++
++....
++ 'original-oid' SP <object-identifier> LF
++....
++
++where `<object-identifer>` is any string not containing LF.
++
+ `tag`
+ ~~~~~
+ Creates an annotated tag referring to a specific commit. To create
+@@
+ ....
+ 'tag' SP <name> LF
+ 'from' SP <commit-ish> LF
++ original-oid?
+ 'tagger' (SP <name>)? SP LT <email> GT SP <when> LF
+ data
+ ....
+@@
+ ....
+ 'blob' LF
+ mark?
++ original-oid?
+ data
+ ....
+
+
diff --git a/builtin/fast-export.c b/builtin/fast-export.c
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -51,7 +107,7 @@
- printf("blob\nmark :%"PRIu32"\ndata %lu\n", last_idnum, size);
+ printf("blob\nmark :%"PRIu32"\n", last_idnum);
+ if (show_original_ids)
-+ printf("originally %s\n", oid_to_hex(oid));
++ printf("original-oid %s\n", oid_to_hex(oid));
+ printf("data %lu\n", size);
if (size && fwrite(buf, size, 1, stdout) != 1)
die_errno("could not write blob '%s'", oid_to_hex(oid));
@@ -64,7 +120,7 @@
- refname, last_idnum,
+ printf("commit %s\nmark :%"PRIu32"\n", refname, last_idnum);
+ if (show_original_ids)
-+ printf("originally %s\n", oid_to_hex(&commit->object.oid));
++ printf("original-oid %s\n", oid_to_hex(&commit->object.oid));
+ printf("%.*s\n%.*s\ndata %u\n%s",
(int)(author_end - author), author,
(int)(committer_end - committer), committer,
@@ -77,7 +133,7 @@
- name, tagged_mark,
+ printf("tag %s\nfrom :%d\n", name, tagged_mark);
+ if (show_original_ids)
-+ printf("originally %s\n", oid_to_hex(&tag->object.oid));
++ printf("original-oid %s\n", oid_to_hex(&tag->object.oid));
+ printf("%.*s%sdata %d\n%.*s\n",
(int)(tagger_end - tagger), tagger,
tagger == tagger_end ? "" : "\n",
@@ -96,44 +152,13 @@
--- a/fast-import.c
+++ b/fast-import.c
@@
-
- new_blob ::= 'blob' lf
- mark?
-+ originally?
- file_content;
- file_content ::= data;
-
- new_commit ::= 'commit' sp ref_str lf
- mark?
-+ originally?
- ('author' (sp name)? sp '<' email '>' sp when lf)?
- 'committer' (sp name)? sp '<' email '>' sp when lf
- commit_msg
-@@
-
- new_tag ::= 'tag' sp tag_str lf
- 'from' sp commit-ish lf
-+ originally?
- ('tagger' (sp name)? sp '<' email '>' sp when lf)?
- tag_msg;
- tag_msg ::= data;
-@@
- data ::= (delimited_data | exact_data)
- lf?;
-
-+ originally ::= 'originally' sp not_lf+ lf
-+
- # note: delim may be any string but must not contain lf.
- # data_line may contain any data but must not be exactly
- # delim.
-@@
next_mark = 0;
}
+static void parse_original_identifier(void)
+{
+ const char *v;
-+ if (skip_prefix(command_buf.buf, "originally ", &v))
++ if (skip_prefix(command_buf.buf, "original-oid ", &v))
+ read_next_command();
+}
+
@@ -160,7 +185,7 @@
die("Invalid ref name or SHA1 expression: %s", from);
read_next_command();
-+ /* originally ... */
++ /* original-oid ... */
+ parse_original_identifier();
+
/* tagger ... */
@@ -177,7 +202,7 @@
+test_expect_success 'fast-export --show-original-ids' '
+
+ git fast-export --show-original-ids master >output &&
-+ grep ^originally output| sed -e s/^originally.// | sort >actual &&
++ grep ^original-oid output| sed -e s/^original-oid.// | sort >actual &&
+ git rev-list --objects master muss >objects-and-names &&
+ awk "{print \$1}" objects-and-names | sort >commits-trees-blobs &&
+ comm -23 actual commits-trees-blobs >unfound &&
10: 82735bcbde ! 11: 1796373474 fast-export: add
--always-show-modify-after-rename
@@ -2,29 +2,53 @@
fast-export: add --always-show-modify-after-rename
- fast-export output is traditionally used as an input to a fast-import
- program, but it is also useful to help gather statistics about the
- history of a repository (particularly when --no-data is also passed).
- For example, two of the types of information we may want to collect
- could include:
- 1) general information about renames that have occurred
- 2) what the biggest objects in a repository are and what names
- they appear under.
+ I wanted a way to gather all the following information efficiently
+ (with as few history traversals as possible):
+ * Get all blob sizes
+ * Map blob shas to filename(s) they appeared under in the history
+ * Find when files and directories were deleted (and whether they
+ were later reinstated, since that means they aren't actually gone)
+ * Find sets of filenames referring to the same logical 'file'. (e.g.
+ foo->bar in commit A and bar->baz in commit B mean that
+ {foo,bar,baz} refer to the same 'file', so someone wanting to just
+ "keep baz and its history" need all versions of those three
+ filenames). I need to know about things like another foo or bar
+ being introduced after the rename though, since that breaks the
+ connection between filenames)
+ and then I would generate various aggregations on the data and display
+ some type of report for the user.
- The first bit of information can be gathered by just passing -M to
- fast-export. The second piece of information can partially be gotten
- from running
- git cat-file --batch-check --batch-all-objects
- However, that only shows what the biggest objects in the repository are
- and their sizes, not what names those objects appear as or what commits
- they were introduced in. We can get that information from fast-export,
- but when we only see
+ The only way I know of to get blob sizes is via
+ cat-file --batch-all-objects --batch-check
+
+ The rest of the data would traditionally be gathered from a log
command,
+ e.g.
+
+ git log --format='%H%n%P%n%cd' --date=short --topo-order --reverse \
+ -M --diff-filter=RAMD --no-abbrev --raw -c
+
+ however, parsing log output seems slightly dangerous given that it is a
+ porcelain command. While we have specified --format and --raw to try
+ to avoid the most obvious problems, I'm still slightly concerned about
+ --date=short, the combinations of --raw and -c, options that might
+ colorize the output, and also the --diff-filter (there is no current
+ option named --no-find-copies or --no-break-rewrites, but what if those
+ turn on by default in the future much as we changed the default with
+ detecting renames?). Each of those is a small worry, but they add up.
+
+ A command meant for data serialization, such as fast-export, seems like
+ a better candidate for this job. There's just one missing item: in
+ order to connect blob sizes to filenames, I need fast-export to tell me
+ the blob sha1sum of any file changes. It does this for modifies, but
+ not always for renames. In particular, if a file is a 100% rename, it
+ only prints
R oldname newname
instead of
R oldname newname
M 100644 $SHA1 newname
- then it makes the job more difficult. Add an option which allows us to
- force the latter output even when commits have exact renames of files.
+ as occurs when there is a rename+modify. Add an option which allows us
+ to force the latter output even when commits have exact renames of
+ files.
Signed-off-by: Elijah Newren <[email protected]>
--
2.19.1.1063.g2b8e4a4f82.dirty