[PATCH v2 00/11] fast export and import fixes and features

Elijah Newren Tue, 13 Nov 2018 16:27:02 -0800

This is a series of small fixes and features for fast-export and
fast-import, mostly on the fast-export side.


Changes since v1 (full range-diff below):
  - used {tilde} in asciidoc documentation to avoid subscripting and
    escaping problems
  - renamed ABORT/ERROR enum values to help avoid further misusage
  - multiple small testcase cleanups (use $ZERO_OID, remove grep -A, etc.)
  - add FIXME comment to code about string_list usage
  - record Peff's idea for a future optimization in patch 8 commit message
    (is there a better place to put that??)
  - New patch (9/11): remove the unmaintained copy of fast-import stream
    format documentation at the beginning of fast-import.c
  - Rewrite commit message for 10/11 to match the wording Peff liked
    better, s/originally/original-oid/, and add documentation to
    git-fast-import.txt
  - Rewrite commit message for 11/11; the last one didn't make sense to
    Peff.  I hope this one does.

Elijah Newren (11):
  git-fast-import.txt: fix documentation for --quiet option
  git-fast-export.txt: clarify misleading documentation about rev-list
    args
  fast-export: use value from correct enum
  fast-export: avoid dying when filtering by paths and old tags exist
  fast-export: move commit rewriting logic into a function for reuse
  fast-export: when using paths, avoid corrupt stream with non-existent
    mark
  fast-export: ensure we export requested refs
  fast-export: add --reference-excluded-parents option
  fast-import: remove unmaintained duplicate documentation
  fast-export: add a --show-original-ids option to show original names
  fast-export: add --always-show-modify-after-rename

 Documentation/git-fast-export.txt |  34 +++++-
 Documentation/git-fast-import.txt |  23 +++-
 builtin/fast-export.c             | 172 ++++++++++++++++++++++--------
 fast-import.c                     | 166 +++-------------------------
 t/t9350-fast-export.sh            | 116 +++++++++++++++++++-
 5 files changed, 308 insertions(+), 203 deletions(-)

 1:  0744f65b0d =  1:  8870fb1340 git-fast-import.txt: fix documentation for 
--quiet option
 2:  aba1e22fdd !  2:  16d1c3e22d git-fast-export.txt: clarify misleading 
documentation about rev-list args
    @@ -13,7 +13,7 @@
        current master reference to be exported along with all objects
     -  added since its 10th ancestor commit.
     +  added since its 10th ancestor commit and all files common to
    -+  master\~9 and master~10.
    ++  master{tilde}9 and master{tilde}10.
      
      EXAMPLES
      --------
 3:  6983e845b2 <  -:  ---------- fast-export: use value from correct enum
 -:  ---------- >  3:  e19f6b36f9 fast-export: use value from correct enum
 4:  761ba324d5 !  4:  2b305561d5 fast-export: avoid dying when filtering by 
paths and old tags exist
    @@ -49,18 +49,14 @@
     +  (
     +          cd rewrite_tag_predating_pathspecs &&
     +
    -+          touch ignored &&
    -+          git add ignored &&
     +          test_commit initial &&
     +
     +          git tag -a -m "Some old tag" v0.0.0.0.0.0.1 &&
     +
    -+          echo foo >bar &&
    -+          git add bar &&
    -+          test_commit add-bar &&
    ++          test_commit bar &&
     +
    -+          git fast-export --tag-of-filtered-object=rewrite --all -- bar 
>output &&
    -+          grep -A 1 refs/tags/v0.0.0.0.0.0.1 output | grep -E ^from.0{40}
    ++          git fast-export --tag-of-filtered-object=rewrite --all -- bar.t 
>output &&
    ++          grep from.$ZERO_OID output
     +  )
     +'
     +
 5:  64e9f0d360 =  5:  607b1dc2b2 fast-export: move commit rewriting logic into 
a function for reuse
 6:  fd14d9749a !  6:  ec1862e858 fast-export: when using paths, avoid corrupt 
stream with non-existent mark
    @@ -54,22 +54,18 @@
     +  (
     +          cd avoid_non_existent_mark &&
     +
    -+          touch important-path &&
    -+          git add important-path &&
    -+          test_commit initial &&
    ++          test_commit important-path &&
     +
    -+          touch ignored &&
    -+          git add ignored &&
    -+          test_commit whatever &&
    ++          test_commit ignored &&
     +
     +          git branch A &&
     +          git branch B &&
     +
    -+          echo foo >>important-path &&
    -+          git add important-path &&
    ++          echo foo >>important-path.t &&
    ++          git add important-path.t &&
     +          test_commit more changes &&
     +
    -+          git fast-export --all -- important-path | git fast-import 
--force
    ++          git fast-export --all -- important-path.t | git fast-import 
--force
     +  )
     +'
     +
 7:  4e67a2bc7f !  7:  9da26e3ccb fast-export: ensure we export requested refs
    @@ -21,9 +21,6 @@
         were just silently omitted from being exported despite having been
         explicitly requested for export.
     
    -    NOTE: The usage of string_list should really be replaced with the
    -    strmap proposal, once it materializes.
    -
         Signed-off-by: Elijah Newren <[email protected]>
     
      diff --git a/builtin/fast-export.c b/builtin/fast-export.c
    @@ -41,6 +38,12 @@
                        export_blob(&diff_queued_diff.queue[i]->two->oid);
      
        refname = *revision_sources_at(&revision_sources, commit);
    ++  /*
    ++   * FIXME: string_list_remove() below for each ref is overall
    ++   * O(N^2).  Compared to a history walk and diffing trees, this is
    ++   * just lost in the noise in practice.  However, theoretically a
    ++   * repo may have enough refs for this to become slow.
    ++   */
     +  string_list_remove(&extra_refs, refname, 0);
        if (anonymize) {
                refname = anonymize_refname(refname);
 8:  be02337f29 !  8:  7e5fe2f02e fast-export: add --reference-excluded-parents 
option
    @@ -30,6 +30,15 @@
         repository which already contains the necessary commits (much like the
         restriction imposed when using --no-data).
     
    +    Note from Peff:
    +      I think we might be able to do a little more optimization here. If
    +      we're exporting HEAD^..HEAD and there's an object in HEAD^ which is
    +      unchanged in HEAD, I think we'd still print it (because it would not
    +      be marked SHOWN), but we could omit it (by walking the tree of the
    +      boundary commits and marking them shown).  I don't think it's a
    +      blocker for what you're doing here, but just a possible future
    +      optimization.
    +
         Signed-off-by: Elijah Newren <[email protected]>
     
      diff --git a/Documentation/git-fast-export.txt 
b/Documentation/git-fast-export.txt
    @@ -41,14 +50,15 @@
      
     +--reference-excluded-parents::
     +  By default, running a command such as `git fast-export
    -+  master~5..master` will not include the commit master\~5 and
    -+  will make master\~4 no longer have master\~5 as a parent (though
    -+  both the old master\~4 and new master~4 will have all the same
    -+  files).  Use --reference-excluded-parents to instead have the
    -+  the stream refer to commits in the excluded range of history
    -+  by their sha1sum.  Note that the resulting stream can only be
    -+  used by a repository which already contains the necessary
    -+  parent commits.
    ++  master~5..master` will not include the commit master{tilde}5
    ++  and will make master{tilde}4 no longer have master{tilde}5 as
    ++  a parent (though both the old master{tilde}4 and new
    ++  master{tilde}4 will have all the same files).  Use
    ++  --reference-excluded-parents to instead have the the stream
    ++  refer to commits in the excluded range of history by their
    ++  sha1sum.  Note that the resulting stream can only be used by a
    ++  repository which already contains the necessary parent
    ++  commits.
     +
      --refspec::
        Apply the specified refspec to each ref exported. Multiple of them can
    @@ -58,10 +68,10 @@
        to export.  For example, `master~10..master` causes the
        current master reference to be exported along with all objects
     -  added since its 10th ancestor commit and all files common to
    --  master\~9 and master~10.
    +-  master{tilde}9 and master{tilde}10.
     +  added since its 10th ancestor commit and (unless the
     +  --reference-excluded-parents option is specified) all files
    -+  common to master\~9 and master~10.
    ++  common to master{tilde}9 and master{tilde}10.
      
      EXAMPLES
      --------
 -:  ---------- >  9:  14306a8436 fast-import: remove unmaintained duplicate 
documentation
 9:  7ab314849d ! 10:  72487a61e4 fast-export: add a --show-original-ids option 
to show original names
    @@ -2,16 +2,24 @@
     
         fast-export: add a --show-original-ids option to show original names
     
    -    Knowing the original names (hashes) of commits, blobs, and tags can
    -    sometimes enable post-filtering that would otherwise be difficult or
    -    impossible.  In particular, the desire to rewrite commit messages which
    -    refer to other prior commits (on top of whatever other filtering is
    -    being done) is very difficult without knowing the original names of 
each
    -    commit.
    +    Knowing the original names (hashes) of commits can sometimes enable
    +    post-filtering that would otherwise be difficult or impossible.  In
    +    particular, the desire to rewrite commit messages which refer to other
    +    prior commits (on top of whatever other filtering is being done) is
    +    very difficult without knowing the original names of each commit.
    +
    +    In addition, knowing the original names (hashes) of blobs can allow
    +    filtering by blob-id without requiring re-hashing the content of the
    +    blob, and is thus useful as a small optimization.
    +
    +    Once we add original ids for both commits and blobs, we may as well
    +    add them for tags too for completeness.  Perhaps someone will have a
    +    use for them.
     
         This commit teaches a new --show-original-ids option to fast-export
    -    which will make it add a 'originally <hash>' line to blob, commits, and
    -    tags.  It also teaches fast-import to parse (and ignore) such lines.
    +    which will make it add a 'original-oid <hash>' line to blob, commits,
    +    and tags.  It also teaches fast-import to parse (and ignore) such
    +    lines.
     
         Signed-off-by: Elijah Newren <[email protected]>
     
    @@ -19,12 +27,12 @@
      --- a/Documentation/git-fast-export.txt
      +++ b/Documentation/git-fast-export.txt
     @@
    -   used by a repository which already contains the necessary
    -   parent commits.
    +   repository which already contains the necessary parent
    +   commits.
      
     +--show-original-ids::
     +  Add an extra directive to the output for commits and blobs,
    -+  `originally <SHA1SUM>`.  While such directives will likely be
    ++  `original-oid <SHA1SUM>`.  While such directives will likely be
     +  ignored by importers such as git-fast-import, it may be useful
     +  for intermediary filters (e.g. for rewriting commit messages
     +  which refer to older commits, or for stripping blobs by id).
    @@ -33,6 +41,54 @@
        Apply the specified refspec to each ref exported. Multiple of them can
        be specified.
     
    + diff --git a/Documentation/git-fast-import.txt 
b/Documentation/git-fast-import.txt
    + --- a/Documentation/git-fast-import.txt
    + +++ b/Documentation/git-fast-import.txt
    +@@
    + ....
    +   'commit' SP <ref> LF
    +   mark?
    ++  original-oid?
    +   ('author' (SP <name>)? SP LT <email> GT SP <when> LF)?
    +   'committer' (SP <name>)? SP LT <email> GT SP <when> LF
    +   data
    +@@
    + to another object simply by reusing the same `<idnum>` in another
    + `mark` command.
    + 
    ++`original-oid`
    ++~~~~~~~~~~~~~~
    ++Provides the name of the object in the original source control system.
    ++fast-import will simply ignore this directive, but filter processes
    ++which operate on and modify the stream before feeding to fast-import
    ++may have uses for this information
    ++
    ++....
    ++  'original-oid' SP <object-identifier> LF
    ++....
    ++
    ++where `<object-identifer>` is any string not containing LF.
    ++
    + `tag`
    + ~~~~~
    + Creates an annotated tag referring to a specific commit.  To create
    +@@
    + ....
    +   'tag' SP <name> LF
    +   'from' SP <commit-ish> LF
    ++  original-oid?
    +   'tagger' (SP <name>)? SP LT <email> GT SP <when> LF
    +   data
    + ....
    +@@
    + ....
    +   'blob' LF
    +   mark?
    ++  original-oid?
    +   data
    + ....
    + 
    +
      diff --git a/builtin/fast-export.c b/builtin/fast-export.c
      --- a/builtin/fast-export.c
      +++ b/builtin/fast-export.c
    @@ -51,7 +107,7 @@
     -  printf("blob\nmark :%"PRIu32"\ndata %lu\n", last_idnum, size);
     +  printf("blob\nmark :%"PRIu32"\n", last_idnum);
     +  if (show_original_ids)
    -+          printf("originally %s\n", oid_to_hex(oid));
    ++          printf("original-oid %s\n", oid_to_hex(oid));
     +  printf("data %lu\n", size);
        if (size && fwrite(buf, size, 1, stdout) != 1)
                die_errno("could not write blob '%s'", oid_to_hex(oid));
    @@ -64,7 +120,7 @@
     -         refname, last_idnum,
     +  printf("commit %s\nmark :%"PRIu32"\n", refname, last_idnum);
     +  if (show_original_ids)
    -+          printf("originally %s\n", oid_to_hex(&commit->object.oid));
    ++          printf("original-oid %s\n", oid_to_hex(&commit->object.oid));
     +  printf("%.*s\n%.*s\ndata %u\n%s",
               (int)(author_end - author), author,
               (int)(committer_end - committer), committer,
    @@ -77,7 +133,7 @@
     -         name, tagged_mark,
     +  printf("tag %s\nfrom :%d\n", name, tagged_mark);
     +  if (show_original_ids)
    -+          printf("originally %s\n", oid_to_hex(&tag->object.oid));
    ++          printf("original-oid %s\n", oid_to_hex(&tag->object.oid));
     +  printf("%.*s%sdata %d\n%.*s\n",
               (int)(tagger_end - tagger), tagger,
               tagger == tagger_end ? "" : "\n",
    @@ -96,44 +152,13 @@
      --- a/fast-import.c
      +++ b/fast-import.c
     @@
    - 
    -   new_blob ::= 'blob' lf
    -     mark?
    -+    originally?
    -     file_content;
    -   file_content ::= data;
    - 
    -   new_commit ::= 'commit' sp ref_str lf
    -     mark?
    -+    originally?
    -     ('author' (sp name)? sp '<' email '>' sp when lf)?
    -     'committer' (sp name)? sp '<' email '>' sp when lf
    -     commit_msg
    -@@
    - 
    -   new_tag ::= 'tag' sp tag_str lf
    -     'from' sp commit-ish lf
    -+    originally?
    -     ('tagger' (sp name)? sp '<' email '>' sp when lf)?
    -     tag_msg;
    -   tag_msg ::= data;
    -@@
    -   data ::= (delimited_data | exact_data)
    -     lf?;
    - 
    -+  originally ::= 'originally' sp not_lf+ lf
    -+
    -     # note: delim may be any string but must not contain lf.
    -     # data_line may contain any data but must not be exactly
    -     # delim.
    -@@
                next_mark = 0;
      }
      
     +static void parse_original_identifier(void)
     +{
     +  const char *v;
    -+  if (skip_prefix(command_buf.buf, "originally ", &v))
    ++  if (skip_prefix(command_buf.buf, "original-oid ", &v))
     +          read_next_command();
     +}
     +
    @@ -160,7 +185,7 @@
                die("Invalid ref name or SHA1 expression: %s", from);
        read_next_command();
      
    -+  /* originally ... */
    ++  /* original-oid ... */
     +  parse_original_identifier();
     +
        /* tagger ... */
    @@ -177,7 +202,7 @@
     +test_expect_success 'fast-export --show-original-ids' '
     +
     +  git fast-export --show-original-ids master >output &&
    -+  grep ^originally output| sed -e s/^originally.// | sort >actual &&
    ++  grep ^original-oid output| sed -e s/^original-oid.// | sort >actual &&
     +  git rev-list --objects master muss >objects-and-names &&
     +  awk "{print \$1}" objects-and-names | sort >commits-trees-blobs &&
     +  comm -23 actual commits-trees-blobs >unfound &&
10:  82735bcbde ! 11:  1796373474 fast-export: add 
--always-show-modify-after-rename
    @@ -2,29 +2,53 @@
     
         fast-export: add --always-show-modify-after-rename
     
    -    fast-export output is traditionally used as an input to a fast-import
    -    program, but it is also useful to help gather statistics about the
    -    history of a repository (particularly when --no-data is also passed).
    -    For example, two of the types of information we may want to collect
    -    could include:
    -      1) general information about renames that have occurred
    -      2) what the biggest objects in a repository are and what names
    -         they appear under.
    +    I wanted a way to gather all the following information efficiently
    +    (with as few history traversals as possible):
    +      * Get all blob sizes
    +      * Map blob shas to filename(s) they appeared under in the history
    +      * Find when files and directories were deleted (and whether they
    +        were later reinstated, since that means they aren't actually gone)
    +      * Find sets of filenames referring to the same logical 'file'. (e.g.
    +        foo->bar in commit A and bar->baz in commit B mean that
    +        {foo,bar,baz} refer to the same 'file', so someone wanting to just
    +        "keep baz and its history" need all versions of those three
    +        filenames).  I need to know about things like another foo or bar
    +        being introduced after the rename though, since that breaks the
    +        connection between filenames)
    +    and then I would generate various aggregations on the data and display
    +    some type of report for the user.
     
    -    The first bit of information can be gathered by just passing -M to
    -    fast-export.  The second piece of information can partially be gotten
    -    from running
    -        git cat-file --batch-check --batch-all-objects
    -    However, that only shows what the biggest objects in the repository are
    -    and their sizes, not what names those objects appear as or what commits
    -    they were introduced in.  We can get that information from fast-export,
    -    but when we only see
    +    The only way I know of to get blob sizes is via
    +      cat-file --batch-all-objects --batch-check
    +
    +    The rest of the data would traditionally be gathered from a log 
command,
    +    e.g.
    +
    +      git log --format='%H%n%P%n%cd' --date=short --topo-order --reverse \
    +          -M --diff-filter=RAMD --no-abbrev --raw -c
    +
    +    however, parsing log output seems slightly dangerous given that it is a
    +    porcelain command.  While we have specified --format and --raw to try
    +    to avoid the most obvious problems, I'm still slightly concerned about
    +    --date=short, the combinations of --raw and -c, options that might
    +    colorize the output, and also the --diff-filter (there is no current
    +    option named --no-find-copies or --no-break-rewrites, but what if those
    +    turn on by default in the future much as we changed the default with
    +    detecting renames?).  Each of those is a small worry, but they add up.
    +
    +    A command meant for data serialization, such as fast-export, seems like
    +    a better candidate for this job.  There's just one missing item: in
    +    order to connect blob sizes to filenames, I need fast-export to tell me
    +    the blob sha1sum of any file changes.  It does this for modifies, but
    +    not always for renames.  In particular, if a file is a 100% rename, it
    +    only prints
             R oldname newname
         instead of
             R oldname newname
             M 100644 $SHA1 newname
    -    then it makes the job more difficult.  Add an option which allows us to
    -    force the latter output even when commits have exact renames of files.
    +    as occurs when there is a rename+modify.  Add an option which allows us
    +    to force the latter output even when commits have exact renames of
    +    files.
     
         Signed-off-by: Elijah Newren <[email protected]>

-- 
2.19.1.1063.g2b8e4a4f82.dirty

[PATCH v2 00/11] fast export and import fixes and features

Reply via email to