Re: full kernel history, in patchset format

Junio C Hamano Sat, 16 Apr 2005 11:31:38 -0700

>>>>> "LT" == Linus Torvalds <[EMAIL PROTECTED]> writes:


LT> What do people think? I'm not so much worried about the data itself: the
LT> git architecture is _so_ damn simple that now that the size estimate has
LT> been confirmed, that I don't think it would be a problem per se to put
LT> 3.2GB into the archive. But it will bog down "rsync" horribly, so it will
LT> actually hurt synchronization untill somebody writes the rev-tree-like
LT> stuff to communicate changes more efficiently..

LT> IOW, it smells to me like we don't have the infrastructure to really work 
LT> with 3GB archives, and that if we start from scratch (2.6.12-rc2), we can 
LT> build up the infrastructure in parallell with starting to really need it.

LT> But it's _great_ to have the history in this format, especially since 
LT> looking at CVS just reminded me how much I hated it.

LT> Comments?

I have been cooking this idea before I dove into the merge stuff
and did not have time to implement it myself (Hint Hint), but I
think something along the following lines would work nicely:

 * A script git-archive-tar is used to create a "base tarball"
   that roughly corresponds to "linux-*.tar.gz".  This works as
   follows:

    $ git-archive-tar C [B1 B2...]

   This reads the named commit C, grabs the associated tree
   (i.e.  its sub-tree objects and the blob they refer to), and
   makes a tarball of ??/??????????????????????????????????????
   files.  The tarball does not have to contain any extra
   information to reproduce any ancestor of the named commit.

   When extra parameters, B1 B2..., are given, it also creates
   "diff package" that roughly corresponds to "patch-*.gz" for
   each Bn given.  They must be ancestors of commit.  The
   intention is to store enough information to ensure that the
   recipient can recreate all the SHA1 files "base tarball" for
   commits between (Bn, C] would contain, provided if the
   recipient already has all the SHA1 files "base tarball" for
   Bn.

 * A script git-archive-patch is used to read such a "diff
   package".

So a user needs to:

 * First pick some baseline B and download the base tarball for
   commit B.  It is up to him to make trade-offs between how far
   back he wants to see the history and how much bandwidth he
   wants to waste.  Untar it to get the baseline.

 * Then periodically pick up "diff package" for (C, B] where C
   is the latest available.  Run git-archive-patch to populate
   the rest.

 * In addition the user can run rsync with timestamp option to
   pick up SHA1 files created upstream since C after this
   happens.

What git-archive-tar needs to do to produce "diff package" for
(Bn, C] is fairly obvious.

 * From rev-tree output, find all the commits that are on path
   from Bn to C.

 * Find all the SHA1 objects that appear on this commit chain;
   subtract what is in Bn since we assume the recipient has them
   already.

 * Run diff-tree between neighboring commits [*1*] to find out
   the set of blobs that are "related".  Extract those related
   blobs and run "diff" [*2*] between them to see if it produces
   a patch smaller than the whole thing when compressed.  If
   diff+patch is a win, then we do not have to transmit the blob
   that we could reproduce by sending the diff.  Note that fact.

 * When you are all done, you have a single patch file that
   contains small edits on numerous blobs, and set of SHA1 files
   that are cheaper to transmit than in the patch form.
   Compress the patch file and package them together to make a
   tar archive.

Given the above, the operation of git-archive-patch is also
quite obvious.  Extract the "diff package" tarball into the
objects/ directory that has (at least) the full Bn, uncompress
the patch file part, and run patch on it. 


[Footnotes]

*1* Alternatively, this diff-tree can be run between Bn and each
commit between (Bn, C].  It is like incremental dump strategy.
We should experiment and find a good balance.

*2* This does not have to be "diff -u" --- we are assuming the
exact patch so diff -e or xdelta would do.  We should experiment
and find a good diff+patch pair.


-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: full kernel history, in patchset format

Reply via email to