>>>>> "LT" == Linus Torvalds <[EMAIL PROTECTED]> writes:
LT> What do people think? I'm not so much worried about the data itself: the
LT> git architecture is _so_ damn simple that now that the size estimate has
LT> been confirmed, that I don't think it would be a problem per se to put
LT> 3.2GB into the archive. But it will bog down "rsync" horribly, so it will
LT> actually hurt synchronization untill somebody writes the rev-tree-like
LT> stuff to communicate changes more efficiently..
LT> IOW, it smells to me like we don't have the infrastructure to really work
LT> with 3GB archives, and that if we start from scratch (2.6.12-rc2), we can
LT> build up the infrastructure in parallell with starting to really need it.
LT> But it's _great_ to have the history in this format, especially since
LT> looking at CVS just reminded me how much I hated it.
I have been cooking this idea before I dove into the merge stuff
and did not have time to implement it myself (Hint Hint), but I
think something along the following lines would work nicely:
* A script git-archive-tar is used to create a "base tarball"
that roughly corresponds to "linux-*.tar.gz". This works as
$ git-archive-tar C [B1 B2...]
This reads the named commit C, grabs the associated tree
(i.e. its sub-tree objects and the blob they refer to), and
makes a tarball of ??/??????????????????????????????????????
files. The tarball does not have to contain any extra
information to reproduce any ancestor of the named commit.
When extra parameters, B1 B2..., are given, it also creates
"diff package" that roughly corresponds to "patch-*.gz" for
each Bn given. They must be ancestors of commit. The
intention is to store enough information to ensure that the
recipient can recreate all the SHA1 files "base tarball" for
commits between (Bn, C] would contain, provided if the
recipient already has all the SHA1 files "base tarball" for
* A script git-archive-patch is used to read such a "diff
So a user needs to:
* First pick some baseline B and download the base tarball for
commit B. It is up to him to make trade-offs between how far
back he wants to see the history and how much bandwidth he
wants to waste. Untar it to get the baseline.
* Then periodically pick up "diff package" for (C, B] where C
is the latest available. Run git-archive-patch to populate
* In addition the user can run rsync with timestamp option to
pick up SHA1 files created upstream since C after this
What git-archive-tar needs to do to produce "diff package" for
(Bn, C] is fairly obvious.
* From rev-tree output, find all the commits that are on path
from Bn to C.
* Find all the SHA1 objects that appear on this commit chain;
subtract what is in Bn since we assume the recipient has them
* Run diff-tree between neighboring commits [*1*] to find out
the set of blobs that are "related". Extract those related
blobs and run "diff" [*2*] between them to see if it produces
a patch smaller than the whole thing when compressed. If
diff+patch is a win, then we do not have to transmit the blob
that we could reproduce by sending the diff. Note that fact.
* When you are all done, you have a single patch file that
contains small edits on numerous blobs, and set of SHA1 files
that are cheaper to transmit than in the patch form.
Compress the patch file and package them together to make a
Given the above, the operation of git-archive-patch is also
quite obvious. Extract the "diff package" tarball into the
objects/ directory that has (at least) the full Bn, uncompress
the patch file part, and run patch on it.
*1* Alternatively, this diff-tree can be run between Bn and each
commit between (Bn, C]. It is like incremental dump strategy.
We should experiment and find a good balance.
*2* This does not have to be "diff -u" --- we are assuming the
exact patch so diff -e or xdelta would do. We should experiment
and find a good diff+patch pair.
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html